Bandwidth-Aware Prefetching in Chip Multiprocessors by Grannæs, Marius
June 2006
Lasse Natvig, IDI
Master of Science in Computer Science
Submission date:
Supervisor:
Norwegian University of Science and Technology
Department of Computer and Information Science
Bandwidth-Aware Prefetching in Chip
Multiprocessors
Marius Grannæs

Problem Description
A promising area of computer architecture research is chip multiprocessors (CMP) — also called
multicore architectures. A CMP is an architecture where multiple cores are embedded into a
single chip. These new processors typically have one or two levels of cache memory that are
private to each processor, and one common shared L2 or L3 cache. The processors on the chip
also share the communication channel to the off-chip memory, thus they compete for several
shared resources.  On the other hand, fetching of instructions or data from external memory or
storing in the shared cache are operations that potentially may help other processors. This
potential balance between negative competition and positive cooperation gives new challenges for
prefetching. Performance counters that monitor the usage of the various architectural resources
will probably be important for implementing efficient CMP prefetching.
Marius Grannæs completed in the course TDT4720 Computer Design and
Architecture, Specialization the project entitled ” Simulation of Hardware Based Prefetching in
SimpleScalar”.  The project was limited to uniprocessor architectures, it presented a simulator
framework and some initial experiments on prefetching. The goals for the diploma work are:
* Experiments evaluating more types of prefetching methods
* An extension of the simulator framework for studying of prefetching in relevant CMP
architectures
* Experiments evaluating some of the most common prefetching methods in CMPs
* Studies of the topic ”performance counters” in this context
* If time allows, development and evaluation of a new CMP prefecthing algorithm based on
performance counters
Assignment given: 19. January 2005
Supervisor: Lasse Natvig, IDI

Contents
List of Figures v
List of Tables vii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Reducing Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Subgoals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Fifth Year Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Scope of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Structure of this Document . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 7
2.1 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Cacti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Software Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Hardware Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 The Memory Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 The DRAM Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Memory Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Chip Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Commercial Vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 SimpleScalar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 SimpleScalar Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Performance Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.1 Performance Counters on x86 . . . . . . . . . . . . . . . . . . . . . . 29
2.6.2 Performance Counter Libraries . . . . . . . . . . . . . . . . . . . . . 31
3 Methodology 33
3.1 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.3 DRAM Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.4 Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
i
CONTENTS CONTENTS
3.2 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 New Additions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Target Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 Implementation Shortcomings . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Bandwidth-Aware Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.1 CMP Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 Results 59
4.1 Overall Plan for the Experiments . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Uniprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.1 Unlimited Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Limited Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.3 Bandwidth-Aware Prefetching . . . . . . . . . . . . . . . . . . . . . 94
4.4 CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.1 Plan for the Experiments . . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.2 CMP Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4.3 Prefetching in CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.4.4 Bandwidth-Aware Prefetching in CMP . . . . . . . . . . . . . . . . . 109
5 Discussion 117
5.1 The Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.1 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.2 DRAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.3 CMP Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.1 Benchmark Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.2 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.1 Uniprocessor Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.2 CMP Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3.3 Bandwidth-Aware Prefetching . . . . . . . . . . . . . . . . . . . . . 121
5.4 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6 Conclusion 123
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3.1 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
ii
CONTENTS CONTENTS
6.3.2 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3.3 Bandwidth-Aware Prefetching . . . . . . . . . . . . . . . . . . . . . 125
6.4 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Bibliography 127
A Cacti Output 133
B Notur 2006 Poster 137
C Performance Counter Code 139
C.1 Pmc.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
C.2 Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
C.3 Performance.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
D Python Scripts 143
D.1 Clustisrunbench.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
D.2 Config.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
D.3 Parsebench.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
E Uniprocessor Code 147
E.1 Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
E.2 Sim-outorder.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
E.3 Dram.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
E.4 Dram.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
E.5 Prefetch.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
E.6 Prefetch.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
E.7 Memory.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
F CMP Code 181
F.1 Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
F.2 Sim-outorder.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
F.3 Controller.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
F.4 Shared.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
F.5 Shared.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
F.6 Cache.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
F.7 Dram.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
iii
CONTENTS CONTENTS
iv
List of Figures
1.1 Development of CPU performance versus memory latency [4, 6]. . . . . . . 2
2.1 Example of a memory hierarchy with 2 levels of cache. . . . . . . . . . . . . 7
2.2 CZone / Delta Correlation operation . . . . . . . . . . . . . . . . . . . . . . 15
2.3 State diagram for Chen and Baer’s reference predictor. . . . . . . . . . . . . 17
2.4 Stream prefetching in the Power4 architecture . . . . . . . . . . . . . . . . . 17
2.5 Memory connection diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 A single DRAM cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 The Athlon X2 [40]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 The architecture of the Niagara (T1) [42]. . . . . . . . . . . . . . . . . . . . 24
2.9 The architecture of the Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.10 The SimpleScalar model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Memory organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 DRAM flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 IPC of SPEC benchmarks using the old model and 1,2 or 3 channels in the
new model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Prefetching in SimpleScalar. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Target architecture of the simulated CMP. . . . . . . . . . . . . . . . . . . . 46
4.1 Performance of prefetching on processors with unlimited bandwidth. . . . . 62
4.2 Similar to figure 4.1, but with uninteresting benchmarks removed. . . . . . 63
4.3 Prefetching degree of 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Speedup in IPC by increasing the prefetching degree from 1 to 8. . . . . . . 66
4.5 Increasing prefetching degree on Mgrid. . . . . . . . . . . . . . . . . . . . . 70
4.6 Increasing prefetching degree on Art. . . . . . . . . . . . . . . . . . . . . . . 70
4.7 Increasing prefetching degree on Swim. . . . . . . . . . . . . . . . . . . . . . 71
4.8 Varying CZone size with prefetching degree on C/DC for Art. . . . . . . . . 72
4.9 Varying CZone size with prefetching degree on C/DC for Mcf. . . . . . . . . 73
4.10 Varying CZone size with prefetching degree on C/DC for Mgrid. . . . . . . 73
4.11 Varying CZone size with prefetching degree on C/DC for Swim. . . . . . . . 74
4.12 Varying Table size on Ammp. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.13 Varying Table size on Art. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.14 Varying Table size on Mcf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.15 Varying Table size on Mgrid. . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.16 Varying Table size on Swim. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
v
LIST OF FIGURES LIST OF FIGURES
4.17 Baseline with 1 dram channel. . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.18 Speedup by using 1 channel . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.19 Baseline with 2 DRAM channels. . . . . . . . . . . . . . . . . . . . . . . . . 81
4.20 Speedup using 2 channels over the unlimited bandwidth model. . . . . . . . 82
4.21 Plot of increasing prefetching degree versus available bandwidth for Ammp
with sequential prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.22 Plot of increasing prefetching degree versus available bandwidth for Art
with sequential prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.23 Plot of increasing prefetching degree versus available bandwidth for Mcf
with sequential prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.24 Plot of increasing prefetching degree versus available bandwidth for Mgrid
with sequential prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.25 Plot of increasing prefetching degree versus available bandwidth for Swim
with sequential prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.26 Plot of increasing prefetching degree versus available bandwidth for Ammp
with C/DC prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.27 Plot of increasing prefetching degree versus available bandwidth for Art
with C/DC prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.28 Plot of increasing prefetching degree versus available bandwidth for Mcf
with C/DC prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.29 Plot of increasing prefetching degree versus available bandwidth for Mgrid
with C/DC prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.30 Plot of increasing prefetching degree versus available bandwidth for Swim
with C/DC prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.31 Plot of increasing prefetching degree versus available bandwidth for Ammp
with RPT prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.32 Plot of increasing prefetching degree versus available bandwidth for Art
with RPT prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.33 Plot of increasing prefetching degree versus available bandwidth for Mcf
with RPT prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.34 Plot of increasing prefetching degree versus available bandwidth for Mgrid
with RPT prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.35 Plot of increasing prefetching degree versus available bandwidth for Swim
with RPT prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.36 Speedup of bandwidth-aware prefetching . . . . . . . . . . . . . . . . . . . . 95
4.37 Reductions in bandwidth usage . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.38 Speedup using bandwidth-aware prefetching. Threshold = 400 . . . . . . . 97
4.39 Reductions in bandwidth usage. Threshold = 400 . . . . . . . . . . . . . . . 98
4.40 Speedup using bandwidth-aware prefetching. Threshold = 800 . . . . . . . 99
4.41 Reductions in bandwidth usage. Threshold = 800 . . . . . . . . . . . . . . . 100
vi
List of Tables
2.1 Types of memory access patterns. . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Delta correlation in a GHB, newest miss to the right. . . . . . . . . . . . . . 14
2.3 Example Reference Prediction Table. . . . . . . . . . . . . . . . . . . . . . . 16
3.1 SPEC 2000 Integer benchmarks [67]. . . . . . . . . . . . . . . . . . . . . . . 51
3.2 SPEC 2000 Floating-Point benchmarks [67]. . . . . . . . . . . . . . . . . . . 52
3.3 Charachteristics of the SPEC2000 benchmarks suite. . . . . . . . . . . . . . 53
4.1 Simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Accuracy of prefetching heuristics . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Coverage of prefetching heuristics . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 IPC of the benchmark in the left column in a dual core CMP combined
with the benchmarks in the first row. . . . . . . . . . . . . . . . . . . . . . . 103
4.5 Speedup in IPC compared to a L2 cache of 1MB. . . . . . . . . . . . . . . . 105
4.6 Prefetching in CMP, speedup compared to a CMP with no prefetching. . . 107
4.7 Performance of RPT prefetching in CMP . . . . . . . . . . . . . . . . . . . 108
4.8 Bandwidth-aware prefetching using sequential prefetching . . . . . . . . . . 110
4.9 Bandwidth-aware prefetching using C/DC prefetching . . . . . . . . . . . . 111
4.10 Bandwidth-aware prefetching using RPT prefetching . . . . . . . . . . . . . 112
4.11 Bandwidth-aware prefetching using sequential prefetching . . . . . . . . . . 113
4.12 Bandwidth-aware prefetching using C/DC prefetching . . . . . . . . . . . . 115
4.13 Bandwidth-aware prefetching using RPT prefetching . . . . . . . . . . . . . 116
vii
LIST OF TABLES LIST OF TABLES
viii
Listings
2.1 Example memory patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Prefetching in GCC using intrinsics . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 DRAM-model datastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 The prefetch data type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 The trigger type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 The location data type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Implementation of sequential prefetching. . . . . . . . . . . . . . . . . . . . 42
C.1 Pmc.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
C.2 Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
C.3 Performance.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
D.1 Clustisrunbench.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
D.2 Sample configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
D.3 Parsebench.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
E.1 Makefile - Unified diff against SimpleScalar 3.0d . . . . . . . . . . . . . . . 147
E.2 Sim-outorder.c - Unified diff against SimpleScalar 3.0d . . . . . . . . . . . . 149
E.3 Dram.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
E.4 Dram.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
E.5 Prefetch.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
E.6 Prefetch.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
E.7 Memory.h - Unified diff against SimpleScalar 3.0d . . . . . . . . . . . . . . . 180
F.1 Makefile - Unified diff against Uniprocessor version . . . . . . . . . . . . . . 181
F.2 Sim-outorder.c - Unified diff against Uniprocessor version . . . . . . . . . . 183
F.3 Controller.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
F.4 Shared.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
F.5 Shared.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
F.6 Cache.c - Unified diff against SimpleScalar 3.0d . . . . . . . . . . . . . . . . 202
F.7 Dram.c - Unified diff against Uniprocessor version . . . . . . . . . . . . . . 216
ix
Chapter 1
Introduction
1.1 Motivation
Each year exponentially more transistors are put into integrated circuits [1, 2]. Moore’s law
is the empirical observation that at our rate of technological development, the complexity
of an integrated circuit, with respect to minimum component cost, will double every 18
months [3]. Increased transistor density, in turn, translates into faster computers for
consumers. In addition, architectural advances in microprocessor design has contributed
to increased performance.
CPU performance increased by 35% per year until 1986 and by 55% per year after
1986 [4]. DRAM density has also increased at approximately the same rate. In terms
of latency, DRAM has not seen such huge improvements (only 7% per year). This is
shown in figure 1.1. Because of this speed difference, main memory cannot keep up with
the processor anymore. Thus the processor frequently stalls while waiting for data from
memory. This problem is known as the “memory wall” or “memory gap” [5].
In a recent (march 2006) article by the ACM president, David Patterson argues that
performance of microprocessors have only increased by 20% per year since 2002. This is
due to three separate causes:
• The lack of additional power for a chip to dissipate.
• The lack of additional instruction-level parallelism to exploit.
• The lack of improvement in memory latency.
Decreasing main memory latency is difficult, because there is a lower bound on the
minimum latency possible. Ultimately, main memory latency has a lower bound deter-
mined by the speed of light and the distance from the memory modules to the CPU. At
4GHz, a signal can only travel 7.5cm per clock tick. This causes a natural bound for
the lowest possible main memory latency, as traditional main memory is off-chip. Given
such a limit, approaching it becomes ever more difficult. However, given current electronic
designs, the RC-delays become the dominant contributing factor to latency. Finally, there
is a practical limitation caused by economics [7], because increased capacity sells chips,
while decreased latency does not.
1
1.2. REDUCING LATENCY CHAPTER 1. INTRODUCTION
 1
 10
 100
 1000
 10000
 100000
 1980  1985  1990  1995  2000  2005
Pe
rfo
rm
an
ce
Year
CPU
Memory Latency
Figure 1.1: Development of CPU performance versus memory latency [4, 6].
1.2 Reducing Latency
It is likely that the memory gap will not be closed in the foreseeable future and it is
therefore important to develop techniques that can either tolerate or decrease latency. Nu-
merous architectural techniques have been developed to compensate for this gap. These
include caches, out-of-order execution, chip-multiprocessing, simultaneous multithreading,
run-ahead execution, bypassing and prefetching. In addition, some have looked at inte-
grating main memory into the processor. The IRAM project [8] is one such initiative.
This technique reduces the gap by decreasing the physical distance the signals will have
to travel. In addition, by integrating main memory in such a way increased bandwidth
also becomes available.
Cache is so far the most successful technique used to bridge the gap. Conceptually,
caches are smaller, but faster memories that store the most often used data [4]. Thus
most of the time, the data required by the processor is in the cache, and there is no
need for a stall. Everything that is fetched from main memory is stored in the cache,
possibly displacing other data. Whenever data referenced by the processor is not present
in the cache, these data must be fetched from main memory. Such cache misses reduce
the effectiveness of caches.
Out-of-order execution is a technique commonly used in modern processors to increase
throughput. Basic out-of-order execution allows instructions to execute in another order
than that specified by the program. This is done through a dynamic analysis of the in-
2
CHAPTER 1. INTRODUCTION 1.2. REDUCING LATENCY
struction stream (scoreboarding [4]). Allowing other instructions to execute while another
instruction waits for memory increases performance considerably. However, because the
instruction window is bounded in size, there are limits for how much latency out-of-order
execution can hide. Effectively, out-of-order execution hides L1 misses, but cannot hide
L2 misses. Thus it is the most effective if most of the code and data reside in the cache.
Run-ahead execution is a hot research topic [9, 10]. Basic run-ahead execution allows
programs to continue speculatively while a memory stall is in effect. By predicting the
value of the returned data, the thread can continue running. When the actual value is
received from memory, it is compared to the predicted value. If the two values match,
the run-ahead was a success and can continue. However, if they do not match, a micro-
architectural roll-back have to be done. Such a roll-back will have to restore all registers
and flush the pipeline.
Simultaneous Multithreading (SMT) allows other threads to execute while one thread
waits for memory, which increases system throughput. In effect, multiple instruction
streams utilize the same core at the same time. This technology is among others used in
Intel’s “Hyperthreading” technology. Normally, two threads executing on a core execute at
the same time, where the scheduler picks instructions from each thread in a round-robin
fashion. If for some reason, one thread becomes stalled, then the scheduler only submits
instructions from the non-blocked thread.
Prefetching is a technique used to increase the effectiveness of caches by trying to
predict the memory reference stream. By fetching needed data to the caches before it
is actually referenced by the processor, it is possible to achieve a significant performance
increase. This technique will be explored in more detail in the remainder of this thesis.
Bypassing is a similar technique to prefetching, however, it’s purpose is different. The
goal of bypassing is to identify data that will not be reused (it has no temporal locality) [11].
Thus, the data do not need to be stored in the cache, effectively bypassing the cache.
Heuristics used for bypassing share many characteristics with prefetching.
Chip Multiprocessors pack multiple processors into a single chip. Thus cores are able to
share caches, which in turn makes it easier for two separate threads to cooperate through
shared memory. Since programs often share code in the form of shared libraries (such as
libc), sharing such libraries in the cache saves a significant space. By having multiple cores,
the focus is shifted away from single thread performance (where latency is important) to
throughput (where latency is less important). In essence; by having multiple cores (which
can possibly be SMT), the effect of stalling one single thread is much less severe. Chip
Multiprocessors is an interesting field, because it opens up a variety of new possibilities in
terms of architecture (this will be explored further later in this thesis). In addition, most
commercial vendors of high-end microprocessors are currently offering CMPs. Lawrence
Spracklen and Santosh G. Abraham outline the challenges in CMP in their paper “Oppor-
tunities and challenges” [12]. One of those challenges is to adapt prefetching to CMP.
It would be too ambitious to try to solve this problem within the time frame of this
thesis. This project will approach this challenge by using performance counters to direct
prefetching. Nonetheless, as this thesis is a part of a larger PhD thesis, it will also be used
as a foundation for further research.
3
1.3. SUBGOALS CHAPTER 1. INTRODUCTION
1.3 Subgoals
The main purpose of this thesis is to investigate prefetching in a chip multiprocessor
setting. The second purpose is to explore the use of performance counters with regard to
prefetching. To achieve these goals, several objectives have been formulated:
• Investigate performance counters in modern processors.
• Investigate modern prefetching methods.
• Expand the simulator (SimpleScalar) from the fifth year project to include more
types of prefetching.
• Expand the simulator to include a more realistic DRAM model.
• Expand the simulator to simulate CMPs.
• Develop a methodology to benchmark CMPs.
• Develop an understanding of current prefetching heuristics by conducting experi-
ments.
• Develop an understanding of known prefetching heuristics in a CMP setting by
performing simulations.
• Development of a new prefetching algorithm based on performance counters.
• Document it’s performance through experimentation.
1.4 Fifth Year Project
This thesis is a continuation of my fifth year project. In that project I studied three
prefetching algorithms in a uniprocessor context. I developed a framework for handling
prefetching in SimpleScalar as well. SimpleScalar is a widely used cycle-accurate microar-
chitectural simulator. The prefetching heuristics that were already implemented at the
beginning of this project are:
• Sequential prefetching
• Delta Correlation prefetching
• CZone/Delta Correlation prefetching
These heuristics are described in section 2.2.2, and will also be used in this thesis.
In addition, much of the theoretical studies regarding prefetching was done in the
project. In this thesis I will reuse the framework that I made in the project as well as the
three reference algorithms.
4
CHAPTER 1. INTRODUCTION 1.5. SCOPE OF THIS THESIS
1.5 Scope of this Thesis
Prefetching as well as CMP architecture are very large fields. This section limits the scope
of the thesis.
Prefetching can be done both in hardware and in software. In software, one can either
use special“prefetch”instructions, or generate hints to hardware about possible prefetching
opportunities. This thesis will only concern pure hardware controlled prefetching, although
some theory around software prefetching will be given to provide context.
CMP offer a whole range of different types of architectures (see section 2.4). In this
thesis I will only look at one fixed architecture, due to it’s popularity in commercial
settings. The cores are separated with a private L1 cache, but share a L2 cache as well as
a memory controller.
1.6 Structure of this Document
In chapter 2 I present background material needed to understand this thesis. This chap-
ter contains the necessary information about DRAM, caches, chip multiprocessors, Sim-
pleScalar and prefetching. In chapter 3 the methods used to conduct the experiments
are presented. My own extensions to SimpleScalar as well as the benchmarks used are
presented and analyzed. Chapter 4 contains the results from my experiments. It is orga-
nized in two distinct sections; First, the prefetching heuristics are tested in a uniprocessor
environment, in order to establish an understanding of the different prefetching heuristics.
Then the heuristics are run in a CMP context to see how it affects prefetching. Chapter
5 discuss the results and I conclude and present some future work chapter 6.
5
1.6. STRUCTURE OF THIS DOCUMENT CHAPTER 1. INTRODUCTION
6
Chapter 2
Background
In this chapter I will describe the theory behind caches, prefetching, DRAM, Chip Multi-
processors and performance counters. Furthermore, I will describe some of the tools that
I will use, such as SimpleScalar and CACTI.
2.1 Caches
Before talking about prefetching and CMP, a basic understanding of caches is needed.
Caches are used to bridge the memory gap through duplicating data in smaller and faster
storage [13]. A cache can be made in many types of technology, but is usually SRAM
(while main memory is DRAM). This is done as part of a memory hierarchy (see figure
2.1).
CPU L1//oo L2//oo Main memory//oo
Figure 2.1: Example of a memory hierarchy with 2 levels of cache.
There might be several levels of caches in the memory hierarchy. In my simulations I
have opted for a 2-level cache system. Registers are the fastest type of memory, and there
is no programmer-visible latency associated with using them. There are usually only a
few registers available. The L1 cache is somewhat larger, usually a few kilobytes large.
Although its latency varies from processor to processor, it is usually between 1 and 4 clock
cycles. The L2 cache is much larger, although smaller than main memory. The amount
of L2 cache varies, but is usually measured in megabytes. Its latency is around 20 clock
cycles. In comparison, main memory usually has a latency of several hundred clock cycles.
Caches work by exploiting the spatial and temporal locality seen in memory refer-
ences [4]. Spatial locality refers to the property that data close together in address space
tend to be referenced around the same time. Temporal locality refers to the property that
data that has been referenced is very likely to be referenced in the near future. Caches
work by storing every referenced data. Thus, temporal locality is exploited. A whole cache
7
2.1. CACHES CHAPTER 2. BACKGROUND
line (usually around 128 bytes large) is brought into the cache each time any part of the
line is accessed, thus spatial locality is exploited.
There are several ways to build a cache in hardware. Because caches can only hold
a small portion of main memory at any point, some way to map cache memory to main
memory is needed. Cache can be organized in three major ways:
• Direct mapped - A cache line can only be placed in one position based on its address.
• Fully associative - A cache line can be placed anywhere in the cache.
• Set associative - A cache line can be placed in exactly one set. Each set can hold n
cache lines.
The most common type is the set associative cache as it provides a compromise between
the two extremes. A direct mapped cache would be extremely expensive to implement
(in terms of area) when the cache size becomes large as it requires comparison logic for
every entry in the cache. A direct mapped cache provides very little flexibility and the
possibility of collisions in the working set becomes large. The set associative solution is
preferable, because it does not require a lot of comparison logic and because a cache line
can be put anywhere in the set, reducing conflicts.
In a set associative cache, the cache lines are organized into sets. Each set can hold
n cache lines. If there are n cache lines (or blocks) in a set, the cache is called n-way set
associative. In addition, each cache has several sets. Data can only map onto one set in
the cache, but it can map onto several cache lines in the set, depending on the replacement
policy in use.
When a new cache line is put into the cache, an old one is evicted. There are sev-
eral possible replacement policies available, least-recently-used is the most common one,
although, FIFO and other variants are possible.
There are several reasons why some data might not be in the cache. It is useful to
categorize these reasons into groups, so that one can more easily reason about them.
According to Hennessy and Patterson [4] there are three major categories of misses:
Definition 1 (Compulsory). The very first access to a block cannot be in the cache,
so the block must be brought into the cache. These are also called cold-start misses or
first-reference misses.
Definition 2 (Capacity). If the cache cannot contain all the blocks needed during execu-
tion of a program, capacity misses (in addition to compulsory misses) will occur because
of blocks being discarded and later retrieved.
Definition 3 (Conflict). If the block placement strategy is set associative or direct
mapped, conflict misses ( in addition to compulsory and capacity misses) will occur be-
cause a block may be discarded and later retrieved if too many blocks map to its set.
These misses are also called collision misses or interference misses. The idea is that hits
in a fully associative cache that become misses in a an n-way set-associative cache are due
to more than n requests on some popular sets.
These definitions will be used when reasoning about prefetching.
8
CHAPTER 2. BACKGROUND 2.1. CACHES
2.1.1 Cacti
CACTI [14] is an advanced cache model capable of modeling the timing, power require-
ments and area of any given cache. CACTI helps computer architects to understand the
trade-offs between power, area and timing. The model used is very complex and considers
most of the available design-techniques. I will use this tool to evaluate different cache
designs, especially to see the trade-offs between timing and size.
As an example I have run the program to demonstrate the fidelity of the model. The
cache being simulated is a 4-way 8KB cache with 64byte cache lines. It has 1 read and
1 write port, and the technology being used is 65nm. The complete output spans several
pages, and can be found in appendix A. An explanation of every field can be found in the
technical report[14].
The most important numbers from this model for our purposes are the following:
Access Time (ns): 0.57591
Cycle Time (wave pipelined) (ns): 0.273554
Total Power all Banks (nJ): 0.139284
Total area 1.074315 (mm^2)
If we double the number of read and write ports (which would be required if it was a
shared cache), we get the following output:
Access Time (ns): 0.729833
Cycle Time (wave pipelined) (ns): 0.361345
Total Power all Banks (nJ): 0.285296
Total area 2.400975 (mm^2)
If we use combined read/write ports instead of separate ones we get the following data:
For one R/W-port:
Access Time (ns): 0.57591
Cycle Time (wave pipelined) (ns): 0.273554
Total Power all Banks (nJ): 0.139284
Total area 1.074315 (mm^2)
With two R/W ports:
Access Time (ns): 0.658765
Cycle Time (wave pipelined) (ns): 0.328369
Total Power all Banks (nJ): 0.233175
Total area 1.889209 (mm^2)
In general, increasing the number of ports of a cache, increases its latency by a small
amount and its area is increased substantially. This becomes quite significant when using
a shared cache in CMP designs. It is also important to note that the energy requirements
becomes much larger when using a multiported cache. It is clear that giving each processor
its own R/W port does not scale. Fortunately, bigger caches are divided into banks that
are independent of each other, so that parallelism can be achieved without increasing the
number of ports. However, it is worth remembering these numbers when discussing shared
L2 caches in the upcoming chapters.
9
2.2. PREFETCHING CHAPTER 2. BACKGROUND
2.2 Prefetching
Prefetching is a speculative technique that is used to fetch data from main memory to
the cache before being referenced by the processor. Its main purpose is thus to reduce
the amount of compulsory misses. In order to do so, one must accurately predict future
references to memory. Numerous heuristics have been developed, and I will present only
a subset of these heuristics in this section.
By using various heuristics a proper prefetching algorithm is able to predict exactly
what data is needed by the processor in advance. Because main memory latency is com-
paratively large ( 200 clock cycles), this prediction must be made at least the same amount
of time in advance. If the prediction is correct the needed data will be in the cache and
the processor will avoid a costly stall.
Prefetching does come with a cost. Prefetching a cache line can potentially have two
separate ill effects. First, additional bandwidth is used to fetch the prefetched data. This
can potentially delay other memory requests that is on the programs critical path of
execution. In addition, accessing external memory is expensive in terms of energy. This
can become quite significant in hand-held devices. Finally, the third effect is that by
fetching a cache line another cache line has to be replaced. The replaced cache line might
contain data that is needed in the near future. If the data has been modified (the dirty bit
is set), the cache line will need to be written back to memory, thus increasing bandwidth
usage.
To reason about prefetching a set of definitions are needed. The following definitions
are taken from Srinivasan “A Prefetch Taxonomy” [15].
Definition 4 (Good prefetch). A prefetch is classified as good if the prefetched line is
referenced by the application before it is replaced or bad otherwise.
Definition 5 (Accuracy). If a conventional1 cache has M misses without using any
prefetch algorithm, the accuracy of a given prefetch algorithm that yields G good prefetches
and B bad prefetches is calculated as:
Accuracy =
G
G+B
(2.1)
Definition 6 (Coverage). If a conventional cache has M misses without using any prefetch
algorithm, the coverage of a given prefetch algorithm that yields G good prefetches and B
bad prefetches is calculated as:
Coverage =
G
M
(2.2)
The following definitions are taken from VanderWiel and Lilja’s “A survey of Data
Prefetching techniques” [16].
Definition 7 (Prefetch distance). If a loop contains small computational bodies, it may
be necessary to initiate prefetches δ iterations before the data is referenced where δ is
know as the prefetch distance and is expressed in units of loop iterations:
δ =
⌈
l
s
⌉
(2.3)
1A conventional cache in this context is a cache without prefetching
10
CHAPTER 2. BACKGROUND 2.2. PREFETCHING
Where l is the average cache miss latency, measured in processor cycles and s is the
estimated cycle time of the shortest possible execution path through one loop iteration.
Definition 8 (Prefetch degree). It is possible to increase the number of blocks prefetched
by any arbitrary number K. This number is known as the prefetching degree. As an
example; a prefetching degree of 1 fetches 1 block from memory, while a prefetching degree
of 3 fetches 3 blocks from memory.
A prefetching heuristic with a high accuracy is a heuristic that generates few needless
memory accesses. A prefetching heuristic with a high coverage is a heuristic that generates
few misses in the cache (and thus high performance). A good prefetching algorithm needs
high accuracy and high coverage. A heuristic that has a high accuracy is trivial to create.
Most of the algorithms described later can be made extremely conservative and thus
increase their accuracy. Coverage can be increased in a similar manner by using more
aggressive prefetching. Aggressive prefetching increases the number of prefetches issued.
Such an approach is impractical as bandwidth limits it’s usefulness.
As a primitive benchmark, prefetching is often compared to a “perfect” cache. A
“perfect” cache is a cache that always havs the requested data. In other words, there is
no difference between main memory and cache (except for a much lower latency). This
value represents a theoretical upper limit for the effectiveness of prefetching. I will use
this metric as well.
It is useful to distinguish between data and instructions when discussing prefetching
as they exhibit different patterns. Instruction prefetching can be done very accurately by
using the information already present in the branch predictor [17, 18]. The information
in the branch predictor can be used to predict what instructions will be executed (the
program path), and thus prefetch the instructions. Spracklen has shown that there are
large performance gains available by using instruction prefetching on CMPs [19].
Data prefetching is somewhat more complex as data can be accessed in various ways
and can be structured. Consider a single value, a constant, an array, a data structure, a
pointer. Each of these types would likely behave in different manners. Different program-
ming constructs produce different memory access patterns as illustrated in listing 2.1.
On line 2, a simple assignment is performed. This is a scalar pattern. x is read once
and then program execution moves forward. A simple loop (as in lines 5-7) produces a
sequential pattern. In this code snippet the array is traversed in a sequential manner
by incrementing the array index. Lines 10-12 is similar, however, only every third data
element is accessed. This pattern is named “strided”. The fourth pattern (lines 15-17) a
simple pointer chasing snippet. In this example, a linked list is traversed to the end by
using the“next”-field of the individual data items. This construct can often be found when
a program needs to traverse a list. The last example is somewhat constructed. A random
function points into an array. Predicting this case would be hard. However, such code
does exist in the form of jump tables and look-up tables. Of course, mixes of the above
patterns also exists. The different types are summarized in table 2.1.
Caches exploit spatial locality(see section 2.1) by fetching entire cache lines (128 bytes)
even though only a single byte is actually needed. This is a limited form of prefetching and
has been shown to be very successful. In the next two sections I will examine prefetching
methods in more detail.
11
2.2. PREFETCHING CHAPTER 2. BACKGROUND
1 /∗ Sca lar pa t t e rn ∗/
2 foo = x ;
3
4 /∗ Sequen t i a l pa t t e rn ∗/
5 for ( index = 0 ; index < 100 ; index++) {
6 foo = foo + array [ index ] ;
7 }
8
9 /∗ S t r i d ed pa t t e rn − S t r i d e = 3 ∗/
10 for ( index = 0 ; index < 100 ; index = index + 3) {
11 foo = foo + array [ index ] ;
12 }
13
14 /∗ Pointer chas ing in a l i n k e d l i s t ∗/
15 while (p !=NULL) {
16 p = p−>next ;
17 }
18
19 /∗ I r r e g u l a r ∗/
20 for ( i = 0 ; i < 100 ; i ++) {
21 foo = foo + array [ random ( ) ]
22 }
Listing 2.1: Example memory patterns
2.2.1 Software Prefetching
Although software prefetching is outside the scope of this thesis, the following background
material is included for completeness. Software prefetching is a large group of prefetching
techniques. There are two subcategories of software prefetching; explicit prefetching and
software hinting. Explicit software prefetching [20] consists of prefetching instructions in
the program. As an example a program might issue a prefetch for a part of an array while
it is processing another part. Such methods have proven very useful in pointer-chasing
programs [21]. Most modern processors have support for explicit software prefetching
through special instructions. In addition, most modern compilers have built-in support
for prefetching through intrinsics. An example of using intrinsics in GCC can be seen
in listing 2.2. Note that this code will compile even on platforms that do not support
software prefetching (the directive will simply be ignored).
Modern compilers also have built-in support for automatically inserting software prefetches.
GCC has support for prefetching loop arrays, although more advanced prefetching features
are planned. In addition several research compilers have support for even more advanced
techniques [21], such as pointer prefetching.
The downside of software prefetching is that it introduces additional instructions into
the program. This increases the program size, thus decreasing instruction cache perfor-
mance. In addition, the prefetch instructions will have to run through the entire pipeline
as any other instruction. Thus a prefetching instruction occupies space in the pipeline
12
CHAPTER 2. BACKGROUND 2.2. PREFETCHING
Type Description Example Pattern
Memory addresses
Scalar A simple reference to a single
scalar value 42
Sequential A memory pattern where the
address is incremented 42, 43, 44, 45, ...
Strided A memory pattern where the address
is increased by a value larger than one 42, 45, 48, 51, ...
Pointer No pattern, but effect is due to
pointer chasing 42, 3, 18, 7, ...
Irregular None of the above 97, 2, 1034, 7, ...
Table 2.1: Types of memory access patterns.
for ( index = 0 ; index < 100 ; index = index + 3) {
/∗ Pre fe t ch the next e lement in the array ∗/
/∗ In r e a l code the o f f s e t w i l l be l a r g e r than 3 ∗/
b u i l t i n p r e f e t c h ( array [ index +3 ] ) ;
foo = foo + array [ index ] ;
}
Listing 2.2: Prefetching in GCC using intrinsics
that could possibly be used for actual computation.
Hinting is a cooperative method were special instructions in the processor guide the
prefetching hardware. There exist numerous such schemes. Guided region prefetching [22]
is one of them. The programmer (or compiler) issues hints to the prefetching hardware
about regions in memory. As an example, one such message might be “The data in this
location is a large array”. The prefetching hardware can then use this information to
predict the behavior of the program. This method is often used in conjunction with
traditional prefetching methods.
2.2.2 Hardware Prefetching
Pure hardware based prefetching is transparent to the programmer and the compiler.
The programmer cannot directly modify the behavior of the hardware prefetching unit.
Therefore the prefetching unit will have to detect memory reference patterns at run-
time. Compared to software prefetching, this can have both advantages and disadvantages.
Hardware prefetching does not have time to analyze the program at a larger scale. It
does have dynamic information that is not available to the compiler. Thus a prefetching
algorithm can cooperate with other parts of the processor to examine likely paths of
execution. In addition, a hardware prefetching unit can be made completely independent
of the processor core, avoiding slowdown on the critical path.
Hardware prefetching schemes range from the very simple to the very complex. As
with any architectural technique, choosing the right one is a trade-off between power
13
2.2. PREFETCHING CHAPTER 2. BACKGROUND
consumption, design complexity, area and performance.
The following sections discuss different prefetching algorithms and are not meant to be
a complete reference. Moreover, the following schemes are implemented in the simulator.
Sequential prefetching
Sequential prefetching is a very simple scheme. Whenever a program access cache line
X, the system prefetches line X+1 [13]. This is an effective approach exploiting spatial
locality, the same principle used by caches. Studies have shown that it can be very effective
in a range of programs [23]. There exist two variants of this algorithm. The first variant
prefetches whenever there is an access to the cache memory, the second variant prefetches
only if it was a miss. The performance of this type of prefetching is documented in my
fifth year project.
DC prefetching
Delta correlation prefetching is an algorithm that tries to detect patterns in the miss
stream. It stores the miss stream in a Global History Buffer (GHB) [24, 25]. This buffer
stores the address of every miss to the L2 cache in a FIFO buffer of fixed size. This has a
distinct advantage to other storage techniques in that stale or old data will be discarded
first.
On a miss, the delta (the difference between the current address and the previous) is
calculated and stored. The GHB is then traversed to find a match of the two deltas.
Miss address: 40 44 46 48 52 54
Delta: 4 2 2 4 2
Table 2.2: Delta correlation in a GHB, newest miss to the right.
An example can be seen in table 2.2. In this table the miss stream is represented by
the addresses that causes misses in the L2 cache. Time increases to the right. The two
last misses are to address 54 and 52, thus delta equals 2. The previous delta was 4. The
GHB is then scanned backwards for this pair of deltas (4,2). In this example, this pair is
detected in the end of the GHB. Then, a prefetch is issued for the current address + the
delta after the pair. In this case a prefetch would be issued for 54+ (48− 46) = 56. If the
prefetch degree was larger than 1, the next address would be : 56 + (52− 48) = 60.
CD/C prefetching
CZone / Delta Correlation is a newer variant of the delta correlation prefetcher. It was
invented by Kyle Nesbit [25].
The idea is that processors have different access patterns to separate regions in memory.
For example, the stack might be stored in one area of memory and will show a distinct
memory reference pattern. An array might be stored in another area, and may exhibit
sequential memory reference behavior. Such areas are called CZones and are usually of a
fixed size.
14
CHAPTER 2. BACKGROUND 2.2. PREFETCHING
To exploit this, C/DC first stores every miss in a GHB (such as the DC scheme). When
a new miss occurs, C/DC determines which CZone it corresponds to. Then every other
access in the GHB that corresponds to the same CZone is put into a separate buffer. Then
the Delta correlation algorithm is used on this buffer.
Figure 2.2 shows a conceptual model of this type of prefetcher. A table indexes the
different CZones with pointers to the global history buffer. This buffer is then traversed
while the correlation units looks for matching patterns. A correlation unit is logic that
performs the matching described under DC prefetching.
If the size of each CZone is equal to the size of the program, this type of prefetching
is reduced to the special case of Delta Correlation prefetching.
Figure 2.2: CZone / Delta Correlation operation. This diagram is taken from [25].
AVD prefetching
Prefetching dynamic data structures such as linked lists is a difficult task. The first problem
lies in identifying what data is pointers to other data. The other problem is timeliness,
identifying pointers in such a manner that the prefetch is issued a significant amount of
time before the data that is pointed to is actually used. Thus chains of pointers will have
to be referenced.
Prefetching pointers is an active research field, and while some progress has been made,
no simple, efficient method has yet been devised. The problem is that there is nothing
regular about dynamic datastructures. Because they are based on pointers, data that are
logically sequential do not have to be sequential in memory. To make matters worse, the
memory allocator or garbage collector might move or compact data.
15
2.2. PREFETCHING CHAPTER 2. BACKGROUND
Tag Prev addr Stride State
43 130 0 Initial
54 126 2 Transient
67 512 10 Steady
Table 2.3: Example Reference Prediction Table.
To identify pointer loads, numerous schemes exists, the simplest one is to match the
upper N bits of the address to the value that has been loaded. If the value matches the
address, it is considered a pointer load. The idea is to catch near pointers; It is quite
improbable that the upper N bits match if the loaded data is simply actual data. If it is
a near pointer, the upper bits will probably match. This method is often referred to as
Address-Value Delta [26, 9].
The downside of this kind of prefetching is that one has to wait until the data returned
from the previous load is available, thus reducing timeliness. Another problem is that the
algorithm can only detect “near” pointers. At last, prefetching pointer chains becomes
very hard, thus reducing the maximum amount of aggressiveness available.
Other methods track pointer loads in special structures by indexing the load instruc-
tions themselves [27, 28].
PC-based methods
As can be seen from listing 2.1 different load-instructions behave differently, some loads
will only be executed once, some are part of a tight inner loop, while other load pointers.
By tracking how different load instructions behave in a separate table, more information
becomes available for prefetching [29].
Tien-Fu Chen and Jean Loup Baer propose a scheme where this information is stored
in a reference prediction table (RPT). The RPT is a cache-like structure that stores infor-
mation about loads based on the address of the load.
An example table is shown in table 2.3. When a load instruction is first encountered it
is entered in the table with its tag and the loaded address, stride is set to 0 and the state
is “Initial”. If a load instruction is already in the table, the difference between prev addr
and the loaded address is calculated. If the difference is equal to the value in the stride
field, prev addr is updated and a state transition along the “Correct” path according to
figure 2.3 occurs. It they do not match, the stride field is updated, prev addr is updated
and a state transition occurs (labeled “incorrect”.)
Dahlgren and Stenstro¨m experimented with a version of this scheme with 3 states [23].
Stream prefetching
Stream prefetching is not a method for detecting prefetching patterns, but is a method
for actually issuing prefetches. It is intended to be used in conjunction with any of the
previous prefetching heuristics. The basic principle is simple; A stream is detected by
one of the previous algorithms and a series of prefetching addresses is generated. The
first address is prefetched into the L1 cache, the next addresses are prefetched into the L2
16
CHAPTER 2. BACKGROUND 2.2. PREFETCHING
Figure 2.3: State diagram for Chen and Baer’s reference predictor.
cache. If a L3 cache is present, some of the prefetches are issued to the L3, cache. Thus a
stream is formed [30]. In the beginning, this will not be very efficient as the most of the
data will lie in main memory and will not be spread out into the memory hierarchy. Once
the steady state is achieved, the first line to prefetch will be held in the L2 cache, and the
next lines will be in the L3 cache, thus a large performance benefit can be achieved.
An illustration of the stream-principle can be found in figure 2.4. This type of prefetch-
ing is used in the Power4 and 5 by IBM [31, 32] among others. In this work I will use the
streaming technique in conjunction with the RPT heuristic.
Figure 2.4: Stream prefetching in the Power4 architecture [31].
17
2.3. DRAM CHAPTER 2. BACKGROUND
2.3 DRAM
DRAM (Dynamic Random Access Memory) is the most common form of main memory
in computer systems today. This technology has seen tremendous growth in capacity as
feature size decrease. This development has also enabled designs with higher bandwidth,
whereas latency has not seen such big improvements[7], leading to problems that is known
as “The memory wall”[5].
This work is not primarily concerned with DRAM implementations, but requires an
accurate model of the memory subsystem to make accurate simulations. Thus it is required
to develop an understanding of this system, in order to implement a good model. In order
to do so, I will look at some of the different technologies in use today, and with this
knowledge develop a model that can be used with the simulator.
DRAM has three major components: The memory core, the bus interface and the
memory controller. See figure 2.5. In the following sections I will describe each component.
CPU

Memory Controller

OO
DRAM Interface

OO
DRAM Core
OO
Figure 2.5: Diagram of the connections between the memory controller, the DRAM inter-
face and the memory core.
2.3.1 The Memory Core
The memory core (the part that actually stores information) has been relatively unchanged
over the years. Each bit is stored as a small charge in a capacitor as shown in figure 2.6.
Its operation is simple; First, the corresponding word line is activated, thus opening the
transistor. Then current flows from the capacitor to the bit line. The bitline is connected
to a sense amplifier (not shown). This component amplifies the signal caused by the
charge in the capacitor. That signal is then stored in a latch as the original charge in the
capacitor is lost when it is read. Thus logic is needed to write the data from the latch
back into the capacitor.
18
CHAPTER 2. BACKGROUND 2.3. DRAM
"
Switching Transistor 


 C1


þ
 ß


Bit line





ßWord line  
Figure 2.6: A single DRAM cell.
To store many bits, many such elements are connected in a matrix. In such a configu-
ration, a whole row is read at the same time. Because of the small feature size, the bitlines
are physically very close together and form a relatively large capacitor. This effected is
countered by precharging the bitlines before an access to the capacitor, thus decreasing
the bias as a result from the previous read. In addition, because capacitors looses charge
over time, there is a need to refresh this charge. In DRAM this is commonly done by
periodically reading an entire row and writing it back into the capacitors.
There are some interesting parameters commonly used by manufacturers regarding the
core of the DRAM chip. The following definitions are taken from John. L. Hennessy and
David A. Pattersons textbook “Comnputer Architecture: A Quantitative Approach”[4].
Definition 9 (Access time). Access time is the time between when a read is requested
and when the desired word arrives.
Definition 10 (Cycle time). Cycle time is the minimum time between requests to memory.
In addition, a chip might adopt a closed or open page policy. An open page policy is
keeping the last page read from the core in the latches, thus speeding up accesses that hit
the same page (row). The downside is that the latch must be flushed if the access does
not hit the same page, thus increasing latency in those cases.
2.3.2 The DRAM Interface
Most of the architectural development of DRAM has happened in the interface to the
memory core. This development has been driven by three factors:
• The need for higher bandwidth.
• The need for lower latency.
• Cost.
The demand for higher bandwidth was driven trough the need to keep processors occupied
with data as they get faster. The need for lower latency was driven trough the expense of
stalling the pipeline in a modern processor in case of a cache miss. Cost was a driving factor
for bringing high performance memory subsystems to the market. As can be seen with the
19
2.3. DRAM CHAPTER 2. BACKGROUND
case of Rambus, the limiting factor of the system was licensing issues and cost2. Rambus
(rambus.com) is a relatively new company specialising in designing memory subsystem.
Its first large commercial product used very narrow data paths (16 bits), but with multiple
channels. It was a complete redesign of the memory subsystem. It was a good design, but
did not get wide acceptance in the marketplace, even with Intels backing.
The first memory systems were asynchronous (no central clock). Instead the memory
controller would wait for a specified amount of time before reading data after issuing
a request. This system did not scale well, so instead a synchronous interface was made,
called SDRAM [33]. This type shared a system clock with the memory controller, enabling
two key issues. First, it could be run at a higher speed, and second; commands could be
pipelined. The commands for one memory access could be issued while another memory
access was read. In addition, multiple banks of SDRAM could be used, thus several
accesses could be done in parallel. SDRAM also introduced burst mode where a sequential
number of bytes in the same memory array row can be pipelined onto the data output bus
by only updating the column address on each cycle. Many variants were made of SDRAM
including: Enhanced SDRAM, Virtual Channel SDRAM, Fast Cycle DRAM [33].
As the SDRAM architecture reached it maximum potential, a more efficient architec-
ture was needed. DDR (Double Data Rate) SDRAM increases the bandwidth and lowers
the latency by transferring data on both the rising and falling edge of the clock. In addi-
tion, DDR was license-free (as opposed to Rambus) thus leading it to market acceptance.
DDR2 is the new standard, and currently lives alongside of DDR in modern systems. This
design uses packets of data transferred over a network, rather than a bus-based system.
XDR[34] is the second generation memory subsystem made by Rambus. It reuses many
of the basic ideas, but uses differential signaling to achieve higher speeds. It also uses a
more network-like structure, using packets and transactions. Many high-end systems use
XDR today, such as Sony Playstation 3.
2.3.3 Memory Controller
The memory controller is the final component in the subsystem. Its main task is to decode
the memory request from the bus master (the CPU or a bridge3) and convey it over the
DRAM interface to the DRAM chips. In addition, it is responsible for refreshing the
charge in the memory cells.
Modern memory controllers perform other tasks as well. To be efficient they must
be able to handle concurrency, such that the CPU can operate efficiently by pipelining
requests as well as sending accesses to multiple banks to enable parallelism. In an SMP4
environment it also enforces cache coherency, by either snooping the bus (MISO protocol)
or by using a directory. It can also prioritize traffic, refresh operations will have a relatively
low priority when the charge in the cells are high. Fault tolerance, such as parity checking
or ECC is also assigned to the memory controller.
Many modern CPU’s (such as AMD64) include the memory controller on-chip. This
leads to less flexiblity, but higher performance.
2However, Rambus is currently undertaking legal action against several DRAM manufacturers under
allegations of price fixing.
3A device that connects two similar or dissimilar buses together.
4Symmetric MultiProcessing.
20
CHAPTER 2. BACKGROUND 2.4. CHIP MULTIPROCESSORS
2.4 Chip Multiprocessors
Chip Multiprocessors5 (CMPs) are multiple processors embedded on a single chip. At a
minimum the two processor cores must share a package and physical I/O pins. However,
because the cores are tightly integrated on the same piece of silicon, there exists other
opportunities for sharing as well. The cores can share caches, memory controllers and
functional units among others. The tight integration allows for new architectural pos-
sibilities, especially when cores cooperate closely (for example in a producer/consumer
relationship).
In the literature, it is common to see the term Multiprocessor System-On-Chip (MP-
SoC). This term refers to the same concept with multiple embedded processors on the
same piece of silicon. However, the MPSoC term is most commonly used in system-on-
chip designs with heterogeneous processors, such as the Trimedia TR-1300. The TR-1300
is made by Philips as a multipurpose multimedia chip [35]. It has multiple application
specific processors to decode video and sound as well as a general purpose VLIW6 CPU.
Most commercial vendors of high performance microprocessors are now turning to
CMPs. Both AMD and Intel manufacture chips with two cores (“Intel Core Duo” and
“AMD Athlon X2”). These two vendors are more conservative than others: Sun Microsys-
tems has developed the T-1 Niagara with 8 independent cores, where each one is capable
of running 4 threads simultaneously. IBM, Sony and Toshiba have developed the Cell
microprocessor in a joint effort. The Cell consists of a heterogeneous architecture, with
one main processor and eight smaller ones connected in a ring.
Why do the semiconductor industry embrace CMPs? The reasons for moving to a
CMP architecture are numerous, but the main issues are [6]:
1. Increasing power consumption causes thermal problems;
2. Design complexity;
3. Limitations of Instruction Level Parallelism (ILP);
4. Increasing memory-processor gap.
As the clock frequency increases the power requirement increases with it. This poses
a problem, as the temperature of the chip increases too. If the temperature goes beyond
a certain point, damage to the chip might occur. The idea is that two slower and smaller
cores can give the same performance as a larger single core, but at a lower frequency and
thus lower power requirements.
Design complexity has become an issue as designs include more functional units and
deeper pipelines. Thus high-end microprocessors have become ever increasingly complex
to design. To put it another way; As more and more transistors become available to
designers, it has become ever increasingly difficult to make good use of them. Using a
CMP architecture is promising, as it offers a designer the possibility of reuse. A single
CPU design can be reused multiple times on a single chip.
Magnus Ekman and Per Stenstro¨m have looked at the trade-off between multiple cores
and issue width, and found the optimum width to be around 4 [36]. Although it is
5The term multicore is also used in the literature.
6Very Long Instruction Word.
21
2.4. CHIP MULTIPROCESSORS CHAPTER 2. BACKGROUND
theoretically possible to increase ILP up to the data flow limit7 [4], it is not very practical
as it requires very large window sizes, near-perfect branch prediction, very large caches,
large register blocks and perfect alias analysis. It is much easier to exploit Thread Level
Parallelism (TLP), where parallelism is expressed as threads. This approach is common in
server applications such as webservers and database servers, where each execution is only
loosely coupled. Such programs maps very elegantly onto a CMP as each processor can
handle one thread, instead of using one large processor to interleave the threads execution.
Because the cores in a CMP operates at a lower frequency, the memory-processor gap
is reduced. However, the opportunities for prefetching actually increases [12]. This is due
to the bandwidth limitations that occur in a CMP due to the sharing of external pins as
well as caches.
A CMP opens up a vast array of architectural choices. The cores themselves can
be homogeneous or heterogeneous. However, the most active research topic is how to
interconnect the cores. The most basic approach is to simply not connect them, or only
connect them at the memory controller. A more interesting approach is to share one or
two levels of the cache through a crossbar or similar structure8. In the next section I will
briefly look at what commercial vendors have done. The purpose is not to give a complete
overview of all commercial vendors or processors, but rather highlight some differences.
The move to CMP has also prompted a paradigm shift in terms of how programs must
be implemented to achieve the highest performance. In a CMP, it is important to segment
the problem into multiple pieces in such a way that all the cores can be utilized. It is also
important to balance the load across the cores. IBM has released a research compiler called
“Octopiler” for the cell architecture that automatically compiles code for uniprocessors to
CMP-aware code [38]. However, the performance of the resulting binaries are not very
impressive, and for high-performance applications the optimum choice is still doing it by
hand.
2.4.1 Commercial Vendors
AMD X2
AMD has chosen a very straight-forward design. The basic X2 is simply two Athlon64
cores on a single die, with separate L2 caches [39]. The cores are connected by a simple
crossbar switch to the shared memory controller. This design is very simple and offers pin
compatibility with the single core Athlon64 chip. The architecture of the Athlon X2 is
shown in figure 2.7.
SUN Niagara
Sun takes a more radical approach to CMP and has designed the Niagara processor. The
chip consist of 8 cores, each capable of running 4 threads simultaneously [41]. Each core
is about as powerful as the UltraSPARC III. It has a shared 3MB 12-way set associative
7The data flow limit is the maximum possible parallelism, that is only limited by actual true data
dependencies.
8An interesting approach is to build a network on chip, where data is moved as packets on a network
[37]. Magnus Jahre at the NCAR group is currently working on extending the simulator framework with
such networks.
22
CHAPTER 2. BACKGROUND 2.4. CHIP MULTIPROCESSORS
Figure 2.7: The Athlon X2 [40].
level 2 cache. However, the 8 cores share a single floating point unit, making it unsuitable
for most scientific applications. The architecture of this chip is shown in figure 2.8. It is
worthwhile to notice how the cache is divided into four banks, each one is connected its
own memory controller (Marked as “System Interface Buffer Switch Core” in the figure).
The Niagara is targeted at server applications, where most instructions are integer
operations. By making each core multithreaded, SUN hopes that the relatively low amount
of cache available is sidestepped by the amount of thread level parallelism available.
In 2006, Sun open-sourced the Niagara processor. The full source code can be found
on the website www.opensparc.org. In addition to the full verilog code, a simulator and
the operating system are available under open-source licenses.
IBM Cell
The Cell is a result of a cooperation between Sony, Toshiba and IBM that started in 2000.
The Cell has a heterogeneous architecture; It has one large core with a traditional cache [43,
44], and there are eight smaller cores that are optimized for SIMD9 called SPEs (Synergistic
Processing Elements). The smaller cores does not have a cache, but a local store. The
SPE can only operate on the local store, but can request DMA transfers to and from
main memory. The idea is to let the SPE work on one part of the local store, while DMA
transfers data to other parts of memory. In addition, there is a circular ring (EIB) that
9Single Instruction Multiple Data.
23
2.4. CHIP MULTIPROCESSORS CHAPTER 2. BACKGROUND
Figure 2.8: The architecture of the Niagara (T1) [42].
handles communication across cores and with the PPE and memory controller. By using
a programmer controlled memory hierarchy, significant speedups can be achieved [45]. On
scientific kernels a speedup of over 20 times compared to traditional architectures have
been shown on single precision arithmetics. The architecture can be seen in figure 2.9.
24
CHAPTER 2. BACKGROUND 2.4. CHIP MULTIPROCESSORS
Figure 2.9: The architecture of the Cell [46].
25
2.5. SIMPLESCALAR CHAPTER 2. BACKGROUND
2.5 SimpleScalar
SimpleScalar [47, 48] is a cycle-accurate simulator capable of simulating out-of-order su-
perscalar processors. It was developed by Todd Austin during his PhD at the University of
Wisconsin in Madison. Today, the simulator is developed and supported by SimpleScalar
LLC. It can accurately model a wide range of processors, and give accurate information
about cache performance as well as other aspects within the processor.
The main purpose of SimpleScalar is to give researchers as well as designers the abil-
ity to experiment with different configurations and to give accurate feedback about the
performance of a processor. It can produce a pipe-trace that displays what instructions
are in what parts of the pipeline. Graphical interfaces exist to make visualization easier.
Furthermore, SimpleScalar logs many aspects of the simulation into performance counters,
for easy identification of bottlenecks.
SimpleScalar is a very advanced simulator. In addition to supporting cycle-accurate
out-of-order superscalar processors, it has support for advanced branch prediction, cache
hierarchies, virtual memory, debugging and I/O. It supports a variety of different instruc-
tion set (Alpha10 and PISA11 in the default distribution).
SimpleScalar has been used by many researchers, and its open nature (the software is
open source as well as free for academic uses) has made it a very robust tool. It is intended
to be extended, the code is very modular and well written.
Numerous add-ons exist to SimpleScalar [50], such as Wattch [51], that can predict
power consumption for a given configuration. Other additions are: Value prediction, hot-
spot modeling, multi-threading, multiprocessing and different ISAs.
The next release of SimpleScalar (version 4) has a new redesigned core, called MASE [52].
This new core decouples the functional and the timing simulation. Thus, even if the timing
simulation contains small, insignificant errors, it will not affect correct program execution.
Other alternative simulators were also evaluated for this thesis, such as Simics [53] and
Simflex [54]. Simics does not provide enough detail when dealing with microarchitecture
at this level and cannot be used. Simflex was also considered, but changing the simulator
core from SimpleScalar would discard a large amount of work done in the fifth year project.
2.5.1 SimpleScalar Model
Discrete Time Specification Systems [55] (DTSS) have a long tradition in the modeling
of digital electronics. This is due to the fact that most digital circuitry is governed by
a central clock. This clock ticks at a given frequency and the state of the machine is
only changed at those discrete time intervals. However, analog circuit elements such as
transistors, capacitors and resistors that make up the digital building blocks (such as gates,
memory elements and wires) can only be modeled by using detailed Differential Equations
Specification Systems (DESS). However, DESS modeling of a Pentium 4 class processor
would need several cpu-days to model even a single clock cycle. This is not practical, and
thus using a DTSS model is more convenient and produces very accurate results. The
drawbacks of using such a simplified model is that information is lost. Specifically the
10A RISC processor developed by DEC.
11The Portable Instruction Set Architecture - a MIPS-like ISA for use by researchers and students [49].
26
CHAPTER 2. BACKGROUND 2.5. SIMPLESCALAR
information about capacitance between wires. This information is needed, for example to
compute the power requirements as can be seen in equation 2.4.
P ∝ CV 2f (2.4)
Where P (the power requirement) is proportional to C (the capacitance of the circuit),
V (the supply voltage) and f (the frequency).
At the core of SimpleScalar is a cycle accurate simulation, using a DTSS model of
the pipeline. This can be seen in figure 2.10, where the main processor pipeline is at the
top. Each step in the pipeline is modeled as a separate component. This is useful because
it allows researchers to concentrate on one area, without worrying too much about the
effects of their changes in other stages of the pipeline. Moreover, such a coupled model
enables researchers to analyze the instructions passing between different stages. In my
research this has allowed me to analyze how pipeline stalls forms based on when cache
misses occur. In addition, the information provided at each stage lets me analyze what
kind of loads typically miss (and are thus good candidates for prefetching).
Figure 2.10: The SimpleScalar model.
The simulator processes the pipeline in reverse order. This ensures that instructions
that are in the commit stage are processed before instructions in the write-back stage are
passed to the commit stage. In that sense instructions move from left to right, while the
simulation moves from right to left.
If the processor needs to access memory (either an instruction, or data) it moves
vertically in the figure. Because memory accesses are typically very slow, these accesses
are not modeled in discrete time. Instead, a function calculates the latency of the operation
27
2.5. SIMPLESCALAR CHAPTER 2. BACKGROUND
(in clock cycles) and an event is put in a queue (the load/store queue). This is done to
gain simulation speed. As the latency is typically high (200+ clock cycles) there is no
point in checking every cycle. This is interesting, because SimpleScalar has two separate
simulation mechanisms working simultaneously. Most of the time, SimpleScalar executes
a simple DTSS model and updates the pipeline in a cycle-by-cycle basis. However, in the
case of long latency operations, such as loads, it uses an event-driven mechanism. This is
purely done for performance reasons. An instruction usually finishes a pipeline stage in a
single cycle, but a load can consume several hundred clock cycles. Thus, by using a mixed
model, huge performance gains can be achieved.
28
CHAPTER 2. BACKGROUND 2.6. PERFORMANCE COUNTERS
2.6 Performance Counters
Modern processors are complex beasts. Optimizing code for the best performance can be
challenging as the programmer must understand every nuance of the architecture. This can
be a daunting task as one needs to understand what will cause a performance degradation,
simply by looking at the code.
A better approach is to instrument the code. To find out where the performance
bottlenecks are, it is possible to time the execution of the program to see where there is
potential for optimization. But timing can only tell you where your program is stalling,
it cannot tell you why.
Performance counters can give you such insights. A performance counter is simply a
counter that counts discrete events. An event can be a cache miss, a cache hit, a branch
misprediction, etc. The programmer can thus profile his code by using these counters.
Almost every modern high performance processor today have performance counters [56].
Accessing these counters are usually processor specific. In this section I will look at how
performance counters work in a specific CPU (The Athlon XP12) and present some high
level libraries that do the same work without invoking kernel-level assembly magic.
2.6.1 Performance Counters on x86
As an example, I will look at how performance counters work on the x86 architecture,
more specifically, the AMD Athlon XP line of processors. The x86 family of processors
is not homogeneous, each generation and each producer has its own set of performance
counters and different ways to access them.
The x86 family of processors has one special kind of register; The TSC. The TSC stores
the number of elapsed clock cycles since the boot-up of the system. This register can be
accessed by issuing the “rdtsc” instruction. The number of clock cycles elapsed is then
stored in the edx:eax register pair. This function is particularly useful for timing small
snippets of code where precision is important. However, it is important to note that the
timing code itself consumes clock cycles, and must be accounted for when using it as a
timing device.
Listing 2.3 depicts a code snippet showing how gcc inline assembly can be used to
access this register. The “cpuid” instruction is a serializing instruction [57], which makes
sure that every previous instruction have committed. The “cpuid” instruction is mainly
used for identifying the processor, the serializing effect is only a beneficial side-effect. A
cpuid instruction will flush the pipeline before the next instruction is processed.
An explanation of the gcc inline assembly syntax might be in order. The parts of
the directive is separated by colons (“:”). The first string is the assembler instructions;
Multiple instructions can be separated by semicolons (“;”). The next part is outputs (or
constraints). In the case of rdtsc, this part instructs gcc to put the contents of register
eax into the variable eax and the contents of register edx into the variable edx. The next
fields are optional, and ommitted from the “rtdsc” instruction. The third field corresponds
to inputs and is handled in the same manner. The last field is the “clobbered” registers,
which tells gcc which registers will be altered by the code snippet. For a more complete
understanding of GCC inline assembly see [58].
12I chose the Athlon XP because I needed a computer that I had root access to as well as physical access.
29
2.6. PERFORMANCE COUNTERS CHAPTER 2. BACKGROUND
stat ic i n l i n e unsigned long long rdt sc t ime ( ) {
unsigned int eax , edx ;
unsigned long long va l ;
asm v o l a t i l e ( ”cpuid ” : : : ”ax ” , ”bx” , ”cx ” , ”dx” ) ;
asm v o l a t i l e ( ”rd t s c ” : ”=a ”( eax ) , ”=d”( edx ) ) ;
va l = edx ;
va l = va l << 32 ;
va l += eax ;
return va l ;
}
Listing 2.3: Elapsed clock cycles performance counter.
In general, the performance counters are accessed through Model Specific Registers
(MSR). In turn, the MSRs are accessed through the instructions “rdmsr” and “wrmsr” for
reading and writing respectively. The register to be written is set in register ecx, and the
parameter or return value is set in eax.
However, issuing a“rdmsr”or“wrmsr” instruction in user-space triggers a segmentation
fault13. This applies to the “rdpmc” (read performance monitor counter) instruction as
well. All of these instructions are privileged instructions and can only be executed by the
kernel. This is due to security concerns. An attacker might use the performance counter
to gain information about what another user on the system is doing by looking at the
performance counters. This type of attack is called a side-channel attack [59].
To enable user-space applications to read performance counters, a bit has to be set
in the control registers (cr4). However, the code that sets this bit has to run in kernel
mode, thus enabling this bit requires patching the kernel. Fortunately, Linux (and most
other operating systems) allows modules that can be dynamically added or removed to a
running kernel.
I have created such a module and the code can be found in appendix C. This kernel
module is not general purpose, its behavior on other than Athlon XP systems is undefined
and might cause damage. It is only intended to show how performance counters work in
an actual processor.
On the Athlon XP CPU, MSR 0xC0010000 controls counter number 0 [60]. In the
kernel module, I have set this register to count the number of cache misses. The example
program reads this value by using the “rdpmc” instruction.
At the core of the example program is the code snippet in listing 2.4. This program
must be compiled without optimizations, or the compiler will simply optimize away the
loop. By running the program we obtain the following output:
Clock is 16667368882865
Number of misses: 265
Clock is 16667369028778
Hence the code caused 265 cache misses and 145913 clock cycles elapsed.
13Or a general protection fault depending on the OS.
30
CHAPTER 2. BACKGROUND 2.6. PERFORMANCE COUNTERS
The complete source code can be found in appendix C, which includes functions to
read the performance registers, setting the performance registers to the wanted function
and enabling user-space to access the performance counters.
p r i n t f ( ”Clock i s %l l d \n” , rdt s c t ime ( ) ) ;
s t a r t = readpc ( ) ;
for ( i = 0 ; i < 1000 ; i++) {
d [ i ] = a / c ;
}
stop = readpc ( ) ;
p r i n t f ( ”Number o f mis se s : %l l d \n” , stop−s t a r t ) ;
p r i n t f ( ”Clock i s %l l d \n” , rdt s c t ime ( ) ) ;
Listing 2.4: Example program using performance counters.
2.6.2 Performance Counter Libraries
Using performance counters directly is a tedious and error-prone task. The resulting code
is also very processor specific. Thus there exists numerous performance counter libraries
as well as tools. Some of the most common (for Linux) are listed below:
• Oprofile[61];
• VTune[62];
• Performance Application Programming Interface (PAPI)[56].
Oprofile is a tool for profiling Linux programs (as the name implies). It uses the built
in performance counters, but also gives applications direct access to the same performance
counters. In addition, Oprofile ships with most modern Linux distributions.
VTune is a proprietary tool made by Intel to enable programmers to profile their code
through the use of performance counters and sampling.
PAPI is an abstraction layer that enables the application programmer to access the
performance counters without worrying about the underlying processor. After installing
PAPI on the system, it is very easy to use. An example program can be seen in listing 2.5.
Running this program on an AMD Opteron generates the following output: Note the low
number of level 2 cache misses, this is the prefetching engine in effect. If there was no
hardware prefetching present, one would expect a much larger amount of L2 misses (closer
to 100000 divided by the cache line length in ints).
Number of stall cycles : 1033362
Number of DL1 misses : 5309
Number of DL2 misses : 4
Number of DTLB misses : 190
31
2.6. PERFORMANCE COUNTERS CHAPTER 2. BACKGROUND
/∗ Example program using PAPI
∗ Compile wi th :
∗ gcc papi . c / opt /xd−t o o l s / papi /3 .1 .0/ l i b / l i b p a p i . a −o papi
∗/
#include ”/opt/xd−t o o l s / papi / 3 . 1 . 0 / inc lude /papi . h”
#include <s t d i o . h>
//Number o f even t s to monitor
#define NUMEVENTS 4
int main ( int argc , char ∗argv [ ] ) {
int Events [NUM EVENTS] = {PAPI RES STL , PAPI L1 DCM,
PAPI L2 DCM,PAPI TLB DM} ;
long long s t a r t v a l u e s [NUM EVENTS] ;
long long s topva lue s [NUM EVENTS] ;
int d [ 1 0 0 0 0 0 ] ;
int i ;
i f ( PAPI start counters ( Events , NUM EVENTS) != PAPI OK) {
p r i n t f ( ”Could not s t a r t counter s \n” ) ;
e x i t ( 1 ) ;
}
i f ( PAPI read counters ( s t a r t va l u e s , NUM EVENTS) != PAPI OK) {
p r i n t f ( ”ERROR: %d \n” ,
PAPI read counters ( s t a r t va l u e s , NUM EVENTS) ) ;
}
/∗ The code to be monitored ∗/
for ( i = 0 ; i < 100000; i++) {
d [ i ] = 6 ;
}
i f ( PAPI read counters ( s topva lues , NUM EVENTS) != PAPI OK) {
p r i n t f ( ”ERROR: %d \n” ,
PAPI read counters ( s topva lues , NUM EVENTS) ) ;
}
p r i n t f ( ”Number o f s t a l l c y c l e s : %l l d \n” ,
s topva lue s [ 0 ] − s t a r t v a l u e s [ 0 ] ) ;
p r i n t f ( ”Number o f DL1 misses : %l l d \n” ,
s topva lue s [ 1 ] − s t a r t v a l u e s [ 1 ] ) ;
p r i n t f ( ”Number o f DL2 misses : %l l d \n” ,
s topva lue s [ 2 ] − s t a r t v a l u e s [ 2 ] ) ;
p r i n t f ( ”Number o f DTLB misses : %l l d \n” ,
s topva lue s [ 3 ] − s t a r t v a l u e s [ 3 ] ) ;
}
Listing 2.5: Example program using PAPI
32
Chapter 3
Methodology
3.1 Memory Model
To accurately model prefetching, an accurate model of the underlying memory subsystem
is needed. In the case of prefetching there are two distinct properties that are especially
interesting:
• Latency;
• Bandwidth.
Latency is a measure of the response time (the time between the issue of a memory request
and completion). Bandwidth is a measure of how much data the communication channel
can transmit at a time. This value can be measured in bytes per second or the number of
simultaneous requests. The correlation between these two measures is a function of both
latency and the size of each request.
A real-life memory subsystem is complex and implementation specific. However, we
need a general model, that is easy to understand and reason about, such that the effects
of prefetching becomes clearer. Our model needs support for:
• Parallelism;
• Multiple banks;
• Pipelining;
• Open/Closed pages.
Vinodh Cuppu[63] observed a significant amount of locality in the address stream that
reaches the primary memory system (40% on average). A computer architect would use
this knowledge to design a modern memory subsystem that exploits this fact. An open
page policy would allow memory adresses that are close to the previous adress to receive a
speed boost. Spreading the data across multiple banks would also enable more concurrency
and thus more bandwidth, which in turn will increase performance.
33
3.1. MEMORY MODEL CHAPTER 3. METHODOLOGY
3.1.1 Implementation Details
To account for parallelism, the notion of channel is introduced. A channel maps either to
a direct Rambus channel or a memory bank, depending on the underlying technology. By
using a mapping function, a memory access is mapped to a single channel. This mapping
function maps cache lines such that a memory bank holds one cache line, and the next
channel holds the next line in a cyclic manner. See figure 3.1. By increasing the number
of channels, more bandwidth becomes available.
Pipelining is implemented by checking if some operations can overlap. If a transfer
is currently in progress, the memory subsystem can issue the commands to the memory
chips while the other completes. The total latency would then be:
TotalLatency = TimetoWait+MemoryLatency − CommandTransferT ime (3.1)
To exploit locality, our model assumes an open page policy (see section 2.3.1). To
accomodate this, the last paged accessed is stored and a check to see if the access hits the
same page is performed. If it does, the time to complete the transfer is reduced.
Figure 3.1: Memory organization in the simulator. Each box represents a block (with byte
numbers). Each bank is separated vertically and each page is composed of two blocks
(shown in grey).
Refresh (Regularly recharging of the capacitors) is not simulated, as it is assumed to
affect all benchmarks and all the configurations equally. It can also be modelled as a
stochastic process, and thus stall with a given probability on every access. Or the effect of
each refresh can be baked into every access, by distributing the effect of refresh on every
access. Modern memory controllers can schedule refresh in such a manner that it does not
interfere with normal operation.
3.1.2 Model Parameters
As with any model, realistic parameters are key to a realistic simulation. I have chosen to
model DDR2, based on its widespread commercial use and its documentation.
34
CHAPTER 3. METHODOLOGY 3.1. MEMORY MODEL
typedef struct {
int num channels ; /∗ Number o f channe ls ∗/
int b l o c k s i z e ; /∗ S i ze o f each b lock , u s a u l l y equa l
∗ to l 2 b l o c k s i z e ( in by t e s ) ∗/
int pag e s i z e ; /∗ Number o f b l o c k s in a page ∗/
int con t ro l t ime ; /∗ Time to t r a n s f e r data ∗/
int core t ime ; /∗ Time to t r a n s f e r from core to
∗ l a t c h e s ∗/
int data t ime ; /∗ Time to t r a n s f e r from l a t c h e s to
∗ memory c o n t r o l l e r ∗/
t i c k t ∗ ready channe l s ; /∗ When are the channe ls ready? ∗/
md addr t ∗ l a s t a dd r e s s ; /∗ Last page accessed ∗/
} dram system t ;
Listing 3.1: DRAM-model datastructure.
DDR2 comes in many different configurations, differing mainly in bus speed and la-
tency. These two numbers are used to calculate the peak bandwidth available from a
module. As an example - a PC2-4300 DRAM module is a DDR2 module with a peak
bandwidth of 4300 GB/s [64]. As of Febuary 2005, this type of chip commonly runs at a
clock speed of 400 MHz (Double Datarate) and a CAS (Column Address Select) latency
of 4 clock cycles. Because the data bus is only 64 bits wide and a cache line is 128 bytes,
16 data transfers are required, or 8 clock cycles (double data rate).
The CM2X512A-6400 [65], a 512MB DDR2 DRAM module manufactured by Corsair
Memory Inc will be the simulator target. It is rated as a 4-4-4-12 module (CAS latency
- RAS to CAS latency - RAS precharge time - Cycle Time). From these numbers we
see that RAS to CAS is 4 clock cycles, CAS is 4 clock cycles. This gives a total of 16
clock cycles. However, the DDR2 data bus runs at only 400MHz compared to the 4GHz
achieved by the main CPU. Thus the 16 clock cycles required by the memory transaction,
translates into 160 clock cycles for the processor.
For simplicity, I will assume that the full extent of the RAS to CAS latency can be
pipelined. I will also assume that a hit to an open page gives a 4 cycle bonus. Each page
is 8 cache lines large (or 1kB). These are the default values for the simulations I will be
conducting.
The number of channels available will also be a parameter. The model number (4300)
signifies that it can deliver a peak of 4.3GB/s. As each access consumes 16 clock cycles at
400MHz, this translates to about 40ns per access of 128 bytes. Each “channel” will then
be able to transfer 3.2GB/s. A full 4300 DRAM chip will then require about 1.5 channels
to implement. This number will not be fixed, as the effects of bandwidth on prefetching
is one of the key topics of this report.
3.1.3 DRAM Statistics
The DRAM subsystem collects the following information:
• The number of accesses to DRAM.
35
3.1. MEMORY MODEL CHAPTER 3. METHODOLOGY
• The total latency imposed by the DRAM subsystem.
• The average latency of the DRAM subsystem.
• The number and percentage of accesses that hit on open pages.
• The number and percentage of accesses that are stalled due to contention.
Each of these values are implemented as simple performance counters, or calculated as
ratios between two counters. These values are available to the end user after a simulator
run.
3.1.4 Model Verification
To ensure model validity 25 of the 26 benchmarks from the SPEC benchmark suite were
run (The last benchmark would not run with the simulator, due to issues with the data
set). The new model was compared to the old model with a fixed latency, set at 160 clock
cycles. The results can be seen in figure 3.3.
For CPU-bound applications (such as VPR and Twolf) there were little changes be-
tween the models. This was expected, as there is little communication with main memory.
In memory-bound applications (such as Apsi and Swim) we see a performance degra-
dation when there is not sufficient bandwidth available. In some benchmarks we see a
performance increase when using the new model. This is due to the linear way that these
programs access memory, thus increasing the chance of hitting open pages and reducing
latency below the fixed latency offered by the standard model.
36
CHAPTER 3. METHODOLOGY 3.1. MEMORY MODEL
Figure 3.2: Flowchart indicating how the latency of a memory access is calculated.
37
3.1. MEMORY MODEL CHAPTER 3. METHODOLOGY
 0
 0.2
 0.4
 0.6
 0.8  1
 1.2
 1.4
 1.6
 1.8  2
art
facerec
twolf
wupwise
apsi
bzip2
vortex
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
IPC
Benchm
ark
O
ld m
odel
1 Channel
2 Channels
3 Channels
F
igure
3.3:
IP
C
of
SP
E
C
benchm
arks
using
the
old
m
odel
and
1,2
or
3
channels
in
the
new
m
odel.
38
CHAPTER 3. METHODOLOGY 3.2. PREFETCHING
3.2 Prefetching
Most of the work in implementing prefetching in Simplescalar was done in my fifth year
project. The following section includes the documentation produced during the project to
make this thesis self contained. In section 3.2.2 the new additions and improvements are
presented.
The documentation has been somewhat modified and references to deprecated func-
tions and other modules have been omitted. For a complete description of the module,
please refer to my fifth year project report.
3.2.1 Implementation
The modifications needs to be as non intrusive as possible. The design has four major
components as depicted in figure 3.4.
Figure 3.4: Prefetching in SimpleScalar.
The design is centered around prefetch triggers. A trigger is generated in the Sim-
pleScalar core when certain events occur. This trigger is then processed by the prefetching
module, which generates prefetches when needed. In brief, the program flow consists of
five steps:
1. An event (memory access, cache miss, etc) occurs in the SimpleScalar core.
2. The event data is sent to the dispatcher where it is packed into a trigger.
3. The dispatcher sends the trigger to the selected algorithm.
4. The algorithm decides how it will react to the trigger.
39
3.2. PREFETCHING CHAPTER 3. METHODOLOGY
typedef struct {
t r i g g e r t y p e t type ; /∗ What happened ∗/
l o c a t i o n t l o c a t i o n ; /∗ Where something did happen ∗/
md addr t address ; /∗ Adress o f memory acces s ∗/
t i c k t time ; /∗ Time o f acces s ∗/
} p r e f e t c h t r i g g e r t ;
Listing 3.2: The prefetch data type.
5. If the algorithm decides to prefetch, a prefetch request is sent to the memory sub-
system.
In general, each prefetch trigger can cause either zero, one or many prefetches, depend-
ing on the algorithm being used. To make things modular, both the prefetch dispatcher
and the algorithms themselves are put into a separate file (prefetch.c and prefetch.h).
The changes needed in each module are documented in the following sections. The
documentation follows a module based approach as this fits well into the code structure.
The modules will be presented in the same order as a prefetch would be handled.
Triggers and Data Structures
At the core of the design is the notion of prefetch triggers. A prefetch trigger is generated
when special events occur in the simulator. This flow is depicted in figure 3.4. Such an
event can be a miss in the L2 cache. The prefetch triggers are implemented as a data
structure. This decision has two big advantages: An algorithm only needs to know about
one data structure. Second, it makes the framework extensible, as designers can simply
add fields to the data-structure without breaking the implementation of other algorithms.
The prefetch triggers are defined in prefetch.h (see listing 3.2).
This format is well suited because it answers three important questions:
• What kind of event occurred?
• Where did it occur?
• When did it occur?
The trigger type t is another data format and is implemented as an enum (see listing
3.3). This approach makes the framework extensible, as a designer can add events when
needed. In addition the data structure location t (listing 3.4) contains information about
where the events occur. Both the location of an event as well as the type of event are
used by the prefetch heuristics to determine when prefetching is needed. For example,
an heuristic that prefetches to the L2 cache might ignore all events that occur in the L1
cache, or in DRAM.
The address field is used for calculating the prefetch address. It can also be used for
other purposes, like looking up data in DRAM for pointer based prefetching.
40
CHAPTER 3. METHODOLOGY 3.2. PREFETCHING
typedef enum {
Cache Miss , /∗ A miss in the cache ∗/
Cache Hit , /∗ A h i t in the cache ∗/
Memory Access , /∗ An access to memory by an i n s t r u c t i o n ∗/
PC Update , /∗ Program counter i s updated ∗/
No event /∗ Dummy t r i g g e r ∗/
} t r i g g e r t y p e t ;
Listing 3.3: The trigger type.
typedef enum {
Cache IL1 , /∗ Event happened in the D−I1 cache ∗/
Cache DL1 , /∗ Event happened in the D−L1 cache ∗/
Cache IL2 , /∗ Event happened in the I−L2 cache ∗/
Cache DL2 , /∗ Event happened in the D−L2 cache ∗/
DRAM, /∗ Event happened in DRAM ∗/
None /∗ When l o c a t i o n doesn ’ t matter − eg PC Update ∗/
} l o c a t i o n t ;
Listing 3.4: The location data type.
Changes to SimpleScalar Core
The prefetch triggers are generated in the SimpleScalar core. Every interaction with the
memory subsystem generates a prefetch trigger. The cache miss handlers are modified to
generate prefetch triggers, as well as the simulation core. Each time such an event occurs
process prefetch trigger is called. The sim-outorder.c file contains command line parsing
and initialization of the prefetch engine.
The code size in this module is kept at a minimum for two reasons: To keep things as
modular as possible, and to ease the insertion of this module into SimpleScalar version 4.
Command line parsing and initialization code could not be put into the prefetch module
as this would depart from the code style of SimpleScalar.
Prefetch Event Dispatcher
The triggers generated in the simulator core are sent to the prefetch event dispatcher.
The dispatcher has three distinct functions: First, it packs the data into prefetch triggers.
Second, it sends the prefetch trigger to the chosen algorithm. By using function pointers it
is easy to swap algorithms while the simulation is running, and to add new algorithms, as
only a pointer needs to be updated for the new algorithm to take effect. The pointer name
is prefetch algorithm and it can point to any algorithm. The function pointer approach
is a very clean solution to this problem as it avoids long switch statements or nested if s,
reducing code bloat and complexity. In addition, by using function pointers, a set structure
is imposed on a prefetching algorithm. This is a design decision; By not allowing more
than prefetch triggers to be passed as a parameter to the algorithms, the functionality of
the reference implementations can be guaranteed.
41
3.2. PREFETCHING CHAPTER 3. METHODOLOGY
/∗
∗ This i s the s e q u en t i a l p r e f e t c h i n g on Miss a l gor i thm .
∗ When a miss in the cache occurs on b l o c k X, b l o c k X+1 i s
∗ p r e f e t c h ed .
∗/
void s e qu en t i a l m i s s p r e f e t c h ( p r e f e t c h t r i g g e r t t r i g g e r ) {
int i ;
i f ( ( t r i g g e r . type==Cache Miss ) && ( t r i g g e r . l o c a t i o n ==
p r e f e t c h l o c a t i o n ) ) {
for ( i =1; i<=pre f e t ch deg r e e ; i++) {
pr e f e t ch ( t r i g g e r . address + i ∗ ta rge t cache−>bs i ze , t r i g g e r .
time ) ;
}
}
}
Listing 3.5: Implementation of sequential prefetching.
The third function serves as a guard against rippling effects. Issuing a prefetch will
cause a miss to be generated in the target cache. The system will then generate a miss-
type prefetch trigger. This in turn can lead to another prefetch being issued, which is not
the intended effect. Therefore a flag (prefetch attempt) is set when a prefetch algorithm
is executing.
Prefetch Algorithms
The algorithms are intended to be modular. A new algorithm can be plugged into the
framework with little effort. It has to be implemented as a function with a single parameter
of type prefetch trigger t and a return type of void. In theory, algorithms only need to
call one function, prefetch(). This function handles the actual prefetching in SimpleScalar.
The designer can thus concentrate on the algorithm and implementation.
As an example consider sequential prefetching. It can be implemented in 5 lines as
shown in listing 3.5. The only input is a prefetch trigger. The algorithm uses that trigger
to generate a prefetch trough the prefetch() function. As can be seen from this example,
because the algorithm only use specific parts in the data structure, adding additional data
to the prefetch triggers won’t break this implementation. And by using function pointers,
we gain an extra level of abstraction.
Issuing Prefetches
When an algorithm decides to prefetch data, it can call the function prefetch(). This func-
tion serves as an abstraction and a simplification of the memory subsystem. In addition
to sending the prefetch request to the memory subsystem, it probes the cache to see if the
cache line has already been prefetched, if it has, it aborts the prefetch. This is very useful,
as it is common for all the prefetch algorithms. If the cache line is not in the cache, the
function issues a special kind of memory access, using the prefetch type (see memory.h).
42
CHAPTER 3. METHODOLOGY 3.2. PREFETCHING
Memory Subsystem
A prefetch is handled like any other memory request, with one exception, a prefetched
block is marked as prefetched in order to gather data on the prefetching algorithm. Every
prefetch issued is counted as well as every successful prefetch. By using these two numbers,
the accuracy of the algorithm can be inferred.
The memory subsystem has to be able to handle partial prefetches. A partial prefetch
is a prefetch that is issued too late, thus hiding only part of the latency.
Statistics
There are many things to consider when evaluating a prefetch scheme. The purpose of
prefetching is to speed up execution, therefore the IPC or running time of a program are
interesting parameters. In addition, the number of hits and misses to the cache is interest-
ing, because it is a strong indicator of the possible speedup gained by using prefetching.
Comparing the number of misses with and without prefetching gives a clear indication of
the performance of the prefetching algorithm. SimpleScalar has built in support for these
statistics. There are other measurements as well, such as the number of issued prefetches,
because it is closely correlated to how aggressive the algorithm is.
3.2.2 New Additions
This section describes the changes made to the original prefetching code developed during
my fifth year project. Its purpose is to differentiate between the work done as part of this
thesis and the work done as part of the project.
New Infrastructure
The infrastructure that handles prefetching has been improved, especially in the pro-
cess prefetch trigger code. Some common decision logic has been moved to this function
to speed up simulation, this avoids both code duplication and a costly function call. In
addition, the command-line parsing has been made clearer, as the algorithms and the trig-
ger types have been separated into two distinct parameters. The whole cache structure is
now available to the prefetchers, as opposed to the previous code, where only the target
cache was available for probing and prefetching.
In the previous version of the code, problems could occur with the simulator if the
prefetcher would issue memory requests that were out of bounds. This issue has now been
resolved by checking if the prefetched address is within the programs address space.
Prefetching Algorithms
Three new algorithms have been implemented:
• Reference Prediction Tables;
• Address-Value Delta;
• Stream prefetching.
43
3.2. PREFETCHING CHAPTER 3. METHODOLOGY
These heuristics are described in section 2.2.2.
Finally, due to a bug in the original delta correlation the performance of the delta
correlation and the CZone/Delta Correlation algorithm was lower than it should have
been. This issue has been fixed in the new code.
44
CHAPTER 3. METHODOLOGY 3.3. CMP
3.3 CMP
SimpleScalar is a simulator that is targeted at uniprocessor simulations where there is only
one program executing. A significant amount of work has been put into building a CMP
extension of SimpleScalar. This work has been done in cooperation with Haakon Dybdahl,
a PhD student, at IDI. This section describes how the CMP extension to SimpleScalar
works.
3.3.1 Target Architecture
A conservative approach to CMP is to have separate homogeneous cores with private L1
caches. However, the cores still share the L2 cache as well as the memory controller. The
L1 caches are connected to the L2 by a crossbar. In addition the L2 cache must be able
to handle multiple simultaneous requests.
This architecture is comparable to the upcoming AMD X2 processors, as well as the
Niagara T1 processor. Such an architecture is shown in figure 3.3.1, where a CMP with 2
cores are shown, albeit the simulator should be able to handle more cores.
In addition, the simulator should have the following properties:
• Be able to boot current operating systems.
• Be able to execute true parallel programs.
• Support synchronization per clock cycle. This is to ensure the operation of prefetch-
ers that work on a per clock cycle basis.
• Support an ISA that has a cross compiler available.
• Provide per-core statistics.
3.3.2 Implementation
We have based the CMP simulator on SimpleScalar, as we are familiar with its operation
and source code, as it has been used in our group for nearly two years. SimpleScalar is a
uniprocessor simulator, and the easiest way to convert it to a CMP simulator is to simply
run multiple instances at the same time. However, because we want the cores to share
resources, a few issues must be resolved:
1. Sharing of the L2 cache.
2. Sharing of the DRAM interface.
3. Synchronization of the cores to ensure correct execution.
These issues will be presented in the next two sections. I will use a 2-way CMP as an
example throughout these sections, but the simulator can simulate an arbitrary number
of cores (limited by the host machine specifications).
To help with synchronization and provide a way to communicate between cores a
separate process has been created, called the controller. The controller is invoked every
45
3.3. CMP CHAPTER 3. METHODOLOGY
CPU 1
L1
CPU 2
L1
OO

OO

L2

[[7777777 
CC
Memory Controller
OO

Main Memory
OO

Figure 3.5: Target architecture of the simulated CMP.
10000 clock cycles (user defined), and can perform any operation on both cores. This
functionality has been used previously to repartition or reconfigure caches. It can in effect
mimic an operating system.
Shared Structures
As each core executes as its own process, the cores do not share address spaces. This leads
to a problem, because the data structures representing the cache must be shared between
the process. The solution is to put the shared data structures into system shared memory.
The files shared.h and shared.c contains the routines necessary to create and use shared
memory on the Linux platform.
Using shared memory for a process in Linux is a five steps procedure:
1. Create the shared memory segment using a unique number to identify it.
2. Attach to the shared memory using the identifier in step 1.
3. Use the shared memory.
4. Detach from the shared memory.
5. When all processes have detached from the segment, it can safely be destroyed.
In practice, core #0 creates the data structures, while the other cores simply uses
the shared memory already created. This is the case with both the DRAM and L2 data
structures.
However, there is a catch. Pointers cannot be stored in a shared data structure, because
different processes might map the physical shared memory addresses to different logical
46
CHAPTER 3. METHODOLOGY 3.3. CMP
addresses. In essence, if one process creates a pointer in shared memory, it might point
to something else for another process. This causes a problem, because the original cache
code was very pointer-intensive and had to be reworked.
Because the cores are simulated in separate address spaces, they map their own simu-
lated memory in the same way. In other words, they use the same simulated addresses for
different simulated data. To differentiate between which core actually holds valid data in
the cache, each cacheline is tagged with the core it belongs to. This has the added benefit
of increasing the amount of information regarding cache performance.
As the data structures reside in shared memory, it is important to control access to
it, as data corruption might occur if two different processes change the shared structure
simultaneously. In our simulator, this is solved by using semaphores with an initialization
value of one. This provides a locking mechanism for the critical regions of the program.
When the simulation is completed, the controller destroys the shared memory segments.
If the program is aborted (killed or terminated), the shared memory segments are destroyed
through a special signal handler in controller.c.
Synchronization
To ensure that simulation is cycle accurate, the simulated cores need to be synchronized.
This ensures that accesses to DRAM and L2 happen in the correct order.
In the simulator, this is achieved by using semaphores in a ripple chain. The chain
works as follows; Core #0 starts executing its first clock cycle, while core #1 waits for its
own lock. When core #0 is finished with its first clock cycle, it releases the lock on core
#1. Then core #0 increments its cycle counter and waits for its own lock. In essence, the
two simulations alternate between executing and waiting.
If a core finishes its simulation before the others, it enters a special state where it simply
waits for its own lock, and unlock the next core without doing any actual simulation. When
all cores are finished, the controller detects this and terminates the simulation.
This is however not a very efficient way to handle synchronization. It is obvious that
the two programs only need to be synchronized when they access the L2 cache. In essence
it is only necessary to enforce that L2 accesses occur in the correct order for correct
simulation. We have developed such an alternative. It works as a barrier that only allows
the core with the lowest clock cycle number to pass. This is implemented as a lock in
the same way. We have observed speedups of over 3 by using this method over the ripple
method.
Unfortunately, this method cannot be used in this thesis. The controller and this
locking method cannot be used simultaneously, as both the controller and the barrier uses
semaphores, deadlock can occur. The following example illustrates this: The controller
waits for both cores to invoke it. Core #1 waits for the controller to finish, while core #2
waits for L2 access. This issue is currently being worked on.
3.3.3 Implementation Shortcomings
The implemented simulator falls short of the ideal simulator in many ways. Most of these
are due to the simulator being based on SimpleScalar.
SimpleScalar on its own cannot boot operating systems. This is both due to its limited
ISA emulation (it cannot simulate ring-0 instructions), but also due to the fact that it
47
3.3. CMP CHAPTER 3. METHODOLOGY
cannot emulate the full system (disks, network, graphics, bios etc). This limitation is
carried on to the CMP version of the simulator. However, the controller can provide
limited O/S functionality, but it will have to be written on a per-project basis.
The simulator cannot execute true parallel applications. This is due to the original
SimpleScalar program loader. It has no notion of shared memory, and without this infor-
mation, the extended CMP simulator cannot map shared memory. However, it is possible
to simulate shared memory by using system calls. This has been done in a previous project,
but requires rewriting the target application to use these special system calls, which can
be a significant amount of work.
These two limitations severely restricts the number of benchmarks that can be run on
the simulator. This issue will be handled in section 3.5.1.
It is only possible to measure destructive interference with this simulator, as the mem-
ory of the benchmarks are assumed to not overlap. There is simply no way that one core
can do anything that will benefit the other. It can only create bandwidth contention or
displace cache lines. This is a very simplistic assumption, as shared libraries or code is
likely to overlap in a realistic system. In a real system, shared libraries (such as libc) might
be used by both programs, and therefore, one core can “prefetch” code for the other.
48
CHAPTER 3. METHODOLOGY 3.4. BANDWIDTH-AWARE PREFETCHING
3.4 Bandwidth-Aware Prefetching
This section describes one of the most important contributions of this thesis, namely a new
heuristic for prefetching, named bandwidth-aware prefetching. This heuristic is a result
from lessons learned both from this thesis work, but also from previous work in the fifth
year project.
3.4.1 Motivation
In many areas of computer performance, off-chip memory bandwidth is the limiting fac-
tor [45]. As research and industry is moving towards CMP architectures this will become
an even more severe problem. In a CMP, the cores share the same physical package and
thus the same physical I/O pins. Increasing the number of actual pins on a package is
costly, as it represents an increase in cost for both packaging and materials. Thus, the
same I/O pins have to be used to serve more cores than previously. In effect, this will
reduce the bandwidth per core.
Prefetching in a CMP is still needed [12], as prefetching can significantly increase
performance. However, current prefetching methods are not 100% accurate and will thus
create unnecessary prefetches that will consume valuable bandwidth.
As the two cores share resources such as the L2 and DRAM, one core might prefetch
data to increase its performance. By doing so it might cause bandwidth contention,
leading to decreased performance for the other core. In addition, the prefetched data
might displace data needed by the other core, thus further decreasing performance.
3.4.2 Idea
The basic idea of bandwidth-aware prefetching is to use existing performance counters
to estimate future bandwidth usage to direct prefetching. The reasoning is as follows: If
there is little bandwidth contention, then issuing a prefetch will probably not cause future
bandwidth contention either. However, if there is bandwidth contention, then the prefetch
must be rejected. If the prefetch was accurate, it will probably be delayed for so long that
the actual load will be issued before the prefetch is actually dispatched to the memory
subsystem. By using such a heuristic, one core cannot overrun the other with prefetches,
because the additional prefetches will simply be rejected.
However, predicting future bandwidth usage is hard. The Network Weather Service
(NWS) project is an initiative that does research into predicting network performance
for computational grids. In a paper by Wolski [66] numerous prediction heuristics are
described. However, many of the techniques described in that paper requires considerable
computing effort and is thus unsuitable for microarchitectural purposes. A naive, but
inaccurate, approach is to use the previous value as a prediction for future values. A
running mean value is a much better alternative. Performance counters hold data about
previous bandwidth usage. I will use this data to predict future bandwidth usage by
averaging the last 3 values. By averaging three values, spikes in bandwidth usage is
evened out.
49
3.4. BANDWIDTH-AWARE PREFETCHING CHAPTER 3. METHODOLOGY
3.4.3 Implementation
Bandwidth-aware prefetching can be seen as a supplement to regular prefetching. The
regular heuristics predicts what to prefetch, while bandwidth-aware prefetching predicts
when a prefetch should be issued.
For bandwidth-aware prefetching to work successfully, it is necessary to accurately
predict future bandwidth usage. By using performance counters, it is easy to obtain the
latency of previous DRAM operations. By averaging the latency of the previous three
DRAM operations, we get an indicator of the current bandwidth usage. If the latency per
operation is low, it is an indicator that there is little bandwidth contention. If the latency
per operation is high, it is a sign that there is memory contention.
If this latency is larger than a set threshold, then the prefetch is simply discarded. If
it is less, then the prefetch is processed in the normal way.
50
CHAPTER 3. METHODOLOGY 3.5. BENCHMARKS
3.5 Benchmarks
In the fifth year project I used a subset of the SPEC2000 [67] benchmark. SPEC2000 is a
benchmark suite that uses kernels of many common scientific and engineering programs.
Many programs are taken directly from common GNU programs such as Gzip, Bzip2
and Gcc. It is commonly used in computer architecture research and is the standard for
measuring performance in many settings.
In the fifth year project seven benchmarks were chosen somewhat arbitrarily. The ones
used were; Gzip, Gcc, Crafty, Mcf, Swim, Mgrid and Equake. Many benchmarks gained
little benefit from prefetching, as they were mainly compute-bound. The benchmarks in
the SPEC2000 suite are summarized in tables 3.1 and 3.2.
To decrease simulation time I have used the reduced datasets [68]. The reduced datasets
are smaller than the originals, but behave in the same manner. The purpose of the reduced
datasets is to decrease simulation time, while keeping the instruction mix of the original
workload. It is important to note that when using the reduced datasets1 the problems are
smaller and might more easily fit into the cache, thus cause different behavior from the
reference sets. However, in this thesis we are mainly interested in comparing prefetching
schemes to each other, thus this problem will not become significant.
Benchmark Language Category
164.gzip C Compression
175.vpr C FPGA Circuit Placement and Routing
176.gcc C C Programming Language Compiler
181.mcf C Combinatorial Optimization HTML
186.crafty C Game Playing: Chess
197.parser C Word Processing
252.eon C++ Computer Visualization
253.perlbmk C PERL Programming Language
254.gap C Group Theory, Interpreter
255.vortex C Object-oriented Database
256.bzip2 C Compression
300.twolf C Place and Route Simulator
Table 3.1: SPEC 2000 Integer benchmarks [67].
Running the reference sets is not an option, as simulation time would amount to
days [69], and would leave us with less time to explore the designspace. This would be
very costly in terms of computer-time, but it would also limit the number of experiments
that can be conducted. The main purpose of the experiments in this thesis is not to achieve
high accuracy in specific synthetic benchmarks, but to highlight the effects of prefetching
under different circumstances.
A common way to use the reference set in simulations is to fast forward around 1
billion instructions and then do the accurate simulation for around 1 billion instructions.
This method also have problems as the large scale structures (branch predictor, caches)
1Also called lgred.
51
3.5. BENCHMARKS CHAPTER 3. METHODOLOGY
Benchmark Language Category
168.wupwise Fortran 77 Physics / Quantum Chromodynamics
171.swim Fortran 77 Shallow Water Modeling
172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field
173.applu Fortran 77 Parabolic / Elliptic Partial Differential Equations
177.mesa C 3-D Graphics Library
178.galgel Fortran 90 Computational Fluid Dynamics
179.art C Image Recognition / Neural Networks
183.equake C Seismic Wave Propagation Simulation
187.facerec Fortran 90 Image Processing: Face Recognition
188.ammp C Computational Chemistry
189.lucas Fortran 90 Number Theory / Primality Testing
191.fma3d Fortran 90 Finite-element Crash Simulation
200.sixtrack Fortran 77 High Energy Nuclear Physics Accelerator Design
301.apsi Fortran 77 Meteorology: Pollutant Distribution
Table 3.2: SPEC 2000 Floating-Point benchmarks [67].
would be in a cold state. Simpoint [70] solves this problem by storing the state of branch
predictors and caches on specific points. However, this approach assumes that the memory
reference stream does not change between executions. This assumption does not hold when
using prefetching as different prefetching algorithms will affect the state of the cache, thus
Simpoint could not be used.
To understand the basic behaviour of the SPEC2000 benchmarks suite a simple sim-
ulation run was done with SimpleScalar. The configuration was chosen to match an Intel
Pentium 4 processor as closely as possible in terms of the cache hierarchy. The simulated
processor had two 8kb level 1 caches (one instruction cache and one data cache). These
caches were 4-way set associative with 32 sets and 64 bytes cache lines. The unified level 2
cache was 512KB and organized as a 8-way set-associative cache with 512 sets and 128
bytes cache lines. The latency of the level 1 cache was 2 clock cycles and the level 2 latency
was 7 clock cycles. Memory latency was fixed to 160 clock cycles with no bandwidth lim-
itations. The processor was a 4-way superscalar out-of-order processor. This large issue
width will ensure that most of the available ILP will be realized, and the effects of the
memory subsystem can be studied. The results from running the benchmarks can be seen
in table 3.5.
52
C
H
A
P
T
E
R
3.
M
E
T
H
O
D
O
L
O
G
Y
3.5.
B
E
N
C
H
M
A
R
K
S
Table 3.3: Charachteristics of the SPEC2000 benchmarks suite.
Benchmark IPC # Insts # Loads % Loads # Acc. L2 # Miss L2 % Miss L2 MPI
Gzip 1.6052 5.93E + 08 1.50E + 08 25.23% 1.86E + 07 1.58E + 05 0.85% 267.01
Gcc 1.1417 5.12E + 09 1.77E + 09 34.67% 2.47E + 08 5.26E + 06 2.13% 1027.48
Crafty 1.1002 8.35E + 08 2.61E + 08 31.27% 7.38E + 07 1.33E + 05 0.18% 159.51
Mcf 0.2771 7.94E + 08 2.53E + 08 31.87% 7.57E + 07 1.54E + 07 20.31% 19379.82
Swim 0.6374 4.31E + 08 1.01E + 08 23.51% 1.60E + 07 4.69E + 06 29.28% 10864.85
Mgrid 1.2834 1.15E + 08 3.49E + 07 30.34% 2.32E + 06 3.75E + 05 16.19% 3266.35
Equake 1.3092 1.02E + 09 2.84E + 08 27.77% 2.27E + 07 9.40E + 05 4.14% 919.72
Applu 1.0422 8.82E + 07 2.31E + 07 26.24% 2.90E + 06 3.08E + 05 10.63% 3496.24
Vpr 1.4494 1.57E + 09 4.42E + 08 28.21% 5.21E + 07 2.73E + 03 0.01% 1.74
Ammp 0.0901 1.25E + 09 3.26E + 08 26.12% 1.47E + 08 6.97E + 07 47.46% 55851.49
Mesa 1.825 1.61E + 09 3.46E + 08 21.54% 2.16E + 07 1.15E + 05 0.53% 71.26
Galgel 1.7781 3.48E + 08 1.18E + 08 33.85% 9.37E + 06 4.08E + 04 0.44% 117.30
Lucas 1.6807 1.87E + 08 3.40E + 07 18.17% 5.32E + 06 1.84E + 03 0.03% 9.83
Fma 1.0827 6.68E + 08 1.62E + 08 24.16% 5.63E + 07 3.44E + 03 0.01% 5.14
Parser 1.0171 4.53E + 09 1.27E + 09 28.00% 9.04E + 07 9.53E + 06 10.54% 2104.86
Eon 1.2676 1.07E + 09 3.22E + 08 30.08% 4.72E + 07 4.03E + 03 0.01% 3.76
Perlbmk 1.2433 2.06E + 09 6.20E + 08 30.10% 5.10E + 07 2.42E + 05 0.48% 117.64
Gap 1.0243 7.61E + 08 2.40E + 08 31.49% 3.36E + 07 1.90E + 06 5.64% 2491.11
Vortex 1.1337 4.51E + 05 1.35E + 05 29.92% 1.70E + 04 2.75E + 03 16.18% 6110.81
Bzip2 1.1348 1.82E + 09 5.09E + 08 27.96% 2.98E + 07 5.28E + 06 17.72% 2904.17
Apsi 1.2933 3.40E + 08 6.96E + 07 20.46% 1.86E + 07 1.59E + 06 8.57% 4685.92
Wupwise 1.3589 5.22E + 09 9.53E + 08 18.27% 3.31E + 07 8.52E + 06 25.74% 1632.39
Twolf 1.1976 9.73E + 08 2.56E + 08 26.32% 5.26E + 07 5.72E + 03 0.01% 5.88
Facerec 1.5768 2.52E + 08 5.65E + 07 22.38% 1.80E + 06 2.80E + 05 15.61% 1112.04
Art 0.1491 1.66E + 09 5.20E + 08 31.33% 2.78E + 08 9.19E + 07 33.07% 55330.91
53
3.5. BENCHMARKS CHAPTER 3. METHODOLOGY
Unfortunately, the vortex application did not work correctly due to errors in the data
files. The size of each benchmark can be seen as the number of instructions executed.
Although the percentage of loads is about constant in every program (20% - 35%), the
number of misses in the L2 cache varies by a large amount. Suleyman Sair [71] has used
misses per instruction (MPI) as a metric to characterize benchmarks. This metric provides
an indication of how memory intensive an application is.
Using a limit of 3000 MPI the following benchmarks are characterized as memory
intensive:
• Mcf
• Swim
• Mgrid
• Applu
• Ammp
• Apsi
• Art
These benchmarks will be used for most of the experiments due to their memory-
bound nature. A few selected experiments will be run with the full suite to ensure that
prefetching does not introduce regressions in compute-bound applications.
3.5.1 CMP Benchmarking
Benchmarking a CMP is much more difficult than benchmarking uniprocessors. Several
new questions arise or become more difficult:
1. What benchmarks can be used?
2. If one core experience performance degradation, while the other experience a speedup,
how do we measure the net result?
3. Can the performance of the entire system be characterized by a single number?
I will start with the first question. On uniprocessors it is very easy to use the SPEC2000
benchmark. They require little or no operating system support and precompiled binaries
are available to almost every platform. In addition, it is very commonly used in the litera-
ture. The benchmarks themselves have been studied thoroughly. There are 26 benchmarks
in the suite, making it possible to run them all and get a good picture of the performance
of a system.
There exist numerous benchmarks that can be used for CMP systems. Some of the
most commonly seen in the literature are:
• Multiple instances of SPEC2000.
• TPC-C [72]
54
CHAPTER 3. METHODOLOGY 3.5. BENCHMARKS
• SPECWeb [73]
• SPECjAppServer [74]
• Linpack [75]
• NAS Parallel Benchmarks (NPB) [76]
TPC-C is a transaction processing benchmark. It is mainly used for benchmarking
database performance. The main idea is to issue a large amount of queries and measure
the throughput of the system in terms of completed transactions. It is important to
note that the performance measure in this benchmark is TPM (Transactions per minute).
Simply measuring IPC would be meaningless, as spinlocks generate a lot of committed
instructions, but generates no forward progress in terms of completing transactions. In
addition, TPC requires two things that SimpleScalar cannot provide:
• Full system simulation (including OS);
• Shared memory2.
In addition, the benchmark itself introduces a lot of tuneable parameters, both at the
OS level and at the database level. Another problem is that it provides only a single
benchmark that produces a single number. Thus it is harder to analyze exactly what
causes a speedup or a slowdown. Therefore TPC-C cannot be used as a benchmark for
my purposes.
Linpack [75] is a benchmark that is used to benchmark the most powerful supercomput-
ers in the world. The rankings in the top500 list is based on each computers performance
on the Linpack benchmark. Linpack does not require a large OS (most supercomputers
use small kernels as the operating system on compute nodes). However, it does require
an MPI (Message Passing Interface) and a BLAS (Basic Linear Algebra Subprograms)
implementation. Both BLAS and MPI are highly tuned to the underlying architecture,
and would require a rewrite if the architecture changes. Thus Linpack cannot be used.
Both SPECWeb and SPECjAppServer are transaction oriented benchmarks. They are
both aimed at Web-applications. Like TPC, the primary purpose ot these benchmarks is to
measure througput in terms of clients served per unit of time. However, the main focus is
network performance, both in terms of latency and bandwidth. Thus, these benchmarks
cannot be used, for my purposes, as the main bottleneck being studied is beyond the
microarchitectural level.
NPB is a benchmark suite developed by NASA designed to help evaluate the perfor-
mance of parallel supercomputers. The benchmarks themselves are based on computa-
tional fluid dynamics and consists of five kernels and three applications. The benchmarks
come in different flavours;
• NPB 1: Vendors can choose how to implement the programs using their own pro-
gramming models.
• NPB 2: MPI based source code, that should be able to run efficiently without
modification.
2Magnus Jahre at the NCAR group is working on an implementation of shared memory in SimpleScalar.
55
3.5. BENCHMARKS CHAPTER 3. METHODOLOGY
• NPB 3: Implementation based on OpenMP.
In addition, Grid and multi-zone versions exists. The problem with using NBP is that there
exists no MPI or OpenMP implementation. In addition, writing a NBP 1 implementation
from scratch would be a very large undertaking.
On the other hand, running multiple instances of the SPEC2000 benchmark is possible.
The main advantages of such a solution are:
• I am familiar with how they work;
• Low setup time;
• Designed to evaluate microarchitectural performance;
• Does not require any special OS support or libraries.
However, the SPEC2000 benchmarks are designed to run on uniprocessors. Simply
running multiple instances would become unrealistic, as there is no communication be-
tween cores. Thus, this choice in benchmarks limits the research to systems with no
shared memory or cache coherence problems.
Using multiple SPEC2000 benchmarks is a technique that is very common in SMP /
SMT and CMP research, mainly for the above reasons. There is a dramatic shortage of
proper mircoarchitectural benchmarks for CMP systems.
Running every combination of the SPEC2000 benchmarks would yield 264 = 456976
experiments for each configuration. This would require an enourmous amount of time,
and cannot be considered. A better solution would be to simply use a subset of all the
possible permutations and use statistics to evaluate the performance.
However, the statistical distribution of the benchmarks is not known, and one cannot
simply assume a gaussian distribution. Thus it is very hard to measure the confidence
interval of any given result. A possibility is to simply assume that a “large enough” sample
would result in a “small enough” confidence interval.
A weak statistical test that can be used is to simply use a binary result; simply record
howmany cases experience a performance increase compared to how many cases experience
a performance degradation. Then the binomial distribution can be used to determine if
one configuration is truely better than another in a statistically significant manner.
Characterizing performance with a single number
Measuring performance is a tricky task even for a single processor system. In a CMP
system the problem gets more complicated. A change in the architecture might improve
performance for some benchmarks while degrade performance for others. When this hap-
pens to a CMP, how can we measure the effect on the entire system?
But first, what is performance? James Smith uses the following guideline [77]:
The time required to perform a specified amount of computation is the ultimate
measure of computer performance.
In other words, how much faster can the work be done, if I switch systems?
In my setup I will run several SPEC2000 benchmarks on the simulated CMP system.
Some of the benchmarks will experience a performance degradation due to prefetching,
56
CHAPTER 3. METHODOLOGY 3.5. BENCHMARKS
while others will experience a speedup. How do you measure the overall effect on system
performance?
A possible solution is to simply average the results. The mathematical formula for
averaging numbers is well known, and is shown in equation 3.2. In this equation n is the
number of programs being averaged and Mi is the running time of the program.
A =
1
n
n∑
i=1
Mi (3.2)
This method is known as the arithmetic mean. However there are other ways to
produce a single number. One can also use the geometric mean (equation 3.3) or the
harmonic mean ( equation 3.4) [77, 78].
G = n
√√√√ n∏
i=1
Mi (3.3)
H =
n∑n
i=1
1
Mi
(3.4)
The use of the geometric mean is questionable. The single number produced by such
an operation should be indicative of end user application performance. However, it can
be shown that geometric mean does not predict actual performance in a meaningful way.
The paper by Smith has numerous examples where this is proven [77].
The harmonic mean is useful for averaging rates [79]. As an example; consider a car
driving at 50 km/h. It’s destination is 100km away, and will therefore spend 2 hours
getting there. On the return trip, the car drives at 100 km/h, and thus spends 1 hour on
the return back home. If one takes the arithmetic means of the two speeds one obtains:
A =
50 + 100
2
= 75 (3.5)
Which implies that the average speed is 75km/h. However, the total distance travelled is
200km and the driver spends 3 hours in the car, thus the equivalent speed is:
A =
200
3
= 66.67 (3.6)
Which is obviously a more informative value. The harmonic mean calculates this value
directly:
H =
2
1
50 +
1
100
= 66.67 (3.7)
To summarize, the arithmetic mean is useful for evaluating performance measured in
time (Execution time), while harmonic mean is useful for evaluating performance measured
in rates (Mflops/IPC).
At last, it is worth noting that normalization should be done after averaging perfor-
mance. If normalization is performed before averaging, the results will become erroneous
[78].
57
3.5. BENCHMARKS CHAPTER 3. METHODOLOGY
58
Chapter 4
Results
4.1 Overall Plan for the Experiments
This chapter describes the results obtained with the simulator. In computer architecture
research there is a large degree of freedom, therefore it is natural to limit the number of
parameters to be studied. The fixed parameters are presented in section 4.2.
The experiments presented in this chapter use an iterative approach, the first exper-
iments are performed on an uniprocessor with no bandwidth limitations (section 4.3.1).
As mentioned in section 2.2, prefetching data can cause two ill effects:
1. Useful data is evicted from the cache.
2. Prefetching consumes bandwidth.
By eliminating effect two, we can examine effect one directly. The purpose of the first
batch of experiments is to simply evaluate effect number one. Some sensitivity analysis is
performed to better understand the prefetching heuristics, and for selecting good values
for future experimentation.
In section 4.3.2 limited bandwidth is introduced. This allows us to evaluate effect two,
since we have data about effect one. This work will mainly focus on benchmarks that are
memory-bound.
In section 4.4 we look at prefetching in a CMP environment. This work will be based
on the previously obtained uniprocessor results.
4.2 Experimental Setup
This section describes the parameters that I have used in my experiments. The parameters
were chosen to match an aggressive 4-way superscalar core. The number of functional
units as well as the size of the load/store queue were chosen accordingly. In addition, a
more accurate branch predictor was chosen (rather than the default bimodal predictor).
The cache hierarchy used is the same as for the Pentium 4. The size of the L2 is small
compared to contemporary high-end processors. This choice was deliberate as it would
force more capacity misses as well as conflict misses. This is important as the relatively
small footprints of the lgred benchmarks would fit entirely in a larger cache.
59
4.2. EXPERIMENTAL SETUP CHAPTER 4. RESULTS
The default DRAM parameters are taken from section 3.1.2. The setup is summarized
in table 4.1.
Parameter Value
Clock speed 4 GHz
Register Update Unit size 16 instructions
Load/Store Queue size 8 instructions
Fetch Queue size 4 instructions
Fetch, Issue, Decode
and Commit width 4 instructions/cycle
4 Integer ALU, 1 Integer Multiply/Divide
4 Floating Point ALU
Functional units 1 Floating Point Multiply/Divide
Combined, Bimodal 4K entry table,
2-level 1K table, 10 bit history table,
Branch predictor 4K Chooser, 4-way 512 entry BTB
Branch misprediction penalty 15 clock cycles
Translation 128 entry full associative (both data and
Lookaside Buffer instructions) 30 cycle miss penalty
8KB 4-way set associative,
64B blocks, LRU replacement policy,
Level 1 Data cache 2 cycles latency
8KB 4-way set associative,
64B blocks, LRU replacement policy,
Level 1 Instruction cache 2 cycles latency
512KB 8-way set associative,
128B blocks, LRU replacement policy
Level 2 Unified cache 7 cycles latency
2 channels, 40 clock cycle command transfer,
40 clock cycles access time,
80 clock cycles transfer time,
DRAM Double Data Rate @ 400Mhz,
Table 4.1: Simulation parameters.
60
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
4.3 Uniprocessor
Before experimenting with CMP configurations this section will examine each prefetching
heuristic in detail. The purpose is to establish a thorough understanding of how each
heuristic work.
In the next section we examine how prefetching works on a uniprocessor where there is
no bandwidth limitations, in order to examine how increasing prefetching degree interacts
with system performance. In section 4.3.2, I will examine how limiting bandwidth affects
prefetching. The main focus of that section is to examine how limited bandwidth interacts
with the prefetching degree for different heuristics on selected benchmarks. In the last
section (4.3.3) I will explore the performance of my own contribution in a uniprocessor
context. I will explore how bandwidth aware prefetching interacts with performance, both
in terms of IPC and bandwidth usage.
4.3.1 Unlimited Bandwidth
Prefetching can potentially cause performance degradation. This effect can be caused by
two issues as stated in section 2.2.
1. Displacing useful data already in the cache.
2. Displacing other useful memory requests on the system bus.
To quantify the effects, a simple experiment was run with no bandwidth limitations. The
prefetching degree was set to 1 for every algorithm. In addition, the tables were set to
1024 entries for every type (RPT, Stream, DC and C/DC). The results can be seen in
figure 4.1.
In this experiment we see a lot of variation in terms of the effect of adding prefetching.
By comparing no prefetching to the ’perfect L2’ case we see that some benchmarks can
potentially have large speedups due to prefetching. Other benchmarks get almost no
benefit of a perfect L2. This corroborates the analysis made in section 3.5. In that
section, we observed that some benchmark have relatively few L2 misses per instruction.
These benchmarks will therefore gain little from both prefetching and a perfect L2. By
removing the benchmarks that receive little or no effect from a perfect L2 we obtain the
graph in figure 4.2.
We see that stream prefetching is by far the most efficient prefetcher. In one case
(art) it outperforms the perfect L2. This might sound surprising, but stream prefetching
prefetches to the L1 cache, thus achieving a small speedup.
Although this experiment is run with a prefetching degree of 1, we see significant
improvements with all prefetching algorithms. The performance in terms of IPC is almost
always at least equal to the case were no prefetching is performed. This leads to the
interpretation that the effect of prefetching displacing other useful data in the cache is
relatively small compared to the benefit.
We see that AVD prefetching is not very effective, only a few benchmarks show any
speedup at all. This is due to the way AVD prefetching works. SPEC2000 does not contain
pointer intensive code, thus the effect would be minimal (the original paper describing AVD
used SPEC95 benchmarks [9]). In addition, AVD prefetching cannot prefetch with enough
distance to be truly useful.
61
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0
 0.5  1
 1.5  2
 2.5
art
facerec
twolf
wupwise
apsi
bzip2
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
IPC
Benchm
ark
N
one
Sequ
DC
CDC
R
PT
Stream
AVD
Perfect L2
F
igure
4.1:
P
erform
ance
of
prefetching
on
processors
w
ith
unlim
ited
bandw
idth.
62
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0
 0.5  1
 1.5  2
 2.5
art
facerec
wupwise
bzip2
gap
parser
ammp
applu
equake
mgrid
swim
mcf
gcc
IPC
Benchm
ark
N
one
Sequ
DC
CDC
R
PT
Stream
AVD
Perfect L2
F
igure
4.2:
Sim
ilar
to
figure
4.1,
but
w
ith
uninteresting
benchm
arks
rem
oved.
63
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
We also see that there is a lot of potential for prefetching, as there is a large performance
gain when using a perfect L2. We see that the stream- and RPT- prefetchers perform very
well in most cases. This is not surprising, given the large amount of information used by
these prefetchers.
In figure 4.3 a similar experiment is run, but with a prefetching degree of 8, again,
the uninteresting benchmarks are removed. In this experiment we observe that increasing
the prefetching degree generally increases the performance benefit of prefetching. This
is especially true for the simpler algorithms such as sequential prefetching. We observe
that the performance of Mcf is decreased by 8% by using sequential prefetching with
a prefetching degree of 8, which was the observed maximum. However, only 6% of the
experiments showed a performance degradation. In figure 4.4 we compare the performance
of the heuristics with a prefetching degree of one. For sequential prefetching on Swim we
observe a performance increase by a factor of 1.7 in terms of IPC. On Art, the speed
increase is 138% for sequential prefetching. A similar effect can be seen on Art as well.
As in the previous experiment we see little performance degradation due to the effects of
prefetching, although more blocks are prefetched. Mcf has the largest speed degradation
due to increasing the prefetching degree. This is a relatively minor change, but it is
important to remember when moving to CMP, because of the increased pressure on the
cache.
The accuracy and coverage of the heuristics are interesting, because they indicate how
successful the heuristics are at predicting memory accesses. In table 4.2 the accuracy of
each prefetching algorithm is listed. We see that the RPT and Stream prefetchers are very
accurate, close to 100 % for many benchmarks. This is not surprising considering the com-
plexity of these heuristics. In addition, we observe that DC prefetching is generally less
accurate than sequential prefetching, which is somewhat surprising, given the relatively
simple design of the sequential heuristic. However, C/DC prefetching is much more accu-
rate than both of them, and is much more robust across benchmarks. AVD prefetching
has a very low accuracy, as most of these benchmarks are not pointer-intensive. Also note
that the accuracy for sequential prefetching on Ammp is wrong. The performance counters
that were used to count prefetches rolled over and thus produced the wrong results.
In table 4.3 the coverage of each heuristic is listed. Again, AVD performs very badly,
it detects less than 7% of the prefetching opportunities in every benchmark. However,
it is clear that all heuristics can be improved given the relatively low coverage across all
benchmarks. It is interesting to note that on the Art, Perlbmk and Ammp benchmark
for the RPT and Stream prefetchers, both coverage and accuracy is above 98%. Such an
impressive result will be hard to outperform.
64
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0
 0.5  1
 1.5  2
 2.5
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
IPC
Benchm
ark
N
one
Sequ
DC
CDC
R
PT
Stream
AVD
Perfect L2
F
igure
4.3:
P
erform
ance
of
prefetching
on
processors
w
ith
unlim
ited
bandw
idth.
P
refetching
degree
is
set
at
8.
65
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0
 0.5  1
 1.5  2
 2.5
art
facerec
wupwise
bzip2
gap
parser
ammp
applu
equake
mgrid
swim
mcf
gcc
Speedup
Benchm
ark
Sequ
DC
CDC
R
PT
Stream
AVD
F
igure
4.4:
Speedup
in
IP
C
by
increasing
the
prefetching
degree
from
1
to
8.
66
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
Benchmark Sequential DC C/DC RPT Stream AVD
gzip 72.0 % 40.7 % 90.0 % 99.9 % 99.4 % 38.8 %
gcc 69.4 % 54.7 % 78.5 % 95.6 % 89.4 % 35.5 %
crafty 25.5 % 27.0 % 86.4 % 96.7 % 89.3 % 22.1 %
mcf 72.6 % 49.9 % 91.9 % 99.5 % 98.9 % 64.6 %
swim 80.6 % 49.9 % 85.7 % 98.3 % 94.1 % 30.4 %
mgrid 90.3 % 52.9 % 95.3 % 98.8 % 99.4 % 15.8 %
equake 76.7 % 54.5 % 93.0 % 99.6 % 99.5 % 78.2 %
applu 74.9 % 42.0 % 64.4 % 76.9 % 51.2 % 53.8 %
vpr 90.7 % 87.6 % 88.6 % 97.9 % 97.8 % 22.2 %
ammp 0.3 % 99.2 % 99.6 % 99.5 % 98.4 % 10.8 %
mesa 78.6 % 41.2 % 88.8 % 99.7 % 99.7 % 41.9 %
galgel 57.5 % 60.3 % 85.9 % 81.0 % 70.5 % 39.1 %
lucas 85.5 % 84.0 % 81.3 % 63.6 % 64.9 % 40.0 %
fma 76.1 % 58.8 % 66.0 % 98.1 % 88.5 % 62.5 %
parser 60.8 % 47.8 % 90.9 % 99.5 % 98.5 % 36.0 %
eon 82.0 % 57.6 % 75.7 % 89.2 % 85.2 % 11.1 %
perlbmk 9.0 % 96.2 % 94.1 % 99.5 % 99.5 % 11.6 %
gap 60.2 % 25.6 % 85.0 % 98.4 % 95.6 % 9.7 %
bzip2 32.0 % 48.6 % 85.9 % 97.5 % 95.0 % 39.5 %
apsi 93.3 % 24.0 % 88.9 % 99.5 % 94.9 % 34.8 %
wupwise 88.6 % 46.3 % 90.8 % 96.3 % 85.5 % 31.6 %
twolf 86.8 % 78.0 % 87.9 % 90.8 % 91.2 % 17.1 %
facerec 86.5 % 52.5 % 94.6 % 97.0 % 95.3 % 41.9 %
art 99.1 % 50.5 % 97.5 % 99.9 % 99.6 % 99.9 %
Table 4.2: Accuracy of prefetching heuristics. Values larger than 90% are marked with
green, while values smaller than 50 % are marked with red.
67
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
Benchmark Sequential DC C/DC RPT Stream AVD
gzip 47.7 % 28.1 % 44.1 % 29.4 % 29.2 % 0.0 %
gcc 39.3 % 15.4 % 22.0 % 5.1 % 5.7 % 4.3 %
crafty 23.5 % 5.0 % 8.0 % 11.2 % 11.4 % 0.0 %
mcf 39.9 % 16.0 % 17.8 % 49.7 % 50.6 % 0.6 %
swim 47.1 % 28.4 % 38.8 % 67.4 % 67.5 % 0.0 %
mgrid 47.5 % 28.3 % 39.0 % 54.2 % 57.6 % 0.0 %
equake 45.0 % 20.0 % 33.4 % 43.8 % 54.7 % 0.0 %
applu 46.1 % 12.1 % 17.0 % 46.8 % 52.0 % 0.0 %
vpr 41.0 % 19.0 % 23.5 % 16.9 % 22.3 % 0.1 %
ammp 0.3 % 47.2 % 30.7 % 97.9 % 97.2 % 0.0 %
mesa 49.6 % 30.1 % 44.8 % 70.8 % 70.8 % 0.0 %
galgel 41.9 % 11.2 % 11.9 % 54.2 % 70.7 % 0.0 %
lucas 43.4 % 22.0 % 24.4 % 1.9 % 2.6 % 0.2 %
fma 40.4 % 13.7 % 18.3 % 3.0 % 3.4 % 0.1 %
parser 32.9 % 20.5 % 29.4 % 47.7 % 51.3 % 6.3 %
eon 40.6 % 8.1 % 18.2 % 6.0 % 8.2 % 0.0 %
perlbmk 8.4 % 30.7 % 33.0 % 98.6 % 98.6 % 0.0 %
gap 35.4 % 11.5 % 30.9 % 12.2 % 15.6 % 0.5 %
bzip2 23.3 % 2.7 % 5.9 % 7.4 % 7.5 % 0.0 %
apsi 49.9 % 22.2 % 46.8 % 0.8 % 0.9 % 0.0 %
wupwise 49.1 % 24.6 % 40.7 % 48.8 % 49.5 % 0.0 %
twolf 37.0 % 19.0 % 23.0 % 3.4 % 16.2 % 0.1 %
facerec 45.4 % 28.2 % 41.2 % 72.9 % 105.4 % 0.0 %
art 49.7 % 29.3 % 47.3 % 99.6 % 99.4 % 0.0 %
Table 4.3: Coverage of prefetching heuristics. Values larger than 70% are marked with
green, while values smaller than 5 % are marked with red.
68
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
Prefetching Degree
In this section I will examine how increasing the prefetching degree affects the performance
of prefetching. I have selected three benchmarks for this experiment. First, I have only
chosen benchmarks with a significant potential for prefetching (the difference between no
prefetching and a perfect L2 is significant). Then I have randomly selected Mgrid, Swim
and Art. More experimentation was not required as all experiments point in the same
direction. In addition, this facet of prefetching will be explored in other sections of this
thesis.
The results of these experiments can be seen in figures 4.5, 4.6 and 4.7. The most
interesting results from these graphs is that performance increases monotonously with in-
creasing prefetching degree. In other words, an increase in prefetching degree does not
decrease performance for these benchmarks. This enhances the finding in the previous sec-
tion regarding data displacement in the cache, where only one benchmark had performance
degradation when increasing the prefetching degree.
Further, we see that stream and RPT prefetching reaches their maximum potential
quite early. This effect can be most clearly observed in figure 4.6. We also observe that
DC and C/DC prefetching does not gain a large benefit from high prefetching degrees.
This is due to the matching function in the delta correlation, which limits the maximum
prefetch degree available. We also observe that sequential prefetching gains the most
benefit from increasing the prefetching degree.
We observe again that stream prefetching outperforms the perfect L2. As before, this
is due to the fact that stream prefetching fetches blocks to the L1 cache rather than the
L2 cache. Although understandable, this is an impressive result.
In figure 4.7 we observe an interesting thing as the curves for CZone/DC and DC
crosses each other. This is simply due to DC issuing more prefetch requests.
69
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0
 0.5
 1
 1.5
 2
 2.5
 3
 0  2  4  6  8  10  12  14  16
IP
C
Prefetching degree
No prefetching
Sequential
Delta Corrolation
CZone/Delta Correlation
RPT
Stream prefetching
Perfect L2
Figure 4.5: Increasing prefetching degree on Mgrid.
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 0  2  4  6  8  10  12  14  16
IP
C
Prefetching degree
No prefetching
Sequential
Delta Corrolation
CZone/Delta Correlation
RPT
Stream prefetching
Perfect L2
Figure 4.6: Increasing prefetching degree on Art.
70
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0
 0.5
 1
 1.5
 2
 2.5
 3
 0  2  4  6  8  10  12  14  16
IP
C
Prefetching degree
No prefetching
Sequential
Delta Corrolation
CZone/Delta Correlation
RPT
Stream prefetching
Perfect L2
Figure 4.7: Increasing prefetching degree on Swim.
71
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
CZone Size
CZone/Delta correlation prefetching have an additional parameter that needs to be ex-
amined; namely the size of each CZone (or memory division). I have examined this in 4
different benchmarks; Art, Mcf, Mgrid and Swim. Mcf was added because it produced
some interesting results in the original paper [25]. In that paper, increasing the prefetch-
ing degree generally decreases performance on Mcf. In addition, increasing the CZone size
decreases performance as well, until the CZone size is set to 16MB. If the CZone size is
larger or equal to the size of the datasets, C/DC will perform as the DC algorithm.
The experiment was run with the baseline setup and with C/DC prefetching. By
varying the prefetching degree and CZone size the results in figures 4.8, 4.9, 4.10 and 4.11
were obtained.
From these experiments we see that too small CZone sizes causes a performance degra-
dation, because 4K is to small for the pattern detection to function properly. This is also
the case for too large CZones. However, it is generally better to have too large CZones
than too small CZones. In Mgrid, we observe that any size beyond 64K gives acceptable
results. This is also true for Swim, which peaks at about 256K CZone size. However, for
Mcf, this does not hold.
Based on these experiments an CZone size of 256K was chosen for the remainder of
the experiments. This result matches the results found in the original paper by Nesbit.
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
16M
4M
1M
256K
64K
16K
4K
Czone size
 0
 2
 4
 6
 8
 10
 12
 14
 16
Prefetching degree
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
IPC
Figure 4.8: Varying CZone size with prefetching degree on C/DC for Art.
72
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0.275
 0.28
 0.285
 0.29
 0.295
 0.3
 0.305
 0.31
 0.315
 0.32
 0.325
16M
4M
1M
256K
64K
16K
4K
Czone size
 0
 2
 4
 6
 8
 10
 12
 14
 16
Prefetching degree
 0.275
 0.28
 0.285
 0.29
 0.295
 0.3
 0.305
 0.31
 0.315
 0.32
 0.325
IPC
Figure 4.9: Varying CZone size with prefetching degree on C/DC for Mcf.
 1.35
 1.4
 1.45
 1.5
 1.55
 1.6
 1.65
 1.7
 1.75
16M
4M
1M
256K
64K
16K
4K
Czone size
 0
 2
 4
 6
 8
 10
 12
 14
 16
Prefetching degree
 1.35
 1.4
 1.45
 1.5
 1.55
 1.6
 1.65
 1.7
 1.75
IPC
Figure 4.10: Varying CZone size with prefetching degree on C/DC for Mgrid.
73
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0.75
 0.8
 0.85
 0.9
 0.95
 1
 1.05
 1.1
 1.15
 1.2
16MB
4M
1M
256K
64K
16K
4K
Czone size
 0
 2
 4
 6
 8
 10
 12
 14
 16
Prefetching degree
 0.75
 0.8
 0.85
 0.9
 0.95
 1
 1.05
 1.1
 1.15
 1.2
IPC
Figure 4.11: Varying CZone size with prefetching degree on C/DC for Swim.
74
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
Table Size
Many of the prefetching heuristics studied in this thesis require storage. This section ex-
amines the optimal size of the tables. However, it is important to note that the tables are
different in complexity. A table entry might vary in size between the different implemen-
tations, thus the comparison is not fair. A fair comparison would require a comparison
between area requirements. However, it is too simplistic to simply assume that area in-
creases linearly with table size (in bits). Some heuristics require cache-like structures,
which increases in area according to the findings in section 2.1.1, while others require
FIFO structures. To estimate the area requirements would require an actual VHDL or
Verilog model of the prefetches, which is beyond the scope of this thesis.
However, the main purpose of this experiment is to find a table size that is large enough
to cover most cases, such that a prefetching algorithm is not cut off due to resource issues.
The experiment was run with a prefetching degree of 1. By varying the size of the
table we get the results in figures 4.12, 4.13, 4.14, 4.15 and 4.16.
In figures 4.12 and 4.14 we see that there is little significant change in performance by
increasing the table size. A small increase is gained when using DC prefetching and going
from a 16 entry to a 32 entry table.
In figure 4.13 a more dramatic increase is observed. We see a clear increase in perfor-
mance until the table can contain 128 load instructions. This graph also indicates some
of the structure of the program, suggesting that there is somewhere between 32 and 128
loads between each delinquent load.
Such a sharp increase can also be observed in figures 4.15 and 4.16. A table size of
1024 entries was chosen because it covers most of the cases, although more performance
can be gained by using a larger table.
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 0.45
 16  32  64  128  256  512  1024  2048
IP
C
Table size size in entries
DC
CDC
RPT
Stream
Figure 4.12: Varying Table size on Ammp.
75
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 16  32  64  128  256  512  1024  2048
IP
C
Table size size in entries
DC
CDC
RPT
Stream
Figure 4.13: Varying Table size on Art.
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 16  32  64  128  256  512  1024  2048
IP
C
Table size size in entries
DC
CDC
RPT
Stream
Figure 4.14: Varying Table size on Mcf.
76
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0
 0.5
 1
 1.5
 2
 2.5
 16  32  64  128  256  512  1024  2048
IP
C
Table size size in entries
DC
CDC
RPT
Stream
Figure 4.15: Varying Table size on Mgrid.
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1.8
 2
 16  32  64  128  256  512  1024  2048
IP
C
Table size size in entries
DC
CDC
RPT
Stream
Figure 4.16: Varying Table size on Swim.
77
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
4.3.2 Limited Bandwidth
By adding the new DRAM model described in section 3.1 we can study the effect of limited
bandwidth on performance. In the previous section we looked how prefetching without
any bandwidth limitation affected performance. We now have an idea of how prefetching
displaces data in the cache. This information can then be applied in the analysis of
prefetching in an environment where bandwidth is limited.
First, we run a simple experiment similar to the one conducted in the previous section.
The experimental setup is unchanged, with the exception of the bandwidth limitation.
In figure 4.17 the results from the experiment are documented. In this experiment we
look at how performance is affected by limiting bandwidth to a single DRAM channel. It
is interesting to compare this to the unlimited version of the same experiment in figure
4.1. To make comparison easier the graph in figure 4.18 was made. This figure shows the
speedup of using the new DRAM model using one channel as opposed to the unlimited
previous model. Benchmarks that show little (< 1.5 %) performance increase or decrease
were omitted.
As noted in section 3.1.4, the effect of using the new DRAM model is varied, some
benchmarks experience a speedup due to the effects of open pages (the Ammp benchmark),
while others show degradation. Swim and Apsi show significant performance degradation
due to the reduced bandwidth available. This is especially true for more bandwidth inten-
sive methods. This indicates that bandwidth contention is a more dominant factor than
cache line displacement.
In figure 4.19 we see the results from a similar experiment, but with 2 DRAM channels.
Again, the speedup of using the new DRAM model over the old one is shown in figure 4.20.
Naturally, we see a performance benefit as more bandwidth is available for prefetching.
This is especially true for benchmarks such as Apsi and Swim, which were previously
bandwidth limited. More bandwidth increases performance compared to the previous
experiment. In addition we observe that the effect of open pages has an especially good
effect on sequential prefetching as well as programs that exhibit sequential access patterns.
78
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0
 0.5  1
 1.5  2
 2.5
art
facerec
twolf
wupwise
apsi
bzip2
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
IPC
Benchm
ark
N
one
Sequ
DC
CDC
R
PT
Stream
AVD
Perfect L2
F
igure
4.17:
B
aseline
w
ith
1
dram
channel.
79
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9  1
 1.1
 1.2
art
facerec
wupwise
apsi
bzip2
gap
parser
ammp
applu
equake
mgrid
swim
mcf
gcc
Speedup
Benchm
ark
N
one
Sequ
DC
CDC
R
PT
Stream
AVD
F
igure
4.18:
Speedup
using
1
channel
over
the
unlim
ited
bandw
idth
m
odel.
B
enchm
arks
w
ith
less
than
1%
difference
has
been
rem
oved.
80
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0
 0.5  1
 1.5  2
 2.5
art
facerec
twolf
wupwise
apsi
bzip2
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
IPC
Benchm
ark
N
one
Sequ
DC
CDC
R
PT
Stream
AVD
Perfect L2
F
igure
4.19:
B
aseline
w
ith
2
D
R
A
M
channels.
81
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9  1
 1.1
 1.2
art
facerec
wupwise
apsi
bzip2
gap
parser
ammp
applu
equake
mgrid
swim
mcf
gcc
Speedup
Benchm
ark
N
one
Sequ
DC
CDC
R
PT
Stream
AVD
F
igure
4.20:
Speedup
using
2
channels
over
the
unlim
ited
bandw
idth
m
odel.
82
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
Sequential Prefetching
In this section, I will examine how increasing prefetching degree interacts with increasing
the amount of bandwidth on sequential prefetching. In the previous section, we observed
that sequential prefetching have a monotonous IPC increase as the prefetching degree is
increased.
I have conducted five experiments where I have plotted performance (in IPC) as a func-
tion of both the number of DRAM channels and the prefetching degree. The benchmarks
were selected because they are the most memory intensive according to the preliminary
investigations (section 3.5) and in subsequent experiments. The results can be seen in
figures 4.21 to 4.25.
It is interesting that the experiments have very different shapes. In figure 4.21 we see
how sequential prefetching interacts with Ammp: Increasing the number of channels has
the greatest impact on performance, whereas increasing the prefetching degree has little
effect.
In figure 4.22 we see a typical scenario: Increasing the prefetching degree increases
performance, but the bandwidth required to support prefetching must be present, thus we
see a peak with a prefetching degree of 14 and 6 channels.
Mcf behaves quite differently: Increasing the prefetching degree deteriorates perfor-
mance as it causes contention. This experiment is documented in figure 4.23.
On the Mgrid benchmark (figure 4.24) we observe similar results to the results obtained
from Art. We observe that by increasing the prefetching degree we can gain additional
performance, but only if there is available bandwidth.
In the final experiment (Swim - figure 4.25), performance degrades as the prefetching
degree increases, however, increasing the available bandwidth offsets this trend to some
extent.
Generally speaking, no benchmark suffered performance degradation due to increasing
the bandwidth available. There are some small exceptions to this, in the form of irregu-
larities. This is most probably due to memory channel conflicts, where the load becomes
unevenly distributed across the channels.
When increasing the prefetching degree, three clear patterns emerge. The first group is
the group where increasing the prefetching degree also increases performance; This is true
for the Art and Mgrid benchmark. The Ammp benchmark gains little from increasing
the prefetching degree, while increasing it degrades performance on the Mcf and Swim
benchmarks. This result is interesting, as it tells us that one cannot simply increase
the prefetching degree for sequential prefetching and assume that it will give increased
performance. In addition, we observe that sequential prefetching is heavily dependant on
high bandwidth in order to give the maximum benefit.
83
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0.08
 0.085
 0.09
 0.095
 0.1
 0.105
 0.11
 0.115
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 0.08
 0.085
 0.09
 0.095
 0.1
 0.105
 0.11
 0.115
IPC
Figure 4.21: Plot of increasing prefetching degree versus available bandwidth for Ammp
with sequential prefetching.
 0.25
 0.3
 0.35
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 0.25
 0.3
 0.35
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
IPC
Figure 4.22: Plot of increasing prefetching degree versus available bandwidth for Art with
sequential prefetching.
84
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0.16
 0.18
 0.2
 0.22
 0.24
 0.26
 0.28
 0.3
 0.32
 0.34
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 0.16
 0.18
 0.2
 0.22
 0.24
 0.26
 0.28
 0.3
 0.32
 0.34
IPC
Figure 4.23: Plot of increasing prefetching degree versus available bandwidth for Mcf with
sequential prefetching.
 1.4
 1.5
 1.6
 1.7
 1.8
 1.9
 2
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 1.4
 1.5
 1.6
 1.7
 1.8
 1.9
 2
IPC
Figure 4.24: Plot of increasing prefetching degree versus available bandwidth for Mgrid
with sequential prefetching.
85
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0.16
 0.18
 0.2
 0.22
 0.24
 0.26
 0.28
 0.3
 0.32
 0.34
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 0.16
 0.18
 0.2
 0.22
 0.24
 0.26
 0.28
 0.3
 0.32
 0.34
IPC
Figure 4.25: Plot of increasing prefetching degree versus available bandwidth for Swim
with sequential prefetching.
86
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
C/DC Prefetching
In this section I will look at how C/DC prefetching behaves with limited bandwidth in
the same manner that I looked at sequential prefetching. I will not look at DC prefetch-
ing explicitly, as it behaves in much the same manner as C/DC prefetching, due to the
underlying pattern detection scheme discussed in section 2.2.2.
In the Ammp benchmark (figure 4.26) we observe a peculiar shape. This is due to the
way that the program interacts with the memory subsystem. In essence, the load across
the memory channels become uneven on certain configurations. It is clear that a system
with a multiple of three channels outperforms the others. Again, we see that, it is not
enough to simply increase the prefetching degree, but there must be available bandwidth
to support it.
In the other experiments (figures 4.27 to 4.30) we get almost identical shapes, which is
interesting. We do not get the performance degradation observed with sequential prefetch-
ing on Swim and Mcf. This result can be interpreted as the C/DC heuristic is much more
robust than sequential prefetching. In essence, one can simply assume that increasing the
prefetching degree will not deteriorate performance.
 0.13
 0.135
 0.14
 0.145
 0.15
 0.155
 0.16
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 0.13
 0.135
 0.14
 0.145
 0.15
 0.155
 0.16
IPC
Figure 4.26: Plot of increasing prefetching degree versus available bandwidth for Ammp
with C/DC prefetching.
87
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0.24
 0.26
 0.28
 0.3
 0.32
 0.34
 0.36
 0.38
 0.4
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 0.24
 0.26
 0.28
 0.3
 0.32
 0.34
 0.36
 0.38
 0.4
IPC
Figure 4.27: Plot of increasing prefetching degree versus available bandwidth for Art with
C/DC prefetching.
 0.26
 0.27
 0.28
 0.29
 0.3
 0.31
 0.32
 0.33
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 0.26
 0.27
 0.28
 0.29
 0.3
 0.31
 0.32
 0.33
IPC
Figure 4.28: Plot of increasing prefetching degree versus available bandwidth for Mcf with
C/DC prefetching.
88
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 1.35
 1.4
 1.45
 1.5
 1.55
 1.6
 1.65
 1.7
 1.75
 1.8
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 1.35
 1.4
 1.45
 1.5
 1.55
 1.6
 1.65
 1.7
 1.75
 1.8
IPC
Figure 4.29: Plot of increasing prefetching degree versus available bandwidth for Mgrid
with C/DC prefetching.
 0.26
 0.27
 0.28
 0.29
 0.3
 0.31
 0.32
 0.33
 1
 2
 3
 4
 5
 6
# Channels
 2
 4
 6
 8
 10
 12
 14
Prefetching degree
 0.26
 0.27
 0.28
 0.29
 0.3
 0.31
 0.32
 0.33
IPC
Figure 4.30: Plot of increasing prefetching degree versus available bandwidth for Swim
with C/DC prefetching.
89
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
RPT Prefetching
In this section I will look at Reference Prediction Table prefetching in order to explore how
RPT prefetching interacts with prefetching degree and varying amounts of bandwidth.
Because stream prefetching uses the same pattern detection algorithm, I will not do a
separate experiment, but rather rely on the results obtained in this section. As a general
rule stream prefetching amplifies the results of RPT prefetching, because it also prefetches
to the L1 cache. In essence, if the predictor is correct, this will improve performance as
L1 misses are turned into L1 hits. However, if the predictor is wrong, the extra pressure
on the L1 cache might cause data displacement. As such, the performance of the RPT
prefetcher is a very good indicator of the stream prefetchers performance.
I have done the same experiment as in the previous sections, but with a small change
in presentation. Because the RPT prefetcher gets effective at very low prefetching degrees,
I have selected to show the results with a prefetching degree from 1 to 6, rather than 2
to 14. This was done to improve clarity, and give more detail. The results can be seen in
figures 4.31 to 4.35.
In most benchmarks, but especially on Ammp (figure 4.31, we observe the same pattern
when increasing the number of memory channels that we observed on C/DC prefetching.
Again, this is due to imbalances in the pressure on each channel. We also observe that
the number of channels should be a multiple of three. It is interesting to note that this
pattern only occurs on RPT and C/DC prefetching, but not in sequential prefetching.
This is most probably due to Ammp having a strided access pattern, which RPT and
C/DC is able to detect, while sequential prefetching is unable to detect it. This results in
a skewed distribution of prefetches across DRAM channels.
On the Art benchmark (figure 4.32) we observe a sharp increase in performance by
increasing the number of channels from one to three, but it has less impact beyond three.
This is also the case of increasing the prefetching degree. Although there are gains to
increasing the prefetching degree beyond 3, these are small compared to the previous
ones.
In the next three experiments (Mcf, Mgrid and Swim), we observe that increasing the
amount of bandwidth available has the greatest effect on performance, while increasing the
prefetching degree has only a minor effect compared to the amount of bandwidth available.
RPT prefetching works in much the same manner as C/DC prefetching. The extra pat-
tern recognition logic is especially significant on the Ammp benchmark. RPT prefetching
is also robust, it does not deteriorate performance, whereas sequential prefetching does. In
addition, as already observed, it is not very dependent on prefetching degree to function
properly, even a low prefetching degree gives a significant performance increase.
90
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0.12
 0.13
 0.14
 0.15
 0.16
 0.17
 0.18
 0.19
 0.2
 0.21
 0.22
 1
 2
 3
 4
 5
 6
# Channels
 1
 2
 3
 4
 5
 6
Prefetching degree
 0.12
 0.13
 0.14
 0.15
 0.16
 0.17
 0.18
 0.19
 0.2
 0.21
 0.22
IPC
Figure 4.31: Plot of increasing prefetching degree versus available bandwidth for Ammp
with RPT prefetching.
 0.16
 0.17
 0.18
 0.19
 0.2
 0.21
 0.22
 0.23
 0.24
 0.25
 0.26
 1
 2
 3
 4
 5
 6
# Channels
 1
 2
 3
 4
 5
 6
Prefetching degree
 0.16
 0.17
 0.18
 0.19
 0.2
 0.21
 0.22
 0.23
 0.24
 0.25
 0.26
IPC
Figure 4.32: Plot of increasing prefetching degree versus available bandwidth for Art with
RPT prefetching.
91
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
 0.27
 0.28
 0.29
 0.3
 0.31
 0.32
 0.33
 0.34
 1
 2
 3
 4
 5
 6
# Channels
 1
 2
 3
 4
 5
 6
Prefetching degree
 0.27
 0.28
 0.29
 0.3
 0.31
 0.32
 0.33
 0.34
IPC
Figure 4.33: Plot of increasing prefetching degree versus available bandwidth for Mcf with
RPT prefetching.
 1.26
 1.28
 1.3
 1.32
 1.34
 1.36
 1.38
 1.4
 1.42
 1.44
 1.46
 1
 2
 3
 4
 5
 6
# Channels
 1
 2
 3
 4
 5
 6
Prefetching degree
 1.26
 1.28
 1.3
 1.32
 1.34
 1.36
 1.38
 1.4
 1.42
 1.44
 1.46
IPC
Figure 4.34: Plot of increasing prefetching degree versus available bandwidth for Mgrid
with RPT prefetching.
92
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
 0.85
 0.9
 1
 2
 3
 4
 5
 6
# Channels
 1
 2
 3
 4
 5
 6
Prefetching degree
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
 0.85
 0.9
IPC
Figure 4.35: Plot of increasing prefetching degree versus available bandwidth for Swim
with RPT prefetching.
93
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
4.3.3 Bandwidth-Aware Prefetching
The goal of bandwidth aware prefetching is to increase performance through a decrease
in memory bandwidth usage. However, because most SPEC2000 benchmarks are not
memory intensive, this section will both deal with performance and robustness (how it
handles varying loads).
All experiments in this section has been conducted with only one DRAM channel. This
was done to increase pressure on the memory subsystem. However, this is not a significant
limitation compared to contemporary systems.
In figure 4.36, I have compared the IPC of bandwidth-aware prefetching to the bandwidth-
oblivious versions of the same prefetching heuristics. In this experiment the cutoff value is
set to 40. This is a very restrictive value, that in practice drops prefetches if any memory
access is in progress. In most cases, we observe a performance decrease, as fewer prefetches
are issued. This is natural, considering that for most cases, bandwidth is not an issue in
a uniprocessor. It is also interesting to note that for sequential prefetching, significant
decreases can be observed when the original scheme has a very high accuracy (Mgrid,
Wupwise, Art). However, the difference is relatively small in most cases.
There are five noteworthy exceptions to this observation. These are the Ammp and Art
benchmarks for the RPT and Stream prefetchers. We see that performance decreases by
about 45% in these cases. This is natural, as these benchmarks are very latency-sensitive,
while the prefetchers are highly accurate (see table 4.2).
It is interesting to note a performance increase in Applu by using the Stream prefetcher.
Applu is one of the few cases where Stream prefetching has a low accuracy, and will
therefore get a higher performance if the correct prefetches are dropped.
Figure 4.37 shows the positive aspect of bandwidth-aware prefetching. In this figure
the memory bandwidth usage of bandwidth-aware prefetching is compared to the case of
no bandwidth-aware prefetching.
In general, sequential and DC prefetching gains the most from this technique, as these
heuristics are generally less accurate than the others. We observe that the greatest reduc-
tions in bandwidth usage when using sequential prefetching occurs on benchmarks where
it has low accuracy. This is also the case for RPT and Stream prefetching in cases where
they have a low accuracy (Galgel and Applu).
From these initial experiments, it is clear that bandwidth-aware prefetching works best
when accuracy is low. If accuracy is relatively low, then it can provide about the same
performance, while reducing the bandwidth requirements significantly.
In figures 4.3.3 and 4.39 I have conducted the exact same experiment, but with the
threshold increased to 400. In the first figure we observe that for most, benchmarks and
prefetchers, there is still performance reductions, however these are not as large as in
the previous case. In addition, we keep most of the reductions in bandwidth usage by
increasing the threshold.
The increased threshold also reduces the benefits by a small amount, but for Crafty
and Perlbmk, the reduction is not significant compared to the results previously obtained.
In the last experiment, the threshold has been increased to 800. The results can be
viewed in figures 4.40 and 4.41. Again, we see the effect of bandwidth-aware prefetching
diminishes as the threshold is increased.
94
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
-50
-40
-30
-20
-10  0
 10
art
facerec
twolf
wupwise
apsi
bzip2
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
IPC increase (%)
Benchm
ark
Sequential
DC
C/DC
R
PT
Stream
F
igure
4.36:
Speedup
using
bandw
idth-aw
are
prefetching.
C
utoff
value
is
set
at
40.
95
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
-50
-45
-40
-35
-30
-25
-20
-15
-10 -5  0  5
art
facerec
twolf
wupwise
apsi
bzip2
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
Bandwidth increase (%)
Benchm
ark
Sequential
DC
C/DC
R
PT
Stream
F
igure
4.37:
R
eductions
in
bandw
idth
usage
by
using
bandw
idth-aw
are
prefetching.
T
hreshold
value
is
set
at
40.
96
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
-50
-45
-40
-35
-30
-25
-20
-15
-10 -5  0  5
art
facerec
twolf
wupwise
apsi
bzip2
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
IPC Increase (%)
Benchm
ark
Sequential
DC
C/DC
R
PT
Stream
F
igure
4.38:
Speedup
using
bandw
idth-aw
are
prefetching.
T
hreshold
is
set
at
400.
97
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
-50
-45
-40
-35
-30
-25
-20
-15
-10 -5  0  5
art
facerec
twolf
wupwise
apsi
bzip2
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
Bandwidth increase (%)
Benchm
ark
Sequential
DC
C/DC
R
PT
Stream
F
igure
4.39:
R
eductions
in
bandw
idth
usage.
T
hreshold
is
set
at
400.
98
CHAPTER 4. RESULTS 4.3. UNIPROCESSOR
-50
-45
-40
-35
-30
-25
-20
-15
-10 -5  0  5
art
facerec
twolf
wupwise
apsi
bzip2
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
IPC increase (%)
Benchm
ark
Sequential
DC
C/DC
R
PT
Stream
F
igure
4.40:
Speedup
using
bandw
idth-aw
are
prefetching.
T
hreshold
is
set
at
800.
99
4.3. UNIPROCESSOR CHAPTER 4. RESULTS
-50
-45
-40
-35
-30
-25
-20
-15
-10 -5  0  5
art
facerec
twolf
wupwise
apsi
bzip2
gap
perlbmk
eon
parser
fma
lucas
galgel
mesa
ammp
vpr
applu
equake
mgrid
swim
mcf
crafty
gcc
gzip
Bandwidth increase (%)
Benchm
ark
Sequential
DC
C/DC
R
PT
Stream
F
igure
4.41:
R
eductions
in
bandw
idth
usage.
T
hreshold
is
set
at
800.
100
CHAPTER 4. RESULTS 4.4. CMP
4.4 CMP
4.4.1 Plan for the Experiments
In this section I will look at how prefetching work in CMPs. The following experiments
will be conducted on a dual core CMP system. By using a 2-way system instead of a
system with a larger amount of cores, it becomes easier to analyze the results.
Each core will be configured as in the uniprocessor experiments. However, the L2
cache and memory controller will be shared among the cores, but will retain the same
parameters as in the previous experiments, unless otherwise noted.
I will look at three distinct properties of this CMP:
1. How CMP affects performance for single applications.
2. How prefetching affects performance.
3. How Bandwidth-aware prefetching affects performance.
Lgred benchmarks can be very short (50 million instructions) or comparatively long
(several billion instructions). This poses a problem when running them simultaneously,
because running time would be limited to the shortest running program. If the running
time of the experiment was not limited to the running time of the shortest experiment,
then it would degenerate into a uniprocessor experiment1.
A solution to this problem is to use the full SPEC2000 benchmarks, but running them
to end would be prohibitive, therefore it is only possible to run a small interval of each.
To avoid benchmarking startup code, I will fast-forward (no pipeline or cache simulation)
through the first 500 million to 1500 million instructions (randomly selected). After that,
I will run the cycle-accurate simulation for 200 million clock cycles.
As mentioned in section 3.3.2, the CMP simulator is not as fast as the uniprocessor
simulator. This is both due to locking and due to the fact that 2 cores are being simulated
at the same time. This limits the number of benchmarks that can be run. By using the
full SPEC2000 benchmark suite there are 676 possible benchmark pairs (assuming the
cores are not symmetrical). Fortunately, most of the benchmarks in the SPEC2000 suite
are not interesting for our purposes. I have selected 10 benchmarks from two groups. In
the first group are the compute bound applications, they were selected because they have
relatively few L2 misses per instruction.
• Gzip
• Gcc
• Crafty
• Eon
• Twolf
The second group is the memory bound applications:
1It would be possible to restart benchmarks to compensate for this.
101
4.4. CMP CHAPTER 4. RESULTS
• Swim
• Mgrid
• Ammp
• Wupwise
• Art
By making such a distinction we can now look at how benchmarks from the two groups
interact.
4.4.2 CMP Performance
In this section, I will look at how benchmarks perform in a CMP with another program
present. First, I establish a baseline experiment, which I will use as a base metric, for the
rest of the experiments.
In table 4.4a I have run all the experiments combined with all the others and summa-
rized them. In this table the performance in terms of IPC of the benchmark in the first
column is shown. The benchmark is combined with the benchmark in the top row. By
dividing the benchmarks in such a manner, we see that there are four groups of interest,
these are:
1. Compute bound combined with compute bound.
2. Compute bound combined with memory bound.
3. Memory bound combined with compute bound.
4. Memory bound combined with memory bound.
Naturally, combining any application with a memory bound application leads to de-
creased performance. This is natural, as there are more evictions from the cache as well as
bandwidth contention, and is consistent with theory. In addition, compute bound applica-
tions shows less performance degradation when combined with memory bound applications
than memory bound applications.
In table 4.4b the same experiment was run, but with 2 DRAM channels. This ex-
periment was conducted to establish how much extra bandwidth increases performance
for each benchmark. We observe little change in group 1, however, the most bandwidth
intensive applications, such as Ammp show a significant improvement. This is natural,
and is compatible with the results obtained in the uniprocessor section. It is especially
worthwhile to note the changes that occur in group 4. Because, both benchmarks are
memory bound, each of them gains from the added bandwidth, but the measured bench-
mark only gains its fraction. By comparing this to group 3 we see a significant decrease in
performance. However, there is still a large performance increase compared to the system
with only 1 DRAM channel.
In table 4.5, I examine the effects of increasing the cache size. In subtable a I have
doubled the cache size to 2MB by doubling the number of sets. In effect, this will double
the area requirements of the cache. However, we do not gain significant speedup by doing
102
CHAPTER 4. RESULTS 4.4. CMP
(a
)
1
D
R
A
M
ch
a
n
n
el
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
h-m
ean
gzip
1.4692
1.4454
1.4276
1.5186
1.5163
1.2669
1.1621
1.2543
1.4620
1.0872
1.3438
gcc
0.6768
0.6672
0.6398
0.7023
0.7195
0.6181
0.5650
0.6343
0.6834
0.5136
0.6358
crafty
0.6905
0.6975
0.5953
0.7280
0.7384
0.6529
0.5988
0.6638
0.7232
0.5554
0.6587
eon
0.8914
0.8877
0.8845
0.8754
0.8949
0.8846
0.8494
0.8634
0.8916
0.8515
0.8771
tw
olf
0.9341
0.9291
0.9282
0.9390
0.9379
0.9169
0.8671
0.8937
0.9300
0.8511
0.9117
sw
im
1.3454
1.3025
1.5104
1.5230
1.4012
1.2492
1.5628
1.4623
1.2748
1.3371
1.3889
m
grid
0.7832
0.7884
0.7988
0.9050
0.9158
0.7443
0.6275
0.6125
0.8402
0.5800
0.7419
am
m
p
0.0636
0.0688
0.5926
0.0782
0.1040
0.0563
0.0495
0.7066
0.0720
0.0342
0.0731
w
upw
ise
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
art
0.0824
0.0926
0.0913
0.1053
0.0997
0.0693
0.0618
0.0613
0.0933
0.0542
0.0772
(b
)
2
D
R
A
M
ch
a
n
n
els
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
h-m
ean
gzip
1.4922
1.4752
1.4550
1.5265
1.5258
1.3712
1.2994
1.3078
1.4838
1.1943
1.4044
gcc
0.6983
0.6857
0.6638
0.7101
0.7255
0.6522
0.6267
0.6795
0.6996
0.5622
0.6670
crafty
0.7142
0.7184
0.6285
0.7369
0.7465
0.6850
0.6553
0.7001
0.7337
0.6095
0.6898
eon
0.8923
0.8893
0.8855
0.8758
0.8952
0.8857
0.8610
0.8694
0.8923
0.8555
0.8800
tw
olf
0.9364
0.9318
0.9302
0.9400
0.9392
0.9218
0.8918
0.9045
0.9313
0.8783
0.9200
sw
im
1.3748
1.3484
1.5549
1.5598
1.4363
1.2704
1.5882
1.4901
1.3159
1.3691
1.4229
m
grid
0.9879
0.9842
0.9870
1.0301
1.0335
0.9578
0.8320
0.8433
0.9978
0.7853
0.9357
am
m
p
0.0840
0.0873
0.7027
0.0913
0.1212
0.0802
0.0808
0.7192
0.0909
0.0608
0.1024
w
upw
ise
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
1.8846
art
0.1266
0.1375
0.1360
0.1494
0.1470
0.1140
0.1064
0.0933
0.1436
0.0883
0.1202
T
able
4.4:
IP
C
of
the
benchm
ark
in
the
left
colum
n
in
a
dual
core
C
M
P
com
bined
w
ith
the
benchm
arks
in
the
first
row
.
103
4.4. CMP CHAPTER 4. RESULTS
so. I have compared the performance with the baseline experiment with 1 DRAM channel.
The largest increase is for the Ammp benchmark combined with the Twolf benchmark with
a 79% increase in performance. The rest of the benchmarks show less than a 24% increase
in performance.
In subtable b I have experimented with a 4MB cache, by both doubling the number
of sets, and by doubling the associativity of the cache. In this experiment we see a
much more significant increase in performance. Again, the compute bound applications in
group 1 gain little from the larger cache. In group 2 however, there is a large increase in
performance when the benchmarks are combined with some of the more memory intensive
benchmarks. The largest increases in performance occurs for groups 3 and 4. Here we
observe increases up to 776% for Ammp combined with Gzip. This increase is especially
significant for Ammp and Art, which are the most bandwidth intensive benchmarks in the
entire SPEC2000 suite.
104
CHAPTER 4. RESULTS 4.4. CMP
(a
)
D
o
u
b
le
th
e
n
u
m
b
er
o
f
sets
(2
M
B
ca
ch
e)
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
+
4.4
%
+
3.6
%
+
4.0
%
+
0.8
%
+
0.7
%
+
7.8
%
+
12.2
%
+
4.6
%
+
1.0
%
+
25.9
%
gcc
+
5.2
%
+
5.9
%
+
6.8
%
+
3.5
%
+
2.6
%
+
10.5
%
+
10.7
%
+
3.4
%
+
4.4
%
+
11.8
%
crafty
+
5.7
%
+
5.5
%
+
16.3
%
+
4.6
%
+
3.1
%
+
4.5
%
+
7.5
%
+
3.8
%
+
2.7
%
+
10.3
%
eon
+
0.2
%
+
0.4
%
+
0.6
%
+
2.0
%
+
0.0
%
+
0.4
%
+
1.4
%
+
0.9
%
+
0.3
%
+
2.3
%
tw
olf
+
0.4
%
+
0.8
%
+
0.8
%
+
0.1
%
+
0.1
%
+
0.6
%
+
2.0
%
+
0.7
%
+
0.4
%
+
3.9
%
sw
im
+
0.4
%
+
0.1
%
+
0.1
%
+
0.0
%
+
0.1
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
m
grid
+
20.7
%
+
18.3
%
+
14.5
%
+
11.7
%
+
21.8
%
+
12.8
%
+
13.8
%
+
12.5
%
+
14.6
%
+
1.3
%
am
m
p
+
2.3
%
+
2.3
%
+
1.0
%
+
0.5
%
+
79.0
%
+
1.6
%
+
10.2
%
+
6.1
%
+
1.9
%
+
0.9
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
+
23.5
%
+
13.1
%
+
13.4
%
+
11.5
%
+
17.4
%
+
12.0
%
+
12.9
%
+
12.8
%
+
7.8
%
+
5.1
%
(b
)
D
o
u
b
le
th
e
n
u
m
b
er
o
f
sets
a
n
d
a
sso
cia
tiv
ity
(4
M
B
ca
ch
e)
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
+
5.1
%
+
4.4
%
+
4.7
%
+
0.9
%
+
0.9
%
+
9.7
%
+
14.7
%
+
11.5
%
+
1.6
%
+
40.1
%
gcc
+
6.6
%
+
8.9
%
+
9.3
%
+
4.3
%
+
3.3
%
+
13.6
%
+
15.9
%
+
6.5
%
+
6.9
%
+
24.6
%
crafty
+
8.2
%
+
8.2
%
+
21.6
%
+
5.9
%
+
4.7
%
+
7.3
%
+
11.3
%
+
6.3
%
+
4.4
%
+
18.3
%
eon
+
0.2
%
+
0.5
%
+
0.6
%
+
2.1
%
+
0.0
%
+
0.6
%
+
2.3
%
+
1.7
%
+
0.4
%
+
3.2
%
tw
olf
+
0.6
%
+
1.0
%
+
0.9
%
+
0.1
%
+
0.2
%
+
1.0
%
+
3.0
%
+
1.9
%
+
0.6
%
+
6.0
%
sw
im
+
0.4
%
+
0.2
%
+
0.2
%
+
0.1
%
+
0.1
%
+
0.0
%
+
0.1
%
+
0.0
%
+
0.0
%
+
0.0
%
m
grid
+
31.4
%
+
27.6
%
+
22.7
%
+
15.8
%
+
26.2
%
+
20.6
%
+
31.4
%
+
39.6
%
+
17.5
%
+
33.0
%
am
m
p
+
776.8
%
+
601.3
%
+
18.1
%
+
606.2
%
+
248.8
%
+
307.3
%
+
375.9
%
+
7.7
%
+
523.9
%
+
133.3
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
+
135.1
%
+
110.1
%
+
120.1
%
+
177.8
%
+
229.3
%
+
76.4
%
+
17.5
%
+
156.2
%
+
114.4
%
+
9.8
%
T
able
4.5:
Speedup
in
IP
C
com
pared
to
a
L
2
cache
of
1M
B
.
105
4.4. CMP CHAPTER 4. RESULTS
4.4.3 Prefetching in CMP
In this section I will look at how prefetching works in a CMP.
I will only look at sequential, C/DC and RPT prefetching. I have omitted DC prefetch-
ing because it is basically C/DC prefetching with very large CZones. Stream prefetching
was omitted because the underlying memory reference detection mechanism is the same as
RPT prefetching. AVD prefetching was omitted because it showed little effect on unipro-
cessor benchmarks.
In table 4.6, I have compared the performance of sequential and C/DC prefetching
to the baseline experiment. By using sequential prefetching we see a much more dif-
fuse picture, as some benchmarks experience a slowdown, while other combinations show
a speedup. This is most likely due to the inaccuracy of sequential prefetching causing
bandwidth contention. This is especially clear in group 2, where the memory intensive
applications prefetches too much data and degrades performance for the compute intensive
application. This occurs both because the compute bound application evicts useful data
from the cache, but also because it causes bandwidth contention.
In subtable b, the performance of C/DC prefetching in a CMP is documented. Here
we observe the same performance degradation in group 2, but it is dampened compared
to sequential prefetching. In addition, in groups 3 and 4 we no longer observe significant
performance decreases. This is again due to C/DCs increased accuracy.
In table 4.7, the performance of the RPT prefetching is shown. RPT prefetching has a
very high accuracy, and therefore it is strange to note that performance is not increased as
much as one would think. For the Art benchmark, performance is actually decreased by a
significant amount in most cases. This might be caused by the relatively short simulation
time not being sufficient to properly initialize the reference prediction tables.
106
CHAPTER 4. RESULTS 4.4. CMP
(a
)
S
eq
u
en
tia
l
p
refetch
in
g
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
+
1.5
%
-0.5
%
-1.4
%
+
1.0
%
+
0.8
%
-6.4
%
-8.4
%
-7.2
%
-1.3
%
+
7.6
%
gcc
-2.3
%
-1.6
%
-3.1
%
-0.3
%
-0.2
%
-3.6
%
-6.0
%
-7.9
%
-0.3
%
+
1.0
%
crafty
-4.3
%
-3.7
%
-7.9
%
-2.1
%
-1.7
%
-7.7
%
-9.6
%
-9.4
%
-1.9
%
-6.6
%
eon
-0.1
%
-0.1
%
-0.4
%
-0.1
%
-0.0
%
-0.6
%
-1.8
%
-1.9
%
-0.1
%
-0.2
%
tw
olf
+
0.0
%
-0.1
%
-0.3
%
+
0.0
%
+
0.0
%
-0.7
%
-2.8
%
-2.9
%
-0.0
%
-0.7
%
sw
im
+
0.1
%
+
0.5
%
+
0.4
%
+
1.1
%
+
0.7
%
+
0.4
%
-0.1
%
+
0.5
%
-0.2
%
+
1.6
%
m
grid
+
15.1
%
+
18.9
%
+
15.0
%
+
20.3
%
+
26.4
%
+
16.6
%
+
15.5
%
+
3.9
%
+
22.8
%
+
45.4
%
am
m
p
-23.5
%
-20.6
%
-15.7
%
-14.5
%
-14.3
%
-23.9
%
-32.1
%
-1.8
%
-19.4
%
-21.4
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
-1.1
%
+
0.4
%
-0.9
%
+
4.2
%
+
5.5
%
-1.3
%
-2.7
%
-11.0
%
+
4.8
%
+
6.3
%
(b
)
C
/
D
C
p
refetch
in
g
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
+
1.1
%
+
0.9
%
+
1.2
%
+
1.0
%
+
1.0
%
-4.2
%
-6.2
%
-0.1
%
-0.1
%
-1.0
%
gcc
-0.1
%
-0.1
%
+
0.3
%
+
0.0
%
+
0.0
%
-3.5
%
-6.0
%
-0.4
%
+
0.0
%
-7.0
%
crafty
-0.2
%
-0.3
%
-0.3
%
-0.1
%
-0.0
%
-4.0
%
-5.5
%
-0.9
%
-0.4
%
-5.9
%
eon
-0.0
%
-0.0
%
-0.0
%
-0.3
%
+
0.0
%
-0.4
%
-1.3
%
-0.5
%
-0.0
%
-1.0
%
tw
olf
+
0.0
%
+
0.0
%
+
0.1
%
+
0.0
%
+
0.0
%
-0.7
%
-1.8
%
-0.4
%
+
0.0
%
-1.2
%
sw
im
+
0.8
%
+
1.4
%
+
1.0
%
+
1.2
%
+
1.1
%
+
0.6
%
+
0.2
%
+
0.6
%
+
0.9
%
+
1.4
%
m
grid
+
19.1
%
+
21.4
%
+
20.0
%
+
17.2
%
+
23.3
%
+
17.9
%
+
16.1
%
+
8.1
%
+
18.8
%
+
26.1
%
am
m
p
+
9.7
%
+
10.7
%
+
3.1
%
+
11.3
%
+
10.5
%
+
6.5
%
+
0.5
%
+
0.1
%
+
9.4
%
+
7.1
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
+
0.8
%
+
0.4
%
+
0.8
%
+
0.6
%
+
0.0
%
-1.4
%
-0.8
%
-1.2
%
+
2.0
%
+
4.7
%
T
able
4.6:
P
refetching
in
C
M
P,
speedup
com
pared
to
a
C
M
P
w
ith
no
prefetching.
107
4.4. CMP CHAPTER 4. RESULTS
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
+
0.4
%
+
0.2
%
+
0.3
%
+
0.5
%
+
0.5
%
-0.7
%
-0.6
%
-1.7
%
+
0.1
%
-3.1
%
gcc
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-1.4
%
-0.9
%
-0.8
%
+
0.0
%
-6.0
%
crafty
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
+
0.0
%
-0.5
%
-1.1
%
-1.2
%
-0.0
%
-5.0
%
eon
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
-0.0
%
-0.2
%
-0.9
%
+
0.0
%
+
4.9
%
tw
olf
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.1
%
-0.4
%
-0.8
%
+
0.0
%
-1.7
%
sw
im
+
0.3
%
+
0.3
%
+
0.2
%
+
0.3
%
+
0.2
%
+
0.2
%
+
0.2
%
-0.2
%
+
0.4
%
-0.1
%
m
grid
+
6.7
%
+
8.4
%
+
5.8
%
+
7.5
%
+
8.2
%
+
6.7
%
+
11.1
%
-5.2
%
+
5.4
%
+
2.2
%
am
m
p
+
12.8
%
+
14.1
%
+
3.6
%
+
15.4
%
+
13.3
%
+
13.9
%
+
12.5
%
-0.2
%
+
13.4
%
+
10.1
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
-3.5
%
-9.4
%
-9.9
%
-14.5
%
-11.0
%
-2.8
%
-0.6
%
-4.6
%
-8.4
%
-7.3
%
T
able
4.7:
P
erform
ance
of
R
P
T
benchm
arking
in
a
C
M
P
com
pared
to
the
case
w
here
no
prefetching
is
perform
ed.
108
CHAPTER 4. RESULTS 4.4. CMP
4.4.4 Bandwidth-Aware Prefetching in CMP
In this section, I will look at how bandwidth-aware prefetching works in a CMP. Bandwidth-
aware prefetching was designed with CMP in mind.
In table 4.8, I have compared the performance of bandwidth aware sequential prefetch-
ing with the bandwidth-oblivious version. In this experiment the threshold was set to 40,
which again is a very aggressive threshold, but is used to better highlight the properties of
bandwidth-aware prefetching. For the Crafty benchmark, we see some significant perfor-
mance gains, especially when combining two Crafty benchmarks (+ 4%). For Mgrid, there
is a significant performance decrease when combining it whit compute-bound applications
(up to a 5.3% decrease). For most applications, there is a small speed increase.
However, there are some very large decreases in bandwidth usage, especially for the
compute-bound applications, in addition to Swim, Mgrid andWupwise. The most bandwidth-
intensive benchmarks, Art and Ammp, did not gain such large improvements, mostly due
to the relatively high accuracy of sequential prefetching on these benchmarks. The largest
savings in bandwidth usage was 47.8% which occurred when combining Crafty with Eon.
The next experiment (table 4.9) were conducted with C/DC prefetching. C/DC
prefetching is much more accurate than sequential prefetching in general. As accuracy
increases, the performance benefits of bandwidth-aware prefetching decreases. We observe
the same pattern with C/DC prefetching as we did with sequential prefetching, however,
Crafty does not experience a speedup by using bandwidth-aware prefetching. Mgrid does,
on the other hand, experience a larger performance decrease than in the sequential case
(up to 6.8%).
However, we are still able to significantly reduce the bandwidth required. Especially
when combining compute intensive applications together. It is also worth noting that the
largest increase in bandwidth usage is a mere 0.1%, while the largest decrease is 26.8%.
RPT prefetching is the most accurate prefetching heuristic studied in this thesis. In
table 4.10, I have compared its performance with the bandwidth-aware version. Because
RPT prefetching is very accurate it produces few unnecessary prefetches, thus the gains by
using bandwidth-aware prefetching is minor. The largest decrease in performance occurs
for Eon in combination with Art (5.4%). This is also the only difference larger than 2%
observed. However, there is only small reductions in bandwidth, which again is due to the
high accuracy of the RPT heuristic.
By increasing the threshold value, one also decreases the effects of bandwidth-aware
prefetching. The next experiments are conducted by setting the threshold value to 400.
In other words, we allow prefetching to continue, even if there are two or three memory
requests on average in the queue.
In table 4.11, I have compared bandwidth-ware sequential prefetching to the bandwidth-
oblivious implementation. When using the increased threshold, we observe no significant
performance degradation (the largest occurs for Swim combined with Gzip for a 0.3%
decrease). On the other hand, we observe performance increases up to 1.7% when combin-
ing Crafty with Gzip. There are significant bandwidth reductions as well (up to 38.5%),
especially in group 1.
In table 4.12, I have conducted the same experiment, but with C/DC prefetching.
Again, we observe little impact on performance and a small, but significant impact on
bandwidth usage. The largest decrease in bandwidth usage occurs in group 1. This is
109
4.4. CMP CHAPTER 4. RESULTS
(a
)
S
p
eed
u
p
in
IP
C
.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-0.5
%
+
0.0
%
+
1.1
%
-0.4
%
-0.3
%
-0.5
%
+
0.9
%
-0.1
%
-0.1
%
-0.0
%
gcc
+
0.9
%
+
0.6
%
+
1.8
%
+
0.1
%
+
0.1
%
+
0.4
%
+
0.2
%
+
0.1
%
+
0.6
%
+
0.0
%
crafty
+
3.2
%
+
2.5
%
+
4.0
%
+
2.0
%
+
1.6
%
+
2.5
%
+
0.6
%
+
0.5
%
+
1.6
%
+
0.0
%
eon
+
0.0
%
+
0.0
%
+
0.4
%
+
0.1
%
+
0.0
%
+
0.2
%
+
0.2
%
-0.0
%
+
0.1
%
+
0.0
%
tw
olf
-0.1
%
-0.0
%
+
0.2
%
-0.0
%
-0.0
%
+
0.0
%
+
0.1
%
+
0.0
%
+
0.1
%
+
0.0
%
sw
im
-0.3
%
+
0.0
%
-0.0
%
+
0.1
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
-0.5
%
+
0.0
%
m
grid
-2.5
%
-2.9
%
-2.4
%
-5.0
%
-5.3
%
-1.1
%
+
0.1
%
-0.2
%
-1.0
%
+
0.0
%
am
m
p
+
0.0
%
-0.1
%
-0.1
%
-0.1
%
-0.1
%
-0.4
%
+
0.2
%
+
0.0
%
+
1.7
%
+
0.0
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
+
0.0
%
+
0.0
%
+
0.0
%
+
0.2
%
+
0.1
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.3
%
+
0.0
%
(b
)
R
ed
u
ctio
n
s
in
b
a
n
d
w
id
th
u
sa
g
e.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-6.7
%
-12.5
%
-21.5
%
-3.5
%
-4.0
%
-10.6
%
-2.2
%
-0.5
%
-10.1
%
+
0.1
%
gcc
-17.7
%
-16.3
%
-21.6
%
-19.4
%
-18.7
%
-12.7
%
-1.4
%
-3.9
%
-15.0
%
-0.0
%
crafty
-40.9
%
-37.2
%
-23.3
%
-47.8
%
-46.8
%
-26.5
%
-3.2
%
-12.8
%
-37.7
%
-0.1
%
eon
-26.8
%
-26.6
%
-46.5
%
-34.4
%
-14.4
%
-24.7
%
-6.2
%
-0.4
%
-24.8
%
-0.4
%
tw
olf
-14.2
%
-19.0
%
-33.7
%
-7.6
%
-12.0
%
-13.2
%
-4.4
%
-1.0
%
-24.1
%
-0.2
%
sw
im
-9.2
%
-9.0
%
-16.2
%
-13.7
%
-10.5
%
-17.1
%
-1.8
%
-0.3
%
-14.9
%
+
0.0
%
m
grid
-2.4
%
-2.5
%
-2.2
%
-4.3
%
-4.8
%
-0.9
%
+
0.1
%
-0.1
%
-1.2
%
+
0.0
%
am
m
p
-0.1
%
-0.2
%
-1.1
%
-0.2
%
-0.4
%
-0.5
%
+
0.2
%
-0.2
%
+
1.7
%
-0.0
%
w
upw
ise
-11.3
%
-17.8
%
-14.2
%
-8.5
%
-17.4
%
-14.5
%
-2.1
%
-4.2
%
+
0.0
%
+
0.5
%
art
-0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.2
%
+
0.0
%
T
able
4.8:
B
andw
idth-aw
are
prefetching
using
sequential
prefetching
on
a
C
M
P.
T
hreshold
is
set
to
40.
110
CHAPTER 4. RESULTS 4.4. CMP
(a
)
S
p
eed
u
p
in
IP
C
.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-0.4
%
-0.3
%
-0.2
%
-0.4
%
-0.5
%
+
0.2
%
+
0.9
%
-0.1
%
-0.2
%
-0.1
%
gcc
-0.1
%
-0.0
%
-0.2
%
-0.1
%
-0.1
%
+
0.7
%
+
1.0
%
-0.1
%
+
0.0
%
+
0.4
%
crafty
+
0.0
%
+
0.1
%
+
0.2
%
+
0.1
%
+
0.0
%
+
0.6
%
+
0.6
%
+
0.0
%
+
0.3
%
+
0.0
%
eon
+
0.0
%
-0.0
%
+
0.0
%
+
0.3
%
+
0.0
%
+
0.1
%
+
0.2
%
-0.0
%
+
0.0
%
+
0.0
%
tw
olf
-0.1
%
-0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.1
%
+
0.5
%
-0.0
%
+
0.0
%
+
0.1
%
sw
im
-0.0
%
-0.0
%
+
0.0
%
+
0.0
%
-0.0
%
-0.5
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
m
grid
-3.7
%
-4.7
%
-4.7
%
-6.0
%
-6.8
%
-1.9
%
+
0.0
%
-0.4
%
-3.2
%
-0.1
%
am
m
p
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.3
%
+
0.0
%
+
0.2
%
+
0.0
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
-0.6
%
-0.2
%
-0.2
%
-0.4
%
-0.4
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
(b
)
R
ed
u
ctio
n
s
in
b
a
n
d
w
id
th
u
sa
g
e.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-5.3
%
-2.2
%
-1.4
%
-2.1
%
-1.9
%
-2.4
%
-4.1
%
+
0.0
%
-3.0
%
-0.2
%
gcc
-3.4
%
-3.1
%
-3.3
%
-4.1
%
-3.6
%
-6.2
%
-1.8
%
-1.1
%
-3.9
%
-0.1
%
crafty
-1.0
%
-1.1
%
-2.8
%
-1.8
%
-0.7
%
-3.4
%
-1.3
%
-0.3
%
-1.4
%
+
0.1
%
eon
-13.1
%
-6.2
%
-3.6
%
-26.8
%
-3.7
%
-11.2
%
-4.1
%
-0.4
%
-11.0
%
-0.4
%
tw
olf
-5.3
%
-4.3
%
-1.9
%
-1.9
%
-3.6
%
-5.3
%
-7.3
%
-0.1
%
-6.8
%
-0.7
%
sw
im
-6.1
%
-4.2
%
-6.9
%
-7.5
%
-5.6
%
-6.6
%
-0.8
%
-0.4
%
-7.1
%
-0.0
%
m
grid
-3.6
%
-4.0
%
-4.0
%
-8.3
%
-6.1
%
-1.6
%
-0.0
%
-0.4
%
-2.6
%
-0.0
%
am
m
p
+
0.0
%
-0.0
%
-0.1
%
-0.0
%
-0.0
%
-0.0
%
+
0.1
%
+
0.0
%
+
0.2
%
+
0.0
%
w
upw
ise
-5.5
%
-3.8
%
-3.4
%
-7.2
%
-6.0
%
+
0.1
%
+
0.2
%
-0.9
%
-5.0
%
-0.2
%
art
-0.3
%
-0.4
%
-0.5
%
-0.7
%
-1.0
%
-0.1
%
-0.0
%
-0.1
%
-0.1
%
+
0.0
%
T
able
4.9:
B
andw
idth-aw
are
prefetching
using
C
/D
C
prefetching
on
a
C
M
P.
T
hreshold
is
set
to
40.
111
4.4. CMP CHAPTER 4. RESULTS
(a
)
S
p
eed
u
p
in
IP
C
.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-0.2
%
-0.1
%
-0.1
%
-0.4
%
-0.4
%
+
0.0
%
+
0.2
%
-0.1
%
-0.1
%
+
0.2
%
gcc
+
0.0
%
+
0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.1
%
+
0.0
%
-0.0
%
-0.0
%
+
0.2
%
crafty
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
-0.1
%
eon
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
-5.4
%
tw
olf
-0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
-0.0
%
+
0.0
%
-0.0
%
sw
im
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
m
grid
-0.8
%
-0.8
%
-1.0
%
-0.9
%
-1.2
%
-0.1
%
-0.0
%
-0.2
%
-0.1
%
+
0.0
%
am
m
p
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
+
0.0
%
+
0.0
%
+
0.0
%
-0.2
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.1
%
+
0.0
%
(b
)
R
ed
u
ctio
n
s
in
b
a
n
d
w
id
th
u
sa
g
e.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-0.0
%
-0.0
%
-0.4
%
-0.5
%
-0.3
%
-0.2
%
-1.8
%
-0.1
%
+
0.9
%
-0.2
%
gcc
-0.4
%
-0.2
%
-0.3
%
-0.1
%
-0.1
%
-0.5
%
+
0.1
%
+
0.0
%
-0.0
%
+
0.3
%
crafty
-0.0
%
-0.0
%
-0.2
%
+
0.0
%
-0.1
%
-0.1
%
-0.1
%
-0.0
%
+
0.0
%
+
0.1
%
eon
-2.9
%
-0.3
%
-0.4
%
-0.1
%
-0.9
%
-0.2
%
-0.9
%
+
0.1
%
-0.0
%
+
1.7
%
tw
olf
+
0.2
%
-0.3
%
-0.3
%
-0.0
%
-0.1
%
-0.0
%
-1.1
%
+
0.0
%
+
0.0
%
-0.3
%
sw
im
-0.3
%
-0.2
%
-0.1
%
-0.0
%
-0.1
%
-0.0
%
-0.8
%
-0.0
%
-0.0
%
+
0.3
%
m
grid
-0.7
%
-0.6
%
-0.9
%
-1.4
%
-1.1
%
-0.1
%
+
0.0
%
-0.1
%
-0.1
%
+
0.0
%
am
m
p
-0.0
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
-0.0
%
+
0.0
%
-0.1
%
-0.0
%
+
0.3
%
w
upw
ise
-0.0
%
-0.0
%
+
0.0
%
+
0.0
%
-0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
-0.0
%
art
-0.1
%
-0.1
%
-0.1
%
-0.3
%
-0.1
%
-0.0
%
+
0.0
%
-0.0
%
-0.0
%
+
0.0
%
T
able
4.10:
B
andw
idth-aw
are
prefetching
using
R
P
T
prefetching
on
a
C
M
P.
T
hreshold
is
set
to
40.
112
CHAPTER 4. RESULTS 4.4. CMP
(a
)
S
p
eed
u
p
in
IP
C
.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-0.3
%
-0.2
%
+
0.4
%
-0.2
%
-0.1
%
-0.6
%
-0.0
%
-0.1
%
+
0.3
%
+
0.0
%
gcc
+
0.4
%
+
0.1
%
+
1.0
%
+
0.1
%
+
0.1
%
+
0.4
%
+
0.0
%
+
0.1
%
+
0.6
%
-0.0
%
crafty
+
1.7
%
+
1.1
%
+
0.6
%
+
1.5
%
+
1.2
%
+
0.9
%
+
0.0
%
+
0.4
%
+
0.9
%
+
0.0
%
eon
+
0.0
%
+
0.0
%
+
0.3
%
+
0.1
%
+
0.0
%
+
0.2
%
-0.0
%
+
0.0
%
+
0.1
%
+
0.0
%
tw
olf
-0.1
%
-0.1
%
+
0.2
%
-0.0
%
-0.0
%
+
0.3
%
+
0.0
%
+
0.0
%
+
0.1
%
+
0.0
%
sw
im
-0.3
%
+
0.0
%
+
0.0
%
+
0.1
%
+
0.0
%
+
0.4
%
+
0.0
%
-0.1
%
+
0.0
%
+
0.0
%
m
grid
+
0.0
%
-0.0
%
+
0.0
%
-0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
am
m
p
+
0.0
%
-0.1
%
-0.0
%
-0.1
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.8
%
+
0.0
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
1.3
%
+
0.0
%
(b
)
R
ed
u
ctio
n
s
in
b
a
n
d
w
id
th
u
sa
g
e.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-3.5
%
-6.4
%
-11.2
%
-2.7
%
-2.9
%
-6.7
%
+
0.2
%
-0.2
%
-6.2
%
+
0.0
%
gcc
-9.1
%
-7.1
%
-10.4
%
-12.4
%
-12.4
%
-4.0
%
-0.0
%
-2.3
%
-5.7
%
+
0.0
%
crafty
-23.3
%
-19.8
%
-3.6
%
-38.4
%
-39.1
%
-7.8
%
-0.0
%
-9.6
%
-13.3
%
+
0.0
%
eon
-21.6
%
-19.4
%
-38.5
%
-33.5
%
-11.1
%
-20.4
%
+
0.2
%
-0.4
%
-24.4
%
-0.2
%
tw
olf
-10.2
%
-11.9
%
-28.0
%
-6.5
%
-10.6
%
-12.3
%
-0.0
%
-0.7
%
-21.7
%
-0.2
%
sw
im
-5.9
%
-5.4
%
-7.0
%
-12.4
%
-9.4
%
+
0.4
%
-0.0
%
-0.2
%
+
0.0
%
-0.0
%
m
grid
-0.0
%
-0.0
%
+
0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
-0.0
%
-0.1
%
+
0.0
%
am
m
p
-0.0
%
-0.1
%
-0.8
%
-0.1
%
-0.2
%
-0.2
%
+
0.0
%
-0.2
%
+
0.8
%
+
0.0
%
w
upw
ise
-12.7
%
-12.9
%
-9.5
%
-8.4
%
-17.3
%
-1.7
%
+
0.1
%
-4.1
%
-0.0
%
-1.0
%
art
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.3
%
+
0.0
%
T
able
4.11:
B
andw
idth-aw
are
prefetching
using
sequential
prefetching
on
a
C
M
P.
T
hreshold
is
set
to
400.
113
4.4. CMP CHAPTER 4. RESULTS
especially true when combining Eon with Eon (26.4% decrease).
In the previous experiment with RPT we observed little impact from using bandwidth-
aware prefetching. By increasing the threshold to 400 (table 4.13), there are even less
impact. However, there is a significant decrease when combining Gcc with Eon (29%).
Again, the minor effect of bandwidth-aware prefetching is due to the high accuracy of the
RPT prefetching heuristic.
114
CHAPTER 4. RESULTS 4.4. CMP
(a
)
S
p
eed
u
p
in
IP
C
.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-0.1
%
-0.1
%
-0.1
%
-0.1
%
-0.1
%
+
0.1
%
-0.0
%
-0.0
%
-0.1
%
+
0.0
%
gcc
-0.1
%
-0.0
%
-0.1
%
-0.0
%
-0.0
%
+
0.4
%
-0.0
%
-0.0
%
+
0.1
%
+
0.0
%
crafty
+
0.0
%
+
0.0
%
+
0.1
%
+
0.1
%
+
0.0
%
+
0.2
%
-0.0
%
+
0.0
%
+
0.1
%
-0.0
%
eon
+
0.0
%
+
0.0
%
+
0.0
%
+
0.3
%
+
0.0
%
+
0.1
%
-0.0
%
-0.0
%
+
0.0
%
+
0.0
%
tw
olf
-0.1
%
-0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.1
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
sw
im
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
-0.5
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
m
grid
+
0.0
%
-0.0
%
+
0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
am
m
p
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.2
%
+
0.0
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
(b
)
R
ed
u
ctio
n
s
in
b
a
n
d
w
id
th
u
sa
g
e.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-2.1
%
-1.0
%
-0.8
%
-1.1
%
-1.1
%
-1.2
%
+
0.1
%
+
0.0
%
-0.8
%
+
0.0
%
gcc
-1.4
%
-1.1
%
-1.2
%
-2.0
%
-1.9
%
-2.6
%
-0.1
%
-0.5
%
-1.4
%
+
0.0
%
crafty
-0.3
%
-0.3
%
-0.5
%
-1.4
%
-0.4
%
-1.3
%
+
0.1
%
-0.2
%
-0.4
%
-0.0
%
eon
-8.2
%
-4.0
%
-3.7
%
-26.4
%
-2.8
%
-7.0
%
+
0.1
%
-0.2
%
-9.6
%
-0.0
%
tw
olf
-2.9
%
-2.3
%
-1.5
%
-1.5
%
-3.1
%
-4.0
%
-0.1
%
-0.1
%
-6.0
%
-0.1
%
sw
im
-3.0
%
-1.9
%
-2.6
%
-5.3
%
-3.9
%
-1.0
%
+
0.0
%
-0.1
%
-0.2
%
-0.0
%
m
grid
+
0.0
%
-0.0
%
+
0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
am
m
p
-0.0
%
-0.0
%
-0.1
%
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
+
0.1
%
+
0.2
%
+
0.0
%
w
upw
ise
-4.6
%
-2.4
%
-1.2
%
-7.0
%
-5.8
%
+
0.1
%
+
0.1
%
-0.7
%
+
7.2
%
-0.0
%
art
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
T
able
4.12:
B
andw
idth-aw
are
prefetching
using
C
/D
C
prefetching
on
a
C
M
P.
T
hreshold
is
set
to
400.
115
4.4. CMP CHAPTER 4. RESULTS
(a
)
S
p
eed
u
p
in
IP
C
.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-0.1
%
-0.0
%
-0.0
%
-0.1
%
-0.1
%
-0.0
%
+
0.0
%
-0.1
%
+
0.0
%
+
0.0
%
gcc
-0.0
%
-0.0
%
+
0.0
%
+
3.1
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
crafty
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
eon
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
tw
olf
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
sw
im
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
m
grid
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
am
m
p
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.1
%
+
0.0
%
+
0.0
%
+
0.0
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
art
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
(b
)
R
ed
u
ctio
n
s
in
b
a
n
d
w
id
th
u
sa
g
e.
gzip
gcc
crafty
eon
tw
olf
sw
im
m
grid
am
m
p
w
upw
ise
art
gzip
-0.4
%
+
0.0
%
-0.1
%
-0.1
%
-0.1
%
+
0.1
%
-0.0
%
-0.1
%
-0.1
%
+
0.0
%
gcc
+
0.2
%
-0.1
%
-0.1
%
-29.0
%
-0.1
%
-0.2
%
+
0.0
%
-0.0
%
-0.0
%
+
0.0
%
crafty
+
0.0
%
-0.0
%
-0.1
%
+
0.0
%
-0.1
%
-0.1
%
-0.0
%
-0.0
%
+
0.0
%
+
0.0
%
eon
-2.1
%
-0.4
%
-0.1
%
-0.1
%
-1.0
%
-0.4
%
-0.2
%
-0.0
%
-0.0
%
+
0.0
%
tw
olf
+
0.2
%
-0.1
%
-0.2
%
-0.0
%
-0.1
%
-0.0
%
+
0.0
%
-0.0
%
+
0.0
%
+
0.0
%
sw
im
-0.4
%
-0.2
%
-0.1
%
-0.0
%
-0.1
%
+
0.0
%
-0.0
%
-0.0
%
-0.0
%
-0.0
%
m
grid
-0.0
%
+
0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
+
0.0
%
-0.0
%
-0.0
%
+
0.0
%
am
m
p
+
0.0
%
+
0.0
%
-0.0
%
-0.0
%
-0.0
%
-0.0
%
+
0.0
%
-0.1
%
-0.0
%
+
0.0
%
w
upw
ise
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
-0.0
%
+
0.0
%
-0.0
%
-0.0
%
art
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
+
0.0
%
T
able
4.13:
B
andw
idth-aw
are
prefetching
using
R
P
T
prefetching
on
a
C
M
P.
T
hreshold
is
set
to
400.
116
Chapter 5
Discussion
5.1 The Simulator
Much of the work done in this thesis was implementing the simulator. In the next sections
I will look at some of the choices I have made.
5.1.1 Prefetching
I have built upon the framework for prefetching simulation that I developed in the fifth
year project. Although the framework can handle a large array of prefetching heuristics, I
have only been able to implement 8 reference heuristics (including the perfect L2), due to
time constraints. I selected sequential prefetching because it is simple and effective, and
has been thoroughly studied in the literature. C/DC- and its predecessor DC-prefetching
are relatively new heuristics that have been proposed. RPT prefetching is an old heuristic,
but has a very high accuracy (as demonstrated in the previous chapter). AVD was included
because it is one of the few heuristics designed to handle pointer-chasing problems.
This selection ranges from the very simple, inaccurate heuristics to the complex and
accurate. In addition, the area requirements vary considerably across implementations, in
addition to power requirements. For instance, a full RPT-prefetcher would not be the ideal
choice for a small embedded processor, as opposed to the much simpler sequential heuristic.
In addition, RPT prefetching requires information from inside the core, which might delay
the critical path, and reduce system performance. Thus, all prefetching heuristics have
their place in modern processor design.
The framework was extended in many ways. It now checks if the prefetched address
is within the programs address space, so that the simulator does not crash in the cache-
handling code. This was a problem in the fifth year project, especially with DC prefetching
and high prefetching degrees. In addition, the framework has been optimized for perfor-
mance by moving much of the logic so that computation is not performed unnecessary.
5.1.2 DRAM Model
Early in this work it became obvious that a good DRAM model was key to understanding
the performance of the various prefetching heuristics. In my previous work I had used
117
5.2. METHODOLOGY CHAPTER 5. DISCUSSION
a much simpler model that only addressed contention. This was insufficient, especially
considering the complexity of modern DRAM system.
An important factor for moving to another model was open pages. Open pages have a
significant impact on prefetching. Consider for instance sequential prefetching, because it
always fetches the next cacheline, it is likely that it will hit on an open page, and will thus
experience a reduced latency. In effect, all prefetching heuristics that work by detecting
memory access patterns will benefit from open pages. Thus, this facet of the DRAM model
contributes significantly to the performance of prefetching.
In addition, by allowing pipelining of requests, it becomes more favorable to prefetch,
simply because the impact of adding another DRAM request is reduced under high band-
width utilization.
However, because of the increased complexity of the model, it becomes more difficult
to analyze the results. In effect, the latency of a DRAM operation is no longer fixed.
To compensate for this problem a number of performance counters have been added to
highlight the effects of the new DRAM model. The number of accesses that hit open pages
have been especially useful for analyzing the results.
5.1.3 CMP Extension
The CMP extension is a result from cooperation with Haakon Dybdahl. Although it works
very well, it has it’s limitations [11].
CMPs are targeted at parallel applications. It is therefore a serious shortcoming that
the simulator is unable to simulate parallel benchmarks. As mentioned in section 3.3.3 this
shortcoming is inherit from SimpleScalar. It is possible to execute parallel code, and it has
been done in other projects, but it requires a rewrite of the program to use special system
calls. Otherwise, this issue would require an almost complete rewrite of the SimpleScalar
code, which would be unfeasible.
The other issue with the simulator is its inability to execute operating system code.
It has been previously shown that OS code can significantly impact the performance of a
system. However, this effect can be compensated, as this factor is equal for all experiments.
The problem lies in the inability of the simulator to support dynamic libraries, network
code, native compilation and so on.
The simulator uses a very inefficient way to simulate CMP. Because the two cores are
simulated in separate processes, there is considerable overhead in context switching. In
addition, the two cores are synchronized on every clock cycle. As already mentioned, this
is suboptimal, as it is only necessary to serialize accesses to the L2 cache, which does not
occur as frequent.
For these three reasons the NCAR group has been looking at other possible simula-
tors. Arnt Jørgen Lande’s diploma work analyzes different simulators and will make a
recommendation for a new simulator. I will comment more on this topic in the future
work section (6.3).
5.2 Methodology
Simulations of CMPs offers a vast experimental space. The uniprocessor cores have 82
parameters each. In addition, there is a large number of possible benchmarks to run. Much
118
CHAPTER 5. DISCUSSION 5.3. RESULTS
of the work in this thesis has been to simply reduce the experimental degrees of freedom,
in such a way that prefetching becomes the dominant factor in realistic conditions.
5.2.1 Benchmark Selection
The number of benchmarks that are possible to run are limited due to the constrains
imposed by the simulator. Because it is not possible to run true parallel benchmarks it
was a satisfactory choice to use the SPEC2000 benchmark suite. I have used this suite
previously, and it is very common in the literature. In addition, it is necessary to compare
uniprocessor results with CMP results, if both use the same benchmark.
I have used two separate datasets for SPEC2000. The Lgred datasets are reduced
datasets which is said to behave in much the same manner as the reference sets. However,
in practice there is a significant difference as the reduced sets have a much smaller memory
footprint, and thus fits easier into the cache.
5.2.2 Simulation Environment
Most of the simulations in this thesis were conducted on the Clustis2 cluster located at IDI.
Clustis2 is composed of 22 nodes, each with a Pentium 4 processor. As my experiments
are trivially parallelizable (each experiment is independent from the others), using this
computer provides an approximate 22 times speedup.
Python was used extensively, both to construct and submit jobs to the Clustis queue
system (OpenPBS). As each experiment produces a lot of data (approx 200 MB), Python
was also extensively used to analyze the data. Some of the scripts used can be found in
the appendix.
In addition, Musculus (musculus.hpc.ntnu.no) was also used for this project. Musculus
is an Cray XD-1. It has 12 64-bit Opteron CPU with 24 GB’s of RAM. However, it could
only be used for uniprocessor experiments, as two and two CPU’s share memory (and would
thus conflict when running shared-memory based experiments). A possibility would be to
simply randomize or probe shared memory usage. This was not done, as the computing
power of Clustis2 was sufficient for this project.
5.3 Results
5.3.1 Uniprocessor Results
By comparing the performance of a system with no prefetching with the perfect L2 case it
is clear that for some of the benchmarks there is a very large opportunity for prefetching.
As en example, the performance of Art increases by 529 % with a perfect L2. Other
benchmarks gain very little from a perfect L2 (such as VPR).
Some of the prefetching methods come very close to this theoretical limit. This is espe-
cially visible with RPT and stream prefetching on the most memory-intensive applications
such as Ammp and Art. This is due to these heuristics having an accuracy close to 100%.
As such it generates very few unnecessary prefetches. In addition, the coverage is also
close to 100%, which indicates that it exploits almost all opportunities to prefetch. On
the other hand, RPT prefetching is expensive to implement. It requires large cache-like
structures that consume both area and power.
119
5.3. RESULTS CHAPTER 5. DISCUSSION
A much simpler alternative is sequential prefetching. It is easy to implement and gives
considerable performance increases. However, it is not as accurate as RPT prefetching,
and therefore generates a lot of unnecessary prefetches.
In between, there are DC and C/DC prefetching. They are more complex than sequen-
tial prefetching, but less complex than RPT prefetching. They are more accurate than
sequential prefetching, and get better results.
AVD prefetching had little effect on the benchmarks used. This is probably due to the
fact that the benchmarks used was not pointer-intensive.
Increasing the size of the tables used by C/DC and RPT prefetching generally increases
performance. However, the most significant increases occur for relatively small tables. This
indicates that the most memory intensive loads occur in relatively tight loops, which is
in-line with previous experiments.
It is also worthwhile to note the optimum CZone size (around 256KB). This indicates
how large the different memory regions in these benchmarks are. It is therefore reasonable
that the optimum CZone size is application dependant.
Increasing the prefetching degree generally increases performance if there is enough
bandwidth available to support it. In other words, the effect of prefetching new data
outweighs the disadvantages of displacing old data in the cache. However, in bandwidth-
limited situation, sequential prefetching might decrease performance when increasing the
prefetching degree. This effect does not occur with the more advanced heuristics such as
C/DC and RPT.
5.3.2 CMP Results
When experimenting with a CMP, I have looked at 2-way CMP, instead of a system with
more cores. This decision was driven by a need for simplicity of analysis. More cores
would necessarily both increase computing time as well as the complexity of the analysis
without gaining much new information. Although it is possible to buy systems with up
to 64 cores on a single die, they are not general purpose, and is thus not as interesting for
this type of research.
I have looked at how increasing the number of DRAM channels, the number of sets and
the associativity of the L2 affects performance. Increasing the number of DRAM channels
has a significant impact on benchmarks that are memory bound such as Ammp and Art,
while increasing the number of sets has a comparatively small impact compared to the
area requirements. On the other hand, doubling the associativity has a very large impact
on performance. This is in-line with similar research.
Prefetching in a CMP is a double-edged sword. For some benchmarks (mainly memory
bound) it can have a large positive impact, but it can also reduce performance for others.
This is especially true for the sequential prefetching heuristic.
As accuracy increases, so does the performance. It is interesting to note that the
performance of RPT prefetching is not as high as can be expected given the uniprocessor
results. This is both due to a change in problem size and a result of the limited time
available to initialize the RPT structures.
120
CHAPTER 5. DISCUSSION 5.4. WORKFLOW
5.3.3 Bandwidth-Aware Prefetching
Bandwidth-aware prefetching is a new prefetching heuristic proposed in this thesis. It is
based on using performance counters to direct prefetching. By predicting future bandwidth
requirements, it is possible to use that information to gain a performance benefit. The
idea is based on the assumption that it is better to prefetch when there is little bandwidth
contention.
In the previous two chapters I have examined the performance of bandwidth-aware
prefetching in a uniprocessor setting and in a CMP. It is clear that bandwidth-aware
prefetching works best when the prefetchers accuracy is low. As such it would be interest-
ing to combine this heuristic with a heuristic that predicts the accuracy of the prefetching
heuristic. This subject will be further discussed in section 6.3.
Taking the average of the last three latency values was a relatively good predictor. The
Network Weather Service project [66] use several heuristics to predict future bandwidth
usage and use the best heuristic for each prediction. In a processor, one cannot afford
to use such advanced methods. Due to area constraints one cannot afford to use a very
compute- or storage- intensive predictor.
Bandwidth-aware prefetching worked quite well in most cases. Most benchmarks had
relatively little changes in performance due to it being used. However, most applications
received a drastic reduction in the number of DRAM accesses (up to 47.8%).
5.4 Workflow
In this section I will briefly discuss the tools that I have used in this thesis and my
experience with them. This will hopefully be useful for someone who are planning to do
something similar.
For development I have used C and the Gcc compiler. I experimented with Icc (Intel
C compiler), and it did provide a speedup of about 20%. However, it produces code
specific for the Pentium series of processors, and it’s performance is therefore less on
AMD processors. In addition, Icc is licensed software1, which made it easier for me to
simply use Gcc on all platforms.
Subversion was used as the version control system. This worked very well, and gave
little problems. In effect, I had two repositories, one for CMP development (which I share
with the rest of the NCAR group) and a personal repository. This repository was also
used for documentation and the preparation of this document.
Latex was used as the primary typesetting tool. As it is a plain-text format I could
use subversion for version control for this part of the work as well.
Python was used extensively both to set up and execute simulations, but also for data
analysis.
Gnuplot was used to plot all graphs contained in this document. It was easy to integrate
it with Python in such a way that getting a graphical representation of the results quickly
became possible.
Lastly, OpenPBS was used for batchprocessing on the Clustis2 cluster.
1Although free for academic use.
121
5.4. WORKFLOW CHAPTER 5. DISCUSSION
122
Chapter 6
Conclusion
6.1 Results
In this thesis I have looked at several types of prefetching. Experiments with the simulator
have obtained comparable results to those in the literature. This builds confidence in the
simulator framework.
The most interesting result is the Bandwidth-aware results. The development of this
heuristic was driven through a hypothesis that reducing bandwidth usage of prefetching
would lead to increased performance in CMP. Bandwidth-aware prefetching does reduce
bandwidth usage significantly (up to 47.8%), but it does not give a large increase in
IPC. However, for most applications IPC does not change significantly. The performance
of bandwidth-aware prefetching is closely correlated to the accuracy of the prefetching
heuristic. If the accuracy of the prefetching heuristic is high, then there is little to gain
from rejecting prefetches.
In CMP systems, the largest gains are from compute-bound applications, which is
natural, considering that most compute-bound applications have more erratic memory
access patterns.
During the experimentation with bandwidth-aware prefetching I have discovered sev-
eral factors that can be exploited to increase the performance of the heuristic. These will
be explored further in future work.
6.2 Contributions
To summarize; In this thesis I have:
• Investigated performance counters in modern processors.
• Developed kernel modules for Linux for exploring performance counters.
• Investigated modern prefetching methods.
• Expanded the simulator used in my fifth year project.
• Expanded the simulator to include a more realistic DRAM model.
• Expanded the simulator to simulate CMP architectures.
123
6.3. FUTURE WORK CHAPTER 6. CONCLUSION
• Developed a methodology to benchmark the performance of CMP architectures.
• Developed an understanding of current prefetching heuristics by conducting experi-
ments.
• Developed an understanding of known prefetching heuristics in a CMP setting by
performing simulations.
• Developed a new prefetching heuristic based on performance counters.
• Documented its performance through experimentation.
6.3 Future work
As this master thesis is the ground work for further work in my PhD thesis, I would like
to highlight some possible future work:
6.3.1 Simulator
The CMP simulator based on SimpleScalar has some limitations. There exists many
simulators that are already capable of simulating CMPs. Arnt Jørgen Lande is currently
working on diploma thesis that evaluates many simulators and I intend to use his work to
consider switching simulator. The purposes of such a switch are:
• A larger community using the simulator.
• A possible speedup.
• Ability to run true parallel benchmarks.
• Ability to run OS code.
From Arnt Jørgens initial report, it looks like the M5 simulator [80] is a viable alter-
native. M5 is loosely based on some SimpleScalar code, which would be beneficial, as I
am familiar with this code.
As always, extending the simulator framework to support more types of prefetching
would be beneficial when designing heuristics that direct prefetching, such as bandwidth-
aware prefetching. Markov predictors [81] would be especially interesting to implement,
as it has shown very good results in the past.
6.3.2 Interactions
A computer system is very complex and prefetching interacts with other parts of the
system. It would be interesting to see how prefetching interacts with alternative cache
replacement algorithms. In a uniprocessor the LRU algorithm has been the dominant
replacement algorithm for decades, however it is not certain that this algorithm is the
optimal one for CMPs. Therefore a lot of different other algorithms have been recently
developed (Haakon Dybdahl has done some work in this area [11]).
124
CHAPTER 6. CONCLUSION 6.4. ACKNOWLEDGMENTS
Prefetching will also necessarily interact with the cache consistency mechanisms. In a
tightly coupled application, this might have a large impact on performance if the prefetcher
generates unnecessary invalidates.
Another new and interesting topic in CMP is cache partitioning among cores [82].
These schemes dynamically divides the cache into partitions so that each core has its own
subset of the cache. The idea is that some applications require larger caches than others
and this can be dynamically determined.
6.3.3 Bandwidth-Aware Prefetching
It is obvious that the efficiency of bandwidth-aware prefetching is determined by the
accuracy of the prefetching heuristic. Therefore it would be interesting to estimate the
accuracy of prefetching heuristics and use this as an additional input to the bandwidth-
aware scheme. However, estimating accuracy is difficult. Prefetched data might not be
used for several million clock cycles, and tagging every cache line would be expensive in
terms of hardware. A possible solution would be to use shadow tags, that mimic parts of
the original cache and are only a few sets big. Such a structure does not hold actual data,
only metadata such as tags and LRU placement. By using sampling theory it might be
possible to get an estimate of the accuracy of a given algorithm at runtime.
Another possible venue for this type of heuristic is to control the prefetching degree,
rather than a binary on/off. Some initial experimentation was conducted based on this
approach. It yielded good results for some benchmarks, but performed badly for others.
It was therefore dropped in favor of the one presented in this thesis. However, with an
accuracy estimator it might be possible to improve this heuristic.
In addition, adding a DRAM controller with a capability of providing priorities and
preemption might increase performance [83]. Such a memory controller could be used to
give prefetches a relatively low priority, while giving a higher priority to actual loads. This
could be expanded to provide fairness across cores.
6.4 Acknowledgments
I would like to acknowledge the following persons for their help with this thesis:
• Lasse Natvig - For his guidance and tutoring.
• Haakon Dybdahl - For working with me with the CMP simulator and exchanging
his ideas.
• Cyril Banino-Rokkones - For providing me with example PAPI code and proofread-
ing.
• Thorvald Natvig - For helping me with understanding x86 performance counters.
• Hanne Lian - For her support.
125
6.4. ACKNOWLEDGMENTS CHAPTER 6. CONCLUSION
126
Bibliography
[1] G. E. Moore, “Cramming more components onto integrated circuits,”Electronics, vol. 38, Apr.
1965.
[2] International Technology Roadmap for Semiconductors, “ITRS roadmap,” 2004. http://www.
itrs.net/Common/2004Update/2004Update.htm.
[3] Wikipedia, “Moore’s law,” 2006. http://en.wikipedia.org/wiki/Moore%27s_Law.
[4] J. L. Hennesey and D. A. Patterson, Computer Architecture. 340 Pine Street, Sixth Floor,
San Fransisco, CA 94104-3205, USA: Morgan Kaufmann Publishers, 2003.
[5] W. Wulf and S. McKee, “Hitting the memory wall: Implications of the obvious,”ACM Com-
puter Architecture New, vol. 23, march 1995.
[6] D. A. Patterson, “Computer science education in the 21st century,” Commun. ACM, vol. 49,
no. 3, pp. 27–30, 2006.
[7] D. A. Patterson, “Latency lags bandwith,”Commun. ACM, vol. 47, no. 10, pp. 71–75, 2004.
[8] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas,
and K. Yelick, “A case for intelligent RAM,” IEEE Micro, vol. 17, pp. 34–44, 1997.
[9] O. Mutlu, H. Kim, and Y. N. Patt, “Address-value delta (AVD) prediction: Increasing the
effectiveness of runahead execution by exploiting regular memory allocation patterns,” in 38th
Annual International Symposium on Microarchitecture (MICRO-38), pp. 233–244, 2005.
[10] S. R. Sarangi, W. Liu, J. Torrellas, and Y. Zhou, “ReSlice: Selective Re-Execution of Long-
Retired Misspeculated Instructions Using Forward Slicing,” in 38th Annual International Sym-
posium on Microarchitecture (MICRO-38), pp. 245–256, 2005.
[11] H. Dybdahl and P. Stenstro¨m, “Enhancing lower level cache performance by early miss deter-
mination and block bypassing,” in Prooceedings of ACSAC, 2006.
[12] L. Spracklen and S. G. Abraham,“Chip multithreading: Opportunities and challenges,” in 11th
International Symposium on High-Performance Computer Architecture (HPCA’05), pp. 248–
252, 2005.
[13] A. J. Smith, “Cache memories,”ACM Comput. Surv., vol. 14, no. 3, pp. 473–530, 1982.
[14] P. Shivakumar and N. Jouppi, “Cacti 3.0: An integrated cache timing, power, and area model,”
Tech. Rep. 2, Compaq Western Research Laboratory, August 2001.
[15] V. Srinivasan, E. Davidson, and G. Tyson, “A prefetch taxonomy,” Computers, IEEE Trans-
actions on, vol. 53, pp. 126–140, Feb. 2004.
[16] S. VanderWiel, “A survey of data prefetching techniques,” Tech. Rep. 5, University of Min-
nesota, October 1996.
[17] I.-C. K. Chen, C.-C. Lee, and T. Mudge, “Instruction prefetching using branch prediction
information,” in IEEE International Conference on Computer Design, p. 593, 1997.
127
BIBLIOGRAPHY BIBLIOGRAPHY
[18] J. Pierce and T. Mudge, “Wrong-path instruction prefetching,” in 29th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO-29), p. 165, 1996.
[19] L. Spracklen, Y. Chou, and S. G. Abraham, “Effective instruction prefetching in chip multi-
processors for modern commercial applications,” in 11th International Symposium on High-
Performance Computer Architecture, pp. 225–236, 2005.
[20] D. Callahan, K. Kennedy, and A. Porterfield, “Software prefetching,” in ASPLOS-IV: Proceed-
ings of the fourth international conference on Architectural support for programming languages
and operating systems, (New York, NY, USA), pp. 40–52, ACM Press, 1991.
[21] L. Chi-Keung and T. Mowry, “Automatic compiler-inserted prefetching for pointer-based ap-
plications,”Computers, IEEE Transactions on, vol. 48, pp. 134–141, Feb. 1999.
[22] W. Zhenlin, D. Burger, K. McKinley, S. Reinhardt, and C. Weems,“Guided region prefetching:
a cooperative hardware/software approach,” in Computer Architecture, 2003. Proceedings.
30th Annual International Symposium on, pp. 388–398, June 2003.
[23] F. Dahlgren and P. Stenstro¨m,“Evaluation of hardware-based stride and sequential prefetching
in shared-memory multiprocessors,”Parallel and Distributed Systems, IEEE Transactions on,
vol. 7, pp. 385–398, Apr. 1996.
[24] K. J. Nesbit and J. E. Smith, “Data cache prefetching using a global history buffer,”Micro,
IEEE, vol. 25, pp. 90–97, Jan. 2005.
[25] K. J. Nesbit, A. S. Dhodapkar, and J. E. Smith, “AC/DC: An adaptive data cache prefetcher,”
in Proceedings of the 13th International Conference on Parallel Architecture and Compilation
Techniques, pp. 135–145, 2004.
[26] J. Collins, S. Sair, B. Calder, and D. M. Tullsen, “Pointer cache assisted prefetching,” in
Microarchitecture, 2002. (MICRO-35). Proceedings. 35th Annual IEEE/ACM International
Symposium on, pp. 62–73, 2002.
[27] S.-C. Lai and S.-L. Lu, “Hardware-based pointer data prefetcher,” in Computer Design, 2003.
Proceedings. 21st International Conference on, pp. 290 – 298, Oct. 2003.
[28] A. Roth and S. Gurindar S., “Effective jump-pointer prefetching for linked data structure,” in
Computer Architecture, 1999. Proceedings of the 26th International Symposium on, pp. 111–
121, 1999.
[29] T.-F. Chen and J.-L. Baer, “Effective hardware-based data prefetching for high-performance
processors,”Computers, IEEE Transactions on, vol. 44, pp. 609–623, May 1995.
[30] J. Fritts, “Multi-level memory prefetching for media and stream processing,” in 2002 IEEE In-
ternational Conference on Multimedia and Expo, 2002. ICME ’02. Proceedings, vol. 2, pp. 101–
104, Aug. 2002.
[31] J. M. Tendler, J. S. Dodson, J. J. S. Fields, H. Le, and B. Sinharoy, “Power4 system microar-
chitecture,” IBM Journal of Research and Development, vol. 50, 2002.
[32] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner, “Power5 system
microarchitecture,” IBM Journal of Research and Development, vol. 49, no. 4/5, 2005.
[33] T. Mitra, “Dynamic random access memory: A survey,” Research Proficiency Examination
Report, march 1999.
[34] Rambus Inc., “XDR dram : System design overview,” 2006. http://www.rambus.com/
products/xdr/index.aspx.
[35] Philips Semiconductors, “Tm-1300 media processor data book.”
128
BIBLIOGRAPHY BIBLIOGRAPHY
[36] M. Ekman and P. Stenstro¨m, “Performance and power impact of issue-width in chip-
multiprocessor cores,” in International Conference on Parallel Processing, 2003.
[37] L. Benini and G. D. Micheli, “Networks on chips: a new soc paradigm,” Com-
puter, vol. 35, pp. 70–78, Jan. 2002. http://ieeexplore.ieee.org/iel5/2/21069/
00976921.pdf?isnumber=21069\begingroup\let\relax\relax\endgroup[Pleaseinsert\
PrerenderUnicode{a^´LR´}intopreamble]=JNL\&arnumber=976921\&arnumber=976921\
&arSt=70\&ared=78\&arAuthor=Benini%2C+L.%3B+De+Micheli%2C+G.
[38] IBM, “Octopiler webpage,” 2006. http://domino.research.ibm.com/comm/research_
projects.nsf/pages/cellcompiler.index.html.
[39] Advanced Micro Devices, Inc, “Amd athlon 64 x2 key architectural features web page,”
2006. http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_9485_
13041%5E13043,00.html.
[40] R. Shrout, “Amd athlon 64 x2 4400+ dual core processor review,” 2005. http://www.pcper.
com/article.php?aid=141\&type=expert.
[41] K. Krewell, “Sun weaves multithreaded future,”Microprocessor Report, Apr. 2003.
[42] J. D. Gelas, “Sun’s ultrasparc t1 - the next generation server cpus,” 2005. http://www.
anandtech.com/cpuchipsets/showdoc.aspx?i=2657\&p=3.
[43] K. Krewell, “Cell moves into the limelight,”Microprocessor Report, Feb. 2005.
[44] D. Pham, S. Asano, M. Bolliger, M. Day, H. Hofstee, C. Johns, J. Kahle, A. Kameyama,
J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock,
S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa, “The design and implementation of a
first-generation cell processor,” in IEEE International Solid-State Circuits Conference, vol. 1,
pp. 184–185, February 2005.
[45] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick, “The potential of the
cell processor for scientific computing,” in Computing Frontiers, 2006.
[46] N. Blachford, “Cell architecture explained,” 2005. http://www.blachford.info/computer/
Cell/Cell0_v2.html.
[47] D. Burger and T. M. Austin, “Simplescalar toolset 3.0b,” 2003. http://www.simplescalar.
com.
[48] T. Austin, E. Larson, and D. Ernst, “Simplescalar: An infrastructure for computer system
modeling,” IEEE Computer, 2002.
[49] D. Burger and T. M. Austin, “The simplescalar tool set, version 2.0,” 1997. http://www.
simplescalar.com/docs/users_guide_v2.pdf.
[50] N. Manjikian, “More enhancements of the simplescalar tool set,” SIGARCH Comput. Archit.
News, vol. 29, no. 4, pp. 5–12, 2001.
[51] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: a framework for architectural-level power
analysis and optimizations,” in Proceedings of the 27th International Symposium on Computer
Architecture, 2000, pp. 83–94, 200.
[52] E. Larson, S. Chatterjee, and T. Austin, “The mase microarchitecture simulation environ-
ment,” in Proceedings of the 2001 International Symposium on Performance Analysis of Sys-
tems and Software, June 2001.
[53] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson,
A. Moestedt, and B. Werner, “Simics: A full system simulation platform,”Computer, vol. 35,
pp. 50–58, Feb. 2002.
129
BIBLIOGRAPHY BIBLIOGRAPHY
[54] T. F. Wenisch and R. E. Wunderlich, “Simflex: Fast, accurate and flexible simulation of com-
puter systems,”November 2005. Tutorial in the International Symposium on Microarchitecture
(MICRO-38).
[55] B. P. Zeigler, H. Praehofer, and T. G. Kim, Theory of Modeling and Simulation, 2nd ed. 84
Theobalds Road, London WC1X 8RR, UK: Academic Press, 2000.
[56] S. Browne, C. Deane, G. Ho, and P. Mucci, “PAPI: A portable interface to hardware perfor-
mance counters,” in Proceedings of Department of Defense HPCMP Users Group Conference,
June 1999.
[57] Intel corporation, “IA-32 intel architecture software developer’s manual,” January 2006.
[58] S. Sandeep, “Gcc-inline-assembly-howto,” March 2003. http://www.ibiblio.org/gferg/
ldp/GCC-Inline-Assembly-HOWTO.html.
[59] Discretix Technologies Ltd,“Introduction to side channel attacks,”2006. http://www.hbarel.
com/publications/Introduction_To_Side_Channel_Attacks.pdf.
[60] AMD,“AMD Athlon Processor, x86 Code Optimization Guide,”February 2002. http://www.
amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf.
[61] J. Levon, OProfile manual, 2004. http://oprofile.sourceforge.net/doc/index.html.
[62] Intel corporation, “VTune website.” http://www.intel.com/cd/software/products/
asmo-na/eng/vtune/index.htm.
[63] V. Cuppu, B. Jacob, B. Davis, and T. Mudge, “A performance comparison of contempo-
rary DRAM architectures,” in Proceedings of the 26th International Symposium on Computer
Architecture, pp. 222–233, 1999.
[64] Wikipedia, “DDR2,” 2005. http://en.wikipedia.org/wiki/DDR-2.
[65] Corsair Memory Inc, “CM2X512A-6400.” http://www.corsairmemory.com/corsair/
products/specs/CM2X512A-6400.pdf.
[66] R. Wolski, “Experiences with predicting resource performance on-line in computational grid
settings,” ACM SIGMETRICS Performance Evaluation Review, vol. 30, pp. 41–49, March
2003.
[67] SPEC, “Spec 2000 benchmark suites,” 2000. http://www.spec.org.
[68] A. KleinOsowski and D. J. Lilja, “MinneSPEC: A new spec benchmark workload for
simulation-based computer architecture research,”Computer Architecture Letters, vol. 1, June
2002.
[69] E. Perelman, G. Hamerly, M. V. Biesbrouck, T. Sherwood, and B. Calder, “Using simpoint
for accurate and efficient simulation,” in ACM SIGMETRICS the International Conference
on Measurement and Modeling of Computer Systems, June 2003.
[70] “Simpoint web page,” 2006. http://www-cse.ucsd.edu/~calder/simpoint/.
[71] S. Sair and M. Charney, “Memory behavior of the spec2000 benchmark suite.” http://
citeseer.ist.psu.edu/431597.html.
[72] Transaction Processing Performance Council, “TPC-C webpage,” 2006. http://www.tpc.
org/tpcc/default.asp.
[73] SPEC, “SPECWeb web page,” 2005. http://www.spec.org/web2005/.
[74] SPEC, “SPECjAppServer2004 web page,” 2004. http://www.spec.org/jAppServer2004/.
130
BIBLIOGRAPHY BIBLIOGRAPHY
[75] A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, “HPL - a portable implementation of
the high-performance linpack benchmark for distributed-memory computers,” January 2004.
http://www.netlib.org/benchmark/hpl/.
[76] R. Biswas and R. F. V. der Wijngaart, “NAS parallel benchmarks home page,” 2006. http:
//www.nas.nasa.gov/Software/NPB/.
[77] J. E. Smith, “Characterizing computer performance with a single number,” Communications
of the ACM, vol. 31, October 1988.
[78] B. Jacob and T. Mudge, “Notes on calculating computer performance,” Tech. Rep. 231-95,
University of Michigan, March 1995.
[79] Wikipedia, “Harmonic mean,” 2006. http://en.wikipedia.org/wiki/Harmonic_mean.
[80] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt, “Network-oriented full-system simulation
using m5,” in CAECW - Computer Architecture Evaluation using Commercial Workloads, Feb
2003.
[81] P. U. M. Predictors, “D. joseph,” in The 24th Annual International Symposium on Computer
Architecture, pp. 252–263, 1997.
[82] S. Kim, D. Chandra, and Y. Solhin, “Fair cache sharing and partitioning in a chip multiproces-
sor architecture,” in Proceedings of the 13th International Conference on Parallel Architecture
and Compilation Techniques, pp. 111–122, 2004.
[83] L. Wei-Fen, S. Reinhardt, and D. Burger,“Reducing dram latencies with an integrated memory
hierarchy design,” in High-Performance Computer Architecture, 2001. HPCA. The Seventh
International Symposium on, pp. 301–312, Jan. 2001.
131
BIBLIOGRAPHY BIBLIOGRAPHY
132
Appendix A
Cacti Output
This is the output of CACTI simulating a 4-way 8KB cache with 64byte cache lines. It has 1 read and 1
write port, and the technology being used is 65nm.
---------- CACTI version 3.2 ----------
Cache Parameters:
Number of Subbanks: 1
Total Cache Size: 8192
Size in bytes of Subbank: 8192
Number of sets: 32
Associativity: 4
Block Size (bytes): 64
Read/Write Ports: 1
Read Ports: 1
Write Ports: 1
Technology Size: 0.06um
Vdd: 0.8V
Access Time (ns): 0.57591
Cycle Time (wave pipelined) (ns): 0.273554
Total Power all Banks (nJ): 0.139284
Total Power Without Routing (nJ): 0.139284
Total Routing Power (nJ): 0
Maximum Bank Power (nJ): 0.139284
Best Ndwl (L1): 16
Best Ndbl (L1): 1
Best Nspd (L1): 1
Best Ntwl (L1): 1
Best Ntbl (L1): 4
Best Ntspd (L1): 1
Nor inputs (data): 2
Nor inputs (tag): 1
Area Components:
Cache data
array 1.002876 (mm^2)
pred 0.001504 (mm^2)
colmux pred 0.000754 (mm^2)
colmux post 0.000053 (mm^2)
133
APPENDIX A. CACTI OUTPUT
write sig 0.000872 (mm^2)
total area 1.006058 (mm^2)
Cache tag
array 0.061271 (mm^2)
pred 0.000754 (mm^2)
colmux pred 0.000754 (mm^2)
colmux post 0.000211 (mm^2)
out decode 0.002650 (mm^2)
out sig 0.000872 (mm^2)
total area 0.066512 (mm^2)
Cache
total area 1.074315 (mm^2)
subanked 1.126643 (mm^2)
aspect ratio 1.43
data ramcells 93.4%
tag ramcells 5.7%
control/routing 0.9%
efficiency 95.2%
Time Components:
decode data : 273.554ps 50.726pJ
w&b line data : 112.349ps 15.286pJ
wordline : 91.638ps 0.284pJ
bitline : 20.711ps 15.002pJ
sense amp data : 67.600ps 42.025pJ
decode tag : 71.852ps 2.525pJ
w&b line tag : 55.342ps 4.302pJ
wordline : 45.725ps 0.221pJ
bitline : 9.617ps 4.080pJ
sense amp tag : 21.937ps 7.551pJ
compare address : 122.225ps 1.003pJ
mux driver : 185.040ps 1.683pJ
select inverter : 17.525ps 0.013pJ
data output drv : 101.988ps 14.170pJ
total data ~drv : 453.503ps 108.037pJ
total tag (~DM) : 473.922ps 17.077pJ
Total Data : 555.491ps 123.904pJ
Total TAG : 575.910ps 15.381pJ
Read Energy : 139.284pJ
Write Energy : 162.399pJ
Access Time : 575.910ps
Max Precharge : 0.000ps
Pipe Time (1clk) : 575.910ps (data= 575.910ps) (tag= 575.910ps)
Pipe Time (2clk) : 287.955ps (data= 277.746ps) (tag= 287.955ps)
Pipe Time (3clk) : 273.554ps (data= 185.164ps) (tag= 191.970ps)
Cache bank distributions and energy per access (1 banks)
bank ctrl 0% 0.000pJ
decode 15% 53.250pJ
tag array 4% 4.302pJ
data array 65% 15.286pJ
tag ctrl 0.29% 10.251pJ
data ctrl 16% 56.196pJ
134
APPENDIX A. CACTI OUTPUT
total 139.284pJ
135
APPENDIX A. CACTI OUTPUT
136
Appendix B
Notur 2006 Poster
This poster was submitted and accepted to the Notur 2006 conference. It’s purpose was to showcase work
in progress.
137
APPENDIX B. NOTUR 2006 POSTER
Department of Computer and
Information Science
Simulation of Bandwidth-Aware
Hardware Based Prefetching in Chip
Multiprocessors
by Marius Grannæs (grannas@idi.ntnu.no)
supervised by Lasse Natvig (Lasse.Natvig@idi.ntnu.no)
11th May 2006
Department of Computer and
Information Science
1 Prefetching
Prefetching is a technique used to increase the effectiveness of caches by trying to
predict the memory reference stream. By fetching needed data to the caches be-
fore it is actually referenced by the processor, it is possible to achieve a significant
performance increase.
There are two things that can potentially degrade performance on uniprocessors:
•Displacing data from the cache that is still needed
•Causing bandwidth contention
There exist very good heuristics for hardware prefetching on uniprocessors, but in a
CMP where the cache is shared, prefetching might displace other processors data
[1, 2].
2 SimpleScalar simulator
We at the NCAR group at NTNU have developed an advanced simulator based on
the open source SimpleScalar simulator. It has been extended with the following
extensions:
•Added support for chip multiprocessing
•Shared caches
• Improved DRAM-model that mimics DDR2
•Added support for hardware prefetching
•Added system calls for synchronization and shared memory
CPU 1 CPU 2
L1 L1
CPU 3 CPU 4
L1 L1
OO
²²
OO
²²
OO
²²
OO
²²
L2
((
hhQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
¾¾
[[77777777777 ¤¤
CC¨
¨¨
¨¨
¨¨
¨¨
¨¨
vv
66mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
Main Memory
OO
²²
Figure 1: The simulated architecture
3 Uniprocessor results
To understand prefetching in a CMP, one must first understand how the heuristics
work in a simpler setting. In figure 2, we see how the most memory intensive bench-
marks in the SPEC 2000 suite performs without any bandwidth limitations.
 0
 1
 2
 3
 4
 5
 6
 7
a
rt
a
m
m
p
m
gr
id
sw
im
m
cf
N
or
m
al
iz
ed
 IP
C
Benchmark
Sequential
CDC
RPT
Perfect L2
Figure 2: Prefetching on uniprocessors with unlimited bandwidth to memory. IPC is normalized to the
case where no prefetching is performed.
In figure 3, we limit the amount of bandwidth available:
 0
 1
 2
 3
 4
 5
 6
 7
a
rt
a
m
m
p
m
gr
id
sw
im
m
cf
N
or
m
al
iz
ed
 IP
C
Benchmark
Sequential
CDC
RPT
Perfect L2
Figure 3: Prefetching on uniprocessors with limited bandwidth to memory. IPC is normalized to the
case where no prefetching is performed.
4 CMP results
In figure 4, we run the same prefetching heuristics, but on a 4-way CMP running 4
different SPEC2000 benchmarks.
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
Pe
rfe
ct
 L
2
R
PT
C/
DC
Se
qu
en
tia
l
N
or
m
al
iz
ed
 IP
C
Prefetching method
Vpr
Mcf
Art
Gcc
Figure 4: Prefetching on a 4-way CMP. IPC is normalized to the case where no prefetching is per-
formed.
5 Conclusion
It is clear that CMPs offer ample oppertunity for prefetching, but it requires new
heuristics. While prefetching might benefit one core, it might seriously degrade per-
formance for another.
5.1 Future work
This work will be continued as part of my PhD thesis, and is a work in progress.
•SimpleScalar cannot execute true parallel programs.
•Switch to the M5 simulator.
•Cache partitioning
•Study the interaction between the cache coherence protocol and prefetching
•More reference prefetching algorithms
Acknowledgments
I would like to thank Lasse Natvig for guidance and support. Haakon Dybdahl for
sharing his ideas and improvements to the simulator. Hanne Lian for her support and
patience.
References
[1] L. Spracklen and S. G. Abraham, “Chip multithreading: Opportunities and challenges,” in 11th
International Symposium on High-Performance Computer Architecture (HPCA’05), pp. 248–252,
2005.
[2] L. Spracklen, Y. Chou, and S. G. Abraham, “Effective instruction prefetching in chip multiproces-
sors for modern commercial applications,” in 11th International Symposium on High-Performance
Computer Architecture, pp. 225–236, 2005.
138
Appendix C
Performance Counter Code
C.1 Pmc.c
Listing C.1: Pmc.c
/∗
∗ pcm . c
∗
∗ This l i nu x k e rne l module enab l e s RDPMC for user l e v e l programs .
∗ In add i t ion , i t s e t s up counter #0 to count the number o f cache misses .
∗
∗ Written by Marius Grannas 2006
∗
∗ NOTE! This i s AMD Athlon XP s p e c i f i c code , and w i l l p o s s i b l y break t h i n g s
∗ bad l y on any o ther arch .
∗
∗/
#include <l i nux /module . h> /∗ Needed by a l l modules ∗/
#include <l i nux / ke rne l . h> /∗ Needed f o r KERN INFO ∗/
/∗ This func t i on i s c a l l e d when the module i s loaded ∗/
int i n i t modu l e (void )
{
pr in tk (KERN INFO ”Enabling RDPCM. . . \ n”) ;
// The f o l l ow i n g i s a re implementat ion o f s e t i n c r 4 () t ha t doesn ’ t
// r e qu i r e mmu cr4 features which i sn ’ t expor ted anymore .
asm v o l a t i l e ( ”movl %%cr4 ,%%eax\n\ t ”
” o r l %0,%%eax\n\ t ”
”movl %%eax,%%cr4 \n”
: : ” i r g ” (X86 CR4 PCE)
: ”ax ”) ;
p r in tk (KERN INFO ”Se t t i ng up counter 0 .\n”) ;
// This par t s e t s up counter 0 to count the number o f data cache misses .
// See the Athlon Opt imizat ion g u i d e l i n e s f o r va l u e s
asm v o l a t i l e ( ”mov $0xC0010000 , %%ecx ; mov $0x00420041 , %%eax ;
wrmsr ”
:
139
C.2. MAKEFILE APPENDIX C. PERFORMANCE COUNTER CODE
:
: ”eax ” , ”ecx ”) ;
// A non 0 re turn means in i t modu l e f a i l e d ; module can ’ t be loaded .
return 0 ;
}
/∗ This func t i on i s c a l l e d when the module i s unloaded ∗/
void cleanup module (void )
{
pr in tk (KERN INFO ”Disab l ing Performance counter s .\n”) ;
// Same as above
asm v o l a t i l e ( ”movl %%cr4 ,%%eax\n\ t ”
”andl %0,%%eax\n\ t ”
”movl %%eax,%%cr4 \n”
: : ” i r g ” (˜X86 CR4 PCE)
: ”ax ”) ;
}
C.2 Makefile
Listing C.2: Makefile
# This make f i l e c r e a t e s the k e rne l module from source
# Created by Marius Grannaes 2006
obj−m += pmc . o
a l l :
make −C / l i b /modules/$ ( s h e l l uname −r ) / bu i ld M=$ (PWD) modules
c l ean :
make −C / l i b /modules/$ ( s h e l l uname −r ) / bu i ld M=$ (PWD) c l ean
140
APPENDIX C. PERFORMANCE COUNTER CODE C.3. PERFORMANCE.C
C.3 Performance.c
Listing C.3: Performance.c
/∗
∗ performance . c
∗
∗ The purpose o f t h i s sma l l program i s to show how userspace can acces s
∗ performance r e g i s t e r s . This code i s p r e t t y u s e l e s s un l e s s RDPMC i s
enab led
∗ f o r userspace . See the k e rne l module f o r a d d i t i o n a l d e t a i l s .
∗
∗ Written by Marius Grannas 2006
∗
∗ Note : Some o f t h i s code i s a th l on XP s p e c i f i c and might break on other
∗ a r c h i t e c t u r e s .
∗
∗ Compile wi th
∗ gcc performance . c −o performance
∗
∗ Note : Do not turn on opt imiz ing , the compi ler might op t imize away the
en t i r e
∗ l oop body !
∗/
#include <s t d i o . h>
#include <s t d l i b . h>
/∗
∗ This func t i on reads the va lue o f the b u i l t in c l o c k c y c l e t imer
∗ The cpuid i n s t r u c t i o n i s t h e r e to ensure s e r i a l i z a b i l i t y ( eg i t f o r c e s
∗ a p i p e l i n e f l u s h
∗/
stat ic i n l i n e unsigned long long rdt sc t ime ( ) {
unsigned int eax , edx ;
unsigned long long va l ;
asm v o l a t i l e ( ”cpuid ” : : : ”ax ” , ”bx” , ”cx ” , ”dx”) ;
asm v o l a t i l e ( ” rd t s c ” : ”=a ”( eax ) , ”=d”( edx ) ) ;
va l = edx ;
va l = va l << 32 ;
va l += eax ;
return va l ;
}
/∗
∗ This func t i on reads the va lue o f performance r e g i s t e r 0 .
∗ Cpuid i n s t r u c t i o n i s again used f o r s e r i a l i z a b i l i t y
∗/
stat ic i n l i n e unsigned long long readpc ( ) {
unsigned int eax , edx ;
unsigned long long va l ;
asm v o l a t i l e ( ”cpuid ” : : : ”ax ” , ”bx” , ”cx ” , ”dx”) ;
asm v o l a t i l e ( ”xor %%ecx , %%ecx ; rdpmc”
: ”=a ”( eax ) , ”=d”( edx ) /∗ Output ∗/
: /∗ Input ∗/
: ”ecx ” /∗ Clobbered ∗/
) ;
141
C.3. PERFORMANCE.C APPENDIX C. PERFORMANCE COUNTER CODE
va l = edx ;
va l = va l << 32 ;
va l += eax ;
return va l ;
}
/∗
∗ Sample t e s t program .
∗ This shows how the counter works as w e l l as the t imer .
∗/
int main (void ) {
long long int s ta r t , stop ;
int a = 9 ;
int c = 3 ;
int d [ 9 0 0 0 0 ] ;
int i ;
p r i n t f ( ”Clock i s %l l d \n” , rdt sc t ime ( ) ) ;
s t a r t = readpc ( ) ;
for ( i =0; i< 1000 ; i++) {
d [ i ] = a / c ;
}
stop = readpc ( ) ;
p r i n t f ( ”Number o f mis se s : %l l d \n” , stop−s t a r t ) ;
p r i n t f ( ”Clock i s %l l d \n” , rdt sc t ime ( ) ) ;
return 0 ;
}
142
Appendix D
Python Scripts
D.1 Clustisrunbench.py
Listing D.1: Clustisrunbench.py
#! / usr / bin /python
import sys
import time
import popen2
import c on f i g
header = ”””#!/bin /bash
#PBS −N sim−out
#PBS − l wa l l t ime =8:00:00
#PBS − l nodes=1:ppn=1
#PBS −m bea
#PBS −q de f au l t
#
”””
count = 0
for benchmark in c on f i g . benchmarks :
for c on f i gu r a t i on in c on f i g . c on f i g u r a t i o n s :
output = open ( c on f i g . temppbs , ’w ’ )
s imu la t i on = con f i g . s imu lator + ’ ’ + con f i g . commonconfig + ’ ’ +
con f i gu r a t i on [ 1 ] + ’ ’+ con f i g . specbinpath + benchmark [ 1 ] + ’ ’ +
benchmark [ 3 ]
count = count + 1
output . wr i t e ( header )
# Change d i r e c t o r y in to the s imu la t i on d i r e c t o r y
output . wr i t e ( ’ cd ’ + con f i g . specdatapath + benchmark [ 2 ] +’ \n ’ ) ;
output . wr i t e ( s imu la t i on )
output . c l o s e ( )
r e s u l t s = popen2 . popen3 ( ’ qsub ’ + con f i g . temppbs )
p r i n t r e s u l t s [ 0 ] . r e ad l i n e ( ) ,
p r i n t ’Number o f submitted jobs : ’ ,
p r i n t count
143
D.2. CONFIG.PY APPENDIX D. PYTHON SCRIPTS
D.2 Config.py
Listing D.2: Sample configuration file
#! / usr / bin /python
#Path to b i n a r i e s
#Ful l path to executab l e
s imu lator = ’ /home/grannas /hovedoppgave/ simplesim −3.0/ sim−outorder ’
# Ful l path to s imu la t i on executab l e s
specbinpath = ’ /home/grannas / spe c2000b ina r i e s / ’
# Where you want to s t o r e the temporary pbs f i l e
temppbs = ’ sim−outorder . pbs ’
# Where the data f i l e s to the spec benchmarks are
specdatapath = ’ /home/grannas /SPEC 2000 REDUCED/ ’
# Common con f i gu r a t i on ac r o s s benchmarks .
commonconfig = ’−cache : d l1 dl1 : 3 2 : 6 4 : 4 : l −cache : d l 1 l a t 2 −cache : i l 1 i l 1
: 3 2 : 6 4 : 4 : l −cache : i l 1 l a t 2 −cache : d l2 ul2 : 5 1 2 : 1 2 8 : 8 : l −cache : d l 2 l a t 7 −
cache : i l 2 d l2 −cache : i l 2 l a t 7 −mem: width 8 −mem: l a t 160 2 −f e t ch : mplat
15 −ruu : s i z e 16 − l s q : s i z e 8 −t l b : i t l b i t l b : 1 : 4 0 9 6 : 1 2 8 : l −t l b : dt lb dt lb
: 1 : 4 0 9 6 : 1 2 8 : l −bpred 2 l ev −bpred : bimod 4096 −bpred : 2 l ev 1 1024 10 0 −
bpred : comb 4096 −dram : block 128 −dram : page 8 −dram :comm 40 −dram : core 40
−dram : data 80 −pr e f e t ch : t ab l e 1024 −dram : chan 1 ’
# Experiment name , used to generate p l o t f i l e s
experimentname = ’ bas e l i n e−chan1 ’
# Parse out t h i s s t a t i s t i c
#grepname = ’ sim IPC ’
#grepname = ’dram . a c c e s s e s ’
#grepname = ’ ul2 . p r e f e t ch e s ok ’
grepname = ’ ul2 . mis ses ’
# Conf igurat ion
# Format : (Name, Parameters )
c on f i g u r a t i o n s = [ ( ’None ’ , ’−pr e f e t ch none ’ ) , \
( ’ Sequ ’ , ’−pr e f e t ch s e qu en t i a l ’ ) , \
( ’DC’ , ’−pr e f e t ch DC’ ) , \
( ’CDC’ , ’−pr e f e t ch CDC’ ) , \
( ’RPT’ , ’−pr e f e t ch RPT −pr e f e t ch : l o c DL1 −pr e f e t ch : type
ac c e s s ’ ) , \
( ’ Stream ’ , ’−pr e f e t ch stream −pr e f e t ch : l o c DL1 −pr e f e t ch :
type ac c e s s ’ ) , \
( ’AVD’ , ’−pr e f e t ch AVD’ ) , \
( ’ Pe r f e c t L2 ’ , ’−p e r f e c t : l 2 1 ’ ) ]
# Di f f e r e n t benchmarks
# Format : (Name, binary , working dir , parameters )
benchmarks = [ ( ’ gz ip ’ , ’ gz ip00 . peak . ev6 ’ , ’ 164 . gz ip / input / ’ , ’ l g r ed . l og 1 ’
) ,\
( ’ gcc ’ , ’ gcc00 . peak . ev6 ’ , ’ 176 . gcc / input / ’ , ’ l g r ed . cp−dec l . i −o l g r ed . cp−dec l .
s ’ ) ,\
( ’ c r a f t y ’ , ’ c r a f t y00 . peak . ev6 ’ , ’ 186 . c r a f t y / input / l g r ed / ’ , ’< l g r ed . in ’ ) ,\
144
APPENDIX D. PYTHON SCRIPTS D.2. CONFIG.PY
( ’mcf ’ , ’ mcf00 . peak . ev6 ’ , ’ 181 .mcf/ input / ’ , ’ l g r ed . in ’ ) ,\
( ’ swim ’ , ’ swim00 . peak . ev6 ’ , ’ 171 . swim/ input / l g r ed / ’ , ’< swim . in ’ ) ,\
( ’ mgrid ’ , ’ mgrid00 . peak . ev6 ’ , ’ 172 . mgrid/ input / l g r ed / ’ , ’< mgrid . in ’ ) ,\
( ’ equake ’ , ’ equake00 . peak . ev6 ’ , ’ 183 . equake/ input / ’ , ’< l g r ed / l g r ed . in ’ ) ,\
( ’ applu ’ , ’ applu00 . peak . ev6 ’ , ’ 173 . applu/ input / l g r ed ’ , ’< applu . in ’ ) ,\
( ’ vpr ’ , ’ vpr00 . peak . ev6 ’ , ’ 175 . vpr/ input / ’ , ’ l g r ed . net smal l . arch . in t u l l 1
t u l l 2 −nodisp −p l a c e on ly − i n i t t 5 −e x i t t 0 .005 −a lpha t 0 .9412 −
inner num 2 ’ ) ,\
( ’ammp ’ , ’ammp00 . peak . ev6 ’ , ’ 188 .ammp/ ’ , ’< . / input / l g r ed . in ’ ) ,\
( ’mesa ’ , ’mesa00 . peak . ev6 ’ , ’ 177 .mesa/ input / ’ , ’−frames 1 −mesh f i l e l g r ed . in −
ppmf i l e u t f i l ’ ) ,\
( ’ g a l g e l ’ , ’ g a l g e l 0 0 . peak . ev6 ’ , ’ 178 . g a l g e l / input / l g r ed / ’ , ’< l g r ed . in ’ ) ,\
( ’ l u ca s ’ , ’ l ucas00 . peak . ev6 ’ , ’ 189 . l u ca s / input / l g r ed / ’ , ’< l g r ed . in ’ ) ,\
( ’ fma ’ , ’ fma3d00 . peak . ev6 ’ , ’ 191 . fma3d/ input / l g r ed / ’ , ’ ’ ) ,\
( ’ pa r s e r ’ , ’ par se r00 . peak . ev6 ’ , ’ 197 . pa r s e r / input ’ , ’ 2 . 1 . d i c t −batch < l g r ed .
in ’ ) ,\
( ’ eon ’ , ’ eon00 . peak . ev6 ’ , ’ 252 . eon/ input / l g r ed ’ , ’ cha i r . c on t r o l . ka j i ya cha i r .
camera cha i r . s u r f a c e s cha i r . ka j i ya .ppm ppm p i x e l s o u t . ka j i ya ’ ) ,\
( ’ perlbmk ’ , ’ perlbmk00 . peak . ev6 ’ , ’ 253 . perlbmk/ input / l g r ed ’ , ’−I . −I . / l i b l g r ed
. makerand . p l ’ ) ,\
( ’ gap ’ , ’ gap00 . peak . ev6 ’ , ’ 254 . gap/ input / l g r ed ’ , ’− l . −q −m 64M < l g r ed . in ’ ) ,\
( ’ bz ip2 ’ , ’ bz ip200 . peak . ev6 ’ , ’ 256 . bz ip2 / input ’ , ’ l g r ed . source 1 ’ ) ,\
( ’ ap s i ’ , ’ aps i00 . peak . ev6 ’ , ’ 301 . aps i / input / l g r ed ’ , ’ ’ ) ,\
( ’ wupwise ’ , ’ wupwise00 . peak . ev6 ’ , ’ 168 . wupwise/ input / l g r ed ’ , ’ ’ ) ,\
( ’ two l f ’ , ’ two l f00 . peak . ev6 ’ , ’ 300 . two l f / input / ’ , ’ l g r ed / l g r ed ’ ) ,\
( ’ f a c e r e c ’ , ’ f a c e r e c 00 . peak . ev6 ’ , ’ 187 . f a c e r e c / input / l g r ed ’ , ’ < l g r ed . in ’ ) ,\
( ’ a r t ’ , ’ a r t00 . peak . ev6 ’ , ’ 179 . a r t / input / ’ , ’− s c a n f i l e c756he l . in − t r a i n f i l e 1
a10 . img −s t r i d e 5 −s t a r t x 134 −s t a r t y 220 −endx 184 −endy 240 −ob j e c t s 1 ’
) \
]
# Ikke fungerende :
# S ix t rack : Manglende f i l e r ? F i l s t o e r r e l s e er i h v e r t f a l l 0 .
# Vortex : Korrupt f i l ?
#( ’ s i x t r a c k ’ , ’ s i x t r a ck00 . peak . ev6 ’ , ’ 200 . s i x t r a c k / input / l g r ed ’ , ’< inp . in ’ ) ,\
#( ’ vortex ’ , ’ vortex00 . peak . ev6 ’ , ’ 255 . vortex / input ’ , ’ l g r ed . raw ’ ) ,\
145
D.3. PARSEBENCH.PY APPENDIX D. PYTHON SCRIPTS
D.3 Parsebench.py
Listing D.3: Parsebench.py
#! / usr / bin /python
import sys
import time
import popen2
import c on f i g
import os
p r i n t ’# This i s an autogenerated p l o t f i l e f o r gnuplot ’
p r i n t ’# Conf igurat ion used : ’
p r i n t ’# ’ + con f i g . commonconfig
for i in c on f i g . c on f i g u r a t i o n s :
p r i n t ’#’ + i [ 0 ] + ’ : ’ + i [ 1 ]
sys . s tdout . wr i t e ( ’#Benchmark\ t ’ )
for i in c on f i g . c on f i g u r a t i o n s :
sys . s tdout . wr i t e ( i [ 0 ] )
sys . s tdout . wr i t e ( ’ \ t ’ )
sys . s tdout . wr i t e ( ’ \n ’ )
# I t e r a t e trough d i f f e r e n t c on f i g u r a t i o n s and grep for important in fo rmat ion
.
count = 1
runlog = open ( ’ run . l og ’ , ’ r ’ )
for benchmark in c on f i g . benchmarks :
p r i n t benchmark [ 0 ] + ’ \ t ’ ,
p r i n t count ,
p r i n t ’ \ t ’ ,
for c on f i gu r a t i on in c on f i g . c on f i g u r a t i o n s :
l o g f i l e = runlog . r e ad l i n e ( )
l o g f i l e = l o g f i l e . s p l i t ( ’ . ’ )
t ry :
d a t a f i l e = open ( ’ sim−out . e ’+l o g f i l e [ 0 ] , ’ r ’ )
except IOError :
p r i n t ’N/C\ t ’ ,
continue
s im r e s u l t s = d a t a f i l e . r e a d l i n e s ( )
f l a g = 0
for l i n e in s im r e s u l t s :
i f l i n e . s t a r t sw i t h ( c on f i g . grepname ) :
ipc = l i n e . s p l i t ( ) [ 1 ]
f l a g = 1
pr in t ipc + ’ \ t ’ ,
break
i f f l a g == 0 :
p r i n t ’ERROR\ t ’ ,
count = count + 1
pr in t ’ \n ’ ,
146
Appendix E
Uniprocessor Code
E.1 Makefile
Listing E.1: Makefile - Unified diff against SimpleScalar 3.0d
−−− . . / s implesim−3.0− o r i g /Make f i l e 2003−10−09 04 :42 :59 .000000000 +0200
+++ . . / simplesim −3.0/Make f i l e 2006−05−25 23 :42 :04 .000000000 +0200
@@ −77,8 +77 ,8 @@
## RS/6000 AIX Unix ve r s i on 4 , GNU GCC ver s ion cygnus−2.7−96q4
## Windows NT ver s ion 4 .0 , Cygnus CygWin/32 be ta 19
##
−CC = gcc
−OFLAGS = −O0 −g −Wall
+CC = gcc−3.4
+OFLAGS = −g
MFLAGS = ‘ . / sysprobe −f l a g s ‘
MLIBS = ‘ . / sysprobe − l i b s ‘ −lm
ENDIAN = ‘ . / sysprobe −s ‘
@@ −277 ,7 +277 ,7 @@
#
# a l l the sources
#
−SRCS = main . c sim−f a s t . c sim−s a f e . c sim−cache . c sim−p r o f i l e . c \
+SRCS = main . c dram . c p r e f e t ch . c sim−f a s t . c sim−s a f e . c sim−cache . c sim−
p r o f i l e . c \
sim−e i o . c sim−bpred . c sim−cheetah . c sim−outorder . c \
memory . c r eg s . c cache . c bpred . c pt race . c eventq . c \
r e s ou r c e . c endian . c d l i t e . c symbol . c eva l . c opt ions . c range . c \
@@ −287 ,7 +287 ,7 @@
target−alpha / alpha . c target−alpha / loade r . c target−alpha / s y s c a l l . c \
target−alpha /symbol . c
−HDRS = s y s c a l l . h memory . h r eg s . h sim . h loade r . h cache . h bpred . h ptrace . h \
+HDRS = dram . h s y s c a l l . h memory . h r eg s . h sim . h l oade r . h cache . h bpred . h
ptrace . h \
eventq . h r e sou r c e . h endian . h d l i t e . h symbol . h eva l . h bitmap . h \
e i o . h range . h ve r s i on . h endian . h misc . h \
target−p i sa / p i sa . h target−p i sa / p i s ab i g . h target−p i sa / p i s a l i t t l e . h \
@@ −305 ,9 +305 ,7 @@
#
# programs to b u i l d
147
E.1. MAKEFILE APPENDIX E. UNIPROCESSOR CODE
#
−PROGS = sim−f a s t $ (EEXT) sim−s a f e $ (EEXT) sim−e i o$ (EEXT) \
− sim−bpred$ (EEXT) sim−p r o f i l e $ (EEXT) \
− sim−cache$ (EEXT) sim−outorder$ (EEXT) # sim−cheetah$ (EEXT)
+PROGS = sim−outorder$ (EEXT)
#
# a l l t a r g e t s , NOTE: l i b r a r y order ing i s important . . .
@@ −390 ,8 +388 ,8 @@
sim−cache$ (EEXT) : sysprobe$ (EEXT) sim−cache . $ (OEXT) cache . $ (OEXT) $ (
OBJS) l i b e xo / l i b exo . $ (LEXT)
$ (CC) −o sim−cache$ (EEXT) $ (CFLAGS) sim−cache . $ (OEXT) cache . $ (OEXT)
$ (OBJS) l i b e xo / l i b exo . $ (LEXT) $ (MLIBS)
−sim−outorder$ (EEXT) : sysprobe$ (EEXT) sim−outorder . $ (OEXT) cache . $ (OEXT)
bpred . $ (OEXT) r e sou r c e . $ (OEXT) ptrace . $ (OEXT) $ (OBJS) l i b exo / l i b exo . $ (
LEXT)
− $ (CC) −o sim−outorder$ (EEXT) $ (CFLAGS) sim−outorder . $ (OEXT) cache . $ (
OEXT) bpred . $ (OEXT) r e sou r c e . $ (OEXT) ptrace . $ (OEXT) $ (OBJS) l i b exo / l i b exo
. $ (LEXT) $ (MLIBS)
+sim−outorder$ (EEXT) : sysprobe$ (EEXT) sim−outorder . $ (OEXT) dram . $ (OEXT)
cache . $ (OEXT) pr e f e t ch . $ (OEXT) bpred . $ (OEXT) r e sou r c e . $ (OEXT) ptrace . $ (
OEXT) $ (OBJS) l i b exo / l i b e xo . $ (LEXT)
+ $ (CC) −o sim−outorder$ (EEXT) $ (CFLAGS) sim−outorder . $ (OEXT) cache . $ (
OEXT) dram . $ (OEXT) pr e f e t ch . $ (OEXT) bpred . $ (OEXT) r e sou r c e . $ (OEXT) ptrace
. $ (OEXT) $ (OBJS) l i b exo / l i b e xo . $ (LEXT) $ (MLIBS)
exo l i b exo / l i b e xo . $ (LEXT) : sysprobe$ (EEXT)
cd l i b exo $ (CS) \
@@ −499 ,6 +497 ,8 @@
regs . $ (OEXT) : opt i ons . h s t a t s . h eva l . h
cache . $ (OEXT) : host . h misc . h machine . h machine . de f cache . h memory . h opt ions
. h
cache . $ (OEXT) : s t a t s . h eva l . h
+pr e f e t ch . $ (OEXT) : host . h misc . h machine . h machine . de f cache . h memory . h
opt ions . h
+pr e f e t ch . $ (OEXT) : s t a t s . h eva l . h cache . h
bpred . $ (OEXT) : host . h misc . h machine . h machine . de f bpred . h s t a t s . h eva l . h
ptrace . $ (OEXT) : host . h misc . h machine . h machine . de f range . h ptrace . h
eventq . $ (OEXT) : host . h misc . h machine . h machine . de f eventq . h bitmap . h
148
APPENDIX E. UNIPROCESSOR CODE E.2. SIM-OUTORDER.C
E.2 Sim-outorder.c
Listing E.2: Sim-outorder.c - Unified diff against SimpleScalar 3.0d
−−− . . / s implesim−3.0− o r i g /sim−outorder . c 2003−10−09
03 :57 :25 .000000000 +0200
+++ . . / simplesim −3.0/ sim−outorder . c 2006−05−31 13 :32 :54 .000000000 +0200
@@ −72,6 +72 ,8 @@
#inc lude ”ptrace . h”
#inc lude ” d l i t e . h”
#inc lude ”sim . h”
+#inc lude ”dram . h”
+#inc lude ”p r e f e t ch . h”
/∗
∗ This f i l e implements a very d e t a i l e d out−of−order i s s u e supe r s ca l a r
@@ −91,6 +93,73 @@
∗ s imu la tor op t i ons
∗/
+/∗ Pre fe t ch op t i ons ∗/
+
+/∗ p r e f e t c h i n g type {none | s e q u e n t i a l . . . . } ∗/
+stat ic char ∗ pr e f e t ch type ;
+
+/∗ p r e f e t c h i n g l o c a t i o n {DL1 |DL2 | IL1 | IL2} ∗/
+stat ic char ∗ p r e f e t c h l o c a t i o n ;
+
+/∗ p r e f e t c h i n g degree ∗/
+stat ic int p r e f e t c h d e g r e e i n ;
+
+/∗ p r e f e t c h i n g t r i g g e r type (Cache miss , cache h i t e t c ) ∗/
+stat ic char ∗ p r e f e t c h t a r g e t ;
+
+/∗ CZone s i z e ( in b i t s ∗/
+stat ic int c z o n e s i z e i n ;
+
+/∗ GHB s i z e ( in e n t r i e s ) ∗/
+stat ic int t a b l e s i z e i n ;
+
+/∗ Bandwidth−aware p r e f e t c h i n g − Treshold ∗/
+int bwa tresho ld ;
+
+/∗ Bandwidth−aware p r e f e t c h i n g − Enable ∗/
+int bwa enable ;
+
+/∗ number o f channe l s between cache and DRAM ∗/
+stat ic unsigned int max dram chan ;
+
+/∗
+ ∗ These op t i ons are added by Marius Grannaes
+ ∗ The de f i n e the DRAM used .
+ ∗ I f t h i s model i s used the l a t ency s p e c i f i e d by −mem: l a t i s not used .
+ ∗/
+
+/∗ Dram s t r u c t u r e ∗/
+dram system t ∗dram system ;
149
E.2. SIM-OUTORDER.C APPENDIX E. UNIPROCESSOR CODE
+
+/∗ The number o f a v a i l a b e l DRAM channe l s ∗/
+int num channels ;
+
+/∗ Block s i z e in DRAM ∗/
+int b l o c k s i z e ;
+
+/∗ Page s i z e in DRAM ( in b l o c k s ) ∗/
+int pag e s i z e ;
+
+/∗ Time to execu te command t r an s f e r ( in c l o c k c y c l e s ) ∗/
+int con t r o l t ime ;
+
+/∗ Time to t r an s f e r data from the core to the l a t c h e s ∗/
+int core t ime ;
+
+/∗ Time to t r an s f e r data from the l a t c h e s to the memory c o n t r o l l e r ∗/
+int data t ime ;
+
+/∗ How o f t en to l o g DRAM trace ∗/
+int dram t ra c e i n t e r va l ;
+
+/∗ Per f e c t l 2 ∗/
+int p e r f e c t l 2 = 0 ;
+
+/∗
+ ∗ Orig ina l parameters
+ ∗ ( not a l t e r e d )
+ ∗/
+
/∗ maximum number o f i n s t ’ s to execu te ∗/
stat ic unsigned int max insts ;
@@ −433 ,14 +502 ,19 @@
t i c k t now) /∗ t ime o f acces s ∗/
{
unsigned int l a t ;
−
+ p r o c e s s p r e f e t c h t r i g g e r ( Cache Miss , Cache DL1 , baddr , s im cyc l e ) ;
i f ( cache d l2 )
{
− /∗ acces s next l e v e l o f data cache h i e rarchy ∗/
− l a t = cache ac c e s s ( cache dl2 , cmd , baddr , NULL, bs i ze ,
− /∗ now ∗/now , /∗ pudata ∗/NULL, /∗ r e p l addr ∗/NULL
) ;
+ i f ( p e r f e c t l 2 ) {
+ l a t = cache dl2−>h i t l a t e n c y ;
+ } else {
+ /∗ acces s next l e v e l o f data cache h i e rarchy ∗/
+ l a t = cache ac c e s s ( cache dl2 , cmd , baddr , NULL, bs i ze ,
+ /∗ now ∗/now , /∗ pudata ∗/NULL, /∗ r e p l addr ∗/
NULL) ;
+ }
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access , Cache DL2 , baddr , s im cyc l e ) ;
i f (cmd == Read)
− return l a t ;
150
APPENDIX E. UNIPROCESSOR CODE E.2. SIM-OUTORDER.C
+ return l a t ;
else
{
/∗ FIXME: un l imi t ed wr i t e b u f f e r s ∗/
@@ −450 ,8 +524 ,17 @@
else
{
/∗ acces s main memory ∗/
− i f (cmd == Read)
− return mem access latency ( b s i z e ) ;
+ i f (cmd == Read) {
+ /∗ I f the dram system i s de f ined us ing 0 channels , then f a l l b a c k
+ ∗ to o l d model ∗/
+ i f ( dram system−>num channels == 0) {
+ l a t = mem access latency ( b s i z e ) ;
+ } else {
+ l a t = access dram ( dram system , baddr , bs i ze , now) ;
+ }
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access ,DRAM, baddr , s im cyc l e ) ;
+ return l a t ;
+ }
else
{
/∗ FIXME: un l imi t ed wr i t e b u f f e r s ∗/
@@ −468 ,14 +551 ,26 @@
struct ca che b l k t ∗blk , /∗ p t r to b l o c k in upper l e v e l ∗/
t i c k t now) /∗ t ime o f acces s ∗/
{
+ int l a t ency ;
/∗ t h i s i s a miss to the l owe s t l e v e l , so access main memory ∗/
− i f (cmd == Read)
− return mem access latency ( b s i z e ) ;
+ i f (cmd == Read) {
+ /∗ I f the dram system i s de f ined us ing 0 channels , then f a l l b a c k
+ ∗ to o l d model ∗/
+ i f ( dram system−>num channels == 0) {
+ latency = mem access latency ( b s i z e ) ;
+ } else {
+ latency = access dram ( dram system , baddr , bs i ze , now) ;
+ }
+ }
else
{
/∗ FIXME: un l imi t ed wr i t e b u f f e r s ∗/
− return 0 ;
+ la t ency = 0 ;
}
+ s e t r e t u r n l a t e n c y ( l a t ency ) ;
+ p r o c e s s p r e f e t c h t r i g g e r ( Cache Miss , Cache DL2 , baddr , s im cyc l e ) ;
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access , DRAM, baddr , s im cyc l e ) ;
+ return l a t ency ;
}
/∗ l 1 i n s t cache l 1 b l o c k miss hand ler f unc t i on ∗/
@@ −491 ,21 +586 ,34 @@
i f ( c a c h e i l 2 )
151
E.2. SIM-OUTORDER.C APPENDIX E. UNIPROCESSOR CODE
{
/∗ acces s next l e v e l o f i n s t cache h i e rarchy ∗/
− l a t = cache ac c e s s ( c a che i l 2 , cmd , baddr , NULL, bs i ze ,
+ i f ( p e r f e c t l 2 ) {
+ l a t = cache i l 2−>h i t l a t e n c y ;
+ } else {
+ l a t = cache ac c e s s ( c a che i l 2 , cmd , baddr , NULL, bs i ze ,
/∗ now ∗/now , /∗ pudata ∗/NULL, /∗ r e p l addr ∗/NULL
) ;
− i f (cmd == Read)
− return l a t ;
− else
− panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
+ }
+ i f (cmd != Read) {
+ panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
+ }
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access , Cache IL2 , baddr , s im cyc l e ) ;
}
else
{
− /∗ acces s main memory ∗/
− i f (cmd == Read)
− return mem access latency ( b s i z e ) ;
− else
− panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
+ /∗ acces s main memory ∗/
+ i f (cmd == Read) {
+ /∗ I f the dram system i s de f ined us ing 0 channels , then f a l l b a c k
+ ∗ to o l d model ∗/
+ i f ( dram system−>num channels == 0) {
+ l a t = mem access latency ( b s i z e ) ;
+ } else {
+ l a t = access dram ( dram system , baddr , bs i ze , now) ;
+ }
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access , DRAM, baddr , s im cyc l e ) ;
+ } else {
+ panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
+ }
}
+ return l a t ;
}
/∗ l 2 i n s t cache b l o c k miss hand ler f unc t i on ∗/
@@ −516 ,14 +624 ,27 @@
struct ca che b l k t ∗blk , /∗ p t r to b l o c k in upper l e v e l ∗/
t i c k t now) /∗ t ime o f acces s ∗/
{
+ int l a t ency ;
/∗ t h i s i s a miss to the l owe s t l e v e l , so access main memory ∗/
− i f (cmd == Read)
− return mem access latency ( b s i z e ) ;
− else
− panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
+ i f (cmd == Read) {
+ /∗ I f the dram system i s de f ined us ing 0 channels , then f a l l b a c k
152
APPENDIX E. UNIPROCESSOR CODE E.2. SIM-OUTORDER.C
+ ∗ to o l d model ∗/
+ i f ( dram system−>num channels == 0) {
+ latency = mem access latency ( b s i z e ) ;
+ } else {
+ latency = access dram ( dram system , baddr , bs i ze , now) ;
+ }
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access , DRAM, baddr , s im cyc l e ) ;
+ }
+ else {
+ panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
+ }
+ p r o c e s s p r e f e t c h t r i g g e r ( Cache Miss , Cache IL2 , baddr , s im cyc l e ) ;
+ return l a t ency ;
}
+
/∗
∗ TLB miss hand l e r s
∗/
@@ −580 ,14 +701 ,83 @@
”la t ency o f a l l p i p e l i n e ope ra t i on s .\n”
) ;
− /∗ i n s t r u c t i o n l im i t ∗/
+ /∗ New DRAM−op t i ons ∗/
+
+ opt r e g u i n t ( odb , ”−dram : chan ” , ”number o f DRAM channe l s ” ,
+ &num channels , /∗ d e f a u l t − d i s a b l e d ∗/ 0 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ /∗ TODO: Doublecheck t h e s e va l u e s ! ∗/
+
+ opt r e g u i n t ( odb , ”−dram : block ” , ” s i z e o f a block ( in bytes ) ” ,
+ &b l o ck s i z e , /∗ d e f a u l t ( same as L2 d e f a u l t s ) ∗/ 128 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ opt r e g u i n t ( odb , ”−dram : page ” , ” s i z e o f each page ( in b locks ) ” ,
+ &page s i z e , /∗ d e f a u l t ∗/ 8 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ opt r e g u i n t ( odb , ”−dram :comm” , ” l a t ency o f command t r a n s f e r ” ,
+ &contro l t ime , /∗ d e f a u l t ∗/ 40 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ opt r e g u i n t ( odb , ”−dram : core ” , ” l a t ency o f DRAM core ” ,
+ &core t ime , /∗ d e f a u l t ∗/ 40 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ opt r e g u i n t ( odb , ”−dram : data ” , ” l a t ency o f DRAM t r an s f e r ” ,
+ &data time , /∗ d e f a u l t ∗/ 80 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ opt r e g u i n t ( odb , ”−dram : t r a c e ” , ”Number o f c y c l e s between each sample ” ,
+ &dram trace in t e rva l , /∗ d e f a u l t ∗/ 0 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
153
E.2. SIM-OUTORDER.C APPENDIX E. UNIPROCESSOR CODE
+
+ /∗ p r e f e t c h op t i ons ∗/
+
+ op t r e g s t r i n g ( odb , ”−pr e f e t ch ” ,
+ ”p r e f e t ch i ng type {none | s e qu en t i a l |DC|CDC|RPT| Stream |AVD} ” ,
+ &pre f e t ch type , /∗ d e f a u l t ∗/ ”none ” ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ op t r e g s t r i n g ( odb , ”−pr e f e t ch : type ” ,
+ ”p r e f e t ch i ng t r i g g e r type {miss , h i t , a c c e s s } ” ,
+ &pr e f e t ch t a r g e t , /∗ d e f a u l t ∗/ ”miss ” ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ op t r e g s t r i n g ( odb , ”−pr e f e t ch : l o c ” ,
+ ”p r e f e t ch i ng l o c a t i o n {DL1 |DL2 | IL1 | IL2} ” ,
+ &pr e f e t c h l o c a t i o n , /∗ d e f a u l t ∗/ ”DL2” ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ op t r e g i n t ( odb , ”−pr e f e t ch : degree ” , ”p r e f e t ch degree ” ,
+ &pr e f e t ch deg r e e i n , /∗ d e f a u l t ∗/ 1 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ op t r e g i n t ( odb , ”−pr e f e t ch : czone ” , ” s i z e o f each CZone in b i t s ” ,
+ &czon e s i z e i n , /∗ d e f a u l t ∗/ 16 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ op t r e g i n t ( odb , ”−pr e f e t ch : t ab l e ” , ” s i z e o f the p r e f e t ch i ng tab l e in
e n t r i e s ” ,
+ &t a b l e s i z e i n , /∗ d e f a u l t ∗/ 1024 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ op t r e g i n t ( odb , ”−p e r f e c t : l 2 ” , ” I s the l 2 p e r f e c t ? ” ,
+ &pe r f e c t l 2 , /∗ d e f a u l t ∗/ 0 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ op t r e g i n t ( odb , ”−bwa” , ”Enables Bandwidth Aware Pre f e t ch ing ” ,
+ &bwa enable , /∗ d e f a u l t ∗/ 0 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ op t r e g i n t ( odb , ”−bwa : t r e sh ” , ”Bandwidth Aware Pre f e t ch ing t r e sho l d ” ,
+ &bwa treshold , /∗ d e f a u l t ∗/ 0 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
op t r e g u i n t ( odb , ”−max : i n s t ” , ”maximum number o f i n s t ’ s to execute ” ,
&max insts , /∗ d e f a u l t ∗/ 0 ,
/∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
− /∗ t r ace op t i ons ∗/
−
op t r e g i n t ( odb , ”−f a s t fwd ” , ”number o f i n s t s sk ipped be f o r e t iming
s t a r t s ” ,
&fast fwd count , /∗ d e f a u l t ∗/ 0 ,
/∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
@@ −894 ,6 +1084 ,91 @@
i f ( f e t ch spe ed < 1)
f a t a l ( ”f ront−end speed must be p o s i t i v e and non−zero ”) ;
154
APPENDIX E. UNIPROCESSOR CODE E.2. SIM-OUTORDER.C
+ i f ( ! mystricmp ( pre f e t ch type , ” s e qu en t i a l ”) )
+ {
+ /∗ Sequen t i a l p r e f e t c h ∗/
+ r e g i s t e r p r e f e t c h a l g o r i t hm (& s e qu en t i a l p r e f e t c h ) ;
+ }
+ else i f ( ! mystricmp ( pre f e t ch type , ”DC”) )
+ {
+ /∗ Delta c o r r e l a t i o n ∗/
+ r e g i s t e r p r e f e t c h a l g o r i t hm (&d e l t a c o r r e l a t i o n p r e f e t c h ) ;
+ }
+ else i f ( ! mystricmp ( pre f e t ch type , ”none ”) )
+ {
+ /∗ No p r e f e t c h i n g ∗/
+ r e g i s t e r p r e f e t c h a l g o r i t hm (&no pre f e t ch ) ;
+ }
+ else i f ( ! mystricmp ( pre f e t ch type , ”CDC”) )
+ {
+ /∗ CZone/Del ta Corre l a t i on ∗/
+ r e g i s t e r p r e f e t c h a l g o r i t hm (& c z o n e d e l t a c o r r e l a t i o n p r e f e t c h ) ;
+ }
+ else i f ( ! mystricmp ( pre f e t ch type , ”RPT”) )
+ {
+ /∗ RPT t a b l e ∗/
+ r e g i s t e r p r e f e t c h a l g o r i t hm (& rp t p r e f e t c h ) ;
+ }
+ else i f ( ! mystricmp ( pre f e t ch type , ”AVD”) )
+ {
+ /∗ AVD pr e f e t c h i n g ∗/
+ r e g i s t e r p r e f e t c h a l g o r i t hm (&avd pre f e t ch ) ;
+ }
+ else i f ( ! mystricmp ( pre f e t ch type , ”stream ”) )
+ {
+ /∗ Stream pr e f e t c h i n g ∗/
+ r e g i s t e r p r e f e t c h a l g o r i t hm (&st r eam pre f e t ch ) ;
+ }
+ else
+ {
+ f a t a l ( ”cannot parse p r e f e t ch type ‘%s ’ ” , p r e f e t ch type ) ;
+ }
+ i f ( ! mystricmp ( p r e f e t ch t a r g e t , ”miss ”) )
+ {
+ /∗ Pre fe t ch on miss ∗/
+ r e g i s t e r p r e f e t c h t a r g e t ( Cache Miss ) ;
+ }
+ else i f ( ! mystricmp ( p r e f e t ch t a r g e t , ”h i t ”) )
+ {
+ /∗ Pre fe t ch on h i t ∗/
+ r e g i s t e r p r e f e t c h t a r g e t ( Cache Hit ) ;
+ }
+ else i f ( ! mystricmp ( p r e f e t ch t a r g e t , ”a c c e s s ”) )
+ {
+ /∗ Pre fe t ch on access ∗/
+ r e g i s t e r p r e f e t c h t a r g e t (Memory Access ) ;
+ }
+ else
155
E.2. SIM-OUTORDER.C APPENDIX E. UNIPROCESSOR CODE
+ {
+ f a t a l ( ”cannot parse p r e f e t ch t r i g g e r type ‘%s ’ ” , p r e f e t c h t a r g e t ) ;
+ }
+ i f ( ! mystricmp ( p r e f e t c h l o c a t i o n , ”dl1 ”) )
+ {
+ /∗ Cache DL1 ∗/
+ r e g i s t e r p r e f e t c h l o c a t i o n (Cache DL1 ) ;
+ }
+ else i f ( ! mystricmp ( p r e f e t c h l o c a t i o n , ”dl2 ”) )
+ {
+ /∗ Cache DL2 ∗/
+ r e g i s t e r p r e f e t c h l o c a t i o n (Cache DL2 ) ;
+ }
+ else i f ( ! mystricmp ( p r e f e t c h l o c a t i o n , ” i l 1 ”) )
+ {
+ /∗ Cache IL1 ∗/
+ r e g i s t e r p r e f e t c h l o c a t i o n ( Cache IL1 ) ;
+ }
+ else i f ( ! mystricmp ( p r e f e t c h l o c a t i o n , ” i l 2 ”) )
+ {
+ /∗ Cache IL2 ∗/
+ r e g i s t e r p r e f e t c h l o c a t i o n ( Cache IL2 ) ;
+ }
+ else
+ {
+ f a t a l ( ”cannot parse p r e f e t ch l o c a t i o n ‘%s ’ ” , p r e f e t c h l o c a t i o n ) ;
+ }
+
+ r e g i s t e r p r e f e t c h d e g r e e ( p r e f e t c h d e g r e e i n ) ;
+
i f ( ! mystricmp ( pred type , ”p e r f e c t ”) )
{
/∗ p e r f e c t p r e d i c t o r ∗/
@@ −1292 ,6 +1567 ,11 @@
s ta t r e g f o rmu l a ( sdb , ”avg s im s l i p ” ,
”the average s l i p between i s s u e and re t i r ement ” ,
” s im s l i p / sim num insn ” , NULL) ;
+
+ /∗ r e g i s t e r DRAM s t a t s ∗/
+ i f ( dram system−>num channels !=0) {
+ dram reg s ta t s ( dram system , sdb ) ;
+ }
/∗ r e g i s t e r p r e d i c t o r s t a t s ∗/
i f ( pred )
@@ −1368 ,6 +1648 ,9 @@
{
s im num refs = 0 ;
+ /∗ c r ea t e the memory h ierachy ∗/
+ dram system = create dram ( num channels , b l o ck s i z e , page s i z e ,
cont ro l t ime , core t ime , data time , d ram t ra c e i n t e r va l ) ;
+
/∗ a l l o c a t e and i n i t i a l i z e r e g i s t e r f i l e ∗/
r e g s i n i t (&reg s ) ;
156
APPENDIX E. UNIPROCESSOR CODE E.2. SIM-OUTORDER.C
@@ −1434 ,6 +1717 ,7 @@
readyq i n i t ( ) ;
r u u i n i t ( ) ;
l s q i n i t ( ) ;
+ p r e f e t c h i n i t ( c a che i l 1 , c a che i l 2 , cache dl1 , cache dl2 , c z on e s i z e i n ,
t a b l e s i z e i n , mem) ;
/∗ i n i t i a l i z e the DLite debugger ∗/
d l i t e i n i t ( s imoo reg obj , simoo mem obj , s imoo mstate obj ) ;
@@ −2184 ,10 +2468 ,12 @@
/∗ go to the data cache ∗/
i f ( cache d l1 )
{
+ se t l a s t PC va lu e ( rs−>PC) ;
/∗ commit s t o r e va lue to D−cache ∗/
l a t =
cache ac c e s s ( cache dl1 , Write , (LSQ[ LSQ head ] . addr
&˜3) ,
NULL, 4 , s im cyc le , NULL, NULL) ;
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access , Cache DL1 , ( rs−>addr &
˜3) , s im cyc l e ) ;
i f ( l a t > c a ch e d l 1 l a t )
events |= PEV CACHEMISS;
}
@@ −2730 ,10 +3016 ,13 @@
i f ( cache d l1 && va l id addr )
{
/∗ acces s the cache i f non−f a u l t i n g ∗/
+ /∗ Set PC va lue ∗/
+ se t l a s t PC va lu e ( rs−>PC) ;
l o a d l a t =
cache ac c e s s ( cache dl1 , Read ,
( rs−>addr & ˜3) , NULL, 4 ,
s im cyc le , NULL, NULL) ;
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access , Cache DL1 , ( rs−>addr &
˜3) , s im cyc l e ) ;
i f ( l o a d l a t > c a ch e d l 1 l a t )
events |= PEV CACHEMISS;
}
@@ −4230 ,10 +4519 ,15 @@
i f ( c a c h e i l 1 )
{
/∗ acces s the I−cache ∗/
+ /∗ Set p r e f e t c h i n g PC va lue to −1 to s i g n i f y t ha t t h i s i s not a
+ ∗ data r e que s t going through the h i e rarchy .
+ ∗/
+ se t l a s t PC va lu e (−1) ;
l a t =
cache ac c e s s ( c a che i l 1 , Read , IACOMPRESS( fe tch regs PC ) ,
NULL, ISCOMPRESS( s izeof ( md inst t ) ) , s im cyc le ,
NULL, NULL) ;
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access , Cache IL1 , IACOMPRESS(
fe tch regs PC ) , s im cyc l e ) ;
i f ( l a t > c a c h e i l 1 l a t )
l a s t i n s t m i s s e d = TRUE;
}
157
E.2. SIM-OUTORDER.C APPENDIX E. UNIPROCESSOR CODE
@@ −4428 ,7 +4722 ,7 @@
/∗ i gnore any f l o a t i n g po in t excep t ions , they may occur on mis−s p e cu l a t e d
execu t i on paths ∗/
s i g n a l (SIGFPE, SIG IGN) ;
−
+
/∗ s e t up program entry s t a t e ∗/
r eg s . regs PC = ld prog en t ry ;
r eg s . regs NPC = regs . regs PC + s izeof ( md inst t ) ;
@@ −4595 ,6 +4889 ,9 @@
RUU fcount += ( (RUU num == RUU size ) ? 1 : 0) ;
LSQ count += LSQ num ;
LSQ fcount += ( (LSQ num == LSQ size ) ? 1 : 0) ;
+
+ /∗ Dram trace ∗/
+ dram trace ( dram system , s im cyc l e ) ;
/∗ go to next c y c l e ∗/
s im cyc l e++;
158
APPENDIX E. UNIPROCESSOR CODE E.3. DRAM.H
E.3 Dram.h
Listing E.3: Dram.h
/∗ This code d e s c r i b e s the memory model f o r DRAM
∗
∗ I t i n c l u d e s the f o l l ow i n g p r o p e r t i e s
∗ − Vir tua l channe l s ( f o r p a r a l l e l i s a t i o n )
∗ − Pipe l i n i n g
∗ − Open pages
∗
∗ This f i l e was wr i t t en by Marius Grannaes in 2006.
∗/
#ifndef DRAMH
#define DRAMH
#include <s t d i o . h>
#include ”host . h”
#include ”misc . h”
#include ”machine . h”
#include ”memory . h”
#include ” s t a t s . h”
/∗ This i s the s i z e o f the c i r c u l a r b u f f e r con ta in ing
the occupancy o f the memory channel ∗/
#define CIRC BUFFER SIZE 3
/∗ This da t a s t r u c t u r e d e f i n e s the DRAM system ∗/
typedef struct {
int num channels ; /∗ Number o f channe l s ∗/
int b l o c k s i z e ; /∗ S i z e o f each b lock , u s a l l y equa l
∗ to l 2 b l o c k s i z e ( in b y t e s ) ∗/
int pag e s i z e ; /∗ Number o f b l o c k s in a page ∗/
int con t r o l t ime ; /∗ Time to t r an s f e r data ∗/
int core t ime ; /∗ Time to t r an s f e r from core to
∗ l a t c h e s ∗/
int data t ime ; /∗ Time to t r an s f e r from l a t c h e s to
∗ memory c o n t r o l l e r ∗/
t i c k t ∗ ready channe l s ; /∗ When are the channe l s ready? ∗/
t i c k t ∗ c i r c b u f f e r ; /∗ Circu la r b u f f e r ∗/
int t r a c e i n t e r v a l ; /∗ Sample i n t e r v a l when t r a c i n g ∗/
md addr t ∗ l a s t a dd r e s s ; /∗ Last page accessed ∗/
counte r t a c c e s s e s ; /∗ Number o f acce s s e s to the memory
∗ system ∗/
counte r t l a t ency ; /∗ Total l a t ency imposed by system ∗/
counte r t s t a l l s ; /∗ Total number o f s t a l l s due to busy system ∗/
counte r t page h i t s ; /∗ Total number o f t imes acce s s ing an open page
∗/
} dram system t ;
/∗ Function pro t o t ype s − See dram . c f o r he l p ∗/
159
E.3. DRAM.H APPENDIX E. UNIPROCESSOR CODE
dram system t ∗ create dram ( int number of channels , int s i z e o f b l o c k , int
page s i z e , int cont ro l t ime , int core t ime , int data time , int
t r a c e i n t e r v a l ) ;
void f ree dram ( dram system t ∗dram system ) ;
int block to bank ( dram system t ∗dram system , md addr t block ) ;
int i s same page ( dram system t ∗dram system , md addr t block1 , md addr t
block2 ) ;
unsigned int access dram ( dram system t ∗dram system , md addr t block , int
bs i ze , t i c k t now) ;
unsigned int g e t channe l s i n u s e ( dram system t ∗dram system , t i c k t now) ;
void dram reg s ta t s ( dram system t ∗dram system , struct s t a t s db t ∗ sdb ) ;
void dram trace ( dram system t ∗dram system , t i c k t now) ;
int get occupancy (void ) ;
#endif
160
APPENDIX E. UNIPROCESSOR CODE E.4. DRAM.C
E.4 Dram.c
Listing E.4: Dram.c
/∗ This code d e s c r i b e s the memory model f o r DRAM
∗
∗ I t i n c l u d e s the f o l l ow i n g p r o p e r t i e s
∗ − Vir tua l channe l s ( f o r p a r a l l e l i s a t i o n )
∗ − Pipe l i n i n g
∗ − Open pages
∗
∗ This f i l e was wr i t t en by Marius Grannaes in 2006.
∗/
#include <s t d i o . h>
#include <s t d l i b . h>
#include <a s s e r t . h>
#include ”host . h”
#include ”misc . h”
#include ”machine . h”
#include ”dram . h”
/∗ Globa l c i r c u l a r b u f f e r po in t e r ∗/
int ∗ c i r c b u f f e r ;
/∗ This func t i on c r ea t e s the dram subsystem ∗/
dram system t ∗ create dram ( int number of channels , int s i z e o f b l o c k , int
page s i z e , int cont ro l t ime , int core t ime , int data time , int
t r a c e i n t e r v a l ) {
dram system t ∗dram system ;
/∗ Al l o ca t e memory f o r the s t r u c t u r e ∗/
dram system = c a l l o c (1 , s izeof ( dram system t ) ) ;
i f ( dram system == 0) {
p r i n t f ( ”Could not a l l o c a t e memory f o r DRAM model .\n”) ;
e x i t (1 ) ;
}
/∗ Set the va l u e s g iven as parameters ∗/
dram system−>num channels = number of channels ;
dram system−>b l o c k s i z e = s i z e o f b l o c k ;
dram system−>pag e s i z e = page s i z e ;
dram system−>con t r o l t ime = cont ro l t ime ;
dram system−>core t ime = core t ime ;
dram system−>data t ime = data t ime ;
dram system−>t r a c e i n t e r v a l = t r a c e i n t e r v a l ;
/∗ Reset s t a t i s t i c s ∗/
dram system−>a c c e s s e s = 0 ;
dram system−>l a t ency = 0 ;
dram system−>s t a l l s = 0 ;
dram system−>page h i t s = 0 ;
161
E.4. DRAM.C APPENDIX E. UNIPROCESSOR CODE
/∗ Create the arrays ∗/
/∗ I f the number o f channe l s i s 0 then do not c r ea t e anyth ing ∗/
i f ( number of channels > 0) {
dram system−>ready channe l s = c a l l o c ( number of channels , s izeof ( t i c k t ) )
;
i f ( dram system−>ready channe l s == 0) {
p r i n t f ( ”Error c r e a t i ng ready channe l s .\n”) ;
e x i t (1 ) ;
}
dram system−>l a s t a dd r e s s = c a l l o c ( number of channels , s izeof ( md addr t )
) ;
i f ( dram system−>l a s t a dd r e s s == 0) {
p r i n t f ( ”Error c r e a t i ng Address array .\n”) ;
e x i t (1 ) ;
}
}
/∗ Create the c i r c u l a r b u f f e r as usua l ∗/
c i r c b u f f e r = c a l l o c (CIRC BUFFER SIZE +1, s izeof ( int ) ) ;
i f ( c i r c b u f f e r == 0) {
p r i n t f ( ”Error c r e a t i ng C i r cu l a r bu f f e r .\n”) ;
e x i t (1 ) ;
}
return ( dram system ) ;
}
/∗ This func t i on f r e e s up the memory used by the DRAM−model ∗/
void f ree dram ( dram system t ∗dram system ) {
f r e e ( dram system−>l a s t a dd r e s s ) ;
f r e e ( dram system−>ready channe l s ) ;
f r e e ( dram system ) ;
}
/∗ This func t i on maps an address to a DRAM bank ∗/
int block to bank ( dram system t ∗dram system , md addr t block ) {
int bank = 0 ;
bank = ( block / dram system−>b l o c k s i z e ) % dram system−>num channels ;
i f ( bank < 0) {
p r i n t f ( ”Negative bank value from hash ! This shouldn ’ t happen ! ”) ;
e x i t (1 ) ;
}
return bank ;
}
/∗ This func t i on determines i f two b l o c k s are on the same page
∗ Returns 1 i f they are on the same page , 0 o the rw i s e
∗/
int i s same page ( dram system t ∗dram system , md addr t block1 , md addr t
block2 ) {
md addr t block no1 , b lock no2 ;
/∗ Ca l cu l a t e b l o c k numbers by d i v i d i n g by the s i z e o f each b l o c k ∗/
block no1 = block1 / dram system−>b l o c k s i z e ;
b lock no2 = block2 / dram system−>b l o c k s i z e ;
162
APPENDIX E. UNIPROCESSOR CODE E.4. DRAM.C
/∗ Divide by the number o f banks to g e t the d i s t ance ∗/
block no1 /= dram system−>num channels ;
b lock no2 /= dram system−>num channels ;
/∗ I f the i n t e g e r d i v i s i o n by the page s i z e i s equal , the two are on
∗ the same page ∗/
i f ( ( b lock no1 /dram system−>pag e s i z e ) == ( block no2 /dram system−>
pag e s i z e ) ) {
return 1 ;
}
/∗ Fal l−through : No match ∗/
return 0 ;
}
/∗ This func t i on c a l c u l a t e s the access time o f a s i n g e l acces s based on
∗ the s t a t e o f the DRAM−system .
∗ I t r e tu rns the access time in number o f t i c k s ( due to compab i l i t y i s s u e s )
∗/
unsigned int access dram ( dram system t ∗dram system , md addr t block , int
bs i ze , t i c k t now) {
int dram bank ; /∗ The bank in use , c a l c u l a t e d based on the address ∗/
int l a t ency ; /∗ The c a l c u l a t e d l a t ency − in t i c k s ∗/
int con t r o l t ime ; /∗ Time requ i r ed to t r an s f e r a con t r o l word to DRAM ∗/
int core t ime ; /∗ Time requ i r ed to t r an s f e r data from c e l l s to l a t c h e s
∗/
int data t ime ; /∗ Time requ i r ed to t r an s f e r data from l a t c h e s to
∗ c o n t r o l l e r ∗/
int over lap ; /∗ Over lapping time ∗/
int p i p e l i n i n g ; /∗ Time tha t can be over lapped ( from 0 to con t r o l t ime )
∗/
dram bank = block to bank ( dram system , block ) ;
c on t r o l t ime = dram system−>con t r o l t ime ;
core t ime = dram system−>core t ime ;
data t ime = dram system−>data t ime ;
p i p e l i n i n g = 0 ; /∗ Safe i n i t i a l i z a t i o n ∗/
dram system−>a c c e s s e s++;
/∗ Update the c i r c u l a r b u f f e r ∗/
c i r c b u f f e r [ c i r c b u f f e r [ 0 ]+1 ] = now − dram system−>ready channe l s [ dram bank
] ;
c i r c b u f f e r [ 0 ] = ( c i r c b u f f e r [ 0 ] + 1) % CIRC BUFFER SIZE ;
/∗ I f the DRAM chip i s f r e e ∗/
i f ( dram system−>ready channe l s [ dram bank ] < now) {
/∗Check i f we h i t an open page ∗/
i f ( i s same page ( dram system , block , dram system−>l a s t a dd r e s s [ dram bank
] ) ) {
163
E.4. DRAM.C APPENDIX E. UNIPROCESSOR CODE
/∗ We h i t an open page , t r a n s f e r time i s reduced . ∗/
dram system−>page h i t s++;
l a t ency = cont r o l t ime + data t ime ;
} else {
/∗ We h i t a c l o s ed page ∗/
l a t ency = cont r o l t ime + core t ime + data t ime ;
}
} else {
/∗ The DRAM bank i s c u r r en t l y occupied ∗/
dram system−>s t a l l s ++;
/∗ Ca l cu l a t e ove r l app ing time ( p i p e l i n i n g ) ∗/
over lap = dram system−>ready channe l s [ dram bank ] − now ;
i f ( over lap > con t r o l t ime ) {
p i p e l i n i n g = cont r o l t ime ; /∗ Only p i p e l i n e con t r o l ∗/
} else {
p i p e l i n i n g = over lap ;
}
i f ( i s same page ( dram system , block , dram system−>l a s t a dd r e s s [ dram bank
] ) ) {
/∗ We h i t an open page ∗/
dram system−>page h i t s++;
l a t ency = ( con t r o l t ime − p i p e l i n i n g ) + data t ime ;
} else {
/∗ We h i t a c l o s ed page ∗/
l a t ency = ( con t r o l t ime − p i p e l i n i n g ) + core t ime + data t ime ;
}
/∗ Latency observed by the system , wai t f o r o ther r e que s t to complete ∗/
l a t ency += ( dram system−>ready channe l s [ dram bank ] − now) ;
}
/∗ Commit changes to the data s t r u c t u r e ∗/
dram system−>ready channe l s [ dram bank ] = now + latency ;
dram system−>l a s t a dd r e s s [ dram bank ] = block ;
/∗ Update S t a t i s t i c s ∗/
dram system−>l a t ency += latency ;
// p r i n t f (”Access ing %l d at time %ld wi th l a t ency %d .\n” , b lock , now ,
l a t ency ) ;
return l a t ency ;
}
/∗ This func t i on g e t s the current bandwidth useage by re tu rn ing the
∗ number o f busy channe l s
∗/
unsigned int g e t channe l s i n u s e ( dram system t ∗dram system , t i c k t now) {
int no busy = 0 ;
int i ;
for ( i =0; i < dram system−>num channels ; i++) {
i f ( dram system−>ready channe l s [ i ] > now) {
/∗ Channel i s busy ∗/
no busy++;
}
}
return no busy ;
}
164
APPENDIX E. UNIPROCESSOR CODE E.4. DRAM.C
/∗ This func t i on r e g i s t e r s DRAM s t a t s based on the data s t r u c t u r e ∗/
void dram reg s ta t s ( dram system t ∗dram system , struct s t a t s db t ∗ sdb ) {
s t a t r e g c oun t e r ( sdb , ”dram . a c c e s s e s ” , ” t o t a l number o f memory a c c e s s e s ”
,
&(dram system−>a c c e s s e s ) , 0 , NULL) ;
s t a t r e g c oun t e r ( sdb , ”dram . l a t ency ” , ” t o t a l l a t ency o f a l l a c c e s s e s ” ,
&(dram system−>l a t ency ) , 0 , NULL) ;
s t a t r e g f o rmu l a ( sdb , ”dram . avg latency ” , ”average l a t ency in DRAM” ,
”dram . l a t ency / dram . a c c e s s e s ” , NULL) ;
s t a t r e g c oun t e r ( sdb , ”dram . s t a l l s ” , ” t o t a l number o f s t a l l s ” ,
&(dram system−>s t a l l s ) , 0 , NULL) ;
s t a t r e g f o rmu l a ( sdb , ”dram . p e r c e n t s t a l l ” ,
”percentage o f a c c e s e s that are s t a l l e d ” ,
”dram . s t a l l s ∗ 100 .0 / dram . a c c e s s e s ” , NULL) ;
s t a t r e g c oun t e r ( sdb , ”dram . page h i t s ” ,
” t o t a l number o f h i t s on open pages ” ,
&(dram system−>page h i t s ) , 0 , NULL) ;
s t a t r e g f o rmu l a ( sdb , ”dram . p e r c en th i t s ” ,
”percent o f a c c e s s e s h i t t i n g open pages ” ,
”dram . page h i t s ∗ 100 .0 / dram . a c c e s s e s ” , NULL) ;
}
/∗ This func t i on i s c a l l e d every c y c l e and genera t e s a bandwidth t race i f
needed ∗/
void dram trace ( dram system t ∗dram system , t i c k t now) {
/∗ Only t race i f non−zero i n t e r v a l i s s p e c i f i e d ∗/
i f ( dram system−>t r a c e i n t e r v a l > 0) {
/∗ Only t race on g iven i n t e r v a l , use mod to accompl ish t h i s ∗/
i f (now % dram system−>t r a c e i n t e r v a l == 0) {
p r i n t f ( ”%ld ; %d\n” , now , g e t channe l s i n u s e ( dram system , now) ) ;
}
}
}
int get occupancy (void ) {
int i ;
long t o t a l = 0 ;
for ( i =1; i<CIRC BUFFER SIZE+1; i++) {
t o t a l += c i r c b u f f e r [ i ] ;
}
return ( t o t a l / CIRC BUFFER SIZE) ;
}
165
E.5. PREFETCH.C APPENDIX E. UNIPROCESSOR CODE
E.5 Prefetch.c
Listing E.5: Prefetch.c
/∗ p r e f e t c h . c − p r e f e t c h i n g module rou t i n e s ∗/
/∗ Written by Marius Grannaes 2006 ∗/
#include <s t d i o . h>
#include <s t d l i b . h>
#include <a s s e r t . h>
#include ”host . h”
#include ”misc . h”
#include ”machine . h”
#include ”cache . h”
#include ”p r e f e t ch . h”
#include ” loade r . h”
#include ”dram . h”
/∗ Globa l s ∗/
extern int bwa tresho ld ;
extern int bwa enable ;
/∗ We need to s t o r e po in t e r s to the caches so t ha t we can access them ∗/
struct cache t ∗ data l1 ;
struct cache t ∗ data l2 ;
struct cache t ∗ i n s t r u c t i o n l 1 ;
struct cache t ∗ i n s t r u c t i o n l 2 ;
/∗ And a po in t e r to memory so we can examine the re tu rn ing data ∗/
struct mem t ∗mem;
/∗ This i s where the p r e f e t c h i n g occurs ∗/
l o c a t i o n t p r e f e t c h l o c a t i o n = None ;
/∗ The type o f acces s t ha t t r i g g e r s a p r e f e t c h ∗/
t r i g g e r t y p e t t a r g e t type = Cache Miss ;
/∗ This i s the a c t ua l t a r g e t cache , s e l e c t e d from the above in the
p r e f e t c h i n i t f unc t i on ∗/
struct cache t ∗ t a r g e t ca che ;
/∗ This i s a f l a g t ha t i s s e t to 1 i f a p r e f e t c h i s at tempted . ∗/
int pre f e t ch at tempt = 0 ;
/∗ Function po in t e r to the a c t i v e a l gor i thm ∗/
void (∗ pr e f e t ch a l go r i t hm ) ( p r e f e t c h t r i g g e r t ) ;
/∗ Pre fe t ch degree ∗/
int p r e f e t ch deg r e e = 1 ;
/∗ CZone s i z e in b i t s ∗/
int c z on e s i z e = 16 ;
/∗ Table s i z e in e n t r i e s (GHB, RPT e t c ) ∗/
int t a b l e s i z e = 64 ;
166
APPENDIX E. UNIPROCESSOR CODE E.5. PREFETCH.C
/∗ Globa l His tory b u f f e r ∗/
md addr t ∗ghb ;
/∗ RPT t a b l e ∗/
r p t en t r y t ∗ rpt ;
/∗ Stream o f f s e t ∗/
int s t r e am o f f s e t = 4 ;
/∗ AVD t a b l e ∗/
avd ent ry t ∗avd ;
/∗ AVD t r e s h o l d (MAXAVD ∗/
long int maxavd = 65535;
/∗ Delta b u f f e r ∗/
md addr t ∗ d e l t a b u f f e r ;
/∗ Globa l His tory b u f f e r top ∗/
int ghb top = 0 ;
/∗ Globa l His tory b u f f e r f i l l ∗/
int g h b i s f u l l = 0 ;
/∗ Return l a t ency − Used when p r e f e t c h e s uses re turned data ∗/
int r e tu rn l a t en cy = 0 ;
/∗ Last PC va lue − Note t h i s i s a hack :
∗ This i s done to avoid pass ing the curren t PC va lue a long wi th
∗ each c a l l to cache acce s s ( ) .
∗ Thus the code becomes more modular and ea s i e r to maintain .
∗/
md addr t la s t PC va lue ;
/∗
∗ This func t i on s e t s the l a s t PC value , t h i s i s t y p i c a l l y done when the
f i r s t
∗ acces s to l e v e l 1 data cache occurs − note : This type o f p r e f e t c h i n g on ly
∗ makes sense f o r data−p r e f e t c h e r s !
∗/
md addr t s e t l a s t PC va lu e (md addr t PC) {
l a s t PC va lue = PC;
return l a s t PC va lue ;
}
/∗
∗ This func t i on s e t s where the p r e f e t c h i n g occurs .
∗/
int r e g i s t e r p r e f e t c h l o c a t i o n ( l o c a t i o n t l o c a t i o n ) {
p r e f e t c h l o c a t i o n = l o c a t i o n ;
return (0 ) ;
}
167
E.5. PREFETCH.C APPENDIX E. UNIPROCESSOR CODE
/∗
∗ This func t i on s e t s the p r e f e t c h i n g event t ha t t r i g g e r s a p r e f e t c h i n g
∗/
void r e g i s t e r p r e f e t c h t a r g e t ( t r i g g e r t y p e t t a r g e t ) {
t a r g e t type = ta rg e t ;
}
/∗
∗ This func t i on s e t s the p r e f e t c h degree .
∗/
void r e g i s t e r p r e f e t c h d e g r e e ( int degree ) {
p r e f e t ch deg r e e = degree ;
}
/∗
∗ This func t i on s e t s the func t i on po in t e r to po in t to the needed func t i on .
∗/
int r e g i s t e r p r e f e t c h a l g o r i t hm (void (∗ a lgor i thm ) ( p r e f e t c h t r i g g e r t ) ) {
pr e f e t ch a l go r i t hm = algor i thm ;
}
/∗
∗ This func t i on s e t s the re turn l a t ency
∗/
void s e t r e t u r n l a t e n c y ( int l a t ency ) {
r e tu rn l a t en cy = la tency ;
}
/∗
∗ This func t i on s e t s up the p r e f e t c h i n g . In add i t i on to s t o r i n g the cache
po in te r s , i t
∗ r e s o l v e s which l o c a t i o n needs to be p r e f e t c h ed
∗/
int p r e f e t c h i n i t ( struct cache t ∗ i l 1 , struct cache t ∗ i l 2 , struct cache t ∗
dl1 , struct cache t ∗dl2 , int c s i z e , int t s i z e , struct mem t ∗memory) {
c z on e s i z e = c s i z e ;
t a b l e s i z e = t s i z e ;
data l1 = dl1 ;
data l2 = dl2 ;
i n s t r u c t i o n l 1 = i l 1 ;
i n s t r u c t i o n l 2 = i l 2 ;
mem = memory ;
i f ( p r e f e t c h l o c a t i o n == Cache IL1 ) {
t a r g e t ca che = i n s t r u c t i o n l 1 ;
} else i f ( p r e f e t c h l o c a t i o n == Cache DL1 ) {
t a r g e t ca che = data l1 ;
} else i f ( p r e f e t c h l o c a t i o n == Cache IL2 ) {
t a r g e t ca che = i n s t r u c t i o n l 2 ;
} else i f ( p r e f e t c h l o c a t i o n == Cache DL2 ) {
t a r g e t ca che = data l2 ;
}
/∗ Set up the r e f r ence p r e d i c t i on t a b l e ∗/
rpt = c a l l o c ( t a b l e s i z e , s izeof ( r p t en t r y t ) ) ;
168
APPENDIX E. UNIPROCESSOR CODE E.5. PREFETCH.C
/∗ Set up the AVD pred i c t i on t a b l e ∗/
avd = c a l l o c ( t a b l e s i z e , s izeof ( avd ent ry t ) ) ;
/∗ Set up the g l o b a l h i s t o r y b u f f e r ∗/
ghb = c a l l o c ( t a b l e s i z e , s izeof ( md addr t ) ) ;
/∗ Al l o ca t e a d e l t a b u f f e r o f equa l s i z e to the GHB ∗/
d e l t a b u f f e r = c a l l o c ( t a b l e s i z e , s izeof ( md addr t ) ) ;
return (1 ) ;
}
/∗
∗ This i s the p r e f e t c h i n g func t i on . I t i s s u e s the ac t ua l p r e f e t c h e s .
∗ This i s used by the a l gor i t hms as an a b s t r a c t i o n
∗/
void pr e f e t ch (md addr t adress , struct cache t ∗ target , t ime t now) {
/∗ Check i f the p r e f e t c h ed address i s v a l i d ∗/
i f (MD VALID ADDR( adre s s ) ) {
/∗ Check i f the p r e f e t c h ed address i s a l l r e a d y in the cache ∗/
i f ( cache probe ( target , adre s s ) == FALSE) {
i f ( bwa enable ) {
i f ( get occupancy ( ) < bwa tresho ld ) {
ca che ac c e s s ( target , Pre fetch , adress , NULL, 4 , now , NULL,NULL) ;
}
} else {
ca che ac c e s s ( target , Pre fetch , adress , NULL, 4 , now , NULL,NULL) ;
}
}
}
}
/∗
∗ As the name imp l i e s t h i s p r e f e t c h i n g method does no p r e f e t c h i n g .
∗ This i s the d e f a u l t case .
∗/
void no pre f e t ch ( p r e f e t c h t r i g g e r t t r i g g e r ) {
}
/∗
∗ This i s the s e q u en t i a l p r e f e t c h i n g on Miss a l gor i thm .
∗ When a miss in the cache occurs on b l o c k X, b l o c k X+1 i s p r e f e t c h ed .
∗/
void s e q u e n t i a l p r e f e t c h ( p r e f e t c h t r i g g e r t t r i g g e r ) {
int i ;
for ( i =1; i<=pre f e t ch deg r e e ; i++) {
pr e f e t ch ( t r i g g e r . address + i ∗ ta rge t cache−>bs i ze , ta rge t cache , t r i g g e r .
time ) ;
}
}
/∗
∗ This i s a d e l t a c o r r e l a t i o n p r e f e t c h i n g a l gor i thm on acces s
∗ IT s t o r e s the l a s t two acce s s e s . I f the d e l t a i s equa l in access
∗ X,X−1,X−2, a p r e f e t c h i s i s sued .
∗/
void d e l t a c o r r e l a t i o n p r e f e t c h ( p r e f e t c h t r i g g e r t t r i g g e r ) {
int i , j ;
169
E.5. PREFETCH.C APPENDIX E. UNIPROCESSOR CODE
int de l ta count = 0 ;
int de l t a 1 ;
int de l t a 2 ;
int de l t a ;
md addr t adre s s ;
/∗ I n s e r t r e f e r ence in t o GHB ∗/
ghb top = ( ghb top + 1) % t a b l e s i z e ;
ghb [ ghb top ] = t r i g g e r . address ;
/∗ Construct d e l t a b u f f e r ∗/
/∗ GHB i s a c i r c u l a r bu f f e r , need two separa t e f o r l oops ∗/
for ( i = ghb top ; i >= 0 ; i−−) {
d e l t a b u f f e r [ de l t a count ] = ghb [ i ] ;
d e l t a count++;
}
for ( i = t a b l e s i z e −1; i > ghb top ; i−−) {
d e l t a b u f f e r [ de l t a count ] = ghb [ i ] ;
d e l t a count++;
}
/∗ Corre l a t e d e l t a s − Two d e l t a s are used as Nesb i t found opt imal ∗/
de l t a 1 = d e l t a b u f f e r [ 0 ] − d e l t a b u f f e r [ 1 ] ;
d e l t a 2 = d e l t a b u f f e r [ 1 ] − d e l t a b u f f e r [ 2 ] ;
/∗ Search f o r f i r s t d e l t a ∗/
for ( i = 2 ; i < de l ta count−2 ; i++) {
i f ( d e l t a b u f f e r [ i ] − d e l t a b u f f e r [ i +1] == de l t a 1 ) {
i f ( d e l t a b u f f e r [ i +1] − d e l t a b u f f e r [ i +2] == de l t a 2 ) {
/∗ Pattern found ∗/
/∗ S ta r t p r e f e t c h i n g ∗/
adre s s = t r i g g e r . address ;
for ( j =1; j<=pre f e t ch deg r e e ; j++) {
/∗ Find next d e l t a ∗/
i−−;
i f ( i < 0) {
break ;
}
adre s s += de l t a b u f f e r [ i ] − d e l t a b u f f e r [ i +1] ;
p r e f e t ch ( adress , ta rge t cache , t r i g g e r . time ) ;
}
/∗ Break out o f the loop ∗/
break ;
}
}
}
}
/∗
∗ This i s the s t r i d e d p r e f e t c h i n g a l gor i thm us ing a g l o b a l h i s t o r y b u f f e r
∗ and d e l t a c o r r e l a t i o n .
∗/
void c z o n e d e l t a c o r r e l a t i o n p r e f e t c h ( p r e f e t c h t r i g g e r t t r i g g e r ) {
int i , j ;
int de l ta count = 0 ;
int de l t a 1 ;
int de l t a 2 ;
int de l t a ;
md addr t adre s s ;
/∗ I n s e r t r e f e r ence in t o GHB ∗/
170
APPENDIX E. UNIPROCESSOR CODE E.5. PREFETCH.C
ghb top = ( ghb top + 1) % t a b l e s i z e ;
ghb [ ghb top ] = t r i g g e r . address ;
/∗ Construct d e l t a b u f f e r ∗/
/∗ GHB i s a c i r c u l a r bu f f e r , need two separa t e f o r l oops ∗/
for ( i = ghb top ; i >= 0 ; i−−) {
/∗ I f t h i s entry matches the czone s i z e , put i t i n t o the b u f f e r ∗/
i f ( ghb [ i ] >> c z on e s i z e == t r i g g e r . address >> c z on e s i z e ) {
d e l t a b u f f e r [ de l t a count ] = ghb [ i ] ;
d e l t a count++;
}
}
for ( i = t a b l e s i z e −1; i > ghb top ; i−−) {
/∗ I f t h i s entry matches the czone s i z e , put i t i n t o the b u f f e r ∗/
i f ( ghb [ i ] >> c z on e s i z e == t r i g g e r . address >> c z on e s i z e ) {
d e l t a b u f f e r [ de l t a count ] = ghb [ i ] ;
d e l t a count++;
}
}
/∗ We can only p r e f e t c h i f t h e r e i s enough data a v a i l a b l e ∗/
i f ( de l ta count >3) {
/∗ Corre l a t e d e l t a s − Two d e l t a s are used as Nesb i t found opt imal ∗/
de l t a 1 = d e l t a b u f f e r [ 0 ] − d e l t a b u f f e r [ 1 ] ;
d e l t a 2 = d e l t a b u f f e r [ 1 ] − d e l t a b u f f e r [ 2 ] ;
/∗ Search f o r f i r s t d e l t a ∗/
for ( i = 2 ; i < de l ta count−2 ; i++) {
i f ( d e l t a b u f f e r [ i ] − d e l t a b u f f e r [ i +1] == de l t a 1 ) {
i f ( d e l t a b u f f e r [ i +1] − d e l t a b u f f e r [ i +2] == de l t a 2 ) {
/∗ Pattern found ∗/
/∗ S ta r t p r e f e t c h i n g ∗/
adre s s = t r i g g e r . address ;
for ( j =1; j<=pre f e t ch deg r e e ; j++) {
/∗ Find next d e l t a ∗/
i−−;
i f ( i < 0) {
break ;
}
adre s s += de l t a b u f f e r [ i ] − d e l t a b u f f e r [ i +1] ;
p r e f e t ch ( adress , ta rge t cache , t r i g g e r . time ) ;
}
/∗ Break out o f the loop ∗/
break ;
}
}
}
}
}
/∗ This i s the s t r i d e d p r e f e t c h i n g a l gor i thm us ing an RPT ∗/
void r p t p r e f e t c h ( p r e f e t c h t r i g g e r t t r i g g e r ) {
int index ; /∗ Index in t o the RPT t a b l e ∗/
int i ; /∗ Loop index ∗/
t i c k t o l d e s t ; /∗ Used to f i nd a RPT entry to r ep l a c e ∗/
int c o r r e c t ; /∗ Flag i f p r e d i c t i o n s are co r r e c t ∗/
171
E.5. PREFETCH.C APPENDIX E. UNIPROCESSOR CODE
/∗ Check i f the entry i s in the t a b l e ∗/
index = −1;
/∗ Linear search trough the RPT ∗/
for ( i = 0 ; i < t a b l e s i z e ; i++) {
i f ( l a s t PC va lue == rpt [ i ] . pc ) {
index = i ;
break ;
}
}
i f ( index > −1) {
/∗ Entry i s in t a b l e ∗/
/∗ Ca l cu l a t e i f p r e d i c t i on i s c o r r e c t ∗/
i f ( t r i g g e r . address == rpt [ index ] . prev addr + rpt [ index ] . s t r i d e ) {
c o r r e c t = 1 ;
} else {
c o r r e c t = 0 ;
}
switch ( rpt [ index ] . s t a t e ) {
case I n i t i a l :
i f ( ! c o r r e c t ) {
rpt [ index ] . s t r i d e = t r i g g e r . address − rpt [ index ] . prev addr ;
rpt [ index ] . s t a t e = Trans ient ;
} else {
rpt [ index ] . s t a t e = Steady ;
}
break ;
case Trans ient :
i f ( c o r r e c t ) {
rpt [ index ] . s t a t e = Steady ;
} else {
rpt [ index ] . s t r i d e = t r i g g e r . address − rpt [ index ] . prev addr ;
rpt [ index ] . s t a t e = No pred i c t i on ;
}
break ;
case Steady :
i f ( c o r r e c t ) {
rpt [ index ] . s t a t e = Steady ;
} else {
rpt [ index ] . s t a t e = I n i t i a l ;
}
break ;
case No pred i c t i on :
i f ( c o r r e c t ) {
rpt [ index ] . s t a t e = Trans ient ;
} else {
rpt [ index ] . s t r i d e = t r i g g e r . address − rpt [ index ] . prev addr ;
rpt [ index ] . s t a t e = No pred i c t i on ;
}
break ;
default :
p r i n t f ( ”Something weird happened , shouldn ’ t be in t h i s s t a t e ”) ;
}
rpt [ index ] . prev addr = t r i g g e r . address ;
/∗ Update acces s time ∗/
172
APPENDIX E. UNIPROCESSOR CODE E.5. PREFETCH.C
rpt [ index ] . atime = t r i g g e r . time ;
/∗ I f we now are in the s t eady s t a t e ; i s s u e p r e f e t c h e s ! ∗/
i f ( rpt [ index ] . s t a t e == Steady ) {
for ( i =1; i<=pre f e t ch deg r e e ; i++) {
pr e f e t ch ( t r i g g e r . address + i ∗ rpt [ index ] . s t r i d e , datal2 , t r i g g e r . time
) ;
}
}
} else {
/∗ This entry i s not in the t a b l e , so we i n s e r t i t . ∗/
/∗ Find the o l d e s t en try through l i n e a r search ∗/
o l d e s t = rpt [ 0 ] . atime ;
index = 0 ;
for ( i = 0 ; i < t a b l e s i z e ; i++) {
i f ( o l d e s t > rpt [ i ] . atime ) {
o l d e s t = rpt [ i ] . atime ;
index = i ;
}
}
/∗ Replace the o l d e s t ∗/
rpt [ index ] . pc = las t PC va lue ;
rpt [ index ] . prev addr = t r i g g e r . address ;
rpt [ index ] . s t r i d e = 0 ;
rpt [ index ] . atime = t r i g g e r . time ;
rpt [ index ] . s t a t e = I n i t i a l ;
}
}
/∗ This i s the streaming p r e f e t c h i n g used in the Power 4 proces sor by IBM
∗ I t has been modi f ied to f i t a memory h i e rarchy o f on ly two l e v e l s
∗ For more in format ion see :
∗ h t t p ://www. research . ibm . com/ journa l / rd /461/ t end l e r . html
∗ NOTE: To de t e c t streams , we use the rp t
∗/
void s t r eam pre f e t ch ( p r e f e t c h t r i g g e r t t r i g g e r ) {
int index ; /∗ Index in t o the RPT t a b l e ∗/
int i ; /∗ Loop index ∗/
t i c k t o l d e s t ; /∗ Used to f i nd a RPT entry to r ep l a c e ∗/
int c o r r e c t ; /∗ Flag i f p r e d i c t i o n s are co r r e c t ∗/
/∗ Only p r e f e t c h i f i t i s a data acces s ∗/
i f ( l a s t PC va lue != −1) {
/∗ Check i f t h i s i s the co r r e c t type ∗/
/∗ Check i f the entry i s in the t a b l e ∗/
index = −1;
/∗ Linear search trough the RPT ∗/
for ( i = 0 ; i < t a b l e s i z e ; i++) {
i f ( l a s t PC va lue == rpt [ i ] . pc ) {
index = i ;
break ;
}
173
E.5. PREFETCH.C APPENDIX E. UNIPROCESSOR CODE
}
i f ( index > −1) {
/∗ Entry i s in t a b l e ∗/
/∗ Ca l cu l a t e i f p r e d i c t i on i s c o r r e c t ∗/
i f ( t r i g g e r . address == rpt [ index ] . prev addr + rpt [ index ] . s t r i d e ) {
c o r r e c t = 1 ;
} else {
c o r r e c t = 0 ;
}
switch ( rpt [ index ] . s t a t e ) {
case I n i t i a l :
i f ( ! c o r r e c t ) {
rpt [ index ] . s t r i d e = t r i g g e r . address − rpt [ index ] . prev addr ;
rpt [ index ] . s t a t e = Trans ient ;
} else {
rpt [ index ] . s t a t e = Steady ;
}
break ;
case Trans ient :
i f ( c o r r e c t ) {
rpt [ index ] . s t a t e = Steady ;
} else {
rpt [ index ] . s t r i d e = t r i g g e r . address − rpt [ index ] . prev addr ;
rpt [ index ] . s t a t e = No pred i c t i on ;
}
break ;
case Steady :
i f ( c o r r e c t ) {
rpt [ index ] . s t a t e = Steady ;
} else {
rpt [ index ] . s t a t e = I n i t i a l ;
}
break ;
case No pred i c t i on :
i f ( c o r r e c t ) {
rpt [ index ] . s t a t e = Trans ient ;
} else {
rpt [ index ] . s t r i d e = t r i g g e r . address − rpt [ index ] . prev addr ;
rpt [ index ] . s t a t e = No pred i c t i on ;
}
break ;
default :
p r i n t f ( ”Something weird happened , shouldn ’ t be in t h i s s t a t e ”) ;
}
rpt [ index ] . prev addr = t r i g g e r . address ;
/∗ Update acces s time ∗/
rpt [ index ] . atime = t r i g g e r . time ;
/∗ I f we now are in the s t eady s t a t e ; i s s u e p r e f e t c h e s ! ∗/
i f ( rpt [ index ] . s t a t e == Steady ) {
/∗ Stream pr e f e t c h i n g ∗/
/∗ F i r s t acces s goes to the L1 ∗/
pr e f e t ch ( t r i g g e r . address + rpt [ index ] . s t r i d e , datal1 , t r i g g e r . time ) ;
/∗ Then some data i s t r an s f e r r e d to the l 2 ∗/
for ( i=s t r e am o f f s e t ; i< p r e f e t ch deg r e e + s t r e am o f f s e t ; i++) {
174
APPENDIX E. UNIPROCESSOR CODE E.5. PREFETCH.C
pr e f e t ch ( t r i g g e r . address + i ∗ rpt [ index ] . s t r i d e , datal2 , t r i g g e r .
time ) ;
}
}
} else {
/∗ This entry i s not in the t a b l e , so we i n s e r t i t . ∗/
/∗ Find the o l d e s t en try through l i n e a r search ∗/
o l d e s t = rpt [ 0 ] . atime ;
index = 0 ;
for ( i = 0 ; i < t a b l e s i z e ; i++) {
i f ( o l d e s t > rpt [ i ] . atime ) {
o l d e s t = rpt [ i ] . atime ;
index = i ;
}
}
/∗ Replace the o l d e s t ∗/
rpt [ index ] . pc = las t PC va lue ;
rpt [ index ] . prev addr = t r i g g e r . address ;
rpt [ index ] . s t r i d e = 0 ;
rpt [ index ] . atime = t r i g g e r . time ;
rpt [ index ] . s t a t e = I n i t i a l ;
}
}
}
/∗ This i s the address−va lue d e l t a a l gor i thm .
∗ I t compares the load adress wi th the data re turned
∗ I f i t i s s im i l i a r , a p r e f e t c h i s i s s u e s
∗ See p r e f e t c h . h f o r a r e f e r ence to l i t t e r a t u r e .
∗/
void avd pre f e t ch ( p r e f e t c h t r i g g e r t t r i g g e r ) {
md addr t data ; /∗ Store the re turned data ∗/
int index ; /∗ Index to the AVD t a b l e ∗/
int i ; /∗ General purpose loop index ∗/
/∗ Get the a s s o c i a t e d data ∗/
mem access (mem, Read , t r i g g e r . address , &data , s izeof ( md addr t ) ) ;
/∗ Check i f d e l t a i s w i th in bounds (maxavd ) ∗/
i f ( ( ( t r i g g e r . address − data ) < maxavd) | |
( ( data − t r i g g e r . address ) <maxavd) ) {
/∗ Pre fe t ch in t o the f u t u r e !
∗ This i s due to the f a c t t ha t the data w i l l not be
∗ a v a i l a b l e u n t i l the re turn l a t ency has passed
∗/
pr e f e t ch ( ( ( data >>8)<<8) , ta rge t cache , t r i g g e r . time + re tu rn l a t en cy ) ;
}
}
/∗
∗ This i s the p r e f e t c h i n g engine , i t packs a t r i g g e r in t o the
p r e f e t c h t r i g g e r t format
∗ and uses a func t i on po in t e r to send i t to the co r r e c t p l ace .
∗/
175
E.5. PREFETCH.C APPENDIX E. UNIPROCESSOR CODE
int p r o c e s s p r e f e t c h t r i g g e r ( t r i g g e r t y p e t type , l o c a t i o n t l o ca t i on ,
md addr t address , t i c k t now) {
p r e f e t c h t r i g g e r t p r e f e t c h t r i g g e r ;
/∗ Avoid p r e f e t c h i n g a l go r i t hms t ha t genera te new p r e f e t c h i n g ∗/
i f ( pre f e tch at tempt == 0) {
/∗ Pack the data in t o a t r i g g e r t y p e t ∗/
p r e f e t c h t r i g g e r . type = type ;
p r e f e t c h t r i g g e r . l o c a t i o n = l o c a t i o n ;
p r e f e t c h t r i g g e r . address = address ;
p r e f e t c h t r i g g e r . time = now ;
/∗ Send the p r e f e t c h t r i g g e r to the
∗ co r r e c t a l gor i thm us ing func t i on po in t e r s .
∗ p r e f e t c h a t t emp t v a r i a b l e i s used as a l o c k to avoid cascad ing
∗ p r e f e t c h e s
∗/
pre f e t ch at tempt = 1 ;
i f ( ( type==ta rge t type ) && ( l o c a t i o n == p r e f e t c h l o c a t i o n ) ) {
pr e f e t ch a l go r i t hm ( p r e f e t c h t r i g g e r ) ;
}
pre f e t ch at tempt = 0 ;
}
}
176
APPENDIX E. UNIPROCESSOR CODE E.6. PREFETCH.H
E.6 Prefetch.h
Listing E.6: Prefetch.h
/∗ p r e f e t c h . h − p r e f e t c h module i n t e r f a c e s and d e f i n i t i o n s ∗/
/∗ Written by Marius Grannaes 2006 ∗/
#ifndef PREFETCH H
#define PREFETCH H
#include <s t d i o . h>
#include ”host . h”
#include ”misc . h”
#include ”machine . h”
#include ”memory . h”
#include ” s t a t s . h”
#include ”cache . h”
/∗ The var ious t r i g g e r t ype s ∗/
typedef enum {
Cache Miss , /∗ A miss in the cache ∗/
Cache Hit , /∗ A h i t in the cache ∗/
Memory Access , /∗ An access to memory by an i n s t r u c t i o n ∗/
PC Update , /∗ Program counter i s updated ∗/
No event /∗ Dummy t r i g g e r ∗/
} t r i g g e r t y p e t ;
/∗ This enum hand les whare p r e f e t c h i n g t r i g g e r s happen ∗/
typedef enum {
Cache IL1 , /∗ Event happened in the D−I1 cache
∗/
Cache DL1 , /∗ Event happened in the D−L1 cache
∗/
Cache IL2 , /∗ Event happened in the I−L2 cache
∗/
Cache DL2 , /∗ Event happened in the D−L2 cache
∗/
DRAM, /∗ Event happened in DRAM ∗/
None /∗ When l o c a t i o n doesn ’ t matter − eg
PC Update ∗/
} l o c a t i o n t ;
/∗ The main p r e f e t c h i n g s t r u c t u r e ∗/
typedef struct {
t r i g g e r t y p e t type ; /∗ What happened ∗/
l o c a t i o n t l o c a t i o n ; /∗ Where something did happen ∗/
md addr t address ; /∗ Adress o f memory acces s ∗/
t i c k t time ; /∗ Time o f acces s ∗/
} p r e f e t c h t r i g g e r t ;
/∗ S ta t e s f o r the rp t t a b l e ∗/
/∗ Defined in the paper by Chen and Baer ∗/
177
E.6. PREFETCH.H APPENDIX E. UNIPROCESSOR CODE
typedef enum {
I n i t i a l ,
Transient ,
Steady ,
No pred i c t i on
} r p t s t a t e t ;
/∗ RPT t a b l e s t r u c t u r e ∗/
typedef struct {
md addr t pc ; /∗ Adress o f l oad ing i n s t r u c t i o n ∗/
md addr t prev addr ; /∗ Previous adress loaded by i n s t r u c t i o n ∗/
int s t r i d e ; /∗ Recorded s t r i d e ∗/
r p t s t a t e t s t a t e ; /∗ Current s t a t e ∗/
t i c k t atime ; /∗ Time o f l a s t acces s ( in t i c k s ) ∗/
} r p t en t r y t ;
/∗ AVD pr e f e t c h e r as in the paper :
∗ ”Adress−Value Del ta (AVD) Pred i c t i on : Increas ing the E f f e c t i v e n e s s o f
Runahead
∗ Execut ion by Exp l o i t i n g Regular Memory A l l o ca t i on Pat terns ” by
∗ Onur Mutlu , Hyesoon Kim and Uale N. Patt
∗/
/∗ AVD t a b l e s t r u c t u r e ∗/
typedef struct {
md addr t pc ; /∗ Adress o f l oad ing i n s t r u c t i o n ∗/
int avd ; /∗ Ca lcu l a t ed d e l t a ∗/
t i c k t atime ; /∗ Time o f l a s t acces s ( in t i c k s ) ∗/
int con f idence ; /∗ Confidence in p r e d i c t i on ∗/
} avd ent ry t ;
/∗ Prototype d e c l a r a t i o n s ∗/
md addr t s e t l a s t PC va lu e (md addr t PC) ;
int r e g i s t e r p r e f e t c h l o c a t i o n ( l o c a t i o n t l o c a t i o n ) ;
void r e g i s t e r p r e f e t c h d e g r e e ( int degree ) ;
void r e g i s t e r p r e f e t c h t a r g e t ( t r i g g e r t y p e t t a r g e t ) ;
int r e g i s t e r p r e f e t c h a l g o r i t hm (void (∗ a lgor i thm ) ( p r e f e t c h t r i g g e r t ) ) ;
void s e t r e t u r n l a t e n c y ( int l a t ency ) ;
int p r e f e t c h i n i t ( struct cache t ∗ i l 1 , struct cache t ∗ i l 2 , struct cache t ∗
dl1 , struct cache t ∗dl2 , int c s i z e , int t s i z e , struct mem t ∗memory) ;
void pr e f e t ch (md addr t adress , struct cache t ∗ target , t ime t now) ;
void no pre f e t ch ( p r e f e t c h t r i g g e r t t r i g g e r ) ;
void s e q u e n t i a l p r e f e t c h ( p r e f e t c h t r i g g e r t t r i g g e r ) ;
void d e l t a c o r r e l a t i o n p r e f e t c h ( p r e f e t c h t r i g g e r t t r i g g e r ) ;
178
APPENDIX E. UNIPROCESSOR CODE E.6. PREFETCH.H
void c z o n e d e l t a c o r r e l a t i o n p r e f e t c h ( p r e f e t c h t r i g g e r t t r i g g e r ) ;
void r p t p r e f e t c h ( p r e f e t c h t r i g g e r t t r i g g e r ) ;
void avd pre f e t ch ( p r e f e t c h t r i g g e r t t r i g g e r ) ;
void s t r eam pre f e t ch ( p r e f e t c h t r i g g e r t t r i g g e r ) ;
int p r o c e s s p r e f e t c h t r i g g e r ( t r i g g e r t y p e t type , l o c a t i o n t l o ca t i on ,
md addr t address , t i c k t now) ;
#endif /∗ PREFETCH H ∗/
179
E.7. MEMORY.H APPENDIX E. UNIPROCESSOR CODE
E.7 Memory.h
Listing E.7: Memory.h - Unified diff against SimpleScalar 3.0d
−−− . . / s implesim−3.0− o r i g /memory . h 2003−10−09 03 :13 :46 .000000000 +0200
+++ . . / simplesim −3.0/memory . h 2006−05−08 23 :14 :32 .000000000 +0200
@@ −86,7 +86 ,8 @@
/∗ memory access command ∗/
enum mem cmd {
Read , /∗ read memory from t a r g e t ( s imu la ted prog )
to hos t ∗/
− Write /∗ wr i t e memory from hos t ( s imu la tor ) to
t a r g e t ∗/
+ Write , /∗ wr i t e memory from hos t ( s imu la tor ) to t a r g e t ∗/
+ Pre f e tch /∗ p r e f e t c h memory ∗/
} ;
/∗ memory access f unc t i on type , t h i s i s a gener i c f unc t i on expor ted f o r the
180
Appendix F
CMP Code
F.1 Makefile
Listing F.1: Makefile - Unified diff against Uniprocessor version
−−− . . / s implesim −3.0/Make f i l e 2006−05−25 23 :42 :04 .000000000 +0200
+++ . . / . . / . . / f e l l e s −svn/ p r o j e c t / branches / g rannas pr e f e t ch /Make f i l e
2006−02−06 22 :55 :24 .000000000 +0100
@@ −78,7 +78 ,7 @@
## Windows NT ve r s i on 4 . 0 , Cygnus CygWin/32 beta 19
##
CC = gcc−3.4
−OFLAGS = −g
+OFLAGS = −g
MFLAGS = ‘ . / sysprobe −f l a g s ‘
MLIBS = ‘ . / sysprobe − l i b s ‘ −lm
ENDIAN = ‘ . / sysprobe −s ‘
@@ −277 ,7 +277 ,7 @@
#
# a l l the sour ce s
#
−SRCS = main . c dram . c p r e f e t ch . c sim−f a s t . c sim−s a f e . c sim−cache . c sim−
p r o f i l e . c \
+SRCS = dram . c p r e f e t ch . c shared . c main . c sim−f a s t . c sim−s a f e . c sim−cache . c
sim−p r o f i l e . c \
sim−e i o . c sim−bpred . c sim−cheetah . c sim−outorder . c \
memory . c r eg s . c cache . c bpred . c pt race . c eventq . c \
r e s ou r c e . c endian . c d l i t e . c symbol . c eva l . c opt ions . c range . c \
@@ −287 ,7 +287 ,7 @@
target−alpha / alpha . c target−alpha / loade r . c target−alpha / s y s c a l l . c \
target−alpha /symbol . c
−HDRS = dram . h s y s c a l l . h memory . h r eg s . h sim . h l oade r . h cache . h bpred . h
ptrace . h \
+HDRS = dram . h p r e f e t ch . h s y s c a l l . h memory . h r eg s . h sim . h loade r . h cache . h
bpred . h ptrace . h \
eventq . h r e sou r c e . h endian . h d l i t e . h symbol . h eva l . h bitmap . h \
e i o . h range . h ve r s i on . h endian . h misc . h \
target−p i sa / p i sa . h target−p i sa / p i s ab i g . h target−p i sa / p i s a l i t t l e . h \
@@ −305 ,7 +305 ,7 @@
#
181
F.1. MAKEFILE APPENDIX F. CMP CODE
# programs to bu i ld
#
−PROGS = sim−outorder$ (EEXT)
+PROGS = con t r o l l e r $ (EEXT) sim−outorder$ (EEXT)
#
# a l l ta rge t s , NOTE: l i b r a r y orde r ing i s important . . .
@@ −388 ,8 +388 ,10 @@
sim−cache$ (EEXT) : sysprobe$ (EEXT) sim−cache . $ (OEXT) cache . $ (OEXT) $ (
OBJS) l i b e xo / l i b exo . $ (LEXT)
$ (CC) −o sim−cache$ (EEXT) $ (CFLAGS) sim−cache . $ (OEXT) cache . $ (OEXT)
$ (OBJS) l i b e xo / l i b exo . $ (LEXT) $ (MLIBS)
−sim−outorder$ (EEXT) : sysprobe$ (EEXT) sim−outorder . $ (OEXT) dram . $ (OEXT)
cache . $ (OEXT) pr e f e t ch . $ (OEXT) bpred . $ (OEXT) r e sou r c e . $ (OEXT) ptrace . $ (
OEXT) $ (OBJS) l i b exo / l i b e xo . $ (LEXT)
− $ (CC) −o sim−outorder$ (EEXT) $ (CFLAGS) sim−outorder . $ (OEXT) cache . $ (
OEXT) dram . $ (OEXT) pr e f e t ch . $ (OEXT) bpred . $ (OEXT) r e sou r c e . $ (OEXT) ptrace
. $ (OEXT) $ (OBJS) l i b exo / l i b e xo . $ (LEXT) $ (MLIBS)
+sim−outorder$ (EEXT) : dram . $ (OEXT) pr e f e t ch . $ (OEXT) sysprobe$ (EEXT) sim−
outorder . $ (OEXT) cache . $ (OEXT) bpred . $ (OEXT) r e sou r c e . $ (OEXT) shared . $ (
OEXT) ptrace . $ (OEXT) $ (OBJS) l i b exo / l i b e xo . $ (LEXT)
+ $ (CC) −o sim−outorder$ (EEXT) $ (CFLAGS) sim−outorder . $ (OEXT) dram . $ (
OEXT) pr e f e t ch . $ (OEXT) cache . $ (OEXT) shared . $ (OEXT) bpred . $ (OEXT)
r e sou r c e . $ (OEXT) ptrace . $ (OEXT) $ (OBJS) l i b e xo / l i b exo . $ (LEXT) $ (MLIBS)
+c on t r o l l e r $ (EEXT) : shared . $ (OEXT) c o n t r o l l e r . c
+ $ (CC) $ (CFLAGS) shared . $ (OEXT) c o n t r o l l e r . c −o c o n t r o l l e r $ (EEXT)
exo l i b exo / l i b e xo . $ (LEXT) : sysprobe$ (EEXT)
cd l i b exo $ (CS) \
@@ −497 ,8 +499 ,6 @@
regs . $ (OEXT) : opt i ons . h s t a t s . h eva l . h
cache . $ (OEXT) : host . h misc . h machine . h machine . de f cache . h memory . h opt ions
. h
cache . $ (OEXT) : s t a t s . h eva l . h
−pr e f e t ch . $ (OEXT) : host . h misc . h machine . h machine . de f cache . h memory . h
opt ions . h
−pr e f e t ch . $ (OEXT) : s t a t s . h eva l . h cache . h
bpred . $ (OEXT) : host . h misc . h machine . h machine . de f bpred . h s t a t s . h eva l . h
ptrace . $ (OEXT) : host . h misc . h machine . h machine . de f range . h ptrace . h
eventq . $ (OEXT) : host . h misc . h machine . h machine . de f eventq . h bitmap . h
@@ −508 ,6 +508 ,8 @@
d l i t e . $ (OEXT) : host . h misc . h machine . h machine . de f v e r s i on . h eva l . h r eg s . h
d l i t e . $ (OEXT) : memory . h opt ions . h s t a t s . h sim . h symbol . h l oade r . h range . h
d l i t e . $ (OEXT) : d l i t e . h
+pr e f e t ch . $ (OEXT) : host . h misc . h machine . h machine . de f cache . h memory . h
opt ions . h
+pr e f e t ch . $ (OEXT) : s t a t s . h eva l . h cache . h
symbol . $ (OEXT) : host . h misc . h target−p i sa / e c o f f . h l oade r . h machine . h
symbol . $ (OEXT) : machine . de f r eg s . h memory . h opt ions . h s t a t s . h eva l . h symbol
. h
eva l . $ (OEXT) : host . h misc . h eva l . h machine . h machine . de f
182
APPENDIX F. CMP CODE F.2. SIM-OUTORDER.C
F.2 Sim-outorder.c
Listing F.2: Sim-outorder.c - Unified diff against Uniprocessor version
−−− . . / s implesim −3.0/ sim−outorder . c 2006−05−31 13 :32 :54 .000000000 +0200
+++ . . / . . / . . / f e l l e s −svn/ p r o j e c t / branches / g rannas pr e f e t ch /sim−outorder . c
2006−05−31 13 :33 :51 .000000000 +0200
@@ −54,6 +54 ,11 @@
#inc lude <math . h>
#inc lude <a s s e r t . h>
#inc lude <s i g n a l . h>
+#inc lude <errno . h>
+#inc lude <sys / types . h>
+#inc lude <sys / ipc . h>
+#inc lude <sys /sem . h>
+#inc lude <sys /shm . h>
#inc lude ”host . h”
#inc lude ”misc . h”
@@ −72,9 +77 ,11 @@
#inc lude ”ptrace . h”
#inc lude ” d l i t e . h”
#inc lude ”sim . h”
+#inc lude ”shared . h”
#inc lude ”dram . h”
#inc lude ”p r e f e t ch . h”
+
/∗
∗ This f i l e implements a very d e t a i l e d out−of−order i s s u e supe r s ca l a r
∗ proces sor wi th a two− l e v e l memory system and s p e c u l a t i v e execu t i on
suppor t .
@@ −88,11 +95,36 @@
/∗ s imu la ted memory ∗/
stat ic struct mem t ∗mem = NULL;
+union semun {
+ int va l ;
+ struct semid ds ∗buf ;
+ ushort ∗ array ;
+} arg ;
+
+/∗ Globa l shared memory v a r i a b l e s ∗/
+
+int sync semaphore id ;
+int con t r o l l e r s emapho r e i d ;
+int repor t semaphore id ;
+int l 2 semaphore id ;
+struct sembuf sb = {0 , −1, 0} ; /∗ s e t to a l l o c a t e resource ∗/
+
+int c o n t r o l l e r i d ; // The id o f the shared memory segment f o r the
c o n t r o l l e r
+int counte r i d ; // The id con ta ing in the shared memory segment f o r the
counter
+
+int ∗ c o n t r o l l e r ; // Pointer to the c o n t r o l l e r segment
+counte r t ∗ counter ; // Pointer to the counter segment
183
F.2. SIM-OUTORDER.C APPENDIX F. CMP CODE
/∗
∗ s imu la tor op t i ons
∗/
+/∗ The number o f concurrent p roce s so r s ∗/
+unsigned int t o t a l c pu s ;
+
+/∗ This p roce s so r s number ∗/
+unsigned int my cpuid ;
+
/∗ Pre fe t ch op t i ons ∗/
/∗ p r e f e t c h i n g type {none | s e q u e n t i a l . . . . } ∗/
@@ −163 ,6 +195 ,9 @@
/∗ maximum number o f i n s t ’ s to execu te ∗/
stat ic unsigned int max insts ;
+/∗ maximum number o f c y c l e s to execu te ∗/
+t i c k t max cycles ;
+
/∗ number o f i n s t s s k ipped b e f o r e t iming s t a r t s ∗/
stat ic int f a s t fwd count ;
@@ −476 ,6 +511 ,7 @@
: ( panic ( ”bad s t a t c l a s s ”) , 0) ) ) )
+
/∗ memory access la tency , assumed to not c ros s a page boundary ∗/
stat ic unsigned int /∗ t o t a l l a t ency o f acces s ∗/
mem access latency ( int b l k s z ) /∗ b l o c k s i z e accessed ∗/
@@ −503 ,44 +539 ,40 @@
{
unsigned int l a t ;
p r o c e s s p r e f e t c h t r i g g e r ( Cache Miss , Cache DL1 , baddr , s im cyc l e ) ;
− i f ( cache d l2 )
− {
+ i f ( cache d l2 ) {
i f ( p e r f e c t l 2 ) {
l a t = cache dl2−>h i t l a t e n c y ;
} else {
+ my lock ( l2 semaphore id , 0) ;
/∗ acces s next l e v e l o f data cache h i e rarchy ∗/
l a t = cache ac c e s s ( cache dl2 , cmd , baddr , NULL, bs i ze ,
− /∗ now ∗/now , /∗ pudata ∗/NULL, /∗ r e p l addr ∗/
NULL) ;
+ /∗ now ∗/now , /∗ pudata ∗/NULL, /∗ r e p l addr ∗/NULL
) ;
+ my unlock ( l2 semaphore id , 0) ;
}
p r o c e s s p r e f e t c h t r i g g e r (Memory Access , Cache DL2 , baddr , s im cyc l e ) ;
i f (cmd == Read)
return l a t ;
− else
− {
184
APPENDIX F. CMP CODE F.2. SIM-OUTORDER.C
− /∗ FIXME: un l imi t ed wr i t e b u f f e r s ∗/
− return 0 ;
− }
− }
− else
− {
+ else {
+ /∗ FIXME: un l imi t ed wr i t e b u f f e r s ∗/
+ return 0 ;
+ }
+ } else {
/∗ acces s main memory ∗/
i f (cmd == Read) {
/∗ I f the dram system i s de f ined us ing 0 channels , then f a l l b a c k
∗ to o l d model ∗/
− i f ( dram system−>num channels == 0) {
− l a t = mem access latency ( b s i z e ) ;
+ i f ( num channels == 0) {
+ l a t = mem access latency ( b s i z e ) ;
} else {
− l a t = access dram ( dram system , baddr , bs i ze , now) ;
+ l a t = access dram ( dram system , baddr , bs i ze , my cpuid , now) ;
}
p r o c e s s p r e f e t c h t r i g g e r (Memory Access ,DRAM, baddr , s im cyc l e ) ;
return l a t ;
− }
− else
− {
− /∗ FIXME: un l imi t ed wr i t e b u f f e r s ∗/
− return 0 ;
− }
− }
+ } else {
+ /∗ FIXME: un l imi t ed wr i t e b u f f e r s ∗/
+ return 0 ;
+ }
+ }
}
/∗ l 2 data cache b l o c k miss hand ler f unc t i on ∗/
@@ −556 ,10 +588 ,10 @@
i f (cmd == Read) {
/∗ I f the dram system i s de f ined us ing 0 channels , then f a l l b a c k
∗ to o l d model ∗/
− i f ( dram system−>num channels == 0) {
+ i f ( num channels == 0) {
l a t ency = mem access latency ( b s i z e ) ;
} else {
− l a t ency = access dram ( dram system , baddr , bs i ze , now) ;
+ la t ency = access dram ( dram system , baddr , bs i ze , my cpuid , now) ;
}
}
else
@@ −582 ,18 +614 ,19 @@
t i c k t now) /∗ t ime o f acces s ∗/
{
185
F.2. SIM-OUTORDER.C APPENDIX F. CMP CODE
unsigned int l a t ;
−
i f ( c a c h e i l 2 )
{
/∗ acces s next l e v e l o f i n s t cache h i e rarchy ∗/
i f ( p e r f e c t l 2 ) {
l a t = cache i l 2−>h i t l a t e n c y ;
} else {
+ my lock ( l2 semaphore id , 0) ;
l a t = cache ac c e s s ( c a che i l 2 , cmd , baddr , NULL, bs i ze ,
− /∗ now ∗/now , /∗ pudata ∗/NULL, /∗ r e p l addr ∗/NULL
) ;
+ /∗ now ∗/now , /∗ pudata ∗/NULL, /∗ r e p l addr ∗/NULL) ;
+ my unlock ( l2 semaphore id , 0) ;
}
i f (cmd != Read) {
− panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
+ panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
}
p r o c e s s p r e f e t c h t r i g g e r (Memory Access , Cache IL2 , baddr , s im cyc l e ) ;
}
@@ −603 ,19 +636 ,20 @@
i f (cmd == Read) {
/∗ I f the dram system i s de f ined us ing 0 channels , then f a l l b a c k
∗ to o l d model ∗/
− i f ( dram system−>num channels == 0) {
+ i f ( num channels == 0) {
l a t = mem access latency ( b s i z e ) ;
} else {
− l a t = access dram ( dram system , baddr , bs i ze , now) ;
+ l a t = access dram ( dram system , baddr , bs i ze , my cpuid , now) ;
}
p r o c e s s p r e f e t c h t r i g g e r (Memory Access , DRAM, baddr , s im cyc l e ) ;
} else {
− panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
+ panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
}
}
return l a t ;
}
+
/∗ l 2 i n s t cache b l o c k miss hand ler f unc t i on ∗/
stat ic unsigned int /∗ l a t en cy o f b l o c k access ∗/
i l 2 a c c e s s f n (enum mem cmd cmd , /∗ acces s cmd , Read or Write
∗/
@@ −626 ,25 +660 ,24 @@
{
int l a t ency ;
/∗ t h i s i s a miss to the l owe s t l e v e l , so access main memory ∗/
− i f (cmd == Read) {
− /∗ I f the dram system i s de f ined us ing 0 channels , then f a l l b a c k
− ∗ to o l d model ∗/
− i f ( dram system−>num channels == 0) {
− l a t ency = mem access latency ( b s i z e ) ;
− } else {
186
APPENDIX F. CMP CODE F.2. SIM-OUTORDER.C
− l a t ency = access dram ( dram system , baddr , bs i ze , now) ;
− }
− p r o c e s s p r e f e t c h t r i g g e r (Memory Access , DRAM, baddr , s im cyc l e ) ;
− }
− else {
− panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
− }
− p r o c e s s p r e f e t c h t r i g g e r ( Cache Miss , Cache IL2 , baddr , s im cyc l e ) ;
− return l a t ency ;
+ i f (cmd == Read) {
+ /∗ I f the dram system i s de f ined us ing 0 channels , then f a l l b a c k
+ ∗ to o l d model ∗/
+ i f ( num channels == 0) {
+ latency = mem access latency ( b s i z e ) ;
+ } else {
+ latency = access dram ( dram system , baddr , bs i ze , my cpuid , now) ;
+ }
+ p r o c e s s p r e f e t c h t r i g g e r (Memory Access , DRAM, baddr , s im cyc l e ) ;
+ }
+ else {
+ panic ( ”wr i t e s to i n s t r u c t i o n memory not supported ”) ;
+ }
+ p r o c e s s p r e f e t c h t r i g g e r ( Cache Miss , Cache IL2 , baddr , s im cyc l e ) ;
+ return l a t ency ;
}
−
/∗
∗ TLB miss hand l e r s
∗/
@@ −701 ,6 +734 ,16 @@
”la t ency o f a l l p i p e l i n e ope ra t i on s .\n”
) ;
+ /∗ Mul t i p roce s so r s parameters ∗/
+
+ opt r e g u i n t ( odb , ”−cpu : t o t a l ” , ”The t o t a l number o f cpu ’ s ” ,
+ &tota l cpus , /∗ d e f a u l t ∗/ 1 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
+ opt r e g u i n t ( odb , ”−cpu : t h i s ” , ”The cpuid f o r t h i s p ro c e s s o r ( s t a r t with
zero ) ” ,
+ &my cpuid , /∗ d e f a u l t ∗/ 0 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+
/∗ New DRAM−op t i ons ∗/
op t r e g u i n t ( odb , ”−dram : chan ” , ”number o f DRAM channe l s ” ,
@@ −778 ,6 +821 ,10 @@
&max insts , /∗ d e f a u l t ∗/ 0 ,
/∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
+ op t r e g i n t ( odb , ”−max : c y c l e s ” , ”maximum numer o f c y c l e s to execute ” ,
+ &max cycles , /∗ d e f a u l t ∗/ 0 ,
+ /∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
187
F.2. SIM-OUTORDER.C APPENDIX F. CMP CODE
+
op t r e g i n t ( odb , ”−f a s t fwd ” , ”number o f i n s t s sk ipped be f o r e t iming
s t a r t s ” ,
&fast fwd count , /∗ d e f a u l t ∗/ 0 ,
/∗ p r i n t ∗/TRUE, /∗ format ∗/NULL) ;
@@ −927 ,7 +974 ,7 @@
op t r e g s t r i n g ( odb , ”−cache : d l1 ” ,
” l 1 data cache con f i g , i . e . , {<con f i g >|none} ” ,
− &cache d l1 opt , ”d l1 : 1 2 8 : 3 2 : 4 : l ” ,
+ &cache d l1 opt , ”d l1 : 8 : 6 4 : 2 : l ” ,
/∗ p r i n t ∗/TRUE, NULL) ;
opt r eg not e ( odb ,
@@ −1302 ,9 +1349 ,15 @@
name , &nsets , &bs i ze , &assoc , &c ) != 5)
f a t a l ( ”bad l 2 D−cache parms : ”
”<name>:<nsets>:<bs i ze >:<assoc>:<rep l>”) ;
− cache d l2 = cache c r ea t e (name , nsets , bs i ze , /∗ b a l l o c ∗/FALSE,
+ l2 semaphore id = get semaphore se t (SEMAPHORE L2 KEY, 1) ;
+
+ i f (my cpuid !=0) {
+ my lock ( l2 semaphore id , 0) ;
+ }
+ cache d l2 = cache c r ea t e sha r ed (name , nsets , bs i ze , /∗ b a l l o c ∗/
FALSE,
/∗ u s i z e ∗/ 0 , assoc , ca che cha r2po l i cy ( c )
,
− d l 2 a c c e s s f n , /∗ h i t l a t ∗/ c a ch e d l 2 l a t
) ;
+ d l 2 a c c e s s f n , /∗ h i t l a t ∗/ ca che d l 2 l a t
, /∗ p ro c e s s i d ∗/my cpuid ) ;
+ my unlock ( l2 semaphore id , 0) ;
}
}
@@ −1320 ,9 +1373 ,10 @@
}
else i f ( ! mystricmp ( ca che i l 1 op t , ”d l1 ”) )
{
− i f ( ! cache d l1 )
+ /∗ i f ( ! cache d l 1 )
f a t a l (”I−cache l 1 cannot access D−cache l 1 as i t ’ s unde f ined ”) ;
− c a c h e i l 1 = cache d l 1 ;
+ ca c h e i l 1 = cache d l 1 ; ∗/
+ f a t a l ( ”Haakon s t a t e s that I−cache l 1 cannot a c c e s s D−cache due to
vary ing D cache s i z e e t c . ”) ;
/∗ the l e v e l 2 I−cache cannot be de f ined ∗/
i f ( strcmp ( ca che i l 2 op t , ”none ”) )
@@ −1569 ,7 +1623 ,7 @@
”s im s l i p / sim num insn ” , NULL) ;
/∗ r e g i s t e r DRAM s t a t s ∗/
− i f ( dram system−>num channels !=0) {
+ i f ( num channels !=0) {
188
APPENDIX F. CMP CODE F.2. SIM-OUTORDER.C
dram reg s ta t s ( dram system , sdb ) ;
}
@@ −1649 ,7 +1703 ,9 @@
sim num refs = 0 ;
/∗ c r ea t e the memory h ierachy ∗/
− dram system = create dram ( num channels , b l o ck s i z e , page s i z e ,
cont ro l t ime , core t ime , data time , d ram t ra c e i n t e r va l ) ;
+ i f ( num channels > 0) {
+ dram system = create dram ( num channels , b l o ck s i z e , page s i z e ,
cont ro l t ime , core t ime , data time , d ram trace in t e rva l , my cpuid ) ;
+ }
/∗ a l l o c a t e and i n i t i a l i z e r e g i s t e r f i l e ∗/
r e g s i n i t (&reg s ) ;
@@ −4719 ,6 +4775 ,22 @@
void
sim main (void )
{
+ /∗ Set up semaphore l o c k i n g f o r synchron i za t ion , One semaphore per cpu ∗/
+ sync semaphore id = get semaphore se t (SEMAPHORE SYNCH KEY, t o t a l c pu s ) ;
+
+ /∗ The c o n t r o l l e r semaphore ∗/
+ cont r o l l e r s emapho r e i d = get semaphore se t (SEMAPHORECONTROLLERKEY, 1) ;
+
+ /∗ The repor t semaphore ∗/
+ report semaphore id = get semaphore se t (SEMAPHORE REPORTKEY, 1) ;
+
+ /∗ Atta tch to shared memory segments ∗/
+ c o n t r o l l e r i d = get shmem (SHM CONTROLLER KEY, t o t a l c pu s ∗ s izeof ( int ) ) ;
+ counte r i d = get shmem (SHM COUNTER KEY, t o t a l c pu s ∗ s izeof ( counte r t ) ) ;
+
+ c o n t r o l l e r = ( int ∗) shmem attatch ( c o n t r o l l e r i d ) ;
+ counter = ( counte r t ∗) shmem attatch ( counte r i d ) ;
+
/∗ i gnore any f l o a t i n g po in t excep t ions , they may occur on mis−s p e cu l a t e d
execu t i on paths ∗/
s i g n a l (SIGFPE, SIG IGN) ;
@@ −4824 ,13 +4896 ,38 @@
to e l im ina t e t h i s /next s t a t e synchron i za t i on and r e l a x a t i o n problems
∗/
for ( ; ; )
{
+ /∗ Connect to c o n t r o l l e r every 10000 c l o c k t i c k s ∗/
+
+ i f ( s im cyc l e%RESOLUTION == 0) {
+ /∗ Report s t a t u s in t o shared memory ∗/
+ con t r o l l e r [ my cpuid ] = WAITINGFORCOMMAND;
+ counter [ my cpuid ] = cache dl1−>misses ;
+
+ /∗ S igna l the c o n t r o l l e r ∗/
+ my unlock ( cont ro l l e r s emaphor e id , 0 ) ;
+
+ /∗ Wait f o r s i g n a l from c o n t r o l l e r to cont inue ∗/
+ my lock ( report semaphore id , my cpuid ) ;
189
F.2. SIM-OUTORDER.C APPENDIX F. CMP CODE
+
+ /∗ Read command from shared memory ∗/
+ switch ( c o n t r o l l e r [ my cpuid ] ) {
+ case RUNCOMMAND:
+ break ;
+ default :
+ p r i n t f ( ”Something bad happened in command t r a n s f e r !\n”) ;
+ break ;
+ }
+ }
+ /∗ Wait f o r own semaphore b e f o r e cont inu ing on next c l o c k c y c l e ∗/
+ my lock ( sync semaphore id , my cpuid ) ;
+
/∗ RUU/LSQ san i t y checks ∗/
i f (RUU num < LSQ num)
− panic ( ”RUU num < LSQ num”) ;
+ panic ( ”RUU num < LSQ num”) ;
i f ( ( ( RUU head + RUU num) % RUU size ) != RUU tail )
− panic ( ”RUU head/RUU tail wedged ”) ;
+ panic ( ”RUU head/RUU tail wedged ”) ;
i f ( ( ( LSQ head + LSQ num) % LSQ size ) != LSQ tai l )
− panic ( ”LSQ head/LSQ tai l wedged ”) ;
+ panic ( ”LSQ head/LSQ tai l wedged ”) ;
/∗ check i f p i p e t r a c i n g i s s t i l l a c t i v e ∗/
p t r a c e ch e ck a c t i v e ( r eg s . regs PC , sim num insn , s im cyc l e ) ;
@@ −4850 ,37 +4947 ,35 @@
/∗ ==> i n s e r t s ope ra t i ons in t o ready queue −−> r e g i s t e r deps r e s o l v e d
∗/
ruu wr i teback ( ) ;
− i f ( ! bugcompat mode )
− {
− /∗ t r y to l o c a t e memory opera t i ons t ha t are ready to execu te ∗/
− /∗ ==> i n s e r t s ope ra t i ons in t o ready queue −−> mem deps r e s o l v e d
∗/
− l s q r e f r e s h ( ) ;
−
− /∗ i s s u e opera t i ons ready to execu te from a prev ious c y c l e ∗/
− /∗ <== dra ins ready queue <−− ready opera t i ons commence execu t ion
∗/
− r uu i s s u e ( ) ;
− }
+ i f ( ! bugcompat mode ) {
+ /∗ t r y to l o c a t e memory opera t i ons t ha t are ready to execu te ∗/
+ /∗ ==> i n s e r t s ope ra t i ons in t o ready queue −−> mem deps r e s o l v e d ∗/
+ l s q r e f r e s h ( ) ;
+
+ /∗ i s s u e opera t i ons ready to execu te from a prev ious c y c l e ∗/
+ /∗ <== dra ins ready queue <−− ready opera t i ons commence execu t ion
∗/
+ ruu i s s u e ( ) ;
+ }
/∗ decode and d i s pa t ch new opera t i ons ∗/
190
APPENDIX F. CMP CODE F.2. SIM-OUTORDER.C
/∗ ==> i n s e r t ops w/ no deps or a l l r eg s ready −−> reg deps r e s o l v e d
∗/
ruu d i spatch ( ) ;
− i f ( bugcompat mode )
− {
− /∗ t r y to l o c a t e memory opera t i ons t ha t are ready to execu te ∗/
− /∗ ==> i n s e r t s ope ra t i ons in t o ready queue −−> mem deps r e s o l v e d
∗/
− l s q r e f r e s h ( ) ;
−
− /∗ i s s u e opera t i ons ready to execu te from a prev ious c y c l e ∗/
− /∗ <== dra ins ready queue <−− ready opera t i ons commence execu t ion
∗/
− r uu i s s u e ( ) ;
− }
+ i f ( bugcompat mode ) {
+ /∗ t r y to l o c a t e memory opera t i ons t ha t are ready to execu te ∗/
+ /∗ ==> i n s e r t s ope ra t i ons in t o ready queue −−> mem deps r e s o l v e d ∗/
+ l s q r e f r e s h ( ) ;
+
+ /∗ i s s u e opera t i ons ready to execu te from a prev ious c y c l e ∗/
+ /∗ <== dra ins ready queue <−− ready opera t i ons commence execu t ion
∗/
+ ruu i s s u e ( ) ;
+ }
/∗ c a l l i n s t r u c t i o n f e t c h un i t i f i t i s not b l o cked ∗/
i f ( ! r u u f e t c h i s s u e d e l a y )
− ruu f e t ch ( ) ;
+ ruu f e t ch ( ) ;
else
− r uu f e t c h i s s u e d e l a y −−;
+ ruu f e t c h i s s u e d e l a y −−;
/∗ update b u f f e r occupancy s t a t s ∗/
IFQ count += fetch num ;
@@ −4889 ,15 +4984 ,70 @@
RUU fcount += ( (RUU num == RUU size ) ? 1 : 0) ;
LSQ count += LSQ num ;
LSQ fcount += ( (LSQ num == LSQ size ) ? 1 : 0) ;
−
− /∗ Dram trace ∗/
− dram trace ( dram system , s im cyc l e ) ;
+
/∗ go to next c y c l e ∗/
s im cyc l e++;
+ /∗ S igna l chained CPU tha t i t can cont inue execu t ion ∗/
+ my unlock ( sync semaphore id , ( my cpuid +1) % to t a l c pu s ) ;
+
/∗ f i n i s h e a r l y ? ∗/
i f ( max insts && sim num insn >= max insts )
− return ;
+ return ;
191
F.2. SIM-OUTORDER.C APPENDIX F. CMP CODE
+ /∗ maximum number o f c y c l e s reached ∗/
+ i f ( max cycles && s im cyc l e >= max cycles )
+ return ;
}
}
+
+/∗ This func t i on i s c a l l e d a f t e r a s imu la t i on f i n i s h e s so t ha t i t keeps in
+ ∗ synch wi th everyone e l s e
+ ∗/
+
+void s im con t i nue t i c k i ng ( ) {
+ /∗ This c l o c k c y c l e i s over ∗/
+ s im cyc l e++;
+ /∗ S igna l chained CPU tha t i t can cont inue execu t ion ∗/
+
+ my unlock ( sync semaphore id , ( my cpuid +1) % to t a l c pu s ) ;
+
+ /∗ Only way out o f t h i s loop i s to g e t a command from the c o n t r o l l e r ∗/
+
+ for ( ; ; ) {
+ /∗ Connect to c o n t r o l l e r every 10000 c l o c k t i c k s ∗/
+ i f ( s im cyc l e%RESOLUTION == 0) {
+
+ /∗ Report s t a t u s in t o shared memory ∗/
+ con t r o l l e r [ my cpuid ] = SIMULATIONCOMPLETEDCOMMAND;
+ counter [ my cpuid ] = cache dl1−>misses ;
+
+ /∗ S igna l the c o n t r o l l e r ∗/
+ my unlock ( cont ro l l e r s emaphor e id , 0 ) ;
+
+ /∗ Wait f o r s i g n a l from c o n t r o l l e r to cont inue ∗/
+ my lock ( report semaphore id , my cpuid ) ;
+
+ /∗ Read command from shared memory ∗/
+ switch ( c o n t r o l l e r [ my cpuid ] ) {
+ case RUNCOMMAND:
+ break ;
+ case HALTCOMMAND:
+ return ;
+ break ;
+ default :
+ p r i n t f ( ”Something bad happened in command t r a n s f e r !\n”) ;
+ break ;
+ }
+ }
+
+ /∗ Wait f o r own semaphore b e f o r e cont inu ing on next c l o c k c y c l e ∗/
+ my lock ( sync semaphore id , my cpuid ) ;
+
+ /∗ go to next c y c l e ∗/
+ s im cyc l e++;
+
+ /∗ S igna l chained CPU tha t i t can cont inue execu t ion ∗/
+ my unlock ( sync semaphore id , ( my cpuid +1) % to t a l c pu s ) ;
+ }
+}
192
APPENDIX F. CMP CODE F.3. CONTROLLER.C
F.3 Controller.c
Listing F.3: Controller.c
/∗ This program con t r o l s the execu t ion o f p a r a l l e l l s imp l e s c a l a r
∗ in a CMP enviroment .
∗ Contro l i s done trough shared memory segments .
∗/
#include <errno . h>
#include <sys / types . h>
#include <sys / ipc . h>
#include <sys /shm . h>
#include <sys /sem . h>
#include <s t d i o . h>
#include <s t d l i b . h>
#include <uni s td . h>
#include <s i g n a l . h>
#include ”host . h” // For coun t e r t d e f i n i t i o n
#include ”shared . h” // For the keys
#define SCHEDULER HISTORY LENGHTH 5
/∗ Globa l v a r i a b l e s ∗/
int ∗ c o n t r o l l e r ; // Pointer to the c o n t r o l l e r segment
counte r t ∗ counter ; // Pointer to the counter segment
int cpu count ; // The number o f CPU’ s to con t r o l
int cont ro l l e r s emaphore ; // The ID of the semaphore f o r c o n t r o l l i n g the
invoca t i on
// o f the c o n t r o l l e r loop
int sync semaphore ; // The ID of the synchron i z ing semaphores
int report semaphore ; // Used f o r l o c k i n g the CPUs wh i l e r e po r t i n g in
progre s s .
int l2 semaphore ; // The semaphore c o n t r o l l i n g access to the L2 cache
int c o n t r o l l e r i d ; // The id o f the shared memory segment f o r the c o n t r o l l e r
int counte r i d ; // The id con ta ing in the shared memory segment f o r the
counter
/∗ This i s the main c o n t r o l l e r loop ∗/
void c o n t r o l l e r l o o p ( ) {
int command ;
int i ;
int f i n i s h e d = 0 ; // Set t h i s f l a g when a l l s imu la t o r s have f i n i s h e d .
int f l a g ; // Used to check i f everyone i s f i n i s h e d .
struct sembuf sb = {0 , 0 , 0} ; /∗ Semaphore con t r o l opera t ion ∗/
int cy c l e = 0 ; // Cycle counter
/∗ I n i t i a l i z a t i o n : Allow cpu #0 to s t a r t ∗/
my unlock ( sync semaphore , 0 ) ;
193
F.3. CONTROLLER.C APPENDIX F. CMP CODE
while ( f i n i s h e d != 1) {
// Wait u n t i l a l l cpus have f l a g g e d the c o n t r o l l e r semaphore
sb . sem num = 0 ;
sb . sem op = 0−(short ) cpu count ;
i f ( semop ( cont ro l l e r s emaphore , &sb , 1) == −1) {
per ro r ( ”Something went wrong whi l e g e t t i ng the semaphore : ”) ;
e x i t (1 ) ;
}
p r i n t f ( ”%d cy c l e s have e lapsed .\n” , RESOLUTION) ;
/∗ Read the misses from each ins tance ∗/
/∗ f o r ( i = 0 ; i < cpu count ; i++) {
p r i n t f (”Cpu %d has %d misses .\n” , i , counter [ i ] ) ;
}∗/
f l a g = 1 ;
for ( i =0; i<cpu count ; i++) {
i f ( c o n t r o l l e r [ i ] != SIMULATIONCOMPLETEDCOMMAND) {
f l a g = 0 ;
}
}
/∗ I f everyone i s f i n i s h e d then s e t f i n i s h e d f l a g and l e t everyone h a l t
∗/
i f ( f l a g == 1) {
f i n i s h e d = 1 ;
for ( i = 0 ; i < cpu count ; i++) {
c o n t r o l l e r [ i ] = HALTCOMMAND;
}
} else {
/∗ Set run command on a l l cpu ’ s ∗/
for ( i = 0 ; i < cpu count ; i++) {
c o n t r o l l e r [ i ] = RUNCOMMAND;
}
}
/∗ Unlock a l l r epor t l o c k s ∗/
for ( i = 0 ; i < cpu count ; i++) {
my unlock ( report semaphore , i ) ;
}
cy c l e++;
}
s l e e p (5 ) ; /∗ Don ’ t d e a l l o c a t e re source s too soon ∗/
}
/∗ This func t i on f r e e s a l l a l l o c a t e d shared re source s ∗/
void cleanup (void ) {
int sha r ed id ;
/∗ Detatch from the shared memory segments ∗/
p r i n t f ( ”Detatching segments .\n”) ;
shmdt ( c o n t r o l l e r ) ;
shmdt ( counter ) ;
/∗ Dea l l o ca t e shared memory segments ∗/
shmctl ( c o n t r o l l e r i d , IPC RMID, NULL) ;
shmctl ( counter id , IPC RMID, NULL) ;
/∗ Remove semaphores ∗/
194
APPENDIX F. CMP CODE F.3. CONTROLLER.C
p r i n t f ( ”Detatch s u c e s s f u l .\ nRemoving semaphores .\n”) ;
destroy semaphore ( cont ro l l e r s emaphore ) ;
destroy semaphore ( sync semaphore ) ;
destroy semaphore ( report semaphore ) ;
destroy semaphore ( l2 semaphore ) ;
/∗ Try to de ta t ch the cache ∗/
/∗ This i s somewhat d i r t y ∗/
sha r ed id = shmget (SHM L2 KEY, 0 , 0 ) ;
shmctl ( shared id , IPC RMID, 0) ;
sha r ed id = shmget (SHM L2 KEY+1 ,0 ,0) ;
shmctl ( shared id , IPC RMID, 0) ;
sha r ed id = shmget (SHMDRAMKEY,0 , 0 ) ;
shmctl ( shared id , IPC RMID, 0) ;
sha r ed id = shmget (SHMDRAMKEY+1 ,0 ,0) ;
shmctl ( shared id , IPC RMID, 0) ;
sha r ed id = shmget (SHMDRAMKEY+2 ,0 ,0) ;
shmctl ( shared id , IPC RMID, 0) ;
sha r ed id = shmget (SHMDRAMKEY+3 ,0 ,0) ;
shmctl ( shared id , IPC RMID, 0) ;
}
/∗ I n t e r rup t s i g n a l hand ler ∗/
/∗ Used to c l ean up mess when q u i t t i n g ∗/
/∗ f i r s t , here i s the s i g n a l hand ler ∗/
void c a t ch i n t ( int sig num ) {
p r i n t f ( ”Ctrl−C caugt , c l e an ing up .\n”) ;
c leanup ( ) ;
p r i n t f ( ”Cleanup complete .\n”) ;
e x i t (2 ) ;
}
int main ( int argc , char ∗argv [ ] ) {
int i ;
/∗ Check command l i n e ∗/
i f ( argc != 2) {
p r i n t f ( ”Usage :\n . / c o n t r o l l e r <number of cpus>\n”) ;
e x i t (1 ) ;
}
/∗ Set s i g n a l hand ler to own ∗/
s i g n a l (SIGINT , c a t ch i n t ) ;
/∗ Parse the number o f CPUS ∗/
i f ( ( cpu count = a to i ( argv [ 1 ] ) ) < 1) {
p r i n t f ( ”You have s t a r t ed too few cpus\n”) ;
}
p r i n t f ( ”S ta r t i ng s imu la t i on o f %d cpus .\n” , cpu count ) ;
/∗ Create the shared memory areas ∗/
c o n t r o l l e r i d = create shmem (SHM CONTROLLER KEY, cpu count ∗ s izeof ( int ) ) ;
c ounte r i d = create shmem (SHM COUNTER KEY, cpu count∗ s izeof ( counte r t ) ) ;
/∗ Atta tch to the shared segments ∗/
195
F.3. CONTROLLER.C APPENDIX F. CMP CODE
c o n t r o l l e r = ( int ∗) shmem attatch ( c o n t r o l l e r i d ) ;
counter = ( counte r t ∗) shmem attatch ( counte r i d ) ;
p r i n t f ( ”Al l shared memory created and attatched .\n”) ;
p r i n t f ( ”Creat ing semaphores .\n”) ;
/∗ c r ea t e a semaphore s e t wi th 1 semaphore f o r c o n t r o l l e r ∗/
cont ro l l e r s emaphore = crea te s emaphore s e t (SEMAPHORECONTROLLERKEY, 1) ;
/∗ I n i t i a l i z e the semaphore to 0 ∗/
semaphore se t va lue ( cont ro l l e r s emaphore , 0 , 0) ;
/∗ Create the synch semaphore , one semaphore per CPU ∗/
sync semaphore = crea te s emaphore s e t (SEMAPHORE SYNCH KEY, cpu count ) ;
/∗ I n i t i a l i z e a l l sync semaphores to 0 (no go ! ) ∗/
for ( i =0; i<cpu count ; i++) {
semaphore se t va lue ( sync semaphore , i , 0) ;
}
/∗ Create the repor t semaphore , one semaphore per CPU ∗/
report semaphore = crea te s emaphore s e t (SEMAPHORE REPORTKEY, cpu count ) ;
/∗ I n i t i a l i z e a l l r epor t semaphores to 0 (no go ! ) ∗/
for ( i =0; i<cpu count ; i++) {
semaphore se t va lue ( report semaphore , i , 0) ;
}
/∗Create the L2 cache l o c k and i n i t i a l i z e i t to 0 ∗/
l2 semaphore = crea te s emaphore s e t (SEMAPHORE L2 KEY, 1) ;
s emaphore se t va lue ( l2 semaphore , 0 , 0) ;
p r i n t f ( ”Semaphores c r ea ted .\ nStar t ing c o n t r o l l e r looop \n”) ;
c o n t r o l l e r l o o p ( ) ;
p r i n t f ( ”Simulat ion done .\n”) ;
c leanup ( ) ;
/∗ Suc c e s s f u l re turn ∗/
return 0 ;
}
196
APPENDIX F. CMP CODE F.4. SHARED.C
F.4 Shared.c
Listing F.4: Shared.c
/∗ This f i l e i s a c o l l e c t i o n o f a b s t r a c t i o n s f o r shared memory con t r o l
∗ I t was made to make i t e a s i e r to use semaphores in the r e s t o f the
program
∗ and make t h i n g s more readab l e
∗/
#include <s t d i o . h>
#include <s t d l i b . h>
#include <s i g n a l . h>
#include <errno . h>
#include <sys / types . h>
#include <sys / ipc . h>
#include <sys /sem . h>
#include <sys /shm . h>
#include ”shared . h”
/∗ This func t i on s imply l o c k s a semaphore ∗/
void my lock ( int semaphore id , int semaphore number ) {
struct sembuf sb = {0 , 0 , 0} ;
sb . sem num = semaphore number ;
sb . sem op = −1;
i f ( semop ( semaphore id , &sb , 1) == −1) {
per ro r ( ”Something went wrong whi l e l o ck ing a semaphore : ”) ;
e x i t (1 ) ;
}
}
/∗ This func t i on un locks a semaphore ∗/
void my unlock ( int semaphore id , int semaphore number ) {
struct sembuf sb = {0 , 0 , 0} ;
sb . sem num = semaphore number ;
sb . sem op = 1 ;
i f ( semop ( semaphore id , &sb , 1) == −1) {
per ro r ( ”Something went wrong whi l e un lock ing a semaphore : ”) ;
e x i t (1 ) ;
}
}
/∗ This func t i on de s t r oy s a semaphore s e t ∗/
void destroy semaphore ( int semaphore id ) {
union semun {
int va l ;
struct semid ds ∗buf ;
ushort ∗ array ;
} arg ;
i f ( semct l ( semaphore id , 0 , IPC RMID, arg ) == −1) {
per ro r ( ”Could not remove semaphore : ”) ;
e x i t (1 ) ;
}
}
/∗ This func t i on c r ea t e s a new semaphore s e t and re turns the id ∗/
197
F.4. SHARED.C APPENDIX F. CMP CODE
int c r ea te s emaphore s e t ( key t semaphore key , int s e t s i z e ) {
int semaphore id ;
i f ( ( semaphore id = semget ( semaphore key , s e t s i z e , DEFAULT PERMISSIONS |
IPC CREAT) ) == −1) {
per ro r ( ”Could not c r e a t e semaphore : ”) ;
e x i t (1 ) ;
}
return semaphore id ;
}
/∗ This func t i on g e t s a semaphore s e t based on the key ∗/
int get semaphore se t ( key t semaphore key , int s e t s i z e ) {
int semaphore id ;
i f ( ( semaphore id = semget ( semaphore key , s e t s i z e , DEFAULT PERMISSIONS) )
== −1) {
per ro r ( ”Could not get semaphore , are they created ? ”) ;
e x i t (1 ) ;
}
return semaphore id ;
}
/∗ This func t i on s e t s the va lue o f a semaphore ( u s e f u l f o r i n i t i a l i z a t i o n ∗/
void semaphore se t va lue ( int semaphore id , int semaphore number , int value )
{
union semun {
int va l ;
struct semid ds ∗buf ;
ushort ∗ array ;
} arg ;
arg . va l = value ;
i f ( semct l ( semaphore id , semaphore number , SETVAL, arg ) == −1) {
per ro r ( ”Could not s e t va lue o f semaphore : ”) ;
e x i t (1 ) ;
}
}
/∗ Shared memory func t i on s ∗/
/∗ Create a shared memory segment based on a key wi th a s e t s i z e ∗/
int create shmem ( key t key , int s i z e ) {
int shmem id ;
i f ( ( shmem id = shmget ( key , s i z e , IPC CREAT | IPC EXCL |
DEFAULT PERMISSIONS) ) < 0) {
p r i n t f ( ”Faul t ing key i s %d with s i z e = %d\n” , key , s i z e ) ;
pe r ro r ( ”Could not a l l o c a t e shared memory : ”) ;
e x i t (1 ) ;
}
return shmem id ;
}
/∗ Get a shared memory segment based on a key wi th a s e t s i z e ∗/
int get shmem ( key t key , int s i z e ) {
int shmem id ;
i f ( ( shmem id = shmget ( key , s i z e , DEFAULT PERMISSIONS) ) < 0) {
per ro r ( ”Could not get shared memory : ”) ;
e x i t (1 ) ;
}
198
APPENDIX F. CMP CODE F.4. SHARED.C
return shmem id ;
}
/∗ This func t i on re turns a po in t e r to the shared memory segment ∗/
void ∗ shmem attatch ( int shmem id ) {
void ∗ po in t e r ;
i f ( ( po in t e r = shmat ( shmem id , NULL, 0) ) == NULL) {
per ro r ( ”Could not at ta tch to shared memory : ”) ;
e x i t (1 ) ;
}
return po in t e r ;
}
199
F.5. SHARED.H APPENDIX F. CMP CODE
F.5 Shared.h
Listing F.5: Shared.h
/∗ This header f i l e con ta ins the keys f o r the shared memory segments as
∗ we l l as the keys f o r the semaphores used
∗/
#define RESOLUTION 1000000
#define DEFAULT PERMISSIONS 0644
#define SHMCONTROLLER KEY 13380
#define SHMCOUNTERKEY 13390
#define SHMDRAMKEY 135
#define SHM L2 KEY 20000
#define SEMAPHORECONTROLLERKEY 9987
#define SEMAPHORE SYNCHKEY 8901
#define SEMAPHOREREPORTKEY 7891
#define SEMAPHORE L2 KEY 197
/∗ Commands i s sued through shared memory ∗/
#define WAITINGFORCOMMAND 0
#define RUNCOMMAND 1
#define SIMULATIONCOMPLETEDCOMMAND 2
#define HALTCOMMAND 3
/∗ Some u s e f u l f unc t i on s f o r l o c k i n g ∗/
/∗ This func t i on s imply l o c k s a semaphore ∗/
void my lock ( int semaphore id , int semaphore number ) ;
/∗ This func t i on un locks a semaphore ∗/
void my unlock ( int semaphore id , int semaphore number ) ;
/∗ This func t i on de s t r oy s a semaphore ∗/
void destroy semaphore ( int semaphore id ) ;
/∗ This func t i on c r ea t e s a new semaphore s e t and re turns the id ∗/
int c r ea te s emaphore s e t ( key t semaphore key , int s e t s i z e ) ;
/∗ This func t i on g e t s a semaphore s e t based on the key ∗/
int get semaphore se t ( key t semaphore key , int s e t s i z e ) ;
/∗ This func t i on s e t s the va lue o f a semaphore ( u s e f u l f o r i n i t i a l i z a t i o n ∗/
void semaphore se t va lue ( int semaphore id , int semaphore number , int value ) ;
/∗ Shared memory opera t i ons ∗/
/∗ Create a shared memory segment based on a key wi th a s e t s i z e ∗/
int create shmem ( key t key , int s i z e ) ;
/∗ Get a shared memory segment based on a key wi th a s e t s i z e ∗/
int get shmem ( key t key , int s i z e ) ;
/∗ This func t i on re turns a po in t e r to the shared memory segment ∗/
200
APPENDIX F. CMP CODE F.5. SHARED.H
void ∗ shmem attatch ( int shmem id ) ;
201
F.6. CACHE.C APPENDIX F. CMP CODE
F.6 Cache.c
Listing F.6: Cache.c - Unified diff against SimpleScalar 3.0d
−−− . . / s implesim−3.0− o r i g / cache . c 2003−10−08 17 :50 :34 .000000000 +0200
+++ . . / . . / . . / f e l l e s −svn/ p r o j e c t / branches / g rannas pr e f e t ch / cache . c
2006−05−20 13 :07 :15 .000000000 +0200
@@ −48 ,15 +48 ,22 @@
∗ Copyright (C) 1994−2003 by Todd M. Austin , Ph .D. and SimpleScalar , LLC.
∗/
−
#inc lude <s t d i o . h>
#inc lude <s t d l i b . h>
#inc lude <a s s e r t . h>
+#inc lude <sys / types . h>
+#inc lude <sys / ipc . h>
+#inc lude <sys /shm . h>
+
#inc lude ”host . h”
#inc lude ”misc . h”
#inc lude ”machine . h”
#inc lude ”cache . h”
+#inc lude ”shared . h”
+
+extern unsigned int t o t a l c pu s ;
+extern unsigned int my cpuid ;
/∗ cache access macros ∗/
#de f i n e CACHE TAG( cp , addr ) ( ( addr ) >> ( cp )−>t a g s h i f t )
@@ −136 ,6 +143 ,40 @@
}\
}
+/∗ The f o l l ow i n g v a r i a b l e s are used as l o c a l v a r i a b l e s in a shared
+ ∗ cache t s t r u c t u r e
+ ∗/
+
+char ∗cp name ; /∗ cache name ∗/
+ /∗ miss/ rep lacement handler , read/ wr i t e BSIZE by t e s s t a r t i n g at BADDR
+ from/ in to cache b l o c k BLK, re turns the l a t ency o f the opera t ion
+ i f i n i t i a t e d at NOW, re turned l a t e n c i e s i n d i c a t e how long i t t a k e s
+ fo r the cache access to cont inue ( e . g . , f i l l a wr i t e b u f f e r ) , the
+ miss/ r e p l f unc t i on s are r e qu i r ed to t rack how t h i s opera t ion w i l l
+ e f f e c t the l a t ency o f l a t e r ope ra t i ons ( e . g . , wr i t e b u f f e r f i l l s ) ,
+ i f !BALLOC, then j u s t re turn the l a t ency ; BLK ACCESS FN i s a l s o
+ r e s p on s i b l e f o r genera t ing any user data and inco rpora t ing the l a t ency
+ of t ha t opera t ion ∗/
+ unsigned int /∗ l a t en cy o f b l o c k access
∗/
+ (∗ c p b l k a c c e s s f n ) (enum mem cmd cmd , /∗ b l o c k acces s
command ∗/
+ md addr t baddr , /∗ program address to access
∗/
+ int bs i ze , /∗ s i z e o f the cache b l o c k
∗/
202
APPENDIX F. CMP CODE F.6. CACHE.C
+ struct ca che b l k t ∗blk , /∗ p t r to cache b l o c k s t r u c t
∗/
+ t i c k t now) ; /∗ when f e t c h was i n i t i a t e d
∗/
+
+
+
+ counte r t c p h i t s ; /∗ t o t a l number o f h i t s ∗/
+ counte r t cp mi s s e s ; /∗ t o t a l number o f misses ∗/
+ counte r t cp rep lacements ; /∗ t o t a l number o f rep lacements at misses ∗/
+ counte r t cp wr i t ebacks ; /∗ t o t a l number o f wr i t e back s at
misses ∗/
+ counte r t c p i n v a l i d a t i o n s ; /∗ t o t a l number o f e x t e r na l i n v a l i d a t i o n s ∗/
+ counte r t c p p r e f e t c h e s ; /∗ t o t a l number o f p r e f e t c h e s ∗/
+ counte r t cp p r e f e t ch e s ok ; /∗ t o t a l number o f p r e f e t c h e s t ha t worked ∗/
+
+
+
+
/∗ bound sqword t / d f l o a t t to p o s i t i v e i n t ∗/
#de f i n e BOUND POS(N) ( ( int ) (MIN(MAX(0 , (N) ) , 2147483647) ) )
@@ −258 ,7 +299 ,7 @@
/∗ c r ea t e and i n i t i a l i z e a genera l cache s t r u c t u r e ∗/
struct cache t ∗ /∗ po in t e r to cache crea t ed ∗/
−ca che c r ea t e (char ∗name , /∗ name of the cache ∗/
+cache c r ea t e sha r ed (char ∗name , /∗ name of the cache ∗/
int nsets , /∗ t o t a l number o f s e t s in cache ∗/
int bs i ze , /∗ b l o c k ( l i n e ) s i z e o f cache ∗/
int ba l l oc , /∗ a l l o c a t e data space f o r b l o c k s ?
∗/
@@ −270 ,11 +311 ,14 @@
md addr t baddr , int bs i ze ,
struct ca che b l k t ∗blk ,
t i c k t now) ,
− unsigned int h i t l a t e n c y ) /∗ l a t en cy in c y c l e s f o r a h i t ∗/
+ unsigned int h i t l a t e n c y /∗ l a t en cy in c y c l e s f o r a h i t ∗/ ,
+ int processID )
{
+ key t key = SHM L2 KEY;
struct cache t ∗cp ;
struct ca che b l k t ∗blk ;
int i , j , bindex ;
+ int shmid ;
/∗ check a l l cache parameters ∗/
i f ( n s e t s <= 0)
@@ −296 ,13 +340 ,30 @@
f a t a l ( ”must s p e c i f y miss / replacement f unc t i on s ”) ;
/∗ a l l o c a t e the cache s t r u c t u r e ∗/
− cp = ( struct cache t ∗)
− c a l l o c (1 , s izeof ( struct cache t ) + ( nsets −1)∗ s izeof ( struct c a ch e s e t t )
) ;
− i f ( ! cp )
203
F.6. CACHE.C APPENDIX F. CMP CODE
− f a t a l ( ”out o f v i r t u a l memory”) ;
−
+ switch ( processID ) {
+ case −1 :
+ cp = ( struct cache t ∗)
+ c a l l o c (1 , s izeof ( struct cache t ) + ( nsets −1)∗ s izeof ( struct
c a ch e s e t t ) ) ;
+ i f ( ! cp )
+ f a t a l ( ”out o f v i r t u a l memory : ”) ;
+ break ;
+ case 0 :
+ shmid = create shmem ( key , s izeof ( struct cache t ) + ( nsets −1)∗ s izeof (
struct c a ch e s e t t ) ) ;
+ cp = shmem attatch ( shmid ) ;
+ memset ( cp , 0 , s izeof ( struct cache t ) + ( nsets −1)∗ s izeof ( struct
c a ch e s e t t ) ) ;
+ break ;
+ default :
+ shmid = get shmem ( key , s izeof ( struct cache t ) + ( nsets −1)∗ s izeof (
struct c a ch e s e t t ) ) ;
+ cp = ( struct cache t ∗) shmem attatch ( shmid ) ;
+ break ;
+ }
+
/∗ i n i t i a l i z e user parameters ∗/
− cp−>name = mystrdup (name) ;
+ cp−>name = NULL;
+ i f ( processID==−1)
+ cp−>name = mystrdup (name) ;
+ else
+ cp name = mystrdup (name) ;
cp−>nse t s = nse t s ;
cp−>b s i z e = bs i z e ;
cp−>ba l l o c = ba l l o c ;
@@ −312 ,7 +373 ,11 @@
cp−>h i t l a t e n c y = h i t l a t e n c y ;
/∗ miss/ rep lacement f unc t i on s ∗/
− cp−>b l k a c c e s s f n = b l k a c c e s s f n ;
+ cp−>b l k a c c e s s f n = NULL;
+ i f ( processID==−1)
+ cp−>b l k a c c e s s f n = b l k a c c e s s f n ;
+ else
+ cp b l k a c c e s s f n = b l k a c c e s s f n ;
/∗ compute der i v ed parameters ∗/
cp−>h s i z e = CACHE HIGHLY ASSOC( cp ) ? ( as soc >> 2) : 0 ;
@@ −325 ,12 +390 ,13 @@
cp−>bu s f r e e = 0 ;
/∗ p r i n t de r i v ed parameters during debug ∗/
− debug ( ”%s : cp−>h s i z e = %d” , cp−>name , cp−>h s i z e ) ;
− debug ( ”%s : cp−>blk mask = 0x%08x” , cp−>name , cp−>blk mask ) ;
− debug ( ”%s : cp−>s e t s h i f t = %d” , cp−>name , cp−>s e t s h i f t ) ;
− debug ( ”%s : cp−>set mask = 0x%08x” , cp−>name , cp−>set mask ) ;
− debug ( ”%s : cp−>t a g s h i f t = %d” , cp−>name , cp−>t a g s h i f t ) ;
204
APPENDIX F. CMP CODE F.6. CACHE.C
− debug ( ”%s : cp−>tag mask = 0x%08x” , cp−>name , cp−>tag mask ) ;
+ // debug (”%s : cp−>h s i z e = %d ”, cp name , cp−>h s i z e ) ;
+ // debug (”%s : cp−>b lk mask = 0x%08x ” , cp name , cp−>b lk mask ) ;
+ // debug (”%s : cp−>s e t s h i f t = %d ”, cp name , cp−>s e t s h i f t ) ;
+ // debug (”%s : cp−>set mask = 0x%08x ” , cp name , cp−>set mask ) ;
+ // debug (”%s : cp−>t a g s h i f t = %d ” , cp name , cp−>t a g s h i f t ) ;
+ // debug (”%s : cp−>tag mask = 0x%08x ” , cp name , cp−>tag mask ) ;
+
/∗ i n i t i a l i z e cache s t a t s ∗/
cp−>h i t s = 0 ;
@@ −338 ,32 +404 ,83 @@
cp−>rep lacements = 0 ;
cp−>wri tebacks = 0 ;
cp−>i n v a l i d a t i o n s = 0 ;
+ cp−>p r e f e t c h e s = 0 ;
+ cp−>p r e f e t ch e s ok = 0 ;
+
+ i f ( processID !=−1) {
+ cp−>h i t s = −1; //Flag t ha t t h i s counters are not used
+ cp h i t s = 0 ;
+ cp mi s s e s = 0 ;
+ cp rep lacements = 0 ;
+ cp wr i t ebacks = 0 ;
+ cp i n v a l i d a t i o n s = 0 ;
+ cp p r e f e t c h e s = 0 ;
+ cp p r e f e t ch e s ok = 0 ;
+ }
/∗ blow away the l a s t b l o c k accessed ∗/
cp−>l a s t t a g s e t = 0 ;
cp−>l a s t b l k = NULL;
− /∗ a l l o c a t e data b l o c k s ∗/
− cp−>data = ( byte t ∗) c a l l o c ( n s e t s ∗ assoc ,
− s izeof ( struct ca che b l k t ) +
− ( cp−>ba l l o c ? ( b s i z e ∗ s izeof ( byte t ) ) : 0) ) ;
− i f ( ! cp−>data )
− f a t a l ( ”out o f v i r t u a l memory”) ;
+ key++;
+
+ switch ( processID ) {
+ case −1 :
+ /∗ a l l o c a t e data b l o c k s ∗/
+ cp−>data = ( byte t ∗) c a l l o c ( n s e t s ∗ assoc ,
+ s izeof ( struct ca che b l k t ) +
+ ( cp−>ba l l o c ? ( b s i z e ∗ s izeof ( byte t ) ) : 0) ) ;
+ i f ( ! cp−>data )
+ f a t a l ( ”out o f v i r t u a l memory : b”) ;
+ break ;
+ case 0 :
+ shmid = create shmem ( key , n s e t s ∗ as soc ∗ (
+ s izeof ( struct ca che b l k t ) +
+ ( cp−>ba l l o c ? ( b s i z e ∗ s izeof ( byte t ) ) : 0) ) ) ;
+
+ cp−>data = ( byte t ∗) shmem attatch ( shmid ) ;
205
F.6. CACHE.C APPENDIX F. CMP CODE
+ memset ( cp−>data , 0 , s izeof ( struct ca che b l k t ) +
+ ( cp−>ba l l o c ? ( b s i z e ∗ s izeof ( byte t ) ) : 0) ) ;
+
+
+ break ;
+ default :
+ shmid = get shmem ( key , n s e t s ∗ as soc ∗ ( s izeof ( struct ca che b l k t ) +
+ ( cp−>ba l l o c ? ( b s i z e ∗ s izeof ( byte t ) ) : 0) ) ) ;
+ cp−>data = ( byte t ∗) shmem attatch ( shmid ) ;
+ break ;
+ }
+
+
/∗ s l i c e up the data b l o c k s ∗/
for ( bindex=0, i =0; i<nse t s ; i++)
{
cp−>s e t s [ i ] . way head = NULL;
cp−>s e t s [ i ] . way ta i l = NULL;
− /∗ ge t a hash t a b l e , i f needed ∗/
− i f ( cp−>h s i z e )
− {
− cp−>s e t s [ i ] . hash =
− ( struct ca che b l k t ∗∗) c a l l o c ( cp−>hs i ze ,
− s izeof ( struct ca che b l k t ∗) ) ;
+ /∗ ge t a hash t a b l e , i f needed ∗/
+ i f ( cp−>h s i z e ) {
+ switch ( processID ) {
+ case −1 :
+ cp−>s e t s [ i ] . hash = ( struct ca che b l k t ∗∗) c a l l o c ( cp−>hs i ze ,
+ s izeof ( struct
ca che b l k t ∗) ) ;
+ break ;
+ case 0 :
+ key ++;
+ shmid = create shmem ( key , cp−>h s i z e ∗ ( s izeof ( struct ca che b l k t
∗) ) ) ;
+ cp−>s e t s [ i ] . hash = ( struct ca che b l k t ∗∗) shmem attatch ( shmid ) ;
+ memset ( cp−>s e t s [ i ] . hash ,0 , cp−>h s i z e ∗ ( s izeof ( struct ca che b l k t
∗) ) ) ;
+ break ;
+ default :
+ key ++;
+ shmid = get shmem ( key , cp−>h s i z e ∗ ( s izeof ( struct ca che b l k t ∗) ) )
;
+ cp−>s e t s [ i ] . hash = ( struct ca che b l k t ∗∗) shmem attatch ( shmid ) ;
+ break ;
+ }
i f ( ! cp−>s e t s [ i ] . hash )
− f a t a l ( ”out o f v i r t u a l memory”) ;
+ f a t a l ( ”out o f v i r t u a l memory : c ”) ;
}
+
/∗ NOTE: a l l the b l o c k s in a s e t ∗must∗ be a l l o c a t e d con t i guous l y ,
o therwise , b l o c k acce s s e s through SET−>BLKS w i l l f a i l ( used
during random replacement s e l e c t i o n ) ∗/
206
APPENDIX F. CMP CODE F.6. CACHE.C
@@ −376 ,13 +493 ,34 @@
/∗ l o c a t e next cache b l o c k ∗/
blk = CACHE BINDEX( cp , cp−>data , bindex ) ;
bindex++;
−
+
/∗ i n v a l i d a t e new cache b l o c k ∗/
blk−>s t a tu s = 0 ;
blk−>tag = 0 ;
blk−>ready = 0 ;
− blk−>use r data = ( u s i z e != 0
− ? ( byte t ∗) c a l l o c ( us i ze , s izeof ( byte t ) ) : NULL
) ;
+ blk−>pre f e t ched = 0 ;
+ blk−>procNo = −1;
+
+ i f ( u s i z e==0)
+ blk−>use r data = NULL;
+ else {
+ switch ( processID ) {
+ case −1 :
+ blk−>use r data = ( byte t ∗) c a l l o c ( us i ze , s izeof ( byte t ) ) ;
+ break ;
+ case 0 :
+ key ++;
+ shmid = create shmem ( key , u s i z e ∗ s izeof ( byte t ) ) ;
+ blk−>use r data = ( byte t ∗) shmem attatch ( shmid ) ;
+ memset ( blk−>use r data ,0 , u s i z e ∗ s izeof ( byte t ) ) ;
+ break ;
+ default :
+ key ++;
+ shmid = get shmem ( key , u s i z e ∗ s izeof ( byte t ) ) ;
+ blk−>use r data = ( byte t ∗) shmem attatch ( shmid ) ;
+ break ;
+ }
+ }
/∗ i n s e r t cache b l o c k in t o s e t hash t a b l e ∗/
i f ( cp−>h s i z e )
@@ −396 ,8 +534 ,8 @@
cp−>s e t s [ i ] . way head = blk ;
i f ( ! cp−>s e t s [ i ] . way ta i l )
cp−>s e t s [ i ] . way ta i l = blk ;
− }
− }
+ }
+ }
return cp ;
}
@@ −413 ,17 +551 ,50 @@
}
}
+
+
207
F.6. CACHE.C APPENDIX F. CMP CODE
+/∗ c r ea t e and i n i t i a l i z e a genera l cache s t r u c t u r e ∗/
+struct cache t ∗ /∗ po in t e r to cache crea t ed ∗/
+cache c r ea t e (char ∗name , /∗ name of the cache ∗/
+ int nsets , /∗ t o t a l number o f s e t s in cache ∗/
+ int bs i ze , /∗ b l o c k ( l i n e ) s i z e o f cache ∗/
+ int ba l l oc , /∗ a l l o c a t e data space f o r b l o c k s ?
∗/
+ int us i ze , /∗ s i z e o f user data to a l l o c w/ b l k s
∗/
+ int assoc , /∗ a s s o c i a t i v i t y o f cache ∗/
+ enum ca che po l i c y po l i cy , /∗ rep lacement p o l i c y w/ in s e t s ∗/
+ /∗ b l o c k acces s func t ion , see d e s c r i p t i o n w/ in s t r u c t cache de f
∗/
+ unsigned int (∗ b l k a c c e s s f n ) (enum mem cmd cmd ,
+ md addr t baddr , int bs i ze ,
+ struct ca che b l k t ∗blk ,
+ t i c k t now) ,
+ unsigned int h i t l a t e n c y ) /∗ l a t en cy in c y c l e s f o r a h i t ∗/
+{ return cache c r ea t e sha r ed (name , /∗ name of the cache ∗/
+ nsets , /∗ t o t a l number o f s e t s in cache ∗/
+ bs ize , /∗ b l o c k ( l i n e ) s i z e o f cache ∗/
+ ba l l oc , /∗ a l l o c a t e data space f o r b l o c k s ? ∗/
+ us ize , /∗ s i z e o f user data to a l l o c w/ b l k s
∗/
+ assoc , /∗ a s s o c i a t i v i t y o f cache ∗/
+ pol i cy , /∗ rep lacement p o l i c y w/ in s e t s ∗/
+ /∗ b l o c k acces s func t ion , see d e s c r i p t i o n w/ in s t r u c t cache de f
∗/
+ b lk a c c e s s f n ,
+ h i t l a t en cy ,
+ −1) ; /∗ l a t en cy in c y c l e s f o r a h i t ∗/
+}
+
+
/∗ p r i n t cache con f i g u ra t i on ∗/
void
ca che con f i g ( struct cache t ∗cp , /∗ cache in s tance ∗/
FILE ∗ stream ) /∗ output stream ∗/
{
+ char ∗ threadSafeName = cp−>name ;
+ i f ( cp−>name==NULL) threadSafeName = cp name ;
f p r i n t f ( stream ,
”cache : %s : %d se t s , %d byte blocks , %d bytes user data/ block \n” ,
− cp−>name , cp−>nsets , cp−>bs i ze , cp−>u s i z e ) ;
+ threadSafeName , cp−>nsets , cp−>bs i ze , cp−>u s i z e ) ;
f p r i n t f ( stream ,
”cache : %s : %d−way , ‘%s ’ replacement po l i cy , write−back\n” ,
− cp−>name , cp−>assoc ,
+ threadSafeName , cp−>assoc ,
cp−>po l i c y == LRU ? ”LRU”
: cp−>po l i c y == Random ? ”Random”
: cp−>po l i c y == FIFO ? ”FIFO”
@@ −436 ,13 +607 ,54 @@
struct s t a t s db t ∗ sdb ) /∗ s t a t s database ∗/
{
char buf [ 5 1 2 ] , buf1 [ 5 1 2 ] , ∗name ;
208
APPENDIX F. CMP CODE F.6. CACHE.C
+ char ∗ threadSafeName = cp−>name ;
+ i f ( cp−>name==NULL) threadSafeName = cp name ;
/∗ ge t a name fo r t h i s cache ∗/
− i f ( ! cp−>name | | ! cp−>name [ 0 ] )
+ i f ( ! threadSafeName | | ! threadSafeName [ 0 ] )
name = ”<unknown>” ;
else
− name = cp−>name ;
+ name = threadSafeName ;
+ i f ( cp−>h i t s==−1) {
+ sp r i n t f ( buf , ”%s . a c c e s s e s ” , name) ;
+ s p r i n t f ( buf1 , ”%s . h i t s + %s . mis ses ” , name , name) ;
+ s t a t r e g f o rmu l a ( sdb , buf , ” t o t a l number o f a c c e s s e s ” , buf1 , ”%12.0 f ”) ;
+ s p r i n t f ( buf , ”%s . h i t s ” , name) ;
+ s t a t r e g c oun t e r ( sdb , buf , ” t o t a l number o f h i t s ” , &cp h i t s , 0 , NULL) ;
+ s p r i n t f ( buf , ”%s . mis ses ” , name) ;
+ s t a t r e g c oun t e r ( sdb , buf , ” t o t a l number o f mis se s ” , &cp misses , 0 , NULL
) ;
+ s p r i n t f ( buf , ”%s . rep lacements ” , name) ;
+ s t a t r e g c oun t e r ( sdb , buf , ” t o t a l number o f rep lacements ” ,
+ &cp rep lacements , 0 , NULL) ;
+ s p r i n t f ( buf , ”%s . wr i tebacks ” , name) ;
+ s t a t r e g c oun t e r ( sdb , buf , ” t o t a l number o f wr i tebacks ” ,
+ &cp wr i tebacks , 0 , NULL) ;
+ s p r i n t f ( buf , ”%s . i n v a l i d a t i o n s ” , name) ;
+ s t a t r e g c oun t e r ( sdb , buf , ” t o t a l number o f i n v a l i d a t i o n s ” ,
+ &cp i nva l i d a t i o n s , 0 , NULL) ;
+ s p r i n t f ( buf , ”%s . m i s s r a t e ” , name) ;
+ s p r i n t f ( buf1 , ”%s . mis ses / %s . a c c e s s e s ” , name , name) ;
+ s t a t r e g f o rmu l a ( sdb , buf , ”miss r a t e ( i . e . , mi s se s / r e f ) ” , buf1 , NULL) ;
+ s p r i n t f ( buf , ”%s . r e p l r a t e ” , name) ;
+ s p r i n t f ( buf1 , ”%s . rep lacements / %s . a c c e s s e s ” , name , name) ;
+ s t a t r e g f o rmu l a ( sdb , buf , ”replacement ra t e ( i . e . , r e p l s / r e f ) ” , buf1 ,
NULL) ;
+ s p r i n t f ( buf , ”%s . wb rate ” , name) ;
+ s p r i n t f ( buf1 , ”%s . wr i tebacks / %s . a c c e s s e s ” , name , name) ;
+ s t a t r e g f o rmu l a ( sdb , buf , ”wr iteback ra t e ( i . e . , wrbks/ r e f ) ” , buf1 , NULL
) ;
+ s p r i n t f ( buf , ”%s . i n v r a t e ” , name) ;
+ s p r i n t f ( buf1 , ”%s . i n v a l i d a t i o n s / %s . a c c e s s e s ” , name , name) ;
+ s t a t r e g f o rmu l a ( sdb , buf , ” i n v a l i d a t i o n ra t e ( i . e . , i nvs / r e f ) ” , buf1 ,
NULL) ;
+ /∗ Pre f e t ch ing s t a t i s t i c s ∗/
+ sp r i n t f ( buf , ”%s . p r e f e t c h e s ” , name) ;
+ s t a t r e g c oun t e r ( sdb , buf , ” t o t a l number o f p r e f e t c h e s ” , &cp pr e f e t ch e s ,
+ 0 , NULL) ;
+ s p r i n t f ( buf , ”%s . p r e f e t ch e s ok ” , name) ;
+ s t a t r e g c oun t e r ( sdb , buf , ” t o t a l number o f s u c e s s f u l l p r e f e t c h e s ” ,
+ &cp pr e f e t che s ok , 0 , NULL) ;
+ }
+ else
+{
s p r i n t f ( buf , ”%s . a c c e s s e s ” , name) ;
s p r i n t f ( buf1 , ”%s . h i t s + %s . mis ses ” , name , name) ;
209
F.6. CACHE.C APPENDIX F. CMP CODE
s t a t r e g f o rmu l a ( sdb , buf , ” t o t a l number o f a c c e s s e s ” , buf1 , ”%12.0 f ”) ;
@@ −471 ,6 +683 ,14 @@
s p r i n t f ( buf , ”%s . i n v r a t e ” , name) ;
s p r i n t f ( buf1 , ”%s . i n v a l i d a t i o n s / %s . a c c e s s e s ” , name , name) ;
s t a t r e g f o rmu l a ( sdb , buf , ” i n v a l i d a t i o n ra t e ( i . e . , i nvs / r e f ) ” , buf1 ,
NULL) ;
+ /∗ Pre f e t ch ing s t a t i s t i c s ∗/
+ sp r i n t f ( buf , ”%s . p r e f e t c h e s ” , name) ;
+ s t a t r e g c oun t e r ( sdb , buf , ” t o t a l number o f p r e f e t c h e s ” , &cp−>pre f e t che s ,
+ 0 , NULL) ;
+ s p r i n t f ( buf , ”%s . p r e f e t ch e s ok ” , name) ;
+ s t a t r e g c oun t e r ( sdb , buf , ” t o t a l number o f s u c e s s f u l l p r e f e t c h e s ” ,
+ &cp−>pre f e t che s ok , 0 , NULL) ;
+ }
}
/∗ p r i n t cache s t a t s ∗/
@@ −479 ,16 +699 ,35 @@
FILE ∗ stream ) /∗ output stream ∗/
{
double sum = (double ) ( cp−>h i t s + cp−>misses ) ;
+ i f ( cp−>h i t s==−1)
+ sum = (double ) ( c p h i t s + cp mi s s e s ) ;
+ char ∗ threadSafeName = cp−>name ;
+ i f ( cp−>name==NULL) threadSafeName = cp name ;
+ i f ( cp−>h i t s==−1) {
f p r i n t f ( stream ,
”cache : %s : %.0 f h i t s %.0 f mis se s %.0 f r e p l s %.0 f i n v a l i d a t i o n s \n”
,
− cp−>name , (double ) cp−>h i t s , (double ) cp−>misses ,
+ threadSafeName , (double ) cp h i t s , (double ) cp misse s ,
+ (double ) cp rep lacements , (double ) c p i n v a l i d a t i o n s ) ;
+ f p r i n t f ( stream ,
+ ”cache : %s : miss r a t e=%f r ep l r a t e=%f i n v a l i d a t i o n ra t e=%f \n” ,
+ threadSafeName ,
+ (double ) cp mi s s e s /sum , (double ) (double ) cp rep lacements /sum ,
+ (double ) c p i n v a l i d a t i o n s /sum) ;
+ }
+ else
+ {
+ f p r i n t f ( stream ,
+ ”cache : %s : %.0 f h i t s %.0 f mis se s %.0 f r e p l s %.0 f i n v a l i d a t i o n s \n”
,
+ threadSafeName , (double ) cp−>h i t s , (double ) cp−>misses ,
(double ) cp−>replacements , (double ) cp−>i n v a l i d a t i o n s ) ;
f p r i n t f ( stream ,
”cache : %s : miss r a t e=%f r ep l r a t e=%f i n v a l i d a t i o n ra t e=%f \n” ,
− cp−>name ,
+ threadSafeName ,
(double ) cp−>misses /sum , (double ) (double ) cp−>rep lacements /sum ,
(double ) cp−>i n v a l i d a t i o n s /sum) ;
+ }
+
}
210
APPENDIX F. CMP CODE F.6. CACHE.C
/∗ acces s a cache , perform a CMD opera t ion on cache CP at address ADDR,
@@ −513,6 +752 ,15 @@
s t r u c t c a c h e b l k t ∗ b l k , ∗ r e p l ;
i n t l a t = 0 ;
+ /∗ Increment p r e f e t c h counter ∗/
+ i f (cmd == Pre f e tch ) {
+ i f ( cp−>h i t s == −1) {
+ cp p r e f e t c h e s++;
+ } else {
+ cp−>p r e f e t c h e s++;
+ }
+ }
+
/∗ d e f a u l t rep lacement address ∗/
i f ( r ep l addr )
∗ r ep l addr = 0 ;
@@ −526 ,7 +774 ,6 @@
( ( addr + ( nbytes − 1) ) > ( ( addr & ˜cp−>blk mask ) + ( cp−>b s i z e − 1) ) )
∗/
i f ( ( addr + nbytes ) > ( ( addr & ˜cp−>blk mask ) + cp−>b s i z e ) )
f a t a l ( ”cache : a c c e s s e r r o r : a c c e s s spans block , addr 0x%08x” , addr ) ;
−
/∗ permiss ions are checked on cache misses ∗/
/∗ check f o r a f a s t h i t : acces s to same b l o c k ∗/
@@ −546 ,8 +793 ,9 @@
blk ;
blk=blk−>hash next )
{
− i f ( blk−>tag == tag && ( blk−>s t a tu s & CACHE BLK VALID) )
+ i f ( blk−>tag == tag && ( blk−>s t a tu s & CACHE BLK VALID) && ( blk−>procNo
==my cpuid ) ) {
goto ca che h i t ;
+ }
}
}
else
@@ −557 ,15 +805 ,17 @@
blk ;
blk=blk−>way next )
{
− i f ( blk−>tag == tag && ( blk−>s t a tu s & CACHE BLK VALID) )
+ i f ( blk−>tag == tag && ( blk−>s t a tu s & CACHE BLK VALID) && ( blk−>procNo
==my cpuid ) )
goto ca che h i t ;
}
}
/∗ cache b l o c k not found ∗/
−
/∗ ∗∗MISS∗∗ ∗/
− cp−>misses++;
+ i f ( cp−>h i t s==−1)
+ cp mi s s e s++;
+ else
211
F.6. CACHE.C APPENDIX F. CMP CODE
+ cp−>misses++;
/∗ s e l e c t the appropr ia t e b l o c k to rep lace , and re−l i n k t h i s entry to
the appropr ia t e p l ace in the way l i s t ∗/
@@ −596 ,7 +846 ,10 @@
/∗ wr i t e back rep l aced b l o c k data ∗/
i f ( rep l−>s t a tu s & CACHE BLK VALID)
{
− cp−>rep lacements++;
+ i f ( cp−>h i t s==−1)
+ cp rep lacements++;
+ else
+ cp−>rep lacements++;
i f ( r ep l addr )
∗ r ep l addr = CACHEMKBADDR( cp , rep l−>tag , s e t ) ;
@@ −613 ,20 +866 ,34 @@
i f ( rep l−>s t a tu s & CACHE BLK DIRTY)
{
/∗ wr i t e back the cache b l o c k ∗/
− cp−>wri tebacks++;
− l a t += cp−>b l k a c c e s s f n (Write ,
+ i f ( cp−>h i t s==−1)
+ cp wr i t ebacks++;
+ else
+ cp−>wri tebacks++;
+ i f ( cp−>b l k a c c e s s f n==NULL)
+ l a t += cp b l k a c c e s s f n (Write ,
CACHEMKBADDR( cp , rep l−>tag , s e t ) ,
cp−>bs i ze , rep l , now+l a t ) ;
+ else
+ l a t += cp−>b l k a c c e s s f n (Write ,
+ CACHEMKBADDR( cp , rep l−>tag , s e t ) ,
+ cp−>bs i ze , rep l , now+l a t ) ;
+
}
}
/∗ update b l o c k t a g s ∗/
+ repl−>procNo = my cpuid ;
rep l−>tag = tag ;
rep l−>s t a tu s = CACHE BLK VALID; /∗ d i r t y b i t s e t on update ∗/
/∗ read data b l o c k ∗/
− l a t += cp−>b l k a c c e s s f n (Read , CACHEBADDR( cp , addr ) , cp−>bs i ze ,
− rep l , now+l a t ) ;
+ i f ( cp−>b l k a c c e s s f n==NULL)
+ l a t += cp b l k a c c e s s f n (Read , CACHEBADDR( cp , addr ) , cp−>bs i ze ,
+ rep l , now+l a t ) ;
+ else
+ l a t += cp−>b l k a c c e s s f n (Read , CACHEBADDR( cp , addr ) , cp−>bs i ze ,
+ rep l , now+l a t ) ;
/∗ copy data out o f cache b l o c k ∗/
i f ( cp−>ba l l o c )
@@ −638 ,6 +905 ,12 @@
212
APPENDIX F. CMP CODE F.6. CACHE.C
i f (cmd == Write )
rep l−>s t a tu s |= CACHE BLK DIRTY;
+ /∗ Update p r e f e t c h s t a t u s ∗/
+ i f (cmd == Pre f e tch )
+ rep l−>pre f e t ched = 1 ;
+ else
+ repl−>pre f e t ched = 0 ;
+
/∗ ge t user b l o c k data , i f r e que s t ed and i t e x i s t s ∗/
i f ( udata )
∗udata = rep l−>use r data ;
@@ −656 ,7 +929 ,21 @@
cache h i t : /∗ s low h i t hand ler ∗/
/∗ ∗∗HIT∗∗ ∗/
− cp−>h i t s++;
+ i f ( cp−>h i t s==−1)
+ cp h i t s++;
+ else
+ cp−>h i t s++;
+
+
+ i f ( blk−>pre f e t ched == 1) {
+ i f ( cp−>h i t s == −1) {
+ cp p r e f e t ch e s ok++;
+ } else {
+ cp−>p r e f e t ch e s ok++;
+ }
+ blk−>pre f e t ched = 0 ;
+ }
+
/∗ copy data out o f cache b lock , i f b l o c k e x i s t s ∗/
i f ( cp−>ba l l o c )
@@ −691 ,8 +978 ,21 @@
ca c h e f a s t h i t : /∗ f a s t h i t hand ler ∗/
/∗ ∗∗FAST HIT∗∗ ∗/
− cp−>h i t s++;
+ i f ( cp−>h i t s==−1)
+ cp h i t s++;
+ else
+ cp−>h i t s++;
+
+ i f ( blk−>pre f e t ched == 1) {
+ i f ( cp−>h i t s==−1) {
+ cp p r e f e t ch e s ok++;
+ } else {
+ cp−>p r e f e t ch e s ok++;
+ }
+ blk−>pre f e t ched = 0 ;
+ }
+
/∗ copy data out o f cache b lock , i f b l o c k e x i s t s ∗/
213
F.6. CACHE.C APPENDIX F. CMP CODE
i f ( cp−>ba l l o c )
{
@@ −741 ,7 +1041 ,7 @@
blk ;
blk=blk−>hash next )
{
− i f ( blk−>tag == tag && ( blk−>s t a tu s & CACHE BLK VALID) )
+ i f ( blk−>tag == tag && ( blk−>s t a tu s & CACHE BLK VALID) && ( blk−>
procNo==my cpuid ) )
return TRUE;
}
}
@@ −752 ,7 +1052 ,7 @@
blk ;
blk=blk−>way next )
{
− i f ( blk−>tag == tag && ( blk−>s t a tu s & CACHE BLK VALID) )
+ i f ( blk−>tag == tag && ( blk−>s t a tu s & CACHE BLK VALID) && ( blk−>
procNo==my cpuid ) )
return TRUE;
}
}
@@ −780 ,16 +1080 ,29 @@
{
i f ( blk−>s t a tu s & CACHE BLK VALID)
{
− cp−>i n v a l i d a t i o n s++;
+ i f ( cp−>h i t s==−1)
+ cp i n v a l i d a t i o n s++;
+ else
+ cp−>i n v a l i d a t i o n s++;
+
blk−>s t a tu s &= ˜CACHE BLK VALID;
i f ( blk−>s t a tu s & CACHE BLK DIRTY)
{
/∗ wr i t e back the i n v a l i d a t e d b l o c k ∗/
− cp−>wri tebacks++;
− l a t += cp−>b l k a c c e s s f n (Write ,
+ i f ( cp−>h i t s==−1)
+ cp wr i t ebacks++;
+ else
+ cp−>wri tebacks++;
+ i f ( cp−>b l k a c c e s s f n==NULL)
+ l a t += cp b l k a c c e s s f n (Write ,
CACHEMKBADDR( cp , blk−>tag , i ) ,
cp−>bs i ze , blk , now+l a t ) ;
+ else
+ l a t += cp−>b l k a c c e s s f n (Write ,
+ CACHEMKBADDR( cp , blk−>tag , i ) ,
+ cp−>bs i ze , blk , now+l a t ) ;
+
}
}
}
@@ −848 ,10 +1161 ,19 @@
214
APPENDIX F. CMP CODE F.6. CACHE.C
i f ( blk−>s t a tu s & CACHE BLK DIRTY)
{
/∗ wr i t e back the i n v a l i d a t e d b l o c k ∗/
− cp−>wri tebacks++;
− l a t += cp−>b l k a c c e s s f n (Write ,
− CACHEMKBADDR( cp , blk−>tag , s e t ) ,
− cp−>bs i ze , blk , now+l a t ) ;
+ i f ( cp−>h i t s==−1)
+ cp wr i t ebacks++;
+ else
+ cp−>wri tebacks++;
+ i f ( cp−>b l k a c c e s s f n==NULL)
+ l a t += cp b l k a c c e s s f n (Write ,
+ CACHEMKBADDR( cp , blk−>tag , s e t ) ,
+ cp−>bs i ze , blk , now+l a t ) ;
+ else
+ l a t += cp−>b l k a c c e s s f n (Write ,
+ CACHEMKBADDR( cp , blk−>tag , s e t ) ,
+ cp−>bs i ze , blk , now+l a t ) ;
+
}
/∗ move t h i s b l o c k to t a i l o f the way (LRU) l i s t ∗/
update way l i s t (&cp−>s e t s [ s e t ] , blk , Ta i l ) ;
215
F.7. DRAM.C APPENDIX F. CMP CODE
F.7 Dram.c
Listing F.7: Dram.c - Unified diff against Uniprocessor version
−−− . . / s implesim −3.0/dram . c 2006−06−01 23 :10 :03 .000000000 +0200
+++ . . / . . / . . / f e l l e s −svn/ p r o j e c t / branches / g rannas pr e f e t ch /dram . c
2006−05−31 13 :55 :19 .000000000 +0200
@@ −16,6 +16 ,7 @@
#inc lude ”misc . h”
#inc lude ”machine . h”
#inc lude ”dram . h”
+#inc lude ”shared . h”
/∗ Globa l c i r c u l a r b u f f e r po in t e r ∗/
@@ −23 ,17 +24 ,21 @@
/∗ This func t i on c r ea t e s the dram subsystem ∗/
−dram system t ∗ create dram ( int number of channels , int s i z e o f b l o c k , int
page s i z e , int cont ro l t ime , int core t ime , int data time , int
t r a c e i n t e r v a l ) {
+dram system t ∗ create dram ( int number of channels , int s i z e o f b l o c k , int
page s i z e , int cont ro l t ime , int core t ime , int data time , int
t r a c e i n t e r v a l , int proc id ) {
dram system t ∗dram system ;
−
− /∗ Al l o ca t e memory f o r the s t r u c t u r e ∗/
+ int dram id ;
+ int channe l s i d ;
+ int a v a i l a b l e i d ;
+ int l a s t p r o c i d ;
+ int i ;
+ int c i r c b u f f e r i d ;
dram system = c a l l o c (1 , s izeof ( dram system t ) ) ;
i f ( dram system == 0) {
p r i n t f ( ”Could not a l l o c a t e memory f o r DRAM model .\n”) ;
e x i t (1 ) ;
}
−
+
/∗ Set the va l u e s g iven as parameters ∗/
dram system−>num channels = number of channels ;
dram system−>b l o c k s i z e = s i z e o f b l o c k ;
@@ −50 ,25 +55 ,44 @@
dram system−>page h i t s = 0 ;
/∗ Create the arrays ∗/
+ i f ( proc id == 0) {
+ channe l s i d = create shmem (SHMDRAMKEY+1, number of channels ∗ s izeof
( t i c k t ) ) ;
+ a v a i l a b l e i d = create shmem (SHMDRAMKEY+2, number of channels ∗
s izeof ( md addr t ) ) ;
+ l a s t p r o c i d = create shmem (SHMDRAMKEY+3, number of channels ∗ s izeof
( int ) ) ;
216
APPENDIX F. CMP CODE F.7. DRAM.C
+ c i r c b u f f e r i d = create shmem (SHMDRAMKEY+4, (CIRC BUFFER SIZE +1) ∗
s izeof ( int ) ) ;
+ } else {
+ channe l s i d = get shmem (SHMDRAMKEY+1, number of channels ∗ s izeof (
t i c k t ) ) ;
+ a v a i l a b l e i d = get shmem (SHMDRAMKEY+2, number of channels ∗ s izeof (
md addr t ) ) ;
+ l a s t p r o c i d = get shmem (SHMDRAMKEY+3, number of channels ∗ s izeof (
int ) ) ;
+ c i r c b u f f e r i d = get shmem (SHMDRAMKEY+4, (CIRC BUFFER SIZE +1) ∗
s izeof ( int ) ) ;
+ }
+
+ dram system−>ready channe l s = shmem attatch ( channe l s i d ) ;
+ dram system−>l a s t a dd r e s s = shmem attatch ( a v a i l a b l e i d ) ;
+ dram system−>l a s t p r o c = shmem attatch ( l a s t p r o c i d ) ;
+
+ c i r c b u f f e r = shmem attatch ( c i r c b u f f e r i d ) ;
+
/∗ I f the number o f channe l s i s 0 then do not c r ea t e anyth ing ∗/
− i f ( number of channels > 0) {
− dram system−>ready channe l s = c a l l o c ( number of channels , s izeof ( t i c k t )
) ;
− i f ( dram system−>ready channe l s == 0) {
− p r i n t f ( ”Error c r e a t i ng ready channe l s .\n”) ;
− e x i t (1 ) ;
− }
− dram system−>l a s t a dd r e s s = c a l l o c ( number of channels , s izeof ( md addr t
) ) ;
− i f ( dram system−>l a s t a dd r e s s == 0) {
− p r i n t f ( ”Error c r e a t i ng Address array .\n”) ;
− e x i t (1 ) ;
− }
− }
− /∗ Create the c i r c u l a r b u f f e r as usua l ∗/
− c i r c b u f f e r = c a l l o c (CIRC BUFFER SIZE +1, s izeof ( int ) ) ;
− i f ( c i r c b u f f e r == 0) {
− p r i n t f ( ”Error c r e a t i ng C i r cu l a r bu f f e r .\n”) ;
− e x i t (1 ) ;
+ i f ( proc id == 0) {
+ i f ( number of channels > 0) {
+ i f ( dram system−>ready channe l s == 0) {
+ p r i n t f ( ”Error c r e a t i ng ready channe l s .\n”) ;
+ ex i t (1 ) ;
+ }
+ for ( i =0; i< number of channels ; i++) {
+ dram system−>ready channe l s [ i ] = 0 ;
+ }
+ i f ( dram system−>l a s t a dd r e s s == 0) {
+ p r i n t f ( ”Error c r e a t i ng Address array .\n”) ;
+ ex i t (1 ) ;
+ }
+ for ( i =0; i< number of channels ; i++) {
+ dram system−>l a s t a dd r e s s [ i ] = 0 ;
+ dram system−>l a s t p r o c [ i ] = −1;
217
F.7. DRAM.C APPENDIX F. CMP CODE
+ }
+ }
}
return ( dram system ) ;
}
@@ −78,6 +102 ,7 @@
void f ree dram ( dram system t ∗dram system ) {
f r e e ( dram system−>l a s t a dd r e s s ) ;
f r e e ( dram system−>ready channe l s ) ;
+ f r e e ( dram system−>l a s t p r o c ) ;
f r e e ( dram system ) ;
}
@@ −123 ,7 +148 ,7 @@
∗ I t r e tu rn s the a c c e s s time in number o f t i c k s ( due to compabi l i ty i s s u e s
)
∗/
−unsigned int access dram ( dram system t ∗dram system , md addr t block , int
bs i ze , t i c k t now) {
+unsigned int access dram ( dram system t ∗dram system , md addr t block , int
bs i ze , int procid , t i c k t now) {
int dram bank ; /∗ The bank in use , c a l c u l a t e d based on the address ∗/
int l a t ency ; /∗ The c a l c u l a t e d l a t ency − in t i c k s ∗/
int con t r o l t ime ; /∗ Time requ i r ed to t r an s f e r a con t r o l word to DRAM ∗/
@@ −144 ,14 +169 ,14 @@
dram system−>a c c e s s e s++;
/∗ Update the c i r c u l a r b u f f e r ∗/
−
− c i r c b u f f e r [ c i r c b u f f e r [ 0 ]+1 ] = now − dram system−>ready channe l s [ dram bank
] ;
+
+ c i r c b u f f e r [ c i r c b u f f e r [ 0 ]+1 ] = now − dram system−>ready channe l s [ dram bank
] ;
c i r c b u f f e r [ 0 ] = ( c i r c b u f f e r [ 0 ] + 1) % CIRC BUFFER SIZE ;
/∗ I f the DRAM chip i s f r e e ∗/
i f ( dram system−>ready channe l s [ dram bank ] < now) {
/∗Check i f we h i t an open page ∗/
− i f ( i s same page ( dram system , block , dram system−>l a s t a dd r e s s [
dram bank ] ) ) {
+ i f ( ( i s same page ( dram system , block , dram system−>l a s t a dd r e s s [
dram bank ] ) ) && ( dram system−>l a s t p r o c [ dram bank ] == proc id ) ) {
/∗ We h i t an open page , t r a n s f e r time i s reduced . ∗/
dram system−>page h i t s++;
l a t ency = cont r o l t ime + data t ime ;
@@ −183 ,6 +208 ,7 @@
/∗ Commit changes to the data s t r u c t u r e ∗/
dram system−>ready channe l s [ dram bank ] = now + latency ;
dram system−>l a s t a dd r e s s [ dram bank ] = block ;
+ dram system−>l a s t p r o c [ dram bank ] = proc id ;
/∗ Update S t a t i s t i c s ∗/
dram system−>l a t ency += latency ;
@@ −246 ,3 +272 ,4 @@
218
APPENDIX F. CMP CODE F.7. DRAM.C
}
return ( t o t a l / CIRC BUFFER SIZE) ;
}
+
219
