Improving the Performance of Parallel Applications in Chip Multiprocessors with Architectural Techniques by Jahre, Magnus
July 2007
Lasse Natvig, IDI
Master of Science in Computer Science
Submission date:
Supervisor:
Norwegian University of Science and Technology
Department of Computer and Information Science
Improving the Performance of Parallel
Applications in Chip Multiprocessors
with Architectural Techniques
Magnus Jahre

Problem Description
Chip Multiprocessors (CMPs) are becoming increasingly popular, both in industry and academia.
However, most applications are still single-threaded. The paradox is that these applications will
not experience improved performance when run on a CMP platform. In fact, the performance is
often worse due to competition for shared resources. Consequently, the only way to achieve the
performance potential of CMPs is to run parallel applications.
The candidate must investigate the performance of communication intensive multi-threaded
workloads on a CMP platform. The result of this investigation should be the identification of
performance bottlenecks. Furthermore, the candidate should propose and evaluate architectural
techniques that alleviate these performance issues.
The proposed techniques should be evaluated with the M5 simulator. In addition, both multi-
threaded and multi-programmed workloads should be simulated. The multi-threaded workloads
can be used to explore the merits of the proposed technique, while multi-programmed workloads
can be used for sensitivity analysis. It is advisable to use the multi-threaded SPLASH-2 benchmark
suite to investigate communication effects, and programs from the SPEC-2000 benchmark suite to
create multi-programmed workloads.
Assignment given: 13. September 2006
Supervisor: Lasse Natvig, IDI

Abstract
Chip Multiprocessors (CMPs) or multi-core architectures are a new class of processor archi-
tectures. Here, multiple processing cores are placed on the same physical chip. To reach the
performance potential of these architectures with a single application, it must be multi-threaded.
In these applications, the processing cores cooperate to solve a single task, and this requires a
large amount of inter-processor communication in many cases. Consequently, CMPs need to
support this communication in an efficient manner.
To investigate inter-processor communication in CMPs, a good understanding of the state-of-the-
art of CMP design options, interconnect network design and cache coherence protocol solutions
is required. Furthermore, a good computer architecture simulator is needed to evaluate both
new and conventional architectural solutions. The M5 simulator [BDH+06] is used for this
purpose and has been extended with a generic split transaction bus, a crossbar based on the
IBM Power 5 crossbar [KZT05], a butterfly network and an ideal interconnect. The unrealistic
ideal interconnect provides an upper bound on the performance improvement available from
enhancing the interconnect. In addition, a directory-based coherence protocol proposed by
Stenstro¨m has been implemented [Ste89].
The performance of 2-, 4- and 8-core CMPs with crossbar and bus interconnects, private L1
caches and shared L2 caches is investigated. The bus and the crossbar are the conventional ways
of implementing the L1 to L2 cache interconnect. These configurations have been evaluated
with multiprogrammed workloads from the SPEC2000 benchmark suite [SPEa] and parallel,
scientific benchmarks from the SPLASH-2 benchmark suite [WOT+95]. With multiprogrammed
workloads, the crossbar interconnect configurations perform nearly as well as a configuration
with an ideal interconnect. However, the performance of the crossbar CMPs is similar to the
performance of the bus CMPs when there is intensive L1 to L1 cache communication. The reason
is limited L1 to L1 bandwidth. The bus CMPs experience a severe performance degradation
with some benchmarks for all processor counts and workload classes.
A butterfly interconnect is proposed to alleviate the L1 to L1 communication bottleneck. The
butterfly CMP performs on average 3.9 times better than the bus CMP and 3.8 times better
than the crossbar CMP when there are 8 processor cores. These numbers are based on the
performance of theWaterNSquared, Raytrace, Radix and LUNoncontig benchmarks. The reason
is that the other SPLASH-2 benchmarks had issues with the M5 thread implementation for these
configurations. For the multiprogrammed workloads, the butterfly CMPs are a bit slower than
the crossbar CMPs.

Contents
1 Introduction 1
1.1 Assignment Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 State-of-the-art 7
2.1 Chip Multiprocessor Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 CMP Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Future Limitations to CMP Architectures . . . . . . . . . . . . . . . . . . 10
2.1.3 CMP Design Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 On-Chip and Off-Chip Interconnects . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.4 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Snooping-based Cache Coherence Protocols . . . . . . . . . . . . . . . . . 32
2.3.2 Directory-based Cache Coherence Protocols . . . . . . . . . . . . . . . . . 33
2.3.3 Alternative Cache Coherence Solutions . . . . . . . . . . . . . . . . . . . . 45
3 Research Questions and Methods 47
3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 CMP Architecture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Processor Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.2 Memory System Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Interconnect Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.4 Coherence Protocol Parameters . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.1 Experiment Tool-chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.1 SPEC CPU2000 Multiprogrammed Workloads . . . . . . . . . . . . . . . 60
3.4.2 SPLASH-2 Communicating Workloads . . . . . . . . . . . . . . . . . . . . 61
4 Simulator Extensions 63
4.1 The M5 simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 M5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.2 Flaws in M5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Interconnect Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
i
TABLE OF CONTENTS
4.2.2 Split Transaction Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.3 Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.4 Butterfly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.5 Ideal Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Cache Coherence Protocol Extensions . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.1 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.2 Handling Non-blocking Caches . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.3 Possible Race Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.4 Implementation Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Simulator Extension Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 CMP Performance with Multiprogrammed Workloads 79
5.1 2-core CMP Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.1 Miss Intensive Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.2 Other Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 4-core CMP Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.1 Miss Intensive Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.2 Other Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 8-core CMP Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.1 Baseline 8-core CMP Results . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.2 8-core CMP with Larger Cache . . . . . . . . . . . . . . . . . . . . . . . . 85
6 CMP Performance with Communicating Workloads 89
6.1 Parallel Benchmark Performance with 2 CPUs . . . . . . . . . . . . . . . . . . . 89
6.1.1 Bandwith Demand with 2 Cores . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.2 2-core Interconnect Performance . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Parallel Benchmark Performance with 4 CPUs . . . . . . . . . . . . . . . . . . . 93
6.2.1 Bandwith Demand with 4 Cores . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.2 4-core Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Parallel Benchmark Performance with 8 CPUs . . . . . . . . . . . . . . . . . . . 96
6.3.1 Bandwith Demand with 8 Cores . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.2 8-core Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7 Butterfly Interconnect Evaluation 99
7.1 Butterfly Performance with Multiprogrammed Workloads . . . . . . . . . . . . . 100
7.1.1 2-core CMP Multiprogrammed Performance . . . . . . . . . . . . . . . . . 100
7.1.2 4-core CMP Multiprogrammed Performance . . . . . . . . . . . . . . . . . 101
7.1.3 8-core CMP Multiprogrammed Performance . . . . . . . . . . . . . . . . . 102
7.2 Butterfly Performance with Scientific Workloads . . . . . . . . . . . . . . . . . . 104
7.2.1 2-core CMP SPLASH-2 Performance . . . . . . . . . . . . . . . . . . . . . 105
7.2.2 4-core CMP SPLASH-2 Performance . . . . . . . . . . . . . . . . . . . . . 107
7.2.3 8-core CMP SPLASH-2 Performance . . . . . . . . . . . . . . . . . . . . . 108
8 Discussion and Evaluation 111
8.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.1.1 Multiprogrammed Workload Performance . . . . . . . . . . . . . . . . . . 111
8.1.2 Parallel Workload Performance . . . . . . . . . . . . . . . . . . . . . . . . 111
8.1.3 Performance Impact of Interconnect Enhancements . . . . . . . . . . . . . 112
8.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2.1 CMP Model Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 113
ii
TABLE OF CONTENTS
8.2.2 Use of the M5 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2.3 Implementation Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9 Conclusion and Further Work 115
9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
A Randomly Generated Multiprogram Workloads 125
A.1 Multiprogram Workloads for 2 CPUs . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.2 Multiprogram Workloads for 4 CPUs . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.3 Multiprogram Workloads for 8 CPUs . . . . . . . . . . . . . . . . . . . . . . . . . 125
B Mail Correspondence with M5 Development Team 129
B.1 First Bug Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
B.2 Reply from Steve Reinhardt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
B.3 Elaboration on First Bug Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
C Simulator Extension Code 133
C.1 Interconnect Extension Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
C.1.1 Interconnect Header File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
C.1.2 Interconnect Code File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
C.1.3 Split Transaction Bus Header File . . . . . . . . . . . . . . . . . . . . . . 149
C.1.4 Split Transaction Bus Code File . . . . . . . . . . . . . . . . . . . . . . . 153
C.1.5 Butterfly Header File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
C.1.6 Butterfly Code File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.1.7 Crossbar Header File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
C.1.8 Crossbar Code File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
C.1.9 Ideal Interconnect Header File . . . . . . . . . . . . . . . . . . . . . . . . 189
C.1.10 Ideal Interconnect Code File . . . . . . . . . . . . . . . . . . . . . . . . . 193
C.1.11 Interconnect Interface Header File . . . . . . . . . . . . . . . . . . . . . . 201
C.1.12 Interconnect Interface Code File . . . . . . . . . . . . . . . . . . . . . . . 206
C.1.13 Interconnect Master Interface Header File . . . . . . . . . . . . . . . . . . 208
C.1.14 Interconnect Master Interface Code File . . . . . . . . . . . . . . . . . . . 211
C.1.15 Interconnect Slave Interface Header File . . . . . . . . . . . . . . . . . . . 214
C.1.16 Interconnect Slave Interface Code File . . . . . . . . . . . . . . . . . . . . 218
C.1.17 Interconnect Profiler Header File . . . . . . . . . . . . . . . . . . . . . . . 221
C.1.18 Interconnect Profiler Code File . . . . . . . . . . . . . . . . . . . . . . . . 224
C.2 Coherence Protocol Extension Code . . . . . . . . . . . . . . . . . . . . . . . . . 227
C.2.1 Directory Protocol Header File . . . . . . . . . . . . . . . . . . . . . . . . 227
C.2.2 Directory Protocol Code File . . . . . . . . . . . . . . . . . . . . . . . . . 232
C.2.3 Stenstro¨m Protocol Header File . . . . . . . . . . . . . . . . . . . . . . . . 237
C.2.4 Stenstro¨m Protocol Code File . . . . . . . . . . . . . . . . . . . . . . . . . 240
D Simulator Configuration Scripts 267
D.1 run.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
D.2 DetailedConfig.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
D.3 FuncUnitConfig.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
D.4 MemConfig.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
D.5 Spec2000.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
D.6 workloads.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
iii
TABLE OF CONTENTS
iv
List of Figures
1.1 A Shared-Cache CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Processor and Memory Performance [HP07] . . . . . . . . . . . . . . . . . . . . . 9
2.2 CMP Memory System Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Wire Scaling and Technology Scaling (Reproduced from [HHM99]) . . . . . . . . 12
2.4 Chip Multiprocessor Generations (Reproduced from [SA05]) . . . . . . . . . . . . 13
2.5 A Tiled Chip Multiprocessor (Reproduced from [ZA05]) . . . . . . . . . . . . . . 14
2.6 A 3D Chip Multiprocessor (Adapted from [LNR+06]) . . . . . . . . . . . . . . . 15
2.7 Possible Levels of Sharing in a CMP . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8 Interconnection Network Terminology . . . . . . . . . . . . . . . . . . . . . . . . 19
2.9 The Bus Interconnect from Kumar et al. [KZT05] . . . . . . . . . . . . . . . . . . 21
2.10 Crossbar Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.11 Mesh and Torus Topologies (Reproduced from [BD06]) . . . . . . . . . . . . . . . 23
2.12 Number of Channels in a Butterfly and a Crossbar . . . . . . . . . . . . . . . . . 24
2.13 A Radix 2 Butterfly with 8 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.14 Fat-Tree and Tapered Fat-Tree Topologies (Reproduced from [BD06]) . . . . . . 26
2.15 An Example Router Architecture (Adapted from [BD06]) . . . . . . . . . . . . . 28
2.16 Illustration of the Cache Coherence Problem . . . . . . . . . . . . . . . . . . . . 31
2.17 Snooping Protocol Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.18 In-Network Cache Coherence Optimisation Example (Reproduced from [EPS06]) 35
2.19 In-Network Cache Coherence Virtual Tree (Reproduced from [EPS06]) . . . . . . 36
2.20 Status Information used in the Stenstro¨m Protocol (Adapted from [Ste89]) . . . 37
2.21 Stenstro¨m Protocol Read Miss Handling . . . . . . . . . . . . . . . . . . . . . . . 40
2.22 Stenstro¨m Protocol Write Hit Handling . . . . . . . . . . . . . . . . . . . . . . . 41
2.23 Stenstro¨m Protocol Write Miss Handling . . . . . . . . . . . . . . . . . . . . . . . 43
2.24 Stenstro¨m Protocol Block Replacement Handling . . . . . . . . . . . . . . . . . . 44
3.1 High-level Chip Multiprocessor Architecture . . . . . . . . . . . . . . . . . . . . . 48
3.2 Shared Bus Model - Data Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Crossbar Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Examples of Butterfly Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Typical Experiment Work-Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 M5 Simulator Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 M5 Memory System Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 M5 Memory Layout for Multiprogrammed Workloads . . . . . . . . . . . . . . . 66
4.4 Interconnect Extension Software Architecture . . . . . . . . . . . . . . . . . . . . 69
4.5 Interconnect Extension L1 to L2 Cache Transfer Example . . . . . . . . . . . . . 70
v
LIST OF FIGURES
4.6 Directory Protocol Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7 Non-blocking Cache and Coherence Challenge . . . . . . . . . . . . . . . . . . . . 73
4.8 Two Processors Request Block Ownership Simultaneously . . . . . . . . . . . . . 74
4.9 A Processor Issues a Redirected Read while Owner Transfer in Progress . . . . . 75
5.1 Single-core SPEC Benchmark Cache Performance . . . . . . . . . . . . . . . . . . 80
5.2 2-core CMP Interconnect Performance . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 4-core CMP Interconnect Performance . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4 8-core CMP Interconnect Performance . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 8-core CMP with Large Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1 Interconnect Requests Rate in Sample for a 2-core CMP . . . . . . . . . . . . . . 91
6.2 Scientific Workload Performance in a 2-core CMP . . . . . . . . . . . . . . . . . . 92
6.3 Interconnect Request Rate in Sample for a 4-core CMP . . . . . . . . . . . . . . 93
6.4 Scientific Workload Performance in a 4-core CMP . . . . . . . . . . . . . . . . . . 94
6.5 LUNoncontig Request Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6 Interconnect Requests Rate in Sample for a 8-core CMP . . . . . . . . . . . . . . 97
6.7 Scientific Workload Performance in a 8-core CMP . . . . . . . . . . . . . . . . . . 98
7.1 Butterfly Performance in a 2-core CMP . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Butterfly Performance in a 4-core CMP . . . . . . . . . . . . . . . . . . . . . . . 101
7.3 Butterfly Performance in a 8-core CMP . . . . . . . . . . . . . . . . . . . . . . . 103
7.4 Butterfly Performance in a 8-core CMP with Large Cache . . . . . . . . . . . . . 104
7.5 Total Interconnect Requests in Sample for a 2-core CMP . . . . . . . . . . . . . . 106
7.6 Butterfly Communication Performance in a 2-core CMP . . . . . . . . . . . . . . 106
7.7 Total Interconnect Requests in Sample for a 4-core CMP . . . . . . . . . . . . . . 107
7.8 Butterfly Communication Performance in a 4-core CMP . . . . . . . . . . . . . . 108
7.9 Total Interconnect Requests in Sample for a 8-core CMP . . . . . . . . . . . . . . 109
7.10 Butterfly Communication Performance in a 8-core CMP . . . . . . . . . . . . . . 109
vi
List of Tables
2.1 Topology Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Baseline Processor Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Baseline Memory System Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Baseline Interconnect Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Baseline Butterfly Interconnect Parameters . . . . . . . . . . . . . . . . . . . . . 58
3.5 Number of Instructions Simulated with SPLASH-2 Benchmarks . . . . . . . . . . 62
6.1 Splash Benchmarks where M5 Flaws were Observed . . . . . . . . . . . . . . . . 90
7.1 Splash Benchmarks where M5 Flaws were Observed . . . . . . . . . . . . . . . . 105
A.1 Randomly Generated Multiprogram Workloads for 2 CPUs . . . . . . . . . . . . 126
A.2 Randomly Generated Multiprogram Workloads for 4 CPUs . . . . . . . . . . . . 127
A.3 Randomly Generated Multiprogram Workloads for 8 CPUs . . . . . . . . . . . . 128
vii
LIST OF TABLES
viii
List of Abbreviations
ACK Acknowledgement
CAS Column Access Strobe
CMOS Complementary Metal-Oxide-Semiconductor
CMP Chip Multiprocessor
CMT Chip Multithreading
CPU Central Processing Unit
flit flow control digit
IDI Department of Computer and Information Science, NTNU
ILP Instruction-Level Parallelism
IPC Instructions per Cycle
ISA Instruction Set Architectures
ITRS International Technology Roadmap for Semiconductors
KB Kilobyte
LRU Least Recently Used
MB Megabyte
MIN Multistage Interconnection Network
MPSoC Multi-Processor System on Chip
MSHR Miss Status Holding Register
MSI Modified, Shared, Invalid
NACK Negative Acknowledgement
NCAR NTNU Computer Architecture Research Group
NUCA Non-Uniform Cache Access
SMP Symmetric Multiprocessor
SMT Simultaneous Multithreading
SoC System on Chip
SPEC Standard Performance Evaluation Corporation
ix
List of Abbreviations
SPLASH Stanford Parallel Applications for Shared Memory
SRAM Static Random Access Memory
SW Switch
TLB Translation Lookaside Buffer
TLP Thread-Level Parallelism
VC Virtual Channel
x
Chapter 1
Introduction
Chip Multiprocessors (CMPs) or multi-core architectures are a new class of high-performance
processor architectures. Here, multiple processing cores are placed on one physical chip, and the
result is a single-chip multiprocessor. To realise the performance potential of these architectures,
it is important to support efficient inter-processor communication. This report investigates
architectural techniques for providing this much needed capability.
Figure 1.1 shows the high-level CMP architecture which is the main focus this report. This
architecture is known as a shared-cache CMP and is one of many ways to design a CMP. Other
possible architectures will be discussed when necessary. Since the task at hand is complex, it is
helpful to focus on one high-level architecture.
There are three main reasons for focusing on shared-cache CMPs:
• In this work, the most important reason is that the architecture enables very fast com-
munication between the processor cores as communication can be carried out with L1 to
L1 cache data transfers. In other CMP architectures, this communication might need to
access the off-chip memory bus or go via both processors’ L2 caches. Both options add
considerable delay.
• The recent, commercial Intel Core Duo dual-core processor is a shared-cache CMP [GMNR06].
Consequently, a major commercial actor has made a considerable investment into this high-
level architecture.
• As more cores are added to the chip, the off-chip memory bus can become a bottleneck.
Consequently, this shared resource should not be used more than necessary. An important
part in achieving this is to use the on-chip L2 cache efficiently. A shared cache will in most
cases result in better chip-wide utilisation than per-core private L2 caches.
This chapter has the following outline:
• First, section 1.1 discusses the assignment and formulates the tasks that must be carried
out in order to answer it. Furthermore, it explains in which part of the report the tasks
are answered.
• Then, section 1.2 discusses the main contributions of this work.
• Finally, the structure of the report is presented in section 1.3.
1
CHAPTER 1. INTRODUCTION
CPU 1
L1 Data
L1 Instruction
CPU 2
L1 Data
L1 Instruction
In
te
rc
on
ne
ct
L2 Cache Main
Memory
Memory Bus
Figure 1.1: A Shared-Cache CMP
1.1 Assignment Interpretation
This section discusses the assignment text and divides it into subtasks. In addition, it high-
lights where in the report the subtasks are answered. Consequently, this section clarifies the
relationship between the assignment text and the report.
The assignment consists of the following tasks:
T1 Investigate the performance of communication intensive multi-threaded benchmarks.
T2 Identify performance bottlenecks.
T3 Propose techniques that alleviate the bottlenecks.
Task T1 is answered by the experiments discussed in chapter 6. Here, the SPLASH-2 benchmark
suite [WOT+95] is used to investigate the performance of parallel programs on 2-, 4- and 8-core
shared-cache CMPs. Furthermore, the results are supported by the multiprogrammed workload
experiments with programs from the SPEC2000 benchmark suite [SPEa] which are discussed in
chapter 5.
This investigation has identified a number of bottlenecks:
• Most importantly, efficient L1 to L1 communication is needed for parallel programs. The
state-of-the-art crossbar interconnect simulated in this work uses a shared bus for L1 to
L1 traffic. This crossbar is based on the crossbar used in the IBM Power 5 CMP [KZT05].
The results in chapter 6 shows that congestion in this bus severely limits performance for
a number of parallel programs.
• For multiprogrammed workloads, the off-chip memory bus become congested in some cases
and this limits performance.
• Lastly, the number of misses that can be serviced simultaneously in the L1 cache is a
bottleneck for the processor simulated in this report. This bottleneck has been investigated
further by us in a different work [JN07].
These bottlenecks demonstrate that task T2 is answered. Since this work focuses on CMP
communication, the L1 to L1 communication bottleneck is prioritised. A butterfly interconnect
is proposed because it enables efficient communication both between the L1 caches and between
the L1 caches and the L2 cache. Although the hardware cost of the butterfly is somewhat higher
than the cost of the simulated crossbar, it is less than the cost of a full crossbar. In this context,
the term full crossbar is used to describe an interconnect where all nodes have a direct connection
2
1.2. MAIN CONTRIBUTIONS
to all other nodes. The butterfly interconnect results are discussed in chapter 7 and show that
the butterfly is very successful in alleviating the L1 to L1 communication bottleneck.
In addition to these tasks, the assignment text makes two suggestions:
• The experiments should be carried out on the M5 simulator [BDH+06].
• The multi-threaded SPLASH-2 [WOT+95] and multi-programmed SPEC2000 [SPEa] bench-
mark suites should be used.
Although these suggestions are not absolute demands, it is probably advantageous to follow
them. The M5 simulator is a recent, feature-rich computer architecture simulator and a consid-
erable improvement compared to the SimpleScalar simulator [ALE02] previously used at NTNU
Computer Architecture Research Group (NCAR). For instance, M5 accurately models finite
miss status buffers and an L1 to L2 interconnect. In SimpleScalar, these resources have an
infinite capacity. Sadly, M5 does have some issues with its thread implementation in system
call emulation mode, but these were not known at the time the simulator was chosen. These
limitations will be discussed in detail in section 4.1.2. All in all, M5’s advantages outweigh its
disadvantages.
The SPLASH-2 and SPEC2000 benchmark suites are a good match for the M5 simulator. Since
the M5 simulator uses the Alpha instruction set, they are both available as precompiled binaries.
This is a great advantage as compiling benchmarks is a non-negligible amount of work.
In addition to the tasks specified in the assignment, the interconnect performance of multi-
programmed workloads is investigated. The reason is that this is a simpler task because it does
not require a cache coherence protocol. Consequently, some insights can be gained earlier in
the work. Hopefully, this leads to a better understanding of the problem at hand than if the
assignment text was followed to the letter.
Answering the tasks given in the assignment text within the time frame of master thesis is
ambitious. There are a number of reasons for this. Firstly, the M5 simulator has not been used
by the NCAR group earlier. Consequently, there is no help available locally on how to use and
configure it. This is a challenge, as the M5 simulator is a complex piece of software. Secondly, M5
only supports a bus interconnect between the L1 and L2 caches. The M5 software architecture
makes it difficult to add other interconnects. Lastly, it is preferable to simulate a directory cache
coherence protocol as this gives full freedom in choosing the types of interconnects to investigate.
Although it is considerably easier to implement a directory protocol in a simulator than in real
hardware, it is still a challenge. The reason is that there are numerous rare race conditions that
must be handled correctly.
1.2 Main Contributions
The main contributions in this work are:
C1 A study of the most important interconnects in a shared-cache CMP has been carried out.
C2 A good understanding of the M5 simulator has been gained. Furthermore, a split trans-
action bus, a crossbar, a butterfly, an ideal interconnect and a directory-based cache co-
herence protocol have been developed.
C3 A problem with the M5 system call emulation thread implementation that makes using
this library for research less attractive than previously known was identified. This problem
was reported to the M5 development team.
3
CHAPTER 1. INTRODUCTION
Introduction
State of the Art
Questions and Methods
Simulator Extensions
Scientific Workload 
Performance
Multiprogram Workload 
Performance
Butterfly Interconnect 
Evaluation
Discussion and Evaluation
Conclusion and Further 
Work
E
xperim
ent Setup
Experim
ent R
esults
T
he
or
et
ic
al
P
ar
t
P
ra
ct
ic
al
 P
ar
t
Figure 1.2: Report Outline
C4 A review of the state-of-the-art of CMP design with a particular focus on interconnect and
cache coherence solutions is given.
Contribution C1 is the answer to the assignment text and therefore the most important one for
this work. However, contributions C2 and C3 are important for the NCAR group. Detailed
knowledge about a state-of-the-art computer architecture simulator is clearly an asset for future
research. Lastly, contribution C4 is also helpful for future research. In addition, it contains
discussions of the most important, recently published works on interconnects in CMPs.
1.3 Report Outline
Figure 1.2 is a graphical depiction of the structure of this report. The report consists of two
main parts: a theoretical part and a practical part. The practical part is further subdivided into
the experimental setup and experimental results parts. The theoretical part comes first and is
4
1.3. REPORT OUTLINE
covered by the State-of-the-art chapter (chapter 2). This chapter contains an introduction to
CMP design options, a discussion of CMP on-chip interconnects and an introduction to cache
coherence protocols in a CMP context.
The practical part starts with the Questions and Methods chapter (chapter 3). This chapter
states the research questions that form the basis for the practical part of the report. In addition,
the baseline CMP architecture used in the simulations is discussed. A general introduction to
the M5 simulator is given in the Simulator Extensions chapter (chapter 4). The implementation
of the new interconnects and the directory protocol is also described here. Furthermore, the
problems with the M5 system call emulation thread library are discussed.
There are three chapters describing the simulation results. First, the Multiprogrammed Work-
load Performance chapter (chapter 5) discusses the simulation results from the experiments
with the bus, crossbar and ideal interconnects and multiprogrammed workloads created with
SPEC2000 benchmarks [SPEa]. Then, the Scientific Workload Performance chapter (chapter 6)
discusses the results obtained when using the SPLASH-2 benchmark suite [WOT+95]. Finally,
the Butterfly Interconnect Evaluation chapter (chapter 7) evaluates the butterfly interconnect
with both multiprogrammed and scientific workloads. The butterfly interconnect was chosen
because it should alleviate the L1 to L1 cache bandwidth bottleneck discovered in chapter 6.
The two final chapters wrap up the report. First, the Discussion and Evaluation chapter (chapter
8) answers the research questions. Furthermore, it discusses possible threats to the validity of the
experimental results. Finally, the Conclusion chapter (chapter 9) concludes and gives indications
for possible further work.
5
CHAPTER 1. INTRODUCTION
6
Chapter 2
State-of-the-art
It is necessary to have a good understanding of previously proposed techniques to be able to
propose new ones. In this report, there is a need to understand CMP design possibilities,
interconnects and cache coherence solutions. As mentioned in the introduction, this report
focuses on shared L2 cache CMPs. Consequently, the techniques reviewed in this chapter will
mainly be related to this high-level architecture. However, other architectures will be considered
when it is appropriate.
This chapter has the following outline:
• Firstly, the CMP-design space is explored by discussing academic CMP design proposals
and commercial CMP implementations. Section 2.1 covers this point and does not focus
on a specific CMP architecture.
• Section 2.2 discusses the interconnection network design options. The focus is on inter-
connect solutions for shared L2 cache CMPs.
• Finally, section 2.3 presents possible cache coherence protocol implementations. Again,
shared L2 cache CMPs are the main focus.
2.1 Chip Multiprocessor Background
This section explores the design space of single-chip multiprocessors. In the academic community,
these designs are commonly referred to as Chip Multiprocessors (CMPs). However, commercial
CMPs with two cores are known as dual-core processors. This has led to the adoption of the
new term multi-core architectures. These terms are synonymous, and the CMP term will be
used in the remainder of this report.
A CMP or multi-core architecture can be homogeneous or heterogeneous [KTJR05, KTR+04,
KFJ+03]. This refers to how similar the different processing cores in the system are to each
other. In a heterogeneous CMP, the processing cores have different properties. For instance,
some cores can be simple in-order cores and other cores can be speculative and out-of-order.
Here, applications that can efficiently utilise an out-of-order core are run on an out-of-order
core. Applications with limited Instruction-Level Parallelism (ILP) can run on the in-order core
as they experience only a small speed-up when run on an out-of-order core. Depending on the
design constraints, this can result in lower power consumption, higher area efficiency or both.
Multi-Processor System on Chip (MPSoC) is an important class of heterogeneous single-chip
7
CHAPTER 2. STATE-OF-THE-ART
multiprocessors [Wol04]. A System on Chip (SoC) is an embedded system where all or most
components are placed on the same chip. If this system uses more than one general-purpose pro-
cessor, it is referred to as a MPSoC. In this context, each processor is often given responsibility
for a small number of tasks. Then, a processor that is well suited for these tasks is selected. The
main difference between CMPs and MPSoCs is that MPSoCs are usually application specific. In
addition, MPSoCs often have a strict power budget, area budget and real time demands. Fur-
thermore, each processor often has an Instruction Set Architectures (ISA) that are different from
the other processors in the system. Therefore, they differ from the general-purpose single-chip
multiprocessors investigated in this report and will not be discussed further.
This report will focus on homogeneous CMPs. In other words, the CMP’s processing cores are
identical.
The rest of this section is organised as follows:
• Section 2.1.1 discusses the motivation for implementing CMPs.
• Then, section 2.1.2 discuss possible future limitations to the CMP architectures.
• Section 2.1.3 investigates possible high-level architectural choices when implementing a
CMP. Both CMP architectures proposed in academia and commercial implementations
are discussed.
2.1.1 CMP Motivation
The single-chip multiprocessor concept has been around for some time. For instance, Olukotun
et al. [ONH+96] proposed CMPs as a way to increase the processor clock rate in the late 90s.
Although increasing the clock rate is not an important motivation any more, CMPs have recently
gained popularity. A number of commercial vendors now produce CMPs [KST04, KAO05, AMD,
GMNR06].
The recent popularity of CMPs is due to the following factors:
• Technology scaling has made placing multiple cores on one chip feasible.
• It has become increasingly difficult to improve performance by techniques that exploit
Instruction-Level Parallelism (ILP) beyond what is common today.
• The power consumption of single-core, high-performance processors is high. Consequently,
expensive packaging and noisy cooling solutions are needed. This limitation is known as
the power wall.
• Processor performance has been improving at a faster rate than the main memory access
time for over 20 years. Consequently, the performance difference between the processor
and the memory is large and techniques that hide this latency are needed. This limitation
is known as the memory wall.
• When designing a CMP, a processor core is designed once and reused as many times as
there are cores on the chip. Furthermore, these cores can be simpler than their single-core
counterparts. Consequently, CMPs facilitate design reuse and reduce time-to-market.
From 1987 to 2004 the performance of a microprocessor was increased by around 55% per year
[HP07]. This high performance improvement was primarily due to two factors. Firstly, the
number of transistors per chip increased as the production technology scaled down. Secondly,
the clock frequency was increased faster than what was natural given the reduction in feature
size.
In 2000, Agarwal et al. argued that the techniques used to exploit ILP in aggressive out-of-order
8
2.1. CHIP MULTIPROCESSOR BACKGROUND
1
10
100
1000
10000
100000
1 9
8 0
1 9
8 1
1 9
8 2
1 9
8 3
1 9
8 4
1 9
8 5
1 9
8 6
1 9
8 7
1 9
8 8
1 9
8 9
1 9
9 0
1 9
9 1
1 9
9 2
1 9
9 3
1 9
9 4
1 9
9 5
1 9
9 6
1 9
9 7
1 9
9 8
1 9
9 9
2 0
0 0
2 0
0 1
2 0
0 2
2 0
0 3
2 0
0 4
2 0
0 5
2 0
0 6
2 0
0 7
Year
P e
r f
o r
m
a n
c e
Processor Performance Memory Latency
Figure 2.1: Processor and Memory Performance [HP07]
processors could only support an annual performance improvement of 12.5% [AHKB00]. The
reason is that global wire delays grow faster than gate delay. Consequently, designers can choose
between deeper pipelines, smaller structures or slower clocks. None of these design options will
result in scalable performance. Therefore, there is a need to look into new architectures. In
CMPs, the problem of wire delays is confined to the interconnect between cores. This point will
be discussed further in section 2.1.2.
The power consumption of a processor must be controlled. However, achieving processor per-
formance improvements by increasing the clock frequency results in higher power consumption.
Furthermore, many techniques that exploit ILP are power hungry. The reason is that they do
more work than is needed. This increases performance, but has a non-negligible power cost. In
CMPs, Thread-Level Parallelism (TLP) can be used to achieve high performance. Consequently,
more power efficient cores running on a lower clock frequency can be used. However, achieving
a speed-up comes at the price of parallelising the application.
Figure 2.1 shows the relative difference between microprocessor and memory performance with
1980 as a baseline. According to Hennessy and Patterson [HP07], microprocessor performance
increased by on average 25% per year from 1980 to 1986. Then, processor performance increased
by 55% on average from 1987 to 2004. As discussed earlier, this growth is attributed to the
advances in computer architecture as well as the scaling of technology. However, from 2004 the
average performance improvement per year has been reduced to 20%. This is due to the reasons
noted at the beginning of this section: power limitations, limitations to ILP-based techniques,
long memory latency and high design cost.
The average improvements in memory latency are also shown in figure 2.1. This improvement
has remained at 7% per year on average. However, the memory density has improved at roughly
9
CHAPTER 2. STATE-OF-THE-ART
the same rate as the processor performance. Consequently, the challenge is to feed the powerful
processors from a large and slow memory. The main technique to combat this problem is to
create large on-chip caches. Currently, CMPs make this problem worse. When it is difficult to
feed one processor, why should it be easier to feed for instance four processors? The next section
will discuss this point further.
The last reason for introducing CMPs is that processors which are good at exploiting ILP,
are very complex. In other words, they are expensive to design. In homogeneous CMPs, the
processor core is made one time and then reused throughout the design. Consequently, CMPs
make more business sense than conventional ILP-based processors.
2.1.2 Future Limitations to CMP Architectures
CMPs address a number of challenges that face microprocessor designers but not all. In addition,
the physical limitations of the production technology apply regardless of architectural choices.
Consequently, it is interesting to discuss potentially limiting factors for CMP designs.
The following potential limits to CMP performance will be discussed in this section:
• A single-threaded application will in some cases run slower on a CMP than on a comparable
single-core processor
• Memory latency and off-chip bandwidth are constrained resources
• The latency of global wires does not scale down with technology
2.1.2.1 CMP Single Thread Performance
The great paradox of CMPs is that a single threaded program never runs faster on a CMP than
on a single-core design with the same processor core. Sadly, it will in some cases run slower. The
reason is that CMPs either share caches or have smaller per-core caches. In the shared cache
case, different caches might compete for space in the same cache set or bank. This problem
is known as hot-sets and hot-banks [SA05]. Consequently, the application is slowed down in
unpredictable ways. In the private cache case, the private cache size is usually divided equally
between the processors. If this result in the application’s working set not fitting in the cache,
the application is slowed down.
This problem can be solved by parallelising the application. However, single-threaded programs
will probably be the norm for many years. Consequently, there is a need for techniques that
make these programs run efficiently.
2.1.2.2 Memory Latency and Bandwidth
As mentioned in the previous section, there is a need to hide the memory latency in modern
processors. This is done by using caches. Figure 2.2 shows growth trends for processor per-
formance, memory density, pins per chip and memory latency. The performance numbers have
been set to one in 2005 to make comparison of the trends easier. Processor performance (55%
per year), memory latency (7% per year) and memory density (55% per year) are based on the
growth trends used by Hennessy and Patterson [HP07]. The average pin count growth of 6.5%
is based on the ITRS Roadmap [ITR06].
10
2.1. CHIP MULTIPROCESSOR BACKGROUND
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
5
2005 2006 2007 2008 2009 2010 2011 2012 2013
Year
R
e l
a t
i v
e  
G
r o
w
t h
Processor Performance and Memory Density Total Number of Pins Memory Latency
Figure 2.2: CMP Memory System Scalability
Figure 2.2 shows that processor performance grows faster than pin count and memory latency.
The memory latency can be hidden by using caches. However, the memory bandwidth depends
on the width and clock frequency of the memory bus. Figure 2.2 shows that the number of pins
per chip is expected to grow at roughly the same rate as the memory latency. Although the
clock frequency of the bus can be increased, off-chip bandwidth is likely to become a constrained
resource [HBK01]. Consequently, a good CMP design must use the memory bus in an intelligent
manner.
2.1.2.3 Delay of Global Wires
The delay of a wire depends on its resistance and its capacitance. Both these quantities depend
on wire length. Consequently, the delay of a wire depends on its length. The delay of a gate
also depends on its size, and a smaller gate has a lower delay than a larger gate. When the
production technology improves, gates become smaller and the circuits become faster.
If the wire length is scaled with technology, the relative delay of a gate and the wire stays
roughly the same [HHM99]. Figure 2.3 illustrates this point. However, the long wire connecting
the two modules in the figure does not decrease in length when the technology is scaled down.
Consequently, the delay of this wire is increased relative to gate delay. On the other hand, the
length of the short wire decreases when technology is scaled. In other words, the delay of global
wires is expected to increase relative to gate delay when technology scales down.
CMPs cope well with this challenge. As technology is scaled, the size of the individual core is
decreased. The local wiring scales down with core size. However, inter-processor communication
uses global wires. Consequently, this on-chip communication will become more expensive as
11
CHAPTER 2. STATE-OF-THE-ART
Figure 2.3: Wire Scaling and Technology Scaling (Reproduced from [HHM99])
technology scales down. CMP researchers should take this trend into account when proposing
new techniques.
2.1.3 CMP Design Options
This section investigates the high-level architectural choices that can be made when designing a
CMP. The placement and sharing status of the last-level cache is especially important. In this
report, the last-level cache is the on-chip cache closest to memory. This section is based on the
Chip Multiprocessor (CMP) discussion in my preliminary project [Jah06].
This section is organised as follows:
• Firstly, conventional CMPs are discussed. These CMPs are extensions of traditional single-
core designs.
• Tiled CMPs are discussed next. Here, the processor core, caches and communication
routers are allocated on a tile. This tile is then replicated throughout the chip.
• It is also possible to share some functional units between adjacent processing cores. This
type of CMPs is called Conjoined Core CMPs.
• Recently, CMP-designs that distribute over several stacked wafers have been proposed.
These are called 3D CMPs.
• Lastly, a number of commercial CMP designs is discussed. Since the details of these
designs often are closely guarded secrets, they will receive less attention than their academic
counterparts.
2.1.3.1 Conventional CMPs
The term Conventional CMP does not appear in the literature. However, this important class
of CMPs needs a name. They will be called Conventional CMPs in this report.
Conventional CMPs have evolved from single-core processors. According to Spracklen and Abra-
ham [SA05], CMPs have so far gone through three generations. These generations are shown in
figure 2.4. In the first generation of CMPs, ease of design was the primary constraint. Conse-
quently, two nearly independent single-core designs where added to the same chip. As shown in
figure 2.4(a), only the off-chip memory controller and memory bus is shared.
12
2.1. CHIP MULTIPROCESSOR BACKGROUND
(a) First Generation (b) Second Generation (c) Third Generation
Figure 2.4: Chip Multiprocessor Generations (Reproduced from [SA05])
Figure 2.4(b) shows a second generation CMP-design. Here, the processing cores have been
designed specifically for inclusion in a CMP. Furthermore, the L2 cache is shared. The move
to the third generation is carried out by increasing the number of cores and adding Simul-
taneous Multithreading (SMT). In SMT, instructions are fetched from more than one thread
simultaneously. In this case, a new thread can start execution the clock cycle after the run-
ning thread encountered a long latency event. Spracklen and Abraham coined the term Chip
Multithreading (CMT) for this approach [SA05].
A third generation CMP can achieve very high throughput when many threads are available.
However, to fit many cores on a chip, the cores must be reasonably simple. Consequently, the
execution time for a single thread can be quite long.
In server workloads, handling many threads efficiently is more important than the execution
time of a single thread. Therefore, third generation CMPs are probably a good choice in these
machines. Many threads are also available in a desktop machine. However, the execution time
of a single thread does matter in this case. Therefore, a second generation CMP with good
single-thread performance might be more appropriate. This highlights the important trade-off
between throughput and single-thread performance.
The last-level cache in second and third generation CMPs can be private or shared. A private
cache design has a lower latency because the cache is smaller than a shared one. Furthermore,
conflicts between two cores are not possible as it is used by only one core.
However, there are two disadvantages. Firstly, global cache utilisation might be low. For ex-
ample, the tread running on core A might only use a small amount of its cache space while the
thread running on core B uses all its cache space. In this case, thread B would run faster with
a shared cache because it could use the free space in A’s cache. Secondly, inter-core commu-
nication is slow. The reason is that sharing is implemented between the last-level caches and
memory. Consequently, inter-core communication uses the off-chip memory bus.
It is possible to achieve the advantages of both designs. In the shared case, the key idea is that
cache blocks should be placed in cache banks that are physically close to the processor that uses
this block. This reduces hit latency. Exposing this difference in delay between cache banks result
in what Kim et al. refers to as a Non-Uniform Cache Access (NUCA) architecture [KBK02].
In the private case, the idea is that a processor core can borrow space in a neighbouring core’s
cache. This increases cache utilisation. Chishti et al. [CPV05], Chang and Sohi [CS06] as well
as Dybdahl and Stenstro¨m [DS07] have proposed such architectures.
13
CHAPTER 2. STATE-OF-THE-ART
Figure 2.5: A Tiled Chip Multiprocessor (Reproduced from [ZA05])
2.1.3.2 Tiled CMPs
A tiled CMP is shown in figure 2.5. These CMPs are very similar to second or third generation
CMPs with private caches. The main difference is that in tiled CMPs each tile is designed once
and then replicated throughout the chip. This makes design and floor planning easier. However,
there is no clean division in the terminology so the classification is somewhat overlapping.
In a plain tiled CMP architecture, each core has its private last-level cache. This might result in
poor global cache utilisation. Again, this can be improved if a core is allowed to borrow space
in a neighbouring core’s cache. Zhang and Asanovic´’ Victim Replication scheme makes this
possible [ZA05]. Victim Replication is a way of implementing a shared cache from a number of
per-tile private caches and differs from the previous techniques in that it is designed specifically
for tiled CMPs. In this technique, a cache block is first brought into the L2 cache bank that
is responsible for its address. Furthermore, the block is stored in the local L1 of the processor
that requested it. When the cache block is evicted from the local L1 cache, a replica of the
block is kept in the local L2 cache. In this way, the block migrates towards where it is used.
Consequently, fast access time and good global cache utilisation is achieved.
2.1.3.3 Conjoined Core CMPs
Kumar et al. has proposed Conjoined Core CMPs [KJT04]. In this case, cores that are physically
close to each other can share resources. It might be more area efficient to share these resources
between cores if they are used rarely. However, the additional wiring complexity must be taken
into account. The reason is that the wiring overhead can quickly outweigh any area benefits.
Kumar et al. investigated the sharing of floating-point units, crossbar ports, data caches and
instruction caches. They showed that processor core area can be substantially reduced with a
small performance degradation.
2.1.3.4 3D Chip Multiprocessors
As feature size is decreased, the relative impact of interconnect delay grows as discussed in
section 2.1.2. 3D CMPs, as shown in figure 2.6, have been proposed to reduce this problem.
Here, multiple wafers are placed on top of each other and vertical vias provide inter-wafer
14
2.1. CHIP MULTIPROCESSOR BACKGROUND
Figure 2.6: A 3D Chip Multiprocessor (Adapted from [LNR+06])
communication. Since the vertical connections are short, this creates the possibility of placing
many cache banks close to each core. This reduces interconnect delay because the delay through
a wire mainly depends on its length. Li et al. [LNR+06] proposed and evaluated this design
option and report promising results. However, a number of implementation issues must be
resolved before this type of CMPs can be implemented.
2.1.3.5 Commercial CMPs
As mentioned earlier, the information commercial companies publicly provide about their CMP
implementations is in many cases intentionally vague. Consequently, this section will focus on
relatively high level architectural choices.
A number of commercial companies sell CMPs:
• The Intel Core Duo processor is a shared L2 cache, two-core CMP [GMNR06].
• AMD Athlon 64 X2 is a two-core CMP with per-core private L2 caches [AMD].
• Another dual-core processor with a shared L2 cache is the IBM Power 5 processor [SKT+05,
KST04]. Each core is two-way multithreaded.
• Sun Microsystems has implemented an 8-core CMP with a shared L2 cache called Ultra-
SPARC T1 or Niagara [KAO05]. Here, each core is 4-way multithreaded. Recently, Sun
has announced the 8-core Niagara 2 processor [McG06]. The cores in this processor are
8-way multithreaded so this processor can execute 64 threads simultaneously.
The processors are located at different solution points in the continuum between single-thread
performance and throughput. For instance, the UltraSPARC T1 has a large throughput focus.
The number of advanced out-of-order features must be limited when eight 4-way multithreaded
cores are placed on a single chip. In fact, when a thread issues a multiply or divide instruction,
the thread is suspended. Consequently, single-thread performance is likely to be low. However,
high throughput is expected when a sufficient number of threads are available.
15
CHAPTER 2. STATE-OF-THE-ART
The Intel Core Duo and AMD Athlon 64 X2 processors emphasise single-thread performance.
Here, powerful, out-of-order cores harvest ILP from the instruction stream. Consequently, a
single-threaded application will run faster. The Power 5 processor is somewhat of a compromise
as it has two, two-way multithreaded cores.
Another difference between the processors is whether the L2 cache should be shared or not. The
AMD Athlon X2 processor has private L2 caches while the other processors have shared caches.
This indicates that there is no agreement of what constitutes the best solution to this problem
at this time.
16
2.2. INTERCONNECT
2.2 Interconnect
As mentioned earlier, this section will focus on the interconnect between the private L1 caches
and the shared L2 cache in a shared cache CMP. Consequently, the interconnected nodes are
the L1 caches and the L2 cache banks. However, the interconnection networks are general and
can be used in other CMP architectures as well.
Balfour and Dally investigated the design trade-offs in tiled CMP interconnection networks
[BD06]. They focused on the mesh, concentrated mesh, torus, fat-tree and tapered fat-tree
topologies. In addition, Kumar et al. [KZT05] has investigated bus, hierarchical bus and crossbar
topologies. These works seem to be the most important works investigating interconnection
networks in CMPs. Consequently, they will play a central part in this section.
The section follows this outline:
• Section 2.2.1 discusses how on-chip interconnects differ from off-chip interconnects and to
what extent the large body of research carried out on interconnection networks can be
reused in the CMP context.
• Section 2.2.2 discusses network topology. A topology is a description of how network nodes
are connected to each other.
• Routing is the task of choosing how to get from one node to another and is discussed in
section 2.2.3.
• Finally, section 2.2.4 discusses flow control. Flow control manages the allocation of re-
sources to data that flow through the network.
2.2.1 On-Chip and Off-Chip Interconnects
This section is primarily based on Dally’s presentation at the “Workshop on On- and Off-Chip
Interconnection Networks for Multicore Systems” [Dal06]. Furthermore, a quick note on termi-
nology is in order. As will be discussed in section 2.2.3, the main area consuming parts of a
router are the buffers and switches. Consequently, when buffers and switches are discussed in
this section, they refer to functional units within routers.
Off-chip interconnects have the following characteristics:
• The cost of the interconnect is determined by the number of transmission channels. This
determines the number of pins used for the different chips, which connectors can be used,
the number of cables and the number of optical units. Optical channels have lower latency
than electrical channels but are more expensive. Consequently, they should only be used
for channels where low latency is important.
• Network nodes are relatively far apart. Consequently, latency is high in general.
On-chip interconnects are characterised by:
• The cost of the interconnect is primarily the area used for buffers and switches.
• A lot of wires can easily be added. However, repeaters must be added to avoid signal
decay. In addition, pipelining transfers makes it possible for the network to work at a
clock frequency similar to the processor clock frequency. In this case, flip-flops or latches
are inserted. It might be advantageous to add some additional hardware such that these
devices can handle certain network functions.
• Network nodes can communicate in a few processor clock cycles. Consequently, the latency
is considerably lower than for off-chip networks.
17
CHAPTER 2. STATE-OF-THE-ART
Processor
Processor
L1
L1
L2
M
ain M
em
ory
Interconnect
(a) Shared L2 Cache
Processor
Processor
L1
L1
L3
M
ain M
em
ory
Interconnect
L2
L2
(b) Private L2 and Shared L3 Cache
Figure 2.7: Possible Levels of Sharing in a CMP
• The properties of the channels, buffers and switches influence the performance of a given
topology. Consequently, finding the most suitable interconnection network for a given
design requires optimising all these components together.
• Power consumption is a first order design constraint.
From these lists, two effects are apparent. Firstly, routers are expensive and wires are cheap in
on-chip networks. This differs from traditional multiprocessor networks. Consequently, network
topologies with few routers and many wires are likely to perform well on CMPs. However, the size
of the router depends on the width of the transmission channels. Therefore, wide channels also
create large, expensive routers. Balfour and Dally [BD06] found that providing two independent
networks gave good performance.
Moving communication closer to the processor creates the opportunity for fast interprocessor
communication. However, it also creates additional design constraints. Even though the com-
munication traffic depends on the application and is independent of where in the CMP com-
munication is implemented, non-communication traffic is more frequent closer to the processor
cores. This point is illustrated by figure 2.7 which shows two possible shared cache CMP de-
signs. In figure 2.7(a), the interconnect has to handle all regular L1 cache misses in addition to
the communication traffic created by the application. The pressure on the interconnect can be
traded off against an increased communication latency by inserting private L2 caches as shown
in figure 2.7(b). Since this cache can be larger than the L1 cache, the number of misses will
be reduced. Consequently, the impact of interconnect latency on overall system performance is
likely to be larger when a shared L2 solution is chosen since congestion is more likely to occur
in this case.
As mentioned, a large amount of research has been carried out on interconnection networks in
traditional multiprocessors. The discussion in this section highlights that this research needs
to be taken into account when designing on-chip interconnects. However, the trade-offs will be
different and there is probably still room for innovation.
Network-on-chip is a popular research topic in the embedded systems domain [BM02]. Here, a
main concern is to create an optimised interconnection network for a given SoC design. In this
case, the workloads and communication patterns are known. This enables radical optimisations
18
2.2. INTERCONNECT
Interconnection 
Network
Node 
Arrangement
Connection 
Flexibility
IndirectDirect Blocking Non-Blocking
Rearrangeably
Non-Blocking
Strictly
Non-Blocking
Non-Interfering
Figure 2.8: Interconnection Network Terminology
that can not be used in a general CMP interconnect. However, CMP interconnects and on-chip
networks use similar building blocks like for instance power efficient routers. In other words, the
components used are similar, but the embedded system designer often knows more about the
workload.
2.2.2 Topology
This section discusses some of the possible topologies for an L1 cache to L2 cache interconnection
network in a shared-cache CMP. In this case, the nodes of the network are the L1 caches and
L2 banks and the channels are point-to-point links. A topology refers to how the channels
and nodes of a network are laid out. Furthermore, a network can be direct or indirect. In a
direct network, all channels run between network terminals. An indirect network has nodes that
only perform a switching function and do not have a terminal associated with them. Indirect
networks are also known as multistage interconnection networks (MINs).
There is also a distinction between blocking and non-blocking networks. In a non-blocking net-
work, a path can be formed between all network inputs and all network outputs without a
conflict occurring. In this context, a conflict is a busy channel. If new connections can be
set up incrementally without rerouting an existing connection for all input permutations, the
network is said to be strictly non-blocking. Alternatively, a rearrangeably non-blocking network
might reroute established connections to accommodate new ones. If conflicts can occur, the
interconnection network is blocking.
According to Dally and Towels, creating non-blocking networks is overkill in packet switched
networks [DT03]. The reason is that by allocating network resources in a good way, it is possible
to guarantee that one flow does not deny the service of another flow for more than a short time
period. Consequently, it is possible to place an upper bound on the delay. If the network meets
these criteria, Dally and Towels refer to it as a non-interfering network.
The terminology introduced in this introduction is summarised in figure 2.8.
A frequently used term when describing topologies is bisection bandwidth. To understand this
term, the notion of a cut must be understood first. A cut is a set of channels that partition
the nodes of a network into two groups. A cut is a bisection if two conditions are met. Firstly,
half of the nodes must be in one partition and the other half must be in the other partition.
Secondly, half of the terminal nodes must be in one partition and the other half must be in the
19
CHAPTER 2. STATE-OF-THE-ART
Term Explanation
Radix
The number of inputs or outputs to each switching node.
In other words, if a switch has 2 inputs and 2 outputs,
its radix is 2.
Throughput
The data rate in bits per second that the network
accepts per input port.
Ideal Throughput
The throughput of the network with perfect routing and
flow control. These concepts will be discussed in sections
2.2.3 and 2.2.4, respectively.
Maximum channel
load
The load on the most heavily loaded channel in the
network under a particular traffic pattern.
Head latency (Th)
The time it takes for the head of the packet to traverse
the network
Serialisation
latency (Ts)
The time it takes for a packet of length L to traverse a
channel with bandwidth b, i.e. Ts = Lb
Total Latency T = Th + Ts = Th + Lb
Path diversity
The number of minimal paths between two nodes in the
network
Table 2.1: Topology Terminology
other partition. Of course, the number of nodes might be odd so one partition can have one
more node than the other. The bisection bandwidth is then the minimum bandwidth over all
possible bisections of the network.
The interconnection network field of research is terminology-rich, and a detailed discussion of
all terminology is beyond the scope of this report. However, table 2.1 describes a few additional
terms.
2.2.2.1 Direct Networks
This section discusses some of the direct interconnect networks used in recent CMP publications.
It has the following outline:
• First, the shared bus topology is discussed.
• Then, the crossbar topology is presented.
• Lastly, the mesh, torus, hypercube and ring topologies are discussed together because they
are variations of a single connection scheme.
Shared Bus
A bus is a simple topology where all network nodes are connected to the same channel. An
important reason for using a bus interconnect is that it is simple to construct. Furthermore, it
makes it possible to use a snooping coherence protocol. This further simplifies the design. The
main drawback is that it can become a performance bottleneck.
Kumar et al. have evaluated bus interconnects in a CMP context [KZT05]. Their CMP has
private L2 caches and 4, 8 or 16 cores, and the bus design is shown in figure 2.9. Here, four
processors and L2 caches are connected to the bus interconnect. Note that there is one arbitration
queue and one data queue for each L2 cache. Naturally, there is only one address arbiter and
one data arbiter in the system. The bus implemented in this work is considerably simpler than
20
2.2. INTERCONNECT
Processor Cores
and L1 Caches
L2 Caches DataArbiter
Address
Arbiter
Address Bus
Snoop Bus
Response Bus
Data Bus
Arbitration
Queues
Data
Queues
Book-
keeping
Address 
Queue
Response 
Queue
Address or data
Figure 2.9: The Bus Interconnect from Kumar et al. [KZT05]
this design as this simplifies both the implementation and the result interpretation. However,
this might lead to the performance of the bus being overestimated.
A typical read transaction on this bus proceeds as follows:
1. The requester tells the address arbiter that it wants to access the bus.
2. When it is granted access, it sends its request over the address bus.
3. The request arrives in the address queue and is sent over the snoop bus when this is free.
4. All nodes listen to the snoop bus and all nodes that have information about the cache
block in question put a response on the response bus after a fixed delay. Then, a message
is broadcasted over the response bus from the bookkeeping unit. This message takes into
account the responses from each of the caches and informs the caches of which action they
should take next. Examples of possible actions are data transfers and invalidations.
5. Finally, the responding node asks the data arbiter for access to the data bus. When access
is granted, the data is sent to the original requester.
Both the address bus and the snooping bus are broadcast buses. Therefore, it is possible for the
nodes to snoop on the address bus instead of the snoop bus. However, the snoop bus might be
shared with other chips as the bus is located on the memory side of the last-level cache. The
needed data might be cached on a different chip, and this chip is only allowed access to the
snoop bus. In other words, the snoop bus is the point of serialisation. These broadcast buses
have a significant delay. However, the bidirectional, pipelined data bus ensures that data can
be transferred quickly when the administration tasks are finished.
Kirman et al. [KKD+06] have evaluated the use of optical buses for private last-level cache
CMPs. Optical interconnects provide low-latency, high bandwidth channels. Kirman et al.
maintain that technological advances in CMOS compatible optical components make optical
interconnects a potential replacement for electrical components around 2013. Obviously, there
is a great deal of uncertainty associated with this number. As expected, optical buses outperform
electrical buses in Kirman et al.’s evaluation.
21
CHAPTER 2. STATE-OF-THE-ART
CPU and L1 cache
CPU and L1 cache
L2 Bank 1 L2 Bank 2 L2 Bank 3
SS S
S S S
L2 Bank 4
S
S
D
ata and Address In
Data and Address Out
SS S
S S S
S
S
Coherence Bus
Data In
Data and Address Out
Data In
D
ata O
ut
D
ata and Address In
D
ata O
ut
D
ata and Address In
D
ata O
ut
D
ata and Address In
D
ata O
ut
Figure 2.10: Crossbar Topology
Crossbar
A crossbar directly connects n inputs to m outputs. If n = m, the crossbar is square. Otherwise,
it is rectangular. Furthermore, a crossbar is strictly non-blocking as every input can be connected
to any output incrementally without influencing any other connections [DT03].
Crossbar interconnects are very popular in shared last-level cache CMPs. They are used in
IBM’s Power 4 and Power 5 as well as Sun’s Niagara [KZT05, KAO05]. However, it is difficult
to find precise technical information about these interconnects. Luckily, Kumar et al. evaluates
a crossbar interconnect based on the interconnect used in the Power 4 and Power 5 processors
[KZT05]. Their crossbar design is shown in figure 2.10. Here, there are data and address lines
in L1 to L2 direction and data lines in the L2 to L1 direction. This crossbar differs slightly from
the crossbar implemented in this work as will be discussed in chapters 3 and 4.
The crossbar shown in figure 2.10, is actually two crossbars. First, there is a rectangular crossbar
with two inputs and four outputs in the CPU to L2 cache direction. In addition, there is a
rectangular crossbar with four inputs and two outputs in the L2 to CPU direction. By closing
the appropriate switch, any input can be connected to any output. Furthermore, all channels
only send data in one direction as this makes pipelined data transfer possible. If the electrical
drivers are sufficiently powerful, multicast can be enabled by closing more than one switch.
Of course, the high performance of a crossbar does not come for free. Since each network node is
connected to all other nodes, there are n connections per node. If there are n nodes in the system,
this gives a growth on the order of n2. This theoretical analysis was validated experimentally by
Kumar et al. They found that if the die size is kept constant, the area overhead of including a
crossbar reduces overall performance. The reason is that the L2 cache size must be reduced, and
this reduction hurts performance more than the performance gain by introducing cache sharing.
In a shared L2 cache CMP, the frequent case is L2 access. Therefore, this must be supported in
an efficient manner. However, inter-processor communication requires transferring data between
L1 caches. There are three ways of doing this. Firstly, the crossbar can be extended to allow
L1 to L1 transfers as well. Secondly, data from one L1 to another can be sent via an L2 cache.
The first solution is expensive in terms of area while the second option is slow. A compromise
22
2.2. INTERCONNECT
(a) Mesh (b) Concentrated Mesh (c) Torus
Figure 2.11: Mesh and Torus Topologies (Reproduced from [BD06])
is to add a bus between all L1 caches. This is the solution used in the Power 4 and Power 5.
Mesh and Torus
Mesh and torus networks are classes of a network family called cubes. Here, the network consists
of N = kn nodes which are allocated into a n-dimensional grid with k nodes in each dimension.
All nodes are connected to its nearest neighbours. If n = 1 the topology is known as a ring and
if k = 2 the topology is called a hypercube. Otherwise, it is known as a mesh or a torus. The
difference between these two is that a torus has connections from each node on the edge of the
network to another edge node. In a mesh, these end nodes have fewer connections than nodes
in the middle of the network. Furthermore, a ring is a 1-dimensional torus and a hypercube is
a mesh where the node count is a power of 2 [DT03].
The mesh, concentrated mesh and torus topologies are shown in figure 2.11. Here, each square
is a processor tile and each dot is a network router. The lines connecting the dots are channels.
Balfour and Dally evaluated these topologies for a tiled CMP [BD06].
Figure 2.11(a) shows a mesh topology. Here, a network node that is not on the edge of the
network is connected to all its neighbours. Edge nodes are connected to all available neighbours.
A torus topology is shown in figure 2.11(c). For nodes that are not on the edge of the network,
the torus topology and the mesh topology are the same. However, in the torus case, edge
nodes have the same number of connections as the internal nodes. Figure 2.11(c) shows a folded
topology. In this case, the connections from the edge nodes do not go around the whole network,
but to a neighbouring node. Here, all channels have the same length but the average channel
length is longer.
Figure 2.11(b) shows a concentrated mesh topology. This is a variation on a mesh topology
where four processors share a router. In addition, express channels are added for the edge
nodes. These channels reduce the number of hops needed in the worst case for communication
between an arbitrary pair of nodes.
Balfour and Dally introduced this topology for tiled-CMPs. They found that a concentrated
mesh is superior to a regular mesh and a torus in terms of performance, power and area. The
main reason is that an average transaction requires fewer hops when the concentrated mesh is
used. In addition, the concentrated mesh perform better than the fat-tree and tapered fat-tree
topologies discussed later in this section. The fat-tree and tapered fat-tree topologies have a low
average hop count but their wiring complexity make them less desirable.
23
CHAPTER 2. STATE-OF-THE-ART
1
10
100
1000
10000
100000
2 4 8 16 32 64 128
Number of Nodes
N
u m
b e
r  o
f  C
h a
n n
e l
s
Crossbar Butterfly (Radix 2) Butterfly (Radix 4)
Figure 2.12: Number of Channels in a Butterfly and a Crossbar
Another reason for these good results is that Balfour and Dally was able to create two indepen-
dent copies of the mesh and concentrated mesh networks while hiding the area overhead of the
second network. The reason was that they had allocated a specific portion of the total chip area
for interconnect use. Since the mesh and concentrated mesh use little area, they were able to
create two networks in the allocated space. This area could have been used to create larger L2
caches, but Balfour and Dally did not pursue this.
It is unclear to what extent the findings of Balfour and Dally can be extended to shared L2
CMPs. The first reason is that the topologies have only been evaluated for tiled CMPs. In tiled
CMPs, the L2 caches are private and co-located with the processors. This results in less pressure
on the interconnect as there will be fewer misses in these L2 caches than in the private L1 caches
in a shared L2 CMP. Secondly, only synthetic benchmarks where used in their evaluation. It is
unclear to what extent these represent real workloads.
As mentioned earlier, a ring is a 1-dimensional torus. Marty and Hill has proposed an extended
snooping cache coherence protocol that relies on the ordering properties of a ring interconnect
[MH06]. Their solution is discussed in section 2.3.1. The main point here is that this simpler
coherence solution makes a ring more attractive as the complexity of implementing a directory
cache coherence protocol can be avoided.
2.2.2.2 Indirect Networks
The choice of topology for a given design does to some extent depend on how well the topology
can be mapped to the components of the system. In this respect, meshes and tori map well to
tiled CMPs. A shared last-level cache CMP has a less regular structure and mapping meshes
and tori to this CMP type is more difficult. Consequently, it might be worthwhile to look into
indirect networks.
In general, indirect networks provide all-to-all connectivity at a O(n log n) cost [DT03]. In
comparison, realising this with a crossbar has a cost of O(n2). This relationship is illustrated
24
2.2. INTERCONNECT
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
0.1
0.2
0.3
0.4
1.1
1.2
1.3
1.4
2.1
2.2
2.3
2.4
Stage 0 Stage 1 Stage 2
Figure 2.13: A Radix 2 Butterfly with 8 Nodes
by figure 2.12 which plots the number of channels in the network against the number of nodes
for two different butterfly networks and a crossbar. The number of channels is plotted on a
logarithmic scale to enhance readability. Recall from the beginning of this section that the radix
of a network is the number of inputs or outputs of a switch. The function for computing the
number of channels in a butterfly network will be discussed shortly.
This section discusses the butterfly and fat-tree topologies. The butterfly topology is chosen
because it has the minimum possible diameter for an N node network with switches of degree
δ [DT03]. Here, the diameter is the largest, minimal hop count for any pair of network nodes,
and the degree is the number of channels that terminate on a node. In other words, degree =
δ = 2 · radix because the radix is the number of input or output channels to a node. Fat-tree
topologies are discussed because they have been evaluated for tiled CMPs by Balfour and Dally
[BD06].
Butterfly
A butterfly network topology is determined by the radix of its switches and the number of
stages. Here, a stage is a group of switches. The letter k is often used to describe the radix and
n indicates the number of stages. These numbers must be chosen such that the relation N = kn
holds. In this case, N is the number of nodes in the network.
A short example will make this clear. If N = 16, the following topologies are possible:
N = 16 = kn = 161
N = 16 = kn = 42
N = 16 = kn = 24
Consequently, butterflies with radix 16, 4 and 2 are possible. The radix 16 butterfly will have 1
stage, the radix 4 butterfly will have 2 stages and the radix 2 butterfly will have 4 stages. A 1
stage butterfly is simply one crossbar switch.
25
CHAPTER 2. STATE-OF-THE-ART
(a) Fat-Tree (b) Tapered Fat-Tree
Figure 2.14: Fat-Tree and Tapered Fat-Tree Topologies (Reproduced from [BD06])
Figure 2.13 shows a butterfly network with 8 nodes, a radix of 2 and 3 stages. This figure
illustrates how the switches are connected. First, the terminal nodes are placed in groups of 2
which are connected to one switch. The middle switch stage is then partitioned into two groups
and each of the first switches is connected to one switch in each group. This procedure is then
repeated for each of the two groups at the middle switch stage, creating four groups at the last
switch level. The number of switch groups created at each stage depends on the radix of the
switches which is 2 in this example. Consequently, this butterfly construction method is general.
From the above discussion and figure 2.13, we can see that the cost of a butterfly topology is
given by the following equations:
Number of switches = ns = n · N
k
= logkN · N
k
Number of channels = nc = k · ns = k · logkN · N
k
= N · logkN
A problem with the butterfly network is that it does not have path diversity. In other words,
there is only one path from one node to another. For example, assume that node 1 and 6 in
figure 2.13 attempts to send data to 7 and 8 at the same time. This is not possible as they
both need to traverse the link from switch 1.3 to 2.4. Adding an extra stage increases the path
diversity to 2 because there are now two possible paths between two nodes. If n extra stages are
added, the butterfly becomes a non-blocking Benes network and the problem is solved [DT03].
Fat-Tree
The fat-tree and tapered fat-tree topologies are shown in figure 2.14(a) and 2.14(b), respectively.
The fat-tree is actually a number of interconnected trees. All nodes in each of the interconnected
trees have four children, and each sub-tree contains all nodes in the fat-tree. In other words, the
radix of the fat-trees seen here is 4. Consequently, there are many different ways to get from
one processor to another. The squares in the figures are processor tiles, the points are switches
and the lines are channels.
A more detailed look at figure 2.14(a) illustrates how the fat-tree topology is constructed. In this
work, an informal discussion is sufficient, and the precise mathematical definition is therefore
not discussed. Firstly, the leftmost root node is connected to the leftmost node in each group
of four nodes on the next tree level. This node is then connected to its four “closest” children.
26
2.2. INTERCONNECT
In this context, closest means the four nodes directly below the four nodes in the group the
parent belongs to. Lastly, each node at the last switch layer is connected to four processors.
The processors are the leaf nodes of the tree.
As the previous example illustrates, the wiring complexity of a fat-tree is considerable. The
tapered fat-tree reduces this problem by reducing the bandwidth available towards the root.
As shown in figure 2.14(b), there are only four root nodes. These are in turn connected to all
nodes at the next tree level. Each of the lowest level intermediate nodes has two parents. By
comparing figure 2.14(a) and 2.14(b), we can see the available bandwidth in the tapered fat-tree
is comparable to the fat-tree when close to the leaves.
Balfour and Dally [BD06] evaluated the fat-tree and the tapered fat-tree topologies for a tiled
CMP. They found that the fat-tree has higher performance than the tapered fat-tree. The
downside is that the fat-tree uses considerably more area and power than the tapered fat-tree.
In addition, both topologies are inferior to the concentrated mesh topology described in section
2.2.2.1 in terms of performance, power and area. The reason is that the wiring complexity of the
tree topologies results in long transmission channels on the chip. Furthermore, these channels
must be narrower than the channels in the concentrated mesh to fit in within the assigned area
budget. Naturally, longer lines lead to longer delays and more area used. In addition, more
power is needed to drive longer lines. Since these lines are narrower, each packet must be split
into smaller units than in the concentrated mesh case which makes the problems worse.
It is unclear whether Balfour and Dally’s results extend to a shared cache CMP as discussed in
section 2.2.2.1. The main problems are that the evaluation is based on tiled CMPs and that
only synthetic benchmarks where used in the evaluation.
2.2.3 Routing
Routing is the task of selecting which channel a given packet should traverse in order to get to
its destination. The routing task is highly dependent on the topology. This section introduces
the routing problem and how it can be solved. Then, a state-of-the art router used by Balfour
and Dally is discussed [BD06].
2.2.3.1 Routing Algorithms
According to Dally and Towels, there are three classes of routing algorithms [DT03]:
• Deterministic routing algorithms always choose the same route between two nodes. The
main advantage of this class of algorithms is that they are easy to construct. If the under-
lying topology has multiple paths from one node to another, this is ignored. Consequently,
the primary disadvantage is poor load balancing.
• Oblivious routing algorithms distribute the network traffic over a set of possible paths
without taking the state of the network into account.
• Adaptive routing algorithms consider the network status when making a routing decision.
Examples of such status are if a link is up or down, queue length and history status.
Dally states that the worst-case performance of these algorithms is often poor compared
to oblivious routing because status information is mostly local.
Furthermore, a routing algorithm is classified according to which paths it considers legal paths
between two nodes. A routing algorithm is minimal if it only considers the shortest paths
27
CHAPTER 2. STATE-OF-THE-ART
(a) Router Architecture (b) Crossbar Switch
Figure 2.15: An Example Router Architecture (Adapted from [BD06])
between to nodes as candidate paths. More flexibility is gained by considering longer paths as
well. In this case, the algorithm is called non-minimal.
2.2.3.2 Router Design
An important contribution from Balfour and Dally’s work on tiled CMP interconnection net-
works, was their area and power models of the interconnect network [BD06]. The router used
in their work is shown in figure 2.15(a). Since the router is a basic building block of an inter-
connection network, it is helpful to have an idea of how it is constructed.
As shown in figure 2.15(a), the router consists of three main parts:
• The input module is responsible for receiving incoming data and buffering it.
• The switch is a crossbar connecting all input modules to all output modules. The details
of its design are shown in figure 2.15(b).
• Finally, the output module contains a register that temporarily stores the data. The reason
is that traversing the switch takes nearly a clock cycle. Consequently, the data must wait
for the next clock cycle before it can be sent over the channel.
The crossbar switch shown in figure 2.15(b) is called a segmented crossbar. Its main advantages is
a compact layout and low power dissipation [WPM03]. Here, the transmission lines are broken
into segments of approximately equal length with tri-state buffers on the boundary between
segments. Control signals are chosen in a way that minimises the number of active segments.
Consequently, power dissipation is reduced.
The Virtual Channel (VC) allocator and Switch (SW) allocator in figure 2.15(a) are part of the
virtual channel flow control policy implemented in this router. This controls the allocation of
resources to packets and is discussed in the next section.
2.2.4 Flow Control
The flow control task can be viewed in two ways. Firstly, it can be seen as a resource allocation
task. In this case, resources like buffers and channels are allocated to packets as they advance
through the network. In addition, it can be viewed as a conflict resolution task. If two packets
28
2.2. INTERCONNECT
arrive at a router at the same time, the flow control policy decides which packet should be
forwarded first and how the blocked packet should be handled.
There are a number of different ways to do flow control [DT03]:
• In Bufferless flow control there is no temporary storage for packets. Consequently, they
are either dropped or routed along a different path. This simple solution is inefficient as
it uses network resources for packets that are dropped.
• Circuit Switching first reserves resources along a path through the network and then sends
one or more packets down this path. This policy is also simple to implement and the
buffers needed are small as only the packet header must be buffered. The downside is that
setting up a circuit is a considerable overhead if the circuit is only used by a few packets.
• In store-and-forward flow control, buffers are added to the routers. Furthermore, each
router waits until a packet is completely received before it is forwarded.
• Cut-through flow control improves store-and-forward flow control. In this case, the packet
is forwarded immediately if there are no conflicts. Consequently, latency is reduced.
• Wormhole flow control further improves cut-through flow control by introducing the notion
of a flow control digit (flit). Here, a packet is divided into multiple flits. The first flit
allocates a virtual channel through the network, and the rest of the flits follow this path.
This reduces the buffer space required as only a small number of flits needs to be buffered
per virtual channel.
• Virtual channel flow control improves wormhole flow control. The problem is that a chan-
nel is owned by a packet but that buffers are allocated on a flit-by-flit basis. Consequently,
if the buffer goes full, a channel goes idle even though it could be used by flits belonging to
a different packet. Virtual channel flow control solves this problem by adding a flit-queue
for each router output.
The cut-through and wormhole flow control policies are often referred to as cut-through routing
and wormhole routing. However, Dally and Towels maintain that this terminology is imprecise
as cut-through and wormhole flow control are flow control policies and have nothing to do with
routing [DT03]. Hennessy and Patterson have adopted the improved terminology in the new
edition of their book [HP07].
29
CHAPTER 2. STATE-OF-THE-ART
30
2.3. CACHE COHERENCE PROTOCOLS
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
X: 100 1
2
4 Write X: 120
Read X: 100
6
2
X: 100
X: 120
3
5 Write X: 90
Read X: 100
7
3
X: 100
X: 90
X: 120 6
X: 90 7
Figure 2.16: Illustration of the Cache Coherence Problem
2.3 Cache Coherence Protocols
A cache coherence protocol solves the cache coherence problem. This problem arises when at
least two processors write to the same cache block in two different caches at the same time.
Figure 2.16 is an example of this situation in a shared L2 cache CMP with write-back caches.
The numbers in the figure correspond to the numbers in the list below and define the ordering
of the actions.
The actions in figure 2.16 are:
1. First, the L2 cache is the only cache with a copy of block X which has the value 100.
2. Processor 1 reads the value and stores it in its own cache.
3. Then, processor 2 reads and stores the value.
4. Processor 1 writes the value 120 to its copy of the cache line. This update is invisible to
both processor 2 and the L2 cache.
5. Processor 2 writes the value 90 to its cache line.
6. Some time later, the block is replaced from processor 1’s cache. It is then written back to
the L2 cache and the value is updated to 120.
7. Then, processor 2 writes back its copy of X. The value in the L2 cache becomes 90 and
the modification made by processor 1 disappears.
This behaviour can not be allowed in a useful system. Consequently, the writing to shared values
must be controlled, and a cache coherence protocol is a popular way of providing this control.
For a programmer to be able to use the memory system, a precise definition is needed of how
memory requests are ordered. This definition is called a memory consistency model [AG96,
GLL+98]. Intuitively, a read of a value should return the last value written. In sequential pro-
grams this last write is defined by the order of operations in the program. For parallel programs
it is more complicated. One possibility is to implement sequential consistency. In this case, the
operations appear to execute one at the time and in one of the possible sequential orderings
31
CHAPTER 2. STATE-OF-THE-ART
Snooping 
Coherence
Cache 
Write Policy Protocol Type
Write-ThroughWrite-Back Write-Invalidate Write-Update
Write-Broadcast Read-Broadcast
Figure 2.17: Snooping Protocol Possibilities
defined in the program. However, this model disallows a number of hardware and compiler
optimisations. Therefore, relaxed consistency models have been proposed. Here, varying degrees
of freedom is given to reorder reads and writes in ways that do not affect program output.
This section will focus on coherence protocols between private and shared caches in shared last-
level cache CMPs. However, coherence protocols are also needed if different CMPs are used to
make a larger Symmetric Multiprocessor (SMP). Consequently, solutions like the recent proposal
by Marty et al. [MBH+05] regarding intra-CMP coherence is outside the scope of this report.
Furthermore, only pure hardware coherence solutions are considered.
This section has the following outline:
• Section 2.3.1 is a high-level introduction to snooping cache coherence protocols. Further-
more, three recent optimisations of snooping coherence in CMPs are discussed.
• Then, section 2.3.2 discusses directory-based cache coherence protocols in depth. The main
part of this section is a detailed discussion of a directory protocol proposed by Stenstro¨m
[Ste89].
• Finally, section 2.3.3 briefly discusses Token Coherence [MHW03]. This scheme is an
improvement over unoptimised snooping and directory protocols.
2.3.1 Snooping-based Cache Coherence Protocols
Snooping coherence protocols relies on changes to cache blocks being broadcast to all sharers.
Furthermore, these broadcasts must be seen in the same order by all sharers. For these reasons,
snooping coherence protocols are often used in connection with bus interconnects where these
properties come for free. However, it is possible to implement snooping protocols over other
interconnects as well, but this requires some additional features.
2.3.1.1 Snooping Protocol Design Space
Figure 2.17 illustrates a few high-level decisions that must be taken when choosing a snooping
protocol for a system. The first choice is the write policy of the cache, which is whether the
cache is write-back or write-through. Then, a coherence protocol type is selected. The possible
choices are write-invalidate and write-update. In a write-invalidate protocol, writes are carried
out locally and all other copies of the block are invalidated. A write-update protocol updates
32
2.3. CACHE COHERENCE PROTOCOLS
the data stored in other caches when the block is written to. In both protocols, only one cache
is allowed to write to a block at the time. There is also a choice of whether this update is
carried out directly when the data is written or later when the cache sees a read on the bus.
This is known as write-broadcast and read-broadcast, respectively. According to Culler et al.,
the write-invalidate protocols are more robust and most vendors provide these protocols as the
default [CGS97].
There are many different snooping cache coherence protocols. Consequently, conducting a full
survey of these proposals is beyond the scope of this report. Furthermore, this subject mat-
ter is discussed and presented in survey articles and textbooks. For instance, both Stenstro¨m
[Ste90] and Archibald and Baer [AB86] have written survey articles on the subject. In addition,
Hennessy and Patterson [HP07] and Culler at al. [CGS97] discuss it in their textbooks. Con-
sequently, this report will focus more on recently proposed enhancements to snooping protocols
than the protocols themselves.
2.3.1.2 Recent Enhancements to Snooping Coherence
Snooping cache coherence protocols are attractive because they are considerably easier to im-
plement than their directory-based counterparts. However, they need to check if a block is in a
different processor’s cache before they retrieve the block from the shared L2 cache. This is most
often accomplished by broadcasting the request to all caches and creates two problems. Firstly,
this broadcasting consumes considerable interconnect bandwidth [CLS05]. Furthermore, a re-
quest must be broadcasted even if the needed cache block is not shared. The reason is that the
cache does not know if the cache is shared before it has checked the other caches. Consequently,
the latency of all memory requests is increased.
In other words, there are at least two possible ways of enhancing a snooping cache coherence
protocol:
• It is possible to improve the protocol in a way that reduces the number of broadcasts is-
sued. Cantin et al. [CLS05] takes this approach with their Coarse-grain coherence tracking
technique. Here, each core maintains aggregate coherence information for a region of the
address space. If no other processors have cached blocks in this address space, there is no
need to broadcast the data.
• The protocol can be adapted such that more powerful interconnects can be used. Marty et
al. [MH06] extend the snooping protocols so that they can work with a ring interconnect.
Another possibility is to map different coherence messages to on-chip wires with different
electrical characteristics as proposed by Cheng et al. [CMR+06]. Here, latency critical
protocol actions are mapped to fast wires while less critical actions use power efficient,
slower wires.
2.3.2 Directory-based Cache Coherence Protocols
A cache coherence protocol needs to enforce an ordering of memory accesses to shared blocks.
This is enabled by the interconnect in a snooping protocol. However, this does make it difficult
to use a number of interconnection networks. A directory-based protocol employs a directory
that stores which cores have cached copies of which cache blocks. In this case, the ordering of
memory accesses is enforced by the directory.
The directory can be either centralised or distributed. A centralised directory stores the status
33
CHAPTER 2. STATE-OF-THE-ART
for all cache blocks. However, it can become a bottleneck. Consequently, distributed directories
are more common. Here, each memory module is responsible for a part of the global address
space and has a directory that keeps track of the caching status of its cache blocks [LLG+90].
There is a hardware overhead associated with storing the sharing status of all cache blocks in the
system. Chaiken et al. classifies directory protocols according to their directory implementation
[CFKA90]:
• Full-map directories make it possible for all caches to have a copy of a given cache block.
This is enabled by using one bit per processor to indicate that a processor has a copy of
the line.
• Limited directories have a fixed number of processor pointers allocated. Consequently,
only a subset of the processors can have a copy of a given line. If all pointers are in use
and a processor requests the cache block, a cached copy must be invalidated in one of the
processor caches. Each pointer use log2N bits where N is the number of processors.
• In Chained directories the directory only has a pointer to one sharer. In addition, each
cache has a pointer to the next cache that has a copy of a block. In other words, a linked
list of sharers is maintained. The main advantage of this directory implementation is the
low hardware overhead, and the main disadvantage is that keeping the list updated creates
a latency overhead.
The main challenge in implementing a cache coherence protocol is to handle race conditions
correctly [HP07]. For instance, two or more processors can initiate protocol actions for the same
block at the same time in a directory protocol. This race will be resolved when the messages
reach the directory since there is only one directory for a given cache block. However, the loser
must be notified so that it can take an appropriate action. This is accomplished by sending a
Negative Acknowledgement (NACK) message to the loser. In addition, some protocol actions
require Acknowledgement (ACK) messages so that the directory or processor knows that it went
through.
Another source of possible deadlocks is finite buffering in the interconnect. The key to avoid
these is to ensure that all replies can be accepted and that all requests are eventually serviced.
When implementing a coherence protocol in a simulator, it is possible to model infinite buffers.
This assumption is made in the coherence implementation described later in the report as it
simplifies the implementation.
The rest of this section is a case study of two possible directory protocol solutions:
• First, a recently proposed technique by Eisley et al. called In-Network Cache Coherence
is discussed [EPS06]. Here, directory protocol requests are optimised while traversing the
interconnection network by protocol agents embedded in network routers.
• Then, an old directory protocol proposed by Stenstro¨m [Ste89] is discussed in detail. The
reason for focusing on this protocol is that it makes frequent protocol actions fast by
storing sharer status in the private caches. In a traditionally multiprocessor, this entails
a large hardware overhead as the size of the directory depends on the number of different
cache blocks that can be stored in private caches at the same time. In a shared L2 cache
CMP, the overhead is lower because the L1 cache size is small.
34
2.3. CACHE COHERENCE PROTOCOLS
Figure 2.18: In-Network Cache Coherence Optimisation Example (Reproduced from [EPS06])
2.3.2.1 In-Network Cache Coherence
The In-Network Cache Coherence technique by Eisley et al. propose to improve the performance
of a directory protocol by reducing the latency of protocol actions [EPS06]. In their approach,
both the coherence protocol and directories are embedded within network routers.
Figure 2.18 illustrates how In-Network Cache Coherence reduces the latency of protocol actions.
In figure 2.18(a), processor B issues a read request for a block currently cached by A. First,
consider a conventional Modified, Shared, Invalid (MSI) directory protocol. Here, B’s request is
sent to the directory H, also known as the home node. H then instructs A to supply the data
to B and the transaction is finished. In an in-network MSI directory protocol, A intercepts the
directory request while it traverses the interconnection network and supplies the data directly.
Consequently, latency is reduced.
Figure 2.18(b) shows the protocol actions when processor C wants to write to a cache block
present in the caches of processor A and processor B. In the conventional MSI protocol, C
sends a request to the directory H. H sends invalidate messages to A and B and sends the data
to processor C when it receives the ACK messages from A and B. The in-network protocol
optimises the protocol actions by making the write request pass all sharers. A and B then start
processing the invalidations when they see C’s request. Then, H can supply the data as soon
as it has received the request and all ACK messages.
The in-network coherence protocol works by creating a virtual tree for each cache block where
the root of the tree is the directory and each leaf is a sharer. Intermediate nodes are routers
on the path from the sharer to the directory. To simplify the discussion, we assume that the
directory is located at a shared L2 cache bank and the sharers are private L1 caches.
Figure 2.19(a) shows a read request to a cache block that is not cached by any cache. Here,
bold arrows represent protocol actions and normal arrows represent virtual tree edges. First,
the read is sent to the directory. Then, the L2 cache retrieves the data from memory and sends
a reply to the requester. As this reply traverses the network, a virtual tree is constructed. The
virtual tree is created by storing the block address and a direction pointer in a small cache in
each router. This cache is known as a virtual tree cache. The direction pointer points to the next
router on the path towards the sharer. Choosing a large virtual tree cache makes it possible for
many virtual trees to exist at the same time. However, the size of this cache must be traded off
35
CHAPTER 2. STATE-OF-THE-ART
Figure 2.19: In-Network Cache Coherence Virtual Tree (Reproduced from [EPS06])
against the impact of its access time on overall router delay.
Figure 2.19(b) illustrates the protocol actions when another cache requests read access to the
cache block. Here, the read request is forwarded towards the home node until it arrives at a
router which is a member of the virtual tree. This router recognises the block address and
redirects the message towards the nearest copy. The sharer cache then provides the requested
data and the virtual tree is extended as this message traverses the interconnect.
The In-Network Cache Coherence technique does a good job at optimising the baseline protocol.
However, it is unclear how this protocol performs in comparison to other protocol optimisations
like for instance the Stenstro¨m protocol discussed in the next section. Furthermore, it is de-
signed for a torus network and might need modifications to be adapted to other interconnection
networks.
2.3.2.2 The Stenstro¨m Directory Protocol
The Stenstro¨m Directory Protocol was proposed by Per Stenstro¨m [Ste89] and was primarily
designed for multiprocessors with multistage interconnection networks. It differs from other
directory protocols in that if cache A owns a cache block X, cache A knows which other caches
have copies of X. In addition, all other caches that have a copy of X know that cache A owns
it. Consequently, they can access cache A directly when they need to read block X. Tang,
Censier and Feautrier, Yen and Fu as well as Archibald and Baer have all proposed protocols
that require the caches to access the directory before a cache-to-cache transfer. These protocols
have all been reviewed by Agarwal et al. [ASHH88] in their survey article and differ mainly in
the area overhead of their directory implementation.
In summary, there is a trade-off between area and performance. The traditional directory
protocols use less area, but require one access to the directory before all cache-to-cache transfers.
The Stenstro¨m Protocol uses more area but can send a read request directly to the owner if the
owner information is stored in the cache. For writes, the Stenstro¨m protocol works similarly to
the traditional directory protocols.
This section will discuss how the Stenstro¨m protocol can be used in a CMP. To make the
discussion easier to understand, a two-level cache hierarchy with private L1 caches and a shared
L2 cache is assumed. Implementing the cache coherence protocol between private L2 caches and
a shared L3 cache would work in exactly the same way. However, since private L2 caches are
normally larger than private L1, the area overhead would be larger.
36
2.3. CACHE COHERENCE PROTOCOLS
Figure 2.20: Status Information used in the Stenstro¨m Protocol (Adapted from [Ste89])
Figure 2.20 shows the hardware data structures needed for the Stenstro¨m protocol. The Tag
field has the same meaning as in a normal cache. V denotes the valid bit. A cache can only
read or write the block if the valid bit is set to 1. This is only the case for cache 1 in the figure.
The reason is that cache 1 owns block X, which is given by the fact that the owned bit (O) is
set. The last two single bit fields in the figure are the modified bit (M) and the distributed write
mode bit (DW). The modified bit is set when the block data is altered and signifies that the
block must be written back when the block is replaced. The distributed write bit selects whether
the Distributed Write or the Global Read mode should be used. The difference between these
modes will be explained later in this section. Since the discussions in this report are limited to
the Global Read mode, this bit will always be set to 0. The owner, modified and distributed
write bits only carry meaning for the cache that owns a block. In all sharer caches these bits
are don’t cares.
The remaining data fields in the caches are the present flags and the owner identification. The
present flags contain one bit for each processor. If a processors bit is set to 1, there is an invalid
copy of the block in this processors cache. The owner identification field identifies the cache
that owns the block. Consequently, read requests to block X from cache 2 or 3 in figure 2.20
can be forwarded directly to cache 1. The present flags are only used in the owner cache, while
the owner identification is only used in sharer caches.
The memory module or L2 cache needs to know who owns each cache block. This information
is stored in a structure which Stenstro¨m calls the block store. The reason is that new requests
to block X must be redirected to the L1 cache that has the updated copy.
Since only the present flags and owner identification sizes grow with the number of processors
in the system, the area overhead of the Stenstro¨m protocol in a CMP context is given by the
equations:
L1 Cache Area Overhead = O(Number of Blocks in DL1 · (P + log2 P ))
= O(Number of Blocks in DL1 · P )
Shared L2 Cache Area Overhead = O(Number of Blocks in L2 · log2 P )
Total Area Overhead = O((P · L1 Cache Area Overhead) + Shared L2 Cache Area Overhead)
37
CHAPTER 2. STATE-OF-THE-ART
The above equations assume that the directory state is stored with the cache blocks in the L2
cache. However, it is possible to use a dedicated memory for this purpose. The rationale is
that the worst-case situation occurs when all L1 caches own all blocks stored in them. In other
words, this is the situation where all L1 caches are full and there is no sharing. Depending
on the cache sizes, this dedicated memory might be larger or smaller than storing the owner
information with the L2 blocks. Note that coherence state is only needed for the L1 data cache
since the L1 instruction cache is read only.
The area overhead with a dedicated directory memory is given by:
Dedicated Directory Memory Size = O(P ·Number of Blocks in DL1 · log2 P )
The shared L2 cache area overhead and dedicated directory memory size computations assume
that a full-map directory is used. As mentioned earlier, this area overhead can be reduced by
using limited or chained directories at the cost of increased latency for protocol actions.
Figure 2.20 is reproduced from Stenstro¨m’s original article, but has been changed to reflect the
state of the Global Read mode. In Stenstro¨m’s paper, the figure exemplifies the operation of the
Distributed Write mode.
The area overhead of the Stenstro¨m protocol is less important in a CMP-context than in a
traditional multiprocessor. The reason is that the number of state bits depend on the size of the
caches and the size of the memory or cache at the level where sharing starts. In a CMP with
private L1 caches and a shared L2 cache, both the L1 data cache and the shared L2 cache is
considerably smaller than the size of the off-chip memory. Consequently, an on-chip realisation
of a directory protocol will use significantly less area than a directory protocol between the last
level cache and memory. Furthermore, the area difference between a conventional protocol and
the Stenstro¨m scheme will be small. However, the increase in L1 cache size must be implemented
without increasing the access time of this cache. The reason is that the L1 cache often is on the
processors critical path and consequently the impact of increasing its hit latency would be large.
This subsection focuses on the Global Read mode of the Stenstro¨m Protocol. The reason is
that it only allows one valid copy of a cache line in the system at any time. In contrast, the
Distributed Write mode can have one writer and many readers. The key idea is that every write
is distributed to all readers. Consequently, the cache block is updated when the reader needs
it. In summary, the Global Read mode is probably easier to understand while the Distributed
Write mode has the potential of giving better performance.
The reason for including two different modes in the protocol is that the application can select
the mode that gives the best performance for its communication pattern. Furthermore, different
phases of the program might use different modes. The Global Read mode is simple, but it adds
the extra latency of accessing a remote cache for the readers. In the Distributed Write mode,
writes to a shared block are distributed to all caches that have a copy of it. Consequently,
the block might already be in the reader’s cache when it needs it. However, all updates are
forwarded to all copies, and this might create a lot of unnecessary network traffic. In other
words, the protocol is optimised for the communication pattern with one writer and many
readers. Stenstro¨m states that many supercomputing applications follow this communication
pattern [Ste89].
There are six possible states in the protocol, but only three of them are used in the Global Read
mode:
38
2.3. CACHE COHERENCE PROTOCOLS
• Invalid - The cache block is not valid. If the owner information is present, reads are
forwarded to the cache that owns the block. This point is the key to understanding how
the protocol works and the reason for naming the protocol mode global read. On writes,
ownership must be acquired before the write can be carried out.
• Owned Exclusively Global Read - This cache owns the cache block, and it is the only cached
copy. Both reads and writes can be carried out without delay.
• Owned NonExclusively Global Read - This cache owns the cache block, but at least one
other cache has a copy in the Invalid state. Since all other copies of the cache line are
invalid, reads and writes can proceed without delay.
The other three states are UnOwned, Owned Exclusively Distributed Write and Owned NonEx-
clusively Distributed Write. These states are only used in the Distributed Write mode.
The rest of this section discusses the protocol actions for read hits, read misses, write hits, write
misses and cache block replacements from the L1 data caches.
Read Hit
A request is only a hit in the cache if the valid bit of the cache block is set to 1. This can only
happen in the states Owned Exclusively Global Read and Owned NonExclusively Global Read
in the Global Read mode. In other words, this is the most recent data copy and the read can
proceed without delay.
We do not need to inform any other sharers for two reasons. Firstly, we are reading the cache
and consequently not changing the data. Also, since we are in the Global Read mode, this cache
has the only valid copy of the block. Consequently, other sharers will access this cache when
they need the data.
Read Miss
There are three main cases for a read miss when using theGlobal Read policy. The first possibility
is that the cache block is not present in the L1 cache. Furthermore, the block might be present
in a different L1 cache. In addition, the cache block might be present but invalid. These cases
are handled differently by the protocol, and they are shown in figure 2.21.
In the two first cases a block might be replaced in the requesting cache. The protocol actions in
this case are described later in this section.
Consider the case where the only available copy of cache block X is in the shared L2 cache. This
case is shown in figure 2.21(a). The protocol actions are described in the following list, and the
numbers in the list correspond to the numbers in the figure:
1. At first, the only valid copy of X resides in the shared L2 cache.
2. Processor 2’s L1 cache does not have a copy of cache block X and issues a load request to
the L2 cache.
3. The L2 cache sets Cache 2 as the owner of block X.
4. Block X is sent to Cache 2.
5. Cache 2 stores the data in its cache, initialises the present flags and sets the state to Owned
Exclusively Global Read.
The actions taken when block X is present in a different L1 cache are shown in figure 2.21(b):
1. Initially, the valid cached copy is in processor 1’s cache and the L2 cache has recorded
processor 1 as the owner of block X. Block X is not present in processor 2’s cache.
39
CHAPTER 2. STATE-OF-THE-ART
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
Load X
X: No owner
X: Cache 2
X data
X: Owner is 2, [F,T]
Owned Ex GR
2
1
3
45
A block might be replaced. 
See figure 2.24 for details.
X: Not present1
(a) The other caches does not have a copy of X
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
Load X
X: Cache 1
Owner is 1
X: Owner is 1 [-,-]
Invalid
2
1
3
5
4
X: Owner is 1, [T,F]
Owned Ex GR
1
X: Owner is 1, [T,T]
Owned NonEx GR
Load X DataOwner is 1
6
A block might be replaced. 
See figure 2.24 for details.
1
7
X: Not present
(b) At least one other cache has a copy of X
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
X: Cache 1
X: Owner is 1
Invalid
2
1
3
1
X: Owner is 1, [T,T]
Owned NonEx GR
Load X Data
(c) The requesting cache has an invalid copy of X
Figure 2.21: Stenstro¨m Protocol Read Miss Handling
40
2.3. CACHE COHERENCE PROTOCOLS
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
X: Cache 1 is owner
X: Owner is 1 [-,-]
Invalid
1
1
5
1
X: Owner is 1, [T,T]
Owned NonEx GR
Data
Owner state
6
Request Own X
Request Own X from 2
2
4
X: Cache 2 is owner 3
X: Owner is 2, [-,-]
Invalid
New owner state is sent to 
all caches with an invalid 
copy
X: Owner is 2, [T,T]
Owned NonEx GR7
Figure 2.22: Stenstro¨m Protocol Write Hit Handling
2. Processor 2 issues a load request for block X to the L2 cache.
3. The L2 cache tells processor 2 that processor 1 is the owner. Consequently, processor 2
will redirect its request to processor 1.
4. Processor 2 sends the load request to processor 1.
5. Processor 1 changes the state of block X from Owned Exclusively Global Read to Owned
NonExclusively Global Read. In addition it sets processor 2’s present flag. If the state of
block X in processor 1’s cache was Owned NonExclusively Global Read, block X would
remain in this state and only the present flag would be set. This could happen in a system
with more than 2 processors if one of the other processors previously had read block X.
6. Processor 1 sends the data and the owner identification to processor 2. The owner iden-
tification is attached so that processor 2 does not need to remember which processor the
message was sent to. In other words, a buffer is avoided.
7. Processor 2 stores block X in its cache and uses the received data. The state of block X
is set to Invalid to ensure that subsequent reads of the block is redirected to the owner
cache. In other words, the protocol does not migrate data.
When block X is present in the cache with state Invalid, the protocol actions shown in figure
2.21(c) are taken:
1. At first, block X is present in both processor 1’s and processor 2’s L1 caches. Processor
1 owns block X. Since there is more than one copy, the state is Owned NonExclusively
Global Read. Processor 2’s copy of X is invalid since all copies that are not owned must
be invalid in the Global Read mode.
2. Processor 2 knows that processor 1 is the owner and issues the load request directly to
processor 1 without accessing the L2 cache. Other directory based protocols would access
the L2 cache first in this case.
3. Processor 1 sends the data to processor 2. There are no changes to the protocol state in
any of the L1 caches.
Write Hit
In the Global Read mode there are three possible protocol actions on a write hit. Firstly, the
cache block can be in the state Owned Exclusively Global Read. In this case, the write can
proceed without delay as there is no other cached copy of the cache block. Secondly, the cache
41
CHAPTER 2. STATE-OF-THE-ART
block can have the state Owned NonExclusively Global Read. Here, the write can also proceed
without delay because all other cached copies of the block are invalid. Consequently, reads to
these blocks will be redirected to this cache.
The last possibility is that the cache line is invalid. In this case, the cache needs to obtain
ownership of the block before it can write to it. The protocol actions needed are shown in figure
2.22 and described in the following list:
1. First, processor 2’s cache has an invalid copy of block X. The valid copy is present in
processor 1’s cache in the state Owned NonExclusively Global Read. In addition, the L2
cache has stored that processor 1 is the owner of cache block X.
2. Processor 2 requests ownership of cache block X from the L2 cache. The request is sent
to the directory as this is the point of serialisation.
3. The L2 cache sets processor 2 as the owner of block X.
4. The L2 cache informs processor 1 that processor 2 is the new owner of block X
5. Processor 1 sets the state of block X to Invalid and sets processor 2 as the new owner.
6. Processor 1 then sends the data and the present flags for block X to processor 2. In
addition, it informs all other caches that have a copy of X that the new owner is processor
2. The present flags tell processor 1 which processors to inform.
7. Finally, processor 2 stores the data and the present flags of block X in its cache. The state
is set to Owned NonExclusively Global Read. The write operation is carried out and the
modified bit is set to 1.
Write Miss
The protocol actions on a write miss are similar to the actions taken on a write hit. Again,
only the owner can write to a block. Consequently, a cache must obtain ownership of the block
before the write can proceed. There are two possibilities. Either the block is already present in
a L1 cache or it must be retrieved from the L2 cache.
Consider the possibility where the cache block is not present in any L1 caches. The protocol
actions for this case are shown in figure 2.23(a) and are described in the following list:
1. Initially, block X is neither present in processor 2’s L1 cache or in the L2 cache.
2. Processor 2 requests ownership of block X from the shared L2 cache.
3. The L2 cache retrieves block X from memory if necessary and sets processor 2 as the
owner.
4. The L2 cache sends the data to processor 2.
5. Processor 2 stores block X in its cache and sets the state to Owned Exclusive Global Read.
The present flags are initialised with processor 2’s flag set to 1 and all other flags set to 0.
The write operation is carried out and the modified bit is set to 1.
The other write miss possibility is that the block is present in a different L1 cache. Here, it does
not matter whether the block is not present or invalid in the requesting processor’s L1 cache.
Figure 2.23(b) shows the case where block X is not present in the requesting processor’s cache.
In either case, the following protocol actions are carried out:
1. At first, block X is not present in processor 2’s cache, and processor 1 owns block X.
Block X is in the state Owned Exclusive Global Read, but the protocol actions would be
essentially the same if there where other sharers. The only difference is that the owner
information is forwarded to all sharers in point 6.
2. Processor 2 requests ownership of block X from the L2 cache.
3. The L2 cache changes the owner of block X to processor 2.
42
2.3. CACHE COHERENCE PROTOCOLS
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
X: No owner 1
4
Request Own X
2
X: Cache 2 is owner 3
X: Owner is 2, [F,T]
Owned Ex GR5
Data
X: Not present1
(a) The other caches does not have a copy of X
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
X: Cache 1 is owner
X: Not present
1
1
5
1
X: Owner is 1, [T,F]
Owned Ex GR
Data
Owner state
6
Request Own X
Request Own X from 2
2
4
X: Cache 2 is owner 3
X: Owner is 2, [-,-]
Invalid
New owner state and new 
present flags are sent to 
all caches that has an 
invalid copy of X
X: Owner is 2, [T,T]
Owned NonEx GR7
A block might be replaced. 
See figure 2.24 for details.
(b) At least one other cache has a copy of X
Figure 2.23: Stenstro¨m Protocol Write Miss Handling
4. The L2 cache issues a request to processor 1 telling it that processor 2 is the new owner
of block X.
5. Processor 1 sets processor 2 as the new owner of block X and changes the state to Invalid.
6. Processor 1 sends the data and the present flags of block X to processor 2. If there are
other processors with a copy of block X in the system, processor 1 forwards the new owner
information to them.
7. Processor 2 creates or updates the cache entry for block X with the data and present flags
received from processor 1. The new state of block X is Owned NonExclusively Global Read.
The write operation is carried out and the modified bit is set to 1.
Block replacement
There are three possible protocol actions on a block replacement. If the block is owned exclu-
sively, it is the only cached copy. In this case, the L2 cache must be notified that the cache no
longer owns the block. Furthermore, if the block is modified it must be written back.
In addition, the block might be non-exclusively owned or invalid. These cases are shown in
43
CHAPTER 2. STATE-OF-THE-ART
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
1
1
1
Request 
transfer 
ownership
Request Own according to protocol
2
4
X: Cache 2 is owner
X: Owner is 2, [-,-]
Invalid
If the cache accepts 
ownership it sends an 
ACK. However, if it has 
replaced the block in the 
mean time it sends a 
NACK.
If the cache will not accept 
ownership, the owner must 
try another cache .
X: Owner is 2, [T,T]
Owned NonEx GR
3ACK
(a) The to-be-replaced block X is in the state Owned NonExclusively
Global Read
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
X: Cache 1 is owner 1
1
4
1
X: Owner is 1, [T,T]
Owned NonEx GR
Replace X
Replace X from 2
2
3
X: Owner is 1, [-,-]
Invalid
X: Owner is 1, [T,F]
Owned Ex GR
The new state will be 
Owned Exclusive GR or 
Owned NonExclusive GR 
depending on the number 
of set bits in the present 
flags
(b) The replaced block X is invalid
Figure 2.24: Stenstro¨m Protocol Block Replacement Handling
figures 2.24(a) and 2.24(b) respectively.
Consider first the case where the block is non-exclusively owned. Here, ownership of the cache
block must be transferred to a different L1 cache. The new owner can be chosen arbitrarily from
the present flags. The following actions are carried out:
1. At first, processor 2 is the owner of cache block X and processor 1 has an invalid copy.
2. Processor 2 wants to replace block X, and requests processor 1 to become the new owner.
3. (a) If processor 1 still has block X in its cache, it responds with an Acknowledgement
(ACK) message. Processor 2 still can not replace block X, as processor 1 will go
through the usual protocol steps for acquiring ownership of block X.
(b) If processor 1 does not have block X in its cache, it responds with a Negative
Acknowledgement (NACK) message. Processor 2 must then set processor 1’s present
flag to 0 and try to transfer ownership to a different cache. If there are no more shar-
ers, processor 2 can change the state to Owned Exclusively Global Read and follow
the protocol actions for the owned exclusively case.
44
2.3. CACHE COHERENCE PROTOCOLS
4. In figure 2.24(a), processor 1 accepts ownership of block X. Then, it requests ownership
for the block using the normal protocol. When processor 2 has delivered the block data
and present flags to processor 1, it can replace block X.
The protocol actions when the block is invalid are considerably simpler as shown in figure 2.24(b):
1. Initially, block X is owned by processor 1, and processor 2 has an invalid copy. Processor
2 wants to replace block X.
2. Processor 2 informs the L2 cache that it will replace block X. There is no need to issue a
writeback as an invalid copy never can be modified. When this request is sent, processor
2 is free to replace block X.
3. The L2 cache informs processor 1 that processor 2 no longer has a copy of block X.
4. Processor 1 sets processor 2’s bit in the present flags to 0. If this results in processor
1 being the only processor with a copy of X, the state is changed to Owned Exclusively
Global Read.
2.3.3 Alternative Cache Coherence Solutions
The cache coherence problem can be solved in other ways than by snooping or directory-based
cache coherence protocols. Although this report mainly focuses on these traditional approaches,
a brief look at an alternative solution is in order.
Martin et al. maintain that snooping and directory-based protocols solve the coherence problem
in an inefficient way [MHW03]:
• Snooping protocols require a totally ordered interconnection network. This avoids protocol
race conditions, but can limit performance.
• Traditional directory protocols resolve races by accessing the directory in an ordered fash-
ion. Although it removes the totally ordered interconnect restriction, it adds overhead by
requiring an access to the directory before a cache-to-cache transfer. Martin et al. refers
to this property as adding indirection.
According to Martin et al., the fastest way to service a request for shared data is to broadcast
it and let the cache with updated data respond. The problem is that this might lead to race
conditions on an unordered interconnect. However, these race conditions will be rare.
Token Coherence uses these properties to solve the cache coherence problem efficiently [MHW03].
Here, a number of tokens equal to the number of processors in the system are added for each
cache block. If a cache has all the tokens for a block, it can safely write to the block as there
are no other cached copies. Similarly, a cache can read a block if it has at least one token.
Furthermore, all requests containing tokens must contain valid data. Consequently, the protocol
enforces that only one processor writes to a block and that multiple processors can read a block.
If two processors attempt to write to the same block simultaneously, they will compete for the
same tokens. This situation can lead to starvation. Martin et al. use persistent requests to handle
this situation. This special request type results in that all tokens belonging to a given block is
sent to the requester. There can only be one persistent request for a given cache block in the
system at the time. Since the requester gets all tokens, the request is guaranteed to complete.
A cache detects possible starvation by measuring the time it takes to service a cache miss. If
this time increases beyond a given threshold, a persistent request is issued. For this mechanism
to guarantee freedom from starvation, a fair mechanism must be provided to decide which cache
should be allowed to issue the persistent request.
45
CHAPTER 2. STATE-OF-THE-ART
Token Coherence has been compared to a snooping and a directory-based protocol [MHW03].
It performs better than a snooping protocol if it is run on an interconnect that does not order
requests. Furthermore, it outperforms a directory-based protocol at the cost of a somewhat
higher interconnect bandwidth requirement. The scalability of token coherence is somewhat
limited because it uses broadcasts. However, Martin et al. state that they expect the protocol
to scale well enough to be used in a 32 or 64 processor SMP provided that the interconnect is
powerful enough.
46
Chapter 3
Research Questions and Methods
This chapter is a discussion of the research questions this report is based on and how they will
be answered. The research questions lay the foundation for the practical work described in this
report. The chapter has the following outline:
• Section 3.1 states and discusses the research questions.
• Section 3.2 describes the baseline CMP architecture used in this report.
• Section 3.3 discusses possible simulators and explains why the M5 simulator [BDH+06]
was chosen.
• Finally, section 3.4 discusses possible benchmarks and how they will be used.
3.1 Research Questions
The research questions should be well-formulated and within the boundaries given by the as-
signment text. Consequently, they are a tool for focusing the work as well as giving the reader
a feeling of where the report is heading.
This report will focus on the following research questions:
1. How does the CMP on-chip interconnect between private and shared caches influence
overall system performance for multiprogrammed workloads?
2. How does the CMP on-chip interconnect between private and shared caches influence
overall system performance for scientific workloads?
3. Can improvements to the private to shared cache interconnect improve performance for
both multiprogrammed and scientific workloads?
The research questions follow one suggestion for further work outlined in my 5th year project
report [Jah06]. In this work, the interconnect, the cache coherence protocol and Non-Uniform
Cache Access (NUCA) cache designs were identified as the most promising areas for CMP
communication research. Since it is probably a good idea to start with a reasonably simple
system, NUCA architectures are not considered here.
The cache coherence protocol depends heavily on the properties of the interconnect. In particu-
lar, a snooping protocol can be used if the interconnect guarantees a global ordering of requests.
One problem with a snooping protocol is that it broadcasts requests, and this can create a large
strain on the interconnect. Furthermore, creating a global ordering might prevent an intercon-
nect from reaching its full performance potential. In addition, it is difficult to create an ordering
47
CHAPTER 3. RESEARCH QUESTIONS AND METHODS
CPU 1
L1 Data
L1 Instruction
CPU 2
L1 Data
L1 Instruction
In
te
rc
on
ne
ct
L2 Cache Main
Memory
Memory Bus
Figure 3.1: High-level Chip Multiprocessor Architecture
in some interconnects. Choosing a directory protocol creates great freedom in which intercon-
nects can be investigated. Furthermore, this will give insights into cache coherence protocols
in general. In this sense, the research questions attack both the interconnect and the cache
coherence protocol research fields.
The assignment text states that this report should investigate the performance of communication
intensive workloads in CMPs and identify performance bottlenecks. Furthermore, architectural
techniques that alleviate these bottlenecks should be proposed. The first issue is addressed by
research question 2, and the second issue is addressed by question 3. Furthermore, these ques-
tions focus the work towards the private to shared cache interconnect. However, question 1 does
not address any of these issues. The reason for including this question is that multiprogrammed
workloads are an easier problem to deal with as they do not require a cache coherence protocol.
Consequently, the proposed interconnects can be evaluated before the cache coherence protocol
is ready. The effect is that it is possible to investigate architectural effects early on. Hopefully,
this will result in a better understanding of the problem at hand than if the assigned problem
was addressed right away.
For these reasons, answering the research questions will keep the report within the further work
identified in my project report and fulfil the requirements given in the assignment text.
3.2 CMP Architecture Model
This section presents the CMP architecture used in this report. As shown in figure 3.1, each
core has a private instruction cache and a private data cache. These caches are connected to a
shared L2 cache through an interconnect. The L2 cache is then connected to the main memory
with a memory bus.
A possible design alternative is to have private per-core L2 caches as used in AMD’s Athlon
64 X2 processor [AMD]. In general, these private caches will have lower access times which in
many cases will result in increased performance [CS06]. However, if the working set for one
processor is larger than the local L2 cache size, the result will be many costly off-chip memory
accesses and reduced performance. Another important drawback of a pure private cache scheme
is that inter-core communication must go via the memory bus. Since the aim of this work is to
investigate communicating workloads, a shared cache scheme will probably be more appropriate.
Many decisions must be taken when configuring a CMP simulator. For instance, the type and
48
3.2. CMP ARCHITECTURE MODEL
Parameter Value
Clock frequency 3.2 GHz
Reorder Buffer Size 128 entries
Store Buffer Size 32 entries
Instruction Queue Size 64 instructions
Instruction Fetch Queue Size 32 entries
Load/Store Queue Size 32 instructions
Issue Width 8 instructions/cycle
Functional units
4 Integer ALUs
2 Integer Multipy/Divide
4 Floating Point ALUs
2 Floating Point Multiply/Divide
4 Memory Read/Write Ports
1 Internal Register Access Port
Branch predictor
Hybrid, 2048 local history registers,
2-way 2048 entry Branch Target
Buffer (BTB)
Decode to Dispatch latency 10 cycles
Dispatch to Issue latency 1 cycle
Table 3.1: Baseline Processor Model Parameters
size of the branch predictor, the size of the issue queue and the L1 cache hit latency must be
decided. This is a difficult task and the numbers used in academic publications differ widely.
Consequently, values from existing processors will be used in this report if they are available.
The goal is that the simulated CMP should be representative of current CMP implementations.
The scripts used to configure the M5 simulator can be found in appendix D.
The rest of this section has the following outline:
• Section 3.2.1 discusses the parameters chosen for the processor core.
• Section 3.2.2 discusses the parameters chosen for the caches and the main memory.
• Section 3.2.3 presents the modelled interconnects and their parameters.
• Finally, section 3.2.4 presents the cache coherence protocols and their parameters.
3.2.1 Processor Parameters
The work is not primarily aimed at the processor cores. However, the design of the processor
cores has a significant impact on the results of the memory system research. For instance,
a simple in-order processor core will not create as many simultaneous memory requests as a
powerful out-of-order core. Consequently, it would be advantageous to have powerful processor
cores as this would most likely put a large strain on the memory system.
This design philosophy is apparent in the processor core parameters shown in table 3.1. Firstly,
the different queues and buffers are relatively large. Therefore, the processor can issue many
parallel memory requests. In addition, the processor has many functional units available. This
makes the memory accesses more frequent as the CPU-bound portions of the programs are
executed faster than on a less powerful design.
The clock frequency of 3.2 GHz is taken from the Intel Core 2 Extreme shared L2 cache 2-core
49
CHAPTER 3. RESEARCH QUESTIONS AND METHODS
processor [GMNR06]. Choosing this low but realistic clock frequency has the added benefit that
the interconnect and the processor cores can be clocked at the same frequency. If the clock
frequency was higher, each interconnect clock cycle might need to be 2 or more processor clock
cycles. This simplifies the simulator and is further discussed in section 3.2.3. In addition, 3.2
GHz makes one processor cycle correspond to 4 effective memory bus cycles with an 800 MHz
PC6400 DDR2 memory bus. This point is discussed further in section 3.2.2.
With the deep pipelines of recent processors, branch prediction is an important issue. Conse-
quently, a lot of research has been done on this subject. The main requirement for the branch
predictor in this context is that it is sufficiently powerful to not stall the processor pipeline too
often. Again, the reason is that many pipeline stalls will result in less strain on the memory
system.
Using the terminology of Smith [Smi81], a branch prediction strategy can be static or dynamic. A
dynamic strategy takes into account the run-time behaviour of the branch while a static strategy
does not. Naturally, a dynamic strategy is more accurate, but it does need more hardware
support. Furthermore, McFarling has shown that the best accuracy for a given predictor size is
attainable by combining different dynamic strategies [McF93].
McFarling based his work on three different schemes:
• A bimodal predictor assumes that a branch will go in the same direction as it did the last
few times it was executed
• A local predictor stores the history of a one branch instruction and makes predictions for
it based on this history information
• A global predictor uses the combined history of all recent branches when making a predic-
tion
The M5 simulator supports the local and global branch predictor schemes. The simulated
CMP uses the scheme that performs best for each branch at any given time. Furthermore, the
Branch Target Buffer (BTB) and the number of local history registers are both fairly large.
Consequently, the simulated branch predictor is reasonably powerful.
The last parameters in table 3.1 are decode-to-dispatch latency and dispatch-to-issue latency.
These parameters make it possible to simulate a deeper pipeline than the 5-stage Alpha pipeline
M5 is based on. Consequently, by setting the minimum time an instruction must use from the
decode to the dispatch stage to 10 cycles, the pipeline would behave as if it has approximately
15 stages. 15 pipeline stages are more reasonable than 5 stages given a clock rate of 3.2 GHz.
The M5 simulator does not support Translation Lookaside Buffer (TLB) simulation in system
call emulation mode. In this mode, the benchmark’s system calls are executed by the operating
system the simulator is run on. Consequently, the modelled CMP does not contain a TLB.
Alternatively, the full system mode of M5 could be used. Here, the operating system is simulated
as well as the benchmark. The reason for not choosing full system simulation is that carrying
out the simulations takes more time and interpreting the results becomes more difficult. Not
simulating the TLB is probably a small price to pay to avoid these difficulties. This point is
further discussed in section 3.3.
3.2.2 Memory System Parameters
The cache latencies in table 3.2 are based on the cache parameters for Intel’s new Core 2 Duo
architecture [Int06]. This architecture has a similar memory system to the simulated CMP with
50
3.2. CMP ARCHITECTURE MODEL
Parameter Value
Level 1 Data Cache
32 KB 8-way set associative
64 KB blocks
LRU replacement policy
Write-Back
3 cycles latency
1 bank
4 MSHRs
Level 1 Instruction Cache
32 KB 8-way set associative
64 KB blocks
LRU replacement policy
Write-Back
1 cycle latency
1 bank
4 MSHRs
Level 2 Unified Shared Cache
4 MB 8-way set associative
64 KB blocks
LRU replacement policy
Write-Back
14 cycles latency
4 banks
8 MSHRs per bank
Main memory
112 cycles access time
8 byte wide, DDR2-800 memory bus
Table 3.2: Baseline Memory System Parameters
one L1 instruction cache and one L1 data cache per core. A large L2 cache is shared between all
cores. The cache sizes and latencies as well as the write strategy in table 3.2 are are all identical
to the Intel architecture.
The L2 cache is similar to the static non-uniform access cache architecture (S-NUCA) design
described by Kim et al. [CKB03]. The reason for it having a non-uniform access time in the work
by Kim et al., is that each bank will be at a different distance from each core. Consequently,
the transfer times through the interconnect can be different between different cores and banks.
In this report, all banks have the same interconnect delay and access times. This will probably
make the results easier to interpret, but the delay must be large enough to enable the cores to
reach the farthest bank within this time. Consequently, some performance might be lost.
Kim et al. argue that up to 32 banks are reasonable for a 4MB L2 cache in a 100 nm tech-
nology. However, the IBM Power 4 L2 cache has only 3 banks [KZT05]. Consequently, it is
somewhat unclear which number of banks is the best choice for the simulated CMP. Following
the commercial designs, the bank count is set to 4.
The caches in M5 are non-blocking. In other words, they can service more requests while
outstanding misses are being serviced by a unit further down in the memory hierarchy. This
cache design scheme was first proposed by Kroft [Kro81], and became a necessity with the
deeply pipelined superscalar out-of-order designs of the 90’s. The key idea in this scheme is to
add a hardware structure called a Miss Status Holding Register (MSHR) to keep track of the
outstanding misses. Furthermore, by searching the MSHRs on each miss, a missing cache block
is only requested once.
51
CHAPTER 3. RESEARCH QUESTIONS AND METHODS
According to Kroft [Kro81], 4 MSHRs per bank is a good balance between allowing a sufficient
number of outstanding misses and hardware cost. Sohi and Franklin also used 4 MSHRs per
bank in their evaluation of a superscalar processor [SF91]. Consequently, the simulated CMP
will use 4 MSHRs per bank in the L1 caches. However, it is unclear if these results extend to a
large L2 cache in a CMP. Consequently, 8 MSHRs per bank will be used here as the shared L2
cache probably should tolerate more outstanding misses than the private L1 caches.
The focus of the research in this report is the interaction between the different L1 caches and
the shared L2 cache. Consequently, it is not necessary to go into to much detail on how the
main memory and the memory bus are modelled. However, the parameters chosen to describe
these units must be representative.
The memory bus performance is modelled on the DDR2-800 memory bus standard. The reasons
for choosing this standard is that it is a relatively recent standard that fits well with a core
clock of 3.2 GHz. M5 requires the effective bus clock frequency to be a multiple of the processor
clock frequency. In the DDR2-800 standard, the memory bus is clocked twice as fast as the
memory itself, and data is transferred on both rising and falling clock edges. The 800 part of the
DDR2-800 name indicates that the bus has an effective clock frequency of 800 MHz. However,
the real bus frequency is 400 MHz and the memory is clocked at 200 MHz. This relation is
important as vendors provide the memory latency measured in the number of memory clock
cycles used to supply the data. However, the memory chip suppliers use the 800 number when
they market their chips. Therefore, knowledge about the bus interface is needed to compute the
actual latency.
The latencies used in this report are taken from the Corsair TWIN2X2048-6400 memory chip
[Cor07]. This chip was chosen because it has the DDR2-800 interface and the manufacturers
documentation is sufficiently detailed to compute the access latency. However, it is only an
example of a memory chip and not representative of memory chips in general.
Ali [Ali06] states that the read latency of a DDR2 chip can be found with the following formula:
Read latency = CAS latency+ CAS additive latency
The Column Access Strobe (CAS) latency is the number of memory cycles that passes from the
last part of a read command is received to the data is ready. The CAS additive latency is added
to avoid that different commands use the same resources internally in the DRAM. This makes
it possible to use every bus cycle to supply data when the reads are directed to adjacent banks.
However, this effect is not taken into account in this work, and the details of the DDR2 scheme
are beyond the scope of this report.
The Corsair TWIN2X2048-6400 memory chip [Cor07] has a CAS latency of 5 memory cycles and
a CAS additive latency of 2 cycles. Consequently, the memory latency of this chip is 7 memory
cycles. From the preceding discussion, we know that the memory clock frequency is 200 MHz.
Since the processor frequency is 3200 MHz, a bus clock cycle is 3200MHz200MHz = 16 processor cycles.
Consequently, the memory latency in processor cycles is 16 · 7 = 112.
The caches used in M5 do not enforce inclusion. In other words, the blocks kept in an L1 cache
is not necessarily a subset of the blocks stored in the L2 cache. This was an implementation
choice taken by the M5 developers and changing it would involve a considerable amount of work.
Consequently, inclusion is not enforced in this report. However, this does create some problems
for the cache coherence protocol implementation as discussed in chapter 4.
52
3.2. CMP ARCHITECTURE MODEL
Parameter Value
Transfer latency 4 processor clock cycles
Arbitration latency 5 processor clock cycles
Width 64 byte
Table 3.3: Baseline Interconnect Parameters
3.2.3 Interconnect Parameters
When investigating the interconnect in a CMP, a possible starting point is to investigate what
types of interconnects are used in commercial CMPs. However, it is difficult to find accurate
and credible information about this. For instance, the interconnect is part of the Intel Advanced
Smart Cache used in the Intel Core Microarchitecture, but it is unclear how it is actually im-
plemented [Int06]. However, Intel has disclosed that one core can retrieve data from a different
core’s L1 cache. Consequently, there must be some form of communication channel between the
cores as well as between the cores and the L2 cache.
AMD’s Athlon X2 has a private L2 cache for each core [AMD05]. Consequently, there is no need
for an advanced interconnect between the core’s L1 and L2 cache. A bus would suffice.
The IBM Power4 and Power5 CMPs use a crossbar interconnect called the Core Interface Unit
between the private L1 caches and the shared L2 cache [KZT05]. In this case, each core has
address and data lines that can be connected to each L2 cache bank. However, one core can
only access one bank each clock cycle. In addition, each L2 cache bank has data lines that can
be connected to all cores. Again, one L2 bank can only deliver data to one core each clock
cycle. The crossbar model used in this report will be based on IBM’s crossbar implementation.
This model is further discussed later in this section. Sun’s Niagara processor also use a crossbar
interconnect [KAO05].
All interconnects consists of a number of transmission channels and a way of controlling access
to these channels. The interconnects differ in how many transmission channels are available and
how they are organised. The interconnect latencies used in this report are shown in table 3.3.
By keeping the total delay the same for all configurations, it becomes easy to compare the queue
delay experienced in each of the interconnects. In reality, these delays differ from interconnect to
interconnect, and taking this into account is interesting further work. Furthermore, these values
do not make much sense for butterfly interconnect as it does not have an explicit arbitration
step. However, its parameters are chosen such that the baseline delay is approximately the same.
This point is discussed further in section 3.2.3.3.
Kumar et al. [KZT05] have carried out comprehensive simulations of bus and crossbar intercon-
nects. In particular, they present detailed calculations of the transfer and arbitration latencies
of a 5 GHz CMP bus. They found that the transfer latency is 6 processor cycles if the wires are
routed through the 8X plane. In addition, they used an arbitration latency of 8 processor cycles.
By scaling these values to a bus clock frequency of 3 GHz, we get the values shown in table
3.3. For instance, d6 clock cycles · 3GHz5GHze = 4 clock cycles computes the transfer time. The sum
of these delays is higher than the 5.5 bus cycles used in Intel’s Advanced Smart Cache [Int06].
However, it is unclear how this value is computed. Consequently, Kumar’s numbers will be used
in this report. All channels are wide enough to accommodate one cache line each clock cycle.
Again, this choice has been made to make it easier to compare the interconnects. In reality,
the width of the channels would most likely be adjusted to provide maximum performance for
a given area budget. Investigating these trade-offs is further work.
53
CHAPTER 3. RESEARCH QUESTIONS AND METHODS
Addr
Data
CPU 1
CPU 2
Addr
Data
Addr
Data
Addr
Data
Addr
Data
Addr
Data
Bank 1
Bank 2
Bank 3
Bank 4
Data Bus
L2 Cache
Figure 3.2: Shared Bus Model - Data Bus
The rest of this section will describe the following interconnect models:
• Split transaction bus
• Crossbar
• Butterfly
• Ideal interconnect
3.2.3.1 Split Transaction Bus Model
The data bus of the split transaction bus for a 2-core CMP and a L2 cache with 4 banks, is
shown in figure 3.2. As there is only one transmission channel and it is shared among all cores
and L2 banks, only one of these units can use the bus at one time.
A read transaction will need arbitration both when sending the address to the L2 cache and
when the requested cache line is sent to the L1 cache. Note that this is an advantage as other
transactions can use the bus while the L2 cache is fetching the requested cache line. When a
unit is granted access, it gets both the data and the address bus regardless of whether it needs
both. Consequently, only one arbitration unit is needed.
Since the bus is bi-directional, pipelining of transfers is not possible. However, this makes it
easy to use a snooping cache coherence protocol.
3.2.3.2 Crossbar Model
The crossbar model used in this thesis is based on the IBM crossbar implementation used in
the Power4 and Power5 CMPs [KZT05] and is shown in figure 3.3. As in the IBM scheme,
the crossbar is really two crossbars: one in the L1 to L2 cache direction and one in the L2 to
L1 direction. However, it differs in two respects. Firstly, all transmission channels have both
address lines and data lines. Furthermore, there is one set of channels for the data cache and
one set of channels for the instruction cache. Adding these connections significantly increases
the area overhead of the crossbar. In addition, it makes the crossbar indicate an upper bound
on the achievable performance with this type of crossbars.
54
3.2. CMP ARCHITECTURE MODEL
L2 Bank 1 L2 Bank 2 L2 Bank 3
SS S
S S S
L2 Bank 4
S
S
SS S
S S S
S
S
C
oh
er
en
ce
 B
us
S S S S
S S S S
S S S S
S S S SL1(instruction)
L1
(instruction)
L1 
(data)
L1
(data)
All channels have 
data and address 
lines in both 
directions
Figure 3.3: Crossbar Model
Since the transmission lines are uni-directional, pipelined transfer is possible. The boxes marked
S in figure 3.3 are switches.
As mentioned, the IBM scheme only has address lines in the L1 cache to L2 cache direction.
However, the L1 caches in the simulated CMP can have several outstanding requests. Conse-
quently, we need a way to communicate which cache line is being delivered to the L1 cache. A
simple way to do this is to have address lines in both directions. It is possible to find more area
efficient ways of communicating this information but this simple solution will be used in this
report.
Arbitration in the crossbar is carried out by setting the appropriate control signals for the
switches and resolving contention. Since each cache only has one input line, only one unit
can send to it at a time. This task is less work than the arbitration carried out with the
split transaction bus. However, it is set to take the same amount of time to make it easier to
compare the two interconnects. Furthermore, arbitration for the address and data lines are done
simultaneously. Consequently, one request from a core will result in it being granted both the
data and the address lines.
Data transfer from the private L1 caches to the shared L2 cache is the common case in a CMP.
However, adding the possibility of transferring data from one L1 cache to another can be used
to get faster inter-processor communication. As shown in figure 3.3, this is accomplished by
adding a bus between the L1 caches. According to Kumar, this is the solution used in the IBM
55
CHAPTER 3. RESEARCH QUESTIONS AND METHODS
Power processors but the details are unclear [KZT05].
3.2.3.3 Butterfly
The Butterfly interconnect differs from the other interconnects in that it is not helpful to divide
the total delay into an arbitration latency and a transfer latency. The reason is that arbitration
is done in each switch in the butterfly. If the data at two different switch input ports need
the same output channel, one is granted access and the other is blocked. Consequently, the
arbitration delay is set to 0 in the butterfly interconnect.
Furthermore, a butterfly is a really a class of interconnects. In this report, only radix 2 butterflies
are investigated. In other words, all switches have 2 input ports and 2 output ports. Different
radix configurations change the trade-off between switch delay and the number of hops needed
to cross the network and is left as further work.
Setting realistic values for the delay of the different butterfly channels and switches requires
making a detailed floorplan. Consequently, it is probably a bit too ambitious for this work.
However, by making the assumption that the total transfer delay through all interconnects is
approximately the same, we can investigate the impact of congestion. This is the main focus
of this report. In other words, the end-to-end delay of the butterfly should be as close to the 9
processor clock cycle latency used in the other interconnects as possible.
The butterfly models for the two and four-core CMPs are shown in figure 3.2.3.3. The 8-
core butterfly is not shown as it is relatively large and quite similar to the butterflies shown.
Consequently, it is only discussed in the text.
Figure 3.4(a) shows the 2-core CMP butterfly used in this work. Here, the data cache and
instruction cache of one processor is mapped to one terminal node in the butterfly. In addition,
two L2 banks are mapped to the same terminal node. An alternative way of mapping caches to
nodes is to assign one node to each instruction cache, data cache and L2 bank. This results in
a total of 8 nodes which is significantly more expensive in terms of chip area than a four-node
butterfly. The downside of choosing the four-node butterfly is that congestion is probably more
likely.
The four-core butterfly is shown in figure 3.4(b). Here, an eight-node butterfly is a perfect fit
when each processor and L2 bank is a terminal node. The situation is more complex for the
eight-core butterfly. Here, the chosen solution is to map each processor and L2 bank to a terminal
node. However, this results in 12 terminal nodes while the next possible butterfly network has
16 terminal nodes. Consequently, four terminal nodes are left unconnected. Since all terminal
nodes will not inject traffic, congestion is probably less likely than in a butterfly where all nodes
are in use. An alternative scheme would be to add four more L2 banks. However, this would
make it difficult to compare the butterfly to the other interconnects.
The transmission latency through a butterfly is given by the equation:
total latency = (number of switches traversed · switch latency)
+ (number of channels traversed · channel latency)
The number of switches and channels traversed is the same regardless of origin and destination
node. Consequently, this equation can be used to compute the end-to-end delay of a given
56
3.2. CMP ARCHITECTURE MODEL
1
2
3
4
0.1
0.2
1.1
1.2
Stage 0 Stage 11
2
3
4
Processor 0
(L1D + L1I)
Processor 1
(L1D + L1I)
L2 Bank 0
L2 Bank 1
L2 Bank 2
L2 Bank 3
(a) 2 Core Butterfly Model
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
0.1
0.2
0.3
0.4
1.1
1.2
1.3
1.4
2.1
2.2
2.3
2.4
Stage 0 Stage 1 Stage 2Processor 0
(L1D + L1I)
Processor 1
(L1D + L1I)
Processor 2
(L1D + L1I)
Processor 3
(L1D + L1I)
L2 Bank 0
L2 Bank 1
L2 Bank 2
L2 Bank 3
(b) 4 Core Butterfly Model
Figure 3.4: Examples of Butterfly Models
butterfly interconnect. For instance, in the 2-core CMP butterfly shown in figure 3.4(a), each
request must traverse two switches and three channels.
The switch and channel latencies for the different butterflies are given in table 3.4. These values
are chosen to get as close as possible to a total delay of 9 processor clock cycles. The reason is
that this is the end-to-end delay of the other interconnects investigated in this report. For the
8-core butterfly where each request traverses 5 channels and 4 switches, this corresponds to a
switch latency and channel latency of 1 processor clock cycle. For the other two butterflies, it
is not possible to get exactly 9 clock cycles total delay by manipulating the switch and channel
latencies. Here, the values are chosen such that the total delay is 10 clock cycles. Consequently,
the butterfly has a small latency penalty compared to the other interconnect in the 2 and 4-core
CMPs.
57
CHAPTER 3. RESEARCH QUESTIONS AND METHODS
Parameter Value
2-core channel latency 2 processor clock cycles
2-core switch latency 2 processor clock cycles
4-core channel latency 1 processor clock cycles
4-core switch latency 2 processor clock cycles
8-core channel latency 1 processor clock cycles
8-core switch latency 1 processor clock cycles
Arbitration latency 0 processor clock cycles
Width 64 byte
Table 3.4: Baseline Butterfly Interconnect Parameters
3.2.3.4 Ideal Interconnect
To estimate the potential performance improvement available for the different simulated bench-
marks, it is useful to compare to an ideal interconnect. Here, transmission and arbitration takes
the same number of clock cycles as in the bus and crossbar interconnects. The reason is that
these are to some extent physical limits that can not be avoided. A request is granted access
as soon as the specified delay has passed. In other words, there is no queueing delay. Conse-
quently, the ideal interconnect gives an indication of the maximum possible speedup achievable
from interconnect improvements.
3.2.4 Coherence Protocol Parameters
The possible cache coherence protocols was discussed extensively in chapter 2. Furthermore, the
research questions focus on the private cache to shared cache interconnect. A directory protocol
will create full freedom for the choice of interconnect.
The only problem is that the M5 simulator only supports snooping protocols. Consequently, M5
must be extended with a directory protocol. The Stenstro¨m protocol discussed in section 2.3.2
was chosen. The reason is that it optimises L1 to L1 communication by storing which other L1
caches have copies of a cache block in the owner’s cache. Therefore, it is likely that this protocol
will outperform traditional directory protocols that require a directory access before all cache
to cache transfers.
The choices taken regarding the Stenstro¨m protocol are largely implementation choices that will
be discussed in section 4.3. Consequently, the protocol will not be discussed further in this
chapter.
3.3 Simulator
There are two simulators that are the most likely candidates for being used in this work:
• The SimpleScalar [ALE02] based IDI CMP Simulator developed by Haakon Dybdahl and
Marius Grannæs
• The M5 Simulator [BDH+06]
The reason for the IDI CMP simulator being a candidate is that it was the simulator used in
my 5th year project [Jah06]. Furthermore, a number of useful extensions to this simulator were
58
3.3. SIMULATOR
M5 Simulator 
on Clusters
Run Benchmark
(Python script)
Parse Result
(Python script)
Plot Result
(Python script)
Simulation Results
(Text files)
Result 
database
Figure 3.5: Typical Experiment Work-Flow
developed during this project. However, it is not trivial to get scientific benchmark suites like
SPLASH-2 to run on this simulator. As this is crucial for the research described in this report,
the IDI CMP simulator is probably not a good choice.
Arnt Jørgen Lande investigated possible CMP simulators for use at the NTNU Computer Ar-
chitecture Research Group (NCAR) at IDI [Lan06]. He investigated the Rsim, Asim, SimOS,
Simics, TFSim, SimFlex, GEMS and M5 simulators and concluded that M5 was the most ap-
propriate for the NCAR group. The most important point for this work is that M5 supports
cache coherent CMP simulation. Although this feature will not be used directly, some of the
facilities needed to implement a coherence protocol will be available. Furthermore, pre-compiled
binaries for the SPLASH-2 benchmark suite [WOT+95] are distributed with the simulator.
Another feature of M5 is that it supports both full system simulation and system call emulation
simulation. In full system simulation, the simulated program is run on a simulated operating
system. Consequently, the behaviour of the operating system can be studied as well as the be-
haviour of the benchmark program. In system call emulation simulation, the operating system
calls in the benchmark are executed by the operating system of the computer running the simu-
lator. Because the operating system is not simulated, system call emulation simulation is faster
than full system simulation. However, the simulation results will say nothing about the impact
of the operating system on performance. This is not necessarily a drawback as it probably makes
the simulation results easier to understand. Consequently, system call emulation simulation is
used in this report. Sadly, some problems with M5’s thread implementation discovered late in
the work made choosing system call emulation a bad choice. These problems will be discussed
in chapter 4.
3.3.1 Experiment Tool-chain
Carrying out experiments consists of running the simulator with different parameters, recording
the results and analysing them. Furthermore, this procedure must often be iterated to gain a
good understanding of the results. Consequently, there is a need for automating this process.
This typical experiment work-flow is illustrated in figure 3.5.
The following processes have been automated with Python scripts:
• Running experiments with different parameters on the Clustis2 [Clu] and Norgrid [Nor]
clusters.
• Retrieving selected values from the large simulation result text files and storing them in a
SQLite [SQL] database.
• Plotting the results in a graphical form to ease analysis.
59
CHAPTER 3. RESEARCH QUESTIONS AND METHODS
This experiment set-up turned out to be too complicated in practice. The main problem was
that it was too cumbersome to make it retrieve other statistics than the ones identified when the
system was first made. A better way was to use a number of short python scripts that retrieve
statistics according to a regular expression. These short scripts can be easily modified to extract
different simulation data.
3.4 Benchmarks
In this report, there is a need for two different benchmark types:
• Multi-threaded benchmarks make it possible to investigate application communication
behaviour
• Single-threaded benchmarks can be combined to create multiprogrammed workloads. This
makes it possible to investigate CMP interconnects without adding a cache coherence
protocol.
In the assignment text, it is suggested to use the SPLASH-2 benchmark suite [WOT+95] to inves-
tigate application communication behaviour. Furthermore, it suggests that the SPEC CPU2000
benchmark suite [SPEa] should be used for sensitivity analysis. In other words, the SPEC2000
benchmarks should be used to verify that new schemes aimed at scientific workloads do not re-
duce the performance of single-threaded applications. The main problem with only using these
two benchmark suites, is that they do not include commercial applications. However, scientific
shared memory workloads and single-threaded workloads are represented. Consequently, this
will limit the generality of the conclusions in this report to these application classes. In addi-
tion, the SPEC CPU2006 benchmark suite [SPEb] has recently been introduced. Consequently,
this should replace SPEC2000 in future work.
3.4.1 SPEC CPU2000 Multiprogrammed Workloads
SPEC CPU2000 has been used extensively by the computer architecture research community.
Sadly, Citron points out that it has also been misused [Cit03]. He makes the following points:
• Many papers do not simulate all applications in the benchmark suite. Consequently,
average performance measures might be misleading. In this report, all benchmarks from
the SPEC2000 suite are used.
• The benchmarks are not run to completion with the reference datasets.
• In 2003 when the paper was written, many researchers still used the retired SPEC CPU95
benchmark suite.
As Citron acknowledges, running the SPEC CPU2000 benchmarks to completion with the refer-
ence datasets is intractable in practise because of long simulation times. However, it is possible
to run the CPU2000 benchmarks to completion with the reduced datasets. On the other hand,
Perelman [PHC03] maintains that these datasets either put to much emphasis on program ini-
tialisation or are still too large to be simulated to completion.
Perleman advocates the use of simulation points (SimPoints) [PHC03]. Here, the benchmark
is analysed to find some sections of the program that together represent the behaviour of the
whole program. These sections are then simulated in detail. Since fast-forwarding is used
between simulation points to keep for instance the cache state up to date, it is advantageous
to choose simulation points early in the program’s execution. This technique is not used in
60
3.4. BENCHMARKS
this report. The reason is that when simulating multiprogrammed workloads, the interaction
between benchmarks is just as interesting as one benchmark on its own. SimPoints does not
take this into account.
Another possible technique is to fast-forward past the initialisation phase of the benchmark.
This fast-forwarding is followed by a warm-up phase to reduce the cold-start bias from execution
history aware units such as caches and branch predictors. Then, all performance counters are
reset and the benchmark is simulated for a fixed number of clock cycles. Although this technique
removes the start-up phase from the results, the simulated part of the benchmark might still
not be representative. However, since each SPEC2000 benchmark is part of a multiprogrammed
workload it is the interplay of these benchmarks that is the main concern. By choosing both
benchmarks in a workload and the number of fast-forward cycles per benchmark in the workload
at random, we can get interesting workloads. This approach was inspired by the experiment
methodology used by Haakon Dybdahl [DS07].
In this report, 40 workloads were created for 2, 4 and 8 CPUs by randomly choosing SPEC2000
benchmarks. Consequently, there are 120 workloads in total. All SPEC2000 benchmarks are
represented in at least one workload for each number of CPUs. The benchmarks are fast-
forwarded a random number of clock cycles between 1 and 1.1 billion and then simulated in detail
for 100 million clock cycles measured from when the last benchmark finished fast-forwarding.
There is no need for warm-up in the M5 simulator because the memory hierarchy is simulated in
detail while fast-forwarding. The multiprogrammed workloads used in this report can be found
in appendix A.
It is an open question to what extent these workloads will be representative of real-life workloads.
However, they will probably serve to identify bottlenecks in the CMP memory system which is
the main aim of this work. Consequently, this methodology will be used in the simulations in
this report.
To compare two architecture schemes, it is important that the exact same part of the benchmark
is simulated. Creating samples by specifying clock cycles is not ideal in this respect. If one scheme
performs substantially better than another scheme, it might move into a different execution phase
which makes the results less accurate. An alternative would be to use the number of committed
instructions to create samples. The problem here is that the different applications in a workload
commit instructions at very different rates. Consequently, a situation where one benchmark
is running alone can occur. This is even worse than simulating slightly different parts of the
benchmark. Therefore, samples are defined in terms of clock cycles in this work.
3.4.2 SPLASH-2 Communicating Workloads
To simulate the SPLASH-2 benchmarks [WOT+95] to completion takes less time than with the
SPEC2000 benchmarks. However, it still takes too long to be practical. Consequently, the fast-
forwarding techniques described for the SPEC2000 benchmarks can also be useful here. Sadly,
flaws in the M5 thread implementation makes it difficult to use fast-forwarding. Furthermore,
they limit how much of the SPLASH-2 benchmarks can be simulated. These flaws are related
to the system call emulation thread implementation and will be discussed in detail in section
4.1.2. Sadly, the flaws are difficult to fix. The reason is that the benchmarks and the M5 specific
pthreads library only compile with a library only found on a Tru64 UNIX operating system in
an Alpha-based system.
The best way to get round these problems is to switch to full system simulation. However,
61
CHAPTER 3. RESEARCH QUESTIONS AND METHODS
Benchmark 2 CPUs 4 CPUs 8 CPUs
Barnes 200 million 200 million 200 million
Cholesky 200 million 100 million 200 million
FFT 150 million 125 million 100 million
FMM 80 million 80 million 15 million
LUContig 200 million 150 million 100 million
LUNoncontig 200 million 100 million 75 million
OceanContig 180 million 60 million 30 million
OceanNoncontig 170 million 50 million 30 million
Radix 50 million 25 million 15 million
Raytrace 180 million To completion To completion
WaterNSquared 200 million 150 million 75 million
WaterSpatial 200 million 125 million 75 million
Table 3.5: Number of Instructions Simulated with SPLASH-2 Benchmarks
there was not enough time left to do this when the problem was detected. Consequently, a
workaround is needed for the system call emulation mode. One possibility is to detect the
problem and abandon simulation with an error message. Sadly, some benchmarks deadlock
very early in their execution and this result in very short simulations. Another option is to
detect the situation and start a waiting processor. This is most likely not the correct behaviour,
but it makes it possible to simulate for a reasonable number of clock cycles. Consequently,
the applicability of the results will be limited, but at least some results can be gathered and
analysed.
Table 3.5 shows the simulation lengths chosen. These where found by trial-and-error and are
chosen such that all simulation runs finish within 16 hours simulation time on the Norgrid cluster.
Consequently, it is difficult to say if they are representative for the whole program execution or
not. This further limits the applicability of the experimental results. However, some insight can
be gained into the performance of parallel programs.
In addition to the benchmarks listed in table 3.5, radiosity and volrend are also part of the
SPLASH-2 benchmark suite. These where not included with the binaries shipped with M5.
Since a Tru64 UNIX operating system and Alpha processor is needed to compile the library, it
is also difficult to compile these benchmarks for the M5 simulator.
The default problem size for the LUContig and LUNoncontig benchmarks is a 512× 512 matrix
with 16 element blocks. However, the precompiled LUNoncontig benchmark was compiled with
a default matrix size of 128× 128. The effect of this was that simulation terminated after 10 to
15 million clock cycles. This is to short for it to be useful as a benchmark. The problem was
fixed by setting the matrix size to 512× 512 with a command line parameter.
62
Chapter 4
Simulator Extensions
Ideally, the chosen simulator would provide all necessary features for investigating the problem at
hand. However, this is rarely the case. Consequently, there is a need to extend the simulator with
the needed features. This chapter describes the extensions to M5 simulator [BDH+06] developed
as a part of this work. The complete simulator source code used in this work is attached as a
digital appendix. Furthermore, the extension source code can be found in appendix C.
This chapter has the following outline:
• Section 4.1 contains a quick introduction to the M5 simulator.
• Then, section 4.2 describes the extensions to the private to shared cache interconnect.
• Section 4.3 describes the implementation of the Stenstro¨m directory protocol.
• Finally, section 4.4 describes the steps taken to make sure that the new components behave
as specified.
4.1 The M5 simulator
The M5 simulator is a computer architecture simulator originally developed at the University of
Michigan for modelling networked systems [BDH+06]. This requires a large number of features
and results in M5 being a large software system. Consequently, a full review of the simulator is
beyond the scope of this report. On the other hand, it is necessary to provide enough detail to
understand the extensions presented in this chapter.
The M5 team has at the time of writing just released the third beta version of M5 2.0. Beta 1
of version 2.0 was released in the late autumn of 2006. Sadly, all beta versions lack coherence
protocol support and this makes them poorly suited to communication research. Although a
directory protocol would have to be developed anyway, it is preferable to work on a simulator
where a coherence protocol is supported. The reason is that the developers have been thinking
about coherence support while developing the simulator. Consequently, it was considered too
risky to move to version 2.0, and the simulator used in this report is M5 version 1.1.
4.1.1 M5 Overview
Understanding the simulator extensions requires a basic understanding of how the M5 simulator
works. This section starts with describing how the M5 simulator is configured. The reason is that
63
CHAPTER 4. SIMULATOR EXTENSIONS
...
User defined 
command-line
interface
User defined 
configuration script M5 Python
M5 defined
ini-file
M5 SimObject 
Builder
SimObjects
Figure 4.1: M5 Simulator Configuration
this is done in a special way that gives some insights into how the simulator is constructed. Then,
the basic simulator architecture is described with a strong emphasis on the memory system.
4.1.1.1 M5 Simulator Configuration
A simulator is really a collection of components that can be interconnected in different ways.
In M5, these components are known as SimObjects. More precisely, a SimObject is a simulator
component that can be configured outside the simulator. If these components can be connected
in many ways, it is more likely that the simulator can be used for many different research
projects. However, this flexibility also makes the simulator difficult to use. This is less of a
concern as researchers are willing to invest a considerable amount of time in understanding a
simulator.
Figure 4.1 shows how M5 is configured. The simulator is run from the command line. Here, a
number of user defined options are input to a user defined configuration script. This configuration
is written in Python and sets the parameters for the different SimObjects and their connections.
Each SimObject has an internal M5 python file that defines its external interface. The user
defined configuration script uses this interface to set the parameters of the SimObjects. These
parameters are then used to create a configuration tree which is flattened and written to an
ini-style configuration file.
This configuration file is the interface between the input parsing part written in Python and the
actual simulator which is written in C++. In the simulator, the configuration file is read and the
configuration tree is recreated. The SimObject Builder then creates the SimObjects as specified
in the configuration file. Each SimObject also has an object builder class associated with it. This
class is created in a standardised way, and instantiates the simulator object with the parameters
defined in the configuration file. When this configuration phase is finished, simulation is started.
Consequently, each SimObject has three classes associated with it: the actual C++ SimObject
class, a C++ SimObject builder class and a Python class. This makes the elaborate simulator
configuration scheme used in M5 possible. The main advantage of this scheme is that only the
necessary configuration options are made available on the command line. In other simulators like
for instance SimpleScalar [ALE02], all parameters are set with command line options. Conse-
quently, a long command line is needed and this makes it difficult to start the simulator without
a script. The downside of this scheme is that implementing new SimObjects takes time.
64
4.1. THE M5 SIMULATOR
Memory 
Interface Cache
Master 
Interface Bus
Slave 
Interface
Cache Memory
Processor
Coherence 
Protocol
Coherence 
Protocol
InterconnectInterconnect
Master 
Interface Bus
Slave 
Interface
Figure 4.2: M5 Memory System Example
4.1.1.2 M5 Memory Hierarchy
Figure 4.2 shows an example of a memory system configuration in the M5 simulator. Here, a
processor is connected to two levels of caches and a main memory. The blue boxes are SimObjects
and the grey boxes are helper classes used in the simulator. The processor core is shown at the
left side of figure 4.2. In M5, this can be a detailed or a simple processor. The simple processor
is used for fast-forwarding while the detailed processor is used when measurements are taken. In
M5 version 1.1, which is the version used in this work, the detailed CPU implements an execute-
in-fetch model. In other words, an instruction is executed in fetch stage of the simulated pipeline
and only timing analysis is done in later stages. In contrast, M5 version 2.0 uses an execute-
in-execute model. This model provides accurate simulation of time dependent instructions as
for instance synchronisation operations [BDH+06]. The processor core communicates with the
memory system through the MemoryInterface class which hides the memory system access
details.
The cache implementation used in M5 is the same for first and second level caches. Furthermore,
it makes heavy use of the C++ template construct. This creates difficulties for the M5 config-
uration system. Consequently, the components used by the cache are not SimObjects and their
external interface is made available through the cache’s external interface. This is the reason
for the coherence protocol class in figure 4.2 not being a SimObject.
The M5 cache-to-cache interconnect is shown in the middle of figure 4.2. The only available
interconnect is a split transaction bus. This bus is accessed through two interface classes,
namely MasterInterface and SlaveInterface. The master term signifies that the interface is
on the processor side of the interconnect. Conversely, the slave term means that it is on the
memory side of the interconnect. The interconnect can be extended by modifying the files in
the interconnect box in figure 4.2.
The main memory implementation used in this report is very simple. If a response is needed, it
is returned after the user specified memory latency. For this work, this simple memory model is
sufficient. In research that focuses mainly on main memory access, it is probably too simplistic
and should be extended.
65
CHAPTER 4. SIMULATOR EXTENSIONS
CPU 0
CPU 1
CPU n
...
Data
Instructions
Data
0x12000000 + CPU offset
0x14000000 + CPU offset
0x00000000 + CPU offset 0x0000000000000000
0xFFFFFFFFFFFFFFFF
(n-1) x Per CPU Memory size
Total Memory Address SpaceProcess Address Space
2 x Per CPU Memory size
1 x Per CPU Memory size
Figure 4.3: M5 Memory Layout for Multiprogrammed Workloads
4.1.2 Flaws in M5
Unfortunately, the M5 simulator is not perfect. Even worse, there are a few flaws in M5 that
may influence the simulation results. The first flaw is due to the lack of address translation and
is described in section 4.1.2.1. This results in all applications being mapped to the same address
space when running multiprogrammed workloads. Consequently, they might warm up the cache
for each other. The mapping of requests to L2 banks makes all requests go to one L2 cache bank
when the scientific benchmarks are run. This problem is discussed in section 4.1.2.2. The third
flaw is the thread implementation used in the system call emulation mode. This implementation
has several problems, and these are discussed in section 4.1.2.3.
4.1.2.1 Multiprogrammed Workload Address Translation Flaw
In a multiprogrammed workload, all benchmarks end up using the same address space in M5.
The reason is that there is no address translation support in the system call emulation mode.
The benchmarks will still run correctly, but the statistics gathered may not be accurate. For
instance, one processor can move a cache block into a shared cache. If this block is accessed
from a different processor, the result is a cache hit. However, it should have been a cache miss.
Luckily, this flaw is easy to fix. The key idea is to intercept all memory requests and move them
to a part of address space reserved for the processor the application is running on. When the
request is returned to the processor, the address is changed back to the original address.
Figure 4.3 illustrates how this fix works. Basically, each processor is given a part of the total
address space as given by the equation:
Per CPU Memory Size =
Number of CPUs
264
Then, the start address and new address can be computed:
CPU offset = CPU ID · Per CPU Memory Size
66
4.1. THE M5 SIMULATOR
New address = Old address+ CPU offset
The border between the different address spaces are guarded by assertions. This makes the
simulator quit with an error message if a request is relocated into another processors address
space. The address boundaries in figure 4.3 have been estimated by tracing which addresses the
different caches access when running various SPEC2000 benchmarks. It seems like the address
space between addresses 0x12000000 and 0x14000000 is used for instructions and that the rest
of the address space is used for data.
This fix is not very realistic. The reason is that address translation is normally carried out by
the hardware in cooperation with the operating system. Consequently, full system simulation is
needed to do address translation realistically. In other words, this fix is sufficient for system call
emulation simulation.
4.1.2.2 Address to L2 Cache Bank Mapping
In standard M5, each L2 cache bank is responsible for a contiguous part of the address space.
For multiprogrammed workloads, the address translation described in section 4.1.2.1 is sufficient
to distribute accesses across banks. However, all accesses are mapped to the same bank for the
scientific workloads. This leads to a very low cache utilisation and can be fixed by using the
least significant bits of the cache block address to select the bank. When the number of banks
is a power of two, this is equivalent to using the modulo operator. This solution is used in the
the scientific workload experiments in this report.
4.1.2.3 Faulty System Call Emulation Thread Implementation
The M5 system call emulation thread implementation actually contains at least two different
flaws:
• Firstly, the thread implementation is prone to deadlocks. These are not protected by
assertions and their only effect is very strange simulation results.
• Furthermore, some communication system calls are not implemented. Here, the only action
taken is to write a message to the standard output stream.
These problems were discovered for the first time during this work and has been reproduced on
an unmodified M5 simulator. Consequently, it is not a problem with the extensions developed
in this work. The thread implementation only compiles on a Tru64 UNIX machine. As such a
machine is not available to the NCAR group, it is impossible to fix these problems. According
to Steve Reinhardt, the best option is to move to full system simulation where the thread
implementation is better. Reinhardt is one of the main developers of the M5 simulator. The
e-mail correspondence with Reinhardt can be found in appendix B.
According to Reinhardt, the pthreads library used in M5 allocates P + 1 threads where P is
the number of processors. The extra thread handles management tasks. When the system
call emulation thread support was developed, it was assumed that this thread was not used.
Consequently, it was not created since this makes it possible to allocate one thread to each
processor. In other words, there is no need for a thread scheduler. The deadlocks are probably
due to all processing threads waiting for this manager thread.
To make things worse, the only effect of this problem is inaccurate simulation results, and this
makes it difficult to discover. The reason is that the deadlock situation is not protected by any
67
CHAPTER 4. SIMULATOR EXTENSIONS
assertions. Consequently, it looks like the simulation finishes successfully if the experiment is
terminated after a fixed number of clock cycles. As an example, consider the situation where
this problem arises in configuration A and not in configuration B. Then, some architectural
effect can be blamed for creating the performance difference which is really due to a simulator
bug.
According to Reinhardt, moving to full system simulation is the best way to avoid these problems.
Here, the faulty thread implementation is not used. Sadly, there was no time to change simulation
mode when the problem was eventually diagnosed. Instead, the decision was made to go with
the system call emulation mode and write the occurrence of the known problems to a tracefile.
The idea is to use this problem trace to discard results where the problems might have influenced
the results. Since some benchmarks deadlock very early in their execution, simply detecting the
problem and exiting with an error message would lead to very short simulations. Therefore,
the processor that has been waiting for the longest time is started when a deadlock is detected.
The rationale is that when all processors are waiting, it is safe to start one of them. This can
clearly lead to wrong results so we need to know how often this happens. Consequently, these
events are noted in the problem trace file. Discussing the occurrences of these problems will
be a central part of chapters 6 and 7 where the scientific benchmark results are presented. In
addition, verifying the findings of this report in full system mode of M5 is very important further
work.
4.2 Interconnect Extensions
This section describes the extensions developed to the cache-to-cache interconnect. First, the
general software architecture is discussed with a focus on how the extensions communicate
with the existing M5 code. Then, the new split transaction bus, crossbar, butterfly and ideal
interconnects are discussed. The source code for the interconnect extensions can be found in
appendix C.1.
4.2.1 Software Architecture
Figure 4.4 shows the classes that implement the interconnect extension. The blue classes in the
figure have been developed in this work. The other classes are needed to glue the extensions to
the rest of the simulator.
The interconnect extensions can be divided into two types: interconnects and interfaces. Both
types inherit from the BaseHier and SimObject classes. In M5, all classes that can be configured
from outside the simulator must inherit from the SimObject class. Although the M5 interfaces
are not configured themselves, the BaseHier class is. The purpose of this class is to contain user
configurable variables that define if data should be transported by the memory requests and if
events should be scheduled in the memory system. In this report, event simulation is always on
and data is not transferred. The reason for not transferring data is that the data is not used by
the processor core anyway. Therefore, nothing is gained by explicitly transferring data.
M5 expects that the interconnect can be called through an interface. Both a master and a
slave interface are needed. As mentioned earlier, the master term tells us that the interface
is on the processor side of the interconnect, and the slave term signifies that the interface
is on the memory side. These both inherit from the InterconnectInterface class. This class
implements a few features that are needed by both the InterconnectMaster and InterconnectSlave
68
4.2. INTERCONNECT EXTENSIONS
InterconnectMasterInterconnectSlave
InterconnectInterface
Interconnect
SplitTransBusCrossbar IdealInterconnectButterfly
BaseInterface
BaseHier
SimObject
Figure 4.4: Interconnect Extension Software Architecture
classes. InterconnectInterface then inherits from BaseInterface which implements methods that
are needed in all memory system interfaces in M5.
It is a great advantage to be able to switch between interconnects easily. Furthermore, it
should be easy to add new ones. The software architecture in figure 4.4 meets both these
design constraints. All interfaces have references to the abstract Interconnect class and this
defines the methods that must be implemented by all interconnects. Furthermore, it declares
the measurement variables used to record the performance of the interconnect. This ensures
that all interconnects provide compatible statistics. Of course, the interconnects can add more
measurement variables if these are needed. Adding a new interconnect is easy as only one new
class that inherits from Interconnect and implements the abstract methods must be created. An
additional advantage of this architecture is that all interconnects have the same configuration
parameters which simplifies the configuration scripts.
Figure 4.5 should make the function of the different interconnect classes clearer. Here, a typical
call sequence on a transfer from a L1 cache to a L2 cache is illustrated. The return of data
from the L2 cache to the L1 cache is not shown. First, the L1 cache calls the request method
of its MasterInterface and the master interface passes on the request to the Interconnect. The
Interconnect object stores the request and schedules an arbitration event. When this arbitration
event is serviced, the current requests are checked and at least one request is granted access.
The specifics of the arbitration method depends on the interconnect used.
When the Interconnect object decides to grant access to a request, it calls the GrantData method
on the MasterInterface object. The MasterInterface then retrieves the current request from the
69
CHAPTER 4. SIMULATOR EXTENSIONS
L1 Cache MasterInterface Interconnect SlaveInterface L2 Cache
Request() Request()
GrantData()
Send()
Deliver() Access()
Arbitration 
Delay
Transfer 
Delay
GetMemReq()
Figure 4.5: Interconnect Extension L1 to L2 Cache Transfer Example
L1 cache and calls the interconnect’s Send method. This results in a deliver event being scheduled
after the number of clock cycles specified by the transfer delay parameter. The transfer event
calls the Deliver method in the slave interface. Then, the Access method for the L2 Cache
object is called and the request is delivered.
The rest of this section is a quick introduction to the different interconnect extensions. The main
part of the implementation is the method that does arbitration. Consequently, the discussion of
each interconnect will focus on this method.
4.2.2 Split Transaction Bus
As noted earlier in this section, standard M5 has a split transaction bus interconnect. Con-
sequently, it does seem strange to extend the simulator with an interconnect that is already
supported. However, there are three good reasons to do just that. First, it is a good starting
point to implement the simplest possible component. Then, this can be tested thoroughly and
form the basis for other more advanced components.
Secondly, it is an advantage to have a clean interface to the other interconnects. This is especially
important for simulator statistics as one need to measure the same thing for all interconnects.
The measurements taken in the original M5 bus are very bus specific and not well suited to
other interconnects. In addition, a common configuration interface simplifies the configuration
scripts. Thirdly, it is an advantage to have implemented the key components yourself. Then,
you know exactly how they work. This is a great advantage when analysing the results.
As mentioned in section 3.2.3, the split transaction bus does not implement pipelined arbitration
or transfer. Consequently, carrying out one arbitration operation takes the number of clock cycles
specified by the user. If a request is received while an arbitration is in progress, it is not serviced
until the previous arbitration operation is finished. Therefore, the arbitration time for a given
request can be more than the arbitration delay even if the bus is not very busy. Naturally, only
one request is granted access to the bus after one arbitration operation.
The bus only supports an arbitration delay that is less than the transfer delay. In this case,
the arbitration delay determines the delay of the arbitration operation and the transfer delay
70
4.2. INTERCONNECT EXTENSIONS
determines the transmission time. If the transfer time is longer than the arbitration delay, the
arbitration delay is determined by the transfer delay. The reason is that the bus would be
occupied when the arbitration is finished in this case.
4.2.3 Crossbar
The implemented crossbar simulates one channel from all L1 caches to all L2 banks and was
shown in figure 3.3. In other words, each instruction cache and each data cache has its own
channel to all L2 banks. Furthermore, there is a channel from all L2 banks to all L1 caches.
Both arbitration and transmission is pipelined with each stage taking one processor clock cycle.
L1 to L1 traffic is enabled by adding a split transaction bus between these. This crossbar was
described in section 3.2.3 and is based on the design used by Kumar et al. [KZT05].
This design differs from the crossbar used by Kumar et al. in two ways. Firstly, Kumar et al.
only use one channel from each core to each L2 bank. In this implementation, there is one
channel for the data cache and one for the instruction cache. The reason for doing this is that
the implementation in the simulator becomes easier. Furthermore, it creates an upper bound
on the achievable performance with a crossbar. The downside is that the area overhead of this
crossbar is significantly higher than the area overhead of Kumar’s design. The other difference
is that my implementation has both data and address channels in both directions.
When one L2 bank blocks, the crossbar blocks. An L2 cache bank will block if it can not
guarantee that it has space to store the state of an additional cache miss. Since the crossbar
blocks as well, access to all banks are blocked when one cache bank blocks. This is realistic if we
assume that there is no buffering in the crossbar. Consequently, a request might go to a blocked
bank and it is not possible to guarantee delivery. However, changing this implementation choice
and exploring the results is possible further work.
4.2.4 Butterfly
The butterfly is an implementation of the model discussed in section 3.2.3.3. Because of time
constraints, it can only handle the input combinations used in this work. In other words, only
radix 2 butterflies with 2, 4 and 8 processor cores is implemented. The blocking behaviour is
identical to the crossbar’s blocking behaviour as this makes it easier to compare them to each
other. Removing the blocking restriction and supporting other butterflies is left as further work.
4.2.5 Ideal Interconnect
The ideal interconnect is an interconnect which can issue an unlimited number of requests in
parallel. However, these requests experience an arbitration delay and a transmission delay.
Consequently, it gives an upper bound on the performance attainable when a request is never
queued due to interconnect capacity constraints. The implementation keeps a sorted list of
requests and delivers them when the specified number of clock cycles has passed. Again, the
whole interconnect blocks if one L2 bank blocks.
71
CHAPTER 4. SIMULATOR EXTENSIONS
DirectoryProtocolCache
StenstromProtocol
Figure 4.6: Directory Protocol Extension
4.3 Cache Coherence Protocol Extensions
This section describes the implementation of the Stenstro¨m directory-based cache coherence
protocol [Ste89] discussed in section 2.3.2 and has the following outline:
• Section 4.3.1 discusses the software architecture chosen for the directory protocol imple-
mentation.
• The non-blocking caches implemented in M5 create a few additional challenges which are
discussed in section 4.3.2.
• Protocol race conditions are often cited as the most challenging part of implementing a
directory-based protocol. Section 4.3.3 presents two examples of such races in the Sten-
stro¨m protocol and describes how they are handled in this implementation.
• Finally, section 4.3.4 discusses a few additional implementation choices.
The source code for the coherence protocol implementation can be found in appendix C.2.
4.3.1 Software Architecture
Figure 4.3.1 shows the software architecture of the directory protocol implementation. The Cache
class has a pointer to a DirectoryProtocol class. Furthermore, the cache implementation was
modified to call the directory protocol methods at certain places. The abstract DirectoryProtocol
class defines an interface which the classes that inherit from it must implement. Consequently,
it is easy to interface a new coherence protocol with the rest of the simulator.
As noted earlier, the DirectoryProtocol does not inherit from SimObject. Consequently, it can
not be configured directly from outside the simulator. The reason is that the M5 cache imple-
mentation uses the C++ template construct heavily which result in it not interfacing cleanly
with the simulator configuration scripts. The original M5 code works around this problem by
using the cache’s external interface to configure the objects that are associated with it. For
instance, the existing prefetcher and snooping cache coherence protocols are configured in this
way. The directory protocol implementation uses the same workaround.
4.3.2 Handling Non-blocking Caches
The memory system used in the M5 simulator uses non-blocking or lockup-free caches [Kro81].
This means that the state of the miss is stored in special-purpose register called a Miss Status
72
4.3. CACHE COHERENCE PROTOCOL EXTENSIONS
Block Address X Read
Write
Empty
Empty
MSHR
Targets
Processor 1's L1 Cache
Processor 2's L1 Cache
Block Address X Owned Exclusive GR
Directory
Read X
Owner is 2
X: Owner is 23
1
2
Figure 4.7: Non-blocking Cache and Coherence Challenge
Holding Register (MSHR) and that the cache continues to service hits and misses even if several
earlier requests have missed in the cache. If there are more than one miss to the same cache
block, this information is stored in the MSHR. However, a new request is not sent to the next
memory hierarchy level. The number of misses to the same cache block that can be handled
without blocking is called the number of targets of a MSHR in M5.
There is a limited amount of registers and targets. Consequently, the cache must block if it
can not guarantee that it will have space to allocate an additional miss. In this case, the cache
does not accept any new requests until an outstanding request has been serviced. There are two
situations in M5 where the cache might block. Firstly, all MSHRs might be in use. In this case,
the cache will be unable to service a miss to a cache block which has no previous miss allocated.
Secondly, a MSHR might use all its targets. Here, the cache can not service a new miss to this
cache block.
Non-blocking caches create additional complications for a cache coherence protocol. The reason
is that a write request can be hidden behind a read request. If the read does not result in the
requesting cache becoming the owner, the write must not complete.
Figure 4.7 exemplifies this problem. The numbers in the figure correspond to the numbers in
the following list:
1. First, processor 1 has a read miss to block X. This results in the allocation of one MSHR.
In addition, a read request is sent to the directory. Recall from section 2.3.2 that the
directory is co-located with the shared L2 cache.
2. Processor 1 writes to block X before the read has completed. Since there is an outstanding
miss for this block, this write is allocated as a target in the MSHR.
3. The directory answers that block X is owned by processor 2. This results in the read
being redirected to processor 2’s L1 cache. However, care must be taken to avoid that
the waiting write is carried out on the received block. If the write goes through, it is not
noticed by the directory protocol and is lost.
There are a number of possible ways of avoiding this erroneous behaviour:
• Firstly, we could use blocking caches. In this case, there is only one outstanding miss at
any time and the problem disappears. The downside is reduced performance and reduced
pressure on the interconnect. Since this work is aimed at investigating communication
73
CHAPTER 4. SIMULATOR EXTENSIONS
Processor 1's L1 Cache
Processor 2's L1 Cache
Shared L2 Cache
Own X
X: Not owned
NACK
2
3
5
4
X: Not present1
X: Owner is 1, [T,F]
Owned Ex GR
6
1
7
X: Not present
Own X
Data
X: Cache 1
Owner Transfer X
1
4
6ACK
Figure 4.8: Two Processors Request Block Ownership Simultaneously
performance, this option would severely limit the applicability of the results.
• Reads and writes to the same address could use different MSHRs.
• Additional hardware can be used to detect the hazard situation shown in figure 4.7 and
resend the hidden write request if needed.
Applications that do not share data should not be slowed down by the cache coherence protocol.
This makes the option of allocating different MSHRs to reads and writes less attractive. The
reason is that it will increase the number of memory requests in the system. Consequently, the
strain on the interconnect and the shared cache will be larger than necessary.
The implemented protocol detects the hazard and resends the write request. This ensures that
requests are only resent when it is necessary. Furthermore, the performance non-communicating
applications are not affected.
4.3.3 Possible Race Conditions
During the implementation of the Stenstro¨m protocol, many race conditions were observed.
These race conditions are rare [MHW03]. For example, some race conditions were observed
after simulating for only a few million clock cycles with 2 processors while others were only
observed after over 100 million clock cycles with 8 processors. Consequently, high performance
is not a key issue when handling them. However, they must be handled correctly. This section
presents two observed races and describes how they are handled in the protocol implementation.
Figure 4.8 illustrates a race condition where two processors attempt to acquire ownership of a
block simultaneously. The protocol actions are described in the following list:
1. At first, block X is not present in any of the caches and the L2 cache has recorded the
74
4.3. CACHE COHERENCE PROTOCOL EXTENSIONS
P
ro
ce
ss
or
 1
Pr
oc
es
so
r 2
Shared L2 Cache
2
6
4
1
X: Owner is 1, [T,T,T]
Owned NonEx GR
1
X: Owner is 3
Invalid
Own X
X: Cache 3
3
6
X: Owner is 1
Invalid Owner Transfer from 1P
ro
ce
ss
or
 3
New Owner is 1 Read X
N
A
C
K
X: Owner is 3 [T,T,T]
Owned NonEx GR
X: Owner is 3
Invalid
Data and 
present flags
ACK
X: Cache 1
1
1
6
5
Processor 2 consults the 
directory to find the new 
owner
7
7
8
7
Figure 4.9: A Processor Issues a Redirected Read while Owner Transfer in Progress
block as not being owned.
2. Then, processor 1 issues a owner request for block X.
3. Moments later, processor 2 issues a owner request as well.
4. The directory receives the owner request from processor 1 first and stores processor 1 as
the new owner. Furthermore, it sends a message to processor 1 with the data for block X.
5. Then, the request from processor 2 is received by the directory. Since the directory must
create a globally consistent ordering of the requests, the owner request can not be granted
until processor 1 has finished its write to block X. Consequently, a NACK message is sent
to processor 2.
6. Processor 1 receives the block and stores it in its cache. In addition, it sends an ACK
message to the directory to signify that the owner request is finished. This tells the
directory that it can accept other owner transfer requests for block X.
7. When processor 2 reissues the request, a normal owner transfer operation is carried out.
In summary, two or more simultaneous owner transfer requests are handled by granting one
and sending NACK messages to all other processors. This way of handling races is for instance
described by Hennessy and Patterson [HP03].
Figure 4.9 illustrates a more complicated race condition where a processor issues a redirected
75
CHAPTER 4. SIMULATOR EXTENSIONS
read while two other processors are in the middle of an owner transfer. The example uses three
processors as this is the minimum number of processors needed for this case to occur.
The following list describes the situation:
1. First, block X is owned by processor 3 and both processor 1 and processor 2 have invalid
copies of it.
2. Processor 1 wants to write to the block and issues an owner transfer request.
3. The directory receives the request and sets processor 1 as the owner of block X.
4. A message is sent to processor 3 that instructs it to send the updated block data to
processor 1.
5. Processor 3 prepares the message and invalidates its own copy.
6. Processor 3 sends the block data and present flags to processor 1. Furthermore, it sends a
message to processor 2 to inform it that processor 1 is now the owner of block X. At the
same time, processor 2 attempts to read block X and issues a redirected read to processor
3. As far as processor 2 is considered, processor 3 is still the owner of block X.
7. Processor 3 has sent the updated data to processor 1 and can not answer processor 2’s re-
quest. Consequently, it answers by sending a NACK message. At the same time, processor
1 has received the data, updated its cache and sent an ACK message to the directory.
8. Processor 2 receives the NACK message. Now, there are two possible implementations.
We can assume that the new owner information has been received and redirect the request
there. However, the updated data and present flags might not have been received yet.
Consequently, it is safer to consult the directory first and then forward the request. This
safe option is taken in this implementation.
4.3.4 Implementation Choices
Implementing a directory coherence protocol carries with it a number of problems. This section
discusses a few of them:
• The M5 cache implementation does not enforce inclusion and this must be handled.
• M5 supports software prefetches.
• Some protocol operations require accessing the cache. What is the latency of these oper-
ations?
• Finally, requests that are issued to acquire ownership of a given cache block are not buffered
in the cache in this implementation. Consequently, multiple owner transfer requests for
the same cache block to the same cache can circulate in the interconnect at the same time.
The inclusion property is that all cache blocks stored in the L1 cache are also stored in the L2
cache [HP07]. The effect of not enforcing inclusion is that the coherence state of a cache block
can not be stored together with the L2 cache data. Fixing this requires decoupling the directory
from the L2 cache data storage. This is easy in a simulator as one can create a map that stores
the owner of a given cache block. Solving the problem in hardware requires an upper bound
on the storage required. The worst case storage requirement is that all L1 caches are full and
that they own every block they store. In other words, there is no sharing. Then, this structure
can be stored in an SRAM memory. Consequently, this implementation assumes that such a
memory is available. Of course, the area requirements of this memory can be reduced by using
techniques inspired by limited or chained directories as discussed in section 2.3.2 and by Chaiken
et al. [CFKA90].
76
4.4. SIMULATOR EXTENSION TESTING
A few of the SPLASH-2 benchmarks issue software prefetches. If these miss in the L1 cache,
they are simply discarded. This is not a problem if they hit in the cache. However, there might
be a performance penalty on discarding them when they miss in the L1 cache. One problem is
that it is difficult to know whether the cache should request ownership of the block or simply
register itself as a sharer in this case. Exploring different strategies with software prefetches is
left as further work.
Some protocol actions require retrieving data from a cache. For instance, when a read is redi-
rected to the owner cache, the updated data must be retrieved from this cache and returned to
the requesting processor. The latency of this check is set to the hit latency of the cache. This is
realistic if the only overhead associated with this look-up is retrieving the data. However, this
request might happen at the same time as the owner processor requests data from the cache. If
the cache only has one port, one of the requests must be delayed. Investigating the impact of
such real world effects is also possible further work.
In this protocol implementation, owner transfer requests are not buffered in the caches. As only
one cache can be the owner at one time, the directory will only grant one and send Negative
Acknowledgement (NACK) messages to the other caches. Consequently, the writes are processed
in a globally consistent order. However, this lack of buffering creates the possibility that one
processor can issue more than one owner transfer request for a given block. This might lead to it
requesting ownership to a block that it already owns, and this must be handled by the protocol.
An alternative solution would be to only allow one outstanding owner transfer request per block
and block the cache if it tries to issue more requests. Investigating this solution is left as further
work.
4.4 Simulator Extension Testing
This section discusses how the simulator extensions have been tested. When writing simulator
code, testing is especially important for two reasons. Firstly, programming errors can create
inaccuracies that only show up in the simulator results. These inaccuracies might be difficult
to discover and can lead to erroneous conclusions. Secondly, the simulators often run for a long
time to gather the needed results. An implementation that fails unpredictably after a few hours
of simulation is consequently not acceptable.
The testing carried out in this work is mainly based on placing assertions in the code. These
assertions contain a boolean expression that is true if the simulator is in a correct state. If
this expression evaluates to false, the simulator exits with an error message. Consequently, the
simulator has executed a correct sequence of states if simulation finishes without errors. Of
course, this only helps if there are assertions that guard against the error that arises. Therefore,
the simulator extensions contain a large number of assertions.
A test script was developed to facilitate automatic testing of the simulator. This script simply
runs all available benchmarks in all relevant configurations. When one run finishes, it checks if
the simulation finished successfully. If the simulator exited with an error message, the simulator
output is written to a file and the test is reported as failed. As the simulator code matures, the
parts of the benchmarks simulated in a test are increased.
This approach was very successful as only two simulator extension bugs were found during
simulation on the clusters. Furthermore, these errors were caught by assertions and did not
result in wrong simulator statistics being reported.
77
CHAPTER 4. SIMULATOR EXTENSIONS
This testing scheme created a need for additional computing power. Therefore, the aocdev
computer has been used as a dedicated test machine. This computer is normally used to test
new features added to the Age of Computers (AoC) teaching system before they are added to
the production server. Luckily, there has not been any AoC development this term so aocdev
has been available to run tests.
78
Chapter 5
CMP Performance with
Multiprogrammed Workloads
Single-threaded programs will probably be common for many years to come. Consequently, it is
important that CMPs perform well with multiprogrammed workloads. This section investigates
the performance of multiprogrammed workloads consisting of SPEC2000 [SPEa] benchmarks
with a split transaction bus and a state-of-the-art crossbar interconnect. CMPs with 2, 4 and 8
cores are considered.
The workloads are created by randomly adding SPEC2000 benchmarks to each workload. The
only guidance provided to the random selection is that a SPEC2000 benchmark must be present
in at least one workload for each CPU count. Each benchmark is fast forwarded for a random
number of clock cycles between 1 and 1.1 billion. Then, detailed simulation is carried out for 100
million clock cycles after the last benchmark has finished fast forwarding. 40 different workloads
were generated for each number of cores, and the workloads can be found in appendix A.
Figure 5.1 shows the number of L1 and L2 cache misses the SPEC2000 benchmarks encounter
when they are run on a single core processor. The main point of this graph is that most
SPEC benchmarks have relatively few misses. However, gap, ammp and apsi all miss in the
L1 cache in more than 7.4% of their instructions. Consequently, they are classified as cache
intensive. The experimental analysis will focus specifically on these applications as it is natural
to expect that they will be sensitive to the properties of the interconnect. Of course, the
sections of the benchmarks that are simulated with the single-core processor are not identical to
the sections used in the CMP simulations so the actual number of misses might differ. However,
the qualitative trends should hold.
As mentioned, the analysis will focus on gap, ammp and apsi. In addition, other benchmarks
will be discussed if the results make this necessary. However, an exhaustive analysis of the
performance of all benchmarks will be too lengthy to fit in this report.
The performance measurements will be reported relative to the performance of an ideal inter-
connect. In this context, an ideal interconnect is an interconnect where each request experiences
a transmission delay and an unlimited number of requests can be sent in parallel as described
in section 3.2.3. Consequently, it represents a golden standard that the other interconnects
can be compared to. In addition, the performance measurements are reported per benchmark.
The number reported is the harmonic mean of the Instructions per Cycle (IPC) this benchmark
achieved in each workload. This makes it possible to compare the performance of a benchmark in
79
CHAPTER 5. CMP PERFORMANCE WITH MULTIPROGRAMMED WORKLOADS
0
20
40
60
80
100
120
140
g a
p
a m
m
p
a p
s i
m
g r
i d
m
c f
s w
i m
g a
l g
e l
m
e s
a
e o
n a r
t
b z
i p
g z
i p
p e
r l b
m
k
p a
r s
e r
c r
a f
t y
l u
c a
s
v o
r t e
x 1
e q
u a
k e
s i
x t
r a
c k
t w
o l
f
f m
a 3
d
g c
c
w
u p
w
i s
e
a p
p l
u
v p
r
f a
c e
r e
c
Benchmark
M
i s
s e
s  
p e
r  1
0 0
0  
I n
s t
r u
c t
i o
n s
L1 Misses L2 Misses
Figure 5.1: Single-core SPEC Benchmark Cache Performance
the different CMP configurations. Since IPC is a rate, the harmonic mean is used [Smi88, JM95].
Furthermore, the benchmarks are sorted after their bus performance in the figures so that similar
results are shown next to each other. This makes the general trends easier to see but comes at
the cost of making comparisons between the same benchmark in different CMP configurations
more difficult.
The rest of this chapter has the following outline:
• Section 5.1 discusses the results from the 2-core CMP.
• Section 5.2 presents the 4-core CMP results.
• Finally, section 5.3 discusses the 8-core CMP results.
Each section will first discuss the performance of the miss intensive apsi, ammp and gap bench-
marks. Then, benchmarks that either perform better or worse than expected will be discussed.
5.1 2-core CMP Configuration
Figure 5.2 shows the per benchmark average IPC relative to the ideal interconnect for the 2-core
CMP. As expected, the bus performs a lot worse than the crossbar and the ideal interconnect
in some cases.
This section has the following outline:
• Section 5.1.1 discusses the apsi, ammp and gap benchmarks.
• Then, art, wupwise, swim, eon and fma3d are discussed in section 5.1.1. The reason for
choosing these benchmarks, is that art performs worse than expected while the realistic
80
5.1. 2-CORE CMP CONFIGURATION
-25 %
-20 %
-15 %
-10 %
-5 %
0 %
5 %
a p
s i
a m
m
p a r
t
g a
l g
e l
t w
o l
f
s i
x t
r a
c k
p a
r s
e r v p
r
v o
r t e
x 1
g z
i p
m
e s
a
g c
c
p e
r l b
m
k
c r
a f
t y
b z
i p
l u
c a
s
g a
p
m
g r
i d
m
c f
f a
c e
r e
c
e q
u a
k e
a p
p l
u
w
u p
w
i s
e
s w
i m e o
n
f m
a 3
d
Benchmarks
A
v e
r a
g e
 I P
C
 R
e l
a t
i v
e  
t o
 I d
e a
l  w
i t h
 D
e l
a y
Bus Crossbar
Figure 5.2: 2-core CMP Interconnect Performance
interconnects outperform the ideal interconnect in the other benchmarks.
5.1.1 Miss Intensive Benchmarks
The cache miss intensive apsi benchmark has the worst bus performance with a speed degra-
dation of over 20%. The reason for this poor performance is severe congestion in the L1 to L2
bus. Workload 18 is an example of this and the workload where apsi performs worst. With the
bus interconnect, a request spends on average 10.4 clock cycles in queue. In comparison, the
average queue time is 0.15 clock cycles with the crossbar interconnect.
The miss intensive ammp and gap benchmarks do not experience a performance degradation pro-
portional to their L1 cache misses. In the ammp case, workload 18 can offer some insights. Here,
ammp is run together with apsi and experiences a considerable speed degradation compared to
the other workloads it is a member of. In this case, the bus, crossbar and ideal interconnects
are all slowed down. The reason is that the memory bus is badly congested. Consequently, the
L1 to L2 interconnect only has a secondary effect on performance. However, this effect is large
enough to create a small performance difference between the crossbar and ideal interconnects.
All interconnects have identical performance for the gap benchmark. This is not expected, as the
large number of cache misses should put considerable strain on the interconnect. By carefully
studying the simulator statistics, it turns out that these misses does not reach the interconnect
in sufficient numbers. The reason is that the L1 cache blocks. This will happen if there are 4
misses to different cache blocks or 4 misses to the same cache block in the L1 cache. In the gap
benchmark this becomes the predominant bottleneck. Consequently, the misses are injected into
the interconnect in bursts, but these bursts have sufficiently large idle periods between them.
81
CHAPTER 5. CMP PERFORMANCE WITH MULTIPROGRAMMED WORKLOADS
Therefore, even the bus is able to handle the traffic load.
5.1.2 Other Results
The performance degradation of art is due to the low bus performance in workload 9. Here, art
is run together with twolf. Although these benchmarks do not miss too much by themselves, the
sum of the misses creates the problem. Profiling the bus utilisation reveals that it occupied 30%
of the time for most of the detailed simulation. However, at some points the utilisation reaches
around 60%. This high utilisation results in long queues and long wait times. Consequently, the
combined miss behaviour of art and twolf are the cause of their low performance.
The bus and crossbar interconnects do perform better than the ideal interconnect in some cases.
This might seem counter-intuitive at first. The key idea is that even if the interconnect is ideal,
other parts of the processor are not ideal. If the performance difference due to the interconnect is
small, then small changes in the timing of misses can make other parts of the processor perform
better or worse.
This is the case for eon, fma3d and swim. Here, all memory system statistics indicate that the
ideal interconnect is better than the crossbar which again is better than the bus. However, the
bus interconnect leads to less speculative instructions being rolled back in the processor core. In
other words, the crossbar and the ideal interconnects execute further down a wrong path. The
reason is that memory accesses are delivered faster. In this case, waiting for memory accesses to
complete is better since this reduces the number of speculative instructions issued. Consequently,
fewer instructions need to be rolled back and the bus outperforms the other interconnects.
The good bus performance of wupwise is due to the blocking behaviour of the L1 cache. Here,
the ideal interconnect has the largest number of cycles where the cache is blocked. Recall that
when a cache is blocked, it can not service any requests. In the bus case, the processor execution
is slowed down due to longer interconnect delays. Consequently, it generates memory requests
at a slower rate that with the ideal interconnect. This slow rate result in the sequence of misses
that cause cache blocking to be further spaced in time. The misses might even be slowed down so
much that the first miss is serviced before the blocking miss arrives. This effect makes wupwise
perform better with a bus interconnect than with the ideal interconnect.
5.2 4-core CMP Configuration
Figure 5.3 shows the interconnect performance on a 4-core CMP. There is a large performance
degradation with the bus interconnect while the crossbar performs close to the ideal. In other
words, the trends from the 2-core experiment still hold, but the performance impact of the bus
is larger. A new trend is that the crossbar in some cases performs slightly better than the ideal
interconnect. The reason is that the crossbar can avoid some cache blocking by reducing the
memory access rate as discussed in the previous section.
This section has the following outline:
• Section 5.2.1 discusses the miss intensive apsi, ammp and gap benchmarks.
• Then, section 5.2.2 discusses the surprising performance degradation of vortex1 and parser
as well as the counter-intuitive results from the wupwise, gcc, applu and facerec applica-
tions.
82
5.2. 4-CORE CMP CONFIGURATION
-40 %
-35 %
-30 %
-25 %
-20 %
-15 %
-10 %
-5 %
0 %
5 %
10 %
a p
s i
v o
r t e
x 1
p a
r s
e r
t w
o l
f
p e
r l b
m
k
g a
l g
e l
s i
x t
r a
c k v p
r
g z
i p
c r
a f
t y
l u
c a
s
a r
t
m
g r
i d
m
e s
a
m
c f
b z
i p
g a
p
f m
a 3
d
e q
u a
k e e o
n
s w
i m
w
u p
w
i s
e
g c
c
a p
p l
u
f a
c e
r e
c
a m
m
p
Benchmark
A
v e
r a
g e
 I P
C
 R
e l
a t
i v
e  
t o
 I d
e a
l  w
i t h
 D
e l
a y
Bus Crossbar
Figure 5.3: 4-core CMP Interconnect Performance
5.2.1 Miss Intensive Benchmarks
Apsi has the largest performance degradation for the bus interconnect of all the benchmarks.
Again, this is due to bus congestion. For instance, the average queue time with the bus inter-
connect was 20.1 clock cycles and only 0.1 with the crossbar in workload 40.
In this experiment, ammp is over 5% faster with the bus interconnect than with the ideal
interconnect. Given the large number of misses in ammp, it is very unlikely that this is actually
due to the bus being fast. This intuition is confirmed by the simulator statistics. In fact, the
only workload with ammp where the bus beat the other interconnects is workload 20. In all
other workloads, the performance with the bus is lower than with the other interconnects. The
reason why the harmonic mean over these values is better for the bus is that it emphasises the
lowest number. When measuring performance, this is an advantage as it makes the harmonic
mean somewhat of a worst case measure. If the arithmetic mean was used, the bus performance
would be worse than the other interconnects and this important workload might not have been
discovered.
Since the bus is not actually performing any better than the other interconnects, a different
property of the CMP must be responsible for slowing down execution for all configurations. The
simulator statistics show that ammp causes the L1 cache to be blocked for a considerable period.
Furthermore, the off-chip memory bus is congested. It is unlikely that congestion alone makes
the bus perform better than the ideal interconnect. To achieve this, the time in the memory bus
queue in the ideal case must be larger than the combined time in queue in the L1 to L2 bus and
the off-chip memory bus with the bus interconnect. Although this effect would probably not
lead to the bus outperforming the ideal interconnect, it will certainly reduce the performance
difference between them. An important consequence of this queueing is that it influences the
83
CHAPTER 5. CMP PERFORMANCE WITH MULTIPROGRAMMED WORKLOADS
memory access issue rate and therefore the blocking behaviour of the cache. Consequently, the
end result is probably a combination of these two factors. To verify this hypothesis, an additional
experiment was carried out in the 8-core CMP where this problem is more pronounced. This
experiment will be discussed in section 5.3.2.1.
The performance of gap is similar to the 2-core case. Here, the choice of interconnect does not
have a large impact on overall performance. Again, the reason is that the L1 cache is blocked
for a considerable time period and this has a larger performance impact. Furthermore, the
crossbar interconnect makes the L1 cache block a few less times than the ideal interconnect.
Consequently, it performs slightly better.
5.2.2 Other Results
Vortex1 and parser experience a considerable slowdown with the bus interconnect. This was
not expected as they have relatively few cache misses. However, looking into the results reveals
an interesting situation. Vortex1 is mainly slowed down in workload 9 and workload 29, while
parser performs worst in workload 17. Furthermore, the miss intensive apsi benchmark is a
member of all these workloads. The result is bus congestion which delays the few misses vortex1
and parser have enough to slow them down. In other words, apsi creates problems for the
applications it is run together with.
As in the 2-core case, the strange results where the bus or the crossbar performs better than the
ideal is due to either cache blocking or wrong-path execution. In swim, wupwise and applu, the
cache lock-up time is the reason for the bus or crossbar outperforming the ideal interconnect.
In eon and facerec, the bus outperforms ideal because it has fewer instructions that are rolled
back. Both these effects were discussed in section 5.1.2.
5.3 8-core CMP Configuration
This section explores the experimental results with the 8-core CMP:
• First, the experimental results with the baseline CMP architecture are presented in section
5.3.1. Here, the L2 cache becomes a large performance bottleneck. Consequently, the
performance of the interconnect becomes secondary and only has a small impact on overall
performance.
• Section 5.3.2 applies this observation by increasing the L2 cache size to 8 MB and the
number of MSHRs to 16. Here, the performance impact of the interconnect is much
larger.
5.3.1 Baseline 8-core CMP Results
Figure 5.4 shows the results from the 8-core baseline architecture. The general trend is that
the performance degradation with a bus interconnect is very small. For instance, apsi has a
degradation of only just over 5%. This is very low compared to the degradation of almost 35%
seen in the 4-core CMP.
A detailed look at the simulation statistics reveals that the benchmarks that use the bus most are
slowed down by an increased number of L2 misses. The reason is that two cores now compete
84
5.3. 8-CORE CMP CONFIGURATION
-7 %
-6 %
-5 %
-4 %
-3 %
-2 %
-1 %
0 %
1 %
2 %
g a
p
a p
s i
a p
p l
u a r
t
p a
r s
e r
f a
c e
r e
c
b z
i p
t w
o l
f
v p
r
v o
r t e
x 1
s w
i m
g z
i p
s i
x t
r a
c k
p e
r l b
m
k
l u
c a
s
g a
l g
e l
m
e s
a
a m
m
p
c r
a f
t y
m
g r
i d
e q
u a
k e m
c f
e o
n
w
u p
w
i s
e
g c
c
f m
a 3
d
Benchmark
A
v e
r a
g e
 I P
C
 R
e l
a t
i v
e  
t o
 I d
e a
l  w
i t h
 D
e l
a y
Bus Crossbar
Figure 5.4: 8-core CMP Interconnect Performance
for space in the same cache bank because of the way address translation is done in the M5
simulator. This was described in detail in section 4.1.2.
This hypothesis is validated by running the experiments again with a larger cache size and
comparing the results. Workload 4 illustrates the problem. Here, apsi and mgrid compete for
space in the same cache bank. The larger L2 cache size reduces the miss rate from 28% to 20%
for the ideal and crossbar interconnects. For the bus interconnect, the miss rate is reduced from
28% to 25%. The reduced L2 cache performance is probably due to a lower access rate in the
bus case. A possible explanation is that the memory access rate is slowed down enough that
frequently accessed blocks get thrown out of the cache. Here, the accesses that would keep them
in the cache with the fast interconnects arrive too rarely to keep the block in the cache with the
bus interconnect.
Another strange result is that gap suddenly has become sensitive to interconnect performance.
Again, the reason is a hot L2 bank which it shares with ammp. When the L2 cache size is
increased, the miss rate stays the same. However, the number of hits is increased. This makes
gap run fast enough to make L1 cache blocking the main bottleneck.
The results shown in figure 5.4 will not be discussed further in this report. The reason is that
they primarily illustrate L2 cache performance which is not the focus of this work. Instead, the
next section discusses the results gathered from the 8 MB 8-core configuration in detail.
5.3.2 8-core CMP with Larger Cache
Figure 5.5 shows the results from the 8-core CMP with 8 MB L2 cache and 16 MSHRs per
L2 bank. Here, the overall performance with the bus interconnect is considerably worse than
85
CHAPTER 5. CMP PERFORMANCE WITH MULTIPROGRAMMED WORKLOADS
-70 %
-60 %
-50 %
-40 %
-30 %
-20 %
-10 %
0 %
10 %
20 %
30 %
a p
s i a r
t
g a
l g
e l
s i
x t
r a
c k
v o
r t e
x 1
g z
i p
p a
r s
e r v p
r
t w
o l
f
p e
r l b
m
k
c r
a f
t y
f a
c e
r e
c
f m
a 3
d
m
e s
a
e o
n
m
g r
i d
e q
u a
k e m
c f
s w
i m
w
u p
w
i s
e
g a
p
b z
i p
l u
c a
s
a m
m
p
g c
c
a p
p l
u
Benchmark
A
v e
r a
g e
 I P
C
 R
e l
a t
i v
e  
t o
 I d
e a
l  w
i t h
 D
e l
a y
Bus Crossbar
Figure 5.5: 8-core CMP with Large Cache
with the other interconnects as expected. The crossbar scales well in terms of performance and
performs close to the ideal interconnect for all benchmarks.
5.3.2.1 Miss Intensive Benchmarks
As usual, apsi is sensitive to the performance of the interconnect. For instance, in workload 4
the average queue delay of a request was 23.3 clock cycles with the bus interconnect and 0.1
with the crossbar interconnect.
Both gap and ammp perform better with a bus interconnect than with the ideal interconnect.
In gap, this happens in 10 out of the 14 workloads it is a member of. Again, this is due to the
amount of time the L1 cache was blocked. With the bus interconnect, the L1 cache is blocked
for fewer clock cycles. Therefore, it can service cache hits during this time and the result is
higher overall performance. In addition, the off-chip memory bus is congested and this reduces
the advantage of having fast L1 to L2 transfers.
The bus interconnect results in better performance than the ideal interconnect for all workloads
containing the ammp benchmark. This pattern was also seen in one ammp workload in the
4-core configuration. The hypothesis is that memory bus congestion and L1 cache blocking
combined make the bus outperform the ideal interconnect. To investigate this, workloads 9, 20,
21, 24 and 31 were run with a memory bus that was clocked at the same rate as the processor.
Obviously, this is not realistic but it will tell us whether or not the memory bus is the performance
bottleneck. The chosen workloads are a subset of the workloads that contain ammp, and they
were chosen at random.
This experiment resulted in the overall performance of ammp with the bus being 22% worse than
86
5.3. 8-CORE CMP CONFIGURATION
the performance with the ideal interconnect. Consequently, the strange speed-up was removed.
The main reason is that there is less contention for the memory bus. For example, the fraction
of the time the memory bus is idle is increased from 0% to 40% for the ideal interconnect in
workload 20. In addition, the time the L1 cache is blocked is reduced. These results support
the hypothesis that memory bus congestion and L1 cache blocking create the strange results for
the ammp benchmark.
5.3.2.2 Other Results
The art benchmark has a very large performance degradation in the 8 MB 8-core CMP. This
is probably caused by a combination of memory bus congestion and the time the L1 cache was
blocked. The simulation statistics show that for some workloads the L1 cache blocks more with
the bus interconnect than with the ideal interconnect. This is the opposite effect of what we
observed earlier. The reason is that it takes a long time to service the misses that have caused
the cache to block. In other workloads, the performance of art is determined by the sum of delay
through the L1 to L2 interconnect and the off-chip interconnect. With a bus interconnect, L1 to
L2 delay is large while the memory bus not as badly congested as with the other interconnects.
With the crossbar and ideal interconnects, there is virtually no extra delay in the L1 to L2
interconnect but the memory bus is severely congested.
Mcf, swim, wupwise, bzip, lucas and applu all perform better with the bus interconnect that
with the ideal interconnect. The reason is that the L1 cache is blocked for a smaller amount of
time with the bus interconnect. This effect was discussed in section 5.1.2.
Gcc also perform better with a bus interconnect than with the other interconnects. However,
pinning down the exact cause of this has proved difficult. It is not L1 cache blocking as it is
blocked a much larger fraction of the time with the bus interconnect than with the ideal. Wrong
path execution is also ruled out as the number of rolled back instructions are greater with the
bus interconnect. By examining the workloads where the bus beats the other interconnects, a
number of possible reasons are found. However, the most likely reason differs from workload to
workload. Most likely, the good performance of the bus is due to complex interactions between
the L1 miss rate, L1 lock-up time, L1 to L2 interconnect delay, L2 miss rate and memory bus
congestion. Looking further into this problem has not been prioritised.
87
CHAPTER 5. CMP PERFORMANCE WITH MULTIPROGRAMMED WORKLOADS
88
Chapter 6
CMP Performance with
Communicating Workloads
The main aim of this work is to investigate the performance of parallel programs on a CMP
platform. In this chapter, the performance of the bus and crossbar interconnects is investigated.
As noted in section 4.1.2, there are some problems with the thread implementation in the M5
simulator’s system call emulation mode. Since these flaws might influence the results, we need
to record how often they happen. This is done by writing some text describing the occurrence
of a problem to a trace file when the simulation is run. Table 6.1 shows the benchmarks where
the known flaws were observed. This information is taken into account when the results are
analysed. Note that there are a number of cases where no problems are observed for all core
counts.
6.1 Parallel Benchmark Performance with 2 CPUs
This section discusses the results from the experiments with the SPLASH-2 benchmarks on a
2-core CMP. Since the main focus of this work is on the interconnect, it is helpful to quan-
tify the interconnect bandwidth demands of the different benchmarks. Then, the interconnect
performance results are discussed.
6.1.1 Bandwith Demand with 2 Cores
The number of interconnect requests per committed instruction for all benchmarks and configu-
rations are shown in figure 6.1. In this figure, the benchmarks are sorted in ascending order after
the number of requests for the bus interconnect. If the interconnects have a large difference in
request rate for the same benchmark, it must be taken into account when the experiments are
analysed. The reason is that a difference in network load can make some of the interconnects
seem better or worse than they really are.
There are three different causes for the differences in network load shown in figure 6.1:
• The distribution of data to the different L1 caches is controlled by the cache coherence
protocol. Consequently, the cache that first requests a cache block becomes the owner of
that block. If this block is only read, the cache will continue to be the owner until it writes
89
CHAPTER 6. CMP PERFORMANCE WITH COMMUNICATING WORKLOADS
Configuration Benchmark Bus Crossbar Ideal
2 CPUs
LUContig X X X
OceanContig X X -
OceanNoncontig X X X
WaterSpatial - X X
4 CPUs
LUContig X X X
OceanContig X X X
OceanNoncontig X X X
WaterNSquared X X X
WaterSpatial - X X
8 CPUs
Cholesky X - -
LUContig X X X
LUNoncontig X X X
OceanContig X X X
OceanNoncontig X X X
Radix X X X
Raytrace X - -
WaterNSquared X X X
WaterSpatial X X X
Table 6.1: Splash Benchmarks where M5 Flaws were Observed
back the block. For instance, say that cache A reads block X one time and that cache B
reads block X five times. If the access from cache A arrives at the directory first, cache B
will have to issue five reads to A’s cache. If B reads block X first, cache A issues one access
to its cache. Therefore, it is best to put the block in B’s cache in this case. However, in
the simulated protocol the block simply ends up in the cache that requested it first.
• Simulation is finished when one processor reaches a certain number of committed instruc-
tions. However, the number of instructions the other processor manages to commit varies
with interconnect performance.
• The M5 flaws can manifest themselves in different ways for the different configurations. In
this case, it is difficult to know how to interpret the results, and they should be discarded.
The differences in request rate for the Barnes, FMM and LUContig benchmarks are all because
of different data distributions. In particular, the number of reads from the other processor’s L1
cache is different. The M5 flaws manifest themselves a few times for the LUContig benchmark,
but in a similar way for all configurations. A possible way to reduce the effects of the initial
data distribution is to migrate cache blocks such that they are close to the processor that uses
them the most. However, a more through investigation is needed to fully understand how this
affects protocol and benchmark performance. Pursuing this interesting opportunity is left as
further work.
The variation in request rate for the Cholesky benchmark is probably not due to data distribu-
tion. The reason is that the number of reads to the other processor’s L1 cache is about the same
for all configurations. However, the number of instructions committed on the processor that
does not reach the maximum instruction count, is considerably less for the crossbar than for the
two other interconnects. None of the known bugs manifest themselves for this benchmark.
The OceanContig result is a typical example of results that must be discarded due to M5
thread implementation problems. Here, two unimplemented methods are called a lot when
90
6.1. PARALLEL BENCHMARK PERFORMANCE WITH 2 CPUS
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
B
a r
n e
s
F M
M
C
h o
l e
s k
y
O
c e
a n
N
o n
c o
n t
i g
R
a d
i x
W
a t
e r
S
p a
t i a
l
L U
C
o n
t i g
L U
N
o n
c o
n t
i g
R
a y
t r a
c e
W
a t
e r
N
S
q u
a r
e d
O
c e
a n
C
o n
t i g
F F
T
Benchmark
I n
t e
r c
o n
n e
c t
 R
e q
u e
s t
s  
p e
r  C
o m
m
i t t
e d
 I n
s t
r u
c t
i o
n
Bus Crossbar Ideal
Figure 6.1: Interconnect Requests Rate in Sample for a 2-core CMP
the bus interconnect is used and there are a number of deadlocks when the crossbar is used.
Consequently, it is difficult to know how to interpret the results.
6.1.2 2-core Interconnect Performance
Figure 6.2 shows the sum of the Instructions per Cycle (IPC) for each benchmark and configura-
tion relative to the performance of configuration with an ideal interconnect. As earlier, the ideal
interconnect can service an unlimited number of requests in parallel but all requests experience
a constant transmission delay. The sum of IPC is used because it is throughput metric. For a
parallel application, it is the total amount work carried out that matter, and the sum of IPC
measures this.
The results in figure 6.2 show that the interconnect has a considerable performance impact for
the SPLASH-2 benchmarks. In chapter 5, only apsi had a performance degradation of more than
20% for the bus compared to the ideal interconnect. Here, 7 out of 12 benchmarks have more
than 20% performance degradation for the bus interconnect. Furthermore, the bus interconnect
configuration performs over 90% worse than the ideal interconnect for the FMM and Barnes.
The large performance degradation for the bus interconnect with the FMM and Barnes bench-
marks is due to bus congestion. However, the crossbar and ideal interconnects have considerably
less requests to deal with. The reason is that they have a more favourable data distribution and
therefore less coherence traffic. Consequently, it is difficult to know whether the crossbar could
handle the load the bus received. Conversely, the bus might perform better if it only had to
handle the load the crossbar got.
For the Cholesky benchmark, there is only a small performance difference between the bus and
91
CHAPTER 6. CMP PERFORMANCE WITH COMMUNICATING WORKLOADS
-110 %
-90 %
-70 %
-50 %
-30 %
-10 %
10 %
30 %
50 %
F M
M
B
a r
n e
s
C
h o
l e
s k
y
W
a t
e r
S
p a
t i a
l
L U
C
o n
t i g
W
a t
e r
N
S
q u
a r
e d
R
a y
t r a
c e
L U
N
o n
c o
n t
i g
O
c e
a n
N
o n
c o
n t
i g
R
a d
i x
F F
T
O
c e
a n
C
o n
t i g
Benchmark
S u
m
 o
f  I
P C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Crossbar
3 1
5 %
Figure 6.2: Scientific Workload Performance in a 2-core CMP
the crossbar interconnects. The reason is that the bus used for L1 to L1 communication in the
crossbar becomes congested. Furthermore, the network load in the crossbar case is considerably
less than with the bus and ideal interconnects. The WaterSpatial and WaterNSquared results
also support this theory. Here, the bus and crossbar configurations have very similar perfor-
mance. In this case, the request rates are similar and only very few M5 flaws were observed with
the WaterSpatial benchmark.
The LUContig benchmark performance with the crossbar is nearly as good as its performance
with the ideal interconnect. However, the crossbar has far fewer requests to deal with than
the bus and ideal interconnects in this case. Consequently, it is difficult to say whether the
performance difference is real or created by the more favourable data distribution in the crossbar
case. The M5 flaws manifest themselves in a similar way for the different interconnects. The
opposite situation happens with the Raytrace benchmark. Here, the crossbar interconnect gets
the largest load and therefore performs worst. It is unlikely that the bus could handle a similar
load.
Radix and FFT do not create enough interconnect requests for the bus or crossbar to limit
performance. Consequently, there is nearly no performance difference. The OceanNoncontig
and OceanContig experience a large number M5 deadlocks. Consequently, the results can not
be used and they are only shown for completeness.
92
6.2. PARALLEL BENCHMARK PERFORMANCE WITH 4 CPUS
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
B
a r
n e
s
F M
M
R
a d
i x
W
a t
e r
S
p a
t i a
l
R
a y
t r a
c e
O
c e
a n
N
o n
c o
n t
i g
L U
C
o n
t i g
W
a t
e r
N
S
q u
a r
e d
L U
N
o n
c o
n t
i g
O
c e
a n
C
o n
t i g
C
h o
l e
s k
y
F F
T
Benchmark
I n
t e
r c
o n
n e
c t
 R
e q
u e
s t
s  
p e
r  C
o m
m
i t t
e d
 I n
s t
r u
c t
i o
n
Bus Crossbar Ideal
Figure 6.3: Interconnect Request Rate in Sample for a 4-core CMP
6.2 Parallel Benchmark Performance with 4 CPUs
This section discusses the performance of the SPLASH-2 benchmarks on a 4-core CMP. First,
the bandwidth demands of the different benchmarks are quantified. Then, the performance
results are presented and discussed.
6.2.1 Bandwith Demand with 4 Cores
Figure 6.3 shows the number of interconnect requests per committed instruction for the 4-core
CMP. Here, most of the benchmarks have similar request rates with all interconnects. However,
Barnes, Radix, LUContig and Cholesky have considerable differences and need to be discussed
further.
In the 2-core CMP, a similar request pattern was observed with the Barnes benchmark. There,
the difference was due to a different data distribution, and this is also the case here. With
the bus interconnect, two processors initiate a large number of reads to other L1 caches while
all processors initiate some reads to other L1 caches with the crossbar interconnect. The total
number of remote reads is larger with the bus interconnect and this creates the difference in
request rate.
The Cholesky benchmark also behaves similarly in the 4-core and 2-core CMP experiments.
Again, a larger number of instructions are committed in total with the bus and ideal intercon-
nects. By inspecting the benchmark source code, it becomes clear that there is relatively little
global synchronisation. Basically, the processors have relatively large sections of computation
93
CHAPTER 6. CMP PERFORMANCE WITH COMMUNICATING WORKLOADS
-110 %
-90 %
-70 %
-50 %
-30 %
-10 %
10 %
30 %
50 %
B
a r
n e
s
F M
M
W
a t
e r
S
p a
t i a
l
W
a t
e r
N
S
q u
a r
e d
L U
C
o n
t i g
R
a d
i x
R
a y
t r a
c e
O
c e
a n
N
o n
c o
n t
i g
L U
N
o n
c o
n t
i g
C
h o
l e
s k
y
F F
T
O
c e
a n
C
o n
t i g
Benchmark
S u
m
 o
f  I
P C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Crossbar
1 3
1 %
2 2
4 %
Figure 6.4: Scientific Workload Performance in a 4-core CMP
divided by barriers. This can result in some processors entering a phase with more intercon-
nect usage for some interconnects and might explain the difference observed. However, further
investigation is needed to pin down the exact cause of this situation.
The Radix benchmark has one processor that commits very few instructions with the bus and
crossbar interconnects. Consequently, the number of requests per committed instruction is larger
than with the ideal interconnect. It is unclear whether this difference is the correct behaviour or
due to an unknown problem with the M5 system call emulation thread implementation. The FFT
benchmark is only simulated in the serial section. This is the farthest it is possible to simulate
without the simulation taking more than 16 hours on the Norgrid cluster. Sadly, the simulation
finishes quickly when the maximum number of instructions is set appropriately. This suggests
that it is another problem with the M5 thread implementation. Rather than investigating these
problems, full system simulation should be used. There, the faulty thread library is not used
and the problems are hopefully avoided.
6.2.2 4-core Performance Results
Figure 6.4 shows the performance results for the SPLASH-2 benchmarks on the 4-core CMP.
Again, the interconnect has a considerable performance impact. For instance, FFM has a
performance degradation of over 90% with both the bus and the crossbar interconnects. As
expected, the general performance degradation is larger with four cores than with two cores.
According to the results, the benchmarks can be grouped into four classes:
• For the LUNoncontig benchmark, the crossbar interconnect outperforms the bus.
94
6.2. PARALLEL BENCHMARK PERFORMANCE WITH 4 CPUS
 0
 10000
 20000
 30000
 40000
 50000
 60000
 0  20  40  60  80  100  120
In
te
rc
on
ne
ct
 R
eq
ue
st
s 
pe
r 2
50
00
0 
Cl
oc
k 
Cy
cle
s
Million Clock Cycles
Data Sends
Instruction Sends
Coherence Sends
(a) Bus Interconnect
 0
 10000
 20000
 30000
 40000
 50000
 60000
 0  20  40  60  80  100
In
te
rc
on
ne
ct
 R
eq
ue
st
s 
pe
r 2
50
00
0 
Cl
oc
k 
Cy
cle
s
Million Clock Cycles
Data Sends
Instruction Sends
Coherence Sends
(b) Crossbar Interconnect
Figure 6.5: LUNoncontig Request Profile
• There are a number of benchmarks where the bus and the crossbar have nearly identi-
cal performance. FMM, WaterSpatial, WaterNSquared, Radix and Raytrace are in this
category.
• Barnes and Cholesky have a large variation in the request rate between the different
interconnects.
• Lastly, a few benchmarks must be discarded due to problems with the M5 system call
emulation thread implementation. OceanNoncontig, OceanContig and FFT fall into this
category.
Figure 6.5 illustrates why the crossbar outperforms the bus with the LUNoncontig benchmark.
95
CHAPTER 6. CMP PERFORMANCE WITH COMMUNICATING WORKLOADS
Here, the lines represent the number of requests to the interconnect over time. The blue lines
represent coherence requests and the red lines represent requests for data from the L2 cache. In
the bus case, L1 to L1 transfers and data transfers from the L2 cache share the same transmission
channel. This can be seen by noticing that when there are many coherence requests, there
are few data requests and vice-versa. In the crossbar case, these requests can be handled in
parallel. Consequently, the variations of the graph follow the variation in bandwidth demand.
An additional point is that there are very few instruction requests as shown by the green line
being practically invisible.
For FFM, WaterNSquared, WaterSpatial, Raytrace and Radix the performance of the bus and
the crossbar is nearly the same. The reason is that the interconnect requests are predominately
L1 to L1 transfers. Consequently, most requests to the crossbar use its L1 to L1 bus. This bus is
very similar to the bus interconnect, so it is no surprise that they have similar performance. The
small differences in performance are due to the small differences in network load shown in figure
6.3. As noted earlier, it is unclear if these results are representative for the Raytrace benchmark.
In addition, a few M5 bugs manifest themselves with the WaterNSquared and WaterSpatial
benchmarks but in a similar way for all interconnects.
Barnes, Cholesky and LUContig have large differences in interconnect bandwidth demand be-
tween different configurations. Consequently, it is difficult to know how the interconnects per-
form relative to each other. In the Barnes and LUContig cases, the crossbar outperforms the bus.
However, the bandwidth demand is also higher with the bus than with the crossbar. Conversely,
the bus performs better than the crossbar with the Cholesky application. Here, the crossbar has
a significantly larger amount of requests to deal with. In summary, the performance difference
is probably due to variations in bandwidth demand and does not provide evidence either way
on which interconnect is better.
The last group of benchmarks is the one where the results must be discarded. With OceanContig
and OceanNoncontig, there are large performance differences and unlikely results. These are
due to the known M5 system call emulation thread implementation flaws. The FFT benchmark
is simulated for a very short period of time to avoid what is probably a different M5 problem.
Analysing these benchmarks with the better thread implementation in the full system simulation
mode of M5, is left as further work.
6.3 Parallel Benchmark Performance with 8 CPUs
In this section, the simulation results for the 8-core CMP is presented and discussed. As usual,
it starts with a discussion of the bandwidth needs of the different applications before the actual
performance measurements are discussed.
6.3.1 Bandwith Demand with 8 Cores
Figure 6.6 shows the number of requests per committed instruction. Again, most benchmarks
have similar request rates across the different configurations. However, Barnes and Cholesky
stand out. Here, one processor has a lot of reads from a different processor’s L1 cache for both
the bus and crossbar interconnects. Consequently, this processor is slowed down and it commits
fewer instructions. Since the total number of instructions committed is reduced, the number
of requests per committed instructions is also less in these cases. Note that this might suggest
96
6.3. PARALLEL BENCHMARK PERFORMANCE WITH 8 CPUS
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
F M
M
R
a y
t r a
c e
W
a t
e r
S
p a
t i a
l
R
a d
i x
O
c e
a n
N
o n
c o
n t
i g
W
a t
e r
N
S
q u
a r
e d
L U
C
o n
t i g
O
c e
a n
C
o n
t i g
L U
N
o n
c o
n t
i g
C
h o
l e
s k
y
B
a r
n e
s
F F
T
Benchmark
I n
t e
r c
o n
n e
c t
 R
e q
u e
s t
s  
p e
r  C
o m
m
i t t
e d
 I n
s t
r u
c t
i o
n
Bus Crossbar Ideal
Figure 6.6: Interconnect Requests Rate in Sample for a 8-core CMP
that the impact of data distribution is reduced when the number of processors are increased for
these interconnects. Investigating this theory is interesting further work.
6.3.2 8-core Performance Results
Figure 6.7 shows the performance results for the 8-core CMP. Again, the choice of interconnect
has a considerable impact on overall system performance. Furthermore, the performance of the
bus and crossbar interconnects are more similar to each other than with the 4-core CMP. It
is somewhat surprising that the performance degradation compared to the ideal interconnect
is not larger. When comparing figure 6.7 to the 4-core results in figure 6.4, the performance
degradations experienced are reasonably similar. This might be due to the problem sizes being
kept constant when the number for processors is increased. Consequently, the total amount of
data communication is the same in both cases. Since the interconnects are already operating
at their maximum capacity in the 4-core case, this communication can not be carried out any
faster with eight processors. However, more work is needed to verify this theory.
The division of the benchmarks into categories presented in section 6.2.2 for 4-core CMP, holds
for the 8-core CMP as well. Again, FFM, WaterNSquared, WaterSpatial and Radix all have
nearly identical performance for the bus and crossbar interconnects. However, the M5 thread
library problems manifest themselves more often with 8 processor cores. Consequently, these
results are probably not exact. Taking these limitations into account, the results still support
the theory that the performance of the crossbar approaches the performance of a bus when there
are a lot of L1 to L1 transfers.
LUNoncontig is again the only benchmark where the crossbar outperforms the bus. In addition,
97
CHAPTER 6. CMP PERFORMANCE WITH COMMUNICATING WORKLOADS
-120 %
-100 %
-80 %
-60 %
-40 %
-20 %
0 %
F M
M
W
a t
e r
S
p a
t i a
l
W
a t
e r
N
S
q u
a r
e d
R
a y
t r a
c e
R
a d
i x
L U
C
o n
t i g
L U
N
o n
c o
n t
i g
C
h o
l e
s k
y
B
a r
n e
s
O
c e
a n
N
o n
c o
n t
i g
O
c e
a n
C
o n
t i g
F F
T
Benchmark
S u
m
 o
f  I
P C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Crossbar
Figure 6.7: Scientific Workload Performance in a 8-core CMP
there is a higher bandwidth demand for the crossbar than the bus. Consequently, the results
support the theory that this benchmark benefits from an interconnect where L1 to L2 and L1
to L1 transfers can be handled in parallel. A possible source of error is the M5 thread library
flaws, but these manifest themselves in a similar way for all interconnects for this benchmark.
As in the 4-core case, the LUContig, Barnes and Cholesky benchmarks have large variations in
the bandwidth demand between interconnects. For LUContig, there is only a small performance
difference between the bus and crossbar interconnects, and this is due to the difference in band-
width demand. Cholesky and Barnes both have a surprisingly low performance degradation
with the bus and crossbar compared to the ideal interconnect. The reason is that the bandwidth
demand with the ideal interconnect is much larger than for the other interconnects. Although
it can handle an unlimited number of requests in parallel, they all experience a transmission
delay. The result is that the performance degradation with the realistic interconnects is under-
estimated. The small performance difference between the bus and crossbar interconnects are
caused by a small difference in bandwidth demand in the Cholesky case and the presence of a
number of L1 to L2 transfers for Barnes.
The results with the OceanContig, OceanNoncontig and FFT benchmarks must be discarded
in this case as well. For the OceanContig and OceanNoncontig benchmarks, there are so many
deadlocks that it is likely that a lot of the execution of the benchmark is serial. Consequently,
the bandwidth demands are quite limited. Again, FFT is only executed in the serial section to
avoid what is probably a M5 thread library problem.
98
Chapter 7
Butterfly Interconnect Evaluation
The aim of this chapter is to apply some of the lessons learnt in the experiments carried out in
this work. Because of time constraints, a full exploration of possible performance improvement
techniques will not be carried out. Instead, the focus will be on one technique: a multistage
butterfly network.
To achieve good performance with communicating workloads in a shared L2 cache CMP, efficient
L1 to L1 communication should be supported. A conventional CMP crossbar can not do this as
the previous chapter shows. The reason is that this network optimises the single-thread common
case of efficient L2 access and has limited L1 to L1 bandwidth. A simple solution to this problem
would be to use a full crossbar network. In this case, all network nodes can communicate with
all other nodes through point-to-point links. However, the number of channels in a full crossbar
is O(N2) where N is the number of nodes in the network. In other words, the hardware cost is
prohibitive when there are many nodes.
Multistage networks support this all-to-all interconnection with O(N logN) channels. An im-
portant multistage interconnect is the butterfly network [DT03], which was discussed in chapter
2. This chapter evaluates the butterfly network in a CMP context.
It is unlikely that the butterfly network will outperform the crossbar for the multiprogrammed
workloads. As shown in chapter 5, the crossbar performs close to the ideal interconnect in all
cases with multiprogrammed workloads. However, chapter 6 showed that the performance of
a crossbar approaches that of a shared bus when there is frequent L1 to L1 communication.
Therefore, a performance improvement is expected for this application type.
This chapter is organised as follows:
• First, the performance of the butterfly with multiprogrammed workloads is presented and
discussed in section 7.1.
• Then, section 7.2 discusses performance of the butterfly with communicating programs
from the SPLASH-2 benchmark suite.
The results from chapters 5 and 6 are shown again in the graphs in this chapter. This makes
it easy for the reader to compare the CMP performance with the butterfly to the performance
with the other interconnects.
99
CHAPTER 7. BUTTERFLY INTERCONNECT EVALUATION
-25 %
-20 %
-15 %
-10 %
-5 %
0 %
5 %
10 %
a p
s i
a m
m
p a r
t
g a
l g
e l
t w
o l
f
s i
x t
r a
c k
p a
r s
e r v p
r
v o
r t e
x 1
g z
i p
m
e s
a
g c
c
p e
r l b
m
k
c r
a f
t y
b z
i p
l u
c a
s
g a
p
m
g r
i d
m
c f
f a
c e
r e
c
e q
u a
k e
a p
p l
u
w
u p
w
i s
e
s w
i m e o
n
f m
a 3
d
Benchmark
A
v e
r a
g e
 I P
C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Butterfly Crossbar
Figure 7.1: Butterfly Performance in a 2-core CMP
7.1 Butterfly Performance with Multiprogrammed Workloads
This section discusses the performance of a CMP with a butterfly interconnect and multipro-
grammed workloads from the SPEC 2000 benchmark suite. It has the following outline:
• Section 7.1.1 discusses the results from simulating a 2-core CMP.
• Then, 7.1.2 presents the results from the 4-core CMP experiments.
• Finally, 7.1.3 investigates the butterfly performance in an 8-core CMP.
7.1.1 2-core CMP Multiprogrammed Performance
Figure 7.1 shows the performance of the CMP with bus, crossbar and butterfly interconnects
relative to the performance of an ideal interconnect. Recall from chapter 3 that the butterfly
has one clock cycle larger minimum transfer latency than the other interconnects for the 2-core
CMP. As mentioned in chapter 5, apsi is the only benchmark where the interconnect is critical
for performance. Here, the butterfly performs much better than the bus but worse than the
crossbar as expected. This difference is mostly due to that all transfers take one clock cycle
more with the butterfly than the crossbar. For the other benchmarks, the choice of interconnect
has only a small performance impact.
The CMP performance with the butterfly interconnect is better than with the ideal interconnect
for the ammp, bzip, mgrid, facerec and applu benchmarks. Obviously, this is not due to the
delay of the butterfly being lower. On the contrary, the lower performance of the interconnect
causes the caches to perform better. For ammp, bzip, mgrid and facerec, there are fewer L2
cache misses with the butterfly than with the ideal interconnect. With applu, there are less L1
100
7.1. BUTTERFLY PERFORMANCE WITH MULTIPROGRAMMED WORKLOADS
-40 %
-35 %
-30 %
-25 %
-20 %
-15 %
-10 %
-5 %
0 %
5 %
10 %
15 %
a p
s i
v o
r t e
x 1
p a
r s
e r
t w
o l
f
p e
r l b
m
k
g a
l g
e l
s i
x t
r a
c k v p
r
g z
i p
c r
a f
t y
l u
c a
s
a r
t
m
g r
i d
m
e s
a
m
c f
b z
i p
g a
p
f m
a 3
d
e q
u a
k e e o
n
s w
i m
w
u p
w
i s
e
g c
c
a p
p l
u
f a
c e
r e
c
a m
m
p
Benchmark
A
v e
r a
g e
 I P
C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Butterfly Crossbar
Figure 7.2: Butterfly Performance in a 4-core CMP
instruction cache misses. A likely explanation is that the butterfly delays the requests enough
to keep some frequently used data in the caches.
The CMP performance is worse with the butterfly than with all other interconnects for 10
benchmarks. With the lucas, vortex, perlbmk, wupwise, gap and mcf benchmarks, the reason is
that the there are not enough requests to make the queue delay of the bus large enough for the
total delay to be greater than the minimum butterfly delay. For vpr, gzip, mesa and gcc the
delay of bus is very similar to the butterfly delay. Consequently, small changes in the timing
of requests create the small effects that create the performance difference. Since the differences
are small, it is time-consuming to track down the exact causes. It is better to spend this time
on effects relevant to the problem at hand. Mainly, these effects highlight that the choice of
interconnect only has a small performance impact for a 2-core CMP for most of the SPEC2000
benchmarks.
This point is further backed up by the performance of swim, eon and fma3d. Here, the CMP
performance with the butterfly is better than with the ideal interconnect. This time, the reason
is that the better performance of the ideal interconnect result in the programs continuing further
down a wrong execution path.
7.1.2 4-core CMP Multiprogrammed Performance
This section discusses the performance of the butterfly interconnect with multiprogrammed
workloads and a 4-core CMP. The results are shown in figure 7.2. As expected, the performance
with the butterfly is better than the performance with the bus and worse than the performance
with the crossbar. Again, the minimum transfer delay of the butterfly is one clock cycle longer
101
CHAPTER 7. BUTTERFLY INTERCONNECT EVALUATION
than the minimum latency of the other interconnects.
The performance with the butterfly is better than the performance with the ideal interconnect
in a few cases. Again, the reason is that the somewhat slower performance of the butterfly
makes some other part of the architecture perform better. With bzip and ammp, the reason is
that the butterfly avoids a few cache misses. The situation is more complex with gcc, applu and
facerec because the butterfly configuration performs better than the bus in some workloads and
worse in others. Since the harmonic mean emphasises the lowest number, the configurations
with the lowest performance have a stronger impact on the result. For these benchmarks, the
good performance is due to fewer cache misses, less cache blocking or a combination of these. In
some workloads, the memory bus is badly congested and this amplifies the performance effect
of these small changes.
With wupwise, the butterfly configuration performs worse than all other interconnects. This
is due to workload 32 where the L1 cache is blocked for longer time with the butterfly than
with the other interconnects. Consequently, the performance of the butterfly configuration is
degraded.
7.1.3 8-core CMP Multiprogrammed Performance
This section discusses the performance of the butterfly with multiprogrammed workloads and
an 8-core CMP. First, the performance results from the 8-core CMP described in chapter 3
is presented. This configuration provides too little L2 cache space per core, and this becomes
the predominant bottleneck. Consequently, the interconnect only has a secondary impact on
performance. Therefore, a CMP where the L2 cache size is increased to 8 MB and the number
of MSHRs per bank is increased to 16 is simulated. For both configurations, the minimum delay
of the butterfly is the same as the minimum delay for the other configurations.
7.1.3.1 Original 8-core CMP Configuration
Figure 7.3 shows the simulation results for the original 8-core configuration. Here, the perfor-
mance with the butterfly is chaotic. For some benchmarks a large speed-up is observed and for
others a large performance degradation. A possible explanation is that the butterfly can give
very good performance for a favourable traffic pattern while it can give low performance for an
unfavourable pattern. When this is coupled with a strained cache system, the effects can be
severe.
Again, this experiment does not give too much insight into interconnect performance as the
caches are the predominant bottleneck. Therefore, only the performance of the applu and the
gap benchmarks will be discussed. Applu has very good performance with the butterfly network
while gap experiences a severe performance degradation. In the applu case, the performance
improvement is due to applu being very lucky with how its L1 cache misses are serviced. In
workload 8, applu’s L1 cache nearly does not block at all. Since the available resources have not
been increased, this results in much more blocking for the other applications in this workload.
Gap has very low performance with the butterfly in some workloads and very good performance
in others. Workload 17 in an example where gap experiences a performance degradation. Here,
gap’s L1 cache blocks a lot more than with the other interconnects. Furthermore, there are more
L2 cache misses in the L2 bank that gap uses.
102
7.1. BUTTERFLY PERFORMANCE WITH MULTIPROGRAMMED WORKLOADS
-150 %
-100 %
-50 %
0 %
50 %
100 %
150 %
200 %
g a
p
l u
c a
s
a m
m
p
t w
o l
f
g c
c
m
c f v p
r
m
e s
a
p e
r l b
m
k
e o
n
f m
a 3
d
w
u p
w
i s
e
v o
r t e
x 1
c r
a f
t y
e q
u a
k e
g z
i p
p a
r s
e r
f a
c e
r e
c
s w
i m
s i
x t
r a
c k
m
g r
i d
g a
l g
e l
b z
i p a r
t
a p
p l
u
a p
s i
Benchmarks
A
v e
r a
g e
 I P
C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Butterfly Crossbar
Figure 7.3: Butterfly Performance in a 8-core CMP
These examples indicate an interesting point. When there are not enough resources to serve
the needs of all applications, the achieved performance can be quite random with the butterfly
interconnect. A possible explanation is that conflicts in the butterfly impacts some benchmarks
more than others. Consequently, the benchmarks that have few conflicts will get more of their
data into the L2 cache as they are able to access it more often. This leads to other applications
data being thrown out and therefore they experience a performance degradation. Ideally, per-
formance should be predictable when the processor is heavily loaded. Therefore, the butterfly
seems badly suited to a CMP where this situation is expected to occur relatively often. Inves-
tigating techniques that ensure that all cores are given a fair share of the shared resources is
interesting further work.
7.1.3.2 8-core CMP Configuration with Large Cache
Figure 7.4 shows the performance results from the experiments with the large cache, 8-core CMP.
Here, the performance with the butterfly is very close to the performance with the crossbar and
the ideal interconnects. This is actually a bit better than expected, and two factors contribute
to these good results. Firstly, the minimum delay through the butterfly is the same as through
the other interconnects with the 8-core CMP. Secondly, there are some unused nodes in the
butterfly. The reason is that only 12 nodes are needed, but the number of nodes in a butterfly
must be a power of two. This point was discussed in section 3.2.3.3, and the reason is that
there are 8 cores and 4 L2 banks. Consequently, the butterfly has 16 nodes, and this reduces
the probability of conflicts in the butterfly.
The only strange result that can be seen in the graph is that the butterfly configuration performs
worse than all other interconnects with gcc. This is mainly due more L1 misses with the butterfly
103
CHAPTER 7. BUTTERFLY INTERCONNECT EVALUATION
-70 %
-60 %
-50 %
-40 %
-30 %
-20 %
-10 %
0 %
10 %
20 %
30 %
a p
s i a r
t
g a
l g
e l
s i
x t
r a
c k
v o
r t e
x 1
g z
i p
p a
r s
e r v p
r
t w
o l
f
p e
r l b
m
k
c r
a f
t y
f a
c e
r e
c
f m
a 3
d
m
e s
a
e o
n
m
g r
i d
e q
u a
k e m
c f
s w
i m
w
u p
w
i s
e
g a
p
b z
i p
l u
c a
s
a m
m
p
g c
c
a p
p l
u
Benchmark
A
v e
r a
g e
 I P
C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Butterfly Crossbar
Figure 7.4: Butterfly Performance in a 8-core CMP with Large Cache
than with the other interconnects in a few workloads. Since the harmonic mean emphasises the
lowest number, these workloads make the performance degradation show up in the graph. The
exact cause of the good bus performance is unclear, but it is probably due to complex interactions
between the number of L1 misses, the number of L2 misses and cache lock-up time. This point
was discussed more thoroughly in section 5.3.2.2.
With mcf, swim, wupwise, gap, bzip and applu the bus outperforms the other interconnects due
to less L1 cache lock-up time. This point was also discussed in section 5.3.2.2.
7.2 Butterfly Performance with Scientific Workloads
This section discusses the CMP performance with the butterfly interconnect and applications
from the SPLASH-2 benchmark suite. It discusses the results from experiments with 2-, 4- and
8-core CMPs.
Table 7.1 shows the configurations where the M5 system call emulation thread library flaws were
observed. The bus, crossbar and ideal values are repeated for completeness. These flaws were
discussed in chapter 4. As in chapter 6, the results from the OceanContig and OceanNoncontig
benchmarks must be discarded. The reasons are that the flaws are observed very often or in a
different way for each interconnect with one benchmark. For the other benchmarks, the flaws
are observed rarely enough that the results can be used. However, it is likely that some error is
introduced. Consequently, important future work is to rerun the experiments in the full system
simulation mode of M5 where the thread implementation allegedly is better. Furthermore, the
results for FFT are discarded as it is only simulated in the serial section. This is due to what
104
7.2. BUTTERFLY PERFORMANCE WITH SCIENTIFIC WORKLOADS
Configuration Benchmark Bus Butterfly Crossbar Ideal
2 CPUs
LUContig X - X X
OceanContig X X X -
OceanNoncontig X X X X
WaterNSquared - X - -
WaterSpatial - - X X
4 CPUs
LUContig X - X X
OceanContig X X X X
OceanNoncontig X X X X
WaterNSquared X X X X
WaterSpatial - - X X
8 CPUs
Cholesky X - - -
LUContig X - X X
LUNoncontig X X X X
OceanContig X X X X
OceanNoncontig X X X X
Radix X - X X
Raytrace X - - -
WaterNSquared X X X X
WaterSpatial X X X X
Table 7.1: Splash Benchmarks where M5 Flaws were Observed
is probably another M5 thread library bug. Section 6.2.1 contains a more detailed discussion of
this problem.
7.2.1 2-core CMP SPLASH-2 Performance
This section discusses the performance of the 2-core CMP with the different interconnects and
benchmarks from the SPLASH-2 benchmark suite. Figure 7.5 shows the interconnect request
rate and figure 7.6 shows the performance results. The coherence protocol, parallel phase be-
haviour of the benchmark and the M5 thread library flaws can create variation in the inter-
connect request rate between different interconnects for the same benchmark. This point was
thouroughly discussed in section 6.1.1. Furthermore, the minimum transfer delay of the butterfly
is one clock cycle larger than in the other interconnects.
For most benchmarks, the butterfly configuration outperforms the bus and crossbar configura-
tions. Furthermore, it performs close to the ideal interconnect with the FFM and LUContig
benchmarks. For FFM, this is due to actual good performance, but for LUContig it is probably
a bit flattering. The reason is that the butterfly has less requests to handle than the ideal inter-
connect. With Cholesky the variation in requests make the crossbar look better than it really is
because it has many fewer requests to handle than the other interconnects.
With the Barnes and LUNoncontig benchmarks, the butterfly configuration performs better than
the bus but worse than the crossbar. In the Barnes case, the explanation is simply that the
crossbar has significantly fewer requests to deal with. However, the situation is somewhat more
complicated with LUNoncontig. Here, the average delay each request experiences is actually
only 10.2 clock cycles with the butterfly while it is 11.0 clock cycles with the crossbar. The
performance difference is due to more cache blocking with the butterfly.
105
CHAPTER 7. BUTTERFLY INTERCONNECT EVALUATION
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
B
a r
n e
s
F M
M
C
h o
l e
s k
y
O
c e
a n
N
o n
c o
n t
i g
R
a d
i x
W
a t
e r
S
p a
t i a
l
L U
C
o n
t i g
L U
N
o n
c o
n t
i g
R
a y
t r a
c e
W
a t
e r
N
S
q u
a r
e d
O
c e
a n
C
o n
t i g
F F
T
Benchmark
I n
t e
r c
o n
n e
c t
 R
e q
u e
s t
s  
p e
r  C
o m
m
i t t
e d
 I n
s t
r u
c t
i o
n
Bus Butterfly Crossbar Ideal
Figure 7.5: Total Interconnect Requests in Sample for a 2-core CMP
-110 %
-90 %
-70 %
-50 %
-30 %
-10 %
10 %
30 %
50 %
F M
M
B
a r
n e
s
C
h o
l e
s k
y
W
a t
e r
S
p a
t i a
l
L U
C
o n
t i g
W
a t
e r
N
S
q u
a r
e d
R
a y
t r a
c e
L U
N
o n
c o
n t
i g
O
c e
a n
N
o n
c o
n t
i g
R
a d
i x
F F
T
O
c e
a n
C
o n
t i g
Benchmark
S u
m
 o
f  I
P C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Butterfly Crossbar
315%
Figure 7.6: Butterfly Communication Performance in a 2-core CMP
106
7.2. BUTTERFLY PERFORMANCE WITH SCIENTIFIC WORKLOADS
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
B
a r
n e
s
F M
M
R
a d
i x
W
a t
e r
S
p a
t i a
l
R
a y
t r a
c e
O
c e
a n
N
o n
c o
n t
i g
L U
C
o n
t i g
W
a t
e r
N
S
q u
a r
e d
L U
N
o n
c o
n t
i g
O
c e
a n
C
o n
t i g
C
h o
l e
s k
y
F F
T
Benchmark
I n
t e
r c
o n
n e
c t
 R
e q
u e
s t
s  
p e
r  C
o m
m
i t t
e d
 I n
s t
r u
c t
i o
n
Bus Butterfly Crossbar Ideal
Figure 7.7: Total Interconnect Requests in Sample for a 4-core CMP
FFT and Radix both have little communication and all interconnects can handle it. Conse-
quently, there is no performance difference. The results from the experiments with OceanContig
and OceanNoncontig are discarded as these benchmarks have been severely exposed to the M5
thread library flaws. They are shown for completeness.
7.2.2 4-core CMP SPLASH-2 Performance
Figure 7.7 shows the number of requests per committed instruction for all benchmarks and con-
figurations, and figure 7.8 shows the 4-core CMP performance with the different interconnects.
Again, the minimum latency through the butterfly is one clock cycle larger than the minimum
latency through the other interconnects.
The most interesting result in figure 7.8 is probably the Radix performance with the butterfly
interconnect. Here, it actually performs as well as the ideal interconnect. This highlights the
abundant bandwidth available with a butterfly when the traffic pattern is favourable.
The butterfly configuration outperforms the ideal with the Barnes and Cholesky benchmarks.
For Cholesky, a part of the explanation is that the butterfly has much less requests to deal with
than the ideal interconnect. Furthermore, the processors that do not reach the maximum instruc-
tion count get more work done before simulation is finished with the butterfly. Consequently,
the sum of IPC is larger and it seems like the butterfly outperforms the ideal interconnect.
For Barnes, the butterfly actually has more requests to deal with than the ideal interconnect.
However, the lower performance of the butterfly reduces the number of cache misses and cache
lock-up time compared to the ideal configuration.
107
CHAPTER 7. BUTTERFLY INTERCONNECT EVALUATION
-110 %
-90 %
-70 %
-50 %
-30 %
-10 %
10 %
30 %
50 %
B
a r
n e
s
F M
M
W
a t
e r
S
p a
t i a
l
W
a t
e r
N
S
q u
a r
e d
L U
C
o n
t i g
R
a d
i x
R
a y
t r a
c e
O
c e
a n
N
o n
c o
n t
i g
L U
N
o n
c o
n t
i g
C
h o
l e
s k
y
F F
T
O
c e
a n
C
o n
t i g
Benchmark
S u
m
 o
f  I
P C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Butterfly Crossbar
1 3
1 %
2 2
4 %
Figure 7.8: Butterfly Communication Performance in a 4-core CMP
Again, the results from the OceanContig, OceanNoncontig and FFT benchmarks are discarded
due to problems with M5.
7.2.3 8-core CMP SPLASH-2 Performance
This section discusses the experimental results with the SPLASH-2 benchmarks and the 8-core
CMP. Here, the applications are so communication intensive that the interconnect continues to
be the predominate bottleneck. Consequently, there was no need to increase the L2 cache size.
Figure 7.9 shows the interconnect request rate for this configuration and figure 7.10 shows the
performance results.
With FMM, WaterNSquared, Raytrace and LUContig, the butterfly configurations outperform
the bus and crossbar configurations due to a better performing interconnect. Furthermore, the
butterfly configurations are slower than the ideal interconnect configurations for these bench-
marks. Consequently, they perform as expected in these cases. However, the butterfly configu-
ration should perform closer to the ideal configuration with the FMM benchmark. The reason
for this low performance is that an unfavourable traffic pattern creates a hot channel in the
butterfly. This result in the average request delay being as large as 126 clock cycles. With the
ideal interconnect this delay is 9 clock cycles, and the bus and crossbar has an average request
delay of 1819 and 1820 clock cycles, respectively. This illustrates how the parallelism available
within the butterfly can be wasted for some traffic patterns.
There are also a number of cases where the butterfly configuration outperforms the ideal con-
figuration. With the Radix benchmark, the butterfly slows down the execution of the processor
that reaches the maximum instruction count. Consequently, the other processors get more work
108
7.2. BUTTERFLY PERFORMANCE WITH SCIENTIFIC WORKLOADS
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
F M
M
R
a y
t r a
c e
W
a t
e r
S
p a
t i a
l
R
a d
i x
O
c e
a n
N
o n
c o
n t
i g
W
a t
e r
N
S
q u
a r
e d
L U
C
o n
t i g
O
c e
a n
C
o n
t i g
L U
N
o n
c o
n t
i g
C
h o
l e
s k
y
B
a r
n e
s
F F
T
Benchmark
I n
t e
r c
o n
n e
c t
 R
e q
u e
s t
s  
p e
r  C
o m
m
i t t
e d
 I n
s t
r u
c t
i o
n
Bus Butterfly Crossbar Ideal
Figure 7.9: Total Interconnect Requests in Sample for a 8-core CMP
-120 %
-100 %
-80 %
-60 %
-40 %
-20 %
0 %
20 %
F M
M
W
a t
e r
S
p a
t i a
l
W
a t
e r
N
S
q u
a r
e d
R
a y
t r a
c e
R
a d
i x
L U
C
o n
t i g
L U
N
o n
c o
n t
i g
C
h o
l e
s k
y
B
a r
n e
s
O
c e
a n
N
o n
c o
n t
i g
O
c e
a n
C
o n
t i g
F F
T
Benchmark
S u
m
 o
f  I
P C
 R
e l
a t
i v
e  
t o
 I d
e a
l
Bus Butterfly Crossbar
Figure 7.10: Butterfly Communication Performance in a 8-core CMP
109
CHAPTER 7. BUTTERFLY INTERCONNECT EVALUATION
done and it seems like the butterfly performs better than the ideal interconnect. LUNoncontig
has between 1000 and 2000 less L2 misses in each bank when the butterfly interconnect is used.
This increased cache performance causes the good butterfly performance in this case. For Wa-
terSpatial, the difference is probably due to the M5 problems manifesting themselves in different
ways. In particular, there are a few more deadlocks with the butterfly than with the other
interconnects. Consequently, the results from this benchmark can probably not be trusted.
With the Barnes benchmark, the butterfly performs worse than both the bus and the crossbar.
The reason is that it has a lot more requests to deal with. Consequently, it is difficult to compare
the different interconnects in this case. The situation with the Cholesky benchmark is similar
with the butterfly and ideal interconnects having many more requests to deal with. However, the
butterfly still outperforms the crossbar and the bus. It is likely that this performance difference
would be much larger if the bus and crossbar had to handle the same number of requests as the
butterfly.
As usual, the results from the OceanContig, OceanNoncontig and FFT benchmarks are discarded
due to problems with M5.
110
Chapter 8
Discussion and Evaluation
The main purpose of this chapter is to answer the research questions stated in section 3.1.
Section 8.1 takes care of this. Then, section 8.2 discusses a few choices made while carrying out
this work. With the benefit of hindsight, they do not seem as clever as they did when they were
made. Consequently, it is important to document them so that better choices can be made in
the future.
8.1 Discussion
8.1.1 Multiprogrammed Workload Performance
How does the CMP on-chip interconnect between private and shared caches influence overall
system performance for multiprogrammed workloads?
The answer to this question is highly dependent on the benchmark and interconnect used. A
CMP that uses the crossbar interconnect performs very close to the ideal interconnect for all
multiprogrammed workloads and CMP configurations investigated in this report. In this case,
the performance impact of the interconnect is very small. If the bus interconnect is used, the
performance impact can be severe. For instance, the bus configuration performs 58% worse than
the ideal interconnect with the apsi benchmark on the 8-core, large cache CMP. Consequently,
the choice of interconnect is important for overall system performance as a bad choice can
severely limit performance.
8.1.2 Parallel Workload Performance
How does the CMP on-chip interconnect between private and shared caches influence overall
system performance for scientific workloads?
For the scientific workloads, the impact of the interconnect on overall system performance is
large. The FFM benchmark experiences a 93% performance degradation with the bus intercon-
nect and a 62% degradation with the crossbar interconnect compared to the ideal interconnect
configuration on the 2-core CMP. In other words, the interconnect performance is critical even
with only two processing cores. With the 8-core CMP, the performance degradation is 97% for
both the bus and crossbar interconnects. The reason for the small difference is probably that the
interconnects are operating at their maximum capacity and that the same problem size is used
111
CHAPTER 8. DISCUSSION AND EVALUATION
for the 2, 4 and 8 processor experiments. In other words, the total amount of communication
is not increased when the number of processors is increased. Investigating the effects of scaling
the problem sizes with the number of processors is possible further work.
8.1.3 Performance Impact of Interconnect Enhancements
Can improvements to the private to shared cache interconnect improve performance for both
multiprogrammed and scientific workloads?
This question is covered by the evaluation of the butterfly interconnect with both multipro-
grammed workloads and scientific applications. For the scientific applications, overall system
performance with the butterfly interconnect is much better than with the crossbar and bus inter-
connects. If we discard the results from the applications where the comparison is not valid due
to variation in request rate or problems with the M5 system call emulation thread library1, the
butterfly configuration on average perform 3.9 times better than the bus and 3.8 times better
than the crossbar on the 8-core CMP. This impressive speed-up is probably to a larger extent
an indication of the low L1 to L1 cache communication performance of the crossbar than the
merits of the butterfly interconnect. In other words, the results should be interpreted as a large
speed-up being available if sufficient L1 to L1 cache bandwidth is provided rather than the but-
terfly network being the definitive answer to the problem. Furthermore, the hardware cost of
the butterfly networks used in this work is probably higher than the cost of the crossbars used.
With the multiprogrammed workloads, the crossbar is difficult to beat in terms of performance.
The reason is that overall system performance with the crossbar is close to the ideal interconnect
in all cases simulated in this work. However, the butterfly is not much worse, performance wise.
Sadly, conflicts in the butterfly impact benchmarks in a non-uniform way and this leads to
some benchmarks in a workload experiencing a speed-up and others a considerable performance
degradation. This effect is evident when there is extensive competition for space in the L2
cache as shown in the small L2 cache, 8-core CMP experiments. Consequently, some fairness
constraints are needed before the butterfly can be applied in this context.
The good performance of the crossbar and butterfly interconnects is due to overprovisioning of
resources. Consequently, the utilisation of many channels in these interconnects can be very low.
In other words, we have a hardware structure that uses a lot of area, and much of this hardware
is only rarely used. This is hardly efficient. By using the allocated hardware in a more efficient
manner, it should be possible to reduce the hardware needs at the cost of a modest performance
degradation. This saved area can then be used to provide more cache space and this might make
up for the performance loss in the interconnect. Pursuing this further is promising further work.
8.2 Evaluation
This section discusses a few choices made in this work that were not the best possible choices.
First, the choice of a few CMP model parameters is discussed. Then, a couple of improvements
to the use of the M5 simulator are presented. Finally, some implementation decisions regarding
the simulator extensions is discussed.
1The discarded benchmarks are FFM, LUContig, Cholesky, Barnes, OceanContig, OceanNoncontig, WaterSpa-
tial and FFT.
112
8.2. EVALUATION
8.2.1 CMP Model Configuration
The main problem with the CMP model is that only 4 MSHRs are available in each L1 data
cache. This hardware structure determines the number of outstanding misses the cache can
handle without blocking. For instance, the Intel Pentium 4 processor has 8 MSHRs in its L1
data cache [BBH+04]. Choosing a too low number of MSHRs results in the cache being blocked
for long periods of time. Furthermore, this behaviour depends on the timing of cache misses, and
therefore it is often different for the same benchmark and different interconnects. In addition,
it limits the pressure on the interconnect. Consequently, increasing the number of MSHRs will
probably increase performance impact of the choice of interconnect. In a different work, we
found that 8 or 16 MSHRs in the L1 cache is a good compromise between area and performance
for a 4-core CMP [JN07].
The latencies of the different interconnects were based on the numbers computed by Kumar et
al. [KZT05]. However, these numbers are highly dependent on the CMP floorplan and properties
of the interconnect. Consequently, more work is needed to explore the realism of these numbers.
The impact of this uncertainty is reduced by using the same transfer latency for all interconnects.
This is probably not an unreasonable assumption as this delay is mostly given by the distance
the signals must travel on the chip. Since the same units are connected in all cases, this delay is
likely to be reasonably uniform across interconnects. To summarise, it is the available parallelism
in the interconnect that is investigated in this work and not the impact of end-to-end transfer
latency.
8.2.2 Use of the M5 Simulator
This section discusses how the M5 simulator can be used better. The main point here is that
using the M5 system call emulation thread library is a bad idea. If parallel applications are
simulated, it is better to use M5’s full system mode. Sadly, the extent of the problems with this
library was not even known to the M5 development team when this choice was made. Changing
from system call emulation to full system simulation is time consuming, and there was simply not
enough time to do this when the problem was discovered. Therefore, the system call emulation
mode was used for the experiments. To control the introduction of errors, extensive tracing of
the occurrence of bugs was implemented. This strategy worked well and made it possible to
remove many obviously erroneous results.
The choice of which cache bank to use in M5 is carried out by allocating a contiguous address
range to each bank. For the scientific benchmarks, this results in all accesses going to the
same bank. A better way to choose the bank is to use the least significant address bits. This
is equivalent to the modulo operation as long as the number of banks is a power of two. The
experiments with the scientific workloads use modulo bank selection while the multiprogrammed
workloads use the M5 default. This might have resulted in poor chip-wide cache utilisation for
the multiprogrammed workloads. Consequently, modulo bank selection should be used in future
experiments.
The memory model and memory bus model used in this work are simple. For instance, all
accesses to main memory have the same latency. In reality, accesses to adjacent memory blocks
can be carried out in parallel which reduce the access latency. Furthermore, the memory bus is
a generic bus design which does not model any real memory bus standard. This is probably an
advantage in this work as it makes the results easier to interpret. However, a more advanced
model is needed for work that has the memory and memory bus as its main focus.
113
CHAPTER 8. DISCUSSION AND EVALUATION
Lastly, simulation of scientific workloads is terminated when one processor has committed a
certain number of instructions. This introduces some error as the number of instructions com-
mitted by the other processors might vary between configurations. It might be better to finish
simulation when the total number of committed instructions reaches a certain number. Another
possibility is to simulate for a fixed number of clock cycles. More work is needed to find the
best solution to this problem.
8.2.3 Implementation Decisions
There is no throttling mechanism implemented for the coherence protocol. This was not a good
choice as it makes the problem of congestion in the interconnect worse. Since the scientific
workloads are communication intensive, congestion was observed with all realistic interconnects.
A better solution might be to block the L1 cache when there are many active coherence messages.
Figuring out the details of this scheme is further work.
The crossbar implemented in this work consider the L1 data cache and L1 instruction cache as
different nodes and give both of them a full set of transmission channels. It is likely that these
two sets of channels can be replaced by one set with only a minute performance impact. The
reason is that there are very few cache misses in the instruction cache. Verifying this assumption
is left as further work.
114
Chapter 9
Conclusion and Further Work
9.1 Conclusion
This report started with a review of the state-of-the-art of CMP design, on-chip CMP inter-
connects and cache coherence solutions for CMPs. Although a lot of previous work has focused
on interconnection networks and cache coherence protocols, only relatively few researches have
addressed this problem in a CMP context. Consequently, it is probably still possible to develop
better solutions to these problems. Furthermore, older techniques as for instance the Stenstro¨m
protocol, might prove to be a good match for the new multi-core processor architectures.
The most important part of this work is the exploration of the CMP interconnect design space.
In particular, the bus and crossbar interconnects were evaluated with multiprogrammed work-
loads created from the SPEC2000 [SPEa] benchmark suite and SPLASH-2 scientific applications
[WOT+95]. With multiprogrammed workloads, a CMP with a crossbar interconnect performs
almost as good as one using the ideal interconnect. Consequently, the crossbar interconnect
should be used if this is the predominant application class and the hardware cost can be toler-
ated. However, this must be balanced against the other parts of the system as optimising the
parts independently might not yield the best solution [KZT05]. The bus interconnect does not
perform well for some benchmarks and should not be used.
The impact of the interconnect on overall system performance is severe for scientific applications.
Unfortunately, some of these results had to be discarded due to problems with the M5 system
call emulation thread library implementation. With scientific workloads, both the configurations
with the bus and crossbar interconnects experience a considerable performance degradation com-
pared to the ideal interconnect configuration. Furthermore, the performance with the crossbar
approaches the performance with the bus when the L1 to L1 communication intensity is high.
The reason is that the crossbar has very limited L1 to L1 bandwidth available.
A butterfly interconnection network was proposed to counter the problems observed with the
bus and crossbar for scientific applications. For the 8-core CMP used in this work, the butterfly
configuration on average performs 3.9 times and 3.8 times better than the bus and crossbar
configurations, respectively. These numbers are based on the results from the benchmarks
where the M5 problems had small effects and there was little variation in interconnect request
rate. For the multiprogrammed workloads, the butterfly configurations perform a bit worse
than the crossbar configurations. This is expected, as the crossbar performs very close to the
ideal configuration in all experiments with multiprogrammed workloads. However, there is large
variation in the performance of the benchmarks within a workload when there is extensive
115
CHAPTER 9. CONCLUSION AND FURTHER WORK
competition for space in the shared L2 cache. Consequently, some fairness scheme is needed if
the butterfly is used for this class of workloads.
9.2 Further Work
This section describes possible further work. It is organised as three lists where the first one
discusses some further work of a general nature. Then, further work regarding the on-chip
interconnection network is discussed. Finally, a few possible future directions for cache coherence
protocol investigation is presented. Since this work will be continued in a PhD, it is especially
useful to write down these ideas.
The following list describes future work of a general nature:
• The most important further work is probably to rerun the scientific workload experiments
in M5’s full system mode. In particular, it is interesting to confirm whether the large
variation in interconnect request rate between configurations in the Cholesky and Barnes
benchmarks is an architectural effect or due to a problem with M5’s system call emulation
mode thread implementation.
• Non-Uniform Cache Access (NUCA) caches [KBK02] make the difference in access time
to cache banks due to the distances between a given core and the banks visible at the
architecture level. It is interesting to see how moving to this type of architecture influences
the findings in this report.
• There is a considerable body of research on the behaviour of the Splash-2 benchmarks in
the SMP context [WOT+95, CGS97]. Do these findings still hold on a CMP platform?
The interconnect investigations have also given some ideas for further work. These are described
in the following list:
• The butterfly and crossbar perform well because most channels are lightly loaded. However,
this is inefficient in terms of area. By mapping requests to channels in a more intelligent
fashion, it should be possible to use these channels more efficiently. A related possibility is
to investigate the impact of mapping the instruction and data traffic to the same crossbar
transmission channels.
• It is expected that the number of cores in a CMP will grow in the future. Since we
already have commercial implementations of 8-core CMPs, it is interesting to investigate
the impact increasing the number of cores has on the interconnect.
• The crossbar and butterfly implementations used in this work, blocks when one of the
cache banks blocks. However, it is possible to continue to deliver requests to the banks
that are not blocked. Quantifying the performance impact of this optimisation is possible
further work. In addition, more intelligent blocking strategies can possibly be used to
ensure fair use of shared resources.
• The butterfly interconnection network used in this work is only one out of a number of
possible butterfly networks. A possible future direction is to investigate the performance
of butterflies with a different radix. In this case, the actual chip area consumption and
latencies should be estimated. It might be necessary to make a floorplan to get sufficiently
good estimates.
Although cache coherence protocol solutions only have had a secondary focus in this work, a
few possibilities for future work has been discovered. These are discussed in the following list:
116
9.2. FURTHER WORK
• The Stenstro¨m directory protocol is an improvement over the simple MSI directory proto-
col. Consequently, possible further work is to compare these protocols. This will give an
idea of the merits of the Stenstro¨m scheme in a CMP.
• Although the Stenstro¨m protocol has not been compared to any other directory protocols,
some improvements might be possible. Firstly, it might be possible to alleviate the con-
gestion in the interconnect by reducing the number of requests injected into the network.
Furthermore, the experimental results seem to indicate the protocol is sensitive to the ini-
tial data distribution. A possible strategy is to improve on this distribution by migrating
cache blocks to the cache of the processor that most frequently use the data. Furthermore,
software prefetches are discarded in the current implementation. It is unclear if it is better
to retrieve the block for reading or for writing in this case.
117
CHAPTER 9. CONCLUSION AND FURTHER WORK
118
Bibliography
[AB86] James Archibald and Jean-Loup Baer. Cache coherence protocols: evaluation using
a multiprocessor simulation model. ACM Trans. Comput. Syst., 4(4):273–298, 1986.
[AG96] S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial.
IEEE Computer, 29(12):66–76, 1996.
[AHKB00] Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, and Doug Burger. Clock rate
versus IPC: the end of the road for conventional microarchitectures. In ISCA ’00:
Proceedings of the 27th annual international symposium on Computer architecture,
pages 248–259, New York, NY, USA, 2000. ACM Press.
[ALE02] T. Austin, E. Larson, and D. Ernst. SimpleScalar: an infrastructure for computer
system modelling. IEEE Computer, 35(2):59–67, 2002.
[Ali06] Razak Mohammed Ali. DDR2 SDRAM Interfaces for Next-Gen Systems. Electronic
Engineering Times-Asia, 2006.
[AMD] AMD. AMD Athlon 64 X2 Dual-Core Processor for Desktop. http://www.amd.
com/us-en/Processors/ProductInformation/0,,30_118_9485_13041,00.html.
[AMD05] AMD. Software Optimization Guide for AMD64 Processors. http://www.amd.com/
gb-uk/Processors/TechnicalResources/0,,30_182_739_7203,00.html, 2005.
[ASHH88] A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz. An evaluation of directory
schemes for cache coherence. SIGARCH Comput. Archit. News, 16(2):280–298, 1988.
[BBH+04] Darrell Boggs, Aravindh Baktha, Jason Hawkins, Deborah T. Marr, J. Alan Miller,
Patrice Roussel, Ronak Singhal, Bret Toll, and K.S. Venkatraman. The Microar-
chitecture of the Intel Pentium 4 Processor on 90nm Technology. Intel Technology
Journal, 8(1), 2004.
[BD06] James Balfour and William J. Dally. Design tradeoffs for tiled CMP on-chip net-
works. In ICS ’06: Proceedings of the 20th annual international conference on
Supercomputing, pages 187–198, New York, NY, USA, 2006. ACM Press.
[BDH+06] Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi,
and Steven K. Reinhardt. The M5 Simulator: Modeling Networked Systems. IEEE
Micro, 26(4):52–60, 2006.
[BM02] Luca Benini and Giovanni De Micheli. Networks on Chips: A New SoC Paradigm.
Computer, 35(1):70–78, 2002.
119
BIBLIOGRAPHY
[CFKA90] David Chaiken, Craig Fields, Kiyoshi Kurihara, and Anant Agarwal. Directory-
Based Cache Coherence in Large-Scale Multiprocessors. Computer, 23(6):49–58,
1990.
[CGS97] David E. Culler, Anoop Gupta, and Jaswinder Pal Singh. Parallel Computer Archi-
tecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 1997.
[Cit03] Daniel Citron. MisSPECulation: partial and misleading use of SPEC CPU2000 in
computer architecture conferences. In ISCA ’03: Proceedings of the 30th annual
international symposium on Computer architecture, pages 52–61, New York, NY,
USA, 2003. ACM Press.
[CKB03] S.W. Changkyu Kim; Burger, D.; Keckler. Nonuniform cache architectures for wire-
delay dominated on-chip caches. IEEE Micro, 23(6):99–107, 2003.
[CLS05] Jason F. Cantin, Mikko H. Lipasti, and James E. Smith. Improving Multiprocessor
Performance with Coarse-Grain Coherence Tracking. In ISCA ’05: Proceedings of
the 32nd Annual International Symposium on Computer Architecture, pages 246–
257, Washington, DC, USA, 2005. IEEE Computer Society.
[Clu] Clustis2 Cluster Web Page. http://clustis2.idi.ntnu.no/.
[CMR+06] Liqun Cheng, Naveen Muralimanohar, Karthik Ramani, Rajeev Balasubramonian,
and John B. Carter. Interconnect-Aware Coherence Protocols for Chip Multipro-
cessors. In ISCA ’06: Proceedings of the 33rd annual international symposium on
Computer Architecture, pages 339–351, Washington, DC, USA, 2006. IEEE Com-
puter Society.
[Cor07] Corsair. TWIN2X2048-6400. http://www.corsairmemory.com/corsair/
products/specs/TWIN2X2048-6400.pdf, 2007.
[CPV05] Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. Optimizing Replication,
Communication, and Capacity Allocation in CMPs. In ISCA ’05: Proceedings of the
32nd Annual International Symposium on Computer Architecture, pages 357–368,
Washington, DC, USA, 2005. IEEE Computer Society.
[CS06] Jinchuan Chang and Gurindar S. Sohi. Cooperative Caching for Chip Multiproces-
sors. 33nd Annual International Symposium on Computer Architecture (ISCA’06),
2006.
[Dal06] William J. Dally. Future Directions for On-Chip Interconnection Networks - Pre-
sentation at Workshop on On- and Off-Chip Interconnection Networks for Multicore
Systems. http://www.ece.ucdavis.edu/~ocin06/talks/dally.pdf, 2006.
[DS07] Haakon Dybdahl and Per Stenstro¨m. An Adaptive Shared/Private NUCA Cache
Partitioning Scheme for Chip Multiprocessors. In HPCA ’07: Proceedings of the
13th International Symposium on High-Performance Computer Architecture, 2007.
[DT03] William Dally and Brian Towles. Principles and Practices of Interconnection Net-
works. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
[EPS06] Noel Eisley, Li-Shiuan Peh, and Li Shang. In-Network Cache Coherence. In MI-
CRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on
Microarchitecture, pages 321–332, Washington, DC, USA, 2006. IEEE Computer
Society.
120
BIBLIOGRAPHY
[GLL+98] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop
Gupta, and John Hennessy. Memory consistency and event ordering in scalable
shared-memory multiprocessors. In ISCA ’98: 25 years of the international symposia
on Computer architecture (selected papers), pages 376–387, New York, NY, USA,
1998. ACM Press.
[GMNR06] Simcha Gochman, Avi Mendelson, Alon Naveh, and Efraim Rotem. Introduction to
Intel Core Duo Processor Architecture. Intel Technology Journal, 10(2), 2006.
[HBK01] Jaehyuk Huh, Doug Burger, and Stephen W. Keckler. Exploring the Design Space
of Future CMPs. In PACT ’01: Proceedings of the 2001 International Conference
on Parallel Architectures and Compilation Techniques, pages 199–210, Washington,
DC, USA, 2001. IEEE Computer Society.
[HHM99] M. Horowitz, R. Ho, and K. Mai. The Future of Wires, 1999.
[HP03] John L. Hennessy and David A. Patterson. Computer Architecture - A quantitative
approach, Third Edition. Morgan Kaufmann Publishers, 2003.
[HP07] John L. Hennessy and David A. Patterson. Computer Architecture - A quantitative
approach, Fourth Edition. Morgan Kaufmann Publishers, 2007.
[Int06] Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual. http:
//developer.intel.com/products/processor/manuals/index.htm, 2006.
[ITR06] ITRS. International Technology Roadmap for Semiconductors. http://www.itrs.
net/, 2006.
[Jah06] Magnus Jahre. Interprocessor Communication in Chip Multiprocessors. Project
Report in TDT 4720 Computer Design and Architecture, Specialisation, 2006.
[JM95] Bruce Jacob and Trevor Mudge. Notes on Calculating Computer Performance. Tech-
nical Report 231-95, University of Michigan, March 1995.
[JN07] Magnus Jahre and Lasse Natvig. Performance Effects of a Cache Miss Handling
Architecture in a Multi-core Processor. Submitted to Norsk Informatikkonferanse
(NIK-2007), 2007.
[KAO05] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A
32-Way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29, 2005.
[KBK02] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform
cache structure for wire-delay dominated on-chip caches. In ASPLOS-X: Proceed-
ings of the 10th international conference on Architectural support for programming
languages and operating systems, pages 211–222, New York, NY, USA, 2002. ACM
Press.
[KFJ+03] Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan,
and Dean M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures: The Po-
tential for Processor Power Reduction. InMICRO 36: Proceedings of the 36th annual
IEEE/ACM International Symposium on Microarchitecture, page 81, Washington,
DC, USA, 2003. IEEE Computer Society.
[KJT04] Rakesh Kumar, Norman P. Jouppi, and Dean M. Tullsen. Conjoined-Core Chip
Multiprocessing. In MICRO 37: Proceedings of the 37th annual IEEE/ACM Inter-
121
BIBLIOGRAPHY
national Symposium on Microarchitecture, pages 195–206, Washington, DC, USA,
2004. IEEE Computer Society.
[KKD+06] Nevin Kirman, Meyrem Kirman, Rajeev K. Dokania, Jose F. Martinez, Alyssa B.
Apsel, Matthew A. Watkins, and David H. Albonesi. Leveraging Optical Technology
in Future Bus-based Chip Multiprocessors. In MICRO 39: Proceedings of the 39th
Annual IEEE/ACM International Symposium on Microarchitecture, pages 492–503,
Washington, DC, USA, 2006. IEEE Computer Society.
[Kro81] David Kroft. Lockup-free instruction fetch/prefetch cache organization. In ISCA ’81:
Proceedings of the 8th annual symposium on Computer Architecture, pages 81–87,
Los Alamitos, CA, USA, 1981. IEEE Computer Society Press.
[KST04] Ron Kalla, Balaram Sinharoy, and Joel M. Tendler. IBM Power5 chip: a dual-core
multithreaded processor. IEEE Micro, 2004.
[KTJR05] Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi, and Parthasarathy Ran-
ganathan. Heterogeneous Chip Multiprocessors. Computer, 38(11):32–38, 2005.
[KTR+04] Rakesh Kumar, Dean M. Tullsen, Parthasarathy Ranganathan, Norman P. Jouppi,
and Keith I. Farkas. Single-ISA Heterogeneous Multi-Core Architectures for Multi-
threaded Workload Performance. In ISCA ’04: Proceedings of the 31st annual in-
ternational symposium on Computer architecture, page 64, Washington, DC, USA,
2004. IEEE Computer Society.
[KZT05] Rakesh Kumar, Victor Zyuban, and Dean M. Tullsen. Interconnections in Multi-
Core Architectures: Understanding Mechanisms, Overheads and Scaling. 32nd An-
nual International Symposium on Computer Architecture (ISCA’05), 2005.
[Lan06] Arnt Jørgen Lande. Evaluering av Chip Multiprocessor simulatorer. Master Thesis,
2006.
[LLG+90] Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John
Hennessy. The directory-based cache coherence protocol for the DASH multipro-
cessor. In ISCA ’90: Proceedings of the 17th annual international symposium on
Computer Architecture, pages 148–159, New York, NY, USA, 1990. ACM Press.
[LNR+06] Feihui Li, Chrysostomos Nicopoulos, Thomas Richardson, Yuan Xie, Vijaykrishnan
Narayanan, and Mahmut Kandemir. Design and Management of 3D Chip Multi-
processors Using Network-in-Memory. In ISCA ’06: Proceedings of the 33rd annual
international symposium on Computer Architecture, pages 130–141, Washington,
DC, USA, 2006. IEEE Computer Society.
[MBH+05] Michael R. Marty, Jesse D. Bingham, Mark D. Hill, Alan J. Hu, Milo M. K. Martin,
and David A. Wood. Improving Multiple-CMP Systems Using Token Coherence. In
HPCA, pages 328–339, 2005.
[McF93] Scott McFarling. Combining Branch Predictors. Technical Report TN-36, June
1993.
[McG06] Harlan McGahn. Niagara 2 Opens the Floodgates. Microprocessor Report, 2006.
[MH06] Michael R. Marty and Mark D. Hill. Coherence Ordering for Ring-based Chip
Multiprocessors. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, pages 309–320, Washington, DC, USA,
2006. IEEE Computer Society.
122
BIBLIOGRAPHY
[MHW03] Milo M. K. Martin, Mark D. Hill, and David A. Wood. Token coherence: decoupling
performance and correctness. In ISCA ’03: Proceedings of the 30th annual interna-
tional symposium on Computer architecture, pages 182–193, New York, NY, USA,
2003. ACM Press.
[Nor] Norgrid Cluster Web Page. http://norgrid.ntnu.no/.
[ONH+96] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung
Chang. The case for a single-chip multiprocessor. In ASPLOS-VII: Proceedings
of the seventh international conference on Architectural support for programming
languages and operating systems, pages 2–11, New York, NY, USA, 1996. ACM
Press.
[PHC03] E. Perelman, G. Hamerly, and B. Calder. Picking statistically valid and early sim-
ulation points, 2003.
[SA05] Lawrence Spracklen and Santosh G. Abraham. Chip Multithreading: Opportunities
and Challenges. In HPCA ’05: Proceedings of the 11th International Symposium on
High-Performance Computer Architecture, pages 248–252, Washington, DC, USA,
2005. IEEE Computer Society.
[SF91] Gurindar S. Sohi and Manoj Franklin. High-bandwidth data memory systems for
superscalar processors. In ASPLOS-IV: Proceedings of the fourth international con-
ference on Architectural support for programming languages and operating systems,
pages 53–62, New York, NY, USA, 1991. ACM Press.
[SKT+05] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner.
POWER5 System microarchitecture. IBM J. Res. Dev., 49(4/5):505–521, 2005.
[Smi81] James E. Smith. A study of branch prediction strategies. In ISCA ’81: Proceedings of
the 8th annual symposium on Computer Architecture, pages 135–148, Los Alamitos,
CA, USA, 1981. IEEE Computer Society Press.
[Smi88] James E. Smith. Characterizing Computer Performance With a Single Number.
Communications of the ACM, 31(10), October 1988.
[SPEa] SPEC CPU 2000 Web Page. http://www.spec.org/cpu2000/.
[SPEb] SPEC CPU 2006 Web Page. http://www.spec.org/cpu2006/.
[SQL] SQLite Web Page. http://www.sqlite.org/.
[Ste89] P. Stenstro¨m. A cache consistency protocol for multiprocessors with multistage
networks. SIGARCH Comput. Archit. News, 17(3):407–415, 1989.
[Ste90] Per Stenstro¨m. A Survey of Cache Coherence Schemes for Multiprocessors. Com-
puter, 23(6):12–24, 1990.
[Wol04] Wayne Wolf. The future of multiprocessor systems-on-chips. In DAC ’04: Proceed-
ings of the 41st annual conference on Design automation, pages 681–685, New York,
NY, USA, 2004. ACM Press.
[WOT+95] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and
Anoop Gupta. The SPLASH-2 programs: Characterization and methodological
considerations. In Proceedings of the 22th International Symposium on Computer
Architecture, pages 24–36, Santa Margherita Ligure, Italy, 1995.
123
BIBLIOGRAPHY
[WPM03] Hangsheng Wang, Li-Shiuan Peh, and Sharad Malik. Power-driven Design of Router
Microarchitectures in On-chip Networks. In MICRO 36: Proceedings of the 36th an-
nual IEEE/ACM International Symposium on Microarchitecture, page 105, Wash-
ington, DC, USA, 2003. IEEE Computer Society.
[ZA05] Michael Zhang and Krste Asanovic. Victim Replication: Maximizing Capacity while
Hiding Wire Delay in Tiled Chip Multiprocessors. In ISCA ’05: Proceedings of the
32nd Annual International Symposium on Computer Architecture, pages 336–345,
Washington, DC, USA, 2005. IEEE Computer Society.
124
Appendix A
Randomly Generated Multiprogram
Workloads
A.1 Multiprogram Workloads for 2 CPUs
The randomly generated multiprogrammed workloads used in experiments with 2-way CMPs
are shown in table A.1.
A.2 Multiprogram Workloads for 4 CPUs
The randomly generated multiprogrammed workloads used in experiments with 4-way CMPs
are shown in table A.2.
A.3 Multiprogram Workloads for 8 CPUs
The randomly generated multiprogrammed workloads used in experiments with 8-way CMPs
are shown in table A.3.
125
APPENDIX A. RANDOMLY GENERATED MULTIPROGRAM WORKLOADS
Workload ID SPEC Benchmarks
1 sixtrack, gcc
2 twolf, mcf
3 twolf, twolf
4 gcc, bzip
5 equake, vpr
6 applu, mesa
7 vortex1, vortex1
8 galgel, gcc
9 art, twolf
10 crafty, gap
11 parser, gcc
12 perlbmk, ammp
13 vortex1, perlbmk
14 mgrid, gcc
15 art, eon
16 fma3d, sixtrack
17 apsi, bzip
18 ammp, apsi
19 sixtrack, apsi
20 gap, vortex1
21 vpr, parser
22 sixtrack, gcc
23 crafty, ammp
24 bzip, twolf
25 fma3d, fma3d
26 bzip, ammp
27 eon, bzip
28 mgrid, ammp
29 mesa, mgrid
30 eon, gcc
31 mgrid, mgrid
32 twolf, gzip
33 facerec, lucas
34 galgel, twolf
35 gcc, wupwise
36 mgrid, swim
37 fma3d, lucas
38 wupwise, galgel
39 perlbmk, vortex1
40 sixtrack, bzip
Table A.1: Randomly Generated Multiprogram Workloads for 2 CPUs
126
A.3. MULTIPROGRAM WORKLOADS FOR 8 CPUS
Workload ID SPEC Benchmarks
1 ammp, mgrid, perlbmk, parser
2 lucas, gcc, mcf, twolf
3 eon, eon, mesa, facerec
4 vortex1, ammp, equake, galgel
5 gcc, galgel, apsi, crafty
6 applu, equake, art, facerec
7 applu, gap, gcc, parser
8 gap, swim, twolf, mesa
9 sixtrack, fma3d, apsi, vortex1
10 ammp, bzip, equake, parser
11 vpr, twolf, applu, eon
12 galgel, crafty, mgrid, swim
13 twolf, fma3d, galgel, vpr
14 bzip, vpr, bzip, equake
15 galgel, crafty, vpr, swim
16 mcf, wupwise, mesa, mesa
17 applu, parser, apsi, perlbmk
18 mgrid, perlbmk, gzip, mgrid
19 mcf, sixtrack, gcc, apsi
20 ammp, gcc, art, mesa
21 perlbmk, apsi, lucas, equake
22 vpr, crafty, vpr, mcf
23 gzip, equake, mgrid, mesa
24 facerec, applu, fma3d, lucas
25 gap, applu, parser, facerec
26 mcf, apsi, twolf, ammp
27 swim, sixtrack, ammp, applu
28 art, fma3d, swim, parser
29 apsi, gcc, vortex1, twolf
30 mgrid, gzip, apsi, equake
31 mgrid, equake, vpr, eon
32 wupwise, gap, twolf, facerec
33 galgel, equake, lucas, gzip
34 facerec, gcc, facerec, apsi
35 mesa, mcf, swim, sixtrack
36 mesa, sixtrack, equake, bzip
37 mcf, gap, gcc, vortex1
38 facerec, lucas, mcf, parser
39 twolf, eon, mesa, eon
40 apsi, apsi, mcf, equake
Table A.2: Randomly Generated Multiprogram Workloads for 4 CPUs
127
APPENDIX A. RANDOMLY GENERATED MULTIPROGRAM WORKLOADS
Workload ID SPEC Benchmarks
1 gap, applu, vpr, gap, mcf, mcf, twolf, vortex1
2 galgel, mgrid, twolf, mesa, equake, equake, swim, facerec
3 ammp, mgrid, vpr, art, lucas, parser, galgel, gzip
4 mgrid, apsi, equake, eon, crafty, twolf, mcf, bzip
5 bzip, lucas, ammp, eon, perlbmk, gcc, parser, vpr
6 parser, gzip, equake, bzip, wupwise, gcc, perlbmk, mcf
7 parser, eon, gcc, swim, swim, vpr, galgel, swim
8 lucas, bzip, applu, equake, mgrid, ammp, ammp, gcc
9 ammp, gap, mesa, facerec, eon, vpr, bzip, galgel
10 parser, swim, twolf, gcc, vpr, bzip, facerec, gzip
11 crafty, vpr, sixtrack, crafty, lucas, crafty, equake, apsi
12 art, crafty, eon, vortex1, fma3d, mgrid, crafty, equake
13 twolf, vpr, mesa, fma3d, equake, sixtrack, gap, gzip
14 twolf, mesa, crafty, equake, vortex1, mgrid, swim, gap
15 eon, mgrid, mcf, perlbmk, wupwise, crafty, twolf, swim
16 crafty, bzip, applu, apsi, gzip, galgel, equake, perlbmk
17 gzip, apsi, bzip, mgrid, gap, art, art, bzip
18 eon, equake, vortex1, art, gcc, apsi, facerec, gzip
19 eon, mesa, vortex1, eon, gcc, lucas, equake, galgel
20 apsi, bzip, galgel, ammp, art, galgel, ammp, sixtrack
21 parser, parser, gap, gap, ammp, applu, vortex1, art
22 crafty, swim, twolf, galgel, swim, twolf, twolf, parser
23 vpr, vortex1, parser, twolf, eon, equake, gzip, fma3d
24 vortex1, galgel, ammp, parser, bzip, vpr, mesa, ammp
25 twolf, facerec, perlbmk, gzip, vpr, vortex1, wupwise, eon
26 gap, sixtrack, eon, applu, swim, perlbmk, vpr, apsi
27 gap, gap, gap, twolf, mcf, gap, lucas, bzip
28 vpr, vpr, twolf, mesa, gap, bzip, gzip, sixtrack
29 swim, equake, swim, wupwise, fma3d, sixtrack, lucas, vortex1
30 wupwise, vortex1, gap, vpr, fma3d, vortex1, art, mgrid
31 applu, perlbmk, applu, galgel, crafty, wupwise, gap, ammp
32 swim, bzip, swim, apsi, vpr, gcc, twolf, twolf
33 swim, galgel, eon, gap, lucas, ammp, equake, apsi
34 vpr, twolf, apsi, vpr, mesa, applu, mgrid, fma3d
35 vortex1, perlbmk, mesa, eon, lucas, equake, mesa, equake
36 gap, eon, mgrid, gcc, parser, mesa, swim, bzip
37 equake, mcf, galgel, crafty, bzip, ammp, vortex1, crafty
38 facerec, wupwise, vpr, eon, sixtrack, bzip, perlbmk, art
39 gzip, crafty, crafty, wupwise, gap, gap, eon, art
40 vpr, mcf, mgrid, equake, galgel, mcf, facerec, gzip
Table A.3: Randomly Generated Multiprogram Workloads for 8 CPUs
128
Appendix B
Mail Correspondence with M5
Development Team
B.1 First Bug Report
From: magnus.jahre@idi.ntnu.no
To: m5-users@m5sim.org
Subject: Deadlock with Splash benchmarks in SE mode in version 1.1
Date: 27. May 2007
Hi,
I’m using the precompiled Splash benchmarks for M5 version 1.1 and the SE mode. The prob-
lem is that many of these benchmarks eventually deadlock. I have traced the problem to the
synchronization functions in alpha tru64 process.cc. Here, all processors end up calling the
m5 cond waitFunc() method and they all suspend. The really bad thing is that if you simulate
for a fixed number of clock cycles, it is very difficult to see that something has gone wrong.
I’ve added the following code to the end of the m5 cond waitFunc() method to detect the
problem:
if(process->waitList.size() == process->numCpus()){
fatal("We have a deadlock");
}
Have you seen this problem before?
I think it might be a problem with the thread library. However, I have not been able to look
into this as I do not have a working cross compiler. Have you been able to build a cross compiler
for the Splash benchmarks and the thread libraries?
Regards,
Magnus Jahre
129
APPENDIX B. MAIL CORRESPONDENCE WITH M5 DEVELOPMENT TEAM
B.2 Reply from Steve Reinhardt
From: stever@eecs.umich.edu
To: m5-users@m5sim.org
Subject: Re: [m5-users] Deadlock with Splash benchmarks in SE mode in version 1.1
Date: 28. May 2007
Hi Magnus,
What inputs are you using for the benchmarks? I don’t recall this happening, but they haven’t
been tested extensively, so it wouldn’t be a huge surprise if this happened when you use different
input sizes or numbers of processors. The Tru64 pthreads library (like the Linux pthreads library)
has a “manager” thread in addition to the application threads. In order to keep the m5 support
manageable, we don’t create a manager thread (because then you’d have N+1 threads trying
to run on only N CPUs). If the application ever gets into the situation where all the worker
threads are waiting on the manager thread then you’re hosed. We don’t see that happen under
SPLASH (since they’re pretty simple in their use of threads) but it could be you’ve run into a
situation where that’s happening.
I don’t believe it’s possible to build a cross-compiler for Tru64 Alpha binaries; at least we’ve
never been able to. Thus unless you have a native Tru64 Alpha machine you can’t really generate
new binaries for the existing SPLASH support.
Unfortunately supporting Linux pthreads in SE mode is even harder than under Tru64 (and the
situation with Tru64 is that my goal was to support pthreads but I gave up partway through,
which is why it’s kind of a mess). This question comes up a lot, so if you’re interested you can
probably find more detailed answers in the mailing list archive.
Actually this question comes up often enough that I’m just going to create a wiki page for it:
http://www.m5sim.org/wiki/index.php/Splash benchmarks
The bottom line is that you’re probably best off running with Linux pthreads in FS mode.
Steve
B.3 Elaboration on First Bug Report
From: magnus.jahre@idi.ntnu.no
To: m5-users@m5sim.org
Subject: Deadlock with Splash benchmarks in SE mode in version 1.1
Date: 28. May 2007
Hi,
Thanks for the quick reply!
First, I agree that switching to FS mode is the better option. I will get FS mode up and running
as soon as I can find the time :-)
I’ll elaborate a bit on the deadlock problem with Splash in SE mode. I’m using the standard
inputs except for with LUNoncontig where I have increased the problem size to 512. The Ocean
130
B.3. ELABORATION ON FIRST BUG REPORT
benchmark is the one that is causing the most trouble. With my extensions (different L1-L2
interconnects and directory coherence) it simulates less than 1 million instructions before all
threads suspend. This happens for 2, 4 and 8 cores. To makes sure that it is not a problem with
my code, I ran a test with vanilla M5, 8 cores and the MSI, MESI or MOESI coherence protocols.
In this case, the problem arises for Barnes, both Ocean benchmarks, both LU benchmarks,
WaterSpatial and WaterNSquared with all coherence protocols. Furthermore, FMM has the
problem with the MSI protocol and Radix with the MOESI protocol.
If we assume that all threads are waiting for the scheduler thread, a possible dirty hack would
be to start up the thread at the head of the wait queue. Then, it should be possible to get at
least some results. Of course, it is far from ideal and might create more problems than it solves.
What do you think?
In addition, I notice that some benchmarks call the nxm thread blockFunc, nxm blockFunc and
nxm unblockFunc. However, these methods only print their arguments to stdout and return 0.
What should these methods do?
Regards,
Magnus
131
APPENDIX B. MAIL CORRESPONDENCE WITH M5 DEVELOPMENT TEAM
132
Appendix C
Simulator Extension Code
C.1 Interconnect Extension Code
C.1.1 Interconnect Header File
#ifndef INTERCONNECT HH
#define INTERCONNECT HH
#include <iostream>
#include <vector>
#include <queue>
#include <fstream>
#include ”mem/ ba s e h i e r . hh”
#include ” i n t e r c o nn e c t i n t e r f a c e . hh”
#include ” i n t e r c o n n e c t p r o f i l e . hh”
#include ”sim/ eventq . hh”
#include ”sim/ s t a t s . hh”
#include ”cpu/ exec context . hh” // f o r ExecContext , needed f o r cpu id
#include ”cpu/base . hh” // f o r BaseCPU, needed f o r cpu id
/∗∗ The maximum value o f type Tick . ∗/
#define TICK T MAX ULL(0x3FFFFFFFFFFFFF)
class I n t e r c onn e c t I n t e r f a c e ;
class In te r connec tArb i t ra t i onEvent ;
class InterconnectDel iverQueueEvent ;
class I n t e r c onn e c tP r o f i l e ;
/∗∗
∗ This c l a s s i s the parent c l a s s o f a l l i n t e r connec t ex t en s i on s . In other
∗ words , a l l i n t e r c onne c t s are sub−c l a s s e s o f t h i s c l a s s .
∗
∗ I t has three func i on s :
∗ − F i r s t l y , i t d e f i n e s the i n t e r f a c e which a l l i n t e r c onne c t s must
∗ implement
∗ − Secondly , i t takes care o f r e g i s t r a t i o n and admin i s t ra t i on o f
∗ i n t e r connec t i n t e r f a c e s . This f u n c t i o n a l i t y i s common to a l l i n t e r c onne c t s .
∗ − In addi t ion , d e f i n e s some c l a s s e s that the i n t e r c onne c t s need . These
133
APPENDIX C. SIMULATOR EXTENSION CODE
∗ c l a s s e s are the event ob j e c t s used to c r e a t e de lays and a two convenience
∗ c l a s s e s that r ep r e s en t r eque s t s and d e l i v e r i e s .
∗
∗ @author Magnus Jahre
∗/
class In t e r connec t : public BaseHier
{
private :
int master Inter faceCount ;
int s l ave In t e r f a c eCount ;
int t o t a l I n t e r f a c eCount ;
protected :
bool blocked ;
int wait ingFor ;
Tick blockedAt ;
int cpu count ;
I n t e r c onn e c tP r o f i l e ∗ p r o f i l e r ;
s td : : map<int , int> processorIDToInterconnectIDMap ;
std : : map<int , int> interconnectIDToProcessorIDMap ;
std : : map<int , int> interconnectIDToL2IDMap ;
std : : vector<I n t e r c onn e c t I n t e r f a c e ∗ > mas t e r In t e r f a c e s ;
s td : : vector<I n t e r c onn e c t I n t e r f a c e ∗ > s l a v e I n t e r f a c e s ;
s td : : vector<I n t e r c onn e c t I n t e r f a c e ∗ > a l l I n t e r f a c e s ;
/∗ S t a t i s t i c s v a r i a b l e s ∗/
Stat s : : Sca lar<> t o t a lA rb i t r a t i onCyc l e s ;
S ta t s : : Sca lar<> totalArbQueueCycles ;
S ta t s : : Formula avgArbCyclesPerRequest ;
S ta t s : : Formula avgArbQueueCyclesPerRequest ;
S ta t s : : Sca lar<> t o t a lT ran s f e rCyc l e s ;
S ta t s : : Sca lar<> totalTransQueueCycles ;
S ta t s : : Formula avgTransCyclesPerRequest ;
S ta t s : : Formula avgTransQueueCyclesPerRequest ;
S ta t s : : Vector<> perCpuTotalTransferCycles ;
S ta t s : : Vector<> perCpuTotalTransQueueCycles ;
S ta t s : : Formula avgTotalDelayCyclesPerRequest ;
S ta t s : : Sca lar<> r eque s t s ;
S ta t s : : Sca lar<> arb i t ra t edReques t s ;
S ta t s : : Sca lar<> sentRequests ;
S ta t s : : Sca lar<> nu l lReques t s ;
// S t a t s : : Scalar<> dup l i c a t eReque s t s ;
Stat s : : Sca lar<> numClearBlocked ;
Stat s : : Sca lar<> numSetBlocked ;
/∗∗
∗ Convenience c l a s s that r ep r e s en t s a t r a n s f e r r eque s t .
∗/
class InterconnectRequest {
public :
Tick time ;
134
C.1. INTERCONNECT EXTENSION CODE
int fromID ;
/∗∗
∗ Defau l t con s t ruc to r
∗
∗ @param time The t i c k the r eques t was i s su ed
∗ @param fromID The ID o f the r eque s t i ng i n t e r f a c e
∗/
InterconnectRequest ( Tick time , int fromID ){
time = time ;
fromID = fromID ;
}
} ;
/∗∗
∗ Convenience c l a s s that r ep r e s en t s a granted reques t which i s in the
∗ proce s s o f be ing d e l i v e r e d .
∗/
class In t e r connec tDe l i v e ry {
public :
Tick grantTime ;
int fromID ;
int toID ;
MemReqPtr req ;
/∗∗
∗ Defau l t con s t ruc to r
∗
∗ @param grantTime The t i c k the r eques t was granted ac c e s s
∗ and the d e l i v e r y ob j e c t c r ea ted
∗ @param fromID The ID o f the r eque s t i ng i n t e r f a c e
∗ @param toID The ID o f the d e s t i n a t i on i n t e r f a c e
∗ @param req The MemReqPtr that w i l l be d e l i v e r e d
∗/
In t e r connec tDe l i v e ry ( Tick grantTime ,
int fromID ,
int toID ,
MemReqPtr& req )
{
grantTime = grantTime ;
fromID = fromID ;
toID = toID ;
req = req ;
}
} ;
public :
int c l o ck ;
int width ;
int t r an s f e rDe l ay ;
int a rb i t r a t i onDe l ay ;
std : : vector<In te r connec tArb i t ra t i onEvent ∗> a rb i t r a t i onEvent s ;
s td : : vector<InterconnectDel iverQueueEvent ∗ > de l i v e rEvent s ;
protected :
/∗∗
∗ Checks that a l i s t o f InterconnectRequest ob j e c t s i s s o r t ed
∗ in ascending order accord ing to t h e i r r eque s t t imes . This i s
135
APPENDIX C. SIMULATOR EXTENSION CODE
∗ important because the a r b i t r a t i o n methods u sua l l y assume that the
∗ r eque s t l i s t i s s o r t ed . I t i s used in a s s e r t i o n s in the s ub c l a s s e s .
∗
∗ @param i nL i s t The l i s t to check
∗
∗ @return True i f the l i s t i s s o r t ed
∗
∗ @see InterconnectRequest
∗/
bool i s S o r t ed ( std : : l i s t <InterconnectRequest∗>∗ i nL i s t ) ;
/∗∗
∗ Checks that a l i s t o f In t e r connec tDe l i v e ry ob j e c t s i s s o r t ed
∗ in ascending order accord ing to t h e i r grant t imes . This i s
∗ important because the a r b i t r a t i o n methods u sua l l y assume that the
∗ r eque s t l i s t i s s o r t ed . I t i s used in a s s e r t i o n s in the s ub c l a s s e s .
∗
∗ @param i nL i s t The l i s t to check
∗
∗ @return True i f the l i s t i s s o r t ed
∗
∗ @see In t e r connec tDe l i v e ry
∗/
bool i s S o r t ed ( std : : l i s t <In t e r connec tDe l i v e ry∗>∗ i nL i s t ) ;
public :
/∗∗
∗ This i s the d e f au l t con s t ruc to r f o r the In t e r connec t c l a s s . I t s t o r e s
∗ the arguments and i n i t i a l i s e s some member v a r i a b l e s and does some
∗ input check ing .
∗
∗ The in t e r connec t only supports running at the same frequency as the
∗ proc e s s o r core , and there must be at l e a s t one CPU in the system .
∗
∗ @param name The ob j e c t name from the c on f i gu r a t i on f i l e . This
∗ i s passed on to BaseHier and SimObject
∗ @param width The b i t width o f the t ransmi s s i on l i n e s in the
∗ i n t e r connec t
∗ @param c l o ck The number o f p ro c e s s o r c y c l e s in one in t e r connec t
∗ c l o ck cy c l e .
∗ @param transDe lay The end−to−end t r a n s f e r de lay through the
∗ i n t e r connec t in CPU cy c l e s
∗ @param arbDelay The lenght o f an a r b i t r a t i o n in CPU cy c l e s
∗ @param cpu count The number o f p r o c e s s o r s in the system
∗ @param h i e r Hierarchy parameters f o r BaseHier
∗
∗/
Inte r connec t ( const std : : s t r i n g & name ,
int width ,
int c lock ,
int transDelay ,
int arbDelay ,
int cpu count ,
HierParams ∗ h i e r )
: BaseHier ( name , h i e r ){
width = width ;
c l o ck = c l o ck ;
t r an s f e rDe l ay = transDe lay ;
136
C.1. INTERCONNECT EXTENSION CODE
a rb i t r a t i onDe l ay = arbDelay ;
cpu count = cpu count ;
i f ( c l o ck != 1){
f a t a l ( ”The i n t e r c onne c t s are only implemented to run ”
”at the same frequency as the CPU core ” ) ;
}
i f ( cpu count < 1){
f a t a l ( ”There must be at l e a s t one CPU in the system ” ) ;
}
master Inter faceCount = −1;
s l ave In t e r f a c eCount = −1;
t o t a l I n t e r f a c eCount = −1;
blocked = fa l se ;
blockedAt = −1;
wait ingFor = −1;
}
˜ Inte r connec t ( ){ /∗ does noth ing ∗/ }
/∗∗
∗ This method r e g i s t e r s a I n t e r c o nn e c tP r o f i l e r . The p r o f i l e r i s used to
∗ dump s e l e c t e d s t a t i s t i c s to a f i l e at r e gu l a r time i n t e r v a l s .
∗
∗ @param p r o f i l e r The I n t e r c o nn e c tP r o f i l e r to use
∗/
void r e g i s t e r P r o f i l e r ( I n t e r c onn e c tP r o f i l e ∗ p r o f i l e r ){
p r o f i l e r = p r o f i l e r ;
}
/∗∗
∗ This method i s c a l l e d from the M5 s t a t i s t i c s package and i n i t i a l i s e s
∗ the s t a t i s t i c s v a r i a b l e s used in a l l i n t e r c onne c t s .
∗/
void r e gS ta t s ( ) ;
/∗∗
∗ This method i s supposed to r e s e t the s t a t i s t i c s va lue s . However , i t
∗ i s not used any set−up used in t h i s work and i s not implemented .
∗/
void r e s e t S t a t s ( ) ;
/∗∗
∗ An In t e r c onn e c t I n t e r f a c e must r e g i s t e r with the in t e r connec t to be
∗ ab le to use i t . This func t i on i s handled by t h i s method .
∗
∗ @param i n t e r f a c e A po in t e r to the i n t e r connec t i n t e r f a c e that i s
∗ r e g i s t e r i n g i t s e l f
∗ @param isL2 I s t rue i f the i n t e r f a c e r ep r e s en t s an L2 cache ,
∗ f a l s e o therwi s e
∗ @param processor ID The ID o f the p roc e s s o r connected to the
∗ i n t e r f a c e . I f the cache i s not connected to any
∗ pa r t i c u l a r proces sor , −1 should be supp l i ed .
∗
∗ @return The ID the i n t e r f a c e i s g iven
∗/
int r e g i s t e r I n t e r f a c e ( I n t e r c onn e c t I n t e r f a c e ∗ i n t e r f a c e ,
137
APPENDIX C. SIMULATOR EXTENSION CODE
bool isL2 ,
int processor ID ) ;
/∗∗
∗ This method makes a l l r e g i s t e r s i n t e r c onne c t s r e eva lua t e which
∗ address ranges they are r e s p on s i b l e f o r .
∗/
void rangeChange ( ) ;
/∗∗
∗ The cache might i s s u e r eque s t s that are l a t e r squashed . The
∗ i n t e r f a c e s might de t e c t t h i s s i t u a t i o n when they try to r e t r i e v e the
∗ cur rent r eque s t from the cache . This s i t u a t i o n needs to be measured
∗ as i t does cause empty i s s u e s l o t s in the i n t e r connec t .
∗
∗ The In t e r c onn e c t I n t e r f a c e c a l l s t h i s method when a squashed reques t
∗ was encountered . In increments a s t a t i s t i c v a r i ab l e that i s pr in ted
∗ when s imu la t i on i s f i n i s h e d .
∗/
void incNul lRequest s ( ) ;
// vo id incDup l i ca t eReques t s ( ) ;
/∗∗
∗ To enable t r a n s f e r s between caches at the same l e v e l , a means o f
∗ t r a n s l a t i n g from in t e r connec t IDs to p roc e s s o r IDs i s needed .
∗ This in format ion i s s to r ed in a map when the i n t e r f a c e r e g i s t e r s
∗ i t s e l f and i s r e t r i e v e d through t h i s method .
∗
∗ @param processor ID The proc e s s o r ID to t r a n s l a t e
∗
∗ @return The in t e r connec t ID o f the data cache be long ing to t h i s
∗ proc e s s o r
∗/
int get Inte rconnect ID ( int processor ID ) ;
/∗∗
∗ This method prov ide s s t a t i s t i c va lue s to the I n t e r c onn e c tP r o f i l e
∗ ob j e c t . The va lue s are s to r ed in the memory pointed to by the
∗ arguments , and the i n t e r n a l counter s are r e s e t .
∗
∗ @param dataSends Pointer to a memory area where the number o f
∗ data sends can be s to r ed
∗ @param ins tSends Pointer to a memory area where the number o f
∗ i n s t r u c t i o n sends can be s to r ed
∗ @param coherenceSends Pointer to a memory area where the number o f
∗ coherence sends can be s to r ed
∗ @param tota lSends Total number o f sends which i s the sum of the
∗ other r eque s t types
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
void getSendSample ( int∗ dataSends ,
int∗ instSends ,
int∗ coherenceSends ,
int∗ to ta lSends ) ;
/∗∗
∗ Convenience method that f i n d s the ID o f the s l av e i n t e r f a c e that i s
∗ r e s p on s i b l e answering r eque s t s r e l a t e d to a given address .
∗
138
C.1. INTERCONNECT EXTENSION CODE
∗ @param address The address in ques t i on
∗
∗ @return The i n t e r f a c e ID o f the i n t e r f a c e r e s p on s i b l e f o r the address
∗/
int getTarget (Addr address ) ;
/∗∗
∗ This method i s commented in the s ub c l a s s e s where i t i s implemented
∗/
virtual void r eque s t ( Tick time , int fromID ) = 0 ;
/∗∗
∗ This method i s commented in the s ub c l a s s e s where i t i s implemented
∗/
virtual void send (MemReqPtr& req , Tick time , int fromID ) = 0 ;
/∗∗
∗ This method i s commented in the s ub c l a s s e s where i t i s implemented
∗/
virtual void a r b i t r a t e ( Tick cy c l e ) = 0 ;
/∗∗
∗ This method i s commented in the s ub c l a s s e s where i t i s implemented
∗/
virtual void d e l i v e r (MemReqPtr& req ,
Tick cyc le ,
int toID ,
int fromID ) = 0 ;
/∗∗
∗ This method i s commented in the s ub c l a s s e s where i t i s implemented
∗/
virtual void setBlocked ( int f r omIn t e r f a c e ) = 0 ;
/∗∗
∗ This method i s commented in the s ub c l a s s e s where i t i s implemented
∗/
virtual void c l ea rB locked ( int f r omIn t e r f a c e ) = 0 ;
/∗∗
∗ This method i s commented in the s ub c l a s s e s where i t i s implemented
∗/
virtual int getChannelCount ( ) = 0 ;
/∗∗
∗ This method i s commented in the s ub c l a s s e s where i t i s implemented
∗/
virtual std : : vector<int> getChannelSample ( ) = 0 ;
/∗∗
∗ This method i s commented in the s ub c l a s s e s where i t i s implemented
∗/
virtual void writeChanne lDecr iptor ( std : : o f s tream &stream ) = 0 ;
} ;
/∗∗
∗ This c l a s s c r e a t e s an a r b i t a t i o n event that i s compatible with the M5 event
∗ queue . I t i s used by the Inte r connec t c l a s s e s to c r e a t e a time de lay from a
∗ r eque s t i s r e c i e v ed u n t i l l the a r b i t r a t i o n i s c a r r i e d out .
∗
∗ @see Inte r connec t
139
APPENDIX C. SIMULATOR EXTENSION CODE
∗ @see Spl itTransBus
∗ @see Crossbar
∗ @see But t e r f l y
∗ @see Id ea l I n t e r c onne c t
∗
∗ @author Magnus Jahre
∗/
class In te r connec tArb i t ra t i onEvent : public Event
{
public :
I n t e r connec t ∗ i n t e r connec t ;
/∗∗
∗ Defau l t con s t ruc to r .
∗
∗ @param in t e r c onne c t A po in t e r to the a s s o c i a t ed in t e r connec t
∗/
Inte r connec tArb i t ra t i onEvent ( In t e r connec t ∗ i n t e r c onne c t )
: Event(&mainEventQueue ) , i n t e r connec t ( i n t e r c onne c t )
{
}
/∗∗
∗ This method i s c a l l e d when the event i s s e r v i c ed . I t s ea r che s through
∗ the In t e r connec t s a r b i t r a t i o n t i c k queue to f i nd the cur rent c l o ck
∗ t i ck , removes t h i s and c a l l s the a r b i t r a t e method in Inte r connec t .
∗ Then , i t d e l e t e s i t s e l f .
∗
∗ @see Inte r connec t
∗/
void proce s s ( ) ;
/∗∗
∗ @return A tex tua l d e s c r i p t i o n o f the event
∗/
virtual const char ∗ d e s c r i p t i o n ( ) ;
} ;
/∗∗
∗ This c l a s s c r e a t e s a d e l i v e r event that i s compatible with the M5 event
∗ queue . I t i s used by the Inte r connec t c l a s s e s to c r e a t e a time de lay from a
∗ r eque s t i s granted ac c e s s u n t i l l i t i s d e l i v e r e d .
∗
∗ This event i s f o r use in i n t e r c onne c t s that do not use d e l i v e r y queue .
∗ Such c l a s s e s should use InterconnectDel iverQueueEvent in s t ead .
∗
∗ @see InterconnectDel iverQueueEvent
∗ @see Inte r connec t
∗ @see Spl itTransBus
∗ @see Crossbar
∗ @see But t e r f l y
∗ @see Id ea l I n t e r c onne c t
∗
∗ @author Magnus Jahre
∗/
class Inte rconnectDe l ive rEvent : public Event
{
public :
140
C.1. INTERCONNECT EXTENSION CODE
In t e r connec t ∗ i n t e r connec t ;
MemReqPtr req ;
int toID ;
int fromID ;
/∗∗
∗ Constructs a d e l i v e r y event f o r i n t e r c onne c t s that do not use a
∗ de l i v e r y queue .
∗
∗ @param in t e r c onne c t A po in t e r to the i n t e r connec t that c r ea ted the
event
∗ @param req The reques t to d e l i v e r
∗ @param toID The i n t e r f a c e ID the reques t w i l l be d e l i v e r e d to
∗ @param fromID The i n t e r f a c e ID the reques t was sent from
∗/
Inte rconnectDe l ive rEvent ( In t e r connec t ∗ i n t e r connec t ,
MemReqPtr& req ,
int toID ,
int fromID )
: Event(&mainEventQueue )
{
i n t e r connec t = in t e r c onne c t ;
req = req ;
toID = toID ;
fromID = fromID ;
}
/∗∗
∗ This method i s c a l l e d when the event i s s e r v i c ed and c a l l s the
∗ d e l i v e r method in an Inte r connec t c l a s s . Afterwards , i t d e l e t e s
∗ i t s e l f .
∗
∗ @see Inte r connec t
∗/
void proce s s ( ) ;
/∗∗
∗ @return A tex tua l d e s c r i p t i o n o f the event
∗/
virtual const char ∗ d e s c r i p t i o n ( ) ;
} ;
/∗∗
∗ This c l a s s c r e a t e s a d e l i v e r event that i s compatible with the M5 event
∗ queue . I t i s used by the Inte r connec t c l a s s e s to c r e a t e a time de lay from a
∗ r eque s t i s granted ac c e s s u n t i l l i t i s d e l i v e r e d .
∗
∗ This event i s f o r use in i n t e r c onne c t s that do not use d e l i v e r y queue .
∗ Such c l a s s e s should use Inte rconnectDe l ive rEvent in s t ead .
∗
∗ @see Inte rconnectDe l ive rEvent
∗ @see Inte r connec t
∗ @see Spl itTransBus
∗ @see Crossbar
∗ @see But t e r f l y
∗ @see Id ea l I n t e r c onne c t
∗
∗ @author Magnus Jahre
∗/
class InterconnectDel iverQueueEvent : public Event
141
APPENDIX C. SIMULATOR EXTENSION CODE
{
public :
I n t e r connec t ∗ i n t e r connec t ;
/∗∗
∗ Constructs a d e l i v e r y event f o r i n t e r c onne c t s that uses a d e l i v e r y
∗ queue .
∗
∗ @param in t e r c onne c t A po in t e r to the i n t e r connec t that c r ea ted the
∗ event
∗/
InterconnectDel iverQueueEvent ( In t e r connec t ∗ i n t e r c onne c t )
: Event(&mainEventQueue ) {
i n t e r connec t = in t e r c onne c t ;
}
/∗∗
∗ This method i s c a l l e d when the event i s s e r v i c ed . F i r s t , i t removes
∗ i t s e l f from the d e l i v e r y queue . Then i t c a l l s the d e l i v e r method in an
∗ In t e r connec t subc l a s s .
∗
∗ Only the t i c k argument to d e l i v e r i s provided when t h i s method i s
∗ s e r v i c ed . The memory reques t provided i s NULL and the from and to IDs
∗ are s e t to −1.
∗
∗ @see Inte r connec t
∗/
void proce s s ( ){
bool found = fa l se ;
int foundIndex = −1;
for ( int i =0; i<i n t e r connec t−>de l i v e rEvent s . s i z e ( ) ; i++){
i f ( ( InterconnectDel iverQueueEvent ∗) in te r connec t−>de l i v e rEvent s [ i ]
== this ){
foundIndex = i ;
found = true ;
}
}
a s s e r t ( found ) ;
in te r connec t−>de l i v e rEvent s . e r a s e (
in te rconnec t−>de l i v e rEvent s . begin ()+ foundIndex ) ;
MemReqPtr noReq = NULL;
in te rconnec t−>d e l i v e r (noReq , this−>when ( ) , −1, −1);
delete this ;
}
/∗∗
∗ @return A tex tua l d e s c r i p t i o n o f the event
∗/
virtual const char ∗ d e s c r i p t i o n ( ){
return ”InterconnectDel iverQueueEvent ” ;
}
} ;
#endif // INTERCONNECT HH
142
C.1. INTERCONNECT EXTENSION CODE
C.1.2 Interconnect Code File
#include ” in t e r connec t . hh”
#include ”sim/ bu i l d e r . hh”
#include ”mem/ ba s e h i e r . hh”
using namespace std ;
void
In t e r connec t : : r e gS ta t s ( ){
using namespace Stat s ;
/∗ Arb i t r a t i on ∗/
t o t a lA rb i t r a t i onCyc l e s
. name(name ( ) + ” . t o t a l a r b i t r a t i o n c y c l e s ”)
. desc ( ” t o t a l number o f a r b i t r a t i o n c y c l e s f o r a l l r eque s t s ”)
;
avgArbCyclesPerRequest
. name(name ( ) + ” . a v g a r b i t r a t i o n c y c l e s p e r r e q ”)
. desc ( ”average number o f a r b i t r a t i o n c y c l e s per r eque s t s ”)
;
avgArbCyclesPerRequest = to t a lA rb i t r a t i onCyc l e s / a rb i t ra t edReques t s ;
totalArbQueueCycles
. name(name ( ) + ” . t o t a l a r b i t r a t i o n q u e u e c y c l e s ”)
. desc ( ” t o t a l number o f c y c l e s in the a r b i t r a t i o n queue ”
” f o r a l l r e que s t s ”)
;
avgArbQueueCyclesPerRequest
. name(name ( ) + ” . a v g a r b i t r a t i o n qu eu e c y c l e s p e r r e q ”)
. desc ( ”average number o f a r b i t r a t i o n queue c y c l e s per r eque s t s ”)
;
avgArbQueueCyclesPerRequest = totalArbQueueCycles / a rb i t ra t edReque s t s ;
/∗ Transfer ∗/
t o t a lT ran s f e rCyc l e s
. name(name ( ) + ” . t o t a l t r a n s f e r c y c l e s ”)
. desc ( ” t o t a l number o f t r a n s f e r c y c l e s f o r a l l r e que s t s ”)
;
avgTransCyclesPerRequest
. name(name ( ) + ” . a v g t r a n s f e r c y c l e s p e r r e q u e s t ”)
. desc ( ”average number o f t r a n s f e r c y c l e s per r eque s t s ”)
;
avgTransCyclesPerRequest = to ta lT ran s f e rCyc l e s / sentRequests ;
totalTransQueueCycles
. name(name ( ) + ” . t o t a l t r a n s f e r q u e u e c y c l e s ”)
. desc ( ” t o t a l number o f t r a n s f e r queue c y c l e s f o r a l l r e que s t s ”)
;
avgTransQueueCyclesPerRequest
. name(name ( ) + ” . a v g t r an s f e r qu eu e c y c l e s p e r r e qu e s t ”)
143
APPENDIX C. SIMULATOR EXTENSION CODE
. desc ( ”average number o f t r a n s f e r queue c y c l e s per r eque s t ”)
;
avgTransQueueCyclesPerRequest = totalTransQueueCycles / sentRequests ;
perCpuTotalTransferCycles
. i n i t ( cpu count )
. name(name ( ) + ” . p e r c p u t o t a l t r a n s f e r c y c l e s ”)
. desc ( ” t o t a l number o f t r a n s f e r c y c l e s per cpu f o r a l l r eque s t s ”)
. f l a g s ( t o t a l )
;
perCpuTotalTransQueueCycles
. i n i t ( cpu count )
. name(name ( ) + ” . p e r c pu t o t a l t r a n s f e r q u e u e c y c l e s ”)
. desc ( ” t o t a l number o f c y c l e s in the t r a n s f e r queue per cpu ”
” f o r a l l r e que s t s ”)
. f l a g s ( t o t a l )
;
/∗ Other s t a t i s t i c s ∗/
avgTotalDelayCyclesPerRequest
. name(name ( ) + ” . a v g t o t a l d e l a y c y c l e s p e r r e q u e s t ”)
. desc ( ”average number o f de lay c y c l e s per r eque s t ”)
;
avgTotalDelayCyclesPerRequest = avgArbCyclesPerRequest
+ avgArbQueueCyclesPerRequest
+ avgTransCyclesPerRequest
+ avgTransQueueCyclesPerRequest ;
r eque s t s
. name(name ( ) + ” . r eque s t s ”)
. desc ( ” t o t a l number o f r eque s t s ”)
;
a rb i t ra t edReques t s
. name(name ( ) + ” . a r b i t r a t e d r e qu e s t s ”)
. desc ( ” t o t a l number o f r eque s t s that reached a r b i t r a t i o n ”)
;
sentRequests
. name(name ( ) + ” . s e n t r e qu e s t s ”)
. desc ( ” t o t a l number o f r eque s t s that are a c t ua l l y sent ”)
;
nu l lReques t s
. name(name ( ) + ” . n u l l r e q u e s t s ”)
. desc ( ” t o t a l number o f nu l l r e que s t s ”)
;
nu l lReques t s
. name(name ( ) + ” . n u l l r e q u e s t s ”)
. desc ( ” t o t a l number o f nu l l r e que s t s ”)
;
// dup l i c a t eReque s t s
// . name(name() + ”. d u p l i c a t e r e q u e s t s ”)
// . desc (” t o t a l number o f d u p l i c a t e r e qu e s t s ”)
// ;
144
C.1. INTERCONNECT EXTENSION CODE
numSetBlocked
. name(name ( ) + ” . num set blocked ”)
. desc ( ”the number o f t imes the i n t e r connec t has been blocked ”)
;
numClearBlocked
. name(name ( ) + ” . num clear b locked ”)
. desc ( ”the number o f t imes the i n t e r connec t has been c l e a r ed ”)
;
}
void
In t e r connec t : : r e s e t S t a t s ( ){
/∗ seems l i k e func t i on i s not needed as measurements on ly are taken in the
∗ second phase when us ing f a s t forwarding . Consequent ly , i t i s not
∗ implemented .
∗/
}
int
In t e r connec t : : r e g i s t e r I n t e r f a c e ( I n t e r c onn e c t I n t e r f a c e ∗ i n t e r f a c e ,
bool isL2 ,
int processor ID ){
++tota l I n t e r f a c eCount ;
a l l I n t e r f a c e s . push back ( i n t e r f a c e ) ;
a s s e r t ( t o t a l I n t e r f a c eCount == ( a l l I n t e r f a c e s . s i z e ( ) −1)) ;
i f ( i sL2 ){
// This i s a s l a v e i n t e r f a c e ( i . e . i n t e r f a c e to a L2 bank )
++s lave In t e r f a c eCount ;
s l a v e I n t e r f a c e s . push back ( i n t e r f a c e ) ;
a s s e r t ( s l ave In t e r f a c eCount == ( s l a v e I n t e r f a c e s . s i z e ( ) −1)) ;
}
else {
// This i s a master i n t e r f a c e ( i . e . i n t e r f a c e to a L1 cache )
++master Inter faceCount ;
mas t e r In t e r f a c e s . push back ( i n t e r f a c e ) ;
a s s e r t ( master Inter faceCount == ( mas t e r In t e r f a c e s . s i z e ( ) −1)) ;
}
i f ( processor ID != −1){
a s s e r t ( processor ID >= 0 ) ;
processorIDToInterconnectIDMap . i n s e r t (
make pair ( processorID , t o ta l I n t e r f a c eCount ) ) ;
interconnectIDToProcessorIDMap . i n s e r t (
make pair ( to ta l In t e r f a ceCount , processor ID ) ) ;
}
else {
a s s e r t ( i sL2 ) ;
interconnectIDToL2IDMap . i n s e r t (
make pair ( to ta l In t e r f a ceCount , s l ave In t e r f a c eCount ) ) ;
}
return t o t a l I n t e r f a c eCount ;
}
void
145
APPENDIX C. SIMULATOR EXTENSION CODE
In t e r connec t : : rangeChange ( ){
for ( int i =0; i<a l l I n t e r f a c e s . s i z e ();++ i ){
l i s t <Range<Addr> > r a n g e l i s t ;
a l l I n t e r f a c e s [ i ]−>getRange ( r a n g e l i s t ) ;
}
}
void
In t e r connec t : : incNul lRequest s ( ){
nu l lReques t s++;
}
// vo id
// In te rconnec t : : incDup l i ca t eReques t s (){
// dup l i c a t eReque s t s++;
// }
int
In t e r connec t : : ge t Inte rconnect ID ( int processor ID ){
i f ( processor ID == −1) return −1;
map<int , int > : : i t e r a t o r tmp =
processorIDToInterconnectIDMap . f i nd ( processor ID ) ;
// make sure at l e a s t one r e s u l t i s re turned
a s s e r t (tmp != processorIDToInterconnectIDMap . end ( ) ) ;
return tmp−>second ;
}
void
In t e r connec t : : getSendSample ( int∗ dataSends ,
int∗ instSends ,
int∗ coherenceSends ,
int∗ to ta lSends ){
a s s e r t (∗ dataSends == 0 ) ;
a s s e r t (∗ i n s tSends == 0 ) ;
a s s e r t (∗ coherenceSends == 0 ) ;
a s s e r t (∗ to ta lSends == 0 ) ;
int tmpSends , tmpInsts , tmpCoh , tmpTotal ;
for ( int i =0; i<a l l I n t e r f a c e s . s i z e ( ) ; i++){
a l l I n t e r f a c e s [ i ]−>getSendSample(&tmpSends ,
&tmpInsts ,
&tmpCoh ,
&tmpTotal ) ;
∗dataSends += tmpSends ;
∗ i n s tSends += tmpInsts ;
∗ coherenceSends += tmpCoh ;
∗ to ta lSends += tmpTotal ;
}
}
int
In t e r connec t : : getTarget (Addr address ){
int toID = −1;
int hitCount = 0 ;
for ( int i =0; i<a l l I n t e r f a c e s . s i z e ( ) ; i++){
146
C.1. INTERCONNECT EXTENSION CODE
i f ( a l l I n t e r f a c e s [ i ]−> i sMaster ( ) ) continue ;
i f ( a l l I n t e r f a c e s [ i ]−>inRange ( address ) ){
toID = i ;
hitCount++;
}
}
i f ( hitCount == 0) f a t a l ( ”No supp l i e r f o r address in i n t e r connec t ” ) ;
i f ( hitCount > 1) f a t a l ( ”More than one s upp l i e r f o r address in i n t e r connec t ” ) ;
return toID ;
}
bool
In t e r connec t : : i s S o r t ed ( l i s t <In t e r connec tDe l i v e ry∗>∗ i nL i s t ){
In t e r connec tDe l i v e ry ∗ prev = NULL;
bool f i r s t = true ;
bool nonSeqDataExists = fa l se ;
for ( l i s t <In t e r connec tDe l i v e ry ∗> : : i t e r a t o r i=inL i s t−>begin ( ) ;
i != inL i s t−>end ( ) ;
i++){
i f ( f i r s t ){
f i r s t = fa l se ;
prev = ∗ i ;
continue ;
}
i f ( prev−>grantTime > (∗ i )−>grantTime ) nonSeqDataExists = true ;
prev = ∗ i ;
}
return ! nonSeqDataExists ;
}
bool
In t e r connec t : : i s S o r t ed ( l i s t <InterconnectRequest∗>∗ i nL i s t ){
InterconnectRequest ∗ prev = NULL;
bool f i r s t = true ;
bool nonSeqDataExists = fa l se ;
for ( l i s t <InterconnectRequest ∗> : : i t e r a t o r i=inL i s t−>begin ( ) ;
i != inL i s t−>end ( ) ;
i++){
i f ( f i r s t ){
f i r s t = fa l se ;
prev = ∗ i ;
continue ;
}
i f ( prev−>time > (∗ i )−>time ) nonSeqDataExists = true ;
prev = ∗ i ;
}
return ! nonSeqDataExists ;
}
void
In te r connec tArb i t ra t i onEvent : : p roce s s ( ){
int foundIndex = −1;
int eventHitCount = 0 ;
for ( int i =0; i<i n t e r connec t−>a rb i t r a t i onEvent s . s i z e ();++ i ){
i f ( in te r connec t−>a rb i t r a t i onEvent s [ i ] == this ){
147
APPENDIX C. SIMULATOR EXTENSION CODE
foundIndex = i ;
eventHitCount++;
}
}
a s s e r t ( foundIndex >= 0 ) ;
a s s e r t ( eventHitCount == 1 ) ;
in te r connec t−>a rb i t r a t i onEvent s . e r a s e (
in te rconnec t−>a rb i t r a t i onEvent s . begin ()+ foundIndex ) ;
in te r connec t−>a r b i t r a t e ( this−>when ( ) ) ;
delete this ;
}
const char∗
In te r connec tArb i t ra t i onEvent : : d e s c r i p t i o n ( ){
return ”Inte r connec t a r b i t r a t i o n event ” ;
}
void
Inte rconnectDe l ive rEvent : : p roce s s ( ){
i n t e r connec t−>d e l i v e r ( this−>req , this−>when ( ) , this−>toID , this−>fromID ) ;
delete this ;
}
const char∗
Inte rconnectDe l ive rEvent : : d e s c r i p t i o n ( ){
return ”Inte r connec t d e l i v e r event ” ;
}
#ifndef DOXYGEN SHOULD SKIP THIS
DEFINE SIM OBJECT CLASS NAME( ”Inte r connec t ” , In t e r connec t ) ;
#endif
148
C.1. INTERCONNECT EXTENSION CODE
C.1.3 Split Transaction Bus Header File
#ifndef SPLIT TRANS BUS HH
#define SPLIT TRANS BUS HH
#include <iostream>
#include <vector>
#include <queue>
#include ” in t e r connec t . hh”
#define DEBUG SPLIT TRANS BUS
/∗∗
∗ This c l a s s implements a Sp l i t Transact ion Bus in t e r connec t . Here , a l l
∗ i n t e r f a c e s are connected to one t ransmi s s i on channel . After a rb i t r a t i on ,
∗ a reques t i s granted both the address bus and the data bus .
∗
∗ Two bus types have been implemented . One ve r s i on has p i p e l i n ed a r b i t r a t i o n
∗ and p ip e l i n ed t r a n s f e r whi l e the other one i s not p i p e l i n ed . The p i p e l i n ed
∗ ve r s i on i s not r e a l i s t i c as i t assumes that a reques t can be i n j e c t e d in to
∗ the data bus from any i n t e r f a c e each c l o ck cy c l e .
∗
∗ @author Magnus Jahre
∗/
class Spl itTransBus : public In t e r connec t
{
private :
s td : : l i s t <InterconnectRequest ∗ > requestQueue ;
std : : l i s t <In t e r connec tDe l i v e ry ∗ > del iverQueue ;
/∗ in a p i p e l i n ed , bi−d i r e c t i o n a l bus we can i s s u e one r e que s t
in each d i r e c t i o n each c l o c k c y c l e ∗/
std : : l i s t <InterconnectRequest ∗ >∗ slaveRequestQueue ;
bool p ip e l i n ed ;
void addToList ( std : : l i s t <InterconnectRequest∗>∗ i nL i s t ,
InterconnectRequest ∗ icReq ) ;
typedef enum{STB MASTER, STB SLAVE, STB NOT PIPELINED} grant type ;
void g r an t I n t e r f a c e ( grant type gt , Tick cy c l e ) ;
void s chedu leArb i t ra t ionEvent ( Tick poss ib l eArbCyc le ) ;
void schedu leDe l ive rEvent ( Tick poss ib l eArbCyc le ) ;
bool doPro f i l e ;
int useCycleSample ;
#ifdef DEBUG SPLIT TRANS BUS
void check I fSo r t ed ( std : : l i s t <InterconnectRequest∗>∗ i nL i s t ) ;
void printRequestQueue ( ) ;
void pr intDel iverQueue ( ) ;
#endif //DEBUG SPLIT TRANS BUS
public :
/∗∗
∗ This con s t ruc to r c r e a t e s a s p l i t t r an s a c t i on bus ob j e c t . I f the bus
149
APPENDIX C. SIMULATOR EXTENSION CODE
∗ i s not p i p e l i n ed the a r b i t r a t i o n de lay must be l onge r or equal to the
∗ t r a n s f e r de lay . The reason i s that the a r b i t r a t i o n method assumes
∗ that the prev ious bus t r a n s f e r has f i n i s h e d when an a r b i t r a t i o n
∗ operat i on f i n i s h e s .
∗
∗ @param name The name provided in the c on f i g f i l e
∗ @param width The bus width in bytes
∗ @param c l o ck The number o f p ro c e s s o r c l o ck in one bus cy c l e
∗ @param transDe lay The number o f bus c y c l e s one t r a n s f e r takes
∗ @param arbDelay The number o f bus c y c l e s one a r b i t r a t i o n takes
∗ @param cpu count The number o f p r o c e s s o r s in the system
∗ @param h i e r Hierarchy params f o r BaseHier
∗/
Spl itTransBus ( const std : : s t r i n g & name ,
int width ,
int c lock ,
int transDelay ,
int arbDelay ,
int cpu count ,
bool p ipe l i n ed ,
HierParams ∗ h i e r )
: In t e r connec t ( name ,
width ,
c lock ,
transDelay ,
arbDelay ,
cpu count ,
h i e r ){
p ip e l i n ed = p i p e l i n e d ;
i f ( a rb i t r a t i onDe l ay < t r an s f e rDe l ay && ! p i p e l i n ed ){
f a t a l ( ”This bus implementation r e qu i r e s the a r b i t r a t i o n ”
”de lay to be l onge r than or equal to the t r a n s f e r ”
”de lay ” ) ;
}
doPro f i l e = fa l se ;
useCycleSample = 0 ;
i f ( p i p e l i n ed ){
slaveRequestQueue = new std : : l i s t <InterconnectRequest ∗>;
}
}
/∗∗
∗ Defau l t d e s t ru c t o r . De l e t e s the dynamical ly a l l o c a t e d
∗ slaveRequestQueue i f the bus i s p i p e l i n ed .
∗/
˜ Spl itTransBus ( ){
i f ( p i p e l i n ed ){
delete slaveRequestQueue ;
}
}
/∗∗
∗ This method i s c a l l e d when a i n t e r f a c e needs to use the bus . I t adds
∗ the r eque s t to a queue and schedu l e s an a r b i t r a t i o n event . I f the bus
∗ i s p ipe l i ned , the re are two reques t queues . The reason i s that the re
∗ are two buses in t h i s case . One runs from the s l av e i n t e r f a c e s to the
∗ master i n t e r f a c e s and one in the oppos i t e d i r e c t i o n .
150
C.1. INTERCONNECT EXTENSION CODE
∗
∗ @param time The c l o ck cy c l e the r eque s t i s r eques ted
∗ @param fromID The ID o f the i n t e r f a c e r eque s t i ng ac c e s s
∗/
void r eque s t ( Tick time , int fromID ) ;
/∗∗
∗ This methods takes c r e a t e s an In t e r connec tDe l i v e ry ob j e c t based on
∗ the arguments g iven . Then , an InterconnectDel iverQueueEvent i s
∗ scheduled a f t e r the s p e c i f i e d t ransmi s s i on de lay . The reques t
∗ queue ( s ) are kept so r t ed in ascending order with the o l d e s t
∗ r eque s t f i r s t .
∗
∗ @param req The memory reques t
∗ @param time The c l o ck cy c l e the method i s c a l l e d in
∗ @param fromID The ID o f the i n t e r f a c e sending the reques t
∗
∗ @see In t e r connec tDe l i v e ry
∗ @see InterconnectDel iverQueueEvent
∗/
void send (MemReqPtr& req , Tick time , int fromID ) ;
/∗∗
∗ This method i s c a l l e d when an a r b i t r a t i o n event i s s e r v i c ed . Each
∗ time i t i s c a l l ed , i t i s s u e s at l e a s t one reques t . In the
∗ non−p ip e l i n ed ve r s i on the o l d e s t r eque s t i s granted ac c e s s .
∗ In the p i p e l i n ed vers ion , the o l d e s t master r eque s t and the o l d e s t
∗ s l av e r eque s t are granted ac c e s s .
∗
∗ The method assumes that the r eques t queues are so r t ed .
∗
∗ @param cyc l e The c l o ck cy c l e the a r b i t r a t i o n method i s c a l l e d
∗/
void a r b i t r a t e ( Tick cy c l e ) ;
/∗∗
∗ This method i s c a l l e d when a InterconnectDel iverQueueEvent i s
∗ s e r v i c ed . I t d e l i v e r s one reques t each time i t i s c a l l e d . I f the
∗ cache does not block and there are more r eque s t s that need to be
∗ de l i v e r ed , i t checks whether a d e l i v e r y event has been r e g i s t e r e d .
∗ This i s needed because the re might be r eque s t s wa i t ing from an
∗ e a r l i e r cache b lock ing .
∗
∗ @param req The reques t to d e l i v e r
∗ @param cyc l e The c l o ck t i c k the method was c a l l e d
∗ @param toID The i n t e r f a c e ID o f the d e s t i n a t i on i n t e r f a c e
∗ @param fromID The i n t e r f a c e ID o f the sender i n t e r f a c e
∗/
void d e l i v e r (MemReqPtr& req , Tick cyc le , int toID , int fromID ) ;
/∗∗
∗ This method i s c a l l e d i f one o f the s l av e cache banks b locks . Then ,
∗ i t removes a l l a r b i t r a t i o n and d e l i v e r events . Requests that a r r i v e
∗ whi le a cache bank i s blocked are s imply queued .
∗
∗ @param f romInt e r f a c e The i n t e r f a c e that i s blocked
∗/
void setBlocked ( int f r omIn t e r f a c e ) ;
/∗∗
∗ This method i s c a l l e d when a s l av e cache can r e c i e v e r eque s t s again .
151
APPENDIX C. SIMULATOR EXTENSION CODE
∗
∗ @param f romInt e r f a c e The i n t e r f a c e that i s no l onge r blocked
∗/
void c l ea rB locked ( int f r omIn t e r f a c e ) ;
/∗∗
∗ This method i s c a l l e d from the I n t e r c onn e c tP r o f i l e c l a s s when i t
∗ needs to know how many t ransmi s s i on channe l s the bus has .
∗
∗ @return 1 s i n c e the bus only has one channel
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
int getChannelCount ( ){
return 1 ;
}
/∗∗
∗ This method i s c a l l e d from the I n t e r c onn e c tP r o f i l e c l a s s and re tu rns
∗ the number o f c l o ck c y c l e s the bus was in use s i n c e the l a s t time i t
∗ was c a l l e d .
∗
∗ @return The number o f c l o ck c y c l e s the bus was used s i n c e the l a s t
∗ time the method was c a l l e d .
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
std : : vector<int> getChannelSample ( ) ;
/∗∗
∗ This method wr i t e s a d e s c i p t i on o f the t ransmi s s i on channe l s used to
∗ the provided stream .
∗
∗ @param stream The stream to wr i t e to
∗/
void writeChanne lDecr iptor ( std : : o f s tream &stream ){
stream << ”0 : The shared bus\n” ;
}
} ;
#endif // SPLIT TRANS BUS HH
152
C.1. INTERCONNECT EXTENSION CODE
C.1.4 Split Transaction Bus Code File
#include ”sim/ bu i l d e r . hh”
#include ” s p l i t t r a n s b u s . hh”
using namespace std ;
void
Spl itTransBus : : r eque s t ( Tick time , int fromID ){
r eque s t s++;
a s s e r t ( fromID >= 0 ) ;
// keep l i n k e d l i s t o f r e qu e s t s so r t ed at a l l t imes
// f i r s t r e que s t t a k e s p r i o r i t y over l a t e r r e qu e s t s at same cy c l e
InterconnectRequest ∗ newReq = new InterconnectRequest ( time , fromID ) ;
i f ( p i p e l i n ed ){
i f ( a l l I n t e r f a c e s [ fromID]−> i sMaster ( ) ){
addToList(&requestQueue , newReq ) ;
}
else {
addToList ( slaveRequestQueue , newReq ) ;
}
}
else {
addToList(&requestQueue , newReq ) ;
}
#ifdef DEBUG SPLIT TRANS BUS
check I fSo r t ed (&requestQueue ) ;
i f ( p i p e l i n ed ) check I fSo r t ed ( slaveRequestQueue ) ;
#endif //DEBUG SPLIT TRANS BUS
i f ( ! b locked ){
i f ( ! p i p e l i n ed ){
i f ( a rb i t r a t i onEvent s . empty ( ) ){
s chedu leArb i t ra t ionEvent ( time + arb i t r a t i onDe l ay ) ;
}
else {
Tick nextArbCycle = TICK T MAX;
int h i t Index = −1;
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ( ) ; i++){
i f ( a rb i t r a t i onEvent s [ i ]−>when ( ) < nextArbCycle ){
nextArbCycle = arb i t r a t i onEvent s [ i ]−>when ( ) ;
h i t Index = i ;
}
}
a s s e r t ( nextArbCycle < TICK T MAX) ;
i f ( nextArbCycle > ( time + arb i t r a t i onDe l ay ) ){
/∗ the a r b i t r a t i o n even t s are out o f synch ∗/
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ( ) ; i++){
i f ( a rb i t r a t i onEvent s [ i ]−>scheduled ( ) ){
a rb i t r a t i onEvent s [ i ]−>deschedule ( ) ;
}
delete a rb i t r a t i onEvent s [ i ] ;
}
153
APPENDIX C. SIMULATOR EXTENSION CODE
a rb i t r a t i onEvent s . c l e a r ( ) ;
s chedu leArb i t ra t ionEvent ( time + arb i t r a t i onDe l ay ) ;
}
}
}
else {
s chedu leArb i t ra t ionEvent ( time + arb i t r a t i onDe l ay ) ;
}
}
}
void
Spl itTransBus : : addToList ( std : : l i s t <InterconnectRequest∗>∗ i nL i s t ,
InterconnectRequest ∗ icReq ){
l i s t <InterconnectRequest ∗> : : i t e r a t o r f indPos ;
for ( f indPos=inL i s t−>begin ( ) ;
f indPos != inL i s t−>end ( ) ;
f indPos++){
InterconnectRequest ∗ tempReq = ∗ f indPos ;
i f ( icReq−>time < tempReq−>time ) break ;
}
i nL i s t−>i n s e r t ( f indPos , icReq ) ;
}
void
Spl itTransBus : : s chedu leArb i t ra t ionEvent ( Tick poss ib l eArbCyc le ){
a s s e r t ( ! b locked ) ;
bool addArbCycle = true ;
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ();++ i ){
i f ( a rb i t r a t i onEvent s [ i ]−>when ( ) == poss ib l eArbCyc le ){
addArbCycle = fa l se ;
}
}
i f ( addArbCycle ){
In te r connec tArb i t ra t i onEvent ∗ event =
new In te r connec tArb i t ra t i onEvent ( this ) ;
event−>schedu le ( poss ib l eArbCyc le ) ;
a rb i t r a t i onEvent s . push back ( event ) ;
}
}
void
Spl itTransBus : : a r b i t r a t e ( Tick cy c l e ){
a s s e r t ( ! b locked ) ;
i f ( p i p e l i n ed ) a s s e r t ( ! requestQueue . empty ( ) | | ! s laveRequestQueue−>empty ( ) ) ;
else a s s e r t ( ! requestQueue . empty ( ) ) ;
i f ( p i p e l i n ed ){
g r an t I n t e r f a c e (STB SLAVE, cy c l e ) ;
g r an t I n t e r f a c e (STB MASTER, cy c l e ) ;
}
else {
g r an t I n t e r f a c e (STB NOT PIPELINED, cy c l e ) ;
154
C.1. INTERCONNECT EXTENSION CODE
}
i f ( ! requestQueue . empty ( ) ){
Tick nextReqTime = requestQueue . f r on t ()−>time ;
i f ( p i p e l i n ed ){
i f ( nextReqTime <= ( cyc l e − a rb i t r a t i onDe l ay ) ){
s chedu leArb i t ra t ionEvent ( cy c l e + 1 ) ;
}
else {
s chedu leArb i t ra t ionEvent ( nextReqTime + arb i t r a t i onDe l ay ) ;
}
}
else {
i f ( nextReqTime <= cyc l e ){
s chedu leArb i t ra t ionEvent ( cy c l e + arb i t r a t i onDe l ay ) ;
}
else {
s chedu leArb i t ra t ionEvent ( nextReqTime + arb i t r a t i onDe l ay ) ;
}
}
}
i f ( p i p e l i n ed ){
i f ( ! slaveRequestQueue−>empty ( ) ){
Tick nextReqTime = slaveRequestQueue−>f r on t ()−>time ;
i f ( nextReqTime <= ( cyc l e − a rb i t r a t i onDe l ay ) ){
s chedu leArb i t ra t ionEvent ( cy c l e + 1 ) ;
}
else {
s chedu leArb i t ra t ionEvent ( nextReqTime + arb i t r a t i onDe l ay ) ;
}
}
}
}
void
Spl itTransBus : : g r an t I n t e r f a c e ( grant type gt , Tick cy c l e ){
Tick goodReqTime = cyc l e − a rb i t r a t i onDe l ay ;
InterconnectRequest ∗ grantReq ;
/∗ remove the r e que s t from the co r r e c t queue ∗/
switch ( gt ){
case STB NOT PIPELINED:
grantReq = requestQueue . f r on t ( ) ;
requestQueue . pop f ront ( ) ;
break ;
case STB MASTER:
i f ( requestQueue . empty ( ) ) return ;
grantReq = requestQueue . f r on t ( ) ;
/∗ check i f r e qu e s t s are a v a i l a b l e ∗/
i f ( grantReq−>time > goodReqTime ) return ;
requestQueue . pop f ront ( ) ;
break ;
case STB SLAVE:
155
APPENDIX C. SIMULATOR EXTENSION CODE
i f ( slaveRequestQueue−>empty ( ) ) return ;
grantReq = slaveRequestQueue−>f r on t ( ) ;
/∗ check i f r e qu e s t s are a v a i l a b l e ∗/
i f ( grantReq−>time > goodReqTime ) return ;
s laveRequestQueue−>pop f ront ( ) ;
break ;
default :
f a t a l ( ”Unknown grant type encountered ” ) ;
}
/∗ grant acces s ∗/
a l l I n t e r f a c e s [ grantReq−>fromID]−>grantData ( ) ;
/∗ update s t a t i s t i c s ∗/
arb i t ra t edReques t s++;
totalArbQueueCycles += ( ( cy c l e − grantReq−>time ) − a rb i t r a t i onDe l ay ) ;
t o t a lA rb i t r a t i onCyc l e s += arb i t r a t i onDe l ay ;
delete grantReq ;
}
void
Spl itTransBus : : send (MemReqPtr& req , Tick time , int fromID ){
a s s e r t ( ! b locked ) ;
a s s e r t ( ( req−>s i z e / width ) <= 1 ) ;
bool isFromMaster = fa l se ;
i f ( a l l I n t e r f a c e s [ fromID]−> i sMaster ( ) ) isFromMaster = true ;
i f ( req−>t o In t e r f a c e ID != −1){
// L1 to L1 re que s t
del iverQueue . push back (
new In t e r connec tDe l i v e ry ( time ,
fromID ,
req−>to Inte r f ace ID ,
req ) ) ;
}
else i f ( isFromMaster ){
// Try a l l s l a v e s and check i f they can supp ly the needed data
int successCount = 0 ;
int toID = −1;
for ( int i =0; i<a l l I n t e r f a c e s . s i z e ();++ i ){
i f ( a l l I n t e r f a c e s [ i ]−> i sMaster ( ) ) continue ;
i f ( a l l I n t e r f a c e s [ i ]−>inRange ( req−>paddr ) ){
successCount++;
toID = i ;
}
}
i f ( successCount == 0){
f a t a l ( ”No supp l i e r f o r data on Spl itTransBus ” ) ;
}
i f ( successCount > 1){
f a t a l ( ”More than one s upp l i e r f o r data on Spl itTransBus ” ) ;
156
C.1. INTERCONNECT EXTENSION CODE
}
/∗ d e l i v e r to L2 cache ∗/
del iverQueue . push back (
new In t e r connec tDe l i v e ry ( time , fromID , toID , req ) ) ;
}
else {
/∗ d e l i v e r to L1 cache ∗/
del iverQueue . push back (
new In t e r connec tDe l i v e ry ( time ,
fromID ,
req−>f romInter faceID ,
req ) ) ;
}
#ifdef DEBUG SPLIT TRANS BUS
/∗ check t ha t the queue i s so r t ed ∗/
In t e r connec tDe l i v e ry ∗ prev = NULL;
bool f i r s t = true ;
for ( l i s t <In t e r connec tDe l i v e ry ∗> : : i t e r a t o r i=del iverQueue . begin ( ) ;
i !=del iverQueue . end ( ) ;
i++){
i f ( f i r s t ){
f i r s t = fa l se ;
prev = ∗ i ;
continue ;
}
a s s e r t ( prev−>grantTime <= (∗ i )−>grantTime ) ;
prev = ∗ i ;
}
#endif //DEBUG SPLIT TRANS BUS
i f ( d oP ro f i l e ) useCycleSample += trans f e rDe l ay ;
schedu leDe l ive rEvent ( time + trans f e rDe l ay ) ;
}
void
Spl itTransBus : : s chedu leDe l iverEvent ( Tick poss ib l eArbCyc le ){
bool addEvent = true ;
for ( int i =0; i < de l i v e rEvent s . s i z e ( ) ; i++){
i f ( de l i v e rEvent s [ i ]−>when ( ) == poss ib l eArbCyc le ){
addEvent = fa l se ;
}
}
i f ( addEvent ){
InterconnectDel iverQueueEvent ∗ event =
new InterconnectDel iverQueueEvent ( this ) ;
event−>schedu le ( poss ib l eArbCyc le ) ;
d e l i v e rEvent s . push back ( event ) ;
}
}
void
Spl itTransBus : : d e l i v e r (MemReqPtr& req , Tick cyc l e , int toID , int fromID ){
157
APPENDIX C. SIMULATOR EXTENSION CODE
a s s e r t ( ! b locked ) ;
a s s e r t ( ! de l iverQueue . empty ( ) ) ;
In t e r connec tDe l i v e ry ∗ de l i v e r y = del iverQueue . f r on t ( ) ;
de l iverQueue . pop f ront ( ) ;
/∗ update s t a t i s t i c s ∗/
sentRequests++;
int queueTime = ( cy c l e − de l i v e ry−>grantTime ) − t r an s f e rDe l ay ;
totalTransQueueCycles += queueTime ;
t o t a lT ran s f e rCyc l e s += trans f e rDe l ay ;
int curCpuId = de l i v e ry−>req−>xc−>cpu−>params−>cpu id ;
perCpuTotalTransQueueCycles [ curCpuId ] += queueTime ;
perCpuTotalTransferCycles [ curCpuId ] += trans f e rDe l ay ;
int r e t v a l = BA NO RESULT;
a s s e r t ( de l i v e ry−>toID > −1);
i f ( a l l I n t e r f a c e s [ de l i v e ry−>toID]−> i sMaster ( ) ){
a l l I n t e r f a c e s [ de l i v e ry−>toID]−>d e l i v e r ( de l i v e ry−>req ) ;
}
else {
r e t v a l = a l l I n t e r f a c e s [ de l i v e ry−>toID]−>ac c e s s ( de l i v e ry−>req ) ;
}
delete de l i v e r y ;
i f ( r e t v a l != BA BLOCKED){
/∗ see i f we need to schedu l e another d e l i v e r y ∗/
i f ( ! de l iverQueue . empty ( ) ){
In t e r connec tDe l i v e ry ∗ nextDe l ive ry = del iverQueue . f r on t ( ) ;
i f ( nextDel ivery−>grantTime <= ( cyc l e − t r an s f e rDe l ay ) ){
i f ( p i p e l i n ed ) schedu leDe l ive rEvent ( cy c l e + 1 ) ;
else schedu leDe l ive rEvent ( cy c l e + t ran s f e rDe l ay ) ;
}
else {
schedu leDe l ive rEvent ( nextDel ivery−>grantTime + trans f e rDe l ay ) ;
}
}
}
}
void
Spl itTransBus : : setBlocked ( int f r omIn t e r f a c e ){
i f ( blocked ) warn ( ”Spl itTransBus b lock ing on a second cause ” ) ;
b locked = true ;
numSetBlocked++;
wait ingFor = f romInt e r f a c e ;
/∗ remove a l l s chedu l ed a r b i t r a t i o n even t s ∗/
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ();++ i ){
i f ( a rb i t r a t i onEvent s [ i ]−>scheduled ( ) ) {
a rb i t r a t i onEvent s [ i ]−>deschedule ( ) ;
}
delete a rb i t r a t i onEvent s [ i ] ;
}
a rb i t r a t i onEvent s . c l e a r ( ) ;
/∗ remove a l l d e l i v e r even t s ∗/
158
C.1. INTERCONNECT EXTENSION CODE
for ( int i =0; i<de l i v e rEvent s . s i z e ( ) ; i++){
i f ( de l i v e rEvent s [ i ]−>scheduled ( ) ){
de l i v e rEvent s [ i ]−>deschedule ( ) ;
}
delete de l i v e rEvent s [ i ] ;
}
de l i v e rEvent s . c l e a r ( ) ;
blockedAt = curTick ;
}
void
Spl itTransBus : : c l ea rB locked ( int f r omIn t e r f a c e ){
a s s e r t ( blocked ) ;
a s s e r t ( blockedAt >= 0 ) ;
i f ( blocked && wait ingFor == f romInt e r f a c e ) {
blocked = fa l se ;
i f ( ! requestQueue . empty ( ) ){
Tick min = requestQueue . f r on t ()−>time ;
i f (min >= curTick ){
s chedu leArb i t ra t ionEvent (min + arb i t r a t i onDe l ay ) ;
}
else {
s chedu leArb i t ra t ionEvent ( curTick + arb i t r a t i onDe l ay ) ;
}
}
i f ( p i p e l i n ed ){
i f ( ! slaveRequestQueue−>empty ( ) ){
Tick min = slaveRequestQueue−>f r on t ()−>time ;
i f (min >= curTick ){
s chedu leArb i t ra t ionEvent (min + arb i t r a t i onDe l ay ) ;
}
else {
s chedu leArb i t ra t ionEvent ( curTick + arb i t r a t i onDe l ay ) ;
}
}
}
i f ( ! de l iverQueue . empty ( ) ){
Tick min = del iverQueue . f r on t ()−>grantTime ;
i f (min >= curTick ){
schedu leDe l ive rEvent (min + t rans f e rDe l ay ) ;
}
else {
schedu leDe l ive rEvent ( curTick + t rans f e rDe l ay ) ;
}
}
numClearBlocked++;
blockedAt = −1;
}
}
vector<int>
Spl itTransBus : : getChannelSample ( ){
159
APPENDIX C. SIMULATOR EXTENSION CODE
i f ( ! d oP ro f i l e ) doP ro f i l e = true ;
s td : : vector<int> r e t v a l (1 , 0 ) ;
r e t v a l [ 0 ] = useCycleSample ;
useCycleSample = 0 ;
return r e t v a l ;
}
#ifdef DEBUG SPLIT TRANS BUS
void
Spl itTransBus : : check I fSo r t ed ( std : : l i s t <InterconnectRequest ∗ >∗ i nL i s t ){
/∗ check t ha t the queue i s so r t ed ∗/
InterconnectRequest ∗ prev = NULL;
bool f i r s t = true ;
for ( l i s t <InterconnectRequest ∗> : : i t e r a t o r i=inL i s t−>begin ( ) ;
i != inL i s t−>end ( ) ;
i++){
i f ( f i r s t ){
f i r s t = fa l se ;
prev = ∗ i ;
continue ;
}
a s s e r t ( prev−>time <= (∗ i )−>time ) ;
prev = ∗ i ;
}
}
void
Spl itTransBus : : printRequestQueue ( ){
cout << ”Request queue : ” ;
for ( l i s t <InterconnectRequest ∗> : : i t e r a t o r i = requestQueue . begin ( ) ;
i != requestQueue . end ( ) ;
i++){
cout << ”( ”
<< (∗ i )−>fromID
<< ” , ”
<< (∗ i )−>time
<< ”) ” ;
}
cout << ”\n” ;
i f ( p i p e l i n ed ){
cout << ”Slave r eques t queue : ” ;
for ( l i s t <InterconnectRequest ∗> : : i t e r a t o r i = slaveRequestQueue−>begin ( ) ;
i != slaveRequestQueue−>end ( ) ;
i++){
cout << ”( ”
<< (∗ i )−>fromID
<< ” , ”
<< (∗ i )−>time
<< ”) ” ;
}
cout << ”\n” ;
}
}
void
160
C.1. INTERCONNECT EXTENSION CODE
Spl itTransBus : : pr intDel iverQueue ( ){
cout << ”De l i v e r queue : ” ;
for ( l i s t <In t e r connec tDe l i v e ry ∗> : : i t e r a t o r i = del iverQueue . begin ( ) ;
i != del iverQueue . end ( ) ;
i++){
cout << ”( ”
<< (∗ i )−>fromID
<< ” , ”
<< (∗ i )−>toID
<< ” , ”
<< (∗ i )−>grantTime
<< ”) ” ;
}
cout << ”\n” ;
}
#endif //DEBUG SPLIT TRANS BUS
#ifndef DOXYGEN SHOULD SKIP THIS
BEGIN DECLARE SIM OBJECT PARAMS( Spl itTransBus )
Param<int> width ;
Param<int> c l o ck ;
Param<int> t r an s f e rDe l ay ;
Param<int> a rb i t r a t i onDe l ay ;
Param<int> cpu count ;
Param<bool> p ip e l i n ed ;
SimObjectParam<HierParams ∗> h i e r ;
END DECLARE SIM OBJECT PARAMS( Spl itTransBus )
BEGIN INIT SIM OBJECT PARAMS( Spl itTransBus )
INIT PARAM(width , ”bus width in bytes ”) ,
INIT PARAM( clock , ”bus c l o ck ”) ,
INIT PARAM( trans f e rDe lay , ”bus t r a n s f e r de lay in CPU cy c l e s ”) ,
INIT PARAM( arb i t ra t i onDe lay , ”bus a r b i t r a t i o n de lay in CPU cy c l e s ”) ,
INIT PARAM( cpu count , ”the number o f CPUs in the system ”) ,
INIT PARAM( p ipe l i ned , ”t rue i f the bus has p i p e l i n ed a r b i t r a t i o n ”
”and t ransmi s s i on ”) ,
INIT PARAM DFLT( hier ,
”Hierarchy g l oba l v a r i a b l e s ” ,
&defaultHierParams )
END INIT SIM OBJECT PARAMS( Spl itTransBus )
CREATE SIM OBJECT( Spl itTransBus )
{
return new Spl itTransBus ( getInstanceName ( ) ,
width ,
c lock ,
t rans f e rDe lay ,
a rb i t ra t i onDe lay ,
cpu count ,
p ipe l i ned ,
h i e r ) ;
}
REGISTER SIM OBJECT( ”Spl itTransBus ” , Spl i tTransBus )
#endif //DOXYGEN SHOULD SKIP THIS
161
APPENDIX C. SIMULATOR EXTENSION CODE
C.1.5 Butterfly Header File
#ifndef BUTTERFLY HH
#define BUTTERFLY HH
#include ” in t e r connec t . hh”
/∗∗
∗ This c l a s s implements a bu t t e r f l y i n t e r connec t . I t was only developed to
∗ i n v e s t i g a t e the performance o f a mul t i s tage i n t e r c onne c t i on network .
∗ Consequently , i s only p o s s i b l e to c on f i gu r e i t to r ep r e s en t a subset o f a l l
∗ po s s i b l e b u t t e r f l y networks .
∗
∗ In pa r t i cu l a r , i t handles 2 , 4 or 8 p roc e s s o r co r e s . The reason i s that the
∗ mapping from i n t e r f a c e to bu t t e r f l y node i s de f ined in the cons t ruc to r . S ince
∗ 8 p r o c e s s o r s are the maximum number used in t h i s work , i t was not p r i o r i t i s e d
∗ to add support f o r more p r o c e s s o r s . Furthermore , only rad ix 2 sw i t che s i s
∗ supported and only L2 caches with 4 banks .
∗
∗ Path d i v e r s i t y can be added to a bu t t e r f l y by adding extra s t age s . This
∗ implementation has no path d i v e r s i t y .
∗
∗ @author Magnus Jahre
∗/
class But t e r f l y : public In t e r connec t
{
private :
int switchDelay ;
int rad ix ;
int butterf lyCpuCount ;
int butter f lyCacheBanks ;
int terminalNodes ;
int s t ag e s ;
int sw i t che s ;
int but t e r f l yHe i gh t ;
int hopCount ;
int chanBetweenStages ;
s td : : map<int , int> cpuIDtoNode ;
std : : map<int , int> l2IDtoNode ;
std : : l i s t <InterconnectRequest∗> requestQueue ;
std : : l i s t <In t e r connec tDe l i v e ry∗> del iverQueue ;
std : : vector<bool> bu t t e r f l y S t a t u s ;
s td : : vector<int> channelUsage ;
std : : vector<int> b l o c k ed In t e r f a c e s ;
public :
/∗∗
∗ This con s t ruc to r c r e a t e s the i n t e r f a c e to node mapping f o r a g iven
∗ number o f CPUs . Furthermore , a number o f convenience va lue s are
∗ computed . Examples o f such va lue s are the width and he ight o f the
∗ bu t t e r f l y .
∗
∗ @param name The name given in the c on f i gu r a t i on f i l e
∗ @param width The width o f the t ransmi s s i on channe l s
162
C.1. INTERCONNECT EXTENSION CODE
∗ @param c l o ck The number o f p ro c e s s o r c l o ck c y c l e s in one
∗ i n t e r connec t c l o ck cy c l e
∗ @param transDe lay The t r a n s f e r de lay pe r channe l in the
∗ bu t t e r f l y
∗ @param arbDelay Arb i t r a t i on de lay f o r the In t e r connec t
∗ con s t ruc to r . This should be s e t to 0 as the re i s
∗ no e x p l i c i t a r b i t r a t i o n in a bu t t e r f l y .
∗ @param cpu count The number o f cpus in the system
∗ @param h i e r Hierarchy parameters f o r BaseHier
∗ @param switchDelay The de lay through the sw i t che s in the bu t t e r f l y
∗ @param rad ix The number o f inputs or outputs f o r each switch
∗ ( only 2 are supported in t h i s implementation ) .
∗ @param banks The number o f L2 banks ( only 4 are supported in
∗ t h i s implementation ) .
∗/
But t e r f l y ( const std : : s t r i n g & name ,
int width ,
int c lock ,
int transDelay ,
int arbDelay ,
int cpu count ,
HierParams ∗ h i e r ,
int switchDelay ,
int rad ix ,
int banks ) ;
/∗∗
∗ This de s t ru c t o r does nothing .
∗/
˜ But t e r f l y ( ){
/∗ noop ∗/
}
/∗∗
∗ This method puts the r eques t i n to a queue and schedu l e s an
∗ a r b i t r a t i o n event i f needed . The reques t queue i s kept so r t ed in
∗ ascending order on the c l o ck cy c l e i t was r e c i e v ed as t h i s s i m p l i f i e s
∗ the a r b i t r a t i o n method .
∗
∗ @param time The c l o ck cy c l e the method was c a l l e d
∗ @param fromID The i n t e r f a c e ID o f the r eque s t i ng i n t e r f a c e
∗/
void r eque s t ( Tick time , int fromID ) ;
/∗∗
∗ This method i s c a l l e d from an i n t e r f a c e when i t i s granted ac c e s s . I t
∗ computes the i n t e r f a c e ID o f the r e c i p i e n t based on the reques t g iven
∗ and adds t h i s to a d e l i v e r y queue . Then , i t s chedu l e s a d e l i v e r y
∗ event i f needed .
∗
∗ @param req The memory reques t to send .
∗ @param time The c l o ck cy c l e the method was c a l l e d at .
∗ @param fromID The i n t e r f a c e ID o f the sender i n t e r f a c e .
∗/
void send (MemReqPtr& req , Tick time , int fromID ) ;
/∗∗
∗ This method i s c a l l e d when an a r b i t r a t i o n event i s s e r v i c ed . I t
∗ attempts to grant a c c e s s to as many i n t e r f a c e s as p o s s i b l e g iven the
∗ l im i t a t i o n s o f the bu t t e r f l y i n t e r connec t . S ince the r eques t queue
∗ i s sorted , the o ld e r r eque s t s are p r i o r i t i s e d .
163
APPENDIX C. SIMULATOR EXTENSION CODE
∗
∗ I f a l l r e que s t s can not be granted at a g iven cyc l e , an a r b i t r a t i o n
∗ event i s scheduled at the next c l o ck cy c l e i f at l e a s t one reques t i s
∗ old enough to be scheduled at t h i s c y c l e . I f not , an a r b i t r a t i o n
∗ event i s added at the r eques t time + a r b i t r a t i o n de lay .
∗
∗ @param cyc l e The c l o ck cy c l e the method i s c a l l e d .
∗/
void a r b i t r a t e ( Tick cy c l e ) ;
/∗∗
∗ This method t r i e s to d e l i v e r as many r eque s t s as p o s s i b l e to i t s
∗ de s t i n a t i on . Only , r eque s t s that have exper i enced the de f ined de lay
∗ can be d e l i v e r e d . However , i f an L2 bank blocks , a l l r e que s t s that
∗ are o ld enough might not be d e l i v e r e d . S ince the d e l i v e r y queue
∗ i s kept sorted , the o l d e s t r eque s t s are d e l i v e r e d f i r s t .
∗
∗ Since t h i s c l a s s uses a d e l i v e r y queue , a l l parameters except
∗ cy c l e are d i s carded .
∗
∗ @param req Not used , must be NULL.
∗ @param cyc l e The c l o ck cy c l e the method i s c a l l e d .
∗ @param toID Not used , must be −1.
∗ @param fromID Not used , must be −1.
∗/
void d e l i v e r (MemReqPtr& req , Tick cyc le , int toID , int fromID ) ;
/∗∗
∗ This method i s c a l l e d when a L2 bank b locks . I t de schedu le s a l l
∗ a r b i t r a t i o n events and d e l i v e r y events . Consequently , no r eque s t s
∗ are d e l i v e r e d to i n t e r f a c e s that are not blocked e i t h e r .
∗
∗ @param f romInt e r f a c e The ID o f the i n t e r f a c e that has blocked
∗/
void setBlocked ( int f r omIn t e r f a c e ) ;
/∗∗
∗ This method i s c a l l e d when a L2 bank becomes unblocked . I f the re are
∗ wait ing r eque s t s or d e l i v e r i e s , new a r b i t r a t i o n events or d e l i v e r
∗ events are scheduled r e s p e c t i v e l y .
∗
∗ @param f romInt e r f a c e The ID o f the i n t e r f a c e that has blocked
∗/
void c l ea rB locked ( int f r omIn t e r f a c e ) ;
/∗∗
∗ This method re tu rns the number o f t r ansmi s s i on channe l s in the
∗ i n t e r connec t and i s used by the I n t e r c onn e c tP r o f i l e c l a s s .
∗
∗ @return The number o f t r ansmi s s i on channe l s
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
int getChannelCount ( ) ;
/∗∗
∗ This method re tu rns the number o f c y c l e s the d i f f e r e n t channe l s was
∗ occupied s i n c e i t was c a l l e d l a s t .
∗
∗ @return The number o f c l o ck c y c l e s each channel was used s i n c e l a s t
∗ time the method was c a l l e d .
164
C.1. INTERCONNECT EXTENSION CODE
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
std : : vector<int> getChannelSample ( ) ;
/∗∗
∗ This method wr i t e s a d e s c r i p t i o n o f the d i f f e r e n t channe l s to
∗ the provided stream .
∗
∗ @param stream The output stream to wr i t e to .
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
void writeChanne lDecr iptor ( std : : o f s tream &stream ) ;
private :
void s chedu leArb i t ra t ionEvent ( Tick candidateTime ) ;
bool setChannelsOccupied ( int f romInter faceID , int t o In t e r f a c e ID ) ;
int ge tDes t ina t i on Id ( int fromID ) ;
void pr intChanne lStatus ( ) ;
} ;
#endif // BUTTERFLY HH
165
APPENDIX C. SIMULATOR EXTENSION CODE
C.1.6 Butterfly Code File
#include ”sim/ bu i l d e r . hh”
#include ”bu t t e r f l y . hh”
#include <math . h>
using namespace std ;
But t e r f l y : : Bu t t e r f l y ( const std : : s t r i n g & name ,
int width ,
int c lock ,
int transDelay ,
int arbDelay ,
int cpu count ,
HierParams ∗ h i e r ,
int switchDelay ,
int rad ix ,
int banks )
: In t e r connec t ( name ,
width ,
c lock ,
transDelay ,
arbDelay ,
cpu count ,
h i e r )
{
switchDelay = switchDelay ;
rad ix = rad ix ;
butterf lyCpuCount = cpu count ;
butter f lyCacheBanks = banks ;
i f ( butter f lyCacheBanks != 4){
f a t a l ( ”mappings only implemented f o r 4 L2 cache banks ” ) ;
}
i f ( cpu count == 2){
cpuIDtoNode [ 0 ] = 0 ;
cpuIDtoNode [ 1 ] = 1 ;
l2IDtoNode [ 0 ] = 2 ;
l2IDtoNode [ 1 ] = 2 ;
l2IDtoNode [ 2 ] = 3 ;
l2IDtoNode [ 3 ] = 3 ;
terminalNodes = 4 ;
}
else i f ( cpu count == 4){
cpuIDtoNode [ 0 ] = 0 ;
cpuIDtoNode [ 1 ] = 1 ;
cpuIDtoNode [ 2 ] = 2 ;
cpuIDtoNode [ 3 ] = 3 ;
l2IDtoNode [ 0 ] = 4 ;
l2IDtoNode [ 1 ] = 5 ;
l2IDtoNode [ 2 ] = 6 ;
l2IDtoNode [ 3 ] = 7 ;
166
C.1. INTERCONNECT EXTENSION CODE
terminalNodes = 8 ;
}
else i f ( cpu count == 8){
cpuIDtoNode [ 0 ] = 0 ;
cpuIDtoNode [ 1 ] = 1 ;
cpuIDtoNode [ 2 ] = 2 ;
cpuIDtoNode [ 3 ] = 3 ;
cpuIDtoNode [ 4 ] = 4 ;
cpuIDtoNode [ 5 ] = 5 ;
cpuIDtoNode [ 6 ] = 6 ;
cpuIDtoNode [ 7 ] = 7 ;
cpuIDtoNode [ 8 ] = −1;
cpuIDtoNode [ 9 ] = −1;
cpuIDtoNode [ 1 0 ] = −1;
cpuIDtoNode [ 1 1 ] = −1;
l2IDtoNode [ 0 ] = 12 ;
l2IDtoNode [ 1 ] = 13 ;
l2IDtoNode [ 2 ] = 14 ;
l2IDtoNode [ 3 ] = 15 ;
terminalNodes = 16 ;
}
else {
f a t a l ( ”The bu t t e r f l y only supports 2 , 4 or 8 p r o c e s s o r s ” ) ;
}
// compute t opo l ogy u t i l i t y vars
// c e i l needed because the answer i s not comp l e t e l y accura te
double tmp = ( log10 ( (double ) terminalNodes ) / log10 ( (double ) rad ix ) ) ;
s t ag e s = ( int ) c e i l (tmp−1e−9);
sw i t che s = s tage s ∗ ( terminalNodes / rad ix ) ;
bu t t e r f l yHe i gh t = ( terminalNodes / rad ix ) ;
hopCount = s tage s + 1 ;
chanBetweenStages = but t e r f l yHe i gh t ∗ rad ix ;
int tota lChanne l s = hopCount ∗ chanBetweenStages ;
bu t t e r f l y S t a t u s . i n s e r t ( bu t t e r f l y S t a t u s . begin ( ) , tota lChannels , fa l se ) ;
channelUsage . i n s e r t ( channelUsage . begin ( ) , tota lChanne ls , 0 ) ;
i f ( rad ix != 2) f a t a l ( ”Only rad ix 2 b u t t e r f l i e s are implemented ” ) ;
}
void
But t e r f l y : : r eque s t ( Tick time , int fromID ){
r eque s t s++;
i f ( requestQueue . empty ( ) | | requestQueue . back()−>time < time ){
requestQueue . push back (new InterconnectRequest ( time , fromID ) ) ;
}
else {
l i s t <InterconnectRequest ∗> : : i t e r a t o r pos ;
for ( pos = requestQueue . begin ( ) ;
pos != requestQueue . end ( ) ;
pos++){
167
APPENDIX C. SIMULATOR EXTENSION CODE
i f ( (∗ pos)−>time > time ) break ;
}
requestQueue . i n s e r t ( pos , new InterconnectRequest ( time , fromID ) ) ;
}
a s s e r t ( i s So r t ed (&requestQueue ) ) ;
i f ( ! b locked ) schedu leArb i t rat ionEvent ( time + arb i t r a t i onDe l ay ) ;
}
void
But t e r f l y : : send (MemReqPtr& req , Tick time , int fromID ){
a s s e r t ( ! b locked ) ;
a s s e r t ( ( req−>s i z e / width ) <= 1 ) ;
int toID = −1;
i f ( a l l I n t e r f a c e s [ fromID]−> i sMaster ( ) && req−>t o In t e r f a c e ID != −1){
toID = req−>t o In t e r f a c e ID ;
}
else i f ( a l l I n t e r f a c e s [ fromID]−> i sMaster ( ) ){
toID = getTarget ( req−>paddr ) ;
}
else {
toID = req−>f romInter face ID ;
}
del iverQueue . push back (new In t e r connec tDe l i v e ry ( time , fromID , toID , req ) ) ;
/∗ check i f we need to schedu l e a d e l i v e r event ∗/
Tick de l iverTime = time + ( s tage s ∗ switchDelay + hopCount∗ t r an s f e rDe l ay ) ;
bool found = fa l se ;
for ( int i =0; i<de l i v e rEvent s . s i z e ( ) ; i++){
i f ( de l i v e rEvent s [ i ]−>when ( ) == del iverTime ) found = true ;
}
i f ( ! found ){
InterconnectDel iverQueueEvent ∗ event =
new InterconnectDel iverQueueEvent ( this ) ;
event−>schedu le ( de l iverTime ) ;
de l i v e rEvent s . push back ( event ) ;
}
}
void
But t e r f l y : : a r b i t r a t e ( Tick cy c l e ){
// r e s e t i n t e r n a l s t a t e
for ( int i =0; i<bu t t e r f l y S t a t u s . s i z e ( ) ; i++) bu t t e r f l y S t a t u s [ i ] = fa l se ;
l i s t <InterconnectRequest ∗ > notGrantedReqs ;
Tick legalRequestTime = cyc l e − a rb i t r a t i onDe l ay ;
l i s t <InterconnectRequest ∗> : : i t e r a t o r pos ;
while ( ! requestQueue . empty ( ) ){
i f ( requestQueue . f r on t ()−>time <= legalRequestTime ){
int t o I n t e r f a c e = ge tDes t ina t i on Id ( requestQueue . f r on t ()−>fromID ) ;
i f ( t o I n t e r f a c e == −1){
// nu l l reques t , remove
delete requestQueue . f r on t ( ) ;
168
C.1. INTERCONNECT EXTENSION CODE
requestQueue . pop f ront ( ) ;
continue ;
}
i f ( setChannelsOccupied (
requestQueue . f r on t ()−>fromID ,
t o I n t e r f a c e ) ){
// update s t a t i s t i c s
arb i t ra t edReques t s++;
totalArbQueueCycles +=
( cyc l e − requestQueue . f r on t ()−>time ) − a rb i t r a t i onDe l ay ;
t o t a lA rb i t r a t i onCyc l e s += arb i t r a t i onDe l ay ;
// grant acces s
a l l I n t e r f a c e s [ requestQueue . f r on t ()−>fromID]−>grantData ( ) ;
delete requestQueue . f r on t ( ) ;
requestQueue . pop f ront ( ) ;
}
else {
notGrantedReqs . push back ( requestQueue . f r on t ( ) ) ;
requestQueue . pop f ront ( ) ;
}
}
else {
notGrantedReqs . push back ( requestQueue . f r on t ( ) ) ;
requestQueue . pop f ront ( ) ;
}
}
i f ( ! notGrantedReqs . empty ( ) ){
// the r e where r e qu e s t s we cou ld not i s s u e
// put them back in the queue and schedu l e new arb event
a s s e r t ( requestQueue . empty ( ) ) ;
requestQueue . s p l i c e ( requestQueue . begin ( ) , notGrantedReqs ) ;
i f ( requestQueue . f r on t ()−>time <= cyc l e ){
s chedu leArb i t ra t ionEvent ( cy c l e +1);
}
else {
s chedu leArb i t ra t ionEvent (
requestQueue . f r on t ()−>time + arb i t r a t i onDe l ay ) ;
}
}
a s s e r t ( bu t t e r f l y S t a t u s . s i z e ( ) == channelUsage . s i z e ( ) ) ;
for ( int i =0; i<bu t t e r f l y S t a t u s . s i z e ( ) ; i++){
i f ( bu t t e r f l y S t a t u s [ i ] ) channelUsage [ i ]++;
}
}
bool
But t e r f l y : : setChannelsOccupied ( int f romInter faceID , int t o In t e r f a c e ID ){
a s s e r t ( f romInter face ID >= 0 && to In t e r f a c e ID >= 0 ) ;
// t r a n s l a t e in t o b u t t e r f l y node IDs
int fromNodeId = ( a l l I n t e r f a c e s [ f romInter face ID]−> i sMaster ( ) ?
cpuIDtoNode [ interconnectIDToProcessorIDMap [ f romInter face ID ] ]
169
APPENDIX C. SIMULATOR EXTENSION CODE
: l2IDtoNode [ interconnectIDToL2IDMap [ f romInter face ID ] ] ) ;
int toNodeId = ( a l l I n t e r f a c e s [ t o In t e r f a c e ID ]−> i sMaster ( ) ?
cpuIDtoNode [ interconnectIDToProcessorIDMap [ t o In t e r f a c e ID ] ]
: l2IDtoNode [ interconnectIDToL2IDMap [ t o In t e r f a c e ID ] ] ) ;
// s t o r e o ld s t a t e in case we can ’ t grant the r e que s t
vector<bool> tmpState = bu t t e r f l y S t a t u s ;
int atSwitch = −1;
for ( int i =0; i<hopCount ; i++){
i f ( i == 0){
i f ( bu t t e r f l y S t a t u s [ fromNodeId ] ) {
bu t t e r f l y S t a t u s = tmpState ;
return fa lse ;
}
bu t t e r f l y S t a t u s [ fromNodeId ] = true ;
atSwitch = fromNodeId / rad ix ;
}
else i f ( i == hopCount−1){
int lastStageChanID = ( chanBetweenStages ∗ i ) + toNodeId ;
i f ( bu t t e r f l y S t a t u s [ lastStageChanID ] ) {
bu t t e r f l y S t a t u s = tmpState ;
return fa lse ;
}
bu t t e r f l y S t a t u s [ lastStageChanID ] = true ;
}
else {
int useChannelNum = −1;
i f ( ( toNodeId & (1 << ( s t ag e s − i ) ) ) > 0) useChannelNum = 1 ;
else useChannelNum = 0 ;
int channelID = ( atSwitch ∗ 2) + useChannelNum ;
int o f f s e t = 1 << ( s t ag e s − i − 1 ) ;
int nextSwitch = −1;
i f ( ( atSwitch & o f f s e t ) == 0 && useChannelNum == 1){
nextSwitch = atSwitch + o f f s e t ;
}
else i f ( ( atSwitch & o f f s e t ) > 0 && useChannelNum == 0){
nextSwitch = atSwitch − o f f s e t ;
}
else {
nextSwitch = atSwitch ;
}
int s t a g eO f f s e t = chanBetweenStages ∗ i ;
i f ( bu t t e r f l y S t a t u s [ s t a g eO f f s e t + channelID ] ) {
bu t t e r f l y S t a t u s = tmpState ;
return fa lse ;
}
bu t t e r f l y S t a t u s [ s t a g eO f f s e t + channelID ] = true ;
atSwitch = nextSwitch ;
}
170
C.1. INTERCONNECT EXTENSION CODE
}
return true ;
}
void
But t e r f l y : : d e l i v e r (MemReqPtr& req , Tick cyc l e , int toID , int fromID ){
a s s e r t ( ! b locked ) ;
a s s e r t ( ! req ) ;
a s s e r t ( toID == −1);
a s s e r t ( fromID == −1);
a s s e r t ( i s So r t ed (&del iverQueue ) ) ;
int butte r f lyTransDe lay = s tage s ∗ switchDelay + hopCount∗ t r an s f e rDe l ay ;
Tick legalGrantTime = cyc l e − butte r f lyTransDe lay ;
/∗ at tempt to d e l i v e r as many r e qu e s t s as p o s s i b l e ∗/
/∗ s ince the queue i s sor ted , s t a r v a t i on i s not p o s s i b l e ∗/
while ( ! de l iverQueue . empty ( ) ){
In t e r connec tDe l i v e ry ∗ de l i v e r y = del iverQueue . f r on t ( ) ;
/∗ check i f t h i s grant has exper i enced the proper de lay ∗/
/∗ s ince the r e qu e s t s are sor ted , we know tha t a l l o ther are l a t e r ∗/
i f ( de l i v e ry−>grantTime > legalGrantTime ) break ;
de l iverQueue . pop f ront ( ) ;
/∗ update s t a t i s t i c s ∗/
sentRequests++;
int curCpuId = de l i v e ry−>req−>xc−>cpu−>params−>cpu id ;
int queueCycles = ( cy c l e − de l i v e ry−>grantTime ) − butte r f lyTransDe lay ;
totalTransQueueCycles += queueCycles ;
t o t a lT ran s f e rCyc l e s += butter f lyTransDe lay ;
perCpuTotalTransQueueCycles [ curCpuId ] += queueCycles ;
perCpuTotalTransferCycles [ curCpuId ] += butter f lyTransDe lay ;
int r e t v a l = BA NO RESULT;
i f ( a l l I n t e r f a c e s [ de l i v e ry−>toID]−> i sMaster ( ) ){
a l l I n t e r f a c e s [ de l i v e ry−>toID]−>d e l i v e r ( de l i v e ry−>req ) ;
}
else {
r e t v a l = a l l I n t e r f a c e s [ de l i v e ry−>toID]−>ac c e s s ( de l i v e ry−>req ) ;
}
delete de l i v e r y ;
/∗ i f t he cache re turns b l o c ked we cannot d e l i v e r any more data ∗/
i f ( r e t v a l == BA BLOCKED) break ;
}
}
void
But t e r f l y : : se tBlocked ( int f r omIn t e r f a c e ){
i f ( blocked ) warn ( ”b lock ing on a second cause ” ) ;
b locked = true ;
wait ingFor = f romInt e r f a c e ;
blockedAt = curTick ;
171
APPENDIX C. SIMULATOR EXTENSION CODE
numSetBlocked++;
b l o c k ed In t e r f a c e s . push back ( f r omInt e r f a c e ) ;
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ();++ i ){
i f ( a rb i t r a t i onEvent s [ i ]−>scheduled ( ) ){
a rb i t r a t i onEvent s [ i ]−>deschedule ( ) ;
}
delete a rb i t r a t i onEvent s [ i ] ;
}
a rb i t r a t i onEvent s . c l e a r ( ) ;
for ( int i =0; i<de l i v e rEvent s . s i z e ();++ i ){
i f ( de l i v e rEvent s [ i ]−>scheduled ( ) ){
de l i v e rEvent s [ i ]−>deschedule ( ) ;
}
delete de l i v e rEvent s [ i ] ;
}
de l i v e rEvent s . c l e a r ( ) ;
}
void
But t e r f l y : : c l ea rB locked ( int f r omIn t e r f a c e ){
a s s e r t ( blocked ) ;
a s s e r t ( blockedAt > −1);
int h i t Index = −1;
int hitCount = 0 ;
for ( int i =0; i<b l o c k ed In t e r f a c e s . s i z e ( ) ; i++){
i f ( b l o c k ed In t e r f a c e s [ i ] == f romInt e r f a c e ){
h i t Index = i ;
hitCount++;
}
}
a s s e r t ( hitCount == 1 && hit Index > −1);
b l o c k ed In t e r f a c e s . e r a s e ( b l o c k ed In t e r f a c e s . begin ()+ hi t Index ) ;
i f ( b l o c k ed In t e r f a c e s . empty ( ) ){
blocked = fa l se ;
numClearBlocked++;
a s s e r t ( a rb i t r a t i onEvent s . empty ( ) ) ;
i f ( ! requestQueue . empty ( ) ){
/∗ s chedu l e new a r b i t r a t i o n event ∗/
Tick f i r s tReq = ( requestQueue . f r on t ())−> time ;
In te r connec tArb i t ra t i onEvent ∗ event =
new In te r connec tArb i t ra t i onEvent ( this ) ;
a rb i t r a t i onEvent s . push back ( event ) ;
i f ( ( f i r s tReq + arb i t r a t i onDe l ay ) <= curTick ){
event−>schedu le ( curTick ) ;
}
else {
event−>schedu le ( f i r s tReq + arb i t r a t i onDe l ay ) ;
}
}
172
C.1. INTERCONNECT EXTENSION CODE
a s s e r t ( de l i v e rEvent s . empty ( ) ) ;
i f ( ! de l iverQueue . empty ( ) ){
/∗ s chedu l e new d e l i v e r event ∗/
Tick f i r s tG ran t = ( de l iverQueue . f r on t ())−>grantTime ;
InterconnectDel iverQueueEvent ∗ event =
new InterconnectDel iverQueueEvent ( this ) ;
d e l i v e rEvent s . push back ( event ) ;
i f ( ( f i r s tG ran t + t rans f e rDe l ay ) <= curTick ){
event−>schedu le ( curTick ) ;
}
else event−>schedu le ( f i r s tG ran t + t rans f e rDe l ay ) ;
}
blockedAt = −1;
}
}
int
But t e r f l y : : getChannelCount ( ){
return hopCount ∗ chanBetweenStages ;
}
vector<int>
But t e r f l y : : getChannelSample ( ){
vector<int> copy = channelUsage ;
a s s e r t ( channelUsage . s i z e ( ) == getChannelCount ( ) ) ;
for ( int i =0; i<channelUsage . s i z e ( ) ; i++) channelUsage [ i ] = 0 ;
return copy ;
}
void
But t e r f l y : : wr i teChanne lDecr iptor ( std : : o f s tream &stream ){
stream << ” I n t e r f a c e s :\n” ;
for ( int i =0; i<a l l I n t e r f a c e s . s i z e ( ) ; i++){
stream << ” I n t e r f a c e ” << i
<< ” ( ” << a l l I n t e r f a c e s [ i ]−>getCacheName ( ) << ” ) : ”
<< ” mapped to node id ”
<< ( a l l I n t e r f a c e s [ i ]−> i sMaster ( ) ?
cpuIDtoNode [ interconnectIDToProcessorIDMap [ i ] ] :
l2IDtoNode [ interconnectIDToL2IDMap [ i ] ] )
<< ”\n” ;
}
stream << ”\nChannels :\n” ;
int chanSet = 0 ;
for ( int i =0; i<(hopCount ∗ chanBetweenStages ) ; i++){
i f ( i != 0 && i % chanBetweenStages == 0){
chanSet++;
stream << ”\n” ;
}
stream << ”Channel ID ” << i << ” : In s e t ”
<< chanSet << ” , id in s e t ”
<< ( i % chanBetweenStages ) << ”\n” ;
}
}
173
APPENDIX C. SIMULATOR EXTENSION CODE
void
But t e r f l y : : pr intChanne lStatus ( ){
cout << ”ID : ” ;
for ( int i =0; i<chanBetweenStages ; i++){
i f ( i<=10) cout << ” ” << i << ” ” ;
else cout << ” ” << i << ” ” ;
}
cout << ”\n” ;
int chanGroup = 0 ;
cout << ”Channel Group ” << chanGroup << ” : ” ;
for ( int i =0; i<bu t t e r f l y S t a t u s . s i z e ( ) ; i++){
cout << ( bu t t e r f l y S t a t u s [ i ] ? ” t rue ” : ” f a l s e ”) << ” ” ;
i f ( i != 1 && ( i +1) % chanBetweenStages == 0){
chanGroup++;
cout << ”\n” ;
i f ( i != bu t t e r f l y S t a t u s . s i z e ()−1){
cout << ”Channel Group ” << chanGroup << ” : ” ;
}
}
}
}
void
But t e r f l y : : s chedu leArb i t ra t ionEvent ( Tick candidateTime ){
int found = fa l se ;
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ( ) ; i++){
i f ( a rb i t r a t i onEvent s [ i ]−>when ( ) == candidateTime ) found = true ;
}
i f ( ! found ){
In te r connec tArb i t ra t i onEvent ∗ event =
new In te r connec tArb i t ra t i onEvent ( this ) ;
event−>schedu le ( candidateTime ) ;
a rb i t r a t i onEvent s . push back ( event ) ;
}
}
int
But t e r f l y : : g e tDes t ina t i on Id ( int fromID ){
i f ( a l l I n t e r f a c e s [ fromID]−> i sMaster ( ) ){
pair<Addr , int> tmp = a l l I n t e r f a c e s [ fromID]−>getTargetAddr ( ) ;
Addr targetAddr = tmp . f i r s t ;
int t o I n t e r f a c e I d = tmp . second ;
// The r e que s t was a nu l l reques t , remove
i f ( targetAddr == 0) return −1;
// we a l l r e a d y know the to i n t e r f a c e id i f i t ’ s an L1 to L1 t r an s f e r
i f ( t o I n t e r f a c e I d != −1) return t o I n t e r f a c e I d ;
return getTarget ( targetAddr ) ;
}
int retID = a l l I n t e r f a c e s [ fromID]−>getTargetId ( ) ;
a s s e r t ( retID != −1);
174
C.1. INTERCONNECT EXTENSION CODE
return retID ;
}
#ifndef DOXYGEN SHOULD SKIP THIS
BEGIN DECLARE SIM OBJECT PARAMS( But t e r f l y )
Param<int> width ;
Param<int> c l o ck ;
Param<int> t r an s f e rDe l ay ;
Param<int> a rb i t r a t i onDe l ay ;
Param<int> cpu count ;
SimObjectParam<HierParams ∗> h i e r ;
Param<int> sw i t ch de l ay ;
Param<int> rad ix ;
Param<int> banks ;
END DECLARE SIM OBJECT PARAMS( But t e r f l y )
BEGIN INIT SIM OBJECT PARAMS( But t e r f l y )
INIT PARAM(width , ”the width o f the c ro s sba r t ransmi s s i on channe l s ”) ,
INIT PARAM( clock , ”bu t t e r f l y c l o ck ”) ,
INIT PARAM( trans f e rDe lay , ”bu t t e r f l y t r a n s f e r de lay in CPU cy c l e s ”) ,
INIT PARAM( arb i t ra t i onDe lay , ”bu t t e r f l y a r b i t r a t i o n de lay in CPU cy c l e s ”) ,
INIT PARAM( cpu count , ”the number o f CPUs in the system ”) ,
INIT PARAM DFLT( hier ,
”Hierarchy g l oba l v a r i a b l e s ” ,
&defaultHierParams ) ,
INIT PARAM DFLT( switch de lay ,
”The de lay o f a switch in CPU cy c l e s ” ,
1 ) ,
INIT PARAM( radix , ”The switch ing−degree o f each switch ”) ,
INIT PARAM(banks , ”the number o f l a s t−l e v e l cache banks ”)
END INIT SIM OBJECT PARAMS( But t e r f l y )
CREATE SIM OBJECT( But t e r f l y )
{
return new But t e r f l y ( getInstanceName ( ) ,
width ,
c lock ,
t rans f e rDe lay ,
a rb i t ra t i onDe lay ,
cpu count ,
h ie r ,
sw i tch de lay ,
radix ,
banks ) ;
}
REGISTER SIM OBJECT( ”But t e r f l y ” , Bu t t e r f l y )
#endif //DOXYGEN SHOULD SKIP THIS
175
APPENDIX C. SIMULATOR EXTENSION CODE
C.1.7 Crossbar Header File
#ifndef CROSSBAR HH
#define CROSSBAR HH
#include <iostream>
#include <vector>
#include <queue>
#include ” in t e r connec t . hh”
#define DEBUG CROSSBAR
/∗∗
∗ This c l a s s implements a c ro s sba r i n t e r connec t i n s p i r e d by the c ro s sba r used
∗ in IBM ’ s Power 4 and Power 5 p r o c e s s o r s . Here , two c ro s sba r connects a l l L1
∗ caches to a l l L2 banks . One c ro s sba r i s added in the L1 to L2 d i r e c t i o n and
∗ the other c r o s sba r runs in the L2 to L1 d i r e c t i o n . L1 to L1 t r a n s f e r s are
∗ made po s s i b l e by connect ing a l l L1 caches to a shared bus .
∗
∗ The c ro s sba r model led in t h i s c l a s s d i f f e r s from the IBM des ign in a few
∗ ways :
∗ − The IBM cro s sba r only has address l i n e s in the L2 to L1 d i r e c t i o n . This
∗ implementation has address l i n e s in both d i r e c t i o n s .
∗ − The data and i n s t r u c t i o n caches share t ransmi s s i on channe l s in the IBM
∗ des ign . In t h i s implementation , the i n s t r u c t i o n and data caches have
∗ s epara te t ransmi s s i on channe l s .
∗
∗ Arb i t r a t i on and t r a n s f e r in the c ro s sba r are p i p e l i n ed .
∗
∗ @author Magnus Jahre
∗/
class Crossbar : public In t e r connec t
{
private :
bool i sF i r s tReque s t ;
Tick nextBusFreeTime ;
std : : vector<std : : l i s t <InterconnectRequest∗>∗ > requestQueues ;
s td : : l i s t <In t e r connec tDe l i v e ry∗> del iverQueue ;
void i n s e r t I n t o L i s t ( std : : l i s t <InterconnectRequest ∗ >∗ i nL i s t ,
Tick reqTime ,
int fromID ) ;
bool moreRequestsAvai lable ( ) ;
int ge tDes t ina t i on Id ( int fromID ) ;
void s chedu leArb i t ra t ionEvent ( Tick reqTime ) ;
s td : : vector<int> b l o c k ed In t e r f a c e s ;
bool doPro f i l i n g ;
s td : : vector<int> channelUseCycles ;
#ifdef DEBUG CROSSBAR
void pr intRequest s ( ) ;
176
C.1. INTERCONNECT EXTENSION CODE
#endif //DEBUG CROSSBAR
public :
/∗∗
∗ This con s t ruc to r i n i t i a l i s e s a few member va r i ab l e s , but sends a l l
∗ parameters to the In t e r connec t con s t ruc to r .
∗
∗ @param name The ob j e c t name from the c on f i gu r a t i on f i l e . This
∗ i s passed on to BaseHier and SimObject
∗ @param width The b i t width o f the t ransmi s s i on l i n e s in the
∗ i n t e r connec t
∗ @param c l o ck The number o f p ro c e s s o r c y c l e s in one in t e r connec t
∗ c l o ck cy c l e .
∗ @param transDe lay The end−to−end t r a n s f e r de lay through the
∗ i n t e r connec t in CPU cy c l e s
∗ @param arbDelay The lenght o f an a r b i t r a t i o n in CPU cy c l e s
∗ @param cpu count The number o f p r o c e s s o r s in the system
∗ @param h i e r Hierarchy parameters f o r BaseHier
∗
∗ @see Inte r connec t
∗/
Crossbar ( const std : : s t r i n g & name ,
int width ,
int c lock ,
int transDelay ,
int arbDelay ,
int cpu count ,
HierParams ∗ h i e r )
: In t e r connec t ( name ,
width ,
c lock ,
transDelay ,
arbDelay ,
cpu count ,
h i e r ){
i sF i r s tReque s t = true ;
nextBusFreeTime = 0 ;
doP r o f i l i n g = fa l se ;
}
/∗∗
∗ This de s t ru c t o r d e l e t e s the r eque s t queues that are dynamical ly
∗ a l l o c a t e d when the f i r s t r eque s t i s r e c i e v ed .
∗/
˜Crossbar ( ){
for ( int i =0; i<requestQueues . s i z e ( ) ; i++){
delete requestQueues [ i ] ;
}
}
/∗∗
∗ The reques t method admin i s t r a t e s one queue f o r each t ransmi s s i on
∗ channel . Al l channe l s are kept so r t ed to s imp l i f y a r b i t r a t i o n . I f an
∗ a r b i t r a t i o n event i s needed , t h i s method adds one .
∗
∗ @param time The c l o ck cy c l e the method was c a l l e d
∗ @param fromID The i n t e r f a c e ID o f the r eque s t i ng i n t e r f a c e
∗/
void r eque s t ( Tick time , int fromID ) ;
177
APPENDIX C. SIMULATOR EXTENSION CODE
/∗∗
∗ The send method i s c a l l e d when an i n t e r f a c e i s granted ac c e s s and
∗ f i n d s the d e s t i n a t i on i n t e r f a c e from the va lue s in the r eques t . Then ,
∗ i t adds the r eques t to a d e l i v e r y queue and schedu l e s a d e l i v e r y
∗ event i f needed .
∗
∗ @param req The memory reques t to send .
∗ @param time The c l o ck cy c l e the method was c a l l e d at .
∗ @param fromID The i n t e r f a c e ID o f the sender i n t e r f a c e .
∗/
void send (MemReqPtr& req , Tick time , int fromID ) ;
/∗∗
∗ The c ro s sba r a r b i t r a t i o n method removes the o l d e s t r eque s t from each
∗ r eque s t queue each cy c l e . The reques t must have exper i enced the
∗ s p e c i f i e d a r b i t r a t i o n de lay to be e l i g i b l e f o r be ing granted ac c e s s .
∗ I f a l l r e que s t s can not be granted , i t attempts to schedu le a new
∗ a r b i t r a t i o n event .
∗
∗ @param cyc l e The c l o ck cy c l e the method was c a l l e d .
∗/
void a r b i t r a t e ( Tick cy c l e ) ;
/∗∗
∗ This method t r i e s to d e l i v e r as many r eque s t s as p o s s i b l e to i t s
∗ de s t i n a t i on . Only , r eque s t s that have exper i enced the de f ined de lay
∗ can be d e l i v e r e d . However , i f an L2 bank blocks , a l l r e que s t s that
∗ are o ld enough might not be d e l i v e r e d . S ince the d e l i v e r y queue
∗ i s kept sorted , the o l d e s t r eque s t s are d e l i v e r e d f i r s t .
∗
∗ Since t h i s c l a s s uses a d e l i v e r y queue , a l l parameters except
∗ cy c l e are d i s carded .
∗
∗ @param req Not used , must be NULL.
∗ @param cyc l e The c l o ck cy c l e the method i s c a l l e d .
∗ @param toID Not used , must be −1.
∗ @param fromID Not used , must be −1.
∗/
void d e l i v e r (MemReqPtr& req , Tick cyc le , int toID , int fromID ) ;
/∗∗
∗ This method i s c a l l e d when a L2 bank b locks . I t de schedu le s a l l
∗ a r b i t r a t i o n events and d e l i v e r y events . Consequently , no r eque s t s
∗ are d e l i v e r e d to i n t e r f a c e s that are not blocked e i t h e r .
∗
∗ @param f romInt e r f a c e The ID o f the i n t e r f a c e that has blocked
∗/
void setBlocked ( int f r omIn t e r f a c e ) ;
/∗∗
∗ This method i s c a l l e d when a L2 bank becomes unblocked . I f the re are
∗ wait ing r eque s t s or d e l i v e r i e s , new a r b i t r a t i o n events or d e l i v e r
∗ events are scheduled r e s p e c t i v e l y .
∗
∗ @param f romInt e r f a c e The ID o f the i n t e r f a c e that has blocked
∗/
void c l ea rB locked ( int f r omIn t e r f a c e ) ;
/∗∗
178
C.1. INTERCONNECT EXTENSION CODE
∗ This method re tu rns the number o f t r ansmi s s i on channe l s and i s used
∗ by the I n t e r c onn e c tP r o f i l e c l a s s . In t h i s c r o s sba r implementation ,
∗ the number o f channe l s i s the number o f i n t e r f a c e s p lus the shared
∗ bus .
∗
∗ @return The number o f t r ansmi s s i on channe l s
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
int getChannelCount ( ){
//one channel f o r a l l i n t e r f a c e s and one coherence bus
channelUseCycles . r e s i z e ( a l l I n t e r f a c e s . s i z e ( )+1 ,0 ) ;
return a l l I n t e r f a c e s . s i z e ( ) + 1 ;
}
/∗∗
∗ This method re tu rns the number o f c y c l e s the d i f f e r e n t channe l s was
∗ occupied s i n c e i t was c a l l e d l a s t .
∗
∗ @return The number o f c l o ck c y c l e s each channel was used s i n c e l a s t
∗ time the method was c a l l e d .
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
std : : vector<int> getChannelSample ( ) ;
/∗∗
∗ This method wr i t e s a d e s c r i p t i o n o f the d i f f e r e n t channe l s to
∗ the provided stream .
∗
∗ @param stream The output stream to wr i t e to .
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
void writeChanne lDecr iptor ( std : : o f s tream &stream ) ;
} ;
#endif // CROSSBAR HH
179
APPENDIX C. SIMULATOR EXTENSION CODE
C.1.8 Crossbar Code File
#include ”sim/ bu i l d e r . hh”
#include ”c ro s sba r . hh”
using namespace std ;
void
Crossbar : : r eque s t ( Tick time , int fromID ){
r eque s t s++;
i f ( i sF i r s tReque s t ){
//NOTE: This can not be i n i t a l i s e d b e f o r e a l l e i n t e r f a c e s has r e g i s t e r e d
// Consequent ly , doing i t when the f i r s t r e que s t a r r i v e s i s s a f e
for ( int i =0; i<a l l I n t e r f a c e s . s i z e ( ) ; i++){
requestQueues . push back (new l i s t <InterconnectRequest ∗>);
}
i sF i r s tReque s t = fa l se ;
}
i n s e r t I n t o L i s t ( requestQueues [ fromID ] , time , fromID ) ;
i f ( ! b locked ) schedu leArb i t rat ionEvent ( time + arb i t r a t i onDe l ay ) ;
}
void
Crossbar : : i n s e r t I n t o L i s t ( l i s t <InterconnectRequest ∗ >∗ i nL i s t ,
Tick reqTime ,
int fromID ){
i f ( i nL i s t−>empty ( ) | | i nL i s t−>back()−>time <= reqTime ){
// f a s t common case ;
i nL i s t−>push back (new InterconnectRequest ( reqTime , fromID ) ) ;
}
else {
l i s t <InterconnectRequest ∗> : : i t e r a t o r pos ;
for ( pos = inL i s t−>begin ( ) ;
pos != inL i s t−>end ( ) ;
pos++){
i f ( (∗ pos)−>time > reqTime ) break ;
}
i nL i s t−>i n s e r t ( pos , new InterconnectRequest ( reqTime , fromID ) ) ;
}
#ifdef DEBUG CROSSBAR
/∗ Make sure t ha t the queues are so r t ed ∗/
for ( int i =0; i<requestQueues . s i z e ( ) ; i++){
a s s e r t ( i s So r t ed ( requestQueues [ i ] ) ) ;
}
#endif //DEBUG CROSSBAR
}
void
Crossbar : : s chedu leArb i t rat ionEvent ( Tick candidateTime ){
int found = fa l se ;
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ( ) ; i++){
180
C.1. INTERCONNECT EXTENSION CODE
i f ( a rb i t r a t i onEvent s [ i ]−>when ( ) == candidateTime ) found = true ;
}
i f ( ! found ){
In te r connec tArb i t ra t i onEvent ∗ event =
new In te r connec tArb i t ra t i onEvent ( this ) ;
event−>schedu le ( candidateTime ) ;
a rb i t r a t i onEvent s . push back ( event ) ;
}
}
void
Crossbar : : a r b i t r a t e ( Tick cy c l e ){
a s s e r t ( ! b locked ) ;
Tick legalRequestTime = cyc l e − a rb i t r a t i onDe l ay ;
/∗ c r ea t e s t o rage f o r the a r b i t r a i on proces s ∗/
vector<bool> de l ive rBusy ( a l l I n t e r f a c e s . s i z e ( ) , fa l se ) ;
vector<bool> senderBusy ( a l l I n t e r f a c e s . s i z e ( ) , fa l se ) ;
bool toOtherL1Busy = fa l se ;
i f ( c y c l e < nextBusFreeTime ) toOtherL1Busy = true ;
vector< l i s t <InterconnectRequest∗> >
delayedRequests ( a l l I n t e r f a c e s . s i z e ( ) ) ;
for ( int i =0; i<delayedRequests . s i z e ( ) ; i++){
delayedRequests [ i ] = l i s t <InterconnectRequest ∗>();
}
while ( moreRequestsAvai lable ( ) ){
/∗ s t a r t wi th the o l d e s t reques t , r e g a r d l e s s o f which queue i t i s in
Consequent ly , s t a r v a t i on i s avoided ∗/
int sma l l e s t I d = −1;
Tick sma l l e s t = TICK T MAX;
for ( int i =0; i<requestQueues . s i z e ( ) ; i++){
i f ( ! requestQueues [ i ]−>empty ( )
&& requestQueues [ i ]−> f r on t ()−>time < sma l l e s t ){
sma l l e s t = requestQueues [ i ]−> f r on t ()−>time ;
sma l l e s t I d = i ;
}
}
a s s e r t ( sma l l e s t I d >= 0 ) ;
InterconnectRequest ∗ tempReq = requestQueues [ sma l l e s t I d ]−> f r on t ( ) ;
requestQueues [ sma l l e s t I d ]−>pop f ront ( ) ;
i f ( tempReq−>time <= legalRequestTime ){
int toID = getDes t ina t i on Id ( tempReq−>fromID ) ;
i f ( toID == −1){
/∗ t h i s was a nu l l reques t , remove i t and move on ∗/
delete tempReq ;
continue ;
}
i f ( a l l I n t e r f a c e s [ toID]−> i sMaster ( )
181
APPENDIX C. SIMULATOR EXTENSION CODE
&& a l l I n t e r f a c e s [ tempReq−>fromID]−> i sMaster ( ) ){
i f ( toOtherL1Busy ){
delayedRequests [ tempReq−>fromID ] . push back ( tempReq ) ;
continue ;
}
else {
toOtherL1Busy = true ;
nextBusFreeTime = ( cy c l e + arb i t r a t i onDe l ay ) ;
/∗ update s t a t i s t i c s ∗/
arb i t ra t edReques t s++;
totalArbQueueCycles +=
( cyc l e − tempReq−>time ) − a rb i t r a t i onDe l ay ;
t o t a lA rb i t r a t i onCyc l e s += arb i t r a t i onDe l ay ;
a l l I n t e r f a c e s [ tempReq−>fromID]−>grantData ( ) ;
delete tempReq ;
continue ;
}
}
i f ( ! de l ive rBusy [ toID ] && ! senderBusy [ tempReq−>fromID ] ) {
/∗ r e que s t can be d e l i v e r e d ∗/
/∗ update s t a t i s t i c s ∗/
arb i t ra t edReques t s++;
totalArbQueueCycles +=
( cyc l e − tempReq−>time ) − a rb i t r a t i onDe l ay ;
t o t a lA rb i t r a t i onCyc l e s += arb i t r a t i onDe l ay ;
a l l I n t e r f a c e s [ tempReq−>fromID]−>grantData ( ) ;
delete tempReq ;
de l ive rBusy [ toID ] = true ;
senderBusy [ tempReq−>fromID ] = true ;
continue ;
}
}
delayedRequests [ tempReq−>fromID ] . push back ( tempReq ) ;
}
/∗ put not granted r e qu e s t s back in the r e que s t queues ∗/
a s s e r t ( requestQueues . s i z e ( ) == delayedRequests . s i z e ( ) ) ;
for ( int i =0; i<requestQueues . s i z e ( ) ; i++){
a s s e r t ( requestQueues [ i ]−>empty ( ) ) ;
(∗ requestQueues [ i ] ) = delayedRequests [ i ] ;
}
/∗ check i f we need to schedu l e new a r b i t r a t i o n even t s ∗/
Tick nextArbTime = TICK T MAX;
bool addArbEvent = fa l se ;
for ( int i =0; i<requestQueues . s i z e ( ) ; i++){
i f ( ! requestQueues [ i ]−>empty ( ) ){
InterconnectRequest ∗ f i r s t = requestQueues [ i ]−> f r on t ( ) ;
Tick candidateTime = f i r s t −>time + arb i t r a t i onDe l ay ;
i f ( candidateTime <= cyc l e ){
i f ( ( c y c l e + 1) < nextArbTime ){
addArbEvent = true ;
182
C.1. INTERCONNECT EXTENSION CODE
nextArbTime = cyc l e + 1 ;
}
}
else {
i f ( candidateTime < nextArbTime ){
addArbEvent = true ;
nextArbTime = candidateTime ;
}
}
}
}
i f ( addArbEvent ){
a s s e r t ( nextArbTime != TICK T MAX) ;
schedu leArb i t ra t ionEvent ( nextArbTime ) ;
}
}
bool
Crossbar : : moreRequestsAvai lable ( ){
bool moreReqs = fa l se ;
for ( int i =0; i<requestQueues . s i z e ( ) ; i++){
i f ( ! requestQueues [ i ]−>empty ( ) ) moreReqs = true ;
}
return moreReqs ;
}
int
Crossbar : : g e tDes t i na t i on Id ( int fromID ){
i f ( a l l I n t e r f a c e s [ fromID]−> i sMaster ( ) ){
pair<Addr , int> tmp = a l l I n t e r f a c e s [ fromID]−>getTargetAddr ( ) ;
Addr targetAddr = tmp . f i r s t ;
int t o I n t e r f a c e I d = tmp . second ;
// The r e que s t was a nu l l reques t , remove
i f ( targetAddr == 0) return −1;
// we a l l r e a d y know the to i n t e r f a c e id i f i t ’ s an L1 to L1 t r an s f e r
i f ( t o I n t e r f a c e I d != −1) return t o I n t e r f a c e I d ;
return getTarget ( targetAddr ) ;
}
int retID = a l l I n t e r f a c e s [ fromID]−>getTargetId ( ) ;
a s s e r t ( retID != −1);
return retID ;
}
void
Crossbar : : send (MemReqPtr& req , Tick time , int fromID ){
a s s e r t ( ! b locked ) ;
a s s e r t ( ( req−>s i z e / width ) <= 1 ) ;
int toID = −1;
bool busIsUsed = fa l se ;
i f ( a l l I n t e r f a c e s [ fromID]−> i sMaster ( ) && req−>t o In t e r f a c e ID != −1){
busIsUsed = true ;
toID = req−>t o In t e r f a c e ID ;
183
APPENDIX C. SIMULATOR EXTENSION CODE
}
else i f ( a l l I n t e r f a c e s [ fromID]−> i sMaster ( ) ){
toID = getTarget ( req−>paddr ) ;
}
else {
toID = req−>f romInter face ID ;
}
// update p r o f i l e s t a t s
i f ( d oP r o f i l i n g ){
i f ( busIsUsed ){
// the coherence bus i s not p i p e l i n e d
channelUseCycles [ a l l I n t e r f a c e s . s i z e ( ) ] += trans f e rDe l ay ;
}
else {
// r e gu l a r p i p e l i n e d cros sbar channe l s used
// one p i p e l i n e s l o t i s a l l o c a t e d in both sender
// and r e c i e v e r s channel
channelUseCycles [ fromID ] += 1 ;
channelUseCycles [ toID ] += 1 ;
}
}
del iverQueue . push back (new In t e r connec tDe l i v e ry ( time , fromID , toID , req ) ) ;
/∗ check i f we need to schedu l e a d e l i v e r event ∗/
Tick de l iverTime = time + trans f e rDe l ay ;
bool found = fa l se ;
for ( int i =0; i<de l i v e rEvent s . s i z e ( ) ; i++){
i f ( de l i v e rEvent s [ i ]−>when ( ) == del iverTime ) found = true ;
}
i f ( ! found ){
InterconnectDel iverQueueEvent ∗ event =
new InterconnectDel iverQueueEvent ( this ) ;
event−>schedu le ( de l iverTime ) ;
de l i v e rEvent s . push back ( event ) ;
}
}
void
Crossbar : : d e l i v e r (MemReqPtr& req , Tick cyc le , int toID , int fromID ){
a s s e r t ( ! b locked ) ;
a s s e r t ( ! req ) ;
a s s e r t ( toID == −1);
a s s e r t ( fromID == −1);
#ifdef DEBUG CROSSBAR
a s s e r t ( i s So r t ed (&del iverQueue ) ) ;
#endif //DEBUG CROSSBAR
Tick legalGrantTime = cyc l e − t r an s f e rDe l ay ;
/∗ at tempt to d e l i v e r as many r e qu e s t s as p o s s i b l e ∗/
/∗ s ince the queue i s sor ted , s t a r v a t i on i s not p o s s i b l e ∗/
while ( ! de l iverQueue . empty ( ) ){
In t e r connec tDe l i v e ry ∗ de l i v e r y = del iverQueue . f r on t ( ) ;
/∗ check i f t h i s grant has exper i enced the proper de lay ∗/
184
C.1. INTERCONNECT EXTENSION CODE
/∗ s ince the r e qu e s t s are sor ted , we know tha t a l l o ther are l a t e r ∗/
i f ( de l i v e ry−>grantTime > legalGrantTime ) break ;
de l iverQueue . pop f ront ( ) ;
/∗ update s t a t i s t i c s ∗/
sentRequests++;
int curCpuId = de l i v e ry−>req−>xc−>cpu−>params−>cpu id ;
int queueCycles = ( cy c l e − de l i v e ry−>grantTime ) − t r an s f e rDe l ay ;
totalTransQueueCycles += queueCycles ;
t o t a lT ran s f e rCyc l e s += trans f e rDe l ay ;
perCpuTotalTransQueueCycles [ curCpuId ] += queueCycles ;
perCpuTotalTransferCycles [ curCpuId ] += trans f e rDe l ay ;
int r e t v a l = BA NO RESULT;
i f ( a l l I n t e r f a c e s [ de l i v e ry−>toID]−> i sMaster ( ) ){
a l l I n t e r f a c e s [ de l i v e ry−>toID]−>d e l i v e r ( de l i v e ry−>req ) ;
}
else {
r e t v a l = a l l I n t e r f a c e s [ de l i v e ry−>toID]−>ac c e s s ( de l i v e ry−>req ) ;
}
delete de l i v e r y ;
/∗ i f t he cache re turns b l o c ked we cannot d e l i v e r any more data ∗/
i f ( r e t v a l == BA BLOCKED) break ;
}
}
void
Crossbar : : se tBlocked ( int f r omIn t e r f a c e ){
i f ( blocked ) warn ( ”b lock ing on a second cause ” ) ;
b locked = true ;
wait ingFor = f romInt e r f a c e ;
blockedAt = curTick ;
numSetBlocked++;
b l o c k ed In t e r f a c e s . push back ( f r omInt e r f a c e ) ;
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ();++ i ){
i f ( a rb i t r a t i onEvent s [ i ]−>scheduled ( ) ){
a rb i t r a t i onEvent s [ i ]−>deschedule ( ) ;
}
delete a rb i t r a t i onEvent s [ i ] ;
}
a rb i t r a t i onEvent s . c l e a r ( ) ;
for ( int i =0; i<de l i v e rEvent s . s i z e ();++ i ){
i f ( de l i v e rEvent s [ i ]−>scheduled ( ) ){
de l i v e rEvent s [ i ]−>deschedule ( ) ;
}
delete de l i v e rEvent s [ i ] ;
}
de l i v e rEvent s . c l e a r ( ) ;
}
185
APPENDIX C. SIMULATOR EXTENSION CODE
void
Crossbar : : c l ea rB locked ( int f r omIn t e r f a c e ){
a s s e r t ( blocked ) ;
a s s e r t ( blockedAt > −1);
int h i t Index = −1;
int hitCount = 0 ;
for ( int i =0; i<b l o c k ed In t e r f a c e s . s i z e ( ) ; i++){
i f ( b l o c k ed In t e r f a c e s [ i ] == f romInt e r f a c e ){
h i t Index = i ;
hitCount++;
}
}
a s s e r t ( hitCount == 1 && hit Index > −1);
b l o c k ed In t e r f a c e s . e r a s e ( b l o c k ed In t e r f a c e s . begin ()+ hi t Index ) ;
i f ( b l o c k ed In t e r f a c e s . empty ( ) ){
blocked = fa l se ;
numClearBlocked++;
a s s e r t ( a rb i t r a t i onEvent s . empty ( ) ) ;
i f ( moreRequestsAvai lable ( ) ){
/∗ s chedu l e new a r b i t r a t i o n event ∗/
Tick f i r s tReq = TICK T MAX;
for ( int i =0; i<requestQueues . s i z e ( ) ; i++){
i f ( ! requestQueues [ i ]−>empty ( )
&& requestQueues [ i ]−> f r on t ()−>time < f i r s tReq ){
f i r s tReq = requestQueues [ i ]−> f r on t ()−>time ;
}
}
a s s e r t ( f i r s tReq != TICK T MAX) ;
Inte r connec tArb i t ra t i onEvent ∗ event =
new In te r connec tArb i t ra t i onEvent ( this ) ;
a rb i t r a t i onEvent s . push back ( event ) ;
i f ( f i r s tReq <= curTick ) event−>schedu le ( curTick ) ;
else event−>schedu le ( f i r s tReq ) ;
}
a s s e r t ( de l i v e rEvent s . empty ( ) ) ;
i f ( ! de l iverQueue . empty ( ) ){
/∗ s chedu l e new d e l i v e r event ∗/
Tick f i r s tG ran t = ( de l iverQueue . f r on t ())−>grantTime ;
InterconnectDel iverQueueEvent ∗ event =
new InterconnectDel iverQueueEvent ( this ) ;
d e l i v e rEvent s . push back ( event ) ;
i f ( f i r s tG ran t <= curTick ) event−>schedu le ( curTick ) ;
else event−>schedu le ( f i r s tG ran t ) ;
}
blockedAt = −1;
}
}
186
C.1. INTERCONNECT EXTENSION CODE
vector<int>
Crossbar : : getChannelSample ( ){
i f ( ! d oP r o f i l i n g ) doP ro f i l i n g = true ;
s td : : vector<int> r e t v a l ( channelUseCycles ) ;
for ( int i =0; i<channelUseCycles . s i z e ( ) ; i++){
channelUseCycles [ i ] = 0 ;
}
return r e t v a l ;
}
void
Crossbar : : wr i teChanne lDecr iptor ( std : : o f stream &stream ){
for ( int i =0; i<a l l I n t e r f a c e s . s i z e ( ) ; i++){
stream << ”Channel ” << i << ” : ”
<< a l l I n t e r f a c e s [ i ]−>getCacheName ( ) << ”\n” ;
}
stream << ”Channel ” << a l l I n t e r f a c e s . s i z e ( ) << ” : Coherence bus\n” ;
}
#ifdef DEBUG CROSSBAR
void
Crossbar : : p r intRequest s ( ){
for ( int i =0; i<requestQueues . s i z e ( ) ; i++){
cout << i << ” : ” ;
for ( l i s t <InterconnectRequest ∗> : : i t e r a t o r j=requestQueues [ i ]−>begin ( ) ;
j != requestQueues [ i ]−>end ( ) ;
j++){
InterconnectRequest ∗ cur rent = ∗ j ;
cout << ”( ”
<< current−>fromID
<< ” , ”
<< current−>time
<< ”) ” ;
}
cout << ”\n” ;
}
cout << ”\n” ;
}
#endif //DEBUG CROSSBAR
#ifndef DOXYGEN SHOULD SKIP THIS
BEGIN DECLARE SIM OBJECT PARAMS( Crossbar )
Param<int> width ;
Param<int> c l o ck ;
Param<int> t r an s f e rDe l ay ;
Param<int> a rb i t r a t i onDe l ay ;
Param<int> cpu count ;
SimObjectParam<HierParams ∗> h i e r ;
END DECLARE SIM OBJECT PARAMS( Crossbar )
BEGIN INIT SIM OBJECT PARAMS( Crossbar )
187
APPENDIX C. SIMULATOR EXTENSION CODE
INIT PARAM(width , ”the width o f the c ro s sba r t ransmi s s i on channe l s ”) ,
INIT PARAM( clock , ”c ro s sba r c l o ck ”) ,
INIT PARAM( trans f e rDe lay , ”c ro s sba r t r a n s f e r de lay in CPU cy c l e s ”) ,
INIT PARAM( arb i t ra t i onDe lay , ”c ro s sba r a r b i t r a t i o n de lay in CPU cy c l e s ”) ,
INIT PARAM( cpu count , ”the number o f CPUs in the system ”) ,
INIT PARAM DFLT( hier ,
”Hierarchy g l oba l v a r i a b l e s ” ,
&defaultHierParams )
END INIT SIM OBJECT PARAMS( Crossbar )
CREATE SIM OBJECT( Crossbar )
{
return new Crossbar ( getInstanceName ( ) ,
width ,
c lock ,
t rans f e rDe lay ,
a rb i t ra t i onDe lay ,
cpu count ,
h i e r ) ;
}
REGISTER SIM OBJECT( ”Crossbar ” , Crossbar )
#endif //DOXYGEN SHOULD SKIP THIS
188
C.1. INTERCONNECT EXTENSION CODE
C.1.9 Ideal Interconnect Header File
#ifndef IDEAL INTERCONNECT HH
#define IDEAL INTERCONNECT HH
#include <iostream>
#include <vector>
#include < l i s t >
#include ” in t e r connec t . hh”
#define DEBUG IDEAL INTERCONNECT
/∗∗
∗ This c l a s s implements an i d e a l i n t e r connec t . I d e a l in t h i s context means that
∗ a reques t w i l l expe r i ence the s p e c i f i e d delay , but an unl imi ted number o f
∗ r eque s t s w i l l be granted ac c e s s in p a r a l l e l . The r a t i o n a l e f o r t h i s cho i c e
∗ i t that t ransmi s s i on de lays are mainly due to phy s i c a l f a c t o r s that can not
∗ be changed by a r c h i t e c t u r a l t echn iques .
∗
∗ @author Magnus Jahre
∗/
class I d e a l I n t e r c onne c t : public In t e r connec t
{
private :
s td : : l i s t <InterconnectRequest ∗ > requestQueue ;
std : : l i s t <In t e r connec tDe l i v e ry ∗ > grantQueue ;
std : : vector<int> b l o c k ed In t e r f a c e s ;
void s chedu leArb i t ra t ionEvent ( Tick cy c l e ) ;
#ifdef DEBUG IDEAL INTERCONNECT
void printRequestQueue ( ) ;
void printGrantQueue ( ) ;
#endif //DEBUG IDEAL INTERCONNECT
public :
/∗∗
∗ This con s t ruc to r i n i t i a l i s e s a few member v a r i a b l e s and pas se s on
∗ parameters to the In t e r connec t con s t ruc to r . The cache implementation
∗ r e qu i r e s that the i n t e r connec t has a f i n i t e width . Consequently , a
∗ width equal to the cache block s i z e should be provided .
∗
∗ @param name The ob j e c t name from the c on f i gu r a t i on f i l e . This
∗ i s passed on to BaseHier and SimObject
∗ @param width The b i t width o f the t ransmi s s i on l i n e s in the
∗ i n t e r connec t
∗ @param c l o ck The number o f p ro c e s s o r c y c l e s in one in t e r connec t
∗ c l o ck cy c l e .
∗ @param transDe lay The end−to−end t r a n s f e r de lay through the
∗ i n t e r connec t in CPU cy c l e s
∗ @param arbDelay The lenght o f an a r b i t r a t i o n in CPU cy c l e s
∗ @param cpu count The number o f p r o c e s s o r s in the system
∗ @param h i e r Hierarchy parameters f o r BaseHier
∗
∗ @see Inte r connec t
∗/
189
APPENDIX C. SIMULATOR EXTENSION CODE
I d e a l I n t e r c onne c t ( const std : : s t r i n g & name ,
int width ,
int c lock ,
int transDelay ,
int arbDelay ,
int cpu count ,
HierParams ∗ h i e r )
: In t e r connec t ( name ,
width ,
c lock ,
transDelay ,
arbDelay ,
cpu count ,
h i e r ){
i f ( width <= 0){
f a t a l ( ”The i d e a l I n t e r c onn e c t must have a f i n i t e width , ”
”or e l s e the cache implementation won ’ t work ” ) ;
}
t r an s f e rDe l ay = transDe lay ;
a rb i t r a t i onDe l ay = arbDelay ;
}
/∗∗
∗ Empty de s t ru c t o r .
∗/
˜ Id ea l I n t e r c onne c t ( ){ /∗ does noth ing ∗/ }
/∗∗
∗ This method puts the r eque s t s in a queue accord ing to the c l o ck cy c l e
∗ the r eque s t was r e c i e v ed . Consequently , the queue i s kept so r t ed in
∗ ascending order . This s i m p l i f i e s a r b i t r a t i o n .
∗
∗ @param time The c l o ck cy c l e the method was c a l l e d
∗ @param fromID The i n t e r f a c e ID o f the r eque s t i ng i n t e r f a c e
∗/
void r eque s t ( Tick time , int fromID ) ;
/∗∗
∗ The send method i s c a l l e d when an i n t e r f a c e i s granted ac c e s s and
∗ f i n d s the d e s t i n a t i on i n t e r f a c e from the va lue s in the r eques t . Then ,
∗ i t adds the r eques t to a d e l i v e r y queue and schedu l e s a d e l i v e r y
∗ event i f needed .
∗
∗ @param req The memory reques t to send .
∗ @param time The c l o ck cy c l e the method was c a l l e d at .
∗ @param fromID The i n t e r f a c e ID o f the sender i n t e r f a c e .
∗/
void send (MemReqPtr& req , Tick time , int fromID ) ;
/∗∗
∗ The i d e a l i n t e r connec t a r b i t r a t i o n method grants a c c e s s to a l l
∗ r eque s t s that can be granted ac c e s s every time i t runs . A reques t
∗ can be granted i f i t has exper i enced the s p e c i f i e d a r b i t r a t i o n
∗ delay .
∗
∗ @param cyc l e The c l o ck cy c l e the method was c a l l e d .
∗/
190
C.1. INTERCONNECT EXTENSION CODE
void a r b i t r a t e ( Tick cy c l e ) ;
/∗∗
∗ This method t r i e s to d e l i v e r as many r eque s t s as p o s s i b l e to i t s
∗ de s t i n a t i on . Only , r eque s t s that have exper i enced the de f ined de lay
∗ can be d e l i v e r e d . However , i f an L2 bank blocks , a l l r e que s t s that
∗ are o ld enough might not be d e l i v e r e d . S ince the d e l i v e r y queue
∗ i s kept sorted , the o l d e s t r eque s t s are d e l i v e r e d f i r s t .
∗
∗ Since t h i s c l a s s uses a d e l i v e r y queue , a l l parameters except
∗ cy c l e are d i s carded .
∗
∗ @param req Not used , must be NULL.
∗ @param cyc l e The c l o ck cy c l e the method i s c a l l e d .
∗ @param toID Not used , must be −1.
∗ @param fromID Not used , must be −1.
∗/
void d e l i v e r (MemReqPtr& req , Tick cyc le , int toID , int fromID ) ;
/∗∗
∗ When th i s method i s ca l l ed , a l l a r b i t r a t i o n events and d e l i v e r y
∗ events are descheduled .
∗
∗ @param f romInt e r f a c e The i n t e r f a c e that i s blocked
∗/
void setBlocked ( int f r omIn t e r f a c e ) ;
/∗∗
∗ This method i s c a l l e d when a L2 cache unblocks . Depending on the
∗ r eque s t s and d e l i v e r i e s wait ing , a new a r b i t r a t i o n event and
∗ de l i v e r y event are sheduled .
∗
∗ @param f romInt e r f a c e The i n t e r f a c e that unblocks
∗/
void c l ea rB locked ( int f r omIn t e r f a c e ) ;
/∗∗
∗ Since the re are an unl imi ted number o f channe l s in the i d e a l
∗ i n t e r connec t , t h i s method re tu rns −1.
∗
∗ @return Always −1
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
int getChannelCount ( ){
return −1;
}
/∗∗
∗ I t makes no sense to get a channel sample in an i d e a l in te r connec t ,
∗ so i f t h i s method i s c a l l e d i t p r i n t s a f a t a l e r r o r message .
∗
∗ @return An empty vec to r
∗/
std : : vector<int> getChannelSample ( ){
f a t a l ( ” Id ea l In t e r connec t has no channe l s ” ) ;
s td : : vector<int> r e t v a l ;
return r e t v a l ;
}
/∗∗
191
APPENDIX C. SIMULATOR EXTENSION CODE
∗ The i d e a l i n t e r connec t has an unl imi ted number o f channe l s .
∗ Consequently , t h i s method r epo r t s a f a t a l e r r o r i f i t i s c a l l e d .
∗
∗ @param stream The stream that i s never used
∗/
void writeChanne lDecr iptor ( std : : o f s tream &stream ){
f a t a l ( ” Id ea l In t e r connec t has no channel d e s c r i p t o r ” ) ;
}
} ;
#endif // IDEAL INTERCONNECT HH
192
C.1. INTERCONNECT EXTENSION CODE
C.1.10 Ideal Interconnect Code File
#include ”sim/ bu i l d e r . hh”
#include ” i d e a l i n t e r c o nn e c t . hh”
using namespace std ;
void
I d e a l I n t e r c onne c t : : r eque s t ( Tick time , int fromID ){
r eque s t s++;
// keep l i n k e d l i s t o f r e qu e s t s so r t ed at a l l t imes
// f i r s t r e que s t t a k e s p r i o r i t y over l a t e r r e qu e s t s at same cy c l e
l i s t <InterconnectRequest ∗> : : i t e r a t o r f indPos ;
for ( f indPos=requestQueue . begin ( ) ;
f indPos != requestQueue . end ( ) ;
f indPos++){
InterconnectRequest ∗ tempReq = ∗ f indPos ;
i f ( time < tempReq−>time ) break ;
}
requestQueue . i n s e r t ( f indPos , new InterconnectRequest ( time , fromID ) ) ;
#ifdef DEBUG IDEAL INTERCONNECT
/∗ check t ha t the queue i s so r t ed ∗/
InterconnectRequest ∗ prev = NULL;
bool f i r s t = true ;
for ( l i s t <InterconnectRequest ∗> : : i t e r a t o r i=requestQueue . begin ( ) ;
i != requestQueue . end ( ) ;
i++){
i f ( f i r s t ){
f i r s t = fa l se ;
prev = ∗ i ;
continue ;
}
a s s e r t ( prev−>time <= (∗ i )−>time ) ;
prev = ∗ i ;
}
// printRequestQueue ( ) ;
#endif //DEBUG IDEAL INTERCONNECT
/∗ add a r b i t r a t i o n event i f we are not b l o c ked ∗/
i f ( ! b locked ){
s chedu leArb i t ra t ionEvent ( time + arb i t r a t i onDe l ay ) ;
}
}
void
I d e a l I n t e r c onne c t : : send (MemReqPtr& req , Tick time , int fromID ){
a s s e r t ( ! b locked ) ;
a s s e r t ( ( req−>s i z e / width ) <= 1 ) ;
bool isFromMaster = a l l I n t e r f a c e s [ fromID]−> i sMaster ( ) ;
193
APPENDIX C. SIMULATOR EXTENSION CODE
i f ( req−>t o In t e r f a c e ID != −1){
/∗ cache−to−cache t r an s f e r ∗/
a s s e r t ( req−>toProcessor ID != −1);
a s s e r t ( req−>t o In t e r f a c e ID == get Inte rconnect ID ( req−>toProcessor ID ) ) ;
grantQueue . push back (new In t e r connec tDe l i v e ry ( time ,
fromID ,
req−>to Inte r f ace ID ,
req ) ) ;
}
else i f ( isFromMaster ){
/∗ r e c i e v e r i s a s l a v e i n t e r f a c e ∗/
int recvCount = 0 ;
int recvID = −1;
for ( int i =0; i<a l l I n t e r f a c e s . s i z e ();++ i ){
i f ( a l l I n t e r f a c e s [ i ]−> i sMaster ( ) ) continue ;
i f ( a l l I n t e r f a c e s [ i ]−>inRange ( req−>paddr ) ){
recvCount++;
recvID = i ;
}
}
/∗ check f o r e r ro r s ∗/
i f ( recvCount > 1){
f a t a l ( ”More than one s upp l i e r f o r address in Id ea l I n t e r c onne c t ” ) ;
}
i f ( recvCount != 1){
f a t a l ( ”No supp l i e r f o r address in Id ea l I n t e r c onne c t ” ) ;
}
a s s e r t ( recvID >= 0 ) ;
grantQueue . push back (new In t e r connec tDe l i v e ry ( time ,
fromID ,
recvID ,
req ) ) ;
}
else {
/∗ r e c i e v e r i s a master i n t e r f a c e ∗/
grantQueue . push back (new In t e r connec tDe l i v e ry ( time ,
fromID ,
req−>f romInter faceID ,
req ) ) ;
}
#ifdef DEBUG IDEAL INTERCONNECT
/∗ check t ha t the queue i s so r t ed ∗/
In t e r connec tDe l i v e ry ∗ prev = NULL;
bool f i r s t = true ;
for ( l i s t <In t e r connec tDe l i v e ry ∗> : : i t e r a t o r i=grantQueue . begin ( ) ;
i !=grantQueue . end ( ) ;
i++){
i f ( f i r s t ){
f i r s t = fa l se ;
prev = ∗ i ;
continue ;
}
a s s e r t ( prev−>grantTime <= (∗ i )−>grantTime ) ;
prev = ∗ i ;
}
194
C.1. INTERCONNECT EXTENSION CODE
#endif //DEBUG IDEAL INTERCONNECT
bool found = fa l se ;
for ( int i =0; i<de l i v e rEvent s . s i z e ( ) ; i++){
i f ( de l i v e rEvent s [ i ]−>when ( ) == ( time + trans f e rDe l ay ) ) found = true ;
}
i f ( ! found ){
InterconnectDel iverQueueEvent ∗ event =
new InterconnectDel iverQueueEvent ( this ) ;
event−>schedu le ( time + trans f e rDe l ay ) ;
de l i v e rEvent s . push back ( event ) ;
}
}
void
I d e a l I n t e r c onne c t : : a r b i t r a t e ( Tick cy c l e ){
a s s e r t ( ! b locked ) ;
l i s t <InterconnectRequest∗> tempGrantQueue ;
InterconnectRequest ∗ l astReq = requestQueue . back ( ) ;
i f ( ( lastReq−>time + arb i t r a t i onDe l ay ) <= cyc l e ){
// a l l r e qu e s t s can be i s sued
tempGrantQueue . s p l i c e ( tempGrantQueue . end ( ) , requestQueue ) ;
}
else {
l i s t <InterconnectRequest ∗> : : i t e r a t o r f i r s tOldReq ;
for ( f i r s tOldReq = requestQueue . begin ( ) ;
f i r s tOldReq != requestQueue . end ( ) ;
f i r s tOldReq++){
i f ( ( (∗ f i r s tOldReq)−>time + arb i t r a t i onDe l ay ) > cy c l e ) break ;
}
tempGrantQueue . s p l i c e ( tempGrantQueue . end ( ) ,
requestQueue ,
requestQueue . begin ( ) ,
f i r s tOldReq ) ;
}
for ( l i s t <InterconnectRequest ∗> : : i t e r a t o r i = tempGrantQueue . begin ( ) ;
i != tempGrantQueue . end ( ) ;
i++){
/∗ update s t a t i s t i c s ∗/
arb i t ra t edReques t s++;
totalArbQueueCycles += ( cyc l e − (∗ i )−>time ) − a rb i t r a t i onDe l ay ;
t o t a lA rb i t r a t i onCyc l e s += arb i t r a t i onDe l ay ;
a l l I n t e r f a c e s [ ( ∗ i )−>fromID]−>grantData ( ) ;
delete ∗ i ;
}
i f ( ! requestQueue . empty ( ) ){
InterconnectRequest ∗ f i r s tReq = requestQueue . f r on t ( ) ;
s chedu leArb i t ra t ionEvent ( f i r s tReq−>time + arb i t r a t i onDe l ay ) ;
}
}
195
APPENDIX C. SIMULATOR EXTENSION CODE
void
I d e a l I n t e r c onne c t : : s chedu leArb i t ra t ionEvent ( Tick time ){
bool arbEventExists = fa l se ;
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ();++ i ){
i f ( time == arb i t r a t i onEvent s [ i ]−>when ( ) ) arbEventExists = true ;
}
i f ( ! arbEventExists ){
In te r connec tArb i t ra t i onEvent ∗ event =
new In te r connec tArb i t ra t i onEvent ( this ) ;
event−>schedu le ( time ) ;
a rb i t r a t i onEvent s . push back ( event ) ;
}
}
void
I d e a l I n t e r c onne c t : : d e l i v e r (MemReqPtr& req , Tick cyc l e , int toID , int fromID ){
a s s e r t ( ! b locked ) ;
a s s e r t ( ! req ) ;
a s s e r t ( toID == −1);
a s s e r t ( fromID == −1);
l i s t <In t e r connec tDe l i v e ry ∗ > grantsThisCyc le ;
In t e r connec tDe l i v e ry ∗ l astReq = grantQueue . back ( ) ;
i f ( ( lastReq−>grantTime + trans f e rDe l ay ) <= cyc l e ){
// a l l r e qu e s t s can be i s sued
grantsThisCyc le . s p l i c e ( grantsThisCyc le . end ( ) , grantQueue ) ;
}
else {
l i s t <In t e r connec tDe l i v e ry ∗> : : i t e r a t o r f i r s tOldReq ;
for ( f i r s tOldReq = grantQueue . begin ( ) ;
f i r s tOldReq != grantQueue . end ( ) ;
f i r s tOldReq++){
i f ( ( (∗ f i r s tOldReq)−>grantTime + trans f e rDe l ay ) > cy c l e ) break ;
}
grantsThisCyc le . s p l i c e ( grantsThisCyc le . end ( ) ,
grantQueue ,
grantQueue . begin ( ) ,
f i r s tOldReq ) ;
}
while ( ! grantsThisCyc le . empty ( ) ){
In t e r connec tDe l i v e ry ∗ tempGrant = grantsThisCyc le . f r on t ( ) ;
grantsThisCyc le . pop f ront ( ) ;
/∗ update s t a t i s t i c s ∗/
sentRequests++;
int queueCycles = ( cy c l e − tempGrant−>grantTime ) − t r an s f e rDe l ay ;
int curCpuId = tempGrant−>req−>xc−>cpu−>params−>cpu id ;
totalTransQueueCycles += queueCycles ;
t o t a lT ran s f e rCyc l e s += trans f e rDe l ay ;
perCpuTotalTransQueueCycles [ curCpuId ] += queueCycles ;
196
C.1. INTERCONNECT EXTENSION CODE
perCpuTotalTransferCycles [ curCpuId ] += trans f e rDe l ay ;
int r e t v a l = BA NO RESULT;
i f ( a l l I n t e r f a c e s [ tempGrant−>toID]−> i sMaster ( ) ){
a l l I n t e r f a c e s [ tempGrant−>toID]−>d e l i v e r ( tempGrant−>req ) ;
}
else {
r e t v a l = a l l I n t e r f a c e s [ tempGrant−>toID]−>ac c e s s ( tempGrant−>req ) ;
}
/∗ t h i s d e l i v e r y go t through , f r e e memory ∗/
delete tempGrant ;
/∗ i f t he cache re turns b l o c ked we cannot d e l i v e r any more data ∗/
i f ( r e t v a l == BA BLOCKED) break ;
}
i f ( ! grantsThisCyc le . empty ( ) ){
grantQueue . s p l i c e ( grantQueue . begin ( ) ,
grantsThisCycle ,
grantsThisCyc le . begin ( ) ,
grantsThisCyc le . end ( ) ) ;
}
}
void
I d e a l I n t e r c onne c t : : se tBlocked ( int f r omIn t e r f a c e ){
i f ( blocked ) warn ( ”b lock ing on a second cause ” ) ;
b locked = true ;
wait ingFor = f romInt e r f a c e ;
blockedAt = curTick ;
numSetBlocked++;
b l o c k ed In t e r f a c e s . push back ( f r omInt e r f a c e ) ;
for ( int i =0; i<a rb i t r a t i onEvent s . s i z e ();++ i ){
i f ( a rb i t r a t i onEvent s [ i ]−>scheduled ( ) ){
a rb i t r a t i onEvent s [ i ]−>deschedule ( ) ;
}
delete a rb i t r a t i onEvent s [ i ] ;
}
a rb i t r a t i onEvent s . c l e a r ( ) ;
for ( int i =0; i<de l i v e rEvent s . s i z e ();++ i ){
i f ( de l i v e rEvent s [ i ]−>scheduled ( ) ){
de l i v e rEvent s [ i ]−>deschedule ( ) ;
}
delete de l i v e rEvent s [ i ] ;
}
de l i v e rEvent s . c l e a r ( ) ;
}
void
I d e a l I n t e r c onne c t : : c l ea rB locked ( int f r omIn t e r f a c e ){
a s s e r t ( blocked ) ;
a s s e r t ( blockedAt > −1);
197
APPENDIX C. SIMULATOR EXTENSION CODE
int h i t Index = −1;
int hitCount = 0 ;
for ( int i =0; i<b l o c k ed In t e r f a c e s . s i z e ( ) ; i++){
i f ( b l o c k ed In t e r f a c e s [ i ] == f romInt e r f a c e ){
h i t Index = i ;
hitCount++;
}
}
a s s e r t ( hitCount == 1 && hit Index > −1);
b l o c k ed In t e r f a c e s . e r a s e ( b l o c k ed In t e r f a c e s . begin ()+ hi t Index ) ;
i f ( b l o c k ed In t e r f a c e s . empty ( ) ){
blocked = fa l se ;
numClearBlocked++;
a s s e r t ( a rb i t r a t i onEvent s . empty ( ) ) ;
i f ( ! requestQueue . empty ( ) ){
/∗ s chedu l e new a r b i t r a t i o n event ∗/
Tick f i r s tReq = ( requestQueue . f r on t ())−> time ;
In te r connec tArb i t ra t i onEvent ∗ event =
new In te r connec tArb i t ra t i onEvent ( this ) ;
a rb i t r a t i onEvent s . push back ( event ) ;
i f ( ( f i r s tReq + arb i t r a t i onDe l ay ) <= curTick ){
event−>schedu le ( curTick ) ;
}
else {
event−>schedu le ( f i r s tReq + arb i t r a t i onDe l ay ) ;
}
}
a s s e r t ( de l i v e rEvent s . empty ( ) ) ;
i f ( ! grantQueue . empty ( ) ){
/∗ s chedu l e new d e l i v e r event ∗/
Tick f i r s tG ran t = ( grantQueue . f r on t ())−>grantTime ;
InterconnectDel iverQueueEvent ∗ event =
new InterconnectDel iverQueueEvent ( this ) ;
d e l i v e rEvent s . push back ( event ) ;
i f ( ( f i r s tG ran t + t rans f e rDe l ay ) <= curTick ){
event−>schedu le ( curTick ) ;
}
else {
event−>schedu le ( f i r s tG ran t + t rans f e rDe l ay ) ;
}
}
blockedAt = −1;
}
}
#ifdef DEBUG IDEAL INTERCONNECT
void
I d e a l I n t e r c onne c t : : printRequestQueue ( ){
cout << ”ReqQueue : ” ;
198
C.1. INTERCONNECT EXTENSION CODE
for ( l i s t <InterconnectRequest ∗> : : i t e r a t o r i = requestQueue . begin ( ) ;
i != requestQueue . end ( ) ;
i++){
cout << ”( ” << (∗ i )−>time << ” , ” << (∗ i )−>fromID << ”) ” ;
}
cout << ”\n” ;
}
void
I d e a l I n t e r c onne c t : : printGrantQueue ( ){
cout << ”GrantQueue : ” ;
for ( l i s t <In t e r connec tDe l i v e ry ∗> : : i t e r a t o r i = grantQueue . begin ( ) ;
i != grantQueue . end ( ) ;
i++){
cout << ”( ”
<< (∗ i )−>grantTime
<< ” , ”
<< (∗ i )−>fromID
<< ” , ”
<< (∗ i )−>toID
<< ”) ” ;
}
cout << ”\n” ;
}
#endif //DEBUG IDEAL INTERCONNECT
#ifndef DOXYGEN SHOULD SKIP THIS
BEGIN DECLARE SIM OBJECT PARAMS( Id ea l I n t e r c onne c t )
Param<int> width ;
Param<int> c l o ck ;
Param<int> t r an s f e rDe l ay ;
Param<int> a rb i t r a t i onDe l ay ;
Param<int> cpu count ;
SimObjectParam<HierParams ∗> h i e r ;
END DECLARE SIM OBJECT PARAMS( Id ea l I n t e r c onne c t )
BEGIN INIT SIM OBJECT PARAMS( Id ea l I n t e r c onne c t )
INIT PARAM(width , ” i d e a l i n t e r connec t width , s e t t h i s to the cache l i n e ”
”width ”) ,
INIT PARAM( clock , ” i d e a l i n t e r connec t c l o ck ”) ,
INIT PARAM( trans f e rDe lay , ” i d e a l i n t e r connec t t r a n s f e r de lay in CPU ”
”cy c l e s ”) ,
INIT PARAM( arb i t ra t i onDe lay , ” i d e a l i n t e r connec t a r b i t r a t i o n de lay in CPU ”
”cy c l e s ”) ,
INIT PARAM( cpu count , ”the number o f CPUs in the system ”) ,
INIT PARAM DFLT( hier ,
”Hierarchy g l oba l v a r i a b l e s ” ,
&defaultHierParams )
END INIT SIM OBJECT PARAMS( Id ea l I n t e r c onne c t )
CREATE SIM OBJECT( Id ea l I n t e r c onne c t )
{
return new I d e a l I n t e r c onne c t ( getInstanceName ( ) ,
width ,
c lock ,
t rans f e rDe lay ,
a rb i t ra t i onDe lay ,
cpu count ,
h i e r ) ;
199
APPENDIX C. SIMULATOR EXTENSION CODE
}
REGISTER SIM OBJECT( ”Id ea l I n t e r c onne c t ” , I d e a l I n t e r c onne c t )
#endif //DOXYGEN SHOULD SKIP THIS
200
C.1. INTERCONNECT EXTENSION CODE
C.1.11 Interconnect Interface Header File
#ifndef INTERCONNECT INTERFACE HH
#define INTERCONNECT INTERFACE HH
#include <iostream>
#include ”base / range . hh”
#include ”ta rge ta r ch / i s a t r a i t s . hh” // f o r Addr
#include ”mem/bus/ b a s e i n t e r f a c e . hh”
#include ” in t e r connec t . hh”
class In t e r connec t ;
/∗∗
∗ This c l a s s g l u e s the i n t e r connec t ex t en s i on s toge the r with the r e s t o f the M5
∗ memory system .
∗
∗ This c l a s s i s based on the bu s i n t e r f a c e . hh c l a s s in standard M5 but has been
∗ r ewr i t t en from sc ra t ch . Consequently , a number o f methods that are not used
∗ are not implemented .
∗
∗ @author Magnus Jahre
∗/
class I n t e r c onn e c t I n t e r f a c e : public Base In t e r f a c e
{
protected :
int i n t e r f a c e ID ;
In t e r connec t ∗ t h i s I n t e r c onn e c t ;
bool t race on ;
int dataSends ;
int i n s tSends ;
int coherenceSends ;
int to ta lSends ;
bool doPro f i l i n g ;
public :
/∗∗
∗ This con s t ruc to r c r e a t e s an in t e r connec t i n t e r f a c e and i n i t i a l i s e s
∗ a few member v a r i a b l e s .
∗
∗ @param in t e r c onne c t A po in t e r to the i n t e r f a c e t h i s c l a s s i n t e r f a c e s
∗ to .
∗ @param name The name o f the c l a s s from the con f i g f i l e
∗ @param h i e r Hierarchy parameters f o r BaseHier
∗/
I n t e r c onn e c t I n t e r f a c e ( In t e r connec t ∗ i n t e r connec t ,
const std : : s t r i n g &name ,
HierParams ∗ h i e r )
: Ba s e In t e r f a c e (name , h i e r )
{
blocked = fa l se ;
t h i s I n t e r c onn e c t = in t e r c onne c t ;
t race on = fa l se ;
dataSends = 0 ;
201
APPENDIX C. SIMULATOR EXTENSION CODE
i n s tSends = 0 ;
coherenceSends = 0 ;
to ta lSends = 0 ;
doP r o f i l i n g = fa l se ;
}
/∗∗
∗ Mark t h i s i n t e r f a c e as blocked .
∗/
void setBlocked ( ) ;
/∗∗
∗ Mark t h i s i n t e r f a c e as unblocked .
∗/
void c l ea rB locked ( ) ;
/∗∗
∗ Access the connected memory and make i t perform a given reques t .
∗
∗ @param req The reques t to perform .
∗
∗ @return The r e s u l t o f the a c c e s s .
∗/
virtual MemAccessResult a c c e s s (MemReqPtr &req ) = 0 ;
/∗∗
∗ Request a c c e s s to the i n t e r connec t .
∗
∗ @param time The time to reques t the bus .
∗/
virtual void r eque s t ( Tick time ) = 0 ;
/∗∗
∗ Respond to the g iven reques t at the g iven time .
∗
∗ @param req The reques t be ing responded to .
∗ @param time The time the response i s ready .
∗/
virtual void respond (MemReqPtr &req , Tick time ) = 0 ;
/∗∗
∗ This method must be implemented to f i t i n to the the r e s t o f the M5
∗ memory system . In t h i s implemetat ion i t i s never used and i s s u e s
∗ a f a t a l e r r o r message i f i t i s c a l l e d .
∗
∗ @return Always f a l s e
∗/
bool grantAddr ( ){
f a t a l ( ”Cro s sba r In t e r f a c e grantAddr ( ) method not implemented\n” ) ;
return fa lse ;
}
/∗∗
∗ This method i s used when an i n t e r f a c e i s granted ac c e s s to the bus .
∗
∗ @return True i f another r eque s t i s outstanding .
∗/
virtual bool grantData ( ) = 0 ;
/∗∗
∗ The in t e r c onne c t s does not support snooping , so t h i s method i s s u e s a
202
C.1. INTERCONNECT EXTENSION CODE
∗ f a t a l e r r o r message i f i t i s c a l l e d .
∗
∗ @param req Not used
∗/
void snoop (MemReqPtr &req ){
f a t a l ( ”Cro s sba r In t e r f a c e snoop not implemented ” ) ;
}
/∗∗
∗ The in t e r c onne c t s does not support snooping , so t h i s method i s s u e s a
∗ f a t a l e r r o r message i f i t i s c a l l e d .
∗
∗ @param req Not used
∗/
void snoopResponse (MemReqPtr &req ){
f a t a l ( ”Cro s sba r In t e r f a c e snoopResponse not implemented ” ) ;
}
/∗∗
∗ The in t e r c onne c t s does not support snooping , so t h i s method i s s u e s a
∗ f a t a l e r r o r message i f i t i s c a l l e d .
∗
∗ @param req Not used
∗/
void snoopResponseCal l (MemReqPtr &req ){
f a t a l ( ”Cro s sba r In t e r f a c e snoopResponse not implemented ” ) ;
}
/∗∗
∗ This method i s never c a l l e d with the c on f i g u r a t i o n s used in t h i s
∗ work . Consequently , i t i s not used .
∗
∗ @param req Not used .
∗ @param update Not used .
∗
∗ @return Nothing
∗/
Tick sendProbe (MemReqPtr &req , bool update ){
f a t a l ( ”CrossbarSlave sendProbe ( ) method not implemented ” ) ;
return −1;
}
/∗∗
∗ This method i s never c a l l e d with the c on f i g u r a t i o n s used in t h i s
∗ work . Consequently , i t i s not used .
∗
∗ @param req Not used .
∗ @param update Not used .
∗
∗ @return Nothing
∗/
Tick probe (MemReqPtr &req , bool update ){
f a t a l ( ”CrossbarSlave probe ( ) method not implemented ” ) ;
return −1;
}
/∗∗
∗ This method i s never c a l l e d in the c on f i gu r a t i on used in t h i s work
∗ and i s not implemented .
∗
∗ @param r a n g e l i s t Not used
203
APPENDIX C. SIMULATOR EXTENSION CODE
∗/
void co l l e c tRange s ( std : : l i s t <Range<Addr> > &r a n g e l i s t ){
f a t a l ( ”CrossbarSlave co l l e c tRange s ( ) method not implemented ” ) ;
}
/∗∗
∗ Adds the address ranges o f t h i s i n t e r f a c e to the provided l i s t .
∗
∗ @param r a n g e l i s t The l i s t o f ranges .
∗/
void getRange ( std : : l i s t <Range<Addr> > &r a n g e l i s t ) ;
/∗∗
∗ Not i fy t h i s i n t e r f a c e o f a range change in the in t e r connec t .
∗/
void rangeChange ( ) ;
/∗∗
∗ Set the address ranges o f t h i s i n t e r f a c e to the l i s t provided . This
∗ f unc t i on removes any e x i s t i n g ranges .
∗
∗ @param r a n g e l i s t L i s t o f addr ranges to add .
∗/
void setAddrRange ( std : : l i s t <Range<Addr> > &r a n g e l i s t ) ;
/∗∗
∗ Add an address range f o r t h i s i n t e r f a c e .
∗
∗ @param range The addres range to add .
∗/
void addAddrRange ( const Range<Addr> &range ) ;
/∗∗
∗ This method re tu rns the number o f r eque s t s sent s i n c e the l a s t time
∗ i t was c a l l e d . Furthermore , i t d i v i d e s the sends in to data sends ,
∗ i n s t r u c t i o n sends and coherence sends as we l l as prov id ing a grand
∗ t o t a l .
∗
∗ @param dataSends A po in t e r a memory l o c a t i o n where the number o f
∗ data sends can be s to r ed
∗ @param ins tSends A po in t e r a memory l o c a t i o n where the number o f
∗ i n s t r u c t i o n sends can be s to r ed
∗ @param ins tSends A po in t e r a memory l o c a t i o n where the number o f
∗ coherence sends can be s to r ed
∗ @param tota lSends A po in t e r a memory l o c a t i o n where the t o t a l number
∗ o f sends can be s to r ed
∗
∗ @see I n t e r c onn e c tP r o f i l e
∗/
void getSendSample ( int∗ dataSends ,
int∗ instSends ,
int∗ coherenceSends ,
int∗ to ta lSends ) ;
/∗∗
∗ This method ensure s that the p r o f i l e va lue s are updated c o n s i s t e n t l y
∗ from a l l i n t e r f a c e s . I t i s c a l l e d by the s ub c l a s s e s .
∗
∗ @param req The cur rent memory reques t
∗
∗ @see I n t e r c onn e c tP r o f i l e
204
C.1. INTERCONNECT EXTENSION CODE
∗/
void updatePro f i l eVa lue s (MemReqPtr &req ) ;
/∗∗
∗ This method d e l i v e r s a reponse to the in t e r connec t . In a s l av e
∗ i n t e r connec t , t h i s sends the r eques t over the i n t e r connec t . In a
∗ master in te rconnec t , the r eque s t i s d e l i v e r e d to the cache .
∗
∗ @param req The reques t that w i l l be d e l i v e r e d
∗/
virtual void d e l i v e r (MemReqPtr &req ) = 0 ;
/∗∗
∗ The in t e r c onne c t s o f t en need to d i s t i n g u i s h between a master and a
∗ s l av e i n t e r f a c e in an e f f i c i e n t manner . This method enab l e s t h i s .
∗
∗ @return True i f the i n t e r f a c e i s a master i n t e r f a c e
∗/
virtual bool i sMaster ( ) = 0 ;
/∗∗
∗ This method a c c e s s e s the cache and f i nd s out which i n t e r f a c e the
∗ next r eque s t should be sent to . Then , i t r e tu rns a pa i r o f the
∗ address and the d e s t i n a t i on i n t e r f a c e .
∗
∗ Note that the d e s t i n a t i on i n t e r f a c e might be −1. In t h i s case , the
∗ i n t e r connec t must f i nd the d e s t i n a t i on i t s e l f by i n sp e c t i n g the
∗ de s t i n a t i on address .
∗
∗ @return The address and the d e s t i n a t i on i n t e r f a c e . I f the d e s t i n a t i on
∗ i s −1, the i n t e r connec t must de r i v e the d e s t i n a t i on based on
∗ the reques ted address .
∗/
virtual std : : pair<Addr , int> getTargetAddr ( ) = 0 ;
/∗∗
∗ This method re tu rns the ID o f the d e s t i n a t i on i n t e r f a c e o f the
∗ r eque s t at the f r on t o f the r eque s t queue in a s l av e i n t e r f a c e .
∗
∗ @return The de s t i n a t i on o f the next r eque s t to be sent from a s l av e
∗ i n t e r f a c e .
∗/
virtual int getTargetId ( ) = 0 ;
/∗∗
∗ Convenience method that r e tu rns the name o f the cache a s s o c i a t ed with
∗ a given i n t e r f a c e .
∗
∗ @return The name o f the cache a s s o c i a t ed with a given i n t e r f a c e .
∗/
virtual std : : s t r i n g getCacheName ( ) = 0 ;
} ;
#endif // INTERCONNECT INTERFACE HH
205
APPENDIX C. SIMULATOR EXTENSION CODE
C.1.12 Interconnect Interface Code File
#include <iostream>
#include <vector>
#include ” i n t e r c o nn e c t i n t e r f a c e . hh”
using namespace std ;
void
I n t e r c onn e c t I n t e r f a c e : : se tBlocked ( ){
i f ( ! b locked ){
blocked = true ;
t h i s In t e r conne c t−>setBlocked ( i n t e r f a c e ID ) ;
}
}
void
I n t e r c onn e c t I n t e r f a c e : : c l ea rB locked ( ){
i f ( blocked ){
blocked = fa l se ;
t h i s In t e r conne c t−>c l ea rB locked ( i n t e r f a c e ID ) ;
}
}
void
I n t e r c onn e c t I n t e r f a c e : : getRange ( std : : l i s t <Range<Addr> > &r a n g e l i s t )
{
for ( int i = 0 ; i < ranges . s i z e ( ) ; ++i ) {
r a n g e l i s t . push back ( ranges [ i ] ) ;
}
}
void
I n t e r c onn e c t I n t e r f a c e : : rangeChange ( ){
th i s In t e r conne c t−>rangeChange ( ) ;
}
void
I n t e r c onn e c t I n t e r f a c e : : setAddrRange ( l i s t <Range<Addr> > &r a n g e l i s t ){
ranges . c l e a r ( ) ;
while ( ! r a n g e l i s t . empty ( ) ) {
ranges . push back ( r a n g e l i s t . f r on t ( ) ) ;
r a n g e l i s t . pop f ront ( ) ;
}
rangeChange ( ) ;
}
void
I n t e r c onn e c t I n t e r f a c e : : addAddrRange ( const Range<Addr> &range ){
ranges . push back ( range ) ;
rangeChange ( ) ;
}
void
I n t e r c onn e c t I n t e r f a c e : : getSendSample ( int∗ data ,
int∗ i n s t ,
int∗ coherence ,
int∗ t o t a l ){
206
C.1. INTERCONNECT EXTENSION CODE
// s t a r t p r o f i l i n g i f t h i s i s the f i r s t c a l l to t h i s f unc t i on
i f ( ! d oP r o f i l i n g ) doP ro f i l i n g = true ;
// re turn the sampled va l u e s
∗data = dataSends ;
∗ i n s t = ins tSends ;
∗ coherence = coherenceSends ;
∗ t o t a l = tota lSends ;
// r e s e t counters
dataSends = 0 ;
in s tSends = 0 ;
coherenceSends = 0 ;
to ta lSends = 0 ;
}
void
I n t e r c onn e c t I n t e r f a c e : : updatePro f i l eVa lue s (MemReqPtr &req ){
i f ( d oP r o f i l i n g ){
i f ( req−>cmd . i sDi rec toryMessage ( ) ){
coherenceSends++;
}
else i f ( req−>readOnlyCache ){
i n s tSends++;
}
else {
dataSends++;
}
to ta lSends++;
}
}
207
APPENDIX C. SIMULATOR EXTENSION CODE
C.1.13 Interconnect Master Interface Header File
#ifndef INTERCONNECT MASTER HH
#define INTERCONNECT MASTER HH
#include <iostream>
#include ”base / range . hh”
#include ”ta rge ta r ch / i s a t r a i t s . hh” // f o r Addr
#include ”mem/bus/ b a s e i n t e r f a c e . hh”
#include ” i n t e r c o nn e c t i n t e r f a c e . hh”
#include ” in t e r connec t . hh”
class In t e r connec t ;
/∗∗
∗ This c l a s s implements a master i n t e r f a c e as needed by the M5 cache
∗ implementation . In the t e rmin io l ogy o f M5, master i s synonymous with ’ on the
∗ proc e s s o r s i d e o f an in t e r connec t ’ .
∗
∗ This c l a s s i s based on the ma s t e r i n t e r f a c e f i l e s in the o r i g i n a l M5 memory
∗ system but has been r ewr i t t en from sc ra t ch .
∗
∗ @author Magnus Jahre
∗/
template <class MemType>
class InterconnectMaster : public I n t e r c onn e c t I n t e r f a c e
{
private :
MemType∗ thisCache ;
Addr currentAddr ;
int currentToCpuId ;
bool currentVa l sVa l id ;
// debug
std : : vector<std : : pair<Addr , Tick>∗ > outstandingRequestAddrs ;
public :
/∗∗
∗ This con s t ruc to r c r e a t e s a master i n t e r f a c e and r e g i s t e r i t with the
∗ a s s o c i a t ed in t e r connec t . In the t e rmin io l ogy o f M5, master i s
∗ synonymous with ’ on the p roc e s s o r s i d e o f an in t e r connec t ’
∗
∗ @param name The name o f the i n t e r f a c e from the con f i g f i l e
∗ @param in t e r connec t A po in t e r to the a s s o c i a t ed in t e r connec t
∗ @param cache A po in t e r to the a s s o c i a t ed cache
∗ @param h i e r Hierarchy parameters f o r BaseHier
∗/
InterconnectMaster ( const std : : s t r i n g &name ,
In t e r connec t ∗ i n t e r connec t ,
MemType∗ cache ,
HierParams ∗ h i e r ) ;
/∗∗
∗ Access the connect memory to perform the given reques t .
∗
∗ @param req The reques t to perform .
208
C.1. INTERCONNECT EXTENSION CODE
∗
∗ @return The r e s u l t o f the a c c e s s .
∗/
MemAccessResult a c c e s s (MemReqPtr &req ) ;
/∗∗
∗ Request a c c e s s to the i n t e r connec t at the g iven time .
∗
∗ @param time The time to reques t the bus .
∗/
void r eque s t ( Tick time ) ;
/∗∗
∗ Responses are c a r r i e d out through the d e l i v e r method in the master
∗ i n t e r f a c e . Consequently , t h i s method e x i t s with an e r r o r message i f
∗ i t i s c a l l e d .
∗
∗ @param req Not used .
∗ @param time Not used .
∗/
void respond (MemReqPtr &req , Tick time ){
f a t a l ( ”CrossbarMaster respond method not implemented ” ) ;
}
/∗∗
∗ When the i n t e r f a c e i s granted ac c e s s to the in te rconnec t , t h i s method
∗ i s c a l l e d . I t r e t r i e v e s the r eques t with the h i ghe s t p r i o r i t y from
∗ the cache and prov ide s i t to the send method in the in t e r connec t .
∗
∗ @return True i f another r eque s t i s outstanding .
∗/
bool grantData ( ) ;
/∗∗
∗ This method d e l i v e r s the r eque s t to the a s s o c i a t ed cache .
∗
∗ @param req The memory reques t to d e l i v e r
∗/
void d e l i v e r (MemReqPtr &req ) ;
/∗∗
∗ Convenience method that i d e n t i f i e s t h i s i n t e r f a c e as a master
∗ i n t e r f a c e .
∗
∗ @return True , s i n c e t h i s i s a master i n t e r f a c e
∗/
bool i sMaster ( ){
return true ;
}
/∗∗
∗ This method a c c e s s e s the cache and f i nd s out which i n t e r f a c e the
∗ next r eque s t should be sent to . Then , i t r e tu rns a pa i r o f the
∗ address and the d e s t i n a t i on i n t e r f a c e .
∗
∗ Note that the d e s t i n a t i on i n t e r f a c e might be −1. In t h i s case , the
∗ i n t e r connec t must f i nd the d e s t i n a t i on i t s e l f by i n sp e c t i n g the
∗ de s t i n a t i on address .
∗
∗ @return The address and the d e s t i n a t i on i n t e r f a c e . I f the d e s t i n a t i on
∗ i s −1, the i n t e r connec t must de r i v e the d e s t i n a t i on based on
209
APPENDIX C. SIMULATOR EXTENSION CODE
∗ the reques ted address .
∗/
std : : pair<Addr , int> getTargetAddr ( ) ;
/∗∗
∗ This method i s only va l i d f o r s l av e i n t e r f a c e s and produces a f a t a l
∗ e r r o r message i f i t i s c a l l e d .
∗
∗ @return Nothing
∗/
int getTargetId ( ){
f a t a l ( ”getTargetId ( ) not va l i d f o r a Mas te r In t e r f a ce ” ) ;
return −1;
}
/∗∗
∗ Convenience method that r e tu rns the name o f the a s s o c i a t ed cache .
∗
∗ @return The name o f a s s o c i a t ed cache
∗/
std : : s t r i n g getCacheName ( ){
return thisCache−>name ( ) ;
}
} ;
#endif // INTERCONNECT MASTER HH
210
C.1. INTERCONNECT EXTENSION CODE
C.1.14 Interconnect Master Interface Code File
#include <iostream>
#include <vector>
#include ” in t e r connec t . hh”
#include ” in t e r connec t mas t e r . hh”
using namespace std ;
template<class MemType>
InterconnectMaster<MemType> : : InterconnectMaster ( const s t r i n g &name ,
In t e r connec t ∗ i n t e r connec t ,
MemType∗ cache ,
HierParams ∗ h i e r )
: I n t e r c onn e c t I n t e r f a c e ( in te rconnec t , name , h i e r )
{
thisCache = cache ;
i n t e r f a c e ID = th i s In t e r connec t−>r e g i s t e r I n t e r f a c e ( this ,
false ,
cache−>getProcessor ID ( ) ) ;
thisCache−>s e t I n t e r f a c e ID ( i n t e r f a c e ID ) ;
currentAddr = 0 ;
currentToCpuId = −1;
cur rentVa l sVa l id = fa l se ;
i f ( t race on ) cout << ”InterconnectMaster with id ”
<< i n t e r f a c e ID << ” created \n” ;
}
template<class MemType>
MemAccessResult
InterconnectMaster<MemType> : : a c c e s s (MemReqPtr &req ){
// NOTE: copied from Mas ter In te r face
int s a t i s f i e d b e f o r e = req−>f l a g s & SATISFIED ;
// Cache Coherence c a l l goes here
//mem−>snoop ( req ) ;
i f ( s a t i s f i e d b e f o r e != ( req−>f l a g s & SATISFIED) ) {
return BA SUCCESS;
}
return BA NO RESULT;
}
template<class MemType>
void
InterconnectMaster<MemType> : : d e l i v e r (MemReqPtr &req ){
i f ( ! req−>cmd . i sDi rec toryMessage ( ) ){
pair<Addr , Tick>∗ h i tPa i r = NULL;
int h i t Index = −1;
for ( int i =0; i<outstandingRequestAddrs . s i z e ();++ i ){
pair<Addr , Tick>∗ tmpPair = outstandingRequestAddrs [ i ] ;
i f ( req−>paddr == tmpPair−> f i r s t ){
h i t Index = i ;
211
APPENDIX C. SIMULATOR EXTENSION CODE
h i tPa i r = tmpPair ;
}
}
/∗ check i f t h i s a c t u a l l y was an answer to something we reque s t ed ∗/
a s s e r t ( h i t Index >= 0 ) ;
outstandingRequestAddrs . e r a s e ( outstandingRequestAddrs . begin ()+ hi t Index ) ;
delete h i tPa i r ;
}
i f ( t race on ){
cout << ”Master ”<< i n t e r f a c e ID <<” i s wa i t ing f o r : ” ;
for ( int i =0; i<outstandingRequestAddrs . s i z e ();++ i ){
cout << ”( ” << outstandingRequestAddrs [ i ]−> f i r s t
<< ” , ” << outstandingRequestAddrs [ i ]−>second << ”) ” ;
}
cout << ”at t i c k ” << curTick << ”\n” ;
}
i f ( t race on ) cout << ”TRACE: MASTER RESPONSE id ” << i n t e r f a c e ID
<< ” addr ” << req−>paddr << ” at ” << curTick << ”\n” ;
thisCache−>handleResponse ( req ) ;
}
template<class MemType>
void
InterconnectMaster<MemType> : : r eque s t ( Tick time ){
i f ( t race on ) cout << ”TRACE: MASTER REQUEST id ” << i n t e r f a c e ID << ” at ”
<< curTick << ”\n” ;
th i s In t e r conne c t−>r eque s t ( time , i n t e r f a c e ID ) ;
}
template<class MemType>
bool
InterconnectMaster<MemType> : : grantData ( ){
MemReqPtr req = thisCache−>getMemReq ( ) ;
i f ( ! req ){
th i s In t e r conne c t−>incNul lRequest s ( ) ;
return fa lse ;
}
i f ( t race on ) cout << ”TRACE: MASTER SEND ” << req−>cmd . t oS t r i ng ( )
<< ” from id ” << i n t e r f a c e ID << ” addr ” << req−>paddr
<< ” at ” << curTick << ”\n” ;
i f ( ! req−>cmd . isNoResponse ( ) ){
i f ( ! req−>cmd . i sDi rec toryMessage ( ) ){
outstandingRequestAddrs . push back (
new pair<Addr , Tick>(req−>paddr , curTick ) ) ;
}
}
req−>f romInter face ID = in t e r f a c e ID ;
req−>f i r s tSendTime = curTick ;
req−>readOnlyCache = thisCache−>i s I n s t ruc t i onCache ( ) ;
req−>t o In t e r f a c e ID =
th i s In t e r conne c t−>get Inte rconnect ID ( req−>toProcessor ID ) ;
212
C.1. INTERCONNECT EXTENSION CODE
// make sure d e s t i n a t i o n was s e t p rope r l y
i f ( req−>toProcessor ID != −1) a s s e r t ( req−>t o In t e r f a c e ID != −1);
i f ( cur rentVa l sVa l id ){
a s s e r t ( currentAddr == req−>paddr ) ;
a s s e r t ( currentToCpuId == req−>toProcessor ID ) ;
cur rentVa l sVa l id = fa l se ;
}
// Update send p r o f i l e
updatePro f i l eVa lue s ( req ) ;
//Current ly sends can ’ t f a i l , so a l l r eq s w i l l be a succe s s
thisCache−>sendResult ( req , true ) ;
t h i s In t e r conne c t−>send ( req , curTick , i n t e r f a c e ID ) ;
return thisCache−>doMasterRequest ( ) ;
}
template<class MemType>
pair<Addr , int>
InterconnectMaster<MemType> : : getTargetAddr ( ){
MemReqPtr currentRequest = thisCache−>getMemReq ( ) ;
i f ( ! currentRequest ) return pair<Addr , int >(0 ,−2);
a s s e r t ( currentRequest−>paddr != 0 ) ;
int t o I n t e r f a c e =
th i s In t e r conne c t−>get Inte rconnect ID ( currentRequest−>toProcessor ID ) ;
i f ( currentRequest−>toProcessor ID != −1) a s s e r t ( t o I n t e r f a c e != −1);
currentAddr = currentRequest−>paddr ;
currentToCpuId = t o I n t e r f a c e ;
cur rentVa l sVa l id = true ;
return pair<Addr , int>(currentRequest−>paddr , t o I n t e r f a c e ) ;
}
213
APPENDIX C. SIMULATOR EXTENSION CODE
C.1.15 Interconnect Slave Interface Header File
#ifndef INTERCONNECT SLAVE HH
#define INTERCONNECT SLAVE HH
#include <iostream>
#include ”base / range . hh”
#include ”ta rge ta r ch / i s a t r a i t s . hh” // f o r Addr
#include ”mem/bus/ b a s e i n t e r f a c e . hh”
#include ” i n t e r c o nn e c t i n t e r f a c e . hh”
#include ” in t e r connec t . hh”
class In t e r connec t ;
/∗∗
∗ This c l a s s implements a s l av e i n t e r f a c e as needed by the M5 cache
∗ implementation . In the t e rmin io l ogy o f M5, s l av e i s synonymous with ’ on the
∗ memory s i d e o f an in t e r connec t ’
∗
∗ This c l a s s i s based on the s l a v e i n t e r f a c e f i l e s in the o r i g i n a l M5 memory
∗ system but has been r ewr i t t en from sc ra t ch .
∗
∗ @author Magnus Jahre
∗/
template <class MemType>
class In t e r connec tS l ave : public I n t e r c onn e c t I n t e r f a c e
{
private :
/∗∗
∗ Convenience c l a s s f o r s t o r i n g cache r e sponse s from they are reques ted
∗ un t i l they are granted ac c e s s .
∗
∗ @author Magnus Jahre
∗/
class InterconnectResponse {
public :
MemReqPtr req ;
Tick time ;
/∗∗
∗ Store s the r eque s t and the time i t was reques ted in t h i s
∗ ob j e c t .
∗
∗ @param req The memory reques t wai t ing f o r a c c e s s to the
∗ i n t e r connec t
∗ @param time The c l o ck cy c l e the re sponse was r e c i e v ed
∗
∗ @author Magnus Jahre
∗/
InterconnectResponse (MemReqPtr & req , Tick t ime )
{
req = req ;
time = time ;
}
} ;
214
C.1. INTERCONNECT EXTENSION CODE
MemType∗ thisCache ;
std : : vector<InterconnectResponse ∗ > responseQueue ;
public :
/∗∗
∗ This con s t ruc to r c r e a t e s an in t e r connec t s l av e i n t e r f a c e and
∗ r e g i s t e r s i t with the provided in t e r connec t .
∗
∗ @param name The name from the con f i g f i l e
∗ @param in t e r connec t A po in t e r to the i n t e r connec t
∗ @param cache A po in t e r to the cache
∗ @param h i e r Hierarchy parameters f o r BaseHier
∗/
In t e r connec tS l ave ( const std : : s t r i n g &name ,
In t e r connec t ∗ i n t e r connec t ,
MemType∗ cache ,
HierParams ∗ h i e r ) ;
/∗∗
∗ Access the connect memory to perform the given reques t .
∗
∗ @param req The reques t to perform .
∗
∗ @return The r e s u l t o f the a c c e s s .
∗/
MemAccessResult a c c e s s (MemReqPtr &req ) ;
/∗∗
∗ The reques t method i s not needed in s l av e i n t e r f a c e s and i s not
∗ implemented . I f i t i s c a l l ed , i t i s s u e s a f a t a l e r r o r message .
∗
∗ @param time Not used .
∗/
void r eque s t ( Tick time ){
f a t a l ( ” In t e r connec tS l ave r eques t ( Tick time ) not implemented ” ) ;
}
/∗∗
∗ The respond method i s c a l l e d when the connected cache responds to an
∗ ac c e s s . Then , an InterconnectResponse ob j e c t i s a l l o c a t e d and put
∗ i n to a queue . Furthermore , a c c e s s to the i n t e r connec t i s r eques ted .
∗
∗ @param req The reques t be ing responded to .
∗ @param time The time the response i s ready .
∗
∗ @see InterconnectResponse
∗/
void respond (MemReqPtr &req , Tick time ) ;
/∗∗
∗ Cal led when t h i s i n t e r f a c e i s granted ac c e s s to the i n t e r connec t .
∗
∗ @return True i f another r eque s t i s outstanding .
∗/
bool grantData ( ) ;
/∗∗
∗ The d e l i v e r method i s not used in a s l av e i n t e r f a c e and i s s u e s a
∗ f a t a l e r r o r message i f i t i s c a l l e d .
215
APPENDIX C. SIMULATOR EXTENSION CODE
∗
∗ @param req Not used .
∗/
void d e l i v e r (MemReqPtr &req ){
f a t a l ( ” In t e r connec tS l ave d e l i v e r ( ) not implemented ” ) ;
}
/∗∗
∗ Since t h i s i s a s l av e i n t e r f a c e , t h i s method always r e tu rns f a l s e .
∗
∗ @return False , s i n c e t h i s i s a s l av e i n t e r f a c e .
∗/
bool i sMaster ( ){
return fa lse ;
}
/∗∗
∗ This method has no r e l evance f o r a s l av e i n t e r f a c e and i s s u e s a
∗ f a t a l e r r o r message i f i t i s c a l l e d .
∗
∗ @return Nothing .
∗/
std : : pair<Addr , int> getTargetAddr ( ){
f a t a l ( ”getTargetAddr ( ) not va l i d f o r s l av e i n t e r f a c e s ” ) ;
return std : : pair<Addr , int>(−42 ,−42);
}
/∗∗
∗ This method i s used to f i nd the d e s t i n a t i on i n t e r f a c e o f the r eque s t
∗ at the head o f re sponse queue . This in fo rmat ion i s s to r ed in the
∗ request , so t h i s method simply a c c e s s e s t h i s in fo rmat ion .
∗
∗ @return The de s t i n a t i on i n t e r f a c e .
∗/
int getTargetId ( ){
a s s e r t ( ! responseQueue . empty ( ) ) ;
return responseQueue . f r on t ()−>req−>f romInter face ID ;
}
/∗∗
∗ Ret r i eve s the name o f the a s s o c i a t ed cache .
∗
∗ @return The name o f the a s s o c i a t ed cache .
∗/
std : : s t r i n g getCacheName ( ){
return thisCache−>name ( ) ;
}
/∗∗
∗ This method over loaded here to implement modulo bank addre s s ing .
∗ The de f au l t in M5 i s that each bank i s r e s p on s i b l e f o r a cont igous
∗ part o f the address space . This f u n c t i o n a l i t y i s r e ta ined , but in
∗ add i t i on i t implements modulo addre s s ing . In t h i s case the address
∗ modulo the number o f banks i s used to dec ide which bank should
∗ s e r v i c e a g iven reques t .
∗
∗ A con f i gu r a t i on opt ion in the cache s e l e c t s which bank addre s s ing
∗ type should be used .
∗
∗ @param addr The address to be checked
∗
216
C.1. INTERCONNECT EXTENSION CODE
∗ @return True i f t h i s i n t e r f a c e i s r e s p on s i b l e f o r t h i s address .
∗/
virtual bool inRange (Addr addr ) ;
} ;
#endif // INTERCONNECT SLAVE HH
217
APPENDIX C. SIMULATOR EXTENSION CODE
C.1.16 Interconnect Slave Interface Code File
#include <iostream>
#include <vector>
#include ” in t e r connec t . hh”
#include ” i n t e r c onn e c t s l a v e . hh”
using namespace std ;
template<class MemType>
Inte rconnectS lave<MemType> : : I n t e r connec tS l ave ( const s t r i n g &name ,
In t e r connec t ∗ i n t e r connec t ,
MemType∗ cache ,
HierParams ∗ h i e r )
: I n t e r c onn e c t I n t e r f a c e ( in te rconnec t , name , h i e r )
{
thisCache = cache ;
i n t e r f a c e ID = th i s In t e r connec t−>r e g i s t e r I n t e r f a c e ( this ,
true ,
cache−>getProcessor ID ( ) ) ;
thisCache−>s e t I n t e r f a c e ID ( i n t e r f a c e ID ) ;
i f ( t race on ) cout << ”In t e r connec tS l ave with id ” << i n t e r f a c e ID
<< ” created \n” ;
}
template<class MemType>
MemAccessResult
Inte rconnectS lave<MemType> : : a c c e s s (MemReqPtr &req ){
bool a l r e a d y s a t i s f i e d = req−> i s S a t i s f i e d ( ) ;
i f ( a l r e a d y s a t i s f i e d && ! req−>cmd . i sDi rec toryMessage ( ) ) {
warn ( ”Request i s a l l r e ady s a t i s f i e d ( In t e r connec tS l ave : a c c e s s ( ) ) ” ) ;
return BA NO RESULT;
}
i f ( this−>inRange ( req−>paddr ) ) {
a s s e r t ( ! b locked ) ;
i f ( t race on ) cout << ”TRACE: SLAVE ACCESS from id ”
<< req−>f romInter face ID << ” addr ” << req−>paddr
<< ” at ” << curTick << ”\n” ;
thisCache−>ac c e s s ( req ) ;
a s s e r t ( ! this−>i sB locked ( ) | | thisCache−>i sB locked ( ) ) ;
i f ( this−>i sB locked ( ) ) {
//Out o f MSHRS, now we b l o c k
return BA BLOCKED;
} else {
// This t r an sac t i on went through ok
return BA SUCCESS;
}
}
return BA NO RESULT;
218
C.1. INTERCONNECT EXTENSION CODE
}
template<class MemType>
void
Inte rconnectS lave<MemType> : : respond (MemReqPtr &req , Tick time ){
i f ( ! req−>cmd . isNoResponse ( ) ) {
i f ( t race on ) cout << ”TRACE: SLAVE RESPONSE ” << req−>cmd . t oS t r i ng ( )
<< ” from id ” << req−>f romInter face ID
<< ” addr ” << req−>paddr
<< ” at ” << curTick << ”\n” ;
// handle d i r e c t o r y r e qu e s t s
i f ( req−>toProcessor ID != −1){
// the sender i n t e r f a c e r e c i e v e s s l a v e responses
req−>f romInter face ID =
th i s In t e r conne c t−>get Inte rconnect ID ( req−>toProcessor ID ) ;
a s s e r t ( req−>f romInter face ID != −1);
}
responseQueue . push back (new InterconnectResponse ( req , time ) ) ;
th i s In t e r conne c t−>r eque s t ( time , i n t e r f a c e ID ) ;
}
}
template<class MemType>
bool
Inte rconnectS lave<MemType> : : grantData ( ){
a s s e r t ( responseQueue . s i z e ( ) > 0 ) ;
InterconnectResponse ∗ re sponse = responseQueue . f r on t ( ) ;
responseQueue . e r a s e ( responseQueue . begin ( ) ) ;
MemReqPtr req = response−>req ;
// Update send p r o f i l e
updatePro f i l eVa lue s ( req ) ;
th i s In t e r conne c t−>send ( req , curTick , i n t e r f a c e ID ) ;
delete re sponse ;
return ! responseQueue . empty ( ) ;
}
template<class MemType>
bool
Inte rconnectS lave<MemType> : : inRange (Addr addr )
{
a s s e r t ( thisCache != NULL) ;
i f ( thisCache−>isModuloAddressedBank ( ) ){
int bankID = thisCache−>getBankID ( ) ;
int bankCount = thisCache−>getBankCount ( ) ;
int l o c a lB l kS i z e = thisCache−>ge tB lockS i ze ( ) ;
int bitCnt = 1 ;
a s s e r t ( l o c a lB l kS i z e != 0 ) ;
219
APPENDIX C. SIMULATOR EXTENSION CODE
while ( ( l o c a lB l kS i z e >>= 1) != 1) bitCnt++;
a s s e r t ( bankID != −1);
a s s e r t ( bankCount != −1);
Addr e f f e c t i v eAddr = addr >> bitCnt ;
i f ( ( e f f e c t i v eAddr % bankCount ) == bankID ) return true ;
return fa lse ;
}
else {
for ( int i = 0 ; i < ranges . s i z e ( ) ; ++i ) {
i f ( addr == ranges [ i ] ) {
return true ;
}
}
return fa lse ;
}
}
220
C.1. INTERCONNECT EXTENSION CODE
C.1.17 Interconnect Profiler Header File
#ifndef INTERCONNECT PROFILE HH
#define INTERCONNECT PROFILE HH
#include ”sim/ s im ob j e c t . hh”
#include ”sim/ eventq . hh”
#include ” in t e r connec t . hh”
/∗∗ The number o f c l o ck c y c l e s between each p r o f i l e event ∗/
#define RESOLUTION 250000
class In t e r connec t ;
class In t e r connec tPro f i l eEven t ;
/∗∗ D i f f e r e n t i a t e s send p r o f i l e events from channel p r o f i l e events ∗/
typedef enum {SEND, CHANNEL} INTERCONNECT PROFILE TYPE;
/∗∗
∗ This c l a s s implements a s imple p r o f i l e r f o r i n t e r c onne c t s . I t r e t r i e v e s some
∗ s t a t i s t i c s from the in t e r connec t ob j e c t and wr i t e s i t to a f i l e at a r e gu l a r
∗ i n t e r v a l . This i n t e r v a l i s dec ided by the RESOLUTION d e f i n i t i o n at the
∗ beg in ing o f t h i s f i l e .
∗
∗ Currently , t h i s c l a s s prov ide s two forms o f p r o f i l e s :
∗ − F i r s t l y , i t p r o f i l e s the number o f sends i n j e c t e d in to the in t e r connec t .
∗ This g i v e s a measure o f the ex t e rna l p r e s su r e on the in t e r connec t as a
∗ f unc t i on o f time .
∗ − Secondly , i t p r i n t s the u t i l i s a t i o n o f each channel i n s i d e the
∗ i n t e r connec t at r e gu l a r i n t e r v a l . This f e a tu r e can be used to i d e n t i f y
∗ bo t t l ene ck s i n s i d e a g iven in t e r connec t .
∗
∗ @author Magnus Jahre
∗/
class I n t e r c onn e c tP r o f i l e : public SimObject
{
private :
I n t e r connec t ∗ i n t e r connec t ;
bool t raceSends ;
bool t raceChanne lUt i l ;
Tick s ta r tT i ck ;
In t e r connec tPro f i l eEven t ∗ sendEvent ;
In t e r connec tPro f i l eEven t ∗ channelEvent ;
s td : : s t r i n g sendFileName ;
std : : s t r i n g channelFileName ;
std : : s t r i n g channelExplFileName ;
public :
/∗∗
∗ This con s t ruc to r c r e a t e s the i n t e r connec t p r o f i l e r and schedu l e s the
∗ f i r s t p r o f i l e event .
∗
∗ @param name The name from the con f i g f i l e
∗ @param traceSends Wheter or not sends should be t raced or not
∗ @param traceChanne lUt i l Wheter or not sends should be t raced or not
∗ @param s ta r tT i ck The c l o ck cy c l e the p r o f i l i n g w i l l s t a r t
∗ @param in t e r c onne c t A po in t e r to the i n t e r connec t that w i l l be
∗ p r o f i l e d
221
APPENDIX C. SIMULATOR EXTENSION CODE
∗/
I n t e r c onn e c tP r o f i l e ( const std : : s t r i n g & name ,
bool traceSends ,
bool t raceChanne lUt i l ,
Tick s ta r tT i ck ,
In t e r connec t ∗ i n t e r c onne c t ) ;
/∗∗
∗ This method i n i t i a l i s e s the send p r o f i l e f i l e .
∗/
void i n i t S endF i l e ( ) ;
/∗∗
∗ This method i s c a l l e d when a p r o f i l e event i s e r v i c ed . I t r e t r i e v e s
∗ the cur rent p r o f i l e va lue s from the in t e r connec t and wr i t e s them to
∗ the send p r o f i l e f i l e .
∗/
void writeSendEntry ( ) ;
/∗∗
∗ This method i n i t i a l i s e s the channel f i l e . I t checks i f the
∗ i n t e r connec t supports channel p r o f i l i n g and i n i t a l i s e s the f i l e i f i t
∗ does . The reason f o r t h i s check i s that channel p r o f i l i n g makes no
∗ s ense with an i d e a l i n t e r connec t .
∗
∗ @return True , i f the i n t e r connec t supports channel p r o f i l i n g .
∗/
bool i n i tChanne lF i l e ( ) ;
/∗∗
∗ This method r e t r i e v e s the updated channel u t i l i s a t i o n from the
∗ i n t e r connec t and wr i t e s the se va lue s to the channel u t i l i s a t i o n f i l e .
∗/
void writeChannelEntry ( ) ;
} ;
/∗∗
∗ This c l a s s implements a p r o f i l e event . I t s chedu l e s i t s e l f at r e gu l a r
∗ i n t e r v a l s when i t has been s t a r t ed and c a l l s the appropr ia te methods f o r the
∗ s t a t i s t i c s to be wr i t t en to the p r o f i l e f i l e s .
∗
∗ @author Magnus Jahre
∗/
class In t e r connec tPro f i l eEven t : public Event
{
public :
I n t e r c onn e c tP r o f i l e ∗ p r o f i l e r ;
INTERCONNECT PROFILE TYPE traceType ;
/∗∗
∗ I n i t a l i s e s the member v a r i a b l e s .
∗
∗ @param p r o f i l e r A po in t e r to the a s s o c i a t ed p r o f i l e r ob j e c t
∗ @param type The type o f p r o f i l e r
∗/
In t e r connec tPro f i l eEven t ( I n t e r c onn e c tP r o f i l e ∗ p r o f i l e r ,
INTERCONNECT PROFILE TYPE type )
: Event(&mainEventQueue )
{
222
C.1. INTERCONNECT EXTENSION CODE
p r o f i l e r = p r o f i l e r ;
traceType = type ;
}
/∗∗
∗ This method i s c a l l e d when the event i s s e r v i c ed . I t c a l l s a method
∗ o f the p r o f i l e r ob j e c t accord ing to the event type and schedu l e s
∗ i t s e l f RESOLUTION t i c k s l a t e r .
∗/
void proce s s ( ){
switch ( traceType ){
case SEND:
p r o f i l e r −>writeSendEntry ( ) ;
break ;
case CHANNEL:
p r o f i l e r −>writeChannelEntry ( ) ;
break ;
default :
f a t a l ( ”Unimplemented in t e r connec t t r a c e type ” ) ;
}
this−>schedu le ( curTick + RESOLUTION) ;
}
/∗∗
∗ @return A tex tua l d e s c r i p t i o n o f the event
∗/
virtual const char ∗ d e s c r i p t i o n ( ){
return ”In t e r connec tPro f i l eEven t ” ;
}
} ;
#endif // INTERCONNECT PROFILE HH
223
APPENDIX C. SIMULATOR EXTENSION CODE
C.1.18 Interconnect Profiler Code File
#include ” i n t e r c o n n e c t p r o f i l e . hh”
#include ”sim/ bu i l d e r . hh”
#include <fstream>
using namespace std ;
class In t e r connec t ;
I n t e r c onn e c tP r o f i l e : : I n t e r c onn e c tP r o f i l e ( const std : : s t r i n g & name ,
bool traceSends ,
bool t raceChanne lUt i l ,
Tick s ta r tT i ck ,
In t e r connec t ∗ i n t e r c onne c t )
: SimObject ( name )
{
a s s e r t ( i n t e r c onne c t != NULL) ;
t raceSends = traceSends ;
t raceChanne lUt i l = traceChanne lUt i l ;
s t a r tT i ck = s ta r tT i ck ;
i n t e r connec t = in t e r c onne c t ;
in te r connec t−>r e g i s t e r P r o f i l e r ( this ) ;
sendFileName = ” in t e r c onne c tS endPro f i l e . txt ” ;
channelFileName = ”in t e r connec tChanne lPro f i l e . txt ” ;
channelExplFileName = ”interconnectChanne lExplanat ion . txt ” ;
i n i t S endF i l e ( ) ;
sendEvent = new In t e r connec tPro f i l eEven t ( this , SEND) ;
sendEvent−>schedu le ( s t a r tT i ck ) ;
bool doChannelTrace = in i tChanne lF i l e ( ) ;
i f ( doChannelTrace ){
channelEvent = new In t e r connec tPro f i l eEven t ( this , CHANNEL) ;
channelEvent−>schedu le ( s t a r tT i ck ) ;
}
}
void
I n t e r c onn e c tP r o f i l e : : i n i t S endF i l e ( ){
ofstream s e n d f i l e ( sendFileName . c s t r ( ) ) ;
s e n d f i l e << ”Clock Cycle ; Data Sends ; I n s t r u c t i o n Sends ; ”
<< ”Coherence Sends ; Total Sends\n” ;
s e n d f i l e . f l u s h ( ) ;
s e n d f i l e . c l o s e ( ) ;
}
void
I n t e r c onn e c tP r o f i l e : : writeSendEntry ( ){
// ge t sample
int data = 0 , i n s t s = 0 , coherence = 0 , t o t a l = 0 ;
in te r connec t−>getSendSample(&data , &in s t s , &coherence , &t o t a l ) ;
a s s e r t ( data + i n s t s + coherence == t o t a l ) ;
224
C.1. INTERCONNECT EXTENSION CODE
// wr i t e to f i l e
ofstream s e n d f i l e ( sendFileName . c s t r ( ) , o f s tream : : app ) ;
s e n d f i l e << curTick << ” ; ”
<< data << ” ; ”
<< i n s t s << ” ; ”
<< coherence << ” ; ”
<< t o t a l << ”\n” ;
s e n d f i l e . f l u s h ( ) ;
s e n d f i l e . c l o s e ( ) ;
}
bool
I n t e r c onn e c tP r o f i l e : : i n i tChanne lF i l e ( ){
int channelCount = inte rconnec t−>getChannelCount ( ) ;
i f ( channelCount != −1){
// Write f i r s t l i n e in t r a c e f i l e
ofstream ch an f i l e ( channelFileName . c s t r ( ) ) ;
c h a n f i l e << ”Clock Cycle ; ” ;
for ( int i =0; i<channelCount ; i++){
c h a n f i l e << ”Channel ” << i ;
i f ( i == channelCount−1) c h a n f i l e << ”\n” ;
else c h a n f i l e << ” ; ” ;
}
c h a n f i l e . f l u s h ( ) ;
c h a n f i l e . c l o s e ( ) ;
// Write channel e xp l ana t i on
ofstream exp lF i l e ( channelExplFileName . c s t r ( ) ) ;
i n t e r connec t−>writeChanne lDecr iptor ( e xp lF i l e ) ;
e xp lF i l e . f l u s h ( ) ;
e xp lF i l e . c l o s e ( ) ;
return true ;
}
// in t e r connec t does not suppor t channel p r o f i l i n g
return fa lse ;
}
void
I n t e r c onn e c tP r o f i l e : : writeChannelEntry ( ){
int channelCount = inte rconnec t−>getChannelCount ( ) ;
vector<int> r e s = inte r connec t−>getChannelSample ( ) ;
a s s e r t ( r e s . s i z e ( ) == channelCount ) ;
o f s tream ch a nn e l f i l e ( channelFileName . c s t r ( ) , o f s tream : : app ) ;
c h a n n e l f i l e << curTick << ” ; ” ;
for ( int i =0; i<r e s . s i z e ( ) ; i++){
c h a nn e l f i l e << ( (double ) r e s [ i ] / (double ) RESOLUTION) ;
i f ( i == re s . s i z e ()−1) c h a nn e l f i l e << ”\n” ;
else c h a nn e l f i l e << ” ; ” ;
225
APPENDIX C. SIMULATOR EXTENSION CODE
}
c h a nn e l f i l e . f l u s h ( ) ;
c h a n n e l f i l e . c l o s e ( ) ;
}
#ifndef DOXYGEN SHOULD SKIP THIS
BEGIN DECLARE SIM OBJECT PARAMS( I n t e r c onn e c tP r o f i l e )
Param<bool> t raceSends ;
Param<bool> t raceChanne lUt i l ;
Param<Tick> t r a c eS ta r tT i ck ;
SimObjectParam<In t e r connec t∗> i n t e r connec t ;
END DECLARE SIM OBJECT PARAMS( In t e r c onn e c tP r o f i l e )
BEGIN INIT SIM OBJECT PARAMS( I n t e r c onn e c tP r o f i l e )
INIT PARAM( traceSends , ”Trace number o f sends ? ”) ,
INIT PARAM( traceChanne lUt i l , ”Trace channel u t i l i s a t i o n ? ”) ,
INIT PARAM( traceStar tTick , ”The c l o ck cy c l e to s t a r t the t r a c e ”) ,
INIT PARAM( inte r connec t , ”The in t e r connec t to p r o f i l e ”)
END INIT SIM OBJECT PARAMS( I n t e r c onn e c tP r o f i l e )
CREATE SIM OBJECT( I n t e r c onn e c tP r o f i l e )
{
return new I n t e r c onn e c tP r o f i l e ( getInstanceName ( ) ,
traceSends ,
t raceChanne lUt i l ,
t raceStar tT ick ,
i n t e r connec t ) ;
}
REGISTER SIM OBJECT( ” I n t e r c onn e c tP r o f i l e ” , I n t e r c onn e c tP r o f i l e )
#endif //DOXYGEN SHOULD SKIP THIS
226
C.2. COHERENCE PROTOCOL EXTENSION CODE
C.2 Coherence Protocol Extension Code
C.2.1 Directory Protocol Header File
#ifndef DIRECTORY PROTOCOL HH
#define DIRECTORY PROTOCOL HH
#include ”sim/ s im ob j e c t . hh”
#include ”mem/cache / base cache . hh”
#include ”mem/mem req . hh”
#include ”mem/cache / cache b lk . hh”
#include ”mem/cache / tags / cache tags . hh”
#define OUTFILENAME ”cohe r ence t race . txt ”
class BaseCache ;
template <class TagStore> class DirectoryProtocolDumpEvent ;
/∗∗
∗ This c l a s s implements an i n t e r f a c e between the d i r e c t o r y p ro to co l
∗ implementation added in t h i s work and the M5 cache implementation . To add
∗ more d i r e c t o r y implementat ions a new subc l a s s can be added to t h i s f i l e .
∗
∗ The r e s p o n s i b i l i t y o f t h i s c l a s s i s to s p e c i f y convenience methods that are
∗ needed f o r more than one p ro to co l . In the cur rent implementation , t h i s
∗ f u n c t i o n a l i t y i s l im i t ed to p ro to co l t r a c e f a c i l i t i e s f o r debugging ,
∗ coherence message t r a c i ng and acc e s s to the d i r e c t o r y .
∗
∗ @author Magnus Jahre
∗/
template <class TagStore>
class Direc to ryProtoco l
{
protected :
BaseCache ∗ cache ;
typedef typename TagStore : : BlkType BlkType ;
std : : s t r i n g cacheName ;
std : : s t r i n g p ro to co l ;
int directoryCpuCount ;
int directoryCpuID ;
std : : map<Addr , int> b lockStore ;
bool doTrace ;
int t r a c eS t a r t ;
DirectoryProtocolDumpEvent<TagStore>∗ dumpEvent ;
double lastNumRedirectedReads ;
double lastNumOwnerRequests ;
double lastNumOwnerWritebacks ;
double lastNumSharerWritebacks ;
double lastNumNACKs ;
std : : s t r i n g dumpFileName ;
public :
227
APPENDIX C. SIMULATOR EXTENSION CODE
MemReqList d i r e c to ryReques t s ;
// S t a t s
Stat s : : Sca lar<> numRedirectedReads ;
Stat s : : Sca lar<> numOwnerRequests ;
S ta t s : : Sca lar<> numOwnerWritebacks ;
S tat s : : Sca lar<> numSharerWritebacks ;
S tat s : : Sca lar<> numNACKs;
public :
/∗∗
∗ This con s t ruc to r c r e a t e s the message t r a c e and message p r o f i l e i f
∗ needed .
∗
∗ @param name The name from the con f i g f i l e .
∗ @param pro t o c o l The name o f the p ro to co l that w i l l be used
∗ @param doTrace I f true , the p ro to co l a c t i on s are wr i t t en to a
∗ t r a c e f i l e
∗ @param dumpInterval The number o f c l o ck c y c l e s between each pro to co l
∗ p r o f i l e event
∗ @param t r a c eS t a r t The c l o ck cy c l e to s t a r t t r a c i ng p ro to co l
∗ a c t i on s
∗/
Di r e c to ryProtoco l ( const std : : s t r i n g & name ,
const std : : s t r i n g & protoco l ,
bool doTrace ,
int dumpInterval ,
int t r a c e S t a r t ) ;
/∗∗
∗ Empty de s t ru c t o r .
∗/
virtual ˜Di r e c to ryProtoco l ( ){}
/∗∗
∗ This method i s c a l l e d in the cache bu i l d e r and makes sure the
∗ cache knows which cache i t i s a s s o c i a t ed with .
∗
∗ @param cache A po in t e r to the a s s o c i a t ed cache
∗/
void setCache ( BaseCache∗ cache ){
cache = cache ;
a s s e r t ( cache != NULL) ;
}
/∗∗
∗ This method s e t s the number o f cpus and the cpu id o f the a s s o c i a t ed
∗ cache f o r the d i r e c t o r y p ro to co l . I t i s c a l l e d in the cache
∗ con s t ruc to r .
∗
∗ @param num cpus The number o f CPUs in the system .
∗ @param cpuId The CPU ID of the cache a s s o c i a t ed with t h i s
∗ pro to co l .
∗/
void setCpuCount ( int num cpus , int cpuId ){
directoryCpuCount = num cpus ;
directoryCpuID = cpuId ;
}
228
C.2. COHERENCE PROTOCOL EXTENSION CODE
/∗∗
∗ Reg i s t e r s the s t a t i s t i c s v a r i a b l e s with the M5 s t a t i s t i c s module .
∗/
void r e gS ta t s ( ) ;
/∗∗
∗ Writes a t r a c e l i n e in to the t r a c e f i l e .
∗
∗ @param cachename The name o f the cache that c a l l e d t h i s method
∗ @param message The message to wr i t e to the f i l e
∗ @param owner The ID o f the cache that owns the block
∗ @param s t a t e The s t a t e o f the cache block
∗ @param paddr The address o f the cache block
∗ @param b lkS i z e The block s i z e o f the cache
∗ @param pre sentF lags A bool array showing which caches have a copy o f
∗ the cache block
∗/
void writeTraceLine ( const std : : s t r i n g cachename ,
const std : : s t r i n g message ,
const int owner ,
const Direc to rySta t e s tate ,
const Addr paddr ,
const Addr b lkS ize ,
bool∗ pre s entF lags ) ;
/∗∗
∗ When th i s method i s ca l l ed , the coherence message p r o f i l e i s wr i t t en
∗ to the p r o f i l e f i l e and the counter s are r e s e t .
∗/
void dumpStats ( ) ;
/∗∗
∗ This method re tu rns checks i f a g iven address i s owned .
∗
∗ @param address The cache block address
∗
∗ @return True i f the cache block i s owned
∗/
bool isOwned (Addr address ) ;
/∗∗
∗ Ret r i eve s the CPU ID of the cache cu r r en t l y owning a block .
∗
∗ @param address The cache block address .
∗
∗ @return The CPU ID of the cache block owner .
∗/
int getOwner (Addr address ) ;
/∗∗
∗ This method s e t s the owner o f a cache block .
∗
∗ @param address The cache block address .
∗ @param newOwner The CPU ID of the new owner .
∗/
void setOwner (Addr address , int newOwner ) ;
/∗∗
∗ This method removes the owner o f a g iven cache block .
∗
∗ @param address The cache block address .
229
APPENDIX C. SIMULATOR EXTENSION CODE
∗/
void removeOwner (Addr address ) ;
/∗∗
∗ This method i s a part o f the d i r e c t o r y coherence i n t e r f a c e and i s
∗ documented in the s ub c l a s s e s .
∗/
virtual void sendDirectoryMessage (MemReqPtr& req , int l a t ) = 0 ;
/∗∗
∗ This method i s a part o f the d i r e c t o r y coherence i n t e r f a c e and i s
∗ documented in the s ub c l a s s e s .
∗/
virtual void sendNACK(MemReqPtr& req ,
int l a t ,
int toID ,
int fromID ) = 0 ;
/∗∗
∗ This method i s a part o f the d i r e c t o r y coherence i n t e r f a c e and i s
∗ documented in the s ub c l a s s e s .
∗/
virtual bool doDirectoryAccess (MemReqPtr& req ) = 0 ;
/∗∗
∗ This method i s a part o f the d i r e c t o r y coherence i n t e r f a c e and i s
∗ documented in the s ub c l a s s e s .
∗/
virtual bool doL1DirectoryAccess (MemReqPtr& req , BlkType∗ blk ) = 0 ;
/∗∗
∗ This method i s a part o f the d i r e c t o r y coherence i n t e r f a c e and i s
∗ documented in the s ub c l a s s e s .
∗/
virtual bool handleDirectoryResponse (MemReqPtr& req ,
TagStore ∗ tags ) = 0 ;
/∗∗
∗ This method i s a part o f the d i r e c t o r y coherence i n t e r f a c e and i s
∗ documented in the s ub c l a s s e s .
∗/
virtual bool hand l eD i r e c t o r yF i l l (MemReqPtr& req ,
BlkType∗ blk ,
MemReqList& writebacks ,
TagStore∗ tags ) = 0 ;
/∗∗
∗ This method i s a part o f the d i r e c t o r y coherence i n t e r f a c e and i s
∗ documented in the s ub c l a s s e s .
∗/
virtual bool doDirectoryWriteback (MemReqPtr& req ) = 0 ;
/∗∗
∗ This method i s a part o f the d i r e c t o r y coherence i n t e r f a c e and i s
∗ documented in the s ub c l a s s e s .
∗/
virtual MemAccessResult handleL1DirectoryMiss (MemReqPtr& req ) = 0 ;
} ;
230
C.2. COHERENCE PROTOCOL EXTENSION CODE
/∗∗
∗ This c l a s s implements an event that dumps message s t a t i s t i c s to a f i l e .
∗
∗ @author Magnus Jahre
∗/
template <class TagStore>
class DirectoryProtocolDumpEvent : public Event
{
public :
D i r ec toryProtoco l<TagStore>∗ pro to co l ;
int dumpInterval ;
/∗∗
∗ This con s t ruc to r i n i t i a l i s e s the member v a r i a b l e s with the provided
∗ arguments .
∗
∗ @param pro t o c o l A po in t e r to the d i r e c t o r y p ro to co l used
∗ @param dumpInterval The number o f c l o ck c y c l e s between each dump
∗/
DirectoryProtocolDumpEvent ( Di rec toryProtoco l<TagStore>∗ protoco l ,
int dumpInterval )
: Event(&mainEventQueue )
{
pro to co l = p ro t o c o l ;
dumpInterval = dumpInterval ;
}
/∗∗
∗ This method i s c a l l e d when the event i s s e r v i c ed . I t makes the
∗ pro to co l dump the gathered s t a t i s t i c s and r e s ch edu l e s i t s e l f .
∗/
void proce s s ( ){
protoco l−>dumpStats ( ) ;
this−>schedu le ( curTick + dumpInterval ) ;
}
/∗∗
∗ @return A tex tua l d e s c r i p t i o n o f the event .
∗/
virtual const char ∗ d e s c r i p t i o n ( ){
return ”Di r e c to ryPro toco l dump event ” ;
}
} ;
#endif // DIRECTORY PROTOCOL HH
231
APPENDIX C. SIMULATOR EXTENSION CODE
C.2.2 Directory Protocol Code File
#include ”d i r e c t o r y . hh”
#include ”sim/ bu i l d e r . hh”
#include <fstream>
using namespace std ;
// us ing namespace gnu cxx ;
template<class TagStore>
DirectoryProtoco l<TagStore > : : D i r e c to ryProtoco l ( const std : : s t r i n g & cacheName ,
const std : : s t r i n g & protoco l ,
bool doTrace ,
int dumpInterval ,
int t r a c e S t a r t ){
pro to co l = p ro t o c o l ;
cacheName = cacheName ;
doTrace = doTrace ;
t r a c eS t a r t = t r a c eS t a r t ;
i f ( dumpInterval != 0){
dumpEvent =
new DirectoryProtocolDumpEvent<TagStore>(this , dumpInterval ) ;
dumpEvent−>schedu le ( curTick + dumpInterval ) ;
dumpFileName = ”coherencedump . ” + cacheName + ” . txt ” ;
o f s tream dumpfi le ( dumpFileName . c s t r ( ) ) ;
dumpf i le << ”Clock Cycle ; Redi rected Reads ; Trans fe r Owner Request ; ”
<< ”Owner Writebacks ; Sharer Writebacks ; NACKs\n” ;
dumpf i le . f l u s h ( ) ;
dumpf i le . c l o s e ( ) ;
}
else {
dumpEvent = NULL;
}
lastNumRedirectedReads = 0 ;
lastNumOwnerRequests = 0 ;
lastNumOwnerWritebacks = 0 ;
lastNumSharerWritebacks = 0 ;
lastNumNACKs = 0 ;
i f ( doTrace ){
ofstream t r a c e f i l e (OUTFILENAME) ;
t r a c e f i l e << ”M5 coherence t r a c e :\n\n” ;
t r a c e f i l e . f l u s h ( ) ;
t r a c e f i l e . c l o s e ( ) ;
}
}
template<class TagStore>
void
DirectoryProtoco l<TagStore > : : r e gS ta t s ( ){
using namespace Stat s ;
232
C.2. COHERENCE PROTOCOL EXTENSION CODE
numRedirectedReads
. name(name ( ) + ” . num red i r ec ted reads ”)
. desc ( ” t o t a l number o f reads r e d i r e c t e d to a d i f f e r e n t cache ”)
;
numOwnerRequests
. name(name ( ) + ” . num owner requests ”)
. desc ( ” t o t a l number o f owner r eque s t s i s su ed to a l l r e ady owned”
” b locks ”)
;
numOwnerWritebacks
. name(name ( ) + ” . num owner writebacks ”)
. desc ( ” t o t a l number o f t imes t h i s cache has wr i t t en back an owned”
” block ”)
;
numSharerWritebacks
. name(name ( ) + ” . num sharer wr i tebacks ”)
. desc ( ” t o t a l number o f t imes t h i s cache has wr i t t en back a block ”
” owned by a d i f f e r e n t cache ”)
;
numNACKs
. name(name ( ) + ” . num nacks ”)
. desc ( ” t o t a l number o f negat ive acknowledgements sent from th i s ”
” cache ”)
;
}
template<class TagStore>
void
DirectoryProtoco l<TagStore > : : wr i teTraceLine ( const std : : s t r i n g cachename ,
const std : : s t r i n g message ,
const int owner ,
const Direc to rySta t e s tate ,
const Addr paddr ,
const Addr b lkS ize ,
bool∗ pre s entF lags ){
i f ( doTrace && curTick >= tra c eS t a r t ){
Addr blkAddr = ( paddr & ˜((Addr ) b l kS i z e − 1 ) ) ;
o f s tream t r a c e f i l e (OUTFILENAME, ofstream : : app ) ;
t r a c e f i l e << curTick
<< ” ; ” << cachename
<< ” ; ” << blkAddr << ” , ” << hex
<< showbase << blkAddr << dec
<< ” ; Owner ” << owner
<< ” ; ” << s t a t e
<< ” ; ” << message ;
i f ( p re s entF lags != NULL){
t r a c e f i l e << ” ; [ ” ;
for ( int i =0; i<directoryCpuCount ; i++){
t r a c e f i l e << ( p re s entF lags [ i ] ? ”1 ” : ”0 ” ) ;
i f ( i != ( directoryCpuCount −1)) t r a c e f i l e << ” , ” ;
}
t r a c e f i l e << ” ] ” ;
}
233
APPENDIX C. SIMULATOR EXTENSION CODE
t r a c e f i l e << ”\n” ;
t r a c e f i l e . f l u s h ( ) ;
t r a c e f i l e . c l o s e ( ) ;
}
}
template<class TagStore>
void
DirectoryProtoco l<TagStore > : : dumpStats ( ){
ofstream dumpfi le ( dumpFileName . c s t r ( ) , o f s tream : : app ) ;
dumpf i le << curTick << ” ; ”
<< ( numRedirectedReads . va lue ( ) − lastNumRedirectedReads ) << ” ; ”
<< ( numOwnerRequests . va lue ( ) − lastNumOwnerRequests ) << ” ; ”
<< ( numOwnerWritebacks . va lue ( ) − lastNumOwnerWritebacks ) << ” ; ”
<< ( numSharerWritebacks . va lue ( ) − lastNumSharerWritebacks ) << ” ; ”
<< (numNACKs. value ( ) − lastNumNACKs) << ”\n” ;
lastNumRedirectedReads = numRedirectedReads . va lue ( ) ;
lastNumOwnerRequests = numOwnerRequests . va lue ( ) ;
lastNumOwnerWritebacks = numOwnerWritebacks . va lue ( ) ;
lastNumSharerWritebacks = numSharerWritebacks . va lue ( ) ;
lastNumNACKs = numNACKs. value ( ) ;
dumpf i le . f l u s h ( ) ;
dumpf i le . c l o s e ( ) ;
}
template<class TagStore>
bool
DirectoryProtoco l<TagStore > : : isOwned (Addr address ){
i f ( b lockStore . f i nd ( address ) == blockStore . end ( ) ){
return fa lse ;
}
return true ;
}
template<class TagStore>
int
DirectoryProtoco l<TagStore > : : getOwner (Addr address ){
map<Addr , int > : : i t e r a t o r found = blockStore . f i nd ( address ) ;
i f ( found != blockStore . end ( ) ){
return found−>second ;
}
return −1;
}
template<class TagStore>
void
DirectoryProtoco l<TagStore > : : setOwner (Addr address , int newOwner){
b lockStore [ address ] = newOwner ;
}
template<class TagStore>
void
DirectoryProtoco l<TagStore > : : removeOwner (Addr address ){
map<Addr , int > : : i t e r a t o r e r a s e I t = b lockStore . f i nd ( address ) ;
a s s e r t ( e r a s e I t != b lockStore . end ( ) ) ;
b lockStore . e r a s e ( e r a s e I t ) ;
}
234
C.2. COHERENCE PROTOCOL EXTENSION CODE
#ifndef DOXYGEN SHOULD SKIP THIS
// Inc lude con f i g f i l e s
// Must be inc luded f i r s t to determine which caches we want
#include ”mem/ con f i g / cache . hh”
#include ”mem/ con f i g / compress ion . hh”
// Tag Templates
#i f de f ined (USE CACHE LRU)
#include ”mem/cache / tags / l r u . hh”
#endif
#i f de f ined (USE CACHE FALRU)
#include ”mem/cache / tags / f a l r u . hh”
#endif
#i f de f ined (USE CACHE IIC)
#include ”mem/cache / tags / i i c . hh”
#endif
#i f de f ined (USE CACHE SPLIT)
#include ”mem/cache / tags / s p l i t . hh”
#endif
#i f de f ined (USE CACHE SPLIT LIFO)
#include ”mem/cache / tags / s p l i t l i f o . hh”
#endif
// Compression Templates
#include ”base / compress ion / nu l l compre s s i on . hh”
#i f de f ined (USE LZSS COMPRESSION)
#include ”base / compress ion / l z s s c ompre s s i on . hh”
#endif
#i f de f ined (USE CACHE FALRU)
template class DirectoryProtoco l<CacheTags<FALRU, NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class DirectoryProtoco l<CacheTags<FALRU, LZSSCompression> >;
#endif
#endif
#i f de f ined (USE CACHE IIC)
template class DirectoryProtoco l<CacheTags<IIC , NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class DirectoryProtoco l<CacheTags<IIC , LZSSCompression> >;
#endif
#endif
#i f de f ined (USE CACHE LRU)
template class DirectoryProtoco l<CacheTags<LRU, NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class DirectoryProtoco l<CacheTags<LRU, LZSSCompression> >;
#endif
#endif
#i f de f ined (USE CACHE SPLIT)
template class DirectoryProtoco l<CacheTags<Sp l i t , NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class DirectoryProtoco l<CacheTags<Sp l i t , LZSSCompression> >;
235
APPENDIX C. SIMULATOR EXTENSION CODE
#endif
#endif
#i f de f ined (USE CACHE SPLIT LIFO)
template class DirectoryProtoco l<CacheTags<SplitLIFO , NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class DirectoryProtoco l<CacheTags<SplitLIFO , LZSSCompression> >;
#endif
#endif
#endif // DOXYGEN SHOULD SKIP THIS
236
C.2. COHERENCE PROTOCOL EXTENSION CODE
C.2.3 Stenstro¨m Protocol Header File
#include ”d i r e c t o r y . hh”
/∗∗
∗ This c l a s s implements the Stenstrom d i r e c t o r y cache coherence p ro to co l .
∗
∗ @author Magnus Jahre
∗/
template <class TagStore>
class StenstromProtoco l : public DirectoryProtoco l<TagStore>
{
using DirectoryProtoco l<TagStore > : : cache ;
using DirectoryProtoco l<TagStore > : : d i r e c to ryReques t s ;
using DirectoryProtoco l<TagStore > : : cacheName ;
using DirectoryProtoco l<TagStore > : : numRedirectedReads ;
using DirectoryProtoco l<TagStore > : : numOwnerRequests ;
using DirectoryProtoco l<TagStore > : : numOwnerWritebacks ;
using DirectoryProtoco l<TagStore > : : numSharerWritebacks ;
using DirectoryProtoco l<TagStore > : :numNACKs;
typedef typename TagStore : : BlkType BlkType ;
private :
D i r ec toryProtoco l<TagStore>∗ parentPtr ;
s td : : map<Addr , bool∗> outstandingWritebackWSAddrs ;
s td : : map<Addr , int> outstandingOwnerTransAddrs ;
public :
/∗∗
∗ This con s t ruc to r c r e a t e s a stenstrom pro toco l ob j e c t .
∗
∗ @param name The name from the con f i g f i l e .
∗ @param pro t o c o l The name o f the p ro to co l that w i l l be used
∗ @param doTrace I f true , the p ro to co l a c t i on s are wr i t t en to a
∗ t r a c e f i l e
∗ @param dumpInterval The number o f c l o ck c y c l e s between each pro to co l
∗ p r o f i l e event
∗ @param t r a c eS t a r t The c l o ck cy c l e to s t a r t t r a c i ng p ro to co l
∗ a c t i on s
∗/
StenstromProtoco l ( const std : : s t r i n g & name ,
const std : : s t r i n g & protoco l ,
bool doTrace ,
int dumpInterval ,
int t r a c e S t a r t ) :
D i rec toryProtoco l<TagStore>( name ,
pro toco l ,
doTrace ,
dumpInterval ,
t r a c e S t a r t )
{
parentPtr = dynamic cast<DirectoryProtoco l<TagStore>∗ >(this ) ;
a s s e r t ( parentPtr != NULL) ;
}
237
APPENDIX C. SIMULATOR EXTENSION CODE
/∗∗
∗ Depending on whether the cache i s a L1 cache or an L2 cache , the
∗ d e t a i l s o f sending a message i s d i f f e r e n t . This method h ides
∗ the se d e t a i l s .
∗
∗ @param req The reques t to send .
∗ @param l a t The number o f c l o ck c y c l e s be f o r e the r eques t should be
∗ sent .
∗/
void sendDirectoryMessage (MemReqPtr& req , int l a t ) ;
/∗∗
∗ Negative Acknowledge messages (NACKs) are i s su ed qu i t e o f t en by the
∗ pro to co l . This method h ides the d e t a i l s o f how they are sent .
∗
∗ @param req The reques t to send
∗ @param l a t The la t ency be f o r e the r eques t i s sent
∗ @param toID The r e c i e v e r CPU ID
∗ @param fromID The sender CPU ID
∗/
void sendNACK(MemReqPtr& req , int l a t , int toID , int fromID ) ;
/∗∗
∗ This message handles d i r e c t o r y a c c e s s e s in the L2 cache ac c e s s
∗ method .
∗
∗ @param req The cur rent r eque s t
∗
∗ @return True , i f the r eque s t was handled by the p ro to co l .
∗/
bool doDirectoryAccess (MemReqPtr& req ) ;
/∗∗
∗ This message handles d i r e c t o r y a c c e s s e s in the L1 cache ac c e s s
∗ method . I t i s only c a l l e d i f i t i s a h i t in the cache .
∗
∗ @param req The cur rent r eque s t
∗ @param blk A po in t e r to the cur rent cache block
∗
∗ @return True , i f the r eque s t was handled by the p ro to co l .
∗/
bool doL1DirectoryAccess (MemReqPtr& req , BlkType∗ blk ) ;
/∗∗
∗ When a L1 cache r e c i e v e s a response , some o f the se must be handled
∗ without a c tua l l y a c c e s s i n g the cache . These s i t u a t i o n s are handled
∗ by t h i s method .
∗
∗ Some o f these ca s e s r e qu i r e a c c e s s i n g the tag s t o r e to aqu i r e updated
∗ i n fo rmat ion on a cache block . Consequently , the tag s t o r e i s provided
∗ as a parameter .
∗
∗ @param req The cur rent r eque s t
∗ @param tags A po in t e r to the cache tag s t o r e
∗
∗ @return True , i f the r eque s t was handled by the p ro to co l .
∗/
bool handleDirectoryResponse (MemReqPtr& req , TagStore ∗ tags ) ;
/∗∗
∗ When a cache f i l l i s r e c i eved , t h i s must be checked by the coherence
238
C.2. COHERENCE PROTOCOL EXTENSION CODE
∗ pro to co l . These ca s e s are handled by t h i s method .
∗
∗ @param req The cur rent r eque s t .
∗ @param blk The cur rent cache block .
∗ @param wr i tebacks The cur rent l i s t o f wr i tebacks .
∗ @param tags A po in t e r to the cache ’ s tag s t o r e .
∗
∗ @return True , i f the r eque s t was handled by the p ro to co l .
∗/
bool hand l eD i r e c t o r yF i l l (MemReqPtr& req ,
BlkType∗ blk ,
MemReqList& writebacks ,
TagStore∗ tags ) ;
/∗∗
∗ Writebacks o f shared b locks r e qu i r e s p e c i a l handl ing . This method
∗ checks the wr i tebacks and handles them accord ing to the p ro to co l .
∗
∗ @param req The writeback reques t
∗
∗ @return True , i f the r eque s t was handled by the p ro to co l .
∗/
bool doDirectoryWriteback (MemReqPtr& req ) ;
/∗∗
∗ Some race cond i t i on s cause d i r e c t o r y messages to miss in the L1
∗ cache . These ca s e s are handled by t h i s method .
∗
∗ @param req The cur rent memory reques t
∗
∗ @return I f the re turn value i s d i f f e r e n t from BA NO RESULT, the
∗ r e s u l t should be used d i r e c t l y .
∗/
MemAccessResult handleL1DirectoryMiss (MemReqPtr& req ) ;
private :
void setUpRedirectedRead (MemReqPtr& req ,
int fromProcessorID ,
int toProcessor ID ) ;
void setUpRedirectedReadReply (MemReqPtr& req ,
int fromProcessorID ,
int toProcessor ID ) ;
void setUpOwnerTransferInL2 (MemReqPtr& req ,
int oldOwner ,
int newOwner ) ;
void setUpACK(MemReqPtr& req , int toID , int fromID ) ;
} ;
239
APPENDIX C. SIMULATOR EXTENSION CODE
C.2.4 Stenstro¨m Protocol Code File
#include ”stenstrom . hh”
using namespace std ;
template<class TagStore>
void
StenstromProtocol<TagStore > : : sendDirectoryMessage (MemReqPtr& req , int l a t ){
i f ( cache−>isDirectoryAndL1DataCache ( ) ){
d i r e c to ryReque s t s . push back ( req ) ;
cache−>setMasterRequest ( Request DirectoryCoherence , curTick + l a t ) ;
return ;
}
i f ( cache−>isDirectoryAndL2Cache ( ) ){
cache−>respond ( req , curTick + l a t ) ;
}
}
template<class TagStore>
void
StenstromProtocol<TagStore > : :sendNACK(MemReqPtr& req ,
int l a t ,
int toID ,
int fromID ){
req−>toProcessor ID = toID ;
req−>f romProcessorID = fromID ;
req−>t o In t e r f a c e ID = −1;
req−>f romInter face ID = −1;
req−>dirNACK = true ;
numNACKs++;
wr i teTraceLine ( cacheName ,
”NACK Sent ” ,
−1,
DirNoState ,
( req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1 ) ) ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
sendDirectoryMessage ( req , l a t ) ;
}
template<class TagStore>
bool
StenstromProtocol<TagStore > : : doDirectoryAccess (MemReqPtr& req ){
int l a t = cache−>getHitLatency ( ) ;
int fromCpuId = req−>xc−>cpu−>params−>cpu id ;
Addr tmpL2BlkAddr = req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1 ) ;
// Direc tory i n f o i s on ly s t o r ed f o r data b l o c k s
i f ( ! req−>readOnlyCache ){
i f ( ! parentPtr−>isOwned ( tmpL2BlkAddr ) ){
240
C.2. COHERENCE PROTOCOL EXTENSION CODE
i f ( req−>dirNACK){
i f ( req−>cmd == DirSharerWriteback ){
// the b l o c k i s not shared anymore , d i s card message
writeTraceLine ( cacheName ,
”Recieved NACK to block that i s no l onge r ”
”shared , d i s c a rd i ng message ” ,
−1,
DirNoState ,
tmpL2BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
return true ;
}
else i f ( req−>cmd = DirOwnerTransfer ){
// the o ld owner wrote back the b l o c k
// make the r e qu e s t e r re t ransmi t the r e que s t
a s s e r t ( outstandingOwnerTransAddrs . f i nd ( tmpL2BlkAddr )
!= outstandingOwnerTransAddrs . end ( ) ) ;
outstandingOwnerTransAddrs . e r a s e (
outstandingOwnerTransAddrs . f i nd ( tmpL2BlkAddr ) ) ;
req−>ownerWroteBack = true ;
sendNACK( req , cache−>getHitLatency ( ) , fromCpuId , −1);
return true ;
}
else {
f a t a l ( ”Unimplemented NACK type in doDirectoryAccess ( ) ” ) ;
return true ;
}
}
i f ( req−>cmd == DirRedirectRead ){
parentPtr−>setOwner ( tmpL2BlkAddr , fromCpuId ) ;
req−>owner = fromCpuId ;
wr i teTraceLine ( cacheName ,
”Redirected Read to not owned block rec i eved , ”
” r eque s t e r i s new owner ” ,
parentPtr−>getOwner ( tmpL2BlkAddr ) ,
DirNoState ,
tmpL2BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
req−>toProcessor ID = fromCpuId ;
req−>f romProcessorID = −1;
req−>t o In t e r f a c e ID = −1;
req−>f romInter face ID = −1;
sendDirectoryMessage ( req , cache−>getHitLatency ( ) ) ;
return true ;
}
i f ( req−>cmd == DirOwnerTransfer ){
// the prev ious owner wrote back the b l o c k
// in the middle o f the t r an s f e r
// make the r e que s t e r resend as a read
241
APPENDIX C. SIMULATOR EXTENSION CODE
a s s e r t ( outstandingOwnerTransAddrs . f i nd ( tmpL2BlkAddr )
== outstandingOwnerTransAddrs . end ( ) ) ;
req−>ownerWroteBack = true ;
sendNACK( req , cache−>getHitLatency ( ) , fromCpuId , −1);
return true ;
}
a s s e r t ( ! req−>cmd . i sDi rec toryMessage ( ) ) ;
// The r e qu e s t i n g cache i s now the owner , do normal response
parentPtr−>setOwner ( tmpL2BlkAddr , fromCpuId ) ;
req−>owner = fromCpuId ;
req−>f romProcessorID = −1;
}
else {
i f ( req−>dirACK){
i f ( req−>cmd == DirOwnerTransfer ){
// owner t r an s f e r to t h i s b l o c k i s a l l owed again
outstandingOwnerTransAddrs . e r a s e (
outstandingOwnerTransAddrs . f i nd ( tmpL2BlkAddr ) ) ;
wr i teTraceLine ( cacheName ,
”Owner Trans fe r ACK rec i e v ed ” ,
parentPtr−>getOwner ( tmpL2BlkAddr ) ,
DirNoState ,
tmpL2BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
return true ;
}
else {
f a t a l ( ”ACK type not implemented ” ) ;
}
}
else i f ( req−>dirNACK){
i f ( req−>cmd == DirOwnerTransfer ){
// the owner had not r e c i e v ed the b l o c k ye t
// c l ean up and send a NACK to the r e qu e s t e r
// NOTE: use fromProcessorID here
// fromCPUId i d e n t i f i e s the o r i g i n a l r e qu e s t e r
a s s e r t ( fromCpuId == parentPtr−>getOwner ( tmpL2BlkAddr ) ) ;
a s s e r t ( req−>f romProcessorID > −1);
// r e s e t too o r i g i n a l owner and remove address
// from b locked l i s t
int reques te r ID = parentPtr−>getOwner ( tmpL2BlkAddr ) ;
parentPtr−>setOwner ( tmpL2BlkAddr , req−>f romProcessorID ) ;
outstandingOwnerTransAddrs . e r a s e (
outstandingOwnerTransAddrs . f i nd ( tmpL2BlkAddr ) ) ;
sendNACK( req , l a t , requesterID , −1);
return true ;
}
else i f ( req−>cmd == DirSharerWriteback ){
req−>dirNACK = fa l se ;
req−>toProcessor ID = parentPtr−>getOwner ( tmpL2BlkAddr ) ;
req−>f romProcessorID = −1;
242
C.2. COHERENCE PROTOCOL EXTENSION CODE
req−>t o In t e r f a c e ID = −1;
wr i teTraceLine ( cacheName ,
”Recieved NACK on share r writeback , ”
”r e t r an sm i t t i ng ” ,
parentPtr−>getOwner ( tmpL2BlkAddr ) ,
DirNoState ,
tmpL2BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
sendDirectoryMessage ( req , l a t ) ;
return true ;
}
else {
f a t a l ( ”Unimplemented NACK type ( in L2) ” ) ;
}
}
else i f ( req−>writeMiss ){
i f ( outstandingOwnerTransAddrs . f i nd ( tmpL2BlkAddr )
!= outstandingOwnerTransAddrs . end ( ) ){
// d e s t i n a t i o n was req−>fromProcessorID
sendNACK( req , l a t , fromCpuId , −1);
return true ;
}
outstandingOwnerTransAddrs [ tmpL2BlkAddr ] = fromCpuId ;
// wr i t e to a l l r e a d y owned b l o c k
int oldOwner = parentPtr−>getOwner ( tmpL2BlkAddr ) ;
parentPtr−>setOwner ( tmpL2BlkAddr , fromCpuId ) ;
wr i teTraceLine ( cacheName ,
”Write miss to owned block ” ,
oldOwner ,
DirNoState ,
tmpL2BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
/∗ update and send r e que s t ∗/
setUpOwnerTransferInL2 ( req , oldOwner , fromCpuId ) ;
sendDirectoryMessage ( req , l a t ) ;
return true ;
}
else i f ( req−>cmd == Writeback | | req−>cmd == DirWriteback ){
i f ( outstandingOwnerTransAddrs . f i nd ( tmpL2BlkAddr )
== outstandingOwnerTransAddrs . end ( ) ){
// not par t o f an owner t rans f e r , must be from owner
a s s e r t ( parentPtr−>getOwner ( tmpL2BlkAddr ) == fromCpuId ) ;
}
// t h i s b l o c k i s not in any L1 cache , r e s e t owner s t a t u s
parentPtr−>removeOwner ( tmpL2BlkAddr ) ;
// carry out normal wr i t e back ac t i on s in the wr i t e back case
i f ( req−>cmd == DirWriteback ){
243
APPENDIX C. SIMULATOR EXTENSION CODE
return true ;
}
}
else i f ( req−>cmd == DirOwnerTransfer ){
i f ( outstandingOwnerTransAddrs . f i nd ( tmpL2BlkAddr )
!= outstandingOwnerTransAddrs . end ( ) ){
// d e s t i n a t i o n was req−>fromProcessorID
sendNACK( req , l a t , fromCpuId , −1);
return true ;
}
outstandingOwnerTransAddrs [ tmpL2BlkAddr ] = fromCpuId ;
// change owner s t a t u s and forward to o ld owner
int oldOwner = parentPtr−>getOwner ( tmpL2BlkAddr ) ;
int newOwner = req−>f romProcessorID ;
parentPtr−>setOwner ( tmpL2BlkAddr , newOwner ) ;
wr i teTraceLine ( cacheName ,
”Owner Change Request Granted ” ,
parentPtr−>getOwner ( tmpL2BlkAddr ) ,
DirNoState ,
tmpL2BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
/∗ update and send r e que s t ∗/
setUpOwnerTransferInL2 ( req , oldOwner , newOwner ) ;
sendDirectoryMessage ( req , l a t ) ;
return true ;
}
else i f ( req−>cmd == Read){
// re turn the owner s t a t e to the r e qu e s t i n g cache
int owner = parentPtr−>getOwner ( tmpL2BlkAddr ) ;
a s s e r t ( fromCpuId != owner ) ;
req−>owner = owner ;
}
else i f ( req−>cmd == DirSharerWriteback ){
// b l o c k rep l aced from a non−owner cache , inform owner
a s s e r t ( req−>replacedByID == −1);
req−>toProcessor ID = parentPtr−>getOwner ( tmpL2BlkAddr ) ;
req−>f romProcessorID = −1;
req−>t o In t e r f a c e ID = −1;
req−>f romInter face ID = −1;
req−>replacedByID = fromCpuId ;
wr i teTraceLine ( cacheName ,
”Forwarding share r wr iteback to owner ” ,
req−>owner ,
DirNoState ,
tmpL2BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
req−>pre s entF lags ) ;
244
C.2. COHERENCE PROTOCOL EXTENSION CODE
sendDirectoryMessage ( req , l a t ) ;
return true ;
}
else i f ( req−>cmd == DirRedirectRead ){
// a r e d i r e c t e d read got NACKed
// re turn the owner s t a t e to the r e qu e s t e r
req−>owner = parentPtr−>getOwner ( tmpL2BlkAddr ) ;
req−>toProcessor ID = req−>f romProcessorID ;
req−>f romProcessorID = −1;
req−>t o In t e r f a c e ID = −1;
req−>f romInter face ID = −1;
wr i teTraceLine ( cacheName ,
”Got Redirected Read , ”
”in forming r eque s t e r o f cur r ent owner ” ,
req−>owner ,
DirNoState ,
tmpL2BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
sendDirectoryMessage ( req , l a t ) ;
return true ;
}
else {
f a t a l ( ”In L2 : a c c e s s to a block that i s a l l r e ady owned , ”
”unimplemented reques t type ” ) ;
}
}
}
return fa lse ;
}
template<class TagStore>
bool
StenstromProtocol<TagStore > : : doL1DirectoryAccess (MemReqPtr& req , BlkType∗ blk ){
int l a t = cache−>getHitLatency ( ) ;
Addr tmpL1BlkAddr = req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1 ) ;
i f ( ! req−>cmd . i sDi rec toryMessage ( ) ){
a s s e r t ( req−>xc−>cpu−>params−>cpu id == cache−>getCacheCPUid ( ) ) ;
}
switch ( req−>cmd){
case Write :
i f ( blk−>d i r S t a t e == DirOwnedExGR
| | blk−>d i r S t a t e == DirOwnedNonExGR){
// Write i s OK
}
else {
// we need to r e que s t ownership f o r t h i s b l o c k
req−>oldCmd = req−>cmd ;
req−>cmd = DirOwnerTransfer ;
req−>toProcessor ID = −1;
req−>t o In t e r f a c e ID = −1;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
numOwnerRequests++;
245
APPENDIX C. SIMULATOR EXTENSION CODE
writeTraceLine ( cacheName ,
” I s s u i ng owner t r a n s f e r r eque s t ” ,
blk−>owner ,
blk−>d i rSta te ,
tmpL1BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
sendDirectoryMessage ( req , l a t ) ;
return true ;
}
break ;
case Read :
i f ( blk−>d i r S t a t e == DirOwnedExGR
| | blk−>d i r S t a t e == DirOwnedNonExGR){
// Read i s OK
}
else {
setUpRedirectedRead ( req , cache−>getCacheCPUid ( ) , blk−>owner ) ;
numRedirectedReads++;
wr i teTraceLine ( cacheName ,
” I s s u i ng Redirected Read (1) ” ,
blk−>owner ,
blk−>d i rSta te ,
tmpL1BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
sendDirectoryMessage ( req , l a t ) ;
return true ;
}
break ;
case So f t P r e f e t ch :
// discard , s ince i t i s a h i t in the L1 cache we don ’ t care
break ;
case DirRedirectRead :
i f ( blk−>d i r S t a t e == DirOwnedExGR){
// make sure i t i s r e a l l y one owner
int presentCount = 0 ;
for ( int i =0; i<cache−>cpuCount ; i++){
i f ( blk−>pre s entF lags [ i ] ) presentCount++;
}
a s s e r t ( presentCount == 1 ) ;
// update b l o c k s t a t e
int fromCpuID = req−>f romProcessorID ;
blk−>pre s entF lags [ fromCpuID ] = true ;
// t h i s r e que s t might be from ou r s e l v e s
int newSharerCount = 0 ;
for ( int i =0; i<cache−>cpuCount ; i++){
i f ( blk−>pre s entF lags [ i ] ) newSharerCount++;
246
C.2. COHERENCE PROTOCOL EXTENSION CODE
}
i f ( newSharerCount == 1){
// the b l o c k i s pre sen t in our cache , answer the r e que s t
writeTraceLine ( cacheName ,
”Answering Redirected Read from myse l f ”
”( re turn to cache ) (1 ) ” ,
blk−>owner ,
blk−>d i rSta te ,
tmpL1BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
blk−>d i r S t a t e = DirOwnedExGR ;
return fa lse ;
}
blk−>d i r S t a t e = DirOwnedNonExGR ;
// update r e que s t and send i t
setUpRedirectedReadReply ( req ,
cache−>getCacheCPUid ( ) ,
fromCpuID ) ;
sendDirectoryMessage ( req , l a t ) ;
}
else i f ( blk−>d i r S t a t e == DirOwnedNonExGR){
int fromCpuID = req−>f romProcessorID ;
// sending r e d i r e c t e d reads to ou r s e l v e s
// w i l l cause a deadlock , l e t i t f i n i s h
i f ( fromCpuID == cache−>getCacheCPUid ( ) ){
writeTraceLine ( cacheName ,
”Answering Redirected Read from myse l f ”
”( re turn to cache ) (2 ) ” ,
blk−>owner ,
blk−>d i rSta te ,
tmpL1BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
return fa lse ;
}
a s s e r t ( fromCpuID != cache−>getCacheCPUid ( ) ) ;
blk−>pre s entF lags [ fromCpuID ] = true ;
setUpRedirectedReadReply ( req ,
cache−>getCacheCPUid ( ) ,
fromCpuID ) ;
sendDirectoryMessage ( req , l a t ) ;
}
else {
sendNACK( req ,
l a t ,
req−>fromProcessorID ,
cache−>getCacheCPUid ( ) ) ;
return true ;
}
247
APPENDIX C. SIMULATOR EXTENSION CODE
writeTraceLine ( cacheName ,
”Answering Redirected Read Request ” ,
blk−>owner ,
blk−>d i rSta te ,
tmpL1BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
return true ;
break ;
case DirOwnerTransfer :
{
// Owner t r an s f e r i s f i n i s h e d
// Update the b l o c k s t a t e and l e t the r e que s t go through
a s s e r t ( req−>owner == cache−>getCacheCPUid ( ) ) ;
a s s e r t ( req−>pre s entF lags != NULL) ;
// ownership might be r eque s t ed by more than one r e que s t
// consequent ly , we might be t r a n s f e r i n g ownership to ou r s e l v e s
// i . e . no b l k−>owner != req−>owner a s s e r t i on
a s s e r t ( blk−>pre s entF lags == NULL) ;
blk−>pre s entF lags = req−>pre s entF lags ;
// the f l a g s must not be removed when the r e que s t i s d e l e t e d
req−>pre s entF lags = NULL;
blk−>owner = req−>owner ;
// the prev ious owner does not n e c e s s a r i l y know about t h i s cache
blk−>pre s entF lags [ cache−>getCacheCPUid ( ) ] = true ;
int sharerCount = 0 ;
for ( int i =0; i<cache−>cpuCount ; i++){
i f ( blk−>pre s entF lags [ i ] ) sharerCount++;
}
a s s e r t ( sharerCount > 0 ) ;
i f ( sharerCount == 1) blk−>d i r S t a t e = DirOwnedExGR ;
else blk−>d i r S t a t e = DirOwnedNonExGR ;
wr i teTraceLine ( cacheName ,
”Owner t r a n s f e r complete ” ,
blk−>owner ,
blk−>d i rSta te ,
tmpL1BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
// Send ACK to the L2 cache
MemReqPtr tmpReq = buildReqCopy ( req ,
cache−>cpuCount ,
DirOwnerTransfer ) ;
setUpACK(tmpReq , −1, cache−>getCacheCPUid ( ) ) ;
sendDirectoryMessage (tmpReq , cache−>getHitLatency ( ) ) ;
}
break ;
case DirOwnerWriteback :
248
C.2. COHERENCE PROTOCOL EXTENSION CODE
{
// h i t on owner wr i t e back
a s s e r t ( req−>pre s entF lags == NULL) ;
a s s e r t ( req−>paddr ==
( req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1 ) ) ) ;
// make a copy t ha t w i l l become the ownership r e que s t
MemReqPtr ownershipReq = buildReqCopy ( req ,
cache−>cpuCount ,
DirOwnerTransfer ) ;
// s e t addre s s ing i n f o
ownershipReq−>toProcessor ID = −1;
ownershipReq−>f romProcessorID = cache−>getCacheCPUid ( ) ;
ownershipReq−>t o In t e r f a c e ID = −1;
// send an ACK to the curren t owner
req−>toProcessor ID = req−>f romProcessorID ;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>t o In t e r f a c e ID = −1;
req−>f romInter face ID = −1;
req−>dirACK = true ;
// send the messages
sendDirectoryMessage ( req , l a t ) ;
sendDirectoryMessage ( ownershipReq , l a t ) ;
wr i teTraceLine ( cacheName ,
”Accepting ownership , ACK and ownership r eques t sent ” ,
blk−>owner ,
blk−>d i rSta te ,
tmpL1BlkAddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
return true ;
}
break ;
default :
cout << req−>cmd . t oS t r i ng ( ) << ”\n” ;
f a t a l ( ”L1 : cache ac c e s s ( ) , unknown reques t type ” ) ;
}
return fa lse ;
}
template<class TagStore>
bool
StenstromProtocol<TagStore > : : handleDirectoryResponse (MemReqPtr& req ,
TagStore ∗ tags ){
i f ( req−>dirNACK){
i f ( req−>cmd == DirOwnerTransfer ){
req−>dirNACK = fa l se ;
req−>f l a g s &= ˜SATISFIED ;
BlkType∗ tmpBlk = tags−>f indBlock ( req ) ;
249
APPENDIX C. SIMULATOR EXTENSION CODE
i f ( req−>ownerWroteBack ){
a s s e r t ( req−>writeMiss ) ;
req−>cmd = Read ;
wr i teTraceLine ( cacheName ,
”Owner Trans fe r NACK rec i eved , owner wrote back , ”
” r e t r an sm i t t i ng as read ” ,
req−>owner ,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
}
else i f ( tmpBlk != NULL &&
(tmpBlk−>d i r S t a t e == DirOwnedExGR
| | tmpBlk−>d i r S t a t e == DirOwnedNonExGR)){
writeTraceLine ( cacheName ,
”Owner Trans fe r NACK rec i eved , ”
”we have become the owner ” ,
tmpBlk−>owner ,
tmpBlk−>d i rSta te ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
// we have become the owner , re turn r e que s t to proces sor
i f ( req−>mshr == NULL){
a s s e r t ( req−>completionEvent != NULL) ;
cache−>respond ( req , curTick + cache−>getHitLatency ( ) ) ;
return true ;
}
else {
a s s e r t ( req−>mshr != NULL) ;
cache−>missQueueHandleResponse ( req ,
curTick + cache−>getHitLatency ( ) ) ;
return true ;
}
}
else {
writeTraceLine ( cacheName ,
”Owner Trans fe r NACK rec i eved , ”
”r e t r an sm i t t i ng to L2 cache ” ,
req−>owner ,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
}
req−>toProcessor ID = −1;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>f romInter face ID = −1;
req−>t o In t e r f a c e ID = −1;
sendDirectoryMessage ( req , cache−>getHitLatency ( ) ) ;
return true ;
}
else i f ( req−>cmd == DirRedirectRead ){
250
C.2. COHERENCE PROTOCOL EXTENSION CODE
req−>dirNACK = fa l se ;
req−>f l a g s &= ˜SATISFIED ;
wr i teTraceLine ( cacheName ,
”Redirected Read NACK rec i eved , ”
”r e t r an sm i t t i ng to L2 cache ” ,
req−>owner ,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
req−>toProcessor ID = −1;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>f romInter face ID = −1;
req−>t o In t e r f a c e ID = −1;
sendDirectoryMessage ( req , cache−>getHitLatency ( ) ) ;
return true ;
}
else i f ( req−>cmd == DirOwnerWriteback ){
a s s e r t ( req−>pre s entF lags == NULL) ;
Addr tmpBlkAddr = req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1 ) ;
a s s e r t ( outstandingWritebackWSAddrs . f i nd ( tmpBlkAddr )
!= outstandingWritebackWSAddrs . end ( ) ) ;
outstandingWritebackWSAddrs [ tmpBlkAddr ] [ req−>f romProcessorID ]
= fa l se ;
bool∗ tmpPresentFlags = outstandingWritebackWSAddrs [ tmpBlkAddr ] ;
int presentCount = 0 ;
for ( int i =0; i<cache−>cpuCount ; i++){
i f ( tmpPresentFlags [ i ] ) presentCount++;
}
req−>dirNACK = fa l se ;
i f ( presentCount == 0){
// no other share r s l e f t , send to L2
req−>toProcessor ID = −1;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>t o In t e r f a c e ID = −1;
req−>f romInter face ID = −1;
req−>cmd = DirWriteback ;
sendDirectoryMessage ( req , cache−>getHitLatency ( ) ) ;
// wr i t e back has been handled , remove i t
outstandingWritebackWSAddrs . e r a s e (
outstandingWritebackWSAddrs . f i nd ( tmpBlkAddr ) ) ;
wr i teTraceLine ( cacheName ,
”No sha r e r s l e f t , doing normal wr iteback ” ,
req−>owner ,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
251
APPENDIX C. SIMULATOR EXTENSION CODE
tmpPresentFlags ) ;
return true ;
}
else {
int nextSharer = −1;
for ( int i =0; i<cache−>cpuCount ; i++){
i f ( tmpPresentFlags [ i ] ) {
nextSharer = i ;
break ;
}
}
a s s e r t ( nextSharer != −1);
req−>toProcessor ID = nextSharer ;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>t o In t e r f a c e ID = −1;
req−>f romInter face ID = −1;
wr i teTraceLine ( cacheName ,
”Attempting to t r a n s f e r ownership to ”
” d i f f e r e n t share r ” ,
req−>owner ,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
tmpPresentFlags ) ;
sendDirectoryMessage ( req , cache−>getHitLatency ( ) ) ;
return true ;
}
}
else i f ( req−>cmd == Read){
a s s e r t ( req−>f romProcessorID == −1);
i f ( req−>writeMiss ){
a s s e r t ( req−>writeMiss ) ;
// change the r e que s t to an owner t r an s f e r and resend
req−>cmd = DirOwnerTransfer ;
req−>dirNACK = fa l se ;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>toProcessor ID = −1;
req−>t o In t e r f a c e ID = −1;
wr i teTraceLine ( cacheName ,
”Write miss , r e c i e v ed NACK, r e t r an sm i t t i ng ” ,
−1,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
//TODO: have a d i f f e r e n t re t ransmi t de lay ?
sendDirectoryMessage ( req , cache−>getHitLatency ( ) ) ;
return true ;
}
else {
252
C.2. COHERENCE PROTOCOL EXTENSION CODE
f a t a l ( ”NACK on read miss not implemented ” ) ;
}
}
else {
f a t a l ( ”Recieved NACK, reques t type not implemented ” ) ;
}
}
else i f ( req−>cmd == DirOwnerWriteback ){
a s s e r t ( ! req−>isDirectoryNACK ( ) ) ;
i f ( req−>isDirectoryACK ( ) ){
a s s e r t ( req−>pre s entF lags == NULL) ;
wr i teTraceLine ( cacheName ,
”Owner with share r s , r e c i e v ed ACK” ,
req−>owner ,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
return true ;
}
else {
// r e c i e v ed a r e que s t to take over ownership o f t h i s b l o c k
a s s e r t ( req−>owner != cache−>getCacheCPUid ( ) ) ;
cache−>ac c e s s ( req ) ;
return true ;
}
}
else i f ( req−>cmd == DirSharerWriteback ){
// a sharer has wr i t t en back a b l o c k owned by t h i s cache
BlkType∗ tmpBlk = tags−>f indBlock ( req ) ;
i f ( tmpBlk == NULL){
// we are in the proces s o f w r i t i n g back the b l o c k
Addr tmpBlkAddr = req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1 ) ;
i f ( outstandingWritebackWSAddrs . f i nd ( tmpBlkAddr )
== outstandingWritebackWSAddrs . end ( ) ){
// we have wr i t t en back the b l o c k
// or the b l o c k has been g iven a new owner
// send NACK to L2 and l e t i t handle i t
sendNACK( req ,
cache−>getHitLatency ( ) ,
−1,
cache−>getCacheCPUid ( ) ) ;
return true ;
}
a s s e r t ( outstandingWritebackWSAddrs . f i nd ( tmpBlkAddr )
!= outstandingWritebackWSAddrs . end ( ) ) ;
outstandingWritebackWSAddrs [ tmpBlkAddr ] [ req−>replacedByID ] = fa l se ;
wr i teTraceLine ( cacheName ,
”Sharer wr iteback r e c i e v ed to block that i s ”
”being wr i t t en back ” ,
−1,
DirNoState ,
req−>paddr ,
253
APPENDIX C. SIMULATOR EXTENSION CODE
cache−>ge tB lockS i ze ( ) ,
outstandingWritebackWSAddrs [ tmpBlkAddr ] ) ;
return true ;
}
i f ( tmpBlk−>d i r S t a t e == Di r Inva l i d ){
// we are in the middle o f an owner t r an s f e r and
// haven ’ t r e c i e v ed the data ye t
// make the L2 resend the r e que s t
sendNACK( req , cache−>getHitLatency ( ) , −1, cache−>getCacheCPUid ( ) ) ;
return true ;
}
a s s e r t ( tmpBlk != NULL) ;
a s s e r t ( tmpBlk−>d i r S t a t e == DirOwnedNonExGR
| | tmpBlk−>d i r S t a t e == DirOwnedExGR ) ;
a s s e r t ( tmpBlk−>pre s entF lags != NULL) ;
a s s e r t ( req−>replacedByID >= 0 ) ;
i f ( req−>replacedByID == cache−>cacheCpuID ){
// i f a sharer and the owner wr i t e s back a b l o c k at the same time
// t h i s can happen , d i s card t h i s update
writeTraceLine ( cacheName ,
”Recieved share r wr iteback from ou r s e l v e s ” ,
tmpBlk−>owner ,
tmpBlk−>d i rSta te ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
tmpBlk−>pre s entF lags ) ;
return true ;
}
tmpBlk−>pre s entF lags [ req−>replacedByID ] = fa l se ;
int sha r e r s = 0 ;
for ( int i =0; i<cache−>cpuCount ; i++){
i f ( tmpBlk−>pre s entF lags [ i ] ) s ha r e r s++;
}
a s s e r t ( sha r e r s > 0 ) ;
i f ( sha r e r s == 1) tmpBlk−>d i r S t a t e = DirOwnedExGR ;
else tmpBlk−>d i r S t a t e = DirOwnedNonExGR ;
wr i teTraceLine ( cacheName ,
”Sharer wr iteback r e c i e v ed and handled ” ,
tmpBlk−>owner ,
tmpBlk−>d i rSta te ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
tmpBlk−>pre s entF lags ) ;
return true ;
}
else i f ( req−>cmd == DirNewOwnerMulticast ){
BlkType∗ tmpBlk = tags−>f indBlock ( req ) ;
i f ( tmpBlk == NULL){
//we have wr i t t en back t h i s b l o c k and don ’ t care who owns i t
writeTraceLine ( cacheName ,
254
C.2. COHERENCE PROTOCOL EXTENSION CODE
”Owner i n f o discarded , b lock wr i t t en back ” ,
−1,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
return true ;
}
a s s e r t ( tmpBlk != NULL) ;
a s s e r t ( tmpBlk−>d i r S t a t e == Di r Inva l i d ) ;
a s s e r t ( tmpBlk−>pre s entF lags == NULL) ;
a s s e r t ( req−>owner >= 0 ) ;
tmpBlk−>owner = req−>owner ;
wr i teTraceLine ( cacheName ,
”New owner i n f o r e c i e v ed and s to r ed ” ,
tmpBlk−>owner ,
tmpBlk−>d i rSta te ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
tmpBlk−>pre s entF lags ) ;
return true ;
}
else i f ( req−>owner != −1
&& req−>owner != cache−>getCacheCPUid ( )
&& req−>f romProcessorID == −1
&& req−>cmd != DirOwnerTransfer ){
// This i s an L1 cache , but i s not the owner
// r e d i r e c t e d r e que s t to owner , must be a read
a s s e r t ( ! req−>writeMiss ) ;
setUpRedirectedRead ( req , cache−>getCacheCPUid ( ) , req−>owner ) ;
numRedirectedReads++;
wr i teTraceLine ( cacheName ,
” I s s u i ng Redirected Read (2) ” ,
req−>owner ,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
sendDirectoryMessage ( req , cache−>getHitLatency ( ) ) ;
return true ;
}
else i f ( req−>f romProcessorID != −1
&& req−>owner == cache−>getCacheCPUid ( ) ){
// Case 1 : r e que s t from a d i f f e r e n t L1 cache and we are the owner
// Case 2 : we have r e c i e v ed the owner s t a t e at the end o f
// a owner t r an s f e r r e que s t
i f ( req−>writeMiss ){
// t h i s i s a cache f i l l , l e t i t through
}
else {
a s s e r t ( req−>cmd == DirRedirectRead | | req−>cmd == DirOwnerTransfer ) ;
255
APPENDIX C. SIMULATOR EXTENSION CODE
i f ( req−>cmd == DirOwnerTransfer ){
// the b l o c k must be in the cache f o r t h i s forwarding to work
BlkType∗ tmpBlk = tags−>f indBlock ( req ) ;
i f ( tmpBlk == NULL){
// we have rep l aced t h i s b lock , l e t i t back in
a s s e r t ( req−>mshr == NULL) ;
return fa lse ;
}
}
cache−>ac c e s s ( req ) ;
return true ;
}
}
else i f ( req−>cmd == DirRedirectRead
&& req−>owner == cache−>getCacheCPUid ( )
&& req−>f romProcessorID == −1){
i f ( req−>mshr != NULL){
// a mshr i s a l l o c a t e d
// l e t i t go through and handle the f i l l normal ly
return fa lse ;
}
else {
// no MSHR i s a l l o c a t e d because the b l o c k i s a l l r e a d y in our cache
// t h i s code assumes t ha t we have become the owner wh i l e the
// Redirec ted Read has been t ranspor t ed through the system
BlkType∗ tmpBlk = tags−>f indBlock ( req ) ;
a s s e r t ( tmpBlk != NULL) ;
i f ( tmpBlk−>d i r S t a t e == Di r Inva l i d ){
// we have not become the owner yet , send nack to ou r s e l v e s
sendNACK( req ,
cache−>getHitLatency ( ) ,
cache−>getCacheCPUid ( ) ,
cache−>getCacheCPUid ( ) ) ;
return true ;
}
a s s e r t ( tmpBlk−>d i r S t a t e == DirOwnedExGR
| | tmpBlk−>d i r S t a t e == DirOwnedNonExGR ) ;
a s s e r t ( tmpBlk−>owner == cache−>getCacheCPUid ( ) ) ;
wr i teTraceLine ( cacheName ,
”This cache i s the owner , send response to CPU” ,
req−>owner ,
DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
NULL) ;
a s s e r t ( req−>completionEvent != NULL) ;
cache−>respond ( req , curTick + cache−>getHitLatency ( ) ) ;
return true ;
}
}
return fa lse ;
}
256
C.2. COHERENCE PROTOCOL EXTENSION CODE
template<class TagStore>
bool
StenstromProtocol<TagStore > : : h and l eD i r e c t o r yF i l l (MemReqPtr& req ,
BlkType∗ blk ,
MemReqList& writebacks ,
TagStore∗ tags ){
// This i s an L1 data cache
i f ( req−>cmd == DirOwnerTransfer ){
i f ( req−>f romProcessorID == −1){
int newOwner = −1;
bool∗ o ldFlags = NULL;
i f ( outstandingWritebackWSAddrs . f i nd (
req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1) )
!= outstandingWritebackWSAddrs . end ( ) ){
//remove t h i s entry
Addr tmpAddr = req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1 ) ;
o ldF lags = outstandingWritebackWSAddrs [ tmpAddr ] ;
outstandingWritebackWSAddrs . e r a s e (
outstandingWritebackWSAddrs . f i nd ( tmpAddr ) ) ;
// no need to check f o r o ther share r s
// ( end o f ownership rep lacement wi th share r s )
newOwner = req−>owner ;
req−>toProcessor ID = newOwner ;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>pre s entF lags = o ldFlags ;
req−>owner = newOwner ;
wr i teTraceLine ( cacheName ,
”Trans f e r ing Owner State ”
”( end o f owner with sha r e r s replacement ) ” ,
newOwner ,
Di r Inva l id ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
o ldF lags ) ;
}
else {
// Owner t r an s f e r r e c i e v ed from L2 , we must be the owner
i f ( blk == NULL){
// The b l o c k hasn ’ t been d e l i v e r e d to us ye t
// or we have wr i t t en i t back , send NACK
sendNACK( req ,
cache−>getHitLatency ( ) ,
−1,
cache−>getCacheCPUid ( ) ) ;
return true ;
}
a s s e r t ( blk−>d i r S t a t e == DirOwnedExGR
| | blk−>d i r S t a t e == DirOwnedNonExGR ) ;
a s s e r t ( blk−>pre s entF lags != NULL) ;
newOwner = req−>owner ;
257
APPENDIX C. SIMULATOR EXTENSION CODE
blk−>pre s entF lags [ newOwner ] = true ;
o ldF lags = blk−>pre s entF lags ;
// update l o c a l b l o c k
blk−>pre s entF lags = NULL;
blk−>d i r S t a t e = Di r Inva l i d ;
blk−>s t a tu s &= ˜BlkDirty ;
blk−>owner = newOwner ;
wr i teTraceLine ( cacheName ,
”Trans f e r ing Owner State ” ,
blk−>owner ,
blk−>d i rSta te ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
o ldF lags ) ;
// send owner s t a t e to new owner
req−>toProcessor ID = newOwner ;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>pre s entF lags = o ldFlags ;
req−>owner = newOwner ;
}
a s s e r t ( o ldF lags != NULL) ;
a s s e r t (newOwner != −1);
for ( int i =0; i<cache−>cpuCount ; i++){
i f ( o ldF lags [ i ]
&& i != cache−>getCacheCPUid ( )
&& i != newOwner){
// send r e que s t to a l l o ther share r s
MemReqPtr tmpReq = buildReqCopy ( req ,
cache−>cpuCount ,
DirNewOwnerMulticast ) ;
a s s e r t ( tmpReq−>cmd == DirNewOwnerMulticast ) ;
tmpReq−>toProcessor ID = i ;
tmpReq−>pre s entF lags = NULL;
wr i teTraceLine ( cacheName ,
”Informing share r o f new owner ” ,
( blk != NULL) ? blk−>owner : −1,
( blk != NULL) ? blk−>d i r S t a t e : DirNoState ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
o ldF lags ) ;
sendDirectoryMessage (tmpReq , cache−>getHitLatency ( ) ) ;
}
}
// send r e que s t to new owner
sendDirectoryMessage ( req , cache−>getHitLatency ( ) ) ;
return true ;
}
else i f ( req−>writeMiss ){
// We have r e c i e v ed ownership o f a b l o c k because o f a L1 wr i t e miss
a s s e r t ( cache−>getCacheCPUid ( ) == req−>owner ) ;
int sharerCount = 0 ;
258
C.2. COHERENCE PROTOCOL EXTENSION CODE
for ( int i =0; i<cache−>cpuCount ; i++){
i f ( req−>pre s entF lags [ i ] ) sharerCount++;
}
// we must g e t the b lock , because the the prev ious c a l l i s bypassed
CacheBlk : : State o l d s t a t e = ( blk ) ? blk−>s t a tu s : 0 ;
blk = tags−>hand l eF i l l ( blk ,
req−>mshr ,
cache−>getNewCoherenceState ( req , o l d s t a t e ) ,
wr i tebacks ) ;
blk−>owner = cache−>getCacheCPUid ( ) ;
blk−>pre s entF lags = req−>pre s entF lags ;
i f ( sharerCount > 1) blk−>d i r S t a t e = DirOwnedNonExGR ;
else blk−>d i r S t a t e = DirOwnedExGR ;
// remove the r e f e r ence to t h e s e f l a g s , so they are
// not d e l e t e d t o g e t h e r wi th the r e que s t
req−>pre s entF lags = NULL;
wr i teTraceLine ( cacheName ,
”Recieved owner s t a t e ( wr i t e miss ) ” ,
blk−>owner ,
blk−>d i rSta te ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
// Send ACK to the L2 cache
MemReqPtr tmpReq = buildReqCopy ( req ,
cache−>cpuCount ,
DirOwnerTransfer ) ;
setUpACK(tmpReq , −1, cache−>getCacheCPUid ( ) ) ;
sendDirectoryMessage (tmpReq , cache−>getHitLatency ( ) ) ;
}
else {
// the b l o c k was rep l aced in the middle o f an owner t r an s f e r
// put i t back in and update the s t a t s
a s s e r t ( req−>mshr == NULL) ;
a s s e r t ( blk == NULL) ;
blk = tags−>hand l eF i l l ( blk ,
req ,
BlkValid | BlkWritable ,
wr i tebacks ) ;
blk−>owner = cache−>getCacheCPUid ( ) ;
blk−>pre s entF lags = req−>pre s entF lags ;
blk−>pre s entF lags [ cache−>getCacheCPUid ( ) ] = true ;
blk−>d i r S t a t e = DirOwnedNonExGR ;
// remove the r e f e r ence to t h e s e f l a g s , so they are
// not d e l e t e d t o g e t h e r wi th the r e que s t
req−>pre s entF lags = NULL;
wr i teTraceLine ( cacheName ,
”Owner t r a n s f e r complete , needed block was rep laced ” ,
blk−>owner ,
blk−>d i rSta te ,
req−>paddr ,
259
APPENDIX C. SIMULATOR EXTENSION CODE
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
// check t ha t the cache f i l l worked
BlkType∗ checkBlk = tags−>f indBlock ( req−>paddr , req−>as id ) ;
a s s e r t ( checkBlk != NULL) ;
// Send ACK to the L2 cache
MemReqPtr tmpReq = buildReqCopy ( req ,
cache−>cpuCount ,
DirOwnerTransfer ) ;
setUpACK(tmpReq , −1, cache−>getCacheCPUid ( ) ) ;
sendDirectoryMessage (tmpReq , cache−>getHitLatency ( ) ) ;
return true ;
}
}
else i f ( req−>cmd == DirRedirectRead ){
// response from a r e d i r e c t e d read r e c i e v ed
i f ( blk−>owner == cache−>getCacheCPUid ( )
| | req−>owner == cache−>getCacheCPUid ( ) ){
// CASE 1: we have become the owner wh i l e
// the r e d i r e c t e d read was in t r a n s i t
// CASE 2: the prev ious owner wrote the l i n e back wh i l e the RR was
// in t r a n s i t and we are the new owner
// the b l o c k might be brought in t o the cache so i t might not have a s t a t e ye t
i f ( blk−>d i r S t a t e == DirNoState ){
a s s e r t ( req−>pre s entF lags == NULL) ;
blk−>pre s entF lags = new bool [ cache−>cpuCount ] ;
for ( int i =0; i<cache−>cpuCount ; i++){
blk−>pre s entF lags [ i ] = fa l se ;
}
blk−>pre s entF lags [ cache−>getCacheCPUid ( ) ] = true ;
blk−>d i r S t a t e = DirOwnedExGR ;
blk−>owner = cache−>getCacheCPUid ( ) ; // needed in case 2
}
a s s e r t ( blk−>d i r S t a t e == DirOwnedExGR
| | blk−>d i r S t a t e == DirOwnedNonExGR ) ;
a s s e r t ( blk−>pre s entF lags != NULL) ;
a s s e r t ( req−>pre s entF lags == NULL) ;
wr i teTraceLine ( cacheName ,
”Redirected Read Response Recieved , ”
” we have become the owner ” ,
blk−>owner ,
blk−>d i rSta te ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
}
else {
a s s e r t ( blk−>owner != cache−>getCacheCPUid ( ) ) ;
a s s e r t ( blk−>pre s entF lags == NULL) ;
blk−>d i r S t a t e = Di r Inva l i d ;
blk−>owner = req−>owner ;
wr i teTraceLine ( cacheName ,
260
C.2. COHERENCE PROTOCOL EXTENSION CODE
”Redirected Read Response Recieved ” ,
blk−>owner ,
blk−>d i rSta te ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
blk−>pre s entF lags ) ;
}
i f ( req−>mshr == NULL){
a s s e r t ( req−>completionEvent != NULL) ;
cache−>respond ( req , curTick ) ;
return true ;
}
}
else i f ( req−>cmd == Read
&& ( blk−>d i r S t a t e == DirOwnedExGR
| | blk−>d i r S t a t e == DirOwnedNonExGR)){
// command i s a read or wr i t e and we are the owner
// l e t i t go through
a s s e r t ( req−>mshr != NULL) ;
return fa lse ;
}
else i f ( req−>f romProcessorID == −1
&& req−>owner == cache−>getCacheCPUid ( ) ){
a s s e r t ( blk−>pre s entF lags == NULL) ;
a s s e r t ( blk−>d i r S t a t e != DirOwnedExGR ) ;
a s s e r t ( blk−>d i r S t a t e != DirOwnedNonExGR ) ;
blk−>owner = req−>owner ;
blk−>d i r S t a t e = DirOwnedExGR ;
i f ( blk−>pre s entF lags == NULL){
blk−>pre s entF lags = new bool [ cache−>cpuCount ] ;
}
for ( int i =0; i<cache−>cpuCount ; i++){
blk−>pre s entF lags [ i ] = fa l se ;
}
blk−>pre s entF lags [ cache−>getCacheCPUid ( ) ] = true ;
}
else {
f a t a l ( ”re sponse type not implemented ( handleResponse ( ) ) ” ) ;
}
return fa lse ;
}
template<class TagStore>
bool
StenstromProtocol<TagStore > : : doDirectoryWriteback (MemReqPtr& req ){
i f ( req−>cmd == DirWriteback ){
// Direc tory wr i t e back o f non−modi f ied b l o c k
// l a t ency has a l l r e a d y been counted
sendDirectoryMessage ( req , 0 ) ;
return true ;
}
else i f ( req−>cmd == DirOwnerWriteback ){
261
APPENDIX C. SIMULATOR EXTENSION CODE
a s s e r t ( req−>pre s entF lags != NULL) ;
// s e t our pre sen t f l a g to f a l s e
req−>pre s entF lags [ cache−>getCacheCPUid ( ) ] = fa l se ;
int foundCount = 0 ;
int newOwner = −1;
for ( int i =0; i<cache−>cpuCount ; i++){
i f ( i != cache−>getCacheCPUid ( ) && req−>pre s entF lags [ i ] ) {
newOwner = i ;
foundCount++;
break ;
}
}
a s s e r t ( foundCount == 1 ) ;
a s s e r t (newOwner >= 0 ) ;
// update the r e que s t s t a t s
req−>toProcessor ID = newOwner ;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>t o In t e r f a c e ID = −1;
req−>owner = cache−>getCacheCPUid ( ) ;
bool∗ tmpFlags = req−>pre s entF lags ;
req−>pre s entF lags = NULL;
Addr tmpBlkAddr = req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1 ) ;
i f ( outstandingWritebackWSAddrs . f i nd ( tmpBlkAddr )
== outstandingWritebackWSAddrs . end ( ) ){
// the r e i s no ou t s tand ing wr i t e back to t h i s address
outstandingWritebackWSAddrs [ tmpBlkAddr ] = tmpFlags ;
}
else {
f a t a l ( ”We are wr i t i ng back the same block twice , ”
” t h i s i s not n i c e . . . ” ) ;
}
numOwnerWritebacks++;
wr i teTraceLine ( cacheName ,
”Replac ing owned block with sha r e r s ” ,
cache−>getCacheCPUid ( ) ,
D i r Inva l id ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
tmpFlags ) ;
// forward the r e que s t to the new owner
sendDirectoryMessage ( req , 0 ) ;
return true ;
}
else i f ( req−>cmd == DirSharerWriteback ){
// send i t to the L2 cache , i t w i l l inform the owner
req−>toProcessor ID = −1;
req−>f romProcessorID = cache−>getCacheCPUid ( ) ;
req−>t o In t e r f a c e ID = −1;
req−>f romInter face ID = −1;
numSharerWritebacks++;
262
C.2. COHERENCE PROTOCOL EXTENSION CODE
writeTraceLine ( cacheName ,
”Replac ing not owned block ” ,
−1,
Di r Inva l id ,
req−>paddr ,
cache−>ge tB lockS i ze ( ) ,
req−>pre s entF lags ) ;
// forward the r e que s t to the L2 cache
sendDirectoryMessage ( req , 0 ) ;
return true ;
}
return fa lse ;
}
template<class TagStore>
MemAccessResult
StenstromProtocol<TagStore > : : handleL1DirectoryMiss (MemReqPtr& req ){
i f ( req−>cmd == So f t P r e f e t ch ){
return MA CACHE MISS;
}
else i f ( req−>cmd == DirRedirectRead ){
// Miss on a r e d i r e c t e d read , we have wr i t t en back the b lock , send NACK
sendNACK( req ,
cache−>getHitLatency ( ) ,
req−>fromProcessorID ,
cache−>getCacheCPUid ( ) ) ;
return MA CACHE MISS;
}
else i f ( req−>cmd == DirOwnerWriteback ){
// Miss on a owner t r an s f e r reques t ,
// we have wr i t t en back the b lock , send NACK
sendNACK( req ,
cache−>getHitLatency ( ) ,
req−>fromProcessorID ,
cache−>getCacheCPUid ( ) ) ;
return MA CACHE MISS;
}
else i f ( req−>cmd == Read | | req−>cmd == Write ){
Addr tmpAddr = req−>paddr & ˜( (Addr ) cache−>ge tB lockS i ze ( ) − 1 ) ;
i f ( outstandingWritebackWSAddrs . f i nd ( tmpAddr)
!= outstandingWritebackWSAddrs . end ( ) ){
// t h i s cache s t i l l has updated s t a t e f o r t h i s cache
// respond to r e que s t
cache−>respond ( req , curTick + cache−>getHitLatency ( ) ) ;
return MA HIT;
}
}
return BA NO RESULT;
}
/∗ Pr iva te he l p e r methods ∗/
template<class TagStore>
void
StenstromProtocol<TagStore > : : setUpRedirectedRead (MemReqPtr& req ,
263
APPENDIX C. SIMULATOR EXTENSION CODE
int fromProcessorID ,
int toProcessor ID ){
req−>oldCmd = req−>cmd ;
req−>cmd = DirRedirectRead ;
req−>f romProcessorID = fromProcessorID ;
req−>toProcessor ID = toProcessor ID ;
//must be updated i f the req did not come from L2 j u s t now
req−>owner = toProcessor ID ;
req−>t o In t e r f a c e ID = −1;
}
template<class TagStore>
void
StenstromProtocol<TagStore > : : setUpRedirectedReadReply (MemReqPtr& req ,
int fromProcessorID ,
int toProcessor ID ){
req−>t o In t e r f a c e ID = −1;
req−>f romProcessorID = fromProcessorID ;
req−>toProcessor ID = toProcessor ID ;
// only owners r e p l y to r e d i r e c t e d reads
req−>owner = fromProcessorID ;
}
template<class TagStore>
void
StenstromProtocol<TagStore > : : setUpOwnerTransferInL2 (MemReqPtr& req ,
int oldOwner ,
int newOwner){
req−>cmd = DirOwnerTransfer ;
req−>toProcessor ID = oldOwner ;
req−>owner = newOwner ; // b l k−>owner ;
req−>f romProcessorID = −1;
req−>t o In t e r f a c e ID = −1;
}
template<class TagStore>
void
StenstromProtocol<TagStore > : : setUpACK(MemReqPtr& req , int toID , int fromID ){
req−>toProcessor ID = toID ;
req−>f romProcessorID = fromID ;
req−>t o In t e r f a c e ID = −1;
req−>f romInter face ID = −1;
req−>pre s entF lags = NULL;
req−>owner = −1;
req−>dirACK = true ;
}
/∗ The r e s t o f t h i s f i l e c o n s i s t s o f t emp la te d e f i n i t i o n s ∗/
#ifndef DOXYGEN SHOULD SKIP THIS
// Inc lude con f i g f i l e s
// Must be inc luded f i r s t to determine which caches we want
#include ”mem/ con f i g / cache . hh”
#include ”mem/ con f i g / compress ion . hh”
// Tag Templates
#i f de f ined (USE CACHE LRU)
#include ”mem/cache / tags / l r u . hh”
#endif
264
C.2. COHERENCE PROTOCOL EXTENSION CODE
#i f de f ined (USE CACHE FALRU)
#include ”mem/cache / tags / f a l r u . hh”
#endif
#i f de f ined (USE CACHE IIC)
#include ”mem/cache / tags / i i c . hh”
#endif
#i f de f ined (USE CACHE SPLIT)
#include ”mem/cache / tags / s p l i t . hh”
#endif
#i f de f ined (USE CACHE SPLIT LIFO)
#include ”mem/cache / tags / s p l i t l i f o . hh”
#endif
// Compression Templates
#include ”base / compress ion / nu l l compre s s i on . hh”
#i f de f ined (USE LZSS COMPRESSION)
#include ”base / compress ion / l z s s c ompre s s i on . hh”
#endif
#i f de f ined (USE CACHE FALRU)
template class StenstromProtocol<CacheTags<FALRU, NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class StenstromProtocol<CacheTags<FALRU, LZSSCompression> >;
#endif
#endif
#i f de f ined (USE CACHE IIC)
template class StenstromProtocol<CacheTags<IIC , NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class StenstromProtocol<CacheTags<IIC , LZSSCompression> >;
#endif
#endif
#i f de f ined (USE CACHE LRU)
template class StenstromProtocol<CacheTags<LRU, NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class StenstromProtocol<CacheTags<LRU, LZSSCompression> >;
#endif
#endif
#i f de f ined (USE CACHE SPLIT)
template class StenstromProtocol<CacheTags<Sp l i t , NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class StenstromProtocol<CacheTags<Sp l i t , LZSSCompression> >;
#endif
#endif
#i f de f ined (USE CACHE SPLIT LIFO)
template class StenstromProtocol<CacheTags<SplitLIFO , NullCompression> >;
#i f de f ined (USE LZSS COMPRESSION)
template class StenstromProtocol<CacheTags<SplitLIFO , LZSSCompression> >;
#endif
#endif
#endif // DOXYGEN SHOULD SKIP THIS
265
APPENDIX C. SIMULATOR EXTENSION CODE
266
Appendix D
Simulator Configuration Scripts
D.1 run.py
from m5 import ∗
import Splash2
import TestPrograms
import Spec2000
import workloads
from Deta i l edConf ig import ∗
###############################################################################
# Constants
###############################################################################
L2 BANK COUNT = 4
a l l p r o t o c o l s = [ ’ none ’ , ’ msi ’ , ’ mesi ’ , ’ mosi ’ , ’ moesi ’ , ’ s tenstrom ’ ]
snoop pro toco l s = [ ’ msi ’ , ’ mesi ’ , ’ mosi ’ , ’ moesi ’ ]
d i r e c t o r y p r o t o c o l s = [ ’ stenstrom ’ ]
###############################################################################
# Check command l i n e op t i ons
###############################################################################
i f env [ ’PROTOCOL’ ] not in a l l p r o t o c o l s :
panic ( ’No/ Inva l i d cache coherence p ro to co l s p e c i f i e d ! ’ )
i f ’BENCHMARK’ not in env :
panic ( ”The BENCHMARK environment va r i ab l e must be s e t !\ ne . g . \
−EBENCHMARK=Cholesky\n”)
# Multi−programmed work loads ( numbered 1 to N) reads f a s t−forward c y c l e s
# from a con f i g f i l e
# Splash benchmarks can read from con f i g f i l e
i f not ( ( env [ ’BENCHMARK’ ] . i s d i g i t ( ) ) or ( env [ ’BENCHMARK’ ]
in Splash2 . benchmarkNames ) ) :
i f ’FASTFORWARDTICKS’ not in env :
panic ( ”The FASTFORWARDTICKS environment va r i ab l e must be s e t !\n\
e . g . −EFASTFORWARDTICKS=10000\n”)
i f ’SIMULATETICKS ’ not in env and ’SIMINSTS ’ not in env \
and ’ISEXPERIMENT ’ not in env :
panic ( ”One o f the SIMULATETICKS/SIMINSTS/ISEXPERIMENT environment \
va r i ab l e must be s e t !\ ne . g . −ESIMULATETICKS=10000\n”)
267
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
i f ’INTERCONNECT’ not in env :
panic ( ”The INTERCONNECT environment va r i ab l e must be s e t !\ ne . g . \
−EINTERCONNECT=bus\n”)
i f ’STATSFILE ’ not in env :
panic ( ’No s t a t i s t i c s f i l e name given ! (−ESTATSFILE=foobar . txt ) ’ )
coherenceTrace = False
coherenceTraceStart = 0
i f ’TRACE’ in env :
i f env [ ’PROTOCOL’ ] not in d i r e c t o r y p r o t o c o l s :
panic ( ’ Tracing i s only supported f o r d i r e c t o r y p r o t o c o l s ’ ) ;
coherenceTrace = True
coherenceTraceStart = env [ ’TRACE’ ]
print >>sys . s tde r r , ’ warning : Protoco l t r a c i ng i s turned on ! ’
inDumpInterval = 0
i f ’DUMPCCSTATS’ in env :
inDumpInterval = in t ( env [ ’DUMPCCSTATS’ ] )
i c P r o f i l e S t a r t = −1
i f ’PROFILEIC ’ in env :
i c P r o f i l e S t a r t = in t ( env [ ’PROFILEIC ’ ] )
p r o g r e s s I n t e r v a l = 0
i f ’PROGRESS ’ in env :
p r o g r e s s I n t e r v a l = in t ( env [ ’PROGRESS ’ ] )
# MSHR parameters
l1mshrTargets = −1
l1mshrsData = −1
i f ’MSHRSL1D ’ in env and ’MSHRL1TARGETS’ in env :
l1mshrsData = in t ( env [ ’MSHRSL1D ’ ] )
l1mshrTargets = in t ( env [ ’MSHRL1TARGETS’ ] )
l 1mshr s Ins t = −1
i f ’MSHRSL1I ’ in env and ’MSHRL1TARGETS’ in env :
l 1mshr s Ins t = in t ( env [ ’MSHRSL1I ’ ] )
l1mshrTargets = in t ( env [ ’MSHRL1TARGETS’ ] )
l2mshrTargets = −1
l2mshrs = −1
i f ’MSHRSL2 ’ in env and ’MSHRL2TARGETS’ in env :
l2mshrs = in t ( env [ ’MSHRSL2 ’ ] )
l2mshrTargets = in t ( env [ ’MSHRL2TARGETS’ ] )
###############################################################################
# Root , CPUs and L1 caches
###############################################################################
root = Detai ledStandAlone ( )
i f p r o g r e s s I n t e r v a l > 0 :
root . p r o g r e s s i n t e r v a l = p r o g r e s s I n t e r v a l
# Create CPUs
BaseCPU . workload = Parent . workload
root . simpleCPU = [ CPU( d e f e r r e g i s t r a t i o n=True , cpu id=i )
for i in xrange ( i n t ( env [ ’NP ’ ] ) ) ]
root . detailedCPU = [ DetailedCPU ( d e f e r r e g i s t r a t i o n=True , cpu id=i )
for i in xrange ( i n t ( env [ ’NP ’ ] ) ) ]
268
D.1. RUN.PY
# Create L1 caches
i f env [ ’INTERCONNECT’ ] == ’ bus ’ :
root . L1dcaches = [ DL1( out bus=Parent . i n t e r connec t )
for i in xrange ( i n t ( env [ ’NP ’ ] ) ) ]
root . L1 icaches = [ IL1 ( out bus=Parent . i n t e r connec t )
for i in xrange ( i n t ( env [ ’NP ’ ] ) ) ]
else :
root . L1dcaches = [ DL1( ou t i n t e r conne c t=Parent . i n t e r connec t )
for i in xrange ( i n t ( env [ ’NP ’ ] ) ) ]
root . L1 icaches = [ IL1 ( ou t i n t e r conne c t=Parent . i n t e r connec t )
for i in xrange ( i n t ( env [ ’NP ’ ] ) ) ]
i f env [ ’PROTOCOL’ ] != ’ none ’ :
i f env [ ’PROTOCOL’ ] in snoop pro toco l s :
for cache in root . L1dcaches :
cache . p ro to co l = CoherenceProtocol ( p ro to co l=env [ ’PROTOCOL’ ] )
e l i f env [ ’PROTOCOL’ ] in d i r e c t o r y p r o t o c o l s :
for cache in root . L1dcaches :
cache . dirProtocolName = env [ ’PROTOCOL’ ]
cache . d irProtocolDoTrace = coherenceTrace
i f coherenceTraceStart != 0 :
cache . d i rPro toco lTraceS ta r t = coherenceTraceStart
cache . d irProtocolDumpInterval = inDumpInterval
# Connect L1 caches to CPUs
for i in xrange ( i n t ( env [ ’NP ’ ] ) ) :
root . simpleCPU [ i ] . dcache = root . L1dcaches [ i ]
root . simpleCPU [ i ] . i c a che = root . L1 icaches [ i ]
root . detailedCPU [ i ] . dcache = root . L1dcaches [ i ]
root . detailedCPU [ i ] . i c a che = root . L1 icaches [ i ]
root . L1dcaches [ i ] . cpu id = i
root . L1 icaches [ i ] . cpu id = i
i f l1mshrsData != −1:
for l 1 in root . L1dcaches :
l 1 . mshrs = l1mshrsData
l 1 . tg t s per mshr = l1mshrTargets
i f l 1mshr s Ins t != −1:
for l 1 in root . L1 icaches :
l 1 . mshrs = l1mshr s Ins t
l 1 . tg t s per mshr = l1mshrTargets
###############################################################################
# Fast−forwarding
###############################################################################
i f env [ ’BENCHMARK’ ] in Splash2 . benchmarkNames :
# S c i e n t i f i c work loads
root . sampler = Sampler ( )
root . sampler . phase0 cpus = Parent . simpleCPU
root . sampler . phase1 cpus = Parent . detailedCPU
i f ’ISEXPERIMENT ’ in env and env [ ’PROTOCOL’ ] in d i r e c t o r y p r o t o c o l s :
root . sampler . pe r i od s = [ 0 , 50000000000] # sampler i s not used
for cpu in root . detailedCPU :
cpu . max insts any thread = \
Splash2 . i n s t r u c t i o n s [ i n t ( env [ ’NP ’ ] ) ] [ env [ ’BENCHMARK’ ] ]
e l i f ’SIMINSTS ’ in env :
root . sampler . pe r i od s = [ 0 , 50000000000] # sampler i s not used
269
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
for cpu in root . detailedCPU :
cpu . max insts any thread = in t ( env [ ’SIMINSTS ’ ] )
e l i f ’FASTFORWARDTICKS’ not in env :
fwt i cks , s imt i ck s = Splash2 . f a s t f o rward [ env [ ’BENCHMARK’ ] ]
root . sampler . pe r i od s = [ fwt i cks , s imt i ck s ]
else :
root . sampler . pe r i od s = [ env [ ’FASTFORWARDTICKS’ ] ,
i n t ( env [ ’SIMULATETICKS ’ ] ) ]
root . setCPU( root . simpleCPU)
e l i f not env [ ’BENCHMARK’ ] . i s d i g i t ( ) :
# Simulator t e s t work loads
root . sampler = Sampler ( )
root . sampler . phase0 cpus = Parent . simpleCPU
root . sampler . phase1 cpus = Parent . detailedCPU
root . sampler . pe r i od s = [ i n t ( env [ ’FASTFORWARDTICKS’ ] ) ,
i n t ( env [ ’SIMULATETICKS ’ ] ) ]
root . setCPU( root . simpleCPU)
else :
# Multi−programmed workload
root . samplers = [ Sampler ( ) for i in xrange ( i n t ( env [ ’NP ’ ] ) ) ]
fwCycles = \
workloads . workloads [ i n t ( env [ ’NP ’ ] ) ] [ i n t ( env [ ’BENCHMARK’ ] ) ] [ 1 ]
s imulateCyc l e s = in t ( env [ ’SIMULATETICKS ’ ] )
s imu la t eS ta r t = max( fwCycles )
for i in xrange ( i n t ( env [ ’NP ’ ] ) ) :
root . samplers [ i ] . phase0 cpus = [ Parent . simpleCPU [ i ] ]
root . samplers [ i ] . phase1 cpus = [ Parent . detailedCPU [ i ] ]
root . samplers [ i ] . p e r i od s = [ fwCycles [ i ] , s imu la teCyc l e s
+ ( s imu la t eS ta r t − fwCycles [ i ] ) ]
root . setCPU( root . simpleCPU)
###############################################################################
# Interconnec t and L2 caches
###############################################################################
i f env [ ’BENCHMARK’ ] in Splash2 . benchmarkNames :
BaseCache . mult iprog workload = False
else :
BaseCache . mult iprog workload = True
i f env [ ’BENCHMARK’ ] in Splash2 . benchmarkNames and ’FASTFORWARDTICKS’ not in env :
i f ’PROFILEIC ’ in env :
print >>sys . s tde r r , ”warning : Production workload , \
i gno r i ng user supp l i ed p r o f i l e s t a r t ”
i c P r o f i l e S t a r t = 0 #Splash2 . f a s t f o rward [ env [ ’BENCHMARK ’ ] ] [ 0 ]
i f env [ ’BENCHMARK’ ] . i s d i g i t ( ) :
i f ’PROFILEIC ’ in env :
print >>sys . s tde r r , ”warning : Production workload , \
i gno r i ng user supp l i ed p r o f i l e s t a r t ”
fwCycles = workloads . workloads [ i n t ( env [ ’NP ’ ] ) ] [ i n t ( env [ ’BENCHMARK’ ] ) ] [ 1 ]
i c P r o f i l e S t a r t = max( fwCycles )
moduloAddr = False
i f env [ ’BENCHMARK’ ] in Splash2 . benchmarkNames :
moduloAddr = True
270
D.1. RUN.PY
In t e r connec t . cpu count = in t ( env [ ’NP ’ ] )
root . s e t I n t e r c onne c t ( env [ ’INTERCONNECT’ ] ,
L2 BANK COUNT,
i cP r o f i l e S t a r t ,
moduloAddr )
root . setL2Banks ( )
i f env [ ’PROTOCOL’ ] in d i r e c t o r y p r o t o c o l s :
for bank in root . l 2 :
bank . dirProtocolName = env [ ’PROTOCOL’ ]
bank . dirProtocolDoTrace = coherenceTrace
i f coherenceTraceStart != 0 :
bank . d i rPro toco lTraceS ta r t = coherenceTraceStart
i f l2mshrs != −1:
for bank in root . l 2 :
bank . mshrs = l2mshrs
bank . tg t s per mshr = l2mshrTargets
###############################################################################
# Workloads
###############################################################################
# Storage f o r multiprogrammed work loads
prog = [ ]
###############################################################################
# SPLASH−2
###############################################################################
i f env [ ’BENCHMARK’ ] == ’ Cholesky ’ :
root . workload = Splash2 . Cholesky ( )
e l i f env [ ’BENCHMARK’ ] == ’FFT ’ :
root . workload = Splash2 .FFT( )
e l i f env [ ’BENCHMARK’ ] == ’LUContig ’ :
root . workload = Splash2 . LU contig ( )
e l i f env [ ’BENCHMARK’ ] == ’ LUNoncontig ’ :
root . workload = Splash2 . LU noncontig ( )
e l i f env [ ’BENCHMARK’ ] == ’Radix ’ :
root . workload = Splash2 . Radix ( )
e l i f env [ ’BENCHMARK’ ] == ’ Barnes ’ :
root . workload = Splash2 . Barnes ( )
e l i f env [ ’BENCHMARK’ ] == ’FMM’ :
root . workload = Splash2 .FMM()
e l i f env [ ’BENCHMARK’ ] == ’OceanContig ’ :
root . workload = Splash2 . Ocean contig ( )
e l i f env [ ’BENCHMARK’ ] == ’OceanNoncontig ’ :
root . workload = Splash2 . Ocean noncontig ( )
e l i f env [ ’BENCHMARK’ ] == ’ Raytrace ’ :
root . workload = Splash2 . Raytrace ( )
e l i f env [ ’BENCHMARK’ ] == ’WaterNSquared ’ :
root . workload = Splash2 . Water nsquared ( )
e l i f env [ ’BENCHMARK’ ] == ’WaterSpatial ’ :
root . workload = Splash2 . Water spat ia l ( )
###############################################################################
# SPEC 2000
###############################################################################
e l i f env [ ’BENCHMARK’ ] == ’ gz ip ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
271
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
prog . append ( Spec2000 . GzipSource ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ vpr ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . VprPlace ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ gcc ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Gcc166 ( ) )
e l i f env [ ’BENCHMARK’ ] == ’mcf ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 .Mcf ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ c r a f t y ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Crafty ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ par s e r ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Parser ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ eon ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Eon1 ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ perlbmk ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Perlbmk1 ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ gap ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 .Gap( ) )
e l i f env [ ’BENCHMARK’ ] == ’ vortex1 ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Vortex1 ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ bzip ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Bzip2Source ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ two l f ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Twolf ( ) )
e l i f env [ ’BENCHMARK’ ] == ’wupwise ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 .Wupwise ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ swim ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Swim( ) )
e l i f env [ ’BENCHMARK’ ] == ’mgrid ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Mgrid ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ applu ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Applu ( ) )
e l i f env [ ’BENCHMARK’ ] == ’mesa ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 .Mesa ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ g a l g e l ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Galge l ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ ar t ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Art1 ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ equake ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Equake ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ f a c e r e c ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Facerec ( ) )
272
D.1. RUN.PY
e l i f env [ ’BENCHMARK’ ] == ’ammp ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 .Ammp( ) )
e l i f env [ ’BENCHMARK’ ] == ’ lu ca s ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Lucas ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ fma3d ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 .Fma3d( ) )
e l i f env [ ’BENCHMARK’ ] == ’ s i x t r a c k ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . S ix t rack ( ) )
e l i f env [ ’BENCHMARK’ ] == ’ aps i ’ :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
prog . append ( Spec2000 . Apsi ( ) )
###############################################################################
# Multi−programmed work loads
###############################################################################
e l i f env [ ’BENCHMARK’ ] . i s d i g i t ( ) :
prog = Spec2000 . createWorkload (
workloads . workloads [ i n t ( env [ ’NP ’ ] ) ] [ i n t ( env [ ’BENCHMARK’ ] ) ] [ 0 ] )
###############################################################################
# Testprograms
###############################################################################
e l i f env [ ’BENCHMARK’ ] == ’ h e l l o ’ :
root . workload = TestPrograms . HelloWorld ( )
else :
panic ( ”The BENCHMARK environment va r i ab l e was s e t to something improper\n”)
# Create mult i−programmed work loads
i f prog != [ ] :
for i in range ( i n t ( env [ ’NP ’ ] ) ) :
root . simpleCPU [ i ] . workload = prog [ i ]
root . detailedCPU [ i ] . workload = prog [ i ]
###############################################################################
# S t a t i s t i c s
###############################################################################
root . s t a t s = S t a t i s t i c s ( t e x t f i l e=env [ ’STATSFILE ’ ] )
273
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
D.2 DetailedConfig.py
from m5 import ∗
from MemConfig import ∗
from FuncUnitConfig import ∗
###############################################################################
# Branch Pred ic tor
###############################################################################
class DefaultBranchPred ( BranchPred ) :
p r e d c l a s s = ’ hybrid ’
l o c a l h i s t r e g s = ’ 2 k i ’
l o c a l h i s t b i t s = 11
l o c a l i n d e x b i t s = 11
l o c a l x o r = False
g l o b a l h i s t b i t s = 13
g l o b a l i n d e x b i t s = 13
g l oba l x o r = False
c h o i c e i n d e x b i t s = 13
cho i c e xo r = False
r a s s i z e = 16
b tb s i z e = ’ 2 k i ’
b tb as soc = 4
###############################################################################
# CPUs
###############################################################################
class DetailedCPU (FullCPU) :
i q = StandardIQ ( s i z e = 64 , caps = [ 0 , 0 , 0 , 0 ] )
iq comm latency = 1
fupoo l s = DefaultFUP ( )
l s q s i z e = 32
r ob s i z e = 128
rob caps = [ 0 , 0 , 0 , 0 ]
s t o r e b u f f e r s i z e = 32
width = 8
issue bandwidth = [ 8 , 8 ]
p r i o r i t i z e d i s s u e = False
thread we ights = [ 1 , 1 , 1 , 1 ]
d i s p a t c h t o i s s u e = 1
decode to d i spatch = 10
mispred recover = 3
f e t ch branche s = 3
i f q s i z e = 32
num icache ports = 1
branch pred = DefaultBranchPred ( )
def setCache ( s e l f , dcache , i c ache ) :
s e l f . dcache = dcache
s e l f . i c a che = icache
class CPU(SimpleCPU) :
def setCache ( s e l f , dcache , i c ache ) :
s e l f . dcache = dcache
s e l f . i c a che = icache
274
D.2. DETAILEDCONFIG.PY
###############################################################################
# Root
###############################################################################
class Detai ledStandAlone (Root ) :
#c lo c k = ’3Hz ’
c l o ck = ’ 3200MHz ’
toMemBus = ToMemBus( )
ram = SDRAM( in bus=Parent . toMemBus)
l 2 = [ ]
def setCPU( s e l f , inCPU) :
s e l f . cpu = inCPU
#de f setNumCPUs( s e l f , numCPUs) :
#s e l f . i n t e r connec t . L1CacheCount = (numCPUs∗2)
def s e t I n t e r c onne c t ( s e l f , opt ionStr ing , L2BankCount , p r o f i l e S t a r t , moduloAddr )
:
i f opt i onSt r ing == ’ bus ’ :
s e l f . i n t e r connec t = ToL2Bus ( )
s e l f . c reateL2 (True , L2BankCount , moduloAddr )
e l i f opt i onSt r ing == ’myBus ’ :
s e l f . i n t e r connec t = InterconnectBus ( )
s e l f . c reateL2 ( False , L2BankCount , moduloAddr )
e l i f opt i onSt r ing == ’ c ro s sba r ’ :
s e l f . i n t e r connec t = Inte rconnectCros sbar ( )
s e l f . c reateL2 ( False , L2BankCount , moduloAddr )
e l i f opt i onSt r ing == ’ i d e a l ’ :
s e l f . i n t e r connec t = In t e r c onne c t I d e a l ( )
s e l f . c reateL2 ( False , L2BankCount , moduloAddr )
e l i f opt i onSt r ing == ’ idea lwde lay ’ :
s e l f . i n t e r connec t = InterconnectIdea lWithDelay ( )
s e l f . c reateL2 ( False , L2BankCount , moduloAddr )
e l i f opt i onSt r ing == ’ pipeBus ’ :
s e l f . i n t e r connec t = Pipe l inedBus ( )
s e l f . c reateL2 ( False , L2BankCount , moduloAddr )
e l i f opt i onSt r ing == ’ bu t t e r f l y ’ :
s e l f . i n t e r connec t = In t e r c onne c tBut t e r f l y ( )
s e l f . c reateL2 ( False , L2BankCount , moduloAddr )
else :
panic ( ’Unknown in t e r connec t s e l e c t e d ’ )
i f p r o f i l e S t a r t != −1 and opt i onSt r ing != ’ bus ’ :
s e l f . i n t e r c o nn e c tP r o f i l e r = In t e r c onn e c tP r o f i l e ( )
s e l f . i n t e r c o nn e c tP r o f i l e r . t raceSends = True
s e l f . i n t e r c o nn e c tP r o f i l e r . t raceChanne lUt i l = True
s e l f . i n t e r c o nn e c tP r o f i l e r . t r a c eS ta r tT i ck = p r o f i l e S t a r t
s e l f . i n t e r c o nn e c tP r o f i l e r . i n t e r conne c t = s e l f . i n t e r connec t
def createL2 ( s e l f , bus , L2BankCount , moduloAddr ) :
for bankID in range (0 , L2BankCount ) :
thisBank = None
i f bus :
thisBank = L2Bank( in bus=Parent . in te r connec t , out bus=Parent .
toMemBus)
else :
thisBank = L2Bank( i n i n t e r c onn e c t=Parent . in te r connec t , out bus=
Parent . toMemBus)
275
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
i f moduloAddr and not bus :
thisBank . setModuloAddr (bankID , L2BankCount )
else :
thisBank . setAddrRange (bankID , L2BankCount )
s e l f . l 2 . append ( thisBank )
def setL2Banks ( s e l f ) :
s e l f . L2Bank0 = s e l f . l 2 [ 0 ]
s e l f . L2Bank1 = s e l f . l 2 [ 1 ]
s e l f . L2Bank2 = s e l f . l 2 [ 2 ]
s e l f . L2Bank3 = s e l f . l 2 [ 3 ]
276
D.3. FUNCUNITCONFIG.PY
D.3 FuncUnitConfig.py
from m5 import ∗
class IntALU(FUDesc) :
opLi s t = [ OpDesc ( opClass=’ IntAlu ’ ) ]
count = 4
class IntMultDiv (FUDesc) :
opLi s t = [ OpDesc ( opClass=’ IntMult ’ , opLat=3) ,
OpDesc ( opClass=’ IntDiv ’ , opLat=20, i s sueLat=19) ]
count = 2
class FP ALU(FUDesc) :
opLi s t = [ OpDesc ( opClass=’ FloatAdd ’ , opLat=2) ,
OpDesc ( opClass=’FloatCmp ’ , opLat=2) ,
OpDesc ( opClass=’ FloatCvt ’ , opLat=2) ]
count = 4
class FP MultDiv (FUDesc) :
opLi s t = [ OpDesc ( opClass=’ FloatMult ’ , opLat=4) ,
OpDesc ( opClass=’ FloatDiv ’ , opLat=12, i s sueLat=12) ,
OpDesc ( opClass=’ FloatSqrt ’ , opLat=24, i s sueLat=24) ]
count = 2
class ReadPort (FUDesc) :
opLi s t = [ OpDesc ( opClass=’MemRead ’ ) ]
count = 0
class WritePort (FUDesc) :
opLi s t = [ OpDesc ( opClass=’MemWrite ’ ) ]
count = 0
class RdWrPort(FUDesc) :
opLi s t = [ OpDesc ( opClass=’MemRead ’ ) , OpDesc ( opClass=’MemWrite ’ ) ]
count = 4
class IprPort (FUDesc) :
opLi s t = [ OpDesc ( opClass=’ IprAccess ’ , opLat = 3 , i s sueLat = 3) ]
count = 1
class DefaultFUP ( FuncUnitPool ) :
FUList = [ IntALU ( ) , IntMultDiv ( ) , FP ALU( ) , FP MultDiv ( ) , ReadPort ( ) ,
WritePort ( ) , RdWrPort ( ) , IprPort ( ) ]
277
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
D.4 MemConfig.py
from m5 import ∗
###############################################################################
# CACHES
###############################################################################
class BaseL1Cache ( BaseCache ) :
in bus = NULL
s i z e = ’ 64kB ’
as soc = 8
b l o c k s i z e = 64
mshrs = 4
tgt s per mshr = 4
cpu count = in t ( env [ ’NP ’ ] )
i s s h a r ed = False
class IL1 (BaseL1Cache ) :
l a t ency = Parent . c l o ck . per iod
i s r e a d on l y = True
class DL1(BaseL1Cache ) :
l a t ency = 3 ∗ Parent . c l o ck . per iod
i s r e a d on l y = False
class L2Bank(BaseCache ) :
s i z e = ’ 1MB’ # 1MB ∗ 4 banks = 4MB t o t a l cache s i z e
as soc = 8
b l o c k s i z e = 64
la t ency = 14 ∗ Parent . c l o ck . per iod
mshrs = 8
tgt s per mshr = 4
cpu count = in t ( env [ ’NP ’ ] )
i s s h a r ed = True
i s r e a d on l y = False
def setModuloAddr ( s e l f , bankID , bank count ) :
s e l f . do modulo addr = True
s e l f . bank id = bankID
s e l f . bank count = bank count
def setAddrRange ( s e l f , bankID , bank count ) :
o f f s e t = MaxAddr / bank count
i f bankID == 0 :
s e l f . addr range = AddrRange (0 , o f f s e t )
e l i f bankID == ( bank count−1) :
s e l f . addr range = AddrRange ( ( bankID∗ o f f s e t )+1, MaxAddr)
else :
s e l f . addr range = AddrRange ( ( bankID∗ o f f s e t )+1, ( ( bankID+1)∗ o f f s e t ) )
###############################################################################
# INTERCONNECT
###############################################################################
class ToL2Bus(Bus ) :
width = 64
c l o ck = Parent . c l o ck . per iod
class InterconnectBus ( Spl itTransBus ) :
278
D.4. MEMCONFIG.PY
width = 64
c l o ck = 1 ∗ Parent . c l o ck . per iod
t ran s f e rDe l ay = 4
arb i t r a t i onDe l ay = 5
p ip e l i n ed = False
class Pipe l inedBus ( InterconnectBus ) :
p i p e l i n ed = True
class I n t e r c onne c t I d e a l ( I d e a l I n t e r c onne c t ) :
width = 64 # the cache needs f i n i t e width
c l o ck = 1 ∗ Parent . c l o ck . per iod
t ran s f e rDe l ay = 0
arb i t r a t i onDe l ay = 0
class InterconnectIdea lWithDelay ( Id ea l I n t e r c onne c t ) :
width = 64 # the cache needs f i n i t e width
c l o ck = 1 ∗ Parent . c l o ck . per iod
t ran s f e rDe l ay = 4
arb i t r a t i onDe l ay = 5
class Inte rconnectCros sbar ( Crossbar ) :
width = 64
c l o ck = 1 ∗ Parent . c l o ck . per iod
t ran s f e rDe l ay = 4
arb i t r a t i onDe l ay = 5
class I n t e r c onne c tBut t e r f l y ( But t e r f l y ) :
width = 64
c l o ck = 1 ∗ Parent . c l o ck . per iod
rad ix = 2
banks = 4
i f i n t ( env [ ’NP ’ ] ) == 2 :
# t o t a l de l ay i s 10 c l o c k c y c l e s (2∗2+3∗2)
t r an s f e rDe l ay = 2 # per l i n k t r an s f e r de lay
a rb i t r a t i onDe l ay = 0 # arb in swi tches , no e x p l i c i t de l ay
sw i t ch de l ay = 2
e l i f i n t ( env [ ’NP ’ ] ) == 4 :
# t o t a l de l ay i s 10 c l o c k c y c l e s (2∗3+1∗4)
t r an s f e rDe l ay = 1 # per l i n k t r an s f e r de lay
a rb i t r a t i onDe l ay = 0 # arb in swi tches , no e x p l i c i t de l ay
sw i t ch de l ay = 2
else :
# t o t a l de l ay i s 9 c l o c k c y c l e s (1∗4+1∗5)
t r an s f e rDe l ay = 1 # per l i n k t r an s f e r de lay
a rb i t r a t i onDe l ay = 0 # arb in swi tches , no e x p l i c i t de l ay
sw i t ch de l ay = 1
###############################################################################
# MEMORY AND MEMORY BUS
###############################################################################
class ToMemBus(Bus ) :
width = 8
#c lo c k = 1.5 ∗ Parent . c l o c k . per iod
c l o ck = 4 ∗ Parent . c l o ck . per iod
class SDRAM(BaseMemory ) :
#la t ency = 200 ∗ Parent . c l o c k . per iod
279
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
l a t ency = 112 ∗ Parent . c l o ck . per iod
uncacheab l e l a t ency = 1000 ∗ Parent . c l o ck . per iod
280
D.5. SPEC2000.PY
D.5 Spec2000.py
from m5 import ∗
import os
import os . path
from s h u t i l import copy , copytree
import glob
# Or i g i n a l l y wr i t t en by James Sr in ivasan
# Further modi f ied by Magnus Jahre <j ahre @ i d i . ntnu . no>
i f ’NP ’ not in env :
panic ( ”No number o f p r o c e s s o r s was de f ined .\ ne . g . −ENP=4\n”)
r o o t d i r = os . getenv ( ”DIPPROOT”)
i f r o o t d i r == None :
print ”Envirionment va r i ab l e DIPPROOT not s e t . Quitt ing . . . ”
sys . e x i t (−1)
# Assumes curren t working d i r e c t o r y i s where we ought to run the benchmarks from ,
copy da t a s e t s to e t c .
# Root o f where SPEC2000 i n s t a l l l i v e s
sp e c r oo t = roo td i r+’ / exper iments /benchmarks/ spec2000 /SPEC 2000 REDUCED ’
# Location o f SPEC b i n a r i e s
spec b in = roo td i r+’ / exper iments /benchmarks/ spec2000 / ’
# Str ing to benchmark mappings
def createWorkload ( benchmarkStrings ) :
returnArray = [ ]
for s t r i n g in benchmarkStrings :
i f s t r i n g == ’ gz ip ’ :
returnArray . append ( GzipSource ( ) )
e l i f s t r i n g == ’ vpr ’ :
returnArray . append ( VprPlace ( ) )
e l i f s t r i n g == ’ gcc ’ :
returnArray . append (Gcc166 ( ) )
e l i f s t r i n g == ’mcf ’ :
returnArray . append (Mcf ( ) )
e l i f s t r i n g == ’ c r a f t y ’ :
returnArray . append ( Crafty ( ) )
e l i f s t r i n g == ’ par s e r ’ :
returnArray . append ( Parser ( ) )
e l i f s t r i n g == ’ eon ’ :
returnArray . append (Eon1 ( ) )
e l i f s t r i n g == ’ perlbmk ’ :
returnArray . append (Perlbmk1 ( ) )
e l i f s t r i n g == ’ gap ’ :
returnArray . append (Gap( ) )
e l i f s t r i n g == ’ vortex1 ’ :
returnArray . append ( Vortex1 ( ) )
e l i f s t r i n g == ’ bzip ’ :
returnArray . append ( Bzip2Source ( ) )
e l i f s t r i n g == ’ two l f ’ :
returnArray . append ( Twolf ( ) )
e l i f s t r i n g == ’wupwise ’ :
281
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
returnArray . append (Wupwise ( ) )
e l i f s t r i n g == ’ swim ’ :
returnArray . append (Swim( ) )
e l i f s t r i n g == ’mgrid ’ :
returnArray . append (Mgrid ( ) )
e l i f s t r i n g == ’ applu ’ :
returnArray . append (Applu ( ) )
e l i f s t r i n g == ’mesa ’ :
returnArray . append (Mesa ( ) )
e l i f s t r i n g == ’ g a l g e l ’ :
returnArray . append ( Galge l ( ) )
e l i f s t r i n g == ’ ar t ’ :
returnArray . append (Art1 ( ) )
e l i f s t r i n g == ’ equake ’ :
returnArray . append (Equake ( ) )
e l i f s t r i n g == ’ f a c e r e c ’ :
returnArray . append ( Facerec ( ) )
e l i f s t r i n g == ’ammp ’ :
returnArray . append (Ammp( ) )
e l i f s t r i n g == ’ lu ca s ’ :
returnArray . append ( Lucas ( ) )
e l i f s t r i n g == ’ fma3d ’ :
returnArray . append (Fma3d( ) )
e l i f s t r i n g == ’ s i x t r a c k ’ :
returnArray . append ( S ix t rack ( ) )
e l i f s t r i n g == ’ aps i ’ :
returnArray . append ( Apsi ( ) )
else :
panic ( ”Unknown benchmark i s part o f workload ”)
return returnArray
###############################################################################
###############################################################################
class GzipSource ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 164 . gz ip / input / r e f . source ’ ) , ’ . ’ )
executab l e = os . path . j o i n ( spec b in , ’ gz ip00 . peak . ev6 ’ )
cmd = ’ gz ip00 . peak . ev6 r e f . source 60 ’
class GzipLog ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 164 . gz ip / input / r e f . l og ’ ) , ’ . ’ )
executab l e = os . path . j o i n ( spec b in , ’ gz ip00 . peak . ev6 ’ )
cmd = ’ gz ip00 . peak . ev6 r e f . l og 60 ’
class GzipGraphic ( L iveProcess ) :
282
D.5. SPEC2000.PY
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 164 . gz ip / input / r e f . g raph ic ’ ) , ’ . ’ )
executab l e = os . path . j o i n ( spec b in , ’ gz ip00 . peak . ev6 ’ )
cmd = ’ gz ip00 . peak . ev6 r e f . g raph ic 60 ’
class GzipRandom( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 164 . gz ip / input / r e f . random ’ ) , ’ . ’ )
executab l e = os . path . j o i n ( spec b in , ’ gz ip00 . peak . ev6 ’ )
cmd = ’ gz ip00 . peak . ev6 r e f . random 60 ’
class GzipProgram ( LiveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 164 . gz ip / input / r e f . program ’ ) , ’ . ’ )
executab l e = os . path . j o i n ( spec b in , ’ gz ip00 . peak . ev6 ’ )
cmd = ’ gz ip00 . peak . ev6 r e f . program 60 ’
###############################################################################
class VprPlace ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 175 . vpr/ input / r e f . net ’ ) , ’ re fVpr . net ’ )
copy ( os . path . j o i n ( spec root , ’ 175 . vpr/ input / r e f . arch . in ’ ) , ’ re fVpr . arch . in ’ )
executab l e = os . path . j o i n ( spec b in , ’ vpr00 . peak . ev6 ’ )
cmd = ’ vpr00 . peak . ev6 ’ + \
’ re fVpr . net refVpr . arch . in p lace . out dum. out ’ + \
’−nodisp −p l a c e on ly − i n i t t 5 −e x i t t 0 .005 −a lpha t 0 .9412 −inner num
2 ’
###############################################################################
class Gcc166 ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
283
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 176 . gcc / input / r e f . 1 6 6 . i ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ gcc00 . peak . ev6 ’ )
cmd = ’ gcc00 . peak . ev6 r e f . 1 6 6 . i −o r e f . 1 6 6 . s ’
class Gcc200 ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 176 . gcc / input / r e f . 2 0 0 . i ’ ) , ” . ”)
#exe cu t a b l e = os . path . j o i n ( spec b in , ’ cc100 . peak . ev6 ’ )
#cmd = ’ cc100 . peak . ev6 200. i −o 200. s ’
executab l e = os . path . j o i n ( spec b in , ’ gcc00 . peak . ev6 ’ )
cmd = ’ gcc00 . peak . ev6 r e f . 2 0 0 . i −o r e f . 2 0 0 . s ’
class GccExpr ( L iveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 176 . gcc / input / r e f . expr . i ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ gcc00 . peak . ev6 ’ )
cmd = ’ gcc00 . peak . ev6 r e f . expr . i −o r e f . expr . s ’
class GccIntegrate ( L iveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 176 . gcc / input / r e f . i n t e g r a t e . i ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ gcc00 . peak . ev6 ’ )
cmd = ’ gcc00 . peak . ev6 r e f . i n t e g r a t e . i −o r e f . i n t e g r a t e . s ’
class GccSci lab ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 176 . gcc / input / r e f . s c i l a b . i ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ gcc00 . peak . ev6 ’ )
cmd = ’ gcc00 . peak . ev6 r e f . s c i l a b . i −o r e f . s c i l a b . s ’
###############################################################################
class Mcf ( L iveProcess ) :
284
D.5. SPEC2000.PY
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 181 .mcf/ input / r e f . in ’ ) , ”re fMcf . in ”)
executab l e = os . path . j o i n ( spec b in , ’ mcf00 . peak . ev6 ’ )
cmd = ’mcf00 . peak . ev6 re fMcf . in ’
###############################################################################
class Crafty ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 186 . c r a f t y / input / r e f / r e f . in ’ ) , ” . / c r a f t y r e f . in ”
)
executab l e = os . path . j o i n ( spec b in , ’ c r a f t y00 . peak . ev6 ’ )
cmd = ’ c ra f t y00 . peak . ev6 ’
input = ’ c r a f t y r e f . in ’ # source f o r s t d e r r
###############################################################################
class Parser ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 197 . par s e r / input / r e f . in ’ ) , ” r e fPa r s e r . in ”)
copy ( os . path . j o i n ( spec root , ’ 197 . par s e r / input / 2 . 1 . d i c t ’ ) , ” . ”)
# for some reason t h i s cons t ruc t o r g e t s c a l l e d tw ice but i f the t a r g e t a l r eady
e x i s t s copy t ree w i l l f a i l so check f i r s t
i f not os . path . e x i s t s ( ”words ”) :
copytree ( os . path . j o i n ( spec root , ’ 197 . par s e r / input /words ’ ) , ”words ”)
executab l e = os . path . j o i n ( spec b in , ’ par se r00 . peak . ev6 ’ )
cmd = ’ parse r00 . peak . ev6 2 . 1 . d i c t −batch ’
input = ’ r e fPa r s e r . in ’ # source f o r s t d e r r
###############################################################################
class Eon1( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f /eon . dat ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f /mate r i a l s ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / spec t ra . dat ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / cha i r . c on t r o l . cook ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / cha i r . camera ’ ) , ” . ”)
285
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / cha i r . s u r f a c e s ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ eon00 . peak . ev6 ’ )
cmd = ’ eon00 . peak . ev6 cha i r . c on t r o l . cook cha i r . camera cha i r . s u r f a c e s cha i r .
cook .ppm ppm p i x e l s o u t . cook ’
class Eon2( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f /eon . dat ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f /mate r i a l s ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / spec t ra . dat ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / cha i r . c on t r o l . rushmeier ’ ) , ”
. ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / cha i r . camera ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / cha i r . s u r f a c e s ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ eon00 . peak . ev6 ’ )
cmd = ’ eon00 . peak . ev6 cha i r . c on t r o l . rushmeier cha i r . camera cha i r . s u r f a c e s
cha i r . rushmeier .ppm ppm p i x e l s o u t . rushmeier ’
class Eon3( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f /eon . dat ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f /mate r i a l s ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / spec t ra . dat ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / cha i r . c on t r o l . ka j i ya ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / cha i r . camera ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 252 . eon/ input / r e f / cha i r . s u r f a c e s ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ eon00 . peak . ev6 ’ )
cmd = ’ eon00 . peak . ev6 cha i r . c on t r o l . ka j i ya cha i r . camera cha i r . s u r f a c e s cha i r .
ka j i ya .ppm ppm p i x e l s o u t . ka j i ya ’
###############################################################################
class Perlbmk1 ( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / lenums ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / d i f fm a i l . p l ’ ) , ” . ”)
# for some reason t h i s cons t ruc t o r g e t s c a l l e d tw ice but i f the t a r g e t a l r eady
e x i s t s copy t ree w i l l f a i l so check f i r s t
i f not os . path . e x i s t s ( ” l i b ”) :
copytree ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / l i b ’ ) , ” l i b ”)
286
D.5. SPEC2000.PY
executab l e = os . path . j o i n ( spec b in , ’ perlbmk00 . peak . ev6 ’ )
cmd = ’ perlbmk00 . peak . ev6 −I . / l i b d i f fm a i l . p l 2 550 15 24 23 100 ’
class Perlbmk2 ( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / lenums ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f /cpu2000 mhonarc . rc ’ ) , ” . ”
)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f /makerand . p l ’ ) , ” . ”)
# for some reason t h i s cons t ruc t o r g e t s c a l l e d tw ice but i f the t a r g e t a l r eady
e x i s t s copy t ree w i l l f a i l so check f i r s t
i f not os . path . e x i s t s ( ” l i b ”) :
copytree ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / l i b ’ ) , ” l i b ”)
executab l e = os . path . j o i n ( spec b in , ’ perlbmk00 . peak . ev6 ’ )
cmd = ’ perlbmk00 . peak . ev6 −I . / l i b makerand . p l ’
class Perlbmk3 ( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / lenums ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f /cpu2000 mhonarc . rc ’ ) , ” . ”
)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / p e r f e c t . p l ’ ) , ” . ”)
# for some reason t h i s cons t ruc t o r g e t s c a l l e d tw ice but i f the t a r g e t a l r eady
e x i s t s copy t ree w i l l f a i l so check f i r s t
i f not os . path . e x i s t s ( ” l i b ”) :
copytree ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / l i b ’ ) , ” l i b ”)
executab l e = os . path . j o i n ( spec b in , ’ perlbmk00 . peak . ev6 ’ )
cmd = ’ perlbmk00 . peak . ev6 −I . / l i b p e r f e c t . p l b 3 m 4 ’
class Perlbmk4 ( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / lenums ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f /cpu2000 mhonarc . rc ’ ) , ” . ”
)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / s p l i tma i l . p l ’ ) , ” . ”)
# for some reason t h i s cons t ruc t o r g e t s c a l l e d tw ice but i f the t a r g e t a l r eady
e x i s t s copy t ree w i l l f a i l so check f i r s t
287
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
i f not os . path . e x i s t s ( ” l i b ”) :
copytree ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / l i b ’ ) , ” l i b ”)
executab l e = os . path . j o i n ( spec b in , ’ perlbmk00 . peak . ev6 ’ )
cmd = ’ perlbmk00 . peak . ev6 −I . / l i b s p l i tma i l . p l 850 5 19 18 1500 ’
class Perlbmk5 ( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / lenums ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f /cpu2000 mhonarc . rc ’ ) , ” . ”
)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / s p l i tma i l . p l ’ ) , ” . ”)
# for some reason t h i s cons t ruc t o r g e t s c a l l e d tw ice but i f the t a r g e t a l r eady
e x i s t s copy t ree w i l l f a i l so check f i r s t
i f not os . path . e x i s t s ( ” l i b ”) :
copytree ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / l i b ’ ) , ” l i b ”)
executab l e = os . path . j o i n ( spec b in , ’ perlbmk00 . peak . ev6 ’ )
cmd = ’ perlbmk00 . peak . ev6 −I . / l i b s p l i tma i l . p l 704 12 26 16 836 ’
class Perlbmk6 ( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / lenums ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f /cpu2000 mhonarc . rc ’ ) , ” . ”
)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / s p l i tma i l . p l ’ ) , ” . ”)
# for some reason t h i s cons t ruc t o r g e t s c a l l e d tw ice but i f the t a r g e t a l r eady
e x i s t s copy t ree w i l l f a i l so check f i r s t
i f not os . path . e x i s t s ( ” l i b ”) :
copytree ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / l i b ’ ) , ” l i b ”)
executab l e = os . path . j o i n ( spec b in , ’ perlbmk00 . peak . ev6 ’ )
cmd = ’ perlbmk00 . peak . ev6 −I . / l i b s p l i tma i l . p l 535 13 25 24 1091 ’
class Perlbmk7 ( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / lenums ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f /cpu2000 mhonarc . rc ’ ) , ” . ”
)
copy ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / s p l i tma i l . p l ’ ) , ” . ”)
288
D.5. SPEC2000.PY
# for some reason t h i s cons t ruc t o r g e t s c a l l e d tw ice but i f the t a r g e t a l r eady
e x i s t s copy t ree w i l l f a i l so check f i r s t
i f not os . path . e x i s t s ( ” l i b ”) :
copytree ( os . path . j o i n ( spec root , ’ 253 . perlbmk/ input / r e f / l i b ’ ) , ” l i b ”)
executab l e = os . path . j o i n ( spec b in , ’ perlbmk00 . peak . ev6 ’ )
cmd = ’ perlbmk00 . peak . ev6 −I . / l i b s p l i tma i l . p l 957 12 23 26 1014 ’
###############################################################################
class Gap( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 254 . gap/ input / r e f / r e f . in ’ ) , ” . / gapre f . in ”)
# copy input f i l e by f i l e
for f i l e in glob . g lob ( os . path . j o i n ( spec root , ’ 254 . gap/ input / r e f /∗ ’ ) ) :
i f os . path . basename ( f i l e ) != ” r e f . in ” :
copy ( f i l e , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ gap00 . peak . ev6 ’ )
cmd = ’ gap00 . peak . ev6 − l . / −q −m 192M’
input = ’ gapre f . in ’
###############################################################################
class Vortex1 ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / persons . 1 k ’ ) , ” . / persons . 1 k”)
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / l end ian . rnv ’ ) , ” . / l end ian . rnv ”
)
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / l end ian .wnv ’ ) , ” . / l end ian .wnv”
)
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / l end ian1 . raw ’ ) , ” . / l end ian1 .
raw”)
executab l e = os . path . j o i n ( spec b in , ’ vortex00 . peak . ev6 ’ )
cmd = ’ vortex00 . peak . ev6 l end ian1 . raw ’
class Vortex2 ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / persons . 1 k ’ ) , ” . / persons . 1 k”)
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / l end ian . rnv ’ ) , ” . / l end ian . rnv ”
)
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / l end ian .wnv ’ ) , ” . / l end ian .wnv”
)
289
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / l end ian2 . raw ’ ) , ” . / l end ian2 .
raw”)
executab l e = os . path . j o i n ( spec b in , ’ vortex00 . peak . ev6 ’ )
cmd = ’ vortex00 . peak . ev6 l end ian2 . raw ’
class Vortex3 ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / persons . 1 k ’ ) , ” . / persons . 1 k”)
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / l end ian . rnv ’ ) , ” . / l end ian . rnv ”
)
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / l end ian .wnv ’ ) , ” . / l end ian .wnv”
)
copy ( os . path . j o i n ( spec root , ’ 255 . vortex / input / l end ian3 . raw ’ ) , ” . / l end ian3 .
raw”)
executab l e = os . path . j o i n ( spec b in , ’ vortex00 . peak . ev6 ’ )
cmd = ’ vortex00 . peak . ev6 l end ian3 . raw ’
###############################################################################
class Bzip2Source ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 256 . bz ip2 / input / r e f . source ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ bzip200 . peak . ev6 ’ )
cmd = ’ bzip200 . peak . ev6 r e f . source 58 ’
class Bzip2Graphic ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 256 . bz ip2 / input / r e f . g raphic ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ bzip200 . peak . ev6 ’ )
cmd = ’ bzip200 . peak . ev6 r e f . g raphic 58 ’
class Bzip2Program ( LiveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 256 . bz ip2 / input / r e f . program ’ ) , ” . ”)
290
D.5. SPEC2000.PY
executab l e = os . path . j o i n ( spec b in , ’ bzip200 . peak . ev6 ’ )
cmd = ’ bzip200 . peak . ev6 r e f . program 58 ’
###############################################################################
class Twolf ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e by f i l e
for f i l e in glob . g lob ( os . path . j o i n ( spec root , ’ 300 . two l f / input / r e f /∗ ’ ) ) :
copy ( f i l e , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ two l f00 . peak . ev6 ’ )
cmd = ’ two l f00 . peak . ev6 r e f ’
###############################################################################
class Wupwise ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 168 . wupwise/ input / r e f /wupwise . in ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ wupwise00 . peak . ev6 ’ )
cmd = ’ wupwise00 . peak . ev6 ’
###############################################################################
class Swim( LiveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 171 . swim/ input / r e f /swim . in ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ swim00 . peak . ev6 ’ )
cmd = ’ swim00 . peak . ev6 ’
input = ’ swim . in ’
###############################################################################
class Mgrid ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 172 . mgrid/ input / r e f /mgrid . in ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ mgrid00 . peak . ev6 ’ )
cmd = ’mgrid00 . peak . ev6 ’
291
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
input = ’mgrid . in ’
###############################################################################
class Applu ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 173 . applu/ input / r e f / applu . in ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ applu00 . peak . ev6 ’ )
cmd = ’ applu00 . peak . ev6 ’
input = ’ applu . in ’
###############################################################################
class Mesa( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 177 .mesa/ input / r e f . in ’ ) , ” . / refMesa . in ”)
# Can ’ t f i nd t h i s f i l e
#copy ( os . path . j o i n ( spec roo t , ’ benchspec /CFP2000/177.mesa/data / r e f / input /
numbers ’ ) , ”. ”)
executab l e = os . path . j o i n ( spec b in , ’mesa00 . peak . ev6 ’ )
cmd = ’mesa00 . peak . ev6 −frames 1000 −mesh f i l e refMesa . in −ppmf i l e mesa .ppm ’
###############################################################################
class Galge l ( L iveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 178 . g a l g e l / input / r e f / r e f . in ’ ) , ” r e fGa l g e l . in ”)
executab l e = os . path . j o i n ( spec b in , ’ g a l g e l 0 0 . peak . ev6 ’ )
cmd = ’ ga l g e l 0 0 . peak . ev6 ’
input = ’ r e fGa l g e l . in ’
###############################################################################
class Art1 ( L iveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 179 . a r t / input /a10 . img ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 179 . a r t / input / c756he l . in ’ ) , ” . ”)
292
D.5. SPEC2000.PY
copy ( os . path . j o i n ( spec root , ’ 179 . a r t / input /hc . img ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ ar t00 . peak . ev6 ’ )
cmd = ’ art00 . peak . ev6 − s c a n f i l e c756he l . in − t r a i n f i l e 1 a10 . img − t r a i n f i l e 2 hc .
img −s t r i d e 2 −s t a r t x 110 −s t a r t y 200 −endx 160 −endy 240 −ob j e c t s 10 ’
###############################################################################
class Art2 ( L iveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 179 . a r t / input /a10 . img ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 179 . a r t / input / c756he l . in ’ ) , ” . ”)
copy ( os . path . j o i n ( spec root , ’ 179 . a r t / input /hc . img ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ ar t00 . peak . ev6 ’ )
cmd = ’ art00 . peak . ev6 − s c a n f i l e c756he l . in − t r a i n f i l e 1 a10 . img − t r a i n f i l e 2 hc .
img −s t r i d e 2 −s t a r t x 470 −s t a r t y 140 −endx 520 −endy 180 −ob j e c t s 10 ’
###############################################################################
class Equake ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 183 . equake/ input / r e f / inp . in ’ ) , ”inpEquake . in ”)
executab l e = os . path . j o i n ( spec b in , ’ equake00 . peak . ev6 ’ )
cmd = ’ equake00 . peak . ev6 ’
input = ’ inpEquake . in ’
###############################################################################
class Facerec ( L iveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e by f i l e
for f i l e in glob . g lob ( os . path . j o i n ( spec root , ’ 187 . f a c e r e c / input / r e f /∗ ’ ) ) :
i f os . path . basename ( f i l e ) == ” r e f . in ” :
copy ( f i l e , ” . / r e fFac e r e c . in ”)
else :
copy ( f i l e , ” . ”)
for f i l e in glob . g lob ( os . path . j o i n ( spec root , ’ 187 . f a c e r e c / input / a l l / input
/∗ ’ ) ) :
copy ( f i l e , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ f a c e r e c 00 . peak . ev6 ’ )
cmd = ’ f a c e r e c 00 . peak . ev6 ’
input = ’ r e fFac e r e c . in ’
###############################################################################
293
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
class Ammp( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e by f i l e
#fo r f i l e in g l o b . g l o b ( os . path . j o i n ( spec roo t , ’188 .ammp/ input /∗ ’ ) ) :
# copy ( f i l e , ”. ”)
i f not os . path . e x i s t s ( ”input ”) :
copytree ( os . path . j o i n ( spec root , ’ 188 .ammp/ input ’ ) , ”input ”)
executab l e = os . path . j o i n ( spec b in , ’ammp00 . peak . ev6 ’ )
cmd = ’ammp00 . peak . ev6 ’
input = ’ input / r e f . in ’
###############################################################################
class Lucas ( L iveProces s ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 189 . l u ca s / input / r e f / r e f . in ’ ) , ” . / l u c a s r e f . in ”)
executab l e = os . path . j o i n ( spec b in , ’ lucas00 . peak . ev6 ’ )
cmd = ’ lucas00 . peak . ev6 ’
input = ’ l u c a s r e f . in ’
###############################################################################
class Fma3d( LiveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 191 . fma3d/ input / r e f /fma3d . in ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ fma3d00 . peak . ev6 ’ )
cmd = ’ fma3d00 . peak . ev6 ’
###############################################################################
class S ix t rack ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e by f i l e
for f i l e in glob . g lob ( os . path . j o i n ( spec root , ’ 200 . s i x t r a c k / input / a l l /
input /∗ ’ ) ) :
copy ( f i l e , ” . ”)
for f i l e in glob . g lob ( os . path . j o i n ( spec root , ’ 200 . s i x t r a c k / input / r e f /∗ ’ ) )
:
copy ( f i l e , ” . ”)
294
D.5. SPEC2000.PY
executab l e = os . path . j o i n ( spec b in , ’ s i x t r a ck00 . peak . ev6 ’ )
cmd = ’ s i x t r a ck00 . peak . ev6 ’
input = ’ inp . in ’
###############################################################################
class Apsi ( L iveProcess ) :
def i n i t ( s e l f , va lue par ent = None , ∗∗kwargs ) :
L iveProces s . i n i t ( s e l f ) # c a l l parent cons t ruc t o r
# copy input f i l e ( s ) to run d i r e c t o r y
copy ( os . path . j o i n ( spec root , ’ 301 . ap s i / input / r e f / aps i . in ’ ) , ” . ”)
executab l e = os . path . j o i n ( spec b in , ’ aps i00 . peak . ev6 ’ )
cmd = ’ aps i00 . peak . ev6 ’
input = ’ aps i . in ’
###############################################################################
295
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
D.6 workloads.py
# Autogenerated workload con f i g u ra t i on f i l e
workloads = {
2 :{
1 : ( [ ’ s i x t r a c k ’ , ’ gcc ’ ] , [ 1092486526 , 1089243335 ] ) ,
2 : ( [ ’ two l f ’ , ’mcf ’ ] , [ 1 059202723 , 1053008801 ] ) ,
3 : ( [ ’ two l f ’ , ’ two l f ’ ] , [ 1019199715 , 1022103917 ] ) ,
4 : ( [ ’ gcc ’ , ’ bz ip ’ ] , [ 1 002476696 , 1071000820 ] ) ,
5 : ( [ ’ equake ’ , ’ vpr ’ ] , [ 1 051263392 , 1035899636 ] ) ,
6 : ( [ ’ applu ’ , ’mesa ’ ] , [ 1 078310236 , 1065791271 ] ) ,
7 : ( [ ’ vortex1 ’ , ’ vortex1 ’ ] , [ 1097073523 , 1075978661 ] ) ,
8 : ( [ ’ g a l g e l ’ , ’ gcc ’ ] , [ 1 057204678 , 1031439111 ] ) ,
9 : ( [ ’ a r t ’ , ’ two l f ’ ] , [ 1 047406642 , 1059060970 ] ) ,
1 0 : ( [ ’ c r a f t y ’ , ’ gap ’ ] , [ 1064237847 , 1017853995 ] ) ,
1 1 : ( [ ’ pa r s e r ’ , ’ gcc ’ ] , [ 1 081587350 , 1011430120 ] ) ,
1 2 : ( [ ’ perlbmk ’ , ’ammp ’ ] , [ 1003893316 , 1057282190 ] ) ,
1 3 : ( [ ’ vortex1 ’ , ’ perlbmk ’ ] , [ 1033080485 , 1028848432 ] ) ,
1 4 : ( [ ’ mgrid ’ , ’ gcc ’ ] , [ 1 077758250 , 1025841852 ] ) ,
1 5 : ( [ ’ a r t ’ , ’ eon ’ ] , [ 1078380907 , 1082856533 ] ) ,
1 6 : ( [ ’ fma3d ’ , ’ s i x t r a c k ’ ] , [ 1054547902 , 1059269574 ] ) ,
1 7 : ( [ ’ ap s i ’ , ’ bz ip ’ ] , [ 1 016725975 , 1005591980 ] ) ,
1 8 : ( [ ’ammp ’ , ’ ap s i ’ ] , [ 1 002336280 , 1068720536 ] ) ,
1 9 : ( [ ’ s i x t r a c k ’ , ’ ap s i ’ ] , [ 1080496374 , 1053859612 ] ) ,
2 0 : ( [ ’ gap ’ , ’ vortex1 ’ ] , [ 1074901437 , 1097439116 ] ) ,
2 1 : ( [ ’ vpr ’ , ’ pa r s e r ’ ] , [ 1 019331389 , 1046595780 ] ) ,
2 2 : ( [ ’ s i x t r a c k ’ , ’ gcc ’ ] , [ 1093357813 , 1021595111 ] ) ,
2 3 : ( [ ’ c r a f t y ’ , ’ammp ’ ] , [ 1052896999 , 1054364834 ] ) ,
2 4 : ( [ ’ bz ip ’ , ’ two l f ’ ] , [ 1 025306235 , 1090087001 ] ) ,
2 5 : ( [ ’ fma3d ’ , ’ fma3d ’ ] , [ 1034277622 , 1081375595 ] ) ,
2 6 : ( [ ’ bz ip ’ , ’ammp ’ ] , [ 1068774434 , 1069302295 ] ) ,
2 7 : ( [ ’ eon ’ , ’ bz ip ’ ] , [ 1063951215 , 1029525317 ] ) ,
2 8 : ( [ ’ mgrid ’ , ’ammp ’ ] , [ 1077166566 , 1054993749 ] ) ,
2 9 : ( [ ’mesa ’ , ’ mgrid ’ ] , [ 1 010844627 , 1033250230 ] ) ,
3 0 : ( [ ’ eon ’ , ’ gcc ’ ] , [ 1083534947 , 1095151804 ] ) ,
3 1 : ( [ ’ mgrid ’ , ’ mgrid ’ ] , [ 1 056202128 , 1023508059 ] ) ,
3 2 : ( [ ’ two l f ’ , ’ gz ip ’ ] , [ 1 047072697 , 1071838019 ] ) ,
3 3 : ( [ ’ f a c e r e c ’ , ’ l u ca s ’ ] , [ 1037717684 , 1014186334 ] ) ,
3 4 : ( [ ’ g a l g e l ’ , ’ two l f ’ ] , [ 1020359060 , 1017316543 ] ) ,
3 5 : ( [ ’ gcc ’ , ’ wupwise ’ ] , [ 1 042255024 , 1069960666 ] ) ,
3 6 : ( [ ’ mgrid ’ , ’ swim ’ ] , [ 1068900962 , 1055587084 ] ) ,
3 7 : ( [ ’ fma3d ’ , ’ l u ca s ’ ] , [ 1 029779454 , 1008639060 ] ) ,
3 8 : ( [ ’ wupwise ’ , ’ g a l g e l ’ ] , [ 1 086346408 , 1077694823 ] ) ,
3 9 : ( [ ’ perlbmk ’ , ’ vortex1 ’ ] , [ 1060507355 , 1031964886 ] ) ,
4 0 : ( [ ’ s i x t r a c k ’ , ’ bz ip ’ ] , [ 1030217175 , 1034127697 ] )
} ,
4 :{
1 : ( [ ’ammp ’ , ’ mgrid ’ , ’ perlbmk ’ , ’ pa r s e r ’ ] , [ 1 041955945 , 1047775879 , 1025548197 ,
1008908800 ] ) ,
2 : ( [ ’ l u ca s ’ , ’ gcc ’ , ’mcf ’ , ’ two l f ’ ] , [ 1 026388815 , 1050246990 , 1046272284 ,
1056602690 ] ) ,
3 : ( [ ’ eon ’ , ’ eon ’ , ’mesa ’ , ’ f a c e r e c ’ ] , [ 1085828439 , 1085086202 , 1098261402 ,
1004267962 ] ) ,
4 : ( [ ’ vortex1 ’ , ’ammp ’ , ’ equake ’ , ’ g a l g e l ’ ] , [ 1027564658 , 1014658355 , 1009037399 ,
1039572509 ] ) ,
5 : ( [ ’ gcc ’ , ’ g a l g e l ’ , ’ ap s i ’ , ’ c r a f t y ’ ] , [ 1 029306590 , 1091516994 , 1016968254 ,
1091181775 ] ) ,
296
D.6. WORKLOADS.PY
6 : ( [ ’ applu ’ , ’ equake ’ , ’ a r t ’ , ’ f a c e r e c ’ ] , [ 1 001909990 , 1013915170 , 1046887563 ,
1056979138 ] ) ,
7 : ( [ ’ applu ’ , ’ gap ’ , ’ gcc ’ , ’ pa r s e r ’ ] , [ 1082162802 , 1059139806 , 1013409002 ,
1085694384 ] ) ,
8 : ( [ ’ gap ’ , ’ swim ’ , ’ two l f ’ , ’mesa ’ ] , [ 1 042656444 , 1061963955 , 1085903965 ,
1036190567 ] ) ,
9 : ( [ ’ s i x t r a c k ’ , ’ fma3d ’ , ’ ap s i ’ , ’ vortex1 ’ ] , [ 1074480257 , 1031183064 , 1098143364 ,
1012919523 ] ) ,
1 0 : ( [ ’ammp ’ , ’ bz ip ’ , ’ equake ’ , ’ pa r s e r ’ ] , [ 1077398959 , 1003951563 , 1072415593 ,
1053509179 ] ) ,
1 1 : ( [ ’ vpr ’ , ’ two l f ’ , ’ applu ’ , ’ eon ’ ] , [ 1040680776 , 1031568211 , 1082293995 ,
1041436570 ] ) ,
1 2 : ( [ ’ g a l g e l ’ , ’ c r a f t y ’ , ’ mgrid ’ , ’ swim ’ ] , [ 1031527863 , 1044545857 , 1082173250 ,
1096751917 ] ) ,
1 3 : ( [ ’ two l f ’ , ’ fma3d ’ , ’ g a l g e l ’ , ’ vpr ’ ] , [ 1062306790 , 1060828350 , 1098129008 ,
1043023932 ] ) ,
1 4 : ( [ ’ bz ip ’ , ’ vpr ’ , ’ bz ip ’ , ’ equake ’ ] , [ 1084019868 , 1038244774 , 1003412847 ,
1097472955 ] ) ,
1 5 : ( [ ’ g a l g e l ’ , ’ c r a f t y ’ , ’ vpr ’ , ’ swim ’ ] , [ 1070880481 , 1027287316 , 1060235344 ,
1058807655 ] ) ,
1 6 : ( [ ’mcf ’ , ’ wupwise ’ , ’mesa ’ , ’mesa ’ ] , [ 1 054249832 , 1006759950 , 1014557494 ,
1030953598 ] ) ,
1 7 : ( [ ’ applu ’ , ’ pa r s e r ’ , ’ ap s i ’ , ’ perlbmk ’ ] , [ 1075021039 , 1053158322 , 1034718910 ,
1026856922 ] ) ,
1 8 : ( [ ’ mgrid ’ , ’ perlbmk ’ , ’ gz ip ’ , ’ mgrid ’ ] , [ 1 049328406 , 1079074439 , 1096282781 ,
1079036253 ] ) ,
1 9 : ( [ ’mcf ’ , ’ s i x t r a c k ’ , ’ gcc ’ , ’ ap s i ’ ] , [ 1 090116441 , 1068921998 , 1066705590 ,
1092093538 ] ) ,
2 0 : ( [ ’ammp ’ , ’ gcc ’ , ’ a r t ’ , ’mesa ’ ] , [ 1011080402 , 1007932868 , 1079537464 ,
1095718719 ] ) ,
2 1 : ( [ ’ perlbmk ’ , ’ ap s i ’ , ’ l u ca s ’ , ’ equake ’ ] , [ 1051169802 , 1057285545 , 1064666557 ,
1019744818 ] ) ,
2 2 : ( [ ’ vpr ’ , ’ c r a f t y ’ , ’ vpr ’ , ’mcf ’ ] , [ 1 073177627 , 1082019945 , 1021734200 ,
1066267018 ] ) ,
2 3 : ( [ ’ gz ip ’ , ’ equake ’ , ’ mgrid ’ , ’mesa ’ ] , [ 1097569789 , 1080949028 , 1056929996 ,
1079797826 ] ) ,
2 4 : ( [ ’ f a c e r e c ’ , ’ applu ’ , ’ fma3d ’ , ’ l u ca s ’ ] , [ 1 013937124 , 1035387836 , 1051243465 ,
1041436071 ] ) ,
2 5 : ( [ ’ gap ’ , ’ applu ’ , ’ pa r s e r ’ , ’ f a c e r e c ’ ] , [ 1 008180602 , 1067057433 , 1083231912 ,
1080419219 ] ) ,
2 6 : ( [ ’mcf ’ , ’ ap s i ’ , ’ two l f ’ , ’ammp ’ ] , [ 1014292526 , 1058328743 , 1061373130 ,
1050686626 ] ) ,
2 7 : ( [ ’ swim ’ , ’ s i x t r a c k ’ , ’ammp ’ , ’ applu ’ ] , [ 1 052228680 , 1059328443 , 1080039777 ,
1026620495 ] ) ,
2 8 : ( [ ’ a r t ’ , ’ fma3d ’ , ’ swim ’ , ’ pa r s e r ’ ] , [ 1 082308602 , 1095181635 , 1012762841 ,
1035776155 ] ) ,
2 9 : ( [ ’ ap s i ’ , ’ gcc ’ , ’ vortex1 ’ , ’ two l f ’ ] , [ 1050080000 , 1076827259 , 1024773007 ,
1088514951 ] ) ,
3 0 : ( [ ’ mgrid ’ , ’ gz ip ’ , ’ ap s i ’ , ’ equake ’ ] , [ 1015952145 , 1024722623 , 1059266770 ,
1077591627 ] ) ,
3 1 : ( [ ’ mgrid ’ , ’ equake ’ , ’ vpr ’ , ’ eon ’ ] , [ 1015263556 , 1063692577 , 1044670814 ,
1092770749 ] ) ,
3 2 : ( [ ’ wupwise ’ , ’ gap ’ , ’ two l f ’ , ’ f a c e r e c ’ ] , [ 1 073842062 , 1077919529 , 1009246189 ,
1048001712 ] ) ,
3 3 : ( [ ’ g a l g e l ’ , ’ equake ’ , ’ l u ca s ’ , ’ gz ip ’ ] , [ 1 040375492 , 1037630973 , 1017422599 ,
1094439053 ] ) ,
3 4 : ( [ ’ f a c e r e c ’ , ’ gcc ’ , ’ f a c e r e c ’ , ’ ap s i ’ ] , [ 1 085839746 , 1069300438 , 1073285869 ,
1062627766 ] ) ,
3 5 : ( [ ’mesa ’ , ’mcf ’ , ’ swim ’ , ’ s i x t r a c k ’ ] , [ 1094695081 , 1092502223 , 1029829307 ,
1052267670 ] ) ,
3 6 : ( [ ’mesa ’ , ’ s i x t r a c k ’ , ’ equake ’ , ’ bz ip ’ ] , [ 1 063594040 , 1062127033 , 1040041781 ,
297
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
1060015597 ] ) ,
3 7 : ( [ ’mcf ’ , ’ gap ’ , ’ gcc ’ , ’ vortex1 ’ ] , [ 1033902102 , 1001090684 , 1030020762 ,
1048547872 ] ) ,
3 8 : ( [ ’ f a c e r e c ’ , ’ l u ca s ’ , ’mcf ’ , ’ pa r s e r ’ ] , [ 1 092600483 , 1066508342 , 1027466999 ,
1060969516 ] ) ,
3 9 : ( [ ’ two l f ’ , ’ eon ’ , ’mesa ’ , ’ eon ’ ] , [ 1098504941 , 1088612335 , 1009372945 ,
1069289808 ] ) ,
4 0 : ( [ ’ ap s i ’ , ’ ap s i ’ , ’mcf ’ , ’ equake ’ ] , [ 1092680129 , 1068726226 , 1098316344 ,
1073035913 ] )
} ,
8 :{
1 : ( [ ’ gap ’ , ’ applu ’ , ’ vpr ’ , ’ gap ’ , ’mcf ’ , ’mcf ’ , ’ two l f ’ , ’ vortex1 ’ ] , [ 1073187984 ,
1054331802 , 1084995449 , 1037520798 , 1079155778 , 1006168982 , 1011603043 ,
1086569852 ] ) ,
2 : ( [ ’ g a l g e l ’ , ’ mgrid ’ , ’ two l f ’ , ’mesa ’ , ’ equake ’ , ’ equake ’ , ’ swim ’ , ’ f a c e r e c ’
] , [ 1010114816 , 1033266768 , 1074762683 , 1073328214 , 1085178419 , 1090120043 ,
1028221923 , 1077227382 ] ) ,
3 : ( [ ’ammp ’ , ’ mgrid ’ , ’ vpr ’ , ’ a r t ’ , ’ l u ca s ’ , ’ pa r s e r ’ , ’ g a l g e l ’ , ’ gz ip ’
] , [ 1073035280 , 1082456083 , 1064051258 , 1047745378 , 1091351248 , 1043930366 ,
1088556254 , 1019159065 ] ) ,
4 : ( [ ’ mgrid ’ , ’ ap s i ’ , ’ equake ’ , ’ eon ’ , ’ c r a f t y ’ , ’ two l f ’ , ’mcf ’ , ’ bz ip ’
] , [ 1087805348 , 1010946963 , 1058724149 , 1080641094 , 1052238876 , 1061243098 ,
1079000456 , 1020553687 ] ) ,
5 : ( [ ’ bz ip ’ , ’ l u ca s ’ , ’ammp ’ , ’ eon ’ , ’ perlbmk ’ , ’ gcc ’ , ’ pa r s e r ’ , ’ vpr ’
] , [ 1000157189 , 1096256620 , 1018004216 , 1022509362 , 1063759077 , 1020126159 ,
1017677198 , 1091187089 ] ) ,
6 : ( [ ’ pa r s e r ’ , ’ gz ip ’ , ’ equake ’ , ’ bz ip ’ , ’ wupwise ’ , ’ gcc ’ , ’ perlbmk ’ , ’mcf ’
] , [ 1054015368 , 1072027794 , 1037141802 , 1002590313 , 1032598159 , 1062865867 ,
1086592320 , 1077206505 ] ) ,
7 : ( [ ’ pa r s e r ’ , ’ eon ’ , ’ gcc ’ , ’ swim ’ , ’ swim ’ , ’ vpr ’ , ’ g a l g e l ’ , ’ swim ’ ] , [ 1009382201 ,
1008386991 , 1013955362 , 1085186502 , 1049653866 , 1081052784 , 1004535245 ,
1000701983 ] ) ,
8 : ( [ ’ l u ca s ’ , ’ bz ip ’ , ’ applu ’ , ’ equake ’ , ’ mgrid ’ , ’ammp ’ , ’ammp ’ , ’ gcc ’
] , [ 1062490836 , 1000066629 , 1031295467 , 1003301732 , 1095699052 , 1066621695 ,
1084377485 , 1048723512 ] ) ,
9 : ( [ ’ammp ’ , ’ gap ’ , ’mesa ’ , ’ f a c e r e c ’ , ’ eon ’ , ’ vpr ’ , ’ bz ip ’ , ’ g a l g e l ’ ] , [ 1 079442762 ,
1070093875 , 1097495616 , 1031240052 , 1022413214 , 1090524731 , 1086977938 ,
1057202237 ] ) ,
1 0 : ( [ ’ pa r s e r ’ , ’ swim ’ , ’ two l f ’ , ’ gcc ’ , ’ vpr ’ , ’ bz ip ’ , ’ f a c e r e c ’ , ’ gz ip ’
] , [ 1090257985 , 1009293715 , 1022410113 , 1011023325 , 1056748310 , 1026660178 ,
1009495664 , 1090156263 ] ) ,
1 1 : ( [ ’ c r a f t y ’ , ’ vpr ’ , ’ s i x t r a c k ’ , ’ c r a f t y ’ , ’ l u ca s ’ , ’ c r a f t y ’ , ’ equake ’ , ’ ap s i ’
] , [ 1080684386 , 1072953538 , 1074549432 , 1025906445 , 1034048076 , 1078018402 ,
1090184500 , 1095714609 ] ) ,
1 2 : ( [ ’ a r t ’ , ’ c r a f t y ’ , ’ eon ’ , ’ vortex1 ’ , ’ fma3d ’ , ’ mgrid ’ , ’ c r a f t y ’ , ’ equake ’
] , [ 1015843197 , 1002447314 , 1053646470 , 1021827013 , 1045943050 , 1058122519 ,
1071926742 , 1080330548 ] ) ,
1 3 : ( [ ’ two l f ’ , ’ vpr ’ , ’mesa ’ , ’ fma3d ’ , ’ equake ’ , ’ s i x t r a c k ’ , ’ gap ’ , ’ gz ip ’
] , [ 1081428568 , 1054183010 , 1045848482 , 1094992822 , 1033922911 , 1085605808 ,
1068053379 , 1087356592 ] ) ,
1 4 : ( [ ’ two l f ’ , ’mesa ’ , ’ c r a f t y ’ , ’ equake ’ , ’ vortex1 ’ , ’ mgrid ’ , ’ swim ’ , ’ gap ’
] , [ 1056751279 , 1088275153 , 1020060137 , 1003606008 , 1078323826 , 1022366158 ,
1019548656 , 1034359417 ] ) ,
1 5 : ( [ ’ eon ’ , ’ mgrid ’ , ’mcf ’ , ’ perlbmk ’ , ’ wupwise ’ , ’ c r a f t y ’ , ’ two l f ’ , ’ swim ’
] , [ 1069111934 , 1033143088 , 1094184021 , 1030649130 , 1031499067 , 1082028261 ,
1071878030 , 1057197432 ] ) ,
1 6 : ( [ ’ c r a f t y ’ , ’ bz ip ’ , ’ applu ’ , ’ ap s i ’ , ’ gz ip ’ , ’ g a l g e l ’ , ’ equake ’ , ’ perlbmk ’
] , [ 1060295839 , 1061605217 , 1078408538 , 1040793226 , 1017904531 , 1012317818 ,
1009187526 , 1008829265 ] ) ,
1 7 : ( [ ’ gz ip ’ , ’ ap s i ’ , ’ bz ip ’ , ’ mgrid ’ , ’ gap ’ , ’ a r t ’ , ’ a r t ’ , ’ bz ip ’ ] , [ 1 062227060 ,
298
D.6. WORKLOADS.PY
1043287066 , 1072789027 , 1090842725 , 1081090176 , 1038163398 , 1076091368 ,
1015033152 ] ) ,
1 8 : ( [ ’ eon ’ , ’ equake ’ , ’ vortex1 ’ , ’ a r t ’ , ’ gcc ’ , ’ ap s i ’ , ’ f a c e r e c ’ , ’ gz ip ’
] , [ 1031823376 , 1011627676 , 1014152137 , 1001876755 , 1062533802 , 1095016852 ,
1075002641 , 1070354554 ] ) ,
1 9 : ( [ ’ eon ’ , ’mesa ’ , ’ vortex1 ’ , ’ eon ’ , ’ gcc ’ , ’ l u ca s ’ , ’ equake ’ , ’ g a l g e l ’
] , [ 1027310441 , 1002162446 , 1059120707 , 1074827012 , 1028078935 , 1023522418 ,
1049001044 , 1086781145 ] ) ,
2 0 : ( [ ’ ap s i ’ , ’ bz ip ’ , ’ g a l g e l ’ , ’ammp ’ , ’ a r t ’ , ’ g a l g e l ’ , ’ammp ’ , ’ s i x t r a c k ’
] , [ 1036116693 , 1008009383 , 1088504472 , 1035365251 , 1075786712 , 1026648985 ,
1065582559 , 1051291310 ] ) ,
2 1 : ( [ ’ pa r s e r ’ , ’ pa r s e r ’ , ’ gap ’ , ’ gap ’ , ’ammp ’ , ’ applu ’ , ’ vortex1 ’ , ’ a r t ’
] , [ 1072012746 , 1057840199 , 1061743354 , 1099794688 , 1078982271 , 1037341159 ,
1023276131 , 1043919426 ] ) ,
2 2 : ( [ ’ c r a f t y ’ , ’ swim ’ , ’ two l f ’ , ’ g a l g e l ’ , ’ swim ’ , ’ two l f ’ , ’ two l f ’ , ’ pa r s e r ’
] , [ 1050818960 , 1064897126 , 1052515062 , 1086678782 , 1066161812 , 1040178194 ,
1067884105 , 1006949019 ] ) ,
2 3 : ( [ ’ vpr ’ , ’ vortex1 ’ , ’ pa r s e r ’ , ’ two l f ’ , ’ eon ’ , ’ equake ’ , ’ gz ip ’ , ’ fma3d ’
] , [ 1005348033 , 1035925273 , 1045642080 , 1023304669 , 1029043677 , 1049160390 ,
1006073261 , 1044794122 ] ) ,
2 4 : ( [ ’ vortex1 ’ , ’ g a l g e l ’ , ’ammp ’ , ’ pa r s e r ’ , ’ bz ip ’ , ’ vpr ’ , ’mesa ’ , ’ammp ’
] , [ 1063194330 , 1097275798 , 1082541931 , 1003883854 , 1079087447 , 1090030082 ,
1093793294 , 1041142266 ] ) ,
2 5 : ( [ ’ two l f ’ , ’ f a c e r e c ’ , ’ perlbmk ’ , ’ gz ip ’ , ’ vpr ’ , ’ vortex1 ’ , ’ wupwise ’ , ’ eon ’
] , [ 1053550239 , 1018430586 , 1014624132 , 1008809943 , 1029707942 , 1046921866 ,
1001808722 , 1028638607 ] ) ,
2 6 : ( [ ’ gap ’ , ’ s i x t r a c k ’ , ’ eon ’ , ’ applu ’ , ’ swim ’ , ’ perlbmk ’ , ’ vpr ’ , ’ ap s i ’
] , [ 1054115093 , 1005445699 , 1097166435 , 1065930690 , 1024858175 , 1026537869 ,
1045341467 , 1023002078 ] ) ,
2 7 : ( [ ’ gap ’ , ’ gap ’ , ’ gap ’ , ’ two l f ’ , ’mcf ’ , ’ gap ’ , ’ l u ca s ’ , ’ bz ip ’ ] , [ 1077017001 ,
1099668442 , 1085122280 , 1089412190 , 1087185975 , 1039440799 , 1078171971 ,
1013558211 ] ) ,
2 8 : ( [ ’ vpr ’ , ’ vpr ’ , ’ two l f ’ , ’mesa ’ , ’ gap ’ , ’ bz ip ’ , ’ gz ip ’ , ’ s i x t r a c k ’
] , [ 1017259907 , 1013013234 , 1030175803 , 1090459182 , 1076078700 , 1032172360 ,
1023272393 , 1027539069 ] ) ,
2 9 : ( [ ’ swim ’ , ’ equake ’ , ’ swim ’ , ’ wupwise ’ , ’ fma3d ’ , ’ s i x t r a c k ’ , ’ l u ca s ’ , ’ vortex1 ’
] , [ 1005106200 , 1023192637 , 1003810844 , 1072497715 , 1014349757 , 1001914569 ,
1081817129 , 1086550774 ] ) ,
3 0 : ( [ ’ wupwise ’ , ’ vortex1 ’ , ’ gap ’ , ’ vpr ’ , ’ fma3d ’ , ’ vortex1 ’ , ’ a r t ’ , ’ mgrid ’
] , [ 1062614948 , 1058495568 , 1085462700 , 1056121083 , 1005862000 , 1035665738 ,
1098565664 , 1015543999 ] ) ,
3 1 : ( [ ’ applu ’ , ’ perlbmk ’ , ’ applu ’ , ’ g a l g e l ’ , ’ c r a f t y ’ , ’ wupwise ’ , ’ gap ’ , ’ammp ’
] , [ 1048307705 , 1027263945 , 1094469129 , 1055082711 , 1075216695 , 1078291964 ,
1015047097 , 1099078359 ] ) ,
3 2 : ( [ ’ swim ’ , ’ bz ip ’ , ’ swim ’ , ’ ap s i ’ , ’ vpr ’ , ’ gcc ’ , ’ two l f ’ , ’ two l f ’ ] , [ 1 090335490 ,
1063630973 , 1073476422 , 1035755503 , 1050094400 , 1081428824 , 1027995780 ,
1021164635 ] ) ,
3 3 : ( [ ’ swim ’ , ’ g a l g e l ’ , ’ eon ’ , ’ gap ’ , ’ l u ca s ’ , ’ammp ’ , ’ equake ’ , ’ ap s i ’
] , [ 1043985407 , 1065211955 , 1005038440 , 1011662553 , 1045691442 , 1001412047 ,
1015576344 , 1084843544 ] ) ,
3 4 : ( [ ’ vpr ’ , ’ two l f ’ , ’ ap s i ’ , ’ vpr ’ , ’mesa ’ , ’ applu ’ , ’ mgrid ’ , ’ fma3d ’
] , [ 1057562916 , 1028743098 , 1067453907 , 1087649740 , 1012866796 , 1020715300 ,
1050920626 , 1034913162 ] ) ,
3 5 : ( [ ’ vortex1 ’ , ’ perlbmk ’ , ’mesa ’ , ’ eon ’ , ’ l u ca s ’ , ’ equake ’ , ’mesa ’ , ’ equake ’
] , [ 1029345593 , 1016169734 , 1013286006 , 1002052717 , 1080794866 , 1033967832 ,
1080996824 , 1062109113 ] ) ,
3 6 : ( [ ’ gap ’ , ’ eon ’ , ’ mgrid ’ , ’ gcc ’ , ’ pa r s e r ’ , ’mesa ’ , ’ swim ’ , ’ bz ip ’ ] , [ 1 031957692 ,
1025634021 , 1051330127 , 1019081762 , 1002465854 , 1015611877 , 1081170265 ,
1064397098 ] ) ,
3 7 : ( [ ’ equake ’ , ’mcf ’ , ’ g a l g e l ’ , ’ c r a f t y ’ , ’ bz ip ’ , ’ammp ’ , ’ vortex1 ’ , ’ c r a f t y ’
] , [ 1007445950 , 1076581005 , 1031048765 , 1070570356 , 1038710655 , 1054579808 ,
299
APPENDIX D. SIMULATOR CONFIGURATION SCRIPTS
1014641175 , 1092263752 ] ) ,
3 8 : ( [ ’ f a c e r e c ’ , ’ wupwise ’ , ’ vpr ’ , ’ eon ’ , ’ s i x t r a c k ’ , ’ bz ip ’ , ’ perlbmk ’ , ’ a r t ’
] , [ 1017668229 , 1062388660 , 1045846686 , 1064095596 , 1044136084 , 1049590197 ,
1035253692 , 1061796023 ] ) ,
3 9 : ( [ ’ gz ip ’ , ’ c r a f t y ’ , ’ c r a f t y ’ , ’ wupwise ’ , ’ gap ’ , ’ gap ’ , ’ eon ’ , ’ a r t ’
] , [ 1011530049 , 1039498517 , 1059870493 , 1050430777 , 1097320584 , 1026210416 ,
1065924402 , 1024174983 ] ) ,
4 0 : ( [ ’ vpr ’ , ’mcf ’ , ’ mgrid ’ , ’ equake ’ , ’ g a l g e l ’ , ’mcf ’ , ’ f a c e r e c ’ , ’ gz ip ’
] , [ 1052954017 , 1084642384 , 1029631047 , 1045662487 , 1014823135 , 1017907122 ,
1069887119 , 1010792942 ] )
} ,
}
300
