Automated Cache Coherence Model Reduction using Abstraction Patterns by Hut, Geert-Jan
Automated Cache Coherence
Model Reduction using Ab-
straction Patterns
Masters Thesis
G. J. Hut
St
ud
en
t:
83
4.
07
8.
21
9
Da
te
:
Ju
ne
1,
20
16

AUTOMATED CACHE COHERENCE MODEL REDUCTION USING
ABSTRACTION PATTERNS
MASTERS THESIS
by
G. J. Hut
in partial fulfillment of the requirements for the degree of
Master of Science
in Software Engineering
at the Open University, faculty of Management, Science and Technology
Master Software Engineering
to be defended publicly on Wednesday, June 1, 2016 at 13:00 PM.
Student number: 834.078.219
Course code: T75317
Thesis committee: Prof. dr. M. C. J. D. van Eekelen (chairman), Open University
Dr. F. Verbeek (supervisor), Open University
An electronic version of this thesis is available at http://dspace.ou.nl/.
ACKNOWLEDGEMENTS
Writing this thesis would not have been possible without the support of several people
whom I want to gratefully thank.
I wish to thank my supervisor dr. Freek Verbeek for providing me with the opportunity
to write this thesis under his guidance. Cache coherence and bisimulation are not sim-
ple subjects, and he provided me with enough railing alongside the abyss of complexity I
sometimes felt I was in danger of falling into.
I would like to thank my parents and my brothers. They always supported me to pursue
my ideals and dreams, and where would I be without the constant positive competition
with my brothers.
Finally, without the never ending support of my wife Marjolijn I would not have been
able to continue this effort for almost four years, alongside my regular job. I will try to
support her the same way back with the study she is now starting.
Geert-Jan Hut
Vleuten, May 2016
ii
CONTENTS
List of Figures v
List of Tables vi
Code Listings vii
Summary viii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Research Context 3
2.1 Caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Multi Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Memory Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Core connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 The gem5 simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Research Design 8
3.1 Research subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Related Works 10
4.1 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Networks on Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Model Checking and Bisimulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 State Machine Minimization Approach 13
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.1 network-impacting minimizations . . . . . . . . . . . . . . . . . . . . . . 14
5.2 The minimization process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Contribution and research boundaries . . . . . . . . . . . . . . . . . . . . . . . . 15
6 State Machine Abstraction Patterns 17
6.1 Cache coherence components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.1.1 Coherence devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.1.2 Coherence features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iii
iv CONTENTS
6.2 Abstraction patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2.1 Removal of non-essential actions pattern . . . . . . . . . . . . . . . . . . 26
6.2.2 Merge similar actions pattern. . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2.3 Removal of non-essential device pattern . . . . . . . . . . . . . . . . . . . 28
6.3 Have we found all abstraction patterns?. . . . . . . . . . . . . . . . . . . . . . . . 31
7 State Machine Reduction Algorithms 32
7.1 State machine equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.1.1 Trace Equivalence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.1.2 Structural equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.3 (Bi-)simulation equivalence. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.4 equality between trace equivalence and (bi-)similarity . . . . . . . . . . 36
7.2 NFA to DFA conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2.1 subset construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2.2 tau-a reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.2.3 Haskell Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.2.4 Impact selection of initial state . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3 Strong bisimulation minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.3.1 Mark phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.3.2 Reduce phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.4 Do we generate the most minimal solution? . . . . . . . . . . . . . . . . . . . . . 46
8 Experimental Results 47
8.1 The MI protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 The MESI protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9 Conclusions and future work 61
9.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.2 Answers to Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.3 Future work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A Appendix Cache Coherence Protocol details 65
A.1 MI protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.1.1 Cache controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.1.2 Directory Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.1.3 DMA Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.1.4 More information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.2 MESI protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.2.1 L1 Cache controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.2.2 L2 Cache controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.2.3 Directory controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.2.4 DMA controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A.2.5 More information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Bibliography 78
Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Scientific Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Technical Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
CONTENTS v
Acronyms 80
Glossary 81
LIST OF FIGURES
2.1 The MSI Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.1 State Machine Minimization Approach . . . . . . . . . . . . . . . . . . . . . . . 14
6.1 Cache Coherence Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 MI cache controller node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 MI directory controller node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.4 MI DMA controller node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.5 Partially abstracted state machine with DMA transitions . . . . . . . . . . . . . 29
7.1 Trace equivalent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 Trace equivalent, but not structural equivalent . . . . . . . . . . . . . . . . . . . 34
7.3 Merging of similar follow up states . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.4 equivalent DFAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.5 Initial state dependent NFA to DFA conversion, starting with P1 . . . . . . . . 41
7.6 NFA to DFA conversion without remaining states . . . . . . . . . . . . . . . . . 42
8.1 Gem 5 MESI component configuration . . . . . . . . . . . . . . . . . . . . . . . 48
8.2 Original gem5 MI cache controller state machine . . . . . . . . . . . . . . . . . 51
8.3 Original gem5 MI directory controller state machine . . . . . . . . . . . . . . . 52
8.4 Original gem5 MI DMA controller state machine . . . . . . . . . . . . . . . . . 53
8.5 Minimized gem5 MI cache controller state machine . . . . . . . . . . . . . . . 53
8.6 Minimized gem5 MI directory controller state machine . . . . . . . . . . . . . 54
8.7 Original gem5 MESI Layer 1 Cache controller state machine . . . . . . . . . . . 55
8.8 Original gem5 MESI Layer 2 cache controller state machine . . . . . . . . . . . 56
8.9 Original gem5 MESI directory controller state machine . . . . . . . . . . . . . 57
8.10 Minimized gem5 MESI Layer 1 Cache controller state machine . . . . . . . . . 58
8.11 Minimized gem5 MESI Layer 2 cache controller state machine . . . . . . . . . 59
8.12 Minimized gem5 MESI directory controller state machine . . . . . . . . . . . . 60
vi
LIST OF TABLES
8.1 MI minimization results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 MESI minimization results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.1 MI_example cache controller actions . . . . . . . . . . . . . . . . . . . . . . . . 66
A.2 MI_example directory controller actions . . . . . . . . . . . . . . . . . . . . . . 67
A.3 MI_example DMA controller actions . . . . . . . . . . . . . . . . . . . . . . . . 69
A.4 MESI L1 cache controller actions . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.5 MESI L2 cache controller actions . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A.6 MESI Directory controller actions . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.7 MESI L2 cache controller actions . . . . . . . . . . . . . . . . . . . . . . . . . . 77
vii
CODE LISTINGS
6.1 Network-impacting actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Actions without network impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Similar actions in Directory controller . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 DMA queue processing in Directory node . . . . . . . . . . . . . . . . . . . . . 30
viii
SUMMARY
State machines of cache coherence protocol implementations are made complex because
of the many optimization and other behavioral requirements these implementations sup-
port. This makes for efficient cache coherence protocols, but complex state machines.
When focusing on only parts of these implementations - like when only the network pro-
tocol behavior is to be investigated - this is a disadvantage, as complex state machines are
harder to analyze and use in formal verification.
This document first provides a description on why cache coherence is needed, fol-
lowed by a description of an automatable process to algorithmically minimize complex
cache coherence state machines. The process is to be used to facilitate the research which
uses model checking on cache coherence protocols, like the research done in [VS12a] and
[VS12b] on Network on a Chip deadlock detection. The process consists of a number of
abstraction patterns which remove uninteresting coherence state details, followed by a re-
duction step which which minimizes the abstracted state machine using a bisimulation
reduction algorithm. A software tool is provided that implements this process to reduce
gem5 simulator (gem5) cache coherence state diagrams. The patterns and process are val-
idated by reducing gem5 state machines using the software tool.
The developed software and the data sets used in the thesis research will be made avail-
able to the research community to enable validation and extension of the work.
Keywords: Memory Consistency, Cache Coherence, Network on Chip, State Machine min-
imization, Reduction algorithms, Abstration patterns, gem5, bisimulation
ix
1
INTRODUCTION
The last years have seen a proliferation of the amount of execution cores microprocessors
are provided with. Where only two decades ago only high end systems were equipped with
multiple cores, now even low-end smart phones or tablets utilize dual core, quad core or
even eight core processors. This trend is continuing. As an example, Reilly describes in
[Rei+08] a high end configuration where a system was built from 972 processors with each
6 cores. Fang et al. describe in [Fan+14] their experiences with the Intel Xeon Phi, a com-
mercial currently available 60 core microprocessor.
One of the main challenges with multi-core processors is that the applications that run
on them must be able to depend on memory to act consistently. So when a thread updates
a value in location A, and next sets a flag in location B (maybe to indicate that A was set),
there should be methods available to the programmer to make sure that when another
thread gets the new value for B, it will then always get the new value for A.
This should even work when locations A and B are on completely separated memory
blocks, managed separately by the processor. The rules which programmers must follow
and what they must do to enforce the memory behavior they expect are defined with mem-
ory consistency models.
Implementations of these memory consistency models are made complex by the caches,
queue’s and other components that are added to the processor and its cores to improve
performance. These components work on memory blocks or cache-lines, and this requires
some way to keep the content of these blocks coherent between the queue’s, caches and
main memory. This coherency is enforced by the caches and memory controllers them-
selves, by way of cache coherence protocols.
Cache coherence protocols are protocols with which cache devices exchange messages
to keep each other up to date on the status and content of memory blocks in the system.
The messages are exchanged between the cache controllers and directory controllers de-
vices in the system.
1
2 1. INTRODUCTION
1.1. MOTIVATION
The motivation for this research is the realization that even when the cache coherence pro-
tocol as well as the used interconnection network are deadlock-free, the interaction be-
tween these two layers can still cause deadlocks [VS12b] [VS12a]. To research this, models
of the cache coherence protocol and the interconnecting network must be combined so
they can be researched as a single model. This single model is fed into model checking soft-
ware, which can check this complete model for deadlocks. These analysis tools are more
effective when the models to analyze are kept simple, as simpler models are less prone to
state space explosions.
This paper gives an answer to the question whether the minimization of cache coher-
ence protocols can be automated, so that the number of states that needs to be examined
is programmatically minimized. Only states and state transitions that are relevant for the
network analysis should remain. This state minimization should be done in a way that
the model abstraction remains sound, so that all network related properties proven on the
abstraction remain true on the original model.
The outcome of this research can be used to more efficiently research deadlock related
aspects of the combination of these coherence protocol state machines with network mod-
els.
1.2. DOCUMENT STRUCTURE
This document is setup as follows: in Chapter 2 a number of concepts used in this research
are described. This way the context the research is executed in is introduced. Next, Chap-
ter 3 presents the way the research is structured, and states the research questions and
deliverables. Chapter 4 presents a number of significant papers and other background ma-
terial which describe the topic in further detail, and which were studied in the context of
this research.
In the next chapters we present the main result of our research. Chapter 5 presents the
overall approach on how to minimize state machines. This approach consists of two parts.
The abstration patterns part is further discussed in Chapter 6. The reduction algorithms
are described in detail in Chapter 7. Next we provide the results of our experiments in
Chapter 8; here we apply the implementation of the process to the device state machines
of two gem5 cache coherence protocols. We finish with our conclusions and give pointers
to potential future research areas in Chapter 9.
2
RESEARCH CONTEXT
This chapter describes the context, the technical landscape in which the research is done,
by highlighting a number of its aspects. The chapter only gives a high level overview of the
used concepts and terminology. An excellent and detailed introduction on most of these
concepts can be found in “A Primer on Memory Consistency and Cache Coherence” written
by Sorin et al. [SHW11].
2.1. CACHES
Since the 90’s caches have become commonplace in microprocessors (CPU’s). They are
used to limit the impact of the speed difference between a computer’s memory and its pro-
cessor(s). This speed difference is caused by the fact that processors can request and pro-
cess data much faster than main memory can deliver. Caches function as an intermediate
between the processor and the main memory to speed up this value retrieval, by retaining
the retrieved values for a while in the cache. This works because most programs exhibit
locality of reference, which means that most programs re-use specific data (and program
code) in a short time interval (temporal locality). They tend to also access data in close
proximity to an already accessed memory location (spatial locality). The spacial locality
aspect is used by fetching not only the used data, but a whole cache-line (normally a 64 or
128 byte block of memory) at once. Additionally the processor can speculatively pre-fetch
any cache-lines it expects to be required next.
Nowadays these caches are normally placed on-die, so on the same silicon chip as the ex-
ecution core itself, to minimize the physical distance between the execution unit needing
the data and program code and the cache which temporarily holds copies of these values.
With the current processor clock speeds this distance has an impact on the time it takes to
retrieve data from the caches.
2.2. MULTI CORE
Another way that the performance of modern microprocessors is increased is by placing
multiple cores on a single processor die, creating a multi-core processor. The advantage of
having multiple cores as compared to a single,very powerful core is that there is a maximum
to how fast you can make a single core run. When you want to make a core run faster you
need to increase its clock frequency. With an increased frequency you get a much higher
3
4 2. RESEARCH CONTEXT
power dissipation heating your processor [Sut05]. So the current answer to the need for
increased performance is not to supercharge a single processor core, but to add multiple
cores to a processor. Currently, this is one of the preferred ways to make use of the ever
increasing amount of transistors that can be placed on a die.
However, there is a challenge when you have a system with multiple cores. The applications
running on it must obviously be able to use these cores, and use them efficiently. It must
support multi-threading, so that an application can run concurrently on more than one
core. Only then can an application benefit from the increased performance the multi-core
processor brings.
To do this, programmers must be able to make certain assumptions on how the system
(the multiple processor cores together with their caches) handle the application’s memory
accesses and updates, so from a programmers view this memory handling must be depend-
able and consistent.
2.3. MEMORY CONSISTENCY
As described in the previous section, to allow developers to write correct programs they
must be able to depend on how the system deals with memory loads and stores. For in-
stance, if a piece of code in a single thread writes a memory location, and afterwards reads
back this location, the value must be the new value. Only if another thread modifies this
memory location in the meantime could the value have changed. This becomes an is-
sue when store actions are queued, something that is commonly done as an optimization
method[SHW11, p. 37].
Also, to write correct programs the system must provide methods to make sure that
when location A is written before location B by one program thread, another program
thread can only see the new A value (so not the old value) after reading the new B value.
This should even be the case when the A and B locations are on different cache blocks or
caches and when instructions are potentially reordered by the processor cores (another op-
timization method commonly used in modern processors). If the methods to implement
this kind of synchronization are not available, threads would not be able to synchronize,
and writing correct multi-threaded programs that share memory between threads would
become impossible.
Always providing this guarantee is expensive, performance would be impacted. Dif-
ferent memory consistency implementations provide more relaxed levels of optimization.
This is done by allowing the programmer or compiler to specify when synchronization is
needed. The trade-off is that the programmer or compiler must then do more work to guar-
antee a correct functioning application, with more room for mistakes.
2.4. CACHE COHERENCE
Cache coherence becomes an issue in multi-core systems, in which each core has its own
local cache. The main issue that cache coherence is concerned with is that multiple cores
can keep values from a specific memory location cached in their own local- or intermedi-
ate caches, effectively creating copies of those values. When one core updates the value
for that memory location, other cores must not use the old value anymore, and must not
2.4. CACHE COHERENCE 5
I M
S
GetM
GetS
Release+data
GetM
Release
Figure 2.1: The MSI Protocol
update other values at memory locations on that cache-line. Instead the other caches must
first update their copy of the cache-line or invalidate it by deleting their copy. When a core
request a memory location on this cache-line again it will get the new values. This invali-
dation/update mechanism effectively keeps the caches coherent.
To orchestrate these coherence activities, cache coherence protocols come into play.
These coherence protocols are state machines that for each memory caching device deter-
mine the state of each cache-line. A well-known cache coherence protocol is for instance
the MSI protocol[SHW11, p. 104]. This protocol, shown in Figure 2.1, has three conceptual
states in which the cache-line can be, namely: Modified (the core has written in the cache-
line, modifying it), Shared (a read but unmodified copy, potentially shared by others) and
Invalid (the cache-line contains no valid data).
When a core wants to read a memory location, and its cache does not contain a copy
(the cache-line is in invalid state), a GetS message is sent out. This message could request
any already cached value before asking the memory controller to obtain the value from
main memory.
When a core wants to write to a memory location, and the cache-line is not already in
the modified state, a GetM message is sent out. Any other core receiving that message will
invalidate their copy of that cache-line by releasing their copy, optionally directly forward-
ing the modified data to the requesting core.
Apart from the evident overhead needed for the exchange of these messages, additional
intermediate states are also needed to implement the message exchanges, for instance be-
cause the cache controller needs to wait for results.
More advanced protocols have also been devised. For instance the MESI protocol is an
extension to the MSI protocol, extending it with an extra Exclusive state, indicating situ-
ations where a core is the only core with a read data copy in it’s caches[SHW11, p. 115].
When the core then wishes to update this value, no communication is needed with other
cores before going to the Modified state.
However, this efficiency increase comes at a cost, as the MESI protocol state machine is
much more complex. This makes it more difficult to create an error free implementation
6 2. RESEARCH CONTEXT
of the coherence protocol. Typical errors in state machines lead to either the wrong state,
but more often to deadlocks. A deadlock means that the state machine encounters a situ-
ation (event) it can permanently not handle, so it can permanently not go to a next state
and effectively halts forever; a situation an implementation always wants to avoid. How-
ever, this is difficult, in [KAC14] Komuravelli investigated the gem5 simulator (gem5) MESI
protocol implementation (see Section 2.6) using formal methods and found six errors in
the implementation. This was much to the surprise of the original protocol implementers,
who thought the implementation to be stable and error-free; given that it was already in use
for a number of years. However, Komuravelli also encountered problems verifying the pro-
tocol because of its complexity, having to restrict the state diagram to prevent state space
explosions. Restricting was done by creating simpler state machines that still exhibited the
behavior under test, effectively creating sound abstractions. These abstractions allow the
state diagram to become simpler while retaining the aspects of the state diagram that were
under scrutiny.
2.5. CORE CONNECTIVITY
To exchange messages, the cache devices need to be connected to one another. With a
limited number of devices a shared bus based approach is normally used, where all cache
devices are exchanging messages over a single bus. The logic to exchange messages over
the bus also provides for message synchronization, as only one message can be sent at a
time, and all devices can see all messages. However, this only works up to a certain number
of devices, as when the amount becomes larger the single shared bus becomes a limiting
factor. This is simply because it cannot keep up with the increased amount of messages
exchanged between the different cores.
For systems with larger counts of cores the message bus is normally replaced by a topol-
ogy of point-to-point connections, or with Network on a Chip (NoC) connectivity. NoC
configurations can be ring structures, grids, spidergon meshes (an adaptation of a ring
structure where also opposite nodes are linked) or other topologies. Marty described in
[Mar08] the impact of different network topologies on cache coherence.
Each core has a router and a number of connections to other cores or devices. Messages
are exchanged between these routers using a routing protocol. This routing protocol is re-
sponsible for the optimal routing of the messages without creating deadlocks. Deadlocks
can occur when a router cannot forward one or more messages to a next destination router
because the relevant message queue is full. When a cycle of such queues is created message
transmission halts and a deadlock is created, as described in [DS87].
Sorin et al. also dedicate a paragraph to these so called liveness issues [SHW11, p. 186–]. It
is the responsibility of the routing protocol to make sure these kinds of cycles never appear.
Previous research has provided efficient routing protocols that are proven to be deadlock-
free for a certain topology. See for instance [GN92] where Glass and Ni describe a model for
deadlock free wormhole routing.
So for cache coherence to work an error free cache coherence protocol needs to be im-
plemented whereby the coherence messages are exchanged using an error free routing pro-
tocol for the network configuration. However this is not sufficient. Verbeek and Schmaltz
2.6. THE GEM5 SIMULATOR 7
have shown in [VS12b] that a combination of a coherence protocols and a routing protocol
which are in itself deadlock-free still can suffer from deadlocks.
This problem is currently being researched, and one of the issues is that to prove or
disprove deadlocks both state machine models must be combined; this creates much more
complex models as a result. This requires smaller state machine abstractions to start with
to create combined models that can be proven deadlock-free without suffering from state
space explosion.
2.6. THE GEM5 SIMULATOR
Having all cores, caches and connectivity on a single die makes it hard to observe what is
exactly going on within the processor. For instance measuring timing of detailed instruc-
tion executions, and the verification of (coherence) mechanisms used in the processor is
hard. It is also difficult to experiment with these concepts in hardware, as creating a die is
expensive and time consuming.
Therefore simulators are built to allow engineers to experiment and to better observe
what exactly is happening. One of these simulators is the gem5 simulator. This open source
hardware simulator provides an implementation of many of the components used in nor-
mal PC’s, like CPU’s, memory, IO components, etc. It is a software simulator with the source
code publicly available, it allows one to simulate existing hardware, and test new hardware
designs. Cache coherence protocol state machines created for gem5 simulator configura-
tions were used as input for this research.
3
RESEARCH DESIGN
This research has been designed based on the problem description in Section 1.1. The
next section describes the research subject, which further details the problem description.
Section 3.2 details the research question that is answered during this research. How the
research outcome is validated is documented in Section 3.3. The deliverables this research
produced are listed in Section 3.4.
3.1. RESEARCH SUBJECT
The subject of this research is to determine whether the automatic creation of minimized
cache coherence state machines from gem5 simulator (gem5) configurations is possible.
The research attempts to programmatically create sound abstractions of these state ma-
chines. These are to be simplified so that they still represent the network behavior of the
original machine, and - where possible - other non-relevant state machine complexity is
removed. In other research these minimized state machines can then be combined with
routing logic models, allowing the combination of the coherence state machines and the
routing logic to be proven deadlock-free.
The programmatic minimization of the state machines is done by applying abstraction
and reduction patterns. These patterns generate proven minimizations for our purpose,
they generate sound abstractions of the original state machines which retain their original
network routing behavior.
These abstraction and reduction patterns are the main focus of this research and are
created by manually inspecting coherence protocol state diagrams, whereby found mini-
mization methods are investigated on how they can be made generic.
3.2. RESEARCH QUESTIONS
The question that this research is trying to answer is the following:
Is it possible to create sound minimizations of selected gem5 cache coherence
state machines by programmatically applying state machine minimization pat-
terns?
Where minimization patterns are defined as transformations on a state machine that
reduce the amount of states or state transitions.
8
3.3. VALIDATION 9
This research question can be divided into and amended with a number of sub-questions:
1. what patterns can be found that can be applied to minimize cache coher-
ence state machines creating sound abstractions?
2. which state machine aspects determine the applicability of these mini-
mization patterns?
3. can programmatically implementable strategies be found when to apply
which pattern?
4. can limits be found in the application of these reduction patterns?
5. how can these reduction patterns be best documented?
6. do the patterns have an ordering, is the use of certain patterns enabled by
the application of other minimization patterns?
7. if the patterns have an ordering, can an optimal ordering be applied to
obtain the best result?
For the last question we define the “best result” as the state machine with the least
amount of states, and if equal the least amount of state transitions.
3.3. VALIDATION
This research is validated by implementing the found patterns and algorithm in a software
tool. It is used to minimize gem5 cache coherence protocol state machines for multiple
protocols and devices. The original gem5 state machines are provided in JSON format, and
the software tool minimizes these state machines, and provides the minimized state ma-
chine also in JSON format for further processing and analysis.
The obtained results can be found in Chapter 8.
3.4. DELIVERABLES
The deliverables of the research are the following:
• A master thesis describing the research results, containing:
– The definition of all found patterns and algorithms for cache coherence state
machine minimization and how to apply them;
– A description of cache coherence protocols which were targeted for minimiza-
tion, both in their original form and their minized form;
– A catalog of analyzed gem5 actions used in the analyzed cache coherence pro-
tocols;
• A tool for automatically reducing gem5 state machines.
4
RELATED WORKS
This chapter describes a number of relevant papers and books that that have been investi-
gated to get insight into the area where the research has taken place in.
4.1. CACHE COHERENCE
To investigate memory consistency and cache coherence protocols, the primer written by
Sorin et al. [SHW11] was studied. This primer gives an introduction on both these sub-
jects, and introduces various levels of memory consistency and coherence protocols. The
coherence protocols are described for both snooping protocols and directory protocols.
Archibald and Baer [AB86] provide another, less verbose introduction in cache coher-
ence protocols. They introduce a number of coherence protocols and try to model perfor-
mance aspects of these protocols using applications written in the Simula language.
Marty provides another introduction in cache coherence protocols in [Mar08], but with
a focus on Network on a Chip (NoC) configurations. He argues that cache coherence is
needed in these kind of networks. Chapter 1 and 2 of this thesis were studied for this sub-
ject.
Martin et al. provide in [MHS12] arguments ‘Why on-chip cache coherence is here to
stay’. They state that on-chip (implemented in hardware) cache coherence implementa-
tions can scale to to large amount of cores. They give examples of perceived scaling prob-
lems in areas of traffic, storage, maintaining inclusion, latency and of energy usage; and
give solutions to each of these problems.
Komuravelli et al. on the other hand describe in [KAC14] problems with hardware im-
plemented cache coherence protocols, by performing a study where they tried to formally
verify a gem5 simulator (gem5) implementation of a MESI coherence protocol. They found
that even after four years of extensive use bugs still could be found in the implementa-
tion. They use this result as evidence that hardware coherence protocols are still hard to
understand and design. Because of this they propose that hardware-software co-designed
protocols might offer a simpler alternative.
10
4.2. NETWORKS ON CHIPS 11
4.2. NETWORKS ON CHIPS
A number of papers were studied about NoC’s and routing. Dally and Seitz present in
[DS87] a way to create message networks without deadlocks using wormhole routing. They
do this by creating virtual channels with which to separate the message paths. They state
that “A deadlock-free routing algorithm can be generated for arbitrarily interconnection
networks using the concept of virtual channels”. This removes all potential circular de-
pendencies on network level and makes deadlocks impossible. This would imply that the
current research with which to verify cache coherence protocols on NoC networks could
be seen as an optimization problem, as any cycle that could create deadlocks could also be
removed by adding extra hardware for virtual channels. However, Verbeek et al. describe
in [Ver+16] a case where cross-layer deadlocks occur when queues are wrongly sized. This
happens even when virtual channels are applied and the routing algorithm itself is dead-
lock free. These deadlocks emerge because of the interaction of the network topology and
routing algorithm and the queue sizes. This research found that queues need to be of a
minimum size to avoid deadlocks. Networks made of virtual channels only decrease the
minimum queue size, deadlocks still occur when the queue’s are sized below these mini-
mum required sizes.
Verbeek and Schmaltz describe in [VS12a] and [VS12b] work done on verification of co-
herence protocols on NoC networks. These papers are examples of the kind of research
where the minimized cache coherence models provided by this research are to be used in.
Given the complexity of these papers, only the introductions and first chapters were stud-
ied.
Intel®has created a whitepaper [Int09]on their Quickpath Interconnect interprocessor
communication method. This interconnect method is implemented in their current pro-
cessor line, and features point-to-point connectivity using virtual channels per message
type. This would be an example of the application of the theory of Dally and Seitz. Also
interesting is the fact that Intel uses point-to-point messages with protocols called home-
and source-snooping. These are based on MESI type snooping coherence protocol varia-
tions, using point to point connectivity between the different on- and off-die caches.
4.3. MODEL CHECKING AND BISIMULATION
Reduction of state machines is a subject that is investigated in the context of Bisimula-
tion equivalence, or Bisimilarity. Bisimulation is a relatively recent research area, which in-
vestigates a stronger equivalence relation compared to trace equivalence, but weaker than
structural equivalence. An advantage of bisimulation as an equivalence relation is that the
equality relation can be checked efficiently.
Bisimulation is explained in detail in Introduction to bisimulation and coinduction,
where Sangiorgi gives an introduction in bisimulation theory, both for strong and weak
bisimulation[San12]. Baier and Katoen give a comprehensive summary of many subjects
related to model checking in [B+08]. They devote a chapter to equality relations and ab-
stractions, where they also give an introduction into Bisimulation. A short introduction to
branching bisimulation is also given in [Bas96].
12 4. RELATED WORKS
Bisimulation reduction algorithms are defined that allow for the minimization of state
machines, while retaining bisimilarity. Linz describes in [Lin11] a strong bisimulation re-
duction algorithm, without explicitely naming it as such.
5
STATE MACHINE MINIMIZATION APPROACH
This chapter gives an overview on the outcome of the research we have done. It provides a
high level view on the process we devised to minimize the cache coherence state machines.
The process generates simpler state machines containing only the features that we are in-
terested in, and that are better suited for model checking.
The overall approach to minimize cache coherence state machines is modeled in Figure
5.1. The details for the reduction and abstraction steps are described in full detail in the
next two chapters. There we also describe the specific aspects of the cache coherence state
machines we used in the algorithms. We end this chapter by detailing what this research
added as new work, and what research results were found in other papers.
5.1. OVERVIEW
A cache coherence protocol state machine, or ‘state machine‘, consists of states and transi-
tions between these states. Each state specifies the condition a cached memory block - or
cache-line - has for a specific cache coherence device. A cache coherence device can be a
of a cache controller, a directory controller or a DMA controller type. Each device contains
its own implementation of the state machine; the state machines in all devices combined
implement the cache coherence protocol.
Multiple devices of a type can have their own state machine implementations, for in-
stance the MESI protocol has L1 and L2 cache controllers, with unique state machines.
In a state machine, a state transition between states is made when an event is received,
normally in the form of messages. On a transition zero or more actions are executed, each
action implementes part of the features of a coherence device. This way, an implemen-
tation feature, like handling requests from the core or maintaining coherence statistics, is
implemented as a set of actions executed on various state transitions.
A cache coherence state machine is a single Strongly Connected Component (SCC); this
means that all states in the state machine are always accessible from any other state, di-
rectly or indirectly.
Because the Cache Coherence protocol never terminates it has no terminal states. Ad-
ditionally, we define the protocol as having no initial state. Having no initial state could be
13
14 5. STATE MACHINE MINIMIZATION APPROACH
Figure 5.1: State Machine Minimization Approach
seen as counter-intuitive; one would expect the initial state to be the Invalid state, as im-
plementations normally use. However, an explicit initial state is not needed, as explained
in Section 7.2.4.
The state machines consists of long-lived states, like the M,S and I states shown in Fig-
ure 2.1. In addition to these long-lived states, the state machines also contains transient
states, between the long-lived states. These transient states are used to synchronize be-
tween the state machines of the cache coherence devices.
As an example, when a cache controller requires access to a cache-line, it will initially
first move from the Invalid state to a transient state. Only after receiving the response mes-
sage will the state machine move from the transient state to the long-lived Shared state.
The intermediate transient state is defined here to allow the device to wait for the response
from the memory or directory controller. More complex protocols can define state ma-
chines with a sequence of multiple transient states between two long-lived states.
5.1.1. NETWORK-IMPACTING MINIMIZATIONS
What we are interested in for this research is to abstract the state machines to only network-
ing impacting operations. This means that we want to remove all cache coherence features
from the state machines that do not have impact on the protocol network behaviour. This
includes the content of the messages, we are interested in when messages are sent, to which
device node, and of which message type.
We are also only interested in the externally visible behaviour of the state machines;
5.2. THE MINIMIZATION PROCESS 15
how a state machine is internally constructed is of no concern to us. This implies that if we
want to compare state machines for equality, we are interested whether state machines are
trace equivalent.
5.2. THE MINIMIZATION PROCESS
The process we devised to minimize the cache coherence protocol state machines consists
of two phases. In the first phase we apply a number of abstraction patterns to the state
machine. Each pattern deletes or changes part of the state machine so that non-essential
features are removed.
The three patterns we found were named as:
• Removal of non-essential actions;
• Remove non-essential devices;
• Merge similar actions.
These patterns work independently from one another, and are described in detail in
Chapter 6.
After the abstraction patterns have been applied, we are left with a sparse SCC state
machine; a Non-deterministic finite automaton (NFA) with internal τ-transitions for tran-
sitions where all actions have been removed. In this sparse state machine opportunities
exist to combine states and transitions, or to remove transitions.
The second phase is the reduction phase. In this phase we apply a strong bisimula-
tion reduction algorithm to reduce the abstracted sparse state machine to minimize the
amount of states and transitions. Before appying this reduction algorithm we first convert
the sparse state machine from a NFA to a Deterministic Finite Automaton (DFA) to allow
the algorithm to be used. To retain the SCC state machine attribute, a post processing step
is executed to remove any superfluous state introducted during the NFA to DFA conversion,
before performing the strong bisimulation reduction step.
5.2.1. IMPLEMENTATION
The first abstraction pattern and the reduction algorithms are implemented in a software
tool written in Haskell. This software tool is fed with a cache coherence state machine and
a list of non-network-impacting actions. The tool will minimize the state machine by first
applying the abstractions, after which it will use the reduction algorithms to generate the
minimized state machine.
With this tool the applicability of the minimization process is shown.
5.3. CONTRIBUTION AND RESEARCH BOUNDARIES
The subset construction, τ-elimination and strong bisimulation reduction algorithms are
well-known algorithms. We have used the description in [Lin11] of these algorithms to
guide our implementation.
16 5. STATE MACHINE MINIMIZATION APPROACH
The main knowledge added by this research are the following:
• We have created a number of pattern descriptions to guide in the removal of unnec-
essary state machine features;
• We have devised a post processing step to adapt the well-known subset construction
algorithm to state machines without initial states, and proven its correctness;
• We have provided a formal proof that for Deterministic Finite Automata (DFAs) sim-
ilarity, bi-similarity and trace equivalence coincide;
• We have devised the overall minimization process, using the above knowledge.
The three patterns we documented were found in the investigated MI and MESI state
machines. It could be that other cache coherence state machines provide evidence of ad-
ditional patterns; these three patterns were found during the inspection of the MI protocol
state machines, and were subsequently validated on the MESI state machines.
We could not find a formal proof for the equality of bi-similarity and trace-equivalence
for LTS DFAs in the literature. The closest we came was a theorem in [B+08, p.579] that
stated the equivalence for AP-deterministic state machines. This theorem was not provided
with a proof, this was left as an exercise for the reader. The equivalence between these two
equalities is essential to allow the strong bisimular reduction to be used in our algorithm;
therefore this paper provides the proof for this theorem.
Together the individual proofs in this document and references to proofs in other doc-
uments provide the mathematical proof of the correctness of the complete reduction algo-
rithm.
We did not attempt to prove that the combination of these algorithms provided minimal
results. Especially the subset construction algorithm can provide additional states. We did
not observe this during our experiments, our expectation is that these additional states are
removed again during the strong bisimular reduction step.
Weak bisimilarity is an equivalence relation that can be directly applied on state ma-
chines with internal τ-transitions. It could be that research in this area provides more ap-
plicable reduction algorithms, resulting in simplifications to more compact NFA versions
of the state machines. This is left as future research.
6
STATE MACHINE ABSTRACTION PATTERNS
When we want to minimize the gem5 simulator (gem5) cache coherence state machines,
we want to remove those components, device nodes and structural features - named fea-
tures in the rest of this chapter - that are not needed for our purposes.
This chapter presents the abstraction patterns used to remove non-essential features
from cache coherence protocol implementations. The described abstraction patterns are
the ‘Removal of non-essential actions’ pattern, the ‘Merge similar actions’ pattern, and the
‘Removal of non-essential device’ pattern.
For this research we are interested in the networking behaviour of the state machine
implementations. In Section 6.1 we will first identify which components and features are
defined to implement the protocol. We indicate which opportunities exist to remove the
features we do not need to model for our research. We explain why these features are non-
essential for our research by describing them in sufficient detail.
In Section 6.2 we will describe the patterns we have identified that can be used to re-
move the non-essential features from the state machines. The information in Section 6.1 is
used with these patterns to abstract away these non-essential features from the state ma-
chines.
These patterns are like recipes; when the proper information on the features is pro-
vided, an implementation of the pattern algorithm can programmatically remove the non-
essential features. The patterns can in principle be re-used by other research, to remove
other subsets of state machine features deemed non-essential for their research.
Definition Essential feature - A feature is essential if and only if it can influence the dead-
lock behaviour of the cache coherence protocol implementation subsystems we are inter-
ested in. All other features are deemed non-essential.
For our research, a feature is essential if and only if it influences the way messages are
transmitted and received over the on-die communication network. Examples of features
that are essential are the queues with which the cores send and receive messages. Features
which influence the message routing (addressing) are also seen as essential.
Definition Network-impacting - A feature is network-impacting if and only if it directly
impacts the network behaviour of the protocol. All other features are deemed non-network-
impacting.
The abstractions we create are sound and complete:
17
18 6. STATE MACHINE ABSTRACTION PATTERNS
Definition Sound abstraction - An abstraction is sound if all properties on the abstraction
are also present in the original model.
Definition Complete abstraction - An abstraction is complete if all relevant network prop-
erties from the original model are also present in the abstraction.
However, we do not guarantee that all non-essential features are removed when these
patterns are applied. In Chapter 7 we describe methods to reduce the size of the sparse
state machine that remains after the application of these abstraction patterns.
6.1. CACHE COHERENCE COMPONENTS
The gem5 implementation of the cache coherence protocols consists of a number of co-
herence device nodes, which implement structural features. The coherence device nodes
are the component nodes that execute the coherence protocol, like the cache controller(s),
directory controller(s) and the DMA controller(s).
The structural features are used to implement these devices, and with which the nodes
implement services like message reception and transmission, temporary buffers for cache
data, message queues and usage statistics, etc.
In gem5, the device node types can be identified by their cache coherence protocol
implementation. Each node has its own implementation source file in the gem5 source-
code. This sourcefile describes and implements the state machine that provides the device
functionality for that node. The features are implemented as action functions in the state
machine implementations. In Appendix A more detail information on the gem5 imple-
mentation for the investigated coherence protocols is given.
In this paper we will concentrate on the devices and features found in the examined
gem5 protocols. Other protocols can have additional features and device node types to
implement additional functionality, for instance to support special caches or additional
cache-line states, like the Owned or Forwarded states used in more complex protocols.
Before abstracting, the protocols need to be inspected to obtain the information needed
to apply the abstraction patterns. Depending on the outcome of this inspection additional
non-essential components and features can be identified.
The terminology and device naming used in this paper follows the gem5 protocol im-
plementation, additional detail on this is given in [Sor+02].
This section will provide a high level description of the devices and structural features
we encountered, in sufficient detail to determine their network-impacting state. With this
information we can determine if the device or feature can be abstracted away, with the
patterns described in the following section.
6.1.1. COHERENCE DEVICES
The coherence devices are the components that together implement and execute the co-
herence protocol. They communicate with messages, sent and received via message queues.
Each device will execute their own state machine, implementing the part of the cache co-
herence protocol that the device is responsible for. The state machines are executed for
cache-lines, which are the memory blocks on which the caching is implemented. For each
cache-line the state is separately maintained, and this state is used to run a separate in-
6.1. CACHE COHERENCE COMPONENTS 19
Core 
#1 
Cache 
Controller 
Private data 
cache 
Communication Bus/Network 
Multicore Processor Chip 
LLC/Directory 
Controller 
Last Level 
Cache 
DMA 
Device 
DMA 
Controller 
Core 
#N 
Cache 
Controller 
Private data 
cache 
… 
Main Memory 
Figure 6.1: Cache Coherence Devices
stance of the cache coherence protocol. A cache controller can simultaneously have cache-
lines in a read-only state, in a modified state, and in the various transitional states at the
same time. These cache states and transitions are independent from one another from a
coherence protocol view, so we do not have to look at interactions between the state ma-
chine instances for different cache-lines. State transitions within devices are normally ex-
ecuted based on messages that are received from other components. This in turn can lead
to messages sent to other components. The message recipients can be other coherence de-
vices or they can be external components, like cores and memory subsystems that interact
with the cache coherence subsystem.
As a side note, any multi-threaded program that is executed by multiple micro-processor
cores creates dependencies between state machine executions of different cache-lines. The
application code defines the order in which cores trigger memory loads and stores to mem-
ory locations. We ignore any dependency that is created because of the application, as we
define these as being not non-essential to our research.
For the device descriptions we have used the MI devices, other cache coherence proto-
col configurations will have comparable devices.
COMMUNICATION NETWORK
The communication network, or network, is responsible for the transport of messages be-
tween the other coherence devices. It is not a direct part of the cache coherence protocol
implementation, and normally assumed to be a lower level feature when looking at cache
coherence protocols. For our research it is however an essential component, as it is this
device which is to be model checked with the minimized cache coherence protocols for
deadlock behaviour. This device is modeled separately, and the model is later joined with
20 6. STATE MACHINE ABSTRACTION PATTERNS
the minimized cache coherence state machines.
Depending on the network infrastructure, it can be composed of one network or a num-
ber of sub-networks. The requests and responses can be transmitted over their own net-
work, and the DMA communication can also be done over a separate network.
One reason to separate requests from responses is that the coherence protocol is syn-
chronized on the request messages. For snooping/bus protocols this is especially impor-
tant, as all cache controllers need to monitor the request network to synchronize amongst
themselves.
In the cache coherence protocol definition, these networks are represented by the dif-
ferent queues the cache coherence protocol implementations use to send and receive mes-
sages. Depending on the network configuration, networks can combine messages from dif-
ferent queues. Network implementations will normally not split message from one queue
over different networks, as the network normally isn’t aware of the message content; if mes-
sages are to be sent over different network, the cache coherence devices will use different
queues.
CACHE CONTROLLER NODE
A first level cache controller - like used in the MI protocol - is attached to a processing core,
and implements the interaction between the core, the local cache or caches and the rest of
the cache/memory subsystem. Lower level caches like the L2 cache controller in the MESI
protocol provide caching functionality between two component layers.
When a processor core needs to access a memory location for a read or write action, it
requests the value from the L1 cache controller.
Figure 6.2: MI cache controller node
When the requested cache-line is already present in a
private data cache, the controller can immediately write
the value or return it. When the cache-line is not in
the local cache, the controller will send out a request
to obtain a read-only or writeable copy of the cache-
line. How this request is executed, how the cache-
line is subsequently handled and written back to mem-
ory, is the primary focus of the cache coherence proto-
col.
The gem5 cache controller implementations use
queues to decouple the interaction between the con-
troller and the core, as well as to decouple the interaction
with the other nodes via the communication network. Figure 6.2 shows the queues that are
present in the MI cache coherence protocol implementation we examined.
The incoming requests from the core are queued in the Mandatory Queue, the responses
are directly returned to the core sequencer, without queueing. The requests in the manda-
tory queue come from a sequencer, this is the gem5 core component that initiates and
handles memory requests.
The MI cache controller has two sets of queues for requests and responses to and from
other components. This separation exists because requests and responses can be handled
via separate networks, depending on the exact network configuration. One of the reason
6.1. CACHE COHERENCE COMPONENTS 21
these networks can be split is because the ordering of memory requests for the various
cores is done on request network messages.
Requests are transmitted and received via the request network. Outgoing requests are
when the current core requires a cache-line not already in the private data cache of the
controller. Incoming requests are from the directory controller, if another cache controller
requires access to cache-lines that the current cache controller uses (‘owns’). In addition
to request type and addressing information, a request message can also contain message
data in the case of a ‘PUT‘ request.
The response network is used to send and receive responses on the requests. The re-
sponses are either ack’s or nacks, to (negatively) acknowledge the handling of requests. The
response messages can also contain memory data.
In addition to the mandatory queue shown in Figure 6.2, a core can also communicate
with the cache controller via an optional queue, if it exists. This queue can also be used
by the execution core to send memory requests to the cache controller. The mandatory
queue is used by the execution core to send ‘mandatory’ requests, the requests that must
be handled by the coherence subsystem, to the cache controller node. An optional queue
however, will be used by the core sequencer to initiate optional requests, like pre-fetches of
cache-lines needed in the (near) future. The cache-line is then retrieved from the memory
subsystem, and stored in the private data cache. The main difference between the optional
and mandatory queue handling is that the results of a fetch via the optional queue are not
returned to the sequencer. If the cache coherence subsystem is busy with other requests it
can ignore the pre-fetch request. It will always be followed by a request via the mandatory
queue when the cache-line is actually needed for a load or store action. We could not find
evidence for it, but presumably the name mandatory and optional queue are chosen to
indicate the status of memory requests on these queues.
We found an optional queue definition in the gem5 MESI protocol, the MI protocol did
not define an optional queue.
When an abstraction is created for the network-impacting aspects of this node, the core
communication and private data cache interaction can be removed. All actions that use the
queues to both the request- and response networks need to remain. These actions generate
and receive the requests and responses that are handled via the communications network
that we are researching.
Another reason that responses cannot be abstracted away is that they are at minimum
needed to retain the synchronisation between the state machines running on the various
device nodes. This implies that we cannot do our research using only request messages.
DIRECTORY CONTROLLER NODE
In the gem5 implementations, the directory controller is responsible for the provisioning
of cache-lines from memory subsystems. For protocols where the cache controllers do
not communicate with one another directly, the directory controller can also maintain the
ownership of the cache-lines. For the MI protocol this is indeed the case, for the MESI pro-
tocol this function is implemented in the L2 cache controller. It will also make sure that
requests for cache-lines already claimed by other cache controllers will be forwarded to
that controller.
22 6. STATE MACHINE ABSTRACTION PATTERNS
Figure 6.3: MI directory con-
troller node
When looking at Picture 6.3, the queues drawn above
the directory controller communicate with the network and
are as such essential for our research. There is one in-
coming request queue from the cache controllers, but a di-
rectory controller has two outgoing queue’s to the cache
controllers. The incoming queue contains the received re-
quests to be handled from cache controllers, via the request
network. Incoming requests are forwarded when they are
for a cache-line that is currently used (‘owned’) by another
core. One outgoing queue is used to forward these for-
warded requests over the request network; the other outgo-
ing queue is used for responses for requests the directory
controller could handle. The latter are sent over the re-
sponse network. The responses of the forwarded requests
are sent directly between the cache controllers. The di-
rectory controller does not initate requests itself, and also
doesn’t directly receive responses from the network. Be-
cause of the latter, it doesn’t contain an incoming response
queue.
For the MI protocol the interaction with the DMA controller is done over separate queues,
separate from the communication with the cache controllers. DMA requests are received
from the DMA controller via a separate incoming queue, and responses are returned via a
separate DMA response queue.
Depending on the network configuration the DMA controller queue’s are network- im-
pacting and essential for our research, when the DMA messages are transmitted via the
same network.
The directory controller directly sends requests to the memory subsystem, but has an
incoming queue for memory responses. These are non-essential for our research, and can
be removed.
The directory controller is only needed for protocols which are not ‘snooping’. When
a snooping protocol is used, the cache controllers synchronize their controller states by
snooping on all requests on the request network. This makes it unnecessary to retain the
cache-line states in a directory, and no dedicated controller is needed to forward requests.
Obviously we still need a controller to handle requests to the memory device. In the MESI
protocol the directory controller is used for access to the memory device, and handle DMA
requests.
In the literature the directory controller or memory controller also contain a shared
‘Last Level Cache’, or LLC. This cache is used as a shared secondary cache for the cores,
and provides an additional optimization for accesses to the memory subsystem. The gem5
implementations of the cache coherence protocols we investigated do not implement this
LLC.
A directory controller node is mandatory for all memory subsystems present in the sys-
6.1. CACHE COHERENCE COMPONENTS 23
tem. At least one will be present in each configuration.
DMA CONTROLLER NODE
Figure 6.4: MI DMA
controller node
A DMA controller is a special kind of controller that requests cache-
lines for block oriented IO actions. In some architectures they can
also be used for memory block-copy actions, releaving the core of
this work. To implement these DMA actions, the DMA controller se-
quentially claims a set of cache-lines, either for reading or for writ-
ing.
As a DMA controller normally only reads or writes a memory lo-
cation once, it exhibits excellent spatial locality, but no temporal lo-
cality. This makes the memory request behaviour more predictable
when compared to a cache controller. A DMA controller therefore
does not need to have a private data cache. It does not need to hold
on to cache-lines for a longer time and does not need to retain access
many blocks simultaneously.
When we look at the controller behaviour from a distance, they
look very similar to the cache coherence devices connected to the
cores. However, in the protocol implementations we investigated
they do run a different, much simpler state machine. If one wants
to take this different state machine behaviour into account, one or
more DMA controllers should be added to the model. For our research we decided to re-
move the DMA controllers.
When investigating network-impacting behaviour, another criteria is whether the DMA
controller communicates via a network which is shared with the cache controller devices.
Only if the network is shared can it impact and cause deadlocks in the communication
between the directory- and cache controllers. Otherwise the component will never impact
this network behaviour, and can be removed.
This last attribute can only be determined by looking at the network configuration itself.
MICROPROCESSOR CORE AND DMA IO DEVICES
The microprocessor core and DMA IO devices are responsible for the memory requests
issued to respectively the cache- and DMA controllers. For this they contain sequencer
components that requests cache-lines from the cache controller.
For our model checking purposes the core and IO devices can be abstracted away, as
they do not directly impact network interactions.
MEMORY
The memory subsystem contains the system memory. The cache coherence subsystem
communicates with the memory subsystem via the directory controller.
For our model checking purposes we are not interested in the content of the memory
or the cache-lines we exchange. We can abstract away this subsystem and the interaction
with it. Also any lower level cache or Last Level Cache (LLC) can be abstracted away for this
reason, if it is present in a protocol implementation.
24 6. STATE MACHINE ABSTRACTION PATTERNS
6.1.2. COHERENCE FEATURES
The structural features discussed in this section are attributes of the implementations which
enrich the core protocol with functionality. Some of these functions are mandatory for a
correct working of the protocol. Others are supplementary, as they provide performance
optimizations or data concerning the state machine execution. An example of the latter is
statistics gathering, which can provide performance data like cache hit ratio.
All features that are not impacting the deadlock behaviour we are interested in can be
abstracted away.
MESSAGE CONTENTS
One of the things that we are not interested in for our research is the memory data in the
exchanged messages. This is because the memory content only becomes relevant outside
of the cache coherence subsystem, the subsystem itself does not interpret message content.
The core, memory and DMA IO devices are of course interested in the memory content, but
these we will also exclude from the checked model.
TBE BUFFERS
The Transaction Buffer Entries (TBEs) are records that hold information on the transactions
that are currently being executed [Sor+02, p.5]. One of the uses of the TBEs is to make
sure that only one transaction is executed per cache-line per cohence device node. The
amount of TBEs present in the node limits the amount of transactions that a node can
issue simultaneously.
When not looking at the memory content itself,the main impact to the network be-
haviour this feature has is the limitation on the amount of transactions that can be issued.
This limits the amount of messages that the node can wait on. If that is no concern, when
for instance this message limit is already implemented with queue sizes, the feature can be
removed.
QUEUES
Queues are the primary means with which the gem5 cache coherence device nodes com-
municate amongst each other and with the subsystems in the environment. They are used
to de-couple the various components, and to allow the cache to defer handling of mes-
sages when performing transitions. The queues are shared for all messages, so there is for
instance only one mandatory queue used for all memory requests. Any message at the top
of the queue will block all further messages if not handled, so special handling is needed
when messages are to be deferred.
The queues that are needed in our abstracted model are those that communicate with
the network and environment devices that we want to model check. The other queues
and their interactions can safely be removed, as they will not impact the behaviour we are
interested in. This does not mean that no deadlocks can occur in the interaction with these
devices, but this is out of scope for our current research.
For instance the mandatory queue can be removed as it is not network-impacting. The
model checking will be done directly on the messages initiated from cache controllers to
the directory controllers.
REQUEST RECYCLING
The queues that the different components use combine requests for all cache-lines. When
a cache-line is is in a transient state, the state machine is not capable of handling another
6.2. ABSTRACTION PATTERNS 25
request for this cache-line. If a message would stay on the head of the queue this would
also block this queue for all other messages, also for all other cache-lines.
To prevent this blocking and to increase the performance, the request can be recycled.
This can for instance happen happen by either placing it back in the queue, or by placing it
into a side table. The request is kept here until the associated memory block state machine
is capable of handling this request again 1.
For our purposes this request recycling functionality can be removed, as we are not
interested in performance aspects.
STATISTIC GATHERING
Statistics gathering is done by gem5 to get insight into the performance of the cache coher-
ence subsystem, for instance to measure the cache hit ratio with differing cache sizes.
This feature can be safely removed during abstraction.
CACHE-LINE OWNERSHIP
Cache-line ownership is a feature that is present in the directory controllers. It is needed
to forward requests for already used (’owned‘) memory blocks to the cache controller node
that is currently owning the block. As such, it defines adressing for forwarded messages.
This makes it an network-impacting feature.
For the MI protocol the cache-line ownership is tracked by the directory controller. This
controller tracks which device has the cache-line in Modified state. For the MESI protocol
the L2 cache controller tracks the list of L1 caches that have the cache in Shared state, or
the cache that has the cache-line in Exclusive/Modified state.
6.2. ABSTRACTION PATTERNS
The abstraction patterns provide a structured method to allow automated removal of non-
essential protocol features. We have documented three patterns, each of these patterns
implements a way to remove non-essential functionality to minimize the size of the state
machine.
The three documented patterns are named as:
• Removal of non-essential actions;
• Merge similar actions;
• Removal of non-essential device.
These three patterns are described in detail in the following sections. We assume that
when actions are deemed non-essential, they can be abstracted away with these three pat-
terns.
Our methodology and these three patterns are sound because we don’t add features or
transitions, and complete because all removed features are non-essential.
As mentioned in the introduction of this chapter, we do not guarantee that all non-
essential features are removed when these patterns are applied.
1See http://www.m5sim.org/SLICC#Special_Functions for more information on the features gem5 pro-
vides for request recycling.
26 6. STATE MACHINE ABSTRACTION PATTERNS
This could be caused by the way that the patterns are applied, in that we didn’t identify
all non-essential features. Also, more complex protocols could also allow identification
of additional abstraction patterns, which could allow removal of even more non-essential
features.
6.2.1. REMOVAL OF NON-ESSENTIAL ACTIONS PATTERN
The gem5 state machine state transitions are all labelled with an event and a list of actions.
Each action in this list is an action that is executed when the event is received, executed
during the state transition. Events are received and responses are sent via the queues used
for message exchange. As we are only interested in the networking related behaviour, we
only need to look at the actions that use networking related queues or directly influence
the message sending like the cache ownership feature. The network related queues are the
queues that are used to communicate between the components running the cache coher-
ence protocol.
What we will do is inspect for each action in all protocol components whether it per-
forms an enqueue or dequeue of a messages on any of the networking queues. When the
action does not enqueue or dequeue messages for network-impacting queues or directly
influences messsage sending, it can be abstracted away.
An example of an action which is network-impacting is given in the following listing:
1 action ( a_issueRequest , "a" , desc=" Issue a request " ) {
2 enqueue ( requestNetwork_out , RequestMsg , issue_latency ) {
3 out_msg . addr := address ;
4 out_msg . Type := CoherenceRequestType :GETX;
5 out_msg . Requestor := machineID ;
6 out_msg . Destination . add ( map_Address_to_Directory ( address ) ) ;
7 out_msg . MessageSize := MessageSizeType : Control ;
8 }
9 }
Listing 6.1: Network-impacting actions
The a_issueRequest action queues a GETX request on the requestNetwork_out queue.
This clearly makes it relevant for the network behaviour. This action is therefore essential
to our research.
An example of actions that can be removed is given in the following listing:
1 action ( x_copyDataFromCacheToTBE , "x" , desc="Copy data from cache to TBE" ) {
2 a s s e r t ( i s _ v a l i d ( cache_entry ) ) ;
3 a s s e r t ( i s _ v a l i d ( tbe ) ) ;
4 tbe . DataBlk := cache_entry . DataBlk ;
5 }
6
7 action ( z _ s t a l l , "z" , desc=" s t a l l " ) {
8 // do nothing
9 }
Listing 6.2: Actions without network impact
6.2. ABSTRACTION PATTERNS 27
The action x_copyDataCacheFromCacheToTBE sets a cache entry datablock in the TBE.
The action z_stall stalls the processing of a queue message. Both actions are not network-
impacting, therefore they are non-essential to our research and can be removed.
IMPACT ON THE MINIMIZED STATE MACHINE
The removal of actions executed during transitions can result in empty action lists; these
represent internal τ-transitions. Internal τ-transitions can be removed during the reduc-
tion phase.
Also, transitions whose action lists differ on only non-essential actions will become
equal. Equality for transitions is defined as having equal action lists. When the action lists
become equal because of this pattern, these transitions can potentially be combined by the
reduction algorithms in Chapter 7.
DEADLOCK BEHAVIOUR
This pattern is sound, as we only remove non-essential actions. Any deadlock that occurs
when non-essential actions are removed, also occur in the real system, and vice-versa.
6.2.2. MERGE SIMILAR ACTIONS PATTERN
The previous pattern removes all non-essential actions from the Cache Coherence state
machines, and during the reduction phase any equal transitions are combined, where pos-
sible. So different transitions that use the same network-impacting actions can already be
combined. However, when similar, but not exactly equal network-impacting actions are
used, this optimalization should also be possible. See for instance the next actions from
the MI_example directory controller:
1 action ( a_sendWriteBackAck , "a" , desc="Send writeback ack to requestor " ) {
2 peek ( requestQueue_in , RequestMsg ) {
3 enqueue ( forwardNetwork_out , RequestMsg , directory_latency ) {
4 out_msg . addr := address ;
5 out_msg . Type := CoherenceRequestType :WB_ACK;
6 out_msg . Requestor := in_msg . Requestor ;
7 out_msg . Destination . add( in_msg . Requestor ) ;
8 out_msg . MessageSize := MessageSizeType : Writeback_Control ;
9 }
10 }
11 }
12
13 action ( l_sendWriteBackAck , " l a " , desc="Send writeback ack to requestor " ) {
14 peek (memQueue_in, MemoryMsg) {
15 enqueue ( forwardNetwork_out , RequestMsg , 1) {
16 out_msg . addr := address ;
17 out_msg . Type := CoherenceRequestType :WB_ACK;
18 out_msg . Requestor := in_msg . OriginalRequestorMachId ;
19 out_msg . Destination . add ( in_msg . OriginalRequestorMachId ) ;
20 out_msg . MessageSize := MessageSizeType : Writeback_Control ;
21 }
22 }
23 }
Listing 6.3: Similar actions in Directory controller
It is easy to see that these actions, while different, are very similar when only looking
at their networking aspects. They both respond with a WB_ACK acknowledgement message
28 6. STATE MACHINE ABSTRACTION PATTERNS
on the forwardNetwork_out queue. The differences are mainly which location the data
comes from, either from an incoming message or from the memory. In the message content
we are not interested, from a networking impact point of view any two transitions with only
these actions would be eligible to be combined.
This can be achieved by simply renaming all similar actions to the same action name.
This will make that any transitions that can be combined, will be combined during the re-
duction phase. Obviously this also requires that all other essential actions executed during
the transitions are equal.
IMPACT ON THE MINIMIZED STATE MACHINE
By making actions equal, we can also make action lists equal. This means that the transi-
tions can also become equal, as equality for transitions is defined as having equal action
lists. Any equal set of transitions can potentially be combined, again depending on the
state machine configuration.
DEADLOCK BEHAVIOUR
The similar actions are differing in the original, fully functional cache coherence protocol.
As we are removing irrelevant aspects with the abstraction, what we retain are not simi-
lar, but completely equal actions. If you look at example in Listing 6.3, the differerence is
whether the message content is read from the memQueue_in, or from the requestQueue_in.
From a network-impacting perspective we are only interested in the type of message re-
turned, not in the message content. This also implies that the deadlock behaviour of the
resulting actions are the same, so they can be combined.
6.2.3. REMOVAL OF NON-ESSENTIAL DEVICE PATTERN
When the DMA coherence node is deemed out of scope for model checking, we cannot
simply remove all actions related with the DMA protocol in the state machines of the other
devices. This would make all transitions for the DMA events internal τ-transitions; and this
is not correct, because when we remove the whole device, these transitions will never be ex-
ecuted anymore. Transitions not taken are different from transitions that become internal.
Internal transitions can always be taken, while these transitions will never be taken.
Figure 6.5 shows a MI Directory state machine partially abstracted and reduced with
the first pattern, ‘Removal of non-essential actions pattern’. When we would remove the
inv_sendCacheInvalidate action from the transition between the state M and the state
M_DRD,M_DRW.2, we would remove the transition for the DMA_READ and DMA_WRITE events.
However, we would then combine these states, and immediately allow triggering of the
PUTX event from the combined M,M_DRD,M_DWR state. In this case this looks harmless,
as the M state also allows a transition on the PUTX event, other state machine configura-
tions however would have new transitions added to the M state. These new transitions
violate the sound-ness property of our patterns.
The correct way is to remove all impacted transitions directly from the coherence state
machines. In Figure 6.5 we would remove all transitions for the the events DMA_READ and
DMA_WRITE.
2This is how the reduction algorithm outputs merged states.
6.2. ABSTRACTION PATTERNS 29
I PUTX_NotOwner[b_sendWriteBackNack,i_popIncomingRequestQueue]
IM
GETX
[qf_queueMemoryFetchRequest,
i_popIncomingRequestQueue]
M
Memory_Data
[d_sendData]
GETX
[f_forwardRequest,
i_popIncomingRequestQueue]
PUTX_NotOwner
[b_sendWriteBackNack,
i_popIncomingRequestQueue]
M_DRD,M_DWR
DMA_READ||DMA_WRITE
[inv_sendCacheInvalidate]
MI,M_DRDI,M_DWRI
PUTX
[v_allocateTBEFromRequestNet,
i_popIncomingRequestQueue]
PUTX
[l_writeDataToMemory,
i_popIncomingRequestQueue]
Memory_Ack
[l_sendWriteBackAck]
Figure 6.5: Partially abstracted state machine with DMA transitions
30 6. STATE MACHINE ABSTRACTION PATTERNS
The relevant transitions can be identified by inspecting the receiving queue handling
functions. For instance, the following code handles the incoming DMA queue for the Di-
rectory node:
1 in_port ( dmaRequestQueue_in , DMARequestMsg, dmaRequestToDir ) {
2 i f ( dmaRequestQueue_in . isReady ( clockEdge ( ) ) ) {
3 peek ( dmaRequestQueue_in , DMARequestMsg) {
4 TBE tbe := TBEs [ in_msg . LineAddress ] ;
5 i f ( in_msg . Type == DMARequestType :READ) {
6 t r i g g e r ( Event :DMA_READ, in_msg . LineAddress , tbe ) ;
7 } e lse i f ( in_msg . Type == DMARequestType :WRITE) {
8 t r i g g e r ( Event :DMA_WRITE, in_msg . LineAddress , tbe ) ;
9 } e lse {
10 error ( " Inval id message" ) ;
11 }
12 }
13 }
14 }
Listing 6.4: DMA queue processing in Directory node
This code defines that on a DMARequestType:READ request a transition with
Event:DMA_READ is to be handled. For the DMARequestType:WRITE type message a transi-
tion with an Event:DMA_WRITE is to be handled.
When we remove all transitions with these events, we effectively remove all behaviour
linked to the DMA messages, without introducing additional internal τ-transitions. With
this pattern we can remove transitions that use network-impacting actions, something also
not possible if we would use the ‘Removal of not-essential actions’ pattern.
We need to make sure that the events are only issued for messages coming from the
device to remove. If the event definitions are shared between various devices we cannot
remove the transitions, as these are also re-used by the other device.
To retain the Strongly Connected Component (SCC) feature of the cache coherence
state machines, we also need to remove all states and transitions that have become un-
reachable by removing the transition. For the example in Figure 6.5 we would remove the
M_DRD and M_DWR states, and the PUTX transitions starting from these states. Effectively
we are removing the traces that were only implemented to support the removed compo-
nent.
These are the states that have no incoming transitions apart from the ones we just re-
moved, and all transitions that originate from these removed states.
The states can be identified automatically by either searching for states without incom-
ing transitions, or by using an algorithm as used in Section 7.2.4 3. The latter algorithm also
covers the situation that multiple states to remove form a loop, analogous to the discussion
in Section 7.2.4.
Additionally, we need to remove all states that become end-states because of the re-
moved transitions. This can be done in a similar way, for instance by using Tarjan’s strongly
connected components algorithm, and selecting the SCC containing a stable state, in case
transition states form a loop.
3In this case we can use any non-transient state as initial state. The Invalid state would be a good starting
state.
6.3. HAVE WE FOUND ALL ABSTRACTION PATTERNS? 31
IMPACT ON THE MINIMIZED STATE MACHINE
The state machines of the still in scope cache coherence devices will be be simplified, by
removing all traces that were solely implemented to handle the device now removed. States
and transitions will be possibly be removed as a result of the removal of these traces, this
will result in a more minimal state machine.
DEADLOCK BEHAVIOUR
We should only remove components for which we have determined that they are not essen-
tial for our model checking. When we apply this pattern we remove traces from the state
machine that are only used to support the removed component. Therefore, removing these
components will not change the deadlock behaviour that is relevant to us, and the pattern
is sound.
6.3. HAVE WE FOUND ALL ABSTRACTION PATTERNS?
With the three abstractions we described in this paper we cover many cases where we re-
move non-essential features from the cache coherence state machines. The question now
arises whether we can remove all non-essential features with these three patters.
The answer is depending on state machine we investigate. What we want to do with
these patterns is to obtain only the essential complexity from the state machine; the com-
plexity needed to implement the features that are relevant to us, the network-impacting
features. What we want to remove are the incidental or accidental complexity, introduced
by the state machine designers to implement features.
As an example, when we look at the MESI L1 cache state machine the concept of similar
states arises from the design choice to implement load/stores, optional/mandatory and
data versus Instruction fetches as separate, parallel transitions. In the simpler MI state
machine we see no instance of this pattern. An alternative implementation could have put
this complexity in the action implementation itself, which would have led to a simpler state
machine, but more complex and less granular actions.
The design decision to handle the complexity this way gives lead to the emergency of
this abstraction pattern, which effectively reverses this decision and removes (abstracts)
this complexity again.
What we want to do with these patterns is to remove the complexity introduced by the
design decisions the state machine designers made. This means that the amount of ab-
straction patterns is maybe only bound by the limits to the ingenuity the designers had
when implementing these protocols.
As more complex state machines can lead to more ingenuity in the found solutions, our
expectation is indeed that we might find more patterns when we investigate more complex
and different coherence protocols and protocol implementations. As an example, in the
discussion on the reduction of the Layer 2 cache controller in Section 8.2 we give evidence
of yet another pattern. We did not document this pattern because of time constraints, oth-
erwise we would have described four abstraction patterns in this paper.
7
STATE MACHINE REDUCTION ALGORITHMS
While the abstraction process can be seen as removing unnecessary features, the reduction
process is responsible for the minimization of the state machine to the most simple form,
while preserving all relevant features. So what the reduction step tries to do is to morph
the state machine into a similar but simpler state machine, with the same characteristics.
This new state machine is supposed to be less complex, but can still be used in place of the
original, abstracted state machine.
In this chapter we first determine the kind of equivalence we are interested in, after
which we describe the method we use to implement the state machine reduction, while
preserving this kind of equivalence.
As an input we have cache coherence state machines that, while abstracted, are still fi-
nite, single Strongly Connected Component (SCC) state machines without initial- and end
states. Every state can still be reached from every other state, either directly or indirectly.
For all transitions both event labels and action lists are defined, but transition equality is
only determined by the action lists. As these event labels normally identify the incoming
message types, this means that we will handle incoming message types handled with the
same network-impacting actions as equal.
Empty action lists are seen as internal τ-transitions from the originating to the destina-
tion state, so a transition that is not externally visible. While these internal τ-transitions
are normally not present in the provided, original cache coherence state machines, they
can occur in the state machines after abstraction. They are created when all actions are
removed from a transition.
Because of these τ-transitions and the fact that we can get equal transitions emanat-
ing from a state for different message types, we must treat the state machines as Non-
deterministic Finite Automata (NFA).
We can also state that we are only interested in the externally visible behaviour of cache
coherence state machines. The model checking that these state machines are going to be
used for, have no requirements on the internal structure of the state machine.
This chapter describes a number of existing techniques to reduce abstracted state ma-
chines. We also describe two contributions that were not found in existing literature. The
32
7.1. STATE MACHINE EQUIVALENCE 33
first is the proof that trace equivalence and (bi-)similation equality coincide for DFAs (De-
terministic Finite Automata), as we describe in Section 7.1.4. For this we did find theorems
in [B+08, p.511,p.514] that this is the case for AP-deterministic systems, but without proof
and not for the LTSs (Labelled Transition Systems) we use.
The second new technique we provide is the removal of superfluous initial states cre-
ated by the standard NFA to DFA conversion algorithm, as explained in Section 7.2.4.
7.1. STATE MACHINE EQUIVALENCE
What we first need to determine is how we are going to define when two state machines are
equal. We need this to be able to define what ‘preserving relevant features’ exactly means,
so when the abstracted sparse state machines and the final reduced state machines are still
considered equal for our purposes.
So to determine what we can safely remove from a state machine while still preserving
this kind of equality, we first need to determine the kind of equality we are interested in.
What we are interested in is behavioural equivalence. We want to treat the cache coher-
ence state machines as black box implementations, and want to have equality based on
the traces that the state machine accepts. When the original state machine allows or gen-
erates some trace, we want the reduced state machine to allow or generate that exact same
traces.
We will first look at trace equivalence and structural equivalence, after which we will look at
bisimilarity type equality. For much more detail on these kinds of equivalence, see [San12].
7.1.1. TRACE EQUIVALENCE
Trace equivalence checks if two state machines can handle the same traces, so the same se-
quence of state transitions. In our context, a trace is defined as a sequence of event/action
lists that are handled in order.
For instance in Figure 7.1 it can be seen that both state machines accept the string ‘ab’. This
makes them trace equivalent. However, trace equivalence does not compare for which se-
quence of state transition traces are not accepted. In the left state machine however, when
by accepting ‘a’ the transition to state P4 is taken, the subsequent ‘b’ leads to a deadlock.
These state machines are not behaviourally equivalent when you look at them from a sys-
tem execution perspective, as only one can experience this deadlock.
However, in our case this does not present problems, as our state machines are defined as
having no end states. Each alternative choice in the state machine just leads to different
behaviour, and never to a deadlock.
On first look, one would say that trace equivalence is a very good match for our state ma-
chine behaviour (as even the name suggests it). However, we run into two problems with
this kind of equality definition:
• Trace equivalence is difficult to determine. You would have to compare all traces
possible in the original state machine for acceptance in the reduced state machine,
and vice versa. This is not made easier by the fact that in principle our traces have
infinite length;
34 7. STATE MACHINE REDUCTION ALGORITHMS
P1
P2 P3
P4
a
a
b
Q1 Q2 Q3
a b
Figure 7.1: Trace equivalent
P1 P2
a
b ∼ Q1 Q2 Q3a
b
a
Figure 7.2: Trace equivalent, but not structural equivalent
• No state machine reduction algorithms are readily available that reduce according to
trace equivalence.
In the next sections we show that there still are efficient methods to reduce according
to trace equivalence, but these rely on specific attributes of the state machine itself.
7.1.2. STRUCTURAL EQUIVALENCE
Another type of equivalence on state machines is structural equivalence or isomorphism.
Two state machines are structurally equivalent when for all states and state transitions a
bijection, a reciprocal 1-1 relation, can be established.
This is a very strong kind of equivalence. Not only does this include the behaviour of a
state machine into the equivalence, but also the structure of the state machines.
What this can lead to can be seen in Figure 7.2. These two state machines are trace equiva-
lent, but not structurally equivalent. Using an equality relation like this would mean that a
reduction from the Q state machine to the P state machine would not be allowed, while the
behaviour is exactly equal. With these restrictions not many reductions would be possible.
So we need an equality relation less strong than structural equivalence.
7.1.3. (BI-)SIMULATION EQUIVALENCE
The equivalence relation we could be looking for is something like bisimilation equivalence.
Bisimilation equivalence is also called bisimilarity, and if we use this equivalence relation
we are looking for reduced, simpler state machines that are bisimilar with the original state
machines.
According to [San12], ‘Bisimilarity is accepted as the finest extensional behavioural equiv-
alence one would like to impose on processes’. This means that all behaviour of the state
machines is taken into account when comparing for bisimilarity.
Bisimilarity in this respect is stronger then trace equivalence, but weaker than structural
7.1. STATE MACHINE EQUIVALENCE 35
P1
P2 P3
P4 P5
[µ1] [µ1]
[µ2] [µ3] 
Q1
Q2,3
Q4 Q5
[µ1]
[µ2] [µ3]
Figure 7.3: Merging of similar follow up states
equivalence. Additionally it provides equivalence on the behaviour of the state machine,
which is the kind of equivalence that we want. So while in Figure 7.2 the state machines are
structurally not equivalent, they are bisimilar. Bisimilarity is noted with the ‘tilde’ ∼ sign,
as can be seen between the two state machines in the Figure 7.2.
Informally, two state machines are bisimulation equivalent when for all transitions from
all states a similar state can be found with the same transitions, and in which the destina-
tion states of the transitions in both state machines also have the same property. In the
other direction the same relation is to be found.
Formally this can be defined as:
A relationR on processes is a bisimulation if whenever PRQ:
∀µ,P ′ such that P µ−→ P ′, then ∃Q ′ such that Q µ−→Q ′ and P ′RQ ′;
∀µ,Q ′ such that Q µ−→Q ′, then ∃P ′ such that P µ−→ P ′ and P ′RQ ′;
(7.1)
The bisimulation relationR is a set of pairs of states, which have the same transitions,
and for which the destinations of the transitions are also in the relation.
In Figure 7.2 for the state P1 there are two states that fulfill this relation, namely Q1 and
Q3, and vice versa the two Q states map also on P1. For P2 the state that is bisimulation
equal is Q2, which also applies in the other direction. So the bisimulation relation for this
figure is:
R = {(P1,Q1), (P1,Q3), (P2,Q2)}
One of the nice things about bisimilarity is that it can be checked very efficiently, much
more efficiently than trace equivalence. Where you need to check all traces in a state ma-
chine to determine trace equivalence, you only need to setup the bisimulation relation to
determine bisimulation equivalence. Bisimilarity implies trace equivalence, all bisimilar
state machines are also trace equivalent. Another good thing is that efficient reduction al-
gorithms exist, also based on the above equations.
BISIMULATION EQUIVALENCE OR BISIMILARITY
According to the Equation 7.1 the state machines in Figure 7.3 are not bisimilar. This can
easily be seen as no state can be found in the left state machine that has similar transi-
tions as Q2,3. However, we would still like to have this kind of optimization, as it removes
36 7. STATE MACHINE REDUCTION ALGORITHMS
Pa Pb Pc Pd
a b a
b
∼ Qa Qba
b
Figure 7.4: equivalent DFAs
one state and one transition from the state machine. So for our purposes standard (strong)
bisimilarity equivalence is still too strong. Also, standard bisimilarity does not work with
internal τ-transactions which we have in our abstracted state machines.
One option to solve this would be to use a weaker bisimilarity relation. Many exist that
provide weaker equivalence relations. A good example of such a weaker bisimilar rela-
tion would be branching bisimilarity, which is ‘an equivalence relation on processes that
preserves the branching structure of processes’ [Bas96]. Branching bisimilarity also allows
τ-actions. It could very well be that a branching bisimulation variant could be applicable
to our solution. We did not pursue this direction as in our case there are simpler methods
to reduce our state machines. These methods are detailed in the next sections.
SIMULATION EQUIVALENCE
Another type of equality we could use is a ‘similarity’ relation. This relation is noted as≤ or
≥ and is the bisimulation relation, but only in one direction. This is a weaker form of equiv-
alence than bisimilarity. When looking again at Figure 7.1 the Q state machine is similar to
P (Q ≤ P ), but not vice versa. We can create a relation R from all Q states to the P states,
but have no Q equivalent for P4.
When we combine this simulation relation in two directions, P ≤ Q and Q ≤ P , we get
simulation equivalence P ≤≥Q. As an example, P and Q in Figure 7.1 are not simulation
equivalent, as we cannot create a relation in both directions.
Simulation equivalence is weaker than bisimulation equivalence, because of this sep-
arate relation set for each direction. However, just like trace equivalence, standard simu-
lation equivalence does not respect deadlock, as stated in [San12, p. 168]. Therefore, we
cannot use it. 1.
7.1.4. EQUALITY BETWEEN TRACE EQUIVALENCE AND (BI-)SIMILARITY
What we would like to do is to compare using trace equivalence, but reduce the state ma-
chine using strong bisimilation reduction algorithms. We would then have the needed form
of equality, while also being able to leverage the simple bisimulation based algorithms to
reduce a state machine. For AP-Deterministic state machines this is possible, as [B+08,
p.579] state, “when we have an AP-deterministic state machine, bisimulation, simulation
and trace equivalence coincide”.
This means that if we can provide the same proof for deterministic Labeled Transition
System (LTS) state machines, we can use (strong) bisimulation equivalence and reduction
algorithms to reduce the resulting state machine. The resulting minimized state machine
1As with bisimilation equivalence, there are also other types of simulation equivalence that could be more
appropriate for us. We do not pursue these alternatives as we assume these are more complex than the
solution below
7.1. STATE MACHINE EQUIVALENCE 37
will not necessarily be (bi-)simulation equivalent to the original state machine, only to the
already abstracted variant that we converted into a Deterministic Finite Automaton (DFA)
with the NFA to DFA step.
Bisimulation reduction algorithms reduce the size of state machines while retaining
bisimulation equivalence. If we can make use of the above, the algoritms will also retain
trace equivalence, as these are equal.
The question is now, can we make the above statement also for the type of action-oriented
LTSs that the cache coherence state machines are? This is indeed the case as proven below,
because for DFAs we can also determine that these equivalence relations are equal.
DFAs are deterministic action-oriented LTSs which have only single transitions for each
action starting from a state. For any action you never have a choice in destination states,
as for each action/label a DFA at most only this single transition can be chosen. Also, DFAs
do not have internal τ-transitions.
This is in contrast to Non-deterministic finite automatons (NFAs), which can have both
choices and internal τ-transitions. The abstracted cache coherence state machines we
want to reduce are NFAs.
Definition We define a DFA as :
DF A = 〈Q,L,T 〉2
For our purposes a DFA is defined as a combination of a set of states Q, a set of labels L
and a transition relation T .
For these DFAs we need to define what we exactly mean with traces and paths through
a DFA.
Definition Let p be a (state of a) labeled transition system, with :
1. traces(p)=de f {σ ∈ L∗|p
σ=⇒}
2. paths(p)=de f {(σ,pi)|σ ∈ L∗,pi ∈Q∗,σ= [µ1,µ2, · · · ,µm],pi= [p, p1, · · · , pm], p µ1−→ p1 µ2−→
·· · µm−−→ pm}
A single trace is defined as a Kleene closure over the set of labels. This results in a list of
labels, length 0 or more, which is satisfiable by the DFA starting from state p.
A single path is defined as a list of transitions, combined with the list of the states the
path visits. This results in a pair of lists, which are satisfiable by the DFA starting from state
p.
If we have two DFAs, what we want to prove is that iff they are trace equivalent, that they
are also simulation equivalent and bisimulation equivalent:
Lemma 1
traces(DF A1)= traces(DF A2) ←→ DF A1 ≤≥DF A2 ←→ DF A1 ∼DF A2
Proof 1 We ignore the concept of initial and terminal states here, as these are not present
in the state machines we use, but they are also not relevant for this proof. The starting state
of a trace can be any state in the LTS.
2Adapted from [Tre08, p.9], L labels are equivalent to action lists.
38 7. STATE MACHINE REDUCTION ALGORITHMS
For DFAs we can see that for each trace ∈ traces(p) there is a unique path ∈ paths(p). This is
because of the lack of choices and internal τ-states. If we compare two DFAs, named DF A1
and DF A2, and they are trace equivalent, by definition there is for each trace in DF A1 a
corresponding, equal trace in DF A2. This means that these DFAs also have equal paths,
where only the state(names) themselves differ. If we create a relation N which maps the
states between the DFAs, we can state the following:
traces(DF A1)= traces(DF A2) ←→ ∃N paths(N (DF A1))= paths(DF A2)
For the DFAs in Figure 7.4 the relationN would be:
N = {(Pa ,Qa), (Pb ,Qb), (Pc ,Qa), (Pd ,Qb)}
As we make this definition for all traces in the DFA, and additionally for all paths, the path
equivalence covers all outgoing transitions for all states. For all states in P, there is a state in
Q with the exact same transitions, to destination states with the same equality. Otherwise
not all paths could be equal.
This path equivalence is exactly the equality relation that we can not make for NFAs, as
the choices and internal τ-transitions can cause multiple paths for a trace.
An example can be seen in left part of Figure 7.3, where we have two paths with the same
[µ1] trace. The right state machine is the DFA variant, which does not have these different
paths. Also, not shown in this example, internal τ-transitions could also have been part of
traces in the NFA on the left, which would not have been visible in the trace, but could have
resulted in extra states and transitions in the paths. These extra states and transitions, and
the choices one can make, make that any trace in an NFA can result in multiple paths.
Simulation equivalence ≤≥ for two LTSs is defined as follows:
∀µ,P ′ such that P µ−→ P ′, then ∃Q ′ such that Q µ−→Q ′ and P ′RQ ′;
∀µ,Q ′ such that Q µ−→Q ′, then ∃P ′ such that P µ−→ P ′ and P ′S Q ′;
(7.2)
What this definition describes is when P ≤≥Q, then for all states in P, we can find a state in
Q with the exact same transitions. If we have two trace equivalent DFAs, we know that that
is the case, as it is shown above that these have equal paths, and so the the state relation
relationN can be used asR andS in the similarity relation. So this means:
DF A1 ≤≥DF A2 ←→ traces(DF A1)= traces(DF A2) (7.3)
As R equals S (as both equal N , the trace equivalence mapping above), the similarity
relation for ≤≥ above is equal to the bisimilarity relation ∼, as defined in Equation 7.1. So
therefore we can state that in our case similarity and bisimilarity coincide:
DF A1 ≤≥DF A2 ←→ DF A1 ∼DF A2 (7.4)
By combining 7.3 and 7.4 this result we can conclude that for deterministic action ori-
ented state machines (so DFAs), trace-equivalence, similarity equivalence and bisimilarity
equivalence coincide. This means that we have proven Lemma 1.
7.2. NFA TO DFA CONVERSION 39
This result is interesting, because it means that if we convert our NFA coherence state
machines to a DFA, we can then use standard strong bisimulation reduction to minimize
this DFA, and this minimal DFA remains trace equivalent to the original NFA. Additionally,
all bisimulation defined attributes of this state machine, as eventual deadlock behaviour,
etc. are retained.
This result is only valid for DFAs, we first need to convert our NFA state machine with in-
ternal states to a trace equivalent DFA. Luckily, standard algorithms are available to assist
us in that. How this is done is the subject of the next section.
7.2. NFA TO DFA CONVERSION
For the conversion from an NFA to a DFA we use the algorithm as described in detail by
Linz in [Lin11, p.59], which also provides proofs of the correctness of the algorithm steps.
Linz’s algoritm basically creates combination states every time a choice is made in the NFA,
or an internal transition is made. This way it constructs a trace equivalent DFA without any
choices and internal τ-transitions. This is the well-known subset construction algorithm,
combined with the also standard τ-reduction algorithm.
The DFA resulting from this conversion can have more states, in theory up to 2|N F A|
states, where |NFA| is the set of states in the original NFA. However, this is dependent on
the topology of the NFA, and we assume the topology of the state machines we work with
is such that in the end, after we also reduced the state machine, we will have a smaller state
machine as a result.
7.2.1. SUBSET CONSTRUCTION
The subset construction (also called powerset construction) is an algorithm which is com-
posed of the following steps (taken from [Lin11, p.59]):
1. Create a graph GD with vertex {q0}. Identify this vertex as the initial vertex.
2. Repeat the following steps until no more edges are missing:
(a) Take any vertex {qi , q j , . . . , qk } of GD that has no outgoing edge for some a ∈
∑
.
(b) Compute δ∗N (qi , a),δ
∗
N (q j , a), . . .δ
∗
N (qk , a).
δ∗N (qi , a) is the transition function which computes the destination state set {ql }
for qi and a as: δ∗N (qi , a)= {ql |qi
a−→ ql }.
(c) If δ∗N (qi , a)∪δ∗N (q j , a)∪ ·· · ∪δ∗N (qk , a) = {ql , qm , . . . , qn}, create a vertex for GD
labeled {ql , qm , . . . , qn} if it does not already exist.
(d) Add to GD an edge from {qi , q j , . . . , qk } to {ql , qm , . . . , qn} and label it with a.
3. Every state of GD whose label contains any q f ∈ FN is identified as a final vertex.
4. If MN accepts λ, the vertex {q0} in GD is also made a final vertex.
For us the only interesting step is Step 2. We have no initial states, so we cannot create
a vertex {q0}. We also have no final states, so also the last two list steps are also not in scope
for us. The issues encountered with determining the state to start with if we have no DFA
initial state are covered in Section 7.2.4.
40 7. STATE MACHINE REDUCTION ALGORITHMS
7.2.2. TAU-A REDUCTION
The tau-a reduction algorithm used to remove all τ-transitions is combined with step 2.b
above. When computing δ∗N (qi , a), we also add all states which are reachable with τ transi-
tions and which have an ‘a’ transition themselves. So this results in the following extension
to the transition function:
δ∗N (qi , a)= {ql |qi
τ∗a−−→ ql } where τ∗ is 0 or more τ transitions
As already mentioned, in the cache coherence state machines the τ-transitions are tran-
sitions with empty action lists.
The result of the this delta calculation is a set of 0 or more NFA destination states for
transitions with action a, which together with the result of the calculation of the other NFA
states in the vertex {qi , q j , . . . , qk } result in the new destination state vertex {ql , qm , . . . , qn}.
As a side note, the current version of the tool used to validate the simplification process
completely removes the intermediate states and transitions from the state machine. This is
the default implementation of the above algorithm, but must be taken into account when
validating the result.
7.2.3. HASKELL IMPLEMENTATION
The haskell implementation of this algorithm is as follows. We keep a to-do list of states
still to process, and seed it with the startstate of the first transition. This startstate we take
as the ‘random’ initial state, as defined in Section 7.2.4.
As long as the to-do list of states is not empty, we repeat the following actions:
1. We take the state s from the top of the list
2. We do the tau-a reduction for s. To be able to automate the algorithm, we slightly
modify the tau-a reduction algorithm. We calculate all states reachable from s using
τ-transitions, using:
startStates (s)= {p|s τ
∗
−→ p}
3. Determine all actions originating from these states, using:
actions (startStates)= {a|p a−→ . . . ∧p ∈ startStates}
4. determine for each action a ∈ actions the transition by combining the NFA deststates:
destStates ( a, startStates)= {q|p a−→ q ∧p ∈ startStates}
5. create for each action a new edge or DfaTransition that links s to the new destState,
created by combining the destStates just calculated for that action. Add this new
DfaTransition value to the DFA to create.
6. if the new destState is not already seen (in the to-do list or already present in the
created DFA), add the destState to the to-do list for later processing
When this process ends with an empty to-do list, we have gone through all states we can
reach from the original initial state, and have created a DFA, trace-equivalent with the orig-
inal NFA, as proven by Linz in [Lin11, p.59].
7.2. NFA TO DFA CONVERSION 41
P1 P2 P3
P4P5
a
b
a
b
b
b
b
a
NFA to DFA−−−−−−−→
P1 P2 P3
P2,4
P5,1
a
b
a
ba
ba
Figure 7.5: Initial state dependent NFA to DFA conversion, starting with P1
7.2.4. IMPACT SELECTION OF INITIAL STATE
What we needed to add to use the algorithm described in [Lin11, p.59] was an initial state.
As we have not defined an initial state, we will start with a random state, but this will have
impact on the eventual DFA state machine. This can be seen in Figure 7.3. The right state
machine is in fact the DFA version of the left state machine. You can see that the states
P2 and P3 are subsumed in a new state here labelled Q2,3. Now suppose these are parts of
larger state machines. It could be that if you start the subset reduction algorithm with P2
you would add P2 also to the DFA state machine as an extra state.
These extra states are easy to identify. In the SCC cache coherence state machines we anal-
yse we can always reach each state from each other state, directly or indirectly. However, if
we use P2 from Figure 7.3 as the initial state, it could be that this is not a destination of a
transition from any state in the rest of the DFA, because these transitions would lead to the
combined state labelled Q2,3 3.
This is also the reason that a Cache Coherence Invalid state is not chosen as the default
initial state. It would suffer the same problem as any other state, as this state too could be
combined during the subset construction. Choosing a ‘random’ state will result in the same
outcome, and makes the algorithm input simpler.
To correct this startup behaviour, we have developed a correction algorithm to find
these extra initial states, to remove them again from the created DFA.
The question is now how can we determine these extra states and transitions. As an
example, see Figure 7.5. In this figure we have calculated the resulting DFA (on the right)
when picking state P1 as the initial state. As a counter example, had we started with P3 as
the initial state, we would not have the states P1 and P2 in the resulting DFA. This means
that P1 and P2 are extra intermediate states, which we must remove from the result. As they
both have incoming transitions, this shows that a naïve algorithm that would only check for
states without incoming transitions will fail.
For this state machine one would be tempted to think that a strategy to be ‘smart’ when
3If it did get an incoming transition from the rest of the DFA, it would be because of a single non-combined
transition from another state in the DFA itself. Then it would obviously rightfully be part of the final DFA,
but this would have shown in Figure 7.3 as an extra incoming transition
42 7. STATE MACHINE REDUCTION ALGORITHMS
P1 P2 P3
P4P5 P6
a
b
a
b
b
b
b
a
a
b
NFA to DFA−−−−−−−→
P3,6
P2,4
P5,1
ba
ba
Figure 7.6: NFA to DFA conversion without remaining states
selecting a good starting state from the NFA state would work. A counter example that also
a smart selection would not work is shown in Figure 7.6. There we see a variant on the previ-
ous figure which is slightly modified with an extra state. This makes that the ideal resulting
DFA does does not re-use any original NFA states anymore, and therefore the strategy to do
a ‘smart‘ selection of the initial state is bound to fail.
However, we know that the final DFA is a SCC, therefore all states which have to become
part of the resulting DFA are all directly or indirectly reachable from each other state. This
is because the original NFA is a SCC, and so should the DFA be to retain trace equivalence.
This is why we have the one-way transitions from the ‘intermediate’ part to the SCC part of
the DFA. See for an example of this Figure 7.5, between P2 and P3. We know that the last
state we construct is by definition part of the SCC. Therefore, when we determine the states
reachable from this last state, we will get only the states in the SCC. This way we will reliably
be able to separate the SCC from the intermediate states. This works regardless with which
state we started the original NFA to DFA algorithm with, as we will show in the rest of this
section.
Let DF Ar oug h = (Qr oug h ,L,δr oug h) denote the original DFA with the extra initial states.
We can then find DF A f i nal = (Q f i nal ,L,δ f i nal ) , which is the SCC part of the calculated
DF Ar oug h with the intermediate starting states and their transitions removed.
State qr oug h,l ast is the last state added to DF Ar oug h , and we can use it to construct
Q f i nal :
Q f i nal = {q ′|∃σ ∈ L∗ : qr oug h,l ast
σ=⇒ q ′}
The transition relation δ f i nal contains all transitions from δr oug h which originate from
a state in Q f i nal . Combining both δ f i nal and Q f i nal , we have DF A f i nal .
What we want to retain over all NFA to DFA conversion actions is that the accepted
traces must not change. We formalize this with the Lemmas below.
We want to retain trace equivalence when we convert the NFA to a DFA:
Lemma 2
tr aces(N F Aor i g )= tr aces(DF Ar oug h)
7.2. NFA TO DFA CONVERSION 43
Proof 2 The equivalence between N F Aor i g and DF Ar oug h is defined by the NFA to DFA
conversion algorithm. The proof for this algorithm can be found in [Lin11, p.59].
Similarily, we want to retain the trace equivalence when removing the additional states
in DF Ar oug h :
Lemma 3
tr aces(DF Ar oug h)= tr aces(DF A f i nal )
Proof 3 DF A f i nal is the SCC part of DF Ar oug h . Any difference between DF Ar oug h and
DF A f i nal is caused by the choice of the initial state, where each of these states (if not
in DF A f i nal ) is also present in the SCC, but then combined with other NFA states. All
traces originating from the removed initial states (if any) can also be found in the result-
ing DF A f i nal , which is the SCC part of DF Ar oug h . The SCC transitions will emanate from
the combined NFA states.
Because of this, we can state that there are no extra traces in DF Ar oug h that are not in
DF A f i nal . In the other direction, as DF A f i nal is the SCC subset of DF Ar oug h , there are no
additional traces in DF A f i nal that are not in DF Ar oug h . Therefore we can state that they
are equal, and DF Ar oug h =DF A f i nal .
Combining Lemma 2 and Lemma 3 results in the following:
Corollary 1
tr aces(N F Aor i g )= tr aces(DF Ar oug h)= tr aces(DF A f i nal )
ALGORITHM TERMINATION
With this algorithm in place we are left with one question: does this initial state algorithm
terminate in all cases when a random initial state is chosen? So given an finite state ma-
chine (NFA) in which all states can be reached from all other states? So in other words, will
the eventual DFA be constructed. This answer is positive, as the below proof shows:
Lemma 4 The removal of the extra initial startup states algorithm always terminates.
Proof 4 Suppose we have an NFA N, resulting in a DFA D after tau-a reduction and subset
construction, with optionally extra startup states. When the original NFA to DFA algorithm
reaches a state that is destined for D, it will subsequently only build states for D. We start
with the last state, which is therefore always part of D. As D is to be a single SCC, we can
reach any state from the any other state, and so any state not reachable from the last state
is not part of D. The algorithm is such that we will never re-visit an already found or gener-
ated state, so after evaluating |D| states we will have constructed D fully. So the algorithm
will eventually always terminate with the creation of the DFA.
The initial state removal only visits states in D, and so is not impacted by any amount of
initial startup states that are generated outside of D.
The conclusion is that even when we do not start the tau-a reduction and subset con-
struction from a state in D, we will still create the final DFA after executing the algorithm a
finite amount of time, proportional to |D|.
In normal cases we would expect to have only one or two extra startup states before a
DFA state is constructed.
44 7. STATE MACHINE REDUCTION ALGORITHMS
7.3. STRONG BISIMULATION MINIMIZATION
After we have done the NFA to DFA transition, we have the DFA needed to perform the
bisimulation reduction step. This reduction basically works by partitioning the set of DFA
states in equal partition blocks, where each state in a partition has transitions to other
states also in the same destination partition. This partioning must first start by creating
two or more partitions. This can instance be done, based on whether states are final or not
[Lin11, p.63] or by the labels for Kripke structures [B+08, p.476]. After this initial partition-
ing into partition blocks, these blocks are then further refined.
After executing the partitioning algorithm, when the DFA is completely refined, we have
an end result with blocks of states where all states have the same labeled transitions. All
transitions with the same label are to states in the same destination block. This closely
resembles the bisimilarity relation from Equation 7.1, where the transitions for states also
must be to ‘equal’ states. So in effect what we have created is a form of the bisimulation
relationR.
This completely refined set of blocks can in turn be used to generate the minimized
DFA we would like to achieve, by replacing each block with a single state. The transitions
from the block are replaced by transitions from this state, and the destination state of the
transition is the state in which the destination block is transformed.
We will follow the algorithm as described in [Lin11, p.63], where we - as we have no fi-
nal states - assume a demonic completion of all non-existing transitions for actions to a
single ‘virtual final state’. This virtual final state represents the deadlock occured when a
not modeled action is encountered, this allows us to start the partitioning algorithm. This
way we can use the algorithm described by Linz, and refer to the proof in that document.
See [Tre08, p.17] for an explanation on demonic completion.
Linz divides the algorithm in a mark and a reduce part, conforming to the partitioning
and merge steps described in the previous paragraphs. The mark and reduce algorithms
are explained in the following sections.
7.3.1. MARK PHASE
As described by Linz, the mark part consists of the following activities on the DFA M =
(Q,
∑
,δ, q0,F ):
1. Remove all inaccessible states;
2. Consider all pairs of states (p,q). if p ∈ F and q ∉ F or vice versa, mark the pair (p,q)
as distinguishable;
3. Repeat the following step until no previously unmarked pairs are marked. For all
pairs, (p,q) and all a ∈∑, compute δ(p, a)= pa and δ(q, a)= qa . If the pair (pa , qa) is
marked as distinguishable, then mark (p,q) as distinguishable.
Step 1 is not used as we will have as output from the NFA to DFA transformation a SCC
where all states are reachable from all other states. This obviously implies we have no in-
accessible states. Step 2 we will do, but we will use our demonic completion to create the
‘virtual final state’ as described above.
7.3. STRONG BISIMULATION MINIMIZATION 45
We will use the transition for an action to this virtual final state, so whether the action
exists for the state, as the differentiator. How this is implemented is described below.
SPLIT ON ACTIONS
For our code we will separate Step 3 into two sub-steps. We will first distinguish all states
depending whether they have a transition labeled with with a certain action list. We use the
fact that not all states have transitions for all action lists, so we can use that to start. So we
take the first action list, and partition all states according to whether they have a transition
defined for that action list. This we repeat for all other present action lists. This should
result in partitions of states with equal action lists.
So we partition DFA M according to the following equation, where action(s) returns the
actions which have transitions emanating from state s:
partitionOnActions (M)= {{s j | ∀
s j∈M
actions(si )= actions(s j )}|si ∈M }
This formula partitions DFA M into a set of partition blocks which are themselves sets
of states with all equal action lists.
SPLIT ON DESTINATION PARTITION
The second sub-step divides the partitions we get again, based on the destinations of these
actions. We know that all actions in the partitions have transitions for the same actions, as
we just partitioned them on this in the previous step. We split each partition block further,
based on the partition block each transition destination for an action is part of. This imple-
mentation works as follows:
When partition block p contains more than one state, we take the first state from the parti-
tion p, s1 and for each action emanating from this state we search the destination state s1,a ,
so that s1
a−→ s1,a with a ∈ actions (s1). We do the same for the other states in p, si , where
2≤ i ≤ |p|. Their destinations are calculated as si a−→ si ,a .
We then can do the following split, based on the partition of the destination states:
pi n = {si | ∀
a∈actions(s1)
partition(si ,a)= partition(s1,a)}
pout = {si | ∃
a∈actions(s1)
partition(si ,a)<> partition(s1,a)}
The partition pi n now contains the states with the same destination partition per action
for all states, while pout contains the states that have destinations to other partitions for at
least one action. We repeat this split action again for pout , as the states in this partition can
also still be eligible to be split further.
We repeat this second sub-step with all partitions, and repeat with all partitions until the
full partitionset becomes stable, i.e. no partitions are split anymore.
46 7. STATE MACHINE REDUCTION ALGORITHMS
7.3.2. REDUCE PHASE
Following Linz also for the reduce phase, given DFA M = (Q,∑,δ, q0,F ), we will construct a
reduced dfa M̂ = (Qˆ,∑, δˆ, qˆ0, Fˆ ) as follows:
1. Use the procedure mark to generate the equivalence classes, say {qi , q j , . . . , qk } as de-
scribed above;
2. For each set {qi , q j , . . . , qk } of such indistinguishable states, create a state labeled i j . . .k
for M̂ ;
3. For each transition rule of M of the form δ(qr , a) = qp , find the partition to which
qr and qp belong. If qr ∈ {qi , q j , . . . , qk } and qp ∈ {ql , qm , . . . , qn} we add to δˆ a rule
δˆ(i j . . .k, a)= lm . . .n;
4. The initial state qˆ0 is that state of M̂ which contains the 0 4;
5. Fˆ is the set of all states whose label contains i such that qi ∈ F .
For our use we can ignore the last two steps, as we have no initial or final states. Step 1 is
described in the previous paragraph, what remains is walking through the list of transitions
we have, and convert their start- and endstates to simple states, one state representing
each partition block. This creates a DFA with states for all partition blocks, effectively de-
duplicating the state machine of bisimilar equal states.
7.4. DO WE GENERATE THE MOST MINIMAL SOLUTION?
We have proven that we create a sound and complete solution in the preceding sections.
However, the question remains whether we generate the optimal, minimal solution.
Our hypothesis is that this could be the case, because of the combination of τ-reduction
and subset construction steps on the one hand, and the strong bisimulation reduction on
the other hand.
The τ-reduction and subset construction steps create a DFA, which can be seen as the
semantically simplest way to implement the traces already present in the original NFA.
However, there is a cost involved, as these algorithms can create a structurally complex
solution, with potentially up to 2|N | states.
On the other hand, the strong bisimulation reduction algorithm should find the struc-
turally most simple solution while retaining the state machine semantics. The combination
of these three algorithms will then result in the semantically and structurally most simple
DFA.
We leave the proof (or disproof ) of the conjecture that this is the minimal solution as
future research, as there could be NFAs or differently constructed DFAs that are even sim-
pler. For NFAs we have documented evidence for a potentially simpler state machine with
an internal τ-transition in Section 8.2.
4this is the exact line from Linz, we assume the ‘0’ stands for the initial state in the DFA.
8
EXPERIMENTAL RESULTS
This chapter describes the results we obtained when applying the patterns and algorithms
described in respectively Chapter 6 and Chapter 7. We have investigated two cache coher-
ence protocols, the MI_example protocol and the MESI protocol, as provided in the gem5
distribution.
The results are obtained by examining the code as it existed on August 26, 2014 1. To
replicate the results the codebase version from this date should be used.
8.1. THE MI PROTOCOL
Figure 8.2 on page 51, 8.3 and 8.4 show the original state machines of the gem5 simula-
tor (gem5) MI cache coherence protocol. Figure 8.5 and 8.6 show the minimized state
machines, after processing. The patterns were used with the MI definitions as found in
Appendix A.1. Given its initial size, we did not attempt to minimize the DMA coherence
device state machine.
The reduction results we obtained for the MI protocol minimization are:
Table 8.1: MI minimization results
Original Minimized
State Machine States Transitions States Transitions
MI Cache 7 43 6 10
MI Directory 10 44 5 9
MI DMA 3 4 - -
totals 20 91 11 19
By applying our minimization process, we obtained the results as shown in Table 8.1.
These results were obtained by applying the abstraction pattern ‘Removal of non-essential
actions’ to the state machines, as described in Section 6.2.1. As can be seen in Figure 8.6 on
1This date is chosen because actions used in the supplied JSON state machine definition files were removed
by commits on November 26 that year. This date is the date the last stable code version was generated which
is in sync with the supplied JSON files. It is assumed that the JSON files were generated from this version.
See also http://repo.gem5.org/gem5
47
48 8. EXPERIMENTAL RESULTS
page 54, we could reduce the MI Directory state machine with at least 2 additional transi-
tions and one additional state when applying the ‘Remove non-essential devices’ pattern,
as described in Section 6.2.3. This pattern would then be used to remove the transitions as-
sociated with the events DMA_READ and DMA_WRITE. In the MI state machines we could
not find opportunities to apply the ‘Merge similar actions’ pattern.
In the directory state machine the functionality has been implemented to maintain the
ownership of the cache-lines. This is done with two actions, e_ownerIsRequestor and
c_clearOwner; these are used to maintain the owner of the cache-line. This ownership
is subsequently used to forward GETX requests to when the directory controller is not the
owner of the cache-line.
The original transitions we obtained by counting the number of transitions in the JSON
state machine definition files that were provided 2. The original states were counted in the
state_declaration structure in the SLICC sourcecode files. We completely removed the
MI DMA controller.
8.2. THE MESI PROTOCOL
Figure 8.1 shows the gem5 MESI component and network configuration. The original state
machines for the gem5 MESI protocol are shown for the Layer 1 cache in Figure 8.7 on
page 55, for the Layer 2 cache in Figure 8.8, for the Directory controller in Figure 8.9. The
DMA controller had a similar state machine as the MI protocol in Figure 8.4.
As with the MI protocol, these results were obtained by applying the single abstraction
pattern ‘Removal of non-essential actions’ to the state machines, as described in Section
6.2.1. We again did not attempt to minimize the DMA device for the same reason.
Core 
#n 
L1 Cache 
Controller 
Data cache 
L1 Request 
Network 
Directory 
Controller 
Instr. Cache 
DMA 
Controller 
Main 
Memory 
L2 Request 
Network 
L2 Cache 
Controller 
2nd Level 
Cache 
R
esp
o
n
se N
etw
o
rk 
Figure 8.1: Gem 5 MESI component configuration
2The state lists on the http://www.m5sim.org website are not complete.
8.2. THE MESI PROTOCOL 49
The reduction results we obtained for the MESI protocol minimization are:
Table 8.2: MESI minimization results
Original Minimized
State Machine States Transitions States Transitions
MESI L1 Cache 15 146 8 43
MESI L2 Cache 17 155 14 45
MESI Directory 10 44 5 8
MESI DMA 3 4 - -
totals 45 349 27 96
These results were obtained by executing the minimization process with the impacting
classifications as defined in Appendix A.2. We removed all actions with a classification of
‘No’ or ‘DMA’.
The Layer 1 cache controller state machine minimization outcome can be seen in Fig-
ure 8.10 on page 58. One obvious feature of this state machine are the parallel transi-
tions. These are caused by the fact that the L1 cache has a number of similar features
from a network-impacting point of view. These are e.g. instruction versus data caches and
prefetches and mandatory fetches. These have their own actions, as their non-networking
impact is different. Applying the ‘similar actions’ pattern from Section 6.2.2 would com-
bine these transitions.
Another interesting feature of the reduced L1 cache state machine is that the Exclusive
state and the Modified state seen not to be related. This is counter intuitive, as one of
the feature of the MESI protocol is that one can ‘silently’ move from the E to the M state.
In the reduced state machines the states and transitions that are removed because of the
internal τ-transitions are really removed from the state machine. The E state in reality is
the combined E and M state, where only one of the Inv transitions is unique to the E state.
The others are combined transitions, and appear both for the E and M states.
This is an example where an NFA could lead to a more compact state machine. Leaving
the internal τ-transition in place would allow for all non-exclusive transitions now linked
to the E state to be removed, as these transitions could be re-used from the M state. This
would allow for an additional simplification of the state machine.
A future improvement to the tool could be to display the removed internal states and
event on the transitions from where they were removed. This would give more insight and
potential to validate the simplification process.
The minimized Layer 2 cache controller state machine is shown in Figure 8.11 on page 59.
Ideally we would like to further reduce this state machine by splitting it for Layer 1 and
Layer 2 network functionality. This would allow the separate investigation and modeling of
the two layers. In this implementation this is not possible by removing actions, as the ac-
tions to send response messages to the L1 cache controller are also used to send responses
to the Directory and DMA controller. We also cannot apply the ‘Remove non-essential de-
vices’ pattern from Section 6.2.3 here. When this pattern is used to remove device activities,
we remove full traces between stable states. This we do not want in this case, as the inter-
action with the L1 cache is combined with Directory controller cache interaction in single
50 8. EXPERIMENTAL RESULTS
traces.
As an example, when the L2 cache-line is in the Invalid (NP) state, the block is not
present in this cache. When a GetX request is received from an L1 cache controller for
this cache line, the L2 state machine sends out a fetch request (a_issueFetchToMemory)
to memory and moves to the IM state. After receiving a Mem_data event from the directory
controller, the data is sent back with exclusive permission to the requesting L1 cache with
the action (ee_sendDataToGetXRequestor). Also, the L2 cache state machine then moves
to the MT_MB,SS_MB state, as seen in the reduced state machine in Figure-8.11. Only af-
ter the L1 cache Exlusive Unblock response is received will the state machine move to the
stable ‘MT’ state.
We obviously cannot remove any transitions from this sequence, this would break the
functionality. What we want to do here is make the transitions for events from the removed
device internal τ-transitions. This is an additional pattern whose necessity is caused by the
design / implementation decision to use a shared response network.
The Directory controller again provides evidence for the applicability of the Section 6.2.3
to remove the states DMA_READ||DMA_WRITE that are associated with the DMA controller.
In contrast to the MI protocol, in the MESI protocol the directory controller does not
maintain the cache-line owner. This is delegated to the L2 cache controller. The L2 cache
controller maintains a list of sharers, controlled with the actions marked with ‘Share‘ in the
L2 action Table A.5 on page 73.
8.2. THE MESI PROTOCOL 51
IS
Lo
ad
[z_
sta
ll]
Ifet
ch
[z_
sta
ll]
Sto
re
[z_
sta
ll]
Re
pla
cem
ent
[z_
sta
ll]
Fw
d_
GE
TX
[z_
sta
ll]
Inv
[z_
sta
ll]
M
Da
ta
[u_
wr
ite
Da
taT
oC
ach
e,
rx_
loa
d_
hit
,
w_
dea
llo
cat
eTB
E,
n_
po
pR
esp
on
seQ
ueu
e]
IM
Lo
ad
[z_
sta
ll]
Ifet
ch
[z_
sta
ll]
Sto
re
[z_
sta
ll]
Re
pla
cem
ent
[z_
sta
ll]
Fw
d_
GE
TX
[z_
sta
ll]
Inv
[z_
sta
ll]
Da
ta
[u_
wr
ite
Da
taT
oC
ach
e,
sx_
sto
re_
hit
,
w_
dea
llo
cat
eTB
E,
n_
po
pR
esp
on
seQ
ueu
e]
MI
Lo
ad
[z_
sta
ll]
Ifet
ch
[z_
sta
ll]
Sto
re
[z_
sta
ll]
Re
pla
cem
ent
[z_
sta
ll]
Inv
[o_
po
pF
orw
ard
edR
equ
est
Qu
eue
]
II
Fw
d_
GE
TX
[ee
_se
nd
Da
taF
rom
TB
E,
o_
po
pF
orw
ard
edR
equ
est
Qu
eue
] MI
I
Wr
ite
bac
k_
Na
ck
[o_
po
pF
orw
ard
edR
equ
est
Qu
eue
]
I
Wr
ite
bac
k_
Ac
k
[w
_d
eal
loc
ate
TB
E,
o_
po
pF
orw
ard
edR
equ
est
Qu
eue
]
Lo
ad
[z_
sta
ll]
Ifet
ch
[z_
sta
ll]
Sto
re
[z_
sta
ll]
Re
pla
cem
ent
[z_
sta
ll]
Wr
ite
bac
k_
Na
ck
[w
_d
eal
loc
ate
TB
E,
o_
po
pF
orw
ard
edR
equ
est
Qu
eue
]
Lo
ad
[z_
sta
ll]
Ifet
ch
[z_
sta
ll]
Sto
re
[z_
sta
ll]
Re
pla
cem
ent
[z_
sta
ll]
Fw
d_
GE
TX
[ee
_se
nd
Da
taF
rom
TB
E,
w_
dea
llo
cat
eTB
E,
o_
po
pF
orw
ard
edR
equ
est
Qu
eue
]
Re
pla
cem
ent
[v_
all
oca
teT
BE
,
b_
iss
ueP
UT
,
x_c
op
yD
ata
Fro
mC
ach
eTo
TB
E,
for
wa
rd_
evi
cti
on
_to
_cp
u,
h_
dea
llo
cat
eL
1C
ach
eB
loc
k]
Inv
[v_
all
oca
teT
BE
,
b_
iss
ueP
UT
,
x_c
op
yD
ata
Fro
mC
ach
eTo
TB
E,
for
wa
rd_
evi
cti
on
_to
_cp
u,
h_
dea
llo
cat
eL
1C
ach
eB
loc
k]Sto
re
[s_
sto
re_
hit
,
p_
pro
file
Hi
t,
m_
po
pM
and
ato
ryQ
ueu
e]
Lo
ad
[r_
loa
d_
hit
,
p_
pro
file
Hi
t,
m_
po
pM
and
ato
ryQ
ueu
e]
Ifet
ch
[r_
loa
d_
hit
,
p_
pro
file
Hi
t,
m_
po
pM
and
ato
ryQ
ueu
e] F
wd
_G
ET
X
[e_
sen
dD
ata
,
for
wa
rd_
evi
cti
on
_to
_cp
u,
o_
po
pF
orw
ard
edR
equ
est
Qu
eue
]
Lo
ad
[v_
all
oca
teT
BE
,
i_a
llo
cat
eL
1C
ach
eB
loc
k,
a_i
ssu
eR
equ
est
,
p_
pro
file
Mi
ss,
m_
po
pM
and
ato
ryQ
ueu
e]
Ifet
ch
[v_
all
oca
teT
BE
,
i_a
llo
cat
eL
1C
ach
eB
loc
k,
a_i
ssu
eR
equ
est
,
p_
pro
file
Mi
ss,
m_
po
pM
and
ato
ryQ
ueu
e]
Sto
re
[v_
all
oca
teT
BE
,
i_a
llo
cat
eL
1C
ach
eB
loc
k,
a_i
ssu
eR
equ
est
,
p_
pro
file
Mi
ss,
m_
po
pM
and
ato
ryQ
ueu
e]
Inv
[o_
po
pF
orw
ard
edR
equ
est
Qu
eue
]
Re
pla
cem
ent
[h_
dea
llo
cat
eL
1C
ach
eB
loc
k]
Figure 8.2: Original gem5 MI cache controller state machine
52 8. EXPERIMENTAL RESULTS
M
_D
RD
GE
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
M
_D
RD
I
PU
TX
[l_
wr
ite
Da
taT
oM
em
ory
,
drp
_s
en
dD
M
AD
ata
,
c_
cle
arO
wn
er,
l_q
ue
ue
M
em
ory
W
BR
eq
ue
st,
i_p
op
Inc
om
ing
Re
qu
est
Qu
eu
e]
M
_D
W
R
GE
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
M
_D
W
RI
PU
TX
[l_
wr
ite
Da
taT
oM
em
ory
,
qw
_q
ue
ue
M
em
ory
W
BR
eq
ue
st_
pa
rti
alT
BE
,
c_
cle
arO
wn
er,
i_p
op
Inc
om
ing
Re
qu
est
Qu
eu
e]
GE
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
I
M
em
ory
_A
ck
[w
_w
rit
eD
ata
To
M
em
ory
Fr
om
TB
E,
l_s
en
dW
rit
eB
ac
kA
ck
,
da
_s
en
dD
M
AA
ck
,
w_
de
all
oc
ate
TB
E,
l_p
op
M
em
Qu
eu
e]
GE
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
M
em
ory
_A
ck
[l_
sen
dW
rit
eB
ac
kA
ck
,
w_
de
all
oc
ate
TB
E,
l_p
op
M
em
Qu
eu
e]
IM
GE
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
GE
TS
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
PU
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
PU
TX
_N
otO
wn
er
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
DM
A_
RE
AD
[y_
rec
yc
leD
M
AR
eq
ue
stQ
ue
ue
]
DM
A_
W
RI
TE
[y_
rec
yc
leD
M
AR
eq
ue
stQ
ue
ue
]
M
M
em
ory
_D
ata
[d_
sen
dD
ata
,
l_p
op
M
em
Qu
eu
e]
M
I
GE
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
GE
TS
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
PU
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
PU
TX
_N
otO
wn
er
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
DM
A_
RE
AD
[y_
rec
yc
leD
M
AR
eq
ue
stQ
ue
ue
]
DM
A_
W
RI
TE
[y_
rec
yc
leD
M
AR
eq
ue
stQ
ue
ue
]
M
em
ory
_A
ck
[w
_w
rit
eD
ata
To
M
em
ory
Fr
om
TB
E,
l_s
en
dW
rit
eB
ac
kA
ck
,
w_
de
all
oc
ate
TB
E,
l_p
op
M
em
Qu
eu
e]
ID
GE
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
GE
TS
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
PU
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
PU
TX
_N
otO
wn
er
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
DM
A_
RE
AD
[y_
rec
yc
leD
M
AR
eq
ue
stQ
ue
ue
]
DM
A_
W
RI
TE
[y_
rec
yc
leD
M
AR
eq
ue
stQ
ue
ue
]
M
em
ory
_D
ata
[dr
_s
en
dD
M
AD
ata
,
w_
de
all
oc
ate
TB
E,
l_p
op
M
em
Qu
eu
e]
ID
_W
GE
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
GE
TS
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
PU
TX
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
PU
TX
_N
otO
wn
er
[z_
rec
yc
leR
eq
ue
stQ
ue
ue
]
DM
A_
RE
AD
[y_
rec
yc
leD
M
AR
eq
ue
stQ
ue
ue
]
DM
A_
W
RI
TE
[y_
rec
yc
leD
M
AR
eq
ue
stQ
ue
ue
]
M
em
ory
_A
ck
[dw
t_w
rit
eD
M
AD
ata
Fr
om
TB
E,
da
_s
en
dD
M
AA
ck
,
w_
de
all
oc
ate
TB
E,
l_p
op
M
em
Qu
eu
e]
GE
TX
[qf
_q
ue
ue
M
em
ory
Fe
tch
Re
qu
est
,
e_
ow
ne
rIs
Re
qu
est
or,
i_p
op
Inc
om
ing
Re
qu
est
Qu
eu
e]
DM
A_
RE
AD
[r_
all
oc
ate
Tb
eF
orD
ma
Re
ad
,
qf_
qu
eu
eM
em
ory
Fe
tch
Re
qu
est
DM
A,
p_
po
pIn
co
mi
ng
DM
AR
eq
ue
stQ
ue
ue
]
DM
A_
W
RI
TE
[v_
all
oc
ate
TB
E,
qw
_q
ue
ue
M
em
ory
W
BR
eq
ue
st_
pa
rti
al,
p_
po
pIn
co
mi
ng
DM
AR
eq
ue
stQ
ue
ue
]
PU
TX
_N
otO
wn
er
[b_
sen
dW
rit
eB
ac
kN
ac
k,
i_p
op
Inc
om
ing
Re
qu
est
Qu
eu
e]
DM
A_
RE
AD
[v_
all
oc
ate
TB
E,
inv
_s
en
dC
ac
he
Inv
ali
da
te,
p_
po
pIn
co
mi
ng
DM
AR
eq
ue
stQ
ue
ue
]
DM
A_
W
RI
TE
[v_
all
oc
ate
TB
E,
inv
_s
en
dC
ac
he
Inv
ali
da
te,
p_
po
pIn
co
mi
ng
DM
AR
eq
ue
stQ
ue
ue
]
PU
TX
[c_
cle
arO
wn
er,
v_
all
oc
ate
TB
EF
rom
Re
qu
est
Ne
t,
l_q
ue
ue
M
em
ory
W
BR
eq
ue
st,
i_p
op
Inc
om
ing
Re
qu
est
Qu
eu
e]
GE
TX
[f_
for
wa
rdR
eq
ue
st,
e_
ow
ne
rIs
Re
qu
est
or,
i_p
op
Inc
om
ing
Re
qu
est
Qu
eu
e]
PU
TX
_N
otO
wn
er
[b_
sen
dW
rit
eB
ac
kN
ac
k,
i_p
op
Inc
om
ing
Re
qu
est
Qu
eu
e]
Figure 8.3: Original gem5 MI directory controller state machine
8.2. THE MESI PROTOCOL 53
READY
BUSY_RD
ReadRequest
[s_sendReadRequest,
p_popRequestQueue]
BUSY_WR
WriteRequest
[s_sendWriteRequest,
p_popRequestQueue]
Data
[d_dataCallback,
p_popResponseQueue]
Ack
[a_ackCallback,
p_popResponseQueue]
Figure 8.4: Original gem5 MI DMA controller state machine
M
I,II
Fwd_GETX
[e_sendData,
o_popForwardedRequestQueue]
MI
Inv||Replacement
[b_issuePUT]
Inv||Writeback_Nack
[o_popForwardedRequestQueue]
IM,IS
Ifetch||Load||Store
[a_issueRequest]
I,MI,MII
Inv||Writeback_Ack||Writeback_Nack
[o_popForwardedRequestQueue]
II
Fwd_GETX
[ee_sendDataFromTBE,
o_popForwardedRequestQueue]
Fwd_GETX
[ee_sendDataFromTBE,
o_popForwardedRequestQueue]
Inv||Writeback_Ack||Writeback_Nack
[o_popForwardedRequestQueue]
Ifetch||Load||Store
[a_issueRequest]
Data
[n_popResponseQueue]
Writeback_Nack
[o_popForwardedRequestQueue]
Figure 8.5: Minimized gem5 MI cache controller state machine
54 8. EXPERIMENTAL RESULTS
MI,M_DRDI,M_DWRI
I
Memory_Ack
[l_sendWriteBackAck]
PUTX_NotOwner
[b_sendWriteBackNack,
i_popIncomingRequestQueue]
IM
GETX
[e_ownerIsRequestor,
i_popIncomingRequestQueue]
M
Memory_Data
[d_sendData]
PUTX
[c_clearOwner,
i_popIncomingRequestQueue]
GETX
[f_forwardRequest,
e_ownerIsRequestor,
i_popIncomingRequestQueue]
PUTX_NotOwner
[b_sendWriteBackNack,
i_popIncomingRequestQueue]
M_DRD,M_DWR
DMA_READ||DMA_WRITE
[inv_sendCacheInvalidate]
PUTX
[c_clearOwner,
i_popIncomingRequestQueue]
Figure 8.6: Minimized gem5 MI directory controller state machine
8.2. THE MESI PROTOCOL 55
IS
Lo
ad
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Ifet
ch
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Sto
re
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
IS_
I
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
S
Da
ta_
all
_A
cks
[u_
wr
ite
Da
taT
oL
1C
ach
e,
hx_
loa
d_
hit
,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Da
taS
_fr
om
L1
[u_
wr
ite
Da
taT
oL
1C
ach
e,
j_s
end
Un
blo
ck,
hx_
loa
d_
hit
,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
E
Da
ta_
Ex
clu
siv
e
[u_
wr
ite
Da
taT
oL
1C
ach
e,
hx_
loa
d_
hit
,
jj_
sen
dE
xcl
usi
veU
nb
loc
k,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
IM
Lo
ad
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Ifet
ch
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Sto
re
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Ac
k
[q_
up
dat
eA
ckC
ou
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
SM
Da
ta
[u_
wr
ite
Da
taT
oL
1C
ach
e,
q_
up
dat
eA
ckC
ou
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
M
Da
ta_
all
_A
cks
[u_
wr
ite
Da
taT
oL
1C
ach
e,
hh
x_s
tor
e_h
it,
jj_
sen
dE
xcl
usi
veU
nb
loc
k,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Lo
ad
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Ifet
ch
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Sto
re
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
I
Da
ta_
all
_A
cks
[u_
wr
ite
Da
taT
oL
1C
ach
e,
hx_
loa
d_
hit
,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Da
taS
_fr
om
L1
[u_
wr
ite
Da
taT
oL
1C
ach
e,
j_s
end
Un
blo
ck,
hx_
loa
d_
hit
,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Da
ta_
Ex
clu
siv
e
[u_
wr
ite
Da
taT
oL
1C
ach
e,
hx_
loa
d_
hit
,
jj_
sen
dE
xcl
usi
veU
nb
loc
k,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
M_
I
Lo
ad
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Ifet
ch
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Sto
re
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
SIN
K_
WB
_A
CK
Inv
[ft_
sen
dD
ata
To
L2
_fr
om
TB
E,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TX
[dt
_se
nd
Da
taT
oR
equ
est
or_
fro
mT
BE
,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TS
[dt
_se
nd
Da
taT
oR
equ
est
or_
fro
mT
BE
,
d2
t_s
end
Da
taT
oL
2_
fro
mT
BE
,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
T_
IN
ST
R
[dt
_se
nd
Da
taT
oR
equ
est
or_
fro
mT
BE
,
d2
t_s
end
Da
taT
oL
2_
fro
mT
BE
,
l_p
op
Re
qu
est
Qu
eue
]
WB
_A
ck
[s_
dea
llo
cat
eTB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Inv
[fi_
sen
dIn
vA
ck,
dg
_in
val
ida
te_
sc,
l_p
op
Re
qu
est
Qu
eue
]
Lo
ad
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Ifet
ch
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Sto
re
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
Ac
k
[q_
up
dat
eA
ckC
ou
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Ac
k_
all
[jj_
sen
dE
xcl
usi
veU
nb
loc
k,
hh
x_s
tor
e_h
it,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Lo
ad
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Ifet
ch
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Sto
re
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
WB
_A
ck
[s_
dea
llo
cat
eTB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
PF
_IS
Lo
ad
[uu
_p
rof
ile
Da
taM
iss
,
pp
m_
ob
ser
veP
fM
iss
,
k_
po
pM
and
ato
ryQ
ueu
e]
Ifet
ch
[uu
_p
rof
ile
Da
taM
iss
,
pp
m_
ob
ser
veP
fM
iss
,
k_
po
pM
and
ato
ryQ
ueu
e]
Sto
re
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_IS
_I
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Da
ta_
all
_A
cks
[u_
wr
ite
Da
taT
oL
1C
ach
e,
s_d
eal
loc
ate
TB
E,
mp
_m
ark
Pre
fetc
hed
,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Da
taS
_fr
om
L1
[u_
wr
ite
Da
taT
oL
1C
ach
e,
j_s
end
Un
blo
ck,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Da
ta_
Ex
clu
siv
e
[u_
wr
ite
Da
taT
oL
1C
ach
e,
jj_
sen
dE
xcl
usi
veU
nb
loc
k,
s_d
eal
loc
ate
TB
E,
mp
_m
ark
Pre
fetc
hed
,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Lo
ad
[uu
_p
rof
ile
Da
taM
iss
,
pp
m_
ob
ser
veP
fM
iss
,
k_
po
pM
and
ato
ryQ
ueu
e]
Sto
re
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Da
ta_
all
_A
cks
[s_
dea
llo
cat
eTB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Da
taS
_fr
om
L1
[j_
sen
dU
nb
loc
k,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Da
ta_
Ex
clu
siv
e
[u_
wr
ite
Da
taT
oL
1C
ach
e,
jj_
sen
dE
xcl
usi
veU
nb
loc
k,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
PF
_IM
Sto
re
[uu
_p
rof
ile
Da
taM
iss
,
pp
m_
ob
ser
veP
fM
iss
,
k_
po
pM
and
ato
ryQ
ueu
e]
Lo
ad
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Ifet
ch
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Ac
k
[q_
up
dat
eA
ckC
ou
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
PF
_S
M
Da
ta
[u_
wr
ite
Da
taT
oL
1C
ach
e,
q_
up
dat
eA
ckC
ou
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Da
ta_
all
_A
cks
[u_
wr
ite
Da
taT
oL
1C
ach
e,
jj_
sen
dE
xcl
usi
veU
nb
loc
k,
s_d
eal
loc
ate
TB
E,
mp
_m
ark
Pre
fetc
hed
,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Sto
re
[uu
_p
rof
ile
Da
taM
iss
,
pp
m_
ob
ser
veP
fM
iss
,
k_
po
pM
and
ato
ryQ
ueu
e]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Lo
ad
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Ifet
ch
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[z_
sta
llA
nd
Wa
itM
and
ato
ryQ
ueu
e]
Ac
k
[q_
up
dat
eA
ckC
ou
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Ac
k_
all
[jj_
sen
dE
xcl
usi
veU
nb
loc
k,
s_d
eal
loc
ate
TB
E,
mp
_m
ark
Pre
fetc
hed
,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
NP
Lo
ad
[oo
_al
loc
ate
L1
DC
ach
eB
loc
k,
i_a
llo
cat
eTB
E,
a_i
ssu
eG
ET
S,
uu
_p
rof
ile
Da
taM
iss
,
po
_o
bse
rve
Mi
ss,
k_
po
pM
and
ato
ryQ
ueu
e]
Ifet
ch
[pp
_al
loc
ate
L1
ICa
che
Blo
ck,
i_a
llo
cat
eTB
E,
ai_
iss
ueG
ET
IN
ST
R,
uu
_p
rof
ile
Ins
tM
iss
,
po
_o
bse
rve
Mi
ss,
k_
po
pM
and
ato
ryQ
ueu
e]
Sto
re
[oo
_al
loc
ate
L1
DC
ach
eB
loc
k,
i_a
llo
cat
eTB
E,
b_
iss
ueG
ET
X,
uu
_p
rof
ile
Da
taM
iss
,
po
_o
bse
rve
Mi
ss,
k_
po
pM
and
ato
ryQ
ueu
e]
PF
_L
oad
[oo
_al
loc
ate
L1
DC
ach
eB
loc
k,
i_a
llo
cat
eTB
E,
pa_
iss
ueP
fGE
TS
,
pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pp
_al
loc
ate
L1
ICa
che
Blo
ck,
i_a
llo
cat
eTB
E,
pai
_is
sue
PfG
ET
IN
ST
R,
pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[oo
_al
loc
ate
L1
DC
ach
eB
loc
k,
i_a
llo
cat
eTB
E,
pb
_is
sue
PfG
ET
X,
pq
_p
op
Pre
fetc
hQ
ueu
e]
L1
_R
epl
ace
me
nt
[ff_
dea
llo
cat
eL
1C
ach
eB
loc
k]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Lo
ad
[oo
_al
loc
ate
L1
DC
ach
eB
loc
k,
i_a
llo
cat
eTB
E,
a_i
ssu
eG
ET
S,
uu
_p
rof
ile
Da
taM
iss
,
po
_o
bse
rve
Mi
ss,
k_
po
pM
and
ato
ryQ
ueu
e]
Ifet
ch
[pp
_al
loc
ate
L1
ICa
che
Blo
ck,
i_a
llo
cat
eTB
E,
ai_
iss
ueG
ET
IN
ST
R,
uu
_p
rof
ile
Ins
tM
iss
,
po
_o
bse
rve
Mi
ss,
k_
po
pM
and
ato
ryQ
ueu
e]
Sto
re
[oo
_al
loc
ate
L1
DC
ach
eB
loc
k,
i_a
llo
cat
eTB
E,
b_
iss
ueG
ET
X,
uu
_p
rof
ile
Da
taM
iss
,
po
_o
bse
rve
Mi
ss,
k_
po
pM
and
ato
ryQ
ueu
e]
PF
_L
oad
[oo
_al
loc
ate
L1
DC
ach
eB
loc
k,
i_a
llo
cat
eTB
E,
pa_
iss
ueP
fGE
TS
,
pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pp
_al
loc
ate
L1
ICa
che
Blo
ck,
i_a
llo
cat
eTB
E,
pai
_is
sue
PfG
ET
IN
ST
R,
pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[oo
_al
loc
ate
L1
DC
ach
eB
loc
k,
i_a
llo
cat
eTB
E,
pb
_is
sue
PfG
ET
X,
pq
_p
op
Pre
fetc
hQ
ueu
e]
L1
_R
epl
ace
me
nt
[ff_
dea
llo
cat
eL
1C
ach
eB
loc
k]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Sto
re
[i_
all
oca
teT
BE
,
c_i
ssu
eU
PG
RA
DE
,
uu
_p
rof
ile
Da
taM
iss
,
k_
po
pM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[fo
rw
ard
_ev
ict
ion
_to
_cp
u,
ff_d
eal
loc
ate
L1
Ca
che
Blo
ck]
Inv
[fo
rw
ard
_ev
ict
ion
_to
_cp
u,
fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
Lo
ad
[h_
loa
d_
hit
,
uu
_p
rof
ile
Da
taH
it,
k_
po
pM
and
ato
ryQ
ueu
e]
Ifet
ch
[h_
loa
d_
hit
,
uu
_p
rof
ile
Ins
tH
it,
k_
po
pM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[fo
rw
ard
_ev
ict
ion
_to
_cp
u,
i_a
llo
cat
eTB
E,
ff_d
eal
loc
ate
L1
Ca
che
Blo
ck]
Inv
[fo
rw
ard
_ev
ict
ion
_to
_cp
u,
fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TX
[fo
rw
ard
_ev
ict
ion
_to
_cp
u,
d_
sen
dD
ata
To
Re
qu
est
or,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TS
[d_
sen
dD
ata
To
Re
qu
est
or,
d2
_se
nd
Da
taT
oL
2,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
T_
IN
ST
R
[d_
sen
dD
ata
To
Re
qu
est
or,
d2
_se
nd
Da
taT
oL
2,
l_p
op
Re
qu
est
Qu
eue
]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
Lo
ad
[h_
loa
d_
hit
,
uu
_p
rof
ile
Da
taH
it,
k_
po
pM
and
ato
ryQ
ueu
e]
Ifet
ch
[h_
loa
d_
hit
,
uu
_p
rof
ile
Ins
tH
it,
k_
po
pM
and
ato
ryQ
ueu
e]
Sto
re
[hh
_st
ore
_h
it,
uu
_p
rof
ile
Da
taH
it,
k_
po
pM
and
ato
ryQ
ueu
e]
L1
_R
epl
ace
me
nt
[fo
rw
ard
_ev
ict
ion
_to
_cp
u,
i_a
llo
cat
eTB
E,
ff_d
eal
loc
ate
L1
Ca
che
Blo
ck]
Inv
[fo
rw
ard
_ev
ict
ion
_to
_cp
u,
f_s
end
Da
taT
oL
2,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TX
[fo
rw
ard
_ev
ict
ion
_to
_cp
u,
d_
sen
dD
ata
To
Re
qu
est
or,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TS
[d_
sen
dD
ata
To
Re
qu
est
or,
d2
_se
nd
Da
taT
oL
2,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
T_
IN
ST
R
[d_
sen
dD
ata
To
Re
qu
est
or,
d2
_se
nd
Da
taT
oL
2,
l_p
op
Re
qu
est
Qu
eue
]
PF
_L
oad
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_S
tor
e
[pq
_p
op
Pre
fetc
hQ
ueu
e]
PF
_If
etc
h
[pq
_p
op
Pre
fetc
hQ
ueu
e]
Lo
ad
[h_
loa
d_
hit
,
uu
_p
rof
ile
Da
taH
it,
k_
po
pM
and
ato
ryQ
ueu
e]
Ifet
ch
[h_
loa
d_
hit
,
uu
_p
rof
ile
Ins
tH
it,
k_
po
pM
and
ato
ryQ
ueu
e]
Sto
re
[hh
_st
ore
_h
it,
uu
_p
rof
ile
Da
taH
it,
k_
po
pM
and
ato
ryQ
ueu
e]
Figure 8.7: Original gem5 MESI Layer 1 Cache controller state machine
56 8. EXPERIMENTAL RESULTS
NP
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
ME
M_
Inv
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
IS
L1
_G
ET
_IN
ST
R
[qq
_al
loc
ate
L2
Ca
che
Blo
ck,
ll_
cle
arS
har
ers
,
nn
_ad
dS
har
er,
i_a
llo
cat
eTB
E,
ss_
rec
ord
Ge
tSL
1ID
,
a_i
ssu
eFe
tch
To
Me
mo
ry,
uu
_p
rof
ile
Mi
ss,
jj_
po
pL
1R
equ
est
Qu
eue
]
ISSL1
_G
ET
S
[qq
_al
loc
ate
L2
Ca
che
Blo
ck,
ll_
cle
arS
har
ers
,
nn
_ad
dS
har
er,
i_a
llo
cat
eTB
E,
ss_
rec
ord
Ge
tSL
1ID
,
a_i
ssu
eFe
tch
To
Me
mo
ry,
uu
_p
rof
ile
Mi
ss,
jj_
po
pL
1R
equ
est
Qu
eue
]
IM
L1
_G
ET
X
[qq
_al
loc
ate
L2
Ca
che
Blo
ck,
ll_
cle
arS
har
ers
,
i_a
llo
cat
eTB
E,
xx_
rec
ord
Ge
tXL
1ID
,
a_i
ssu
eFe
tch
To
Me
mo
ry,
uu
_p
rof
ile
Mi
ss,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L2
_R
epl
ace
me
nt
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L2
_R
epl
ace
me
nt_
cle
an
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
ME
M_
Inv
[zn
_re
cyc
leR
esp
on
seN
etw
ork
]
L1
_G
ET
S
[nn
_ad
dS
har
er,
ss_
rec
ord
Ge
tSL
1ID
,
uu
_p
rof
ile
Mi
ss,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[nn
_ad
dS
har
er,
ss_
rec
ord
Ge
tSL
1ID
,
uu
_p
rof
ile
Mi
ss,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
SS
Me
m_
Da
ta
[m_
wr
ite
Da
taT
oC
ach
e,
e_s
end
Da
taT
oG
etS
Re
qu
est
ors
,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
L1
_G
ET
S
[nn
_ad
dS
har
er,
ss_
rec
ord
Ge
tSL
1ID
,
uu
_p
rof
ile
Mi
ss,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[nn
_ad
dS
har
er,
ss_
rec
ord
Ge
tSL
1ID
,
uu
_p
rof
ile
Mi
ss,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L2
_R
epl
ace
me
nt
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L2
_R
epl
ace
me
nt_
cle
an
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
ME
M_
Inv
[zn
_re
cyc
leR
esp
on
seN
etw
ork
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
MT
_M
B
Me
m_
Da
ta
[m_
wr
ite
Da
taT
oC
ach
e,
ex_
sen
dE
xcl
usi
veD
ata
To
Ge
tSR
equ
est
ors
,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L2
_R
epl
ace
me
nt
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L2
_R
epl
ace
me
nt_
cle
an
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
ME
M_
Inv
[zn
_re
cyc
leR
esp
on
seN
etw
ork
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
Me
m_
Da
ta
[m_
wr
ite
Da
taT
oC
ach
e,
ee_
sen
dD
ata
To
Ge
tXR
equ
est
or,
s_d
eal
loc
ate
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
S
[ds
_se
nd
Sh
are
dD
ata
To
Re
qu
est
or,
nn
_ad
dS
har
er,
set
_se
tM
RU
,
uu
_p
rof
ile
Hi
t,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[ds
_se
nd
Sh
are
dD
ata
To
Re
qu
est
or,
nn
_ad
dS
har
er,
set
_se
tM
RU
,
uu
_p
rof
ile
Hi
t,
jj_
po
pL
1R
equ
est
Qu
eue
]
I_I
L2
_R
epl
ace
me
nt_
cle
an
[i_
all
oca
teT
BE
,
f_s
end
Inv
To
Sh
are
rs,
rr_
dea
llo
cat
eL
2C
ach
eB
loc
k]
S_
I
L2
_R
epl
ace
me
nt
[i_
all
oca
teT
BE
,
f_s
end
Inv
To
Sh
are
rs,
rr_
dea
llo
cat
eL
2C
ach
eB
loc
k]
ME
M_
Inv
[i_
all
oca
teT
BE
,
f_s
end
Inv
To
Sh
are
rs,
rr_
dea
llo
cat
eL
2C
ach
eB
loc
k]
SS
_M
B
L1
_G
ET
X
[d_
sen
dD
ata
To
Re
qu
est
or,
fwm
_se
nd
Fw
dIn
vT
oS
har
ers
Mi
nu
sR
equ
est
or,
set
_se
tM
RU
,
uu
_p
rof
ile
Hi
t,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_U
PG
RA
DE
[fw
m_
sen
dF
wd
Inv
To
Sh
are
rsM
inu
sR
equ
est
or,
ts_
sen
dIn
vA
ckT
oU
pg
rad
er,
set
_se
tM
RU
,
uu
_p
rof
ile
Hi
t,
jj_
po
pL
1R
equ
est
Qu
eue
]
M
L1
_G
ET
_IN
ST
R
[d_
sen
dD
ata
To
Re
qu
est
or,
nn
_ad
dS
har
er,
set
_se
tM
RU
,
uu
_p
rof
ile
Hi
t,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
M_
I
L2
_R
epl
ace
me
nt
[i_
all
oca
teT
BE
,
c_e
xcl
usi
veR
epl
ace
me
nt,
rr_
dea
llo
cat
eL
2C
ach
eB
loc
k]
ME
M_
Inv
[i_
all
oca
teT
BE
,
c_e
xcl
usi
veR
epl
ace
me
nt,
rr_
dea
llo
cat
eL
2C
ach
eB
loc
k]
L2
_R
epl
ace
me
nt_
cle
an
[i_
all
oca
teT
BE
,
c_e
xcl
usi
veC
lea
nR
epl
ace
me
nt,
rr_
dea
llo
cat
eL
2C
ach
eB
loc
k]
L1
_G
ET
X
[d_
sen
dD
ata
To
Re
qu
est
or,
set
_se
tM
RU
,
uu
_p
rof
ile
Hi
t,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
S
[dd
_se
nd
Ex
clu
siv
eD
ata
To
Re
qu
est
or,
set
_se
tM
RU
,
uu
_p
rof
ile
Hi
t,
jj_
po
pL
1R
equ
est
Qu
eue
]
Me
m_
Ac
k
[s_
dea
llo
cat
eTB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
ME
M_
Inv
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
Ac
k_
all
[c_
exc
lus
ive
Cle
anR
epl
ace
me
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
ME
M_
Inv
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
Ac
k
[q_
up
dat
eA
ck,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Ac
k_
all
[ct
_ex
clu
siv
eR
epl
ace
me
ntF
rom
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
ME
M_
Inv
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
Ac
k
[q_
up
dat
eA
ck,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
MT
_IB
WB
_D
ata
[m_
wr
ite
Da
taT
oC
ach
e,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
WB
_D
ata
_cl
ean
[m_
wr
ite
Da
taT
oC
ach
e,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L2
_R
epl
ace
me
nt
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L2
_R
epl
ace
me
nt_
cle
an
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
ME
M_
Inv
[zn
_re
cyc
leR
esp
on
seN
etw
ork
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
MT
_S
B
Un
blo
ck
[nn
u_
add
Sh
are
rFr
om
Un
blo
ck,
k_
po
pU
nb
loc
kQ
ueu
e,
kd
_w
ake
Up
De
pen
den
ts]
L1
_P
UT
X
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L2
_R
epl
ace
me
nt
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L2
_R
epl
ace
me
nt_
cle
an
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
ME
M_
Inv
[zn
_re
cyc
leR
esp
on
seN
etw
ork
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
MT
L1
_P
UT
X
[ll_
cle
arS
har
ers
,
mr_
wr
ite
Da
taT
oC
ach
eFr
om
Re
qu
est
,
t_s
end
WB
Ac
k,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
X
[b_
for
wa
rdR
equ
est
To
Ex
clu
siv
e,
uu
_p
rof
ile
Mi
ss,
set
_se
tM
RU
,
jj_
po
pL
1R
equ
est
Qu
eue
]
MT
_II
B
L1
_G
ET
S
[b_
for
wa
rdR
equ
est
To
Ex
clu
siv
e,
uu
_p
rof
ile
Mi
ss,
set
_se
tM
RU
,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[b_
for
wa
rdR
equ
est
To
Ex
clu
siv
e,
uu
_p
rof
ile
Mi
ss,
set
_se
tM
RU
,
jj_
po
pL
1R
equ
est
Qu
eue
]
MT
_I
L2
_R
epl
ace
me
nt
[i_
all
oca
teT
BE
,
f_s
end
Inv
To
Sh
are
rs,
rr_
dea
llo
cat
eL
2C
ach
eB
loc
k]
ME
M_
Inv
[i_
all
oca
teT
BE
,
f_s
end
Inv
To
Sh
are
rs,
rr_
dea
llo
cat
eL
2C
ach
eB
loc
k]
MC
T_
I
L2
_R
epl
ace
me
nt_
cle
an
[i_
all
oca
teT
BE
,
f_s
end
Inv
To
Sh
are
rs,
rr_
dea
llo
cat
eL
2C
ach
eB
loc
k]
Ex
clu
siv
e_U
nb
loc
k
[mm
u_
ma
rkE
xcl
usi
veF
rom
Un
blo
ck,
k_
po
pU
nb
loc
kQ
ueu
e,
kd
_w
ake
Up
De
pen
den
ts]L
2_
Re
pla
cem
ent
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L2
_R
epl
ace
me
nt_
cle
an
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
ME
M_
Inv
[zn
_re
cyc
leR
esp
on
seN
etw
ork
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X_
old
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
Ex
clu
siv
e_U
nb
loc
k
[mm
u_
ma
rkE
xcl
usi
veF
rom
Un
blo
ck,
k_
po
pU
nb
loc
kQ
ueu
e,
kd
_w
ake
Up
De
pen
den
ts]
L2
_R
epl
ace
me
nt
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L2
_R
epl
ace
me
nt_
cle
an
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
ME
M_
Inv
[zn
_re
cyc
leR
esp
on
seN
etw
ork
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X_
old
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
Un
blo
ck
[nn
u_
add
Sh
are
rFr
om
Un
blo
ck,
k_
po
pU
nb
loc
kQ
ueu
e]
WB
_D
ata
[m_
wr
ite
Da
taT
oC
ach
e,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
WB
_D
ata
_cl
ean
[m_
wr
ite
Da
taT
oC
ach
e,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L2
_R
epl
ace
me
nt
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L2
_R
epl
ace
me
nt_
cle
an
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
ME
M_
Inv
[zn
_re
cyc
leR
esp
on
seN
etw
ork
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X_
old
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
WB
_D
ata
[qq
_w
rite
Da
taT
oT
BE
,
ct_
exc
lus
ive
Re
pla
cem
ent
Fro
mT
BE
,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
WB
_D
ata
_cl
ean
[ct
_ex
clu
siv
eR
epl
ace
me
ntF
rom
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Ac
k_
all
[ct
_ex
clu
siv
eR
epl
ace
me
ntF
rom
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
ME
M_
Inv
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X_
old
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
WB
_D
ata
[qq
_w
rite
Da
taT
oT
BE
,
ct_
exc
lus
ive
Re
pla
cem
ent
Fro
mT
BE
,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
WB
_D
ata
_cl
ean
[c_
exc
lus
ive
Cle
anR
epl
ace
me
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Ac
k_
all
[c_
exc
lus
ive
Cle
anR
epl
ace
me
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]M
EM
_In
v
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_G
ET
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_U
PG
RA
DE
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
S
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_G
ET
_IN
ST
R
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
L1
_P
UT
X_
old
[zz
_st
all
An
dW
ait
L1
Re
qu
est
Qu
eue
]
Figure 8.8: Original gem5 MESI Layer 2 cache controller state machine
8.2. THE MESI PROTOCOL 57
I
IM
Fet
ch
[qf
_q
ueu
eM
em
ory
Fet
chR
equ
est
,
j_p
op
Inc
om
ing
Re
qu
est
Qu
eue
]
ID
DM
A_
RE
AD
[qf
_q
ueu
eM
em
ory
Fet
chR
equ
est
DM
A,
j_p
op
Inc
om
ing
Re
qu
est
Qu
eue
]
ID_
W
DM
A_
WR
ITE
[dw
_w
rite
DM
AD
ata
,
qw
_q
ueu
eM
em
ory
WB
Re
qu
est
_p
art
ial
,
j_p
op
Inc
om
ing
Re
qu
est
Qu
eue
]
Fet
ch
[z_
sta
llA
nd
Wa
itR
equ
est
]
Da
ta
[z_
sta
llA
nd
Wa
itR
equ
est
]
DM
A_
WR
ITE
[zz
_re
cyc
leD
MA
Qu
eue
]
DM
A_
RE
AD
[zz
_re
cyc
leD
MA
Qu
eue
]
M
Me
mo
ry_
Da
ta
[d_
sen
dD
ata
,
l_p
op
Me
mQ
ueu
e,
kd
_w
ake
Up
De
pen
den
ts]
Cle
anR
epl
ace
me
nt
[a_
sen
dA
ck,
k_
po
pIn
com
ing
Re
spo
nse
Qu
eue
,
kd
_w
ake
Up
De
pen
den
ts]
Fet
ch
[in
v_
sen
dC
ach
eIn
val
ida
te,
z_s
tal
lA
nd
Wa
itR
equ
est
]
MI
Da
ta
[m_
wr
ite
Da
taT
oM
em
ory
,
qw
_q
ueu
eM
em
ory
WB
Re
qu
est
,
k_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
M_
DR
D
DM
A_
RE
AD
[in
v_
sen
dC
ach
eIn
val
ida
te,
j_p
op
Inc
om
ing
Re
qu
est
Qu
eue
]
M_
DW
R
DM
A_
WR
ITE
[v_
all
oca
teT
BE
,
inv
_se
nd
Ca
che
Inv
ali
dat
e,
j_p
op
Inc
om
ing
Re
qu
est
Qu
eue
]
Me
mo
ry_
Ac
k
[aa
_se
nd
Ac
k,
l_p
op
Me
mQ
ueu
e,
kd
_w
ake
Up
De
pen
den
ts]
Fet
ch
[z_
sta
llA
nd
Wa
itR
equ
est
]
Da
ta
[z_
sta
llA
nd
Wa
itR
equ
est
]
DM
A_
WR
ITE
[zz
_re
cyc
leD
MA
Qu
eue
]
DM
A_
RE
AD
[zz
_re
cyc
leD
MA
Qu
eue
]
Me
mo
ry_
Da
ta
[dr
_se
nd
DM
AD
ata
,
l_p
op
Me
mQ
ueu
e,
kd
_w
ake
Up
De
pen
den
ts]
Fet
ch
[z_
sta
llA
nd
Wa
itR
equ
est
]
Da
ta
[z_
sta
llA
nd
Wa
itR
equ
est
]
DM
A_
WR
ITE
[zz
_re
cyc
leD
MA
Qu
eue
]
DM
A_
RE
AD
[zz
_re
cyc
leD
MA
Qu
eue
]
Me
mo
ry_
Ac
k
[da
_se
nd
DM
AA
ck,
l_p
op
Me
mQ
ueu
e,
kd
_w
ake
Up
De
pen
den
ts]
Fet
ch
[z_
sta
llA
nd
Wa
itR
equ
est
]
Da
ta
[z_
sta
llA
nd
Wa
itR
equ
est
]
DM
A_
WR
ITE
[zz
_re
cyc
leD
MA
Qu
eue
]
DM
A_
RE
AD
[zz
_re
cyc
leD
MA
Qu
eue
]
M_
DR
DI
Me
mo
ry_
Ac
k
[aa
_se
nd
Ac
k,
l_p
op
Me
mQ
ueu
e,
kd
_w
ake
Up
De
pen
den
ts]
Fet
ch
[z_
sta
llA
nd
Wa
itR
equ
est
]
Da
ta
[z_
sta
llA
nd
Wa
itR
equ
est
]
DM
A_
WR
ITE
[zz
_re
cyc
leD
MA
Qu
eue
]
DM
A_
RE
AD
[zz
_re
cyc
leD
MA
Qu
eue
]
M_
DW
RI
Me
mo
ry_
Ac
k
[dw
t_w
rite
DM
AD
ata
Fro
mT
BE
,
aa_
sen
dA
ck,
da_
sen
dD
MA
Ac
k,
w_
dea
llo
cat
eTB
E,
l_p
op
Me
mQ
ueu
e,
kd
_w
ake
Up
De
pen
den
ts]
Fet
ch
[z_
sta
llA
nd
Wa
itR
equ
est
]
Da
ta
[z_
sta
llA
nd
Wa
itR
equ
est
]
DM
A_
WR
ITE
[zz
_re
cyc
leD
MA
Qu
eue
]
DM
A_
RE
AD
[zz
_re
cyc
leD
MA
Qu
eue
]
Da
ta
[dr
p_
sen
dD
MA
Da
ta,
m_
wr
ite
Da
taT
oM
em
ory
,
qw
_q
ueu
eM
em
ory
WB
Re
qu
est
,
k_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
DM
A_
WR
ITE
[zz
_re
cyc
leD
MA
Qu
eue
]
DM
A_
RE
AD
[zz
_re
cyc
leD
MA
Qu
eue
]
Da
ta
[m_
wr
ite
Da
taT
oM
em
ory
,
qw
_q
ueu
eM
em
ory
WB
Re
qu
est
_p
art
ial
TB
E,
k_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
DM
A_
WR
ITE
[zz
_re
cyc
leD
MA
Qu
eue
]
DM
A_
RE
AD
[zz
_re
cyc
leD
MA
Qu
eue
]
Figure 8.9: Original gem5 MESI directory controller state machine
58 8. EXPERIMENTAL RESULTS
IS,
PF
_IS
IS_
I,P
F_
IS_
I
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
S
Da
ta_
all
_A
cks
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Da
taS
_fr
om
L1
[j_
sen
dU
nb
loc
k,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
E
Da
ta_
Ex
clu
siv
e
[jj_
sen
dE
xcl
usi
veU
nb
loc
k,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Da
ta_
Ex
clu
siv
e
[jj_
sen
dE
xcl
usi
veU
nb
loc
k,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
I
Da
ta_
all
_A
cks
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Da
taS
_fr
om
L1
[j_
sen
dU
nb
loc
k,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Lo
ad
[a_
iss
ueG
ET
S]
PF
_L
oad
[pa
_is
sue
PfG
ET
S]
Ifet
ch
[ai
_is
sue
GE
TIN
ST
R]
PF
_If
etc
h
[pa
i_i
ssu
eP
fGE
TIN
ST
R]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
IM
,PF
_IM
,PF
_S
M,
SM
Sto
re
[b_
iss
ueG
ET
X]
PF
_S
tor
e
[pb
_is
sue
PfG
ET
X]
Sto
re
[c_
iss
ueU
PG
RA
DE
]
Fw
d_
GE
TS
||Fw
d_
GE
T_
IN
ST
R
[d_
sen
dD
ata
To
Re
qu
est
or,
d2
_se
nd
Da
taT
oL
2,
l_p
op
Re
qu
est
Qu
eue
]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TX
[d_
sen
dD
ata
To
Re
qu
est
or,
l_p
op
Re
qu
est
Qu
eue
]
WB
_A
ck
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Inv
[f_
sen
dD
ata
To
L2
,
l_p
op
Re
qu
est
Qu
eue
]
SIN
K_
WB
_A
CK
Inv
[ft_
sen
dD
ata
To
L2
_fr
om
TB
E,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TX
[dt
_se
nd
Da
taT
oR
equ
est
or_
fro
mT
BE
,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TS
||Fw
d_
GE
T_
IN
ST
R
[dt
_se
nd
Da
taT
oR
equ
est
or_
fro
mT
BE
,
d2
t_s
end
Da
taT
oL
2_
fro
mT
BE
,
l_p
op
Re
qu
est
Qu
eue
]
Lo
ad
[a_
iss
ueG
ET
S]
PF
_L
oad
[pa
_is
sue
PfG
ET
S]
Ifet
ch
[ai
_is
sue
GE
TIN
ST
R]
PF
_If
etc
h
[pa
i_i
ssu
eP
fGE
TIN
ST
R]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Sto
re
[b_
iss
ueG
ET
X]
PF
_S
tor
e
[pb
_is
sue
PfG
ET
X]Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Ac
k||D
ata
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
M
Ac
k_
all
||D
ata
_al
l_A
cks
[jj_
sen
dE
xcl
usi
veU
nb
loc
k,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Fw
d_
GE
TS
||Fw
d_
GE
T_
IN
ST
R
[d_
sen
dD
ata
To
Re
qu
est
or,
d2
_se
nd
Da
taT
oL
2,
l_p
op
Re
qu
est
Qu
eue
]
WB
_A
ck
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Inv
[f_
sen
dD
ata
To
L2
,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TX
[d_
sen
dD
ata
To
Re
qu
est
or,
l_p
op
Re
qu
est
Qu
eue
]
Inv
[ft_
sen
dD
ata
To
L2
_fr
om
TB
E,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TX
[dt
_se
nd
Da
taT
oR
equ
est
or_
fro
mT
BE
,
l_p
op
Re
qu
est
Qu
eue
]
Fw
d_
GE
TS
||Fw
d_
GE
T_
IN
ST
R
[dt
_se
nd
Da
taT
oR
equ
est
or_
fro
mT
BE
,
d2
t_s
end
Da
taT
oL
2_
fro
mT
BE
,
l_p
op
Re
qu
est
Qu
eue
]
WB
_A
ck
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Inv
[fi_
sen
dIn
vA
ck,
l_p
op
Re
qu
est
Qu
eue
]
Figure 8.10: Minimized gem5 MESI Layer 1 Cache controller state machine
8.2. THE MESI PROTOCOL 59
MT
_IB
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
SS
WB
_D
ata
||W
B_
Da
ta_
cle
an
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
S||L
1_
GE
T_
IN
ST
R
[ds
_se
nd
Sh
are
dD
ata
To
Re
qu
est
or,
nn
_ad
dS
har
er,
jj_
po
pL
1R
equ
est
Qu
eue
]
I_I
,S_
I
L2
_R
epl
ace
me
nt|
|L2
_R
epl
ace
me
nt_
cle
an|
|M
EM
_In
v
[f_
sen
dIn
vT
oS
har
ers
]
MT
_M
B,S
S_
MB
L1
_G
ET
X
[d_
sen
dD
ata
To
Re
qu
est
or,
fwm
_se
nd
Fw
dIn
vT
oS
har
ers
Mi
nu
sR
equ
est
or,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_U
PG
RA
DE
[fw
m_
sen
dF
wd
Inv
To
Sh
are
rsM
inu
sR
equ
est
or,
ts_
sen
dIn
vA
ckT
oU
pg
rad
er,
jj_
po
pL
1R
equ
est
Qu
eue
]
M_
I
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
M_
I,N
P
ME
M_
Inv
||M
em
_A
ck
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
ME
M_
Inv
||M
em
_A
ck
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
IS,
ISS
L1
_G
ET
S||L
1_
GE
T_
IN
ST
R
[ll_
cle
arS
har
ers
,
nn
_ad
dS
har
er,
a_i
ssu
eFe
tch
To
Me
mo
ry,
jj_
po
pL
1R
equ
est
Qu
eue
] IM
L1
_G
ET
X
[ll_
cle
arS
har
ers
,
a_i
ssu
eFe
tch
To
Me
mo
ry,
jj_
po
pL
1R
equ
est
Qu
eue
]
Me
m_
Da
ta
[e_
sen
dD
ata
To
Ge
tSR
equ
est
ors
,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
Me
m_
Da
ta
[ex
_se
nd
Ex
clu
siv
eD
ata
To
Ge
tSR
equ
est
ors
,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
IS
L1
_G
ET
S||L
1_
GE
T_
IN
ST
R
[nn
_ad
dS
har
er,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
Me
m_
Da
ta
[ee
_se
nd
Da
taT
oG
etX
Re
qu
est
or,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Ac
k_
all
[c_
exc
lus
ive
Cle
anR
epl
ace
me
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Ac
k_
all
[ct
_ex
clu
siv
eR
epl
ace
me
ntF
rom
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
Ac
k||M
EM
_In
v
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
MT
Ex
clu
siv
e_U
nb
loc
k
[mm
u_
ma
rkE
xcl
usi
veF
rom
Un
blo
ck,
k_
po
pU
nb
loc
kQ
ueu
e]
Me
m_
Da
ta
[e_
sen
dD
ata
To
Ge
tSR
equ
est
ors
,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]L
1_
GE
TS
||L
1_
GE
T_
IN
ST
R
[nn
_ad
dS
har
er,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
] M
T_
IIB
,M
T_
MB
L1
_G
ET
S||L
1_
GE
TX
||L
1_
GE
T_
IN
ST
R
[b_
for
wa
rdR
equ
est
To
Ex
clu
siv
e,
jj_
po
pL
1R
equ
est
Qu
eue
]
MC
T_
I,M
T_
I
L2
_R
epl
ace
me
nt|
|L2
_R
epl
ace
me
nt_
cle
an|
|M
EM
_In
v
[f_
sen
dIn
vT
oS
har
ers
]
M
L1
_P
UT
X
[ll_
cle
arS
har
ers
,
t_s
end
WB
Ac
k,
jj_
po
pL
1R
equ
est
Qu
eue
]
Un
blo
ck
[nn
u_
add
Sh
are
rFr
om
Un
blo
ck,
k_
po
pU
nb
loc
kQ
ueu
e]
Ex
clu
siv
e_U
nb
loc
k
[mm
u_
ma
rkE
xcl
usi
veF
rom
Un
blo
ck,
k_
po
pU
nb
loc
kQ
ueu
e]
MT
_S
B
WB
_D
ata
||W
B_
Da
ta_
cle
an
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Ac
k_
all
||W
B_
Da
ta||
WB
_D
ata
_cl
ean
[ct
_ex
clu
siv
eR
epl
ace
me
ntF
rom
TB
E,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
Ac
k_
all
||W
B_
Da
ta_
cle
an
[c_
exc
lus
ive
Cle
anR
epl
ace
me
nt,
o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
ME
M_
Inv
[o_
po
pIn
com
ing
Re
spo
nse
Qu
eue
]
L1
_G
ET
_IN
ST
R
[d_
sen
dD
ata
To
Re
qu
est
or,
nn
_ad
dS
har
er,
jj_
po
pL
1R
equ
est
Qu
eue
]
L2
_R
epl
ace
me
nt|
|M
EM
_In
v
[c_
exc
lus
ive
Re
pla
cem
ent
]
L2
_R
epl
ace
me
nt_
cle
an
[c_
exc
lus
ive
Cle
anR
epl
ace
me
nt]
L1
_G
ET
X
[d_
sen
dD
ata
To
Re
qu
est
or,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_G
ET
S
[dd
_se
nd
Ex
clu
siv
eD
ata
To
Re
qu
est
or,
jj_
po
pL
1R
equ
est
Qu
eue
]
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
Un
blo
ck
[nn
u_
add
Sh
are
rFr
om
Un
blo
ck,
k_
po
pU
nb
loc
kQ
ueu
e]
L1
_P
UT
X||
L1
_P
UT
X_
old
[t_
sen
dW
BA
ck,
jj_
po
pL
1R
equ
est
Qu
eue
]
Figure 8.11: Minimized gem5 MESI Layer 2 cache controller state machine
60 8. EXPERIMENTAL RESULTS
MI,M_DRDI,M_DWRI
I
Memory_Ack
[aa_sendAck]
ID,ID_W,IM
DMA_READ||DMA_WRITE||Fetch
[j_popIncomingRequestQueue]
DMA_READ||DMA_WRITE||Fetch
[j_popIncomingRequestQueue]
M
Memory_Data
[d_sendData]
Data
[k_popIncomingResponseQueue]
CleanReplacement
[a_sendAck,
k_popIncomingResponseQueue]
M_DRD,M_DWR
DMA_READ||DMA_WRITE
[j_popIncomingRequestQueue]
Data
[k_popIncomingResponseQueue]
Figure 8.12: Minimized gem5 MESI directory controller state machine
9
CONCLUSIONS AND FUTURE WORK
We conclude this thesis by discussing our findings. We list the contributions our research
has made to model checking of cache coherence protocols, and methodically creating ab-
stractions of Labelled Transition Systems. Then we will summarize the answers to the re-
search questions from Section 3.2. We also suggest a number of areas for further research.
9.1. DISCUSSION
During our research we found that the minimization of state machines can best be done in
two phases. The first abstraction phase consists of the removal of all non-relevant features
from the state machines. To implement this we defined three patterns, which we used to re-
move non relevant features from the cache coherence state machines we investigated. This
resulted in an NFA with internal τ-transitions for state transitions where all features were
removed, and equal transitions where all differing features were removed. All cache coher-
ence protocol state machines are a single strongly connected component (SCC) without
initial and final states. The abstraction patterns retain this Strongly Connected Component
(SCC) structure and the lack of initial and final states.
As defined in the future works in Section 9.3, we did not pursue the identification of
patterns further. Other state machines cache coherence state machines could reveal addi-
tional abstraction patterns. In the MESI L2 cache cache coherence protocol we identified
an additional oportunitity for a new abstraction pattern.
The second reduction phase consists of algorithmically shrinking the NFA state ma-
chine by applying strong bisimulation reduction. This reduction phase minimizes the state
machine without changing its traces, so the end result of this step retains trace equivalence
to the original abstracted NFA resulting from the abstraction. This also implies that the re-
sulting state machine is also a single SCC without initial and terminal states.
We convert the NFA state machine to a DFA state machine with a modified subset conc-
truction algorithm. We modified the subset construction algorithm, as we could only find
algorithms for NFA state machines with initial states, used to start the algorithm. The mod-
ification we developed consists of an added post processing step which removes any arti-
facts added by the selection of the initial state.
61
62 9. CONCLUSIONS AND FUTURE WORK
After this subset construction step we have a SCC DFA for which we can apply strong
bisimulation reduction. Strong bisimulation reduction algorithms require that the state
machines do not have internal τ-transitions. We removed these during the subset con-
struction. We have proven that for DFAs trace equivalence and bisimulation equivalence
(bisimilarity) coincides. This means that two DFAs are trace equivalent iff they are bisim-
ilation equivalent. The bisimulation reduction algorithm retains bisimilation equivalence
so according to the proof it also retains trace equivalence.
We have not investigated whether this combination of algorithms is minimal. The ab-
stracted state machine could be reduced into smaller state machines by appying alternative
algorithms e.g. use weak- or branching bisimulation reduction.
9.2. ANSWERS TO RESEARCH QUESTIONS
In this section we provide answers for the research questions we devised at the start of our
research.
1. What patterns can be found that can be applied to reduce cache coherence state machines,
creating sound abstractions?
We found three patterns that can be used to abstract away unneeded features of the cache
coherence state machines, as described in Chapter 6. We also found evidence of additional
patterns. We also found a sequence of algorithms that converts the such abstracted state
machine to a reduced size. These algorithms are described in Chapter 7 and reduce the size
of the state machine without changing the state machine semantics.
2. Which state machine aspects determine the applicability of the minimization process?
The minimization process we determined relies on the cache coherence state machines
being Strongly Connected Components, without initial and terminal states.
The abstraction algorithms make use of the fact that the state machines have stable states,
interleaved with transitional states.
3. Can programmatically implementable strategies be found when to apply which pattern?
The reduction algorithm can be fully automated. For the abstraction patterns the pattern
application can be automated. The features to abstract or remove from the state machine
must be identified by inspecting the state machine implementations to minimize.
4. Can limits be found in the application of the minimization process?
We did not find any limits or restrictions to apply the abstraction patterns or reduction
algorithms.
5. How can the minimization process be best documented?
9.3. FUTURE WORK 63
The reduction algorithm is mathematically described and proven. The abstraction pat-
terns are provided as named recipes, which define how to obtain the features that can be
removed from the state machines.
6. Do the patterns have an ordering, is the use of certain patterns enabled by the application
of other reduction patterns?
The abstraction patterns must be followed by the reduction algorithm. The reduction al-
gorithm reduces the size of the state machine based on internal τ-transitions and equal
transitions that appear when features are removed by the abstraction patterns.
The semantically preserving reduction patterns are therefore enabled by the abstraction
patterns.
7. If the patterns have an ordering, can an optimal ordering be applied to obtain the best
result
The abstraction patterns we have found can be applied in any order, as they all target dif-
ferent types of features.
The reduction algorithms will only provide meaningful results when applied after the ab-
straction patterns, as these rely on the removal of semantics by the abstraction patterns.
The main question that this research is trying to answer is the following:
Is it possible to create sound abstractions of selected gem5 cache coherence state machines by
programmatically applying state machine minimization patterns?
We have found that we can create minimized state machines for the MI and MESI proto-
cols. We have found no reasons to assume that we can not apply the same patterns and
algorithms for other and more complex cache coherence protocols. Therefore we conclude
that it is indeed possible to create sound abstractions by automatically applying the ab-
straction patterns and reduction algorithms we found.
9.3. FUTURE WORK
We have applied the abstraction patterns and reduction algorithms to the MI and MESI
protocols. Other protocols could also be targeted. This could futher validate the appli-
cability of the abstraction algorithms. It could be that analysing more complex protocols
allows for the identification of other abstraction patterns that do not express themselves in
the already investigated cache coherence state machines.
With regard to the semantical preserving reduction patterns, there could be algorithms
that provide more efficient solutions. Future research could focus on finding these more
efficient algorithms, e.g. weak- or branching bisimilarity reduction algorithms.
We have focussed on the creation of abstractions to research network impact of cache
coherence protocols. Other research could be targeted at abstractions for other aspects of
the cache coherence protocols, e.g. memory interfaces, the core interface or private data
cache configurations. Our results should also be useable for these research directions, as
64 9. CONCLUSIONS AND FUTURE WORK
the abstraction patterns and reduction algorithms are not limited to network interactions.
Creating abstractions for these other cache coherence features could also lead to the iden-
tification of additional abstraction patterns.
The reduction algorithm we devised is useable for all state machines which are SCCs
and which have no initial and terminal states. Other research areas where these types of
state machines are used could benefit from this research.
A
CACHE COHERENCE PROTOCOL DETAILS
This appendix describes the information obtained and used during the analysis of the cache
coherence protocols. It describes the various components that are used in the protocol, the
queues that are used to communicate between these components, and the actions that are
executed on state transitions. More information on these protocols can be obtained from
the gem5 website, the URL to the protocol specific information is also provided.
More information on the terminology can be found in [Sor+02].
A.1. MI PROTOCOL
The MI protocol is the simplest gem5 protocol, mainly used for educational purposes. The
protocol is implemented for three types of components; cache controllers, each connected
to a processing core; directory controllers, connected to memory; and DMA controllers,
connected to IO related devices. These component types are detailed in the next sections.
A.1.1. CACHE CONTROLLER
The cache controller receives memory requests from cores and retrieves the associated
memory data block from the directory controllers. When other cores or a DMA controller
requires the memory block, it is relinquished again. A block can also be relinquished when
the core requires the cache location for other purposes.
QUEUES
The input- and output queues are defined in the sourcecode with respectively the in_port
and out_port functions.The following queues are defined for the cache controller:
1. forwardRequestNetwork_in: used to receive invalidates, getx requests, and write-
back responses from the directory controller. These requests originate from other
components.
2. responseNetwork_in: used to receive DATA packets over the network from the cache
coherence components.
3. mandatoryQueue_in: input queue for core memory requests from the core for LOAD,
STORE and IFetch requests.
4. requestNetwork_out; used to issue GETX and PUTX requests to the directory con-
troller.
65
66 A. CACHE COHERENCE PROTOCOL DETAILS
5. responseNetwork_out; used to send DATA packets to the other cache coherence
components.
The queues that we are interested in are all the above queues, apart from the mandato-
ryQueue_in queue. That queue is used to receive requests from the processor core.
STATES
The description of all states in this protocol can be found at: http://www.m5sim.org/MI_
example
ACTIONS
Table A.1: MI_example cache controller actions
Action name
impacting description
a_issueRequest Yes Issues a GETX to the directory controller
over the requestNetwork_out queue.
b_issuePUT Yes Issues a PUT request to the directory
controller over the requestNetwork_out
queue.
e_sendData Yes Send data from the cache to the re-
questor over the responseNetwork_out
queue.
ee_sendDataFromTBE Yes Send data from the TBE buffers to the re-
questor over the responseNetwork_out
queue.
n_popResponseQueue Yes Pop the responseNetwork_in queue.
o_popForwardedRequest
Queue
Yes Pop the forwardRequestNetwork_in
queue.
i_allocateL1CacheBlock No Allocate a cache block.
h_deallocateL1CacheBlock No Deallocate a cache block.
m_popMandatoryQueue No pop the mandatory request queue
mandatoryQueue_in.
p_profileMiss No update gem5 statistics.
p_profileHit No update gem5 statistics.
r_load_hit No Notify sequencer the load completed.
rx_load_hit No Notify the sequencer an external load
completed.
s_store_hit No Notify the sequencer a store completed.
sx_store_hit No Notify the sequencer an external store
completed.
u_writeDataToCache No Write data to cache.
forward_eviction_to_cpu No Send eviction information to the proces-
sor.
v_allocateTBE No Allocate TBE buffer.
w_deallocateTBE No Deallocate TBE buffer.
x_copyDataFromCacheToTBE No Copy data from cache to TBE.
z_stall No Do nothing.
A.1. MI PROTOCOL 67
The actions for this component are described in Table A.1. The impacted column indicates
whether the action is network-impacting (‘Yes’) or not network-impacting (‘No’).
A.1.2. DIRECTORY CONTROLLER
The directory controller manages the communication with the memory subsystem. It also
handles DMA write and read requests from the DMA component.
QUEUES
1. dmaRequestQueue_in: receive DMA READ and WRITE requests from the DMA con-
troller.
2. requestQueue_in: incoming GETS, GETX and PUTX requests from cache controllers.
3. memQueue_in: incoming queue for memory read and writeback messages.
4. forwardNetwork_out: used to send INV, writeback ACKs, NACKs, and forward re-
quests to cache controllers.
5. responseNetwork_out: used to send DATA packets to the other cache coherence
components
6. requestQueue_out: does not seem to be used, queue definition is annotated with
‘For recycling requests’.
7. dmaResponseNetwork_out: used to send DATA and ACK messages.
STATES
The description of all states in this protocol can be found at: http://www.m5sim.org/MI_
example
ACTIONS
Table A.2: MI_example directory controller actions
Action name
impacting description
a_sendWriteBackAck Yes Send writeback ack to requestor over the
forwardNetwork_out queue.
l_sendWriteBackAck Yes Send writeback ack to requestor over the
forwardNetwork_out queue.
b_sendWriteBackNack Yes Send writeback nack to requestor over
the forwardNetwork_out queue.
f_forwardRequest Yes Forward the request from the re-
questQueue_in queue to the forward-
Network_out queue.
i_popIncomingRequestQueue Yes Pop the incoming request queue.
d_sendData Yes Send data to the requestor over the re-
sponseNetwork_out queue.
inv_sendCacheInvalidate DMA Invalidate a cache block for DMA by
sending an INV message over the for-
wardNetwork_out queue.
dr_sendDMAData DMA Send data to DMA controller from direc-
tory over the dmaResponseNetwork_out
queue.
68 A. CACHE COHERENCE PROTOCOL DETAILS
Action name impacting description (continued)
drp_sendDMAData DMA Send data to DMA controller from in-
coming PUTX over the dmaRespon-
seNetwork_out queue.
da_sendDMAAck DMA Send Ack to the DMA controller over the
dmaResponseNetwork_out queue.
p_popIncomingDMARequest
Queue
DMA Pop incoming DMA queue.
c_clearOwner Share Clear the owner field of a directory entry.
e_ownerIsRequestor Share Set the requestor as the owner of the
cache block.
v_allocateTBE No Allocate TBE buffer for DMA request.
r_allocateTbeForDmaRead No Allocate TBE for DMA read.
v_allocateTBEFromRequest
Net
No Allocate TBE.
w_deallocateTBE No Deallocate TBE.
z_recycleRequestQueue No Recycle Request Queue.
y_recycleDMARequestQueue No Recycle DMA Request Queue.
qf_queueMemoryFetch Re-
quest
No Queue off-chip fetch request.
qf_queueMemoryFetch Re-
questDMA
No Queue off-chip fetch request.
qw_queueMemoryWBRequest
_partial
No Queue off-chip writeback request.
qw_queueMemoryWBRequest
_partialTBE
No Queue off-chip writeback request
l_queueMemoryWBRequest No Write PUTX data to memory.
l_popMemQueue No Pop off-chip request queue.
l_writeDataToMemory No Write PUTX data to memory
dwt_writeDMADataFromTBE No DMA Write data to memory from TBE
The actions for this protocol are given in Table A.2. The impacted column indicates whether
the action is network-impacting (‘Yes’), , used for cache-line ownership (’Share’), only network-
impacting when DMA operations are taken into account (‘DMA’) or not network-impacting
(‘No’).
A.1.3. DMA CONTROLLER
QUEUES
1. requestToDir_out: send READ and WRITE requests to the directory.
2. dmaRequestQueue_in: input queue for LOAD and STORE requests from the DMA
subsystem.
3. dmaResponseQueue_in: DATA and ACK messages from the directory controller.
STATES
The description of all states in this protocol can be found at: http://www.m5sim.org/MI_
example
A.2. MESI PROTOCOL 69
ACTIONS
The actions for this protocol are given in Table A.3. The impacted column indicates whether
the action is not network-impacting (‘No’) or only network-impacting when DMA opera-
tions are taken into account (‘DMA’).
Table A.3: MI_example DMA controller actions
Action name
impacting description
s_sendReadRequest DMA Send a DMA read request to memory
over the requestToDir_out queue.
s_sendWriteRequest DMA Send a DMA write request to memory
over the requestToDir_out queue.
p_popRequestQueue DMA Pop request queue dmaRe-
questQueue_in.
p_popResponseQueue DMA Pop response queue
dmaRewsponeQueue_in.
a_ackCallback No Notify dma controller that write request
completed.
d_dataCallback No Write data to dma sequencer.
A.1.4. MORE INFORMATION
More information on the MI_example gem5 protocol can be obtained from the following
locations:
Description -
http://www.m5sim.org/MI_example
SLICC description -
http://www.m5sim.org/SLICC
Cache code -
http://grok.gem5.org/source/xref/gem5/src/mem/protocol/MI_example-cache.sm
Directory code -
http://grok.gem5.org/source/xref/gem5/src/mem/protocol/MI_example-dir.sm
DMA code -
http://grok.gem5.org/source/xref/gem5/src/mem/protocol/MI_example-dma.sm
Code messages -
http://grok.gem5.org/source/xref/gem5/src/mem/protocol/MI_example-msg.sm
For the source listings the latest history version before August 26, 2014 should be taken
to validate the above data.
A.2. MESI PROTOCOL
The MESI protocol
The gem5 simulator (gem5) MESI protocol we investigated is a two level protocol with
for each core a private L1 cache, with separate data and instruction caches. The L2 cache is
a second level cache which is shared among the cores.
70 A. CACHE COHERENCE PROTOCOL DETAILS
The protocol has four stable states, Modified means that the cache-line is written, E
means that the cache-line has exclusive permission to be written but is still clean, S is a
shared readonly copy, I means that the cache-line is invalid.
The protocol is implemented with four types of components; L1 cache controllers, each
connected to a processing core; L2 cache controllers for the shared cache; directory con-
trollers, connected to memory; and DMA controllers, connected to IO related devices. These
four component types are detailed in the next sections.
A.2.1. L1 CACHE CONTROLLER
The L1 cache controller receives memory requests from cores and retrieves the associated
cache-line from the L2 cache controllers. When other cores or a DMA controller requires
the cache-line, it is relinquished again. A cache-line can also be relinquished in a replace-
ment, when the L1 cache wants to replace the cache-line content for another memory
block.
QUEUES
The input- and output queues are defined in the sourcecode with respectively the in_port
and out_port functions.The following queues are defined for the cache controller:
1. optionalQueue_in: used by the core prefetcher to send pre-fetch cache-line instruc-
tions.
2. optionalQueue_out; used by the core prefetcher to send pre-fetch cache-line in-
structions.
3. mandatoryQueue_in: input queue for core memory requests from the core for LOAD,
STORE and IFetch requests.
4. requestL1Network_out∗; used to issue requests to the other nodes..
5. requestL1Network_in∗: requests from this L1 cache to the shared L2
6. responseL1Network_in∗: used to receive DATA packets over the network from the
cache coherence components.
7. responseL1Network_out∗; used to send DATA packets to the other cache coherence
components.
8. unblockNetwork_out∗; used to send unblock requests to the L2 cache controller.
The queues that we are interested in are all the above queues marked with a ∗. The other
queues are used to receive requests from the processor core.
STATES
The description of all states in this protocol can be found at: http://www.m5sim.org/
MESI_Two_Level
ACTIONS
Table A.4: MESI L1 cache controller actions
Action name
impacting description
a_issueGETS Yes Issues a GETS over the re-
questL1Network_out queue
A.2. MESI PROTOCOL 71
Action name impacting description (continued)
pa_issuePfGETS Yes Issues a prefetch GETS over the re-
questL1Network_out queue
ai_issueGETINSTR Yes Issue GETINSTR over the re-
questL1Network_out queue
pai_issuePfGETSINSTR Yes Issue GETINSTR for Prefetch request
over the requestL1Network_out queue
b_issueGETX Yes Issue GETX over the re-
questL1Network_out queue
pb_issuePfGETX Yes Issue prefetch GETX over the re-
questL1Network_out queue
c_issueUPGRADE Yes Issue an upgrade message over the re-
questL1Network_out queue
d_sendDataToRequestor Yes send data to requestor over the respon-
seL1Network_out queue
d2_sendDataToL2 Yes Send data to the L2 cache because
of M downgrade over the respon-
seL1Network_out queue
dt_sendDataToRequestor
_fromTBE
Yes send data to requestor over the respon-
seL1Network_out queue
d2t_sendDataToL2_fromTBE Yes send data to the L2 cache over the re-
sponseL1Network_out queue
e_sendAckToRequestor Yes send invalidate ack to requestor
(could be L2 or L1) over the respon-
seL1Network_out queue
f_sendDataToL2 Yes send data to the L2 cache over the re-
sponseL1Network_out queue
ft_sendDataToL2_fromTBE Yes send data to the L2 cache from TBE over
the responseL1Network_out queue
fi_sendInvAck Yes send Acknowledge to the L2 over the re-
sponseL1Network_out queue
forward_eviction_to_cpu No sends eviction information to the pro-
cessor
g_issuePUTX Yes Send data to the L2 cache over the re-
questL1Network_out queue
j_sendUnblock Yes Send unblock to the L2 cache over the
unblockNetwork_out queue
jj_sendUnblock Yes Send exclusive unblock to the L2 cache
over the unblockNetwork_out queue
dg_invalidate_sc No Invalidate store conditional as the cache
lost permission
h_load_hit No if not prefetch, notify sequencer the load
completed
hx_load_hit No if not prefetch, notify sequencer the load
completed
hh_store_hit No if not prefetch, notify sequencer the
store completed
72 A. CACHE COHERENCE PROTOCOL DETAILS
Action name impacting description (continued)
hhx_store_hit No if not prefetch, notify sequencer the
store completed
i_allocateTBE No allocate TBE buffer
k_popMandatoryQueue No pop the mandatory queue
l_popRequestQueue Yes pop the requestL1network_in queue
o_popIncomingResponse
Queue
Yes pop the responseL1Network_in queue
s_deallocateTBE No Deallocate the TBE
u_writeDataToL1Cache No Write data to cache
q_updateAckCount No Update the ack count
ff_deallocateL1CacheBlock No Deallocate L1 cache block
oo_allocateL1DCacheBlock No Set L1 D-cache tag equal to tag of block
B
pp_allocateL1ICacheBlock No Set L1 I-cache tag queal to tag of block B
z_stallAndWaitMandatory
Queue
No recycle mandatory queue
kd_wakeUpDependents No wake-up dependents
uu_profileInstMiss No Profile the demand miss
uu_profileInstHit No Profile the demand hit
uu_profileDataMiss No Profile the demand miss
uu_profileDataHit No Profile the demand hit
po_observeMiss No Inform the prefetcher about the miss
ppm_observePfMiss No Inform the prefetcher about the partial
miss
pq_popPrefetchQueue No pop the optionalQueue_in queue
mp_markPrefetched No Mark the cache_entry (cache-line) as
pre-fetched
The actions for this component are described in Table A.1. The impacted column indicates
whether the action is network-impacting (‘Yes’) or not network-impacting (‘No’).
A.2.2. L2 CACHE CONTROLLER
The L2 cache controller receives memory requests L1 caches and retrieves the associated
memory cache-line from memory. The Directory controller can request to relinquish a
cache-line to be used by a DMA controller.
QUEUES
The input- and output queues are defined in the sourcecode with respectively the in_port
and out_port functions.The following queues are defined for the cache controller:
1. L1RequestL2Network_out: L1 Requests from L2 cache
2. L1unblockNetwork_in: Unblock requests from L1 caches
3. L1RequestL2Network_in: L1 cache requests to L2 cache
4. DirRequestL2Network_out: Directory requests from L2 cache
5. responseL2Network_out: Responses from L2 cache to Directory and L1 caches
6. responseL2Network_in: Responses from Directory and L1 caches to L2 cache
A.2. MESI PROTOCOL 73
Which queues are network-impacting is dependent on the network configuration. The
first three queues are related to the L1 network, between the L1 caches and the L2 cache.
The last three queues in the above list are related to the L2 network, between the L2
cache, the Directory controller and the DMA controller(s).
STATES
The description of all states in this protocol can be found at: http://www.m5sim.org/
MESI_Two_Level
ACTIONS
Table A.5: MESI L2 cache controller actions
Action name
impacting description
a_issueFetchToMemory Yes fetch data (GETS) from memory over the
DirRequestL2Network_out queue
b_forwardRequestToExclusive Yes Forward request to the exclusive L1 over
the L1RequestL2Network_out
c_exclusiveReplacement Yes Send MEMORY_DATA to memory over
the responseL2Network_out
c_exclusiveCleanReplacement Yes Send ACK to memory for clean replace-
ment over the responseL2Network_out
ct_exclusiveReplacement
FromTBE
Yes Send MEMORY_DATA to memory over
responseL2Network_out
d_sendDataToRequestor Yes Send DATA from cache to requestor over
responseL2Network_out
dd_sendExclusiveDataTo
Requestor
Yes Send DATA_EXCLUSIVE from cache to
requestor over responseL2Network_out
ds_sendSharedDataTo Re-
questor
Yes Send DATA from cache to requestor over
responseL2Network_out
e_sendDataToGetSRequestors Yes Send DATA from cache to all GetS IDs
over responseL2Network_out
ex_sendExclusiveDataTo
GetSRequestors
Yes Send DATA_EXCLUSIVE from cache to
all GetS IDs
ee_sendDataToGetXRequestor Yes Send DATA from cache to GetX ID over
responseL2Network_out
f_sendInvToSharers Yes Send INV-alidate sharers for L2 replace-
ment over L1RequestL2Network_out
fw_sendFwdInvToSharers Yes Send INV-alidate sharers for request,
over L1RequestL2Network_out
fwm_sendFwdInvToSharers
MinusRequestor
Yes Send INV-alidate sharers for re-
quest, requestor is sharer, over
L1RequestL2Network_out
i_allocateTBE No Allocate TBE for request
s_deallocateTBE No Deallocate external TBE
jj_popL1RequestQueue Yes Pop incoming L1 request
L1RequestL2Network_in queue
74 A. CACHE COHERENCE PROTOCOL DETAILS
Action name impacting description (continued)
k_popUnblockQueue Yes Pop incoming unblock
L1unblockNetwork_in queue
o_popIncomingResponse
Queue
Yes Pop incoming response respon-
seL2Network_in queue
m_writeDataToCache No Write data from response queue to
cache
mr_writeDataToCache
FromRequest
No Write data from response queue to
cache
q_updateAck No update pending ack count
qq_writeDataToTBE No Write data from response queue to TBE
ss_recordGetSL1ID No Record L1 GetS for load response
xx_recordGetXL1ID No Record L1 GetX for store response
set_setMRU No set the MRU entry
qq_allocateL2CacheBlock No Set the L2 cache tag equal to tag of block
B.
rr_deallocateL2CacheBlock No Deallocate L2 cache block. Sets the
cache to not present, allowing a replace-
ment in parallel with a fetch.
t_sendWBAck Yes Send a writeback ACK over the respon-
seL2Network_out queue
ts_sendInvAckToUpgrader Yes Send ACK to upgrader over the respon-
seL2Network_out queue
uu_profileMiss No Profile the demand miss
uu_profileHit No Profile the demand hit
nn_addSharer Share Add L1 sharer to the list via the
L1RequestL2Network_in queue
nnu_addSharerFromUnblock Share Add L1 sharer to the list via the
L1unblockNetwork_in queue
kk_removeRequestSharer Share Remove L1 Request sharer from list via
the L1RequestL2Network_in queue
ll_clearSharers Share Remove all L1 aharers from list, via the
L1RequestL2Network_in queue
mm_markExclusive Share set the exclusive owner via the
L1RequestL2Network_in queue
mmu_markExclusiveFrom
Unblock
Share set the exclusive owner via the
L1UnblockNetwork_in queue
zz_stallAndWaitL1Request
Queue
No recycle L1 request queue
zn_recycleResponseNetwork No recycle memory request
kd_wakeUpDependents No wake up dependents
The actions for this component are described in Table A.5. The impacted column indicates
whether the action is network-impacting (‘Yes’) or not network-impacting (‘No’). The en-
tries marked with ‘Shared’ are not network-impacting in itself, but are needed to maintain
the cache-line ownership state.
A.2. MESI PROTOCOL 75
A.2.3. DIRECTORY CONTROLLER
The Directory controller receives memory requests and stores from the L2 cache and the
DMA controller. When the DMA controller requests a memory block the directory con-
troller can request the L2 cache to relinquish the cache-line.
QUEUES
The input- and output queues are defined in the sourcecode with respectively the in_port
and out_port functions.The following queues are defined for the cache controller:
1. responseNetwork_in: Memory DATA or ACK’s from the L2 caches
2. responseNetwork_out: Responses from Directory to the L2 caches
3. requestNetwork_in: Requests from L2 caches or DMA nodes
4. memQueue_out: Requests to memory
5. memQueue_in: DATA or ACK messages from memory.
The queues that we are interested in are the top three queues from the list above. The
memQueues are used for communication with the memory subsection.
The responseNetwork_out queue is also used for INV requests to the L2 cache, in the
inv_sendCacheInvalidate action.
STATES
The description of all states in this protocol can be found at: http://www.m5sim.org/
MESI_Two_Level
ACTIONS
Table A.6: MESI Directory controller actions
Action name
impacting description
a_sendAck Yes send a MEMORY_ACK to L2 over the
responseNetwork_out (from respon-
seNetwork_in, back to sender)
d_sendData Yes send MEMORY_DATA to L2 over the re-
sponseNetwork_out
aa_sendAck Yes send a MEMORY_ACK to L2 over
the responseNetwork_out (from
memQueue_in)
j_popIncomingRequestQueue Yes Pop incoming request queue request-
Network_in
k_popIncomingResponse
Queue
Yes Pop incoming response queue respon-
seNetwork_in
l_popMemQueue No Pop off-chip request queue
memQueue_in
kd_wakeUpDependents No wake-up dependents
qf_queueMemoryFetchRe-
quest
No Queue off-chip fetch request using
memQueue_out
qw_queueMemoryWBRequest No Queue off-chip writeback request using
memQueue_out
76 A. CACHE COHERENCE PROTOCOL DETAILS
Action name impacting description (continued)
m_writeDataToMemory No Write dirty writeback to memory
qf_queueMemoryFetchRe-
questDMA
No Queue off-chip fetch request using
memQueue_out
p_popIncomingDMARe-
questQueue
DMA Pop incoming DMA queue from re-
questNetwork_in
dr_sendDMAData DMA Send Data to DMA controller from di-
rectory over the responseNetwork_out
dw_writeDMAData No DMA write data to memory
qw_queueMemoryWBRe-
quest_partial
No Queue off-chip writeback request using
memQueue_out
da_sendDMAAck DMA Send Ack to DMA controller over the re-
sponseNetwork_out queue
z_stallAndWaitRequest No recycle request queue requestNet-
work_in
zz_recycleDMAQueue No recycle DMA queue requestNetwork_in
inv_sendCacheInvalidate DMA Invalidate a cache block to responseNet-
work_out
drp_sendDMAData DMA Send Data to DMA controller from
incoming PUTX over responseNet-
work_out
v_allocateTBE No Allocate TBE
dwt_writeDMADataFromTBE No DMA Write data to memory from TBE
qw_queueMemoryWBRequest
_partialTBE
No Queue off-chip writeback request using
memQueue_out
w_deallocateTBE No Deallocate TBE
The actions for this component are described in Table A.6. The impacted column indi-
cates whether the action is network-impacting (‘Yes’), network-impacting for DMA actions
(’DMA’) or not network-impacting (‘No’).
A.2.4. DMA CONTROLLER
The DMA controller initiates memory requests for block read- and write actions. It targets
these requests at the directory controller.
QUEUES
The input- and output queues are defined in the sourcecode with respectively the in_port
and out_port functions.The following queues are defined for the cache controller:
1. requestToDir_out: DMA requests to the Directory controller
2. dmaRequestQueue_in: DMA requests from the IO device.
3. dmaResponseQueue_in: Responses from L2 cache
The requestToDir_out queue and the dmaResponseQueue_in are the queues that we
are interested in. The dmaRequestQueue_in is non-network-impacting.
A.2. MESI PROTOCOL 77
STATES
There are three states in this state machine:
1. READY: Invalid, Ready to accept a new request;
2. BUSY_RD: Busy: currently processing a read request;
3. BUSY_WR: Busy: currently processing a write request;
ACTIONS
Table A.7: MESI L2 cache controller actions
Action name
impacting description
s_sendReadRequest DMA Send a DMA READ request over the re-
questToDir_out queue
s_sendWriteRequest DMA Send a DMA WRITE request over the re-
questToDir_out queue
a_ackCallback No Notify dma controller that write request
completed
d_dataCallback No Writer data to dma sequencer
p_popRequestQueue No pop dmaRequestQueue_in request
queue
p_popResponseQueue DMA Pop dmaResponseQueue_in response
queue
The actions for this component are described in Table A.7. The impacted column indi-
cates whether the action is network-impacting for DMA actions (‘DMA’) or not network-
impacting (‘No’).
A.2.5. MORE INFORMATION
More information on the MESI gem5 protocol can be obtained from the following locations:
Description -
http://www.m5sim.org/MESI_Two_Level
SLICC description -
http://www.m5sim.org/SLICC
L1 Cache code -
http://grok.gem5.org/xref/gem5/src/mem/protocol/MESI_Two_Level-L1cache.sm
L2 Cache code -
http://grok.gem5.org/xref/gem5/src/mem/protocol/MESI_Two_Level-L2cache.sm
Directory code -
http://grok.gem5.org/xref/gem5/src/mem/protocol/MESI_Two_Level-dir.sm
DMA code -
http://grok.gem5.org/xref/gem5/src/mem/protocol/MESI_Two_Level-dma.sm
Code messages -
http://grok.gem5.org/xref/gem5/src/mem/protocol/MESI_Two_Level-msg.sm
For the source listings the latest history version before August 26, 2014 should be taken
to validate the above data.
BIBLIOGRAPHY
BOOKS
[B+08] Christel Baier, Joost-Pieter Katoen, et al. Principles of model checking. Vol. 26202649.
MIT press Cambridge, 2008.
[Lin11] Peter Linz. An introduction to formal languages and automata. Jones & Bartlett
Publishers, 2011.
[Mar08] Michael R Marty. Cache coherence techniques for multicore processors. ProQuest,
2008.
[San12] Davide Sangiorgi. Introduction to bisimulation and coinduction. Cambridge Press,
2012. ISBN: 978-1-107-00363-7.
[SHW11] Daniel J Sorin, Mark D Hill, and David A Wood. A Primer on Memory Consistency
and Cache Coherence. Vol. 6. 3. Morgan & Claypool Publishers, 2011, pp. 1–212.
SCIENTIFIC ARTICLES
[AB86] James Archibald and Jean-Loup Baer. “Cache coherence protocols: Evaluation
using a multiprocessor simulation model”. In: ACM Transactions on Computer
Systems (TOCS) 4.4 (1986), pp. 273–298.
[Bas96] Twan Basten. “Branching bisimilarity is an equivalence indeed!” In: Information
Processing Letters 58.3 (1996), pp. 141–147.
[DS87] William J Dally and Charles L Seitz. “Deadlock-free message routing in multi-
processor interconnection networks”. In: Computers, IEEE Transactions on 100.5
(1987), pp. 547–553.
[KAC14] Rakesh Komuravelli, Sarita V Adve, and Ching-Tsun Chou. “Revisiting the com-
plexity of hardware cache coherence and some implications”. In: ACM Transac-
tions on Architecture and Code Optimization (TACO) 11.4 (2014), p. 37.
[MHS12] Milo MK Martin, Mark D Hill, and Daniel J Sorin. “Why on-chip cache coherence
is here to stay”. In: Communications of the ACM 55.7 (2012), pp. 78–89.
[Sor+02] Daniel J Sorin, Manoj Plakal, Anne E Condon, Mark D Hill, Milo MK Martin, and
David A Wood. “Specifying and verifying a broadcast and a multicast snooping
cache coherence protocol”. In: Parallel and Distributed Systems, IEEE Transac-
tions on 13.6 (2002), pp. 556–578.
[VS12a] Freek Verbeek and Julien Schmaltz. “Easy formal specification and validation of
unbounded networks-on-chips architectures”. In: ACM Transactions on Design
Automation of Electronic Systems (TODAES) 17.1 (2012), p. 1.
[VS12b] Freek Verbeek and Julien Schmaltz. “Towards the formal verification of cache
coherency at the architectural level”. In: ACM Transactions on Design Automa-
tion of Electronic Systems (TODAES) 17.3 (2012), p. 20.
78
CONFERENCES 79
CONFERENCES
[Fan+14] Jianbin Fang, Henk Sips, Lilun Zhang, Chuanfu Xu, Yonggang Che, and Ana Lu-
cia Varbanescu. “Test-driving intel xeon phi”. In: Proceedings of the 5th ACM/SPEC
international conference on Performance engineering. ACM. 2014, pp. 137–148.
[GN92] Christopher J Glass and Lionel M Ni. “The turn model for adaptive routing”. In:
ACM SIGARCH Computer Architecture News. Vol. 20. 2. ACM. 1992, pp. 278–287.
[Rei+08] Matthew Reilly et al. “When multicore isn’t enough: Trends and the future for
multi-multicore systems”. In: Proceedings of Workshop on High Performance
Embedded Computing. 2008.
[Tre08] Jan Tretmans. “Model based testing with labelled transition systems”. In: For-
mal methods and testing. Springer, 2008, pp. 1–38.
[Ver+16] Freek Verbeek, Pooria M Yaghini, Ashkan Eghbal, and Nader Bagherzadeh. “AD-
VOCAT: Automated deadlock verification for on-chip cache coherence and in-
terconnects”. In: 2016 Design, Automation & Test in Europe Conference & Exhi-
bition (DATE). IEEE. 2016, pp. 1640–1645.
TECHNICAL DOCUMENTATION
[Int09] An Intel. Introduction to the Intel QuickPath Interconnect. Tech. rep. 2009.
[Sut05] Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in
software. 2005.
ACRONYMS
CPU Central Processing Unit.
DFA Deterministic Finite Automaton.
LTS Labeled Transition System.
NFA Non-deterministic finite automaton.
NoC Network on a Chip.
SCC Strongly Connected Component.
SE Software Engineering.
80
GLOSSARY
AP-Deterministic a deterministic state machine where all states are labeled with the prop-
erties that are valid in that state. AP stands for Atomic Propositions or Properties. De-
terministic means that the state machine only allows at most a single state to transi-
tion to with a certain set of properties. Also known as deterministic Kripke structures.
bijection in mathematics, a bijection, bijective function or one-to-one correspondence is
a function between the elements of two sets, where every element of one set is paired
with exactly one element of the other set, and every element of the other set is paired
with exactly one element of the first set..
bisimilarity the finest extensional behavioural equivalence one would like to impose on
processes.
cache coherence cache communication protocols with which to make the caches of a shared-
memory system as functionally invisible as caches in a single core system.
cache controller a module that services load and store requests for cached values in the
cache. See [SHW11, p. 83].
cache-line a block of memory, normally in the order of 64 or 128 bytes, that is handled as a
single entity in the system caches. This is normally also equal to the size of a memory
request from and to main memory.
core a microprocessor execution unit capable of actively execute application actions. In
the context of this paper these are the units initiating memory load- and store actions.
deadlock a situation where two or more actions are waiting for each other, caused by a
circular dependency. These situations typically never resolve without external inter-
ference.
Deterministic Finite Automaton state machines that allow at most a single transition for
an event. No choices need to be made, and it is always clear what the next state will
be..
die a small block of semiconductor material, on which an integrated circuit is fabricated.
It is fabricated in large batches on a single wafer. A microprocessor or memory chip
is normally composed of a single die placed in a ’chip’ package.
directory controller a module that services load and store requests for main memory con-
tent. It can also maintain the ownership of the cache-lines. If it does not, the compo-
nent is sometimes also called a memory controller. See [SHW11, p. 84].
81
82 GLOSSARY
extensional property a property whose definition only takes into account the interactions
that the processes may, or may not, perform[San12, p. 2].
gem5 the gem5 simulator is a open source software application for simulation of computer
systems in software. See: http://gem5.org.
Labeled Transition System a state machine where the transitions are labeled. The labels
model the actions that are performed during the transition, the states model the sys-
tem states.
memory consistency ruleset defining the allowed behaviour of multithreaded programs
executing with shared memory.
multi-threading for software: the ability of a single process or application to subdivide its
work in multiple threads of execution, which can be run in parallel; for microproces-
sor cores: the ability to concurrently run multiple software threads in parallel on a
single core.
network on a chip a communication subsystem between various execution cores and sup-
porting subsystems on an integrated circuit.
Non-deterministic finite automaton state machines that allow multiple transition for an
event, and internal transitions, taken without external stimuli. Potentially choices
need to be made, and a trace through a NFA can use multiple paths..
state machine a finite-state machine, or a state machine, is an abstract machine that de-
scribes a number of states, and the events and triggers with which the machine tran-
sitions between states.
state space explosion a combinatorial explosion or blow up of the number of states that
must be examined to get a state machine property validated with automatic analysis..
Strongly Connected Component part of a state machine where all states are strongly con-
nected, i.e. where the states can transition to one another, via a path in both direc-
tions between each individual state.
structural equivalence two state machines are structurally equivalent or isomorphic when
a bijection can be established on the states and their transitions. See also [San12, p.
16].
trace equivalence two state machines are trace equivalent when they can perform the
same finite sequences of transactions. See also [San12, p. 18].
wormhole routing a simple flow control method where a packet is split into a number of
short fixed messages called ’flit’s. The first flit contains the address and with it creates
the connection for the following flit’s, which follow the head..
