Modeling Cache Coherence to Expose by Sensfelder, Nathanaël et al.
HAL Id: hal-02165139
https://hal.archives-ouvertes.fr/hal-02165139
Submitted on 25 Jun 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Distributed under a Creative Commons Attribution| 4.0 International License
Modeling Cache Coherence to Expose
Nathanaël Sensfelder, Julien Brunel, Claire Pagetti
To cite this version:
Nathanaël Sensfelder, Julien Brunel, Claire Pagetti. Modeling Cache Coherence to Expose. ECRTS
2019, Jul 2019, Stuttgart, Germany. ￿10.4230/LIPIcs.ECRTS.2019.26￿. ￿hal-02165139￿
Modeling Cache Coherence to Expose1
Interference2
Nathanaël Sensfelder3
ONERA, France4
Julien Brunel5
ONERA, France6
Claire Pagetti7
ONERA, France8
Abstract9
To facilitate programming, most multi-core processors feature automated mechanisms maintaining10
coherence between each core’s cache. These mechanisms introduce interference, that is, delays caused11
by concurrent access to a shared resource. This type of interference is hard to predict, leading to12
the mechanisms being shunned by real-time system designers, at the cost of potential benefits in13
both running time and system complexity.14
We believe that formal methods can provide the means to ensure that the effects of this15
interference are properly exposed and mitigated. Consequently, this paper proposes a nascent16
framework relying on timed automata to model and analyze the interference caused by cache17
coherence.18
2012 ACM Subject Classification Computer systems organization → Multicore architectures; Com-19
puter systems organization → Real-time systems20
Keywords and phrases Real-time systems, multi-core processor, cache coherence, formal methods21
Digital Object Identifier 10.4230/LIPIcs.ECRTS.2019.2622
Supplement Material https://www.onera.fr/sites/default/files/598/ecrts19.zip23
Acknowledgements We would like to thank Mamoun Filali-Amine (IRIT-CNRS) for his helpful24
insights on how to validate our model and related works suggestions.25
1 Introduction26
The next generation of aircrafts will embed multi-core processors. Indeed, it will be more and27
more difficult to find mono-core processors on the market and, when correctly programmed,28
multi-core processors offer huge opportunities to reduce the amount of equipment required29
to host multiple applications compared to federated or single-core IMA (Integrated Modular30
Avionics) architectures. However, multi-core processors come with several drawbacks, among31
which is the lack of predictability [26, 27], one of the key elements of certification expectations.32
This lack of predictability is caused by interference, a delay inherent to the concurrent access33
to a shared resource.34
Cache Coherence In most multi-core processors, each core has its own cache memory, of35
which it is virtually the sole accessor. A cache coherence protocol ensures that:36
At any given time, a memory location can either be accessed by a single cache controller,37
in which case both writing and reading are allowed, or by any number of cache controllers,38
in which case only reading is allowed.39
Any copy of a memory location held in a cache has the most up-to-date value.40
Maintaining this cache coherence requires exchanges of information between cache memo-41
ries. These exchanges can be the source of a large amount of additional traffic, a potential42
© Nathanaël Sensfelder and Julien Brunel and Claire Pagetti;
licensed under Creative Commons License CC-BY
31st Euromicro Conference on Real-Time Systems (ECRTS 2019).
Editor: Sophie Quinton; Article No. 26; pp. 26:1–26:22
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
26:2 Modeling Cache Coherence to Expose Interference
hindrance that we qualify of implicit interference, because of how difficult to predict they43
are. Additionally, it can result in the removal of elements from the cache, which may lead to44
time consuming communications with the system’s memory (cache misses).45
While multi-core processors feature hardware to efficiently and automatically handle46
cache coherence, the black-box nature of commercial processors leads to a lack of control,47
visibility, and predictability of the cache coherence protocol and, by extension, of the delays48
it may create.49
Current Research Practices Several approaches have been developed in the literature to50
deal with the interference found in multi-core processors. The main solutions to ensure51
predictability are 1) preventing any kind of uncontrolled interference (e.g. run-time services52
[15, 28]); 2) enforcing a unique access to any shared resource at any time, so as to be equivalent53
to a single core situation (e.g. execution models [18, 5, 12]). Because its interference is difficult54
to predict, most of the considered hardware do not have or use automatic cache coherency.55
Instead, the burden of cache management is placed on the developers, forcing an application-56
specific solution (e.g. scratchpad memory [25, 22]). Such solutions prevent the gains in57
performance that would otherwise be provided by automatic hardware cache management58
mechanisms.59
Contributions We believe that the implicit interference generated by the cache coherence60
can be exposed and taken into account to achieve predictable programming of a multi-core61
processor. In this work, we focus on exposing these unexpected delays, the analysis of a62
formal model of the processor.63
We start this paper by going into more details on how cache coherence can be achieved64
(Section 2), the type of system we are interested in (Section 3), and the categories of65
interference it can host (Section 4). We then present the tools that we use to model and66
analyze it (Section 5). Afterwards, we explain our choices in how we modeled the cache67
coherence in a multi-core processor (Section 6). Finally, we showcase some of the results that68
can be extracted from our model (Section 7), before listing some related works (Section 8)69
and concluding (Section 9).70
2 Cache Coherence Protocols71
We start by introducing archetypal systems on which coherence protocols run. We then72
present how those protocols behave.73
A number of components (see Figure 1) are involved in the coherence.74
Interconnect
Coherency
Manager
Memory
Controller
Cache
Controller
. . . Cache
Controller
Core Core
Figure 1 Components involved in cache coherence
N. Sensfelder and J. Brunel and C. Pagetti 26:3
Memory Element The main memory is composed of chunks (or memory elements) which75
have a fixed size and contain multiple addressable elements. Reading/writing from/to an76
address in the main memory actually corresponds to reading or writing a whole memory77
element. The distinction between an addressable space and a memory element is not78
relevant to cache coherence, and thus, for simplification purposes, this paper considers79
that each memory element has a single address.80
Core The component actually using and modifying memory element values. Instead of81
accessing the original memory elements through the interconnect, each core is linked to its82
own private cache. The content of this cache is managed by an associated cache controller.83
The core can ask its cache controller for the value held by the memory element at a84
given address through a load request. It can also send a store request to modify this85
value. Additionally, the core can issue an evict request, which tells its cache controller86
to invalidate a memory element copy. While it is rare for cores to be the initiators of87
evict requests, it remains a possibility (e.g. for micro-optimization). Cores can be made88
to stall by their cache controller, delaying the emission of a request until the cache89
controller is ready to accept it.90
Cache Controller Component that handles requests from its core, potentially initiating a91
transaction by making a query on the interconnect. Such queries take the form of a GetS92
when asking for a read-only copy of a memory element and that of a GetM when asking93
for a read-and-write copy. Queries that indicate a new value for the memory element are94
done through PutM messages. Depending on the protocol, variants of these messages may95
be used. Cache controllers are also able to reply to the query of another cache controller96
with a data reply (data). Additionally, cache controllers may initiate evict requests97
on themselves to make space for new memory element copies. These self-requests are98
controlled by a cache replacement policy, which is most commonly a speed-over-accuracy99
variation on the Least Recently Used policy.100
Coherency Manager Component that stores information on the state of the cache controllers,101
to help maintain the cache coherence. Using this stored information, it can tell if a query102
should be answered by the memory controller or not. This component is very much103
dependent on which protocol is being implemented, and can range from being a simple104
link between the cache controllers to actually being multiple separate components (e.g.105
all directory nodes of a directory-based cache coherence protocol). It is usually found106
inside the interconnect.107
Memory Controller Component that handles the modification or copy of the original memory108
elements.109
Interconnect Component that regulates and handles the propagation of messages between110
cache controllers, memory controllers, and the coherency manager.111
I Definition 1 (Request, Message, Query, and Data Reply). To keep things separate, we use112
the term request when talking about communications between a core and its cache controller,113
and the term message when talking about communications that use the interconnect. As such,114
queries (e.g. GetM, GetS, PutM) are messages, and so are data replies (e.g. data). Thus,115
messages = queries ∪ data replies.116
I Definition 2 (Transaction). A transaction is composed of a query and of all the data117
messages the completion of that query requires.118
Each message transiting through the interconnect, and each cache controller query, is119
about a specific memory element. Upon receiving either one of those, cache controllers look120
ECRTS 2019
26:4 Modeling Cache Coherence to Expose Interference
up the state they associate with their copy of the memory element for this address, and act121
according to the cache coherence protocol.122
2.1 Protocols123
Most cache coherence protocols are based on the MSI protocol, named after the states given124
to copies of the memory elements by the cache controllers. M stands for Modified, the125
state a cache controller gives its copy of the memory element to indicate that it has both126
read-and-write access to the original. S stands for Shared, and is the equivalent for read-only127
access. I stands for Invalid, when a cache controller does not currently have a copy.128
MSI-based protocols are all categorized as Write-Back, because caches may contain a129
more up-to-date value of the memory element than the RAM.130
The aforementioned protocols are referred to according to their states and general idea,131
however, the definition of their behavior depends on the system they are implemented on.132
There are two main families of cache coherence implementation: snooping-based and133
directory-based. When using a snooping-based protocol, cache controller queries are broad-134
casted to all cache controllers and to the coherency manager. The protocol also ensures135
that only one of the components answers the query. This answer is not broadcasted, but is136
instead only meant for the query’s originator. For such protocols to properly function, all137
the components have to receive the queries in the same order. In the sequel, we only take138
into consideration snooping-based protocols.139
I S
M
load?GetS!data?
store?GetM!data?
evict?
GetM?
st
or
e?
Ge
tM
!da
ta
?
load?
Ge
tM
?d
at
a!
Ge
tS
?m
em
_d
at
a!d
at
a!
ev
ic
t?
Pu
tM
!m
em
_d
at
a!
load?
store?
Figure 2 Generic MSI Cache Controller
U
M
GetM?data!
GetS?data!
Pu
tM
?m
em
_d
at
a?
Ge
tS
?m
em
_d
at
a?
Figure 3 Generic MSI Co-
herency Manager
Automata describing a generic snooping-based MSI protocol can be seen in Figure 2 and140
Figure 3. Figure 2 shows how the state given to a memory element’s copy evolves when141
receiving a request (store?, load?, or evict?), or a query (GetM? or GetS?). Data exchanges142
between cache controllers are also represented (data! and data?). Cache controllers do not143
differentiate between data sent from another cache controller and data sent from the memory144
controller (both use data?). Sending data to the memory controller, however, is marked as145
mem_data!. Figure 3 represents the coherency manager, which keeps track of whether the146
memory has the most up-to-date value for a memory element (state U) or not (state M).147
This particular protocol considers that cache controllers delay incoming requests until148
they are able to use the interconnect, and that transactions cannot take place simultaneously.149
N. Sensfelder and J. Brunel and C. Pagetti 26:5
These automata actually describe a generic snooping-based MSI protocol. They feature150
macro-transitions (a succession of atomic transitions). The next section presents a more151
detailed protocol.152
3 MSI Snooping-Based Protocol153
3.1 A Few Caveats154
These are the hypotheses made on the targeted hardware. Placing such hypotheses (or lack155
thereof) is necessary to properly define the targeted cache coherence protocol.156
B Hypothesis 1 (Non-Atomic Requests). Cores are able to issue load, store, and evict157
requests to their cache controller regardless of whether the cache controller is currently able to158
initiate a transaction on the interconnect. In this paper, we consider that this is implemented159
through the use of a FIFO queue between each cache controller and the interconnect.160
B Hypothesis 2 (Unique Interconnect). The interconnect is unique. As a result, all cache161
controllers are able to see all transactions, and those transactions are all seen in the same162
order. Examples of excluded hardware include many-core processors, which feature a163
Network-On-Chip.164
B Hypothesis 3 (Split-Transaction Interconnect). The interconnect supports simultaneous165
transfer of data and queries and allows multiple transactions to take place simultaneously.166
3.2 From Abstract to Concrete Behaviour167
In Subsection 2.1, we have seen automata using macro-transitions to describe a generic168
snooping-based MSI protocol. Let us now look in details at what is composing the transition169
from I to S with the load?GetS!data? label. Let us consider two cache controllers, CC0 and170
CC1, each of which is driven by its own core (CU0 and CU1, respectively) and a memory171
element. Let us assume that, while CC1 already has read-only access to that memory element,172
CC0 does not, and that core CU0 issues a load request to acquire it.173
In the sequence diagram of Figure 4, we see the behavior of all components involved.174
Once the core issues the load request, the cache controller generates a GetS query to the175
interconnect. The latter broadcasts the GetS to all cache controllers, including the query’s176
originator, and the coherency manager. As the owner of the data is the memory, the coherency177
manager transmits that query to the memory controller, which, in turn, sends the data to178
the core CU0.179
In order to expose the interference, we need to model the atomic transitions and interme-180
diate states, such as the ones shown in the figure.181
3.3 Detailed Snooping-Based MSI Protocol182
Instead of representing the full automaton as a graph, we use a matrix representation (see183
Figure 6). The first column details every possible states. As in [20], the naming of each184
state is determined by the following reasoning: Invalid (I), Shared (S), and Modified (M) are185
the three stable states of the MSI protocol. The other states are transient. Reception of a186
request that requires use of the interconnect will usually lead to a XYBD transient state, which187
means that the cache controller is handling a transition between the stable states X and Y,188
with (B) indicating that this transition requires the acquisition of the interconnect and (D)189
the reception of a related data reply (whether it comes from an other cache controller or190
ECRTS 2019
26:6 Modeling Cache Coherence to Expose Interference
CU0 CC0 CU1 CC1
Inter-
connect
Coherency
Manager
Memory
Controller
load
GetS
GetSGetS
GetS
read
data
data
I
〈〉
S
〈〉
U
ISBD
〈GetS〉
ISD
〈〉
S
〈〉
Figure 4 load request
CU0 CC0 CU1 CC1 CU2 CC2
Inter-
connect
Coherency
Manager
M
〈〉
I
〈〉
I
〈〉 M
o = CC0
evict store store
MIB
〈PutM〉
IMBD
〈GetM〉
IMBD
〈GetM〉
GetM
GetMGetM
GetM GetM
IMD
〈〉
IIB
〈PutM〉
o← CC1
GetM
GetMGetM
GetM GetM
IMDI
〈〉
IMD
〈〉
o← CC2
data
data
I
〈〉
data
data
M
〈〉
PutM
PutM PutM PutM PutM
I
〈〉
Figure 5 Double store
the memory). This can be followed by XYD if the cache controller sees its own query before191
receiving a reply, or XYB if a reply is received before the query is processed. This happens192
when, despite processing all queries in the same order, not all cache controllers take the same193
time to do so. Another possibility is for an external query to be received when in the XYD194
state. Indeed, at that point, the system pretty much considers that the cache controller is in195
the Y state and thus has the responsibilities that the Y would require. This makes it possible196
for a cache controller to see a query it needs to act upon before being actually ready to do so197
(e.g. observing a GetM query while waiting for data). These states have a XYDA form (which198
means that when all is handled, the cache controller ends up in the A state), or XYDAB (which199
ends up leading to the B state). As it may be that the required action is to reply to said200
query, it is sometimes necessary to remember the originator of the query. This is marked as201
r←s.202
When the core makes a request (load, store or evict), the second macro-column203
indicates how the cache controller behaves. The a/b notation denotes the emission of an a204
message on the interconnect, followed by a transition of the memory element copy’s state to205
b. If you look at the load from the I state, the cell indicates that the GetS request will be206
generated and the reached state is ISBD. We recognize the beginning of the sequence diagram207
described in Figure 4. Grayed out cells indicate situations that cannot occur in the protocol,208
due to our hypotheses.209
The third macro-column (named Interconnect access) indicates what happens when the210
previously queued query is broadcasted on the interconnect. When in the ISBD state, we211
know that, at some point, our previously queued GetS query is going to be broadcasted on212
the interconnect. This will result in reaching the ISD state. As a side note, if the core makes213
a second load request on the same memory element while the copy is ISBD, that new request214
N. Sensfelder and J. Brunel and C. Pagetti 26:7
State Core Request InterconnectAccess
Data
Reply Received Queries
load store evict GetS GetM PutM
I GetS/ISBD GetM/IMBD - - -
ISBD stall stall stall -/ISD -/ISB - - -
ISB stall stall stall -/S - -
ISD stall stall stall -/S - -/ISDI
ISDI stall stall stall -/I - -
IMBD stall stall stall -/IMD -/IMB - - -
IMB stall stall stall -/M - - -
IMD stall stall stall -/M r←s-/IMDS
r←s
-/IMDI
IMDI stall stall stall
r!data
-/I - -
IMDS stall stall stall
r!data
m!data
-/I
- -/IMDSI
IMDSI stall stall stall
r!data
m!data
-/I
- -
S hit GetM/SMBD -/I - -/I
SMBD hit stall stall -/SMD -/SMB - -/IMBD
SMB hit stall stall -/M - -/IMB
SMD hit stall stall -/M r←s-/SMDS
r←s
-/SMDI
SMDI hit stall stall
r!data
-/I - -
SMDS hit stall stall
r!data
m!data
-/S
- -/SMDSI
SMDSI hit stall stall
r!data
m!data
-/I
- -
M hit hit PutM/MIB
m!data
s!data
-/S
s!data
-/I
MIB hit hit stall
m!data
-/I
m!data
s!data
-/IIB
s!data
-/IIB
IIB stall stall stall -/I - - -
Handling Requests Handling Queries
Figure 6 Cache Controller Memory Element State Changes (adapted from [20])
is stalled.215
The fourth macro-column describes the behavior upon reception of a data reply.216
The fifth macro-column (named Received Queries) defines the behavior of the cache217
controller when snooping a transiting query that is not its own (which would otherwise218
pertain to the third macro-column). For instance, from state S, when snooping a GetS, the219
cache controller does not do anything, as can be seen with core CU1 in the sequence diagram220
of Figure 4.221
Replying with a message d, meant for t (t = m when sending to the memory controller222
and the coherency manager, t = s when sending to the cache controller that initiated the223
transaction, and t = r when sending to the initiator of an earlier query) is written as t!d.224
225
I Example 3. Let us have a look at a more complex behavior: when 2 cores attempt226
modification of the same memory element. This is illustrated in the sequence diagram of227
Figure 5. CC0 starts with read-and-write access to the memory element (its copy being in228
the M state), neither CC1 nor CC2 have a copy (state I), and the coherency manager knows229
ECRTS 2019
26:8 Modeling Cache Coherence to Expose Interference
that its value is out of date (state M).230
The sequence starts when CU0 issues an evict request and both CU1 and CU2 issue231
a store request. CC0 receives the evict request, queues a PutM query and now considers232
the memory element to be MIB (that is, “was Modified, will be Invalid once access to the233
interconnect is granted”). On the other hand, the other two caches receive their store234
requests, queue a GetM query, and now consider the memory element to be IMBD (“was Invalid,235
will be Modified after access to the interconnect and reception of a data reply”).236
All the cache controllers want to access the interconnect. The internal behavior of the237
interconnect will drive this choice. Most of the time, the interconnect is based on Fair-RR238
(Round Robin) [11]. In this scenario, the interconnect first broadcasts the GetM query from239
CC1’s queue, which is now empty.240
CC1, seeing its own query, confirms that it has accessed the interconnect, and switches241
to the IMD state to await a data reply. The coherency manager ignores the query. Seeing242
CC1’s GetM query passing through the interconnect, CC0 has to reply with a data message243
(this corresponds to s!data in the protocol definition), containing its value for the memory244
element, and to transition to the IIB state.245
CC2’s GetM query is broadcasted. As it is about to receive the data with read-and-write246
access, CC1 is the component that should reply to CC2’s query. Not having the data yet,247
CC1 is currently unable to do so. Instead, it transitions to the IMDI, remembering that it248
should send the data to CC2 as soon as possible.249
Finally receiving the data, CC1 applies completes CU1’s request, sends the updated data250
to CC2 and transitions to the I state (as CC2 wants read-and-write access).251
CC2 receives the data and completes CU2’s request.252
CC0’s PutM is broadcasted, but has been superseded by a previous GetM and thus causes253
no reaction in the other cache controllers or the coherency manager. CC0 transitions to I,254
completing its core’s request.255
3.4 Coherency Manager256
State Received Queries Data Reply
GetS GetM PutM (Owner) PutM (Other) data
U s!data
s!data
o←s
-/M
-
UD stall stall stall - -/U
UB
o← ∅
-/U -
o← ∅
-/U -
M
o← ∅
-/UD o←s
o← ∅
-/UD - -/U
B
Figure 7 Coherency Manager Memory Element State Changes (adapted from [20])
Figure 7 shows how the coherency manager keeps track of whether the RAM has the257
most up-to-date value for a memory element (state U) or if a cache controller does (state M).258
This is used to determine if the RAM should be the one to reply when either a GetS or a259
GetM query passes through the interconnect. The U state indicates that the RAM currently260
has the most up-to-date value. The UD state indicates that the RAM should be the one to261
respond to queries, but it still hasn’t received the latest value. Unlike the cache controller, it262
will not switch to a dedicated state but instead force queries from the interconnect to stall263
N. Sensfelder and J. Brunel and C. Pagetti 26:9
until the problematic query can be fulfilled. UB indicates that the RAM has received the264
latest value, but has not yet seen the query that led this data to be sent.265
The exact cache controller currently in charge of the memory element is kept track of.266
Change of ownership are marked as o←s (the query originator becomes the new owner) and267
o← ∅ (there is no longer an owner, meaning that the RAM is currently responsible for it).268
I Example 4. Back to the sequence diagram of Figure 5 and to Example 3, let us observe269
the behavior of the coherency manager. The coherency manager reacts to each GetM query,270
updating its internal state to reflect the change of ownership. Thus, the coherency manager271
starts by considering that CC0 is the only one to have a valid (i.e. up-to-date) value of the272
memory element, then, upon seeing the first GetM, considers CC1 to be responsible for it273
(o←s in the table). As a result, at the end of the execution, the coherency manager knows274
that the PutM query is originating from a cache controller that is not currently responsible275
for that memory element and can thus safely ignore it.276
4 Interference277
St
at
e
R
ec
ei
ve
d
Q
ue
ri
es
Ge
tS
Ge
tM
Pu
tM
I
Mi
.
Mi
.
Mi
.
IS
BD
Mi
.
Mi
.
Mi
.
IS
B
Mi
.
Mi
.
IS
D
Mi
.
Ex
.
IS
D
I
Mi
.
Mi
.
IM
BD
Mi
.
Mi
.
Mi
.
IM
B
Mi
.
Mi
.
Mi
.
IM
D
De
.
Ex
.
IM
D
I
Mi
.
Mi
.
IM
D
S
Mi
.
Ex
.
IM
D
SI
Mi
.
Mi
.
SM
BD
Mi
.
Ex
.
SM
B
Mi
.
Mi
.
SM
D
De
.
Ex
.
SM
D
I
Mi
.
Mi
.
SM
D
S
Mi
.
Ex
.
SM
D
SI
Mi
.
Mi
.
M
De
.
Ex
.
MI
B
Ex
.
Ex
.
II
B
Mi
.
Mi
.
Mi
.
Minor Expelling Demoting
Figure 8 Occurrences of Interference
Let us now categorize how a cache controller may be negatively affected by the actions of278
another. Figure 8 summarizes the occurrences of each interference category. In the Figures 9,279
11, and 10, the dark gray area indicates when the cache controller is unavailable due to280
having to handle the incoming query (deciding how to act and, potentially, updating its281
internal state), and the light gray area shows when its core’s next request for that memory282
element may be negatively impacted by the change of state.283
I Definition 5 (Minor Interference). Cache controllers have actions to perform upon receiving284
any type of request. Because of this, every time a cache controller has to deal with an incoming285
query, there is a very small amount of time during which it cannot be used by its core. We call286
this unavailability period minor interference. And, while the effect of each minor interference287
is so small as to be considered negligible, their accumulation most definitely is not. Indeed,288
minor interferences are one of the main motivations behind the use of a directory-based289
coherency protocol (in which minor interferences are only experienced by cache controllers290
likely to have a use for that query) over a snooping-based one (in which all cache controllers291
are affected by every query).292
Figure 9 shows an example of minor interference: the CC1 cache controller has to process293
the GetS broadcast, despite that message not requiring any reply or internal state update from294
CC1.295
ECRTS 2019
26:10 Modeling Cache Coherence to Expose Interference
Figure 9 Minor Figure 10 Expelling Figure 11 Demoting
I Definition 6 (Expelling Interference). To maintain the principles of cache coherency, it may296
be required for a cache controller to dispense of its copy of a memory element, relinquishing297
its access rights. This is caused by another cache controller demanding read-and-write access298
to that memory element (a GetM query). We have, however, marked the reception of a GetS299
query for an element in the MIB as being an expelling interference in Figure 8. It could be300
argued that reaching the MIB indicates that the cache controller is already in the process of301
evicting its copy of the memory element. But, as the MIB state allows immediate (i.e. hit)302
access for both writing and reading that memory element, we still consider this event to have303
a negative impact.304
Figure 10 shows an example of expelling interference: the CC1 cache controller, receiving305
a demand for read-and-write access, is forced to relinquish its read-only copy.306
I Definition 7 (Demoting Interference). Another type of interference is the demoting inter-307
ference, in which a cache controller has to abandon its writing access rights to a memory308
element, while retaining its reading access.309
Figure 11 shows an example of demoting interference: the CC1 cache controller, receiving310
a demand for read access on that memory element, has to update the value from the main311
memory and go from read-and-write access to read-only access.312
5 Formal Modeling of Real-Time Systems with Timed Automata313
To expose the interference presented in the previous section, we chose to use formal methods.314
More precisely, we are relying on timed automata [1] to model and analyze our system.315
A timed automaton is an extended automaton with variables and clocks. During the316
system’s execution, the state of timed automaton is defined as a location, the value of its317
integer variables and of its clocks. The evolution of these integer variables is controlled318
by the automaton’s transitions, whereas all of the system’s clocks progress at the same319
rate, following the passing of time. To indicate that a location should be left immediately,320
UPPAAL [4] offers the following location modifiers:321
Urgent: The location must be left before any time passes.322
Committed: The location must be left before any time passes, and the next transition must323
originate from a committed location.324
N. Sensfelder and J. Brunel and C. Pagetti 26:11
Invariant φ: The location is defined only if a linear constraint φ holds true. φ may reason325
over the automaton’s integer variables, clocks, or both.326
The automata transitions are composed of the following:327
Guard: Prerequisite (linear constraint) for this transition to be able to fire. The condition328
uses the automaton’s integer variables, clocks, or both.329
Synchronization: Allows to have more than one automaton transitioning during a step, by330
synchronizing multiple transitions over a channel. The channel can be used in either331
receiver (with a ? suffix) or sender (with a ! suffix) mode. On a channel that was332
declared without modifier, the transition requires exactly two automata to synchronize333
during this step: the sender, and the receiver. It is also possible for a channel to have334
been declared as a broadcast channel, in which case the sender synchronizes with all335
available receivers. Furthermore, the channel may have been declared as urgent, which336
prevents waiting in a location if the synchronization can occur. Finally, priorities between337
channels may be put in place.338
Update: Sequence of instructions to alter the automaton’s integer variables, or reset its339
clocks.340
Select: The transition selects the given integer variables’ next value from a specified range.341
Example This subsection presents an example of UPPAAL model: a processor attempts342
to read a variable, which may be either in RAM or in its cache. The automaton in Figure 12343
corresponds to the core, the one in Figure 13 to the RAM controller, and the remaining344
one (Figure 14) is used to mark a transition as urgent by having an automaton always345
ready to synchronize on a dedicated urgent channel (FORCE_URGENT). In this model, the346
FORCE_URGENT and READ_LINE channels are both declared as urgent.347
Figure 12 Core and cache
Figure 13 RAM Figure 14 Ur-
gence
The Core Automaton (Figure 12) Its initial location is marked as committed, meaning348
that it is left immediately. The exiting transition sets the x clock to 0, and the var_is_cached349
variable to a value in the [0, 1] range. The x clock will be used to know how long it took for350
the processor to get its variable. Two transitions are fireable from the S1 location, depending351
on whether the targeted variable is cached or not. If it is indeed cached, the transition352
labeled FORCE_URGENT is the only one fireable and it synchronizes with the automaton of353
Figure 14, forcing it to be taken as soon as possible. Additionally, the transition increases an354
integer variable that counts the number of times a variable was found in the cache. Taking355
said transition leads to a location in which the only exiting transition requires the x clock to356
equal 1 unit of time before arriving in the Done location.357
If the variable was not in the cache, the other transition from S1 is active and leads358
to a synchronization on the READ_LINE which is also to be taken as soon as possible. This359
time, however, it is possible for that synchronization to not be immediately available, as360
the RAM controller automaton may be handling another query and thus not be ready to361
ECRTS 2019
26:12 Modeling Cache Coherence to Expose Interference
synchronize as it would not be in its initial location. This also justifies not marking the362
location as urgent or committed: the automaton may have to wait an unknown amount of363
time. Once the synchronization does happen, an integer variable counting the number of364
times the variable was not found in the cache is incremented, then the automaton waits for365
the RAM automaton to synchronize on the REPLY channel before considering it has acquired366
the variable.367
The RAM Controller Automaton (Figure 13) Its initial location awaits synchronization368
on the READ_LINE channel. Since READ_LINE is urgent, the transition happens as soon as369
possible. It resets the automaton’s time clock back to 0. The synchronization leads to a370
location which has to be left strictly before more than 2 units of time pass, as defined by371
the invariant. To ensure that the automaton stays in this location for exactly 2 units of372
time, the only exiting transition has a guard stating just that. This transition also requires a373
synchronization on the REPLY channel before allowing a return to the automaton’s initial374
location.375
6 Model of the Cache Coherence376
This sections describes the general ideas behind how we modeled the cache coherence in377
UPPAAL. We have released the model under an LGPL v3 license at https://www.onera.378
fr/sites/default/files/598/ecrts19.zip.379
6.1 Modeling Strategy380
The model contains one automaton per component present in Figure 1, an automaton in381
charge of synchronizing on the FORCE_URGENT channel (in an identical manner to the one in382
Figure 14), as well as message queues for both queries and data (Sub-section 6.6). Each core383
runs exactly one program. To change the number of cores, one simply has to add or remove384
cores (and associated cache controllers) and to change the value of a dedicated system-wide385
constant. Moreover, each component has a unique identifier, which is used both to target386
a specific automaton on some synchronization, and to indicate the emitter of requests and387
queries.388
The states and transitions seen on the automata do not visibly reflect any program389
or protocol. This means that the stable states (M, S, I) and the transient states (ISBD,390
ISD, . . . ) will not appear explicitly. Instead, the automata’s designs are focused on their391
synchronizations, with the logic (and state) of the protocols being held in their variables392
instead. As such, the same automaton can easily be used for any program or protocol393
(provided the hypotheses from Sub-Section 3.1 remain), only requiring small changes in the394
definition of the functions found in its transitions.395
Priorities on synchronizations are used to reduce the number of redundant system states.396
For example, any transition that exits a waiting location (i.e. location in which nothing397
happens until a clock has reached a certain count) has a higher priority than any other type398
of transition.399
6.2 Core400
Programs are modeled using arrays of address-targeting instructions, not so dissimilar401
to their binary executable. These arrays only contain instructions related to memory402
accesses (INSTR_LOAD, INSTR_STORE, INSTR_EVICT), and one (INSTR_END) to indicate that403
the execution of the program is completed. An example can be seen in Figure 16.404
N. Sensfelder and J. Brunel and C. Pagetti 26:13
Figure 15 Model of the Core
program_line_t program_0 [7] =
{
{ INSTR_LOAD , 1},
{ INSTR_LOAD , 2},
{ INSTR_STORE , 3},
{ INSTR_LOAD , 3},
{ INSTR_STORE , 1},
{ INSTR_EVICT , 1},
{INSTR_END , 0}
};
Figure 16 Model of a Program
The automaton corresponding to the core is shown in Figure 15. Progress of the program’s405
execution is tracked by the program_counter, which is incremented each time an instruction406
has been started. Another integer variable, received_acks, counts how many times the cache407
controller has confirmed that a request has been fulfilled. The sending of each instruction to408
the cache controller is separated by at least the time of a clock cycle.409
To ensure that synchronization occurs with the right automaton, the request uses the410
cache controller’s identifier to select a sub-channel of CPU_REQ. Conversely, acknowledgments411
are received on the sub-channel of CPU_ACK corresponding to the core’s identifier. Upon412
reaching the INSTR_END instruction, the automaton has to wait until all of its outstanding413
requests have been fulfilled before being able to reach the TERMINATED state.414
6.3 Coherency Manager and Memory Controller415
Figure 17 Model of the Coherency Manager
Figure 18 Model of the Memory Con-
troller
The timed automaton modeling the coherency manager can be seen on Figure 17. The416
coherency manager has to know for which memory elements the RAM copy is to be considered417
as superseded by a cache controller. For this purpose, it maintains an array associating a418
state to each memory element address. The size of this array must be able to accommodate419
all cache controllers having their caches full of superseding copies of memory elements. In420
effect, |mem_array| = |cache_array| × |caches|.421
After initializing its array with default values, the timed automaton waits for either a422
cache controller query or a data message.Receiving any of these leads to an update of the423
internal state associated with the related memory element, as described by the array in424
Figure 7.425
ECRTS 2019
26:14 Modeling Cache Coherence to Expose Interference
Upon receiving a cache controller query, the update to the internal state may indicate426
the need to provide data from the RAM, leading the automaton to synchronize with the427
memory controller to wait for RAM_READ_TIME units of time before providing a reply to the428
query’s originator. Alternatively, when receiving data, the automaton synchronizes with429
the memory controller to wait for RAM_WRITE_TIME units of time. The memory controller’s430
automaton is shown in Figure 18. It has a local clock, clk, which is used to wait either431
RAM_WRITE_TIME or RAM_READ_TIME, depending on what the coherency manager demands.432
6.4 Interconnect433
Figure 19 Model of the Interconnect
Figure 19 shows the timed automaton for the interconnect. It starts (S1) by waiting for434
cache controllers to synchronize through the ADD_BUS_MASTER so that they can be added to435
the bus policy. The order in which the cache controllers make that synchronization is not436
deterministic. This results in all possible orders being explored when analyzing the system.437
Once all cache controllers have been added, the automaton proceeds and synchronizes with438
all the other components by broadcasting on the SYS_INIT channel.439
Using a component identifier to select the appropriate sub-channel, the interconnect440
awaits either an incoming cache controller query, or a notice that the cache controller does441
not have any to send (Ready). If the latter happens, the access policy is followed to determine442
which cache controller should be made able to send its query (e.g. with a Fair-Round-Robin443
the next cache controller is chosen). With the former, the query is first received by the444
interconnect (Ready→S2), then, in a second transition (S2→Ready), it is broadcasted to all445
components that listen for cache controller queries. This broadcast is stalled if any of the446
components that need to receive it indicate that they are not ready to do so (e.g. because447
their incoming query queue is full).448
6.5 Cache Controller449
The automaton used to model a cache controller is rather complex. As previously stated,450
it does not feature any of the states found in the protocol description (e.g. the ones of the451
matrix in Figure 6). Instead, this automaton keeps an array that indicates the protocol state452
associated with a given memory element. The automaton starts by synchronizing with the453
interconnect so that it is taken into account by the interconnect’s access policy (S0→S1). It454
then waits for the broadcast on the SYS_INIT channel (S1→Ready).455
CPU Communications Each cache controller has a queue of outstanding requests from its456
core, as well as a queue of completed requests to inform the core of. Both queues are first in,457
N. Sensfelder and J. Brunel and C. Pagetti 26:15
Figure 20 Model of the Cache Controller
first out. Upon receiving a request from its core (middle Ready→S5 transition), the cache458
controller attempts to find a line in its array either corresponding to the associated address,459
or, if none exists, one that is not currently used (Invalid). If no such line is found, the460
request is stalled, meaning that it is simply put in the outstanding requests queue for later.461
Otherwise, the behavior of the cache controller depends on the cache coherence protocol and462
the state held by the line, such as indicated in Figure 6. If the eviction policy is applicable463
and no line can currently accommodate the request, an automated eviction occurs. The464
cache controller is re-evaluated once the eviction has been completed (leftmost Ready→S5465
transition). In our model, we use an accurate LRU eviction policy, meaning that the cache466
controller keeps track of the order in which its cache lines have been used and will allow an467
automated eviction to occur if the least recently used line points to a state for which the468
protocol does not indicate stall in case of evict request.469
There are two possible reasons for a request to be acted upon: it is an incoming request470
from a core, or it is a previously stalled request on a memory element which just changed471
state.472
hit: the request is moved to the completed requests queue. The handling of stalled requests473
continues. This also counts as a use of the line according to the eviction policy, if the474
request is not an evict.475
stall: the request is put in the outstanding requests queue, if it is not already there. The476
handling of stalled requests is stopped.477
msg/state: the state of the line is set to state, the request is put in the outstanding requests478
queue, if it is not already there. If this is encountered during the un-stalling of requests,479
the request is re-evaluated. In the latter case, this counts as a use of the line according480
to the eviction policy, if the request is not an evict.481
ECRTS 2019
26:16 Modeling Cache Coherence to Expose Interference
Interconnect Communications Handling of pending incoming queries is done through the482
Ready→S2→S3 transitions. This updates the internal state of the cache following what483
was indicated by Figure 6 and has a waiting period that accounts for the simulated query484
handling time period. Handling of pending incoming data is similar (Ready→S3). The S3485
location is where data emission is handled. Data can be sent to either memory or another486
cache controller (the latter introducing yet another delay). This data is actually sent to487
a FIFO queue and not to the other components directly. When there is no data to send,488
the S3→S5 transition evaluates the impact the changes had on the currently stalled core489
requests.490
6.6 Message Queues491
Figure 21 Model of the Data FIFOs Figure 22 Model of the Query FIFOs
Access to the bus is done through message queues. We use separate automata for data and492
query queues to avoid over-encumbering the automata that use them (we would otherwise493
need to add their transitions to nearly all the locations of the cache controller automaton).494
These automata actually handle both an incoming and outgoing queue. Each cache controller495
has a dedicated instance of both automata. The memory controller has an instance of the496
data queues automaton.497
The data and query queues automata are fairly straightforward, having one transition to498
take and one transition to push items in either direction. However, the actual condition for499
incoming queries to be allowed in is hidden behind a shared variable. Indeed, the queries500
come from broadcasts made by the bus and UPPAAL does not allow conditions on transitions501
receiving from a broadcast channel. Thus, the condition of having all query message queues502
ready to receive is actually handled on the side of the interconnect.503
7 Checking Properties504
UPPAAL lets users check if their model verify properties. These properties can be505
used to know if at least one (E) or all (A) execution paths always () or at least once (♦)506
verify a given formula over the automata’s clocks, integer variables, or location. In addition,507
UPPAAL has an operator that looks for the highest value reachable by an automaton’s508
clock or integer variable.509
For example, taking the system from Section 5, with two CPUs (C0 and C1), we can510
know if both processors always end up getting their variable (all paths lead to both automata511
reaching the Done location, A♦(C0.Done && C1.Done)), or the longest time it would take for512
one of them to do so (what is the maximum value the clock can reach before the automaton513
reaches its Done location, sup{not C0.Done}: C0.x).514
N. Sensfelder and J. Brunel and C. Pagetti 26:17
program_line_t program_200 [11] =
{
{ INSTR_STORE , 1},
{ INSTR_STORE , 2},
{ INSTR_LOAD , 1},
{ INSTR_STORE , 1},
{ INSTR_LOAD , 2},
{ INSTR_STORE , 2},
{ INSTR_LOAD , 1},
{ INSTR_STORE , 1},
{ INSTR_LOAD , 2},
{ INSTR_STORE , 2},
{INSTR_END , 0}
};
Figure 23 Program Model 200
program_line_t program_201 [11] =
{
{ INSTR_STORE , 3},
{ INSTR_STORE , 4},
{ INSTR_LOAD , 3},
{ INSTR_STORE , 3},
{ INSTR_LOAD , 4},
{ INSTR_STORE , 4},
{ INSTR_LOAD , 3},
{ INSTR_STORE , 3},
{ INSTR_LOAD , 4},
{ INSTR_STORE , 4},
{INSTR_END , 0}
};
Figure 24 Program Model 201
7.1 Exposing Interference515
Using such properties, we are able to expose the interference in a number of fashions. The516
example we will take for showcasing them is that of a dual core on which two instances of517
the program modeled by Figure 23 are running.518
Counting Hits & Misses: An easy metric to measure is the number of cache hits519
and misses for each address. This can be achieved by simply looking at the state of520
the memory element upon reception of a core’s request, and increasing the right integer521
variable accordingly (much like in Section 5).522
In the dual core example, this shows that each core has 2 cache hits and 3 cache misses523
for the first address; one core has 2 cache hits and 3 cache misses for the second address,524
whereas whereas the other has 1 cache hit and 4 cache misses.525
Counting All Occurrences: We can expose interference by counting all of its occur-526
rences, without regards for whether it had an impact on the system’s execution or not.527
In effect, this equates to having one integer variable per address and type of interference,528
and increasing the right one according to what is described in Figure 8.529
When applied to the dual core example, we can see that for the second address, both530
caches have 4 occurrences of minor interference, 1 occurrence of demoting interference,531
and 2 occurrences of expelling ones. For the first address, one cache has 4 minors, 1532
demoting, and 1 expelling, whereas the other has 3 minors, no demoting, and 3 expelling.533
Counting Meaningful Occurrences: Another pertinent information is an account of534
the interference that actually has an impact on the system. Since we are already able to535
detect any occurrence of the interference, we simply have to isolate the occurrences which536
impacted the cache’s completion of core’s requests. To do so, each cache keeps track, for537
each address, of whether an interference occurred since that address was last involved in538
a core request. Thus, if the CPU requests a read on an address for which the expelling539
flag is active, we consider that a meaningful expelling interference occurred.540
Using this with the dual core example, we can see that, for the second address, both541
caches are affected by the effects of 1 demoting and 1 expelling interference. For the542
first address, one cache has the same and the other experiences the effects of 2 expelling543
interferences.544
Execution Time Analysis: A more general metric is the execution time. Indeed, we545
can measure the impact that cache coherence has on an application’s execution time. This546
can be achieved by simply replacing all accesses to shared variables made by the target547
ECRTS 2019
26:18 Modeling Cache Coherence to Expose Interference
application with accesses to new variables, setting the time impact of minor interferences548
to nil, and having the framework compute the new maximum execution time so that it549
can be compared to the one with shared variables left intact.550
On the dual core example, we first measure the execution time with the system as is,551
then replace the program running on one of the two cores by Figure 24 and set the cost552
of minor interferences to zero. Our first analysis indicates a maximum execution time553
of 1602 time units, the second one indicates 1050 time units. This implies that cache554
coherence causes a 16 percent increase in execution time.555
Alternatively, by keeping the time impact of minor interferences to its default value,556
a WCET of these two programs lets us deduce how much time is lost due to minor557
interferences. In the dual core example, the result is still 1050 time units, showing a lack558
of negative effects from minor interference.559
7.2 Model Validation560
In addition, we can assert that the behavior of our model does indeed correspond to what561
we expect. The successful verification of all these properties gives us a reasonable confidence562
in the validity of the protocol used in our model. The validation of the chosen timing563
parameters, however, would still require a few judicious benchmarks.564
Programs Always Terminate: By checking that all possible execution paths lead all565
cores to the Terminated location, we ensured that there are no deadlocks in our model.566
No Incompatible States: As stated in Section 1, there should never be two cache567
controllers simultaneously having writing access to the same memory element. Thus, we568
checked that if a cache controller is in a state where it may write to a memory element,569
then the others are not in a state where they may read that memory element.570
Values Are Always Up-To-Date: Another point stated in Section 1 is that the values571
in cache should be up-to-date. We verified that it is the case in our model by creating a572
version in which the exact value of each memory element is taken into account. Using a573
shared variable to keep track of the expected system-wide value, we tested that every574
time an action (either read or write) was taken on a memory element, it the local copy575
of that memory element had a value equal to the system-wide one. This is a standard576
property to validate coherency protocol [10, 19].577
8 Related Works578
WCET Analysis for Single-Core: The authors of [9] introduce METAMOC, a579
UPPAAL-based framework for modular WCET analysis of programs running on single-580
core processors. It transforms program binary executables into timed automata, one for581
each function of the program. These programs are simplified. For example, a conditional582
jump may be removed if it would lead to less instructions being executed. This is justified583
by the assumption that the more instructions there are, the longer the execution time584
is (the reverse of which is called a time anomaly). METAMOC supports instruction585
pipelines, which are modeled using five timed automata (fetch, decode, execute, memory,586
and writeback). These five automata have to be manually made for the targeted archi-587
tecture. Caching is also supported, and requires a similar attention the architecture’s588
specifics. As it is intended for single-core architectures, METAMOC obviously does not589
have any concept of cache coherence. We are, however, taking a very similar approach to590
tackle our problematic.591
N. Sensfelder and J. Brunel and C. Pagetti 26:19
The work in [6] also shares similarities with [9], as UPPAAL is used to estimate WCET for592
programs running on single core processors with pipeline and cache, in what is presented593
as a modular framework. It attempts to improve on the weaknesses of METAMOC by594
replacing the value analysis based control flow graphs with program slicing. In effect,595
statements that do not affect dynamic jump addresses are replaced with nop (i.e. “do596
nothing”) operations. In [7], they address the state explosion issue.597
WCET Analysis for Multi-Core with Private Caches: Readers can refer to [17] for598
an overview of Multi-Core WCET Analysis. [16], proposes a UPPAAL-based framework599
to estimate the WCET of applications running on a multi-core processor. They consider600
the delays caused by contention on the interconnect and a private instruction cache for601
each core (data caches are not considered). They perform analysis on the memory blocks602
pertinent to the instructions of the program. A memory block may contain one or more603
instruction. For each instruction, they are only interested in whether it: is always found604
in the cache; is always found except on the first access; is never found in the cache; is605
undecided. They have defined a timed automaton to model each of these possibilities606
(modeling the need for interconnect access, time to read the memory blocks, and updates607
to the cache). They consider programs as control flow graphs in which each node is a608
memory block. As such, they model each program by a single timed automaton based609
on the control flow graph, but in which each instruction has been replaced by one of610
the aforementioned timed automata corresponding to its impact. Their paper presents611
models for two types of interconnects: TDMA and FCFS, which control the order the612
bus can be accessed by the timed automata modeling the instructions. Cache coherence613
is not addressed.614
WCET Analysis for Multi-Core with Shared Caches: The authors of [8] focus615
on the estimation of WCET on multi-core processors. Their point of interests are the616
delays caused by hierarchical caches, the use of a shared cache, and the interconnect.617
They do not use UPPAAL, but instead model the applications as task-dependency graphs618
and perform computations to estimate the WCET. Their approach starts by analyzing619
how the L1 caches are accessed, to remove elements that are sure to always be present620
from further consideration. The other accesses are dependent on both the content of the621
L2 cache, and access to the bus. The content of the L2 cache depends on which tasks622
are running, which in turns, depends on bus access time access. To resolve the circular623
dependency, they propose an iterative approach: starting by considering the worse case624
scenario in which all tasks interfere, they estimate the running time of the tasks, which625
lets them remove any interference between two tasks whose running time are disjoint,626
and start over until a fix point is reached. Data caches are not taken into account and627
are assumed to have no effect on the calculations. Cache coherence is not addressed.628
The authors of [29] study the impact of a shared cache (including data caches) on execution629
time. To do so, they represent each program as an address flow graph, in which edges630
correspond to instruction, and vertices correspond to the state of the cache and its access631
history. They actually build a combined cache conflict graph, which is pretty much the632
combination of each core’s address flow graph into a single graph. Cache coherence is not633
addressed.634
The work done in [13] has similarities with ours, as it uses UPPAAL to calculate WCET635
of programs running on multi-core processors. Their focus is not on cache coherence, but636
it does feature some, as write requests lead to the invalidation of the memory element in637
the other caches.638
Cache Coherence Protocol Comparison: The authors of [2] compare the efficiency of639
ECRTS 2019
26:20 Modeling Cache Coherence to Expose Interference
common snooping-based cache coherence protocols. To do so, they described a multi-core640
processor and the cache coherence protocols in Simula. Much like ours, the programs641
running on this simulation are described as a succession of memory related instructions.642
However, they do not use explicit addresses for these instructions. Instead, they have643
defined system-wide weights to regulate the probability of an instruction to be applied644
on a private memory element (i.e. a memory element the cache is the sole user of) or a645
shared block (i.e. a memory element used by multiple blocks). Thus, cache coherence is646
addressed, but only in a very broad context. Indeed, whereas our work focuses on the647
impact of cache coherence on specific applications on a specific architecture, the cache648
coherence protocol comparison made by the authors of [2] provides a general idea of649
which protocol is more fitted for which type of application.650
Predictable Cache Coherence: An alternative to trying to predict how cache coher-651
ence is going to behave is to use a kind of cache coherence designed to be predictable.652
[24] lists the cache coherence related latencies that need to be known before predictability653
of the protocol can be achieved. Its authors argue that write-through, update-based654
protocols (i.e. writes are propagated to other caches and to the memory) can be made to655
be predictable.656
[14] presents PMSI, a variation on the MSI protocol that uses a TDM bus to achieve657
predictability. Emission of coherence queries and is restricted to a core’s TDM slots.658
As a result, a cache does not suffer from interference during its own TDM slots. [21]659
expands on this by introducing HourGlass, which allows separate handling of critical660
and non-critical cores. HourGlass uses timers to allow cores to hold access to a memory661
element for a predefined time duration. The evaluation of queries that would remove662
an access currently protected a timer are delayed until its time is up. Both PMSI and663
HourGlass require hardware modification, which prevents them from being used in a664
context that relies on COTS.665
9 Conclusion and Future Work666
When using cache coherence, the execution of a program running on a core is affected by667
the execution of the programs running on the other cores. Because of this, analysis of the668
execution time becomes much more difficult. In this paper, we categorized the types of669
interference that cache coherence induces: minor interference, caused by the handling of670
queries irrelevant to the cache controller; demoting interference, when an external event671
forces the loss of writing rights; and expelling interference, when an external event forces672
eviction of a cache line.673
We also presented timed automata as a way to model cache coherence so that this674
interference can be studied and exposed. For this purpose, we also showed and explained our675
current model for the analysis of cache coherence, as well as the hypotheses made for that676
model to be applicable.677
We are also working on a tool to automatically switch which MSI variant (MESI, MOSI,678
MOESI, MESIF) is used by the model. We also intend to add another type of instruction679
to programs soon, adding more non-determinism to the model by having a INSTR_CALC680
instruction that causes the CPU to wait for any amount of time in a given range. Lastly, we681
have planned to perform a benchmark comparison on the Keystone TCI6630K2L [23] from682
Texas Instruments to further validate our approach.683
Our current model was tested with up to 6 cores. We are working on its scalability issues,684
and intend to make use of SAT/SMT [3] to tackle this limitation.685
N. Sensfelder and J. Brunel and C. Pagetti 26:21
References686
1 Rajeev Alur and David L. Dill. A theory of timed automata. Theor. Comput. Sci., 126(2):183–687
235, April 1994. URL: http://dx.doi.org/10.1016/0304-3975(94)90010-8, doi:10.1016/688
0304-3975(94)90010-8.689
2 James Archibald and Jean-Loup Baer. Cache coherence protocols: Evaluation using a690
multiprocessor simulation model. ACM Trans. Comput. Syst., 4(4):273–298, September 1986.691
URL: http://doi.acm.org/10.1145/6513.6514, doi:10.1145/6513.6514.692
3 Clark W. Barrett, Roberto Sebastiani, Sanjit A. Seshia, and Cesare Tinelli. Handbook of693
Satisfiability, chapter Satisfiability Modulo Theories, pages 825–885. IOS Press, 2009.694
4 Gerd Behrmann, Alexandre David, and Kim G. Larsen. A Tutorial on Uppaal, pages 200–695
236. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. URL: https://doi.org/10.1007/696
978-3-540-30080-9_7, doi:10.1007/978-3-540-30080-9_7.697
5 Frédéric Boniol, Hugues Cassé, Eric Noulard, and Claire Pagetti. Deterministic execution698
model on cots hardware. In Proceedings of the 25th International Conference on Architecture699
of Computing Systems (ARCS’12), pages 98–110, 2012.700
6 Franck Cassez and Jean-Luc Béchennec. Timing analysis of binary programs with UPPAAL.701
In 13th International Conference on Application of Concurrency to System Design, ACSD702
2013, pages 41–50. IEEE Computer Society, July 2013. doi:http://dx.doi.org/10.1109/703
ACSD.2013.7.704
7 Franck Cassez and Pablo González de Aledo Marugán. Timed automata for modelling caches705
and pipelines. In Rob J. van Glabbeek, Jan Friso Groote, and Peter Höfner, editors, Proceedings706
Workshop on Models for Formal Analysis of Real Systems, MARS 2015, Suva, Fiji, November707
23, 2015., volume 196 of EPTCS, pages 37–45, 2015. doi:10.4204/EPTCS.196.4.708
8 Sudipta Chattopadhyay, Abhik Roychoudhury, and Tulika Mitra. Modeling shared cache and709
bus in multi-cores for timing analysis. In Proceedings of the 13th International Workshop710
on Software & Compilers for Embedded Systems, SCOPES ’10, pages 6:1–6:10, New York,711
NY, USA, 2010. ACM. URL: http://doi.acm.org/10.1145/1811212.1811220, doi:10.1145/712
1811212.1811220.713
9 Andreas E. Dalsgaard, Mads Chr. Olesen, Martin Toft, René Rydhof Hansen, and Kim Guld-714
strand Larsen. METAMOC: Modular Execution Time Analysis using Model Checking.715
In Björn Lisper, editor, 10th International Workshop on Worst-Case Execution Time716
Analysis (WCET 2010), volume 15 of OpenAccess Series in Informatics (OASIcs), pages717
113–123, Dagstuhl, Germany, 2010. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.718
The printed version of the WCET’10 proceedings are published by OCG (www.ocg.at)719
- ISBN 978-3-85403-268-7. URL: http://drops.dagstuhl.de/opus/volltexte/2010/2831,720
doi:10.4230/OASIcs.WCET.2010.113.721
10 Giorgio Delzanno. Automatic verification of parameterized cache coherence protocols. In722
Proceedings of the 12th International Conference on Computer Aided Verification, CAV ’00,723
pages 53–68, London, UK, UK, 2000. Springer-Verlag. URL: http://dl.acm.org/citation.724
cfm?id=647769.734088.725
11 Philip Enslow, Jr. Multiprocessor organization - a survey. ACM Comput. Surv., 9(1):103–129,726
March 1977.727
12 Sylvain Girbal, Xavier Jean, Jimmy le Rhun, Daniel Gracia Pérez, and Marc Gatti. Determin-728
istic Platform Software for Hard Real-Time systems using multi-core COTS. In 34th Digital729
Avionics Systems Conference (DASC’15), 2015.730
13 Andreas Gustavsson, Andreas Ermedahl, Björn Lisper, and Paul Pettersson. Towards WCET731
analysis of multicore architectures using UPPAAL. In 10th International Workshop on Worst-732
Case Execution Time Analysis, WCET 2010, July 6, 2010, Brussels, Belgium, pages 101–112,733
2010. URL: https://doi.org/10.4230/OASIcs.WCET.2010.101, doi:10.4230/OASIcs.WCET.734
2010.101.735
14 Mohamed Hassan, Anirudh M. Kaushik, and Hiren D. Patel. Predictable cache coherence736
for multi-core real-time systems. In 2017 IEEE Real-Time and Embedded Technology and737
ECRTS 2019
26:22 Modeling Cache Coherence to Expose Interference
Applications Symposium, RTAS 2017, Pittsburg, PA, USA, April 18-21, 2017, pages 235–246,738
2017. URL: https://doi.org/10.1109/RTAS.2017.13, doi:10.1109/RTAS.2017.13.739
15 Xavier Jean, David Faura, Marc Gatti, Laurent Pautet, and Thomas Robert. Ensuring740
robust partitioning in multicore platforms for ima systems. In 31st Digital Avionics Systems741
Conference (DASC’16), 2012.742
16 M. Lv, W. Yi, N. Guan, and G. Yu. Combining abstract interpretation with model checking743
for timing analysis of multicore software. In 2010 31st IEEE Real-Time Systems Symposium,744
pages 339–349, Nov 2010. doi:10.1109/RTSS.2010.30.745
17 Claire Maiza, Hamza Rihani, Juan Maria Rivas, Joël, Godelieve Goossens, Sebastian Altmeyer,746
and Robert I. Davis. A survey of timing verification techniques for multi-core real-time systems.747
Technical report, Grenoble INP/Ensimag/Verimag, 2018.748
18 Rodolfo Pellizzoni, Emiliano Betti, Stanley Bak, Gang Yao, John Criswell, Marco Caccamo,749
and Russell Kegley. A predictable execution model for cots-based embedded systems. In 17th750
IEEE Real-Time and Embedded Technology and Applications Symposium RTAS 2011, pages751
269–279, 2011.752
19 Fong Pong and Michel Dubois. A new approach for the verification of cache coherence753
protocols. IEEE Trans. Parallel Distrib. Syst., 6(8):773–787, August 1995. URL: http:754
//dx.doi.org/10.1109/71.406955, doi:10.1109/71.406955.755
20 Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory Consistency and756
Cache Coherence. Morgan & Claypool Publishers, 1st edition, 2011.757
21 Nivedita Sritharan, Anirudh M. Kaushik, Mohamed Hassan, and Hiren D. Patel. Hourglass:758
Predictable time-based cache coherence protocol for dual-critical multi-core systems. CoRR,759
abs/1706.07568, 2017. URL: http://arxiv.org/abs/1706.07568, arXiv:1706.07568.760
22 V. Suhendra, T. Mitra, and A. Roychoudhury and. Wcet centric data allocation to scratchpad761
memory. In 26th IEEE International Real-Time Systems Symposium (RTSS’05), pages 10762
pp.–232, Dec 2005. doi:10.1109/RTSS.2005.45.763
23 Texas Instruments. TCI6630K2L Multicore DSP+ARM KeyStone II System-on-Chip. Techni-764
cal Report SPRS893E, Texas Instruments Incorporated, 2013.765
24 Sascha Uhrig, Lillian Tadros, and Arthur Pyka. Mesi-based cache coherence for hard real-time766
multicore systems. In Luís Miguel Pinho Pinho, Wolfgang Karl, Albert Cohen, and Uwe767
Brinkschulte, editors, Architecture of Computing Systems – ARCS 2015, pages 212–223, Cham,768
2015. Springer International Publishing.769
25 L. Wehmeyer and P. Marwedel. Influence of memory hierarchies on predictability for time770
constrained embedded software. In Design, Automation and Test in Europe, pages 600–605771
Vol. 1, March 2005. doi:10.1109/DATE.2005.183.772
26 Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David773
Whalley, Guillem Bernat, Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, Frank774
Mueller, Isabelle Puaut, Peter Puschner, Jan Staschulat, and Per Stenström. The worst-775
case execution-time problem - overview of methods and survey of tools. ACM Transactions776
Embedded Computing Systems, 7(3):36:1–36:53, May 2008.777
27 Reinhard Wilhelm and Jan Reineke. Embedded systems: Many cores - many problems. In 7th778
IEEE International Symposium on Industrial Embedded Systems (SIES’12), pages 176–180,779
2012.780
28 Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Marco Caccamo, and Lui Sha. MemGuard: Mem-781
ory bandwidth reservation system for efficient performance isolation in multi-core platforms.782
In 19th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’13),783
pages 55–64, 2013.784
29 Wei Zhang and Jun Yan. Static timing analysis of shared caches for multicore processors.785
JCSE, 6(4):267–278, 2012. URL: https://doi.org/10.5626/JCSE.2012.6.4.267, doi:10.786
5626/JCSE.2012.6.4.267.787
