Cache Where you Want! Reconciling Predictability and Coherent Caching by Bansal, Ayoosh et al.
1Cache Where you Want!
Reconciling Predictability and Coherent Caching
Ayoosh Bansal , Jayati Singh , Yifan Hao, Jen-Yang Wen, Student Member, IEEE,
Renato Mancuso, Member, IEEE, and Marco Caccamo, Fellow, IEEE
Abstract—Real-time and cyber-physical systems need to interact with and respond to their physical environment in a predictable
time. While multicore platforms provide incredible computational power and throughput, they also introduce new sources of
unpredictability. Large fluctuations in latency to access data shared between multiple cores is an important contributor to the overall
execution-time variability. In addition to the temporal unpredictability introduced by caching, parallel applications with data shared
across multiple cores also pay additional latency overheads due to data coherence.
Analyzing the impact of data coherence on the worst-case execution-time of real-time applications is challenging because only scarce
implementation details are revealed by manufacturers. This paper presents application level control for caching data at different levels
of the cache hierarchy. The rationale is that by caching data only in shared cache it is possible to bypass private caches. The access
latency to data present in caches becomes independent of its coherence state. We discuss the existing architectural support as well as
the required hardware and OS modifications to support the proposed cacheability control. We evaluate the system on an architectural
simulator. We show that the worst case execution time for a single memory write request is reduced by 52%.
Index Terms—Cache memories, Multi-core/single-chip multiprocessors, Real-time and embedded systems, Worst-case analysis
F
©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
1 Introduction
The last decade has witnessed a profound transformationin the way real-time systems are designed and integrated.
At the root of this transformation are the ever growing data-
heavy and time-sensitive real time applications. As scaling in
processor speed has reached a limit, multi-core solutions [1]
have proliferated. Embedded multi-core systems have intro-
duced a plethora of new challenges for real-time applications.
Not only this adds a new dimension to scheduling, but re-
markably the fundamental principle that worst-case execution
time (WCET) of applications can be estimated in isolation has
been shaken.
In multi-core systems, major sources of unpredictabil-
ity arise from inter-core contention over shared memory
resources [2], [3]. Memory resource partitioning techniques
present a suitable approach to mitigate undesired temporal
interference between cores [4], [5], [6]. However, memory re-
source partitioning is particularly well suited only for systems
where data exchange between cores is scarce or nonexis-
tent [7]. Heavy data processing or data pipelining workloads,
on the other hand, are often internally structured as multi-
threaded applications, where coordination and fast data ex-
change between parallel execution flows on different cores is
crucial. This exchange is based on shared memory.
Modern platforms generally feature a multi-level cache
hierarchy, with the first cache level (L1) comprised of pri-
vate per-core caches. When multiple threads access the same
• A. Bansal, J. Singh, Y. Hao and J. Wen are with Department
of Computer Science, University of Illinois Urbana-Champaign,
Urbana, IL 61801-2302.
Corresponding E-mail: ayooshb2@illinois.edu
• R. Mancuso is with Boston University.
• M. Caccamo is with Technical University of Munich.
memory locations, it is crucial to ensure the coherence of
different copies of the same memory block in multiple L1
caches. Dedicated hardware circuitry, namely the coherence
controller, exists to maintain this invariant. Because main-
taining coherence requires coordination among distributed L1
caches, it introduces overhead.
Cache coherence introduces two main obstacles for real-
time systems. First, hardware coherence protocols are a pre-
ciously guarded intellectual property of hardware manufactur-
ers. As such, scarce details are available to study the worst-
case behavior for coherent data exchange. Second, coherence
controllers are not designed to optimize for worst-case behav-
ior. One approach to achieve predictable coherence consists
of re-designing coherence protocols and controllers [8], [9].
Doing so, however, requires extensive modifications to exist-
ing processor design, with a possibly significant impact on
performance.
In this paper, we propose a new approach to achieve
predictable coherence. The key intuition is that if memory
blocks accessed by multiple applications are cached only in
shared levels —e.g., last-level cache— multiple copies of cache
lines do not exist and coherence is trivially satisfied. Based on
this, we define a new memory type that is non-cacheable in
private (inner) cache levels, but cacheable in shared (outer)
caches, namely Inner-Non-Cacheable, Outer-Cacheable (INC-
OC). INC-OC memory type coexists with traditional memory
types. Control over which type of memory should be used for
different areas of an application’s working set is then provided
to the developer and/or compiler. The key contributions of
this work can be summarized as follows:
• A novel solution for predictable time access to coherent data
without variability induced by coherence mechanisms.
• Prototype evaluation on an architectural simulator.
ar
X
iv
:1
90
9.
05
34
9v
1 
 [c
s.D
C]
  1
1 S
ep
 20
19
22 Related Work
Multi-core systems have enabled multi-threaded real-time
workloads. Real-time applications with parallel, precedence-
constrained processing tasks are often represented as peri-
odic (or sporadic) DAG tasks [10]. Scheduling of DAG tasks
has received considerable attention [10], [11], [12]. Alongside,
application-level frameworks to structure real-time applica-
tions to follow the DAG model have been proposed and
evaluated [13].
In practice, however, the implementation and analysis
of DAG real-time tasks is challenging because analytically
upper-bounding the worst-case execution time (WCET) of
a processing node is hard. As multiple DAG nodes are ex-
ecuted in parallel, they interfere with each other on the
shared memory hierarchy. Well known sources of interference
arise from space contention in the shared cache levels [14],
[15]; contention for allocation of miss status holding regis-
ters (MSHR) [16]; bandwidth contention when accessing the
DRAM memory controller [5], [17]; and bank-level contention
due to request re-ordering in the DRAM storage [6]. Our
work focuses on tightening and simplifying the analysis of the
WCET bound by making shared data accesses immune from
unpredictable temporal effects of cache coherence controllers.
Cache contention among parallel tasks is a major cause
of interference [15]. Mitigation approaches include selective
caching [18], [19] and cache partitioning [4], [20]. Strict re-
source partitioning among cores is effective for independent
tasks. But data sharing is necessary to implement processing
pipelines, such as real-time tasks following the sporadic DAG
model. In [21], the authors acknowledge that data-sharing
between tasks is inevitable in mixed criticality, multi-core
systems. In light of these considerations, previously proposed
solutions have been adapted to allow sharing [22], [23].
The problems introduced by data sharing in real-time
signal processing applications were studied in [24], [25], [26],
which demonstrate that the overhead from cache coherence
protocols can severely diminish the gains aimed to be achieved
through parallelism in multi-core systems. But more impor-
tantly, the overheads of cache coherence are unpredictable and
can have large variance. In this work we focus exclusively on
the unpredictability caused by cache coherence. Our solution
is to allow shared data to bypass private levels and be cached
directly in shared cache levels.
Previous works have addressed cache coherence in multiple
ways. Giovani et al. propose a coherence aware scheduler [25]
which staggers the execution of tasks that share data. The
solution only works at task level granularity and may force idle
times on processor. Predictable MSI [8] solves the coherence
unpredictability by using a TDMA coherence bus and a mod-
ified MSI coherence protocol. The solution is invisible to soft-
ware and provides predictability with reasonable overheads,
but requires major changes to hardware coherence controllers
that are difficult to implement and verify [27], [28], [29]. MC2
[21] improved upon [22], [23] by allowing data sharing across
processors. The coherence effects are avoided by making the
shared memory uncacheable or assigning tasks with shared
memory accesses to the same core. The scheduling option
is restrictive and like [25] may force processor idling. Extra
accesses to uncached memory, i.e. main memory, are slow
and may increase the WCET. On-Demand Coherent Cache
[30] converts tasks accessing shared data to critical sections,
hence disallowing concurrent execution of any tasks that share
data. SWEL [31] focuses on high performance computing and
message passing workloads. It proposes heuristic mechanisms
in hardware to cache only private and read only data in L1
caches. There are no predictability assurances as the cache
line placements decisions are based on observations at run
time and optimize for throughput.
Our proposed solution does not impose any scheduling
restrictions on the application and does not burden it to
maintain coherence. In addition, it provides the developer
freedom to choose which data to cache where, to optimize the
average and worst-case performance trade-offs. This solution
can be implemented without any changes to hardware coher-
ence implementations and works with any coherence protocol.
Minimal changes to the cache controller’s logic are required.
The overall effect is a complete avoidance of cache coherence
overheads based on the developer’s choices.
3 Background
Let us first familiarize ourselves with the terms and concepts
used in this work.
3.1 Cache Coherence
Cache Coherence [32] is a feature of modern multi-level,
distributed CPU caches. In traditional cache architectures, it
is fundamental that the contents of private levels of caches
are kept coherent across multiple cores. A hardware cache
coherence controller is present for this purpose. It ensures that
any valid copies of a cache line contain the same data. Cache
coherence controllers work by assigning additional states to
cache lines. Let us consider the example of MSI cache co-
herence protocol [33] as shown in Figure 1. In this protocol a
cache line can be in one of the following three states:
Invalid
Modified
Shared
Se
lf S
T
Other Invalidation
Self LD
Self Eviction
Self ST
Self LD
LD : Load
ST : Store
Ot
he
r L
D
Self LD
Self ST
Self Eviction
Other 
Invalidation
Fig. 1. MSI States and Transitions
• Invalid: Cache line is not allocated or contains invalid data.
This is the initial state.
• Shared: Cache line contains valid data. A cache line in
shared state can only be read from. Other caches may have
3the same cache line in Shared or Invalid state. Data cached
in this line is the same as the corresponding location in main
memory.
• Modified: Cache line contains valid data. A cache line in
modified state can be used to read or write data. Other
caches cannot have the same cache line in any state other
than Invalid. Cache line contains dirty data, i.e., the cur-
rent content of the cache line may be different from the
corresponding location in main memory.
A cache line transitions between these state based on Self
vs. Other load/store (LD/ST) events, as shown in Figure 1.
Here, Self refers events generated by the the core under
analysis. Evictions are cache line replacements. Other refers
to messages to handle events by other cores.
3.2 Memory Types
A vast majority of modern multi-core embedded systems
are implemented using ARM architectures. We focus on the
latest major version ARMv8-A, extensively used in current
platforms. This includes recent versions of Nvidia Tegra,
Qualcomm Snapdragon and Samsung Exynos, among others
[34]. There are 100+ mobile and embedded SoC compliant
with ARMv8-A Instruction Set Architecture (ISA). In this
architecture, a uniform physical memory address space de-
scribes traditional memory resources (e.g. DRAM space), as
well as configuration space for on-chip and external devices.
In order to adopt the correct caching policy for any given
memory region, the hardware allows specifying a set of meta-
data, or memory type, for each memory page. The memory
type specification informs the hardware of how load/store
operations within a given memory range should be handled.
Memory type attributes are encoded in each virtual memory
page table descriptor. In setting up virtual memory, the OS
is responsible for encoding the correct memory type in the
page table entry (PTE) of any portion of memory being
accessed. If virtual memory is disabled, no memory type can
be associated with a memory range. This forces the hardware
to be conservative and to treat any load/store operation as
non-cacheable accesses.
ARMv8-A standard allows defining two main attributes
in the memory type. First, cacheability: i.e., whether or not
a memory location should be cached or not. There exist two
cacheability attributes: inner cacheability and outer cacheabil-
ity. If a memory region is marked as inner (resp., outer)
cacheable, its content can be cached in the inner (resp., outer)
cache levels. What constitutes inner vs. outer is implementa-
tion defined. Generally, however, private cache levels (e.g., L1)
are inner caches. Conversely, shared cache levels are usually
outer caches. Second, memory types encode a shareability
attribute. Once again, this can be specified independently for
inner and outer caches. If a memory region is defined as inner
shareable, any of its cached lines are kept coherent by the
hardware in the inner cache levels. The same goes for outer
shareable memory.
In this work, we focus on a subset of possible memory
types [35]. Specifically, Table 1 defines the memory attributes
used throughout the paper. The default memory type is
Normal Cacheable. This type of memory is cacheable (and
shareable) at all levels of caches. The other frequently used
memory type is Uncacheable. This memory type is typically
used to describe I/O memory. We define a new memory type:
Inner Non Cacheable, Outer Cacheable (INC-OC) and also
address kernel support in Section 5.2. This type of memory is
accessed by the processor cores only. It is cached in all shared
(outer) cache levels but not cached in any caches private
(inner) to any cores.
3.3 Architectural Support
One of our goals is minimal changes. So we first explore exist-
ing support for cacheability control in popular Instruction Set
Architectures (ISA) and corresponding compliant processors.
We find that while ARMv8-A ISA supports inner and outer
cacheability control, hardware implementations simplify away
this support. Conversely, X86 and MIPS ISA do not support
high granularity cacheability control for different cache levels.
• X86 : Intel 64 defines various levels of caching like Un-
cacheable, Write Combining, Write Through, Write Back
[36]. Two methods are provided to specify the type of
caching, namely, Memory Type Range Registers and Page
Attribute Table. The defined caching types do not differen-
tiate between the various levels of caches.
• ARM : ARMv8-A ISA allows managing the cacheability of
Inner and Outer regions independently [37]. Most ARM pro-
cessors however simplify their design by treating Inner Non-
cacheable Outer Cacheable type as Non-Cacheable memory.
This applies to Cortex-A53 [38], Cortex-A57 [39], Cortex-
A72 [40]. Other ARM compliant processors implement sim-
ilar simplifications. Nvidia Denver and Carmel architectures
ignores the Outer Cacheability attribute [41], [42].
• MIPS : MIPS treats cacheability control in a simpler
manner. MIPS32/64 ISA only defines Cached and Uncached
memory types [43]. A lot of fields are left as implemen-
tation dependent though. One of the recent MIPS proces-
sors, M6200 supports only Cached and Uncached memory
types [44].
4 Approach Overview and Motivation
The main idea behind our solution is to allow application de-
velopers or compilers to use their knowledge of the application
to choose between the trade-offs of worst-case vs. average use
access time. The choice of cacheability determines whether
or not certain data will be cached in private L1 levels. Data
locations for which strong worst-case latency guarantees are
required can be selectively cached only in shared levels. Data
TABLE 1
Memory Types
Name Cacheability Description
Normal Cacheable Inner Cacheable, Outer Cacheable Data caching allowed in all caches
Uncacheable Inner Non-Cacheable, Outer Non-Cacheable Data caching not allowed
INC-OC Inner Non-Cacheable, Outer Cacheable Data caching allowed in Shared caches only
4coherence is achieved as only one cached copy of such data
locations can exist. Coherence overheads and variability are
avoided for any access to such locations.
4.1 Architectural Prerequisites
For cacheability control to provide predictable time access,
two core requirements must be met. First, a shared cache
level must exist between the entities that share the worst
case latency sensitive data. Consider Multi-Socket Multi-Core
(MSMC) architectures. The chips in such sockets do not have
a shared cache level. Their first common memory level is
the main memory and hence our solution would translate to
using uncacheable memory type. The second requirement is
that for the given shared cache level, a method must exist to
limit the cacheability to that level. Consider a multilevel cache
architecutre like Nvidia Carmel SoC. Cluster of 2 cores share
a L2 cache level. 2 such clusters share an L3 cache. In such
cases, either the data sharing needs to be limited or enough
memory types must exist to limit cacheability to each cache
level. This level of fine grained control is not supported by
current ARMv8-A ISA. For the remainder of this paper we
consider a system as shown in Figure 2.
L2 Cache  
Core 0 L1 Cache
Main 
Memory
Core 1 L1 Cache
Core 2 L1 Cache
Core 3 L1 Cache
 L1 Hit 3.4 ns
 L2 Hit 16.6 ns
 Main Memory Access 154 ns  
Fig. 2. System model
4.2 Coherence Cost
Cache coherence introduces a new dimension to cache func-
tion. A cache hit can no longer be defined as simply having
the data in the cache. Correct state and privilege are now also
a requirement for a hit. The implementation details of cache
coherence controllers are proprietary and hence it is generally
difficult to estimate or measure the exact latency of every
operation. Cache access operations can be one of:
• Hit: The data block is present in the cache and in a state
that allows the desired operation. For example Shared for
Load and Modified for Store.
• Miss: The data block is not present in the cache and needs
to be retrieved from a lower cache level or main memory.
• Coherence Miss: The data block is present in the local cache
or a remote cache at the same level. But the state of the
block does not allow the desired operation. For example
Store on a cache line in Shared state.
Another example is shown in Figure 3. Consider a 2
core processor with 2 cache levels. In this example, the core
attached to L1 Cache1 initiates a store operation. This cache
does not have the data block for the Store, but L1 Cache2
has the data block in Modified state. Cache2 has to invalidate
its cache line and write back the dirty data to the shared L2
cache. The L2 cache can then send the data to L1 Cache1
which can finally execute the Store. L1 Cache1 now contains
the cache line in Modified state. These series of events can
lead to long latency in executing a single memory access. We
refer to this situation as a Dirty Miss in this paper.
Figure 4 illustrates the cost of a dirty miss. Using an
instrumented simulation, see Section 5.6.1, we can set up
custom scenarios. Simulation logs show when some events of
interest complete. The time reported is simulation ticks which
are a direct equivalent to cycles in a hardware system.
At time 0, a core initiates a single write request. The target
cache line is in Modified state in another L1 cache. It takes 4
cycles for the L1 cache to get the request from the processor,
determine the current state of the cache line and take actions
accordingly. It takes a further 124 cycles or total 128 cycles
for L2 cache to receive the first request and take actions
accordingly. The next 601 cycles are spent in completing the
coherence steps, as shown in Figure 3, and delivering data to
the requesting L1 cache. It takes 4.7× cycles to resolve a dirty
miss compared to L2 communication delay.
4.3 Coherence Complexity
Hardware cache coherence simplifies the development of gen-
eral purpose multi-threaded software. Many applications are
served well by the transparent handling of data coherence
by the hardware. But for real-time applications this creates
another uncontrolled source of unpredictability in their worst-
case execution time. Cache coherence protocols in SoCs are
defined by vendors with only the main stable states [42],
[45]. There are a plethora of transient states in coherence
state machines and many low level details that impact the
overall coherence state machine operation [46]. This makes
any analysis on existing cache coherence controllers difficult.
Our approach of caching shared data in shared cache only,
completely removes the cache coherence controller from the
shared data access process. Hence the effect of coherence on
worst case analysis and during certification of the system is
trivially handled.
4.4 Private vs Shared Access
In this section we discuss the difference between accessing
Shared vs Private data on two real platforms.
• Cortex-A53 [47]: Quad-core ARMv8-A processor [48].
• Xeon E5-2658 [49]: 14 core, 2 hyperthread per core, Intel
processor on a desktop workstation.
We developed synthetic benchmarks to study the effect of
cache coherence on real platforms. We measure the average
latency to complete a Load or Store to data already present
in L1 caches. All cores do the same operations simultaneously.
The resulting average latency is a combined effect of single
access latency, parallelization, bandwidth contention and op-
portunistic hardware optimization like prefetchers. Consider
Figure 5. For the first cluster of bars, Load, every core reads
sequential memory. For the second cluster, Store, every core
writes to sequential memory and then reads the value back.
The read back ensures that the captured latency includes
Store to cache and not to an internal write buffer only.
In Private case (left bars) every core accesses private data
sets, in Shared (right bars) they all share one data set. The
5L1 
Cache 1
L1 
Cache 2
L2 
Cache (1)  State: M (1)  State: I
(2) Event: Store (3) Event: Get Exclusive 
(6) Event: Write-Back 
(5) State: I
(7) Event: Data(8) State: M
(4) Event: Invalidate 
Fig. 3. Transitions of a Dirty Miss
L1 Cache
4
L2 Cache
128
Write 1
Complete
729
Cycles
Fig. 4. Timeline of a Dirty Miss
116 128
20
130
151
363
0
50
100
150
200
250
300
350
400
Load Store Lock
Ti
m
e 
(n
s)
Cortex A-53
Private Shared
2 61 223 87
5741
0
1000
2000
3000
4000
5000
6000
Load Store Lock
Xeon E5-2658
Private Shared
Fig. 5. Private vs Shared Access Latency
Shared case therefore shows the additional overhead of cache
coherence protocols.
The Lock latency is the average latency to acquire and
release a spinlock. In case of Private every core accesses a
different spinlock. All lock operations complete within the
Private L1 cache and there is no waiting on the lock and
no one else is trying to acquire it. In Shared every core
acquires and releases the same lock. The critical section is
empty. Spinlock implementation is below and uses gcc built-
in atomics [50]. The acquire and release sequence is hence less
than 10 assembly instructions. In case of shared lock, hence,
most of the time is spent on access contention.
1 l o ck : whi l e ( sync lock test and set (&lck , 1) )
{} ;
2 unlock : sync lock release (& l ck ) ;
While exact overheads of cache coherence are hard to mea-
sure on real platforms, it is evident that the latency to access
data is dependent on whether it is being shared across dif-
ferent cores. The Load/Store measurements represent latency
differences near full memory bandwidth and hence the effect of
coherence itself is diminished. Locks on the other hand require
that the underlying micro-ops/instructions complete in order.
Locks are hence affected more by the overheads of maintaining
coherence.
5 Implementation
As noted in Section 3.3, existing COTS ARMv8-A platforms
cannot use INC-OC memory type hence the evaluation is lim-
ited to simulations. We implement the controlled cacheability
on the gem5 architectural simulator [51] to realize a system
as shown in Figures 2 and 6. The simulation system is config-
ured to be representative of a Cortex-A53 [47]. The memory
hierarchy is comprised of 32 KB L1 cache per core, 2 MB L2
cache and 4 GB DRAM. This is in line with typical Cortex-
A53 based SoCs [48]. We chose configurable cache parameters
to mirror Cortex-A53 though some differences would surely
exist. For both the Cortex-A53 [47], [48] and the simulated
system, memory access latency are as shown in Figure 2.
Further in this section we describe our implementation and
how it would function on a real system. Our modifications
add the support for the INC-OC memory type in gem5
simulator. The design is presented top to bottom, starting
from application layer and all the way to cache microar-
chitecture. This design is close to what the authors, to the
best of their knowledge, believe a hardware implementations
should be like. gem5’s cache framework Ruby does not support
uncacheable memory type. Access latency to uncacheable
memory is much higher than any cache. Hence any comparison
with uncacheable memory is not interesting.
5.1 Application
Our primary aim is to provide applications the mechanism to
decide between the tradeoffs of worst case vs average access
time. ARM ISA allows expressing cacheability of memory
at a per-page granularity. We modified mmap [52] memory
allocation API to accept additional flags that are used by the
kernel to determine the cacheability of allocated pages. As
part of the evaluation of this work we modified some standard
benchmarks to use this cacheability control. An example INC-
OC allocation is shown here:
1 buf = mmap(0 , s i z e , PROT READ | PROT WRITE,
2 MAP SHARED | MAP INCOC, fd , o f f s e t ) ;
5.2 Kernel
For an OS to provide the cacheability control to the userspace
application, two components are required: first, APIs to allow
applications to choose the memory type, as shown above.
Second, page table entries need to be set up with the right
value as defined by the ISA to use the INC-OC memory type.
We implemented both these components for Linux ARM64.
Our mmap syscall implementation allows for additional flags
to be passed by an application. These flags are then used
6L2 Cache
+
L2 Cache 
Controller 
+ 
Coherence 
Controller
Directory
Main 
Memory
Core 0 IC? No
Yes
L1 Cache Hit/Miss &Coherence
L1 Cache
Controller
Fig. 6. Memory Access Flow
to determine the memory type to set the page tables. We
created a new memory type in the kernel for INC-OC. Linux
kernel defines 6 memory types1. Two more memory types can
be defined. So we were able to add the new memory type
with minimal changes. The kernel changes can run on any
ARMv8-A compliant platform and set the memory type bits
for INC-OC as defined in the ARMv8-A ISA, but the eventual
handling depends on the underlying hardware. In accordance
to Cortex-A53 documentation [38] when such cacheability
bits for INC-OC type are set by the Linux kernel, Cortex-
A53 treats these memory pages as uncacheable. This is a
simplification in processor implementation as described in
Section 3.3. In our simulation system the same cacheability
bits allow the underlying caches to handle cacheability of
memory requests in accordance to the full specification of the
ARMv8-A ISA. The kernel changes in form of a patch will be
made available here [53].
5.3 Processor
We use the existing Timing Simple Processor in gem5 that
runs ARMv8-A ISA. We modify the Translation lookaside
buffer (TLB) to cache the additional page table attributes
and add them to every memory access sent to the memory
subsystem. Similar changes will be required in the real pro-
cessors to propagate the cacheability information from page
tables to memory requests.
5.4 L1 Cache Controller
We modified the L1 cache controller to check if the request
is for Inner Cacheable (IC) memory. The further handling is
dependent on this check.
5.4.1 Normal Memory
Our L1 cache controller follows the MSI protocol. The L1
cache controllers do not communicate with each other. Any
1. arch/arm64/include/asm/memory.h
L1 cache only communicates with its processor core and the
shared L2 Cache. Hence L1 caches send requests to the L2
cache controller which completes all relevant actions before
sending a response back to the L1 cache.
5.4.2 INC-OC Memory
If a memory request is marked as Inner Non-Cacheable the
request is directly forwarded to the L2 cache as shown in
Figure 6. Similarly, any responses to these requests from the
L2 cache are forwarded to the processor. Since INC-OC data
blocks are never cached in the private L1 caches, multiple
copies of the same data can not exist, therefore, no extra logic
is required to maintain coherence. The cost of this change is
that all requests to INC-OC cache lines go directly to the L2
cache, including what could have been L1 cache hits.
5.5 L2 Cache Controller
The L2 cache is the only shared cache in our system as
depicted in Figure 2. It is strictly inclusive i.e. it contains
any cache line that is cached in an L1 cache. The DRAM
subsystem that connects to this L2 cache was not modified
and emulates a constant access time main memory.
5.5.1 Normal Memory
The L2 cache controller serves as the coherence directory for
this machine, in addition to the shared cache. It maintains
information about all cached lines and manages all requests
from L1 caches. It is able to ascertain the steps involved
in fulfilling such requests, like sending out invalidations or
collecting acknowledgements for the invalidations.
5.5.2 INC-OC Memory
L2 Cache Controller was modified to support the INC-OC
memory type. INC-OC cache lines share the same memory
space as normal memory type. But once allocated as an
INC-OC cache line, the coherence state machine marks them
separately and treats them like single-core system cache lines.
7A cache line can convert between INC-OC or normal types
but needs to be invalidated in between. A cache line cannot
be simultaneously treated as INC-OC and Normal Cacheable
memory. Due to the direct forwarding of all INC-OC requests
to L2 caches, there is increased contention on the L2 cache
bandwidth. Since the L2 cache is already inclusive of L1 caches
there is no increase in contention for L2 cache space. This
change converts the coherence problem to a cache bandwidth
contention problem which is well studied in literature [5], [17].
A major functional change in the L2 cache is the re-
quirement to handle Load/Store exclusive (LDXR/STXR)
instruction pair. To this end we maintain a markup of LDXR
instructions on cache lines and only accept the STXR if the
cache line has not been modified since the corresponding
LDXR from the same core. This is a new requirement for
INC-OC memory type, as in case of Normal memory type, all
LDXR/STXR instructions are handled in the private cache
itself.
5.6 Simulation Modes
We use the simulation system in two modes. Trace mode
uses the memory subsystem in isolation to replay data ac-
cess traces. Full system mode emulates a complete hardware
platform.
5.6.1 Trace Mode Simulation
Trace mode for gem5 was developed by Hassan et al. [8]. It al-
lows using the gem5 memory subsystem alone. The processor
subsystem is replaced by a synthetic request injection. Based
on an input trace file a dummy processor generates memory
requests. The trace file contains memory address, operation
(read/write) and time stamp for initial request injection.
We manually construct traces to create custom scenarios.
These scenarios do occur in real systems but are difficult to
reproduce and observe.
5.6.2 Full System Simulation
In this mode the simulator emulates a real platform. The cache
and memory subsystems are the same as previous mode. But
there is a full fledged ARMv8-A compatible processor subsys-
tem. This mode supports running a Linux kernel. Benchmark
applications are oblivious of the underlying simulator and run
as if on a real platform. As noted in Section 5 and Figure 2,
we have selected simulation parameters so that the simulated
system’s cache access latency are close to Cortex-A53 [47],
[48]. This full system simulation brings together all aspects
of our implementation as described in Sections 5.1-5.5. The
simulator with all the changes will be made available [53].
5.7 Discussion and Limitations
The design is close to what the authors, to the best of their
knowledge, believe a hardware implementations should be
like. It seems impractical to have dedicated ports from the
processor to each level of the memory system. So the proces-
sor should only interact with the respective L1 caches. The
L1 cache, based on parameters in the memory request, can
then determine if it should allocate lines for such a request.
This design choice does slow down access to INC-OC lines.
Additional cycles are spent in competing with any pending
L1 cacheable memory requests and going though the L1 cache
controller logic. A direct port would have bypassed all this but
we believe that the silicon and metal costs will be prohibitive.
The cascading communication path also makes our solution
scalable to multilevel caches as long as there is enough ISA
support to express the memory types that define cacheability
at each level.
6 Evaluation
The evaluation is based on custom scenarios, microbench-
marks and benchmarks from the SPLASH2 [54] suite.
6.1 Trace Mode Simulation
As described before in Section 5.6.1, Trace mode synthetically
injects accesses to the memory subsystem of gem5, based on
input trace files. The traces can be manually written to create
specific scenarios.
Figure 7 shows an extended scenario similar to Section
4.2. A write request to the same address is generated by each
core, simultaneously. Write 1 complete is the same dirty miss
scenario as Figure 4. We additionally note the time when all
4 write requests finish. Similar experiments for read requests
were conducted but are not shown. Shared state data, that
can safely co-exist on multiple private caches, is served better
by normal memory than INC-OC.
The trace files and gem5 coherence log that provides the
timing information is available here [53].
Observation: The total time to process the dirty miss is
52% shorter for INC-OC memory. For all 4 write requests the
total time was reduced by 74%.
Limitation: A dirty miss is a fairly common occurrence
for shared data accesses but multiple parallel write requests
for the same data does not happen in well written programs.
From a coherence perspective, a similar situation can occur
when atomic accesses, like spinlocks, request exclusive access
to cache lines in course of executing LDXR instruction.
6.2 Full System Simulation
In full system simulation, programs are run inside a full
fledged Linux kernel running over the simulation platform.
This presents a close equivalence to a real machine. To reduce
the variability in program execution time caused by CPU and
DRAM, we use the timing simple models for both included in
gem5. These models provide constant time DRAM accesses
and simplified instruction pipelines, while still maintaining
detailed timing of events. Also, between Normal and INC-OC
only an mmap argument flag2 is changed before compilation.
The programs are otherwise identical. All this helps limit the
evaluation to purely a comparison between Normal and INC-
OC memory types.
6.2.1 Synthetic Benchmarks
In Figure 8 we show the results of running the microbench-
marks discussed in Section 4. As before, Load, Store and Lock
refers to the average latency of completing these operations.
For Normal Cacheable memory we measure the latency on
Private memory blocks per core and also memory blocks
shared among all cores. Next, we repeat the shared memory
experiments with INC-OC memory type.
2. As shown in Section 5.1.
8L1 Cache
4
L2 Cache
128
4
L1 Cache
128
L2 Cache
Write 1
Complete
729
354
Write 1
Complete
All Write
Complete
1826
474 
All Write
Complete
Cycles
Normal
INC-OC
Fig. 7. Worst-Case Write-Request Contention
Observation: Latency to acquire locks reduces signifi-
cantly by the use of INC-OC memory type. Load and Store
time is increased for INC-OC.
Discussion: The forced ordering of locking primitives
avoids other effects, but since data coherence among con-
tending cores depends on coherence hardware, the effect of
coherence dominates. In case of loads and stores the combined
effect of bandwidth limitations, additional latency and paral-
lel handling of coherence of individual lines makes INC-OC
average access latency higher.
5 11 910 30
788
23 46
116
0
100
200
300
400
500
600
700
800
900
Load Store Lock
Ti
m
e 
(n
s)
Normal Private Normal Shared INC-OC Shared
Fig. 8. Full system simulation synthetic benchmarks
6.2.2 Benchmark Evaluation
In Figure 9 we compare the worst case run time for SPLASH2
[54] benchmarks on the full system simulation. For Normal
all memory used is normal cacheable. Blind INC-OC blindly
allocates every variable, that is visible across threads created
by the benchmark, with INC-OC memory type. For Program
Aware INC-OC we identify program variables and memory
locations that can be safely accessed as Normal memory.
Our optimization treats as normal memory those variables
that (1) are accessed by a single thread, that (2) remain
constant in parallel parts of the benchmark, or that (3) are
within memory ranges that are divided among threads by
offset ranges. Results are normalized to the Normal case.
The reported measurements represent the worst-case runtime
observed in 100 runs with warmed caches. All benchmarks
use spinlocks for synchronization which are allocated with an
INC-OC memory type, except in the Normal case.
Observation: Blind INC-OC execution times are up to
5.7× higher than Normal, while Program Aware INC-OC ’s
performance is near identical as Normal.
Discussion: The blind approach is overly conservative in
INC-OC allocation, while Program Aware INC-OC precisely
targets truly shared memory requests. But this is a manual
process of understanding the program and making choices
about cacheability. We expect this to become a part of the reg-
ular practice for the development of multi-threaded real-time
applications. Nevertheless, this is the true cost of the proposed
solution. The near identical performance is attributable to a
small percentage of INC-OC accesses in Program Aware INC-
OC, max 0.08% across benchmarks. This highlights the need
for a precise and selective mechanism for handling coherence
overheads in multi-threaded and parallel applications3.
1 1 1 1 1
1.3
2.1
5.7
1.1
1.6
0.9 1.0 1.0 1.0 1.0
0
1
2
3
4
5
6
FFT Ocean LU
Factorization
Radix Water Nsq
N
o
rm
al
iz
ed
 W
o
rs
t 
C
as
e 
Ex
ec
u
ti
o
n
 T
im
e
Normal Blind INC-OC Program Aware INC-OC
Fig. 9. Benchmark evaluation for Full System simulation.
Due to limitations of space a variability analysis is not
shown. gem5 simulator aims at deterministic repeatability
of execution. Consequently, even in full system mode the
observed standard deviation in execution time of benchmarks
is quite small, roughly 1% of execution time.
7 Security Implications
Since we are expanding the user space software’s control of
low level hardware, security implications need to be carefully
considered.
3. See Section 8.3 for further discussion on this.
97.1 Locks
RISC and CISC architectures handle atomic operations dif-
ferently at architectural level. This difference has important
implications for the use of INC-OC memory type.
7.1.1 CISC
Atomic operations in X86, like XCHG, disallow any access to
the targeted memory area till all involved micro-operations
are complete. This is achieved by asserting a LOCK [36]
signal to block any access to the relevant memory buses. If
the targeted memory area is in a private cache the memory
location is modified internally and cache coherency mech-
anism ensures that the operation is carried out atomically.
For atomic operations to INC-OC and Uncacheable memory
the LOCK signal will have to be asserted to shared memory
buses. This opens up the possibility of a core starving all other
cores of memory bandwidth by continuously executing atomic
operations on INC-OC or Uncacheable memory [55].
7.1.2 RISC
RISC ISA like ARM and MIPS, use Load-Linked Store-
Conditional semantics to implement lock free atomic opera-
tions. Since system resources are never locked, the vulnerabil-
ity, as discussed above, does not apply.
7.2 Cache Conflicts
As a result of the introduction of INC-OC memory type
userspace applications can now directly allocate cache lines
in shared cache levels. Such allocations can be used to force
cache lines owned by other applications to evict from the
cache. But all related attack are already possible with Normal
cacheable memory via directed private cache misses [56]. The
access patterns and size required to achieve the same effect
via normal cacheable memory depends on the inclusivity and
allocation policy of the caches.
8 Discussion
In this section we discuss the strengths and limitations of
cacheability control and INC-OC memory type.
8.1 Hardware/Software changes
Our solution requires changes to existing hardware. But the
changes are minimal as we leverage existing ISA features.
In comparison, competing techniques modify the cache co-
herence controller implementation and behavior. Cache co-
herence protocols are difficult to verify [57] and certify for
real-time behaviour. Any solution that changes the coherence
protocol states, transient states or timing etc is hence difficult
to adopt. The development and verification costs can be
prohibitive. INC-OC memory does not modify the coherence
protocols or controllers. The memory type itself classifies the
memory requests for INC-OC type to be handled outside
of the standard coherence state machine. So while cache
controllers need to be modified to handle the new memory
type, the coherence controller itself remains unchanged.
The kernel modifications required to support the INC-
OC memory types are also small. Since Linux kernel already
has a notion of memory types the introduction of INC-OC
type required less than 10 new code lines. Allowing mmap
system call to set the memory type in a simple implementation
required less than 100 new code lines.
8.2 WCET Analysis
Due to the complex nature of the coherence protocols, a
WCET analysis of coherence controllers, even for those de-
signed for real-time systems, can be an onerous task. INC-OC
memory type eliminates the need for such an analysis.
8.3 Precise Impact
A problem with existing solutions for predictable cache co-
herence is that they impact all data accesses, independently
from the data being private or shared. On the other hand,
INC-OC memory type is used explicitly and precisely. Default
memory type for memory allocation API is normal memory.
As discussed before, the mechanism of this selection can be as
simple as passing an additional flag during memory allocation.
The solution can be selectively applied by the developer who
can judiciously decide between the worst case and average
memory access times on a case by case basis. For this reason,
significant performance benefits can be observed via OC-
INC on the same benchmarks compared to the alternative
solution proposed in [8], namely Predictable MSI (PMSI).
Specifically, although [8] uses a less realistic simulation setup,
applications using PMSI incur a 1.45× slowdown compared to
the baseline. Conversely, our evaluations indicate that INC-
OC has near identical performance to the Normal baseline.
As previously mentioned, this precision comes at the cost of
manual application code refinement. We argue however that
such refinement could largely be automated at compile-time.
That is, as long as identification of shared data can be auto-
matically performed, which has been successfully attempted
in the past [19].
8.4 Applicability
Detailed cache coherence parameters of a system may limit if
INC-OC memory type is useful.
8.4.1 Coherence Protocol
A prerequisite for INC-OC memory type being useful is that
the L2 cache access time is less than the worst case access time
for normal cacheable memory. While in our observation that
is true for some platforms, it is possible to build a processor
where this is not true. A larger number of cores trying to
modify the same data at the same time will increase the worst
case if the data is cacheable in private caches. On the other
hand, coherence protocols like MOESI that allow dirty data
to be communicated directly among private caches without
requiring write-back would decrease the worst case latency for
private caches. The exact applicability of INC-OC memory
type depends on the full characteristics of the cache and co-
herence controllers that are usually not publicly documented
by vendors. Vendors tend to include only the stable coherence
state information in their product documentation [36].
8.4.2 Bus Structure
We implement a directory based coherence, i.e. where the
directory, L2 cache in this case, maintains the list of all L1
caches that are using a particular cache line. The directory
sends directed message to all L1 caches when required. A
snooping coherence bus may reduce the worst case access time
for normal cacheable memory, albeit it is known that snooping
approaches do not scale well with large number of cores.
10
8.4.3 Non-Uniform Cache Architectures
Non-Uniform Cache Architectures (NUCA) [58] use a phys-
ically distributed last level cache to reduce wire delays for
cache access. From a given processor core, different banks of
the NUCA cache have different access latency. For worst case
analysis with INC-OC memory type the largest latency bank,
accessible by a core, should be considered.
8.5 Other Sources of Variability
Use of INC-OC memory type eliminates the inter-core in-
terference due to cache coherence only. Both shared and
private caches are still affected by cache misses and bandwidth
contention. caused by different cores. One core’s cache line
can still be evicted by other core’s accesses. Additionally, the
multiple cores still share the memory bandwidth of shared
caches and all the way to the memory. This is another source
of contention that the use of INC-OC memory does not
remove. In fact due to direct forwarding of requests to shared
caches, the use of INC-OC memory increases the shared cache
bandwidth usage. These sources of contention have existed
since the use of memory caches. They are not specific to
multi-core systems. Many existing works have addressed these
problems as discussed in Section 2.
8.6 Application Models
The impact of INC-OC memory type is heavily dependent on
application characteristics. On the one end of the spectrum
are Non-Blocking algorithms [59] or Worker Queue models
that severely limit data sharing. For these applications INC-
OC’s contribution will be small. On the other end, Data
Streaming models or chains of producers and consumers in-
volve multiple threads continuously sharing and modifying
large amounts of data. In this case, OC-INC would lead to
significant improvements in predictability. Due to limitations
of platform and benchmarks we have not been able to evaluate
these application models yet.
9 Conclusion
In this paper we present the INC-OC memory type and
a series of mechanisms to select memory types from user-
space. Memory types can be defined at a page granularity.
A developer can selectively and judiciously decide which
memory type to use based on application requirements. INC-
OC memory type bypasses private caches, hence avoiding
coherence overheads and reducing worst-case memory access
latency. Overall, judicious use of INC-OC memory type can
help applications reduce their worst-case execution-time by
reducing the unpredictability arising from black-box hardware
coherence management.
Acknowledgments
The material presented in this paper is based upon work
supported by the National Science Foundation (NSF) under
grant numbers CNS-1646383. M. Caccamo was also supported
by an Alexander von Humboldt Professorship endowed by
the German Federal Ministry of Education and Research.
Any opinions, findings, and conclusions or recommendations
expressed in this publication are those of the authors and do
not necessarily reflect the views of the NSF.
References
[1] K. J. Kuhn, “Moore’s law past 32nm: Future challenges in device
scaling,” in 2009 13th International Workshop on Computational
Electronics, May 2009, pp. 1–6.
[2] L. Sha, M. Caccamo, R. Mancuso, J.-E. Kim, M.-K. Yoon,
R. Pellizzoni, H. Yun, R. Kegley, D. Perlman, G. Arundale et al.,
“Single core equivalent virtual machines for hard real—time
computing on multicore processors,” Tech. Rep., 2014.
[3] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and
R. Rajkumar, “Bounding memory interference delay in cots-
based multi-core systems,” in 2014 IEEE 19th Real-Time and
Embedded Technology and Applications Symposium (RTAS),
April 2014, pp. 145–154.
[4] R. Mancuso, R. Dudko, E. Betti, M. Cesati, M. Caccamo,
and R. Pellizzoni, “Real-time cache management framework for
multi-core architectures,” in 2013 IEEE 19th Real-Time and
Embedded Technology and Applications Symposium, RTAS 2013,
2013, pp. 45–54.
[5] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Mem-
ory bandwidth management for efficient performance isolation
in multi-core platforms,” IEEE Transactions on Computers,
vol. 65, no. 2, pp. 562–576, 2015.
[6] H. Yun, R. Mancuso, Z. Wu, and R. Pellizzoni, “Palloc: Dram
bank-aware memory allocator for performance isolation on mul-
ticore platforms,” in 2014 IEEE 19th Real-Time and Embedded
Technology and Applications Symposium (RTAS), April 2014,
pp. 155–166.
[7] M. Chisholm, N. Kim, B. C. Ward, N. Otterness, J. H. Anderson,
and F. D. Smith, “Reconciling the tension between hardware iso-
lation and data sharing in mixed-criticality, multicore systems,”
in 2016 IEEE Real-Time Systems Symposium (RTSS). IEEE,
2016, pp. 57–68.
[8] M. Hassan, A. M. Kaushik, and H. Patel, “Predictable cache
coherence for multi-core real-time systems,” in 2017 IEEE Real-
Time and Embedded Technology and Applications Symposium
(RTAS), April 2017, pp. 235–246.
[9] N. Sritharan, A. M. Kaushik, M. Hassan, and H. D. Patel,
“Hourglass: Predictable time-based cache coherence protocol for
dual-critical multi-core systems,” CoRR, abs/1706.07568, 2017.
[10] S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, L. Stougie,
and A. Wiese, “A generalized parallel task model for recurrent
real-time processes,” in 2012 IEEE 33rd Real-Time Systems
Symposium, Dec 2012, pp. 63–72.
[11] S. Baruah, “Improved multiprocessor global schedulability anal-
ysis of sporadic dag task systems,” in 2014 26th Euromicro
Conference on Real-Time Systems, July 2014, pp. 97–105.
[12] V. Bonifaci, A. Marchetti-Spaccamela, S. Stiller, and A. Wiese,
“Feasibility analysis in the sporadic dag task model,” in 2013
25th Euromicro Conference on Real-Time Systems, July 2013,
pp. 225–233.
[13] M. Yang, T. Amert, K. Yang, N. Otterness, J. H. Anderson, F. D.
Smith, and S. Wang, “Making openvx really” real time”,” in 2018
IEEE Real-Time Systems Symposium (RTSS). IEEE, 2018, pp.
80–93.
[14] H. Kim, A. Kandhalu, and R. Rajkumar, “Coordinated cache
management for predictable multi-core real-time systems,”
Technical report, 2014.
[15] G. Gracioli, A. Alhammad, R. Mancuso, A. A. Fro¨hlich, and
R. Pellizzoni, “A survey on cache management mechanisms
for real-time embedded systems,” ACM Comput. Surv.,
vol. 48, no. 2, pp. 32:1–32:36, Nov. 2015. [Online]. Available:
http://doi.acm.org/10.1145/2830555
[16] P. K. Valsan, H. Yun, and F. Farshchi, “Taming non-blocking
caches to improve isolation in multicore real-time systems,” in
2016 IEEE Real-Time and Embedded Technology and Applica-
tions Symposium (RTAS), April 2016, pp. 1–12.
[17] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Mem-
ory access control in multiprocessor for real-time systems with
mixed criticality,” in 2012 24th Euromicro Conference on Real-
Time Systems, July 2012, pp. 299–308.
[18] D. Hardy, T. Piquet, and I. Puaut, “Using bypass to tighten
wcet estimates for multi-core processors with shared instruction
caches,” in 2009 30th IEEE Real-Time Systems Symposium, Dec
2009, pp. 68–77.
[19] B. Lesage, D. Hardy, and I. Puaut, “Shared data caches conflicts
reduction for wcet computation in multi-core architectures.”
11
in 18th International Conference on Real-Time and Network
Systems, 2010, p. 2283.
[20] J. Liedtke, H. Hartig, and M. Hohmuth, “OS-controlled cache
predictability for real-time systems,” in Proceedings Third IEEE
Real-Time Technology and Applications Symposium. IEEE,
1997, pp. 213–224.
[21] M. Chisholm, N. Kim, B. C. Ward, N. Otterness, J. H. Anderson,
and F. D. Smith, “Reconciling the tension between hardware iso-
lation and data sharing in mixed-criticality, multicore systems,”
in 2016 IEEE Real-Time Systems Symposium (RTSS). IEEE,
2016, pp. 57–68.
[22] N. Kim, B. C. Ward, M. Chisholm, C. Fu, J. H. Anderson, and
F. D. Smith, “Attacking the one-out-of-m multicore problem by
combining hardware management with mixed-criticality provi-
sioning,” in 2016 IEEE Real-Time and Embedded Technology
and Applications Symposium (RTAS), April 2016, pp. 1–12.
[23] B. C. Ward, J. L. Herman, C. J. Kenna, and J. H. Ander-
son, “Outstanding paper award: Making shared caches more
predictable on multicore platforms,” in 2013 25th Euromicro
Conference on Real-Time Systems, July 2013, pp. 157–167.
[24] G. Gracioli and A. A. Fro¨hlich, “On the influence of shared
memory contention in real-time multicore applications,” in 2014
Brazilian Symposium on Computing Systems Engineering, Nov
2014, pp. 25–30.
[25] G. Gracioli and A. A. Fro¨hlich, “On the design and
evaluation of a real-time operating system for cache-
coherent multicore architectures,” SIGOPS Oper. Syst. Rev.,
vol. 49, no. 2, pp. 2–16, Jan. 2016. [Online]. Available:
http://doi.acm.org/10.1145/2883591.2883594
[26] A. Bansal, R. Tabish, G. Gracioli, R. Mancuso, R. Pellizzoni,
and M. Caccamo, “Evaluating the memory subsystem of a
configurable heterogeneous mpsoc,” in Workshop on Operating
Systems Platforms for Embedded Real-Time Applications (OS-
PERT), 2018, p. 55.
[27] E. A. Emerson and K. S. Namjoshi, “Verification of a parameter-
ized bus arbitration protocol,” in Computer Aided Verification,
A. J. Hu and M. Y. Vardi, Eds. Berlin, Heidelberg: Springer
Berlin Heidelberg, 1998, pp. 452–463.
[28] S. Qadeer, “Verifying sequential consistency on shared-memory
multiprocessors by model checking,” IEEE Transactions on Par-
allel and Distributed Systems, vol. 14, no. 8, pp. 730–741, Aug
2003.
[29] F. Pong and M. Dubois, “A new approach for the verification
of cache coherence protocols,” IEEE Trans. Parallel Distrib.
Syst., vol. 6, no. 8, pp. 773–787, Aug. 1995. [Online]. Available:
http://dx.doi.org/10.1109/71.406955
[30] A. Pyka, M. Rohde, and S. Uhrig, “Extended performance anal-
ysis of the time predictable on-demand coherent data cache for
multi- and many-core systems,” in 2014 International Confer-
ence on Embedded Computer Systems: Architectures, Modeling,
and Simulation (SAMOS XIV), July 2014, pp. 107–114.
[31] S. H. Pugsley, J. B. Spjut, D. W. Nellans, and
R. Balasubramonian, “SWEL: Hardware Cache Coherence
Protocols to Map Shared Data Onto Shared Caches,” in
Proceedings of the 19th International Conference on Parallel
Architectures and Compilation Techniques, ser. PACT ’10.
New York, NY, USA: ACM, 2010, pp. 465–476. [Online].
Available: http://doi.acm.org/10.1145/1854273.1854331
[32] D. A. Patterson and J. L. Hennessy, Computer Organization
and Design: The Hardware/Software Interface, 3rd ed. San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007.
[33] T. Suh, D. M. Blough, and H. . S. Lee, “Supporting cache
coherence in heterogeneous multiprocessor systems,” in Proceed-
ings Design, Automation and Test in Europe Conference and
Exhibition, vol. 2, Feb 2004, pp. 1150–1155 Vol.2.
[34] Wikipedia, “Comparison of armv8-a cores,” 2019, [Online;
accessed 24-Jan-2019]. [Online]. Available: https://en.wikipedia.
org/wiki/Comparison of ARMv8-A cores
[35] Arm Holdings, “Arm cortex-a series program-
mer’s guide for armv8-a,” 2018. [Online]. Avail-
able: http://infocenter.arm.com/help/index.jsp?topic=/com.
arm.doc.den0024a/CEGDBEJE.html
[36] Intel Corporation, “Intel 64 and ia-32 architectures software
developer manual,” 2016. [Online]. Available: https://software.
intel.com/en-us/articles/intel-sdm
[37] Arm Holdings, “ARM Architecture Reference Manual ARMv8,
for ARMv8-A architecture profile,” 2017.
[38] ——, “ARM Cortex-A53 MPCore Processor Tech-
nical Reference Manual,” 2018. [Online]. Avail-
able: http://infocenter.arm.com/help/index.jsp?topic=/com.
arm.doc.ddi0500j/CHDGIBBD.html
[39] ——, “ARM Cortex-A57 MPCore Processor Tech-
nical Reference Manual,” 2018. [Online]. Avail-
able: http://infocenter.arm.com/help/index.jsp?topic=/com.
arm.doc.ddi0488h/Chunk477744436.html
[40] ——, “ARM Cortex-A72 MPCore Processor Tech-
nical Reference Manual,” 2018. [Online]. Avail-
able: http://infocenter.arm.com/help/index.jsp?topic=/com.
arm.doc.100095 0003 06 en/Chunk1869832971.html
[41] Nvidia Corporation, “Tegra x2 (parker series soc) technical
reference manual,” 2017. [Online]. Available: https://developer.
nvidia.com/embedded/downloads
[42] ——, “Technical Reference Manual Xavier Series SoC,” 2019.
[Online]. Available: https://developer.nvidia.com/embedded/
dlc/xavier-technical-reference-manual
[43] “Mips architecture for programmers volume
iii: The mips64 and micro mips64 privi-
leged resource architecture,” 2014. [Online]. Avail-
able: https://s3-eu-west-1.amazonaws.com/downloads-mips/
documents/MD00091-2B-MIPS64PRA-AFP-05.04.pdf
[44] “MIPS32 M6200 Processor Core Family Pro-
grammer’s Guide,” 2016. [Online]. Avail-
able: https://s3-eu-west-1.amazonaws.com/downloads-mips/
documents/MD01093-2B-M6200SW-USG-01.00.pdf
[45] Arm Holdings, “6.2.5. Data cache coherency,” 2019. [Online].
Available: http://infocenter.arm.com/help/topic/com.arm.doc.
ddi0500j/ch06s02s05.html
[46] D. Abts, S. Scott, and D. J. Lilja, “So many states, so little
time: Verifying memory coherence in the cray x1,” in Parallel
and Distributed Processing Symposium, 2003. Proceedings. In-
ternational. IEEE, 2003, pp. 10–pp.
[47] Arm Holdings, “Cortex A-53,” 2018. [Online]. Available: https://
developer.arm.com/products/processors/cortex-a/cortex-a53
[48] Xilinx, Inc., “Ultrascale+ MPSoC ZCU102,” 2018. [Online].
Available: https://www.xilinx.com/products/boards-and-kits/
ek-u1-zcu102-g.html
[49] Intel Corporation, “Intel Xeon Processor E5-2658 v4,” 2018.
[Online]. Available: https://ark.intel.com/products/91771/
Intel-Xeon-Processor-E5-2658-v4-35M-Cache-2-30-GHz-
[50] GCC, “Built-in functions for atomic memory access,” 2018.
[Online]. Available: https://gcc.gnu.org/onlinedocs/gcc-4.1.0/
gcc/Atomic-Builtins.html
[51] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,
A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti,
R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A.
Wood, “The gem5 simulator,” SIGARCH Comput. Archit.
News, vol. 39, no. 2, pp. 1–7, Aug. 2011. [Online]. Available:
http://doi.acm.org/10.1145/2024716.2024718
[52] M. Kerrisk, “Linux Programmer’s Manual,” 2019. [Online].
Available: http://man7.org/linux/man-pages/man2/mmap.2.
html
[53] “INC-OC,” 2019. [Online]. Available: https://gitlab.engr.illinois.
edu/rtesl/inc-oc
[54] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The
splash-2 programs: characterization and methodological consid-
erations,” in Proceedings 22nd Annual International Symposium
on Computer Architecture, June 1995, pp. 24–36.
[55] T. Zhang, Y. Zhang, and R. B. Lee, “Dos attacks on your
memory in cloud,” in Proceedings of the 2017 ACM on Asia
Conference on Computer and Communications Security. ACM,
2017, pp. 253–265.
[56] D. J. Bernstein, “Cache-timing attacks on aes,” 2005.
[57] D. J. Sorin, M. Plakal, A. E. Condon, M. D. Hill, M. M. K.
Martin, and D. A. Wood, “Specifying and verifying a broad-
cast and a multicast snooping cache coherence protocol,” IEEE
Transactions on Parallel and Distributed Systems, vol. 13, no. 6,
pp. 556–578, June 2002.
[58] C. Kim, D. Burger, and S. W. Keckler, “An adaptive, non-
uniform cache structure for wire-delay dominated on-chip
caches,” in Acm Sigplan Notices, vol. 37, no. 10. ACM, 2002,
pp. 211–222.
[59] M. B. Greenwald and D. R. Cheriton, Non-blocking synchroniza-
tion and system design. Citeseer, 1999, vol. 99, no. 1624.
