Deterministic memory abstraction and supporting multicore system architecture by Farshchi, Farzad et al.
Boston University
OpenBU http://open.bu.edu
BU Open Access Articles BU Open Access Articles
Deterministic memory abstraction
and supporting multicore system
architecture
This work was made openly accessible by BU Faculty. Please share how this access benefits you.
Your story matters.
Version
Citation (published version): Farzad Farshchi, Prathap Kumar Valsan, Renato Mancuso, Heechul
Yun. "Deterministic Memory Abstraction and Supporting Multicore
System Architecture."
https://hdl.handle.net/2144/28189
Boston University
The Deterministic Memory Abstraction and
Supporting Cache Architecture for Multicore
Real-Time Systems
Farzad Farshchi†, Prathap Kumar Valsan?, Renato Mancuso‡, Heechul Yun†
† University of Kansas, {farshchi, heechul.yun}@ku.edu
? Intel, prathap.kumar.valsan@intel.com
‡ Boston University, rmancuso@bu.edu
Abstract—Poor timing predictability of multicore processors
has been a long-standing challenge in the real-time systems
community. In this paper, we make a case that a fundamental
problem that prevents efficient and predictable real-time com-
puting on multicore is the lack of a proper memory abstraction
to express memory criticality, which cuts across various layers
of the system: the application, OS, and hardware. We therefore
propose a new holistic resource management approach driven by
a new memory abstraction, which we call Deterministic Memory.
The key characteristic of deterministic memory is that the
platform—the OS and hardware—guarantees small and tightly
bounded worst-case memory access timing. In contrast, we call
the conventional memory abstraction as best-effort memory in
which only highly pessimistic worst-case bounds can be achieved.
We propose to utilize both abstractions to achieve high time
predictability but without significantly sacrificing performance.
We present how the two memory abstractions can be realized
with small extensions to existing OS and hardware architecture.
In particular, we show the potential benefits of our approach
in the context of shared cache management, by presenting a
deterministic memory-aware cache architecture and its manage-
ment scheme. We evaluate the effectiveness of the deterministic
memory-aware cache management approach compared with a
conventional way-based cache partitioning approach, using a set
of synthetic and real benchmarks. The results show that our
approach achieves (i) the same degree of temporal determinism
of traditional way-based cache partitioning for deterministic
memory, (ii) while freeing up to 49% of additional cache space, on
average, for best-effort memory, and consequently improving the
cache hit rate by 39%, on average, for non-real-time workloads.
We also discuss how the deterministic memory abstraction can
be leveraged in other parts of the memory hierarchy, particularly
in the memory controller.
I. INTRODUCTION
High-performance embedded multicore platforms are in-
creasingly demanded in cyber-physical systems (CPS)—
especially those in automotive and aviation applications—to
cut cost and reduce size, weight, and power (SWaP) of the
system via consolidation [1].
Consolidating multiple tasks with different criticality levels
(a.k.a. mixed-criticality systems [2], [3]) on a single multicore
processor is, however, extremely challenging because interfer-
ence in the shared hardware resources in the memory hierarchy
can significantly alter the tasks’ timing characteristics. Poor
time predictability of multicore platforms is a major hurdle
that makes their adoption challenging in many safety-critical
CPS. For example, the CAST-32A position paper by the
avionics certification authorities comprehensively discusses the
certification challenges of multicore avionics [4]. Therefore,
in aerospace industry, it is a common practice to disable all
but one core [5], because extremely pessimistic worst-case-
execution times (WCETs) nullify any performance benefits
of using multicore processors in critical applications. This
phenomenon is also known as the “one-out-of-m” problem [6].
There has been significant research efforts to address the
problem. Two common strategies are (1) partitioning the
shared resources among the tasks or cores to achieve spatial
isolation and (2) applying analyzable arbitration schemes
(e.g., time-division multiple access) in accessing the shared
resources to achieve temporal isolation. These strategies
have been studied individually (e.g., cache [7]–[9], DRAM
banks [10], [11], memory bus [12], [13]) or in combination
(e.g., [6], [14]). However, most of these efforts improve
predictability at the cost of significant sacrifice in efficiency
and performance.
In this paper, we argue that the fundamental problem
that prevents efficient and predictable real-time computing
on multicore is the lack of a proper memory abstraction to
express memory criticality, which cuts across various layers of
the system: the application, OS, and hardware. Thus, our ap-
proach starts by defining a new OS-level memory abstraction,
which we call Deterministic Memory. The key characteristic
of deterministic memory is that the platform—the OS and
hardware—guarantees small and tightly bounded worst-case
memory access timing. In contrast, we call the conventional
memory abstraction as best-effort memory in which only
highly pessimistic worst-case bounds can be achieved.
We propose a new holistic cross-layer resource manage-
ment approach that leverages the deterministic and best-effort
memory abstractions. In our approach, a task can allocate
either type of memory blocks in its address space, at the page
granularity, based on the desired WCET requirement in ac-
cessing the memory blocks. The OS and hardware then apply
different resource management strategies depending on the
memory type. Specifically, predictability focused strategies,
such as resource reservation and predictable scheduling, shall
ar
X
iv
:1
70
7.
05
26
0v
2 
 [c
s.A
R]
  1
1 O
ct 
20
17
be used for deterministic memory while average performance
and efficiency focused strategies, such as space sharing and
out-of-order scheduling, shall be used for best-effort memory.
Because neither all tasks are time-critical nor all memory
blocks of a time-critical task are equally important with respect
to the task’s WCET, our approach enables the possibility
of achieving high time predictability without significantly
affecting performance and efficiency via selective use of
deterministic memory.
While our approach is a generic framework that can be
applied to any shared hardware resource management, in this
paper, we particularly focus on shared cache and demonstrate
the potential benefits of our approach in the context of
shared cache space management. Specifically, we propose a
deterministic memory-aware cache replacement scheme that
(i) provides the same level of performance isolation capabil-
ities of conventional way-based partitioning techniques; and
(ii) achieves significantly higher cache space utilization and
throughput by reducing unused cache space.
We implement the deterministic memory abstraction in
Linux 3.13 kernel and implement the proposed deterministic-
memory aware hardware architecture extensions in the gem5
full-system simulator [15], which models a high-performance
(out-of-order) quad-core platform as baseline. We evaluate the
system using a set of synthetic and real-world benchmarks,
which include benchmarks from EEMBC [16], SD-VBS [17]
and SPEC2006 [18] suites. The results demonstrate the same
degree of isolation with respect to the conventional way-based
cache partitioning for real-time tasks while improving the
cache hit rate of co-scheduled non-real-time workloads sharing
the machine by 39% on average.
The main contributions of this work are as follows:
• We propose a new OS-level memory abstraction, called
deterministic memory.
• We propose a deterministic memory-aware cache replace-
ment scheme that showcases the potential benefits of the
new memory abstraction.
• We implement a realistic prototype system—both OS and
hardware—that realizes the proposed memory abstraction
and the cache-management mechanism in Linux kernel
and a cycle-accurate full system simulator. 1
• We provide extensive empirical results, using both syn-
thetic and real-world benchmarks, that demonstrates the
effectiveness of our approach.
The remainder of the paper is organized as follows. Section II
provides background and motivation. Section III describes
the proposed Deterministic Memory abstraction. Section IV
provides an overview of the deterministic memory-aware
system design. Section V describes a deterministic memory-
aware cache controller design. Section VI details our prototype
implementation. Section VII presents evaluation results. We re-
view related work in Section VIII and conclude in Section IX.
1We provide the modified Linux kernel source, the modified gem5 simulator
source, and the simulation methodology at http://github.com/CSL-KU/detmem
for replication study.
II. BACKGROUND AND MOTIVATION
In this section, we describe why the standard uniform
memory abstraction is a fundamental limitation for the de-
velopment of efficient and predictable real-time computing
infrastructures.
To address the problem of time predictability on multicore
platforms, many research works have proposed a number of
OS-level shared resource management techniques. In gen-
eral, these techniques are based on the notion of resource
reservation that gives dedicated resources—-cache space [7]–
[9], DRAM banks [6], [11], bus bandwidth [12], or their
combinations [6], [14], [19]—to each individual task or core.
However, OS-based techniques generally have two main
limitations. First, the achieved degree of isolation is often
insufficient. For instance, a recent study [20] showed that,
despite partitioning the shared cache, a task’s WCET can
increase by 5X - 21X due to contention on shared miss status
holding registers (MSHRs); similarly, despite partitioning both
shared cache and DRAM banks, some SPEC2006 benchmarks
suffer up to 6X WCET increase due to contention on the bus
and other shared hardware resources [11]. At the root of this
problem is the fact that all the memory requests are treated
equally by the hardware because there is no way of knowing
which memory requests are time critical.
A second, important limitation is that reservation-based
isolation techniques are often enforced at the granularity of
a task or core. Due to the coarse granularity, these techniques
suffer from serious resource under-utilization problems. For
instance, when a fraction of cache space is reserved for a
task, it cannot be used by other tasks, even if it is not fully
utilized by the reserved task. Likewise, when DRAM banks
are reserved for a task, they cannot be utilized by other
tasks, resulting in under-utilized DRAM space. Furthermore,
reserving DRAM banks also limits the maximum memory
capacity that the task can allocate [21]. These restrictions make
the efficient use of memory resources difficult or impossible.
Resource under-utilization is a serious problem for resource
constrained embedded systems because it limits the degree of
consolidation that the system can achieve, which in turn affects
cost, size, weight, and power of the system.
The Uniform Memory Abstraction: Operating systems
and hardware traditionally have provided a simple uniform
memory abstraction that hides all the complex details of the
memory hierarchy. When an application requests to allocate
more memory, the OS simply maps the necessary amount of
any physical memory pages available at the time to the appli-
cation’s address space—without considering (i) how memory
pages are actually mapped to shared hardware resources in
the memory hierarchy, and (ii) how the page(s) selected for
allocation will affect application performance. Likewise, the
underlying hardware components treat all memory accesses
from the CPU as equal without any regard to differences in
criticality and timing requirements in allocating and schedul-
ing the requests at the hardware level.
We argue that this uniform memory abstraction is funda-
2
mentally inadequate for multicore systems because it prevents
the OS and the memory hierarchy hardware from making
informed decisions in allocating and scheduling access to
shared hardware resources. Thus, we believe that new memory
abstractions are needed to enable efficient and predictable real-
time resource management. It is important to note that the said
abstractions should not expose too many architectural details
about the memory hierarchy to the users, to ensure portability
in spite of rapid changes in hardware architectures.
III. DETERMINISTIC MEMORY ABSTRACTION
In this section, we introduce the Deterministic Memory
abstraction to address the aforementioned challenges.
We define deterministic memory as special memory space
for which the OS and hardware guarantee small and tightly
bounded worst-case access delay. In contrast, we call conven-
tional memory as best-effort memory, for which only highly
pessimistic worst-case bounds can be achieved. Both memory
types are supported by the OS and hardware in a single
multicore system. These memory abstractions allow applica-
tions to express their memory access timing requirements in
an architecture-neutral way, while leaving the implementation
details to the OS and to the hardware architecture. This enables
us to achieve predictable, analyzable and efficient management
of shared hardware resources in multicore platforms, as we
discuss in the rest of the section.
Inter-core
interference
Inherent 
timing
legend
Best-effort
memory
Deterministic
memory
Worst-case
memory
delay 
Fig. 1. Conceptual differences of deterministic and best-effort memories
Figure 1 shows the conceptual differences between the two
types of memory with respect to worst-case memory access
delay. For clarity, we divide memory access delay into two
components: inherent access delay and inter-core interference
delay. The inherent access delay is the minimum necessary
timing in isolation. In this regard, deterministic memory can
be slower—in principal, but not necessarily—than best-effort
memory, as its main objective is predictability and not perfor-
mance, while in the case of best-effort memory, the reverse
is true. On the other hand, the inter-core interference delay is
additional delay caused by concurrently sharing hardware re-
sources between multiple cores. This is where the two memory
types differ the most. For best-effort memory, the worst-case
delay bound is highly pessimistic mainly because the inter-core
interference delay can be severe. For deterministic memory, on
the other hand, the worst-case delay bound is small and tight
as the inter-core interference delay is minimized.
Predictability on deterministic memory space is achieved via
ad-hoc management policies at both the OS and the hardware.
For instance, (i) a fraction of the cache space and a (ii) subset
of DRAM banks are reserved for deterministic memory; (iii)
at the memory controller, requests in deterministic memory
space are handled using real-time scheduling algorithms, while
throughput oriented algorithms can be used for best-effort
memory. Due to space constraints, we do not detail all the
allocation strategies that are possible thanks to deterministic
memory. Instead, we describe how the new abstraction can be
constructed, and detail, as a use-case, a deterministic memory-
aware shared cache controller (see Section V).
In the deterministic memory approach, a task can map
all or part of its memory from the deterministic memory.
For example, an entire address space of a real-time task can
be allocated from the deterministic memory; or, only the
important buffers used in a control loop of the real-time task
can be allocated from the deterministic memory.
Deterministic 
memory
Best-effort
memory
Fig. 2. Application view (logical)
Figure 2 shows a possible virtual address space of a task
using both types of memory. From the point of view of
the task, the two memory types differ only in their worst-
case timing characteristics. The deterministic memory will
be realized by extending the virtual memory system at the
page granularity. Whether a certain page is deterministic or
best-effort memory is stored in the task’s page table and
the information is propagated throughout the shared memory
hierarchy, which is then used in allocation and scheduling
decisions made by the OS and the memory hierarchy hardware.
IV. SYSTEM DESIGN
In this section, we describe OS and hardware architecture
extensions to support the deterministic memory abstraction.
The deterministic memory abstraction is realized by extend-
ing the OS’s virtual memory subsystem. Whether a certain
page has the deterministic memory property or not is stored
in the corresponding page table entry. Note that in most
architectures, a page table entry contains not only the virtual-
to-physical address translation but also a number of auxiliary
attributes such as access permission and cacheability. The
deterministic memory can be encoded as just another attribute,
3
Deterministic Memory-Aware Memory Hierarchy
Core1 Core2 Core3 Core4
W
5
Best-effort
W
1
W
2
W
3
W
4
Deterministic
I D I D I D I D
B
1
B
2
B
3
B
4
B
5
B
6
B
7
B
8
DRAM banks
Cache ways
W
4
Fig. 3. System-level view (physical)
which we call a DM bit, in the page table entry. 2 The OS then
uses the information in making memory allocation decisions.
For example, the OS might apply cache and DRAM bank-
aware page allocation strategies [11], [14] for deterministic
memory to improve predictability. Likewise, the underlying
hardware also can use the same information in making low-
level resource allocation and scheduling decisions, such as
cache replacement decisions in the shared cache (which will
be presented in Section V) or memory request scheduling
decisions in the shared DRAM controller.
Figure 3 shows the system-level (OS and architecture)
view of a possible implementation of a multicore system
that supports the two memory types. In this example, each
core is given one cache way and a DRAM bank which
will be used to serve deterministic memory for the core.
One cache way and four DRAM banks are assigned for the
best-effort memory of all cores. The deterministic memory-
aware memory hierarchy refers to hardware support for the
deterministic memory abstraction, which will be described in
the following.
A. Hardware Support: The Memory Hierarchy
In a modern processor, the processor’s view of memory
is determined by the Memory Management Unit (MMU),
which translates a virtual address to a corresponding phys-
ical address. The translation information, along with other
auxiliary information, is stored in a page table entry, which
is managed by the OS. Translation Look-aside Buffer (TLB)
then caches frequently accessed page table entries to accelerate
the translation speed. As discussed above, in our design, the
DM bit in each page table entry indicates whether the page
is for deterministic memory or for best-effort memory. Thus,
the TLB also must store the DM bit and pass the information
down to the memory hierarchy.
Figure 4 shows this information flow of deterministic mem-
ory. Note that bus protocols (e.g., AMBA [22]) also should
provide a mean to pass the deterministic memory information
into each request packet. In fact, many of the existing bus
2In our implementation, we used an unused memory attribute in the page
table entry of the ARM architecture; see Section VI for details.
Core
VA
TLB
VA PA DM
VA PA DM
LLC
Line DataPA DM
Line DataPA DM
Page table walk
Data load/store
Fig. 4. Deterministic memory information flow in the memory hierarchy.
protocols already support some forms of priority information
as part of the bus transaction 3. These fields are currently
used to distinguish priority between bus masters (e.g., CPU vs
GPU vs. DMA controllers). A bus transaction for deterministic
memory can be incorporated into these bus protocols, for
example, as a special priority class. The deterministic memory
information can then be utilized in mapping and scheduling
decisions made by the respective hardware components in the
memory hierarchy.
In the following section, we focus on shared caches and
demonstrate how the deterministic memory abstraction can
be utilized in the caches to achieve both predictability and
efficiency at the same time.
V. DETERMINISTIC MEMORY-AWARE CACHE
In this section, we present a deterministic memory-aware
cache design that provides the same isolation benefits of
traditional way-based partitioning while at the same time
achieving high cache space efficiency.
A. Way-based Cache Partitioning
In a standard way-based partitioning, which is supported
in several COTS multicore processors [23], [24], each core is
given a non-overlapping subset of cache ways. When a cache
miss occurs, a new cache line (loaded from the memory) is
allocated on one of the assigned cache ways in order not to
evict useful cache lines of the other cores that share the same
cache set. An important shortcoming of way-partitioning is,
however, that its partitioning granularity is coarse (i.e., way
granularity) and the cache space of each partition may be
wasted if it is underutilized. Furthermore, even if fine-grain
partition adjustment is possible, it is not easy to determine the
“optimal” partition size of a task because the task’s behavior
may change over time or depending on the input. As a
result, it is often a common practice to conservatively allocate
3For example, ARM AXI4 protocol includes a 4-bit QoS identifier AxQOS
signal [22] that supports up to 16 different priority classes for bus transactions.
4
Way 0 Way 1 Way 2 Way 3
Core 0
partition
Core 1
partition
best-effort line (any core, DM=0)
deterministic line (Core0, DM=1)
deterministic line (Core1, DM=1)
Premise: DM-only way partitioning
- A core’s DM lines are not evicted by other cores
- A core’s DM lines can be evicted by the core’s best-effort lines (v1)
- A core’s DM lines cannot be evicted by the core’s best-effort lines (v2)
- Then you need to have at least one shared way
- DM bits can be ignored or cleared by configuring a programmable register
PartMask  ways of the given partition
IgnMask  ways to ignore DM checking
// Allocating a deterministic line
If DM==1:
victim = LRU(PartMask)
DetMask |= 1<<victim
// Allocating a best-effort line
else
victim = LRU(!DetMask U IgnMask)
DetMask ^= !(1<<victim)
fi
Way 4
shared
partition
Set 0
Set 1
Set 2
Set 3
0
DM Tag Line data
Fig. 5. Deterministic memory-aware cache management
sufficient amount of resource (over-provisioning), which will
waste much of the reserved space most of the time.
B. Deterministic Memory-Aware Replacement Algorithm
We address the shortco ings of wa -based partitio ing by
taking advantage of the deterministic memory abstraction.
At the high level, the idea is that we use way partitioning
only for deterministic memory accesses while allowing best-
effort memory accesses to use all the cache ways that do not
currently hold deterministic cache lines.
Figure 5 shows an example cache status of our design in
which two cores share a 4-set, 5-way set-associative cache. In
our design, there is a DM bit per cache line stored along with
other status bits (e.g. valid and dirty bits) to indicate whether
a cache line is for deterministic memory or best-effort memory
(see the upper-right side of Figure 5). When inserting a new
cache line (of a given set), if the requesting memory access
is for deterministic memory, then the victim line is chosen
from the core’s way partition (e.g., way 0 and 1 for Core 0 in
Figure 5). On the other hand, if the requesting memory access
is for best-effort memory, the victim line is chosen from ways
that do not hold deterministic cache lines (e.g., in set 0, all
but way 2 are best-effort cache lines; in set 1, on the other
hand, only the way 4 is best-effort cache line.).
Input : PartMaski - way partition mask of Core i
DetMasks - deterministic ways of Set s
Output: victim - the victim way to be replaced.
1 if DM == 1 then
2 if (PartMaski ∧ ¬DetMasks) 6= NULL then
// evict a best-effort line first
3 victim = LRU(PartMaski ∧ ¬DetMasks)
4 DetMasks |= 1 victim
5 else
// evict a deterministic line
6 victim = LRU(PartMaski)
7 end
8 else
// evict a best-effort line
9 victim = LRU(¬DetMask)
10 end
11 return victim
Algorithm 1: Deterministic memory-aware cache line re-
placement algorithm.
Algorithm 1 shows the pseudo code of the cache line
replacement algorithm. As in the standard cache way-
partitioning, we assign dedicated cache ways for each core,
denoted as PartMaski, to elimi ate inter-core cache inter-
ference. Note that DetMasks denotes the bitmask of the set
s’s cache lines that contain deterministic memory. If a request
from core i is d terministic memory request (DM = 1),
then a line is allocat d from the core’s cache way partition
(PartM ski). Among the ways of the partition, the algorithm
first tries to evict a best-effort cache line, if such a line exists
(Line 3-4). If not (i.e., all lines are deterministic ones), it
chooses one of the deterministic line as the victim (Line
6). One the other hand, if a best-effort memory is requested
(DM 6= 1), it evicts one of the best-effort cache lines, but
not any of the deterministic cache lines (Line 9). In this way,
even though the deterministic cache lines of a partition are
completely isolated from any accesses other than the assigned
core of the partition, the under-utilized cache lines of the
partition can still be utilized as best-effort cache lines of all
cores.
C. Deterministic Memory Cleanup
Note that a core’s way partition would eventually be filled
with deterministic cache lines (ones with DM = 1) if left
unmanaged (e.g., scheduling multiple different real-time tasks
on the core). This would eliminate the space efficiency gains of
using deterministic memory because the deterministic memory
cache lines cannot be evicted by best-effort memory requests.
In order to keep only a minimal number of deterministic
cache lines on any given partition in a predictable manner, our
cache controller provides a special hardware mechanism that
clears the DM bits of all deterministic cache lines, effectively
turning them to best-effort cache-lines. This mechanism is
used by the core’s OS scheduler on each context switch so
that the deterministic cache-lines of the previous tasks can be
evicted by the current task. When the deterministic-turned-
best-effort cache-lines of a task are accessed again and they
still exist in the cache, they will be simply re-marked as
deterministic without needing to reload from memory. In the
worst case, however, all deterministic cache lines of a task
shall be reloaded when the task is re-scheduled on the CPU.
Note that our cache controller reports the number of deter-
ministic cache lines that are cleared on a context switch back
to the OS. This information can be used to more accurately
estimate cache-related preemption delays (CRPD) [25].
5
Small page base address, PA[31:12]
31 12 11 10 9 8 6 5 4 3 2 1 0
S TEX[2:0] C B 1
nG
A P [2]
AP[1:0] XN
Small page
Fig. 6. Small descriptor format for 2nd level page table entry in ARMv7-A
family SoCs [24].
D. Guarantees
The premise of the proposed cache replacement strategy is
that a core’s deterministic cache lines will never be evicted by
other cores’ cache allocations, hence preserving the benefit of
cache partitioning. At the same time, non-deterministic cache
lines in the core’s cache partition can safely be used as other
cores’ best-effort memory requests, hence minimizing wasted
cache capacity due to partitioning.
VI. PROTOTYPE IMPLEMENTATION
We implemented the proposed deterministic memory ab-
straction on a Linux 3.13 kernel, and tested the aforementioned
hardware extensions on a cycle accurate full-system simulator,
gem5 [15]. This section provides implementation details of our
prototype. First, we briefly review the ARMv7 architecture on
which our implementation is based. We then describe our mod-
ifications to the Linux kernel to support deterministic memory
abstraction. Finally, we describe the hardware extensions on
the gem5 simulator. We also discuss the feasibility of the
proposed hardware extensions in real silicon.
A. ARM Architecture Background
In this paper, we consider the ARMv7-A [24] architecture,
which is fully supported in gem5. The ARMv7 defines four
primary memory types and several memory related attributes
such as cache policy (Write-back/write-through) and coher-
ence boundaries (between cores or beyond). Up to 8 different
combinations are allowed by the architecture. Each page’s
memory type is determined by a set of bits in the correspond-
ing 2nd-level page table entry. Figure 6 illustrates the structure
of a page table entry (PTE).
In the figure, the bits TEX[0], C and B are used to define
one of the 8 memory types. The property of each memory type
is determined by two global architectural registers, namely
Primary Region Remap Register (PRRR) and Normal Memory
Region Register (NMRR) 4.
B. Linux Extensions
We have modified Linux kernel 3.13 to support deterministic
memory. The modifications include declaring additional mem-
ory type inside the kernel and providing user-level interfaces
to use the deterministic memory by the applications.
At the lowest level, we define a new memory type that
corresponds to the deterministic memory. Note that the default
Linux uses only 6 out of 8 possible memory types, leaving
4The hardware behaves as described only when the so called “TEX
remapping” mechanism is in use. TEX remapping can be controlled via a
configuration bit (TRE) in the System Control Register (SCTLR). The Linux
kernel enabled TEX remapping by default.
two undefined memory types. For deterministic memory, we
define one of the unused memory types as the deterministic
memory type, by updating PRRR and NMRR registers at boot
time. A page is marked as deterministic memory when the
corresponding page table entry’s memory attributes point to
the deterministic memory type.
At the user-level, we currently provide two user-level in-
terfaces for applications to use deterministic memory. First,
we extend mmap system call to support deterministic heap
memory. For example, to allocate a deterministic memory
block in the heap, a program might call mmap as follows:
/* critical region (heap) */
char *buf = mmap(..., MAP_DETMEM, ...);
where the MAP_DETMEM flag indicates the requested memory
type is deterministic memory. Second, we extend the Linux’s
ELF (Executable and Linkable Format [26]) loader so that the
entire address space of a task or a subset of the task’s memory
regions—code, data, and stack sections—can be declared as
deterministic memory. We currently use a special file extension
to indicate the use of deterministic memory; note that ELF
header of a program binary, for example, e_flags, can be
used instead to encode more fine-grained control.
Within the Linux kernel, a task’s virtual address space is
represented as a set of memory regions, each of which is
represented by a data structure, vm_area_struct, called
a VMA descriptor. Each VMA descriptor contains a variety
of metadata about the memory region, including its memory
type information. Whenever a new physical memory block is
allocated (at a page fault), the kernel uses the information
stored in the corresponding VMA descriptor to construct the
page table entry for the new page. We added a new flag
VM_DETMEM to indicate the deterministic memory type in a
VMA descriptor. If the VM_DETMEM flag of a memory region
is set, then OS sets the TEX[0], C and B bits in allocating
pages of the memory region to mark that they are deterministic
memory pages.
Note that the aforementioned code changes are minimal.
In total, we only added/modified less than 200 lines of C
and assembly code over 12 files in the Linux kernel source
tree. Furthermore, because most changes are in page table
descriptors and their initialization, no runtime overhead is
incurred by the code changes.
C. Architecture Simulator Extensions
The deterministic memory type information stored in the
page table is read by the MMU and passed throughout the
memory hierarchy.
1) MMU and TLB: When a page fault occurs, the MMU
performs the page table walk to determine the physical address
of the faulted virtual address. In the process, it also reads other
important auxiliary information such as memory attribute and
access permission from the page table entry and stores them
into a TLB entry in the processor. The deterministic memory
attribute is stored alongside with the other memory attributes
in the TLB entry. Specifically, we add a single bit in the gem5’s
6
implementation of a TLB entry to indicate the deterministic
memory type. As a reference, Cortex-A17’s TLB entry has
80 bits and a significant fraction of the bits are already used
to store various auxiliary information [27] or reserved for
future use. Thus, requiring a single bit in a TLB entry does
not pose significant overhead in practice. We also extend
the memory request packet format in the gem5 simulator to
include the deterministic memory type information. In this
way, the memory type information of each memory request
can be passed down through the memory hierarchy. In real
hardware, bus protocols should be extended to include such
information. As discussed earlier, existing bus protocols such
as AXI4 already support the inclusion of such additional
information in each bus packet [22].
2) Cache Controller: The gem5’s cache subsystem imple-
ments a flexibly configurable non-blocking cache architecture
and supports standard LRU and random replacement algo-
rithms. Our modifications are as follows. First, we extend
gem5’s cache controller to support a standard way-based par-
titioning capability 5. The way partition is configured via a set
of programmable registers. When a cache miss occurs, instead
of replacing the cache line in the LRU position, the controller
replaces the LRU line among the configured ways for the core.
The way-based partitioning mechanism is used as a baseline.
On top of the way-based partitioning, we implement the
proposed deterministic memory-aware replacement algorithm
and the deterministic memory cleanup algorithm (Section V-B
and Section V-C, respectively).
D. Hardware Implementation Overhead
We now briefly discuss overhead of actual hardware im-
plementation. First, the space overhead of our approach is
small: one bit storage space per cache-line, which is less
than 0.2% for a standard 64 byte cache-line space. The
timing overhead of the deterministic memory-aware cache
replacement algorithm is not easy to analyze without actual
hardware implementation, but we conjecture it would also
be low because of the simplicity of our design and the fact
that cache replacement operations occur only at cache-misses
which are relatively less frequent operations.
One potentially costly operation is the deterministic memory
cleanup operation, which requires updating every cache line
marked as deterministic (one bit per line) after the correspond-
ing task’s context switch. A simple hardware design would
need to clear the DM bit of every cache-line of a cache-
partition one by one—until all cache lines of the cache way
partition are cleared. All accesses to the cache will be blocked
until the operation completes. For a 2MB shared cache, the
estimated time for it would be around one µS, assuming bank-
level parallelization. Because the context switching occurs
infrequently, we believe this overhead is acceptable. The
overhead can be further reduced by using custom SRAM
arrays that provide additional signal lines, one per cache way,
to clear all DM bit cells in the way. In this case, the operation
could be performed in few clock cycles.
5https://github.com/farzadfch/gem5-cache-partitioning
VII. EVALUATION
In this section, we present evaluation results to support
the feasibility and effectiveness of the proposed deterministic
memory-aware system design.
A. System Setup
For OS, we use a modified Linux kernel 3.13, which im-
plements modifications explained in Section VI-B to support
the deterministic memory abstraction. In terms of hardware,
we use a modified gem5 full system simulator, which imple-
ments the proposed deterministic memory support described
in Section VI-C. The simulator setup is configured as a quad-
core out-of-order processor (O3CPU model [28]) with per-core
private L1 I/D caches, a shared L2 cache, and a shared DRAM.
The baseline architecture parameters are shown in Table I.
Since the focus of this paper is on user space memory
regions, the kernel memory regions are not guaranteed to
stay in L2 cache. Thus, we utilized a few techniques to
reduce the number of kernel accesses. We used the system
call mlockall to assign physical pages for all the program’s
virtual memory space. This results in avoiding page faults
during the rest of program’s execution. In addition, we en-
abled the kernel configuration option NO_HZ_FULL to reduce
the number of scheduling-clock interrupts. With this option,
unnecessary scheduling-clock ticks on the CPUs are omitted,
reducing the number of accesses to kernel memory regions.
TABLE I
SIMULATOR CONFIGURATION
Core Quad-core, out-of-order, 2 GHzIQ: 96, ROB: 128, LSQ: 48/48
L1-I/D caches Private 16/16 KiB (2-way)MSHRs: 2(I)/6(D)
L2 cache Shared 2 MiB (16-way)LRU, MSHRs: 56, Hit latency: 12
DRAM module LPDDR2@533MHz, 1 rank, 8 banks
B. Real-Time Benchmark Characteristics
We use a set of EEMBC [16] automotive and SD-VBS [17]
vision benchmarks as real-time workloads. We profile each
benchmark, using the gem5 simulator, to better understand
memory characteristics of the benchmarks.
Figure 7(a) shows the ratio between the number of accessed
pages within the main loop and the the number of all accessed
pages of each benchmark; the pages accessed in the loop are
denoted as critical pages. As it can be seen in the figure, 39%
(on average) of pages are critical pages, and this number can
be as low as 6% in the case of svm. This means that the rest of
the pages are accessed during the initialization and other non-
time-critical procedures. To further analyze the characteristics
of the critical pages, we profiled L1 cache misses of each
critical page to see which pages contribute most to the overall
L1 cache misses. Figure 7(b) shows the L1 cache miss count
distribution of the critical pages, which are grouped based on
whether they belong to code, data, heap, or stack sections.
Note that L1 cache misses are directed to the shared L2
7
 0
 20
 40
 60
 80
 100
disparity
mser sift svm texture_synth
aifftr01aiifft01matrix01
average
Cr
iti
ca
l/a
ll 
pa
ge
s (
%
)
(a) Critical pages among all touched pages
 0
 20
 40
 60
 80
 100
disparity
mser sift svm texture_synth
aifftr01aiifft01matrix01
L1
 m
iss
 d
ist
rib
ut
io
n 
(%
)
code data heap stack
(b) L1 miss distribution of critical pages
Fig. 7. Space and temporal characteristics of application memory pages.
Critical pages refer the touched pages within the main loop of each bench-
mark.
cache, which is shared by all cores. Thus, those pages that
show high L1 misses likely contribute most to the WCET of
the application because they can suffer from high inter-core
interference due to contention at the shared L2 cache and/or
the shared DRAM. As can be seen in the figure, the majority
of L1 misses are originated from heap pages. This suggests
even among the critical pages, only certain memory areas may
contribute more to WCET. Additional analysis shows that,
on average, less than half of critical pages account for 80%
of L1 misses. The results show that selective, fine-grained
application of deterministic memory can significantly reduce
WCETs while minimizing resource waste.
C. Effects of Deterministic Memory-Aware Cache
In this experiment, we run a real-time benchmark on Core 3
and three instances of a memory intensive synthetic benchmark
as best-effort co-runners on Core 0 through 2; we use the
Bandwidth benchmark with write memory access pattern [20]
as the co-runners. Note that the working-set size of a best-
effort co-runner is chosen so that the sum of all co-runners is
equal to the size of the entire L2 cache. This will increase the
likelihood to evict the cache lines of the real-time task, if its
cache lines are not protected.
We evaluate the system with 4 different configurations: NoP,
WP, DM(H), and DM(A). In NoP, the L2 cache is shared
among all cores without any restrictions. In WP, the L2
cache is partitioned using the standard way-based partitioning
method, and 4 ways are given to each core. In DM(H), we
configure the heap section of the program as deterministic
memory, while leaving the rest of the address space as best-
effort memory. Lastly, in DM(A), the entire address space of
the program is marked as deterministic memory.
0.00
0.25
0.50
0.75
1.00
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
hi
t r
a
te
NoP WP DM(H) DM(A)
Fig. 8. L2 hit rate.
Figure 8 compares the L2 hit rate for different configu-
rations. As it can be seen, hit rate can be as low as 0.53
in NoP. This configuration suffers the most cache misses as
the cache lines of the real-time benchmarks are evicted by
the co-running Bandwidth benchmarks. In WP, on the other
hand, all benchmarks show hit rates that are close to 1. This
is because the dedicated private L2 cache space (4 out of 16
cache ways = 512KB) is sufficient to hold the working-sets
of the benchmarks and the co-runners cannot evict the cache
lines of the real-time tasks. The hit rates are also close to 1
in both DM(H) and DM(A). For DM(A), it is because the co-
runners are not allowed to evict any of the cache lines allocated
for the real-time task, as all memory regions are marked as
deterministic memory. Although not all memory regions are
marked as deterministic in DM(H), the fact that most of L2
accesses by the real-time tasks are to the heap region (as we
showed it in VII-B) make it almost as effective as DM(A) by
marking only the heap region as deterministic memory.
0.0
0.5
1.0
1.5
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
sl
ow
 d
ow
n
NoP WP DM(H) DM(A)
Fig. 9. Measured slowdown.
Figure 9 shows the measured execution time slowdowns of
the evaluated benchmarks (normalized to their solo execution
times on a 4-way cache partition). In NoP, without cache parti-
tioning, we observe up to 1.6X slowdown, which is expected.
In WP, where way-based cache partitioning is applied, we
observed virtually no execution time increases in all tested
8
real-time benchmarks. Both deterministic memory configura-
tions, DM(H) and DM(A), also achieve comparable isolation
performance with WP. It is important to note, however, that
DM(H) and DM(A) uses less cache space than WP because
unused cache lines of a cache partition can be used by best-
effort tasks.
0%
25%
50%
75%
100%
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
pa
rti
tio
n 
ut
iliz
at
io
n
DM(H) DM(A)
Fig. 10. Cache partition utilization of deterministic memory.
We instrumented the gem5 simulator to identify the number
of cache lines for deterministic memory, by checking the DM
bit in the cache lines. Figure 10 shows the percentage of
the cache lines allocated by the deterministic memory cache
lines. This number varies between 7% for (aifftr01 in DM(H))
and 99% (for svm) with the average of 49% among all the
benchmarks in DM(A). When the conventional way parti-
tioning is being used, the unused cache space in the private
partition is essentially wasted. In the proposed deterministic
memory-aware system, the best-effort tasks can use the rest
of the partition which is not flagged as deterministic memory.
Thus, the hit rate of the best-effort tasks can potentially be
improved as more cache space will be available to them. This
will be shown in the experiment in VII-D. Note also that, as
expected, DM(H) uses less deterministic memory cache lines
than DM(A).
D. Effects on Best-Effort Tasks
To study the effect of deterministic memory-aware cache on
best-effort tasks, we designed an experiment with two different
scenarios: 1) Best-effort task is running on Core 0, and 3
instances of a real-time task are running on Core 1 through 3.
2) A real-time task is running on Core 3, and three instances
of a best-effort task are running on Core 0 through 2.
We select two benchmarks from the SPEC CPU2006 bench-
mark suite as best-effort tasks, based on the following criteria:
1) ones that frequently access the shared cache; 2) ones that
are sensitive to extra cache space (i.e. the hit rate shall be
improved if more cache space is given to the benchmark).
Based on a memory characterization study of the SPEC2006
suite [29], we selected bzip2 and mcf which satisfy the two
aforementioned conditions.
Figure 11 shows the results for bzip2 in the first scenario.
Inset (a) shows the percentage of cache space used by bzip2 for
each real-time task pairing, while inset (b) shows its hit rates.
Note first that in WP, bzip2 can only use 25% of cache space
(512kB out of 2MB), as this is the size of its private cache
partition. In DM(H) and DM(A), on the other hand, bzip2 can
allocate more cache lines from the private partitions of the
other cores which are not marked as deterministic memory
cache lines. Consequently, the average hit rate is improved
by 39% and 38% in DM(H) and DM(A), compared with that
of in WP. Looking more closely, we can also observe that
more cache lines are allocated by bzip2 in DM(H) than in
DM(A). This is because only the heap region of the real-time
benchmark is marked as deterministic in DM(H). Therefore,
more space is left for bzip2, which positively impacts the hit
rate of this task compared to the DM(H) case. The result for
the second scenario and mcf are included in Appendix A due
to space limitation.
0%
25%
50%
75%
100%
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
ca
ch
e 
oc
cu
pa
nc
y
WP DM(H) DM(A)
(a) The percentage of cache space occupied by bzip2.
0.00
0.25
0.50
0.75
1.00
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
hi
t r
a
te
WP DM(H) DM(A)
(b) bzip2 hit rate.
Fig. 11. Cache usage and hit rate of bzip2. bzip2 is running on Core 0, and
3 instances of the real-time task are running on Core 1 through 3.
VIII. RELATED WORK
Our work is divided in two parts. In the first part, we demon-
strate that on commercial hardware it is possible to transmit an
extra bit of information about the “importance” of a memory
location (and corresponding transactions). Next, we provide an
example of how shared cache management could be performed
by relying on the extra piece of information. Little work
has been proposed in the past to export OS/application-level
fine-grained awareness of memory importance down to the
hardware. Conversely, a consistent body of work has explored
the problem of cache and memory resource management.
Describing memory importance: the majority of previous
works have used CPU-centric models to distinguish memory
9
that is crucial for real-time performance. For instance, page-
coloring [30] is one such scheme: each task, say task A, is
given a certain cache space, but all the memory of task A
is treated equally. The same scheme has also been studied
to partition shared caches in multicore systems [6]–[8]. Page
coloring has also been applied to partition DRAM banks [10],
[11], [14] and TLB [31]. Cache partitioning techniques that use
way-based partitioning [32] exhibit essentially the same limi-
tations in terms of granularity at which memory management
decisions are take: the cache ways available for allocation are
determined either (i) from the ID of the requesting core, or
(ii) from the ID of the requesting task.
Another approach taken in the past is to entirely re-design
the hardware to always provide real-time performance. A
predictable CPU micro-architecture was proposed in [33];
a predictable L2 cache architecture in [34]; real-time bus
arbitration schemes in [35], [36]; and predictable DRAM
controllers in [37], [38].
More closely, the importance of differentiating memory
blocks that are crucial for real-time performance has been well
understood in the context of cache locking and scratchpad
management [9], [39]–[42]. In this class of works the dis-
tinction between real-time memory and best-effort memory is
used by the OS (and/or the compiler) to explicitly manage the
hardware, but it is not propagated down the memory hierarchy.
Conversely, the main objective of our technique is to make
the hardware aware of such a distinction, so that it can behave
accordingly (management).
Real-time memory management: Once the importance of
a memory location is available, the way management is
performed is also crucial. While the DM bit is propagated
to the entire memory subsystem, in this work we discuss a
possible scheme only for shared cache management.
The main novelty of the proposed scheme with respect to
cache partitioning [6]–[8], [30] is twofold. First unused cache
space is automatically available for use by all the applications
in the system. Second, the cache occupancy of a real-time
application remains in line with changes in the working-set
size, since DM bits are cleared at context-switches.
The work in [42] proposes to dynamically lock the cache
throughout the execution of a task. Locking statements are
inserted in the task’s execution flow at compilation time,
whenever uncertainty about the memory locations being ac-
cessed negatively impacts the WCET calculation. A number
of variations has also been explored [9], [43], [44]. Similarly,
compiler support is used in [40] to split a task into intervals. At
the beginning of each interval, the required memory is loaded
onto a scratchpad. The division of a task in a sequence of
intervals with well-defined memory and computation phases
was originally proposed in [45], [46]. The common limitation
is that memory locations deemed “important” for real-time
performance can be allocated far in time from when they
are accessed (if accessed at all). Conversely, our allocation is
contextual with the request being performed for a DM-flagged
memory block.
Cache replacement algorithms and their impact to task
WCETs have been studied extensively in the real-time systems
community [47], [48]. For example, Reineke et al. proposed
a LRU variant cache replacement policy that is aware of
preemptive multitasking systems [49] to reduce CRPD on
single-core processors. On multicore processors, extensions to
the existing cache replacement algorithms have been made to
provide static/dynamic cache partitioning capabilities among
the cores [50]. However, strict core-based partitioning could
under-utilize the reserved cache space. Our replacement algo-
rithm leverages the proposed deterministic memory abstraction
to eliminate resource waste while providing strong isolation
benefits of partitioning.
The deterministic memory abstraction can be used by pre-
dictability enhanced memory controllers. For example, Kim et
al. proposed a predictable DRAM controller design in which
real-time tasks are allocated on dedicated banks and their
requests are prioritized over the requests from the non-real-
time tasks [51]. Similar designs were proposed in [52], [53].
These DRAM controller designs can easily be modified to
support/prioritize deterministic memory requests. As a part
of our future work, we plan to implement a deterministic
memory-aware memory controller based on one of these
DRAM controller designs.
IX. CONCLUSION AND FUTURE WORK
In this paper, we proposed a new memory abstraction, which
we call Deterministic Memory, for predictable and efficient
resource management in multicore. We define deterministic
memory as a special memory space that the platform—OS and
hardware architecture—guarantees small and tightly bounded
worst-case access timing.
We presented OS and architecture designs to efficiently
support the deterministic memory abstraction. In particular,
we presented a deterministic memory-aware cache design
that leverage the abstraction to improve efficiency of shared
cache without losing isolation benefits of traditional way-
based cache partitioning. We implemented the proposed OS
extension on a real operating system (Linux) and implemented
the proposed architecture extensions on a cycle-accurate full-
system simulator (gem5).
Evaluation results obtained using a set of EEMBC and SD-
VBS benchmarks support the potential of using the determin-
istic memory abstraction. Concretely, by using deterministic
memory, we achieved the same degree of strong isolation while
using 49% less cache space, on average, than the conventional
way-based cache partitioning method.
As future work, we plan to develop methodologies and tools
to identify “optimal” deterministic memory blocks that max-
imize the overall schedulability. We plan to develop a deter-
ministic memory-aware DRAM controller, extending recently
developed mixed criticality real-time DRAM controllers [51]–
[53]. Lastly, we plan to implement the proposed architecture
extensions on a FPGA using an open-source RISC-V core [54].
10
REFERENCES
[1] R. Leibinger, “Software architectures for advanced driver assistance
systems (ADAS),” in Int. Workshop on Operating Syst. Platforms for
Embedded Real-Time Applicat. (OSPERT), 2015.
[2] S. Vestal, “Preemptive scheduling of multi-criticality systems with
varying degrees of execution time assurance,” in Real-Time Syst. Symp.
(RTSS). IEEE, 2007, pp. 239–243.
[3] A. Burns and R. Davis, “Mixed criticality systems - A review,” Depart-
ment of Computer Science, University of York, Tech. Rep, 2013.
[4] Certification Authorities Software Team, “CAST-32A: Multi-core Pro-
cessors (Rev 0),” Federal Aviation Administration (FAA), Tech. Rep.,
November 2016.
[5] O. Kotaba et al., “Multicore in real-time systems temporal isolation
challenges due to shared resources,” in Workshop on Industry-Driven
Approaches for Cost-effective Certification of Safety-Critical, Mixed-
Criticality Syst., 2013.
[6] N. Kim et al., “Attacking the one-out-of-m multicore problem by
combining hardware management with mixed-criticality provisioning,”
in 2016 IEEE Real-Time and Embedded Technology and Applicat.
Symp. (RTAS). IEEE, 2016, pp. 1–12.
[7] H. Kim et al., “A coordinated approach for practical os-level cache
management in multi-core real-time systems,” in Real-Time Syst.
(ECRTS). IEEE, 2013, pp. 80–89.
[8] B. Ward et al., “Making shared caches more predictable on multicore
platforms,” in Euromicro Conf. Real-Time Syst. (ECRTS), 2013.
[9] R. Mancuso et al., “Real-time cache management framework for
multi-core architectures,” in Real-Time and Embedded Technology and
Applicat. Symp. (RTAS). IEEE, 2013.
[10] L. Liu et al., “A software memory partition approach for eliminating
bank-level interference in multicore systems,” in Parallel Architecture
and Compilation Techniques (PACT). ACM, 2012, pp. 367–376.
[11] H. Yun et al., “PALLOC: DRAM bank-aware memory allocator for
performance isolation on multicore platforms,” in Real-Time and Em-
bedded Technology and Applicat. Symp. (RTAS), 2014.
[12] H. Yun and G. Yao, “MemGuard: Memory bandwidth reservation
system for efficient performance isolation in multi-core platforms,”
in Real-Time and Embedded Technology and Applicat. Symp. (RTAS),
2013.
[13] J. Nowotsch et al., “Multi-core interference-sensitive WCET analy-
sis leveraging runtime resource capacity enforcement,” in Euromicro
Conf. Real-Time Syst. (ECRTS), 2014.
[14] N. Suzuki et al., “Coordinated bank and cache coloring for temporal
protection of memory accesses,” in Computational Sci. and Eng.
(CSE). IEEE, 2013, pp. 685–692.
[15] N. Binkert et al., “The gem5 simulator,” ACM SIGARCH Comput.
Architecture News, 2011.
[16] “EEMBC benchmark suite,” www.eembc.org.
[17] S. K. Venkata et al., “SD-VBS: The San Diego vision benchmark
suite,” in Int. Symp. Workload Characterization (ISWC). IEEE, 2009,
pp. 55–64.
[18] J. Henning, “SPEC CPU2006 benchmark descriptions,” ACM
SIGARCH Comput. Architecture News, vol. 34, no. 4, pp. 1–17, 2006.
[19] R. Mancuso et al., “WCET(m) estimation in multi-core systems using
single core equivalence,” in 2015 27th Euromicro Conf. Real-Time
Syst., July 2015, pp. 174–183.
[20] P. K. Valsan et al., “Taming non-blocking caches to improve isolation
in multicore real-time systems,” in Real-Time and Embedded Technol-
ogy and Applicat. Symp. (RTAS). IEEE, 2016.
[21] H. Kim et al., “Bounding memory interference delay in COTS-based
multi-core systems,” in Real-Time and Embedded Technology and
Applicat. Symp. (RTAS), 2014.
[22] ARM, AMBA AXI and ACE Protocol Specification, 2013.
[23] e500mc Core Reference Manual, Freescale, 2012.
[24] ARM Architecture Reference Manual. ARMv7-A and ARMv7-R Edi-
tion, ARM, 2014.
[25] S. Altmeyer et al., “Improved cache related pre-emption delay aware
response time analysis for fixed priority pre-emptive systems,” Real-
Time Syst. Symp. (RTSS), vol. 48, no. 5, pp. 499–526, 2012.
[26] T. Committee, “Executable and linking format (ELF) specification
version 1.2,” TIS Committee, 1995.
[27] Cortex-A17 Technical Reference Manual, Rev: r1p1, ARM, 2014.
[28] “Gem5: O3CPU,” http://gem5.org/O3CPU.
[29] A. Jaleel, “Memory characterization of workloads using
instrumentation-driven simulation,” http://www.jaleels.org/ajaleel/
publications/SPECanalysis.pdf, 2010.
[30] A. Wolfe, “Software-based cache partitioning for real-time applica-
tions,” J. Comput. and Software Eng., vol. 2, no. 3, pp. 315–327, 1994.
[31] S. A. Panchamukhi and F. Mueller, “Providing task isolation via
TLB coloring,” in Real-Time and Embedded Technology and Applicat.
Symp. (RTAS). IEEE, 2015, pp. 3–13.
[32] K. T. Sundararajan et al., “RECAP: Region-Aware cache partitioning,”
in Int. Conf. Comput. Design (ICCD). IEEE, 2013, pp. 294–301.
[33] M. Zimmer et al., “FlexPRET: A processor platform for mixed-
criticality systems,” in Real-Time and Embedded Technology and Ap-
plicat. Symp. (RTAS). IEEE, 2014, pp. 101–110.
[34] J. Yan and W. Zhang, “Time-predictable L2 cache design for high-
performance real-time systems,” in Embedded and Real-Time Comput-
ing Syst. and Applicat. (RTCSA). IEEE, 2010, pp. 357–366.
[35] J. Rosen et al., “Bus access optimization for predictable implementa-
tion of real-time applications on multiprocessor systems-on-chip,” in
Real-Time Syst. Symp. (RTSS), 2007, pp. 49–60.
[36] J. Jalle et al., “AHRB: A high-performance time-composable AMBA
AHB bus,” in Real-Time and Embedded Technology and Applicat.
Symp. (RTAS). IEEE, 2014, pp. 225–236.
[37] Z. Wu et al., “Worst case analysis of DRAM latency in multi-requestor
systems,” in Real-Time Syst. Symp. (RTSS), 2013.
[38] S. Goossens et al., “Conservative open-page policy for mixed time-
criticality memory controllers,” in Design, Automation and Test in
Europe (DATE), 2013.
[39] I. Puau and C. Pais, “Scratchpad memories vs locked caches in hard
real-time systems: A quantitative comparison,” in Design, Automation
& Test in Europe (DATE). IEEE, 2007, pp. 1–6.
[40] M. R. Soliman and R. Pellizzoni, “WCET-Driven dynamic data
scratchpad management with compiler-directed prefetching,” in Eu-
romicro Conf. Real-Time Syst. (ECRTS), vol. 76, 2017, pp. 24:1–24:23.
[41] S. Wasly and R. Pellizzoni, “A dynamic scratchpad memory unit for
predictable real-time embedded systems,” in Euromicro Conf. Real-
Time Syst. (ECRTS). IEEE, 2013, pp. 183–192.
[42] X. Vera et al., “Data cache locking for tight timing calculations,” ACM
Trans. Embed. Comput. Syst., vol. 7, no. 1, pp. 4:1–4:38, Dec. 2007.
[43] T. Liu et al., “Task assignment with cache partitioning and locking
for wcet minimization on mpsoc,” in 2010 39th Int. Conf. Parallel
Processing, Sept 2010, pp. 573–582.
[44] A. Sarkar et al., “Static task partitioning for locked caches in multicore
real-time systems,” ACM Trans. Embed. Comput. Syst., vol. 14, no. 1,
pp. 4:1–4:30, Jan. 2015.
[45] R. Pellizzoni et al., “A predictable execution model for COTS-based
embedded systems,” in Real-Time and Embedded Technology and
Applicat. Symp. (RTAS). IEEE, 2011, pp. 269–279.
[46] G. Yao et al., “Memory-centric scheduling for multicore hard real-time
systems,” Real-Time Syst., vol. 48, no. 6, pp. 681–715, 2012.
[47] R. Wilhelm et al., “The worst-case execution-time problem - overview
of methods and survey of tools,” ACM Trans. Embedded Comput. Syst.
(TECS), vol. 7, no. 3, 2008.
[48] J. Reineke and R. Sen, “Sound and efficient WCET analysis in the
presence of timing anomalies,” in OASIcs-OpenAccess Series in Infor-
matics, vol. 10. Schloss Dagstuhl-Leibniz-Zentrum fu¨r Informatik,
2009.
[49] J. Reineke et al., “Selfish-LRU: Preemption-aware caching for pre-
dictability and performance,” in Real-Time and Embedded Technology
and Applicat. Symp. (RTAS). IEEE, 2014, pp. 135–144.
[50] R. Iyer et al., “QoS policies and architecture for cache/memory in
CMP platforms,” ACM SIGMETRICS Performance Evaluation Review,
vol. 35, no. 1, pp. 25–36, 2007.
[51] H. Kim et al., “A predictable and command-level priority-based
DRAM controller for mixed-criticality systems,” in Real-Time and
Embedded Technology and Applicat. Symp. (RTAS). IEEE, 2015, pp.
317–326.
[52] J. Jalle et al., “A dual-criticality memory controller (DCmc): Proposal
and evaluation of a space case study,” in Real-Time Syst. Symp. (RTSS).
IEEE, 2014, pp. 207–217.
[53] P. Valsan and H. Yun, “MEDUSA: A predictable and high-performance
DRAM controller for multicore based embedded systems,” in Cyber-
Physical Syst., Networks, and Applicat. (CPSNA). IEEE, 2015.
[54] “The berkeley out-of-order RISC-V processor code repository,” https:
//github.com/ucb-bar/riscv-boom.
11
0%
25%
50%
75%
100%
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
ca
ch
e 
oc
cu
pa
nc
y
WP DM(H) DM(A)
(a) The percentage of cache space occupied by mcf.
0.00
0.25
0.50
0.75
1.00
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
hi
t r
a
te
WP DM(H) DM(A)
(b) mcf hit rate.
0%
25%
50%
75%
100%
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
ca
ch
e 
oc
cu
pa
nc
y
WP DM(H) DM(A)
(c) The percentage of total cache space occupied by mcf running on 3 cores.
0.00
0.25
0.50
0.75
1.00
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
hi
t r
a
te
WP DM(H) DM(A)
(d) Average hit rate for mcf running on 3 cores.
Fig. 12. Cache usage and hit rate of mcf for two different scenarios: 1) mcf is running on one core, and 3 instances of the real-time task are running on
rest of the cores (figures (a) and (b)); 2) the real-time task is running on one core, and three instances of mcf are running on the rest of the cores (figures (c)
and (d)).
APPENDIX
A. Additional Results for Best-effort Tasks
Figure 13(a) shows the result for Scenario 2 (see VII-D). As
it can bee seen, three instances of bzip2 occupy 75% of cache
space in WP (each gets 25%). To calculate the cache space
each bzip2 occupies in DM(H) and DM(A), the numbers in
Figure 13(a) must be divided by 3. By doing this calculation,
we can see that each bzip2 occupies less cache space compared
to what has been observed in Scenario 1 (Figure 11(a)).
This results in a smaller hit rate improvement compared to
Scenario 1, as can be observed by comparing Figure 11(b)
and Figure 13(b).
Figure 12 shows the result for mcf. By comparing this
figure to figures 11 and 13, we can observe that the amount
of cache space that mcf and bzip2 are allowed to occupy is
the same. This means that, although cache space occupation
increase from WP to DM(H) or DM(A) is the same for both
benchmarks, bzip2 benefits more from the additional space in
terms of hit rate.
0%
25%
50%
75%
100%
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
ca
ch
e 
oc
cu
pa
nc
y
WP DM(H) DM(A)
(a) The percentage of total cache space occupied by bzip2 running on 3 cores.
0.00
0.25
0.50
0.75
1.00
disparity
mser
sift svm texture_synth
aifftr01
aiifft01
matrix01
average
hi
t r
a
te
WP DM(H) DM(A)
(b) Average hit rate for bzip2 running on 3 cores.
Fig. 13. Cache usage and hit rate of bzip2. The real-time task is running
Core 3, and three instances of bzip2 are running on Core 0 through 2.
12
