Real-time systems on multicore platforms: managing hardware resources for predictable execution by Ye, Ying
Boston University
OpenBU http://open.bu.edu
Theses & Dissertations Boston University Theses & Dissertations
2017
Real-time systems on multicore
platforms: managing hardware
resources for predictable execution
https://hdl.handle.net/2144/27475
Boston University
BOSTON UNIVERSITY
GRADUATE SCHOOL OF ARTS AND SCIENCES
Dissertation
REAL-TIME SYSTEMS ON MULTICORE PLATFORMS:
MANAGING HARDWARE RESOURCES FOR PREDICTABLE
EXECUTION
by
YING YE
B.E., Tongji University, China, 2011
Submitted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
2017
c© Copyright by
YING YE
2017
Approved by
First Reader
Richard West, PhD
Professor of Computer Science
Second Reader
Jonathan Appavoo, PhD
Associate Professor of Computer Science
Third Reader
Abraham Matta, PhD
Professor of Computer Science
Acknowledgments
First, I want to thank my advisor, Professor Richard West, for guiding and supporting
me throughout the past years. With his help, I learned the whole process of conducting
independent research, from identifying and defining problems, coming up with solutions,
to evaluating solutions and writing up research papers. His constant push makes me realize
how much potential I have. Without his encouragement, my last research project vLibOS,
which I am mostly proud of, would not exist. I am also grateful for his strict requirement
on presentation quality. It makes me a much better public speaker than I ever could be.
I want to thank Professor Jonathan Appavoo. His feedback to my research is invalu-
able. His generous help to my last project is greatly appreciated. I would like to thank all
other thesis committee members as well, including Professor Abraham Matta, Professor
Hongwei Xi and Professor Renato Mancuso. Their comments have greatly improved the
quality of this thesis.
I also want to thank my colleagues: Ye Li, Zhuoqun Cheng, Jingyi Zhang and Soham
Sinha. I really enjoyed the fruitful discussions with them. I owe them a lot for their kind
help to my research projects at different stages. Ye was like a mentor to me when I just
started my PhD program. I benefited a lot from his experience and insights. Zhuoqun is a
good friend of mine. I enjoyed all the research collaborations and the fun moments outside
school with him. Finally, I have to thank James Cadden. Although we were not working
in the same lab, he always tried to help me generously. Without him, my life in BU would
not have so much fun.
This work is supported in part by the National Science Foundation (NSF) under Grant
#1527050. Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the NSF.
iv
REAL-TIME SYSTEMS ON MULTICORE PLATFORMS:
MANAGING HARDWARE RESOURCES FOR PREDICTABLE
EXECUTION
(Order No. )
YING YE
Boston University, Graduate School of Arts and Sciences, 2017
Major Professor: Richard West, Professor of Computer Science
ABSTRACT
Shared hardware resources in commodity multicore processors are subject to contention
from co-running threads. The resultant interference can lead to highly-variable perfor-
mance for individual applications. This is particularly problematic for real-time applica-
tions, which require predictable timing guarantees. It also leads to a pessimistic estimate
of the Worst Case Execution Time (WCET) for every real-time application. More CPU
time needs to be reserved, thus less applications can enter the system. As the average
execution time is usually far less than the WCET, a significant amount of reserved CPU
resource would be wasted.
Previous works have attempted partitioning the shared resources, amongst either CPUs
or processes, to improve performance isolation. However, they have not proven to be both
efficient and effective. In this thesis, we propose several mechanisms and frameworks that
manage the shared caches and memory buses on multicore platforms. Firstly, we introduce
a multicore real-time scheduling framework with the foreground/background scheduling
model. Combining real-time load balancing with background scheduling, CPU utilization
v
is greatly improved. Besides, a memory bus management mechanism is implemented
on top of the background scheduling, making sure bus contention is under control while
utilizing unused CPU cycles. Also, cache partitioning is thoroughly studied in this thesis,
with a cache-aware load balancing algorithm and a dynamic cache partitioning framework
proposed. Lastly, we describe a system architecture to integrate the above solutions all
together. It tackles one of the toughest problems in OS innovation, legacy support, by
converting existing OSes into libraries in a virtualized environment. Thus, within a single
multicore platform, we benefit from the fine-grained resource control of a real-time OS
and the richness of functionality of a general-purpose OS.
vi
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 Multicore CPU Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Software Cache Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Memory Bus and DRAMManagement . . . . . . . . . . . . . . . . . . . 7
2.4 OS Evolution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Multicore CPU Scheduling 11
3.1 Quest Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Background Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Predictable Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 VCPU Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.1 Background Scheduling . . . . . . . . . . . . . . . . . . . . . . 20
3.5.2 VCPU Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
vii
4 Cache Partitioning 24
4.1 Page Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Color-aware Memory Allocator . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Static Cache Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Dynamic Cache Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1 COLORIS Architecture . . . . . . . . . . . . . . . . . . . . . . 29
4.4.2 Page Color Manager . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Static Cache Partitioning . . . . . . . . . . . . . . . . . . . . . . 39
4.5.2 Dynamic Cache Partitioning . . . . . . . . . . . . . . . . . . . . 42
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Memory Bus Management 53
5.1 Bus Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Memory-aware Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 System Integration 65
6.1 vLibOS Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.1 User APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.2 Master OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.3 vLib OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.4 Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Implementation: A Multicore Real-Time System . . . . . . . . . . . . . 73
viii
6.2.1 Partitioning Hypervisor . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.2 vLib OS: Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.1 vLib Call Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.2 Performance of Partitioned I/O Devices . . . . . . . . . . . . . . 81
6.3.3 Effectiveness of Memory Throttling . . . . . . . . . . . . . . . . 82
6.3.4 Autonomous Driving Case Study . . . . . . . . . . . . . . . . . 83
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7 Conclusions and Future Work 90
7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Bibliography 94
Curriculum Vitae 102
ix
List of Tables
3.1 Hardware Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Allocator Implementation Overhead . . . . . . . . . . . . . . . . . . . . 43
4.2 Experimental Configurations . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Instructions Retired in One Hour (×1011) . . . . . . . . . . . . . . . . . 47
4.4 Recoloring Overhead (Total # Pages Recolored) . . . . . . . . . . . . . . 47
4.5 Stable-State Color Assignments (povray, tonto, omnetpp, gamess) . . . . 49
5.1 Profile Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1 Hardware Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Mechanism Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 GPU Performance (106 CPU cycles) . . . . . . . . . . . . . . . . . . . . 81
x
List of Figures
3.1 VCPU Scheduling Framework . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Background Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Instructions Retired in Background Mode . . . . . . . . . . . . . . . . . 22
3.4 Differences between task1 and task2 . . . . . . . . . . . . . . . . . . . . 22
4.1 Page Color Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Mapping Between Memory Pages and Cache Space . . . . . . . . . . . . 25
4.3 Memory Page Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 COLORIS Architecture Overview . . . . . . . . . . . . . . . . . . . . . 29
4.5 COLORIS Page Coloring Scheme . . . . . . . . . . . . . . . . . . . . . 32
4.6 Overhead of Hot Page Identification . . . . . . . . . . . . . . . . . . . . 33
4.7 Cache-aware Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . 41
4.8 mwalkWith Different Working Set Sizes . . . . . . . . . . . . . . . . . . 41
4.9 Execution Time of Foreground Workload . . . . . . . . . . . . . . . . . 44
4.10 Cache Miss Ratio of Foreground Workload . . . . . . . . . . . . . . . . 44
4.11 LLC Miss Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.12 Over-Committed System Performance . . . . . . . . . . . . . . . . . . . 50
5.1 Sync Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Example of Memory Controller Occupancy and Requests . . . . . . . . . 56
5.3 Bus Traffic & Instructions Retired versus Latency . . . . . . . . . . . . . 60
xi
5.4 Comparison between Rate- and Latency-based Throttling . . . . . . . . . 63
5.5 Foreground Performance of canny . . . . . . . . . . . . . . . . . . . . . 64
6.1 vLibOS Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 vLibOS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 vLibOS Unified Scheduling (left is sync call, right is async call) . . . . . 72
6.4 Autonomous Vehicle Platform . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Effective Memory Throttling . . . . . . . . . . . . . . . . . . . . . . . . 83
6.6 Autonomous Driving System . . . . . . . . . . . . . . . . . . . . . . . . 84
6.7 Object Avoidance Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 85
6.8 lidar Performance in Quest-V . . . . . . . . . . . . . . . . . . . . . . . 87
6.9 lidar Performance in Vanilla Linux . . . . . . . . . . . . . . . . . . . . . 88
xii
Chapter 1
Introduction
1.1 Motivation
There is an increasing prevalence of multicore processors in embedded and real-time sys-
tems. These processors offer power and performance benefits over single-core alternatives
running at higher clock frequencies. However, complex on-chip cache hierarchy, includ-
ing shared last-level caches and memory buses 1 that are common to all cores, pose chal-
lenges for tasks with real-time requirements. A task 2 running on one core may experience
harmful contention for cache lines that are shared with tasks running on other cores. Con-
sequently, a task that should seemingly run in isolation of tasks on other cores experiences
timing unpredictability due to unforeseen cache line evictions, misses, and reloads. Simi-
larly, cache-line fills with instructions and data require accesses to a shared memory bus.
This may lead to one or more co-running tasks being forced to stall while a memory bus
transaction is performed for another task.
For safety-critical real-time systems, shared-resource-related performance variations
lead to task timing violations and potential life-threatening situations. Besides, arbitrary
contention forces pessimistic analysis to be made. In the worst case, only one core’s
processing capacity can be utilized on a M -core processor. This “One-out-of-M” prob-
1connecting main memory to the memory controller
2a task refers to a single-threaded application
2lem [29] causes significantly low utilization of multicore processors, compromising their
power and performance benefits. To deal with these issues, performance isolation is es-
sential.
Solutions have been proposed for managing the shared caches and memory buses.
A common way to control cache contention is to partition the last-level cache (LLC),
amongst either CPUs or processes. While software-based partitioning allows for better
flexibility, hardware-based partitioning incurs less overhead. The same resource partition-
ing idea applies to memory bus management. It is also possible to construct a resource-
aware scheduler that dispatches particular threads to avoid heavy resource contention.
However, existing solutions have not proven to be both efficient and effective.
Even if with existing solutions, it remains a challenge for integrating them into a com-
plete system. One approach is to retrofit a general-purpose OS (GPOS) with the requisite
features that mitigate the effects of resource sharing. But this approach does not eas-
ily achieve the desired end goals, since fundamentally there is a mismatch between the
system’s design goals (e.g., high performance, fairness) and real-time applications’ re-
quirements. For example, despite years of effort, Linux PREEMPT RT still has not been
adopted in hard real-time systems. An alternative approach is to carefully construct a from-
scratch real-time OS that controls the impact of sharing. While effective, the approach suf-
fers from the lack of the rich functionality of a GPOS. Device drivers and libraries, though
available in GPOSes, need to be rewritten or ported in order to support the minimum fea-
tures required by modern applications. This, oftentimes, is a very time-consuming task.
We believe work needs to be done to ease the development burden of new OSes.
31.2 Thesis Statement
Shared micro-architectural resources such as caches and memory buses impose timing
unpredictability on the execution of real-time tasks on modern commodity multicore pro-
cessors. This thesis shows that it is possible to build a multicore real-time system with the
required temporal isolation while benefiting from the rich features of a general-purpose
OS.
1.3 Contributions
The contributions of this thesis include the following:
• We propose the foreground/background scheduling model to increase CPU utiliza-
tion in real-time systems. This model also serves as a flexible mechanism for man-
aging shared hardware resources. Combined with our predictable load balancing
algorithm, the efficiency of the whole platform is improved.
• We present a dynamic cache partitioning framework, which addresses issues with
previous dynamic partitioning schemes and provides quality-of-service (QoS) guar-
antees to applications. Besides, static partitioning scheme is also evaluated with a
cache-aware load balancing algorithm.
• We discuss a novel memory bus performance metric, called average memory request
latency. It outperforms the traditional bandwidth metric by detecting bus contention
more accurately. Based on it, we further describe a mechanism for effective bus
contention management.
• We propose the vLibOS model for building new OSes. While legacy support is ful-
filled by existing OSes, shared resources are managed by the new OS. A reference
4implementation of this model is presented, targeting multicore real-time systems.
We then discuss the potential applications of this model and its alternative imple-
mentations in different settings.
1.4 Thesis Organization
The rest of this thesis is organized as follows: Chapter 2 reviews related literature on mul-
ticore CPU scheduling, cache partitioning, memory bus and DRAM management and OS
evolution strategies. Chapter 3 presents our basic scheduling model and a predictable load
balancing algorithm. Chapter 4 describes our methods for partitioning the shared caches,
both statically and dynamically. Chapter 5 discusses a novel memory bus performance
metric and its application in bus traffic control. Chapter 6 introduces an innovative sys-
tem architecture for building feature-rich real-time operating systems. Then Chapter 7
concludes all the research contributions from this thesis and shares ideas about future di-
rections.
Chapter 2
Related Work
2.1 Multicore CPU Scheduling
Prior works on multicore real-time scheduling has predominantly focused on global [16,
11] and partitioned [4, 5] approaches. Global scheduling selects tasks from a system-wide
run queue and allows for task migrations between cores. Partitioned scheduling statically
assigns each task to a core, where it is scheduled from a local run queue. While the global
approach tends to achieve better CPU utilization across the whole system [15], its utiliza-
tion bound can be as low as 1+ ǫ in the worst case [16], for arbitrarily small ǫ. Partitioned
scheduling, on the other hand, divides the multicore scheduling problem into separate
uniprocessor scheduling problems. Then it takes advantage of well-studied uniprocessor
scheduling techniques to schedule local tasks [37, 36], with fairly good utilization bound.
To improve average CPU utilization while avoiding the overhead of global scheduling,
researchers have developed semi-partitioned schemes. Examples include EDF-fm [1] and
EDF-WM [27]. Semi-partitioned scheduling works by statically mapping tasks to cores
first. If there are tasks left that cannot fit into any core, the algorithm splits each task’s
periodic job into multiple parts and maps them to different cores. Once the job finishes a
part on one core, it is migrated to another core, which is assigned to run the next part of
the job. Through this limited migration, CPU utilization is increased. However, the cost of
migration has not been discussed in previous works. More importantly, they all assumed a
6closed system where the set of running tasks is fixed so that core mapping could be done
statically in advance. For an open real-time system 1, dynamic load balancing is needed to
handle the low utilization problem.
2.2 Software Cache Partitioning
Page coloring [63] was first introduced as a way to manage shared caches on multicore
processors in [50]. Cho and Jin [13] applied page coloring to a multicore system with a
distributed shared cache, with the goal to place data in cache slices that are closer to the
CPU running the target application. Tam et al. [61] implemented static page coloring in a
prototype Linux system.
Lin et al. [34] later investigated re-partitioning (recoloring) policies for fully commit-
ted systems, having the same number of co-running threads as cores. While these policies
can be applied to a 2-core platform, their complexity grows exponentially as the number
of cores increases. Moreover, page hotness 2 is ignored when picking pages for recolor-
ing, leading to inefficient use of newly assigned colors when all recolored pages are barely
accessed. Using hot pages can effectively reduce the number of pages to be recolored, but
it pays additional overhead from hotness identification [75].
Recent cache management work has attempted to divide the cache into different us-
ages. Soares et al. [55] determined cache-unfriendly pages via online profiling and mapped
them to a pollute buffer, which is a small portion of the cache. Lu et al. [38] let a user-level
allocator cooperate with a kernel page allocator to provide object-level cache partitioning,
in which the partitions are decided offline. The basic idea is to force user-level data struc-
tures with weak locality to use a reserved part of the cache, while other data structures use
1tasks start and terminate dynamically
2page access frequency
7the rest.
Another interesting work is SRM-Buffer [18], which reduces cache interference from
kernel address space. In systems like Linux, the page cache usually occupies a significant
amount of memory – one burst of accesses to it may incur large-scale cache evictions,
hurting application performance. To address this problem, the authors limited the range of
page colors that can be used by the Linux page cache during a file I/O burst.
A number of other researchers have also looked at real-time cache-aware resource man-
agement. This includes work on page coloring with cache lockdown for use in mixed crit-
icality systems [29]. Ward et al. [67] studied cache locking and scheduling techniques, to
reduceWCETs of higher-criticality hard real-time tasks in the presence of lower-criticality
soft real-time tasks. Calandrino et al. [12] studied several real-time cache-aware schedul-
ing policies based on the cache utilization of multi-threaded tasks. Metrics such as the
working set size were used to establish a utilization threshold which, when reached, would
trigger a cache-aware policy to select a task based on under-utilized cache space and avail-
able cores. Kim et al. [28] developed an OS-level cache management scheme for multicore
real-time systems, using Linux/RK. The work included the development of a response time
schedulability test for tasks that share cache partitions. Mancuso et al. [39] also developed
a framework to analyze and profile task memory access patterns, including a kernel-level
cache management scheme to enforce deterministic cache allocations for the most fre-
quently accessed memory areas.
2.3 Memory Bus and DRAMManagement
In 1997, Bellosa et al. [8] developed a memory throttling 3 technique, using hardware per-
formance counters to determine memory bus usage. More recently, MemGuard [74] was
3a way to reduce memory access rate
8developed to address timing variations caused by memory references from different cores.
Each core is assigned a memory budget, which limits the number of memory accesses
in a specified interval. To improve bandwidth utilization, MemGuard predicts the actual
bandwidth usage of each core in the upcoming period. For cores that do not use all their
budgets, they contribute their surplus to a global pool, which is shared amongst all cores.
A similar user-space technique [24] was developed to allow memory to be budgeted to
individual, or groups of, tasks.
PALLOC uses a DRAM bank-aware buddy allocator to assign page frames to applica-
tions so that bank-level contention is avoided [73]. Both MemGuard and PALLOC are part
of an effort to develop a Single Core Equivalence (SCE) framework [49]. SCE attempts to
treat each core in a multicore processor as if it were a separate chip, to ensure that a task’s
worst-case execution time is not affected by other tasks running on different cores.
Dirigent [77] is a system that regulates the progress of latency-constrained (foreground)
tasks in the presence of non-time-constrained (background) tasks. Dirigent reduces the
performance variation of foreground applications caused by memory contention, while
maintaining a high throughput for background tasks. The system works by first offline
profiling the execution of latency-constrained tasks when running alone. An online ex-
ecution time predictor monitors the progress of foreground tasks. If the progress of any
foreground task falls behind, a controller then adjust resources available to it by slowing
down background task execution and allocating more caches to it.
2.4 OS Evolution Strategies
OSKit [20] is an early work with an explicit goal to ease new OS development. It pro-
vides a set of commonly used OS components, like bootstrapping, architecture-specific
manipulation and protocol stacks. Modularization and encapsulation of legacy code via a
9glue layer allows developers to concentrate their engineering effort on innovative features.
However, it is hard to achieve comparable performance with existing systems if strict mod-
ularization is used. Besides, creating adaptation layers is often a non-trivial task.
There have been a number of research efforts that focus on OS structure and extensibil-
ity. Extensible operating systems research [53] aims at providing applications with greater
control over the management of their resources. For example, the Exokernel [19] tries to
efficiently multiplex hardware resources among applications that utilize library operating
systems. Resource management is thus delegated to library operating systems, which can
be readily modified to suit the needs of individual applications. SPIN [9] is an extensible
operating system that supports extensions written in the Modula-3 programming language.
Interaction between the core kernel and SPIN extensions is mediated by an event system,
which dispatches events to handler functions in the kernel. By providing handlers for
events, extensions can implement application-specific resource management policies with
low overhead.
Some works have attempted to virtualize existing OSes, so that different OSes share
services. User-Mode Linux [17] and L4Linux [64], for example, implement a modified
Linux as a user-level address space on top of a host OS. Virtualization technologies like
KVM [30] or Xen [3] create machine abstractions for hosting guest services with unmod-
ified kernels, or kernels with minimal change.
Dune [6] uses hardware virtualization to expose privileged hardware features to user-
level processes, greatly improving efficiency for certain application such as garbage col-
lection. As with extensible kernels, the aim is to enrich the functionality within existing
systems.
VirtuOS [43] delegates part of system services to other VMs through the exception-
less system call mechanism [54]. The execution of services is controlled by each service
10
domain instead of the primary domain. VirtuOS, as well as Nooks [59], focuses on safety
isolation of existing system components rather than recycling legacy components for new
OSes.
FusedOS [44] proposes the use of a full-blown OS as a master OS, which spawns
light-weight kernels to a subset of CPUs, but without virtualization. Shimosawa et al. [51]
further formalized this hybrid kernel design and defined a corresponding interface. Devel-
opers can focus on innovating new features in light-weight kernels while requesting legacy
services through cross-kernel service delegation. The downside is that, without virtualiza-
tion, protection between kernels is not enforced. A light-weight kernel does not have the
ability to manage global resources in order to regulate cache or memory bus contention.
EbbRT [48] employs a similar approach towards enabling kernel innovation in a dis-
tributed environment. In EbbRT, services are offloaded between machines running light-
weight kernels and full-featured OSes. Offloading is facilitated by an object model which
encapsulates the distributed implementation of system components.
Chapter 3
Multicore CPU Scheduling
There are predominantly two schemes for multicore real-time scheduling, global schedul-
ing and partitioned scheduling. Global scheduling selects tasks from a system-wide run
queue and allows for task migrations between cores. Partitioned scheduling statically as-
signs each task to a core, where it is scheduled from a local run queue. The former, while
achieving better CPU utilization, incurs heavier scheduling overhead due to contention
to the central run queue. Besides, it is very hard to derive the utilization bounds with
global scheduling algorithms [16]. With partitioned scheduling, it is possible to address
the low utilization issue with dynamic task migration, especially when dealing with open
systems. However, it remains a challenge to migrate tasks without timing violation. Also,
contention on shared hardware resources is a tough problem on multicore platforms.
Based on the partitioned approach, we are going to present our design here. Our
goals for multicore real-time scheduling are beyond the basic CPU reservation guaran-
tees. Firstly, we want to take advantage of the unreserved CPU times and reservation
leftovers 1 to improve single CPU utilization. Also, we want the unused CPU times to
be fair-shared amongst tasks. Thirdly, we need a flexible scheduling model so that shared
resource contention can be managed as well. Most of all, we must achieve these goals
without violating any real-time constraints.
1unused reservation times
12
In this chapter, we will talk about MARACAS, an implementation based on the Quest
OS [72]. Section 3.1 describes the basic concepts in Quest. Then the MARACAS mul-
ticore real-time scheduling framework is proposed and evaluated. We will cover the
scheduling model that enables contention-aware resource management, but defer the in-
troduction of the actual shared resource management policy to Chapter 5.
3.1 Quest Operating System
Quest is a small-footprint real-time OS. It implements a novel virtual CPU (VCPU) schedul-
ing framework [14]. Rather than scheduling threads directly on physical CPUs, the schedul-
ing problem is decomposed into a two-level hierarchy (Figure 3.1). One or more threads
are assigned and scheduled on a VCPU, which is then scheduled on a physical CPU. This
way, groups of threads that are not time-critical or which are part of an equivalent class
can share a single VCPU, while specific real-time tasks may be assigned separate VCPUs.
Figure 3.1: VCPU Scheduling Framework
VCPUs are resource containers [2] for threads that are assigned to them. They account
13
for budget usage in specific windows of real-time. By default, each VCPU is specified a
processor capacity reserve [40] consisting of a budget capacity, C, and period, T . A VCPU
is required to receive at least its budget every period when it is runnable. For simplicity,
here we assume that every VCPU is implemented as a Sporadic Server [56, 57], and each
VCPU is assigned a single thread. If multiple threads are assigned to a VCPU, they will
be migrated together as a single entity, which complicates the migration process.
Each core is associated with a separate run queue. All VCPUs assigned to the same
core are scheduled using Rate-Monotonic Scheduling (RMS) [35]. The RMS utilization
bound is then applied on a per-core basis when assigning VCPUs to cores. Schedulability
tests are performed when new VCPUs are created and when they are migrated between
cores.
3.2 Background Scheduling
Each VCPU with available budget at the current time operates in foreground mode. When
a VCPU depletes its budget it enters background mode, where it will only be scheduled
if there are no other runnable VCPUs in foreground mode on the same core. A core is
said to be in background mode when all VCPUs assigned to it are in background mode.
At this point, the core invokes its background scheduling policy. MARACAS implements
a background scheduling algorithm that attempts to fairly distribute surplus CPU time
amongst VCPUs. Every task and, hence, VCPU 2 is tracked for the amount of background
CPU time (BGT) it has used so far. When a core enters background mode, the local
scheduler picks a task with the smallest BGT and keeps it running until the core switches
back to foreground mode. The mode switch occurs when a VCPU is replenished with new
budget.
2unless otherwise stated, we use “task” and “VCPU” interchangeably in this chapter
14
An alternative background scheduling approach is to keep the same task running on
a core when it switches from foreground to background mode. If that task happens to
block during background mode, the system schedules the task that is expected to run first
when the core switches back into foreground mode. This method attempts to reduce con-
text switches, but preliminary experimental results suggest there is negligible performance
benefit.
A further background scheduling option attempts to reduce cache and memory bus
contention. A scheduler running on a core in background mode gives precedence to tasks
that are less memory intensive. This approach guarantees budgeted foreground time for a
set of tasks, while trying to use surplus CPU time without increasing resource contention.
Other approaches also have attempted to co-schedule tasks to avoid cache and memory
bus contention, but at the cost of meeting timing requirements [10, 69]. The use of sepa-
rate foreground and background modes enables VCPU timing constraints to be met, while
allowing for objectives such as fairness or performance to be addressed. This flexibility al-
lows us to easily integrate contention-aware resource management policies into the system
(see Chapter 5).
3.3 Predictable Migration
In MARACAS, special kernel threads (migration threads), associated with dedicated VC-
PUs, are responsible for the movement of VCPUs between their local cores and the desti-
nation cores. Every core has one migration thread, but only one can be active at a certain
time while others are blocked. A migration thread is responsible for checking the load of
every core and deciding whether to perform load balancing. If a migration thread decides
to move a local VCPU, it must first identify a destination core. A VCPU is only migrated
if it passes a schedulability test for the destination core. The cost of the test, which is
15
a relatively simple utilization bound calculation, is factored into the migration thread’s
CPU budget. A migration thread is woken up when there is a scheduling event (e.g., a
task blocks, wakes up or terminates), at which point it performs a rebalance check and
potential VCPU migration. A migration thread is not woken up by the periodic sleeping
of real-time tasks. Currently, only one VCPU is allowed to be migrated within one period
of the migration thread. This is purely a policy decision and not an inherent design limita-
tion. A pending event counter records the number of scheduling events (up to the number
of cores) that happen before the active migration thread completes the transfer of a VCPU,
making sure events are not ignored.
For real-time systems, the migration process has to be predictable. Care must be taken
to make sure migration cost does not impact the timing requirement of the VCPU being
relocated, as well as those running on the destination core. First, migration threads are set
to the highest priority on their respective cores to avoid preemption during the migration
process. Second, each VCPU that is created must pass a schedulability test on its assigned
core. This means the migration thread’s execution of its entire budget Cm does not lead to
any other local VCPUs missing their deadlines. Therefore, as long as the migration cost
is smaller than Cm, timing constraints on the local core will not be violated. Finally, the
migrated VCPU must pass a schedulability test at the destination to ensure that it does not
violate any timing guarantees on that core.
In MARACAS, the VCPU migration process is as simple as locking two run queues
(on the source and destination cores), detaching the VCPU data structure from the source
queue and attaching it to the destination. Memory address space copying is not needed
during migration unless software cache partitioning is enabled in the system.
LetElock be the overhead of locking a run queue, andEstruct be the overhead of moving
a VCPU data structure in the worst case. Then, the following condition must hold:
16
Cm ≥ 2× Elock + Estruct
If software cache partitioning (page coloring) is enabled, then pages of a migrated
address space may need to be recolored [71] on the destination core. MARACAS builds
upon the Quest system guarantee that process address spaces are limited to a maximum
size. Hence, it is possible to place an upper bound on the memory copying overheads. Let
Epage be the cost of copying one page, and Pmax be the maximum number of pages. Then,
the new migration constraint is:
Cm ≥ 2× Elock + Estruct + Pmax × Epage
To keep migration cost down, it makes sense to reduce the frequency with which mi-
grations occur. As an optimization, a timestamp is taken at the end of a VCPU migration.
New migration can only be performed after a predefined minimum interval.
3.4 VCPU Load Balancing
In Linux, CPU load is defined as the sum of all local tasks’ scheduling weights, where
a weight is decided by a task’s priority. Every core periodically runs a load balancing
algorithm, which attempts to minimize the difference in load amongst all cores. The goal
is to let every task of the same priority have the same amount of CPU time.
Load balancing in our real-time VCPU framework differs from that in a general-
purpose OS, as it deals with VCPUs that have CPU reservations. Each VCPU is guaranteed
its CPU reservation irrespective of the mapping of VCPUs to cores. Balancing VCPUs so
they receive the same amount of CPU time would penalize those with larger reservations.
VCPUs with larger budgets in a given period of time would have less background time
than those with smaller reservations. In observance of these differences, we propose an
alternative method of load balancing for a system of VCPUs on multicore platforms.
17
Let each VCPU, Vi, have a utilization factor Ui =
Ci
Ti
. We then define the Slack-Per-
VCPU (SPV) of a core as
1.0−
∑
v
i=1
Ui
v
, where v is the number of VCPUs on the correspond-
ing core. The smaller the SPV is, the heavier the load is for the core. For load balancing,
we attempt to equalize the SPV values across cores.
Algorithm 1 is the main body of the VCPU load balancing scheme. When any VCPU
blocks, the kernel tries to find a core with the smallest SPV value and activates its cor-
responding migration thread. Whenever a VCPU is awoken, the migration thread on the
same core is activated as well. Every migration thread runs the procedure REBALANCE
when active, which migrates VCPUs from the current core to the one that has the most idle
time, as indicated by the largest SPV. FIND HOST CPU identifies the core with the largest
idle time that can feasibly schedule a new VCPU. For a feasible schedule, all VCPUs must
satisfy their foreground scheduling requirements on the given core. Line 18 starts by find-
ing a target core for the VCPU with the lowest utilization on the local (source) core. If
a core exists, lines 29 onwards check to see if an alternative VCPU could be migrated
to the target core to reduce the SPV imbalance between the source and destination. By
checking the feasibility of migrating the lowest utilization VCPU first, we avoid attempt-
ing to reduce the SPV imbalance across cores for higher utilization VCPUs that would
not be schedulable at the destination. Finally, line 48 is the condition to terminate the
rebalancing procedure.
Algorithm 1 VCPU Load Balancing
1: procedure FIND HOST CPU(new vcpu)
2: max = 0
3: for all cpu do
4: if schedulability test(cpu, new vcpu) == FALSE then
5: continue
6: end if
7: if SPV (cpu) > max then
18
8: max = SPV (cpu)
9: host = cpu
10: end if
11: end for
12: return host
13: end procedure
14: procedure REBALANCE
15: src cpu = current cpu id()
16: /* return the VCPU with the smallest utilization on a core */
17: min v = get smallest ut vcpu(src cpu)
18: dst cpu = FIND HOST CPU(min v)
19: if dst cpu == src cpu then
20: return
21: end if
22: src spv = get SPV of(src cpu)
23: dst spv = get SPV of(dst cpu)
24: if runqueue length(dst cpu) == 0 then
25: imbalance =∞
26: else
27: imbalance = dst spv − src spv
28: end if
29: for all vcpu in runqueue(src cpu) do
30: if runqueue length(src cpu) <= 1 then
31: break
32: end if
33: /* calculate a core’s new SPV as if a VCPU is added */
34: /* without actually adding VCPU to the core */
35: dst spv = get SPV add one(dst cpu, vcpu)
36: /* calculate a core’s new SPV as if a VCPU is removed */
19
37: /* without actually removing VCPU from the core */
38: src spv = get SPV remove one(src cpu, vcpu)
39: if dst spv < src spv and
40: src spv − dst spv >= imbalance then
41: continue
42: end if
43: if schedulability test(dst cpu, vcpu) == FALSE then
44: continue
45: end if
46: move vcpu(src cpu, dst cpu, vcpu)
47: imbalance = dst spv − src spv
48: if imbalance <= 0 then
49: break
50: end if
51: end for
52: end procedure
3.5 Experimental Evaluation
We evaluated the MARACAS multicore scheduling framework, using the hardware plat-
form in Table 3.1.
Processor Intel Core i5-2500k quad-core
Caches 6MB L3 cache, 12-way set associative, 4 cache slices
Memory 8GB 1333MHz DDR3, 1 channel, 2 ranks, 8KB row buffers
Table 3.1: Hardware Specification
20
3.5.1 Background Scheduling
Firstly, we wanted to validate the effectiveness of background scheduling for improving
CPU utilization and task performance. In this experiment, two test cases were used. In
the first case (vcpu + bg), tasks were run with background scheduling enabled; in the sec-
ond case (vcpu), background scheduling was disabled. Four Ma¨lardalen benchmarks [22]
(compress, adpcm, fir, and matmult) were started simultaneously on the same core. Ev-
ery task was assigned a VCPU with the same capacity C (ms) and a fixed period of
T = 100 ms. Unless stated otherwise, the value T = 100 ms was used throughout the
evaluations in this chapter. All benchmarks were executed for 5 minutes, after which
the counts of their instructions retired were collected. Figure 3.2 shows that background
scheduling improves progress in every case. The total instructions retired are approxi-
mately equal for each benchmark with different values of C. Greater values of C increase
the base level instructions retired when a VCPU obtains its guaranteed share of the CPU.
3.5.2 VCPU Load Balancing
With VCPU load balancing, every task has a fair share of BGT. This is shown by the fol-
lowing experiment comprising two groups. A static group used only the FIND HOST CPU
procedure from Algorithm 1 and statically mapped tasks to cores. A second VLB group
used the complete load balancing algorithm, which allowed threads to be migrated be-
tween cores. In each case, we created 16 instances of the compress benchmark that were
started at 1 second intervals. Every benchmark was assigned a VCPU capacity, C, and
duration D. The duration specified how long the task executed in minutes. In all cases,
the VCPU periods assigned to each benchmark were set to T = 100 ms.
We generated 10 sets (k0 to k9) of parameter values for each group of 16 tasks. Pa-
rameters C and D were generated from a uniform random distribution over the range
21
 0
 100
 200
 300
 400
 500
 600
 700
compress
adpcm
fir matmult
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
9 )
 
C=5
vcpu+bg
vcpu
 0
 100
 200
 300
 400
 500
 600
 700
compress
adpcm
fir matmult
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
9 )
 
C=10
vcpu+bg
vcpu
 0
 100
 200
 300
 400
 500
 600
 700
compress
adpcm
fir matmult
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
9 )
 
C=18
vcpu+bg
vcpu
Figure 3.2: Background Scheduling
1− 14 ms and 2− 11 minutes, respectively. The range of C values caused variation in the
foreground utilization of each VCPU, while ensuring the total utilization remained below
the RMS bound [35]. The range of D values ensured that within a 10 minute monitoring
period the system load was dynamic: some tasks terminate while others remain active.
Each of the 16 tasks in the same experimental group were assigned a randomly chosen
value of C, while the first 14 tasks were assigned randomly chosen values ofD. Two other
tasks in each group (task1 and task2) were executed for the full duration of each experi-
mental run. The total instructions retired in backgroundmode were recorded for task1 and
task2 over a 10 minute interval from when all 16 tasks were first assigned to cores.
Figure 3.3 shows the results of VCPU load balancing. In the static case, task1 and
task2 exhibit highly variable progress across the 10 parameter sets. In contrast, the VLB
22
 40
 60
 80
 100
 120
k0 k1 k2 k3 k4 k5 k6 k7 k8 k9
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
10
)
 
static
task1
task2
 40
 60
 80
 100
 120
k0 k1 k2 k3 k4 k5 k6 k7 k8 k9
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
10
)
 
VLB
task1
task2
Figure 3.3: Instructions Retired in Background Mode
 0
 10
 20
 30
 40
 50
 60
k0 k1 k2 k3 k4 k5 k6 k7 k8 k9
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
10
)
 
static
VLB
Figure 3.4: Differences between task1 and task2
case shows that dynamic load balancing achieves more evenly distributed progress for the
two observed tasks. We use Figure 3.4 to better illustrate the performance differences
between task1 and task2. Our results suggest that VCPU load balancing is very effective
at distributing background CPU time equally.
23
3.6 Summary
This chapter describes a real-time multicore scheduling framework called MARACAS. It
takes advantage of surplus CPU cycles on each core, after meeting the foreground timing
requirements of each VCPU, to improve system performance. A migration mechanism
is proposed, which guarantees no deadline will be missed during the whole migration
process. Based on this mechanism, we show how to balance load across cores to both
guarantee VCPU timing requirements and evenly distribute surplus CPU cycles.
The foreground/background model enables us to accommodate real-time tasks and
non-real-time tasks in a single system. It is also beneficial to applications that improve
their performance (or service quality) when given more execution time, such as in data
sampling, numerical integration and imprecise computations. Moreover, it lays down the
foundation for contention-aware resource management on multicore platforms, which will
be discussed in a later chapter.
Chapter 4
Cache Partitioning
Modern processors often devote the largest fraction of on-chip transistors to caches, mostly
the shared LLC, the performance of which is crucial for the overall processing capability
of processors, especially for running memory-bound applications. Yet, in most commer-
cial off-the-shelf (COTS) systems, only best effort service is provided for accessing the
shared LLC. Multiple processes run simultaneously on those systems, interfering with one
another on cache accesses, leading to unpredictable application performance.
Common solutions to this problem are to partition the shared LLC, either statically
or dynamically. While static partitioning provides the maximum degree of performance
isolation and simplicity, it suffers from low utilization of cache resources. Dynamic par-
titioning, on the other hand, benefits from increased resource management flexibility and
improved utilization. The average application performance using dynamic partitioning
tends to be better, comparing to the static solution. However, the online re-partitioning
process inevitably hinders the predictable execution of real-time tasks.
In this chapter, we are going to evaluate both static and dynamic cache partitioning,
using a software approach known as page coloring. For the former, we have it implemented
inside the MARACAS framework on top of Quest (Section 4.3). For the latter, a Linux-
based design is presented in Section 4.4.
25
4.1 Page Coloring
Figure 4.1: Page Color Bits
Figure 4.2: Mapping Between Memory Pages and Cache Space
On most modern architectures, a shared LLC is physically indexed and set associative.
When a physical address is used in a cache lookup operation, it is divided into a Tag, an
Index and an Offset as shown in Figure 4.1. The Offset bits are used to specify the byte
offset location of the desired data within a specific cache line. Index bits select a cache
set. Tag bits are checked against the current cache line to determine a cache hit or miss.
26
Operating systems that manage memory at page granularity use the least significant p bits
of a physical address as a byte offset within a page of memory. These p bits typically
overlap with the least-significant Index bits, leaving the most significant indexing bits to
form the Color bits (as shown in the shaded area of Figure 4.1). Two or more pages of
physical memory addressed with different color bits are guaranteed not to map to the same
set of cache lines.
Using color bits, it is possible for an operating system to implement a color-aware
memory allocator to control the mapping of physical pages to sets of cache lines. This
is the Page Coloring technique (illustrated by Figure 4.2). Pages mapped to the same set
of cache lines are said to have the same page color. The total number of page colors is
derived from the following equation:
number of colors = cache size
number of ways×page size
4.2 Color-aware Memory Allocator
Figure 4.3: Memory Page Allocator
In our design, the color-aware memory page allocator maintains a big memory pool.
27
Inside the pool, free page frames of the same colors are linked together to form multiple
lists (Figure 4.3). Upon receiving a request, the allocator checks the page color assignment
of the requester (e.g., a core or a process) and then picks a color from it in a round-robin
manner. The corresponding color list is located and the page in the list head is returned.
4.3 Static Cache Partitioning
The static partitioning scheme is part of the MARACAS scheduling framework. During
system initialization, the LLC is partitioned amongst cores based on a static configuration.
Instead of equally dividing the LLC, some cores have more cache space while others have
less. The VCPU load balancing algorithm is extended so that it is aware of cache partitions
and application’s cache requirement. Applications running on a core are allocated page
frames whose colors are restricted to the set reserved for the corresponding core. The
VCPU creation API in Quest is modified to allow applications to specify the minimum
number of page colors they need:
bool vcpu create(uint C, uint T, uint colors);
The new cache-aware load balancing algorithm is similar to the original one. The only
difference is that, before a VCPU is migrated, the destination core is checked to see if it has
sufficient page colors. Migration only takes place if there are enough page colors to meet
the VCPU’s cache requirement. The migration process takes longer since the application’s
address space associated with the migrating VCPU has to be recolored [71]. As a result,
migration threads in MARACAS need larger CPU reservations (see Chapter 3).
28
4.4 Dynamic Cache Partitioning
The design of a dynamic cache partitioning mechanism is much harder than its static coun-
terpart. There are at least four issues concerning re-partitioning with page coloring.
First of all, knowing when to perform re-partitioning is non-trivial. Dynamic phase
changing behaviors 1 of applications lead to fluctuating resource demands, which may
cause poor cache utilization under static partitioning. To re-partition the shared cache, we
want to clearly capture program phase transitions on-the-fly. Even without phase changes,
when an application is just started, it is hard to determine its best cache partition size with-
out a-priori knowledge. Existing page coloring techniques either do not adaptively adjust
partitions, at least not efficiently, or they fail to identify application phase transitions.
Secondly, finding the optimal partition size to minimize cache misses for a given work-
load is difficult. If the goal is fairness or QoS, proper performance metrics have to be
defined to guide dynamic re-partitioning. For example, some researchers have attempted
to construct cache utility curves, which capture the miss rates for applications at different
cache occupancies [46, 58, 62, 69, 76], but this is typically expensive to do in a working
system.
Another issue is the significant overhead with cache re-partitioning, also called recolor-
ing. Recoloring is cumbersome. It involves allocating new page frames, copying memory
between pages and freeing old page frames as necessary. To make the matter worse, naive
page selection during recoloring may cause inefficient use of newly allocated cache space.
Altogether, the benefit of dynamic partitioning can be undermined.
Lastly, in over-committed systems where there are more runnable threads than cores,
excessive recoloring operations may result from the interleaved execution of multiple
threads. A simple example would be a 2-core, 6-page-color system with two running
1Phase changes might be due to switching between localities such as function loops.
29
single-threaded processes (P1, P2) and one ready process (P3). The page colors allocated
to these three processes might be {1, 2, 3}, {4, 5, 6} and {1, 2, 3}, respectively. If a sched-
uler suspends P2 and dispatches P3 to co-run with P1, two processes will be contending
for the same three page colors. This may result in significant contention on a subset of
the entire cache space. At this point, either recoloring has to be performed or performance
isolation is compromised.
Our work tries to solve the problems associated with implementing dynamic page col-
oring in production systems. Specifically, this section describes the implementation of an
efficient page coloring framework in the Linux kernel, called COLORIS [71].
4.4.1 COLORIS Architecture
Figure 4.4: COLORIS Architecture Overview
COLORIS is comprised of two major components: a Page Color Manager and a Color-
aware Page Allocator. The Color-aware Page Allocator is capable of allocating page
frames of specific colors, as described in Section 4.2. The Page Color Manager is re-
30
sponsible for assigning initial page colors to processes, monitoring process cache usage
metrics, and performing color assignment adjustment according to system specific objec-
tives (e.g., fairness, QoS, or performance). An architectural overview of COLORIS is
shown in Figure 4.4.
4.4.2 Page Color Manager
The Page Color Manager manages page color resources amongst application processes
according to specific policies. In the simplest form, the Page Color Manager statically
partitions hardware caches according to various color assignment schemes. The most
intuitive approach is to strictly assign different page colors to different processes so that
they land on different parts of the shared cache. While this provides the maximum degree
of isolation, it also limits the maximum number of processes supported. Consider, for
example, a system with 4 cores, 64 page colors and 16MB memory per color – at most 64
processes can be supported with 1 color each, with the memory footprint of every process
limited to no more than 16 MB. Additionally, cache utilization is as low as 6.25% even
when all cores are active.
To mitigate the problem of cache utilization, we want to make optimal color assign-
ments for application processes according to their cache demands. However, this requires
a-priori knowledge that is difficult to acquire for most applications. Consequently, we
decided to adopt a more flexible color assignment scheme that does not require offline
profiling.
In this new scheme, the cache is equally divided into N sections of contiguous page
colors on a platform of N processing cores. Each section is then assigned to a specific
core, which we call the local core of all the colors within this section. All the other cores
will be referred to as remote cores. When a new process is created, COLORIS searches
31
for a core with the lightest workload and assigns the whole cache section of the core to
the process. As a result, in a system with a total of C page colors, every process will be
assigned C
N
colors. This means that N co-running processes can fully utilize the cache.
For load balancing, process migration can be invoked and the entire address space
needs to be recolored. Migration may also be applied to reduce memory bus contention,
by placing memory intensive processes to the same cores. However, since migration incurs
prohibitive overhead, it should be avoided when possible.
Though this scheme is simple, it can still potentially lead to low cache utilization for
an underutilized system and when there are dynamic program phase changes. A more
practical way would be to assign a default-sized partition first, and then gradually approach
the best size through re-partitioning (i.e., recoloring). Thus, for a page coloring technique
to be truly useful in production systems, dynamic partitioning is essential.
In COLORIS, we extended the Page Color Manager with dynamic partitioning capa-
bilities, by introducing a comprehensive recoloring mechanism. Our primary design focus
of this recoloring mechanism is to make it effective and efficient even in over-committed
systems.
Starting with the initial color assignments described earlier, the Page Color Manager
attempts to make online color assignment changes based on dynamic application behav-
iors. For applications that do not require the entire local cache section, some colors can be
reclaimed; for applications that demand more cache space, colors from other sections can
be shared. An example is illustrated in Figure 4.5. In this example, the twelve columns
represent twelve different page colors. Every block within a column is an ownership token
denoting that the process having this token (in the same row) is assigned the corresponding
color.
Since the local cache section can now be shared, limited cache interference between
32
Figure 4.5: COLORIS Page Coloring Scheme
processes may happen. In this case, a global coordination amongst schedulers on each
core is useful. By sharing the information on page color usages and scheduling processes
accordingly, the likelihood of cache contention is reduced. This will be a future work.
We have also considered an alternative design for color assignment. Instead of re-
partitioning at the granularity of processes, we can adjust the size of a cache section for
each core. Before doing that, we need a mechanism for online application classification
in terms of cache demands. We want to move applications of the same class to the same
core in order to maintain high cache utilization. For online classification, there are several
useful areas of study [26, 46, 68, 69, 70].
Cache Utilization Monitor. The Cache Utilization Monitor is responsible for measuring
the cache usage of individual applications, and making decisions about partition adjust-
ments. To start with, we need to introduce the concept of hotness, which is the key for
partition adjustment. In previous work, page color hotness is defined as the aggregate ac-
33
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 20
m
cf
gcc gobmk
om
netpp
hmmer
lbm soplex
povray
sphinx3
Ex
ec
ut
io
n 
Ti
m
e 
O
ve
rh
ea
d 
(%
)
 
100 milliseconds sampling interval
Figure 4.6: Overhead of Hot Page Identification
cess frequency of page frames with the same color [75]. It was used to help reduce the
cost of cache re-partitioning by only recoloring pages of hot colors. However, hotness
identification requires expensive periodic page table scans, which may result in worse per-
formance than if hot page identification is not used at all. We conducted a preliminary
experiment on the proposed approach from this work, using SPEC CPU2006 benchmark.
The results from Figure 4.6 clearly support our argument.
Based on this observation, we decided upon a new definition of hotness. One approach
is to define it as a function of the number of free pages in the target color. A second
approach, which is more appropriate and has been adopted in COLORIS, is to use the
number of processes sharing the same color as the definition of hotness. This information
can be used to avoid heavy cache contention by assigning cold colors to newly created
processes. Due to its simplicity, we believe this is a suitable metric in practical systems.
Algorithm 2 Cache Utilization Monitor
34
procedure MONITOR(cmr)
assignment = assignment of(current)
if cmr > HighThreshold then
if isCold = False then
isHot← True
return
end if
new = ALLOC COLORS(UNIT )
/* triggers Recoloring Engine */
assignment+ = new
isCold← False
else if cmr < LowThreshold then
if isHot = True then
isHot← False
isCold← True
victims = PICK V ICTIMS(UNIT )
/* triggers Recoloring Engine */
assignment− = victims
end if
end if
end procedure
procedure ALLOC COLORS(num)
new ← φ
while num > 0 do
if needRemote() then
new+ = pick coldest remote()
else
new+ = pick coldest local()
end if
num← num− 1
35
end while
return new
end procedure
procedure PICK VICTIMS(num)
victims← φ
while num > 0 do
if hasRemote() then
victims+ = pick hottest remote()
else
victims+ = pick hottest local()
end if
num← num− 1
end while
return victims
end procedure
Following the re-definition of hotness, we now describe the two recoloring approaches
in COLORIS. In the first approach, ALLOC COLORS(UNIT) is invoked to add UNIT
colors to a process, whenever it runs out of memory with its pre-existing colors. Here,
UNIT is a configurable number of colors. The second approach triggers recoloring when a
process’s cache demand exceeds its current assignment. This is determined by monitoring
the cache miss ratio for a given process, defined as cache misses with respect to total
accesses over a sample period, using hardware performance counters commonly available
on modern processors.
We set up two global cache miss ratio thresholds when the system starts up, High-
Threshold and LowThreshold. Applications with miss ratios higher than HighThreshold
are the ones needing more cache space; others with miss ratios below LowThreshold are
willing to provide vacant cache space for re-partitioning. Procedure MONITOR in Algo-
36
rithm 2 is used to trigger recoloring. It takes the cache miss ratio (cmr) of the current
process (current) over a period of time (PERIOD) as input. The needRemote() function
returns True if current has already been using the entire local cache section. The has-
Remote() function returns True if current owns colors from remote cache sections. The
boolean variable isHot is used to indicate when the cache miss ratio of a process goes
above a specific threshold. It acts as a signal to indicate that a process needs more page
colors. Conversely, isCold is set when extra page colors are available. Both variables are
global to all processes.
Functions pick coldest remote() and pick hottest remote() choose a color in a remote
cache section belonging to the current process, with the smallest or largest global hotness
value, respectively. Similarly, pick coldest local() and pick hottest local() return a color
from the local cache section, owned by current, with the smallest or largest remote hotness
value, respectively. Here, we define global hotness as the number of owners of a color
running on all cores, while remote hotness is the number of owners of a color running on
remote cores. We also define a local color to be any color within a local cache section.
The key insight of using remote hotness is that, when adding or taking away a local
color to/from a process, we do not care about other processes running on the same core.
Since they cannot run simultaneously with one another, there is no cache interference
amongst them for sharing colors. In Figure 4.5, we show all four cases where the four
functions above are called respectively. The figure illustrates a state transition during re-
coloring. Solid white blocks indicate the process owns the corresponding color before and
after recoloring. Dashed white blocks indicate the process has the color before recoloring.
Similarly, dark blocks indicate the process is assigned the color after recoloring.
Despite the global thresholds, we also allow individual applications to provide their
private threshold pair as part of QoS specification. When QoS specification is not avail-
37
able, global thresholds are used.
While dynamic partitioning benefits from the information provided by cache utility
curves, such information is not easily obtainable in a running system. Without this in-
formation, COLORIS’ objective is to enhance QoS by attempting to maintain application
miss ratios belowHighThreshold, given sufficient capacity. Sufficient capacity here means
there are some other applications, with miss ratios lower than LowThreshold, that are able
to provide enough free colors. If there exists a cache partitioning scheme that guarantees
no application’s miss ratio exceeds HighThreshold, COLORIS attempts to achieve such
guarantee.
Notice that the Cache Utilization Monitor takes into account both cache miss ratios
and the number of cache references. When the number of references is small, miss ratio
for a process is set to zero, to indicate the cache is not being used significantly. Likewise,
in situations where frequent page recoloring is not beneficial to overall performance, it is
possible to disable recoloring for individual processes.
Recoloring Engine. The Recoloring Engine performs two tasks: (1) shrinkage of color
assignments, and (2) expansion of color assignments. Lazy recoloring [34] is adopted
for shrinking a color assignment. Basically, we look for pages of specific colors that are
going to be taken away and clear the present bits of their page table entries (and flush TLB
entries). At the same time, an unused bit of every page table entry whose present bit is
cleared is set to identify the page as needing recoloring. However, we do not set a target
color for each page to be recolored. In the page fault handler, we allocate a new page from
the page allocator and copy the content of the old page to it. Since round-robin is used in
page allocation, as described in Section 4.2, pages to be recolored are eventually spread
out uniformly across the cache partition assigned to the process.
38
Assignment expansion is more complicated than shrinkage. The major reason being
that it is difficult to figure out which pages should be moved to the new colors. Ideally,
we want memory accesses to be redistributed evenly across the new cache partition. In
COLORIS, we currently consider two selection policies.
1) Selective Moving – In this policy, we take the set associativity of the cache into con-
sideration. Assuming an n-way set associative cache, we know that one page color of
the cache can hold up to n pages at the same time. We therefore scan the whole page
table of the current process and recolor one in every n + 1 pages of the same color, try-
ing to minimize cache evictions when big data structures with contiguous memory are
being accessed. These pages will be immediately moved to the newly assigned colors in a
round-robin manner.
2) Redistribution – When the expansion is triggered, we first go through the entire page
table of the current process and clear the access bit of every entry (flush TLB entry well).
The access bit is a special bit in page table entry on x86 platforms; whenever the page in
that entry is accessed, this bit is automatically set by hardware. After a fixed time window
WIN, we scan the page table again and find all entries with the access bit set. Pages
in those entries have all been accessed during the time window. Since it is hard to re-
balance the cache hotness by moving selected pages, we let the process itself perform the
redistribution. That is, all accessed pages are recolored using lazy recoloring as mentioned
above.
Comparing these two policies, Selective Moving is simpler and more light-weight. How-
ever, Redistribution is likely to be more powerful for re-balancing cache hotness. Their
39
effectiveness is evaluated later.
4.5 Experimental Evaluation
4.5.1 Static Cache Partitioning
In this experiment, we evaluated the static cache partitioning design as part of the load
balancing algorithm. Hardware platform in use is described in Table 3.1, which provides
a total of 32 allocatable page colors.
We set the memory pool inside MARACAS’ cache-aware memory allocator to be
1 GB. Then we devised several programs to observe the effects of shared caches on task
execution. Our mwalk program writes to elements of a 1 MB array in a pseudo-random
order to eliminate the benefits of hardware cache prefetching. Another hog program re-
peatedly scans a 2MB array of integer elements in a sequential order.
Listing 4.1: mwalk
i n t a r r a y [ SIZE ] ;
wh i l e (TRUE) {
f o r ( j = 0 ; j < 1025 ; j ++) {
c = j ;
f o r ( i = 0 ; i < SIZE ; i += 1025) {
i f ( i + c >= SIZE ) b r eak ;
a r r a y [ i + c ] += 2 ;
c = ( c + 1) % 1025 ;
}
}
}
We first started 15 hog tasks, each with parameters C and D, at 1 second intervals.
Values for C and D were randomly generated in the same way as for the experiment from
40
Section 3.5.2. After the last hog was activated, we executed an instance of mwalk for 10
minutes on a VCPU with C = 6 ms and T = 100 ms. The LLC miss ratio was then
recorded for mwalk’s execution.
The experiment from Section 3.5.2 was repeated 10 times with different random sets
of parameters (t0 to t9) applied to the 15 instances of hog. Three cases were considered:
share, c0 and c6. In the share case, original VCPU load balancing algorithm was tested on
a Quest system without page coloring. In the c0 and c6 cases, cache-aware load balancing
was used while page coloring was enabled, with the LLC partitioned in the page color
ratio 4 : 4 : 10 : 14 across the four cores. The mwalk task requested 0 page colors for c0
case, and 6 page colors for c6 case. c0 represents the case where mwalk does not specify
its cache requirement.
Figure 4.7 shows the LLC miss ratio for mwalk in all 10 experimental runs. In the
presence of inter-core cache interference, mwalk suffers a very high cache miss ratio as
seen by the share cases. Even with cache partitioning, the c0 case does not always perform
well (e.g., for experimental runs t3, t4 and t6). This happens because a migration thread
may place mwalk on a core with a cache partition that is smaller than its working set,
causing self-conflict misses. In general, the c6 case performs best, although cache misses
still occur due to context-switching between mwalk and other tasks on the same core.
When a new task executes on a given core, it may evict cache lines for the previously
running task.
To show the effects of context-switching between tasks on the same core, we ran an-
other set of experiments similar to the above. This time, we varied the working set of
mwalk using three different array sizes, 0.5 MB, 1 MB and 2 MB. In each case, the cache
requirement to accommodate the working set of mwalk was passed to the kernel. Results
from Figure 4.8 show that for a larger working set, the task may consume all its VCPU
budget before finishing the scan of the entire array, only to see its cache contents evicted
41
 20
 30
 40
 50
 60
 70
 80
 90
 100
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
LL
C 
M
iss
 R
at
io
 (%
)
 
share
c0
c6
Figure 4.7: Cache-aware Load Balancing
by another task when resuming execution.
4.5.2 Dynamic Cache Partitioning
We conducted a series of experiments to evaluate the performance and effectiveness of
the COLORIS page coloring framework. We implemented a prototype system for a 32-bit
Ubuntu 12.04 Linux OS with kernel version 3.8.8. All the experiments were conducted
using the SPEC CPU2006 benchmark suite on a Dell PowerEdge T410 machine with a
quad-core Intel Xeon E5506 2.13 GHz processor and 8 GB of RAM. A total of 4 MB
16-way set-associative L3 cache was shared amongst the 4 cores of the processor. As a
result, there were 64 page colors available in the system with 4 KB page size. Since we
set the size of the memory pool in our Color-aware Page Allocator to be 1 GB, each color
provided up to 16MB of memory for application use.
42
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
LL
C 
M
iss
 R
at
io
 (%
)
 
0.5MB
1MB
2MB
Figure 4.8: mwalkWith Different Working Set Sizes
4.5.2.1 Allocator Implementation Overhead
In the first experiment, we compared the performance of the original Linux memory allo-
cator with COLORIS’ page allocator to evaluate the efficiency of our implementation. In
Linux, we ran each SPEC benchmark alone in the system without co-runners and recorded
their execution times. In COLORIS, we assigned all colors, effectively the entire cache,
to every program and ran them with the same configuration. In the latter case, COLORIS
picked page colors in round-robin manner for each application page request. Table 4.1
shows that our page allocator achieved similar performance as the original Linux Buddy
System.
4.5.2.2 Effectiveness of Page Coloring
The benefit of page coloring is the performance isolation between running processes. In
the following experiments, we tried to evaluate the effectiveness of page coloring for
43
Linux (sec) COLORIS (sec)
gobmk 736 738
gcc 507 495
libquantum 773 778
bzip2 961 956
sphinx3 864 866
omnetpp 476 480
povray 366 367
hmmer 849 850
h264ref 1131 1128
mcf 450 444
soplex 429 428
leslie3d 797 799
gromacs 1386 1384
namd 790 792
milc 671 679
gamess 1503 1510
zeusmp 842 841
soplex 429 428
tonto 977 977
wrf 1328 1323
calculix 1517 1512
sjeng 871 868
astar 777 773
perlbench 569 569
cactusADM 1796 1788
GemsFDTD 725 728
lbm 533 535
Table 4.1: Allocator Implementation Overhead
applications of different characteristics. We first selected three groups of benchmarks:
{sphinx3, leslie3d, libquantum}, {leslie3d, h264ref, gromacs}, and {povray, h264ref, gro-
macs}. According to their memory access intensity, we call them the heavy background
workload (H), the medium background workload (M) and the light background workload
(L), respectively. Three foreground benchmarks were then selected: omnetpp, gobmk and
hmmer. omnetpp has a very large memory footprint and is memory-intensive. gobmk, with
44
a small footprint, is less memory-intensive but cache-sensitive. hmmer is similar to gobmk
except for cache-sensitivity, due to its small working set.
 450
 500
 550
 600
 650
 700
 750
 800
 850
 900
 950
gobmk hmmer omnetpp
Ex
ec
ut
io
n 
Ti
m
e 
(s)
Foreground
H
H+P
M
M+P
L
L+P
S
F
Figure 4.9: Execution Time of Foreground Workload
For every experiment, we chose one foreground workload and one background work-
load. The three background programs were started first, each pinned to a different core.
The foreground workload started execution after a delay of one second on the fourth core.
The background workload remained executing during the entire life of the foreground
workload. We denote experiments with page coloring as P experiments, and experiments
with heavy, medium, and light background workloads as H, M, and L experiments, respec-
tively. For example, an experiment running foreground and heavy background workloads
together with page coloring is labeled as H + P. In any experiment with page coloring,
every workload was assigned 16 page colors.
We also had two control groups: the first group, S, ran only the foreground workloads,
with page coloring limiting access to a quarter of the last-level cache; the second group, F,
45
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
gobmk hmmer omnetpp
LL
C 
m
iss
 ra
tio
 (%
)
Foreground
H
H+P
M
M+P
L
L+P
S
F
Figure 4.10: Cache Miss Ratio of Foreground Workload
also ran only the foreground workloads but without page coloring, thereby allowing access
to the full cache. We conducted all experiments three times and recorded the average
execution time and LLC miss ratio of each foreground workload. The results are shown in
Figure 4.9 and Figure 4.10.
As can be seen in Figure 4.9 (H and H + P), the cache-sensitive workload gobmk
experienced a performance gain of as much as 13% under the interference of a heavy
background workload with page coloring. For non-cache-sensitive workload hmmer, page
coloring was less effective.
With ideal performance isolation, given the same cache partition size, a foreground
workload running with background workload should exhibit the same behavior as when it
is running alone in the system (which is represented by the control group S). As Figure 4.9
shows, with the presence of interference, COLORIS was able to constrain the performance
variations of gobmk, hmmer and omnetpp to within 5%, 0.5% and 55% respectively, as
46
compared to S. Without page coloring, the variations can be as much as 21%, 2% and
90%, as compared to F. For omnetpp, a 39% reduction in variation was achieved.
Notice that omnetpp suffered higher miss ratio when page coloring was enabled (com-
paring L and L+P). This is mainly caused by the high self-conflict from limiting its mem-
ory accesses to just 1
4
of the entire cache (as shown in Figure 4.10).
4.5.2.3 Recoloring Evaluation
Configurations LowThreshold (%) HighThreshold (%)
C1 30 65
C2 30 75
C3 0 100
C4 (40, 0, 30, 40) (80, 60, 65, 80)
C5 - -
C6 30 75
Table 4.2: Experimental Configurations
To demonstrate the benefit of recoloring, we designed a set of experiments with four
benchmarks, povray, tonto, omnetpp and gamess. Among them, only omnetpp is memory-
intensive and cache-hungry, meaning the more cache space there is, the lower its cache
miss ratio will be. If the other three benchmarks are similar to omnetpp in terms of cache
demand, then there will be no possibility for re-partitioning. Configurations for different
experiments are listed in Table 4.2. In all the configurations WIN was set to 3ms (except
for C6), UNIT to 4, and PERIOD to 5 seconds. (40, 0, 30, 40) in C4 means assigning these
four LowThresholds to povray, tonto, omnetpp and gamess, respectively, instead of using
global LowThreshold. Note that the thresholds of configuration C3 prevent page recoloring
altogether. Configuration C5 was used for the special case in which every benchmark ran
alone with full cache space access.
All the experiments were carried out by fixing each benchmark to a different core.
47
The four benchmarks were started at the same time, each with 16 non-overlapping page
colors. Each experiment was run for more than an hour. After the first minute, we set
up the performance counters to collect LLC miss ratios and the numbers of instructions
retired from user-level (to exclude kernel overhead) over a measured 60 minutes interval.
At the same time, the Cache Utilization Monitor was also enabled. For these experiments,
the LLC was fully utilized, which means colors need to be removed from one application
before they could be re-assigned to another. The final results were produced by repeating
the same experiments three times and taking the average, which is shown in Figure 4.11,
Table 4.3 and Table 4.4.
C1 C2(C6.WIN=3) C3 C4 C6.WIN=1 C6.WIN=5
Selective
Moving
povray 100.1 101.8 102.2 100.7 - -
tonto 88.9 90.7 91.6 91.4 - -
omnetpp 46.2 43.6 42.2 45.8 - -
gamess 120.9 122.8 123.1 119.2 - -
total 356.1 358.9 359.1 357.1 - -
Redistri-
bution
povray 100.2 101.7 - 100.7 101.7 101.8
tonto 90.0 91.4 - 91.4 91.2 89.9
omnetpp 46.0 44.1 - 46.0 43.9 43.8
gamess 121.1 122.3 - 119.3 121.3 122.5
total 357.3 359.5 - 357.4 358.1 358.0
Table 4.3: Instructions Retired in One Hour (×1011)
C1 C2(C6.WIN=3) C4 C6.WIN=1 C6.WIN=5
Selective Moving 10746 3845 8791 - -
Redistribution 32749 22243 33547 6476 30174
Table 4.4: Recoloring Overhead (Total # Pages Recolored)
With static cache partitioning (C3), the miss rate of omnetpp reached 77.8%. With
the help of recoloring (C1 and C2), the miss rate was successfully limited to below High-
Threshold, thus meeting the system default QoS requirement. By comparing miss ratios
of omnetpp between C1 and C2, we see that the degree of improvement for cache-hungry
applications depends on how we set up the thresholds. As shown in Table 4.3, despite
the recoloring overhead, omnetpp had received 3.3% − 9.4% performance gain, proving
48
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
povray
tonto
om
netpp
gamess
LL
C 
m
iss
 ra
tio
 (%
)
 
C1
Selective_Moving
Redistribution
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
povray
tonto
om
netpp
gamess
LL
C 
m
iss
 ra
tio
 (%
)
 
C2
Selective_Moving
Redistribution
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
povray
tonto
om
netpp
gamess
LL
C 
m
iss
 ra
tio
 (%
)
 
C3
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
povray
tonto
om
netpp
gamess
LL
C 
m
iss
 ra
tio
 (%
)
 
C4
Selective_Moving
Redistribution
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
povray
tonto
om
netpp
gamess
LL
C 
m
iss
 ra
tio
 (%
)
 
C5
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
povray
tonto
om
netpp
gamess
LL
C 
m
iss
 ra
tio
 (%
)
 
C6
1ms
3ms
5ms
Figure 4.11: LLC Miss Ratios
that our approaches are practical. povray managed to keep the same miss ratio with fewer
cache lines since its working set is small enough to fit into private L2 cache. Although the
other two benchmarks, tonto and gamess, were negatively affected by reduced cache sizes,
the system overall performance remained roughly the same.
Instead of using system default QoS specification, reasonable individual specifications
49
can also be satisfied by the COLORIS framework as shown in C4. By setting LowThresh-
old to 0 for tonto, we avoided decreasing its cache space, and limited its miss ratio to below
40%. The results show that recoloring is effective for guaranteeing QoS requirements of
individual applications.
Table 4.4 lists the total number of pages recolored in all the experiments. It can be seen
that Redistribution incurred higher overhead as compared to Selective Moving. However,
Redistribution achieved a slightly better overall system performance if we look at the total
instructions retired in Table 4.3. The results indicate that Redistribution is more effective
in utilizing newly assigned cache space.
Following this observation, we then evaluated the impact of window size WIN on the
effectiveness of the Redistribution policy in C6. omnetpp was the only application that
received an expanded cache space during the experiment. Its performance under differ-
ent window sizes stayed the same, which suggests that larger window size may not help
improve performance, as recoloring overhead increases as well.
Finally, Table 4.5 shows the stable color assignments for each repetition of the above
experiments. Within each 4-element tuple, the numbers are the sizes of color assignments
for povray, tonto, omnetpp and gamess respectively. Across multiple repetitions of the
same experiment, stable-state color assignments are fairly consistent. This indicates the
COLORIS recoloring mechanism is stable.
C1 C2(C6.WIN=3) C4 C6.WIN=1 C6.WIN=5
Selective
Moving
EXP(1) (4,8,44,8) (8,16,28,12) (4,16,40,4) - -
EXP(2) (4,12,40,8) (16,12,24,12) (4,16,40,4) - -
EXP(3) (4,8,44,8) (12,12,28,12) (4,16,40,4) - -
Redistri-
bution
EXP(1) (4,12,40,8) (12,16,28,8) (4,16,40,4) (8,12,28,16) (16,8,28,12)
EXP(2) (4,12,40,8) (8,16,28,12) (4,16,40,4) (12,16,28,8) (16,12,28,8)
EXP(3) (4,12,40,8) (12,12,28,12) (4,16,40,4) (16,16,28,4) (12,12,28,12)
Table 4.5: Stable-State Color Assignments (povray, tonto, omnetpp, gamess)
50
 20
 25
 30
 35
 40
 45
 50
 55
 60
 65
gobmk
gromacs
h264ref
povray
tonto
leslie3d
om
netpp
gamess
In
st
ru
ct
io
ns
 re
tir
ed
 (X
 10
11
)
 
C7
C8
C9
 10
 20
 30
 40
 50
 60
 70
 80
 90
gobmk
gromacs
h264ref
povray
tonto
leslie3d
om
netpp
gamess
LL
C 
m
iss
 ra
tio
 (%
)
 
C7
C8
C9
Figure 4.12: Over-Committed System Performance
51
4.5.2.4 Performance in Over-Committed Systems
In these experiments, we tried to evaluate the design of COLORIS for over-committed
systems. We first created four groups of SPEC benchmarks (G1, G2, G3, G4): {gobmk,
sphinx3}, {gromacs, leslie3d}, {h264ref, omnetpp}, {povray, gamess}. Programs were
started with ten seconds interval in the following order: gobmk, gromacs, h264ref, povray,
sphinx3, leslie3d, omnetpp, gamess. One minute after the last benchmark program was
started, we enabled hardware performance counters and the Cache Utilization Monitor
(if used). All experiments were then run for an hour, at the end of which results were
collected.
When running experiments in COLORIS, benchmark programs received their initial
color assignments in a round-robin fashion, since all colors had zero hotness at the be-
ginning. According to their launching order, programs in group Gx were assigned the
cache section belonging to core x. After assignment, they were pinned automatically to
those cores. For experiments in Linux, we pinned each group onto the same core similar
to the COLORIS experiments. By doing this, we eliminated the difference of memory
bus contention between the two sets of experiments. The same experiments were carried
out with three different configurations, C7, C8 and C9. In C7, recoloring used the Re-
distribution policy, WIN = 3ms, UNIT = 4, PERIOD = 5s and system-wide thresholds
were set to (LowThreshold, HighThreshold) = (20%, 75%). An individual QoS specifica-
tion (0%, 100%) was also provided to leslie3d, which essentially disabled recoloring. The
reason being leslie3d is an LLC-thrashing [26] application that does not benefit from as-
signment expansion (known through offline profiling). In C8, a static partitioning scheme
was used. For C9, applications were run in Linux without page coloring. Results from
Figure 4.12 show that COLORIS performs no worse than Linux in heavy over-committed
cases, while at the same time guaranteeing QoS for applications.
52
4.6 Summary
In this chapter, we have studied both the static and dynamic cache partitioning approaches
for improving performance isolation on multicore platforms. For the former, the LLC is
statically partitioned amongst cores in order to get strict real-time performance. A cache-
aware load balancing algorithm is developed, along with an API to specify cache require-
ment. For the latter, a Cache Utilization Monitor measures application cache usage on-
the-fly and triggers re-partitioning to improve QoS for individual applications. To achieve
efficient cache re-partitioning, two page selection policies are described: Selective Moving
and Redistribution. When applied to over-committed systems, our system tries to main-
tain good cache isolation amongst processes, attempting to minimize cache interference
by carefully managing page colors in the Page Color Manager.
Chapter 5
Memory Bus Management
While most people have been focusing on managing shared cache contention on multicore
platforms, some works point out that memory bus (or memory controller) contention could
be one of the dominant factors for application performance degradation [78, 42]. Heavy
contention on the bus often leads to unpredictable task completion time, posing significant
challenges to real-time system design.
In general, there are three ways to handle memory bus contention. One way is to
control the memory access rate from the source [74]. This is achieved through scheduling
algorithms that slow down memory-intensive threads’ execution. For the second approach,
memory bus bandwidth needs to be partitioned amongst CPUs so that isolation can be built
up [33]. But unlike the space-partitioning scheme from cache management, it partitions
time. The last approach tries to reduce the response time of memory accesses by partition-
ing DRAM banks [73].
Our method for controlling bus contention belongs to the first approach. In this chapter,
we will introduce a memory-aware scheduling algorithm, based on the foreground/back-
ground scheduling model established before. It is implemented as a part of theMARACAS
scheduling framework [72] and has successfully demonstrated the power of the scheduling
model.
54
5.1 Bus Performance Metric
Memory-aware scheduling considers the effects of memory accesses when ordering the
execution of a set of tasks. Memory accesses on one core might incur delays caused by
concurrent accesses on another core, because of contention on a shared memory bus. One
approach to address this problem is to regulate the rate of off-chip memory references (i.e.,
those missing in a cache), so that each core cannot exceed a pre-defined threshold [74].
However, there are additional problems that affect the throughput of memory requests.
DRAM bank-level parallelism leads to significant variations in the throughput of memory
traffic, depending on whether memory accesses are to the same or separate banks [73].
While separate banks are accessible in parallel, requests to the same bank are serialized.
Similarly, servicing sequential accesses is faster than servicing random accesses within the
same bank, due to row buffering in DRAM. Interleaved accesses to separate rows within
the same bank impact row locality, leading to repeated pre-charging of a row buffer. These
factors combine to make it difficult to determine the correct memory access rate threshold
for avoiding excessive bus contention. If we assume separate accesses map to different
banks, we may set the threshold too high and it may never be reached even when the bus
is heavily contended. Similarly, if we pessimistically assume all accesses are to the same
DRAM bank and set the threshold too low, we may trigger memory access throttling 1
when the bus is not heavily contended.
A lower memory access rate does not necessarily mean lower contention. Consider the
case where two tasks, task1 and task2, are allowed a budgeted number of memory requests
every period, T [74]. Figure 5.1 shows the situation where the tasks run concurrently
on separate cores until t, when they exhaust their request budget. Both tasks are then
suspended until T . Assuming uniform memory accesses in time, each task reduces its
memory access rate by a factor T−t
T
in the interval [0, T ]. However, because the tasks
1Slow down program execution to reduce memory accesses
55
execute at exactly the same time, a reduction in memory access rate does not reduce the
contention experienced in the interval [0, t]. We call this phenomenon the Sync Effect,
which occurs when two or more cores have overlapping idle times due to the suspension
of tasks. The Sync Effect leads to a drop in CPU and bus utilization without improving
task performance.
Figure 5.1: Sync Effect
To eliminate memory contention requires complete knowledge of access patterns from
all cores and Direct Memory Access (DMA) devices, including how they interleave inside
the memory controller. Monitoring system wide memory traffic by only looking at each
core’s requests, either through cache miss events or off-core events [25], is insufficient.
This would not detect DMA requests or accesses to a memory domain from a remote
node in a Non-Uniform Memory Access (NUMA) system. Fortunately, some multicore
architectures, such as the Intel Xeon now provide monitoring events for all types of DRAM
traffic.
Our method to deal with memory contention neither relies on memory access rates
nor ignores traffic outside cores. It measures memory traffic by looking at the average
latency to service memory requests. Unlike a rate-based metric, latency is directly re-
lated to application performance. Intel Sandy Bridge and more recent processors pro-
vide two uncore performance monitoring events: UNC ARB TRK REQUEST.ALL and
UNC ARB TRK OCCUPANCY.ALL. The first event counts all memory requests going
56
to the memory controller request queue (requests), and the second one counts cycles
weighted by the number of pending requests in the queue (occupancy). For example, in
Figure 5.2, request r1 arrives at time 0 and finishes at time 2. r2 and r3 both arrive at time 1
and complete at time 5. At the end of this 5 cycles period, occupancy = 10, requests = 3.
We derive the average latency (cycles) per request as follows:
latency = occupancy
requests
Figure 5.2: Example of Memory Controller Occupancy and Requests
5.2 Memory-aware Scheduling
MARACAS is configured with a request latency threshold, MAX MEM LAT. The thresh-
old is global rather than per-core for comparison with the observed overall bus traffic.
Memory throttling commences when the observed average latency exceeds or equals the
threshold, as shown in Algorithm 3 (line 9). A memory monitoring thread assigned to a
dedicated VCPU periodically updates the average latency via aMONITOR procedure. The
period is set to a constant MEM PERIOD.
When throttling is applied in MARACAS, background scheduling on the correspond-
ing core is temporarily disabled, and the core goes idle. This reduces contention on the
memory controller, shared cache lines and Miss Status Holding Registers [65]. While the
57
Sync Effect is still possible, MARACAS is able to detect the contention and apply further
throttling. When one or more cores throttle their usage of background time, other cores in
foreground mode are able to make greater progress due to the reduced contention.
Instead of simply disabling the cores with the most traffic, we adopt a proportional
throttling scheme. Suppose the ith core in a set of n cores generates mi requests to the
memory controller in time ti, which yields a memory access rate ri =
mi
ti
. Larger val-
ues of ri cause a greater degree of memory throttling on the ith core. Global variable
num throttle is used to tell the scheduling sub-system how many cores (referred to as
cpus in Algorithm 3) need to be throttled. When the core-local scheduler is switched to
background mode, it calls function IS BG SCHED, which returns TRUE if a task is able
to run. count keeps track of how many cores should be allowed to run in background
mode if the current core is allowed to do so as well. bg vtime[i] on core i is the product
of the background execution time consumed in the current period (MEM PERIOD) and
the weight,mem weight[i], which is generated inside CALC WEIGHTS. Higher value of
bg vtime[i] relative to those on other cores increase the likelihood of core i being throttled.
Once memory throttling is activated, a new question is how long it should be applied.
While average memory request latency is used to determine bus contention, it is not nec-
essarily the best metric to identify a reduction in memory bandwidth demand. This is
because a reduction in memory access latency could be due to throttling background time,
rather than a drop in memory demand from the running VCPUs. Let Rcur =
∑n
i=1 ri in
the current period, and Rhigh be the largest Rcur since throttling began. In MARACAS,
if Rcur <= Rhigh×IDLE MEM (IDLE MEM is a configurable parameter between 0 and
1), the system-wide memory access intensity is considered to be lower than before and
throttling is reduced gradually (by decreasing num throttle, Algorithm 3 line 14).
Algorithm 3Memory-aware Scheduling
1: procedure MONITOR
2: /* update allmi */
58
3: /* clear all bg vtime[i] */
4: /* UNC ARB TRK REQUEST.ALL */
5: requests = get requests()
6: /* UNC ARB TRK OCCUPANCY.ALL */
7: occupancy = get occupancy()
8: latency = occupancy/requests
9: if latency >= MAX MEM LAT and
10: num throttle < num cpus then
11: num throttle++
12: else if IS LESS TRAFFIC() and
13: num throttle > 0 then
14: num throttle−−
15: end if
16: if num throttle > 0 then
17: CALC WEIGHTS()
18: end if
19: end procedure
20: procedure IS LESS TRAFFIC
21: if Rcur <= Rhigh × IDLE MEM then
22: return TRUE
23: else
24: return FALSE
25: end if
26: end procedure
27: procedure CALC WEIGHTS
28: for all cpu do
29: mem weight[cpu] = mcpu/Rcur
30: end for
31: end procedure
59
32: procedure IS BG SCHED
33: if num throttle <= 0 then
34: return TRUE
35: end if
36: if num throttle >= num cpus then
37: return FALSE
38: end if
39: count = 0
40: self = get local cpu id()
41: for all cpu do
42: if cpu ! = self and
43: bg vtime[cpu] <= bg vtime[self ] then
44: count++
45: end if
46: end for
47: if count < num cpus− num throttle then
48: return TRUE
49: else
50: return FALSE
51: end if
52: end procedure
5.3 Experimental Evaluation
In this section, we investigated the effectiveness of our memory-aware scheduling algo-
rithm and compared the latency metric against the traditional rate metric. Experiments
were conducted on the same platform as described in Table 3.1.
We developed a memory-intensive benchmark, m jump (with pseudocode shown in
Code 5.1), that operates on a 6 MB data array, which is large enough to span the entire
last-level cache. The benchmark writes to the first 4 of every 64 bytes in the array. As
60
every cache line is also 64 bytes, this causes the entire cache to be filled. After every
write, m jump jumps 8 KB forward in order to avoid cache prefetching effects. It is worth
noting that caches cannot be disabled for this experiment, even though our focus is on
memory performance. If caches were disabled, every instruction would be fetched from
memory. This would effectively force CPUs to run at the same speed as the memory bus,
reducing the likelihood of bus congestion.
Listing 5.1: m jump
by t e a r r a y [6M] ;
f o r ( u i n t 3 2 j = 0 ; j < 8192 ; j += 64)
f o r ( u i n t 3 2 i = j ; i < 6M; i += 8192)
<Va r i a b l e d e l a y added here>
a r r a y [ i ] = i ;
 600
 700
 800
 900
 1000
 1100
 1200
 1300
 1400
 100  150  200  250  300  350  400  450
 0
 100
 200
 300
 400
 500
Bu
s 
Tr
af
fic
 (G
B)
ta
sk
3 
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
8 )
Latency (Cycles)
Bus Traffic
Instructions
Figure 5.3: Bus Traffic & Instructions Retired versus Latency
To establish a fair comparison between the rate and latency metrics, we first performed
several profiling experiments. Three m jump tasks (task1, task2 and task3) were executed
61
on separate cores for 5 minutes without memory throttling. The task parameters, (C, T ),
were set to (20, 40), (25, 50) and (30, 60), respectively. In each run, we inserted a time
delay between memory accesses in the m jump code for task1 and task2, by performing
multiplication operations on a register value for a variable number of iterations. The use
of a register was to avoid any extra memory requests that might affect the experiment. At
the end of the experiment, we recorded the total system-wide bus traffic, average memory
request latency and task3’s instructions retired in foreground mode. Results are shown in
Figure 5.3. The Bus Traffic curve shows data points for the memory latency X and cor-
responding traffic Y . Matching X and Y data points on the Bus Traffic curve are used to
establish latency and rate thresholds, respectively, for memory throttling. Derivation of
these thresholds is described later. The corresponding Instructions curve enables thresh-
olds to be set that trade-off performance of the target application (task3) and the entire
system memory throughput. Notice that our scheduling algorithm does not require per-
formance profiling to function properly. Profiling is used here to establish comparable
thresholds for the two memory throttling metrics.
From Figure 5.3, we chose three data points on the Bus Traffic curve that straddled the
intersection with the Instructions curve. The chosen values represent several cases when
the bus traffic is rising to its limit. The latencies for these three points were 157, 183 and
228 cycles, respectively, as shown by the vertical lines. For each latency, we also recorded
the foreground performance of task3 on the Instructions curve, which resulted in three
experimental configurations, E1, E2 and E3 (See Table 5.1).
Bus Traffic (GB) Latency task3 Instructions Retired (×108)
E1 1128 228 249
E2 1049 183 304
E3 976 157 357
Table 5.1: Profile Configurations
62
The values in the Latency column were used as thresholds for latency-based mem-
ory throttling. Data in the Bus Traffic column shows the gigabytes transferred across the
memory bus in a 5 minute interval. We converted these values into a memory service rate
per MEM PERIOD (set to 2 seconds), to establish comparative thresholds for rate-based
memory throttling. MEM PERIOD is set empirically, with smaller values enabling finer-
grained monitoring of bus traffic and larger values imposing lower system overhead on
memory-aware scheduling. We chose a value of MEM PERIOD=2s as a reasonable trade-
off when considering task scheduling periods in milliseconds. The last column in Table 5.1
serves as a reference (expected in Figure 5.4), showing the expected performance of task3
using the corresponding thresholds.
Next, we repeated the previous experiment with memory throttling. A fixed delay was
added to the m jump code of task1 and task2 so they would cause heavy bus contention.
With each configuration (E1, E2, E3), we compared rate- and latency-based memory throt-
tling. Figure 5.4 shows the resultant foreground performance of task3. In both E1 and E2
cases, our latency-based throttling approach was able to reduce bus contention so that
the target application’s performance was better than expected. In contrast, the rate-based
approach failed to achieve the expected performance of task3. In case E3, the latency
threshold was too low, leading to insufficient background time (BGT) to reduce bus con-
tention. However, latency-based throttling still enabled task3 to execute more instructions
(hence, make further progress) than the rate-based approach.
The E3 case reveals a limitation of our algorithm: the effectiveness of memory throt-
tling depends on the amount of BGT for each core. We demonstrated this dependence
through another experiment. A canny edge detection benchmark used in image processing
was executed for 10 minutes on a VCPU with parameters C = 50 ms, T = 100 ms. Dur-
ing this time, canny repeatedly processed a 720×480 pixel image on a single core. Three
m jump benchmarks were executed on the other three cores, with their VCPU periods set
63
 0
 50
 100
 150
 200
 250
 300
 350
 400
E1 E2 E3
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
8 )
rate
latency
expected
Figure 5.4: Comparison between Rate- and Latency-based Throttling
to T = 150, 100 and 50 ms, respectively. Different T values were used to avoid the Sync
Effect described in Section 5.1. The foreground utilizations of the VCPUs associated with
the m jump benchmarks were varied to yield five different cases in this experiment. For
cases, U=30%, U=60%, U=80% and U=100%, each m jump VCPU was allocated 30, 60,
80 and 100% CPU utilization, respectively. For the special case alone, cannywas executed
without any m jump co-runners.
Figure 5.5 shows the performance of canny in foreground mode. For m jump uti-
lizations below 80%, the system was able to maintain memory request latency below the
threshold, MAX MEM LAT=180 cycles. As the m jump utilization increased, MARA-
CAS would gradually lose its capability to provide service quality to canny.
64
 44
 46
 48
 50
 52
 54
 56
 58
 60
 62
 64
alone
U=30%
U=60%
U=80%
U=100%
 80
 100
 120
 140
 160
 180
 200
 220
 240
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
10
)
Av
er
ag
e 
La
te
nc
y 
(bu
s c
yc
les
)instructionslatency
Figure 5.5: Foreground Performance of canny
5.4 Summary
In this chapter, we have discussed issues with the traditional performance metric for mem-
ory buses. Then our new metric, average memory request latency, is introduced, which
more accurately reflects the traffic condition on the bus. Through the use of hardware
performance counters, monitored data can be gathered in a lightweight fashion as well.
Combined with the foreground/background scheduling model, bus contention is taken
care of by managing the allocation of background CPU time to VCPUs. When bus con-
tention goes above a predefined latency threshold, some cores will disable their back-
ground scheduling temporarily to reduce memory access rates. Our experimental results
have validated the effectiveness of the proposed memory-aware scheduling algorithm.
Chapter 6
System Integration
In previous chapters, we have talked about a new scheduling framework, some cache
partitioning mechanisms and a memory bus management policy. While it is possible to
integrate them into a general-purpose OS kernel (e.g., Linux), the engineering effort re-
quired is usually significant. More importantly, there is a fundamental mismatch between
a GPOS’ design goals and real-time applications’ requirement. A GPOS has various is-
sues like large memory footprint, vulnerable system security, fairness-oriented resource
management, multiple layers of indirection and lengthy critical sections (scattered around
the kernel). Consequently, writing a multicore real-time OS kernel from scratch seems
to be a better choice. That being said, writing a new OS is also a time-consuming and
difficult exercise. Device drivers, libraries and application programming interfaces must
all be written for a new OS to support any kind of non-trivial modern applications.
Machine virtualization provides an opportunity to combine legacy system features with
new OS abstractions, greatly saving engineering effort. It is possible to encapsulate full-
blown OSes in separate virtual machines, and have their services made accessible to an-
other OS using an appropriately implemented Remote Procedure Call (RPC) mechanism.
However, this solution does not solve the problem of performance isolation. Virtualization
has traditionally only provided a logical separation between guest virtual machines. When
multiple VMs execute on separate cores, they compete for shared last-level caches, mem-
ory buses and DRAM. Uncontrolled access to shared physical resources leads to detrimen-
66
tal performance interference. Moreover, newly-developed timing-sensitive services are in
jeopardy of unbounded timing delays from the execution of other VM-based services.
It is possible to modify the hypervisor in existing virtual machine systems such as Xen
and VMware ESXi to ensure performance isolation. This requires the implementation of
timing-aware shared resource management functionality. However, there are several prob-
lems with this approach. First, since every isolation-demanding OS should already have its
own mechanisms and policies for managing shared resources, adding OS-specific policies
to a hypervisor not only creates dependencies but also duplicates functionality. A change
of/in guest OS may require modification to the underlying virtualization infrastructure as
well. Second, adding resource management policies to the hypervisor increases the size
of the Trusted Computing Base (TCB), potentially reducing the security and reliability of
the entire system [60]. Third, hypervisors perform resource accounting and management
at the granularity of virtual CPUs, which adds overhead to accounting mechanisms al-
ready in the guest OS. For instance, a QoS-aware guest OS tries to improve the worst case
DRAM access latency. It tracks every thread’s DRAM access rate and performs memory
traffic regulation when the memory bus is in heavy contention [74]. To avoid the inter-VM
interference and meet the required QoS, the hypervisor also needs to manage the DRAM
access of every VM, introducing an extra layer of resource accounting on the bus. Fourth,
applications that request services from other guest OSes, referred to as dual-mode appli-
cations, require unified resource accounting (e.g., CPU budget, cache occupancy, memory
bandwidth) across multiple VMs, and current hypervisors only account resource usage by
individual VMs. Without accurately accounting resource usage for individual dual-mode
applications, resource contention management cannot be carried out effectively.
In-kernel virtualization technologies, like KVM, are able to deal more effectively with
the above issues as it merges a hypervisor with an OS. But fundamentally there is a prob-
lem of KVM (and its like). For every QoS-aware new OS, KVM has to be implemented
67
Figure 6.1: vLibOS Concept
inside its kernel, which requires a huge amount of engineering effort.
With the above considerations in mind, we believe it is still possible to use virtual-
ization for running legacy services in conjunction with newly-defined OS functionality.
However, for virtualization to be effectively used in the construction of evolvable and
timing-sensitive systems, it is important to provide performance isolation and proper re-
source accounting across multiple VMs. In this chapter, we are going to present vLibOS,
a master-slave paradigm that allows new systems (masters) to be temporally and spatially
isolated from legacy services in separate VMs. With the assistance of hardware virtu-
alization, slave OSes run side by side with the master OS and provide legacy services.
Each slave OS is called a virtualized library OS, or vLib OS. A vLib OS is not a tradi-
tional library OS. Traditional library OS models focus on re-implementing OS services as
application components, which incurs significant engineering cost. On the contrary, our
68
Figure 6.2: vLibOS Architecture
vLibOS model helps the construction and adoption of new OSes by relying on legacy OSes
to provide a feature-rich environment with minimum effort. And unlike traditional library
OS designs that treat an OS as a set of application libraries, we view an entire legacy OS as
a single library to be added to the new OS (or baby OS, see Figure 6.1). Thus, a vLib OS
acts as a standard library and only executes when it is called by threads from the master.
For the rest of this chapter, we first propose the general design of vLibOS, which can
be applied to the building of real-time or QoS-aware OSes. Then we discuss and evaluate
a real-time implementation of the vLibOS architecture.
6.1 vLibOS Design
The architecture of vLibOS is shown in Figure 6.2. It consists of a set of user APIs, a mas-
ter OS, a set of N vLib OSes, and an underlying hypervisor. The master OS implements
new features while leveraging pre-existing services in one or more vLib OSes. Each vLib
OS runs under the control of the master.
69
6.1.1 User APIs
Each vLib OS runs a server program to provide services to the master. A call to
channelAddr* vLib listen(port)
from the vlibService library blocks the entire vLib OS until it is requested to execute
on behalf of a service caller. This causes all virtual CPUs to be suspended inside the
hypervisor, waiting for vLib calls from client threads inside the master OS. A port number
is used to uniquely identify vLib OSes. Once this function is unblocked, it returns a virtual
address to the communication channel established between the client and itself, which
contains all the data needed for the service. The data includes the function requested by
the client, the name of the library that contains the function, and the input values. Service
completion causes the server to write the output back to the channel, followed by calling
vLib listen again. This leads to a completion signal being sent to the client, and the vLib
OS waits for another request. An example vLib server is listed below (Listing 6.1).
Listing 6.1: Example vLib Server
whi l e ( c h anne l = v L i b l i s t e n ( p o r t ) ) {
/∗ l o c a t e s e r v i c e ∗ /
/∗ unmar sha l d a t a ∗ /
/∗ pe r fo rm s e r v i c e ∗ /
/∗ wr i t e r e s u l t back t o channe l ∗ /
}
A client application uses the vlibCall library to make vLib calls into services from one
of the vLib OSes. The following APIs are provided:
• errCode vLib init(port, channel size, **channel addr);
• errCode vLib call(port, timeout);
• errCode vLib async call(port, callback, timeout);
• errCode vLib channel resize(port, newSize);
70
• errCode vLib channel destroy(port);
vLib init establishes a communication channel to the vLib OS listening on port. After
data is copied into the channel, a subsequent vLib call requests a service with an optional
timeout for the server-side processing. Multiple service requests to a single vLib OS are
serialized in FIFO order. The timeout is used to terminate the service wait, which guar-
antees a bounded delay for the call and avoids liability inversion [23]. A vLib call is
a blocking (i.e., synchronous) request, while a vLib async call provides a non-blocking
asynchronous interface. The channel itself is resized using vLib channel resize, while an
existing channel is closed and its resources reclaimed using vLib channel destroy.
6.1.2 Master OS
A vLibOS system includes a single master OS, which acts as a centralized manager of all
hardware resources with the help of a hypervisor. Both native applications and dual-mode
applications (spanning the master and one or more vLib OSes) are supported. Dual-mode
applications start in the master OS, for proper cross-VM resource accounting. A vlibShm
kernel module maps communication channels into applications’ address spaces.
6.1.3 vLib OS
A vLib OS is any pre-existing OS, such as a UNIX-based system with process address
spaces, or a traditional library OS having a single address space [19]. Each vLib OS pro-
vides libraries, or functionality, that are available for use by the master OS. The hardware
Performance Monitoring Unit (PMU) is virtualized to a vLib OS. Core-local performance
counters will not be exposed if being used by the hypervisor/master. Also, global perfor-
mance events are made inaccessible. This hardens security isolation between the master
and vLib OSes. For example, denial-of-service attacks on shared caches or the memory
bus [41] are preventable by the resource-aware scheduler inside the master OS. Informa-
71
tion leakage from side channels based on PMU data is also avoided [31]. The separation of
a master OS and one or more legacy vLib OSes provides the basis for a mixed-criticality
system. Each vLib OS establishes a sandbox domain for services of different timing, safety
and security criticalities. As with the master OS, each vLib OS uses a vlibShm module to
map communication channels into the server’s address space.
6.1.4 Hypervisor
The hypervisor in vLibOS is responsible for booting OSes and delegating resources (CPUs,
memory and devices) to them. It provides an interface to VMs to support the user APIs
from Section 6.1.1. For vLib calls, it allocates communication channels between client
applications and servers upon request, and routes calls to the right destinations (using the
vLibCall Router in Figure 6.2). To avoid time-related issues inside a vLib OS due to
blocking, guest time is virtualized. More importantly, the hypervisor empowers a master
OS with the capability to block and wake up other VMs. This means the execution of a
vLib OS is integrated into the scheduling framework of the master.
Figure 6.3 illustrates the unified scheduling framework in the vLibOS architecture. A
vLib call resembles an RPC, with the callee occupying a separate address space to the
caller. However, the callee shares the same resource accounting entity (i.e., thread from
the master OS) with the caller. For synchronous vLib calls, a callee in a vLib OS executes
with the CPU budget of the calling thread in the master OS. Once a budget is depleted,
or preemption occurs, the callee is descheduled. For asynchronous vLib calls, the caller
runs simultaneously with its callee on different cores and both will be descheduled at the
same time. This mechanism extends the capability of a contention-aware scheduler within
a master OS, to manage resource contention across the entire platform.
72
Figure 6.3: vLibOS Unified Scheduling (left is sync call, right is async call)
6.1.5 Applications
Applications can utilize vLib OS services in several ways. In the first case, dual-mode ap-
plications start in the master OS and send each request through a vLib call. As a vLib call
incurs higher overhead than a library call, this approach should be avoided on performance
critical paths. In the second case, a dual-mode application first makes an async vLib call
(with no input) to pass its CPU budget to the vLib OS. Then it writes a series of service
requests to the communication channel. The server on the other side, upon returning from
vLib listen(), starts polling on the communication channel to get the actual requests. When
an ending signal is received, the server jumps back to vLib listen(), indicating the end of
this async vLib call session. This approach, while greatly reduces system overhead for
sending each request, locks the vLib OS for a longer time as well, thus blocking other
applications from requesting services. In the third case, applications run entirely inside
a vLib OS, while a dummy thread is created in the master. The dummy thread makes a
73
single sync vLib call with no input. After receiving the call, the server terminates without
signaling service completion. Consequently, all other applications on the vLib OS inherit
the dummy thread’s CPU budget and keep running.
6.2 Implementation: A Multicore Real-Time System
We have implemented our vLibOS architecture by extending an existing virtualization
system, Quest-V [32]. Quest-V is targeted at secure and predictable embedded systems,
where virtualization is used to isolate legacy functionality from timing, safety and security-
critical custom OS features. The system currently runs on multicore x86 architectures (in
IA32, 32-bit mode) with VT-x virtualization extensions. It is the basis for our evaluation
version of vLibOS, providing separate virtual machine domains for a master OS and a
vLib OS. We use Quest as the master OS. Since it has been described in previous sections,
we will skip it here. For the vLib OS, we choose one of the Linux distributions.
6.2.1 Partitioning Hypervisor
Our hypervisor relies on hardware-assisted virtualization to achieve efficient resource par-
titioning. CPU cores, memory and I/O devices are statically partitioned during system boot
time. Each VM is then only allowed to access the physical resources within its domain.
To partition I/O devices, the hypervisor has to restrict access to device specific hard-
ware registers, which are either memory mapped or addressed by I/O ports. For memory
mapped registers, Extended Page Tables (EPT) are used to prevent their accesses from
unauthorized VMs. For port-addressed registers, Intel VT-x supports hypervisor trapping
of accesses to specified ports. It is then possible to ignore unauthorized I/O operations
or trigger an appropriate fault handler. Additionally, interrupts from I/O devices are par-
titioned amongst VMs. On x86 processors, the I/O Advanced Programmable Interrupt
Controller (IOAPIC) controls interrupt routing to specific cores using an I/O redirection
74
table. An unauthorized write to the IOAPIC registers causes a VM exit, thereby ensur-
ing that hardware resources are securely partitioned. I/O passthrough [21] additionally
allows interrupts to be delivered directly to guests without trapping into the hypervisor.
This provides an efficient method for secure I/O partitioning, and is made possible by not
overcommitting resources amongst multiple guest VMs on the same physical machine.
Unified Scheduling. Although hardware resources are partitioned, the hypervisor still
allows the master OS to indirectly control resource usage of vLib OSes. This is achieved
by extending the original hypervisor with our unified scheduling mechanism, which re-
quires coordination between the master OS, the vLib server and the hypervisor.
Firstly, when the vLib server invokes vLib listen, it jumps into the hypervisor and
blocks the entire VM waiting for vLib calls. Later, a client thread in the master OS makes
a vLib call into the kernel, which then transfers control to the hypervisor. A request flag
is set to unblock the destination vLib OS, while input data is passed to it through a pre-
created communication channel. After unblocking, the CPU running the client returns to
the master OS kernel space. Instead of blocking the user thread, the kernel marks it as
being in a remote state and forces it to busy wait on the channel for a request completion
signal, with interrupt and kernel preemption enabled. This effectively turns the thread into
an idle thread. From the perspective of the user-level client, the vLib call is a blocking
call. However, inside the busy waiting loop, timestamps are checked to enforce the vLib
call timeout. Notice that this busy waiting approach greatly simplifies the changes need
to be made to the scheduler, though it leads to a lowered CPU utilization (which can be
avoided, see Section 6.2.3).
On the server side, the vLib OS is unblocked and the channel ID is passed to the
vLib server. If the channel has not been mapped before, the server passes control into the
vlibShm kernel module (details in Section 6.2.2). This module maps the specified commu-
75
nication channel into the server’s virtual address space. After the channel is mapped, the
server commences request handling.
If the waiting client thread runs out of CPU budget, it is descheduled. At the same
time, the master OS generates an Inter-Processor Interrupt (IPI) to the vLib OS CPU(s).
Although normal interrupts are directly delivered to guest OSes, Non-Maskable Interrupts
(NMIs) cause VM exits. By setting the delivery mode of an IPI to be non-maskable, con-
trol is passed into the hypervisor (similar to how Jailhouse behaves [52]). The hypervisor
performs the resource accounting (including cache occupancy and memory bandwidth us-
age) for the VM which is then blocked. We call this process remote descheduling. All
resource usage data is returned to the master OS and budgeted to the client thread. When
the client thread is dispatched again, an unblock signal is set so that the vLib OS resumes
its execution from where it was previously descheduled. From the scheduler’s perspective,
the client thread is executing the vLib OS code the entire time. Hence, the execution of
client threads and services are unified.
After the server completes a service, it calls vLib listen again. Before waiting for
another request, it sets a completion signal. This allows the client thread to exit its idle
loop, discard its remote state and return to user space with the service result. Note that
in current implementation, we focus only on synchronous vLib calls. We will discuss the
issue of asynchronous vLib calls in Section 6.2.3.
6.2.2 vLib OS: Linux
We use the 32-bit non-SMP Ubuntu Server 14.04.5 (4.4.0 kernel) as our single vLib OS.
Our applications do not need CPU parallelism in Linux and a non-SMP version greatly
reduces the engineering effort to virtualize CPUs. One CPU core is dedicated to the Linux
vLib OS, while the rest of the cores are assigned to Quest. Virtualization is further simpli-
fied by applying a patch comprising approximately 100 lines of code to the Linux vLib OS.
76
This limits the vLib OS’s view of available physical memory, and adjusts I/O device DMA
offsets to account for memory virtualization. From Linux’s view, most of the processor
capabilities are exposed except those associated with VT-x and PMU.
vlibShm. The design of the vlibShm modules, both in the master OS and the vLib OS,
are very similar to each other, so we focus on our Linux vLib OS for brevity. A com-
munication channel’s physical memory is outside of Linux’s memory range and cannot
be easily mapped into the vLib server. The vlibShm kernel module we have developed is
specifically designed to handle this. The server calls mmap into the module, passing in
the machine physical address of the channel and its memory size, which are returned by
the hypervisor. vlibShm then creates a vm area struct object with a customized page fault
handler. When a page (A) inside the channel is accessed for the first time, a page fault oc-
curs and the handler is invoked. This leads to the allocation of a page (B) within Linux’s
memory range. B’s guest physical address is then passed to the hypervisor through a hy-
percall, together with the machine physical address of page A. An EPT entry swap is
performed in the hypercall so that the channel’s page A now maps to a legitimate memory
address (B) in Linux.
6.2.3 Discussion
In this prototype, our goal is to combine the timing predictability of an RTOS and the rich
body of software available on a general-purpose OS. The RTOS hosts the time-critical
and latency sensitive tasks or control loops, which are required to ensure the safety of
our targeted embedded platforms. At the same time, we take advantage of the abundant
commodity software, including vision, data logging and communication code hosted on
Linux. Our system design not only allows us to combine legacy and custom software, but
also enables fine-grained resource control required to achieve strong performance isola-
77
tion.
Given our design goals, we adopted a partitioning hypervisor for maximum predictabil-
ity and fast I/O manipulation. The downside of this approach is that when a vLib OS is not
servicing requests its assigned CPUs are unused. We believe this is an appropriate tradeoff
when targeting real-time systems or general systems with tight tail latency requirements.
However, when higher resource utilization is desired, a traditional hypervisor with hard-
ware resource multiplexing can be used. Through CPU multiplexing, a vLib call would
cause a VM switch and CPU utilization can be increased. However, we pay the cost of
VM switching, in terms of pipeline stalls, cache and TLB flushing.
In our implementation, a significant development burden has been avoided by exploit-
ing the vLibOS model. Rather than rewriting the scheduler to manage the threads asso-
ciated with vLib OSes side by side with native Quest threads (or VCPUs), we treat the
execution of the former as a special state (remote) of the latter. As a consequence, the
modification to the scheduler in Quest is minimal (< 50 lines). Admittedly, this approach
poses two disadvantages: 1) running services from N vLib OSes concurrently would re-
quire at least N cores dedicated to the master OS, and 2) while the client thread is busy
waiting, it cannot yield its CPU before depleting its current budget. The client-side CPU
cannot do additional useful work even if there is no contention on platform resources. To
avoid the utilization issue, one can rewrite the master OS scheduler and manages differ-
ent types of threads separately. However, as demonstrated in our evaluation section, our
model affords us the luxury of doing this only when needed. Our existing platform has
sufficient resources to meet our application goals without these optimizations, which will
be investigated in future work.
We have made a design choice to not support asynchronous vLib calls in real-time
systems. With blocking calls, both critical and non-critical code (including Linux services)
have temporal separation within a thread. If an asynchronous vLib call is made, then the
78
same thread is able to run its critical code and non-critical code simultaneously on different
cores, impacting each other when accessing shared resources. The added contention would
make it even harder to guarantee predictable execution on multicore platforms.
While our vLibOS prototype uses a research RTOS as the master and a mature legacy
OS as the slave vLib OS, this need not be the case. In fact, a full-blown OS such as Linux
could be the master as well. Such an architecture can be exploited to achieve Monolithic
kernel decomposition [43]. Similarly it is not necessary for the slaves to be commodity
OSes, rather it is equally viable to construct a runtime in which several specialized slave
OSes are included. For example, specialized new OS kernels [7, 45] can be forked by
the master to perform specific and highly optimized tasks (e.g., network I/O). Mixtures of
specialized OSes allow developers to focus their effort on optimizing one service while
delegating other services to general-purpose OSes. Unlike hybrid systems that lack isola-
tion amongst kernels [44, 51], our approach provides security and performance isolation
for the master OS.
Although our prototyping effort has focused on a particular platform, the techniques
it introduces for controlled interaction between the master and the slaves are applicable
to other situations. Achieving higher utilization in data centers leads to dramatic energy
and cost savings [66]. Our architecture provides a VM-based framework to construct
cloud runtimes in which high priority service applications with stringent QoS requirements
are consolidated with lower priority batch and best effort workloads on shared hardware
nodes. As is demonstrated in our evaluation, despite using virtualization, it is possible to
achieve precise resource throttling and isolation between VMs even with respect to low-
level resources such as shared caches and memory buses. For instance, vLibOS can be
used to structure a Xen-based cloud environment where the Dom0 acts as a master OS and
is extended with contention management policies.
79
6.3 Experimental Evaluation
In this section, we evaluate our vLibOS implementation using the hardware platform as
shown in Figure 6.4 and Table 6.1. The autonomous ground vehicle houses a custom-
made PC. Only three cores are enabled in the firmware, for the purposes of running all
needed tasks and to conserve energy usage from the main battery powering the vehicle. A
GeForce GT 710 GPU is used because of its relatively small form-factor, single PCIE slot
requirement, fanless design and low power consumption.
Processor Intel Core i5-2500k quad-core
Caches 6MB L3 cache
Memory 4GB 1333MHz DDR3
GPU MSI GeForce GT 710 2GB
Camera Logitech QuickCam Pro 9000
LIDAR Hokuyo URG-04LX-UG01
Table 6.1: Hardware Specification
We assigned two of the three cores to Quest and one to Linux. Our hypervisor and
Quest relied on an in-RAM file system while Linux used an USB drive for its storage.
Both servo controller and LIDAR (mission-critical tasks) were connected through serial
ports. Based on this system requirement, we partitioned serial ports to Quest and granted
exclusive access of the GPU and USB stack (camera and storage) to Linux.
6.3.1 vLib Call Overhead
In this experiment, we examined the overhead of the vLib call mechanism. We started by
measuring VM entry/exit costs followed by the cost of making a vLib call. We ran a test
thread in Quest and a vLib server in Linux. The test thread establishes a communication
channel and keeps making vLib calls without input data. The vLib server receives re-
quests and immediately signals the completion without performing any services. To avoid
80
Figure 6.4: Autonomous Vehicle Platform
the impact of scheduling in Quest, we measured the time difference T1 between when the
thread entered the kernel and when it was about to return to user space. To avoid Linux
scheduling overheads, we measured the time, T2, between when vLib listen() was about
to return from the hypervisor (for servicing new requests) and when the next vLib listen()
call entered into the hypervisor (to generate a completion signal). The vLib call overhead
is represented as T1 − T2. For the remote descheduling cost, we also measured the time
difference between the moment the Quest kernel sent out an IPI and when a VM exit was
completed on the Linux core. Finally, we also measured the execution time of our cus-
tomized page fault handler as the cost of mapping a single-page communication channel
into Linux. All measurements were averaged over 1000 times and are shown in Table 6.2.
Note that the channel is mapped only when it is accessed for the first time. It is not on the
81
critical path of the vLib calls.
VM
Entry
VM
Exit
vLib
Call
Remote
Desched
Channel
Mapping
CPU Cycles 531 481 4754 1153 2377
Table 6.2: Mechanism Overhead
6.3.2 Performance of Partitioned I/O Devices
Our partitioning hypervisor incurs minimum overhead for guest I/O operations. To mea-
sure this overhead, we evaluated the GPU performance in the Linux vLib OS using the
open source Convolutional Neural Network (CNN) application, Darknet [47]. Although
there are other popular deep learning frameworks available, we chose Darknet because of
its support for 32-bit platforms. Darknet is implemented in C and CUDA with high effi-
ciency and only a few dependencies. We believe it is well suited to embedded applications.
In this experiment, we compared the performance of running Darknet in the stand-
alone Linux (vanilla Linux) and Linux running on top of Quest-V (vLib Linux). Both
Linux kernels were built without SMP support for fair comparison. The vLib Linux still
retains execution control over its CPU since the vLib server was not started for this exper-
iment. We measured the execution time of the Darknet image classification operation on
the GPU for both systems. The results were averaged over 1000 operations on the same
single image and are shown in Table 6.3. As can be seen, the vLib Linux achieved similar
GPU performance (7% slowdown) comparing to the vanilla Linux. Notice that part of the
slowdown comes from the memory virtualization overhead.
vanilla Linux 859 vLib Linux 920
Table 6.3: GPU Performance (106 CPU cycles)
82
6.3.3 Effectiveness of Memory Throttling
To evaluate the effectiveness of the Quest memory throttling mechanism, we used a memory-
intensive micro-benchmark, m jump, to measure the memory bus performance. The detail
of m jump is covered in Section 5.3. No delay was added to the memory scan.
We set up three groups of experiments, each with a different CPU foreground utiliza-
tion (C/T) for m jumps. In each group, we ran five 10-minute experiments for comparison.
The first experiment (alone) ran a single m jump under Quest, without a co-runner. At the
end of the experiment, we recorded m jump’s instructions retired only in the foreground
mode (FG Inst). For the second (quest) and the third (quest + mem) experiments, two
m jumps were started at the same time on different cores in Quest. We disabled memory
throttling for the second experiment and enabled it for the third. We measured the FG Inst
for the first of the two m jumps for both experiments. In the fourth experiment (linux), the
first m jump was started in Quest while the second was started in Linux. Memory throt-
tling was enabled in Quest. The FG Inst of the m jump in Quest was measured. The last
experiment (linux + mem) was similar to the fourth, but the m jump in Linux was invoked
by a Quest thread through a single vLib call (timeout set to null) so that memory throttling
could be applied to Linux.
In group one (U=10%), the first m jump was bound to a VCPU with C=10 and T=100,
where the time unit was in milliseconds. For the second m jump, C=9, T=90. We refer to
this setting as {C=(10, 9), T=(100, 90)}. The CPU utilization of both threads were 10%
but the periods (T s) were set differently to reduce the Sync Effect [72]. When running
m jump in Linux, there was no VCPU assignment. The VCPU with C=9 and T=90 was
assigned to the Quest thread performing the vLib call in the fifth experiment. For group
two (U=30%) and group three (U=60%), VCPU settings were {C=(30,27), T=(100, 90)}
and {C=(60,54), T=(100, 90)}, respectively. Figure 6.5 shows the FG Inst of the first
m jump in different cases.
83
 0
 200
 400
 600
 800
 1000
 1200
 1400
 1600
U=10%
U=30%
U=60%
In
st
ru
ct
io
ns
 R
et
ire
d 
(X
 10
8 )
alone
quest
quest+mem
linux
linux+mem
Figure 6.5: Effective Memory Throttling
Comparing the linux case to the alone base case, we see that having uncontrolled
memory bus contention inside Linux leads to a significant performance drop for real-
time threads running in Quest. If memory throttling is applied to Linux using our unified
scheduling mechanism, as in linux + mem, a large reduction in performance slowdown is
achieved. In group U=10%, there is a 50% reduction in slowdown.
As we increase the CPU utilization of m jump, its foreground performance increases
correspondingly, due to increased foreground time. However, background time decreases
at the same time. With less background time, the effectiveness of memory throttling is
reduced because there is less slack time to stall execution on individual cores.
6.3.4 Autonomous Driving Case Study
We then tested Quest-V using a real-time application involving an autonomous ground
vehicle. This system (see Figure 6.6) consists of a real-time program lidar and two Linux
84
Figure 6.6: Autonomous Driving System
programs, logger and Darknet. lidar takes LIDAR data as input and makes steering deci-
sions to avoid objects. Meanwhile it dumps data to a 4MB memory buffer, which is shared
with the logger. The logger periodically checks the buffer and saves data to a log file when
the buffer is full. Saved data is used for offline diagnostics. Darknet runs side by side
in Linux, reading camera frames and performing object classification, which is useful in
autonomous vehicle control. For example, Darknet is able to identify traffic signs while
LIDAR is not.
We considered the functioning of lidar as the most critical in the system, so it was
placed inside the Quest RTOS. Data logging and object classification improve quality of
service, but their failure is tolerable as long as obstacle avoidance is maintained. Although
less critical, implementing them from scratch would take significant engineering effort.
Using legacy Linux implementation of Darknet and data logging greatly reduced our time
to build our mixed-criticality system.
85
In our object avoidance solution, the LIDAR device sends out distance data (around
700 bytes) every 100 ms. Periodically, lidar decodes the received data and scans object
distances from all angles (240 degrees) in order to identify objects within a certain dis-
tance. If a nearby object is found directly in front of the vehicle, then it will look for
the closest open space either on the left side or on the right side. From the example in
Figure 6.7, since α < β, a left turn decision will be made.
Figure 6.7: Object Avoidance Algorithm
In our evaluation, we divided experiments into 3 cases. In the first one (lidar), lidar
was running in Quest with a VCPU configured with {C=12, T=40}. Although the logger
was started in Linux as well, during the whole experiment time, the shared buffer did
not fill to capacity. Thus, the logger did not perform any task. We consider this case
as lidar running alone on the platform. This allows us to focus on evaluating only the
performance impact from co-running Darknet later. We will not mention the logger again
in the experiment description that follows.
In every period, we measured the execution time of lidar from right after it received
LIDAR data to the end of its object avoidance algorithm. 2000 samples were taken during
the experiment. In the second case (lidar+Darknet w/o mem), we simultaneously ran lidar
86
in Quest and Darknet in Linux. In Linux, we did not start a vLib server, so Linux still
had control over its own execution. The same measurement was carried out. The last
experiment (lidar+Darknet w/ mem) was similar to the second, except a vLib server ran
in Linux and Quest acquired full system control. A Quest thread (client) was created with
VCPU {C=12, T=40}. Immediately after starting, it made a blocking vLib call to Linux
with timeout set to null. Without sending out a request completion signal, the server simply
terminated itself. Darknet then kept running inside Linux with the client’s CPU budget, so
that memory throttling could be applied.
From Figure 6.8, we can see that lidar, when running alone inside Quest, has a very sta-
ble performance. When Darknet starts competing for shared resources in case lidar+Darknet
w/o mem, lidar suffers from increased performance variation. The worst case execution
time we observed was around 24000 CPU cycles, which is twice the average. This is be-
cause Linux was not running a vLib server and could not be controlled by Quest. When
the vLib call mechanism was enabled in case lidar+Darknet w/ mem, memory bus con-
tention was effectively managed, leading to reduced performance variation. The worst
case execution time dropped to around 17000 cycles.
It is worth mentioning that the lidar program’s working set fits into the private L1/L2
caches. This explains why the lidar’s average execution time does not increase as much
as we would expect from the previous section 6.3.3, in the presence of memory bus con-
tention. However, more advanced LIDAR devices, better object avoidance algorithms and
more complicated sensor fusion algorithms contribute to a larger working set of a real-time
program. As a result, memory bus management would become essential.
For comparison, we investigated how a vanilla Linux performed on our ground vehicle.
Ideally, we wanted to use the RT-PREEMPT patch for Linux, which improves performance
for real-time tasks. However, we discovered that the Nvidia GPU driver did not support
the real-time patch, so we were restricted to using an unpatched SMP Linux system.
87
10 12 14 16 18 20 22 24
CPU Cycles (x1000)
100
200
300
400
500
600
700
800
900
1000
1100
Nu
m
be
r o
f S
am
pl
es
1 1 1
lidar
10 12 14 16 18 20 22 24
CPU Cycles (x1000)
100
200
300
400
500
600
700
800
900
1000
1100
Nu
m
be
r o
f S
am
pl
es
3 442 32 11111 1 1 1
lidar+Darknet w/o mem
10 12 14 16 18 20 22 24
CPU Cycles (x1000)
100
200
300
400
500
600
700
800
900
1000
1100
Nu
m
be
r o
f S
am
pl
es
43 11 1
lidar+Darknet w/ mem
Figure 6.8: lidar Performance in Quest-V
88
10 15 20 25 30 35
CPU Cycles (x1000)
100
200
300
400
500
600
700
800
900
Nu
m
be
r o
f S
am
pl
es
13 4 11111
lidar
10 15 20 25 30 35
CPU Cycles (x1000)
100
200
300
400
500
600
700
800
900
Nu
m
be
r o
f S
am
pl
es
2311 2 1 4321 11
lidar+Darknet
Figure 6.9: lidar Performance in Vanilla Linux
For the first experiment (lidar), we pinned lidar to a core together with a CPU hog,
which runs an empty while loop. Since lidar itself runs only periodically, it does not
create much workload for the CPU. If we do not assign a hog on the same CPU, Linux
performs Dynamic Voltage and Frequency Scaling (DVFS) on the CPU, thereby decreas-
ing its frequency. This would unnecessarily slow down the execution of lidar and impact
our measurements. We also set the real-time scheduling class SCHED FIFO to lidar with
the highest priority in order to avoid preemption. The execution time of lidar’s periodic
task was measured over 2000 samples. Next, we put Darknet on another CPU and repeated
89
the same experiment (lidar+Darknet). Results are shown in Figure 6.9.
As Linux is not designed for real-time applications, lidar experienced noticeable per-
formance variation even running alone. With the presence of interference by Darknet,
both the average and worst case execution time were prolonged significantly. Comparing
these results with the results from Figure 6.8, we believe Quest-V provides better real-time
service to the lidar application.
6.4 Summary
This chapter presents vLibOS, a master-slave paradigm that integrates services from mul-
tiple OSes into a single, custom system. The approach allows pre-existing OSes to provide
legacy services to new systems with specialized requirements. The new system features
are implemented in a master OS that calls upon legacy system software in different virtual
machines. The master OS communicates and schedules the execution of slave OS services,
which behave like virtualized library calls. As the master OS in one virtual machine is able
to coordinate and schedule the execution of services in other VMs on separate cores, per-
formance isolation is greatly improved. This is critical to systems that require temporal
predictability (e.g., real-time guarantees).
In our prototype system, Quest-V, we implemented a partitioning hypervisor with vLi-
bOS API support. The unified scheduling mechanism in Quest-V enables legacy services
to be an extension of a client thread in the master OS. Quest-V uses a latency-based mem-
ory throttling technique to ensure the shared memory bus is controlled by the master OS.
This avoids concurrent memory accesses by different VMs that would otherwise lead to
unpredictable execution times of real-time tasks. Experiments show the benefits of our
vLibOS approach.
Chapter 7
Conclusions and Future Work
In this thesis, we studied the performance interference issue on multicore platforms. Al-
though code are running on separate cores, they rely on shared hardware resources like
LLCs and memory buses. Concurrent accesses to those resources create highly-variable
contention and significant execution slowdown. To help bring real-time systems into the
multicore era, we proposed several mechanisms and policies to efficiently put resource
access under control.
At the CPU level, we introduced the foreground/background scheduling model. Our
scheduler takes advantage of surplus CPU cycles on each core, after meeting the fore-
ground timing requirements of each VCPU, to improve system performance. The back-
ground scheduling aims at increasing CPU utilization without causing heavy contention
on shared resources. It serves as the basis for contention-aware scheduling policies. Be-
sides, we proposed a predictable load balancing mechanism and showed how to balance
VCPUs across cores to both guarantee VCPU timing requirements and evenly distribute
surplus CPU cycles.
At the cache level, we studied both the static and dynamic cache partitioning ap-
proaches for improving performance isolation on multicore platforms. For the former, the
LLC is statically partitioned amongst cores in order to get strict real-time performance.
A cache-aware load balancing algorithm was developed, along with an API to specify
cache requirement. For the latter, a Cache Utilization Monitor measures application cache
91
usage on-the-fly and triggers re-partitioning to improve QoS for individual applications.
To achieve efficient cache re-partitioning, two page selection policies were proposed and
evaluated.
At the memory bus level, we discussed issues with the traditional performance metric
for memory buses. Then our new metric, average memory request latency, was intro-
duced, which more accurately reflects the traffic condition on the bus. Through the use
of hardware performance counters, monitored data can be gathered in a lightweight fash-
ion as well. Combined with the foreground/background scheduling model, bus contention
is taken care of by managing the allocation of background CPU time to VCPUs. When
bus contention goes above a predefined latency threshold, some cores will disable their
background scheduling temporarily, effectively reducing memory traffic.
In the end, we talked about system integration for practical multicore real-time sys-
tems. It tackles one of the toughest problems in OS innovation, legacy support. We pre-
sented vLibOS, a master-slave paradigm that integrates services from multiple OSes into a
single, custom system. The approach allows pre-existing OSes to provide legacy services
to new systems with specialized requirements. The new system features are implemented
in a master OS that calls upon legacy system software in different VMs. The master OS
communicates and schedules the execution of slave OS services, which behave like vir-
tualized library calls. As the master OS in one VM is able to coordinate and schedule
the execution of services in other VMs, performance isolation is greatly improved. We
showed how our memory traffic control mechanism can be integrated into vLibOS and
effectively manages bus contention across VMs.
7.1 Future Directions
I/O Management Shared I/O devices is another source of performance interference on
multicore platforms. Requests may suffer from queuing delays due to concurrent accesses
92
to the same device. A real-time I/O scheduler needs to be developed in order to provide
predictable services to real-time applications.
Device interrupt handling is also a tough problem. Although interrupt bottom halves
have been converted into threads, top halves are still handled within the interrupt context.
Frequent interrupt handling slows down applications being interrupted. An RTOS should
be able to offer a noiseless execution environment through new APIs. For bottom halves,
a mechanism is also needed to associate resource usages (e.g., cache occupancy, memory
bandwidth) with the correct service requesters, for more fine-grained resource control.
Besides, we will look into extensions to our load balancing algorithm for I/O support.
Bottom half handling may be migrated across cores so that data is closer to the service
requesters, reducing cross-core cache coherence traffic.
vLibOS Although a reference implementation has been given out and evaluated, it is not
the only way to implement the vLibOS architecture. We plan to investigate a traditional
hypervisor design for vLibOS. Potentially, a master OS takes all available cores on the
platform. When a vLib call is made, the current core switches to the slave VM hosting the
service. It will be important to find out whether the VM switch overhead is prohibitive for
real-time systems.
As stated before, in our current implementation, we treat the execution of a vLib OS
as a special state (remote) of a master OS thread. This design choice hurts hardware
utilization. Alternatively, we can extend the current scheduler in the master OS. There
will be a separate scheduling module that handles vLib OS execution. The complexity of
such a module is worth looking into.
Moreover, the application of vLibOS is not limited to real-time systems. Future work
should explore its application in cloud systems. Current solution provides low-latency
cloud services by reserving entire physical machines to single VMs. It avoids resource
93
contention at the cost of lowered server utilization. By applying vLibOS to cloud systems,
we expect to see enhanced QoS control. Thus server consolidation can be more effective.
Bibliography
[1] ANDERSON, J. H., BUD, V., AND DEVI, U. C. An EDF-based Restricted-migration
Scheduling Algorithm for Multiprocessor Soft Real-Time Systems. Real-Time Syst.
38, 2 (Feb. 2008), 85–131.
[2] BANGA, G., DRUSCHEL, P., AND MOGUL, J. C. Resource Containers: A New
Facility for Resource Management in Server Systems. In Proceedings of the 3rd
USENIX Symposium on Operating Systems Design and Implementation (1999).
[3] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HARRIS, T., HO, A.,
NEUGEBAUER, R., PRATT, I., AND WARFIELD, A. Xen and the Art of Virtual-
ization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems
Principles (New York, NY, USA, 2003), SOSP ’03, ACM, pp. 164–177.
[4] BARUAH, S., AND FISHER, N. The Partitioned Multiprocessor Scheduling of Spo-
radic Task Systems. In Proceedings of the 26th IEEE International Real-Time Sys-
tems Symposium (2005), RTSS ’05, pp. 321–329.
[5] BARUAH, S., AND FISHER, N. The Partitioned Multiprocessor Scheduling of
Deadline-constrained Sporadic Task Systems. IEEE Transactions on Computers 55,
7 (July 2006), 918–923.
[6] BELAY, A., BITTAU, A., MASHTIZADEH, A., TEREI, D., MAZIE`RES, D., AND
KOZYRAKIS, C. Dune: Safe User-level Access to Privileged CPU Features. In
Presented as part of the 10th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 12) (Hollywood, CA, 2012), USENIX, pp. 335–348.
[7] BELAY, A., PREKAS, G., KLIMOVIC, A., GROSSMAN, S., KOZYRAKIS, C., AND
BUGNION, E. IX: A Protected Dataplane Operating System for High Throughput
and Low Latency. In 11th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 14) (CO, 2014), USENIX Association, pp. 49–65.
[8] BELLOSA, F. Process Cruise Control: Throttling Memory Access in a Soft Real-
Time Environment. Tech. Rep. TR-14-97-02, University of Erlangen, Germany, July
1997.
[9] BERSHAD, B. N., SAVAGE, S., PARDYAK, P., SIRER, E. G., FIUCZYNSKI, M.,
AND CHAMBERS, B. E. Extensibility, safety, and performance in the SPIN oper-
ating system. In Proceedings of the 15th ACM Symposium on Operating Systems
Principles (Copper Mountain, Colorado, December 1995), pp. 267–284.
95
[10] BLAGODUROV, S., ZHURAVLEV, S., AND FEDOROVA, A. Contention-Aware
Scheduling on Multicore Systems. ACM Trans. Comput. Syst. 28, 4 (Dec. 2010),
8:1–8:45.
[11] BRANDENBURG, B. B., AND ANDERSON, J. H. On the Implementation of Global
Real-Time Schedulers. In Proceedings of the 30th IEEE Real-Time Systems Sympo-
sium (2009), RTSS ’09, pp. 214–224.
[12] CALANDRINO, J., AND ANDERSON, J. Cache-Aware Real-Time Scheduling on
Multicore Platforms: Heuristics and a Case Study. In Proceedings of the 20th Eu-
romicro Conference on Real-Time Systems (July 2008).
[13] CHO, S., AND JIN, L. Managing Distributed, Shared L2 Caches Through OS-level
Page Allocation. In Proceedings of the 39th Annual IEEE/ACM International Sym-
posium on Microarchitecture (2006), pp. 455–468.
[14] DANISH, M., LI, Y., AND WEST, R. Virtual-CPU Scheduling in the Quest Operat-
ing System. In Proceedings of the 17th IEEE Real-Time and Embedded Technology
and Applications Symposium (2011), RTAS ’11, pp. 169–179.
[15] DAVIS, R. I., AND BURNS, A. A Survey of Hard Real-time Scheduling for Multi-
processor Systems. ACM Comput. Surv. 43, 4 (Oct. 2011), 35:1–35:44.
[16] DHALL, S. K., AND LIU, C. L. On a Real-Time Scheduling Problem. Oper. Res.
26, 1 (Feb. 1978), 127–140.
[17] DIKE, J. The User Mode Linux Kernel.
http://user-mode-linux.sourceforge.net/, 2006.
[18] DING, X., WANG, K., AND ZHANG, X. SRM-Buffer: an OS Buffer Management
Technique to Prevent Last Level Cache from Thrashing in Multicores. In Proceed-
ings of the 6th ACM European Conference on Computer Systems (2011), pp. 243–
256.
[19] ENGLER, D. R., KAASHOEK, M. F., AND O’TOOLE, JR., J. Exokernel: An Oper-
ating System Architecture for Application-level Resource Management. In Proceed-
ings of the Fifteenth ACM Symposium on Operating Systems Principles (New York,
NY, USA, 1995), SOSP ’95, ACM, pp. 251–266.
[20] FORD, B., BACK, G., BENSON, G., LEPREAU, J., LIN, A., AND SHIVERS, O. The
Flux OSKit: A Substrate for Kernel and Language Research. In Proceedings of the
Sixteenth ACM Symposium on Operating Systems Principles (New York, NY, USA,
1997), SOSP ’97, ACM, pp. 38–51.
96
[21] GORDON, A., AMIT, N., HAR’EL, N., BEN-YEHUDA, M., LANDAU, A., SCHUS-
TER, A., AND TSAFRIR, D. ELI: Bare-Metal Performance for I/O Virtualization.
In Proceedings of the 17th International Conference on Architectural Support for
Programming Languages and Operating Systems (2012), pp. 411–422.
[22] GUSTAFSSON, J., BETTS, A., ERMEDAHL, A., AND LISPER, B. The Ma¨lardalen
WCET Benchmarks – Past, Present and Future. In Proceedings of the 10th Interna-
tional Workshop on Worst-Case Execution Time Analysis (July 2010), B. Lisper, Ed.,
OCG, pp. 137–147.
[23] HAND, S., WARFIELD, A., FRASER, K., KOTSOVINOS, E., AND MAGENHEIMER,
D. Are Virtual Machine Monitors Microkernels Done Right? In Proceedings of the
10th Conference on Hot Topics in Operating Systems - Volume 10 (Berkeley, CA,
USA, 2005), HOTOS’05, USENIX Association, pp. 1–1.
[24] INAM, R., MAHMUD, N., BEHNAM, M., NOLTE, T., AND SJDIN, M. The Multi-
Resource Server for Predictable Execution on Multi-core Platforms. In Proceedings
of the 20th IEEE Real-Time and Embedded Technology and Applications Symposium
(April 2014), pp. 1–12.
[25] INTEL. Intel 64 and IA-32 Architectures Software Developer’s Manual Combined
Volumes 3A, 3B, 3C, and 3D: System Programming Guide, 2014.
[26] JALEEL, A., NAJAF-ABADI, H. H., SUBRAMANIAM, S., STEELY, S. C., AND
EMER, J. CRUISE: Cache Replacement and Utility-aware Scheduling. In Proceed-
ings of the 17th International Conference on Architectural Support for Programming
Languages and Operating Systems (2012), pp. 249–260.
[27] KATO, S., YAMASAKI, N., AND ISHIKAWA, Y. Semi-partitioned Scheduling of
Sporadic Task Systems on Multiprocessors. In Proceedings of the 21st Euromicro
Conference on Real-Time Systems (July 2009), pp. 249–258.
[28] KIM, H., KANDHALU, A., AND RAJKUMAR, R. A Coordinated Approach for Prac-
tical OS-level Cache Management in Multi-core Real-Time Systems. In Proceedings
of the 25th Euromicro Conference on Real-Time Systems (ECRTS) (July 2013).
[29] KIM, N., WARD, B. C., CHISHOLM, M., FU, C.-Y., ANDERSON, J. H., AND
SMITH, F. D. Attacking the One-Out-Of-m Multicore Problem by Combining Hard-
ware Management with Mixed-Criticality Provisioning. In Proceedings of the 22nd
IEEE Real-Time and Embedded Technology and Applications Symposium (2016),
RTAS ’16.
[30] KIVITY, A., KAMAY, Y., LAOR, D., LUBLIN, U., AND LIGUORI, A. KVM: the
Linux Virtual Machine Monitor. In In Proceedings of the 2007 Ottawa Linux Sym-
posium (2007), OLS ’07.
97
[31] KOCHER, P. C., JAFFE, J., AND JUN, B. Differential Power Analysis. In Pro-
ceedings of the 19th Annual International Cryptology Conference on Advances in
Cryptology (London, UK, UK, 1999), CRYPTO ’99, Springer-Verlag, pp. 388–397.
[32] LI, Y., WEST, R., AND MISSIMER, E. A Virtualized Separation Kernel for Mixed
Criticality Systems. In Proceedings of the 10th ACM SIGPLAN/SIGOPS Interna-
tional Conference on Virtual Execution Environments (New York, NY, USA, 2014),
VEE ’14, ACM, pp. 201–212.
[33] LICKLY, B., LIU, I., KIM, S., PATEL, H. D., EDWARDS, S. A., AND LEE, E. A.
Predictable Programming on a Precision Timed Architecture. In Proceedings of the
2008 International Conference on Compilers, Architectures and Synthesis for Em-
bedded Systems (2008), CASES ’08, pp. 137–146.
[34] LIN, J., LU, Q., DING, X., ZHANG, Z., ZHANG, X., AND SADAYAPPAN, P. Gain-
ing Insights into Multicore Cache Partitioning: Bridging the Gap between Simu-
lation and Real Systems. In Proceedings of the 14th International Symposium on
High-Performance Computer Architecture (2008), pp. 367–378.
[35] LIU, C. L., AND LAYLAND, J. W. Scheduling Algorithms for Multiprogramming
in a Hard-Real-Time Environment. J. ACM 20, 1 (Jan. 1973), 46–61.
[36] LOPEZ, J. M., DIAZ, J. L., AND GARCIA, D. F. Minimum and Maximum Utiliza-
tion Bounds for Multiprocessor RM Scheduling. In Proceedings of the 13th Euromi-
cro Conference on Real-Time Systems (2001), ECRTS ’01, pp. 67–75.
[37] LOPEZ, J. M., GARCIA, M., DIAZ, J. L., AND GARCIA, D. F. Worst-case Utiliza-
tion Bound for EDF Scheduling on Real-Time Multiprocessor Systems. In Proceed-
ings of the 12th Euromicro Conference on Real-Time Systems (2000), ECRTS ’00,
pp. 25–33.
[38] LU, Q., LIN, J., DING, X., ZHANG, Z., ZHANG, X., AND SADAYAPPAN, P. Soft-
OLP: Improving Hardware Cache Performance through Software-Controlled Object-
Level Partitioning. In Proceedings of the 18th International Conference on Parallel
Architectures and Compilation Techniques (2009), pp. 246–257.
[39] MANCUSO, R., DUDKO, R., BETTI, E., CESATI, M., CACCAMO, M., AND PEL-
LIZZONI, R. Real-time Cache Management Framework for Multi-core Architec-
tures. In Proceedings of the 2013 IEEE 19th Real-Time and Embedded Technol-
ogy and Applications Symposium (RTAS) (Washington, DC, USA, 2013), RTAS ’13,
IEEE Computer Society, pp. 45–54.
[40] MERCER, C. W., SAVAGE, S., AND TOKUDA, H. Processor Capacity Reserves: Op-
erating System Support for Multimedia Applications. In Proceedings of the IEEE In-
ternational Conference on Multimedia Computing and Systems (May 1994), pp. 90–
99.
98
[41] MOSCIBRODA, T., AND MUTLU, O. Memory Performance Attacks: Denial of
Memory Service in Multi-core Systems. In Proceedings of 16th USENIX Security
Symposium on USENIX Security Symposium (Berkeley, CA, USA, 2007), SS’07,
USENIX Association, pp. 18:1–18:18.
[42] NESBIT, K. J., AGGARWAL, N., LAUDON, J., AND SMITH, J. E. Fair Queuing
Memory Systems. In Proceedings of the 39th Annual IEEE/ACM International Sym-
posium on Microarchitecture (2006), MICRO 39, pp. 208–222.
[43] NIKOLAEV, R., AND BACK, G. VirtuOS: An Operating System with Kernel Virtual-
ization. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems
Principles (New York, NY, USA, 2013), SOSP ’13, ACM, pp. 116–132.
[44] PARK, Y., HENSBERGEN, E. V., HILLENBRAND, M., INGLETT, T., ROSENBURG,
B., RYU, K. D., AND WISNIEWSKI, R. W. FusedOS: Fusing LWK Performance
with FWK Functionality in a Heterogeneous Environment. In 2012 IEEE 24th Inter-
national Symposium on Computer Architecture and High Performance Computing
(Oct 2012), pp. 211–218.
[45] PETER, S., LI, J., ZHANG, I., PORTS, D. R. K., WOOS, D., KRISHNAMURTHY,
A., ANDERSON, T., AND ROSCOE, T. Arrakis: The Operating System is the Control
Plane. In 11th USENIX Symposium on Operating Systems Design and Implementa-
tion (OSDI 14) (CO, 2014), USENIX Association, pp. 1–16.
[46] QURESHI, M. K., AND PATT, Y. N. Utility-Based Cache Partitioning: A Low-
Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchi-
tecture (2006), pp. 423–432.
[47] REDMON, J. Darknet: Open Source Neural Networks in C.
http://pjreddie.com/darknet/, 2013–2016.
[48] SCHATZBERG, D., CADDEN, J., DONG, H., KRIEGER, O., AND APPAVOO, J.
EbbRT: A Framework for Building Per-Application Library Operating Systems. In
12th USENIX Symposium on Operating Systems Design and Implementation (OSDI
16) (GA, 2016), USENIX Association, pp. 671–688.
[49] SHA, L., CACCAMO, M., MANCUSO, R., KIM, J.-E., YOON, M.-K., PELLIZ-
ZONI, R., YUN, H., KEGLEY, R., PERLMAN, D., ARUNDALE, G., AND BRAD-
FORD, R. Single Core Equivalent Virtual Machines for Hard Real-Time Computing
on Multicore Processors. Tech. rep., University of Illinois at Urbana-Champaign,
October 2014.
99
[50] SHERWOOD, T., CALDER, B., AND EMER, J. Reducing Cache Misses using Hard-
ware and Software Page Placement. In Proceedings of the 13th International Con-
ference on Supercomputing (1999), pp. 155–164.
[51] SHIMOSAWA, T., GEROFI, B., TAKAGI, M., NAKAMURA, G., SHIRASAWA, T.,
SAEKI, Y., SHIMIZU, M., HORI, A., AND ISHIKAWA, Y. Interface for Heteroge-
neous Kernels: A Framework to Enable Hybrid OS Designs Targeting High Perfor-
mance Computing on Manycore Architectures. In 2014 21st International Confer-
ence on High Performance Computing (HiPC) (Dec 2014), pp. 1–10.
[52] SIEMENS. Jailhouse: Linux-based Partitioning Hypervisor.
https://github.com/siemens/jailhouse/, 2016.
[53] SMALL, C., AND SELTZER, M. I. A Comparison of OS Extension Technologies. In
USENIX Annual Technical Conference (1996), pp. 41–54.
[54] SOARES, L., AND STUMM, M. FlexSC: Flexible System Call Scheduling with
Exception-less System Calls. In Proceedings of the 9th USENIX Conference on Op-
erating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI’10,
USENIX Association, pp. 33–46.
[55] SOARES, L., TAM, D., AND STUMM, M. Reducing the Harmful Effects of Last-
Level Cache Polluters with an OS-level, Software-only Pollute Buffer. In Proceed-
ings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
(2008), pp. 258–269.
[56] SPRUNT, B., SHA, L., AND LEHOCZKY, J. Aperiodic Task Scheduling for Hard
Real-Time Systems. Real-Time Systems Journal 1, 1 (1989), 27–60.
[57] STANOVICH, M., BAKER, T. P., WANG, A. I., AND HARBOUR, M. G. Defects of
the POSIX Sporadic Server and How to Correct Them. In Proceedings of the 16th
IEEE Real-Time and Embedded Technology and Applications Symposium (2010).
[58] SUH, G. E., DEVADAS, S., AND RUDOLPH, L. A NewMemoryMonitoring Scheme
for Memory-aware Scheduling and Partitioning. In Proceedings of the 8th Inter-
national Symposium on High-Performance Computer Architecture (2002), pp. 117–
128.
[59] SWIFT, M. M., BERSHAD, B. N., AND LEVY, H. M. Improving the Reliability of
Commodity Operating Systems. In Proceedings of the Nineteenth ACM Symposium
on Operating Systems Principles (New York, NY, USA, 2003), SOSP ’03, ACM,
pp. 207–222.
100
[60] SZEFER, J., KELLER, E., LEE, R. B., AND REXFORD, J. Eliminating the Hypervi-
sor Attack Surface for a More Secure Cloud. In Proceedings of the 18th ACM Con-
ference on Computer and Communications Security (New York, NY, USA, 2011),
CCS ’11, ACM, pp. 401–412.
[61] TAM, D., AZIMI, R., SOARES, L., AND STUMM, M. Managing Shared L2 Caches
onMulticore Systems in Software. In Proceedings of the Workshop on the Interaction
between Operating Systems and Computer Architecture (2007).
[62] TAM, D. K., AZIMI, R., SOARES, L. B., AND STUMM, M. RapidMRC: Approx-
imating L2 Miss Rate Curves on Commodity Systems for Online Optimizations. In
Proceedings of the 14th International Conference on Architectural Support for Pro-
gramming Languages and Operating Systems (2009), pp. 121–132.
[63] TAYLOR, G., DAVIES, P., AND FARMWALD, M. The TLB Slice–A Low-cost High-
speed Address Translation Mechanism. In Proceedings of the 17th Annual Interna-
tional Symposium on Computer Architecture (1990), pp. 355–363.
[64] TU-DRESDEN. L4linux. https://l4linux.org/, 2016.
[65] VALSAN, P. K., YUN, H., AND FARSHCHI, F. Taming Non-blocking Caches to
Improve Isolation in Multicore Real-Time Systems. In Proceedings of the 22nd IEEE
Real-Time and Embedded Technology and Applications Symposium (2016), RTAS
’16.
[66] VERMA, A., PEDROSA, L., KORUPOLU, M., OPPENHEIMER, D., TUNE, E., AND
WILKES, J. Large-scale Cluster Management at Google with Borg. In Proceed-
ings of the Tenth European Conference on Computer Systems (New York, NY, USA,
2015), EuroSys ’15, ACM, pp. 18:1–18:17.
[67] WARD, B. C., HERMAN, J. L., KENNA, C. J., AND ANDERSON, J. H. Making
Shared Caches More Predictable on Multicore Platforms. In Proceedings of the 25th
Euromicro Conference on Real-Time Systems (ECRTS) (July 2013).
[68] WEST, R., ZAROO, P., WALDSPURGER, C. A., AND ZHANG, X. Online Cache
Modeling for Commodity Multicore Processors. SIGOPS Oper. Syst. Rev. 44, 4 (Dec.
2010), 19–29.
[69] WEST, R., ZAROO, P., WALDSPURGER, C. A., AND ZHANG, X. CAFE´: Cache-
Aware Fair and Efficient Scheduling for CMPs. In Multicore Technology: Architec-
ture, Reconfiguration and Modeling. CRC Press, 2013.
[70] XIE, Y., AND LOH, G. Dynamic Classification of Program Memory Behaviors in
CMPs. In the 2nd Workshop on Chip Multiprocessor Memory Systems and Intercon-
nects (2008).
101
[71] YE, Y., WEST, R., CHENG, Z., AND LI, Y. COLORIS: A Dynamic Cache Par-
titioning System Using Page Coloring. In Proceedings of the 23rd International
Conference on Parallel Architectures and Compilation (2014), PACT ’14, ACM,
pp. 381–392.
[72] YE, Y., WEST, R., ZHANG, J., AND CHENG, Z. MARACAS: A Real-Time Mul-
ticore VCPU Scheduling Framework. In Proceedings of the 37th IEEE Real-Time
Systems Symposium (2016), RTSS ’16.
[73] YUN, H., MANCUSO, R., WU, Z.-P., AND PELLIZZONI, R. PALLOC: DRAM
Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms.
In Proceedings of the 20th IEEE Real-Time and Embedded Technology and Applica-
tions Symposium (2014), RTAS ’14.
[74] YUN, H., YAO, G., PELLIZZONI, R., CACCAMO, M., AND SHA, L. Mem-
Guard: Memory Bandwidth Reservation System for Efficient Performance Isolation
in Multi-core Platforms. In Proceedings of the 19th IEEE Real-Time and Embedded
Technology and Applications Symposium (2013), RTAS ’13, pp. 55–64.
[75] ZHANG, X., DWARKADAS, S., AND SHEN, K. Towards Practical Page Coloring-
based Multicore Cache Management. In Proceedings of the 4th ACM European
Conference on Computer Systems (2009), pp. 89–102.
[76] ZHOU, P., PANDEY, V., SUNDARESAN, J., RAGHURAMAN, A., ZHOU, Y., AND
KUMAR, S. Dynamic Tracking of Page Miss Ratio Curve for Memory Management.
In Proceedings of the 11th International Conference on Architectural Support for
Programming Languages and Operating Systems (2004), pp. 177–188.
[77] ZHU, H., AND EREZ, M. Dirigent: Enforcing QoS for Latency-Critical Tasks on
Shared Multicore Systems. In Proceedings of the 21th ACM International Confer-
ence on Architectural Support for Programming Languages and Operating Systems
(ASPLOS) (April 2016).
[78] ZHURAVLEV, S., BLAGODUROV, S., AND FEDOROVA, A. Addressing Shared Re-
source Contention in Multicore Processors via Scheduling. In Proceedings of the
Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages
and Operating Systems (2010), ASPLOS XV, pp. 129–142.
Curriculum Vitae
