Modeling Microarchitectural Side Channel Attacks for Fun & Profit by Ravichandran, Joseph
© 2021 Joseph Ravichandran
IF YOU WANTED IT TO BE SECURE, THEN YOU
SHOULDN’T HAVE PUT A RING IN IT:
Modeling Microarchitectural Side Channel Attacks for Fun & Profit
By
Joseph Ravichandran
Senior Thesis in Computer Engineering




The discovery of hardware vulnerabilities has increasingly become more
frequent in recent years. In the wake of the Spectre and Meltdown attacks,
architectural security research has exposed many flaws in the processors we
use every day that we trust to be secure. The purpose of this research is to
construct a detailed architectural simulator for understanding, analyzing, and
prototyping attacks in multicore computer systems.
This work presents a custom multicore RISC-V computer system, complete
with a custom multitasking kernel and simulation/ RTL verification environ-
ment. All of the hardware— each individual processor core, the memory and
cache hierarchies, and the ring interconnect were designed with the intention
of their use in analyzing real-world architectural attacks. This system allows
for deep introspection into the state of the processor system, allowing for
a complete understanding of the exact mechanisms of architectural attacks.
The computer system implements an interconnect and cache model similar to
those of real-world systems, allowing recent real-world attacks to be modeled.
The simulation environment provides an array of tools that provide realtime
deep introspection into the architectural state of the processor, allowing the
user to have a complete picture of the attack being modeled at a glance. The
simulator also provides GDB debugging support directly to the core, allowing
the user to debug their attack software running on the simulated processor in
realtime.
ii
To my fellow beginners of the world, may we one day become graceful experts.
iii
Acknowledgments
First, I would like to thank Prof. Chris Fletcher and Riccardo Paccagnella for
their support throughout this project. Without their support and enthusiasm
I would not have been able to complete this work. You made research not
only intellectually rewarding, but extremely fun as well. Thank you for your
willingness to support me!
I would like to thank Prof. Sarita Adve, Prof. Rakesh Kumar, Prof. Josh
Mason, Muhammad Huzaifa, and Sam Grayson for being incredible mentors
throughout my journey into academic research. Thank you for helping me
find my way. I would also like to thank Prof. Deming Chen for offering ECE
527 (SoC Design), where I began the work that eventually grew into this
thesis. I would also like to thank Mrs. Schoeler, Mr. Bosco, Mrs. Hyma, and
Mr. Leszczynski for inspiring a love of learning that has only grown stronger
as the years have gone on. Thank you for teaching me to stay curious!
I would like to also thank the members of SIGPwny— Josh, Ian, Ankur,
Kuilin, Jesse, Thomas, Chris, Nathan, and everyone else, thank you for
pushing me to learn more, think differently, and for your contagious passion
about this wonderful and terrifying subject. Hack the planet!
I could not complete this list without acknowledging Finn Sinclair and Kuilin
Li, both of whom I’ve worked with on various projects and research these
past four years. Finn and Kuilin have been a constant source of laughs, deep
discussions, and inspiration these past four years. I’ll never forget our late
nights at Everitt Lab! Dead Coders Society for life.
Finally, none of this would be possible without the support from my family.
Mom, Doc, and Joshua— thanks for everything.
iv
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . 4
2.1 Side Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Multicore Interconnects . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Covert Channels . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Cache Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Ring Interconnect Attacks . . . . . . . . . . . . . . . . . . . . 12
2.7 Existing Cores and Simulators . . . . . . . . . . . . . . . . . . 14
CHAPTER 3 HARDWARE . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Ring Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . 34
CHAPTER 4 SOFTWARE . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 System Bringup . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Syscalls, Exceptions, and Interrupts . . . . . . . . . . . . . . . 49
4.3 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
v
4.4 Scheduling and Context Switching . . . . . . . . . . . . . . . . 53
4.5 Compiling and Linking . . . . . . . . . . . . . . . . . . . . . . 56
CHAPTER 5 SIMULATOR . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Debugging Server . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Visualizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Runtime Verification . . . . . . . . . . . . . . . . . . . . . . . 70
CHAPTER 6 ATTACK MODELS . . . . . . . . . . . . . . . . . . . 71
6.1 Using the High Resolution Timer . . . . . . . . . . . . . . . . 71
6.2 Prime+Probe Cache Attack . . . . . . . . . . . . . . . . . . . 72
6.3 Lord of the Rings Interconnect Attack . . . . . . . . . . . . . 81
CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 88




In 2018, security researcher Christopher Domas demonstrated a hardware
backdoor in the VIA C3 family of x86 CPUs that allows for a complete bypass
of the x86 ring protection model [1]. By triggering a sequence of undocumented
instructions, a userspace program can run kernel-mode instructions on a
hidden co-processor included in every VIA C3 CPU, defeating any potential
software exploit mitigations. This debug coprocessor was included with every
VIA C3 CPU taped out during its production run without any references to
the hidden ISA (Instruction Set Architecture) in the CPU documentation.
Vendors including this CPU in their products had no idea such a coprocessor
even existed. Problems such as this naturally lead to the key question
motivating this work, namely:
How can we be sure a given processor is secure?
This work will present a custom multicore RISC-V based [2] processor system,
multitasking kernel, and realtime simulation engine. This processor system
was designed from the ground up with the intention that it be used for
modeling and studying architectural attacks. The custom multitasking kernel
provides support to the attack software running on the simulated processor
system, and the simulator engine provides intuitive realtime analysis features.
These features make analyzing attacks on the system simple and efficient.
This work will then apply this processor system to model recent architectural
attacks against real-world multicore systems, using the unique debugging and
1
visualization features built into the simulator. We will attempt to answer
the question of “How can we be sure a given processor is secure?” by first
understanding precisely what makes processors insecure. We will then show
how the model presented can be used to develop further attack and defense
research for multicore systems.
Figure 1.1 shows a screenshot of the Pretty Secure System simulator envi-
ronment. In this screenshot, the simulated Pretty Secure Processor core
has drawn some text and sine waves to the realtime graphics framebuffer by
communicating with the graphics endpoint over the ring interconnect. This
result can be seen in realtime by the user.
Figure 1.1: A screenshot of the Pretty Secure Processor simulator
Chapter 2 will introduce the requisite background information. This includes
an overview of the specifics of caching in modern multicore processor systems,
an introduction to the interconnect model, and a survey of architectural
attacks and existing open-source multicore processors.
2
Chapter 3 will introduce Pretty Secure System, the processor system presented
by this work. It breaks down the exact implementation details of the hardware,
everything from the processor core architecture through the memory hierarchy
and ring interconnect.
Chapter 4 will describe the kernel that runs on Pretty Secure System and the
software features that make prototyping attacks with the simulated processor
simple.
Chapter 5 will describe the runtime simulation and verification engine, and
visualization tools that help researchers analyze attacks on Pretty Secure
System.
Chapter 6 applies the hardware, software, and simulator introduced in Chap-
ters 3, 4, and 5 respectively, to demonstrate how Pretty Secure System is
ideal for modeling architectural attacks.





Two recent high-profile architectural attacks, Spectre and Meltdown, changed
the way the computing community at large views architectural attacks [3, 4].
This section will provide background on architectural features in modern
systems, and explore a few examples of architectural security attacks on those
features.
2.1 Side Channels
A side channel in computer architecture is a means of leaking information
through a channel that is outside of the architectural state of the CPU. Two
common side channel domains are time-driven (where the amount of time
taken to perform some action leaks information), and power-driven (where the
amount of power used by a resource can be measured to result in information
leaks). This work will explore time-driven side channel attacks.
2.2 Caching
Modern processors use an optimization called caching, wherein recently used
memory is stored in a small region of high-speed storage near the processor
core [5, 6]. This allows for a processor to quickly retrieve commonly used data
4
without needing to access main memory. Due to the asymmetry inherent to
hierarchical memory models, architectural state can be revealed by inspecting
the latency of accessing a particular line of memory. That is, data in the
cache will be faster to access than data not in the cache. This results in a
variety of time-based side channels present inside the memory hierarchy of
modern CPUs.
Cache hierarchies for computer systems typically partition the cache into a
variety of levels referred to as L1, L2, L3, and so on, where the last level
cache in a particular system is referred to as the “LLC” (or “last level cache”).
L1 is the highest level of the cache (closest to the main core), and the LLC
is the lowest level of the cache (closest to the memory controller). A cache
hierarchy is said to be inclusive if all memory present in a given level is also
present in every lower level [5]. For example, in an inclusive cache hierarchy,
memory lines present in L1 are also present in L2 and L3.
On multicore systems, each core will typically have its own private cache [5].
While cache hierarchies vary from implementation to implementation, the
system modeled by this work will use the common practice of having the
L1 and L2 caches be local per-core, and the L3 being an inclusive last-level
cache shared across all cores [5]. The L1 cache is typically split into two
halves, one half for instructions and the other for data [5]. These two halves
are joined together into a common stream that is served by the unified L2
cache that stores both instructions and data, and is inclusive. In modern
processors, it is common to partition the LLC into a distributed set of slices
that are located at different points around the interconnect. Time to access
a given slice may vary from core to core depending on the topology of the
interconnect and the number of memory lanes present. Slices are typically
sized to evenly distribute the load across the interconnect, and prevent the
LLC from becoming a single serialization point for all cores. By having
multiple LLC slices operate independently, greater parallelism is achieved,
which improves system performance and resource utilization.
Caches organize memory into lines, which are groupings of multiple adjacent
memory addresses. A typical cache line size is 64 Bytes [7], as in the AMD
Opteron Processor [5]. A set is a group of lines sharing some common bit
pattern. For example, a cache with eight sets might use bits 10-12 in the
5
address to determine which set a given address falls in [6]. Addresses where
all three bits are 0 might map to the first set; addresses where bit 10 is a 1
and the rest are 0 might map to the second, and so on. Associativity is
the property of a cache to fit multiple addresses that map to the same set
at the same time. A direct mapped cache is a cache where every memory
address can only map to one line in the cache. A fully associative cache can
fit any address at any line in the cache. A set associative cache is a mix
between the two— a given address can fit into one of a subset of cache lines,
determined by which set the address falls into. The number of lines belonging
to a given set in the cache is called the number of ways belonging to that
set. Thus a set associative cache’s dimensions are described by its number of
ways, number of sets, and number of bytes in a line. A cache hit occurs when
the requested address is present in the cache. A cache miss occurs when the
requested address is not present in the cache, and incurs extra latency as the
cache needs to read the missing line from lower memory, potentially writing
back an evicted line.
The tag is the upper bits of an address that are stored alongside the cached
data. Together, the tag and set allow the cache to know if a particular address
is present in the cache or not. The tag is necessary as multiple addresses may
map to the same set, and without a tag, there is no way of knowing which
address is currently stored in the cache. Addresses that have the same tag and
set can be distinguished by their line offset. The line offset determines how
far into the line this address is located. Since multiple adjacent addresses are
stored together in a line, the line offset is required to fully describe the location
of the data in the cache. Table 2.1 shows how an address can be decoded to
determine its set, line offset, and tag. The bit widths are intentionally left
out, as these vary depending on the exact dimensions of the cache.
Table 2.1: Address fields used during cache lookup
Tag Set Line Offset
6
Figure 2.1: Generic Cache
Figure 2.1 shows a 4-way set-associative cache with six sets. This convention
will be used for cache illustrations throughout the rest of this paper. Each
smaller box represents one cache line. Each cache set is a row containing
multiple ways. The ways are represented as columns. A given address can only
map to one set, but can be placed in any way within that set. Graphically,
this means that an address must fit into one of the boxes within its associated
row.
When a set in the cache is full and a new address needs to be inserted, one of
the old lines must be evicted (removed from the cache). There are several
common algorithms to choose which line to evict. This work uses the Pseudo
Least-Recently Used algorithm, which approximates the least recently used
address using a binary decision tree [5]. The tree has a leaf for every line
in the cache. To choose the line to evict, the algorithm traverses the tree,
choosing the line corresponding to the leaf that the tree points to. Each time
a line is accessed, this tree is updated to point away from the most recently
used line. Over time, this will approximate the least recently used line in the
cache. This algorithm is useful because it uses very little metadata regarding
the cache set and can be quickly evaluated. As caches can grow to be quite
large, having a fast and low-overhead eviction policy is important.
7
Caches need to track the state of the lines they contain, as a line may have
been written to during cache operation. This work implements a writeback
cache policy for managing writes [5]. A line can be invalid, clean, or dirty
[5]. An invalid line means it is not in use and should not be read from. A
clean line is a line that has not been written to. Clean lines contain the same
memory data as the lower memory source that this cache operates on. Clean
lines can be evicted without concern. Dirty lines are lines that the cache has
modified locally but have not been modified at lower memory. When a dirty
line is evicted, the modified value should be first written to lower memory
before the line is reused. Caches can also implement a write-through policy,
where writes are propagated to lower memory when they happen [5].
Side channels present within shared cache hierarchies can be used to leak
information about the architectural state across privilege boundaries. Mali-
cious attacker processes may be able to reveal more privileged information,
which can be used to exploit the system. Side channels in the cache system
may also allow covert communication channels between processes. This work
presents a custom CPU core that includes a cache hierarchy that models
timing side channels present in real caches used by today’s systems. Namely,
this work will demonstrate a viable cache-based timing side channel attack
that establishes a covert channel between two processes, allowing them to
communicate without shared memory or inter-process communication via the
kernel.
2.3 Multicore Interconnects
The interconnect is the means by which multiple cores communicate with one
another in multicore systems. Typically, this is the mechanism through which
the cores synchronize their caches to achieve multicore cache coherence.
Cache coherence is the property where all cores agree on the contents of
memory— writes from one core are visible to other cores [5]. The interconnect
is also used for other things, such as issuing inter-processor interrupts (IPIs)
and transferring data between L2 and the LLC when L2 misses in a given
core [7].
8
The processor system presented in this work includes a ring style interconnect
that models the architecture exploited by the Lord of the Rings attack [7].
While there are multiple ways to attack the ring interconnect, this work will
demonstrate a Lord of the Rings style cross-core covert channel on the shared
last level cache where the sender and receiver cores are targeting the same
last level cache slice. The interconnect presented in this work does not feature
full cross-core cache coherence, as the ring interconnect attack presented does
not rely on cache coherence to function. Future work may include adding
support for cache coherence protocols to examine side channels related to
coherence, such as side channel attacks in cache coherence directories [8].
2.4 Covert Channels
A covert channel is the use of a side channel for communicating between
two processes that should normally not be able to communicate. In a covert
channel scenario, there will be a sender and receiver process that are both
attempting to communicate using a side channel.
The processor system presented in this work demonstrates the presence
of exploitable side channels by establishing covert channels. These covert
channels illustrate the usefulness of the side channel primitives exposed by
the architecture.
2.5 Cache Attacks
Caching systems are common victims of timing side channel attacks due
to their inherent asymmetric access times. The Prime+Probe technique is
a technique for leaking information from the cache between two processes
resident on the same core [9, 10]. In a Prime+Probe attack, first the attacker
“primes” the shared cache by filling it with attacker controlled data. Then,
the victim process may access some data, causing attacker lines to be evicted
from the cache. Finally, the attacker “probes” the cache by re-accessing
9
its previously inserted data, and measuring the latency required of each
operation. Lines that were evicted by the victim process will take longer to
load. From this timing analysis, the attacker can determine which lines the
victim accessed, and can gain information about the execution context of the
victim process. The Prime+Probe attack does not rely on shared memory,
and works when the attacker and victim process are using the same cache
hierarchy [9, 10].
When using the Prime+Probe technique to create a covert channel, contention
only needs to be created on a single set of the cache. This is because contention
only is needed on one line at minimum to allow for communication between
two processes. An eviction set is a set of addresses that, when accessed,
will result in the eviction of all other lines from a particular set [11]. Using
an eviction set to evict only a single set (instead of the entire cache) can
significantly increase the bandwidth of covert cross-process communication
while using Prime+Probe. On Pretty Secure System, the cache mapping
function for the default caches is linear, which makes crafting eviction sets
straightforward. Chapter 6 will explore the creation and use of eviction sets
on Pretty Secure System for creating a Prime+Probe covert channel. Figure
2.2 shows an overview of the Prime+Probe attack using the top set as the
eviction set. The cache pictured in Figure 2.2 is a 4-way set associative cache,
so the eviction set used by the sender process involves four addresses.
10
Figure 2.2: Overview of the Prime+Probe attack
The Flush+Reload technique is another cache timing attack technique that
targets the last level cache, allowing for cross-core cache-based timing attacks
[12]. Flush+Reload relies on shared memory between two processes running
on different cores and requires an inclusive last level cache. Flush+Reload
also requires access to an architectural means of flushing a cache line. On x86
architectures, the clflush instruction causes a given address to be evicted
from the cache [13]. If a shared address between two processes is evicted from
the inclusive last level cache, it will also be evicted from all core’s private
caches. When the victim re-accesses that line, it will be brought back into the
last level cache. The attacker can then reload that address and measure the
time taken to reload. If the access latency was small, the address was reloaded
into the last level cache by the victim. Otherwise, the victim never accessed
that line. By analyzing the victim memory usage patterns, an attacker can
leak information about the victim process [14].
Cache-based attacks have been shown to be quite practical, and are able to
leak the secret keys of cryptographic routines [9, 10, 12].
11
While this work focuses on attacks and not defenses, a brief overview of some
cache defenses are presented here. The simplest means of disabling cache
attacks is to completely disable caching altogether. The performance penalty
for this is massive, and it is not a practical solution for most applications.
Another option is to partition the caches such that different security domains
cannot contend with one another. Intel Cache Allocation Technology (CAT) is
a means of partitioning the LLC in Intel processors to create various security
contexts that cannot contend with one another in the LLC [15, 16]. Various
other cache partitioning schemes have been proposed [17, 18].
2.6 Ring Interconnect Attacks
Figure 2.3 shows an overview of the Intel Skylake ring interconnect [7, 19].
This interconnect architecture is scaleable to any number of CPU cores;
however, the diagram pictured in Figure 2.3 shows four CPU cores with four
corresponding slices to the LLC. Additional cores can be added by adding
new ring stops adjacent to the other cores following the pattern seen here.
Figure 2.3: Overview of the Intel Ring Interconnect
In Figure 2.3, the CPU cores are pictured at the top, the distributed sliced
LLC on the bottom, and graphics and the system agent off to the sides. The
shaded boxes that link the CPU cores to LLC slices and the ring are referred
to as ring stops. Each core is paired with a particular LLC slice through
an extremely high bandwidth communication channel. As the topological
12
distance between a core and a particular LLC slice increases, so does the
latency to access that LLC slice. Ring stops allow the core and LLC to
exchange packets with the rest of the ring. The behavior of packet injection
and contention is described below.
When a core misses in its private cache, a request to read from the LLC is
issued along the ring. If the requested line is not present in the LLC, the LLC
will perform a load request through the System Agent which communicates
with DRAM [7, 20]. The ring interconnect consists of four separate rings:
data, request, snoop, and acknowledge [7]. The ring is bidirectional, meaning
packets can be issued in any direction, and packets issued along the ring
always take the shortest possible path [7]. Endpoints on the ring waiting to
inject packets are blocked until a free slot becomes available [20]. The ring
can be thought of as a large train with individual boxcars traveling in circles
around the track. When a boxcar is free, a packet can be injected into the
ring. If the passing boxcar is full, no new packet can be injected [7]. The
Lord of the Rings attack paper [7] presents multiple ways to create contention
on the ring— namely, boxcars can block certain ring stops from injecting new
packets, or contention can be created at ring stops, causing increased latency
due to extra ring traffic [7]. Figure 2.4 shows an overview of the latter attack
model.
Figure 2.4: Covert Channel Targeting a single LLC Slice
In Figure 2.4, the sender and receiver cores are attempting to communicate
via a covert channel by targeting a previously agreed upon LLC slice. Deter-
mining the mapping between addresses and LLC slices can be hard due to
13
undocumented slice hash mapping functions being used in different hardware
versions [7]. Assuming the sender and receiver can find addresses that map
to the same LLC slice, they can create a covert channel on the ring. The
receiver process continually causes L2 misses using the eviction set techniques
described in the Cache Attacks Section (Section 2.5). The receiver process
monitors the time for each L2 miss. When the sender wants to send a “1,” it
also forces an L2 miss. When both the sender and receiver are missing, they
will issue packets that travel the ring and end up in the target LLC slice. The
contention created on this LLC slice results in a noticeably increased latency
by the receiver. When the sender wants to send a “0,” it will not perform
any tasks. The receiver will observe decreased latency due to the alleviated
contention on the target LLC slice. In this manner, the two cores can create
a covert channel not relying on shared memory. It has been found that such
a covert channel is incredibly quick due to the high bandwidth of the ring
interconnect [7, 20]. SurfNoC [21] is a provably non-interfering approach to
securing network on chip systems that may defend against this attack.
2.7 Existing Cores and Simulators
A number of existing open-source RISC-V multicore systems and simulators
exist. The Vex RISC-V CPU project is a multicore open-source RISC-V
processor that can run on various FPGAs [22]. The RISC-V foundation
maintains a list of open-source cores, many of which are multicore [23]. The
Frizzle testbench is a RISC-V oriented testbench that integrates with the
PULP Platform debugger, a JTAG-based OpenOCD [24] debugger peripheral
for PULP RISC-V cores [24, 25, 26]. The Frizzle testbench uses Verilator [27]
to compile the design, and provides GDB [28] support over OpenOCD [24],
along with a VGA-style graphical window. Since Frizzle relies on hardware
debugging inside the simulated core, it cannot be used easily for simulating
microarchitectural attacks, as the act of debugging using simulated hardware
may modify microarchitectural state.
Instead of using an off-the-shelf existing core and simulator (meaning sim-
ulated peripherals such as a display), this work presents a custom design
14
compliant with the unprivileged RV32I architecture and some of the RISC-V
privileged architecture [2], designed in SystemVerilog [29]. Most off-the-shelf
debuggers are implemented as extensions to the hardware, which means that
the act of debugging the simulated processor may change the microarchitec-
tural state. As modeling microarchitectural attacks relies on deterministic
microarchitectural state, these solutions are not viable for this purpose. Ad-
ditionally, as this work focuses on modeling a particular attack on a certain
interconnect architecture, total control over the implementation of the inter-
connect is required. By building a custom core, system-aware tooling can be
built that integrates directly into the hardware for analysis. Developing the
core alongside the simulated peripherals and visualizers allows for the two
to be built for one another in a vertically integrated way not possible with
off-the-shelf cores. The debugging interface can also be designed from the
ground up to not modify any microarchitectural state for seamless tracing of
the exact mechanisms of the attacks being studied.
The core presented here departs from the RISC-V privileged specification in
a few ways to simplify writing basic system software on the system. These
simplifications were intended to make it easier for system designers to focus
on modeling attacks, and to simplify the hardware logic.
This system uses Verilator [27] to compile the design and connect the C++
testbench to the Verilated design. The simulator features presented here, such
as the graphics, custom debugging server, cache, and interconnect visualizers,
are part of the custom testbench that interacts with the Verilated design.
This system uses the “VGA Font” from the QEMU project [30, 31] in the




Pretty Secure System is a custom multicore computer system designed
specifically for modeling and prototyping architectural attacks and defenses.
Studying microarchitectural attacks can be difficult in practice due to system
background noise [7, 10]. Pretty Secure System is a platform containing
no unwanted noise, as the user has complete control over the processes,
scheduling, and architectural state. This allows for quickly understanding
the exact exploit primitives and reduces the chance of architects finding false
positives in their research. This level of complete customization also allows for
quickly controlling all processor state that an attacker may want to control,
such as which processes run on which cores, or how filled a particular cache
is. This results in a significantly quicker turnaround time for developing
attack concepts. Pretty Secure System is open-source, so researchers do
not need to perform extensive reverse engineering effort to understand how
components work. These factors make Pretty Secure System an ideal platform
for interactively modeling architectural exploits.
The ability to prototype the efficacy of various attacks and defenses within
a simple, user friendly architecture simulator will enable new architecture
research workflows. Pretty Secure System makes understanding architectural
attacks simple and intuitive, while also providing a complexity that allows for
real-world attack modeling. Having complete control over the system model
also allows for deep introspection into the entire state of the architecture, and
enables custom data acquisition solutions for profiling various implementations
of architectural features.
16
Pretty Secure System consists of three main components: the CPU hard-
ware (Pretty Secure Processor), the software kernel that runs on the virtual
CPU, and the simulator infrastructure which seamlessly connects the compo-
nents together using Verilator [27]. This chapter will focus on the hardware
components of Pretty Secure System, designed in SystemVerilog [29].
3.1 Core Architecture
Pretty Secure System is first and foremost a multicore processor system.
Each individual core within Pretty Secure System is called a “Pretty Secure
Processor.” The Pretty Secure Processor core is a custom five stage pipelined
RISC-V core based on the RISC-V ISA [2]. These cores support the majority
of the RV32I ISA, as well as user and machine privilege modes, exceptions, in-
terrupts, system calls, sleep/ wake functionality, and inter-processor interrupts
[2]. They also feature a cycle accurate clock readable from userspace, which
enables timing side channel attacks. The cores are connected together by a
modular and extensible ring interconnect, modeled after the ring interconnect
architecture exploited by the Lord of the Rings attack [7].
While these cores are built on the RISC-V ISA standard, they depart from
the standard in some key ways. The exact specification of the processor cores
themselves, as well as the departures from the RISC-V standard, will be
explained in this section.
3.1.1 System Block Diagram
Figure 3.1 contains an overview of a single Pretty Secure Processor core.
17
Figure 3.1: Block diagram of a single Pretty Secure Processor core
3.1.2 Core Pipeline
The individual cores are built using a standard five stage pipelined design
consisting of the FETCH, DECODE, EXECUTE, MEMORY, and WRITEBACK stages
[5, 6]. A brief overview of the inner core is presented here, with emphasis
on the privileged architecture and connection to the ring interconnect and
caching hierarchy.
The FETCH stage fetches new instructions from the L1 instruction cache. If
18
the L1 instruction cache does not contain the desired address, it will request
data from the L2 cache over the low-priority port (the L1 data cache takes
priority). If the L2 cache does not have the address in question, it will issue
a broadcast memory read request on the ring interconnect. This read request
will be handled by the LLC, which may make requests from main memory
as needed. The LLC will issue a response over the ring interconnect, where
L2, L1, and eventually the core itself receive the result. Branches, exceptions,
ECALLs, and MRETs will update the program counter in FETCH to point to new
code to be executed.
The DECODE stage takes the fetched address and decodes it. This involves
creating a control word with commands for the rest of the hardware to
execute. If the instruction decoded is a CSR or general system instruction,
the pipeline will be stalled until this instruction is able to retire— no CSR
register hazard forwarding is performed. This allows for the entire pipeline to
be in one privilege state at a time, simplifying system logic. That is, in-flight
instructions from two privilege domains cannot share the pipeline at the same
time.
The EXECUTE stage executes arithmetic operations and detects exceptions.
If an instruction attempts to access a privileged CSR that it does not have
permission to, or if the instruction is an ECALL, an exception is generated.
This causes the instructions in FETCH and DECODE to be invalidated, and the
program counter is changed to the exception vector handler pointed to by
mtvec. A privilege transition also occurs, and the instruction that generated
the fault is invalidated.
The MEMORY stage allows the pipeline to access the L1 data cache, and issue
writes to peripherals. Depending on the kind of access (to main memory, or
to memory-mapped IO), the core will either forward the request to the L1
data cache, or to its vertical memory port directly to the ring interconnect.
If the L1 data cache and L2 both miss, that same vertical port is used to
issue a data miss on the vertical port to the memory ring, as each core only
has one memory port. More information on memory ring transactions can be
found in Section 3.3.
The WRITEBACK stage allows the core to commit all architectural state changes.
19
Notice that the cache microarchitectural state may have been modified prior to
an instruction entering WRITEBACK. During WRITEBACK, all CSRs and integer
register file are changed. The debugger port always reads from the register
file, and never observes forwarded hazards from instructions in the pipeline.
This allows the debugger to always receive the latest committed architectural
state. Hazard forwarding between pipeline stages is performed as expected
internally.
3.1.3 Control and Status Registers
RISC-V defines a set of CSRs, or Control and Status Registers [2]. Pretty
Secure Processor supports a subset of the RISC-V ISA CSRs, and defines
some new ones to simplify writing system level software for Pretty Secure
Processor. As compatibility with existing RISC-V Operating Systems is not
a design goal, this tradeoff is acceptable. Each core in the system has its
own distinct set of CSRs. Tables 3.1 and 3.2 show the CSRs accessible from
various processor privilege levels. The “Standard?” field refers to whether a
given CSR is in compliance with the RISC-V ISA specification [2].
Table 3.1: CSRs accessible from machine mode
Address Name Use Standard?
0x037 utimer Current Time (number of cycles) No
0x304 mie Interrupt Enable No
0x305 mtvec Machine Trap Vector Table Pointer Yes
0x340 mscratch Scratch Register Yes
0x341 mepc Exception Saved PC Yes
0x342 mcause Exception Cause Yes
0x343 mtval Exception Value Yes
0x397 mipi issuer IPI Issuer Core ID No
0x398 mpie Previous Interrupt Enable No
0x399 mpp Previous Privilege Level No
0xf14 mhartid Hardware Thread (core) ID Yes
20
Table 3.2: CSRs accessible from user mode
Address Name Use Standard?
0x037 utimer Current Time (number of cycles) No
Most notably, the RISC-V mstatus register has been split into several CSRs.
This makes writing system code to interact with the processor state quite
simple as no bit manipulation is required. This also simplifies the privilege
transition logic complexity, as the distinct CSRs allow for simple register logic
to handle saving/ restoring execution states.
The interrupt enable register and previous interrupt enable register, mie and
mpie respectively, only have 1 bit field in position 0. When set to 1, interrupts
are allowed, and when set to 0, interrupts are disallowed. All other bits are
ignored. Table 3.3 shows the bit fields of these registers.
Table 3.3: The mie and mpie CSRs
31 unused bits Interrupts Enabled (1 bit)
The machine trap vector table pointer CSR, mtvec, points to the handler
to run when an interrupt, syscall, or exception is encountered [2]. This is a
common entrypoint for all traps regardless of origin or cause.
The scratch register, mscratch has no particular architectural designation.
The kernel uses this to save the kernel stack pointer during user code execution.
The saved program counter, mepc, contains the address of the next instruction
to execute after handling a given exception (when the MRET instruction is
executed) [2].
mcause and mtval are set during an exception, interrupt, or system call
corresponding to what happened. During IPIs, the mipi issuer CSR is
updated with the core ID of the issuing core.
mpie, mepc, and mpp are used to save previous architectural state during
an exception, syscall, or interrupt context. They are written to during an
21
exception, syscall, or interrupt circumstance, and are read from during the
MRET instruction. More information on privilege translations can be found in
the section on interrupts, syscalls, and exceptions.
mhartid refers to the current core’s ID. It is 0 for core 0, 1 for core 1, etc.
utimer is a timer that is accessible from user and machine contexts. It is the
only read-only CSR, and every cycle it is incremented by 1 (even while the
core is asleep).
3.1.4 Privilege Levels
Each core supports two of the four RISC-V privilege levels [2]. Namely, a
core can be operating in either PSP MACHINE (kernel/ privileged) mode or
PSP USER (user/ unprivileged) mode. Table 3.4 shows the different privilege
levels allowed, along with the permissions granted at each level.
Table 3.4: Privilege Levels in Pretty Secure Processor
Privilege Code Restrictions
PSP USER 00 Cannot use machine CSRs
PSP MACHINE 11 None
When a core boots up, it begins execution in PSP MACHINE mode. After setting
up the required CSRs and kernel data structures, the kernel can switch to
executing a user process in user mode by configuring the desired CSRs and
executing the MRET instruction. Interrupts, syscalls, or exceptions will cause
the processor to switch from whatever state it currently is into PSP MACHINE
mode. Ecalls taken from either mode will be handled at the PSP MACHINE
privilege level.
22
3.1.5 Interrupts, Syscalls, and Exceptions
Pretty Secure Processor supports interrupts, system calls, and exceptions.
All of these flow through the common exception handler trap vector routine
stored in the mtvec CSR. Table 3.5 shows the exception kinds available. The
mcause field for Illegal CSR Accesses, Ecalls, and external interrupts are
compliant with the RISC-V ISA [2]. The mcause for IPIs is non-standard.
Table 3.5: All exception causes
Exception mcause mtval
Illegal CSR Access 0x00000001 None
Ecall from user mode 0x00000008 None
Ecall from machine mode 0x0000000b None
External Interrupt 0x8000000b Keycode
Inter-Processor Interrupt 0x8000000c IPI Reason
From now on, we will refer to system calls, interrupts, and exceptions collec-
tively as exceptions. When an exception occurs in the processor, the current
PC value is saved from the EXECUTE stage. If this stage is invalid, the PC
in DECODE is used, and if that is invalid, then the PC currently in the FETCH
stage is saved. The instructions contained in the FETCH, DECODE, and EXECUTE
stages are then marked as invalid. The saved PC value is the PC value that is
used when the processor returns from the interrupt context. This allows the
core to continue doing what it was doing immediately before the exception
occured.
Immediately after an exception, the current value of mie is recorded in mpie
(previous interrupt enable), the current processor privilege level is recorded
in mpp (previous privilege), and the address of the current instruction being
worked on is recorded in mepc (exception PC). mcause and mtval are updated
with the reason for that particular interrupt, and mipi issuer will be updated
if this exception is an IPI. The program counter is loaded with mtvec regardless
of the exception kind— all exceptions, syscalls, and interrupts go through the
same handler routine.
To leave an exception context, the processor should issue the MRET instruction.
23
During MRET, mie is loaded with mpie, PC is loaded with mepc, and the current
privilege level is loaded with mpp. Note that there is no other architectural
means of changing the current privilege level- it is not backed by any CSR.
On exception:
mpie← mie







Current Privilege Level← mpp
PC← mepc
For core-generated exceptions (Illegal CSR Access and Ecalls), the exception
is generated when the faulting instruction is in the EXECUTE stage. The
instruction is then marked as invalid in the pipeline such that it does not
retire or modify architectural state. The saved PC value during the exception
points to the instruction that caused the exception in compliance with the
RISC-V ISA [2]. So, to avoid infinite syscall loops, the kernel should incremet
mepc by 4 before issuing MRET. A side effect of this is that when debugging
the processor, exception-generating instructions such as ECALL will not be
visible from GDB, as they never fully retire through the WRITEBACK stage of
the processor.
The core has two interrupt ports— one for external device interrupts, and
one connected to its System Management Core (SMC) for inter-processor
interrupts (IPIs). The only supported external device interrupt kind in Pretty
Secure System is a keyboard interrupt, which is issued from the simulation
24
environment when the user presses a key with the main framebuffer window
highlighted. During this interrupt, the mtval register is set to be whatever
the keyboard data register is. The keyboard data register is read from the
main system platform engine, and is synchronized with the processor clock
domain along with the asynchronous external device interrupt signal inside
the core. Only core 0 can receive keyboard interrupts.
The other interrupt port on the core is the IPI port, which is connected
directly to the core’s associated SMC. The SMC will issue IPI interrupts
and pass in the IPI reason as well as the IPI issuer directly to the core.
The SMC will ignore further IPI requests until the core issues an interrupt
acknowledgement. The interrupt acknowledgement is automatically issued
when the processor begins execution of an interrupt— kernel code does not
need to handle interrupt acknowledgement.
Syscalls follow the RISC-V ABI, where arguments are passed in registers
x10-x17, and the return value is passed in x10 and x11 [2]. Syscalls should
not change any registers except for x10 and x11 (the return value).
3.1.6 Architectural State
The entire architectural state of a non-sleeping processor (excluding memory)
can be represented by the 31 integer registers (x1-x31), the program counter,
the value of all CSRs, and the current processor executing privilege level.
A kernel wishing to perform context switches should record any CSRs that
differ between processes and swap them when swapping contexts, along with
the integer registers. The PC register does not need to be saved as long as
the context switching function’s stack frame and integer registers are restored
correctly— exiting this function will tear down the stack frame and return
to previous calls in the new context. All context switching between different
execution contexts should happen in PSP MACHINE mode, as that is the only
privilege level allowed to read/ write the necessary CSRs.
25
3.1.7 Sleep/ Wake
Cores can issue the WFI (wait for interrupt) instruction to cause the processor
to go to sleep. Cores should avoid going to sleep with interrupts disabled, as
there is no way to wake up a sleeping core with mie set to 0. All instructions
prior to and including the WFI will retire before the processor goes to sleep,
and no instructions after it will complete until after the next exception.
Cores can be awoken via any interrupt kind— a keyboard press (“external
device interrupt”) or an inter-processor interrupt (IPI). This will cause the
core to begin execution at mtvec exactly as a normal interrupt, and will
effectively wake the core up. After the exception concludes with MRET, the
core will continue execution from immediately at the WFI instruction, as
mepc will be loaded with the address of the WFI instruction when the next
exception occurs. The exception handler should increment mepc by 4 to
prevent immediately going back to sleep if that is not desired.
3.1.8 System Management Core
The System Management Core (SMC) is responsible for receiving and issuing
inter-processor interrupts (IPIs) on the IPI ring interconnect. It handles
writes from its corresponding core to the IPI core addresses (defined in Table
3.6). It also acts as a ring stop receiver, passively monitoring the ring for
packets destined for this particular core. When an IPI destined for this core
is detected, the SMC triggers an interrupt on the core, potentially waking it
up. The SMC will not accept new IPIs until a waiting IPI is acknowledged
by the core.
Writing to an address in the SMC MMIO range will cause an IPI to be sent
to the destination core corresponding to the address. Table 3.6 shows the
mapping between addresses and cores. These addresses are global for all
cores, and cores can issue IPIs to themselves. IPIs are not guaranteed to
enter the ring as the SMC currently performs no buffering— since IPIs are
rarely issued, this is not presently a problem.
26







Communication with the SMC consists of writing a 32-bit value to the address
for a particular core. This sends an IPI to that core, where the 32-bit value
written is the “IPI Reason” visible from the IPI receiver as the mtval CSR.
More information on the memory map of the system can be found in the
Memory section.
When an IPI is received by a core’s SMC, it will issue an IPI interrupt request
to the core. It will then wait for the core to acknowledge that interrupt
before accepting new IPIs. The SMC will ignore any further IPI requests
while the core has not issued an interrupt acknowledgement. The core will
automatically issue an interrupt acknowledgement when it begins executing
that exception— the kernel does not need to acknowledge the interrupts itself.
3.1.9 System Platform
The system platform is the glue logic that ties the cores, caches, and ring
interconnect together. This part of the processor system also provides debug
status information to allow the simulator to monitor and control the hardware
via GDB, as well as attaching to the RISC-V Formal Verification Monitor
[28, 32]. At this time only the main core (core 0) can be debugged with
GDB, and only the main core has a formal verification monitor attached.
The formal verification monitor interface port contains all signals needed to
send privilege transitions and asynchronous code execution to the verification
monitor to ensure complete correct core operation. The RVFI engine and
cache correctness checkers make hardware bugs easy to spot in simulation.
27
More information on the runtime verification tools can be found in Chapter 5
in the section on Runtime Verification.
3.2 Memory Hierarchy
There are a variety of memory-mapped devices and levels to the memory
hierarchy attached to each individual core. This section will introduce the
memory map available to each processor in the system, elaborate on the
cache hierarchy and implementation, and introduce the main memory model
and memory-mapped IO (MMIO) devices. The memory architecture lays the
groundwork for the attacks demonstrated on Pretty Secure System in the
later sections of this work. The entire memory architecture only uses physical
addressing— there is no hardware support for virtual memory. This simplifies
architectural attacks (creating eviction sets and cross-core contention is
simpler), and reduces implementation complexity for attack models. Attacks
tend to deal with memory at a very low level, and the simplification provided
by not introducing virtual memory allows for quicker attack prototyping.
Since there is no virtual memory, all caches are physically indexed, physically
tagged (PIPT), and there is no translation lookaside buffer (TLB) present [6].
3.2.1 Memory Map
Table 3.7: Pretty Secure Processor Memory Map
Start End Region Allowed Ops
0x00000000 0x04000000 Kernel Binary RWX
0x40000000 0x41000000 Text Overlay W
0x80000000 0x81000000 Graphics Framebuffer W
0xA0000000 0xA0001000 SMC MMIO W
Table 3.7 shows the different address regions accessible from within a particular
core. Each core’s memory management system services requests to each region
28
as appropriate. Transactions may include accessing the cache, transacting
on the ring, or communicating with the System Management Core (SMC).
The read and execute permissions are currently equivalent— any readable
region of memory is also executable. As memory protection features are not
necessary for modeling architectural side channel attacks, there is no memory
protection unit in place at this time. That is, the permissions are enforced not
via a programmable memory protection unit, but simply due to the hardware
not supporting certain operations from certain memory types. Additional
constraints on the address space (such as non-executable memory regions)
are not supported by the hardware at this time.
The kernel binary region of memory is readable, writeable, and executable.
It is a large region of memory that is preloaded with the raw bytes of the
compiled kernel object using $readmemh at runtime by the simulator. This
region of memory is backed by the cache hierarchy, which is described later
in this document. At reset, each processor core will begin executing from
address 0x00000000 in the kernel binary with interrupts disabled.
The simulation framework includes a synthesizable cache correctness checker,
which independently maintains a copy of the entire kernel binary region of
memory. Every transaction by the core is copied to that local copy, and the
verification engine ensures that transactions committed by the processor are
reflected by the core’s cache hierarchy when it is read from. This correctness
checker is only safe for single-core programs, as writes from other cores do
not propagate to the verification engine, but may propagate to the per-core
cache hierarchy.
The text overlay region of memory is an 80 by 30 character buffer. Characters
can be written directly into it in a row major access pattern. The byte
located at 0x40000000 corresponds to the character being drawn at (0,0)
in screen coordinates (top left), the byte at 0x40000001 is the character
at (0,1) in screen coordinates (to the right of (0,0)), and so on. These
characters are 8 pixels wide and 16 pixels tall and are drawn directly over
the graphics framebuffer. The font used is the “VGA font” from the QEMU
project [30, 31]. Writing a 0 to any byte in this buffer causes the overlay to
display no character at that location. The text overlay draws all characters
with a pure white color (0x00ffffff).
29
The graphics framebuffer memory region is a row-major 640 by 480 array of
32-bit integer colors. Each pixel is represented by one 32-bit value, where
Table 3.8 shows the layout of colors within a given 32-bit value.
Table 3.8: Layout of a single 32-bit pixel
Unused (8) Red (8) Green (8) Blue (8)
As each pixel uses 24-bit color directly without a palette of any kind, arbitrary
images up to 640 by 480 pixels can be displayed using the graphics framebuffer
with 16 million possible colors per pixel. Any graphics in this region will
be drawn underneath the text overlay. Chapter 5 provides more detailed
information on the graphics system, including screenshots of the running
system. Figure 5.4 in Chapter 5 shows the core drawing an arbitrary image
into the graphics framebuffer and using the text overlay to present some text
on top.
The System Management Core (SMC) MMIO port provides memory-mapped
IO access to communicate with the per-core SMC. The SMC handles transac-
tions on the inter-processor interrupt (IPI) ring— this includes issuing and
receiving IPIs (inter-processor interrupts), which may wake a sleeping core
up. More information on interfacing with the SMC is found in the section in
Core Architecture regarding the System Management Core.
3.2.2 Cache Hierarchy
Each core contains a parametric split L1 and unified L2 cache. The L2 caches
are linked together via the ring interconnect, which allows them to interface
with the shared last level cache. In the default configuration, each core’s
L1 cache consists of 8 kB of high-speed storage, divided evenly amongst the
instruction and data caches. The unified L2 cache is 8 kB large. The shared
last level cache (L3) is 16 kB large. As the dimensions and line sizes are
completely parametric, these parameters are user-configurable to tune for any
cache hierarchy size desired.
30
Each core’s L1 instruction and data caches are, by default, both set-associative
caches consisting of 4 ways and 16 sets, with each line containing 64 bytes.
The associativity, number of sets, and line size are all adjustable at design
synthesis time. Each core’s L2 cache is by default an 8-way set-associative
cache with 16 sets, where each line is 64 bytes large. The shared last level
cache (L3) is by default a 4-way set-associative cache with 64 sets, where
each line is also 64 bytes large. The L2 and L3 associativity, number of sets,
and line size are also user configurable at synthesis/ compile time. Table 3.9
provides an overview of the cache hierarchy.
Table 3.9: Default cache dimensions for all levels
Cache Ways Sets Line Size Size
L1 instruction 4 16 64 Bytes 4 kB
L1 data 4 16 64 Bytes 4 kB
L2 8 16 64 Bytes 8 kB
L3 (LLC) 4 64 64 Bytes 16 kB
The main pipeline features two memory ports— one for instructions, and
one for data. These ports are attached to the L1 instruction and L1 data
caches, respectively. These caches are unified through an arbitration unit,
which selects between which port to stream to L2 for a particular transaction.
The arbitration unit monitors two memory interfaces with one marked as
the “priority” interface. When either port requests a new transaction, the
arbiter forwards the request to lower memory, preferring the priority interface
when both interfaces are requesting new memory. The arbiter then waits
for the lower port to finish communication and transmits the data upwards
to the requesting interface. After a transaction is complete, the arbiter
starts working on the next available transaction. The data port is marked
as the priority port in this design, as a miss on the data cache will stall the
entire pipeline, whereas a miss on the instruction cache will only prevent new
instructions from entering the pipeline.
The caches use a writeback policy, where modified data is stored in the cache
until it is evicted. Table 3.10 provides an overview of the allowed states for a
particular cache line.
31
Table 3.10: Cache line states
State Meaning
invalid This line has no data
shared This line is clean (never written to)
modified This line is dirty (written to)
The cache eviction policy uses an implementation of the Psuedo Least-Recently
Used (PLRU) algorithm to evict lines from the cache [5]. Evictions happen
when all ways in a particular set are full (either shared or modified) and the
requester is asking for an address that also maps to that set but is not present
in any of the ways. Eviction of a modified line first causes the cache to
issue a write request to its lower memory, and then perform a read. Eviction
of a shared line will only result in a read request being issued. This is in
accordance with the writeback policy [5].
3.2.3 Memory Interface
The memory interface consists of eight data signals between a driver (issues
memory requests) and a bus (responds to memory requests). Table 3.11 shows
all signals in the memory interface.
Table 3.11: Memory interface signals
Signal Width (bits) Issuer Meaning
addr 32 Driver Address
data i 8 * LINE BYTES Driver Data in
data o 8 * LINE BYTES Bus Data out
data en LINE BYTES Driver Enable bit vector
write en 1 Driver Write request
read en 1 Driver Read request
hit 1 Bus Transaction complete next cycle
done 1 Bus Transaction complete
The memory interface includes five fields controlled by the driver (request
32
issuer), and three fields controlled by the bus in response to transactions. The
bus asserts the address, data sent to lower memory, a data enable bit vector
to disable bytes in the input data signal, and lines to select a read or write
request. The bus issues three signals in response: data out as a result of a read
transaction, the hit signal, and the done signal. The hit signal is set when the
bus will be ready for a request next cycle. Data out is not ready until the cycle
after the hit signal is asserted. The done signal is the hit signal delayed by a
single cycle. It is part of the interface instead of internal memory controller
state since some memory devices transition on the done signal as opposed to
the hit signal. Making this part of the memory interface generalizes it to allow
more devices to communicate directly using this interface without maintaining
extra internal state. This is particularly useful for components that transact
on the ring interconnect. There is no ready signal issued from the bus up
to the driver— it is the driver’s responsibility to ensure that it waits for the
lower memory to issue a response before issuing a new instruction. The driver
is allowed to issue new requests the cycle immediately after the driver asserts
the hit signal; the driver does not need to wait for the done signal.
The multicore cache system currently does not maintain cache coherence across
the cores. This is because the current work does not explore attacks that rely
on shared memory between processes on different cores, and thus this feature
was not implemented. Additionally, self-modifying code is not supported, as
the L1 instruction cache is not flushed on data writes to addresses contained
within it.
3.2.4 Memory Couplers
The system also provides two modules that are useful for creating arbitrary
memory-mapped devices that can communicate with caches and for creating
cache hierarchies with varying line sizes. The short to long coupler converts
a shorter memory region to a longer one. This allows for memory interfaces of
smaller line size (for example, the core that communicates in chunks of 4 bytes)
to communicate with memory regions of a larger size (for example, the 64 byte
cache lines). long to short provides support for the opposite— allowing
a larger memory interface to make requests of smaller memory interfaces.
33
As physical memory can only be accessed 4 bytes at a time, this coupler is
useful for connecting 64 byte line caches to physical memory. This module
works by serializing requests to the lower memory such that the entire long
line is serviced fully. Much of the simulated latency in accessing physical
memory comes from serializing cache-line width transactions into single 4-byte
transactions with the physical memory controller during an LLC miss.
3.3 Ring Interconnect
Each core is connected to one another via a shared inter-processor interconnect
interface called the ring interconnect. This interconnect was designed to model
the interconnect architecture used in Intel processors exploited by the Lord
of the Rings attack [7] so that this attack could be simulated on Pretty
Secure System. This section provides an overview of the ring interconnect
and elaborates on how the System Management Cores interact with the IPI
rings, as well as how L2 misses are handled by the system. This section lays
the groundwork for the Lord of the Rings attack demonstration in Chapter 6.
Figure 3.2 provides an overview of the ring interconnect architecture on Pretty
Secure System.
34
Figure 3.2: Overview of the ring interconnect architecture
The ring interconnect provides an extensible and modular means of connecting
multiple memory producers/ consumers on a shared memory ring bus. The
IPI ring provides that same level of modularity and flexibility for adding extra
processing cores to the system. The current configuration of Pretty Secure
System includes two processing cores, a single slice of LLC, and a graphics
engine. Additional cores can very easily be added to the system, however,
only two cores are necessary to model the Lord of the Rings attack, so only
two cores are present at this time.
The ring interconnect implementation differs slightly from the Intel one
presented in the Lord of the Rings attack paper in several key ways. First, the
LLC only contains one slice in this design, instead of the two that would be
present if using per-core LLC slices. However, as the sliced nature of the LLC
is not required for the version of the Lord of the Rings attack that will be
modeled, this is not an issue [7]. Adding support for additional LLC slices is
35
relatively straightforward and will only require an additional arbitration ring
for when different slices miss. Future research may extend the LLC to model
other kinds of LLC attacks. Second, the ring implementation only allows
unidirectional traffic, as opposed to the bidirectional traffic lanes present
in Intel CPUs. This is not an issue either, as one of the prerequisites for
the attack demonstrated is that traffic flowing in a given direction can be
controlled by the attacker [7].
3.3.1 Ring Packets
Ring packets are the mechanism by which devices on the ring communicate
with one another. Table 3.12 contains all fields of a ring packet, and Table
3.13 shows all possible ring packet kinds.
Table 3.12: Fields in a ring packet
Field Meaning
valid Is this packet valid?
kind Ring packet kind
sender id Ring ID of the sender
dest vector Bit vector of recipients
ipi reason Reason for the IPI
mem address Memory address
mem data Memory data
mem data en Memory data enable vector
Table 3.13: Kinds of ring packets
Kind Code
RING PACKET KIND IPI 0
RING PACKET KIND SNOOP 1
RING PACKET KIND READ 2
RING PACKET KIND WRITE 3
RING PACKET KIND ACK 4
36
Ring packets can be multicast simultaneously to multiple receiver nodes by
setting the corresponding bits in the destination vector field of the packet.
Packets can also be transmitted to the same core that sent them. As the
packet moves around the ring, the ring stops automatically check if the packet
is destined for the device attached to it. If the packet is chosen to be sent to
the associated device, the current ring stop ID is cleared from the destination
bit vector, the packet is sent to the device, and the packet continues along
the ring to other recipients. If the destination bit vector becomes empty, the
packet is marked as complete and discarded by the ring. This allows ring
stops that are blocked to pass the packet back into the ring, where it keeps
circling until the device is ready to respond. This has the added benefit of
allowing the ring itself to buffer packets automatically while waiting for a
busy device. Extra ring stops without any associated device are added to
provide additional packet buffering. The next sections provide a breakdown
of the various ring packet kinds and what their purpose is.
3.3.2 Ring Stops
Ring stops are the main interconnect building blocks that make up the ring
interconnect. Each ring stop has an associated device, which can issue and
receive ring packets. Ring stops are assigned a unique ring ID, which is how
other ring devices can target a given device. Most importantly, ring stops are
connected to other ring stops via a forward and backward port. Ring stops
read a packet from the backward port and issue a packet to the forward port
each cycle. The forward and backward ports that connect ring stops to one
another are called lateral ports, and the ports that connect a ring stop to its
device are called vertical ports.
A ring stop has two input ports and two output ports. The lateral ports
(forward and back ring stop connections) provide the lateral injector (first
input) and lateral receiver (first output) ports. The ring stop behind a given
stop acts as the injector, and the ring stop after it acts as the receiver. The
vertical port to the device also acts as an injector (second input) and receiver
(second output). The ring stop can choose to issue a packet to either its
lateral output port, just its vertical output port, or both. If the packet being
37
read is intended for this ring stop (dest vector contains our ring ID), the
packet will be sent to the device via the vertical output port if the device
is ready. If the dest vector is not empty after removing our ring ID from
it, the packet is also forwarded down the lateral port to other receivers. If
the dest vector is empty after issuing a packet to our associated device, the
packet is marked as invalid, and the invalid packet is sent to the next ring
stop via the lateral output port. If the packet is not intended for a particular
ring stop, it will just forward the packet along its lateral output port. Figure
3.3 provides a block diagram of the ring stop.
Figure 3.3: Block diagram of a ring stop
38
Every cycle, a ring stop will read from both its lateral and vertical input (that
is, its input from the ring and input from its device, respectively). If the ring
packet is valid, it will forward the packet to the chosen output port. This is to
model the “boxcar” behavior of the Intel ring interconnect [7]. Packets may
only be injected into the ring when a free slot opens up— packets already in
the ring take priority. If the packet being issued from the lateral input port
is invalid, and the ring stop’s associated device is trying to inject a packet,
the ring stop will pick that packet up and send it to the chosen output port.
Every cycle each ring stop makes a transaction— the ring never halts or stalls.
3.3.3 Ring Interface
The ring interface consists of four signals between an issuer (generates packets)
and receiver (receives packets). Table 3.14 shows all signals in the ring
interface.
Table 3.14: Ring interface signals
Signal Driver Meaning
issue Issuer Inject this packet into the ring?
packet Issuer The packet to insert
issuing Receiver Injecting the packet into the ring
ready Receiver Ready for more packets
The ring interface includes two signals controlled by the issuer (packet genera-
tor) and two fields controlled by the receiver in response to transactions. The
issuer asserts the packet (packet) as well as a signal indicating whether this
packet should be injected (issue). The receiver asserts a signal indicating
that it is ready for more packets (ready) and another signal when it is picking
up the packet (issuing). When the issuing signal is asserted, the packet
will be injected into the ring at the next clock edge, and the issuer should
stop asserting the issue signal.
Ring stops always are ready on their incoming lateral ring connection, and are
always asserting issue (injecting) to the next ring on their lateral connection.
39
They will always assert the ready signal to their vertical connection to their
associated device, and will assert the issuing signal when the packet from
the device is being picked up. They will assert issue to the device when a
packet intended for the device is detected. If a ring stop observes the device
is not asserting ready on its receiver port, the ring stop will not issue the
packet to the device, and will instead send it for another trip around the ring.
The packet will eventually return to the same stop, where the ring stop will
again try to inject the packet to the device. Due to this behavior, packets
may arrive out of the order they are issued in.
3.3.4 Inter-Processor Interrupts
Inter processor interrupts are served on the dedicated IPI ring. The IPI ring
is connected directly to each core’s System Management Core (SMC), and
is not part of the memory ring. When a core’s SMC wants to issue an IPI
to another core, it creates a ring packet with the destination vector pointing
at the target core. The packet kind is set to RING PACKET KIND IPI, and the
ipi reason field is populated with the given IPI reason. Upon receiving an
IPI packet, the SMC will issue an interrupt to the processor and inform it of
the reported IPI reason, as well as the IPI issuer. The IPI issuer is equal to
the received ring packet’s sender id, and the receiving core’s mipi issuer
CSR is set to this value.
Table 3.15 shows the data fields used in a RING PACKET KIND IPI packet. The
routing metadata fields are excluded as they act the same for all packet kinds.
Table 3.15: Data fields in a RING PACKET KIND IPI packet
Signal Meaning
ipi reason Reason for this IPI
40
3.3.5 Memory Ring
The memory ring is a larger ring than the IPI ring as it contains both processor
cores and memory peripherals. Cores 0 and 1 are located at ring ports 0 and
1 respectively. The last level cache consists of a single slice located at port
2. The graphics controller is located at port 3. There are a few additional
buffering stops not pictured in the diagram that do not have an associated
device. The extra buffering ring stops are used to offload some traffic when
the ring gets too busy, ensuring a free slot will always arrive even when all
devices are blocked waiting for some transaction to complete. These extra
stops introduce extra latency for packets cycling the ring, with the added
benefit of extra built-in buffering capacity for the ring itself.
Two modules are provided that couple memory interfaces to ring stop in-
terfaces: mem to ring and ring to mem. mem to ring converts a memory
interface into a ring stop interface and handles the conversion of memory re-
quests to ring packets. mem to ring may broadcast RING PACKET KIND READ
or RING PACKET KIND WRITE packets to all other ring stops and listens for
RING PACKET KIND ACK packets during reads. This module also maintains
the internal state required to wait for responses to read requests from lower
memory. ring to mem converts a ring stop interface into a memory interface.
ring to mem listens for RING PACKET KIND READ and RING PACKET KIND WRITE
packets destined for the associated device. It decodes the appropriate memory
fields from the ring packet and creates an associated memory request on its
lower memory interface. It will issue RING PACKET KIND ACK packets back
along the ring to the requester ID if the request was a read request. This
allows the requester to see the results of its read request.
The data fields of RING PACKET KIND READ, RING PACKET KIND WRITE, and
RING PACKET KIND ACK packets are shown in Table 3.16.
Table 3.16: Data fields in a read, write, or ack packet
Signal Meaning
mem address Address of this transaction
mem data Cache line size of data to transact
mem data en Byte enable for each byte of mem data
41
A memory server is any device on the memory ring that can process memory
reads or writes. This includes the last level cache and the graphics controller,
and they use ring to mem couplers to attach to the ring. A memory client
is any device on the memory ring that can request memory reads or writes.
This includes the cores themselves, and they use mem to ring couplers to
attach to the ring.
Misses in each core’s L2 are injected to the memory ring via the core’s memory
to ring coupler. This coupler converts outgoing core memory requests to a
ring stop interface as a memory client. Each individual processor subsystem
only has one memory port— this port is attached directly to its ring stop on
the memory ring. If a core detects a write to graphics memory, this same
memory port will request a write to the graphics region of memory, bypassing
the cache. Issuing graphics writes over the ring allows multiple cores to issue
graphics writes simultaneously.
Memory read/ write requests are sent as broadcast packets to all ring stops
in the memory ring by memory clients. Since these packets are broadcast
to all ring stops, all ring stops must acknowledge and receive them for the
packets to complete their lifespan. mem to ring couplers (attached to clients)
are configured to always accept every packet that they are given by their
ring stop. The packet will be discarded internally if one was not expected. If
a packet was expected, the incoming packet is inspected at that time. We
cannot only broadcast to memory servers, as ring stops are unaware of what
other ring stops communicate on the bus. Therefore, we must broadcast read/
write requests to all ring stops in the system, including other memory clients.
This is why memory clients must acknowledge (and ignore internally) any
broadcast packets they receive that they weren’t expecting.
ring to mem couplers also must perform a packet screening before passing
packet data to their associated memory server. Since reads and writes are
broadcast to all stops, ring to mem couplers should only initiate transactions
with their associated memory server device if that packet falls within its
address range. Packets outside of this dedicated range are acknowledged and
ignored by the ring to mem couplers— this way broadcast packets can be
acknowledged and discarded when not intended for a particular stop before
the memory server itself ever receives the request. When the ring to mem
42
coupler observes a broadcast packet that matches its particular address range,
it will accept and service it, and then issue an acknowledgement packet to
the core that requested it. Acknowledgement packets are not broadcast; they
are sent directly to the requester. This is possible because memory servers
record which ring stop last requested a transaction of their attached resource.
Figure 3.4: L2 miss broadcast from core 0
Figure 3.4 shows how the ring handles a broadcast packet on an L2 miss
broadcast from core 0. First, the packet is injected from core 0, and is
immediately injected back into core 0 since the packet was broadcast to all
cores, core 0 included. Core 0 will ignore and acknowledge this packet from
the ring stop. Next, the packet will arrive at ring stop 1 for core 1, where
it will be received, acknowledged, and ignored by the mem to ring module
at that memory client. Next, the packet will arrive at port 2 belonging to
the LLC memory server. The packet will be accepted and serviced by the L3
cache. Lastly, the packet will arrive at the graphics memory server, and will
43
be acknowledged and ignored since it does not belong to the graphics address
space. For packets issued from core 1 or packets destined for the graphics
controller, the same sequence is performed by the ring.
If any one of the devices was busy when the packet passed through its ring
stop, the packet would be injected to the next ring stop and not be injected
to the device. When the packet is injected to the next ring stop, it will not
have its destination vector updated, and will continue to circle the ring. Since
the only ring stop capable of removing a particular field in the destination
vector is the ring stop that corresponds to that bit, the packet will circle the
ring indefinitely until that device becomes free and acknowledges it.
3.3.6 Coherence Ring
At this time, there is no support for inter-core cache coherence, as it is
not needed for modeling the attacks presented in this work. However, the
RING PACKET KIND SNOOP packet kind is designated as a placeholder for future
snoop-based cache coherence protocols. Such a protocol may involve modifying
mem to ring to first issue RING PACKET KIND READ or RING PACKET KIND WRITE
packets along a dedicated lateral cache ring, allowing for other core’s caches to
directly serve the request. If such a request times out, then mem to ring can
issue the packet to the memory ring where it is served by the LLC. This lateral
coherence ring would also be responsible for issuing RING PACKET KIND SNOOP
packets, which are sent when a core moves a line from shared to modified,
informing all other caches to invalidate that particular line if they have it.
Such a protocol is susceptible to race conditions, which could be solved by
inserting a serialization lock that imposes total request ordering and atomicity
amongst all the requests [5]. A cache directory protocol may instead be used,





This work provides a complete set of system software designed specifically
to work with the processor introduced in Chapter 3. This software system
provides kernel-mode utility functions to user-mode attack threads running on
the system. It takes care of scheduling, process creation, interrupts, syscalls,
exceptions, multicore bringup, and IPIs so that attack designers can focus on
writing the best attack software without needing to handle processor-specific
system code. This section provides an overview of the system software offered
with Pretty Secure System.
4.1 System Bringup
This section on system bringup will provide a detailed look at the early boot
stages of the processor system. This is meant to provide a deep look into how
to write system software for Pretty Secure System, and will cover basic CSR
usage and system state early in bringup. The intention is that there is no
uncertainty about the exact operations the kernel performs prior to beginning
the execution of user processes, which is where attack models will run.
At reset, each core in the system will immediately begin executing from
physical address 0x00000000 in kernel mode in the running state. A possible
implementation choice would be to have all cores but the main bringup core
begin in sleep mode, and wake the other cores up as needed using inter-
45
processor interrupts. However, the cores need to be configured to receive
inter-processor interrupts by setting up a kernel stack and configuring mtvec.
So, instead all cores begin execution immediately, and the kernel bringup
routine handles putting the non-essential cores to sleep once they are in a
state where they are ready to receive interrupts.
Listing 4.1 shows the first few instructions that execute on every core.
Listing 4.1: Bringup Interface
start:
# Setup exception handler
la x2, exception_handler_entry
csrw CSR_MTVEC , x2
# Setup stack pointer to default
# kernel stack for this core
# Anonymous stack =
# _kernel_anonymous_stack - 0x100 * mhartid
csrr x1, CSR_MHARTID
slli x1, x1, 8
la sp, _kernel_anonymous_stack
sub sp , sp , x1
# Save kernel stack for exception contexts
csrw CSR_MSCRATCH , sp
# Begin execution in bootloader_main
la x1, bootloader_main
csrrw x1 , CSR_MEPC , x1
# Start in kernel mode
li x1, PSP_PRIV_MACHINE
csrrw x1 , CSR_MPP , x1
# Start with interrupts disabled
li x1, 0
csrrw x1 , CSR_MPIE , x1
46
# Give bootloader_main a return address
la x1, wfi_forever
# Jump to bootloader_main in kernel mode
mret
First, mtvec is loaded with the global exception handler entrypoint
(exception handler entry). Note that there is only one handler for all
exceptions, syscalls, and interrupts. Next, the per-core stack is calculated by
reading the current core ID from mhartid and computing an offset relative to
the global anonymous kernel stack. The anonymous kernel stack is a region of
memory created by the custom linker script outside of any kernel object data
or instruction memory. The kernel anonymous stack symbol points to the
very end of this region. Each core is given a default stack of size 0x100 bytes,
which is more than adequate for the basic bringup tasks performed by the
bootloader. We then load mscratch, mepc, mpp, and mpie with the execution
context we want to begin running in when C code takes over. By default, this
is with interrupts disabled in kernel mode at the bootloader main method.
During interrupts, syscalls, and exceptions, mscratch is used as the kernel
stack pointer, so we set mscratch to our current anonymous stack. When
processes are scheduled, they will fill mscratch with their own per-process
kernel stack. We only set it to the anonymous task for the time being such
that any interrupts, syscalls, or exceptions taken during early kernel execution
have some sort of a stack to use. (However, this stack is the same stack as the
one used for the other bringup tasks, so recovery from these states is unlikely
early in bringup. This is acceptable for now, as early bringup tasks are quite
simple— no interrupts, exceptions, or syscalls need to be handled yet).
The return address is set to wfi forever, which is where bootloader main
will go if it ever returns. This method causes the processor to execute the
wfi (wait for interrupt) instruction in a loop.
bootloader main will read the core ID from mhartid, and if it is not 0 (the
main core), it will put the core to sleep with interrupts enabled. If the kernel is
running on the main core, it will issue IPIs to wake up all other cores. The IPI
issued is the IPI KICKOFF kind, which instructs the receiving core to execute
47
the kickoff method. This method initializes the per-core process table and
launches the launchd launcher daemon process. This process never quits and
handles all remaining kernel tasks. This includes booting attack user processes.
The launcher daemon is always kept running to ensure the scheduler can
always schedule a process, which simplifies context switching logic. After
launching the launcher daemon process, the earlier anonymous kernel stack
execution context is discarded. Syscalls, exceptions, and interrupts are also
properly supported immediately after handling execution over to launchd.
The launcher daemon itself can check the core ID and perform different actions
depending on which core it is executing on. For the multicore attacks, this
means it launches the sender process on one core, and the receiver on the
other. While the launcher daemon logic is currently quite simple (its primary
purpose beyond being the initial process is to keep the scheduler busy), it
provides an excellent launch point for expanding kernel logic for future attack
demonstrations and kernel features.
The reason that core 0 issues IPIs to wake up other cores instead of having
all cores immediately enter the launcher daemon in their respective bringup
contexts is so that different cores can be scheduled for execution at different
times. For one of the demonstrations presented in this work, only one core is
needed, so the second core is never brought up. This architecture also allows
for the main core to initialize various system components, or perhaps initialize
the cache system, without having another core running in the background.
It also provides a forced meet-up point between the two cores— as opposed
to having one core spin constantly waiting for a signal from the first one,
all cores are asleep by default until woken up. This design decision was
made as another means of handing absolute control of the system over to the
researcher using it. There should be as little noise as possible by default, and
this provides another means of eliminating unnecessary noise.
48
4.2 Syscalls, Exceptions, and Interrupts
The kernel provides a common entrypoint for all syscalls, exceptions, and
interrupts per the RISC-V ABI [2]. Collectively, we will refer to all three of
these as exceptions from now on. When an exception is encountered, the
processor will begin executing code pointed to by mtvec. Listing 4.2 provides
an overview of the steps that occur when an exception happens.
Listing 4.2: Exception Handler Entry
exception_handler_entry:
atomic swap sp with CSR_MSCRATCH
push all registers to the kernel stack
mov arg 0, CSR_MCAUSE
mov arg 1, CSR_MTVAL
mov arg 2, stack top - 4
jal exception_handler
pop all registers from kernel stack
atomic swap sp with CSR_MSCRATCH
mret
The exception handler C method takes three arguments— the mcause and
mtval that caused the exception, and a pointer to the saved user registers.
The kernel is free to read and modify the saved user registers. The mcause
and mtval values provide the exception handling routines with all information
required to understand what caused this exception.
4.2.1 Supported Syscalls
System calls, or syscalls, are a means of communicating with the kernel by
intentionally executing a specific instruction that causes a fault. The ecall
instruction, defined by the RISC-V ISA [2], provides support for processes
to communicate with kernel code. System calls use the RISC-V calling
convention [2] just like normal functions. A syscall can be triggered from
49
either user or kernel mode just by running the ecall instruction. Listing 4.3
provides the code stub required for executing a system call.
Listing 4.3: Code to Perform Syscall
# Perform syscall
# Arguments are passed in a0, a1, ..., a7





Table 4.1 provides an overview of all system calls supported by the kernel.
Table 4.1: All System Calls






































The quit syscall allows a process to exit permanently, passing its return code
to its parent if the parent called a blocking execute. The execute blocking and
execute nonblocking syscalls allow a process to schedule new child processes.
The start address field specifies the address to begin executing— this should
be set to the function that the new process will launch in. The privilege
level field allows the requester to specify an access level (PSP MACHINE or
50
PSP USER) for the new process. The interrupts enabled argument allows the
caller to specify whether the new process should run with interrupts enabled
or disabled. If the call was a blocking call, the parent process will not be
scheduled again until the child process completes. If the call was a nonblocking
call, then the parent process will continue execution and will not be able to
read the result of the child process’s call to quit.
The getc syscall allows a process to read a key from the keyboard. It will block
execution until a key is pressed, and will return the key’s value. The mmap
syscall allows a process to allocate a chunk of 0x1000 bytes from the kernel
memory allocator. There are no memory protection features in place, so this
syscall is not needed; however, it will be useful if memory protection features
are added. The release scheduler syscall allows a process to pass execution
over to some other process in the system. Since the scheduler is cooperative
and not preemptive, processes are expected to regularly call release scheduler
to allow other processes to continue execution.
The syscall numbering is out of order, and there are several numbers missing.
This is intentional— some spaces are left for future system calls.
4.3 Processes
The kernel supports multiple processes operating simultaneously. Processes
represent an execution context, which consists of an architectural state
and some metadata regarding that architectural state to provide support for
scheduling and blocking conditions. The root process, launchd— the launcher
daemon, is always running in the background, and all other processes can
trace their origin back to a process launched by launchd. To create a process,
the start address, desired privilege level, and whether or not it should run
with interrupts are required. The kernel maintains processes with a process
control block (PCB) array. Each process in the PCB array has its own kernel
stack, user stack, saved CSRs, and some metadata for the scheduler regarding
whether this process is blocked or not.
When a process is first launched, the mpie, mpp, mepc, and mscratch CSRs
51
are populated with the desired interrupt enable status, desired privilege level,
desired code location, and kernel stack respectively. The registers are all set
to 0 except for the stack pointer, which is set to point to the user stack. Then,
the process is launched using the MRET instruction, which puts the processor
into the execution context of the new process. When an exception occurs, the
exception handler will use the kernel stack saved in mscratch as the kernel
stack. The user process will continue execution until it performs a syscall, an
interrupt occurs (if interrupts are enabled for this process), or an exception is
encountered. Upon entering an exception context, the system will switch into
kernel mode and run the exception handler.
Execution contexts can run in either user mode or kernel mode. The distinction
is what value gets loaded into mpp during process launch and later context
switches. There is no difference in how user and kernel mode processes are
treated by the rest of the kernel.
4.3.1 Blocking Reads from Keyboard
When a process performs the getc syscall, the scheduler puts it to sleep and
schedules other processes. (This is part of the reason that launchd can never
quit or perform blocking syscalls— the scheduler always needs something to
schedule). When the next keyboard interrupt occurs, the exception handler
walks the per-core process table looking for processes that are blocked on a
keyboard read. It writes the latest key into their control block and clears
the sleeping flag. The exception handler will then issue a context switch
to wake up the previously sleeping process. This (along with the execute
syscalls) is the only time that cooperative scheduling is violated, and is done
so to increase the responsiveness of pressing keys. Without this, the scheduler
might not schedule the blocked process for quite a while depending on which
process is running when the interrupt is received. Preempting the current
process to service the getc syscall results in a much smoother experience for
the user. Blocking executions perform a similar set of tasks to block a parent
process, and later return the child process’s exit code to the parent.
52
4.4 Scheduling and Context Switching
The kernel supports a cooperative scheduling model. This section will discuss
the approach to scheduling and the rationale behind it, and explain the
mechanisms of context switching.
4.4.1 Cooperative Scheduling
The system uses a cooperative scheduler approach. Cooperative scheduling is
the scheduling paradigm where user processes are never interrupted by the
kernel for the express purpose of scheduling a new process to achieve time-
multiplexing of processes. User processes cooperate to share the resources
of the system, and release their execution time to the system at appropriate
times deemed by each process. Of course, a malicious user process could
hog all the time on the system by running with interrupts disabled and
never releasing the scheduler back to the kernel. However, as the purpose
of this platform is to research architectural attacks as opposed to software
attacks, having a cooperative scheduler is a large advantage. The cooperative
scheduling approach means that processes operate at an exact granularity
specified by the process. This allows for processes to synchronize with each
other natively through the scheduling interface, as opposed to using a variety
of other means. This also allows for simple atomic operations to happen
between processes pinned to the same core. Overall, using a cooperative
scheduling approach instead of the common preemptive scheduling model
of today’s systems allows for simpler and faster attack modeling on this
system. Additionally, adding support for preemptive multitasking would not
be difficult given the modularity of the current architecture. (All that is
needed is a hardware timer, a new interrupt port on the core, and a bit of
code in the kernel to link it all together).
53
4.4.2 Context Switching
Context switching is the process of swapping one execution context for
another. This involves swapping all CSRs, process registers, and the program
counter such that a process can be interrupted and later resumed without
any modification to the architectural state of that execution context. The
context switching approach used by this kernel is meant to maximize interface
simplicity while allowing flexibility for future additions to the multiprocessing
system design. Context switching can only occur when a process is operating
in an exception context in kernel code, where the kernel stack in use is the
kernel stack setup for that process at its creation. Additionally, only one
method is allowed to swap execution contexts, and this method is the aptly
named context switch. By enforcing these restrictions, we can guarantee
that any sleeping (swapped away) processes are currently stopped within the
context switch method.
When a process is launched for the first time as described in the Processes
section, its CSRs are configured, its integer registers are cleared, its user
stack pointer is set, and the system enters the execution context of the new
process. When an interrupt, syscall, or exception moves the system into the
exception context, the kernel stack for that process is used and all the user
registers are pushed to it. This kernel stack is then used as we call into kernel
methods. Eventually the kernel calls the context switch method, where the
scheduler picks a new process to execute round-robin style. The CSRs for
this current process are recorded, and all the kernel registers are pushed to
its kernel stack.
No matter which process the scheduler picked, by the earlier assumptions it is
known that this process went through the same steps as the current process
within the context switch method. We load the CSRs of the new process.
We swap the stack to point to the top of the kernel stack of the new process
and pop all the registers off (this restores the kernel callee saved registers).
Now, we allow context switch to exit. Since we assumed all processes enter
and exit through the context switch method, the stack frame is torn down
appropriately, and any registers that were saved to the stack are properly
restored. As RISC-V passes the return address via a register [2], and there
54
are multiple other callee saved registers, the compiler may choose to order the
saved registers in any order on the stack within the context switch prologue
and epilogue. This ordering will be the same for both the prologue and the
epilogue, and as we exit the context switch method after restoring all the
kernel registers of the stopped process, we will tear down the stack frame
correctly for context switch, restore any registers saved to the stack, and
return to the appropriate kernel function. By following those steps, we have
swapped from one execution context to another. This new execution context
will continue returning through kernel code to where it will eventually pop
the user registers off the stack and MRET back into the user execution context.
Since the CSRs were saved and restored, this will restore the execution context
to the appropriate privilege level and interrupt status.
Listing 4.4 provides an overview of the context switch process.
Listing 4.4: Context Switch Overview
def context_switch ():
pick a process from valid , non -blocked processes:
if current_process isn ’t null:
save all CSRs to current_process ’s PCB
load CSRs from next process PCB
if current_process isn ’t null:
save all kernel registers to kernel stack
save stack pointer in current_process ’s PCB
if next process has been run before:
set sp to next process ’s saved stack pointer
read all registers from stack
return
else:
set all registers to 0
set sp to user stack
mret to new process
55
4.5 Compiling and Linking
As the processor begins at reset with all registers set to 0x00000000 and
executes code starting at address 0x00000000, the first bytes of main memory
must be the instructions required to bring up the system. The compiling
and linking process includes a set of automation scripts for compiling the
kernel code, linking it such that the desired bootloader code starts at address
0x00000000, and extracts the raw binary of the built executable so that it
can be uploaded directly to the processor as firmware.
The compiler used is the RISC-V GNU C Compiler [33]. The compiler is
configured to compile for the rv32i subset of the RISC-V architecture [2, 33].
libgcc is used to provide support for division by non-even divisors, as there
is no hardware accelerated division supported by the processor [33]. The
binary is then extracted to a binary file with objcopy [34]. This binary file is
passed through a Python script to convert it into a format readable by the
simulator [35].
The linker script is used to ensure the very first instructions correspond to
the bringup bootloader entrypoint assembly code. The linker script is also
used to create the anonymous kernel stack region of memory for (currently)




So far, we have introduced Pretty Secure System, a multicore RISC-V pro-
cessor system with caching and interconnect support [2]. We have also
introduced a kernel that runs on this processor that provides cooperative
scheduling, multiprocessing, system calls, interrupt, and exception support for
the processor system. In this chapter, we introduce the simulation suite that
provides a user-interactive realtime simulation of the processor, complete with
graphical feedback, a realtime debugger attached to the simulated processor,
and keyboard support.
The processor RTL design is written in SystemVerilog, a hardware descrip-
tion language [29]. The Verilator SystemVerilog simulator and compiler
toolchain is used as the backbone of the simulator engine [27, 29]. Verilator
compiles SystemVerilog designs to a cycle-accurate C++ model [27, 36].
This C++ model provides an efficient means of interfacing with the simu-
lated hardware design. The simulator program can cycle the simulated clock,
issue interrupts, and read the registers and memory of the Verilated design.
These capabilities enable an entirely new class of security analysis tooling to
interface with Pretty Secure Processor. Every aspect of the entire design can
be examined, stepped through, and visualized by connecting to the Verilated
design. Figure 5.1 provides an overview of the simulator code.
57
Figure 5.1: Overview of the simulator
The simulator consists of two threads. The windows and keyboard thread (the
main thread) manages all graphics and keyboard interrupts from the host OS.
This thread is responsible for periodically updating the windows displayed
to the user, and responding to key presses depending on what window is
present. When a key press that is intended for the processor is detected (a key
pressed while the main framebuffer window is selected), the window server
acquires the simulator lock and schedules an interrupt. The graphics thread
is allowed to asynchronously read memory contents of the Verilated model
without acquiring the simulation lock, as it is not crucial that all data is
synchronized before being graphically represented. This may result in a slight
amount of screen tearing, or the cache viewer being behind a cycle or two;
however, this is not noticeable in practice, and the speed benefits achieved by
not synchronizing allow the entire simulator system to feel extremely fluid.
As most of the quantitative analysis will be performed using the debugging
server anyways, it is only important that the debugging interface provides
the latest synchronized data, which it does.
The second thread, the debugging server thread, handles management of the
Verilated model. When the simulator is run in run mode, this thread will run
58
the processor continuously without waiting for a debugging client. When the
simulator is run in debug mode, it will respond to user commands. Depending
on the user action, this may involve single-stepping until a single instruction
retires, the processor goes to sleep, continuing execution until the processor
hits a breakpoint, or reading the registers / memory contents from a core and
reporting back to the client over the network. The Debugging Server section
provides more information on how this thread behaves.
The window and keyboard thread always runs in the main execution context
(not as a separate pthread) for platform compatibility. The simulator supports
running natively on the macOS and Linux operating systems [37, 38]. For
compatibility across different environments, the window thread must run as
the main thread.
For cross-platform compatibility, the simulator uses GTK [39] for all window
rendering, refreshing, and keyboard/ mouse interaction. This allows the
simulator to be compiled on both macOS and Linux without the need for
reinstrumentation of the windowing code. The code responsible for populating
each window uses raw image buffers and draws directly into them pixel by
pixel for maximum performance, and a consistent user interface across all
platforms.
5.1 Debugging Server
The debugging server acts as a GNU GDB remote target debugging server
[28]. This allows GDB clients using the RISC-V GDB binary to target the
simulator using the target remote command from within GDB. This server
is intended as a means of inspecting the state of the running system. Since the
kernel is compiled with symbols and debug information, and is configured to
place all code at the appropriate addresses in the processor physical address
space, debugging symbols are accessible from GDB as well. Variables can
be inspected by name, functions can be set as breakpoint targets by name,
and code can be disassembled by referring to exact symbols. Hook stops
are functional as well, as is reading memory contents and register state. At
59
this time, the debug interface provides read-only access to the processor
architectural state. The debug server is intended as a means for architects
to inspect the exact state of their attack code and software, and collect data
from their experiments, without modifying any microarchitectural state.
Listing 5.1: GDB Demo
void gdb_demo () {
int a = 1;
int b = 2;
int c = 0;
char textbuf [64];
c = a + b;
snprintf(textbuf , sizeof(textbuf),
"%x + %x = %x", a, b, c);
draw_string (0, 0, textbuf , COLOR_WHITE );
}
Figure 5.2: Graphics window during debugging
60
Figure 5.3: GDB remote debugging client
Listing 5.1 contains a simple function that will be used to demonstrate the
interactive debugging capabilities of the simulator. Figure 5.2 shows the
results of the graphics window attached to the processor system at the end
of the debugging session for the gdb demo method. Figure 5.3 shows the
remote GDB client attached to the custom debug server running as part of
the simulator program. This window is running regular GDB, and is issuing
commands over the network that the simulator runs on the Verilated hardware
design [28]. This demo shows that full kernel symbols are present, and that
stepping through the program, setting and hitting breakpoints, and printing
variables and memory work completely naturally. Since the GDB protocol
operates over a socket across the network, the GDB client debug the Verilated
processor model remotely over the Internet.
The server itself is custom designed to be modular and interact directly with
the processor model, and to follow the majority of the GDB server protocol
61
[28]. Checksums are computed automatically before a packet is sent to the
client via the server code. Since the GDB protocol is string based, there are
several GDB string packet helper methods provided as part of the Pretty
Secure System simulator codebase that allow future modification to this
custom debugging server. These helper methods allow for expansion to the
debugging server without needing to reimplement the basic transaction level
socket programming, or the GDB checksum logic.
5.1.1 Breakpoints
The Pretty Secure System model has methods to individually single-step until
a new instruction is detected by the simulator, or to continue until a breakpoint
is hit. Breakpoints are maintained by the simulator. While the processor
is continuing until a breakpoint is hit, the simulator compares the current
committed PC against a list of valid breakpoints. If a breakpoint is found, the
processor stops and the GDB server reports that the breakpoint has been hit to
the client. These breakpoints are implemented with GDB software breakpoints
from the perspective of the GDB client. These breakpoints actually act more
like hardware breakpoints, as they are built into the testbench program, and
do not ever modify the software running on the virtual processor. However,
for simplicity with the GDB server protocol, the custom debugging server
pretends that all breakpoints are software based when reporting to the client.
Recall that the processor system being modeled is a pipelined processor, so
there are multiple in-flight instructions simultaneously. The debug interface
operates on the very end of the pipeline (the WRITEBACK stage) such that all
instructions that are visible to the debugger are valid committed instructions.
This has the drawback that exception-causing instructions that are invalidated
prior to reaching the WRITEBACK stage, such as ECALL and invalid CSR accesses
by user code, will not be visible to the debugger. Any breakpoints set at
exception-generating instructions will also fail to be triggered. If a core goes
to sleep, instructing it to step will result in the simulator instantly returning
control back to the debugger. This way the debugger does not freeze when
the debugee sleeps.
62
5.1.2 Reading Architectural State
While the inner pipeline of the core handles all hazard forwarding, hazards
are not visible from an architectural state. Each committing instruction
will modify the register file as it exits WRITEBACK, so to read the current
architectural state, the register file always has the latest version of the
registers (as the effects of in-flight instructions should not be architecturally
visible to the debugger). To read from memory, the debugger reads from
the copy of memory that is simultaneously used for cache verification. This
means that multicore writes will not be propagated to the memory visible
from GDB— the debugging server is primarily designed for debugging a single
core without any shared memory.
5.2 Graphics
Figure 5.4: The 640 by 480 graphics framebuffer and text overlay window
63
Figure 5.41 is a screenshot taken of the running system. The “Pretty Secure
Processor” window provides a 640 by 480 24-bit color region into which the
cores can draw arbitrary graphics. In this example, the main core has drawn
a picture to the background layer using 24-bit color. The core has also used
the text overlay layer to draw some text in the upper left corner.
Figure 5.5: Path that graphics packets take through the ring interconnect
Figure 5.5 shows the path that graphics write packets take through the ring
interconnect to reach the graphics controller from core 0. When the graphics
controller ring stop observes a broadcast write packet intended for an address
in the graphics range, it forwards that write request to the graphics controller
core. The graphics core contains two large buffers of memory— one for the 640
by 480 32-bit framebuffer, and another for the 80 by 30 8-bit character overlay
array. The graphics controller calculates the appropriate offsets depending
on which region of memory is targetted by a particular request packet and
1This screenshot contains a picture of Skittles the green cheek conure.
64
stores the data at the appropriate location in the internal buffer, acting as a
large write-only scratchpad of memory.
5.2.1 Compositing
The simulator window asynchronously reads from both memory buffers in
the graphics controller each time it is scheduled for a refresh, and displays
the contents to the screen. For the full-resolution graphics framebuffer, this
involves reading the data out pixel by pixel and copying it into the window
presented to the user. For the text overlay, the simulator thread iterates over
each character in the character buffer one by one, and if it is a printable
character, draws it over the full-resolution 640 by 480 graphics framebuffer
using a pure white pixel (0x00ffffff). The font used is the “VGA Font”
from the QEMU project [30, 31]. Each character is 8 pixels wide and 16
pixels tall. The simulator thread then submits this completed image to the
windowing library for display to the user [39].
Both the 640 by 480 graphics framebuffer and text overlay regions of memory
are laid out in row-major order. This means that the first data point inside
of each corresponds to the top left of the screen, the second data point
corresponds to the index immediately to the right of the top left, and so on.
For the graphics framebuffer, each data point in the memory region is a 32-bit
integer. For the text overlay, each data point is a single 8-bit character. More
information on the graphics memory system can be seen in Chapter 3 Section
3.2 on the Memory Hierarchy.
5.2.2 Keyboard
Pressing a key while the processor graphics window is selected will schedule
an interrupt to run on core 0 during the next instruction. The next time the
debugger thread runs an instruction, it will issue an interrupt to the Verilated
design. It will also pass in the keycode to the System Platform, which will
record the keycode and send it to the core as the keyboard data register. The
65
core will read this value and send it to the kernel via the mtval CSR.
5.3 Visualizers
The simulator infrastructure comes with two windows that provide an even
deeper level of introspection into the state of the system than GDB offers.
While the debugging server is great for understanding architectural state,
hardware attacks typically exploit state below the ISA abstraction. The
simulator system ships with two visualizers for the two main microarchitec-
tural components attacked in this work— the cache visualizer and the ring
interconnect visualizer.
5.3.1 Cache Visualizer
Figure 5.6 shows the cache visualization window.
66
Figure 5.6: The cache visualization window
This window is drawn directly into an image buffer from C++ that is then
presented to the user using GTK [39]. This provides a consistent user
experience across all host environments, and ensures the user interface is fast
and responsive. Mouse and keyboard support are provided by GTK. Clicking
on any cache line in any of the caches will select it and allow the user to see
more about that particular line. GTK is also used to periodically refresh
the window such that it maintains accuracy in realtime. As the processor is
running, the cache window is automatically updating to constantly represent
the latest processor cache state. The user can even type keystrokes into the
graphics window to issue keyboard interrupts to the processor, and watch the
cache respond to the core handling those interrupts in realtime.
The upper left shows an overview of the split L1 cache (icache and dcache)
and the L2 cache for core 0 in the system. Invalid lines are colored black, and
valid lines are given one of eight possible colors. These colors are mapped to
repeating segments of the total address space so that addresses separated by
67
a large distance will appear colored differently. This is so that, in general,
addresses owned by different processes will be colored differently in the cache
viewer. The address distance granularity before changing colors is configurable
at compile time to suit any particular application. If a cache line is modified,
it will be colored with a thin white outline. Figure 5.7 shows the cache
visualization window in the presence of modified lines.
Figure 5.7: The cache visualization window with modified lines
The bottom left window within the cache visualizer contains metadata about
the currently selected line. When a line is clicked, it is given a thick white
outline, and the selected line and data windows appear. The selected line
window shows the exact cache line size aligned address, tag, set, domain
(which color the line is mapped to by the color mapping function), its status
(invalid, shared, or modified), and which cache it belongs to. The data
window shows the data contained within that line broken down into 32-bit
words. This data may be dirty and may not have been written back to lower
memory yet. At this time the cache visualization tools only support viewing
68
the per-core cache, and not the last level cache. However, adding support for
an LLC visualizer would not be difficult, as the cache viewer window code is
modular by design to visualize any cache created with the cache module.
5.3.2 Ring Interconnect Visualizer
Figure 5.8 shows the ring interconnect visualizer window.
Figure 5.8: The ring interconnect visualizer window
The ring interconnect visualizer window is currently non-interactive and only
provides a visual feedback to the actions of the ring. There are eight different
zones that can be active in the ring— cores 0 and 1, the LLC, graphics, and
the four ring stops they connect to. Whenever any component is handling a
transaction, it will light up to indicate that it is busy. Over time, this allows
an intuitive sense of the contention occurring on the ring at a glance, and
provides insight into the state of the ring over a given time period.
69
5.4 Runtime Verification
The simulator includes a series of runtime verification components to ensure
the hardware is operating correctly. As each individual core conforms to the
RISC-V ISA [2], RISC-V verification tools will work with them. The RISC-
V Formal Verification Monitor is an open-source synthesizable verification
monitor for RISC-V processors [32]. The main core in Pretty Secure System
has an associated RVFI monitor attached, monitoring every instruction
committed by the core. If an error is detected, RVFI will stop the simulation
and report the error code to the console.
Additionally, for verifying single-core programs, there is a shadow region of
single-cycle memory attached to the core that bypasses the cache. This region
is never read from the core itself, but is used for verification. Every time
the core writes to its cache, the shadow memory also records the write. If
the cache ever reports data that differs from what the shadow memory has
stored, it is likely the cache has produced an erroneous value, and the shadow
memory will stop the simulation. This shadow memory is implemented with
a custom series of checks that run within the Verilated processor model. This
region of memory is also used for serving GDB memory requests. Instead of
running the simulated design to request memory from the caches, memory is
read from this shadow region of memory. This is a significantly faster and
less complicated model, with the caveat that writes from other cores are not
propagated to the results visible from GDB. This makes the shadow memory
feature and GDB memory features best suited for single-core applications, or
programs that have no shared memory between cores.
The system also features an array of non-synthesizable SystemVerilog test-





Now that we have introduced Pretty Secure Processor and the software and
simulator support for it, we will use the system to model two architectural
attacks. The first attack is the Prime+Probe cache attack, and the second is
the Lord of the Rings interconnect attack. Both attacks will setup a covert
channel using the Pretty Secure System kernel’s multiprocessing capabilities.
The first attack will run on a single core of the processor and attack the
shared L1 and L2 cache. The second attack will run the receiver on one core
and the sender on another, and attack the ring interconnect and shared L3
cache. We will apply the unique visualization tools to visualize how these
attacks work.
6.1 Using the High Resolution Timer
In order to execute the Prime+Probe and Lord of the Rings attacks, we will
need access to the high resolution timer provided by Pretty Secure System.
The high resolution timer increments by one every cycle regardless of whether
or not an instruction was completed that cycle. This timer is accessible
from user mode with the CSR utimer, and has address 0x037. Listing 6.1
demonstrates how to time memory accesses on Pretty Secure Processor. As
the pipeline is an in-order processor without any memory reordering, there is
no need to use memory fencing here.
71
Listing 6.1: Timing latency to read an address
uint32_t time_access(uint32_t addr_to_poke) {
uint32_t utimer1 , utimer2;
asm volatile (
"csrr %[ timer_r1], 0x037\n" \
"lw x0 , 0(%[ r_addr_to_poke ])\n" \
"csrr %[ timer_r2], 0x037"
: [timer_r1 ]"=r"( utimer1), [timer_r2 ]"=r"( utimer2)
: [r_addr_to_poke ]"r"( addr_to_poke)
:
);
return utimer2 - utimer1;
}
Running time access with some address will result in the observed latency
(in clock cycles) to read from that address.
6.2 Prime+Probe Cache Attack
A Prime+Probe covert channel is a means of allowing two processes to
communicate with one another via creating contention on the shared cache.
For this attack, we will be creating two processes as shown in in Listing 6.2
using the Pretty Secure System kernel cooperative scheduling multiprocess
system. We will then communicate between these processes using contention
on the shared per-core cache.
Listing 6.2: Prime+Probe launchd
void launchd_prime_probe () {
do_syscall (
SYSCALL_EXECUTE_NONBLOCKING ,
(uint32_t )& prime_probe_receiver_thread ,
PSP_PRIV_USER , // User Process





(uint32_t )& prime_probe_sender_thread ,
PSP_PRIV_USER , // User Process
true // Interrupts Allowed
);
while (1) {




Two user processes will be scheduled concurrently with interrupts enabled for
both— the sender process and the receiver process. Since the execute call
was performed on the same core for both processes, these processes will be
executed pinned locally to the same core as launchd (in this case, core 0).
This means that the L1 and L2 cache will be shared between them and we
can use it to create a covert channel.
6.2.1 Demonstrating Contention
Since the system uses a cooperative scheduler, each process can choose when
to give up its time slice. We will demonstrate viable contention by sending
alternating 1s and 0s and measuring the latency from the receiver.
First, the sender process will create an eviction set and fill the entirety of
the L2 cache in set 0. Then, the receiver process will be scheduled, and will
time the latency to access a field from set 0. Then, the sender process will
be scheduled again, and will not modify its eviction set. Finally, the receiver
process will be scheduled, and will time the latency to access that same field.
The receiver process should observe a larger latency the first time than the
second time. While this attack does rely on the sender and receiver being
73
able to knowingly target a particular set of the cache, this attack does not
require shared memory. The sender only needs a single eviction set that maps
to the same set that the receiver uses for priming and probing. Figure 6.1
shows the receiver printing the access latencies to the screen while the sender
sends an alternating pattern of ones and zeros.
Figure 6.1: The receiver process printing latencies
Figure 6.2 shows using GDB to measure these same latencies by setting a
breakpoint to just after the receiver calls time access and printing the access
latency observed by the receiver.
74
Figure 6.2: Measuing latencies with GDB
Using a GDB script, we automatically collected the latency of 100 iterations
of the receiver. Plotted in Figure 6.3 are these results, demonstrating the
extremely low noise environment provided by Pretty Secure System. The
spikes to high latencies correspond to the sender sending a 1 (evicting the
primed receiver line from L1 and L2, resulting in a cache miss), and the dips
correspond to the sender sending a 0 (the sender not evicting anything so the
line is still present in L1).
75
Figure 6.3: Latency vs time for 100 iterations
6.2.2 Modeling a Noisy Channel
As Figure 6.3 shows, the amount of noise present by default on the system is
nearly 0. For cache-based side channels, the noise incurred due to scheduling
and context switches is quite avoidable. In this section, we will spawn a
noisy fourth process (alongside launchd, the sender, and receiver) to induce
noise into the system to demonstrate how the system is capable of modeling
noisy environments. Figure 6.4 shows the result of 100 iterations of sending
alternating 1s and 0s in the presence of a noisy third process which has a
50% chance of evicting everything in the L1 cache (at the set targeted by the
covert channel).
76
Figure 6.4: Latency vs time for 100 iterations with a noisy background
process
This noisy process introduces two new latency possibilities. The primed data
could have been evicted by the noisy process but not by the sender. In this
case, the data will be in L2 but not L1, and the latency will increase from 6
cycles to 14. The primed data could also have been evicted by both the sender
and noisy process, resulting in a latency of 53 cycles due to a trip through the
LLC to physical memory. Naturally, the amount of noise could be increased
or tuned to create more varying conditions to test the signal-to-noise ratio
of particular implementations of covert channels. The latency at each stage
could also be tuned by adjusting the hardware. For example, this system’s
physical memory controller is quite quick compared to real systems to offset
the performance penalty coming from simulating an entire processor system.
Having the ability to completely disable noise makes prototyping attacks
simple, as there is no question as to what is creating the contention. This
system allows for both noiseless testing environments and customizable levels
of noise to verify an attack’s integrity.
77
6.2.3 Visualizing Contention
Now that we have demonstrated a viable cross-process side covert channel, let’s
use the cache visualization features of the simulator to demonstrate the covert
channel visually. Figure 6.5 shows the state of the cache immediately before
the receiver reads a 1. Figure 6.6 shows the state of the cache immediately
before the receiver reads a 0. In both Figures 6.6 and 6.5, the lines belonging
to the sender process have been colored red, and lines belonging to the receiver
process and kernel have been colored green. Contention is created on the
top set, set 0. The receiver primes the cache by inserting its single address
to L1D. The sender fills the entire L1D and L2 top sets with its eviction
set, consisting of 8 cache lines, to send a 1 (and does nothing to send a 0).
The receiver then probes the latency to access its line to determine if the
sender sent a 1 or a 0. Sender addresses begin at 0xb00000 and count up
by increments of 0x400. The receiver primes and probes using the address
granted to it by mmap, which is 0x17000 in this example.
Figure 6.5: Reading a 1
78
Figure 6.6: Reading a 0
In Figure 6.5, both the entire L1D and L2 top set are filled with sender
addresses (notice how all lines are colored red). In Figure 6.6, some sender
lines are still present in the top line as they are left over from previous
transmissions. However, the primed receiver address, 0x17000, is still present
in the L1D cache immediately prior to probing. The latency for probing will
be noticeably smaller in this case than in the case where the primed address
is completely evicted from both L1D and L2.
6.2.4 Creating a Covert Channel
Now that contention has been shown to be capable of inducing extra latency
on the receiver when it probes a line it has previously primed, we can use the
cache to create a covert channel and send data between the processes without
any shared memory or inter-process communication through the kernel. Notice
79
how the use of a cooperative multitasking scheduler makes synchronization
between processes simple— this design decision helps architects focus on
modeling the attacks themselves without worrying about system noise due to
preemptive scheduling and uncontrolled background tasks. This covert channel
demo will operate without the presence of a noisy process for simplicity.
The sender process will send data to the receiver one bit at a time. It will read
from a string variable and leak the contents of the string over the cache, bit
by bit. When the string is complete, it will send a NULL byte to indicate the
end of transmission. The receiver will read bits from the covert channel and
store them in a buffer. When it reads a NULL byte it will display the string
it received to the graphics window. Figure 6.7 demonstrates transmitting a
message over the covert channel. For this demo, an observed latency ≥ 16
cycles is defined to be a 1, and smaller latencies are defined to be 0.
Figure 6.7: Sending a message over the covert channel
80
6.3 Lord of the Rings Interconnect Attack
The Lord of the Rings attack is a cross-core attack that targets the ring
interconnect. Contention is created on the shared last level cache ring stop
by issuing packets on the ring, which results in increased latency visible from
the receiver process. Figure 6.8 provides an overview of the variant of the
Lord of the Rings attack being demonstrated here.
Figure 6.8: Overview of the Lord of the Rings attack
The receiver core periodically misses in its L2 cache, resulting in probe LLC
read packets being issued along the memory ring. To send a 1, the sender
core intentionally misses in its L2 cache, resulting in increased latency on the
receiver due to contention on the LLC and extra packets on the memory ring.
To send a 0, the sender core performs no action, which reduces the latency
observed by the receiver core.
81
To schedule processes on different cores, IPIs are required. First, the main core
boots up, issues an IPI to core 1, and launches its launchd process. When core
1 receives the IPI, it launches its own copy of launchd. launchd reads the
current core ID, and executes the correct Lord of the Rings process depending
on which core is running. A handler process is launched on each core which
then performs a blocking execute to start the appropriate Lord of the Rings
attack process. Due to the reliance on the ring interconnect for issuing IPIs
and reading from the LLC, the cores cannot be assumed to be synchronized
with one another. To compensate for this, multiple measurements are taken
for each bit received. This increases the latency required to communicate,
but also increases the accuracy.
6.3.1 Measuring Contention
Figure 6.9 shows the measurements of contention taken by the receiver process
probing the memory ring interconnect.
Figure 6.9: Contention measured by the receiver for {0xff, 0x00}
82
In this example, the sender process is sending a sequence of 16 bits, where the
first eight bits are 1’s, and the second eight bits are 0’s. As the data shows,
this attack can be quite noisy, and multiple data points taken together helps
ensure good accuracy in the observed side channel.
As the sender and receiver cannot communicate beyond independently timing
the latency of accessing the ring and the shared LLC, there is no way for
them to synchronize. They must both independently maintain local schedules
to keep synchronized after the attack has begun (that is, the receiver should
always sample a single bit from the sender, not multiple). Since the latency
observed will be significantly larger when there is contention than when there
is not, the sender must take care to ensure the total time spent sending a 1
is equivalent to the total time spent sending a 0. If these two times are not
equal, the receiver will slowly get out of sync with the data stream depending
on the data being sent. So, the sender and receiver must both time themselves
to ensure that they are scheduling transactions for roughly equivalent time
slices. This is done by using the utimer CSR. After issuing a transaction,
both the sender and receiver spin until the total elapsed time is 0x30000
cycles, a heuristic determined experimentally.
During a transaction, the receiver will probe the ring interconnect 50 times
and compute the total latency observed across all accesses. As can be seen in
Figure 6.9, some intervals involve the receiver observing alternations between
extremely high latency and low latency, and other intervals with relatively
consistent observed latency. This is due to the order at which packets arrive
on the ring. It is possible for the receiver packets to be blocked by sender
packets multiple times (by the boxcar model) if the LLC is particularly busy,
and other packets are circling the ring. It is also possible that the receiver
packets arrive at the LLC before the sender packets do. This uncertainty in
the network on chip results in the variations in latency even in an extremely
low-noise environment. Collecting large sample sizes helps keep both the
sender and receiver synchronized, and reduces error due to network on chip
timing variability.
83
6.3.2 Visualizing Contention on the Ring
The ring interconnect visualizer tool in the simulator provides an excellent
overview into the state of the ring during simulation. This tool is meant
to be used dynamically during execution to visually see which parts of the
processor system are active at a given time. Two screenshots are provided
here to demonstrate the interface.
Figure 6.10 shows the ring interconnect visualizer while both cores 0 and
1 are missing, and the LLC is active. In this screenshot there are multiple
packets traversing the ring, which can be seen by the highlighted ring stops.
Figure 6.11 shows the ring interconnect visualizer while just core 0 is missing.
In this screenshot there happen to be no visible packets traversing the ring,
although there may be some inside internal buffering structures in the ring.
Figure 6.10: Contention present in the ring (both cores missing)
84
Figure 6.11: No contention present in the ring (only core 0 missing)
6.3.3 Creating a Covert Channel
Figure 6.12 shows the latencies for sending the string “UIUC ECE” using the
Lord of the Rings attack. Figure 6.13 shows a screenshot of the graphical
framebuffer output during the execution of the attack whose latencies are
plotted in Figure 6.12. In Figure 6.12, the utimer values are reported by
each core prior to beginning execution, and it can be seen that they do not
start synchronized due to IPI and LLC misses adding variable latencies to
the start time of the second core after bringup.
85
Figure 6.12: Contention measured by the receiver for the string “UIUC ECE”
Figure 6.13: Framebuffer during the LOTR attack sending “UIUC ECE”
86
The Lord of the Rings attack uses no shared memory, inter-process com-
munication, or IPIs beyond the one required for bringup. All cross-core
communication is performed by measuring the latency of accessing the last
level cache through the memory ring during L2 misses. An experimentally
determined threshold is used to distinguish between 1s and 0s in the receiver.
The ring interconnect of Pretty Secure Processor is sophisticated enough to
demonstrate such an attack, and the unique visualization tooling provided by
the simulator allows a deep level of insight into the state of execution of the
processor during runtime.
Most notably, this system provides a uniquely elegant data collection interface
via the GDB integration. Since the debug interface is built into the testbench
and never modifies architectural state, it is possible to extract the latencies
observed by the receiver without modifying them. Alternative debugging
approaches using hardware debug features built into the simulated model
would incur extra latencies during debugging, and would change the system
behavior depending on whether a debugger was attached or not. With this
system, logging data without modifying any part of the system state (not




In this work, we have introduced Pretty Secure System, a complete computer
system consisting of a multicore RISC-V computer (Pretty Secure Processor),
a multitasking operating system, and realtime simulator with keyboard and
640 by 480 graphics support. We have shown how the unique simulation and
debug features of Pretty Secure System allow for elegant modeling of real-
world microarchitectural side channel attacks. We have demonstrated using
the system to model two such attacks and shown how the unique features of
the system allow researchers to quickly extract meaningful results from the
system. We have also produced a working demonstration of the Lord of the
Rings attack, which is a very recent microarchitectural side channel attack in
Intel Nehalem processors [7].
Pretty Secure System is an intuitive, simple, and fun new way of interact-
ing with microarchitectural attacks. We believe this system will be useful
in future research, as well as in classroom settings for teaching students




[1] C. Domas, “Hardware Backdoors in x86 CPUs,” Black Hat Conference,
2018.
[2] RISC-V International, “RISC-V Instruction Set Manual,”
https://github.com/riscv/riscv-isa-manual.
[3] P. Kocher, J. Horn, A. Fogh, , D. Genkin, D. Gruss, W. Haas, M. Ham-
burg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom,
“Spectre attacks: Exploiting speculative execution,” in 40th IEEE Sym-
posium on Security and Privacy (S&P’19), 2019.
[4] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, A. Fogh, J. Horn,
S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Melt-
down: Reading kernel memory from user space,” in 27th USENIX Secu-
rity Symposium (USENIX Security 18), 2018.
[5] J. L. Hennessy and D. A. Patterson, Computer Architecture, Sixth Edi-
tion: A Quantitative Approach, 6th ed. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc., 2017.
[6] D. A. Patterson and J. L. Hennessy, Computer Organization and Design,
Fifth Edition: The Hardware/Software Interface, 5th ed. San Francisco,
CA, USA: Morgan Kaufmann Publishers Inc., 2013.
[7] R. Paccagnella, L. Luo, and C. W. Fletcher, “Lord of the ring(s): Side
channel attacks on the CPU on-chip ring interconnect are practical,” in
Proc. of the USENIX Security Symposium (USENIX), 2021.
[8] M. Yan, R. Sprabery, B. Gopireddy, C. Fletcher, R. Campbell, and
J. Torrellas, “Attack directories, not caches: Side channel attacks in a
non-inclusive world,” in 2019 IEEE Symposium on Security and Privacy
(SP), 2019, pp. 888–904.
[9] C. Percival, “Cache missing for fun and profit,” in In Proc. of BSDCan
2005, 2005.
89
[10] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and
countermeasures: The case of aes,” in Proceedings of the 2006 The
Cryptographers’ Track at the RSA Conference on Topics in Cryptology,
ser. CT-RSA’06. Berlin, Heidelberg: Springer-Verlag, 2006. [Online].
Available: https://doi.org/10.1007/11605805 1 p. 1–20.
[11] P. Vila, B. Köpf, and J. F. Morales, “Theory and practice of finding
eviction sets,” CoRR, vol. abs/1810.01497, 2018. [Online]. Available:
http://arxiv.org/abs/1810.01497
[12] Y. Yarom and K. Falkner, “Flush+reload: A high resolution, low noise,
l3 cache side-channel attack,” in Proceedings of the 23rd USENIX Confer-
ence on Security Symposium, ser. SEC’14. USA: USENIX Association,
2014, p. 719–732.




[14] M. Yan, “Cache-based side channels: Modern attacks and defenses,”
Ph.D. dissertation, University of Illinois at Urbana-Champaign, Cham-
paign, IL, 2019.
[15] Khang T Nguyen, “Introduction to Cache Allocation
Technology in the Intel Xeon Processor E5 v4 Family,”
https://software.intel.com/content/www/us/en/develop/articles/
introduction-to-cache-allocation-technology.html, 2016, accessed: May
2021.
[16] F. Liu, Q. Ge, Y. Yarom, F. Mckeen, C. Rozas, G. Heiser, and R. B. Lee,
“Catalyst: Defeating last-level cache side channel attacks in cloud com-
puting,” in 2016 IEEE International Symposium on High Performance
Computer Architecture (HPCA), 2016, pp. 406–418.
[17] D. Page, “Partitioned cache architecture as a side-channel defence mech-
anism.” IACR Cryptology ePrint Archive, vol. 2005, p. 280, 01 2005.
[18] Y. Wang, A. Ferraiuolo, D. Zhang, A. C. Myers, and G. E.
Suh, “Secdcp: Secure dynamic cache partitioning for efficient
timing channel protection,” in 2016 53nd ACM/EDAC/IEEE Design
Automation Conference (DAC). IEEE Press, 2016. [Online]. Available:
https://doi.org/10.1145/2897937.2898086 p. 1–6.
[19] J. Doweck, W.-F. Kao, A. K.-y. Lu, J. Mandelblat, A. Rahatekar, L. Rap-
poport, E. Rotem, A. Yasin, and A. Yoaz, “Inside 6th-generation intel
core: New microarchitecture code-named skylake,” IEEE Micro, vol. 37,
no. 2, pp. 52–62, 2017.
90
[20] R. Myslewski, “Intel Sandy Bridge many-core
secret sauce: One ring to rule them all,”
https://www.theregister.com/2010/09/16/sandy bridge ring interconnect,
2010, accessed May 2021.
[21] H. M. G. Wassel, Y. Gao, J. K. Oberg, T. Huffmire, R. Kastner,
F. T. Chong, and T. Sherwood, “Surfnoc: A low latency and provably
non-interfering approach to secure networks-on-chip,” in Proceedings of
the 40th Annual International Symposium on Computer Architecture, ser.
ISCA ’13. New York, NY, USA: Association for Computing Machinery,
2013. [Online]. Available: https://doi.org/10.1145/2485922.2485972 p.
583–594.
[22] SpinalHDL, “VexRiscv: A FPGA friendly 32 bit RISC-V CPU imple-
mentation,” https://github.com/SpinalHDL/VexRiscv, 2021.
[23] RISC-V International, “RISC-V Exchange: Cores & SoCs,”
https://riscv.org/exchange/cores-socs/, accessed May 2021.
[24] D. Rath, “Open On-Chip Debugger: an On-Chip Debug Solution for
Embedded Target Systems based on the ARM7 and ARM9 Family,”
http://openocd.org/files/thesis.pdf, 2005.
[25] K. Umenthum, “Frizzle RISC-V Testbench,”
https://github.com/umenthum/frizzle, 2020, accessed May 2021.
[26] PULP Platform, “Risc-v debug support for pulp cores,”
https://github.com/pulp-platform/riscv-dbg, 2021, accessed May
2021.
[27] Wilson Snyder, “Verilator,” https://www.veripool.org/verilator, 2021,
accessed May 2021.
[28] Free Software Foundation, “GDB: The GNU Project Debugger,”
https://www.gnu.org/software/gdb/, 2021, accessed May 2021.
[29] IEEE Standards Association, “Standard for systemverilog,”
https://standards.ieee.org/project/1800.html, 2019, accessed May
2021.
[30] F. Bellard, “Qemu, a fast and portable dynamic translator,” in Proceed-
ings of the Annual Conference on USENIX Annual Technical Conference,
ser. ATEC ’05. USA: USENIX Association, 2005, p. 41.
[31] F. Bellard, “Official qemu mirror,” https://github.com/qemu/qemu,
2021, accessed May 2021.
[32] SymbioticEDA, “Risc-v formal verification framework,”
https://github.com/SymbioticEDA/riscv-formal.
91
[33] Free Software Foundataion, “GCC, the GNU Compiler Collection,”
https://gcc.gnu.org/, 2021, accessed May 2021.
[34] RISC-V Foundation, “RISC-V GNU Compiler Toolchain,”
https://github.com/riscv/riscv-gnu-toolchain, 2021, accessed May
2021.
[35] Python Software Foundation, “Python,” https://www.python.org/, 2021,
accessed May 2021.
[36] Standard C++ Foundation, “Standard C++,” https://isocpp.org/, 2021,
accessed May 2021.
[37] Apple, “macOS Big Sur,” https://www.apple.com/macos/big-sur/, 2021,
accessed May 2021.
[38] Linux Foundation, “Linux Foundation- Decentralized Innovation, Built
on Trust,” https://linuxfoundation.org/, 2021, accessed May 2021.
[39] The GTK Team, “The GTK Project- A free and open-source cross-
platform widget toolkit,” https://www.gtk.org/, 2021, accessed May
2021.
[40] Xilinx, “Vivado Design Suite,” https://www.xilinx.com/products/design-
tools/vivado.html, 2021, accessed May 2021.
92
