Speculose: Analyzing the Security Implications of Speculative Execution
  in CPUs by Maisuradze, Giorgi & Rossow, Christian
SPECULOSE: Analyzing the Security Implications
of Speculative Execution in CPUs
Giorgi Maisuradze
CISPA, Saarland University
Saarland Informatics Campus
giorgi.maisuradze@cispa.saarland
Christian Rossow
CISPA, Saarland University
Saarland Informatics Campus
rossow@cispa.saarland
Abstract—Whenever modern CPUs encounter a conditional
branch for which the condition cannot be evaluated yet, they
predict the likely branch target and speculatively execute code.
Such pipelining is key to optimizing runtime performance and is
incorporated in CPUs for more than 15 years. In this paper,
to the best of our knowledge, we are the first to study the
inner workings and the security implications of such speculative
execution. We revisit the assumption that speculatively executed
code leaves no traces in case it is not committed. We reveal several
measurable side effects that allow adversaries to enumerate
mapped memory pages and to read arbitrary memory—all
using only speculated code that was never fully executed. To
demonstrate the practicality of such attacks, we show how a
user-space adversary can probe for kernel pages to reliably break
kernel-level ASLR in Linux in under three seconds and reduce
the Windows 10 KASLR entropy by 18 bits in less than a second.
Disclaimer: This work on speculative execution was con-
ducted independently from other research groups and was
submitted to IEEE S&P ’17 in October 2017. Any techniques
and experiments presented in this paper predate the public
disclosure of attacks that became known as Meltdown [25]
and Spectre [22] and that were released begin-January 2018.
I. INTRODUCTION
Being at the core of any computer system, CPUs have
always strived for maximum execution efficiency. Several
hardware-based efforts have been undertaken to increase CPU
performance by higher clock frequencies, increasing the num-
ber of cores, or adding more cache levels. Orthogonal to
such developments, vendors have long invested in logical
optimization techniques, such as complex cache eviction al-
gorithms, branch predictors or instruction reordering. These
developments made clear that CPUs do not represent hardware-
only components any longer. Yet our level of understanding
of the algorithmic, i.e., software-side aspects of CPUs is in its
infancy. Given that many CPU design details remain corporate
secrets, it requires tedious reverse engineering attempts to
understand the inner workings of CPUs [23].
In this paper, we argue that this is a necessary direction
of research and investigate the security implications of one of
the core logical optimization techniques that is ubiquitous in
modern CPUs: speculative execution. Whenever facing a con-
ditional branch for which the outcome is not yet known, instead
of waiting (stalling), CPUs usually speculate one of the branch
target. This way, CPUs can still fully leverage their instruction
pipelines. They predict the outcome of conditional branches
and follow the more likely branch target to continue execution.
Upon visiting a particular branch for the first time, CPUs use
static prediction, which usually guesses that backward jumps
(common in loops to repeat the loop body) are taken, whereas
forward jumps fall through (common in loops so as not to
abort the loop). Over time, the CPU will learn the likely
branch target and then uses dynamic prediction to take the
more likely target. When a CPU discovers that it mispredicted
a branch, it will roll back the speculated instructions and
their results. Despite this risk of mispredictions, speculative
execution has significantly sped up CPU performance and is
part of most modern CPUs of popular vendors Intel, AMD or
ARM-licensed productions.
To the best of our knowledge, we are the first to analyze the
security implications of speculative execution. So far, the only
known drawback of speculative execution is a slightly higher
energy consumption due to non-committed instructions [27].
As we show, its drawbacks go far beyond reduced energy
efficiency. Our analysis follows the observation that CPUs,
when executing code speculatively, might leak data from
the speculated execution branch, although this code would
never have been executed in a non-speculative world. Ideally,
speculated code should not change the CPU state unless it is
committed at a later stage (e.g., because the predicted branch
target was confirmed). We analyze how an adversary might
undermine this assumption by causing measurable side effects
during speculative execution. We find that at least two feedback
channels exist to leak data from speculative execution, even if
the speculated code, and thus its results, are never committed.
Whereas one side channel uses our observation that speculated
code can change cache states, the other side channel observes
differences in the time it takes to flush the instruction pipeline.
With these techniques at hand, we then analyze the se-
curity implications of the possibility to leak data from within
speculative execution. We first show how a user-space attacker
can abuse speculative execution to reliably read arbitrary user-
space memory. While this sounds boring at first, we then also
discuss how that might help an attacker to access memory
regions that are guarded by conditionals, such as in sandboxes.
We then analyze if an unprivileged user can use speculation
to read even kernel memory. We show that this attack is
fortunately not possible. That is, speculative execution protects
against invalid reads in that the results of access-violating reads
(e.g., from user to kernel) are zeroed.
This observation, however, leads us to discover a severe
side channel that allows one to distinguish between mapped
ar
X
iv
:1
80
1.
04
08
4v
1 
 [c
s.C
R]
  1
2 J
an
 20
18
and unmapped kernel pages. In stark contrast to access-
violating memory reads (which are zeroed), page faults (i.e.,
accesses to non-mapped kernel pages) stall the speculative
execution. We show that an attacker can use this distinction
to reliably and efficiently determine whether a virtual memory
page is mapped. This effectively undermines a fundamental
assumption of kernel-based Address Space Layout Randomiza-
tion (KASLR) designs [10] present in modern OSes (Windows
8.x+, Linux kernel 4.4+, iOS 4.3+, Android 8.x): KASLR’s
foremost goal is to hide the location of the kernel image,
which is easily broken with speculative execution. Access
violations in speculative execution—in contrast to violations
in non-speculated execution—do not cause program crashes
due to segmentation faults, allowing one to easily repeat
checks in multiple memory ranges. In our experiments, we
use commodity hardware to show that one can reliably break
KASLR on Ubuntu 16.04 with kernel 4.13.0 in less than three
seconds.
In this paper, we provide the following contributions:
• To the best of our knowledge, we are the first to explore
the internals of speculative execution in modern CPUs.
We reveal details of branch predictors and speculative ex-
ecution in general. Furthermore, we propose two feedback
channels that allow us to transfer data from speculation
to normal execution. We evaluate these primitives on five
Intel CPU architectures, ranging from models from 2004
to recent CPU architectures such as Intel Skylake.
• Based on these new primitives, we discuss potential
security implications of speculative execution. We first
present how to read arbitrary user-mode memory from
inside speculative execution, which may be useful to read
beyond software-enforced memory boundaries. We then
extend this scheme to a (failed) attempt to read arbitrary
kernel memory from user space.
• We discover a severe side channel in Intel’s speculative
execution engine that allows us to distinguish between a
mapped and a non-mapped kernel page. We show how we
can leverage this concept to break KASLR implementa-
tions of modern operating systems, prototyping it against
the Linux kernel 4.13 and Windows 10.
• We discuss potential countermeasures against security
degradation caused by speculative execution, ranging
from hardware- and software- to compiler-assisted at-
tempts to fix the discovered weaknesses.
II. ANALYSIS OF SPECULATIVE EXECUTION IN X86
In this section, we first provide an overview of the basic
x86 architecture in general. We then introduce the concept
of speculative execution and its implementation details in the
recent x86 microarchitectures such as Haswell and Skylake.
To understand the inner workings of speculative execution,
this section also describes the details of branch prediction
techniques that are used by modern processors.
These inner details allow for an attack that abuses spec-
ulative execution to break kernel-level Address Space Layout
Randomization (KASLR). This section therefore also intro-
duces KASLR, an in-kernel defense mechanism, deployed
in most modern operating systems. To add an emphasis the
importance of KASLR, we will show a wide range of kernel
attacks that are possible in the absence of KASLR. Later, in
Section IV, we will combine branch prediction and speculative
execution to anticipate the speculatively executed path after a
conditional branch. This will be a key to executing arbitrary
code speculatively, which we will use in our attack to remove
the randomness introduced by KASLR.
A. Generic x86 architecture
Despite being a CISC (Complex Instruction Set Comput-
ing) architecture, x861 constitutes the prevalent architecture
used for desktop and server environments. Extensive optimiza-
tions are among the reasons for x86’s popularity.
Realizing the benefits of RISC (Reduced Instruction Set
Computing) architectures, both Intel and AMD have switched
to RISC-like architecture under the hood. That is, although
still providing a complex instruction set to programmers,
internally they are translated into sequences of simpler RISC-
like instructions. These high-level and low-level instructions
are usually called macro-OPs and micro-OPs, respectively.
This construction brings all the benefits of RISC architecture
and allows for simpler constructions and better optimizations,
while programmers can still use a rich CISC instruction set.
An example of translating macro-OP into micro-OP is shown
in the following:
1 load t1, [rax]
2 add [rax], 1 => add t1, 1
3 store [rax], t1
Using micro-OPs requires way less circuitry in the CPU for
their implementation. For example, similar micro-OPs can be
grouped together into execution units (also called ports). Such
ports allow micro-OPs from different groups to be executed
in parallel, given that one does not depend on the results of
another. The following is an example that can trivially be run
in parallel once converted into micro-OPs:
1 mov [rax], 1
2 add rbx, 2
Both instructions can be executed in parallel. Whereas instruc-
tion #1 is executed in the CPU’s memory unit, instruction #2
is computed in the CPU’s Arithmetic Logic Unit (ALU)—
effectively increasing the throughput of the CPU. Based on
the availability of their corresponding execution ports and the
source data, each of these instructions could also be executed
out of order. In the previous example, the CPU could easily
swap the order of addition and memory read, as there are
no dependencies between them. Such reordering frequently
allows optimizing a sequence of instructions for speed and data
locality. Given the possibility of data dependencies in general,
an execution cannot be done entirely out-of-order, though.
Otherwise, such reordering would create data hazards and
possibly even result in entirely different executions that do not
reflect the macro-OP specification. To capture the complexity
of such reordering, CPUs include another unit that maintains
the consistency of the executed instructions.
1Note that we will refer to both IA-32 (32-bit CPUs) and AMD64/Intel 64
(64-bit CPUs) as x86 for simplicity.
2
Mem
Ex
IF/ID
Micro-OP Queue
BPU L1 ICacheMicro-OP Cache
Fetch
Decode
Reorder
Buffer
Scheduler
P0 P1 P2 P3 P4 P5 P6 P7
Load Load StoreStore… … ……
Load Buffer
L1 Data Cache
L2 Cache
Fig. 1: CPU Core Pipeline Functionality of the Haswell
Microarchitecture [18].
1) Execution Units in Modern x86 CPUs: As software
changes over time, the underlying logic and hardware setup
of CPUs is also subject to change. CPU vendors usually
maintain microarchitectures to summarize various CPUs of the
same generation, possibly varying in terms of clock speeds or
number of cores. We will now describe the basic x86 CPU
architecture based on the example of the Haswell microarchi-
tecture, which is a recent Intel x86 microarchitecture and one
of the microarchitectures that we will use in our experiments.
Although the inner details differ from one microarchitecture
to another, the basic building blocks (especially the ones that
we use) remain mostly the same.
In Figure 1, we outline the most important parts of CPU’s
pipeline, based on the Haswell microarchitecture [18]. We split
the whole pipeline into three parts: (i) Instruction fetch/decode,
(ii) Execution, and (iii) Memory.
Fetch/Decode Unit: The fetch/decode unit is responsible
for fetching the forthcoming (i.e., to-be-executed) instructions
and preparing them for execution. There are different places
where these instructions might be coming from. The fastest of
these places is a micro-OP cache, which is a storage of instruc-
tions that are already decoded into their corresponding micro-
OPs. If there is a miss in this cache, then the instruction has to
be retrieved from one of the caches in the hierarchy (L1-I, L2,
L3, L4), or ultimately, if none of the caches contain the data,
from main memory. The fetched instructions are then decoded
and passed to the execution unit as well as to the micro-OP
cache. The details of how this is handled are not important for
the remainder of our paper and we refer the interested reader
to the respective CPU vendor documentation [18]. One crucial
part to note, though, is that the address of the next instruction
to fetch is controlled by the Branch Prediction Unit (BPU).
Execution: After decoding the instruction into the corre-
sponding micro-OPs, they are put on the instruction dispatch
queue to prepare for execution. Micro-OPs on the queue
are checked for dependencies, and the ones with all source
operands resolved are scheduled to one of 8 (in Haswell;
varies in other architectures) execution ports. Further, the
scheduled micro-OPs are added to the reorder buffer (ROB),
which contains the sequentially consistent view of all executed
instructions. Abstracting away from its details, it is sufficient
for us to imagine the ROB as a unit that keeps micro-OPs
in order while executing them out of order. Usually, the ROB
is implemented as a ring buffer, with its head pointing to the
newest micro-OP and its tail to the oldest not-yet-committed
operation. Committing a micro-OP means to reflect the changes
that it made to the machine state. Committing micro-OPs is
done in order, i.e., a micro-OP can be committed only after it
is done executing and all instructions before it are already
committed. In contrast, however, finishing the computation
does not directly mean that the micro-OP can be committed.
A relevant example in our case is speculative execution, in
which case a micro-OP can only be committed after the
outcome of the speculation is known (i.e., after knowing that
the speculated path was correctly predicted). This implies that
CPUs have to maintain micro-OPs uncommitted until they can
be committed. CPUs typically employ a maximum number
of uncommitted/speculated instructions which aligns to the
number of ROB entries (192 in Haswell). This limit might
be further reduced by the availability of other resources, such
as physical registers or available execution ports.
Memory Unit: The third part of the execution pipeline,
the memory unit, is used for micro-OPs that access memory.
Being the bottleneck in most executions, CPU designers pay
special attention to this unit to improve its efficiency. That is
the reason why in modern architectures we have a hierarchy
of memory caches, each containing different sizes of buffers
with varying access speeds. Haswell’s memory hierarchy, for
example, comprises Load and Store Buffers (72 and 42 entries
in Haswell, respectively), followed by various levels of caches
(L1-I/L1-D, L2, L3, L4), and finally the main memory. Each
load operation is done in the following way:
1) When a load micro-OP is scheduled, a load buffer entry
is allocated that will hold the loaded data until it is
committed.
2) To retrieve the data, the L1 data cache is queried. In case
of a cache hit, the data will be loaded into the load buffer
entry of the corresponding micro-OP.
3) In the case of an L1 cache miss, the data will be queried
from the higher cache hierarchies or, ultimately, from the
main memory.
Executing a memory load involves two buffers. First, a load
buffer entry is allocated which will be tied to the corresponding
load micro-OP, and will hold the loaded data after it is
returned from the L1 data cache. Second, in the case of a cache
miss, the L1 data cache has to load the missing data from the
higher memory hierarchy. To service multiple cache misses in
parallel, the L1 data cache has its own buffer consisting of
memory accesses that have to be resolved. Haswell, e.g., has
a 32-entry buffer and can handle 32 cache misses2 in parallel.
Executing load micro-OPs speculatively means that new
load buffer entries will be allocated for each of them; however,
they will not be freed until the CPU is done speculating. As
commits are not carried out until the result of speculation
are known, the loaded data cannot reach the architectural
register and cannot be discarded yet. This means that while
speculating, the maximum number of load instructions that
can be executed is limited by the size of the load buffer. After
2Note that in reality even cache hits will allocate entries in the buffer;
however, as cache hits are serviced instantaneously, they can be ignored.
3
reaching the limit, the execution of subsequent load micro-
OPs will be stalled. Given the generous sizes of load buffers
in recent microarchitectures (72 entries in Haswell), for most
programs this will not be an issue. However, we will use this
limitation as an advantage when measuring the access times
of mapped/unmapped memory accesses in Section IV-F. Also
worth noting here is the case of cache misses in the L1 data
cache. As we previously mentioned, there is a limited number
of ongoing cache misses that can be handled by the L1 data
cache (32 in Haswell). This, combined with the fact that faults
cannot be handled speculatively, means that after having 32
faulty cache misses (e.g., cache miss resulting in page fault),
all the subsequent memory operations will be stalled.
2) Speculative Execution: Speculative execution is an im-
portant optimization carried out by most CPU architectures.
Programs, generally, have many conditional branches. Evalu-
ating the condition in a conditional branch might take some
time (e.g., reading data from memory, which can take hundreds
of cycles if it comes from main memory. Instead of waiting
for a branch condition to be evaluated, the CPU speculates
on the outcome and starts speculatively executing the more
likely execution path. Once the condition for the branch can
be evaluated, this results in two possible scenarios:
1) The prediction was correct→ The speculatively executed
instructions were correct and thus will now be committed.
2) The prediction was wrong → The speculatively executed
instructions are flushed and the correct path is executed
from the beginning (non-speculatively).
Optimizing the correctness of branch prediction is thus of
utmost importance for performance. Perfect prediction means
that the CPU never has to wait for slow branches and will
always execute the correct path. In contrast, mispredictions
penalize the execution by forcing the CPU to flush all the
progress it has made since the speculation and start over from
scratch.
B. Branch Prediction
As the efficacy of branch predictors directly influences
the performance of CPUs, hardware vendors have improved
predictors over time. Therefore, branch predictors not only
differ from vendor to vendor, but also between different
generations of microarchitectures (e.g., Haswell vs, Skylake).
The basic principle is the same: branch predictors try to foresee
the outcome of conditional branches. There are generally
two types of predictors, static and dynamic. When a CPU
sees a conditional branch for the first time, it has to resort
to a static predictor to guess the outcome based on simple
expert-knowledge heuristics. In contrast, if the branch has
been executed previously, a dynamic predictor can make more
informed decisions by looking at previous branch outcomes.
The simplest example of a static predictor can be predicting
any branch to be either always taken or never taken. However,
intuitively, these predictors are not highly precise. The most
widely used static predictor, that also exists in most modern
CPUs, is called BTFNT (backwards taken forwards not taken).
As its name suggests, this static predictor predicts forward
branches not to be taken and backwards to be taken. The
reasoning is that loops (i.e., backward jumps) are usually taken
more than once, and the first case of conditionals (i.e., fall-
through cases of forward jumps) are more likely to be taken.
The success rate of the predictor can be further improved by
the compiler, which aligns the cases of conditionals according
to the static predictions of the underlying hardware.
The simplest example of a dynamic branch predictor is
a one-bit history predictor, which stores a single bit for
each branch instruction. This bit denotes whether the branch
was taken or not in the previous execution. Most CPUs are
believed to3 have 2-bit predictors (also called 2-bit saturating
counters) [37]. These two bits represent a counter that is
incremented every time the branch is taken and decremented
otherwise. To decide the outcome of the predictor, the values
of the counter {0, 1, 2, 3} are mapped to decisions {strongly
not-taken, weakly not-taken, weakly taken, strongly taken}.
Apart from simple history matching, dynamic predictors can
also detect cycles or nested branches, e.g., by using local or
global history bits or even combining them together. However,
dynamic branch predictors are out of scope of this paper and
therefore, we will not go into more detail here.
In the following, we will use static predictors, as they show
a relatively coherent behavior across different CPUs, making
it possible to reliably anticipate their outcome. For example, if
a CPU sees a branch at address A with a forward conditional
branch for the first time, the static branch predictor will predict
the branch to be not taken (according to BTFNT). In principle,
the same thing can also be done on dynamic predictors by
training them at the beginning and then taking the opposite
branch of what the trained predictor expects.
C. Kernel ASLR
With this basic CPU knowledge, we now turn to ASLR,
a defense mechanism whose underlying assumptions an ad-
versary can undermine with speculative execution. Address
Space Layout Randomization (ASLR) [33] is a widely de-
ployed defense technique against code-reuse attacks. In user
space, ASLR randomizes the base addresses of the program’s
memory segments, thus preventing the attacker from knowing
or predicting the addresses of gadgets that she needs for code-
reuse attacks. ASLR is applied to a program at load-time and
whenever a program requests a new memory allocation. ASLR
raises the bar for adversaries in doing successful exploitation,
and thus is supported by all modern operating systems.
The recent increase of attack targets towards the kernel
side [16], [20], [28], [30], [32] motivated the operating system
developers to create similar countermeasures against code-
reuse attacks for kernel code. This resulted in creating KASLR
(Kernel ASLR), which is an ASLR implementation for the
kernel, i.e., randomizing the kernel’s image base address as
well as its loaded modules/drivers and allocated data. While
historically the kernel image was loaded at a fixed address
(e.g., 0xffffffff80000000 in Linux), with KASLR, the
kernel is placed at a different address at boot time. KASLR
is supported by all major operating systems, Windows (from
Vista), OSX (from Mountain Lion 10.8) and Linux (from
3.14, and enabled by default from 4.12). KASLR is meant
to (1) protect against remote attackers, and (2) as a defense
against privilege escalation attacks from local adversaries. In
3Branch predictor details are not part of official vendor documentation.
4
our KASLR derandomization attack, we assume the latter
case, i.e., when the attacker tries to reveal the location of
critical kernel functions in order to use this in a subsequent
exploitation.
Windows KASLR: KASLR implementations differ be-
tween OSes, or even among different versions of the same
OS. In general, at every boot, the kernel image and the
loaded modules/drivers are allocated at random addresses.
In Windows 10 (version 1709), there is a dedicated kernel
memory region in the range from 0xfffff80000000000
to 0xfffff88000000000, containing allocation slots of
2MiB each (i.e., large pages). KASLR will then randomly
assign these slots to the kernel image and to the loaded
modules, resulting in 262144 possible slots where the kernel
image can reside, i.e., 18 bits of randomness.
Linux KASLR: In Linux, KASLR randomly chooses any
page from which to start loading the image. By default,
the size of this page is 2MiB, and the range in which the
randomization can happen is from 0xffffffff80000000
to 0xffffffffc0000000. This gives at most 512 possible
kernel base image start pages (minus the image size itself),
i.e., 9 bits of randomness. The dedicated memory range of
the kernel image is followed by a memory region for kernel
modules. It starts at 0xffffffffc0000000 and can go
as far as 0xfffffffffd200000. However, only the load
offset is randomized, i.e., the first loaded module will be loaded
at a randomly chosen 4KiB page from the range [1,1024]. All
consecutively loaded modules will then follow the first one,
giving in total 10 bits of randomness.
III. THREAT MODEL
This section details our threat model and assumptions we
make about the attack environment. Our threat model assumes
a local attacker and is in accordance with the threat model
of other attacks against KASLR [12], [14], [17], [19]. More
specifically, we envision the following scenario:
1) An attacker can already execute arbitrary code on the
victim’s system in user mode without elevated privileges.
This means that either the attacker is running her own
executable as a local user on the machine, or can inject
arbitrary code in an already running user program.
2) The attacker knows an exploitable kernel vulnerability
and aims to abuse this vulnerability to elevate her priv-
ileges. For example, a vulnerability could allow her to
hijack the kernel process’s control flow (if the attacker
knows the precise memory location of the kernel).
3) We assume the OS deploys KASLR that mitigates such
exploitation attempts by randomizing the kernel’s image
base address. This implies that there is no other way to
leak address space information from the kernel that would
otherwise undermine KASLR.
4) Given that a naı¨ve trial-and-error search would crash the
kernel, we assume that the attacker cannot just brute-force
kernel addresses.
IV. POTENTIAL ABUSES OF SPECULATIVE EXECUTION
A key observation of our analysis in Section II is that in
case of a branch misprediction, the CPU has been executing the
set of instructions that it was actually not supposed to execute.
We will now outline techniques where an attacker can abuse
speculative execution to leak certain information (e.g., memory
content, mapped pages), ultimately compromising fundamental
assumptions of KASLR in Linux and Windows. First, we will
provide an overview of the basic approach and show several
code patterns that allow abuse of speculative execution as a
potential side channel. We will then discuss how an attacker
could use these techniques to identify the memory region of a
KASLR-protected kernel.
A. Enforced Speculative Execution
If we trick the CPU into a branch misprediction, it will
speculatively execute code that was not actually meant to be
executed. But: How can we force the CPU to enter code parts
that we want to execute speculatively? To execute a piece of
code speculatively, we first have to ensure that the branch
condition takes “long” to evaluate, and in the meanwhile code
after the conditional jump is executed in parallel. To slow down
branch condition computation, we propose to fill up a port of
the CPU with many computations that the condition has to wait
for before it can evaluate. For the remainder of this paper,
we will use integer multiplications (imul reg32,imm32)
to compute an input for a conditional branch. The imul
instruction multiplies register reg by immediate imm and
stores the result in reg. In our example, several (we use 2048
in the remainder of this paper) subsequent imul will fill the
multiplication port.
In addition, we have to make sure that the condition and
the speculated code are not scheduled on the same execution
ports. Otherwise, there will be no free execution units to run
the speculated code. For example, if we want to use memory
reads (mov reg32,mem32) within speculative execution,
these instructions will be scheduled on memory load ports
(5 & 6 in Haswell). While ports may slightly differ between
architectures, they (and thus the described procedure) are
relatively stable and well-documented in vendor manuals [18].
1 imul r9, 3 ; Repeat imul to fill
2 ... ; the ALU’s queue
3 imul r9, 3 ;
4 cmp r9b, 3 ; Requires imul result
5 je C1True ; and has to speculate
6 << ... >> ; Code that is executed
7 << ... >> ; speculatively
8 jmp Exit
9 C1True:
10 << ... >>
11 Exit:
Listing 1: Code pattern to trigger speculative execution via
static prediction (BTFNT) of the conditional jump in line 5
Listing 1 outlines the basic idea. We first fill a port
with many imul instructions that depend upon each other
(lines #1–#3). The compare instruction (line #4) has to wait
for the result of this multiplication chain, while the CPU
speculates that the conditional forward jump (line #5) is not
taken. This code pattern thus allows us to (speculatively)
execute arbitrary code (placeholders in lines #6–#7) that would
have never been executed without speculation. In all our
experiments, the initial value of r9 is 3. Therefore, after
5
multiplying it 2048 times by 3, the least significant byte of
r9 will be 32048+1 mod 256 = 3, i.e., the outcome of the
conditional jump (line #4) was mispredicted.
When executing this code several times, the dynamic
branch predictor will kick in and correctly predict the outcome.
To avoid this we use ASLR, i.e., we re-run the program after
each execution, which will re-randomize the base address of
the executable and thus the address for which the prediction is
stored. This forces the static predictor to make decisions again.
If an OS does not randomize the base address (either lack
of ASLR, or ASLR randomization as boot time, such as in
Windows), we have to search for alternative solutions. One
(that we will use later) is writing a larger probe program that
runs several speculative executions in batch and thus just needs
to be called once. This leaves dynamic prediction no chance,
but allows executing long code speculatively. Another solution
would be to manually unlearn the prediction result. This could,
e.g., be done by running the program with inverted conditionals
for several times before calling it with the actual conditionals—
thereby enforcing the dynamic predictor to mispredict.
B. Feedback Channel from Speculation
While we explained speculative execution can now be
enforced, we will now discuss how any computations done in
speculation can feed back data to the outside (non-speculated)
execution. We face a natural challenge. As soon as the CPU
would identify a misprediction, it would flush (revert) the
mispredicted branch and would thus not commit its result. That
is, without special care, the outside world could not see any
effects from speculative execution. That means that we have
to find a feedback channel for the data we want to leak during
speculative execution. That is, to reason about the outcome of
the speculated code, we need to communicate back from the
speculative execution to the non-speculated execution flow of
the program.
Although ideally it is not supposed to, we find that
speculative execution modifies the CPU state in such a way
that it becomes measurable in non-speculated execution. This
allows us to leak data from within speculation, although the
explicit results of the speculated code have been flushed and
are not observable. As direct communication from speculative
to regular execution is unfeasible, we have to think of potential
side channels. We describe two exemplary feedback channels
that we can use for these goals in the following.
1) Caching Side Channel Feedback: Our first feedback
channel reads memory inside speculative execution. Such reads
store the memory content in the L1 data cache, which can be
detected by measuring its subsequent access time from normal
execution. This technique can be used to leak a single bit of
information per cached line. In the remainder of this paper, we
mainly use this approach, as shown in Listing 2.
1 << ... >> ; slow condition -> ZF=1
2 jz C1True
3 mov rbx, [rax] ; cache memory at rax
4 jmp Exit
5 C1True:
6 << ... >> ; real execution
7 Exit:
8 << measure [rax] access time >>
9 << fast when cached, slow otherwise >>
Listing 2: Code fragment for caching feedback channel
2) Conditional Flushing Slowdown Feedback: Our second
feedback channel measures the time it takes to recover from
mispredicted branches. The core observation here is that the
recovery time will depend on the state of the pipeline when
flushing occurs. For example, if we consider the code in
Listing 3, the time it takes to flush the pipeline will be less
if we stop the execution (hlt) from inside the speculative
execution. This is true even if the executed instructions are
simple nops. Note that, as we use caching as our main source
of feedback channel, we have not investigated all possible
types of such side-effect-incurring instructions.
1 << ... >> ; slow condition -> ZF=1
2 << Start Measurement >>
3 jz C1True
4 nop ; repeat
5 ... ; nops
6 nop ; 128 times
7 [hlt] ; halt the execution
8 jmp Exit
9 C1True:
10 << ... >> ; real execution
11 Exit:
12 << End Measurement >>
Listing 3: Code for conditional flushing slowdown channel
C. Detecting Static Predictor Behavior
We have described in Section II that there are multiple
strategies for static predictors. The code mentioned in List-
ing 1, for example, assumes that the CPU statically predicts
that forward jumps are not taken. In the following, we outline
two tests to reveal the forward jump prediction strategy for
a specific CPU. The first test aims to find out if the forward
jumps are predicted to be taken or not. The code in Listing 4
checks if forward jumps are predicted to be not taken:
1 << ... >> ; slow condition->ZF=1
2 je C1True ; Condition is true
3 C1False:
4 mov rsi, [r11] ; r11 user address
5 jmp Exit
6 C1True:
7 jmp Exit
8 Exit:
9 << measure [r11] access time >>
Listing 4: Detect forward-not-taken predictor
In the above listed code, the conditional branch (line #2)
will eventually evaluate to true, i.e., it will jump to C1True.
That is, the repetition of imul before (line #1) will fill the
port such that the condition is not known and speculative
execution will be started. If the branch was predicted not to
take the forward jump, then the memory read (line #4) will
be speculated, caching the value at r11. Checking the access
time of memory at r11 (line #9) reveals the decision, i.e., if
6
access to r11 is faster than non-cached access, the predictor
uses the fall-through case for forward jumps.
On the other hand, to check if forward jumps are predicted
as taken, we can invert the above listed code by inverting the
condition of the conditional jump:
1 << ... >> ; slow condition->ZF=1
2 jne C1True ; Condition false
3 C1False:
4 jmp Exit
5 C1True:
6 mov rsi, [r11] ; r11 user address
7 jmp Exit
8 Exit:
9 << measure [r11] access time >>
Out of the CPUs that we tested, only the Sandy Bridge mi-
croarchitecture seems to be predicting forward jumps as taken.
Others speculate the fall-through case, which is the expected
outcome. Note that the prediction behavior is independent from
the condition, e.g., both je and jne will be predicted the
same (either both jump, or both fall through). For simplicity,
all following examples will assume that fall-through cases are
speculated. In our experiments (e.g., on Sandy Bridge), we
will adjust the branch behavior accordingly.
D. Arbitrary Memory Read
We now try to abuse speculative execution to read arbitrary
memory that should not be accessible during normal (com-
mitted) execution. The basic idea is as follows: in speculative
execution we check the memory content at a specified address,
and report the result back using one of the feedback channels
described in Section IV-B. To leak an arbitrary byte in memory,
we use several test instructions to leak a byte bit-by-bit, i.e.,
it takes 8 checks to read a byte from memory.
1 << ... >> ; slow condition->ZF=1
2 je C1True ; Taken
3 C1False:
4 ; --- START of SPECULATION ---
5 mov r10b, BYTE PTR[r10]
6 test r10b, 1 ; test 1st bit
7 jz C2True
8 C2False: ; 1st bit is 1
9 jmp Exit ; Lvl2 Speculation
10 C2True: ; 1st bit is 0
11 mov rsi, [r11] ; cache mem @ r11
12 jmp Exit
13 ; --- END of SPECULATION ---
14 C1True:
15 << ... >>
16 Exit:
17 ; if [r11] is cached
18 ; 1st bit of [r10] = 1
19 ; else
20 ; 1st bit of [r10] = 0
Listing 5: Arbitrary memory read via nested speculation
Listing 5 shows the code listing that we can use to read one
bit of arbitrary memory. In this example, we want to read a
value that is located at address r10: While the multiplication
results are being evaluated (line #1), C1False is executed
speculatively. Inside, we read a byte stored at the target address
(r10) into register r10b (line #5). Based on the least signifi-
cant bit of the read value, we make the decision to either jump
to C2False or C2True. Until the memory at r10 (line #5)
is read, C2False (line #8) will be run in nested speculation—
regardless of the pending test result (line #6). Assuming that
the memory read (condition for second-level speculation) is
faster than the sequence of multiplications (condition for first-
level speculation)4, the outcome of the second branch condition
will be known sooner than the first one. Once the test result
and thus the condition of the second conditional jump (line
#7) is known, the second-level speculation will be flushed. The
CPU would then start to execute the correct jump target (line
#10). Note that this does not flush the first-level speculation
(that started in line #3). Therefore, C2True, which caches the
feedback address r11 (line #11), is executed only in the case
that the read memory has 0 for its least significant bit. The side
effects of caching r11 can be easily detected from outside the
speculative execution (line #16), when the multiplication result
is finally known and the speculated paths are flushed.
By repeating this code snippet with different bit offsets to
test, we can leak entire bytes or memory ranges. Technically,
we need to flush the feedback address r11 from the cache so
that we can start fresh measurements, e.g., using clflush.
E. Arbitrary Kernel-Space Memory Read (But...)
Being able to read arbitrary user memory might have some
interesting use cases, even for local attackers which seemingly
already can read arbitrary code (for discussion see Section VI).
Following our threat model of a user-space attacker, the im-
mediate follow-up question is now: Can we even read memory
in regions for which we do not have privileges?
In our second experiment we thus aim to read privileged
memory in kernel-land as an unprivileged user. In other words,
we want to check if privilege separation between user and ker-
nel was taken into account also during speculative execution.
To this end, we ran the code from Listing 5 with probe address
pointing to a kernel address. After running the code against
various microarchitectures, much to our surprise, we saw that
the code was being executed and the kernel memory was read.
However, the values that we read were not the true values
we expected. More specifically, we observed that instead of
getting the actual content of the memory, we would read 0s
for all read kernel addresses. This is good news in that an
arbitrary kernel memory read is not possible from user space,
which would be a complete disaster for security.
1 << ... >> ; slow condition->ZF=1
2 je C1True
3 C1False:
4 ; --- START of SPECULATION ---
5 cmp [r10], 0 ; r10 = K
6 jz C2True
7 C2False: ; page is not mapped
8 jmp Exit
9 C2True: ; page is mapped
10 mov rsi, [r11] ; r11 user address
11 jmp Exit
12 ; --- END of SPECULATION ---
13 C1True:
4We verify this assumption as part of our experiments.
7
14 jmp Exit
15 Exit:
16 << measure [r11] access time >>
Listing 6: Checking if address K is mapped
Interestingly, however, our measurement results differed
completely when we tried to read the values of kernel-land
virtual addresses that did not map to any physical page. In
such a case, the execution of the corresponding micro-OP
would stop, together with the micro-OPs that use the value
read from it. For example, if we read a non-mapped value into
a register rax, using rax anywhere in subsequent instructions
as a source operand will stall its execution. This leads to a clear
distinction between an access violation that results in a 0 being
read, while in the case of a page fault (e.g., when the memory
page is not mapped) the entire execution stalls. An attacker
can use this side channel to distinguish between mapped and
unmapped kernel pages. If a byte read from kernel is 0, the
page was mapped; otherwise the page was not mapped. To
check if a kernel address K is mapped or not, we use the code
from Listing 5 with the exception that now we check for the
memory content to be 0. The assembly code for it is displayed
in Listing 6 (r10 is K).
We can further simplify this code by removing the second level
conditional as shown in Listing 7. The observation here is that
whenever we read something from a mapped kernel page we
get a 0 back, and whenever the page is not mapped execution
stalls. Therefore, we can use the read value as an offset in a
memory access that is caching the user-space address (line #6).
The offset in rax is 0 in the case of a GPF and thus r11 will
be cached. In the case of a non-mapped page, the execution
stalls after line #5. Note that removing the register offset would
remove the dependency between lines #5 and #6 and thus allow
for instruction reordering, destroying the side channel.
1 << ... >> ; slow condition->ZF=1
2 je C1True
3 C1False:
4 ; --- START of SPECULATION ---
5 mov rax, [r10] ; r10 = K
6 mov rsi, [r11+rax] ; r11 user address
7 jmp Exit
8 ; --- END of SPECULATION ---
9 C1True:
10 jmp Exit
11 Exit:
12 << measure [r11] access time >>
Listing 7: Checking mapped pages via load dependency
F. Reasoning About the Observed Side Channel
Having this side channel at hand, it is not hard to see
that schemes like KASLR that try to hide the presence of
kernel pages are severely threatened. One immediate question
that came to our mind was: Why does the CPU proceed with
speculative execution in the case of access violations (thereby
creating this side channel)? We expected the CPU to stall the
execution whenever any fault occurs, be it either insufficient
privileges or a page fault. In practice, however, we see a clear
difference.
In the following, we try to explain the underlying process in
these two scenarios. Due to the lack of documented details on
CPU internals, this explanation has to be taken with a grain of
salt. The core problem is caused by the fact that faults cannot
be handled by the memory unit during speculative execution,
i.e., they are only dealt with at the commit stage. In the case
of kernel memory accesses, we have two types of faults. A
general protection fault (GPF) occurs if memory is mapped,
but we do not have sufficient access rights to read it. Second,
a page fault (PF) occurs when memory is not mapped at all.
Given that memory units cannot handle faults speculatively
(until later at commit stage), there are two different scenarios
how these two faulty memory accesses are serviced. In the
case of GPF, the memory access is actually still carried out
by the L1 data cache. However, because of a privilege fault,
the actual data is not put into a load buffer entry; instead
the entry is marked as faulty to be handled later at commit
stage. Remember that each memory access allocates a new
load buffer entry, which remains allocated until the micro-OP
is not committed. The observed value 0 in such cases can either
be a default load buffer value when it is reserved for the micro-
OP, or a value used by the L1 cache for faulty accesses.
In contrast, for PF, we have a cache miss that will never
be resolved as the page is not mapped. Therefore, the L1 data
cache cannot return any value to the load buffer (because it
does not have any), and it cannot finish the request (because it
cannot handle faults while speculating). Note that the L1 data
cache has a limited number of entries for handling ongoing
cache misses. This means that the entry for handling the
memory read will remain allocated for the entire speculation.
This will additionally stall the load buffer entry that waits for
the result and any instructions that depend on the read value.
G. Kernel-Probing Specific Feedback Channel
We can leverage the fact that page faults stall the execution
engine to design an even easier and faster way to determine
whether a page is mapped. For this we use the following ob-
servation: for non-mapped pages the bottleneck of the number
of memory accesses is the size of ongoing loads in the L1
data cache (e.g., 32 in Haswell). For mapped addresses, the
bottleneck is the number of load buffer entries (e.g., 72 in
Haswell). This means that, if we have a non-mapped address
K, doing mov rax,[K] 32 times will stall the execution of
following load instructions. However, if address K is mapped,
then the execution of loads will only stall after the CPU runs
out of load buffer entries, i.e., after executing mov rax,[K]
72 times in Haswell. Therefore, any number of loads (mov
rax,[K]) between 32 and 71, followed by reading a user-
space feedback address U (mov rsi,[U]), will result in
caching U if the page at address K is mapped.
V. EVALUATION
We will now perform the described analyses on five x86
microarchitectures. Each microarchitecture is a different iter-
ation of implementation of CPU circuitry, thus affecting the
execution process while still keeping the high-level semantics
mostly the same. To see how inner workings of CPUs may
have changed, we gathered a wide selection of Intel CPUs
with architectures ranging from the year 2004 to a recent
one from 2015. The complete list of the microarchitectures
8
µ-arch. Model Name Year
Skylake Intel R© Core
TM
i5-6200U @2.30GHz 2015
Haswell Intel R© Core
TM
i5-4690 @3.50GHz 2013
Sandy Bridge Intel R© Xeon R© E5-2430 v2 @2.50GHz 2011
Nehalem Intel R© Xeon R© L5520 @2.27GHz 2008
Prescott Intel R© Xeon
TM
3.20GHz 2004
TABLE I: List of Intel’s microarchitectures tested.
used in our experiments is shown in Table I. Throughout the
remaining paper, we will refer to each of the CPUs by its
microarchitecture codename.
A. Revealing Microarchitecture Details
As a first experiment, we applied several tests described in
Section IV to reveal the inner workings of the different archi-
tectures. That is, we experimentally reveal the static prediction
strategy, the number of load buffer entries, and the maximum
number of parallel cache misses. Table II summarizes the
outcomes of those measurements, showing that the various
Intel architectures differ in their characteristics. The general
trend is that both the number of load buffer entries and parallel
loads increased over the years. For the static prediction pattern,
there is a more coherent pattern: all microarchitectures except
Sandy Bridge fall through forward jumps. We will use these
architecture-dependent results to adapt the code snippets used
in the following experiments, especially regarding prediction
expectation and load buffer usage.
For Kernel ASLR derandomization we use both our attack
techniques from Section IV-E and Section IV-F. In the follow-
ing section we will present our findings.
B. Measuring the Execution Time
In Section IV we have described ways to identify whether
a page is mapped or not. All feedback channels we pre-
sented relied on timing differences due to memory caching.
To measure such time differences between cached and non-
cached memory accesses, we use hardware timestamp counters
(rdtsc and rdtscp). Such counters deliver higher than
nanosecond precision and are sufficient to differentiate the two
cases. However, as these instructions could be reordered during
the execution on a CPU, care needs to be taken to get the
measurements as precise as possible. To this end, we resort
to serializing instructions that avoid instruction reordering (in
particular, we use cpuid). The code for measuring a memory
access at address U is shown below:
1 cpuid ; serializing point
2 rdtsc ; start measuring
3 mov r11, rax ; store measurement
4 mov rax, [U] ; access memory
5 rdtscp ; end measuring
6 mov r12, rax ; store measurement
7 cpuid ; serializing point
8 sub r12, r11 ; r12 - time tiff
Listing 8: Measuring access time via timestamp counters.
µ-arch. FJ1 LB2 Entries PL3
Skylake N 72 40
Haswell N 72 32
S-Bridge N 64 32
Nehalem T 48 11
Prescott N – 19
TABLE II: Prediction behavior and characteristics.
1Forward Jump: Taken, Not taken, 2Load Buffer, 3Parallel Loads
The code in Listing 8 performs the following steps (by line):
1) cpuid waits for all preceding instructions to finish before
it gets executed. This introduces a serializing point to
ensure that no instruction before it will get executed after.
2) rdtsc reads the initial counter value, which will be
returned in rax:rdx register pair.
3) We are interested in smaller values that can fit in 32 bits
and thus only store the lower part of it, i.e., eax, in r11.
4) We read the data at address U. This is the access for which
we want to measure the time.
5) For the second measurement, we use the rdtscp in-
struction to get the final timestamp counter. rdtscp is
specially designed to read the timestamp counter only
after all memory operations before it are done, which
exactly matches our needs.
6) We store the new counter value into r12.
7) To be sure that no consecutive instructions will be sched-
uled together with rdtscp instruction, we use another
serializing point (cpuid).
8) Using the two timestamp values in r12 and r11, we
subtract them to get the difference.
This code allows us to measure only the time needed for
memory access, without adding too much noise. For older
CPUs that do not support rdtscp yet, we resort to using
rdtsc for the second measurement as well and additionally
preceding it with cpuid, so that rdtsc is not scheduled
together with the memory access that we want to measure. By
doing so we introduce another instruction (cpuid) in between
two measurement points. This will add some overhead to the
returned values. Nevertheless, the overhead will be included
in both cached and non-cached memory accesses, and thus the
difference will remain the same.
C. Differentiating Between Mapped/Unmapped Kernel Pages
At this point we have all the prerequisites to measure tim-
ing differences accurately. We will now use this methodology
to evaluate the feedback channels as proposed in Section IV-B
and Section IV-G. We will first focus on the two feedback
channels that cache a specific user-mode memory address.
Given lack of conceptual differences, we will treat them
equally in this section and benchmark the timing differences
between accessing a cached (mapped kernel page) and non-
cached (non-mapped kernel page) address using the two-
level speculation method proposed in Section IV-B1. Figure 2
summarizes the timings, more specifically their minimum,
median, and average values over 1000 runs. Given that we
9
Mapped Not Mapped
0
200
400
600
800
1000
(a) Skylake
Mapped Not Mapped
0
200
400
600
800
1000
1200
1400
(b) Haswell
Mapped Not Mapped
0
100
200
300
400
500
600
700
800
900
(c) Sandy Bridge
Mapped Not Mapped
50
100
150
200
250
300
350
400
450
500
(d) Nehalem
Mapped Not Mapped
500
1000
1500
2000
2500
3000
(e) Prescott
Fig. 2: Results of the caching side channel. If left (mapped) and right (not mapped) deviate, feedback is reliable.
use rdtsc for measurements, the timings approximate the
number of cycles5 for corresponding memory accesses. As we
want to see whether memory is cached or not, we will later
resort to the more reliable minimum values.
Figure 2 clearly shows timing differences between mapped
and non-mapped pages for most architectures. Two architec-
tures are worth mentioning in particular, as their behavior
differed from the others. In Nehalem, accessing a non-mapped
memory page does not stall the dependent instructions in the
pipeline; instead it would also return 0 as a value. Therefore,
the technique defined in Section IV-E (two-level speculation
in Listing 6 or dependent load in Listing 7) will not work
there. However, resource exhaustion of architectural buffers
(Section IV-F) works seamlessly and we report these numbers
instead. In particular, following Table II, we used between 11
and 47 load instructions to the probed address followed by
a caching instruction, as Nehalem features 11 parallel cache
miss and 47 load buffer entries, respectively.
Prescott also prevents the first approach, as both faulting
addresses stall the pipeline and therefore the caching instruc-
tion is not executed at all. Similarly, the resource exhaustion
technique (Section IV-F) also fails. Regardless of the type of
fault (i.e., page fault or access violation), consecutive load
instructions are stalled after 19 faulty ones.
The flushing-based feedback channel turned out to be
less reliable than caching-based feedback, as the minimum
measurements did not show any measurable difference re-
gardless of the branch condition. In the following, we will
thus rely on the caching-based feedback channels that have
shown to be accurate. Having said this, we observed significant
timing differences between confirmed-but-halted speculative
execution (via hlt) and mispredicted execution. For example,
on Sandy Bridge, the median of 1000 measurements differed
in 16 cycles, and also the average execution time on Nehalem
was significantly slower for non-mapped pages. We refer the
interested reader to Section IX-A to see how and on which
architectures one could use flushing-based feedback.
5http://www.forwardscattering.org/post/15
D. Breaking Linux KASLR
By now, we can successfully identify whether a single
kernel address space memory page is mapped, using any of our
proposed approaches with equal success. The dependent load
technique (as shown in Listing 7) is by far the simplest means
of doing so, considering that it uses only two instructions.
We will thus use this test to derandomize KASLR on 64-
bit Linux (Ubuntu 16.04 LTS; kernel 4.13.0) running on a 4-
core Haswell CPU with 32GB RAM. Kernel ASLR is enabled
in the OS by default, which, as described in Section II-C,
randomizes the location of the kernel image in the range
from 0xffffffff80000000 to 0xffffffffc0000000.
Given that the starting address of the image is aligned to 2MiB
pages this results in 9 bits of randomness.
For the derandomization attack, we created an executable,
which takes the probe address as an input and outputs the time
it takes to access the user-mode memory address (i.e., the one
that is being cached in the case of the mapped kernel page). To
reduce the possible noise, we run the program twice per probed
address and look at the minimum duration of the two runs. We
use a single-threaded process to run the experiment. Increasing
the number of parallel processes is possible, but would intro-
duce noise because of hyper-threading, which would require
more than two tests per probed address. According to our
experiments, a single process running each probe twice showed
the best compromise between efficiency and accuracy. Using
the program as an oracle, we probe each possible address
in the KASLR memory range with 2MiB increments (which
corresponds to Linux KASLR’s alignment size). In principle,
this could be further improved with larger increments that take
into account the size of the kernel image, or by using a binary
(instead of linear) search. That is, note that we did not strive
towards decreasing the search time, in order to find a generic
solution that works with any kernel image size (even small
ones). Still, using this naı¨ve search, we reliably find every
mapped 2MiB kernel page in 2.63 seconds on average.
We use the same approach to search for kernel modules,
but modified the search area, given that the possible start
address of modules is between 0xffffffffc0000000 and
0xffffffffc0c00000 (page-aligned). This gives us 3072
possible places where modules can be allocated (∼11 bits of
10
entropy). Running the above experiment for modules with the
modified parameters found all allocated pages for modules
reliably in 14.70 seconds on average.
To evaluate the accuracy of this scheme, we ran the
program 1000 times. We use the list of kernel pages
in /sys/kernel/debug/kernel_page_tables as a
ground truth to detect possible false positives and false nega-
tives. Out of 1000 runs, our program always correctly identified
11 out of 11 kernel image pages. As for detecting loaded
modules, our program successfully identified all 1316 mapped
module pages in 968 out of 1000 runs. For the remaining 32
runs it only missed a single mapped page. Throughout the test
run, our program did not report any false positives.
E. Breaking Linux KASLR in a VM
So far we have constrained the proposed attacks to physical
hardware, yet an attacker might be interested to abuse similar
techniques in a virtualized environments. A core driver for
the recent increase of virtualization was the CPUs’ virtualiza-
tion support, which made the execution more efficient. One
of the features that hardware support nowadays is extended
page tables (EPT), which adds another layer of translation
to page tables in order to translate guest physical addresses
into host physical pages. Requiring no hypervisor intervention
to translate virtual addresses, we can use EPTs to mount our
attack even on a VM. In our next experiment, we carry out our
measurements on virtualized 64-bit Linux (Ubuntu 16.04 LTS;
kernel 4.14.0) running on a 4-core Haswell CPU with 8GB of
virtual RAM. For virtualization we use VirtualBox (Version
5.1.30) with hardware acceleration enabled (the default).
We ran the same measurements as we did for physical
machines (Section V-D). As expected, a virtualized environ-
ment reduced the efficiency of our approach. Keeping the
same parameters (i.e., 2 trials per probed address on a single
core), the measurements took on average 2.51 seconds and
17.78 seconds to find mapped pages for the kernel image
and loaded modules, respectively. Further, the precision of
the approach was decreased, missing 47 out of 528 pages on
average (min/max being 26 and 72 respectively, and median
being 47) for modules, and missing 2 out of 11 pages on
average for the kernel image.
To cope with this reduced accuracy, in a second experiment,
we doubled the number of trials per probed address (i.e., 4
trials instead of 2). This significantly improved the precision
of our approach; however, it also doubled the execution time.
With increased trials, it takes on average 5.66 seconds to
find the kernel image and 33.80 seconds to find the loaded
modules. As for the precision of the attack, we observed on
average 5 missed pages for modules (min/max being 0 and 52,
respectively, and median being 4) and 1 missed page of the
kernel image. Note, however, that despite the missed pages,
it is trivial for an attacker to correctly identify the area of
the kernel image if the two missing pages do not constitute
pages at the edge of the image—which, given the overall size
of the region found, can be easily verified. Additionally, one
can probe for neighboring pages after the initial scan to find
possible false negatives, or simply repeat measurements.
F. Breaking Windows KASLR
After we have demonstrated how security guarantees of
Linux KASLR are undermined with speculative execution, we
will now turn to the Windows OS. We chose a Skylake CPU
and Windows 10 (version 1709) for our evaluation. Despite
being OS agnostic, we still had to adjust our test program for
Windows. More specifically, we have to work around user-
land ASLR implementation differences between Windows and
Linux. In contrast to Linux, which randomizes executables
at every run, Windows randomizes them using a fixed seed
determined at boot time. Consequently, running the same
program multiple times does not randomize base addresses
anymore, which we used to evade dynamic prediction. To solve
the problem, we created a single executable probing multiple
addresses one after another in consecutive speculative memory
accesses, derandomizing the address space in a single run.
As a side effect, this drastically improved the performance
of our test program. Similar to our attack against Linux
Kernel Section V-D, we will also run this program as a single
process, this time using just one probe per address.
The second challenge in Windows comes from KASLR
itself. We observed that the kernel image is always loaded
together with the hardware abstraction layer (HAL) module.
However, the loading order is randomized (e.g., Kernel→HAL
or HAL→Kernel). Additionally, although both the kernel
image and HAL are allocated on consecutive large pages
(pages that are 2MiB large instead of 4KiB), their entry points
are randomized inside the large page and can start at any
4KiB boundary. Given that we can only probe for mapped
pages, this attack only allows us find the base address of
the large page containing the kernel image and HAL, and
not their actual addresses. However, this still removes 18 bits
of KASLR randomness and can be used in combination with
other techniques (e.g., those discussed in Section VII) to break
the remaining 9 bits (i.e., the 4KiB-aligned kernel offset).
According to Section II-C, we scan the kernel address range
from 0xfffff80000000000 to 0xfffff88000000000.
Given the alignment of large pages, we step through the
memory in 2MiB increments, giving us in total 262144 tries.
We verified that only the kernel image would allocate five
consecutive 2MiB pages and use this as its fingerprint. Running
the experiment 100 times showed that the kernel image area in
Windows can be found in under a second (0.55s) on average6.
Out of 1000 runs, all 5 pages were found 74 times, while
missing a single page 15 times and missing two pages 11 times.
Throughout the test run, we did not see any false positives.
VI. DISCUSSION
A. Security Implications of Speculative Execution
We believe that showing how one can derandomize KASLR
is just the tip of the iceberg of security implications that
speculative execution may have. Executing arbitrary code on
the CPU without consequences (such as segmentation faults),
and being able to report results back is a powerful tool, which
could have many different use cases. One such case is using the
6Note that the additional overhead in Linux was due to loading and
removing the executable for each memory probe, which could also be heavily
optimized.
11
arbitrary memory read technique (Section IV-D) from inside a
restricted environment that sandboxes memory accesses using
bounds checking. For example, consider a JavaScript execution
environment in Web browsers that checks if memory accesses
in attacker-controlled code target outside of a well-defined
safe region (and thus should be rejected). Given that the
majority of modern browsers compile JavaScript into native
code, executing arbitrary code on a CPU is feasible even from
web applications. Therefore, to gain arbitrary memory reads
from inside JavaScript, one has to generate native code similar
to that described in Section IV-D. Such an attack could be
simplified by using WebAssembly, which allows compilation
of C/C++ programs into a WebAssembly byte code, which
itself will be compiled into native code by browsers. We leave
an evaluation of such attacks open for future work and hope
that our analysis on speculative execution will foster new
research to understand the full scope of the problem.
B. Possible Mitigations
We now turn to possible mitigation techniques that can be
applied to make systems resistant to our attack.
Hardware Modifications: First, we discuss approaches
that require hardware modifications. The naive approach
would disable speculative execution altogether, which, how-
ever, would drastically reduce the performance of the CPU—
every branch instruction would stall the execution pipeline.
A more reasonable approach would be to stall the execution
when a privilege fault occurs. This successfully removes the
information leak that we use to distinguish between mapped
and non-mapped pages. However, this technique does not get
rid of the general problem of side channels in speculative
execution. We have already provided two feedback channels
in Section IV-B: (i) measuring the complete execution time
and detecting difference of rollback complexity in case of
mispredictions, (ii) and caching user-space memory. To de-
feat the former feedback channel, one would need to unify
execution times for micro-OPs. Disabling the latter would
require forbidding memory loads in speculative execution.
However, given that memory accesses are already a bottle-
neck for modern CPUs, blocking their speculation will incur
a significant slowdown. Note that merely disabling nested
speculation would not remedy the problem, as one could cache
two different memory pages based on the value to leak in first-
level speculation. Summarizing, while mitigation in hardware
is possible, added rigorous security would imply significant
performance degradation.
Attack-Specific Defenses: Although the problem origi-
nates from hardware, some attack-specific mitigations can
also be applied in software. One such mitigation against our
KASLR attack is stronger kernel/user space isolation, which
has been proposed multiple times [9], [11], [20], [24]. The
basic idea is to separate kernel and user address spaces into
different virtual spaces. This would remove kernel addresses
entirely from user-level page tables, thus making them inacces-
sible. The downside to this approach is the overhead caused by
flushing translation lookaside buffers (TLBs) on every context
switch from user to kernel mode. Alternatively, one could try
to increase the entropy of KASLR, which however would only
slow down the attack, and efficient binary search techniques
would reduce the search effort to logarithmic complexity. Also,
while any of this would mitigate leaks from kernel into user
mode, as we have discussed in the previous subsection, even
pure user-mode attacks can be of concern.
It seems more promising to hide the kernel content rather
than just its existence. One such solution was recently imple-
mented in OpenBSD and is dubbed Kernel Address Random-
ized Link (KARL). Instead of randomizing the base address
of the kernel image, KARL randomizes the kernel’s content
(similar to proposed fine-grained randomization schemes in
user space [21], [29], [35]). At every system startup a new
kernel image will be linked together by combining object files
in a random order. Having a completely different layout at each
boot forbids the attacker to predict the addresses of required
code or data pieces. KARL thus withstands our proposed
derandomization attack, as an attacker can only learn where
the kernel is mapped, but not how it is laid out.
Compiler-Assisted Solutions: An interesting defense
opens up in a setting where the attacker does not control the
compiler that generates the measurement code. For example,
consider a setting where an adversary can specify code that
WebAssembly ultimately compiles into a byte code. Assume
the attacker aims to read beyond a critical memory check
via speculative execution. The compiler, assuming knowledge
of such critical conditionals (e.g., by code annotations) could
then create specific branches that are guarded against misuse.
Listing 9 shows an example of such a guarded conditional
in a BTFNT setting for the compiled code snippet if (rax
== somevar) {...} else {...}. To prevent specula-
tive execution, a guard injects a backward jump that is never
taken (line #4) in normal execution, but will be predicted—
effectively creating an infinite loop executed in speculative
execution. This way an attacker has no possibility to inject
code in the fall-through case that is executed speculatively.
1 cmp rax, [somevar]
2 je Equal
3 FallThrough1:
4 je FallThrough1 ; GUARD: backward jump
5 ; that is taken in BTFNT
6 ... ; code in else {} branch
7 jmp Exit
8 Equal:
9 ... ; code in if {} branch
Listing 9: A guarded conditional hinders static
misprediction, assuming BTFNT in this example.
C. Future Work
Affected Vendors and ISAs: The focus of our experiments
was Intel x86 CPUs. We have not tried to apply our speculative
execution-based attacks against other architectures, but also
have not seen any fundamental reasons why this should not be
possible. For example, the CPUs of other CISC vendors like
AMD also offer speculative execution engines that would open
up similar side channels. In principle, even RISC processors
like ARM-based CPUs feature speculative execution, such as
Cortex-R processors [5] or the Cortex-A57 in LG Nexus 5X
smartphone [4]. Having said this, the inner workings of CPUs
might differ significantly. For example, if a CPU does not
feature nested speculation, the presented side channels require
12
adaptations. With this work, we have proposed several auto-
mated tests that identify whether a certain CPU is susceptible
to speculation-based side channels. Using this test suite, we
aim to broaden our analyses to other architectures in the future.
User-Mode Attacks: We have already conceptually de-
scribed that user mode processes might also be at risk due
to speculative execution (Section VI-A). In immediate future
work, we plan to test whether sandboxed environments that
allow attacker-controlled code to be JIT-compiled (in particular
browsers) can be abused to read out-of-bounds. Such a proof-
of-concept will be a significant research and engineering effort
on its own and is thus out of scope for this paper.
VII. RELATED WORK
In the following, we list some of the works that are related
to our approach. This includes recent works targeting KASLR,
but also side channels in general. We highlight that none of
these works has proposed to use speculative execution for
their side channels. In contrast, our work is the first to show
how (i) speculation can be reliably abused to execute attacker-
controlled code that is never executed without speculation, (ii)
that such speculative execution is not free of side effects, as
assumed so far, and (iii) that even modern KASLR imple-
mentations using a high entropy (such as in Windows) are
undermined when facing speculation.
A. Using Hardware to Challenge (K)ASLR
Hund et al. [17] propose timing-based side channel attacks
against KASLR. They describe three different scenarios, Cache
Probing, Double Page Fault, and Cache Preloading, in which
the attacker is able to leverage side-channels in the hardware to
derandomize KASLR. The general observation of these attacks
is that, even though the memory accesses in kernel space result
in privilege faults, their corresponding cache/translation entries
are still stored and allow faster consecutive accesses.
The translation cache side channel was further improved
by Jang et al. [19] by using Intel’s TSX (Transactional Syn-
chronization Extensions). TSX allows handling faulty mem-
ory accesses without OS intervention, reducing measurement
noise. The authors managed map the whole kernel space, as
well as distinguish their executable privileges.
Instead of accessing privileged memory pages directly,
Gruss et al. [14] suggest using prefetch instructions. Some
of the advantages of prefetch instructions are that they do not
cause page faults and ignore privilege checks, however, they
still go through the same translation/lookup as regular memory
access would, and thus leave a measurable trace behind.
Another KASLR derandomization attack, from Evtyushkin
et al. [12], is more related to our approach, in that it also used
branch predictors. However, instead of abusing speculative
execution to run arbitrary code without consequences, the
authors try to cause collisions in branch target buffers (BTB)
and thus leak the lower addresses of randomized pages.
Also user-space ASLR has been challenged by others. Gras
et al. [13] use MMU features to derandomize ASLR to mount
an attack by executing JavaScript code on a victim’s browsers.
Bosman et al. [7] use the memory deduplication feature in op-
erating systems to leak memory pointers from JavaScript, thus
also resulting in successful ASLR derandomization (although
the consequences of the complete attack were more significant
than breaking ASLR). Similarly, Barresi et al. [6] use memory
deduplication in virtual machine monitors to leak the address
space layout of neighbor VMs.
B. CPU-Related Side Channels
The need to run multiple processes on the same machine
creates a requirement to share limited resources among all of
them. This opens an opportunity for an adversarial process to
leak sensitive information about the environment by looking
at the utilization of those resources. For example, instruction
caches have been used to leak the information about the
execution trace of co-existing programs on the same execution
core [1]–[3], [8], or even across different VMs on the same
machine [38], allowing the attacker to reconstruct sensitive
information, e.g., private keys of cryptographic protocols.
In contrast to instruction caches, where the attacker re-
covers execution trace of the program, data cache attacks can
reveal the patterns of data accesses of neighboring processes
and thus reveal the underlying operation. These types of attacks
have been shown to be successful in reconstructing different
cryptographic secrets, such as AES keys [34]. Similarly, a last-
level data cache side channel can be used reconstruct private
keys [15], or leak sensitive information across VMs, e.g.,
keystroke timings to snoop user-typed passwords in SSH [31].
One of the techniques of leaking cache usage information
in x86 is Flush&Reload attack [36]. In this attack, the attacker
issues the clflush instruction to flush victim’s cache lines.
Measuring the same cache lines later will reveal if it has been
accessed. However, Flush&Reload assumes that the attacker
has access to victim’s memory pages (e.g., via shared pages).
Prime&Probe [26], in contrast, does not require shared mem-
ory regions, and instead relies on cache collisions on shared
caches between the attacker and the victim.
VIII. CONCLUSION
Speculative execution has long been ignored by the security
community, although it bares critical threats that undermine
important assumptions of existing defensive schemes. We are
the first to provide a comprehensive overview of the security
implications. Such an understanding is crucial for security
researchers to understand the guarantees of existing solutions.
Our adversarial setting may seem contrived compared to usual
threat models, as we assume that the attacker has influence on
the code. However, we argue that the proposed side channels
will highly influence existing defenses, such as condition-
als that constrain unsafe code to certain memory regions—
especially in the realm of JIT compilation like in browsers. The
recent rise of KASLR techniques demonstrates that abusing
speculative execution is clearly a novel type of attack class
that needs further considerations.
13
REFERENCES
[1] O. Aciic¸mez, “Yet another microarchitectural attack:: exploiting i-
cache,” in Proceedings of the 2007 ACM workshop on Computer
security architecture. ACM, 2007, pp. 11–18.
[2] O. Acıic¸mez, B. B. Brumley, and P. Grabher, “New results on instruction
cache attacks,” in CHES, vol. 2010. Springer, 2010, pp. 110–124.
[3] O. Aciic¸mez and W. Schindler, “A vulnerability in rsa implementations
due to instruction cache analysis and its demonstration on openssl,” in
CT-RSA, vol. 8. Springer, 2008, pp. 256–273.
[4] ARM, “ARM Cortex-A57 MPCore Processor Technical Reference
Manual.” [Online]. Available: http://infocenter.arm.com/help/index.jsp?
topic=/com.arm.doc.ddi0488c/BABHEHHA.html
[5] ——, “ARM Cortex-R7 MPCore, Technical Reference Manual.”
[Online]. Available: https://static.docs.arm.com/ddi0458/c/DDI0458.pdf
[6] A. Barresi, K. Razavi, M. Payer, and T. R. Gross, “CAIN: Silently
breaking ASLR in the cloud,” in 9th USENIX Workshop on Offensive
Technologies (WOOT 15). Washington, D.C.: USENIX Association,
2015. [Online]. Available: https://www.usenix.org/conference/woot15/
workshop-program/presentation/barresi
[7] E. Bosman, K. Razavi, H. Bos, and C. Giuffrida, “Dedup est machina:
Memory deduplication as an advanced exploitation vector,” in Security
and Privacy (SP), 2016 IEEE Symposium on. IEEE, 2016, pp. 987–
1004.
[8] C. Chen, T. Wang, Y. Kou, X. Chen, and X. Li, “Improvement of trace-
driven i-cache timing attack on the rsa algorithm,” Journal of Systems
and Software, vol. 86, no. 1, pp. 100–107, 2013.
[9] O. R. Chick, J. Snee, L. Carata, R. Sohan, A. Rice, and A. Hopper,
“Shadow kernels: A general mechanism for kernel specialization.”
[10] K. Cook, “Linux kernel aslr (kaslr),” Linux Security Summit, vol. 69,
2013.
[11] N. Dautenhahn, T. Kasampalis, W. Dietz, J. Criswell, and V. Adve,
“Nested kernel: An operating system architecture for intra-kernel priv-
ilege separation,” ACM SIGPLAN Notices, vol. 50, no. 4, pp. 191–206,
2015.
[12] D. Evtyushkin, D. Ponomarev, and N. Abu-Ghazaleh, “Jump over
aslr: Attacking branch predictors to bypass aslr,” in Microarchitecture
(MICRO), 2016 49th Annual IEEE/ACM International Symposium on.
IEEE, 2016, pp. 1–13.
[13] B. Gras, K. Razavi, E. Bosman, H. Bos, and C. Giuffrida, “Aslr on the
line: Practical cache attacks on the mmu,” NDSS (Feb. 2017), 2017.
[14] D. Gruss, C. Maurice, A. Fogh, M. Lipp, and S. Mangard, “Prefetch
side-channel attacks: Bypassing smap and kernel aslr,” in Proceedings of
the 2016 ACM SIGSAC Conference on Computer and Communications
Security. ACM, 2016, pp. 368–379.
[15] D. Gullasch, E. Bangerter, and S. Krenn, “Cache games–bringing
access-based cache attacks on aes to practice,” in Security and Privacy
(SP), 2011 IEEE Symposium on. IEEE, 2011, pp. 490–505.
[16] R. Hund, T. Holz, and F. C. Freiling, “Return-oriented rootkits: Bypass-
ing kernel code integrity protection mechanisms,” in USENIX Security
Symposium, 2009, pp. 383–398.
[17] R. Hund, C. Willems, and T. Holz, “Practical timing side channel attacks
against kernel space aslr,” in Security and Privacy (SP), 2013 IEEE
Symposium on. IEEE, 2013, pp. 191–205.
[18] Intel, “Intel 64 and IA-32 Architectures Optimiza-
tion Reference Manual.” [Online]. Available: https : / /
www.intel.com/content/dam/www/public/us/en/documents/manuals/
64-ia-32-architectures-optimization-manual.pdf
[19] Y. Jang, S. Lee, and T. Kim, “Breaking kernel address space layout
randomization with intel tsx,” in Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security. ACM, 2016,
pp. 380–392.
[20] V. P. Kemerlis, M. Polychronakis, and A. D. Keromytis, “ret2dir:
Rethinking kernel isolation.” in USENIX Security Symposium, 2014,
pp. 957–972.
[21] C. Kil, J. Jun, C. Bookholt, J. Xu, and P. Ning, “Address Space
Layout Permutation (ASLP): Towards Fine-Grained Randomization of
Commodity Software,” in Proceedings of the 22Nd Annual Computer
Security Applications Conference, ser. ACSAC ’06, Washington, DC,
2006. [Online]. Available: http://dx.doi.org/10.1109/ACSAC.2006.9
[22] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp,
S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, “Spectre attacks:
Exploiting speculative execution,” ArXiv e-prints, Jan. 2018.
[23] P. Koppe, B. Kollenda, M. Fyrbiak, C. Kison, R. Gawlik, C. Paar,
and T. Holz, “Reverse engineering x86 processor microcode,” in
26th USENIX Security Symposium (USENIX Security 17). Vancouver,
BC: USENIX Association, 2017, pp. 1163–1180. [Online]. Available:
https://www.usenix.org/conference/usenixsecurity17/technical-sessions/
presentation/koppe
[24] A. Kurmus and R. Zippel, “A tale of two kernels: Towards ending
kernel hardening wars with split kernel,” in Proceedings of the 2014
ACM SIGSAC Conference on Computer and Communications Security.
ACM, 2014, pp. 1366–1377.
[25] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard,
P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Meltdown,” ArXiv
e-prints, Jan. 2018.
[26] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level cache
side-channel attacks are practical,” in Security and Privacy (SP), 2015
IEEE Symposium on. IEEE, 2015, pp. 605–622.
[27] S. Manne, A. Klauser, and D. Grunwald, “Pipeline gating: Speculation
control for energy reduction,” in ACM SIGARCH Computer Architecture
News, vol. 26, no. 3. IEEE Computer Society, 1998, pp. 132–141.
[28] P. Oester, “Dirty COW.” [Online]. Available: https://www.exploit-db.
com/exploits/40611/
[29] V. Pappas, M. Polychronakis, and A. D. Keromytis, “Smashing
the Gadgets: Hindering Return-Oriented Programming Using In-place
Code Randomization,” in Proceedings of the 2012 IEEE Symposium
on Security and Privacy, ser. SP ’12, Washington, DC, USA, 2012.
[Online]. Available: http://dx.doi.org/10.1109/SP.2012.41
[30] G. S. Research, “Rowhammer Privilege Escalation .” [Online].
Available: https://www.exploit-db.com/exploits/40611/
[31] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, “Hey, you, get
off of my cloud: exploring information leakage in third-party compute
clouds,” in Proceedings of the 16th ACM conference on Computer and
communications security. ACM, 2009, pp. 199–212.
[32] W. Song, H. Choi, J. Kim, E. Kim, Y. Kim, and J. Kim, “Pikit: A new
kernel-independent processor-interconnect rootkit.” in USENIX Security
Symposium, 2016, pp. 37–51.
[33] P. Team, “Address Space Layout Randomization (ASLR).” [Online].
Available: http://pax.grsecurity.net/docs/aslr.txt
[34] E. Tromer, D. A. Osvik, and A. Shamir, “Efficient cache attacks on aes,
and countermeasures,” Journal of Cryptology, vol. 23, no. 1, pp. 37–71,
2010.
[35] R. Wartell, V. Mohan, K. W. Hamlen, and Z. Lin, “Binary
Stirring: Self-randomizing Instruction Addresses of Legacy x86
Binary Code,” in Proceedings of the 2012 ACM Conference on
Computer and Communications Security, ser. CCS ’12. New
York, NY, USA: ACM, 2012, pp. 157–168. [Online]. Available:
http://doi.acm.org/10.1145/2382196.2382216
[36] Y. Yarom and K. Falkner, “Flush+ reload: A high resolution, low noise,
l3 cache side-channel attack.” in USENIX Security Symposium, 2014,
pp. 719–732.
[37] T.-Y. Yeh and Y. N. Patt, “Alternative implementations of two-level
adaptive branch prediction,” in ACM SIGARCH Computer Architecture
News, vol. 20, no. 2. ACM, 1992, pp. 124–134.
[38] Y. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Cross-vm side
channels and their use to extract private keys,” in Proceedings of
the 2012 ACM conference on Computer and communications security.
ACM, 2012, pp. 305–316.
14
IX. APPENDIX
Mapped Not Mapped
0
200
400
600
800
1000
(a) Skylake
Mapped Not Mapped
0
200
400
600
800
1000
1200
1400
(b) Haswell
Mapped Not Mapped
0
100
200
300
400
500
600
(c) Sandy Bridge
Mapped Not Mapped
100
200
300
400
500
600
700
800
900
(d) Nehalem
Mapped Not Mapped
0
1000
2000
3000
4000
5000
6000
(e) Prescott
Fig. 3: Results of the flushing side channel. If left (mapped)
and right (not mapped) deviate, feedback is reliable.
A. Flushing-Based Side Channel Analysis
This section details evaluation results of our flushing-based
side channel. In Section V, we measured and compared the
minimum number of cycles for cache accesses. We cannot
directly apply the same methodology to the execution time side
channel. First, we cannot wrap our timing measurements in two
serializing instructions such as cpuid, as this would allow
the sequence of imul instructions to complete—avoiding
speculation. Therefore, we removed the first call to cpuid,
and also included the entire speculative execution block in our
measurements (instead of access to a memory address).
Second, we found that the flushing execution time dif-
fers significantly between repeated executions, frequently not
showing any difference in the condition (e.g., page mapped
or not). This implies that the minimum execution time among
several (again, we used 1000) executions cannot be used to
leak a condition, but instead we have to inspect the value
distributions, averages or median values.
To allow for a fair comparison between the feedback chan-
nels, we provide the flushing-based feedback measurement
results for architectures in Figure 3. For two architectures,
Prescott and Skylake, there is no significant measurable timing
difference between the page being mapped or not. We believe
that Prescott fails for the same reason as for the other feedback
channels. But also Skylake, the most recent architecture in our
measurements, does not yield a measurable difference in our
concrete instantiation of instruction flushing. While improved
side channels of similar methodology (e.g., other instructions
than hlt) might still exist, this indicates that recent Intel CPUs
have more stable execution times regardless of speculation.
The other three architectures, however, show significant
differences that can reliably be confirmed when repeating the
experiment. To our surprise the results were not as coherent
as expected. Haswell (Figure 3b) and Nehalem (Figure 3d),
on average, showed faster executions for mapped pages. In
contrast, Sandy Bridge (Figure 3c) slows down executions
(median) accessing mapped pages. We cannot explain this
deviation in detail, but speculate that the inner workings of
the CPU’s pipeline engine and its different branch prediction
rollback algorithms lead to such characteristic behavior.
15
