SpectreRewind: Leaking Secrets to Past Instructions by Fustos, Jacob et al.
SpectreRewind: Leaking Secrets to Past Instructions
Jacob Fustos
University of Kansas
Michael Bechtel
University of Kansas
Heechul Yun
University of Kansas
ABSTRACT
Transient execution attacks utilize micro-architectural covert
channels to leak secrets that should not have been accessible during
logical program execution. Commonly used micro-architectural
covert channels are those that leave lasting footprints in the micro-
architectural state, for example, a cache state change, from which
the secret is recovered after the transient execution is completed.
In this paper, we present SpectreRewind, a new approach to
create contention based covert channels for transient execution
attacks. In our approach, a covert channel is established by issuing
the necessary instructions logically before the transiently executed
victim code. Unlike prior contention based covert channels, which
require simultaneous multi-threading (SMT), SpectreRewind sup-
ports single hardware thread based covert channels, making it
viable on systems where attacker cannot utilize SMT. We show
that contention on the floating point division unit on commodity
processors can be used to create a high-performance (∼100 KB/s),
low-noise covert channel for transient execution attacks instead of
commonly used flush+reload based cache covert channels.
We implement a Meltdown attack utilizing the proposed covert
channel showing competitive performance compared to the state-
of-the-art cache based covert channel implementation. We also
show that the covert channel works in the JavaScript engine of a
Chrome browser.
1 INTRODUCTION
Modern out-of-order microprocessors support speculative exe-
cution to improve performance. In speculative execution, instruc-
tions can be executed speculatively before knowing whether they
are in the correct program execution path. If the speculation was
wrong, the instructions that were executed incorrectly—known as
transient instructions [15]—are squashed and the processor then
simply retries to fetch and execute the correct instruction stream.
Unfortunately, it turned out that these transient instructions can
potentially bypass both software and hardware defenses to ac-
cess secrets. The disclosure of Spectre [15] and Meltdown [18]
and many other subsequently disclosed transient execution at-
tacks [1, 7, 11, 12, 14, 16, 19, 20, 24, 26, 30–33, 35] have shown
the danger of these transient instructions, as the secrets they had
access to could be encoded and transmitted into microarchitectural
covert channels, from which normal, non-speculative instructions
could then read, allowing the secrets to be visible to the attacker.
All known transient execution attacks share the same three basic
steps: (1) the attacker initiates speculative execution where the
secret is read improperly from memory or registers; (2) the secret
dependent transient instructions then encode and transmit the
secret to a micro-architectural covert channel; (3) finally, the secret
is recovered from the covert channel by normal (non-transient)
receiver instructions. Commonly used covert channels, such as
cache, are stateful as they leave lasting footprints in the micro-
architectural state, from which the secret is recovered after the
transient execution is completed. Many hardware defense proposals
aim to prevent such stateful covert channels either by hiding the
changes into additional hardware buffers [10, 13, 36] or by reverting
them [22] when the transient instructions are squashed. Such a
mitigation strategy is attractive from a performance standpoint, as
the transient instructions are allowed to execute normally, retaining
many of the performance benefits of speculative execution.
These type of defenses are effective at blocking transient exe-
cution attacks that utilize stateful covert channels. Unfortunately,
these techniques cannot be used to block attacks that both transmit
into and read from covert channel before transient instructions
have been squashed. SmotherSpectre [4] is the first to demonstrate
such an attack, by utilizing a simultaneous covert channel in a
simultaneous multi-threading (SMT) setup where contention on
issue ports within the processor is used as a covert channel to
transmit secret between the hardware threads in the context of
Spectre-based attack. Such contention cannot be buffered or re-
verted, as instructions have already waited to use the issue ports,
affecting their execution time.
In this paper, we present a new class of contention-based covert
channels for transient execution attacks, which we call Spectr-
eRewind. Like SmotherSpectre, SpectreRewind allows the attacker
to both transmit and receive secret data before transient execu-
tion has completed, allowing the attacker to bypass most defense
mechanisms that attempt to revert or hide micro-architectural
changes caused by the attack. However, unlike SmotherSpectre,
SpectreRewind does not require the attacker to utilize SMT, instead
the attack can be executed from a single hardware thread. While
traditional transient attacks locate the instructions that will read
from the covert channel logically after the instruction that triggers
the transient execution (e.g., a branch), SpectreRewind takes the
opposite approach and locates these instructions logically before
the triggering instruction. This structure allows the transmitting
and receiving instructions to execute concurrently on a modern out-
of-order core and communicate the secret even before the transient
execution completes.
We start by presenting our SpectreRewind strategy and exam-
ining the unique challenges associated with it (Section 4 and Sec-
tion 5). We then apply SpectreRewind to create a new covert chan-
nel, which utilizes contention on a floating point division unit in
commodity Intel and AMD processors (Section 6). Next, we analyze
how SpectreRewind can be integrated into different transient execu-
tion attacks, modifying the Meltdown POC programs to utilize our
covert channel, and implementing our covert channel within the
Google Chrome JavaScript sandbox (Section 7). Finally, we discuss
the security implications that this attack has on currently proposed
and implemented hardware and software defenses (Section 8). Our
evaluation results show that our technique allows the creation of a
simultaneous covert channel from a single threaded context that
ar
X
iv
:2
00
3.
12
20
8v
2 
 [c
s.C
R]
  2
3 J
un
 20
20
has low noise, and is viable for use in transient execution attacks,
and can leak data at a rate up to 100 KB/s with an error rate of less
than 0.01%.
2 BACKGROUND
In this section, we provide necessary background on out-of-order
cores, transient execution attacks, and simultaneous multithreading
(SMT) hardware.
2.1 Out-of-order Processors
Figure 1: Simplified out-of-order processor design. The Re-
Order Buffer holds and retiresµops in logical programorder,
while µops are issued to the execution units in out-of-order.
Modern high performance microprocessors utilize out-of-order
execution to execute multiple independent instructions in parallel—
taking advantage of instruction level parallelism—allowing for
higher throughput, while also reducing the penalty of a stall caused
by independent instructions.
Figure 1 shows a simplified example of an out-of-order processor.
In this example, instructions have first been translated into micro-
operations (µops). These µops are first placed into the ReOrder
Buffer (ROB) in logical program order. They are then passed to the
scheduler where they are then issued to a proper functional unit,
once their operands become available and the necessary resources
are available. In this example, the functional units are clustered
into two execution units. Each execution unit contains a single
issue port, which can only issue a single µop to one of the enclosed
functional units every clock cycle, but once issued, the functional
units run independent of each other. Once executed by a functional
unit, the scheduler is notified so that it can forward the results to
following dependent µops. The µop then waits in the ROB until it
reaches the head where it may be retired. It is only now that the
changes made by the µop become architecturally visible, giving the
illusion—from the architecture’s point of view—that the instructions
are executed in-order.
To further reduce branch related stalls, modern processors im-
plement complex branch predictors to predict what instructions
should be executed. As the results of execution are not made archi-
tecturally visible until the instructions retire, it is architecturally
safe to create a checkpoint of the processor’s state before the predic-
tion and then store the predicted instruction stream in the ROB and
execute the instructions—speculative execution—further improving
performance if the prediction was correct. If the prediction was
false, these instructions are squashed—returning the processor state
to the checkpoint—where the processor can begin executing the
correct instructions. Instructions that were executed but were then
squashed—will never become architecturally visible—are known as
transient instructions.
2.2 Transient Execution Attacks
As transient instructions were not supposed to have executed, it
follows that they can perform tasks—such as accessing secret data—
that should not have been accessible during proper program execu-
tion. While they do not retire—and do not become architecturally
visible—they still can contend for shared resources with instruc-
tions that will retire, creating a microarchitectural side-channel
that can leak the secrets.
if (x < array1_size)
secret = array1[x];
y = array2[secret * 4096];
Figure 2: A Spectre gadget. Adopted from [15]
Figure 2 shows an example of a speculative execution attack—
Spectre variant 1 [15]—in this example, the if statement of line 1
has been trained by the attacker such that the body—lines 2 and 3—
is executed, even though x is out-of-bounds—transient execution.
This allows the attacker to chose the value of x such that it points
to the value of the secret and when line 2 is executed, the value of
the secret—something that would have been blocked during normal
program flow—will be unintentionally loaded from memory. The
secret value is then encoded in line 3 to leak its value into the cache
state. Later—after the transient instructions have been squashed—
the attacker can read the cache state—from architecturally visible
instructions—and decode it to complete the side-channel.
Transient execution attacks can be broken down into two cat-
egories. Spectre type attacks utilize control- and data-flow mis-
speculation to force a victim to access secrets from their own ad-
dress space and leak them into the covert channel where they can
be accessed by the attacker. Each Spectre variant—1 [15], 1.1 [14],
2 [15], 4 [11], and ret2spec [16, 19]—is distinguished by the mi-
croarchitectural component that is responsible for causing the
mis-speculation namely—Branch History Buffer (BHB), Branch
Target Buffer (BTB), Memory Disambiguator, Return Stack Buffer
(RSB). Meltdown style attacks take advantage that processor excep-
tions are deferred until the instruction that caused the exception
is retired—becomes architecturally visible—instructions that occur
logically after the exception should not execute in regards to logical
program order—but can be executed out-of-order—potentially by-
passing the security that the exception intended to provide. These
attacks can be run from within the attacker’s own address space
while allowing them to access secrets from other processes and priv-
ilege levels. EachMeltdown variants—1.2 [14], 3 [18], 3a [1, 12], Lazy
FP [26], and L1TF [30, 35]—correspond to the exception that caused
the fault. Microarchitectural Data Sampling(MDS) [20, 24, 32] are
also considered Meltdown-type attacks. These attacks target specu-
lative loads that have incorrectly loaded data from internal buffers—
Store Buffer, Load Port, Line Fill Buffer—and leak the data into
2
covert channels before realizing the fault. The data that was incor-
rectly loaded could have come from other SMT threads on the same
processor executing at any privilege level.
2.3 Simultaneous Multithreading (SMT)
To improve hardware utilization, manufacturers often employ a
technique called Simultaneous Multithreading (SMT) [29], where
a single core is allowed to execute multiple hardware threads—
instruction streams—simultaneously. These hardware threads share
the underutilized structures, improving utilization, while appearing—
from the architectural point of view—to be independent process-
ing cores. Because the hardware threads in a SMT capable core
share various hardware structures—e.g. issue ports and functional
units—these structures can be used to create covert channels and
microarchitectural side channels between processes running on the
multiple threads within a single core.
3 THREAT MODEL
We assume that the attacker has the ability to control some code
that executes both logically before and after a transiently execution
attack in program order. We assume that the attacker would like to
construct code that will transmit the secret over a covert channel
such that they may read the value of the secret at the architectural
level. We assume that stateful covert channels, such as caches,
are not available to the attacker because the platform either does
not provide necessary means to control cache state (e.g., clflush)
or implements hardware level defense mechanisms that prevent
stateful covert channels [10, 13, 22, 36].
4 SPECTREREWIND
SpectreRewind is an approach to utilize contention-based covert
channels for transient execution attacks. It allows the attacker to
both transmit into and receive from a covert channel before the
transient execution phase of the attack is completed.
Figure 3 illustrates the basic concept of SpectreRewind in com-
parison to the conventional stateful covert channel based approach.
In this figure, we break up involved µops into three distinct cate-
gories: the µops that come logically before, during, and after the
transient execution attack. In both approaches, we assume that we
send only a single bit over a covert channel at a time. For each
approach, we depict two timing diagrams: transmitting ‘0’ and ‘1’
over a covert channel.
In case of the traditional transient execution attack approach,
the attacker will use a covert channel that causes a lasting state
change in the micro-architecture, and read from the covert channel
from µops that occur logically after the transient execution. Data
can read from the channel by measuring the timing differences of
these µop (t3-t4 for a value ‘0’ and t3-t5 for a value ‘1’). Hardware
defenses (e.g., [13, 36]) that remove secret from covert channel after
transient execution (t2) will be able to stop this attack by disrupting
the transmission of the secret.
In case of the SpectreRewind approach, however, transient in-
structions will contend for resources with the µops that come logi-
cally before the transient instructions. Because the covert channel
will be read from before transient execution completes (t2), the
aforementioned hardware defense mechanisms which attempt to
Figure 3: Simplified timing diagram comparing traditional
Spectre attack framework to SpectreRewind framework
remove the secret from the covert channel at that time will be inef-
fective. In our approach, the attacker measures the entire execution
time of the attack to detect the timing differences. Since the covert
channel must be read from before transient execution completes,
this gives the added challenge of needing to fit the entire attack in
the ROB at the same time.
SpectreRewind assumes that older transient µops can contend
with younger µops that began before the transient µops on certain
micro-architectural resources. In the following, we will discuss the
kinds of micro-architectural resources are viable covert channels
in SpectreRewind.
5 NOT FULLY PIPELINED FUNCTIONAL
UNITS
Since we aim to contend with instructions that are logically older
than us this will limit our covert channel options.Wewill not be able
to cause port contention or contention on fully pipelined functional
units as in [4]. However, we will show that it is still possible to
cause contention on certain functional units that contain at least
one non-pipelined stage.
Figure 4 shows visual examples of this problem. In Figure 4a,
we see an example of an attacker µop trying to cause slowdown
on a victim µop that is trying to use a shared integer multiplier.
Unfortunately, because both the attacker and victim are ready to
3
(a) Ready victim, Pieplined functional unit
(b) Waiting victim, Pipelined functional unit
(c) Waiting victim, Not fully pipelined functional unit
Figure 4: Multiple attempts by attacker to delay the execu-
tion of the victim, causing measurable timing differences.
If the attacker is younger than the victim, an age-ordered
scheduler will prevent most contention.
issue, the scheduler will choose the older victim, preventing any
contention.
Figure 4b, shows the situation where the attacker becomes ready
the cycle before the victim. The attacker is issued into the multiplier,
but still cannot create contention on the victim, as the victim is
issued on the next cycle that it becomes ready, just as if the attacker
was not there.
Finally, figure 4c shows an attack on a non-pipelined shared
functional unit (stage 1 takes 3 clock cycles to complete). As the
victim is not initially ready, the attacker is scheduled on the unit.
As the unit is not pipelined, the victim cannot be issued on the unit
until the attacker completes, which effects the execution time of
the victim, making a covert channel possible. Thus, for our attack
we will only focus on functional units that have at least one stage
that is not fully pipelined. Note that it is well known that floating
point division is difficult to pipeline because for division each step
depends on the previous step [21]. In the following, we will develop
a floating point division unit based covert channel.
6 FLOATING POINT DIVISION COVERT
CHANNEL
In this section, we utilize our SpectreRewind approach to create a
covert channel on real commodity hardware that can transmit data
from transient execution without using stateful covert channels, or
SMT co-scheduled processes. We do this by causing contention on
the floating point division unit.
1 double recv, div;
2 double send1, send2, send3, send4;
3 int message; // secret
4
5 start = rdtscp(); // start timer
6
7 // begin receiver (12 dependent FP divisions)
8 recv /= div;
9 recv /= div;
10 ...
11 recv /= div;
12 // end of receiver
13
14 if (recv == 1) { // begin speculative execution
15 m_bit = bit(message, k);
16 if (m_bit) { // secret dependent branch
17 // begin sender (independent FP divisions)
18 for (int x = 0; x < 100; x++) {
19 send1 /= div;
20 send2 /= div;
21 send3 /= div;
22 send4 /= div;
23 }
24 // end of sender
25 }
26 }
27
28 end = rdtscp(); // end timer
Figure 5: Pseudo code of our floating point division unit con-
tention based covert channel.
4
Our covert channel utilizes contention on a functional unit,
namely the floating point division unit (see Figure 1), to transmit data
from transient instructions to non-transient instructions, which
will retire and become architecturally visible. The floating point
division unit was chosen as it is not fully pipelined (see Section 5)
in all Intel, AMD, and ARM microarchitectures we tested. Table 1
shows the tested microarchitectures, and their latency and through-
put characteristics of the DIVSD instruction, which are obtained
from [2] 1 Note that in all tested microarchitectures, the throughput
of the DIVSD instruction is 4 or 8 cycles, meaning that while an
DIVSD instruction is being executed, a pending DIVSD instruction
has to wait 4 or 8 cycles before entering the floating point division
unit. This delay makes the floating point division unit an ideal
candidate for us to create a covert channel.
Figure 5 shows the code used to form the ideal covert channel.
(1) A timer is started (Line 5); (2) A chain of dependent floating
point division instructions begins execution (Line 8). Because the
instructions are dependent, each instruction suffers the full round-
trip latency of the floating point division unit (see Table 1). This
chain of division instructions acts as a receiver; (3) The result of
the receiver instruction chain is compared in the if statement (Line
14). The if statement has been previously trained to be true, so
the body will execute speculatively while the result of the receiver
chain is being calculated; (4) A single bit of the (secret) message to
transmit is accessed (Line 15) and the inner if statement branches
depending on the value of the secret bit (Line 16); (5) The inner if
statement is trained to be false. Thus, if the secret bit was ‘1’, the
processor backtracks and begins to speculatively execute a set of
independent floating point division instructions (Line 18-23), which
act as a sender. The “sender” instructions are independent with each
other so as to be issued concurrently and maximally contend with
the “receiver” instructions on the floating point division unit of the
processor. (6) When the “receiver” instructions are completed, the
processor will realize the mis-speculation (recv in Line 14 was 0)
and squash the speculative instructions from the “sender”. We then
stop the timer (Line 28) and measure the time difference.
Note that if the secret bit was ‘1’, the observed time difference
will be longer, due to the contention in the floating point division
unit with the mis-speculated “sender” instructions, compared to
the case when the secret bit was ‘0’ where there was no contention.
This secret-dependent timing difference creates a covert channel.
6.1 Covert Channel Properties
We experimentally evaluate the characteristics the covert chan-
nel on a number of commodity Intel, AMD, and ARM systems, as
listed in Table 1.
Each system runs Linux (Ubuntu 18.04 or 16.04). For x86 plat-
forms from Intel and AMD, we use rdtscp instructions for cycle ac-
curate timing measurements. For ARM, we use an additional thread
based software counter instead due to the architectural limitation.
We repeatedly send 0 and 1 values over the covert channel, each
for 1,000,000 times, and measure the timing results. To minimize
1As defined in [2], latency refers to the clock cycles needed from the time the µop is
issued to the time the result become available to dependent µops, while throughput
refers to the clock cycles needed from the time the µop is issued until to the time the
functional unit becomes available again.
noise, we use Linux’s performance governor disable Turbo-boost
(for X86 platforms) to improve reliability of the measurements.
Figure 6 shows the results. The X-axis shows the number of
cycles taken to transmit, while the Y axis displays the probability
a measurement has to take that many cycles. Note first that on
all tested platforms, we are able to see clear timing differences
between ‘0’ and ‘1’ values. As explained in Section 5, not fully
pipelined floating point division units in these platforms allow the
mis-speculated division instructions to contend with the logically
prior “receiver” instructions, resulting in clearly measurable timing
differences.
Another interesting observation is that the two AMD processors
and the ARM Cortex-A57 show discreet timing characteristics—
large proportion of the samples are concentrated on a few small
measured cycles—whereas Intel processors show more varied tim-
ing behaviors, especially the Skylake processors. These differences
are likely due to the way the floating point division unit is imple-
mented in each of these vendors.
We also evaluated floating point multiplication instructions but
were not able to observe any noticeable timing difference, suggest-
ing that the floating point multiplication units in these platforms are
well pipelined, and thus cannot be used to create covert channels.
6.2 Performance Analysis
Next, we analyze the performance of the covert channel in terms
of transfer rate and error rate. The measured transfer rates of our
tested platforms are calculated by simply dividing the total bits sent
(1 million bits of 0 and 1 million bits of 1) with the time it took to
send them. The error rate of each system is calculated as follows.
We first sort each million timing samples of 0 and 1. We then find 99
percentile value of the ‘0’ samples and 1 percentile value of the ‘0’
samples. If the former (99 percentile of ‘0’ samples) is smaller than
the latter (1 percentile of ‘1’ samples), we pick the average of the
two value as the threshold to determine 0 or 1. If the 99 percentile
of 0 is bigger than the 1 percentile of 1, we set the average of the
median values of 0 and 1 samples as the threshold value. We then
apply the threshold against the collected samples to determine if it
correctly classifies the sample against its known correct value.
The results are shown in Table 1 (see the ‘Transfer Rate’ and
‘Error Rate’ columns). First, notice that the proposed covert channel
supports very high transfer rates on all tested platforms, ranging
from 63 to 105 KB/s. Furthermore, the error rates are also very low,
especially on Intel processors, as we observe less than 0.5% error
rates. AMD processors show higher error rates, of up to 5.5% on
low end Ryzen3 APU.
6.3 Sensitivity Analysis
An interesting aspect of our covert channel is that the size (dura-
tion) of the speculation window can be controlled by adjusting the
number of dependent division instructions used in the “receiver”
part of the covert channel—i.e., Line 8-11 in Figure 5. This is because
speculatively executed sender instructions are squashed after the
receiver instruction change is completed. As such, the longer the
receiver instruction chain is, the longer the sender instructions
can contend on the floating point division unit. To understand the
effect of the length of the receiver to the effectiveness of the covert
5
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 200  250  300  350  400  450
Pr
ob
ab
ilit
y
Number of Clock Cycles
(a) Kabylake R (i5-8250U)
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 200  250  300  350  400  450
Pr
ob
ab
ili
ty
Number of Clock Cycles
(b) Skylake (i5-6500)
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 200  250  300  350  400  450
Pr
ob
ab
ilit
y
Number of Clock Cycles
(c) Skylake (i5-6200U)
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 200  250  300  350  400  450
Pr
ob
ab
ili
ty
Number of Clock Cycles
(d) Haswell (E5-2658v3)
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 200  250  300  350  400  450
Pr
ob
ab
ilit
y
Number of Clock Cycles
(e) Ivybridge (i5-3340M)
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 200  250  300  350  400  450
Pr
ob
ab
ili
ty
Number of Clock Cycles
(f) Zen (Ryzen3 2200G)
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 200  250  300  350  400  450
Pr
ob
ab
ili
ty
Number of Clock Cycles
(g) Zen+ (Ryzen5 2600)
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 0.45
 0.5
 60  80  100  120  140  160  180  200
Pr
ob
ab
ilit
y
Number of Clock Cycles
(h) Cortex-A57 (Jetson Nano)*
Figure 6: Floating point division unit based covert channel (Figure 5) timing characteristics; 1,000,000 timing measurement
samples of transmitting 0 (purple) and 1 (green). (*) For Cortex-A57, we use an additional thread based software counter for
time measurement due to the lack of high-precision clock source (such as rdtsc in x86) available at the user level.
6
CPU Microarch. Latency Throughput Transfer Rate Error Rate(cycles) (cycles) (KB/s) (%)
Intel Core i5-8250U Kabylake R 13–15 4 53.1 0.02
Intel Core i5-6500 Skylake 13–15 4 105.3 <0.01
Intel Core i5-6200U Skylake 13–15 4 74.9 0.04
Intel Xeon E5-2658 v3 Haswell 10–20 8 64.1 <0.01
Intel Core i5-3340M Ivybridge 10–20 8 75.6 0.16
AMD Ryzen 3 2200G Zen 8–13 4 83.1 5.50
AMD Ryzen 5 2600 Zen+ 8–13 4 84.8 3.30
NVIDIA Jetson Nano Cortex A57 N/A N/A 87.7 0.02
Table 1: Evaluation platforms; DIVSD (SSE floating point division) instruction latency and throughput [2] and performance
(transfer and error rates) of the proposed floating point division unit based covert channel.
channel, we measure the characteristics of the covert channel as a
function of the number divisions in the receiver chain.
#divs ‘0’ ‘1’ Diff. Transfer Error(cycles) (cycles) (cycles) (KB/s) (%)
3 190 192 2 102.9 45.77
6 232 236 4 92.5 11.95
9 274 290 16 82.4 0.70
12 314 342 28 74.9 0.04
15 354 392 38 69.6 0.18
24 394 436 42 55.7 0.09
48 800 928 128 35.7 0.43
72 1128 1252 137 27.3 0.12
Table 2: Sensitivity to #of divisions (DIVSD) used in the “re-
ceiver" part of the covert channel on Intel Core i5-6200U.
Table 2 shows the results. The first column shows the number
of division instructions in the receiver chain. The second and third
columns show the median cycles observed when sending ‘0’ and ‘1’
values over the covert channel, respectively. The fourth column is
the cycle difference between 0 and 1 samples. Finally, the fifth and
the last columns show the transfer and error rates of the channel.
Note first that the transfer rate is inversely proportional to the
number of divisions in the receiver, which is expected as the more
divisions are used, the longer time is needed to execute them before
squashing the speculation. As such, from the transfer rate perspec-
tive, using a smaller number of divisions in the receiver may be
desirable. However, when the number of divisions is too small, as
in the case of 3 divisions, the covert channel becomes ineffective
as the error rate is too high(>45% error rate). This is because the
speculation window is not long enough for the sender instructions
to be able to effectively contend with the receiver instructions on
the floating point division unit.
The error rate dramatically decreases as we increase the number
of divisions in the receiver. At 9 ormore divisions the covert channel
shows very low error rate while showing gradually decreasing
transfer rates. For this platform, we can see using 12 divisions
in the receiver chain is a “sweet spot” in the sense that it offers
high enough performance and low noise. While different platforms
may have different sweet spots, we nevertheless used the same 12
divisions in all platforms, unless noted otherwise, as it performed
reasonably well in all of them.
7 TRANSIENT EXECUTION ATTACKS
In this section, we present transient execution attacks using the
proposed SpectreRewind DIVSD covert channel.
7.1 Meltdown Attack
Meltdown attacks [18] allow transient instructions to access se-
crets belonging to other processes and security domains, including
the OS and virtual machines. In this section we describe our modi-
fications to such attacks to utilize SpectreRewind covert channel.
In the Meltdown attack, the attacker attempts to read from a
memory address, such as a kernel virtual address. While architec-
turally such an access will generate an exception, speculatively
the access can forward secret data to a dependent load instruction,
which encodes the secret into a cache state change, before it can be
squashed.
In our modification, we simply surround the exception generat-
ing memory access with DIVSD sender and receiver instructions as
shown in Figure 5. In more detail, we base our implementation on
the original Meltdown open-source repository 2. We modified a sin-
gle function libkdump_read(addr) in libkdump.c, which reads
a single byte from the given address (addr), to utilize our Spectr-
eRewind DIVSD covert channel. The rest of the code and other
settings are unchanged.
Note that because a Meltdown attack generates exceptions, it is
necessary to suppress such exceptions. In the original PoC, either
Intel’s Transactional Synchronization Extension (TSX) or signal
handling was used to suppress exceptions. In our approach, how-
ever, an exception generating secret memory access can only occur
in a mis-speculated transient execution, which will be squashed
when the the receiver code has completed. Thus, we effectively sup-
press the exception without needing to use TSX or signal handling
methods used in the original Meltdown PoC.
Table 3 compares the performance of our modified Meltdown
attack with the original ones, which utilize flush+reload based
covert channels. Of the two original versions we evaluated, Origi-
nal (SigHandle) suppresses exception by installing a signal handler
2https://github.com/IAIK/meltdown
7
Method # Reads Success (%)
Original (SigHandle) 2197 97.78
Original (TSX) 217691 100.00
SpectreRewind 193399 99.97
Table 3: Performance of SpectreRewind covert channel
(DIVSD) based Meltdown attack (Demo #3: Reliability test of
the original Meltdown PoC repository) on Intel i5-6500.
while the Original (TSX) does so by utilizing TSX. We use the relia-
bility PoC in the official Meltdown repository, which continuously
reads a single byte from a kernel memory address and reports the
number of reads and the success (i.e., correct reading) rate. In each
configuration, we ran the reliability PoC for 60 seconds and mea-
sured the performance on the Intel Core i5-6500 (Skylake) processor,
which supports TSX. As can be seen in the table, our SpectreRewind
version of Meltdown performs significantly faster than the signal
handler version of the original Meltdown in terms of the speed and
the success rate, while it performs similarly compared to the TSX
version of the original attack.
7.2 Sandbox (JavaScript) Attack
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 0.16
 0.18
 0.2
 60  70  80  90  100  110
Pr
ob
ab
ilit
y
Number of Clock Cycles
Figure 7: Timing characteristics of division floating point
unit covert channel execution in Google Chrome JavaScript
sandbox
One of the drawbacks of SpectreRewind is the amount of µops
it requires to exist in the ReOrder Buffer simultaneously. While
we have shown that this is not a problem when executing attacks
in a native environment, sandbox environments—e.g. JavaScript—
pose additional challenges. In this section, we show that it is possi-
ble to mount SpectreRewind attacks from sandbox environments
by porting our floating point DIVSD based covert channel code to
JavaScript, and successfully executing it on Google Chrome ver-
sion 62.0.3202.75. For a high resolution timer, we take the same
approach as past research [25] and utilize Web Workers along with
SharedArrayBuffer. This allows for the creation of a separate
thread, that continuously increments a value in shared memory
that the original thread can use to time code execution. Overall, we
found that minimal changes were needed to port our code.
Figure 8 shows a snippet of the final code along with the gener-
ated assembly produced by the JavaScript JIT compiler side-by-side
with the original attack code assembly. While the µop footprint is
increased from the native compiled version, we interestingly find
that the majority of these extra instructions happen in the section
of code that is responsible for accessing the message and branching
on bit values. We see that the division operations are compiled
down neatly into only a couple of floating point instructions, and
we note that the extra vmovapd instructions added in the JIT com-
piler version do not take up any µops and thus the two areas of the
code provide equivalent pressure on the ROB and scheduler. We
also find the attack can still fit within the scheduler and ReOrder
Buffer of a Kabylake R micro-architecture, and while the resolution
of the SharedArrayBuffer is poor compared to native timers that
the difference between timing bit values is sufficient for data trans-
mission. We have however increased the number of receiver code
divisions from 12 to 24 to improve signal over the lower resolution
timer. We show the probability distribution of the transmission in
Figure 7.
8 DISCUSSION
In this section, we discuss the benefits and shortcomings of
SpectreRewind, and its mitigation options.
8.1 Comparison to Cache Covert Channels
To date, most transient execution attacks leverage cache based
covert channels, especially Flush+Reload [37], due to their high
performance and low noise. In this paper, we find that our floating
point division unit based covert channel is available in a wide range
of micro-architectures while providing similarly high performance
and low noise characteristics. As such, we believe that our floating
point division unit covert channel can be used as an alternative
to cache based ones for transient execution attacks. Our covert
channel may be preferable to Flush+Reload in environments where
instructions to flush cache lines (e.g. CLFLUSH in x86) are unavail-
able. In these environments, eviction sets must be created, but these
channels can be noisy, especially in processors that use pseudo
random cache replacement schemes [17].
We also note that recently many researchers have proposed
solutions to protect cache based covert channels. For example, Invi-
siSpec [36] and SafeSpec [13] are both recently proposed hardware
solutions that defer updating microarchitectural states of caches
(and TLBs) until such changes are considered to be safe. Gonza-
lez et al [10] actually implemented such a defense on an out-of-
order open source RISC-V processor core. CleanupSpec [22] lets
the microarchitectural changes from transient instructions to occur
but later undo those changes after recognizing mis-speculation.
SpectreRewind can bypass these defense mechanisms, as complete
transmission of secret over the covert channel is accomplished—in
forms of increased execution time of the receiver—before the tran-
sient instructions are completed, which make the aforementioned
defense mechanisms ineffective.
One downside of our approach is that it requires that sender
and receiver instructions be present simultaneously at the same
hardware thread, which restricts its use in cross-process attack
scenarios (e.g., cross-process branch target injection attack [15]). In
addition, the sender and receiver instructions must be in the same
protection domain—either both in kernel or both in user. Therefore,
8
Figure 8: Excerpt from JavaScript covert channel code (Left), the assembly the JIT compiler created (Center), and the native
generated assembly (Right)
initiating the receiver instructions at the user-level while executing
the sender instructions at the kernel (e.g., a system call) may not
be feasible.
8.2 Mitigation Strategies
As SpectreRewind requires out-of-order contention on not fully
pipelined functional units in the processor, one mitigation strategy
is to redesign the functional units to be fully pipelined. But such a re-
desing may not always be possible. Another alternative is to adopt
a strict in-order scheduling policy such that younger instructions
(sender) can never be issued before all older instructions are issued
first, though it would incur high performance cost.
SpectreRewind also requires secret data be forwarded transiently
to the dependent instructions that cause the contention. Therefore,
it can be mitigated through solutions that block or delay such
forwarding. SpectreGuard [9] is an example of such an approach,
where secret data is marked as secret in the page tables and then is
disallowed from being forwarded to dependent instructions until it
reaches a point where it can be logically considered safe to forward.
ConTExT [23] uses a similar approach, still marking data as secret
in page tables and delaying forwarding of the secret value, but
unlike SpectreGuard, not considering operations as safe until they
reach the head of the ROB. Intel and NVIDIA also proposed new
memory type based solutions [5, 27]. STT [38] improved upon these
approaches, by considering if the instructions being forwarded
secret data could transiently leak data into a covert channel, and
if it could not, allowing that instruction to execute speculatively.
All these techniques that prevent secret dependent speculative
execution may mitigate SpectreRewind covert channels.
9 RELATEDWORK
Cache based covert channels generally utilize the property that
timing differences occurwhen accessing different levels of the cache,
and these timing difference can be large—e.g. 1s of nanoseconds to
access the L1 cache vs 100s of nanoseconds to access main memory
on common high performance systems. Changes to the cache can
also be long lasting (remain across context switches) as generally
cache state changes remain until they are replaced by other memory
accesses. Prime+Probe [28] takes advantage of the fact that caches
are generally partitioned into sets that can hold a limited number
of cache lines. Once the attacker finds a grouping that perfectly fills
the set, measuring the access time of accessing the entire set allows
the attacker to monitor other memory access across the system
that are utilizing that same set which can be used as a side-channel
to spy on a victim. Flush+Reload [37] is a cache based technique
that uses special hardware instructions provided by architectures
to flush cache lines from the caches. If the attacker shares memory
with the victim, they need only flush a shared memory location, let
the victim execute, and then time a reload to the memory location
to determine if it was accessed by the attacker. Both techniques are
commonly used in transient execution attacks to monitor cache
activity performed by the transient instructions.
Systems that implement simultaneous multi-threading (SMT),
are particularly open to the creation of covert channels, as the
multiple threads that run on a single core can potential share and
thus compete for the resources on the core. Wang and Lee explored
functional unit sharing in SMT processors to create a covert chan-
nel [34]. In their work, they created a covert channel—on a Pentium
9
4 processor—by utilizing contention on the shared integer multipli-
cation unit. Concurrently, Acıiçmez and Seifert utilized contention
on the same Intel processor—again using the shared integer mul-
tiplication unit—to create a microarchitectural side channel [3].
Utilizing this channel, the attacker could spy on another process
running a square and multiply cryptographic function that was
running concurrently on a separate hardware thread on the same
core. Our work differs from these papers, as we explore the unique
challenges of both creating similar functional unit covert channels,
but from a non-SMT context—single hardware thread—and utilizing
such contention in Spectre Attacks.
In 2016, Fogh introduced a technique called Covert Shotgun [8]
in-which two processes running on threads in the same SMT core
run through an iterative set of instruction groupings and time the
results to determine if those instructions can cause measurable
contention on the shared resources. Recently, two works have con-
currently implemented such an approach to test the viability of port
contention as a covert channel between two such processes. This
approach utilizes the fact that ports may only issue one instruction
per cycle to their underlying functional units, thus if both processes
attempt to issue instructions that require functional units on the
same port, one of the processes will need to stall that clock cycle—
causing a measurable timing difference. PortSmash [6] utilized port
contention to create a microarchitectural side-channel to leak the
secret key from a vulnerable version of OpenSSL. SmotherSpec-
tre [4] utilized port contention as the covert channel in a Branch
Target Injection (BTI [15]) Spectre attack. Using BTI allowed this
attack to run attacker code to transiently access secret in the vic-
tim and then to execute specific instructions—dependent on secret
value—that could be easily detected by the attacker’s process. Our
work differs from all three approaches as we focus on non-SMT
contention channels which face unique challenges, and from the
latter two as we focus on functional unit contention, and not port
contention.
10 CONCLUSION AND FUTUREWORK
In this paper, we showed that it is possible to create a covert
channel utilizing concurrent contention on functional units from
a single hardware thread. We introduced a new covert channel,
which utilizes contention on the floating point division unit in
commodity Intel, AMD, and ARM processors. Our covert channel
achieved high performance and low noise comparable to that of the
widely used Flush+Reload cache covert channel. We then showed
that how the covert channel can be used in transient execution
attacks. We implemented a Meltdown attack with our covert chan-
nel. We also showed that our covert channel can be used in the
JavaScript sandbox of a Chrome browser. As future work, we plan
to investigate if other microarchitectural structures can be used to
create concurrent contention based covert channels.
REFERENCES
[1] 2018. Cache Speculation Side-channels. ARM White paper (2018).
[2] Andreas Abel and Jan Reineke. 2019. uops.info: Characterizing Latency, Through-
put, and Port Usage of Instructions on Intel Microarchitectures. In Proceedings of
the Twenty-Fourth International Conference on Architectural Support for Program-
ming Languages and Operating Systems (ASPLOS ’19). ACM, New York, NY, USA,
673–686. https://doi.org/10.1145/3297858.3304062
[3] O. Aciicmez and J. Seifert. 2007. Cheap Hardware Parallelism Implies Cheap
Security. InWorkshop on Fault Diagnosis and Tolerance in Cryptography (FDTC
2007). 80–91. https://doi.org/10.1109/FDTC.2007.16
[4] Atri Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandtner, Alessan-
dro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kurmus. 2019. SMoTher-
Spectre: exploiting speculative execution through port contention. arXiv preprint
arXiv:1903.01843 (2019).
[5] Darrell D Boggs, Ross Segelken, Mike Cornaby, Nick Fortino, Shailender
Chaudhry, Denis Khartikov, Alok Mooley, Nathan Tuck, and Gordon Vreug-
denhil. 2019. Memory type which is cacheable yet inaccessible by speculative
instructions. (Jan. 3 2019). US Patent App. 16/022,274.
[6] Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar
Pereida García, and Nicola Tuveri. 2019. Port Contention for Fun and Profit. In
2019 IEEE Symposium on Security and Privacy (SP). https://doi.org/10.1109/SP.
2019.00066
[7] Claudio Canella, Jo Van Bulck, Michael Schwarz, Moritz Lipp, Benjamin von
Berg, Philipp Ortner, Frank Piessens, Dmitry Evtyushkin, and Daniel Gruss. 2018.
A Systematic Evaluation of Transient Execution Attacks and Defenses. CoRR
abs/1811.05441 (2018). arXiv:1811.05441 http://arxiv.org/abs/1811.05441
[8] Anders Fogh. 2016. https://cyber.wtf/2016/09/27/covertshotgun/. (2016).
[9] Jacob Fustos, Farzad Farshchi, and Heechul Yun. 2019. SpectreGuard: An Efficient
Data-centric Defense Mechanism against Spectre Attacks. In DAC. 61–1.
[10] Abraham Gonzalez, Ben Korpan, Jerry Zhao, Ed Younis, and Krste Asanović.
2019. Replicating and Mitigating Spectre Attacks on an Open Source RISC-V
Microarchitecture. In Third Workshop on Computer Architecture Research with
RISC-V (CARRV).
[11] Jann Horn. 2018. speculative execution, variant 4: speculative store bypass.
https://bugs.chromium.org/p/project-zero/issues/detail?id=1528. (2018).
[12] Intel. 2018. Intel Analysis of Speculative Execution Side Channels (Rev. 4.0).
Technical Report. https://software.intel.com/sites/default/files/managed/b9/f9/
336983-Intel-Analysis-of-Speculative-Execution-Side-Channels-White-Paper.
pdf
[13] Khaled N. Khasawneh, Esmaeil Mohammadian Koruyeh, Chengyu Song, Dmitry
Evtyushkin, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2019. SafeSpec: Ban-
ishing the Spectre of a Meltdown with Leakage-Free Speculation. In 56th Annual
Design Automation Conference (ACM DAC).
[14] Vladimir Kiriansky and Carl Waldspurger. 2018. Speculative buffer overflows:
Attacks and defenses. arXiv preprint arXiv:1807.03757 (2018).
[15] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M.
Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom. 2019. Spectre Attacks:
Exploiting Speculative Execution. In 2019 IEEE Symposium on Security and Privacy
(SP). IEEE Computer Society, Los Alamitos, CA, USA. https://doi.org/10.1109/SP.
2019.00002
[16] Esmaeil Mohammadian Koruyeh, Khaled N Khasawneh, Chengyu Song, and
Nael Abu-Ghazaleh. 2018. Spectre returns! speculation attacks using the return
stack buffer. In WOOT.
[17] Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Clémentine Maurice, and Stefan
Mangard. 2016. ARMageddon: Cache Attacks on Mobile Devices. In 25th USENIX
Security Symposium (USENIX Security 16). USENIX Association, Austin, TX, 549–
564. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/
presentation/lipp
[18] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas,
Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval
Yarom, and Mike Hamburg. 2018. Meltdown: Reading Kernel Memory from User
Space. In USENIX Security.
[19] Giorgi Maisuradze and Christian Rossow. 2018. ret2spec: Speculative execution
using return stack buffers. In ACM(CCS). ACM, 2109–2122.
[20] Marina Minkin, Daniel Moghimi, Moritz Lipp, Michael Schwarz, Jo Van Bulck,
Daniel Genkin, Daniel Gruss, Berk Sunar, Frank Piessens, and Yuval Yarom. 2019.
Fallout: Reading Kernel Writes From User Space.
[21] Stuart F Oberman. 1999. Floating point division and square root algorithms and
implementation in the AMD-K7/sup TM/microprocessor. In IEEE Symposium on
Computer Arithmetic (Cat. No. 99CB36336). IEEE, 106–115.
[22] Gururaj Saileshwar and Moinuddin K. Qureshi. 2019. CleanupSpec: An “Undo”
Approach to Safe Speculation. In Proceedings of the 52nd Annual IEEE/ACM Inter-
national Symposium onMicroarchitecture (MICRO ’52). Association for Computing
Machinery, New York, NY, USA, 73–86. https://doi.org/10.1145/3352460.3358314
[23] Michael Schwarz, Moritz Lipp, Claudio Alberto Canella, Robert Schilling, Florian
Kargl, and Daniel Gruß. 2020. ConTExT: A Generic Approach for Mitigating
Spectre. In Network and Distributed System Security Symposium 2020. https:
//doi.org/10.14722/ndss.2020.24271
[24] Michael Schwarz, Moritz Lipp, Daniel Moghimi, Jo Van Bulck, Julian Steck-
lina, Thomas Prescher, and Daniel Gruss. 2019. ZombieLoad: Cross-Privilege-
Boundary Data Sampling. In CCS.
[25] Michael Schwarz, Clémentine Maurice, Daniel Gruss, and Stefan Mangard. 2017.
Fantastic Timers and Where to Find Them: High-Resolution Microarchitectural
Attacks in JavaScript. In Financial Cryptography and Data Security, Aggelos
10
Kiayias (Ed.). Springer International Publishing, Cham, 247–267.
[26] Julian Stecklina and Thomas Prescher. 2018. LazyFP: Leaking FPU Register State
using Microarchitectural Side-Channels. arXiv preprint arXiv:1806.07480 (2018).
[27] K Sun, R Branco, and K Hu. 2019. A New Memory Type Against Speculative
Side Channel Attacks. (2019).
[28] Eran Tromer, Dag Arne Osvik, and Adi Shamir. 2010. Efficient Cache Attacks on
AES, and Countermeasures. J. Cryptology 23 (07 2010), 37–71. https://doi.org/10.
1007/s00145-009-9049-y
[29] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1995. Simultaneous
Multithreading: Maximizing On-chip Parallelism. In Proceedings of the 22Nd
Annual International Symposium on Computer Architecture (ISCA ’95). ACM, New
York, NY, USA, 392–403. https://doi.org/10.1145/223982.224449
[30] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank
Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, and Raoul Strackx.
2018. Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient
Out-of-Order Execution. In Proceedings of the 27th USENIX Security Symposium.
USENIX Association. See also technical report Foreshadow-NG [35].
[31] Jo Van Bulck, Daniel Moghimi, Michael Schwarz, Moritz Lipp, Marina Minkin,
Daniel Genkin, Yarom Yuval, Berk Sunar, Daniel Gruss, and Frank Piessens.
2020. LVI: Hijacking Transient Execution through Microarchitectural Load Value
Injection. In 41th IEEE Symposium on Security and Privacy (S&P’20).
[32] Stephan van Schaik, Alyssa Milburn, Sebastian Österlund, Pietro Frigo, Giorgi
Maisuradze, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2019. RIDL:
Rogue In-flight Data Load. In S&P.
[33] Stephan van Schaik, Marina Minkin, Andrew Kwong, Daniel Genkin, and Yuval
Yarom. 2020. CacheOut: Leaking Data on Intel CPUs via Cache Evictions. https:
//cacheoutattack.com/. (2020).
[34] Z. Wang and R. B. Lee. 2006. Covert and Side Channels Due to Processor
Architecture. In 2006 22nd Annual Computer Security Applications Conference
(ACSAC’06). 473–482. https://doi.org/10.1109/ACSAC.2006.20
[35] Ofir Weisse, Jo Van Bulck, Marina Minkin, Daniel Genkin, Baris Kasikci, Frank
Piessens, Mark Silberstein, Raoul Strackx, Thomas F. Wenisch, and Yuval Yarom.
2018. Foreshadow-NG: Breaking the Virtual Memory Abstraction with Transient
Out-of-Order Execution. Technical report (2018). See also USENIX Security
paper Foreshadow [30].
[36] Mengjia Yan, Jiho Choi, Dimitrios Skarlatos, Adam Morrison, Christopher W
Fletcher, and Josep Torrellas. 2018. InvisiSpec: Making Speculative Execution
Invisible in the Cache Hierarchy. In International Symposium onMicroarchitecture
(MICRO).
[37] Yuval Yarom and Katrina Falkner. 2014. FLUSH+RELOAD: A High Reso-
lution, Low Noise, L3 Cache Side-Channel Attack. In 23rd USENIX Security
Symposium (USENIX Security 14). USENIX Association, San Diego, CA, 719–
732. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/
presentation/yarom
[38] Jiyong Yu, Mengjia Yan, Artem Khyzha, Adam Morrison, Josep Torrellas, and
Christopher W Fletcher. 2019. Speculative Taint Tracking (STT) A Comprehen-
sive Protection for Speculatively Accessed Data. In International Symposium on
Microarchitecture (MICRO). 954–968.
11
