Shining Light On Shadow Stacks by Burow, Nathan et al.
SoK: Shining Light on Shadow Stacks
Nathan Burow
Purdue University
Xinping Zhang
Purdue University
Mathias Payer
EPFL
Abstract—Control-Flow Hijacking attacks are the dominant
attack vector against C/C++ programs. Control-Flow Integrity
(CFI) solutions mitigate these attacks on the forward edge,
i.e., indirect calls through function pointers and virtual calls.
Protecting the backward edge is left to stack canaries, which are
easily bypassed through information leaks. Shadow Stacks are
a fully precise mechanism for protecting backwards edges, and
should be deployed with CFI mitigations.
We present a comprehensive analysis of all possible shadow
stack mechanisms along three axes: performance, compatibil-
ity, and security. For performance comparisons we use SPEC
CPU2006, while security and compatibility are qualitatively
analyzed. Based on our study, we renew calls for a shadow
stack design that leverages a dedicated register, resulting in
low performance overhead, and minimal memory overhead,
but sacrifices compatibility. We present case studies of our
implementation of such a design, Shadesmar, on Phoronix and
Apache to demonstrate the feasibility of dedicating a general
purpose register to a security monitor on modern architectures,
and Shadesmar’s deployability. Our comprehensive analysis,
including detailed case studies for our novel design, allows
compiler designers and practitioners to select the correct shadow
stack design for different usage scenarios.
Shadow stacks belong to the class of defense mechanisms
that require metadata about the program’s state to enforce
their defense policies. Protecting this metadata for deployed
mitigations requires in-process isolation of a segment of the
virtual address space. Prior work on defenses in this class has
relied on information hiding to protect metadata. We show that
stronger guarantees are possible by repurposing two new Intel
x86 extensions for memory protection (MPX), and page table
control (MPK). Building on our isolation efforts with MPX
and MPK, we present the design requirements for a dedicated
hardware mechanism to support intra-process memory isolation,
and discuss how such a mechanism can empower the next wave of
highly precise software security mitigations that rely on partially
isolated information in a process.
I. INTRODUCTION
Arbitrary code execution exploits give an attacker fine-
grained control over a system. Such exploits leverage software
bugs to corrupt code pointers to hijack the control-flow of an
application. Code pointers can be divided into two categories:
backward edge, i.e., return addresses or forward edge pointers,
such as function pointers or virtual table pointers. Control-
Flow Integrity (CFI) [1], [2] protects forward edges, and is
being deployed by Google [3] to protect Chrome and Android,
and Microsoft [4] to protect Windows 10 and Edge. CFI
assumes that backward edges are protected. However, stack
canaries [5] and safe stacks [6], [7] are the strongest backward
edge protections available in mainline compilers, and both are
easily bypassed by information leaks.
Control-flow hijacking attacks that target backward
edges [8], [9], [10], e.g., Return Oriented Programming
(ROP) [10], [11], [12], are a significant problem in prac-
tice, and will only increase in frequency. In the last year,
Google’s Project Zero has published exploits against Android
libraries, trusted execution environments, and Windows device
drivers [13], [14], [15], [16], [17]. These exploits use arbi-
trary write primitives to overwrite return addresses, leading
to privilege escalation in the form of arbitrary execution in
user space or root privileges. The widespread adoption of
CFI increases the difficulty for attacks on forward edge code
pointers. Consequently, attackers will increasingly focus on
the easier target, backward edges.
C / C++ applications are fundamentally vulnerable to ROP
style attacks for two reasons: (i) the languages provide neither
memory nor type safety, and (ii) the implementation of the
call-return abstraction relies on storing values in writeable
memory. In the absence of memory or type safety, an attacker
may corrupt any memory location that is writeable. Consider,
for the sake of exposition, x86 64 machine code where the
call-return abstraction is implemented by pushing the address
of the next instruction in the caller function, i.e., the return
address, onto the stack; the callee function then pops this
address off the stack and sets the instruction pointer to that
value to perform a return. As C / C++ are memory unsafe,
attackers may modify return addresses on the stack to arbitrary
values and perform code-reuse attacks such as ROP.
Mitigating ROP attacks requires guaranteeing the integrity
of the return address used to reset the instruction pointer after
a function executes. There are four principle attempts to do
this: (i) stack canaries, (ii) back edge CFI, (iii) safe stacks,
and (iv) shadow stacks. Stack Canaries [5] protect against
sequential overwrites of a return address through, e.g., buffer
overflows by inserting a magic value onto the stack after the
return address, which is then checked before returns. However,
canaries are not effective against arbitrary writes where, e.g.,
an attacker controls a pointer and can precisely overwrite
memory. CFI computes a valid set of targets for indirect
control-flow transfers, for returns this means any potential call
site of the function. As shown by Control-Flow Bending [18],
this is too imprecise to prevent control-flow hijacking attacks
in the general case. Safe Stacks [6] move potentially unsafe
stack variables to a separate stack, thereby protecting return
addresses. However, Safe Stacks offer limited compatibility
with unprotected code, so are unlikely to be deployed.
Shadow stacks [19], [20], [21], [7] enforce stack integrity,
protecting against stack pivot attacks and overwriting return
addresses. Shadow stacks store the return address in a separate,
isolated region of memory that is not accessible by the attacker.
Upon returning, the integrity of the program return address
ar
X
iv
:1
81
1.
03
16
5v
2 
 [c
s.C
R]
  1
 M
ar 
20
19
is checked against the protected copy on the shadow stack.
By protecting return addresses, shadow stacks enforce a one
to one mapping between calls and returns, thereby preventing
ROP. Two shadow stack designs have been proposed: compact
shadow stacks [20], which rely on a separate shadow stack
pointer, and parallel shadow stacks [19], which place the
shadow stack at a constant offset to the original stack. These
existing shadow stack designs suffer from a combination of
poor performance — greater than the 5% threshold suggested
by [22] and far more than the 2% of LLVM-CFI, high memory
overhead, and difficulty supporting C and C++ programming
paradigms such as multi-threading and exception handling.
To improve the state of shadow stack design, we conduct
a detailed survey of the design space. Our design study
includes two novel designs for modern platforms that rely on
a dedicated register for performance. While a 2002 technical
report [7] initially proposed leveraging a dedicated general
purpose register, we believe that such designs deserve renewed
attention on 64 bit architectures. In total, our survey considers
five shadow stack mechanisms. We fully explore the trade-offs
of these designs in terms of performance, compatibility, and
security. We consider the impact of high level design decisions
on runtime, memory overhead, and support for threading,
stack unwinding, and unprotected code. For the performance
comparison we use SPEC CPU2006 as the standard bench-
mark, with qualitative arguments based on design for features
like threading that are not exercised by SPEC CPU2006. For
security, we note that the best shadow stack design approach,
instrumenting function prologues and epilogues, results in a
time of check to time of use (TOCTTOU) window on x86.
The TOCTTOU window requires impeccable timing to be
exploitable [23], and we discuss design approaches to avoid
it. Further, we propose novel optimizations for comparing the
shadow and stack return addresses, improving the performance
of all shadow stack schemes by 25%. As LLVM [24] is in the
process of developing a shadow stack implementation [25],
the time is ripe for such a design survey and optimizations to
maximize impact.
Beyond the design of the shadow stacks, we analyze the
options for guaranteeing their integrity, including existing
software solutions and two new ISA extensions. Unlike CFI,
which relies on immutable metadata stored on read-only
pages, shadow stacks, and other security mechanisms, require
mutable metadata that must be integrity protected. Integrity
protection is accomplished by isolating an area of the address
space within a process, preventing attackers from modifying
it. We discuss the limitations of existing hardware mechanisms
for intra-process isolation, and propose a new primitive better
suited for use by software security mechanisms.
Based on our design study, we propose Shadesmar, a new
compact shadow stack mechanism that leverages a dedicated
register for the shadow stack pointer and our optimizations
for comparing the program and shadow return addresses. We
present case studies of Shadesmar on Phoronix and Apache to
highlight its practicality and thoroughly evaluate it. We provide
a detailed discussion of the trade-offs between the different
shadow stacks along the axes of performance, security, and
compatibility. We hope that our thorough evaluation will lead
to the adoption and deployment of shadow stacks in practice,
closing a significant loop-hole in modern software’s protection
against code-reuse attacks. Shadesmar, along with ports of all
prior shadow stack techniques to LLVM-7.0.0 is available at
https://github.com/HexHive/ShadowStack, to aid deployment
of shadow stacks.
We present the following contributions: (i) Comprehensive
evaluation of the shadow stack design space along the axes
of performance, compatibility, and security; (ii) Performance
evaluation of each shadow stack design, including sources of
overhead, and our optimizations for x86; (iii) Comparative
study of new ISA features that can be used to create integrity
protected memory regions for any runtime mitigation, and
a proposal for an intra-process isolation mechanism; (iv)
Shadesmar a register-based compact performant, secure, and
deployable shadow stack scheme and its evaluation.
II. BACKGROUND
To enable security analysis of shadow stacks, we first
establish our attacker model. Using this attacker model, we
then discuss common attacks on the stack, e.g., ROP, which
overwrite return addresses for interested readers. Knowledge-
able readers may wish to move directly to our discussion of
the shadow stack design space in Section III.
A. Attacker Model
As is standard for defenses that aim to mitigate exploits,
e.g., CFI and Shadow Stacks, rather than the underlying cor-
ruptions, e.g., memory or type safety, we assume an attacker
with arbitrary memory read and write primitives. The attacker
uses these arbitrary reads and writes to inject her payload, and
then corrupts a code pointer to hijack the program’s execution,
executing her payload and exploiting the application. The
adversary is constrained only by standard defenses: DEP [26]
and ASLR [27]. We disable stack canaries [5] as they are
strictly weaker than Shadow Stacks.
For the final step of the attack, corrupting a code pointer,
we assume that the attacker only targets backward edges, i.e.,
return addresses of functions. Protection for forward edges is
an orthogonal problem, covered by defenses such as CFI [1],
[2]. Attacking forward edge control flow is therefore out of
scope for this paper. Also out of scope are data-only attacks,
i.e., attacks that do not corrupt code pointers.
B. Attacks on the Stack
Attacks against stack integrity began with Aleph One’s
seminal work on stack smashing [8]. To this day, control-flow
information on the stack remains an active battle ground in
software security [22]. Code reuse attacks such as ROP and
Stack Pivots are the latest iteration of this threat.
ROP [10] is a style of code-reuse attack that hijacks applica-
tion control flow by overwriting return addresses on the stack.
When the function returns, control is redirected to the attacker
chosen address. Absent any hardening, return addresses on the
Return Address
Local Data
Return Address
Local Data
Call Stack
foo()
bar()
ROP Payload
&(“/bin/sh”)
&(pop rdi; ret) 
Local Data
&(system)
Return Address
Local Data
Fig. 1: ROP Illustration
stack can be modified to target any executable byte in the
program. If a non-executable byte is targeted, attempting to
execute that byte will lead to a fault, terminating the program.
In practice, attackers target so called “gadgets”, which are
sequences of executable bytes ending in a return instruction
that perform some useful computation for the attacker. The
attacker’s payload consists of a sequence of addresses of such
gadgets that combined perform the desired computation, e.g.,
open a shell, or, in most real-world attacks map a memory
page as executable and writable and memcpy target shellcode
to that page before executing the injected shellcode.
Figure 1 illustrates a payload that executes system() to
spawn a shell. When function bar() returns, the first gadget
is executed. Returning to the first gadget moves the stack
pointer to &("/bin/sh"), which is then popped into rdi,
and moving the stack pointer to &(system). Consequently,
the return in the first gadget calls system("/bin/sh"),
opening a shell.
Stack Pivots are an attack technique wherein the adversary
controls the stack pointer, i.e., rsp on x86 architectures.
Consequently, instead of having to selectively overwrite data
on the stack, the attacker can move the stack frame to a region
of memory she entirely controls, thereby making, e.g., ROP
attacks significantly easier. This technique has also been used
to bypass ASLR [28], [29]. While stack pivoting changes how
the payload is delivered, code-reuse attacks utilizing it must
still overwrite a code pointer. Consequently, for the purposes
of shadow stacks and back edge defenses in general, stack
pivoting is just a payload delivery variant of ROP.
III. SHADOW STACK DESIGN SPACE
For any shadow stack mechanism to be adopted in practice,
it must be highly performant, compatible with existing code,
and provide meaningful security. We analyze the performance
of each shadow stack mechanism that we identify in terms of
runtime, memory, and code size overhead qualitatively in this
section, and quantitatively in our evaluation. Compatibility for
shadow stacks means supporting C and C++ paradigms such
as multi-threading and stack unwinding, as well as interfacing
correctly with unprotected code. Security is dictated both by
how a shadow stack mechanism validates the return address,
and by any orthogonal technique the mechanism uses to
guarantee the integrity of the shadow stack. See Section V
for details on such integrity mechanisms.
Shadow stack mechanisms are defined by how they map
from the program stack to the shadow stack, illustrated in
Figure 2. This includes the type of mapping, as well as how
the mapping is encoded in the protected binary. We analyze
five such mechanisms using the two types of shadow stack
identified by the literature: compact [20] and parallel [19].
For compact shadow stacks we identify three ways to encode
the mapping in the binary, and two such ways for parallel
shadow stacks. Each of these mechanisms has unique per-
formance and compatibility characteristics. All shadow stack
mechanisms must adopt a policy on validating the return
address. Traditionally, this has been to compare the shadow
and program return addresses and only proceed if they match.
We examine the security impact of utilizing the shadow
return address without a comparison and find it increases
performance without impacting security.
A. Shadow Stack Mechanisms
Direct mappings schemes for parallel shadow stacks use the
location of the return address on the program stack to directly
find the corresponding entry on the shadow stack. The parallel
shadow stack is as large as the program stack, and a simple
offset maps from the program stack to the shadow stack.
Consequently, the direct mapping trades memory overhead –
twice the stack memory usage, for performance – a very simple
shadow stack look up.
Indirect mapping schemes for compact shadow stacks main-
tain a shadow stack pointer, equivalent to the stack pointer
used for the program stack. The shadow stack pointer points
to the last entry on the shadow stack, exactly as the stack
pointer does for the program stack. Maintaining a shadow
stack pointer allows a compact shadow stack to allocate sig-
nificantly less memory, as only room for the return address is
required, instead of duplicating the program stack. Therefore,
indirect mappings trade performance overhead – from using
the shadow stack pointer, for reduced memory overhead – by
only requiring a compact shadow stack.
In addition to the performance versus memory overhead
trade-off, parallel and compact shadow stacks have different
compatibility implications. If calls and returns were always
perfectly matched, there would be no difference. However,
the setjmp / longjmp functionality of C, which allows
jumping multiple stack frames back up the stack, and the
equivalent stack unwinding capability used by C++ for excep-
tion handling, both break the assumption of perfectly matched
calls and returns. The direct shadow stack paradigm naturally
handles these, as C / C++ adjust the stack accordingly, and
then it uses the adjusted stack to find the appropriate shadow
stack entry. The indirect shadow stack scheme on the other
hand must know how many stack frames the program stack has
been unwound to appropriately adjust its shadow stack pointer.
Consequently, stack unwinding leads to additional overhead
for indirect shadow stack mapping schemes, while having no
effect on direct mapping schemes.
Stack Stack
Shadow 
Stack
8MB
8MB
constant
Direct Mapping
Stack Stack
Shadow 
Stack
8MB
Indirect Mapping
Grows on 
demand
Fig. 2: Shadow Stack Designs – Mapping Options
Mapping Encoding Performance Memory CompatibilityThreading Stack Unwinding Unprotected Code
Compact
Global Variable Slow Low 7 3 3
Segment Medium Low 3 3 3
Register Fast Low 3 3 G
Parallel Constant Offset Fast High 7 3 3Register Offset Medium High G 3 3
TABLE I: Summary of Performance Overhead, Memory Overhead, and Compatibility trade-offs between shadow stack
mechanisms. 3– supported; 7– not supported; G – implementation dependent
For each shadow stack mapping scheme, there are multiple
possible mechanisms with different implications for perfor-
mance and compatibility. In particular, we introduce the use
of a register for the shadow stack pointer for compact shadow
stacks, or the offset for parallel shadow stacks. Now that all
64 bit architectures have at least 16 general purpose registers,
it is possible to dedicate a general purpose register to the
shadow stack mechanism, unlike in 2001 when the original
shadow stack proposal was made [20] and only eight general
purpose registers were available on x86. We find that using
a dedicated register allows compact shadow stack mappings
to be as performant as parallel shadow stacks, and allows
parallel shadow stacks to increase their compatibility with
multi-threading while also being more secure.
A summary of our shadow stack mechanisms and their
trade-offs for each design is shown in Table I. Each row in
the table represents a shadow stack mechanism that we eval-
uate. The table reports qualitative differences between them,
we refer to the evaluation in Section VII-A for quantitative
measurements.
1) Parallel Shadow Stack Mechanisms: Parallel shadow
stack mechanisms effectively use the stack pointer as the
shadow stack pointer. The existing mechanism [19] places
shadow stack entries at a constant offset from the program
stack. This is very efficient, requiring no extra registers
or memory access, and no instrumentation to maintain the
shadow stack pointer. This performance benefit is offset by
higher memory overhead, compatibility problems, and lower
security. All parallel shadow stacks suffer from higher memory
overhead, as they fundamentally require the program stack to
be duplicated. The compatibility concerns arise from requiring
a constant offset, which is limited to 32 bits for immediate
operands in x86, from the program to the shadow stack from
all threads, severely constraining the address space layout for
programs with many threads, such as browsers. Hard-coding
the offset in the binary is also a security hazard, as recovering
the offset leaks the address of the shadow stack to adversaries.
To mitigate the compatibility and security concerns, we
propose a new parallel shadow stack mechanism. Our parallel
shadow stack mechanism encodes the offset in a dedicated
register, see Figure 3, allowing the offset to the shadow stack
to be determined at runtime. Further, the offset may vary from
thread to thread as registers are thread local, and the offset
can be set when the thread is created. This register is only
updated once, when the offset is determined for the thread, and
therefore adds no per function call overhead (unlike shadow
stack pointers for compact shadow stacks).
2) Compact Shadow Stack Mechanisms: For compact
shadow stack mechanisms, the key question is where to store
the shadow stack pointer. This decision will not impact the
memory overhead of the implementation, but does have per-
formance and compatibility ramifications. The shadow stack
pointer will be dereferenced twice in every function: once in
the prologue to push the correct return address, and once in
the epilogue to pop the shadow return address. Consequently,
the speed of accessing the shadow stack pointer is critical for
the performance of shadow stacks that are indirectly mapped.
There are three locations to store a variable: in memory,
in a segment, or in a register. We discuss and evaluate the
performance and compatibility trade-offs of all three, and x86
code for each is shown in Figure 4.
Using a memory location, e.g., a global variable is the
simplest solution, and we present it as a straw man. Accessing
memory is orders of magnitude slower than accessing a value
stored in a register. Even with caching, this effect is noticeable,
see Figure 7. This slow down is aggravated by the need for
an additional move instruction to load the location of the
global variable into a register to access it – x86 does not
1 mov rax , [ r s p ]
mov [ r s p +CONSTANT] , r a x
(a) Constant Offset
1 mov rax , [ r s p ]
mov [ r s p + r15 ] , r a x
(b) Offset in Register
Fig. 3: Direct Mapping Shadow Stack Prologues. The epilogues execute the inverse.
1 mov r10 , r c x
mov r11 , rdx
3 mov rax , [ r s p ]
mov rdx , GLOBAL
5 mov rcx , [ rdx ]
mov [ r c x ] , r a x
7 mov [ r c x ] , r s p
add [ rdx ] , 16
9 mov rcx , r10
mov rdx , r11
(a) Global Variable
1 mov rax , [ r s p ]
mov r10 , gs : [ 0 ]
3 mov [ r10 ] , r a x
mov [ r10 + 8] , r s p
5 add r10 , 16
mov gs : [ 0 ] , r10
(b) Segment
1 mov rax , [ r s p ]
mov [ r15 ] , r a x
3 mov [ r15 + 8] , r s p
l e a r15 , [ r15 +16]
(c) Register
Fig. 4: Indirect Mapping Shadow Stack Prologues. Note - Epilogues are the inverse.
support 64 bit immediate values. Further, changing memory
access patterns can affect cache behavior, with unpredictable
effects on the program’s performance. An additional problem
for this scheme is that the memory must be thread local to
support multi-threaded programs. Consequently, a scheme that
has better performance characteristics and is inherently thread
local is desirable.
Segment registers, used by existing shadow stack mecha-
nisms [20] to store the location of the shadow stack base, are
an architectural feature left over from when physical memory
was larger than the virtual address space. Segment registers
are faster to access than memory, and are inherently thread
local. Consequently, they improve performance significantly
over using a memory location to store the shadow stack
pointer, while also improving compatibility by supporting
multi-threading. We point the segment register at the base of
the shadow stack, and store the shadow stack pointer there.
Accessing the shadow stack is thus double indirect, through
the segment register and then the shadow stack pointer.
The earliest approach of a shadow stack scheme with a
dedicated register we know of is a 2002 technical report [7]
that focuses on x86 32. We rejuvenate this idea for 64 bit
architectures as general purpose registers provide the fastest
possible option for storing the shadow stack pointer. Compared
to x86 32, the x86 64 architecture defines twice as many
general purpose registers. The disadvantage of using a general
purpose register is that one register must be reserved for
the shadow stack pointer, reducing the number of registers
available to the compiler’s register allocation pass, and thereby
increasing register pressure. Increased register pressure can
reduce performance if it leads to additional register spills to
the stack. Despite this potential overhead, our evaluation finds
that this is the fastest shadow stack encoding, see Figure 7.
B. Return Address Validation
Shadow stack mechanisms can ensure a valid return address
in two ways: by either comparing the program and shadow
return addresses, or by using the shadow return address.
Comparing the shadow and program return addresses detects
corruptions of the program return address immediately, and
can halt execution. Immediate detection is useful during testing
and debugging as it helps isolate the bug. In deployment, how-
ever, preventing control-flow hijacking attacks only requires
that the corrupt program return address not be used. Checking
the program return address is equivalent to a low entropy
stack canary, possibly detecting sequential buffer overflows.
Consequently, the shadow stack mechanism can simply use the
return address on the shadow stack. Doing so fully mitigates
control-flow hijacking attacks as the attacker controlled return
address is not used and avoids the overhead of comparing the
return addresses. Either policy provides the same security: an
attacker cannot control the target address of a function return.
IV. SHADOW STACK IMPLEMENTATIONS
Each of the shadow stack mechanisms we evaluate is
implemented as a backend compiler pass in LLVM [24] 7.0.0,
and shares some common implementation details. In particular,
each shadow stack mechanism must instrument calls and
returns to update its shadow stack and validate the return
address before using it to transfer control. We show that the
best way to accomplish this is to instrument function prologues
and epilogues. Our implementations further include a small
runtime library to set up the shadow stacks, and support stack
unwinding for compact shadow stack schemes. Additionally,
we introduce novel peep hole optimizations for x86 epilogues.
1 pop r10
; r11 h o l d s shadow RA
3 xor r11 , r10
po pc n t r11 , r11
5 s h l r11 , 48
or r11 , r10
7 ; f a u l t s i f r11 != 0
jmp r11
(a) Fault Epilogue
pop r10
2 ; r11 h o l d s shadow RA
xor r11 , r10
4 po pc n t r11 , r11
; w i l l f a u l t i f r11 != 0
6 mov r11b , [ L a s t B y t e o f P a g e + r11 ]
jmp r10
(b) LBP Epilogue
Fig. 5: Shadow Stack Epilogue Optimizations
A. Instrumented Locations
Shadow stack mechanisms can instrument function calls
either at the location of the call instruction or in the function
prologue on the callee side. This instrumentation is responsible
for pushing the return address to the shadow stack, and
updating the shadow stack pointer for compact shadow stacks.
Returns must be instrumented to pop from the shadow stack
and validate the program return address in the function epi-
logue before the control-flow transfer to mitigate control-flow
hijacking attacks. Code that can unwind stack frames, such as
longjmp and C++’s exception handling mechanism, which
uses libunwind, must also be instrumented to maintain the
shadow stack pointer for compact shadow stacks. Failing to
handle stack unwinding correctly can lead to false positives
as the shadow and program stack are out of sync.
The elegant solution for instrumenting calls is to place the
protection in the function prologue. In this way, the function
is protected, not particular call sites. The compiler does not
have to distinguish between calls to protected and unprotected
functions as it would if call sites were instrumented instead.
The distinction must be made if call sites are instrumented to
keep the shadow stack in sync for compact shadow stack where
calls and returns must be perfectly matched. Instrumenting
function prologues and epilogues maintains this symmetry
naturally, as each will be executed for every function call.
On x86, instrumenting the function prologues results in a one-
instruction wide Time Of Check To Time Of Use (TOCTTOU)
opportunity due to architectural limitations. The call instruc-
tion pushes the return address to the stack where it may be
modified by an attacker before it is picked up by the prologue
in the called function. Architectures, such as ARM, where the
address of the called function is stored in a register, do not
have this limitation.
While the TOCTTOU window exists, given the extremely
precise timing required, we do not believe this potential weak-
ness to be readily exploitable. Any such attack would rely on
accurately timing the victim process, and manipulating the OS
scheduler to pause the victim’s execution precisely between
the call and mov instruction. After the call instruction
pushes the return address onto the stack, it remains in the
cache and the mov instruction can immediately use the value,
resulting in a minimal window of only a few cycles. Microsoft
researchers proposed and redacted Return Flow Guard (RFG)
as it was vulnerable to TOCTTOU windows in the prologue
and epilogue. The Microsoft red team discovered a viable
attack against their proposed epilogue, targeting epilogues of
leaf functions [23]. Our proposed mechanism halves the attack
window as we jump to the verified address, but do not fully
mitigate the TOCTTOU window. Intel Control Enforcement
Technology (CET) [30] introduces a shadow stack based
on hardware and compiler support. This extension, when
available, will mitigate the TOCTTOU window on x86 and
simplify the required instrumentation.
It is also possible to mitigate this vulnerability by using
rsp as the shadow stack pointer. However, doing so comes
with significant side effects, see Section VI. Alternatively,
resolving the TOCTTOU window requires instrumenting call
sites to pass the return address in a register, e.g., r9 which
is currently used for the sixth function argument, changing
the ABI. However, this creates compatibility problems, as
protected functions called from unprotected code would need
to read the return address from the stack. Such a scheme thus
requires whole program analysis, and reduces compatibility
with unprotected code. For users in highly sensitive environ-
ments that are concerned about sophisticated adversaries this
may be a worthwhile trade-off.
Our prologue and epilogue rely on the stack pointer to find
the return address, and are therefore agnostic to optimizations
that delete the stack frame base pointer. Once our epilogue
has popped the return address, we do not read it again from
memory, thereby preventing TOCTTOU attacks that modify
the return address in memory between the time it is read for
the shadow stack check and the time it is used by the return
instruction. One consequence of this is that ret instructions
become pop and jmp instructions. This single transformation
accounts for approximately half of the shadow stack overhead,
see Figure 8. Hardware solutions that avoid this overhead are
discussed in Section VI.
Stack unwinding mechanisms such as longjmp and C++
exceptions require additional instrumentation for compact
shadow stacks. Parallel shadow stacks are unaffected as they
do not require adjustment to track stack frames, i.e., they do
not maintain a shadow stack pointer. For compact shadow
stacks, we must be able to unwind to the correct point on
the shadow stack as well. Simply matching return addresses
does not suffice for this, as the same return address can show
up multiple times in the call stack due to, e.g., recursive calls.
To deal with this, our compact shadow stack implementations
also push the stack pointer, i.e., rsp. The stack pointer and
return address uniquely identify the stack frame to unwind to,
allowing our mechanisms to support stack unwinding.
For the shadow stack mechanisms that use a register to
encode the shadow stack mapping, ensuring compatibility with
unprotected code constrains our selection of register. A callee
saved register must be used, so that any unprotected code that
is called will restore the shadow stack pointer, but only if it
is clobbered, which helps performance. Our implementations
use r15 in practice. An alternative would be to use rsp
as the shadow stack pointer, and r15 as the stack pointer.
The ramifications of such an implementation are considered
in Section VI.
B. Runtime Support
Our runtime library is responsible for allocating the shadow
stack, and hooking setjmp and longjmp. We add a new
function in the pre_init array that initializes the shadow
stack for the main program thread. This function also ini-
tializes the shadow stack pointer for compact shadow stack
mappings. In particular, for segment encodings it invokes the
system call to assign the shadow stack to the segment register.
Setjmp and longjmp are redirected to versions that are
aware of our shadow stacks. These patched versions required
less than 20 lines of assembly to modify.
For compact shadow stack mappings to support multi-
threading and libunwind, we preload a small support library. It
intercepts calls to pthread_create and pthread_exit
to set up and tear down shadow stacks for additional threads.
We use a patched version of libunwind, to which we added 20
lines of code for compatibility with our shadow stacks. These
changes are minimal, and easily deployable by having, e.g.,
two version of the library on the system and a compiler flag to
chose which one is linked in. If shadow stacks are universally
used to harden libraries, no such additional support would be
required. Consequently, we believe compact shadow stacks are
readily deployable.
C. Shadow Stack Epilogue Optimizations
Traditionally, shadow stacks have relied on compare in-
structions to validate the shadow return address and program
return address are equivalent. However, the compare and
jump paradigm is relatively expensive, potentially leading to
pipeline stalls even with branch prediction. Consequently, as
an optimization, we explore two different methods to optimize
this validation. Our optimizations rely on the insight that a full
comparison is not required, only an equality test.
To replace the compare instruction of traditional shadow
stack epilogues, we propose an xor of the program return
address and shadow return address. This will result in 0 bits
anywhere the two are identical, and 1s elsewhere. x86 has an
instruction, popcnt, that returns the number of bits set to
1. Consequently, if the popcnt of the xor of the program
return address and shadow return address is 0, then the two
are equivalent.
We leverage the memory management unit (MMU) to
compare the popcnt to zero as a side effect by creating
a protection fault. We propose two different ways to do so:
fault and last byte in page (LBP), see the code in Figure 5.
For fault, we note that the maximum value of the popcnt
is 64, therefore fitting in six bits. By shifting this value left
48 and oring it into the return address, we create a general
purpose fault for a non-canonical address form if its value is
not zero, by setting one of the high order 16 bits to one in user
space. This scheme abuses the fact that the high order 16 bits
are currently unused, and may break if those bits are utilized
in future processors. Alternately, the LBP scheme creates two
pages in memory, the first of which is mapped read write, the
second of which has no permissions. We then attempt to read
from the first page at the address of the last valid byte, plus
the popcnt value. If the popcnt value is zero, we read the
last byte of the valid page, otherwise we read from the guard
page, causing the MPU to return a fault. The trade-off between
the two is that the fault scheme requires serialization in the
processor, while the LBP scheme requires a memory access
and the MMU. We show the performance of both schemes in
Figure 9.
V. HARDWARE INTEGRITY MECHANISMS
Once a shadow stack design has been chosen, the shadow
stack mechanism must guarantee the integrity of the shadow
stack. How to guarantee the integrity of a protected region
of memory is a problem faced not only by shadow stacks,
but also by all mitigations that rely on writable runtime
metadata. Integrity guarantees are best provided by hardware
solutions, though software solutions exist and are covered here.
Hardware solutions offer greater security and performance
than software solutions, and can be as generic. Hardware
solutions for integrity protecting part of the address space
within a process should be evaluated on two metrics: their
performance, and the number of supported concurrent code
regions.
Existing hardware mechanisms take two different ap-
proaches to encoding access privileges to provide integrity
protection: (i) MPK which encodes access privileges in each
thread’s register file, providing per thread (thread centric)
integrity, and (ii) MPX which encodes access in the individual
instructions, so that access privileges are the same across all
threads and depend only on the executed instruction (code
centric). Note that thread centric solutions require additional
instructions to change the register file, consequently, code
centric solutions are (potentially) more performant as they
operate in a single step, checking an instruction’s permissions,
instead of first toggling bits in the register file and then check-
ing permissions. For code centric mechanisms, the ability to
execute the instruction grants the necessary permissions while
for thread centric mechanisms, the state of the register file
determines the policy.
; Read Wr i t e ( d i s a b l e a l l MPKs)
2 mov eax , 0
xor ecx , ecx
4 xor edx , edx
wrpkru
6 ; p r o t e c t i o n i s o f f , w r i t e t o shadow s t a c k
. . .
8 ; Read Only ( e n a b l e w r i t e d i s a b l e b i t f o r
shadow s t a c k )
mov eax , 8
10 xor ecx , ecx
xor edx , edx
12 wrpkru
Fig. 6: MPK Page Permission Toggling
Assuming code integrity and a control-flow hijacking de-
fense such as CFI, we prefer code centric solutions for
their potential performance and flexibility. Unfortunately, no
existing code centric solution is fully satisfactory in that they
have excessive code size increases, lack performance, and
are not as flexible as required, i.e., split memory into only
two regions. Consequently, we call for a new ISA extension
that is hardware-based for performance, supports multiple
secure regions to be general purpose (e.g., to support multiple
concurrent security monitors, each with its own protected
region), and requires minimal code changes. Such an extension
would support many different security policies, as opposed to
past proposals for policy specific extensions [30], [7], [31],
[32]. We show how our proposed mechanism is a code centric
adaptation of the state of the art thread centric mechanism,
and thus is fully practical.
A. Thread Centric Solutions
Thread centric solutions operate by changing the permis-
sions on the pages of the protected memory region. Adding
write permissions elevates the thread’s privileges, thereby
creating a privileged region that is able to modify the pro-
tected memory region, i.e., the shadow stack. Removing the
write permissions ends the privileged region. The traditional
mechanism for doing this is the mprotect system call.
Using mprotect is prohibitively expensive as it not only
requires a context switch into the kernel, but a full page
table walk to change the permissions on the indicated pages.
In addition, mprotect enables write capabilities for all
concurrent threads and not just for the thread writing the
privileged data.
As a hardware-enforced isolation mechanism, segment reg-
isters used to provide privilege-based isolation for x86, where
the segmentation register served to give an instruction access
privileges to the protected region. For 64 bit architectures
however, x86 no longer enforces the isolation property while
still providing the segmentation registers.
A new Intel ISA extension, Memory Protection Keys (MPK)
aims to address this by providing a single, unprivileged
instruction that can change page access permissions on a per-
thread basis. MPK repurposes four unused bits in the page
table to assign one of sixteen keys to each page, and adds
a per-thread 32-bit register that, for each key, stores if reads
or writes are disabled. The new wrpkru instruction writes
to this new register, selectively disabling reads or writes for
pages with a given key. This approach elegantly solves the
TOCTTOU problem of mprotect and allows per-thread
protected regions.
The assembly to enforce privileged code regions using MPK
is shown in Figure 6. Note that the wrpkru instructions
requires edx and ecx to be set to 0. Intel did not disclose
why the two registers are required to be 0, it may be for future
extension of the wrpkru instruction to allow a full API to be
developed. The System V calling convention, used by Linux,
uses these registers to pass the third and fourth arguments
to a function respectively. Consequently, for functions which
take more than two arguments, it is necessary to preserve the
original values of these registers, which is accomplished by
moving their values to caller save registers, and then restoring
them after the wrpkru instruction. Surprisingly, this scheme
is slower than MPX which must instrument almost every
memory write in the program, see Figure 10 for full results.
B. Code Centric Solutions
The most common code centric solution is information
hiding, where a pointer to the protected region gives any
instruction access privileges. Information hiding is attractive
because it adds no additional overhead; however, it is the
weakest option as the many attacks against ASLR and other
information hiding schemes attest [33], [34], [35], [36], [37].
Zieris and Horsch [38] present a detailed study of these
attacks against shadow stacks, including proposed mitigations.
Nonetheless, given the history of successful attacks against
randomization defenses, we consider information hiding to
provide minimal security for the shadow stack, and recom-
mend against it.
Software Fault Isolation (SFI) [39], [40], [41] is a secure
software solution for isolating intra-process address regions.
Even the best SFI implementations [39] still have 7% overhead
just for the isolation, significantly more than is acceptable in
total for a deployed security monitor. Additionally, the x86
ISA supports an address override prefix that limits addressable
memory to 32 bits. This can be used to crudely separate the
program’s address space in a 4GB region for the process to
access, leaving all other memory for the security monitor.
4GB of memory is insufficient for many modern applications
however. Consequently, a more flexible hardware mechanism
is required.
The Intel ISA extension Memory Protection Extension
(MPX) provides a hardware mechanism that can be used to im-
plement segmentation [42] in a flexible manner. MPX provides
a bounds checking mechanism, with four new 128 bit registers
to store the bounds, and two new primitives to perform the
upper and lower bounds checks. MPX segmentation schemes
divide writes into two categories, those that are privileged
to write into the protected region, and all others. All non-
privileged writes in the code are instrumented with a bounds
check to ensure that they do not touch the privileged region.
In essence, unprivileged writes are restricted to an “array” of
memory that consists of all unprotected regions. This approach
is surprisingly performant, see Figure 10.
C. Privileged Move
Intel’s MPK comes closest of all existing hardware mech-
anisms to meeting our requirements – it is a hardware based
mechanism so should be performant, and supports 16 code
regions within a process. However, while faster than rewriting
page tables, MPK is still too expensive to execute for every
function call, see Figure 10. Further, security monitors do
not require a thread centric protection scheme. Rather, a code
centric scheme with a single privileged move instruction would
suffice. This instruction could take a one byte immediate
specifying the region of memory it is allowed to write to.
Unprivileged moves would be limited by default to the un-
protected code region, allowing minimal changes. Privileged
moves which encode their access permissions should be faster
than toggling a thread control register as MPK does. Further,
its implementation should be largely similar, relying on the
same four bits in the page table that MPK does, and with the
same checks. The difference being that instead of referencing
a thread local state for permissions, the permissions would be
encoded in the instruction proper.
Such a privileged move instruction makes entire class of
security policies that rely on runtime metadata practical.
Currently, protecting metadata at runtime is the bottleneck
for many of these policies, covering areas as diverse as type
safety [43], use after free protection [44], and partial memory
safety for function pointers [6]. This hardware primitive would
allow for the creation of flexible security policies in software
that can change and adapt, such as shadow stacks. With
the availability of such a primitive, the policies would be
secure in practice, and make them deployable in adversarial
environments, instead of only being useful for testing as they
cannot withstand direct attacks.
Protection schemes that rely on new ISA extensions are
unlikely to be immediately adopted by the wider community.
However, analyzing them can show which hardware schemes
are useful, hopefully paving the way for eventual broad
deployment as happened with the DEP and the NX bit.
VI. DISCUSSION
Orthogonal to the main design, optimization, and protec-
tion points above there are interesting details around dealing
with unprotected code, existing compiler optimizations with
ramifications for shadow stacks, and forthcoming hardware
extensions that we include here for completeness.
Unprotected Code. Unprotected code weakens the guar-
antees of shadow stack schemes, as they cannot prevent a
control-flow hijacking attack in the unprotected region. Both
parallel and compact shadow stack can be fully compatible
with unprotected code regions. Parallel shadow stacks are com-
pletely oblivious to unprotected code as they do not require
a shadow stack pointer. Compact shadow stack schemes fully
support unprotected code as long as the shadow stack pointer
is not clobbered. In particular, the register implementation of
the compact shadow stack scheme is exposed to this. The
register implementation can handle calls into unprotected code
that return directly to protected code, as the register used
is callee saved and thus restored before protected code runs
again. However, if the unprotected region calls into protected
code due to, e.g., a call back function to a sorting routine, the
shadow stack pointer may have been clobbered causing the
call back function to fail. Preventing this requires modifying
the linker so that it is aware of whether modules have been
compiled with shadow stacks or not. If not, the linker could
add wrapper functions for calls across the protection boundary
that save the shadow stack pointer. With such a linker, compact
register stacks are fully compatible with unprotected code. We
leave such engineering issues as future work.
Tail Call Optimizations. Tail calls allow call return pairs
to be omitted by the compiler, when, for example, a function
returns the value of another function call, or for recursive calls.
In these cases, the same program return address can be used
for the tail called function. However, new stack frames are
required for the case where the call being optimized is the last
instruction in an arbitrary function. The optimization simply
saves instructions by omitting a call return pair by jumping
directly to the callee, which can then use one return to exit
itself and the caller. As a function can be both tail called and
called normally, the full function prologue is executed even
when the function has been tail called. To keep the shadow
stack in sync, we execute the normal shadow stack epilogue
before tail calls, though we omit the jump through the return
address in these cases. Consequently, fault epilogues fall
back to LBP for tail calls, as there is no jmp to modify.
Mobile Architectures. Beyond x86, ARM is in wide use for
mobile and embedded devices. ARM uses the link register
to store the return address for the current function, only
pushing the return address to the stack when additional func-
tions are called. Consequently, shadow stacks can instrument
function prologues without a potential TOCTTOU window.
Our analysis of the design space applies to other architectures
while our epilogue optimizations are x86 specific because
of the popcnt instruction. Of course, this instruction can
be replaced with shift and or instructions. We leave the
evaluation of an ARM implementation as future work.
Intel Control Enforcement Technology. Intel has released
a preview document for a proposed new ISA extension
called Control Enforcement Technology (CET) [30]. CET
provides hardware support for shadow stacks, and checks on
forward edge indirect control-flow transfers. CET modifies
call instructions to push the return address to a hardware
protected shadow stack as well as the program stack, and
return instructions to compare the program and shadow return
addresses, raising a fault if they are not equal. While this
technology has great promise, no release date has been made
public so it is unclear when / if it will become available. In
the meantime, software solutions for hardening programs are
required. Orthogonally, other architectures and legacy systems
equally require protection.
Tagged Architectures. Recently, there has been renewed
interest in tagged architectures [45], [46], in which the hard-
ware associates a “tag” with each byte in memory that encodes
a security policy. Tags can be used to, for example, allow
call instructions exclusive rights to write to certain memory
areas, preventing return addresses from being overwritten [47].
More generally, such architectures can readily be leveraged
for in-process virtual address isolation, by assigning access
permissions to each instruction in the code section, and each
byte in memory. Such architectures are still in development
however, and likely years from widespread deployment.
RSP as Shadow Stack Pointer. Using rsp as the stack
pointer instead of r15 for compact register based shadow
stacks would have two benefits: (i) improved performance, as
shown in Figure 8, half the overhead of shadow stack comes
from replacing return with pop; jmp; and (ii) using rsp
would remove the TOCTTOU window as call instructions
would directly push the return address onto the shadow stack.
However, such a design has fairly radical compatibility im-
plications. It would remove push and pop from the instruc-
tion set, as these implicitly use rsp. Furthermore, a linker
modification to detect unmodified code and provide wrappers
for changing the stack pointer to the expected semantics, i.e.,
pointing rsp back at the normal stack while preserving the
value of the shadow stack pointer, would be mandatory. Other
interesting engineering challenges include (i) support for inline
assembly, (ii) rewriting threading and standard libraries to
support the two stacks, (iii) kernel support for process creation,
signal delivery, and argument passing (which implicitly uses
the rsp register), and (iv) the stack initialization code that
switches rsp to point to the new shadow stack. A clean
field implementation would leverage kernel support but would
require rather serious changes to the threading and standard
libraries as well as the kernel ABI.
128 Bit Returns. The System V ABI specifies what regis-
ters are callee saved. Our epilogues must preserve these reg-
isters, and of course the return address in rax. Additionally,
128 bit wide returns are possible under the ABI and used by
LLVM in practice. These further use rdx for the extra 64 bits
of the return value, removing it from the pool of registers that
our epilogue can use for its computation.
Assembly Files. Our shadow stack mechanisms treat assem-
bly files as unprotected code, and do not add instrumentation.
Consequently, they are allowed to use r15 even for the
register-based shadow stack encoding. Extending support to
assembly files, including refactoring to remove uses of r15
is left as an engineering challenge in future work.
VII. EVALUATION
We evaluate the five different shadow stack implemen-
tations from Table I, and we examine the impact of our
proposed epilogue optimizations. Orthogonally, we evaluate
the cost of providing deterministic integrity protection for the
shadow stack. Based on this evaluation, we recommend a
shadow stack mechanism, Shadesmar, for broad use. To show
Shadesmar’s practicality, we present two deployability case
studies: Phoronix and the Apache web server. The Phoronix
benchmarks are common use cases for widely used, real-world
applications, and Apache is the most popular web server. All
of our evaluation is done on an Intel(R) Xeon(R) Bronze
3106 CPU at 1.7GHz, with 48GB memory, running Debian-
9.3.0. This was the first machine that supported MPK in 2017
when this work began, and is used for all experiments for
consistency. Our results do not change on Skylake 3.0GHz
desktops with 16GB of memory running Ubuntu 16.04, which
were also used during development. We compile software at
O2 and for SPEC CPU2006 we use the default configuration
with three reportable runs on the ref dataset.
A. Shadow Stack Evaluation
For each of the five different shadow stack designs, we
first evaluate their performance on SPEC CPU2006. For the
existing shadow stack designs identified in Section III, we
reimplemented them on LLVM 7.0.0 to control for perfor-
mance effects from compiler improvements. For these ex-
periments, we used the traditional cmp-based epilogue, and
information hiding to protect the shadow stack. The results
are in Figure 7. Note that the parallel shadow stack constant
offset implementation and the compact shadow stack register
implementation are within measurement noise of each other
at 5.78% overhead and 5.33% respectively. This removes the
performance justification for parallel shadow stack’s greater
memory use, if a dedicated register is used for the shadow
stack pointer. The register based parallel shadow stack scheme,
needed for compatibility as per Section III-A1, has 7.10%
overhead, noticeably more than the constant offset version.
Consequently, the performance case for compact shadow is
even more compelling when equally compatible designs are
considered. The compact and parallel shadow stacks have
effectively the same code size impact as well, 15.57% and
14.88% respectively. Consequently, we recommend compact
shadow stacks.
Figure 9 shows the overheads for the compact shadow stack
register implementation with our different epilogue optimiza-
tions. The traditional cmp-based epilogue has 5.33% overhead,
25% more than our optimized epilogues at 4.31% for the
fault epilogue, and 4.44% for the LBP epilogue. Further, the
cmp epilogue has significant outliers on perlbench, povray, and
Xalancbmk. Consequently, we believe our epilogue optimiza-
tions are highly effective as they not only reduce overhead
but also reduce its variation. As the fault-based epilogue
is faster (albeit marginally) and does not require additional
changes to the address space (LBP introduces guard pages),
we recommend it for vulnerability discovery settings, e.g.,
software testing and fuzzing. Using the shadow return address
without any comparison as discussed in Section III-B results in
3.65% overhead, and is our recommendation for deployment.
We break down the sources of overhead within the compact
shadow stack register implementation in Figure 8. Changing
the ret instruction to a pop; jmp sequence has 1.97%
 0
 10
 20
 30
 40
 50
 60
40
0.p
erl
ben
ch
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
uan
tum
46
4.h
26
4re
f
47
0.l
bm
48
2.s
ph
inx
3
44
4.n
am
d
45
0.s
op
lex
45
3.p
ov
ray
47
1.o
mn
etp
p
47
3.a
sta
r
48
3.x
ala
ncb
mk
geo
me
an
Pe
rc
en
ta
ge
 O
ve
rh
ea
d
parallel - c.o.
parallel - register
compact - g.v.
compact - segment
compact - register
Fig. 7: Design Comparison
 0
 5
 10
 15
 20
 25
 30
40
0.p
erl
ben
ch
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
uan
tum
46
4.h
26
4re
f
47
0.l
bm
48
2.s
ph
inx
3
44
4.n
am
d
45
0.s
op
lex
45
3.p
ov
ray
47
1.o
mn
etp
p
47
3.a
sta
r
48
3.x
ala
ncb
mk
geo
me
an
Pe
rc
en
ta
ge
 O
ve
rh
ea
d
Jump RA
Create Shadow Stack
Shadow RA
Comparison
Fig. 8: Overhead Breakdown for Compact Register
overhead (the overhead is likely due to the loss of the CPU’s
return value prediction). Maintaining the shadow stack but
leaving the normal return instruction has 1.85% overhead. If
the epilogue jumps through the shadow stack return address,
there is 3.65% overhead, effectively the sum of the return
instruction transformation and maintaining the shadow stack,
as expected. Our experiment highlights an opportunity for
architectural improvement: moving the return stack buffer to
the shadow stack would recover most of the overhead and, due
to the compact design and fixed layout of the shadow stack,
could simplify the management of that buffer and possibly
improve performance.
Our last experiment on SPEC CPU2006 evaluates the over-
head of our three different shadow stack integrity mechanisms.
For these experiments, we used a compact shadow stack with
the register implementation and the fault-based epilogue.
The results are in Figure 10. As expected, the information
hiding scheme is the fastest with 4.31% overhead. The MPX-
based, code centric, isolation scheme was the next fastest
with 12.12% overhead on average. The MPK thread centric,
 0
 5
 10
 15
 20
 25
 30
40
0.p
erl
ben
ch
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
uan
tum
46
4.h
26
4re
f
47
0.l
bm
48
2.s
ph
inx
3
44
4.n
am
d
45
0.s
op
lex
45
3.p
ov
ray
47
1.o
mn
etp
p
47
3.a
sta
r
48
3.x
ala
ncb
mk
geo
me
an
Pe
rc
en
ta
ge
 O
ve
rh
ea
d
Comparison
Fault
LBP
Shadow-RA
Fig. 9: Epilogue Micro-Optimizations
 0
 50
 100
 150
 200
 250
 300
 350
 400
 450
40
0.p
erl
ben
ch
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
43
3.m
ilc
44
5.g
ob
mk
45
6.h
mm
er
45
8.s
jen
g
46
2.l
ibq
uan
tum
46
4.h
26
4re
f
47
0.l
bm
48
2.s
ph
inx
3
44
4.n
am
d
45
0.s
op
lex
45
3.p
ov
ray
47
1.o
mn
etp
p
47
3.a
sta
r
48
3.x
ala
ncb
mk
geo
me
an
Pe
rc
en
ta
ge
 O
ve
rh
ea
d
Information Hiding
MPK
MPX
Fig. 10: Integrity Protection Overhead
isolation scheme had 61.18% overhead. Our finding is in line
with Erim [48] which finds that adding a permission switch to
a direct call increases the number of cycles for the call from
8 to 69. Consequently, we conclude that MPK is serializing
execution, and was not intended for hot path use. MPX has
a code size increase of 41.67% vs 21.24% for MPK. Neither
the MPX nor MPK overhead numbers are acceptable for a
deployed mechanism, highlighting the need for our proposed
privileged move instruction, as per Section V-C.
B. Shadesmar Case Studies
We believe that Shadesmar— a compact, register based
shadow stack that directly uses the shadow RA, and relies
Benchmark Overhead Deviation
sqlite 8.94% 0.22%
flac 1.19% 0.85%
MP3 1.47% 0.28%
wavpack 0.35% 0.15%
crafty 0.84% 0.15%
hmmer 0.28% 0.42%
LZMA 0.84% 0.29%
apache -2.05% 0.40%
minion-graceful 1.18% 0.16%
minion-quasigroup 3.39% 0.13%
TABLE II: Phoronix Benchmark Results
File Size Simultaneous Connections1 4 8
70K - HTML 6.21% 0.63% -0.40%
1.4M - Image 1.13% 0.45% -0.31%
TABLE III: Apache Throughput Reduction
on information hiding to protect the shadow stack — is the
best candidate for adoption by mainline compilers based on
our initial experiments with SPEC CPU2006. Consequently,
we present a more in depth evaluation of Shadesmar here, on
real world applications of interest to potential users of shadow
stacks. Note that information hiding still significantly raises
the bar for attackers by requiring an information leak, and
a write to a region of memory with only one pointer into it
(the shadow stack pointer) to bypass Shadesmar. Shadesmar
exclusively keeps the shadow stack pointer in a register,
making leaking the location of the shadow stack extremely
difficult. Nonetheless, from a security perspective, information
hiding is fundamentally broken as discussed in Section V-B.
We recommend it only because of the resistance to deploying
any protection mechanism with greater than 5% overhead [22].
To demonstrate the usefulness of Shadesmar for real soft-
ware, we run benchmarks from the Phoronix test suite for typ-
ical desktop user experiences, and benchmark the throughput
of the Apache webserver. For all case studies, Shadesmar has
minimal performance impact while greatly increasing security
by removing backward edge control-flow transfers from the
attack surface. In particular, this shows that on modern 64 bit
architectures with 16 general purpose registers, dedicating one
general purpose register to a security mechanism is acceptable
in practice.
Phoronix. We run ten benchmarks from Phoronix with
workloads including databases, audio encoding, data compres-
sion, chess, protein sequencing, and their version of Apache.
These workloads are representative of common workloads for
user space computation. The results are in Table II. The only
benchmark with high overhead is sqlite. Our primary source
of overhead is instrumenting calls and returns, however sqlite
has the same frequency of function calls as other benchmarks.
After further analysis, we attribute the performance difference
to code layout changes (affecting the instruction cache) and
increased register pressure.
For eight of the ten benchmarks, the overhead is less than
2%; for five benchmarks overhead is within 1%; and it is
within measurement noise of zero for two benchmarks. Con-
sequently, we believe that Shadesmar is performant enough
to be deployed in desktop computing environments, and that
users would not notice any slow down.
Apache. To evaluate Shadesmar in server settings, we
benchmarked the throughput of Apache with Shadesmar in-
strumentation. For this experiment, we used two different files,
a 70KB HTML file and a 1.4MB image file, representative of
the average size of webpages in 2016 [49]. The experiment
was run on localhost to minimize measurement noise
from network effects. Throughput was measured over five
minutes using the standard ab tool. Note that the overhead
drops with the number of connections, and file size, and is
non-existent for eight concurrent connections, as shown in
Table III. This demonstrates that Shadesmar has no impact
on the performance of IO bound applications like servers.
VIII. RELATED WORK
Prior work on code-reuse attacks and defenses has focused
on three major areas: (i) offensive papers that seek to fully
evaluate the potential of code-reuse attacks, (ii) CFI defense
mechanisms for mitigating forward edge attacks, and (iii)
shadow stacks for mitigating backward edge attacks.
Code-Reuse Attack Surface. Code-reuse attacks as an
attack vector began with the original ROP attack [10]. Since
then, the research community has worked to fully understand
the scope of this attack vector. Follow on work established that
any indirect control-flow transfer could be used for code-reuse
attacks, not just returns [12], [50]. JIT-ROP [29] showed how
just in time compiled code, like JavaScript, can be abused for
code-reuse attacks. Counterfeit Object Oriented Programming
(COOP) [51] specialized code reuse attacks for C++ programs,
while PIROP [28] shows how to perform ROP in the face of
ASLR. Control Jujustu [52], Control-Flow Bending [18], and
Block-Oriented Programming [53] showed that CFI defenses
cannot prevent code-reuse attacks in general. Newton [54] pro-
vides a framework for analyzing code-reuse defenses’ security.
Side channels are a powerful primitive to attack existing code
reuse defenses, e.g., by widening TOCTTOU windows [55] or
by carefully monitoring reads/writes [56].
Control-Flow Integrity. CFI [1] mitigates forward edge
code-reuse attacks. CFI mechanisms work by using static
analysis to create an over approximation of the control-flow
graph (CFG), and then enforce at runtime that all transitions
must be within the statically computed CFG. After the initial
proposal, follow on research has removed the need for whole
program analysis [57], [58], and specialized CFI to use ad-
ditional information in C++ programs when protecting virtual
calls [59], [60]. To improve the precision of the CFG construc-
tion underlying CFI, more advanced static analysis techniques
have been proposed [61]. Alternately, dynamic analysis-based
approaches that leverage execution history [62], or analyze
execution history on a separate core [63] significantly increase
the precision of CFI, and thereby the security it provides. See
Burow et al. [2] for a survey of CFI techniques.
Alternatives to CFI for forward edge protection have been
proposed. Code Pointer Integrity (CPI) [6] isolates and protects
code pointers, thereby keeping them from being corrupted. CPI
included a proposal for Safe Stacks which rely on a precise
escape analysis for stack variables, and other inter-procedural
analysis to divide the stack into two new stacks: a safe stack
with the return address, and variables that cannot be accessed
through pointers, and an unsafe stack. Safe stacks have sig-
nificant compatibility problems, particularly with unprotected
code and without full program analysis the conservative anal-
ysis ends up allocating a large number of unsafe stack frames,
resulting in unnecessary overhead. CFIXX [42] provides object
type integrity by protecting the virtual table pointers of C++
objects, thereby precisely protecting virtual dispatch.
Shadow Stacks. Prior work is split between binary transla-
tion solutions [64], [65], [21], [66], [67], [68], [69], [70] and
compiler-based solutions [20], [19], [71], [72], [73], [38]. The
binary solutions employ binary rewriting to add trampolines to
the shadow stack instrumentation, and may enforce additional
policies such as CFI, or utilize Intel’s Process Trace (PT)
feature and an additional core to analyze the process trace [70].
Compiler-based solutions come in three flavors: those that
only attempt to prevent stack pivots [71], [72], an attempt to
remove all ROP gadgets from the binary [73], and finally full
shadow stacks [20], [19], which offer the strongest security.
Shadesmar builds on full shadow stacks and introduces a
dedicated shadow stack register to improve performance for
compact shadow stacks, and compatibility by fully supporting
stack unwinding. We also introduce hardware mechanisms to
integrity protect the shadow stack.
IX. CONCLUSION
With the increasing deployment of CFI to protect against
forward-edge attacks, backward-edge defenses are required
to fully mitigate control-flow hijack attacks. We conduct a
qualitative and quantitative study of the design space of
shadow stacks along performance, compatibility, and security
dimensions. Based on this study, we believe that Shadesmar,
a register-based compact shadow stack that is compatible
with all required C/C++ paradigms, should be deployed. We
provide implementations of all known shadow stack schemes,
in addition to Shadesmar, in LLVM 7.0.0 to aid the de-
ployment of shadow stacks. Our case studies on Apache,
where we had no performance impact for real work loads,
and Phoronix where we had less than 2% overhead for 8 of
the 10 benchmarks show the feasibility of using a dedicated
general purpose register for shadow stacks. Orthogonally,
we show that currently no existing hardware mechanism is
usable in practice for intra-process address space isolation,
and propose a new code-centric mechanism to fit this need
for general security monitors that require mutable metadata.
Our design study of shadow stack demonstrates that they have
low performance and memory overhead, support all required
C/C++ paradigms, and fill an important security gap left by
deployed CFI mechanisms against control-flow hijacking. The
source code of our prototype implementations are available at
https://github.com/HexHive/ShadowStack.
ACKNOWLEDGEMENTS
We thank our shepherd Anders Fogh and the anonymous re-
viewers for their insightful comments. This research was sup-
ported by ONR awards N00014-17-1-2513, by CNS-1801601,
and a gift from Intel corporation. Any opinions, findings, and
conclusions or recommendations expressed in this material are
those of the authors and do not necessarily reflect the views
of our sponsors.
REFERENCES
[1] M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti, “Control-flow
integrity,” in CCS ’05, 2005.
[2] N. Burow, S. A. Carr, J. Nash, P. Larsen, M. Franz, S. Brunthaler, and
M. Payer, “Control-flow integrity: Precision, security, and performance,”
CSUR, 2017.
[3] “Control flow integrity,” http://clang.llvm.org/docs/ControlFlowIntegrity.
html, 2016.
[4] “Control flow guard (windows),” https://msdn.microsoft.com/en-us/
library/windows/desktop/mt637065(v=vs.85).aspx, 2016.
[5] C. Cowan, C. Pu, D. Maier, J. Walpole, P. Bakke, S. Beattie, A. Grier,
P. Wagle, Q. Zhang, and H. Hinton, “Stackguard: Automatic adaptive
detection and prevention of buffer-overflow attacks.” in SEC ’98, 1998.
[6] V. Kuznetsov, L. Szekeres, M. Payer, G. Candea, R. Sekar, and D. Song,
“Code-pointer integrity,” in OSDI ’14, 2014.
[7] J. Xu, Z. Kalbarczyk, S. Patel, and R. K. Iyer, “Architecture support for
defending against buffer overflow attacks,” https://www.ideals.illinois.
edu/handle/2142/74493, 2002.
[8] A. One, “Smashing the stack for fun and profit,” Phrack magazine, 1996.
[9] Nergal, “The advanced return-to-lib(c) exploits: Pax case study,” Phrack
magazine, 2001.
[10] H. Shacham, “The geometry of innocent flesh on the bone: Return-into-
libc without function calls (on the x86),” in CCS ’07, 2007.
[11] R. Roemer, E. Buchanan, H. Shacham, and S. Savage, “Return-oriented
programming: Systems, languages, and applications,” TISSEC, 2012.
[12] S. Checkoway, L. Davi, A. Dmitrienko, A.-R. Sadeghi, H. Shacham, and
M. Winandy, “Return-oriented programming without returns,” in CCS
’10, 2010.
[13] https://googleprojectzero.blogspot.com/2017/12/apacolypse-now-
exploiting-windows-10-in 18.html.
[14] https://googleprojectzero.blogspot.com/2016/09/return-to-libstagefright-
exploiting.html.
[15] https://googleprojectzero.blogspot.com/2017/07/trust-issues-exploiting-
trustzone-tees.html.
[16] https://googleprojectzero.blogspot.com/2017/02/attacking-windows-
nvidia-driver.html.
[17] https://googleprojectzero.blogspot.com/2017/02/lifting-hyper-visor-
bypassing-samsungs.html.
[18] N. Carlini, A. Barresi, M. Payer, D. Wagner, and T. R. Gross, “Control-
flow bending: On the effectiveness of control-flow integrity,” in SEC’15,
2015.
[19] T. H. Dang, P. Maniatis, and D. Wagner, “The performance cost of
shadow stacks and stack canaries,” in AsiaCCS ’15, 2015.
[20] T.-c. Chiueh and F.-H. Hsu, “Rad: A compile-time solution to buffer
overflow attacks,” in ICDCS ’01, 2001.
[21] L. Davi, A.-R. Sadeghi, and M. Winandy, “Ropdefender: A detection
tool to defend against return-oriented programming attacks,” in AsiaCCS
’11, 2011.
[22] L. Szekeres, M. Payer, T. Wei, and D. Song, “Sok: Eternal war in
memory,” in SP ’13, 2013.
[23] https://github.com/Microsoft/MSRC-Security-Research/blob/master/
presentations/2018 02 OffensiveCon/The%20Evolution%20of%
20CFI%20Attacks%20and%20Defenses.pdf.
[24] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong
program analysis & transformation,” in CGO, 2004.
[25] LLVM, https://clang.llvm.org/docs/ShadowCallStack.html.
[26] M. Corporation, “A detailed description of the data execution prevention
(dep) feature in windows xp service pack 2, windows xp tablet pc edition
2005, and windows server 2003,” https://support.microsoft.com/en-us/
kb/875352, 2013.
[27] P. Team, “Pax address space layout randomization (aslr),” 2003.
[28] E. Goktas, B. Kollenda, P. Koppe, E. Bosman, G. Portokalidis, T. Holz,
H. Bos, and C. Giuffrida, “Position-independent code reuse: On the
effectiveness of aslr in the absence of information disclosure,” in
EuroSP’18, 2018.
[29] K. Z. Snow, F. Monrose, L. Davi, A. Dmitrienko, C. Liebchen, and A.-R.
Sadeghi, “Just-in-time code reuse: On the effectiveness of fine-grained
address space layout randomization,” in SP ’13, 2013.
[30] https://software.intel.com/sites/default/files/managed/4d/2a/control-
flow-enforcement-technology-preview.pdf.
[31] S. Nagarakatte, M. M. K. Martin, and S. Zdancewic, “Watchdog:
Hardware for safe and secure manual memory management and full
memory safety,” in ISCA ’12, 2012.
[32] S. Nagarakatte, M. M. Martin, and S. Zdancewic, “Watchdoglite:
Hardware-accelerated compiler-based pointer checking,” in CGO ’14,
2014.
[33] B. Gras, K. Razavi, E. Bosman, H. Bos, and C. Giuffrida, “Aslr on the
line: Practical cache attacks on the mmu,” NDSS ’17, 2017.
[34] E. Go¨ktas¸, R. Gawlik, and B. Kollenda, “Undermining information
hiding (and what to do about it),” in SEC ’16, 2016.
[35] R. Gawlik, B. Kollenda, P. Koppe, B. Garmany, and T. Holz, “Enabling
client-side crash-resistance to overcome diversification and information
hiding.” in NDSS ’16, 2016.
[36] A. Oikonomopoulos, E. Athanasopoulos, H. Bos, and C. Giuffrida,
“Poking holes in information hiding.” in SEC ’16, 2016.
[37] I. Evans, S. Fingeret, J. Gonza´lez, U. Otgonbaatar, T. Tang, H. Shrobe,
S. Sidiroglou-Douskos, M. Rinard, and H. Okhravi, “Missing the point
(er): On the effectiveness of code pointer integrity,” in SP ’15, 2015.
[38] P. Zieris and J. Horsch, “A leak-resilient dual stack scheme for
backward-edge control-flow integrity,” in AsiaCCS ’18, 2018.
[39] D. Sehr, R. Muth, C. Biffle, V. Khimenko, E. Pasko, K. Schimpf, B. Yee,
and B. Chen, “Adapting software fault isolation to contemporary cpu
architectures.” in SEC ’10, 2010.
[40] S. McCamant and G. Morrisett, “Evaluating sfi for a cisc architecture.”
in SEC ’06, 2006.
[41] B. Yee, D. Sehr, G. Dardyk, J. B. Chen, R. Muth, T. Ormandy,
S. Okasaka, N. Narula, and N. Fullagar, “Native client: A sandbox for
portable, untrusted x86 native code,” in SP ’09, 2009.
[42] N. Burow, D. McKee, S. A. Carr, and M. Payer, “Cfixx: Object type
integrity for c++ virtual dispatch,” in NDSS’18, 2018.
[43] Y. Jeon, P. Biswas, S. A. Carr, B. Lee, and M. Payer, “Hextype: Efficient
detection of type confusion errors for c++,” in CCS ’17, 2017.
[44] B. Lee, C. Song, Y. Jang, T. Wang, T. Kim, L. Lu, and W. Lee,
“Preventing use-after-free with dangling pointers nullification.” in NDSS
’15, 2015.
[45] U. Dhawan, C. Hritcu, R. Rubin, N. Vasilakis, S. Chiricescu, J. M.
Smith, T. F. Knight Jr, B. C. Pierce, and A. DeHon, “Architectural
support for software-defined metadata processing,” in Proceedings of
the International Conference on Architectural Support for Programming
Languages and Operating Systems, 2015.
[46] https://www.darpa.mil/program/system-security-integration-through-
hardware-and-firmware.
[47] N. Roessler and A. DeHon, “Protecting the stack with metadata policies
and tagged hardware,” in 2018 IEEE Symposium on Security and Privacy
(SP), 2018.
[48] A. Vahldiek-Oberwagner, E. Elnikety, D. Garg, and P. Druschel, “Erim:
Secure and efficient in-process isolation with memory protection keys,”
arXiv preprint arXiv:1801.06822, 2018.
[49] https://www.keycdn.com/support/the-growth-of-web-page-size/.
[50] T. Bletsch, X. Jiang, V. W. Freeh, and Z. Liang, “Jump-oriented
programming: a new class of code-reuse attack,” in CCS ’11, 2011.
[51] F. Schuster, T. Tendyck, C. Liebchen, L. Davi, A.-R. Sadeghi, and
T. Holz, “Counterfeit object-oriented programming: On the difficulty
of preventing code reuse attacks in c++ applications,” in SP ’15, 2015.
[52] I. Evans, F. Long, U. Otgonbaatar, H. Shrobe, M. Rinard, H. Okhravi,
and S. Sidiroglou-Douskos, “Control jujutsu: On the weaknesses of fine-
grained control flow integrity,” in CCS ’15, 2015.
[53] K. Ispoglou, B. AlBassam, T. Jaeger, and M. Payer, “Block Oriented
Programming: Automating Data-Only Attacks,” in ACM CCS, 2018.
[54] V. van der Veen, D. Andriesse, M. Stamatogiannakis, X. Chen, H. Bos,
and C. Giuffrdia, “The dynamics of innocent flesh on the bone: Code
reuse ten years later,” in CCS ’17, 2017.
[55] T. Allan, B. B. Brumley, K. Falkner, and J. van de Pol, “Amplifying
side channels through performance degradation,” in ACSAC ’16, 2016.
[56] M. Jurczyk and G. Coldwind, “Identifying and exploiting
windows kernel race conditions via memory access patterns,”
https://static.googleusercontent.com/media/research.google.com/en/
/pubs/archive/42189.pdf, 2013.
[57] B. Niu and G. Tan, “Modular control-flow integrity,” in PLDI ’14, 2014.
[58] ——, “Per-input control-flow integrity,” in CCS ’15, 2015.
[59] C. Zhang, S. A. Carr, T. Li, Y. Ding, C. Song, M. Payer, and D. Song,
“Vtrust: Regaining trust on virtual calls,” in Symposium on Network and
Distributed System Security (NDSS), 2016.
[60] D. Bounov, R. Kici, and S. Lerner, “Protecting c++ dynamic dispatch
through vtable interleaving,” in Annual Network and Distributed System
Security Symposium (NDSS), 2016.
[61] X. Fan, Y. Sui, X. Liao, and J. Xue, “Boosting the precision of virtual
call integrity protection with partial pointer analysis for c++,” in ISSTA
’17, 2017.
[62] V. van der Veen, D. Andriesse, E. Go¨ktas¸, B. Gras, L. Sambuc,
A. Slowinska, H. Bos, and C. Giuffrida, “Practical context-sensitive cfi,”
in CCS ’15, 2015.
[63] R. Ding, C. Qian, C. Song, B. Harris, T. Kim, and W. Lee, “Efficient
protection of path-sensitive control security,” in CCS ’17, 2017.
[64] M. Prasad and T.-c. Chiueh, “A binary rewriting defense against stack
based buffer overflow attacks.” in ATC ’03, 2003.
[65] U. Erlingsson, M. Abadi, M. Vrable, M. Budiu, and G. C. Necula, “Xfi:
Software guards for system address spaces,” in OSDI’06, 2006.
[66] M. Payer and T. R. Gross, “Fine-grained user-space security through
virtualization,” in ACM VEE: Conference on Virtual Execution Environ-
ments, 2011.
[67] M. Payer, T. Hartmann, and T. R. Gross, “Safe Loading - A Foundation
for Secure Execution of Untrusted Programs,” in IEEE Symposium on
Security and Privacy, 2012.
[68] R. Qiao, M. Zhang, and R. Sekar, “A principled approach for rop
defense,” in ACSAC ’15, 2015.
[69] M. Payer, A. Barresi, and T. R. Gross, “Fine-Grained Control-Flow
Integrity through Binary Hardening,” in DIMVA, 2015.
[70] X. Ge, W. Cui, and T. Jaeger, “Griffin: Guarding control flows using
intel processor trace,” in ASPLOS ’17, 2017.
[71] A. Quach, M. Cole, and A. Prakash, “Supplementing modern software
defenses with stack-pointer sanity,” in ACSAC ’17, 2017.
[72] A. Prakash and H. Yin, “Defeating rop through denial of stack pivot,”
in ACSAC ’15, 2015.
[73] K. Onarlioglu, L. Bilge, A. Lanzi, D. Balzarotti, and E. Kirda, “G-free:
defeating return-oriented programming through gadget-less binaries,” in
ACSAC ’10, 2010.
