A popular run-time attack technique is to compromise the controlflow integrity of a program by modifying function return addresses on the stack. So far, shadow stacks have proven to be essential for comprehensively preventing return address manipulation. Shadow stacks record return addresses in integrity-protected memory secured with hardware-assistance or software access control. Software shadow stacks incur high overheads or trade off security for efficiency. Hardware-assisted shadow stacks are efficient and secure, but require the deployment of special-purpose hardware.
INTRODUCTION
Traditional code-injection attacks are ineffective in the presence of W⊕X policies that prevent the modification of executable memory [50] . However, code-reuse attacks can alter the run-time behavior of a program without modifying any of its executable code sections. Return-oriented programming (ROP) is a prevalent attack technique that corrupts function return addresses to hijack a program's control flow. ROP can be used to achieve Turing-complete computation by chaining together existing code sequences in the victim program. To prevent ROP, return addresses must be protected when stored in memory. At present, the most powerful protection against ROP is using an integrity-protected shadow stack that maintains a secure reference copy of each return address [1] . Integrity of the shadow stack is ensured by making it inaccessible to the adversary either by randomizing its location in memory or by using specialized hardware [28] . Recent software-based shadow stacks show reasonable performance [11] , but are vulnerable to an adversary capable of exploiting memory vulnerabilities to infer the location of the shadow stack. To date, only hardware-assisted schemes, such as Intel CET [28] , achieve negligible overhead without any security trade-offs. But employing such a custom hardware mechanism incurs a development and deployment cost.
Recent ARM processors include support for generalpurposepointer authentication (PA); a hardware extension that uses tweakable message authentication codes (MACs) to sign and verify pointers [2] . One initial use case of PA is the authentication of return addresses [47] . However, current PA schemes are vulnerable to reuse attacks, where the adversary can reuse previously observed valid protected pointers [35] . Prior work [35, 47] and current implementations by GCC 1 and LLVM 2 mitigate reuse attacks, but cannot completely prevent them.
In this paper, we propose a new approach, authenticated call stack (ACS), providing security comparable to hardware-assisted shadow stacks, with minimal overhead and without requiring new hardware-protected memory. ACS binds all return addresses into a chain of MACs that allow verification of return addresses before their use. We show how ACS can be efficiently realized using ARM PA while resisting reuse attacks. The resulting system, PACStack, can withstand strong adversaries with full memory access. Our contributions are:
• ACS, a new approach for precise verification of function return addresses by chaining MACs (Section 5).
• PACStack, a LLVM-based realization of ACS using ARM PA without requiring additional hardware (Section 6).
• A systematic evaluation of PACStack security, showing that its security is comparable to shadow stacks (Section 7).
• Demonstrating that the performance overhead of PACStack is negligible (<1%) (Section 8). For realizing PACStack, we implemented an efficient authenticated stack using ARM PA. This approach may be generalizable to other data structures and applications (Section 10.1). We plan to make our PACStack implemenation and associated evaluation code available as open source.
BACKGROUND 2.1 ROP on ARM
In ROP, the adversary exploits a memory vulnerability to manipulate return addresses stored on the stack, thereby altering the program's backward-edge control flow. ROP allows Turing-complete attacks by chaining together multiple gadgets, i.e., adversary-chosen sequences of pre-existing program instructions that together perform the desired operations. ARM Figure 1 : PA uses an embedded authentication token based on the pointer's address, a modifier, and a key.
(LR) to hold the current function's return address. LR is automatically set by the branch with link (bl) or branch with link to register (blr) instructions that are used to implement regular and indirect function calls. Because LR is overwritten on call, non-leaf functions must store the return address onto the stack. This opens up the possibility of ROP on ARM architectures [30] .
ARM Pointer Authentication
The ARMv8.3-A PA extension supports calculating and verifying pointer authentication codes (PACs) [2] . A pac instruction calculates a keyed tweakable MAC, H K (A P , M), over the address A P of a pointer P using a 64-bit modifier M as the tweak. The resulting authentication token, referred to as a PAC, is embedded into the unused high-order bits of P. It can be verified using an aut instruction that recalculates H K (A P , M), and compares the result to P's PAC.
Since the PAC is stored in unused bits of a pointer, its size is limited by the virtual address size (VA_SIZE in Figure 1 ) and whether address tagging is enabled [2] . On a 64-bit ARM machine running a default Linux kernel, VA_SIZE is 39, which leaves 16 bits for the PAC when excluding the reserved and address tag bits. PA provides five different keys; two for code pointers, two for data pointers, and one for generic use. Each key has a separate set of instructions 3 , e.g., the autia and pacia instructions always operate on the instruction key A, stored in the APIAKey_EL1 register. Access to the key registers and PA configuration registers can be restricted to a higher exception level (EL). Linux v5.0 4 adds full support for PA, such that the kernel (at EL1) manages user-space (EL0) keys and prevents EL0 from modifying them.
As currently specified, PA does not cause a fault on verification failure; instead, it strips the PAC from the pointer P and flips one of the high-order bits such that P becomes invalid. If the invalid pointer is used by an instruction that causes the pointer to be translated, such as load or instruction fetch that dereferences the pointer, the memory management unit issues a memory translation fault.
PA also supports the generic pacga instruction, which outputs a 32-bit PAC based on a 64-bit input value and a 64-bit modifier. There is no corresponding verification instruction. To verify the pacga PAC, instrumented code must explicitly compare it to the expected value.
2.2.1 PA-based return address protection. Return address protection is the first published PA-based control-flow protection [47] . It is implemented as the -msign-return-address feature of GCC and 3 A full list of PA instructions from [35] is available in Appendix D. 4 Listing 1: The -msign-return-address feature in GCC and LLVM/Clang uses PA to sign and verify the return address in LR when storing and loading it from the stack.
LLVM/Clang. 5 An authenticated return address is computed using paciasp (❶ in Listing 1) and verified with autiasp (❷ in Listing 1). These instructions implicitly use the value of stack pointer (SP) as the modifier. An adversary cannot create the correct PAC for an arbitrary pointer and therefore cannot modify the return address without causing a fault on function return. The -msign-return-address feature and other prior PA-based solutions are vulnerable to reuse attacks where an adversary replaces a valid authenticated return address with another authenticated return address previously read from the process' memory. For a reused PAC to pass verification, both the original and replacement PAC must have been computed using the same PA key and modifier. This applies to any PA scheme, not only authenticated return addresses. For instance, if a constant modifier is used then all pointers based on the same key are interchangeable. Using only the SP value as a modifier reduces the set of interchangeable pointers, but still allows reuse attacks when SP values coincide. Reuse attacks can be mitigated, but not completely prevented, by further narrowing the scope of modifier values [35] .
ADVERSARY MODEL
In this work, we consider a powerful adversary, A, with arbitrary control of process memory but restricted by a W⊕X policy. Therefore A can read all process memory, but write operations and execution are restricted such that A can neither modify program code nor execute memory pages reserved for data (e.g., the program stack). This adversary model is consistent with prior work on run-time attacks [50] .
These abilities allow A to modify any pointer in the process data memory pages. In particular, A can modify function return addresses while they reside on the program call stack.
In this work, we exclude adversaries with kernel mode privilege escalation capabilities, i.e., A cannot undermine kernel integrity or confidentiality. As a consequence, A cannot modify or read sensitive data in kernel memory or kernel-managed registers, such as the PA keys. As in prior work on control-flow integrity (CFI), we do not consider non-control data attacks [13] , such as data-oriented programming (DOP) [26] . 
REQUIREMENTS & ASSUMPTIONS
Our goal is to thwart A who modifies function return addresses on the call stack in order to hijack the program control flow. We define the following requirements for our solution:
R1 Return address integrity: Detect if a function return address has been modified while in program memory.
R2 Memory disclosure tolerance: Remain effective even when A can read the entire process address space.
R3 Compatibility: Be applicable to typical (standards-compliant) C code, without requiring source code modifications.
R4 Performance: Impose only minimal run-time performance and memory overhead, while meeting R1-R3.
We make the following assumptions about the system: A1 A W⊕X policy that protects code memory pages from modification by non-privileged processes. W⊕X is today supported by all major processor architectures, including ARMv8-A.
A2 Coarse-grained forward-edge CFI. We assume that ACS is combined with a CFI solution that restricts forward controlflow transfers to a set of valid targets. Specifically, we assume that indirect function-calls always target the beginning of a function and that indirect jumps to arbitrary addresses is infeasible. This property can be satisfied by several preexisting software-only CFI solutions with reasonable overhead [1, 18, 31, 37] , as well as with negligible overhead by using hardware-assisted mechanisms like ARM PA itself [35] , branch target indicators [2] , or TrustZone-M [5, 40] .
Coarse-grained forward-edge CFI (A2) and W⊕X (A1) are used to prevent A from tampering with the instrumentation that maintains the ACS, as discussed in Section 7.2.
DESIGN: AUTHENTICATED CALL STACK
In this section we present our general design for ACS, not tied to a particular hardware-assisted mechanism. In Section 6, we present our implementation that efficiently realizes ACS using PA. While PA approximates pointer integrity it falls short when the modifier is not unique to a pointer. Our key idea is to provide a modifier for the return address by cryptographically binding it to all previous return addresses in the call stack. This makes the modifier statistically unique to a particular control-flow path, thus preventing reuse-type attacks and allowing precise verification of return addresses.
Recall that on ARM systems, the return address is initially stored in LR, which cannot be manipulated by A (Section 2.1). However, non-leaf functions need to store their return address on the stack before invoking a nested function. The return addresses 
stack-frame 0 ret 0 active function records) must thus always be stored on the stack, where A can modify them by exploiting memory vulnerabilities. ACS protects these values by computing a series of chained authentication tokens auth i , i ∈ [0, n] that cryptographically bind the latest auth n to all return addresses ret i , i ∈ [0, n − 1] stored on the stack (Figure 2 ). Only the MAC key and the last authentication token auth n must be stored securely to ensure that previous auth tokens and return addresses can be correctly verified when unwinding the call stack. We use a tweakable MAC function H K to generate a b-bit authentication token auth i :
auth n is maintained in a register unmodifiable by A. Figure 3 shows how authentication tokens and return addresses are stored on the call stack. On function calls, auth i is retained across the call to the callee, which calculates auth i+1 and stores both auth i and the corresponding return address ret i+1 on its stack frame. On return, auth 
return to R abort Figure 4 : To maintain the integrity of ACS the last authentication token is maintained and retained through function calls in the designated CR. The notation x ′ indicates that x is read from the stack and may have been compromised.
Authenticated return addresses
We can avoid the need to maintain separate auth and ret values by defining a combined authenticated return address:
We call auth i and the corresponding aret i valid if they are equal to H K (ret i , aret i−1 ) for some given aret i−1 . In this variant, not only the current authentication token, but also the current return address are securely stored. Because the plain return address ret i is never stored on the stack, A is limited to manipulating the earlier authenticated return addresses on stack, i.e., aret i , i ∈ [0, n−1]. A compromised authenticated return address must therefore pass two authentications before use: first when being restored from the stack, and second, when being used as the target of a function return. We discuss the security properties in Section 7.
The remainder of Section 5 will focus on aret, but unless otherwise noted, similar properties also apply for separate auth tokens.
Securing the authentication token
The current authenticated return address aret n , is secured by keeping it exclusively in a CPU register. On processors with a dedicated link register, LR can be used to store aret; otherwise an additional register must be reserved for this purpose. On function calls, aret must be securely retained during a function call that overwrites LR. This is done by modifying the calling convention such that aret is kept in a specific register which we call a chain register (CR) (Figure 4) .
ACS protects the integrity of backward-edge control-flow transfers. Combined with coarse-grained forward-edge CFI (Assumption A2), it ensures that: 1) immediately after function return, the aret n in CR is valid, 2) at function entry the aret n−1 stored in CR is valid, and 3) LR is always used as or set to a valid aret. This ensures that token updates are done securely, and that the ACS instrumentation cannot be bypassed or used to generate arbitrary authenticated return addresses.
Mitigation of hash-collisions: authentication token masking
Though aret n is protected by hardware, the fact that it is embedded in the return pointer means that the size b of the authentication token auth is limited by the pointer address size. This is significant, as collisions can be found after A has seen, on average, approximately 1.253 · 2 b/2 tokens [48, Section 1.4.2] (e.g., 321 tokens for b = 16). Despite this, we can still prevent A from recognizing collisions, thus forcing A to guess which authenticated return addresses yield a collision, succeeding with a probability 2 −b . The auth of any aret stored on the stack is masked using a pseudo-random value derived from the previous aret value:
The mask H K (0, aret i−1 ) is exclusive-OR-ed with H K (ret i , aret i−1 ) after it is generated and before it is authenticated, thereby preventing A from identifying opportunities for pointer reuse.
Mitigation of brute-force guessing: re-seeding authentication token chain
A brute force attack against PA where A guesses a PAC correctly succeeds with probability p for a b-bit PAC after
guesses, on the assumption that an authentication attempt of an incorrect PAC terminates the program and subsequent program runs receive a new, random set of PA keys [35] (the current behavior in Linux 5.0).
However, currrently pre-forked or multithreaded programs share the PA key between the parent and sibling processes / threads. This could allow A to target a vulnerability in a sibling, and unless a failed authentication terminates the entire process tree that shares the PA key, A can attempt a new guess against another sibling process. In this scenario, 2 b−1 guesses on average are enough to guess a b-bit PAC [35] .
Multi-threaded applications are also affected since address translation errors due to PAC authentication failures are delivered in Linux via the SIGSEGV signal which is always directed agains the offending thread 6 , and the thread cannot change the signal's disposition such that it would not be delivered.
Liljestrand et al. [35] recommend hardening pre-forking and multi-threaded applications against guessing attacks by having the application restart all of its processes if the number of PAC failures in child processes exceeds a pre-defined threshold. Since ACS does not exhibit false positives in a typical program (a corrupt return address is a strong indication that the process is being subject to a run-time attack), we recommend an alternative mitigation specific to ACS: "re-seeding" the auth calculation after a fork or thread creation. For example, calculating auth 0 = H K (ret 0 , pid/tid) where the pid / tid corresponds to the process or thread ID, or any other value unique to the task. This solution is straightforward to apply to threads, as a return from the function starting the thread causes the thread to exit. Therefore, the ACS for the thread stacks can be made disjoint from the main ACS chain. Forked processes may include auth tokens generated by the parent process in stack frames inherited from the parent. If a child process never returns to 1 # include < setjmp .h > Listing 2: setjmp / longjmp allows the programmer to transfer execution to another location, potentially in another function. The location, and the state of the environment after the transfer, is determined by an in-memory buffer containing the calling environment of a previous setjmp call. Calling longjmp after the calling environment is destroyed results in undefined behavior.
inherited stack frames, re-seeding any new auth tokens beyond the point of the fork is sufficient. However, if the child process returns to inherited stack frames, the ACS must be re-seeded starting from auth 0 by rewriting any auth tokens in pre-existing stack frames; similar to some stack canary re-randomization schemes [24, 45] .
Irregular stack unwinding
The C standard includes the setjmp / longjmp programming interface, which can be used to add exception-like functionality to C (Listing 2). The longjmp C function executes a non-local jump to a prior calling environment stored using the setjmp function. At setjmp, callee-saved registers (whose values are guaranteed to persist through function invocations), as well as the stack pointer SP and return address are stored in the given jmp_buf buffer (➀ in Listing 2). setjmp returns 0 to indicate that execution is continuing directly after the call. Upon executing longjmp, the environment is restored from jmp_buf (➂); program execution continues at the setjmp return site with a non-zero value (➁).Calling longjmp using an expired buffer, i.e., after the corresponding setjmp caller has returned (➃), results in undefined behavior (the implications of this are discussed in Section 10.2). Because jmp_buf also stores the latest authenticated token, ACS needs a mechanism to ensure its integrity when using setjmp and longjmp.
Listing 3: PACStack retains the last auth / aret via CR, defined as the general purpose register X28.
When stored in memory, the integrity of jmp_buf cannot be guaranteed. Nonetheless, the stored aret i is bound to the corresponding aret i−1 on the setjmp caller's stack. This ensures that longjmp always restores a valid ACS state. To limit the set of values A can inject into jmp_buf, we replace the setjmp return address ret b in jmp_buf with aret b , defined as:
where SP b is the SP value stored in jmp_buf. When executing longjmp, aret b is recalculated based on the buffer values to verify that the stored aret i was stored by a setjmp. A cannot generate the aret b value for an arbitrary aret i , nor replace aret b with a previously observed aret i . However, because longjmp explicitly allows jumping to prior states, ACS cannot ensure that the target is the intended one, i.e., A could substitute the correct jmp_buf with another. Shadow stacks share a similar limitation [17] , and cannot guarantee that the intended state has been reached, only that the return address (and stack pointer) in that state is intact.
IMPLEMENTATION: PACSTACK
We present PACStack, an ACS realization using ARMv8.3-A PA. PACStack is based on LLVM 7.0 and integrated into the 64-bit ARM backend, used via llc, the LLVM static compiler. PACStack adds two compilation passes: 1) to instrument function calls for aret propagation, and 2) to instrument function prologues and epilogues. The instrumentation is applied by passing the -pafss-ng flag to llc when transforming LLVM bitcode to target-specific assembly. We plan to add PACStack support to Clang Compiler source code is available at https://pacstack.github.io
Function call instrumentation
Recall from Section 5 that ACS can be implemented using separate auth and ret tokens (variant 1), or using a combined authenticated return address (variant 2).
In both PACStack variants, we designate the general purpose register X28 as the chain register (CR) and reserve it for instrumentation use. PACStack instruments call sites to move auth (variant 1) or aret (variant 2) to CR (❶ in Listing 3) in order to retain its value through function calls that overwrite link register (LR) (❷). After function return the contents of CR are restored to LR (❸).
The advantage of using X28 is that it is a callee-saved register. Whenever a function uses a callee-saved register, it must also ensure that the old value is restored before return. By using X28 as CR, PACStack can be transparently mixed with uninstrumented code (either PACStack-instrumented applications using uninstrumented libraries, or vice-versa). We discuss the security implications of mixing instrumented and uninstrumented code in Section 10.3. Listing 4: Variant 1 of PACStack generates and verifies auth tokens using pacga (❶ and ❸). Both auth i−1 and ret i are stored on the stack, and are hence validated against auth i on function return (❷). Where possible, the store pair (stp) / load pair instructions (ldp) are used to minimize the latency for successive loads / stores.
Our current PACStack implementation reserves X28 exclusively for instrumentation use because the LLVM 7.0 implementation prevents LR-use without substantial changes to compiler internals 7 . However, we expect the performance cost to be negligible, as cases where the compiler needs to utilize all callee-saved registers (X19-X29) are infrequent. Note that reserving exclusive use of a register has also been proposed for shadow stacks on the x86 architecture [11] , even though x86 has fewer general purpose registers compared to 64-bit ARM processors. Unlike shadow stacks, ACS in general can avoid consuming additional registers by using LR to store auth (variant 1, Section 6.2) or aret (variant 2, Section 6.2).
Authenticated return addresses with PA
Variant 1: generating auth with pacga. In this variant, we use pacga to generate auth tokens:
To generate and verify authentication tokens, PACStack instruments function prologues and epilogues (Listing 4). In the function prologue, auth i−1 and ret i (in CR and LR, respectively) are stored on the function stack frame and then used to generate a new auth i with pacga (❶). The auth i−1 and ret i values are then stored on the function stack frame. Before function return, PACStack verifies the auth ′ i−1 and ret ′ i read from the stack by calculating the corresponding auth ′ i (❸) and comparing it to auth i , stored in LR (❹). For auth 0 any value currently in CR is used and stored for later validation. This allows PACStack to operate without explicit initialization by the C Library (libc) startup code. To enable re-seeding the auth token chain (Section 5.4), the process and thread initialization, and fork() wrapper in libc should be modified to set the initial value of CR accordingly.
Variant 1 can efficiently compute 32-bit authentication tokens values using pacga. However, it has two drawbacks: First, an additional stack store / load is added for the 4-byte token; to preserve the 7 https://github.com/llvm/llvm-project/blob/llvmorg-7.0.0/llvm/lib/Target/AArch64/ AArch64CallingConvention.td#L278 Listing 5: At function entry, PACStack stores the prior aret i−1 on the stack (➀) and generates the new aret i (➁). Before return, aret i−1 is loaded from the stack (➂) and verified against aret i (➃). On verification failure, LR is set to an invalid address ret * i , causing a fault on return.
callee-saved behavior of CR, the full 8-byte register content must be stored on the stack. Second, the output of pacga must be explicitly checked using a comparison and a conditional branch instruction. For this reason, our current implementation only supports variant 2 below. However, in Section 10.1 we discuss using pacga to bind other stack-based write-once data to a specific ACS state. Variant 2: generating aret with autib. In this variant, we use pacib and autib instructions to efficiently calculate and verify ACS authenticated return addresses (Listing 5). These instructions differ from pacga in that the output is an authenticated return address which is directly written to LR:
The corresponding verification is similar, and defined as:
where autib will automatically handle verification errors by setting LR to an unusable address ret * i . No additional checking is needed; executing a return to ret * i causes a address translation fault (Section 2.2). In variant 2, PACStack requires no additional stack space as aret i−1 is stored on the stack in place of ret i , not in addition to it. The value of CR for aret 0 is handled identically as in variant 1 for auth 0 .
Mitigating hash collisions: PAC masking
To prevent A from identifying PAC collisions that can be reused to violate the integrity of the call stack, PACStack masks all authentication tokens values before storing them on the stack (Listing 6). A pseudo-random value is obtained by generating a PAC for address 0x0, pacib(0, aret i−1 ) (❶, ❸).
By using pacib we efficiently obtain a pseudo-random value that can be directly applied to the authentication token part of aret using only an exclusive-or instruction (eor).
Because this construction uses the same key to generate both authentication tokens and masks, A must not obtain an aret i for a ret i = 0x0 and any existing aret i−1 . PACStack will never generate such aret values, as the return address never points to memory address zero. To prevent leaking the mask directly, it is cleared after Listing 6: PACStack masks authentication tokens to prevent A from detecting PAC collisions. The mask is created in CR with pacib(0, aret i−1 ) (❶), and exclusive-OR-ed with the unmasked authentication token (❷). On return, the mask is recreated (❸) and applied to the masked authentication token aret i−1 (❹) before verification.
use. We can thus be certain that no H K (0, x) value is visible to A nor possible to pre-compute without the confidential PA key. This approach to masking requires two additional PAC calculations for each function activation. Our current implementation supports this as an optional feature that can be invoked using the -pafss-ng-cp flag.
Irregular stack unwinding
PACStack binds jmp_buf buffers to the aret i at the time of setjmp call by replacing the setjmp return address ret b with its authenticated counterpart aret b (Section 5.5). The libc implementation is not modified; instead setjmp / longjmp calls are replaced with the wrapper functions in Listings 7 and 8.
The setjmp_wrapper wrapper function (Listing 7) executes setjmp and updates the buffer with aret b . PACStack generates aret b based on the current SP value, CR and the setjmp return address; this avoids the need to read the values setjmp has stored. The longjmp_wrapper (Listing 8) retrieves aret b , aret i , and the SP values from the buffer. It then verifies the values and writes ret b into jmp_buf.
Multi-threading
The values of ARMv8-A general purpose registers are stored in memory when entering EL1 (i.e. kernel-mode) from EL0 (i.e. usermode), for example during context switches and system calls. This must not allow A to modify the aret values or read the mask, which are both exclusively in either CR or LR during execution (Listings 5 and 6), but must be stored in memory during the context switch. On ARMv8-A, system calls are implemented using the supervisor call instruction (svc) that switches the CPU to EL1 and triggers a configured handler. On 64-bit ARM, Linux v5.0 uses the for arbitrary values and therefore cannot inject them in jmp_buf. #r, #a and #s are the offsets to ret b , CR, and ret i within jmp_buf. kernel_entry 8 macro to store all register values on the EL1 stack, where they cannot be accessed by user-space processes. During context switches, callee-saved registers (including CR) and LR are stored in struct cpu_context 9 which belongs to the in-kernel task structure and cannot be accessed by user space. The CR and LR values of a non-executing task are thus securely stored within the kernel, beyond the reach of other processes or other threads within the same process. Thus, no kernel modifications are needed to securely apply PACStack to multi-threaded applications.
SECURITY EVALUATION
We address two questions in this section: 1) Is the ACS scheme cryptographically secure? 2) Do ACS's guarantees hold when instantiated as PACStack?
8 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/ kernel/entry.S?h=v5.0 9 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/ include/asm/processor.h?h=v5.0
ACS security
A generic representation of an attack against ACS is shown in Figure 5 . Under normal operation, function C returns to A if called from A ( Figure 5a) ; i.e., when called from A, the return address of C is an address ret A in A. The goal of A (Figure 5b) is to cause C to return to some other address ret B .
Since the authenticated return address aret A containing ret A is protected from A, in order to perform a backward-edge control-flow attack, A must achieve two goals successfully: AG-Jump: Obtain an authenticated return address aret B , valid with respect to some known modifier, which will validate successfully when C returns.
AG-Load: Violate the integrity of the call stack such that the LR register is loaded with aret B from AG-Jump rather than the correct authenticated return address aret A .
This requires two returns: one from a 'loader' function to load A's aret B into LR, and another from C to the return address ret B contained in aret B .
In the analyses below, we treat the auth token H K (P, m) as a random oracle with respect to both the pointer P and modifier m. This means that if H K (P, m) has never been computed by a function call, H K (P, m) will match any value with probability 2 −b . In the analysis below we assume that programs that share the same PA keys between multiple processes or threads employ the mitigation strategy against brute-force attacks described in Section 5.4. This assumption and the design of ACS ensure that there is no authentication oracle available: the only way to test whether an auth token is valid with respect to some address and modifier is to attempt to return using the address and token, triggering a crash if the token is incorrect. The difficulty of achieving these goals therefore depends on whether A's desired control-flow violation follows the call graph of the program and whether auth tokens are masked. Violating control-flow integrity while still traversing the call graph is easier because this allows A to harvest auth tokens and search for collisions; violations that do not follow the call graph are more difficult because they require that A make one or more guesses, risking a crash.
7.1.1 Violations that follow the call graph. As A can harvest authenticated return pointers when they are written to the stack, the short auth tokens mean that in the absence of masking an attacker can violate the integrity of the call stack by finding collisions in H K (·, ·).
In order to achieve goal AG-Load, A must find two authenticated return addresses aret A and aret B , such that i) they are both returned to by a function C, ii) that C contains a call-site to the loader function with a corresponding return address ret C , and iii) such that
Note that the collisions must be for different values in the second argument only, since that is the value in A's control. Collisions that require different values for ret C cannot be exploited because ret C is in CR and cannot be modified by A.
The auth tokens contained in aret A and aret B depend on the path that A has taken through the call graph. A can obtain as many auth tokens with ret C as a pointer as there are distinct execution paths leading to C. The number of such paths will explode combinatorially as the complexity of the program increases, and cycles in the call graph-as occur in Figure 5 -make the number of paths essentially infinite, limited only by available stack space.
Having found such a collision, A then arranges for function C to be called, traversing the call graph in such a way that it is set up to return to A using aret A . Then, when the function C calls into the loader function, it will set LR to aret C . When the loader function returns to ret C , it will attempt to load aret A from the stack. Instead, A substitutes aret B , which because of (1) will validate correctly when returning to ret C . Since aret B is a valid authenticated return address, C will successfully return to ret B , thereby violating the integrity of the call stack.
More concretely, after collecting q auth tokens, according to the birthday paradox [48, Section 1.4.2], the probability that some pair collides is:
This quickly approaches 1 as A collects more tokens, on average occurring after obtaining
tokens. With a 16-bit PAC, A will therefore obtain a collision after harvesting 321 pointers on average. In order to successfully mount the above attack, A must find two colliding auth tokens and perform the substitution. Without masking, A can read the auth token from the stack. A can then keep collecting auth tokens until they find two that collide; since these are both valid pointers, A will always succeed once this occurs, thus
With masking A cannot identify auth token collisions: aret A and aret B have different mask values H K (0, aret A ) and H K (0, aret B ). Therefore it is impossible to identify a collision with a probability than by random selection. This means that A will succeed in the attack above with a probability of 2 −b . We give a detailed proof in Appendix A.
In practice, this means that A can use this attack to traverse the program's call graph, but cannot jump to an address that is not a valid return address for C function.
Violations that leave the call graph.
We now consider A's probability of success when attempting to return to an address ret B in a way that that does not follow the program's call graph.
In this case, the path from B to C has not been traversed, and the instrumentation has never before computed the auth token H K (ret C , aret B ). Therefore, A succeeds at AG-Load-i.e., H K (ret C , aret B ) = H K (ret C , aret A )-with probability P[AG-Load] = 2 −b , irrespective of whether the substituted aret B is a valid authenticated return address. On failure, which has probability 1 − 2 −b , the process will crash. A's probability of then achieving goal AG-Jump depends on whether ret B is the return address of a valid call-site. If it is, then A can obtain a valid authenticated return pointer for that location in the same way as in Section 7.1.1, thereby succeeding with probability P[AG-Jump] = 1. If ret B has never been used as a return address, then no auth token has ever been generated for that pointer. Therefore, AG-Jump is achieved with probability at most P[AG-Jump] = 2 −b ; otherwise, the process crashes.
A can therefore succeed with probability 2 −b when the return address is a valid call-site return address, or with probability of 2 −2b when the return address is not.
We summarize our results in Table 1 .
Run-time attack resistance of PACStack
PACStack must ensure the integrity of aret n and the confidentiality of the masks. The former is achieved by storing aret n in LR or CR, reserved for this purpose, used by regular code, and hence inaccessible to A (Section 6.1). The latter is maintained as the mask is re-generated each time it is needed, only stored in LR, and cleared after use (Section 6.3). This holds true also in multi-threaded environments (Section 6.5).
Recent results have shown that traditional CFI solutions are unable to withstand control-flow bending [12] ; attacks where each control-flow transfer follows the program's CFG, but the program execution trace conforms to no feasible benign execution trace. PACStack-or ACS in general-is not susceptible to backward-edge control-flow bending, because it precisely protects the integrity of the authenticated return addresses while they remain on the stack. A cannot trick PACStack to deviate from an expected return flow by replacing aret n with a valid, but outdated aret value, because PACStack never writes aret n onto the stack. A also cannot reliably exploit PAC collisions to replace part of the aret chain, as each aret is masked. A cannot tamper with the instrumentation itself by modifying the instructions in memory (Assumption A1). By requiring coarse-grained forward-edge CFI (Assumption A2), PACStack ensures that auth token calculations and masking are executed atomically and cannot be used to manipulate ret i , aret i−1 or the mask during the function prologue and epilogue. This holds even if the forward-edge CFI is susceptible to control-flow bending.
Tail-calls and signing gadgets.
A recent discovery by Google Project Zero 10 shows that PA schemes can be vulnerable to an attack whereby specific code sequences can be used as gadgets to generate PACs for arbitrary pointers. Recall that on PAC verification failure an aut instruction removes the PAC, but corrupts a wellknown high-order bit such that the pointer becomes invalid. If a pac instruction adds a PAC to a pointer P with corrupt high-order bits, it treats the high-order bits as though they were correct when calculating the new PAC, and flips a well-known bit p of the PAC if any high-order bit was corrupt. This means that instruction sequences such as the one shown in Listing 9, consisting of an aut instruction followed by a pac instruction, can be used generate a valid PAC for a pointer even if the original pointer is not valid to begin with. A writes an arbitrary pointer P to memory (❶) and allows it to be verified. When verification fails, autia removes the PAC, and corrupts the high-order bit in P, writing the resulting P * to the destination register (❷). The subsequent pacia will add the correct PAC for P, then flip bit p of the PAC to indicate that the input pointer was invalid (❸). A can now flip bit p back (❺) in order to obtain the correct PAC for pointer P (❻).
The PA signing gadget requires finding a matching ⟨autib, pacib⟩ pair operating on pointer P in the code without any use of P between these instructions. In PACStack each verification is immediately followed by a return, which ensures that the failure is detected. Tail-calls are a notable exception. Tail-calls are function calls executed before return and optimized so that the callee directly returns to the caller of the optimized invocation of B in Listing 10. On 64-bit ARM processors, tail-calls are implemented using the b or br instructions that do not update LR (➀). The tail-called function can return (➁) to the LR value set before the tail-call (➂). PACStack limits A to modifying the previous auth token on the stack. A could attempt to exploit the signing gadget to trick PACStack to accept an invalid aret ′ i−1 (➃), and subsequently load it into LR after return. This is not possible as A cannot flip the bit p of aret ′ i (➄), because aret i ⊕ p is: 1) kept in LR while in B, and 2) verified against aret i+1 on subsequent function calls from B. The invalid aret ′ i−1 is thus rejected by autib (➃) before the return from B.
Sigreturn-oriented programming.
Sigreturn-oriented programming [9] is a exploitation technique in UNIX-like operating systems, including Linux, that abuses the signal frame to take complete control of a process's execution state, i.e., the values of general purpose registers, SP, program counter (PC), status flags, etc. When the kernel delivers a signal, it suspends the process and changes the user-space processor context such that the appropriate signal handler is executed with the right arguments. When the signal handler returns, the original user-space processor context is restored. In a sigreturn attack A sets up a fake signal frame and initiates a return from a signal that the kernel never delivered. Specifically, a program returns from the handler using a sigreturn system call that reads a signal frame (struct sigcontext in Linux) from the process stack. A sigreturn attack is problematic for PACStack, as if successful, it would allow A control of any EL0 register, including CR.
A number of defense strategies against sigreturn attacks have been proposed for the Linux kernel. Bosman and Bos [9] propose placing keyed signal canaries in the signal frame that are validated by the kernel before performing a sigreturn, or to keep a counter of the number of currently executing signal handlers. However, modern Linux versions rely solely on address space layout randomization (ASLR) [33] to make it difficult for the attacker to trigger an unwarranted sigreturn. Fortunately sigreturn is never called directly from program code (in fact the GNU C library sigreturn simply returns an error value). Instead the system call is triggered by signal trampoline code placed either in the kernel's virtual dynamic shared object (vdso) or in the C library, both subject to ASLR. For our chosen adversary model (Section 3) ASLR is not sufficient as A can determine the contents of any readable memory in the process memory space. However, PACStack itself, together with coarse-grained CFI (Assumption A2), ensures that A cannot divert control flow from program code to the signal trampoline. Nonetheless, 64-bit ARM programs that might call system calls directly using the svc instruction (without going through C library system call wrappers), would not be protected against the presence of such gadgets. We discuss a potential general solution against sigreturn attacks that utilizes the ACS construction in Appendix B.
PERFORMANCE EVALUATION
At present, the only publicly available PA-enabled SoCs are the Apple A12 and S4, neither of which support PA for 3rd party code at the time of writing. To verify the correctness of instrumentation we ran all benchmarks on the ARMv8-A Base Platform Fixed Virtual Platform (FVP), based on Fast Models 11.4, which supports ARMv8.3-A [4] . Because the FVP runs the v4.14 kernel, we have used PA RFC patches 11 modified to support all PA keys.
The FVP is not cycle-accurate and executes all instructions in one master cycle; therefore, it cannot be used for performance evaluation. Based on prior evaluations of the QARMA cipher [7] , which is used as the underlying cryptographic primitive in reference implementations of PA [47] , Liljestrand et al. estimate that the PAC calculations incur an average overhead of four cycles on a 1.2GHz CPU [35] . We employ the PA-analog (Listing 11)introduced by Liljestrand et al. to estimate the run-time overhead of PACStack. The PA-analog consists of four eor instructions that both read and write the registers used by the corresponding PA instruction in order to induce similar constraints on instruction pipelining within the CPU. To preserve compiler behavior, the PA-analog is swappedin during a separate pre-emit pass, i.e., after both register allocation and instruction scheduling.
Using the PA-analog, we conducted benchmarks on a 96board Kirin 620 HiKey (LeMaker version) with an ARMv8-A Cortex A53 Octa-core CPU (1.2GHz) / 2GB LPDDR3 SDRAM (800MHz) / 8GB eMMC, running the Linux kernel v4. 18 Listing 11: PA-analog used to simulate overhead on non-PA hardware, based on an estimated overhead of 4 cycles. Three exclusive-or inputs are constants, whereas the last instruction uses both inputs to ensure instruction pipelining must get both values.
We have performed benchmarks using both nbench-byte-2.2.3 12 program and the SPEC CPU 2017 benchmark package 13 .
nbench-byte-2.2.3
The nbench program includes 10 separate benchmarks and is designed to measure CPU and memory performance. The benchmarks employ dynamic workload adjustment to ensure that a test run takes at least a certain amount of time. In order to determine the relative overhead introduced by PACStack, we took the same approach as prior work [10, 35] and modified nbench to perform a pre-determined number of iterations of each benchmark and measured the execution time of each separately. All binaries used in the performance evaluation were produced by our PACStack-enabled compiler. We disabled all optimizations when compiling benchmark binaries (-O0 flag for Clang and LLVM, and -O=0 for llc). We evaluated the performance of nbench in three configurations: i) PACStack disabled, to determine the baseline execution time; ii) PACStack enabled, without PAC masking; and iii) PACStack enabled, with PAC masking. We repeated each benchmark 10 times and measured the user time using the time utility for each benchmark run. The results are shown in Figure 6 , and indicate an overhead of 0.5% when using PAC masking, and an overhead of < 0.3% without (geometric mean of all benchmarks [52] ).
SPEC CPU 2017
In contrast to nbench-byte-2.2.3, SPEC CPU 2017 is an industrystandard benchmarking suite that consists of larger units of work based on real-world applications. Due to resource constraints it was not feasible to install both the PACStack compiler and the SPEC CPU suite on the FVP or HiKey board. Instead we compiled the benchmarks with the SPEC runcpu utility configured to use WLLVM 14 as the compiler. WLLVM produces binaries containing the LLVM Intermediate Representation (IR), which we extracted and instrumented using PACStack and the PA-analog. For comparison, we also measured the run-time overhead of ShadowCallStack [15], an instrumentation pass added in LLVM/-Clang 7.0. ShadowCallStack protects programs against return address overwrites by saving a function's return address in the function prolog to a separately allocated shadow stack and checking 12 http://www.math.utah.edu/~mayer/linux/bmark.html 13 the return address on the stack against the shadow stack in the function epilog. On 64-bit ARM the instrumentation makes use of X18 register to reference the shadow stack. Currently the runtime support for the ShadowCallStack instrumentation is only available in Android's Bionic C library. In addition, ShadowCallStack is only compatible with uninstrumented libraries which reserve the X18 register, i.e., binaries built for a platform whose ABI reserves x18, (e.g., Android, Darwin, Fuchsia and Windows) or are compiled with the -ffixed-x18 flag. To be able to perform a fair comparison against PACStack instrumented SPEC using the GNU C library (glibc) we ported ShadowCallStack support to glibcversion 2.23 and compiled versions of our modified glibc and libgcc 6.4.1, the GCC low-level runtime library with the -ffixed-x18 flag. Our changes to glibc were based on revision da772e2 15 of the Bionic C library. For PACStack measurements we used a prebuilt version of glibc 2.23 and libgcc 6.4.1 distributed by Linaro 16 . All benchmarks were compiled with the -O0 flag to disable optimizations. The benchmark execution command and input files were determined using the SPEC specinvoke utility and then timed on the HiKey board using time.
Our measurements include all C-language SPECrate benchmarks, with the exception of two benchmarks that were incompatible with the WLLVM build environment that we used. For each benchmark, we compared the performance of the baseline (with PACStack and ShadowCallStack disabled) with three different configurations: i) PACStack without masking, ii) PACStack with masking, and iii) ShadowCallStack. Results are shown in Figure 7 and are reported as the mean overhead (w.r.t the baseline) and corresponding standard error. The SPEC CPU 2017 benchmark suite is resource intensive [42] ; a single iteration of all SPEC benchmarks in Figure 7 took 13 times longer than an iteration of all nbench benchmarks. We therefore performed fewer measurements for SPEC than for nbench. Consequently, though the SPEC benchmarks are more representative of real-world workloads, they are more sensitive to -0,50% 0,00% 0,50% Figure 7 : Relative performance overhead for SPEC CPU 2017 benchmarks; error bars show the standard error for n measurements. n = 20 for the ShadowCallStack configuration.
outliers than those in Figure 6 . The results show PACStack incurs an overhead of 0.9% with masking, and 0.4% without masking (geometric mean of all benchmarks). PACStack without masking performs marginally better than ShadowCallStack, which incurs an overhead of 0.5% (geometric mean of all benchmarks). The performance overhead of PACStack is proportional to the frequency of function calls; benchmarks with few function calls are affected less by the instrumentation compared to benchmarks with frequent function calls. For instance, the 519.lbm_r benchmark performs fluid dynamics and consists of large nested loops with few function calls.
Consequently we see little effect on performance in 519.lbm_r; in fact, our measurements show a small improvement in performance, which is likely caused by CPU pipeline optimizations that happen to be advantageous. We observe the same behavior for ShadowCallStack.
Based on these results, we expect the overhead for both PACStack configurations to be a) comparable to ShadowCallStack, and b) negligible on ARMv8.3-A PA-capable hardware.
RELATED WORK
Control-flow hijacking attacks were discovered and popularized more than two decades ago [49] . The majority of CFI solutions proposed since then are stateless: they validate each control-flow transfer in isolation without distinguishing among different paths in the control-flow graph (CFG). Fully-precise static CFI [12] is in theory the most restrictive stateless policy that is possible without breaking the intended functionality of the protected program. In fully-precise static CFI, and by extension any stateless policy,the best possible policy for return instructions is to allow returns within a function F to target any instruction that follows a call to F . All stateless CFI schemes, including fully-precise static CFI, are vulnerable to control-flow bending [12] .
Stateful CFI can express policies that take previous controlflow transfers into account. HAFIX [19] is a hardware-assisted CFI scheme that confines function returns to active call sites. Contextsensitive CFI [20, 27, 53] further ensures that each control-flow transfer taken by the program is consistent with a non-malicious trace. This leads to a more precise policy compared to stateless CFI, but context-sensitive CFI enforcement has been dismissed as impractical for real-world adoption [1] . Hardware-assisted branch recording features available in modern 64-bit Intel microprocessors show promise in enabling context-sensitive CFI enforcement on commodity hardware, but suffer from i) limited branch history used to make CFI decisions, ii) over-approximation of the program CFG, iii) reliance on complex run-time monitoring. HAFIX, on the other hand, requires changes to the underlying processor architecture.
Stateless forward-edge CFI enforcement is often combined with a shadow stack [1, 14-18, 22, 23, 28, 39, 40, 51] to enforce the integrity of return addresses stored on the call stack. In fact, the results by Carlini et al. [12] show that a shadow stack (or equivalent mechanism) is essential for the security of CFI. The shadow stack maintains a copy of each return address in a separate region of memory. Each return instruction is then instrumented to validate that the return addresses on the call and shadow stack match. This ensures that each return is restricted only to its corresponding call site.
Although shadow stacks provide precise protection, traditional shadow stacks incur significant performance overhead and lead to false positives for programming constructs that cause mismatches between calls and returns (C++ exceptions with stack unwinding, setjmp/ longjmp). Recent shadow designs demonstrate that performance can be increased by either leveraging a parallel shadow stack [17] , or using a dedicated register for shadow stack addressing [11] . However, in these schemes the shadow stack still resides in the same address space as the target application, and can be compromised if the shadow stack location is known to A. For traditional shadow stacks, a typical solution for dealing with mismatches between calls and returns is to pop return addresses off the shadow stack until a match is found, or the shadow stack is empty (e.g., binary RAD [14] ). This not only increases the complexity and run-time of the shadow stack instrumentation placed in the function epilogue, but also sacrifices precision, e.g., it allows A to redirect longjmp to any previously active call site. This can be avoided by storing and validating both the return address and stack pointer [16, 41, 51] . So far, only hardware-assisted shadow stacks promise to achieve negligible overhead without security trade-offs (e.g., Intel CET [28] ).
Park et al. [44] present a microarchitectural shadow stack implemenation that utilizes the branch predictor return address stack, a common hardware feature found in modern speculative superscalar processor designs. The return address stack is typically a circular buffer, so to avoid the loss of stored return addresses when the maximum capacity is reached, Park et al. modify the return address stack to spill a portion of it's content to backup storage in main memory. They use a Merkle tree caching scheme to efficiently authenticate the backup storage before it is read back to the return address stack. The latency of spill / fill operations on backup memory is effectively offset by the 100% hit rate for branch prediction thanks to the ability to retain return addresses that exceed the return address stack capacity.
The idea of using of MACs to protect the return address at runtime was introduced in Cryptographic CFI (CCFI) [37] which uses MACs to protect return addresses and other control-flow data (e.g., function pointers and C++ vtable pointers). CCFI's return address protection is similar to PA-based return address signing [47] ; both bind the return address to the address of the function's stack frame and thus provide only coarse-grained resistance against pointer reuse attacks [35] .
Program Counter Encoding [34, 43, 46] protects return addresses on the stack by encoding them with either a register-resident secret key [34] , the SP [46] , or the address at which the return address itself is stored (a.k.a. the self-address) [43] . It's efficient, but relying on a userspace-resident secret key makes such encoding schemes susceptible to buffer over-reads, and SP or self-address encoding suffer the same drawbacks as -msign-return-address [35, 47] (Section 2.2.1).
Other prominent defenses against control-flow attacks include fine-grained code randomization [33] , and code-pointer integrity (CPI) [31] . Code randomization makes it more difficult for A to find suitable gadgets to exploit in their attacks, but is not effective if the memory layout of the program becomes known. CPI protects code pointers by storing them in a separate safe stack. The safe stack requires similar integrity guarantees as shadows stacks to remain effective [21] .
PACStack targets the ARM architecture, which traditionally has received less attention compared to the x86 family of computer architectures in terms of CFI research. MoCFI [18] is a software-based CFI approach specifically targeting ARM application processors used in smartphones. It uses a combination of a shadow stack, static analysis and run-time heuristics to determine the set of valid targets for control-flow transfers, but suffers from the same drawbacks that plague traditional shadow stack schemes. CFI CaRE [40] is a CFI solution targeting small, embedded ARM-based microcontrollers (MCUs). It uses the ability to perform hardware-enforced isolated execution on ARMv8-M MCUs to isolate the shadow stack to a secure processor state. The ARMv8-M [5] architecture enforces that calls to secure functions must target secure gate instructions placed at the beginning of such functions. The ARMv8.5-A architecture introduces similar branch target indicators (BTI) [3] to also ARM application processors. BTI constitutes one way to meet the PACStack pre-requisite of coarse-grained CFI for indirect branch instructions, e.g., calls via function pointers.
DISCUSSION 10.1 Generalizing ACS to other data structures
ACS builds on the idea of chaining cryptographic authentication codes. This simple, yet powerful, construct is similar to hash chains, which have been used before as means of password protection (Lamport signatures [32] ), digital signatures (Merkle trees [38] ), and have seen use in technologies such as blockchain [54] and trusted hardware access control authorization policies [6] .
While the focus of this work is on applying this idea to protect the integrity of return addresses in the program call stack, the same approach can be generalized to other data structures and applications. For example, the call-stack protection could easily be extended to cover the frame pointer, or other data stored in a function's stack frame, and protect such data from unauthorized modification.
In addition to instrumentation that can protect the call stack, an ACS-like authenticated stack, or other data structure such as a Merkle-tree [38] can be implemented as reusable library, which would allow application developers to protect the integrity of critical data structures from manipulation as a result of software [13, 26] , or hardware attacks [29] .
An example of such a use case is data structures in operating system kernels. For instance, the Linux kernel source code features a generic double linked list implementation, which doubles as a queue and stack, depending on where in the kernel it is used 17 . Kernel data structures are critical to the system security. Many of the vulnerabilities found in the kernel allow limited access to kernel data. Malicious modification of kernel data can lead to a wide range of effects, including privilege escalation and process hiding [8] . Applying ACS-like protection to critical kernel stacks can protect such structures from: i) malicious modification by A in an effort to compromise kernel data integrity ii) accidental misuse by programmers, e.g., operating on a stack as a queue and vice versa (a side-effect of reuse of generic list implementations).
Support for software exceptions
The setjmp / longjmp interface has traditionally been used to provide exception-like functionality in C. However, modern coding standards for C and C++ that aim to facilitate code safety, security, and reliability consider them harmful and forbid their use, e.g., MISRA C:2004 [25, Rule 20.7] and JSF AV C++ [36, Rule 20] . Recall from Section 5.5, that calling longjmp with an expired jmp_buf is is undefined behavior. For PACStack, this means that although the aret b in jmp_buf to the corresponding SP and auth i , it cannot guarantee their freshness. A can modify jmp_buf to contain the previously used aret b and SP b , but must also modify the stackframe at SP b , such that it contains the prior aret i . This allows a control-flow transfer to a previously valid setjmp return site and SP value. To prevent reuse of expired jmp_buf buffers, longjmp can be rewound step-by-step, i.e., conceptually performing returns until the correct stack-frame is reached.
We plan to extend PACStack support to LLVM libunwind 18 libunwind performs frame-by-frame unwinding of the call stack. By validating the ACS on each stack frame unwinding, PACStack can ensure that a fresh and valid state is reached.
Because C++ exceptions also cause irregular stack unwinding they pose a similar challenge. However, C++ already performs more fine-grained stack unwinding to correctly destroy objects in unwound stack frames. The LLVM libcxxabi library will, depending on configuration, use libunwind for this purpose. With PACStack support in libunwind, we will be able to secure both setjmp / longjmp and support C++ exception handling.
Interoperability with unprotected code
Interoperability with unprotected (uninstrumented) code is an important deployment consideration. On one hand, a PACStackprotected application may need to interoperate with unprotected shared libraries. On the other, an unprotected app may need to interoperate with PACStack-protected shared libraries. The latter scenario is relevant for deployment in mobile operating systems such as Android, where multiple stakeholders provide application binaries to consumer devices. The deployment of PACStack, or any other run-time protection mechanism, is likely to be driven by OEMs that enable specific protection schemes for the operating system and system applications. However, OEMs are not in control of native code deployed as part of applications distributed through standard application marketplaces. It should be possible for one version of the shared libraries shipped with the operating system to remain interoperable with both PACStack-protected, and unprotected apps.
In Section 6.1 we explain how the use of callee-saved registers allows PACStack to remain interoperable with unprotected code. Recall that because CR is a callee-saved register it will be restored upon return. However, PACStack cannot guarantee that CR remains unmodified during the execution of the unprotected code that could temporarily store its value on the stack. To achieve the security guarantees describes in Section 7, PACStack instrumentation must be applied to both the application and any shared libraries. However, partial protection, e.g. PACStack-protected shared libraries can significantly raise the bar for the attacker, as calls into protected functions can still benefit from return address authentication. Common shared libraries like libc are a popular source for gadgets for run-time attacks because of their size and availability. Because functions in a PACStack-protected library validate the return address in returns from library functions, they effectively remove a potentially large set of reusable gadgets from A's disposal.
CONCLUSION
We showed how a general-purpose hardware security mechanism (ARM PA) can provide guarantees on-par with hardware-assisted shadow stacks, without requiring additional hardware support or compromising security. Other general-purpose primitives like memory tagging and branch target indicators are being rolled out. Creative uses of such primitives hold the promise of significantly improving software protection. We will bound this advantage by reduction to a semantic security game for the masks. We consider the following games, shown in Figure 11 , and described in Figure 10 .
The first hop, from G 1 to G 2 , is based on indistinguishability and relaxation: we suppose that H K (·, ·) can be distinguished from a random oracle with probability no more than G B 1 (1 λ , H, q): B obtains masked authentication tokens H K (x, y) ⊕ H K (0, y) for up to q pairs (x, y) of B's choice, and must then distinguish the masks H K (0, ·) from a random oracle.
G B 2 (1 λ , H, q):
This is the same as the previous game, except that H K (·, ·) is replaced by a random oracle and B is not limited in their number of queries. B must now distinguish between two random oracles, one of which is used in computing the authentication tokens, and one of which is independent of the authentication tokens.
Reformulation This is the semantic security game for repeated one-time-pad encryptions of a random string. (1 λ , H, q), and that the adversary is not limited in the number of queries that can be made to the masked authentication token oracle. Then,
The second hop, from G 2 to G 3 , is a mere reformulation of G 2 such that random oracles are represented as strings, and that rather than allowing B to request arbitrarily many authentication tokens from the challenger, we instead give B direct access to the oracle, as represented by the sequence of strings T 1...2 VA_SIZE . The third game is a semantic security game for the one-time pad, where A is given 2 VA_SIZE encryptions of S 1 and then asked to distinguish between S 1 and a random string. The perfect secrecy of the one-time pad means that P[G B 1 (1 λ ) = 1] = 1 2 and so
Finally, we provide a reduction from G A
PAC-Collision
(1 λ , H, q) to If the MAC H K (·, ·) is a pseudo-random function family with respect to K, then Adv A
PAC-Distinguish
(1 λ , H, q) is negligible, and thus so is Adv A
PAC-Collision
(1 λ , H, q). □ With a bound on A's probability of successfully obtaining a PAC collision, we may now obtain a bound on their probability of violating the integrity of an ACS-protected call stack. Theorem A.2 (Security of ACS). Consider a program whose call stack is protected by ACS, which has a call-graph C and b-bit masked authentication tokens T K (x, y) = H K (x, y) ⊕ H K (0, y). Then, an adversary with arbitrary control over memory can violate backwardedge control-flow integrity with probability Proof. We begin with a security game for ACS, shown in Figure 13 .
Our goal is to provide a black-box reduction from G A ACS (1 λ , H, C, q) to G A
(1 λ , H, q). From line 24 of Figure 13 , winning G A ACS implies that A has obtained colliding authentication tokens, and therefore A can win G A PAC-Collision with probability at least P[G A ACS ]. Substituting the bound from Theorem A.1, we obtain the bound given. □
B MITIGATION OF SIGRETURN ATTACKS
A solution for precluding sigreturn attacks against PACStack would be to include the signal return value to the PACStack chain via the PC value stored on the signal frame:
Upon signal delivery, the kernel stores a copy of asiдret n securely in kernel space as a reference value. If the process was already executing a signal handler, and thus the kernel already has a reference copy of asiдret n−1 on record, it stores asiдret n−1 in the new signal frame and overwrites the secure copy with asiдret n . On sigreturn the kernel attempts to validate the PC and CR values in the signal frame as though the reference value was asiдret 0 . If successful it performs the signal return to siдret n and restores aret n to CR. Otherwise the kernel assumes a return to a nested signal handler, and retrieves siдret ′ n and asiдret ′ n−1 from the signal frame, validates them by calculating asiдret ′ n = pacib(siдret ′ n , asiдret ′ n−1 ) and comparing the result against the stored asiдret n reference value. If successful the kernel replaces asiдret n with asiдret n−1 in the secure kernel store and performs the signal return to siдret n . If the validation fails the kernel terminates the process. This prevents A from 1) overwriting CR, and 2) forging the PC values in signal frames. For general protection against sigreturn attacks corrupting any register stored in the signal frame, all register values could be included in the asiдret calculation using the pacga instruction and validated at the time of sigreturn. 
