Widespread use of memory unsafe programming languages (e.g., C and C++) leaves many systems vulnerable to memory corruption attacks. A variety of defenses have been proposed to mitigate attacks that exploit memory errors to hijack the control flow of the code at run-time, e.g., (fine-grained) randomization or Control Flow Integrity. However, recent work on data-oriented programming (DOP) demonstrated highly expressive (Turing-complete) attacks, even in the presence of these state-of-the-art defenses. Although multiple real-world DOP attacks have been demonstrated, no efficient defenses are yet available. We propose run-time scope enforcement (RSE), a novel approach designed to efficiently mitigate all currently known DOP attacks by enforcing compile-time memory safety constraints (e.g., variable visibility rules) at run-time. We present HardScope, a proof-of-concept implementation of hardware-assisted RSE for the new RISC-V open instruction set architecture. We discuss our systematic empirical evaluation of HardScope which demonstrates that it can mitigate all currently known DOP attacks, and has a real-world performance overhead of 3.2% in embedded benchmarks.
INTRODUCTION
Although known for over two decades, memory corruption vulnerabilities are still a persistent source of threats against software systems. The main problem is that modern software still contains a lot of unsafe code written in memory unsafe programming languages (e.g., C and C++), especially in embedded systems and the Internet of Things [15] . The lack of memory safety in these languages and the inevitability of software bugs leave many systems vulnerable to attacks that exploit memory errors.
Control-flow attacks, such as Return-Oriented Programming (ROP) [32] , which hijack the execution flow of a program, are well-known, and various defenses against them have been proposed (e.g., [1, 21, 23] ). In contrast, non-control-data attacks do not need to modify the control-flow, and thus cannot be prevented by these defenses. Instead, non-control-data attacks corrupt data used for decision-making, e.g., to leak sensitive data or escalate privileges by corrupting variables used in authorization decisions. Some defenses against non-control-data attacks (e.g., [8, 30] ) provide protection against attacks that only target individual pieces of (security-critical) data.
However, recent work has shown that non-control-data attacks can be generalized to achieve Turing-complete execution, called Data-Orientated Programming (DOP) [17] . In DOP, the attacker carefully corrupts non-control-data to build up sequences of operations (data-oriented gadgets) without modifying the program's control-flow. Each gadget simulates a virtual operation on some attacker-controlled input. Unlike previous non-control-data attacks, DOP can be highly expressive (e.g., including assignment, arithmetic, and conditional decisions). This allows DOP to actively break stateof-the-art defenses, such as Address Space Layout Randomization (ASLR) [23] . Since DOP can reuse virtually any data, preventing DOP is a significant challenge.
Practical DOP attacks have already been shown against real-world software [14, 17] . Hu et al. [17] discuss various existing schemes that could reduce the number of DOP attacks, including memory safety, data-flow integrity, fine-grained data-plane randomization, and hardware/software fault isolation. However, they explain that mitigating DOP with existing approaches results in high performance overheads, and do not offer viable alternatives. As defenses against control-flow attacks become widespread [19] , DOP is likely to become the next appealing attack technique for modern run-time exploitation.
Goals and Contributions. We propose a new efficient defense against non-control-data attacks and all currently known DOP attacks. The intuition behind our approach is simple: In block structured languages, such as C and C++, every variable has a so-called lexical scope, denoting the block(s) of source code in which the variable is visible. Developers can thus define variables with limited scope of visibility (e.g., local variables). All correct compilers enforce variable scope at compile-time by checking these variable arXiv:1705.10295v2 [cs.CR] 12 Mar 2018 visibility rules. However, all currently known DOP attacks violate variable scope rules at run-time, since there is no equivalent enforcement. Consequently, mechanisms for variable scope enforcement at run-time can significantly reduce the number of available DOP gadgets.
In this paper, we define the notion of Run-time Scope Enforcement (RSE) that provides fine-grained compartmentalization of data memory within programs. We demonstrate that RSE can mitigate all currently known DOP attacks. We stress that while it is not possible to guarantee the absence of DOP gadgets in arbitrary programs 1 we argue that RSE can prevent DOP in typical programs. Unlike RSE, existing defences (a) do not provide complete mediation of all variable accesses [17] (as we explain in Section 7) and (b) suffer from high performance and memory overhead [7, 25, 26] .
We then describe HardScope, a hardware-assisted RSE scheme. HardScope introduces a set of six new instructions. Compilerassisted instrumentation places HardScope instructions in the program to ensure that all memory access constraints are also enforced at run-time. As the program executes, these instructions dynamically create rules that define which code blocks can access which pieces of memory. One significant challenge is to minimize the performance overhead of checking these rules on every memory load/store operation. To overcome this challenge, we designed an efficient method for storing the rules as a stack, such that all rules applicable to the currently executing code block are always at the top of the stack and can be checked simultaneously (Section 5).
Since enforcement rules are created and updated dynamically,HardScope (or any other RSE scheme) enables contextspecific memory isolation, unlike prior defenses that only allow static policies [3, 5-7, 13, 35] . This means that the same piece of code can be granted access to different memory locations depending on the context in which the code is executed. For example, if a particular function can be called as part of either a privileged or unprivileged execution path, HardScope can allow/deny it access to certain variables in memory depending on which path was executed. 2 This is critical to reducing the number of available DOP gadgets. In Section 5.2 we describe how HardScope can enforce memory protection at either coarser or finer granularity. RSE can be instantiated with HardScope to enforce various protection models, such as defending against ROP attacks by protecting function return addresses on the stack. We developed a proof-of-concept implementation of HardScope targeting the RISC-V instruction set architecture. We integrated HardScope with the open-source Pulpino core 3 on a Zedboard FPGA 4 (Section 5). We also added support for HardScope to the RISC-V toolchain and enhanced the RISC-V compiler to automatically instrument programs to protect variables at run-time to mitigate known DOP attacks. Finally, we extended the official RISC-V simulator, Spike, to support our new instructions. Our evaluation indicates that HardScope is efficient enough to be realized even on small embedded devices. 1 A pathological case would be a program that contains all necessary gadgets, a gadget dispatcher, and all the data in the same function. 2 RSE can in fact enforce different policies on each distinct invocation of a given function even in a single execution path. 3 http://www.pulp-platform.org/ 4 http://zedboard.org/ In summary, our main contributions are as follows:
• Run-time Scope Enforcement: A novel approach for finegrained context-specific memory isolation within programs (Section 4) to defeat non-control-data attacks, and all currently known DOP attacks.
• HardScope: An open-source proof-of-concept implementation of hardware-assisted runtime scope enforcement on the RISC-V instruction set architecture to achieve efficient compartmentalization of memory accesses within programs capable of mitigating all currently known DOP attacks (Section 5).
• Automatic Instrumentation: Compiler support for protecting large classes of variables at run-time without requiring any developer input or data-flow analysis (Section 5.2).
• Evaluation: Systematic analyses of how HardScope mitigates published DOP attacks (Section 6.1), discussion of RSE security guarantees (Section 6.2), and evaluation of HardScope's hardware area overhead and performance impact (Section 6.3). HardScope is efficient, incurring only a 3.2% performance overhead in embedded benchmarks.
Code Availability. To enable our results to be reproduced, and to encourage further research in this area, the source code for the HardScope-enhanced GCC toolchain including the RISC-V simulator are available in the accompanying supplementary material at https://goo.gl/TAjLxy.
BACKGROUND
Memory errors and bounds checking. Exploitable memory errors may be used as entry-points to a vulnerable program. These may provide read or write access to program memory. At run-time, C and C++ programs can access or overwrite data anywhere in their own memory space. Modern compilers may insert checks around operations on local, global, and heap objects (e.g., arrays), that verify at run-time whether data written to a memory object is within the boundaries of that object. Such bounds checking can prevent buffer overflows, but requires instrumentation of code and incurs high performance and memory overhead [31] , even with hardware support for such bounds checks [28] 5 . Existing defenses. Modern systems prevent direct modification of program code by W⊕X memory access policies such as Data Execution Prevention [16] , use control-flow integrity [1, 19] to deter control-flow attacks that do not modify program code and raise the bar against run-time attacks using Address Space Layout Randomization (ASLR) [23] .
W⊕X. An attacker can use memory vulnerabilities to subvert the integrity of the program memory. Direct modification of program code in modern processors is prevented by W⊕X memory access policies such as Data Execution Prevention [16] . Probabilistic defenses. In sophisticated run-time attacks, the attacker crafts payloads that cause the program to behave in an unintended manner. These payloads usually refer to data and code by their addresses in memory. Attackers can find these addresses by offline analysis of the program memory layout. Address Space Layout Randomization (ASLR) [23] randomizes the memory layout of the program on each execution. ASLR typically randomizes the base address of the executable and the positions of the stack, heap, and libraries. This prevents an attacker from reliably addressing known targets, for instance identifying a particular function to jump to, or reading/modifying a particular variable. However, ASLR defenses are susceptible to information leakage (e.g., by obtaining the value of a well-known pointer), and are routinely bypassed in real-world exploits [33] .
Data-Orientated Programming. In DOP [17] , an attacker carefully tampers with non-control-data to execute sequences of operations within the program on attacker-controlled input. Each sequence of operations constitutes a data-oriented gadget which represents a single virtual machine instruction executing on top of benign program logic. A gadget dispatcher (e.g., an attacker-controlled loop in the benign program) enables the attacker to chain together dataorientated gadgets to realize expressive computation.
Hu et al. [17] demonstrate three practical DOP attacks against the ProFTPD file server and one DOP attack against the Wireshark network packet analyzer. In the following, we describe the first attack against ProFTPD in detail. The remaining DOP attacks described by Hu et al. and an independently discovered attack by Evans [14] against the GStreamer FLIC decoder are described in Appendix C.
Each attack against ProFTPD exploits the same stack buffer overflow vulnerability in a general-purpose string replacement function, sreplace() 6 allowing the attacker to read and write arbitrary memory locations (as shown in Appendix A). The attacker's goal is to obtain the server's OpenSSL private key, but the memory address of this key is randomized by the program. The attacker constructs a virtual DOP program that accesses the OpenSSL context structure (ssl_ctx) from a well-known location in memory, then dereferences a chain of pointers to determine the key's location. The attack requires three different data-oriented gadgets: assignment, addition, and pointer dereferencing. The assignment gadget is constructed from the vulnerable sreplace() function. The addition gadget is realized by corrupting two integer fields in a global data structure, session.total_bytes_out and session.xfer.total_out, and performs the operation session.total_bytes_out += session.xfer.total_out. The dereference gadget is obtained by corrupting a string pointer in another global data structure, main_server.ServerName. This dereferences the pointer main_server->ServerName and copies the result to a known position in a static buffer. These gadgets can be triggered in arbitrary sequences using specially crafted input without compromising the control-flow of ProFTPD. Note that during benign execution, sreplace() need not access ssl_ctx, nor any of the nested structures that lead to the key, nor the key itself. In Section 6.1 we demonstrate how enforcing RSE in ProFTPD thwarts this attack.
REQUIREMENTS & ASSUMPTIONS
Adversary Model. We consider a powerful adversary who has full control over the data memory of the target program. This models buffer overflows and other memory corruption vulnerabilities (e.g., 6 CVE-2006-5815: https://cve.mitre.org/cgi-bin/cvename.cgi?name=cve-2006-5815 an externally controlled format string 7 ) that could lead to arbitrary corruption of data memory. However, the adversary cannot modify program code at run-time (W⊕X protection). Our adversary model is standard for runtime attacks and consistent with the adversary in Hu et al.'s DOP attacks against ProFTPD (Section 2 and Appendix C).
Requirements. We require a mechanism that prevents the above adversary from mounting DOP attacks. Since DOP attacks require the adversary to modify data in unintended ways at run-time, these attacks can be prevented by a run-time enforcement mechanism that prevents any modification of control-data and non-control-data that would not be permitted during a compile-time check by a correct compiler. We derive the following requirements for a mechanism that mitigates all currently known DOP attacks.
R1. Multi-granularity enforcement. Enforce memory protection at run-time for any granularity of protection domain (subject) and protected region (object). R2. Context-specific enforcement. Enforce different permissions on each invocation of the same subject (e.g., each function), to minimize the attack surface following the principle of least privilege. R3. Complete mediation. Protection domains cannot increase their permissions accidentally or maliciously, and all memory accesses can be checked with only minimal performance impact and memory overhead.
Design goals. We define two design goals for the system:
• Legacy software should run without recompilation even if selected components, such as libraries, make use of finegrained protection.
• Performance and memory overhead should scale gracefully with the number of protection domains (subjects), the number of protected regions (objects), and the frequency of domain transitions.
Assumptions. In our implementation of RSE we make the following assumptions:
• We restrict our attention to single-threaded C programs. We outline what would be needed to relax this assumption in Appendix F.
• Typical programs minimize the scope of variables and interdependence between modules, e.g., local and static variables are preferred over global variables. We discuss narrowing the run-time visibility of global variables in Section 5.2.
• Typical programs enhance spatial locality by nesting structures instead of creating links between separate structures by nesting pointers. This is reasonable to assume because it improves performance by making better use of processor caches, and may also improve power consumption in embedded applications [18] . Nevertheless, we also discuss how RSE can be applied to nested pointers in Section 5.2.
• We focus on an adversary that employs DOP and other non-control-data attacks. A real-world adversary may also attempt to influence a program's control flow. Defenses against control-flow attacks, such as Control-Flow Integrity (CFI) [1] are complementary to RSE. In Appendix E we show how applying RSE at a suitable granularity can also prevent large classes of control-flow attacks, e.g., ROP.
DESIGN OVERVIEW
Designing a solution to meet the requirements identified in Section 3 requires addressing two major challenges:
(1) Run-time enforcement: enforcing variable scopes at runtime requires information which is usually only available at compile-time (necessary to meet R1). (2) Context-specific enforcement: enforcing different rules for each invocation of a code block requires rules to be created, modified, and deleted dynamically at run-time (necessary to meet R2).
R3 implies that solutions to these challenges must be efficient. The high-level idea of HardScope is to extend the compiler to emit compile-time information about the visibility of variables, and to extend the underlying hardware to use this compiler-supplied information to dynamically create and update a set of memory access rules against which all memory accesses are checked. We chose function-level compartmentalization as the granularity of isolation, since this is sufficient to mitigate all currently known DOP attacks (Section 6.1, Appendix C). However, RSE can also be implemented at other granularities, without changes to the new HardScope hardware as described in Section 6.2 and Appendix E. Run-time enforcement. Binary program code produced from languages such as C and C++ does not include information available to the compiler about variables and code blocks. RSE needs this information to assign in-memory variables to specific execution contexts. To bridge this gap between compile-time lexical scope and run-time execution context, we modified the compiler to instrument the program code with special instructions that record the variables that may be used by each block of code. HardScope introduces an instruction set extension for this purpose (Section 5.1).
The compile-time components and behavior of HardScope are illustrated in Figure 1 . An unmodified source code program (❶) is fed to the compiler (❷), which checks that all variable accesses are correctly scoped (as usual). Our new RSE Plug-in (❸) in the compiler adds HardScope instructions (❹) at particular locations in the binary (e.g., at the start of functions). This results in a fully-functional program binary, instrumented with HardScope instructions. These instructions are used by the HardScope hardware to create a set of rules against which all memory accesses can be checked at run-time.
Context-specific enforcement. Consider the program (❶) in Figure 1: function C receives two pointers as input and copies data from the first pointer to the second. It can be called from either function A or function B (the call graph is shown in Figure 2 ). In benign program execution, variables x and y are only used in a privileged execution path, where access control checks prevent their misuse (e.g., could be a secret key). Function B contains an exploitable vulnerability allowing the attacker to control the pointers passed to function C. Since function C can be used to copy arbitrary data between two attacker-controlled pointers, this constitutes a usable DOP gadget. The attacker could use this to bypass the access control checks on variables x and y by accessing them through the unprivileged execution path. To enable context-specific enforcement, HardScope must be able to associate different memory access rules with each active instance of a function.
To address this challenge, the HardScope hardware creates memory access rules dynamically for each individual function invocation, and stores these in a stack data structure called the Storage Region Stack (SRS). The SRS is kept in hardware-isolated protected memory; only HardScope instructions can add or remove SRS entries. Each entry in the SRS defines an area of data memory (e.g., the location of a variable) that may be accessed. The SRS is organized into frames; each frame corresponds to an execution context (i.e., contains all the entries for that context). The topmost SRS frame corresponds to the active execution context. On each memory access, e.g., load or store, HardScope validates that the memory address matches an entry in the topmost SRS frame.
To prevent the attack scenario in Figure 2 , HardScope's enforcement of function C's memory accesses must distinguish between legitimate accesses to variables x and y when invoked by function A, and illegitimate accesses to them when invoked by the exploited function B. By default, HardScope prevents function C from accessing both x and y.
The SRS for function A (❺) includes variables x and y, and the SRS for function B (❻) includes variables i and j (Figure 2 ). To allow function C to access certain variables, the calling function must use a special instruction (Figure 1 ❼) to delegate access to a variable to function C: e.g., function A must delegate access to x and y. For valid delegation, the calling function must already have access to the delegated variables. HardScope RSE therefore prevents the attack shown in Figure 2 : even though the attacker can manipulate the pointers in function B, this function does not have access to x and y (its SRS lacks the corresponding entries) and hence function B cannot delegate access to these variables to function C.
IMPLEMENTATION
To realize Run-time Scope Enforcement (RSE), the new instructions must provide the following functionality:
(1) Specify what storage regions are accessible by an execution context. (2) Allow an execution context to dynamically delegate access to a storage region to another execution context (e.g., during function invocation or return). (3) Subdivide a storage region so that partial access can be delegated.
Section 5.1 describes the HardScope instructions in detail. Further, the compiler needs to be modified to emit these instructions to describe visibility rules derived from the program source code. Section 5.2 describes our compiler RSE Plug-in.
Finally, the underlying processor hardware must be extended to efficiently store and enforce the visibility rules described by the new instructions. Section 5.3 describes our processor modifications and proof-of-concept hardware realization of HardScope.
HardScope Instructions
HardScope extends the RISC-V instruction set with six new SRS management instructions, as shown in Table 1 . Figure 1 . In (a), access to variables and is successfully delegated from A to C. In (b), function B should not have access to and (e.g., could be a secret key), but a memory corruption vulnerability in B has been used to corrupt and to point to and instead of and . HardScope prevents B from accessing or delegating and .
The sbent and sbxit instructions are used by the compiler RSE Plug-in to mark the beginning and end of each execution context. HardScope hardware uses these instructions to track when HardScope is first enabled and when execution context changes, and thus when new enforcement rules should be loaded in the SRS. When sbent is executed, a new frame is pushed on top of the SRS. Conversely, sbxit pops the topmost SRS frame. Program execution starts with an empty SRS and HardScope enforcement is initially disabled. The first function that supports HardScope, typically the program's main function, executes sbent to enable HardScope. HardScope remains enabled until a matching sbxit empties the stack. Due to the associated SRS management (explained in Section 5.3), sbent and sbxit may consume up to additional cycles, where is the number of storage region entries in the topmost SRS frame. However, this only stalls the processor if the instruction is followed by another sbent or sbxit within the next executed instructions. The sradd and srdda instructions create a storage region entry (SRS entry) in the current (topmost) SRS frame. HardScope hardware uses these instructions to determine the bounds of memory areas that the current execution context is allowed to access. The value of the first register operand of each instruction sets the base address, and the second operand sets the limit address. An optional offset is added to to either the limit (sradd) or base (srdda) register operand.
The srdlg and srdsub instructions delegate an SRS entry from the currently executing function either to an invoked callee function or to the caller when the current function returns. HardScope hardware uses these instructions to derive SRS entries for flow of data which cannot be fully tracked by the compiler, such as context-specific accesses (Section 4). The srdlg instruction takes a single register operand and an immediate offset or only an immediate operand specifying an absolute address to determine which memory address to delegate. The resulting memory address is compared with the entries in the current SRS frame and if a match is found, the matching entry is copied to the next execution context entered. If the delegation is followed by a sbent, the delegated entry is added to the newly created SRS frame. If the delegation is followed by a sbxit, the delegated entry is added to the caller's SRS frame. If multiple matching entries exist, only the most recently added entry is delegated.
The srdsub instruction is used to delegate a new SRS entry that is a subset of an existing SRS entry. It takes the same operands as sradd. HardScope hardware uses the operands to decide when storage region entries should be subdivided. If the new subdivided memory region is a subset of an existing SRS entry in the current SRS frame, a new SRS entry is created for a sub-region using the new base and limit.
If no matching entry is found in the SRS when srdlg or srdsub execute, no entry is delegated. This prevents the use of srdsub to elevate the access rights of the next execution context beyond the rights of the current, but allows the delegation instructions to be applied to pointers which are not dereferenced directly in the current context. These include null-pointers and intentionally created out-of-scope pointers (e.g., via the use of pointer arithmetic) that are passed to callees for which they are in scope (e.g., accessor functions that receive opaque pointers from the caller). We refer to this as lax delegation and describe a strict delegation variant in Appendix D.
The srdlg and srdsub instructions each consume one additional cycle if immediately followed by context switching HardScope instructions.
RSE GCC Plug-in and Backend
We developed an enhanced version of the GCC compiler incorporating a proof-of-concept RSE GCC plug-in and a modified RISC-V backend that can automatically instrument C programs to benefit from HardScope without requiring any changes to program code or additional information from the developer (e.g., code annotations).
The GCC plug-in analyzes the program's Intermediate Representation (IR) within GCC. The plug-in targets the high-level GIMPLE representation, which is a processor independent abstraction of the program. From the IR, the plug-in extracts information about the use of global and static variables in each function, the type of pointers passed as arguments in function calls, and the return type of each function to assess whether delegation is needed. The results of the analysis are passed to the modified RISC-V backend that operates on the low-level Register Transfer Language (RTL) representation of the program and emits sequences of assembly. While the RTL lacks information about the lexical scope of variables, the backend supplements the information in the RTL with information retained from the prior RSE plug-in analysis pass and emits HardScope instructions when expanding function prologues, epilogues and function call sites.
Function instrumentation. Our modified RISC-V backend currently supports automatic instrumentation of C programs at function granularity to protect the 1) the return address and other return state information, 2) stack variables, 3) arguments passed on the stack, 4) heap objects, and 5) global and static variables. The beginning of each distinct execution context is marked by inserting a single Listing 1: Function prologue instrumentation. Registers used are the stack pointer (sp), return address register (ra), frame pointer (s0), and temporary register zero (t0).
sbent instruction at the function call site just before the jump instruction. The end of an execution context is marked by inserting an sbxit instruction just before the return in the callee function.
In Section 6 we show that function-level isolation is sufficient to mitigate all currently known DOP attacks. However, RSE can also be implemented at other granularities, without changes to the new HardScope instructions or HardScope hardware as described in Section 6.2.
Return state and stack variables. By convention, the compiler adds a function prologue to the beginning, and a function epilogue to the end of each function. The function prologue is responsible for allocating space on the call stack for local variables, and storing the return address and old frame pointer to the stack. HardScope instructions are inserted before the standard prologue begins in order to create new SRS entries for local variables allocated by the prologue, as well as for any static or global variables accessed by the function.
Listing 1 shows a function prologue (❸) that reserves space for a stack frame containing two 32-bit variables, the return address, and frame pointer (16 bytes in total) from the stack (line 8). It stores the value of the return address register (ra), and the register holding the frame pointer (s0) to the stack frame (lines 9, 10). Our instrumentation prepends the prologue with a srdda instruction (❶) that adds an SRS entry covering the whole stack frame (e.g., 16 bytes in Listing 1). The limit for this entry is one less than the current stack pointer value, and the base address is calculated by subtracting the size of the stack frame from the current stack pointer. The compiler already knows the size of the stack frame since this is used to decrement the stack pointer (line 8). The SRS entry must be added before the standard prologue so that the prologue can access the stack to store the return address (line 9).
Global and static variables. In C programs, global objects can be accessed from any scope. However, RSE instrumentation can effectively narrow the scope of global objects to only those functions that refer directly to these objects. For example, in Listing 1, a global myobject is accessed from the function's scope, so an SRS entry for myobject is created before the standard prologue begins (❷). Separate SRS entries are added for each object accessed by the function. The sizes of global objects are known from the IR analysis and the address is evaluated by the linker. Static variables are handled similarly. Conversely, functions that access global objects indirectly (e.g., via function pointers) must receive the necessary SRS entries 1 ; ❶ prepare arguments 2 lw a0,-12(s0) ; dest 3 add a1,s0,-1060 ; src 4 li a2,1024 ; n 5 ; ❷ setup delegations and callee srs context 6 srdlg a0 ; delegate dest 7 srdsub a1,1024(a1) ; delegate n bytes of src 8 sbent 9 ; ❸ jump to memcpy 10 jal ra,memcpy
Listing 2: Instrumented memcpy() call. dest (a0) points to a buffer in the program's data section. src pointer (a1) points to a local buffer allocated from the function's stack frame.
Integer n (a2) holds the number of bytes to copy.
1 ; ❶ prepare return value 2 lw a0,-12(s0) 3 ; ❷ delegate returned object and exit context 4 srdlg a0 ; delegate returned object 5 sbxit 6 ; ❸ return to caller 7 ret Listing 3: Instrumented function returning a pointer via register a0.
for these objects via run-time delegation. Without such delegated rules HardScope would prevent access to indirectly accessed data.
Function arguments and return values. Listing 2 shows an instrumented call to the memcpy() function in the C standard library. The memcpy() function copies n bytes from src to dest, so the caller prepares the arguments dest, src, and n as usual (❶). To allow memcpy() to operate on these memory areas, two delegation instructions are added to the program (❷) just before the call (❸). The dest pointer is held in register a0 and points to a global buffer in the program's data section. The caller already has an SRS entry for this specific buffer, which it delegates using the srdlg instruction with register a0 (line 6).
The src buffer exists in the caller's own stack frame. To avoid giving memcpy() access to the caller's whole stack frame, the caller delegates only a sub-region spanning only the src buffer using srdsub (line 7). In this example, the programmer defined the number of bytes to copy (n) based on the size of src. If the size of dest is less than n, this would result in a buffer overflow, which could be used to overwrite variables in the program's data section. However, since memcpy() is only delegated access to the memory area containing dest, HardScope prevents this memory error.
Listing 3 shows an instrumented function returning a pointer (❶). Before it returns (❸), the function delegates access to the returned object and exits the current execution context (❷).
Heap object allocation. We implemented a wrapper on top of the C standard library malloc() function that creates SRS entries for the memory allocated on the heap, and delegates these to the caller.
Deeply nested pointers. Keeping track of SRS entries is a challenge when deeply nested data structures are delegated, e.g., when the head of a linked list is passed as an argument. Our current PoC GCC Plug-in cannot automatically infer the complete set of SRS entries that should be delegated when a callee receives a pointer to beginning of the nested pointer chain. Additionally, the number of SRS entries that must be accumulated and delegated for large nested data structures result in more frequent stalls at run-time (cf. additional profiling on CoreMark in Appendix G).
To handle delegation of such complex data structures, the HardScope GCC Plug-in must infer the relationship between linked data structures and emit instructions to allow HardScope hardware to derive the complete set of SRS entries that must be delegated for that structure. Our current GCC Plug-in implementation can derive SRS entries for structures allocated from memory pools known statically (a common pattern in embedded applications). However, to provide finer granularity policies that better match developer intent, a developer writing software for a HardScope-enabled platform can insert srdlg and srdsub instructions manually where needed. The HardScope support files include compiler macros for this purpose. Standard data structures can be provided with wrapper functions. Employing more sophisticated program analysis could insert such wrappers automatically.
Hardware Implementation
We developed a proof-of-concept hardware implementation of HardScope and extended it to the open-source RISC-V Pulpino core. 8 We modified the instruction decoding stage of the processor pipeline to interpret the new instructions (Section 5.1). To minimize modifications to the decode stage all new instructions were encoded in RISC-V's existing S-type instruction format that allows up to two registers operands and a 12-bit signed integer immediate operand for each instruction. After decoding, the appropriate control signals are sent to the HardScope unit, which realizes the execute stage of the new instructions. Figure 3 shows the main components of the HardScope unit: the SRS controller (❶), dedicated memory to hold the SRS (❷), and three register banks (❸, ❹, ❺). The active bank (❸) holds the entries in the SRS frame for the current execution context enabling each memory access to be compared against all active entries efficiently. The spare bank (❹) holds entries delegated via srdlg and srdsub before a HardScope context switch occurs. It allows delegated entries for the next execution context to be accumulated ahead of time. When a HardScope context switch occurs, the spare bank becomes the active bank (and vice versa), thus activating the delegated entries. The third bank (❺) is used as a cache to hold a copy of the topmost frame of the SRS. This reduces the latency when the topmost SRS frame is transferred between stack memory and the spare bank.
When executing sbent, the controller activates the spare bank and transfers the contents of the currently active bank to the cache (❻) in a single cycle. The bank that held the previously active frame becomes the spare, and can be used for subsequent delegations. The entries in the cache must be stored for future use, and are transferred to the SRS in protected memory (❼) over at most subsequent cycles, where is the maximum number of entries in the cache. During this time, the CPU continues to execute subsequent instructions normally until a new HardScope context switch occurs. Only if a HardScope context switch occurs before the cache has been emptied does the processor stall until the transfer is complete.
When executing sbxit, the controller copies the SRS frame from the cache into the spare bank (❽) while retaining delegated entries (i.e., activating the entries that are already in the spare bank). The SRS frame in the previously active bank is no longer needed and is discarded. This executes in a single cycle. The cache, which now holds an out-of-date copy of the active frame, is updated with the topmost SRS frame from the protected memory (❾), which takes at most cycles, where is the number of entries in the topmost SRS frame in memory. This transfer to the cache does not stall the processor unless another sbxit is encountered before the cache is fully populated, in which case the CPU stalls until the next frame is available. However, if an sbent is encountered before the cache is fully populated, the partial cache is discarded and replaced with the contents of the active bank, without stalling.
The sradd and srdda instructions always operate on the active bank. When executing srdsub, the controller checks the active bank for an entry containing the given memory region and, if found, adds the new sub-entry to the spare bank. Similarly, in srdlg, the controller checks for the matching entry in the active bank and, if found, copies the entry to the spare bank (❿). The srdlg and srdsub instructions require an additional cycle only if followed immediately by a context switching HardScope instruction that modifies the spare bank.
Integrating HardScope into the processor pipeline at the execute stage also required modifying the memory access stage to intercept all memory access requests to the load/store unit. At each load or store instruction, the requested memory address and the number of requested bytes (one byte, half-word (two bytes), or word (four bytes)) are evaluated to a memory address range to be fetched from memory. The request is forwarded to the SRS controller, which compares it against all entries in the active bank. The registers in each bank are wired to comparators that enable all entries in the bank to be checked in parallel. If a match is found, i.e. the requested address range is a subset of any of the active entries, then the request is allowed to be evaluated by the processor's load/store unit, otherwise a hardware fault is raised. The memory access intercepted by HardScope executes without incurring additional cycles to load and store instructions.
Since both HardScope instructions as well as load and store instructions require access to the HardScope hardware unit, we multiplexed shared access to it according to the currently decoded instruction. To ensure correct and hazard-free execution of the
To also overcome potential data hazards, operand and data forwarding from preceding instructions is also supported for our HardScope instructions. This eliminates the need to inject additional stalls when results from preceding instructions are not yet updated in the registers but are required by the current HardScope instruction, which often occurs with sradd and srdda instructions that are preceded by auxiliary instructions that load immediate values to their operand registers.
We fully integrated HardScope with the Pulpino core and synthesized the extended processor on a Xilinx Zynq-7020 ZedBoard All Programmable SoC. 9 The instruction-level functionality was validated by performing a register-transfer level (RTL) cycle-accurate simulation of the integrated hardware design using ModelSim/QuestaSim. 10 We also extended Spike 11 , the official RISC-V ISA simulator to support our HardScope instruction set extension.
Simulator Implementation
We extended Spike 12 , the official RISC-V ISA simulator to support our HardScope instruction set extension. Spike is part of the official RISC-V infrastructure and is currently the most accurate simulator for RISC-V assembler programs. It is regularly maintained by the RISC-V community, well integrated into the toolchain, and supports debugging with the GNU debugger (GDB). We used it to analyze the security properties and performance profile of RSE, and we have also made it available for developers and researchers who wish to reproduce our results or experiment with other uses of our HardScope instruction set extension. 9 http://zedboard.org 10 https://www.mentor.com/products/fv/questa/ 11 https://github.com/riscv/riscv-isa-sim 12 https://github.com/riscv/riscv-isa-sim 13 The enhanced version of the Spike simulator is included with our accompanying materials at: https://goo.gl/TAjLxy
The simulator was extended by adding a SRS module to the Memory Management Unit (MMU) of the processor. Similar to our hardware design (Section 5.3), this module includes two banks of SRS registers for active and delegated entries, and a stack for inactive frames. Each executed HardScope instruction is passed to our SRS module, which faithfully simulates the behavior of a real hardware implementation. It also collects performance profiling statistics including the number of executed instructions, frequency of context switches, sizes of SRS frames, and the number of access checks performed.
EVALUATION
To demonstrate the functionality of HardScope, we first show how it comprehensively mitigates Hu et al.'s [17] DOP attack (described in Section 2), and in Appendix C we explain how it mitigates all other known DOP attacks, including the attack by Evans [14] . We then evaluate the security of HardScope with reference to the requirements defined in Section 3, and analyze its performance and area overhead.
DOP Mitigation Example
We replicated the DOP attack by Hu et al. [17] and ported the code to Pulpino to evaluate the effectiveness of HardScope. Although it was not possible to port the complete ProFTPD to our FPGA testbed or simulation environment, we concentrated our evaluation on the vulnerable sreplace() function. 14 We automatically instrumented this code with HardScope instructions using our GCC compiler extensions. This means that all enforcement rules in the test programs are derived without any developer annotations -the GCC intermediate representation contains all information necessary for compile-time instrumentation, including: stack-frame sizes, global variable accesses, function calls, parameters, and return values. We used our modified Spike simulator to trace the execution for both benign and malicious inputs, and verified that our instrumentation did not affect the correctness of the program under benign inputs. The source code and instrumented symbolic assembler files are included in the supplementary material. 15 We identified five ways in which RSE prevents this DOP attack. Any one of these would be sufficient to stop the attack, and thus the existence of five distinct mitigation points demonstrates the effectiveness of RSE's layered defense strategy. All five were also verified experimentally using our modified Spike simulator.
1) The initial memory violation in sreplace() is caused by an out-of-bounds sstrncpy() to a local stack buffer buf. The bound for the sstrncpy() call is calculated as sizeof(buf) -strlen(pbuf). The contents of pbuf is attacker-controlled, and left without a trailing null terminator causes the subtraction to yield a negative value. Interpreted as an unsigned integer this causes sstrncpy() to overflow. RSE instrumentation records and enforces the intended bounds of buf and pbuf, thus preventing the out-of-bounds access by strlen() and sstrncpy().
2) The DOP program keeps internal state in unused areas of the program's data section. By default, RSE denies access to such unused areas from all functions. The attacker could attempt to work around this by using pre-existing global variables. However, because access to globals can be narrowed by the use of RSE, all the DOP gadgets must either share access to the same global data structure, or all be reachable by data flows to and from this data structure. Gadgets that legitimately share access to such data are more likely to use this data in benign program operation. This is undesirable for DOP because re-purposing such data could have unwanted side-effects on program execution, or be overwritten during benign operations, thus significantly limiting the amount of data or the set of gadgets that can be used in the attack.
3) The exploit corrupts variables in global data structures to control the operands of the addition and dereference gadgets. During benign execution, the sreplace() function should only operate on a copy of the main_server.ServerName pointer passed by value, and on unrelated fields in the global session structure, also passed by value. Therefore, RSE denies sreplace() access to these global variables, and thus blocks the addition and dereference gadgets by preventing the attacker from controlling their respective operands.
4) The dereference gadget accesses ssl_ctx via a pointer included in the DOP payload and traverses the chain of linked structures to determine the location of the secret key. Since ssl_ctx is defined as a global static in the mod_tls.c source file, RSE prevents the dereference gadget from accessing this structure or linked structures.
5) Once the address is known, sreplace() is used to corrupt a local static mons array containing string pointers. The mons array (Appendix B) contains pointers to string literals in the program's data section that are used by the pr_strtime() function to format human readable dates. Each pointer in mons is redirected to the same memory location. One byte of the secret key is then copied to that location. Whenever any date is formatted by the corrupt pr_strtime(), it leaks a few bytes of the key to the attacker. The process is repeated until the entire key has been extracted. RSE prevents this exfiltration in two ways: Firstly, because the scope of mons is local to the pr_strtime() function, sreplace() cannot overwrite it with new pointers. Secondly, RSE ensures that pr_strtime() cannot access the key, even if it attempts to dereference corrupt mons pointers.
Although RSE cannot guarantee the absence of usable dataoriented gadgets in arbitrary programs, this example demonstrates how RSE significantly limits the expressiveness of DOP attacks in programs with at least some degree of structural data separation, e.g., by minimizing the scope of variables (see Section 3), as is the case in virtually all real-world programs.
Security Considerations
R1. Multi-granularity enforcement. With the appropriate instrumentation, HardScope instructions can be used to enforce memory protection at run-time for any granularity of protection domain and any granularity of protected region. Our enhanced GCC compiler automatically emits HardScope instructions at function-level granularity which, as shown in Section 2) and Appendix C is necessary and sufficient to thwart currently known DOP attacks. However, HardScope can also be used to enforce policies with either coarser or finer granularity of execution contexts. This is possible because HardScope instructions are agnostic to programming language and language constructs. We show an example of finer granularity in Appendix E by isolating the function prologue and epilogue from the function body. Thus even if the function body contains a memory error, this cannot be used to corrupt the return address on the stack (i.e., prevents control-flow hijacking). For both of the above granularities, all necessary information about the protection domains can be deduced by the compiler, thus allowing automatic instrumentation. Instrumentation at other granularities could also be inferred from existing language constructs (e.g., loops) or may require developer annotations.
As another example, consider the simple loop in Listing 4. The for loop forms a separate code block in terms of lexical scope. Since the index variable i is declared in the loop signature, it cannot be accessed from outside the for loop. The statement in the loop body accesses the name array, the buf array, and the len integer. The loop can be isolated in a separate execution context from the rest of the function body by surrounding it with sbent and sbxit instructions. Access to name and the buffer referenced by buf are delegated to the loop's execution context via srdlg, and access to the variable i, which is not accessed from outside the loop, is delegated with srdlgm. This ensures that the for loop is executed with minimal privileges. Should the value of len exceed the size of the name buffer, RSE prevents the buffer operation from overflowing into the password array, which should only be accessed later in the function body.
In addition to execution contexts deduced from C control constructs, such as loops, the programmer may use unnamed blocks to group related code and data together. Any variables declared inside an unnamed block are considered by the compiler to exist only within the block's lexical scope. HardScope can make use of this standard language feature to automatically infer developer intent when determining execution contexts during instrumentation.
R2. Context-specific enforcement. In HardScope, the set of active SRS entries can differ between different invocations of the same subject, depending on which entries have been delegated to this subject (e.g., variables passed to a function by its caller or callee functions).
R3. Complete mediation. Since HardScope hardware checks every memory access against the currently active set of SRS entries, a memory access without a matching entry will fail. Therefore, the only possible memory accesses are those that would be allowed by a compile-time check. We discuss the scalability, performance and area overhead in Section 6.3.
Preventing confused deputies. In a confused deputy attack, the attacker attempts to subvert the RSE property by misusing existing HardScope instructions at run-time to create unintended rules (i.e., rule-creating instructions are the confused deputies). Our design ensures that no such instructions are available to the attacker. Instructions that create rules for static allocations (stack and global variables) are encoded directly into the instrumentation. Since these cannot be modified at run-time, they cannot be used as confused deputies. Instructions that create rules for dynamic allocations could potentially be used as confused deputies, but this is practically infeasible because these instructions are only found within memory allocators e.g., malloc(). It is reasonable to assume that memory allocators are trusted (or at least that the absence of run-time vulnerabilities can be easily verified). We recommend that manually annotated code is vetted for allocators that create arbitrary rules at run-time. Furthermore, an attacker can only initiate a confused deputy attack if he already controls some part of the code, which is very difficult since every memory access in the instrumented program is checked by the HardScope hardware.
Interfacing with legacy code. Legacy code, such as precompiled shared libraries, can be instrumented using wrapper functions. For example, our malloc() wrapper (Section 5.2) provides function-granularity isolation for delegated objects and the stack frame, and provides coarse-grained module-level isolation for other library data.
Mitigating DOP attacks. As shown above, HardScope fulfils all requirements defined in Section 3, and as shown in Section 6.1 and Appendix C, function-level granularity is sufficient to mitigate all currently known DOP attacks (at multiple points in each attack).
Mitigating ROP attacks. Additionally, HardScope can defend against ROP attacks. As explained in Appendix E, the function prologue and epilogue are placed in a separate execution context from the function's main body. By restricting a function's return state information only to the prologue/epilogue's execution context, HardScope protects this information from the potentially corrupted main body of the function. Without the ability to control the function's return value, an attacker cannot mount ROP attacks.
Performance and Area Overhead
We evaluated the performance and area overhead of HardScope using the extended Pulpino processor synthesized using Xilinx Vivado 16 for the ZedBoard 7020 prototyping kit. 17 For the performance evaluations, we used both the ProFTPD code excerpt (Section 6.1) as well as the CoreMark processor benchmark.
18
Performance evaluation (ProFTPD excerpt). This program consists of 527 lines of C code and its deepest call chain is seven levels deep. Table 2 shows the total cycle count for each type of HardScope instruction, and the number of cycles for which the processor was stalled when running on a HardScope equipped core, 16 https://www.xilinx.com/products/design-tools/vivado.html 17 http://zedboard.org 18 http://www.eembc.org/coremark/index.php compared with the unmodified program. In the instrumented version, the cycle counts for already existing instructions, including load/store, are not affected. Since rules are checked in parallel, the number of enforcement rules per subject does not impact performance up to the number of available HardScope registers. Taking into account the added HardScope instructions, the processor stalls, and the additional instructions needed to support the instrumentation, the total performance overhead was 1.6%. Performance evaluation (CoreMark). CoreMark is a synthetic CPU performance benchmark for embedded systems based on a realistic mixture of commonly used algorithms including matrix and linked list manipulation, state machine operations, and Cyclic Redundancy Checks (CRCs). 19 It consists of~1500 lines of C code, with a deepest call chain of 11 levels. All instrumentation in CoreMark was automatically generated by our extended GCC compiler, and we ran the benchmark on the Pulpino SoC extended with HardScope instructions. Binary size increased by 11% as a result of instrumentation. Table 3 shows the total number of executed HardScope instructions, the number of consumed cycles, and the number of cycles the processor was stalled for a single iteration of CoreMark, compared with an unmodified version. The added instructions account for a performance overhead of 3.1%. We also ran CoreMark with varying iteration counts on the FPGA and observed an average overall performance overhead of 3.2%. The number of entries per SRS frame varied between 1 and 23, with a maximum of 11 frames. Area and memory overhead. We synthesized HardScope using Xilinx Vivado for a Xilinx Zynq-7020 ZedBoard (Virtex-7 XC7Z020 FPGA). Figure 4 shows the number of look-up tables (LUTs) and registers required to support different numbers of entries per SRS frame. As expected, the area overhead increases linearly with the bank size (i.e., the maximum number of entries per frame), since more entries must be checked in parallel.
To support a bank size of 16 and a maximum of 16 frames (i.e., a protected memory size of 16×16 entries), HardScope uses 10, 376 LUTs, 3, 221 registers, and one 18 kB block RAM. This is less area than a 128-bit high-throughput pipelined hardware AES cipher, which uses 12, 475 LUTs and 10, 769 registers. 20 The Pulpino SoC itself uses 15, 444 LUTs and 9, 758 registers. We also synthesized 19 http://www.eembc.org/coremark/faq.php 20 https://opencores.org/project,aes-128_pipelined_encryption a Calculations based on uninstrumented CoreMark (458150 cycles).
CoreMark configuration details are in Appendix G. this HardScope configuration using Synopsys Design Compiler targeting the NanGate 45 nm Open Cell Library 21 which gave a logic size of approximately 800,000 transistors. In comparison to a modern general-purpose SoC, such as the Apple A10 quad-core ARM64 mobile SoC (3.3 billion transistors), the area overhead of HardScope is negligible. Thus, HardScope is suitable for deployment on a wide range of SoCs, including small MCUs like the Pulpino.
RELATED WORK
Various software-only and hardware-assisted memory safety technologies have been proposed and/or deployed (e.g., [1, 3, 5-8, 11, 13, 20-22, 30, 31] ). We discuss those that aim to mitigate non-control-data attacks. Figure 5 shows a taxonomy of defenses that can instantiate policies effective against DOP. To the best of our knowledge, RSE is the first scheme that specifically considers DOP attacks in its threat model. Pointer Safety can prevent spatial memory errors that are exploited in memory corruption attacks. Typical realizations associate a base address and upper bound with each pointer, and check that the memory accesses that occur when dereferencing the pointer fall within those bounds. We call these schemes pointer-oriented. The associated bounds metadata can either be stored with each pointer or in a disjoint area of memory.
Fat-pointer schemes add bounds metadata directly to the pointer e.g., by increasing the length of the pointer [27] or borrowing unused bits from the pointer [20] . This incurs only a small memory overhead but changes the memory layout of program in ways that break both binary and source code compatibility.
BIMA [22] is a hardware-assisted fat-pointer scheme developed within the SAFE project. 22 BIMA encodes pointer bounds metadata within the pointer itself (e.g., on a 64 bit system, it assumes that 46 bit addresses are sufficient). On the simplified SAFE processor, a clean-slate ISA design which includes various security enhancements, this scheme has no performance overhead and worst-case 16% memory overhead. However, the compact encoding of pointer metadata results in alignment restrictions on pointers. This necessitates the use of custom stack allocators when applying this approach to stack data structures [12] . SoftBound [25] and HardBound [11] use pointer bounds metadata stored in disjoint shadow space memory to ensure pointer safety. Although this retains the program's original memory layout, it requires bounds information to be fetched from the shadow space before checks. SoftBound breaks cache locality and leads to additional cache misses when retrieving pointer bounds, incurring an average performance overhead of 67% in standard benchmarks. HardBound is a hardware-assisted scheme where the processor checks associated pointer bounds implicitly when a pointer is dereferenced. HardBound incurs an average performance overhead of ≈10%. Both schemes exhibit a worst-case memory overhead of ≈200%.
Intel's Memory Protection Extensions 23 (MPX), introduced in the Intel Skylake microarchitecture in late 2015, is currently the only example of hardware-assisted pointer safety technology being deployed in real systems. MPX adds new instructions for performing bounds checks on pointers. MPX stores bounds metadata in a disjoint memory area. Bounds information for up to four pointers can be stored in dedicated registers for fast checks. Oleksenko et al. [28] found that MPX incurs an average performance overhead of 50% 22 http://www.crash-safe.org/ 23 https://software.intel.com/en-us/isa-extensions/intel-mpx and a memory overhead of 1.9x, largely due to the time and memory required for storing and loading bounds metadata. Unlike the above pointer-oriented schemes, RSE (e.g., HardScope) is block-oriented as it associates access control rules with blocks of code, rather than individual pointers. However, as HardScope policies can be applied at the granularity of even single instructions, it can also enforce pointer safety. HardScope retains the program's original memory layout and does not require special alignment of pointers. In addition, HardScope also validates that pointer dereferences occur from legitimate execution contexts.
Red-Zone Tripwires can be used to ensure partial pointer safety against contiguous overflows. By placing a block of invalid memory that acts as a "red-zone" between memory objects. Loads and stores are instrumented to verify if the red-zone is tripped. Contiguous overflows, e.g. past an array boundary will hit the red-zone tripwire. This provides only partial pointer safety, as non-contiguous accesses or accesses with a larger step distance than the size of the red-zone can violate spatial safety without setting of the tripwire Modern compilers, such as GCC and LLVM, support instrumenting code operating with red-zone tripwire based run-time bounds checks via AddressSanitizer [31] .
Pointer authenticity can ensure the unforgeability of pointers, preventing non-control-data attacks and use of DOP gadgets that rely on retargeting pointers.
Yarra [30] is a variant C that ensures the authenticity of a pointer's type for critical data types ascribed by the programmer. YARRA guarantees that such critical data is only written through pointers with the given static type. However, YARRA incurs a prohibitively high overhead when used for whole program protection (4x -6x).
PointGuard [9] instruments programs to encrypt all pointers at run-time by XORing them against a key generated at program initialization. Pointers are decrypted before dereference. PointGuard incurs a small to medium overhead (0% -20%) in real-world programs, but is vulnerable to information disclosure e.g., if the ciphertext of a known pointer becomes known to an attacker.
ARMv8.3 Pointer Authentication [29] is a hardware-assisted mechanism in the ARMv8.3 processor architecture that ensures the authenticity of pointers by calculating a Pointer Authentication Code (PAC) as a keyed MAC of the pointer value and a 64-bit context (e.g., the current value of the stack pointer). The PAC is stored in the unused bits of 64-bit pointers and verified before dereferencing the pointer to ensure its authenticity. The inclusion of the context value prevents unauthorized copying of the pointer and its PAC to another context within the program. The ARMv8.3 architecture provides four keys for PAC (two for code pointer / two for data pointers) and a fifth key usable for general purpose authentication code generation. The keys are stored in internal CPU register and are not accessible from user mode, but must be managed by privileged software (e.g., ephemeral keys per process for user mode, or per boot for kernel mode).
HardScope does not provide pointer authenticity, but when applied at a fine granularity, it can greatly reduce the attack surface of pointers and non-pointer data.
Software compartmentalization aims to mitigate the consequences of memory vulnerabilities by isolating software components into distinct protection domains.
Software Fault Isolation (SFI) [13, 24, 35 ] compartmentalizes software at a module level e.g., kernel modules and dynamically loaded libraries. Non-control-data and DOP attacks cannot interact with data outside the module boundary but attacks that operate fully within the confines of a single module remain viable. Byte Granularity Isolation (BGI) [8] is an SFI variant that instruments kernel extensions and can enforce access control policies at fine data granularity with moderate overhead (0%-16%).
Data-Flow Integrity (DFI) [7] is a software-only approach for mitigating control-flow and non-control-data attacks. At compiletime, static analysis constructs a data-flow graph of a program. The code is instrumented to record the last instruction that wrote to each variable. On every read, the origin of the last write is checked against the pre-computed data-flow graph. Like HardScope, DFI can be instantiated at various granularities. Intraproc DFI only instruments uses of control-data and uses of local variables without definitions outside their function, thus providing function-granularity isolation for stack data. Interproc DFI isolates individual data flows from each other. Intraproc DFI incurs 46% and Interproc DFI incurs 104% performance overhead and 50% memory overhead. HardwareAssisted Data-flow Isolation (HDFI) [34] provides instruction-level granularity isolation by tagging each machine word in memory and every memory access instruction with a protection domain. However, it only supports two simultaneous protection domains.
Probabilistic schemes aim to randomize the data or its layout at run-time so that unauthorized accesses would have unpredictable results. Data Randomization [6] uses static analysis to partition code into equivalence classes, and then instruments all load/store operations to XOR the data with a class-specific mask. Data Space Randomization [5] randomizes the layout of data in memory. However, probabilistic schemes rely on some secret information (e.g., the XOR mask or randomization secret) which if leaked or inferred by the attacker could undermine the scheme.
Hardware-architectures that enable different protection models have been proposed. Fine-grained tagged memory systems e.g., lowRISC 24 can be used to assist the implementation of sophisticated memory access policies, including RSE. However, unlike HardScope, lowRISC only differentiates between access types (read/write) and can not apply different policies per subject without reprogramming. Intel Memory Protection Keys (MPK) 25 provides hardware support to associate memory at page granularity to one of 16 distinct protection domains, but unlike HardScope, it does not support context-specific policies or delegation. CHERI [37] is hardware-assisted capability model that extends the 64-bit MIPS ISA with byte-granularity enforcement of memory accesses. CHERI can support various protection models, such as pointer safety [37] and software compartmentalization [36] at library or module level. However, programs must be re-engineered by hand to benefit from CHERI.
Run-time attestation, such as Control-Flow Attestation [2, 10] can detect, but not prevent, both control-flow and some non-control-data attacks.
Although HardScope shares many of the same goals as the above schemes, it differs in several fundamental aspects. Compared to software-based schemes (e.g., DFI [7] and SoftBound [25] ), HardScope has significantly lower overhead, does not require wholeprogram static analysis, and can enforce different rules during different invocations of the same function (context-specific memory isolation). HardScope RSE policies can be instantiated for a large class of programs without additional input from developers (cf., YARRA [30] ), or software re-engineering (cf., CHERI). It also reduces the amount of metadata that must be stored at execution time by requiring only storage for rules pertaining to the active set of execution contexts. Thanks to limiting the rules that are needed for enforcement at any given time, HardScope also makes it feasible to cache that metadata in on-chip memory, and enable implicit access checks with no overhead.
CONCLUSION
By implementing and evaluating HardScope, we demonstrated that RSE is a novel, effective approach to protect against memory vulnerabilities at run time. HardScope can also enforce memory isolation at coarser or finer granularity, to facilitate different types of memory protection strategies. In future work we plan to integrate HardScope with a general purpose RISC-V core (e.g. the Rocket Core [4] and extend our RSE GCC Plug-in to support more protection models. To support reproducibility of our results, we provide 1) our enhanced GCC compiler, which can automatically instrument unmodified C programs; 2) instrumented binaries of our test programs; and 3) a version of the official RISC-V simulator with added support for HardScope instructions. Listing 5: Excerpt from sreplace() function in proftpd/src/support.c [17] . The src pointer is set to the next character of the input string containing replacement patterns (❶). When input does not match a replacement pattern, the character is copied verbatim to the output buffer (❷). The preceding bounds check is off-by-one (❸), allowing the null-terminator to be overwritten. During the immediately following iteration of the while loop, strlen(bpuf) will exceed the size of blen, resulting in an integer underflow of the n parameter to sstrncpy() (❹), allowing the attacker to overwrite the local variables on the stack and gain control of sreplace().
A EXCERPT FROM SREPLACE

B EXCERPT FROM PR_STRTIME
Listing 6 shows an excerpt from the pr_strtime() function in ProFTPD [17] .
1 const char * pr_strtime(time_t t) { 2 static char buf [30] ; // output buffer 3 static char * mons[] = { "Jan", "Feb", // . . .
Listing 6: Excerpt of ProFTPd's pr_strtime() function with mons array. pr_strtime() indexes mons by month number, copies the corresponding string literal to the static output buffer, and returns a pointer to the output buffer.
C.3 DOP attacks against GStreamer
Evans [14] demonstrates an attack against the GStreamer FLIC decoder 28 that exploits a combination of control-flow hijacking and DOP techniques. The attack exploits a decode loop that lacks bounds checks against the output frame_data buffer, as shown in Listing 8. The exploit can be triggered via a specially crafted media file that causes the decode loop to write past the bounds of the heapresident buffer. The goal of the attacker is to escalate the buffer overflow into an arbitrary code execution, but they cannot perform a traditional control-flow attack due to the presence of ASLR. The initial memory corruption vulnerability is non-linear, allowing the attacker to skip over heap memory before the write, but only allows the attacker to tamper with memory below the overflowing frame_data buffer. Additionally the attacker targets program logic that extracts metadata from a media file which only runs the decode loop for two media frames, considerably limiting the initial number of write gadgets possible to chain together. GStreamer decoders are typically run in their own dedicated threads with their own thread heap. When decoding begins in a newly created decoder thread, the predictability of the initial heap layout allows the attacker to corrupt the flxdec metadata object (Listing 9) typically allocated in the heap at a predictable offset below the frame_data buffer used by the decoder. By massaging the flxdec object into a state that keeps the decode loop running beyond the initial two frame window, the attacker obtains a DOP dadget dispatcher used to execute the subsequent DOP payload. The DOP program leverages dataoriented gadgets found in flx_decode_delta_fli() (Listing 8) and gst_flxdec_chain() (Listing 10) The assignment gadget leverages the memcpy() from flxdec->delta_data to flxdec->frame_data (❶). By corrupting these pointers in the decode loop, the gadget can be used for arbitrary memory loads and stores. Combining this with the assignment gadget flxdec->frame_data to flxdec->delta_data in gst_flxdex_chain() (❷) provides a dereference gadget. Together with the assignment gadget (❸), these gadgets allow the DOP program to perform a load / add / store during frame processing. With this computational capability the attacker can obtain a code pointer to a known function within the program's code section, add an offset to it to obtain a pointer into the program's global offset table (GOT) which contains relocated pointers to shared library functions. Reading a GOT entry for a shared library function reveals the randomized code pointer. This in turn enables the attacker to mount a traditional control-flow attack on the derandomized code.
GStreamer attack operates by corrupting a heap object which is legitimately within the scope of the decode loop with the initial memory corruption vulnerability. Thus, preventing the GStreamer attack is more challenging compared to the ProFTPD attacks described in Section 6.1 and Appendix C.1. However, RSE can be utilized by the caller to selectively delegate individual fields of the dynamically allocated flxdec in flx_decode_delta_fli() gst_flxdex_chain. Furthermore, by applying RSE at a blockgranularity, the vulnerable decode loop can be constrained to stay within the bounds of the flxdec->delta_data and flxdec->frame_data buffers.
in extending HardScope to support multi-threaded programs, Symmetric Multi-Processing (SMP), and C++, and we suggest possible solutions to these challenges. The discussion here is not exhaustive, and the implementation and evaluation of these extensions is beyond the scope of this work.
Multi-threading. The current implementation of HardScope maintains only a single SRS, and hence does support multiple concurrent execution contexts. However, when multiple processes or threads are executed concurrently (e.g., interleaved by the CPU), each must be associated with a distinct HardScope SRS. To facilitate this, HardScope must provide the means to store the full SRS state of a thread when the thread is pre-empted, and restore the thread's SRS state when it is scheduled. The system must allocate separate storage for each thread's SRS and the HardScope hardware must maintain a pointer to the current SRS e.g., in a dedicated register. The system scheduler must update this SRS pointer register when switching threads. Access controls should be put in place to prevent unprivileged threads from tampering with any of the stored SRS states, including their own, or with the SRS pointer register. Privileged software must also have the ability to flush all records in the active and spare banks into the SRS in memory during thread scheduling, to ensure the coherency of the stored SRS.
Threads that co-operate on a computing task may need to share data amongst each other e.g., by passing pointers to shared memory areas. To facilitate this, threads may need to delegate storage region entries to one another. Currently, HardScope does not provide means of delegation across multiple SRSs. In order to enable delegated access to shared memory areas, HardScope must be extended with a messagebox facility that allows the delegating thread to identify a recipient thread and mark a storage region entry for delegation to the recipient.
Symmetric Multi-Processing. In addition to the enhancements to support multi-threading, SMP systems require additional considerations to allow HardScope to maintain SRSs across multiple processors or cores. Most importantly, the HardScope active, spare, and cache banks must be duplicated for each distinct core. The SRS should be maintained in memory that is accessible to each core, and the messagebox facility must be extended to allow delegations across cores.
C++. The most important factor in terms of extending HardScope to support the C++ runtime is enabling C++ exception handling. Exceptions provide a way to transfer control from one execution context to another, possible separated by several links in the call chain. The exception handling facility has the means to unwind the call stack to the stack frame of the execution context that handles the exception. In HardScope enabled programs, the exception handling facility must also unwind the SRS to the corresponding SRS frame belonging to the exception context. In the current implementation of HardScope, the only way to achieve this is via the execution of sbxit. While it would be possible for the exception handling facility to issue multiple consecutive sbxit instructions until the correct SRS frame has been restored, this is likely to cause undesirable performance overhead due to the associated management operations on the active, spare, and cache banks. Moreover, many of these operations are unnecessary, as the result of the management operations is immediately discarded on the next sbxit. To improve the efficiency of exception handling, HardScope could be extended with a facility that would enable fast unwinding of the SRS stack to a previous state. A similar facility would also improve the implementation of a HardScope aware C setjmp / longjmp API. 29 
G COREMARK CONFIGURATION
The CoreMark benchmark program must be passed three seed values used for initialization of data during the benchmark. The seeds must be input from a source that cannot be determined at compile time to ensure that the compiler cannot pre-compute results to completely optimize away the work intended to be performed during a benchmark. While in principle any seed values could be used, three common sets of seeds have been designated by the CoreMark developers to ensure that results from CoreMark runs remain comparable. The benchmark also contains self-test logic that ensures the correctness of the work performed during the benchmark for the common seeds.
As per industry recommendations 30 we use the profile run seeds to obtain the results reported in Section 6.3. Table 5 shows the performance overhead for different numbers of CoreMark iterations measured on the FPGA. For completeness, we also ran the CoreMark benchmark using the validation run and performance run seeds to ensure that our instrumentation did not affect the correctness of the benchmark. The results of the validation and performance runs are reported in Table 4 . For each run, we fixed the number of CoreMark iterations to one. Because CoreMark was run in a cycleaccurate simulated environment, the results of repeated runs yielded the same cycle counts. For both the validation and performance runs, the number of entries per SRS frame varied between 1 and 120, with a maximum of 11 frames required. The increased SRS utilization was attributed to the increased number of allocations performed by CoreMark's linked list benchmark with the validation and performance seed values. Due to area limitations of our FPGA testbed, we were unable to synthesize HardScope with a frame size of 120 entries. Instead, we ran the validation and performance run benchmarks on both the Spike-based simulator implementation and a cycle-accurate simulation of the hardware implementation (Pulpino SoC extended with HardScope) in Questa Advanced Simulator 31 . The cycle-accurate overhead measured for the hardware implementation was at 4.9% for the validation and performance runs, in contrast to 3.2% for the profile run determined on the FPGA. The principal reason for the increased overhead in Table 4 results compared to the profile run was the increased number of stalls (2.8% overhead for both validation and performance runs). The increase in stalls is a direct result from the increased SRS utilization. Consequently, minimizing the number of stalls is an important consideration for maintaining efficient HardScope performance in more complex programs. We discuss a possible hardware-design optimization that would reduce the number of stall cycles in Appendix H. However, the number of stalls can also be reduced by reducing the required SRS entries for each 29 
