Isolating sensitive data and state can increase the security and robustness of many applications. Examples include protecting cryptographic keys against exploits like OpenSSL's Heartbleed bug or protecting a language runtime from native libraries written in unsafe languages. When runtime references across isolation boundaries occur relatively infrequently, then page-based hardware isolation can be used, because the cost of kernel-or hypervisor-mediated domain switching is tolerable. However, some applications, such as isolating cryptographic session keys in a network-facing application or isolating frequently invoked native libraries in managed runtimes, require very frequent domain switching. In such applications, the overhead of kernel-or hypervisormediated domain switching is prohibitive.
Introduction
It is good software security practice to partition sensitive data and code into isolated components, thereby limiting the effects of bugs and vulnerabilities in a component to the confidentiality and integrity of that component's data. For instance, isolating cryptographic keys from all but the crypto functions that use them can thwart vulnerabilities like the OpenSSL Heartbleed bug [34] ; isolating a managed language's runtime can protect its security invariants from bugs and vulnerabilities in co-linked native libraries; and, writeprotecting jump tables can prevent some attacks on the integrity of an application's control flow.
Isolation prevents an untrusted component from directly accessing the private memory of other components. Broadly speaking, isolation can be enforced using one of two approaches. First, with software fault isolation (SFI) [42] , we can instrument the code of untrusted components with bounds checks on indirect memory accesses, restricting accesses to the other component's memory. The bounds checks can be added by the compiler, as is the case in memory-safe languages, or through binary rewriting. Bounds checks impose overhead on the execution of all untrusted components; additional overhead may be required to prevent control-flow hijacks [28] , which could circumvent the checks.
A second approach is to use hardware page protection for memory isolation [10, 12, 30, 9] . Here, access checks are performed in hardware as part of the address translation with no additional overhead on execution within a component. However, transferring control between components requires a switch to supervisor or hypervisor mode in order to change the (extended) page table base. 1 Recent work such as Wedge, Shreds, SMVs, and light-weight contexts (lwCs) [10, 12, 23, 30] have reduced the cost of switching between isolated components within a process, but the cost is still substantial. For instance, Litton et al. [30] report a switching cost of about 1us per switch for lwCs, which amounts to an overhead of nearly 10% for an application that switches 100,000 times a second.
In this paper, we present ERIM, the first isolation technique that combines zero-overhead in-component execution with very low cost switching among components. ERIM relies on a recent x86 ISA extension called memory protection keys or MPK [26] . With MPK, each virtual page can be tagged with a 4-bit domain id, thus partitioning a process' address space into up to 16 disjoint domains. A special per-core register, PKRU, determines which domains the core can read or write. Switching domain permissions requires writing the PKRU register with the user-mode WRPKRU instruction, which takes only 11-260 cycles on current Intel CPUs in our experiments.
However, MPK by itself does not provide strong security because a compromised or malicious component can simply write to the PKRU register and grant itself permission to access any of the other domains in the process. ERIM relies on binary inspection to ensure that all occurrences of the WRPKRU instruction in the binary are safe, i.e., they cannot be exploited to gain unauthorized access. By design, this property holds even if there is a control-flow hijack in the untrusted component. Hence, ERIM provides isolation without requiring control-flow integrity, which cannot be ensured in unsafe languages without additional runtime overhead.
While ERIM's binary inspection enforces the safety of its MPK-based isolation, it creates a potential usability issue: What to do if a binary has unintentional occurrences of the WRPKRU instruction? Since x86 does not have instruction alignment, such sequences could arise within a longer instruction, or spanning the bytes of two or more adjacent instructions. Any such sequence could be exploited by a control-flow hijack attack, so such a binary must be rejected by the binary inspection mechanism. To handle such cases, we describe a novel procedure to rewrite any instruction sequence containing an unaligned WRPKRU to a functionally equivalent sequence without any WRPKRUs. The procedure can be integrated with a compiler or our binary inspection to remove unaligned WRPKRU instructions.
ERIM is the first technique that enables efficient isolation in applications that require very high domain switching rates (~10 5 /s or more) and also spend significant time executing inside untrusted components. We evaluate our ERIM prototype on three such applications. First, in the web server nginx, we show that ERIM can isolate the frequently accessed session keys as opposed to merely isolating the longterm signing keys, which are accessed only infrequently and therefore can be isolated with low overhead using existing techniques. Second, we show that ERIM can efficiently isolate a managed language runtime from native libraries written in unsafe languages. Third, we show that ERIM can efficiently isolate the safe region in code-pointer integrity [29] . In all cases, we observe switching rates of order 10 5 or more times/s per core. ERIM provides strong, fine-grained hardware isolation with overheads less than 1% for every 100,000 switches/s, which is considerably lower than that of existing techniques. Moreover, ERIM does not require compiler support and can run on stock Linux.
In summary, we make the following contributions. -We present ERIM, an efficient memory isolation technique that relies on a combination of Intel's MPK ISA extension and binary inspection, but does not require or assume control-flow integrity.
-We describe a complete rewriting procedure to make binaries unexploitable for circumventing ERIM.
-Through experiments, we demonstrate that ERIM can protect applications with very high inter-component switching rates with low overhead.
Background and related work
In this section, we survey background and related work. Enforcing relevant security or correctness invariants while trusting only a small portion of an application's code generally requires data encapsulation. Encapsulation itself requires isolating sensitive data so it cannot be accessed by untrusted code, and facilitating switches to trusted code that has access to the isolated state. We survey techniques for isolation and secure control transfer provided by operating systems, hypervisors, compilers, language runtimes, and binary rewriting, as well as other work that uses MPK for memory isolation.
OS-based techniques A data encapsulation technique is to split application components into separate processes. This is feasible only if the rate of cross-component invocations is relatively low, so that context switching overheads remain tolerable. Novel kernel abstractions like light-weight contexts (lwCs) [30] and secure memory views (SMVs) [23] , combined with additional compiler support as in Shreds [12] or runtime analysis tools as in Wedge [10] , have reduced the cost of data encapsulation to the point where isolating longterm signing keys in a web server is feasible with little overhead [30] . Settings that require more frequent switches, like isolating session keys or isolating the safe region in CPI [29] , remain beyond the reach of OS-based techniques, and indeed all existing techniques based on hardware isolation.
Mimosa [19] relies on the Intel TSX hardware transactional memory to protect private cryptographic keys from software vulnerabilities and cold-boot attacks. Mimosa restricts cleartext keys to exist within uncommitted transactions, and TSX ensures that an uncommitted transaction's data is never written to the DRAM or other cores. Unlike ERIM, which is a general-purpose isolation technique, Mimosa specifically targets cryptographic keys, and is constrained by hardware capacity limits of TSX.
Virtualization-based techniques In-process data encapsulation can be provided by a hypervisor. Dune [9] enables user-level processes to implement isolated compartments by leveraging the Intel VT-x x86 virtualization ISA extensions [26] . Koning et al. [28] sketch how to use the VT-x VMFUNC instruction to switch extended page tables in order to achieve in-process data isolation. SeCage [31] similarly relies on VMFUNC to switch between isolated compartments; it also provides static and dynamic program analysis based techniques to automatically partition monolithic software into compartments, which is orthogonal to our work. TrustVisor [33] uses a thin hypervisor and nested page tables to support isolation; it additionally supports code attestation. SIM [39] relies on VT-x to isolate a security monitor within an untrusted guest VM, where it can access guest memory with native speed. In addition to the overhead of the VMFUNC calls during switching, these techniques incur overheads on TLB misses and syscalls due to the use of extended page tables and hypercalls, respectively. Overall, the overheads of virtualization-based encapsulation are comparable to those of OS-based techniques.
Nexen [40] decomposes the Xen hypervisor into isolated components and a security monitor, using page-based protection within the hypervisor's privilege ring 0. Control of the MMU is restricted to the monitor; compartments are de-privileged by scanning and removing exploitable MMUmodifying instructions. The goal of Nexen is quite different from ERIM's: Nexen aims to isolate co-hosted VMs and the hypervisor's components from each other, while ERIM isolates components of a user process. Nonetheless, ERIM is similar to Nexen in that it removes exploitable instructions.
Language and runtime techniques Memory isolation can be provided as part of a memory-safe programming language. This encapsulation is efficient if most of the checks can be done statically. However, such isolation is languagespecific, relies on the compiler and runtime, and can be undermined by co-linked libraries written in unsafe languages.
Software fault isolation (SFI) [42] provides memory isolation in unsafe languages using runtime memory access checks inserted by the compiler or by rewriting binaries. As mentioned, SFI imposes an overhead on all execution of untrusted code. Additionally, SFI by itself does not protect against attacks that hijack control flow (to possibly bypass the memory access checks). To get strong security, SFI must be coupled with an additional technique for control-flow integrity (CFI) [6] . However, existing CFI solutions have nontrivial overhead. For example, code-pointer integrity (CPI), one of the cheapest reasonably strong CFI defense, has a runtime overhead of at least 15% on the throughput of a moderately performant web server (Apache) [29, Section 5.3] . In contrast, ERIM does not rely on CFI for data encapsulation and has much lower overhead. Concretely, we show in Section 6 that ERIM's overhead on the throughput of a much more performant web server (nginx) is no more than 5%.
The Intel MPX ISA extension [26] provides architectural support for bounds checking needed by SFI. A compiler can use up to four bounds registers, and each register can store a pair of 64-bit starting and ending addresses. Specialized instructions check a given address and raise an exception if the bounds are violated. However, even with MPX support, the overhead of bounds checks is on the order of several tens of percent in many applications [37] .
Hardware-based trusted execution environments Intel's SGX [25] and ARM's TrustZone [8] ISA extensions allow (components of) applications to execute with hardwareenforced isolation, even from the operating system. However, switching overheads are similar to other hardwarebased isolation mechanisms [28] .
IMIX [16] and MicroStach [35] propose minimal extensions to the x86 ISA, adding load and store instructions to access secrets in a safe region. The extended ISA can provide data encapsulation. Both systems provide a compiler that automatically partition secrets. However, for data encapsulation in the face of control-flow hijack attacks, both systems require CFI. As mentioned, CFI techniques have nontrivial overhead. ERIM, on the other hand, provides strong isolation without relying on CFI and has lower overhead.
ASLR Address space layout randomization (ASLR) is widely used to mitigate code reuse exploits such as those based on buffer overflow attacks [38, 22] . By randomizing the layout of code in an address space, ASLR makes it difficult for attackers to reuse code as part of an exploit. ASLR has also been used for data encapsulation by randomizing data layout. For example, as one of the isolation techniques used in CPI [29, 41] , a region of sensitive data is allocated at a random address within the 48-bit x86-64 address space and its base address is stored in a segment descriptor. All pointers stored in memory are offsets into the region and do not reveal its actual address. However, all forms of ASLR are vulnerable to attacks like thread spraying [38, 24, 14, 18, 36] . Consequently, ASLR is no longer considered a viable technique for strong memory isolation, despite proposals such as [32] to harden it further.
ARM memory domains ARM memory domains [7] are similar to Intel MPK, the x86 hardware feature that ERIM relies on. However, unlike in MPK, changing domain permissions is a kernel operation in ARM. Therefore, unlike MPK, ARM's memory domains do not support low-cost user-mode switching.
MPK-based techniques Koning et al. [28] present MemSentry, a general framework for data encapsulation, implemented as a pass in the LLVM compiler toolchain. They instantiate the framework with several different memory isolation techniques, including many described above and, importantly, one based on MPK's domains. However, MemSentry's MPK instance is secure only with, and assumes the existence of, a separate defense against control-flow hijack/codereuse attacks to prevent adversarial misuse of WRPKRU instructions in the binary. As mentioned earlier, such defenses have significant overhead of their own. As a result, the overall overhead of MemSentry's MPK instance is significantly higher than that of ERIM, which does not rely on a defense against control-flow hijacks.
In concurrent work [21] , Hedayati et al. describe how to isolate userspace libraries using VMFunc or Intel MPK. Their MPK-based method is very similar to ERIM, but they do not describe how to rewrite binaries to ensure that there are no exploitable occurrences of WRPKRU. In contrast, such rewriting is a key contribution of our work (Section 4). Finally, while Hedayati et al. rely on kernel changes, we describe how ERIM can run safely on a stock Linux kernel.
In recent work, Burow et al. [11] survey implementation techniques for shadow stacks. In particular, they examine the use of MPK for protecting the integrity of shadow stacks. Burow et al.'s measurements of MPK overheads ( Fig. 10 in [11] ) are consistent with ours. Their use of MPK could be a specific use-case for ERIM, which is a more general framework for memory isolation.
Design
Goals ERIM enables efficient data isolation within a userspace process. Like prior work, it enables trusted application components to isolate their sensitive data and state from less trusted components. Unlike prior work, ERIM supports such isolation with low overhead even at high switching rates between components (hence enabling fine-grained isolation) and without requiring control-flow integrity.
In the following, we focus on the case of two components within a process that are fully isolated from each other within a single-threaded process. As we will show later in this section, ERIM can be generalized to multi-threaded processes, more than two components per process, partial and read-only sharing among components, as well as other extensions.
We use the letter T to denote the trusted component and U to denote the remaining, untrusted application component. The main primitive ERIM provides is memory isolation-it reserves a region of the address space accessible exclusively from the trusted component T. This reserved region is denoted M T and can be used by T to store sensitive data. The rest of the address space, denoted M U , holds the application's regular heap and stack and is accessible from both U and T. ERIM enforces the following invariants:
(1) While control is in U, access to M T remains disabled.
(2) Access to M T is enabled atomically with a control transfer to a designated entry point in T and disabled when T transfers control back to U. The first invariant provides isolation of M T from U, while the second invariant prevents U from confusing T into accessing M T improperly by jumping into the middle of M T 's code.
Background: Intel MPK To realize its goals, ERIM uses the recent MPK extension to the x86 ISA [26] . With MPK, each virtual page of a process can be associated with one of 16 protection keys, thus partitioning the address space into up to 16 domains. A per-core register, called PKRU, determines the current access permissions (read, write, neither or both) on each domain for the code running on that core. Access checks against the PKRU are implemented in hardware and impose no overhead on program execution.
Changing access privileges requires writing new permissions to the PKRU register with a user-mode instruction, WRPKRU. This instruction is relatively fast (11-260 cycles on current Intel CPUs), does not require a syscall, changes to page tables, a TLB flush, or inter-core synchronization.
Since WRPKRU is a user-mode instruction, ERIM must ensure that untrusted code cannot use it to improperly elevate its privileges. To this end, ERIM combines MPK with binary inspection to ensure that all WRPKRU occurrences on executable pages are safe, i.e., they cannot be exploited to improperly elevate privilege.
Background: Linux support for MPK As of version 4.6, the mainstream Linux kernel supports MPK. Page-table entries are tagged with MPK domains, there are additional syscall options to associate pages with specific domains, and the PKRU register is saved and restored during context switches. Since hardware PKRU checks are disabled in kernel mode, the kernel has also been modified to check PKRU permissions explicitly before dereferencing any userspace pointer. To avoid executing a signal handler with inappropriate privileges, the kernel updates the PKRU register to its initial set of privileges (read/write access only to domain 0) before delivering signals to a process.
High-level design overview
ERIM can be configured to provide either complete isolation of M T from U (confidentiality and integrity), or only write protection (only integrity). We describe the design for complete isolation first. Section 3.7 explains a slight design re-configuration that provides only write protection.
ERIM's isolation mechanism is conceptually simple: It maps T's reserved memory, M T , and the application's general memory, M U , to two different MPK domains. It manages MPK permissions (the per-core PKRU registers) to ensure that M U is always accessible, while M T is never accessible when control is in U. It allows U to securely transfer control to T and back via call gates. A call gate enables access to M T using the WRPKRU instruction and immediately transfers control to a specified entry point of T, which may be an explicit or inlined function. When T is done executing, the call gate disables access to M T and returns control to U. This enforces ERIM's two invariants (1) and (2) from Section 3. Call gates operate entirely in user-mode (they don't use syscalls) and are described in Section 3.3.
Preventing WRPKRU exploitation A key difficulty in ERIM's design is preventing the untrusted U from exploiting occurrences of the WRPKRU instruction sequence on executable pages to elevate its privileges. For instance, if the sequence appeared at any byte address on an executable page, it could be exploited using control-flow hijack attacks. To prevent such exploits, ERIM relies on binary inspection to enforce the invariant that only safe WRPKRU occurrences appear on executable pages. A WRPKRU occurrence is safe if it is immediately followed by one of the following: (A) a pre-designated entry point of T. (B) a specific sequence of instructions that checks that the permissions set by WRPKRU do not include access to M T and terminates the program otherwise.
A safe WRPKRU occurrence cannot be exploited to execute untrusted code with access to M T . If the occurrence satisfies (A), then it does not give control to U at all; instead, it enters T at a designated entry point. If the occurrence satisfies (B), then it would terminate the program immediately were it used by a control-flow hijack to enable access to M T . ERIM's call gates use only safe WRPKRU occurrences and, therefore, pass the binary inspection. The binary inspection is described further in Section 3.4.
Creating safe binaries An important question is how to construct binaries that do not have unsafe WRPKRUs. On x86, an inadvertent WRPKRU may arise spanning the bytes of two adjacent instructions or as a subsequence in a longer instruction. To eliminate inadvertent WRPKRUs, we describe a binary rewriting mechanism that rewrites any sequence of instructions containing a WRPKRU to a functionally equivalent sequence without any WRPKRUs. The mechanism can be deployed as a compiler pass or integrated with our binary inspection, as explained in Section 4.
Threat model
ERIM makes no assumptions about the untrusted component (U) of an application. U may behave arbitrarily and may contain memory corruption and control-flow hijack vulnerabilities that may be exploited during its execution.
However, ERIM assumes that the trusted component T's binary does not have such vulnerabilities and does not compromise sensitive data by calling back into U while access to M T is enabled, through information leaks, or by mapping executable pages with unsafe/exploitable occurrences of the WRPKRU instruction.
The hardware, the OS kernel, and a small library added by ERIM to each process that uses ERIM are trusted to be secure. We also assume that the kernel enforces standard DEP-an executable page must not be simultaneously mapped with write permissions. ERIM relies on a list of legitimate entry points into T provided either by the programmer or the compiler, and this list is assumed to be correct (see Section 3.4). The OS's dynamic program loader/linker is trusted to invoke ERIM's initialization function before any other code in a new process.
Side-channel and rowhammer attacks, and microachitectural leaks, although important, are beyond the scope of this work. However, ERIM is compatible with existing defenses.
Call gates
A call gate transfers control from U to T, enabling access to M T , then runs code from a designated entry point of T, and later returns control to U after disabling access to M T . This requires two WRPKRUs. The primary challenge in designing the call gate is ensuring that both these WRPKRUs are safe in the sense explained in Section 3.1.
Listing 1 shows the assembly code of a call gate. WRP-KRU expects the new PKRU value in the eax register and requires ecx and edx to be 0. The call gate works as follows. First, it sets PKRU to enable access to M T (lines 1-4). The macro PKRU_ALLOW_TRUSTED is a PKRU setting that allows access to M T . Next, the call gate transfers control to the designated entry point of T (line 6). T's code may be invoked either by a direct call, or it may be inlined here.
After T has finished, the call gate sets PKRU to disable access to M T (lines [8] [9] [10] [11] .
The macro PKRU_DISALLOW_TRUSTED is a PKRU setting that excludes access to M T .
Next, the call gate checks that the PKRU was actually loaded with PKRU_DISALLOW_TRUSTED (line 12 returns control to U (lines [15] [16] . It may seem that the check on line 12 is pointless since it will always succeed (eax is set to PKRU_DISALLOW_TRUSTED on line 10). While this will be the case under normal operation, the check prevents exploitation of the WRPKRU on line 11 with control-flow hijack attacks (explained next).
Safety Both occurrences of WRPKRU in the call gate are safe. Neither can be exploited by a control flow hijack to get unauthorized access to M T . The first occurrence of WRP-KRU (line 4) is of form (A) and has a control transfer to a designated entry point of T right after it. This occurrence cannot be exploited to transfer control to anything else. The second occurrence of WRPKRU (line 11) is followed by a check that terminates the program if the new permissions include access to M T . If, as part of an attack, the execution jumped directly to line 11 with PKRU_ALLOW_TRUSTED in eax, the program would be terminated on line 14.
Efficiency A call gate's overhead on a roundtrip from U to T is two WRPKRUs, a few very fast, standard register operations and one conditional branch instruction. This overhead is very low compared to other hardware isolation techniques that rely on inter-process communication, syscalls or hypervisor trampolines to change privileges.
Use considerations ERIM's call gate omits features that some readers may naturally expect. These features have been omitted to avoid having to pay their overhead when they are not needed. First, the call gate does not include support to pass parameters from U to T or to pass a result from T to U. These can be passed via a designated shared buffer in M U (both U and T have access to M U ). Second, the call gate does not scrub registers when switching from T to U. So, if T uses confidential data, it should scrub any secrets from registers before returning to U. Further, because T and U share the call stack, T must also scrub secrets from the stack prior to returning. Alternatively, T can allocate a private stack for itself in M T , and T's entry point can switch to that stack immediately upon entry. This prevents T's secrets from being written to U's stack in the first place. (A private stack is also necessary for multi-threaded applications; see Section 3.7).
Binary inspection
Next, we describe ERIM's binary inspection. The inspection prevents U from mapping any executable pages with unsafe WRPKRU occurrences and consists of two parts: (i) an inspection function that verifies that a sequence of pages does not contain unsafe WRPKRU occurrences; and, (ii) an interception mechanism that prevents U from mapping executable pages without inspection.
Inspection function The inspection function scans a sequence of pages for occurrences of WRPKRU. It also inspects any adjacent executable pages in the address space for WRPKRU occurrences straddling a page boundary. For every WRPKRU, it checks that the WRPKRU is safe, i.e., either condition (A) or condition (B) from Section 3.1 holds.
To check for condition (A), ERIM needs a list of designated entry points of T. The source of this list depends on the nature of T and is trusted. If T consists of library functions, then the programmer marks these functions, e.g., by including a unique character sequence in their names. If the functions are not inlined by the compiler, their names will appear in the symbol table. If T's functions are subject to inlining or if they are generated by a compiler pass, then the compiler must be directed to add their entry locations to the symbol table with the unique character sequence. In all cases, ERIM can identify designated entry points by looking at the symbol table and make them available to the inspection function.
Condition (B) is checked easily by verifying that the WRPKRU is immediately followed by exactly the instructions on lines 12-15 of Listing 1. These instructions ensure that the WRPKRU cannot be used to enable access to M T and continue execution.
Interception With recent versions of Linux (v4.6), the interception can be implemented as follows without requiring kernel changes. We install a seccomp-bpf filter [27] that catches mmap, mprotect, and pkey_mprotect syscalls with a mode argument of PROT_EXEC, which attempt to map a region of memory as executable. Since the bpf filtering language currently has no provisions for reading the PKRU register, we rely on seccomp-bpf's SECCOMP_RET_TRACE option to notify a ptrace()-based tracer process. The tracer inspects the tracee process and allows the syscall if it was invoked from T and denies it otherwise. The trace process is configured so that it traces any child of the tracee process as well. While ptrace() interception is expensive, note that it is required only when a program maps pages as executable, which is normally an infrequent operation.
Alternatively, the interception can be implemented with a simple Linux Security Module (LSM) [45] , which allows mmap, mprotect and pkey_mprotect system calls only from T. (Whether such a call is made by U or T is easily determined by examining the PKRU register value at the time of the syscall.) With either interception approach in place, U must go through T to map executable pages. T maps the pages only after they have passed the inspection function.
Regardless of the interception method, pages can be inspected upfront at the time T attempts to map them as executable, or on demand when they are executed for the first time. On-demand inspection is preferable when a program maps a large executable segment but commonly executes only a small number of pages. With on-demand inspection, when the process maps a region as executable, T instead maps the region read-only but records that the pages in the region are supposed to be executable pending inspection.
When control transfers to such a page, a fault occurs. The fault traps to a dedicated signal handler, which ERIM installs when it initializes (the LSM or the tracer prevents U from overriding this signal handler). This signal handler calls a T function which checks whether the faulting page is pending inspection and, if so, inspects the page. If the inspection passes, then the handler remaps the page with the execute permission and resumes execution of the faulting instruction, which will now succeed. If not, the program is terminated.
The interception and binary inspection has very low overhead in practice because it scans an executable page at most once. It is also fully transparent to U's code if all WRPKRUs in the binary are already safe.
Security We briefly summarize how ERIM attains security. The binary inspection mechanism prevents U from mapping any executable page with an unsafe WRPKRU. T does not contain any executable unsafe WRPKRU by assumption. Consequently, only safe WRPKRUs are executable in the entire address space at any point. Safe WRPKRUs preserve ERIM's two security invariants (1) and (2) by design. Thus M T is accessible only while T executes starting from legitimate T entry points, which are safe by assumption. Hence, M T remains isolated from U.
Lifecycle of an ERIM application
Besides binary inspection (Section 3.4), all of ERIM is implemented as a runtime library that is linked into a binary either statically or at load time through LD_PRELOAD. Importantly, before starting U's main(), as part of ERIM's initialization, existing malloc-like functions are overwritten to functions which differentiate the allocator based on the PKRU value. This is not necessary for security, but rather for convenience of the programmer. In case U calls the allocator of M T , the allocation will fail to access the internal data structures of the allocator and the program will crash. 
The initialization function, called init here, creates the memory domain M T and maps memory to it (M U occupies the default MPK domain, which is automatically created with the process). It then loads T's code and data from a dynamic link library. Next, init scans the code of T for unsafe WRPKRUs and installs one of the interception mechanisms described in Section 3.4. Finally, init scans U's code, disables access to M T and transfers control to U's main().
After main() has control, U executes almost as usual. It maps and unmaps memory in the domain M U . However, to access T's exported services, U must invoke a call gate to enable access to M T and invoke a T entry point. Hence, U's binary must be constructed to invoke call gates to T at appropriate points. This is done with two techniques. First, LD_PRELOAD can be used to re-link explicit T function calls to a library of wrappers that invoke a call gate. Second, T calls from functions inserted by the compiler can be made to directly invoke the call gate by modifying these functions.
Developing ERIM applications
Next, we describe the process of developing an application that isolates sensitive state using ERIM. Here, we describe the process for programs written in C; we have also implemented a language binding for Rust.
For C, ERIM provides a runtime library and a C header file. The header file defines macros that can be used by the programmer to insert call gates at appropriate points in the program. The runtime library provides the init functions and dynamic memory allocation stubs, which replace the standard libc malloc and free. When these stub functions are invoked, they check the current value of PKRU and then redirect the call to affect either U's or T's heap.
Listing 2 demonstrates an example C program that isolates a data structure called secret (lines 1-3). The structure contains an integer value.
Two functions, initSecret and compute, reference the secret and bracket their respective accesses with call gates using the macros ERIM_SWITCH_T and ERIM_SWITCH_U. ERIM isolates the secret such that only code that appears between ERIM_SWITCH_T and ERIM_SWITCH_U, i.e., code in T, may access the secret. initSecret allocates an instance of secret while executing inside T, which implicitly allocates secret in M T , and initializes the secret value. compute computes a function f of the secret inside T.
Extensions
Next, we discuss various extensions to the basic design.
Multi-threaded processes ERIM's design works as-is with multi-threaded applications because MPK uses a percore PKRU register. Threads are created as usual, e.g. using libpthread. The PKRU register is saved and restored by the kernel during context switches. However, multi-threading imposes an additional requirement on T (not on ERIM): In a multi-threaded application, it is essential that T allocate a private stack in M T (not M U ) for each thread and execute its code on these stacks. This is easy to implement by switching stacks at T's entry points. Not doing so and executing T on standard stacks in M U runs the risk that, while a thread is executing in T, another thread executing in U may corrupt or read the first thread's stack frames. This can potentially destroy T's integrity, leak its secrets and hijack control while access to M T is enabled. By executing T's code on stacks in M T , such attacks are prevented.
More than two components per process Our description of ERIM so far has been limited to two components (T and U) per process. However, ERIM generalizes easily to support as many components as the number of domains Linux's MPK support can provide (this could be less than 16 because the kernel may reserve a few domains for specific purposes ERIM for integrity only Some applications care only about the integrity of protected data, but not its confidentiality. Examples include CPI, which needs to protect only the integrity of code pointers. In such applications, efficiency can be improved by allowing U to read M T directly, thus avoiding the need to invoke a call gate for reading M T .
The ERIM design we have described so far can be easily modified to support this. Only the definition of the constant PKRU_DISALLOW_TRUSTED in Listing 1 has to change to also allow read-only access to M T . With this change, read access to M T is always enabled.
Just-in-time (jit) compilers with ERIM Existing jit compilers allocate writable code pages and alter their permissions to execute-only once the compilation finishes. ERIM's binary inspection scans the page and only enables the execute bit if no unsafe WRPKRU exists. This mechanism is safe, but may lead to program crashes if the jit compiler accidentally emits a WRPKRU sequence. ERIM-aware jit compilers can emit WRPKRU free binary code by relying on the rewrite strategy described in Section 4, and inserting call gates when necessary. In addition to supporting ERIM, jit compilers can prevent memory-corruption attacks [17] from, e.g., corrupting the jit compiler's state using ERIM. ERIM's memory isolation can efficiently protect the jit compiler's state by isolating the jit compiler in the trusted domain, while the application runs in the untrusted domain. As a result, ERIM prevents the untrusted application from accessing the jit compiler's state preventing memory-corruption attacks. Compared to existing work [15] which relies on Intel SGX to isolate the compiler's state, ERIM's isolation is highly efficient.
OS privilege separation (extension)
The design described so far provides memory isolation. Some applications, however, require privilege separation between T and U with respect to OS resources. For instance, an application might need to restrict the filesystem name space accessible to U or restrict the system calls available to U.
ERIM can be easily extended with a few kernel changes to support privilege separation with respect to OS resources. During process initialization, ERIM's init function can instruct the kernel to restrict U's access rights. After this, the kernel refuses to grant access to restricted resources whenever the value of the PKRU is not PKRU_ALLOW_TRUSTED, indicating that the syscall does not originate from T. To access restricted resources, U has to invoke T, which can act as a reference monitor.
Rewriting inadvertent WRPKRUs
For security, our binary inspection (Section 3.4) requires binaries to have only safe WRPKRU occurrences. WRPKRUs emitted purposefully by a compiler can be made safe by changing the compiler to insert the check on lines 12-15 of Listing 1 after every WRPKRU that is not followed by an entry point to T. Inadvertent WRPKRUs-those that occur unintentionally as parts of longer x86 instructions or spanning two consecutive x86 instructions-are more interesting. We describe a rewrite strategy to eliminate such WRPKRUs. The strategy is complete: It can rewrite any sequence of x86 instructions containing an inadvertent WRPKRU to a functionally equivalent sequence without any WRPKRUs.
Rewrite strategy WRPKRU is a 3 byte instruction, 0x0F01EF. WRPKRU sequences that span two or more instructions can be "broken" by inserting a 1 byte nop like 0x90 between any two consecutive instructions. 0x90 does not coincide with any individual byte of WRPKRU, so this insertion cannot generate a new WRPKRU.
A WRPKRU sequence that lies entirely in a longer instruction can be eliminated by finding an equivalent sequence of instructions. Doing so systematically requires an understanding of x86 instruction coding. An x86 instruction contains: (i) an opcode field possibly with prefix, (ii) a MOD R/M field that determines the addressing mode and includes a register operand, (iii) an optional SIB field that specifies registers for indirect memory addressing, and (iv) optional displacement and/or immediate fields that specify constant offsets for memory operations and other constant operands.
Our strategy for rewriting an instruction depends on the fields with which the WRPKRU subsequence overlaps. Table 1 summarizes the strategy. If the WRPKRU sequence lies entirely in the opcode field, then the instruction is WRP-KRU. This case is handled by adding the check (A) after the instruction to make it safe.
If the sequence overlaps with the MOD R/M field, we change the register in the MOD R/M field. This requires a free register. If one does not exist, we rewrite to push an existing register to the stack, use it in the instruction, and pop it back. (See lines 2 and 3 in Table 1.) If the sequence overlaps with the displacement or the immediate field, we change the mode of the instruction to use a register instead of a constant. The constant is computed in the register before the instruction (lines 4 and 6). If a free register is unavailable, we push and pop one. Two instruction-specific optimizations are possible. First, for jump-like instructions, the jump target can be relocated in the binary; this changes the displacement in the instruction, thus not needing a free register (line 5). Second, associative operations like addition can be performed in two increments without an extra register (line 7).
We never rewrite the SIB field. This does not affect the completeness of our technique since any WRPKRU must overlap with at least one non-SIB field (the SIB field is 1 byte long while WRPKRU is 3 bytes long).
Implementing the rewriting For binaries that can be (re)compiled from source, rewriting can be added to the codegen phase of the compiler, which converts the intermediate representation (IR) to machine instructions. Whenever codegen outputs an inadvertent WRPKRU, the surrounding instructions in the IR can be replaced with equivalent WRPKRU-free instructions as described above, and codegen can be run again on the updated IR.
For binaries that cannot be compiled, the rewrite strategy can be integrated with our binary inspection mechanism (Section 3.4). If the mechanism discovers an unsafe WRP-KRU on an executable page during its scan, it can overwrite add ebx, 0x0F01EF00 → add ebx, 0x0E01EF00; add ebx, 0x01000000 Table 1 : Rewrite strategy for intra-instruction occurrences of WRPKRU the page with 1-byte trap instructions, make it executable, and store the original page in reserve without enabling it for execution. Later, if there is a jump into the executable page, a trap occurs and the trap handler discovers an entry point into the page. It can then disassemble the reserved page from that entry point on, rewriting any discovered WRPKRU occurrences, and copy the WRPKRU-free instruction sequences back to the executable page. To prevent other threads from executing partially overwritten instruction sequences, we actually rewrite a fresh copy of the executable page with the WRPKRU-free sequences, and then swap this rewritten copy for the executable page. This technique is transparent to the application, has an overhead proportional to the number of entry points in offending pages (it disassembles from every entry point only once) and maintains the invariant that only safe WRPKRU sequences are executable. Alternatively, binaries can also be disassembled and rewritten ahead of time, an idea we actually implemented to test our rewrite strategy.
Use Cases
ERIM advances prior work by providing efficient isolation when switches between trusted and untrusted components are very frequent, of the order of 10 5 or 10 6 times a second. We describe three such use-cases here, and show in Section 6 that ERIM's overhead is low on all of them.
Isolating cryptographic keys in web servers Isolating long-term SSL keys to protect from web server vulnerabilities such as the Heartbleed bug [34] is well-studied [30, 31] . However, long-term keys are accessed relatively infrequently, typically only a few times per user session. Session keys, on the other hand, are accessed far more frequently, up to 10 6 times a second per core in a high throughput web server like nginx. Isolating sessions keys is relevant because these keys protect the confidentiality of individual users. With its low-cost switching, ERIM can be used to isolate session keys efficiently. To verify this, we partitioned OpenSSL's low-level crypto library (libcrypto) to isolate the session keys and basic crypto routines, which run as T, from the rest of the web server, which runs as U.
Native libraries in managed runtimes
Managed runtimes such as a Java or JavaScript VM often rely on third-party native libraries written in unsafe languages for performance. A relevant security goal is to isolate the managed runtime from bugs and vulnerabilities in the native libraries. ERIM can be used for this purpose by mapping the managed runtime to T and the native libraries to U. We test this by isolating a native SQLite plugin from Node.js. Node.js is a state-of-the-art managed runtime for JavaScript and SQLite is a state-of-theart database library written in C [1, 2].
CPI/CPS Code-pointer integrity (CPI) [29] prevents control-flow hijacks by isolating sensitive objects-code pointers and objects that can lead to code pointers-in a safe region that cannot be written without bounds checks. CPS is a lighter, less-secure variant of CPI that isolates only code pointers. Switching rates to the safe region can be very high in CPI, of the order of 10 6 switches per second on standard benchmarks. A key challenge is to isolate the safe region efficiently. We show that ERIM can provide strong isolation for the safe region at low cost. To do this, we override the CPI/CPS-enabled compiler's intrinsic function for writing the sensitive region to use a call gate around an inlined sequence of T code that performs a bounds check before the write. (MemSentry [28] also proposes using MPK for isolating the safe region, but does not actually implement this.)
Evaluation
We implemented a prototype of ERIM on Linux v4.9.60. The prototype includes a 77 line Linux Security Module (LSM) that intercepts all mmap and mprotect calls to prevent U from mapping pages in executable mode, and prevents U from overriding the binary inspection handler. We additionally added 26 LoC in kernel hooks needed for this module. We also implemented ERIM on an unmodified v4.9.60 Linux kernel using the ptrace-based technique described in Section 3.4. In the following, however, we show results obtained with the modified kernel. The performance of the stock Linux kernel with ERIM is very similar, except that the cost of mmap, mprotect, and pkey_mprotect syscalls that enable execute permissions are about 10x higher. Since these are infrequently used operations, the impact on the overall performance of the applications we considered is negligible.
Our implementation also includes the ERIM runtime library, which provides a memory allocator over M T , call gates, the ERIM initialization code, and binary inspection. These comprise 569 LoC. Separately, we have implemented the rewriting logic to eliminate inadvertent WRPKRU occurrences (about 2250 LoC). While we have not yet integrated the logic into either a compiler or our inspection handler, we have integrated it into a standalone binary rewriting tool that uses Dyninst [13] to disassemble binaries. The binaries used in our evaluation do not have any unsafe WRPKRU occurrences and do not load any libraries at runtime.
We evaluate the ERIM prototype on microbenchmarks and on the three applications mentioned in Section 5. We perform our experiments on Dell PowerEdge R640 machines with 16-core MPK-enabled Intel Xeon Gold 6142 2.6GHz CPUs (with the latest firmware,Turbo Boost and SpeedStep disabled), 384GB memory, 10Gbps Ethernet links, running Debian 8. For the OpenSSL/webserver experiment, we use nginx v1.12.1 and OpenSSL v1.1.1 and the ECDHE-RSA-AES128-GCM-SHA256 cipher. For the managed language runtime experiment, we use Node.js v9.11.1 and SQLite v3.22.0. For a comparison base line we use SQLite compiled to WebAssembly via emscripten v1.37.37's WebAssembly backend [3] . For the CPI experiment, we use the Levee prototype v0.2 available from http://dslab.epfl.ch/ proj/cpi/ and Clang v3.3.1 including its CPI compile pass, runtime library extensions and link-time optimization.
Microbenchmarks
Switch cost We performed a microbenchmark to measure the overhead of invoking a function with and without a switch to a trusted component. The function adds a constant to an integer argument and returns the result. Table 2 shows the cost of invoking the function, in cycles, as an inlined function (I), as a directly called function (DC), and as a function called via a function pointer (FP). For reference, the table also includes the cost of a simple syscall (getpid) and the cost of a switch on lwCs, a recent in-process isolation Table 2 : Cycle counts for basic call and return mechanism based on standard page table protections [30] . In our microbenchmark, calls with an ERIM switch are between 55 and 80 cycles more expensive than their no-switch counterparts. The most expensive indirect call costs less than the simplest system call (getpid). ERIM switches are up to 100x faster than lwC switches.
Because the CPU must not reorder loads and stores with respect to a WRPKRU instruction, the overhead of an ERIM switch depends on the CPU pipeline state at the time the WRPKRUs are executed. In experiments described later in this section, we observed average overheads ranging from 11 to 260 cycles per switch. At a clock rate of 2.6GHz, this corresponds to overheads between 0.04% and 1.0% for 100,000 switches per second, which is significantly lower than the overhead of any kernel-or hypervisor-based isolation.
Binary inspection To determine the cost of ERIM's binary inspection, we measured the cost of scanning the binaries of each of the 17 applications in the SPEC 2006 CPU benchmark, which range in size from 9 to 3918 4KB pages and contain between 35 and 63765 WRPKRU instructions when compiled with CPI (see Section 6.4). The overhead is largely independent of the number of WRPKRU instructions and ranges between 3.5 and 6.2 microseconds per page. Even for the largest binary, this amounts to only 17.7 milliseconds, a tiny fraction of the typical runtime of a process.
Protecting session keys in nginx
Next, we use ERIM to isolate SSL session keys in a high performance web server, nginx. We modified OpenSSL's libcrypto to isolate the keys and the functions for AES key allocation and encryption/decryption into ERIM's T and use ERIM call gates to invoke these functions.
Our goal is to measure ERIM's overhead on the peak throughput of nginx. To start, we configure nginx to run a single worker pinned to a CPU core, and connect to it remotely over HTTPS with keep-alive from 4 concurrent ApacheBench [4] instances each simulating 75 concurrent clients. The clients all request the same file, whose size we vary from 0 to 128KB across experiments. Figure 1 shows the average throughput of 10 runs of an ERIM-protected nginx relative to native nginx without any protection for differ- Table 3 : Nginx throughput with a single worker. The standard deviation is below 1.1% in all cases.
ent file sizes, measured after an initial warm-up period. ERIM-protected nginx provides a throughput within 95.18% of the unprotected server for all request sizes. To explain the overhead further, we list the number of ERIM switches per second in the nginx worker and the worker's CPU utilization in Table 3 for request sizes up to 128KB. The overhead shows a general trend up to requests of size 32 KB: The worker's core remains saturated but as the request size increases, the number of ERIM switches per second decrease, and so does ERIM's relative overhead. The observations are consistent with an overhead of about 0.31%-0.44% for 100,000 switches per second. For request sizes of 64KB and higher, the 10Gbps network card saturates and the worker does not utilize its CPU core completely in the baseline. The free CPU cycles absorb ERIM's CPU overhead, so ERIM's throughput matches that of the baseline.
Note that this is an extreme test case for a web server. Here, the web server does almost nothing and serves the same cached file repeatedly. To get a more realistic assessment, we set up nginx to serve from a 571 MB corpus of 15,520 static HTML Wikipedia pages snapshotted in 2006 [43] . File sizes vary from 417 bytes to 522 KB (average size 37.7 KB). 75 keep-alive clients request random pages (selected based on pageviews on Wikipedia [44] ). The average throughput with a single nginx worker was 22,415 requests/s in the base line and 21,802 requests/s with ERIM (std. devs. below 0.6% in both cases). On average, there were 615,000 switches a second. This corresponds to a total overhead of 2.7%, or about 0.43% for 100,000 switches a second.
Scaling with multiple workers To verify that ERIM scales with core parallelism, we re-ran the first experiment above with 3, 5 and 10 nginx workers pinned to separate cores, and sufficient numbers of concurrent clients to saturate all the workers. Table 4 shows the relative overheads with different number of workers. (For requests larger than those shown in the table, the network card saturates, and the spare CPU cycles absorb ERIM's overhead completely.) The overheads were independent of the number of workers (cores), indicating that ERIM adds no additional synchronization and scales perfectly with core parallelism. This is unsurprising since MPK's PKRU is per-core and updates to the PKRU of a core affect execution on that core only.
Comparison to kernel-based isolation Using the first experiment above, we also compare ERIM's overhead to that of lwCs [30] , a state-of-the-art system for in-process isolation based on standard page-table protections. LwCs map each isolated component to a separate address space (in the same process). A switch between components requires kernel mediation to change page tables. Due to lack of space, we omit the detailed results but, briefly, we find that lwCs perform significantly worse than ERIM in this experiment: The throughput of nginx with lwC-based isolation is never above 80% of native nginx and, for small requests, where the switch rate is higher, it is below 50% of native nginx. In contrast, with ERIM's isolation, the throughput is within 95% of native nginx in all configurations. Hence, ERIM performs significantly better than kernel-mediated isolation.
Isolating managed runtimes
Next, we use ERIM to isolate a managed language runtime from an untrusted native library. Specifically, we link the widely-used C database library, SQLite, to Node.js, a state-of-the-art JavaScript runtime and use ERIM to isolate Node.js from SQLite by mapping Node.js's runtime to T and SQLite to U. We manually instrumented SQLite's entrypoints to invoke call gates. Additionally, since we want to isolate Node.js's stack from SQLite, we run Node.js on a separate stack in M T , and add code to switch to the standard stack (in M U ) prior to calling a SQLite function. Finally, SQLite uses the libc function memmove, which accesses libc constants that are in M T , so we implemented a separate memmove for SQLite. In total, we added 437 LoC.
We measure overheads on the speedtest1 benchmark that comes with SQLite and emulates a typical database workload [ deletes. We increased the iterations in each test by a factor of four to make the tests longer. Our base line for comparison is native SQLite linked to Node.js without any protection. We configure the benchmark to store the database in-memory and report averages of 20 runs.
The geometric mean of ERIM's runtime overhead across all tests is 4.3%. The overhead is below 6.7% on all tests except those with more than 10 6 switches per second. This suggests that ERIM can be used for isolating native libraries from managed language runtimes with low overheads up to a switching cost of the order of 10 6 per second. Beyond that the overhead is noticeable. Table 5 , columns 1-3, show the relative overheads for tests with switching rates of at least 100,000/s. The numbers are consistent with an average overhead between 0.07% and 0.41% for 100,000 switches/s. The actual switch cost measured from direct CPU cycle counts varies from 73 to 260 cycles across all tests. It exceeds 100 cycles only when the switch rate is less than 2,000 times/s. We verified that these are due to i-cache misses-at low rates, the call gate instructions are evicted between switches.
Comparison to isolation with bounds checks (SFI) We also use the above experiment to compare ERIM to isolation based on bounds checks. For this, we re-compile SQLite to native code indirectly through WebAssembly, a new memory-safe, low-level language designed specifically for writing safe native plugins for JavaScript environments [20] . The WebAssembly to native code translation inserts bounds checks prior to indirect memory accesses.
Across all tests, the geometric mean of the relative overhead of WebAssembly-based isolation on run time is 133.5%. The overheads range from 66.4% to 280.6%, which is significantly higher than ERIM's overheads. However, WebAssembly's overheads do not increase with the switching rate since it does not interpose on switches. Instead, it imposes a continuous overhead while executing SQLite. Other work using bounds checks has found similarly high overheads on performance-intensive benchmarks [37, 20] .
Protecting sensitive data in CPI/CPS
Next, we use ERIM to isolate the safe region of CPI and CPS [29] in a separate domain. We modified CPI/CPS's LLVM compiler pass to emit additional ERIM switches, which bracket any code that modifies the safe region. The switch code, as well as the instructions modifying the safe region, are inlined with the application code. In addition, we implemented simple optimizations to safely reduce the frequency of ERIM domain switches. For instance, the original implementation initializes sensitive code pointers to zero during initialization. Rather than generate a domain switch for each pointer initialization, we generate loops of pointer set operations that are bracketed by a single pair of ERIM domain switches. This is safe because the loop relies on direct jumps and the code to set a pointer is inlined in the loop's body. In all, we modified 300 LoC in LLVM's CPI/CPS pass.
Like the original CPI/CPS paper [29] , we compare the overhead of the original and our ERIM-protected CPI/CPS system on the SPEC CPU 2006 benchmarks, relative to a base line compiled with Clang without any protection. The original CPI/CPS system is configured to use ASLR for isolation, the default technique used on x86-64 in the original paper. ASLR imposes almost no switching overhead, but also provides no security [38, 24, 14, 18, 36] . Table 6 : Domain switch rates of selected SPEC CPU benchmarks and overheads for ERIM-CPI without binary inspection, relative to the original CPI with ASLR. Figure 2 shows the average runtime overhead of 10 runs of the original CPI/CPS (lines "CPI/CPS") and CPI/CPS over ERIM (lines "ERIM-CPI/CPS"). All overheads are normalized to the unprotected SPEC benchmark. We could not obtain results for 400.perlbench for CPI and 453.povray for both CPS and CPI. 400.perlbench does not halt when compiled with CPI and SPEC's result verification for 453.povray fails due to unexpected output. These problems exist in the code generated by the Levee CPI/CPS prototype with CPI/CPS enabled (-fcps/-fcpi), not our modifications.
CPI:
The geometric means of the overheads (relative to no protection) of the original CPI and ERIM-CPI across all benchmarks are 4.7% and 5.3%, respectively. The relative overheads of ERIM-CPI are low on all individual benchmarks except gcc, omnetpp, and xalancbmk.
To understand this better, we examined switching rates across benchmarks. Table 6 shows the switching rates for benchmarks that require more than 100,000 switches/s. From the table, we see that the high overheads on gcc, omnetpp and xalancbmk are due to extremely high switching rates on these three benchmarks (between 1.6 × 10 7 and 8.9 × 10 7 per second). Further profiling indicated that the reason for the high switch rate is tight loops with pointer updates (each pointer update incurs a switch). An optimization pass could hoist the domain switches out of the loops safely using only direct control flow instructions and enforcing store instructions to be bound to the application memory, but we have not implemented this yet. Table 6 also shows the overhead of ERIM-CPI excluding binary inspection, relative to the original CPI over ASLR (not relative to an unprotected baseline as in Figure 2 ). This relative overhead is exactly the cost of ERIM's switching. Depending on the benchmark, it varies from 0.03% to 0.16% for 100,000 switches per second or, equivalently, 7.8 to 41.6 cycles per switch. These results indicate that ERIM can support inlined reference monitors with switching rates of up to 10 6 times a second with low overhead. Beyond this rate, the overhead becomes noticeable.
CPS: The results for CPS are similar to those for CPI, but the overheads are generally lower. Relative to SPEC without protection, the geometric means of the overheads of the original CPS and ERIM-CPS are 1.1% and 2.4%, respectively. ERIM-CPS' overhead relative to the original CPS is within 2.5% on all benchmarks, except except perlbench, omnetpp and xalancbmk, where it ranges up to 17.9%.
Conclusion
We conclude the paper with a brief summary of ERIM and how it compares to other memory isolation techniques. Relying on the recent Intel MPK ISA extension and simple binary inspection, ERIM provides hardware-enforced isolation with an overhead of less than 1% for every 100,000 switches/s between components on current CPUs. It imposes no overhead on execution within a component.
Existing hardware isolation techniques rely on either kernel-or hypervisor-mediation for switching protection domains and incur much higher switching costs-the state-ofthe-art lwCs impose approximately 10% overhead for every 100,000 switches/s. Other techniques based on access bounds-checks such as SFI or memory-safe languages provide isolation without interposing on domain switches, but impose overheads of several tens of percent on the normal execution of untrusted code, even with mainstream hardware support for bounds checking (e.g., MPX). Additionally, they require control-flow integrity to provide strong security. Yet other software techniques like ASLR impose negligible runtime overhead but offer very limited defense against strong user-space adversaries.
ERIM's comparative advantage prominently stands out on applications that switch very rapidly and spend a nontrivial fraction of time in untrusted code. We have demonstrated ERIM's efficacy on three such applications: isolating session keys in a web server, isolating a reference monitor's private state, and isolating managed runtimes from native libraries. In all cases, ERIM provides strong isolation with overheads significantly lower than those of existing techniques, without necessarily requiring compiler support or OS modifications.
