All the state-of-the-art rowhammer attacks can break the MMUenforced inter-domain isolation because the physical memory owned by each domain is adjacent to each other. To mitigate these attacks, CATT [6], as the first generic and practical technique, physically separates each domain: it divides the physical memory into multiple partitions and keeps each partition occupied by only one domain.
INTRODUCTION
A memory management unit (MMU) is an essential component of the CPU. It plays a critical role in enforcing isolation in the operating system (OS). For example, the kernel relies on the MMU to mediate all memory accesses from user processes in order to prevent them from modifying the kernel or accessing its sensitive information. Any unauthorized access will be stopped with a hardware exception. Without the strict kernel-user isolation, the whole system can be easily compromised by a malicious user process, such as a browser [13] . The MMU is also used widely in other forms of isolation, such as intra-process isolation (e.g., sandbox) and inter-virtual machine (VM) isolation. Therefore, the MMU and its key data structure, page tables, are critical to the security of the whole system. However, the recent rowhammer attacks have posed a serious challenge to the status quo. Table 1 : A comparison of rowhammer attacks. The memory ambush technique allows an unprivileged process to gain both root and kernel privileges with low memory.
√ ⋆ means that the exploit does not break the kernel and user domain isolation enforced by CATT, instead it gains the root privilege within the user domain.
Among software-based defenses, CATT is the first generic mitigation method against rowhammer attacks [6] . Based on the observation that rowhammer attacks essentially require attacker-controlled memory to be physically adjacent to the privileged memory (e.g., page tables), CATT aims at physically separating the memory of different domains. Specifically, it divides the physical memory into multiple partitions and further ensures that partitions are separated by at least one unused DRAM row and each partition is only owned by a single domain. For example, the heap in the user space will be allocated from the user partition, and page tables are allocated from the kernel partition. By doing so, CATT can confine bit-flips induced by one domain to its own partition and thus prevent rowhammer attacks from affecting other domains. Although CATT currently only enforces the domain separation between the kernel and the user spaces, its approach can theoretically be applied to multiple domains (e.g., regular and privileged processes, multiple VMs, the hypervisor and guests), thus mitigating all the previous rowhammer attacks [2, 11, 28, 29, 31, 34, 35] . Our contributions: the principle of CATT is sound, but its assumptions and fundamental invariants do not always hold: even though CATT can theoretically support multiple domains, the number of domains is still limited. Consequently, multiple processes have to share the same partition. Although bit flips are confined to a single domain, an attacker can still target other privileged processes in the same domain and gain the root privilege [11] (e.g., memory waylay in Table 1 ). However, such techniques cannot break the kernel and user domain isolation to directly gain the kernel privilege. In certain secure environments, such as containers [33] or highsecure systems [7] with features of kernel.modules_disabled, kexec_load_disabled and namespace, it is not trivial for a malicious root user to gain the kernel privilege.
Even worse, the concept of memory ownership in modern operating systems is very dynamic. In particular, a block of memory can be allocated for the kernel (i.e., from the kernel partition) but later mapped into the user space, allowing the user process to access/hammer the memory. This kind of change in the ownership renders CATT's static partition assignment ineffective, leaving the kernel still hammerable. By analyzing the Linux kernel source code, we have identified a number of such cases. For brevity, we call such vulnerable memory the double-owned memory. In this paper, we focus on evaluating the security of CATT in face of the double-owned memory.
Although the double-owned memory potentially allows a malicious process to hammer the kernel, it is still challenging to stealthily launch the rowhammer attack. First, the double-owned memory is often associated with device drivers. This limits the operations that can be performed on that memory. For example, some device drivers limit the amount of the memory buffer that can be mapped to the user space. Our attack thus needs to take these constraints into consideration.
Second, to position the attacker-controlled memory next to security critical objects, existing rowhammer attacks, shown in Table 1, require exhausting either the page cache or the system memory [2, 11, 14, 28, 31, 34, 35] , as summarized by Daniel et al. [11] . Such anomaly could easily be detected by an attentive system administrator. To address that, we propose a novel technique called memory ambush that is able to stealthily achieve the expected position with a small amount of memory (e.g., 88MB).
Last, existing single-sided rowhammer attacks [31] cannot be simply adopted by us because they require costly random address selections but we are limited by the choice of the double-owned memory. Meanwhile, double-sided rowhammer attacks require the now-inaccessible address mapping information [31, 34] . To address this challenge, we leverage the timing channel [26] to selectively pick addresses that are most likely in the same DRAM bank. This technique avoids the need to access virtual-to-physical address mapping without losing efficiency.
To demonstrate the feasibility of our technique, we have implemented a proof-of-concept attack against CATT-based rowhammer defense on the Linux operating system. Our exploit uses the video buffers as the double-owned memory and targets page tables, the critical data structure for the MMU-based isolation. Video buffers are owned and managed by the kernel and thus are allocated from the kernel partition in CATT. However, video buffers can also be mapped in the user space, essentially allowing a malicious process to hammer the kernel memory. We then use our memory ambush technique to stealthily place the accessible video buffers around page-table pages, by exploiting the intrinsic design of Linux's buddy allocator and mmap syscall. After positioning the video buffers next to the page-table pages, we hammer the buffers, which might flip certain bits in the page-table pages. We repeat the process until a page-table page is found to be writable. This essentially allows the attacker to read and write all system memory (i.e., kernel privilege). We also demonstrate to gain the root privilege by changing the uid of the current process to 0. Our exploit can be launched by any user process without exhausting the page cache or the system memory, or relying on the virtual-to-physical address mapping information. Our experiments shows that the exploit only requires about 88MB memory with a success rate of 6%. The average time of all successful exploits is about 4 minutes. In the best case, the attack succeeds within 1 minute to gain the kernel privilege. To defend against our exploit, we have discussed possible improvements to CATT.
The main contributions of this paper are threefold:
• We identify the inherent weakness in CATT [6] , the only existing generic software-based defense against rowhammer attacks and empirically demonstrate a working attack against it.
• We present a novel rowhammer attack that allows an unprivileged user process to gain the root and kernel privileges. We also discuss possible countermeasures against our attack.
• Our exploit combines a new memory ambush technique and a timing channel to make itself stealthy and efficient without relying on the virtual-to-physical address mapping information.
The rest of the paper is structured as follows. In Section 2, we briefly introduce the background information. In Section 3, we present the general idea of our attack in detail. Section 4 demonstrates the attack and evaluates it. In Section 5, Section 6 and Section 7, we propose possible improvements to CATT against our attack, discuss possible limitations, and summarize the related work, respectively. We conclude this paper in Section 8.
BACKGROUND
In this section, we first describe the memory organization as it is critical to understand rowhammer attacks. We then summarize the existing rowhammer techniques.
Memory Organization
Main memory of most modern computers uses the dynamic randomaccess memory technology, or DRAM. Data in DRAM require periodical refresh (i.e., rewrite) to keep their value. Memory modules are usually produced in the form of dual inline memory module, or DIMM, where both sides of the memory module have separate electrical contacts for memory chips. Each memory module is directly connected to the CPU's memory controller through one of the two channels. Logically, each memory module consists of two ranks, corresponding to its two sides, and each rank consists of multiple banks. A bank is further structured as arrays of memory cells with rows and columns. For example, our test machine has a Sandy Bridge-based Core i7 CPU with two 4GB DDR3 DIMM modules. Each module has two ranks (2GB each), and each rank is vertically partitioned into 8 banks, which in turn consists of 32K rows of memory (8KB each). Fig. 1 shows the structure of a bank. Note that a typical page table in the x86-64 architecture is 4KB.
Every cell of a bank stores one bit of data whose value depends on whether the cell is electrically charged or not. A row is the basic unit for memory access. Each access to a bank "opens" a row by transferring the data in all the cells of this row to the bank's row buffer. This operation discharges all the cells of the row. To prevent data loss, the row buffer is then copied back into the cells, thus recharging the cells. Consecutive access to the same row will be fulfilled by the row buffer, while accessing another row replaces the content of the row buffer.
Rowhammer Overview
Rowhammer bugs: Kim et al. [20] discovered that current DRAMs are vulnerable to disturbance errors induced by charge leakage. In particular, their experiments have shown that frequently opening the same row (i.e., hammering the row) can cause sufficient disturbance to a neighboring row and flip its bits without even accessing the neighboring row. Because the row buffer acts as a cache, another row in the same bank is accessed to replace the row buffer after each hammering so that the next hammering will re-open the hammered row, leading to bit flips of its neighboring row. Rowhammer methods: generally speaking, there are three methods to hammer a vulnerable DRAM, classified by their memory access patterns: Double-sided hammering: in this method, two immediately adjacent rows of the victim row are hammered simultaneously, as shown in Fig. 1 . These two adjacent rows are called the aggressor rows. They are repeatedly accessed by turn, leading to quick charges and discharges of these two rows. If the memory module is vulnerable to the rowhammer bug, this may cause some cells in the victim row to leak charge and lose their data.
Because the aggressor rows and the victim row must lie in the same bank, this method requires at least a partial knowledge of the virtual-to-physical address mapping and the mapping between physical addresses and the DRAM layout. The Linux kernel originally allowed any user process to access its address mapping through the pagemap interface, but has limited the access to root processes since version 4.0 [32] . Another option is to use the huge page, which allows the process to allocate a large block of continuous physical memory (2MB or 1GB). It is very likely to find two candidate aggressor rows in a huge page. However, huge pages may not be available if the kernel has the feature disabled or the memory is severely fragmented. Meanwhile, the mapping between physical addresses and the DRAM layout (i.e., DIMMs, rank, banks, and rows) can be either obtained from the processor's architectural manual or through reverse-engineering [27, 35] .
Single-sided hammering: double-sided hammering requires knowledge of the virtual-to-physical address mapping that is sometimes difficult to obtain. To this end, Seaborn et al. [31] proposed the single-sided hammering. The main idea is to randomly pick multiple addresses and just hammer them. The probability of an aggressor row adjacent to the victim row is decided by the total number of rows. It can be significantly improved if many aggressor rows are 3 hammered at the same time. However, without precisely positioning of aggressor rows, this method usually induces fewer bit flips than double-sided hammering.
One-location hammering: unlike the previous methods where multiple rows are hammered, one-location hammering [11] just quickly hammers one aggressor row. It relies on other, unrelated processes (e.g. browser) to clear the row buffer. This is essentially a less aggressive form of single-sided hammering. Key requirements: there are three key requirements for exploiting rawhammer bugs:
First, modern CPUs employ multiple levels of caches to effectively reduce the memory access time. If data is present in the CPU cache, Accessing it will be fulfilled by the cache and never reach the physical memory. As such, the CPU cache must be flushed in order to hammer aggressor rows. Even though CPU caches are mostly transparent to the user programs, they can be explicitly invalidated by instructions such as clflush on x86. In addition, conflicts in the cache can evict data from the cache since CPU caches are much smaller than the main memory. Therefore, to evict aggressor rows from the cache, we can use a crafted access pattern to cause cache conflicts with the aggressor rows. Subsequent access to them will be fetched directly from the memory.
Second, the row buffer must be cleared between consecutive hammering of an aggressor row. Both double-sided and singlesided hammering explicitly perform alternate access to two or more rows within the same bank to clear the row buffer. One-location hammering itself accesses only one row but relies on concurrently running processes to clear the row buffer.
Third, for rowhammer attacks to succeed, the attacker-controlled aggressor rows must be positioned adjacent to the victim row and the victim row must contain the sensitive data (e.g., page tables) we target. Usually, the attacker does not have direct control of the (physical) memory allocation. To address that, a probabilistic approach is usually adopted on the x86 architectures. Specifically, the attacker allocates a large number of potential aggressor rows and induces the kernel to create many copies of the target objects. This strategy is very similar to the heap spray attack in that by spraying the memory with potential aggressor and victim rows, the probability of the correct positioning is high. Page tables are often targeted as the victim row because they control the system memory mapping and it is relatively easy to create many page-table pages (by allocating and using a large block of memory). An attackercontrolled page table essentially allows him to read/write/execute all the memory in the system.
CATT
CATT is a software-based defense against generic rowhammer attacks. Since the kernel is the most appealing target, CATT focuses on protecting the kernel from user processes (the kernel-user isolation). As previously mentioned, rowhammer attacks must correctly position the aggressor rows and the victim row and ensure that the victim row contain sensitive data. CATT aims at breaking this requirement by physically separating the kernel and user memory. Specifically, it partitions each bank into a kernel part and a user part. These two parts are separated by at lease one unused row. When physical memory is allocated, CATT allocates it from either the kernel part or the user part according to the intended use of the memory (specifically the flags, such as GFP_USER, passed to the kernel page allocator). For example, the user heap and stacks are allocated from the user part and page tables are allocated from the kernel part. By separating the physical memory of the kernel and the user space, CATT guarantees that bit flips caused by rowhammer attacks are confined strictly into its own memory partition, thus protecting the kernel from rowhammer attacks by malicious user processes.
This design, although sound in principle, has two potential weaknesses: the lack of isolation in the user domain and the static view of the memory ownership. First, even though CATT can be extended to support multiple domains, user processes can significantly outnumber the domains. The sharing of a user domain is inevitable. This would allow one user process to attack another, especially that with the root privilege. Nevertheless, this problem can be alleviated by further partitioning user processes into finer domains, for example, one for the normal process and one for setuid programs like passwd. Second, CATT allocates physical pages according to whether the memory is intended for the kernel or the user space. This is incompatible with the modern operating systems where the ownership of the memory is rather dynamic. For instance, some memory can be used by the device driver to send or receive data from the device and then be mapped in the user space to avoid extra copying of the data. Such memory cannot be allocated from the user partition, otherwise a crafted rowhammer exploit may badly affect the operation of the device. In the worst case, the exploit can gain the kernel privilege when the memory contains control data for the DMA operations (e.g., in the network packet transmission mechanism, the ring buffer stores transmit descriptors that point to the packets to transmit). On the other hand, allocating from the kernel partition would allow a malicious user process to hammer the kernel memory as soon as the memory is mapped to the user space. This creates a dilemma for CATT that cannot be solved under its current design 1 .
ATTACK OVERVIEW
Our primary goal is to evaluate the security of the CATT-protected kernel under our rowhammer attack that leverages the doubleowned buffers. In this section, we firstly present the threat model and assumptions, then identify the main challenges and introduce new techniques to overcome them. In the next section, we will present a working proof-of-concept attack that employs the techniques.
Threat Model and Assumptions
Our threat model is a little bit different from that of other rowhammer attacks [4, 13, 28, 29, 31, 35] . Specifically, • The kernel is considered to be secure against software-only attacks. In other words, our attack does not rely on any software vulnerabilities. Even though this assumption is generally not possible, we focus on the study of the rowhammer defense and attack.
• The kernel is protected by CATT [6] . That is, the kernel and the user memory are allocated from physically separated partitions, and bit flips caused by rowhammer attacks are confined to their related partition.
• Unlike other rowhammer attacks, the attacker has no knowledge about the kernel memory locations that are bit-flippable, since CATT protects the kernel partition from being scanned.
• The attacker controls an unprivileged user process that has no special privileges such as accessing pagemap. That is, the attacker cannot obtain the virtual-to-physical address mapping.
• The installed memory modules are susceptible to rowhammerinduced bit flips. Pessl et al. [27] report that many mainstream DRAM manufacturers have vulnerable DRAM modules, including both DDR3 and DDR4 memory.
Key Steps and Main Challenges
CATT employs a static kernel/user memory partition to protect the kernel from rowhammer attacks by a malicious user process. This implies that a physical page can only be owned by a single domain. However, modern OS kernels often have double-owned memory that are shared between the kernel and user processes, such as memory-mapped files and video buffers. If the double-owned buffer is allocated from the kernel partition, it would allow a malicious user process to hammer the kernel.
To successfully launch a rowhammer attack, the following five steps are necessary: 1 identify the double-owned buffers that can be hammered; 2 stealthily position the hammerable buffers and victim kernel objects next to each other; 3 efficiently hammer the buffer without the virtual-to-physical address mapping information; 4 verify whether "useful" bit flips have occurred. If not, go to the step 2 or 3 to restart hammering based on the strategy; 5 gain the root/kernel privileges, say, by changing uid to 0 for the current process. The last two steps have been well studied [31] . In the following, we describe the challenges of the first three steps. Identify hammerable buffers: not all double-owned buffers are useful for our exploit. A hammerable buffer should satisfy the following requirements: the buffer should be allocated from the kernel partition but can be accessed by unprivileged user processes. In addition, its size should be reasonably large (e.g., in the level of KB or MB). If it is too small, the number of bit flips could be considerably low. By imposing these constraints, our attack is broadly applicable and potentially stealthy. Stealthily position hammerable buffers and target objects: for rowhammer attacks to succeed, hammerable buffers and target objects must be physically adjacent to each other in the DRAM layout. Previous rowhammer attacks rely on technologies such as page deduplication [29] or exhausting either the page cache [11] or the system memory [13, 34] for this purpose. However, the page deduplication is usually disabled for security reasons [34] , and other techniques are relatively easy to detect due to the anomaly in the page cache usage or the memory usage. As such, we need to design a new strategy that can position the hammerable buffers next to the target objects without exhausting the memory. Efficiently hammering: we cannot use double-sided hammering because the unprivileged user process no longer has access to the virtual-to-physical address mapping information or huge pages that are required to determine whether a pair of candidate addresses is separated by one row. On the other hand, the random hammering strategy of single-sided hammering could be inefficient. As such, we need to propose a new efficient hammer strategy without relying on the virtual-to-physical address mapping or huge pages.
New Techniques
To address the aforementioned challenges, we present our main techniques as follows.
3.3.1 Identification of Hammerable Buffers. Double-owned buffers are frequently used by the kernel for efficient data exchange between device drivers and user processes without copying the data when crossing the kernel boundary. Such capability is usually implemented in the Linux kernel by calling the mmap function in the device drivers. A quick search of the mmap function under the drivers directory returns 568 matching files in a recent Linux kernel. These matched files belong to drivers such as infiniband, graphic drivers, Ethernet drivers, media devices, Video4Linux drivers, as well as logic devices such as RDMA (remote direct memory access) and the Lustre file system. All these drivers are possible candidates for our attack. Because the mmaped buffers are used by device drivers, CATT accordingly allocates them from the kernel partition. This potentially allows a malicious user process to hammer the kernel after mmapping these buffers. Arguably, CATT could allocate all these buffers from the user partition. However, this modification will expose the hardware devices and their drivers to rowhammer attacks by user processes. Specifically, past experience shows that device drivers are very complex and vulnerable, and it is foreseeable that such situations will be inevitably exacerbated under the new modifications. In addition, hardware devices most likely assume certain integrity of the data passed in from the drivers. Consequently, they would misbehave under rowhammer attacks. This creates a dilemma for CATT -to expose the main kernel or device drivers to the rowhammer attacks. In this paper, we decide not to change the design decision made by CATT. It would be an interesting future work to evaluate the security of the other option. Certainly, not all the mmapped buffers of those device drivers are useful to our attack. We plan to design a program analysis system to help identify the usable double-owned buffers.
In our proof-of-concept attack, we have selected the video buffers in the Video4Linux subsystem for hammering. These buffers can be mmaped into the user process for real-time video capturing and thus become double-owned. They are allocated in a relatively large size (e.g., 18 .75MB) and the buffers are virtually continuous but physically discontinuous, i.e., they can be mapped to any allocated physical pages.
Memory Ambush.
Our attack uses double-owned video buffers for hammering and targets the page tables. Therefore, we need to position video buffers and page tables next to each other. To address that, we propose the memory ambush technique to target page tables, which leverages the inherit design of the Linux kernel's mmap and buddy physical page allocator. We briefly introduce them first. Mmap and page-table page allocation: mmap is a posix API that allows a process to map files or devices into user-accessible memory. The caller of mmap can specify the destination address, the source file descriptor, the protection, and a number of flags. For example, the MAP_FIXED flag requests the kernel to place the mapping at a specified address. This feature could be used to control the allocation of page-table pages. When a map is created, the kernel needs to populate the corresponding page tables and map the file/device (or the anonymous pages if the mapping is not backed by a file) at the selected addresses. However, this is usually done lazily, i.e., the page-table pages are not allocated or populated until the mapped addresses are accessed by the user process. Based on the above observations, we could make a mmap-based primitive function, which takes a number as input and allocates page-table pages accordingly [31] . Linux buddy allocator: like most OS kernels, the Linux kernel uses layers of memory allocators to fulfill different needs of the kernel. In particular, the physical pages are allocated using the buddy allocator [9] . As shown in Fig. 2(A) , the buddy allocator splits memory into equal halves called blocks. Each block initially contains a power-of-two number of pages that are physically continuous. Upon an allocation request, the kernel searches the blocks that best match the request. If the blocks do not have enough continuous pages for the request, the kernel splits a larger block in half and returns one half to the request. This process can happen recursively. For example, to allocate 256KB of memory, the buddy allocator will first search the blocks that contain 64 pages (a single page is 4KB). If none is found, the kernel tries to split a large block of 512KB (128 pages) in half to fulfill the request. When the requested pages are freed, the kernel tries to merge them with other free pages if possible. To continue the previous example, if the allocated 64-page memory are freed, the kernel checks whether its buddy (i.e., the other 64-page memory in the 512KB block) is free. If so, the kernel merges them to recreate the split large block. Memory Ambush: as mentioned above, the main purpose of memory ambush is to position the double-owned buffers next to the page tables by leveraging the Linux features. It is relatively easy to create multiple page-table pages in Linux, for example, by repetitively invoking mmap-based primitive function to map the same user file into different parts of the user address space. When a page-table page is created, the kernel will ask the buddy allocator to allocate a 4KB-block (x86-64 also supports large page sizes, such as 2MB and 1GB).
Note that, the two objects (i.e., the buffers and the page tables) that are physically consecutive may not be adjacent to each other in the memory, because the mapping between the physical address and the DRAM layout is not linear [6] . Some bits of a physical address are used to select the DIMM, Rank, Bank and row. For example, our test machine has two DIMMs and the 6th bit selects between the two DIMMs. The consecutive physical addresses such as 0x1000000 and 0x0FFFFFF are located on different DIMMs and thus are not next to each other.
To address this challenge, we need to find certain blocks that can occupy two adjacent rows in the DRAM layout (one row for double-owned buffers and the other one for page tables). The size of a target block (i.e., TargetBlockSize) should be twice of the row size (shown in Equation 1 ). The row size (i.e., RowsSizePerRowIndex) is determined by the number of DIMMs, the number of banks and the size of a single row in one bank (shown in Equation 2 and 3) . The specific equations are listed as follows:
The memory ambush technique is illustrated in Fig. 2 . Specifically, the blocks smaller than the target blocks are small blocks, and the blocks larger than the target blocks are large blocks. On our test The kernel memory is divided into blocks of different sizes and some blocks have been allocated (e.g., network ring buffer). In (B), the mmap-based primitive function is called repeatedly to fill the rest small blocks with page tables. In (C), a target block is split and allocated for double-owned buffers and page tables. (D) shows the state of the kernel partition after the previous step is repeated until a specified threshold is reached. machine, the row size (i.e., RowsSizePerRowIndex) is 256KB, and the size of the target block (i.e., TargetBlockSize) is 512KB.
At the beginning of our technique, the memory of the kernel partition could be fragmented, especially for small blocks (Fig. 2(A) ). Next, we exhaust the small blocks by repeatedly invoking mmapbased primitive function to allocate page-table pages (Fig. 2(B) ). We check the exhaustion of small blocks by accessing the file of /proc/buddyinfo. Note that any process, privileged or not, can read this file to obtain data about the available and allocated blocks.
Based on the obtained data, we can then position the doubleowned buffers next to the page tables, since they are expected to share the same target block (Fig. 2(C) ). Note that the double-owned buffers usually have a limited/fixed size. If the size happens to be one or more of RowsSizePerRowIndex, the two objects share the target block equally. If the buffers cannot stuff one row or there is a remainder after stuffing one or multiple rows, the page tables ought to occupy the rest empty pages of the split target block. We repeat this step until a specified memory threshold is reached (Fig. 2(D) ). By doing so, we can stuff the empty pages of the split target block, and position more page-table pages next to the buffer pages, increasing the probability that the buffers are in the aggressor rows while the page tables stuff the victim rows.
In our experiments, we keep the initial Linux system running a typical workload (i.e., a browser, a mail client, and a music player). As such, the depleted small blocks is calculated to be 56MB, taking up a small part of the system memory. In addition, stuffing the small blocks with page tables might also increase the chance of positioning the two objects next to each other. Certainly, the access to /proc/buddyinfo can be removed or protected without causing problems for most programs. We argue that this is similar to previous systems that use pagemap to obtain virtual-to-physical address mapping: the access to this file is currently enabled by default and thus can be misused by any one.
Efficient
Hammering. Since Linux kernel 4.0, the access to pagemap has been protected from unprivileged processes. Without the information about the virtual-to-physical address mapping, an intuitive solution is to randomly select a pair of virtual addresses to hammer, also known as the single-sided rowhammer. Overall, this approach is less effective than double-sided hammering (e.g., if these two addresses happen to lie in the same row). In our system, we resort to a timing channel [26] to improve the efficiency of single-sided hammering.
Specifically, this timing channel is created by the row-buffer conflicts within the same DRAM bank. As we previously mentioned, each bank has a row buffer that caches the last accessed row. If a pair of virtual addresses reside in two different rows of the bank and they are accessed alternately, the row buffer will be repeatedly reloaded and cleared. This causes the so-called row-buffer conflicts. Clearly, row buffer conflicts can lead to higher latency in accessing the two addresses than the case that they lie either within the same row or in different banks. As such, we can reliably distinguish whether two addresses are in different rows within the same bank. When we perform the hammering, we can target such pairs of addresses, significantly improving the efficiency of the single-sided hammering.
A PROOF-OF-CONCEPT ATTACK
In this section, we present in detail a proof-of-concept attack that exploits double-owned buffers to break the kernel-user separation enforced by CATT. At a high level, our attack uses the double-owned video buffers of a Video4Linux driver as the aggressor rows and targets the page table. It relies on our memory ambush technique to stealthily position the aggressor and victim rows adjacent to each other. Furthermore, the attack performs the improved single-sided hammering. We also briefly describe the steps to verify whether the attack has succeeded and to gain the root and kernel privileges if so.
Double-owned Video Buffers
Video4Linux (V4L) [22] is a collection of device drivers that provide the API for programs to capture real-time videos on the Linux systems. The current specification of V4L is version2 (V4L2) [23] . A V4L device is usually presented as a char device under the directory /dev of the file system, such as /dev/video0. The device is accessible by unprivileged processes by default.
We dive into the memory allocation by the V4L2 device and discover that the video buffer is allocated by the kernel and mapped into the user space. Specifically, by issuing the VIDIOC_REQBUFS ioctl command, an unprivileged process can request the V4L driver to allocate some device memory as the video buffers that can be mapped into the user space later. To handle this request, the driver calls the vmalloc_user function, passing the flags (GFP_KERNEL | __GFP_ZERO) in the arguments. This function allocates a block of zeroed continuous virtual memory in the kernel space (the memory is allocated but has not been mmapped.). Therefore, the allocated memory is virtually continuous but physically discontinuous. Note that CATT will allocate this block of memory from the kernel partition 2 . When the request is complete, the unprivileged process can then map the allocated memory into its own address space with the read and write permissions. Until now, the video buffers are changed to be double-owned buffers, facilitating the unprivileged user process to hammer the kernel. The maximum size of the video buffer for a V4L device is limited to 18.75MB, a sufficient size for our attack.
In the following, we briefly summarize the five steps for an unprivileged process to obtain read and write access to the video buffers:
• Open the video device: the V4L2 video capture device is a char device (as opposite to a block device) located in the /dev directory. Linux can support up to 64 V4L2 devices, starting from /dev/video0 to /dev/video63 with a major number of 81 and a minor number from 0 to 63. We select /dev/video0 as our device.
• Configure the video device: different video capture devices support different capabilities, such as cropping limits, the pixel aspect of images, and the stream data format. We apply the default settings to this device.
• Request the video buffer: after the configuration, we can issue the VIDIOC_REQBUFS command to ask the driver to allocate the video buffer. The command provides three ways for a user process to access the allocated kernel memory, i.e., memory mapped, user pointer, or DMABUF based I/O [23] . Moreover, it allows the process to request up to 32 buffers and the size of each buffer is 600KB (i.e., 18.75MB in total). For our attack, we specify the memory mapped I/O and use the maximum buffer size.
• Map the video buffer. the VIDIOC_QUERYBUF command returns the detailed information about the allocated video buffers (e.g., the size and address of each buffer set). Based on this information, we can map all these buffers into the user space.
• Close the video device: after our rowhammer exploit completes, we should unmap the video buffers from the user space and close the video device.
Memory Ambush
In this technique, we need to create sufficient page-table pages under the given targeted memory. Specifically, we create a temporary file tmp using tmpfs, which is stored in the memory only. We then map this file repeatedly in order to create numerous virtual memory areas (VMAs) mapped to the file. In each call to mmap, all pages in the mapped area are accessed in order to populate the page mmap(map_each_base, f ile_size, f ile) 22: read_access(map_each_base, f ile_size) 23: map_each_base ← map_each_base + file_size 24: pt_size_sum ← pt_size_each_while + pt_size_sum 25: if idx == 0 then 26: add_marker_to_each_file_page_header() 27: end if 28: if pt_size_sum == small_blocks_size then 29: video_buffers_allocation() 30: end if 31: idx ← idx + 1 32: end while otherwise we risk being detected; the size should not be too small because Linux limits the number of VMAs that can be created by mmap (i.e., 65536). We need a sufficiently large number of page-table pages to be targeted for the attack to succeed.
The size of the tmp file is calculated in line 1 to line 12 of Algorithm 1. Specifically, the file size is initialized to 2MB in line 1, which can be mapped by a single PTE (page table entry) page 3 . In line 4 to 6, we calculate how many VMAs we need to create if we were to use a specified memory threshold for the attack. More specifically, line 4 calculates the size of the page tables that can be created. Note that threshold_mem_size is the total memory specified for the attack. It is a user configurable parameter. In our experiment, the threshold_mem_size can be as low as 88MB. Line 5 calculates how much memory can be mapped by these page-table pages; Line 6 calculates the number of VMAs to be created. If the number is less than the limit, we have found the right file size. Otherwise, we double the file size and try again.
Based on vma_num, the tmp file is mmapped and accessed repeatedly to indirectly populate many created PTE pages (line 20-31). In the first iteration, we place a special marker in every page of the tmp file. Since the file is mapped in all the locations, we can look for this mark to check whether the bit flips caused by rowhammer have changed the page table, i.e., whether the attack succeeds or not. When all the small blocks are exhausted, we start to call the video_buffer_allocation function, which requests 32 buffers (size of each buffer is 600KB). As such, 32 large blocks of 1024KB will be allocated and shared by the video buffers and subsequent PTE pages. Given the size of all rows per row index is 256KB, each video buffer crosses three consecutive row indixes, stuffing the first two rows and leaving the last row partially occupied. The last row is then stuffed by the PTE pages, neighboring one row of the video buffers.
Efficient Single-sided Hammering
Since we do not have access to the virtual-to-physical address mapping information, we rely on the single-sided hammering but improve its efficiency with the timing channel based on the row buffer. As mentioned before, a pair of virtual addresses in different rows of the same bank has longer access latency than these in the same row or in the different banks. Such virtual addresses are a better candidate for hammering. For brevity, we call such pair of addresses DRSB (different rows within same bank).
In our experiments, most pairs of virtual addresses in DRSB have an access latency that is no less than 130ns. When a pair of virtual addresses from the video buffers is randomly selected, we will time its latency. If it is no less than 130ns, then we use the pair for hammering. Otherwise, the pair will be discarded. We repeat the step until the attack succeeds.
Privilege Escalation
In our attack, we randomly select a pair of addresses in DRSB for hammering. We then check whether the attack succeeds and select a new pair if not. Our attack targets the page tables. We aim at hammering the video buffers to flip bits in adjacent page tables. If the bit flips happen to change the mapped physical address to a page table, we can gain full control over all the system memory. This is feasible because we (lightly) spray the kernel memory with page tables. This process is described in detail in Algorithm 2.
As previously mentioned, we embed a special marker at the beginning of every mapped page. We can check for this marker to tell whether the hammering has caused the address to be remapped. Specifically, after each round of hammering, we read every pagealigned virtual address mapped to the tmp file and check whether the returned value (i.e., val1) is equal to the marker (line 6). If they are equal, we continue to check the next page for success. Otherwise, we have found a virtual page V a that points to a physical page outside of the tmp file due to the bit flips. Next, we need to check if the page V a itself is a writable page table (line 10-18). To this end, we pretend that V a is a writable page table and tentatively modify one of the entry. We then read all the markers again. If another virtual pageV b has been remapped, page V a is an attacker-controlled page table, and we can use V b to access the maliciously mapped memory. Now that we have read-write access to any system memory, we essentially have gained full control over the system (i.e., kernel privilege).
We also try to change the uid of current user process to 0 to gain the root privilege: without the access to pagemap, we have to scan all the available physical memory to locate current process's credential structure, struct cred that stores the critical uid field. To make this search fast and precise, we construct a distinct pattern in 8 (a) The access latency of 921 out of 1000 pairs in DRSB is no less than 130ns.
(b) The access latency of 974 out of 1000 pairs in non-DRSB is less than 130ns. ▷ map_virt_mem_base is the memory-mapped base address.
5:
ptr 1 ← map_virt_mem_base + idx1 * page_size 6: val1 ← (map_virt_mem_base + idx1 * page_size)
7:
▷ find a page that may be an attacker-controlled page table. 8: if val1 marker then
9:
▷ save the second entry of the page table. 10: old_pte ← ptr1 [1] 11:
▷ set the physical page #0 readable and writable. 12: new_pte ← 0x27 13: idx2 ← 1 14: while idx2 < map_virt_mem_size/paдe_size do 15: ▷ pick the second virtual page for each mapping. 16: ptr 2 ← map_virt_mem_base + idx2 * page_size 17: val2 ← (map_virt_mem_base + idx2 * page_size)
18:
▷ find out the target page table. 19: if val2 marker and idx2 idx1 then 20: privilege_escalation(ptr1, ptr2) ▷ each page-table page has 512 entries. 24: idx2 ← idx2 + 512 25: end while 26:
end if 28: idx1 ← idx1 + 1 29: end while the cred structure and search each page for this pattern. Specifically, the cred structure contains four user ids (e.g., uid and suid) and four group ids (e.g., gid and sgid) stored sequentially. We firstly set the other three user ids and group ids to be the same as uid using syscalls such as seteuid. We can use these ids as a pattern to search for our cred structure. Specifically, in a loop, we set a page table entry in page V a to every available physical page and scan the newly mapped page to check whether it contains the pattern. Once the pattern is located, we overwrite uid to 0. This essentially gives the root privilege. Note that after each change to the address translation in V a we need to ensure that the CPU's TLB is reloaded otherwise the CPU will continue using the old address translation. However, a user process does not have the privilege to flush the TLB. To address this problem, we instead flush the CPU cache with the clflush instruction to ensure that the change to the page table is committed to the memory and then migrate the attacking process to another CPU core. When the process is reloaded in that core, the TLB is naturally reloaded from the page table and then the new translation comes into effect.
Evaluation
In this section, we firstly describe how to measure the memory access latency required for the efficient hammering, then evaluate the effectiveness and stealthiness of our attack. All the experiments were conducted on a Dell Latitude E6420 PC with 2.8GHz Intel Core i7-2640M (dual cores, four threads) and 8GB DDR3 memory. The operating system is Ubuntu 16.04 LTS for x86-64 with the Linux kernel 4.10.0-generic. Memory access latency distribution: the technique of efficient hammering is highly dependent on the distribution of the memory access latency. The distribution is expected to easily distinguish DRSB from non-DRSB, otherwise it will introduce false positives, significantly reducing the efficiency of our hammering technique. To measure the latency, we randomly select 1, 000 pairs of pagealigned virtual addresses that are DRSB and non-DRSB respectively. We can easily tell whether a pair of address is DRSB or not by using the Linux pagemap and the memory module to address mapping on the Intel Sandy Bridge platform. Note that this information is only used to measure the timing channel. The attack itself does not use it directly. For each pair of addresses, we first perform read-access to them, call clflush to flush the cpu cache lines, and then execute a memory barrier (mfence) to ensure that the flush operation has finished. By doing so, subsequent accesses to the addresses will be fulfilled directly from the memory (instead of the CPU cache). We repeat these steps for 5000 times and use the rdtscp instruction to measure the total time used by the loop. The distribution of the average latency for these 1, 000 pairs is shown in Fig. 3a and Fig. 3b , respectively. Clearly, most pairs in DRSB (92.1%) have a higher latency than most pairs in non-DRSB (97.4%). Based on the latency, we can perform efficient single-sided hammering and verify whether the hammering succeeds or not.
Memory footprints: as shown in Algorithm 1, we can set the threshold through the parameter of threshold_mem_size. The memory threshold refers to the size of both the in-memory tmp file and the page-table pages. A less threshold_mem_size indicates a stealthier attack. Theoretically, the minimum threshold size can be as low as 88MB, i.e., the tmp file size is 2MB while the page-tables pages occupy all free small blocks (i.e., 56MB) and all the free rows neighboring the buffer rows (i.e., 30MB). In our experiment, we set threshold_mem_size to 88MB and launch the exploit for 50 runs.
The results are shown in Table 2 . In every run, the memory ambush technique is effective in positioning the video buffers adjacent to the page tables, indicating that our technique essentially requires only a few memory footprints. Note that we can verify the adjacency by accessing a kernel module and the pagemap. The module is developed to walk the created page tables and then return the physical addresses of the last-level page tables (PTE). The pagemap provides the physical addresses of the video buffers. By doing so, we can obtain their DRAM layout and thus confirm that some of the video-buffer pages are neighboring the page-table pages within the same bank.
Exploit efficiency: as shown in Table 2 , the success rate of flippable bits is much higher than that of exploitable bits (i.e., a successful attack). Within 20 out of 50 runs, the bits have been flipped in the page tables, and 3 out of the 20 runs have found the exploitable bits and thus succeed, indicating that some flipped page tables provide read and write access to other page tables. One round of hammer and verify needs about 1 second. Based on that, we measure the time that each successful run takes. The results are shown in Table 3 . This gives an indication on other test machines. For example, the average time when the first flippable bit occurs on the Sandy Bridge i5-2500 (4GB) is 6 milliseconds [35] , meaning that the machine needs 40 milliseconds on average for a successful attack.
We also experiment with the traditional single-sided rowhammering [31] given the same limited memory (88MB). Its success rate is a little bit higher (about 8%), since it hammers different rows within same bank exhaustively. However, its average execution time is up to 72 hours. Our attack is therefore much more efficient: we can complete the attack roughly about 4 minutes, increasing the efficiency by 1080 times compared to the traditional one.
MITIGATION
As we have demonstrated so far, CATT's static kernel and user partition is ineffective in the face of double-owned memory. Allocating double-owned memory either from the kernel memory or the user memory does not seem to be secure: the former exposes the kernel to rowhammer attacks, while the latter exposes the device drivers and hardware devices to the same attacks. Our attack has demonstrated solidly that the former is not secure. The latter is likely insecure as well given that device drivers are notoriously vulnerable and recent years' research has shown that hardware devices, even the CPU, are not immune from vulnerabilities. Therefore, we need to introduce better memory assignment algorithms beyond the static kernel-user partition.
To design a more effective defense against double-owned memory, we observe that bit flips caused by rowhammer attacks are limited to adjacent rows in the same bank. Therefore, we can isolate the memory for a device driver by allocating them in physically continuous pages and leaving one guard row on each side of the buffer. By doing so, hammering the double-owned buffer will only affect the buffer itself and it is also protected from hammering the security-sensitive objects. This wastes two rows per device driver buffer, i.e., 16KB memory on our test platform. Given that modern computers often have more than 8GB of memory and a computer has limited number of devices that require double-owned buffer, the memory waste does not seem to be a problem at all. A drawback is that physically-continuous memory is less flexible than the virtually-continuous physically-discontinuous memory. If memory is too fragmented, it might be difficult to allocate large physicallycontinuous memory. In that case, we may need to slightly restructure the driver to avoid frequent allocation and release of the buffer (defenses like CATT needs to change the kernel anyway.)
DISCUSSION
In this section, we discuss possible improvements to our system and potential future work. Search for other hammerable buffers: in our attack, we have read and write access to the video buffers. We thus can easily check whether the video buffers themselves are vulnerable to rowhammer attacks or not. Unfortunately, the kernel API does not allow us to partially release the video buffers. Otherwise, we can pin the aggressor pages and free the vulnerable pages while creating page tables. This will make our attack more efficient and deterministic. Thus we plan to design a program analysis technique to search for suitable double-owned buffers, especially the buffers that have this kind of flexibility. Obtain the virtual-to-physical address mapping: by leveraging the prefetch side channel [12] , an adversary can obtain the virtual-to-physical address mapping without pagemap, making it possible again to perform the double-sided hammering. However, a recent kernel patch called KAISER [10] (also known as kernel page table isolation) protects against the channel and has been widely applied in recent Linux kernel versions. Note that the memory waylaying technique [11] that relies on the side channel will no longer be applicable in such Linux kernels. Make the attack stealthier: like other rowhammer attackers, our attack has specific instructions or abnormal memory access patterns that can be detected by static or dynamic analysis tools. Also, our attack has high cache miss rates, which can be observed by monitoring CPU performance counters. For example, MASCAT [18] performs a static code analysis of a target application to detect stateof-the-art DRAM access attacks, including the rowhammer attacks. ANVIL [2] uses the hardware performance counters to monitor the miss rate of the last-level CPU cache. Whenever the rate is high enough to conduct a rowhammer attack, ANVIL will be triggered to further analyze the process for malicious behaviors. Further, it can discover this unusual access pattern and use heuristics to identify a potential rowhammer attack.
Such countermeasures can be bypassed by applying both onelocation hammering and Intel Software Guard Extension (SGX) [8] . Although the one-location hammering induces less bit flips compared to the other two rowhammer methods, this technique just keeps opening and closing one row, making itself stealthy to bypass ANVIL. Intel SGX is a hardware extension in Intel CPUs to securely run trusted code in an untrusted system. We can hide our attack inside an SGX enclave, where the attack code cannot be analyzed by the other software because any external access to the enclave is denied. Features like performance counters and debug registers also cannot be used to monitor the enclave activities [8] . In particular, Schwarz et al. [30] have confirmed that performance counters will not record the CPU cache hit or miss data inside the enclave.
RELATED WORK
In this section, we compare our system to the existing rowhammer attacks and discuss the related defenses.
Rowhammer Attacks
We first review how rowhammer attacks achieve the different requirements, specifically, how the CPU cache is flushed, how the row buffer is cleared, and how the aggressor and victim rows are placed. Flush CPU cache: since frequent and direct memory access is a prerequisite for hammering, a simple solution to is use the clflush instruction explicitly flush the CPU cache [20, 31] . This instruction can flush a cache entry related to a specific virtual address, and thus subsequent read to the address will be served directly from the memory. clflush is included in the instruction set for a process to fetch the updated data from the memory instead of the obsolete cached ones. It can be executed by an unprivileged process. It has been proposed to prohibit user processes to execute the instruction as a defense against rowhammer attacks [31] . However, Qiao et al. [28] reported that commonly used x86 instructions such as movnti and movntdqa actually bypass the CPU cache and access the memory directly. Moreover, carefully crafted memory-access patterns [2] [3] [4] 13] can cause cache conflict and effectively evict the cache of the target address. This approach is particularly useful for the scripting environments where cache-related instructions are not directly available. Clear row buffer: besides flushing the cache, rowhammer attacks also need to clear the row buffer in order to keep "opening" a row. Different rowhammer attacks have achieved this goal with various techniques.
Double-sided hammering performs alternate reads on different rows in the same bank. Therefore, it requires the virtual-to-physical and physical-to-hardware mappings in order to position the aggressor and victim rows. The pagemap provides complete information of the first mapping, but it is not accessible to the unprivileged process now. Although huge page on x86 [31] and the DMA buffers on the ARM architecture [34] only give the partial information about the virtual-to-physical address mapping, they ensure that two virtually-continuous addresses are also physically continuous. They can also be used by rowhammer attacks. For the second mapping, AMD provides the details in their manual, and the mapping for various Intel CPUs has been reverse engineered [27, 35] .
In contrast, single-sided [31] and one-location hammering [11] do not need both mappings. However, their selection of virtual addresses may not be in the same bank to clear the row buffer, making them less efficient. Our attack addresses this problem by leveraging the timing channel based on the row buffer. Place target objects: the last requirement of rowhammer attacks is to manipulate the security domain into placing a securitysensitive object in a vulnerable row. This can be achieved through page-table spraying [13, 31] , page deduplication [4, 29] , and Flip Feng Shui [34] . However, they all require exhausting the memory in order to place the target page in the vulnerable row. Instead of depleting the system memory, memory waylay [11] exhausts the page cache to influence the physical location of a target page. Our page-table ambush technique can the same effect with a much constrained amount of memory.
Rowhammer Defenses
Both hardware and software defense against rowhammer attacks have been proposed. Hardware defenses can be based on the firmware or new hardware designs. For example, computer manufacturers, such as HP [15] , Lenovo [21] and Apple [1] , propose to double the refresh rate of DRAM from 64ms to 32ms. This slightly raises the bar for the attack but has been proven to be ineffective [2] . Intel suggests to use Error Correcting Code (ECC) memory to catch and correct single-bit errors on-the-fly, thus alleviating bit flips by rowhammer attacks [17] . Typically, ECC can correct singlebit errors and detect double-bit errors (e.g., SECDED). However, ECC cannot prevent multiple bit errors and normally is only available on the server systems. Probabilistic adjacent row activation (PARA) [20] activates/refreshes adjacent rows with a high probability when the aggressor rows are hammered many times. This could be effective but needs changes of the memory controller. For future DRAM architectures, new DDR4 modules [24] and LPDDR4 specification [19] propose a targeted row refresh (TRR) capability to mitigate rowhammer attacks. Many software-based defenses have also been proposed. Some defenses aim to preventing attacks from misusing specific system features. For example, researchers and developers have worked to prevent the pagemap [31, 32] , page deduplication [25] , specific x86 CPU instructions [31] and memory/pagecache exhaustion [11, 13, 34] from being abused by unprivileged attackers. ANVIL [2] is the first system to detect rowhammer behaviors using the Intel hardware performance counters [16] . However, ANVIL incurs a high performance overhead in its worst case and has false positives [6] . Brasser et al. [5] patch an open-source bootloader and disable the vulnerable memory modules. Although this approach effectively eliminates all the rowhammer vulnerabilities for legacy systems, it is not practical when most memory is susceptible to rowhammer and this method is not compatible with Windows. Currently, CATT [6] is a practical and efficient approach to prevent rowhammer attacks by partitioning the physical memory. It introduces a low performance overhead but requires changing the kernel. Moreover, our attack demonstrates that CATT's current design is not secure against double-owned buffers but can be improved as presented in Section 5.
CONCLUSION
In this paper, we presented a novel practical exploit, which could effectively defeat CATT and gain the root and kernel privileges. Our attack does not need to exhaust the page cache or the system memory. In addition, it does not rely on the virtual-to-physical address mapping information. To achieve these unique features, we proposed the memory ambush technique, which leverages the inherent Linux memory management features, to make our attack stealthy. We improved the single-sided hammering by utilizing the timing channel caused by the row buffer. We have implemented a proof-of-concept attack on the Linux platform. The experiment results show that our attack can complete in roughly 1 minute and require memory as low as 88MB.
