200 research outputs found

    CATTmew: Defeating Software-only Physical Kernel Isolation

    Full text link
    All the state-of-the-art rowhammer attacks can break the MMU-enforced inter-domain isolation because the physical memory owned by each domain is adjacent to each other. To mitigate these attacks, physical domain isolation, introduced by CATT, physically separates each domain by dividing the physical memory into multiple partitions and keeping each partition occupied by only one domain. CATT implemented physical kernel isolation as the first generic and practical software-only defense to protect kernel from being rowhammered as kernel is one of the most appealing targets. In this paper, we develop a novel exploit that could effectively defeat the physical kernel isolation and gain both root and kernel privileges. Our exploit can work without exhausting the page cache or the system memory, or relying on the information of the virtual-to-physical address mapping. The exploit is motivated by our key observation that the modern OSes have double-owned kernel buffers (e.g., video buffers and SCSI Generic buffers) owned concurrently by the kernel and user domains. The existence of such buffers invalidates the physical kernel isolation and makes the rowhammer-based attack possible again. Existing conspicuous rowhammer attacks achieving the root/kernel privilege escalation exhaust the page cache or even the whole system memory. Instead, we propose a new technique, named memory ambush. It is able to place the hammerable double-owned kernel buffers physically adjacent to the target objects (e.g., page tables) with only a small amount of memory. As a result, our exploit is stealthier and has fewer memory footprints. We also replace the inefficient rowhammer algorithm that blindly picks up addresses to hammer with an efficient one. Our algorithm selects suitable addresses based on an existing timing channel.Comment: Preprint of the work accepted at the IEEE Transactions on Dependable and Secure Computing 201

    Jigsaw: Scalable software-defined caches

    Get PDF
    Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency.United States. Defense Advanced Research Projects Agency (DARPA PERFECT contract HR0011-13-2-0005)Quanta Computer (Firm

    Jigsaw: Scalable Software-Defined Caches (Extended Version)

    Get PDF
    Shared last-level caches, widely used in chip-multiprocessors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency. We present Jigsaw, a technique that jointly addresses the scalability and interference problems of shared caches. Hardware lets software define shares, collections of cache bank partitions that act as virtual caches, and map data to shares. Shares give software full control over both data placement and capacity allocation. Jigsaw implements efficient hardware support for share management, monitoring, and adaptation. We propose novel resource-management algorithms and use them to develop a system-level runtime that leverages Jigsaw to both maximize cache utilization and place data close to where it is used. We evaluate Jigsaw using extensive simulations of 16- and 64-core tiled CMPs. Jigsaw improves performance by up to 2.2x (18% avg) over a conventional shared cache, and significantly outperforms state-of-the-art NUCA and partitioning techniques.This work was supported in part by DARPA PERFECT contract HR0011-13-2-0005 and Quanta Computer

    Rethinking the I/O Stack for Persistent Memory

    Get PDF
    Modern operating systems have been designed around the hypotheses that (a) memory is both byte-addressable and volatile and (b) storage is block addressable and persistent. The arrival of new Persistent Memory (PM) technologies, has made these assumptions obsolete. Despite much of the recent work in this space, the need for consistently sharing PM data across multiple applications remains an urgent, unsolved problem. Furthermore, the availability of simple yet powerful operating system support remains elusive. In this dissertation, we propose and build The Region System – a high-performance operating system stack for PM that implements usable consistency and persistence for application data. The region system provides support for consistently mapping and sharing data resident in PM across user application address spaces. The region system creates a novel IPI based PMSYNC operation, which ensures atomic persistence of mapped pages across multiple address spaces. This allows applications to consume PM using the well understood and much desired memory like model with an easy-to-use interface. Next, we propose a metadata structure without any redundant metadata to reduce CPU cache flushes. The high-performance design minimizes the expensive PM ordering and durability operations by embracing a minimalistic approach to metadata construction and management. To strengthen the case for the region system, in this dissertation, we analyze different types of applications to identify their dependence on memory mapped data usage, and propose user level libraries LIBPM-R and LIBPMEMOBJ-R to support shared persistent containers. The user level libraries along with the region system demonstrate a comprehensive end-to-end software stack for consuming the PM devices

    Understanding and Optimizing Flash-based Key-value Systems in Data Centers

    Get PDF
    Flash-based key-value systems are widely deployed in today’s data centers for providing high-speed data processing services. These systems deploy flash-friendly data structures, such as slab and Log Structured Merge(LSM) tree, on flash-based Solid State Drives(SSDs) and provide efficient solutions in caching and storage scenarios. With the rapid evolution of data centers, there appear plenty of challenges and opportunities for future optimizations. In this dissertation, we focus on understanding and optimizing flash-based key-value systems from the perspective of workloads, software, and hardware as data centers evolve. We first propose an on-line compression scheme, called SlimCache, considering the unique characteristics of key-value workloads, to virtually enlarge the cache space, increase the hit ratio, and improve the cache performance. Furthermore, to appropriately configure increasingly complex modern key-value data systems, which can have more than 50 parameters with additional hardware and system settings, we quantitatively study and compare five multi-objective optimization methods for auto-tuning the performance of an LSM-tree based key-value store in terms of throughput, the 99th percentile tail latency, convergence time, real-time system throughput, and the iteration process, etc. Last but not least, we conduct an in-depth, comprehensive measurement work on flash-optimized key-value stores with recently emerging 3D XPoint SSDs. We reveal several unexpected bottlenecks in the current key-value store design and present three exemplary case studies to showcase the efficacy of removing these bottlenecks with simple methods on 3D XPoint SSDs. Our experimental results show that our proposed solutions significantly outperform traditional methods. Our study also contributes to providing system implications for auto-tuning the key-value system on flash-based SSDs and optimizing it on revolutionary 3D XPoint based SSDs

    Large Pages May Be Harmful on NUMA Systems

    Get PDF
    Application virtual address space is divided into pages, each requiring a virtual-to-physical translation in the page table and the TLB. Large working sets, common among modern applications, necessitate a lot of translations, which increases memory consumption and leads to high TLB and page fault rates. To address this problem, recent hardware introduced support for large pages Large pages require fewer translations to cover the same address space, so the associated problems diminish. We discover, however, that on systems with non-uniform memory access times (NUMA) large pages may fail to deliver benefits or even cause performance degradation. On NUMA systems the memory is spread across several physical nodes; using large pages may contribute to the imbalance in the distribution of memory controller requests and reduced locality of accesses, both of which can drive up memory latencies. Our analysis concluded that: (a) on NUMA systems with large pages it is more crucial than ever to use memory placement algorithms that balance the load across memory controllers and maintain locality; (b) there are cases when NUMA-aware memory placement is not sufficient for optimal performance, and the only resort is to split the offending large pages. To address these challenges, we extend an existing NUMA page placement algorithm with support for large pages. We demonstrate that it recovers the performance lost due to the use of large pages and makes their benefits accessible to application

    Macro-modeling and energy efficiency studies of file management in embedded systems with flash memory

    Get PDF
    Technological advancements in computer hardware and software have made embedded systems highly affordable and widely used. Consumers have ever increasing demands for powerful embedded devices such as cell phones, PDAs and media players. Such complex and feature-rich embedded devices are strictly limited by their battery life- time. Embedded systems typically are diskless and use flash for secondary storage due to their low power, persistent storage and small form factor needs. The energy efficiency of a processor and flash in an embedded system heavily depends on the choice of file system in use. To address this problem, it is necessary to provide sys- tem developers with energy profiles of file system activities and energy efficient file systems. In the first part of the thesis, a macro-model for the CRAMFS file system is established which characterizes the processor and flash energy consumption due to file system calls. This macro-model allows a system developer to estimate the energy consumed by CRAMFS without using an actual power setup. The second part of the thesis examines the effects of using non-volatile memory as a write-behind buffer to improve the energy efficiency of JFFS2. Experimental results show that a 4KB write-behind buffer significantly reduces energy consumption by up to 2-3 times for consecutive small writes. In addition, the write-behind buffer conserves flash space since transient data may never be written to flash
    • …
    corecore