15 research outputs found

    Redesigning Transaction Processing Systems for Non-Volatile Memory

    Get PDF
    Department of Computer Science and EngineeringTransaction Processing Systems are widely used because they make the user be able to manage their data more efficiently. However, they suffer performance bottleneck due to the redundant I/O for guaranteeing data consistency. In addition to the redundant I/O, slow storage device makes the performance more degraded. Leveraging non-volatile memory is one of the promising solutions the performance bottleneck in Transaction Processing Systems. However, since the I/O granularity of legacy storage devices and non-volatile memory is not equal, traditional Transaction Processing System cannot fully exploit the performance of persistent memory. The goal of this dissertation is to fully exploit non-volatile memory for improving the performance of Transaction Processing Systems. Write amplification between Transaction Processing System is pointed out as a performance bottleneck. As first approach, we redesigned Transaction Processing Systems to minimize the redundant I/O between the Transaction Processing Systems. We present LS-MVBT that integrates recovery information into the main database file to remove temporary files for recovery. The LS-MVBT also employs five optimizations to reduce the write traffics in single fsync() calls. We also exploit the persistent memory to reduce the performance bottleneck from slow storage devices. However, since the traditional recovery method is for slow storage devices, we develop byte-addressable differential logging, user-level heap manager, and transaction-aware persistence to fully exploit the persistent memory. To minimize the redundant I/O for guarantee data consistency, we present the failure-atomic slotted paging with persistent buffer cache. Redesigning indexing structure is the second approach to exploit the non-volatile memory fully. Since the B+-tree is originally designed for block granularity, It generates excessive I/O traffics in persistent memory. To mitigate this traffic, we develop cache line friendly B+-tree which aligns its node size to cache line size. It can minimize the write traffic. Moreover, with hardware transactional memory, it can update its single node atomically without any additional redundant I/O for guaranteeing data consistency. It can also adapt Failure-Atomic Shift and Failure-Atomic In-place Rebalancing to eliminate unnecessary I/O. Furthermore, We improved the persistent memory manager that exploit traditional memory heap structure with free-list instead of segregated lists for small memory allocations to minimize the memory allocation overhead. Our performance evaluation shows that our improved version that consider I/O granularity of non-volatile memory can efficiently reduce the redundant I/O traffic and improve the performance by large of a margin.ope

    Memory Management for Emerging Memory Technologies

    Get PDF
    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

    Architectural support for persistent memory systems

    Get PDF
    The long stated vision of persistent memory is set to be realized with the release of 3D XPoint memory by Intel and Micron. Persistent memory, as the name suggests, amalgamates the persistence (non-volatility) property of storage devices (like disks) with byte-addressability and low latency of memory. These properties of persistent memory coupled with its accessibility through the processor load/store interface enable programmers to design in-memory persistent data structures. An important challenge in designing persistent memory systems is to provide support for maintaining crash consistency of these in-memory data structures. Crash consistency is necessary to ensure the correct recovery of program state after a crash. Ordering is a primitive that can be used to design crash consistent programs. It provides guarantees on the order of updates to persistent memory. Atomicity can also be used to design crash consistent programs via two primitives. First, as an atomic durability primitive which guarantees that in the presence of system crashes updates are made durable atomically, which means either all or none of the updates are made durable. Second, in the form of ACID transactions that guarantee atomic visibility and atomic durability. Existing systems do not support ordering, let alone atomic durability or ACID. In fact, these systems implement various performance enhancing optimizations that deliberately reorder updates to memory. Moreover, software in these systems cannot explicitly control the movement of data from volatile cache to persistent memory. Therefore, any ordering requirement has to be enforced synchronously which degrades performance because program execution is stalled waiting for updates to reach persistent memory. This thesis aims to provide the design principles and efficient implementations for three crash consistency primitives: ordering, atomic durability and ACID transactions. A set of persistency models have been proposed recently which provide support for the ordering primitive. This thesis extends the taxonomy of these models by adding buffering, which allows the hardware to enforce ordering in the background, as a new layer of classification. It then goes on show how the existing implementation of a buffered model degenerates to a performance inefficient non-buffered model because of the presence of conflicts and proposes efficient solutions to eliminate or limit the impact of these conflicts with minimal hardware modifications. This thesis also proposes the first implementation of a buffered model for a server class processor with multi-banked caches and multiple memory controllers. Write ahead logging (WAL) is a commonly used approach to provide atomic durability. This thesis argues that existing implementations ofWAL in software are not only inefficient, because of the fine grained ordering dependencies, but also waste precious execution cycles to implement a fundamentally data movement task. It then proposes ATOM, a hardware log manager based on undo logging that performs the logging operation out of the critical path. This thesis presents the design principles behind ATOM and two techniques that optimize its performance. These techniques enable the memory controller to enforce fine grained ordering required for logging and to even perform logging in some cases. In doing so, ATOM significantly reduces processor stall cycles and improves performance. The most commonly used abstraction employed to atomically update persistent data is that of durable transactions with ACID (Atomicity, Consistency, Isolation and Durability) semantics that make updates within a transaction both visible and durable atomically. As a final contribution, this thesis tackles the problem of providing efficient support for durable transactions in hardware by integrating hardware support for atomic durability with hardware transactional memory (HTM). It proposes DHTM (durable hardware transactional memory) in which durability is considered as a first class design constraint. DHTM guarantees atomic durability via hardware redo-logging, and integrates this logging support with a commercial HTM to provide atomic visibility. Furthermore, DHTM leverages the same logging infrastructure to extend the supported transaction size, from being L1-limited to the LLC, with minor changes to the coherence protocol

    Virtualization of Non-Volatile Ram

    Get PDF
    Virtualization technology is powering today's cloud industry. Virtualization inserts a software layer, the hypervisor, below the Operating System, to manage multiple OS environments simultaneously. Offering numerous benefits such as fault isolation, load balancing, faster server provisioning, etc., virtualization occupies a dominant position, especially in IT infrastructure in data centers. Memory management is one of the core components of a hypervisor. Current implementations assume the underlying memory to be homogenous and volatile. However, with the emergence of NVRAM in the form of Storage Class Memory (SCM), this assumption remains no longer valid. New motherboard architectures will support several different memory classes each with distinct properties and characteristics. The hypervisor has to recognize, manage, and expose them separately to the different virtual machines. This study focuses on building a separate memory management module for Non-Volatile RAM in Xen hypervisor. We show that it can be efficiently implemented with a few code changes and minimal runtime performance overhead
    corecore