24 research outputs found
Evaluating TLB (Translation Lookaside Buffer) Performance Overhead for NVM (non-volatile Memory) Hybrid System
As the non-volatile memory (NVM) technology offers near-DRAM performance and near-disk capacity, NVM has emerged as a new storage class. Conventional file systems, designed for hard disk drives or solid-state drives, need to be re-examined or even re-designed for NVM storage. For example, new file systems such as NOVA, HMFS, HMVFS and Ext4-DAX, have been developed and implemented to fully leverage NVMâs characteristics, such as fast fine-grained access. This thesis research uses a variety of I/O workloads to evaluate the performance overhead of the TLB (translation lookaside buffer) in various file systems on emulated NVM storage systems, in which NVM resides on the memory bus. As NVMâs capacity becomes much greater than DRAM and applicationsâ footprints continue to increase rapidly, the number of TLB entries scales up with the same pace, leading to a significant amount of TLB misses. The goal of this research is to gain insights into file system optimizations on storage-class memory. Experimental results show that NVM based file systems can have 50% more TLB overhead compare to with conventional file systems, under the same file operations. Profiling based on performance counters show that TLB-friendly journaling/logging should be taken into consideration into future file system design
ECHOFS: a scheduler-guided temporary filesystem to leverage node-local NVMS
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The growth in data-intensive scientific applications poses strong demands on the HPC storage subsystem, as data needs to be copied from compute nodes to I/O nodes and vice versa for jobs to run. The emerging trend of adding denser, NVM-based burst buffers to compute nodes, however, offers the possibility of using these resources to build temporary file systems with specific I/O optimizations for a batch job. In this work, we present echofs, a temporary filesystem that coordinates with the job scheduler to preload a job's input files into node-local burst buffers. We present the results measured with NVM emulation, and different FS backends with DAX/FUSE on a local node, to show the benefits of our proposal and such coordination.This work was partially supported by the Spanish Ministry of Science and Innovation
under the TIN2015â65316 grant, the Generalitat de Catalunya under contract 2014â
SGRâ1051, as well as the European Unionâs Horizon 2020 Research and Innovation
Programme, under Grant Agreement no. 671951 (NEXTGenIO). Source code available
at https://github.com/bsc-ssrg/echofs.Peer ReviewedPostprint (author's final draft
Redesigning Transaction Processing Systems for Non-Volatile Memory
Department of Computer Science and EngineeringTransaction Processing Systems are widely used because they make the user be able to manage
their data more efficiently. However, they suffer performance bottleneck due to the redundant
I/O for guaranteeing data consistency. In addition to the redundant I/O, slow storage device
makes the performance more degraded. Leveraging non-volatile memory is one of the promising
solutions the performance bottleneck in Transaction Processing Systems. However, since the
I/O granularity of legacy storage devices and non-volatile memory is not equal, traditional
Transaction Processing System cannot fully exploit the performance of persistent memory.
The goal of this dissertation is to fully exploit non-volatile memory for improving the performance
of Transaction Processing Systems.
Write amplification between Transaction Processing System is pointed out as a performance
bottleneck. As first approach, we redesigned Transaction Processing Systems to minimize the
redundant I/O between the Transaction Processing Systems. We present LS-MVBT that integrates
recovery information into the main database file to remove temporary files for recovery.
The LS-MVBT also employs five optimizations to reduce the write traffics in single fsync() calls.
We also exploit the persistent memory to reduce the performance bottleneck from slow storage
devices. However, since the traditional recovery method is for slow storage devices, we develop
byte-addressable differential logging, user-level heap manager, and transaction-aware persistence
to fully exploit the persistent memory. To minimize the redundant I/O for guarantee data consistency,
we present the failure-atomic slotted paging with persistent buffer cache.
Redesigning indexing structure is the second approach to exploit the non-volatile memory
fully. Since the B+-tree is originally designed for block granularity, It generates excessive I/O
traffics in persistent memory. To mitigate this traffic, we develop cache line friendly B+-tree
which aligns its node size to cache line size. It can minimize the write traffic. Moreover, with
hardware transactional memory, it can update its single node atomically without any additional
redundant I/O for guaranteeing data consistency. It can also adapt Failure-Atomic Shift and
Failure-Atomic In-place Rebalancing to eliminate unnecessary I/O.
Furthermore, We improved the persistent memory manager that exploit traditional memory
heap structure with free-list instead of segregated lists for small memory allocations to minimize
the memory allocation overhead.
Our performance evaluation shows that our improved version that consider I/O granularity
of non-volatile memory can efficiently reduce the redundant I/O traffic and improve the
performance by large of a margin.ope
Recommended from our members
Building Distributed Systems with Non-Volatile Main Memories and RDMA Networks
High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers to combine storage and memory into a single layer. These high-performance storage systems would be especially useful in large-scale data center environments where data is distributed and replicated across multiple servers.Unfortunately, existing approaches of providing remote storage access rest on the assumption that storage is slow, so the cost of the software and protocols is acceptable. Such assumption no longer holds for the fast NVMM. As a result, taking full advantage of NVMMsâ potential will require changes in system software and networking protocol. This thesis focuses on accessing remote NVMM efficiently using remote direct memory access (RDMA) network. RDMA enables a client to directly access memory on a remote machine without involving its local CPU.This thesis first presents Mojim, a system that provides replicated, reliable, and highly-available NVMM as an operating system service. Applications can access data in Mojim using normal load and store instructions while controlling when and how updates propagate to replicas using system calls. Our evaluation shows Mojim adds little overhead to the un-replicated system and provides 0.4x to 2.7x the throughput of the un-replicated system.This thesis then presents Orion, a distributed file system designed from for NVMM and RDMA networks. Traditional distributed file systems are designed for slower hard drives. These slower media incentivizes complex optimizations (e.g., queuing, striping, and batching) around disk accesses. Orion combines file system functions and network operations into a single layer. It provides low latency metadata accesses and outperforms existing distributed file systems by a large margin.Finally, an NVMM application can map files backed by an NVMM file system into its address space, and accesses them using CPU instructions. In this case, RDMA and NVMM file systems introduce duplication of effort on permissions, naming, and address translation. We introduce two changes to the existing RDMA protocol: the file memory region (FileMR) and range based address translation. By eliminating redundant translations, FileMR minimizes the number of translations done at the NIC, reducing the load on the NICâs translation cache and resulting in application performance improvement by 1.8x - 2.0x
Rethinking the I/O Stack for Persistent Memory
Modern operating systems have been designed around the hypotheses that (a) memory is both byte-addressable and volatile and (b) storage is block addressable and persistent. The arrival of new Persistent Memory (PM) technologies, has made these assumptions obsolete. Despite much of the recent work in this space, the need for consistently sharing PM data across multiple applications remains an urgent, unsolved problem. Furthermore, the availability of simple yet powerful operating system support remains elusive.
In this dissertation, we propose and build The Region System â a high-performance operating system stack for PM that implements usable consistency and persistence for application data. The region system provides support for consistently mapping and sharing data resident in PM across user application address spaces. The region system creates a novel IPI based PMSYNC operation, which ensures atomic persistence of mapped pages across multiple address spaces. This allows applications to consume PM using the well understood and much desired memory like model with an easy-to-use interface. Next, we propose a metadata structure without any redundant metadata to reduce CPU cache flushes. The high-performance design minimizes the expensive PM ordering and durability operations by embracing a minimalistic approach to metadata construction and management.
To strengthen the case for the region system, in this dissertation, we analyze different types of applications to identify their dependence on memory mapped data usage, and propose user level libraries LIBPM-R and LIBPMEMOBJ-R to support shared persistent containers. The user level libraries along with the region system demonstrate a comprehensive end-to-end software stack for consuming the PM devices