14 research outputs found

    NVB-tree: Failure-Atomic B+-tree for Persistent Memory

    Get PDF
    Department of Computer EngineeringEmerging non-volatile memory has opened new opportunities to re-design the entire system software stack and it is expected to break the boundaries between memory and storage devices to enable storage-less systems. Traditionally, B-tree has been used to organize data blocks in storage systems. However, B-tree is optimized for disk-based systems that read and write large blocks of data. When byte-addressable non-volatile memory replaces the block device storage systems, the byte-addressability of NVRAM makes it challenge to enforce the failure-atomicity of B-tree nodes. In this work, we present NVB-tree that addresses this challenge, reducing cache line flush overhead and avoiding expensive logging methods. NVB-tree is a hybrid tree that combines the binary search tree and the B+-tree, i.e., keys in each NVB-tree node are stored as a binary search tree so that it can benefit from the byte-addressability of binary search trees. We also present a logging-less split/merge scheme that guarantees failure-atomicity with 8-byte memory writes. Our performance study shows that NVB-tree outperforms the state-of-the-art persistent index - wB+-tree by a large margin.ope

    Design Guidelines for High-Performance SCM Hierarchies

    Full text link
    With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM's read/write latency disparity. We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1

    ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์  ์‘์šฉ์„ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์˜ I/O ์ตœ์ ํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2019. 8. ๊น€์ง€ํ™.์˜ค๋Š˜๋‚ ์—๋Š” ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์ ์ธ ์‘์šฉ์ด ํ™œ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‘์šฉ๋“ค์€ ๋Œ€์šฉ๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ฑฐ๋‚˜, ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜์—ฌ ์Šคํ† ๋ฆฌ์ง€์— ์ €์žฅํ•˜๋Š” ๋“ฑ ๋งŽ์€ I/O๋ฅผ ๋ฐœ์ƒ์‹œ์ผœ, ์‹œ์Šคํ…œ์ด I/O๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ์†๋„์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฐ›๊ฒŒ ๋œ๋‹ค. ์šด์˜์ฒด์ œ๋Š” ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๋–จ์–ด์ง€๋Š” ์ €์žฅ ์žฅ์น˜๋กœ์˜ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜์—ฌ ํŒŒ์ผ I/O์˜ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ณ ์ž ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ์˜ ์ผ๋ถ€๋ฅผ ํŽ˜์ด์ง€ ์บ์‹œ๋กœ ํ• ๋‹นํ•œ๋‹ค. ํ•˜์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ์˜ ํฌ๊ธฐ๋Š” ์ €์žฅ ์žฅ์น˜์— ๋น„ํ•ด ํฌ๊ฒŒ ์ œํ•œ๋˜์–ด ์žˆ์–ด, ํŒŒ์ผ I/O์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•ž์œผ๋กœ ์ฐธ์กฐ๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๋ณด๊ด€ํ•˜๊ณ  ์ฐธ์กฐ๋˜์ง€ ์•Š์„ ๋ฐ์ดํ„ฐ๋ฅผ ์บ์‹œ๋กœ๋ถ€ํ„ฐ ๋‚ด๋ณด๋‚ด๋ฉฐ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ํ•˜์ง€๋งŒ ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ์•ž์œผ๋กœ ์ฐธ์กฐ๋ ์ง€, ๊ทธ๋ฆฌ๊ณ  ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ์ฐธ์กฐ๋˜์ง€ ์•Š์„์ง€์— ๋Œ€ํ•ด์„œ ์‹œ์Šคํ…œ์ด ์ž์ฒด์ ์œผ๋กœ ์™„๋ฒฝํ•˜๊ฒŒ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ๋”ฐ๋ผ์„œ, ์‹œ์Šคํ…œ๋ณด๋‹ค ์ƒ์œ„ ๊ณ„์ธต์—์„œ์˜ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ๋…ธ๋ ฅ ์—†์ด๋Š” I/O ์ตœ์ ํ™”์— ์žˆ์–ด ๋ช…๋ฐฑํ•œ ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‘์šฉ์ด I/O๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋งฅ๋ฝ, ์ฆ‰ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ I/O๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์‹œ์ ๊ณผ ๊ทธ ํŒจํ„ด์„ ์ž๋™์œผ๋กœ ํŒŒ์•…ํ•˜์—ฌ ๋ถ„์„ํ•˜๋Š” ๊ธฐ๋ฒ•๊ณผ, ์ด๋ฅผ ํ†ตํ•ด ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ ๊ฐ๊ฐ์˜ I/O๊ฐ€ ๋ฐœ์ƒํ•œ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ์— ์ ์šฉํ•  ์ตœ์ ํ™” ๋ฐฉ์•ˆ ์ถ”์ฒœ์„ ์ž๋™ํ™”ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์‹œ์Šคํ…œ์—์„œ ์ž์ฒด์ ์œผ๋กœ ํŒŒ์•…ํ•  ์ˆ˜ ์—†๋Š” ๋‹ค์–‘ํ•œ ํžŒํŠธ๋ฅผ ์‚ฌ์ „์— ์ œ๊ณตํ•˜๊ณ , ์ด ์ •๋ณด๋ฅผ ์‹œ์Šคํ…œ์ด ์ ๊ทน์ ์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ์ด์ „๋ณด๋‹ค ํšจ์œจ์ ์ธ I/O๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.Many kinds of data intensive applications are broadly utilized nowadays. These applications generate a lot of I/O such as analyzing a large amount of data, structuring the data and storing it in the storage, and the performance is greatly influenced by the speed of the I/O the system performs. The operating system allocates a portion of main memory to the page cache to maximize the performance of file I/O by minimizing access to the storage device which is much lower in performance than main memory. However, since the size of memory is limited compared to the size of the storage device, it is very important to keep the data to be referenced to in future and to export the data not to be referenced from the cache and to manage efficiently to improve the performance of the file I/O. However, it is impossible for the system to predict perfectly about which data will be referenced in the future and which data will not be. Thus, without I/O optimization at the application level, there is a clear limit to performance improvement. In this thesis, we propose a method to automatically detect and analyze I/O characteristics based on I/O program contexts of which an application executes I/O. We propose a technique to automate the optimization recommendation to be applied to the program context in which I/O occurs. Through this, the application can provide various hints to the system that can not be grasped by the system itself, and the system actively reflects this information so that I/O can be performed faster and resources can be used more efficiently than before.์ œ 1 ์žฅ ์„œ ๋ก  1 ์ œ 1 ์ ˆ ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ 1 ์ œ 2 ์ ˆ ์—ฐ๊ตฌ์˜ ๋ชฉ์  ๋ฐ ๊ธฐ์—ฌ 4 ์ œ 3 ์ ˆ ๋…ผ๋ฌธ ๊ตฌ์„ฑ 8 ์ œ 2 ์žฅ ๊ด€๋ จ ์—ฐ๊ตฌ 9 ์ œ 1 ์ ˆ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ๋ฅผ ํ™œ์šฉํ•œ ๋ฒ„ํผ ์บ์‹ฑ 9 ์ œ 2 ์ ˆ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์˜ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ ๊ธฐ๋ฒ• 13 ์ œ 3 ์žฅ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ์— ๊ธฐ๋ฐ˜ํ•œ ์‘์šฉ I/O ๋ถ„์„ 19 ์ œ 1 ์ ˆ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ์˜ ์ •์˜์™€ ์ถ”์ถœ ๋ฐฉ๋ฒ• 19 ์ œ 2 ์ ˆ PCStat: ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ์— ๋”ฐ๋ฅธ I/O ํŒจํ„ด ๋ถ„์„ 22 ์ œ 3 ์ ˆ I/O ์“ฐ๋ ˆ๋“œ ํ™˜๊ฒฝ์„ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ์˜ ์ถ”์ถœ ๊ธฐ๋ฒ• 28 ์ œ 4 ์žฅ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ์— ๊ธฐ๋ฐ˜ํ•œ I/O ์ตœ์ ํ™” ์ ์šฉ 30 ์ œ 1 ์ ˆ ํŽ˜์ด์ง€ ์บ์‹œ์— ์ œ๊ณตํ•˜๋Š” ํžŒํŠธ 30 ์ œ 2 ์ ˆ fadvise ์ ์šฉ์„ ํ†ตํ•œ ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์˜ I/O ์ตœ์ ํ™” 32 ์ œ 3 ์ ˆ PCAdvisor: ํ”„๋กœ๊ทธ๋žจ ์ปจํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์˜ I/O ์ตœ์ ํ™” ์ž๋™ํ™” 35 ์ œ 5 ์žฅ ํ‰๊ฐ€ ์‹คํ—˜ 38 ์ œ 1 ์ ˆ ์‹คํ—˜ ํ™˜๊ฒฝ 38 ์ œ 2 ์ ˆ ์‹คํ—˜ ๊ฒฐ๊ณผ 39 ์ œ 6 ์žฅ ๊ฒฐ ๋ก  44 ์ œ 1 ์ ˆ ๊ฒฐ๋ก  ๋ฐ ํ–ฅํ›„ ๊ณ„ํš 44 ์ฐธ๊ณ ๋ฌธํ—Œ 46 Abstract 49Maste

    The Inherent Cost of Remembering Consistently

    Get PDF
    Non-volatile memory (NVM) promises fast, byte-addressable and durable storage, with raw access latencies in the same order of magnitude as DRAM. But in order to take advantage of the durability of NVM, programmers need to design persistent objects which maintain consistent state across system crashes and restarts. Concurrent implementations of persistent objects typically make heavy use of expensive persistent fence instructions to order NVM accesses, thus negating some of the performance benefits of NVM. This raises the question of the minimal number of persistent fence instructions required to implement a persistent object. We answer this question in the deterministic lock-free case by providing lower and upper bounds on the required number of fence instructions. We obtain our upper bound by presenting a new universal construction that implements durably any object using at most one persistent fence per update operation invoked. Our lower bound states that in the worst case, each process needs to issue at least one persistent fence per update operation invoked

    Bridging the Gap between Application and Solid-State-Drives

    Get PDF
    Data storage is one of the important and often critical parts of the computing system in terms of performance, cost, reliability, and energy. Numerous new memory technologies, such as NAND flash, phase change memory (PCM), magnetic RAM (STT-RAM) and Memristor, have emerged recently. Many of them have already entered the production system. Traditional storage optimization and caching algorithms are far from optimal because storage I/Os do not show simple locality. To provide optimal storage we need accurate predictions of I/O behavior. However, the workloads are increasingly dynamic and diverse, making the long and short time I/O prediction challenge. Because of the evolution of the storage technologies and the increasing diversity of workloads, the storage software is becoming more and more complex. For example, Flash Translation Layer (FTL) is added for NAND-flash based Solid State Disks (NAND-SSDs). However, it introduces overhead such as address translation delay and garbage collection costs. There are many recent studies aim to address the overhead. Unfortunately, there is no one-size-fits-all solution due to the variety of workloads. Despite rapidly evolving in storage technologies, the increasing heterogeneity and diversity in machines and workloads coupled with the continued data explosion exacerbate the gap between computing and storage speeds. In this dissertation, we improve the data storage performance from both top-down and bottom-up approach. First, we will investigate exposing the storage level parallelism so that applications can avoid I/O contentions and workloads skew when scheduling the jobs. Second, we will study how architecture aware task scheduling can improve the performance of the application when PCM based NVRAM are equipped. Third, we will develop an I/O correlation aware flash translation layer for NAND-flash based Solid State Disks. Fourth, we will build a DRAM-based correlation aware FTL emulator and study the performance in various filesystems

    Memory Subsystems for Security, Consistency, and Scalability

    Get PDF
    In response to the continuous demand for the ability to process ever larger datasets, as well as discoveries in next-generation memory technologies, researchers have been vigorously studying memory-driven computing architectures that shall allow data-intensive applications to access enormous amounts of pooled non-volatile memory. As applications continue to interact with increasing amounts of components and datasets, existing systems struggle to eรฟciently enforce the principle of least privilege for security. While non-volatile memory can retain data even after a power loss and allow for large main memory capacity, programmers have to bear the burdens of maintaining the consistency of program memory for fault tolerance as well as handling huge datasets with traditional yet expensive memory management interfaces for scalability. Todayโ€™s computer systems have become too sophisticated for existing memory subsystems to handle many design requirements. In this dissertation, we introduce three memory subsystems to address challenges in terms of security, consistency, and scalability. Specifcally, we propose SMVs to provide threads with fne-grained control over access privileges for a partially shared address space for security, NVthreads to allow programmers to easily leverage nonvolatile memory with automatic persistence for consistency, and PetaMem to enable memory-centric applications to freely access memory beyond the traditional process boundary with support for memory isolation and crash recovery for security, consistency, and scalability
    corecore