14 research outputs found
NVB-tree: Failure-Atomic B+-tree for Persistent Memory
Department of Computer EngineeringEmerging non-volatile memory has opened new opportunities to re-design the entire system software stack and it is expected to break the boundaries between memory and storage devices to enable storage-less systems. Traditionally, B-tree has been used to organize data blocks in storage systems. However, B-tree is optimized for disk-based systems that read and write large blocks of data. When byte-addressable non-volatile memory replaces the block device storage systems, the byte-addressability of NVRAM makes it challenge to enforce the failure-atomicity of B-tree nodes.
In this work, we present NVB-tree that addresses this challenge, reducing cache line flush overhead and avoiding expensive logging methods. NVB-tree is a hybrid tree that combines the binary search tree and the B+-tree, i.e., keys in each NVB-tree node are stored as a binary search tree so that it can benefit from the byte-addressability of binary search trees. We also present a logging-less split/merge scheme that guarantees failure-atomicity with 8-byte memory writes. Our performance study shows that NVB-tree outperforms the state-of-the-art persistent index - wB+-tree by a large margin.ope
Design Guidelines for High-Performance SCM Hierarchies
With emerging storage-class memory (SCM) nearing commercialization, there is
evidence that it will deliver the much-anticipated high density and access
latencies within only a few factors of DRAM. Nevertheless, the
latency-sensitive nature of memory-resident services makes seamless integration
of SCM in servers questionable. In this paper, we ask the question of how best
to introduce SCM for such servers to improve overall performance/cost over
existing DRAM-only architectures. We first show that even with the most
optimistic latency projections for SCM, the higher memory access latency
results in prohibitive performance degradation. However, we find that
deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the
performance of an SCM-mostly memory system competitive. The high degree of
spatial locality that memory-resident services exhibit not only simplifies the
DRAM cache's design as page-based, but also enables the amortization of
increased SCM access latencies and the mitigation of SCM's read/write latency
disparity.
We identify the set of memory hierarchy design parameters that plays a key
role in the performance and cost of a memory system combining an SCM technology
and a 3D stacked DRAM cache. We then introduce a methodology to drive
provisioning for each of these design parameters under a target
performance/cost goal. Finally, we use our methodology to derive concrete
results for specific SCM technologies. With PCM as a case study, we show that a
two bits/cell technology hits the performance/cost sweet spot, reducing the
memory subsystem cost by 40% while keeping performance within 3% of the best
performing DRAM-only system, whereas single-level and triple-level cell
organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1
๋ฐ์ดํฐ ์ง์ฝ์ ์์ฉ์ ์ํ ํ๋ก๊ทธ๋จ ์ปจํ ์คํธ ๊ธฐ๋ฐ์ I/O ์ต์ ํ
ํ์๋
ผ๋ฌธ(์์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ,2019. 8. ๊น์งํ.์ค๋๋ ์๋ ๋ค์ํ ํํ์ ๋ฐ์ดํฐ ์ง์ฝ์ ์ธ ์์ฉ์ด ํ์ฉ๋๊ณ ์๋ค. ์ด๋ฌํ ์์ฉ๋ค์ ๋์ฉ๋์ ๋ฐ์ดํฐ๋ฅผ ๋ถ์ํ๊ฑฐ๋, ๋ฐ์ดํฐ๋ฅผ ๊ตฌ์กฐํํ์ฌ ์คํ ๋ฆฌ์ง์ ์ ์ฅํ๋ ๋ฑ ๋ง์ I/O๋ฅผ ๋ฐ์์์ผ, ์์คํ
์ด I/O๋ฅผ ์ํํ๋ ์๋์ ๋ฐ๋ผ ์ฑ๋ฅ์ ํฐ ์ํฅ์ ๋ฐ๊ฒ ๋๋ค.
์ด์์ฒด์ ๋ ๋ฉ์ธ ๋ฉ๋ชจ๋ฆฌ๋ณด๋ค ์ฑ๋ฅ์ด ํฌ๊ฒ ๋จ์ด์ง๋ ์ ์ฅ ์ฅ์น๋ก์ ์ ๊ทผ์ ์ต์ํํ์ฌ ํ์ผ I/O์ ์ฑ๋ฅ์ ๊ทน๋ํํ๊ณ ์ ๋ฉ์ธ ๋ฉ๋ชจ๋ฆฌ์ ์ผ๋ถ๋ฅผ ํ์ด์ง ์บ์๋ก ํ ๋นํ๋ค. ํ์ง๋ง ๋ฉ๋ชจ๋ฆฌ์ ํฌ๊ธฐ๋ ์ ์ฅ ์ฅ์น์ ๋นํด ํฌ๊ฒ ์ ํ๋์ด ์์ด, ํ์ผ I/O์ ์ฑ๋ฅ์ ๋์ด๊ธฐ ์ํด์๋ ์์ผ๋ก ์ฐธ์กฐ๋๋ ๋ฐ์ดํฐ๋ฅผ ์ ๋ณด๊ดํ๊ณ ์ฐธ์กฐ๋์ง ์์ ๋ฐ์ดํฐ๋ฅผ ์บ์๋ก๋ถํฐ ๋ด๋ณด๋ด๋ฉฐ ํจ์จ์ ์ผ๋ก ๊ด๋ฆฌํ๋ ๊ฒ์ด ๋งค์ฐ ์ค์ํ๋ค. ํ์ง๋ง ์ด๋ค ๋ฐ์ดํฐ๊ฐ ์์ผ๋ก ์ฐธ์กฐ๋ ์ง, ๊ทธ๋ฆฌ๊ณ ์ด๋ค ๋ฐ์ดํฐ๊ฐ ์ฐธ์กฐ๋์ง ์์์ง์ ๋ํด์ ์์คํ
์ด ์์ฒด์ ์ผ๋ก ์๋ฒฝํ๊ฒ ์์ธกํ๋ ๊ฒ์ ๋ถ๊ฐ๋ฅํ๋ค. ๋ฐ๋ผ์, ์์คํ
๋ณด๋ค ์์ ๊ณ์ธต์์์ ์ต์ ํ๋ฅผ ์ํ ๋
ธ๋ ฅ ์์ด๋ I/O ์ต์ ํ์ ์์ด ๋ช
๋ฐฑํ ํ๊ณ๊ฐ ์กด์ฌํ๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์์ฉ์ด I/O๋ฅผ ์ํํ๋ ๋งฅ๋ฝ, ์ฆ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก I/O๊ฐ ๋ฐ์ํ๋ ์์ ๊ณผ ๊ทธ ํจํด์ ์๋์ผ๋ก ํ์
ํ์ฌ ๋ถ์ํ๋ ๊ธฐ๋ฒ๊ณผ, ์ด๋ฅผ ํตํด ๋ถ์ํ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํ์ฌ ๊ฐ๊ฐ์ I/O๊ฐ ๋ฐ์ํ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ์ ์ ์ฉํ ์ต์ ํ ๋ฐฉ์ ์ถ์ฒ์ ์๋ํํ๋ ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ด๋ฅผ ํตํด ์์คํ
์์ ์์ฒด์ ์ผ๋ก ํ์
ํ ์ ์๋ ๋ค์ํ ํํธ๋ฅผ ์ฌ์ ์ ์ ๊ณตํ๊ณ , ์ด ์ ๋ณด๋ฅผ ์์คํ
์ด ์ ๊ทน์ ์ผ๋ก ํ์ฉํ์ฌ ์ด์ ๋ณด๋ค ํจ์จ์ ์ธ I/O๋ฅผ ์ํํ ์ ์๋๋ก ํ๋ค.Many kinds of data intensive applications are broadly utilized nowadays. These applications generate a lot of I/O such as analyzing a large amount of data, structuring the data and storing it in the storage, and the performance is greatly influenced by the speed of the I/O the system performs.
The operating system allocates a portion of main memory to the page cache to maximize the performance of file I/O by minimizing access to the storage device which is much lower in performance than main memory. However, since the size of memory is limited compared to the size of the storage device, it is very important to keep the data to be referenced to in future and to export the data not to be referenced from the cache and to manage efficiently to improve the performance of the file I/O. However, it is impossible for the system to predict perfectly about which data will be referenced in the future and which data will not be. Thus, without I/O optimization at the application level, there is a clear limit to performance improvement.
In this thesis, we propose a method to automatically detect and analyze I/O characteristics based on I/O program contexts of which an application executes I/O. We propose a technique to automate the optimization recommendation to be applied to the program context in which I/O occurs. Through this, the application can provide various hints to the system that can not be grasped by the system itself, and the system actively reflects this information so that I/O can be performed faster and resources can be used more efficiently than before.์ 1 ์ฅ ์ ๋ก 1
์ 1 ์ ์ฐ๊ตฌ์ ๋ฐฐ๊ฒฝ 1
์ 2 ์ ์ฐ๊ตฌ์ ๋ชฉ์ ๋ฐ ๊ธฐ์ฌ 4
์ 3 ์ ๋
ผ๋ฌธ ๊ตฌ์ฑ 8
์ 2 ์ฅ ๊ด๋ จ ์ฐ๊ตฌ 9
์ 1 ์ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ๋ฅผ ํ์ฉํ ๋ฒํผ ์บ์ฑ 9
์ 2 ์ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ ๊ธฐ๋ฐ์ ๋ฐ์ดํฐ ๋ถ๋ฆฌ ๊ธฐ๋ฒ 13
์ 3 ์ฅ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ์ ๊ธฐ๋ฐํ ์์ฉ I/O ๋ถ์ 19
์ 1 ์ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ์ ์ ์์ ์ถ์ถ ๋ฐฉ๋ฒ 19
์ 2 ์ PCStat: ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ์ ๋ฐ๋ฅธ I/O ํจํด ๋ถ์ 22
์ 3 ์ I/O ์ฐ๋ ๋ ํ๊ฒฝ์ ์ํ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ์ ์ถ์ถ ๊ธฐ๋ฒ 28
์ 4 ์ฅ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ์ ๊ธฐ๋ฐํ I/O ์ต์ ํ ์ ์ฉ 30
์ 1 ์ ํ์ด์ง ์บ์์ ์ ๊ณตํ๋ ํํธ 30
์ 2 ์ fadvise ์ ์ฉ์ ํตํ ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ ๊ธฐ๋ฐ์ I/O ์ต์ ํ 32
์ 3 ์ PCAdvisor: ํ๋ก๊ทธ๋จ ์ปจํ
์คํธ ๊ธฐ๋ฐ์ I/O ์ต์ ํ ์๋ํ 35
์ 5 ์ฅ ํ๊ฐ ์คํ 38
์ 1 ์ ์คํ ํ๊ฒฝ 38
์ 2 ์ ์คํ ๊ฒฐ๊ณผ 39
์ 6 ์ฅ ๊ฒฐ ๋ก 44
์ 1 ์ ๊ฒฐ๋ก ๋ฐ ํฅํ ๊ณํ 44
์ฐธ๊ณ ๋ฌธํ 46
Abstract 49Maste
The Inherent Cost of Remembering Consistently
Non-volatile memory (NVM) promises fast, byte-addressable and durable storage, with raw access latencies in the same order of magnitude as DRAM. But in order to take advantage of the durability of NVM, programmers need to design persistent objects which maintain consistent state across system crashes and restarts. Concurrent implementations of persistent objects typically make heavy use of expensive persistent fence instructions to order NVM accesses, thus negating some of the performance benefits of NVM. This raises the question of the minimal number of persistent fence instructions required to implement a persistent object. We answer this question in the deterministic lock-free case by providing lower and upper bounds on the required number of fence instructions. We obtain our upper bound by presenting a new universal construction that implements durably any object using at most one persistent fence per update operation invoked. Our lower bound states that in the worst case, each process needs to issue at least one persistent fence per update operation invoked
Bridging the Gap between Application and Solid-State-Drives
Data storage is one of the important and often critical parts of the computing system in terms of performance, cost, reliability, and energy. Numerous new memory technologies, such as NAND flash, phase change memory (PCM), magnetic RAM (STT-RAM) and Memristor, have emerged recently. Many of them have already entered the production system. Traditional storage optimization and caching algorithms are far from optimal because storage I/Os do not show simple locality. To provide optimal storage we need accurate predictions of I/O behavior. However, the workloads are increasingly dynamic and diverse, making the long and short time I/O prediction challenge. Because of the evolution of the storage technologies and the increasing diversity of workloads, the storage software is becoming more and more complex. For example, Flash Translation Layer (FTL) is added for NAND-flash based Solid State Disks (NAND-SSDs). However, it introduces overhead such as address translation delay and garbage collection costs. There are many recent studies aim to address the overhead. Unfortunately, there is no one-size-fits-all solution due to the variety of workloads. Despite rapidly evolving in storage technologies, the increasing heterogeneity and diversity in machines and workloads coupled with the continued data explosion exacerbate the gap between computing and storage speeds. In this dissertation, we improve the data storage performance from both top-down and bottom-up approach. First, we will investigate exposing the storage level parallelism so that applications can avoid I/O contentions and workloads skew when scheduling the jobs. Second, we will study how architecture aware task scheduling can improve the performance of the application when PCM based NVRAM are equipped. Third, we will develop an I/O correlation aware flash translation layer for NAND-flash based Solid State Disks. Fourth, we will build a DRAM-based correlation aware FTL emulator and study the performance in various filesystems
Memory Subsystems for Security, Consistency, and Scalability
In response to the continuous demand for the ability to process ever larger datasets, as well as discoveries in next-generation memory technologies, researchers have been vigorously studying memory-driven computing architectures that shall allow data-intensive applications to access enormous amounts of pooled non-volatile memory. As applications continue to interact with increasing amounts of components and datasets, existing systems struggle to eรฟciently enforce the principle of least privilege for security. While non-volatile memory can retain data even after a power loss and allow for large main memory capacity, programmers have to bear the burdens of maintaining the consistency of program memory for fault tolerance as well as handling huge datasets with traditional yet expensive memory management interfaces for scalability. Todayโs computer systems have become too sophisticated for existing memory subsystems to handle many design requirements. In this dissertation, we introduce three memory subsystems to address challenges in terms of security, consistency, and scalability. Specifcally, we propose SMVs to provide threads with fne-grained control over access privileges for a partially shared address space for security, NVthreads to allow programmers to easily leverage nonvolatile memory with automatic persistence for consistency, and PetaMem to enable memory-centric applications to freely access memory beyond the traditional process boundary with support for memory isolation and crash recovery for security, consistency, and scalability