162 research outputs found

    Distributed Multi-writer Multi-reader Atomic Register with Optimistically Fast Read and Write

    Full text link
    A distributed multi-writer multi-reader (MWMR) atomic register is an important primitive that enables a wide range of distributed algorithms. Hence, improving its performance can have large-scale consequences. Since the seminal work of ABD emulation in the message-passing networks [JACM '95], many researchers study fast implementations of atomic registers under various conditions. "Fast" means that a read or a write can be completed with 1 round-trip time (RTT), by contacting a simple majority. In this work, we explore an atomic register with optimal resilience and "optimistically fast" read and write operations. That is, both operations can be fast if there is no concurrent write. This paper has three contributions: (i) We present Gus, the emulation of an MWMR atomic register with optimal resilience and optimistically fast reads and writes when there are up to 5 nodes; (ii) We show that when there are > 5 nodes, it is impossible to emulate an MWMR atomic register with both properties; and (iii) We implement Gus in the framework of EPaxos and Gryff, and show that Gus provides lower tail latency than state-of-the-art systems such as EPaxos, Gryff, Giza, and Tempo under various workloads in the context of geo-replicated object storage systems

    PrismDB: Read-aware Log-structured Merge Trees for Heterogeneous Storage

    Full text link
    In recent years, emerging hardware storage technologies have focused on divergent goals: better performance or lower cost-per-bit of storage. Correspondingly, data systems that employ these new technologies are optimized either to be fast (but expensive) or cheap (but slow). We take a different approach: by combining multiple tiers of fast and low-cost storage technologies within the same system, we can achieve a Pareto-efficient balance between performance and cost-per-bit. This paper presents the design and implementation of PrismDB, a novel log-structured merge tree based key-value store that exploits a full spectrum of heterogeneous storage technologies (from 3D XPoint to QLC NAND). We introduce the notion of "read-awareness" to log-structured merge trees, which allows hot objects to be pinned to faster storage, achieving better tiering and hot-cold separation of objects. Compared to the standard use of RocksDB on flash in datacenters today, PrismDB's average throughput on heterogeneous storage is 2.3×\times faster and its tail latency is more than an order of magnitude better, using hardware than is half the cost

    RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design

    Full text link
    Software-defined networking (SDN) and software-defined flash (SDF) have been serving as the backbone of modern data centers. They are managed separately to handle I/O requests. At first glance, this is a reasonable design by following the rack-scale hierarchical design principles. However, it suffers from suboptimal end-to-end performance, due to the lack of coordination between SDN and SDF. In this paper, we co-design the SDN and SDF stack by redefining the functions of their control plane and data plane, and splitting up them within a new architecture named RackBlox. RackBlox decouples the storage management functions of flash-based solid-state drives (SSDs), and allow the SDN to track and manage the states of SSDs in a rack. Therefore, we can enable the state sharing between SDN and SDF, and facilitate global storage resource management. RackBlox has three major components: (1) coordinated I/O scheduling, in which it dynamically adjusts the I/O scheduling in the storage stack with the measured and predicted network latency, such that it can coordinate the effort of I/O scheduling across the network and storage stack for achieving predictable end-to-end performance; (2) coordinated garbage collection (GC), in which it will coordinate the GC activities across the SSDs in a rack to minimize their impact on incoming I/O requests; (3) rack-scale wear leveling, in which it enables global wear leveling among SSDs in a rack by periodically swapping data, for achieving improved device lifetime for the entire rack. We implement RackBlox using programmable SSDs and switch. Our experiments demonstrate that RackBlox can reduce the tail latency of I/O requests by up to 5.8x over state-of-the-art rack-scale storage systems.Comment: 14 pages. Published in published in ACM SIGOPS 29th Symposium on Operating Systems Principles (SOSP'23

    WLFC: Write Less in Flash-based Cache

    Full text link
    Flash-based disk caches, for example Bcache and Flashcache, has gained tremendous popularity in industry in the last decade because of its low energy consumption, non-volatile nature and high I/O speed. But these cache systems have a worse write performance than the read performance because of the asymmetric I/O costs and the the internal GC mechanism. In addition to the performance issues, since the NAND flash is a type of EEPROM device, the lifespan is also limited by the Program/Erase (P/E) cycles. So how to improve the performance and the lifespan of flash-based caches in write-intensive scenarios has always been a hot issue. Benefiting from Open-Channel SSDs (OCSSDs), we propose a write-friendly flash-based disk cache system, which is called WLFC (Write Less in the Flash-based Cache). In WLFC, a strictly sequential writing method is used to minimize the write amplification. A new replacement algorithm for the write buffer is designed to minimize the erase count caused by the evicting. And a new data layout strategy is designed to minimize the metadata size persisted in SSDs. As a result, the Over-Provisioned (OP) space is completely removed, the erase count of the flash is greatly reduced, and the metadata size is 1/10 or less than that in BCache. Even with a small amount of metadata, the data consistency after the crash is still guaranteed. Compared with the existing mechanism, WLFC brings a 7%-80% reduction in write latency, a 1.07*-4.5* increment in write throughput, and a 50%-88.9% reduction in erase count, with a moderate overhead in read performance

    Sync+Sync: A Covert Channel Built on fsync with Storage

    Full text link
    Scientists have built a variety of covert channels for secretive information transmission with CPU cache and main memory. In this paper, we turn to a lower level in the memory hierarchy, i.e., persistent storage. Most programs store intermediate or eventual results in the form of files and some of them call fsync to synchronously persist a file with storage device for orderly persistence. Our quantitative study shows that one program would undergo significantly longer response time for fsync call if the other program is concurrently calling fsync, although they do not share any data. We further find that, concurrent fsync calls contend at multiple levels of storage stack due to sharing software structures (e.g., Ext4's journal) and hardware resources (e.g., disk's I/O dispatch queue). We accordingly build a covert channel named Sync+Sync. Sync+Sync delivers a transmission bandwidth of 20,000 bits per second at an error rate of about 0.40% with an ordinary solid-state drive. Sync+Sync can be conducted in cross-disk partition, cross-file system, cross-container, cross-virtual machine, and even cross-disk drive fashions, without sharing data between programs. Next, we launch side-channel attacks with Sync+Sync and manage to precisely detect operations of a victim database (e.g., insert/update and B-Tree node split). We also leverage Sync+Sync to distinguish applications and websites with high accuracy by detecting and analyzing their fsync frequencies and flushed data volumes. These attacks are useful to support further fine-grained information leakage.Comment: A full version for the paper with the same title accepted by the 33rd USENIX Security Symposium (USENIX Security 2024

    DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent Memory (Extended Version)

    Full text link
    We present Dinomo, a novel key-value store for disaggregated persistent memory (DPM). Dinomo is the first key-value store for DPM that simultaneously achieves high common-case performance, scalability, and lightweight online reconfiguration. We observe that previously proposed key-value stores for DPM had architectural limitations that prevent them from achieving all three goals simultaneously. Dinomo uses a novel combination of techniques such as ownership partitioning, disaggregated adaptive caching, selective replication, and lock-free and log-free indexing to achieve these goals. Compared to a state-of-the-art DPM key-value store, Dinomo achieves at least 3.8x better throughput on various workloads at scale and higher scalability, while providing fast reconfiguration.Comment: This is an extended version of the full paper to appear in PVLDB 15.13 (VLDB 2023

    Replicating Persistent Memory Key-Value Stores with Efficient RDMA Abstraction

    Full text link
    Combining persistent memory (PM) with RDMA is a promising approach to performant replicated distributed key-value stores (KVSs). However, existing replication approaches do not work well when applied to PM KVSs: 1) Using RPC induces software queueing and execution at backups, increasing request latency; 2) Using one-sided RDMA WRITE causes many streams of small PM writes, leading to severe device-level write amplification (DLWA) on PM. In this paper, we propose Rowan, an efficient RDMA abstraction to handle replication writes in PM KVSs; it aggregates concurrent remote writes from different servers, and lands these writes to PM in a sequential (thus low DLWA) and one-sided (thus low latency) manner. We realize Rowan with off-the-shelf RDMA NICs. Further, we build Rowan-KV, a log-structured PM KVS using Rowan for replication. Evaluation shows that under write-intensive workloads, compared with PM KVSs using RPC and RDMA WRITE for replication, Rowan-KV boosts throughput by 1.22X and 1.39X as well as lowers median PUT latency by 1.77X and 2.11X, respectively, while largely eliminating DLWA.Comment: Accepted to OSDI 202
    • …
    corecore