162 research outputs found
Distributed Multi-writer Multi-reader Atomic Register with Optimistically Fast Read and Write
A distributed multi-writer multi-reader (MWMR) atomic register is an
important primitive that enables a wide range of distributed algorithms. Hence,
improving its performance can have large-scale consequences. Since the seminal
work of ABD emulation in the message-passing networks [JACM '95], many
researchers study fast implementations of atomic registers under various
conditions. "Fast" means that a read or a write can be completed with 1
round-trip time (RTT), by contacting a simple majority. In this work, we
explore an atomic register with optimal resilience and "optimistically fast"
read and write operations. That is, both operations can be fast if there is no
concurrent write.
This paper has three contributions: (i) We present Gus, the emulation of an
MWMR atomic register with optimal resilience and optimistically fast reads and
writes when there are up to 5 nodes; (ii) We show that when there are > 5
nodes, it is impossible to emulate an MWMR atomic register with both
properties; and (iii) We implement Gus in the framework of EPaxos and Gryff,
and show that Gus provides lower tail latency than state-of-the-art systems
such as EPaxos, Gryff, Giza, and Tempo under various workloads in the context
of geo-replicated object storage systems
PrismDB: Read-aware Log-structured Merge Trees for Heterogeneous Storage
In recent years, emerging hardware storage technologies have focused on
divergent goals: better performance or lower cost-per-bit of storage.
Correspondingly, data systems that employ these new technologies are optimized
either to be fast (but expensive) or cheap (but slow). We take a different
approach: by combining multiple tiers of fast and low-cost storage technologies
within the same system, we can achieve a Pareto-efficient balance between
performance and cost-per-bit.
This paper presents the design and implementation of PrismDB, a novel
log-structured merge tree based key-value store that exploits a full spectrum
of heterogeneous storage technologies (from 3D XPoint to QLC NAND). We
introduce the notion of "read-awareness" to log-structured merge trees, which
allows hot objects to be pinned to faster storage, achieving better tiering and
hot-cold separation of objects. Compared to the standard use of RocksDB on
flash in datacenters today, PrismDB's average throughput on heterogeneous
storage is 2.3 faster and its tail latency is more than an order of
magnitude better, using hardware than is half the cost
RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design
Software-defined networking (SDN) and software-defined flash (SDF) have been
serving as the backbone of modern data centers. They are managed separately to
handle I/O requests. At first glance, this is a reasonable design by following
the rack-scale hierarchical design principles. However, it suffers from
suboptimal end-to-end performance, due to the lack of coordination between SDN
and SDF.
In this paper, we co-design the SDN and SDF stack by redefining the functions
of their control plane and data plane, and splitting up them within a new
architecture named RackBlox. RackBlox decouples the storage management
functions of flash-based solid-state drives (SSDs), and allow the SDN to track
and manage the states of SSDs in a rack. Therefore, we can enable the state
sharing between SDN and SDF, and facilitate global storage resource management.
RackBlox has three major components: (1) coordinated I/O scheduling, in which
it dynamically adjusts the I/O scheduling in the storage stack with the
measured and predicted network latency, such that it can coordinate the effort
of I/O scheduling across the network and storage stack for achieving
predictable end-to-end performance; (2) coordinated garbage collection (GC), in
which it will coordinate the GC activities across the SSDs in a rack to
minimize their impact on incoming I/O requests; (3) rack-scale wear leveling,
in which it enables global wear leveling among SSDs in a rack by periodically
swapping data, for achieving improved device lifetime for the entire rack. We
implement RackBlox using programmable SSDs and switch. Our experiments
demonstrate that RackBlox can reduce the tail latency of I/O requests by up to
5.8x over state-of-the-art rack-scale storage systems.Comment: 14 pages. Published in published in ACM SIGOPS 29th Symposium on
Operating Systems Principles (SOSP'23
WLFC: Write Less in Flash-based Cache
Flash-based disk caches, for example Bcache and Flashcache, has gained
tremendous popularity in industry in the last decade because of its low energy
consumption, non-volatile nature and high I/O speed. But these cache systems
have a worse write performance than the read performance because of the
asymmetric I/O costs and the the internal GC mechanism. In addition to the
performance issues, since the NAND flash is a type of EEPROM device, the
lifespan is also limited by the Program/Erase (P/E) cycles. So how to improve
the performance and the lifespan of flash-based caches in write-intensive
scenarios has always been a hot issue. Benefiting from Open-Channel SSDs
(OCSSDs), we propose a write-friendly flash-based disk cache system, which is
called WLFC (Write Less in the Flash-based Cache). In WLFC, a strictly
sequential writing method is used to minimize the write amplification. A new
replacement algorithm for the write buffer is designed to minimize the erase
count caused by the evicting. And a new data layout strategy is designed to
minimize the metadata size persisted in SSDs. As a result, the Over-Provisioned
(OP) space is completely removed, the erase count of the flash is greatly
reduced, and the metadata size is 1/10 or less than that in BCache. Even with a
small amount of metadata, the data consistency after the crash is still
guaranteed. Compared with the existing mechanism, WLFC brings a 7%-80%
reduction in write latency, a 1.07*-4.5* increment in write throughput, and a
50%-88.9% reduction in erase count, with a moderate overhead in read
performance
Sync+Sync: A Covert Channel Built on fsync with Storage
Scientists have built a variety of covert channels for secretive information
transmission with CPU cache and main memory. In this paper, we turn to a lower
level in the memory hierarchy, i.e., persistent storage. Most programs store
intermediate or eventual results in the form of files and some of them call
fsync to synchronously persist a file with storage device for orderly
persistence. Our quantitative study shows that one program would undergo
significantly longer response time for fsync call if the other program is
concurrently calling fsync, although they do not share any data. We further
find that, concurrent fsync calls contend at multiple levels of storage stack
due to sharing software structures (e.g., Ext4's journal) and hardware
resources (e.g., disk's I/O dispatch queue).
We accordingly build a covert channel named Sync+Sync. Sync+Sync delivers a
transmission bandwidth of 20,000 bits per second at an error rate of about
0.40% with an ordinary solid-state drive. Sync+Sync can be conducted in
cross-disk partition, cross-file system, cross-container, cross-virtual
machine, and even cross-disk drive fashions, without sharing data between
programs. Next, we launch side-channel attacks with Sync+Sync and manage to
precisely detect operations of a victim database (e.g., insert/update and
B-Tree node split). We also leverage Sync+Sync to distinguish applications and
websites with high accuracy by detecting and analyzing their fsync frequencies
and flushed data volumes. These attacks are useful to support further
fine-grained information leakage.Comment: A full version for the paper with the same title accepted by the 33rd
USENIX Security Symposium (USENIX Security 2024
DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent Memory (Extended Version)
We present Dinomo, a novel key-value store for disaggregated persistent
memory (DPM). Dinomo is the first key-value store for DPM that simultaneously
achieves high common-case performance, scalability, and lightweight online
reconfiguration. We observe that previously proposed key-value stores for DPM
had architectural limitations that prevent them from achieving all three goals
simultaneously. Dinomo uses a novel combination of techniques such as ownership
partitioning, disaggregated adaptive caching, selective replication, and
lock-free and log-free indexing to achieve these goals. Compared to a
state-of-the-art DPM key-value store, Dinomo achieves at least 3.8x better
throughput on various workloads at scale and higher scalability, while
providing fast reconfiguration.Comment: This is an extended version of the full paper to appear in PVLDB
15.13 (VLDB 2023
Replicating Persistent Memory Key-Value Stores with Efficient RDMA Abstraction
Combining persistent memory (PM) with RDMA is a promising approach to
performant replicated distributed key-value stores (KVSs). However, existing
replication approaches do not work well when applied to PM KVSs: 1) Using RPC
induces software queueing and execution at backups, increasing request latency;
2) Using one-sided RDMA WRITE causes many streams of small PM writes, leading
to severe device-level write amplification (DLWA) on PM. In this paper, we
propose Rowan, an efficient RDMA abstraction to handle replication writes in PM
KVSs; it aggregates concurrent remote writes from different servers, and lands
these writes to PM in a sequential (thus low DLWA) and one-sided (thus low
latency) manner. We realize Rowan with off-the-shelf RDMA NICs. Further, we
build Rowan-KV, a log-structured PM KVS using Rowan for replication. Evaluation
shows that under write-intensive workloads, compared with PM KVSs using RPC and
RDMA WRITE for replication, Rowan-KV boosts throughput by 1.22X and 1.39X as
well as lowers median PUT latency by 1.77X and 2.11X, respectively, while
largely eliminating DLWA.Comment: Accepted to OSDI 202
- …