184 research outputs found
PrismDB: Read-aware Log-structured Merge Trees for Heterogeneous Storage
In recent years, emerging hardware storage technologies have focused on
divergent goals: better performance or lower cost-per-bit of storage.
Correspondingly, data systems that employ these new technologies are optimized
either to be fast (but expensive) or cheap (but slow). We take a different
approach: by combining multiple tiers of fast and low-cost storage technologies
within the same system, we can achieve a Pareto-efficient balance between
performance and cost-per-bit.
This paper presents the design and implementation of PrismDB, a novel
log-structured merge tree based key-value store that exploits a full spectrum
of heterogeneous storage technologies (from 3D XPoint to QLC NAND). We
introduce the notion of "read-awareness" to log-structured merge trees, which
allows hot objects to be pinned to faster storage, achieving better tiering and
hot-cold separation of objects. Compared to the standard use of RocksDB on
flash in datacenters today, PrismDB's average throughput on heterogeneous
storage is 2.3 faster and its tail latency is more than an order of
magnitude better, using hardware than is half the cost
Understanding (Un)Written Contracts of NVMe ZNS Devices with zns-tools
Operational and performance characteristics of flash SSDs have long been
associated with a set of Unwritten Contracts due to their hidden, complex
internals and lack of control from the host software stack. These unwritten
contracts govern how data should be stored, accessed, and garbage collected.
The emergence of Zoned Namespace (ZNS) flash devices with their open and
standardized interface allows us to write these unwritten contracts for the
storage stack. However, even with a standardized storage-host interface, due to
the lack of appropriate end-to-end operational data collection tools, the
quantification and reasoning of such contracts remain a challenge. In this
paper, we propose zns.tools, an open-source framework for end-to-end event and
metadata collection, analysis, and visualization for the ZNS SSDs contract
analysis. We showcase how zns.tools can be used to understand how the
combination of RocksDB with the F2FS file system interacts with the underlying
storage. Our tools are available openly at
\url{https://github.com/stonet-research/zns-tools}
Understanding (Un)Written Contracts of NVMe ZNS Devices with zns-tools
Operational and performance characteristics of flash SSDs have long been associated with a set of Unwritten Contracts due to their hidden, complex internals and lack of control from the host software stack. These unwritten contracts govern how data should be stored, accessed, and garbage collected. The emergence of Zoned Namespace (ZNS) flash devices with their open and standardized interface allows us to write these unwritten contracts for the storage stack. However, even with a standardized storage-host interface, due to the lack of appropriate end-to-end operational data collection tools, the quantification and reasoning of such contracts remain a challenge. In this paper, we propose zns.tools, an open-source framework for end-to-end event and metadata collection, analysis, and visualization for the ZNS SSDs contract analysis. We showcase how zns.tools can be used to understand how the combination of RocksDB with the F2FS file system interacts with the underlying storage. Our tools are available openly at \url{https://github.com/stonet-research/zns-tools}
Simurgh: a fully decentralized and secure NVMM user space file system
The availability of non-volatile main memory (NVMM) has started a new era for storage systems and NVMM specific file systems can support extremely high data and metadata rates, which are required by many HPC and data-intensive applications. Scaling metadata performance within NVMM file systems is nevertheless often restricted by the Linux kernel storage stack, while simply moving metadata management to the user space can compromise security or flexibility. This paper introduces Simurgh, a hardware-assisted user space file system with decentralized metadata management that allows secure metadata updates from within user space. Simurgh guarantees consistency, durability, and ordering of updates without sacrificing scalability. Security is enforced by only allowing NVMM access from protected user space functions, which can be implemented through two proposed instructions. Comparisons with other NVMM file systems show that Simurgh improves metadata performance up to 18x and application performance up to 89% compared to the second-fastest file system.This work has been supported by the European Comission’s BigStorage project H2020-MSCA-ITN2014-642963. It is also supported by the Big Data in Atmospheric Physics (BINARY) project, funded by the Carl Zeiss Foundation under Grant No.: P2018-02-003.Peer ReviewedPostprint (author's final draft
RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design
Software-defined networking (SDN) and software-defined flash (SDF) have been
serving as the backbone of modern data centers. They are managed separately to
handle I/O requests. At first glance, this is a reasonable design by following
the rack-scale hierarchical design principles. However, it suffers from
suboptimal end-to-end performance, due to the lack of coordination between SDN
and SDF.
In this paper, we co-design the SDN and SDF stack by redefining the functions
of their control plane and data plane, and splitting up them within a new
architecture named RackBlox. RackBlox decouples the storage management
functions of flash-based solid-state drives (SSDs), and allow the SDN to track
and manage the states of SSDs in a rack. Therefore, we can enable the state
sharing between SDN and SDF, and facilitate global storage resource management.
RackBlox has three major components: (1) coordinated I/O scheduling, in which
it dynamically adjusts the I/O scheduling in the storage stack with the
measured and predicted network latency, such that it can coordinate the effort
of I/O scheduling across the network and storage stack for achieving
predictable end-to-end performance; (2) coordinated garbage collection (GC), in
which it will coordinate the GC activities across the SSDs in a rack to
minimize their impact on incoming I/O requests; (3) rack-scale wear leveling,
in which it enables global wear leveling among SSDs in a rack by periodically
swapping data, for achieving improved device lifetime for the entire rack. We
implement RackBlox using programmable SSDs and switch. Our experiments
demonstrate that RackBlox can reduce the tail latency of I/O requests by up to
5.8x over state-of-the-art rack-scale storage systems.Comment: 14 pages. Published in published in ACM SIGOPS 29th Symposium on
Operating Systems Principles (SOSP'23
Understanding and Optimizing Flash-based Key-value Systems in Data Centers
Flash-based key-value systems are widely deployed in today’s data centers for providing high-speed data processing services. These systems deploy flash-friendly data structures, such as slab and Log Structured Merge(LSM) tree, on flash-based Solid State Drives(SSDs) and provide efficient solutions in caching and storage scenarios. With the rapid evolution of data centers, there appear plenty of challenges and opportunities for future optimizations.
In this dissertation, we focus on understanding and optimizing flash-based key-value systems from the perspective of workloads, software, and hardware as data centers evolve. We first propose an on-line compression scheme, called SlimCache, considering the unique characteristics of key-value workloads, to virtually enlarge the cache space, increase the hit ratio, and improve the cache performance. Furthermore, to appropriately configure increasingly complex modern key-value data systems, which can have more than 50 parameters with additional hardware and system settings, we quantitatively study and compare five multi-objective optimization methods for auto-tuning the performance of an LSM-tree based key-value store in terms of throughput, the 99th percentile tail latency, convergence time, real-time system throughput, and the iteration process, etc. Last but not least, we conduct an in-depth, comprehensive measurement work on flash-optimized key-value stores with recently emerging 3D XPoint SSDs. We reveal several unexpected bottlenecks in the current key-value store design and present three exemplary case studies to showcase the efficacy of removing these bottlenecks with simple methods on 3D XPoint SSDs. Our experimental results show that our proposed solutions significantly outperform traditional methods. Our study also contributes to providing system implications for auto-tuning the key-value system on flash-based SSDs and optimizing it on revolutionary 3D XPoint based SSDs
- …