83 research outputs found
PrismDB: Read-aware Log-structured Merge Trees for Heterogeneous Storage
In recent years, emerging hardware storage technologies have focused on
divergent goals: better performance or lower cost-per-bit of storage.
Correspondingly, data systems that employ these new technologies are optimized
either to be fast (but expensive) or cheap (but slow). We take a different
approach: by combining multiple tiers of fast and low-cost storage technologies
within the same system, we can achieve a Pareto-efficient balance between
performance and cost-per-bit.
This paper presents the design and implementation of PrismDB, a novel
log-structured merge tree based key-value store that exploits a full spectrum
of heterogeneous storage technologies (from 3D XPoint to QLC NAND). We
introduce the notion of "read-awareness" to log-structured merge trees, which
allows hot objects to be pinned to faster storage, achieving better tiering and
hot-cold separation of objects. Compared to the standard use of RocksDB on
flash in datacenters today, PrismDB's average throughput on heterogeneous
storage is 2.3 faster and its tail latency is more than an order of
magnitude better, using hardware than is half the cost
Elevating commodity storage with the SALSA host translation layer
To satisfy increasing storage demands in both capacity and performance,
industry has turned to multiple storage technologies, including Flash SSDs and
SMR disks. These devices employ a translation layer that conceals the
idiosyncrasies of their mediums and enables random access. Device translation
layers are, however, inherently constrained: resources on the drive are scarce,
they cannot be adapted to application requirements, and lack visibility across
multiple devices. As a result, performance and durability of many storage
devices is severely degraded.
In this paper, we present SALSA: a translation layer that executes on the
host and allows unmodified applications to better utilize commodity storage.
SALSA supports a wide range of single- and multi-device optimizations and,
because is implemented in software, can adapt to specific workloads. We
describe SALSA's design, and demonstrate its significant benefits using
microbenchmarks and case studies based on three applications: MySQL, the Swift
object store, and a video server.Comment: Presented at 2018 IEEE 26th International Symposium on Modeling,
Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS
Recommended from our members
Building Distributed Systems with Non-Volatile Main Memories and RDMA Networks
High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers to combine storage and memory into a single layer. These high-performance storage systems would be especially useful in large-scale data center environments where data is distributed and replicated across multiple servers.Unfortunately, existing approaches of providing remote storage access rest on the assumption that storage is slow, so the cost of the software and protocols is acceptable. Such assumption no longer holds for the fast NVMM. As a result, taking full advantage of NVMMs’ potential will require changes in system software and networking protocol. This thesis focuses on accessing remote NVMM efficiently using remote direct memory access (RDMA) network. RDMA enables a client to directly access memory on a remote machine without involving its local CPU.This thesis first presents Mojim, a system that provides replicated, reliable, and highly-available NVMM as an operating system service. Applications can access data in Mojim using normal load and store instructions while controlling when and how updates propagate to replicas using system calls. Our evaluation shows Mojim adds little overhead to the un-replicated system and provides 0.4x to 2.7x the throughput of the un-replicated system.This thesis then presents Orion, a distributed file system designed from for NVMM and RDMA networks. Traditional distributed file systems are designed for slower hard drives. These slower media incentivizes complex optimizations (e.g., queuing, striping, and batching) around disk accesses. Orion combines file system functions and network operations into a single layer. It provides low latency metadata accesses and outperforms existing distributed file systems by a large margin.Finally, an NVMM application can map files backed by an NVMM file system into its address space, and accesses them using CPU instructions. In this case, RDMA and NVMM file systems introduce duplication of effort on permissions, naming, and address translation. We introduce two changes to the existing RDMA protocol: the file memory region (FileMR) and range based address translation. By eliminating redundant translations, FileMR minimizes the number of translations done at the NIC, reducing the load on the NIC’s translation cache and resulting in application performance improvement by 1.8x - 2.0x
WiscSort: External Sorting For Byte-Addressable Storage
We present WiscSort, a new approach to high-performance concurrent sorting
for existing and future byte-addressable storage (BAS) devices. WiscSort
carefully reduces writes, exploits random reads by splitting keys and values
during sorting, and performs interference-aware scheduling with thread pool
sizing to avoid I/O bandwidth degradation. We introduce the BRAID model which
encompasses the unique characteristics of BAS devices. Many state-of-the-art
sorting systems do not comply with the BRAID model and deliver sub-optimal
performance, whereas WiscSort demonstrates the effectiveness of complying with
BRAID. We show that WiscSort is 2-7x faster than competing approaches on a
standard sort benchmark. We evaluate the effectiveness of key-value separation
on different key-value sizes and compare our concurrency optimizations with
various other concurrency models. Finally, we emulate generic BAS devices and
show how our techniques perform well with various combinations of hardware
properties
- …