353 research outputs found
Optimal column layout for hybrid workloads
Data-intensive analytical applications need to support both efficient reads and writes. However, what is usually a good data layout for an update-heavy workload, is not well-suited for a read-mostly one and vice versa. Modern analytical data systems rely on columnar layouts and employ delta stores to inject new data and updates. We show that for hybrid workloads we can achieve close to one order of magnitude better performance by tailoring the column layout design to the data and query workload. Our approach navigates the possible design space of the physical layout: it organizes each column’s data by determining the number of partitions, their corresponding sizes and ranges, and the amount of buffer space and how it is allocated. We frame these design decisions as an optimization problem that, given workload knowledge and performance requirements, provides an optimal physical layout for the workload at hand. To evaluate this work, we build an in-memory storage engine, Casper, and we show that it outperforms state-of-the-art data layouts of analytical systems for hybrid workloads. Casper delivers up to 2.32x higher throughput for update-intensive workloads and up to 2.14x higher throughput for hybrid workloads. We further show how to make data layout decisions robust to workload variation by carefully selecting the input of the optimization.http://www.vldb.org/pvldb/vol12/p2393-athanassoulis.pdfPublished versionPublished versio
Positional Delta Trees to reconcile updates with read-optimized data storage
We investigate techniques that marry the high readonly analytical query performance of compressed, replicated column storage (“read-optimized” databases) with the ability to handle a high-throughput update workload. Today’s large RAM sizes and the growing gap between sequential vs. random IO disk throughput, bring this once elusive goal in reach, as it has become possible to buffer enough updates in memory to allow background migration of these updates to disk, where efficient sequential IO is amortized among many updates. Our key goal is that read-only queries always see the latest database state, yet are not (significantly) slowed down by the update processing. To this end, we propose the Positional Delta Tree (PDT), that is designed to minimize the overhead of on-the-fly merging of differential updates into (index) scans on stale disk-based data. We describe the PDT data structure and its basic operations (lookup, insert, delete, modify) and provide an in-detail study of their performance. Further, we propose a storage architecture called Replicated Mirrors, that replicates tables in multiple orders, storing each table copy mirrored in both column- and row-wise data formats, and uses PDTs to handle updates. Experiments in the MonetDB/X100 system show that this integrated architecture is able to achieve our main goals
No Bits Left Behind
One of the key tenets of database system design is making efficient
use of storage and memory resources. However, existing database
system implementations are actually extremely wasteful of such
resources; for example, most systems leave a great deal of empty
space in tuples, index pages, and data pages, and spend many
CPU cycles reading cold records from disk that are never used.
In this paper, we identify a number of such sources of waste, and
present a series of techniques that limit this waste (e.g., forcing
better memory locality for hot data and using empty space in index
pages to cache popular tuples) without substantially complicating
interfaces or system design. We show that these techniques
effectively reduce memory requirements for real scenarios from
the Wikipedia database (by up to 17.8×) while increasing query
performance (by up to 8×)
Redesigning Transaction Processing Systems for Non-Volatile Memory
Department of Computer Science and EngineeringTransaction Processing Systems are widely used because they make the user be able to manage
their data more efficiently. However, they suffer performance bottleneck due to the redundant
I/O for guaranteeing data consistency. In addition to the redundant I/O, slow storage device
makes the performance more degraded. Leveraging non-volatile memory is one of the promising
solutions the performance bottleneck in Transaction Processing Systems. However, since the
I/O granularity of legacy storage devices and non-volatile memory is not equal, traditional
Transaction Processing System cannot fully exploit the performance of persistent memory.
The goal of this dissertation is to fully exploit non-volatile memory for improving the performance
of Transaction Processing Systems.
Write amplification between Transaction Processing System is pointed out as a performance
bottleneck. As first approach, we redesigned Transaction Processing Systems to minimize the
redundant I/O between the Transaction Processing Systems. We present LS-MVBT that integrates
recovery information into the main database file to remove temporary files for recovery.
The LS-MVBT also employs five optimizations to reduce the write traffics in single fsync() calls.
We also exploit the persistent memory to reduce the performance bottleneck from slow storage
devices. However, since the traditional recovery method is for slow storage devices, we develop
byte-addressable differential logging, user-level heap manager, and transaction-aware persistence
to fully exploit the persistent memory. To minimize the redundant I/O for guarantee data consistency,
we present the failure-atomic slotted paging with persistent buffer cache.
Redesigning indexing structure is the second approach to exploit the non-volatile memory
fully. Since the B+-tree is originally designed for block granularity, It generates excessive I/O
traffics in persistent memory. To mitigate this traffic, we develop cache line friendly B+-tree
which aligns its node size to cache line size. It can minimize the write traffic. Moreover, with
hardware transactional memory, it can update its single node atomically without any additional
redundant I/O for guaranteeing data consistency. It can also adapt Failure-Atomic Shift and
Failure-Atomic In-place Rebalancing to eliminate unnecessary I/O.
Furthermore, We improved the persistent memory manager that exploit traditional memory
heap structure with free-list instead of segregated lists for small memory allocations to minimize
the memory allocation overhead.
Our performance evaluation shows that our improved version that consider I/O granularity
of non-volatile memory can efficiently reduce the redundant I/O traffic and improve the
performance by large of a margin.ope
LNCS
Concurrent accesses to shared data structures must be synchronized to avoid data races. Coarse-grained synchronization, which locks the entire data structure, is easy to implement but does not scale. Fine-grained synchronization can scale well, but can be hard to reason about. Hand-over-hand locking, in which operations are pipelined as they traverse the data structure, combines fine-grained synchronization with ease of use. However, the traditional implementation suffers from inherent overheads. This paper introduces snapshot-based synchronization (SBS), a novel hand-over-hand locking mechanism. SBS decouples the synchronization state from the data, significantly improving cache utilization. Further, it relies on guarantees provided by pipelining to minimize synchronization that requires cross-thread communication. Snapshot-based synchronization thus scales much better than traditional hand-over-hand locking, while maintaining the same ease of use
Optimal column layout for hybrid workloads (VLDB 2020 talk)
Data-intensive analytical applications need to support both efficient
reads and writes. However, what is usually a good data layout for
an update-heavy workload, is not well-suited for a read-mostly one
and vice versa. Modern analytical data systems rely on columnar
layouts and employ delta stores to inject new data and updates.
We show that for hybrid workloads we can achieve close to one
order of magnitude better performance by tailoring the column layout
design to the data and query workload. Our approach navigates
the possible design space of the physical layout: it organizes each
column’s data by determining the number of partitions, their corresponding
sizes and ranges, and the amount of buffer space and how
it is allocated. We frame these design decisions as an optimization
problem that, given workload knowledge and performance requirements,
provides an optimal physical layout for the workload
at hand. To evaluate this work, we build an in-memory storage engine,
Casper, and we show that it outperforms state-of-the-art data
layouts of analytical systems for hybrid workloads. Casper delivers
up to 2:32 higher throughput for update-intensive workloads
and up to 2:14 higher throughput for hybrid workloads. We further
show how to make data layout decisions robust to workload
variation by carefully selecting the input of the optimization.http://www.vldb.org/pvldb/vol12/p2393-athanassoulis.pdfPublished versio
NVB-tree: Failure-Atomic B+-tree for Persistent Memory
Department of Computer EngineeringEmerging non-volatile memory has opened new opportunities to re-design the entire system software stack and it is expected to break the boundaries between memory and storage devices to enable storage-less systems. Traditionally, B-tree has been used to organize data blocks in storage systems. However, B-tree is optimized for disk-based systems that read and write large blocks of data. When byte-addressable non-volatile memory replaces the block device storage systems, the byte-addressability of NVRAM makes it challenge to enforce the failure-atomicity of B-tree nodes.
In this work, we present NVB-tree that addresses this challenge, reducing cache line flush overhead and avoiding expensive logging methods. NVB-tree is a hybrid tree that combines the binary search tree and the B+-tree, i.e., keys in each NVB-tree node are stored as a binary search tree so that it can benefit from the byte-addressability of binary search trees. We also present a logging-less split/merge scheme that guarantees failure-atomicity with 8-byte memory writes. Our performance study shows that NVB-tree outperforms the state-of-the-art persistent index - wB+-tree by a large margin.ope
Lethe: a tunable delete-aware LSM engine (updated version)
Data-intensive applications fueled the evolution of log structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as a second-class citizen. A delete inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written. We highlight that fast persistent deletion without affecting read performance is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on the right-to-be-forgotten, and (iii) en masse cloud deployment of data systems that makes storage a precious resource. To address these challenges, in this paper, we build a new key-value storage engine, Lethe, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17-1.4×) and lower space amplification (2.1-9.8×), with a modest increase in write amplification (between 4% and 25%). In addition, Lethe supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance nor employing a costly full tree merge.First author draf
- …