3,287 research outputs found
Optimal column layout for hybrid workloads
Data-intensive analytical applications need to support both efficient reads and writes. However, what is usually a good data layout for an update-heavy workload, is not well-suited for a read-mostly one and vice versa. Modern analytical data systems rely on columnar layouts and employ delta stores to inject new data and updates. We show that for hybrid workloads we can achieve close to one order of magnitude better performance by tailoring the column layout design to the data and query workload. Our approach navigates the possible design space of the physical layout: it organizes each column’s data by determining the number of partitions, their corresponding sizes and ranges, and the amount of buffer space and how it is allocated. We frame these design decisions as an optimization problem that, given workload knowledge and performance requirements, provides an optimal physical layout for the workload at hand. To evaluate this work, we build an in-memory storage engine, Casper, and we show that it outperforms state-of-the-art data layouts of analytical systems for hybrid workloads. Casper delivers up to 2.32x higher throughput for update-intensive workloads and up to 2.14x higher throughput for hybrid workloads. We further show how to make data layout decisions robust to workload variation by carefully selecting the input of the optimization.http://www.vldb.org/pvldb/vol12/p2393-athanassoulis.pdfPublished versionPublished versio
Optimal column layout for hybrid workloads (VLDB 2020 talk)
Data-intensive analytical applications need to support both efficient
reads and writes. However, what is usually a good data layout for
an update-heavy workload, is not well-suited for a read-mostly one
and vice versa. Modern analytical data systems rely on columnar
layouts and employ delta stores to inject new data and updates.
We show that for hybrid workloads we can achieve close to one
order of magnitude better performance by tailoring the column layout
design to the data and query workload. Our approach navigates
the possible design space of the physical layout: it organizes each
column’s data by determining the number of partitions, their corresponding
sizes and ranges, and the amount of buffer space and how
it is allocated. We frame these design decisions as an optimization
problem that, given workload knowledge and performance requirements,
provides an optimal physical layout for the workload
at hand. To evaluate this work, we build an in-memory storage engine,
Casper, and we show that it outperforms state-of-the-art data
layouts of analytical systems for hybrid workloads. Casper delivers
up to 2:32 higher throughput for update-intensive workloads
and up to 2:14 higher throughput for hybrid workloads. We further
show how to make data layout decisions robust to workload
variation by carefully selecting the input of the optimization.http://www.vldb.org/pvldb/vol12/p2393-athanassoulis.pdfPublished versio
Resilient store: a heuristic-based data format selector for intermediate results
The final publication is available at link.springer.comLarge-scale data analysis is an important activity in many organizations that typically requires the deployment of data-intensive workflows. As data is processed these workflows generate large intermediate results, which are typically pipelined from one operator to the following. However, if materialized, these results become reusable, hence, subsequent workflows need not recompute them. There are already many solutions that materialize
intermediate results but all of them assume a fixed data format. A fixed format, however, may not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (e.g., horizontal and
vertical) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present ResilientStore, which assists on selecting the most appropriate data format for materializing intermediate
results. Given a workflow and a set of materialization points, it uses rule-based heuristics to choose the best storage data format based on subsequent access patterns.We have implemented ResilientStore for HDFS and three different
data formats: SequenceFile, Parquet and Avro. Experimental results show that our solution gives 18% better performance than any solution based on a single fixed format.Peer ReviewedPostprint (author's final draft
- …