4 research outputs found
In-memory caching for multi-query optimization of data-intensive scalable computing workloads
In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work. Instead of optimizing jobs independently, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub) expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Experiments on a prototype implementation of our system show significant benefits of worksharing for TPC-DS workloads
Revisiting reuse in main memory database systems
Reusing intermediates in databases to speed-up analytical query processing
has been studied in the past. Existing solutions typically require intermediate
results of individual operators to be materialized into temporary tables to be
considered for reuse in subsequent queries. However, these approaches are
fundamentally ill-suited for use in modern main memory databases. The reason is
that modern main memory DBMSs are typically limited by the bandwidth of the
memory bus, thus query execution is heavily optimized to keep tuples in the CPU
caches and registers. To that end, adding additional materialization operations
into a query plan not only add additional traffic to the memory bus but more
importantly prevent the important cache- and register-locality opportunities
resulting in high performance penalties.
In this paper we study a novel reuse model for intermediates, which caches
internal physical data structures materialized during query processing (due to
pipeline breakers) and externalizes them so that they become reusable for
upcoming operations. We focus on hash tables, the most commonly used internal
data structure in main memory databases to perform join and aggregation
operations. As queries arrive, our reuse-aware optimizer reasons about the
reuse opportunities for hash tables, employing cost models that take into
account hash table statistics together with the CPU and data movement costs
within the cache hierarchy. Experimental results, based on our HashStash
prototype demonstrate performance gains of for typical analytical
workloads with no additional overhead for materializing intermediates.Comment: 13 Pages, 11 Figure