Search CORE

2 research outputs found

On Disk Allocation of Intermediate Query Results in Parallel Database Systems

Author: Märtens Holger
Publication venue
Publication date: 07/11/2018
Field of study

For complex queries in parallel database systems, substantial amounts of data must be redistributed between operators executed on different processing nodes. Frequently, such intermediate results cannot be held in main memory and must be stored on disk. To limit the ensuing performance penalty, a data allocation must be found that supports parallel I/O to the greatest possible extent. In this paper, we propose declustering even self-contained units of temporary data processed in a single operation (such as individual buckets of parallel hash joins) across multiple disks. Using a suitable analytical model, we find that the improvement of parallel I/O outweighs the penalty of increased fragmentation

Qucosa - Publikationsserver der Universität Leipzig

Near-Optimal Distributed Band-Joins through Recursive Partitioning

Author: Gatterbauer Wolfgang
Li Rundong
Riedewald Mirek
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/04/2020
Field of study

We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work

arXiv.org e-Print Archive

Crossref