Search CORE

8 research outputs found

Skew-Insensitive Join Processing in Shared-Disk Database Systems

Author: Märtens Holger
Publication venue
Publication date: 05/02/2019
Field of study

Skew effects are still a significant problem for efficient query processing in parallel database systems. Especially in shared-nothing environments, this problem is aggravated by the substantial cost of data redistribution. Shared-disk systems, on the other hand, promise much higher flexibility in the distribution of workload among processing nodes because all input data can be accessed by any node at equal cost. In order to verify this potential for dynamic load balancing, we have devised a new technique for skew-tolerant join processing. In contrast to conventional solutions, our algorithm is not restricted to estimating processing costs in advance and assigning tasks to nodes accordingly. Instead, it monitors the actual progression of work and dynamically allocates tasks to processors, thus capitalizing on the uniform access pathlength in shared-disk architectures. This approach has the potential to alleviate not only any kind of data-inherent skew, but also execution skew caused by query- external workloads, by disk contention, or simply by inaccurate estimates used in predictive scheduling. We employ a detailed simulation system to evaluate the new algorithm under different types and degrees of skew

Qucosa - Publikationsserver der Universität Leipzig

Load-balancing distributed outer joins through operator decomposition

Author: Cheng Long
Kotoulas Spyros
Liu Qingzhi
Wang Ying
Publication venue: 'Elsevier BV'
Publication date: 14/05/2019
Field of study

High-performance data analytics largely relies on being able to efficiently execute various distributed data operators such as distributed joins. So far, large amounts of join methods have been proposed and evaluated in parallel and distributed environments. However, most of them focus on inner joins, and there is little published work providing the detailed implementations and analysis of outer joins. In this work, we present POPI (Partial Outer join & Partial Inner join), a novel method to load-balance large parallel outer joins by decomposing them into two operations: a large outer join over data that does not present significant skew in the input and an inner join over data presenting significant skew. We present the detailed implementation of our approach and show that POPI is implementable over a variety of architectures and underlying join implementations. Moreover, our experimental evaluation over a distributed memory platform also demonstrates that the proposed method is able to improve outer join performance under varying data skew and present excellent load-balancing properties, compared to current approaches

Irish Universities

DCU Online Research Access Service

Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing

Author: Viswanath Poosala
Yannis E. Ioannidis
Publication venue
Publication date
Field of study

Many commercial database systems use some form of statistics, typically histograms, to summarize the contents of relations and permit efficient estimation of required quantities. While there has been considerable work done on identifying good histograms for the estimation of query-result sizes, little attention has been paid to the estimation of the data distribution of the result, which is of importance in query optimization. In this paper, we prove that the optimal histogram for estimating the size of the result of a join operator is optimal for estimating its data distribution as well. We also study the effectiveness of these optimal histograms in the context of an important application that requires estimates for the data distribution of a query result: load-balancing for parallel Hybrid hash joins. We derive a cost formula to capture the effect of data skew in both the input and output relations on the load and use the optimal histograms to estimate this cost most accurately. We h..

CiteSeerX

Load Balancing and Skew Resilience for Parallel Joins

Author: ElSeidy Mohammed
Koch Christoph
Vitorovic Aleksandar
Publication venue
Publication date: 03/12/2014
Field of study

We address the problem of load balancing for parallel joins. We show that the distribution of input data received and the output data produced by worker machines are both important for performance. As a result, previous work, which optimizes either for input or output, stands ineffective for load balancing. To that end, we propose a multi-stage load-balancing algorithm which considers the properties of both input and output data through sampling of the original join matrix. To do this efficiently, we propose a novel category of equi-weight histograms. To build them, we exploit state-of-the-art computational geometry algorithms for rectangle tiling. To our knowledge, we are the first to employ tiling algorithms for join load-balancing. In addition, we propose a novel, join-specialized tiling algorithm that has drastically lower time and space complexity than existing algorithms. Experiments show that our scheme outperforms state-of-the-art techniques by up to a factor of 15

Infoscience - École polytechnique fédérale de Lausanne

Near-Optimal Distributed Band-Joins through Recursive Partitioning

Author: Gatterbauer Wolfgang
Li Rundong
Riedewald Mirek
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/04/2020
Field of study

We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work

arXiv.org e-Print Archive

Crossref

Query estimation techniques in database systems

Author: König Arnd Christian
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 23/09/2004
Field of study

The effctiveness of query optimization in database systems critically depends on the system';s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses. While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support a wide range of queries over a number of datasets. In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems. Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions. All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die Effektivität der Anfrage-Optimierung in Datenbanksystemen hängt entscheidend von der Fähigkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszuführen, abzuschätzen. Zu diesem Zweck ist es nötig, die Größen und Datenverteilungen der Zwischenresultate, die während der Ausführung einer Anfrage generiert werden, so genau wie möglich zu schätzen. Zur Lösung dieses Schätzproblems benötigt man Statistiken über die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der Schätzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen Ansätze nur einen Teilaspekt des Problems betrachten. In den meisten Fällen wurden Techniken für das Abschätzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen über diverse Datensätze unterstützen müssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur Resultatsabschätzung vor, welcher insofern über bestehende Ansätze hinausgeht, als dass er akkurate Abschätzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhängigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der Häufigkeiten dieser Attributwerte. Durch den Einsatz raumfüllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubüßen. Die resultierende Schätzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden Ansätze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der für den Seitencache oder Ausführungspuffer zur Verfügung steht. Somit sollte der für Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies führt zu dem Problem, die optimale Kombination von Synopsen für eine gegebene Kombination an Daten, Anfragen und verfügbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende Schätzgenauigkeit mittels Experimenten über eine Vielzahl von Datenverteilungen evaluiert

Universaar

Acronym

Query estimation techniques in database systems

Author: König Arnd Christian
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/2001
Field of study

The effctiveness of query optimization in database systems critically depends on the system\u27;s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses. While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support a wide range of queries over a number of datasets. In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems. Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions. All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die Effektivität der Anfrage-Optimierung in Datenbanksystemen hängt entscheidend von der Fähigkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszuführen, abzuschätzen. Zu diesem Zweck ist es nötig, die Größen und Datenverteilungen der Zwischenresultate, die während der Ausführung einer Anfrage generiert werden, so genau wie möglich zu schätzen. Zur Lösung dieses Schätzproblems benötigt man Statistiken über die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der Schätzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen Ansätze nur einen Teilaspekt des Problems betrachten. In den meisten Fällen wurden Techniken für das Abschätzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen über diverse Datensätze unterstützen müssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur Resultatsabschätzung vor, welcher insofern über bestehende Ansätze hinausgeht, als dass er akkurate Abschätzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhängigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der Häufigkeiten dieser Attributwerte. Durch den Einsatz raumfüllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubüßen. Die resultierende Schätzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden Ansätze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der für den Seitencache oder Ausführungspuffer zur Verfügung steht. Somit sollte der für Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies führt zu dem Problem, die optimale Kombination von Synopsen für eine gegebene Kombination an Daten, Anfragen und verfügbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende Schätzgenauigkeit mittels Experimenten über eine Vielzahl von Datenverteilungen evaluiert

CiteSeerX