201 research outputs found
A New Framework for Join Product Skew
Different types of data skew can result in load imbalance in the context of
parallel joins under the shared nothing architecture. We study one important
type of skew, join product skew (JPS). A static approach based on frequency
classes is proposed which takes for granted the data distribution of join
attribute values. It comes from the observation that the join selectivity can
be expressed as a sum of products of frequencies of the join attribute values.
As a consequence, an appropriate assignment of join sub-tasks, that takes into
consideration the magnitude of the frequency products can alleviate the join
product skew. Motivated by the aforementioned remark, we propose an algorithm,
called Handling Join Product Skew (HJPS), to handle join product skew
Cloud-Scale Entity Resolution: Current State and Open Challenges
Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field
Efficiently Handling Skew in Outer Joins on Distributed Systems
Outer joins are ubiquitous in databases and big data systems. The question of how best to execute outer joins in large parallel systems is particularly challenging as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding problem for parallel outer joins. Conventional approaches to this problem such as ones based on hash redistribution often lead to load balancing problems while duplication-based approaches incurs significant overhead in terms of network communication. In this paper, we propose a new algorithm, query with counters (QC), for directly handling skew in outer joins on distributed architectures. We present an efficient implementation of our approach based on the asynchronous partitioned global address space (APGAS) parallel programming model. We evaluate the performance of our approach on a cluster of 192 cores (16 nodes) and datasets of 1 billion tuples with different skew. Experimental results show that our method is scalable and, in cases of high skew, faster than the state-of-the-art
A scalable analysis framework for large-scale RDF data
With the growth of the Semantic Web, the availability of RDF datasets from multiple domains
as Linked Data has taken the corpora of this web to a terabyte-scale, and challenges
modern knowledge storage and discovery techniques. Research and engineering on RDF
data management systems is a very active area with many standalone systems being introduced.
However, as the size of RDF data increases, such single-machine approaches meet
performance bottlenecks, in terms of both data loading and querying, due to the limited
parallelism inherent to symmetric multi-threaded systems and the limited available system
I/O and system memory. Although several approaches for distributed RDF data processing
have been proposed, along with clustered versions of more traditional approaches, their
techniques are limited by the trade-off they exploit between loading complexity and query
efficiency in the presence of big RDF data. This thesis then, introduces a scalable analysis
framework for processing large-scale RDF data, which focuses on various techniques to
reduce inter-machine communication, computation and load-imbalancing so as to achieve
fast data loading and querying on distributed infrastructures.
The first part of this thesis focuses on the study of RDF store implementation and parallel
hashing on big data processing. (1) A system-level investigation of RDF store implementation
has been conducted on the basis of a comparative analysis of runtime characteristics
of a representative set of RDF stores. The detailed time cost and system consumption is
measured for data loading and querying so as to provide insight into different triple store
implementation as well as an understanding of performance differences between different
platforms. (2) A high-level structured parallel hashing approach over distributed memory is
proposed and theoretically analyzed. The detailed performance of hashing implementations
using different lock-free strategies has been characterized through extensive experiments,
thereby allowing system developers to make a more informed choice for the implementation
of their high-performance analytical data processing systems.
The second part of this thesis proposes three main techniques for fast processing of large
RDF data within the proposed framework. (1) A very efficient parallel dictionary encoding
algorithm, to avoid unnecessary disk-space consumption and reduce computational complexity of query execution. The presented implementation has achieved notable speedups
compared to the state-of-art method and also has achieved excellent scalability. (2) Several
novel parallel join algorithms, to efficiently handle skew over large data during query processing.
The approaches have achieved good load balancing and have been demonstrated
to be faster than the state-of-art techniques in both theoretical and experimental comparisons.
(3) A two-tier dynamic indexing approach for processing SPARQL queries has been
devised which keeps loading times low and decreases or in some instances removes intermachine
data movement for subsequent queries that contain the same graph patterns. The
results demonstrate that this design can load data at least an order of magnitude faster than
a clustered store operating in RAM while remaining within an interactive range for query
processing and even outperforms current systems for various queries
Hardware-conscious query processing for the many-core era
Die optimale Nutzung von moderner Hardware zur Beschleunigung von Datenbank-Anfragen ist keine triviale Aufgabe. Viele DBMS als auch DSMS der letzten Jahrzehnte basieren auf Sachverhalten, die heute kaum noch Gültigkeit besitzen. Ein Beispiel hierfür sind heutige Server-Systeme, deren Hauptspeichergröße im Bereich mehrerer Terabytes liegen kann und somit den Weg für Hauptspeicherdatenbanken geebnet haben. Einer der größeren letzten Hardware Trends geht hin zu Prozessoren mit einer hohen Anzahl von Kernen, den sogenannten Manycore CPUs. Diese erlauben hohe Parallelitätsgrade für Programme durch Multithreading sowie Vektorisierung (SIMD), was die Anforderungen an die Speicher-Bandbreite allerdings deutlich erhöht. Der sogenannte High-Bandwidth Memory (HBM) versucht diese Lücke zu schließen, kann aber ebenso wie Many-core CPUs jeglichen Performance-Vorteil negieren, wenn dieser leichtfertig eingesetzt wird. Diese Arbeit stellt die Many-core CPU-Architektur zusammen mit HBM vor, um Datenbank sowie Datenstrom-Anfragen zu beschleunigen. Es wird gezeigt, dass ein hardwarenahes Kostenmodell zusammen mit einem Kalibrierungsansatz die Performance verschiedener Anfrageoperatoren verlässlich vorhersagen kann. Dies ermöglicht sowohl eine adaptive Partitionierungs und Merge-Strategie für die Parallelisierung von Datenstrom-Anfragen als auch eine ideale Konfiguration von Join-Operationen auf einem DBMS. Nichtsdestotrotz ist nicht jede Operation und Anwendung für die Nutzung einer Many-core CPU und HBM geeignet. Datenstrom-Anfragen sind oft auch an niedrige Latenz und schnelle Antwortzeiten gebunden, welche von höherer Speicher-Bandbreite kaum profitieren können. Hinzu kommen üblicherweise niedrigere Taktraten durch die hohe Kernzahl der CPUs, sowie Nachteile für geteilte Datenstrukturen, wie das Herstellen von Cache-Kohärenz und das Synchronisieren von parallelen Thread-Zugriffen. Basierend auf den Ergebnissen dieser Arbeit lässt sich ableiten, welche parallelen Datenstrukturen sich für die Verwendung von HBM besonders eignen. Des Weiteren werden verschiedene Techniken zur Parallelisierung und Synchronisierung von Datenstrukturen vorgestellt, deren Effizienz anhand eines Mehrwege-Datenstrom-Joins demonstriert wird.Exploiting the opportunities given by modern hardware for accelerating query processing speed is no trivial task. Many DBMS and also DSMS from past decades are based on fundamentals that have changed over time, e.g., servers of today with terabytes of main memory capacity allow complete avoidance of spilling data to disk, which has prepared the ground some time ago for main memory databases. One of the recent trends in hardware are many-core processors with hundreds of logical cores on a single CPU, providing an intense degree of parallelism through multithreading as well as vectorized instructions (SIMD). Their demand for memory bandwidth has led to the further development of high-bandwidth memory (HBM) to overcome the memory wall. However, many-core CPUs as well as HBM have many pitfalls that can nullify any performance gain with ease. In this work, we explore the many-core architecture along with HBM for database and data stream query processing. We demonstrate that a hardware-conscious cost model with a calibration approach allows reliable performance prediction of various query operations. Based on that information, we can, therefore, come to an adaptive partitioning and merging strategy for stream query parallelization as well as finding an ideal configuration of parameters for one of the most common tasks in the history of DBMS, join processing. However, not all operations and applications can exploit a many-core processor or HBM, though. Stream queries optimized for low latency and quick individual responses usually do not benefit well from more bandwidth and suffer from penalties like low clock frequencies of many-core CPUs as well. Shared data structures between cores also lead to problems with cache coherence as well as high contention. Based on our insights, we give a rule of thumb which data structures are suitable to parallelize with focus on HBM usage. In addition, different parallelization schemas and synchronization techniques are evaluated, based on the example of a multiway stream join operation
Scaling and Load-Balancing Equi-Joins
The task of joining two tables is fundamental for querying databases. In this
paper, we focus on the equi-join problem, where a pair of records from the two
joined tables are part of the join results if equality holds between their
values in the join column(s). While this is a tractable problem when the number
of records in the joined tables is relatively small, it becomes very
challenging as the table sizes increase, especially if hot keys (join column
values with a large number of records) exist in both joined tables.
This paper, an extended version of [metwally-SIGMOD-2022], proposes
Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in
distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a
proposed novel algorithm that scales well when the joined tables share hot
keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot
in only one table.
Unlike the state-of-the-art algorithms, AM-Join (a) holistically solves the
join-skew problem by achieving load balancing throughout the join execution,
and (b) supports all outer-join variants without record deduplication or custom
table partitioning. For the fastest AM-Join outer-join performance, we propose
the Index-Broadcast-Join (IB-Join) family of algorithms for Small-Large joins,
where one table fits in memory and the other can be up to orders of magnitude
larger. The outer-join variants of IB-Join improves on the state-of-the-art
Small-Large outer-join algorithms.
The proposed algorithms can be adopted in any shared-nothing architecture. We
implemented a MapReduce version using Spark. Our evaluation shows the proposed
algorithms execute significantly faster and scale to more skewed and
orders-of-magnitude bigger tables when compared to the state-of-the-art
algorithms
10381 Summary and Abstracts Collection -- Robust Query Processing
Dagstuhl seminar 10381 on robust query processing (held 19.09.10 -
24.09.10) brought together a diverse set of researchers and practitioners
with a broad range of expertise for the purpose of fostering discussion
and collaboration regarding causes, opportunities, and solutions for
achieving robust query processing.
The seminar strove to build a unified view across
the loosely-coupled system components responsible for
the various stages of database query processing.
Participants were chosen for their experience with database
query processing and, where possible, their prior work in academic
research or in product development towards robustness in database query
processing.
In order to pave the way to motivate, measure, and protect future advances
in robust query processing, seminar 10381 focused on developing tests
for measuring the robustness of query processing.
In these proceedings, we first review the seminar topics, goals,
and results, then present abstracts or notes of some of the seminar break-out
sessions.
We also include, as an appendix,
the robust query processing reading list that
was collected and distributed to participants before the seminar began,
as well as summaries of a few of those papers that were
contributed by some participants
Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System
Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra.
This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes.
We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists
Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems
Two emerging hardware trends will dominate the database system technology in
the near future: increasing main memory capacities of several TB per server and
massively parallel multi-core processing. Many algorithmic and control
techniques in current database technology were devised for disk-based systems
where I/O dominated the performance. In this work we take a new look at the
well-known sort-merge join which, so far, has not been in the focus of research
in scalable massively parallel multi-core data processing as it was deemed
inferior to hash joins. We devise a suite of new massively parallel sort-merge
(MPSM) join algorithms that are based on partial partition-based sorting.
Contrary to classical sort-merge joins, our MPSM algorithms do not rely on a
hard to parallelize final merge step to create one complete sort order. Rather
they work on the independently created runs in parallel. This way our MPSM
algorithms are NUMA-affine as all the sorting is carried out on local memory
partitions. An extensive experimental evaluation on a modern 32-core machine
with one TB of main memory proves the competitive performance of MPSM on large
main memory databases with billions of objects. It scales (almost) linearly in
the number of employed cores and clearly outperforms competing hash join
proposals - in particular it outperforms the "cutting-edge" Vectorwise parallel
query engine by a factor of four.Comment: VLDB201
Options in Scan Processing for Shared-Disk Parallel Database Systems
Shared-disk database systems offer a high degree of freedom in the allocation of workload compared to shared-nothing architectures. This creates a great potential for load balancing but also introduces additional complexity into the process of query scheduling. This report surveys the problems and opportunities faced in scan processing in a shared-disk environment. We list the parameters to tune and the decisions to make, as well as some known solutions and commonsense considerations, in order to identify the most promising areas of future research
- …