Search CORE

23 research outputs found

Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems

Author: Albutiu Martina-Cezara
Kemper Alfons
Neumann Thomas
Publication venue
Publication date: 01/01/2012
Field of study

Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research in scalable massively parallel multi-core data processing as it was deemed inferior to hash joins. We devise a suite of new massively parallel sort-merge (MPSM) join algorithms that are based on partial partition-based sorting. Contrary to classical sort-merge joins, our MPSM algorithms do not rely on a hard to parallelize final merge step to create one complete sort order. Rather they work on the independently created runs in parallel. This way our MPSM algorithms are NUMA-affine as all the sorting is carried out on local memory partitions. An extensive experimental evaluation on a modern 32-core machine with one TB of main memory proves the competitive performance of MPSM on large main memory databases with billions of objects. It scales (almost) linearly in the number of employed cores and clearly outperforms competing hash join proposals - in particular it outperforms the "cutting-edge" Vectorwise parallel query engine by a factor of four.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Sort vs. Hash Join Revisited for Near-Memory Execution

Author: Falsafi Babak
Grot Boris
Kocberber Onur
Mirzadeh Nooshin S.
Publication venue
Publication date: 01/01/2015
Field of study

Data movement between memory and CPU is a well-known energy bottleneck for analytics. Near-Memory Processing (NMP) is a promising approach for eliminating this bottleneck by shifting the bulk of the computation toward memory arrays in emerging stacked DRAM chips. Recent work in this space has been limited to regular computations that can be localized to a single DRAM partition. This paper examines a Join workload, which is fundamental to analytics and is characterized by irregular memory access patterns. We consider several join algorithms and show that while near-data execution can improve both energy-efficiency and performance, effective NMP algorithms must consider locality, access granularity, and microarchitecture of the stacked memory devices

Infoscience - École polytechnique fédérale de Lausanne

Edinburgh Research Explorer

Main Memory Adaptive Indexing for Multi-core Systems

Author: Alvarez Victor
Dittrich Jens
Richter Stefan
Schuhknecht Felix Martin
Publication venue
Publication date: 01/01/2014
Field of study

Adaptive indexing is a concept that considers index creation in databases as a by-product of query processing; as opposed to traditional full index creation where the indexing effort is performed up front before answering any queries. Adaptive indexing has received a considerable amount of attention, and several algorithms have been proposed over the past few years; including a recent experimental study comparing a large number of existing methods. Until now, however, most adaptive indexing algorithms have been designed single-threaded, yet with multi-core systems already well established, the idea of designing parallel algorithms for adaptive indexing is very natural. In this regard only one parallel algorithm for adaptive indexing has recently appeared in the literature: The parallel version of standard cracking. In this paper we describe three alternative parallel algorithms for adaptive indexing, including a second variant of a parallel standard cracking algorithm. Additionally, we describe a hybrid parallel sorting algorithm, and a NUMA-aware method based on sorting. We then thoroughly compare all these algorithms experimentally; along a variant of a recently published parallel version of radix sort. Parallel sorting algorithms serve as a realistic baseline for multi-threaded adaptive indexing techniques. In total we experimentally compare seven parallel algorithms. Additionally, we extensively profile all considered algorithms. The initial set of experiments considered in this paper indicates that our parallel algorithms significantly improve over previously known ones. Our results suggest that, although adaptive indexing algorithms are a good design choice in single-threaded environments, the rules change considerably in the parallel case. That is, in future highly-parallel environments, sorting algorithms could be serious alternatives to adaptive indexing.Comment: 26 pages, 7 figure

arXiv.org e-Print Archive

Crossref

CISPA – Helmholtz-Zentrum für Informationssicherheit

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

Author: Cisco
Davidson A.
Dehne F.
Harris M.
Kipfer P.
Krueger J.
Merrill D.
Wassenberg J.
Ye X.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/05/2017
Field of study

Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidth-bound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eight-byte records in as little as 50 milliseconds, our approach achieves a 2.32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-to-end sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.Comment: 16 pages, accepted at SIGMOD 201

arXiv.org e-Print Archive

Crossref

Designing Databases for Future High-Performance Networks

Author: Claude Barthels
Gustavo Alonso
Torsten Hoefler
Publication venue
Publication date: 05/03/2020
Field of study

CiteSeerX

How to Stop Under-Utilization and Love Multicores

Author: Ailamaki Anastasia
Liarou Erietta
Porobic Danica
Psaroudakis Iraklis
Tözün Pinar
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/02/2014
Field of study

Designing scalable transaction processing systems on modern hardware has been a challenge for almost a decade. Hardware trends oblige software to overcome three major challenges against systems scalability: (1) Exploiting the abundant thread-level parallelism provided by multicores, (2) Achieving predictively efficient execution despite the variability in communication latencies among cores on multisocket multicores, and (3) Taking advantage of the aggressive micro-architectural features. In this tutorial, we shed light on the above three challenges and survey recent proposals to alleviate them. First, we present a systematic way of eliminating scalability bottlenecks based on minimizing unbounded communication and show several techniques that apply the presented methodology to minimize bottlenecks in major components of transaction processing systems. Then, we analyze the problems that arise from the non-uniform nature of communication latencies on modern multisockets and ways to address them for systems that already scale well on multicores. Finally, we examine the sources of under-utilization within a modern processor and present insights and techniques to better exploit the micro-architectural resources of a processor by improving cache locality at the right level

Infoscience - École polytechnique fédérale de Lausanne

Crossref