19,630 research outputs found

    Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems

    Full text link
    Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research in scalable massively parallel multi-core data processing as it was deemed inferior to hash joins. We devise a suite of new massively parallel sort-merge (MPSM) join algorithms that are based on partial partition-based sorting. Contrary to classical sort-merge joins, our MPSM algorithms do not rely on a hard to parallelize final merge step to create one complete sort order. Rather they work on the independently created runs in parallel. This way our MPSM algorithms are NUMA-affine as all the sorting is carried out on local memory partitions. An extensive experimental evaluation on a modern 32-core machine with one TB of main memory proves the competitive performance of MPSM on large main memory databases with billions of objects. It scales (almost) linearly in the number of employed cores and clearly outperforms competing hash join proposals - in particular it outperforms the "cutting-edge" Vectorwise parallel query engine by a factor of four.Comment: VLDB201

    Data Oblivious Algorithms for Multicores

    Get PDF
    As secure processors such as Intel SGX (with hyperthreading) become widely adopted, there is a growing appetite for private analytics on big data. Most prior works on data-oblivious algorithms adopt the classical PRAM model to capture parallelism. However, it is widely understood that PRAM does not best capture realistic multicore processors, nor does it reflect parallel programming models adopted in practice. In this paper, we initiate the study of parallel data oblivious algorithms on realistic multicores, best captured by the binary fork-join model of computation. We first show that data-oblivious sorting can be accomplished by a binary fork-join algorithm with optimal total work and optimal (cache-oblivious) cache complexity, and in O(log n log log n) span (i.e., parallel time) that matches the best-known insecure algorithm. Using our sorting algorithm as a core primitive, we show how to data-obliviously simulate general PRAM algorithms in the binary fork-join model with non-trivial efficiency. We also present results for several applications including list ranking, Euler tour, tree contraction, connected components, and minimum spanning forest. For a subset of these applications, our data-oblivious algorithms asymptotically outperform the best known insecure algorithms. For other applications, we show data oblivious algorithms whose performance bounds match the best known insecure algorithms. Complementing these asymptotically efficient results, we present a practical variant of our sorting algorithm that is self-contained and potentially implementable. It has optimal caching cost, and it is only a log log n factor off from optimal work and about a log n factor off in terms of span; moreover, it achieves small constant factors in its bounds

    Main Memory Adaptive Indexing for Multi-core Systems

    Full text link
    Adaptive indexing is a concept that considers index creation in databases as a by-product of query processing; as opposed to traditional full index creation where the indexing effort is performed up front before answering any queries. Adaptive indexing has received a considerable amount of attention, and several algorithms have been proposed over the past few years; including a recent experimental study comparing a large number of existing methods. Until now, however, most adaptive indexing algorithms have been designed single-threaded, yet with multi-core systems already well established, the idea of designing parallel algorithms for adaptive indexing is very natural. In this regard only one parallel algorithm for adaptive indexing has recently appeared in the literature: The parallel version of standard cracking. In this paper we describe three alternative parallel algorithms for adaptive indexing, including a second variant of a parallel standard cracking algorithm. Additionally, we describe a hybrid parallel sorting algorithm, and a NUMA-aware method based on sorting. We then thoroughly compare all these algorithms experimentally; along a variant of a recently published parallel version of radix sort. Parallel sorting algorithms serve as a realistic baseline for multi-threaded adaptive indexing techniques. In total we experimentally compare seven parallel algorithms. Additionally, we extensively profile all considered algorithms. The initial set of experiments considered in this paper indicates that our parallel algorithms significantly improve over previously known ones. Our results suggest that, although adaptive indexing algorithms are a good design choice in single-threaded environments, the rules change considerably in the parallel case. That is, in future highly-parallel environments, sorting algorithms could be serious alternatives to adaptive indexing.Comment: 26 pages, 7 figure

    Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers

    Full text link
    We analyze the caching overhead incurred by a class of multithreaded algorithms when scheduled by an arbitrary scheduler. We obtain bounds that match or improve upon the well-known O(Q+Sâ‹…(M/B))O(Q+S \cdot (M/B)) caching cost for the randomized work stealing (RWS) scheduler, where SS is the number of steals, QQ is the sequential caching cost, and MM and BB are the cache size and block (or cache line) size respectively.Comment: Extended abstract in Proceedings of ACM Symp. on Parallel Alg. and Architectures (SPAA) 2017, pp. 339-350. This revision has a few small updates including a missing citation and the replacement of some big Oh terms with precise constant

    Parallel Finger Search Structures

    Get PDF
    In this paper we present two versions of a parallel finger structure FS on p processors that supports searches, insertions and deletions, and has a finger at each end. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of O(log p)^2) of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of FS by any parallel program P that is modelled by a dynamically generated DAG D where each node is either a unit-time instruction or a call to FS. The work done by FS is bounded by the finger bound F_L (for some linearization L of D), i.e. each operation on an item with distance r from a finger takes O(log r+1) amortized work. Running P using the simpler version takes O((T_1+F_L)/p + T_infty + d * ((log p)^2 + log n)) time on a greedy scheduler, where T_1, T_infty are the size and span of D respectively, and n is the maximum number of items in FS, and d is the maximum number of calls to FS along any path in D. Using the faster version, this is reduced to O((T_1+F_L)/p + T_infty + d *(log p)^2 + s_L) time, where s_L is the weighted span of D where each call to FS is weighted by its cost according to F_L. FS can be extended to a fixed number of movable fingers. The data structures in our paper fit into the dynamic multithreading paradigm, and their performance bounds are directly composable with other data structures given in the same paradigm. Also, the results can be translated to practical implementations using work-stealing schedulers

    An empirical evaluation of High-Level Synthesis languages and tools for database acceleration

    Get PDF
    High Level Synthesis (HLS) languages and tools are emerging as the most promising technique to make FPGAs more accessible to software developers. Nevertheless, picking the most suitable HLS for a certain class of algorithms depends on requirements such as area and throughput, as well as on programmer experience. In this paper, we explore the different trade-offs present when using a representative set of HLS tools in the context of Database Management Systems (DBMS) acceleration. More specifically, we conduct an empirical analysis of four representative frameworks (Bluespec SystemVerilog, Altera OpenCL, LegUp and Chisel) that we utilize to accelerate commonly-used database algorithms such as sorting, the median operator, and hash joins. Through our implementation experience and empirical results for database acceleration, we conclude that the selection of the most suitable HLS depends on a set of orthogonal characteristics, which we highlight for each HLS framework.Peer ReviewedPostprint (author’s final draft
    • …
    corecore