Search CORE

12,910 research outputs found

Parallel Performance of MPI Sorting Algorithms on Dual-Core Processor Windows-Based Systems

Author: Elnashar Alaa Ismail
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 30/05/2011
Field of study

Message Passing Interface (MPI) is widely used to implement parallel programs. Although Windowsbased architectures provide the facilities of parallel execution and multi-threading, little attention has been focused on using MPI on these platforms. In this paper we use the dual core Window-based platform to study the effect of parallel processes number and also the number of cores on the performance of three MPI parallel implementations for some sorting algorithms

arXiv.org e-Print Archive

CiteSeerX

Crossref

Main Memory Adaptive Indexing for Multi-core Systems

Author: Alvarez Victor
Dittrich Jens
Richter Stefan
Schuhknecht Felix Martin
Publication venue
Publication date: 01/01/2014
Field of study

Adaptive indexing is a concept that considers index creation in databases as a by-product of query processing; as opposed to traditional full index creation where the indexing effort is performed up front before answering any queries. Adaptive indexing has received a considerable amount of attention, and several algorithms have been proposed over the past few years; including a recent experimental study comparing a large number of existing methods. Until now, however, most adaptive indexing algorithms have been designed single-threaded, yet with multi-core systems already well established, the idea of designing parallel algorithms for adaptive indexing is very natural. In this regard only one parallel algorithm for adaptive indexing has recently appeared in the literature: The parallel version of standard cracking. In this paper we describe three alternative parallel algorithms for adaptive indexing, including a second variant of a parallel standard cracking algorithm. Additionally, we describe a hybrid parallel sorting algorithm, and a NUMA-aware method based on sorting. We then thoroughly compare all these algorithms experimentally; along a variant of a recently published parallel version of radix sort. Parallel sorting algorithms serve as a realistic baseline for multi-threaded adaptive indexing techniques. In total we experimentally compare seven parallel algorithms. Additionally, we extensively profile all considered algorithms. The initial set of experiments considered in this paper indicates that our parallel algorithms significantly improve over previously known ones. Our results suggest that, although adaptive indexing algorithms are a good design choice in single-threaded environments, the rules change considerably in the parallel case. That is, in future highly-parallel environments, sorting algorithms could be serious alternatives to adaptive indexing.Comment: 26 pages, 7 figure

arXiv.org e-Print Archive

Crossref

CISPA – Helmholtz-Zentrum für Informationssicherheit

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

Author: Cisco
Davidson A.
Dehne F.
Harris M.
Kipfer P.
Krueger J.
Merrill D.
Wassenberg J.
Ye X.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/05/2017
Field of study

Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. Over the past few years, several improvements have been proposed for sorting on GPUs, leading to the first radix sort implementations that achieve a sorting rate of over one billion 32-bit keys per second. Yet, state-of-the-art approaches are heavily memory bandwidth-bound, as they require substantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation. Being able to sort two gigabytes of eight-byte records in as little as 50 milliseconds, our approach achieves a 2.32-fold improvement over the state-of-the-art GPU-based radix sort for uniform distributions, sustaining a minimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed the available device memory, we build on our efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers. Comparing the end-to-end sorting performance to the state-of-the-art CPU-based radix sort running 16 threads, our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for sorting 64 GB key-value pairs with a skewed and a uniform distribution, respectively.Comment: 16 pages, accepted at SIGMOD 201

arXiv.org e-Print Archive

Crossref

Platform Dependent Verification: On Engineering Verification Tools for 21st Century

Author: A. Aggarwal
A. B. Kahn
Alfons Laarman
Armin Biere
B. R. Haverkort
Boudewijn R. Haverkort
Brad Bingham
Cornelia P. Inggs
D. Bosnacki
David L. Dill
Doron Peled
E. Allen Emerson
E. M. Clarke
E.M. Clarke
Flavio Lerda
Flavio Lerda
G. Behrmann
G. Ciardo
G. Jayachandran
Gerard J. Holzmann
Gerard J. Holzmann
Gerard J. Holzmann
Gianfranco Ciardo
Giuseppe Della Penna
H. Garavel
I. Černá
I. Černá
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. Barnat
J. R. Burch
Jaco Geldenhuys
Jiří Barnat
Jiří Barnat
K. Verstoep
Keijo Heljanko
Keijo Heljanko
L. Brim
L. Brim
Luboš Brim
M.Y. Vardi
Michael Jones
Moritz Hammer
Naga K. Govindaraju
P. Harish
Peter Lamborn
R. Korf
R. Korf
R. Pel\IeC ánek
Rahul Kumar
Rong Zhou
S. Allmaier
S. Caselli
Sami Evangelista
Shahid Jabbar
Shahid Jabbar
Stefan Edelkamp
T. von Eicken
Tonglaga Bao
U. Stern
U. Stern
W. Knottenbelt
W. Knottenbelt
Yi-Jen Chiang
Publication venue: 'Open Publishing Association'
Publication date: 01/10/2011
Field of study

The paper overviews recent developments in platform-dependent explicit-state LTL model checking.Comment: In Proceedings PDMC 2011, arXiv:1111.006

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems

Author: Albutiu Martina-Cezara
Kemper Alfons
Neumann Thomas
Publication venue
Publication date: 01/01/2012
Field of study

Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research in scalable massively parallel multi-core data processing as it was deemed inferior to hash joins. We devise a suite of new massively parallel sort-merge (MPSM) join algorithms that are based on partial partition-based sorting. Contrary to classical sort-merge joins, our MPSM algorithms do not rely on a hard to parallelize final merge step to create one complete sort order. Rather they work on the independently created runs in parallel. This way our MPSM algorithms are NUMA-affine as all the sorting is carried out on local memory partitions. An extensive experimental evaluation on a modern 32-core machine with one TB of main memory proves the competitive performance of MPSM on large main memory databases with billions of objects. It scales (almost) linearly in the number of employed cores and clearly outperforms competing hash join proposals - in particular it outperforms the "cutting-edge" Vectorwise parallel query engine by a factor of four.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

A Parallel Monte Carlo Code for Simulating Collisional N-body Systems

Author: Alok Choudhary
Bharath Pattabiraman
Binney
Böker
Chatterjee
Collins
Frederic A. Rasio
Fregeau
Fregeau
Giersz
Gokhan Memik
Goswami
Gropp
Heggie
Heggie
Heggie
Joshi
Joshi
L'Ecuyer
Li
Lightman
Lusk
McLaughlin
Merritt
Miller
Nvidia.
Spitzer
Stefan Umbreit
Stodolkiewicz
Trenti
Umbreit
Vassiliki Kalogera
Wei-keng Liao
Publication venue: 'IOP Publishing'
Publication date: 15/11/2012
Field of study

We present a new parallel code for computing the dynamical evolution of collisional N-body systems with up to N~10^7 particles. Our code is based on the the Henon Monte Carlo method for solving the Fokker-Planck equation, and makes assumptions of spherical symmetry and dynamical equilibrium. The principal algorithmic developments involve optimizing data structures, and the introduction of a parallel random number generation scheme, as well as a parallel sorting algorithm, required to find nearest neighbors for interactions and to compute the gravitational potential. The new algorithms we introduce along with our choice of decomposition scheme minimize communication costs and ensure optimal distribution of data and workload among the processing units. The implementation uses the Message Passing Interface (MPI) library for communication, which makes it portable to many different supercomputing architectures. We validate the code by calculating the evolution of clusters with initial Plummer distribution functions up to core collapse with the number of stars, N, spanning three orders of magnitude, from 10^5 to 10^7. We find that our results are in good agreement with self-similar core-collapse solutions, and the core collapse times generally agree with expectations from the literature. Also, we observe good total energy conservation, within less than 0.04% throughout all simulations. We analyze the performance of the code, and demonstrate near-linear scaling of the runtime with the number of processors up to 64 processors for N=10^5, 128 for N=10^6 and 256 for N=10^7. The runtime reaches a saturation with the addition of more processors beyond these limits which is a characteristic of the parallel sorting algorithm. The resulting maximum speedups we achieve are approximately 60x, 100x, and 220x, respectively.Comment: 53 pages, 13 figures, accepted for publication in ApJ Supplement

arXiv.org e-Print Archive

Crossref

One machine, one minute, three billion tetrahedra

Author: Marot Célestin
Pellerin Jeanne
Remacle Jean-François
Publication venue: 'Wiley'
Publication date: 01/01/2018
Field of study

This paper presents a new scalable parallelization scheme to generate the 3D Delaunay triangulation of a given set of points. Our first contribution is an efficient serial implementation of the incremental Delaunay insertion algorithm. A simple dedicated data structure, an efficient sorting of the points and the optimization of the insertion algorithm have permitted to accelerate reference implementations by a factor three. Our second contribution is a multi-threaded version of the Delaunay kernel that is able to concurrently insert vertices. Moore curve coordinates are used to partition the point set, avoiding heavy synchronization overheads. Conflicts are managed by modifying the partitions with a simple rescaling of the space-filling curve. The performances of our implementation have been measured on three different processors, an Intel core-i7, an Intel Xeon Phi and an AMD EPYC, on which we have been able to compute 3 billion tetrahedra in 53 seconds. This corresponds to a generation rate of over 55 million tetrahedra per second. We finally show how this very efficient parallel Delaunay triangulation can be integrated in a Delaunay refinement mesh generator which takes as input the triangulated surface boundary of the volume to mesh

arXiv.org e-Print Archive

DIAL UCLouvain