12,910 research outputs found
Parallel Performance of MPI Sorting Algorithms on Dual-Core Processor Windows-Based Systems
Message Passing Interface (MPI) is widely used to implement parallel
programs. Although Windowsbased architectures provide the facilities of
parallel execution and multi-threading, little attention has been focused on
using MPI on these platforms. In this paper we use the dual core Window-based
platform to study the effect of parallel processes number and also the number
of cores on the performance of three MPI parallel implementations for some
sorting algorithms
Main Memory Adaptive Indexing for Multi-core Systems
Adaptive indexing is a concept that considers index creation in databases as
a by-product of query processing; as opposed to traditional full index creation
where the indexing effort is performed up front before answering any queries.
Adaptive indexing has received a considerable amount of attention, and several
algorithms have been proposed over the past few years; including a recent
experimental study comparing a large number of existing methods. Until now,
however, most adaptive indexing algorithms have been designed single-threaded,
yet with multi-core systems already well established, the idea of designing
parallel algorithms for adaptive indexing is very natural. In this regard only
one parallel algorithm for adaptive indexing has recently appeared in the
literature: The parallel version of standard cracking. In this paper we
describe three alternative parallel algorithms for adaptive indexing, including
a second variant of a parallel standard cracking algorithm. Additionally, we
describe a hybrid parallel sorting algorithm, and a NUMA-aware method based on
sorting. We then thoroughly compare all these algorithms experimentally; along
a variant of a recently published parallel version of radix sort. Parallel
sorting algorithms serve as a realistic baseline for multi-threaded adaptive
indexing techniques. In total we experimentally compare seven parallel
algorithms. Additionally, we extensively profile all considered algorithms. The
initial set of experiments considered in this paper indicates that our parallel
algorithms significantly improve over previously known ones. Our results
suggest that, although adaptive indexing algorithms are a good design choice in
single-threaded environments, the rules change considerably in the parallel
case. That is, in future highly-parallel environments, sorting algorithms could
be serious alternatives to adaptive indexing.Comment: 26 pages, 7 figure
A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
Sorting is at the core of many database operations, such as index creation,
sort-merge joins, and user-requested output sorting. As GPUs are emerging as a
promising platform to accelerate various operations, sorting on GPUs becomes a
viable endeavour. Over the past few years, several improvements have been
proposed for sorting on GPUs, leading to the first radix sort implementations
that achieve a sorting rate of over one billion 32-bit keys per second. Yet,
state-of-the-art approaches are heavily memory bandwidth-bound, as they require
substantially more memory transfers than their CPU-based counterparts.
Our work proposes a novel approach that almost halves the amount of memory
transfers and, therefore, considerably lifts the memory bandwidth limitation.
Being able to sort two gigabytes of eight-byte records in as little as 50
milliseconds, our approach achieves a 2.32-fold improvement over the
state-of-the-art GPU-based radix sort for uniform distributions, sustaining a
minimum speed-up of no less than a factor of 1.66 for skewed distributions.
To address inputs that either do not reside on the GPU or exceed the
available device memory, we build on our efficient GPU sorting approach with a
pipelined heterogeneous sorting algorithm that mitigates the overhead
associated with PCIe data transfers. Comparing the end-to-end sorting
performance to the state-of-the-art CPU-based radix sort running 16 threads,
our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement for
sorting 64 GB key-value pairs with a skewed and a uniform distribution,
respectively.Comment: 16 pages, accepted at SIGMOD 201
Platform Dependent Verification: On Engineering Verification Tools for 21st Century
The paper overviews recent developments in platform-dependent explicit-state
LTL model checking.Comment: In Proceedings PDMC 2011, arXiv:1111.006
Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems
Two emerging hardware trends will dominate the database system technology in
the near future: increasing main memory capacities of several TB per server and
massively parallel multi-core processing. Many algorithmic and control
techniques in current database technology were devised for disk-based systems
where I/O dominated the performance. In this work we take a new look at the
well-known sort-merge join which, so far, has not been in the focus of research
in scalable massively parallel multi-core data processing as it was deemed
inferior to hash joins. We devise a suite of new massively parallel sort-merge
(MPSM) join algorithms that are based on partial partition-based sorting.
Contrary to classical sort-merge joins, our MPSM algorithms do not rely on a
hard to parallelize final merge step to create one complete sort order. Rather
they work on the independently created runs in parallel. This way our MPSM
algorithms are NUMA-affine as all the sorting is carried out on local memory
partitions. An extensive experimental evaluation on a modern 32-core machine
with one TB of main memory proves the competitive performance of MPSM on large
main memory databases with billions of objects. It scales (almost) linearly in
the number of employed cores and clearly outperforms competing hash join
proposals - in particular it outperforms the "cutting-edge" Vectorwise parallel
query engine by a factor of four.Comment: VLDB201
A Parallel Monte Carlo Code for Simulating Collisional N-body Systems
We present a new parallel code for computing the dynamical evolution of
collisional N-body systems with up to N~10^7 particles. Our code is based on
the the Henon Monte Carlo method for solving the Fokker-Planck equation, and
makes assumptions of spherical symmetry and dynamical equilibrium. The
principal algorithmic developments involve optimizing data structures, and the
introduction of a parallel random number generation scheme, as well as a
parallel sorting algorithm, required to find nearest neighbors for interactions
and to compute the gravitational potential. The new algorithms we introduce
along with our choice of decomposition scheme minimize communication costs and
ensure optimal distribution of data and workload among the processing units.
The implementation uses the Message Passing Interface (MPI) library for
communication, which makes it portable to many different supercomputing
architectures. We validate the code by calculating the evolution of clusters
with initial Plummer distribution functions up to core collapse with the number
of stars, N, spanning three orders of magnitude, from 10^5 to 10^7. We find
that our results are in good agreement with self-similar core-collapse
solutions, and the core collapse times generally agree with expectations from
the literature. Also, we observe good total energy conservation, within less
than 0.04% throughout all simulations. We analyze the performance of the code,
and demonstrate near-linear scaling of the runtime with the number of
processors up to 64 processors for N=10^5, 128 for N=10^6 and 256 for N=10^7.
The runtime reaches a saturation with the addition of more processors beyond
these limits which is a characteristic of the parallel sorting algorithm. The
resulting maximum speedups we achieve are approximately 60x, 100x, and 220x,
respectively.Comment: 53 pages, 13 figures, accepted for publication in ApJ Supplement
One machine, one minute, three billion tetrahedra
This paper presents a new scalable parallelization scheme to generate the 3D
Delaunay triangulation of a given set of points. Our first contribution is an
efficient serial implementation of the incremental Delaunay insertion
algorithm. A simple dedicated data structure, an efficient sorting of the
points and the optimization of the insertion algorithm have permitted to
accelerate reference implementations by a factor three. Our second contribution
is a multi-threaded version of the Delaunay kernel that is able to concurrently
insert vertices. Moore curve coordinates are used to partition the point set,
avoiding heavy synchronization overheads. Conflicts are managed by modifying
the partitions with a simple rescaling of the space-filling curve. The
performances of our implementation have been measured on three different
processors, an Intel core-i7, an Intel Xeon Phi and an AMD EPYC, on which we
have been able to compute 3 billion tetrahedra in 53 seconds. This corresponds
to a generation rate of over 55 million tetrahedra per second. We finally show
how this very efficient parallel Delaunay triangulation can be integrated in a
Delaunay refinement mesh generator which takes as input the triangulated
surface boundary of the volume to mesh
- …