16,885 research outputs found
Design Principles for Sparse Matrix Multiplication on the GPU
We implement two novel algorithms for sparse-matrix dense-matrix
multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the
popular compressed-sparse-row (CSR) format and thus do not require expensive
format conversion. While previous SpMM work concentrates on thread-level
parallelism, we additionally focus on latency hiding with instruction-level
parallelism and load-balancing. We show, both theoretically and experimentally,
that the proposed SpMM is a better fit for the GPU than previous approaches. We
identify a key memory access pattern that allows efficient access into both
input and output matrices that is crucial to getting excellent performance on
SpMM. By combining these two ingredients---(i) merge-based load-balancing and
(ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and
a 31.7% geomean speedup over state-of-the-art SpMM implementations on
real-world datasets.Comment: 16 pages, 7 figures, International European Conference on Parallel
and Distributed Computing (Euro-Par) 201
Load-Balancing for Parallel Delaunay Triangulations
Computing the Delaunay triangulation (DT) of a given point set in
is one of the fundamental operations in computational geometry.
Recently, Funke and Sanders (2017) presented a divide-and-conquer DT algorithm
that merges two partial triangulations by re-triangulating a small subset of
their vertices - the border vertices - and combining the three triangulations
efficiently via parallel hash table lookups. The input point division should
therefore yield roughly equal-sized partitions for good load-balancing and also
result in a small number of border vertices for fast merging. In this paper, we
present a novel divide-step based on partitioning the triangulation of a small
sample of the input points. In experiments on synthetic and real-world data
sets, we achieve nearly perfectly balanced partitions and small border
triangulations. This almost cuts running time in half compared to
non-data-sensitive division schemes on inputs exhibiting an exploitable
underlying structure.Comment: Short version submitted to EuroPar 201
RadixSpline: A Single-Pass Learned Index
Recent research has shown that learned models can outperform state-of-the-art
index structures in size and lookup performance. While this is a very promising
result, existing learned structures are often cumbersome to implement and are
slow to build. In fact, most approaches that we are aware of require multiple
training passes over the data.
We introduce RadixSpline (RS), a learned index that can be built in a single
pass over the data and is competitive with state-of-the-art learned index
models, like RMI, in size and lookup performance. We evaluate RS using the SOSD
benchmark and show that it achieves competitive results on all datasets,
despite the fact that it only has two parameters.Comment: Third International Workshop on Exploiting Artificial Intelligence
Techniques for Data Management (aiDM 2020
Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems
Two emerging hardware trends will dominate the database system technology in
the near future: increasing main memory capacities of several TB per server and
massively parallel multi-core processing. Many algorithmic and control
techniques in current database technology were devised for disk-based systems
where I/O dominated the performance. In this work we take a new look at the
well-known sort-merge join which, so far, has not been in the focus of research
in scalable massively parallel multi-core data processing as it was deemed
inferior to hash joins. We devise a suite of new massively parallel sort-merge
(MPSM) join algorithms that are based on partial partition-based sorting.
Contrary to classical sort-merge joins, our MPSM algorithms do not rely on a
hard to parallelize final merge step to create one complete sort order. Rather
they work on the independently created runs in parallel. This way our MPSM
algorithms are NUMA-affine as all the sorting is carried out on local memory
partitions. An extensive experimental evaluation on a modern 32-core machine
with one TB of main memory proves the competitive performance of MPSM on large
main memory databases with billions of objects. It scales (almost) linearly in
the number of employed cores and clearly outperforms competing hash join
proposals - in particular it outperforms the "cutting-edge" Vectorwise parallel
query engine by a factor of four.Comment: VLDB201
Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formation
As an entry for the 2001 Gordon Bell Award in the "special" category, we
describe our 3-d, hybrid, adaptive mesh refinement (AMR) code, Enzo, designed
for high-resolution, multiphysics, cosmological structure formation
simulations. Our parallel implementation places no limit on the depth or
complexity of the adaptive grid hierarchy, allowing us to achieve unprecedented
spatial and temporal dynamic range. We report on a simulation of primordial
star formation which develops over 8000 subgrids at 34 levels of refinement to
achieve a local refinement of a factor of 10^12 in space and time. This allows
us to resolve the properties of the first stars which form in the universe
assuming standard physics and a standard cosmological model. Achieving extreme
resolution requires the use of 128-bit extended precision arithmetic (EPA) to
accurately specify the subgrid positions. We describe our EPA AMR
implementation on the IBM SP2 Blue Horizon system at the San Diego
Supercomputer Center.Comment: 23 pages, 5 figures. Peer reviewed technical paper accepted to the
proceedings of Supercomputing 2001. This entry was a Gordon Bell Prize
finalist. For more information visit http://www.TomAbel.com/GB
A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters
Sustaining a large fraction of single GPU performance in parallel
computations is considered to be the major problem of GPU-based clusters. In
this article, this topic is addressed in the context of a lattice Boltzmann
flow solver that is integrated in the WaLBerla software framework. We propose a
multi-GPU implementation using a block-structured MPI parallelization, suitable
for load balancing and heterogeneous computations on CPUs and GPUs. The
overhead required for multi-GPU simulations is discussed in detail and it is
demonstrated that the kernel performance can be sustained to a large extent.
With our GPU implementation, we achieve nearly perfect weak scalability on
InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less
efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost
analysis must determine the best course of action for a particular simulation
task. Additionally, weak scaling results of heterogeneous simulations conducted
on CPUs and GPUs simultaneously are presented using clusters equipped with
varying node configurations.Comment: 20 pages, 12 figure
Scalable Parallel Numerical Constraint Solver Using Global Load Balancing
We present a scalable parallel solver for numerical constraint satisfaction
problems (NCSPs). Our parallelization scheme consists of homogeneous worker
solvers, each of which runs on an available core and communicates with others
via the global load balancing (GLB) method. The parallel solver is implemented
with X10 that provides an implementation of GLB as a library. In experiments,
several NCSPs from the literature were solved and attained up to 516-fold
speedup using 600 cores of the TSUBAME2.5 supercomputer.Comment: To be presented at X10'15 Worksho
- …