76,527 research outputs found
A well-separated pairs decomposition algorithm for k-d trees implemented on multi-core architectures
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.Variations of k-d trees represent a fundamental data structure used in Computational Geometry with numerous applications in science. For example particle track tting in the software of the LHC experiments, and in simulations of N-body systems in the study of dynamics of interacting galaxies, particle beam physics, and molecular dynamics in biochemistry. The many-body tree methods devised by Barnes and Hutt in the 1980s and the Fast Multipole Method introduced in 1987 by Greengard and Rokhlin use variants of k-d trees to reduce the computation time upper bounds to O(n log n) and even O(n) from O(n2). We present an algorithm that uses the principle of well-separated pairs decomposition to always produce compressed trees in O(n log n) work. We present and evaluate parallel implementations for the algorithm that can take advantage of multi-core architectures.The Science and Technology Facilities Council, UK
A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units
We present a hierarchically blocked one-sided Jacobi algorithm for the
singular value decomposition (SVD), targeting both single and multiple graphics
processing units (GPUs). The blocking structure reflects the levels of GPU's
memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining
high relative accuracy. To this end, we developed a family of parallel pivot
strategies on GPU's shared address space, but applicable also to inter-GPU
communication. Unlike common hybrid approaches, our algorithm in a single GPU
setting needs a CPU for the controlling purposes only, while utilizing GPU's
resources to the fullest extent permitted by the hardware. When required by the
problem size, the algorithm, in principle, scales to an arbitrary number of GPU
nodes. The scalability is demonstrated by more than twofold speedup for
sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single
Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin
- …