107,192 research outputs found
Parallelized Inference for Gravitational-Wave Astronomy
Bayesian inference is the workhorse of gravitational-wave astronomy, for
example, determining the mass and spins of merging black holes, revealing the
neutron star equation of state, and unveiling the population properties of
compact binaries. The science enabled by these inferences comes with a
computational cost that can limit the questions we are able to answer. This
cost is expected to grow. As detectors improve, the detection rate will go up,
allowing less time to analyze each event. Improvement in low-frequency
sensitivity will yield longer signals, increasing the number of computations
per event. The growing number of entries in the transient catalog will drive up
the cost of population studies. While Bayesian inference calculations are not
entirely parallelizable, key components are embarrassingly parallel:
calculating the gravitational waveform and evaluating the likelihood function.
Graphical processor units (GPUs) are adept at such parallel calculations. We
report on progress porting gravitational-wave inference calculations to GPUs.
Using a single code - which takes advantage of GPU architecture if it is
available - we compare computation times using modern GPUs (NVIDIA P100) and
CPUs (Intel Gold 6140). We demonstrate speed-ups of for
compact binary coalescence gravitational waveform generation and likelihood
evaluation and more than for population inference within the
lifetime of current detectors. Further improvement is likely with continued
development. Our python-based code is publicly available and can be used
without familiarity with the parallel computing platform, CUDA.Comment: 5 pages, 4 figures, submitted to PRD, code can be found at
https://github.com/ColmTalbot/gwpopulation
https://github.com/ColmTalbot/GPUCBC
https://github.com/ADACS-Australia/ADACS-SS18A-RSmith Add demonstration of
improvement in BNS spi
A Lower Bound Technique for Communication in BSP
Communication is a major factor determining the performance of algorithms on
current computing systems; it is therefore valuable to provide tight lower
bounds on the communication complexity of computations. This paper presents a
lower bound technique for the communication complexity in the bulk-synchronous
parallel (BSP) model of a given class of DAG computations. The derived bound is
expressed in terms of the switching potential of a DAG, that is, the number of
permutations that the DAG can realize when viewed as a switching network. The
proposed technique yields tight lower bounds for the fast Fourier transform
(FFT), and for any sorting and permutation network. A stronger bound is also
derived for the periodic balanced sorting network, by applying this technique
to suitable subnetworks. Finally, we demonstrate that the switching potential
captures communication requirements even in computational models different from
BSP, such as the I/O model and the LPRAM
On the acceleration of wavefront applications using distributed many-core architectures
In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures
Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors
In this paper we consider the problem of identifying intersections between
two sets of d-dimensional axis-parallel rectangles. This is a common problem
that arises in many agent-based simulation studies, and is of central
importance in the context of High Level Architecture (HLA), where it is at the
core of the Data Distribution Management (DDM) service. Several realizations of
the DDM service have been proposed; however, many of them are either
inefficient or inherently sequential. These are serious limitations since
multicore processors are now ubiquitous, and DDM algorithms -- being
CPU-intensive -- could benefit from additional computing power. We propose a
parallel version of the Sort-Based Matching algorithm for shared-memory
multiprocessors. Sort-Based Matching is one of the most efficient serial
algorithms for the DDM problem, but is quite difficult to parallelize due to
data dependencies. We describe the algorithm and compute its asymptotic running
time; we complete the analysis by assessing its performance and scalability
through extensive experiments on two commodity multicore systems based on a
dual socket Intel Xeon processor, and a single socket Intel Core i7 processor.Comment: Proceedings of the 21-th ACM/IEEE International Symposium on
Distributed Simulation and Real Time Applications (DS-RT 2017). Best Paper
Award @DS-RT 201
- …