1,581 research outputs found
Accelerating MCMC via Parallel Predictive Prefetching
We present a general framework for accelerating a large class of widely used
Markov chain Monte Carlo (MCMC) algorithms. Our approach exploits fast,
iterative approximations to the target density to speculatively evaluate many
potential future steps of the chain in parallel. The approach can accelerate
computation of the target distribution of a Bayesian inference problem, without
compromising exactness, by exploiting subsets of data. It takes advantage of
whatever parallel resources are available, but produces results exactly
equivalent to standard serial execution. In the initial burn-in phase of chain
evaluation, it achieves speedup over serial evaluation that is close to linear
in the number of available cores
Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors
This paper presents a low-overhead optimizer for the ubiquitous sparse
matrix-vector multiplication (SpMV) kernel. Architectural diversity among
different processors together with structural diversity among different sparse
matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is
both matrix- and architecture-adaptive through runtime specialization. To this
direction, we present an approach that first identifies the performance
bottlenecks of SpMV for a given sparse matrix on the target platform either
through profiling or by matrix property inspection, and then selects suitable
optimizations to tackle those bottlenecks. Our optimization pool is based on
the widely used Compressed Sparse Row (CSR) sparse matrix storage format and
has low preprocessing overheads, making our overall approach practical even in
cases where fast decision making and optimization setup is required. We
evaluate our optimizer on three x86-based computing platforms and demonstrate
that it is able to distinguish and appropriately optimize SpMV for the majority
of matrices in a representative test suite, leading to significant speedups
over the CSR and Inspector-Executor CSR SpMV kernels available in the latest
release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201
Accelerating Metropolis-Hastings algorithms: Delayed acceptance with prefetching
MCMC algorithms such as Metropolis-Hastings algorithms are slowed down by the
computation of complex target distributions as exemplified by huge datasets. We
offer in this paper an approach to reduce the computational costs of such
algorithms by a simple and universal divide-and-conquer strategy. The idea
behind the generic acceleration is to divide the acceptance step into several
parts, aiming at a major reduction in computing time that outranks the
corresponding reduction in acceptance probability. The division decomposes the
"prior x likelihood" term into a product such that some of its components are
much cheaper to compute than others. Each of the components can be sequentially
compared with a uniform variate, the first rejection signalling that the
proposed value is considered no further, This approach can in turn be
accelerated as part of a prefetching algorithm taking advantage of the parallel
abilities of the computer at hand. We illustrate those accelerating features on
a series of toy and realistic examples.Comment: 20 pages, 12 figures, 2 tables, submitte
Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques
Maximizing parallelism level in applications can be achieved by minimizing
overheads due to load imbalances and waiting time due to memory latencies.
Compiler optimization is one of the most effective solutions to tackle this
problem. The compiler is able to detect the data dependencies in an application
and is able to analyze the specific sections of code for parallelization
potential. However, all of these techniques provided with a compiler are
usually applied at compile time, so they rely on static analysis, which is
insufficient for achieving maximum parallelism and producing desired
application scalability. One solution to address this challenge is the use of
runtime methods. This strategy can be implemented by delaying certain amount of
code analysis to be done at runtime. In this research, we improve the parallel
application performance generated by the OP2 compiler by leveraging HPX, a C++
runtime system, to provide runtime optimizations. These optimizations include
asynchronous tasking, loop interleaving, dynamic chunk sizing, and data
prefetching. The results of the research were evaluated using an Airfoil
application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed
Scientific and Engineering Computing (PDSEC 2017
Patterns of Scalable Bayesian Inference
Datasets are growing not just in size but in complexity, creating a demand
for rich models and quantification of uncertainty. Bayesian methods are an
excellent fit for this demand, but scaling Bayesian inference is a challenge.
In response to this challenge, there has been considerable recent work based on
varying assumptions about model structure, underlying computational resources,
and the importance of asymptotic correctness. As a result, there is a zoo of
ideas with few clear overarching principles.
In this paper, we seek to identify unifying principles, patterns, and
intuitions for scaling Bayesian inference. We review existing work on utilizing
modern computing resources with both MCMC and variational approximation
techniques. From this taxonomy of ideas, we characterize the general principles
that have proven successful for designing scalable inference procedures and
comment on the path forward
- …