19 research outputs found
Accelerated Event-by-Event Neutrino Oscillation Reweighting with Matter Effects on a GPU
Oscillation probability calculations are becoming increasingly CPU intensive
in modern neutrino oscillation analyses. The independency of reweighting
individual events in a Monte Carlo sample lends itself to parallel
implementation on a Graphics Processing Unit. The library "Prob3++" was ported
to the GPU using the CUDA C API, allowing for large scale parallelized
calculations of neutrino oscillation probabilities through matter of constant
density, decreasing the execution time by a factor of 75, when compared to
performance on a single CPU.Comment: Final Update: Post submission update Updated version: quantified the
difference in event rates for binned and event-by-event reweighting with a
typical binning scheme. Improved formatting of reference
Distributed Block Coordinate Descent for Minimizing Partially Separable Functions
In this work we propose a distributed randomized block coordinate descent
method for minimizing a convex function with a huge number of
variables/coordinates. We analyze its complexity under the assumption that the
smooth part of the objective function is partially block separable, and show
that the degree of separability directly influences the complexity. This
extends the results in [Richtarik, Takac: Parallel coordinate descent methods
for big data optimization] to a distributed environment. We first show that
partially block separable functions admit an expected separable
overapproximation (ESO) with respect to a distributed sampling, compute the ESO
parameters, and then specialize complexity results from recent literature that
hold under the generic ESO assumption. We describe several approaches to
distribution and synchronization of the computation across a cluster of
multi-core computers and provide promising computational results.Comment: in Recent Developments in Numerical Analysis and Optimization, 201
Workload decomposition strategies for hierarchical distributed-shared memory parallel systems and their implementation with integration of high-level parallel languages
Dissecting sequential programs for parallelization-An approach based on computational units
When trying to parallelize a sequential program, programmers routinely struggle during the first step: finding out which code sections can be made to run in parallel. While identifying such code sections, most of the current parallelism discovery techniques focus on specific language constructs. In contrast, we propose to concentrate on the computations performed by a program. In our approach, a program is treated as a collection of computations communicating with one another using a number of variables. Each computation is represented as a computational unit (CU). A CU contains the inputs and outputs of a computation, and the three phases of a computation are read, compute, and write. Based on the notion of CU, which ensures that the read phase executes before the write phase, we present a unified framework to identify both loop parallelism and task parallelism in sequential programs. We conducted a range of experiments on 23 applications from four different benchmark suites. Our approach accurately identified the parallelization opportunities in benchmark applications based on comparison with their parallel versions. We have also parallelized the opportunities identified by our approach that were not implemented in the parallel versions of the benchmarks and reported the speedup
