779 research outputs found
Efficient Parallel Random Sampling : Vectorized, Cache-Efficient, and Online
We consider the problem of sampling numbers from the range
without replacement on modern architectures. The main result
is a simple divide-and-conquer scheme that makes sequential algorithms more
cache efficient and leads to a parallel algorithm running in expected time
on processors, i.e., scales to massively parallel
machines even for moderate values of . The amount of communication between
the processors is very small (at most ) and independent of
the sample size. We also discuss modifications needed for load balancing,
online sampling, sampling with replacement, Bernoulli sampling, and
vectorization on SIMD units or GPUs
Analysis of hybrid parallelization strategies: simulation of Anderson localization and Kalman Filter for LHCb triggers
This thesis presents two experiences of hybrid programming applied to condensed matter and high energy physics. The two projects differ in various aspects, but both of them aim to analyse the benefits of using accelerated hardware to speedup the calculations in current science-research scenarios. The first project enables massively parallelism in a simulation of the Anderson localisation phenomenon in a disordered quantum system. The code represents a Hamiltonian in momentum space, then it executes a diagonalization of the corresponding matrix using linear algebra libraries, and finally it analyses the energy-levels spacing statistics averaged over several realisations of the disorder. The implementation combines different parallelization approaches in an hybrid scheme. The averaging over the ensemble of disorder realisations exploits massively parallelism with a master-slave configuration based on both multi-threading and message passing interface (MPI). This framework is designed and implemented to easily interface similar application commonly adopted in scientific research, for example
in Monte Carlo simulations. The diagonalization uses multi-core and GPU hardware interfacing with MAGMA, PLASMA or MKL libraries. The access to the libraries is modular to guarantee portability, maintainability and the extension in a near future.
The second project is the development of a Kalman Filter, including the porting on GPU architectures and autovectorization for online LHCb triggers. The developed codes provide information about the viability and advantages for the application of GPU technologies in the first triggering step for Large Hadron Collider beauty experiment (LHCb). The optimisation introduced on both codes for CPU and GPU delivered a relevant speedup on the Kalman Filter. The two GPU versions in CUDA R and OpenCLTM have similar performances and are adequate to be considered in the upgrade and in the corresponding implementations of the Gaudi framework. In both projects we implement optimisation techniques in the CPU code. This report presents extensive
benchmark analyses of the correctness and of the performances for both projects
One machine, one minute, three billion tetrahedra
This paper presents a new scalable parallelization scheme to generate the 3D
Delaunay triangulation of a given set of points. Our first contribution is an
efficient serial implementation of the incremental Delaunay insertion
algorithm. A simple dedicated data structure, an efficient sorting of the
points and the optimization of the insertion algorithm have permitted to
accelerate reference implementations by a factor three. Our second contribution
is a multi-threaded version of the Delaunay kernel that is able to concurrently
insert vertices. Moore curve coordinates are used to partition the point set,
avoiding heavy synchronization overheads. Conflicts are managed by modifying
the partitions with a simple rescaling of the space-filling curve. The
performances of our implementation have been measured on three different
processors, an Intel core-i7, an Intel Xeon Phi and an AMD EPYC, on which we
have been able to compute 3 billion tetrahedra in 53 seconds. This corresponds
to a generation rate of over 55 million tetrahedra per second. We finally show
how this very efficient parallel Delaunay triangulation can be integrated in a
Delaunay refinement mesh generator which takes as input the triangulated
surface boundary of the volume to mesh
Parallel Algorithms for Summing Floating-Point Numbers
The problem of exactly summing n floating-point numbers is a fundamental
problem that has many applications in large-scale simulations and computational
geometry. Unfortunately, due to the round-off error in standard floating-point
operations, this problem becomes very challenging. Moreover, all existing
solutions rely on sequential algorithms which cannot scale to the huge datasets
that need to be processed.
In this paper, we provide several efficient parallel algorithms for summing n
floating point numbers, so as to produce a faithfully rounded floating-point
representation of the sum. We present algorithms in PRAM, external-memory, and
MapReduce models, and we also provide an experimental analysis of our MapReduce
algorithms, due to their simplicity and practical efficiency.Comment: Conference version appears in SPAA 201
- …