80,685 research outputs found
Quantum Monte Carlo for large chemical systems: Implementing efficient strategies for petascale platforms and beyond
Various strategies to implement efficiently QMC simulations for large
chemical systems are presented. These include: i.) the introduction of an
efficient algorithm to calculate the computationally expensive Slater matrices.
This novel scheme is based on the use of the highly localized character of
atomic Gaussian basis functions (not the molecular orbitals as usually done),
ii.) the possibility of keeping the memory footprint minimal, iii.) the
important enhancement of single-core performance when efficient optimization
tools are employed, and iv.) the definition of a universal, dynamic,
fault-tolerant, and load-balanced computational framework adapted to all kinds
of computational platforms (massively parallel machines, clusters, or
distributed grids). These strategies have been implemented in the QMC=Chem code
developed at Toulouse and illustrated with numerical applications on small
peptides of increasing sizes (158, 434, 1056 and 1731 electrons). Using 10k-80k
computing cores of the Curie machine (GENCI-TGCC-CEA, France) QMC=Chem has been
shown to be capable of running at the petascale level, thus demonstrating that
for this machine a large part of the peak performance can be achieved.
Implementation of large-scale QMC simulations for future exascale platforms
with a comparable level of efficiency is expected to be feasible
GPU-Accelerated Algorithms for Compressed Signals Recovery with Application to Astronomical Imagery Deblurring
Compressive sensing promises to enable bandwidth-efficient on-board
compression of astronomical data by lifting the encoding complexity from the
source to the receiver. The signal is recovered off-line, exploiting GPUs
parallel computation capabilities to speedup the reconstruction process.
However, inherent GPU hardware constraints limit the size of the recoverable
signal and the speedup practically achievable. In this work, we design parallel
algorithms that exploit the properties of circulant matrices for efficient
GPU-accelerated sparse signals recovery. Our approach reduces the memory
requirements, allowing us to recover very large signals with limited memory. In
addition, it achieves a tenfold signal recovery speedup thanks to ad-hoc
parallelization of matrix-vector multiplications and matrix inversions.
Finally, we practically demonstrate our algorithms in a typical application of
circulant matrices: deblurring a sparse astronomical image in the compressed
domain
On the parallelization of stellar evolution codes
Multidimensional nucleosynthesis studies with hundreds of nuclei linked through thousands of nuclear processes are still computationally prohibitive. To date, most nucleosynthesis studies rely either on hydrostatic/hydrodynamic simulations in spherical symmetry, or on post-processing simulations using temperature and density versus time profiles directly linked to huge nuclear reaction networks.
Parallel computing has been regarded as the main permitting factor of computationally intensive simulations. This paper explores the different pros and cons in the parallelization of stellar codes, providing recommendations on when and how parallelization may help in improving the performance of a code for
astrophysical applications.
We report on different parallelization strategies succesfully applied to the spherically symmetric, Lagrangian, implicit hydrodynamic code SHIVA, extensively used in the modeling of classical novae and type I X-ray bursts.
When only matrix build-up and inversion processes in the nucleosynthesis subroutines are parallelized (a suitable approach for post-processing calculations), the huge amount of time spent on communications between cores, together with the small problem size (limited by the number of isotopes of the nuclear
network), result in a much worse performance of the parallel application than the 1-core, sequential version of the code. Parallelization of the matrix build-up and inversion processes in the nucleosynthesis subroutines is not recommended unless the number of isotopes adopted largely exceeds 10,000.
In sharp contrast, speed-up factors of 26 and 35 have been obtained with a parallelized version of
SHIVA, in a 200-shell simulation of a type I X-ray burstcarried out with two nuclear reaction networks: a reduced one, consisting of 324 isotopes and 1392 reactions, and a more extended network with 60
6 nuclides and 3551 nuclear interactions. Maximum speed-ups of ~41 (324-isotope network) and ~85 (606-isotope network), are also predicted for 200 cores, stressing that the number of shells of the computational domain constitutes an effective upper limit for the maximum number of cores that could be used in a parallel
application.Peer ReviewedPostprint (published version
Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion
We describe a new data format for storing triangular, symmetric, and
Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard
two dimensional arrays of Fortran and C (also known as full format) that are
used to represent triangular and symmetric matrices waste nearly half of the
storage space but provide high performance via the use of Level 3 BLAS.
Standard packed format arrays fully utilize storage (array space) but provide
low performance as there is no Level 3 packed BLAS. We combine the good
features of packed and full storage using RFPF to obtain high performance via
using Level 3 BLAS as RFPF is a standard full format representation. Also, RFPF
requires exactly the same minimal storage as packed format. Each LAPACK full
and/or packed triangular, symmetric, and Hermitian routine becomes a single new
RFPF routine based on eight possible data layouts of RFPF. This new RFPF
routine usually consists of two calls to the corresponding LAPACK full format
routine and two calls to Level 3 BLAS routines. This means {\it no} new
software is required. As examples, we present LAPACK routines for Cholesky
factorization, Cholesky solution and Cholesky inverse computation in RFPF to
illustrate this new work and to describe its performance on several commonly
used computer platforms. Performance of LAPACK full routines using RFPF versus
LAPACK full routines using standard format for both serial and SMP parallel
processing is about the same while using half the storage. Performance gains
are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP
parallel times faster using vendor LAPACK full routines with RFPF than with
using vendor and/or reference packed routines
- …