Search CORE

10,182 research outputs found

AutoParallel: A Python module for automatic parallelization and distributed execution of affine loop nests

Author: Amela Ramon
Badia Rosa M.
Clauss Philippe
Ejarque Jorge
Ramon-Cortes Cristian
Publication venue
Publication date: 26/10/2018
Field of study

The last improvements in programming languages, programming models, and frameworks have focused on abstracting the users from many programming issues. Among others, recent programming frameworks include simpler syntax, automatic memory management and garbage collection, which simplifies code re-usage through library packages, and easily configurable tools for deployment. For instance, Python has risen to the top of the list of the programming languages due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. Moreover, the community has helped to develop a large number of libraries and modules, tuning them to obtain great performance. However, there is still room for improvement when preventing users from dealing directly with distributed and parallel computing issues. This paper proposes and evaluates AutoParallel, a Python module to automatically find an appropriate task-based parallelization of affine loop nests to execute them in parallel in a distributed computing infrastructure. This parallelization can also include the building of data blocks to increase task granularity in order to achieve a good execution performance. Moreover, AutoParallel is based on sequential programming and only contains a small annotation in the form of a Python decorator so that anyone with little programming skills can scale up an application to hundreds of cores.Comment: Accepted to the 8th Workshop on Python for High-Performance and Scientific Computing (PyHPC 2018

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

Author: A Arnold
A Faradjian
B Hess
C Schütte
G Wilson
JA Anderson
JC Phillips
KJ Bowers
KJ Bowers
L Verlet
M Eleftheriou
M Shirts
MJ Abraham
P Eastman
R Yokota
S Pronk
S Páll
U Essmann
W Humphrey
WM Brown
Y Andoh
Y Sugita
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

arXiv.org e-Print Archive

Publikationer från KTH

Crossref

Digitala Vetenskapliga Arkivet - Academic Archive On-line

MPG.PuRe

Solution of the Dirac equation in lattice QCD using a domain decomposition method

Author: Aoki
Frommer
Giusti
Lüscher
Lüscher
Lüscher
Martin Lüscher
Quarteroni
Saad
Schwarz
Sheikholeslami
van der Vorst
van der Vorst
Vuik
Wilson
Publication venue: 'Elsevier BV'
Publication date: 20/10/2003
Field of study

Efficient algorithms for the solution of partial differential equations on parallel computers are often based on domain decomposition methods. Schwarz preconditioners combined with standard Krylov space solvers are widely used in this context, and such a combination is shown here to perform very well in the case of the Wilson--Dirac equation in lattice QCD. In particular, with respect to even-odd preconditioned solvers, the communication overhead is significantly reduced, which allows the computational work to be distributed over a large number of processors with only small parallelization losses.Comment: Plain TeX source, 21 pages, figures include

arXiv.org e-Print Archive

Crossref

CERN Document Server