Search CORE

arXiv.org e-Print Archive

Analyzing the effect of local rounding error propagation on the maximal attainable accuracy of the pipelined Conjugate Gradient method

Author: Agullo Emmanuel
Cools Siegfried
Giraud Luc
Vanroose Wim
Yetkin Emrullah Fatih
Publication venue
Publication date: 29/11/2017
Field of study

Pipelined Krylov subspace methods typically offer improved strong scaling on parallel HPC hardware compared to standard Krylov subspace methods for large and sparse linear systems. In pipelined methods the traditional synchronization bottleneck is mitigated by overlapping time-consuming global communications with useful computations. However, to achieve this communication hiding strategy, pipelined methods introduce additional recurrence relations for a number of auxiliary variables that are required to update the approximate solution. This paper aims at studying the influence of local rounding errors that are introduced by the additional recurrences in the pipelined Conjugate Gradient method. Specifically, we analyze the impact of local round-off effects on the attainable accuracy of the pipelined CG algorithm and compare to the traditional CG method. Furthermore, we estimate the gap between the true residual and the recursively computed residual used in the algorithm. Based on this estimate we suggest an automated residual replacement strategy to reduce the loss of attainable accuracy on the final iterative solution. The resulting pipelined CG method with residual replacement improves the maximal attainable accuracy of pipelined CG, while maintaining the efficient parallel performance of the pipelined method. This conclusion is substantiated by numerical results for a variety of benchmark problems.Comment: 26 pages, 6 figures, 2 tables, 4 algorithm

Institutional Repository Universiteit Antwerpen

HAL Descartes

Méthodes directes hors-mémoire (out-of-core) pour la résolution de systèmes linéaires creux de grande taille

Author: Agullo Emmanuel
Publication venue: HAL CCSD
Publication date: 28/11/2008
Field of study

Factorizing a sparse matrix is a robust way to solve large sparse systems of linear equations. However such an approach is known to be costly both in terms of computation and storage. When the storage required to process a matrix is greater than the amount of memory available on the platform, so-called out-of-core approaches have to be employed: disks extend the main memory to provide enough storage capacity. In this thesis, we investigate both theoretical and practical aspects of such out-of-core factorizations. The MUMPS and SuperLU software packages are used to illustrate our discussions on real-life matrices. First, we propose and study various out-of-core models that aim at limiting the overhead due to data transfers between memory and disks on uniprocessor machines. To do so, we revisit the algorithms to schedule the operations of the factorization and propose new memory management schemes to fit out-of-core constraints. Then we focus on a particular factorization method, the multifrontal method, that we push as far as possible in a parallel out-of-core context with a pragmatic approach. We show that out-of-core techniques allow to solve large sparse linear systems efficiently. When only the factors are stored on disks, a particular attention must be paid to temporary data, which remain in core memory. To achieve a high scalability of core memory usage, we rethink the whole schedule of the out-of-core parallel factorization.La factorisation d'une matrice creuse est une approche robuste pour la résolution de systèmes linéaires creux de grande taille. Néanmoins, une telle factorisation est connue pour être coûteuse aussi bien en temps de calcul qu'en occupation mémoire. Quand l'espace mémoire nécessaire au traitement d'une matrice est plus grand que la quantité de mémoire disponible sur la plate-forme utilisée, des approches dites hors-mémoire (out-of-core) doivent être employées : les disques étendent la mémoire centrale pour fournir une capacité de stockage suffisante. Dans cette thèse, nous nous intéressons à la fois aux aspects théoriques et pratiques de telles factorisations hors-mémoire. Les environnements logiciel MUMPS et SuperLU sont utilisés pour illustrer nos discussions sur des matrices issues du monde industriel et académique. Tout d'abord, nous proposons et étudions dans un cadre séquentiel différents modèles hors-mémoire qui ont pour but de limiter le surcoût dû aux transferts de données entre la mémoire et les disques. Pour ce faire, nous revisitons les algorithmes qui ordonnancent les opérations de la factorisation et proposons de nouveaux schémas de gestion mémoire s'accommodant aux contraintes hors-mémoire. Ensuite, nous nous focalisons sur une méthode de factorisation particulière, la méthode multifrontale, que nous poussons aussi loin que possible dans un contexte parallèle hors-mémoire. Suivant une démarche pragmatique, nous montrons que les techniques hors-mémoire permettent de résoudre efficacement des systèmes linéaires creux de grande taille. Quand seuls les facteurs sont stockés sur disque, une attention particulière doit être portée aux données temporaires, qui restent en mémoire centrale. Pour faire décroître efficacement l'occupation mémoire associée à ces données temporaires avec le nombre de processeurs, nous repensons l'ordonnancement de la factorisation parallèle hors-mémoire dans son ensemble

HAL-ENS-LYON

Thèses en Ligne

arXiv.org e-Print Archive

Hal-Diderot

Pipelining the Fast Multipole Method over a Runtime System

Author: Agullo Emmanuel
Bramas Béranger
Coulaud Olivier
Darve Eric
Messner Matthias
Toru Takahashi
Publication venue
Publication date: 01/01/2012
Field of study

Fast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, in order to process the tasks on the different processing units. We carefully design the task flow, the mathematical operators, their Central Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as well as scheduling schemes. We compute potentials and forces of 200 million particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38 million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012

CiteSeerX

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

Author: Agullo Emmanuel
Buttari Alfredo
Guermouche Abdou
Lopez Florent
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

International audienceTo face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This paper evaluates the usability of runtime systems for complex applications, namely, sparse matrix multifrontal factorizations which constitute extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Experimental results on real-life matrices show that it is possible to achieve the same efficiency as with an ad hoc scheduler which relies on the knowledge of the algorithm. A detailed analysis shows the performance behavior of the resulting code and possible ways of improving the effectiveness of runtime systems

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems

Author: Agullo Emmanuel
Buttari Alfredo
Guermouche Abdou
Lopez Florent
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2016
Field of study

International audienceTo face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This paper evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications , namely, sparse matrix multifrontal factorizations which feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system. ACM Reference Format: Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche and Florent Lopez, 2014. Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime system

Scientific Publications of the University of Toulouse II Le Mirail

Exploiting a Parametrized Task Graph model for the parallelization of a sparse direct multifrontal solver

Author: Agullo Emmanuel
Bosilca George
Buttari Alfredo
Guermouche Abdou
Lopez Florent
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/08/2016
Field of study

International audienceThe advent of multicore processors requires to reconsider the design of high performance computing libraries to embrace portable and effective techniques of parallel software engineering. One of the most promising approaches consists in abstracting an application as a directed acyclic graph (DAG) of tasks. While this approach has been popularized for shared memory environments by the OpenMP 4.0 standard where dependencies between tasks are automatically inferred, we investigate an alternative approach, capable of describing the DAG of task in a distributed setting, where task dependencies are explicitly encoded. So far this approach has been mostly used in the case of algorithms with a regular data access pattern and we show in this study that it can be efficiently applied to a higly irregular numerical algorithm such as a sparse multifrontal QR method. We present the resulting implementation and discuss the potential and limits of this approach in terms of productivity and effectiveness in comparison with more common parallelization techniques. Although at an early stage of development, preliminary results show the potential of the parallel programming model that we investigate in this work

Scientific Publications of the University of Toulouse II Le Mirail

UPCommons. Portal del coneixement obert de la UPC

Resiliency in numerical algorithm design for extreme scale simulations

Author: Agullo Emmanuel
Altenbernd Mirco
Anzt Hartwig
Bautista Gomez Leonardo
Benacchio Tommaso
Publication venue: 'SAGE Publications'
Publication date: 01/12/2021
Field of study

This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

Comparison of coupled solvers for FEM/BEM linear systems arising from discretization of aeroacoustic problems

Author: Agullo Emmanuel
Felšöci Marek
Sylvand Guillaume
Publication venue: HAL CCSD
Publication date: 06/07/2021
Field of study

National audienceWhen discretization of an aeroacoustic physical model is based on the application of both the Finite Elements Method (FEM) and the Boundary Elements Method (BEM), this leads to coupled FEM/BEM linear systems combining sparse and dense parts. In this work, we propose and compare a set of implementation schemes relying on the coupling of the open-source sparse direct solver MUMPS with the proprietary direct solvers from Airbus Central R&T, i.e. the scalapack-like dense solver SPIDO and the hierarchical H-matrix compressed solver HMAT. For this preliminary study, we limit ourselves to a single 24-core computational node

HAL Descartes

Hal-Diderot

On the resilience of a parallel sparse hybrid solver

Author: Agullo Emmanuel
Giraud Luc
Zounon Mawussi
Publication venue: HAL CCSD
Publication date: 01/06/2015
Field of study

As the computational power of high performance computing (HPC) systemscontinues to increase by using a huge number of CPU cores orspecialized processing units, extreme-scale applications areincreasingly prone to faults. As a consequence, the HPC community hasproposed many contributions to design resilient HPC applications, maythese contributions be system-oriented, theoretical or numerical. Inthis study we consider an actual fully-featured parallel sparse hybrid(direct/iterative) linear solver, \maphys, and we propose numericalremedies to design a resilient version of the solver. The solver beinghybrid, we focus in this study on the iterative solution step, whichis often the dominant step in practice. We furthermore assume that aseparate mechanism ensures fault detection and that a system layerprovides support for setting back the environment (processes, \ldots)in a running state. The present manuscript therefore focuses on (andonly on) strategies for recovering lost data \emph{after} the faulthas been detected (a separate concern out of the scope of this study)and \emph{once} the system is back in a running state (anotherseparate concern not studied here either). The numerical remedies wepropose are twofold. Whenever possible, we exploit the natural data redundancybetween processes from the solver to perform exact recovery throughclever copies over processes. Otherwise, data that has been lost andis not available anymore on any process is recovered through aso-called interpolation-restart mechanism. This mechanism is derivedfrom~\cite{aggr:13} to carefully take into account the properties ofthe target hybrid solver. These numerical remedies have beenimplemented in the \maphys parallel solver so that we can assess theirefficiency on a large number of processing units (up to

12,288

CPUcores) for solving large-scale real-life problems