182 research outputs found
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
Analyzing the effect of local rounding error propagation on the maximal attainable accuracy of the pipelined Conjugate Gradient method
Pipelined Krylov subspace methods typically offer improved strong scaling on
parallel HPC hardware compared to standard Krylov subspace methods for large
and sparse linear systems. In pipelined methods the traditional synchronization
bottleneck is mitigated by overlapping time-consuming global communications
with useful computations. However, to achieve this communication hiding
strategy, pipelined methods introduce additional recurrence relations for a
number of auxiliary variables that are required to update the approximate
solution. This paper aims at studying the influence of local rounding errors
that are introduced by the additional recurrences in the pipelined Conjugate
Gradient method. Specifically, we analyze the impact of local round-off effects
on the attainable accuracy of the pipelined CG algorithm and compare to the
traditional CG method. Furthermore, we estimate the gap between the true
residual and the recursively computed residual used in the algorithm. Based on
this estimate we suggest an automated residual replacement strategy to reduce
the loss of attainable accuracy on the final iterative solution. The resulting
pipelined CG method with residual replacement improves the maximal attainable
accuracy of pipelined CG, while maintaining the efficient parallel performance
of the pipelined method. This conclusion is substantiated by numerical results
for a variety of benchmark problems.Comment: 26 pages, 6 figures, 2 tables, 4 algorithm
Méthodes directes hors-mémoire (out-of-core) pour la résolution de systÚmes linéaires creux de grande taille
Factorizing a sparse matrix is a robust way to solve large sparse systems of linear equations. However such an approach is known to be costly both in terms of computation and storage. When the storage required to process a matrix is greater than the amount of memory available on the platform, so-called out-of-core approaches have to be employed: disks extend the main memory to provide enough storage capacity. In this thesis, we investigate both theoretical and practical aspects of such out-of-core factorizations. The MUMPS and SuperLU software packages are used to illustrate our discussions on real-life matrices. First, we propose and study various out-of-core models that aim at limiting the overhead due to data transfers between memory and disks on uniprocessor machines. To do so, we revisit the algorithms to schedule the operations of the factorization and propose new memory management schemes to fit out-of-core constraints. Then we focus on a particular factorization method, the multifrontal method, that we push as far as possible in a parallel out-of-core context with a pragmatic approach. We show that out-of-core techniques allow to solve large sparse linear systems efficiently. When only the factors are stored on disks, a particular attention must be paid to temporary data, which remain in core memory. To achieve a high scalability of core memory usage, we rethink the whole schedule of the out-of-core parallel factorization.La factorisation d'une matrice creuse est une approche robuste pour la rĂ©solution de systĂšmes linĂ©aires creux de grande taille. NĂ©anmoins, une telle factorisation est connue pour ĂȘtre coĂ»teuse aussi bien en temps de calcul qu'en occupation mĂ©moire. Quand l'espace mĂ©moire nĂ©cessaire au traitement d'une matrice est plus grand que la quantitĂ© de mĂ©moire disponible sur la plate-forme utilisĂ©e, des approches dites hors-mĂ©moire (out-of-core) doivent ĂȘtre employĂ©es : les disques Ă©tendent la mĂ©moire centrale pour fournir une capacitĂ© de stockage suffisante. Dans cette thĂšse, nous nous intĂ©ressons Ă la fois aux aspects thĂ©oriques et pratiques de telles factorisations hors-mĂ©moire. Les environnements logiciel MUMPS et SuperLU sont utilisĂ©s pour illustrer nos discussions sur des matrices issues du monde industriel et acadĂ©mique. Tout d'abord, nous proposons et Ă©tudions dans un cadre sĂ©quentiel diffĂ©rents modĂšles hors-mĂ©moire qui ont pour but de limiter le surcoĂ»t dĂ» aux transferts de donnĂ©es entre la mĂ©moire et les disques. Pour ce faire, nous revisitons les algorithmes qui ordonnancent les opĂ©rations de la factorisation et proposons de nouveaux schĂ©mas de gestion mĂ©moire s'accommodant aux contraintes hors-mĂ©moire. Ensuite, nous nous focalisons sur une mĂ©thode de factorisation particuliĂšre, la mĂ©thode multifrontale, que nous poussons aussi loin que possible dans un contexte parallĂšle hors-mĂ©moire. Suivant une dĂ©marche pragmatique, nous montrons que les techniques hors-mĂ©moire permettent de rĂ©soudre efficacement des systĂšmes linĂ©aires creux de grande taille. Quand seuls les facteurs sont stockĂ©s sur disque, une attention particuliĂšre doit ĂȘtre portĂ©e aux donnĂ©es temporaires, qui restent en mĂ©moire centrale. Pour faire dĂ©croĂźtre efficacement l'occupation mĂ©moire associĂ©e Ă ces donnĂ©es temporaires avec le nombre de processeurs, nous repensons l'ordonnancement de la factorisation parallĂšle hors-mĂ©moire dans son ensemble
Pipelining the Fast Multipole Method over a Runtime System
Fast Multipole Methods (FMM) are a fundamental operation for the simulation
of many physical problems. The high performance design of such methods usually
requires to carefully tune the algorithm for both the targeted physics and the
hardware. In this paper, we propose a new approach that achieves high
performance across architectures. Our method consists of expressing the FMM
algorithm as a task flow and employing a state-of-the-art runtime system,
StarPU, in order to process the tasks on the different processing units. We
carefully design the task flow, the mathematical operators, their Central
Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as
well as scheduling schemes. We compute potentials and forces of 200 million
particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38
million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem
processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012
Multifrontal QR Factorization for Multicore Architectures over Runtime Systems
International audienceTo face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This paper evaluates the usability of runtime systems for complex applications, namely, sparse matrix multifrontal factorizations which constitute extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Experimental results on real-life matrices show that it is possible to achieve the same efficiency as with an ad hoc scheduler which relies on the knowledge of the algorithm. A detailed analysis shows the performance behavior of the resulting code and possible ways of improving the effectiveness of runtime systems
Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems
International audienceTo face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This paper evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications , namely, sparse matrix multifrontal factorizations which feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system. ACM Reference Format: Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche and Florent Lopez, 2014. Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime system
Exploiting a Parametrized Task Graph model for the parallelization of a sparse direct multifrontal solver
International audienceThe advent of multicore processors requires to reconsider the design of high performance computing libraries to embrace portable and effective techniques of parallel software engineering. One of the most promising approaches consists in abstracting an application as a directed acyclic graph (DAG) of tasks. While this approach has been popularized for shared memory environments by the OpenMP 4.0 standard where dependencies between tasks are automatically inferred, we investigate an alternative approach, capable of describing the DAG of task in a distributed setting, where task dependencies are explicitly encoded. So far this approach has been mostly used in the case of algorithms with a regular data access pattern and we show in this study that it can be efficiently applied to a higly irregular numerical algorithm such as a sparse multifrontal QR method. We present the resulting implementation and discuss the potential and limits of this approach in terms of productivity and effectiveness in comparison with more common parallelization techniques. Although at an early stage of development, preliminary results show the potential of the parallel programming model that we investigate in this work
Resiliency in numerical algorithm design for extreme scale simulations
This work is based on the seminar titled âResiliency in Numerical Algorithm Design for Extreme Scale Simulationsâ held March 1â6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G Ìoddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz,
Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft
Comparison of coupled solvers for FEM/BEM linear systems arising from discretization of aeroacoustic problems
National audienceWhen discretization of an aeroacoustic physical model is based on the application of both the Finite Elements Method (FEM) and the Boundary Elements Method (BEM), this leads to coupled FEM/BEM linear systems combining sparse and dense parts. In this work, we propose and compare a set of implementation schemes relying on the coupling of the open-source sparse direct solver MUMPS with the proprietary direct solvers from Airbus Central R&T, i.e. the scalapack-like dense solver SPIDO and the hierarchical H-matrix compressed solver HMAT. For this preliminary study, we limit ourselves to a single 24-core computational node
On the resilience of a parallel sparse hybrid solver
As the computational power of high performance computing (HPC) systemscontinues to increase by using a huge number of CPU cores orspecialized processing units, extreme-scale applications areincreasingly prone to faults. As a consequence, the HPC community hasproposed many contributions to design resilient HPC applications, maythese contributions be system-oriented, theoretical or numerical. Inthis study we consider an actual fully-featured parallel sparse hybrid(direct/iterative) linear solver, \maphys, and we propose numericalremedies to design a resilient version of the solver. The solver beinghybrid, we focus in this study on the iterative solution step, whichis often the dominant step in practice. We furthermore assume that aseparate mechanism ensures fault detection and that a system layerprovides support for setting back the environment (processes, \ldots)in a running state. The present manuscript therefore focuses on (andonly on) strategies for recovering lost data \emph{after} the faulthas been detected (a separate concern out of the scope of this study)and \emph{once} the system is back in a running state (anotherseparate concern not studied here either). The numerical remedies wepropose are twofold. Whenever possible, we exploit the natural data redundancybetween processes from the solver to perform exact recovery throughclever copies over processes. Otherwise, data that has been lost andis not available anymore on any process is recovered through aso-called interpolation-restart mechanism. This mechanism is derivedfrom~\cite{aggr:13} to carefully take into account the properties ofthe target hybrid solver. These numerical remedies have beenimplemented in the \maphys parallel solver so that we can assess theirefficiency on a large number of processing units (up to CPUcores) for solving large-scale real-life problems
- âŠ