Search CORE

62 research outputs found

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

Author: Agullo Emmanuel
Buttari Alfredo
Guermouche Abdou
Lopez Florent
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

International audienceTo face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This paper evaluates the usability of runtime systems for complex applications, namely, sparse matrix multifrontal factorizations which constitute extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Experimental results on real-life matrices show that it is possible to achieve the same efficiency as with an ad hoc scheduler which relies on the knowledge of the algorithm. A detailed analysis shows the performance behavior of the resulting code and possible ways of improving the effectiveness of runtime systems

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

HAL-Rennes 1

Parallel sparse direct solvers for Poisson's equation in streamer discharges

Author: Ebert U. (Ute)
Genseberger M. (Menno)
Nool M. (Margreet)
Publication venue
Publication date: 30/05/2017
Field of study

The aim of this paper is to examine whether a hybrid approach of parallel computing, a combination of the message passing model (MPI) with the threads model (OpenMP) can deliver good performance in streamer discharge simulations. Since one of the bottlenecks of almost all streamer models is the solution of Poisson's equation, we focused on several direct solvers, which can solve large sparse systems in parallel. For this purpose, our basic thought was to concentrate on 'easy to get' performance improvements, or, without rewriting of the code. We have investigated in PARDISO, a shared memory solver, and CLUSTER SPARSE SOLVER and MUMPS, which both can apply hybrid parallelism; the latter two solvers can be called from a single core and do not require minor awareness of MPI. We show their performance for solving two- and three-dimensional Poisson's equations on the Dutch national supercomputer, called Cartesius. A runtime study of a code developed for streamer propagation nearby a dielectric rod is included. We discuss various issues that appear to be critical in a mixed MPI-OpenMP environment

CWI's Institutional Repository

Performance Improvements of Common Sparse Numerical Linear Algebra Computations

Author: Luszczek Piotr Rafal
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/01/2003
Field of study

Manufacturers of computer hardware are able to continuously sustain an unprecedented pace of progress in computing speed of their products, partially due to increased clock rates but also because of ever more complicated chip designs. With new processor families appearing every few years, it is increasingly harder to achieve high performance rates in sparse matrix computations. This research proposes new methods for sparse matrix factorizations and applies in an iterative code generalizations of known concepts from related disciplines. The proposed solutions and extensions are implemented in ways that tend to deliver efficiency while retaining ease of use of existing solutions. The implementations are thoroughly timed and analyzed using a commonly accepted set of test matrices. The tests were conducted on modern processors that seem to have gained an appreciable level of popularity and are fairly representative for a wider range of processor types that are available on the market now or in the near future. The new factorization technique formally introduced in the early chapters is later on proven to be quite competitive with state of the art software currently available. Although not totally superior in all cases (as probably no single approach could possibly be), the new factorization algorithm exhibits a few promising features. In addition, an all-embracing optimization effort is applied to an iterative algorithm that stands out for its robustness. This also gives satisfactory results on the tested computing platforms in terms of performance improvement. The same set of test matrices is used to enable an easy comparison between both investigated techniques, even though they are customarily treated separately in the literature. Possible extensions of the presented work are discussed. They range from easily conceivable merging with existing solutions to rather more evolved schemes dependent on hard to predict progress in theoretical and algorithmic research

University of Tennessee, Knoxville: Trace

CiteSeerX

Concurrent Probabilistic Simulation of High Temperature Composite Structural Response

Author: Abdi Frank
Publication venue
Publication date
Field of study

A computational structural/material analysis and design tool which would meet industry's future demand for expedience and reduced cost is presented. This unique software 'GENOA' is dedicated to parallel and high speed analysis to perform probabilistic evaluation of high temperature composite response of aerospace systems. The development is based on detailed integration and modification of diverse fields of specialized analysis techniques and mathematical models to combine their latest innovative capabilities into a commercially viable software package. The technique is specifically designed to exploit the availability of processors to perform computationally intense probabilistic analysis assessing uncertainties in structural reliability analysis and composite micromechanics. The primary objectives which were achieved in performing the development were: (1) Utilization of the power of parallel processing and static/dynamic load balancing optimization to make the complex simulation of structure, material and processing of high temperature composite affordable; (2) Computational integration and synchronization of probabilistic mathematics, structural/material mechanics and parallel computing; (3) Implementation of an innovative multi-level domain decomposition technique to identify the inherent parallelism, and increasing convergence rates through high- and low-level processor assignment; (4) Creating the framework for Portable Paralleled architecture for the machine independent Multi Instruction Multi Data, (MIMD), Single Instruction Multi Data (SIMD), hybrid and distributed workstation type of computers; and (5) Market evaluation. The results of Phase-2 effort provides a good basis for continuation and warrants Phase-3 government, and industry partnership

NASA Technical Reports Server

Application of HPC in eddy current electromagnetic problem solution

Author: Pozza Cristian
Publication venue
Publication date: 29/01/2014
Field of study

As engineering problems are becoming more and more advanced, the size of an average model solved by partial differential equations is rapidly growing and, in order to keep simulation times within reasonable bounds, both faster computers and more efficient software implementations are needed. In the first part of this thesis, the full potential of simulation software has been exploited through high performance parallel computing techniques. In particular, the simulation of induction heating processes is accomplished within reasonable solution times, by implementing different parallel direct solvers for large sparse linear system, in the solution process of a commercial software. The performance of such library on shared memory systems has been remarkably improved by implementing a multithreaded version of MUMPS (MUltifrontal Massively Parallel Solver) library, which have been tested on benchmark matrices arising from typical induction heating process simulations. A new multithreading approach and a low rank approximation technique have been implemented and developed by MUMPS team in Lyon and Toulouse. In the context of a collaboration between MUMPS team and DII-University of Padova, a preliminary version of such functionalities could be tested on induction heating benchmark problems, and a substantial reduction of the computational cost and memory requirements could be achieved. In the second part of this thesis, some examples of design methodology by virtual prototyping have been described. Complex multiphysics simulations involving electromagnetic, circuital, thermal and mechanical problems have been performed by exploiting parallel solvers, as developed in the first part of this thesis. Finally, multiobjective stochastic optimization algorithms have been applied to multiphysics 3D model simulations in search of a set of improved induction heating device configurations

Archivio istituzionale della ricerca - Università di Padova

HPF-2 Support for Dynamic Sparse Computations

Author: Asenjo Plaza Rafael
Doallo Ramón
Plata Oscar
Touriño Juan
Zapata Emilio L.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/1998
Field of study

This is a post-peer-review, pre-copyedit version of an article published in Lecture Notes in Computer Science. The final authenticated version is available online at: https://doi.org/10.1007/3-540-48319-5_15[Abstract] There is a class of sparse matrix computations, such as direct solvers of systems of linear equations, that change the fill-in (nonzero entries) of the coefficient matrix, and involve row and column operations (pivoting). This paper addresses the problem of the parallelization of these sparse computations from the point of view of the parallel language and the compiler. Dynamic data structures for sparse matrix storage are analyzed, permitting to efficiently deal with fill-in and pivoting issues. Any of the data representations considered enforces the handling of indirections for data accesses, pointer referencing and dynamic data creation. All of these elements go beyond current data-parallel compilation technology. We propose a small set of new extensions to HPF-2 to parallelize these codes, supporting part of the new capabilities on a runtime library. This approach has been evaluated on a Cray T3E, implementing, in particular, the sparse LU factorization.Ministerio de Educación y Ciencia; TIC96-1125-C03Xunta de Galicia; XUGA20605B96European Commision; BRITE-EURAM III BE95-1564European Commision; ERB4050P192166

Repositorio da Universidade da Coruña

CiteSeerX

Models for Type I X-Ray Bursts Nucleosynthesis with Parallelisation and Improved Nuclear Physics

Author: Martin Rodriguez Jose-David
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2012
Field of study

Type I XRBs are thermonuclear flashes on the surface of neutron stars (NS) associated with mass-accretion from a companion star. Models of type I XRBs and their associated nucleosynthesis are physically complicated and extremely intense as regards the huge computational power required to model the physical processes played out, with the required precision to be truly representative. Until recently, because of these computational limitations, studies of XRB nucleosynthesis have been performed using limited nuclear reaction networks. In the bid to overcome this hurdle, parallel computing has been raised as the main permitting factor of yet more precise and computationally intensive simulations as it offers the potential to concentrate computational resources on intensive computational problems. In this Work, we present a parallelisation of two different applications; a one-zone (i.e. parameterized) nucleosynthesis code, and a one-dimensional (spherically symmetric), hydrodynamic code, in Lagrangian formulation (hereafter SHIVA code), built originally to model classical nova outbursts (José 1996; José & Hernanz 1998). The codes have been parallelised using the MPICH2 implementation of the Message Passing Interface (MPI) specification for the design of parallel applications using clusters of distributed workstations. As an example, to execute a hydrodynamic simulation along 200k time-steps, the SHIVA code requires (in its sequential, single-node version) about 147 hours (6.1 days) to complete when using a reduced nuclear network with 324 isotopes and 1392 nuclear reactions, and 688 hours (28.6 days) when using a network with 606 nuclides and 3551 nuclear reactions for the same number of time-steps. The post-processing nucleosynthesis code is a time-step loosely synchronous application with a very small problem size (limited by the number of isotopes of the nuclear network). As shown by the performance tests, this fact results in the worst possible scenario for parallelisation; results show that the performance of the parallel application is much worst than the sequential, 1-node version of the code. Our results show that it is therefore not possible to parallelise efficiently a post-processing nucleosynthesis code, and efforts in this regard should be avoided. On the contrary, the parallelised version of the SHIVA code yields excellent performance results. A speed-up factor of 26 is achieved in a simulation with a reduced network consisting of 324 isotopes and 1392 nuclear reactions when 42 processors are used in parallel to execute the application along 200k time-steps. On the other hand, an excellent speed-up factor of 35 is accomplished in a simulation with a reaction network up to 606 nuclides and 3551 nuclear reactions. Maximum speed-ups of ~41 and ~85 are predicted by the performance models when using 200 processors, for the reduced and extended simulations respectively. Our results will not only improve the quality of the simulations (and hence publications) in terms of better numerical approaches, finer approximations, and a considerably shorter time-to-publication, but also will allow taking advantage, if desired, of parallel supercomputing facilities like the Mare Nostrum at the Supercomputing Centre in Barcelona (BSC)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Using GPU to Accelerate Linear Computations in Power System Applications

Author: Li Xue
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2015
Field of study

With the development of advanced power system controls, the industrial and research community is becoming more interested in simulating larger interconnected power grids. It is always critical to incorporate advanced computing technologies to accelerate these power system computations. Power flow, one of the most fundamental computations in power system analysis, converts the solution of non-linear systems to that of a set of linear systems via the Newton method or one of its variants. An efficient solution to these linear equations is the key to improving the performance of power flow computation, and hence to accelerating other power system applications based on power flow computation, such as optimal power flow, contingency analysis, etc. This dissertation focuses on the exploration of iterative linear solvers and applicable preconditioners, with graphic processing unit (GPU) implementations to achieve performance improvement on the linear computations in power flow computations. An iterative conjugate gradient solver with Chebyshev preconditioner is studied first, and then the preconditioner is extended to a two-step preconditioner. At last, the conjugate gradient solver and the two-step preconditioner are integrated with MATPOWER to solve the practical fast decoupled load flow (FDPF), and an inexact linear solution method is proposed to further save the runtime of FDPF. Performance improvement is reported by applying these methods and GPU-implementation. The final complete GPU-based FDPF with inexact linear solving can achieve nearly 3x performance improvement over the MATPOWER implementation for a test system with 11,624 buses. A supporting study including a quick estimation of the largest eigenvalue of the linear system which is required by the Chebyshev preconditioner is presented as well. This dissertation demonstrates the potential of using GPU with scalable methods in power flow computation

University of Tennessee, Knoxville: Trace

Task-based multifrontal QR solver for heterogeneous architectures

Author: Lopez Florent
Publication venue
Publication date: 11/12/2015
Field of study

Afin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. !!br0ken!!ommunications" (Communication Avoiding) dans la méthode multifrontale, permettant d'améliorer considérablement la scalabilité du solveur par rapport a l'approche original utilisée dans qr mumps. Nous introduisons également un algorithme d'ordonnancement sous contraintes mémoire au sein de notre solveur, exploitable dans le cas des architectures multicoeur, réduisant largement la consommation mémoire de la méthode multifrontale QR avec un impacte négligeable sur les performances. En utilisant le modèle présenté ci-dessus, nous visons ensuite l'exploitation des architectures hétérogènes pour lesquelles la granularité des tâches ainsi les stratégies l'ordonnancement sont cruciales pour profiter de la puissance de ces architectures. Nous proposons, dans le cadre de la méthode multifrontale, un partitionnement hiérarchique des données ainsi qu'un algorithme d'ordonnancement capable d'exploiter l'hétérogénéité des ressources. Enfin, nous présentons une étude sur la reproductibilité de l'exécution parallèle de notre problème et nous montrons également l'utilisation d'un modèle de programmation alternatif pour l'implémentation de la méthode multifrontale. L'ensemble des résultats expérimentaux présentés dans cette étude sont évalués avec une analyse détaillée des performance que nous proposons au début de cette étude. Cette analyse de performance permet de mesurer l'impacte de plusieurs effets identifiés sur la scalabilité et la performance de nos algorithmes et nous aide ainsi à comprendre pleinement les résultats obtenu lors des tests effectués avec notre solveur.To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. In this study we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. In the context of the qr mumps solver, we prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in qr mumps. In addition we introduced a memory-aware algorithm to control the memory behaviour of our solver and show, in the context of multicore architectures, an important reduction of the memory footprint for the multifrontal QR factorization with a small impact on performance. Following this approach, we move to heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources. Finally we present a study on the reproducibility of executions and the use of alternative programming models for the implementation of the multifrontal method. All the experimental results presented in this study are evaluated with a detailed performance analysis measuring the impact of several identified effects on the performance and scalability. Thanks to this original analysis, presented in the first part of this study, we are capable of fully understanding the results obtained with our solver

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

The Czech Republic, 27. 11. -9

Author: M Tůma
Z Strakoš
Publication venue
Publication date: 01/01/1994
Field of study

Abstract: Our goal is to show on several examples the great progress made in numerical analysis in the past decades together with the principal problems and relations to other disciplines. We restrict ourselves to numerical linear algebra, or, more specifically, to solving Ax = b where A is a real nonsingular n by n matrix and b a real n−dimensional vector, and to computing eigenvalues of a sparse matrix A. We discuss recent developments in both sparse direct and iterative solvers, as well as fundamental problems in computing eigenvalues. The effects of parallel architectures to the choice of the method and to the implementation of codes are stressed throughout the contribution

CiteSeerX