Search CORE

1,238 research outputs found

Parallel sparse LU factorization on different message passing platforms

Author: Amestoy
Amestoy
Amestoy
Ashcraft
Cosnard
Culler
Curtis
Davis
Davis
Demmel
Dongarra
Duff
Duff
Duff
Fu
Gallivan
Geist
George
George
Hunold
Kai Shen
Li
Malard
Marin
Rodriguez
Schenk
Shen
Shen
Snavely
Tomlin
van der Stappen
Whaley
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Recommended from our members

Preparing sparse solvers for exascale computing.

Author: Anzt Hartwig
Boman Erik
Curfman McInnes Lois
Falgout Rob
Ghysels Pieter
Heroux Michael
Li Xiaoye
Meier Yang Ulrike
Rajamanickam Sivasankaran
Rupp Karl
Smith Barry
Tran Mills Richard
Yamazaki Ichitaro
Publication venue: eScholarship, University of California
Publication date: 01/03/2020
Field of study

Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'

eScholarship - University of California

Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

Author: Bosilca George
Faverge Mathieu
Lacoste Xavier
Ramet Pierre
Thibault Samuel
Publication venue
Publication date: 06/01/2014
Field of study

The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Oskar Bordeaux

A Low Communication Condensation-based Linear System Solver Utilizing Cramer\u27s Rule

Author: Habgood Kenneth C
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2011
Field of study

Systems of linear equations are central to many science and engineering application domains. Given the abundance of low-cost parallel processing fabrics, the study of fast and accurate parallel algorithms for solving such systems is receiving attention. Fast linear solvers generally use a form of LU factorization. These methods face challenges with workload distribution and communication overhead that hinder their application in a true broadcast communication environment. Presented is an efficient framework for solving large-scale linear systems by means of a novel utilization of Cramer\u27s rule. While the latter is often perceived to be impractical when considered for large systems, it is shown that the algorithm proposed has an order N^3 complexity with pragmatic forward and backward stability. To the best of our knowledge, this is the first time that Cramer\u27s rule has been demonstrated to be an order N^3 process. Empirical results are provided to substantiate the stated accuracy and computational complexity, clearly demonstrating the efficacy of the approach taken. The unique utilization of Cramer\u27s rule and matrix condensation techniques yield an elegant process that can be applied to parallel computing architectures that support a broadcast communication infrastructure. The regularity of the communication patterns, and send-ahead ability, yields a viable framework for solving linear equations using conventional computing platforms. In addition, this dissertation demonstrates the algorithm\u27s potential for solving large-scale sparse linear systems

University of Tennessee, Knoxville: Trace

A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices

Author: Kühne Thomas D.
Lass Michael
Mohr Stephan
Plessl Christian
Wiebeler Hendrik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/04/2018
Field of study

We present the submatrix method, a highly parallelizable method for the approximate calculation of inverse p-th roots of large sparse symmetric matrices which are required in different scientific applications. We follow the idea of Approximate Computing, allowing imprecision in the final result in order to be able to utilize the sparsity of the input matrix and to allow massively parallel execution. For an n x n matrix, the proposed algorithm allows to distribute the calculations over n nodes with only little communication overhead. The approximate result matrix exhibits the same sparsity pattern as the input matrix, allowing for efficient reuse of allocated data structures. We evaluate the algorithm with respect to the error that it introduces into calculated results, as well as its performance and scalability. We demonstrate that the error is relatively limited for well-conditioned matrices and that results are still valuable for error-resilient applications like preconditioning even for ill-conditioned matrices. We discuss the execution time and scaling of the algorithm on a theoretical level and present a distributed implementation of the algorithm using MPI and OpenMP. We demonstrate the scalability of this implementation by running it on a high-performance compute cluster comprised of 1024 CPU cores, showing a speedup of 665x compared to single-threaded execution

arXiv.org e-Print Archive

Crossref

UPCommons. Portal del coneixement obert de la UPC

A performance focused, development friendly and model aided parallelization strategy for scientific applications

Author: Joshi Anagha S.
Publication venue: Clemson University Libraries
Publication date: 01/12/2016
Field of study

The amelioration of high performance computing platforms has provided unprecedented computing power with the evolution of multi-core CPUs, massively parallel architectures such as General Purpose Graphics Processing Units (GPGPUs) and Many Integrated Core (MIC) architectures such as Intel\u27s Xeon phi coprocessor. However, it is a great challenge to leverage capabilities of such advanced supercomputing hardware, as it requires efficient and effective parallelization of scientific applications. This task is difficult mainly due to complexity of scientific algorithms coupled with the variety of available hardware and disparate programming models. To address the aforementioned challenges, this thesis presents a parallelization strategy to accelerate scientific applications that maximizes the opportunities of achieving speedup while minimizing the development efforts. Parallelization is a three step process (1) choose a compatible combination of architecture and parallel programming language, (2) translate base code/algorithm to a parallel language and (3) optimize and tune the application. In this research, a quantitative comparison of run time for various implementations of k-means algorithm, is used to establish that native languages (OpenMP, MPI, CUDA) perform better on respective architectures as opposed to vendor-neutral languages such as OpenCL. A qualitative model is used to select an optimal architecture for a given application by aligning the capabilities of accelerators with characteristics of the application. Once the optimal architecture is chosen, the corresponding native language is employed. This approach provides the best performance with reasonable accuracy (78%) of predicting a fitting combination, while eliminating the need for exploring different architectures individually. It reduces the required development efforts considerably as the application need not be re-written in multiple languages. The focus can be solely on optimization and tuning to achieve the best performance on available architectures with minimized investment in terms of cost and efforts. To verify the prediction accuracy of the qualitative model, the OpenDwarfs benchmark suite, which implements the Berkeley\u27s dwarfs in OpenCL, is used. A dwarf is an algorithmic method that captures a pattern of computation and communication. For the purpose of this research, the focus is on 9 application from various algorithmic domains that cover the seven dwarfs of symbolic computation, which were identified by Phillip Colella, as omnipresent in scientific and engineering applications. To validate the parallelization strategy collectively, a case study is undertaken. This case study involves parallelization of the Lower Upper Decomposition for the Gaussian Elimination algorithm from the linear algebra domain, using conventional trial and error methods as well as the proposed \u27Architecture First, Language Later\u27\u27 strategy. The development efforts incurred are contrasted for both methods. The aforesaid proposed strategy is observed to reduce the development efforts by an average of 50%

Clemson University: TigerPrints

Some Experiments and Issues to Exploit Multicore Parallelism in a Distributed-Memory Parallel Sparse Direct Solver

Author: Chowdhury Indranil
L'Excellent Jean-Yves
Publication venue: HAL CCSD
Publication date: 01/10/2010
Field of study

MUMPS is a parallel sparse direct solver, using message passing (MPI) for parallelism. In this report we experiment how thread parallelism can help taking advantage of recent multicore architectures. The work done consists in testing multithreaded BLAS libraries and inserting OpenMP directives in the routines revealed to be costly by profiling, with the objective to avoid any deep restructuring or rewriting of the code. We report on various aspects of this work, present some of the benefits and difficulties, and show that 4 threads per MPI process is generally a good compromise. We then discuss various issues that appear to be critical in a mixed MPI-OpenMP environment

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Energy-Aware High Performance Computing

Author: Dolz Manuel F.
Heuveline Vincent
I. Malossi A. Cristiano
Ludwig Thomas
Quintana-Orti Enrique S.
Reza Heidari M.
Wlotzka Martin
Publication venue: 'IntechOpen'
Publication date: 22/03/2017
Field of study

High performance computing centres consume substantial amounts of energy to power large-scale supercomputers and the necessary building and cooling infrastructure. Recently, considerable performance gains resulted predominantly from developments in multi-core, many-core and accelerator technology. Computing centres rapidly adopted this hardware to serve the increasing demand for computational power. However, further performance increases in large-scale computing systems are limited by the aggregate energy budget required to operate them. Power consumption has become a major cost factor for computing centres. Furthermore, energy consumption results in carbon dioxide emissions, a hazard for the environment and public health; and heat, which reduces the reliability and lifetime of hardware components. Energy efficiency is therefore crucial in high performance computing

IntechOpen