Search CORE

85 research outputs found

Recommended from our members

Preparing sparse solvers for exascale computing.

Author: Anzt Hartwig
Boman Erik
Curfman McInnes Lois
Falgout Rob
Ghysels Pieter
Heroux Michael
Li Xiaoye
Meier Yang Ulrike
Rajamanickam Sivasankaran
Rupp Karl
Smith Barry
Tran Mills Richard
Yamazaki Ichitaro
Publication venue: eScholarship, University of California
Publication date: 01/03/2020
Field of study

Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'

eScholarship - University of California

Tools and Selected Applications

Author: Kramer Stephan Christoph
Publication venue: Niedersächsische Staats- und Universitätsbibliothek Göttingen
Publication date: 22/11/2012
Field of study

Georg-August-University Göttingen

Multilevel techniques for Reservoir Simulation

Author: Christensen Max la Cour
Publication venue: Technical University of Denmark
Publication date: 01/01/2017
Field of study

Online Research Database In Technology

Numerical Platon: a unified linear equation solver interface for industrial softwares

Author: Belliard Michel
Calvin Christophe
Sécher Bernard
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

This paper describes a tool called Numerical Platon developed by the French Atomic Energy Commission (CEA). It provides an interface to a set of parallel linear equation solvers for high-performance computers that may be used in industrial software written in various programming languages. This tool was developed as part of considerable efforts by the CEA Nuclear Energy Division in the past years to promote massively parallel software and on-shelf parallel tools to help develop new generation simulation codes. After the presentation of the package architecture and the available algorithms, we show examples of how Numerical Platon is used in CEA codes

HAL Descartes

HAL-CEA

Performance Modeling and Prediction for the Scalable Solution of Partial Differential Equations on Unstructured Grids

Author: Kaushik Dinesh Kumar
Publication venue: ODU Digital Commons
Publication date: 01/01/2002
Field of study

This dissertation studies the sources of poor performance in scientific computing codes based on partial differential equations (PDEs), which typically perform at a computational rate well below other scientific simulations (e.g., those with dense linear algebra or N-body kernels) on modern architectures with deep memory hierarchies. We identify that the primary factors responsible for this relatively poor performance are: insufficient available memory bandwidth, low ratio of work to data size (good algorithmic efficiency), and nonscaling cost of synchronization and gather/scatter operations (for a fixed problem size scaling). This dissertation also illustrates how to reuse the legacy scientific and engineering software within a library framework. Specifically, a three-dimensional unstructured grid incompressible Euler code from NASA has been parallelized with the Portable Extensible Toolkit for Scientific Computing (PETSc) library for distributed memory architectures. Using this newly instrumented code (called PETSc-FUN3D) as an example of a typical PDE solver, we demonstrate some strategies that are effective in tolerating the latencies arising from the hierarchical memory system and the network. Even on a single processor from each of the major contemporary architectural families, the PETSc-FUN3D code runs from 2.5 to 7.5 times faster than the legacy code on a medium-sized data set (with approximately 105 degrees of freedom). The major source of performance improvement is the increased locality in data reference patterns achieved through blocking, interlacing, and edge reordering. To explain these performance gains, we provide simple performance models based on memory bandwidth and instruction issue rates. Experimental evidence, in terms of translation lookaside buffer (TLB) and data cache miss rates, achieved memory bandwidth, and graduated floating point instructions per memory reference, is provided through accurate measurements with hardware counters. The performance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per-node performance. We identify the bottlenecks to scalability (algorithmic as well as implementation) for a fixed-size problem when the number of processors grows to several thousands (the expected level of concurrency on terascale architectures). We also evaluate the hybrid programming model (mixed distributed/shared) from a performance standpoint

Old Dominion University

VBARMS: A variable block algebraic recursive multilevel solver for sparse linear systems

Author: Liao Jia
Publication venue: University of Groningen
Publication date: 01/01/2015
Field of study

Sparse matrices arising from the solution of systems of partial differential equations often exhibit a perfect block structure. It means that the nonzero blocks in the sparsity pattern are fully dense (and typically small), e.g., when several unknown quantities are associated with the same grid point. However, similar block orderings can be sometimes found also on general unstructured matrices by ordering consecutively rows and columns with a similar sparsity pattern. We also can treat some zero entries of the reordered matrix as nonzero elements to enlarge the blocks to improve the performance. The reordering results in linear systems with blocks of variable size in general. Our recently developed parallel package pVBARMS (parallel variable block algebraic recursive multilevel solver) for distributed memory computers takes advantage of these frequently occurring structures in the design of the multilevel incomplete LU factorization preconditioner. It maximizes computational efficiency and achieves increased throughput during the computation and improved reliability on realistic applications. The method detects automatically any existing block structure in the matrix without any users prior knowledge of the underlying problem, and exploits it to maximize computational efficiency. We proposed a study of performance comparison of pVBAMRS and other popular solvers on a set of general linear systems arising from different application field. We also report on the numerical and parallel scalability of the pVBARMS package for solving the turbulent, Reynolds-averaged, Navier-Stokes (RANS) equations

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

A Survey on Hardware-aware and Heterogeneous Computing on Multicore Processors and Accelerators

Author: Buchty Rainer
Heuveline Vincent
Karl Wolfgang
Weiß Jan-Philipp
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2009
Field of study

KITopen

Providing performance portable numerics for Intel GPUs

Author: Anzt Hartwig
Cojean Terry
Tsai Yu-Hsiang M.
Publication venue: John Wiley and Sons
Publication date: 08/11/2022
Field of study

With discrete Intel GPUs entering the high-performance computing landscape, there is an urgent need for production-ready software stacks for these platforms. In this article, we report how we enable the Ginkgo math library to execute on Intel GPUs by developing a kernel backed based on the DPC++ programming environment. We discuss conceptual differences between the CUDA and DPC++ programming models and describe workflows for simplified code conversion. We evaluate the performance of basic and advanced sparse linear algebra routines available in Ginkgo\u27s DPC++ backend in the hardware-specific performance bounds and compare against routines providing the same functionality that ship with Intel\u27s oneMKL vendor library

KITopen

Algorithms for Large Scale Problems in Eigenvalue and Svd Computations and in Big Data Applications

Author: Wu Lingfei
Publication venue: W&M ScholarWorks
Publication date: 01/10/2016
Field of study

As ”big data” has increasing influence on our daily life and research activities, it poses significant challenges on various research areas. Some applications often demand a fast solution of large, sparse eigenvalue and singular value problems; In other applications, extracting knowledge from large-scale data requires many techniques such as statistical calculations, data mining, and high performance computing. In this dissertation, we develop efficient and robust iterative methods and software for the computation of eigenvalue and singular values. We also develop practical numerical and data mining techniques to estimate the trace of a function of a large, sparse matrix and to detect in real-time blob-filaments in fusion plasma on extremely large parallel computers. In the first work, we propose a hybrid two stage SVD method for efficiently and accurately computing a few extreme singular triplets, especially the ones corresponding to the smallest singular values. The first stage achieves fast convergence while the second achieves the final accuracy. Furthermore, we develop a high-performance preconditioned SVD software based on the proposed method on top of the state-of-the-art eigensolver PRIMME. The method can be used with or without preconditioning, on parallel computers, and is superior to other state-of-the-art SVD methods in both efficiency and robustness. In the second study, we provide insights and develop practical algorithms to accomplish efficient and accurate computation of interior eigenpairs using refined projection techniques in non-Krylov iterative methods. By analyzing different implementations of the refined projection, we propose a new hybrid method to efficiently find interior eigenpairs without compromising accuracy. Our numerical experiments illustrate the efficiency and robustness of the proposed method. In the third work, we present a novel method to estimate the trace of matrix inverse that exploits the pattern correlation between the diagonal of the inverse of the matrix and that of some approximate inverse. We leverage various sampling and fitting techniques to fit the diagonal of the approximation to that of the inverse. Our method may serve as a standalone kernel for providing a fast trace estimate or as a variance reduction method for Monte Carlo in some cases. An extensive set of experiments demonstrate the potential of our method. In the fourth study, we provide first results on applying outlier detection techniques to effectively tackle the fusion blob detection problem on extremely large parallel machines. We present a real-time region outlier detection algorithm to efficiently find and track blobs in fusion experiments and simulations. Our experiments demonstrated we can achieve linear time speedup up to 1024 MPI processes and complete blob detection in two or three milliseconds

College of William & Mary: W&M Publish