29 research outputs found

    Lecture 03: Hierarchically Low Rank Methods and Applications

    Get PDF
    As simulation and analytics enter the exascale era, numerical algorithms, particularly implicit solvers that couple vast numbers of degrees of freedom, must span a widening gap between ambitious applications and austere architectures to support them. We present fifteen universals for researchers in scalable solvers: imperatives from computer architecture that scalable solvers must respect, strategies towards achieving them that are currently well established, and additional strategies currently being developed for an effective and efficient exascale software ecosystem. We consider recent generalizations of what it means to “solve” a computational problem, which suggest that we have often been “oversolving” them at the smaller scales of the past because we could afford to do so. We present innovations that allow to approach lin-log complexity in storage and operation count in many important algorithmic kernels and thus create an opportunity for full applications with optimal scalability

    Lecture 03: Hierarchically Low Rank Methods and Applications

    Get PDF
    As simulation and analytics enter the exascale era, numerical algorithms, particularly implicit solvers that couple vast numbers of degrees of freedom, must span a widening gap between ambitious applications and austere architectures to support them. We present fifteen universals for researchers in scalable solvers: imperatives from computer architecture that scalable solvers must respect, strategies towards achieving them that are currently well established, and additional strategies currently being developed for an effective and efficient exascale software ecosystem. We consider recent generalizations of what it means to “solve” a computational problem, which suggest that we have often been “oversolving” them at the smaller scales of the past because we could afford to do so. We present innovations that allow to approach lin-log complexity in storage and operation count in many important algorithmic kernels and thus create an opportunity for full applications with optimal scalability

    A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters

    Get PDF
    In this work, we consider the solution of boundary integral equations by means of a scalable hierarchical matrix approach on clusters equipped with graphics hardware, i.e. graphics processing units (GPUs). To this end, we extend our existing single-GPU hierarchical matrix library hmglib such that it is able to scale on many GPUs and such that it can be coupled to arbitrary application codes. Using a model GPU implementation of a boundary element method (BEM) solver, we are able to achieve more than 67 percent relative parallel speed-up going from 128 to 1024 GPUs for a model geometry test case with 1.5 million unknowns and a real-world geometry test case with almost 1.2 million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6 minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the setup phase and 20 seconds for the iterative solver. To the best of the authors' knowledge, we here discuss the first fully GPU-based distributed-memory parallel hierarchical matrix Open Source library using the traditional H-matrix format and adaptive cross approximation with an application to BEM problems

    Guest editorial: Special issue on parallel matrix algorithms and applications (PMAA’16)

    Get PDF
    International audienceThis special issue of Parallel Computing contains nine articles, selected after peer reviewing, from invited and contributed presentations made at the 8th International Workshop on Parallel Matrix Algorithms and Applications (PMAA'16), that took place at the Université of Bordeaux, France, from July 6-8, 2016. The workshop attracted around 120 participants from all continents, 25% were PhD students and around 10% from industry. The workshop was co-chaired by Emmanuel Agullo, Peter Arbenz, Luc Gi-raud, and Olaf Schenk. The members of the program committee were : P. D'Am-bra, H A total of twelve high quality submissions were received. In this special issue nine eventually accepted papers appear. The nine papers address diverse aspects of linear algebra and high performance computing 1. Jack Dongarra, Mark Gates, Stanimire Tomov address accelerating the SVD two stage reduction and divide-and-conquer using GPUs. The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today's high performance computers. For dense matrices, the classic algorithm for the SVD uses a one-stage reduction to bidiagonal form, which is limited in performance by the memory bandwidth. To overcome this limitation, a two-stage reduction to bidiagonal has been gaining popularity. As accelerators , such as GPUs and co-processors, are becoming increasingly widespread in high-performance computing, the authors present an accelerated SVD employing a two-stage reduction to bidiagonal as well as a parallelized and accelerated divide-and-conquer algorithm to solve the subsequent bidiagonal SVD. The new implementation provides a significant speedup compared to existing multi-core and GPU-based SVD implementations

    Algorithmic patterns for H\mathcal{H}-matrices on many-core processors

    Get PDF
    In this work, we consider the reformulation of hierarchical (H\mathcal{H}) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs). H\mathcal{H} matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix-vector products. The parallelization of H\mathcal{H} matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on accelerating existing H\mathcal{H} matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full H\mathcal{H} matrix construction and the fast matrix-vector product to many-core hardware. Here, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source H\mathcal{H} matrix library of this kind. We conclude this work by an in-depth performance analysis and a comparative performance study against a standard H\mathcal{H} matrix library, highlighting profound speedups of our many-core parallel approach

    A Parallel Hierarchical Blocked Adaptive Cross Approximation Algorithm

    Get PDF
    This paper presents a hierarchical low-rank decomposition algorithm assuming any matrix element can be computed in O(1)O(1) time. The proposed algorithm computes rank-revealing decompositions of sub-matrices with a blocked adaptive cross approximation (BACA) algorithm, followed by a hierarchical merge operation via truncated singular value decompositions (H-BACA). The proposed algorithm significantly improves the convergence of the baseline ACA algorithm and achieves reduced computational complexity compared to the full decompositions such as rank-revealing QR decompositions. Numerical results demonstrate the efficiency, accuracy and parallel efficiency of the proposed algorithm

    Lecture 02: Tile Low-rank Methods and Applications (w/review)

    Get PDF
    As simulation and analytics enter the exascale era, numerical algorithms, particularly implicit solvers that couple vast numbers of degrees of freedom, must span a widening gap between ambitious applications and austere architectures to support them. We present fifteen universals for researchers in scalable solvers: imperatives from computer architecture that scalable solvers must respect, strategies towards achieving them that are currently well established, and additional strategies currently being developed for an effective and efficient exascale software ecosystem. We consider recent generalizations of what it means to “solve” a computational problem, which suggest that we have often been “oversolving” them at the smaller scales of the past because we could afford to do so. We present innovations that allow to approach lin-log complexity in storage and operation count in many important algorithmic kernels and thus create an opportunity for full applications with optimal scalability
    corecore