Search CORE

20 research outputs found

A parallel butterfly algorithm

Author: Demanet Laurent
Maxwell Nicholas
Poulson Jack
Ying Lexing
Publication venue
Publication date: 25/11/2013
Field of study

The butterfly algorithm is a fast algorithm which approximately evaluates a discrete analogue of the integral transform \int K(x,y) g(y) dy at large numbers of target points when the kernel, K(x,y), is approximately low-rank when restricted to subdomains satisfying a certain simple geometric condition. In d dimensions with O(N^d) quasi-uniformly distributed source and target points, when each appropriate submatrix of K is approximately rank-r, the running time of the algorithm is at most O(r^2 N^d log N). A parallelization of the butterfly algorithm is introduced which, assuming a message latency of \alpha and per-process inverse bandwidth of \beta, executes in at most O(r^2 N^d/p log N + \beta r N^d/p + \alpha)log p) time using p processes. This parallel algorithm was then instantiated in the form of the open-source DistButterfly library for the special case where K(x,y)=exp(i \Phi(x,y)), where \Phi(x,y) is a black-box, sufficiently smooth, real-valued phase function. Experiments on Blue Gene/Q demonstrate impressive strong-scaling results for important classes of phase functions. Using quasi-uniform sources, hyperbolic Radon transforms and an analogue of a 3D generalized Radon transform were respectively observed to strong-scale from 1-node/16-cores up to 1024-nodes/16,384-cores with greater than 90% and 82% efficiency, respectively.Comment: To appear in SIAM Journal on Scientific Computin

arXiv.org e-Print Archive

DSpace@MIT

Elemental: A new framework for distributed memory dense matrix computations

Author: Bryan Marker
Jack Poulson
Robert Van De Geijn
Publication venue
Publication date: 05/03/2020
Field of study

Abstract Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since it will likely not be practical to use MPI-based implementations. Thus, this is a good time to review what lessons we have learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves considerably better performance than the previously developed libraries

CiteSeerX

Elemental: A new framework for distributed memory dense matrix computations

Author: Bryan Marker
Jack Poulson
Jeff R Hammond
Nichols A Romero
Robert A Van De Geijn
Publication venue
Publication date: 05/03/2020
Field of study

Abstract Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since it will likely not be practical to use MPI-based implementations. Thus, this is a good time to review lessons learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters (i.e., on two racks o

CiteSeerX

Fast parallel solution of heterogeneous 3D time-harmonic wave equations

Author: Poulson Jack Lesly
Publication venue
Publication date: 01/01/2012
Field of study

textSeveral advancements related to the solution of 3D time-harmonic wave equations are presented, especially in the context of a parallel moving-PML sweeping preconditioner for problems without large-scale resonances. The main contribution of this dissertation is the introduction of an efficient parallel sweeping preconditioner and its subsequent application to several challenging velocity models. For instance, 3D seismic problems approaching a billion degrees of freedom have been solved in just a few minutes using several thousand processors. The setup and application costs of the sequential algorithm were also respectively refined to O(γ^2 N^(4/3)) and O(γ N log N), where N denotes the total number of degrees of freedom in the 3D volume and γ(ω) denotes the modestly frequency-dependent number of grid points per Perfectly Matched Layer discretization. Furthermore, high-performance parallel algorithms are proposed for performing multifrontal triangular solves with many right-hand sides, and a custom compression scheme is introduced which builds upon the translation invariance of free-space Green’s functions in order to justify the replacement of each dense matrix within a certain modified multifrontal method with the sum of a small number of Kronecker products. For the sake of reproducibility, every algorithm exercised within this dissertation is made available as part of the open source packages Clique and Parallel Sweeping Preconditioner (PSP).Computational Science, Engineering, and Mathematic

CiteSeerX

Texas ScholarWorks

Formalized parallel dense linear algebra and its application to the generalized eigenvalue problem

Author: Poulson Jack Lesly
Publication venue
Publication date: 01/01/2009
Field of study

textThis thesis demonstrates an efficient parallel method of solving the generalized eigenvalue problem, KΦ = M ΦΛ, where K is symmetric and M is symmetric positive-definite, by first converting it to a standard eigenvalue problem, solving the standard eigenvalue problem, and back-transforming the results. An abstraction for parallel dense linear algebra is introduced along with a new algorithm for forming A := U⁻ᵀ K U⁻¹ , where U is the Cholesky factor of M , that is up to twice as fast as the ScaLAPACK implementation. Additionally, large improvements over the PBLAS implementations of general matrix-matrix multiplication and triangular solves with many right-hand sides are shown. Significant performance gains are also demonstrated for Cholesky factorizations, and a case is made for using 2D-cyclic distributions with a distribution blocksize of one.Aerospace Engineering and Engineering Mechanic

CiteSeerX

Texas ScholarWorks

Scientific computing

Author: Jack Poulson
Nick Knight
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref