344 research outputs found

    An overview of block Gram-Schmidt methods and their stability properties

    Full text link
    Block Gram-Schmidt algorithms serve as essential kernels in many scientific computing applications, but for many commonly used variants, a rigorous treatment of their stability properties remains open. This survey provides a comprehensive categorization of block Gram-Schmidt algorithms, particularly those used in Krylov subspace methods to build orthonormal bases one block vector at a time. All known stability results are assembled, and new results are summarized or conjectured for important communication-reducing variants. Additionally, new block versions of low-synchronization variants are derived, and their efficacy and stability are demonstrated for a wide range of challenging examples. Low-synchronization variants appear remarkably stable for s-step-like matrices built with Newton polynomials, pointing towards a new stable and efficient backbone for Krylov subspace methods. Numerical examples are computed with a versatile MATLAB package hosted at https://github.com/katlund/BlockStab, and scripts for reproducing all results in the paper are provided. Block Gram-Schmidt implementations in popular software packages are discussed, along with a number of open problems. An appendix containing all algorithms type-set in a uniform fashion is provided.Comment: 42 pages, 5 tables, 17 figures, 20 algorithm

    Solving large sparse eigenvalue problems on supercomputers

    Get PDF
    An important problem in scientific computing consists in finding a few eigenvalues and corresponding eigenvectors of a very large and sparse matrix. The most popular methods to solve these problems are based on projection techniques on appropriate subspaces. The main attraction of these methods is that they only require the use of the matrix in the form of matrix by vector multiplications. The implementations on supercomputers of two such methods for symmetric matrices, namely Lanczos' method and Davidson's method are compared. Since one of the most important operations in these two methods is the multiplication of vectors by the sparse matrix, methods of performing this operation efficiently are discussed. The advantages and the disadvantages of each method are compared and implementation aspects are discussed. Numerical experiments on a one processor CRAY 2 and CRAY X-MP are reported. Possible parallel implementations are also discussed

    An efficient implementation of the block Gram--Schmidt method

    Get PDF
    The block Gram--Schmidt method computes the QR factorisation rapidly, but this is dependent on block size mm. We endeavor to determine the optimal mm automatically during one execution. Our algorithm determines mm through observing the relationship between computation time and complexity. Numerical experiments show that our proposed algorithms compute approximately twice as fast as the block Gram--Schmidt method for some block sizes, and is a viable option for computing the QR factorisation in a more stable and rapid manner. References Bjorck, A., Numerical Methods for Least Squares Problems, SIAM, (1996). Elden, L., and Park, H., Block Downdating of Least Squares Solutions, SIAM J. Matrix Anal. Appl., 15:1018--1034 (1994). doi:10.1137/S089547989223691X Runger, G., and Schwind, M., Comparison of Different Parallel Modified Gram--Schmidt Algorithms, Euro-Par 2005, LNCS 3648:826--836 (2005). doi:10.1007/11549468_90 Katagiri, T., Performance Evaluation of Parallel Gram--Schmidt Re-orthogonalization Methods, VECPAR 2002, LNCS 2565:302--314 (2003). doi:10.1007/3-540-36569-9_19 Matrix Market, Mathematical and Computational Sciences Division, Information Technology Laboratory of the National Institute of Standards and Technology, USA. http://math.nist.gov/MatrixMarket/ Matsuo, Y. and Nodera, T., The Optimal Block-Size for the Block Gram--Schmidt Orthogonalization, J. Sci. Tech, 49:348--354 (2011). Moriya, K. and Nodera, T., The DEFLATED-GMRES(m, k) Method with Switching the Restart Frequency Dynamically, Numer. Linear Alg. Appl., 7:569--584 (2000). doi:10.1002/1099-1506(200010/12)7:7/8<569::AID-NLA213>3.0.CO;2-8 Moriya, K. and Nodera, T., Usage of the convergence test of the residual norm in the Tsuno--Nodera version of the GMRES algorithm, ANZIAM J., 49:293--308 (2007). doi:10.1017/S1446181100012852 Liu, Q., Modified Gram--Schmidt-based Methods for Block Downdating the Cholesky Factorization, J. Comput. Appl. Math., 235:1897--1905 (2011). doi:10.1016/j.cam.2010.09.003 Saad, Y. and Schultz, M. H., GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems, SIAM J. Sci. Stat. Comput., 7:856--869 (1986). doi:10.1137/0907058 Shiroishi, J. and Nodera, T., A GMRES(mm) Method with Two Stage Deflated Preconditioners, ANZIAM J., 52:C222--C236 (2011). http://journal.austms.org.au/ojs/index.php/ANZIAMJ/article/view/3984 Leon, S. J., Bjorck, A., and Gander, W., Gram--Schmidt Orthogonalization: 100 years and more, Numer. Linear Algebra Appl., 20:492--532 (2013). doi:10.1002/nla.1839 Stewart, G. W., Block Gram--Schmidt Orthogonalization, SIAM J. Sci. Comput., 31:761--775 (2008). doi:10.1137/070682563 Vanderstraeten, D., An Accurate Parallel Block Gram-Schmidt Algorithm without Reorthogonalization, Numer. Lin. Alg. Appl., 7:219--236 (2000). doi:10.1002/1099-1506(200005)7:4<219::AID-NLA196>3.0.CO;2-L Yokozawa, T., Takahashi, T., Boku, T. and Sato, M., Efficient Parallel Implementation of Classical Gram-Schmidt Orthogonalization Using Matrix Multiplication, (in Japanese) Information Processing Society of Japan (IPSJ), Computing System, 1:61--72 (2008)

    Accelerating data-intensive scientific visualization and computing through parallelization

    Get PDF
    Many extreme-scale scientific applications generate colossal amounts of data that require an increasing number of processors for parallel processing. The research in this dissertation is focused on optimizing the performance of data-intensive parallel scientific visualization and computing. In parallel scientific visualization, there exist three well-known parallel architectures, i.e., sort-first/middle/last. The research in this dissertation studies the composition stage of the sort-last architecture for scientific visualization and proposes a generalized method, namely, Grouping More and Pairing Less (GMPL), for order-independent image composition workflow scheduling in sort-last parallel rendering. The technical merits of GMPL are two-fold: i) it takes a prime factorization-based approach for processor grouping, which not only obviates the common restriction in existing methods on the total number of processors to fully utilize computing resources, but also breaks down processors to the lowest level with a minimum number of peers in each group to achieve high concurrency and save communication cost; ii) within each group, it employs an improved direct send method to narrow down each processor’s pairing scope to further reduce communication overhead and increase composition efficiency. The performance superiority of GMPL over existing methods is evaluated through rigorous theoretical analysis and further verified by extensive experimental results on a high-performance visualization cluster. The research in this dissertation also parallelizes the over operator, which is commonly used for α-blending in various visualization techniques. Compared with its predecessor, the fully generalized over operator is n-operator compatible. To demonstrate the advantages of the proposed operator, the proposed operator is applied to the asynchronous and order-dependent image composition problem in parallel visualization. In addition, the dissertation research also proposes a very-high-speed pipeline-based architecture for parallel sort-last visualization of big data by developing and integrating three component techniques: i) a fully parallelized per-ray integration method that significantly reduces the number of iterations required for image rendering; ii) a real-time over operator that not only eliminates the restriction of pre-sorting and order-dependency, but also facilitates a high degree of parallelization for image composition. In parallel scientific computing, the research goal is to optimize QR decomposition, which is one primary algebraic decomposition procedure and plays an important role in scientific computing. QR decomposition produces orthogonal bases, i.e.,“core” bases for a given matrix, and oftentimes can be leveraged to build a complete solution to many fundamental scientific computing problems including Least Squares Problem, Linear Equations Problem, Eigenvalue Problem. A new matrix decomposition method is proposed to improve time efficiency of parallel computing and provide a rigorous proof of its numerical stability. The proposed solutions demonstrate significant performance improvement over existing methods for data-intensive parallel scientific visualization and computing. Considering the ever-increasing data volume in various science domains, the research in this dissertation have a great impact on the success of next-generation large-scale scientific applications

    Enlarged Krylov Subspace Conjugate Gradient Methods for Reducing Communication

    Get PDF
    In this paper we introduce a new approach for reducing communication in Krylov subspace methods that consists of enlarging the Krylov subspace by a maximum of t vectors per iteration, based on the domain decomposition of the graph of A. The obtained enlarged Krylov subspace is a superset of the Krylov subspace. Thus it is possible to search for the solution of the system Ax=b in the enlarged Krylov subspace instead of the Krylov subspace. Moreover, we show in this paper that the enlarged Krylov projection subspace methods lead to faster convergence in terms of iterations and parallelizable algorithms with less communication, with respect to Krylov methods. In this paper we focus on Conjugate Gradient (CG), a Krylov projection method for symmetric (Hermitian) positive definite matrices. We discuss two new versions of Conjugate Gradient. The first method, multiple search direction with orthogonalization CG (MSDO-CG), is an adapted version of MSD-CG with the A-orthonormalization of the search directions to obtain a projection method that guarentees convergence at least as fast as CG. The second projection method that we propose here, long recurrence enlarged CG (LRE-CG), is similar to GMRES in that we build an orthonormal basis for the enlarged Krylov subspace rather than finding search directions. Then, we use the whole basis to update the solution and the residual. Both methods converge faster than CG in terms of iterations, but LRE-CG converges faster than MSDO-CG since it uses the whole basis to update the solution rather than only t search directions. And the more subdomains are introduced or the larger t is, the faster is the convergence of both methods with respect to CG in terms of iterations. For example, for t = 64 the MSDO-CG and LRE-CG methods converge in 75% up to 98% less iteration with respect to CG for the different test matrices. But increasing t also means increasing the memory requirements. Thus, in practice, t should be relatively small, depending on the available memory, on the size of the matrix, and on the number of iterations needed for convergence. We also present the parallel algorithms along with their expected performance based on the estimated run times, and the preconditioned versions with their convergence behavior

    Enlarged Krylov Subspace Conjugate Gradient Methods for Reducing Communication

    Get PDF
    International audienceIn this paper we introduce a new approach for reducing communication in Krylov subspace methods that consists of enlarging the Krylov subspace by a maximum of tt vectors per iteration, based on a domain decomposition of the graph of AA. The obtained enlarged Krylov subspace Kk,t(A,r0)\mathscr{K}_{k,t}(A,r_0) is a superset of the Krylov subspace Kk(A,r0)\mathcal{K}_k(A,r_0), Kk(A,r0)⊂Kk,t(A,r0)\mathcal{K}_k(A,r_0) \subset \mathscr{K}_{k,t}(A,r_0). Thus, we search for the solution of the system Ax=bAx=b in Kk,t(A,r0)\mathscr{K}_{k,t}(A,r_0) instead of Kk(A,r0)\mathcal{K}_k(A,r_0). Moreover, we show in this paper that the enlarged Krylov projection subspace methods lead to faster convergence in terms of iterations and parallelizable algorithms with less communication, with respect to Krylov methods
    • …
    corecore