10 research outputs found

    Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors

    Full text link
    [EN] In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variablesize batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix-vector multiplication kernel that transforms the linear systems' right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVlDlA's K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver. (C) 2018 Elsevier B.V. All rights reserved.This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Number DE-SC-0010042. H. Anzt was supported by the "Impuls and Vernetzungsfond of the Helmholtz Association" under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2014-53495-R of the MINECO-FEDER; and project OPRECOMP (http://oprecomp.eu) with the financial support of the Future and Emerging Technologies (FET) programme within the European Union's Horizon 2020 research and innovation programme, under grant agreement No 732631. The authors would also like to acknowledge the Swiss National Computing Centre (CSCS) for granting computing resources in the Small Development Project entitled "Energy-Efficient preconditioning for iterative linear solvers" (#d65).Anzt, H.; Dongarra, J.; Flegar, G.; Quintana OrtĂ­, ES. (2019). Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors. Parallel Computing. 81:131-146. https://doi.org/10.1016/j.parco.2017.12.006S1311468

    Variable-Size Batched Condition Number Calculation on GPUs

    Get PDF

    Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

    Full text link
    © ACM, 2021. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, Volume 47, Issue , June 2021, http://doi.acm.org/10.1145/3441850[EN] The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator-like a preconditioner-in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.H. Anzt and T. Cojean were supported by the "Impuls und Vernetzungsfond of the Helmholtz Association" under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 "OPRECOMP". This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors want to acknowledge the access to the Piz Daint supercomputer at the Swiss National Supercomputing Centre (CSCS) granted under the project #d100 and the Summit supercomputer at the Oak Ridge National Lab (ORNL).Flegar, G.; Anzt, H.; Cojean, T.; Quintana-Ortí, ES. (2021). Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software. ACM Transactions on Mathematical Software. 47(2):1-28. https://doi.org/10.1145/3441850S12847

    Machine learning-aided numerical linear Algebra: Convolutional neural networks for the efficient preconditioner generation

    Get PDF

    Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers

    Get PDF
    This is the peer reviewed version of the following article: Anzt, H, Dongarra, J, Flegar, G, Higham, NJ, Quintana-Ortí, ES. Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers. Concurrency Computat Pract Exper. 2019; 31:e4460, which has been published in final form at https://doi.org/10.1002/cpe.4460. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving.[EN] We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block-Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth-bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem's data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block-Jacobi preconditioning scheme.Impuls und Vernetzungsfond of the Helmholtz Association, Grant/Award Number: VH-NG-1241; MINECO and FEDER, Grant/Award Number: TIN2014-53495-R; H2020 EU FETHPC Project, Grant/Award Number: 732631; MathWorks; Engineering and Physical Sciences Research Council, Grant/Award Number: EP/P020720/1; Exascale Computing Project, Grant/Award Number: 17-SC-20-SCAnzt, H.; Dongarra, J.; Flegar, G.; Higham, NJ.; Quintana Ortí, ES. (2019). Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers. Concurrency and Computation Practice and Experience. 31(6):1-12. https://doi.org/10.1002/cpe.4460S112316Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. doi:10.1137/1.9780898718003Anzt H Dongarra J Flegar G Quintana-Ortí ES Batched Gauss-Jordan elimination for block-Jacobi preconditioner generation on GPUs 2017 Austin, TX http://doi.acm.org/10.1145/3026937.3026940Anzt H Dongarra J Flegar G Quintana-Ortí ES Variable-size batched LU for small matrices and its integration into block-Jacobi preconditioning 2017 Bristol, UK https://doi.org/10.1109/ICPP.2017.18Dongarra J Hittinger J Bell J Applied Mathematics Research for Exascale Computing [Technical Report] Washington, DC 2014 https://science.energy.gov/~/media/ascr/pdf/research/am/docs/EMWGreport.pdfDuranton M De Bosschere K Cohen A Maebe J Munk H HiPEAC Vision 2015 https://www.hipeac.org/publications/vision/ 2015Lucas R Top Ten Exascale Research Challenges http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf 2014Lavignon JF ETP4HPC Strategic Research Agenda Achieving HPC Leadership in Europe 2013 http://www.etp4hpc.eu/Carson, E., & Higham, N. J. (2017). A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, 39(6), A2834-A2856. doi:10.1137/17m1122918Carson E Higham NJ Accelerating the solution of linear systems by iterative refinement in three precisions July 2017 http://eprints.ma.man.ac.uk/2562 SIAM Journal on Scientific ComputingShalf J The evolution of programming models in response to energy efficiency constraints October 2013 Norman, OK http://www.oscer.ou.edu/Symposium2013/oksupercompsymp2013_talk_shalf_20131002.pdfGolub, G. H., & Ye, Q. (1999). Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration. SIAM Journal on Scientific Computing, 21(4), 1305-1320. doi:10.1137/s1064827597323415Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., … van der Vorst, H. (1994). Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. doi:10.1137/1.9781611971538Notay, Y. (2000). Flexible Conjugate Gradients. SIAM Journal on Scientific Computing, 22(4), 1444-1460. doi:10.1137/s1064827599362314Knyazev, A. V., & Lashuk, I. (2008). Steepest Descent and Conjugate Gradient Methods with Variable Preconditioning. SIAM Journal on Matrix Analysis and Applications, 29(4), 1267-1280. doi:10.1137/060675290CROZ, J. J. D., & HIGHAM, N. J. (1992). Stability of Methods for Matrix Inversion. IMA Journal of Numerical Analysis, 12(1), 1-19. doi:10.1093/imanum/12.1.1Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms. doi:10.1137/1.9780898718027Chow E Scott J On the use of iterative methods and blocking for solving sparse triangular systems in incomplete factorization preconditioning Swindon, UK Rutherford Appleton Laboratory 201

    Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing

    Get PDF
    © ACM, YYYY. This is the author's version of the work "Anzt, H., Cojean, T., Flegar, G., Göbel, F., Grützmacher, T., Nayak, P., ... & Quintana-Ortí, E. S. (2022). Ginkgo: A modern linear operator algebra framework for high performance computing. ACM Transactions on Mathematical Software (TOMS), 48(1), 1-33". It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, {VOL48, ISS 1, (MAR 2022)} http://doi.acm.org/10.1145/3480935"[EN] In this article, we present GINKGO, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Gnswo's design principle abstracts all functionality as linear operators," motivating the notation of a "linear operator algebra library" GINKGO'S current focus is oriented toward providing sparse linear algebra functionality for high performance graphics processing unit (GPU) architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific backends and provide details on extensibility and sustainability measures. We also demonstrate GINKGO'S usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of GINKGO'S high performance on state-of-the-art GPU architectures.This work was supported by the "Impuls und Vernetzungsfond of the Helmholtz Association" under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 "OPRECOMP". This researchwas also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The experiments on the NVIDIA A100 GPU were performed on the HAICORE@KIT partition, funded by the "Impuls und Vernetzungsfond" of the Helmholtz Association. The experiments on the AMD MI100 GPU were performed on Tulip, an early-access platform hosted by HPE.Anzt, H.; Cojean, T.; Flegar, G.; Göbel, F.; Grützmacher, T.; Nayak, P.; Ribizel, T.... (2022). Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing. ACM Transactions on Mathematical Software. 48(1):1-33. https://doi.org/10.1145/348093513348

    Using Gauss - Jordan elimination method with The Application of Android for Solving Linear Equations

    Get PDF
    Problems involving mathematical models appear in many scientific disciplines. Complex mathematical models sometimes cannot be solved by analytic methods using standard algebraic formulas. Computers play a major role in the development of the field of numerical methods because the calculation uses numerical methods in the form of arithmetic operations, the number of arithmetic operations is very large and repetitive, so manual calculations are often tedious and errors occur. This study aims to develop software solutions for linear equations by implementing the Gauss-Jordan elimination(GJ-elimination) method, building software for linear equations carried out through five stages, namely: (1) System Modeling (2) Simplification of Models, (3) Numerical Methods and algorithms, (4) programming languages using The Android Studio and (5) Simulation programs. Overall regarding content, proper software that can be used by students and lecturers in implementing numerical methods because there are ways to use the application and steps to solve linear equation problems using the GJ-elimination method

    Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs

    Get PDF
    In this paper, we target the parallel solution of sparse linear systems via iterative Krylov subspace-based method enhanced with a block-Jacobi preconditioner on a cluster of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the preconditioned conjugate gradient method that improve the interoperability between the message-passing interface and OmpSs programming models. Specifically, we progressively integrate several communication-reduction and iteration-fusing strategies into the initial code, obtaining more efficient versions of the method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 32 nodes with 24 cores each. The experimental analysis shows that the techniques described in the paper outperform the classical method by a margin that varies between 6 and 48%, depending on the evaluation.This research was partially supported by the H2020 EU FETHPC Project 671602 “INTERTWinE.” The researchers from Universidad Jaume I were sponsored by Project TIN2017-82972-R of the Spanish Ministerio de Economía y Competitividad. Maria Barreda was supported by the POSDOC-A/2017/11 project from the Universitat Jaume I.Peer ReviewedPostprint (author's final draft

    A Distributed Approach to solve Power Flow problems in new emerging scenarios

    Get PDF
    Distributed Computing is growing up in interest in many applied fields of scientific research. Power system operation is becoming increasingly complex due to the Distributed Energy Resources (DERs) integration at various voltage levels. In this context, the need to automate grid operation is ever fundamental in order to ensure adequate levels of reliability, flexibility and cost effectiveness of power systems. This report shall be intended as a support for the understanding of methodological aspects and principles for the solution of power flow equations through a distributed approach, in a context where multiple interacting entities share a portion of their grids and want to align their computation in an automated way. The aim is both to give the reader a comprehensive overview of the software used for the implementation, the Portable Scientific Extensible Toolkit for Scientific Computation - PETSc, and the principles followed to build the Distributed Power Flow Solver as well as the specific features that make it different from other distributed solvers available in the literature. Additionally, two frameworks are presented as potential applications for the model. The European transmission networks level, in the context of capacity calculation, and the transmission-distribution networks coupling. In the beginning a short literature review on both frameworks is presented.JRC.C.3-Energy Security, Distribution and Market

    Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field

    Get PDF
    Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture
    corecore