Search CORE

The University of Manchester - Institutional Repository

Machine learning-aided numerical linear Algebra: Convolutional neural networks for the efficient preconditioner generation

Author: Anzt Hartwig
Götz Markus
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 01/01/2019
Field of study

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

Author: Anzt Hartwig
Cojean Terry
Flegar Goran
Quintana-Ortí Enrique S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/04/2021
Field of study

© ACM, 2021. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, Volume 47, Issue , June 2021, http://doi.acm.org/10.1145/3441850[EN] The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator-like a preconditioner-in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.H. Anzt and T. Cojean were supported by the "Impuls und Vernetzungsfond of the Helmholtz Association" under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 "OPRECOMP". This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors want to acknowledge the access to the Piz Daint supercomputer at the Swiss National Supercomputing Centre (CSCS) granted under the project #d100 and the Summit supercomputer at the Oak Ridge National Lab (ORNL).Flegar, G.; Anzt, H.; Cojean, T.; Quintana-Ortí, ES. (2021). Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software. ACM Transactions on Mathematical Software. 47(2):1-28. https://doi.org/10.1145/3441850S12847

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers

Author: Anzt
Arioli
Barrett
Carson
Davis
Du Croz
Golub
Golub
Higham
Knyazev
Notay
Saad
Publication venue: 'Wiley'
Publication date: 22/09/2017
Field of study

This is the peer reviewed version of the following article: Anzt, H, Dongarra, J, Flegar, G, Higham, NJ, Quintana-Ortí, ES. Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers. Concurrency Computat Pract Exper. 2019; 31:e4460, which has been published in final form at https://doi.org/10.1002/cpe.4460. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving.[EN] We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block-Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth-bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem's data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block-Jacobi preconditioning scheme.Impuls und Vernetzungsfond of the Helmholtz Association, Grant/Award Number: VH-NG-1241; MINECO and FEDER, Grant/Award Number: TIN2014-53495-R; H2020 EU FETHPC Project, Grant/Award Number: 732631; MathWorks; Engineering and Physical Sciences Research Council, Grant/Award Number: EP/P020720/1; Exascale Computing Project, Grant/Award Number: 17-SC-20-SCAnzt, H.; Dongarra, J.; Flegar, G.; Higham, NJ.; Quintana Ortí, ES. (2019). Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers. Concurrency and Computation Practice and Experience. 31(6):1-12. https://doi.org/10.1002/cpe.4460S112316Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. doi:10.1137/1.9780898718003Anzt H Dongarra J Flegar G Quintana-Ortí ES Batched Gauss-Jordan elimination for block-Jacobi preconditioner generation on GPUs 2017 Austin, TX http://doi.acm.org/10.1145/3026937.3026940Anzt H Dongarra J Flegar G Quintana-Ortí ES Variable-size batched LU for small matrices and its integration into block-Jacobi preconditioning 2017 Bristol, UK https://doi.org/10.1109/ICPP.2017.18Dongarra J Hittinger J Bell J Applied Mathematics Research for Exascale Computing [Technical Report] Washington, DC 2014 https://science.energy.gov/~/media/ascr/pdf/research/am/docs/EMWGreport.pdfDuranton M De Bosschere K Cohen A Maebe J Munk H HiPEAC Vision 2015 https://www.hipeac.org/publications/vision/ 2015Lucas R Top Ten Exascale Research Challenges http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf 2014Lavignon JF ETP4HPC Strategic Research Agenda Achieving HPC Leadership in Europe 2013 http://www.etp4hpc.eu/Carson, E., & Higham, N. J. (2017). A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, 39(6), A2834-A2856. doi:10.1137/17m1122918Carson E Higham NJ Accelerating the solution of linear systems by iterative refinement in three precisions July 2017 http://eprints.ma.man.ac.uk/2562 SIAM Journal on Scientific ComputingShalf J The evolution of programming models in response to energy efficiency constraints October 2013 Norman, OK http://www.oscer.ou.edu/Symposium2013/oksupercompsymp2013_talk_shalf_20131002.pdfGolub, G. H., & Ye, Q. (1999). Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration. SIAM Journal on Scientific Computing, 21(4), 1305-1320. doi:10.1137/s1064827597323415Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., … van der Vorst, H. (1994). Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. doi:10.1137/1.9781611971538Notay, Y. (2000). Flexible Conjugate Gradients. SIAM Journal on Scientific Computing, 22(4), 1444-1460. doi:10.1137/s1064827599362314Knyazev, A. V., & Lashuk, I. (2008). Steepest Descent and Conjugate Gradient Methods with Variable Preconditioning. SIAM Journal on Matrix Analysis and Applications, 29(4), 1267-1280. doi:10.1137/060675290CROZ, J. J. D., & HIGHAM, N. J. (1992). Stability of Methods for Matrix Inversion. IMA Journal of Numerical Analysis, 12(1), 1-19. doi:10.1093/imanum/12.1.1Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms. doi:10.1137/1.9780898718027Chow E Scott J On the use of iterative methods and blocking for solving sparse triangular systems in incomplete factorization preconditioning Swindon, UK Rutherford Appleton Laboratory 201

Crossref

The University of Manchester - Institutional Repository

MIMS EPrints

Recommended from our members

Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning

Author: Chow Edmond
Dongarra Jack
Hartwig Antz
Scott Jennifer
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

When using incomplete factorization preconditioners with an iterative method to solve large sparse linear systems, each application of the preconditioner involves solving two sparse triangular systems. These triangular systems are challenging to solve efficiently on computers with high levels of concurrency. On such computers, it has recently been proposed to use Jacobi iterations, which are highly parallel, to approximately solve the triangular systems from incomplete factorizations. The effectiveness of this approach, however, is problem-dependent: the Jacobi iterations may not always converge quickly enough for all problems. Thus, as a necessary and important step to evaluate this approach, we experimentally test the approach on a large number of realistic symmetric positive definite problems. We also show that by using block Jacobi iterations, we can extend the range of problems for which such an approach can be effective. For block Jacobi iterations, it is essential for the blocking to be cognizant of the matrix structure

Central Archive at the University of Reading

Crossref

The University of Manchester - Institutional Repository

ePubs: the open archive for STFC research publications

Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing

Author: Anzt H.
Cojean T.
Flegar G.
Grützmacher T.
Göbel F.
Nayak P.
Quintana-Ortí E. S.
Ribizel T.
Tsai Y. M.
Publication venue: Association for Computing Machinery
Publication date: 01/03/2022
Field of study

© ACM, YYYY. This is the author's version of the work "Anzt, H., Cojean, T., Flegar, G., Göbel, F., Grützmacher, T., Nayak, P., ... & Quintana-Ortí, E. S. (2022). Ginkgo: A modern linear operator algebra framework for high performance computing. ACM Transactions on Mathematical Software (TOMS), 48(1), 1-33". It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, {VOL48, ISS 1, (MAR 2022)} http://doi.acm.org/10.1145/3480935"[EN] In this article, we present GINKGO, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Gnswo's design principle abstracts all functionality as linear operators," motivating the notation of a "linear operator algebra library" GINKGO'S current focus is oriented toward providing sparse linear algebra functionality for high performance graphics processing unit (GPU) architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific backends and provide details on extensibility and sustainability measures. We also demonstrate GINKGO'S usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of GINKGO'S high performance on state-of-the-art GPU architectures.This work was supported by the "Impuls und Vernetzungsfond of the Helmholtz Association" under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 "OPRECOMP". This researchwas also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The experiments on the NVIDIA A100 GPU were performed on the HAICORE@KIT partition, funded by the "Impuls und Vernetzungsfond" of the Helmholtz Association. The experiments on the AMD MI100 GPU were performed on Tulip, an early-access platform hosted by HPE.Anzt, H.; Cojean, T.; Flegar, G.; Göbel, F.; Grützmacher, T.; Nayak, P.; Ribizel, T.... (2022). Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing. ACM Transactions on Mathematical Software. 48(1):1-33. https://doi.org/10.1145/348093513348

Using Jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning

Author: Anzt Hartwig
Chow Edmond
Dongarra Jack
Scott Jennifer
Publication venue: Elsevier
Publication date: 14/05/2018
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Toward a modular precision ecosystem for high performance computing

Author: Anzt Hartwig
Flegar Goran
Gruetzmacher Thomas
Quintana-Orti Enrique S.
Publication venue: 'SAGE Publications'
Publication date: 01/01/2019
Field of study

[EN] With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today's high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a "modular precision ecosystem" that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm-two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Impuls und Vernetzungsfond of the Helmholtz Association under grant VH-NG-1241. G Flegar and ES Quintana-Ortí were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 OPRECOMP .Anzt, H.; Flegar, G.; Gruetzmacher, T.; Quintana-Orti, ES. (2019). Toward a modular precision ecosystem for high performance computing. International Journal of High Performance Computing Applications. 33(6):1069-1078. https://doi.org/10.1177/109434201984654710691078336Anzt, H., Dongarra, J., & Quintana-Ortí, E. S. (2015). Adaptive precision solvers for sparse linear systems. Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing - E2SC ’15. doi:10.1145/2834800.2834802Baboulin, M., Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, J., … Tomov, S. (2009). Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications, 180(12), 2526-2533. doi:10.1016/j.cpc.2008.11.005Buttari, A., Dongarra, J., Langou, J., Langou, J., Luszczek, P., & Kurzak, J. (2007). Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems. The International Journal of High Performance Computing Applications, 21(4), 457-466. doi:10.1177/1094342007084026Carson, E., & Higham, N. J. (2017). A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, 39(6), A2834-A2856. doi:10.1137/17m1122918Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m1140819Göddeke, D., Strzodka, R., & Turek, S. (2007). Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations. International Journal of Parallel, Emergent and Distributed Systems, 22(4), 221-256. doi:10.1080/17445760601122076Grützmacher, T., & Anzt, H. (2018). A Modular Precision Format for Decoupling Arithmetic Format and Storage Format. Euro-Par 2018: Parallel Processing Workshops, 434-443. doi:10.1007/978-3-030-10549-5_34Grutzmacher, T., Anzt, H., Scheidegger, F., & Quintana-Orti, E. S. (2018). High-Performance GPU Implementation of PageRank with Reduced Precision Based on Mantissa Segmentation. 2018 IEEE/ACM 8th Workshop on Irregular Applications: Architectures and Algorithms (IA3). doi:10.1109/ia3.2018.00015Hegland, M., & Saylor, P. E. (1992). Block jacobi preconditioning of the conjugate gradient method on a vector processor. International Journal of Computer Mathematics, 44(1-4), 71-89. doi:10.1080/00207169208804096Horowitz, M. (2014). 1.1 Computing’s energy problem (and what we can do about it). 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). doi:10.1109/isscc.2014.6757323Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. doi:10.1137/1.9780898718003Strzodka, R., & Goddeke, D. (2006). Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components. 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. doi:10.1109/fccm.2006.57Tadano, H., & Sakurai, T. (2008). On Single Precision Preconditioners for Krylov Subspace Iterative Methods. Lecture Notes in Computer Science, 721-728. doi:10.1007/978-3-540-78827-0_83Wulf, W. A., & McKee, S. A. (1995). Hitting the memory wall. ACM SIGARCH Computer Architecture News, 23(1), 20-24. doi:10.1145/216585.21658