26 research outputs found

    Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors

    Full text link
    [EN] In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variablesize batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix-vector multiplication kernel that transforms the linear systems' right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVlDlA's K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver. (C) 2018 Elsevier B.V. All rights reserved.This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Number DE-SC-0010042. H. Anzt was supported by the "Impuls and Vernetzungsfond of the Helmholtz Association" under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2014-53495-R of the MINECO-FEDER; and project OPRECOMP (http://oprecomp.eu) with the financial support of the Future and Emerging Technologies (FET) programme within the European Union's Horizon 2020 research and innovation programme, under grant agreement No 732631. The authors would also like to acknowledge the Swiss National Computing Centre (CSCS) for granting computing resources in the Small Development Project entitled "Energy-Efficient preconditioning for iterative linear solvers" (#d65).Anzt, H.; Dongarra, J.; Flegar, G.; Quintana OrtĂ­, ES. (2019). Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors. Parallel Computing. 81:131-146. https://doi.org/10.1016/j.parco.2017.12.006S1311468

    Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

    Full text link
    © ACM, 2021. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, Volume 47, Issue , June 2021, http://doi.acm.org/10.1145/3441850[EN] The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator-like a preconditioner-in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.H. Anzt and T. Cojean were supported by the "Impuls und Vernetzungsfond of the Helmholtz Association" under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 "OPRECOMP". This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors want to acknowledge the access to the Piz Daint supercomputer at the Swiss National Supercomputing Centre (CSCS) granted under the project #d100 and the Summit supercomputer at the Oak Ridge National Lab (ORNL).Flegar, G.; Anzt, H.; Cojean, T.; Quintana-Ortí, ES. (2021). Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software. ACM Transactions on Mathematical Software. 47(2):1-28. https://doi.org/10.1145/3441850S12847

    Toward a modular precision ecosystem for high performance computing

    Get PDF
    [EN] With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today's high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a "modular precision ecosystem" that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm-two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Impuls und Vernetzungsfond of the Helmholtz Association under grant VH-NG-1241. G Flegar and ES Quintana-Ortí were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 OPRECOMP .Anzt, H.; Flegar, G.; Gruetzmacher, T.; Quintana-Orti, ES. (2019). Toward a modular precision ecosystem for high performance computing. International Journal of High Performance Computing Applications. 33(6):1069-1078. https://doi.org/10.1177/109434201984654710691078336Anzt, H., Dongarra, J., & Quintana-Ortí, E. S. (2015). Adaptive precision solvers for sparse linear systems. Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing - E2SC ’15. doi:10.1145/2834800.2834802Baboulin, M., Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, J., … Tomov, S. (2009). Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications, 180(12), 2526-2533. doi:10.1016/j.cpc.2008.11.005Buttari, A., Dongarra, J., Langou, J., Langou, J., Luszczek, P., & Kurzak, J. (2007). Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems. The International Journal of High Performance Computing Applications, 21(4), 457-466. doi:10.1177/1094342007084026Carson, E., & Higham, N. J. (2017). A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, 39(6), A2834-A2856. doi:10.1137/17m1122918Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m1140819Göddeke, D., Strzodka, R., & Turek, S. (2007). Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations. International Journal of Parallel, Emergent and Distributed Systems, 22(4), 221-256. doi:10.1080/17445760601122076Grützmacher, T., & Anzt, H. (2018). A Modular Precision Format for Decoupling Arithmetic Format and Storage Format. Euro-Par 2018: Parallel Processing Workshops, 434-443. doi:10.1007/978-3-030-10549-5_34Grutzmacher, T., Anzt, H., Scheidegger, F., & Quintana-Orti, E. S. (2018). High-Performance GPU Implementation of PageRank with Reduced Precision Based on Mantissa Segmentation. 2018 IEEE/ACM 8th Workshop on Irregular Applications: Architectures and Algorithms (IA3). doi:10.1109/ia3.2018.00015Hegland, M., & Saylor, P. E. (1992). Block jacobi preconditioning of the conjugate gradient method on a vector processor. International Journal of Computer Mathematics, 44(1-4), 71-89. doi:10.1080/00207169208804096Horowitz, M. (2014). 1.1 Computing’s energy problem (and what we can do about it). 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). doi:10.1109/isscc.2014.6757323Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. doi:10.1137/1.9780898718003Strzodka, R., & Goddeke, D. (2006). Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components. 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. doi:10.1109/fccm.2006.57Tadano, H., & Sakurai, T. (2008). On Single Precision Preconditioners for Krylov Subspace Iterative Methods. Lecture Notes in Computer Science, 721-728. doi:10.1007/978-3-540-78827-0_83Wulf, W. A., & McKee, S. A. (1995). Hitting the memory wall. ACM SIGARCH Computer Architecture News, 23(1), 20-24. doi:10.1145/216585.21658

    Acceleration of PageRank with customized precision based on mantissa segmentation

    Full text link
    [EN] We describe the application of a communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges. Our variable-precision strategy, using a customized precision format based on mantissa segmentation (CPMS), abandons the IEEE 754 single- and double-precision number representation formats employed in the standard implementation of PageRank, and instead handles the data in memory using a customized floating-point format. The customized format enables fast data access in different accuracy, prevents overflow/underflow by preserving the IEEE 754 double-precision exponent, and efficiently avoids data duplication, since all bits of the original IEEE 754 double-precision mantissa are preserved in memory, but re-organized for efficient reduced precision access. With this approach, the truncated values (omitting significand bits), as well as the original IEEE double-precision values, can be retrieved without duplicating the data in different formats. Our numerical experiments on an NVIDIA V100 GPU (Volta architecture) and a server equipped with two Intel Xeon Platinum 8168 CPUs (48 cores in total) expose that, compared with a standard ieee double-precision implementation, the CPMS-based PageRank completes about 10% faster if high-accuracy output is needed, and about 30% faster if reduced output accuracy is acceptable.H. Anzt was supported by the "Impuls und Vernetzungsfond" of the Helmholtz Association under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER. This work was also supported by the EU H2020 project 732631 "OPRECOMP. Open Transprecision Computing,' and the US Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Numbers DE-SC0016513 and DE-SC-0010042Gruetzmacher, T.; Cojean, T.; Flegar, G.; Anzt, H.; Quintana-Orti, ES. (2020). Acceleration of PageRank with customized precision based on mantissa segmentation. ACM Transactions on Parallel Computing. 7(1):1-19. https://doi.org/10.1145/3380934S1197

    The neural correlates of intimate partner violence in women

    Get PDF
    Objective: To examine hippocampal volume and white matter tracts in women with and without intimate partner violence (IPV). Method: Nineteen women with IPV exposure in the last year, and 21 women without IPV exposure in the last year underwent structural magnetic resonance imaging (MRI) including diffusion tensor imaging (DTI) sequences. Additional data on alcohol use and presence of psychiatric disorder was collected. Differences in fractional anisotropy (FA) between the two groups were examined, using a statistical model that included demographic measures, alcohol use and psychiatric disorder. Results: IPV subjects did not demonstrate significantly different hippocampal volumes compared to subjects without recent IPV. FA was, however, significantly reduced in the body of the corpus callosum of IPV subjects. Adjusting for age, alcohol use, smoking and psychiatric diagnosis did not change the significance of the result. Conclusion: Data on hippocampal volume in IPV are inconsistent, perhaps reflecting the fact that multiple factors influence this measure. Reduced FA in the body of the corpus callosum in IPV suggests altered integrity of this white matter tract; additional work is needed to address the underlying mechanisms and clinical correlates of this finding

    The neural correlates of intimate partner violence in women

    Get PDF
    Objective: To examine hippocampal volume and white matter tracts in women with and without intimate partner violence (IPV). Method: Nineteen women with IPV exposure in the last year, and 21 women without IPV exposure in the last year underwent structural magnetic resonance imaging (MRI) including diffusion tensor imaging (DTI) sequences. Additional data on alcohol use and presence of psychiatric disorder was collected. Differences in fractional anisotropy (FA) between the two groups were examined, using a statistical model that included demographic measures, alcohol use and psychiatric disorder. Results: IPV subjects did not demonstrate significantly different hippocampal volumes compared to subjects without recent IPV. FA was, however, significantly reduced in the body of the corpus callosum of IPV subjects. Adjusting for age, alcohol use, smoking and psychiatric diagnosis did not change the significance of the result. Conclusion: Data on hippocampal volume in IPV are inconsistent, perhaps reflecting the fact that multiple factors influence this measure. Reduced FA in the body of the corpus callosum in IPV suggests altered integrity of this white matter tract; additional work is needed to address the underlying mechanisms and clinical correlates of this finding.Key Words: Corpus callosum; Hippocampal volume; Intimate partner violence; Neuroimagin

    Communication in task-parallel ILU-preconditioned CG solversusing MPI + OmpSs

    Get PDF
    We target the parallel solution of sparse linear systems via iterative Krylov subspace–based methods enhanced with incomplete LU (ILU)-type preconditioners on clusters of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the classical iteration for the CG method, accelerated via ILUPACK and ILU(0) preconditioners, using MPI + OmpSs. In addition, we integrate several communication-avoiding strategies into the codes, including the butterfly communication scheme and Eijkhout's formulation of the CG method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 16 nodes, with 16 cores each

    History of Chemistry and Chemical Engineering 150 Years of the Croatian Academy of Sciences and Arts and 125 Years of the Presence of Chemists in the Academy

    Get PDF
    Opisano je osnivanje Akademije i uloga biskupa Josipa Jurja Strossmayera. Navedeno je šesnaest prvih redovitih članova Akademije. Navedeni su i svi dosadašnji predsjednici Akademije. Do sada je samo jedan kemičar bio predsjednik Akademije - Gustav Janeček. Prikazane su promjene u ustroju i nazivu Akademije od njezina početka do danas. Današnji naziv Hrvatska akademija znanosti i umjetnosti, uveden 1991., već je ranije bio u uporabi, od 1941. do 1945. Navedeni su i svi kemičari redoviti članovi Akademije. Ukratko su prikazani Rad, časopis koji Akademija izdaje od 1867. i četiri članka Dragutina Eeha, prvi članci temeljeni na izvornome istraživanju u kemiji u Hrvatskoj. Navedeni su i svi članci koje je Janeček objavio u Radu. Dan je i popis svih Janečekovih doktorandica i doktoranada te naslovi disertacija i datumi doktoriranja.The foundation of the Yugoslav Academy of Sciences and Arts in 1861 and the role of bishop Josip Juraj Strossmayer is described. The first 16 members of the Academy and all its presidents are listed. Only one chemist so far has been president of the Academy; this was Gustav Janeeek. The changes in the Academy\u27s structure and name since its foundation are discussed. The present name - Croatian Academy of Sciences and Arts - was used from 1941 to 1945 and since 1991. All full member chemists of the Academy are listed. The scientific journal Rad, published by the Academy since 1867, is briefly presented. Four papers by Dragutin Eeh, published in Rad from 1878 to 1882, are discussed since they are the first papers reporting original research in chemistry in Croatia. All papers by Gustav Janeeek published in Rad are also presented. All Janeeek’s doctoral students are mentioned with the titles of their theses and dates of doctoral degree

    FloatX: A C++ Library for Customized Floating-Point Arithmetic

    Full text link
    "© ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, {45, 4, (2019)} https://dl.acm.org/doi/10.1145/3368086"[EN] We present FloatX (Float eXtended), a C++ framework to investigate the effect of leveraging customized floating-point formats in numerical applications. FloatX formats are based on binary IEEE 754 with smaller significand and exponent bit counts specified by the user. Among other properties, FloatX facilitates an incremental transformation of the code, relies on hardware-supported floating-point types as back-end to preserve efficiency, and incurs no storage overhead. The article discusses in detail the design principles, programming interface, and datatype casting rules behind FloatX. Furthermore, it demonstrates FloatX's usage and benefits via several case studies from well-known numerical dense linear algebra libraries, such as BLAS and LAPACK; the Ginkgo library for sparse linear systems; and two neural network applications related with image processing and text recognition.This work was supported by the CICYT projects TIN2014-53495-R and TIN2017-82972-R of the MINECO and FEDER, and the EU H2020 project 732631 "OPRECOMP. Open Transprecision Computing."Flegar, G.; Scheidegger, F.; Novakovic, V.; Mariani, G.; Tomás Domínguez, AE.; Malossi, C.; Quintana-Ortí, ES. (2019). FloatX: A C++ Library for Customized Floating-Point Arithmetic. ACM Transactions on Mathematical Software. 45(4):1-23. https://doi.org/10.1145/3368086S123454Edward Anderson Zhaojun Bai L. Susan Blackford James Demmesl Jack J. Dongarra Jeremy Du Croz Sven Hammarling Anne Greenbaum Alan McKenney and Danny C. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). SIAM. Edward Anderson Zhaojun Bai L. Susan Blackford James Demmesl Jack J. Dongarra Jeremy Du Croz Sven Hammarling Anne Greenbaum Alan McKenney and Danny C. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). SIAM.Bekas, C., Curioni, A., & Fedulova, I. (2011). Low-cost data uncertainty quantification. Concurrency and Computation: Practice and Experience, 24(8), 908-920. doi:10.1002/cpe.1770Boldo, S., & Melquiond, G. (2008). Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. IEEE Transactions on Computers, 57(4), 462-471. doi:10.1109/tc.2007.70819Buttari, A., Dongarra, J., Langou, J., Langou, J., Luszczek, P., & Kurzak, J. (2007). Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems. The International Journal of High Performance Computing Applications, 21(4), 457-466. doi:10.1177/1094342007084026Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 1-17. doi:10.1145/77626.79170Figueroa, S. A. (1995). When is double rounding innocuous? ACM SIGNUM Newsletter, 30(3), 21-26. doi:10.1145/221332.221334Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., & Zimmermann, P. (2007). MPFR. ACM Transactions on Mathematical Software, 33(2), 13. doi:10.1145/1236463.1236468Mark Gates Piotr Luszczek Ahmad Abdelfattah Jakub Kurzak Jack Dongarra Konstantin Arturov Cris Cecka and Chip Freitag. 2017. C++ API for BLAS and LAPACK. Technical Report 2 ICL-UT-17-03. Mark Gates Piotr Luszczek Ahmad Abdelfattah Jakub Kurzak Jack Dongarra Konstantin Arturov Cris Cecka and Chip Freitag. 2017. C++ API for BLAS and LAPACK. Technical Report 2 ICL-UT-17-03.John Hauser. Accessed March 2019. Berkeley SoftFloat project home page. Retrieved from http://www.jhauser.us/arithmetic/SoftFloat.html. John Hauser. Accessed March 2019. Berkeley SoftFloat project home page. Retrieved from http://www.jhauser.us/arithmetic/SoftFloat.html.Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics Philadelphia PA. Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics Philadelphia PA.Parker Hill Babak Zamirai Shengshuo Lu Yu-Wei Chao Michael Laurenzano Mehrzad Samadi Marios Papaefthymiou Scott Mahlke Thomas Wenisch Jia Deng Lingjia Tang and Jason Mars. 2018. Rethinking numerical representations for deep neural networks. arXiv e-prints (Aug 2018). arXiv:1808.02513. Retrieved from https://openreview.net/forum?id&equals;BJ_MGwqlg8noteId&equals;BJ_MGwqlg. Parker Hill Babak Zamirai Shengshuo Lu Yu-Wei Chao Michael Laurenzano Mehrzad Samadi Marios Papaefthymiou Scott Mahlke Thomas Wenisch Jia Deng Lingjia Tang and Jason Mars. 2018. Rethinking numerical representations for deep neural networks. arXiv e-prints (Aug 2018). arXiv:1808.02513. Retrieved from https://openreview.net/forum?id&equals;BJ_MGwqlg8noteId&equals;BJ_MGwqlg.Parker Hill Babak Zamirai Shengshuo Lu Yu-Wei Chao Michael Laurenzano Mehrzad Samadi Marios Papaefthymiou Scott Mahlke Thomas Wenisch Jia Deng etal 2018. Rethinking numerical representations for deep neural networks. 2018. Parker Hill Babak Zamirai Shengshuo Lu Yu-Wei Chao Michael Laurenzano Mehrzad Samadi Marios Papaefthymiou Scott Mahlke Thomas Wenisch Jia Deng et al. 2018. Rethinking numerical representations for deep neural networks. 2018.IBM. 2015. Engineering and Scientific Subroutine Library. Retrieved from http://www-03.ibm.com/systems/power/software/essl/. IBM. 2015. Engineering and Scientific Subroutine Library. Retrieved from http://www-03.ibm.com/systems/power/software/essl/.IEEE. 2008. IEEE Standard for Floating-point Arithmetic. IEEE Std 754-2008 (Aug. 2008) 1--70. DOI:https://doi.org/10.1109/IEEESTD.2008.4610935 IEEE. 2008. IEEE Standard for Floating-point Arithmetic. IEEE Std 754-2008 (Aug. 2008) 1--70. DOI:https://doi.org/10.1109/IEEESTD.2008.4610935Intel. 2015. Math Kernel Library. Retrieved from https://software.intel.com/en-us/intel-mkl. Intel. 2015. Math Kernel Library. Retrieved from https://software.intel.com/en-us/intel-mkl.ISO. 2017. ISO International Standard ISO/IEC 14882:2017(E)—Programming Language C++. Retrieved from https://isocpp.org/std/the-standard. Visited June 2018. ISO. 2017. ISO International Standard ISO/IEC 14882:2017(E)—Programming Language C++. Retrieved from https://isocpp.org/std/the-standard. Visited June 2018.Lefevre, V. (2013). SIPE: Small Integer Plus Exponent. 2013 IEEE 21st Symposium on Computer Arithmetic. doi:10.1109/arith.2013.22Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep Learning Face Attributes in the Wild. 2015 IEEE International Conference on Computer Vision (ICCV). doi:10.1109/iccv.2015.425Érik Martin-Dorel Guillaume Melquiond and Jean-Michel Muller. 2013. Some issues related to double rounding. BIT Num. Math. 53 4 (01 Dec. 2013) 897--924. DOI:https://doi.org/10.1007/s10543-013-0436-2 Érik Martin-Dorel Guillaume Melquiond and Jean-Michel Muller. 2013. Some issues related to double rounding. BIT Num. Math. 53 4 (01 Dec. 2013) 897--924. DOI:https://doi.org/10.1007/s10543-013-0436-2Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Comput. Surv. 48 4 Article 62 (Mar. 2016) 33 pages. DOI:https://doi.org/10.1145/2893356 Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Comput. Surv. 48 4 Article 62 (Mar. 2016) 33 pages. DOI:https://doi.org/10.1145/2893356NVIDIA. 2016. cuBLAS. Retrieved from https://developer.nvidia.com/cublas. NVIDIA. 2016. cuBLAS. Retrieved from https://developer.nvidia.com/cublas.D. O’Leary. 2006. Matrix factorization for information retrieval. Lecture notes for a course on Advanced Numerical Analysis. University of Maryland. Retrieved from https://www.cs.umd.edu/users/oleary/a600/yahoo.pdf. D. O’Leary. 2006. Matrix factorization for information retrieval. Lecture notes for a course on Advanced Numerical Analysis. University of Maryland. Retrieved from https://www.cs.umd.edu/users/oleary/a600/yahoo.pdf.OpenBLAS. 2015. Retrieved from http://www.openblas.net. OpenBLAS. 2015. Retrieved from http://www.openblas.net.Palmer, T. (2015). Modelling: Build imprecise supercomputers. Nature, 526(7571), 32-33. doi:10.1038/526032aAlec Radford Luke Metz and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. Retrieved from Arxiv Preprint Arxiv:1511.06434 (2015). Alec Radford Luke Metz and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. Retrieved from Arxiv Preprint Arxiv:1511.06434 (2015).Rubio-González, C., Nguyen, C., Nguyen, H. D., Demmel, J., Kahan, W., Sen, K., … Hough, D. (2013). Precimonious. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC ’13. doi:10.1145/2503210.2503296Rump, S. M. (2017). IEEE754 Precision- k base-β Arithmetic Inherited by Precision- m Base-β Arithmetic for k < m. ACM Transactions on Mathematical Software, 43(3), 1-15. doi:10.1145/2785965Rybalkin, V., Wehn, N., Yousefi, M. R., & Stricker, D. (2017). Hardware architecture of Bidirectional Long Short-Term Memory Neural Network for Optical Character Recognition. Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017. doi:10.23919/date.2017.7927210Giuseppe Tagliavini Stefan Mach Davide Rossi Andrea Marongiu and Luca Benini. 2017. A transprecision floating-point platform for ultra-low power computing. Retrieved from Arxiv Preprint Arxiv:1711.10374 (2017). Giuseppe Tagliavini Stefan Mach Davide Rossi Andrea Marongiu and Luca Benini. 2017. A transprecision floating-point platform for ultra-low power computing. Retrieved from Arxiv Preprint Arxiv:1711.10374 (2017).Tobias Thornes. (2016). Can reducing precision improve accuracy in weather and climate models? Weather 71 6 (02 June 2016) 147--150. DOI:https://doi.org/10.1002/wea.2732 Tobias Thornes. (2016). Can reducing precision improve accuracy in weather and climate models? Weather 71 6 (02 June 2016) 147--150. DOI:https://doi.org/10.1002/wea.2732Van Zee, F. G., & van de Geijn, R. A. (2015). BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Transactions on Mathematical Software, 41(3), 1-33. doi:10.1145/2764454Todd L. Veldhuizen. 2003. C++ Templates are Turing Complete. Technical Report. Todd L. Veldhuizen. 2003. C++ Templates are Turing Complete. Technical Report.Qiang Xu Todd Mytkowicz and Nam Sung Kim. 2015. Approximate computing: A survey. IEEE Des. Test 33 (01 2015) 8--22. Qiang Xu Todd Mytkowicz and Nam Sung Kim. 2015. Approximate computing: A survey. IEEE Des. Test 33 (01 2015) 8--22.Ziv, A. (1991). Fast evaluation of elementary mathematical functions with correctly rounded last bit. ACM Transactions on Mathematical Software, 17(3), 410-423. doi:10.1145/114697.11681

    Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing

    Get PDF
    © ACM, YYYY. This is the author's version of the work "Anzt, H., Cojean, T., Flegar, G., Göbel, F., Grützmacher, T., Nayak, P., ... & Quintana-Ortí, E. S. (2022). Ginkgo: A modern linear operator algebra framework for high performance computing. ACM Transactions on Mathematical Software (TOMS), 48(1), 1-33". It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, {VOL48, ISS 1, (MAR 2022)} http://doi.acm.org/10.1145/3480935"[EN] In this article, we present GINKGO, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Gnswo's design principle abstracts all functionality as linear operators," motivating the notation of a "linear operator algebra library" GINKGO'S current focus is oriented toward providing sparse linear algebra functionality for high performance graphics processing unit (GPU) architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific backends and provide details on extensibility and sustainability measures. We also demonstrate GINKGO'S usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of GINKGO'S high performance on state-of-the-art GPU architectures.This work was supported by the "Impuls und Vernetzungsfond of the Helmholtz Association" under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 "OPRECOMP". This researchwas also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The experiments on the NVIDIA A100 GPU were performed on the HAICORE@KIT partition, funded by the "Impuls und Vernetzungsfond" of the Helmholtz Association. The experiments on the AMD MI100 GPU were performed on Tulip, an early-access platform hosted by HPE.Anzt, H.; Cojean, T.; Flegar, G.; Göbel, F.; Grützmacher, T.; Nayak, P.; Ribizel, T.... (2022). Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing. ACM Transactions on Mathematical Software. 48(1):1-33. https://doi.org/10.1145/348093513348
    corecore