74 research outputs found

    Blocked algorithms for the reduction to Hessenberg-triangular form revisited

    Get PDF
    We present two variants of Moler and Stewart's algorithm for reducing a matrix pair to Hessenberg-triangular (HT) form with increased data locality in the access to the matrices. In one of these variants, a careful reorganization and accumulation of Givens rotations enables the use of efficient level 3 BLAS. Experimental results on four different architectures, representative of current high performance processors, compare the performances of the new variants with those of the implementation of Moler and Stewart's algorithm in subroutine DGGHRD from LAPACK, Dackland and Kågström's two-stage algorithm for the HT form, and a modified version of the latter which requires considerably less flop

    Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

    Get PDF
    We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they be- come rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular vector matrix. This algorithm is then implemented with optimizations that (1) leverage vector instruction units to increase floating-point throughput, and (2) fuse multiple rotations to decrease the total number of memory operations. We demonstrate the merits of these new QR algorithms for computing the Hermitian eigenvalue decomposition (EVD) and singular value decomposition (SVD) of dense matrices when all eigen- vectors/singular vectors are computed. The approach yields vastly improved performance relative to the traditional QR algorithms for these problems and is competitive with two commonly used alternatives— Cuppen’s Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representations— while inheriting the more modest O(n) workspace requirements of the original QR algorithms. Since the computations performed by the restructured algorithms remain essentially identical to those performed by the original methods, robust numerical properties are preserved

    Algorithm 782

    Full text link

    Programming matrix algorithms-by-blocks for thread-level parallelism

    Get PDF
    With the emergence of thread-level parallelism as the primary means for continued improvement of performance, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of blocks, facilitating algorithms-by-blocks. Transparent to the library implementor, operand descriptions are registered for a particular operation a priori. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievabl

    Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing

    Get PDF
    © ACM, YYYY. This is the author's version of the work "Anzt, H., Cojean, T., Flegar, G., Göbel, F., Grützmacher, T., Nayak, P., ... & Quintana-Ortí, E. S. (2022). Ginkgo: A modern linear operator algebra framework for high performance computing. ACM Transactions on Mathematical Software (TOMS), 48(1), 1-33". It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, {VOL48, ISS 1, (MAR 2022)} http://doi.acm.org/10.1145/3480935"[EN] In this article, we present GINKGO, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Gnswo's design principle abstracts all functionality as linear operators," motivating the notation of a "linear operator algebra library" GINKGO'S current focus is oriented toward providing sparse linear algebra functionality for high performance graphics processing unit (GPU) architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific backends and provide details on extensibility and sustainability measures. We also demonstrate GINKGO'S usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of GINKGO'S high performance on state-of-the-art GPU architectures.This work was supported by the "Impuls und Vernetzungsfond of the Helmholtz Association" under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 "OPRECOMP". This researchwas also supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The experiments on the NVIDIA A100 GPU were performed on the HAICORE@KIT partition, funded by the "Impuls und Vernetzungsfond" of the Helmholtz Association. The experiments on the AMD MI100 GPU were performed on Tulip, an early-access platform hosted by HPE.Anzt, H.; Cojean, T.; Flegar, G.; Göbel, F.; Grützmacher, T.; Nayak, P.; Ribizel, T.... (2022). Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing. ACM Transactions on Mathematical Software. 48(1):1-33. https://doi.org/10.1145/348093513348

    Out-of-core solution of eigenproblems for macromolecular simulations

    Get PDF
    We consider the solution of large-scale eigenvalue problems that appear in the motion simulation of complex macromolecules on desktop platforms. To tackle the dimension of the matrices that are involved in these problems, we formulate out-of-core (OOC) variants of the two selected eigensolvers, that basically decouple the performance of the solver from the storage capacity. Furthermore, we contend with the high computational complexity of the solvers by off-loading the arithmetically-intensive parts of the algorithms to a hardware graphics accelerator

    Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

    Get PDF
    [EN] The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7% on 768 cores.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partially supported by the European Union's Horizon 2020 research, innovation program under the Marie Sklodowska-Curie grant agreement via the Robust project No. 842528 as well as the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the H2020 EC RIA Programme; in particular, the author gratefully acknowledges the support of Vicenc comma Beltran and the computer resources and technical support provided by BSC. The researchers from Universitat Jaume I (UJI) and Universitat Polit ' ecnica de Valencia (UPV) were supported by MINECO project TIN2017-82972-R. Maria Barreda was also supported by the POSDOC-A/2017/11 project from the Universitat Jaume I.Iakymchuk, R.; Barreda Vayá, M.; Graillat, S.; Aliaga, JI.; Quintana Ortí, ES. (2020). Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments. International Journal of High Performance Computing Applications. 34(5):502-518. https://doi.org/10.1177/1094342020932650S502518345Aliaga, J. I., Barreda, M., Flegar, G., Bollhöfer, M., & Quintana-Ortí, E. S. (2017). Communication in task-parallel ILU-preconditioned CG solvers using MPI + OmpSs. Concurrency and Computation: Practice and Experience, 29(21), e4280. doi:10.1002/cpe.4280Bailey, D. H. (2013). High-precision computation: Applications and challenges [Keynote I]. 2013 IEEE 21st Symposium on Computer Arithmetic. doi:10.1109/arith.2013.39Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., … van der Vorst, H. (1994). Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. doi:10.1137/1.9781611971538Burgess, N., Goodyer, C., Hinds, C. N., & Lutz, D. R. (2019). High-Precision Anchored Accumulators for Reproducible Floating-Point Summation. IEEE Transactions on Computers, 68(7), 967-978. doi:10.1109/tc.2018.2855729Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m1140819Collange, S., Defour, D., Graillat, S., & Iakymchuk, R. (2015). Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Computing, 49, 83-97. doi:10.1016/j.parco.2015.09.001Dekker, T. J. (1971). A floating-point technique for extending the available precision. Numerische Mathematik, 18(3), 224-242. doi:10.1007/bf01397083Demmel, J., & Hong Diep Nguyen. (2013). Fast Reproducible Floating-Point Summation. 2013 IEEE 21st Symposium on Computer Arithmetic. doi:10.1109/arith.2013.9Demmel, J., & Nguyen, H. D. (2015). Parallel Reproducible Summation. IEEE Transactions on Computers, 64(7), 2060-2070. doi:10.1109/tc.2014.2345391Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 1-17. doi:10.1145/77626.79170Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., & Zimmermann, P. (2007). MPFR. ACM Transactions on Mathematical Software, 33(2), 13. doi:10.1145/1236463.1236468Hida, Y., Li, X. S., & Bailey, D. H. (s. f.). Algorithms for quad-double precision floating point arithmetic. Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001. doi:10.1109/arith.2001.930115Hunold, S., & Carpen-Amarie, A. (2016). Reproducible MPI Benchmarking is Still Not as Easy as You Think. IEEE Transactions on Parallel and Distributed Systems, 27(12), 3617-3630. doi:10.1109/tpds.2016.2539167IEEE Computer Society (2008) IEEE Standard for Floating-Point Arithmetic. Piscataway: IEEE Standard, pp. 754–2008.Kulisch, U., & Snyder, V. (2010). The exact dot product as basic tool for long interval arithmetic. Computing, 91(3), 307-313. doi:10.1007/s00607-010-0127-7Kulisch, U. (2013). Computer Arithmetic and Validity. doi:10.1515/9783110301793Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3), 308-323. doi:10.1145/355841.355847Lutz, D. R., & Hinds, C. N. (2017). High-Precision Anchored Accumulators for Reproducible Floating-Point Summation. 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH). doi:10.1109/arith.2017.20Mukunoki, D., Ogita, T., & Ozaki, K. (2020). Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures. Lecture Notes in Computer Science, 516-527. doi:10.1007/978-3-030-43229-4_44Nguyen, H. D., & Demmel, J. (2015). Reproducible Tall-Skinny QR. 2015 IEEE 22nd Symposium on Computer Arithmetic. doi:10.1109/arith.2015.28Ogita, T., Rump, S. M., & Oishi, S. (2005). Accurate Sum and Dot Product. SIAM Journal on Scientific Computing, 26(6), 1955-1988. doi:10.1137/030601818Ozaki, K., Ogita, T., Oishi, S., & Rump, S. M. (2011). Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms, 59(1), 95-118. doi:10.1007/s11075-011-9478-1Priest, D. M. (s. f.). Algorithms for arbitrary precision floating point arithmetic. [1991] Proceedings 10th IEEE Symposium on Computer Arithmetic. doi:10.1109/arith.1991.145549Rump, S. M., Ogita, T., & Oishi, S. (2008). Accurate Floating-Point Summation Part I: Faithful Rounding. SIAM Journal on Scientific Computing, 31(1), 189-224. doi:10.1137/050645671Rump, S. M., Ogita, T., & Oishi, S. (2009). Accurate Floating-Point Summation Part II: Sign, K-Fold Faithful and Rounding to Nearest. SIAM Journal on Scientific Computing, 31(2), 1269-1302. doi:10.1137/07068816xRump, S. M., Ogita, T., & Oishi, S. (2010). Fast high precision summation. Nonlinear Theory and Its Applications, IEICE, 1(1), 2-24. doi:10.1587/nolta.1.2Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. doi:10.1137/1.9780898718003Wiesenberger, M., Einkemmer, L., Held, M., Gutierrez-Milla, A., Sáez, X., & Iakymchuk, R. (2019). Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures. Computer Physics Communications, 238, 145-156. doi:10.1016/j.cpc.2018.12.00

    A pipeline structure for the block QR update in digital signal processing

    Get PDF
    [EN] There exist problems in the field of digital signal processing, such as filtering of acoustic signals that require processing a large amount of data in real time. The beamforming algorithm, for instance, is a process that can be modeled by a rectangular matrix built on the input signals of an acoustic system and, thus, changes in real time. To obtain the output signals, it is required to compute its QR factorization. In this paper, we propose to organize the concurrent computational resources of a given multicore computer in a pipeline structure to perform this factorization as fast as possible. The pipeline has been implemented using both the application programming interface OpenMP and GrPPI, a library interface to design parallel applications based on parallel patterns. We tackle not only the performance challenge but also the programmability of our idea using parallel programming frameworks.This work was supported by the Spanish Ministry of Economy and Competitiveness under MINECO and FEDER projects TIN2014-53495-R and TEC2015-67387-C4-1-R.Dolz, MF.; Alventosa, FJ.; Alonso-Jordá, P.; Vidal Maciá, AM. (2019). A pipeline structure for the block QR update in digital signal processing. The Journal of Supercomputing. 75(3):1470-1482. https://doi.org/10.1007/s11227-018-2666-1S14701482753Huang Y, Benesty J, Chen J (2006) Acoustic MIMO signal processing (signals and communication technology). Springer, BerlinRamiro C, Vidal AM, González A (2015) MIMOPack: a high performance computing library for MIMO communication systems. J Supercomput 71:751–760Alventosa FJ, Alonso P, Piñero G, Vidal AM (2016) Implementation of the Beamformer algorithm for the NVIDIA Jetson. In: Actas de la Conferencia, Granada, Spain, pp 201–211. ISBN 978-3-319-49955-0Alventosa FJ, Alonso P, Vidal AM, Piñero G, Quintana-Ortí ES (2018) Fast block QR update in digital signal processing. J Supercomput. https://doi.org/10.1007/s11227-018-2298-5del Rio D, Dolz MF, Fernández J, García JD (2017) A generic parallel pattern interface for stream and data processing. Concurr Comput Pract Exp 29(24):e4175Benesty J, Chen J, Huang Y, Dmochowski J (2007) On microphone-array Beamforming from a MIMO acoustic signal processing perspective. IEEE Trans Audio Speech Lang Process 15(3):1053–1065Lorente J, Piñero G, Vidal AM, Belloch JA, González A (2011) Parallel implementations of Beamforming design and filtering for microphone array applications. In: 19th European Signal Processing Conference (EUSIPCO), Barcelona, Spain, pp 501–505Belloch JA, Ferrer M, González A, Martínez-Zaldívar FJ, Vidal AM (2013) Headphone-based virtual spatialization of sound with a GPU accelerator. J Audio Eng Soc 61:546–561Belloch JA, González A, Martínez-Zaldívar FJ, Vidal AM (2011) Real-time massive convolution for audio applications on GPU. J Supercomput 58(3):449–457Golub GH, Van Loan CF (2013) Matrix computations. Johns Hopkins studies in the mathematical sciences. Johns Hopkins University Press, BaltimoreGunter BC, van de Geijn RA (2005) Parallel out-of-core computation and updating the QR factorization. ACM Trans Math Softw 31(1):60–78Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35(1):38–53Dolz MF, Alventosa FJ, Alonso-Jordá P, Vidal AM (2018) A pipeline for the QR update in digital signal processing. In: Proceedings of the 18th International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE 2018), Rota, Cádiz, Spain, pp 1–5Quintana-Ortí G, Quintana-Ortí ES, Van De Geijn RA, Van Zee FG, Chan E (2009) Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans Math Softw 36(3):14:1–14:2

    Singular Perturbation Approximation of Large, Dense Linear Systems

    No full text

    Computing Passive Reduced-Order Models for Circuit Simulation

    No full text
    • …
    corecore