50 research outputs found

    Reproducibility strategies for parallel preconditioned Conjugate Gradient

    Get PDF
    [EN] The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performance-oriented variant (through expanding the intermediate precision). These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these strategies on modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of only 29% (ExBLAS) and 4% (lightweight) on 768 processes.To begin with, we would like to thank the reviewers for their thorough reading of the article as well as their valuable comments and suggestions. This research was partially supported by the European Union's Horizon 2020 research, innovation programme under the Marie Sklodowska-Curie grant agreement via the Robust project No. 842528 as well as the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the H2020 EC RIA Programme; in particular, the author gratefully acknowledges the support of Vicenc Beltran and the computer resources and technical support provided by BSC. The researchers from Universitat Jaume I (UJI) and Universidad Politecnica de Valencia (UPV) were supported by MINECO, Spain project TIN2017-82972-R. Maria Barreda was also supported by the POSDOC-A/2017/11 project from the Universitat Jaume I, Spain.Iakymchuk, R.; Barreda, M.; Wiesenberger, M.; Aliaga, JI.; Quintana Ortí, ES. (2020). Reproducibility strategies for parallel preconditioned Conjugate Gradient. Journal of Computational and Applied Mathematics. 371:1-13. https://doi.org/10.1016/j.cam.2019.112697S113371Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3), 308-323. doi:10.1145/355841.355847Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 1-17. doi:10.1145/77626.79170Demmel, J., & Nguyen, H. D. (2015). Parallel Reproducible Summation. IEEE Transactions on Computers, 64(7), 2060-2070. doi:10.1109/tc.2014.2345391Iakymchuk, R., Graillat, S., Defour, D., & Quintana-Ortí, E. S. (2019). Hierarchical approach for deriving a reproducible unblocked LU factorization. The International Journal of High Performance Computing Applications, 33(5), 791-803. doi:10.1177/1094342019832968Iakymchuk, R., Defour, D., Collange, S., & Graillat, S. (2016). Reproducible and Accurate Matrix Multiplication. Lecture Notes in Computer Science, 126-137. doi:10.1007/978-3-319-31769-4_11Rump, S. M., Ogita, T., & Oishi, S. (2009). Accurate Floating-Point Summation Part II: Sign, K-Fold Faithful and Rounding to Nearest. SIAM Journal on Scientific Computing, 31(2), 1269-1302. doi:10.1137/07068816xBurgess, N., Goodyer, C., Hinds, C. N., & Lutz, D. R. (2019). High-Precision Anchored Accumulators for Reproducible Floating-Point Summation. IEEE Transactions on Computers, 68(7), 967-978. doi:10.1109/tc.2018.2855729D. Mukunoki, T. Ogita, K. Ozaki, Accurate and reproducible BLAS routines with Ozaki scheme for many-core architectures, in: Proc. International Conference on Parallel Processing and Applied Mathematics, PPAM2019, 2019, accepted.Ogita, T., Rump, S. M., & Oishi, S. (2005). Accurate Sum and Dot Product. SIAM Journal on Scientific Computing, 26(6), 1955-1988. doi:10.1137/030601818Kulisch, U., & Snyder, V. (2010). The exact dot product as basic tool for long interval arithmetic. Computing, 91(3), 307-313. doi:10.1007/s00607-010-0127-7Boldo, S., & Melquiond, G. (2008). Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. IEEE Transactions on Computers, 57(4), 462-471. doi:10.1109/tc.2007.70819Wiesenberger, M., Einkemmer, L., Held, M., Gutierrez-Milla, A., Sáez, X., & Iakymchuk, R. (2019). Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures. Computer Physics Communications, 238, 145-156. doi:10.1016/j.cpc.2018.12.006Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., & Zimmermann, P. (2007). MPFR. ACM Transactions on Mathematical Software, 33(2), 13. doi:10.1145/1236463.1236468J. Demmel, H.D. Nguyen, Fast reproducible floating-point summation, in: Proceedings of ARITH-21, 2013, pp. 163–172.Ozaki, K., Ogita, T., Oishi, S., & Rump, S. M. (2011). Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms, 59(1), 95-118. doi:10.1007/s11075-011-9478-1Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m114081

    Reproducibility Strategies for Parallel Preconditioned Conjugate Gradient

    Get PDF
    The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performance-oriented variant (through expanding the intermediate precision). These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these strategies on modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of only 29 % (ExBLAS) and 4 % (lightweight) on 768 processes

    Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

    Get PDF
    [EN] The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7% on 768 cores.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partially supported by the European Union's Horizon 2020 research, innovation program under the Marie Sklodowska-Curie grant agreement via the Robust project No. 842528 as well as the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the H2020 EC RIA Programme; in particular, the author gratefully acknowledges the support of Vicenc comma Beltran and the computer resources and technical support provided by BSC. The researchers from Universitat Jaume I (UJI) and Universitat Polit ' ecnica de Valencia (UPV) were supported by MINECO project TIN2017-82972-R. Maria Barreda was also supported by the POSDOC-A/2017/11 project from the Universitat Jaume I.Iakymchuk, R.; Barreda Vayá, M.; Graillat, S.; Aliaga, JI.; Quintana Ortí, ES. (2020). Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments. International Journal of High Performance Computing Applications. 34(5):502-518. https://doi.org/10.1177/1094342020932650S502518345Aliaga, J. I., Barreda, M., Flegar, G., Bollhöfer, M., & Quintana-Ortí, E. S. (2017). Communication in task-parallel ILU-preconditioned CG solvers using MPI + OmpSs. Concurrency and Computation: Practice and Experience, 29(21), e4280. doi:10.1002/cpe.4280Bailey, D. H. (2013). High-precision computation: Applications and challenges [Keynote I]. 2013 IEEE 21st Symposium on Computer Arithmetic. doi:10.1109/arith.2013.39Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., … van der Vorst, H. (1994). Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. doi:10.1137/1.9781611971538Burgess, N., Goodyer, C., Hinds, C. N., & Lutz, D. R. (2019). High-Precision Anchored Accumulators for Reproducible Floating-Point Summation. IEEE Transactions on Computers, 68(7), 967-978. doi:10.1109/tc.2018.2855729Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m1140819Collange, S., Defour, D., Graillat, S., & Iakymchuk, R. (2015). Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Computing, 49, 83-97. doi:10.1016/j.parco.2015.09.001Dekker, T. J. (1971). A floating-point technique for extending the available precision. Numerische Mathematik, 18(3), 224-242. doi:10.1007/bf01397083Demmel, J., & Hong Diep Nguyen. (2013). Fast Reproducible Floating-Point Summation. 2013 IEEE 21st Symposium on Computer Arithmetic. doi:10.1109/arith.2013.9Demmel, J., & Nguyen, H. D. (2015). Parallel Reproducible Summation. IEEE Transactions on Computers, 64(7), 2060-2070. doi:10.1109/tc.2014.2345391Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 1-17. doi:10.1145/77626.79170Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., & Zimmermann, P. (2007). MPFR. ACM Transactions on Mathematical Software, 33(2), 13. doi:10.1145/1236463.1236468Hida, Y., Li, X. S., & Bailey, D. H. (s. f.). Algorithms for quad-double precision floating point arithmetic. Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001. doi:10.1109/arith.2001.930115Hunold, S., & Carpen-Amarie, A. (2016). Reproducible MPI Benchmarking is Still Not as Easy as You Think. IEEE Transactions on Parallel and Distributed Systems, 27(12), 3617-3630. doi:10.1109/tpds.2016.2539167IEEE Computer Society (2008) IEEE Standard for Floating-Point Arithmetic. Piscataway: IEEE Standard, pp. 754–2008.Kulisch, U., & Snyder, V. (2010). The exact dot product as basic tool for long interval arithmetic. Computing, 91(3), 307-313. doi:10.1007/s00607-010-0127-7Kulisch, U. (2013). Computer Arithmetic and Validity. doi:10.1515/9783110301793Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3), 308-323. doi:10.1145/355841.355847Lutz, D. R., & Hinds, C. N. (2017). High-Precision Anchored Accumulators for Reproducible Floating-Point Summation. 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH). doi:10.1109/arith.2017.20Mukunoki, D., Ogita, T., & Ozaki, K. (2020). Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures. Lecture Notes in Computer Science, 516-527. doi:10.1007/978-3-030-43229-4_44Nguyen, H. D., & Demmel, J. (2015). Reproducible Tall-Skinny QR. 2015 IEEE 22nd Symposium on Computer Arithmetic. doi:10.1109/arith.2015.28Ogita, T., Rump, S. M., & Oishi, S. (2005). Accurate Sum and Dot Product. SIAM Journal on Scientific Computing, 26(6), 1955-1988. doi:10.1137/030601818Ozaki, K., Ogita, T., Oishi, S., & Rump, S. M. (2011). Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms, 59(1), 95-118. doi:10.1007/s11075-011-9478-1Priest, D. M. (s. f.). Algorithms for arbitrary precision floating point arithmetic. [1991] Proceedings 10th IEEE Symposium on Computer Arithmetic. doi:10.1109/arith.1991.145549Rump, S. M., Ogita, T., & Oishi, S. (2008). Accurate Floating-Point Summation Part I: Faithful Rounding. SIAM Journal on Scientific Computing, 31(1), 189-224. doi:10.1137/050645671Rump, S. M., Ogita, T., & Oishi, S. (2009). Accurate Floating-Point Summation Part II: Sign, K-Fold Faithful and Rounding to Nearest. SIAM Journal on Scientific Computing, 31(2), 1269-1302. doi:10.1137/07068816xRump, S. M., Ogita, T., & Oishi, S. (2010). Fast high precision summation. Nonlinear Theory and Its Applications, IEICE, 1(1), 2-24. doi:10.1587/nolta.1.2Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. doi:10.1137/1.9780898718003Wiesenberger, M., Einkemmer, L., Held, M., Gutierrez-Milla, A., Sáez, X., & Iakymchuk, R. (2019). Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures. Computer Physics Communications, 238, 145-156. doi:10.1016/j.cpc.2018.12.00

    Optimisations arithmétiques et synthèse de haut niveau

    Get PDF
    High-level synthesis (HLS) tools offer increased productivity regarding FPGA programming.However, due to their relatively young nature, they still lack many arithmetic optimizations.This thesis proposes safe arithmetic optimizations that should always be applied.These optimizations are simple operator specializations, following the C semantic.Other require to a lift the semantic embedded in high-level input program languages, which are inherited from software programming, for an improved accuracy/cost/performance ratio.To demonstrate this claim, the sum-of-product of floating-point numbers is used as a case study. The sum is performed on a fixed-point format, which is tailored to the application, according to the context in which the operator is instantiated.In some cases, there is not enough information about the input data to tailor the fixed-point accumulator.The fall-back strategy used in this thesis is to generate an accumulator covering the entire floating-point range.This thesis explores different strategies for implementing such a large accumulator, including new ones.The use of a 2's complement representation instead of a sign+magnitude is demonstrated to save resources and to reduce the accumulation loop delay.Based on a tapered precision scheme and an exact accumulator, the posit number systems claims to be a candidate to replace the IEEE floating-point format.A throughout analysis of posit operators is performed, using the same level of hardware optimization as state-of-the-art floating-point operators.Their cost remains much higher that their floating-point counterparts in terms of resource usage and performance. Finally, this thesis presents a compatibility layer for HLS tools that allows one code to be deployed on multiple tools.This library implements a strongly typed custom size integer type along side a set of optimized custom operators.À cause de la nature relativement jeune des outils de synthèse de haut-niveau (HLS), de nombreuses optimisations arithmétiques n'y sont pas encore implémentées. Cette thèse propose des optimisations arithmétiques se servant du contexte spécifique dans lequel les opérateurs sont instanciés.Certaines optimisations sont de simples spécialisations d'opérateurs, respectant la sémantique du C.D'autres nécéssitent de s'éloigner de cette sémantique pour améliorer le compromis précision/coût/performance.Cette proposition est démontré sur des sommes de produits de nombres flottants.La somme est réalisée dans un format en virgule-fixe défini par son contexte.Quand trop peu d’informations sont disponibles pour définir ce format en virgule-fixe, une stratégie est de générer un accumulateur couvrant l'intégralité du format flottant.Cette thèse explore plusieurs implémentations d'un tel accumulateur.L'utilisation d'une représentation en complément à deux permet de réduire le chemin critique de la boucle d'accumulation, ainsi que la quantité de ressources utilisées. Un format alternatif aux nombres flottants, appelé posit, propose d'utiliser un encodage à précision variable.De plus, ce format est augmenté par un accumulateur exact.Pour évaluer précisément le coût matériel de ce format, cette thèse présente des architectures d'opérateurs posits, implémentés avec le même degré d'optimisation que celui de l'état de l'art des opérateurs flottants.Une analyse détaillée montre que le coût des opérateurs posits est malgré tout bien plus élevé que celui de leurs équivalents flottants.Enfin, cette thèse présente une couche de compatibilité entre outils de HLS, permettant de viser plusieurs outils avec un seul code. Cette bibliothèque implémente un type d'entiers de taille variable, avec de plus une sémantique strictement typée, ainsi qu'un ensemble d'opérateurs ad-hoc optimisés

    Energy. A continuing bibliography with indexes, issue 36, January 1983

    Get PDF
    This bibliography lists 1297 reports, articles, and other documents introduced into the NASA scientific and technical information system from October 1, 1982 through December 31, 1982

    Cumulative index to NASA Tech Briefs, 1970-1975

    Get PDF
    Tech briefs of technology derived from the research and development activities of the National Aeronautics and Space Administration are presented. Abstracts and indexes of subject, personal author, originating center, and tech brief number for the 1970-1975 tech briefs are presented

    The 11th Space Simulation Conference

    Get PDF
    Subject areas range from specialized issues dealing with the space and entry environments to the environmental testing of systems and complete spacecraft of present-day vintage. Various papers consider: the test and development of several key systems of the orbiter vehicle; integrated tests of complete satellites; new and unique test facilities developed to meet the demanding requirements of high fidelity simulation of test environments; and contamination species, including the instrumentation for detection and measurement of such. Special topics include improved thermal protection methodologies and approaches, sophisticated sensor developments, and other related testing and development areas
    corecore