14 research outputs found
Reproducible SUmmation under HUB Format
Version diferente del paper presentado en el congresoFloating point reproducibility is a property
claimed by programmers and end users. Half-Unit-Biased
(HUB) is a new representation format in which the round
to nearest is carried out by truncation, preventing any carry
propagation and saving time and area. In this paper we study
the reproducible summation of HUB numbers by using a errorfree
vector transformation technique, providing both a specific
architecture and the usage of combined HUB/Standard floating
point adders to achieve a reproducible resultUniversidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
Parallel Algorithms for Summing Floating-Point Numbers
The problem of exactly summing n floating-point numbers is a fundamental
problem that has many applications in large-scale simulations and computational
geometry. Unfortunately, due to the round-off error in standard floating-point
operations, this problem becomes very challenging. Moreover, all existing
solutions rely on sequential algorithms which cannot scale to the huge datasets
that need to be processed.
In this paper, we provide several efficient parallel algorithms for summing n
floating point numbers, so as to produce a faithfully rounded floating-point
representation of the sum. We present algorithms in PRAM, external-memory, and
MapReduce models, and we also provide an experimental analysis of our MapReduce
algorithms, due to their simplicity and practical efficiency.Comment: Conference version appears in SPAA 201
General Framework for Deriving Reproducible Krylov Subspace Algorithms: BiCGStab Case
Parallel implementations of Krylov subspace algorithms often help to accelerate the procedure to find the solution of a linear system. However, from the other side, such parallelization coupled with
asynchronous and out-of-order execution often enlarge the non-associativity
of floating-point operations. This results in non-reproducibility on the
same or different settings. This paper proposes a general framework for
deriving reproducible and accurate variants of a Krylov subspace algorithm. The proposed algorithmic strategies are reinforced by programmability suggestions to assure deterministic and accurate executions. The
framework is illustrated on the preconditioned BiCGStab method for the
solution of non-symmetric linear systems with message-passing. Finally,
we verify the two reproducible variants of PBiCGStab on a set matrices
from the SuiteSparse Matrix Collection and a 3D Poisson’s equation
Accurate and Efficiently Vectorized Sums and Dot Products in Julia
Version submitted to the Correctness2019 workshopThis paper presents an efficient, vectorized implementation of various summation and dot product algorithms in the Julia programming language. These implementations are available under an open source license in the AccurateArithmetic.jl Julia package.Besides naive algorithms, compensated algorithms are implemented: the Kahan-Babuška-Neumaier summation algorithm, and the Ogita-Rump-Oishi simply compensated summation and dot product algorithms. These algorithms effectively double the working precision, producing much more accurate results while incurring little to no overhead, especially for large input vectors.This paper also tries and builds upon this example to make a case for a more widespread use of Julia in the HPC community. Although the vectorization of compensated algorithms is no particularly simple task, Julia makes it relatively easy and straightforward. It also allows structuring the code in small, composable building blocks, closely matching textbook algorithms yet efficiently compiled
Reproducibility strategies for parallel preconditioned Conjugate Gradient
[EN] The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performance-oriented variant (through expanding the intermediate precision). These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these strategies on modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of only 29% (ExBLAS) and 4% (lightweight) on 768 processes.To begin with, we would like to thank the reviewers for their thorough reading of the article as well as their valuable comments and suggestions. This research was partially supported by the European Union's Horizon 2020 research, innovation programme under the Marie Sklodowska-Curie grant agreement via the Robust project No. 842528 as well as the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the H2020 EC RIA Programme; in particular, the author gratefully acknowledges the support of Vicenc Beltran and the computer resources and technical support provided by BSC. The researchers from Universitat Jaume I (UJI) and Universidad Politecnica de Valencia (UPV) were supported by MINECO, Spain project TIN2017-82972-R. Maria Barreda was also supported by the POSDOC-A/2017/11 project from the Universitat Jaume I, Spain.Iakymchuk, R.; Barreda, M.; Wiesenberger, M.; Aliaga, JI.; Quintana Ortí, ES. (2020). Reproducibility strategies for parallel preconditioned Conjugate Gradient. Journal of Computational and Applied Mathematics. 371:1-13. https://doi.org/10.1016/j.cam.2019.112697S113371Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3), 308-323. doi:10.1145/355841.355847Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 1-17. doi:10.1145/77626.79170Demmel, J., & Nguyen, H. D. (2015). Parallel Reproducible Summation. IEEE Transactions on Computers, 64(7), 2060-2070. doi:10.1109/tc.2014.2345391Iakymchuk, R., Graillat, S., Defour, D., & Quintana-Ortí, E. S. (2019). Hierarchical approach for deriving a reproducible unblocked LU factorization. The International Journal of High Performance Computing Applications, 33(5), 791-803. doi:10.1177/1094342019832968Iakymchuk, R., Defour, D., Collange, S., & Graillat, S. (2016). Reproducible and Accurate Matrix Multiplication. Lecture Notes in Computer Science, 126-137. doi:10.1007/978-3-319-31769-4_11Rump, S. M., Ogita, T., & Oishi, S. (2009). Accurate Floating-Point Summation Part II: Sign, K-Fold Faithful and Rounding to Nearest. SIAM Journal on Scientific Computing, 31(2), 1269-1302. doi:10.1137/07068816xBurgess, N., Goodyer, C., Hinds, C. N., & Lutz, D. R. (2019). High-Precision Anchored Accumulators for Reproducible Floating-Point Summation. IEEE Transactions on Computers, 68(7), 967-978. doi:10.1109/tc.2018.2855729D. Mukunoki, T. Ogita, K. Ozaki, Accurate and reproducible BLAS routines with Ozaki scheme for many-core architectures, in: Proc. International Conference on Parallel Processing and Applied Mathematics, PPAM2019, 2019, accepted.Ogita, T., Rump, S. M., & Oishi, S. (2005). Accurate Sum and Dot Product. SIAM Journal on Scientific Computing, 26(6), 1955-1988. doi:10.1137/030601818Kulisch, U., & Snyder, V. (2010). The exact dot product as basic tool for long interval arithmetic. Computing, 91(3), 307-313. doi:10.1007/s00607-010-0127-7Boldo, S., & Melquiond, G. (2008). Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. IEEE Transactions on Computers, 57(4), 462-471. doi:10.1109/tc.2007.70819Wiesenberger, M., Einkemmer, L., Held, M., Gutierrez-Milla, A., Sáez, X., & Iakymchuk, R. (2019). Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures. Computer Physics Communications, 238, 145-156. doi:10.1016/j.cpc.2018.12.006Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., & Zimmermann, P. (2007). MPFR. ACM Transactions on Mathematical Software, 33(2), 13. doi:10.1145/1236463.1236468J. Demmel, H.D. Nguyen, Fast reproducible floating-point summation, in: Proceedings of ARITH-21, 2013, pp. 163–172.Ozaki, K., Ogita, T., Oishi, S., & Rump, S. M. (2011). Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms, 59(1), 95-118. doi:10.1007/s11075-011-9478-1Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m114081
Reproducibility Strategies for Parallel Preconditioned Conjugate Gradient
The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performance-oriented variant (through expanding the intermediate precision). These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these strategies on modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of only 29 % (ExBLAS) and 4 % (lightweight) on 768 processes
Reproducible Floating-Point Aggregation in RDBMSs
Industry-grade database systems are expected to produce the same result if
the same query is repeatedly run on the same input. However, the numerous
sources of non-determinism in modern systems make reproducible results
difficult to achieve. This is particularly true if floating-point numbers are
involved, where the order of the operations affects the final result.
As part of a larger effort to extend database engines with data
representations more suitable for machine learning and scientific applications,
in this paper we explore the problem of making relational GroupBy over
floating-point formats bit-reproducible, i.e., ensuring any execution of the
operator produces the same result up to every single bit. To that aim, we first
propose a numeric data type that can be used as drop-in replacement for other
number formats and is---unlike standard floating-point formats---associative.
We use this data type to make state-of-the-art GroupBy operators reproducible,
but this approach incurs a slowdown between 4x and 12x compared to the same
operator using conventional database number formats. We thus explore how to
modify existing GroupBy algorithms to make them bit-reproducible and efficient.
By using vectorized summation on batches and carefully balancing batch size,
cache footprint, and preprocessing costs, we are able to reduce the slowdown
due to reproducibility to a factor between 1.9x and 2.4x of aggregation in
isolation and to a mere 2.7% of end-to-end query performance even on
aggregation-intensive queries in MonetDB. We thereby provide a solid basis for
supporting more reproducible operations directly in relational engines.
This document is an extended version of an article currently in print for the
proceedings of ICDE'18 with the same title and by the same authors. The main
additions are more implementation details and experiments.Comment: This document is the extended version of an article in the
Proceedings of the 34th IEEE International Conference on Data Engineering
(ICDE) 201
Recommended from our members
Multi-Physics Bi-directional Evolutionary Topology Optimization on GPU-architecture
Topology optimization has proven to be viable for use in the preliminary phases of real world design problems. Ultimately, the restricting factor is the computational expense since a multitude of designs need to be considered. This is especially imperative in such fields as aerospace, automotive and biomedical, where the problems involve multiple physical models, typically fluids and structures, requiring excessive computational calculations. One possible solution to this is to implement codes on massively parallel computer architectures, such as graphics processing units (GPUs). The present work investigates the feasibility of a GPU-implemented lattice Boltzmann method for multi-physics topology optimization for the first time. Noticeable differences between the GPU implementation and a central processing unit (CPU) version of the code are observed and the challenges associated with finding feasible solutions in a computational efficient manner are discussed and solved here, for the first time on a multi-physics topology optimization problem. The main goal of this paper is to speed up the topology optimization process for multi-physics problems without restricting the design domain, or sacrificing considerable performance in the objectives. Examples are compared with both standard CPU and various levels of numerical precision GPU codes to better illustrate the advantages and disadvantages of this implementation. A structural and fluid objective topology optimization problem is solved to vary the dependence of the algorithm on the GPU, extending on the previous literature that has only considered structural objectives of non-design dependent load problems. The results of this work indicate some discrepancies between GPU and CPU implementations that have not been seen before in the literature and are imperative to the speed-up of multi-physics topology optimization algorithms using GPUs.D. J. Munk thanks the Australian government for their financial support through the Endeavour Fellowship scheme. The authors would like to acknowledge the UK Consortium on Mesoscale Engineering
Sciences (UKCOMES) EPSRC grant No EP/L00030X/1 for providing the HPC capabilities used in this article