2,286 research outputs found
Parallel Algorithms for Summing Floating-Point Numbers
The problem of exactly summing n floating-point numbers is a fundamental
problem that has many applications in large-scale simulations and computational
geometry. Unfortunately, due to the round-off error in standard floating-point
operations, this problem becomes very challenging. Moreover, all existing
solutions rely on sequential algorithms which cannot scale to the huge datasets
that need to be processed.
In this paper, we provide several efficient parallel algorithms for summing n
floating point numbers, so as to produce a faithfully rounded floating-point
representation of the sum. We present algorithms in PRAM, external-memory, and
MapReduce models, and we also provide an experimental analysis of our MapReduce
algorithms, due to their simplicity and practical efficiency.Comment: Conference version appears in SPAA 201
Toward accurate polynomial evaluation in rounded arithmetic
Given a multivariate real (or complex) polynomial and a domain ,
we would like to decide whether an algorithm exists to evaluate
accurately for all using rounded real (or complex) arithmetic.
Here ``accurately'' means with relative error less than 1, i.e., with some
correct leading digits. The answer depends on the model of rounded arithmetic:
We assume that for any arithmetic operator , for example or , its computed value is , where is bounded by some constant where , but
is otherwise arbitrary. This model is the traditional one used to
analyze the accuracy of floating point algorithms.Our ultimate goal is to
establish a decision procedure that, for any and , either exhibits
an accurate algorithm or proves that none exists. In contrast to the case where
numbers are stored and manipulated as finite bit strings (e.g., as floating
point numbers or rational numbers) we show that some polynomials are
impossible to evaluate accurately. The existence of an accurate algorithm will
depend not just on and , but on which arithmetic operators and
which constants are are available and whether branching is permitted. Toward
this goal, we present necessary conditions on for it to be accurately
evaluable on open real or complex domains . We also give sufficient
conditions, and describe progress toward a complete decision procedure. We do
present a complete decision procedure for homogeneous polynomials with
integer coefficients, {\cal D} = \C^n, and using only the arithmetic
operations , and .Comment: 54 pages, 6 figures; refereed version; to appear in Foundations of
Computational Mathematics: Santander 2005, Cambridge University Press, March
200
Computing hypergeometric functions rigorously
We present an efficient implementation of hypergeometric functions in
arbitrary-precision interval arithmetic. The functions , ,
and (or the Kummer -function) are supported for
unrestricted complex parameters and argument, and by extension, we cover
exponential and trigonometric integrals, error functions, Fresnel integrals,
incomplete gamma and beta functions, Bessel functions, Airy functions, Legendre
functions, Jacobi polynomials, complete elliptic integrals, and other special
functions. The output can be used directly for interval computations or to
generate provably correct floating-point approximations in any format.
Performance is competitive with earlier arbitrary-precision software, and
sometimes orders of magnitude faster. We also partially cover the generalized
hypergeometric function and computation of high-order parameter
derivatives.Comment: v2: corrected example in section 3.1; corrected timing data for case
E-G in section 8.5 (table 6, figure 2); adjusted paper siz
When FPGAs are better at floating-point than microprocessors
It has been shown that FPGAs could outperform high-end microprocessors on floating-point computations thanks to massive parallelism. However, most previous studies re-implement in the FPGA the operators present in a processor. This is a safe and relatively straightforward approach, but it doesn't exploit the greater flexibility of the FPGA. This article is a survey of the many ways in which the FPGA implementation of a given floating-point computation can be not only faster, but also more accurate than its microprocessor counterpart. Techniques studied here include custom precision, specific accumulator design, dedicated architectures for coarser operators which have to be implemented in software in processors, and others. A real-world biomedical application illustrates these claims. This study also points to how current FPGA fabrics could be enhanced for better floating-point support
A Study of High Performance Multiple Precision Arithmetic on Graphics Processing Units
Multiple precision (MP) arithmetic is a core building block of a wide variety of algorithms in computational mathematics and computer science. In mathematics MP is used in computational number theory, geometric computation, experimental mathematics, and in some random matrix problems. In computer science, MP arithmetic is primarily used in cryptographic algorithms: securing communications, digital signatures, and code breaking. In most of these application areas, the factor that limits performance is the MP arithmetic. The focus of our research is to build and analyze highly optimized libraries that allow the MP operations to be offloaded from the CPU to the GPU. Our goal is to achieve an order of magnitude improvement over the CPU in three key metrics: operations per second per socket, operations per watt, and operation per second per dollar. What we find is that the SIMD design and balance of compute, cache, and bandwidth resources on the GPU is quite different from the CPU, so libraries such as GMP cannot simply be ported to the GPU. New approaches and algorithms are required to achieve high performance and high utilization of GPU resources. Further, we find that low-level ISA differences between GPU generations means that an approach that works well on one generation might not run well on the next.
Here we report on our progress towards MP arithmetic libraries on the GPU in four areas: (1) large integer addition, subtraction, and multiplication; (2) high performance modular multiplication and modular exponentiation (the key operations for cryptographic algorithms) across generations of GPUs; (3) high precision floating point addition, subtraction, multiplication, division, and square root; (4) parallel short division, which we prove is asymptotically optimal on EREW and CREW PRAMs
An efficient algorithm for accelerating the convergence of oscillatory series, useful for computing the polylogarithm and Hurwitz zeta functions
This paper sketches a technique for improving the rate of convergence of a
general oscillatory sequence, and then applies this series acceleration
algorithm to the polylogarithm and the Hurwitz zeta function. As such, it may
be taken as an extension of the techniques given by Borwein's "An efficient
algorithm for computing the Riemann zeta function", to more general series. The
algorithm provides a rapid means of evaluating Li_s(z) for general values of
complex s and the region of complex z values given by |z^2/(z-1)|<4.
Alternatively, the Hurwitz zeta can be very rapidly evaluated by means of an
Euler-Maclaurin series. The polylogarithm and the Hurwitz zeta are related, in
that two evaluations of the one can be used to obtain a value of the other;
thus, either algorithm can be used to evaluate either function. The
Euler-Maclaurin series is a clear performance winner for the Hurwitz zeta,
while the Borwein algorithm is superior for evaluating the polylogarithm in the
kidney-shaped region. Both algorithms are superior to the simple Taylor's
series or direct summation.
The primary, concrete result of this paper is an algorithm allows the
exploration of the Hurwitz zeta in the critical strip, where fast algorithms
are otherwise unavailable. A discussion of the monodromy group of the
polylogarithm is included.Comment: 37 pages, 6 graphs, 14 full-color phase plots. v3: Added discussion
of a fast Hurwitz algorithm; expanded development of the monodromy
v4:Correction and clarifiction of monodrom
Reproducibility strategies for parallel preconditioned Conjugate Gradient
[EN] The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performance-oriented variant (through expanding the intermediate precision). These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these strategies on modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of only 29% (ExBLAS) and 4% (lightweight) on 768 processes.To begin with, we would like to thank the reviewers for their thorough reading of the article as well as their valuable comments and suggestions. This research was partially supported by the European Union's Horizon 2020 research, innovation programme under the Marie Sklodowska-Curie grant agreement via the Robust project No. 842528 as well as the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the H2020 EC RIA Programme; in particular, the author gratefully acknowledges the support of Vicenc Beltran and the computer resources and technical support provided by BSC. The researchers from Universitat Jaume I (UJI) and Universidad Politecnica de Valencia (UPV) were supported by MINECO, Spain project TIN2017-82972-R. Maria Barreda was also supported by the POSDOC-A/2017/11 project from the Universitat Jaume I, Spain.Iakymchuk, R.; Barreda, M.; Wiesenberger, M.; Aliaga, JI.; Quintana Ortí, ES. (2020). Reproducibility strategies for parallel preconditioned Conjugate Gradient. Journal of Computational and Applied Mathematics. 371:1-13. https://doi.org/10.1016/j.cam.2019.112697S113371Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3), 308-323. doi:10.1145/355841.355847Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 1-17. doi:10.1145/77626.79170Demmel, J., & Nguyen, H. D. (2015). Parallel Reproducible Summation. IEEE Transactions on Computers, 64(7), 2060-2070. doi:10.1109/tc.2014.2345391Iakymchuk, R., Graillat, S., Defour, D., & Quintana-Ortí, E. S. (2019). Hierarchical approach for deriving a reproducible unblocked LU factorization. The International Journal of High Performance Computing Applications, 33(5), 791-803. doi:10.1177/1094342019832968Iakymchuk, R., Defour, D., Collange, S., & Graillat, S. (2016). Reproducible and Accurate Matrix Multiplication. Lecture Notes in Computer Science, 126-137. doi:10.1007/978-3-319-31769-4_11Rump, S. M., Ogita, T., & Oishi, S. (2009). Accurate Floating-Point Summation Part II: Sign, K-Fold Faithful and Rounding to Nearest. SIAM Journal on Scientific Computing, 31(2), 1269-1302. doi:10.1137/07068816xBurgess, N., Goodyer, C., Hinds, C. N., & Lutz, D. R. (2019). High-Precision Anchored Accumulators for Reproducible Floating-Point Summation. IEEE Transactions on Computers, 68(7), 967-978. doi:10.1109/tc.2018.2855729D. Mukunoki, T. Ogita, K. Ozaki, Accurate and reproducible BLAS routines with Ozaki scheme for many-core architectures, in: Proc. International Conference on Parallel Processing and Applied Mathematics, PPAM2019, 2019, accepted.Ogita, T., Rump, S. M., & Oishi, S. (2005). Accurate Sum and Dot Product. SIAM Journal on Scientific Computing, 26(6), 1955-1988. doi:10.1137/030601818Kulisch, U., & Snyder, V. (2010). The exact dot product as basic tool for long interval arithmetic. Computing, 91(3), 307-313. doi:10.1007/s00607-010-0127-7Boldo, S., & Melquiond, G. (2008). Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. IEEE Transactions on Computers, 57(4), 462-471. doi:10.1109/tc.2007.70819Wiesenberger, M., Einkemmer, L., Held, M., Gutierrez-Milla, A., Sáez, X., & Iakymchuk, R. (2019). Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures. Computer Physics Communications, 238, 145-156. doi:10.1016/j.cpc.2018.12.006Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., & Zimmermann, P. (2007). MPFR. ACM Transactions on Mathematical Software, 33(2), 13. doi:10.1145/1236463.1236468J. Demmel, H.D. Nguyen, Fast reproducible floating-point summation, in: Proceedings of ARITH-21, 2013, pp. 163–172.Ozaki, K., Ogita, T., Oishi, S., & Rump, S. M. (2011). Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms, 59(1), 95-118. doi:10.1007/s11075-011-9478-1Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m114081
- …