Abstract-GPUs are an important hardware development platform for problems where massive parallel computations are needed. Many of these problems require a higher precision than the standard double floating-point (FP) available. One common way of extending the precision is the multiple-component approach, in which real numbers are represented as the unevaluated sum of several standard machine precision FP numbers. This representation is called a FP expansion and it offers the simplicity of using directly available and highly optimized FP operations.
I. INTRODUCTION
In numerical computing, we sometimes encounter calculations that require more precision than the one offered by current processors. A way of handling such higher-precision calculations is to represent real numbers by floating-point (FP) expansions, i.e., by unevaluated sums of FP numbers. Several slightly different definitions of what a FP expansion is have been introduced in the literature, together with different arithmetic algorithms for manipulating them [1] , [2] , [3] . Choosing between these algorithms frequently depends on a compromise between accuracy, speed, and safety (because some of the more complex algorithms are just heuristic: they do not come with a proof). With Graphics Processing Units (GPUs) oriented implementation in mind, other important aspects are parallelism and locality. Should we try to parallelize the arithmetic algorithms that manipulate FP expansions (or, rather, build variants that are suitable for parallelization), or should we keep them sequential and try to parallelize at a higher level? In a previous study [4] , some of us have dealt with an "embarrassingly parallel" problem with compact intermediate data. However, many applications do not provide as much parallelism. Even for those that do, locality can be a problem. Increasing the precision of sequential arithmetic operations requires a corresponding increase in the amount of intermediate data to keep. Thus, parallel arithmetic algorithms are attractive not just by the extra parallelism they provide, but also by the locality improvements they enable. Here, with different numerical problems in mind, we parallelize the addition and multiplication of FP expansions on GPUs.
Section II recalls the basic notions related to FP expansions. In Sections III and IV we present data-parallel algorithms for addition and multiplication, respectively, of FP expansions. The implementation details are given in Section V followed by some performance assessment in Section VI. We finish by concluding our work in Section VII.
II. FLOATING-POINT EXPANSIONS
A normal binary precision-p floating-point (FP) number is a number of the form
ex−p+1 ,
The integer e x is called the exponent of x, and M x · 2 −p+1 is called the significand of x. Using Golberg's definition we denote ulp(x) = 2 ex−p+1 [5, Chap. 2] .
A natural generalization of the notion of double-double or quad-double is the notion of floating-point expansion.
Definition II.1. A FP expansion u with n terms is the unevaluated sum of n FP numbers u 0 , u 1 , . . . , u n−1 , in which all nonzero terms are ordered by magnitude (i.e., if v is the sequence obtained by removing all zeros in the sequence u, and if sequence v contains m terms, |v i | ≥ |v i+1 |, for all 0 ≤ i < m − 1).
Arithmetics on FP expansions have been introduced by Priest [1] , and later on by Shewchuk [2] .
To make sure that such an expansion carries significantly more information than only one FP number, it is required that the u i 's do not "overlap". The notion of (non-)overlapping varies depending on the authors. Intuitively, the stronger the sense of the nonoverlapping definition, the more difficult it is to obtain, implying extra manipulation of the FP expansions. In what follows we give the definition that we chose to use, referred to as ulp-nonoverlapping, in an attempt to allow for a more relaxed manipulation of the FP expansions. We specify first that even if a FP expansion may contain interleaving zeros, the definition that follows applies only to the non-zero terms of the expansion (i.e., the array v in Definition II.1).
Definition II.2. An expansion u 0 , u 1 , . . . , u n−1 is ulpnonoverlapping if for all 0 < i < n, we have |u i | ≤ ulp(u i−1 ).
In other words, the components are either P-nonoverlapping (that is, nonoverlapping according to Priest's definition [6] ) or they overlap by one bit, in which case the second component is a power of two.
Depending on the nonoverlapping type of an expansion, when using standard FP formats for representation, the exponent range forces a constraint on the number of terms. The largest expansion can be obtained when the largest term is close to overflow and the smallest is close to underflow. We remark that, when using ulp-nonoverlapping expansions, for the two most common FP formats, the constraints are:
• for double-precision (exponent range [−1022, 1023]) the maximum expansion size is 39; • for single-precision (exponent range [−126, 127]) the maximum is 12.
In this article we deal with so called parallel FP expansions, i.e., the expansion is stored on parallel execution threads, with one term/thread. This implies that the user has to launch as many threads as the expansion size.
The majority of algorithms performing arithmetic operations on expansions are based on the so-called error-free transforms (EFT) (such as the algorithms 2Sum, Fast2Sum, Dekker's product and 2MultFMA presented for instance in [5] ), that make it possible to compute both the result and the rounding error of a FP addition or multiplication. This implies that in general, each such EFT applied to two FP numbers, also returns two FP numbers. More precisely, the sum of two FP numbers can be represented exactly as a FP number which is the correct rounding of the sum, plus a second FP number corresponding to the rounding error. Under certain assumptions, this decomposition can be computed at a very low cost by a simple sequence of standard precision FP operations. For instance, assuming that a and b are two FP numbers, a simple algorithm (called 2Sum and due to Knuth [5] ) computes S and E, the decomposition of a + b using only 6 FP operations. Similarly, if a FMA operator 1 is available, 2ProdFMA [5] returns P and the error E (namely ab−P ) in 2 FP operations.
Algorithms like these can be extended to be used with arbitrary precision computations by chaining, resulting into the so-called distillation algorithms [7] , for summing several FP numbers. From these we make use of the VecSum algorithm [2] , [8] , which is simply a chain of 2Sum that performs an EFT on n FP numbers.
A potential problem appears when subsequent computations are done using these results; the size of the exact result is going to increase more and more. To avoid this, some "truncation" methods (both on-the-fly or a-posteriori) may be used to compute only an approximation of the exact result. Also, so-called (re-)normalization algorithms are used to render the result non-overlapping, which implies also a potential reduction in the number of components.
III. DATA-PARALLEL ADDITION ALGORITHM FOR FP EXPANSIONS
An algorithm that performs addition of two FP expansions x and y with n and m terms, respectively, will return a FP expansion with at most n + m terms as the exact result. This implies a continuous increase in the number of terms as subsequent computations are done using the obtained result. This is why, in practice, the results are "truncated" to obtain only an approximation of x + y.
Many variants of algorithms that compute the sum of two FP expansions have been presented in the literature [1] , [2] , [3] , [7] , each using different methods and having different complexities, but, from our knowledge, none of these algorithms are implemented in parallel and, even more, some of them are highly sequential, making them unsuitable for parallel architectures.
In what follows we will present a safe data-parallel algorithm for adding FP expansions that offers a tight error bound on the result and it allows to prove a clear constraint between the terms of the result. We also present a fast version with more relaxed error bounds based on the same scheme that may be used if computations are not cancellation-prone.
The safe data-parallel summation algorithm is presented in Fig. 1 and Algorithm 1. All arithmetic operations including EFTs like 2Sum are performed in parallel element-wise on R-element vectors. We assume vectors can be merged and elements inside a vector can be shuffled. These assumptions make the algorithms applicable to most SIMD units, including Intel SSE/AVX instruction set extensions [9] and recent Nvidia GPUs [10] .
Algorithm 1 Safe data-parallel algorithm for adding FP expansions.
Input: FP expansion vectors x = (x 0 , x 1 , . . . , x R−1 ) and y = (y 0 , y 1 , . . . , y R−1 ).
e ← (y i , e 0 , e 1 , . . . , e R−2 ) //Shift right & insert yi
8:
(s, e) ← 2Sum(s, e ) 9: end for 10: for i ← 1 to R − 2 do e ← (0, e 0 , e 1 , . . . , e R−2 )
12:
(s, e) ← 2Sum(s, e ) 13: end for 14: e ← (0, e 0 , e 1 , . . . , e R−2 )//Shift right 15: s ← s + e 16: return s.
For the sake of simplicity, we only present here the "input-R-output-R" version of the algorithm, even though the generalized version allows for different input sizes. The algorithm is based on a pipelined error propagation. We start by adding the first elements of each expansions, x 0 and y 0 on the first vector component. We continue to add the rest of the elements on the first component one by one and propagating the error upwards, to the other vector components. When we run out of elements to add we continue to propagate the errors for another R − 1 steps by injecting 0s on the first component. In the last step of the algorithm there is no need to use EFT, since we are not going to propagate the errors anymore; this is why we use only simple addition.
By using this scheme to add the two expansions we ensure that the most significant term of the output, s 0 , is the sum of the inputs rounded to nearest. Moreover, the terms of the output are arranged in terms of magnitude in decreasing order, with some constraints. Let us prove this in what follows.
It is easily seen that the parallel scheme presented in Fig. 1 can be reduced to the sequential one presented in Fig. 2 .
If there is a "Sterbenz relation" between x 0 and y 0 (i.e., if they are of opposite signs and x0 2 ≤ |y 0 | ≤ |2x 0 |) then e 01 = 0 and s 01 = x 0 + y 0 . This implies that s 12 = e 02 and e 12 = 0, and so on. In this case we end up propagating a 0, to the end of the result expansion, and we are left with the same scheme as we began with (illustrated in Fig. 3 ). This means that we can eliminate the case in which we have a "Sterbenz relation". Proposition III.1. Let us assume a = a 0 , a 1 , . . . a n−1 , an array of FP numbers that satisfy |a i | ≤ 2 −i(p−1)+δ |a 0 |, for some (presumably small) integer δ and |t| ≤ 2 −p+ |a 0 |, for some (presumably small) integer . If s = s 0 , s 1 , . . . , s n is the sum obtained by adding t to a as shown in Fig. 4 (from left to right, propagating the error), then all the terms in s satisfy:
Proof: From the proof of the algorithm 2Sum we know that
we get
It follows that
From which we deduce
We can continue, by noticing that |e 1 | is bounded by 2 −p |s 1 |, and bounding |a 2 | by 2
This gives a bound on |e 1 + a 2 |, and a bound on s 2 is obtained by multiplying that last bound by (1 + 2 −p ). An easy induction finally gives
with
One easily finds
Denote u = 2 −p . In all practical cases ≥ 2 and δ ≥ 0, so that H i ≤ G i , with
We have,
The only positive term (after the initial "1") in that sum is u(u+6) 1−4u , which is less than 7u for all pertinent values of u = 2 −p . Hence G i < 1 as soon as 2 i+1 ≤ 2 p /7, which occurs in all practical cases. This gives
This concludes our proof. Hence, in the array of Fig. 1 , δ is increased by 1 at each line. For instance, if we add two order-n expansions (i.e., we use 2n − 1 lines in the Array of Fig. 1, then the terms s 0 , s 1 , . . . of the result satisfy
and the expansion obtained by keeping the first n terms only represents the sum with an error less than
i.e., with a relative error bounded by a value slightly larger than 2 −np+3n−1 . We would like to also mention here a faster, relaxed version of the above algorithm, that requires at most R − 1 steps in order to compute the result (the last step using only simple addition). This algorithm (Fig. 5) offers a worse error bound and it does not ensure the correct result when cancellation occurs, if no re-normalization algorithm is applied on the result. It performs best when FP expansions of the same sign and close magnitudes are added. We recommend using it for fast computations that can be validated a-posteriori. 
IV. DATA-PARALLEL MULTIPLICATION ALGORITHM FOR FP EXPANSIONS
In general, an algorithm that performs the multiplication of two expansions x and y with n and m terms, respectively, will return a FP expansion with at most 2nm terms [1] , which, as in the case of addition, implies an increase in the number of terms.
The algorithm that we present (Algorithm 2) computes an approximation of x · y, where x and y are two parallel expansion. Here we also present just the "input-R-output-R" version.
We consider two parallel FP expansions x and y, each with R terms and we compute the R most significant FP components of the product π = x · y. We use the following intuition: let ε = 1 2 ulp(π 0 ), then roughly speaking, if π 0 is of order of O(Λ), then e 0 is of order O(εΛ). This means that for each product (p, e) = 2MultFMA(x i , y j ), p is of order O(ε i+j Λ) and e of order O(ε i+j+1 Λ). We truncate the result on-the-fly, by considering only the terms for which 0 ≤ i + j ≤ R − 1, since the smaller terms have an order of magnitude much smaller and usually they will not influence the result. (p, e) ← 2MultFMA(x, y )
6:
(s, e ) ← 2Sum(s, p)
7: while e = 0 do 10:
(s, e) ← 2Sum(s, e) The multiplication algorithm runs as follows: at each iteration i of the for loop (lines 3-16) we compute p+e = x·y; we add p to the result of the same order, using an EFT, which also generates an error, e . After that, using the two while loops (lines 8 − 11 and 12 − 15) we propagate the two generated errors, e and e to the lower order results. In the last step of the algorithm, we do not use any EFT, because the errors that are supposed to be computed are going to be of order O(ε R Λ), and we do not need to propagate them anymore. This algorithm has the same behavior as the sequential algorithm presented in Fig. 6 , but a graphical representation of the parallel one would be too difficult to read.
While the exact product xy of two FP expansions x and y is computed as R−1;R−1 i=0;j=0
k=0 i+j=k x i y j , in this algorithm we "truncate" it on-the-fly by computing and adding only the relevant part of the scalar products (the first R k=1 k individual products) and after that outputting the R most significant components in the result. In what follows we compute the error bound in two steps: first we compute the error generated by "truncating" the partial products, and second we compute the error given by the discarded errors.
Proposition IV.1. (Error bound on the truncated products) Let x and y be two ulp-nonoverlapping FP expansions, with R terms. If, when computing the product xy we "truncate" the operations by computing and adding only the most significant individual products, the first 
Proof:
The maximum error given by the discarded products satisfies:
We now consider the function φ(α) = ∞ k=0 (R − 1 − k)α k and we get:
Using φ(2 −(p−1) ), we obtain: (1−2 −(p−1) ) 2 is negative and very small we conclude our proof by:
Proposition IV.2. (Final error bound) Let x and y, two ulpnonoverlapping FP expansions, with R terms. When computing π, an approximation of xy to R terms, as shown in Algorithm 2 and Fig. 6 , the result satisfies:
Proof: From the definition we know that |x i | ≤ 2 −i(p−1) |x 0 | and |y j | ≤ 2 −j(p−1) |y 0 |, so we can deduce
For computing π 0 we use only 2MultFMA(x 0 , y 0 ), and we get |π 0 | ≤ |x 0 y 0 | (1 + 2 −p ). For computing π 1 we use VecSum#1, with 3 entries of order O(εΛ): the error from the previous step, and two partial products, which are less than 2 −(p−1) |x 0 y 0 | (1 + 2 −p ). It is easily seen that all these entries are bounded by the same value. We define the following notation:
3 and the outputted errors are less than
For computing π 2 we use VecSum#2 which is going to have 7 entries of order O(ε 2 Λ): 2 errors outputted by VecSum#1, bounded by Ω 2 ; 2 errors from the previous step's partial products, which are less than 2 −2p+1 |x 0 y 0 | (1 + 2 −p ); and 3 partial products, less than 2 −2(p−1) |x 0 y 0 | (1 + 2 −p ). We observe once more that all the entries are less than Ω 2 .
For the induction step we consider VecSum#i − 1. For computing π i−1 we have (i−1)
2 +i entries, which we assume are all less than Ω i−1 . It follows:
and also, the largest error term outputted and implicitly all the others are less than This implies that all error terms are less than:
(n 2 − n + 1);
and this last value will define Ω i . This implies that, since we use only simple summation for computing π R−1 , in the last step we neglect R 2 + R terms, all less than Ω R .
We also have to account for the errors that occur when computing the last partial products using only simple multiplication. This means R − 1 terms less than 2 −R(p−1)+p−2 |x 0 y 0 | (1 + 2 −p ) When adding all these errors with the one given in (5) we get the following bound:
And this concludes our proof Unfortunately, for the multiplication algorithm we are unable to prove any constraints on the terms of the result. Even though cancellation cannot happen when multiplying two FP numbers, it may happen during the summation process, in which case we can get |π i | < |π j |, with i < j. If this happens we can apply a re-normalization algorithm, like the one presented in [11] , in order to render the result nonoverlapping. Since this implies adding a sequential step at the end of Algorithm 2, slowing it's performance, we recommend using this algorithm only if computations are known not to be cancellation-prone or if the result can be verified a-posteriori.
V. WARP-SYNCHRONOUS GPU IMPLEMENTATION GPUs are highly multi-threaded SIMD architectures [12] . They are programmed in languages like CUDA and OpenCL, that expose a pure multi-thread programming model. Programmers describe compute kernels as a single program run by many fine-grained threads. The compiler and hardware group these threads into so-called warps, containing 32 threads on current Nvidia architectures. Threads inside a warp run in lockstep and share a single control flow, and their instructions are executed on SIMD units, with one thread per lane. This implicit SIMD model is equivalent to explicit SIMD: a GPU program can be also understood as computations on vectors from the point of view of a warp. This enables direct implementation of the data-parallel algorithms we propose. Recent additions to the available hardware primitives make this warp-synchronous programming style particularly efficient [10] . Our implementation targets GPUs with compute capability 3.0 or above, such as Kepler and Maxwell architectures, that support warp vote and shuffle instructions. The code was written in CUDA C, using double-precision numbers (i.e. p = 53).
Warp vote instructions perform boolean reductions across all threads within a warp. For instance, they can check whether a condition holds for all the threads, or for any of the threads of the warp. The __any function computes the logical OR of a warp-sized vector of predicates and broadcast it to all elements. We use it to implement the loop exit conditions of the FP expansion multiplication in Algorithm 2.
Warp shuffle instructions allow arbitrary communication between threads in a warp, without having to go through memory. They are analogous to shuffle or permute instruction in explicit SIMD instruction sets [9] . We use them to shift vector components to propagate the errors across expansion terms, and to insert and extract scalar values inside vectors.
• shfl_up(x, n, R) and shfl_down shift components respectively upward or downward by n positions within each R-element vector; • shfl reads an arbitrary vector component within each vector lane. As mentioned in Section III, similar shuffle and permute operations are available in explicit SIMD instruction sets such as Intel AVX. GPUs that do not support the shuffle instruction can also exchange data through the OpenCL local memory at a cost in performance.
We illustrate the warp-synchronous implementation of Algorithm 2 for FP expansion multiplication in Fig. 7 . The code appears from a single thread's perspective, but it runs in parallel and it takes decisions based on the vector lane within an expansion (i.e. threadIdx.x). Although the hardware only support shuffling 32-bit data, we implemented shuffle instructions on the double type by shuffling each half separately.
Although we present here a version of the code that is parameterized by only one parameter, R, our actual implementation uses different template parameters for inputs and output, meaning that we allow static generation of any inputoutput precision combinations.
We exploit both the parallelism that exist between expansion terms and across different expansions. To benefit from the SIMD execution and intra-warp communication primitives, all terms in a given expansion have to be computed by threads of the same warp. As warps have 32 threads on Nvidia architectures, the maximal supported expansion size is 32. Smaller expansions are packed together inside warps. Although this approach works with any expansion size R between 1 and 32 using appropriate padding, we recommend using power of two sizes, which allow filling the whole warp.
VI. COMPARISON AND DISCUSSION
In this section we present performance measurements obtained on a Tesla K40 GPU (Kepler architecture), using the CUDA 7.5 software architecture. The code also runs on the newer Maxwell architecture, although the performance of all algorithms becomes limited by the lower double-precision performance of current Maxwell-based GPUs. We measure throughput on embarrassingly-parallel computations. The values are given in million of operations per second (Mop/s). By one op we understand one operation using extended precision. The tests were done using random generated examples, running on 1024 blocks each with 1024, 512 or 256 execution threads, depending on the expansion size and the required resources to run the algorithms.
To analyze the effect of parallelism on memory footprint, we consider two different shared memory usage scenarios: one best case that assumes the application uses no intermediate data outside of the registers used for the computation, and one worst case where the application uses 256 bytes of CUDA shared memory for each term of the expansion.
For comparison, we report to Bailey's GQD library [3] and CAMPARY [11] , which offer extended precision using FP expansions on GPU. The arithmetic algorithms they employ are not by themselves parallel. Moreover, the GQD library is limited to double-double and quad-double precisions, and the algorithms used in the implementation are not provided with any correctness proof and there is no specific error bound given. CAMPARY is a recent software, that comes with correctness proofs and results guaranteed within a certain error bound.
In Table I we show the addition algorithm's performance for the best-case, no-memory configuration, followed by the performance obtained in the memory-constrained configuration, in Tables II. Even in the worst-case embarrassingly-parallel setup, the performance of data-parallel addition algorithm is competitive with the best known sequential algorithms for smaller expansions: the parallelism comes at little cost in number of operations per expansion.
For the multiplications algorithm we asses performance in Tables III and IV , for the best case setting and for the memory constrained one, respectively. From the data obtained in the best case scenario we observe that addition on larger expansions and multiplications suffer from parallelization overhead. The benefits of exploiting the parallelism available within each expansion are fully realized when parallelism is constrained by internal memory usage. The performance of data-parallel algorithms remains stable in this setup, while the performance of sequential algorithm decreases sharply with memory usage. Although the QD library remains faster on expansions of size 2 (double-double), the dataparallel algorithms significantly outperform their sequential counterparts for all larger expansions. The performance gap increases with the expansion size, eventually reaching an order of magnitude for 32-term expansions.
VII. CONCLUSION
We presented data-parallel algorithms for addition and multiplication of floating-point expansions. By taking advantage of data parallelism within FP expansions, they are suitable for SIMD architectures. We present fast addition and multiplication algorithms, as well as a safe addition algorithm with rigorous error bounds. A GPU implementation of the algorithms using warp-synchronous programming in CUDA is already competitive with state-of-the-art sequential algorithm in an idealistic embarrassingly parallel setup. However, dataparallel algorithms really shine in the more realistic case of applications that manipulate a sizable amount of intermediate data, significantly outperforming sequential algorithms for expansions of size 4 and greater.
