The IEEE 754-2008 standard recommends the correct rounding of some elementary functions. This requires solving the Table Maker's Dilemma (TMD), which implies a huge amount of CPU computation time. In this article, we consider accelerating such computations, namely the Lefèvre algorithm on graphics processing units (GPUs), which are massively parallel architectures with a partial single instruction, multiple data execution.
. Example of undetermined correct rounding for rounding to nearest, where the rounding breakpoints are the midpoints of floating-point numbers.
Furthermore, it recommends correct rounding of some elementary functions, such as log, exp, and the trigonometric functions. However, it is hard to decide which intermediate precision is required to guarantee a correctly rounded result for an elementary function-the rounded evaluation of the approximation must be equal to the rounded evaluation of the function with infinite precision. This problem is known as the Table Maker's Dilemma (TMD) [Muller et al. 2009 ].
State of the Art
There exist theoretical bounds on the intermediate precision required for correctly rounded functions [Muller et al. 2009 ], but these are not sharp enough for efficient floating-point implementations of elementary functions. For example, the Nesterenko and Waldschmidt [1996] bound for the exponential in double precision states that 7,290,678 bits of intermediate precision suffice to provide a correctly rounded result. However, according to probabilistic hypotheses, we expect this intermediate precision to be slightly more than twice the working precision (106 bits for double precision) [Muller et al. 2009 ]. Hence, ad hoc methods are needed to find a sharper bound for each function.
A first method introduced by Ziv [1991] was to compute an approximation y of a function value f (x) with an error bounded by = |y − f (x)| (containing mathematical and round-off errors). As rounding modes are monotonic, if y − and y + round to the same floating-point number, then f (x) does as well; otherwise, the correct rounding cannot be determined from y alone (Figure 1 ). Therefore, a correctly rounded result of f (x) can be obtained by refining the approximation (y, ), decreasing until y − and y + round to the same floating-point number. For the most common elementary functions, such an exists according to the Lindemann-Weierstrass theorem, when the function is evaluated at almost all floating-point numbers [Galochkin 2011] .
However, precomputing an that guarantees correct rounding of the evaluation of f at any floating-point argument allows one to guarantee that the Ziv method will stop after no more than a few iterations and to simplify the implementation by avoiding to write unneeded high-precision Ziv iterations. This can be done by finding the hardest-to-round arguments of the function-that is to say, the arguments requiring the highest precision for the evaluation of f (x) to be correctly rounded. This precision guaranteeing the correct rounding for all arguments is called the hardness-to-round of the function. The hardest-to-round cases can be found by invoking the Ziv algorithm at every floating-point number in the domain of definition of the function, but this is prohibitive (some multiple of 2 p operations when considering precision-p floating-point numbers as arguments).
The first improvement was proposed by Lefèvre et al. [1998] (the Lefèvre algorithm). The main idea of their algorithm is to split the domain of definition into several domains D i , to "filter" hard-to-round cases (HR-cases) for a fixed extended precision, and then to use the Ziv algorithm to find the hardest-to-round cases among them. This filtering is efficiently performed using local affine approximations of the targeted function over O(2 2 p/3 ) domains D i . Stehlé et al. [2005] extended this method in 2003 (the SLZ algorithm) for higher-degree approximations, using the Coppersmith method for finding small roots of a univariate modular equation over O(2 p/2 ) domains D i .
Motivations and Contributions
Even if they are asymptotically and practically faster than an exhaustive search, the Lefèvre and SLZ algorithms remain very computationally intensive. For example, the Lefèvre algorithm required around 5 years of CPU time for the exponential function over all double-precision arguments, and the SLZ algorithm took around 9 years of CPU time for the function 2 x over extended double-precision arguments in the interval [1/2, 1] [Stehlé et al. 2003 ]. Moreover, even if the hardest-to-round cases of some functions in double precision are known [Muller et al. 2009] , it is still not the case for about half of the univariate functions recommended by IEEE standard . Furthermore, we still have no efficient way to find the hardest-to-round cases of any elementary function in double precision, and quadruple precision is out of reach. We will hence be interested in accelerating the search for hardest-to-round cases in double precision (binary64).
As both algorithms split the domain of definition of the targeted function into domains D i and search for HR-cases in them independently, these computations are embarrassingly and massively parallel. The purpose of this work is therefore to accelerate these computations on graphics processing units (GPUs), which theoretically perform one order of magnitude better than CPUs on suitable problems thanks to their massively parallel architectures.
We will focus here on the Lefèvre algorithm, which has been used to generate hardness-to-round values for all functions for which they are known, for double precision [Muller et al. 2009 ]. The Lefèvre algorithm is asymptotically less efficient than SLZ, as it considers more domains D i (O(2 2 p/3 ) against O(2 p/2 )). However, it performs fewer operations per domain D i (O(log p) against O(poly( p))). Therefore, the Lefèvre algorithm is more efficient than SLZ in practice for finding the hardness-to-round of elementary functions for the double-precision format [Muller et al. 2009 ] and offers fine-grain parallelism, making it suitable for a GPU.
In Fortin et al. [2012] , we discussed implementation techniques to deploy the original Lefèvre algorithm efficiently on GPUs, which led to an average speedup of 15.4× with respect to the reference CPU implementation on one CPU core. The major bottleneck of this GPU deployment was the control-flow divergence, which is penalizing considering the partial single instruction, multiple data (SIMD) execution of the GPU. Hardware [Brunie et al. 2012] and software [Frey et al. 2012; Han and Abdelrahman 2011] general solutions have been proposed to address this problem on GPUs. However, these solutions are not efficient in our context, as we have a very fine computation grain for each GPU thread. Hence, we focus here on algorithmic solutions to directly tackle the origin of this divergence issue.
In this article, we thus redesign the Lefèvre algorithm with the continued fraction formalism, which enables us to get a better understanding of it and to propose a much more regular algorithm for searching HR-cases. More precisely, we strongly reduce two major sources of divergence in the Lefèvre algorithm: loop divergence and branch divergence. We also propose an efficient hybrid CPU-GPU deployment of the generation of polynomial approximations D i using fixed-sized multiprecision operations on the GPU. These contributions enable an overall speedup of 53.4× on a GPU over Lefèvre's original sequential CPU implementation and of 7.1× over six CPU cores (with two-way SMT). Finally, as we obtain in the end the same HR-cases as Lefèvre, de Dinechin, and Muller experiments [de Dinechin et al. 2011; Lefèvre and Muller 2001] , we also strengthen the confidence in the generated HR-cases. 
Outline
We first introduce some notions on GPU architecture and divergence in Section 2. Then in Section 3, we present some mathematical background on the TMD and properties of the set {a · x mod 1 | x < n} with a fixed and x a positive integer. In Section 4, we detail the HR-case search step of the Lefèvre algorithm and of the new and more regular algorithm, along with their deployment on a GPU. In Section 5, we detail how to efficiently generate on a GPU the polynomial approximations D i needed by the two HR-case searches. Finally, we present performance results in Section 6 and conclude in Section 7.
GPU COMPUTING
GPUs are many-core devices originally intended for graphics computations. However, since the mid-2000s, they became increasingly used for high-performance scientific computing, since their massively parallel architectures theoretically perform one order of magnitude better than CPUs, and since general-purpose languages adapted to GPUs (e.g., CUDA [NVIDIA 2012a] and OpenCL [Khronos Group 2011] ) have emerged. In this section, we briefly describe the architecture of the NVIDIA GPU used to test our deployments (the Fermi architecture), GPU programming in CUDA, and the divergence problems arising from the partial SIMD execution on the GPU. Here we use the CUDA nomenclature.
GPU Architecture and CUDA Programming
From a hardware point of view, a GPU is composed of several streaming multiprocessors (SMs; there are 14 SMs on Fermi C2070), each being an SIMD unit [NVIDIA 2011 ]. An SM is composed of multiple execution units or CUDA cores (32 on Fermi) sharing the same pipeline and many registers (32,768 on Fermi). GPU memory is organized into two levels: device memory, which can be accessed by any SM on the device, and shared memory, which is local to each SM. The device memory accesses are cached on the Fermi architecture.
From a software point of view, the developer writes in CUDA a scalar code for one function designed to be executed on the device, namely a kernel. At runtime, many threads are created by blocks and bundled into a grid to run the same kernel concurrently on the device. Each block is assigned to an SM. Within each block, threads are executed in groups of 32, called warps. The ratio of the number of resident warps (the number of warps that an SM can process at the same time for a given kernel) to the maximum number of resident warps per SM is called the occupancy. To increase the occupancy, the number of blocks and their sizes have to be tuned.
Divergence
As threads are executed by warps on the GPU SIMD units, applications should have regular patterns for memory accesses and control flow.
The regularity of memory access patterns is important to achieve high memory throughput. As the threads within a warp load data from memory concurrently, the developer has to coalesce accesses to device memory and avoid bank conflicts in the shared memory (Chapter 6 in NVIDIA [2012a] ). This can be done by reorganizing data storage.
The regularity of control flow is important to achieve high instruction throughput and is obtained when all threads within a warp execute the same instruction concurrently (Chapter 9 in NVIDIA [2012a] ). In fact, when the threads of a same warp diverge (i.e., they follow different execution paths), the different execution paths are serialized. For an if statement, the then and else branches are serially executed. For a loop, any thread exiting the loop has to wait until all threads of its warp exit the loop. In the following, we will distinguish branch divergence due to if statements and loop divergence due to loop statements.
The impact of branch divergence can be statically estimated by counting the number of instructions issued within the scope of the if statement. Let us consider the case where the then branch issues n then instructions and the else branch issues n else instructions. If the warp does not diverge, either n then or n else instructions are issued depending on the evaluation of the condition. If the warp diverges, n then + n else instructions are issued.
Unlike branch divergence, measuring the impact of loop divergence requires a dedicated indicator and profiling. In Fortin et al. [2012] , we introduced the mean deviation from the maximum of a warp. This indicator is similar to the standard deviation, which is the mean deviation from the mean value. However, as the number of loop iterations issued for a warp is equal to the maximum number of loop iterations issued by any thread within the warp, it is relevant to consider the mean deviation from the maximum value. This gives the mean number of loop iterations a thread remains idle within its warp. More formally, we denote i as the number of loop iterations of the thread i and number the threads within a warp from 1 to 32. If = { i , i ∈ 1, 32 }, the mean deviation from the maximum (MDM) of a warp is defined as
We can normalize the MDM by max( ) to compute the average percentage of loop iterations for which a thread remains idle within its warp. Hence, the normalized mean deviation from the maximum (NMDM) is
MATHEMATICAL PRELIMINARIES
In this section, we give some definitions to introduce the TMD more formally. We also recall some known properties on the distribution of the elements of the set {a · x mod 1 | x < n} with a fixed and x a positive integer [Slater 1967] , as well as the corresponding continued fraction formalism.
The TMD
Before defining the TMD, we introduce some notations and definitions. We denote {X} or X mod 1 as the positive fractional part of X. We write X cmod 1 as the centered modulo, which is the real Y such that X− Y ∈ Z and Y ∈ (−1/2, 1/2] (Y equals X− X or X − X depending on which has the lowest absolute value). We also write F p as the set of precision-p floating-point numbers and |E| p as the number of precision-p floatingpoint numbers in the set E (namely, |E ∩ F p |). We recall that we consider only binary floating-point format in this article (even though all definitions can be generalized to any radix).
Definition 3.1. The significand m(x) and the exponent e(x) of a nonzero real number x are defined by |x| = m(x) · 2 e(x) with 1/2 ≤ m(x) < 1. We will also denote M p (x) = 2 p · m(x) as the scaled significand.
Definition 3.2. The scaled distance between a real number x and the closest precisionp floating-point number is defined by dist p (x) = |M p (x) cmod 1|. 
The given definition of an HR-case only applies for directed rounding. However, this definition can be extended to all IEEE-754 rounding modes, as rounding-to-nearest ( p, ) HR-cases are directed rounding ( p + 1, 2 ) HR-cases. To simplify notations, we will then focus on directed rounding HR-cases. It has to be noticed that if x is a ( p, ) HR-case, it also satisfies M p ( f (x)) + < 2 mod 1. The latter inequality is used to test whether an argument is a ( p, ) HR-case, as it avoids the computation of absolute values and cmod.
Hence, a ( p, 2 − p ) HR-case x is a precision-p floating-point number for which f (x) is at a scaled distance (as defined in Definition 3.2) less than 2 − p from the closest precision-p floating-point number. In other words, more than p + p bits of accuracy are necessary to correctly round f (x) at precision-p. Definition 3.4 (Table Maker's Dilemma). Given a real valued function f defined over a domain D and a precision p, the TMD is defined as the problem of finding the smallest integer p such that there are no ( p, 2 − p ) HR-cases in D.
We call hardest-to-round cases the arguments x ∈ D minimizing dist p ( f (x)). Knowing the hardest-to-round cases gives us a lower bound on the distances between the function f and the rounding breakpoints ( Figure 2 ) and therefore a solution to the TMD, as they are ( p, 2 − p −1 ) HR-cases. The general method to find the hardest-to-round cases of a function is the following:
(1) Fix a "convenient" using probabilistic assumptions [Muller et al. 2009 ].
(2) Find ( p, ) HR-cases with ad hoc methods such as the Lefèvre or the SLZ algorithms. (3) Find the hardest-to-round among the ( p, ) HR-cases using the Ziv [1991] method.
The most compute-intensive step in this method is to find the ( p, ) HR-cases. To this purpose, the Lefèvre and SLZ algorithms both rely on the following three major steps: 
The generation of polynomial approximations: We approximate the function f (X) with X ∈ D i by polynomials P i (x) with x ∈ 0,
We thus proceed to a change of variable, enabling us to test the floating-point arguments X ∈ D i by testing the integers x ∈ 0, |D i | p − 1 . Each polynomial P i is first centered on the domain D i by applying the change of variable φ 1 : X → X − X i . Then, as the exponent is constant over each domain D i , we will consider integer arguments by applying the change of variable φ 2 :
The HR-case search: We find the ( p, ) HR-cases of P i with = + approx that comprise all ( p, ) HR-cases for f in D i .
In the HR-case search of both algorithms, a Boolean test is used to isolate HR-cases. It succeeds if there are no ( p, ) HR-cases for P i in D i and fails otherwise. It has to be noticed that should be small enough such that very few HR-cases exist in each domain to test (based on probabilistic assumptions [Muller et al. 2009]) .
In this article, we focus on the Lefèvre algorithm, which truncates the polynomials P i to degree 1 for the Boolean test. We denote Q i (x) = P i (x) mod x 2 as the truncation of P i to degree 1. As the Boolean test requires a fixed scaled error bound over each domain to test against, we define q = min x∈ 0,
q− p , and
A consequence of this scaling is to create false HR-cases where a change of exponent occurs in the codomain-these false HR-cases can easily be eliminated subsequently. Hence, the Boolean test of the Lefèvre algorithm consists of testing whether following inequality holds:
More precisely, if the inequality (1) does not hold, then there are no ( p, ) HR-cases for Q i in D i , which implies that there is no ( p, ) HR-case for P i in D i . Therefore, the Boolean test returns Failure if the inequality (1) holds, and Success or Failure (false HR-case) if it does not hold. In the case of Failure, further tests are needed, and in the case of Success, one can deduce that there are no HR-cases in the domain. Moreover, we remark that computing the minimum of the set {b − a · x mod 1 | x < |D i | p } is similar to finding the multiple of a that is the closest to the left of b modulo 1 on the unit segment [0, 1).
Properties of the Set {a · x mod 1 | x < n}
Here we will detail some properties on the configurations of the points {a · x mod 1 | x < n} over the unit segment, with x a nonnegative integer. These properties are necessary to efficiently locate the closest point to b mod 1 in these configurations.
THEOREM 3.5 (THREE DISTANCE THEOREM [SLATER 1950] ). Let 0 < a < 1 be an irrational number. If we place on the unit segment [0, 1) the points {0}, {a}, {2a}, . . . , {(n− 1)a}, then these points partition the unit segment into n intervals having at most three lengths with one being the sum of the two others.
Actually, the lengths, and the distribution of these lengths, heavily rely on the continued fraction expansion of a [Slater 1967 ], which we will denote by a = 1
We denote by (θ i ) i∈N the sequence of the remainders computed during the continued fraction expansion of a using the Euclidean algorithm, and by ( p i /q i ) i∈N the sequence of the convergents of a, defined by the following recurrence relations:
i∈N is a decreasing real-valued positive sequence, whereas ( p i ) i∈N and (q i ) i∈N are increasing integer-valued positive sequences. We also
The lengths obtained when adding multiples of a over the unit segment are therefore the elements of the sequence (θ i,t ) i∈N,t∈ 0,k i+1 [Slater 1967 ]. An example is provided in Figure 3 and Table I . All of the properties provided in this section are valid when a is irrational. However, they are also valid for a rational as long as θ i = 0 (that is to say, until the last quotient of the continued fraction expansion is computed).
In the following, we will use some properties of the configurations {a · x mod 1 | x < n} that contain intervals of only two different lengths. They are of algorithmic interest, as there are only O(log n) such configurations when n tends to infinity. Each label (i, t), with i ∈ N and t ∈ 0, k i+1 , denotes one two-length configuration that satisfies the following equation:
Equation (2) gives details on the number of intervals of each length. After adding q i + q i−1,t multiples of a mod 1 over the unit segment, there are exactly two different lengths of intervals over the unit segment: q i intervals of length θ i−1,t and q i−1,t intervals of length θ i . A special and noticeable subset of the two-length configurations corresponds to the configurations produced using the division-based Euclidean algorithm. These are the (i, 0) configurations, satisfying
Furthermore, we have a way to construct the two-length configurations, which follows.
PROPERTY 1 (TWO-LENGTH CONFIGURATIONS CONSTRUCTION [SLATER 1967]). Given the (i, t) two-length configuration, the next two-length configuration is
To simplify the notation, given the (i, t) configuration, we will write (i, t + 1) for the next two-length configuration and assimilate the configuration (i, k i+1 ) to (i + 1, 0). Property 1 implies that for going from a two-length configuration to the next one, the intervals of length θ i−1,t are split. The way intervals are split is described by the following directed reduction property and illustrated in Figure 3 . Fig. 3 . Example of {a · x mod 1 | x < n} configurations generated by a = 14/45. The unit segment is scaled by a factor of 45 for clarity. Each two-length configuration is labeled by its index (i, t) on the right. Each segment is labeled by its length above, and each multiple of a is labeled by its index below. Figure 3 , the lengths θ i and θ i−1,t are scaled by a factor of 45.
PROPERTY 2 (DIRECTED REDUCTION [VAN RAVENSTEIN 1988]). Given the two-length configuration (i, t), when constructing the next two-length configuration, intervals of length θ i−1,t are split into two intervals in this order from left to right: -one of length θ i and one of length θ i−1,t+1 if i is even, -one of length θ i−1,t+1 and one of length θ i if i is odd.

HR-CASE SEARCH ON THE GPU
In this section, we describe two algorithms for HR-case search: Lefèvre [2005] HR-case search, as originally described, and our new HR-case search, which is more regular in the sense of reducing divergence. Both algorithms make use of Boolean tests that rely on the properties described in Section 3.2. Hence, we will describe both of them with continued fraction expansions, which give a uniform formalism to explain and compare their different behaviors. Then we will describe how they have been deployed on the GPU and the benefit on divergence provided by our new algorithm. Lefèvre [2005] presented an algorithm to search for ( p, ) HR-cases of a polynomial P i (x). This algorithm relies on a Boolean test on Q i (x) (the truncation of P i (x) to degree 1) that computes a lower bound of the set {b − a · x mod 1 | x < |D i | p } and returns Failure if the inequality (1) holds and Failure or Success otherwise.
The Lefèvre HR-Case Search
In Section 3.2, we described some properties of the configurations {a · x mod 1 | x < n}. According to these properties, computing the interval lengths of a subset of the two-length configurations can be done efficiently in O(log |D i | p ) arithmetic operations by computing the continued fraction expansion of a [Brent and Zimmermann 2010] . However, if we use the continued fraction expansion, we will place more points than |D i | p on the unit segment (at most 2 · |D i | p if we use the subtraction-based Euclidean algorithm). To take advantage of the efficient construction of the two-length configurations, the Lefèvre HR-case search computes the minimum of {b − a · x mod 1 | x < n} with n the number of multiples of a placed and n ≥ |D i | p . This gives a lower bound on The cornerstone of the Lefèvre algorithm strategy is therefore the computation of the minimum of {b− a · x mod 1 | x < n}. In other words, it computes the distance between {b} and the closest point to the left of {b} in the configuration {a · x mod 1 | x < n}. We write N for the number of floating-point numbers in the considered subdomain (n ≥ N as we compute a lower bound). Depending on how we generate the two-length configurations (using the subtraction-based or the division-based Euclidean algorithm), we can derive from Property 2 two ways to compute this distance. The first one is the Lefèvre HR-case search, and the second one is the new HR-case search proposed in Section 4.2.
In the lower bound computation of the Lefèvre HR-case search, the way the twolength configurations are computed depends on the length of the interval containing {b}. When adding points in the interval containing {b} and in the direction of {b}, Lefèvre uses a subtraction-based Euclidean algorithm (he moves from the (i, t) configuration to (i, t + 1)). Otherwise, he uses a division-based Euclidean algorithm (he moves from the (i, t) configuration to (i + 1, 0)). Algorithm 1 describes the lower bound computation of the Lefèvre HR-case search and the corresponding test with respect to , which is the sum of all errors involved.
In this algorithm, the variables u and v count the number of intervals as in Equation (2) to exit when n = u + v ≥ N, where u and v respectively store q i and q i−1,t for i Hereafter, we detail the relations between the two-length configurations and the execution paths of Algorithm 1, Lefèvre's lower bound computation and test. This algorithm starts with the configuration (0, 1) and then considers the (i, t) configurations. Note that the condition at line 4 is false only if we have added one point directly to the left of {b} during the previous loop iteration, and it is true otherwise. Hence, it has to be interpreted as "does d need to be updated?" This interpretation is allowed by the fact that the value of d at line 4 corresponds to the previous configuration (the configuration (0, 0) at start) and that at least one point was already added (line 8 or 15). This condition enables us to handle the next four cases (illustrated in Figure 4 Note that the Lefèvre algorithm always reduces d by using subtractions at line 10, as points are added one by one at the left of {b}. In practice, Lefèvre adds specific instructions to partly compute these reductions with divisions to avoid computing large quotients with subtractions. We have omitted these instructions here for clarity, but they are present in our implementations of the Lefèvre algorithm. Furthermore, the algorithm computes divisions (lines 5 and 12). In practice, we can make use of different division implementations. We can apply a subtractive division, a 
15
; division instruction, or combine both in an hybrid approach as presented and analyzed in Lefèvre [2005] and Fortin et al. [2012] .
New Regular HR-Case Search
We now propose a new algorithm for the HR-case search where we use the same filtering and division strategy as in the Lefèvre algorithm, but we introduce a more regular algorithm-in the sense that it strongly reduces divergence on the GPU-to compute a lower bound on {b − a · x mod 1 | x < |D i | p }. Hereafter, we will refer to this new algorithm as the regular HR-case search.
The regular HR-case search is described in Algorithm 2. In it, we only consider configurations satisfying Equation (3) in order to use only the division-based Euclidean algorithm. The variables p and q respectively store the lengths θ i and θ i−1 for i even and the lengths θ i−1 and θ i for i odd. The variables u and v respectively store q i and q i−1 for i even and q i−1 and q i for i odd.
Thus, instead of testing if {b} went from a split interval to an unsplit one like in the Lefèvre HR-case search, we test here which length is reduced, as in the classical Euclidean algorithm, and then we reduce it and update d accordingly (as illustrated in Figure 4 ). In practice, the quotients are computed like in the Lefèvre HR-case search with a subtractive division, a division instruction, or the hybrid approach. Now we detail the execution of Algorithm 2. Let (i, 0) be a two-length configuration:
-If i is even, then the test p < q is true since θ i < θ i−1 , so we go to the configuration (i + 1, 0) (lines 5 and 6), and: -If {b} was in an interval of length θ i , no point was added in the interval containing {b} and d is not updated as d < θ i and d = d mod θ i (line 7). -If {b} was in an interval of length θ i−1 , points were potentially added to the left of {b}. Hence the distance d is updated by reduction modulo θ i (line 7) since intervals are split from the left under Property 2. -If i is odd, then the test p < q (line 4) is false, so we go to the configuration (i + 1, 0) (lines 9 and 10), and: -If {b} was in an interval of length θ i , no point was added in the interval containing {b}. However, we subtract θ i+1 from d if d ≥ θ i+1 (line 12), which is similar to Note that we do not test the divisors for division by zero in Algorithm 2, as it is very unlikely to occur: we only catch the division-by-zero exception and recover from it. The likelihood of a division by zero to happen is directly correlated to the probability of encountering a large quotient (which follows the Gauss-Kuzmin distribution [Khinchin 1997]) at a given iteration. For example, while searching for (53, 2 −32 ) HRcases, the probability of having a division by zero using 64-bit arithmetic is much smaller than 2 −64 . Until now, we have never encountered a division-by-zero in all of our tests.
Deployment on the GPU
The exhaustive search algorithm perfectly takes advantage of the GPU's massive parallelism and of its (partial) SIMD execution. Hence, we will focus on the deployment of the lower bound computation. In this section, we present the GPU deployment of the Lefèvre HR-case search as detailed in Fortin et al. [2012] and the GPU deployment of the new regular HR-case search. We particularly study the divergence in both HR-case searches at three levels: the filtering strategy, the main loop, and the main conditional statement. For these deployments, we first changed the data layout to a "structure of arrays" to have coalesced memory accesses (Section 6.2.1 in NVIDIA [2012a]). We also avoided (as much as possible) consecutive dependent instructions to increase the instruction-level parallelism within each thread. In addition, we mention that the computation of the continued fraction expansion is done using the Lehmer doubledigit GCD algorithm [Brent and Zimmermann 2010] (we reduce the remainders by 32-bit chunks using 64-bit integers).
Throughout this section, we will consider the example domain [1, 1 + 2 −13 ) in the binade [1, 2) for the exponential function in double precision, as this binade is considered by Lefèvre [2005] as the general case.
4.3.1. Filtering Strategy Divergence. As a consequence of the filtering strategy, we will have few threads executing phase 2 and fewer executing phase 3. Table II floating-point numbers each. As we can see, very few subdomains are involved in the exhaustive search step (phase 3). Hence, executing one kernel computing the three phases leads to an important divergence, as we have fewer and fewer active threads within each warp from one phase to the next [Fortin et al. 2012] . To tackle this problem, we propose to use three kernels, one for each phase. This allows us to rebuild the grid of threads between each phase and to run the exact number of threads required by each phase. However, this implies two additional costs.
First, we have to write failing subdomains 1 of phases 1 and 2 into consecutive memory locations as we prepare coalesced reads for the next phase. In Fortin et al. [2012] , this was done with atomic operations on the GPU global memory, since we had few failing subdomains. For some specific binades, however, the number of failing subdomains can be much greater and the numerous atomic operations result in lower performance. Hence, inspired by the algorithm of Sengupta et al. [2008] , we have implemented optimized compact operations based on parallel prefix sums, which are specific to the binary values used. Our compact operations are as efficient as the atomic operations when we have few failing subdomains, and they are more efficient when there are many.
Second, between each two successive kernels, we have to transfer the number of failing subdomains back to the CPU to compute (on the CPU) a suitable CUDA grid size for the next kernel.
It can be noticed in Table II that the Lefèvre HR-case lower bound computation filters out a little more of the arguments than the new algorithm. The Lefèvre HRcase lower bound computation uses the subtraction-based Euclidean algorithm when splitting the interval containing {b}. This results in several considered arguments less than 2N. On the contrary, we always use the division-based Euclidean algorithm in the regular HR-case search. If we consider i such that q i + q i−1 < N < q i+1 + q i , then q i+1 + q i = q i−1 + k i+1 q i + q i < (k i+1 + 1)N -by considering q i < N. However, the geometric mean of the quotients k i of the continued fraction of almost all real numbers equals Khinchin's constant (≈ 2.69) [Khinchin 1997] . Hence, using the regular HR-case search, we consider an average of less than 3.69 · N arguments.
Loop Divergence.
The second source of divergence is the main unconditional loop (see line 3 in Algorithms 1 and 2). Figure 6 shows the NMDM of the number of loop iterations by warp, for the different HR-case searches, when testing 2 25 domains D i containing 2 15 double-precision floating-point arguments. Table III summarizes statistical information on the NMDM and the number of iterations for both the Lefèvre and regular HR-case searches.
For the Lefèvre HR-case search, the main unconditional loop is an important source of divergence with a mean NMDM of 25.6%. In other words, a thread remains idle on average 25.6% of the number of loop iterations executed by its warp. To the best of our knowledge, there is no a priori information on the number of loop iterations that would enable us to statically reorder the subdomains to decrease this divergence. We also tried to use software solutions to reduce the impact of the loop divergence [Fortin et al. 2012 ], although to no avail because the computation grain is very fine.
This divergence in the Lefèvre HR-case search is mainly due to the fact that the quotients are either entirely or partially computed at each iteration depending on the position of b, even with the specific instructions (see Section 4.1). Thanks to these specific instructions, the pathological cases are avoided (see Table III ), but the mean NMDM is still around 25.6%. In the new regular HR-case search, the key point is that a quotient of the continued fraction expansion of a is entirely computed at each loop iteration, which is not the case in the Lefèvre HR-case search. Hence, the number of loop iterations only depends on the number of quotients of the continued fraction expansion of a computed to reach |D i | p points on the segment. As the number of quotients to compute is very close from one subdomain to the next, we reduce the mean NMDM by warp to 0.1%. 4.3.3. Branch Divergence. The third source of divergence is on the main conditional statement (see line 4 in Algorithms 1 and 2). We aim at reducing the number of instructions controlled by the branch condition, and if they are reduced enough, we benefit from the GPU branch predication (Section 9.2 in NVIDIA [2012a] ). This branch predication enables the pipelines to be filled at best, for short sections of divergent code, by scheduling both then and else branches for all threads: thanks to a per-thread predicate, only the relevant results are actually computed and finally written.
As observed in Fortin et al. [2012] , both branches of the Lefèvre HR-case search contain the same instructions, except the variables p (respectively, u) and q (respectively, v) are interchanged, and p is subtracted from d in the else branch. We therefore swap the two values p and q (respectively, u and v) to remove the common instructions from the conditional scope. This is described in Algorithm 3. The swap implies a small extra cost, but we thus reduce the number of divergent instructions.
As far as the new regular HR-case search is concerned, there is as much branch divergence within the unconditional loop in Algorithm 2 as there is in Algorithm 1. However, the main conditional statements of the two algorithms are rather different. In the Lefèvre HR-case search, this test (line 4) depends on the position of point b at each iteration. In the regular HR-case search, it depends on the length to reduce. enables testing of the floating-point arguments X ∈ D i of f (X) by testing the integer arguments
Although individual Taylor approximations could be used on each domain D i , the number of domains makes this prohibitive. The principle here is therefore to consider the union of τ domains D tτ , . . . , D (t+1)τ −1 -denoted by D t -and to approximate the function f by a polynomial R t of degree δ (e.g., with a Taylor approximation) such that
e( f (X))− p for all X ∈ D t with x = 2 p−e(X t ) (X − X t ). If τ is chosen such that e(x) = e(y) for all x, y ∈ D t , then P tτ +i (x) is defined as an approximation of 
we set approx to approx = approx + shi f t (see Section 3.1). We now consider how to compute these polynomials P tτ +i . We first present the hierarchical method originally designed by Lefèvre [2000] to change one Taylor shift by N into several Taylor shifts by 1. Then, we present two existing Taylor shift algorithms: -the tabulated difference shift, which, starting with P tτ (x) = R t (x), sequentially iterates a shift of the polynomial P tτ +i to obtain P tτ +i+1 with only multiprecision additions [de Dinechin et al. 2011] , and -the straightforward shift, which computes the P tτ +i 's from R t in parallel but requires multiprecision multiplications and additions [von zur Gathen and Gerhard 1997] .
Finally, we propose a hybrid CPU-GPU Taylor shift algorithm that efficiently combines these two shifts with the hierarchical method and requires only fixed-size multiprecision additions on GPUs. More details on these algorithms and their error propagation can be found in de Dinechin et al. [2011] .
Hierarchical Method
We first describe the hierarchical method originally described in Lefèvre [2000] , which transforms one shift by N of a polynomial R t (x) of degree δ into δ + 1 shifts by 1. This is of interest, as shifting by 1 can be done with only additions (see Section 5.2). This method requires the input polynomial to be interpolated in the binomial basis
. Therefore, we define the forward difference operator and its application to interpolate a polynomial in the binomial basis.
Definition 5.1. The forward difference operator, denoted by h , is defined as
We write j h for the composition of h j times and define = 1 .
Using this forward difference operator, one can efficiently interpolate the polynomial R t of degree δ in the binomial basis [de Dinechin et al. 2011] , given the val-
). An example is shown in Figure 7 . This interpolation is computed using the definition of and with initial values 0 [R t ](x) = R t (x). This algorithm is a particular case of the Newton interpolation but with the forward difference operator used instead of the forward divided difference operator.
Now we describe the hierarchical method [Lefèvre 2000 ]. Given a polynomial R t (x), we want to build a scheme to shift this by polynomial in consecutive arguments following an arithmetic progression with common difference N. Let us consider the univariate polynomial R t as a bivariate polynomial R t (x + iN) in the variables x and i. By interpolation in the binomial basis with respect to the variable x, we obtain a polynomial in the variable x, with polynomial coefficients r t, j (i) = j [R t ](iN) in the variable i, defined as follows:
Using the hierarchical method, we thus obtain the polynomials r t, j (i). From these, one can compute the shifts P tτ +i (x) of the polynomial R t (x) by iN by evaluating r t, j (i) with 0 ≤ j ≤ δ. If we consider the polynomials r t, j in the binomial basis, these evaluations r t, j (i) can be obtained with the consecutive Taylor shifts of r t, j by 1-by taking the coefficients of degree 0 of these shifts, which are the 0 [r t, j ](i).
Taylor Shift Algorithms
Taylor shifts by 1 can be performed efficiently with the tabulated difference shift [de Dinechin et al. 2011; Lefèvre et al. 1998 ]. According to the forward difference operator definition,
is constant for any integer i ≥ 0, as it is the γ th discrete derivative of r t, j times γ !. Hence, the only needed operations to obtain the consecutive evaluations of the polynomials r t, j are multiprecision additions of the coefficients. An example of this algorithm can be found in Figure 8 .
Obtaining the consecutive evaluations of r t, j can also be performed with the straightforward shift. This algorithm multiplies the vector of the r t, j polynomial coefficients by a matrix constructed using Newton's binomial theorem. If we consider the polynomials r t, j expressed in the binomial basis, this multiplication exactly corresponds to applying the tabulated difference algorithm i times to the polynomial r t, j . This matrix is upper triangular and Toeplitz, which can be used to speed up the matrix-vector multiplication for high degree. It is constructed as ⎛
Therefore, to construct this matrix, only the first ( i l ) with 0 ≤ l < γ + 1 are needed to compute its coefficients.
Hybrid CPU-GPU Deployment
Now we propose a hybrid CPU-GPU deployment of the polynomial approximation generation step. The polynomials R t are Taylor polynomials of "high" degree δ approximating the targeted function over τ = 2 25 domains D i as in Lefèvre et al. [1998] . We interpolate them in the binomial basis using the hierarchical method (Section 5.1) with N = 2 15 , as we want to use the Boolean tests described in Section 4 on intervals containing 2 15 arguments. We choose these parameter values so that the error analysis of the Boolean test in Lefèvre [2000] applies. As this interpolation in the binomial basis is done only once, it is precomputed on the CPU. Then, we shift these polynomials and truncate them to degree 2 to make the domain D i splitting at phase 2 of the HR-case search more efficient. More formally, we have
Hence, to obtain all polynomials P tτ +i for 0 ≤ i < τ, we have to deploy on the GPU the computation of the consecutive evaluations of r t, j (i) for 0 ≤ i < τ and 0 ≤ j ≤ 2.
On the one hand, the tabulated difference shift is efficient, as it requires only multiprecision additions. This method is thus used in the reference CPU implementation [Lefèvre et al. 1998 ]. However, this is an intrinsically sequential algorithm, which prohibits its direct efficient deployment on the GPU. On the other hand, the straightforward shift is embarrassingly parallel, but it requires multiprecision multiplications and divisions to compute the binomial coefficients and multiprecision multiplications and additions to compute the matrix-vector products.
To benefit from the efficiency of the tabulated difference shift on the GPU, we therefore use a hybrid strategy that relies on both the CPU and the GPU: we compute the shifts r t, j,u (i) = r t, j (uν + i) to form μ packets of size ν such that μν = τ . We vary u from 0 to μ − 1 and construct the polynomials r t, j,u sequentially on the CPU with the straightforward shift.
2 All multiprecision operations on the CPU are computed efficiently by using the GMP library [Granlund and the GMP Development Team 2010] .
The μ polynomials r t, j,u are then transferred to the GPU. We run a CUDA kernel of μ threads wherein each thread of ID u processes the polynomial r t, j,u and computes the evaluations r t, j,u (i) with 0 ≤ i < ν using the tabulated difference shift.
Furthermore, as there are δ + 1 independent r t, j polynomials, we can run one kernel per r t, j polynomial and overlap the GPU tabulated difference shift for the polynomial r t, j with the CPU straightforward shift of the polynomial r t, j+1 . The only algorithm deployed on the GPU is therefore the tabulated difference shift, which is sequential within each GPU thread but performed concurrently by multiple threads on multiple polynomials r t, j,u .
As the coefficients of the considered polynomials are large, we need multiprecision addition on the GPU. Here, only fixed-size multiprecision additions are required, as bounds on the required precision, depending on the targeted function and the exponent of the targeted domain, can be computed before compile time [Lefèvre 2000; de Dinechin et al. 2011] . Multiprecision libraries on the GPU [Nakayama and Takahashi 2011; Lu et al. 2010 ] have been developed. However, we preferred to have our own implementation of this operation for two main reasons: to use PTX (NVIDIA assembly language) [NVIDIA 2012b ] and the addc instruction to have an efficient carry propagation, and to benefit from the fixed size of the multiprecision words at compile time to unroll inner loops. As the addc instruction operates only on 32-bit words, multiprecision words are arrays of 32-bit chunks. The multiprecision addition function is implemented as a C++ template with the size of the multiprecision words given as a parameter, which enables automatic generation of addition functions for each size of fixed multiprecision word required by each union of domains D t . As a consequence, the inner loop on the number of chunks can easily be unrolled as the number of loop iterations is known at compile time. Furthermore, to have coalesced memory accesses, the word chunks are interleaved in global memory and loaded chunk by chunk in registers.
Finally, note that this algorithm is completely regular: there is therefore no divergence issue among the GPU threads here.
PERFORMANCE RESULTS
In this section, we present the performance analysis of our different deployments. All results are obtained on a server composed of one Intel Xeon X5650 hex-core processor running at 2.67GHz, one NVIDIA Fermi C2070 GPU, and 48GB of DDR3 memory.
We compare three implementations. The first one is the sequential implementation (named Seq.), which is Lefèvre's reference optimized code. The second one is the parallel implementation on the CPU (referred to as MPI), which is the sequential implementation with an MPI layer added (OpenMPI version 1.6) to distribute the domains composing a binade equally among the available CPU cores. We made the choice of MPI over OpenMP because the processes do not communicate with each other (the computations over each domain are independent), and MPI favors data locality. In addition, we use a cyclic decomposition, which offers better load balancing than a block decomposition, and we run 12 MPI processes to take advantage of the two-way SMT (simultaneous multithreading or hyperthreading for Intel) of each core. The third implementation (named CPU-GPU) relies on the GPU and CPU-GPU deployments presented in this article. We searched for the optimal block sizes on the GPU and tried to increase the number of domains computed per thread in every GPU kernel to optimize occupancy and computation granularity. The implementations have been compiled with gcc-4.6.4 for the CPU code and nvcc (CUDA 5.0) for the GPU code. All of the following timings are obtained for the problem of searching (53, 2 −32 ) HR-cases of the exp function. In other words, binary64 FP arguments for which 32 extra bits of precision during the function evaluation do not suffice to guarantee correct rounding. The timings include all computations and data transfers between the GPU and the CPU.
Multicore Deployments
As a first step, in Table IV , we compare the sequential implementation with our multicore one. As sequential executions are very time consuming, we run this test only over the binade [1, 2). All MPI deployments scale very well on the multicore CPU, as we have a parallel speedup higher than the number of cores. This is mainly because they take advantage of the two-way SMT execution, which can partly offset the high latency operations. These high latencies are caused by the carry propagation during the multiprecision addition for the polynomial approximation generation and the division instructions for the HR-case search. We also note that the regular HR-case search is slightly faster than the Lefèvre one in both sequential (8%) and MPI (12%) executions. This is partly due to its regularity, as removing the main conditional statement avoids wrong branch predictions.
HR-Case Search on the GPU
As a second step, in Table V , we compare performance results of the HR-case search deployments on the GPU with the MPI ones. We first discuss the results for the exp function over the binades [1, 2) , [16, 32), and [128, 256) . The binade [1, 2) corresponds to the case where the exp function is well approximated by a polynomial of degree 1; the binade [128, 256) corresponds to the last entire binade before overflow, where the exp function is hard to approximate by a polynomial of degree 1; and the binade [16, 32) is an intermediate binade. We note here that to minimize the amount of exhaustive search and balance the time spent in phases 2 and 3, we suitably set the number of subdomains to consider in phase 2 for both HR-case search algorithms.
The deployment of the Lefèvre HR-case search on the GPU offers good speedups of about 2.1× over the hex-core CPU. The fact that this speedup is constant from one binade to the next is to be expected, as the algorithm is run with the same number of subdomains in phase 2. In fact, here we are limited on the GPU by the irregularity of the Lefèvre Boolean test. We cannot decrease the number of arguments to test in phase 3-which is well accelerated on the GPU-by increasing the number of subdomains, as it would lead to lower overall performance.
The deployment of the regular HR-case search on the GPU offers very good speedups over the MPI Lefèvre HR-case search: from 7.44× in the binade [1, 2) to 3.59× in the binade [128, 256) . Here the speedup is greater in the binade [1, 2) than in the binade [128, 256) . In the binade [1, 2), phase 1 is the most time consuming on a multicore CPU. Hence, we benefit from the regular behavior of the new HR-case Boolean test on the GPU, which results in a speedup of 7.44× over the MPI deployment.
In the binade [128, 256), phases 2 and 3 become the critical phases, as the Boolean test fails often. Moreover, the new regular HR-case search filters fewer arguments than the Lefèvre HR-case search, as stated in Section 4.3.1. This results in more arguments to test exhaustively, as detailed in Table VI(a). However, this can be partly offset by increasing the number of subdomains in phase 2 whose regular HR-case search is efficiently performed on the GPU. For example, we thus use 32 subdomains for the binade [128, 256) instead of 8 for the Lefèvre HR-case search. In the end, we therefore manage to obtain a shorter exhaustive search (see Table VI (b) ) and a good speedup of 3.59×.
We also tested our deployments on the sine function over the interval [0, π) , and the speedups were similar to those of the exp function. In Table V , we detail the timings over the binade [1/2, 1) and obtain speedups similar to those of the exp function over the same binade: the regular HR-case search on the GPU offers a speedup of 6.04× over the Lefèvre HR-case search on our hex-core CPU. These speedups thus seem to be related to the number of approximations and the approximation error. Therefore, we can expect very good speedups for functions that can be locally well approximated by degree 1 polynomials.
As a third step, in Table VII , we compare both GPU deployments on the positive domain of definition of the exp function. As the overflow occurs early in the binade [256, 512), we do not show any timings for this binade, as they are not significant. In addition, instead of testing many binades near 0, we test about one binade and a half ([1, 3.17) ) of the inverse function (the log function), as suggested by Lefèvre [2005] .
The regular Boolean test enables varying speedups up to 3.44× over the Lefèvre. This speedup decreases as we test arguments further away from 0. As already mentioned, this is due to the increasing truncation error on the tested degree 1 polynomials. The best approach here might be to consider a Boolean test that uses higher degree polynomials like those in the SLZ algorithm.
Finally, we finish with a qualitative remark: all tested implementations return the same HR-cases in the considered binades, and they are identical to those of Lefèvre [Muller et al. 2009 ] and de Dinechin et al. [2011] , which strengthens the confidence in the validity of the computed hardness-to-round.
Polynomial Approximation Generation on the GPU
Now we detail the performance of the polynomial approximation generation. First, we recall that the cost of generating the Taylor approximations is negligible, although they are generated with Maple [Maplesoft 2016 ]. In Table V , we show performance results of the polynomial approximation generation step over three chosen binades. We can observe that the hybrid CPU-GPU deployment greatly takes advantage of the GPU, as all threads perform independent computations and the control flow is perfectly regular among the GPU threads. It thus offers good speedups up to 55.15× over the one CPU core execution and up to 6.97× over the six-core execution.
We can also observe in Table VII that the timings of the CPU-GPU polynomial approximation generation increase more slowly compared to the HR-case search ones, even though we use longer multiprecision words in the large binades (the maximum coefficient sizes vary from 320 bits for the binade [1, 2) to 448 for the binade [128, 256) ) and the highest polynomial degree (max j (deg r t, j (x)) vary from 6 for the binade [1, 2) to 10 for the binade [128, 256) ). This makes the relative amount of time spent in this step vary from 5.9% at least to 52.4% at most. Hence, when the function is well approximated by a degree 1 polynomial, the HR-case search is fast and requires about half of the global time. In the opposite situation, the HR-case search becomes by far the bottleneck, and the time spent in the polynomial approximation generation is negligible.
CONCLUSION AND FUTURE WORK
In this article, we have proposed a new algorithm based on continued fraction expansions for HR-case searches. This new algorithm improves the Lefèvre HR-case search algorithm by strongly reducing loop and branch divergence, which is a problem inherent to GPUs because of their partial SIMD architecture. We have also proposed an efficient GPU deployment of these two HR-case search algorithms and a hybrid CPU-GPU deployment of the generation of polynomial approximations.
When searching for HR-cases of the exp function in double precision, these deployments enable an overall speedup of up to 53.4× on one GPU over a sequential execution on one CPU core and a speedup of up to 7.1× on one GPU over one hex-core CPU.
In the future, we plan to investigate whether the regular HR-case search can benefit from other SIMD architectures like vector units (SSE, AVX, . . . ) on multicore CPU and Intel Xeon Phi architectures. This will require an OpenCL [Khronos Group 2011] implementation and an effective automatic vectorization by the OpenCL compiler.
We also plan to provide formal proofs of the deployed algorithms. This task is eased by the continued fraction expansion formalism and would enable a validated generation of the hardness-to-round, which is necessary to improve the confidence in the produced results. This is desirable before computing the hardness-to-round of all functions recommended by IEEE standard 754.
Finally, it is our hope to tackle the TMD for quadruple precision by deploying on the GPU the SLZ algorithm, which tests the existence of HR-cases with higher-degree polynomials. This algorithm heavily relies on the use of the LLL algorithm. The deployment of this algorithm on the GPU is therefore far from trivial if one wants to obtain good performance. Porting the LLL algorithm to the GPU will be the next step of this work.
