The Descartes method for polynomial real root isolation can be performed with respect to monomial bases and with respect to Bernstein bases. The first variant uses Taylor shift by 1 as its main subalgorithm, the second uses de Casteljau's algorithm. When applied to integer polynomials, the two variants have co-dominant, almost tight computing time bounds. Implementations of either variant can obtain speed-ups over previous state-of-the-art implementations by more than an order of magnitude if they use features of the processor architecture. We present an implementation of the Bernstein-bases variant of the Descartes method that automatically generates architecture-aware high-level code and leaves further optimizations to the compiler. We compare the performance of our implementation, algorithmically tuned implementations of the monomial and Bernstein variants, and architecture-unaware implementations of both variants on four different processor architectures and for three classes of input polynomials.
INTRODUCTION
Some years after Collins and Akritas [5] proposed an algorithm for polynomial real root isolation, Lane and Riesenfeld [24] presented a variant of the algorithm that uses Bernstein bases instead of monomial bases. Both methods proceed recursively and use the Descartes rule of signs as a termination criterion. For any input polynomial, the two methods compute the same isolating intervals since they generate the same recursion tree. An analysis of the recursion tree by Krandick and Mehlhorn [23] was improved by Eigenwillig et al. [10] who also gave a basis-free description of the Descartes method and provided explicit transformations between the two variants. When applied to integer polynomials, both variants have co-dominant, almost tight computing time bounds [10] ; the empirical computing times have never been compared [26] .
The performance of the monomial variant critically depends on the efficiency of its main subalgorithm, Taylor shift by 1. It has been pointed out that asymptotically fast algorithms are available for this operation [27, 10, 11] . Indeed, von zur Gathen and Gerhard implemented and compared three asymptotically fast methods [33, 13] , and Bostan et al. [4] proposed a fourth. However, these methods have not been empirically compared with implementations of classical Taylor shift by 1.
The main subalgorithm of the Bernstein-bases variant is de Casteljau's algorithm, a fundamental method in computeraided design [12] .
Since classical Taylor shift by 1 and de Casteljau's algorithm both consist of a sequence of integer additions, a high-performance version of these algorithms can be obtained by using architecture specific assembly language routines for integer addition. The implementation by Hanrot et al. [16] of the monomial Descartes method and the SYNAPS [25] implementation of the Bernstein-bases variant both build on the GMP library [14, 15] which provides such addition routines. Pursuing a different approach Johnson et al. [21] obtained high-performance Taylor shift by 1 on the UltraSPARC III processor architecture using high-level architecture-aware code and register tiling.
We show that the Taylor shift implementation of Johnson et al. achieves high performance on the Pentium EE and Opteron architectures, but not on the older Pentium 4.
Von zur Gathen's and Gerhard's asymptotically fast Taylor shift by 1 turns out to be slower, for a wide range of inputs, than the classical method of Johnson et al. [21] . The register tiling techniques carry over to de Casteljau's algorithm. We implement the Descartes method using de Casteljau's algorithm with register tiling. We compare the performance of our method, the method by Hanrot et al., the SYNAPS method, and two architecture-unaware implementations from SACLIB [6] . A comparison of the five methods on four different processor architectures, and for three classes of input polynomials shows that the best absolute computing times are obtained on an Opteron processor using the Bernstein-bases variant of the Descartes method with register tiling.
THE DESCARTES METHOD

Monomial vs. Bernstein bases
The Descartes method, independent of the basis used to represent polynomials, uses binary search to find isolating intervals and relies on the Descartes rule of signs to determine when an isolating interval has been found or when the search can stop since there are no roots in the given interval. Let A(x) = amx m + · · · + a1x + a0. The Descartes rule states that the number of coefficient sign variations, var(A), is greater than or equal to the number of positive roots of A, and that the difference is even. This provides an exact test when var(A) ∈ {0, 1}.
The following polynomial transformations are needed for the method and the mapping between the monomial basis and the Bernstein basis:
, and 3) Homothetic transformation: Ha(A(x)) = A(x/a). The method proceeds by using a root bound and a homothetic transformation to transform the input polynomial to a polynomial, A, whose roots in the interval (0, 1) correspond to the positive roots of the input polynomial. It can be advantageous to compute the negative roots separately using a separate root bound for the negative roots. When using the monomial basis, the Descartes rule is applied to the transformed polynomial A * = T−1R(A) to determine whether A has zero or one real roots in the interval (0, 1). Bisection is performed by computing the transformed polynomials A1 = H2(A) and A2 = T−1H2(A) whose roots in the interval (0, 1) correspond to the roots of A in the intervals (0, 1/2) and (1/2, 1) respectively. The Descartes rule is then applied to A * 1 = T−1R(A1) and A * 2 = T−1R(A2), and if more than one coefficient sign variation is obtained the algorithm proceeds recursively with the bisected polynomials.
Associated with this bisection process is a binary tree, where each node in the tree has an associated subinterval and polynomial. Each internal node requires the computation of three polynomial translations T−1, called Taylor shift by 1, to compute the bisection polynomial and the two applications of the Descartes rule, while leaf nodes only require the polynomial translations for the application of the Descartes rule. The bulk of the computing time for the method is devoted to Taylor shift by 1. Figure 1(a) shows the classical computation of A(x + 1) = P m h=0 a m−h,h x h . Note that it is possible to avoid the complete computation of the Taylor shift in the application of the Descartes rule by stopping as soon as more than one sign variation is detected.
Let
Bernstein basis, and write
and var(A * (x)) = var(b0, ..., bm). The Bernstein representation of the bisection polynomials, A1(x) and A2(x), can be obtained from the coefficients of the Bernstein representation of A(x) using de Casteljau's algorithm. In order to preserve integer coefficients a fraction-free variant is used. For 0 ≤ i ≤ m set b0,i = bi, and for 1 ≤ j ≤ m and 0 ≤ i ≤ m − j set bj,i = bj−1 + bj−1,i+1. As Figure 1(b) shows, this computation is similar to the computation of the Taylor shift by 1, except that the computation proceeds in the revserse direction. Eigenwillig et al. [10] remark that if
This establishes a one-to-one mapping between the nodes and the associated polynomials in the search trees for the monomial and Bernstein variants of the algorithm. Moreover, the cost of the computation at each node, assuming classical algorithms for de Casteljau and Taylor shift, for both variants is codominant and hence the total computing time is codominant. In contrast to the monomial basis, each internal node requires one application of de Casteljau's algorithm instead of three Taylor shifts by one and no transformations are required for leaf nodes. A similar approach, called the dual algorithm, which also reduces the number of Taylor shifts by computing A * 1 (x) and A * 2 (x) directly from A * (x) using monomial bases was suggested by Johnson [19] . 
The monomial SACLIB method IPRRID
The program IPRRID in the SACLIB library [6] processes the bisection tree in breadth-first order [22, 27] . IPRRID tries to avoid the complete computation of the Taylor shift in the application of the Descartes rule by stopping as soon as more than one sign variation is detected. Also, IPRRID checks whether var(A) = 0 before computing T−1R(A). The program IUPTR1 that implements Taylor shift by 1 operates on a single array containing the coefficients of the input polynomial. IUPTR1 avoids the overhead of calling integer addition routines and of normalizing after each integer addition [21] .
The Bernstein SACLIB method IPRRIDB
The program IPRRIDB in the SACLIB library [6] converts the input polynomial from its monomial representation into a fraction-free Bernstein-basis representation. IPRRIDB processes the bisection tree in the same way as the program IPRRID of Section 2.2. IPRRIDB uses a fraction-free version of de Casteljau's algorithm that avoids the overhead of calling integer addition routines and of normalizing after each integer addition-in the same way as the program IUPTR1 of Section 2.2.
The method by Hanrot et al.
Hanrot et al. [16] provide an efficient implementation of the monomial version of Descartes method that incorporates the memory saving technique of Rouillier and Zimmermann [27] . Their implementation uses GMP [15] for the integer additions required by Taylor shift operations. Additional algorithmic optimizations are included to reduce the time spent on Taylor shift. The complete execution of the Taylor shift used to compute T−1R prior to the application of the Descartes rule is not needed in many situations. If all of the input coefficients are of the same sign, then the transformed polynomial will have zero coefficient sign variations and the Taylor shift can be avoided. If all of the intermediate coefficients in a column of the Taylor shift computation have the same sign, then the remaining result coefficients will have the same sign, and the computation can be aborted. If exactly two sign variations are reported then there are either zero or two roots in the interval. If the signs of the polynomial evaluated at 0 and 1 are equal but different from the sign at 1/2, then two roots have been found and the algorithm can terminate avoiding the additional Taylor shifts needed for the Descartes test to report the termination. This test is efficient to apply since the polynomial evaluated at 1 is equal to the sum of the coefficients and the sum is known after computing the first column of intermediate coefficients in the Taylor shift computation. In practice, computation of a partial rather than a complete Taylor shift along with the early termination tests can save a substantial amount of time.
In a pre-processing step the method determines the greatest k such that the input polynomial A(x) is a polynomial in x k , and replaces A(x) by A(
If k is even, the method isolates only the positive roots.
The SYNAPS method
The SYNAPS [25] implementation IslBzInteger<QQ> [11] of the Descartes method uses GMP [15] for the integer additions required by the de Casteljau operations. Otherwise, the method is a straightforward implementation of the Bernstein-bases variant. A hardcoded limitation of the recursion depth to 96 prevents the method from isolating the roots of Mignotte polynomials of degrees greater than 80. Figure 2 shows that both, Taylor shift and de Casteljau's method, are amenable to register tiling, a method that reduces the number of memory accesses and exploits instruction-level parallelism [21] .
The register tile method
A tile can be described by three parameters: tile shape, tile size, and addition schedule. The shape of the tiles is determined by the computational structure of the Taylor shift and de Casteljau algorithms. Possible addition schedules are also constrained by the computation structure of the Taylor shift and de Casteljau algorithms. In general, more than one addition schedule is possible, and the number of possible schedules increases with the number of addition units in the target CPU. The optimal tile size depends on the number of registers in the target CPU and the quality with which the CPU schedules memory operations. We found that the best performing parameter values varied widely depending on the target CPU. In order to allow our tiled implementations to be used on multiple architectures, we have implemented an automatic code generator in Perl [34] . It produces portable ANSI C++ [18, 30] code for tiles of different sizes, compiles and executes the code, and searches through successively larger tile sizes until a tile size with peak performance is discovered. A single addition schedule is produced by the code generator for a CPU with two addition units. The purpose of this part of the code generator is not to produce an optimal addition schedule, but rather to produce a schedule that exposes the dependencies of the computation in such a way that the C++ compiler can then schedule the computation appropriately for the target architecture. All compilers we have used were unable to infer these dependencies when the computation is programmed in "for"-loops. The code generators for Taylor shift and de Casteljau's method both follow the same architecture and share a good deal of common code. This is the case because changes in index computations suffice to "reverse the arrows" when going from Taylor shift to de Casteljau's algorithm. Von zur Gathen and Gerhard experimentally compared six methods for computing Taylor shifts. They implemented all methods on top of the NTL library [29, 28] and identified a divide-and-conquer method as the fastest method.
Asymptotically fast Taylor shift
We performed experiments using their implementation and confirmed that, on the Pentium EE, the Opteron, and the UltraSPARC III, the divide-and-conquer method-for degrees ≥ 255 and "large" coefficients-is indeed faster than the other implementations of asymptotically fast Taylor shift.
Using the same NTL-based implementation we compared its speed to classical Taylor shift with register tiling. We note that NTL represents integers using just 32-bits of each word due to the way it implements integer multiplication while our Taylor shift uses the full 64-bit word. Figure 3 shows that our method is faster on a wide range of inputs. None of the Descartes methods we compare in this paper uses asymptotically fast Taylor shift.
EXPERIMENTAL PROCEDURES
Processor architecture
The tiled Bernstein method primarily achieves its speedup from delaying carries and using register tiling to improve locality of reference. The computation schedule for the register tiles allows multiple integer execution units to be used simultaneously in the processing of a register tile. When implementing the tiled Bernstein method for a given processor, the precision of the native integer arithmetic, the number of general purpose integer registers, and the number of integer execution units determines the maximum speedup that can be obtained by the method; the speedup will be larger with high native integer precision and a larger number of general purpose integer registers and integer execution units.
64-bit processors
Current processors such as the Pentium EE [9] , Opteron [2, 1] , and UltraSPARC III [17, 32] support native 64-bit integer operations, have at least 16 64-bit general purpose integer registers, and at least 2 integer execution units. These are the kind of processors for which the tiled Bernstein method was developed. Below, we briefly summarize the salient features of these three processors.
Pentium EE The Intel Pentium Extreme Edition dualcore processor supports both the 32-bit x86 and 64-bit EM64T instruction sets. Each core of the Pentium EE provides 16 64-bit general purpose integer registers and has an 8-way set-associative 16 kilobyte L1 data cache and an 8-way setassociative 1 megabyte L2 cache. [9] Opteron The AMD Opteron processor supports the 32-bit x86 and 64-bit AMD64 instruction sets. The Opteron provides 16 64-bit general purpose integer registers and has a 2-way set-associative 64 kilobyte L1 data cache and a 4-way set-associative 1 megabyte L2 cache. [2, 1] UltraSPARC III The Sun UltraSPARC III processor [17, 32] supports the SPARC V9 instruction set. The Ultra-SPARC III provides 32 64-bit general purpose integer registers and has a 64 kilobyte 4-way set-associative L1 data cache and an 8 megabyte 2-way set-associative L2 cache.
32-bit processors
The Pentium 4 [8] is included for comparison only and is not expected to perform well with the tiled Bernstein method due to the unavailability of native 64-bit integer arithmetic and the small number of general purpose integer registers.
Pentium 4 The Intel Pentium 4 processor supports the 32-bit x86 instruction set. The Pentium 4 provides 8 32-bit general purpose integer registers and has a 16 kilobyte 8-way set-associative L1 data cache and a 1 megabyte 8-way set-associative L2 cache. [8] 
Hardware configuration and timing
The getrusage system call is used on all platforms to obtain timings. The hardware platforms used in this study are configured as follows.
Pentium EE We use a Pentium Extreme Edition 840 Dual-Core CPU with a clock speed of 3.2 GHz and 1 GB of main memory. The Gentoo Linux distribution with the 2.6.14-gentoo-r2 kernel is installed. Hyper-Threading is disabled in the BIOS.
Opteron We use an Opteron 244 with a clock speed of 1.8 GHz and 2 GB of main memory. The Gentoo Linux distribution with the 2.6.14-gentoo-r2 kernel is installed.
UltraSPARC III We use a Sun Blade 2000 with two 900 MHz UltraSPARC III processors and 2 GB of main memory. The Solaris 9 operating system is installed.
Pentium 4 We use a Pentium 4 with a clock speed of 3.0 GHz and 1 GB of main memory. The Fedora Core 2 Linux distribution is installed.
Compilation protocol
SACLIB monomial and SACLIB Bernstein
The SAC-LIB [6] programs IPRRID and IPRRIDB are compiled using gcc 3.4.4 with the flags "-O3 -march=nocona -m64" on the Pentium EE, gcc 3.4.4 with the flags "-O3 -march=opteron -m64" on the Opteron, Sun Studio 9 compilers [31] with the flags "-xO3" on the UltraSPARC III, and gcc 3.3.3 with the flags "-O3 -march=pentium4" on the Pentium 4.
Tiled Bernstein The tiled Bernstein method is compiled with the same compilers and flags as the SACLIB methods. An additional step for the tiled Bernstein method is searching for the correct register tile size. We search over tiles of size n × n for n = 4, 6, 8, 10, 12, 14, and 16. Smaller tiles are too small to offer a speedup and larger tiles will result in register spilling that will negate the locality of reference advantages of the tiled Bernstein method. The outcome of the tile search is shown in Figure 4 . Based on the outcome of the search we use register tile sizes of 12 × 12 for the Pentium EE and Opteron, 8 × 8 on the UltraSPARC III, and 6 × 6 on the Pentium 4. The dip that appears in the graphs is due to increased execution time of the tiled Taylor shift at degree 4100 caused by increased cache miss rate that is likely caused by low set associativity organization of Ultra-SPARC and Opteron data L1 caches. The dip also appears when GMP adds integers that are 4100 words long.
NTL On the Pentium EE, Opteron, and Pentium 4, NTL 5.4 [28] was compiled with the same compilers as the SACLIB method. Compiler flags are the defaults set by NTL. On the UltraSPARC III, NTL 5.4 was compiled with the Sun Studio 9 compiler [31] with the flags "-xO3 -xarch=v9b". Because of the way it performs multiplication, NTL is limited to 32-bit integer arithmetic; however, for compatibility, NTL was compiled to use the 64-bit ABI.
GMP On the Pentium EE, Opteron, UltraSPARC III, and Pentium 4, GMP 4.2 [15] was compiled using the same compilers as the SACLIB method. Compiler flags are the defaults set by GMP.
SYNAPS On the Pentium EE, Opteron, and Pentium 4, SYNAPS 2.4 is compiled with the same compilers and flags as the SACLIB method. On the UltraSPARC III, SYNAPS 2.4 is compiled with the Sun Studio 9 C++ compiler with the flags "-x03 -xarch=v9b". SYNAPS required minor porting before it could be compiled with the Sun Studio 9 C++ compiler.
Hanrot et al. 
Input polynomials
In our experiments we isolate the real roots of three kinds of polynomials.
Random polynomials are polynomials with integer coefficients of absolute value less than 2 20 ; the coefficients are pseudo-randomly generated from a uniform distribution. For each degree we generate 50 random polynomials and report average computing times for degrees 100, 200, . . . , 1000. For random polynomials, the Descartes method produces recursion trees that typically have few nodes.
Chebyshev polynomials are the polynomials defined by the recurrence relation
The roots of Chebyshev polynomials are well-known values of the cosine function. We apply the Descartes method to Chebyshev polynomials in order to obtain wide recursion trees with many nodes. We report computing times for degrees 100, 200, . . . , 1000. Since all these degrees are even, the corresponding Chebyshev polynomials are polynomials in x 2 . Since, for even n, the method by Hanrot et al. reduces Tn(x) to Tn( √ x) we apply the same preprocessing step also to the other methods. We call the polynomials Tn( √ x) somewhat ambiguously "reduced Chebyshev polynomials of degree n"; of course, deg(Tn(
Mignotte polynomials are defined by x n − 2 (5x − 1) 2 . We are not aware of any applications that involve Mignotte polynomials; however, the Descartes method generates extremely deep recursion trees for Mignotte polynomials, and it requires computing times that are approximately proportional to its worst-case computing time function. We report computing times for degrees 100, 200, . . . , 600.
An analysis of the computing time needed to isolate the roots of polynomials from these classes was given by Johnson [19, 20] . We measured the executions times of the five implementations of the Descartes method on the four processor architectures for input polynomials of various degrees from the three classes of polynomials. The raw data are given in Figure 9 . Figures 5, 6 , 7 and 8 show, for input polynomials of various degrees, the speed-up that the various methods obtain with respect to the SACLIB routine IPRRID. Speed-ups by an order of magnitude are typical. The largest speed-up is by a factor of 24, and it is obtained by the Bernstein-based variant of the Descartes method with register tiling on an Opteron processor for the Chebyshev polynomial of degree 1000.
RESULTS AND DISCUSSION
Processor architectures
The data show that register tiling works best on architectures where the optimal tile size is large. According to Section 3.3 and Figure 4 the optimal register tile size on the Pentium EE and Opteron is 12 × 12, on the UltraSPARC III it is 8 × 8, and on the Pentium 4 it is 6 × 6. In order to have a large optimal tile size an architecture must provide a large word length and a large number of registers. By Section 3.2, the Pentium EE and Opteron provide both, the Pentium 4 provides neither. Hence, register tiling achieves large speedups on the Pentium EE and Opteron and small speed-ups on the Pentium 4.
Register tiling can also profit from the instruction level parallelism (ILP) that is afforded by the number of integer execution units since the rows of each tile can be processed concurrently. The UltraSPARC III can execute up to 2 integer instructions per cycle. The Opteron processor is based on Athlon's QuantiSpeed architecture [3, pp. 250-252 per cycle. Both Pentium 4 and Pentium EE are based on the NetBurst microarchitecture [7] which is capable of executing up to 4 simple integer instructions per cycle on 2 ALUs per core. The difference in the number of integer execution units might explain the observation that the Descartes method involving register tiling is the one that profits the most when ported from the UltraSPARC III to the Pentium EE and Opteron. Since the adaptation to the new architecture is obtained by optimizing the register tile size, and this change is performed automatically, no human effort is required to take advantage of the new architectures. The data show that high performance can be achieved using a number of algorithmic devices. node of the recursion tree in the monomial variant from 3 Taylor shifts to 1 de Casteljau transformation.
Algorithms
A comparison between the SACLIB methods IPRRID and IPRRIDB shows that this approach is successfuldespite the fact that the initial transformation of the input polynomial into the Bernstein-base representation can increase the coefficient length.
2. Hanrot et al. achieve a similar reduction in the number of n 3 -operations by partial execution of certain Taylor shifts. In addition, their early termination test avoids all n 3 -operations at certain leaf nodes. For reduced Chebyshev polynomials this device reduces the number of complete Taylor shifts by 40%.
3. The use of the assembly-language integer addition routines of GMP makes the SYNAPS method faster than the SACLIB method IPRRIDB for polynomials with long coefficients, and it contributes to making the method by Hanrot et al. faster than the SACLIB method IPRRID.
4. The use of register tiling is orthogonal to devices (1) and (2) . In fact, in an additional experiment we replaced the complete Taylor shift in the method by Hanrot et al. with our tiled Taylor shift and obtained an additional speed-up by a factor of about 1.33.
The three implementations of the Bernstein-bases variant might be further sped-up by incorporating the early termination test by Hanrot et al.
The data show that-with minor exceptions-for all classes of polynomials, the best absolute computing times are achieved on the Opteron processor using the Bernstein-bases variant of the Descartes method with register tiling.
ACKNOWLEDGEMENTS
The authors would like to thank Jürgen Gerhard, Guillaume Hanrot et al., Bernard Mourrain et al., and George Collins, who developed IPRRIDB, for making their code available. 
