This paper describes the first algorithm to compute the greatest common divisor (GCD) of two n-bit integers using a modular representation for intermediate values U , V and also for the result. It is based on a reduction step, similar to one used in the accelerated algorithm [T. Jebelean, A generalization of the binary GCD algorithm, in: ISSAC '93: International Symposium on Symbolic and Algebraic Computation, Kiev, Ukraine, 1993, pp. 111-116; K. Weber, The accelerated integer GCD algorithm, ACM Trans. Math. Softw. 21 (1995) 111-122] when U and V are close to the same size, that replaces U by (U − bV )/p, where p is one of the prime moduli and b is the unique integer in the interval (−p/2, p/2) such that b ≡ UV −1 (mod p). When the algorithm is executed on a bit common CRCW PRAM with O(n log n log log log n) processors, it takes O(n) time in the worst case. A heuristic model of the average case yields O(n/ log n) time on the same number of processors.  2004 Elsevier Inc. All rights reserved.
Introduction
By using a modular representation for integers [3, Section 4.3.3] we can efficiently perform certain arithmetic operations on parallel processors having a large number of processing elements. Each processor simultaneously performs the addition, subtraction or multiplication operation modulo a separate prime. However, since comparison and general division are expensive when the operands are in modular representation (a conversion of the operands to another representation must be performed), the modular representation has been unattractive for use in integer greatest common divisor (GCD) algorithms.
Chapter 6 of [4] describes an algorithm to compute the greatest common divisor of two n-bit integers using modular representation for intermediate values. As far as we know, it was the first integer GCD algorithm to use a modular representation. This algorithm avoids comparisons altogether, and relies only on special case integer divisions that can be performed with the operands still in modular representation.
The reduction step of this algorithm is the same as used in Sorenson's right-shift k-ary GCD algorithm [5] , and as the reduction used when the intermediate values are nearly the same size in the accelerated GCD algorithm, independently discovered by Jebelean and Weber [1, 2] . If U and V represent the intermediate values whose GCD is equal to that of the original input pair, then this reduction replaces U by (aU + bV )/m, where |a|, |b| < √ m and m is one of the moduli used to represent the intermediate values U and V . This reduces the size of the problem by roughly log 2 m/2 bits. However, spurious factors are introduced that must be eliminated after converting the result back to standard representation.
The new algorithm is based on the following reduction step. Suppose that P is a set of (odd) prime numbers relatively prime to V ; then the main loop of the new algorithm uses the reduction step U ← (U − bV )/p, where p ∈ P, and b ≡ U · V −1 (mod p) is chosen to lie in the interval (−p/2, p/2). The modulus p is chosen so that |b| is minimal among all moduli in P. This is very similar to the reduction used by Jebelean in the EDGCD algorithm [6] , and the reduction used by the accelerated algorithm when the difference in sizes of U and V is large, except that in these cases a single fixed modulus p t is used, and b is computed in the range [0, p t − 1] . Although this reduction is not as efficient as Sorenson's k-ary reduction is when U and V are nearly the same size, "cleanup" of spurious factors is not necessary. This allows the GCD operation to be used in combination with other arithmetic operations to perform a larger calculation totally in modular representation before it is necessary to convert back to standard representation.
An algebraic common CRCW PRAM, as described in [7] , is a parallel random access machine on which each processor can access w-bit integers in a global shared memory, and compute w-bit addition, subtraction, comparison, multiplication, quotient and remainder of integers, all in unit time. We define a w-bit modular common CRCW PRAM to be an algebraic common CRCW PRAM with the additional capability of each processor to compute w-bit modular inverses in unit time.
In the worst case, the new algorithm runs on a w-bit modular common CRCW PRAM in O(n + 2 w/2 ) time using O(n + 2 w/2 ) processors. At worst the main loop will never iterate more than n + 2 times; experimental data taken from execution of a sequential implementation of the algorithm suggests that in the average case the main loop will execute a little more than 2n/w times-as long as the moduli are w-bit primes, and there are at least n + 2 w/2 moduli and processors. A heuristic model of the distribution of the b values allows for an average case analysis that closely matches the experimental data. For enough moduli, this analysis gives an average case running time of O(n/ log n + 2 w/2 ) on O(n + 2 w/2 ) processors, which translates to O(n/ log n) time and O(n log n log log log n) processors on a bit common CRCW PRAM, when w = Θ(log n). Thus the heuristic average case analysis matches the fastest algorithms presently known-those of Chor and Goldreich [8] , Sorenson [5] , and Sedjelmaci [9] , whose worst-case analyses all give O(n/ log n) time on O(n 1+ε ) processors, where ε is any positive constant. We believe this new algorithm has the potential to be better suited for massively parallel machines than any of these algorithms, especially when embedded inside other computations using a modular representation for intermediate values.
Throughout the sequel we use a mod b, where a is any integer and b is a positive odd integer, to represent the symmetrical modular representation [10, Problem 10, p. 277] ; that is, the unique integer c ∈ (−b/2, b/2) such that c ≡ a (mod b). The expression u/v mod m is, of course, another way to write uv −1 mod m, where v −1 is the multiplicative inverse of v in Z/mZ. In certain parts of this paper, the base of the log function is significant; we use ln x = log e x and lg x = log 2 x, as in [10] , in order to eliminate ambiguity. An integer t is a w-bit integer iff 2 w−1 t < 2 w . Finally, we use π(x) to represent the number of primes less than or equal to the real number x, and p k to represent the kth prime, where 2 is counted as the first prime.
In the next section the basic algorithm is presented and proven to correctly compute the gcd. An upper bound for the number of iterations of the main loop is derived, and the average number of iterations is estimated via the heuristic model mentioned previously. The full algorithm is presented in detail in the section following. The two sections after that are devoted to an analysis of the complexity of the full algorithm: the first presents properties of the modular common CRCW PRAM, and the second uses those properties in the actual analysis of the algorithm. Results from experiments with a sequential implementation are then provided. We conclude with some miscellaneous remarks and ideas for future work.
Basic algorithm
A basic version of the algorithm, suppressing the modular representation for clarity, is displayed in Fig. 1 . A set of prime moduli is first selected according to criteria which guarantee that the main loop will execute no more than n + 2 times. Then the algorithm enters the main loop; in each iteration the reduction step (line 8) reduces the number of bits in U and V , using one of the U/V mod p values that is smallest in absolute value. The division by p in the reduction step is exact, and can be performed in modular arithmetic, as long as the divisor is relatively prime to all moduli; thus p must be eliminated from Q before the division occurs (line 7).
Our first concern is to show that the GCD of the original inputs is preserved throughout the main loop.
Input: Positive integers U and V , with U V Output: gcd(U, V ) 
We now show that the values of U and V get smaller and smaller, and determine how many iterations of the main loop are needed to make V = 0. We shall use the following notation throughout the remainder of the paper. Let Proof. First note that the upper bound on the number of iterations may be derived from the inequality, since it implies that |V i | = 1 after i = n + 1 iterations, forcing V i+1 = 0. To establish the inequality, one can prove by induction that
From this, and the fact that lg(1 + x) x lg e for nonnegative x, we get lg
for all 1 i n + 1, where the index p in the sums takes on only prime values. One can show that 3 lg eS i < 2 for all iterations of the loop and for all primes m 29 = p 10 by using this last inequality, together with the following three inequalities from Rosser and Schoenfeld [11, ineqs. 3.6, 3.13, 3.17 and 3.20]: p k < k(ln k + ln ln k) for k 6, π(x) < 1.25506x/ lnx for x > 1, and
The next lemma gives a lower bound on the size of P i . Note that it also shows that the choice of Q 0 guarantees that P i is never empty.
Lemma 3. There will be at least
The result follows from this and the facts that Q i−1 = Q 0 − i + 1 and, by the previous theorem, i n + 2 for any iteration of the loop. 2
The remainder of this section is dedicated to a heuristic analysis of the average number of iterations required by the main loop. Experience with the experimental implementation of the algorithm, discussed in Section 6, suggests that, when using w-bit moduli, as long as there are at least 2 w/2 + n moduli, and as long as U i−1 and V i−1 each has more than 2w bits, the values |U i−1 /V i−1 mod p| for p ∈ P i are fairly uniformly distributed over the interval [0, p/2). This suggests that we may treat them as uniformly distributed, mutually independent random variables in order to obtain a first approximation to the expected value E[
Based on this assumption, and the fact that the minimum of ν independent and identically distributed random samples from [0, 1) has expected value 1/(ν + 1) (see [12, p. 182] , for example), we get E[|b i |] 0.5M/( P i + 1) < 0.5M/ P i . We can now use this to provide an upper bound on the average number of iterations of the main loop, provided we make one more simplifying assumption: that b i and V i−1 are independent as well. We will refer to the complete set of assumptions as our heuristic model. 
Lemma 4. Based on our heuristic model, if the set of moduli Q is chosen so that
Q 0 n + 1 + 0.5m −1/2 M,
Modular algorithm
In Fig. 2 we give the full version of the algorithm, in which we finally include the details of the modular representation.
Step MGCD1 chooses a set of w-bit primes for the moduli. Note that the requirements in line 2 of Fig. 1 are met; we have Q = 2 w/2 + n n + 2 and π(m) > π(2 w−1 ) > π(2 w ) − π(2 w−1 ) max{n, 9}, since π(2x) < 2π(x) for x 3 (from [13] , quoting [14] ).
Step MGCD2 converts the input integers U and V into modular representation. The construct "For all i ∈ I do . . ." indicates parallel execution. Set notation is used for indexing here because it does not implicitly specify an order of execution.
Step MGCD3 is the actual reduction loop, resulting in a modular representation of the greatest common divisor of U and V .
Step MGCD4 computes a balanced mixed-radix representation [10, ex. 10 soln., p. 586], using modular arithmetic, and simultaneously produces a standard representation of the result from the mixed-radix representation.
There is a rather large supply of w-bit primes. Using the approximation π(x) ≈ x 2 dt/ ln t [10, pp. 366-367], we see that there are roughly 9.8 × 10 7 32-bit primes and 2.1 × 10 17 64-bit primes. In addition, w-bit primes and smaller may be used, if necessary, although the details for such a modification are left to the interested reader. Thus, for all practical purposes, the number of available primes seems sufficient to utilize even the large numbers of processors current research suggests will be available in future systems (see [15] , for example).
We close this section by showing that there will be enough primes left in Q to reconstruct the standard representation of the result. 
Proof. Note that
U N = V N−1 = 0. By Theorem 2, lg |V N−1 | < n − N + 3, so q∈Q N q 2 (w−1) 2 w/2 +n−N > 2 n−N+4 > 2|V N−1 |. 2
Modular common CRCW PWAM
The full algorithm presented in the previous section seems naturally suited for implementation on the algebraic common concurrent read, concurrent write, parallel random access machine (CRCW PRAM) used in [7] , augmented by a constant-time modular inverse operation. In this section we show that the minimum needed on line 3 of MGCD3 can be computed in O(1) time on such a computational model, using only a small number of processors. We also compute the additional costs for emulating on a bit common CRCW PRAM an algorithm analyzed for this model.
Define a w-bit modular common CRCW PRAM to be an algebraic common CRCW PRAM, in which each processor is capable of performing w-bit memory accesses and w-bit integer arithmetic in unit time, and is also able to compute the inverse of a w-bit integer modulo another w-bit integer in unit time. 
Lemma 7. The maximum and minimum of a set A of integers in the range

Performance analysis
In this section we provide an asymptotic analysis of the performance of the full algorithm. We first analyze the algorithm on the w-bit modular CRCW PRAM, then extend the analysis to the bit-common CRCW PRAM model, as used in [5] . First we compute the total cost of the algorithm on a w-bit modular common CRCW PRAM. Proof. The total worst-case cost is obtained by taking the maximum of the costs of the major steps, which are cataloged below.
Theorem 9. On a w-bit modular common CRCW PRAM the modular GCD algorithm takes O(n +
2
Step MGCD1 requires O(2 w/2 ) time and O(2 w/2 ) processors
This can be done by emulating a sieve algorithm for an EREW PRAM, described by Sorenson and Parberry [7] , with no increase in time or number of processors [16, Section 3] .
Step MGCD2 requires O(n/w) time and O(n + 2 w/2 ) processors A naive algorithm can be used to divide an n-bit integer by a w-bit integer in O(n/w) time, so this step can be computed by first calculating one half of the 2 2 w/2 + n divisions simultaneously, then the other half.
Step MGCD3 requires O(n) time and n + O(2 w/2 ) processors
Each line in the main loop takes O(1) time on 2 w/2 + n processors. We get the cost for line 3 by applying Lemma 7 twice; once to compute the minimum value, and once to choose a prime for which the minimum is achieved. Note also that the universal quantifier in line 7 can be calculated as the negation of an existential quantifier, which can be computed as an inclusive-OR operation in O(1) time on an 2 w/2 + n processor common CRCW PRAM [16, p. 896] . By Theorem 2, the number of iterations in the worst case in the main loop in Step MGCD3 is bounded by n + 2, thus giving the time and number of processors listed.
Step MGCD4 requires O(n/w) time and O(n + 2 w/2 ) processors Line 5 causes |G| U 0 to grow by no more than w bits each iteration of the loop, so the main loop can execute no more than n/w times. The multiplication on this line can be performed in O(1) time by O(n) processors, by adapting a technique for bit common CRCW PRAMs given in [8] . The remainder of the lines in the loop, as well as the lines before and after the loop, can be performed in constant time, as in Step MGCD3. Note particularly that the choice of an element of Q in line 3 may be done in O(1) time using O(n + 2 w/2 ) processors, according to Lemma 7.
The average cost is obtained by using Theorem 5 to get O(n/w) iterations of the main loop on average. 2 By combining Lemma 8 with the previous theorem, and noting that the requirements placed on w by Step MGCD1 are met when w = Θ(log n), we get the following.
Theorem 10. On a bit common CRCW PRAM, the modular integer GCD algorithm takes O(n+2 w/2 ) time and O((n+2 w/2 )w log log w) processors. The heuristic model for E[|b i |]
gives an average running time of O(n/w + 2 w/2 ) for the same number of processors. By choosing w = Θ(log n), the costs become O(n) for worst-case time, O(n/ log n) for average-case time, and O(n log n log log log n) for the number of processors.
Experimental implementation
The following tables and graphs summarize the results obtained from experimentation with a sequential implementation of the modular algorithm. The experimental implementation allows one to choose w and Q 0 without requiring that Q 0 n + 2 w/2 . The table in Fig. 3 registers the average number of iterations required as a function of input size n and the average size, in bits, of the minimum of the b values. For each size n the algorithm was run with ten different pairs of input values, pseudorandomly chosen in the range [0, 2 n ), using Q 0 = 2 17 consecutive prime moduli. The largest modulus was 2 32 − 5;
Input size n Average number of iterations Average b size 2 8 17.0 13.74 2 9 33.0 13.92 2 10 65.2 13.75 2 11 129.2 13.73 2 12 257.6 13.63 2 13 513.9 13.70 2 14 1026.2 13.68 2 15 2051.9 13.73 . This fits nicely with intuition, since one would expect each iteration to shorten one of U and V by w bits.
The table in Fig. 4 reveals the number of iterations the modular algorithm performs with n fixed at 2 12 and Q 0 at 2 17 . We plot lg(w) vs the number of iterations and lg(w) vs 2n/w in the graph in Fig. 4 . As w grows larger, the gap between the number of iterations and 2n/w also grows. This is because the critical relationship Q 0 n+2 w/2 is no longer maintained for Q 0 34. The table in Fig. 5 presents the number of iterations executed by the modular algorithm as a function of Q 0 , the number of moduli. In this and vary the number of moduli. It shows that the number of iterations asymptotically approaches 2n/w as Q 0 grows. Figure 6 shows the ratio of the number of iterations required by the accelerated algorithm [1, 2] to the number of iterations required by the modular algorithm for the tests summarized in Fig. 3 . This ratio is a useful measure of the relative performance of the modular algorithm, since the accelerated algorithm is very efficient in the number of iterations of the main loop; it uses the k-ary reduction [5] Fig. 3 , we examined the distribution, at each iteration, of the size of b, by counting the number of times the size is i bits. Table 1 shows the percentage of these values for i = 1, . . . , 17 for one particular (randomly chosen) iteration. Nearly 50% of the b values lie between 2 30 and 2 31 , 25% lie between 2 29 and 2 30 , 12.5% between 2 28 and 2 29 , and so on. This is as one would expect for a uniform distribution of the values. 
Remarks and future work
We close with some miscellaneous remarks and ideas for future work relating to the new algorithm.
The same techniques used to extend Sorenson's GCD algorithm to also compute A and B such that G = AU + BV (see [5, Section 7] ) can be used to extend the new modular algorithm, with G, A, and B all expressed in modular representation before conversion back to standard representation is done. More moduli, and hence, more processors, are needed to assure recovery of A and B, since they grow in size as the algorithm progresses. Note also, that although the algorithm presented above restricts moduli to primes, a set of relatively prime moduli should also suffice with only minor complications to the algorithm (such as the test to see whether a modulus is relatively prime to V ).
Attempts to implement a parallel version of the accelerated algorithm [17] on a shared memory multiprocessor have been disappointing, mainly because the algorithm requires fine-grain parallelism. For the same reason it seems likely that it would also be difficult to provide good implementations of the three algorithms mentioned above on a shared memory multiprocessor. The new modular algorithm would most likely suffer the same problem; however, it seems particularly well suited to large SIMD or data-parallel systems that can perform arithmetic on 32-bit words or larger, and better suited to such a system than the other three. Although SIMD architectures have suffered recently from a lack of popularity, they are still popular in certain problem domains, and at least one such architecture is commercially available at the present time [18] . Recent research in micro-architectures [15] indicates that future architectures may very well be descendants of today's SIMD designs.
The major task remaining is to provide a more sophisticated analysis of the worst-case and/or average case behavior; we believe that it can be proven that the algorithm can be shown to execute in O(n/w + 2 w/2 ) time in all cases by showing that the minimum |U/V mod p| is always small with respect to 2 w . Another possibility for future work is to compare the performance of the modular algorithm and the other three algorithms on a data-parallel system. Finally, it may be possible to apply modular techniques to Schön-hage's algorithm; since it takes O B (M(n) log n) time [19, Section 8.10] , where M(n) is the time it takes to compute n-bit integer multiplication, it may be possible to use modular representation to reduce the complexity contributed by multiplication to the overall problem's complexity.
