Abstract-This paper describes carry-less arithmetic operations modulo an integer 2 M −1 in the thousand-bit range, targeted at single instruction multiple data platforms and applications where overall throughput is the main performance criterion. Using an implementation on a cluster of PlayStation 3 game consoles a new record was set for the elliptic curve method for integer factorization.
I. INTRODUCTION
Numbers of a special form often allow faster modular arithmetic operations than generic moduli. This is exploited in a variety of applications and has led to a substantial body of literature on the subject of fast special arithmetic. Speeding up calculations using special moduli was already proposed in the mid-1960s by Merrill [1] in the setting of residue number systems (RNS) [2] . Other applications range from speeding up fast Fourier transform based multiplication [3] , enhancing the performance of digital signal processing [4] , [5] , [6] , to faster elliptic curve cryptography (ECC; [7] , [8] ), such as in [9] .
Another application area of special moduli is in factorization attempts of so-called Cunningham numbers, numbers of the form b n ± 1 for b = 2, 3, 5, 6, 7, 10, 11, 12 up to high powers. This long term factorization project, originally reported in the Cunningham tables [10] and still continuing in [11] , has a long and distinguished record of inspiring algorithmic developments and large-scale computational projects [12] , [13] , [14] , [15] , [16] , [17] . Factorizations from [11] with b = 2 are used in formal correctness proofs of floating point division methods [18] . Several of these developments [19] turned out to be applicable beyond special form moduli, and are relevant for security assessment of various common publickey cryptosystems.
This paper concerns efficient arithmetic modulo a Mersenne number, an integer of the form 2 M − 1. These numbers, and a larger family of numbers called generalized Mersenne numbers [20] , [21] , [22] , have found many arithmetic applications ranging from number theoretic transforms [23] to cryptography. In the latter they are used to run calculations concurrently using RNS [24] or to improve the speed of finite field arithmetic in ECC based schemes [20] , [25] . The great internet Mersenne prime search project [26] is based on an implementation of the Lucas-Lehmer primality test [27] , [28] for Mersenne numbers in the many-million-bit range. Hence, efficient arithmetic modulo a Mersenne number is a widely studied subject, not just of interest in its own right but with many applications.
Our interest in arithmetic modulo a Mersenne number was triggered by a potential (special) number field sieve (NFS) project [19] , for which we need a list of composites dividing 2 M − 1 for exponents M in the range from 1000 to 1200. The Cunningham tables contain over 20 composite Mersenne numbers (or composite factors thereof) in the desired range that have not been fully factored yet. It may be expected that some of these composites are not suitable candidates for our list because they can be factored faster using the elliptic curve method (ECM) for integer factorization [29] than by means of special NFS (SNFS). The only way to find out whether ECM is indeed preferable, is by subjecting each candidate to an extensive ECM effort (which, though it may be substantial, is small compared to the effort that would be required by SNFS): only candidates that ECM failed to factor should be included in the list.
The efficiency of ECM factoring attempts relies on the efficiency of integer arithmetic modulo the number being factored. Given the need to do extensive ECM pre-testing for over 20 composite Mersenne numbers, we developed arithmetic operations modulo a Mersenne number suitable for implementation of ECM on the platform that we intended to use for the calculations: the Cell processor as found in the Sony PlayStation 3 (PS3) game console. Because each ECM effort consists of a large number of independent attempts that can be executed in single instruction multiple data (SIMD) mode and because each core of the Cell processor can be interpreted as a 4-way SIMD environment, our arithmetic modulo a Mersenne number is geared towards SIMD implementation. It is described in Section III after a brief description of the Cell architecture in Section II. Although our implementations were written for the Cell processor, our methods apply to any type of SIMD platform, including graphics cards. Section IV sketches ECM, our Cell processor implementation, and lists some of our ECM results, including ECM record factorizations.
While the new ECM factorizations removed some of the easy cases from our list of candidate Mersenne numbers, the further practical implications of ECM records are limited to their consequence for two variants of the RSA cryp-tosystem [30] , namely RSA multiprime [30] and unbalanced RSA [31] . The former gains a speedup by a factor of r 2 or r 2 4 for the private operation in vanilla RSA or CRT-RSA, respectively, by selecting RSA moduli (of appropriate size to be out of reach of NFS) consisting of the product of r > 2 primes of about the same size. In unbalanced RSA, the RSA modulus has two factors as usual, but one is chosen much smaller than the other. In these variants, r and the smallest factor must be chosen in such a way that ECM has a sufficiently low probability to find the resulting relatively small prime factor(s). Our ECM findings affirm that 1024-bit RSA moduli with r ≥ 4 should be avoided [32] and may give practitioners of these variants some guidance how small the factors may be chosen.
II. THE CELL PROCESSOR AND ITS ARCHITECTURE
The Cell processor, the main processor of the PS3 and thus mainly targeted at the gaming market, is a powerful general purpose processor accessible on the first generation of these game consoles. This made the PS3 a relatively inexpensive and also flexible source of processing power, as witnessed by a variety of cryptanalytic projects: chosen prefix collisions for the cryptographic hash function MD5 [33] , [34] , the solution of a 112-bit prime field elliptic curve discrete logarithm problem [35] , and implementation of elliptic curve group arithmetic over a degree-130 binary extension field [36] .
The architecture of the Cell processor is quite different from that of regular server or desktop processors. Taking full advantage of it requires designing new software. It is worthwhile doing so, because architectures similar to the Cell's will soon be mainstream [37] . It not only helps us to take advantage of the Cell's inexpensive processing power, it also helps to prepare for future generations of processors. See Section IV-A for the rationale why the Cell processor was chosen as the platform for our ECM attempts.
The Cell has a Power Processing Element (PPE), a dualthreaded Power architecture-based 64-bit processor with access to a 128-bit AltiVec/VMX SIMD unit. Its main processing power, however, comes from eight Synergistic Processing Elements (SPEs). When running Linux, six SPEs can be used: one is disabled, and one is reserved by the hypervisor (but may become accessible too [38] ). Each SPE runs independently from the others at 3.192GHz, using its own 256 kilobyte of fast local memory for instructions and data. It has 128 registers of 128 bits each, allowing SIMD operations on sixteen 8-bit, eight 16-bit, or four 32-bit pairs of integers. An SPE has no 32 × 32 → 64-bit or 64 × 64 → 128-bit integer multiplier, but has several 4-way SIMD 16 × 16 → 32-bit integer multipliers including multiply-and-add instructions.
There are odd and even pipelines: in ideal circumstances an SPE can dispatch one odd and one even instruction per clock cycle. Most arithmetic instructions are even. Because the SPE lacks smart branch prediction, branching is best avoided (as usual in SIMD). Multiple SIMD processes may be interleaved, filling both pipelines to increase throughput, while possibly increasing per process latency. Here we took advantage of interleaving in another manner.
The Cell processor has also been made available to the supercomputing community by placing two Cell chips in a single blade server. They come with more memory than in the PS3 and on each Cell all eight SPEs are accessible. For high-performing blade servers a newer derivative of the Cell, the PowerXCell 8i, offers enhanced double-precision floating-point capabilities. Due to their significantly higher price, these compute nodes come at a price performance ratio quite different from the relatively inexpensive PS3.
III. ARITHMETIC MODULO 
In this section we describe the SPE-arithmetic that we developed for arithmetic modulo N = 2 M − 1, for M in the range from 1000 to 1200 (allowing larger values as well). Notice that the following description can easily be carried over to numbers of the form 2 M +1. Assume that M < 13·96−2 = 1246 (larger M -values can be accommodated by putting M < u·v −2 with v · (2 u−1 ) 2 < 2 31 ). Our approach aims to optimize overall throughput as opposed to minimize per process latency. Two variants are presented: a first approach where addition and subtraction are fast at the cost of a radix conversion before and after the multiplication, and an alternative approach where radix conversions are avoided at the cost of slower addition and subtraction. This second variant turns out to be faster for our ECM application. In applications with a different balance between the various operations the first approach could be preferable, so it is described as well. All our methods are particularly suited to SPE-implementation, but the approach may have broader applicability.
For k ∈ Z >0 a k-bit integer is an integer w with 0 ≤ w < 2 k . A signed k-bit integer is an integer w with −2 k−1 ≤ w < 2 k−1 . For r ∈ Z >1 a radix-r representation of an integer z with 0 ≤ z < r s is a sequence of s radix-r digits (w j )
such that z = s−1 j=0 w j r j and w j ∈ Z ≥0 . It is unique if 0 ≤ w j < r for 0 ≤ j < s. If 2 k ≥ r, a signed k-bit radixr representation of z is a sequence (w j ) s j=0 of signed k-bit integers such that z = s j=0 w j r j . We use signed radix-2 k representation for signed k-bit radix-2 k representation.
A. Related work
In [39] an SPE implementation is presented using arithmetic modulo the special prime 2 255 − 19 introduced in [9] . SPEarithmetic modulo a special prime is used in [35] to solve a 112-bit elliptic curve discrete logarithm problem on Cell processors. The SPE-performance of generic versus generalized Mersenne moduli is compared in [40] . SPE-arithmetic for moduli in the 200-bit range is presented in [41] , [42] ; on PS3s the former is more than twice faster than the latter. Different approaches to implement arithmetic over a binary extension field on SPEs are stated in [36] .
Our usage of a small radix to avoid carries (cf. below) is not new [3] , [43, Section 4.6] , [41] . In [41] Figure 1 : A 4-tuple (z1, z2, z3, z4) of 32s-bit integers in radix-2 32 representation using 128-bit registers r0, r1, . . . , rs−1.
modulo 195-bit moduli. Each addition done during a single schoolbook multiplication is carry-less, as for polynomial multiplication, requiring normalization to radix-2 13 representation only at the end of the big multiplication.
B. Representation of 4-tuples of integers modulo N
On the SPE it is advantageous to operate on four integers modulo N simultaneously, in 4-way SIMD fashion. Each 128-bit SPE register is interpreted as being partitioned into four 32-bit words. With s 128-bit registers thought to be stacked on top of each other, where 32s ≥ M , four different integers modulo N can be represented using four disjoint parallel columns, each consisting of s words: denoting the ith word of the jth register by w ij for i ∈ {1, 2, 3, 4} and j = 0, 1, . . . , s−1, the sequence (w ij ) s−1 j=0 is interpreted as the radix-2 32 representation of the 32s-bit integer
. This is illustrated in Fig. 1 . More generally, for any t ≤ 32 of one's choice, the sequence (w ij ) s−1 j=0 may represent the integer s−1 j=0 w ij 2 ti whose value depends on the interpretation of the words w ij : as an unnormalized radix-2 t representation if the w ij are interpreted as non-negative integers (normalized and unique if w ij < 2 t as well), and as a signed k-bit radix-2 t representation, for some k ≤ 32, if the w ij are interpreted as signed k-bit integers.
It should be understood that the integer operations described below are carried out in 4-way SIMD fashion on the SPE.
C. Addition and subtraction modulo N
Addition and subtraction in 4-way SIMD fashion on a pair of 4-tuples of integers modulo N in radix-2 t representation, with each 4-tuple represented by a stack of s registers of 128-bits (where ts ≥ M ), is done by applying s additions or subtractions to the matching pairs of registers (one from each stack), combined with a moderate number of carry propagations. Since N is Mersenne, the reduction modulo N (when needed) usually affects only two of the radix-2 t digits. More digits are affected with probability 2 −1−t−(M mod t) , in which case it causes a slight stall for the other three calculations in the 4-tuple.
For t = 32 the SPE's built-in carry generation instructions are used. For smaller t-values more work needs to be done. We describe the calculation of c = a + b mod N and d = a−b mod N (so-called addition-subtraction of a and b) given the signed radix-2 13 representations a = 95 j=0 a j 2 13j and
Step 5 in Section III-G). The following 5 steps are carried out:
3) Initialize the carry τ as 0. For j = 0 to 95 in succession first replace τ by τ + c j , next replace c j by τ mod 2 13 , and finally replace τ by τ /2 13 . The resulting τ is a carry corresponding to τ · 2 13·96 ; modulo N this carry is taken care of by adding τ ·2 α to c β (for γ = 13·96−M , β = γ/13 and α = γ − 13β ∈ [0, 12]) followed by a few more carry propagations. If there is still a carry, which occurs rarely, use a more expensive function. 4) Repeat the previous step with c replaced by d.
Steps 1, 2, and 5 allow arbitrary parallelization. Table I lists SPE clock cycle counts for the addition operations modulo 2 1193 −1: it can be seen that for signed radix-2 13 representation they are about twice as slow as for radix-2 32 representation.
D. Multiplication modulo N using radix conversions
Given a pair of 4-tuples of M -bit integers, the four pairwise products result in a 4-tuple of 2M -bit integers. The four reductions modulo N can in principle be done by means of a few of the above 4-tuple additions and subtractions modulo N . Here we present our first approach that uses two different radix representations, thereby making it possible to take advantage of the fast radix-2 32 addition and subtraction modulo N . In Section III-F another approach is described that is based on signed radix-2 13 representation. The multiplication modulo N of two M -bit integers a and b given by their radix-2 32 representations, each using 39 words of 32 bits, proceeds in three steps that are described in more detail in sections III-D1 through III-D3. The steps are: 1) conversion of inputs a and b to signed radix-2 13 representation; 2) carry-less calculation of the 2M -bit product a · b in signed 32-bit radix-2 13 representation; 3) reduction modulo N and conversion to radix-2
The following sections describe the steps in more detail.
1) Conversion of inputs to signed radix-2 13 representation: Given the radix-2 32 representation of the precomputed constant C 0 = 2 12 · 95 j=0 2 13j , first calculate the radix-2 32 representation of a + C 0 , in the usual way requiring carries. Next, using masks and shifts, extract the radix-2 13 representation (ã) 95 j=0 of a + C 0 , and finally subtract C 0 again by calculating a j =ã j − 2 12 , for j = 0, 1, . . . , 95 (because a 96 = 0 for our choice of M , it is dropped). The last two steps allow various straightforward parallelizations and run twice as fast (while requiring fewer registers) if two 13-bit chunks are packed into a single 32-bit word. Applying the same method to b, we find signed radix-2 13 representations of the inputs, below regarded as polynomials P a (X) = 95 j=0 a j X j ,
with P a (2 13 ) = a and P b (2 13 ) = b. 2) Carry-less calculation of the 2M -bit product in signed 32-bit radix-2 13 representation: The product polynomial j=0 of the product a · b = P (2 13 ). If M < 13 · w with w < 96, the degree of P (X) will be at most 2w − 2 < 190, which leads to savings here and in the description below.
The polynomial P (X) is calculated using three levels of Karatsuba multiplication [44] (but see Section III-F2 for the possibility to use more levels), resulting in 27 pairs of
. . , 27 (in the more general case where M < u · v − 2 we would use 16−u levels). This leads to 27 independent polynomial multiplications
, done using carry-less schoolbook multiplications. The polynomial P (X) is then obtained by carry-less additions and subtractions of the appropriate Q (k) (X)'s. 3) Reduction modulo N and conversion to radix-2 32 representation of the 2M -bit product: Given a signed 32-bit radix-
i=0 of the M -bit number c ≡ P (2 13 ) mod N is calculated. We use the following precomputed constants:
• Integers k j , l j and m j such that
Given these values, the following four steps are carried out, the correctness of which easily follows by inspection: 1) For 0 ≤ j < 191, computep j = p j + 2 31 (this allows arbitrary parallelization), so that 0 ≤p j < 2 32 . As a result (
For 0 ≤ j < 191, left shiftp j over k j bits and right shiftp j over 32 − k j bits, to obtain d j , e j such that
(this again allows arbitrary parallelization).
(where the indices j can be precomputed) and computẽ 
E. Optimizations
Swapping even for odd instructions. Modular arithmetic mostly relies on the SPE's arithmetic instructions, which are even pipeline instructions. Following the approach from [45] , [46] one may replace an even instruction by one or more odd ones with the same effect. Although this may increase the latency for the functionality of each replaced even instruction and the number of instructions, balancing the counts of even and odd instructions often increases the throughput. This method was used throughout our implementation. Examples are sketched below.
Modular squaring. When squaring polynomials of degree at most 11, half of the mixed products, i.e., = 66 multiplications, can be saved by doubling their resulting 21 sums (as the top elements are zero). Of these sums, the eleven for coefficients of odd degree can be doubled for free during the conversion to radix-2 32 , by using for odd j precomputed integersk j ,l j , andm j such that 13j + 1 =m j M + 32l j +k j with 0 ≤ 32l j +k j < M and 0 ≤k j < 32, instead of k j , l j , and m j , as defined earlier. The ten remaining sums need to be doubled before they are added to the corresponding squared input coefficient. Each doubling can be done by a single even pipeline addition. However, a doubling can also be performed by four odd pipeline instructions (or two doublings in six odd pipeline instructions). The ten remaining doublings could thus be squeezed in the odd pipeline, including all load and storage overheads (all 21 doublings would not have fit in the odd pipeline). As a result, all doublings required for squaring came for free.
Conversion to radix-2
32 . The computation of d j and e j requires shifts by k j and 32 − k j , respectively, for 0 ≤ j < 191, for a total of 382 even pipeline shift instructions. If k j ≡ 0 mod 8, each shift can be replaced by a single odd pipeline byte reordering instruction (or by no instruction if k j = 0). Shift counts bigger than 8 can be replaced by three odd pipeline instructions.
M -dependent optimization. For 0 ≤ j < 191 and most M we have j s.t. lj +1=i e j < 2 32 , since e j is obtained by a right 
Pa(X) and P (k) shift over 32−k j > 0 bits and the shift amounts usually differ. Thus, for such M the second summation in Eq. (1) does not generate carries.
We have written a program that generates SPE code for each value of M , with the applicable C 0 , C 1 , k j , l j , m j ,k j ,l j , andm j hard-coded and including all optimizations mentioned so far. The resulting code thus depends on the value of M used, with slightly varying performance between different Mvalues. Representative instruction and cycle counts for 4-way SIMD multiplication and squaring modulo 2 1193 −1 on a single SPE are given in Table I . Because 
F. Further speedups
Initial estimates indicated that the speed advantage of the radix-2 32 additions would outweigh the disadvantage of the conversion (in Section III-D1) to signed radix-2 13 representations required for the carry-less product calculation. Only after the code based on the methods described above had been used for about nine months (obtaining the results as reported in Section IV) and two further improvements had been developed, this issue was revisited. The two improvements, in sections III-F1 and III-F2 below, apply to the first approach as well. The alternative version of the method from Section III-D3 that normalizes (and reduces) the signed 32-bit radix-2 13 product to its signed radix-2 13 representation (as opposed to converting and reducing the product to radix-2 32 representation, as in Section III-D3) is presented in Section III-G.
1) Using C 1 ≡ 0 mod N in Section III-D3: Let γ = 13 · 191 + 18 − M , β = γ/13 and α = γ − 13β. To get nonnegativep j 's in the first step of Section III-D3, it suffices to putp 0 = p 0 + 2 31 31 + 2 α if β = 0). Thus C 1 in Section III-D3 is replaced by a value that is zero modulo N . This saves an addition (by C 1 ) in the final calculation of c in the fourth step of Section III-D3.
2) Karatsuba multiplication with multiply-and-add: A more substantial improvement is obtained by noting that for 26 out of the 27 k-values in Section III-D2 the coefficients of the polynomials P (k) a (X) and P (k) b (X) are signed 15-bit integers. Therefore, for these k another level of Karatsuba multiplication can be used for the calculation of Q (k) (X), while taking advantage of the SPE's multiply-and-add instructions. Some details are described below.
Let e, e , f, f be four polynomials of degree at most n − 1. To multiply the two polynomials e + e X n and f + f X n of degree at most 2n−1, calculate g = e−e and h = f −f (note the asymmetry). Defining ef = U + U X n , e f = V + V X n and gh = W + W X n , we have to calculate (e + e X
3n . This is done by calculating (using multiply-and-add when relevant) U and U in n 2 operations, next U + V and V using another n 2 operations, U + V + U (n additions) and U + V + V (n − 1 additions), and finally U + V + U + W and U + V + V + W using n 2 operations. In this way this final level of Karatsuba multiplication requires 3n 2 + 4n − 1 operations, which can be reduced to 3n 2 + 3n − 1 if g and h can be calculated twice as fast, as in our case. With n = 6 this becomes 125 operations for the calculation of each of the 26 Q (k) (X)'s to which this applies; the 27th one can be done in 144 operations, for a total of 3394 even instructions to calculate all Q (k) (X)'s. For n = 3 we get 3n 2 + 3n − 1 = 35 < 6 2 , but the remaining parts of the 12-to-6-Karatsuba step take more than 20 operations, so more than 3 × 35 + 20 = 125 operations per Q (k) (X).
Improving the method from Section III-D using sections III-F1 and III-F2 would lead to a speedup of slightly less than 10% for modular multiplication and a much smaller speedup for modular squaring. We have not used this improvement as it led to only a small speedup of the ECM application. Instead we combined the improvements with the method presented in Section III-G below as it was expected (and turned out) to lead to a more substantial speedup for the ECM application.
G. Multiplication modulo N using signed radix-2
13
Multiplication modulo N with inputs and output in signed radix-2 13 representation (and thus relatively slow addition operations) is obtained from the description in Section III-D by omitting the conversion in Section III-D1, keeping Section III-D2 in place (possibly improved as described in Section III-F2), and by replacing Section III-D3 by the reduction and normalization step described below.
Reduction modulo N and normalization to signed radix-2 13 representation of the 2M -bit product Given a signed 32-bit radix-2 13 representation (p j )
190
j=0 of the 2M -bit product a · b, regarded as the polynomial P (X) = 190 j=0 p j X j with P (2 13 ) = a · b, the signed radix-2 13 representation (c j )
12 . (All additions in steps 1 and 2 are combined at a total cost of 191 even addition instructions for steps 1 and 2.) 3) For 96 ≤ j < 191 let p j and p j be words such that p j = p j + p j 2 16 and 0 ≤ p j , p j < 2 16 , and replace p j by p j 2 k j and p j by p j 2 k j using odd instructions, where Steps 1, 2, 3, 4, and 6 allow arbitrary parallelization. The resulting SPE clock cycle counts are listed in Table I .
H. Comparison with other SPE implementations
Because an SPE runs at 3.192GHz and six are available per PS3, it follows from Table I that a single PS3 can perform 13.5 (17.8) million multiplications (squarings) modulo 2 1193 − 1 per second. This compares to 182 million and 138 million multiplications modulo 192-bit and 224-bit special moduli, respectively, as reported for a single PS3 in [40] , i.e., less than an 11-fold slowdown for 5-fold bigger special moduli.
For generic moduli the same carry-less Karatsuba-based multiplication applies. The basic approach to the more cumbersome reduction would reduce our performance by a factor of at most three, but we expect we can do much better. Compared to the roughly 102 million modular multiplications for generic moduli in the 200-bit range, as reported for a single PS3 in [41] , we would get at worst a 20-fold slowdown for 6-fold bigger generic moduli.
IV. APPLICATION TO ECM
A. Background on ECM ECM [29] attempts to factor a composite using a number of independent trials. The success probability per trial grows with the effort spent per trial, but decreases with the size of the smallest factor. Overall, the expected factorization effort for ECM (i.e., number of trials on the same composite times effort per trial) grows subexponentially with the size of the smallest factor. For (S)NFS the effort does not depend on the size of the factor(s) but just on the size of the number being factored and the quality of the polynomials selected for the factorization. For RSA moduli with two factors of about equal size, NFS is expected to be much faster than ECM. If there may be a relatively small factor (such as for composites of the form 2 M − 1), ECM may be more efficient than (S)NFS. Each ECM trial consists of two stages, stage one with bound B 1 , which is compute intensive but requires little memory, followed by a memory-hungry stage two with bound B 2 . Depending on the number of trials and the two bounds, the probability can be estimated that a factor up to a specific size, if present, will be found. To have probability at least e−1 e ≈ 0.632 to find a factor of up to 65 decimal digits (when present), 24 000 ECM trials with B 1 = 3 · 10 9 and B 2 ≈ 10 14 (the default B 2 of GMP-ECM) suffice [47] . For the same bounds and success probability, 110 000 trials suffice to find a 70-digit factor (when present). Before our work the largest prime factor ever found using ECM had 68 decimal digits [48] .
Using the GMP-ECM package [47] , [49] , with B 1 and B 2 as above, on a single core of a 2.2GHz Athlon 2427, stage one for an ECM trial for 2 M −1 with M around 1200 takes on the order of six hours, stage two takes about one hour requiring many GBytes of RAM (for generic composites of comparable size each stage takes about twice as long; more precise timings are presented in Table IV in Section IV-C below). For each composite of the form 2 M − 1 with 1000 ≤ M ≤ 1200 this implies up to 20 core years for an ECM attempt to find a 65-digit factor, and up to 90 core years for a 70-digit one. This should be compared to an SNFS effort ranging from on the order or 70 (M ≈ 1000) to several thousand (M ≈ 1200) core years. Thus, the larger M , the harder we should first try with ECM, commensurate with the expected SNFS effort and the probability that a candidate has a small factor.
Each ECM trial performs a particular sequence of additions, subtractions, and multiplications, all modulo the number being factored. The pattern in this sequence depends only on B 1 . Modular inversions can mostly be avoided. Stage one can easily be run in parallel in SIMD fashion for any number of trials. During a large scale ECM effort, overall throughput of trials is, within reason, a more important performance measure than latency per trial: for instance, being able to process four trials simultaneously in one day is better than processing (on the same platform) one trial every eight hours.
Rationale to use Cell processors for ECM on 2 M −1. Factoring numbers of the form 2 M − 1 is a "popular" activity [11] and hunting for relatively small factors is not hard given several freely available ECM packages. Nevertheless, given the efforts involved, we considered it likely that several of the unfactored composites 2 M − 1 with 1000 ≤ M ≤ 1200 have a factor that can be found more economically by ECM than by SNFS. Given our research interest in the ones that cannot (relatively) easily be factored by ECM, we decided on an ECM effort down our list of at least 20 candidates, aiming to find all factors of up to, roughly, 65 digits. Since it was meant to be a simple production run, we chose to use the off-the-shelf GMP-ECM package, because it is free, easy to use, has an excellent track-record, and can take advantage of the special form of the number 2 M −1. Other packages may be faster, but we were not familiar with them [50] . Notice, that if some small factors of 2 M −1 are known it is still faster to use the arithmetic modulo this Mersenne number than modulo the smaller composite.
The overall computation for these 20 candidates requires at least 20 × 20 = 400 core years and can in principle be done on regular server-clusters. But that would be a waste of resources, because about 6 7 th of the time is spent in stage one, which requires little memory thereby underutilizing the available RAM.
We also have access to a cluster of 215 PS3s, and thus to 215 Cell processors comprising a total of 1290 SPEs with little memory per SPE. It could therefore be more economical for us to use those SPEs to do all stage one calculations, and to do the relatively small stage two effort whenever servers with adequate RAM would otherwise be idle. To test this we ported stage one of GMP-ECM to the SPE, trying a variety of home-grown SPE-specific arithmetic packages (which were already known to outperform [51] ). In the course of these early experiments we stumbled upon a 63-digit prime factor (of 2 1187 − 1). This showed that conducting a thorough ECM search indeed makes sense, and stimulated development of the much faster SPE-arithmetic modulo 2 M − 1 described in Section III.
It was not our goal to improve the ECM package that we put on top of our enhanced arithmetic. It is likely that improvements reported over GMP-ECM that are based on different elliptic curve arithmetic or representations, such as, for instance, described and implemented in [52] , [50] , apply to our overall performance figures as well.
ECM on the Cell processor to support (S)NFS. Although ECM factorizations have little cryptographic significance, this does not imply that ECM performance is cryptographically irrelevant as well. In [53] , for instance, it is observed that high performance ECM implementations on relatively inexpensive devices (given their computational power, such as on graphics cards (GPUs)), may be helpful for future (S)NFS projects. A particularly memory-hungry step of (S)NFS, sieving, generates large quantities of fairly small (100-to 200-bit) composites that must be factored. That task requires little memory and is therefore best outsourced to cheap devices, so sieving is not interrupted and all resources are used in a cost-conscious fashion. This area has seen a flurry of recent activity: see [54] , [55] , [56] for implementations on reconfigurable hardware such as field-programmable gate arrays and [53] , [41] for GPUs. In [41] the Cell architecture is covered as well.
B. ECM on the Cell applied to 2 M − 1 Table II lists the numbers of modular arithmetic operations carried out by stage one of a single ECM trial with bound B 1 = 3 · 10 9 (cf. Section IV-A) when using GMP-ECM. When run on an SPE, four stage one trials are run simultaneously. With the operations from Section III, their cycle counts (cf . Table I) , and the SPE's 3.192GHz clock speed, this leads to an estimated time of less than 22 hours on a single SPE to complete four stage one ECM trials with bound B 1 = 3 · 10 9 using our first approach from Section III-D, and a more than 10% speedup when using the approach from Section III-G along with the improvements from Section III-F. The measured wall-clock times are slightly larger than the estimates. For applications where additions play a more important role, the method from Section III-D may outperform the method from Section III-G (where both methods are enhanced using Section III-F).
With six SPEs per Cell processor and 215 Cell processors in the PS3-cluster, 4 × 6 × 215 = 5160 stage one ECM trials can be processed in less than 20 hours. With 24 000 trials (cf. Section IV-A), stage one for a 65-digit search takes less than four days; stage one for the 110 000 trials for a 70-digit search takes two and a half weeks. Using our multi-core adaptation of stage two of GMP-ECM, the corresponding stage two calculations (with B 2 = 103 971 375 307 818) take the same time when using 4 cores per node on a 56-node cluster (with two hexcore processors per node): each trial takes 15 minutes on 4 cores, using at most 16 GBytes of RAM. Thus, the efforts of the two clusters involved in our calculations are well matched. for M = 1163, a 66-digit factor for M = 1073, a 63-digit factor for M = 1051, a 68-digit factor for M = 1139, and a 70-digit factor for M = 1237. The 241-bit, 73-digit prime factor of 2 1181 − 1 is the current ECM record, beating the previous record by 5 digits. The factor was found after somewhat more than 25 000 stage one trials at approximately the 8800th corresponding stage two trial, implying that we were quite lucky finding it. Less, but still considerable, luck was involved in finding the second 73-digit factor (a bit smaller at 240 bits): it was found after about 50 000 ECM trials. So far our example number 2 1193 − 1, with known factor 121687, stubbornly resisted all ECM efforts to be factored after running 142 162 ECM trials on it. For the numbers 2 M − 1 that we fail to factor using ECM, such as (so far) M = 1193, our efforts will result in a reasonable degree of confidence that they will not have a prime factor of 65 digits or less. Only for M = 1051 and M = 1237 did we find composite cofactors: for M = 1051 the attempt was continued and the 63-factor was indeed re-found where it could be predicted (once it had been found), but the c248 cofactor remained unfactored. Table III lists all results obtained using the slower approach from Section III-D, with ck and pk denoting a k-digit composite and prime, respectively. For exponents M ∈ [1000, 1140] (M ∈ [1141, 1200]) not stated in Table III roughly 50 000 (100 000) ECM trials have been completed with bounds as above without finding a factor.
Using the improved arithmetic we have so far found one factorization: for M = 961 we found that c254 = p61·p193 after 1190 curves with B 1 = 10 9 and B 2 = 25 427 965 563 016. The improved arithmetic is also being used for numbers of the form 2 M + 1 and several factors have already been found. To put this number into perspective, we did the same computation using GMP-ECM 6.3 powered by GMP 5.0.1 [57] (both the latest versions at the time of writing) using all cores on a variety of processors, with optimal multiplication parameters obtained using the tune-up script, and taking advantage of the special Mersenne-arithmetic available in GMP-ECM. Table IV lists the results. On a per-core basis, and accounting for the ratio in clock-speeds, our special 4-way SPE Mersenne arithmetic turns out to about 4 3 times more effective than the regular Mersenne arithmetic from GMP-ECM 6.3 when run on Intel processors, despite the fact that the SPE does not have 64-bit or 32-bit integer multiplications. The lack of such multipliers is clearly to the SPE's disadvantage when comparing it to the AMD processor with its much faster (than Intel) integer multiplication. The more recent generations of processors, like the 12-core AMD Opteron, are catching up with the performance of the PS3.
V. CONCLUSION
For integers M in the range from 1000 to 1200 we presented our Cell processor implementation of multiplication of Mbit integers, processing 24 such multiplications in parallel on a single PlayStation 3 game console, and used it to obtain efficient multiplication modulo 2 M − 1. The ideas underlying our implementation apply to many arithmetic contexts of cryptologic relevance. We focused on application to elliptic curve factoring, which led to the three largest ECM factors found so far. 
