Abstract. Bos, Kaihara, Kleinjung, Lenstra, and Montgomery recently showed that ECDLPs on the 112-bit secp112r1 curve can be solved in an expected time of 65 years on a PlayStation 3. This paper shows how to solve the same ECDLPs at almost twice the speed on the same hardware. The improvement comes primarily from a new variant of Pollard's rho method that fully exploits the negation map without branching, and secondarily from improved techniques for modular arithmetic.
Introduction
The purpose of this paper is to tell people something that is stated already in textbooks. ← Suppressed introduction [18] .
In July 2009, Bos, Kaihara, Kleinjung, Lenstra, and Montgomery announced breaking a discrete-logarithm problem on an elliptic curve over a 112-bit prime field using a cluster of 214 PlayStation 3 (PS3) game consoles. The initial announcement was [3] ; more details on the same computation were published in [5] , [4] , and [6] . The overall attack costs were estimated to be about 60 PS3 years. This computation was, and still is, the largest ECDLP for which a successful solution has been publicly announced.
Our main result is that discrete-logarithm problems on the same curve (or any other curve of the form y 2 = x 3 − 3x + b over the same field) can be solved at almost twice the speed on exactly the same hardware. We performed extensive computational experiments to verify our scalability and performance claims; details of these experiments appear in Section 6. This result combines several different optimizations. A large part of our work consists of faster algorithms for arithmetic modulo the prime (2 128 − 3)/76439 that defines this field; these algorithms use a different approach from the previous papers, and we describe this approach in detail. We also introduce several smaller refinements in rho computations, often co-designing our choice of iteration function with our lower-level optimizations. However, the largest part of our improvement, gaining a factor of almost √ 2, comes from careful use of the negation map.
The conventional wisdom is that the negation map has been known for ten years and trivially gains a factor of √ 2. However, [3] said "We did not use the common negation map since it requires branching and results in code that runs slower in a SIMD environment." SIMD (single instruction, multiple data) is a critical feature of modern CPU designs, including the Cell processor used in the PS3.
Similarly, [5] said that the benefit of negation was outweighed by "the conditional branches required to check for fruitless cycles." The paper [6] observed that most of the negating rho algorithms stated in the literature were nonfunctional (i.e., had negligible chance of succeeding in the claimed amount of time); considered a huge array of 126 different combinations of negation options; and concluded that the best option did save time for non-SIMD architectures, but with a speedup far below √ 2. The paper continued to dispute the possibility of a negation speedup for SIMD architectures:
If the Pollard rho method is parallelized in SIMD fashion, it is a challenge to achieve any speedup at all. . . . Dealing with cycles entails administrative overhead and branching, which cause a non-negligible slowdown when running multiple walks in SIMD-parallel fashion. . . . [This] is a major obstacle to the negation map in SIMD environments. This paper resolves this dispute by explaining how to use the negation map without branching and without significant overhead handling cycles. We demonstrate a speedup very close to √ 2 on the PS3. We comment that the speedup on non-SIMD architectures would be even closer to √ 2.
Review of Pollard's rho method
This section gives an overview of Pollard's rho method. In this paper we will use this method to compute discrete logarithms on elliptic curves but the first subsections apply to any finite cyclic group G = P and Q ∈ G. Computing the discrete logarithm of Q to the base P means computing an integer k such that Q = kP . The integer k is unique modulo where is the order of P ; we assume for simplicity that is an odd prime.
The generic rho method. Pollard's rho method [23] is a low-memory algorithm that finds a discrete logarithm by finding a collision in the map (a, b) → aP + bQ where a, b ∈ Z.
Finding a collision usually reveals the discrete logarithm k of Q to the base P : if aP + bQ = a P + b Q and b ≡ b (mod ) then k ≡ (a − a )/(b − b) (mod ). A generic way to find this collision is to iterate this function. Define maps a and b from P to Z and compute W i+1 = f (W i ) = a(W i )P + b(W i )Q, starting from some initial combination W 0 = a 0 P + b 0 Q. If any W i and W j collide then also W i+1 = W j+1 , W i+2 = W j+2 , etc. This means that the sequence enters a cycle. This can be detected efficiently using, e.g., Floyd's cycle-finding method.
If the functions a(R) and b(R) are random modulo then the iterations perform a random walk in P . The random walk can be modeled as drawing objects with repetition from an urn containing elements. A collision corresponds to drawing the same element twice. A standard birthday-paradox calculation implies that a collision occurs after approximately π /2 iterations on average.
Use of group automorphisms. If the functions a and b are chosen such that f (W i ) = f (−W i ) then the walk is actually defined on equivalence classes under ±. There are only /2 different classes. This reduces the average number of iterations by a factor of almost exactly √ 2. More generally, Pollard's rho method can be combined with any easily computed group automorphism σ of small order. One chooses a and b so that
The walk is then defined on equivalence classes under the automorphisms, reducing the average number of iterations accordingly. However, for most elliptic curves the only easily computed group automorphism of small order is the negation map.
The parallel rho method. To spread the computations in Pollard's rho method across multiple computers one replaces the long walk by a collection of short walks, as proposed by van Oorschot and Wiener in [22] . Some fixed subset of P is declared to be the set of distinguished points. Whenever a walk reaches a distinguished point it reports this to a central server and the server stores the distinguished point along with the values for a and b. If two walks reach the same distinguished point the server notices a collision.
This parallelization requires the server to receive, store, and sort all distinguished points. Tradeoffs are possible. If distinguished points are chosen to be rare then a small number of very long walks will be performed, reducing the number of distinguished points sent to the server but increasing the delay before a collision is recognized. If distinguished points are frequent then many shorter walks will be performed.
Additive walks. The generic rho method described above requires two scalar multiplications for each iteration. One can merge the two scalar multiplications into a 2-scalar multiplication, and further merge the 2-scalar multiplications across several parallel iterations, to reduce the number of group additions required for each iteration; but it is much simpler to use an additive walk, requiring only one addition for each iteration.
An additive walk is defined as
. Here h maps from P to {0, 1, . . . , r − 1}, and R 0 , R 1 , . . . , R r−1 are precomputed as known combinations of P and Q: say R j = c j P + d j Q for each j ∈ {0, 1, . . . , r − 1}.
One then also knows that W i = a i P + b i Q, where a and b are defined recursively as follows:
Additive walks have disadvantages. The walks are noticeably nonrandom and need more iterations than the generic rho method to find a collision. This effect disappears as r grows, but if r is large then the precomputed table R 0 , . . . , R r−1 does not fit into fast memory. Additive walks also have trouble with automorphisms; see the discussion of fruitless cycles below.
Pollard's original proposal was to use r = 3. Instead of 3 different precomputed points, Pollard mixed 2 points with a doubling W i+1 = 2W i ; note that Pollard was computing discrete logarithms in multiplicative groups, where a doubling (i.e., a squaring) is faster than a general addition (i.e., a multiplication). Experiments by Teske in [27] showed that larger values of r, such as r = 20, are much closer to random walks. A general heuristic due to Brent and Pollard implies that nonrandomness slows down this type of walk by a factor 1 − 1/r; for further discussion of such heuristics see, e.g., [1] .
Eliminating coefficients. In all of the discrete-logarithm computations described above, coefficients a i , b i are stored and used to compute the final discrete logarithm. In the parallel rho method these coefficients are communicated along with each distinguished point sent from a client to the server. Computing these coefficients in additive walks requires each client to implement arithmetic modulo or at least to allocate space for counters to keep track of how often each R j is used.
An alternative approach, introduced recently as part of the ECC2K-130 attack [1] , eliminates the coefficients a i and b i . Clients compute W i without keeping track of a i and b i . When a client encounters a distinguished point W i , it reports the point (or a hash of the point) along with a seed identifying the start of the walk, and then starts a new walk. This saves code in the clients, storage in the clients, communication between the clients and the server, and storage on the server. When the server encounters a collision, it recomputes the two walks involved in the collision, starting from the same seeds but now computing a i and b i . This is done only rarely, ideally just once.
Fruitless cycles. Fast negation on elliptic curves reduces the number of iterations in the generic rho method by a factor √ 2, as discussed above. However, a non-negating additive walk is faster than a negating generic walk: the extra speed of each iteration in the non-negating additive walk outweighs the smaller number of iterations in the negating generic walk.
One might think that a negating additive walk combines the advantages of a small number of iterations and a very fast iteration function. It is easy to make a canonical choice |W i | between W i and −W i ; to define h(W i ) as a function of |W i |, so that h(W i ) = h(−W i ); and to define f (W i ) = |W i | + R h(W i ) . Then f is a walk on equivalence classes under ±, and computing f takes only one addition.
The problem is that a negating additive walk does not behave randomly: it quickly enters short fruitless cycles. For example, if
One expects this to occur with probability 1/(2r) at each step, and if it does occur then the walk enters a 2-cycle. Subsequent iterations will not proceed to a distinguished point; subsequent computations will be wasted.
It is also possible to have fruitless cycles of larger lengths, although these are less frequent. Heuristics are given in [12] . Choosing a larger r reduces the chance of entering fruitless cycles, but for large-scale computations the cycles will occur and will occur frequently.
There are many different attempts in the literature to work around this problem. The first proposals were introduced independently by Harley and Singer in [16] , Wiener and Zuccherato in [29] , and Gallant, Lambert, and Vanstone in [14] ; further analyses appeared in [12] and much more recently in [6] . See Appendix A for a detailed review of these proposals. Some of the proposals are successful in eliminating fruitless cycles, but all of these proposals involve frequent conditional operations and, as stated in [6] , perform poorly in a SIMD environment.
How to use negation in Pollard's rho method
This section presents an efficient branchless negating rho algorithm to compute elliptic-curve discrete logarithms. For simplicity we restrict to curves of the form y 2 = x 3 − 3x + b over large prime fields F p . This algorithm uses, on average, (1 + o(1)) π /4 iterations for a group of prime order , assuming standard heuristics; here o(1) means something that converges to 0 as → ∞. Each iteration uses 5 multiplications mod p, 1 squaring mod p, and an asymptotically negligible amount of extra work.
We emphasize that we use a branchless sequence of iterations, always performing the same operations in the same order. This is of theoretical interest: any bounded-time algorithm can be made branchless by standard conversions, but these conversions usually lose efficiency. This is also of practical interest: the algorithm is well suited for modern SIMD CPUs such as the Cell CPU in the PS3, as discussed in subsequent sections of the paper.
We also emphasize that the number of iterations is (1 + o(1)) π /4, not (1 + o(1)) π /2. We use the fast elliptic-curve negation map to save a factor of √ 2; we do this without branching, and we do it with asymptotically zero compromise in iteration speed.
Eliminating fruitless cycles. We begin with the simplest type of negating additive walk stated in the literature. The walk starts at the point W 0 = |b 0 Q| where b 0 is chosen randomly, and then computes W 1 , W 2 , . . . by the rule W i+1 = |W i + R h(W i ) |. Here |(x, y)| means (x, y) for y ∈ {0, 2, 4, . . . , p − 1} or (x, −y) for y ∈ {1, 3, 5, . . . , p − 2}; the hash function h maps points to elements of {0, 1, 2, . . . , r − 1}; and the points R 0 , R 1 , . . . , R r−1 are known multiples of P .
We modify this walk by occasionally checking for fruitless cycles of length 2. Specifically, for a sparse pattern of indices i discussed below, we change the definition of W i as follows. After computing W i−1 , we check whether W i−1 = W i−3 . In the common case that W i−1 = W i−3 , we define W i = W i−1 . In the unusual case that W i−1 = W i−3 , we define W i = |2 min{W i−1 , W i−2 }|, where min means lexicographic minimum and 2 means doubling.
We further modify this walk by occasionally, with even lower frequency, checking for fruitless cycles of length 4. Specifically, for an even more sparse pattern of indices i discussed below, we redefine W i as W i−1 if W i−1 = W i−5 , and we redefine
We continue with analogous modifications for fruitless cycles of lengths 6, 8, etc., up to the smallest even length that exceeds (log )/(log r).
Eliminating branches. The sequence of iterations described above might seem to include branches: a branch to replace y by −y if y is odd, for example, and a branch to conditionally compute W i = |2 min{W i−1 , W i−2 }|. However, one can easily simulate all of these branches by a straight-line program with negligible loss of efficiency, as described in the following paragraphs.
First, |(x, y)| is the same as (x, (1 − 2 )y) where = y mod 2. The implicit reduction modulo p here is not an asymptotic bottleneck: it takes linear time (even without branching), while all known multiplication algorithms take superlinear time. We prefer to make the implicit reduction explicit, computing |(x, y)| as (x, y + (p − 2y)); the addition and subtraction take linear time.
Second, we amortize min computations such as min{W i−1 , W i−2 , W i−3 , W i−4 } across all relevant iterations: after computing W i−3 we initialize a running minimum W min as min{W i−4 , W i−3 }, then replace it with min{W min , W i−2 } after computing W i−2 , then replace it with min{W min , W i−1 } after computing W i−1 . These computations are performed for only a small fraction of all indices i, so the loss of efficiency is negligible. See below for a more detailed cost analysis.
Third, we compute doublings such as
min for all of the selected indices, whether or not the doublings will actually be used. We then compute W i without branches by selecting between W i−1 and |D i |, the same way that |(x, y)| selects between (x, y) and (x, −y). The selection bit is the output of a branch-free comparison between W i−1 and W i−5 , or in general between W i−1 and W i−1−c for detecting fruitless cycles of length c.
Note that each of these selections and comparisons takes linear time per iteration, and is therefore asymptotically negligible compared to a multiplication.
Eliminating inversions. The bottleneck in each iteration is now exactly one elliptic-curve operation: usually an elliptic-curve addition, but an elliptic-curve doubling for occasional iterations.
The standard formulas for elliptic-curve addition in affine coordinates are as follows. Let P = (x 1 , y 1 ), R = (x 2 , y 2 ) with x 1 = x 2 . Then P +R = (x 3 , y 3 ) where
, and y 3 = λ(x 1 − x 3 ) − y 1 . These formulas use 1 inversion, 2 multiplications, 1 squaring, and 6 subtractions. The formulas for a doubling, where R = P , are very similar; only the computation of λ = 3(x 2 1 − 1)/(2y 1 ) is different. We ignore, without further comment, the extraordinarily unlikely event that R = −P .
Inversions are the most expensive operations in finite fields. Standard practice in rho computations is to perform m independent walks in parallel, and to use Montgomery's trick [21] , which computes a batch of m inversions using just 1 inversion and 3(m − 1) multiplications.
We choose a branchless inversion method, specifically computing the (p−2)nd power using O(lg p) multiplications. We then choose m to grow asymptotically more quickly than lg p: in other words, lg p ∈ o(m). A batch of m elliptic-curve additions then costs 5m − 3 multiplications, m squarings, 6m subtractions, and 1 inversion. The subtractions and inversion are negligible, so each elliptic-curve addition costs 5 multiplications and 1 squaring.
A batch of m elliptic-curve doublings is slightly more expensive (costing m extra squarings), but occurs for only a small fraction of iterations, as discussed below. Each iteration therefore uses 5 multiplications and 1 squaring.
Analysis and optimization. Fruitless cycles of length 2 appear with probability approximately 1/(2r). These cycles persist after they appear, wasting subsequent iterations (in the sense that new points and new collision opportunities do not occur), until we check for them. If we check every w iterations then we expect a cycle to appear with probability approximately w/(2r), and for it to waste approximately w/2 iterations on average if it does appear.
This does not mean that w should be chosen as small as possible. If a cycle has not appeared then checking for it wastes an iteration. The overall loss is approximately 1 + w 2 /(4r) iterations out of w. To minimize the quotient 1/w + w/(4r) we take w ≈ 2 √ r. More generally, fruitless cycles of small length 2c appear with probability approximately proportional to 1/r c , so the optimal checking frequency is approximately proportional to 1/r c/2 . The loss here rapidly disappears as c increases. To summarize, fruitless cycles slow down this algorithm by a factor 1 + Θ(1/ √ r). This negation overhead Θ(1/ √ r) is on a larger scale than the overhead Θ(1/r) from the nonrandomness of r-adding walks, but both overheads become asymptotically negligible if r is chosen so that r → ∞ as p → ∞.
As an illustration of these optimizations, our PS3 software takes r = 2048, checks for 2-cycles every 48 iterations, and checks for larger cycles much less frequently. To simplify the software we unify the checks for 4-cycles and 6-cycles into a check for 12-cycles every 49152 iterations. If we had instead taken r = 512 then we would have checked for 2-cycles every 24 iterations. In general, the Θ(1/ √ r) asymptotic means that the negation overhead approximately doubles when the table size is reduced by a factor of 4.
Storage reduction. The storage overhead for detecting and escaping a fruitless cycle consists of storing W min and W i−1−c . For the latter it is enough to store one of the coordinates.
We further reduce storage by avoiding having all iterations check for cycles at the same time. For example, with a batch of 224 iterations running in parallel, we have just 14 iterations checking for 2-cycles and consuming extra space. All iterations perform 2 addition steps, and then these 14 iterations perform a masked doubling while the remaining 210 iterations perform another addition. We then rotate the batch so that the next 14 iterations check for 2-cycles.
Elements of the prime field F p , where p = (2 128 − 3)/76439, can be represented redundantly as elements of the ring Z/(2 128 − 3). Instead of reducing sums and products modulo p one reduces them modulo 2 128 − 3. This representation requires about 15% more space per field element, but the sparsity of 2 128 −3 makes reductions much faster.
The prime p was chosen with this small sparse multiple precisely to allow this speedup for cryptographic operations on the secp112r1 curve; see [10, page 3, bottom, and page 6, bottom]. The same type of redundant representation can of course also be used by the cryptanalyst attacking the ECDLP on secp112r1, as in [3] , [5] , [4] , and this paper. The critical problem is then to perform fast arithmetic modulo 2 128 − 3. This section explains how to efficiently decompose multiplications and squarings modulo 2 128 − 3 into operations on 16-bit integers and 32-bit integers. Here a b-bit integer is an integer in the interval [−2
The next section applies this decomposition to the Cell, obtaining faster arithmetic than [3] et al. 12 where a is a 32-bit integer; -32-bit right shift by 13 bits: a → a/2 13 where a is a 32-bit integer; -32-bit mask clearing 12 bits: a → 2 12 a/2 12 where a is a 32-bit integer; and -32-bit mask clearing 13 bits: a → 2 13 a/2 13 where a is a 32-bit integer.
We assign cost 1 to each of these operations, except that we assign cost 0.5 to the 16-bit addition operation. The next section explains how this cost model is related to PS3 speed. Note that these operations are not defined for all pairs of inputs. For example, 32-bit addition is not permitted to add 2 30 to 2 30 , because 2 31 is too large to be a 32-bit integer. One could of course define an extended 32-bit addition operation that handles all cases, working modulo 2 32 to handle overflows, but there are no overflows in the algorithms in this section. One could also define shifts (and masks) for distances other than 12 and 13, but the only distances used in this section are 12 and 13.
Representing integers modulo 2
128 − 3. We represent an element f of the ring Z/(2 128 − 3) as a sequence of 10 coefficients (f 0 , . . . , f 9 ) such that f = 
The factors of 2 arise from the uneven exponent spacing mentioned above: for example, the product of f 1 2 13 and g 4 2 52 is 2f 1 g 4 2 64 , contributing 2f 1 g 4 to r 5 2 64 . The factors of 3 arise from reducing 2 128 in Z/(2 128 − 3). If (f 0 , . . . , f 9 ) and (g 0 , . . . , g 9 ) are reduced then a sum of any 10 products of the form f i g j can be computed with cost 10: specifically, 1 multiplication followed by 9 multiply-add operations in any convenient order. The sums r 0 , r 1 , . . . , r 9 are slightly more expensive because of the extra factors of 2 and 3. We precompute 3g 1 , 3g 2 , . . . , 3g 9 and 2f 1 , 2f 2 , 2f 3 , 2f 4 , 2f 6 , 2f 7 , 2f 8 , 2f 9 (skipping 2f 5 ); recall that a 16-bit addition costs only 0.5, so these 26 additions cost only 13. Each r i is then a sum of 10 products of known 16-bit quantities, costing 100, for a total cost of 113.
Each r i , and each intermediate result in this multiplication algorithm, is bounded in absolute value by 10 · 3 · 2 · (1.01 · 2 12 ) 2 < 0.96 · 2 30 . The same algorithm also works if (f 0 , . . . , f 9 ) is a sum or difference of two reduced vectors while (g 0 , . . . , g 9 ) is reduced: then each result is bounded in absolute value by 0.96 · 2 31 , still safely below 2 31 .
Coefficient reduction. The product coefficients r 0 , r 1 , . . . , r 9 constructed above are usually not reduced. Some extra work is required to compute a reduced product suitable for use as input to subsequent multiplications.
For i ∈ {0, 1, 2, 3, 5, 6, 7, 8} we reduce r i by carrying from r i to r i+1 . This means changing (r i , r i+1 ) into (r i − 2 13 c, r i+1 + c) where c = (r i + 2 12 )/2 13 . This leaves the sum r i 2 i·12.8 + r i+1 2 (i+1)·12.8 unaffected, and increases the maximum possible r i+1 only slightly, while guaranteeing that the new r i is between −2 12 and 2 12 . This costs 5: an addition, a right shift, a mask, another addition, and a subtraction.
We
Tracing the bounds on each r i shows that the final r 2 is reduced, and therefore that (r 0 , r 1 , . . . , r 9 ) is reduced.
This carry chain costs 62. The complete multiplication algorithm, taking reduced representations of f and g as input and producing a reduced representation of f g as output, costs 175.
Squaring. Squaring in Z/(2 128 − 3) is very similar to multiplication except that several intermediate results are reused: for example, f 1 g 0 + f 0 g 1 becomes just 2f 0 f 1 . We begin with a cost-10 computation of 2f 0 , . . . , 2f 9 ; 3f 5 , . . . , 3f 9 ; 6f 5 , . . . , 6f 9 . We then obtain r 0 , . . . , r 9 straightforwardly at cost 55, and apply the same carry chain as for multiplication. The complete squaring algorithm costs 127.
Fast iterations on the PlayStation 3
This section analyzes and optimizes the performance of rho iterations modulo p = (2 128 − 3)/76439 on the PS3. This optimization makes some changes to the iteration function; we take advantage of the flexibility of co-designing our iteration function with our arithmetic algorithms.
The Cell SPEs. The CPU in the PS3 is the Cell Broadband Engine. The main computational power of the Cell is in 8 Synergistic Processor Elements (SPEs). These SPEs are arranged around a central 64-bit PowerPC core. The PS3 makes only 6 of these SPEs available for computations.
We report the performance that we achieve with 6 SPEs, leaving the central PowerPC core mostly idle. This does not make full use of the Cell-performing some iterations on the PowerPC core would noticeably reduce the overall computation time-but it simplifies comparisons: the speeds reported for the same ECDLP in [3] , [5] , and [4] also left the central PowerPC core mostly idle.
Our implementation runs independently on each SPE. Each SPE has 256 KB of fast local storage; this storage holds all code and data, including a batch of parallel walks and a table of precomputed points. Communication between the SPE and main memory is very small in this algorithm: a few words of data for every million iterations. Performance is therefore determined almost completely by how well we make use of the computational power of the SPE.
Arithmetic on the SPE. The following description summarizes only the SPE features that are relevant to our implementation. See [19] and [25] for more information about the SPE.
The SPE has a register file consisting of 128 128-bit vector registers. Typical arithmetic instructions are SIMD instructions operating on these registers as vectors of 4 independent 32-bit integers or vectors of 8 independent 16-bit integers. There is a multiplication instruction that multiplies 4 pairs of 16-bit integers in parallel, producing 4 32-bit integers. There is also a fused multiplyadd instruction that adds the results into another vector of 4 32-bit integers.
Each SPE cycle carries out at most one of these arithmetic instructions: i.e., 4 32-bit operations or 8 16-bit operations. An algorithm that costs n in the model of Section 4 therefore uses at least n/4 cycles on the SPE. There are, however, several reasons that the SPE can take many more cycles than this:
-In-order execution. An arithmetic instruction must wait until 1 cycle after the previous arithmetic instruction in the program. -Arithmetic latency. An instruction cannot begin until its results are ready.
The result of an arithmetic instruction is not ready until several cycles later. -Load latency. Loads are handled by a separate instruction pipeline but can still delay arithmetic instructions that use the load results.
There are also various function-call overheads, typically consuming 70 cycles per function call. One can eliminate these overheads by inlining and merging functions, but this also increases code size, putting pressure on the SPE's local storage.
Digitsliced multiplication on the SPE. We use 8-way vectorization of our iterations: we repeat our inputs, computations, and outputs 8 independent times. We store 8 independent elements of the ring Z/(2 128 − 3) in 10 128-bit vector registers r 0 , . . . , r 9 , where coefficient i of ring element j is in 16-bit component j of register r i . We convert each 16-bit operation into a 128-bit vector operation, and we convert each 32-bit operation into two 128-bit vector operations.
Scheduling instructions carefully then works around all arithmetic latency. The multiplication algorithm fits comfortably into the SPE registers: loads and stores are not a bottleneck. (Replacing 8-way vectorization with 16-way vectorization would remove the need for careful instruction scheduling but would put more pressure on registers and, more importantly, would cut in half the number of walks that fit easily into local storage without reshuffling.) The 32-bit results at the end of the algorithm are known to be reduced, so they fit into 16 bits; 10 extra instructions are required to shuffle the results into 10 vectors of 8 16-bit in-tegers, but these extra instructions are handled by the SPE's load/store pipeline and are effectively free when they are interleaved with arithmetic instructions.
Overall our software for vectorized multiplication of 8 pairs of ring elements takes exactly 350 SPE arithmetic instructions, i.e., 43.75 arithmetic instructions per multiplication (1/4 of the cost 175 in the previous section). Our software for vectorized squaring of 8 ring elements uses exactly 254 SPE arithmetic instructions, i.e., 31.75 arithmetic instructions per squaring. Each of our iterations performs 5 multiplications and 1 squaring, in total consuming 250.5 arithmetic instructions, at least 250.5 cycles.
Inversion. To invert modulo p = (2 128 −3)/76439 we simply compute a (p−2)nd power. Our addition chain for p − 2 uses 107 squarings and 32 multiplications. Essentially the same speed would also be achieved by an addition chain for the larger but sparser exponent 2 128 − 76443 = p − 2 + 76439(p − 1), using 126 squarings and just 18 multiplications.
Of course, we actually perform 8 independent inversions in parallel, using 8 · 107 squarings and 8 · 32 multiplications modulo 2 128 − 3. Each inversion uses 107 · 31.75 + 32 · 43.75 = 4797.25 arithmetic instructions, consuming at least 4797.25 cycles.
Our 8-inversion function actually uses 43293 cycles, i.e., 5411.625 cycles per inversion. The gap is almost entirely explained by the overhead of 64 function calls, half to multiplication and half to an n-squaring function that computes r ← r 2 n for variable n.
To reduce code size we rejected the possibility of more complicated Euclidtype inversion as used in [5] . In context (see below) inversion is already quite fast, only 6.6% of our final iteration cost.
Canonicalizing the y-coordinate. Redundant representations cause trouble for two parts of the algorithm stated in Section 3. First, because y ∈ F p has multiple representations, checking whether y ∈ {0, 2, 4, . . . , p − 1} is not a simple matter of inspecting the bottom bit of the representation of y. Second, because x ∈ F p has multiple representations, finding the hash of x is not a simple matter of extracting bits from the representation of x.
We address both problems by canonicalizing y. We use the canonicalized version of y to decide whether to negate y. Rather than canonicalizing and hashing x, we extract some bits from the canonicalized version of y as a table index. Note that there can be as many as 3 points having the same y-coordinate, but hashing all of those points to the same table index does not merge walks. We also use the canonicalized version of y to define distinguished points.
The most obvious way to canonicalize y is to replace it with y mod p; but reductions modulo p are expensive. We instead compute s = 76439y mod (2 128 − 3). One can think of this s as a unique representative y mod p, but represented as 76439(y mod p). An alternative is to use Montgomery reduction to compute y · 2 −16 (mod p) with precomputed 2 −16 (mod p), as in [4] . To compute s we multiply y by the cofactor 76439 and then perform a slightly longer reduction chain than the one we use after multiplication. The polynomialmultiplication step here, producing unreduced coefficients s 0 , s 1 , . . . , s 9 , uses only 5 arithmetic instructions (cost 20) instead of 25 (cost 100), because 76439 is represented in just two reduced coefficients: 76439 = 2711 + 9 · 2 13 . The usual precomputations of 3g 1 , 2f 1 , etc. also disappear: all we need are the constants 5422 = 2 · 2711 and 16266 = 6 · 2711.
To reduce the result we carry We do not claim that the resulting (s 0 , s 1 , . . . , s 9 ) is always completely determined by the image of y in F p . What matters is that the probability of nonuniqueness for random y's is so small that two colliding walks have negligible probability of diverging before they hit a distinguished point. We computerverified this as explained in Section 6.
Subtraction. The easy part of subtracting two ring elements is performing 10 subtractions of 16-bit integers. The problem is that the output is usually not reduced. Carries would reduce the output but would make subtraction much more expensive.
We use two standard techniques to avoid carries after subtractions. First, we skip unnecessary reductions before multiplications; recall from Section 4 that the multiplication algorithm can safely multiply f by g − g 2 , where f, g, g 2 are reduced. The elliptic-curve addition formulas involve three subtractions of this type: the denominator x − x 2 , which is multiplied by a reduced product inside Montgomery's batch-inversion method (except for one product at the beginning of a batch); the numerator y − y 2 , which is multiplied by the reduced reciprocal of x − x 2 ; and x 2 − x 3 , which is multiplied by the reduced quotient λ.
Second, we combine multiply-reduce-subtract-reduce into multiply-subtractreduce. In particular, to compute λ 2 − x − x 2 , we first add x to x 2 , and then subtract the sum from λ 2 before reducing the coefficients of λ 2 . Similarly, we subtract y 2 from λ(x 2 − x 3 ) before reducing the coefficients of λ(x 2 − x 3 ). This makes the subtractions more expensive (32-bit instead of 16-bit), but still much less expensive than extra carries.
Overall these 6 subtractions use 10 arithmetic instructions: 5 instructions for 40 subtractions of 16-bit integers, and another 5 instructions for 20 subtractions of 32-bit integers.
Table lookups. Our iteration function uses a precomputed table of 2048 multiples of P . Each multiple uses 16 bytes for an x-coordinate and 16 bytes for a y-coordinate, for a total of 64 KB of local storage. If we were concerned with defining an iteration function to perform well on many platforms simultaneously then we would use a smaller table, say 16 KB, to avoid L1 cache misses on typical CPUs. However, the analysis in Section 3 shows that this would slightly increase the number of iterations. In this paper our goal is purely to maximize Cell performance, so we keep r = 2048.
Normally a precomputed x-coordinate in reduced form (x 0 , x 1 , . . . , x 9 ) would occupy 20 bytes. We instead represent each precomputed x-coordinate in reduced form (x 0 , x 1 , . . . , x 7 , 0, 0), stored as a contiguous 16-byte vector (x 0 , x 1 , . . . , x 7 ).
Each y-coordinate is stored similarly. The final zeros mean that this representation is limited to approximately 2 103 different integers; it cannot handle all elements of Z/(2 128 − 3), or even all elements of F p . We find 2048 representable multiples of P by generating approximately 400 million random multiples of P ; this precomputation has negligible cost.
This table representation also allows lookups with very little arithmetic (but with many load/store instructions, which we carefully interleave with other computations), as explained in the following paragraph. We could further compress the table entries in several ways, but we are not aware of further compression techniques that avoid extra arithmetic. We comment that scaling the same type of precomputation to larger ECDLPs, squeezing the precomputed points as far as we can with a negligible precomputation, reduces the space for rho tables by asymptotically 25%.
The table entry for a point (x, y) has index s 0 mod 2048 where (s 0 , s 1 , . . . , s 9 ) is the canonicalization of y. We perform 8 table lookups in parallel as follows.
We perform one arithmetic instruction to obtain 8 parallel table indices; this instruction is a mask of the vector (2047, 2047, 2047, 2047, 2047, 2047, 2047, 2047) with a vector of canonicalizations. We shift the result left by 5 bits (since table entries have 32 bytes) to obtain 8 addresses in memory; this 128-bit shift is handled by the SPE's load/store pipeline. We then shuffle the result into 8 separate registers, perform 8 x-coordinate loads, and perform 8 y-coordinate loads. Finally we use 24 shuffle instructions to convert to digitsliced format. These add up to 306.08 arithmetic instructions per iteration, implying a lower bound of 306.08 cycles per iteration for our software. Our software actually takes 362 cycles per iteration, about 18% more than this lower bound.
Under 4% of the cycles per iteration are spent on operations that can be blamed on negation: specifically, negating s and y, detecting fruitless cycles, and resolving fruitless cycles by doubling. The rest of the gap between 306 and 362 is explained by detection of distinguished points, loop control, and function-call overhead.
There is, outside the iterations, an extra cost for communicating the occasional distinguished points that do appear (and in setting up replacement points). We made no effort to optimize this cost, since it is multiplied by the distinguishedpoint probability. With the rather large distinguished-point probability used in our experiments, namely 2 −20 , and with all 6 SPEs running in parallel and competing for communication resources, this cost effectively added 15 cycles to each iteration, slowing the computation down by about 4%. A smaller distinguishedpoint probability would reduce this penalty below 1%.
Comparison to previous work. Bos, Kaihara, Kleinjung, Lenstra, and Montgomery state in [4, Appendix A] that they use 456 SPE cycles per iteration for the same ECDLP, including 322 cycles for 6 multiplications, 30 cycles for 6 subtractions, 12 cycles for 1/400 inversions, 24 cycles for canonicalization (with Montgomery reduction), and 68 cycles for miscellaneous overhead. Bos, Kaihara, and Montgomery report 453 cycles per iteration in [5, Section 5] , with 318 cycles instead of 322 for the multiplications, and 69 cycles instead of 68 for the miscellaneous overhead.
Each of our multiplications is faster than the multiplications in [4] and [5] , by a factor of approximately 1.23. This speedup can be traced directly to our use of the non-integer radix 2 12.8 , while [4] et al. used the conventional radix 2 16 . Most of our other operations are also faster than the operations in [4] . We pay a slight penalty for negation but overall gain the same factor of approximately 1.23 in the number of cycles per iteration. Our overall speedup in solving the ECDLP is much larger, because we use far fewer iterations, as discussed in Section 6.
Experimental results and evaluation
We do not have access to the cluster of 1284 SPEs used for many months by the authors of [3] , [4] , [5] , and [6] . However, a few SPEs at the Jülich and Barcelona supercomputer centers were enough for us to perform some reasonably large discrete-logarithm experiments, demonstrating clearly that our code works and runs at the expected speed. This section presents the details of our experiments.
Scaling elliptic-curve challenges without changing the prime. Our software is dedicated to the prime p = (2 128 − 3)/76439, and is designed to break the ECDLP on the curve secp112r1 over F p . However, the same software works without modification for points P, Q on any curve of the form y 2 = x 3 − 3x + b over this F p .
By counting points on y 2 = x 3 − 3x + b for various b we found group orders having many different prime divisors. For example, there are points -of prime order 1195174242772417 ≈ 1.0615 · 2 50 on y 2 = x 3 − 3x + 238 2 ;
-of prime order 36817627222637377 ≈ 1.0219 · 2 55 on y 2 = x 3 − 3x + 372 2 ; and -of prime order 1186848158152955759 ≈ 1.0294 · 2 60 on y 2 = x 3 − 3x + 240 2 .
For each of these prime-order groups we generated a challenge P, Q and then repeatedly solved the challenge, collecting statistics on the distribution of the time needed to solve the challenge.
Distinguished points per second. We used the same distinguished-point property for all of the challenges, inspecting 20 bits of s 4 and s 9 . The probability of a point being distinguished is almost exactly 2 −20 . We predicted that we would need slightly more than 2 20 iterations on average to find a distinguished point, for two reasons. The first reason is that a small percentage of the iterations are wasted by fruitless cycles, as discussed in Section 3. This percentage is under 1% and is independent of the group order .
The second reason is that a walk can enter a long cycle that does not contain a distinguished point. We predicted that this would occur with probability roughly 2 40 / for each seed, i.e., roughly once for every /2 40 seeds: certainly not an issue for a single-shot experiment with ≈ 2 112 , but a serious concern for a careful statistical analysis studying many seeds with ≈ 2 50 . To prevent our software from running forever in case of long cycles, we added a few lines of code to abort each walk after approximately 47 · 2 18 iterations. A walk that is not in a long cycle has probability only about 2 −17 of surviving for so many iterations and of therefore being aborted. We could also have modified our software to extract discrete logarithms from the long cycles, but there would have been no cryptanalytic benefit from doing so, since long cycles disappear as grows.
We also reduced our batch size from 224 to 192 (watching 12 instead of 14) to make room in local storage for keeping track of various statistics. This made each iteration slightly slower, 366 cycles instead of 362 cycles. The extra cost of communicating distinguished points adds 15 cycles per iteration as discussed in the previous section, so we predicted that 6 SPEs running in parallel would produce 6 · 3.192 · 10 9 /(381 · 2 20 ) ≈ 47.94 distinguished points per second. We ran 6 SPEs on the original curve secp112r1 and found 48134 distinguished points in 1000 seconds, with no aborted walks. We also ran various experiments on our 50-bit, 55-bit, and 60-bit challenge curves, and in each case found distinguished points at the expected rate. We found 1 aborted walk for every 2 12.9 distinguished points for the 55-bit challenge curve y 2 = x 3 − 3x + 372 2 with ≈ 1.0219 · 2 55 .
The number of distinguished points needed for a discrete logarithm. We performed the following experiment for our 50-bit challenge. Take a seed, and find the corresponding distinguished point. Take the next seed, and find the corresponding distinguished point. Continue this process until finding two colliding distinguished points. Compute a discrete logarithm from this collision, and verify that it matches the secret scalar used to generate the challenge in the first place. Our software handles many seeds at once and produces distinguished points out of order, so we sorted the outputs back into the original order of seeds. Skipping this step would have introduced a bias into our experiment, favoring distinguished points that use relatively few iterations.
We then performed this experiment again, starting from the first seed that was not used in the first experiment. We continued in the same way through 32237 experiments, using disjoint seeds for each experiment. We did not verify the discrete logarithms for every experiment, but we verified it for a large random sample of experiments, and encountered no failures. The number of distinguished points used in the experiments was on average 31.526 ≈ Performance extrapolations. On the basis of these experiments we confidently predict that our software would solve the secp112r1 ECDLP in, on average, 37.3 years on a PS3, using 2 35.71 distinguished points, requiring under 1 terabyte of storage. Here 2 35.71 is calculated as π /4/2 20 with ≈ p ≈ 2 128 /76439, and 37.3 is calculated as π /4/(2 20 · 47.94 · 86400 · 365.25). The software runs in parallel on many PS3s without trouble, and will easily scale beyond the size of the cluster used in [5] . The computation time is inversely proportional to the number of machines, except for a few minutes at the end of the computation (by all machines while the final collision walks towards a One can trivially reduce the storage and communication requirements by, e.g., a factor of 16 by changing the definition of distinguished points to use 24 bits instead of 20. This increases the final few minutes by a factor of 16, but it also saves almost 15 cycles of communication cost for each iteration, as discussed in the previous section, reducing the total time to just 35.6 years on a PS3.
Comparison to previous work. Our speed is directly comparable to, and almost twice as fast as, the speed previously reported by Bos, Kaihara, Kleinjung, Lenstra, and Montgomery.
Specifically, [4, Appendix A.5] reports that the expected number of iterations "is πq/2 ≈ 8.4 · 10 16 , where q is the prime group order", to solve a secp112r1 ECDLP; each iteration consumes 456 cycles, totalling "about 60 PS3 years". This iteration count (also appearing in [5, Section 5.3] ) is slightly too optimistic: the additive walk in [3] This software hashed W i to some j ∈ {0, 4, 10, 11, 12} and then multiplied W i by 1 + σ j to obtain W i+1 . This was joint work by Harley and Ari Singer in January 1998, according to [17] .
In the meantime, an independent investigation by Gallant, Lambert, and Vanstone had prompted Certicom to suddenly reduce its published attack-time estimates for ECC2K-95 and other Koblitz curves by an order of magnitude. [9] to [7, Section 15] .
An accompanying web page [8] , dated 23 March 1998, claimed a √ 2d speedup for Koblitz curves from normalizing x-coordinates of distinguished points modulo powers of σ, and a √ 2 speedup for general curves from normalizing x-coordinates of distinguished points modulo negation. However, as pointed out in [28, Section 6] two weeks later, these normalizations do not actually produce any speedup unless the walk is also modified to be compatible with the action of σ.
According to [16] , the Certicom web page was subsequently updated to add a preprint of the paper [14] , presenting a suitable multiplicative walk for Koblitz curves. However, as discussed above, multiplicative walks do not produce a √ 2 speedup for general curves. Harley stated in [16] that a multiplicative walk "can be used with any curve but only rarely gives a speedup"; Certicom did not apply a √ 2 reduction to its published attack-time estimates for most curves.
Additive walks with 2-cycle detection and hash modification. In [28] Wiener and Zuccherato independently claimed a speedup of √ 2 for arbitrary curves, and √ 2d for Koblitz curves over F 2 d , using an additive walk. See also the more recent version [29] .
Wiener and Zuccherato pointed out that 2-cycles appear frequently in the simplest negating additive walk, W i+1 = |W i +R h(W i ) |. They suggested detecting and escaping these 2-cycles as follows. Check whether h(W i+1 ) matches h(W i ). If it does, then retroactively adjust h by incrementing h(W i ), and go back to the computation of W i+1 . If r is not very small then this adjustment will not have to be repeated more than a few times for each i.
Duursma, Gaudry, and Morain showed in a followup analysis [12, Appendix A] that 2-cycles occur with probability Θ(1/r), that 4-cycles occur with probability Θ(1/r 2 ), etc. Wiener and Zuccherato do not consider, detect, or escape 4-cycles, so a cycle can be expected to occur within Θ(r 2 ) iterations: e.g., millions of iterations for r = 2048.
We comment that this is not necessarily a disaster. Setting a reasonable upper limit on walk length in parallel collision search was standard practice at the time of [28] ; all walks that enter cycles, fruitless or otherwise, are aborted. If the distinguished-point frequency is much larger than 1/r 2 then almost all of the Wiener-Zuccherato walks terminate with a distinguished point and the time wasted by aborted walks is very small. Presumably this is what happened in the successful experiment reported in [29, Section 5, last four lines].
However, being forced to use a distinguished-point frequency much larger than 1/r 2 means that there will be many more than √ /r 2 distinguished points by the end of the computation. This poses increasingly severe communication problems and storage problems as grows, and those problems mean that for large the Wiener-Zuccherato method is unable to compete with non-negating walks.
Wiener and Zuccherato also suggested the possibility of "looking ahead two steps instead of just one" to further reduce the frequency of cycles. A detailed discussion of this possibility, without credit to [29] , appeared much later in [ 
. The Wiener-Zuccherato approach saves time when a 2-cycle is found, but loses time in the equally common case that
On the other hand, one could sort and compare hash values together with signs to robustly check for 2-cycles, 4-cycles, etc.
Escott used a mixed walk that includes doublings, and checked for cycles only back to the most recent doubling, using the heuristic that "No small useless cycle includes a double." (The statement "A cycle with at least one doubling is most likely not fruitless" appeared much later in [6] without credit to Escott.) This still leaves many comparisons: for example, if the 16 operations corresponding to 16 hash values are 15 additions of precomputed points and 1 doubling, then there will typically be 10 or 20 points after the most recent doubling, and thus 10 or 20 comparisons to those points. Eliminating branches would obviously make this computation even more expensive.
Escott escaped a cycle by walking to a new point computed from the points in the cycle. Escott emphasized the importance of rotational symmetry, having the new point be independent of where the cycle was entered, so that "no collisions are lost." It is not clear exactly which new point was used for the successful experiments reported in [13, page 27] . Previous comments in [13] seem to suggest simply adding the points in the cycle, but this is highly nonrandom; for example, adding W to −W − R j produces −R j , no matter what W is.
Delayed cycle detection. Another discussion of fruitless cycles appeared in the paper [14, by Gallant, Lambert, and Vanstone, submitted for publication in June 1998 and revised in October 1998.
To detect cycles in "labels" (e.g., cycles in x-coordinates), Gallant, Lambert, and Vanstone proposed to "intermittently save labels and detect repetitions by comparing new labels against these stored ones." They suggested that the "save interval" I be small, perhaps 10 or 20, since longer cycles are unlikely to occur. Saving the current label every I iterations, and comparing the new label to the saved label every iteration, detects all cycles of length at most I.
Note that the Gallant-Lambert-Vanstone method often needs more iterations than the Escott method to detect a cycle. For example, a length-2 cycle is detected after just 2 steps if it includes the saved label, but otherwise it waits for the next label, continuing to cycle for nearly I extra iterations. The compensating advantage is that each iteration is faster, involving only 1 comparison. The paper also suggests that one can store "different numbers of past values" at "different intervals" but does not elaborate.
Gallant, Lambert, and Vanstone recommended escaping a cycle by applying a "modified iteration" to the "lexicographically least label" in the cycle. The lexicographically least label has the obvious virtue of being easy to compute. The subsequent paper [6] says that the modified iteration in [14] consists of adding a precomputed point (from another pool of points, not the points used in unmodified iterations), and that this has a good chance of returning to the same cycle, rendering the method non-functional. However, [14] is actually much more vague than this; it does not specify a modified iteration. There exist modified iterations that work correctly, but [14] certainly does not make clear that the success of the method depends heavily on the choice of the iteration.
Statistical cycle detection and escape by doubling. Duursma, Gaudry, and Morain in [12, pages 109-110] proposed sorting the elements of a cycle, obtaining C 1 ≤ C 2 ≤ · · · ≤ C t , and then walking to 2C 1 + 5C 2 + · · · + (t t + 1)C t . Walking simply to 2C 1 was proposed much later in [6] , without credit to [12] . Neither method suffers from recurring cycles.
Combining the detection method from [14] with the escape method from [12] would have produced a reasonably efficient negating rho method. However, it still would not have produced a reasonably efficient branchless negating rho method; the possibility of having to escape a cycle at any moment causes trouble for SIMD architectures.
The same paper suggested detecting cycles by checking whether "the number of distinguished points" does not "evolve as prescribed by the theory." This suggestion, like the Wiener-Zuccherato algorithm, is not competitive for large : it cannot be efficient unless the distinguished-point frequency is large.
More possibilities. The much more recent paper [6] by Bos, Kleinjung, and Lenstra is the first paper on this topic that does not claim a √ 2 speedup. The paper reports a table of 126 different non-SIMD implementation results: specifically, 21 different combinations of cycle-detection methods, hash-modification methods, and cycle-escape methods for 6 different values of r.
The most efficient algorithm identified in [6] is the following: compare and modify hash values as in [29] to reduce the frequency of 2-cycles; "after α steps record a length β sequence of successive points and compare the next point to these β points" (miscredited to [14] ) to detect any remaining cycle; escape any remaining cycle by doubling the smallest element of the cycle.
The comparison and potential modification of hash values at each iteration makes this algorithm perform quite poorly on SIMD architectures. Eliminating those modifications, and relying solely on the α-β cycle-detection mechanism, would also perform poorly: 2-cycles would arise frequently, so α would have to be small to avoid excessive delays in detecting 2-cycles, while β would have to be at least 4 to catch 4-cycles, requiring frequent comparisons and conditional operations along with several stored labels for each walk.
Given the history surveyed above, it was not unreasonable of [6] to dispute the speedup claims made in the earlier literature, and to question the possibility of any speedup on SIMD architectures.
