Abstract. This paper sets new software speed records for high-security Diffie-Hellman computations, specifically 251-bit elliptic-curve variablebase-point scalar multiplication. In one second of computation on a $200 Core 2 Quad Q6600 CPU, this paper's software performs 30000 251-bit scalar multiplications on the binary Edwards curve d(x + x 2 + y + y 2 ) = (x + x 2 )(y + y 2 ) over the field F 2 [t]/(t 251 + t 7 + t 4 + t 2 + 1) where d = t 57 + t 54 + t 44 + 1. The paper's field-arithmetic techniques can be applied in much more generality but have a particularly efficient interaction with the completeness of addition formulas for binary Edwards curves.
Introduction
Which curves should one choose for elliptic-curve cryptography?
The first and most fundamental choice is between curves defined over binary (i.e., characteristic-2) finite fields and curves defined over non-binary fields. For example:
• NIST's standard K-283 curve is the subfield curve y 2 + xy = x 3 + 1 over the field F 2 [t]/(t 283 + t 12 + t 7 + t 5 + 1). NIST's standard B-283 curve is a particular non-subfield curve over the same binary field.
• NIST's standard P-256 curve is a particular curve over the prime field Z/(2 256 − 2 224 + 2 192 + 2 96 − 1).
Multiplication in the polynomial ring F 2 [t] is, at first glance, just like multiplication in Z but skips all the carries. Furthermore, squaring in F 2 [t] is simply a relabeling of exponents. One might therefore guess that curves over binary fields are considerably faster than curves over large-characteristic fields.
However, the software speed records for elliptic-curve cryptography areand for many years have been-held by large-characteristic fields. The fastest Diffie-Hellman speeds (i.e., speeds for variable-base-point scalar multiplication n, P → nP ) reported in ECRYPT's publicly verifiable benchmarks [15] on a single core of an Intel Core 2 Quad Q6600 (6fb, utrecht) are
• 321012 cycles for field size (2 127 − 1) 2 (using software from Galbraith, Lin, and Scott, combining Edwards curves with the idea of [33] ), • 386739 cycles for field size 2 255 −19 (using software from Gaudry and Thomé announced in [35] ), and • 855036 cycles for field size 2 251 (also using software from [35] ).
Similar comments apply to older processors: for example, Fong, Hankerson, López, and Menezes in [30, Table 6 ] report 1720000 cycles on a Pentium III for field size 2 233 , while [12] reports 832457 cycles on a Pentium III for field size 2 255 − 19. Subfield curves provide some speedups in the binary case-for example, Hankerson et al. in [38, Table 7 ] report 1740000 cycles on a Pentium II for field size 2 283 , and Hankerson, Karabina, and Menezes in [39, Table 5 ] report 758000 cycles on a Xeon 5460 (similar to a Core 2 Quad) for field size 2 254 -but binary fields seem to have no hope of catching up to large-characteristic fields.
Why are large-characteristic fields so much faster than binary fields? The conventional explanation is that today's popular CPUs include fast "integermultiplication" (and "floating-point multiplication") instructions that multiply medium-size elements of Z, but do not include instructions to multiply mediumsize elements of F 2 [t] . Of course, one can multiply in F 2 [t] by combining simpler CPU instructions, but the multiplication instructions for Z are much faster, outweighing any possible advantages of characteristic 2. This effect has been intensified by the transition from 32-bit processors to 64-bit processors: 64-bit multipliers are even more powerful than 32-bit multipliers.
Why have CPU designers decided to include a circuit for multiplication in Z and not a smaller, faster circuit for multiplication in F 2 [t] ? The conventional explanation is as follows. Most of the computer users who care about CPU speed are measuring performance of weather simulation, movie decompression, video games, etc. These applications rely heavily on multiplication in Z, rewarding CPUs that include integer multipliers, floating-point multipliers, etc. The same applications make very little use of multiplication in F 2 [t] .
New speed records. This paper introduces new software named BBE251 for scalar multiplication on a high-security binary elliptic curve, specifically the binary Edwards curve d(x + x 2 + y + y 2 ) = (x + x 2 )(y + y 2 ) over k, where k = F 2 251 = F 2 [t]/(t 251 + t 7 + t 4 + t 2 + 1) and d = t 57 + t 54 + t 44 + 1 ∈ k. This curve has group order 4 · prime and twist order 2 · prime, and it satisfies all the usual elliptic-curve security criteria; see Section 3.
BBE251 is so fast that it sets new speed records not just for binary elliptic curves, but for all elliptic curves. For example, a benchmark of a batch of 1024 independent scalar multiplications took 321866514 cycles on a single core of a Core 2 Quad Q6600 (6fb)-a cost of just 314323 Core 2 cycles per scalar multiplication, improving upon all previous results. The Sage computer-algebra system [63] was used to check a random sampling of outputs.
Readers and potential users are cautioned that BBE251 does not compute one scalar multiplication in 314323 cycles. The software is given a batch of curve points P 1 , P 2 , P 3 , . . . and an equal-length batch of integers n 1 , n 2 , n 3 , . . .; it produces a batch of multiples n 1 P 1 , n 2 P 2 , n 3 P 3 , . . .. The speed of BBE251 does not rely on any relationships between the inputs; in fact, for fixed-size batches, the software takes constant time, independent of the inputs. BBE251 nevertheless provides speed benefits from handling (n 1 , P 1 ) together with (n 2 , P 2 ) and (n 3 , P 3 ) and so on. BBE251 is faster than the software in [35] for the same field size once the batch size is above 50; for large batches it is more than twice as fast.
Real-world servers bottlenecked by typical elliptic-curve computations can gain speed by collecting the computations into batches, switching to the curve introduced in this paper, and switching to the software introduced in this paper. The batching increases latency by several milliseconds, but in most applications this is not a problem, whereas raw throughput is often critical. BBE251 completes 1048576 scalar multiplications in just 35 seconds using all four cores of a single 2.4GHz Core 2 Quad Q6600 CPU; interference among the cores is negligible. This is not the fastest single-chip scalar-multiplication measurement ever reported in the literature-Güneysu and Paar in [36] reported "more than 37000 point multiplications per second"-but a closer look shows that [36] achieved 37000 224-bit scalar multiplications per second on a $1000 Xilinx Virtex-4 SX55 containing hundreds of multipliers, while this paper achieves 30000 251-bit scalar multiplications per second on a $200 Core 2 Quad Q6600.
BBE251 provides several benefits beyond speed. It avoids all data-dependent array indices, all data-dependent branches, etc., and is therefore immune to cache-timing attacks, branch-prediction attacks, etc.; the same security feature was already present in state-of-the-art software for large-characteristic elliptic curves (see, e.g., [12] ) but is hard to find for binary curves. BBE251 has been posted (http://binary.cr.yp.to) to allow public verification of its accuracy and speed, and has been placed into the public domain to maximize reusability.
How these speeds were achieved: low level. [10] say that fast polynomial multiplication "yields a 10% speedup in the overall scalar multiplication time" for a particular curve. Brent, Gaudry, Thomé, and Zimmermann in [22] report speedups from several layers of Karatsuba and Toom recursions, but only beyond cryptographic sizes. In BBE251, a 16-bit xor is faster than a 32-bit xor, because two 16-bit inputs are packed into the same space as a 32-bit input and handled in parallel. Shift costs are trivially eliminated by the standard technique of "bitslicing":
, and then handled without shifts until the end of the computation. These vectors do not fit simultaneously into CPU registers, but BBE251 arranges operations so that most loads and stores are overlapped with computation.
Bitslicing has been used in cryptography before. Two record-setting examples are Biham's implementation [18] of DES, a standard hardware-friendly bitoriented cipher that had previously been viewed as very slow in software, and the Matsui-Nakajima implementation [50] of AES, taking just 9.2 cycles per byte on a Core 2. Aoki, Hoshino, and Kobayashi in [9] pointed out that bitsliced field arithmetic saves time for binary elliptic curves-but they were still unable to compete with non-binary elliptic curves. The Pentium III speeds reported in [9, Table 3 ] for a subfield curve over a field of size 2 163 are not as fast as the Pentium II speeds reported the same year by Aoki et al. in [8, Table 4 ] for a curve over a larger non-binary field. The fast multiplication circuits built into current CPUs perform a huge number of bit operations per cycle and are generally perceived as indispensable tools for fast public-key cryptography; it is surprising that bitsliced field arithmetic can set new Diffie-Hellman speed records, outperforming the multiplication circuits.
How these speeds were achieved: high level. There is a limitation to the power of bitslicing, at least in the pure form used in this paper: bitslicing requires all computations to be expressed as straight-line sequences of bit operations. Straight-line computations do not contain data-dependent branches (e.g., "if P = Q then . . . ") or data-dependent array indices (e.g., "load x[i], where i is the next bit of the scalar"). Computations that include data-dependent array indices can be simulated by straight-line computations that perform arithmetic on all array elements, and computations that include data-dependent branches can be simulated in an analogous way (recall that the "program counter" is just another array index), but these simulations are slow.
Fortunately, a recent line of research has shown how to carry out some elliptic-curve computations without data-dependent array indices, without datadependent branches, and without serious loss of speed. These techniques have been advertised as efficiently protecting against various types of software sidechannel attacks, such as cache-timing attacks, but the same techniques are also useful for efficient bitslicing. Specifically:
• Single-scalar multiplication, odd characteristic: [12, Theorem 2.1] says that the differential-addition formulas from [52] compute X 0 (nP ) from X 0 (P ) on any Montgomery curve having a unique 2-torsion point. Here X 0 (P ) means x if P = (x, y) and 0 if P = ∞. See [13, Section 5] for discussion of more general Montgomery curves.
• Arbitrary group operations, odd characteristic: [14, Theorem 3.3] says that various addition formulas are complete-i.e., have no exceptional cases-for any Edwards curve x 2 + y 2 = 1 + dx 2 y 2 having non-square d. Complete addition formulas allow complete single-scalar multiplication, complete doublescalar multiplication, etc.
• Arbitrary group operations, characteristic 2: [16, Theorem 4.1] says that various addition formulas in [16] are complete for any binary Edwards curve
The complete differential-addition formulas in [16, Section 7] are reviewed in Section 3 of this paper and are used in BBE251. Other approaches, such as mixing bitsliced computation with non-bitsliced computation, do not appear to provide noticeable benefits for the speed of elliptic-curve scalar multiplication and are not discussed in this paper. The unexpected speed of bitsliced binary-field multiplication can be viewed as motivation to consider bitslicing for a wide variety of higher-level cryptographic computations and for many other problems. This paper's results on binaryelliptic-curve scalar multiplication are one step along this path. The reader is cautioned, however, that bitslicing will provide smaller benefits for computations that rely more heavily on random access to memory.
What about PCLMULQDQ? Some CPU designers have begun to realize the potential importance of F 2 [t] for cryptographic applications. Intel announced in April 2008 that its processors will eventually include a "carry-less multiplication" instruction named PCLMULQDQ. Intel's white paper [40] says that "carryless multiplication"-i.e., multiplication in F 2 [t]-is particularly important for GCM, a standard for secret-key authenticated encryption relying heavily on multiplications in the binary field F 2 128 ; and that "accelerating carry-less multiplication can significantly contribute to achieving high speed secure computing and communication." Gueron and Kounavis estimate in [37] that scalar multiplication on the NIST B-233 elliptic curve would take 220000 cycles using a 9-cycle instruction for "carry-less multiplication," or 70000 cycles using a 3-cycle instruction for "carry-less multiplication." NIST B-233 has a somewhat lower security level than the 251-bit curve used in this paper.
Intel also announced in April 2008 that its processors will, starting in 2010, include 256-bit vector instructions. The initial instruction set will not include many integer instructions but will include 256-bit logical operations such as VXORPS ymm and VANDPS ymm; see [41] .
It is clear that PCLMULQDQ-based software will easily outperform most software for binary-elliptic-curve computations. However, it is not at all clear that PCLMULQDQ will outperform a 256-bit-vector implementation of the techniques introduced in this paper. It is not even clear that PCLMULQDQ will outperform the specific BBE251 software described in this paper! In any case, one can reasonably speculate that continued improvements in binary-elliptic-curve performance will encourage cryptographic users to shift to binary elliptic curves, increasing the importance of techniques to accelerate computations on those curves. Presumably Z-biased CPUs will continue being sold for years and will continue being used for years after that; the techniques in this paper will clearly remain important for Z-biased CPUs whether or not they are helpful for PCLMULQDQ CPUs.
Polynomial multiplication
This section explains how to multiply 251-bit polynomials using 33096 bit operations. More generally, this section presents small explicit upper bounds on M (n), the minimum number of bit operations needed to multiply an n-bit polynomial
Inputs and outputs are represented in the standard form:
. . , g n−1 ); and the output h 0 +
Comparison to previous work. The definition of polynomial multiplication immediately implies that M (n) ≤ Θ(n 2 ). Karatsuba showed in [45] that M (n) ≤ Θ(n lg 3 ). Toom showed in [64] that M (n) ≤ n 1+o (1) , and more precisely M (n) ≤ n2 Θ( √ lg n) . Schönhage showed in [61] that M (n) ≤ Θ(n lg n lg lg n), by adapting an integer-multiplication method by Schönhage and Strassen in [62] . No better asymptotic bounds are known; Fürer in [32] introduced an asymptotically faster multiplication method for integers, but it is not clear whether the method can be adapted to binary polynomials.
Of course, bounds involving Θ, O, etc. are not explicit. To draw conclusions about M (251), or any other specific M (n), one needs to carefully re-analyze the algorithms used to prove asymptotic bounds. To draw useful conclusions one often needs to rethink the algorithm design, looking for constant-factor (and sub-constant-factor) improvements that are not visible in the asymptotics.
Explicit upper bounds do appear in the literature on hardware multipliers. The bound M (n) ≤ 2n 2 − 2n + 1 is easy to find in hardware textbooks. Several cryptographic-hardware papers (see below) have presented explicit upper bounds on M (n) obtained with Karatsuba's method, have pointed out that these upper bounds are considerably below 2n 2 −2n+1 for cryptographically useful sizes of n, and have concluded that hardware multipliers should use Karatsuba's method. XOR and AND do not have identical hardware costs, but analyses weighted by the actual costs come to similar conclusions.
The explicit upper bounds on M (n) obtained in this section are better than anything that can be found in the hardware literature. Here are several examples of the improved bounds:
• M (128) ≤ 11486. For comparison, Peter and Langendörfer in [57, As these examples illustrate, switching from this paper's multiplication methods back to the multiplication methods in the hardware literature often increases the number of bit operations by more than 20%. On the other hand, one should not exaggerate the importance of multiplication refinements as a component of this paper's new speed records for binary-elliptic-curve cryptography. Bitsliced binary-Edwards-curve arithmetic, as described in Section 3 of this paper, would still have set new speed records even if this section's multiplication methods had been replaced by the methods in [67] .
Schoolbook recursion. One can multiply f 0 + f 1 t + · · · + f n t n by g 0 + g 1 t + · · · + g n t n as follows:
. This takes 2n + 1 bit multiplications and n bit additions.
• Add. This takes n − 1 bit additions for the coefficients of t n , . . . , t 2n−2 ; the other coefficients do not overlap.
Consequently M (n+1) ≤ M (n)+4n. This "schoolbook recursion" bound implies the "schoolbook multiplication" bound M (n) ≤ 2n 2 − 2n + 1, but schoolbook recursion is-in combination with other recursions-useful for much larger n's than schoolbook multiplication.
Three-way recursion. The well-known Karatsuba identity
shows how to multiply 2n-bit polynomials F 0 + F 1 t n , G 0 + G 1 t n using three multiplications of n-bit polynomials and a small amount of overhead. What actually appeared as "Theorem 2 (Karatsuba)" in the original Karatsuba-Ofman paper [45] was an integer analogue of the squaring case of the equivalent identity
Either identity easily leads to the bound M (2n) ≤ 3M (n) + 8n − 4 appearing in, e.g., [67, Section 2] . The 8n − 4 arises as a sum of three components: 2n for the additions in F 0 + F 1 and G 0 + G 1 , another 4n − 2 to subtract F 0 G 0 and
Factoring out 1 − x produces the slightly simpler formula
In particular, if H is the product of two generic linear polynomials
n to obtain the refined Karatsuba identity
For comparison, rewriting the projective Lagrange interpolation formula as H =
Here is a cost analysis of the refined Karatsuba identity as a method of computing the product of F 0 + F 1 t n and G 0 + G 1 t n :
Notice that this computation does not follow the traditional structure of first computing the coefficients H 0 , H 1 , H 2 and then computing H(t n ) = H 0 +H 1 t n + H 2 t 2n . In particular, the middle coefficient
I posted the M (2n) ≤ 3M (n) + 7n − 3 bound (and the resulting M (n) ≤ (103/18)n lg 3 − 7n + 3/2 bound for n ∈ {2, 4, 8, . . .}) in [11, page 7] in 2000, but I am formally announcing the idea here for the first time. The simplifiedLagrange-interpolation explanation has not appeared anywhere, even informally, and is reused below for improvements in five-way recursion.
Five-way recursion. A generic quartic polynomial H
= H 0 + H 1 x + H 2 x 2 + H 3 x 3 + H 4 x 4 over F 2 can
be reconstructed from the values H(0), H(1), H(t), H(t + 1), H(∞) by the projective Lagrange interpolation formula
Easy manual simplification produces the (perhaps not optimal) formula
where U = H(0) + (H(0) + H(1))x and V = H(t) + (H(t) + H(t + 1))(x + t).
This formula, in turn, leads to the following new algorithm to compute the product of F 0 + F 1 t n + F 2 t 2n and G 0 + G 1 t n + G 2 t 2n , where F 0 , F 1 , G 0 , G 1 are n-bit polynomials and F 2 , G 2 are k-bit polynomials:
• Cost 2n + 1: Compute H(t) + H(t + 1). The coefficients of t 2n+2 and t 2n+1 in H(t + 1) are the same as the coefficients of t 2n+2 and t 2n+1 in H(t), so this sum has degree at most 2n. (For the same reason, some work could have been saved in the computation of H(t + 1).)
• Cost 3n + 4, or 3n + 2 if k ≤ n − 1: Compute V = H(t) + (H(t) + H(t + 1))(t n + t). Note that deg V ≤ 3n.
• Cost 3n−2: Compute U = H(0)+(H(0)+H(1))t n . Note that deg U ≤ 3n−2.
• Cost 4k + 3n − 3, assuming n ≥ 2: Compute W = U + V + H(∞)(t 4 + t). Note that deg W ≤ 3n.
• Cost 3n − 2: Compute W/(t 2 + t). This division is exact: W/(t 2 + t) ∈ F 2 [t]. (For the same reason, some work could have been skipped in the computation of W .)
The previous state of the art, building on ideas by Zimmermann and Quercia, Weimerskirch and Paar [68] , and Montgomery [53] , was an algorithm by Bodrato in [19] to compute H 0 , H 1 , H 2 , H 3 , H 4 given H(0), H(1), H(t), H(t + 1), H(∞) using 9 additions, one multiplication by t 3 + 1, one division by t, one division by t+1, and one division by t 2 +t. The total overhead was about 38n operations: 10n to compute Two-level seven-way recursion. Consider the problem of multiplying two degree-3 polynomials. Apply the refined Karatsuba identity three times, factor out 1 − x, and substitute x = t n , to obtain the identity
Cost evaluation for polynomials with 3n + k coefficients, assuming k ≥ n/2: (1), . . . , M (1000) obtained in this way. Each upper bound is accompanied by straight-line multiplication code that has been computer-verified to multiply correctly and to use exactly the specified number of bit operations.
Often an input to one multiplication is reused in a subsequent multiplication; for example, w 1 in Section 3 participates in many multiplications. One can save time by caching evaluations of that input, such as F 0 + F 1 above. To properly optimize this reuse one should define, e.g., M 2 (n) as the cost of multiplying a single n-bit input by two n-bit inputs (serially), and then optimize M 2 (n) analogously to M (n).
The reader is cautioned that there are many more multiplication methods in the literature: for example, more Toom variants, FFTs, etc. Analyzing, refining, and combining these methods would improve the bounds on M (n) for many integers n, perhaps including n = 251. Most of the relevant methods are surveyed in [22] but have not yet been optimized for bit operations.
Elliptic-curve scalar multiplication
This section reviews binary Edwards curves; discusses this paper's selection of a particular binary Edwards curve; and analyzes the speed of computation of scalar multiples on that curve.
Review of binary Edwards curves.
A binary Edwards curve over a binary finite field k is a curve of the form
This curve shape was introduced by Bernstein, Lange, and Rezaeian Farashahi in [16] as a characteristic-2 analogue to the curve shape introduced by Edwards in [28] .
A binary Edwards curve is complete if d 2 has trace 1, i.e., if d 2 cannot be written as c 2 + c for any c ∈ k. The paper [16] proves that every ordinary elliptic curve over k is birationally equivalent over k to a complete binary Edwards curve if #k ≥ 8, and that various addition formulas on the complete binary Edwards curve have no exceptional cases. The case d 1 = d 2 allows various speedups presented in [16] . For example, it allows differential addition and doubling-a single step in a "Montgomery ladder"-using four squarings in k, two multiplications by d 1 , and five more multiplications in k. The general case would need four squarings in k, four multiplications by parameters, and six more multiplications in k.
The main task considered here is scalar multiplication n, P → nP in the group
, with neutral element (0, 0). Note that this group does not have any points at infinity. See [16] for further discussion of the group law.
Security issues in curve selection. This curve satisfies all of the standard criteria for high-security curves:
• The curve has near-prime order. Specifically, the curve has order 4p 1 where p 1 is the prime 2 249 + 17672450755679567125975931502191870417.
• The twist of the curve has near-prime order, specifically order 2p 2 where p 2 is the prime 2 250 − 35344901511359134251951863004383740833.
• The primes are large enough for high security: generic discrete-logarithm algorithms use approximately 2 124 group operations on average.
• Avoiding subfields: The j-invariant 1/d 8 generates the field k.
• Avoiding small discriminants: (2 251 + 1 − 4p 1 ) 2 − 2 253 is divisible by the large prime ((2 251 + 1 − 4p 1 ) 2 − 2 253 )/(−83531196553759) exactly once.
• Avoiding pairing attacks: The multiplicative order of 2 251 modulo p 1 is not small: in fact, it is (p 1 − 1)/2. The multiplicative order of 2 251 modulo p 2 is not small: in fact, it is (p 2 − 1)/2.
• Avoiding the GHS attack: The extension degree 251 is a prime, so the only nontrivial subfield of k is F 2 . GHS genera over F 2 cannot be small: in fact, they are at least 2 49 , since the multiplicative order of 2 modulo 251 is 50. See [51] .
The curve used in this paper was found by a search through various possibilities for d. Most choices of d fail the near-prime-order requirement, and most of the remaining choices of d fail the near-prime-twist-order requirement, but there are still many suitable possibilities. An easy computation with the Magma computer-algebra system [20] Some standards omit, or weaken, some of the criteria listed above, for several reasons:
• Some of the criteria do not have known benefits. For example, subfield curves produce only a small loss of security, which can be corrected by a small increase in field size. The small-discriminant criterion is even more difficult to defend; there is no known attack that exploits small discriminants, and there are reasons to guess that randomly chosen large-discriminant curves are more dangerous than small-discriminant curves. See [47, Sections 11.1-11.3].
• Sometimes the criteria have disadvantages. For example, some criteria have to be weakened by anyone who wants to allow curves with special algebraic structures, such as "pairing-friendly curves," "Koblitz curves," "GallantLambert-Vanstone curves," and the new "Galbraith-Lin-Scott curves." See, e.g., [34] , [33] , and [39] .
• One of the criteria, twist security, has protocol-level benefits that were not visible in the traditional study of the elliptic-curve discrete-logarithm problem. Twist security has, as a result, often been neglected even in situations where it has no disadvantages. For further discussion of twist security see This paper's selection of a curve meeting all the security criteria should not be interpreted as criticism of curves that meet fewer security criteria. One should expect some of those curves, when combined with the techniques in this paper, to achieve even better speeds than the speeds reported in this paper.
Speed issues in curve selection. Even within the restricted pool of curves meeting all of the security criteria discussed above, there are still considerable variations in speed. Standard practice is to focus on the highest-speed curves.
In particular, the fastest elliptic-curve-scalar-multiplication methods involve many multiplications by curve coefficients; it is standard practice to choose these coefficients to be "small." The exact definition of "small" varies but is aimed at speeding up multiplications by these coefficients. For example:
• Weierstrass curves: IEEE Standard P1363 chooses curves y 2 = x 3 − 3x + b to "provide the fastest arithmetic on elliptic curves"; see [2, Section A.9 ]. Chudnovsky and Chudnovsky had pointed out in [26] that choosing a small coefficient a in y 2 = x 3 + ax + b saves time in elliptic-curve scalar multiplication, and that the particular choice a = −3 saves even more time. NIST's standard curves were chosen by the recipe specified in IEEE Standard P1363; see [1, Appendix 6, Section 1.4].
• Montgomery curves: The curve "Curve25519" specified in [12] is the curve y 2 = x 3 + 486662x 2 + x modulo 2 255 − 19. Montgomery had pointed out in [52] that his fast differential-addition formulas for y 2 = x 3 + ax 2 + x involve multiplications by (a + 2)/4 and benefit from (a + 2)/4 being small.
• Binary Edwards curves: [16] makes the analogous suggestion to choose a small parameter d for the curve d(x + x 2 + y + y 2 ) = (x + x 2 )(y + y 2 ). Differential addition and doubling on binary Edwards curves. The following formulas are the "affine d 1 = d 2 " and "mixed d 1 = d 2 " formulas from [16, Section 7] , repeated here to keep this paper self-contained.
For each point P = (x, y) ∈ E(k) define w(P ) = x + y. Then w(2P
, and more generally w(Q + P ) + w(Q − P ) = 1 + d/(d + w(P )w(Q)(1 + w(P ))(1 + w(Q))). The denominators here are never zero.
The following formulas use four squarings, two multiplications by d, and five more multiplications to compute w(2P ), w(Q + P ) as fractions W 4 /Z 4 , W 5 /Z 5 , given w(P ), w(Q) as fractions W 2 /Z 2 , W 3 /Z 3 and given w(Q − P ) as an element w 1 ∈ k:
This operation is called mixed differential addition and doubling. The standard way to handle β = 1 is to first swap W 2 /Z 2 , W 3 /Z 3 , then proceed with the original computation, and finally swap W 4 /Z 4 , W 5 /Z 5 . The first swap exactly reverses the roles of P and Q, since w(P − Q) = w(Q − P ); the original computation therefore produces w(2Q), w(P + Q); and the final swap produces w(P + Q), w(2Q) as desired.
The standard way to swap W 2 and W 3 conditionally on β without branching is to replace (W 2 , W 3 ) by (W 2 +β(W 3 −W 2 ), W 3 −β(W 3 −W 2 )). Similar comments apply to (Z 2 , Z 3 ), (W 4 , W 5 ), and (Z 4 , Z 5 ).
Scalar multiplication. The last step in scalar multiplication computes w(nP ), w(nP + P ) as fractions, starting from w( n/2 P ), w( n/2 P + P ) as fractions and w(P ) as an element of k. This is an example of conditionally swapped mixed differential addition and doubling, where β is the bottom bit of n.
The previous step produces w( n/2 P ), w( n/2 P + P ) as fractions starting from w( n/4 P ), w( n/4 P + P ) as fractions and w(P ) as an element of k. This is the same computation, except that n is replaced by n/2 ; i.e., β is the second bit of n. The conditional swap at the end of this step is followed immediately by, and can be profitably merged with, the conditional swap at the beginning of the next step.
Similar comments apply to earlier steps. If the target scalar n is known to be between 0 and 2 b − 1 then one can use a sequence of b steps. The first step produces w( n/2 b−1 P ), w( n/2 b−1 P + P ) from w( n/2 b P ), w( n/2 b P + P ); i.e., from 0, w(P ).
To summarize: Given w(P ) ∈ k and a b-bit scalar n, this scalar-multiplication method uses b conditionally swapped mixed differential additions and doublings to produce (W, Z) such that W/Z = w(nP ). The 2b conditional swaps can be merged into just b + 1 conditional swaps.
Postprocessing. One can divide W by Z, obtaining w(nP ) ∈ k, by computing W Z A small extra computation shown in [16, Section 7] , using the x and y coordinates of P separately, would produce the x and y coordinates of nP separately; but the w coordinate is adequate for elliptic-curve Diffie-Hellman. One can also check directly that w corresponds to a curve point by checking that d/(w + w 2 ) has trace 0 and that w + w 2 times the half-trace of d/(w + w 2 ) has trace 0. These computations are much faster than scalar multiplication and are not discussed further in this paper.
Instead of inverting Z at the end of the computation one can use affine coordinates, eliminating some multiplications at the cost of an inversion in each step. Inversions are well known to benefit from batching, thanks to "Montgomery's trick" from [52, page 260] . However, each inversion still costs slightly more than 3 multiplications, wiping out most if not all of the gain. Montgomery in [52, page 261] compared batched affine coordinates to projective coordinates in the non-binary case and reported negligible performance differences. One should not expect affine coordinates to provide large savings in the binary case.
Performance: bit operations. A 251-bit single-scalar multiplication as described here involves 44679665 bit operations; this number has been computerverified. The main cost is 43011084 bit operations for 1266 field multiplications (1255 in the main loop and 11 in the final division). Each multiplication uses 33974 bit operations: 33096 bit operations for 251-bit multiplication in F 2 [t], and 878 bit operations to reduce the 501-bit product modulo t 251 + t 7 + t 4 + t 2 + 1. The other costs are as follows:
• 397518 bit operations for 1254 squarings (1004 in the main loop and 250 in the final division), each using 317 bit operations; • 449792 bit operations for 502 multiplications by d, each using 896 bit operations; • 315005 bit operations for 1255 additions, each using 251 bit operations; and • 506266 bit operations for 1004 conditional swaps, which at the cost of 250 bit operations are merged into 504 conditional swaps, each costing 1004 bit operations.
In some protocols the 251-bit scalar is always a multiple of 4, allowing slight speedups. In other protocols a 249-bit scalar is adequate, allowing slight further speedups.
Performance: cycles. The BBE251 software reads a batch of scalars n 1 , . . . , n 128 and a batch of curve points P 1 , . . . , P 128 and computes a batch of multiples n 1 P 1 , . . . , n 128 P 128 . Each scalar is represented as a 32-byte string in little-endian form. Each curve point is represented as a field element w as described in the previous section; w is, in turn, represented as a 32-byte string. One might think that writing fast software for this computation on a Core 2 CPU is a simple matter of generating code for the bit operations described in this paper, replacing XORs and ANDs with a C compiler's intrinsic _mm_xor_si128 and _mm_and_si128 operations on 128-bit vectors. A single core of the CPU can carry out three of the corresponding PXOR and PAND instructions per cycle, so 44 million bit operations should be completed in about 15 million cycles-under 120000 cycles per input. Transposing the n's and P 's into bitsliced form, and transposing the results out of bitsliced form, takes negligible time.
There is, however, a critical bottleneck in any straightforward implementation: namely, load throughput. In one cycle the CPU can carry out three operations on six vectors in registers; but loading those six vectors from memory into registers costs six cycles-the Core 2 performs only one load in each cycle. The results of the three operations are ready to be used for further operations in the next cycle, so one can imagine loading (e.g.) 56 input vectors for a 28-bit multiplication, carrying out all 956 bit operations for the multiplication, and then storing the final outputs; but the Core 2 has only 16 128-bit vector registers.
Recursive multiplication methods such as Karatsuba's method might seem to be ideal for reducing loads and stores, since they split larger multiplication problems into smaller multiplication problems that fit into registers. However, the first step in decomposing a 2n-bit multiplication into n-bit multiplications is to add 2n vectors to another 2n vectors-and the 2n/3 cycles for these additions are swamped by 4n cycles for loads. Similar comments apply to the recombination of (2n − 1)-bit products into a (4n − 1)-bit product.
Further contributing to the memory pressure is the fact that the Core 2's vector instructions are two-operand instructions such as "replace a with a + b," not three-operand instructions such as "replace c with a + b." Copying a to c, in situations where a and b need to be reused, takes away one of the three XOR/AND slots available in a cycle. Copying a to c via memory uses an extra load.
BBE251 takes several measures to reduce the number of loads and stores. Most importantly, it merges decompositions and recombinations across multiple layers of recursion, reusing sums while they are still in registers. As a simple example, adding (a 0 , . . . , a 2n−1 ) to (a 2n , . . . , a 4n−1 ) takes 4n loads and 2n additions; subsequently adding (a 0 , . . . , a n−1 ) to (a n , . . . , a 2n−1 ), adding (a 2n , . . . , a 3n−1 ) to (a 3n , . . . , a 4n−1 ), and adding (a 0 +a 2n , . . . , a n−1 +a 3n−1 ) to (a n +a 3n , . . . , a 2n−1 + a 4n−1 ) would take 6n loads and 3n additions; but performing all of these operations together reduces the 10n loads to 4n loads and 2n copies.
The current version of BBE251 merges most operations across two levels of recursion, and takes fewer than 44 million cycles, although still many more than the target of 15 million. It is not yet clear how close the correlations are between optimized bit-operation counts and optimized cycle counts, but it is clear that schoolbook multiplication could not have been competitive with BBE251. Largerscale load/store elimination is underway and can be expected to further improve BBE251's performance.
