Abstract. The availability of a new carry-less multiplication instruction in the latest Intel desktop processors significantly accelerates multiplication in binary fields and hence presents the opportunity for reevaluating algorithms for binary field arithmetic and scalar multiplication over elliptic curves. We describe how to best employ this instruction in field multiplication and the effect on performance of doubling and halving operations. Alternate strategies for implementing inversion and half-trace are examined to restore most of their competitiveness relative to the new multiplier. These improvements in field arithmetic are complemented by a study on serial and parallel approaches for Koblitz and random curves, where parallelization strategies are implemented and compared. The contributions are illustrated with experimental results improving the state-of-the-art performance of halving and doubling-based scalar multiplication on NIST curves at the 112-and 192-bit security levels, and a new speed record for side-channel resistant scalar multiplication in a random curve at the 128-bit security level.
Introduction
Improvements in the fabrication process of microprocessors allow the resulting higher transistor density to be converted into architectural features such as inclusion of new instructions or faster execution of the current instruction set. Limits on the conventional ways of increasing a processor's performance such as incrementing the clock rate, scaling the memory hierarchy [38] or improving support for instruction-level parallelism [37] have pushed manufacturers to embrace parallel processing as the mainstream computing paradigm and consequently amplify support for resources such as multiprocessing and vectorization. Examples of the latter are the recent inclusions of the SSE4 [22] , AES [19] and AVX [14] instruction sets in the latest Intel microarchitectures.
Since the dawn of elliptic curve cryptography in 1985, several field arithmetic assumptions have been made by researchers and designers regarding its efficient implementation in software platforms. Some analysis (supported by experiments) assumed that inversion to multiplication ratios (I/M ) were sufficiently small (e.g., I/M ≈ 3) that point operations would be done in affine coordinates, favoring certain techniques. However, the small ratios were a mix of old hardware designs, slower multiplication algorithms compared with [32] , and composite extension degree. It seems clear that sufficient progress was made in multiplication so there is incentive to use projective coordinates. Our interest in the face of much faster multiplication is at the other end-is I/M large enough to affect methods that commonly assumed this ratio is modest?
On the other hand, authors in [16] considered that the cost of a point halving computation was roughly equivalent to 2 field multiplications. The expensive computations in halving are a field multiplication, solving a quadratic z 2 + z = c, and finding a square root over F 2 m . However, quadratic solvers presented in [21] are multiplicationfree and hence, provided that a fast binary field multiplier is available, there would be concern that the ratio of point halving to multiplication may be much larger than 2. Having a particularly fast multiplier would also push for computing square roots in F 2 m as efficiently as possible. Similarly, the common software design assumption that field squaring is essentially free (relative to multiplication) may no longer be valid.
A prevalent assumption is that large-characteristic fields are faster than binary field counterparts for software implementations of elliptic curve cryptography. 5 In spite of simpler arithmetic, binary field realizations could not be faster than large-characteristic analogues mostly due to the absence of a native carry-less multiplier in contemporary high-performance processors. However, using a bit-slicing technique, Bernstein [6] was able to compute a batch of 251-bit scalar multiplications on a binary Edwards curve, employing 314,323 clock cycles per scalar multiplication, which, before the results presented in this work and to the best of our knowledge, was the fastest reported time for a software implementation of binary elliptic point multiplication.
In this work, we evaluate the impact of the recently introduced carry-less multiplication instruction [20] in the performance of binary field arithmetic and scalar multiplication over elliptic curves. We also consider parallel strategies in order to speed scalar multiplication when working on multi-core architectures. In contrast to parallelization applied to a batch of operations, the approach considered here applies to a single point multiplication. These approaches target different environments: batching makes sense when throughput is the measurement of interest, while the lower level parallelization is of interest when latency matters and the device is perhaps weak but has multiple processing units. Furthermore, throughout this paper we will assume that we are working in the unknown point scenario, i.e., where the elliptic curve point to be processed is not known in advance, thus precluding off-line precomputation. We will assume that there is sufficient memory space for storing a few multiples of the point to be processed and look-up tables for accelerating the computation of the underlying field arithmetic.
As the experimental results will show, our implementation of multiplication via this native support was significantly faster than previous timings reported in the literature. This motivated a study on alternative implementations of binary field arithmetic in hope of restoring the performance ratios among different operations in which the literature is traditionally based [21] . A direct consequence of this study is that performance analysis based on these conventional ratios [5] will remain valid in the new platform. Our main contributions are:
-A strategy to efficiently employ the native carry-less multiplier in binary field multiplication. -Branchless and/or vectorized approaches for implementing half-trace computation, integer recoding and inversion. These approaches allow the halving operation to become again competitive with doubling in the face of a significantly faster multiplier, and help to reduce the impact of integer recoding and inversion in the overall speed of scalar multiplication, even when projective coordinates are used. -Parallelization strategies for dual core execution of scalar multiplication algorithms in random and Koblitz binary elliptic curves.
We obtain a new state-of-the-art implementation of arithmetic in binary elliptic curves, including improved performance for NIST-standardized Koblitz curves and random curves suitable for halving and a new speed record for side-channel resistant point multiplication in a random curve at the 128-bit security level. The remainder of the paper progresses as follows. Section 2 elaborates on exploiting carry-less multiplication for high-performance field multiplication along with implementation strategies for half-trace and inversion. Sections 3 and 4 discuss serial and parallel approaches for scalar multiplication. Section 5 presents extensive experimental results and comparison with related work. Section 6 concludes the paper with perspectives on the interplay between the proposed implementation strategies and future enhancements in the architecture under consideration.
Binary field arithmetic
A binary extension field F 2 m can be constructed by means of a degree-m polynomial f irreducible over F 2 as F 2 m ∼ = F 2 [z]/ (f (z)). In the case of software implementations in modern desktop platforms, field elements a ∈ F 2 m can be represented as polynomials of degree at most m − 1 with binary coefficients a i packed in n 64 = m 64 64-bit processor words. In this context, the recently introduced carry-less multiplication instruction can play a significant role in order to efficiently implement a multiplier in F 2 m . Along with field multiplication, other relevant field arithmetic operations such as squaring, square root, and half-trace, will be discussed in the rest of this section.
Multiplication
Field multiplication is the performance-critical operation for implementing several cryptographic primitives relying on binary fields, including arithmetic over elliptic curves and the Galois Counter Mode of operation (GCM). For accelerating the latter when used in combination with the AES block cipher [19] , Intel introduced the carry-less multiplier in the Westmere microarchitecture as an instruction operating on 64-bit words stored in 128-bit vector registers with opcode pclmulqdq [20] . The instruction latency currently peaks at 15 cycles while reciprocal throughput ranks at 10 cycles. In other words, when operands are not in a dependency chain, effective latency is 10 cycles [15] .
The instruction certainly looks expensive when compared to the 3-cycle 64-bit integer multiplier present in the same platform, which raises speculation whether Intel aimed for an area/performance trade-off or simply balanced the latency to the point where the carry-less multiplier did not interfere with the throughput of the hardware AES implementation. Either way, the instruction features suggest the following empirical guidelines for organizing the field multiplication code: (i) as memory access by vector instructions continues to be expensive [6] , the maximum amount of work should be done in registers, for example through a Comba organization [12] ; (ii) as the number of registers employed in multiplication should be minimized for avoiding false dependencies and maximize throughput, the multiplier should have 128-bit granularity; (iii) as the instruction latency allows, each 128-bit multiplication should be implemented with three carry-less multiplications in a Karatsuba fashion [25] .
In fact, the overhead of Karatsuba multiplication is minimal in binary fields and the Karatsuba formula with the smaller number of multiplications for multiplying = 4. Particular approaches which led to lower performance in our experiments were organizations based on optimal Toom-Cook [10] due to the higher overhead brought by minor operations; and on a lower 64-bit granularity combined with alternative multiple-term Karatsuba formulas [33] due to register exhaustion to store all the intermediate values, causing a reduction in overall throughput.
Squaring, square-root and multi-squaring
Squaring and square-root are considered cheap operations in a binary field, especially when F 2 m is defined by a square-root friendly polynomial [3, 1] , because they require only linear manipulation of individual coefficients [21] . These operations are traditionally implemented with the help of large precomputed tables, but vectorized implementations are possible with simultaneous table lookups through byte shuffling instructions [2] . This approach is enough to keep square and square-root efficient relative to multiplication even with a dramatic acceleration of field multiplication. For illustration, [2] reports multiplication-to-squaring ratios as high as 34 without a native multi-plier, far from the conventional ratios of 5 [5] or 7 [21] and with a large room for future improvement.
Multi-squaring, or exponentiation to 2 k , can be efficiently implemented with a time-memory trade-off proposed as m-squaring in [1, 11] and here referred as multisquaring. For a fixed k, a table T of 16 m 4 field elements can be precomputed such that T [j,
. The threshold where multi-squaring became faster than simple consecutive squaring observed in our implementation was around k ≥ 6 for F 2 233 and k ≥ 10 for F 2 409 .
Inversion
Inversion modulo f (z) can be implemented via the polynomial version of the Extended Euclidean Algorithm (EEA), but the frequent branching and recurrent shifts by arbitrary amounts present a performance obstacle for vectorized implementations, which makes it difficult to write consistently fast EEA codes across different platforms. A branchless approach can be implemented through Itoh-Tsuji inversion [23] by computing
, as proposed in [18] . In contrast to the EEA method, the Itoh-Tsujii approach has the additional merit of being similarly fast (relative to multiplication) across common processors. The overall cost of the method is m − 1 squarings and a number of multiplications dictated by the length of an addition chain for m − 1. The cost of squarings can be reduced by computing each required 2 i -power as a multi-squaring [11] . The choice of an addition chain allows the implementer to control the amount of required multiplications and the precomputed storage for multi-squaring, since the number of 2 i -powers involved can be balanced.
Previous work obtained inversion-to-multiplication ratios between 22 and 41 by implementing EEA in 64-bit mode [2] , while the conventional ratios are between 5 and 10 [21, 5] . While we cannot reach the small ratios with Itoh-Tsujii for the parameters considered here, we can hope to do better than applying the method from [2] which will give significantly larger ratios with the carry-less multiplier. Hence the cost of squarings and multi-squarings should be minimized to the lowest possible allowed by storage capacity.
To summarize, we use addition chains of 10, 10 and 11 steps for computing field inversion over the fields F 2 233 , F 2 251 and F 2 409 , respectively. 6 We extensively used the multi-squaring approach described in the preceding section. For example, in the case of F 2 233 , we selected the addition chain 1→2→3→6→7→14→28→29→58→116→232, and used 3 pre-computed tables for computing the iterated squarings a . The rest of the field squaring operations were computed by executing consecutive squarings. We recall that each table stores a total of 16 m 4 field elements. 6 In the case of inversion over F 2 409 , the minimal length addition chain to reach m − 1 = 408 has 10 steps. However, we preferred to use an 11-step chain to save one look-up table.
Half-trace
Half-trace plays a central role in point halving and its performance is essential if halving is to be competitive against doubling. For an odd integer m, the half-trace function H :
and satisfies the equation λ 2 + λ = c + Tr(c) required for point halving. One efficient desktop-targeted implementation of the half-trace is described in [3] and presented as Algorithm 1, making extensive use of precomputations. This implementation is based on two main steps: the elimination of even power coefficients and the accumulation of half-trace precomputed values.
Step 5 in Algorithm 1, as shown in [21] , consists in reducing the number of non-zero coefficients of c by removing the coefficients of even powers i via H(
That will lead to memory and time savings during the last step of the half-trace computation, the accumulation (step 6). This is done by extraction of the odd and even bits and can benefit from vectorization in the same way as square-root in [2] . However, in the case of half-trace there is a bottleneck caused by data dependencies. For efficiency, the bank of 128-bit registers is used as much as possible, but at one point in the algorithm execution the number of available bits to process decreases. For 64-bit and 32-bit digits, the use of 128-bit registers is still beneficial, but for a smaller size, the conventional approach (not vectorized) becomes again competitive.
Once step 5 is completed, the direction taken in [21] remains in reducing memory needs. However another approach is followed in [3] which does not attempt to minimize memory requirements but rather it greedily strives to speed up the accumulation part (step 6). Precomputation is extended so as to reduce the number of accesses to the lookup table. The following values of the half-trace are stored:
The memory size in bytes taken by the precomputations follows the formula 16 × n 64 × 8 × 
While considering different organizations of the half-trace code, we made the following serendipitous observation: inserting as many xor operations as the data dependencies permitted from the accumulation stage (step 6) into step 5 gave a substantial speed-up of 20% to 25% compared with code written in the order as described in Algorithm 1. Plausible explanations are compiler optimization and processor pipelining characteristics. The result is a half-trace-to-multiplication ratio near 1, and this ratio can be reduced if memory can be consumed more aggressively.
Random binary elliptic curves
Given a finite field F q for q = 2 m , a non-supersingular elliptic curve E(F q ) is defined to be the set of points (x, y) ∈ F q × F q that satisfy the affine equation
where a and 0 = b ∈ F q , together with the point at infinity denoted by O. It is known that E(F q ) forms an additive Abelian group with respect to the elliptic point addition operation. Let k be a positive integer and P a point on an elliptic curve. Then elliptic curve scalar multiplication is the operation that computes the multiple Q = kP , defined as the point resulting of adding P to itself k − 1 times. One of the most basic methods for computing a scalar multiplication is based on a double-and-add variant of Horner's rule. As the name suggests, the two most prominent building blocks of this method are the point doubling and point addition primitives. By using the non-adjacent form (NAF) representation of the scalar k, the addition-subtraction method computes a scalar multiplication in about m doubles and m/3 additions [21] . The method can be extended to a width-ω NAF k = t−1 i=0 k i 2 i where k i ∈ {0, ±1, . . . , ±2 m − 1}, k t−1 = 0, and at most one of any ω consecutive digits is nonzero. The length t is at most one larger than the bitsize of k, and the density is approximately 1/(ω + 1); for ω = 2, this is the same as NAF.
Sequential algorithms for random binary curves
The traditional left-to-right double-and-add method is illustrated in Algorithm 2 where n = 0 (that is, the computation corresponds to the left column) and the width-ω NAF
i expression is computed from left to right, i.e., it starts processing k t−1 first, then k t−2 until it ends with the coefficient k 0 .
Step 1 computes 2 ω−2 − 1 multiples of the point P . Based on the Montgomery trick, authors in [13] suggested a method to precompute the affine points in large-characteristic fields F p , employing only one inversion. Exporting that approach to F 2 m , we obtained formulae that offer a saving of 4 multiplications and 15 squarings for ω = 4 when compared with a naive method that would make use of the Montgomery trick in a trivial way (see Table 1 for a summary of the computational effort associated to this phase).
For a given ω, the evaluation stage of the algorithm has approximately m/(ω + 1) point additions, and hence increasing ω has diminishing returns. For the curves given by NIST [34] and with on-line precomputation, ω ≤ 6 is optimal in the sense that total point additions are minimized. In many cases, the recoding in ωNAF(k) is performed on-line and can be considered as part of the precomputation step.
The most popular way to represent points in binary curves is López-Dahab projective coordinates that yield an effective cost for a mixed point addition and point doubling operation of about 8M + 5S ≈ 9M and 4M + 5S ≈ 5M , respectively (see Tables 2 and 3 ). Kim and Kim [26] report alternate formulas for point doubling requiring four multiplications and five squarings, but two of the four multiplications are by the constant b, and these have the same cost as general multiplication with the native Algorithm 2 Double-and-add, halve-and-add scalar multiplication: parallel Input: ω, scalar k, P ∈ E(F2m ) of odd order r, constant n (e.g., from Table 1 Q0 ← 2Q0 7:
else if k i < 0 then 10:
else if k i < 0 then 16:
carry-less multiplier. For mixed addition, Kim and Kim require eight multiplications but save two field reductions when compared with López-Dahab, giving their method the edge. Hence, in this work we use López-Dahab for point doubling and Kim and Kim for point addition.
Right-to-left halve-and-add Scalar multiplication based on point halving replaces point doubling by a potentially faster halving operation that produces Q from P with P = 2Q. The method was proposed independently by Knudsen [28] and Schroeppel [35] for curves y 2 + xy = x 3 + ax 2 + b over F 2 m . The method is simpler if the trace of a is 1, and this is the only case we consider. The expensive computations in halving are a field multiplication, solving a quadratic z 2 + z = c, and finding a square root. On the NIST random curves studied in this work, we found that the cost of halving is approximately 3M , where M denotes the cost of a field multiplication.
Let the base point P have odd order r, and let t be the number of bits to represent r. For 0 < n ≤ t, let t i=0 k i 2 i be given by the width-ω NAF of 2 n k mod r. Then k ≡ k /2 n ≡ t i=0 k i 2 i−n (mod r) and the scalar multiplication can be split as
When n = t, this gives the usual representation for point multiplication via halving, illustrated in Algorithm 2 (that is, the computation is essentially the right column). The cost for postcomputation appears in Table 1 .
Parallel scalar multiplication on random binary curves
For parallelization, choose n < t in (2) and process the first portion by a double-andadd method and the second portion by a method based on halve-and-add. Algorithm 2 illustrates a parallel approach suitable for two processors. Recommended values for n to balance cost between processors appear in Table 1 . 
Side-channel resistant multiplication on random binary curves
Another approach for scalar multiplication offering some resistance to side-channel attacks was proposed by López and Dahab [31] based on the Montgomery laddering technique. This approach requires 6M + 5S in F 2 m per iteration independently of the bit pattern in the scalar, and one of these multiplications is by the curve coefficient b.
The curve being lately used for benchmarking purposes [7] at the 128-bit security level is an Edwards curve (CURVE2251) corresponding to the Weierstraß curve y 2 + xy = x 3 +(z 13 +z 9 +z 8 +z 7 +z 2 +z+1). It is clear that this curve is especially tailored for this method due to the short length of b, reducing the cost of the algorithm to approximately 5.25M + 5S per iteration. At the same time, halving-based approaches are non-optimal for this curve due to the penalties introduced by the 4-cofactor [27] . Considering this and to partially satisfy the side-channel resistance offered by a bitsliced implementation such as [6] , we restricted the choices of scalar multiplication at this security level to the Montgomery laddering approach.
Koblitz elliptic curves
A Koblitz curve E a (F q ), also known as an Anomalous Binary Curve [29] , is a special case of (1) where b = 1 and a ∈ {0, 1}. In a binary field, the map taking x to x 2 is an automorphism known as the Frobenius map. Since Koblitz curves are defined over the binary field F 2 , the Frobenius map and its inverse naturally extend to automorphisms of the curve denoted τ and τ −1 , respectively, where τ (x, y) = (x 2 , y 2 ). Moreover, (x 4 , y 4 ) + 2(x, y) = µ(x 2 , y 2 ) for every (x, y) on E a , where µ = (−1) 1−a ; that is, τ satisfies τ 2 + 2 = µτ and we can associate τ with the complex number τ = µ+ √ −7 2 . Solinas [36] presents a τ -adic analogue of the usual NAF as follows. Since short representations are desirable, an element ρ ∈ Z[τ ] is found with ρ ≡ k (mod δ) of as small norm as possible, where δ = (τ m − 1)/(τ − 1). Then for the subgroup of interest, kP = ρP and a width-ω τ -adic NAF (ωτ NAF) for ρ is obtained in a fashion that parallels the usual ωNAF. As in [36] , define α i = i mod τ ω for i ∈ {1, 3, . . . , 2 ω−1 − 1}. A ωτ NAF of a nonzero element ρ is an expression ρ = l−1 i=0 u i τ i where each u i ∈ {0, ±α 1 , ±α 3 , . . . , ±α 2 ω−1 −1 }, u l−1 = 0, and at most one of any consecutive ω coefficients is nonzero. Scalar multiplication kP can be performed with the ωτ NAF expansion of ρ as
with l − 1 applications of τ and approximately l/(ω + 1) additions.
The length of the representation is at most m + a, and Solinas presents an efficient technique to find an estimate for ρ, denoted ρ = k partmod δ with ρ ≡ ρ (mod δ), having expansion of length at most m+a+3 [36, 9] . Under reasonable assumptions, the algorithm will usually produce an estimate giving length at most m + 1. For simplicity, we will assume that the recodings obtained have this as an upper bound on length; small adjustments are necessary to process longer representations. Under these assumptions and properties of τ , scalars may be written
Sequential algorithms for Koblitz curves
A traditional left-to-right τ -and-add method for (3) appears as [21, Alg 3.70] , and is essentially the left-hand portion of Algorithm 3. Precomputation consists of 2 ω−2 − 1 multiples of the point P , each at a cost of approximately one point addition (see Table 1 for a summary of the computational effort associated to this phase).
Alternatively, we can process bits right-to-left and obtain a variant we shall denote as [21, Alg 3.70] (an analogue of [21, Alg 3.91]). The multiple points of precomputation P u are exchanged for the same number of accumulators Q u along with postcomputation of form α u Q u . The cost of postcomputation is likely more than the precomputation of the left-to-right variant; see Table 1 for a summary in the case where postcomputation uses projective additions. However, if the accumulator in Algorithm 3 is in projective coordinates, then the right-to-left variant has a less expensive evaluation phase since τ is applied to points in affine coordinates.
Parallel algorithm for Koblitz curves
The basic strategy in our parallel algorithm is to reformulate the scalar multiplication in terms of both the τ and the
where 0 < n < m. Algorithm 3 illustrates a parallel approach suitable for two processors. Although similar in structure to Algorithm 2, a significant difference is the shared precomputation rather than the pre and postcomputation required in Algorithm 2.
The scalar representation is given by Solinas [36] and hence has an expected m/(ω+ 1) point additions in the evaluation-stage, and an extra point addition at the end. There are also approximately m applications of τ or its inverse. If the field representation is such that these operators have similar cost or are sufficiently inexpensive relative to field multiplication, then the evaluation stage can be a factor 2 faster than a corresponding non-parallel algorithm.
As discussed before, unlike the ordinary width-ω NAF, the τ -adic version requires a relatively expensive calculation to find a short ρ with ρ ≡ k (mod δ). Hence, (a portion of) the precomputation is "free" in the sense that it occurs during scalar recoding. This can encourage the use of a larger window size ω. The essential features exploited by Algorithm 3 are that the scalar can be efficiently represented in terms of the Frobenius map and that the map and its inverse can be efficiently computed, and hence the algorithm adapts to curves defined over small fields.
Algorithm 3 ωτ NAF scalar multiplication: parallel
Input: ω, k ∈ [1, r − 1], P ∈ Ea(F2m ) of order r, constant n (e.g., from Table 1(b)) Output: kP 1: ρ ← k partmod δ 2:
if ui = αj then 8:
Q0 ← Q0 + Pj 9:
else if ui = −αj then 10:
Q0 ← Q0 − Pj 11: Q1 ← O 12: for i = n + 1 to m do 13:
if ui = αj then 15:
Q1 ← Q1 + Pj 16:
else if ui = −αj then 17:
Algorithm 3 is attractive in the sense that two processors are directly supported without "extra" computations. However, if multiple applications of the "doubling step" are sufficiently inexpensive, then more processors and additional curves can be accommodated in a straightforward fashion without sacrificing the high-level parallelism of Algorithm 3. As an example for Koblitz curves, a variant on Algorithm 3 discards the applications of τ −1 (which may be more expensive than τ ) and finds kP = k
with traditional methods to calculate k i P . The application of τ j is low cost if there is storage for a per-field matrix as it was first discussed in [1] .
Experimental results
We consider example fields F 2 m for m ∈ {233, 251, 409}. These were chosen to address 112-bit and 192-bit security levels, according to the NIST recommendation, and the 251-bit binary Edwards elliptic curve presented in [6] . The field F 2 233 was also chosen as more likely to expose any overhead penalty in the parallelization compared with larger fields from NIST. Our C library coded all the algorithms using the GNU C 4.6 (GCC) and Intel 12 (ICC) compilers, and the timings were obtained on a 3.326 GHz 32nm Intel Westmere processor i5 660.
Obtaining times useful for comparison across similar systems can be problematic. Intel, for example, introduced "Pentium 4" processors that were fundamentally different than earlier designs with the same name. The common method via time stamp counter (TSC) requires care on recent processors having "turbo" modes that increase the clock (on perhaps 1 of 2 cores) over the nominal clock implicit in TSC, giving an underestimate of actual cycles consumed. Benchmarking guidelines on eBACS [7] , for example, recommend disabling such modes, and this is the method followed in this paper.
Timings for field arithmetic appear in Table 2 . The López-Dahab multiplier described in [2] was implemented as a baseline to quantify the speedup due to the native multiplier. For the most part, timings for GCC and ICC are similar, although López-Dahab multiplication is an exception. The difference in multiplication times between ) is in reduction. The relatively expensive square root in F 2 251 is due to the representation chosen; if square roots are of interest, then there are reduction polynomials giving faster square root and similar numbers for other operations. Inversion via exponentiation ( §2) gives I/M similar to that in [2] where an Euclidean algorithm variant was used with similar hardware but without the carry-less multiplier. Table 4 shows timings obtained for different variants of sequential and parallel scalar multiplication. We observe that for ωNAF recoding with ω = 3, 4, the halve-andadd algorithm is always faster than its double-and-add counterpart. This performance is a direct consequence of the timings reported in Table 3 , where the cost of one point doubling is roughly 5.5 and 4.8 multiplications whereas the cost of a point halving is of only 3.3 and 2.5 multiplications in the fields F 2 233 and F 2 409 , respectively. The parallel version that concurrently executes these algorithms in two threads computes one scalar multiplication with a latency that is roughly 37.7% and 37.0% smaller than that of the halve-and-add algorithm for the curves B-233 and B-409, respectively.
The bold entries for Koblitz curves identify fastest timings per category (i.e., considering the compiler, curve, and the specific value of ω used in the ωNAF recoding). For smaller ω, [21, Alg 3.70] has an edge over [21, Alg 3.70] because τ is applied to points in affine coordinates; this advantage diminishes with increasing ω due to post- computation cost. "(τ, τ )-and-add" denotes the parallel variant described in §4.2. There is a storage penalty for a linear map, but applications of τ −1 are eliminated (of interest when τ is significantly less expensive). Given the modest cost of the multi-squaring operation (with an equivalent cost of less than 1.44 field multiplications, see Table 2 ), the (τ, τ )-and-add parallel variant is usually faster than Algorithm 3. When using ω = 5, the parallel (τ, τ )-and-add algorithm computes one scalar multiplication with a latency that is roughly 35.5% and 40.5% smaller than that of the best sequential algorithm for the curves K-233 and K-409, respectively.
Per-field storage and coding techniques compute half-trace at cost comparable to field multiplication, and methods based on halving continue to be fastest for suitable random curves. However, the hardware multiplier and squaring (via shuffle) give a factor 2 advantage to Koblitz curves in the examples from NIST. This is larger than in [16, 21] , where a 32-bit processor in the same general family as the i5 has half-trace at approximately half the cost of a field multiplication for B-233 and a factor 1.7 advantage to K-163 over B-163 (and the factor would have been smaller for K-233 and B-233). It is worth remarking that the parallel scalar multiplications versions shown in Table 4 look best for bigger curves and larger ω.
Conclusion and future work
In this work we achieve the fastest timings reported in the open literature for software computation of scalar multiplication in NIST and Edwards binary elliptic curves defined at the 112-bit, 128-bit and 192-bit security levels. The fastest curve implemented, namely NIST K-233, can compute one scalar multiplication in less than 17.5µs, a result that is not only much faster than previous software implementations of that curve, but is also quite competitive with the computation time achieved by state-of-the-art hardware accelerators working on similar or smaller curves [24, 1] .
These fast timings were obtained through the usage of the native carry-less multiplier available in the newest Intel processors. At the same time, we strive to use the best algorithmic techniques, and the most efficient elliptic curve and finite field arithmetic formulae. Further, we proposed effective parallel formulations of scalar multiplication algorithms suitable for deployment in multi-core platforms.
The curves over binary fields permit relatively elegant parallelization with low synchronization cost, mainly due to the efficient halving or τ −1 operations. Parallelizing at lower levels in the arithmetic would be desirable, especially for curves over prime fields. Grabher et al. [17] apply parallelization for extension field multiplication, but times for a base field multiplication in a 256-bit prime field are relatively slow compared with Beuchat et al. [8] . On the other hand, a strategy that applies to all curves performs point doubles in one thread and point additions in another. The doubling thread stores intermediate values corresponding to nonzero digits of the NAF; the addition thread processes these points as they become available. Experimentally, synchronization cost is low, but so is the expected acceleration. Against the fastest times in Longa and Gebotys [30] for a curve over a 256-bit prime field, the technique would offer roughly 17% improvement, a disappointing return on processor investment.
The new native support for binary field multiplication allowed our implementation to improve by 10% the previous speed record for side-channel resistant scalar multiplication in random elliptic curves. It is hard to predict what will be the superior strategy between a conventional non-bitsliced or a bitsliced implementation on future revisions of the target platform: the latency of the carry-less multiplier instruction has clear room for improvement, while the new AVX instruction set has 256-bit registers. An issue with the current Sandy Bridge version of AVX is that xor throughput for operations with register operands was decreased significantly from 3 operations per cycle in SSE to 1 operation per cycle in AVX. The resulting performance of a bitsliced implementation will ultimately rely on the amount of work which can be scheduled to be done mostly in registers.
