A scalar multiplication over a binary elliptic curve consists in a sequence of hundreds of multiplications, squarings and additions. This sequence of field operations often involves a large amount of operations of type AB, AC and AB + CD. In this paper, we modify classical polynomial multiplication algorithms to obtain optimized algorithms which perform these particular operations AB, AC and AB + CD. We then present software implementation results of scalar multiplication over binary elliptic curve over two platforms: Intel Core 2 and Intel Core i5. These experimental results show some significant improvements in the timing of scalar multiplication due to the proposed optimizations.
Introduction
Finite field arithmetic is widely used in elliptic curve cryptography (ECC) [13, 11] and coding theory [4] . The main operation in ECC is the scalar multiplication which is computed as a sequence of multiplications and additions in the underlying field [6, 8] . Efficient implementations of these sequences of finite field operations are thus crucial to get efficient cryptographic protocols.
We focus here on the special case of software implementation of scalar multiplication on elliptic curve defined over an extended binary field F 2 m . An element in F 2 m is a binary polynomial of degree at most m − 1. In practice m is a prime integer in the interval [160, 600] . An addition and a multiplication of field elements consist in a regular binary polynomial addition and multiplication performed modulo the irreducible polynomial defining F 2 m . An addition and a reduction are in practice faster than a multiplication of size m polynomials. Specifically, an addition is a simple bitwise XOR of the coefficients: in software, this consists in computing several independent word bitwise XORs (WXOR). Concerning the reduction, when the irreducible polynomial which defines the field F 2 m is sparse, reducing a polynomial can be expressed as a number of word shifts and word XORs.
Until the end of 2009 the fastest algorithm for software implementation of polynomial multiplication was the Comb method of Lopez and Dahab [12] . This method essentially uses look-up tables, word shifts (Wshift), ANDs and XORs. One of the most recent implementation based on this method was done by Aranha et al. in [1] on an Intel Core 2. But, since the introduction by Intel of a new carry-less multiplication instruction on the new processors i3, i5 and i7, the authors in [16] have shown that the polynomial multiplication based on Karatsuba method [15] outperforms the former approaches based on Lopez-Dahab multiplication. In the sequel, we consider implementations on two platforms: processor without carry-less multiplication (Intel Core 2) and processor i5 which has such instruction.
Our contributions. In this paper, we investigate some optimizations of the operations AB, AC and AB + CD. The fact that we can optimize two multiplications AB, AC which have a common input A, is well known, it was for example noticed in [2] . Indeed, since there is a common input A, the computations depending only on A in AB and AC can be shared.
We also investigate a new optimization based on AB + CD. In this situation, we show that we can save in Lopez-Dahab polynomial multiplication algorithm 60N WShifts and 30N WXORs if the inputs are stored on N computer words. We also show that this approach can be adapted to the case of Karatsuba multiplication and we evaluate the resulting complexity.
We present implementation results of scalar multiplication which involve the previously mentioned optimizations. The reported results on an Intel Core 2 were obtained using Lopez-Dahab polynomial multiplication for field multiplication, and the reported results on an Intel Core i5 were obtained with Karatsuba multiplication.
Organization of the paper. In Section 2, we review the best known algorithms for software implementation of polynomial multiplication of size m ∈ [160, 600]. In Section 3, we then present optimized versions of these algorithms for the operations AB, AC and AB + CD. In Section 4, we describe how to use the proposed optimizations in a scalar multiplication and give implementation results obtained on an Intel Core 2 and on an Intel Core i5. Finally, in Section 5, we give some concluding remarks.
Review of multiplication algorithms
The problem considered in this section is to compute efficiently a multiplication in a binary field F 2 m . A field F 2 m can be defined as the set of binary polynomials modulo an irreducible polynomial f (x) ∈ F 2 [x] of degree m. Consequently, a multiplication in F 2 m consists in multiplying two polynomials of degree at most m − 1 and reducing the product modulo f (x). The fields considered here are described in Table 1 and are suitable for elliptic curve cryptography. The irreducible polynomials in Table 1 have a sparse form. This implies that the reduction can be expressed as a number of shifts and additions (the reader may refer for example to [8] for further details).
We then focus on efficient software implementation of binary polynomial multiplication: we review the best known algorithms for polynomial of cryptographic size. An element
is coded over N = m/64 computer words of size 64 bits A[0], . . . , A[N − 1]. In the sequel, we will often use a nibble decomposition A:
where deg A i < 4 and n = m/4 is the nibble size of A. In Table 1 we give the value of N and n for the field sizes m = 233 and m = 409 considered in this paper.
Comb Multiplication
One of the best known methods for software implementation of the multiplication of two polynomials A and B was proposed by Lopez and Dahab in [12] . This algorithm is generally referred as the leftto-right comb method with window size w. We present this method for the window size w = 4 since, based on our experiments and several other experimental results in the literature [1, 8] , this seems to be the best case for the platform considered here (Intel Core 2). This method first computes a table T containing all products u · A for u(x) of degree < 4. The second input B is decomposed into 64-bit words and nibbles as follows
Then the product A × B is expressed by expanding the above expression of B as follows
The above expression can be computed through a sequence of accumulations R ← R+T [B 16j+k ]x 64j , corresponding to the terms A · B 16j+k x 64i , followed by multiplications by x 4 . This leads to Algorithm 1 for a pseudo-code formulation and Algorithm 6 in the appendix for a C-like code formulation.
Algorithm 1 CombMul(A,B)
Require: Two binary polynomials A(x) and B(x) of degree < 64N − 4, and B(x) = N −1 i=0 15 k=0 B 16j+k x 4k+64j is decomposed in 64-bit words and nibbles. Ensure:
end for end for
Complexity. We evaluate the complexity of the corresponding C-like code (Algorithm 6) of the CombMul algorithm in terms of the number of 64-bit word operations (WXOR, WAND and WShift).
We do not count the operations performed for the loop variables k, j, . . .. Indeed, when all the loops are unrolled, these operations can be precomputed. We have separated the complexity evaluation of the CombMul algorithm into three parts: the computation of the table T , the accumulations R ← R + T [B 16j+k ]x 64j and the shifts R ← R · x 4 of R.
• Table computation . The loop on k is of length 7, and performs one WXOR and one WShift plus 2(N − 1) WXORs and 2(N − 1) WShifts in the inner loop on i.
• Shifts by 4. There are two nested loops: the one on k is of length 15 and the loop on i is of length 2N . The loop operations consist in two WShifts and one WXOR.
• Accumulations. The number of accumulations R ← R + T [B 16j+k ]x 64j is equal to n, the nibble length of B. This results in nN WXOR, n WAND and n − N WShift operations, since a single accumulation R ← R + T [B 16j+k ]x 64j requires N WXOR, one WAND and one WShift (except for k = 0). Table 2 , the total number of operations is equal to nN + 44N − 7 WXORs, n + 73N − 7 WShifts and n WANDs. Table T 14N
As stated in
− 7 14N − 7 0 R ← R + T [B 16j+k ]x 64j nN n − N n Shift R ← R << 4 30N 60N 0 Total nN + 44N − 7 n + 73N − 7 n
Karatsuba multiplication
We review the Karatsuba approach for binary polynomial multiplication. Let A and B be two binary polynomials of size 64N and assume that N is even. Then, we first split A and B in two halves A = A 0 + x 64N/2 A 1 and B = B 0 + x 64N/2 B 1 and then we re-express the product A × B in terms of three polynomial multiplications of half size:
The resulting recursive approach is given in KaratRec algorithm (i.e., Algorithm 2). In this case the inputs A and B are supposed to be of size 64N bits where N = 2 s and packed in an array of N computer words. The three products R 0 , R 1 and R 2 are computed recursively until we reach inputs of size one computer word. Then the word products are computed with a Mult64 operation. We further assume that this Mult64 operation is performed using a single processor instruction: this is the case of the Intel Cores i3, i5 and i7.
Complexity of KaratRec approach. We briefly compute the complexity of the KaratRec algorithm in terms of the number of WXOR and Mult64 operations. One single recursion of the Karatsuba formula with inputs of word size N requires N WXORs for the additions A 0 + A 1 and B 0 + B 1 , and 5N/2 WXORs for the reconstruction of R. We obtain the recursive complexity given in the left side
Algorithm 2 KaratRec(A,B,N)
Require: A and B on N = 2 s computer words.
of (2). We rewrite the complexity in the non-recursive form given in the right side of (2).
3 Optimization of the operations AB + CD and AB, AC
In this section, we present our main building blocks for the optimization of software implementation of elliptic curve scalar multiplication. The main idea is that the scalar multiplication involves operations of type AB+CD or AB, AC. In such operations AB+CD and AB, AC some computations can be saved resulting in a more efficient software implementation. This idea was previously mentionned for example in [2] for AB, AC for the CombMul algorithm. We extend this idea to the variants based on Karatsuba multiplication. We also study the optimization based on the operation AB + CD in the case of CombMul algorithm and in the case of the variants of Karatsuba multiplication.
Optimizations of AB + CD and AB, AC in the CombMul approach
Optimization AB, AC in the CombMul algorithm. The fact that we have to compute two multiplications with the same operand A, implies that the table T in the CombMul algorithm, which contains the products T [u] = u · A, can be computed only once for the two multiplications AB and AC. This saves 14N − 7 WXORS and 14N − 7 Shifts operations in the computation of AC. The resulting complexity of the CombMul ABAC algorithm is shown in Table 3 . Due to lack of space we do not provide the algorithm form of CombMul ABAC, but it can be easily deduced from Algorithm 1.
Optimization AB + CD in the CombMul algorithm. We optimize the operation AB + CD by performing the final addition (AB) + (CD) during the accumulation step of the CombMul algorithm. 
// Computation of the table T and S such that
The complexity of Algorithm 3 can be easily deduced from the complexity of the CombMul algorithm (Table 2 ):
• We have in the CombMul ABplusCD algorithm two table computations which contribute to twice the complexity of the table computation in Table 2 .
• The accumulations
])x 64j also contribute to twice the complexity of the accumulation step in Table 2 .
• We have the same amount of shifts R ← R · x 4 as in the CombMul algorithm.
The resulting complexity is given in Table 3 . 3.2 Optimizations AB + CD and AB, AC in the KaratRec approach
The optimization based on AB, AC can be extended to the KaratRec algorithm. Indeed the recursive splitting and the addition of the two halves A 0 + A 1 can be performed only once for the polynomial A. This approach is described in Algorithm 5.
We also adapt the optimization AB + CD as follows: the addition is performed before the reconstruction of the two products AB and AC, this means that we have only one recursive reconstruction instead of two. This approach is specified in Algorithm 4. 
Complexity of KaratRec ABAC. In the first recursion we have 3N/2 WXORs for A 0 + A 1 , B 0 + B 1 and C 0 + C 1 plus 5N WXORs for the reconstructions of R and S. This leads to the following complexity:
Complexity of KaratRec ABpCD. In the first recursion we have 2N WXORs for the computations A 0 + A 1 , B 0 + B 1 , C 0 + C 1 and D 0 + D 1 plus 5N/2 WXORs for the reconstruction of R. The complexity for N = 1 is equal to 2M ult64 plus one WXOR. Based on this, we derive the complexity for the KaratRec ABpCD algorithm:
Complexity comparison and implementation results
Using the complexity results determined in the former subsections, we can compute the complexities of the multiplication algorithms and their optimized AB, AC and AB + CD counter parts for the polynomial sizes m = 233 and m = 409. We implemented these algorithms on the platforms Intel Core 2 and Intel Core i5. Our implementation uses 128-bit registers and vector instructions available on these two processors. On the Core 2 we used the modified CombMul algorithm of [5, 1] which uses mostly shifts by multiple of 8; cheaper than an arbitrary shift for 128-bit data. On the Core i5 we implemented the KaratRec multiplication method with the PCLMUL instruction which performs carry-less multiplication of two 64 bit inputs contained in 128-bit registers.
The resulting complexities and timings are reported in Table 4 and Table 5 . Based on the results presented above, we notice that the optimization AB + CD has always a better complexity than the optimization AB, AC and better than two independent multiplications. Concerning the timings we note that:
• On the Core 2 the optimization ABplusCD is always slower than the optimization AB, AC.
Moreover, the optimizations ABplusCD and AB, AC are effective only for m = 233, since in this case they are faster than two independent multiplications. This seems to contradict the corresponding complexity results since the complexity differences appear quite large.
• On the Core i5 the timing results are more related to the complexity values: for the two considered degrees ABplusCD and AB, AC are faster than two independent multiplications and ABplusCD is always faster than AB, AC.
In the literature we can find some timing of the CombMul algorithm over a Core 2 in [1] . The authors in [1] Table 4 . However, our results on the Core i5 compares favorably with the results recently reported in [16] , which give 128 clock-cycles for m = 233 and 345 clock-cycles for m = 409, but these reported timings may include the reduction operation (this is not clearly specified in [16] ).
Implementations of scalar multiplication based on the optimizations AB, AC and AB + CD
In this section, we present our experimental results for scalar multiplication based on the optimizations AB, AC and AB + CD presented in the previous section. We first review best known elliptic curve point operation formulas, and describe how we use the optimizations AB, AC and AB + CD in these formulas. Then we describe the strategies we used for our implementations: scalar multiplication algorithms and implementations of field operations (squaring, inversion, ... ). Finally, we present the implementation results on an Intel Core 2 and an Intel Core i5.
Elliptic curve arithmetic
The considered curves are ordinary binary elliptic curve defined by the following Weierstrass equation
We will more specifically focused on the two NIST [14] curves B233 and B409.
Optimization AB, AC and ABplusCD in curve operation
We review Kim-Kim elliptic curve operations [10] in order to describe how the optimized operations AB, AC and AB + CD can be used in the curve operations. Kim and Kim in [10] use a specific projective coordinates P = (X : Y : Z : T ) which corresponds to the affine point (X/Z, Y /T ) where
In the following formulas we use the following notations: A · B is a non reduced polynomial multiplication, and [R] represents the reduction of the polynomial R modulo the irreducible polynomial defining the field F 2 m .
• Point doubling in Kim-Kim coordinates. We compute the doubling P 1 = (X 1 : Y 1 : Z 1 : T 1 ) = 2 · (X : Y : Z : T ) of a point P = (X : Y : Z : T ) by performing the following sequence of operations
and then:
• Point addition in Kim-Kim coordinates. We review the Kim-Kim formula for mixed point addition: we add one point P 1 = (X 1 : Y 1 : Z 1 : T 1 ) which has a regular Kim-Kim projective coordinates with a point P 2 = (X 2 : Y 2 : 1 : 1) which is in affine coordinates, i.e., Z 2 = T 2 = 1. The coordinates of P 3 = (X 3 : Y 3 : Z 3 : T 3 ) is then computed with the following sequence of operations:
and then deduce:
In the above formulas, we indicated the operations which can be performed with the optimization AB + CD and the operations which can be performed with the optimization AB, AC.
• Optimization AB, AC and ABplusCD in other curve operation formulas. We consider the following two cases: Lopez-Dahab formulas, which are variants of the Kim-Kim formulas, and Montgomery laddering. For the Lopez-Dahab formulas the optimizations AB, AC and ABplusCD can be Table 6 : Optimizations AB, AC and ABplusCD in Lopez-Dahab curve operations Point doubling: P 1 = 2 · P where
Point addition:
where 
applied in both doubling and mixed addition. For the Montgomery laddering we can just apply one optimization ABplusCD in the inner loop operation. These formulas along with the optimizations AB, AC and ABplusCD are given in Table 7 and Table 6 . Another interesting operation is the point halving, but, unfortunately, we could not apply any of the optimizations AB, AC or ABplusCD in the halving formula of [8] (Algorithm 3.81 [8] , p. 131]). Indeed, this point halving consists in one half-trace operation, followed by one multiplication, one trace computation and one square root, so no optimization based on combined multiplications can be applied.
Scalar multiplication algorithm
The scalar multiplication on the curve E(F 2 m ) consists in the computation of r · P for a given point P ∈ E(F 2 m ) and an -bit integer r where is the bit length of the order of P . We implemented the following methods for scalar multiplication:
• Double-and-add. This approach consists in a sequence of doublings and additions on the curve. The integer r is generally recoded with the N AF w algorithm [8] with window size w = 4 in order to reduce the number of additions performed during the double-and-add algorithm. The scalar multiplication then requires a table precomputation T [i] = i · P for the odd integers 0 < i < 2 w−1 . This approach is detailed in Algorithm 7 in the appendix. In our implementations we used the Kim-Kim (cf. Subsection 4.1.1) and the Lopez-Dahab (Table 6) doubling and addition formulas.
• Halve-and-add. This approach consists in a sequence of halvings and additions on the curve.
The integer r is first recoded in r = r · 2 −1 mod #< P > since in this case we have:
and we can then compute r · P as a sequence of halvings and additions. We use again the N AF w algorithm for w = 4 to recode r and the variant of the halve-and-add approach given Algorithm 8 to perform the scalar multiplication. The reader may refer to Section 3.6 in [8] for further details on point halving approaches.
• Parallel (Double-and-add, Halve-and-add). This approach, proposed in [16] , splits the computation of the scalar multiplication in two parts: one uses double-and-add approach and the other uses halve-and-add approach. This requires some recoding of the scalar r similar to the one used in halve-and-add approach. Details are given in Algorithm 9 in the appendix.
• Montgomery. The last approach we considered is the Montgomery laddering (cf. Algorithm 3.40, p.103 in [8] ): it is a variant of the double-and-add approach. The main difference is that two points are computed in the inner for loop of the algorithm: P 1 and P 2 which have a constant difference P 1 − P 2 = P . This approach has some nice properties as counter measure against side channel attacks.
Implementation aspects
We use the following strategies to implement the field operations required in scalar multiplication algorithms:
• Multiplication. The considered multiplication strategies have already been described in Subsection 3.3. Specifically, on the Intel Core 2 platform, we use the version of the CombMul algorithm of [5, 1] which uses 128-bit instruction sets. On the Intel Core i5 platform we use the Karatsuba algorithm along with vector instructions and more precisely the carry-less instruction which performs binary polynomial multiplication of size 64 bits.
• Squaring. For the squaring we use the strategy described in [1] . Specifically, we use a 128-bit word Sq which stores in each byte the squaring of a 4-bit polynomial. Then for each 128-bit word A[i] of A we separate odd and even nibbles with a masking and a shift and then apply mm shuffle epi8 intrinsinc function with left input value Sq and right input value the word containing even or odd nibbles of A[i]. The result is a 128-bit word containing the squaring of each nibble. The bytes are then reordered and repacked into two 128-bit words. The reader may refer to Algorithm 1 in [1] for further details.
• Square-root. The square root is based on the expression
Following [1] , we separate odd and even coefficients of A using the mm shuffle epi8 intrinsinc function and by reordering the resulting bytes. Then the multiplication by √ x is done through a number of shifts and additions since for m = 233 and m = 409, √ x has a sparse expression.
• Reduction. The reduction follows the strategy of [8] : the considered irreducible polynomials are sparse (cf. Table 1 ), this makes possible to perform a reduction with a short sequence of shifts and WXORs.
• Inversion. The inversion is computed using the Itoh-Tsujii algorithm [9] . This algorithm consists in a sequence of multiplications and multi-squarings. This sequence of multiplication and squaring reconstructs step by step the exponent of A −1 = A • Half-trace. In the halving curve operation, we have to compute half-trace (HT ) of an element:
Our implementation is again inspired from [16] and [7] and uses the mm shuffle epi8 function to compute the half-trace of the even bits of A and look-up table to compute the half-trace of the odd bits of A. For further details on this the reader may refer to [16] .
Lazy reductions. An optimization called lazy-reduction can be used to optimize curve operations (cf. [2, 3] ). This consists in removing unnecessary reduction operations performed during the sequence of multiplications and squarings in the curve operation formulas. Here we considered the following two lazy reduction optimizations:
• Lazy-reduction 1 (LR1). This optimization regroups reduction operations corresponding to distinct squarings or multiplications. For example in the sequence of operations A 2 + C · D we can perform the addition (addition of polynomial of degree 2m − 2) before performing the reduction. This reduces the total number of WXORs and WShifts. In the considered elliptic curve operation formulas (Kim-Kim, Lopez-Dahab and Montgomery) the bracket [·] specifies the reduction operations corresponding to this LR1 optimization (cf. Subsection 4.1, Table 6 and 7).
• Lazy-reduction 2 (LR2). In this case the reduction modulo the irreducible polynomial is partially done, this results in a polynomial with a degree larger than m − 1. We have applied this approach for m = 233: the polynomial is reduced to a degree 255 instead of 232. Since the KaratRec algorithm multiplies polynomials of size 256, we don't have to reduce the coefficients in the range [233, 256], so we can use a lazy reduction of this kind. Figure 1 illustrates this strategy: we can see in this figure that the LR2 approach saves the computations involved in the reduction of the word containing coefficients c 255 , . . . , c 233 .
We did not apply this strategy in the case of Intel Core 2 since the CombMul approach multiplies polynomials of degree 232 and not 256. For the case of degree 409, the LR2 approach does not provide any saving in the number of words which have to be reduced, so, again, we did not implement such LR2 optimization. generally more efficient than AB, AC. The only cases in which neither AB+CD nor AB, AC provide the best timing result is the parallel implementation for m = 233 and halve-and-add implementation for m = 409.
In Table 10 , we review the results obtained by Aranha et al. over an Intel Core i5 with a GCC compiler. We remark that, except for parallel implementation when m = 409, our results are competitive with the timings of Table 10 . This means that our implementations reach the level of performance of [16] and that the proposed optimized operations are efficient when included in the best known implementation strategies for Intel Core i5.
Conclusion
The goal of this paper was to study software optimizations of binary field operations AB, AC and AB + CD for scalar multiplication on binary elliptic curves. We have established several algorithms for these optimizations and have evaluated the complexity of the corresponding C-like codes of these algorithms. We have then presented implementation results for scalar multiplication on an Intel Core 
Algorithm 6 CombMul C Code
Require: A and B two N 64-bit words polynomials of nibble length n Ensure: Q ← Q + P ki . Q ← Q − P −ki .
11:
end if 12: end for 13: return(Q) Algorithm 8 Halve-and-add with N AF w Require: Window width w, P ∈ E(F 2 m ) and an integer k Ensure: Q = k · P .
1: k = k · 2 m mod #E(F 2 m ). Q k i ← Q k i + R. Q −k i ← Q −k i − R.
10:
end if
11:
R ← R/2. 12: end for 13: Q ← i∈{1,3,5,...,2 w−1 −1} i · Q i . 14: return(Q) Algorithm 9 Parallel (Double-and-add, Halve-and-add) with N AF w Require: Window width w, P ∈ E(F2m ) and an integer k Ensure:
//-------------Begin parallel execution ---------------Compute Pi = i · P for i ∈ {1, 3, 5, ..., 2 w−1 − 1} Set Qi = O for i ∈ {1, 3, 5, ..., 2 w−1 − 1} Q ← O R ← P for i from l downto t for i from t − 1 downto 0
end if end if R ← R/2 end for end for //-------------End parallel execution ---------------Q ← Q + i∈{1,3,5,...,2 w−1 −1} i · Qi return Q = Q + i∈{1,3,5,...,2 w−1 −1} i · Qi
