Abstract. In the underlying finite field arithmetic of an elliptic curve cryptosystem, field multiplication is the next computational costly operation other than field inversion. We present two novel algorithms for efficient implementation of field multiplication and modular reduction used frequently in an elliptic curve cryptosystem defined over GF (2 n ). We provide a complexity study of the two algorithms and present an implementation performance of the algorithms over GF (2 167 ).
Introduction
In 1985, Neil Koblitz and Victor Miller independently proposed the elliptic curve cryptosystem, whose security rests on the discrete logarithm problem over points on an elliptic curve. Elliptic curve cryptography can be used to provide both a digital signature scheme and an encryption scheme. With the apparent advantage of high cryptographic strength relative to key size, elliptic curve cryptosystems [9, 14] have gained much popularity in the implementation of discrete logarithm based public key protocols. The shorter key size generally leads to improved computational efficiencies and smaller storage and bandwidth requirements. Although elliptic curve cryptosystem can be based on finite field of any characteristic, it is generally practical to implement within the prime or binary finite field [9, 14] .
Certain classes of elliptic curves such as the subfield curves, supersingular and anomalous binary curves have been proposed which provide improved efficiencies in implementation. However the extra structure provided by these curves are subjected to attack and reviewed recently [17, 4] . We consider only nonsupersingular and non-anomalous elliptic curves over non-composite field in the paper. The algorithms presented are specifically for binary finite field with standard basis representation.
Efficient implementation of elliptic curve cryptography can be focused on 2 levels. At elliptic curve group level, fast algorithms for multiplying a base point P of an elliptic curve may be applied [5] [6] [7] . The computation of multiplying a point P of an elliptic curve group by a large integer d is analogous to exponentiation of an element in a multiplicative group to the d th power. The generally accepted algorithm for the computation is the "square-and-multiply" algorithm. Signed digit representation, k-SR representation, addition chains and sliding window methods are applied to the computation of scalar multiplication, as the are employed to exponentiation [1, 2] .
For the underlying finite field arithmetic, more efficient algorithms speeding up computation of field multiplication and inversion may be introduced. Field multiplication is the next costly operation other than field inversion. Various algorithms, such as the transformation to projective coordinates trade field inversion for field multiplication. Hence, it is desirable to provide fast and effective field multiplication and modular reduction.
The purpose of this paper is to present new approaches for field multiplication and a modular reduction commonly performed in elliptic curve cryptosystems defined over binary finite field.
First, we review previous works in section 2. In section 3, we present a method to speed up computation of field multiplication. This algorithm is applicable to standard basis representation of elements in Galois field GF (2 n ). The algorithm is based on modified classical "shift-and-add" method. Through elimination of extensive shiftings, our algorithm is suited for microprocessors that have small word size, and only instruction that can shift only one bit at a time. Such microprocessors are common in 8 or 16 bit microcontrollers and smartcards. While there exist fast binary finite field multiplication with the use of table lookup [10] , such algorithms are generally not suitable for computing multiplication of field elements with degree > 5 using low end microprocessor with limited memory.
In section 4, we present an efficient modular reduction based on optimization of Schroeppel's modular reduction technique [16] , and our method is more efficient than Schroeppe;'s approach. Detailed analysis of the complexity and performances of our algorithms will also be presented.
We conclude this article with the comparison of implementation results for field multiplication and reduction over GF (2 167 ). Relative to the the classical "shift-and-add" method of multiplication, our implementation result shows approximately 12 percent reduction in computation time for a general purpose 32 bits microprocessor. As for the field modular reduction, 14 percent reduction of the computation cost can be realised.
Previous Works
An elliptic curve, defined on a field K =GF(2 n ) where n is a prime, is the set of solution points(x, y) to an equation of the form:
The set of points on an elliptic curve, together with a special point called the point of infinity can be equipped with an Abelian group structure by the following point addition operation:
And by following point doubling operation:
The simplest technique to compute multiplication in GF (2) is to use the "shift-and-add" method. As no arithmetic carry over is involved, the "shift-andadd" method is a neat and easy method for implementation. Addition in GF (2) is simply the bitwise exclusive-or operation.
Selection of an elliptic curve is a critical step before the implementation. The curve selected should not be a supersingular curve or anomalous curve.
For computation of field multiplication over GF (2 n ), we noted that word level multiplication in GF (2) is usually not supported in general microprocessors. There are 2 common software implementation techniques to achieve the GF (2) multiplication.
-Table look-up method -Emulation using "shift-and-add" technique In Table look-up method [10] , the field multiplication result are first precomputed. A simple method is to use 2 tables, to store the higher order and lower order of the multiplication result. The tables are addressed using the bits of the multiplier and multiplicand. Therefore a 8-bits word for GF (2) [10] to handle 16 bit GF (2) multiplication using 8-bits look up table, we noted that the overheads is not favourable for microprocessors without special shift instruction and not practical with devices with extremely limited memory.
The simplest technique to compute multiplication in GF (2) is to use the "shift-and-add" method. Addition in GF (2) is simply the bitwise exclusive-or operation. As no arithmetic carry over is involved, the "shift-and-add" method is a neat and simple method for implementation.
A New Approach for Multiplication
To compute the multiplication of two field elements A and B in standard basis over GF (2 n ), the classical "shift-and-add" algorithm as described in [16] is commonly used. The classical method typically incurs computational cost of shifting 2s(n − 1) bits of the intermediate results; where n is the number of bits of the field.
Our algorithm attempts to eliminate the extensive number of shift operations which inherently contributes to a large part of the computational cost. Our method requires shifting 2s(w − 1) bits of the intermediate results; where w is the wordsize of the microprocessor, and s = n/w . This contributes to greater performance improvement, particularly for microprocessors with small word size. It is noted that the number of field additions remains the same as classical method since it is dependent on the hamming weight of the multiplier. As 'addition' in GF (2) operation does not involve 'carry', with the addition operator defined as exclusive-or operation, the saving in shift operations is possible with our new algorithm.
Efficient Field Multiplication
be the bit-string representation of the multiplier B. B can be partitioned into s blocks and each block is of length w bits, and s = n/w . Denote
The field multiplication result of
can be re-expressed as:
The efficient field multiplication is based on the following model of computing the intermediate results of (t 0 , t w , ..., t (s−1)w ) first and progressively on the next 
New-method-for-field multiplication(A, B)
LeftShift ( Using the classical "shift-and-add" method, 2s(n − 1) shift operations are required to compute the field multiplication in GF (2 n ). With the new ap-proach, only 2s(w − 1) shift operations are incurred without any need for precomputation. The relation between the number of bits n of the underlying field and the word size, w would determine if the new algorithm would be more efficient compared to the binary method. The new algorithm will perform even more efficiently than the binary method when 2s(w − 1) < 2s(n − 1) or equivalently w < n
The word size of general microprocessor are usually 8, 16, 32 and 64 bits and for elliptic curve over GF (2 n ), n is usually chosen to be about 160 bits. It is noted that when the field size of the elliptic curve is increased, the new algorithm will perform more efficiently compare to the classical "shift-and-add" field multiplication method. This is because the number of shift operation performed on element C would remain unchanged for the new approach, whereas the number of shift operations using the "shift-and-add" method would depend on the field size n. Table 2 compares the two methods based on a typical 167 bits field for elliptic curve cryptosystem, with s defined as the number of words, and w as the wordsize of the microprocessor.
bits field
No. of shift operations w s "Shift-and-add" method Our method Percentage Savings  8  21  7119  441  94%  16  11  3817  495  87%  32  6  2178  558  74%  64  3  1185 567 52% 
Modular Reduction
The result of field multiplication requires storage length of 2 × sw. Modular reduction can be done very efficiently with an irreducible polynomial, such as trinomial and pentanomial using shifts and additions. The idea is to zero out the upper bits and add the representation of each original term right shifted by some quantity. Schroeppel et al. describes a practical approach of working on one computer word at a time to systematically perform the polynomial modular reduction [16] . We consider a trinomial modulus of the expression x n +x k +1, where n = 167, k = 6. After each field multiplication or squaring, the result must be reduced modulo F (x) = x 167 + x 6 + 1.
The product of two polynomials of degree 166 produces a polynomial of degree 332. Assume the polynomial to be reduced is:
Then the reduction modulo x 167 +x 6 +1 proceeds by reducing each term modulo the trinomial and subtracting it from the result. We noted that:
Instead of working on one computer word at a time, and lowering the degree of the polynomial by a word, proceeding from the high order terms to the low, our approach is to work on s 2 words at a time and lowering the degree by s 2 . For our approach to be effective, it is therefore desirable to choose a trinomial with low k degree.
Algorithm 2 Let A be the result of field multiplication or squaring prior to modular reduction. A has degree of at most 2n − 2. A can be partitioned into 2s blocks and each block is of length w bits. Let A i denotes the i th block of the partition of field element A. CarryRightShift(Q,d,T ) denotes right shifting the memory location range in Q by d bits making use of the word shift with carry instruction available in general microprocessor, and that the carry bits are stored in T . T emp1 and T emp2 are registers of wordsize w. The following algorithm performs the modular reduction on A using the trinomial modulus,
New-method-for-trinomial-modular-reduction(A) A careful choice of reduction trinomial that has small value of p = ((n − k) mod w) and u = (n mod w) will boost efficiencies in our new algorithm. Further to the elimination of extensive shifting, the number of field additions is also reduced by a factor of 2 in our approach, this is achieved with the alignment on the degrees of the congruent terms with the microprocessor word size.
Performance of Implementation
The following table present the performance benchmark of the improved field multiplication and modular reduction in an elliptic curve cryptosystem defined over GF (2 167 ). The computation is based on C source codes compiled with Microsoft Visual C++ 5.0 without compiler's optimization. An Intel Pentium II 32 bit microprocessor running at 333 MHz was used to conduct the benchmarking.
Comparing our new approaches of field multiplication and modular reduction to the classical methods, the timing results shows about 12 percent improvement for the multiplication, and approximately 14 percent improvement is achieved for the modular reduction. 
