Abstract: The authors propose balanced algorithms for elliptic curve cryptography (ECC). The authors make the point addition and doubling balanced; that is, they are implemented as identical sequences of operations. As an example the authors implement an ECC point multiplication algorithm, using the approach of Montgomery, for which a single power trace does not expose the Hamming weight nor the bits of the secret key. Nevertheless, their fieldprogrammable gate array implementation is also compact and efficient. The proposed multiplier for the finite field operations is digit serial and scalable to arbitrary bit-lengths. The method calculates the result by splitting the multiplication into two separate processes. The architecture presented compares favourably with designs presented in the literature. Furthermore, the power consumption graphs show the new implementation has an improved side-channel resistance.
Introduction
The best-known and most commonly used public-key cryptosystems (PKCs) are based on factoring (RSA) and on the discrete logarithm problem (Diffie-Hellman, ElGamal, Schnorr, DSA) [1] . They allow secure communications over insecure channels without prior agreement of a shared secret; they also enable efficient and compact digital signatures. Another alternative for PKC is elliptic curve cryptography (ECC), which was proposed in the mid-1980s by Miller [2] and Koblitz [3] . For ECC, two types of finite fields are being considered, i.e. binary and prime fields. A field GF(2 n ) offers far more options as there are many choices for bases, irreducible polynomials, composite fields etc. There exist several ways to accelerate this curve-based arithmetic. Following a bottom-up approach these are speeding up the finite-field arithmetic (especially multiplication and inversion), choosing a 'good' representation (i.e. coordinates that are more efficient) and accelerating a scalar multiplication operation.
In this article we propose an algorithm for multiplication in binary fields and we describe an efficient systolic array architecture for the multiplication. The proposed method performs two parts of the multiplication (from LSB and from MSB) in parallel. Furthermore, we consider the approach of Montgomery [4] for scalar multiplication. This uses a representation where computations are performed on the x-coordinate only. According to Lo´pez and Dahab [5] , the Montgomery representation requires less memory and offers better protection against side-channel attacks. The same conclusion with respect to side-channel protection was drawn by Joye and Yen [6] . They also observed its benefit for a parallel computation. Menezes and Vanstone observed that the benefit in storage is at considerable expense of speed [7] . From an algorithmic point of view, Stam concluded that it is less efficient than other known methods and that in the binary case it can hardly be recommended [8] . However, our conclusion is just the opposite, at least for hardware implementations. This method can benefit from independent calculations for point operations that can therefore be performed fully in parallel by means of two multipliers. Furthermore, we have optimised the formulae for the point operations to have exactly the same number of field multiplications for point addition and doubling. The field multiplications are performed in corresponding steps in both point operations. Similar work has been done by Fischer et al. [9] and Izu and Takagi [10] in GF(p). However, their point operations are not fully balanced as in our case. We are convinced that our approach also offers an improved resistance against side-channel attacks compared with unbalanced methods. We provide some evidence for this in the form of power consumption graphs for what are considered to be 'side-channel vulnerable' operations.
The remainder of this paper is organised as follows. Section 2 provides the necessary mathematical background for ECC in GF(2 n ), the method of Montgomery for point multiplication and modular multiplication (MM) in binary fields. In Section 3 previous work is discussed and some relevant hardware implementations of ECC in GF(2 n ) are briefly reviewed. Section 4 gives algorithms and details of the new implementation. In Section 5 the results of our field-programmable gate array (FPGA) implementation of the ECC processor are presented, including a comparison with other relevant work. Section 6 addresses the security with respect to side-channel attacks. Section 7 concludes the paper and points to future work.
2
Elliptic curves over GF(2 n )
ECC relies on a group structure induced on an elliptic curve. A set of points on an elliptic curve (with one special point added, the so-called point at infinity O) together with the so-called chord-and-tangent rule has the structure of an abelian group. Here we consider a finite field of characteristic 2, i.e. GF(2 n ). The point or scalar multiplication is the basic operation for cryptographic protocols; it is easily performed via repeated group operations. At the next (lower) level are the point operations, which are closely related to the coordinates used to represent the points. The lowest level consists of finite field operations such as addition, subtraction, multiplication and inversion required to perform the group operations (Fig. 1) .
We introduce some notation. Let P 4 ¼ (x 4 , y 4 ) ¼ P 2 À P 1 and P 5 ¼ (x 5 , y 5 ) ¼ 2P 1 with P 3 ¼ P 1 þ P 2 . The point P 4 is included because the method for point multiplication, as introduced by Montgomery, is defined by the fact that to add two points their difference should be known (whereas the y-coordinate is not needed). The formulae for point addition and doubling from [11] can be rewritten using that P 1 ¼ (x 1 , y 1 ) 2 E. For P 1 6 ¼ P 2 we get
If P 1 ¼ P 2 ,
For P 4 we get from Blake et al. [11, Lemma III.2]
We will use the observation that the x-coordinate of P 5 does not include the y-coordinate of P 1 . Also the x-coordinate of the sum of P 3 and P 4 can be expressed with the x-coordinate only. More precisely, we have
The x-coordinates of the points P 3 ¼ (x 3 ,
Proof: It follows directly from the formula for addition and the curve equation. &
Previous work on hardware implementations of ECC
This Section lists some relevant previous work on ECC architectures for binary fields. There are many papers [12] [13] [14] [15] [16] dealing with this topic but very few efficient hardware implementations present a completely generic solution which allows an arbitrary choice for all parameters: field size, digit-length, irreducible polynomial, elliptic curve parameters, coordinates etc. We opt for a completely generic implementation as security criteria are changing frequently. Nevertheless, we did not need any optimisation to boost performance, which would be possible by fixing some parameters, e.g. special curves, sparse polynomials etc.
As field multiplication is the most crucial aspect for efficient hardware implementations the previous work on finite fields multipliers should be also considered. The first bit-serial multiplier for finite fields was discussed by Beth and Gollmann [17] . This multiplier uses convolution and reduction modulo an irreducible polynomial and takes n clock cycles to compute a multiplication. Relevant algorithms and architectures for multiplication in GF(2 n ) have been proposed in [17] [18] [19] [20] [21] [22] [23] . In 2000 Orlando and Paar proposed a scalable elliptic curve processor (ECP) architecture which operates over finite fields GF(2 m ) [13] . The architecture is scalable, with a separated squarer (bit-parallel). Goodman and Chandrakasan proposed a cryptographic processor in [14] which performs a variety of algorithms for PKC applications. Multiplication is performed with a bitserial multiplier using the Montgomery modular multiplication (MMM) [24] . Gura et al. [15] have introduced a programmable hardware accelerator for ECC over GF(2 n ) which can handle arbitrary field sizes up to 255. The multiplier they use is a digit-serial shift-and-add multiplier. For a detailed survey of finite field multipliers and processors for PKC see [25] .
A new hardware implementation
In this Section we describe our new hardware implementation. We follow the top-down approach and for each step we elaborate our choice.
Montgomery method for point multiplication in GF(2 n )
For the point multiplication we chose the method of Montgomery that maintains the relationship P 2 À P 1 as invariant [4] . The idea of Montgomery dealt with speeding up the calculation of only the x-coordinate of the result. More precisely, to add two points their difference is used as an input parameter while the y-coordinate is not used in the algorithm. This fact is justified because cryptographic applications rarely use the y-coordinate. The algorithm to be used (Algorithm 1) 
Algorithm 1: Algorithm for point multiplication
Require: an integer k > 0 and a point P Ensure:
3: for i from l À 2 downto 0 do
6: Else
8: end for
is a variant of the binary method and was considered by Lo´pez and Dahab. They have also introduced an option for recovering the y-coordinate [5] . The advantage of this algorithm is that it calculates one point addition and one doubling in each step. In this way the loop operations do not depend on the exponent, which could offer an increased resistance against timing and other side-channel attacks. In addition, we noticed that the algorithm requires fewer registers than other hardware solutions. Nevertheless, the performance is not much affected. We discuss this in more details in what follows.
Point addition and doubling
In this part we move one level lower, i.e. to point operations. This is where our design is improved with respect to other proposals. Namely, point operations (add and double) are in principle different, which can be explored from the viewpoint of side-channel analysis. Some authors have tried already to balance these two operations in order to improve side-channel resistance. Chevallier-Mames et al. [26] presented a balanced algorithm for ECC over binary fields in the case of affine coordinates. We mention here the work of Brier and Joye [27] , who suggested two approaches to achieve uniformity of point operations. However, both approaches result in some penalty in speed. For the formulae of Lo´pez and Dahab in GF(2 n ) the operation count is A : D ¼ 5M : 6M. Here, A and D are the point operations and M is a field multiplication. We remind the reader that field addition in hardware for GF(2 n ) is just a simple bitwise XOR operation and therefore is not taken into account.
As already mentioned, we deal with projective coordinates to avoid expensive inversions in hardware. Let us consider the formulae for point operations in the case of simple projective coordinates, i.e.
The results of point doubling and point addition, i.e. X 5 ¼ X(P 5 ) and X 3 ¼ X(P 3 ) ¼ X(P 1 þ P 2 ), respectively, are calculated as
It is easy to see that point doubling and addition would require six and five multiplications, respectively. We slightly rewrote the formulae in order to have six multiplications for both point operations. We had to add one more multiplication in the point addition, so we used the following formula:
which follows from the Karatsuba-like approach. In the case of Karatsuba's algorithm a formula for multiplication reduces the problem of multiplying 2n-bit numbers to three multiplications of n-bit numbers. Here we need one extra multiplication so we compute X 1 Z 2 þ X 2 Z 1 with three multiplications. The next property we want is to have balanced field multiplications in each step of the point operation algorithms (this is not the case for the algorithm of [5] ). In this way the two multipliers will work fully in parallel while the exponent is scanned bit by bit. Then for each bit one
Each point operation requires exactly six multiplications which are also balanced with respect to the nine steps in Algorithm 2. In Step 5 of the point doubling a redundant operation is inserted to balance the field additions. The required number of intermediate n-bit registers is three for both cases. More precisely, the following lemma holds:
Lemma 4.1: When Algorithm 1 deploys Algorithm 2 we get the following number of operations in GF(2 n ):
Note that the 12 multiplications require only the time for 6 multiplications since 2 multiplications are performed in parallel in every iteration of the main loop. In this way we improved the formulae of Lo´pez and Dahab in order to have fully balanced point operations to counterfeit simple side-channel attacks.
An algorithm for field multiplication
The standard way to compute the product c(x) ¼ a(x) b(x)mod f(x) is using convolution, which we refer to as the classical algorithm. Another possible way to calculate the product of two polynomials in GF(2 n ) is Montgomery's multiplication algorithm as proposed in [18] . Here, we define the MMM as
. Before a sequence of operations can be started, all operands have to be converted to the form a(x)r(x)modf(x), the so-called M-residue of the operand by multiplication with r(x) ¼ x n . Our circuit implements Algorithm 3. The combined MM algorithm includes two parts, classical and Montgomery, each of which is a systolic array. The parts look quite similar as their cells are performing similar operations, i.e. multiplication and XOR. The difference is that they shift in opposite directions and they start from opposite parts of the loop. While the classical multiplier starts the shift-and-add process from the MSB of one of the operands and shifts the cumulative result left, the Montgomery-based multiplier starts at the LSB and shifts the result right. Those
Algorithm 2: EC point addition and doubling
Require:
two arrays process the operand a(x) from different sides and they stop after exactly ds/2e cycles (here s is the number of digits). The classical part still has to perform a shift over ds/2e bits, but this is taken care of by the conversion of the M-residue of the result. More precisely, the M-residue is of the form
The idea of combining two algorithms together was mentioned for the first time in [28] and the schematic of the multiplier was presented in [29] .
A schematic of the multiplier is presented in Fig. 2 . A i , B i and F i are the coefficients of a(x), b(x) and f(x), respectively. The outputs c(x) become inputs to the systolic arrays in the next clock cycle. Finally, the result of the multiplication is obtained by XOR-ing the outputs of both systolic arrays.
In Figure 3 , one processing element (PE) of the array is depicted. There is the so-called regular PE, while the boundary PEs are called leftmost and rightmost PE. They have a slightly different structure which is shown in Fig. 3b (leftmost PE) and 3c (rightmost PE).
After conversion from the M-residue to the normal representation, we get the result. Namely, the following lemma holds. Proof: Here n ¼ sw, i.e. the bit representation can be written in s words of length w. After ds=2e steps in Algorithm 3, the left (Res C ) part calculated has been shifted to the right for ds=2e Á w bits. At the same time, the right (Res M ) part has been shifted to the left for the same number of bits due to the division with x w . This results in a partial result of the form Res M ¼ aðxÞbðxÞx Àds=2eÁw . The Res C part still has to be shifted over the remaining d n 2 e bits, which is adjusted due to the remaining conversion from Montgomery to normal representation. & In Algorithm 3 A i (x) represents one digit of the polynomial a(x). The addition step is performed by multiplication of A i (x) with b(x). In order to be able to perform Step 7 in Algorithm 3 a multiple of f(x) has to be added which results in Res M (x) being divisible by x w . So, for Res M (x) 6 ¼ 0 mod x w , we need to find
It is easy to see that F 0 À1 ðxÞmodx w F 0 0 ðxÞ, which follows from the relations for Montgomery's parameter r(x) [18] . Also, here Res M 0 ðxÞ and F 0 (x) are the least significant words of Res M (x) and f(x), respectively.
A prototype FPGA architecture
Our ECP is shown in Fig. 4 . The operation blocks on each level from top to bottom are as follows: The main control finite state machine (FSM) first commands the NtoM to convert all the inputs from normal to Montgomery representation and the AtoP to convert the coordinates from affine to projective. The NtoM and the AtoP both use the MM to perform these conversions. The AtoP also needs the MA to do some precalculations, see (2) . When all conversions are
Require: polynomials a(x), b(x) and f(x) finished the main controller orchestrates the PM block to start the point multiplication by invoking the PD and the PA in parallel. It writes the resulting X1 and Z1 to its output registers. The FSM inside the PM orchestrates these operations. Due to the parallelism of the point operations both the PD and the PA use an MA and an MM. The next step after point multiplication is conversion from projective to affine representation using the MI, which invokes the MM to perform the MI using Fermat's theorem. Finally, the affine coordinates are converted from Montgomery to normal representation using the MM. The flowchart of the FSM inside the point multiplication block is shown in Fig. 5 . When the START signal is set, the bits of k are evaluated from MSB to LSB resulting in the assignment of new values for P1 and P2. These values depend on the key-bit k i . When all bits have been evaluated, an internal counter gives an END signal. The result of the last P1 calculation is written to the output register and the VALID output is set.
Results
The results of our design on a Xilinx Virtex XCV800 FPGA are given in Table 1 . The formula for the latency that is used in Table 1 is:
, where the three parts of the formula correspond to the calculations of the main conversion operations, the MI and the point multiplication, respectively.
One of the consequences of the scalability of the design is that the minimal clock period does not depend on the bit-length n; it depends only on the digit-length w. This can be observed in Table 1 . Table 2 presents a broader comparison with other architectures. We found that comparing designs is hard since these designs have been optimised for different goals, have been implemented on different platforms and have chosen different options for bases, coordinates, irreducible polynomials etc. Moreover, some of these solutions are not scalable: in these designs, some parameters are fixed (such as a special polynomial, Fig. 3 The regular PE of the systolic array, the leftmost PE of the systolic array, the rightmost PE of the systolic array special reduction etc.), which boosts the performance. Therefore, we have included only the solutions that are either scalable [14, 15] or believed to be the state of the art in ECC hardware implementations. The purpose of this table is mainly to reference the prior methodology in this area. We give this comparison as a proof that a scalable and side-channel-secure design can also lead to a solution that is competitive in performance.
Side-channel security
Implementations of cryptographic algorithms should be resistant to side-channel attacks such as timing [30] , power [31] and electromagnetic radiation [32, 33] analysis attacks. These attacks present a realistic threat for wireless applications and have been demonstrated to be very effective against smart cards without specific countermeasures. The latency of the proposed algorithm does not depend on the Hamming weight of the exponent, which makes it very suitable for defending against timing attacks. Indeed, timing attacks can explore all steps with non-constant execution time; conditional instructions are a typical target. Also in that case power traces may reveal some secret information that would allow the attacker to perform simple power analysis, i.e. an SPA attack. For example, a typical double-and-add algorithm [1] , which executes the point doubling and addition operations if the i-th bit of the exponent (k i ) ¼ 1 and otherwise (for k i ¼ 0) performs only doubling, is not SPA resistant.
Also, the computational difference between the point operations is a typical target of an attacker [27, 34] . To prevent that, cryptographic algorithms should be implemented as sequences of operations that are indistinguishable through simple side-channel analysis. Chevallier-Mames et al. define this property as the sidechannel atomicity [26] . In their view, SPA-resistant algorithms should consist of so-called side-channel atomic blocks which are algorithm specific. In our implementation, there exist side-channel atomic blocks on different levels of the ECC hierarchy. Following the top-down approach, these are point addition/doubling and multiplication/squaring. For that purpose we use the same multiplier for both multiplication and squaring, although there exist more efficient architectures for squaring. Here we propose a fully balanced point multiplication that performs the same operations for every loop of the algorithm. More precisely, Algorithm 2 for point operations executes exactly one field multiplication in corresponding steps. It is not the case in general, as doubling and addition are two different operations, as in most of the standard references on ECC [11, 35] . Considering our new Algorithm 2, point operations result in two not so distinguishable patterns (Figs 6 and 7) . In both Figures a total of six field multiplications can be observed. Furthermore, both operations (double and add) are performed in parallel for each step in Algorithm 1.
The systolic array architecture also allows more parallelism of the multiplication process, which will make power analysis more difficult. These first results, although proving SPA and timing resistance, are just the first step towards a side-channel-resistant design. In future work this algorithm should be more carefully examined with respect to security against side-channel attacks; however, we are convinced that our design approach offers an increased resistance to these attacks. Moreover, for more advanced attacks such as (first-or higher-order) differential power analysis and differential electromagnetic analysis, other countermeasures are required as well.
7
Conclusions and future work A complete ECC processor for binary fields has been presented. An FPGA implementation has been fully described including a bit/digit-serial multiplier that combines two previously known multiplication methods. The proposed architecture is a systolic array that allows for good performance in speed and more parallelism in operation, which is also beneficial for side-channel security. Furthermore, we have proposed a fully balanced point multiplication algorithm that performs the same operations for every loop of the algorithm. By using this approach we proved that this so-called side-channel-aware design is the first step towards sidechannel resistance. However, it is clear that additional countermeasures might be required.
