Abstract
Introduction
The best-known and most commonly used publickey cryptosystems (PKC) are based on factoring (RSA) and on the discrete logarithm problem (DiffieHellman, DSA) [19] . They allow secure communications over insecure channels without prior agreement of a shared secret and they also enable digital signatures. Another alternative for PKC is Elliptic Curve Cryptography (ECC), which was proposed in the mid 1980s by Miller [21] and Koblitz [14] .
In this article we propose an implementation of the approach of Montgomery [22] for scalar multiplication in binary fields. It uses a representation where computations are performed on the xcoordinate only. Menezes and Vanstone observed that the benefit in storage comes at a considerable expense of speed [20] . Also, from an algorithmic point of view, Stam concludes that it is less efficient than other known methods and that in the binary case it can hardly be recommended [26] . However, our conclusion is just the opposite, at least for hardware implementations. This method can benefit from independent calculations for point operations that can be therefore performed fully in parallel by means of two multipliers. Furthermore, we have optimized the formulae for the point operations to have exactly the same number of field multiplications for point addition and doubling. The field multiplications are performed in corresponding steps in both point operations. We are convinced that this approach also offers an improved resistance against side-channel attacks compared to other unbalanced methods.
The remainder of this paper is organized as follows. Section 2 provides the necessary mathematical background. In Sect. 3 previous work is discussed. Section 4 gives algorithms and details of the new implementation. In Sect. 5 the results of our FPGA implementation of the ECC processor are presented including a comparison with other relevant work. Section 6 addresses the security w.r.t. simple side-channel attacks. It also includes graphs of power consumption which prove improved side-channel resistance. Section 7 concludes the paper.
Elliptic Curves over GF(n )
The point or scalar multiplication is the basic operation for cryptographic protocols; it is easily performed via repeated group operations. At the next (lower) level are the point operations, which are closely related to the coordinates used to represent the points. The lowest level consists of finite field operations such as addition, subtraction, multiplication and inversion required to perform the group operations.
We introduce some notation. Let P 4 = (x 4 , y 4 ) = P 2 − P 1 and P 5 = (x 5 , y 5 ) = 2P 1 with P 3 = P 1 + P 2 . The point P 4 is included because the method for point multiplication, as introduced by Montgomery, is defined by the fact that to add two points their difference should be known (while ycoordinate is not needed). The formulae for point addition and doubling from [4] can be rewritten us-ing that P 1 = (x 1 , y 1 ) ∈ E. For P 1 = P 2 we get:
(1)
For P 4 we get from Blake et al. [4, Lemma III.2 ]:
We will use the observation that the x-coordinate of P 5 does not include the y-coordinate of P 1 . Also the x-coordinate of the sum of P 3 and P 4 can be expressed with the x-coordinate only. More precisely, we have:
The x-coordinates of the points P 3 = (x 3 , y 3 ) = P 1 + P 2 and P 4 = (x 4 , y 4 ) = P 2 − P 1 on an elliptic curve (1) satisfy:
Proof: It follows directly from the formula for addition and the curve equation.
Previous Work on Hardware Implementations of ECC
This section lists some relevant previous work on ECC architectures for binary fields. There are many papers [23, 9, 10, 12] dealing with this topic but very few efficient hardware implementations present a completely generic solution which allows an arbitrary choice for all parameters. Orlando and Paar [23] proposed a scalable elliptic curve processor architecture which operates over finite fields GF(2 m ). Goodman and Chandrakasan proposed a cryptographic processor [9] , which performs a variety of algorithms for PKC applications. Gura et al. [10] have introduced a programmable hardware accelerator for ECC over GF(2 n ). The first bit-serial multiplier was discussed by Beth and Gollmann [3] . This multiplier uses convolution and reduction modulo an irreducible polynomial and takes n clock cycles to compute a multiplication. Relevant algorithms and architectures for multiplication in GF(2 n ) have been proposed in [6, 3] . For a detailed survey on finite fields multipliers and processors for PKC see Batina et al. [2] .
A New Hardware Implementation
In this section we describe our new hardware implementation. We follow the top-down approach and for each step we elaborate our choice.
Montgomery Method for Point Multiplication in GF(2 n )
For the point multiplication we chose the method of Montgomery [22] . The algorithm used (Algorithm 1) was considered by López and Dahab [18] .
Algorithm 1 Algorithm for point multiplication
Require: an integer k > 0 and a point P Ensure:
Else 7:
The advantage of this algorithm is that it calculates one point addition and one doubling in each step. Moreover, the algorithm requires less registers compared to other hardware solutions. This could be of interest for implementations in constrained environments.
Point Addition and Doubling
At this level our design is improved with respect to other proposals. Namely, point operations (add and double) are in principle different, which can be explored from the point of side-channel analysis. Some authors have tried before to balance these two operations in order to improve side-channel resistance. We mention here the work of Brier and Joye [5] who suggested two approaches to achieve uniformity of point operations. However, both approaches result with some penalty in speed.
In the formulae of López and Dahab in GF(2 n ) point operations are almost balanced as they have A : D = 5M : 6M . Here, A and D are the point operations and M is a field multiplication. Consider the formulae for point operations in the case of simple projective coordinates i.e.
The results of point doubling and point addition, i.e. X 5 = X(P 5 ) and X 3 = X(P 3 ) = X(P 1 + P 2 ) respectively, are calculated as:
It is easy to see that point doubling and addition would require 6 and 5 multiplications respectively. We slightly rewrote the formulae in order to have 6 multiplications for both point operations. We had to add one more multiplication in the point addition, so we used the following formula:
that follows from the Karatsuba-like approach [13] . The algorithms for point addition and doubling are given in Algorithm 2.
Algorithm 2 EC point addition and doubling
Each point operation requires exactly 6 multiplications which are also balanced. In Step 5 of the point doubling a redundant operation is inserted to balance even the field additions. The required number of registers is 3 for both cases. More precisely, the following lemma holds: Note that the 12 multiplications require only the time for 6 multiplications since two multiplications are performed in parallel in every iteration of the main loop.
An Algorithm for Field Multiplication
Our circuit implements Algorithm 3; it includes two parts, classical and Montgomery, each of which is a systolic array. Those two arrays process the operand a(x) from different sides and they stop after exactly n/2 cycles for the bit-serial version and after s/2 for this new digit-serial architecture. In short, let us denote a MSB (x) and a LSB (x) as the most significant and the least significant half of a(x), respectively. After exactly s/2 steps the classical and the Montgomery part have calculated
, respectively. So each part evaluated half of the polynomial a(x) and XOR-ing them will give the M-residue of the multiplication result Res(x) with r(x) = x s/2 w . After conversion from Montgomery to the normal representation, we get the result. Namely, the following lemma holds.
Lemma 4.2 The result of Algorithm 3 is in the Montgomery domain i.e. Res(x)
The idea of combining two algorithms together was mentioned in [24] and the schematic of the multiplier was presented in [1] .
In Algorithm 3, A i (x) represents one digit of the polynomial a(x). Also, here Res M0 (x) and F 0 (x) are the least significant words of Res M (x) and f (x) respectively. On the other hand, Res Cn−1 (x) corresponds to the most significant word of Res C (x).
A prototype FPGA architecture
Our Elliptic Curve Processor (ECP) is shown in Fig. 1 . The operation blocks on each level from top to bottom are as follows: The main control Finite State Machine (FSM) first commands the NtoM to convert all the inputs from normal to Montgomery representation and the AtoP to convert the coordinates from affine to projective. The NtoM and the AtoP both use the MM to perform these conversions. When all conversions are finished the Main Controller orchestrates the PM block to start the point multiplication by invoking the PD and the PA in parallel. It writes the resulting X1 and Z1 to its output registers. The FSM inside the PM orchestrates these operations. Due to the parallelism of the point operations both the PD and the PA use an MA and an MM. The next step after point multiplication is conversion from projective to affine representation using the MI, which invokes the MM to perform the modular inversion using Fermat's little theorem [15] . Finally, the affine coordinates are converted from Montgomery to normal representation using the MM. 
Results
The results of our design on a Xilinx Virtex XCV800 FPGA are given in Table 1 . Table 2 presents a broader comparison with other architectures. We have included only those FPGA solutions that are either scalable [9, 10] or that are believed to be the state of the art in ECC hardware implementations. We give this comparison as a proof that a scalable and side-channel secure design can also lead to a solution that is competitive in performance.
Simple Side-Channel Resistance
Implementations of cryptographic algorithms should be resistant to side-channel attacks such as timing [16] , power [17] and electromagnetic radiation [25, 8] analysis attacks. These attacks present a realistic threat for wireless applications and have been demonstrated to be very effective against smart cards without specific countermeasures. Here we discuss the ability of our implementation to withstand simple side-channel attacks, such as the Simple Power Analysis (SPA). In that case the attacker can get some information about the secret key by observing one or a few power consumption graphs. To prevent that, cryptographic algorithms should be implemented as sequences of operations that are indistinguishable through simple side-channel analysis. Chevallier-Mames et al. define this property as the side-channel atomicity [7] . According to the authors, the SPA-resistant algorithms should consist of so-called side-channel atomic blocks, which are algorithm specific. In our implementation, there exist side-channel atomic blocks on different level of ECC hierarchy. Following the top-down approach those are point addition/doubling and multiplication/squaring. Figure 2 shows a pattern for point addition implemented as in most of the standard references on ECC [11, 4] . In this case the addition takes 14 multiplications, which is visible from the power trace. (The standard doubling takes 10 multiplications in total.) The same situation in the light of Algorithm 2 results in two not so distinguishable patterns ( Figure 3 and 4) . In both figures the total of 6 field multiplications can be observed.
Conclusions
In this paper a complete ECC processor for binary fields is presented. An FPGA implementation has been described. We proposed a fully balanced point multiplication algorithm that performs the same operations for every loop of the algorithm. Furthermore, a new algorithm for field multiplication is given, which performs two separate multiplications in parallel. By using this approach we believe that this so-called side-channel aware design is the first step towards side-channel resistance. However, it is clear that additional countermeasures will be required to prevent more advanced attacks.
