Abstract: This paper presents the design of a high-speed coprocessor for Elliptic Curve Cryptography over binary Galois Field (ECC-GF (2 m )). The purpose of our coprocessor is to accelerate the scalar multiplication performed over elliptic curve points represented by affine coordinates in polynomial basis. Our method consists of using elliptic curve parameters over GF (2 163 ) in accordance with international security requirements to implement a bit-parallel coprocessor on field-programmable gate-array (FPGA). Our coprocessor performs modular inversion by using a process based on the Stein's algorithm. Results are presented and compared to results of other related works. We conclude that our coprocessor is suitable for comparing with any other ECC-hardware proposal, since its speed is comparable to projective coordinate designs.
Introduction
Elliptic Curve Cryptography (ECC) is a well-known branch of the studies in cryptography that still remains incompletely explored, despite ECC has been studied since 1985, see [38] , [31] . The earliest researches in this branch presented several ECC-software designs [26] , [35] , [24] , [54] , [55] . ECC-software designs are easier to develop than ECC-hardware designs. However, ECC-hardware designs are often faster than ECC-software designs. Thus, ECC-hardware designs came later to supply speed requirements, [15] . Nowadays, although the literature provides descriptions of a significant variety of ECC-hardware designs, the researches often consider the following issues: elliptic curve points are either represented by affine coordinates in polynomial basis or converted to projective coordinates in other bases, [36] . On one hand, affine coordinates in polynomial basis are suitable for hardware implementation and storage. Nevertheless, they require a modular inversion (or modular division), which is the most complex and, consequently, slower operation among all important operations used to perform ECC algorithms, [31] . On the other hand, projective coordinates in other bases allow replacing the slow modular inversion (or division) by a number of fast multiplications. Nevertheless, they need more temporary storage space. Therefore, accurate comparisons among ECC-hardware designs depend on finding descriptions of cryptosystems for these two elliptic curve point representations.
Our priority was to investigate ECC over binary finite fields (binary Galois Field -GF (2 m ) [37] ). Anyway, we found a wide range of works describing ECC-hardware designs, for which elliptic curve points are represented by projective coordinates in a variety of bases, such as normal basis [40] , [53] , [43] , [2] , [3] , [11] , [41] , optimal normal basis [40] , [1] , Gaussian normal basis [48] , [3] , [44] , [13] , reordered normal basis [42] , redundant representation [58] , [21] , BIT-PARALLEL COPROCESSOR... 243 [42] , type II optimal normal basis [22] , [56] . We also have verified that, while papers continuously and widely describe projective coordinate designs, affine coordinate designs still require more descriptions to allow performing more accurate comparisons among ECC-hardware designs. Although we found several works describing ECC-hardware designs, for which elliptic curve points are represented by affine coordinates in polynomial basis, there is still plenty of space for research in this area. For example, we did not find any work describing a coprocessor for ECC − GF (2 m ) based on affine coordinates in polynomial basis that presents a speed comparable to other ECC-hardware designs based on projective coordinates in other bases. In other words, our research allowed finding advantages in using affine coordinate designs and recognizing absences.
Recently published papers have showed some advantages in using affine coordinate designs. For example, researchers showed that affine coordinates provide more security than projective coordinates against side channel attacks and simple power attacks [20] . Moreover, other researchers showed that the usage of affine coordinates no longer offers significant disadvantages in comparison with projective coordinates, when the design uses an efficient modular inversion algorithm [50] . By studding other recently published papers, we recognized absences in ECC-hardware designs. For example, we found papers describing crypto-processors, such as [28] , [32] , [23] , but none of these papers present bitparallel designs. A bit-parallel design often allows accelerating a cryptosystem. Bit-parallel designs have been published, such as in the papers [49] , [46] , but both papers only show designs of finite field multipliers, instead of processors or coprocessors. We found other descriptions of processors in [7] , [12] . The processor in [7] uses finite field multiplications instead of either modular inversion or division, despite the published papers describing efficient division/inversion algorithms [57] , [14] , [59] , [50] , [12] . The processor in [12] was implemented on no-reconfigurable technology instead of field-programmable gate-array (FPGA), but ECC-hardware designs are more flexible when implemented on FPGAs than on no-reconfigurable technologies.
Since in the literature the majority of previous works describes ECC-hardware designs based on elliptic curve points represented by projective coordinates in a variety of bases, many researchers use to suppose that these designs are much faster than ECC-hardware designs based on elliptic curve points represented by affine coordinates in polynomial basis. Then, these researchers discourage developing ECC-hardware designs, such as a bit-parallel coprocessor for standard ECC-GF (2 m ) on FPGA, for which elliptic curve points are represented by affine coordinates in polynomial basis. However, this paper proposes the design of a high-speed coprocessor for ECC-GF (2 m ) based on affine coordinates in polynomial basis to show that this type of ECC-hardware design presents a speed comparable to projective coordinate designs. We chose elliptic curve parameters defined in standards, such as NIST, IEEE P1363, IPSec, WAP, eCheck, ANSI X9.62 and ANSI X9.63 [9] , since these standards are in accordance with international security requirements. Our coprocessor speeds up modular inversion by using an efficient algorithm based on the Stein's algorithm [51] . It is a bit-parallel coprocessor, for which the speed is comparable to projective coordinate designs.
The remainder of this paper is organized as follows. A background about ECC is presented in Section 2. The design and the implementation of our bitparallel coprocessor are described in Section 3. The behavior of our coprocessor is commented in Subsection 3.1. Our results are presented in Section 4. Finally, discussion is presented in Section 5.
Elliptic Curve Cryptography Background
A binary finite field, also called binary Galois Field (GF (2 m )), is a set of 2 m elements, each one represented by m + 1 bits, [34] . Therefore, the finite field arithmetic operates over these elements. The method used to perform finite field operations depends on the manner that these elements will be interpreted, i.e., the method depends on the basis representation [29] .
The usual representation is the polynomial basis. For the polynomial basis {1, t, t 2 , ..., t m−1 } of GF (2 m ), an element (a m−1 ...a 2 a 1 a 0 ) represents the polynomial a m−1 t m−1 + ... + a 2 t 2 + a 1 t + a 0 ∈ GF (2 m ), where a 0 , a 1 , a 2 , ..., a m−1 ∈ GF (2). For this basis, the finite field operations are followed by mod p(t), where p(t) is an irreducible polynomial. Now, let us consider x and y as any pair of elements in GF (2 m ). When x and y are presented as a pair of coordinates in the form (x, y), they are representing any point by affine coordinates. In other words, we say that P is represented by affine coordinates whenever P is represented by a pair of coordinates in the form P = (P x , P y ).
For ECC-GF (2 m ), the elliptic curve E is the set of all solutions (x, y) to the equation:
where x, y, a and b are elements of GF (2 m ), and b must be nonzero, see [27] , [8] .
ECC is based on the difficulties in solving the Elliptic Curve Discrete Logarithm Problem (ECDLP), [52] . In other words, finding k, given P and Q, where Q = kP , is a computationally intractable problem for large values of k in ECC, because the scalar multiplication (Q = kP ) has an one-way solution [10] .
Therefore, all algorithms based on ECC-GF (2 m ) compute the point Q = kP on the elliptic curve E, where k is an integer and P and Q are points on E [39] .
For example, we can compute Q = kP by using the Double-and-Add algorithm, that sweeps the binary decomposition of k, doubling on each digit (bit) and adding on digits equal to ′ 1 ′ (skipping the most significant digit ′ 1 ′ ) [25] :
The result of each doubling or adding is a new point on E that, here, will be named either P ′ or Q. Point doublings and point additions are based on the finite field arithmetic. They are composed by modular operations such as addition, multiplication, squaring and inversion, [34] . We perform these operations on coordinates of elliptic curve points. To double P ′ = (P ′ x , P ′ y ) or add the points P = (P x , P y ) and P ′ = (P ′ x , P ′ y ) of an elliptic curve to obtain a new point Q = (Q x , Q y ), we use a single set of equations [16] , as follows:
where: a defines an elliptic curve and p represents the irreducible polynomial; P ′ x and P ′ y represent the coordinates of the point that will be doubled or added; P x and P y represent the coordinates of a standardized point P , defined by [9] ; Q x and Q y represent the coordinates of a new point Q; F = P x , G = 0 and H = 0, for point doublings, while for point additions, we have F = 0, G = P y and H = P x [16] .
For the Eq. (3)- (5), we also consider: if P x = 0 and P y = 0, then Q = P ′ ; if P ′ x = 0 and P ′ y = 0, then Q = P ; if P = P ′ = 0, then Q = 0, i.e., Q is a point in the infinity. The point Q will be in the infinity in two other conditions: if P x = 0, for point doublings; if P y = P ′ y , for point additions.
In our coprocessor, we implemented each operation of the finite field arithmetic used in Eqs. (3)- (5): addition, square, multiplication, module and modular inversion.
Addition is performed by an ordinary xor logic operation and is represented by the operator " + ". We implemented addition as follows: Square is represented by A 2 and uses a straightforward algorithm. We perform this operation by inserting a bit ′ 0 ′ between each bit of A. We implemented square as follows: if
Multiplication is represented by " * " and uses a simple algorithm based on a loop, for which each iteration performs a shift left followed by a xor. We implemented multiplication as follows: if A = A(t), B = B(t) ∈ GF (2 m ) is given by
and
then the multiplication A * B = A(t) * B(t) is given by
Therefore, we perform the multiplication as a sequence of additions over GF (2), as follows:
Consequently, considering A(t) = (a m−1 , a m−2 , . . . , a 1 , a 0 ) and defining
we obtain A(t) * B(t) = (r 2m−1 , r 2m−2 , . . . , r j , . . . , r 1 , r 0 )
with r j = We used VHDL [47] to describe the aforementioned operations of the Eq. (3)-(5) for our coprocessor.
Modular inversion is the most complex and, consequently, slower operation among all operations of the finite field arithmetic, [31] . Our coprocessor performs modular inversion by using Algorithm 1. Algorithm 1 is based on the Stein's algorithm [51] and is similar to the modular division algorithm described by Wu et al. in [57] . We have chosen Algorithm 1 for modular inversion, since this algorithm presents high performance and a straightforward implementation in hardware.
Algorithm 1. Modular Inversion (MI)
while slice > 0 2 :
if x and p are, respectively, the value that must be inverted and the irreducible polynomial. The three operations +, /2 and * 2 represent, respectively, a xor, a shift right and a shift left. More details about Algorithm 1 and about the origin of the values attributed to the variables in the first line are found in [57] . Fig. 1 shows Algorithm 1 implemented as a bit-sliced circuit. In other words, the circuit for modular inversion is composed by smaller bit-width circuits, arranged side by side, to form a longer word-length circuit. It processes one bit-field or bit-slice at time. The chained circuits are able to process the full word-length required by the ECC-based software. In this work, a bit-slice is exactly equivalent to a single iteration of Algorithm 1. Whereas Algorithm 1 requires at most 2m − 1 iterations to perform a modular inversion, we need to link, serially, 2m − 1 bit-slices in the case of developing a combinatorial circuit to perform modular inversion. Therefore, the bit-sliced circuit for modular inversion is formed by 2m − 1 bit-width circuits (bit-slices), grouped side by side, to compose a circuit able to invert a P ′ x of 2m − 1 bits. The outputs of the first bit-slice are connected to the inputs of the second bit-slice; the outputs of the second bit-slice are connected to the inputs of the third bit-slice and so on. The inputs of the bit-slice 1 start as follows: Ain = P ′ x, Bin = p, U in = 1, V in = 0, DCCin = 2, F lagin = 1, slice = 2m−1 [57] . The bit-slice 2m−1 presents the modular inverse of P ′ x in Vout. We implemented Algorithm 1 as a bit-sliced circuit, since it is the easiest way to implement large circuits. The circuit for modular inversion is large, since we designed this circuit to prioritize the speed to the detriment of the area, attending to the proposed high-speed requirement. 2 shows the schematic of each bit-slice of the circuit for modular inversion. In Fig. 2, (a) represents that each bit-slice has six inputs and six outputs [57] : Ain, Bin, Uin, Vin, DCCin, Aout, Bout, Uout, Vout and DCCout (all of them are m + 1 bits wide); FLAGin and FLAGout (they are both one bit wide). Moreover, in Fig. 2, (b) , (c), (d), (e), (f) and (g) represent the logic correspondent to a single iteration of Algorithm 1.
Design and Implementation of the Proposed Coprocessor
Our coprocessor is a hardware unit that helps any ECC based software speeding up the computation of Q = kP . It computes Q = kP in high speed, because we implemented the aforementioned finite field operations as digital circuits. We designed our coprocessor to be implemented on FPGAs and to be used on a PC-board adapter. Fig. 3 presents the basic diagram of the PC-board adapter containing our coprocessor. Fig. 3 shows the on-board elements and the PC's components that communicate with the adapter. Our coprocessor is composed by independent circuits working together. These circuits were implemented on Altera's EP 2S180F 1020C4 and EP 2S90F 1508C3 FPGAs, due to high speed and density requirements [5] , [6] . The former implements the modular inversion showed by Eq. (3); the latter implements the remainder of operations showed by Eq. (3)-(5), i.e., multiplication, module, square and addition. Moreover, the latter implements the Double-and-Add algorithm, a random number generator (RNG) [33] , general purpose registers and the logic of the bus interface. Inputs P ′ x and P ′ y , receive P ′ = (P ′ x , P ′ y ). Outputs Q x and Q y inform Q = (Q x , Q y ). The PC-board adapter receives data through the PC data bus (PCI) that is w bits wide (w = 32 or w = 64). The data is sent to the PC-board adapter, for example, by a software executing the Diffie-Hellman key-exchange model in the CPU [27] .
The data received by the PC-board adapter is a point in the form P ′ = (P ′ x , P ′ y ). Each data is 2(m + 1) bits wide, i.e., m + 1 bits by coordinate of the point P ′ , where m represents the finite field.
Thus, the point P ′ = (P ′ x , P ′ y ) arrives at the PC-board adapter fragmented in 2(m + 1)/w parts, where w represents the width (in bits) of the PC data bus. The point P ′ = (P ′ x , P ′ y ) is rebuilt and stored in the input register, from the less significant w bits to the more significant ones.
Behavior of the Proposed Coprocessor
The behavior of the PC-board adapter and the flow of data through its components is presented by Fig. 4. Fig. 4 shows that the cryptographic algorithm 
253
(Double-and-Add) generates signals to control point doublings, point additions and the flow of data. We have chosen the Double-and-Add as the cryptographic algorithm, since the implementation of the Double-and-Add in hardware is straightforward. The Double-and-Add receives the m+1 bits wide value k from a random number generator (RNG). It analyzes each bit of k to generate the control signals k ′ , select and enable. The Double-and-Add performs a point doubling for k ′ = ′ 0 ′ and a point addition for
goes from the input register to the feedback register; for select = ′ 0 ′ , the PC-board adapter performs a process of feedback. The signals enable allow enabling or disabling the input of data into the registers.
At the moment when the Double-and-Add starts working, the point P ′ = (P ′ x , P ′ y ) goes from the input register to the feedback register, passing through a multiplexer between these two registers. Otherwise, a partial point Q = (Q x , Q y ), goes from the auxiliary register to the feedback register, as explained later in this section. Anyway, any point in the feedback register is named
After stored in the feedback register, the coordinates P ′ x and P ′ y go to different ways.
First, the coordinate P ′ x goes to a combinatorial circuit to be inverted. In other words, this circuit uses the coordinate x of P ′ to perform I, where I = (P ′ x ) −1 mod p. The modular inversion is part of Eq. (3). Next, the coordinates x and y of P ′ and the modular inverse of P ′ x (the value I) go to the circuit responsible to perform all the other operations present in the Eq. (3)-(5).
By passing P ′ = (P ′ x , P ′ y ) through these two circuits, the PC-board adapter calculates a point Q = (Q x , Q y ) that will be stored in an auxiliary register. This point Q = (Q x , Q y ) represents either a final or a partial result, depending on the step reached while processing the cryptographic algorithm.
When Q = (Q x , Q y ) is a partial result, it goes to the feedback register, through the multiplexer "start?". This process of feedback is repeated several times, following the logic of the Double-and-Add, while the final point Q = (Q x , Q y ) is not found. In other words, the Double-and-Add uses the value k to determine the number of doublings and additions required to find Q = kP .
At the end of the process, the final point Q = (Q x , Q y ) goes to the output register. From this register, Q = (Q x , Q y ) is fragmented in 2(m + 1)/w parts and goes to the PC data bus. Thus, the software executing in the CPU, finally, receives the point Q = (Q x , Q y ).
Summarizing, some software requiring to perform a scalar multiplication is executed by the CPU of a PC. This software needs a point Q that is found by calculating the Q = kP equation. To achieve a better performance, the software calls our coprocessor to perform the Q = kP operation. When the adapter starts working, the software sends the point P ′ to the adapter through the PC data bus. In the adapter, the integer k is generated by a random number generator (RNG). The circuits use the integer k and the point P ′ to calculate Q = kP ′ . At the end of the process, the adapter sends the point Q back to the software through the PC data bus. Aided by our coprocessor, this software obtains the point Q = kP significantly fast.
Results
Since we have not found any previous work describing a bit-parallel coprocessor for standard ECC-GF (2 m ) on FPGA, for which elliptic curve points are represented by affine coordinates in polynomial basis, comparisons among this specific type of design with other ECC-hardware designs were unknown until this moment. Therefore, to compare the speed of our coprocessor with other hardware designs: first, we implemented a prototype of our coprocessor on Altera's EP 2S180F 1020C4 and EP 2S90F 1508C3 FPGAs that run at 250 MHz [5] , [6] ; second, we found the number of clock cycles required to calculate different operations in our circuit; next, we searched for the computation time required to perform Q = kP for different hardware designs. Our coprocessor used the following parameters recommended by [9] to define the elliptic curve points over GF (2 163 ):
where a is used to define an elliptic curve, p represents the irreducible polynomial, x and y represent the coordinates of the point P . These parameters represent a single set among all sets of parameters recommended by [9] for GF (2 163 ), which was randomly chosen for this work.
The platform used to develop and simulate the circuits of our coprocessor was the Quartus II v5.0 of Altera [4] . Table 1 presents the number of clock cycles required by our coprocessor to calculate, respectively: the modular inversion operation of Eq. (3); all the other operations of Eq. (3)- (5); either a point doubling or a point addition; the Q = kP operation (to calculate Q = kP , our coprocessor needs to perform m operations (point doublings or point additions) in average, where m represents the finite field). The number of clock cycles required by our coprocessor to calculate either a point doubling or a point addition is equivalent to the sum of the values presented by Column 2 and Column 3. The Q = kP is calculated multiplying the number of clock cycles required to calculate either a point doubling or a point addition by the average number of operations used to calculate Q = kP . For achieve this results, our coprocessor uses 329 pins and occupies an area of 216,288 ALUTs (adaptive look-up tables), 270,360 LEs (equivalent logic elements) of two FPGAs that run at 250 MHz [5] , [6] . Table 2 shows the computation time required to perform Q = kP for different hardware designs. These designs use elliptic curve points represented either by projective or affine coordinates. Note that our coprocessor presents the lowest computation time required to perform Q = kP among all ECC-hardware designs presented in Table 2 . Hence, the results show that it is also possible to develop a high speed coprocessor without using the conversion to projective coordinates in normal basis or other basis. Moreover, the results show that projective coordinate designs are no longer faster than affine coordinate designs.
Discussion
In this paper, we have presented a high-speed coprocessor for ECC-GF (2 m ), for which we represent elliptic curve points by affine coordinates in polynomial basis. When our coprocessor is compared to other hardware designs [53] , [45] , [17] , [7] , [12] , [41] , including those for which elliptic curve points are represented by projective coordinates in other bases, the results show that our coprocessor performs the scalar multiplication (Q = kP ) significantly fast. By using our coprocessor to accelerate the scalar multiplication performed over elliptic curve points represented by affine coordinates in polynomial basis, we show that it is also possible to develop a high-speed coprocessor without using the conversion to projective coordinates in normal basis or other basis. Therefore, this paper shows that projective coordinate designs are no longer better than affine coordinate designs. Some recently published papers present analogous opinion [19] , [12] . Since we have not found any other paper describing a bit-parallel coprocessor for standard ECC-GF (2 m ) on FPGA, for which elliptic curve points are represented by affine coordinates in polynomial basis, our main contribution is to offer a paper that allows comparing this specific type of design with other ECC-hardware designs.
Conclusions
We presented the design and implementation of a coprocessor that reaches a speed comparable to projective coordinate designs [53] , [45] , [41] . Our coprocessor significantly accelerates the scalar multiplication performed over elliptic curve points represented by affine coordinates in polynomial basis. Since the majority of previous works describing ECC-hardware designs are based on elliptic curve points represented by projective coordinates in a variety of bases, researchers use to suppose that these designs are much faster than ECC-hardware designs based on elliptic curve points represented by affine coordinates in polynomial basis. However, our results show that projective coordinate designs are no longer faster than affine coordinate designs. Therefore, our coprocessor is suitable for comparing with any other ECC-hardware design. The obvious drawback of our design is the large area of our circuits. Anyway, this large area no longer offers significant limitations for our design, since mobile computing was apart from the goal of this project. For mobile computing, we certainly will prefer a projective coordinate design.
