Elliptic curve cryptography (ECC) plays a vital role in passing secure information among different wireless devices. This paper presents a fast, high-performance hardware implementation of an ECC processor over binary field GF(2 m ) using a polynomial basis. A high-performance elliptic curve point multiplier (ECPM) is designed using an efficient finite-field arithmetic unit in affine coordinates, where ECPM is the key operation of an ECC processor. It has been implemented using the National Institute of Standards and Technology (NIST) recommended curves over the field GF(2 163 ). The proposed design is synthesized in field-programmable gate array (FPGA) technology with the VHDL. The delay of ECPM in a modern Xilinx Kintex-7 (28-nm) technology is 1.06 ms at 306.48 MHz. The proposed ECC processor takes a small amount of resources on the FPGA and needs only 2253 slices without using any DSP slices. The proposed design provides nearly 50% better delay performance than recent implementations.
INTRODUCTION
With the swift growth of mobile devices and computer applications, cryptography has become a vital tool to ensure the security of data communications and network services. Secret-key cryptography and public-key cryptography (PKC) are two main families of cryptography used for different data-security purposes. ECC (Miller, 1986; Koblitz, 1987) and the RSA cryptosystem (Rivest et al., 1978) are the most popular PKCs. The elliptic curve system as applied to cryptography was first proposed in the mid 80s by Koblitz and Miller. This cryptosystem became popular because it offers equivalent security to the traditional RSA with significantly smaller keys. For instance, 163-bit ECC provides equivalent security to 1024-bit RSA (Koblitz et al., 2000; SEC2, 2000) . This feature makes ECC very popular for resourceconstrained environments such as pagers, PDAs, cellular phones, smart cards and so on (Sutter et al., 2013) . The IEEE (IEEE, 2000) and National Institute of Standards and Technology (NIST) (NIST, 2000) have standardized elliptic curve (EC) parameters for GF(p) and GF(2 m ). Certicom has provided NISTrecommended EC domain parameters, standard for efficient cryptography in SEC2 (SEC2, 2000) .
Several FPGA-based efficient ECC hardware architectures and elliptic curve cryptographic processors have been presented in the literature (Sutter et al., 2013; Chelton and Benaissa, 2008; Reaz et al., 2012; Hassan and Benaissa, 2010; Machhout et al., 2010; Ghanmy et al., 2014; Shieh et al., 2009; Smyth et al., 2006; Park and Hwang, 2005) . In (Ghanmy et al., 2014) , Ghanmy proposed ECC processor over GF(2 163 ) on a FPGA platform for wireless sensor networks (WSN). Reaz's design (Reaz et al., 2012) can perform ECC over GF(2 131 ) and GF(2 163 ) on Altera FPGAs. Hasan and Benaissa (Hassan and Benaissa, 2010) implemented their ECC processor using the µ-coding technique on Xilinx Spartan-3 FPGAs over GF(2 131 ), GF(2 163 ), GF(2 283 ) and GF(2 571 ). A coupled FPGA/ASIC implementation of an elliptic curve crypto-processor over GF (2 163 ) is presented in (Machhout et al., 2010) , and they used Xilinx Virtex Pro FPGAs and ASIC CMOS 45 nm technology as a hardware platform. Shieh (Shieh et al., 2009), Park et al. (Park and Hwang, 2005 ) also proposed their ECC processor over a binary field using Xilinx FPGAs. An ASIC-based ECC processor is presented over GF(2 m ) in (Smyth et al., 2006) .
The optimization aim is generally to reduce the latency of an ECPM in terms of the number of required clock cycles. For this, we have concentrated on efficient algorithms and mathematical reformulations for improving finite-field arithmetic operations which are required for ECPM (Sutter et al., 2013; Chelton and Benaissa, 2008; Kong and Phillips, 2009; Phillips et al., 2010) . The arithmetic includes operations defined in finite (Galois) fields, namely GF(p) and GF(2 m ) (Hankerson et al., 2003) . To the best of the authors' knowledge, there have been few highspeed hardware implementations of an ECC processor in the literature. Thus an efficient design of an ECC processor is still mandatory for modern cryptographic applications.
In this paper an efficient ECC processor is developed in which ECPM operations are achieved in a very low area (around 2.25K slices without using any DSP slices) and latency (almost 50% less than recent implementations). For this, efficient algorithmic reformulations underlying binary finite field and architectural optimization schemes are explored to improve the operating speed. We propose a data-flow architecture of elliptic curve point doubling (ECPD) and elliptic curve point addition (ECPA) that are required for the ECC processor. An efficient field inversion and multiplication algorithms over GF(2 m ) are employed to implement high-performance ECPD and ECPA. Finally, an FPGA-based high-performance hardware implementation over GF(2 163 ) is proposed, which is the fastest implementation in an affine coordinate system. The rest of this paper is organized as follows. Section II introduces a background of groups and fields, Galois finite fields (GF(p) and GF(2 m )), and ECC theories related to this work. Section III describes an efficient finite-field algorithm over GF(2 m ), elliptic curve group operations (PD and PA) and hardware architectures. An elliptic curve point multiplication algorithm and cryptographic processor are given in Section IV. FPGA implementation results and comparisons with related designs are given in Section V. Finally this paper is summarized in Section VI.
BACKGROUND
In this section, a brief introduction to abstract algebra, field and group theories relevant to ECC designs used in our hardware implementation is presented.
Groups and Fields
An abelian group (G, * ) consists of a set of elements together with a binary operation * which satisfies the following properties : 1. (Associativity) a * (b * c) = (a * b) * c for all a, b, c ∈ G. 2. (Identity) There is an element e ∈ G such that a * e = e * a for all a ∈ G. 
Elliptic Curve Cryptography (ECC)
ECC is the most popular public-key encryption technique. To encrypt data in ECC, it is denoted as a point on an elliptic curve (EC) over a Galois field. A Galois field denoted normally as GF(q = p m ) is said to be a binary field or characteristic-two finite field if q = 2 m . A elliptic curve defined over a Galois field provides a group structure that is used to implement cryptographic systems. The group operations are EC point addition (ECPA) and EC point doubling (ECPD). There are various coordinate systems to represent elliptic curve points. They vary in the number and type of field operations required to implement PA/PD. In our work, we implement all elliptic curve operations in an affine coordinate system. A non-supersingular elliptic curve E over GF(2 m ) in affine coordinates is the set of solutions to the equation
where x, y, a, b ∈ GF(2 m ), b = 0. The coefficients a, b ∈ F 2 m specifying an elliptic curve E(F 2 m ) are defined by the NIST standard and then the elliptic curve is defined by (1). The number of points on an elliptic curve E is represented by #E(F 2 m ). It is defined over F 2 m as nh, where n is the prime order of the curve, and h is an integer called the co-factor. If P = (x 1 , y 1 ) ∈ E and Q = (x 2 , y 2 ) ∈ E (points on the EC), then summing PA and PD can be respectively derived as
, where λ = x 1 + y 1 /x 1 and P = Q; (3) where R = 0 when x 1 = x 2 and y 2 = y 1 , or x 1 = x 2 = 0. Hence, when P = Q we have the PA operation in (2) and when P = Q we have the PD operation in (3). Using these operations, EC point multiplication kP will be implemented using an ECC-based algorithm (Hankerson et al., 2003; Sutter et al., 2013; Miller, 1986; Koblitz, 1987) . (NIST, 2000) . Prime fields GF(p) and binary fields GF(2 m ) of similar size are considered to provide almost the same level of security (Koblitz et al., 2000) . Table 1 compares symmetric cipher key length, and key lengths for PKC such as RSA, Diffie-Hellman (DH), and ECC (both prime and binary fields). It demonstrates that smaller field sizes can be used in ECC than in RSA and DH systems at a given security level. ECC is many times more efficient than RSA and DH for either private-key operations (such as signature generation and decryption) or public-key operations (such as signature verification and encryption). This makes ECC a promising branch of public-key cryptography (Hankerson et al., 2003) .
HARDWARE IMPLEMENTATION FOR FINITE FIELD
This section presents all arithmetic algorithms and operations for hardware implementation which are important for ECC. All parameters for NIST elliptic curves over GF(2 163 ) are listed in Table 2 . The irreducible polynomial is f (x) = x 163 + x 7 + x 6 + x 3 + 1 given for the field GF(2 163 ). A modern Xilinx Kintex-7 (XC7K325T-2FFG900) FPGA with VHDL (VH-SIC Hardware Description Language) is used for this hardware implementation. The main components in this ECC design are: polynomial-basis modular addition or field addition, field multiplication, field squaring, field inversion, and elliptic curve group operations (PD and PA).
Polynomial Basis Representation
A polynomial basis (or standard basis) is an extension field used to represent field elements and is very popular. PB is used in our hardware design for the representation of numbers. For the PB representation, the elements F 2 m are the binary polynomials of degree at most m − 1, i.e.
For instance, x 3 + x + 1 is a polynomial-basis representation for the 4-bit number 1011 2 . For a reduction polynomial or irreducible polynomial, ( f (x) be an irreducible binary polynomials of degree m), and
where g i ∈ {0, 1} for i = 1, . . . , m − 1 and g 0 = 1 (Hankerson et al., 2003) . For example, f (x) = x 4 + x + 1 = 10011 2 is an irreducible polynomial of the finite field GF(2 4 ). Table 2 : NIST-recommended elliptic curves over F 2 163 .
n=0x 4 00000000 00000000 00020108 A2E0CC0D 99F8A5EF x=0x 2 FE13C053 7BBC11AC AA07D793 DE4E6D5E 5C94EEE8 y=0x 2 89070FB0 5D38FF58 321F2E80 0536D538 CCDAA3D9
Addition in GF(2 m )
Addition is the simplest operation in GF(2 m ). It is simply a bit-wise exclusive-or (xor (⊕)) in either hardware or software. Addition in F 2 m can be achieved as shown in (4) (Wolkerstorfer, 2002) :
where
The subtraction operation in GF(2 m ) is the same as addition because the additive inverse of an element is its identity : U(x) +U(x) = 0. For example, if U = 1100 2 and V = 0110 2 over the finite field GF(2 4 ) then Z = U + V = U ⊕ V = (1100 2 ⊕ 0110 2 ) = 1010 2 .
Multiplication in GF(2 m )
Polynomial multiplication or multiplication in GF(2 m ) with the interleaved modular reduction algorithm is a well-known algorithm for hardware implementation (Wolkerstorfer, 2002) . It computes the product of two polynomials then applies modular reduction, and its operation is different from simple integer multiplication. Multiplication in F 2 m can be achieved as shown in (5):
Multiplication by x i can easily be calculated by the binary left-shift operation. From polynomial multiplication in algorithm 1, we check whether the result is an element of GF(2 m ) with degree < m. A modular reduction step is only necessary if the polynomial multiplication result Z v has degree m or higher. This condition is checked by the Z v (m) = 1 command.
Algorithm 1: Mult. in GF(2 m ) with interleaved modular reduction.
The result of polynomial multiplication Z(x) = U(x).V (x) mod f (x), is achieved after m iterations. Algorithm 1 (Wolkerstorfer, 2002) , named multiplication (Mult.) in GF (2 m ) with interleaved modular reduction, takes just four steps to find the solution of polynomial multiplication over GF(2 4 ). The polynomial multiplication result should be reduced to a degree < 4 by irreducible polynomial f (x) = x 4 + x + 1.
Squaring in GF(2 m )
A PB squarer is simpler than and closely related to multiplication. But squaring in GF(2 m ) has less difficulty than polynomial multiplication because U(x) 2 mod f (x) is a linear operation. It can be computed as shown in (6):
The squaring operation in GF(2 m ) of Z(x) = U(x) 2 is achieved by setting a 0 bit between consecutive bits of the binary representation of U(x) as shown in Figure 1 (Hankerson et al., 2003; Wolkerstorfer, 2002) .
Figure 1: Squaring a binary polynomial U(x).
Inversion in GF(2 m )
Inversion in GF(2 m ) is the most expensive operation for implementing ECC over a binary field. Algorithm 2 computes the field inversion of a non-zero field element U(x) ∈ F 2 m using the modified Euclidean algorithm (Guo and Wang, 1998) . We used this inversion algorithm for our hardware implementation because it is easy to implement on a FPGA.
Algorithm 2: Inversion in GF(2 m ) with Modified Euclidean Algorithm.
The result of field inversion Z(x) = 1/U(x) mod f (x) or multiplicative inversion of U(x) is achieved after 2m iterations (i = 1to 2m) and the value of cnt is always equal to zero at the end of the last iteration (Guo and Wang, 1998 ).
Proposed EC Group Operations
The elliptic curve group operations in GF(2 m ) are the PD and PA operations. These are the building blocks of finite-field arithmetic operations such as addition, multiplication, squaring and inversion. Figures 2 and 3 show the data-flow architecture of the proposed ECPD and ECPA operations, corresponding to (2) and (3) respectively. The ECPD operation in affine coordinates requires one field inversion, five field additions, two field multiplications, and two field squarings. Similarly, the ECPA operation in affine coordinates requires one field inversion, eight field additions, two field multiplications, and one field squaring. 
PROPOSED ECPM
Elliptic curve point multiplication (ECPM) is the main operation of an ECC processor; it is computationally the most expensive. However, we have designed a high-performance ECPM using efficient group operations and FFMA units. The building block of an elliptic curve cryptosystem contains ECC protocols such as ECDH (elliptic curve DiffieHellman) key exchange, ECDSA (EC digital signature algorithm) at the top level, point multiplication in the second level, group operations in the third level, and field arithmetic operations in the bottom level. The basic operation of ECPM is defined as kP, where k is a positive integer and P is a point on the elliptic curve E defined over a field F 2 m . The proposed ECPM architecture over GF(2 m ) is presented in Figure 4 . Various methods exist for implementing ECPM: the binary method, the Non-adjacent form (NAF) method, and the Montgomery method. The easiest way to implement ECPM is the binary method (left to right) (Hankerson et al., 2003) . Finally, we present the ECPM Algorithm 3 using the binary method. It is implemented using the "Doubleand-Add" algorithm concept. Algorithm 3: Binary method (Left to right) for point multiplication.
FPGA IMPLEMENTATION RESULTS AND PERFORMANCE ANALYSIS
This section presents the hardware implementation results of this design. We have implemented and tested our design on a modern 28-nm Xilinx Kintex-7 (XC7K325T-2FFG900) FPGA. All VHDL modules are extensively simulated using both Isim and ModelSim, and synthesized using Xilinx ISE 14.7 synthesis technologies. Table 3 depicts the synthesis results of the finitefield arithmetic operations such as field multiplication/squaring and field inversion over GF(2 163 ). Multiplication or squaring over GF(2 163 ) takes the same area (FF and LUTS), the same number of clock cycles and the same computation time. On the other hand, the clock cycles, flip-flops (FFs), and LUTs (look-up tables) ratio of inversion to multiplications are about 2, 4.45, and 5.2 respectively. Only inversion consumes more clock cycles, area, and timing. The multiplication/squaring (SQ) over GF(2 163 ) is performed in Xilinx Kintex-7 in 419 ns but inversion takes 757 ns. From our implementation results, we notice that field inversion is the most time-consuming operation over the binary field because an inversion takes the same number of clock cycles as 2 multiplications.
The hardware implementation results of proposed elliptic curve group operations are presented in Table 4. The major building block of the elliptic curve group operations (PD and PA) contains addition, multiplication, squaring and inversion. These operations were defined over the binary finite field GF(2 m ). The PA operation occupies almost double the area of the PD operation, but the number of clock cycles and the computation time are identical for both operations. The ECPM results for the NIST-recommended field (GF(2 163 ) is shown in Table 6. We achieve a point multiplication in 1.06 ms at a frequency of 306.48 MHz in Xilinx Kintex-7 (XC7K325T-2FFG900) FPGA. Table 5 represents the summary of estimated values of device utilization. The implemented design over the binary field F 2 163 takes a small amount of resources on the FPGA. The synthesis report shows that our design is area-efficient as it contains only 2253 slices (4% utilization of total available resources).
The hardware implementation results and performance comparisons with related cryptographic processors are listed in Table 6 , which tries to give all the frequencies, number of clock cycles, and the computation time of the designs to make a fair comparison on the performance between them. An ECC processor over GF(2 163 ) for wireless sensor networks (WSN) is proposed in (Ghanmy et al., 2014) , and it requires 2.26 ms to achieve a point multiplication. The ECC processor proposed by Reaz (Reaz et al., 2012) provides a result for the field GF(2 163 ), and their design takes 14.9 ms to compute a point multiplication. Our implemented result is almost 14 times the speed of Reaz (Reaz et al., 2012) but our presented result is not in the same platform. Hasan (Hassan and Benaissa, 2010) , Machhout (Machhout et al., 2010) , and Sheih (Shieh et al., 2009) implemented ECC processors over GF(2 163 ), and their designs require 2.7 ms, 2.07 ms, and 2.55 ms respectively. Our implemented result is almost double the speed of that of Hasan, Maccout, and Sheih. Park (Park and Hwang, 2005) and Smyth (Smyth et al., 2006) developed ECC processors over GF(2 163 ) in different platforms but their cryptographic processors require more computation time than our design. Our ECC processor over GF(2 163 ) takes 1.06 ms to accomplish a point multiplication. We have also achieved a higher frequency than other cryptographic processors. From the comparison and performance analysis in Table 6 , our ECC processor over GF(2 163 ) provides better performance than others.
CONCLUSIONS
A high-performance ECC processor over GF(2 163 ) has been implemented using FPGA technology. The binary method (double-and-add) point-multiplication algorithm using an affine coordinate system was used for this hardware implementation. An efficient polynomial-basis multiplication and inversion algorithm was developed for performing elliptic curve PD and PA operations and hence ECC processor. The implemented design is optimized by using different optimization techniques such as balancing the PD and PA architecture, parallelization in operations, and pre-computations for obtaining high performance on an FPGA compared to other designs. In GF(2 163 ), we can achieve a point multiplication in 1.06 ms at 306.48 MHz in Kintex-7 (28-nm) devices, which is the fastest hardware implementation result. The proposed design provides nearly 50% better delay performance than recent implementations. Our implemented design is also area-efficient as it contains only 2253 slices without using any DSP slices. Based on the overall performance analysis and comparisons of different ECC processors over the binary field F 163 , it can be concluded that this design provides better performance than others in terms of the area and the timing.
