In this paper, a high-performance area-efficient hardware design for the Elliptic Curve Cryptography (ECC) is presented, targeting the area-constrained high-bandwidth embedded applications. The high-speed design is implemented using pipelining architecture. The applied architecture is performed using n-bit data path of the finite field GF(2 n ). For the finite field operations, the implementation in the ECC uses the bit-parallel recursive Karatsuba-Ofman algorithm for multiplication and Itoh-Tsuji for inversion. A modified efficient montgomery ladder algorithm is utilized for the scalar multiplication of a point. The pipelined registers are inserted in ideal locations, where balanced-execution paths among computing components are guaranteed. A Memory-less finite state machine model is developed to control the instructions of computing the finite field operations efficiently. The high-performance design has been implemented using Xilinx Virtex, Kintex and Artix FPGA devices. It can perform a single scalar multiplication in 226 clock cycles within 0.63µs using 2780 slices and 360Mhz working frequency on Virtex-7 over GF (2 163 ). In GF (2 233 ) and GF (2 571 ), a scalar multiplication can be computed in 327 and 674 clock cycles within 1.05µs and 2.32µs, respectively. Comparing with previous works, our design requires less number of clock cycles, and operates using less FPGA resources with competitive high working frequencies. Therefore, the proposed design is well suited in the resourcesconstrained real time cryptosystems like those in online banking services, wearable smart devices and network attached storages.
INTRODUCTION
Elliptic curve cryptosystem (ECC) is a public-key cryptography, which was first proposed by Neal Koblitz and Victor Miller in the 1980s (Kocher et al., 1999) , (Miller, 1985) . Since then, many studies have been conducted to explore its security levels against other public-key cryptosystems such as El-Gamal, RSA and Digital Signature Algorithm (DSA) (ElGamal, 1985) , (Rivest et al., 1978) , which are based on either the integer factorization or discrete logarithm problems (McGrew et al., 2011) . Equivalent security levels with smaller sizes of keys, ease to implement, and resource savings, are reasons that give the ECC to be very appealing and more dominant between the hardware reconfigurable implementations. Moreover, ECC is well suited to be implemented in such resource-constrained embedded systems, since it a https://orcid.org/0000-0002-5975-6537 b https://orcid.org/0000-0002-2924-6659 c https://orcid.org/0000-0002- provides same security levels as in RSA using small keys. ECC has been standardized by IEEE and the National Institute of Standard and Technology (NIST) as a scheme in digital signature and key agreement protocols (for Standardization (ISO), 2000) .
Generally, most of cryptographic algorithms are implemented in software platforms. Performing an algorithm on a general purpose processor (e.g. CPU) will require most of its resources to compute results of intensive operations because of the large operands used in these very accurate computations. Moreover, CPU is not suitable in performing such these algorithms that having the parallel architecture in nature. These issues prove that software implementation of encryption algorithms does not provide the required performance. Due to the diversity in the applications, the trade-off between area, speed and power is required. Some applications, such as RFID cards, nodes of wireless sensor networks and cell phones, need a small area and power. Other applications, such as web servers, large bandwidth networks and satellite broadcast require very high throughputs. To cover the issues of software implementation and meet trade-offs in numerous applications, the hardware platforms have been utilized for implementing the cryptographic algorithms, where high efficiency to perform tasks is achieved in different applications.
Field Programmable Gate Array (FPGA) is one of the preferable reconfigurable hardware platforms (Xilinx, 2018a) which offers flexible and more customizable methods for performing and evaluating different hardware implementations. Because of this fact and since FPGAs have been employed by most of the previous hardware implementations to evaluate their performances; the presented ECC hardware implementation in this paper have been performed using Xilinx FPGA devices (Xilinx, 2018b) . Scalar point multiplication (SPM) is the main point operation in ECC cryptosystems or protocols such as Elliptic Curve Diffie-Hellman (ECDH) (Diffie and Hellman, 1976) for key agreements and Elliptic Curve Digital Signature (ECDS) for digital signatures.
SPM can be implemented over many finite fields under either prime or polynomial fields. Finite fields named also Galois Fields (GF), where GF (p) is a prime field and GF (2 n ) is the polynomial field. SPM has two point operations, doubling and adding points. Each operation consists finite field operations such as square, addition, multiplier. Figure 1 presents the hierarchical implementation of the ECC protocol. Polynomial fields are more suited and efficient to implement on a customizable platform such FPGAs (Wenger and Hutter, 2011a) . To gain high performance in today's high loaded communication networks, utilization of hardware accelerators for physical security has created a great demand for efficient and high-speed implementations of ECC. Based on this fact, many FPGA implementations of the ECC have been published in the literature, where various ranges of latencies and number of clock cycles are achieved targeting applications that require high/low throughputs. Providing high performance as well as utilizing efficient area, is a challenge to achieve it in FPGA's ECC implementations.
In this paper, a high-speed area-efficient Xilinx FPGA implementation of the ECC over GF (2 n ) using the pipelining architecture is proposed. The main target of our work is to develop high performance design that targets systems that have constrained resources such as wearable smart devices, processing engines in image steganographic systems (Dalal and Juneja, 2018) , (Amirtharajan, 2014) and Internet of Things (IoTs) network processors. This paper is organized as follow: section 2 describes the arithmetic operations of the ECC; section 3 describes our high-performance hardware implementation core for ECC over GF (2 n ); section 4 shows the results and comparisons; and finally, section 5 concludes this paper.
ELLIPTIC CURVE CRYPTOGRAPHY (ECC)
Elliptic Curves (ECs) are formulated by the so called Weiestrass equations, which can be performed over by normal or polynomial basis. In this paper, we work on the polynomial basis in GF (2 n ) for its efficiency on the hardware platforms (Wenger and Hutter, 2011a) . Equation 1 represents the general form of the nonesingular curve over GF (2 n ) (Hankerson et al., 2006) .
where a, b ∈ GF (2 n ) and b = 0. A set of affine points (x, y) satisfying the curve forms a group (Hankerson et al., 2006) with an identity point of that group. There are two fundamental elliptic curve operations, doubling and adding points. Doubling point is denoted as P 1 =2P 0 , where P 1 is (x 1 , y 1 ) and P 0 is (x 0 , y 0 ) while point addition is denoted as P 2 = P 0 +P 1 , where P 2 is (x 2 , y 2 ), and P 1 = P 0 . All points in the selected curve are represented in affine coordinates. Finite fields operations are involved in the ECC point operations such as addition, square, multiplication and inversion. Dealing with affine coordinates requires an inversion field operation. Due to the complexity in the inversion operation, a projective coordinate is utilized to avoid it by mapping points in affine (x, y) to be represented in (X, Y, Z) form. Scalar Point Multiplication (SPM) is the main important operation that dominants the ECC-based cryptosystems. SPM is process of adding a point k times, where k is a positive integer and P is a point on a selected curve. SPM is based on the implementation of the underlying point operations, where a series of doubling and adding points is performed by scanning the binary sequence of the k scalar, where k = k n k n−1 . . . k 1 k 0 . There are several methods and techniques to implement k.P efficiently. Binary Add-andDouble Left-Right, Binary Add-and-Double RightLeft as the work presented in (Harb and Jarrah, 2019) , Non-Adjacent-Form (NAF) (Moon, 2006) and montgomery ladder method (Montgomery, 1987) . Affine coordinate system is the default representation that some SPM methods use it to introduce the points on the curves, while other SPMs utilize alternative projective coordinate systems for representing the points. The reason of applying different projective systems is to avoid the time-resource consuming inversion field operation. However, it increases the number of field multiplications. Generally, projective systems tend to provide efficient ECC cryptosystem's designs in terms of area/latency.
Lopez-Dahab (LD) (López and Dahab, 1998 ) is one of the most efficient projective coordinate systems. Points in affine system are mapped to LD projective system, where P at (x, y) coordinate is equal to P at (X = x, Y = y, Z = 1) coordinate. In any SPM, the point multiplication algorithm must be selected first for performing the k.P. The next step is to define the projective coordinate system for representing the points. The last step, the algorithm that is used in the finite field operations, mainly, for the field multiplication and inversion operations. Figure 2 illustrates the main steps of constructing SPM at our ECC cryptosystem. In the presented work, montgomery ladder algorithm is selected as point multiplication algorithm and LD coordinate system is defined to represents the points. For the field multiplication and inversion operations, the Karatsuba-Ofman algorithm and Itoh-Tsuji algorithm are implemented, respectively.
Finite Fields GF (2 n )
ECC over GF (2 n ) is more suitable and efficient in hardware implementations because the arithmetic in the polynomial fields is carry-less. In GF (p), finding points on an elliptic curve requires performing a square root algorithm (Harb and Jarrah, 2017 ) (Wenger and Hutter, 2011b) , while the process is much easier over GF (2 n ), where the points can be found using generators for polynomials. There are 2n-1 elements in GF (2 n ), which can be represented as binary polynomials. For example, the element in GF (2 n ) has a polynomial representation: a n−1 x n−1 + a n−1 x n−2 + ... where x i is the location of the ith term, and a i ∈ [0,1] is the coefficient of the ith term. Arithmetic operations are applied over these elements such as addition, multiplication, squaring and division (i.e. inversion and multiplication). Adding two elements C = A + B is done using the logic XOR. Squaring an element A 2 is computed by padding 0s between two adjacent bits of the element. Multiplying two elements C = A · B is much harder and slower operation than addition and square.
Result of squaring an element or multiplying two elements would be out of the field. To reduce it, an irreducible polynomial of degree n (f(x)) is used by applying the reduction (modular) step, such as C = (A 2 ) mod f(x) or C = (A · B) mod f(x). Inversion operation is the most complex operation in terms of time and resources. Obtaining the inverse of an element A is the process of finding another element B that satisfies A · B ≡ 1 mod f(x). Note that, the f(x) has a major role in the performance of these operations. In this paper, all curves and irreducible polynomials are chosen based on the recommendation of the NIST (Gallagher, 2013) . in such a way that an efficient number of pipelined stages are inserted. Next subsections present more details about these algorithms.
Montgomery Ladder Scalar Point Multilocation Algorithm
At present, montgomery ladder is one of the most popular multiplication algorithms to perform k.P, where k is an integer. It can be implemented in both affine and projective coordinate systems. Doubling and adding point operations are computed in an efficient way for every bit in the sequence of k = k = k t−1 k t−2 . . . k 1 k 0 . Algorithm 1 shows the projective coordinate version of montgomery ladder method (López and Dahab, 1998) . As it's shown, the algorithm is based on performing point operations recursively using the x affine coordinate, which leads to reduce the number of field multiplications. The y coordinate is used at the post-process affine, which is required to recover the affine coordinates from the LD coordinates. This algorithm has been implemented in many hardware implementations that provides high performance (Ansari and Hasan, 2008 ) (Roy et al., 2013 ) (Mahdizadeh and Masoumi, 2013) due to its speed, parallelism capability, resourceconstrained systems and power analysis resistance.
Area-constrained and High-performance Tradeoff
High-performance ECC hardware implementations are achieved by considering optimization techniques such as short-critical path, required resources and number of clock cycles to perform a single SPM. The architectural pipelining optimization aims to optimize the long-critical path of the design through break it into stages. Number of clock cycles can be improved by adopting the parallel field multiplication operations. All these optimization techniques impact on the resources that are required for performing the SPM. In pipelining, inserting registers to minimize the critical path delay results in an increase on the number of the clock cycles (latencies) and resources. Obtaining an efficient pipelined design is done by determining the number of the stages. More stages yield higher working frequencies but higher latencies. Balancing this tradeoff can be achieved by considering an efficient field multiplier, independency levels among point operations, and finite-state machines that control these operations effectively.
Algorithm 1: Montgomery Ladder Scalar Multiplier (k.P).
Stop;
2 ) + y; } 3.3 Proposed High-speed Area-efficient ECC Core over GF (2 n ) Algorithm 1 states that for performing single point multiplication, three stages are covered; 1) map the affine coordinates to LD coordinates, 2) perform doubling and adding point operations recursively, and 3) recover the point from LD to affine coordinates. Each iteration performs the same point operations with swapping between input and output registers (X and Z registers) depending on the current bit of the k i if it is 1 or 0. From this recursive manner, the authors in (Mahdizadeh and Masoumi, 2013) noticed that the initialization step of Algorithm 1 can be merged into the main loop of the algorithm. the registers are before main loop: X 1 = 1, Z 1 = 0, X 2 = x, Z 2 =1. This eliminates the need of precomputed values to be obtained before starting the main loop. However, it requires extra clock cycles for the merged initialization step. The design in (Ansari and Hasan, 2008) Input: k = k t−1 k t−2 ...k 1 k 0 , Point P(x,y) on Elliptic Curve.
The proposed high-speed area-efficient hardware design is shown in Figure 3 . The general architecture has three units, optimized finite-state machine control unit, GF (2 n ) arithmetic unit contains KaratsubaOfman and square, and control signal unit. Next, further details are given for these main unites in the highspeed ECC core. of the proposed high-speed core. As it's shown, the main loop of the merged-improved montgomery ladder SPM starts from state 0 and ends at state 11. At state 11, the condition (the second if-statement in Algorithm 2) of whether swapping registers must be done or go back to state 0 for the next ki. The swap process is done in a routine which starts from state 53 to state 57. Once the i is equal to zero, the results are ready to be mapped back to the affine coordinates. Itoh-Tsuji inversion algorithm is used to achieve that, and it starts from state 12 to state 52. At state 52, a done signal is asserted to indicate that mapping process is done, and affine coordinates are obtained.
Computation Schedule for Montgomery Ladder SPM
In this subsection, we introduce a new efficient scheduling for performing a single scalar point based on the merged-improved version as shown in Algorithm 2. Free-idle cycles schedule is achieved by performing the doubling and adding points in less- dependent way. This is done by parallelizing the field operations of the point operations, such as squaring and multiplying same or different operands, simultaneously. Figure 5 illustrates our new schedule, where 8 registers are utilized to perform a single loop in Algorithm 2. A and B registers are operands that are connected to the field multiplier, while register C is operand of the square field operation. Register T is a temporary register which is used to hold intermediate values. This zero-idle schedule can perform a single SPM iteration in 14 clock cycles using the three pipelined stages Karatsuba-Ofman multiplier, where 3 clock cycles are required to obtain field multiplication results. One clock cycle for the square one clock cycle for square. Four subsequent field multiplications and squares are performed independently and simultaneously from clock number 1 to 4. Each square result is stored at register T.
At 5, results of first multiplication (X 1 · Z 2 ) is obtained to multiply with the next multiplication result (X 2 · Z 1 ) at clock 6 and added at clock 7. Result of the third multiplication (T 2 · Z 2 2 ) is stored in Z 2 at clock 7. The fourth multiplication (b · Z 4 2 ) is obtained at clock 8 and added with register T which contains X 4 2 . Result of (X 1 · X 2 · T · Z 2 ) is obtained at clock 10 which is added to the result of (x · Z 1 ) at clock 13. The total number of clock cycles for performing a single SPM consists of: initialization step, SPM process and LD to affine routine. In our proposed ECC core, there are 9 clock cycles for initialization step, ( log 2 (k) + 1 ) x 14MUL for SPM process, and (9 10 13) x 3MUL + (162 232 570) for LD to affine routine.
For example, if the scalar k is 10 and GF (2 163 ), then the total number of clock cycles for performing a single SPM is equal to: 9 + ( log 2 (10) + 1 ) x 14MUL + 9x3 + 162 = 254 clock cycles. Note that, Itoh-Tsuji algorithm requires n-1 squares and log 2 (n − 1) ) + HW (n − 1) − 1 multiplications, where the HW is the hamming weight of the integer (n-1). For example: in GF (2 233 ), 232 squares and log 2 (232) + 1 ) + HW (232) − 1 = 7 + 4 − 1 = 10 multiplications.
Karatsuba-Ofman Field Multiplier
Field multiplication consists of two steps, first, compute as C' = A · B, the second is reducing the C' by using the mod operation as C = (C') mod f(x). This kind of multipliers is called as two-step classic field multiplier. The interleaved field multiplier is one of the classical field multipliers (Großschädl, 2001) that apply the two steps as shift and add operations in iterations. Few resources are utilized to implement the interleaved multipliers, which makes it a very attractive one to the constrained-resources systems. However, this type of multipliers has a very long critical path due to the dependency between the iterations. The Karatsuba-Ofman field multiplication is a recursive algorithm that performs polynomial GF (2 n ) multiplications in large finite fields efficiently (Karatsuba and Ofman, 1962) . Karatsuba-Ofman is given and defined as follows: Let A and B be two arbitrary elements in GF (2 n ). Result of C' = A · B is a product of a 2n-2 degree polynomial. Both A and B can be represented as two split parts: A = x n/2 (x n/2−1 · a n−1 + · · · + a n/2 )+ (x n/2−1 · a n/2−1
The polynomial of the product C' is:
The sub-products are defined as auxiliary polynomials as follows:
Then the product C' can be obtained by:
This field multiplication can be recursive if we split the auxiliary polynomials again with new auxiliaries are generated. More recursions yield in an increased delay for the Karatsuba-Ofman multiplier (Peter and LangendOorfer, 2007) . So, this recursion ends after the threshold q splits, where it ends with a classical field multiplier. Number of splits is optimum when splitting reaches the balance between area utilized and delay. The work in (Zhou et al., 2010) have discussed this trade off in details. The best split for the GF (2 163 ), GF (2 233 ) and GF (2 571 ) fields, is shown in Figure 6 . The optimum split is coming from the used FPGA technology in term of the lookup tables LUTs (Zhou et al., 2010) .
There are two technologies have been released: 4-input LUT (i.e. old FPGA devices) and 6-input LUT (i.e. new FPGA devices) (Percey, 2007) , (Specification, 2006) . For GF (2 163 ), at first recursive, we get three auxiliaries: 2 of 82-bit multipliers and 1 of 81-bit multiplier. Second recursive, 2 of 41-bit and 1 40-bit multipliers for 81-bit multiplier, and 3 of 41-bit multipliers for 82-bit multiplier. Third recursive, 2 of 21-bit and 1 20-bit multipliers for 41-bit multiplier, and 3 of 20-bit multipliers for 40-bit multiplier. The multiplier used after the recursive split is a single-step (no mod operation) classic multiplier which is used for all three fields. Figure 7 shows the logic gate implementation for the classic multiplier. The critical path of Karatsuba-Ofman multiplier is long due to the recursive nature in its hierarchy. Applying an architectural improvement such as inserting pipelining registers between recursive splits improves the long critical path and provides a higher working frequency.
Efficient pipelined Karatsuba-Ofman multiplier can be achieved when the critical path is the shortest. In FPGA, the shortest critical path implies that the delay-to-area ratio is the minimum in time and utilized area. For achieving that, different pipelined stages have been inserted in the Karatsuba-Ofman multiplier. As shown in Figure 8 , the efficient balance in delay-to-area ratio over GF (2 163 ) is achieved by inserting exactly three pipelined stages, where 2121
... slices are used and the maximum delay between two registers is 3.4 ns. The first pipelined stage is inserted after the classic multiplier, while the second stage is located after combining all 40-bit, 41-bit, 81-bit, 82-bit of recursive splits of the Karatsuba-Ofman multiplier. Note that, the FPGA technology that have been used in Figure 8 is a 6-input LUT.
FPGA IMPLEMENTATION: RESULTS AND COMPARISONS
The pipelining architectural approach is applied to the proposed ECC core for higher speed with efficient utilized area in terms of both working frequencies and slices. Elliptic curve doubling and adding point operations are performed using a merged-improved montgomery ladder scalar point multiplication algorithm. The proposed ECC core doesn't require any precomputed values or any memories for calculations, which provides an efficient design with less slices and clock cycles. An effective FSM is developed to control the main ECC components, where a minimum number of states are used for performing the merged-improved montgomery ladder SPM. To verify performance of the proposed SPM, our high-speed ECC core is implemented over three finite fields, GF (2 163 ), GF (2 233 ) and GF (2 571 ) using many FPGA devices which are Time (µs) provided by Xilinx (Przybus, 2010) . Virtex-5, Virtex-7, Kintex-7 and Artix-7 FPGA families are used to implement the high-speed core.
The high-speed core has been synthesized, placed and routed using Xilinx ISE 14.4 design suite (Xilinx, 2012) . The optimization goal has been set to the balance strategy. A time constraint is applied for all results to achieve better area-speed ratio with zero timing error. Table 1 presents the place and route results for the proposed high-speed ECC core. Table 2 includes our design compared with other previous ECC hardware implementations. The efficiency is defined as follows:
The proposed high-speed core provides higher speed in both Virtex-7 and Kintex-7 FPGA devices, since they are fabricated and optimized at 28nm technology (Przybus, 2010) . Artix-7 device consumes less resources which makes it well suited for the battery-powered cell phones, automotive, commercial digital cameras and IP cores of SoCs. The graphical representation in Figure 9 represents the time comparison between our proposed high-speed core and other designs. As seen in Figure 9 , our design has the lowest execution time for performing a single successful SPM. As shown in Table 2 , the efficiency of our proposed high-speed core outperforms the other ECC hardware implementations. The design in (Rashidi et al., 2016) provides higher frequencies but consumes about twice the resources of the proposed core over GF (2 163 ) and using the same FPGA device.
The area-efficient hardware implementation in (Khan and Benaissa, 2015) consumes less area than the proposed core but requires 95% more clock cycles. This large number of clock cycles comes from writing/reading of the distributed RAM-based memory and register shift operations. In (Li and Li, 2016) , a pipelined architecture is applied to the SPM, which achieves high performance in terms of the area and working frequencies. However, it requires 84% extra clock cycles than our proposed clock cycles using the same device and field. Our high-speed core achieves better efficiency than the design in (Sutter et al., 2013) . Our design has 88% less clock cycles, 33% higher frequency, 44% less slices than (Sutter et al., 2013) . Using Kintex-7 FPGA device, the design in (Hossain et al., 2015) performs on less slices than the proposed core by 39% over GF (2 233 ). Although, large number of clock cycles is required for performing single SPM. In (Hossain et al., 2015) , an iterative-based architecture is adopted by all main SPM operations, where the binary (left-to-right) algorithm is used for scalar multiplication, an interleaved field multiplier is implemented for the multiplication operations, and a modified Extended-Euclidian is applied for the inversion operation.
To sum up the comparisons, shifting registers, segmented multipliers, or memory-based implementations results in large latency (clock cycle) as in (Khan and Benaissa, 2015) and (Sutter et al., 2013) . Iterative architecture is not the efficient way for achieving higher speeds as the work in (Hossain et al., 2015) . Pipelining architecture is more practical to apply for achieving higher performance and maintaining the balance between speed and area, as works (Rashidi et al., 2016) and (Li and Li, 2016) do.
The balance in the speed-area ratio can be applied when the optimal number of pipelined stages are inserted. Our high-performance ECC core uses few slices with small latencies and high working frequencies. This high-speed area-efficient ECC core makes it very suitable to be used in different kinds of realtime embedded systems such as cellphone banking services, health-care monitoring using smart watches, and accessing office networks and storage devices while abroad.
CONCLUSIONS AND FUTURE WORK
In this paper, a high-speed area-efficient ECC core over GF (2 n ) is proposed. Xilinx FPGA devices are used to implement the core, where the pipelining architecture is applied for achieving higher working frequencies. A merged-improved montgomery ladder scalar point method is developed for performing scalar multiplications (kP). Karatsuba-Ofman algorithm is used for performing field multiplication op- eration. Itoh-Tsuji method is applied for mapping the LD coordinates back to the affine coordinates. In GF (2 163 ), A single scalar multiplication can be done in 0.63µs at 360Mhz working frequency in Virtex-7 FPGA devices using 2780 slices, which is the fastest area-efficient hardware implementation result. The proposed ECC core was developed and evaluated using Xilinx ISE 14.4. Place and route results show our implemented ECC core provide best performance in terms of latency and utilized area compared to other existing designs. The proposed ECC core would be suitable for platforms that require efficiency in terms of area/speed. Platforms deal with the public key cryptosystems such as key exchange agreements in ECDH and signing certificates in ECDS. On other hand, our proposed ECC core can be integrated and embedded with applications that have a security layer in its implementations, such as image steganographic engines. Design a separable secure image steganographic cryptosystem would be our next step in future.
