Abstract-In this paper, a novel high-speed elliptic curve cryptography (ECC) processor implementation for point multiplication (PM) on field-programmable gate array (FPGA) is proposed. A new segmented pipelined full-precision multiplier is used to reduce the latency, and the Lopez-Dahab Montgomery PM algorithm is modified for careful scheduling to avoid data dependency resulting in a drastic reduction in the number of clock cycles (CCs) required. The proposed ECC architecture has been implemented on Xilinx FPGAs' Virtex4, Virtex5, and Virtex7 families. To the best of our knowledge, our single-and three-multiplier-based designs show the fastest performance to date when compared with reported works individually. Our one-multiplier-based ECC processor also achieves the highest reported speed together with the best reported area-time performance on Virtex4 (5.32 µs at 210 MHz), on Virtex5 (4.91 µs at 228 MHz), and on the more advanced Virtex7 (3.18 µs at 352 MHz). Finally, the proposed three-multiplier-based ECC implementation is the first work reporting the lowest number of CCs and the fastest ECC processor design on FPGA (450 CCs to get 2.83 µs on Virtex7).
the reduction of latency [number of clock cycles (CCs)] of a point multiplication (PM). To achieve low latency for a PM, these works adopted either parallel multipliers or large-size multipliers at the expense of additional area complexity; pipelining stages are also often used to increase clock frequency at the expense of few extra CCs and area overheads [10] , [12] . In addition, the pipelining stages in the multipliers create idle cycles at the PM level if there is data dependency in the instructions. As a result, careful scheduling is required to take full advantage of pipelining. Indeed, recently, Khan and Benaissa [24] , [25] have reported the highest throughput and highest speed ECC designs on FPGA using novel digit-serial and bit-parallel multipliers together with efficient scheduling and pipelining techniques.
In this paper, we extend [24] and [25] to yield two important contributions to the state of the art. First is the fastest and also crucially with the best area-time metric ECC design on FPGA to date to the best of our knowledge. And second, we report an even faster ECC processor design with the lowest ever latency (CCs) that achieves the performance of the theoretical limit. These are achieved via a novel pipelining technique that enables high clock frequencies to be attained and via a thorough investigation of the different combinations of the field multipliers to evaluate the performance limits for highspeed applications. The key contributions to the results are listed in the following.
1) A full-precision GF(2 m ) multiplier with segmented pipelining to reduce both latency and area. 2) A one-multiplier-based architecture for the ECC processor design targeted at high performance but with low area (fastest ECC processor with best area and time complexities). 3) A three-multiplier-based architecture for the ECC processor design aimed at the highest possible speed. 4) A modified Montgomery PM algorithm to avoid extra latency due to our two-stage pipelining in the field multiplier and use of careful PM scheduling to reduce latency. 5) A pipelined Moore finite state machine (FSM)-based control unit is designed to avoid data dependency in the arithmetic operations by introducing an extra cycle delay. 6) Data are tapped from different pipeline stages to localize some arithmetic operations and avoid memory input-output operations. 7) A repeated square over square circuit (capable to perform a four-square or quad-square operation in a single CC) to reduce latency for the multiplicative inversion operation based on Itoh-Tsujii algorithm [9] . 8) Finally, we use Xilinx ISE timing closure techniques to achieve the best possible high-performance results. The rest of this paper is organized as follows. Section II presents the background of ECC and associated arithmetic operations over GF(2 m ); full-precision multiplication is also discussed in this section. Our proposed full-precision GF(2 m ) multiplier is presented in Section III. Sections IV and V cover the proposed ECC processor architectures. In Section VI, the implementation results are presented and compared with the state of the art. Finally, this paper is concluded in Section VII.
II. ECC BACKGROUND AND ITS ARITHMETIC OVER GF(2 m )

A. Scalar Point Multiplication
The main operation in ECC is scalar PM, Q = k P, where k is a private key, Q is a public key, and P is a base point on an elliptic curve, E. The public key Q is computed by k times point addition operation Q = k P = P + · · · + P + P.
(
The private k is difficult to retrieve from knowledge of Q and P.
An elliptic curve over GF(2 m )E can be defined as
where a, b ∈ GF(2 m ), b = 0, and a point at infinity is θ such that P 1 +θ = P 1 , where P 1 = (x 1 , y 1 ) and (x 1 , y 1 ) ∈ GF(2 m ). The PM k P is achieved using scalar PM algorithms utilizing point addition and point doubling depending on the i th value of k, k i [4] . Scalar PM can be affine coordinates based or projective coordinates based. Because of the expensive inversion operation involved in affine coordinates-based algorithms, projective coordinates-based PM is a more common choice for ECC hardware implementation. In this paper, the Lopez-Dahab (LD) Montgomery PM is considered. This algorithm requires six field multiplications, five field squaring operations, and four addition operations as shown in Algorithm 1 [6] . The LD algorithm is generally faster to implement, and leads to improved parallelism and resistance to side channel power attack.
B. Field Arithmetic Over GF(2 m )
Field multiplication, field squaring, field addition, and field inversion operations are involved in a point operation. Addition and subtraction are equivalent over GF(2 m ), which are very simple bitwise XOR operations.
Field inversion is very costly in terms of hardware and delay. In projective coordinates, an inversion operation is used for the projective to affine coordinates' conversion that can be achieved with multiplicative inversion. The Itoh-Tsujii [9] algorithm is selected as it requires only log 2 (m) multiplications and (m − 1) repeated squaring operations. In projective coordinates-based implementations, the overall performance depends on the performance of the field multipliers.
C. Full-Precision Multiplier for ECC Application
For high-speed ECC application, the field multiplier is the main part of the arithmetic unit compared with the field squaring and field addition circuits due to its high area and time complexities. The performance of the multiplier affects the overall performance and mainly depends on the size of the multiplier used. A larger size multiplier reduces latency to speed up the point operation; however, the critical path delay is increased. Thus, pipelining is often adopted to shorten the critical path delay. Moreover, some multiplication algorithms (such as Karatsuba) are used to improve area and time complexity [10] , [11] , [23] . For the high-speed end of the design space, large digit-serial multipliers or bit-parallel multipliers (such as schoolbook and Mastrovito) are often used. The bit-parallel multiplier takes one CC latency, which can be an attractive option to speed up the PM.
The field multiplication for ECC over GF(2 m ) is divided into two parts: the GF2 multiplication (GF2MUL) part and the reduction part. For a large-size multiplier, the GF2MUL part is costly compared with the reduction part [18] . Thus, the main optimization of a large multiplier is concentrated on the GF2MUL part. There are several high-performance bit-parallel multipliers in [11] , [19] , [20] , [26] , and [27] . The complexity of a bit-parallel multiplier can be quadratic or subquadratic [18] . A quadratic multiplier achieves higher speed by consuming higher area than that of a subquadratic multiplier. Subquadratic multipliers are mostly based on the Karatsuba algorithm to reduce the area complexity at the expense of a lower clock frequency. The performance of the Karatsuba-based bit-parallel multiplier is improved by adopting pipelining techniques [11] . In the next section, we present a novel high-performance full-precision GF(2 m ) multiplier with segmented pipelining.
III. PROPOSED GF2 m MULTIPLIER WITH SEGMENTED PIPELINING
The proposed full-precision GF(2 m ) field multiplier (including reduction) with segmented pipelining is shown in Fig. 1 and consists of two pipelining stages to improve clock frequency. The first stage pipelining is the proposed segmented pipelining to break the critical path delay of the GF2MUL part, which is similar to [7] . In the segmented pipelining, we divide the m bit multiplier operand into n number of w bit long-digit multiplier operands. Then, we multiply the m bit multiplicand by each of the w bit multipliers. The results of the w digit size multiplier are m + w − 1 bit long. We save each of the results in the m + w − 1 size pipelining register. Here, we save n multiplications' results in the n number of m + w − 1 size registers. The outputs of the m+w−1 size registers are aligned by shifting (logically) w bits from each other followed by XOR operations (addition). The result of the addition that is 2m − 1 bit long is then reduced to m bit in the reduction unit. In the reduction unit, we reduce the 2m − 1 bits to m bit multiplier output using a fast irreducible reduction polynomial [4] , [5] . The output of the reduction unit is applied to the second stage pipelining register. Thus, there are two pipelining stages, and hence, the proposed multiplier consumes only two CCs as an initial delay to perform multiplication. The pipelining of the multiplier divides the total critical path delay into two parts: the critical path delay of GF2MUL, T A + (log 2 (m/n))T X , and the critical path delay of the reduction part using the fast NIST reduction polynomial (r-nomial), (log 2 ((n + 2r ))T X , as shown in Table I [4], [7] . Both critical path delays depend on the size of the segment, w. Thus, any one of the two critical paths can be the critical path of the multiplier. The optimum critical path can be defined by the optimum size of w that can be determined by a trial-and-error method.
A one-stage pipelining (segmented pipelining) achieves one CC delay. The critical path delay of the multiplier is the combination of the MULGF2 and reduction parts, which is T A + (log 2 ((m/n) + n + 2r ))T X . Again, the critical path delay can be modulated by changing the size of the segment of the multiplier. The optimum size of the segment of the multiplier can also be achieved using a trial-and-error method. In Table I , we present space and time complexities of our proposed multipliers, and we compare these with quadratic and subquadratic bit-parallel multipliers reported in [19] , [20] , [26] , and [27] . In the theoretical analysis of the quadratic and subquadratic multipliers, the quadratic bit-parallel multiplier achieves twice the speed of the subquadratic, but the quadratic multiplier consumes 2.56 times more area [19] . Moreover, Hasan et al. [19] compare the implementation results of the two bit-parallel multipliers where they show the ratio (quadratic/subquadratic) is 1.5 in terms of area and 0.625 in terms of delay. Their implementation results show that the quadratic bit-parallel multiplier can achieve higher speed, and the area-time product of the subquadratic multiplier outperforms the quadratic multiplier by only 6.65%. Therefore, a quadratic multiplier is considered a better option for high-speed ECC implementation when area is not a constraint; for example, the quadratic multiplier in [26] and its improved speed version in [27] both based on a matrix-vector method (Mastrovito) can achieve improved speed on a subquadratic multiplier [19] but with larger area.
An analytical complexity analysis for the multipliers is shown in Table I . Our proposed multiplier consumes a similar area to the multipliers in [19] , [20] , [26] , and [27] (m 2 ((n − 1)m + (r − 1)m). However, its regular structure makes it more suitable for pipelining, and hence offers more scope for higher speed performance. Our proposed multiplier has a very short critical path compared with the reported parallel multipliers; hence, can show better area-time performance due to its high-speed advantage. For the area complexity, our proposed multiplier consumes the same resources of XOR and AND gates as those of the quadratic bit-parallel multiplier and uses flip-flops (FFs) to reduce the critical path delay. For illustration, an approximate 1 area-time complexity analysis is quantified over GF (2 163 ) for the various multipliers and sketched in Fig. 2 . The results show that the proposed multiplier outperforms the reported multipliers in [19] , [20] , [26] , and [27] in terms of area-time performance.
IV. PROPOSED HIGH-PERFORMANCE ECC
FOR POINT MULTIPLICATION In this section, we present careful scheduling in the point addition and point doubling operations, a novel pipelined fullprecision multiplier, and other supporting units to achieve high speed and low latency while optimizing area complexity.
A. Point Multiplication Without Pipelining Delay
In general, the Montgomery point addition and point doubling in the projective coordinates requires a total of six field multiplication, five field squaring, and four field addition operations' equivalent latency if implemented serially according to Algorithm 1 [6] . If the field squaring and field addition operations can be concurrently operated with multiplication, then the point operations' latency will be equivalent to the latency of the six field multiplications. The six multiplications can, for example, be computed in two steps using three multipliers or in three steps using two multipliers or in six steps by serial multiplications using one multiplier [10] , [13] , [17] . Again, the digit size can affect the performance of ECC; for example, a bit-serial implementation takes m cycles, a digit (w bits) serial one takes (m/w) cycles, and a bit-parallel implementation takes a single CC [8] , [12] , [11] . In the case of high-speed design, digit-serial multipliers are considered to reduce latency. The disadvantage of large digit-serial multipliers is lower clock frequency. Thus, pipelining stages are applied to improve clock frequency [12] . The clock frequency can be improved with the 1 Based on XOR gates only, this is also done in [26] and [27] ; AND gates' complexity is the same for all. increase in the number of pipelining stages in breaking the critical path delay. The main disadvantages of increasing the number of pipelining stages in the high-speed end of the design space are the increase in the number of CCs per multiplication and overcoming data dependency [12] . To avoid pipelining delay, optimal scheduling of the field operations of the PM is necessary.
Our first proposed ECC processor architecture is shown in Fig. 3 . It comprises a full-precision m bit multiplier with two pipelining stages, one squaring circuit, one quad-squaring circuit, and two addition circuits in order to accomplish point operations (point addition and point doubling) within six CCs. To achieve six CCs-based point operations, we include some strategies in the point operations of the Montgomery PM algorithm as shown in Algorithm 2 [24] . In the proposed algorithm, we combine point addition and point doubling to avoid data dependency. In the PM, a particular loop is overlapped with its next loop by two CCs due to two-stage pipelining. Thus, state1 (st1) and state2 (st2) depend on the previous key bit, k i+1 . For example, if previous bit, k i+1 = 1, then the last output will be X 1 otherwise X 2 . The last output of a loop decides the sequence of st1 and st2 in the next loop. The rest of the states depend on the current bit of k, k i . To support a six CCs-based algorithm, we apply a squarer or double square (quad square) or both operations in parallel along with the multiplication. Again, one of the field adders is placed in the common data path to add on the fly. The second adder is used to add the two outputs of the multiplier as shown in Fig. 3 . Both adder circuits can add two of their inputs or can transfer either of the inputs, if we need either. Moreover, we can save some intermediate results of field operations in the local registers (R 1 , R 2 , M, and accumulator, A) to avoid loading/unloading to the main memory. As a result, we can avoid idle CCs due to the memory input-output operations. A data flow diagram is shown in Fig. 4 to demonstrate the proposed combined point operations. In this diagram, we explain point operations for k i+1 = 1, k i = 1, and k i−1 = 1 where k i is the current bit, k i+1 is the previous bit, and k i−1 is the next bit of key (k). In this data flow diagram, we show the loop operation of the PM in projective coordinates. In our implementation, a multiplication takes three CCs due to two-stage pipelining and a square operation takes two CCs where one CC is used to load in the accumulator ( A) register. The addition operation is realized in the common data path and accomplished in the same CCs. As we used two-stage pipelining and there is a data dependency in between two loops, we use careful scheduling. In this scheduling, the present loop operation of PM is overlapped with the next loop operations.
1) We see that the starting state, st1, of a particular loop depends on the value of previous bit, k i+1 . If the previous bit k i+1 = 1, X 1 is not ready. Then, we start from st1 with the multiplication between X 2 and Z 1 instead of X 1 and Z 2 . In this case, the st2 is the multiplication between X 1 and Z 2 .
2) The X 1 operand of the st2 is calculated by addition of two outputs (Mula_out and Mulb_out in Fig. 1 ) of the multipliers where one output (from Mula_out) is tapped after the reduction unit (dotted arrow) and the other one from the multiplier output (Mulb_out). The other operand of st2 is Z 2 , which is already saved in the memory in st1 to use in st2. Here, the delay of the memory operation (accessing Z 2 ) is utilized to calculate X 1 ; again, as k i = 1, we need the square and quad square of Z 2 . Thus, we save Z 2 in the memory and accumulator simultaneously in st1 to achieve the squaring operations of Z 2 in the st2. The output of the square circuit (
2 ) is saved in the memory, and the output of quad square (A 4 = Z 4 2 ) is saved in the local register, R 2 (dotted box). We can use data from the local register (dotted box) immediately without doing any memory operations to save CCs. 3) Similarly, during st2, st3, and st4, the squaring operations of X 2 is realized by saving in the accumulator through B_bus; in this case, the square output
is saved in the local register R 1 , whereas the quad square output A 4 = X 4 2 is saved in the memory. In st3 and st4, one of the multiplication operands is used from the memory and the other operand from the local registers. 4) In st4, Z 1 (result of X 2 . Z 1 ) is ready to save in the memory to use in st5. Again in st4, the available output Z 1 is required to add with the multiplication result of X 1 on the fly. At this time, we access (tapping) X 1 from the output of the reduction unit (dotted arrow, one cycle earlier than the normal output) to add with Z 1 followed by saving in the accumulator to do the square operation to get a new Z 1 . 5) The new Z 1 is ready in st5 to save in the memory and is required in the st6 and the next loop. In st5, the old Z 1 (saved in st4) is used for multiplication with X 1 where X 1 is directly collected from the multiplier output followed by saving in the local register, M. We can manage X 1 to use immediately for multiplication using the instruction delay (pipelined Moore machine based control unit) of accessing the old Z 1 from memory. 6) In st6, we add X 2 (from memory) on the fly with the multiplier output to get newX 2 followed by saving in the memory. Again, the multiplication in st6 is in between the base point, x, and new Z 1 is completed after two CCs. But, a new loop is started after st6. Thus, the st1 of the new loop depends on the last coordinate of the previous loop, X 1 (in this case of k i+1 = 1, k i = 1 and k i−1 = 1), which is calculated by adding the results of the multiplications started in st5 and st6.
In Fig. 5 , we demonstrate the loop of the PM for k i+1 = 0, k i = 1, and k i−1 = 1. The previous bit of k is k i+1 = 0, which means that the coordinate X 2 of the last loop is not ready to start with. 1) In this case, the first state (st1) is started with multiplication between X 1 and Z 2 . In this state, the multiplier output (Z 1 ) started from st4 of the previous loop is saved in the memory to use in the next state (st2).
In the same state, we need to start the squaring operation on Z 2 . Thus, Z 2 is accessed from memory through the A_bus for multiplication and through the B_bus into the accumulator for squaring. 2) In st2, the multiplication is X 2 . Z 1 , where X 2 is calculated by adding two outputs of the multiplier, and then is saved in the M register for use in the next cycles to multiply with Z 1 . In the same time, the calculated X 2 is required and saved in the accumulator for squaring as k i = 1. The rest of the states of Fig. 5 are similar to Fig. 4 .
B. Multiplier With Segmented Pipelining for HPECC
We consider the two extreme field sizes in the NIST standard [5] , i.e., GF2 163 and GF2 571 , to evaluate the ECC performance. In the implementation over GF2 163 , we select w = 14 bits to get 12 of the 14 digit-serial multiplication results. The results then are loaded in the 12 177-bit-long registers. Thus, the critical path of MULGF2 depends on one two-input AND gate and 13 layers of two-input XOR gates to achieve a 14 × 163 multiplication. Again, the 12 pipelining register outputs are shifted and XORed (for accumulation) to get the full-precision multiplication result (2m − 1) without reduction. The result is then reduced into 163 bits in the reduction unit using the fast irreducible reduction polynomial [5] . The reduced result is saved in the second stage pipelining register. Thus, the architecture works like 12 (14-bit) digitserial multipliers are operating in parallel followed by a fullprecision reduction operation. The reduction unit consists two parts: the accumulation part and the reduction part. The accumulation part has 11 layers of two-input XORs and the reduction part has 2r (r -nomial irreducible polynomial) layers of two-input XORs. Thus, the critical path delay is balanced theoretically. Again, in the ECC processor implementation over GF2 571 , we also consider the segment size of 14 bits.
C. Square Circuit, Memory Unit, and Control Unit of HPECC
Our proposed high-speed ECC processor design operates using six CCs for each loop of the PM. To achieve the six-cycle PM loop, we need a quad-square (four-square) circuit to do a one clock quad-square operation. The quad squaring is used in the st2 and st3 along with field multiplication as shown in Algorithm 2. Again, the latency of the conversion step contributes a significant amount to the total latency of the proposed ECC processor as the latency of the loop operation is comparable with that of the conversion step. In the conversion step, the inversion operation consumes the major part of the latency. In our projective-based ECC processor implementation, a multiplicative inversion is applied for the projective to affine coordinate conversion. Several multiplications and m steps repeated squaring operations are required. Thus, we can utilize the quad-square circuit for speeding up the inversion by reducing the number of the repeated square operations. In our proposed architecture, we use a register (accumulator) in the arithmetic data path to achieve a repeated quad-square operation without loading to the main memory. Thus, we need one CC for a four-square, two CCs for an eight-square, and so on.
We design a friendly memory unit that is developed in a single behavioral entity that comprises an accumulator and 8 × m register file. The register file is based on distributed RAM to give high performance and flexibility. There are five input-output buses in the memory unit. In particular, our register file consists of three output buses (A_bus, B_bus, and D_bus) and one input bus. Data through A_bus and B_bus take one more cycle delay than data through D_bus as shown in Fig. 3 . Data from D_bus are dedicated to the multiplier input through the M register. Hence, the two outputs of the memory through A_bus and B_bus and the output of M (through D_bus) are synchronized. The M register acts as a pipelining register between the input and the output of the multiplier and also saves local data for the multiplier. The memory unit offers flexibility to access any data from any location of the memory through each of the output buses independently. The memory unit takes one cycle for a write operation and one cycle for a read operation. The accumulator is designed in the same entity of the memory unit and utilizes unused resources (FFs) of the memory unit. Apart from our memory unit, we deploy local registers R 1 and R 2 ; R 1 and R 2 are used to save outputs of square and quad square, respectively. Thus, the local registers (R 1 and R 2 ) and M save outputs of concurrent operations to avoid the idle state that is due to the common input bus of the memory unit and also avoid the data dependency in the successive point operations' loop.
A pipelined Moore FSM-based control unit is developed in the single behavioral entity. The Moore machine takes one CC delay to address the memory unit. The advantage of this initial instruction delay is a more flexible data control that allows for some intermediate operations to be carried out during this cycle delay with the help of the local registers. Again, the control unit consists of very few states to complete a PM due to the full-precision multiplier and concurrent operations. As a result, the control unit consumes very low area while helps keeping speed very high.
D. Critical Path Delay and Clock Cycles of the HPECC
Our proposed high-speed ECC (HPECC) processor design uses a segmented pipelining-based full-precision multiplier to achieve six CCs for each loop of the PM. The critical path delay of the ECC mainly depends on the critical path of the multipliers. Again, the proposed multipliers' critical path delay can be the critical path delay of the GF2MUL part or the reduction part depending on the size of the segment. As the multiplier output (Mula_out) is taped at end of the reduction part and passed through the adder and multiplexer followed by saving in the M register, the critical path delay of the ECC can be the delay of the reduction part + adder + mux. The critical path delay of the ECC processor architecture is shown in Table II . The main focus of our proposed ECC processor is the reduction in the number of CCs. In particular, our design can manage to take six CCs for each loop of the PM in the projective coordinates. The total number of CCs for PMs is the sum of three main parts: affine coordinates to projective coordinates' initialization, PM in the projective coordinates, and finally projective coordinates to affine coordinates' conversion. The total number of CCs for PM = 5 CCs (required for initialization) + 6x(m − 1) CCs (for PM in the projective coordinates) + CCs (for the final coordinates conversion = m/2 CCs for square + #Mul for inversion x3 + 3 CCs for inversion + 28 CCs for others) + 3 CCs for pipelining as shown in Table III . The others clocks cycles that are independent of curve sizes are included: ten multiplication, six addition, and one square operations. For example, the total CCs for PM over GF2 163 = 5 + (6x162) + 139 (= (81 + 27 + 3) + 28) + 3 = 1119 cycles. Similarly, the latency of the HPECC processor over GF2 571 is 3783 CCs.
V. PROPOSED LOW-LATENCY ECC PROCESSOR
FOR POINT MULTIPLICATION The speed of ECC can be improved for high-speed applications by reducing the latency of the PM. Parallel full-precision multipliers can reduce latency to speed up the point operations. We proposed a high-speed ECC processor for PM utilizing three full-precision multipliers to achieve the lowest latency high-speed ECC as shown in Fig. 6 .
A. Low-Latency Montgomery Point Multiplication
Montgomery PM offers flexibility of parallel field operations; there are six field multiplications in the projective coordinates-based Montgomery PM, as shown in Algorithm 1, all of which can be carried out in parallel based on data dependency. In addition, the Montgomery algorithm exhibits the low data dependency as it employs only x coordinates [4] .
The six multiplications can be achieved in two steps using three full-precision multipliers as shown in Algorithm 3. To achieve the theoretical limit of the loop operation, an ECC processor architecture needs single-clocked field multipliers along with concurrent square and addition operations, all with careful scheduling. In our implementation, we target and achieve this limit; to the best of our knowledge, no previously reported implementation has achieved to date due to the hitherto restrictive performance of the field multiplier. We propose a modified Montgomery PM loop based on two steps using three full-precision multipliers [Mul1, Mul2 (highlighted), and Mul3] as shown in Algorithm 3. In each state of the proposed algorithm, three multiplications' outputs are concurrently used for additions, square, and square over square (four-square) to generate the required output for the next states as shown in Fig. 6 . Mul1, Mul2, and Mul3 are the three multipliers that multiply the three different multiplications involved in each step of Algorithm 3 in a single CC. Again, the adder and cascaded square circuits are in the same data path of the multiplier output to perform addition, square, and four-square operations using the multipliers' outputs.
For the initialization of Algorithm 3, we save the required variables to start the loop operation in local registers (R 1 − R 6 ). For a particular value of k, k i = 1, the multipliers Mul1, Mul2, and Mul3 as shown in Fig. 6 calculate
In the same step, a cascaded squaring of X 2 is performed to obtain the four-square operation (R 2 ← X 4 2 ) followed
Algorithm 3 Proposed Low-Latency Montgomery Point Multiplication (Two CCs-Based Loop Operation Is Shown)
by save in the R 2 register. In step 2, one input of Mul1 (X 1 + Z 1 ) 2 (and the other input, x from memory unit) is processed by adding the outputs of Mul1 and Mul2 using adder1 followed by squaring. The output of the squaring is also saved in the R 1 register as Z 1 for the next loop. The Mul1 output and Mul2 are added by adder1 to get X 1 , an input of step 1 of the Mul2 in the next loop. In step 2, the inputs of Mul2 are the outputs Mul1 (Z 1 ) and Mul2 (X 1 ). The Mul3 output (Z 2 ) of step 1 is saved in the register R 3 in step 2 to use as an input of Mul2 in the next loop, and the Mul3 output Z 2 is squared (Z 2 2 ) and four-squared (Z 2 4 ) using the cascaded square circuits and then saved in the registers R 4 and R 5 . Again, the inputs of Mul3 of step 2 are b from the memory unit and Z 2 4 from register, R 5 , and the multiplication output is added with the content of R 2 (X 4 2 ) using adder2 and then inputted as X 2 , an input of Mul1 in the next loop. Thus, the proposed architecture supports the calculation of all of the new inputs for the next loop such as X 1 , X 2 , Z 1 , andZ 2 using the two steps of Algorithm 3. Apart from this, we utilize a smart scheduling to avoid data dependency in the successive loops. We show data flow diagrams to illustrate the point operations for the different combinations of the previous, current, and next values of k i in Figs. 7 and 8 .
The data flow diagram shown in Fig. 7 is for the values of k i+1 = 1, k i = 1, and k i−1 = 1. In this case, the point operations of the previous loop, current loop, and next loop are the same; hence, there is no transition of the point operations in the successive loops. There are only two states (st1 and st2) for each loop to accomplish the field operations (i.e., multiplication, square, and addition) for a point multiplication loop operation. The field multiplication takes one CC delay due to one-stage pipelining; however, the field square and field adder have only combinational circuit delay and can be performed in the same CC. In Fig. 7 , the data diagram shows the utilization of three full-precision multipliers called Mul1, Mul2, and Mul3 in each state to accomplish three multiplications. As the multiplier, adder, and square circuits are cascaded, we can achieve different field operations in the same CC by tapping the results, respectively. 1) For example, in st1, Mul1 and Mul2 outputs (i.e., Z 1 and X 1 ) are added and squared to get new Z 1 on the fly. The Z 1 is immediately used in the next loop as an input to Mul1, and also Z 1 is saved in the register R 1 to use in the next loop. Again, the output of Mul3 is Z 2 that is squared and four-squared in the same clock to get Z 2 2 and Z 4 2 . After then, the three outputs (Z 2 , Z 2 2 , and Z 4 2 ) are saved in R 3 , R 4 , and R 5 register, respectively, to use in the next loop. 2) In state st2, we get output X 1 by adding the outputs of Mul1 and Mul2, and we also get X 2 by adding the output of Mul3 and the content of R 2 (X 4 2 ). The X 2 and its square X 2 2 are directly applied as an input of Mul1 and Mul3, respectively, in the st1 of the next loop, and also X 2 is squared over squared (four-square) to get X 4 2 output in the same CC and is saved in the R 2 for the next operation. Thus, all inputs that are required to begin the next loop are ready. The data flow diagram is the same for the combination of values k i+1 = 0, k i = 0, and k i−1 = 0 except that the variables are changed as shown in Algorithm 3.
In Fig. 7, a data In the loop, Z 2 1 is calculated and saved in R 5 in the st2. Again, the output X 1 of the loop will be squared and four-squared to get X 2 1 and X 4 1 in the st1 of the next loop (k i = 0). 2) In st1 of the loop of k i = 0 (at CC 3), the X 2 1 is used as Mul3 input, and the X 4 1 is saved in R 2 . In the same state, the content of R 5 (Z 2 1 ) is squared to get Z 4 1 and saved in R 4 .Thus, the second loop for k i = 0 can be started with three multipliers' inputs X 2 · Z 1 , X 1 · Z 2 , and Z 2 1 · X 2 1 after the previous loop (k i = 1). In this case, the loop (k i = 0) inputs of Mul1 and Mul2 are the same as the inputs of the previous loop (k i = 1) due to the fact that the last output (the addition of R 2 and Mul3) of the previous loop is X 2 ; however, the outputs of the multipliers are different than that of the previous loop. 3) Now, the final loop is for k i = 0 (CCs of 5 and 6), which is similar to Fig. 6 (no transition), except that the variables are changed as shown in Algorithm 3.
Thus, the loop of the point operations can be accomplished utilizing only two CCs for any set of values of k i+1 , k i , and k i−1 .
B. Multiplier With Segmented Pipelining for LLECC
Parallel multipliers are used to reduce latency for PM in ECC processor implementations, and the majority of reported designs in the literature are based on digit-serial multipliers instead of bit-parallel multipliers [13] [14] [15] [16] [17] . Bit-parallel multipliers take larger area and critical path delay as the size of the multiplier is large due to the large field sizes of the ECC curves [18] . The subquadratic bit-parallel multiplier can be suitable for a high-speed ECC design; however, pipelining is required to improve speed [11] . The adoption of the pipelining in the proposed three-multiplier-based ECC processor is limited as the loop operation takes place within two CCs only. Thus, only one-stage pipelining can be adopted to improve the performance of the multiplier providing that a smart scheduling is devised to overcome the data dependency. The limitation of pipelining is a serious bottleneck for the traditional bitparallel and subquadratic multipliers to achieve significant performance. This is overcome in our proposed segmented pipelining technique by implementing n pipelines in parallel, achieving an overall single-stage pipelining as shown in Fig. 6 . This makes the proposed full-precision multiplier suitable for the very low latency loop while still maintaining a high performance. The high performance can allow high-security ECC curves to be deployed in more applications. In our proposed low-latency ECC (LLECC) processor architecture (as shown in Fig. 6 ), we consider LLECC implementation over GF (2 163 ) where we use three parallel multipliers where each of them is a 163-bit full-precision multiplier with 14-bit segmented pipelining.
C. Square Circuit, Memory Unit, and Control Unit of LLECC
Our proposed LLECC processor takes two CCs for a loop operation of the Montgomery PMs. To accomplish two CCs-based loop operation, we need to process the multiplier output in the same CC by cascading the adder and square circuits. Thus, in Fig. 6 , there are several extra adders and square circuits, and local registers are considered to calculate some instructions of the point operation on the fly compared with Fig. 3 . The main memory architecture adopted is the same as that of the distributed-based memory of Fig. 3 used to enhance the speed. Our main memory saves the initial input and the final outputs, and during a loop operation, the memory supplies the constant values (x, y, b) as most of the calculated outputs are saved in the local registers to reduce the delay for memory access.
We also use a separate shift register (k register) to save the key of the ECC. The shift register shifts 1 bit in every two cycles to generate a new set of values for k i+1 , k i , and k i−1 used in the control unit as shown in Fig. 6 . The control unit of the LLECC is also based on an FSM that controls the two CCs-based point operations and is simpler than the control unit of the HPECC as most of the operations are performed concurrently.
D. Critical Path Delay and Clock Cycles of the LLECC
In the proposed LLECC architecture, we perform several instructions in the same cycle by cascading the multiplier, adder, and square circuits as shown in Fig. 6 . The critical path delay of the LLECC is the path delay of MULGF2+ the reduction part + adder + square + 3 × 1 mux as shown in Table II . The critical path delay can be optimized by selecting the size of w through a trial-and-error approach.
The total number of CCs of ECC mainly depends on the latency of the loop operation of the PM. We achieve two CCs for each loop operation for the Montgomery PM in projective coordinates, which is the theoretical limit of the Montgomery PM algorithm under projective coordinates. Again, the coordinates' conversion circuit includes the costly inversion operation. We adopt multiplicative inversion to reduce area and time complexities' overheads [9] . As the total latency of the PM in projective coordinates based on the two clocked cycles' loop operations is comparable with the latency of the final conversion operation, reducing the CCs for the conversion operation is required. The inversion operation involved in the conversion step consumes most of the CCs and is thus the focus for optimization. We use a four-square circuit to speed up the multiplicative inversion operation. 
VI. IMPLEMENTATION RESULTS
The architectures have been implemented (placed and routed) on Xilinx Virtex4, Virtex5, and Virtex7 FPGA technologies to enable fair comparisons to relevant reported designs on the same technologies as well as provide achievable implementation results on more recent technologies. Where feasible, the designs have been implemented in each Virtex family. The FPGA size selected was the smallest in the family that could accommodate the design in terms of area and pin count.
The results of our proposed high-speed ECC processor implementation on Virtex4 (XC4VLX60), Virtex5 (XC5VLX50), and Virtex7 (XC7V330T) for HPECC and again, Virtex5 (XC5VLX110) and Virtex7 (XC7V690T) for LLECC over GF (2 163 ), and Virtex7 (XC7VX980T) for HPECC over GF(2 571 ) using Xilinx ISE 14.5 tool after place and route are shown in Table IV . The presented results are achieved with the use of high-speed timing closure techniques. We used repeated place and route for different timing constraints to achieve the best possible result. The highperformance ECC implementations over GF (2 163 ) based on one multiplier (HPECC_1M) on Virtex4, Virtex5, and Virtex7 consume 12 964 slices, 4393 slices, and 4150 slices and can operate at maximum clock frequencies of 210, 228, and 352 MHz, respectively. The achievement of high frequency is due to the design of the high-performance field multiplier. Our LLECC processor based on three parallel multipliers (LLECC_3M) improves the speed by reducing the latency with an area overhead. The proposed LLECC on Virtex7 can manage 159-MHz frequency by consuming the same area of the Virtex5 (113 MHz and 11 777 slices). Table IV provides a detailed comparison with state of the art using the same technology.
Our previous high-throughput design presented in [24] is the best reported implementation in terms of area-time metric; our HPECC implementation presented here over GF (2 163 ) on Virtex7 achieves a better metric value (area-time metric of 13) even using a full-precision multiplier. Our previous high-speed ECC implementation presented in [25] is the fastest FPGA design to date on Virtex7. Our proposed design in this paper outperforms [25] in both speed and area-time metrics.
For Virtex4, the previous highest speed implementation is presented in [14] and consumed 20 807 slices to achieve 7.72 μs using three 82 bit-parallel multiplier cores. Our HPECC implementation on Virtex4 consumes 38% less area and shows 31% speed improvement. Again, our work uses less arithmetic (163-bit multiplier) resource to gain 2.33 times improvement in the area-time metric (slices × time × 10 −3 ) compared with [14] . In [16] , a high-speed design is presented that used 17 929 slices to attain 9.60 μs for the PM time; meanwhile, our proposed work on Virtex4 is 45% faster than that in [16] and consuming less area. The work presented in [15] uses three 55-bit multipliers that consumed two times the area to achieve 10 μs, whereas our design can show two times better speed. The most relevant work is presented in [11] where a 163-bit multiplier with four-stage pipelining is used to achieve a maximum clock frequency of 131 MHz. Our design is based on a 163-bit multiplier with two-stage pipelining that achieved a clock frequency of 210 MHz, i.e., 60% clock frequency speedup improvement. Again, our ECC processor implementation is twice as fast with only 60% more slices; this translates to 21% improvement in the area-time metric than the reported efficient design in [11] . Our design shows 18% better area-time metric than the previous best optimized design presented in [10] . The work presented in [12] uses pipelining to achieve high clock frequency. Our proposed ECC processor uses two-stage pipelining to get 36% improvement in clock frequency speed over [12] . The work in [21] is the previous version of [11] . The works in [22] and [23] are a similar implementation to [11] ; however, [11] is a lookup tablesoptimized implementation. In comparison with [21] [22] [23] , our work shows better results than the best results they presented.
For Virtex5, the best reported performance result over GF (2 163 ) is 5.48 μs and is presented in [13] with 6150 slices. Our proposed ECC processor consumes only 4393 slices to compute a PM in 4.91 μs, which is better in both speed (10%) and area (29%) than that in [13] . Our state of the art achieves double the speed of [11] , but consumes only 25% more slices. The work presented in [17] consumes 6536 slices to get a speed of 12.9 μs; our area-time metric is 3.81 times better than that in [17] . The proposed HPECC architecture over GF (2 571 ) (the highest security NIST curve) is the first reported full-precision multiplier-based implementation and sets a new time record for PM (37.5 μs on Virtex7).
Our LLECC requiring only two CCs for Montgomery PM is the first implementation in the literature with such a schedule. The proposed LLECC design has the lowest latency figure [450 CCs for the curve over GF (2 163 )] reported to date while still achieving a high clock frequency thanks to the novel pipelining technique in the field multiplier and the smart breaking of the long critical path delay by inserting local registers. Furthermore, the LLECC over GF (2 163 ) implemented on Virtex7 shows the fastest ever figure for PM (2.83 μs) on FPGA at the theoretical limit of performance.
VII. CONCLUSION
This paper presented a very high speed ECC processor for PM on FPGA based on a novel two-stage pipelined full-precision multiplier in HPECC and a one-stage pipelined full-precision multiplier in LLECC with careful scheduling in both cases for the combined Montgomery PM algorithm.
Our proposed high-performance one-multiplier-based architecture takes six cycles for a loop of the Montgomery PM in the projective coordinates without any pipelining delay, whereas our LLECC (three-multiplier-based) processor takes only two CCs. The architectures have been implemented (placed and routed) on Xilinx Virtex4, Virtex5, and Virtex7 FPGA families resulting in the fastest reported implementations to date to the best of the authors' knowledge. On Virtex4, our ECC PM over GF (2 163 ) takes 5.32 μs with 13 418 slices, which is faster than the fastest previously reported Virtex 4 design [14] and also faster than the fastest reported design to date (5.48 μs) that was on a Virtex 5 [13] . On Virtex5, our design over GF (2 163 ) is not only even faster at 4.91 μs but also smaller than that of [13] . Our implementation on the new Virtex7 FPGA technology achieves the best areatime performance with the highest speed to date; an ECC implementation takes only 3.18 μs using 4150 slices. To evaluate scalability of our contributions, we also implemented the proposed one-multiplier-based architecture over GF (2 571 ), the highest security curve in the NIST standard [5] , on Virtex 7; this is the first reported implementation, which can complete a PM by taking only 37.54 μs. Our parallel multipliers-based ECC design is the first reported full-precision parallel architecture that shows the highest speed (2.83 μs) for the PM over GF (2 163 ) with the lowest latency (450 CCs) on FPGA.
The proposed ECC processor implementations would enable faster deployment of public key cryptography protocols, for example, in terms of key agreement (Elliptic Curve Diffie-Hellman) and digital signatures (Elliptic Curve Digital Signature Algorithm) across a range of platforms with improved efficiency in terms of area/power resource.
