Abstract-High throughput while maintaining low resource is a key issue for elliptic curve cryptography (ECC) hardware implementations in many applications. In this brief, an ECC processor architecture over Galois fields is presented, which achieves the best reported throughput/area performance on field-programmable gate array (FPGA) to date. A novel segmented pipelining digit serial multiplier is developed to speed up ECC point multiplication. To achieve low latency, a new combined algorithm is developed for point addition and point doubling with careful scheduling. A compact and flexible distributed-RAM-based memory unit design is developed to increase speed while keeping area low. Further optimizations were made via timing constraints and logic level modifications at the implementation level. The proposed architecture is implemented on Virtex4 (V4), Virtex5 (V5), and Virtex7 (V7) FPGA technologies and, respectively, achieved throughout/slice figures of 19.65, 65.30, and 64.48 (10 6 /(Seconds × Slices)).
Throughput/Area-efficient ECC Processor Using
Montgomery Point Multiplication on FPGA Zia-Uddin-Ahamed Khan, Student Member, IEEE, and Mohammed Benaissa, Senior Member, IEEE Abstract-High throughput while maintaining low resource is a key issue for elliptic curve cryptography (ECC) hardware implementations in many applications. In this brief, an ECC processor architecture over Galois fields is presented, which achieves the best reported throughput/area performance on field-programmable gate array (FPGA) to date. A novel segmented pipelining digit serial multiplier is developed to speed up ECC point multiplication. To achieve low latency, a new combined algorithm is developed for point addition and point doubling with careful scheduling. A compact and flexible distributed-RAM-based memory unit design is developed to increase speed while keeping area low. Further optimizations were made via timing constraints and logic level modifications at the implementation level. The proposed architecture is implemented on Virtex4 (V4), Virtex5 (V5), and Virtex7 (V7) FPGA technologies and, respectively, achieved throughout/slice figures of 19.65, 65.30, and 64.48 (10 6 /(Seconds × Slices)).
Index Terms-Efficiency, elliptic curve cryptography (ECC), field-programmable gate array (FPGA), point multiplication (PM), throughput per area (throughput/area).

I. INTRODUCTION
P UBLIC-key-based information security networks use cryptography algorithms such as elliptic curve cryptography (ECC) and RSA. ECC has emerged recently as an attractive replacement to the established RSA due to its superior strengthper-bit and reduced cost for equivalent security [1] .
High-speed ECC is a requirement for matching real-time information security; however, in many applications, the hardware resource implications may be prohibitive, and the required high-speed performance would need to be achieved within a restricted resource performance.
Field-programmable gate array (FPGA)-based hardware acceleration of ECC has seen a surge of interest recently. There are several state-of-the-art FPGA implementations aimed at the high-speed end of the design space [7] - [13] . Most of these however use increased hardware resource to achieve the speed improvements, sacrificing overall efficiency in terms of the throughput per area (throughput/area) metric; such efficiency is desirable in many emerging low-resource applications, in particular in wireless communications. Area-optimized highspeed ECC design is challenging; there are requirements of algorithmic optimization, careful scheduling to reduce clock cycles, size of multipliers, critical delay of the logic, and pipelining issues [7] , [9] .
In ECC, scalar point multiplication (PM) is the main operation. The PM can be implemented over either prime fields GF(p) or binary extension fields GF(2 m ) adopting either projective coordinates or affine coordinates. Binary extension fields called also finite fields (FFs) are more suited to hardware implementation due to their lower complexity FF multipliers, simple FF adder, and single-clocked FF squaring circuits. Projective coordinates are suited to throughput/areaefficient ECC designs, where the costly inversion operation is avoided and the inversion operation required to convert projective into affine coordinates can be achieved by multiplicative inversion [2] , [6] .
ECC computations in the projective coordinates system are based on large operand FF operations of which multiplication is the most frequently performed. The high-speed performance of ECC designs therefore would depend mainly on the performance of the FF multipliers. Digit serial FF multipliers are often used to reduce latency; popular multipliers here include the direct-method-based multipliers and Karatsuba [7] , [10] . If the field size is m and the digit size is w of a digit serial multiplier, then the number of clock cycles for each FF multiplication is s + c, where s = m/w, and c is for clock cycles due to data READ/WRITE operations. Thus, large digit multipliers can reduce clock cycles (latency) with increasing complexities of area and critical path delay. The critical path delay can be reduced using pipelining with some extra latency [9] .
In this brief, we present an area-time (throughput/slice)-efficient ECC processor over binary fields in projective coordinates on FPGA. We implement the Lopez-Dahab (LD) modified Montgomery algorithm for fast PM. We demonstrate a new "no idle cycle" [7] combined point operations (point addition and point doubling) algorithm to remove idle clock cycles in between two successive point operations. We schedule point operations very carefully to avoid the idle clock cycles due to data dependence, READ/WRITE operations, and pipelining. In addition, our efficient arithmetic circuit includes a digit serial multiplier, an adder, and a square circuit. The presented arithmetic unit can support on-the-fly addition and square operations while performing FF multiplication. Moreover, we present an improved most significant digit (MSD) serial multiplier utilizing segmented pipelining similar to the least significant digit multiplier presented in [2] and [4] . We develop an optimized distributed-RAM-based memory unit for flexible data access to support reduced data dependence in the arithmetic operations. We adopt the Itoh-Tsujii inversion algorithm for inversion to save area [5] , [6] . Finally, we use a dedicated finite-state machine-based control unit to speed up the control operations. The proposed architecture is implemented on different FPGA The remainder of this brief is organized as follows. Section II discusses the preliminaries of PM and the LD modified Montgomery PM in projective coordinates. Section III reviews resource constraints in high-throughput ECC. Section IV shows the proposed design. Section V presents the results of the FPGA implementation and a comparison with recently published state-of-the-art designs on FPGAs, followed by conclusions in Section VI.
II. PRELIMINARIES
A. ECC Over GF(2 m )
ECC over binary extension field (2 m ) is suitable for hardware implementation. The main operation of ECC is scalar PM Q = k · P , where k is a scalar (integer), P is a point on the elliptic curve, and Q is a new point of the curve after k · P [2] .
Let E be an elliptic curve in the binary extension field. E is defined by a set of points (x and y), and a point at infinity ∞, which satisfy
where a and b are elements of the FF, and GF(2 m ) and b = 0 [2] . The PM (k · P ) is accomplished by point addition and point doubling depending on k i , i.e., the ith value of k. The LD modified Montgomery PM algorithm, as shown in Algorithm I, has been adopted by many designs in the high-performance ECC design [7] - [13] space due to its speed, side-channel attack resistance, suitability of parallelization, and being low resource friendly.
III. RESOURCE-CONSTRAINED HIGH-THROUGHPUT ECC
For a high-throughput ECC implementation in the low area end of the design space, there are requirements of optimization of the critical path of the logic, the area of the design, and the number of clock cycles (latency) for the PM. Throughput is usually improved via the adoption of large-digit-size multiplication and parallel operation of multiplications to decrease the latency. However, these steps result in an increased area and critical path delay and therefore affect the throughput/area metric figure. The critical path delay can be minimized via pipelining [9] at the expense of an increase in area and number of clock cycles with the number of pipeline stages inserted in the design. Moreover, the pipeline stages can generate idle cycles in the data-dependable field operations [7] . The number of pipeline stages is an important consideration for area-optimized highspeed design often requiring a latency versus clock frequency tradeoff. The latency due to pipelining can affect the merits of the use of a large-digit-size multiplier and the parallelization of multiplication. In general, the area complexity of a high-speed ECC design would depend on the digit size of the multipliers used and the level of parallelism adopted, on the size and sophistication of the memory unit, and on the control unit. 
IV. PROPOSED THROUGHPUT/AREA-EFFICIENT ECC PROCESSOR
Our proposed area-optimized high-throughput architecture is presented in Fig. 1 . The design consists of an efficient arithmetic unit, an optimized memory unit, and a dedicated control unit.
A. Segmented-Pipelining-Based Digit Serial Multiplier
The arithmetic unit design consists of a novel MSD serial multiplier, a square circuit, and an adder circuit, as shown in Fig. 1 .
The performance of ECC depends mainly on the performance of the digit serial multiplier, particularly the speed of the multiplier for a targeted level of latency. Digit serial multiplication for the high-speed ECC implementation end tended to be either in direct form (i.e., MSD serial Multiplier) [10] or in bit parallel form (i.e., Karatsuba multiplier) [8] , [9] . There are some advantages of Karatsuba multiplication over MSD multiplication. A Karatsuba FF multiplication takes s − 1 [2] , [5] , [9] , and [10] . However, a pipelined Karatsuba-multiplier-based ECC implementation has been shown to achieve a lower clock frequency than a direct digit-serial-multiplier-based implementation [7] , [8] , [10] .
For large-MSD-digit-serial-based ECC, pipelining is required, which can affect latency in the PMs. In this brief, we apply segmented pipelining to improve performance in MSD multiplication. In the segmented pipelining approach, a w × m digit serial multiplication is broken into subdigit serial multiplications called segmented multiplications w 1 × m, w 2 × m, . . . , w n × m, where w = w 1 + w 2 + · · · + w n . The segmented multiplication product is first saved in the register (Reg) before reduction into m bits using an interleaved reduction similar to that in the bit serial multiplier in [2] . The reduced m bit output of the reduction unit is saved in another Reg to use in the next cycles reduction or output. Thus, the proposed multiplication takes s + 2 clock cycles where one extra clock cycle is due to the segmented pipelining, and the other additional clock cycle for pipelining after the reduction unit. A new input of the multiplier is inputted in every s clock cycles. Thus, a real-time reset is required in every s cycles. We use multiplexers to select zero for reset and save one clock cycle for the FF multiplication. Finally, the segmented pipelined multiplier takes one clock cycle for n segmentations without increasing area (slices) on the FPGA. The unused flip-flops (FFs) in the combinational circuit of the multiplier are utilized in the pipelining [8] .
To evaluate our proposed segmented multiplier, area and time complexity analysis is performed and presented in Table I , which also includes comparison to state-of-the-art digit serial multipliers reported in [15] - [17] . For s = 4 or less, our proposed multiplier shows same or better latency using similar or fewer resources. However, a key advantage of our proposed architecture is that we are able to achieve higher speed for the same (or less) area and the same (or less) latency; this is because our critical path delay can be modulated by the number of segmentations (n) with extra FFs. The value of n defines the critical path delay of the multiplier, The path delay is either T A + (log 2 (d/n))T X for the GF2MUL (M) or T MUX + (log 2 (n + k))T X for the reduction part (Rd). Thus, our critical path delay can be optimized (to achieve the desirable high speed) by choosing an optimum number of segmentations (n). To generalize, from Table I , the best figure latency for a field multiplication [15] , [17] is 2 m/d , our multiplier's latency is m/d . As a rule of thumb, therefore as long as m < 4d, our multiplier would achieve comparable or better latency figure. However, what is crucial is that, for comparable (less or higher) latency and same digit size, our design can achieves improved critical path delay T A + (log 2 (d/n))T X in our case (due to GF2MUL) compared with T A + (log 2 d)T X in [15] - [17] using an optimum segment size without increasing the latency of the multiplier. Thus, by utilizing similar area, our multiplier can achieve higher speed. At the extreme, the use a full-precision multiplier (d = m) with an optimized segmentation would thus lead to the highest speed.
B. Optimized Memory Unit
High speed and flexible design for the memory unit can improve performance. We consider an optimized distributed RAM-based memory unit. There is an 8 × m size register file in a unit, one m bit register (accumulator), and one shift register (Shiftreg). The 8 × m register file consists of one m bit input that can load data in any location of the register file and two m bit output buses (A bus and B bus) that can access data from any location of the register file. The shift register can store data from any location of the register file to provide w size digit (bi) multiplier for the FF multiplication. The accumulator can save a result from the arithmetic unit or new data from the register file to do a square operation. The accumulator and square circuit are connected such that repeated squaring can be done without saving in the register file. The repeated squaring improves latency of multiplicative inversion as proposed in [6] . The memory unit is smartly accessible to write and read shifting operation in any location. The easy accessibility of the memory reduces the number of temporary registers for the PM. The memory unit consumes very low area to provide high-speed data access.
C. Scheduling for Point Operations
In this brief, we propose new scheduling in the combined LD Montgomery PM as shown in Algorithm 2. To schedule for no idle cycles, we combine the point addition and point doubling algorithms for the current value of Ki = 1, as shown in Algorithm 2. We observe that the product of the last multiplication is X 1 if k i = 1 or X 2 if k i = 0. Thus, the first multiplication of the loop should be independent of the last multiplication. For example, if the last product is X 1 then the next operands of multiplication are X 2 and Z 1 . Otherwise, the next operands will be X 1 and Z 2 . Thus, the first multiplication depends on the last k i , which means the k i+1 bit as shown in Algorithm 2. 
Fig . 2 illustrate the proposed no idle state schedule using a 41-bit digit size FF multiplier. The 41-bit digit size FF multiplier takes M = 4 cycles for actual multiplication and c = 4, with two clock cycles for pipelining and two clock cycles for unloading from and loading to the memory unit. In a loop, the point operation in the projective coordinates system requires six multiplications. To ensure no idle state in the multiplication, a new multiplication is started at every four clock cycles. Thus, two consequent but independent multiplications are overlapping each other as shown in Fig. 2 for k i = 1 and
Again, the adder circuit placed in the common data path is capable of doing addition concurrently. The square operation takes three cycles with one cycle to save in the accumulator, one clock cycle for squaring, and one clock cycle for loading. Repeated squaring can be done without storing in the register file, Thus, double squaring takes four clock cycles. The total latency of the ECC is shown in Table II . ) , on different FPGA technologies namely Virtex4 (LX25_12 for f163, and LX100_12 for f233 to f571)), Virtex5 (XC5VLX50_3 for f163), and Virtex7 (Vx550T_3 for f163, and V585_T for f233 to f571 ) using Xilinx tools versions 13.2 and 14.5 respectively. The design was implemented on Virtex4 and Virtex5 technologies to allow for a fair comparison to most relevant works, and on the Virtex7 to evaluate the performance on the newer technology. We present the implementation results after place and route in Table III . The Xilinx tools were used to set high-speed properties and put subsequent timing constraints to improve the area-time product. The implementation results after place and route of our ECC designs are summarized in Table III. Table IV also includes area-time performance and comparison to state of the art.
As shown in Table IV , the main contribution of the segmentation in the multiplier is an increase in the clock frequency while utilizing very small resources (FFs). The clock frequency for three-segmented (3 Seg.) pipelined-multiplier-based ECC design is 290 MHz on the Virtex4, i.e., 38 MHz, more than the respective implementation of nonsegmented (No Seg.) multiplier-based ECC. Again, the two-segmented (2 Seg.) pipelined-multiplier-based ECC shows the best throughput per slice (65.30) is implemented on Virtex5; the three-segmentedmultiplier-based ECC on Virtex7 shows the highest performance (only 10.51 μs for an ECC PM). The optimum size of the segments is subject to a trial-error method to achieve high throughput. Table IV shows comparisons with relevant high-performance ECC designs on FPGAs in term of efficiency metric throughput/area ((1 × 10 6 /s)/slices) over GF (2 163 ) and GF(2 571 ). For GF (2 163 ), the previous best optimized work was reported in [7] where one 41-bit pseudopipelined Karatsuba multiplier was used with a so-called "no-idle cycles" PM approach to achieve 11.92-throughput/area figure on Virtex4. Our nosegment-based ECC design consumes less area (3623 slices) and achieves higher clock frequency (252 MHz) than [7] (4080 slices, 197 MHz) and therefore has a 40% higher throughput/ area efficiency. In particular, our three-segmented-based design shows 65% better efficiency than [7] . Our f571 achieves 180-MHz speed, whereas the work in [7] operates at a max speed of 107 MHz. One potential option of improving the area performance of [7] is to deploy an area-efficient Karatsuba multiplier [16] ; however, this would be at the expense of increased critical path delay. Another optimized ECC in [8] used full-length (164 bit) word serial Karatsuba multiplier with pipelining and implemented on Virtex4 and Virtex5. The work in [8] uses a four times bigger multiplier than ours to achieve 11.55 and 29.96 throughput/area on Virtex4 and Virtex5, respectively. Our three-segmented 41-bit multiplierbased design on virtex4 is 70%, and the two-segmented 41-bit multiplier-based design on Virtex5 is 118% better than in [8] . In [10] , the reported best throughput/area efficiency is based on three 330-bit multiplier-based ECC on Virtex5, which shows 9.86 in throughput/lookup tables (LUTs) ((1 × 10 6 /s)/LUTs). Our two-segmented-multiplier-based ECC shows 17.9 in (1 × 10 6 /s)/LUTs, which is 82% better than the reported most efficient design in [10] . The hardware results presented in [11] - [14] use parallel multipliers to speed up their ECC designs but show poor throughput/area efficiency due to the large area consumed. Finally, our single-multiplier (41-bit)-based ECC implementation on Virtex7 takes 10.51 μs for PM is faster than the reported high-speed work in [7] , [9] , [13] , and [14] , and the work on the Virtex4 reported in [8] and is comparable to the work in [12] but of course, using much lower resources.
VI. CONCLUSION
We have proposed a highly efficient FPGA ECC processor design for high-speed applications over GF(2 m ). Key contributions include a novel high-performance segmented pipelining MSD multiplication, a smart no-idle state scheduling that enables the clock cycles for loop operations in the PM to depend only on the actual clock cycles of the FF multiplications, and a highly optimized memory unit design.
To our knowledge, our design achieves the best throughput/ area efficiency figure on FPGA reported to date. The best throughput/area design achieved a figure of 65.30 (1 × 10 6 /s)/(slices) that is performing an ECC point multiplication in 14.06 μs time while utilizing only 1089 slices of area. The fastest design achieved 10.51 μs for a point multiplication using only 1476 slices.
