A pipelined architecture is proposed in this work to speed up the point multiplication in elliptic curve cryptography (ECC). This is achieved, at first; by pipelining the arithmetic unit to reduce the critical path delay. Second, by reducing the number of clock cycles (latency), which is achieved through careful scheduling of computations involved in point addition and point doubling. These two factors thus, help in reducing the time for one point multiplication computation. On the other hand, the small area overhead for this design gives a higher throughput/area ratio. Consequently, the proposed architecture is synthesised on different FPGAs to compare with the state-of-the-art. The synthesis results over GF(2 m ) show that the proposed design can work up to a frequency of 369, 357 and 337 MHz when implemented for m = 163, 233 and 283 bit key lengths, respectively, on Virtex-7 FPGA. The corresponding throughput/slice figures are 42.22, 12.37 and 9.45, which outperform existing implementations.
Introduction
Security networks frequently use public key cryptographic algorithms such as Rivest-Shamir-Adleman (RSA) [1] and elliptic curve cryptography (ECC) [2] . However, ECC is getting more and more popular as compared to RSA due to certain advantages such as shorter key lengths, lower hardware cost for equivalent security level and lower power consumption [3] [4] [5] [6] [7] . These advantages make ECC usable in both high speed and low resource applications.
The typical hierarchy of ECC contains four layers, as shown in Fig. 1 . At each layer, different operations are needed to be performed. These operations are arithmetic (addition, multiplication, squarer and inversion), point addition (PA) and point doubling (PD), point multiplication (PM) (also called as scalar multiplication) and protocols. Arithmetic operations are performed at layer 1 while PA and PD operations are computed at layer 2. The PM is the core operation in ECC and is computed at layer 3. Finally, protocols (layer 4 operations) are the set of rules, used to govern data encryption and decryption.
To implement PM operation, National Institute of Standards and Technology (NIST) have proposed a number of standard elliptic curves over the prime field GF(p) and binary extension field GF(2 m ) [3] . However, GF(2 m ) field is commonly used for efficient hardware implementations [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] . Furthermore, fieldprogrammable-gate-array (FPGA) based designs of ECC are gaining more popularity due to the provision of its reconfigurability, availability (commonly available to everyone in the market) and shorter development time scales.
Related work
Several FPGA-based ECC architectures, either for high speed [4] [5] [6] [7] [8] [9] [10] [11] or low resource applications [12] [13] [14] [15] , are available in the literature.
High-speed applications:
In real-time applications, such as IP security (IPsec) and secure socket layer (SSL), high-speed implementation of an asymmetric cryptosystem is important [4] . The conventional practices for optimising high-speed ECC architectures involve (a) reduction of clock cycles (CCs) (latency) and (b) increasing the clock frequency for one PM computation.
To reduce the number of CCs, various techniques have been employed in [4] [5] [6] [7] [8] [9] [10] [11] . For example, the work in [4] presents three complex instructions while instruction level parallelism is used in [6, 8] to reduce the number of required CCs. Similarly, the effect of various digit sizes for digit serial finite field (FF) multipliers is explored in [7] . Moreover, the work presented in [7, 9, 11] duplicates multiple arithmetic blocks (such as adder, multiplier and squarer) to exploit the parallelism.
For optimising operational frequency and to reduce the critical path delay, pipelining is frequently implemented. The architectures, presented in [4] [5] [6] [7] [8] [9] [10] [11] , require 3010, 1446, 1428, 2751, 1091, 3379, 780 and 450 CCs, respectively. The achieved frequencies (MHz) in [4] [5] [6] [7] [8] [9] [10] [11] are 154 (on Virtex 4), 143 (on Virtex 4), 185 (on Virtex 4), 250 (on Virtex 4), 121 (on Virtex 4), 262 (on Virtex 5), 153 (on Virtex 5) and 159 (LLECC_3M architecture on Virtex 7), respectively. The time for computing one PM is determined by dividing the number of CCs with operational frequency. The time (in µs) required for one PM in [4] [5] [6] [7] [8] [9] [10] [11] is 19.5, 10, 7.7, 9.6, 9.0, 12.9, 5.1 and 2.83, respectively. Although the architectures reported in [4] [5] [6] [7] [8] [9] [10] [11] achieve higher speed (or require less computational time for one PM), but they utilise higher hardware resources in terms of FPGA Slices, i.e. 16,209, 24,363, 20,807, 17,929, 10,417, 6536, 10,363 and 11,657 , respectively. The use of higher hardware resources is not suitable for constrained (low area) applications.
Hardware architectures for low area applications:
Low area implementations of asymmetric cryptosystem are important for the embedded systems applications, such as connected vehicles, smart cards and smart cities [12] [13] [14] [15] . In [12] , the critical path delay has been reduced by implementing a four-stage pipelined FF multiplier. In [13] , a bit-serial multiplier is used to reduce the hardware complexity while compromising on the total number of CCs. Using single adder, multiplier and squarer blocks, low-cost implementation of ECC are available in [14, 15] . The most relevant architectures, reported in [12] [13] [14] [15] , require CCs of 1397, 52,012, 2,438,675, and 3426, respectively. The achieved frequencies (MHz) in [12] [13] [14] [15] are 147 (on Virtex 5), 550 (on Virtex 5), 12.5 (on Spartan 6) and 135 (on Virtex 7). Similarly, the time (in µs) required for one PM computation in [12] [13] [14] [15] is 9.5, 94.6, 195,094 and 25.3, respectively. It is obvious from the above discussion that the architectures reported in [12] [13] [14] [15] require higher computational time for one PM, but on the other hand they consume low hardware resources (FPGA Slices) of 3513, 4815, 1844 and 3657, respectively.
Importance of throughput/area
Section 1.1 reveals that state-of-the-art hardware architectures [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] are implemented, either for optimising throughput or area, without paying due attention to the overall performance in terms of throughput/area metric. The performance evaluation in terms of throughput/area is useful, where both the constraints (throughput and area) are required to be fulfilled at the same time. Furthermore, it has been argued in [11] that the main motivation behind the use of ECC is its suitability for high throughput and low area applications at the same time.
Therefore, the performance in terms of throughput/area is desirable in many real-time applications such as ambient intelligence and internet-based applications [16, 17] , cloud computing [18] , banking and other security applications (i.e. ecommerce, e-banking) [19] . For example, in ambient intelligence applications, ubiquitous sensor network is deployed. The deployed sensors require constantly increasing high computational demands to process data and provide various services to the end-users.
Similarly, high throughput/area implementations are also important for network-based applications such as SSL and IPsec protocols which are commonly used today in over-the-web transactions [19] . Consequently, it is critical to achieving the desired throughput at a reasonable time with lower hardware resource utilisations. A comprehensive study of various design constraints in multiple applications is provided in [20] .
Our contributions
In Section 1.2, we have proposed a high throughput/area pipelined ECC architecture for the NIST curves [3] over GF 2 m with m = 163, 233 and 283. Moreover, the data path of the proposed architecture depends upon the size of the underlying field (m). The proposed design is synthesised on different FPGA devices for performance estimation (Virtex 7) and compared with existing solutions (Virtex 4 and Virtex 5).
Previously, we have proposed a throughput/area processor for binary huff curves [21] . In this paper, we are targeting a pipelined architecture of ECC for those applications, where the throughput/ area is critical such as internet-based applications [16, 17] , cloud computing [18] , and banking applications [19] . Consequently, the contributions of this paper are listed below:
• A digit parallel multiplier is presented with the optimal digit size of 32 bits to reduce the latency as well as critical path delay.
• For PM computation in state-of-the-art architectures [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] , the required arithmetic operations are an adder, multiplier, squarer and inversion. Separate squarer and multiplier blocks are generally useful when multiple CCs are needed to perform one FF multiplication. However, our digit-parallel multiplier, mentioned in the previous point, is capable of producing the result of each FF multiplication in one CC. In other words, multiplier and squarer have the same computational cost in our proposed design. Therefore, squaring instructions can be performed by providing the same inputs to the multiplier block. It has allowed us to reduce the overall area of the design.
• To optimise the clock frequency and to reduce the critical path delay, pipeline registers have been used at the input of the arithmetic unit (AU). Moreover, considering the pipeline hazards, such as read after write (RAW), PA and PD instructions have been efficiently scheduled.
• Finally, a dedicated finite-state-machine (FSM) based control block has been used to speed up the control functionalities.
The remainder of this paper is organised as follows: In Section 2, preliminaries related to PM computation on ECC over GF 2 m are presented. The proposed efficient throughput/area architecture for ECC is discussed in Section 3. Section 4 presents the synthesis and performance results of the proposed hardware architecture along with the comparison with the state-of-the-art. Finally, Section 5 concludes the paper.
Setting the stage
The introductory part of this paper mentions two different fields for the computation of PM in ECC: prime field GF p and a binary extension field GF 2 m . For software implementations, the prime field is suitable while the binary field is useful for hardware implementations [14] . Furthermore, each of these fields (prime field as well as binary extension field) may be used either with simple affine coordinates or projective coordinates.
In a simple affine coordinate system, FF inversion operation is required to be performed during each PA and PD computation [9] [10] [11] . For example, for 'm' bit key length, 'm' a number of inversion operations are required to be performed. However, the number of required inversion operations can be reduced by implementing the projective coordinate system, where only two inversion operations are required to compute PM [22] . In addition to the reduced number of required inversion operations, projective coordinates are well suited to achieve efficient throughput/area ECC designs as compared to affine coordinates [23] .
For each coordinate system, two types of field representations are available, i.e. normal basis and polynomial basis. Normal basis representation is useful where frequent squaring operations are involved; however, for efficient FF multiplications, polynomial basis representation is used [4, 14, 15] .
Based on the above-mentioned scenario, we have used the binary extension field with projective coordinate (Lopez Dahab) systems. Lopez Dahab projective coordinate system requires a lower number of field multiplications for PM computation [22] . Moreover, for coordinate representation, we have selected the polynomial basis representation due to efficient FF multiplications.
Point multiplication on ECC over GF 2 m
For GF 2 m , a projective (Lopez Dahab) form of the elliptic curve is defined as a set of points P X : Y : Z , satisfying the following equation:
In ( Consider a base point 'P' and a large integer 'k' of the size of underlying field 'm', then the PM will be the addition of 'k' copies of point 'P', i.e. Q = k ⋅ P + P + ⋯ + P , where 'Q' is the new point on the defined elliptic curve. To compute PM, we have used the Montgomery algorithm [23] , represented as Algorithm 1 in the following: To implement Montgomery algorithm for PM, it requires a scalar multiplier 'k' along with the initial point 'P' with its coordinates (x p , y p ) as input and produces (x q , y q ) coordinates of the final point 'Q' as output. Montgomery algorithm contains three steps:
• Step 1 is the initialisation, where affine to projective (Lopez Dahab) conversions are performed.
• Step 2 is to compute PM by performing point addition (P = P + Q) and point doubling (P = 2P) operations, based on the value of a scalar multiplier (k i ).
• Finally, step 3 is to perform projective to affine conversions (reconversion step).
It is important to mention here that this work handles the side channel and power attacks at the algorithmic level through the use of the Montgomery algorithm. The resistance against the side channel and power attacks is an inherent feature of the Montgomery algorithm (Algorithm 1). Therefore, we have used it for the computation of point multiplication (consisting of PA and PD).
In the Montgomery algorithm, the number of required arithmetic operations for PA and PD steps is independent of the nth key bit (i.e. k i ). In other words, the same number of arithmetic operations is required, irrespective of the value of the key bit (scalar multiplier). The details of these arithmetic operations are six multiplications, five squaring and three additions, as shown in Algorithm 1.
Due to the same number of required arithmetic operations in PA and PD steps, the Montgomery algorithm provides resistance against simple power and side channel attacks. Moreover, in our design, we have ensured that the sequence of these arithmetic operations should remain the same during the execution of PA and PD. Therefore, the inherent feature of the Montgomery algorithms (independence of arithmetic operations on the value of the key bit) remains unaffected. Fig. 2 shows the proposed pipelined hardware architecture which consists of (a) register file (RF), (b) routing networks (RNs), (c) an efficient AU, (d) pipeline registers and (e) a dedicated control unit (CU). The placement of pipeline registers is not shown in Fig. 2 ; however, it is discussed in Section 3.4. The initial curve parameters (x p , y p and b) for the proposed design have been selected from NIST [3] .
Proposed pipelined architecture

Register file
The RF of proposed design contains a register array of size 8 × m', as shown in Fig. 2 . The value of 'm' specifies the width of each particular location and mainly depends upon the size of the field (163, 233 and 283). The main purpose of the RF unit is to store the intermediate results (
and T 4 ) while implementing the PM algorithm (Montgomery in our case) for the corresponding ECC curve. Furthermore, it contains two multiplexers (Mux M1 and Mux M2), which are used to fetch the operands (OP1 and OP_2) from RF unit and a single demultiplexer (Dmux) to modify the RF contents (Mplex_out).
Routing networks
The proposed design constitutes two RNs, Mux M3 and Mux M4, as shown in Fig. 2 . Inputs to the Mux M3 are curved parameters and an operand from RF (OP1). The output of Mux M3 is an operand (OP_1) to the AU. Inputs to the Mux M4 are from the output of AU and Mux M3 (OP_1) and its output go into the input of the RF unit.
Arithmetic unit
The AU of proposed crypto processor contains adder and multiplier blocks/units, as shown in Fig. 2 . The adder is implemented through bit wise exclusive-OR gates. Polynomial squaring is implemented by providing the same inputs to the multiplier unit. For two 'm' bit polynomials multiplication (A(x) × B(x)), we have implemented a parallel Least Significant Digit (LSD) multiplier with digit size of d = 32 bits. The digits with d = 32 bits of the polynomial B x are created and the parallel multiplication of each 'd' bit digit with an 'm' bit polynomial (A(x)) is performed to generate partial products. For further mathematical formulations and algorithmic overview of digit level multipliers, interested readers can consult [24] .
To compute FF multiplication operations over GF 2 163 , a total of six digits (B1 to B6) are required (32 + 32 + 32 + 32 + 32 + 3). Out of these six digits, the size of five digits (B1 to B5) is 32 bits, whereas the size of sixth digit (B6) is 3 bits only. Similarly, for GF 2 233 , a total of eight digits (B1 to B8) are required (32 + 32 + 32 + 32 + 32 + 32 + 32 + 9). Out of these eight digits, the size of seven digits (B1 to B7) is 32 bits each, while the size of eighth digit (B8) is 9 bits. Moreover, for GF 2 283 , a total of nine digits are required (32 + 32 + 32 + 32 + 32 + 32 + 32 + 32 + 27) as shown in Fig. 2 . Out of these nine digits, eight digits (B1 to B8) are with 32 bit size and the size of the last digit (B9) is 27 bits. Parallel multiplication of each B1 to B9 digit with an 'm' bit polynomial A x results 'd + m − 1' bits of polynomials and these resultant polynomials are represented as C1 to C9 in Fig. 2 . Once multiplication of each 'd' bit digit with an 'm' bit polynomial is completed, the final resultant polynomial (D(x)) of size '2 × m − 1' bits is created by XOR and shift operations of C1 to C9.
To summarise, two 'm' bit polynomials multiplication produces a resultant polynomial of degree '2 × m − 1' bits. Consequently, after each field multiplication, FF reduction is required. Reduction operations are performed by implementing NIST reduction algorithms over GF 2 163 , GF 2 233 and GF 2 283 , as described in
Output: Q(x q , y q ) = k·P
Algorithm 2.41, Algorithm 2.42 and Algorithm 2.43 of [22] , respectively. To compute an inversion over GF 2 m the field, square Itoh Tsujii algorithm [25] has been implemented using multiplier block.
Inclusion of pipeline registers
To achieve optimal throughput, the first step is to explore/evaluate the various available/possible options for pipelining. Consequently, the circuit can be divided into three parts: (a) M1, M2 and M3, used for the read operation The Montgomery algorithm, presented in Section 2 of this article, may cause read after write (RAW) hazards in the context of pipelining. Therefore, before developing the control section in Section 3.5, it is required to generate the instruction sequence of the Montgomery algorithm for the appropriate placement of pipeline registers. Consequently, the sequences of instructions and the corresponding actions performed in different pipeline stages are provided in Table 1 for three different cases: (1) no pipeline registers, (2) two-stage pipelining and (3) three-stage pipelining. The first column of Table 1 presents the CCs, whereas the second column shows the sequence of instructions with no pipeline registers. Placement of pipeline registers can cause different data hazards such as RAW, write after reading (WAR) and write after write (WAW) [21] . The term hazard implies the prevention of the next instruction from execution until the read/WB operation of the previous instruction is completed. Consequently, the third column shows the corresponding RAW hazards. Finally, the fourth and last columns (fifth) present the proposed scheduling of PA and PD instructions with two-stage and three-stage pipelines, respectively.
As shown in Table 1 , sequences of instructions without pipeline require a total of 14 CCs. For two-stage and three-stage pipeline instructions scheduling, a total of 17 and 20 CCs are needed, respectively. Furthermore, to compute PA and PD operations of Montgomery algorithm (presented in Algorithm 1) for m bit key length, a total of 14 × m, 17 × m and 20 × m CCs are required with no pipeline, two-stage pipeline and three-stage pipeline architectures, respectively. Consequently, the addition of third pipeline stage for WB is not efficient as it adds more CCs (a total of 20 CCs) due to RAW hazard whereas the increase in frequency is not higher enough to get an overall throughput higher than a two-stage pipelined architecture. Moreover, the addition of registers at the output of AU further reduces the overall throughput/area performance. Therefore, in subsequent sections, the required information related to a two-stage pipelined architecture is described only.
Dedicated CU
An FSM-based dedicated CU is designed to perform control functionalities. The CU generates the signals for the components of RNs as well as the read and writes addresses for the RF unit. The used control signals are shown as dotted lines with red colour in Fig. 2 , whereas the corresponding FSM is generating these signals is shown in Fig. 3 .
To implement the Montgomery algorithm for ECC, FSM incorporates a total of 121 states for a two-stage pipelined architecture
• St: 0 is an idle state, while during St: 1 to St: 6, control signals for affine to projective conversions are generated.
• The proposed scheduling of PA and PD for the PM step of the Montgomery algorithm, as shown in The total number of CCs for the proposed architecture can be calculated by using (2) . The CC information for the proposed architecture with different key lengths is further provided in Table 2 . In (2), the term 'Initial' defines the initialisations part of Algorithm 1, 'm' defines the key length and 'Inv' defines the inversion operation required in the reconversion part of Algorithm 1. Similarly, in Table 2 , the first column shows the key length, whereas required CCs for initialisations part of Algorithm 1 (initial) are presented in the second column. The third column shows the CCs for the PA and PD computations of Algorithm 1. Required CCs for each inversion (Inv) and reconversions (Recon) part of Algorithm 1 are presented in the fourth and fifth columns, respectively. Finally, the total CCs for implementing Algorithm 1 are presented in the last column of Table 2 .
4 Implementation and discussion of results
As shown in Section 3.4, a two-stage pipeline architecture provides higher performance in terms of throughput/area ratio than a threestage pipeline architecture. Therefore, we have implemented and presented the results for a two-stage pipeline architecture only. However, to perform a fair comparison with state-of-the-art, it is necessary to first define the performance metric. Consequently, Section 4.1 elaborates the target performance metric (throughput/ area). The implementation results for the proposed two-stage pipeline architecture are given in Section 4.2. Finally, Section 4.3 provides a comprehensive comparison of the proposed architecture with existing solutions.
Performance metric
To analyse the performance of proposed PM architecture, with different key sizes on FPGAs, a throughput over area (in terms of throughput/slices) metric is considered in this work and is presented in (3). The simplified form of (3) is further presented in (4)
In (3) and (4), the term throughput is the time required for one PM (i.e. Q = k ⋅ P in s) and is calculated by using (5). Similarly, 'Q' is the final point on the elliptic curve, 'k' is the scalar multiplier, 'P' is the initial point on the elliptic curve and term slices refer to the utilised area on the selected FPGA device. The term, '10 6 ' in (4) just simplifies (3) 
Implementations results
The proposed two-stage pipelined architecture for NIST recommended binary elliptic curves over GF 2 163 , GF 2 233 and GF 2 283 are implemented (synthesised) on Virtex-7 FPGA (V7-XC7VX690T) technology using Xilinx ISE (14.2) design suite tool. The synthesis and performance (throughput/slice) results of the proposed design are given in Table 3 .
The first column of Table 3 shows different key sizes of 'm' with 163, 233 and 283. The second, third and fourth columns present the FPGA area information in terms of Slices, LUTs and FFs, respectively. The fifth column provides the operational frequency (Freq. MHz). The time required for computation of one point multiplication (in µs) is presented in the sixth column. Finally, the last column presents the achieved results in terms of the performance above metric (throughput/slices).
As shown in Table 3 Table 3 . Consequently, due to the achievement of high frequencies and lower hardware resource utilisations, the proposed architecture results in high throughput/slices ratio.
Performance comparison with state-of-the-art
Section 4.2 provides the implementation results on Virtex-7 FPGA (V7-XC7VX690T). However, to perform a fair comparison with the most relevant existing works over GF 2 163 , the proposed design is also implemented for Virtex-4 (V4-XC4VLX100) and Virtex-5 (V5-XC5VFX200T) devices. Consequently, the comparison results are summarised in Table 4 .
Our previous low-area implementations of ECC architectures, presented in [14, 15] , are best-reported implementations in terms of area optimisations on Spartan 6 (XC6SLX16) and Virtex 7 (XC7VX690T) devices, respectively. Highly optimised throughput/ slice ECC processor presented in this paper over GF 2 163 on Virtex 7, achieves 80% higher value of throughput/slices than our previous works in [14, 15] .
On Virtex 4, the previous best-reported architecture in terms of throughput/slices (6.24) is presented in [6] and consumes 20,807 slices to compute one PM in 7.7 µs using three 82-bit parallel multiplier cores. The proposed implementation on Virtex 4 shows 19% higher throughput/slice figure (7.69) and consumes 64% lower FPGA slices (7519) as compared to work in [6] .
In [4] , a seven-stage pipeline architecture is presented that uses 16,209 slices and result in throughput/slice ratio of 3.16 on Virtex 4. Our two-stage pipelined architecture on Virtex 4 consumes 54% lower area and shows 12% speed improvement as compared to work in [4] . Therefore, our work achieves a 59% higher throughput/slice ratio than the solution proposed in [4] . This is due to the placement of pipeline registers only at the input of ALU, while in [4] multiple registers have been used in the data path. Use of multiple registers in the data path further increases hardware resources; so overall throughput/area ratio is affected.
In [5] , a high-speed design is presented which utilises 24,363 slices to compute one PM in 10 µs, and therefore, achieves a throughput/slice ratio of 4.10. Our proposed work on Virtex 4 consumes 70% lower slices and shows 47% better throughput/slice figure than the work presented in [5] . In [7] , multiple arithmetic blocks (i.e. three FF multipliers connected serially and four FF squares connected in parallel) are used to achieve a high speed of 9.6 µs by consuming 17,929 slices. Our proposed work utilises only 7519 slices which is 59% lower than [7] . Additionally, the proposed architecture provides 25% higher throughput/slice ratio when compared with [7] .
To achieve higher performance while utilising lower hardware resources, the work presented in [8] employs karatsuba multiplier with no idle CC. Other arithmetic instructions (i.e. addition and squaring) are performed in parallel with the karatsuba multiplier in [8] . The proposed design in this article utilises 28% lower slices and shows 19% higher throughput/slice ratio as compared to [8] . To achieve high performance in [9] , parallelisation at the hardware level has been obtained by using two multipliers, two adders and two squarer blocks. Our proposed work utilises 42% lower hardware resources in terms of slices and achieves 41% better throughput/slice ratio than the parallelised architecture in [9] . Pipelining in [8, 9] is achieved by placement of registers inside the FF multiplier. This increases the number of required CCs to perform one multiplication. On the other hand, the two-stage pipelined architecture in this paper performs one FF multiplication in one CC. Moreover, the proposed architecture achieves 33, 39, 19, 48 and 15% improvement in clock frequency over [4-6, 8, 9] , respectively.
On Virtex 5, the best-reported throughput/slice result over GF 2 163 is 29.96 which is achieved by implementing a four-stage pipelining in [12] . It consumes a total of 3513 FPGA slices and requires 1397 CCs. The proposed two-stage pipeline architecture requires only 2027 slices which are 43% lower than [12] and shows a 19% higher throughput/slice ratio. In comparison with [9, 10] , our architecture shows throughput/slice improvement of 69 and 49%, respectively.
The most recent solution, presented in [11] , has implemented two different architectures, i.e. for high-performance ECC (HPECC) and for low latency ECC (LLECC). The proposed twostage pipeline architecture outperforms LLECC architecture presented in [11] over Virtex 5 as well as Virtex 7. On Virtex 5 technology, 41% improvement in throughput/slice ration has been observed while the improvement figure on Virtex 7 device is 29%. Finally, the proposed architecture achieves 94% higher throughput/ slice ratio as compared to the solution in [13] on Virtex 5 as well as Virtex 7 devices.
Conclusions
This paper presents a pipelined architecture for point multiplication on FPGA using GF 2 163 to GF 2 283 , which outperforms other solutions in terms of throughput/area (area for FPGA slices) for high-performance applications. The key contributions include: (i) an efficient parallel LSD-based FF multiplier to perform field multiplication in a single CC, (ii) the placement of pipelined registers at the input of AU to reduce the critical path and (iii) the efficient scheduling of point addition and point doubling operations to reduce the number of required CCs. The proposed design for GF 2 163 provides a throughput/slice figure of 42.22 on Virtex 7 which is higher than the relevant state-of-the-art solutions. Furthermore, our architectures outperform others in terms of FPGA area (slices) as well as operating frequency. 
