We present a high-performance Elliptic Curve Cryptographic (ECC) processor that supports arbitrary prime field and curve parameters. A novel pipeline architecture of the Montgomery multiplication by using the inherent DSP blocks in modern FPGAs is proposed to speed up the point scalar multiplication. In addition, the improved operation scheduling is presented to optimize the operation cycles further. With Xilinx Virtex-5 FPGA devices, a 256-bit point scalar multiplication can be performed in 0.86 ms at 263 MHz, about 3.14 to 11.27 times faster than other designs with comparable functionalities. Our processor therefore outperforms others significantly in terms of throughput, area, and cost-effectiveness.
Introduction
Digital security has become an urgent need nowadays for modern vital applications. Elliptic Curve Cryptography (ECC), independently proposed by Miller [1] and Koblitz [2] , has been considered mature as compared with conventional public-key cryptosystems, e.g., RSA cryptography. As an alternative to RSA, ECC provides much higher levels of security strength with the same key length. Therefore, ECC has been widely adopted in modern security standards [3] [4] [5] .
There are two distinct requirements for practical use of ECC. Ultra low-cost design (e.g., for power, energy, and hardware area) is demanded for portable devices and ubiquitous applications, e.g., the appearing RFID applications. At the other end, extremely high-speed and powerful cryptographic engines are required for server-side applications to deliver the maximized performance. Both of the diverse criteria can be optimized with hardware accelerator to achieve the best cost-effectiveness.
Recently, many high-performance ECC designs on FPGA (Field-Programmable Gate Array) devices have been published (e.g., ECC over binary field GFð2 m Þ in [6, 7] , prime field GFðpÞ in [8] [9] [10] [11] , and dual-field in [12] ). FPGA inherently provides the fast operating frequency, high-performance embedded arithmetic components, and great reconfigurability with lower inventory cost, making it an attractive alternative to ASIC approach for server-side security applications.
In [6] , a two-stage-pipelined word-serial multiplier has been presented for specific irreducible polynomial. Different levels of parallelism among design hierarchies for ECC has been explored in [7] . A 256-bit full-word Montgomery multiplier has been presented to reduce the operation cycles significantly in [8] . In addition, a dual-field ECC processor utilizing instruction-level parallelism with one to four Modular Arithmetic Logic Units (MALUs) has been proposed in [12] .
Several approaches have also utilized the inherent DSP (Digital Signal Processing) blocks in FPGAs to optimize the overall throughput and cost [9] [10] [11] . By taking the advantage of the high-speed DSP blocks, the design in [9] can improve the operating frequency to 490 MHz. In [10] , the Hiasat multiplier with DSP blocks performed one 127-bit modular multiplication in one clock cycle. Furthermore, the GLV (Gallant, Lambert, and Vanstone) method [13] was applied to replace a larger point scalar multiplication with two smaller ones to reduce the overall complexity. However, both of the two fast ECC designs was designed for supporting specific prime fields, i.e., P-224 and P-256 recommended by NIST [14] in [9] and the Mersenne prime of type 2 n À 1 in [10] , which limits their flexibility substantially. Recently, the design based on the residue number system has been shown in [11] , which achieves a higher degree of parallelism with a large amount of DSP blocks, resulting in the fast ECC implementation for general prime fields.
In this paper, we present a high-performance processor for ECC over GFðpÞ by using the inherent DSP blocks for extremely fast Elliptic Curve (EC) arithmetic. Arbitrary prime fields and elliptic curves are supported. We propose a novel architecture of Montgomery multiplier by cascading the DSP blocks to reduce the number of cycles and to eliminate extra routing delay. In addition, to further optimize the operation cycles, the improved operation scheduling for the finite field operations in EC arithmetic is presented by overlapping successive Montgomery multiplications, i.e., the modular reduction stage of the present Montgomery multiplication can be done with the multiplication stage of the next one in parallel. Successive iterations in the point scalar multiplication can be overlapped as well. Our ECC processor can perform a 256-bit point scalar multiplication in 0.86 ms at 263 MHz on Xilinx Virtex-5 FPGAs. The comparison with other ECC implementations justifies the performance, cost-effecitveness, and the flexibility of our approach.
The Word-based Montgomery Multiplier
Our processor focuses on the ECs over GFðpÞ in IEEE 1363 Standard Specification for Public-Key Cryptography [3] . The standardized EC over GFðpÞ is y 2 ¼ x 3 þ x þ , where x; y 2 GFðpÞ and 4 3 þ 27 2 6 ¼ 0 (mod pÞ. EC point scalar multiplication, the most important operation in ECC, can be decomposed into iterative point doubles and point additions which involve finite field operations. We adopt the addition-and-subtraction method in [3] for the point scalar multiplication. In addition, Jacobian's projective coordinate ðx; y; zÞ ! ðx=2 2 ; y=z 3 Þ over GFðpÞ is used in our processor to effectively replace the field inversion with several field multiplications.
Montgomery multiplication algorithm is a well-known fast modular multiplication method. Furthermore, in order to utilize the inherent high-performance DSP blocks in modern FPGAs to accelerate the computation, we proposed the word-based Montgomery multiplication, as shown in Algorithm 1. Different from the Multiple-word High-Radix Montgomery Multiplication (MWR2 k MM) in [15] and the high-radix scheme in [16] , we partition the operands according to the word width of the arithmetic components in the DSP blocks in order to maximize the utilization of the DSP blocks and to reduce the number of cycles. In addition, we applied the algorithm to pipelined architecture by cascading the DSP blocks to improve the frequency.
Algorithm 1 A word-based Montgomery multiplication.
Input: A, B, p, and p 0 , where p is an n-bit prime, A, B 2 GFðpÞ, and
Àn (mod pÞ.
for j ¼ 0 to e À 1 do // Inner Loop 2 10:
end for 12:
S ¼ S=2 In Algorithm 1, let m Â k-bit multiplier be used, we partition the multiplicand A into w words, and the multiplier B into e words, respectively, where
nþk m e, and n is the field size. Therefore, A ¼
As a result, an n Â n-bit Montgomery multiplication is divided into multiple m Â kbit atomic multiplications. In addition, to effectively reduce the operation cycles, 24 Â 16-bit multipliers are used for our 256-bit Montgomery multiplication (i.e., m ¼ 24, k ¼ 16, w ¼ 16, and e ¼ 12), where the inherent DSP block in Xilinx Virtex-5 family comes with a 25 Â 18-bit multiplier. Figure 1 shows the data dependency graph of Algorithm 1. Each node consists of one 24 Â 16-bit multiplication and at most two 40-bit additions. In addition, node A i; j and B i; j represent the computation of ðc j ; s j Þ ¼ s j þ a i Â b j þ c jÀ1 and ðc j ; s j Þ ¼ s j þ q Â j þ c jÀ1 , respectively (see line 5 and 10 in Algorithm 1). Because the output of node A i; j is one of the inputs of node B i; j , the Inner Loop 1 and Inner Loop 2 in each iteration of the Outer Loop can be overlapped for e À 1 stages. Moreover, the intermediate result S discards the least significant k bits at the end of every iteration of the Outer Loop (see line 12), the node A i; j for 0 < i w À 1 is fed by the node B iÀ1; jþ1 instead of B iÀ1; j , i.e., the Inner Loop 2 and the successive Inner Loop 1 is overlapped for e À 2 stages.
The proposed Processing Elements (PEs) are shown in Fig. 2 , i.e., PEA and PEB perform the computation of node A i; j and B i; j in Fig. 1 , respectively. Both PEs utilize the high-performance DSP blocks in modern FPGAs as the kernel component, e.g., the DSP48E in Xilinx Virtex-5 family. To optimize the operating frequency, the embedded registers in the DSP block are used to keep the intermediate results s j . Additional routing delay introduced by using the registers in generic FPGA slices can therefore be eliminated. Note that PEB requires additional 16-bit register to store the least significant word of S, i.e., q, for the Inner Loop 2 in Algorithm 1. It also needs another 8-bit register for the most significant 8-bit data of each word of S after discarding the least significant 16-bits in each iteration of the Outer Loop.
Figure 3(a) shows the proposed pipeline architecture for the Montgomery multiplication with four piepline stages. At most four node A i; j and four node B i; j can be performed simultaneously while no precedence violation exists. Each PE propagates s j to the next one and c j back to itself in every cycle. For the simplification, Fig. 4 shows an example of the operation cycles for the Outer Loop in Algorithm 1 by using 2-stage pipelined Montgomery multiplier when w ¼ 5 and e ¼ 7, i.e., there are two PEAs and two PEBs, and the operands A and B are divided into five and seven words, respectively. Let x denotes the number of pipeline stages, there are at least three cycles between PEA i and PEA iþ1 , and ðe À 3xÞ additional cycles are required for waiting available PEs between every x PEAs when x is less than de=3e. Finally, the last iteration of the Outer Loop in Algorithm 1 requires ðe þ 1Þ cycles. Therefore, the number of cycles for the Outer Loop can be formulated as
Otherwise. ð1Þ High Fig. 3(b) . The product S and the prime p are stored in two circular shift registers, X and Y, respectively. The least significant word of X is subtracted by d Â p j and ðd þ 1Þ Â p j , and then stored into the most significant word of X and Y in every cycle. Therefore, the number of cycles for the modular reduction stage can be represented as C r ¼ e (i.e., line 14 in Algorithm 1), and the overall Montgomery multiplication requires C m þ C r cycles.
Operation Scheduling
Our ECC processor performs point scalar multiplication by using the addition-and-subtraction method in [3] for arbitrary prime fields and elliptic curves. The point scalar multiplication is decomposed to iteratively point double (PDBL) or point double with point addition/subtraction (PADDSUB). To optimize the performance, we implement the Montgomery multiplier (see Fig. 3 ) and the modular addition with the DSP blocks. An n-bit modular addition can be performed in C a ¼ 2 Â n r cycles, where r is the word width of the carry propagation adder in the DSP blocks. As a result, a 256-bit modular addition requires 12 cycles where the DSP block in Xilinx Virtex-5 family comes with a 48-bit carry propagation adder.
With Jacobian's projective coordinate, we decompose the PDBL, i.e., P 2 ðx 2 ; y 2 ; z 2 Þ ¼ 2P 1 ðx 1 ; y 1 ; z 1 Þ, into eleven atomic multiplications and eight additions. In addition, mix-coordinate PADDSUB, i.e., P 2 ðx 2 ; y 2 ; z 2 Þ ¼ P 0 ðx 0 ; y 0 ; z 0 Þ þ P 1 ðx 1 ; y 1 ; 1Þ, is adopted to reduce the complexity, which can be implemented by twelve multiplications and eight additions. Three types of operation scheduling are proposed for PDBL and PADDSUB by using one modular multiplier and adder (see Figs. 5, 6, and 7). First, the preliminary scheduling (Type-I) performs the modular multiplication and addition/subtraction simultaneously as long as no precedence violation occurs, and accomplishes the modular reduction after the multiplication, as shown in Fig. 5(a) . C m þ C r þ 2 cycles are required in each stage where the additional two cycles are introduced by moving data from and to the RAM. By using the addition-and-subtraction method in [3] , there are approximately two-third of iterations executing PDBL and PADDSUB while the scalar is in NAF (non-adjacent form) representation, and the rest of iterations executing only PDBL. As a result, Type-I scheduling can perform a point scalar multiplication in C psm;
To reduce the operation cycles, Type-II scheduling performs the reduction stage of the present Montgomery multiplication (Stage t) with the next multiplication (Stage t þ 1) at the same time, as long as no data dependency exists between any successive stages [see Fig. 5(b) ]. Figure 6 shows its detail scheduling for point double and point addition. (a) (b) Although additional bubble stage is required to fix the precedence violation, e.g., the stage PADD 2 in Fig. 6(b) , the number of cycles for each stage can be reduced from C m þ C r þ 2 to C m þ 2. Consequently, a point scalar multiplication can be performed in
To further optimize the performance, Type-III operation scheduling overlaps successive iterations with the following conditions: (1) As shown in Fig. 7(a) , for (PDBL)-(PDBL), PDBL 12 of the present iteration can be overlapped with PDBL 1 of the next iteration; (2) For (PDBL-PADDSUB)-(PDBL), PDBL 11 can be inserted to PADD 2. In addition, PADD 14 of current iteration can be overlapped with PDBL 1 of the next iteration. Furthermore, PADD 15 can be omitted since 2y 2 can be fed into the next PDBL smoothly by ignoring the addition operation in PDBL 2, as shown in Fig. 7(b) . Therefore, one stage can be effectively removed for PDBL, and two stages can be reduced for PDBL-PADDSUB, respectively. As a result, a point scalar multiplication can be performed in 
Implementation Results and Comparison
For a fair comparison with various FPGA devices, our ECC processor (with Type-III operation scheduling) has been synthesized on different FPGA families, namely, Xilinx Virtex-2 XC2VP30, Virtex-4 XC4VFX12, and Virtex-5 High-Performance Architecture for Elliptic Curve Cryptography over Prime Fields on FPGAsXC5VLX110. Table 1 compares different ECC designs on Xilinx Virtex-2 devices for point scalar multiplication. Note that because the word widths of the inherent arithmetic components in Virtex-2 and Virtex-4 FPGAs are smaller than the one in Virtex-5 FPGAs, the number of cycles for point scalar multiplication by using Virtex-5 FPGAs is less than those with Virtex-2 and Virtex-4 FPGAs (see Tables 1, 2, 3) . In Table 1 , our implementation is the fastest one in terms of the operation time. Note that [12] proposed a dual-field architecture. The approach in [8] reduces the operation cycles significantly by utilizing a full-word 256 Â 256 Montgomery multiplier. However, it results in a much larger hardware. To further compare the area, we normalize the inherent multiplier (MUL) in Virtex-2 FPGAs to 141 slices according to our synthesis experiment. The comparison shows that our word-based ECC processor provides the fastest operation time as well as the best cost-effectiveness, i.e., the AT product.
In Tables 2 and 3 , we also summarize the comparison among the ECC implementations on modern FPGAs, i.e., [9] was on Xilinx Virtex-4 FPGA, [10] was on Xilinx Virtex-5, and [11] was on Altera Stratix II FPGA. Although in Table 2 , the FPGAs adopted in [9] and [11] are different, they are both fabricated by using 90 nm CMOS technology. Also, one can normalize an Adaptive Logic Module (ALM) in Stratix II FPGAs to a slice in Virtex-4 FPGA, as stated in [11] . In addition, we also implemented functional-compatible modules for the DSP blocks in Virtex-4 and Virtex-5 FPGAs for the area comparison. Each DSP block in Virtex-4 can be mapped into 619 slices, and the one in Virtex-5 is normalized to 992 slices, respectively.
Both [9] and [10] proposed ECC designs supporting special prime fields, e.g., the specific prime recommended by NIST [14] in [9] and the Mersenne prime of the type 2 n À 1 in [10] . Note that [10] also provided architectures for generic prime fields. With the special prime fields, the complexity of modular reduction can be minimized significantly. Therefore, [9] can further pipeline the DSP blocks to improve the operating frequency to 490 MHz. In addition, the Hiasat multiplier over GFðp 2 Þ, which performs a 127-bit modular multiplication in one cycle, is used in [10] to reduce the operation cycles effectively. As a result, a 256-bit point scalar multiplication can be performed in 0.19 ms in [10] , which is the fastest design to the best of our knowledge. However, the architectures dedicated to specific prime fields will limit their flexibility. Tables 2 and 3 show that our processor supporting arbitrary prime fields and elliptic curves provides comparable cost-effectiveness compared with the fast ECC designs for specific fields in [9] and [10] . In addition, compared with the design supporting arbitrary prime fields in [8, 10, 12] , our processor is 3.14 times to 11.27 times faster in terms of the operation time. Although the design in [11] is 1.6 times faster than ours, our design is 3.7 times more cost-effective in AT product. The comparison justifies that our processor outperforms other designs of arbitrary primes for the speed or the cost-effectiveness. Ã Estimated from the given information.
Conclusion
This paper has presented a high-performance ECC architecture featuring arbitrary prime fields and elliptic curves. The proposed word-based pipelined Montgomery multiplier operates at a high clock rate with the inherent DSP blocks on modern FPGA devices. The number of operation cycles is effectively reduced by the improved operation scheduling for the finite field operations in EC arithmetic. A 256-bit point scalar multiplication can be done in 0.86 ms at 263 MHz on Xilinx XC5VLX110. The comparison indicates that our ECC processor outperforms other FPGA designs significantly with the capability of supporting arbitrary prime fields and elliptic curves, which delivers the maximized performance and the flexibility for widespread security applications as well.
