Abstract-To enhance the data security in network communications, this paper presents a dual-field elliptic curve cryptographic processor (DECP) supporting all finite field operations and elliptic curve (EC) functions. Based on the fast radix-4 unified division algorithm, the execution time can be significantly reduced by a factor of three. By exploiting the hardware sharing and the ladder selection techniques, the proposed 160-bit and 256-bit DECP can have competitive execution cycle with only 0.29mm 2 and 0.45mm 2 silicon area in 90nm CMOS technology. In addition, the operating frequency in dual field can be increased by applying the data-path separation method and the degree checker. Our proposed DECP is over 2∼6 times better in areatime product than relative works.
I. INTRODUCTION
To ensure the data security of network communications, public-key encryption algorithms have been widely adopted. Elliptic curve cryptography (ECC) [1] - [3] can provide higher security than the RSA [4] algorithm under the same keysize. The main operation of ECC is elliptic curve point scalar multiplication (ECSM) composed of a series of finite field operations, where the inversion is the most complicated.
Several ECC designs have been published over specified finite field [5] [6] , either GF (p) or GF (2 m ). However, to enhance the security level, the arbitrary key-size and the variation of field operations are essential. Some dual-field ECC processors (DECP) which can support arbitrary key size have been proposed [7] - [11] . The coordinate in these designs is mapped to projective coordinate to avoid the inversion operation in the ECSM. Satoh and Takano [7] exploit r × r-bit multipliers to speed up the ECSM in projective coordinate, and Lai and Huang [8] [9] present a parallel architecture based on [7] to enhance the throughput. However, the operations in projective coordinate are more complicated than that in affine coordinate, and the inversion is still needed in coordinate transformation after the ECSM in projective coordinate. To reduce the execution cycle of ECSM and coordinate transformation, the size of multipliers or the parallel units are increased to speed up the field operations, which results in high hardware cost.
In this paper, we propose a fast radix-4 unified division (R4UD) algorithm supporting the Montgomery modular division (MMD) [11] and modular division (MD) [12] operation over dual fields to reduce the execution cycle. Moreover, the hardware sharing, data-path separation, degree checker, and ladder selection methods are proposed to reduce the hardware cost and shorten the critical path. Our DECP supports all EC functions with arbitrary curves and parameters over dual fields. The implementation results show that our DECP outperforms the relative works in terms of functionality, hardware efficiency, and power consumption. This paper is organized as follows. In section II, a radix-4 unified division algorithm is introduced, and in section III a dual-field ECC processor is presented. Then the implementation results are presented in section IV followed by the conclusion in section V.
II. PROPOSED RADIX-4 UNIFIED DIVISION ALGORITHM
The proposed radix-4 unified division (R4UD) algorithm is shown in Algorithm 1 with the following invariant equivalences.
As shown in Algorithm 1, the operands U , V , R, and S are set as p, Y , 0, and X, respectively. The operations of U and V in Algorithm 1 are based on the binary greatest common divisor (GCD) [13] provided in TABLE I. In each iteration, the value of U or V is divided by 2 or 4 depending on different conditions. The final value of U and V will be 1 and 0, respectively, due to the property of GCD operation. From equations (1) and (2), the corresponding R and S would be X · Y −1 · 2 i (mod p) and 0. The end value of i depends on the MMD or MD operation set initially. In the first m cycles of MMD operation, the i in the equations (1) and (2) is added by 1 or 2 with the operations on the operands R and S. That causes the domain values of R and S are added by 1 or 2. When the execution cycle is larger than m, or i ≥ m, step 42 is executed to keep operands R and S in the Montgomery domain. Note that the variable m is equal to the execution field length, and the variable t is used to fix the domain value of R and S. In the end of this algorithm, the value R is equal to
On the other hand, if the operation is set to the MD, the step 42 is always executed to avoid the increase on the domain value, i, causing the data R and S always fix in the integer domain. Then the output value is equal to X · Y −1 (mod p). Compared with previous works, such as Fermat's little theorem [14] , Kaliski's Montgomery modular inversion algorithm [13] , and Takagi's modular division algorithm [12] , our 
work can reduce the execution cycles in division operation without extra multiplication operation and pre-computed value as shown in TABLE II. The GFAU is controlled by the EC controller to accomplish the modular operations, such as modular addition (MA),
if i < m and operation is MMD, then:
modular subtraction (MS), MMM, MD, and MMD over dual fields and the detailed architecture is shown in Fig. 2 . The multiplication and division operations are based on a radix-4 MMM and the proposed R4UD algorithm. Radix-4 is used due to the compromise between the area overhead and the performance. Among this figure, the U , V , R, and S datapath blocks are used to achieve all the modular operations. Moreover, four techniques, as illustrated as follows, are exploited to enhance the performance of the proposed GFAU.
A. Hardware Sharing
Since the addition operation is the kernel arithmetic units of modular operations, most of them can be shared to reduce the cost. Besides, a swap logic is proposed to reduce the implementing complexity of the R4UD algorithm. In Algorithm 1, several equations are in the same form with different operands, Consequently, by this approach, the operation unit can be shared and then the selection hardware cost is almost saved half. Besides, because the proposed R4UD has some common conditions between dual fields (e.g., i < m, c = 0, d = 0.), the complexity of the controller can be further reduced.
B. Data-path Separation
As the critical path in the proposed R4UD is from UV datapath to R, S data-path, we propose a data-path separation method to cut it by half. Fig. 3 shows the detailed flow of the proposed method. The control signal from the UV datapath is stored and sent to the RS data-path in the next cycle. Although this approach increases one cycle, the critical path can be reduced from two data-path cells to one.
C. Degree Checker
Originally, the degree checking operation in GF (2 m ), such as 2 · S (mod p) and 4 · S (mod p), is implemented by a huge multiplexers as shown in Fig. 4(a) , however, this method results in long critical path. Fig. 4(b) shows the proposed degree checker, which requires only n 2-to-1 AND gates and one n-to-1 OR gate to achieve the degree checking operation. With this approach, the degree of the input data D in and the field length can be compared with shorter critical path. Note that the m-th bit of field length register is set to 1 and other 
UV

Datapath
RS
Datapath
UV
Datapath
RS
Datapath
RS
Data
UV
Datapath
RS
Data
D. Ladder Selection
In GFAU, the selection in the data post-operation of R and S data-path is quite complicated because there is a total of 21 operations on R and S in the proposed R4UD algorithm. To reduce the selection complexity, we propose a ladder selection architecture shown in Fig. 5 . We arrange the input of R, S data-path to make the end selection is decided by a fixed order. For example, if the operating operation is S = 4S (mod p) in step 11 of Algorithm 1, the operands {S , P } = {4S, −p}. With the order, which is from F S3 < 0 to F S1 < 0, the correct value is decided. Consequently, the output value is within the range [0, p − 1]. The hardware cost of the data selection in the post-data operation block is reduced by this approach. Our design supports all EC functions including point addition, point doubling, point scalar multiplication, domain transformation, and finite-field operations. The pre-computation required in the beginning of ECSM operation is eliminated in our DECP. Furthermore, our design can achieve competitive execution cycles compared with Satoh and Takano's [7] and Lai and Huang's works [8] [9] using 1 64-bit and 4 32-bit multipliers. These works [8] [9] exploit the parallel architecture to reduce the execution cycle but substantially increase the hardware cost. Consequently, the area of our DECP is about 2 times smaller than theirs. Chen's work [10] uses systolic array to achieve the smallest execution cycle but the area is about 40 times larger than our design. Compared with the 160-bit and 256-bit designs in [8] , our DECP is about 4 and 2 times better in area-time product (AT). From the tables, our DECP outperforms other DECP designs in terms of functionality, hardware efficiency, execution time, and power consumption. 
V. CONCLUSION
In this paper, a high-performance ECC processor supporting ECSM and finite field operations over both the prime field and the binary field is presented. The proposed design can perform one 160-bit ECSM in 0.31ms at 256.4MHz over GF(p) and 0.19ms at 289.9MHz over GF (2 m ) with core area 0.29mm 2 . In addition, the approach with core area 0.45mm 2 can perform one 256-bit ECSM in 0.77ms at 250.0MHz over GF(p) and 0.59ms at 277.8MHz over GF (2 m ). With the proposed R4UD algorithm, hardware sharing method, data-path separation, degree checker, and ladder selection method, our DECP can achieve better performance comparing with relative works.
