Abstract-Recently, several hardware implementations for elliptic curve cryptography have been proposed but few of them considered the dual-field functions, real-time requirement, hardware efficiency, and power analysis resistance as a whole. In this paper, a new unified division algorithm and a free pre-computation scheme are introduced to accelerate the GF (p)/GF (2 n ) elliptic curve arithmetic functions. The overall hardware is optimized by a very compact Galois field arithmetic unit with the fully pipelined technique. Moreover, a key-blinded technique with regular calculation is designed against the power analysis attacks without degrading clock speed. After fabricated in 90nm CMOS 1P9M process, our ECC processor occupied 0.55mm 2 can perform the scalar multiplication in 19.2ms over GF (p521) and 8.2ms over GF (2 409 ), respectively.
I. INTRODUCTION
Nowadays, secure information exchange is an important issue for the communication network. Since the aging RSA algorithm has been challenged by the quick factoring attack, elliptic curve cryptography (ECC) that can provide the same level of security with shorter key size becomes more attractive. The ECC device is advantageous in terms of speed, cost and power consumption.
For the ECC schemes specified in IEEE P1363 [1] , the most time-critical function is the elliptic curve scalar multiplication (ECSM), which computes a multiple point kP with an n-bit private key k and a point P on elliptic curves (ECs). This can be achieved by the basic point calculation such as point addition and point doubling, which involve the finite field operations either for GF (p) or GF (2 n ). By representing the points in affine coordinates, the formulas are given in Table I .
To target at high speed, the carry-bypass adder [2] and fixed polynomials [3] are used for processing GF (2 n ) functions on specific ECs. However, the applications of IEEE P1363 including digital signature are approved for supporting dual-field functions on arbitrary ECs. Exploiting multiple function units is another common technique to accelerate the ECC functions, but this requires large hardware cost. In [4] , an architecture of four dual-field Montgomery multipliers implemented with over hundred thousands gates can only support 160-bit operations. The short key size just can accommodate the security of lowend device. Besides, the real-time requirement is necessary to the data service. The well-known Montgomery algorithm [5] is usually adopted to avoid trivial modular reduction over GF (p), while it requires pre-computed values for domain conversion. Because these values are different from users with distinct 
keys and ECs, the pre-computations calculated from host CPU result in the system retard.
Also, the key of unprotected crypto-chip can be revealed by power measurement including both simple power analysis (SPA) and differential power analysis (DPA) [6] . For chip implementation, the countermeasure described in [7] uses the switched capacitor to prevent current running through the encryption core. But this approach results in halved frequency for replenishing charge every cycle.
In this paper, the proposed ECC processor aims at providing a hardware-efficient and real-time solution to support 521-bit GF (p)/GF (2 n ) functions on arbitrary ECs with power analysis resistance. Since the field division dominates the computation time of point calculation, a new unified division algorithm is also proposed. This is based on the Montgomery inversion [8] to extend the functionality by directly implementing both the Montgomery and modular division with less latency. To improve hardware efficiency further, a highly integrated and pipelined Galois field arithmetic unit (GFAU) with circular shift register is exploited. And the overall pre-computations for domain conversion can be eliminated by performing several Montgomery divisions and multiplications on-the-fly. To prevent the power analysis attacks without significant speed degradation, both the key-blinded technique and regular calculation with dummy operations are adopted. Compared with other related works, our ECC processor shows the advantages for hardware efficiency.
II. MONTGOMERY MODULAR ALGORITHM
For a given n-bit prime p, the Montgomery domain representation of an integer a is defined as A ≡ a · r (mod p) with the Montgomery constant r = 2 n (mod p). The Montgomery multiplication of two Montgomery domain integers A and B is given in Algorithm 1. To achieve the inversion in Montgomery domain, Kaliski [8] 
Algorithm 3 shows our proposed unified division algorithm. From the observation in Algorithm 2, the derivation is based on the invariants at Step 2 as follows
By setting the initial values of (U, V, R, S) to be (p, B, 0, A) at Step 1, our approach modifies the invariants at Step 2 such that
Before the last iteration, both U and V are 1 because the initial values of U and V are relatively prime. Then after finishing
Step 2, the values of
. Note that the modular reductions are needed for every iteration to remain the values of both operands R and S in finite field set {0, 1, ..., p − 1}. The iterations at Step 3 are to set the 2 i of R to be 2 n or identity by dividing two. Before returning the final result, it is simple to output the positive result by using a subtraction p − R but there is no need as performing for GF (2 n ). The output is an optional result which includes the residue of the Montgomery division
and the residue of the modular division R ≡ a · b −1 (mod p). As a result, our proposed Montgomery division operation only takes n ∼ 3n iteration time without any Montgomery multiplication operation. Compared with the Kaliski's Montgomery inversion by using Algorithm 1 to implement one Montgomery multiplication with n iterations, our proposed Montgomery division has 62% cycle reduction in average. The modular division, which is essentially required for ECC schemes, can be achieved 
with additional n iterations by dividing two without extra circuit components. Table II gives the examples to demonstrate the Montgomery division 2 ≡ 12 · 5 −1 · 2 4 (mod 13) and
III. OUR PROPOSED ECC PROCESSOR Fig. 1 shows the overall ECC architecture with a standard AMBA AHB bus interface. The ECSM with modular operations over dual fields, required for the ECC schemes such as signature, authentication and key exchange, can be calculated by the Galois field arithmetic unit (GFAU). The inputs are the user public/private-key, EC coordinates, EC parameters and protocol instructions. To real-time perform these contents, the instruction decoder and pre-/post-process are combined in our processor. After the instruction decoding, the pre-process stage is to convert the EC coordinates and parameters into the Montgomery domain and blind the key value to avoid power analysis attacks. Before returning the calculation results, the EC coordinates are converted into the integer domain at post-process stage. All the 521-bit data operands stored in registers and transmitted to the GFAU are manipulated by EC control. To reduce the multiplexer complexity for the long bit length of registers, the separated 32-bit circular shift registers is exploited.
A. Galois Field Arithmetic Unit
Considering a hardware-efficient design to support all Montgomery/modular operations, we use a bit-level architecture to combine the multiplier and divisor into the GFAU. To effectively perform multiple arithmetic operations, which are addition/subtraction and shifting in the iterative loop, the carry-saved adder (CSA) with a carry-propagation adder (CPA) at final stage is used. Another benefit is that it can achieve additive operation for GF (2 n ) by hardware sharing because the sum of CSA is equal to two bitwise XOR operators.
If these iterative operations are performed within one cycle, the critical path is to calculate the results of operands R or S in Algorithm 3 which consists of the UV comparative cases with modular addition. The time-critical comparative cases of U , V achieved by a subtraction is nearly equal to an addition delay. Since the data path of the operands U and V are irrelevant to the operands of R and S, a fully pipelined stage can be inserted between the UV data path and the RS data path to moderate the critical path. One additional cycle is needed after pipelining, but this can be neglected as the operation takes hundreds or thousands cycles. Fig. 2 shows the timing flow of the pipelined scheme. As the comparative case is determined in (m-1)th cycle, then the mth cycle is to set the values of the operands R, S and simultaneously determine the next case until V = 0.
tm-1 tm tm+1
. . . Fig . 3 shows the major arithmetic unit for our ECC processor. Registers and modular operations for operands U, S, R in Algorithm 3 can be shared with operands A, B, C in Algorithm 1 to save circuit area since they are not processing in the same time. The selective signal c 2 for RS data path is fed from c 1 , which is determined by the UV comparative case, with one cycle delay due to pipelining.
UV

B. Free Pre-Computation Scheme
Because the primary inputs of base point coordinates (x, y) and EC parameters a p or a b are in the integer domain, the domain conversion can be performed by the proposed Montgomery division such that MonDiv(a, 1)
On the other hand, to return the point coordinates in the integer domain, the Montgomery multiplication can be used as MonMul(a2 p) ≡ a (mod p). For calculating one ECSM, the overheads of domain conversion are three Montgomery divisions and two Montgomery multiplications; both of them can be performed on-the-fly by the GFAU to avoid any pre-computation from host system.
C. Power Analysis Resistance
To hide the leaked information against both the SPA and DPA, the double-and-add-always and key-blinded techniques are exploited. The previous one ensures that the ECSM in the process is uncorrelated with the key value by continuously performing point doubling and then (dummy) point addition. Because the multiplier of EC point, which is also the key value, is periodic of the point order, the multiplier can be blinded by adding r · #E, where r is a 32-bit random integer and #E is the point order. The random integer is generated by using the linear feedback shift register with the irreducible polynomial x 17 + x 15 + x 8 + x 5 + 1. These approaches to prevent power analysis attacks are shown in Fig. 4 . 
IV. IMPLEMENTATION AND MEASUREMENT RESULTS
Our proposed 521-bit dual-field ECC processor is fabricated in 90nm 1P9M CMOS technology. After adopting the circular shift register, our approach can improve the hardware utilization from 60% to 88%, leading to 0.26mm 2 core area reduction. From the measurement results, it can perform the ECSM over GF (p 521 ) in 19.2ms with 58.5mW and GF (2 409 ) in 8.2ms with 86.4mW, respectively. Fig. 5 and Fig. 6 show the unprotected/protected chip micrograph and the experimental condition of power measurement. Fig. 7(a) shows the SPA of the unprotected ECC chip. The 1 and 0 key values can be disclosed by checking the peak differences of the power traces, where the positive and negative peak differences are 112mV and 57mV in average. Due to the regular calculation, the peak differences of our protected chip, shown in Fig. 7(b) , are reduced to below the noise region 20mV. Fig. 8(a) illustrates the DPA of the unprotected chip, where the spikes of correlation coefficients can be scattered from 500 power traces. Note that the inferior spikes appear at distinct bits of key since the point doubling exists inherently. However, the spikes are not given in Fig. 8(b) Table III gives the chip summary and comparison. Compared with the previous 160-bit dual-field design [4] , our proposed ECC processor can achieve 521-bit operations and real-time requirement with similar gates count. The GF (2 n ) arithmetic is usually faster than GF (p) arithmetic due to carryfree addition [2] . By using our proposed unified division without dummy operations, the binary-field-only ECC design can be implemented to perform the GF (2 409 ) ECSM in 2.37ms with only 96Kgates according to post-layout simulations. In contrast to [7] that utilizes the switched capacitor to prevent power analysis attacks, our approach is advantageous in system speed and power consumption. 
V. CONCLUSION
This paper presents a hardware-efficient ECC processor which supports arbitrary 521-bit GF (p)/GF (2 n ) operations with power analysis resistance. Both speed and efficiency are improved by introducing a new unified division algorithm and a highly integrated GFAU with the fully pipelined technique. By processing the domain conversion on-the-fly and blinding key with dummy operations, our ECC processor achieves realtime performance and protection against the power analysis attacks without a large degradation of system speed. The chip has been fabricated in 90nm CMOS technology and measured to perform the ECSM in 19.2ms over GF (p) and 8.2ms over GF (2 n ) with 0.55mm 2 core area.
