Abstract-Nowadays, differential power-analysis (DPA) attacks are a serious threat for cryptographic systems due to the inherent existence of data-dependent power consumption. Hiding power consumption of encryption circuit or applying key-blinded techniques can increase the security against DPA attacks, but they result in a large overhead for hardware cost, execution time, and energy dissipation. In this brief, a new DPA countermeasure performing all field operations in a randomized Montgomery domain is proposed to eliminate the correlation between target and reference power traces. After implemented in 90-nm CMOS process, our protected 521-bit dual-field elliptic curve (EC) cryptographic processor can perform one EC scalar multiplication in 8.08 ms over GF (p 521 ) and 4.65 ms over GF (2 409 ), respectively, with 4.3% area and 5.2% power overhead. Experiments from a field-programmable gate array evaluation board demonstrate that the private key of unprotected device will be revealed within 10 3 power traces, whereas the same attacks on our proposal cannot successfully extract the key value even after 10 6 measurements.
I. INTRODUCTION

E
LLIPTIC curve (EC) cryptography (ECC), described in IEEE P1363 [1] and FIPS PUB 186-3 [2] , has been widely applied to provide a confident scheme for information exchange. For the past several years, many previous works [3] - [6] have been published for ECC hardware implementation aiming at the performance improvement. However, even if the ECC is secure at cryptanalysis, the private data of an unprotected hardware device can be extracted by physical attacks due to side-channel leakage. The power-analysis attacks, initially presented by Kocher [7] , can reveal the key value by analyzing the power information of a cryptographic implementation such as on an application-specified integrated circuit (ASIC), fieldprogrammable gate array (FPGA), or microprocessor.
During the device processing, simple power-analysis (SPA) attacks can distinguish the key value through visual inspection because of the specifically active circuit with direct hardware scheduling. The double-and-add-always method [8] , [9] is usually used to avoid the variation of power consumption over time. The authors are with the Department of Electronics Engineering and the Institute of Electronics, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: jenweilee@gmail.com; hcchang@si2lab.org; cylee@si2lab.org).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSII.2012.2190857
However, the differential power-analysis (DPA) attacks [10] computing the correlation between target power traces and power model can reveal the key value due to the existence of key-dependent operations in every round of calculation.
Hiding technique with algorithm-independent dedicated circuit is a common approach to protect cryptographic processors from attackers collecting the key-dependent characteristics of power traces. In [11] , a wave dynamic differential logic circuit with regular routing algorithm is exploited to equalize the current between rising and falling transitions. However, more than 200% overhead in area, performance, and power consumption is added to the unprotected encryption engines. Switched capacitor [12] is able to isolate the encryption core from the external power supplies, but this approach results in 50% speed loss for replenishing charge every cycle. In order to avoid the throughput degradation, a countermeasure circuit using digital controlled ring oscillators [13] is designed outside of the critical path. The concept is to generate random noise power to dominate the power consumption of arithmetic unit, and then, the correlation peak would not be found even matching the correct key value. However, this demands extra 100% power overhead for the key-dependent processing element.
At the algorithm level, masking the processed data independent of power consumption is another approach to avoid the DPA attacks. For the ECC schemes, since the scalar of point calculation is periodic with the point order #E, a keyblinded technique can be adopted to change the key value by adding r · #E for every calculation, where r is a random integer. However, with this method, the throughput overhead is inevitable due to extending the key length. In [9] , the point calculation of 521-bit key extended with a 32-bit random value needs 10% more execution time to be carried out than that of the unprotected approach.
In this brief, we propose a new efficient countermeasure to overcome the DPA attacks by computing the overall ECC functions in a randomized Montgomery domain. The feature of our approach is to mask the intermediate values in not only the arithmetic but also the temporary register. Thus, it is unnecessary to extend the key length, customize the circuit, and modify the routing algorithm in ASIC or FPGA design flow. Since our proposed design adopts simple logic circuit to counteract DPA attacks, the hardware cost overhead could be significantly reduced, and the maximum operating frequency of the protected design is the same as that of the unprotected design using the conventional Montgomery algorithm. In addition, by reducing the iteration time of the division, which dominates other field operations in the computation time, the speed can be improved further.
The remainder of this brief is outlined as follows. DPA attacks applied on the ECC device are introduced in Section II. The proposed countermeasure method and design architecture are given in Sections III and IV, respectively. Section V shows the FPGA power measurements and ASIC implementation results. Section VI concludes this work.
II. DPA ATTACKS ON ECC DEVICE
For the SPA resistance, the double-and-add-always approach given in Algorithm 1 is adopted to regularly perform the EC scalar multiplication (ECSM) KP = P + · · · + P , where K is the m-bit private key and P is a point on ECs. However, the intermediate values of EC point doubling (ECPD) in Steps 3 and 4 depend on the zero and nonzero bits of the key value. Hence, with a chosen point P , the key value can be distinguished by matching the power trace segment of ECPD calculation. Fig. 1 shows the scenario of DPA attacks. The power model can be characterized from two different key bit values by measuring the device sample before the statistical analysis, which computes the correlation between the measured target power traces and the power model. If the target key bit matches one of the chosen key bits, the correlation value will be larger than that of the others due to the same operation and processed data. Through this approach, the overall binary key can be extracted after m − 1 rounds in linear time.
Algorithm 1 Double-and-add-always ECSM
Input: K and P Output:
else
III. PROPOSED ALGORITHM AGAINST DPA ATTACKS
The fundamental concept of DPA countermeasure is to break the dependence between intermediate values and power traces. For achieving the point calculation, the well-known Montgomery algorithm is usually adopted to perform the field arithmetic in a specific domain such that A ≡ a · 2 m (mod p), where a is in the integer domain and 2 m is the Montgomery constant with m-bit field length. In this brief, we introduce an approach to resist the DPA attacks at modular algorithm by calculating the operands in a randomized Montgomery domain
, where the domain value λ equals the Hamming weight (HW) of an n-bit random value r. Note that n is the maximum field length, and the bit values of {r n−1 , r n−2 , . . . , r m } are set to zero for preventing λ from exceeding m. Because the proposed method is to randomize intermediate values in basic modular operations, the doubleand-add-always ECSM shown in Algorithm 1 against SPA attacks can be applied, and there is no need for external parameter such as point order #E which is not given in several protocols, including the Diffie-Hellman key exchange [1] .
A. RMM Algorithm 2 shows our proposed randomized Montgomery multiplication (RMM) which contains two operating steps in every iteration to change the intermediate domain value λ , and these steps are determined by the ith bit of random value r. If r i = 1, the domain value of the output operand R decreases by one in Step 4; the domain value remains the same as r i = 0 in Step 5. The functionality can be derived as follows:
Hence, the RMM can be performed in m iterations, the same as those in conventional Montgomery multiplication [6] .
Algorithm 2 Randomized Montgomery multiplication
Input: X, Y , p, and r Output:
else S = 2S(mod p) 6. Return R
B. RMD
To achieve the division in Montgomery domain, Kaliski [14] first proposed an iterative algorithm which needs 2m iterations of successive reduction, m iterations for degree recovery (reduce intermediate domain value λ to be m as λ > m), and two additional Montgomery multiplications with a final modular reduction p − R. The algorithm presented in [14] is formulated from identical equations as follows:
Based on Kaliski's method, we derive a new randomized Montgomery division (RMD) which is described in Algorithm 3. To directly achieve the division operation without additional multiplication and final modular reduction, our method is to modify the initial values of (U, V, R, S) to be (p, Y, 0, X) in Step 1 and the RS data path with modular subtraction in Steps 10, 11, 13, and 14. Then, the identities become
Similar to Algorithm 2, the RS data path between the Montgomery domain and integer domain is determined by the ith bit value of r. The domain value of operands R and S increases by one as r i = 1 and remains the same as r i = 0.
For further reducing the degree recovery phase, the RS data path turns into dividing values by two in Steps 5, 8, 11, and 14 to keep the intermediate domain value in λ = HW (r) as i = m. Thus, the identities in Algorithm 3 are given as follows:
Before the last iteration, both U and V are one because the initial values of U and V are relatively prime. Then, after finishing the iterative operations in Step 2, the values of (U, V, R, S) become (1, 0, X · Y −1 · 2 λ (mod p), 0). As a result, the proposed randomized division algorithm requires only 2m iterations of successive reduction. Table I shows the expected operation time and the comparison with related works on modifying the Montgomery division algorithm. With randomization capability, Algorithm 3 will also benefit the hardware design owing to the low latency. IV. DPA-RESISTANT DF-ECC PROCESSOR Fig. 2 shows the block diagram of the proposed dual-field ECC (DF-ECC) processor with a standard advanced microcontroller bus architecture advanced high performance bus interface. For the DPA resistance, all field operations over GF (p) and GF (2 m ) are performed by the Galois field arithmetic unit (GFAU) in a randomized Montgomery domain. The operating domain is determined by the value in domain shift register, which is sourced from a 1-bit random number generator (RNG) and refreshed before the next ECSM calculation. For flexibility, we use an all-digital RNG utilizing the cycle-tocycle time jitter in free-running oscillators with a synchronous feedback postprocessor [15] . The overall architecture of DPA countermeasure circuit is shown in Fig. 3 . To efficiently store the 521-bit operands including EC coefficient and points, a block memory of register file is exploited. Moreover, in order to real-time perform the ECC schemes such as signature and key exchange, the instruction decoder and pre-/postprocess of domain conversion are combined in our DF-ECC processor. Fig. 4 shows the detailed GFAU architecture. As the iterative operations in Algorithm 3 are performed in one cycle, the critical path is to calculate the results of R or S consisting of the UV comparison with modular operations. The timecritical comparison operation U > V achieved by a subtraction is nearly equal to an addition delay. Since the results of R and S are irrelevant to the results of operands U or V , a fully pipelined stage can be inserted between the UV and RS data paths to moderate the critical path, where the critical path is the path (1) + (2)/(2) in Fig. 4 before/after pipelining. As the UV data path is determined, then the next cycle is to set the values of the operands R and S and simultaneously determine the next case until V = 0. Although one additional cycle is needed after pipelining, this is negligible as the operation takes hundreds or thousands of cycles. The timing flow of pipelined scheme is shown in Fig. 5 . Moreover, to reduce the hardware cost, symmetric modular operations such as R = (R − S)/2(mod p) and S = (S − R)/2(mod p) in Algorithm 3 can be executed by the same computational unit with a swap logic circuit, which is to switch the input operands of RS data path. In Algorithm 3, the RS data path can be classified into two groups: The first group includes Steps 4 and 5 and Steps 10 and 11, and the second one consists of Steps 7 and 8 and Steps 13 and 14. The data flows of R and S are switched as the processing group is different from the group in the previous cycle. Furthermore, since the point calculation is a serial field operation, both of the temporary registers and modular operations can be shared for the operands V , S, and R in Algorithm 2 and Algorithm 3. The modular operations including addition/subtraction and shifting in the iterative loop can be effectively performed by exploiting the carry-save adder (CSA) with a carry-propagation adder. Another benefit is that it can achieve the additive operation for GF (2 m ) by circuit integration because the sum of CSAs equals two bitwise XOR operators.
Algorithm 3 Randomized Montgomery division
Input: X, Y , p, and r
Since the primary inputs of EC coefficient and points are in the integer domain, the domain conversion can be performed by the proposed RMD such that RM D(a, 1) = a2 λ (mod p). On the other hand, to return the point coordinates in the integer domain, the RMM can be exploited as RM M (a2 λ , 1) = a(mod p). For calculating one ECSM in affine coordinates, the overhead of domain conversion is three RMD and two RMM operations; both of them can be performed by GFAU to avoid any precomputation from the host system. 
V. FPGA VERIFICATION AND IMPLEMENTATION RESULTS
Based on our proposed architecture, two different 256-bit DF-ECC processors are designed on an FPGA verification platform to evaluate the DPA resistance. The verification environment is shown in Fig. 6 , and the performance results are given in Table II . Fig. 7(a) shows the DPA attacks on the unprotected Design I using the conventional Montgomery algorithm to reveal the first three bits of key value "101." From 10 3 measurements, the correlation coefficients of correct hypothesis (CC(RH i , T H i )) converge to about 0.6, while those of incorrect hypothesis (CC(RH i , T H i )) converge to values below 0.3, where H is the hypothesis of binary key value. Hence, the key value can be distinguished from a difference of at least 0.3 in correlation coefficients. In contrast, after collecting 10 6 power traces from the protected Design II, which uses randomized Montgomery operations given in Algorithm 2 and Algorithm 3, the correlation coefficients of correct and incorrect hypotheses shown in Fig. 7(b) are close to zero and cannot be scattered. This means that the power model is uncorrelated with the target power traces, and there is no biased information of the key value from the differences in correlation coefficients. Consequently, the statistical analysis of power measurements shows that the proposed countermeasure enhances the security against DPA attacks.
Our proposed 521-bit DF-ECC processor was implemented by UMC 90-nm CMOS technology. Moreover, to compare with related works, one 163-bit version of DF-ECC processor and one 256-bit design over GF (p) were implemented to ASIC and FPGA, respectively. The layout and implementation results with comparisons are given in Fig. 8 . In terms of area-time product, our DF-ECC processor outperforms other approaches. By reducing the division iteration time and randomizing the intermediate values in field arithmetic without increasing the key size, our work is at least 40% faster than the previous 521-bit design [9] with comparable hardware complexity. Compared with a four-multiplier-based ECC processor without power-analysis protection [5] , our highly integrated GFAU architecture achieves competitive speed with 60% less gate counts. In [6] , an unprotected design based on GF (2 163 ) and a fixed polynomial is optimized for hardware speed. We design the ECC processor with dual-field support and apply the pipelining approach. The throughput achieved is two times higher than that reported in [6] .
For the DPA resistance, our approach is to mask the processed data uncorrelated with power traces without changing the logic family and without dominating the power consumption of key-dependent operations. From the comparison given in Table III , our proposed countermeasure is superior to others not only in hardware cost but also in energy dissipation.
VI. CONCLUSION
In this brief, we have introduced a new randomized Montgomery algorithm which is suitable for ECC hardware implementation against DPA attacks. Without modifying the logic circuit, the relationship between target power traces and power model can be broken by performing field arithmetic in an unpredictable domain. A free precomputation scheme has been proposed also to immediately carry out the domain conversion for supporting real-time processing.
The proposed DPA countermeasure approach has been analyzed on an FPGA platform. Attacks on the unprotected designs reveal the private key within 1000 power traces, while the key value of the protected core cannot be extracted after one million power traces. Circuit overhead for randomly determining the operating domain can be integrated into the system without speed degradation. By using a UMC 90-nm technology, our protected 521-bit DF-ECC processor, with 4.3% area and 5.2% average power overhead, can perform one GF (p 521 ) ECSM in 8.08 ms and one GF (2 409 ) ECSM in 4.65 ms, respectively. We believe that both high performance and efficient DPA countermeasure are achieved in our proposed DF-ECC processor.
