I. INTRODUCTION
RSA is a widely used public-key cryptosystem. RSA encryption is a one-way function, which is not possible to reverse without knowing the private key [1] . Custom implementations in hardware are more appropriate for the RSA cryptosystem in order to be efficient in area and speed [2] .
In this paper, a hardware architecture of the RSA cryptosystem has been proposed and implemented on Xilinx FPGA families. In this implementation a Montgomery Modular Multiplication (MMM) [3] with Carry Save Adder (CSA) [4] based logic and representation has been used to speed up the calculations.
Side-channel attacks [5] are attacks, based on the information that is retrieved from the device, but is neither the plaintext nor the ciphertext. Power Analysis (PA) attacks [5] are a type of passive side-channel attacks. In these attacks, the power consumption of the circuit is measured while the device is performing an encryption or decryption. The private key or information about the private key is retrieved after an analysis of the measurement data. PA attacks have two types: Simple Power Analysis (SPA) attacks and Differential Power Analysis (DPA) attacks [6] .
There are hardware and algorithmic countermeasures against PA attacks. Itoh et al. have proposed an algorithmic countermeasure, Randomized Table Window Method (RT-WM) , against DPA attacks in [7] . RT-WM has been implemented as a countermeasure in this work. This paper presents a differential power analysis resistant hardware implementation of the RSA cryptosystem. Section II and Section III explain the mathematical background, the fundamentals of RSA and MMM architectures, respectively. Section IV presents the basics of side-channel attacks and gives detail about power analysis attacks and the countermeasures against them. Section V explains the DPA resistant implementation of the RSA cryptosystem. In Section VI a review of the paper and the conclusions are given.
II. THE RSA CRYPTOSYSTEM The RSA cryptosystem was developed by Rivest, Shamir, and Adleman in 1977 [1] . RSA is a public-key cryptosystem that serves both for encryption-decryption and digital signature.
RSA encryption and decryption are performed by modular exponentiation as C = M E mod N and M = C D mod N , respectively. Where M is the plaintext, C is the ciphertext, N and E are the public keys, D is the private key and
A. The m-ary Method
The m-ary method reduces the number of multiplications processed in an exponentiation [8] . The exponent E is scanned r-bits at a time, where m = 2 r and sr = k, where k is the bit length of E. A preprocessing is necessary for the exponentiation process, in which the powers of M mod N from 2 to m − 1 are calculated [8] , [2] . The m-ary method is given in Algorithm 1. 
Algorithm 1 The m-ary Method -left to right
Input: N = (n k−1 , . . . , n 1 , n 0 ) 2 , E = (e k−1 , . . . , e 1 , e 0 ) 2 , M = (m k−1 , . . . , m 1 , m 0 ) 2 Output: C = M E mod N 1: Compute and store M w mod N for w = 2, 3, . . . , m − 1. 2: Decompose E into r-bit words F i for i = 0, 1, . . . , s − 1, sr = k 3: C = M Fs−1 mod N 4: for i from s − 2 downto 0 5: C = CC 2 r mod N 6: if F i = 0 then C = CM Fi mod N
III. MONTGOMERY MODULAR MULTIPLICATION
In 1985 Montgomery introduced a new method for modular multiplication [3] . The approach of Montgomery avoids the time consuming trial division that is a bottleneck for most other algorithms. His method is very efficient and is the basis of many implementations of modular multiplication, both in software and hardware [9] .
The Montgomery algorithm computes the result by replacing the division operation with k times division by a power of 2, where a, b and N are k-bit binary numbers. Thus, not only computation time, but also area is reduced in hardware implementations. MMM is defined as R = a b r −1 mod N , where r = 2 k . The real multiplicands a and b need to be transformed into their N -residues such as a = ar mod N . Hence if two N -residue numbers are multiplied by MMM as R = arbrr −1 mod N = abr mod N , the results is also a N -residue number. We need a post-processing, where R and 1 are the multiplicands of the MMM as R = (abr) 1r
Walter proposed to slightly modify the original MMM algorithm in order to omit the final substraction operation in [10] . We use the MMM algorithm that has no final subtraction as given in Algorithm 2 in our implementation.
Algorithm 2 MMM with No Final Subtraction (MonPro NFS)
Adders are necessary for MMM, namely for step 4 and 6 of Algorithm 2. CSA is suitable especially for large operands [4] . It is an appropriate way of reducing 3 k-bit operands to 2 k-bit operands. The result is in carry save representation (C,S).
One final addition has to be performed to reduce the result from 2 k-bit operands to 1 k-bit operand -to convert back to normal number representation. In this work, Carry Ripple Pipelined Adder (CRPA) has been used as for this operation.
A Carry Ripple Pipelined Adder (CRPA) processes k-bit operands word by word by in k/w clock cycles using a w-bit carry ripple adders (CRAs) as shown in Fig. 1 .
MMM block has been realized with MonPro NFS CSA algorithm, which is given as Algorithm 3. 
Algorithm 3 MMM with No Final Subtraction using
CSA Representation (MonPro NFS CSA) Input: N = (n k−1 , . . . , n 1 , n 0 ) 2 , XC = (xc k+1 , . . . , xc 1 , xc 0 ) 2 , XS = (xs k+1 , . . . , xs 1 , xs 0 ) 2 , Y C = (yc k+1 , . . . , yc 1 , yc 0 ) 2 , Y S = (ys k+1 , . . . , ys 1 , ys 0 ) 2 , r =2 k+2 mod N , n 0 = 1. Output: (T C, T S) = (XC, XS) (Y C, Y S) r −1 mod N 1: T C = 0, T S = 0 2: for i from 0 to k + 1 3: x i = xc i + xs i , (C1, S1) = T C + T S + x i Y C 0 , (C2, S2) = C1 + S1 + x i Y S 0 4: if s2 0 = 0 then (T C, T S) = (C2 + S2) /2 5: else (T C, T S) = (C2 + S2 + N ) /2
A. Implementation Results
MonPro NFS CSA takes k + 2 clock cycles. The maximum frequency of the implementation with Xilinx XC2V2000E for k = 512 is 140,96 MHz, which takes 3,65 μs resulting in a throughput rate of 140.41 Mb/s. When implemented on Xilinx XC2V4000 for k = 1024, the maximum frequency achieved becomes 129,05 MHz; the total time 7,95 μs, and the throughput rate 128,80 Mb/s. As shown in Table I , the resulting throughput rates are faster than [11] , [12] , [13] , and almost the same speed as [14] , which are also architectures using CSAs to realize Montgomery multipliers.
Addition with CRPA takes k/w clock cycles, where k is the key length and w is the word length of CRPA. The decision to choose the word length w was done according to the optimum frequency of the synthesis results. In order not to make the exponentiation slower than the MMM block, w = 16 was chosen.
IV. SIDE-CHANNEL ATTACKS
In cryptography, an attack based on side channel information is called a "side-channel attack". Side-channel information is the information that can be retrieved from the cryptographic device that is neither the plaintext nor the ciphertext [5] . Active attacks, also referred as tampering attacks, require access to the internal circuitry of the attacked device [15] .
In passive attacks, the effects of the processing device are measured and used to retrieve the private key. These have mainly four types according to the type of the revealed output: Timing Analysis [16] , Power Analysis [6] , Electromagnetic Analysis [17] , [18] , Acoustic Analysis [19] . All passive attacks can be either simple or differential. The difference is that, while in simple analysis attacks, the attacker needs only one measurement, he needs numerous measurements and statistics of these measurements in differential analysis attacks.
A. Power Analysis Attacks
Power Analysis (PA) attacks are based on analyzing the power consumption of the cryptographic device while it performs encryption or decryption [6] . The physical supporting point of these attacks is that today Complementary Metal Oxide Semiconductor (CMOS) technology is the one to be used most commonly for digital integrated circuit implementations. The power consumption during transitions of a CMOS gate is not the same for 0 → 1 transitions and 1 → 0 transitions. 0 → 1 transitions consume more power than the other. [7] . The main difference from the window method is that, RT-WM uses randomized data inside the table instead of sequential powers of M .
V. RANDOMIZED TABLE WINDOW METHOD In this work, Randomized

Algorithm 4 Randomized Table Window Method (RT-WM)
The subtrahend containing the random number is shifted left in every step by t-bits (t < b), which creates an overlapping Using the values in the table, the rest of the algorithm becomes like t times square and multiply once with a table value until all windows have been scanned. A final multiplication is needed for the normalization. This algorithm brings a preprocessing time and additional memory for the table. An extra subtraction module is not necessary if an adder is already being used within the RSA. Figure 2 shows the I/O ports, blocks, and connections and important registers inside the RSA implementation.
For the RT-WM algorithm given in Algorithm 4, which is applied as a countermeasure against DPA attacks in this work, the number of items in the ω [i] array is:
This gives us the number of ω count comparisons and subtractions in preprocessing phase 1. One comparison takes one clock cycle and since the existing CRPA is used in subtractions, one subtraction costs w clock cycles. The 2nd phase of the preprocessing calculates M r mod N , M dm mod N and M . The table has 2 t k-bit items and it takes (2 t − 1) MonPro calculations to finish the table. Since one MonPro calculation takes (k + 2) clock cycles in the proposed design, the total time spent in the preprocessing calculations becomes (k − b) (w + 1) + (2 + b + 2 t − 2) (k + 2) clock cycles. The total time required by the RT-WM algorithm, realized with 512-bit key length, 2-bit window length and a 3-bit random number, needs 404276 clock cycles and brings an overhead of 11,8%, when compared to the m-ary method.
A. Implementation Results
The RT-WM algorithm has been realized with 512-bit key length, 2-bit window length, and, a 3-bit random number, on Xilinx XCV2600E. An exponentiation time of 18,43 Kb/s throughput and an area of 22712 slices are achieved. The maximum clock frequency is 14,55 MHz. The total encryption process takes 27,79 ms.
Virtex-E family FPGAs incorporate large block SelectRAM memories, where the data widths of the ports can be configured, and the routing is optimized. The RT-WM algorithm needs 8 × 513 bits to be used as the "randomized table values for the chosen parameters (Section V), which were realized with registers. One needs to separate the carry and save pairs in different RAM blocks in order to have read/write access to them at the same clock cycle. Therefore two RAM blocks of 513-bit data length and 4 entries have been defined.
All implementation results on XCV1000E are given in Table II . We see that the clock speed increased from 14,55 MHz to 66,66 MHz, making the average case throughput increase from 18,48 Kb/s to 84,42 Kb/s. Total exponentiation time is reduced from 27,11 ms to 6,06 ms. The time and area cost is reduced with block SelectRAM usage.
VI. CONCLUSIONS
We have implemented RSA cryptosystem on an FPGA as resistant against DPA attacks. This work is the first FPGA implementation of RSA cryptosystem which is resistant to power analysis attacks. Modular exponentiation is realized with MMM.
The Montgomery modular multiplier has been realized with CSAs. CSA is an appropriate way of reducing 3-k bit operands to 2-k bit operands. Hence, throughout the algorithm, each number is represented by a pair as sum and carry. At the end of the square and multiply algorithm the numbers in the resulting pair are added to form the result. We give the comparisons with the previous Montgomery multiplier architectures, which also used CSAs.
Randomized Table Window Method (RT-WM) have been applied in order to have a differential power analysis resistant hardware implementation. This the first hardware realization of RT-WM. The protected implementation resulted in 66,66 MHz of clock frequency, 84,42 Kb/s of throughput, and 6,06 ms of total exponentiation time and occupied an area of 10986 slices with the use of the built-in block SelectRAM structure inside XCV1000E FPGA.
