Abstract-Power consumption limits the application of public key cryptosystem in portable devices. This paper proposes a low power design of 1,024-bit RSA. In algorithm, the Chinese Remainder Theorem (CRT) and an improved Montgomery algorithm are selected to decrease the computation of RSA. In architecture and circuit, the operand isolation technique is applied to avoid unnecessary flip-flops of the combinational logic, and the clock gating technique is used to reduce the power dissipation of the registers. The proposed design is functionally verified on Altera FPGA EP2C8Q208C8N device. With SMIC 0.18μm CMOS process, the Synopsys synthesizing result shows that the area and the critical path are 7.1k gates and 5.3ns respectively, while the power is 2.56mW and the throughput can reach 49 kbps. Thus the proposed design requires lower power than previous designs.
I. INTRODUCTION
As the information technology develops, the security of information becomes more and more important. Public key cryptography is becoming the preferred solution for information security because of its advantage in distribution and management of the keys. Nowadays, the RSA cryptography [1] is the most widely used public key cryptography, the security of which is based on the Integer Factorization Problem (IFP). The primary operation of the RSA signature is modular exponentiation, which can be realized with modular multiplication. Montgomery [2] algorithm realizes the modular multiplication with addition and shift, making the VLSI implementation of modular multiplication possible.
One of the most critical indices for portable devices is power consumption. The length of the operands of RSA cryptosystem may be up to 1,024-bit or even 2,048-bit to achieve required security class. However, the large area and high power consumption usually make the application of RSA cryptosystem impossible for batterypowered and passive devices. Therefore, low power design becomes the challenge for application of public key cryptosystem RSA in portable devices. This paper is organized as below: Section II, the description of the design of algorithm level; Section III, the design of architecture and circuit level; Section IV, comparison between the result of the proposed design and previous work; and Section V, the summary and the conclusion.
II. ALGORITHM SELECTED

A. RSA Modular Exponentiation
The major operation is modular exponentiation when the RSA cryptosystem performs encryption and decryption. That means the implementation of RSA cryptography is actually the implementation of modular exponentiation. There are several methods to realize modular exponentiation, for example, binary, m-ray and sliding window. For low power design, the binary method is preferred because of its simplicity, low memory requirement and low power consumption.
There are two ways to execute the binary modular exponentiation, namely the left-to-right method and the right-to-left method. They are shown in Algorithm 1, where E is the key; N = P×Q is the module (P and Q are the two primes); M is the message to be encrypted and C is the cryptography: ) If E i = 1 then C = C×M (mod N) A reasonable assumption can be made that half of E's digital bits are logic "1", resulting in that the n-bit modular exponentiation requires 1.5n modular multiplications on average.
B. Chinese Remainder Theorem (CRT)
Chinese Remainder Theorem (CRT) [3] can realize a 2n-bit RSA modular exponentiation with two n-bit modular exponentiations, which can effectively reduce both time complexity and power consumption of RSA cryptosystem.
As mentioned above, the major operations of an n-bit modular exponentiation are about 1.5n modular multiplications on average. The most effective modular multiplication algorithm so far needs three n-bit multiplications to realize an n-bit modular multiplication (This will be discussed later). Thus 4.5n n-bit multiplications are required to realize an n-bit modular exponentiation and that for 2n-bit modular exponentiation is 9n.
Comparing the complexity of one 2n-bit modular exponentiation with that of two n-bit modular exponentiations, it is found that the numbers of multiplications are the same, but the complexity of an nbit multiplication is approximately a quarter of that of a 2n-bit multiplication. That means the complexity of a RSA modular exponentiation without CRT is about 4 times higher than a RSA modular exponentiation with CRT.
Low power consumption is the most critical for portable device design. Chinese Remainder Theorem can reduce the amount of operations for RSA modular exponentiation to about a quarter of that of original RSA modular exponentiation, and this reduction will significantly decrease the power consumption. So CRT is chosen for our design.
Algorithm 2 describes the Chinese Remainder Theorem, where M is the message to signature; P and Q are the two primes and P<Q, N = P×Q; E is the private key, 0 < A, B < Q-1 and A×P ≡ 1 (mod Q), B×Q ≡ 1 (mod P). The method to execute Chinese Remainder Theorem in Algorithm 2 is Single-Radix Conversion (SRC). To realize RSA modular with SRC, four modular multiplications and one modular addition with module N should be performed. This is not good news for hardware implementation, because one more module means not only one more input parameter, but also a few more precomputation. On the other hand, two modular inverses should be performed in the pre-computation and then both modular inverse results should be added to the list of input parameters.
The Mixed-Radix Conversion (MRC) was first proposed by H.L.Garner in 1958, and then improved by D. E. Kunth. To execute CRT with MRC, the modular operations with module N is unnecessary and only one modular inverse is required. Both of these two improvements are good for hardware implementation.
Algorithm 3 describes the Chinese Remainder Theorem with Mixed-Radix Conversion, where M is the message to signature; P and Q are the two primes and P<Q, N = P×Q; E is the private key, 0 < A < Q-1 and A×P ≡ 1 (mod Q).
Algorithm 3: Chinese Remainder Theorem (MRC) Input: M, P, Q, E, A Output:
The purpose of the extra addition of Q is to make sure that the Intermediate result is positive. To realize CRT with MRC, only one modular multiplication with module P and one multiplication are required. The more important thing is that with MRC, we don't parameters N and B in algorithm 2. This improvement can reduce the size of required memory and eliminate the hardware for the modular operations with module N. Because N = P×Q, so the length of N is normally twice as the length of P and Q. So the elimination of the modular operations with module N means all the modular operations have modules with same length, so they can be realized with the same hardware device.
On the other hand, the performance of RSA modular exponentiation can be improved by Chinese Remainder Theorem as well. In high speed applications, step 3 and step 4 can be executed in parallel, reducing the time complexity of modular exponentiation to only a quarter of the original value.
C. Improved Montgomery Algorithm
The main operation of RSA encryption and decryption is modular exponentiation, and modular exponentiation consists of modular multiplications, so the actual implement of RSA cryptography is the implement of modular multiplication.
The general process of modular multiplication consists of two steps: computing T = A×B and reducing T to yield M = A×B (mod N). The traditional reduction, which is implemented with division operation, is not easy for VLSI implement as there is not any good solution for VLSI implement of division proposed yet. Several modular multiplication algorithms without division operation have been proposed to replace the traditional reduction. Among them the two most popular algorithms are Montgomery algorithm and Barrett algorithm.
Montgomery algorithm is the most widely used as well as the most efficient modular multiplication algorithm being applied. It realizes the reduction with addition and shift operations, and they are much easier for VLSI (mod N) instead of AB (mod N). When using it in practice the constant R -1 should be eliminated. That means Algorithm 4 should be executed one more time.
The primary operation for the hardware implementation of Montgomery algorithm is to divided T by 2 (because TR -1 = T2 -n ), and the operation of division by 2 can be realize by shift right. It should be remembered that the shift operations are with modular N, so if T is odd, T = T + N should be performed before the shift operation. After divisions by 2 for n times, the result t is obtained. The result t will satisfy 0 ≤ t < 2N-1, but it is not the final result of the modular multiplication. It is T/2 n mod N instead of T mod N, so we can not get the final result until another transfer, which is a multiplication by 2 n is performed. The hardware implementation of the original Montgomery algorithm is shown in Algorithm 5. A and B are the two n-bit binary multipliers and can be presented as A = (a n-1 , a n-2 , ……, a 1 The original Montgomery modular multiplication is difficult to realize with VLSI because the operands in the cryptosystem applications are usually very large. When the length of the integers is 1,024-bit or 2,048-bit, even the implementation of the simplest operations like addition and shift are impossible. As a result, the big operands are usually divided into a serious of small operands, so that operations on big operands can be executed.
Apparently, the number of arithmetic units can be reduced when the big operands are divided into small operands, and many registers can also be saved because the storage of the big intermediate results will become unnecessary.
Various methods have been proposed to improve the Montgomery algorithm under this idea [4] . With these modified algorithms, it is much easier to implement Montgomery modular multiplication, and higher performance can be achieved with lower power.
The FIPS [4] algorithm proposed by KoC is one of those modified algorithms suitable for implementation with VLSI, especially for the implementations with digital signal processors. The most important feature of this algorithm is that in this algorithm, there is no need to obtain the result of T = A×B. The less significant words of the temporary and the more significant words of the intermediate result can be stored in the same memory units, and the temporary value and the result can be stored in the same memory address, too. As a result, it reduces the size of the memory as well as the number of operations. Both the cost of hardware and the power consumption are therefore lowered. The FIPS algorithm is described in The length of the accumulator can then be decided. For our design, the accumulator will store the largest temporary result when performing Step A of Algorithm 4. When i = s-1, the largest temporary result is the sum of (1), (2) and the result of S/r in i = s-2:
. (2) Equations (1) and (2) show that there will be 2s-1 32-bit numbers to be accumulated, which means that the sum will be less than (2s-1)×2 32 = 2 38 -2
32
. After performing the shifting S/r, the length of the initial value in the accumulation register is 32-bit, and hence the final result will be less than 2 38 . Then the length of the adder can be chosen as 38-bit. Because the width of the SRAM used in this design is 16-bit, taking the length of register to be times of 16 will make the shifting of operands much easier. As a result, the length of the accumulating register is 48-bit. The data path of the modular multiplier is shown in Fig. 1 .
B. Operand Isolation
The power consumption can be divided into two parts: static and dynamic power consumption. The static power is determined by the fabrication process, where possible input from the designers is quite limited. The calculations of the combinational logic and the toggles of the registers take up most of the dynamic power. As a result, the general idea of low power design is to eliminate the unnecessary calculations of the combinational logic and the redundant toggles of the registers. Combinational logic circuits are used to realize different logic and arithmetic functions. In general, no circuit is expected to keep working at any time. Ideally, we want them to work only when we want to use them. But this is usually not that case, the combinational logic circuit will begin to calculate once its input signals are changed. Fig. 2 shows the structure of an ALU for a RISC CPU, where the arithmetic circuit, the logical circuit, the shift circuit and the comparison circuit are connected parallel. When the input signals are changed, all of the four parts will work together, but obviously only one of the four results will be used at the most. That means the circuits consume power without generating any useful data. So in low power design, we hope that the circuits will not work until we want them to work. The only way to stop the combinational circuits from working is to keep the input of the circuits still. And an effective method to keep the input still is the so-called Operand Isolation.
The idea of operands isolation is to keep the inputs of the combinational logics constant when they are not being used, so that the combinational logics will keep quiet and power required during the idle period is reduced. Fig. 3 illustrates the operands isolation of the modular multiplier. The shadows are the isolation modules. The inputs of both the multiplier and the adder stay constant when shift is being performed. Only the former stays constant when performing additions and subtractions. The operands isolation of the modular multiplier
C. Clock Gating
In digital logic circuits, the clock signal has the maximum fan out and the highest activity, so the power consumption from the clock signal is an important part of the total power consumption of the whole circuit. Fig. 4 is the functional schematic of a D-type flip-flop. It shows that the clock signal needs to drive four transfer gates in every D-type flip-flop. When the transfer gates are switch, no matter whether the value of the flip-flop is changed, the charging and discharging of the capacity of the transfer gates will consume dynamic power. There can be thousands of flip-flops in a large design, so the power consumption on all these flip-flops will be a problem for low power design. Fig. 5 shows the principle of clock gating. The upper part of Fig. 5 shows the equivalent schematic of a D-type flip-flop with synchronous active-high enabling signal. It can be seen that the enabling signal acts as the control signal of a 2-to-1 multiplexer, but it has no influence on the clock signal, which is connected to the control signals of the transfer gates. This means that the clock in this flip-flop will consume power, no matter whether the enabling signal is high. As a result, reducing the clock activity effectively reduces the unnecessary power dissipation. One solution is to insert clock gating. When clock gating is inserted, the clock net of the flip-flop will be kept at logic "0", as shown in the lower part of Fig. 5 , thus the power dissipated on the switch is saved. The most power in an ASIC design is consumed by the clock tree, so the reduction of clock activity will notably reduce the total power.
IV. THE RESULT AND COMPARISON
The proposed design has been implemented by Verilog HDL, simulated with ModelSim 6.2b and synthesized with Synopsys Design Compiler with SMIC 0.18μm process. The result shows that the critical path of the design is 5.3ns, so the highest clock frequency of the design can be up to 188MHz and its area is 7.1k gates. It spends about 3.9M clock cycles to finish a 1,024bit RSA signature, so the throughput of this design is about 49kbps at 188MHz. When working on the frequency of 188MHz, it requires a power of 2.56mW. Implemented with Altera FPGA EP2C8Q208C8N device, the proposed design costs 2,439 logic elements and can work at the frequency of 47.32MHz.
The performance and power consumption of proposed design are compared with previous works in From these tables, we can see that the proposed design has lower ratio (power/throughput), which means it requires less power than the previous designs when working at the same speed.
The proposed RSA processor is implemented by Cadence SoC Encounter 8.1 with SMIC 0.18μm CMOS process and is integrated in a smartcard with Anticounterfeiting capability. The two SRAM blocks are generated by SMIC SRAM Generator and the EEPROM IP is S018EE16KBS_LPI from SMIC. The design can execute 1,024-biy RSA digital signature as well as User data read/write. The layout of the smartcard with proposed RSA processor is shown in Fig.6 .
V. CONCLUSION
The proposed 1,024-bit RSA design achieves ultra low power by using the Chinese Remainder Theorem, improved Montgomery algorithm and several low power techniques. The synthesizing result shows that it has a performance of 49kbps at 188MHz while consuming only 2.56mW and the area is only 7.1k gates. The low power and low area make it suitable for smartcards and portable device. 
