Abstract-In this paper, we present the design and implementation of a systolic RSA cryptosystem based on a modified Montgomery's algorithm and the Chinese Remainder Theorem (CRT) technique. The CRT technique improves the throughput rate up to 4 times in the best case. The processing unit of the systolic array has 100% utilization because of the proposed block interleaving technique for multiplication and square operations in the modular exponentiation algorithm. For 512-bit inputs, the number of clock cycles needed for a modular exponentiation is about 0.13M to 0.24M. The critical path delay is 6.13ns using a 0.6µm CMOS technology. With a 150 MHz clock, we can achieve an encryption/decryption rate of about 328 to 578 Kb/s.
I. INTRODUCTION
The electronic communication technology has advanced in a very fast pace during the past few decades, creating new applications and opportunities along the way. Today we can send a multimedia message to or receive one from virtually anyone around the world in seconds through the internet. To protect the transmitted data from eavesdropping by someone else other than the desired receiver, we need to disguise the message before sending it into the insecure communication channel. This is achieved by a cryptosystem. In 1978, Rivest, Shamir, and Adleman invented a method to implement the public-key cryptosystem, which is known as the RSA cryptosystem [1] . It provides high security and is easy to implement, so it quickly became the most widely used public-key cryptosystem. However, developing an inexpensive hardware device for real-time RSA encryption and decryption is still a challenge. Finding an efficient hardware implementation for RSA is one of the important tasks remain to be done.
In the RSA cryptosystem, both encryption and decryption are modular exponentiation, which can be done by a sequence of modular multiplications. Many modular multipliers have been proposed in the past [2] [3] [4] . One of the most important algorithms was proposed by Montgomery [5] . Montgomery's algorithm needs n iterations in each modular multiplication and two additions per iteration, where n is the word length. Cellular arrays based on Montgomery's algorithm can be found in [6] [7] [8] . In this algorithm, odd modulus is assumed. It calculates the modular multiplication without performing division. Some improved algorithms were later proposed, either by reducing the number of iterations or by removing the final modular reduction step. A modified Montgomery's algorithm was reported in [9] , where the multiplication and modular reduction steps in Montgomery's algorithm are separated such that only one addition is required in each iteration. However, the number of iterations in the modified algorithm is two times that of Montgomery's, hence the overall computation time is not reduced. In [10, 11] , the algorithm was further modified to reduce the number of iterations, doubling the speed of modular multiplication. Another way to reduce the computation time is to use the Chinese Remainder Theorem (CRT) technique, since CRT is known to reduce the RSA computation by a divide-and-conquer method.
In this paper, we present the design and implementation of a systolic RSA cryptosystem based on a modified Montgomery's algorithm and the CRT technique. The systolic array has a 100% utilization ratio because of a novel interleaving approach. The throughput of the proposed design is much higher than previous ones. For 512-bit inputs, e.g., the number of clock cycles needed for a modular exponentiation is about 0.13M to 0.24M. The layout and post-layout simulation of the 512-bit RSA chip has been finished, and the critical path delay is 6.13ns using a 0.6µm CMOS technology. Under a 150MHz clock, the decryption rate is about 328 to 578 Kb/s.
II. MODULAR EXPONENTIATION ALGORITHM
To encrypt a message using the encryption key (E N), we first partition the message into a sequence of blocks and consider each block M as an integer between 0 and N 1. Then, we encrypt the message by raising M to the Eth power modulo N, i.e., C M E mod N. Similarly, to decrypt the ciphertext C using the decryption key (D N), we raise C to the power of D modulo N, i.e., M C D mod N. Clearly, modular exponentiation is the main operation of the RSA algorithm. For modular exponentiation, a sequence of modular multiplications can be performed instead. If we transform the expo-nent E into its binary representation´e k 1 e k 2 ¡ ¡ ¡ e 1 e 0 µ 2 , then
Clearly, this can be done by a sequence of squaring and multiplication operations, by scanning the bits of E either from high-order to low-order bits or in the opposite direction. These two methods are known as the H-Algorithm and L-Algorithm, respectively. Both of them require n iterations for an n-bit exponent. Each iteration contains two modular multiplications.
A. Modular Multiplication
Our approach is based on the foundation established in [9] [10] [11] . Suppose N ´n n 1 n 1 n 0 µ is an n-bit odd integer. Let A and B be two other n-bit integers, where 0 A B N. We extend A and B to´n · 1µ bits by appending a leading zero to the MSB, so 0 A B 2 n·1 . Let X A ¢ B, where X ´x 2n·1
The modular multiplication algorithm is as follows.
MM(A B N)
The advantage of Procedure MM() is that the partial products are always in the range´0 2 n·1 µ, which is the same with the input range. Hence, the modular reduction is removed. Based on MM() and the L-algorithm, we obtain the modular exponentiation algorithm MLA() as shown below. Note that by letting R 2 n·2 and C R 2´m od Nµ, the finial result P k·1 will be equal to M E´m od Nµ, i.e., we let M 0 M ¢2 n·2 mod Nµ and P 0 2 n·2 initially to guarantee that the finial result is M E´m od Nµ.
//Post-process return P k·1 ; B. RSA Computation Using CRT The Chinese Remainder Theorem (CRT) can be stated as follows.
Theorem 1 (Chinese Remainder Theorem)
Let m 0 m 2 m n 1 be pairwise relatively prime positive integers and let x 0 x 1 x n 1 be any integers which satisfy the linear congruence system in one variable given by
The RSA decryption and signature operation can be speeded up by using the CRT, where the factors of the modulus N (i.e., P and Q) are assumed to be known. By CRT, the computation of M C D´m od Nµ can be partitioned into two parts:
where
This reduces computation time since D P D Q D and C P C Q C. In fact, their sizes are about half the original sizes. In the ideal case, we can have a speedup of about 4 times. Finally, we compute M by CRT as follows:
III. HARDWARE IMPLEMENTATION Our design consists of a modular multiplier and some logic to control the data interleaving and scheduling mechanism. We follow the standard systolic array design flow [12] to obtain the systolic RSA cryptosystem according to the MM´µ and MLA´µ algorithms. After that, we add some partitioning circuit for our systolic RSA architecture to get two individual modular multipliers which can be used for M P C 
A. Multiplier
The dependence graph (DG), the hyperplanes, and the schedule vector [12] of the multiplier are shown in Fig. 1 T0  T1  T2  T3   T4   T5   T6   T7   T8   T9   T10   T11   T12   T13   T14   T15   a utilization rate of this multiplier is only 50%. However, this can be doubled by a novel block interleaving technique [13] , processing two sets of data in the multiplier during each iteration cycle, i.e., MM´M i M i Nµ and MM´M i P i Nµ.
B. Modular Reduction Unit
In the modular reduction operation, the value q i is generated by XORing the LSB of S i and x i . We use an n-bit addition and right shifts to accomplish the modular operation. Fig. 2 shows the DG, hyperplanes, schedule vector, cell structures, and SFG of the modular reduction unit [13] . Note that the Z cell is the combination of X and Y cells, so we need extra control logic to realize the cell function. This circuit can also be interleaved by two sets of data in a similar way as the multiplier discussed above. We combine the systolic array multiplier and modular reduction unit to form a modular multiplier, which is shown in Fig. 3(a) . mod Qµ using two independent modular multipliers concurrently. This apparently reduces the computation time. Note that the partition is done on-line according to the lengths of P and Q, so each cell in the array must be able to read the data from the primary inputs directly, and be able to propagate the data to the primary outputs directly to finish the finial step of the modular multiplication. Therefore, we use some switches, MUXes, and a control signal to achieve the goals. The multiplier is called the composite modular multiplier (CMM), and is shown in Fig. 3(b) . The signal split, which is shifted into the register formed by chaining the D-FFs in all blocks, indicates the partition location. The partition circuit is shown in Fig. 4(a) . We extend the prime number Q to an n-bit integer with leading zeros and shift it into the n-bit shift register Q Reg initially, where n is the length of the modulus N, i.e., n log 2 N . The clock signal CLK is disabled until the MSB of Q Reg is high. The two shift registers, Q Reg and Split Reg, perform the left-shift function. Finally, the information in the Split Reg indicates the partition location. Note that there is only a single bit that is high in the Split Reg. Thus, we have two smaller modular multipliers that can operate concurrently to perform the original modular multiplication. It reduces the execution time for the RSA decryption and signature operations. In a typical RSA cryptosystem, once the keys are generated, they are fixed until the new keys are generated, i.e., P and Q are fixed after key generation. Therefore, the partition of N into P and Q needs to be done only once, before the RSA computation starts. 
D. CRT-Based RSA Cryptosystem
We have presented the systolic-array modular multiplier based on the CRT technique. We can add some control logic to accomplish the modular exponentiation operation. The circuit for the entire RSA computation that incorporates the CRT technique is shown in Fig. 5 . The details of the CON REG block is shown in Fig. 4(b) , where the partition scheme is the same as CMM. The control signals are supposed to come from the host controller, so they are considered as primary inputs in the core circuit design. The four inputs in1p, in2p, in1q, and in2q are used only for initialization and post-process steps. The control logic is responsible for timing management, data interleaving, iteration handling, etc.
As depicted in Fig. 6 , the time complexity of the CRT-based RSA computation is highly dependent on the lengths of P and Q. The best case occurs when P Q N 2. Suppose N is an n-bit integer, and the lengths of P and Q are p bits and´n p · 1µ bits, respectively. The numbers of clock cycles required to finish a modular exponentiation using the original and the CRT-based approaches are listed, respectively, in Table I . From the table, if P and Q have equal length, i.e., n 2, the number of clock cycles required will be the smallest. Although this is not likely to occur in practice (since P and Q are prime and P Q), their lengths should be as close as possible to ensure security. In the worst case, it is assumed that the maximum length of P or Q is 2n 3 (and the minimum is n 3). The circuit has been implemented with a 0 6µm CMOS standard-cell library. The post-layout simulation result shows that the critical path delay is about 6.13 ns, i.e., we expect the core to operate at a 150MHz clock rate. For a 512-bit RSA cryptosystem, e.g., the best case needs 0.13M clock cycles, while the worst case requires 0.24M cycles to finish an RSA operation. Under a 150MHz clock, the baud rate is 578K in the best case and 328K in the worst case. Fig. 7 shows the layout of the 512-bit CRT-based RSA cryptosystem. The core area is about 7653µm ¢7486µm with about 109K gate count. IV. CONCLUSIONS
In this paper, we have presented a systolic RSA cryptosystem design, based on an improved Montgomery's modular multiplication algorithm [13] and the CRT technique. The design is suitable for decryption and digital signature. The implementation of a 512-bit RSA cryptosystem with a 0 6µm CMOS standard-cell library has been done. The comparison of our implementation with other designs is summarized in Table II, which shows clear advantage of our approach in terms of speed. The number of clock cycles is 0.13M in the best case (when P Q n 2) and 0.24M in the worst case (when P Q n 3). Therefore, the throughput of our design is the highest among all the designs, but with about 50% higher hardware cost. The circuit is easily extendible for large moduli due to the systolic-array design.
