Abstract-RSA key generation is of great concern for implementation of RSA cryptosystem on embedded system due to its long processing latency. In this paper, a novel architecture is presented to provide high processing speed to RSA key generation for embedded platform with limited processing capacity. In order to exploit more data level parallelism, Residue Number System (RNS) is introduced to accelerate RSA key pair generation, in which these independent elements can be processed simultaneously. A cipher processor based on Transport Triggered Architecture (TTA) is proposed to realized the parallelism at the architecture level. In the meantime, division is avoided in the proposed architecture, which reduces the expense of hardware implementation remarkably. The proposed design is implemented by Verilog HDL and synthesized in a 0.18µm CMOS process. A rate of 3 pairs per second can be achieved for 1024-bit RSA key generation at the frequency of 100 MHz.
I. INTRODUCTION
Public key cryptography has gained extreme popularity in many applications such as smart cards, digital certificate and so on. One of the most widely used public key algorithm is RSA [1] . However, the process of RSA key generation is very complex and time-consuming. Some implementations [2] try to generate the RSA key pairs on a desktop and upload the pair, or only the private key, into a smart card. Take communication security and high-performance processing into consideration, the entire procedure of RSA key generation is preferably performed totally inside cipher chip in order to guarantee the efficiency and the security of the applications. Nevertheless, the limited computational power of embedded systems can not afford high speed RSA key generation. In this paper, the issue about how to provide high processing speed to RSA key generation for embedded platform with limited processing capacity is discussed.
An on-card implementation is presented in [3] which takes up to 6.82 seconds to create a key pair. A scalable hardware architecture is proposed in [4] , they have presented the architecture applied for the multiple word Radix-2 Montgomery multiplication algorithm and a processing element (PE) is designed. A new algorithm [5] which correctly identifies every positive integer tested as being either prime or composite is considered. [6] rather shows a simple way to substantially reduce the value of hidden constants to provide much more efficient prime generation algorithms. They apply our techniques to various contexts (DSA primes, safe primes, ANSI X9.31-compliant primes, strong primes, etc.) and show how to build fast implementations on appropriately equipped smart-cards, thus allowing on-board key generation. A very efficient recursive algorithm [7] for generating nearly random provable primes is presented. The expected time for generating a prime is only slightly greater than the expected time required for generating a pseudo-prime of the same size that passes the Miller-Rabin test for only one base. possibilities of realization of Miller-Rabin big number primality test on assembler of Texas Instruments digital signal processors of TMS320C54x family are considered [8] . Applying these modules, it could be achieved considerably higher level of the system security regarding to the software-only security systems. [9] moves on to a very simple new deterministic test, then they discuss various ways of constructing so-called strong primes. However, these new algorithms generally focus on the improvement of the computational complexity and how to apply them on chip remains to be solved.
We propose an efficient solution to accelerate RSA key pair generation from both data and instruction level parallelism, in which RNS and TTA are combined closely. The advantage of RNS Montgomery multiplication algorithm [10] is that large number multiplications can be divided into small elements and each independent elements can be processed simultaneously. The advantages of Transport Triggered Architecture (TTA) [11] is that it can be used as an application specific processor, especially as a coprocessor for different DSP applications in SoC. Comparing with the traditional ASIC, TTA is more flexible in application and consumes less silicon area. And comparing with traditional DSP processor, it has more efficiency and better performance. In this paper, function unit (FU) "MMAC" is elaborately designed, which meets the highly parallelism of RNS Montgomery multiplication and reduces redundant data transmission. So an optimized architecture, called TTA-like, is proposed to meet the desire for high-efficient performance. In order to reduce the expense of hardware implementation, division is discarded in the process due to an appropriate selection of algorithms. This improvement makes it possible to fulfil RSA key pair generation with pure hardware design, which is quite different from prior methods that use software programming to do the division.
The rest of paper is organized as follows. Section 2 describes the RSA key pair generation and details how to reduce timings related to prime finding algorithm which is the kernel of the whole process. Section 3 presents our architecture for it. The implementation results in this paper are compared with what is given in [3] [4] in this section. Finally, in section 4, the conclusions and future work are discussed.
II. DESIGN FLOW AND ALGORITHM ANALYSIS
Common RSA key pair generator generally include a stage of trial division in the sieve function procedure [2] [3]. We investigate in this section a way of how to avoid this by utilizing invertible numbers which is quite suitable for the hardware implementation. Some algebraic techniques are introduced to speed up modular exponentiation [10] [12] , so the primality test, which is the most timeconsuming section practically, can be optimized greatly to achieve a satisfactory implementation; And the private key generation algorithm [13] is improved as well for the sake of limited computation resources in our embedded processor. Figure 1 gives out a common procedure of the RSA key pair generation. First, a random number [14] is generated and then it is thrown into the "sieve function" procedure. After that we get so-called "prime candidate" which is not a prime for sure but very likely [14] [16] . The primality of these candidates are confirmed via primality test(here, Miller-Rabin test is adopted). The rest of the procedure is simple: public key and private key is generated in successive order.
A. Sieve Function
The purpose of the sieve function is to reduce the times of the primality test which is the most time-consuming part of RSA key pair generation. Unlike ordinary sieve functions which use small primes to divide the candidate number, trial divisions is replaced by one-time modular exponentiation [6] as shown in Algorithm 1.This algorithm starts with the following consideration [7] :
and p i is a small prime).
• Then we can calculate k(k < Π) to make k is coprime with Π.
• A large integer L (usually L = νΠ, ν is a small integer).
• q = k + L, which q is co-prime with Π definitely or in another word, q is also co-prime with small primes p i . So the function of sieve function is reached. Here, in order to explain details of our algorithm , two questions are presented to help:
1) How to find such k which is co-prime with Π? 2) How to get the large integer L to ensure that q = k + L is co-prime with Π which is equivalent to the question: how to make sure that q is co-prime with small prime p i ? For question 1, we introduce a concept in number theory: co-prime congruences mod n form a multiplication group.
For example, mod 4 has 2 co-prime congruences: 1 and 3; mod 8 has 4 co-prime congruences: 1, 3, 5 and 7 mod 16 has 6 co-prime congruences: 1, 3, 5, 7, 9, 11, 13 and 15
For the convenience of narrative, co-prime congruences mod Π is denoted as Z * Π and the greatest common divisor is denoted as gcd.
So,theorem 2.1.1.
It means we just have to find k ∈ Z * Π , and here comes a easier way to get it called "Carmichael Theorem":
we denote Carmichael function as λ(n) and ∀a ∈ Z, a λ(n) = 1(mod n)
Take notice that [13] ,
figure 2 provides a method to generate such k.
For question 2,L should be set as an integral multiple of Π 
which means that such L meets the requirement [15] .
According to the architecture presented in this paper, modular exponentiation can be done very fast and no extra hardware consuming is needed. So the processing time of the sieve function can be trivial and circuit area will be saved as well.
Algorithm 1 Invertible Number Generation
Input:
Output: q co-prime with the smallest 74 odd primes
B. Primality Test
The candidate must be tested for primality in order to be useful for the generation of a RSA key pair [12] [16] . In this section, Miller-Rabin's method [10] [13] is used for our primality test. The algorithm is described in Algorithm 2 below.
We consider another property of arithmetic modulo p for a prime number p that can be used as a certificate for compositeness [14] [17].
In the situation of this definition, the numbers 1 and n-1 are always square roots of 1 modulo n (indeed, (n−1)
2 ≡ 1 (mod n)); they are called the trivial square
Output: Confirm the primality of n Find u and k so that n − 1 = u * 2 k Let a be randomly chosen from 2, 3, ..., n − 1
end if
end if end for print "n is composite" Figure 3 . Powers a n−1 mod n calculated with intermediate steps,
roots of 1 modulo n. If n is a prime number, there are no other square roots of 1 modulo n.
If p is a prime number and 1 ≤ a < p and a
Thus, if we find some nontrivial square root of 1 modulo n, then n is certainly composite. Figure 3 gives us an example to show this principle.
In figure 3 , calculating 201 324 mod 325 with two 2 mod n = 1. Thus in general the sequence starts with zero or more elements / ∈ 1, n − 1, and ends with a sequence of zero or more 1s. The two parts may or may not be separated by an entry n − 1. All possible patterns are depicted in Figure 4 , where "*" represents an arbitrary element / ∈ 1, n − 1. We distinguish four cases: 
If n is composite and a is not an A-witness for n, then a is called an A-liar for n.
If a is an A-witness for n, then n is composite. Proof. If a is an A-witness for n, then to the sequenceb i = a u · 2 i mod n, 0 ≤ i ≤ k, Case 2 or Case 3 of the preceding discussion applies, hence n is composite.
Finally, We combine this observation with the idea of choosing some a from 2, ..., n − 2 at random into a strengthening of the Fermat test, called the Miller-Rabin test.The detail of this algorithm has been described in Algorithm 2.
Since Miller-Rabin test is dominant in the processing time of RSA key pair generation, it is important to improve this part for the sake of high performance. Take notice of b ← a u mod n which is the most timingcost computation in Algorithm 2, we focus on this part and make use of RNS Montgomery multiplication to accelerate Miller-Rabin test. The technical details shall be discussed in section 2.4.
C. Calculation of Private Key
The private key is the modular inverse of the public key as described in algorithm 3 [13] .
Algorithm 3 Improved Stein's method for modular inverse
Input: e(public key),Φ(n)(Φ(n) = (p−1)×(q −1),p, q are two primes )
while
end if end while
The particular method chosen for computing modular inverse avoids trial division to make it easier to be implemented in hardware.
D. RNS Montgomery Modular Multiplication
Modular multiplication is the kernel operation of RSA key pair generation which is called in quantity and takes up the most time of the whole procedure, it is of great importance to analyze this part and bring out the optimal algorithm for the hardware we present in this paper.
The RNS Montgomery modular multiplication algorithm proposed in [10] is a fast parallel algorithm for modular multiplication with large operands, which is the basic operation of public key cryptosystems. The algorithm is rewritten in algorithm 4.
Algorithm 4 RNS Montgomery Modular Multiplication
Two bases a and b are introduced, and subscript i and j are used to indicate the elements related to base a and b respectively. For example, a i represents the ith element of base a, and x j represents the jth element of X which is represented by base b. In this paper, [X] a∪b means X is represented in RNS by base a and b, and |A
In Algorithm 4, step3 and step7 are base transformation (BT) between different base representation, which is shown in Algorithm 5. In this case, step3 transforms q i which is represented by base a to base b representation into q j . Derived from base transformation from [10] , the BT algorithm is reformulated to satisfy architecture designed in this paper. |X| mi is used to represent X modulo m i in the following algorithm.
Algorithm 5 Base Extension
Another concern about RNS Montgomery Multiplication is about the proper selection of RNS base. In this paper, an efficient RNS base is chosen in the form of 2 n − c i , where n is the length of RNS base which decides the data width of hardware implementation. In this form, modular addition and modular multiplication are efficiently realized, which are shown in Algorithm 6 and Algorithm 7. Because of the decrease of the complexity of multiplication, the cost for mod operation is considerably reduced in the design. 
A. Transport Triggered Architecture
Transport Triggered Architecture (TTA) [11] is statically programmed ILP modular architecture which is similar to VLIW architecture. Instead of specifying operation typing and controlling the FUs directly, TTAs specify the required data transports. These transports may trigger operations as side effect implicitly. TTA modular template is illustrated in Figure 5 . TTA is organized as a set of functional units (FUs) and register files (RFs) including general-purpose registers. The input and output ports of these resources are con-nected together with an interconnection network composed of move buses and input/output sockets. The data transports encoded in each instruction slot are carried out on each move bus, the number of which thus determines the maximum parallel data moves that can be performed in each cycle. Input sockets contain multiplexers which feed data from the buses into the desti-nation registers in FUs. Output sockets contain de-multiplexers that put FU results from source registers on the buses. The connectivity between RFs and FUs is more complex in VLIW architecture as the number of input/output ports increase, leading to larger area and critical path delay overhead. However, interconnections of TTAs are simpler. The interconnection network of TTAs can be fully connected, in which case all the ports in all the FUs are connected to all the buses. But usually they are partially connected and optimized for specific application according to the traffic on the move buses in practice. The functional units follow the triggering strategy. Each FU triggers the operation when a certain operand is transported to the triggered operand register in the FU, Therefore each FU must hold one trigger register at least to perform the corresponding operations. Additional operand registers may be required for multi-operands operations. The FUs can hold more than one result registers for occasional multiple results requirement. Every FU can be pipelined to reduce the critical path delay and improve the processing throughput.
B. Architecture Design
The proposed architecture in this design is shown in Figure 6 . It mainly consists of five parts: FUs, RFs, control logic, transport network and on-chip RAM/ROM. Similar to common processors, the control logic composes of instruction fetch, instruction decoder and PC control units. In this design, JMP unit can affect the PC value to realize jump, branch and loop operations.
The FUs are the key factors which decide the performance of this processor. According to different applications, various of FUs can be designed and attached to the transport network. To implement RNS Montgomery multiplication algorithm, modular multiplication and modular multiplication-and-accumulation are the key operations in n-bit level, where n is the base size. So, the MMAC units are designed to accelerate the execution speed of these key operations. MMAC units can only do 32-bit× 32-bit modular multiplication, But combing them together with RNS Montgomery multiplication algorithm, long bit modular multiplication can be done very fast. In this work, four MMAC units are designed according to the maximum number of function units used in the processing of executing the operations referring to step 1 step 3 in algorithm 4; The remaining operations after TIME E also depend on these four MMAC units in order to increase the reusability of the architecture and save the valuable hardware resources.
Two ALU units are included to implement the modular addition and modular subtraction operations. And some controlling operations are needed such as logical rightshift, arithmetic right-shift and case-select. They are included in ALU units to help finish the extra operation in the process of RSA key pair generation as well.
There are three Load-store units (LDST), one Lookup Table unit (LUT). LDSTs are connected with the independent Data RAM. LUT are used to store the pre-computed data. There are direct data paths between LDSTs and MMAC units and also between LUT and the group of MMAC, which are used to reduce redundant data transmission.
The transport network is used to transport data from Figure 7 . The operational process of RSA key pair generation from function unit level source register to destination register. Four buses are adopted in the transport network to fully exploit the computation capacity of four MMAC units in parallel. The width of each bus is 32-bit which is decided by the selected base size. Figure 7 illustrates the operational process of RSA key pair generation from function unit level. The whole process has three stages: 1) Sieve Function 2) Primality Test 3) Private Key Calculation The arrows in this figure show the data flow of the procedure: In the Sieve Function, MMAC is used to do modular exponentiation specially, ALU helps to do some ordinary arithmetic operations such as mod-add or add, JMP then controls the programming mode; Similar to Sieve Function, we use the same units to fulfil calculation (mainly modular exponentiation) in Primality Test procedure; For the last stage, It is simple to generate private key with ALU while jump instructions occupy a large proportion of it, so JMP is set up here to solve this problem.
According to the algorithm analysis in section 2, these function units are arranged to fulfil the RSA key pair generation: MMAC units do the modular exponentiation; ALU units complete the operation of addition and substraction and some other logical operations; And JMP units serve as judge and jump instruction.
C. Implementation Result
The design was implemented by Verilog HDL and synthesizes with 0.18 µm CMOS technology. At 100 MHz, the processing time of 1024-bit RSA key pair generation is 306 ms in average, and the logic area is 131k gates. To compare the performance of our implementation with other works [3] [4], the processing time consumed in primality test, prime finding and RSA key pair generation is analyzed. Because of the architecture specified for modular multiplication, the consuming time in primality test is reduced greatly and numbers of times in prime finding are 36 which is a satisfactory compromise in both circuit area and processing time. As shown in Table 1 , the proposed work requires less clock cycles than the other works. IV. CONCLUSION AND FUTURE WORK This paper presents a novel RSA key pair generation hardware implementation based on TTA, and Montgomery modular multiplication based on RNS is adopted in both sieve function and primality test to improve the performance significantly. FUs suiting the algorithms is designed on TTA, and direct data paths are used to reduce redundant data transmission. Above all, pipeline and parallel technology to improve the computing speed are introduced. At the frequency of 100 MHz, 1024-bit RSA key pair generation needs 306 ms in average, the logic area of the proposed architecture consists of 131k gates. This result shows that our proposed work can achieve high performance and small area for RSA key pair generation.
On-going and future developments include: (1) Preparation for some pre-computed data especially in RNS Montgomery multiplication can be optimized which affect the rate of RSA key pair generation significantly. (2) The concept of scalable and reconfigurable architecture is introduced, in which not only 1024-bit RSA key pair but also 2048, 4096-bit can be implemented in this platform. 
