Abstract-This article considers the problem of how to prevent the fast RSA signature and decryption computation with residue number system (or called the CRT-based approach) speedup from a hardware fault cryptanalysis in a highly reliable and efficient approach. The CRT-based speedup for RSA signature has been widely adopted as an implementation standard ranging from large servers to very tiny smart IC cards. However, given a single erroneous computation result, a hardware fault cryptanalysis can totally break the RSA system by factoring the public modulus. Some countermeasures by using a simple verification function (e.g., raising a signature to the power of public key) or fault detection (e.g., an expanded modulus approach) have been reported in the literature; however, it will be pointed out in this paper that very few of these existing solutions are both sound and efficient. Unreasonably, in these methods, they assume that a comparison instruction will always be fault-free when developing countermeasures against hardware fault cryptanalysis. Researches show that the expanded modulus approach proposed by Shamir is superior to the approach of using a simple verification function when other physical cryptanalysis (e.g., timing cryptanalysis) is considered. So, we intend to improve Shamir's method. In this paper, the new concepts of fault infective CRT computation and fault infective CRT recombination are proposed. Based on the new concepts, two novel protocols are developed with rigorous proof of security. Two possible parameter settings are provided for the protocols. One setting is to select a small public key e and the proposed protocols can have comparable performance to Shamir's scheme. The other setting is to have better performance than Shamir's scheme (i.e., having comparable performance to conventional CRT speedup), but with a large public key. Most importantly, we wish to emphasize the importance of developing and proving the security of physically secure protocols without relying on unreliable or unreasonable assumptions, e.g., always fault-free instructions. In this paper, related protocols are also considered and are carefully examined to point out possible weaknesses.
INTRODUCTION
I N order to provide better support for data protection under strong cryptographic schemes (e.g., RSA [1] or ElGamal [2] systems), varieties of implementations based on tamper-proof devices (e.g., smart IC cards) have been proposed. The main reason for this trend is that smart IC cards are assumed to provide high reliability and security with more memory capacity and better performance characteristics than conventional plastic cards. The CPU in a smart IC card controls its data input and output and prevents unauthorized access to a smart card. With special characteristics of computational ability, large memory capacity, and security, a large variety of cryptographic applications benefit from smart IC cards. Due to this popular usage of tamper-resistance, much attention has recently been paid regarding the security issues of cryptosystems implemented on tamper-proof devices [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] , [13] , [14] , [15] , [16] , [17] from the viewpoint of the presence of hardware faults.
This line of research reemerged in September 1996 when a Bellcore press release [5] reported a new kind of attack, the so-called fault-based cryptanalysis. In the fault-based cryptanalysis model, it is assumed that, when an adversary has physical access to a tamper-proof device, he may purposely induce a certain type of fault into the device. Based on a set of incorrect responses or outputs from the device, due to the presence of faults, the adversary can then extract the secrets embedded in the tamper-proof device. These attacks exploit the presence of many kinds of transient or permanent faults. These attacks are of very general nature and remain valid for a large variety of cryptosystems, e.g., the LUC public key cryptosystem [18] and elliptic curve cryptography (ECC) [19] .
In this paper, we focus our attention on public key cryptosystems in which their computation can be sped up using the Chinese remainder theorem (CRT) [20] , [21] . These cryptosystems may be vulnerable to the hardware fault cryptanalysis to reveal the secret key if the following three conditions are met: 1) The message to sign (or to decrypt) is known; 2) a random fault occurs during the computation of a residue number system; 3) the device outputs the faulty result. Our main objective is to emphasize the importance of a careful implementation of cryptosystems with CRT-based speedup. Suppose the context involves trusted third parties (e.g., banks) where thousands of signatures are being produced each day. Another similar, but more serious, example is a server located at the certification authority which will produce hundreds of thousands of certificates (each of which is a signature) each day.
If, for some reason, a single signature is faulty, then the security of the whole system may be compromised. Most people believe that leakage of the secret key can be avoided by making sure that none of the above three conditions will be met. However, it will be pointed out in the next paragraph that the above assumption is not convincing all the time in a real world implementation. Therefore, a denial of service attack is possible in which the attacker can just send an anonymous e-mail claiming that he holds a faulty signature made by a bank server or a CA server. This attack will definitely force the server to stop using the present signing key and, of course, stops the whole important system. Some existing countermeasures have been reported in the literature. However, it will be pointed out in this paper that very few of these solutions are both sound and efficient. All the previous solutions employ one or more checking procedures to detect the possible faults. However, in order to develop a highly reliable CRT-based speedup technique, no checking procedure shall be assumed because, in that situation, the checking procedure itself will become most vulnerable to the hardware fault cryptanalysis and all other parts of the countermeasure will be in vain. On the other hand, some existing solutions suffer a disadvantage that the probability of producing an undetectable error is high.
The main contribution of this paper is that the new concept of fault infective CRT computation and fault infective CRT recombination are proposed. Based on the new concept, two novel protocols are developed with rigorous proof of security. The two CRT-based speedup protocols do not depend upon any checking procedure which should be absolutely error-free and physical cryptanalysis immune and do not suffer the disadvantage of producing an undetectable error with large probability. A key point of developing a secure CRT-based computation protocol without using a checking procedure is the assurance of influencing the computation of one module within the residue number system or the overall computation when an error occurred in another module within the residue number system. Two possible parameter settings are provided for the proposed protocols. One setting is to select a small public key e and the proposed protocols can have comparable performance to Shamir's scheme. The other setting is to have better performance than Shamir's scheme (i.e., having comparable performance to conventional CRT speedup), but with a large public key.
Most importantly, in this paper, we wish to emphasize the importance of developing and proving the security of physically secure protocols without relying on unreliable or unreasonable assumptions, e.g., always fault-free instructions.
PRELIMINARY BACKGROUND OF CRT-BASED CRYPTANALYSIS

Residue Number System and Chinese Remainder Theorem
The residue number system, RNS (or often called the Chinese remainder theorem [21] , CRT), tells that, given a set of integers n 1 ; n 2 ; . . . ; n k that are pairwise relatively prime, then the following system of simultaneous congruences:
has a unique solution modulo n ¼ Q k i¼1 n i . The integer solution s to the above simultaneous congruences can be computed as
i mod n i . In this paper, the following notation:
will be used to represent the computation of s by using the Chinese remainder theorem. This RNS (CRT-based) structure has been widely employed in the computer engineering areas of special parallel computer architectures and some digital signal processing processors to speed up the computation.
The RNS structure was also proposed to speed up the RSA signature and decryption computation [20] . This RNS speedup for RSA computation has been widely adopted as an implementation standard with the performance four times faster than a direct computation in terms of bit operations. The applications range from large servers to very tiny smart IC cards. For servers, there are a huge amount and very frequent RSA computations to be performed. For smart IC cards, the card processors are often not powerful enough to perform real-time complicated cryptographic computations, e.g., RSA signature and decryption.
The CRT-Based Cryptanalysis
Let p and q be two primes and n ¼ p Á q. In the RSA cryptosystem, a message m is signed with a secret exponent d as s ¼ m d mod n. Using the residue number system approach, the value of s can be computed more efficiently from computing s p ¼ m d mod p and s q ¼ m d mod q, then by using the Chinese remainder theorem (CRT) to reconstruct s.
Suppose that an error (any random error) occurs during the computation of s p (s 0 p denotes the erroneous result), but the computation of s q is error-free. Applying the CRT on both s 0 p and s q will produce a faulty signature s 0 . The CRT-based hardware fault cryptanalysis [13] enables the factorization of n by computing q ¼ gcdððs 0e À mÞ mod n; nÞ:
Similarly, p can be derived from a faulty signature computed by applying the CRT on a faulty s 0 q and a correct s p .
Previous Countermeasures
To counteract the above CRT-based cryptanalysis, any methodology of finding possible computation errors within s p ¼ m d mod p and s q ¼ m d mod q works for some situations. Previous research results suggest either performing calculations twice or applying a verification function on the computed result to detect any fault. The first approach is very time-consuming and it cannot always provide a satisfactory solution since a permanent error (caused by a permanent hardware or software fault or implementation bug) may be undetectable, even if the function is computed more than once.
The second approach suggests verifying the correctness by comparing the inverse result with the input. In the RSA signature scenario, a computed signature s ¼ m d mod n can be verified by raising s to the eth power and comparing whether m s e ðmod nÞ. Generally, this is not a satisfactory solution since the parameter e could be a large integer and this checking procedure becomes time-consuming.
Furthermore, both the above two approaches of employing a decision procedure to decide the correctness of computation suffer an intrinsic disadvantage which will be further studied after providing the review of Shamir's proposal.
Shamir's Countermeasure
Shamir presented a simple countermeasure in the rump session of Eurocrypt '97 [15] and applied for a patent [16] . In Shamir's method, a random integer r is selected and the following two numbers are computed:
If s p s q ðmod rÞ, then it is defined to be error-free and s ¼ CRT ðs p mod p; s q mod qÞ is computed.
As described previously, a trivial approach of checking whether m ðCRT ðs p mod p; s q mod qÞÞ e ðmod nÞ can be used to detect any possible fault occurred during the computation of s p and s q when the conventional CRT-based speedup is employed. However, Shamir's proposal provides the advantage that it can withstand the timing cryptanalysis on CRT-based implementation through a binary search algorithmic approach [16] , [22] . In the conventional CRT-based speedup, m p ¼ m mod p has to be computed before the evaluation of s p ¼ m d mod p and a value of m will affect the computation time of m p depending on whether m is greater or less than p.
Remarks on Shamir's Countermeasure
There are at least three nontrivial drawbacks in Shamir's proposal for preventing CRT-based hardware cryptanalysis.
Theoretically, the probability of producing an undetectable error is equal to 1=r. The setting of k-bit integer r suffers the probability of 1=2 k to leak the secret key. Small values of r cannot provide satisfactory protection, while larger values of r decrease the overall performance. The above probability neither depends on whether r is changed for each computation nor depends on whether r is a secret parameter.
Notice that any checking-based approach, including Shamir's method and any previous results, suffers an intrinsic disadvantage that not only the RNS computation but also the checking (decision) procedure will be vulnerable to the hardware fault attack. How to guarantee the feasibility of designing and implementing an error-free checking procedure becomes another big challenge.
In implemented by a "SUB a,b" instruction by subtracting b from a (or "CMP a,b" instruction which compares the equality of a and b), then a "JZ" (jump if zero) instruction is conditionally performed depending on the status of the zero flag. The zero flag is a bit of the status register in a processor. So, if an attacker can induce a random fault into the status register, then the conditional jump instruction may perform falsely. This may bring a big problem in the CRT-based hardware fault cryptanalysis since even a single success of attack will totally factorize n.
The above approach of modifying a decision procedure randomly is feasible for most of the cases. Some previously reported hardware fault cryptanalyses assume a specific single bit flipping (i.e., converts into its complement), but all other parts of the processor should not be affected during the fault induction. On the contrary, in the above decision procedure modifying approach, the result (say a À b) is no longer required after the comparison. This is especially true when the instruction "CMP a,b" is used to compare the equality of a and b because the result of a À b is explicitly ignored. So, an attacker can easily induce random noisy signals into the processor (especially in the location of the central processing unit-CPU) in order to obtain a complementary status of the zero flag. The attacker does not have to worry about the corruption of the computation result a À b. Therefore, if an attacker can introduce a random error into the zero flag, he has 50 percent probability of foiling the checking procedure.
Someone may argue that some techniques may be employed to protect the status register in a processor to avoid the above attack. However, it is well-known that any error detection or correction technique has its limitation, so a design of highly reliable CRT-based attack immune protocol cannot be guaranteed. Furthermore, it is not possible for all commercial processors to adopt such enhancement on the status register. It is also not proven that the enhancement of the status register will solve all the problems. A protocol-level design with rigorous security proof without the assumption on the reliability of the underlying processor is a much better solution to counteract hardware fault cryptanalysis, especially the CRT-based attack, since a single successful attack will totally break the system.
The above observation explains why the usage of a decision procedure violates the guarantee of developing a highly reliable hardware fault immune cryptographic protocol. In this paper, the idea of fault infective CRT computation is proposed so that security of a protocol does not rely on any decision procedure.
SOME INSECURE CRT-BASED SPEEDUP PROTOCOLS
In order to develop a highly reliable CRT-based speedup, no checking procedure will be assumed hereafter because, in that situation, the procedure will become most vulnerable to the CRT-based hardware fault cryptanalysis and all other parts of the countermeasure will be in vain.
Resistance by Using Masking Technique
The following masking technique [22] , [23] has been adopted successfully to resist a timing attack [22] on an RSA implementation without using the CRT-based speedup. Before evaluating the signature s ¼ m d mod n, a random and secret integer r is selected so that r À1 mod n exists and the message m is masked asm m ¼ m Á r e mod n. The computation ofŝ s ¼m m d mod n is therefore sped up by using the residue number system (RNS). The RNS approach returns both s p ¼m m d mod p and s q ¼m m d mod q, then the Chinese remainder theorem (CRT) is employed to reconstructŝ s ¼ CRT ðs p ; s q Þ. Finally, the signature s is obtained by computing s ¼ŝ s Á r À1 mod n ¼ CRT ðs p ; s q Þ Á r À1 mod n, where r À1 is the multiplicative inverse of r modulo n. In this paper, the values r e mod n and r À1 mod n are called the premasking and the postmasking, respectively. Notice that both r e mod n and r À1 mod n can be precomputed in advance and stored in the smart card. For some security reasons, it may be better for r to be different for each signature computation. Then, each r i can be set up to be r i ¼ r a iÀ1 mod n, where a is a small integer, e.g., two. It is similar for the case of r
Unfortunately, the following Lemma 1 shows that this masking technique does not work to resist the CRT-based hardware fault cryptanalysis. 
Other Insecure CRT-Based Speedup Protocols
In the following, a set of insecure protocols (although they are promising at first glance) will be analyzed. Analysis of these protocols helps understanding of motivation of the proposed secure protocols.
Protocol P-1
In this protocol, a random integer r is selected and the computation of 
Protocol P-3
Motivated by Shamir's idea, the following protocol will be considered. Let x 2 K 1 (an RNS system) be denoted as ½x mod ðp Á r p Þ; x mod ðq Á r q Þ and x 2 K 2 (another RNS system) be denoted as ½x mod p; x mod q, where gcdðp Á r p ; q Á r q Þ ¼ 1. For an integer x 2 K 1 , given the representation of x ¼ ½x a ; x b , it is easy to obtain the representation of x mod ðp Á qÞ 2 K 2 as ½x a mod p; x b mod q.
In the following protocol, two random integers r p and r q are selected such that gcdðp Á r p ; q Á r q Þ ¼ 1 and the computation of s ¼ m d mod n is modified into a new form as s ¼ ðm d mod ðn Á r p Á r q ÞÞ mod n. The computation will be sped up using the CRT approach by evaluating
Based on the property of transformation from K 1 to K 2 and the derivation of Lemma 3, it shows that q ¼ gcdððs 0e À mÞ mod n; nÞ still holds if a faulty s 0 p has been produced during the computation ofŝ s over K 1 .
The idea of expanding the modulus to its present form does not help to resist the CRT-based hardware fault cryptanalysis if a similar checking procedure, as in Shamir's method, is used.
TWO SECURE CRT-BASED COMPUTATION PROTOCOLS BASED ON FAULT INFECTION
A key point of developing a secure CRT-based computation protocol without using a checking procedure is the assurance of influencing the computation of s q or the overall computation of s when an error occurred in the computation of s p . This property should also apply when a faulty s 0 q is produced. In this paper, the above concept is called the fault infective CRT computation. This makes (3) invalid and the CRTbased hardware fault cryptanalysis no longer workable.
After ruling out many possible countermeasures such as that based on the above-mentioned fundamental idea, the following CRT-based speedup protocol, CRT-1, is developed.
The First Protocol-CRT-1 Protocol
Let n ¼ p Á q as a usual RSA system. The smart card also prepares another set of key pair ðe r ; d r Þ such that d r ¼ d À r, where r is a small integer selected in order to let gcdðd r ; 0ðnÞÞ ¼ 1 and e r d À1 r ðmod 0ðnÞÞ be a small integer. Notice that, in this setting, we don't assume that e is also a small integer. On the other hand, another setting for a small e but with a large e r is also possible. Performance of the CRT-1 protocol with these two parameter settings can be found in a later section.
The following protocol suggests a way to speed up the RSA signature computation via the residue number system technique, while, at the same time, being immune against the hardware fault cryptanalysis. Furthermore, the protocol does not depend upon any checking procedure (e.g., double computation) which should be absolutely error-free and physical cryptanalysis immune. Nor will the proposed protocol suffer the disadvantage of producing an undetectable error with large probability, as in Shamir's method. In Shamir's method, it is hard to define the value of jrj (bit length of r) in order to balance both performance and security. There is a tradeoff between these two contradicting requirements.
Step-1: Compute both k p ¼ bm=pc and k q ¼ bm=qc.
Step
Step-3: A conventional CRT operation and some extra manipulation are employed to compute the required signature as s ¼ CRT ðs p ; s q Þ Á ðm m r Þ mod n;
The following derivation onm m andm m tells briefly that CRT-1 protocol works correctly when all the calculations are performed faultlessly. If no fault occurred in (6) The above Lemma 4 can be extended into the following general case which enables the famous RSA scheme to be a cryptosystem with one-to-one mapping between the message and the cipher.
Lemma
On the other hand, for each A such that A x ðmod qÞ for x in ½ðp mod qÞ; q À 1, there are bp=qc À 1 integers B which satisfy A mod q ¼ B mod q. In total, there are ðq À ðp mod qÞÞbp=qc such integers A.
Therefore, there are, in total,
integer pairs ðA; BÞ to be found. t u
In the following description, the number of possible integer pairs ðA; BÞ in Lemma 6 will be denoted as E. 
Therefore, the probability of generating a faulty s (6) and will produce an error-free s q . t u Theorem 2 suggests that it is better to let p < q when implementing the CRT-1 protocol. However, Theorem 1 makes sure that the hardware fault cryptanalysis is not feasible on the CRT-1 protocol, even if p > q, since 1 À
0r mod n (let the set of these q À 1 integers form a set S) based on Lemma 5 since gcdðr; 0ðnÞÞ ¼ 1.
From the given faulty s 0 and all these different m m 0r mod n, there are q À 1 possible CRT ðs p ; s 0 q Þ which can be obtained and only one of these values can be employed to factorize n by computing p ¼ gcdððCRT ðs p ; s 0 q ÞÞ e r À m; nÞ. However, the attacker cannot tell which value in Z n will be an element in the set S, even if the parameter r is given, because of the unknown faultym m 0 . Therefore, all values in Z n could be the random integerm m 0r mod n and this concludes the proof.
t u Theorem 1, Theorem 2, and Theorem 3 analyze the security issues of the CRT-1 protocol. They demonstrate that a random fault that occurred in one of the two RNS computation modules will not reveal the factorization of n. A brute force guessing of OðnÞ complexity seems to be the only possible way to factorize n, even if a faulty s 0 q has been produced by the attacker. This does not require the integers e r and r to be secret parameters.
The following protocol P-4 attempts to simplify the protocol CRT-1 in order to provide improved performance and with fewer parameter settings.
Protocol P-4
Step-1: Compute k p ¼ bm=pc.
Step-2: Compute m d mod n via a conventional residue number system as
wherem m ¼ ððs e p mod pÞ þ k p Á pÞ mod q.
Step-3: A conventional CRT operation is employed to compute the required signature as s ¼ CRT ðs p ; s q Þ:
In the above protocol, a faulty s 0 p will not reveal the factorization of n because a faulty s 0 q will also be produced (see Theorem 1 and Theorem 2). However, a faulty s 0 q will reveal the value of p as usual.
The following protocol P-5 attempts to resolve the disadvantage of the protocol P-4.
Protocol P-5
Let r be a small prime.
Step-1: Compute k p ¼ bm=pc and k q ¼ bm=qc.
Step Step-3: A conventional CRT operation is employed to compute the required signature as s ¼ŝ s mod n ¼ CRT ðs p ; s q ; s r Þ mod n:
In protocol P-5, a faulty s 0 p will not reveal the factorization of n because it also produces a faulty s Proof. This can be directly obtained from the computation of CRT ðx p ; x q ; x r Þ. t u Lemma 8. Let the three moduli of an RNS system K 1 be p, q, and r. Also, let the two moduli of another RNS system K 2 be p and q. Then, CRT ðx p ; x q ; x r Þ mod ðp Á qÞ ¼ CRT ðx p ; x q Þ.
Proof. Given the following: To prove y 1 ¼ y 2 , it only needs to be shown that both y 1 and y 2 have the same representation in the RNS system K 2 . This is obviously true because y 1 mod p ¼ y 2 mod p ¼ x p and y 1 mod q ¼ y 2 mod q ¼ x q . t u
Once again, the employment of an inappropriate expanding of the modulus in its present form does not help to resist the CRT-based hardware fault cryptanalysis.
The Second Protocol-CRT-2 Protocol
Let n ¼ p Á q as in a usual RSA system. The smart card also prepares another set of key pairs ðe r ; d r Þ such that d r ¼ d À r, where r is a small integer selected in order to let gcdðd r ; 0ðnÞÞ ¼ 1 and e r d À1 r ðmod 0ðnÞÞ be a small integer. Noticeably, we don't assume that e is also a small integer.
Step-2: Compute m d r mod n via a conventional residue number system as
Step 
Notice that, in the above CRT-2 protocol, given any faulty s 0 p or s 0 q (or both) with random faults, a random faultŷ m m 0 in (11) will be generated. is also OðnÞ.
The following protocol P-6 intends to simplify the protocol CRT-2 in order to provide improved performance and with fewer parameter settings. It also tries to resolve the security problem of the protocol P-1 described previously.
Protocol P-6
Step-1: Compute m dr mod n via a conventional residue number system as
SECURE IMPLEMENTATION OF CRT RECOMBINATION
Previous designs of secure CRT-based speedup focus on developing fault infective computations of both s p and s q . It assumes that a CRT recombination as described in (1) is fault-free. In the following, it will be pointed out that an inappropriate CRT recombination may leak the secret prime factors of n if a hardware fault is induced.
In the following analysis both s p and s q are assumed to be error-free since the faulty conditions have been discussed extensively in the previous section. As described in (1), given s p and s q , one of the possible CRT recombinations computes
where both X p and X q can be precomputed and stored in advance. The above method is often called Gauss's algorithm [21, p. 68 ]. For
Step-3 of the proposed CRT-1 protocol, it can be easily verified that, if either the value of X p or X q is incorrect due to a hardware fault (induced by an attacker or an EEPROM error), then it follows that
Therefore, the CRT-based cryptanalysis by computing gcdððs 0e À mÞ mod n; nÞ does not give p nor q. So, based on the CRT-based cryptanalysis, the composite integer n can be factorized by computing p ¼ gcdððs 0e À mÞ mod n; nÞ:
In fact, the above hardware fault attack on the CRT recombination is generic and can be applied to the original CRT speedup, Shamir's proposal, and the CRT-2 protocol proposed in this paper. However, the original Garner's algorithm can be modified as follows in order to avoid the above-mentioned hardware fault cryptanalysis:
where X ¼ p Á p À1 ðmod qÞ can be precomputed and stored in advance.
The above-suggested modification requires a little more computation than its original version. However, it can be easily verified that, if the value of X is incorrect due to a hardware fault, the CRT recombination is still secure against the above-mentioned attack.
With a similiar idea of fault infective computations of both s p and s q , Gauss's algorithm (see (13) ) and the modified Garner's algorithm (see (17) ) also exhibit a fault infection property such that both
if either the value of X p or X q is incorrect and the value of X is incorrect, respectively. Therefore, no factorization of n is possible by computing gcdððCRT ðs p ; s q Þ e À mÞ mod n; nÞ if both s p and s q are error-free (or both are incorrect). Evidently, a careful selection of a CRT recombination procedure helps against the CRT-based hardware fault attack substantially.
KEY PAIRS SELECTION AND PERFORMANCE ANALYSIS 7.1 The Process of Selecting Key Pairs
In order to have better performance, the parameter e r should be small enough to keep the overall overhead negligible. As described previously, we don't assume that the public key e is a small integer. The generic assumption that e can be any integer between ð3; 0ðnÞ À 1Þ subject to gcdðe; 0ðnÞÞ ¼ 1 is widely accepted in the design and analysis of any general purpose cryptosystem involving the RSA system. So, a small integer e r (( p À 1 and ( q À 1) can be selected such that gcdðe r ; 0ðnÞÞ
r ðmod 0ðnÞÞ is computed.
The integer d ¼ d r þ r (where r is a small random integer) is tested whether gcdðd; 0ðnÞÞ ¼ 1 in order to guarantee the existence of e d À1 ðmod 0ðnÞÞ. If such d and e are found, they are considered as the RSA private and public keys. Certainly, if more than one key pair is found, then a pair with smaller e can be selected subject to the private key d still being a secure one.
Performance of Shamir's Protocol
Notice that, in the CRT-based attack, a single erroneous computation result can totally break the RSA system by factoring the public modulus. Therefore, for a practical and reliable implementation of Shamir's idea, jrj (bit length of r) will not be negligibly small and this will lead to a slowdown of RSA signature computation for some degree.
Let the typical time for evaluating a modular multiplication a Á b mod p (or a Á b mod q) be t m . Then, using the standard square-and-multiply exponentiation [21] 
For the conventional CRT-based speedup approach, the total computation time will be 2T , excluding the cost for one extra CRT recombination (i.e., (1) or in notation as (2)), which is considered to be negligible. In Shamir's fault immune CRT-based speedup scheme, the total computation time will be 2T jprj jpj
excluding the cost for one extra CRT recombination and two modulo r operations which are considered to be negligible.
Recall that the computational complexity of a modular multiplication is Oð' 2 Þ (where ' is the bit length of all operands) in terms of bit operations. The computational complexity of a square-and-multiply exponentiation algorithm is Oð'Þ. These lead to the above performance analysis of Shamir's proposal.
As described previously, the most serious reason for the CRT-based hardware fault attack is that a single erroneous computation result can totally break the RSA system by factoring the public modulus. The selection of jrj ¼ 32 (as suggested by Shamir) may not be enough for a highly reliable implementation. Therefore, performance of the CRT-based approach may be slowed down substantially when a higher security level will be required, e.g., with jrj ¼ 96, so that P miss ¼ 10 À29 , where P miss is the probability of undetectability of attack. For example, let jpj % jqj ¼ 512 (or jnj ¼ 1; 024) and jrj ¼ 96, then the overhead is at least about 68 percent when compared with the conventional CRT-based speedup approach, even if without considering many implementation overheads or the inconveniency of evaluation when modulo pr and qr, but not just modulo p and q.
Implementation Details and Performance of the Two New Protocols
There are some slight differences between the CRT-1 and the CRT-2 protocols. Both s p and s q (see (10)) in Step-2 of the CRT-2 protocol can be computed in parallel if there are two independent processors. However, in the CRT-1 protocol, s p (see (5)) should be computed prior to s q sincem m is dependent on the value of s p . However, if we consider the real implementation details of these two protocols, we conclude that the CRT-1 protocol needs fewer memory (register) space. This may sometimes be important for an implementation in a small device like a smart IC card. In the following rough analysis, a full length register and a half length register are used to store values of jp Â qj bits and jpj (or jqj) bits, respectively. In the CRT-1 protocol, four half-length temporary registers are used to store k p , k q , s p , and s q . We ignore the storage for e r since it is assumed to be a small integer. Also, three full length temporary registers are used to store d r , CRT ðs p ; s q Þ, and both ðs e r p mod pÞ þ k p Á p and ðs e r q mod qÞ þ k q Á q by storage reusing.
In the CRT-2 protocol, four half-length temporary registers are used to store k p , k q , s p , and s q . However, four full length temporary registers are used to store d r , CRT ðs p ; s q Þ, ðs e r p mod pÞ þ k p Á p, and ðs e r q mod qÞ þ k q Á q. Without considering the above-mentioned minor differences between the two proposed protocols, they perform almost equally well in computational time and implementation space.
The two proposed protocols CRT-1 and CRT-2 can perform almost equally well, but with some minor additional modular multiplications and ordinary multiplications when compared with the conventional CRT-based speedup approach if small parameters r and e r are used. The selection and generation of such parameters have already been suggested in this section.
It can be verified that the worst-case performance of the proposed CRT-based protocols will take about twice the effort (i.e., with 100 percent overhead) of the original CRTbased approach. This is for the case where we choose an RSA public key e first in order to guarantee that it will be a very small integer. However, in this situation, it will be difficult to guarantee (it is still an open problem) that a small e r can be found easily, even if a small r can be easily selected. Noticeably, even in this situation, the proposed protocols still perform twice as well as a computation without any CRT-based approach. Recall that, in Shamir's protocol, jpj % jqj ¼ 512 (or jnj ¼ 1; 024) and jrj ¼ 96, then the overhead is at least about 68 percent. Most importantly, the two proposed protocols do not assume any checking procedure (which may be vulnerable to hardware fault cryptanalysis easily) and do not suffer the disadvantage of undetectable faults.
Insecure Modification of CRT-1 and CRT-2 Protocols
Someone may try to develop alternative implementation for protocol CRT-2 in order to eliminate the usage of both k p and k q . One of such attempts is to computem m as ½CRT ðs p ; s q Þ e r mod n after both s p and s q have been obtained. Therefore, the signature s can be computed as s ¼ CRT ðs p ; s q Þ Á ðm m r Þ mod n ¼ ½CRT ðs p ; s q Þ e r Árþ1 mod n:
Similarly, protocol CRT-1 can also be modified in the same approach such that the usage of k q can be eliminated.
After CRT ðs p ; s q Þ has been obtained, the signature s can be computed as s ¼ ½CRT ðs p ; s q Þ erÁrþ1 mod n. In the above modified protocols, one may assume that an enhanced CRT recombination implementation (as described in a previous section) is employed 
ENHANCEMENT FOR TIMING ATTACK RESISTANCE
As described previously, Shamir's method can withstand a timing cryptanalysis of a binary search approach [16] , [22] . This is a very important reason that makes Shamir's method superior to the conventional simple verification method.
In the proposed CRT-1 and CRT-2 protocols, another binary search timing cryptanalysis is possible over the k p ¼ bm=pc and k q ¼ bm=qc operations. Suppose the operation ba=qc takes significantly less time if a < q. An attacker can intentionally provide a larger m ' and a smaller m s iteratively in order to approximate the value of p by observing the timing statistics.
In the following, a novel idea is proposed to enhance the security of CRT-1 and CRT-2 in order to be proof against the above two timing cryptanalyses. In order to be proof against the binary search timing attack over modulo p (or modulo q) operation, a random integer R ¼ k Á p (where k is a random integer) is selected such that R ) p. Then, the original process to obtain the parameter m p (or m q ) for CRT computation is modified as
where R is used as a premasking but no postmasking is necessary since R mod p ¼ 0. To compute m q , the random integer R is selected to be k Á q.
In order to be against the binary search timing attack over bm=pc (or bm=qc) operation, a random integer of the form R ¼ k Á p is also applicable. Then, the original process to compute k p (or k q ) is modified as
where R is used as a premasking and k is used as a postmasking since bR=pc ¼ k. To compute k q , the random integer R is selected to be k Á q.
If a random integer R () p; q) is added with the message m prior to the modulo p or bm=pc operation, the timing characteristic of whether or not modulo p or bm=pc has been performed (by intentionally providing a larger m ' or a smaller m s ) cannot be accessible by an attacker.
With the same property of foiling binary search timing cryptanalysis to discover p (or q), the above proposed technique suffers less overhead than Shamir's method from the viewpoint of implementation. In Shamir's method, expanded modulus is required, which brings some implementation inconvenience or complexity for most existing processors. However, no expanded modulus is necessary in the above proposal.
A remark is given in the following to consider another recently reported binary search timing cryptanalysis proposed by Schindler [24] over an RSA implementation with both CRT speedup and Montgomery exponentiation algorithm. Schindler's research showed that if a conventional Montgomery multiplication with possible final subtraction is employed, then a binary search timing attack can be efficiently performed to find p (or q). Schindler suggested one possible countermeasure by using a conventional blinding technique [22] , [23] (or sometimes called masking technique). Another suggestion is employment of two recently developed Montgomery exponentiations with no final subtraction [25] , [26] . In this approach, both the values of R e mod n and R À1 mod n (where R is a large random integer) are precomputed in advance and are used as a premasking and a postmasking, respectively. In fact, the blinding technique can also be used to counteract the above two binary search timing attacks on either modulo p or bm=pc operation.
As described previously, there are two possible parameter settings for the two proposed protocols. One setting will have a small public key e and the other setting will have a large e. Note especially that the blinding technique can be applicable to both parameter settings, either with small e or large e, since R e mod n is precomputed only once. Further updating does not need to raise a value to the eth power. However, it should be carefully noted that the blinding technique itself cannot withstand the CRT-based hardware fault cryptanalysis (refer to Section 4.1).
ANOTHER IMPORTANT REASON TO AVOID USING A DECISION PROCEDURE
In [27] , it was pointed out that Shamir's countermeasure [15] , [16] may unfortunately benefit the following safe-error attack when an adversary is assumed to have no access of a decrypted message but only the possible response (indicating whether the decrypted message is valid) from the hardware. This is because, in Shamir's design, a faulty result will be detected (this provides a distinguished response) and should not be available to any attacker. The following safe-error attack on CRT-based environment is a straightforward extension of the original work of safe-error attack [17] . RSA exponentiation is usually implemented with the square-and-multiply technique (Fig. 1a) , where multiplication and modular reduction are interleaved to fit the word-size ¼ 2 ! (Fig. 1b) (e.g., see [28] ). Suppose an adversary wants to guess the value of the ith bit, d i , of the secret exponent d. 1 Suppose that d i ¼ 1, the interleaved multiplication A B mod n is thus performed (Line a.3, Fig. 1 ). Suppose that one or several bits of random error are introduced into the more significant positions of register A or, more precisely, into some words A j for j > j ( , where j ( represents the current value of counter j (Line b.2, Fig. 1 ) when the faulty bits are introduced. Since the words containing the errors are no longer required for the next iterations (i.e., for j ¼ j ( ; j ( À 1; . . . ; 0, Line b.2, Fig. 1 ), the computation of R ¼ A B mod n will be correct. Moreover, since R is restored into register A, A A B mod n (Line a.3, Fig. 1 ), the error located in register A will be cleared and the final result c d mod n will be correct. On the other hand, if d i ¼ 0, then the interleaved multiplication (Line a.3, Fig. 1 ) is bypassed and the errors induced into register A will not be cleared, resulting in an incorrect value for the final result c d mod n. Recall that an adversary is assumed to have access to the possible response (if the algorithm and the hardware can provide it) of error or error-free indication. Therefore, by inducing faulty bits, an adversary can know the value of bit d i according to the response provided by the hardware. Notice that if Shamir's countermeasure were not implemented, then an adversary would not be able to guess the correct value of d i under the above assumption made in [27] .
It is evident that the proposed CRT-1 and CRT-2 protocols are much stronger than Shamir's protocol under the above assumption. This is because no checking procedure has been employed in the CRT-1 and CRT-2 protocols, so no explicit response can be available to an adversary. However, in some scenarios where an adversary has access to an implicit response (e.g., recomputation performed by the hardware due to an incorrect result found in some context of applications) or the computation result (e.g., the result is a digital signature and should be delivered to the adversary even if the result will be incorrect 2 ), then a suitable countermeasure against the safe-error attack is necessary.
Although the safe-error attack is, in essence, very powerful, fortunately it is very easy to counteract by using a solution provided in [17] . Using the right-to-left squareand-multiply exponentiation (Fig. 1a) as illustration, one possible solution is to let register B play the role of register A (to be as the multiplier) in the interleaved modular multiplication procedure.
CONCLUSIONS
In this paper, two novel CRT-based protocols are proposed to speed up the RSA signature or decryption computation via the residue number system technique, while, at the same time, being immune against hardware fault cryptanalysis. Most importantly, the two protocols do not assume the existence of any error-free and physical cryptanalysis immune checking procedure (e.g., by double checking, by using a final verification function-say raising a RSA signature to the eth power, or by some kind of redundant computationsay Shamir's expanded modulus approach) or any form of hardware checking module. Nor will the proposed protocols suffer the disadvantage of producing an undetectable error with large probability as in Shamir's method. Therefore, if carefully implemented, Chinese remaindering-based cryptosystem implementations are not more vulnerable than usual cryptosystem implementations. Seungjoo Kim received the BE, ME, and DrEng degrees from Sungkyunkwan University, Korea, in 1994, 1996, and 1999, respectively. Since 1998, Dr. Kim has worked at the KISA (Korea Information Security Agency). Also, since 2000, Dr. Kim has been working as a chairman of TC10/SG10.02 of TTA, which is an IT standards organization of Korea.
Seongan Lim received the BS degree in mathematics from the Dongguk University, Korea, in 1985. In 1987, she received the MS degree in mathematics (algebra) from the Seoul National University, Korea. In 1995, she received the PhD degree in mathematics (several complex variables) from Purdue University, West Lafayette, Indiana. She is a senior member of the technical staff of KISA (Korea Information Security Agency), Korea. Her research interests include cryptography, fast computer arithmetic, computer algorithms, and mathematics.
SangJae Moon received the BE (1972) and ME (1974) degrees in electronic engineering from Seoul National University, Korea, and the PhD . For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
