In this paper, we present several efficient fault attacks against implementations of RSA-CRT signatures that use modular exponentiation algorithms based on Montgomery multiplication. They apply to any padding function, including randomized paddings, and as such are the first fault attacks effective against RSA-PSS. The new attacks are based on the assumption that a small register can be forced to either zero, or a constant value, or a value with zero highorder bits. We show that these models are quite realistic, as such faults can be achieved against many proposed hardware designs for RSA signatures.
Introduction
The RSA signature scheme is one of the most used schemes nowadays. An RSA signature is computed by applying some encoding function to the message, and raising the result to d-th power modulo N , where d and N are the RSA private exponent and the RSA public modulus, respectively. The modular exponentiation is the costliest part of signature generation, so it is important to implement it efficiently. A very commonly used speed-up is RSA-CRT signature generation, where the exponentiation is carried out separately modulo the two factors of N , and the results are then recombined using the Chinese Remainder Theorem. However, when unprotected, RSA-CRT signatures are vulnerable to the so-called Bellcore attack first introduced by Boneh et al. [6] , and later refined in a number of subsequent publications [7, 2, 42] : an attacker who knows the padded message and is able to inject a fault in one of the two half-exponentiations can factor the public modulus using a faulty signature with a simple GCD computation.
Many workarounds have been proposed to patch this vulnerability, including extra computations and sanity checks of intermediate and final results. A recent taxonomy of these countermeasures is given in [31] . The simplest countermeasure may be to verify the signature before releasing it. This is reasonably cheap if the public exponent e is small and available in the signing device. In some cases, however, e is not small, or even not given, e.g. the JavaCard API does not provide it [29] . Another approach is to use an extended modulus. Shamir's trick [33] was the first such technique to be proposed; later refinements were suggested that also protect CRT recombination when it is computed using Garner's formula [5, 12, 40, 14] . Finally, yet another way to protect RSA-CRT signatures against faults is to use redundant exponentiation algorithms, such as the Montgomery Ladder.
Papers including [20, 31] propose such countermeasures. Regardless of the approach, RSA-CRT fault countermeasures tend to be rather costly: for example, Rivain's countermeasure [31] has a stated overhead of 10 % compared to an unprotected implementation, and is purportedly more efficient than previous works including [20, 40] .
Relatedly, while Boneh et al.'s original fault attack does not apply to RSA signatures with probabilistic encoding functions, some extensions of it were proposed to attack randomized ad hoc padding schemes such as ISO 9796-2 and EMV [15, 17] . However, Coron and Mandal [16] were able to prove that Bellare and Rogaway's padding scheme RSA-PSS [3, 4, 23] is secure against random faults in the random oracle model. In other words, if injecting a fault on the halfexponentiation modulo the second factor q of N produces a result that can be modeled as uniformly distributed modulo q, then the result of such a fault cannot be used to break RSA-PSS signatures. It is tempting to conclude that using RSA-PSS should enable signers to dispense with costly RSA-CRT countermeasures. In this paper, we argue that this is not necessarily the case.
Our contributions
The RSA-CRT implementations targeted in this paper use the state-of-the-art modular multiplication algorithm due to Montgomery [27] , which avoids the need to compute actual divisions on large integers, replacing them with only multiplications and bit shifts. A typical implementation of the Montgomery multiplication algorithm will use small registers to store precomputed values or short integer variables throughout the computation. The size of these registers varies with the architecture, from a single bit in certain hardware implementations to 16 bits, 32 bits or more in software. This paper presents several fault attacks on these small registers during Montgomery multiplication, that cause the result of one of the half-exponentiations to be unusually small. The factorization of N can then be recovered using a GCD, or an approximate common divisor algorithm such as [10, 13, 21] .
We consider three models of faults on the small registers. In the first model, one register can be forced to zero. In that case, we show that causing such a fault in the inverse Montgomery transformation of the result of a half-exponentiation, or a few earlier consecutive Montgomery multiplications, yields a faulty signature which is a multiple of the corresponding factor q of N . Hence, we can factor N by taking a simple GCD. In the second model, another register can be forced to some (possibly unknown) constant value throughout the inverse Montgomery transformation of the result of a half-exponentiation, or a few earlier consecutive Montgomery multiplications. A faulty signature in this model is a close multiple of the corresponding factor q of N , and we can thus factor N using an approximate common divisor algorithm. Finally, the third model makes it possible to force some of the higher order bits of one register to zero. We show that, while injecting one such fault at the end of the inverse Montgomery transformation results in a faulty signature that is not usually close enough to a multiple of q to reveal the factorization of N on its own, a moderate number of faulty signatures (a dozen or so) obtained using that process are enough to factor N .
The RSA padding scheme used for signing, whether deterministic or probabilistic, is irrelevant in our attacks. In particular, RSA-PSS implementations are also vulnerable. Of course, this does not contradict the security result due to Coron and Mandal [16] , as the faults we consider are strongly non-random. Our results do suggest, however, that exponentiation algorithms based on Montgomery multiplication are quite sensitive to a very realistic type of fault attacks and that using RSA-CRT countermeasures is advisable even for RSA-PSS.
Organization of the paper
In Sect. 2, we recall some background material on the Montgomery multiplication algorithm, on modular exponentiation techniques, and on RSA-CRT signatures. Our new attacks are then described in Sects. 3-5, corresponding to three different fault models: null faults, constant faults, and zero highorder bits faults. Finally, in Sect. 6, we discuss the applicability of our fault models to concrete hardware implementations of RSA-CRT signatures, and find that many proposed designs are vulnerable.
Preliminaries

Montgomery multiplication
First proposed by Montgomery [27] , the Montgomery multiplication algorithm provides a fast method for computing modular multiplications and squarings. Indeed, the Montgomery multiplication algorithm only uses multiplications, additions and shifts, but no explicit division or modular reduction of big integers. Its cost is about twice that of a (nonmodular) multiplication (compared to 2.5 times for a multiplication and a Barrett reduction), without any constraint on the modulus. Usually, one of two different techniques is used to compute Montgomery multiplication: either separate operand scanning (SOS), or coarsely integrated operand scanning (CIOS). Consider a device whose processor or coprocessor architecture has r -bit registers (typically r = 1, 8, 16, 32 or 64 bits). Let b = 2 r , q be the (odd) modulus with respect to which multiplications are carried out, k the number of r -bit The returned value is (xy · b −k mod q). Since b = 2 r , the division is a bit shift registers used to store q, and R = b k , so that q < R and gcd(q, R) = 1. The SOS variant consists in using the Montgomery reduction after the multiplication: for an input A such that A < Rq, it computes Mgt(A) ≡ AR −1 (mod q), with 0 ≤ Mgt(A) < q. The CIOS mixes the reduction algorithm with the previous multiplication step: considering x and y with x y < Rq, it computes CIOS(x, y) = x y R −1 mod q with CIOS(x, y) < q. Figure 1 presents the main steps of the CIOS variant, which will be used thereafter. However, replacing the CIOS by the SOS or any other variant proposed in [24] does not protect against any of our attacks.
Among all the variants proposed for this algorithm, the optimization of [41] is well known: if Rq > x y, then the result of Algorithm 1 without the final reduction (Step 10) is between 0 and 2q. Therefore for an exponentiation algorithm, there is no need to carry out this final reduction if R > 4q. Besides its efficiency, this variant has the advantage of thwarting timing attacks [1, 9, 32] , which essentially rely on detecting whether the reduction is carried out or not. Nevertheless, these attacks do not easily work with randomized paddings, since the attacker needs to carefully choose the message. In contrast, our attacks work on any padding, with or without this reduction step.
Exponentiation algorithms using Montgomery multiplication
Montgomery reduction is especially interesting when used as part of a modular exponentiation algorithm. Many such exponentiation algorithms can be found in the literature, including the square-and-multiply algorithm from either the least or the most significant bit of the exponent, the Montgomery Ladder (which is used as a side-channel countermeasure against cache analysis, branch analysis, timing analysis and power analysis), the square-and-multiply k-ary algorithm (which boasts greater efficiency thanks to fewer multiplications) and the Sliding Window algorithm. The previous five exponentiation algorithms will be considered in this paper, and each of them (except the square-and-multiply MSB) is detailed in Fig. 2 . Note that using the Montgomery multiplications inside any exponentiation algorithm requires all variables to be in Montgomery representation (x = x R mod q is the Montgomery representation of x) before applying the exponentiation process. In Step 2 of each algorithm from 
RSA-CRT signature generation
Let N = pq be a n-bit RSA modulus. The public key is denoted by (N , e) and the associated private key by ( p, q, d). For a message M to be signed, we note S = m d mod N the corresponding signature, where m is deduced from M by an encoding function, possibly randomized. A well-known optimization of this operation is RSA-CRT, which takes advantage of the decomposition of N into prime factors. By replacing a full exponentiation of size n by two n/2, it divides the computational cost by a factor of about 4. As a result, almost all implementations of RSA signatures use RSA-CRT, including OpenSSL [39] and the JavaCard API [29] .
Recovering S from its reductions S p and S q modulo p and q can be done either by the usual CRT reconstruction formula (1) below, or using the recombination technique (2) due to Garner [19] :
(1)
Garner's formula (2) does not require a reduction modulo N , which is interesting for efficiency reasons and also because it prevents certain fault attacks [8] . On the other hand, it does require an inverse Montgomery transformation S q = CIOS(S q , 1), whereas that step is not necessary for formula (1) , as it can be mixed with the multiplication with p −1 mod q. This is an important point, as some of our attacks specifically target the inverse Montgomery transformation. The main steps of the RSA-CRT signature generation with Garner's recombination are recalled in Fig. 3 . Fig. 2 Four of the exponentiation algorithms considered in this paper. In each case, e 0 , . . . , e t are the bits (or in the k-ary case, the digits in base 2 k ) of the exponent e from the least to the most significant, and R is the Montgomery coefficient
Null faults
We first consider a fault model in which the attacker can force the register containing the precomputed value q = (−q −1 mod b) to zero in certain calls to the CIOS algorithm during the computation of S q .
Under suitable conditions, we will see that such faults can cause the q-part of the signature to be erroneously evaluated as S q = 0, which makes it possible to retrieve the factor q of N from one such faulty signature S, as q = gcd( S, N ).
Attacking CIOS(A, 1)
Suppose first that the fault attacker can force q to zero in the very last CIOS computation during the evaluation of S q , 
Theorem 1 A faulty signature S generated in this fault model is a multiple of q (for any of the exponentiation algorithms considered herein and regardless of the encoding function involved, probabilistic or not).
Proof The faulty value q = 0 causes all of the variables u in the CIOS loop to vanish; indeed, for j = 0, . . . , k − 1, they evaluate to:
As a result, the value S q computed by this CIOS loop can be written as:
Now, the values A j are r -words, i.e. 0 ≤ A j ≤ 2 r − 1. It follows that each of the integer divisions by 2 r evaluate to zero, and hence S q = 0. As a result, the faulty signature S is a multiple of q as stated.
It is thus easy to factor N with a single faulty signature S, by computing gcd( S, N ). Note also that if this last CIOS step is computed as CIOS(1, A) instead of CIOS(A, 1), the formulas are slightly different but the result still holds.
Attacking consecutive CIOS steps
If Garner recombination is not used or the computation of CIOS(A, 1) is somehow protected against faults, a similar result can be achieved by forcing q to zero in earlier calls to CIOS, provided that a certain number of successive CIOS executions are faulty.
We consider here the fault model where we force q to zero on consecutive CIOS steps. We will examine how this plays out in each of the five exponentiation algorithms in turn. For now, we let = log 2 log 2 q .
In all cases, we also assume, heuristically, that the values x, A in Montgomery representation involved in our computation are uniformly distributed modulo q before the first fault is injected; this means, in particular, that they are smaller than 2 log 2 q with probability at least 1/2.
Square-and-Multiply LSB
We first consider a fault model in which the attacker can force the precomputed value q to zero during consecutive calls of CIOS(x,x) in the Square-and-Multiply LSB algorithm (Step 8 in Fig. 2a ), during the computation of S q . Then, we claim that with probability at least 1/2, a faulty signature S generated in this fault model will be a multiple of q, leading to the same key recovery as before.
Indeed, suppose faults are injected starting from iteration i = α in the loop of the exponentiation algorithm, and that before then, |x| ≤ log 2 q − 1 where |x| represents the bit length: this happens with probability at least 1/2. The fault q = 0 has to occur in CIOS(x,x) (the CIOS in Step 6 does not modifyx and thus can be ignored in this case). Then, the outputx of this faulty CIOS is, up to rounding errors:
With our assumption on the size ofx, we obtain | x| ≤ log 2 q − 2. Therefore, for i = α + 1 and with q = 0, the size of the output of CIOS( x, x) will be reduced to at most log 2 q − 4, and so forth. By induction, keeping the fault q = 0 up to iteration i = α + − 1, i.e. through executions of CIOS, brings the valuex down to 0. Clearly, the faulty half-exponentiation thus outputs S q = 0, hence the stated result.
Square-and-Multiply MSB
The case of the Square-and-Multiply MSB exponentiation algorithm is similar. We claim that if we can force q to zero during consecutive steps of the Square-and-Multiply MSB loop (hence up to 2 calls to CIOS), then with probability at least 1/2, a faulty signature S generated in this model will be a multiple of q.
The idea is again to target the Montgomery squaring step, which in the MSB case is A ← CIOS(A, A). The main difference with the LSB case is that the other CIOS also affects the same value A, and hence it is important that q be kept to zero during this other call as well (typically, though, q is simply kept to zero throughout sufficiently many loops, so this does not really matter in practice). Otherwise, the analysis is the same as in the LSB case, and we find that consecutive faulty iterations are sufficient to have S q = 0 with probability at least 1/2.
Remark 1 While a fault has to be injected during the execution of A ← CIOS(A,x) as well, this faulty execution has some probability of reducing the size of A too, if the size ofx is less than log 2 q . Hence fewer iterations will be required in practice to reach A = 0. Moreover, if faults are injected throughout loops of the Square-and-Multiply MSB algorithm starting from the very beginning, we should get S q = 0 with probability 1. Indeed, the initial value of A is equal to R mod q, and hence is less than log 2 q bits.
Montgomery Ladder
In the case of the Montgomery Ladder, we obtain a similar result when q is forced to zero in 2 − 1 suitable consecutive Fig. 2b ) or Montgomery squarings of A (in Step 7, ibid.), depending on whether the corresponding string of bits of e contains more ones or zeros. The same analysis as in the Square-and-Multiply case shows that keeping q to zero throughout the iterations containing those Montgomery squarings will have a probability at least 1/2 of cancelingx or A respectively, and hence yield S q = 0.
Remark 2 The stated bound of 2 − 1 iterations is quite pessimistic, and the attack performs significantly better in practice, as bothx and A will tend to become shorter during faulty Montgomery squarings, and hence both contribute to bringing each other's size further down in the CIOS calls that are not Montgomery squarings (Steps 6 and 9). Additionally, if faults are injected starting from the beginning of the loop, then, like in the Square-and-Multiply MSB setting, the attack succeeds with probability 1 in view of the initial value of A.
Square-and-Multiply k-ary
For the Square-and-Multiply k-ary exponentiation, forcing q to zero during /k consecutive iterations of the loop is enough to achieve a probability of 1/2 of getting S q = 0.
Indeed, the analysis is identical to the Square-andMultiply MSB case of Sect. 3.2.2, except that each loop contains k Montgomery squarings instead of only one.
Sliding window
Finally, we consider the case where q can be forced to zero in consecutive iterations of the main loop (Steps 8-18 in Fig. 2d ) of the Sliding Window exponentiation algorithm, and claim that consecutive faulty iterations are enough to obtain S q = 0 with probability 1/2.
Again, the situation is analogous to the Square-andMultiply MSB setting of Sect. 3.2.2, in the sense that each iteration of the loop contains at least one Montgomery squaring of A, hence iterations are enough to reach A = 0.
Simulation results
We have carried out a simulation of null faults on consecutive CIOS steps for the three first exponentiation process algorithms, with varying numbers of faulty iterations; for the Square-and-Multiply MSB and the Montgomery Ladder algorithms, two sets of experiments have been conducted for each parameter set: one with faults starting from the first iteration, and another one with faults starting from a random iteration somewhere in the exponentiation loop. Results are collected in Table 1 . As we can see, success rates are noticeably higher than those claimed above. This is because, with probability 1/2 at each faulted step of the exponentiation, the size of the focused value can decrease more than the theoretical minimum.
Constant faults
In this section, we consider a different fault model, in which the fault attacker can force the variables u j in the CIOS algorithm to some (possibly unknown) constant value u.
Just as with null faults, we consider two scenarios: one in which the last CIOS computation is attacked, and another in which several inner consecutive CIOS computations in the exponentiation algorithm are targeted. CIOS(A, 1) 
Attacking
Faults on all iterations
Consider first the case when faults are injected in all iterations of the very last CIOS computation. In other words, the device will compute CIOS(A, 1), except that the variables u j , j = 0, . . . , k − 1, are replaced by a fixed, possibly unknown value u. In that case, we show that a single faulty signature is enough to factor N and recover the secret key. The key result is as follows: 
Proof Up to the possible subtraction of q, which clearly does not affect our result, the value S q computed in the faulty execution of CIOS(A, 1) can be written as:
We claim that this value S q is close to the real number u·q/(2 r −1). Indeed, on one hand, using the fact that x ≤ x for all x and A j ≤ 2 r − 1 for j = 0, . . . , k − 1, we obtain:
On the other hand, since x > x − 1 and A j ≥ 0, we get:
as we have both u ≤ 2 r − 1 and q < 2 rk . As a result, we obtain:
and hence:
Thus, a single faulty signature yields a value V = (2 r − 1) · ( S + 1) mod N which is very close to a multiple of q. It is easy to use this value to recover q itself. Several methods are available: -If r is small (say 8 or 16), it may be easiest to just use exhaustive search: q is found among the values gcd(V + X, N ) for |X | ≤ 2 r +1 , and hence can be retrieved using around 2 r +2 GCD computations. -A more sophisticated option, which may be interesting for r = 32, is the baby step, giant step-like algorithm by Chen and Nguyen [10] , which runs in time O(2 r/2 ). -Alternatively, for any r up to half of the size of q, one can use Howgrave-Graham's algorithm [21] based on Coppersmith techniques. It is the fastest option unless r is very small (a simple implementation in Sage [36] runs in about 1.5 ms on our standard desktop PC with a 512-bit prime q for a any r up to ≈ 160 bits, whereas exhaustive search already takes over one second for r = 16).
Faults on most iterations
In fact, Howgrave-Graham's algorithm is especially relevant if the constant faults do not start at the very first iteration in the CIOS loop. More precisely, suppose that the fault attacker can force the variables u j to a constant value u not for all j but for j = j 0 , j 0 + 1, . . . , k − 1 for some j 0 . Then, the same computation as in the proof of Theorem 2 yields the following bound on S q :
It follows that (2 r − 1) · S is a close multiple of q with error size 2 r ( j 0 +1) . Now note that Howgrave-Graham's algorithm [21] will recover q given N and a close multiple with error size at most q 1/2−ε . This means that one faulty signature S is enough to factor N as long as j 0 + 1 < k/2, i.e. the constant faults start in the first half of the CIOS loop.
Moreover, if the faults start even later, a single signature will no longer suffice, but q can be recovered if several faults are available, using the generalization of HowgraveGraham's algorithm due to Cohn and Heninger [13] . That algorithm is discussed in a different context in Sect. 5.
Attacking other CIOS steps
As in Sect. 3.2, if Garner recombination is not used or if CIOS(A, 1) is protected against faults, we can adapt the previous attack to target earlier calls to CIOS and still reveal the factorization of N . However, the attack requires two faulty signatures with the same constant fault u.
For now, we focus on the Square-and-Multiply LSB algorithm and assume that constant faults are injected in the evaluations of CIOS(x,x) and CIOS(A,x) during the exponentiation process computing S q . More precisely, suppose that u i = u (i = 0, . . . , k − 1) for these particular CIOS, and write:
We claim that if the initial value ofx is such that <x < 2 log 2 q −1 , then the computed value A of the Square-andMultiply LSB approaches .
Indeed, one can see that the output x of each faulty CIOS(x,x) is roughly: . Our assumption on the value of v 0 implies that f (I ) ∈ I . Referring to the graph below, it appears then that the sequence will tend to = min( 1 , 2 ) where 1 and 2 denote the two roots of f ( ) = .
However, we want that this limit be reached before the end of the exponentiation process. Let us determine the convergence speed of this sequence:
Since and v 0 are integer values, we look for the condition | f ( )| n+1 · |v 0 − | ≤ 1. Hence, the limit is reached for n such that:
For example, we search a condition in order to have CIOS(x,x) = before the half (|q|/2) of the exponentiation process. Since log 2 (|v 0 − |) ≈ |q|, this condition is . We see on this example that the success of the attack will depend on the ratio q/2 log 2 q and on the ratio u/(2 r − 1).
Looking at CIOS(A,x), the output A is roughly:
and the associated sequence is a little more complicated:
It is clear that if v n = , w n will tend to this limit too. We just have to verify that this sequence reaches before the end of the exponentiation process. In fact, both sequences are linked by the following relation: With our assumption, the sequence (v n ) is decreasing, and we have:
Hence the range between the two sequences constantly decreases during the exponentiation process and if (v n ) tends to before the end of the exponentiation process, then (w n ) will reach this value too.
The attack consists of computing two signaturesS,S by faulting them with the same fault u. In consequence, with a certain probability depending on the ratios q/2 log 2 q and u/(2 r − 1), these two signatures will be equal modulo q. Thus, we recover q as gcd(N ,S −S ). This attack works with the Square-and-Multiply LSB and Montgomery Ladder algorithms, but not with the three other exponentiations. Table 2 , the success rates are even better than expected. In fact, if < 0, the valuex can enter in a cycle of a few different values. As a consequence, with some probability, two messages can have the same value S q . Graphically, that can be explained by the representation of the function f • · · · • f which is flatter and can intersect the line representing g(x) = x.
Remark 3 In
Simulation results are presented in Table 2 . For various 512-bit primes q, the attack has been carried out for 1,000 pairs of random messages, with a random constant fault u for each pair. It is successful if the two resulting faulty signatures S, S satisfy gcd(N , S − S ) = q.
Zero high-order bits faults
In this section, we consider yet another fault model, in which the fault attacker targets the very last iteration in the evaluation of CIOS(A, 1) during the computation of S q . We assume that the attacker is able to force a certain number h of the highest order bits of u k−1 to zero, possibly but not necessarily all of them (i.e. 1 ≤ h ≤ r and u k−1 < 2 r −h ). Then, while a single faulty signature is typically not sufficient to factor the modulus, multiple such signatures will be enough if h is not small.
Theorem 3 Let S be a faulty signature obtained in this fault model. Then, S is a close multiple of q with error size at most 2 −h · q + 1, i.e. there exists an integer T such that
Proof The iterations numbered 0, 1, . . . , k − 2 in the evaluation of CIOS(A, 1) are all computed correctly. Let a 1 , a 2 , . . . , a k−1 be the values of the variable a at the end of these respective iterations. We have:
and it is then easy to see by induction that 0 ≤ a k−1 ≤ q + 1. Then, the computation of the last iteration is attacked, with a value u k−1 of u satisfying 0 ≤ u k−1 ≤ 2 r −h − 1. Thus, the value of a after that iteration becomes:
In particular, a k < q, so that the q-part of the signature S q is a k , and hence | S q | ≤ 2 −h · q + 1. Since S q = S − qT for T = (t · p) mod p, this concludes the proof.
Note that exactly the same result still holds if, in addition to u k−1 , previous values of u are attacked in the same fashion as well, so there is no need to synchronize the attack extremely precisely so as to target only the last iteration. Now, recovering q from faulty signatures of the form S is a partial approximate common divisor (PACD) problem, as we know one exact multiple of q, namely N , and several close multiples, namely the faulty signatures. Since the error size ≈ q/2 h is rather large relative to q, the state-of-theart algorithm to recover q in that case is the one proposed by Cohn and Heninger [13] using multivariate Coppersmith techniques.
The algorithm by Cohn and Heninger is likely to recover the common divisor q ≈ N 1/2 given close multiples S (1) , . . . , S ( ) provided that the error size is significantly less than N (1/2) 1+1/ . Thus, we should have:
Hence, if the faults cancel the top h bits of u k−1 , we need of them to factor the modulus, where:
In practice, if a few more faults can be collected, it is probably preferable to simply use the linear case of the CohnHeninger attack (the case t = k = 1 in their paper [13] ), since it is much easier to implement (as it requires only linear algebra rather than Gröbner bases) and involves lattice reduction in a lattice of small dimension that is straightforward to construct. More precisely, reducing the lattice L generated by the rows of the following matrix:
where B = 2 kr−h , gives a lattice basis consisting of affine forms the first of which vanish on the vector of "error val-
if is large enough. More precisely, they vanish on this vector modulo q, and also do over the integers provided that their coefficients are much smaller than q. If we assume that L behaves like a random lattice, the length of vectors in the reduced basis should be roughly
. This gives the condition:
Hence, this method should recover the error vector and thus make it possible to factor N provided that:
which is always a worse bound than (3) but usually not by a very large margin. Table 3 gives the theoretical number of faulty signatures required to factor N for various values of h, both in the general attack by Cohn and Heninger and in the simplified linear case. We carried out a simulation of the linear version of the attack on a 1024-bit modulus N with various values of h, and found that it works very well in practice with a number of faulty signatures consistent with the theoretical minimum. The results are collected in Table 4 . The attack is also quite fast: a naïve implementation in Sage runs in a fraction of a second on a standard PC. Table 3 Theoretical minimum number of zero higher-order h-bit faulty signatures required to factor a balanced 1024-bit RSA modulus N using the general Cohn-Heninger attack or the simplified linear one Timings are given for our Sage implementation on a single core of a Core 2 CPU at 3 GHz
Fault models
In this section we discuss how realistic the setup of the attacks described above can be. In principle, all the RSA-CRT implementations using Montgomery multiplication may be vulnerable, but we have to note that the fault setup (and how realistic it is) depends heavily on implementation choices, since many variations around the algorithm from Fig. 1 have been proposed in recent literature. After a discussion of the characteristics of the tools needed to get the desired effects, we focus on several implementation proposals [11, 22, 25, 26, 28, 37, 38] , chosen for their relevance, and discuss whether our fault model is realistic in those settings.
Characteristics of the perturbation tool
First, all the perturbations needed to carry out our attacks need to be controlled and local to some gates of the chip. Therefore, the attacker needs to identify the localization of the vulnerable gates and registers. The null fault attacks described in Sect. 3 need either a q value set to 0, or multiple consecutive faults in Step 6 of the main loop of CIOS(A, 1) or during multiple consecutive CIOS. The attacks described in Sect. 4 also need these multiple consecutive faults. Considering that state-of-art secure micro-controllers embed desynchronization countermeasures such as clock jitters and idle cycles, if the target of the perturbation is some shared logic with other treatments (like in the ALU of a CPU), the fault must be accurately space and time controlled, and the effects must be repeatable as well. Identification of the good cycles for injecting the perturbation may be a very difficult task, and our attacks seem to be irrelevant. The only exception may be the null fault of Sect. 3, if the fault is injected when the q register is loaded. Nevertheless, many secure microcontrollers embed an isolated modular arithmetic acceleration coprocessor. A large proportion of them specifically use the Montgomery multiplication CIOS algorithm (or one of its described variants [24] ). Therefore, if the q or the u j value is isolated in a specific small size register, a unique long duration perturbation can be sufficient for our attack to succeed. The duration of the perturbation varies with the implementation choices and can vary from one cycle to log 2 q, which does not exceed a hundred microseconds on actual chips. To get this kind of effect, laser diodes are the best-suited tool, since the duration of the spot is completely controlled by the attacker [34] .
Analysis of classical implementations of the Montgomery multiplication
The Montgomery coprocessors proposed in the literature can be divided into three different categories:
-the first category [22, 25, 38] contains variations on the Tenca and Koç Multiple Word Radix-2 Montgomery Multiplication algorithm (MWR2MM) [38] , which can be seen as a CIOS algorithm with r = 1. The characteristic of these implementations is that they use no multiplier architecture, and are therefore really suitable for constrained ASIC implementations. -the second category [26, 37] [11, 28] proposes a version of CIOS/SOS with only one loop, implying that r ≥ log 2 q . The main difficulty of these implementation techniques is to deal with the very large multiplications they require (one r × r and two half r × r multiplications per CIOS). For that purpose they use interpolation techniques, like Karatsuba in [11] or the residue number system (RNS) in [28] . These implementations are designed to achieve the shortest latency, and are therefore area consuming.
Architectures based on MWR2MM: r = 1
In this kind of architecture, q cannot be manipulated, since it is always equal to 1, so no wire or register carries its value. On the other hand, the value of u j is computed at every loop of the CIOS, and since it is only one bit, a simple shot on the logic driving the register during the final multiplication CIOS(A, 1) is sufficient to get an exploitable result (u j = 0 corresponds to the null fault of Sect. 3, and u j = 1 to the constant fault of Sect. 4).
The first proposal [38] is a fully systolic 1 array of processing elements (PE) executing line 6 of the CIOS algorithm in one cycle, and line 7 in k cycles from LSB to MSB consecutively. Figure 4 proposes an overview of the architecture. Each PE consists of a w-word carry save adder, able to compute a w word addition and to keep the carry for the next cycle. In the figure, T ( j) stands for the j-th least significant w word of T .
At each clock cycle, the PE presents the computed result a i ( j) to the next one, and the value u i is kept in the PE for the computation of the next word a i ( j + 1). The value of u i is computed before the word a i (0) is presented, and then is kept in each PE during the whole computation of a i in a register. After a complete multiplication, the result a n is transformed from a carry save representation to binary thanks to the CS to binary converter. This architecture has the great advantage of being completely scalable. Whatever the number of PEs and the size of M, this architecture can compute the expected result as long as the RAM are correctly dimensioned.
To achieve our attack, the register keeping u i can be targeted, but every PE must be targeted simultaneously to get the correct result. Therefore it is more interesting to target the control logic responsible for the sequencing of the register loading, since all the PEs are connected.
In [25] , the authors manage to get rid of the CS to binary converter by redesigning the CS adder of every PE. The vulnerability to our attack is therefore the same, since the redesign does not affect the targeted area.
Huang et al. [22] proposed a new version of the data dependency in the MWR2MM algorithm and rearranged the architecture of [38] , in a semi systolic form. Figure 5 gives an overview of the architecture. In this architecture, the intermediate value a i is manipulated in carry save format (a i = c i + s i ). A specific PE, PE 0 is specialized in generating the u i values at each cycle, while the j-th PE is in charge of computing the sequence a i ( j). The scalability is lost in exchange for a better time/area trade-off.
This architecture is very vulnerable to our attacks, since a simple n-cycle long shot on the right logic in PE 0 (see Fig.  5 ) is sufficient to get the expected result.
According to the authors, the design works at 100 MHz on their target platform (a Xilinx Virtex II FPGA), therefore the duration of the perturbation is at least 10 μs for a 1024 bits multiplication (2048 bits RSA) if the Garner recombination is used (using the attack from Sect. 3.1 or Sect. 4.1). If classical CRT reconstruction is used, according to Table 1 , 200 μs will be enough for a null fault.
As a conclusion we can see that this kind of implementation is very vulnerable, since the setup of the attack is quite simple.
High radix architecture: 1 < r < log 2 q
In this type of implementation choice the value q = −q −1 mod 2 r is computed in a r -bit register, unless the quotient pipelining approach [30] is used. In all the implementations, the value q is an r -bit register and can be the target of the attack.
For example, the implementation of [26] is described in Fig. 6 . It relies on the coordinated usage of multiplier blocks of the Xilinx Virtex II together with specifically designed carry save adders. The CIOS algorithm from Fig. 1 is completely respected in this implementation. The values u j can be the target of any fault described in this paper, but it may be easier to put once for all the q register to 0, with a 100 % success rate for the attack if properly carried out. Another Fig. 5 Overview of the architecture from [22] and potential fault target Fig. 6 Overview of the architecture from [26] and potential fault target implementation is mentioned in [26] with a four-deep pipeline, but it suffers from the same vulnerability.
On the other hand, the attack may be more difficult to achieve on the architecture described in [37, Figure 4 ]. First, it uses quotient determination [30] , and therefore does not need to store q anywhere. Second, the multiplier in charge of computing u j is shared for all the Montgomery computation. To carry out the attack of Sect. 4 on this architecture, the attacker has to determine the specific cycles where u j is computed to generate a perturbation. For that particular design, the attacks seem out of reach.
Full radix architecture: r ≥ log 2 q
In this kind of implementation, a single round is enough to compute the Montgomery algorithm. This implementation choice concentrates all the complexities in the design of a log 2 q × log 2 q multiplier, used once in full during the multiplication process and twice partially during the Montgomery reduction. To reduce the full complexity of the big multiplication, interpolation techniques are used. In [11] , a classical nested Karatsuba multiplication is used, whereas [28] relies on RNS. Both can be seen as derived from the Lagrange interpolation, with different bases.
In these architectures, a specific laser shot must cancel all the u 0 or q at the same time to produce a null fault. To have a chance, a better solution is to use non-invasive attacks (in the sense of [35] ), such as power or clock glitches. Indeed u 0 or q is fully manipulated on the same clock cycle (or in very few), therefore it may be more practical to make the sequencer miss an instruction instead of aiming directly at the registers.
The zero high-order bits fault attack from Sect. 5 is also an option. In the architecture of [11] , the most significant bits of u 0 can be set to 0 with a focused shot. On the other hand, the architecture of [28] is less vulnerable to this attack, since the RNS representation makes it impractical to modify the significant bits of u 0 (see Fig. 7 ). Fig. 7 Overview of the architecture from [28] . The values of q and u i are represented by log 2 q bits or more, but operations on them require a single clock cycle
Conclusion
In this paper, we have shown that specific realistic faults can defeat unprotected RSA-CRT signatures with any padding scheme, probabilistic or not. While it is not difficult to devise suitable countermeasures (for example, checking that S q is not too small before outputting a signature is enough to thwart all of our attacks), this underscores the fact that relying on probabilistic signature schemes does not, in itself, protect against faults.
