Differential Fault Analysis (DFA) is a real threat for elliptic curve cryptosystems. This paper describes an elliptic curve cryptoprocessor unit resistant against fault injection. This resistance is provided by the use of parity preserving logic gates in the operating structure of the ECC unit, which is based on borrow-save adders. The proposed countermeasure provides a high coverage fault detection and induces an acceptable area overhead (+ 38 %).
Introduction
Since it offers the highest strength per bit of any publickey cryptography system known today, Elliptic Curve Cryptography (ECC) is a good alternative to RSA cryptosystems for ensuring secret exchange [22] . Usually, ECC brings many benefits: faster computations, less power consumption, limited storage and smaller keys and certificates. As a consequence, ECC is a good candidate for smart cards and embedded systems.
ECC is considered mathematically secure since it is based on the Elliptic Curve Discrete Logarithm Problem (ECDLP). Nevertheless, the secret key processed by an elliptic curve cryptosystem can be retrieved if this latter is implemented without caution using physical attacks. Among physical attacks, Simple Power Analysis (SPA [15] ) can be an efficient method to extract the key. In such attacks, information about secret key is deduced directly thanks to the study of the power trace from a single secret key computation. Implementations of elliptic curve point multiplication algorithms are particularly vulnerable because the usual formulas used for the two main elliptic curve point multiplication operations called addition and doubling are quite different. Consequently, they can have power traces which can be distinguished. In order to protect elliptic curve cryptosystems against SPA, some efficient countermeasures have already been proposed in [7] , [4] or [5] . A second type of attack, called "fault attacks", consists in forcing the device to perform erroneous computations by changing some bits of a parameter or an intermediate result [2] , [6] , [3] . Faults can be induced thanks to various means as temperature variations, electromagnetic perturbations, X-rays and ion beams injection, glitches on the supply voltage or the external clock, or light illumination [1] . In order to circumvent this powerful kind of attack, some standard hardware countermeasures can be implemented such as detectors (temperature, supply voltage, frequency, light) for example. If an embedded error detection scheme detects an error, an alarm can be raised and/or a random result can be sent to the output of the cryptosystem. Another obvious countermeasure can be to check the output using an additional computation step. Unfortunately for curve-based cryptography, and contrary to RSA, checking the result implies to perform the whole computation twice or requires a space redundancy duplication, which is very costly. As a consequence, the use of partial checking methods working in parallel with the main computation is preferred. This paper presents the incorporation of a fault detection method based on parity-preserving logic gates in some parts of an elliptic curve unit. In [18] , the feasibility of this approach had already been theoretically demonstrated, but no synthesis result of parity-preserving circuits had been reported. The main contribution of this paper is to demonstrate that this method is acceptable in practice by giving implementation results. A specific part of the unit, the borrow-save adders, becomes a high-level fault-tolerant structure.
This paper is organized as follows. Section 2 recalls some notations and mathematical background. Section 3 presents known fault attacks on ECC and previous counter-measures. Then, our parity fault-tolerant unit using parity preserving logic gates is detailed in Section 4. The performance of this unit, the impact of the implemented countermeasure and the fault coverage are presented in Section 5. 2. for i = 0 to l − 1 do parallel loop
The BSA operator depicted on Figure 1 uses 2 rows of PPM cells [10] . A PPM is very close to a full-adder (just 1 extra inverter) and computes 2c Figure 1 clearly shows that the BSA computation time does not depend on the operand size l and is T(BSA(l)) = 2 · T(PPM).
Figure 1. 4-Digit Borrow-Save Adder
Known attacks on the ECDLP Security constraints on parameters Exhaustive search n sufficiently large (e.g., n ≥ 2 80 ) Pohlig-Hellman and Pollard's rho attack For maximum resistance: # E(F p ) = hn with h ∈ {1, 2, 3, 4} and n prime (n > 2 160 ) Attack on prime-fieldanomalous curves # E(F p ) = p Weil and Tate pairing attack 
Known Fault Analysis on ECC
The security of elliptic curve cryptography schemes lies on the ECDLP: given points P and Q, there exists no subexponential algorithm to find k, such that Q = [k]P (in the general case). For security reasons, the ECDLP should be intractable. As a consequence, the elliptic curve parameters (particularly p, a, b, P , and n = ord E (P )) should be carefully chosen in order to be resistant against all the known attacks on the ECDLP. If this is the case, this elliptic curve is considered to be a cryptographically "strong" curve. Table 1 lists some attacks on the ECDLP and the consequences induced on the choice of a strong curve parameters.
An attacker's way of applying DFA on ECC is to disturb the representation of P (P becomesP ), such that the cryptosystem applies its point multiplication algorithm to a value which is not a point on the given (or selected) curve but on another curve, expected to be cryptographically less secure (it is considered to be a "weak" curve). The result of this computation is the pointQ on this new weak curve which can be exploited to compute the secret key k. This idea had been first described by Biehl et al. in [2] .
Because parameter b is not used in point addition or doubling, an elliptic curve can be completely defined as:
If a "point"P = (x,ŷ) ∈ F p × F p butP / ∈ E then the computation ofQ = [k]P will take place on the curve:
1 C must be large enough so that the DLP in F * p C is considered intractable (if n > 2 160 , C = 20 suffices). Table 2 . Attacks proposed by Ciet et al. in [6] Using this fact, an attacker can choose carefully a "point"
(P i ) = n i is small. Then, the cryptosystem will computeQ i = [k]P i . BecauseP i andQ i are known, and n i is small, the attacker can compute the discrete logarithm to retrieve k (mod n i ). The attacker can iterate this procedure with other input points and using Chinese Remainder Theorem, the correct value of k can be finally retrieved. A simple countermeasure is to check whether the points P and Q are on the strong curve E. As a consequence, the attack described before may not be easy in practice. Biehl et al. also proposed two other attacks making the assumption that only few bit-errors (or exactly one) are inserted into the base point P . These attacks are based on a rather idealized fault model.
Ciet et al. in [6] refine the ideas of Biehl et al. by relaxing their fault model. It is shown how truly random errors (hence practical computation faults) injected in the coordinates of P , in the field representation, or in the curve parameters can allow to retrieve the key k. In a cryptographic device, the system parameters are stored in non-volatile memory (e.g. EEPROM) and are transferred into working memory (e.g. RAM) for the computations. In a first scenario, Ciet et al. assume a permanent fault in an unknown position in any system parameter defining the elliptic curve. In a second scenario, it is analyzed the consequence of faults during the transfer of the system parameters. Table 2 lists the attacks proposed in [6] .
Blömer et al. showed in [3] how sign changes of points can be used to recover the value of k. While the previous attacks forced the device to output points that are not on the original elliptic curve, the following Sign Change Fault Attacks (SCFA) uses a faulty point on the original curve. It is only shown in this paper the SCFA on the standard
double-and-add algorithm (Algorithm 1). It must be noticed that this attack can be also mounted against the NonAdjacent-Form-based binary method and the Montgomery ladder [4] where the y coordinate is used. The goal of the adversary is to change the sign of the y coordinate of the variable Q in line 5 of Algorithm 1 during some unknown loop iteration 0 ≤ i ≤ −1, such that Q⊕P becomes −Q⊕P . It is needed to be able to mount c = /m log(2 ) attacks on the same input (P , k, E(a, b)) to recover k with probability at least 1/2. Moreover, a correct result Q must be known. The secret key bits of k will be retrieved in pieces of r bits, such that 1 ≤ r ≤ m. The faulty valueQ that results from a SCFA like defined before can be written:
is the only part which is unknown in (3): it is a multiple of P . If only a small number of bits of k are unknown, these latter can be guessed and verified using equation (3). The complete attack is divided into three phases. In the first phase, it must be collected c outputsQ of Algorithm 1 by inducing a sign-change fault in Q for random values of i. In the second phase it is guessed parts of k (stored in a variable x) and computed a test candidate T x . In the third phase, it is tested T x with all faulty resultsQ obtained in the first phase.
Blömer et al. proved that their Algorithm succeeds to recover k with a O(c M 3 m ) complexity (M is the maximal cost of a full scalar multiplication or a scalar multiplication including the induction of a sign-change fault) with probability at least 1/2.
The fault analysis presented in [2] and [6] allows the key recovering thanks to the mathematical analysis of only faulty results, contrary to the attack depicted in [3] where both correct and faulty executions are needed.
Another fault attack, which is called Safe-Error Attack (SEA), only checks if the computation is correctly performed or not. This definition involves that the adversary does not need to know the faulty decryption value, but only if his attack is successful or not. There are two types of SEAs: the CSEA and the MSEA. The CSEA consists in inducing any temporary random computational fault inside the ECC unit [24] . It can be applied to attack the value of key bits in a double-and-add always scheme with a dummy addition: after injecting a fault during the ECC unit computation, if the final result is correct, the addition was dummy [12] . In order to perform a MSEA, it is needed to induce a temporary memory fault inside a register or memory location [23] . The MSEA implies stronger requirements than CSEA in terms of controllability of fault location and timing. Thus, this attack appears rather hypothetical.
Previous Countermeasures
In order to circumvent fault attacks depicted in subsection 3.1, some countermeasures have already been proposed. Check if points are on the initial curve. Attacks proposed in [2] and the attack on the base point P in [6] can be counteracted when the device checks if P is on the original curve. This can be done thanks to the curve parameter b. This latter can become an integrity check:
It must be noticed that the attack presented in [3] uses a faulty point on the original curve. As a consequence, the proposed countermeasure is inefficient for this attack. CRC checksums. In order to prevent curve parameters (particularly a and p) from the attacks in [6] , it can be computed CRC on them. Before each use of these parameters, a new CRC is computed and then compared to the old one. Randomization. Scalar k can be randomized. It can be split thanks to a random value r. Different splitting methods can be implemented, from the simplest
Use of a combined curve. In [3] , it is proposed a countermeasure which generalizes the Shamir's idea [19] for RSA to ECC. The modulus is extended in a first computation (p becomes N = p 0 p) and then reduced (modulo p 0 ) in a second one. Instead of computing directly [k]P , it can be performed
, where P N = P (mod N ) and P p0 = P (mod p 0 ).
At the end, it must be checked if Q p0 = Q N (mod p 0 ). If this is not the case, some errors have been induced. Montgomery Ladder without using y coordinates. Since y coordinate is not used in Montgomery Ladder, it is not possible to successfully attack this latter thanks to a SCFA [3] .
Proposed ECC Unit
We present in this section the chosen algorithms for computing modular operations and the global architecture of the proposed ECC unit.
Algorithms for Modular Operations
This ECC unit is built to be able to compute unified addition formulae described by Déchène et al. in [8] . As a consequence, this architecture is also SPA resistant. Required Modular Operations. Three different types of modular operations are needed to compute the formulae of Déchène et al.: additions, multiplications and inversions (to express the final point in affine coordinates at the end of the point multiplication). All these arithmetic operations are done in borrow-save representation (see subsection 2.2). Modular additions are computed thanks to an algorithm given in [21] , and Modular Montgomery Multiplication (MMM) is used to compute modular multiplications [17] . Fermat's theorem is chosen to implement modular inversions. Chosen Modulo through Point Multiplication. The initial version of the Montgomery multiplication described in [17] takes as inputs X < p and Y < p, and finally computes X.Y.R −1 (mod p), with R = 2 l (see Algorithm 3). At step 5 of algorithm 3, 0 < S < 2p, so a final subtraction is needed to ensure S < p. Unfortunately, this final sub-
S ← S − p 8. end if 9. return S traction is a time-and area-consuming process. Moreover, an attack can be performed on this, especially when unified addition formulae are chosen to implement scalar multiplication [20] . As a consequence, a new MMM algorithm without final subtraction must be written (see Algorithm 4) , and the result S is given modulo 2p: this is the chosen modulo for all the calculations. 
Algorithm 4 MMM without final subtraction
10.
14.
16.
end if
17.
Algorithm for Modular Addition. The algorithm initially described in [21] allows to compute X + Y (mod p). This algorithm must be modified to finally compute X + Y (mod 2p).
Algorithm 6 Modular Addition
Algorithm for Modular Inversion. Using Fermat's theorem, the inverse of a value X modulo p can be computed thanks to a modular exponentiation of X by p − 2:
, if gcd(X,p) = 1. This algorithm can be used in this case because p is prime. A standard algorithm for computing modular exponentiation is the squareand-multiply algorithm.
Architecture
The figure of our ECC unit is depicted in figure 2 . Modular Addition. When modular addition is performed (X + Y (mod 2p), with X = Y or X = Y ), MUX1 and MUX2 respectively select the couple (X + ,X − ) and (Y + ,Y − ) and a first addition is computed thanks to BSA1. MUX4 selects (T + ,T − ) and MUX3 chooses the value which must be added to (T + ,T − ) thanks to the value of 4t l+2 + 2t l+1 + t l . BSA2 adds these two variables, and DE-MUX sends the result to its output (DEMUX + , DEMUX − ). SUBTRACTER affects the operands before BSA1 in order to make a subtraction:
Modular Multiplication. SHIFT1 and SHIFT2 are only used to implement the division by 2 in MMM. At the beginning of a MMM, the value of (Y + P ) (denoted Y P in Figure 2 ) is computed: MUX1 and MUX2 respectively select the couple (Y + ,Y − ) and (P + , P − ), BSA1 makes the addition, MUX3 and MUX4 respectively select the couple (T + ,T − ) and (0,0), BSA2 adds (T + T − ) and (0,0) and DE-MUX sends the result to its output named (YN s + ,YN s − ). If X is denoted X = (x l+1 x l x l−1 · · · x 2 x 1 x 0 ) 2 , SELEC-TOR1 and SELECTOR2 respectively computes m i1 (resp. m i2 ) by treating even (resp. odd) indexes of X. These values command MUX2 and MUX4 in order to choose the value which must be added to (S + S − ). Thus, SELEC-TOR1, BSA1 and SHIFT1 computes the result (T + ,T − ) which is treated by SELECTOR2, BSA2 and SHIFT2. In this mode, MUX1, MUX3 and DEMUX respectively selects the couple (S + ,S − ), (T + ,T − ) and (S s + ,S s − ). At the last step of the MMM, DEMUX sends the result to its output (DEMUX + , DEMUX − ). Modular Inversion. Modular inversion can be implemented as a series of modular multiplications.
Parity-Preserving Logic Gates

General Properties
Parhami introduced in [18] a class of logic gates for which the parity of the outputs matches that of the inputs. For example, a parity-preserving logic gate (PPLG) which has 2 input bits (a, b) and 2 output bits (p, q) must hold the property:
These PPLGs are also reversible in the sense that they allows the reproduction of the circuits inputs from observed outputs. In this paper, it is only used the parity-preserving property of these gates. In [18] , the author proved that the only 2-input (a, b), 2-output (p, q) reversible gate which is also a PPLG that complements both inputs unconditionally. This is not a sufficiently interesting result for building complex circuits. Consequently, the author searched and found the only two 3-input (a, b, c), 3-output (p, q, r) reversible logic gates which are also PPLGs (with the condition p = a 4 ): the Fredkin gate (denoted FRG, [9] ) and the Feynman double-gate (F2G).
If designers want to optimize the performance of the circuit using some fanouts or feedbacks, it must paid attention that the parity of the global circuit is maintained. Conse- quently, making a reversible circuit robust or fault tolerant is much more difficult than a conventional logic circuit.
Our approach
It is detailed in this subsection a procedure for implementing parity-preserving circuits in the general case. An illustration is given in this paper by protecting only the borrow-save adders (BSA1 and BSA2), which are the operating components of the initial ECC unit (see Figure 2) . Future works will consist in protecting other components of the ECC unit, like control logic. Choose the protected part of the circuit. Designers should investigate the tradeoff between the degree of fault tolerance of the resulting circuit (i.e. how many BSA output bits are protected and what is the protection level?) and its performance (area and speed). Moreover, besides controlling the validity of outputs, designers can add parity checks on intermediate variables: these additional checks should imply a reasonable penalty on the circuit performance. This paper chooses to check the validity of two BSA output bits at a time (called s 
Finally, the results s 
Transform the logic equations in Galois field. F2G and FRG can compute "not" logic function, and FRG can implement "and" logic function. On the other hand, among these gates, no one can implement "or" logic function. Thus, logic expressions must be expressed in Galois field thanks to the following property:
In our application, the variables c c
This proof can be also applied to the variable c + i . Implement these equations thanks to PPLGs. The protected circuit will only consist of the FRG (white boxes in Figure 5 and 6) and the F2G (grey boxes in Figure 5 and 6). It must be noticed that no fanout is advised. If a signal is needed by more than one cell, it is preferable to duplicate it thanks to a parity-preserving logic gate. For example, if a signal x is needed to be duplicated, a F2G cell can be used with the inputs (a = x, b = 0, c = 0): its outputs will be equal to (p = x, q = x, r = x). If a PPLG output signal is not used, it must be sent to the output of the component to maintain its global parity. If two PPLG outputs have the same values and are not used by other PPLGs, they can be simplified or unconnected in the case that the reversibility property of these PPLGs is not used: these bits are called "garbage bits". For example, if a F2G have the outputs (p = x, q = x, r = x), and if q and r are not used by other PPLGs, they can be unconnected instead of sending them to the component output. PPM1 have eight garbage bits (see Figure 5 ), PPM2 six (see Figure 6 ).
The elementary cell of our fault-tolerant BSA (called FTBSA in the following), is depicted in Figure 7 . It is made of two cells, called PPM1 (see Figure 5 ) and PPM2 (see Figure 6 ). PPM1 computes c 
Signal p o is then calculated:
Thus, if p o = 1, an (or some) error(s) is (are) detected. 
Results
Cost of the proposed solution
All the architectures presented in this paper have been synthetized in C35 CORELIB technology using Design Vision tool. Table 3 contains the impact of our countermeasure only on the BSA. The overhead is great; the next step will be to implement the FTBSA thanks to optimized PPLGs: for example, instead of implementing F2Gs with inputs (a = x, b = y, c = 0), it will be implemented only x ⊕ y.
Overhead. The impact of the countermeasure implemented in components BSA1 and BSA2 is given in Table  4 . The area overhead is acceptable (+ 38 %), but the main drawback of this countermeasure is the induced latency. Like in BSA case, some optimizations are possible, particularly the implementation of optimized PPLGs or pipelining. 
Fault Coverage
The error detection capabilities of the proposed parity check were studied using simulated fault injections into the BSA. In these simulations, two random 164-bit elements were generated, addition was started and a fault was injected during the calculation. The corrupted data was finally checked against the parity bits. The faults (one or two faulty bits at once) could appear anywhere in the adder (164 × 35 bits).
Some faults appear not to affect the result, they are filtered during the computation (it can appear through a FRG for example). Some are not detected even with only one faulty wire due to multiple use of some wires through the PPLGs if one of the outputs is filtered. The ratio of undetected faults goes from 5 to 12% passing from one to two faults. This ratio should decrease for three faults thanks to the parity properties. We will evaluate this case in future works. 
Additional Remarks
At present, the proposed countermeasure only computes parity bits. The treatment of this information is critical since it must not add single points of failure in the circuit. For example, decisional tests must be avoided, because the flag bit which commands this test can be also faulted. It is advised to use the infective computation principle: if a fault is detected, the result is changed in a way unpredictable to an adversary (for example, [13] uses this technique). Thus, the following technique will not be implemented:
If parity(in) = parity(out), then return(out), else return(error)
It will be preferable to implement instead:
return((parity(in) ⊕ parity(out)).r ⊕ out)
where r is a random value. The second remark concerns the use of borrow-save representation. This latter enables to make SCFA very difficult to achieve (under the condition of a secure state machine).
At last, the countermeasure described in this paper aims at making also more difficult SEA. It can be done by checking the parity during the ECC unit computations.
Conclusion
This paper presents a fault-tolerant elliptic curve cryptoprocessor unit. This resistance is provided by the use of parity preserving logic gates in the critical part of the ECC unit, which is the borrow-save adder. This architecture is protected with high-coverage rate against computational safeerror attack, and the sign-change fault attack seems particularly difficult to perform since borrow-save representation is used. Moreover, standard countermeasures shown in subsection 3.2 can be implemented in addition of our gate-level approach in order to be protected against other possible fault attacks on ECC.
The proposed countermeasure provides a high level of fault detection, but at the expense of an important latency overhead. This overhead can be decreased thanks to the use of optimized parity-preserving logic gates, and other reversible gates, like Toffoli and Peres gates (see [18] for more details). Moreover, some flipflops will be inserted in the critical path of the circuit in order to shorten it. This new architecture is under development. Future works will also consist in generalizing the parity-preserving principle to other ECC unit components, like control logic.
