Abstract-Residue Number Systems (RNS) have been a topic of interest for years. Many previous works show that RNS is a good candidate for fast computations in asymmetric cryptography by using its intrinsic parallelization features. A recent result demonstrates that redundant RNS and modular reduction can fit together efficiently, providing an efficient RNS modular reduction algorithm owning a single-fault detection capability. In this paper, we propose to generalize this approach by protecting the classical Cox-Rower architecture against multi-fault attacks. We prove that faults occuring at different places and at different times can be detected with a linear cost for the architecture and a constant time for the execution.
I. INTRODUCTION
Residue Number Systems (RNS) have been proven to be a good candidate for achieving fast computation in finite fields [5] , [6] , [9] , [15] , [18] , which is a critical issue for limiting latency of public key cryptography. In practice, most of the competitive hardware implementations rely on the so-called Cox-Rower architecture, introduced by Kawamura et al. [15] , which is designed to fit with natural properties of RNS.
When critical applications, e.g. related to banks and credit cards, are implemented in an embedded system, they must be supplied with protections against side-channel attacks, such as timing attacks [12] , or simple/differential analysis [7] , [13] . For this purpose, prior works showed that RNS naturally provides a Leak Resistant Arithmetic [4] . Besides, fault attacks can remain a real threat despite protections against leaks. Thus, fault detection is an important security features which should be integrated into a crypto-chip. Again, RNS owns a natural detection solution through the use of redundancy [17] . But, due to the particular structure of RNS modular reduction, this kind of approach seemed not compliant with finite field RNS arithmetic. Recently [2] , this issue has been settled thanks to a new redundant RNS modular multiplication algorithm which includes a single-fault detection ability. Nevertheless, it does not guarantee detection of multiple faults in time and/or space inside the Cox-Rower architecture, making such multi-fault attack potentially invisible and successful.
This work has been supported in part by the European Unions H2020 Programme under grant agreement number ICT-644209 and ANR ARRAND 15-CE39-0002-01.
In this work, we propose a solution for detecting a multifault attack in space and in time on a cryptographic device. By using a stronger fault model, we show how to accomplish the matching between redundant RNS and finite field arithmetic. The multi-fault model offers the possibility to model as well precise targeted attacks as hardware failures which could impact several RNS channels. In order to provide a very practical fault resistant modular reduction algorithm, the fault model is adapted to hardware constraints.
The paper is organised as follows. Section II gives background on standard/redundant RNS, required in the rest of the paper. Section III is the core part where Thm. III.3 and Thm. III.4 are the main novelties. It deals with the use of redundant RNS to detect multiple faults. In section IV, practical considerations about the integration of the solution in a hardware implementation relying on a Cox-Rower architecture are suggested. In particular, the area overhead is analysed. Finaly, some conclusions are drawn.
II. BACKGROUND OVERVIEW

A. Residue Number Systems
Residue number systems [8] are non positional numeral systems, based on the Chinese Remainder Theorem (CRT). The CRT states the existence of a ring isomorphism ϕ B : Z M " Ñ Z m1ˆ. . .ˆZ mn (where M " ś n i"1 m i ) as soon as the "base" B " tm 1 , . . . , m n u is composed of pairwise coprime moduli. Consequently, the arithmetic in the interval r0, Mq can be replaced by independant computations in the channels Z mi . In practice, these concurrent channels are implemented in parallel arithmetic unit called Rowers. Given an integer x, we will denote x i or |x| mi its residue in Z mi .
The inverse ϕ´1 B px B q " x can be computed as follows: . It is direct to notice that the integer κ B px B q belongs r0, n´1s and is equal to
In the following, we may represent an integer through a vector of residues. Such vectors may be merged with integers inside a same formula to point out more easily certain properties.
B. Base conversions
While the Rowers are physical units implementing the basic concurrent RNS arithmetic, the Cox unit is dedicated to the compution of κ B px B q during a base conversion. To switch between two bases, two main approaches are used in practice: by using (1) (similar to Lagrange's interpolation), or through a positional mixed radix system (similar to Newton's interpolation). However, the latter is a real bottleneck because it is intrinsically sequential. In practice, the Cox-Rower architecture is designed to efficiently compute the first variant.
In CRT based conversion, an approximation of the factor κ B px B q is computed by the Cox unit [11] . To be precise, when 2 r´1 ă m i ă 2 r , the Cox unit computes:
where the function trunc h returns the h most significant bits. h is a parameter of approximation. If h matches well with parameters r and n, one may obtain eitherκ B px B q " κ B px B q orκ B px B q " κ B px B q´1. The conversion BexpB, xq derived from this approach returns an almost reduced value:
The error of approximation can be bounded by a real number 0 ď Δ ă 1 (cf. [11] for more details). For any x in r0, Mq, such Δ satisfies the following:
This error can be corrected by adding an offset α P rΔ, 1q to the sum (3) (inside the flooring) computed by the Cox. In the next discussions, this corrected conversion is denoted Bex α . Bex 0 is the previous uncorrected variant Bex. Moreover, it will be useful to notice that, for Bex and any x ă M , 0 ď BexpB, xq ă p1`ΔqM .
C. RNS arithmetic in F p within a Cox-Rower architecture
RNS modular reduction relies on Montgomery's approach [14] . Different variants are possible, depending on the base conversion methods which are used [1] , [11] , [16] . All these techniques involve two coprime RNS bases B and B 1 . When using conversions described in previous part, the principle of computation of a modular reduction within a Cox-Rower architecture is briefly summarised in Alg. 1 (cf. Fig. 2 u with k ď n. The principle of fault detection in redundant RNS is based on a "consistency check" [17] . Some residues px B , x BR q are said to be consistent if they pass the following test:
The function ϕ BR˝ϕ´1 B represents a complete conversion from B to B R . The dynamic range (i.e. used to represent the values modulo p) of this redundant RNS is by definition the one of B, i.e. r0, Mq. Hence, residues of any x in this range always passes the consistency check test. In practice, adding a k-fault detection in a Cox-Rower architecture is simply achieved by adding k Rowers concurrently to the "regular" ones. The way to handle them along computations in F p is the purpose of next section.
III. MULTI-FAULT RESISTANT RNS MODULAR REDUCTION
Without loss of generality, B and B 1 both own n moduli.
A. Preliminary remarks
Alg. 1 involves operations in B and B 1 , a first conversion Bex possibly incomplete, and a second complete one Bex α . So, attacks may occur in different parts of the process, affect different Rowers, and modify residues in both B and B 1 (the consequences of faults on the Cox are discussed in Section III-C). However, because of independancy of Rowers, an infected unit does not propagate a fault between conversions. In Fig. 1 , we consider such types of fault: types 1 and 3 in B, type 2 in B 1 . Fig. 2 illustrates the way a faulty Rower can propagate its error to other units during a conversion. We can notice that an error of type 1 will affect all the residue of the second base. The columns of the second base represent the evaluation of the summation in (1) .
The goal is to find an easy way to detect if some faults occur during the reduction process. We notice that the 2nd conversion Bex α is complete, and can be used as is to perform a consistency check, as described in Thm. III.1. 
Thus, if we consider Alg. 1, multiple faults in B 1 (i.e. type 2) can be detected by using the second conversion for checking, because Bex α is a complete reduction.
Thwarting type 3 faults: In the next section, we show how to deal with type 1 faults, possibly associated to type 2 faults and to faults occuring during a conversion. We won't mention faults in B after the second base conversion, i.e. type 3 faults. Either they can be reduced to type 1 (and then will be detected) when the output of the reduction is re-used as an input (e.g. as in a modular exponentiation), or residues in B 1 (whose integrity is checked by the detection process described afterwards) are sufficient to reconstruct the output of reduction in binary representation.
B. Multi-fault injected during a modular reduction
In this part and as a first step, we consider "theoretical" faults, which affect values modulo some m i 's. In other words, we do not consider yet faults at the binary representation level. For example, if x i is an r-bit word, a faulty residue s x i is assumed to belong to r0, m i q. The other case (s x i P rm i , 2 r q) is tackled in next section. To simplify the study in the context of a Cox-Rower architecture, faults on the Cox are not considered for the moment. This issue will be considered later.
The detection capability k determines the size of B R . The kfault resistant modular reduction is described by Alg. 2. Steps 1 to 5 are the same than for Alg. 1 but they additionally take into account B R . Next, steps 6 to 9 implement the consistency check. A multi-fault during a base conversion can be modeled by a multi-fault infecting pre-and/or post-conversion residues. Thus, we can restrain to the faults affecting q B due to consistency check test Fig. 3 . Scheme of the consistency test.
perturbations at steps 1 and 2, and those affecting s B 1 and s BR after faults at steps 2, 3, 4 and 5.
The structure of a multi-fault injected during the execution of Alg. 2 is formally described in the following definition. 
return ps B , s B 1 , s BR q 8: else 9: return ''Corrupted data'' 10: end if Proof: Let's consider a pu`v`wq-fault as defined by (6) . The w faults in redundant Rowers do not play any role. Indeed, the detection capability k ensures that at least u`v redundant channels are not corrupted between two consistency checks. Then, if u " v " 0, the faults only affect redundant residues, and the context of single-fault in [2] is enough to ensure the detection (just consider the system with only one faulty redundant channel). From now on we denote M R,Zw the product of k´w safe redundant channels. The proof is mainly based on the fact that it suffices to study effect of the pu`vq faults on the integer t " x`p qp at line 3.
First, type 1 faults modify q at line 1. We then denote it q P r0, Mq. After the first conversion, we denotet " Bex α ps B 1 q the value sent by the conversion to B to recover the ouput in the full base, and to B R for the consistency check. So the associated "t" is defined byt p2q " M p s B 1 . At this point, we know that, in particular, the consistency check will detect the multi-fault if pt p1q M´1´p s B 1 q ı 0 mod M R,Zw . Due to coprimality of M and M R , this inequality holds iff pt p1q´tp2ı 0 mod M R,Zw . Thus, we will show that this inequality holds.
Basically, the strategy is to prove that |t p1q´tp2q | is an integer fully representable in B Y B 1 (i.e. it is ă MM 1 ), and has non zero residues in, and only in, channels of B affected by the u type 1 faults (and index-numbered by I u ), and channels of B 1 affected by the v type 2 faults (and index-numbered by J v ). In other words, this means that we need to prove that
A first step is to establish that: (i)t p1q P r0, p1´αqMM 1 q.
Point (i) is due to the fact that, despite the type 1 faults on q, giving q P r0, Mq, the first conversion still ensures that p q P r0, p1`ΔqM q. Indeed, from hyp. on M and M 1 in Alg. 2, it comes x`p qp ă 4p
1 . Now, we highlight crucial properties about s B 1 and p s B 1 : Jv . And because of (iv), this multiple verifies (v).
To end the proof, it just remains to notice that, by hypothesis on the number of faults and on the size of redundant moduli,
So, this and (v) imply that |e t | MR,Z w ‰ 0, or in other words that pt p1q´tp2ı 0 mod M R,Zw : the consistency check detects the multi-fault, at least in all the uncorrupted channels of B R .
C. Treating faults on the Cox
A fault on a Rower corresponds to a type 1 or 2 in Fig. 1 . We have to address the effects of the faults made on the Cox part (not described in Fig. 1) . The Cox evaluates the value κ B px B q in (1) 
Finally, such kind of faults on a single Cox shared between all Rowers can easily be defeated, basically by adding a small extra modulus to B R for instance.
However, due to small area cost of a Cox, it may be more interesting to integrate a Cox inside each Rower. In the next parts, we select this variant.
D. Integrating binary representation within the fault model
The previous fault model takes into account the appearance of multiple faults in time and space in a Cox-Rower architecture, but it does not fit with the way that data are represented into hardware. A residue was always considered like an integer in r0, mq. But such value needs to be handled through a binary representation. During a conversion, each rower computes and stores the ξ i,q or ξ The main consequence of the theorem is that the redundant moduli are not required to be greater than 2 r to ensure the detection of these specific overflowing multi-faults. . α ě Δ) .
Theorem III.4. Assume that any modulus
m P BYB 1 satisfies 2 r´1 ă m ă 2 r . Let ε " mintt P Z | @m P B Y B 1 , 2 r´m ă 2 t u
. Let's consider Alg. 3 with following hypothesis:
The following discussion aims to show that, under conditions of Thm. III.4, the new context is actually identical to the one about "theoretical" faults. To do so, we show that the key points in proof of Thm. III.3 are still relevant. Precisely, we prove that |t p1q´tp2q | ă MM 1 and it owns non zero residues only in u channels of B and v of B 1 . We also prove that the new conditions still imply a complete second base conversion when no fault occurs.
a) About first conversion: In the following, q may denote as well a correct q as q affected by u type 1 faults. The first conversion is not required to provide a full reduction modulo M . The crucial point is to guarantee thatt p1q P r0, p1´αqMM
1 q. So, we need the converted value p q to be positive, and M 1 to be such thatt p1q " x`p qp ă p1´αqMM 1 . On the one side, the sum ř n i"1 ξ i,q M i can contain an extra term M with ď k. is the exactly the number of erroneous ξ i,q which are overflowing their respective m i . We denote by pξ i,iPr1,ns the set of all the ξ i,q , but whereξ i,q " ξ i,q´mi for each ξ i,q overflowing m i . In particular, the new coefficientsξ i,q are those of q affected by a theoretical multi-fault, and represent an integer denoted q in r0, Mq. Consequently,
So, the conversion outputs q plus a multiple of M . On the other side, we check if the value (3) computed by the Cox can cancel M . This computation involves the terms trunc h pξ i,q q. In particular, we have such following quantities (where δ i P t0, 1u):
Under conditions of Thm. III.4, the terms trunc h pm i q`δ i cancel at least p ´1qM . Indeed, we can write (for brievity's sakeness, details are omitted), with Δ "
Since α ă 1 (ii), the Cox outputs:
Finally, the 1st conversion outputs p q " q´βM . As previously explained, although an incomplete reduction is allowed, p q has to be non negative. To prevent the case β " 1, the value returned by the Cox must be decreased by 1. This justifies the term p q`M at line 3 of Alg. 3, and the new definition of t
2M u, and we can deduce from (7) that p q " q`2M only if q ă p1`Δ` 2 r´ε` 2 h qM ď p1`αqM . Hence,
Besides, cond. (iii) about M states that x ă 9p 2 ă p1´αqMp. This, together with (8) and condition (iv) about M 1 , implies:
The analysis is similar for conversion of s B 1 . We yet consider overflows. Then, the offset α (ii) enables to correct the error Δ` 2 r´ε` 2 h ď n`k 2 h´1 " α. Hence, the version of second inequality in (7) for this case is greater than κ B ps B 1 q` . Since s denotes an integer in r0, M 1 q, and since, by hypothesis, α`k 2 h ă 1 (ii), the first upper bound in (7) becomes, in this case:
(10) Finally, the Cox outputs either κ B ps B 1 q` or κ B ps B 1 q` `1. In particular, the conversion outputs s B 1´δM 1 with δ P t0, 1u. And, from (10), we deduce that δ " 1 only if s B 1 ě p1´αḱ
Consequently, we obtain the following bounds:
By gathering (9) and (11), and the definitiont p2q " Bex α ps B 1 qM , we obtain the key pointˇˇt p1q´tp2qˇă MM 1 . The second key point in proof of Thm III.3, i.e. |t p1q´tp2q | owning u non zero residues in B and v in B 1 , is obviously verified. Indeed, the only difference with Thm III.3 is due to a possible presence of multiples of MM 1 . Thus, due to faulty residues index numbered by I on q in B and J on s in B 1 , we have yet thatˇˇt
This guarantees the detection. To end the proof, when no faults in B and B 1 occur (except maybe overflows ξ i,q " m i`ξi,q and ξ
It means that the offset α guarantees an exact conversion: Bex α ps B 1 q " s.
And the output in B Y B
1 Y B R is nothing but s P r0, 3pq with s " xyM´1 mod p.
E. Example of parameters
To show that the conditions of Thm. III.4 are realistically reachable despite their apparent restrictiveness, we consider the parameters calibrated for 521-bit ECC with n " 31 and r " 17. For instance, we set the detection capability to k " 6.
The set r2 17´29 , 2 17 s contains 68 prime numbers (i.e. enough for B, B 1 and B R ), so ε " 9. Consequently, h ď 8 and it has to be set such that α`k 2 h " 2n`3k 2 h ă 1 is true. Setting h " 7 is then sufficient. These parameters are such that conditions (iii) and (iv) about M and M 1 are satisfied (we remind log 2 ppq " 521). Indeed, we have in this case that log 2 pM q, log 2 pM 1 q ą log 2`p 2 17´29 q n˘" 526.8 and log 2 p In this section, we provide some practical guidelines to re-use the redundant Rowers and about hardening parts of Cox-Rower architecture which could not be covered by the detection approach. In III-C, we discussed about the advantage of using one Cox per Rower. This is the choice we make.
A. Cox-Rower architecture
The Cox-Rower architecture has been intensively used for implementing RNS cryptoprocessors [3] , [6] , [9] , [15] . It is composed of a Sequencer unit, a Cox unit and n Rower units. It is illustrated by the black parts in Fig. 4 .
The Cox unit implements the trunc h function so as to return theκ coefficient (3) . Each Rower unit is dedicated to computations like p ř n i"1 a i b i q mod m, where m is a modulus of B or B
1 . This architecture shows a good tradeoff between area and performance, and can take advantage of DSP blocks available in FPGA [3] .
The n registers below rowers are in charge of sending ξ i 's coefficients (cf. (1)) from each rower to all others during a base conversion. To avoid burdensome architecture, the sending can be done through shiftings between these registers.
In the previous section, redundant residues were used to detect multi-fault attacks during Montgomery reduction computations and did not play any role during the Montgomery reduction. The modifications set to the Cox-Rower architecture to include the detection capability are shown in red in Fig. 4 .
B. Redundant Rower unit
Although the architecture of a redundant Rower unit is quite identical to a regular one, the depth of the memories containing the precomputed values for the reduction level is divided by two. The extra features is that it implements the consistency check. This can be achieved at the accumulator/reduction level by comparing the result with zero.
Because redundant residues are not involved as part of pre-conversion base during a modular reduction, they are not subject to Cox-Rower conditions. In order to keep available the range of moduli under Cox-Rower conditions for the two bases B and B 1 , we could choose to use m R ě 2 r for all m R P B R . However, we have shown (Thm. III.4) that such condition is not necessary for detecting specific hardware overflowing faults. Thus redundant rowers could contain strictly identical arithmetic and logic units (i.e. over r bits). Those modifications are highlighten in red in Fig. 5 . In previous works [9] , [15] , several RNS to binary radix conversions have been introduced so as to achieve a final conversion from RNS to binary radix. The principle is to recover coefficients of the result s in base 2 r , from the least significant digit to the most significant one, by using formula t s 2 r u "
. Consequently, we propose here to share the functionality of one redundant Rower using moduli m " 2 r in order to detect one single fault and to compute RNS to binary radix conversion. Furthermore, this redundant Rower is obviously smaller than other ones because computing mod 2 r is just truncating binary representations. On FPGA, this solution is quite easy to implement as DSP blocks embed multiply and accumulate features and also a test value features dedicated to zero value detection and which can be used for consistency checks.
Last, because one standard rower is usually dedicated to two moduli, one of B and one of B 1 , it is worth noticing that multi-fault model enables preventing hardware failures, which then could affect residues two by two. In this case, k is advantageously chosen being even.
C. Hardening Cox unit and registers
In III, it is shown that multi-fault attack can be detected when it affects computations done inside Rower units. Faults affecting Cox unit or Registers used during base conversions are not covered in the hypothesis of Alg. 3 and Thm. III.4. In order to prevent and to detect a fault in Cox unit, we suggest to harden it by using lockstep fault-tolerant methodology. Indeed, Cox is only composed by an h`rlog 2 pnqs-bit adder/accumulator together with an h`rlog 2 pnqs-bit register. Thus it is quite small compared to the Rower units. In previous example III-E, the size of the Cox unit implemented in FPGA would be only 13 LUTs and registers.
In the case of Register units dedicated to distribution of ξ i 's coefficients during a base conversion, the problem is that a permanent fault could corrupt much more than k of these coefficients because of shifting. Thus the fault-model would not fit anymore. For instance a fault which would permanently affect the 1st register could modify the set of whole ξ i 's passing through this register, and may be responsible for appearance of more than k faults.
To prevent this phenomenon, each ξ i can be memorised in its associated rower before it is sent into output shift registers. When ξ i is distributed to all rowers, the i th rower can perform a comparison with respect to previously memorised value so as to check that the distribution step has been fair. The extra cost is then nr registers. The modified rower unit is presented in Fig. 6 using red color for modifications. 
D. Extra cost of detection capability
The detection procedure has the advantage to not increase execution time of modular reduction as soon as enough extra surface can be used to implement the whole detection process. Indeed, redundant residues are never involved in the emitting side during a base conversion. Furthermore, any computation concerning consistency check or protection of output registers can be done independently from main computation flow.
Thus, the only cost is in terms of area and power consumption. Theorem III.4 proves that redundant moduli just have to be greater than standard ones. Thus, a redundant rower owns the same structure than a classical one, plus a negligible area overhead due to the consistency check. For instance, Table I provides area of a Rower design from [3] , which is useable for implementing a 521-bit ECC (as considered in [3] ). Moreover, section IV-C has also emphasized that protecting Cox through lockstep fault-tolerant methodology remains reasonably cheap because of the simple and small structure of this unit.
To resume, a relevant estimation of area overhead is kˆ(Rower Area), with k the detection capability. Because rowers represent main part of architecture, a first estimation is given by "`k n % of initial area, where n is the number of standard rowers. of area overhead in the case of n " 31 standard rowers for implementing a 521-bit ECC [3] . Those values are also compared with P&R results for area overhead (given in slices) as well as power overhead @200MHz which are less than the first estimation. The reason for this gap is mainly due to the fact that a redundant rower is smaller than a modified rower.
To finish, we also notice that the total number of slices in Tab. II for k " 0 is a bit greater than in [3] (2565 slices for the same parameters). This is because we use a deeper pipeline within the present modified rowers. 
a) Validation of the methodology of fault detection:
We have validated our fault detection design by using hardware simulator tool. To inject fault into our design, we forced erroneous values (type 1 and 2 as described above) during the computation in the simulator. Transient faults as well as permanent faults were simulated in our modified rowers. The fault detection mechanism was always triggered when multifault simulation was running. We also compared the faulted design with its golden model (simulation without fault) to observe the differences in the results.
V. CONCLUSION
In [2] , an RNS modular reduction algorithm defeating single-fault attacks was proposed. Since fault attacks can be devastating, it really matters to know if this approach is efficiently scalable so as to defeat multiple attacks, which could cause several RNS units being corrupted. Moreover, a multifault model also allows to model hardware failures. Indeed, in a Cox-Rower architecture, a single Rower is generally dedicated to, at least, one channel of B and one of B 1 . For that purpose, an RNS modular reduction algorithm supplied with multi-fault detection capability has been presented. It has been shown that the approach in [2] is extendable to any degree of detection as proven by Thm III.3 and Thm III.4. Beyond pure theoretical analysis, a hardware multi-fault model adapted to Cox-Rower architecture has also been considered. Despite some specific constraints due to binary representation of RNS data, it has been proven that the redundant moduli do not have to own any extra specific features in hardware context comparing to theoretical model. So, the detection process is flexible, since any rower can be dedicated either to a standard RNS channel, or to a redundant one without significant changes.
As for the single-fault detection, defeating multiple-attacks can be done without any time overhead. Furthermore, area overhead is mainly limited to the one of redundant rowers, and is roughly estimated to be`k n % of initial area, with k being the detection capability, and n the size of both bases B and B 1 . As a consequence, for a given k, the smaller the regular moduli, the smaller the area overhead.
