Abstract. This paper analyzes the robustness of RSA countermeasures against electromagnetic analysis and collision attacks. The proposed RSA cryptosystem uses residue number systems (RNS) for fast executions of the modular calculi with large numbers. The parallel architecture is protected at arithmetic and algorithmic levels by using the Montgomery Ladder and the Leak Resistant Arithmetic countermeasures. Because the architecture can leak information through control and memory executions, the hardware RNS-RSA also relies on the randomization of RAM accesses. Experimental results, obtained with and without randomization of the RNS moduli sets, suggest that the RNS-based RSA with bases randomization and secured RAM accesses is protected.
Introduction
Side-Channel Attacks (SCA) are a serious threat for public-key cryptosystems and notably for the RSA [1] . These attacks aim at recovering a secret manipulated by cryptographic algorithms, by analyzing various sources of side-channel leakages (time, power consumption, electromagnetic (EM) radiations, etc) during their execution on a hardware device.
Countermeasures to prevent simple (SPA) and differential (DPA) power analysis on RSA can be categorized in algorithmic and hardware countermeasures. The Square-and-Multiply Always [2] and the Montgomery Ladder [3] ensure that all operations in the binary method run in a constant sequence of operations in order to prevent SPA like attacks. To deal with DPA attacks, the idea of algorithmic countermeasures is to randomize the message or the exponent (private key) that are processed during the execution of a modular exponentiation. However, most of these countermeasures do not provide sufficient protection against high-order DPA attacks or sophisticated SPA-attacks [4] [21] .
Residue Number System (RNS), coupled together with SPA-protected methods, is an interesting alternative to increase the robustness at the arithmetic level. RNS provides a natural way of masking the data and the internal computations because all intermediate values can be represented in different RNS bases. However, differential, correlation and collision EM attacks [5] [6] [7] [8] remains fully efficient if no randomization of the RNS bases are used to effectively mask sensitive computations. This idea is the foundation of the Leak Resistant Arithmetic (LRA) concept proposed in [9] .
The RSA hardware approach proposed in this work implements different countermeasures. To provide protection against correlation analyses and collision attacks, the design offers protection at arithmetic level by randomizing the moduli between two sets of RNS bases, and then implies the on-the-fly calculus of the required pre-computed constants. For the modular exponentiation, the Montgomery Ladder algorithm is considered even if other algorithms can be executed by our co-processor. The successive modular multiplications are computed with the RNS Montgomery algorithm [10] that needs two sets of k moduli due to the base extension part. For this crucial operation in the Montgomery multiplication, one considers the fast approximation method [11] , which is derived from the Chinese Remainder Theorem. Moreover, hardware countermeasures are adopted with randomization of the RAM addresses during the reading and writing operations.
The rest of the paper is organized as follows. Section 2 give a brief state-ofart about the use of RNS for the integration of public-key algorithms. Section 3 describes the hardware module we have designed and mapped into an FPGA. Section 4 gives experimental results about the robustness of the RNS-RSA implemented on our crypto-module. Finally, a conclusion is drawn in section 5.
Preliminaries

Residue Number System
In the Residue Number System [12] , an integer X, is represented according 
To recover the original number X (modulo B), given the residues x i , one may apply the Chinese Remainder Theorem (CRT):
The forward conversion is a key step before starting any computation in RNS. From the radix-2 w representation of X = ∑ n−1 j=0 X j 2 wj , the residues x i are obtained, for all b i ∈ B, by:
where the constants |2 wj | bi are pre-computed for all i, j to speed up the forward conversion in RNS hardware modules by computing all residues in parallel.
RNS Montgomery Exponentiation
The core of any RSA implementation is a modular exponentiation of x, namely x e mod N is computed and e is the private exponent. This is the operation to be protected! To deal with timing and SPA attacks, in this work we adopted the Montgomery Ladder exponentiation version in RNS, as given in Algorithm 1. One may observe that the computations are performed over two RNS bases A and B.
The pre-computed terms for the modular exponentiation are B mod N and B 2 mod N in bases A and B. The operation M M (x, y, N, B, A) returns the RNS Montgomery Multiplication result xyB −1 mod N in the two RNS bases A and B. For this crucial operation, the recent improvement proposed in [13] was adopted to accelerate the original method [11] by 18%. This acceleration is provided by rearranging the computations within the so-called base extensions (BE). In [13] , two different strategies are proposed for that operation and, in our approach, we adopted the fast approximation method, also called as Posch-Posch method [14] . Given x i the elements of X in base B, where x i = X mod b i for i = 1..k, the fast approximation method ensures the existence of a certain integer λ < k, a CRT-correction coefficient, such that:
and λ can be calculated by: 
RNS Bases Randomization -The LRA Countermeasure
DPA attacks explore the relation between the power consumption and the internal variables to recover the bits of the private key. The leak resistant arithmetic (LRA) countermeasure [9] provides a way for completely masking the internal computations and then protect against differential or correlation power (or EM) analysis at arithmetic level. Before each modular exponentiation, the two set of bases A and B (each of size k) are randomly selected among a set of 2k integers. In this way, an integer w (an intermediate result in the modular exponentiation represented in the Montgomery domain) has C 2k k ≈ 2 2k / √ πk different RNS representations in bases A or B and it offers a high-level of randomization. These randomly selected RNS bases are then used during the entire computation. The authors of [9] also suggested to reinforce the robustness by selecting new bases during the exponentiation, possibly before each MM. However, this second approach may become much slower; it implies two additional MM each time new RNS bases are chosen, or even four extra MM if the Montgomery Ladder is used for the exponentiation. Here, the bases randomization are performed once before each exponentiation, using Montgomery Powering Ladder as depicted in Algorithm 3. In the application of LRA countermeasure, the on-the-fly computation of Montgomery constants B mod N and B 2 mod N is solved by using the pre-computed term AB mod N (in A ∪ B) in the two first Montgomery multiplications. Note the order of A and B in these two first calls of MM in Algorithm 3.
The RNS Montgomery Multiplication needs pre-computed constants related to the random choice of RNS bases A and B. These pre-computed constants must be obtained on-the-fly before each modular exponentiation. The LRA precomputations necessary for the Montgomery multiplication are:
Algorithm 3: RNS Montgomery Powering Ladder with LRA [9] Data:
And then, we obtain:
Then, all constants |b
A∪B and the RNS base sets A and B should be pre-computed.
After the modular exponentiation, the result must be converted back to radix. For the LRA countermeasure, the reverse conversion using CRT-based method needs the on-the-fly computations of the values B i and B in radix-2 w form and it represents a high level of complexity. In this case, it is adopted the Mixed-Radix System (MRS) [12] for the RNS to radix conversion. The mixed-radix system is a weighted representation of a RNS number. This method is computed in two steps: first, the MRS representation of x i (RNS representation of X in B) is obtained using the optimized Garner's Algorithm [17] , and all the pre-computed values, the inverses |b
i | bj , are obtained independently of RNS bases randomizations; second, the MRS result is converted to radix by applying the Horner's scheme, as also presented in [17] . The reverse conversion implies carry-based arithmetic. However, the time spent for these operations is negligible compared to the modular exponentiation.
Proposed and Developed Hardware
The proposed hardware computes the forward conversion (radix to RNS), the LRA pre-computations, the modular exponentiation and the reverse conversion (radix to RNS), using the same set of independent data-paths called RNS Units depicted in Fig. 1 . The implementation follows a similar schematic than that proposed in [11] and improved in [16] , called cox-rower architecture.
As described in section 2, the required LRA pre-computations, which computes the pre-computed constants for the Montgomery multiplication in the RNS bases A and B, needs a set of pre-computed values. To store them, the RNS Units contain dual-port RAM memories. Then, each RNS Unit contains all pre-computed elements of all moduli of A and B. It causes an overhead in terms of memory, however speeds-up the on-the-fly pre-computations.
The core of each RNS Unit is the arithmetic logic unit (ALU), which computes the modular addition/subtraction, modular products and carry-based arithmetic operations in the reverse conversion (CRT or MRS). To accelerate the modular reductions, we adopted the method proposed in [15] . This solution uses pseudo-Mersenne numbers of the form b i = 2
w − c i , where c i < 2 w/2 , for the chosen set of RNS moduli. Then, to compute x mod b i one first performs the following step twice: Then x will be in the range of [0, 2 w+1 ] and a final conditional subtraction by b i returns the residual value. The coefficient c i is also an input of the ALU block. As RNS bases randomizations (LRA) makes the RNS Units operate in different moduli, all c i (for i = 1..2k) are stored in a ROM memory. Each RNS Unit performs operations for one RNS channel of A and one of B; the selection of these channels, and the respective coefficient c i , is defined by the random index input from the control unit.
The architecture also contains an adder block, called f block, for computing the f values in the two base extensions. This block basically sums up all input values (q B in the first base extension and q in the second base extension) and returns the k most significant bits of this sum, named f .
The hardware countermeasure also relies on the RAM access protection. According to the Algorithm 3 there are four registers (A0 in A, A0 in B, A1 in A and A1 in B) for storing the intermediate values, resulting of modular multiplication or squaring executions in the binary loop of the Montgomery Ladder. So, for example, if a modular multiplication
is executed when the exponent bit is 1, the reading and writing operations will be:
On the other hand, if a modular multiplication
is executed when the exponent bit is 0, the reading and writing operations will be:
Note that same registers are read and different registers are written. EM analysis based on localized EM radiations [18] or on the control and RAM leakages [20] show that if the RAM accesses are unprotected, the private key bits can be recovered using sophisticated SEMA or location-based EM attacks. In order to randomize the register's position, and consequently the addresses, where the intermediate results A0 and A1 (in A and B) are stored, we propose the scheme depicted in Fig. 2 in all RNS Units.
Considering the first modular multiplication A 0 = M M (A 0 , A 1 , N, B, A) . The control reads the registers A0 and A1 (in A and B) from the RAM address 0h+j, 1h+j, 2h+j and 3h+j (indicated by 'r') and instead of storing the modular multiplication result A0 (in A and B) in the same positions (0h+j and 1h+j ), A0 is stored in random positions 5h+j, 6h+j, indicated by 'w'. Since the exponent bit e i = 1, the next operation is a modular squaring A 1 = M M (A 1 , A 1 , N, B, A) . The control reads the registers A 1 (in A and B) from addresses 2h+j and 3h+j and instead of storing the result in the same position, it is placed at random address spaces 4h+j, 7h+j. In the next modular multiplication, the registers A0 and A1 will be read from the previous random positions. With this hardware countermeasure, the storing position of intermediate values changes during the modular exponentiation, blurring the EM emanations. Then, the side-channel leakage due RAM memory addressing is suppressed, because the results are always stored in different addresses. Next section shows practical EM attacks on both unprotected and secured RAM.
Considering k the number of RNS moduli in each of the bases A and B, the total number of clock cycles for a Montgomery multiplication is 2k + 37. The LRA countermeasure needs an amount of 64k + 36 clock cycles for the pre-computations. Table 1 summarizes the number of clock cycles for the 512 bits RSA, that is able to compute the CRT-RSA 1024 bits, and the synthesis results for FPGA implementation (low-cost Spartan 3E family) including the number of kilobytes that represents the pre-computed terms pre-stored before the exponentiation and the memory space needed during the exponentiation. The results are provided for the two RSA-RNS implementations. As indicated, there is a time overhead of only 1% due to the LRA countermeasure. The memory (kilobytes) and the area overheads (LUTs and Slices) due countermeasures are 92% and 3%, respectively.
Robustness to EM Analyses
Collision or chosen-messages pair attacks, threat modular exponentiations by exploiting the existence of identical computations. Correlation electromagnetic analysis (CEMA) seeks to recover the secret information by computing the correlation between the EM traces and some guessed intermediate values manipulated or not by the device according to the exponent bits.
To evaluate the relevance of the LRA and hardware countermeasures, we first applied these attacks on an unprotected hardware design, i.e. an RNS-RSA with fixed bases to set a robustness reference level. Then, we re-applied these attacks on our protected implementation in order to quantify the robustness enhancements. To generalize the notation of the acquired EM traces, we define the following:
where EM(T E ,x,e) is the set of all multiplication and squaring intervals during a modular exponentiation with the exponent e = {e n−1 , e n−2 , ..., e 1 , e 0 }, input message x and:
EM(T M ,x,e i ) = EM trace of a modular multiplication (M) done during the time window T M with the exponent bit e i ; 2. EM(T S ,x,e i ) = EM trace of a modular squaring (S) performed during the time window T S with the exponent bit e i ; 3. T E = time window of a full modular exponentiation.
We also define V em (t, x) as being the variation of the EM field at the time t of a modular exponentiation having x as input message.
The EM traces were collected with a measurement platform composed of: an oscilloscope (bandwidth: 2.5 GHz; sampling rate: 40 GS/s), an amplifier with a bandwidth of 200 MHz, a 200 µm probe, a motorized stage, an FPGA Spartan-3 XC3S1600 board and a PC to control the whole measurement setup.
EM Collision Attacks
Collision attacks are SPA like attacks based on the choice of pairs of messages. Basically, an adversary has to measure the power consumption or the EM emanations during the processing of these two chosen messages by the cryptosystem. Then, he has to apply a sliding procedure at the two collected traces to detect, by subtraction, the occurrence of an identical computation. Such collisions typically appear during the squaring operations of modular exponentiations. Several collision attacks have been proposed in the literature. The Doubling Attack (DA) [6] and Yen et al 's Attack [7] collisions are observed in squaring operations and apply on left-to-right exponentiation algorithms. Homma et al 's Attack [8] is a collision that also applies to right-to-left exponentiations contrarily to the DA and the Yen et al ' attack. As explained in [8] , it is based on a different choice of the input messages to provoke collisions in right-to-left and left-to-right exponentiation algorithms.
Because the Montgomery powering ladder algorithm is a left-to-right algorithm, we did consider the Doubling Attack. Following the DA procedure, we truncated, re-aligned and subtracted the EM traces and we confirmed the occurrence of the same intermediate modular squaring results. Figures 3(a) and (b) show how to select and align traces related to the chosen messages in order to have a reference and a target frame.
The first experiment was done on the unprotected RNS-RSA design, when the RNS bases are always fixed. One averaged EM trace (20 trials) has been necessary for each chosen message for identifying the occurrence of collisions using our EM platform. Fig. 3(c) shows the result of a collision analysis on the target RSA-RNS hardware implemented without countermeasures. Note the amplitude of the differential trace is near to zero where redundant computations are performed (depicted as 'region of interest'). To illustrate the effect of our countermeasures, Fig. 3(d) shows the differential traces when DA was applied to the RNS-RSA with randomization of RNS bases. As expected, collisions cannot be detected visually when countermeasures are activated despite the use of average mode of the oscilloscope (20 trials). To demonstrate the efficiency of the DA and quantify the effects of our countermeasures, we define a collision detection criterion by plotting the evolution of the Signal-to-Noise Ratio (SNR) with the number of trials set for the averaging. According to the DA, if the exponent presents consecutive zero bits at e i and e i−1 , the EM traces EM(T S ,x,e i−1 ) and EM(T S ,x 2 ,e i ) represent redundant squarings (collision). The SNR was computed according to:
where σ 2 (EM (TS ,x,ei−1)) is the variance of samples over the time window T S corresponding to a squaring operation and σ ,x 2 ,ei) ) is the variance of the differential trace samples over the time window T S . We defined SNR1 when EM(T S ,x,e i−1 ) = EM(T S ,x 2 ,e i ) (collision) and SNR2 when EM(T S ,x,e i−1 )
(EM (TS ,x,ei−1)−EM (TS
2 ,e i ) (no collision). As shown in Fig. 4a , if a collision occurs, SNR1 is significantly bigger than SNR2 because the denominator of Eq. 5 is almost 0 (suppression of the signal by the collision; only the noise remains) even with no averaging. As shown in Fig. 4(b) , collisions cannot be detected when randomization of RNS bases countermeasure is activated, even when averaging over 1000 times the two signals.
CEMA
Correlation EM Analysis (CEMA) aims at revealing the secret key K manipulated by a circuit by analyzing the correlation between its EM emanations and guesses on the secret key. The most important the correlation is, the most likely the guess is. To apply a CEMA on an RSA the adversary should have the possibility to randomly generate the input data x of the RSA implementation to be attacked or to observe cipher texts. At the same time, he has to measure the variations of the EM field V em (t, x) at time t. This done, he enters in the CEMA procedure that starts by choosing a selection function.
Key Guess and Selection Function: in our case, the adversary, knowing that the considered algorithm is the Montgomery Powering Ladder, may generate 8-bits guesses on the secret key, starting by the MSB. In this way, he has a manageable set of sub-key guesses. These sub-key guesses generated, the adversary computes for each guess k, the corresponding variations of the power consumption at a chosen time of the course of the algorithm, using the Hamming Weight Model W (x, k). This time typically corresponds to the computation of an intermediate value by the algorithm that depends on the sub-key. for RNS-RSA, the adversary must know the set of bases A and B. If this is not the case, he has first to perform a long and tedious CEMA on the forward conversion (radix to RNS conversion) to recover them. In this case, the guesses on the selection function are the values of the RNS bases itself, instead of the private key bits as used in the classic CEMA. Assuming known these RNS bases, the latter may now predict the power consumption variations (and therefore the EM field variations) with the manipulated data x for each key guess k. As the Montgomery multiplication results are obtained in parallel, he has to choose one RNS channel to compute the Hamming weight. Assuming n is the register width, the selection function follows the linear model d(x, k) = W (x, k) − n/2. This is done for each guess of the 8-bits sub-key. The CEMA is expected to return an estimate k of the key by identifying the guess leading to the highest correlation value during the course of the algorithm. The correlation is computed between d(x, k) and EM trace V em (t, x) of single measurements as function of time t :
To illustrate the effect of the RSA countermeasures against CEMA, we evaluated the relation between the number of EM traces and the peak margin observed for the correct guess of the sub-key related to incorrect ones. Figure 5(a) shows the evolution of the peak of the correlation index c(t, k) with the number of EM traces when the architecture performs modular exponentiations with fixed RNS moduli. It is possible to guess the correct hypothesis after the processing of 500 EM traces when RSA presents no countermeasures. With the LRA countermeasure and secured RAM accesses, the correlation curve associated to the secret key has still drowned among the other correlation curves even after the processing of 10k traces.
RAM Memory Randomization
The LRA countermeasure offers a high level of randomization for the internal variables. Collisions and CEMA attacks are defeated because the Hamming Weight of an internal variable can not be estimated to find the secret. Considering that an RSA hardware design is usually composed by arithmetic block (ALU), control (CPU), bus and memories (RAM, ROM), one may find some sources of leakages. The control and memories also performs executions depending on the exponent bits, mainly regarding the values of the memory addresses. The RAM leakages, in the case of Montgomery Ladder, will be generated by different addressing values for reading and writing multiplication or squaring results. Then, simple EM analysis, template attacks [22] or attacks based on a single execution (SE) of exponentiations [21] [4] [19] , may explore the leakage caused by RAM addressing in the Montgomery Ladder and others SPA-protected exponentiation algorithms. SE attacks on exponentiation are also a threat against classical algorithmic countermeasures like message or exponent blinding, however they depend on the quality of the measured traces. If the SNR is very reduced, meaning that the trace contains a big amount of noise, the probability of recovering leaking information from a single trace is quite low. The analyses developed here illustrate the design vulnerabilities related to RAM access when the hardware countermeasure by addressing randomization is disregarded.
Initially, an adversary can do as follows: considering the exponentiation is always performed with a fixed exponent. He sends random messages x to the device and collects an averaged EM trace representing the multiplication when the exponent bit is 1 [EM (T M , x, 1)] and another representing the multiplication when the exponent bit is 0 [EM (T M , x, 0)]. The adversary may then obtains the differential trace
which may reveals the leakages of control and RAM accesses, as illustrated in Fig. 6(a) . The leakage is indicated by higher amplitudes during the RAM reading (r) and writing (w) executions. The procedure adopted by the adversary is: The same procedure can be verified in Fig. 6(b) by subtracting the EM traces of modular squarings when the RAM addressing is not randomized. Following the notations of Alg. 2, the amplitudes at the first samples of the differential traces represent the multiplications s = x.y in the two RNS bases A and B and RAM memory is accessed in order to read the values x and y. The modular multiplication results w A and w B must also be stored in the RAM and this activity is indicated in the differential trace by higher amplitudes representing the RAM writing. Fig. 6(c) and (d) show the differential EM trace obtained after randomizing the RAM addresses. As we can see, these leakages were suppressed. Now, if the exponent is randomized (e r = e + r.ϕ(N )), the attack processes single traces. Template and SE attacks assumes that for each multiplication EM (T M , x, 0) or EM (T M , x, 1) there is at least one sampled point in time t i for which the amplitude of EM emanations follows a normal distribution N (µ M 0 , σ M 0 ) for EM (T M , x, 0) and N (µ M 1 , σ M 1 ) for EM (T M , x, 1). In an advantageous scenario, the point t i may be accurately the amplitude of the EM emanation during the RAM access. To justify this model, we acquired 10000 EM traces from the RSA design mapped on the FPGA, when the private key is known. Fig. 7(a) shows the histogram of the amplitude (in mV ) during a fixed point where the architecture performs memory access by writing the multiplication results in the RAM. The sample points t i during memory accesses follow a normal distribution with different means µ M 0 , µ M 1 and standard deviations σ M 0 , σ M 1 . Yet, Fig. 7(b) illustrates the histogram during the fixed point t i where the architecture performs a RAM writing execution after the squarings. With RAM addressing randomization, the same points t i for EM (T M , x, 0) and EM (T M , x, 1) present similar distributions, meaning the SNR is reduced and SE attacks are more difficult now. Fig. 7(c) and (d) show the normal distribution for multiplication and squaring, respectively. Note the average and standard deviation are very close even for different exponent bits.
Conclusion
In this paper, a performance and robustness evaluation of an RSA cryptocore implemented with RNS was proposed. We evaluated countermeasures at algorithmic, arithmetic and hardware levels in order to provide protection against sidechannel analysis. The Montgomery Powering Ladder exponentiation is adopted in order to protect against simple side-channel analysis. We show that collisionbased attacks remain efficient against an RSA-RNS. To defeat sophisticated SPA and collision attacks, we implemented countermeasures at arithmetic and hardware levels, by randomizing the RNS bases and the RAM memory addresses, respectively. The time overhead due to countermeasures is about 1%.
