Abstract-The hardware implementation of modular exponentiation for very large integers is a well-known topic in digital arithmetic. An effective approach for obtaining parallel and carry-free implementations consists in using the Montgomery exponentiation algorithm and executing the necessary operations in RNS. Two efficient methods for performing the RNS Montgomery exponentiation have been proposed by Kawamura et al. and by Bajard and Imbert. The above approaches mainly differ in the algorithm used for implementing the base extension. This paper presents a modified RNS Montgomery exponentiation algorithm, where several multiplications are moved outside the main execution loop and replaced by an effective pre-processing stage producing a significant saving on the overall delay with respect to stateof-the-art approaches. Since the proposed modification should be applied to both of the above algorithms, two versions are specifically discussed. The overall comparison shows that with the proposed approach, a 18.5% speedup can be achieved for an implementation over 1024 bits, without any significant area overhead.
I. INTRODUCTION
The computation of the Montgomery exponentiation (ME) in the Residue Number System (RNS) [1] allows limiting the delay due to carry propagation and reaching a high degree of parallelism [2] . This approach mainly requires the execution of a set of Montgomery multiplications (MMs) [3] . However, in RNS, some operations (e.g. division, comparison, modulo) are natively difficult to execute. Hence, several approaches have been proposed in order to fully exploit the potential of RNS for modular exponentiation, by minimizing the impact of related drawbacks. A key element of these approaches is the Base Extension (BE), which calculates a number on a different RNS base.
In [4] , Kawamura et al. proposed an RNS ME technique applied to RSA, and a new BE. The BE is characterized by a summation that provides a result modulo a small multiple of the base, which is corrected after the sum of each element. The proposed approach has been detailed in [5] , where an architecture has also been presented, showing that its performance are faster than not RNS approaches.
In [6] , Bajard et al. proposed an implementation of the MM based on both the RNS and the Mixed Radix Number System (MRS), where the MRS corresponds to a weighted system associated with the RNS. Then, in [7] , the same authors proposed a Montgomery multiplication method fully implemented in RNS. This approach employs an approximated BE and the algorithm proposed in [8] , where the result is approximated and corrected by using an extra modulo. Finally in [9] , Bajard and Imbert detailed the application of the previous ME approach in the context of RSA.
The RNS MM has also been studied in different contexts, out of the exponentiation. In [10] , an implementation for elliptic curve cryptography on FPGA is presented. This implementation employs the algorithm proposed in [4] with some pre-computations suitable to the particular architecture presented in the paper. In [2] , an implementation of RNS MM on GPU is presented. Experimental results achieved in the above context show that the algorithm described in [7] provides the best performance. This paper proposes a modified ME algorithm for RNS. The novelties of the algorithm are in the pre-processing stage, which is used to reduce the number of multiplications performed in the loop of the exponentiation algorithm. It is worth observing that, even though in the specific case the ME is considered, the basic approach used here is more general and can be easily applied in other contexts. In particular, the analysis of the designed pre-processing method for generic modular multiplication is presented in [11] .
The differences with respect to previous approaches consist in the values that are pre-computed, and in the new ME and MM algorithms. The modifications discussed in the current work are applicable to both the state-of-the-art MM algorithms [4] , [7] , which mainly differ for the BE correction methods. Hence, in the following, the proposed modifications will be presented in two versions specifically tailored to the characteristics of these methods.
A detailed algorithmic analysis has been carried out, comparing the proposed approach with the common part of the state-of-the-art algorithm [4] , [9] . An analysis at the architectural level is also carried out, in order to fully exploit the effects of the two versions of the proposed approach. An architecture based on the work presented in [5] is designed and analyzed. Then, a new architecture suitable to the algorithm proposed in [9] is proposed, and an overall comparison is presented, showing that with the proposed approach an 18.5% speedup can be achieved for an implementation over 1024 bits, without any significant area overhead. The remaining of the paper is organized as follows: in Section II, the proposed RNS ME algorithm is presented, while in Section III, it is analyzed and compared to related works. In Section IV, the implementation of the arithmetic cells is analyzed, and the effects of the proposed approach are discussed. In Section V, some conclusions are drawn.
II. PROPOSED ALGORITHM
This section illustrates the proposed technique, by analyzing the modifications introduced with respect to previous approaches. Tables I provides a description of the main symbols used in the discussion.
A. RNS
In RNS, a number is represented according to a base A = (a 1 , a 2 , ..., a k ), which is made up of k relative prime numbers, where k is called the base size. Therefore, any number x, where 0 ≤ x < A = ∏ k i=1 a i is uniquely represented by a sequence of positive integers (x 1 , x 2 , ..., x k ), where
It is worth observing that, because of the independence of all the elements, in RNS the multiplication, addition, and subtraction operations can be carried out independently and in parallel for each element.
Using the Chinese Remainder Theorem (CRT) [12] , it is possible to convert a value x from an RNS base to a radix system, achieving a high parallelism. The reconstruction expression is:
where
B. Montgomery Exponentiation (ME)
The ME is based on the MM, where MM(x × y mod N ) gives w = xyR −1 mod N . ME computes x e mod N at the average cost of 3/2 log 2 e + 2 MMs. Let us denotē x andȳ, such thatx = xR mod N andȳ = yR mod N ; then z =xȳR −1 mod N = xyR mod N . Therefore, the exponentiation can be executed by iterating MM onx.
The comparison between the state-of-the-art RNS ME algorithm (the main part of the ME algorithm is common to both [4] and [9] ) and the proposed algorithm is shown in Fig. 1 . Since R = B, B mod N and B 2 mod N must be precomputed.
Step 1 in Algorithm 1 [4] , [9] calculatesx, as previously described. Steps 2 and 3 initialize the exponentiation process, which is executed in the loop from step 4 to step 9. The proposed RNS ME algorithm (Algorithm 2) executes 3 multiplication steps more than Algorithm 1 out of the loop, in order to execute less multiplication steps in the MM algorithm. Before executing the loop, all the values on base A are multiplied by A in A are represented with a hat accent. In order to reach the correct result of the exponentiation, another multiplication by A j is required after the loop of RNS MMs (step 12).
C. RNS Montgomery Multiplication
In general, in RNS the MM [4] , [7] is performed on two RNS bases, A = (a 1 , ..., a k ) and B = (b 1 , ..., b k ), such that Algorithm 1: State-of-the-art RNS Montgomery exponentiation [4] , [9] Input: x in A ∪ B ∪ ar and e = (eg−1.
if ei = 1 then [4] , [7] Input: x, y, and N in A ∪ B ∪ ar, such that 
Comparison between the proposed RNS Montgomery multiplication and the state-of-the-art algorithm The relevant characteristics of MM implementation in RNS ( Fig. 2) are described as the following.
Step 3 is only performed on B, so that the modular reduction by B does not require additional operations. After the modular reduction, the BE to A of step 4 is required, since a subsequent step requires a different base in order to perform a division by B and gcd (A, B) = 1. The multiplication in step 5 and the addition in step 6 are only performed on A, since the result of the multiplication in B is equal to the additive inverse on B of the result of step 1 and so the result of the addition in B is 0. The multiplication by B −1 in step 7 is only performed on A, since gcd (B, B) = 1. The last operation is the BE to B, so that the result can be used as input for other MMs.
In [7] and [4] , the respective authors have proposed two different techniques to perform the BE. Both techniques are based on (1), but avoid the last modular reduction by A in order to save computational effort. In [4] , the final modular reduction of both BEs required by the RNS MM are replaced by an approximation and a correction. In [7] , the final modular reduction of the first BE is not executed, so λB is added to the correct result. However, the second BE gives the exact value due to a correction executed at the end of the summation of multiplications. The adopted correction technique has been presented in [8] , and it requires that all the values calculated on A are also calculated on an additional base element a r . This paper presents a new RNS MM, Algorithm 4, which requires less modular multiplication steps. In order to reach this result, the algorithm is modified: must be pre-computed in order to avoid a multiplication step. 2) The multiplication by N in A of Algorithm 3 step 5 can be moved into the previous BE, and merged with the summation of multiplications by A j in A of (1). Therefore, A j N must be pre-computed in order to avoid a multiplication step. A A j must be precomputed. The proposed approach is presented in two versions, which employ different BE algorithms. The only difference in the MM algorithm is the presence of the RNS base element a r , since this special correction base is required only by the version based on the approach proposed in [7] , which uses the BE of [8] for the second one. According to this version, Algorithm 1, 2, 3 and 4 include the base element a r .
In the following, the steps of the algorithms in j . In Algorithm 4, steps 3, 5, 6, and 7 of Algorithm 3 are moved into the first BE. Moreover, the multiplication by A j , which is required to correct the input, is also moved into the first BE. Fig. 3 shows the first BE proposed by Kawamura et al. in [4] (Algorithm 5), and the proposed BE based on Kawamura et al. approach (KBE1) (Algorithm 6). KBE1 includes all the operations of Algorithm 5 as well as the operations of Algorithm 3 steps 3, 5, 6, and 7. Therefore, it not only extends a value but also calculatesŵ of step 3, Algorithm 4, from s andŝ of steps 1 and 2.
D. BE based on the approach by Kawamura et al. (KBE)
The modular reductions by A in the first BE and by B in the second BE would require a great computational effort, so they are replaced by an approximation and a correction, that are expected to be able to reduce the delay. The result of the summation in the BE of x from A to B isx = x+λA, where λ ∈ Z and 0 ≤ λ < k. Instead of performing the modular reduction in order to reach x, the algorithm proposed by Kawamura et (2) , and ∧ means a bitwise AND operation. Kawamura et al. introduced a further variable (α), which represents the starting value of the parameter that is used to correct the error introduced by the approximation. In [4] , two theorems prove that with correct values of α, with a low , and by selecting the base elements so that 2 r is close to b i , the approximation does not introduce errors. During the first BE α = 0 and the input is unknown; thus, according to Theorem 2 in [4] , the result of the BE isx < 2B, which is approximate. During the second BE α = 0.5 and the input is lower then 2N ; hence, according to Theorem 1 in [4] , the result of the BE is correct. With r = 32, k = 33, and max(2
This approach requires that A ≥ 2N and B ≥ 4N . The two BEs have the same algorithm with exchanged bases, but according to the theorems presented in [4] , the first BE produces an approximate result, which is corrected by the second BE.
Step 1 of Algorithm 5 is the multiplication of each u bi by B −1 i , modulo b i . In KBE1, this multiplication is merged to the multiplication by −N −1 of Algorithm 3 step 3. It can be easily demonstrated that the result of this operation is the same as the result of the corresponding operation in the Kawamura et al. BE, for the associative property.
Step 3 of Algorithm 5 is a simple initialization. In KBE1, step 3 executes the multiplication in step 7 of Algorithm 3 merged with the multiplication by A j , used to correct the A It can also be easily demonstrated that the result of step 8 of KBE1 is the same result of the corresponding operations in Algorithm 5, for the associative property and for the distributive property.
The second BE proposed in [4] , Algorithm 7 in Fig 4 , uses the same algorithm of the first BE (Algorithm. 5), but with the bases switched. The proposed algorithm (KBE2), Algorithm 8, is the same, but it does not perform step 1 of Algorithm 7, since the input is already multiplied by A −1 j .
E. BE based on the approach by Bajard et al. (BBE)
In [7] , Bajard at al. propose an MM algorithm requiring two different BEs; the former, Algorithm 9 in Fig. 5 , trades 
w bi = (w bi + qjAj + f (−A)) mod bi for j = 1...k 8: end for I f ∈ {0, 1} Figure 4 . Comparison between the second BE presented in [4] and the proposed BE (KBE2) approximation for speed, whereas the latter, Algorithm 11 in Fig. 6 originally proposed by Shenoy and Kumaresan [8] , corrects the result. The result of the approximate BE of x is x = x + λB, and no correction steps are performed in order to reach the correct results. Further details are presented in [9] , where the algorithm is applied to the ME in the contexts of RSA. The approximation does not affect the final result of the MM, provided that overflows after the BE are avoided through the use of larger bases guaranteeing:
The First BE proposed in this work, Algorithm 10 in In BBE1, the multiplication of Algorithm 9 step 1, is merged to the multiplication by −N −1 of Algorithm 3 step 3. It can be easily demonstrated that the result of this operation is the same as the corresponding operation in Algorithm 9.
In step 2 of BBE1, the summation is initialized with the multiplication of Algorithm 3 step 7 merged with the multiplication by A j , used to correct the A The approximation correction is only performed after the second BE, and it requires an additional RNS base element a r , such that gcd(a r , A) = 1 and gcd(a r , B) = 1. All the values in A are also calculated in a r , according to [8] .
The second BE employed in [7] , Algorithm 11, requires a correction in order to avoid the approximation.
Step 2 calculates the difference between the correct value of x in a r and the result of the approximate BE on a r , which Figure 6 . Comparison between the second BE used in [7] and the proposed BE (BBE2) correspond to:
The second proposed BE based on the approach by Bajard et al. (BBE2), Algorithm 12 in Fig. 6 , corresponds to Algorithm 11 without step 1; this step can be avoided since the input of the BE is already multiplied by A −1 j .
III. ALGORITHM ANALYSIS
In this section, the proposed approach is evaluated and compared with the state-of-the-art algorithms. The analysis is focused on the MM, which requires the majority of the total computational time.
A. Number of modular multiplications
Both the approaches in [4] and [9] have been evaluated by the respective authors according to the number of modular multiplications required. Table II reports the comparison of the proposed RNS MM algorithm with the previous ones. It can be easily seen that the algorithm presented in [9] achieves a reduction of k modular multiplications with respect to [4] , whereas the proposed algorithm allows a further saving of 3k modular multiplications.
B. Analysis and classification of the required multiplications
As shown in Table II, the described algorithms require 2k 2 and between 5k to 9k modular multiplications. All the necessary multiplications (not considering the BE correction) are listed and classified in Table III . Since each operation is performed on k base elements and only one multiplication is performed on a larger base, up to k cells can perform in parallel the required operations. Thus, the multiplications shown in Table III are organized in 2k + 8 multiplication steps, composed by k parallel multiplications. The multiplication steps are classified according to the opportunity of parallelization and pipelining. These aspects are of paramount importance, since different multiplication steps can require a different number of execution steps. The identified types of multiplication steps are:
• full, where the beginning of the operation must wait for the completion of the previous operation that calculates an input value; the number of required execution steps is p, where p corresponds to the number of pipeline stages; • parallelizable, where a group of operations can be executed in parallel, or the first operation can be executed as full, and the subsequent ones can be pipelined; • full parallelizable, where an operation can be executed in parallel to the previous and/or to the subsequent one requiring 0 execution steps.
C. Remarks
Considering that k cells can perform k modular multiplications in one multiplication step, without considering the BE correction, the RNS MM involves: 6 full multiplication steps (IDs 1, 3, 4, k+6, k+7, and k+8) , two groups of k parallelizable multiplication steps (IDs from 6 to k+5 and from k+9 to 2k+8), and 2 fully parallelizable multiplication steps (IDs 2 and 5).
In the RNS MM used in [4] , [9] , by considering M as the number of parallel multipliers per cell, the fully [4] [7] Proposed (with KBE) Proposed (with BBE)
Step 1, 3, and 4 of MM 5k 5k 2k 2k First BE without correction 
The multiplication is executed by the algorithm • The multiplication is not executed parallelizable step does not need the result of the previous operation. Hence, it can be fully parallelized by any pipelined architecture and it does not affects the overall delay. This step requires 1 p+M −1 execution steps. When executed by an architecture with M ≤ k, the two groups of consecutive parallelizable multiplications require p execution steps for the first multiplication step, and 1/M execution steps for each other multiplication step, corresponding to 2 k M − 2 + 2p execution steps. Each full multiplication step requires p execution steps, which correspond to 6p execution steps.
Considering an architecture with k cells, the number of execution steps required by the RNS MM used in [4] , [9] , without the BE correction, is 2
As shown in Table III , the proposed algorithm allows achieving a reduction of 4p steps (IDs 4, k+6, k+7, and k+8) , and requires 1 p+M −1 additional steps (ID 5). Therefore, the improvement due to the proposed modification is directly matched to the number of pipeline stages and of parallel multipliers. Considering p = 3, M = 1 and k = 33 as in [4] , without the error correction a delay reduction of 13.63% is obtained. With a higher degree of pipelining a larger reduction is achieved, e.g. 16.66% with p = 4 and M = 1, 19.23% with p = 5 and M = 1, etc.
The proposed exponentiation algorithm requires 2p + 1 additional multiplication steps, but their impact on the total delay is negligible since it is equal to (2p 8p) ), e.g. < 0.01% with p = 3, M = 1, and iteration > 1024.
IV. IMPLEMENTATION AND RESULTS
In this section the state-of-the-art architectures exploited in [4] and [9] are described and analyzed. Kawamura et al. presented some details about their architecture in [5] . This implementation is composed by a set of identical cells, where each cell is matched to one base element for each RNS base, or to a set of elements for each base. The cells are made up of a Modular Multiplier and Accumulator Unit (MMAU) with three stages of pipeline, a Cox Unit for the correction of the BE, and some memory. The details regarding the implementation based on the algorithm in [7] have not been presented, so in this section a new architecture tailored to this particular case is proposed. Except for the correction unit, the architecture adopted by Kawamura et al. is also suitable for the approach proposed by Bajard et al., which nonetheless requires a separate cell for the redundant base element, instead of the Cox Unit.
In order to reach an efficient implementation, the multiplications are performed through reduction trees of Carry Save Adders (CSAs), whereas the addition is achieved by means of Carry Look Ahead Adders.
A. Modular reduction
The modular reduction approach is the same used in [5] . In [13] , the authors showed that the modulo reduction of 
where x < 2 z , z > r, and c i < 2 h . Thus, it is y < max (2 r+1 , 2 z−r+h+1 ) and each iteration of this method can reach a reduction of r − h − 1 bits. In order to reach a larger reduction per step with 2 2r < x < 2 4r−2h−1 , it is possible to calculate: 
B. Cell architecture without BE correction
Without the error correction, a cell basically corresponds to an MMAU opportunely controlled. The starting point for the design is represented by the architecture proposed in [5] , that is shown in Fig. 7 (p = 3 and M = 1). The MMAU is divided in three pipelined units: 
C. Analysis of the Cell without BE Correction
In order to evaluate the area and delay of the analyzed architecture, the number of gates that compose the arithmetic cells, and that represent the critical path, have been counted and converted in the equivalent inverter delay, and in the equivalent number of transistors, respectively. For the conversion, the metric in [14] , which is summarized in Table IV , was selected. Table V and VI show the area and delay characteristics of the described cells considering r = 32, k = 33, and h = 11, as in [5] , [9] . The delay of the described architecture corresponds to the delay of the longest critical path multiplied by the number of steps required. With previous algorithm, the considered architecture needs 2k + 22 steps, while with the proposed RNS MM algorithm, it requires only 2k + 10 steps. Considering k = 33, the proposed algorithm reaches a time saving of 13.63% compared with the previous ones, due to the smaller number of steps.
D. Error correction with the approach by Kawamura et al.
The algorithm proposed in [4] can be implemented by adding to each cell a Cox Unit. This unit, which is illustrated The values between curly brackets are obtained using merged reduction trees Fig. 8 , is composed by an adder, a register and a set of AND gates. The delay of this unit corresponds to an adder and one AND gate. According to [5] , a suitable value for , which represents the size of the adder in the Cox Unit, is 9, with r = 32, h = 11, and k = 33; in this case, the delay required by the unit can be estimated in 22.6 inverters. Kawamura et al. use an architecture similar to Fig. 8 , and they place the Cox Unit in parallel to the reduction tree. The area of the reduction three is r FA larger. The area and the delay of the units with the additional input line are reported in Table VII .
E. Error correction with the approach by Bajard et al.
The algorithm employed in [9] can be implemented by using the architectures previously described, but it requires an additional cell matched to a redundant base element. The aim of this cell is to calculate the BE correction, which is used by the other cells as a standard input value. Therefore, the only other difference is the sequence of operations. The architecture of the redundant cell is shown in Fig. 9 . It is composed by the Multiplier Unit (MU) and by the Adder Unit (AU). The number of represented bits is shorter than in other cells, according to the requirements, so two multiplications can be processed in parallel and added in a step. Moreover, a r can be a power of 2, so no reduction is required. The area overhead is similar to the approach proposed in [4] . The BE correction requires an additional step. However, as suggested in [9] , it is possible to avoid a multiplication using tables, but the result from the table should be summed by adding an input line. 
F. Overall comparison and concluding remarks
Table VIII summarizes the results of the comparison among the considered approaches. It is possible to observe that the BE correction does not affect noticeably the area required by the cell. The BE correction proposed by Bajard et al. requires an additional step (unless tables are used for the correction multiplication), but the Kawamura et al. correction increases the delay of the MAU, which represents the critical path of the cell. Therefore, the approach proposed by Bajard et al. provides a 5.8% time saving.
The proposed algorithm provides a delay reduction linked to the BE correction algorithm, and it does not affect the area. Compared to the Kawamura et al. correction, the delay reduction is 13.6%, while compared to the Bajard et al. correction it is 13.4%. The most efficient cell is obtained by mixing the BE approach used in [9] with the proposed algorithm, since it does not require additional area and it provides an 18.5% delay reduction with respect to [4] .
V. CONCLUSION
In this paper a novel RNS Montgomery exponentiation algorithm is proposed. The algorithm is presented in two versions, targeted to the BE approaches adopted in [4] and in [9] , respectively. The architecture proposed in [5] , that is compliant with the approach in [4] , has been used as a reference point for the design of a new architecture suitable for the method adopted in [9] .
An algorithmic analysis has shown that the proposed approach is capable of providing a reduction of 4p− 1/(p+ M − 1) steps over the 2 k/M − 2 + 1/(p + M − 1) + 8p required by each RNS MM (without considering the BE correction). Then, an architectural analysis has shown that, with the BE proposed in [4] , the total number of steps and the reduction are the same, whereas with the BE approach used in [9] , the reduction is the same but the MM requires one additional step. According to the algorithmic characteristics described in [4] , [9] , and to the architectural features described in [5] , the delay reduction is equal to 13.6% or 13.4% depending on whether the BE adopted in [4] or in [9] is used, respectively.
