This paper reports a constant-time CPU and GPU software implementation of the RSA exponentiation by using algorithms that offer a first-line defense against timing and cache attacks. In the case of GPU platforms the modular arithmetic layer was implemented using the Residue Number System (RNS) representation. We also present a CPU implementation of an RNS-based arithmetic that takes advantage of the parallelism provided by the Advanced Vector Extensions 2 (AVX2) instructions. Moreover, we carefully analyze the performance of two popular RNS modular reduction algorithms when implemented on many-and multi-core platforms. In the case of CPU platforms we also report that a combination of the schoolbook and Karatsuba algorithms for integer multiplication along with Montgomery reduction, yields our fastest modular multiplication procedure. In comparison with previous literature, our software library achieves faster timings for the computation of the RSA exponentiation using 1024-, 2048-and 3072-bit private keys.
I. INTRODUCTION
Public key cryptosystems play an important role in communication systems that require the exchange of sensitive information. Proposed by Rivest, Shamir and Adleman in 1978 [1] , RSA has become the most deployed public key cryptosystem in practical applications. The signing/verification of digital certificates is a heavily used application of RSA, as an important fraction of commercial digital certificates have been created using RSA as their cryptographic engine. However, due to its relatively high latency, RSA must be carefully implemented to achieve reasonable timing performance, memory and code footprints. Moreover, the computation of RSA main primitives, quite especially modular exponentiation, must be run in constant-time. This feature presents a first line of defense against timing and cache attacks [2] .
The RSA instantiation using 1024-bit keys (a.k.a RSA-1024), has been widely used for computers on networks and traffic handling across the Internet. For applications requiring to achieve the 112-and 128-bit security levels, the National Institute of Standards and Technology (NIST)
The associate editor coordinating the review of this manuscript and approving it for publication was Chien-Ming Chen . recommended in [3] the usage of RSA-2048 and RSA-3072, respectively. This recommendation should be contrasted with the newest analysis of state-of-the-art integer factorization algorithms [4] , [5] , which estimate that RSA-1024 and RSA-2048 can barely achieve the 76-and 106-bit security levels, respectively.
The RSA algorithm produces a public/private pair of keys by first constructing a per-user unique 2k-bit modulo N = p · q, where p, q are two k-bit prime numbers. The RSA public key is the tuple composed by the modulus N and a public exponent e, which is generally chosen as e = 2 16 + 1. The RSA private exponent is defined as d = e −1 mod φ(N ), where φ(·) stands for the Euler's totient function. Given the RSA private key (d, N ) and a message m, the Full Domain Hash (FDH) signature s of m is computed as s = H (m) d mod N , where H (·) represents a hash function that maps m to Z N . It has been shown that the FDH RSA signature is provably secure [6] . A standard trick based on the Chinese Remainder Theorem trades the 2k-bit RSA exponentiation s = H (m) d mod N , by the computation of two independent k-bit modular exponentiations of the form, s 1 = h d mod (p−1) mod p and s 2 = h d mod (q−1) mod q, where s 1 , s 2 can be calculated concurrently.
In this work, we focus our attention on the efficient parallel computation of s 1 and s 2 in GPU and CPU software implementations.
Most Internet transactions are executed using desktop computers, laptops and smartphones that are powered by multicore micro-architectures based on general purpose Central Processing Units (CPUs). On the other hand, taking advantage of their massive parallelism, General Processing Units (GPUs) platforms have become an interesting option to speedup high demanding computational tasks such as the computation of several public key cryptographic primitives.
OUR CONTRIBUTIONS: In this work, two RSA constanttime software implementations for 1024-, 2048-, and 3072-bit RSA keys, are presented.
Our CPU software implementation of RSA uses a combination of integer arithmetic algorithms and Montgomery reduction that helped us to exploit the fine-grained parallelism present in the latest Intel micro-architectures. We also took advantage of the multi-core architecture of modern Intel CPU processor to concurrently compute two RSA exponentiations. Further, we explore the usage of the Advanced Vector Extension 2 (AVX2) for achieving an efficient Residue Number System (RNS) field arithmetic, as an alternative approach for the parallel computation of the RSA signature. Likewise, our RSA GPU implementation also employs RNS arithmetic, which permits to take a better advantage of the massive parallelism available on this many-core architecture.
Our software implementation targeted four platforms, namely, a GPU GeForce GTX TITAN running on a Kepler architecture at 876 MHz, a GPU GeForce GTX 1080 running on a Kepler architecture at 1.81 GHz and a CPU Intel core i7 equipped with Haswell and Skylake micro-architectures running at 2.6 GHz, 4 GHz and 1.8GHz, respectively.
Our experimental results show that computing one RSA signature takes far less time when calculated using our CPU software library. However, our GPU software scales better for larger RSA key lengths, and offers better performance when not one but many RSA signatures must be computed at once.
The experimental results presented in this work outperform previously reported GPU RSA implementations [7] - [10] by a factor of 1.24, 1.27 and 2.98 for RSA-1024 bits, RSA-2048 bits, and RSA-3072, respectively. Regarding our CPU implementation of RSA, our results outperform previous CPU software implementations [11] , [12] by a factor of 1.84 and 1.15 and 1.19 for the RSA-1024, RSA-2048 and RSA-3072, respectively.
The remainder of this paper is organized as follows. In §II a review of the main modular arithmetic algorithms used in this work is given. Then, in §III and §IV we present a detailed description of the CPU and GPU implementations of the RSA exponentiation, respectively. Finally, we draw some concluding remarks in §V.
II. ARITHMETIC BACKGROUND
One of the main objectives of this work is to perform a fast and constant-time modular exponentiation, which is required by the RSA signature algorithm. Hence, we start this section by describing in §II-A the regular-recoding exponentiation algorithm used for performing the RSA private operation. Furthermore, throughout this work we use two different approaches for computing the underlying modular arithmetic. The first approach adopts the Montgomery representation discussed in §II-B, whereas the second one uses an arithmetic layer based on the Residue Number System (RNS) representation as explained in §II-C. In this section, we present an overview of these two arithmetic representations.
NOTATION: Let N be a 2k-bit RSA modulus of the form N = p · q, where p, q are two k-bit prime numbers. Since an RSA exponentiation modulo N can be traded by two k-bit exponentiations modulo p and q, in this work we focus our attention on the computation of the operation y = x e mod p, where it will be assumed that all the operands have a bitlength of k bits. The integer representation of a k-bit integer can be accommodated in n = k w words, where each word has a size of w bits. Throughout this work, word sizes of w = 32 and w = 64 bits will be assumed. An element a ∈ Z p is represented in radix-r as the array a = n−1 i=0 a i r i , where r = 2 w and 0 ≤ a i < r. We say that the operand a has a wordlength of n words. For the sake of simplicity, we will only consider operands with an even word-length. Particularly, we are interested in the cases n = 8, 16, 24, required for computing RSA-1024, RSA-2048 and RSA-3072 signatures, respectively.
A. CONSTANT-TIME MODULAR EXPONENTIATION
We adopted a variant of the fixed-window exponentiation method, which starts by producing a regular recoding of the exponent. To this aim, we use the procedure proposed by Joye and Tunstall in [13] as shown in Algorithm 1. Given a k-bit exponent, Algorithm 1 provides an encoding of length η = k ω + 1, whose digits belong to the set {1, 2, . . . , 2 ω }, where ω is the prescribed window size.
Algorithm 1 Unsigned Exponent Regular Recoding [13]
Require: A k-bit exponent e, window size ω.
d ← e mod 2 ω 4:
e ← e/2 ω 8:
i ← i + 1 9: end while 10: f i ← e + j − 1 11: return f Algorithm 2 allows us to perform a constant-time modular exponentiation, which helps us to thwart basic timing side-channel attacks. Since this windowed exponentiation VOLUME 8, 2020 algorithm requires to lookup a pre-computed table, it is necessary to implement a mechanism that protects the corresponding queries. Hence, whenever the pre-computed table is queried in Step 6, a linear pass memory access is performed as a protective countermeasure [14] . This technique consists of traversing the entire pre-computed table every time that a certain position is accessed. In this way we ensure that all memory queries have the same running time. The protected modular exponentiation that computes y = x e mod p is shown in Algorithm 2. This algorithm has a cost of exactly k ω modular multiplications and k − 1 modular squaring operations. In 1985, Montgomery proposed a novel method to compute field multiplications without using trial divisions [15] . Montgomery suggested to change the operand representation to the so-called Montgomery domain. 1 Let us define the Montgomery parameter R as, R = r n , where as before n represents the number of words necessary to represent the prime modulus p in radix r = 2 w . Hence, r n−1 < p < r n . The Montgomery representationã of an element a ∈ Z p is computed asã = a · R mod p.
Let as assume that the elements a, b ∈ Z p have a Montgomery's representation given asã andb, respectively. Let d be given as d =ã·b. Then, the Montgomery product ofã and b is defined asc =ã ·b · R −1 mod p, which can be readily computed as
where the parameter µ given as µ = −p −1 mod R, can be pre-computed off-line. Also the reduction and division by R operations, can be efficiently performed using fast right/left n-word shift operations. It can be shown that when 0 ≤ d < p 2 , the resultc of Equation (1) is an integer in the interval [0, 2p[. Hence, at most a single conditional subtraction is needed to obtain 0 ≤c < p. This conditional subtraction must be performed in a constant-time fashion.
C. RNS MODULAR ARITHMETIC
In the nineties of the last century, several authors proposed the usage of the Residue Number System (RNS) as an alternative for computing modular arithmetic over large integer operands [16] - [19] . 2 Taking advantage of the ancient Chinese Remainder Theorem (CRT), RNS main attractiveness is to represent a large integer by means of a set of smaller independent numbers. In this way, one trades the computational cost of a single arithmetic operation over two large operands by the calculation of independent smaller modular operations that may be computed in parallel. The RNS representation is defined as follows.
Let B be an RNS-basis consisting of a set of pairwise co-prime integer moduli B = {m 1 , m 2 , . . . , m }, and let M = i=1 m i . Then, an integer a ∈ [0, M − 1] can be uniquely represented by the -tuple a = (a 1 , a 2 , . . . , a ), where each a i is the residue of a modulo m i . In the remainder of this paper this reduction operation will be written as a i = |a| m i . From its RNS representation, the corresponding binary representation of a can be obtained using the following recovery formula,
Let a and b be two large k-bit integers with a, b < M , represented as RNS tuples a = (a 1 , a 2 , . . . , a ) and b = (b 1 , b 2 , . . . , b ). Then, RNS addition ⊕ and RNS multiplication ⊗ are performed coefficient-wise as,
If the target platform is equipped with processing units, then the computational cost associated to any of the two RNS arithmetic operations of Eq. (3) is approximately the same as performing one single coefficient multiplication. Remark 1: Let a, b be two k-bit integers. Then, the integer product d = a ⊗ b of Eq. (3) can be uniquely recovered from its RNS representation if and only if d < M . Since in general the integer product d is a 2k-bit number, it follows that an RNS representation of a, b and d requires an RNSbasis composed of w-bit moduli, with ≥ 2 k w = 2n. Remark 2: For the sake of efficiency, the integer moduli m i are usually selected as m i = 2 w − µ i , where µ i coefficients are chosen as small as possible. If µ i < 2 w 2 , then the coefficients d i for i = 1, . . . , of Eq. (3) can be efficiently computed by repeating at most twice the operation,
Thereafter it is guaranteed that t i ∈ [0, 2 w [. Since 2 w > m i , one may need to compute a final reduction, d i = |a i · b i | m i = t i mod m i , which can be achieved at a cost of at most one subtraction operation. In order to assure a constant-time implementation, this reduction is carried out by executing exactly two conditional reductions of the form, t i = a mod 2 w +µ i · a/2 w , followed by exactly one conditionally subtraction by m i .
1) RNS MODULAR REDUCTION
A modular multiplication is often performed by first computing the integer multiplication d = a ⊗ b, followed by a reduction modulo p so that the resulting value lies in the range [1, p − 1] . In this work, we adopted the reduction approach proposed in [16] , [18] , and its adaptation to GPU platforms presented by Jeljeli in [21] (see also [22] ). Crucially, this approach allows to perform a modular reduction d mod p without leaving the RNS domain as it is explained next.
Let d be a large integer represented in RNS using a basis composed by one-word moduli. Let us assume that d must be reduced modulo an n-word prime number p. Then, a strategy to perform the modular reduction d mod p can be obtained from a direct application of the RNS recovery formula of Eq. (2) as
Notice that the coefficients γ i for i = 1, . . . , as defined in Eq. (4), can be seen as a single RNS vector γ with coordinates. Notice that d can also be written as
where α is some positive integer, and by construction, 0 ≤ d/M < 1. From Eq. (5), the parameter α can be estimated as
is congruent to d mod p, but in general z ≥ d. In order to obtain a good approximation of α, which at the same time can be computed efficiently, one uses the fact that m i ≈ 2 w . Hence, the ratio γ i /m i can be approximated by only considering the σ most significant bits of the
where σ is an integer in the range [1, w] and 0 < < 1 is an error correcting parameter. Remark 3: The integer part of the summation in Eq. (7) can be efficiently computed by considering the output carry c produced by the addition of the σ most significant bits of the coefficients γ i with i = 0, 1, . . . , . Notice that the output carry c is an integer in the range [0, [. Algorithm 3 computes the RNS vector z ≡ d mod p as defined in Eq. (6) . In Steps 4-6, processing units concurrently compute copies of the RNS vector γ given in Eq. (4). Although the computational cost of these steps is of RNS multiplications, their associated latency is very close to the latency associated to one RNS multiplication. The second loop of Algorithm 3 (Steps 7-13) completes the computation of the RNS vector z.
Step 9 performs and − 1 RNS multiplications and additions, respectively. As before, all these RNS multiplications can be computed in parallel, but they must be sequentially added using a binary tree adder.
Step 10 computes α at the cost of adding σ -bit integers (cf. Remark 3). In Step 11, the RNS vector z is finally obtained by performing one RNS multiplication and one RNS subtraction.
Summarizing, the latency associated to Algorithm 3 is the combined latency of three RNS multiplications plus one RNS subtraction plus the addition of σ -bit numbers. Algorithm 3 does not calculate d mod p, but instead produces an RNS multiple of it, which is bounded by 2 w · · [21] , [22] . In practice this implies that the RNS vector z must be accommodated using at least two extra safeguard moduli. Consequently, we increased the cardinality of the RNS basis B from to + 3 moduli. By taking this caution measure, one guarantees that accumulating thousands of modular multiplications (required in the computation of a typical RSA exponentiation), will not exceed the RNS bound M .
2) RNS MONTGOMERY MODULAR REDUCTION
Alternatively, the modular reduction by a k-bit prime number p can be performed by adapting the Montgomery reduction given in Equation (1), to the RNS representation setting. This approach was first introduced by Posch and Posch [23] , and several refinements were proposed in [24] and in a myriad of subsequent papers [20] .
The adaptation of the k-bit Montgomery reduction to RNS arithmetic requires to handle two distinct RNS-basis
In addition, the Montgomery parameters of Equation (1) must be represented using two RNS bases B and B . It is customary to choose VOLUME 8, 2020 Algorithm 3 RNS Modular Reduction Optimized for Multi/Many Core Platforms [21] Require: The integer d given in -moduli RNS representation, the -moduli RNS-basis B, and parameters r, σ , and . Ensure: RNS vector z, such that its integer representation is z ≡ d mod p. Precomputation:
. . , − 1} 4: for each processing unit j do 5:
: end for 7: for each processing unit j do 8: for each processing unit i do 9 :
mults and − 1 RNS adds. 10 :
end for 13: end for 14: 
As discussed in [24] , [25] , the RNS version of the Montgomery reduction can avoid a conditional subtraction by adopting Walter's approach of [26] . Indeed, a redundant representation of the elements in Montgomery representation can be achieved by choosing a Montgomery radix such that 4p < R and a RNS-basis B such that 2p < M . At the end of the whole computation the result can be normalized at the cost of a single constant-time subtraction.
The procedure to compute an RNS Montgomery modular reduction is presented in Algorithm 4. It is noted that the multiplication d B by µ in Step 5, is carried out in base B. Due to the design choice R = M , the reduction modulo R is implicitly applied in this computation. Thus, the product performed in this step is equivalent to compute µ · d mod R of Equation (1) . The numerator of Equation (1) 
. . , } Computation: 4: for each processing unit i do 5 :
Addition of σ -bit terms 9: for each processing unit i do 10 :
RNS products and RNS additions 11 : 
Addition of σ -bit terms 16: for each processing unit i do 17 : 
III. EFFICIENT IMPLEMENTATION OF RSA ON CPU PLATFORMS
In this section, efficient CPU software implementations of the RSA exponentiation are described in detail. First, the implementation of elementary arithmetic operations such as multiplication, squaring and Montgomery reduction, is described. Moreover, a brief explanation of how this modular arithmetic layer was used for the concurrent computation of the two exponentiations associated to the RSA signature is given. Thereafter, a description of the RNS-based arithmetic implementation is presented. The implementation of this arithmetic heavily relies on the set of Advanced Vector Extensions 2 AVX2. Furthermore, a comparison between the RNS reduction procedures described in Algorithm 3 and Algorithm 4 shows that for our CPU implementation setting, the latter is faster than the former.
All the experimental results presented in this section were executed on an Intel Core i7-4770 processor supporting the Haswell micro-architecture and on an Intel Core i7-6700 processor that supports the Skylake microarchitecture, equipped with 16 GB of RAM memory and using the Ubuntu 16.04.6 LTS operating system. To guarantee the reproducibility of our measurements, the Intel Hyper-Threading and Intel Turbo Boost technologies were disabled. Our source code was compiled using the GNU C Compiler (gcc) v6.1.0 with the -O3 optimization flag and using the options -mbmi2 -fwrapv -fomit-frame-pointer and -mbmi2 -madx -fwrapv -fomit-frame-pointer for the Haswell and Skylake micro-architectures, respectively. 3
A. MONTGOMERY BASED ARITHMETIC ON CPU PLATFORMS 1) INTEL INTEGER ARITHMETIC INSTRUCTIONS
Aiming to achieve efficient integer multiplication/squaring and Montgomery's reduction, we took advantage of the instruction MULX and the set of instructions Multi-precision Add-carry instruction extensions ADX [27] . First introduced in the Bit Manipulation instruction set (BMI2) of the Haswell micro-architecture, the assembly instruction MULX is an extension of the 64-bit multiplication instruction MUL. MULX uses a three-operand code and computes an unsigned multiplication without modifying the arithmetic flags. The advantage of the three-operand code is that permits to save MOV instructions by allowing to choose the output registers receiving the highest and the lowest part of the two-word output product. On the other hand, the instruction set ADX first introduced in the Broadwell micro-architecture, includes the instructions ADCX and ADOX, which were designed to handle two independent carry chains. These instructions compute unsigned 64-bit additions with input carry. The resulting output carry is stored in the carry flag (CF) and the overflow flag (OF), respectively. Since both instructions deal with two 3 The source code of our software library is available at, https://github.com/luinxz/RSA. independent carry chains, they can be executed in parallel. An important advantage of these instructions is that they allow to combine MULX, ADC, ADCX and ADOX instructions without corrupting the carry chain. This feature has a noticeable impact in the efficiency of the Montgomery based arithmetic as discussed next.
2) INTEGER MULTIPLICATION
The two most popular approaches for computing integer multiplication in a software implementation are the schoolbook method with its associated quadratic complexity on the number of word multiplications, and the Karatsuba and Ofman [28] approach that enjoys a sub-quadratic complexity on the number of word multiplications at the price of increasing the number of required word additions. One can also opt for using a combination of these two methods, which was the strategy adopted in this work.
The efficiency of the schoolbook method mainly depends on how the partial products are computed and the way that they are added. We used the operand-scanning strategy, where the multiplicand operand is multiplied by each word of the multiplier. This strategy allows us to take full advantage of the MULX, ADCX and ADOX instructions.
However, this approach is limited by the available number of general purpose registers. Because of this, the schoolbook method tends to be efficient only when the operands have a small word-length. Indeed, as shown in Table 1 , our implementation of a pure schoolbook integer multiplication outperformed Karatsuba only when the operands had a wordlength in the range 0 < n ≤ 8.
Therefore, n-word multiplications with n > 8 were performed using a combination of the Karatsuba multiplication method [28] and the schoolbook approach. Two n-word integers a and b can be written as a = a 0 + a 1 · r n/2 and b = b 0 + b 1 · r n/2 , where as before r = 2 w . Using the Karatsuba approach, one first computes the values c L = a 0 · b 0 , c M = (a 0 + a 1 )(b 0 + b 1 ) and c H = a 1 · b 1 . Then the integer multiplication c = a · b is obtained as
+ C H · r n . This computation costs two additions and three multiplications of n/2-word integers, plus one addition and two subtractions of n-word operands.
For 16-word multiplications, we applied one Karatsuba level that took us from 16-to 8-word multiplications. In the case of 24-word multiplications, we utilized two Karatsuba levels that took us from 24-to 12-word multiplications, and then to 6-word multiplications. The results obtained from this strategy are presented in Table 1 .
COMPARISON AGAINST SCOTT'S KARATSUBA VARIANT
In [29] , Scott proposed a Karatsuba variant based on Arbitrary degree Karatsuba (ADK) previously suggested in [30] . Scott implemented the ADK approach in the reducedradix setting, where a number is represented using a word size lower than the one belonging to the target processor. The advantage of this strategy is that the partial products can be accumulated without worrying about the output carries.
We implemented Scott's strategy using a word size of r = 2 62 bits as proposed in [29] . A comparison of our own implementation of Scott's proposal against the combination of Karatsuba and the schoolbook methods is presented in Table 2 . It can be observed that the combination of Karatsuba plus the schoolbook approaches outperforms the one reported by Scott for the range of word-lengths relevant to this work.
3) INTEGER SQUARING OPERATION
For operands with a word-length n ≥ 6, we chose a variant of the Karatsuba method that takes advantage of the repeated partial products that show up in the squaring operation. Considering an n-word integer a written as a = a 0 + a 1 · r n/2 . First compute the values c L = a 2 0 , c M = 2(a 0 · a 1 ) and c H = a 2 1 . Then the squaring c = a 2 can be computed as c = c L + 2(a 0 · a 1 ) · r n/2 + c H . This can be obtained at the cost of two n/2-word squarings, one multiplication of n/2-word operands, and two n-word additions.
The implementation of an n-word squaring for n ∈ {24, 16, 8} was conducted using up to three Karatsuba levels (going from 24-to 12-, then to 6-word and finally to 3-word multiplications/squarings). Eighteen squaring and nine multiplications of 3-words operands were computed for n = 24. For n = 16, two Karatsuba levels (going from 16-to 8-and then to 4-word multiplications/squarings), require six squarings and three multiplications of 4-word operands. For n = 8 it suffices one Karatsuba level (going from 8-to 4-word multiplications/squarings), using two squarings and one multiplication of 4-word operands. According to our experiments, the squaring computation of up to 4-word operands has a better performance when using a pure schoolbook approach. In the case of operands with a word-length n ≥ 6, the best strategy was to combine the Karatsuba and schoolbook methods. A summary of the results obtained are reported in Table 3 .
4) MONTGOMERY MODULAR REDUCTION
The Montgomery modular reduction of Equation (1), requires to compute two n-word multiplications, which are divided or reduced modulo R = r n . A straightforward optimization can be applied observing that for the multiplication µ · d mod R, only the least significant half of the result must be computed. Likewise, in the case of the (µ · d mod R) · p computation, only the most significant half of the product is required (due to the subsequent division by R). For the cases when n ≤ 8, these operations were performed using the schoolbook multiplication method. Thus, an n-word multiplication divided by R is computed using n(n + 1)/2 + n word multiplications; and an n-word multiplication modulo R is computed using n(n + 1)/2 word multiplications. On the other hand, for the cases when n > 8 we employed up to two levels of the Karatsuba method. At each level was necessary to compute one n/2-word multiplication and two half n/2-word multiplications as depicted in Figure 1 . Table 4 reports the timings measured for the Montgomery modular reduction, modular multiplication and squaring operations. The operands have a word length of n ∈ {8, 16, 24}.
5) MONTGOMERY-BASED RSA SIGNATURE
The RSA signature described in the introduction section, was performed using the exponentiation procedure presented in Algorithm 2, which was implemented with an underlying Montgomery-based arithmetic layer. Furthermore, the two RSA signature exponentiations were computed concurrently using two cores and the OpenMP library for synchronization. The 8-word modular exponentiations for RSA-1024 were performed using a window size ω = 4. In the case of the 16-and 24-word modular exponentiations (required for computing RSA-2048 and RSA-3072), a window size ω = 5 was chosen.
In Table 5 , the latency achieved by our library when computing RSA signatures for 1024-, 2048-, and 3072-bit private keys is reported. Table 5 presents a comparison of our results against previously reported RSA implementations. Bos et al. et al. in [11] computed a Montgomery multiplication by splitting the Montgomery algorithm into two parts, which can be executed in parallel using Single Input Multiple Data (SIMD) instructions. Moreover, the authors of [11] presented RSA signature timings corresponding to a serial implementation. Table 5 also shows the work by Gueron and Krasnov [12] , where the authors reported an RSA implementation that profited from a redundant integer representation that avoids the carry propagation by using operands organized on 29-bit words. To the best of our knowledge, none of these two works included countermeasures to protect their RSA implementations against some basic side-channel attacks.
B. CPU IMPLEMENTATION OF RNS-BASED ARITHMETIC
In this section, our implementation of RNS-based arithmetic using the set of Advanced Vector Extensions 2 AVX2 is presented. A performance comparison between the RNS reduction procedures given in Algorithm 3 and Algorithm 4 is also reported.
1) VECTOR INSTRUCTIONS
In order to perform an efficient implementation of the RNS based arithmetic as described in Section §II-C, we took advantage of the AVX2 instruction set introduced in the Intel Haswell micro-architecture [31] . AVX2 is an extension from its ancestor AVX, which allows to compute Single Instruction Multiple Data (SIMD) operations using 256-bit vector registers. AVX2 provides operations supporting integer arithmetic that are able to compute up to four concurrent 64-bit operations over the values stored in the vector registers. In terms of performance, one would expect a speedup factor of four coming from the simultaneous execution of 64-bit operations. Nevertheless, this acceleration can be attained only by some instructions. This is due to some overhead factors like the execution latency and throughput, and the number of execution units available in the target micro-architecture. In fact, the size of the AVX2 multiplier is expected to be a limiting factor. We mainly made use of the following AVX2 instructions,
• mm256_mul_epu32: Computes four products of 32 × 32 bits, storing the four 64-bit results on a 256-bit vector register;
• mm256_add_epi32, mm256_sub_epi32: Computes eight concurrent 32-bit additions/subtractions, without handling the input/output carry and borrow, respectively;
• mm256_slli_epi32,mm256_srli_epi32: Computes eight 32-bit logical shifts using the same fixed shift displacement for every word stored in the vector register;
• mm256_shuffle_epi32: Shuffles 32-bit values of the source vector in the destination vector at the locations selected by a control operand;
• mm256_xor_si256, mm256_and_si256: Computes the XOR/AND of two 256-bit vector registers;
• mm256_cmpgt_epi32: Returns a vector with the values 2 32 − 1 and zero depending if the comparison of the 32-bit integers in the vector register is true or not. Since the AVX2 multiplier works on 32-bit input operands, a word size w = 32 must be assumed. This implies that the operations required for RSA signatures with 1024-, 2048-and 3072-bit keys must be computed using integers with a wordlength n ∈ {16, 32, 48}.
RNS addition, subtraction, and multiplication of two integers a and b are performed component-wise as shown in Equation (3) of Section §II-C. Considering our target microarchitectures, one can compute up to eight operations modulo m i simultaneously. Therefore, all the computations described below must be performed by each vector storing the moduli of the the RNS-basis B, i.e. a total of n 8 vectors. The implementation of the RNS arithmetic main operations taking advantage of the AVX2 instructions is presented in the remaining of this section.
2) VECTORIAL RNS ADDITION/SUBTRACTION
Addition and subtraction can be straightforwardly implemented using the vector operations included in the AVX2 instruction set. Initially, the integer addition or subtraction are computed with the mm256_add_epi32 or mm256_sub_epi32 instructions. The result of these operations is stored in a vector C. As shown in Figure 2 , the modular reduction by each moduli m i belonging to the RNS base B (cf. Section §II-C) can be computed in constant time as discussed next.
By means of the mm256_cmpgt_epi32 instruction, one can catch the carry or borrow produced by the integer addition or subtraction operations, which is stored in a vector CB. Thereafter, the instruction mm256_and_si256 is used to compute the logic AND of CB and the vector M of moduli m i , whose result is stored in a vector D. Then, the vector D is subtracted or added to the value obtained from the above addition or subtraction, respectively. The RNS computation of C = A ⊕ B and C = A B is depicted in Figure 2 .
3) VECTORIAL RNS MULTIPLICATION
Multiplication and squaring in RNS are a bit more involved operations than the addition and subtraction ones. This is because, the AVX2 instruction mm256_mul_epu32 only computes four 32 × 32-bit multiplications. Hence, in order to compute a component-wise integer multiplication of two RNS vectors A and B, we use the mm256_mul_epu32 instruction. This instruction calculates the products of odd indexes and store them into a vector D 0 . Then, the shuffle instruction mm256_shuffle_epi32 can be used to reorder the 32-bit values of the A and B vector registers. This permits to compute the products of the even indexes, which are stored in the vector D 1 as shown in Figure 3 . 
RNS INDIVIDUAL MODULAR REDUCTION
Modular reduction by each moduli m i = 2 w − µ i in B is computed as described in Remark 2 of Section §II-C. Let D 0 and D 1 be two output vectors of the computation shown in Figure 3 , and let M be a vector composed by the µ i small values described in Remark 2.
First, the mm256_shuffle_epi32 instruction is executed to reorder the 32-bit values in the D 0 , D 1 and M vectors. Thereafter, the instruction mm256_mul_epu32 recovers the values µ i · t i /2 w with t i = a i · b i , which were stored in the vectors E 0 and E 1 , respectively. Then, the execution of the mm256_srli_epi32 instruction on E 0 and E 1 using an offset of 32 produces the vectors F 0 and F 1 , which are added using mm256_add_epi64 to D 0 and D 1 . This obtains the values d i = t i mod 2 w +µ i · t i /2 w . After two iterations of the above procedure, the d i values stored in D 0 and D 1 correspond to t i mod m i . Thus, it becomes necessary to combine the final vectors D 0 and D 1 to get the vector D that stores the values of A ⊗ B. This procedure is depicted in Figure 4 .
4) VECTORIAL RNS MODULAR REDUCTION
Modular reduction was performed using Algorithm 3 as was presented by Jeljeli in [21] , and Algorithm 4 as was described by Kawamura in [24] . For both of these two reduction algorithms it becomes necessary to find the approximationα as given in Equation (7), which was computed as described in Section §II-C, Remark 3. When working with the RNS reduction Algorithm 3, one computesα for each vector storing = (γ 1 , . . . , γ ). This calculation is performed by invoking the mm256_srli_epi32 instruction with offsets 5 for RSA-1024, and 25 for RSA-2048 and RSA-3072. For the reduction Algorithm 4, offsets of 18 for RSA-1024 and RSA-2048, and 16 for RSA-3072 were employed.
Thereafter, the mm256_slli_epi32 instruction is applied to each vector using offsets that guarantee constraining the subsequent additions in the interval [0, 2 32 − 1]. For example when the reduction Algorithm 3 is used, the offsets are 19 for RSA-1024, and 17 for RSA-2048 and RSA-3072. In the case of the reduction Algorithm 4, the offsets are 14, 10 and 8 for RSA-1024, RSA-2048 and RSA-3072, respectively. As a final step, all vectors are added using mm256_add_epi32 instructions, and the values of the resulting output vector are also added in order to obtain a 32-bit value, which is shifted to the right by an offset of 24.
The matrix-vector multiplications needed in both algorithms can be performed using RNS multiplications followed by − 1 RNS additions, as shown in Section §II-C. The matrix multiplication in Step 9 of Algorithm 3 can be done straightforwardly. However, the matrix multiplications in Steps 10 and 17 of Algorithm 4 require to transpose the matrices |M i | m j and M i m j . The experimental results obtained for both reduction algorithms are presented in Table 6 . One can observe that the RNS Montgomery reduction of Algorithm 4 is twice as fast as the RNS reduction Algorithm 3. This is mainly due to the fact that for the RNS Montgomery reduction the basis used to represent the numbers are of size = n, whereas the base used in Algorithm 3 has a size of = 2n + 3. Moreover, all the operations in Algorithm 4 require to process vectors with a size of roughly half of the ones required in Algorithm 3.
5) RNS-BASED RSA SIGNATURE
As in §III-A, we concurrently computed two RSA modular exponentiations using two cores running the protected exponentiation method described in Algorithm 2. Once again, the synchronization of these two tasks was achieved trough the usage of the OpenMP library. Modular exponentiations for RSA-1024 were performed using a window size ω = 4, whereas ω = 5 was adopted for modular exponentiations corresponding to both RSA-2048 and RSA-3072. Table 7 presents the latency achieved by our library for the RSA-1024, RSA-2048, and RSA-3072 signature computations. One can observe that the RSA signature based on RNS Montgomery arithmetic (Algorithm 4) is two times faster than the RSA signature based on the RNS reduction Algorithm 3. It is worth noting that the best results of Table 7 are slower by a factor of 3.1x and 3.8x than the best results reported in Table 5 for RSA-2048 and RSA-3072. On the other hand, the best result for RSA-1024 signature in Table 7 is 2.5 times faster than the one reported in Table 5 .
IV. EFFICIENT IMPLEMENTATION OF RSA ON GPU PLATFORMS
In this section, the implementation of the RSA exponentiation on a GPU architecture is described. The material presented here is partially based on a previous work presented in [32] .
A. PARALLEL COMPUTATIONS ON GPU ARCHITECTURES
Graphics Processing Units (GPU) are optimized hardware blocks originally designed for performing graphics operations [33] . Nowadays, GPU platforms are widely considered general purpose processors. In 2006, NVIDIA introduced a parallel computing framework named CUDA, which was especially designed for GPU environments. CUDA defines three important features: a threading model, a set of conventions for calling native GPU's functions, and a hierarchical memory infrastructure. In a GPU architecture the basic computational and resource allocation units are threads. Threads can be grouped into blocks, which in turn can be grouped into a grid. Threads in a block are partitioned into warps. For all GPU architectures a warp is composed by 32 threads that run concurrently.
A GPU architecture utilizes the Single Instruction Multiple Thread (SIMT) programming model paradigm, where all threads inside a warp can execute the same instruction at the same time. The general programming model consists of code sequences called kernels. A kernel execution can be synchronous or asynchronous. This allows programmers to manage concurrent execution through the completion of command sequences called streams.
GPU architectures support several types of memory models such as, global memory, constant memory, shared memory, among others. The shared memory is a small cache memory with low-latency attached to each Streaming Multiprocessor (SM). Shared memory can be accessed by all the threads in a block. During kernel invocation, a programmer can configure the amount of shared memory available per block. 4 For example, in a Kepler architecture a valid configuration can allocate 48 KB and 16KB for the software and the hardware data cache, respectively. Moreover, the PTX (Parallel Thread eXecution) is a low-level parallel thread execution virtual machine that provides a stable programming model and an instruction set for general purpose parallel programming [34] . It is often used to gain control over arithmetic operations trying to avoid thread divergence during a program execution. 5 In this work we made extensive use of the following assembly instructions,
• addc: Adds two 32/64-bits values taking into account the carry-in bit, producing a carry-out bit;
• subc: Performs a 32/64-bits subtraction operation with input borrow and producing a borrow-out bit;
• mul.lo: Multiplies two 32/64-bits values and returns x i × y i mod 2 r , where x i and y i are both non-negative integers, and r is typically selected to be the GPU word size;
• mul.hi : Multiplies two 32/64-bits values and returns x i ×y i /2 r , where x i and y i are both non-negative integers;
• mad.(hi,lo).cc: Multiplies two 32/64 -bits values, extracts the higher or lower half of the result, and adds a third 32/64-bit value with carry-out.
Especially because of its ability to handle an implicit carry/borrow operation, the aforementioned instructions helped our implementation to achieve a better performance. Also, we extensively used the data type uint2, which is a twoelement vector that stores two halves of a 64-bit integer, as the most and least significant halves. This data structure permits an efficient access to data stored in registers and the shared memory. 4 Common GPU architectures are Fermi, Kepler, Haswell, Pascal, and Volta. Each architecture has different sizes for shared memory. 5 A thread divergence occurs when several threads do not execute the same instruction at the same time. This prevents to fully exploit parallelism. 
B. MAIN OPERATIONS IN RNS REPRESENTATION
In the following, we describe how the RSA exponentiation was carried out in the GPU platform. As a pre-computation step, in the CPU server the set of pair-wised co-prime numbers composing the RNS-basis B was chosen. Then, all the RSA operands and moduli were converted to their RNS representation and these values were sent to the GPU platform. After that, the exponentiation computation in the GPU platform was considered ready to start.
RNS INTEGER MULTIPLICATION
Integer multiplication can be performed in parallel by launching blocks with threads, which compute concurrently up to independent RNS multiplications of the form C = A ⊗ B, where A and B are in RNS representation in base B = m 1 , · · · , m (or in base B = m 1 , · · · , m and B = m 1 , · · · , m if the RNS reduction Algorithm 4 is used). This arrangement is depicted in Figure 5(a) , where it is shown that each thread is in charge of processing the modular product of a pair of RNS coordinates |a i · b i | m i (or |a i · b i | m i and |a i · b i | m i ). Since each warp executes the same instruction, this arrangement avoids thread's divergence. Moreover, the multiplications carried out concurrently do not need to be synchronized. Also, the threads can efficiently access each coordinate of the RNS vectors as these values are allocated on contiguous segments of memory. Each thread stores the output of its modular multiplication computation on a register, thus avoiding global memory accesses that would be much more costlier.
After all threads have completed the integer multiplication step, a modular reduction by the modulus p must be applied either using the reduction Algorithm 3 or the RNS Montgomery reduction algorithm 4. For the sake of brevity, we only explain in the following our GPU implementation of the RNS reduction Algorithm 3. The corresponding implementation of the RNS Montgomery reduction algorithm 4 follows a similar design flow.
C. RNS MODULAR REDUCTION USING ALGORITHM 3
Modular reduction carried out by Algorithm 3 is illustrated in Figures 5 (b) , (c), (d), and (e). The reduction process requires the pre-computation of several values (Steps 1-3), which are processed in the hosting CPU and sent to the GPU before the main computation starts. The RNS vector |M −1 i | m i and the RNS table |M i | p in Steps 1-2 are both stored in the shared memory so that it can be available for all the threads. The third precomputed value is the table containing the RNS vectors |α ·M | p , for α = 1, . . . , −1. This table is mapped to the texture memory because only few threads have to query it.
The multiplication operations required in Step 5 are computed in a redundant fashion as previously described and illustrated in Fig. 5b . Then in Step 9, the most expensive task of the reduction algorithm is performed, requiring the computation of and −1 RNS multiplications and additions, respectively. This calculation is performed in parallel by launching blocks with threads each (illustrated in Fig. 5c ). If there are more than 32 active threads, then an explicit barrier must be placed in order to synchronize all threads of each block, and one must wait until all the threads have completed their execution. Once that all the partial results have been obtained by each block, each thread stores its result in the shared memory. Next, all the partial results so obtained must be added, This can be done by using a binary addition tree strategy [35] . Step 10 of Algorithm 3 calculates copies of α using blocks as shown in Figure 5(d) . The − 1 additions are computed collaboratively as previously mentioned. Finally, in Step 11 of Algorithm 3, a single thread of each one of the blocks, performs an RNS coordinate subtraction saving the final result of the modular reduction into the global memory (see Figure 5e ). This avoids that the threads compete to each other for writing into the same memory address.
D. GPU RESULTS
The experimental hardware setup used for the experimental results reported in this section is the following: CUDA toolkit version 9.1, 20 MultiProcessor with 128 cores each running at 1.81 GHz, and global memory of 8 GB.
In Table 8 , the latency achieved by our GPU library for the RSA private operation is reported for key lengths of 1024, 2048 and 3072 bits, respectively. Table 8 also shows a comparison against related works previously published for the parallel RSA implementation on GPU platforms. It can be seen that our implementation has better latency for RSA-1024, RSA-2048 and RSA-3072 than previously published results. Besides, our implementation offers a first-line of protection against timing attacks.
V. CONCLUSION
In this paper an optimized parallel implementation of RSA signatures using some of the most efficient and effective arithmetic algorithms for both CPU and GPU high-end architectures was presented. As it was shown in Tables 5 and 8 , our RSA CPU and GPU libraries instantiated with the most popular private key lengths of 1024, 2048 and 3072 bits, achieve faster timings than previously reported literature.
From an algorithmic point of view, it is also interesting to compare the performance achieved by the RNS reduction procedures described in Algorithms 3-4 when implemented in CPU and GPU architectures. In the case of our RSA CPU implementation, a close inspection of Table 7 reveals that due to the higher number of about 4n 2 word multiplications required by Algorithm 3 compared with about 2n 2 word multiplications required by Algorithm 4, the latter reduction algorithm is faster than the former by a factor close to two. Strikingly, from our RSA GPU implementation we observed the opposite situation. Indeed, due to the fact that Algorithm 3 is more amenable for the massive parallelism provided by the GPU many-core architecture, it yields a better performance than Algorithm 4. This computational behavior is reported in Table 8 .
In spite of its massive parallelism, we observe that GPU implementations of RSA are considerably slower than their CPU counterparts. Nevertheless, notice that our RSA GPU implementation enjoys a sub-quadratic complexity in the cost of the RSA exponentiation with respect to the size of its key.
For example, from our GPU timings shown in Table 8 , one can see that the computational timing cost of RSA-2048 and RSA-3072 is just 2.42 and 4.23 more expensive than RSA-1024, respectively. On the contrary, from our CPU timings shown in Table 5 , one can see that the cost of RSA-2048 and RSA-3072 is 6.60 and 20.08 more expensive than RSA-1024, respectively. These timings increments closely follow a quadratic complexity. Hence, we believe that for cryptographic applications where extremely large operands are required (such as the ones proposed in several homomorphic encryption schemes), our RNS arithmetic library could be of interest. We leave as a future work to study this potential application.
NARELI CRUZ-CORTÉS received the Ph.D. degree from CINVESTAV, Mexico, in 2004. She is currently a Researcher Professor with the Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico City, Mexico. Her main research interests are cybersecurity, machine learning, and their applications.
FRANCISCO RODRÍGUEZ-HENRÍQUEZ is currently a Professor with the Department of Computer Science, CINVESTAV-IPN, Mexico City, where he joined, in 2002. He is a coauthor of Cryptographic Algorithms on Reconfigurable Hardware. His major research interests are in cryptography and finite field arithmetic. VOLUME 8, 2020 
