Abstract The significant cost of RSA computations affects the efficiency and responsiveness of SSL/TLS servers, and therefore software implementations of RSA are an important target for optimization. To this end, we study here efficient software implementations of modular exponentiation, which are also protected against software side channel analyses. We target superior performance for the ubiquitous ×86_64 architectures, used in most server platforms. The paper proposes optimizations in several directions: the Montgomery multiplications primitives, the w-ary modular exponentiation flow, and reduced cost of side channel mitigation. For a comparison baseline, we use the current OpenSSL version, 1.0.0e. Our implementation-called "RSAZ"-is more than 1.6 times faster than OpenSSL for both 1,024 and 2,048-bit keys, on the previous generation 2010 Intel ® Core™ processors and on the 2nd generation Intel ® Core™ processors. The RSAZ code was contributed to OpenSSL as a patch, and improvements proposed in an earlier version of this paper have already been incorporated into the future OpenSSL version.
SSL/TLS connections. As a result, the cryptographic algorithms that support such secure communications become a critical computational load for servers, and therefore an important target for optimization (see [15] ). The performance of RSA is an important case, because RSA is a part of the handshake of TSL/SSL sessions.
Practically, all RSA usages have 1,024-bit or 2,048-bit keys, where, RSA1024 are currently the majority, and are the main optimization focus. However, the fraction of usages employing RSA2048 is sharply growing. Coupled with NIST's recommendation for key-lengths [4] , RSA2048 becomes an important optimization target as well. The performance of RSA1024 translates directly to the performance of 512-bit modular exponentiations, and RSA2048 computations translate to 1,024-bit modular exponentiations. These computations are our focus.
Our goal is to study efficient software implementations of modular exponentiation, on general purpose ×86_64 architectures (x64 for short). Since today, cryptographic codes are also required to be resistant to threats stemming from different types of side channel analyses, we address this requirement throughout the study. We start the investigation by trying to optimize the relevant "primitives" of modular exponentiations, namely modular multiplications (or equivalents), and continue through optimizing the exponentiation flow itself. At the same time, we also focus on making sure that the implementation resists side channel analyses in a broad way, while reducing the cost of the necessary mitigations.
To assess our results, we compare the performance of our implementation against the public implementations OpenS-SL [18] and publications [7, 27] , as follows. The OpenSSL library is a very widely used (open source) software implementation. The availability of its source code makes it easy to study, tweak and measure, and it is therefore an important comparison baseline. Furthermore, optimizations that can be integrated and adopted by the OpenSSL library have potential benefit to many platforms worldwide. We relate to the currently latest OpenSSL version, 1.0.0e.
The second comparison baseline is the recent publication [7] . It proposes a new approach, resulting in what is claimed thereof to be "the world's fastest modular exponentiation implementation on IA processors", thus setting a bar for performance. Shortly after [7] was published, the related software implementation was publicly posted [27] , as an OpenSSL patch, in the form of an OpenSSL "engine" called "RSAX". This RSAX submission is a highly optimized (assembler) code for modular exponentiation. It is an updated version whose performance improves over the reported results of [7] . RSAX is wrapped as an engine, and not integrated into the main OpenSSL tree, because it is based on an algorithm that requires different and additional pre-computed constants (from what OpenSSL uses). Finally, RSAX implements only 512-bit modular exponentiation, hence supports only RSA1024.
We note that a secure modular exponentiation implementation consists of many ingredients that can be optimized in orthogonal directions. We provide here a step by step description of the various considerations in building the optimization, and explain the contribution of the different ingredients.
The result is a software implementation of modular exponentiation, which we call "RSAZ" (short for RSA ZARIZHebrew for "quick"). RSAZ is compatible with the OpenSSL interface, and can be therefore integrated into its main path. For RSA1024, it outperforms RSAX. Compared to OpenS-SL 1.0.0e, RSAZ1024 and RSAZ2048 show, respectively, a speedup factor of 1.72 and 1.61 on the previous generation 2010 Intel ® Core™ processor, and a speedup factor of 1.62 and 1.64 on the 2nd generation Intel ® Core™ processor. Per request from the OpenSSL Team, RSAZ has been released as a patch, and can be found in [8] .
Preliminaries
We discuss RSA cryptosystem [16] with a 2n-bit modulus size N = P × Q, where P and Q are n-bit primes. We denote the 2n-bit private exponent by d. Decryption of (2n-bit) C requires 2n-bit modular exponentiation C d mod N . To use the Chinese Remainder Theorem, d 1 = dmod(P − 1), d 2 = dmod(Q − 1), and Qinv = Q −1 mod P are pre-computed. Two n-bit modular exponentiations M 1 = C d1 mod P and
Thus, the computational cost of 2n-bit RSA decryption is well approximate as the cost of two n-bit modular exponentiations. In our context, we can assume that by construction (of the RSA keys), 2 n−1 < P, Q < 2 n .
Protecting software implementations against software side channels analysis
Software side channels analysis is a class of techniques that can be used for attacking cryptographic applications that run in a multi-tasking environment. Recent publications showed that when an unprivileged process ("Spy") runs in parallel to another ("Crypto") process, and some processor resources are shared (explicitly or implicitly), then Spy can extract information about the execution flow of Crypto. Two examples are the memory access patterns [20] and taken branches during executions [1, 2, 6] ). Depending on the way that the crypto code operates, this side channel information can compromise the secrets (keys) of Crypto. In our context of modular exponentiation, both the modulus and the exponent are secret. When the w-ary exponentiation [16] and variants of Montgomery multiplications are used (details in the subsequent sections), the knowledge of the following information may compromise the secrets:
1. Which of the Montgomery multiplications, computed during the exponentiation, require an "end reduction" step (for example, see [3, 6, 22, 25] ). 2. Which of the 2 w entries of the computed table are accessed during the exponentiation (for example, see [20] ).
As a general concept, we say that a piece of code is "inherently protected against software side channel analysis" ("inherently protected" hereafter) in a given environment, if for any chosen input, volunteering the full details of the following items does not leak any sensitive information: (a) the addresses (at the granularity of a cache line) that were accesses (read/write); (b) the resolutions (taken/not taken) of the executed branches; (c) the executed instructions. We make our modular exponentiation implementation inherently protected.
Montgomery multiplications and almost Montgomery multiplications
Modular multiplication (or an equivalent) is a critical building block for modular exponentiations. This section explains the selection of our preferred algorithm.
Montgomery multiplications basics
The Montgomery multiplication [17] (MM hereafter) is a well-known efficient technique for computing modular exponentiation. Being already a classical algorithm, we explain it only briefly (more details can be found, e.g., in [5, 13, 14, 16] . 
We say that 2 t is the Montgomery parameter.
The use of MMs for modular exponentiation is based on two observations: (1) for any two integers 0 ≤ a, b < m,
Observation 1 ("stability") allows for using the output of one MM as an input to a subsequent MM. Note that modular exponentiation algorithms consist of sequences of modular multiplications, and that for a given modulus, the constant c2 = 2 2t modm can be pre-computed. Therefore, a x modm (for 0 ≤ a < m and some integer x) can be computed by: (a) mapping the base (a) to the Montgomery domain, a = MM(a, c2); (b) using an exponentiation algorithm while replacing modular multiplications with MMs; (c) mapping the result back to the residues domain, u = MM(u , 1).
MM's computations can use two steps: (1) T = a × b(< m 2 ); (2) a "Montgomery Reduction" to obtain T ×2 −t modm. This can leverage the fact that computing a square is faster than general multiplication, and speed up MMs with a = b, a case that occurs often in our context. We call such computations "Montgomery Squaring" (MSQR).
A Montgomery reduction lemma
We will use the following lemma to discuss several variants of Montgomery multiplications and equivalent constructions. 
Then, Fmodm = T × 2 −s modm, and
It is convenient to view Lemma 1 as an algorithm as in Fig. 1 .
Proof By taking Eq. 1 modulo 2 s , and modulo m, we get Fmodm = T ×2 −s modm, and F < T /2 s +m. The inequalities follow immediately. Proof of correctness (WW-MM) Start with T = a × b < m 2 < m ×2 n . Apply Lemma 1, k times, using r = (n −i ×s) in iteration i, for i = 1, 2, . . ., k, and the updated value of T . After k iterations we get T < 2m.
Step 7 reduces T modulo m.
Computational efficiency (WW-MM)
Step 1 requires an n-bit multiplication.
Step 3 requires the low half of an sbit multiplication.
Step 4 requires n-bit by s-bit multiplication. In iteration i, Step 5 adds an (n + s)-bit number to an (2n −(i −1)s)-bit number.
Step 7 requires (conditional) subtraction of an n-bit integer from an (n + 1)-bit integer. For side channel protection, the subtraction (of either m or 0) is always performed.
Remark 1
The WW-MM algorithm is well suited for architectures with an s-bit multiplier and adder (i.e., an s-bit ALU). Specifically, s = 64 is a natural choice for the 64-bit architectures (x86-64) that we study.
Remark 2
In the WW-MM algorithm (by Lemma 1), the value of
Thus, T only satisfies the condition T < 2m, which means that it is not necessarily reduced modulo m (there are examples where T > m). The conditional subtraction in Step 7 reduces T modulo m, and is called the end reduction (ER) step. This step is known to be problematic from the side channel perspective (see details below).
Almost Montgomery multiplications
For our implementation, we use almost Montgomery multiplication (AMM), which is a variant of MM (this variant is mentioned in [26] as "incomplete" MM). We prove the correctness of the algorithm below. The proof also shows that AMMs have a "stability" property (like MMs), and can therefore be used in a similar way, for modular exponentiation.
Proof of correctness (WW-AMM) we use B = 2 n here. Start with T = a × b < 2 2n . Apply Lemma 1, k times, for i = 1, 2, . . ., k, using r = (n − i × s) and the updated value of T in iteration i. After k iterations, T < 2 n + m.
Step 7 guarantees X < 2 n .
Remark 3 AMM and MM are almost identical, where the only difference is in the ER step (Step 7 in Fig. 2 , both panels). For MM, the condition (whether the result is T or X = T −m) can be determined only by knowing the sign of X = T − m. Thus, X needs to be computed first (at least up to a point where its sign is evident). By contrast, AMM enjoys a simpler condition check, based just on the carry-out bit of the last addition in step 6, with the further advantage that the value of this carry bit already determines whether the output is T or T − m. This can simplify the computations, although side channel resistance requires that the subtraction is always performed. We point out that in an environment where side channel resistance is not required, AMM has the performance advantage of allowing to skip the subtraction when appropriate.
Remark 4
If we assume 2 n−1 < m < 2 n , then the output (X ) of AMM can be fully reduced modulo m by a single (conditional) subtraction of m (because 2 n − m < m). Obviously, such subtraction should be done once, after the exponentiation flow. From the proof of Lemma 1, we see that the computation h = AMM(h, 1) (Step 9 of Algorithm 6 in Fig. 4) outputs a value which is smaller than m + 1. Since in our context the modulus m is a prime, and we assume that the exponentiation base is nonzero, it follows that the result (h) is already reduced modulo m. So, in fact, the final subtraction step (Step 10) can be ignored.
The two-step folding AMM (from [7])
Reference [7] describes an algorithm for computing a 512-bit AMM 1 , extending the Montgomery-Svoboda method [5] . In the particular parameters setting of [7] the algorithm reduces the 1,024-bit integer T = a × b to a 768 bits ("folding"), then to 640 bits (second folding), and follows with an almost Montgomery reduction to 512 bits. Proper corrections compensate for "carry" bits (overflows) chopped away during the reductions. Figure 3 generalizes the algorithm of [7] to a general parameters setting, followed by a proof of correctness (missing from [7] ).
Proof of correctness (TSF-AMM) Steps 3-5 reduce X from 8 s bits to 6 s, possibly modifying it modulo m (by 2 6s modm), and accumulate cf1 to correct the final result, modulo m. Similarly, Steps 6-8 reduce X from 6 s bits to 5 s, possibly modifying it modulo m (by 2 5s modm), and accumulate cf2 to correct the final result modulo m. For Steps 9-13 we apply Lemma 1 with r = 0 and T < 2 5s = 2 n+s remaining with a number bounded by 2 n + m. Step 14 compensates for chopping off carry bits in Steps 5.2, 8.2, and 12.2, to obtain a number which is congruent, modulo m, to a × b × 2 −s , and bounded by X < 2 n + 2m. Steps 15 and 16 (assuming 2 n−1 < m < 2 n ) assure that the output is smaller than 2 n , satisfying the required post conditions.
Computational efficiency (TSF-AMM)
Step 1 requires a single multiplication of n-bit integers.
Step 4 requires the multiplication of 4s-bit by 2s-bit integer and an addition of two 6s-bit integers.
Step 7 requires the multiplication of 4s-bit by s-bit integer and an addition of two 5s-bit integers.
Step 10 requires the low half of the product of two s-bit integers.
Step 11 requires multiplication of 4s-bit by s-bit integer and an addition of two 5s-bit integers.
Step 14 requires the addition of two 4s-bit numbers. Steps 15, 16 require the (conditional) subtraction of two 4s-bit integers.
Remark 5 Reference [7] asserts that the TSF-AMM computes a × b × 2 −128 modm(n = 512, s = 128), but this is incorrect. One counterexample is m = 2 511 + 111, a = 2 511 + 110, b = 2 510 + 53, where the output exceeds m. Figure 3 , gives the correct bound.
Remark 6 TSF-AMM implicitly assumes 2 n−1 < m < 2 n (Steps 15-16). In this case, a single conditional subtraction of m suffices to reduce the output modulo m. Reference [7] uses two successive conditional subtractions after the exponentiation sequence, but the second one is redundant. By Remark 4, the first subtraction is also redundant.
Comparing the WW-AMM and the TSF-AMM algorithms
Our first step in optimizing modular exponentiation is determining the preferred alternative between TSF-AMM and WW-AMM (or WW-MM). We start with a few general observations.
Remark 7
The "Almost" MM algorithm assumes a lower bound on the modulus (in our case, 2 n−1 < m < 2 n ). The classical MM algorithm does not require such assumption.
Remark 8 Both TSF-AMM and WW-AMM are almost
MMs, but their outputs are not equal (because they use a different Montgomery parameter).
Remark 9
By construction, the TSF-AMM algorithm is inherently an AMM. On the other hand, WW-AMM can be easily turned into WW-MM (and vice versa).
Remark 10
The step (a × b) is common to both algorithms.
Remark 11 WW-AMM and TSF-AMM require a different number of pre-computed values (to be resident in the cache for a real implementation). For n-bit operands, WW-AMM requires only one n-bit and one s-bit (s = 64 in our context) pre-computed values. The TSF-AMM algorithm requires a table (table in Fig. 3 ) with eight n-bit pre-computed values, two n-bit constants M1, M2, and an (n/4)-bit value k1.
Remark 12 If code size and simplicity is a consideration, then WW-AMM is obviously preferable over TSF-AMM. Furthermore, note that WW-AMM can be modified to use s = n and k = 1 (i.e., one big "word" and then only one iteration). In this case, an implementation can use only one function, for computing an n-bit multiplication, which would be called three times. We found that such implementation yields a simple code, but is slower than our optimized WW-AMM.
Since we are interested in performance on ×64 architectures, it is convenient to view the operands as "multiprecision" numbers with 64-bit digits and count single precision operations (additions and multiplications). Table 1 lists the steps for both algorithms, for 512-bit operands (for WW-AMM, we averaged the count for addition of variable length operands). The straightforward count shows that TSF-AMM involves 131 single precision multiplications, whereas WW-AMM has 136. TSF-AMM involves about 417 single precision additions while WW-AMM involves around 392 (we approximate 3 single precision additions operations per 1 singleprecision multiplication).
However, this naïve count does not necessarily (at least not directly) reflect on the resulting performance (for example, because fetching constants from the table (table in Fig.  3 ) in an inherently protected way is an additional overhead associated with TSF-AMM). In practice, the performance depends heavily on the code's efficiency and optimization, and also on the architecture that it runs on. For example, software optimization can fuse the multiplication and the reduction steps into an efficient sequence of multiply-add operations, thus reducing the overall number of addition operations (for both algorithms). Therefore, true comparison should be based on measuring the performance of fully optimized codes, on a given architecture. These results are reported in Sect. 8, and indicate that the WW-AMM is the faster alternative. Therefore, we selected the WW-AMM as the preferred algorithm.
Protecting AMMs and MMs against software side channel analyses
To make an implementation of WW-MM (or WW-AMM) inherently protected, we focus on the end reduction (ER) step (Step 7 in Fig. 2 ) of the algorithm. There are three considerations to guarantee. The obvious one is making the execution time independent of the ER step, and this implies that the subtraction (X = T − m) must always take place. The second consideration is making the implementation branch-free. Finally, it is also required to guarantee that the memory access patterns of the implementation leak no information about the ER step. We show here that achieving inherent protection can be subtle, by explaining why OpenSSL's WW-MM implementation is not inherently protected.
OpenSSL's WW-MM implementation is not inherently protected
OpenSSL uses the WW-MM algorithm, where the relevant function is bn_mul_mont (in crypto/bn/asm/x86_64-mont.pl).
The side channel-protected implementation variant is selected, by default, via the "BN_FLG_CONSTTIME" flag. We review only the ER step. The following code snippet from bn_mul_mont shows how the borrow bit from the subtraction (X = T − m) is used for delivering the appropriate output of the algorithm.
In the ".Lcopy" loop (lines 8-13 above), the borrow bit is used for loading either a Qword (T [i]) of T or a Qword (X [i]) of X into register rax (line 9). Then, the content of rax is written into the location of X [i], and zero is written into the location of T [i] (lines10-11). This either refreshes X and zeros T , or copies T onto X and zeros T .
This flow computes X and T unconditionally, and is branch free. However, this implementation is not inherently protected because knowledge of the read access pattern (i.e., the code reads either from T or from X ) exactly reveals the information that needs to be concealed.
We clarify that this observation does not imply that there is a practical vulnerability, or that the implementation is unprotected. Observe that the code writes, unconditionally, both X and T (it zeros T ) inside the loop. Therefore, it can be argued that in order to exploit the borrow-dependent reads, the Spy needs to instrument itself with resolution less than the ".Lcopy" loop turnaround, and the latency of RDTSC is too high for that [21] . However, the advantage of an inherently protected implementation is that such argumentation is not even needed. We therefore show here an inherently protected and efficient implementation of WW-MM/WW-AMM.
An inherently protected implementation of MM
For the ER step, we determine if the output is T or X = T −m, according to the sign of X . We hold the value (fixed in our context) of (−m) in a memory location L1. When the additions (Step 6; Fig. 2, left panel) finish, we store T in memory location L2, while also adding (unconditionally) T to that memory location L1. This places T in L2, X = T − m in L1, while the carry-out bit of the last addition indicates which location hold the desired output. We then load Qwords from L1 and L2, into registers reg1 and reg2, and use the conditional move instruction CMOVE reg1, reg2 to place the desired result in memory location L1. This is repeated for all the Qwords in locations L1 and L2. The following code snippet shows the proposed implementation (for 512-bit operands).
An inherently protected implementation of AMM
We use the following property of AMM: the carry out bit from the last addition operation of Step 6 determines whether the output should be T or T − m. We first set register reg0 to 5 Improved w-ary exponentiation algorithm using AMMs the value "2 64 minus carry out bit". Then, load a Qword of T into register reg1, and a Qword of m into register reg2, write [reg2 AND reg0] into reg2, subtract reg2 from reg1, and move reg1 to the memory location where we want the result. This subtracts, correctly, either a Qword of m or a zero, from T . No branches are used, and the memory access pattern is independent of the carry out bit, thus achieving an inherently protected implementation. The following snippet demonstrates the implementation.
The w-ary modular exponentiation using MMs and AMMs
We use the w-ary exponentiation algorithm (see e.g., [16] ). The advantage of this algorithm (e.g., compared to a sliding window exponentiation) is that the flow is branch free, and is independent of the secret exponent bits (although, as we discuss below, other security considerations are required). Figure 4 (left panel) shows the w-ary modular exponentiation using MMs, as it is implemented by OpenSSL (1.0.0e). The right panel shows a flow that uses AMMs. The computational cost of w-ary exponentiation with window size w, is
OpenSSL's w-ary modular exponentiation
where the relation between n, k, and w, is the following:
We use a "Store/Retrieve" notation to describe the cost of accessing the table (A in Fig. 4) , and relate to this cost below. Special Store and Retrieve are required, at additional cost, in order to make the implementation inherently protected.
For n = 512, the choice w = 5 is considered optimal. It requires a In total, the modified algorithm saves w AMSQRs, 2 AMMs and replaces 2 w−1 AMMs by the faster AMSQRs. Obviously, the same optimizations can be applied to a w-ary exponentiation based on MMs. Figure 5 shows the revised algorithm.
Inherently protected w-ary exponentiation with reduced cost
Cache based side channel attacks are a recent threat to software implementations of cryptographic algorithms. Due to such vulnerabilities (e.g., [20] ), modular exponentiation code need to be written in a way that its memory access patterns (at the granularity of a cache line) do not leak secret information. This requires a special method for storing (in cache) and retrieving values from Table A (see Fig. 4 ).
For n = 512 and w = 5, the table holds 2 w = 32 values, each one of 512-bit. OpenSSL tackles the problem by scattering the bytes of each value at addresses spaced by 2 w bytes. Gathering a 512-bit value from the scattered table involves 64-move operations (but since all cache lines are accessed, implicit dependency on the exponent bits is avoided). For platforms where the cache lines consist of 64 bytes (the more common case), this implementation supports window sizes of up to w = 6 (if the cache lines consist of 32 bytes, the implementation supports window size of up to w = 5).
Reference [7] proposes a useful optimization that is tailored to platforms with cache lines of 64 bytes and the choice w = 5. The 32 values of the table are split into 16-bit "words", which are stored at addresses spaced by 2 w+1 words (i.e., 2 × 2 w bytes). Based on measuring the high cost of the side channel store/retrieve protection, we optimized this method further. We choose a window size of w = 4, obtaining a table of only 2 w = 16, 512-bit values. This allows for scattering the 16 values in 32-bit "dwords" with spacing of 2 w dwords (i.e., 4 × 2 w bytes). The choice w = 4 requires 144 AMMs (9 more than with w = 5). On the other hand, retrieving a value from the table requires only 16-move operations, which is half the number of moves involved with the method of [7] and a quarter of the number of moves use by the OpenS-SL implementation. In addition, the reduced table size with w = 4 saves 1,024 bytes (16 cache lines) in the first level cache, compared to w = 5. The description (code snippet) of our proposed Store/Retrieve method is illustrated in Fig. 6 .
Results
This section provides the performance results of our study.
Cycles count: measurements methodology
The experiments were carried out on two processors: the previous generation 2010 Intel ® Core™ processors (specifically, Intel Core ® i5-750) and the latest 2nd generation Intel ® Core™ processor (specifically, Intel Core ® i5-2500).
Runs were carried out on a system where the Intel ® Turbo Boost Technology, the Intel ® Hyper-Threading Technology, and the Enhanced Intel Speedstep ® Technology were disabled. Each measured function was isolated, run 25,000 times (warm-up), followed by 100,000 iterations that were clocked (using the RDTSC instruction) and averaged. To minimize the effect of background tasks running on the system, each such experiment was repeated five times, and the minimum result was recorded. All reported cycles count performance numbers were obtained with the same measurement methodology.
Cycles count for 512-bit AMM/AMSQR and 512-bit modular exponentiation
AMM and AMSQRs For the performance of WW-AMM/ AMSQR we measured our new optimized implementation. For the performance of TSF-AMM we isolated the mont_mul_a3b and sqr_reduce functions from [27] . For comparison to OpenSSL we isolated its WW-MM implemen- 
512-bit modular exponentiation
We report the performance of 512-bit modular exponentiation, comparing four implementations. For OpenSSL, we measured the BN_mod_ exp_mont function, with and without the "constant time" mitigation (with constant time flag the function calls the function OpenSSL BN_mod_exp_mont_consttime). In both cases, OpenSSL uses a WW-MM based w-ary exponentiation with the (default) window size w = 5. We also measured the mod_exp_512 function of [27] , which is a hand crafted optimized implementation using TSF-AMS and a w-ary exponentiation with w = 5. For our implementation, we report our optimized implementations of the WW-AMM algorithm, using the w-ary exponentiation with w = 4. Table  2 summarizes the results for 512-bit operands. It demonstrates that our AMM/AMSQR implementation is consistently the fastest alternative. For example, on the previous generation 2010 Intel ® Core™ processor it is 1.45 times faster than the OpenSSL implementation, and 1.11 times faster than the TSF-AMM of [27] . Obviously, AMSQRs are faster than AMMs. On the 2nd generation Intel ® Core™ processor, all three implementations improve by more than 30 %.
The effect of the Store/Retrieve optimization The results in Table 3 demonstrate the benefit of our Store/Retrieve optimization. To isolate its effect, we compared the performance of 512-bit modular exponentiation with OpenS-SL 1.0.0e, and with two variants obtained by applying the simple changes needed to modify the Store/Retrieve implementation (changes were made in the file bn_exp.c). Our optimization achieves the fastest results. The results of the un-protected modular exponentiation show the cost of the mitigations (∼5-7 % degradation). Note that a change in the Store/Retrieve operations, introduced straightforwardly into OpenSSL, achieves noticeable performance gain. Table 4 summarizes the results for 1, 024-bit operands. Since a 1, 024-bit TSF-AMM implementation is not publicly available, we compare our implementation only to that of OpenS-SL 1.0.0e. As for the case of 512-bit operands, our implementation is significantly faster on both processors. OpenSSL distribution has a built-in test utility, which includes speed tests for various functions. This speed test performs RSA signing (=RSA decryption) operation for 10 seconds and report the average number of "RSA signs per second". Unlike the cycles count results, reported above, these performance numbers also depend on the processor's frequency. These are practically "bottom line results" because they account for the two exponentiations, the CRT recombination, base blinding, and other overheads. They represent the number of new SSL/TLS connections per second that a server can accept. To measure the results, we integrated our RSAZ implementation into OpenSSL, obtained the faster version, and measured the performance using the "openssl speed rsa1024" and "openssl speed rsa2048".
The results are shown in Table 5 . They demonstrate a significant speedup factor achieved by our implementation, and also illustrate the relative speedup across the processor generations.
Another result, which worth mentioning, is the following. We prepared and measured a version of RSAZ that uses the standard MM/MSQR instead of AMM/AMSQR, and the performances difference is less than 1 %.
In addition, we generated a version of OpenSSL that integrated only the AMM/AMSQR functions (called from OpenSSL's mont_mul() function), changed the windows size parameter to w = 4, and added optimized gather/scatter functions. These partial optimizations were sufficient to achieve a significant portion of the overall speedup. For example, for RSA1024, we obtained a speedup factor of 1.46, compared to OpenSSL 1.0.0e (counting 5,320 signs/s).
Discussion and analysis of the results
Explaining the performance differences between the different processors The reported results demonstrate that RSA computations are faster on the 2nd generation Intel ® Core™ processor, than on the Previous generation Intel ® Core™. This speedup is mainly due to the improved per- formance of the 64-bit multiplication (MUL) and add-withcarry (ADC) instructions, as follows: the latency of MUL, in the 2nd generation Intel ® Core™ processor, is reduced from 9 cycles (on the Previous generation processor) to 4 cycles, and the latency of ADC (with immediate = 0) is reduced from 2 cycles to 1 cycle, while its throughput doubled.
In addition, the 2nd generation Intel ® Core™ processor has a new feature, called the "Decoded Instruction Cache" (see [12] ), which allows it to cache decoded instructions and execute them faster when they are re-invoked. An algorithm can be optimized by making its code stay resident in this cache.
Explaining why OpenSSL has a slower implementation The reported results show that OpenSSL (1.0.0e) is significantly slower than the optimized RSAZ, although both implementations use, basically, similar primitives (WW-MM and WW-AMM). To explain these performance differences, we suggest the following main reasons: (a) OpenSSL has no optimization for squaring, because its WW-MM function interleaves the multiplication and the reduction steps. This leads to a less efficient WW-MM implementation; (b) the gathering/scattering strategy (for the "constant time" w-ary exponentiation) is not efficient (perhaps because it is designed to capture a general window size and modulus size, and not optimized for n = 512); (c) the big-number "multiply-add" implementation, and the w-ary exponentiation flow are more efficient in RSAZ (see details on big-number squaring in [9] ); (d) the RSAZ implementation optimized for two specific (important) key sizes, namely n = 512 and n = 1,024, while OpenSSL's code is more general. In addition, we also point out that OpenSSL's implementation optimizes not only performance, but also code simplicity, readability, maintainability, portability, and generality. These incur some overheads, which an optimized implementation can avoid.
The future OpenSSL version We mention here that OpenS-SL has already a "development branch" [19] . This version has integrated several improvements from RSAZ (based on an earlier version of this paper [10] , and personal discus- Fig. 7 The performance of OpenSSL's development branch [19] , versus OpenSSL 1.0.0e and RSAZ. The performance was measured using the 'openssl speed' utility on a 3.4 GHz Intel ® Core™ i7-2600 K CPU and reported in sign/sec sions with their developments team [21] ), while adhering to the generality requirements, constraints, and the coding style of this library. This version will soon become official. This implementation gives up some portion of the obtainable speedup of RSAZ for the sake of generality and maintainability, as shown in Fig. 7 .
Finally, we point out that under some conditions the ER step in Montgomery multiplications can be simply skipped (see [11, 23, 25] for details). To use this property, the modulus needs to be shortened by two bits. In our context, this requires a change in the RSA primes generation (which OpenSSL generates), to satisfy 2 n−1 < P, Q < 2 n . With such shorter primes, removing the ER step makes the 512-bit modular exponentiation ∼ 4 % faster.
At the level of the "primitives", we identified the WW-AMM (and also WW-MM) method as the preferred algorithm for modular multiplication. At the exponentiation level, we reduced the cost of the associated side channel protection, and proposed a few improvements to the w-ary exponentiation. The combination of these, and an optimized code led to the new RSAZ implementation, offering a speedup factor of more than 1.6x over the current OpenSSL version (1.0.0e).
The RSAZ implementation can be seamlessly integrated into the OpenSSL library, as demonstrated in [8] . As per a request from the OpenSSL Development Team, the RSAZ code was contributed as a patch [8] , for integration as a whole or in parts, for the benefit of the open source community. Some of the improvements, proposed in an earlier version of this paper, have already been incorporated into the next version of OpenSSL [19] , achieving a gain factor of 1.54 × /1.49× for RSA1024/RSA2048, respectively, on the Previous generation Intel Core Processors, and 1.28 × /1.47× on the Second generation Intel Core Processors.
