To increase computation throughput, general purpose Graphics Processing Units (GPUs) have been leveraged to accelerate computationally intensive workloads. GPUs have been used as cryptographic engines, improving encryption/decryption throughput and leveraging the GPU's Single Instruction Multiple Thread (SIMT) model. RSA is a widely used public-key cipher and has been ported onto GPUs for signing and decrypting large files. Although performance has been significantly improved, the security of RSA on GPUs is vulnerable to side-channel timing attacks and is an exposure overlooked in previous studies.
INTRODUCTION
The RSA algorithm [27] is a public-key cipher widely used in digital signatures and Internet protocols, including the Security Socket Layer (SSL) and Transport Layer Security (TLS). RSA entails 32:2 C. Luo et al. excessive computational complexity compared with symmetric ciphers. For scenarios where an Internet domain is handling a large number of SSL connections and generating digital signatures for a large number of files, the amount of RSA computation becomes a major performance bottleneck. With the advent of general-purpose GPUs, the performance of RSA has been improved significantly by exploiting parallel computing on a GPU [11, 22, 29, 32] , leveraging the Single Instruction Multiple Thread (SIMT) model.
However, the security of RSA on a GPU has not received adequate attention. Side-channel vulnerability (i.e., private key retrieval through side-channel analysis) is a major concern on cryptosystems. Meanwhile, various side channels, including power, electromagnetic emanation, and timing, have been exploited for implementations of RSA on other computing platforms, such as CPUs, FPGAs, ASICs, and MCUs. Among these attacks, timing attacks have become a realistic cyber threat given that they are non-invasive and can be launched remotely. Many timing attacks of RSA deployed on CPUs are micro-architectural (cache-based) attacks [10, 25, 33, 34] . For example, in access-driven attacks such as PRIME+PROBE [25] and FLUSH+RELOAD [33] , a spy process first sets the cache into a known state, the victim RSA process executes, and during the run the spy measures the timing of its memory accesses to infer the cache access pattern of the victim and recovers the secret key. We anticipate that applying such cache attacks on a GPU could be very challenging, if not impossible, due to the high degree of noise resulting from the massive parallel execution model of a GPU. These devices launch thousands of concurrent threads, so the victim would not be a single process, but a kernel that runs across threads on Streaming Multiprocessors (SMXes). The cache access pattern would be a complex mix of many copies of the same kernel, and therefore the spy cannot easily attribute a particular cache access to a specific victim thread. Each SMX has its own sets of caches, so the only cache leakage possible is when the spy kernel and victim kernel are on the same SMX. However, the victim kernels may be thread-switched out of the SMX by the hardware scheduler, which attempts to maintain high utilization. This results in a significant amount of noise in the cache access pattern that the spy kernel observes.
An alternative timing attack is time-driven-exploiting key-dependent variations in computation time and treating variations in memory accesses as key-independent noise. Such timing attacks heavily depend on the RSA algorithm and associated optimizations. There are two common optimizations used in RSA algorithms: a sliding window exponentiation and the Chinese Remainder Theorem (CRT). A Montgomery multiplication [21] is the basic operation that is common in many RSA implementations. Kocher [19] first outlined the theoretical foundation for possible timing attacks in several cryptosystems. Dhem et al. [7] proposed a practical timing attack of RSA on a smart card in a bit-by-bit fashion. Their attack exploits the variability present when using a Montgomery multiplication as the timing leakage, which is also adopted in this work. However, massive concurrent RSA executions on a GPU make this particular timing side channel much more obscure, presenting a challenge for our attack. Our GPU RSA implementation also employs sliding-window exponentiation, which was not used in prior work [7] , making the attack based on recovering the exponent, window-by-window. Tóth proposed another timing attack under the assumption that the total number of extra reductions is known to the adversary [30] , which is not a practical attack scenario on a GPU. RSA encryptions using the Chinese Remainder Theorem (CRT), and protocols based on CRT-RSA, have also been attacked using timing leakage [1] [2] [3] [4] [5] 28 ]. Our preliminary analysis shows that there may be no difference in the attack strategy for CRT-RSA on a GPU versus a CPU. Therefore, this article targets an RSA implementation that uses sliding-window exponentiation.
In this article, we extend our previous work [20] including two contributions. First, we construct a new attack method with sliding window of variable length nonzero window optimization. Previously, we only targeted constant length nonzero window optimization, which is a simpler case to attack. Second, we build a success rate model to predict the effectiveness of the attack relative to the quality of the side-channel information.
The main contributions of this work include: (1) a hierarchical timing model for an RSA implementation on a GPU with Montgomery multiplication and sliding-window exponentiation, explicitly capturing various complex interactions in a massively parallel computing platform; (2) a successful correlation timing analysis attack based on the proposed timing model to extract the private key; (3) an effective error correction algorithm designed to detect and correct attack errors to improve the success rate; (4) a success rate analysis for success rate prediction with the quality of side-channel information obtained from the GPU.
The rest of the article is organized as follows: First, we discuss related work on GPU security in Section 2. In Section 3, we briefly introduce the RSA cipher, the optimizations, and a parallel implementation of RSA developed in CUDA and run on an NVIDIA GPU [11, 12] . In Section 4, we construct the timing model and propose a timing attack based on the model. We introduce the success rate analysis in Section 5 and present both empirical and theoretical success rates of attacks with different key sizes. In Section 6, we address the issues of attack error detection and correction. The experimental results are presented in Section 7. In Section 8, we propose effective countermeasures to protect RSA on GPUs. Conclusions are drawn in Section 9.
RELATED WORK
Naghibijouybari et al. [23] demonstrated the feasibility of constructing covert channels on a GPU by creating contentions on shared resources such as caches, functional units, and memory. The corunning trojan and spy can communicate covertly on the covert channel, forecasting the possibility of a side-channel attack. Because a covert channel relies on shared hardware resources (as well as co-running trojan and spy processes), such a side-channel attack would be classified as an accessdriven type attack. In our timing-driven attack, we do not need to introduce additional spy load on GPU hardware resources, which makes our attack easier to launch and harder to detect.
Later work by Naghibijouybari et al. [24] builds on their earlier results [23] , launching sidechannel attacks on a GPU. Sensitive information, including user input to an OpenGL graphic application and internal parameters from a neural network application, was extracted from the GPU. In contrast to our work, this prior work requires a co-running spy process, where we do not. We also have a very different objective, recovering the secret key from a cryptographic algorithm.
In prior work by Jiang et al. [13, 14] , two GPU-based timing-driven side-channel attacks are described. Our proposed work is distinctively different from them in two ways: First, in terms of timing leakage, our work exploits variations in computation time, while prior work exploited timing differences related to memory access, one due to the on-chip coalescing unit [13] , and the other due to bank conflicts on an on-chip shared memory [14] . Second, our work targets a different cipher, RSA, while the prior work targets AES implementations. For AES (a block cipher), the data memory address pattern leaks key information. RSA is computationally intensive, and we anticipate such memory-access-related timing channels are very weak. Instead, we utilize a computational timing side channel for RSA on a GPU.
BACKGROUND: RSA AND GPU IMPLEMENTATION
An RSA cipher uses a pair of keys, one public and one private. The public key (n, e) is used for encrypting the plaintext message M, and the private key, (n, d ) is used for decrypting the ciphertext C:
where n is the modulus, and e and d are the public and private exponents, respectively. The relationship between n, d, and e is that they have to satisfy constraints so the encryption and decryption are reverse operations to each other [27] . The key length (the bit length of C, M, d, and n) can be 512, 1,024, 2,048, or longer, providing different levels of security. Usually the public key exponent e is chosen to be a relatively small number with a small Hamming weight, while the secret key exponent d is much larger and requires many more bits. Therefore, the decryption process using the private key is 20 to 60 times slower than the encryption process, demanding acceleration for higher throughput [11] .
Sliding-window Exponentiation
To perform exponentiation of a large number, a simple binary method [18] performs squaring and conditional multiplications based on each bit of the exponent. One optimization method is to perform sliding-window exponentiation, where the exponent d is decomposed into a series of zero and nonzero windows, F i of length L(F i ), and the exponentiation is processed window-by-window.
Using sliding-window exponentiation, we can achieve a 21% improvement in performance on average [17] .
There are two ways of constructing a sliding window: applying a Constant Length Nonzero Window (CLNW) [16] or a Variable Length Nonzero Window (VLNW), without incurring significant performance overhead [26] . For CLNW, the nonzero window length is fixed, while the zero window can have different lengths. For VLNW, both nonzero windows and zero windows have variable lengths.
Assuming Jang et al.'s implementation [11] , CLNW has nonzero window lengths of 5, 6, and 7 for RSA-512, RSA-1024, and RSA-2048, respectively. The following example illustrates the partitioning of an exponent, with L = 4, from right to left: 1001 1101 00000 1011 0111 000 0001 0 0111 000 1011 Note that the exponent starts from most-significant non-zero window, and the value of the nonzero windows must be odd, while the length of the zero windows vary, and any zero window occurs between two constant-length nonzero windows.
For VLNW, the partitioning between zero and nonzero windows is slightly more complex. The nonzero window length has an upper bound, L max . It switches to a zero window when the next q bits are all zeros. The algorithm proceeds from the least significant bit to the most significant bit, followed by the status of ZW, NW, indicating a zero window and a nonzero window, respectively. ZW: Check the next bit. If it is zero, stay in ZW; else go to NW. NW: Next, check min(q, L max − L cur ) bits. L cur is the current window length. If all are zero, switch to ZW. If not, include the bits into the current window. If L cur = L max after including the new bits, check the incoming single bit: if zero, go to ZW; else go to a new NW.
The following example illustrates the partitioning of an exponent, with L max = 5, q = 2: 101 11101 00 101 10111 000000 1 00 111 000 1011 Note that any zero window occurs between two nonzero windows, and there may be multiple nonzero windows in a sequence. The nonzero windows will only be at odd value (not even). In Jang et al.'s implementation [11] , L max = 4, 5, 6, and q = 3, 2, 2, respectively, for RSA-512, RSA-1024, and RSA-2048. The left-most window will be zero-padded if its length is smaller than L.
The sliding-window exponentiation method for decryption is described in Algorithm 1, regardless of how the windows are partitioned. There are h windows of exponent d. To begin processing, the odd powers C w of the method have to be pre-calculated once (Lines 1-4). The computation loop iterates over the most significant window. During each window (iteration), there are a number of squares computed (Line 7), which is equal to the bit length of the window, followed by conditional multiplication (only if the window is non-zero) with the multiplicand determined by the window value, F i (Line 9).
ALGORITHM 1: Sliding-window Exponentiation
// squaring 8: if F i 0 then 9:
M ← M · C F i mod n // multiplication 10: end if 11: end for 12: return M
Montgomery Multiplication
Montgomery's algorithm [21] is an efficient algorithm for modular multiplication, the basic operation used in the RSA cipher. The multiplier and multiplicand are first transformed into the Montgomerized form before they go through the multiplication. The product is in Montgomerized form as well: a · b = MM(a, b) = a · b · r −1 mod n, where r is a power of 2, and co-prime with both n and n < r .
For decryption, the ciphertext comes in a Montgomerized form. The decryption goes through a series of Montgomery multiplications using the sliding-window algorithm discussed in Section 3.1. Inside a Montgomery multiplication, there is a conditional reduction operation dependent on the values of a and b. Because the extra reduction will take more execution time, this data dependency creates a timing side-channel attack surface. From prior work [28] , if a and b are equal and random, which makes it a squaring operation, the probability of the extra reduction is:
Prob(extra reduction in MM(a, a)) = n/3r ,
which is a constant, and determined by n and r only. If a is random, and b is a constant, the probability becomes:
i.e., dependent on values of b and r .
GPU Parallelization of RSA
NVIDIA GPU hardware features many Streaming Multiprocessors (SMXs), with each consisting of homogeneous cores and shared resources such as caches. In GPU programming, the workload is partitioned into a grid of blocks, with each block consisting of many threads. Threads from one block will only be executed on the same SMX. On CUDA cores, threads of one block are grouped into warps of 32 threads, executing instructions in a synchronized fashion. Whenever the execution path of threads in a warp diverges, only one path is executed at a time and the others have to wait. Control divergence negatively impacts program performance (i.e., the program runs slower), resulting in data-dependent time variations. The execution time is dependent on the message contents and the private key. Warps on the same SMX are competing for the same limited hardware resources, such as CUDA cores, warp schedulers, and shared memory. For warps on different SMXes, they run independently. We adopt a parallel RSA implementation developed in CUDA, as described by Jang et al. [11, 12] . Note that the for loop iterations in Algorithm 1 (Lines 6-11), when used for processing the exponent windows, are executed sequentially. This is because each iteration depends on the result of the previous iteration. Any parallelization is accomplished between messages and within messages. When processing a batch of ciphertext messages on a GPU, the messages are independent and decrypted in parallel while using the same secret key exponent. In addition, the Montgomery multiplications associated with one message's computations are also parallelized. Each message is represented by an array of s 64-bit words, and the multiplications are carried out by s threads, executed in cooperation. For example, to decrypt a single message in RSA-512, the s value is 8, i.e., it requires 8 threads to decrypt. A batch of 32 messages requires 256 threads in total, which are divided into 2 blocks, with a typical block size of 128 threads.
THE TIMING MODELS OF RSA ON GPUS AND A TIMING ATTACK
Given the parallel execution environment on a GPU, it becomes more challenging to predict the execution time versus a single-threaded execution on a CPU. Program execution includes both concurrent computation and blocking execution (e.g., for divergence in a warp). The execution time is not simply equal to the addition of all instructions anymore.
In this section, we introduce a hierarchical timing model for RSA decryption on a GPU, which accounts for various interactions during execution and is the basis for enabling our side-channel timing attack. Note that we choose the extra reductions in the Montgomery multiplication as the timing channel. The number of reductions depends on both the input message and each window of the secret exponent d. We start with a general linear timing model as follows:
where R is the number of extra reductions in the Montgomery multiplications, v is the unit execution time of one reduction, and O represents other key-independent execution time factors that follow a distribution when the input message varies. For a sequential implementation, R will simply be the total number of extra reductions performed in each window. With multiple messages being decrypted in parallel by many threads on the GPU, there are complex interactions between the threads through competition for shared resources and the specific GPU programming model. We build a hierarchical model of R according to the interaction at different levels of parallelism on a GPU. We consider fine-grained thread (message) level interaction as well as interactions at a warp level, an SM level, and finally across the entire GPU.
GPU Timing Model
We begin by considering only a single message being decrypted by multiple threads (e.g., 16 for RSA-1024, which are within the same warp). Since threads in a warp are synchronized, we can still view the decryption for one message as sequential execution. For each window of d, we calculate the reduction number, as shown in Lines 7-10 in Algorithm 1. The first for loop (Lines 2-4) is ignored here, because it is independent of the secret key d.
For the ith window
is always L when using the CLNW method. When F i is zero, its length can vary from 1 to the key length. Under the assumption that each bit of d is independent and uniformly distributed between 0 and 1, the longer the length of F i , the smaller the probability will be. We assume the length of the nonzero window is no longer than 32 bits (probability of 2.33 × 10 −10 ). We use r msд (i) to record the number of reductions of the squares for window i. The number of reduction numbers performed for the entire decryption of message, R msд , is the sum of r msд (i) over i.
Note, we do not include the extra reductions of the multiplication operation (Line 9 of Algorithm 1). Given that the probability of executing a reduction depends on the value of C F i , as expressed in Equation (3), there may be other windows with the same value of F i . This kind of dependency on an extra reduction in multiple windows will bias the attack result [7] , and an effective attack should not use the reductions associated with the multiplication.
Next, we consider running a warp using multiple messages in parallel. The number of messages in a warp depends on the key size. There are 4 messages in a warp for RSA-512, 2 for RSA-1024, and 1 for RSA-2048. Because threads in one warp are synchronized, the operations across different messages are also synchronized, (i.e., all the messages are processing the same window of size d at the same time). For one squaring operation in a window, if one or more messages have an extra reduction, the whole warp has to wait for it to finish, and the warp is considered to have an extra reduction. Thus, the number of reductions in a warp for a window r warp (i) is not simply the sum of the reductions used for squaring for all the messages.
To calculate r warp (i), we use a binary number, b msд (i, j), to record the extra reductions during execution, for the jth message in the ith exponent window. The length of b msд (i, j) is L(F i ), because in each window there are L(F i ) squaring operations, and each bit of b msд (i, j) is for the corresponding squaring. If the squaring of a message results in an extra reduction, this bit is 1; otherwise, it is 0. For each squaring operation, all the messages in a warp are synchronized (i.e., as long as there is one message resulting in an extra reduction, the warp (all the messages) encounters a delay). Therefore, all b msд (i, j) should have bit-wise OR operations over j, and the number of reductions in the warp, r warp (i) is the Hamming weight of the bit-wise OR result. The number of reductions across the warp for the entire decryption R warp is also the sum of r warp (i) over i.
For one SMX with multiple warps running, there is no synchronization between warps. When the total number of threads (number of warps × 32) is equal to or smaller than the number of CUDA cores on the SMX, these warps will run mostly in parallel. However, the reduction operation involves shared memory accesses, which will be processed in a serialized manner, and therefore all the warps are processed in a blocking way to some extent. In scenarios where the total thread number is much larger than the number of SMX CUDA cores, the warps have to be scheduled serially. Based on these considerations, we let the reduction number of one SMX, R smx , to be approximately equal to the sum of its warps' reductions.
For the complete decryption with multiple SMXes, we assume the SMXes are running independently, and the time of one decryption is determined by the slowest SMX. The number of reductions associated with one decryption on the GPU is R дpu , which is the maximum of all the SMX reductions, R smx .
Timing Model Verification
To verify our timing model, we run multiple experiments of RSA-1024 on an Nvidia K40 GPU. We set the block size to 128, such that each block will have 4 warps. We run experiments and vary the message number for one decryption from 1, 2, 8, to 64 with CLNW. We will have 1 message, 1 warp, 1 block (on one SMX), and 8 blocks (on 8 SMXes), respectively. For each experiment, we run the decryption 10K times with random messages, record the execution time of each decryption using the CPU's clock, and calculate the reduction number for the workload according to the timing model. Our results are shown in Figure 1 , with the execution time normalized and correlation coefficients (ρ) of execution time and reduction number shown at the top. The figure shows we obtain a higher linear correlation between the measured execution time and calculated reduction number for (a) and (b) when we have one message and one warp running, respectively. The points in (c) and (d) spread out more for multiple warps and multiple SMXes due to the approximation of the reduction number for one SMX. We also characterize the timing channel for the VLNW implementation, and the correlation results are similar.
Correlation Timing Attack
In this section, we propose an attack method for the RSA GPU implementation based on the timing models we built in Section 4. We assume the ciphertext message is random but known to the attacker. For each decryption run, the adversary only records the total execution time.
The secret key d is extracted one window after another, starting from the most significant one, following the same order as decryption, as shown in Algorithm 1. For the ith target window, the previous attack results are used to calculate the value of the intermediate variable M at the beginning of ith iteration. To avoid confusion with M in other steps in the algorithm, we denote it as M temp . For the first window, M temp is 1.
In the attack, we make a guess of the target window. Based on the guess, we calculate the corresponding number of reductions. We compute the Pearson correlation coefficient between the number of reductions and the observed timing information. To determine the correct value for a nonzero window or length for a zero window, we iterate over all possible guesses for the target window and pick the one with the largest correlation based on our attack results.
Attack CLNW.
We first describe the details of the attack of CLNW. To calculate the number of reductions for the ith window for one message, we assume there are at least two windows after the target window. Because the window could be zero or nonzero, we divide our guess into Side-channel Timing Attack of RSA on a GPU 32:9 two groups, the length of zero windows and values of non-zero windows, which differs from previous work [7] , which only works bit-by-bit. If the window is zero, its length, L x , is varying, and we tally the number of L x squarings i of M temp first. For the sliding-window algorithm, a zero window is always followed by a nonzero window. Therefore, the next window must be nonzero of length L, resulting in another L squaring in the next iteration, but without any multiplication in between due to the zero window. From another perspective, the operations can also be considered as L squaring operations, followed by L x squaring operations.
If the window is a nonzero window, we guess its value as V x (V x is an odd number). The following operations on M temp are L squaring operations and one multiplication by C V x . Beyond the multiplication, we can still infer the operations by checking the next one or two windows. If the next window is nonzero, the operation beyond the multiplication is another L squarings, because the length of the next nonzero window is L. If the next window is zero, the window after the next window must be nonzero, and we have more than L squaring operations after the multiplication, consisting of the squarings of the next zero window and L squarings of the next nonzero window.
In conclusion, the first group of operations on M temp are L squaring operations. After that, if the window is zero, the subsequent operations involve L x squarings; if the window is nonzero with a value of V x , the subsequent operations are multiplications with C V x and L squarings, as shown in Figure 2 .
When calculating the number of reductions, we exclude the first L squaring operations, because they are common to all guesses. For a zero window, we record the extra reductions of the next L x squarings in r zer o (L x , i, j) for each message, where i is the window index, and j is the message index. For a nonzero window, we record the extra reductions for the next L squaring operations in r nonzero (V x , i, j), where the value V x is used to compute the multiplications C V x .
We then subsequently calculate the number of reductions for the warps, SMXes, and entire GPU for the target window, using the timing model described in Section 4. Note that in our attack model, we are attacking window-by-window, and the calculated number of reductions is only for the current window. While in Section 4.2, the linear correlations are for the sum of the number of reductions across all windows. In our attack, we retrieve the private key window-by-window, where each window will have a lower correlation coefficient than the correlation across all h windows, as shown in Figure 1 . In our experiments, the average Pearson correlation of attacking one window is about 0.02 when using 8 messages in one RSA-1024 decryption running on an NVIDIA Tesla K40 GPU, for 100K traces. The correlation value is much lower than that in Figure 1 . However, it suffices to extract the correct key.
Attack VLNW.
We extend our previous work [20] by adding the analysis of RSA with VLNW. Attacking VLNW is more challenging than attacking CLNW, because the length of the nonzero window is no longer constant. To accommodate for this issue, we do not separate attacking zero and nonzero windows anymore. Instead, we combine the nonzero window with its preceding zero window as one attack unit as shown in Figure 3 . If a nonzero window is preceded with another nonzero window, itself is an attack unit. In the attack, we first target the length of one attack unit, L x , which equals to the length of the nonzero window, adding the length of the preceding zero window, if it exists. Then we target the value of the attack unit, V x , which is also the value of the nonzero window. By knowing the length and value of each attack unit, we can reconstruct the private key.
Here we have another observation about the length of an attack unit: it is at least q + 1 bits long (q is defined in Section 3.1). The (q + 1)-bit unit only happens when q zeros are followed by a single one bit. For other cases, the attack unit length is always longer than q + 1. During window partitioning (from right to left), if an NW switches to a ZW, there must have been q zeros, or the NW already has L max bits. And we assume, L max > q + 1.
To attack the ith unit, we assume the previous (i − 1) th units are already known. The intermediate value of M temp can be calculated at the beginning of the processing of the ith unit. We first guess the length of the unit L x by squaring M temp L x times and calculating the number of reductions for each message, the same as was done for CLNW. The value of L x is determined by the correlation attack, similar to the zero window attack for CLNW. Next, we guess the value of the unit V x , by multiplying the M L x temp by C V x and then squaring the result q + 1 times and calculating the number of reductions. Because the next attack unit has to be least q + 1 bits, we assume the computation would be correct if we have the correct guess of V x . By knowing L x and V x of the unit, we can recover the zero and nonzero windows and the private key.
The effort to launch a successful attack on VLNW is more than a successful attack on CLNW for two reasons: First, for each attack unit, we have to decide its length, which is prone to error, as we discuss in Section 6. Second, to decide the value of an attack unit, we calculate the reduction number from the q + 1 squarings, which is usually a small number. Therefore, the correlation when using the correct and wrong values is harder to differentiate.
SUCCESS RATE ANALYSIS
The ability to estimate the success rate of an attack would be useful when assessing the security of a system and making decisions on countermeasure implementation. In this section, based on the timing model and the correlation attack method presented in Sections 4 and 4.3, respectively, we build a probability model to predict the success rate of the attack. This success rate analysis extends our previous work [20] that the number of traces needed for a successful attack can estimated. Since the attack methods of CLNW and VLNW are similar, we only present an analysis for CLNW. We target our model for the case of single-warp execution: there are m messages being encrypted in a single warp. Single message execution is just a special case of one warp execution with m = 1. For the multiple warps and the multiple SMX cases, the prediction will be less accurate, considering the low correlation shown in Figures 1(c) and (d) . As a result, we only present the success rate analysis for single warp execution here, and we only analyze the attack of the CLNW shown in Section 4.3.1.
We rewrite the timing model of Equation (4)
where R is the number of reductions for the target window, v is the execution time of one reduction, C is a constant, and I is the noise, which follows a Gaussian distribution N (0, σ ) .
The attack methods for the nonzero and zero windows are different. We derive the success rates of both methods. To simplify our analysis, we assume we know whether the window under attack is zero or nonzero. This is a valid assumption, because when the proceeding window is zero, the current window under attack is nonzero.
We first consider attacking a nonzero window by guessing the value of V x . Assuming the correct window value is V x = h, the actual reduction number calculated from the timing model is R = R h . For other incorrect values of V x , we index h i , with 0 ≤ i < N k − 1, and N k is the number of guesses for the window, which is 2 L−1 , because V x is an odd number. The number of reductions, corresponding to the wrong guesses, is R i .
From the statistical success rate model proposed by Fei et al. [8] , the success rate of attacking one window is:
where Φ N k −1 is the cumulative distribution function of a N k − 1 dimensional standard Gaussian distribution; N is the number of traces; v and σ are the same parameters as in Equation (5). K is a three-way confusion matrix, whose element at index (i, j) is defined as:
κ is a confusion vector of size N k − 1, whose ith element is defined as:
To calculate K and κ, we need to know the distribution of R. With a nonzero window length of L, there will be L squarings, each of which can possibly generate one extra reduction. Based on the analysis of the timing model for single-warp execution, for one square computation to generate an extra reduction, it requires that any of the m messages' square computations has an extra reduction. We assume the message values are randomized, which means the probability of an extra reduction is n/3r , as indicated by Equation (2) . With n ≈ r , the probability is close to 1/3. If we assume messages are independent, then the probability of an extra reduction in the warp for one squaring is 1 − (1 − n/3r ) m . There are L squares computed, so the distribution of R is a binomial distribution, with parameters n = L and p = 1 − (1 − n/3r ) m : For different guesses of V x , the intermediate results are random. We assume R h , R i , and R j are independent. From probability theory,
As a result, we have K as a diagonal matrix of E[(R h − Ri) 2 ] × I , and κ has N k − 1 identical elements
From our experiments, we have found that v/σ = 0.061, 0.043, and 0.025 for RSA-512, RSA-1024, and RSA-2048, respectively. The success rates of attacking a nonzero window for RSA-512, RSA-1024, and RSA-2048 when running one warp are shown in Figure 4(a) . Our experiment results show that, as the key length increases, the success rate decreases. The trend is because as the key size increases, there will be more windows. Our attack targets one window at a time, while using the total execution time. Therefore the signal-to-noise ratio drops with larger key size. The predicted success rates track the empirical success rates well, with a little over-estimation of the effectiveness of the attacks.
When attacking a zero window, the window length, L x , is variable from 1 to the key length. The window length L x is also a random number of geometric distributions, with parameter p = 0.5.
Because the probability of the occurrence of a zero window that is longer than 32 is close to zero, we only make guesses between 1 and 32 during the attack. We assume the correct window length is L h , with the value between 1 and 32, and the corresponding number of reductions is R h . We index the other 31 guesses as L i , 0 ≤ i ≤ 30, and the corresponding number of reductions is R i . There is a difference from attacking the nonzero window. When a a wrong guess of V x is made, the calculated number of reductions is independent of the correct one. However, with a wrong guess of the L x , the calculated number of reductions is still partially correct. If L i < L h , all the computation of the reductions is correct but does not include the reductions beyond the L i squarings. 
The success rate is dependent on L h . The final success rate of attacking zero windows is a weighted average of the attacking different window lengths:
We plug in the distribution of R h − R i into Equations (7) and (8), K i, j and κ i can be calculated for window length L h , and the success rate SR L h using Equation (6) . Note the value of K and κ are dependent on the correct window length L h . The success rates for attacking zero windows are plotted in Figure 4(b) . Results show that the success rate of attacking zero windows is much lower than nonzero windows with the same number of traces, indicating that our ability to distinguish zero windows is much lower than that of nonzero windows. In the attack of a nonzero window, the wrong guess of a window value will yield a totally incorrect number of reductions. For a zero window, if the window-length guess is close to the correct window length, the reduction number calculation is also very close. As a result, it is much harder to differentiate the correct window length from those close to it.
ERROR CORRECTION
The inherent drawback of the proposed attack, due to the iterative processing shown in Section 4.3, is that the attack on the target window is highly dependent on attack results on the previous windows. If any of the previous guesses were wrong, the current attack may not succeed because of the calculation M temp and the reduction number will be wrong. As a result, the correlation coefficients after the first error will be much lower than those before, as can be observed in Figure 5 . Each point in Figure 5 is the maximum correlation coefficient of one window attack when we iterate over all possible values. This non-robust feature can be utilized to detect errors if there exists N consecutive window attacks with low correlation coefficients.
In our attack, the length of a zero window is much harder to attack. Because the number of extra reductions is roughly linear with the bit length of the zero window, the ability to differentiate provided by the correlation attack is fairly small. For example, when guessing the length, the correlations of the wrong length guesses is very close to that of correct ones, as shown in Figure 6 for the case of attacking a zero window, where the length of the window is 4. If a wrong guess ends up with the highest correlation, the error can only be detected in later windows. We define a parameter distinguishability D to indicate how likely an error can happen. The value of D for one window is defined as the normalized difference between the largest correlation and the second-largest correlation value. The lower D is, the more likely an error will occur. Based on these observations, we propose two error-correction methods: forward and backward methods.
Forward error correction is triggered when the distinguishability D of a target window is below a threshold D th . We need to reconsider those values that were guessed and did not produce the highest correlation. We proceed with tentative attacks on the next window multiple times by setting the target window's value to other possible values, one after another. For each of those tentative attacks on the next window, we obtain a maximum correlation coefficient. To pick the right value for the target window, we choose the one resulting in the largest maximum correlation in the next window.
Backward error correction is triggered when there are N consecutive window attacks with the correlation coefficients below C th . The attack rolls back to the first window of the N consecutive windows. This time, we choose the value for this window with the next highest correlation and then proceed. If this still results in an error and the window is attacked again, the next ranked guess should be chosen next. If all possible values for the window have been attempted and there is still an error, we move further backward to the previous window and reconsider other guesses.
Parameter selection: In our error-detection method, there are three parameters, D th , N , and C th . Their values affect the effectiveness and convergence speed of the attack. For D th , if it is set too high, the forward error correction will be launched too often (false positive) and the attack will be impacted. If it is too low, the chance of overlooking an error is high (false negative). The error is only detected by the backward error correction, which is much slower than the forward one. So, a suitable D th should balance the speed of the attack and the error-detection rate. Similarly both error-detection rate and attack speed will be affected by the N value. The choice of C th has a similar impact as N , but in the opposite direction. This means a larger C th is equal to choosing a smaller N . In our experiments, these values are chosen to maximize the success rate, with the attack speed in an acceptable range. 
EXPERIMENTAL RESULTS
We implement the RSA cipher on an Nvidia K40 GPU, which has 15 SMXes and 192 CUDA cores on each SMX, hosted on a workstation running Ubuntu 14.04.05 LTS with an Intel E5-1603 processor. We record the timing information using the CPU clock that runs at 2.8GHz. We use the CPU instead of the more accurate GPU clock, because normally it is not accessible to the attacker, and we want to launch a realistic attack scenario.
We run experiments with different key lengths of 512, 1,024, and 2,048. The number of threads launched in one decryption varies from 32 to 1,024, varying GPU utilization from a single warp, a single block, to multiple SMXes. For each experiment, 100K decryption runs are carried out with random input messages. The error-correction parameters D th , N , and C th are set to 0.2, 8, and 0.01, respectively. The time used for one attack ranges from around 15mins for RAS-512 with 32 threads, to 8h for RSA-2048 with 1,024 threads. The larger the key size and the number of threads, the longer the attack time due to the increased computational complexity, and a larger number of error corrections needed. Table 1 shows the number of errors corrected by forward and backward error correction, and how many windows (least significant/right-most) are incorrect after the attack. In the results, we see the number of errors increases as the number of threads and the key size increase. The forward error correction works best with fewer threads and a smaller key size. The backward error correction works best when using a larger number of threads and for a longer key. The errors left in our experiment are very small (0-3 windows, 18 bits at most), and the full key can be easily recovered using a lattice attack [6] or brute-force attack. For the largest error in our experiment, the brute-force attack space is 2 18 . It takes about 15mins to attack RSA-512 with 32 threads, and about 8h to attack RSA-2048 with 1,024 threads, due to the increase in key size and error corrections.
COUNTERMEASURES
For countermeasures against GPU timing attacks, Kadam et al. [15] introduced randomized coalescing techniques to mitigate the GPU coalescing attack [13] . Due to the degree of indeterminism introduced, the correlation between the execution time and memory accesses is reduced or eliminated. However, the timing leakage we target in our attack does not relay on coalescing memory accesses, which renders these countermeasures non-applicable here. However, there have been several countermeasures presented that protect a CPU from a side-channel timing attack while running RSA. They attempt to eliminate data-dependent variations associated with the reductions [9, 31] or mask the messages and exponent with random numbers [19] . Such countermeasures are general and are also applicable to GPU implementations. However, they will introduce a significant performance degradation, which is a major concern for GPU acceleration. We evaluate an always-reduce Montgomery multiplication countermeasure on a GPU, which increases the execution time by 4.5%. We also compare the Pearson correlation coefficient between the number of reductions and the execution time for one message and show the result in Figure 7 . The correlation is significantly reduced from 0.79, as shown in Figure 1(a) , to 0.06, rendering the time leakage very small and hard to exploit.
Special countermeasures targeting GPU implementations can also be devised. For example, we can randomize the assignment of messages to threads and blocks such that the execution time cannot be correctly predicted by the attacker. With a randomized assignment countermeasure, the correlation of one block reduces from 0.54, as shown in Figure 1(c) , to −0.0073, as shown in Figure 8 . There is little computational overhead introduced here, except for a random permutation of a sequence. The permutation incurs randomized memory accesses, which impact the data spacial locality adversely. Security and performance can be balanced by varying the assignment granularity for randomization. We can also add more noise to the timing measurements with dummy random messages, at the cost of performance and throughput. If we insert one dummy random message for every three messages, the correlation between the number of reductions and the execution time of one warp is also reduced from 0.66 to 0.12, which also introduces, at most, 25% overhead. This approach is suitable in scenarios where the GPU is lightly loaded, so additional dummy messages can be computed on the unused resources.
CONCLUSIONS
In this work, we evaluate the vulnerability of an RSA GPU implementation to side-channel timing analysis. We build timing models for a sliding-window RSA implementation using Montgomery multiplications. We consider a parallel implementation and evaluate it on an NVIDIA GPU. We characterize the execution behavior, capturing the execution time due to the complex interactions among threads. Our proposed attack methods are designed based on the timing model. We obtain attack results on an NVIDIA K40 hardware for different key sizes and work across different levels of parallelism. Our results show that an RSA implementation on a GPU is vulnerable to side-channel timing attacks, calling for both general countermeasures and GPU-specific countermeasures.
