Abstract. Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include -beside a CPU -such a powerful GPU which runs idle most of the time and might be used as cheap and instantly available co-processor for general purpose applications.
Introduction
For the last twenty years graphics hardware manufacturers have focused on producing fast Graphics Processing Units (GPUs), specifically for the gaming community. This has more recently led to devices which outperform general purpose Central Processing Units (CPUs) for specific applications, particularly when comparing the MIPS (million instructions per second) benchmarks. Hence, a research community has been established to use the immense power of GPUs for general purpose computations (GPGPU). In the last two years, prior limitations of the graphics application programming interfaces (API) have been removed by GPU manufacturers by introducing unified processing units in graphics cards. They support a general purpose instruction set by a native driver interface and framework.
In the field of asymmetric cryptography, the security of all practical cryptosystems rely on hard computational problems strongly dependant on the choice of parameters. But with rising parameter sizes (often in the range of 1024-4096 bits), however, computations become more and more challenging for the underlying processor. For modern hardware, the computation of a single cryptographic operation is not critical, however in a many-to-one communication scenario, like a central server in a company's data processing centre, it may be confronted with hundreds or thousands of simultaneous connections and corresponding cryptographic operations. As a result, the most common current solution are cryptographic accelerator cards. Due to the limited market, their price tags are often in the range of several thousands euros or US dollars. The question at hand is whether commodity GPUs can be used as high-performance public-key accelerators.
In this work, we will present novel implementations of cryptosystems based on modular exponentiations and elliptic curve operations on recent graphics hardware. To the best of our knowledge, this is the first publication making use of the CUDA framework for GPGPU processing of asymmetric cryptosystems. We will start with implementing the extremely wide-spread Rivest Shamir Adleman (RSA) cryptosystem [30] . The same implementation based on modular exponentiation for large integers can be used to implement the Digital Signature Algorithm (DSA), which has been published by the US National Institute of Standards and Technology (NIST) [25] . Recently, DSA has been adopted to elliptic curve groups in the ANSI X9.62 standard [2] . The implementation of this variant, called ECDSA, is the second major goal of this work.
Previous Work
Lately, the research community has started to explore techniques to accelerate cryptographic algorithms using the GPU. For example, various authors looked at the feasibility of the current industry standard for symmetric cryptography, the Advanced Encryption Standard (AES) [21, 31, 18, 9] . Only two groups, namely Moss et al. and Fleissner, have aimed for the efficient implementation of modular exponentiation on the GPU [24, 14] . Their results were not promising, as they were limited by the legacy GPU architecture and interface (cf. the next section). To the best of our knowledge there are neither publications about the implementation of these systems on modern, GPGPU-capable hardware nor on the implementation of elliptic curve based systems.
We aim to fill this gap by implementing the core operations for both systems efficiently on modern graphics hardware, creating the foundation for the use of GPUs as accelerators for public key cryptography. We will use Nvidia's current flagship GPU series, the G80 generation, together with its new GPGPU interface CUDA.
Using GPUs for General-Purpose Applications
The following section will give an overview over traditional GPU computing, followed by a more in-depth introduction to Nvidia's general purpose interface CUDA.
Traditional GPU Computing
Roughly, the graphics pipeline consist of the stages transform & light, assemble primitives, rasterise and shade. First GPUs had all functions needed to implement the graphics pipeline hardwired, but over time more and more stages became programmable by introducing specialised processors, e.g. vertex and fragment processors that made the transform & light and shading stages, respectively, more flexible.
When processing power increased massively while prices kept falling, the research community thought of ways to use these resources for computationally intense tasks. However, as the processors' capabilities were very limited and the API of the graphics driver was specifically built to implement the graphics pipeline, a lot of overhead needed to be taken into account. For example, all data had to be encoded in textures which are two dimensional arrays of pixels storing colour values for red, green, blue and an additional alpha channel used for transparency. Additionally, textures are read-only objects, which forced the programmers to compute one step of an algorithm, store the result in the frame buffer, and start the next step using a texture reference to the newly produced pixels. This technique is known as ping-ponging. Most GPUs did only provide instructions to manipulate floating point numbers, forcing GPGPU programmers to map integers onto the available mantissa and find ways to emulate bit-logical functions, e.g., by using look-up tables.
These limitations have been the main motivation for the key GPU manufacturers ATI/AMD and Nvidia to create APIs specifically for the GPGPU community and modify their hardware for better support: ATI's solution is called Close To the Metal (CTM) [1] , while Nvidia presented the Compute Unified Device Architecture (CUDA), a radically new design that makes GPU programming and GPGPU switch places: The underlying hardware of the G80 series is an accumulation of scalar common purpose processing units ("unified" design) and quite a bit of "glue" hardware to efficiently map the graphics pipeline to this new design. GPGPU applications however directly map to the target hardware and thus graphics hardware can be programmed without any graphics API whatsoever.
Programming GPUs Using Nvidia's CUDA Framework
In general, the GPU's immense computation power mainly relies on its inherent parallel architecture. For this, the CUDA framework introduces the thread as smallest unit of parallelism, i.e., a small piece of concurrent code with associated state. However, when compared to threads on microprocessors, GPU threads have much lower resource usage and lower creation and switching cost. Note that GPUs are only effective when running a high number of such threads. A group of threads that is executed physically in parallel is called warp. All threads in one warp are executed in a single instruction multiple data (SIMD) fashion. If one or more thread(s) in the same warp need to execute different instructions, e.g., in case of a data-dependent jump, their execution will be serialised and the threads are called divergent. As the next level of parallelism, a (thread) block is a group of threads that can communicate with each other and synchronise their execution. The maximum number of threads per block is limited by the hardware. Finally, a group of blocks that have same dimensionality and execute the same CUDA program logically in parallel is called grid.
To allow optimal performance for different access patterns, CUDA implements a hierarchical memory model, contrasting the flat model normally assumed on computers. Host (PC) and device (GPU) have their own memory areas, called host memory and device memory, respectively. CUDA supplies optimised functions to transfer data between these separate spaces.
Each thread possesses its own register file, which can be read and written. Additionally, it can access its own copy of so-called local memory. All threads in the same grid can access the same on-chip read-and writable shared memory region. To prevent hazards resulting from concurrent execution of threads synchronisation mechanisms must be used. Shared memory is organised in groups called banks that can be accessed in parallel. All threads can access a read-and writable memory space called global memory and read-only regions called constant memory and texture memory. The second last is optimised for one-dimensional locality of accesses, while the last is most effective when being used with two-dimensional arrays (matrices). Note that the texture and constant memories are the only regions that are cached. Thus, all accesses to the off-chip regions global and local memory have a high access latency, resulting in penalties when being used too frequently.
The hardware consists of a number of so-called multiprocessors that are build from SIMD processors, on-chip memory and caches. Clearly, one processor executes a particular thread, the same warp being run on the multiprocessor at the same time. One or more blocks are mapped to each multiprocessor, sharing its resources (registers and shared memory) and get executed on a time-sliced basis. When a particular block has finished its execution, the scheduler starts the next block of the grid until all blocks have been run. Design Criteria for GPU Implementations. To achieve optimal performance using CUDA, algorithms must be designed to run in a multitude of parallel threads and take advantage of the presented hierarchical memory model. In the following, we enumerate the key criteria necessary for gaining the most out of the GPU by loosely following the CUDA programming guide [27] and a talk given by Mark Harris of Nvidia [17] .
A. Maximise use of available processing power A1. Maximise independent parallelism in the algorithm to enable easy partitioning in threads and blocks. A2. Keep resource usage low to allow concurrent execution of as many threads as possible, i.e., use only a small number of registers per thread and shared memory per block. A3. Maximise arithmetic intensity, i.e., match the arithmetic to bandwidth ratio to the GPU design philosophy: GPUs spend their transistors 
Modular Arithmetic on GPUs
In the following section we will give different ways do realise modular arithmetic on a GPU efficiently, keeping the aforementioned criteria in mind. For the RSA cryptosystem we need to implement arithmetic modulo N , where N is the product of two large primes p and q: N = p · q. The arithmetic of both DSA systems, however, is based on the prime field GF (p) as the lowest-level building block. Note that the DSA systems both use a fixed -in terms of sessions or key generations -prime p, thus allowing to choose special primes at build time that have advantageous properties when reducing modulo p. For example, the US National Institute of Standards and Technology (NIST) proposes a set of generalised Mersenne primes in the Digital Signature Standard (DSS) [25, Appendix 6] . As the RSA modulus N is the product of the two secret primes p and q that will be chosen secretly for each new key pair, we cannot optimise for the modulus in this case. Modular Multiplication. Multi-precision modular multiplication r ≡ a·b mod m is usually the most critical operation in common asymmetric cryptosystems. In a straightforward approach to compute r, we derive a double-sized product r = ab first and reduce afterwards by multi-precision division. Besides the quadratic complexity of standard multiplication, division is known to be very costly and should be avoided whenever possible. Thus, we will discuss several multiplication strategies to identify an optimal method for implementation on GPUs.
Modular Multiplication Using Montgomery's Technique
In 1985 Peter L. Montgomery proposed an algorithm [23] to remove the costly division operation from the modular reduction. Koç et al. [6] give a survey of different implementation options. As all multi-precision Montgomery multiplication algorithms feature no inherent parallelism except the possibility to pipeline, we do not consider them optimal for our platform and implement the method with the lowest temporary space requirement of n + 2 words, coarsely integrated operand scanning (CIOS), as a reference solution only (cf. to Algorithm 1).
Modular Multiplication in Residue Number Systems (RNS)
As an alternative approach to conventional base-2 w arithmetic, we can represent integers based on the idea of the Chinese Remainder Theorem, by encoding an integer x as a tuple formed from its residues x i modulo n relatively prime w-bit moduli m i , where |x| mi denotes x mod m i :
Here, the ordered set of relatively prime moduli (m 0 , m 1 , . . . , m n−1 ), gcd (m i , m j ) = 1 for all i = j, is called base and denoted by A. The product of all moduli, A = n−1 i=0 m i is called dynamic range of A, i.e., the number of values that can be uniquely represented in A. In other words, all numbers in A get implicitly reduced modulo A. Such a representation in RNS has the advantage that addition, subtraction and multiplication can be computed independently for all residues: 
for j from 1 to n − 1 do 11:
end for 13:
Tn ← Tn+1 + c 15: end for 16:
which allows carry-free computations 2 and multiplication without partial products. However, some information involving the whole number x cannot be easily computed. For instance, sign and overflow detection and comparison of magnitude are hard, resulting from the fact that residue number systems are no weighted representation. Furthermore, division and as a result reduction modulo an arbitrary modulus M = A is not as easy as in other representations.
But similar to the basic idea of Montgomery multiplication, one can create a modular multiplication method for input values in RNS representation as shown in Algorithm 2, which involves a second base B = ( m 0 , m 1 , . . . , m n−1 ) with corresponding dynamic range B. It computes a value v = XY + f M that is equivalent to 0 mod A and XY mod M . Thus, we can safely divide by A, i.e., multiply by its inverse modulo B, to compute the output XY A −1 (mod M ). Note that the needed reduction modulo A to compute f is free in A.
All steps of the algorithm can be efficiently computed in parallel. However, a method to convert between both bases, a base extension mechanism, is needed. We take three different options into account: the method based on a Mixed Radix System (MRS) according to Szabó and Tanaka [37] , as well as CRT-based methods due to Shenoy and Kumaresan [33] , Kawamura et al. [20] and Bajard et al. [3] . We present a brief introduction of these methods, but for more detailed information about base extensions, please see the recent survey at [5] . 
Base Extension Using a Mixed Radix System (MRS)
The classical way to compute base extensions is due to Szabó and Tanaka [37] . Let (m 0 , . . . , m n−1 ) be the MRS base associated to A. Then, each integer x can be represented in a mixed radix system as
The MRS digits x i can be derived from the residues x i by a recursive strategy: where m −1 (i,j) are the pre-computed inverses of m j modulo m i . To convert x from
(mod m n−1 ) this representation to a target RNS base, we could reduce Equation (3) . But instead of creating a table for all c k , a recursive approach is more efficient in our situation, eliminating the need for table-lookups [4] , and allowing to compute all residues in the target base in parallel:
Base Extension Using the Chinese Remainder Theorem (CRT)
Recall the definition of the CRT and adopt it to the source base A with dynamic range A:
whereÂ k = A/m k and α is an integer s.t. 0 ≤ x < A. Note that α is strictly upper-bounded by n. When reducing this equation with an arbitrary target modulus, say m i , we yield
where
and |A| mi are pre-computed constants. Note that the δ k do not depend on the target modulus and can thus be reused in the computation of a different target residue. This is an efficient way to compute all residues modulo the target base, provided we know the value of α. While involving a couple of look-ups for the constants as well, the instruction flow is highly uniform (cf. Criterion A4) and fits to our SIMD architecture, i.e., we can use n threads to compute the n residues of x in the target base in parallel (cf. Criterion A1).
The first technique to compute such an α is due to Shenoy and Kumaresan [33] and requires a redundant modulus m r ≥ n that is relatively prime to all other moduli m j and m i , i.e., gcd(A, m r ) = gcd(B, m r ) = 1. Consider Equation 7 , set m i = m r and rearrange it to the following:
Since α < n ≤ m r it holds that α = |α| mr and thus Equation 8 computes the exact value of α, involving the additional constant |A −1 | mr . Kawamura et al. propose a different technique that approximates α using fixed-point computations [20] . Consider Equation 7, rearrange it and divide by A:
Next, they approximate α by using trunc r (δ k ) as numerator and 2 w as denominator and adding a properly chosen offset σ, where trunc r (δ k ) sets the last w − r bits of δ k to zero:
Thus, the approximate value α can be computed in fixed-point arithmetic as integer part of the sum of the r most-significant bits of all δ k . Provided σ is chosen correctly, Equation 10 will compute α = α, and the resulting base extension will be exact. Finally, Bajard et al. follow the most radical approach possible [3] : they allow an offset of αA ≤ (n − 1)A to occur in Equation 7 and thus do not need to compute α at all. After the first base extension we have f = f + αA and thus w = w + αM , i.e., the result w will contain a maximum offset of (n − 1)M , and thus be equivalent to w mod M . However, this technique needs additional measures of precaution in the multiplication algorithm, which predominantly condense in the higher dynamic ranges needed.
Multiplication Modulo Generalised Mersenne Primes
For some cryptosystems like DSA, arithmetic in an underlying prime field is required. Taking advantage of the special structure of Mersenne primes, the reduction modulo p after a multiplication can be carried out very efficiently. Using such a method, we can compute r using a standard multi-precision multiplication method first, followed by a reduction algorithm that is specific for the given prime. In this work, we will use an algorithm to efficiently compute multiplications modulo P-224, where P-224 is the 224 bit prime proposed by NIST [25] . Algorithm 3 performs the complete reduction for this prime with only two additions and two subtractions of 224 bit integers and a subsequent correction step to determine the correct value of r ≡ r mod p, since −2p ≤ r < 3p must be considered. Note that this final correction step additionally needs the same amount of computations, as we have to avoid data-dependant branches (cf. Criterion A4). 
Implementation
In this section we will describe the implementation of two primitive operations for a variety of cryptosystems: first, we realise modular exponentiation on the GPU for use with RSA, DSA and similar systems. Second, for ECC-based cryptosystems we present an efficient point multiplication method which is the fundamental operation, e.g., for ECDSA or ECDH [16] .
Modular Exponentiation Using the CIOS Method
We implemented the CIOS Method as introduced in Algorithm 1 for sequential execution since it does not include any inherent parallelism. Fan et al. describe efficient ways to pipeline such an algorithm for the use on multi-core systems [13] .
This would however need fairly complex coordination and memory techniques and thus will not be considered further for our implementation, cf. Criteria A4 and B4-B6. As all modular exponentiations are independent, we let each thread compute exactly one modular exponentiation in parallel with all others. Resulting from that, this solution only profits from coarse-grained parallelism. We assume the computation of distinct exponentiations, each having the same exponent tfor example RSA signatures using the same key -and thus need to transfer only the messages P i for each exponentiation to the device and the result P t i (mod N ) back to the host. As a result, every thread executes the same control flow, fulfilling Criterion A4. To accelerate memory transfers between host and device, we use page-locked host memory and pad each message to a fixed length that forces the starting address of each message to values that are eligible for global memory coalescing (cf. Criteria B1 and B4).
For modular exponentiation based on Algorithm 1, we applied the straightforward binary right-to-left method [35] . During exponentiation, each thread needs three temporary values of (n+2) words each that get used as input and output of Algorithm 1 in a round-robin fashion by pointer arithmetic. Thus, 3(n+2) words are required. This leads to 408 bytes and 792 bytes for 1024 bits and 2048 bit parameters, respectively. Each multiprocessor features 16384 bytes of shared memory, resulting in a maximum number of 16386/408 = 40 and 16386/792 = 20 threads per multiprocessor for 1024 and 2048 bits, respectively, if we use shared memory for temporary values. Clearly, both solutions are inefficient when considering that each multiprocessor is able to execute 768 threads per block in principle (i.e., we favour Criterion A2 over B2).
Thus, we chose to store the temporary values in global memory. We have to store the values interleaved so that memory accesses of one word by all threads in a warp can be combined to one global memory access. Hence, for a given set of values (A, B, C, . . .) consisting each of n + 2 words X = (x 0 , x 1 , . . . , x n+1 ), we store all first words (a 0 , b 0 , c 0 , . . .) for all threads in the same block, then all second words (a 1 , b 1 , c 1 , . . .) , and so on (cf. Criterion B4).
Moreover, we have to use nailing techniques, as CUDA does not yet include add-with-carry instructions. Roughly speaking, nailing reserves one or more of the high-order bits of each word for the carry that can occur when adding two numbers. To save register and memory space, however, we store the full word of w bits per register and use bit shifts and and-masking to extract two nibbles, each providing sufficient bits for the carry (cf. Criterion A3). This can be thought of decomposing a 32 bit addition in two 16 bit additions plus the overhead for carry handling.
Modular Exponentiation Using Residue Number Systems
Computations in residue number systems yield the advantage of being inherently parallel. According to Algorithm 2 all steps are computed in one base only, except for the first multiplication. Thus, the optimal mapping of computations to threads is as follows: each thread determines values for one modulus in the two bases. As a result, we have coarse-grained (different exponentiations) and finegrained parallelism (base size), fulfilling Criterion A1. We call n the number of residues that can be computed in parallel, i.e., the number of threads per encryption. The base extension by Shenoy et al. needs a redundant residue starting from the first base extension to be able to compute the second base extension. To reflect this fact, we use two RNS bases A and B, having n moduli each, and an additional residue m r resulting in n = n + 1. For all other cases, it holds that n = n.
Considering the optimal number of bits per modulus, we are faced with w = 32 bit integer registers on the target hardware. Thus, to avoid multi-precision techniques, we can use moduli that are smaller than 2 w . The hardware can compute 24 bit multiplications faster than full 32 bit multiplications. However, CUDA does not expose an intrinsic to compute the most-significant 16 bits of the result. Using 16 bit moduli would waste registers and memory and increase the number of memory accesses as well. Thus, we prefer full 32 bit moduli to save storage resources at the expense of higher computational cost (cf. Criteria A2 and A3).
For Algorithm 1 to work, the dynamic ranges A and B and the modulus M have to be related according to B > A > 2 2 M , or B > A > (2+n) 2 M when using Bajard's method. For performance reasons, we consider full warps of 32 threads only, resulting in a slightly reduced size of M . The figures for all possible combinations can be found in Table 6 in the Appendix. For input and output values, we assume that all initial values will have been already converted to both bases (and possibly the redundant modulus m r ) and that output values will be returned in the same encoding. Note that it would be sufficient to transfer values in one base only and do a base extension for all input values (cf. Criterion B1, transferring values in both bases results in a more compact kernel together with a slightly higher latency). Different from the CIOS method, temporary values can be kept local for each thread, i.e., every thread stores its assigned residues in registers. Principally all operations can be performed in parallel on different residues and -as a result -the plain multiplication algorithm does not need any synchronisations. However, both properties do not hold for the base extension algorithms.
Mixed Radix Conversion.
Recall that the mixed radix conversion computes the mixed radix representation from all residues in the source base first and uses this value to compute the target residues. The second step involves the computation of n residues and can be executed in parallel, i.e., each thread computes the residue for 'its' modulus. As a result, we have to store the n MRS digits in shared memory to make them accessible to all threads (cf. Criteria A1 and B2). The first step however is the main caveat of this algorithm due to its highly divergent nature as each MRS digit is derived from the residue of a temporary variable in a different modulus (and thus thread) and depends on all previously computed digits, clearly breaking Criterion A4 and resulting in serialisation of executions. Additionally, note that threads having already computed an MRS digit do not generate any useful output anymore.
CRT-Based Conversion. The first step for all CRT-based techniques is to compute the δ k for each source modulus and can be carried out by one thread for each value. Second, all n threads compute a weighted sum involving δ k and a modulusdependent constant. Note that all threads need to access all δ k and thus δ k have to be stored in shared memory (cf. Criterion B2). Third, α has to be derived, whose computation is the main difference in the distinguished techniques. α is needed by all threads later and thus needs to be stored in shared memory as well. After computing α all threads can proceed with their independent computations.
Bajard's method does not compute α and consequently needs no further operations. For Shenoy's method, the second step above is needed for the redundant modulus m r as well, which can be done in parallel with all other moduli. Then, a single thread computes α and writes it to shared memory. The redundant residue m r comes at the price of an additional thread, however the divergent part needed to compute α does only contain one addition and one multiplication modulo m r . Kawamura's method needs to compute the sum of the r most significant bits of all δ k . While the right-shift of each δ k can be done using all threads, the sum over all shifted values and the offset has to be computed using a single thread. A final right-shift results in the integer part of the sum, namely α.
Comparison and Selection. Clearly, Bajard's method is the fastest since it involves no computation of α. Shenoy's method only involves a small divergent part. However, we pay the price of an additional thread for the redundant modulus, or equivalently decrease the size of M . Kawamura's technique consists of a slightly larger divergent part, however it does neither include look-ups nor further reduces the size of M .
Not all base extension mechanisms can be used for both directions required for Algorithm 2. For Bajard's method, consider the consequence of an offset in the second base extension: we would compute some w in base A that is not equal to the w in B. As a result, neither w A nor w B could be computed leading to an invalid input for a subsequent execution of Algorithm 2. Thus, their method is only available for A → B conversions. Shenoy's method can only be used for the second base extension as there is no efficient way to carry the redundant residue through the computation of f modulo A. The technique by Kawamura et al. would in principle be available for both conversions. However, the sizes of both bases would be different to allow proper reduction in the A → B case, thus we exclude this option from our consideration. Table 1 shows the available and the practical combinations.
Table 1. Base Extension Algorithm Combinations

A → B MRC (M) Shenoy (S) Kawamura (K) Bajard (B)
B → A MRC (M) • • • • Shenoy (S) • • • • Kawamura (K) • • • • Bajard (B) • • • •
Point Multiplication Using Generalised Mersenne Primes
For realising the elliptic curve group operation, we chose mixed affine-Jacobian coordinates [8] to avoid costly inversions in the underlying field and thus concentrated on efficient implementation of modular multiplication, the remaining time critical operation. For this, we used a straightforward schoolbook-type multiplication combined with the efficient reduction technique for the generalised Mersenne prime presented in Algorithm 3.
As for the CIOS method, there is no intrinsic parallelism except pipelining in this approach (cf. Criterion A1). Thus, we use one thread per point multiplication. We assume the use of the same base point P per point multiplication kP and varying scalars k. Thus, the only input that has to be transferred are the scalars. Secondly, we transfer the result in projective Jacobian coordinates back to the host. For efficiency reasons, we encode all coordinates interleaved for each threads in a block again.
We used shared memory to store all temporary values, nailed to 28 bits to allow schoolbook multiplication without carry propagation. Thus, we need 8 words per coordinate. Point addition and doubling algorithms were inspired by libseccure [29] . With this approach shared memory turns out to be the limiting factor. Precisely, we require 111 words per point multiplication to store 7 temporary coordinates for point addition and modulo arithmetic, two points and each scalar. This results in 444 bytes of shared memory and a maximum of 16384/444 = 36 threads per multiprocessor. This leaves still room for improvements as Criterion A1 is not fulfilled. However, due to internal errors in the toolchain, we were not (yet) able to compile a solution that uses global memory for temporary values instead. Note that the left-to-right binary method for point multiplication demands only one temporary point. However, for the sake of a homogeneous flow of instructions we compute both possible solutions per scalar bit and use a small divergent section to decide which of them is the desired result (cf. Criterion A4).
Conclusion
With the previously discussed implementations on GPUs at hand, we finally need to identify the candidate providing the best performance for modular exponentiation.
Results and Applications
Before presenting the benchmarking results of the best algorithm combinations we show our results regarding the different base extension options for the RNS method. The benchmarking scheme was the following: first, we did an exhaustive search for the number of registers per thread that can principally be generated by the toolchain. Then, we benchmarked all available execution configurations for these numbers of registers. To make the base extension algorithms comparable, we would have to repeat this for all possible combinations, as shown in Table 1 . However to reduce the complexity of benchmarking, it suffices to measure all possible combinations in the first row and all possible combinations in the second column to gain figures for all available combinations. The results for the particular best configuration can be found in Table 2 . Clearly, the mixed radix based approach also used in [24] cannot compete with CRT-based solutions. Kawamura et al. is slower than the method of Shenoy et al. , but performs only slightly worse for the 2048 bit range. Figure 1 shows the time over the number of encryptions for the four cases and the 1024 bit and 2048 bit ranges, respectively.
Both graphs show the characteristic behaviour: Depending on the number of blocks that are started on the GPU and the respective execution configuration we get stair-like graphs. Only multiples of the number of warps per multiprocessor and the number of multiprocessors result in optimal configurations that fully utilise the GPU. However, depending on the number of registers per thread and the amount of shared memory used other configurations are possible and lead to smaller steps in between.
Optimised Implementations. Beside the reference implementation based on the CIOS algorithm, we selected as best choice the CRT-RNS method based on a combination of Bajard's and Shenoy's methods to compute the first and second base extension of Algorithm 2, respectively.
The selection of the implementation was primarily motivated to achieve high throughput rather than a small latency. Hence, due to the latency, not all implementations might be suitable for all practical applications. To reflect this, we present figures for data throughput as well as the initial latency t min required at the beginning of a computation. Note that our results consider optimal configurations of warps per block and blocks per grid only. Table 3 shows the figures for modular exponentiation with 1024 and 2048 bit moduli and elliptic curve point multiplication using NIST's P-224 curve.
The throughput is determined from the number of encryptions divided by the elapsed time. Note that this includes the initial latency t min at the beginning of the computations. The corresponding graphs are depicted in Figure 2 . Note the relatively long plateau when using the CIOS technique. It is a direct result from having coarse-grained parallelism only: the smallest number of encryptions that can be processed is 128 times higher than for the RNS method. Its high offset is due to storing temporary values in global memory: memory access latency is hidden by scheduling independent computations, however the time needed to fetch/store the first value in each group cannot be hidden. Clearly, the CIOS method delivers the highest throughput at the price of a high initial latency. For interactive applications such as online banking using TLS this will be a major obstacle. However, non-interactive applications like a certificate authority (CA) might benefit from the raw throughput 3 . Note that both applications will share the same secret key for all digital signatures when using RSA. In case of ECC (ECDSA) however, different exponents were taken into account.
The residue number system based approach does only feature roughly half of the throughput but provides a more immediate data response. Thus, this method seems to be suitable even in interactive applications. Last but not least elliptic curve cryptography clearly outperforms modular exponentiation based techniques not only due to the much smaller parameters. With respect to other hardware and software implementations compared against our results in the next section, we present an ECC solution which outperforms most hardware devices and comes close the the performance of recent dual-core microprocessors. 
Comparison with Previous Implementations
Due to the novelty of general purpose computations on GPUs and since directly comparable results are rare, we will take reference to recent hardware and software implementations in literature as well. To give a feeling for the different GPU generations we include Table 4 . [24] , using the same RNS approach but picking different base extension mechanisms. The authors present the maximum throughput only that has been achieved at the cost of an unspecified but high latency. Fleissner's recent analysis on modular exponentiation for GPUs is based on 192 bit moduli but relates the GPU performance solely to the CPU of his host system.
Costigan and Scott implemented modular exponentiation on IBM's Cell platform, i.e., a Sony Playstation 3 and an IBM MPM blade server, both running at 3.2 GHz [10] . We only quote the best figures for the Playstation 3 as they call the results for the MPM blade preliminary. The Playstation features one PowerPC core (PPU) and 6 Synergistic Processing Elements (SPUs). Software results have been attained from ECRYPT's eBATS project [11] . Here, we picked a recent Intel Core2 Duo with 2.13 GHz clock frequency. Since mostly all figures for software relate to cycles, we assumed that repeated computations can be performed without interruption on all available cores so that no further cycles are spent, e.g., on scheduling or other administrative tasks. Note that this is a very optimistic assumption possibly overrating the performance of microprocessors with respect to actual applications. We also compare our work to the very fast software implementation by [15] on an Intel Core2 system at 2.66 GHz but which uses the special Montgomery and non-standard curve over F 2 255 −19 .
To the best of our knowledge, Mentens published the best results for public key cryptography on reconfigurable hardwareso far [22] . She used a Field Programmable Gate Array (FPGA) of Xilinx' Virtex-II Pro family, namely the xc2vp30-7FF1152. Schinianakis et al. implemented elliptic curve cryptography on the same family of FPGAs but using RNS arithmetic for the underlying field [32] . Suzuki implemented the modular exponentiation on FPGAs taking advantage of the included digital signal processors (DSPs) on a board from Xilinx' Virtex 4 FX family [36] .
Nozaki et al. designed an RSA circuit in 0.25 μm CMOS technology, that needs 221k gate equivalents (GE) [26] and uses RNS arithmetic with Kawamura's base extension mechanism.
Further Work
Elliptic curves in Hessian form feature highly homogeneous formulae to compute all three projective coordinates in point additions [19, 34] . However, the curves standardised by ANSI and NIST cannot be transformed to Hessian form. Furthermore, point doublings can be converted to point additions by simple coordinate rotations. Thus, it is possible to compute point doublings and additions for all three coordinates in parallel. A future study will show the applicability to graphics hardware.
