Secrets from the GPU by Chauvet, Jean-Marie & Mahé, Eric
ar
X
iv
:1
30
5.
36
99
v1
  [
cs
.C
R]
  1
6 M
ay
 20
13
Secrets from the GPU
Jean-Marie Chauvet and Eric Mahé
MassiveRand, Inc.
{jmc,eric.mahe}@massiverand.com
http://www.massiverand.com
62, ave. Pierre Grenier, 92100 Boulogne-Billancourt, France
Abstract. Acceleration of cryptographic applications on massively par-
allel computing platforms, such as Graphics Processing Units (GPUs),
becomes a real challenge as their decreasing cost and mass produc-
tion makes practical implementations attractive. We propose a layered
trusted architecture integrating random bits generation and parallelized
RSA cryptographic computations on such platforms. The GPU-resident,
three-tier, MR architecture consists of a RBG, using the GPU as a deep
entropy pool; a bignum modular arithmetic library using the Residue
Number System; and GPU APIs for RSA key generation, encryption and
decryption. Evaluation results of an experimental OpenCL implementa-
tion show a 32-40 GB/s throughput of random integers, and encryptions
with up to 16,128-bit long exponents on a commercial mid-range GPUs.
This suggests an ubiquitous solution for autonomous trusted architec-
tures combining low cost and high throughput. . . .
Keywords: Cryptography, Residue Number System, GPU, Random Bit
Generator, RSA
1 Introduction
Recent systemic studies of trust with asymmetric cryptography usage led to ques-
tioning the assumption that sufficient randomnessis available each time public
keys are generated [14]. The issue of the quality of random bits used in key gener-
ation and the depth of their associated entropy pools is of course thre prominent
one raised in these studies. The extent to which public keys are in fact shared
among unrelated parties also points to the need for fast, low-cost, readily avail-
able, autonomous key generation and encryption-decryption capability.
In this contribution, we focus on the efficient realization of the computation-
ally expensive operations in asymmetric cryptosystems on off-the-shelf GPUs.
More precisely, we present improved and novel implementations employing GPUs
both as entropy pool and accelerator for RSA and DSA cryptosystems in a com-
plete three-tiered architecture. The MassiveRand (MR) Architecture consists of
three functional layers: (i) MR-TRNG, a non-deterministic random bits genera-
tor (TRBG); (ii) MR-MOD, a parallelized efficient implementation of modular
arithmetic on big integers; and (iii) MR-RSA, an application layer offering a
simple API for RSA key generation and message encryption/decryption.
2 J.-M. Chauvet et al.
The novelty of the MR Architecture resides in the dual use of mass-marketed,
low-cost hardware, namely GPUs – as now found in a large range of comput-
ing devices ranging from high-end servers to smartphones and tablets – as a
deep entropy pool for an innovative TRBG tightly integrated with an on-board,
efficiently parallelized cryptographic library for a variety of applications. The
architecture delivers high-bandwith, high-quality secrets on a broad set of com-
puting devices.
2 GPU Hardware as an Ubiquitous, Deep, Entropy Pool
A typical GPU architecture [17] consists of several general purpose scalar proces-
sors grouped in multiprocessor cores with a hierarchy of global, local and resident
(cache) memory allowing several levels of data parallelism and optimization of
parallel computations. It is becoming increasing common to use a GPU as a
modified form of stream processor. This idea turns the massive computational
power of a modern graphics accelerator’s shader pipeline into general-purpose
computing power, as opposed to being hard-wired solely to do graphical opera-
tions.
On the other hand, low cost, mass production of GPUs make them ubiqui-
tous, not only on PCs, but on widely released devices such as laptops, smart-
phones and tablets as well as in special architectures for high-performance com-
puting, with an increasing performance-price ratio every year.
The first layer (MR-TRNG) of the cryptography architecture proposed in
this paper leverages the GPU as a specific random bit generator (RBG) hard-
ware. In accordance with standards, e.g. FIPS 140-2 [5,4], the GPU hardware is
used here as a non-deterministic RBG, producing an output that is dependent
on an unpredictable (hardware) source that is “outside human control”. The un-
predictable stream of bits is seeded as an unguessable input key to an approved
deterministic RBG (DRBG or pseudo-RNG). The collection of entropy and the
DRBG are both GPU-bound programs, known as kernels.
In contrast to the current view that the task of a DRBG is simply to distill
out sufficient entropy for all outputs and queue it up for use–shutting down
completely or generating predictable pseudo-random output when it does not
see enough entropy–MR-TRNG innovative use of GPU hardware ensures that
sufficient entropy is continuously collected for the production of unguessable
keys.
3 Cryptographic GPU Computing
For trust, RSA cryptographic computations use large key sizes [6], usually 1024-,
2048- and 3072-bit long integers, so called bignums. In addition to the difficulty
of computing modular arithmetic operations on bignums, the algorithms tra-
ditionally used in RSA cryptography [16] do not easily translate to efficient
implementation on highly parallel architectures. Multiplication and exponentia-
tion by bignums, in particular, are the most demanding. These basic operations
Secrets from the GPU 3
are conducted stepwise, in several bignum parsing phases, inducing complex
and heavy data dependencies between steps. In order to overcome these data
dependency obstacles, the proposed architecture relies on the Residue Number
System (RNS) representation of bignums which reduces bignum operations to
easily parallelized small modulus (32-bit or 64-bit) arithmetic computations.
3.1 Modular Arithmetic Operations on GPU
The main principle of the RNS is to consider several moduli m1, m2, ..., mn
that contain no common factors–they are coprimes to each other–and to work
indirectly with residues of bignums modulo eachmi rather than directly with the
bignum itself. Hence, in the RNS, a bignum X is represented by (x1, x2, ..., xn),
the list of its residues xi = X mod mi.
The fact that this representation is unique, provided X < M =
∏
mi, is the
Chinese Remainders Theorem [12]. In this convenient representation addition,
subtraction and multiplication on bignums can be performed in parallel modulo
each mi. These independent streams of computation, or channels, are directly
mapped onto threads on the GPU computing hardware.
Conversions from binary to the RNS are executed by computing the residues
xi on each channel which requires only small integer operations. The oppo-
site conversion, from the RNS to binary, however, is more involved. Two major
methods are available: translating first to an intermediary Mixed Radix System
(MRS), which although sequential in nature may be computed separately on
each channel [12]; or using the Chinese Remainders Theorem which lends itself
to proper parallelization provided an extra modulus for intermediate calcula-
tions.
RSA arithmetic requires furthermore modular operations, namely multiplica-
tion and exponentiation, to be performed efficiently. MR-MOD uses Montgomery
modular multiplication [1]. This method simultaneously performs multiplication
and reduction by a (bignum) modulus N , in a so-called Montgomery domain
characterized by a large integer R. Bignum operands X , Y are replaced by
X˜ = XR, Y˜ = Y R and the Montgomery modular multiplication computes
Z˜ = X˜Y˜ R−1 mod N , provided again that N < R. By choosing R = M =
∏
mi,
single steps are performed independently on each RNS channel, consuming RNS
representations of X , Y , and N . Because translating bignums to and from the
Montgomery domain is costly, MR-MOD performs as many operations as pos-
sible in the Montgomery domain. Such is the case for exponentiation, imple-
mented with traditional square-and-multiply algorithms [16], efficiently chaining
Montgomery modular multiplications. In addition, pre-computations help mov-
ing many multiplications outside the main exponentiation loop [7] thus improv-
ing overall performance.
Modular inversion is required to compute the private exponent d of the RSA
key. Inversion is also used in the constructive prime generation methods used in
the upper RSA layer of the proposed architecture. The current implementation
implements the Arazi inversion formula which given e and f coprime positive
integers yields d = e−1 mod f as d = 1+f(−f
−1 mod e)
e
. The simplification in this
4 J.-M. Chauvet et al.
formula relies on e being a small integer. The modular inverse of bignum f is
computed by first converting f to the RNS and then inverting by the Arazi
formula, in parallel on each channel. A careful choice of coprime bases for the
RNS leads to additional performance increases [3].
Once the RNS bases are chosen, both base extension and intermediate calcu-
lations in the Montgomery modular multiplication use values which are constant
throughout RNS operations. In order to expedite GPU computations, these con-
stants are pre-computed and installed permanently in GPU memory at initial-
ization time. In the current implementation, each base counts 128 32-bit prime
integers, sufficient for RSA keys up to 3,968-bit long, slightly over the 3072-bit
maximum length in the NIST recommendation [4]. (On modern GPUs, blocks
may run up to 256 threads so that doubling the base size and switching to 64-bit
coprimes would give us capability for up to 16,128-bit long RSA keys, without
much computation overhead.)
3.2 Primality
When efficiency is not a primary concern, the usual way to generate a random
prime number is to select a random number p and test it for primality. This,
in essence, describes the operations of the lower-tier of the third and last layer
of the proposed architecture, MR-RSA. Random numbers issued by GPU-based
MR-TRNG are tested for primality using the MR-MOD modular arithmetic
kernels, both generation and computations being executed in parallel on the
GPU. Prime generation algorithms rely on primality, or compositeness, testing
[18]. In order to keep generated prime numbers on board the GPU in the trust
management context, without transferring any other integer than the public
RSA key and exponent to the host CPU, primality testing in the Montgomery
domain has been implemented as a dedicated GPU kernel. In accordance with
standards and recommendations [6], the kernel executes a Miller-Rabin test with
a user-parameterized number of iterations.
3.3 RSA Key Generation, Encryption and Decryption
Finally the upper-tier of the MR-RSA layer simply provides a basic RSA API
for use in cryptographic applications. Note again that the GPU acts both as the
entropy pool for random bits generation and specific computing hardware for key
generation, encryption and decryption. Of course, the sole GPU entropy pool
may also be used by CPU-bound implementation of cryptographic applications;
in reverse, CPU-bound RBGs and DRBGs can feed random bits in the proposed
architecture for on-GPU RSA computations in mixed computing environments.
Proper RSA key generation impose well-known additional constraints to the
prime numbers to be used. These RSA-specific primes are further formalized by
e.g. the standards and recommendations published by regulatory organizations.
RSA key generation, namely the private primes p, q, composing the public key
N = pq and the private exponent d, modular inverse of the public exponent e
modulo (p − 1)(q − 1), is completely performed on the GPU with only e–when
Secrets from the GPU 5
not chosen by the user–and N being transferred back to the CPU host when
needed. The depth of the GPU entropy pool and the high-throughput of the
MR-TRNG random bits generator make for a very efficient implementation of
RSA key generation.
The API provides two GPU kernels respectively for encryption and decryp-
tion of messages, applying modular exponentiation. Messages in binary format
are split in blocks the length of which matches the number of threads on the
GPU used in the proposed architecture. Considering, for instance, 3072-bit long
RSA keys, messages are thus split into 3072-bit blocks, each one encrypted and
decrypted in parallel by 128 threads within each block. Provided GPUs can
process several blocks in parallel several hundreds, even thousands, of message
blocks can be efficiently encrypted and decrypted in parallel.
4 Evaluation
While the architecture proposed in this paper is now provided as a commercial
product, the evaluation setup reported here reflects the basic implementation in
the original, R&D experimental work. Namely, we only report on the OpenCL
implementation of the MR-TRNG, MR-MOD and MR-RSA kernels. Several
hardware devices deliberately chosen in the performance mid-range were tar-
geted: (i) GPU and (ii) multi-core CPU. Table 1 describes the details of the
target hardware.
Table 1. Experimental hardware setup summary.
Implementation Technical Specs.
Multicore CPU i5 2450M
2 cores 4 threads 2.5 GHz
GPU AMD Radeon HD 7670M
(middle-class GPU for laptops)
600-MHz Core speed, 900-Mhz Mem. speed
GPU AMD Radeon HD 6870
(middle-class GPU)
900-MHz Core speed, 1 GB Mem.
4.1 Related Work
Comparison with related art is not straightforward since different GPU plat-
forms are employed, with different architectural characteristics and performance
capabilities. In addition, the rapid pace of progress of manufacturers may render
experimental results obsolete quite quickly.
In [11] a mixed CUDA/PTX assembler implementation on the high-range
Nvidia’s GeForce GTX 580 is evaluated against a highly optimized MPIR 2.5.0
6 J.-M. Chauvet et al.
bignum library on an Intel i7 processor. The evaluation shows 105 to 250modular
exponentiation per second when the exponent size varies from 512 bits to 4096
bits, a speed-up of 13x to 3x over an OpenMP CPU implementation.
In [19] different approaches are proposed and compared to compute asym-
metric cryptography (RSA and Elliptic Curve) on Nvidia’s GeForce 8800 GTS.
The authors show 194 to 419 1024-bit modular exponentiation per second, and
28 to 56 2048-bit modular exponentiation per second.
The PACE library on the Nvidia’s GeForce 9800 GX2–with 2 on-board
GPUs–is evaluated in [8]. Experimental results show 45 160-bit modular mul-
tiplications per millisecond down to 4 384-bit modular multiplications per mil-
lisecond.
In [9] the authors devote a sizeable portion of their paper to modular mul-
tiplication. They implement Montgomery multiplication in radix representation
with the pencil-and-paper algorithm. It is configured so that a single exponent
encompasses a CUDA warp (32-threads) in order to maximize thread utilization.
Messages are split between two warps dealing with modulus p and q respectively.
Certain algorithmic optimizations, however, such as squaring and modulo mul-
tiplication could not be implemented in this parallel implementation.
Also [2] contains a comparative study of innovative variant implementations
of RNS-based modular arithmetic in view of RSA encryption and decryption.
Our experimental OpenCL implementation provides results on mid-range hard-
ware along the lines of this previous body of work on high-range devices. As sum-
marized in Figure 1, the MR-MOD layer in the proposed architecture handles up
to 3072-bit integer modular multiplications, at performance level comparable to
the high-range GPUs previously reported. Doubling the RNS channels number
up to the maximum thread-per-block limit afforded by the selected hardware,
extends the same performance level to up to 8192-bit integer modular multipli-
cation with the same kernels.
These results point to the feasibility of a complete RSA cryptography trust
architecture based on GPU parallel computation. Our additional results, detailed
in the next sections, suggest further that GPU may act both as a deep entropy
pool for random bit generation and as a consumer of the later randomness for
on-board RSA key generation, message encryption and decryption with excellent
performance at a relatively low price-point.
4.2 Uniformly Distributed Random Numbers Generation
TestU01 [13] is generally recognized as providing the most complete battery of
statistical tests for random number generators. We submitted MR-TRNG to
the so-called “Big Crush” battery, a suite of 106 very stringent statistical tests,
in the 1.2.1 version of TestU01. As comparison points we reproduce results of
McCullough’s review of TestU01 [15] along ours: The test file was generated on
Nvidia’s GPU GeForce GTX 275. In addition, a self-validating kernel was devel-
oped to further qualify MR-TRNG. The kernel streamlines FIPS [5] basic tests
right after the generation on the GPU as a form of simplified health test of the
Secrets from the GPU 7
 0
 200
 400
 600
 800
 1000
 0  500  1000  1500  2000  2500  3000  3500  4000  4500
Tim
e (m
s)
Modular Multiplications
3072-bit Modular Multiplication GPU
3072-bit Modular Multiplication CPU
 0
 10
 20
 30
 40
 50
 0  2048  4096
Th
rou
gh
pu
t (m
mu
l/ms
)
Blocks
GPU
CPU
Fig. 1. Throughput of the OpenCL implementation for GPU and CPU devices. Execu-
tion time for modular multiplication of 3072-bit integers over 128 32-bit RNS channels.
Inset: throughput in modular multiplications per millisecond according to number of
OpenCL blocks; peak is reached on GPU for the nominal maximum of blocks handled
by the AMD card, i.e. 512.
generator. As GPUs provide a very high degree of parallelism, MR-TRNG deliv-
ers very high-throughput unpredictable streams, in the range of 32-40 GB/s of
seed bits, depending on the GPU board. This high performance level doubles up
MR-TRNG usage as a complementary deep entropy pool source for traditional
CPU-based DRBGs used in popular cryptographic libraries such as OpenSSL or
PolarSSL.
4.3 Generation, Encryption and Decryption
As public key cryptography depends on very large integer multiplications, special
attention has been brought to the orchestration of the modular arithmetic, pro-
vided by the MR-MOD layer in the proposed architecture, for RSA operations.
RSA key generation leverages a parallel implementation of small primes testing
(up to the first 10,000 primes) combined with Miller-Rabin compositeness tests–
in numbers of iterations as prescribed by the FIPS documentation. Private keys
are generated and stay on GPU during all RSA-related calculations.
This flexibility allows for several approaches to parallelizing RSA encryption
and decryption of messages. In the experimental evaluation results presented
here, the design choice is to break the original message into 3072-bit binary blocks
8 J.-M. Chauvet et al.
Table 2. Big Crush tests applied to several RNGs
RNG Failures (CPU Time)
Mersenne Twister 0 (15:58:25)
Knuth TAOCP 0 (14:20:26)
Knuth TAOCP 2002 0 (14:29:36)
MR-TRNG 0 (05:17:44)
 0
 10
 20
 30
 40
 50
 0  50  100  150  200  250  300  350  400  450  500  550  600
En
cry
ptio
ns 
pe
r s
eco
nd
RSA Encryptions
1024-bit RSA Key
2048-bit RSA Key
3072-bit RSA Key
 0
 2
 4
 6
 8
 10
 0  200  400  600
GP
U/C
PU
 sp
ee
d_
up
Fig. 2. Throughput of the OpenCL implementation for RSA encryptions on GPU.
Encryptions of 3072-bit messages, over 128 32-bit RNS channels, for various key lengths.
Inset: 0.01x to 16x speed-up improvements over CPU-bound openssl/bn optimized
implementation for 3072-bit message and 3072-bit RSA exponent.
and feed each one of them to an OpenCL block running all the required RNS
channel threads for modular exponentiation in parallel. All blocks are submitted
for GPU-based RSA encryption and decryption in one single pass. Figure 2
summarizes the results.
5 Conclusions
Experimental results of basic implementations of the proposed GPU-bound MR
randomness and cryptography functionality suggest that it is suitable for the
dissemination of trust architectures on low-cost, readily available devices. Au-
tonomy at low cost for massively distributed devices, such as routers, laptops,
smartphones and tablets, in the required cryptographic key generation and ap-
plications may provide a partial response to the challenge brought forward by
Secrets from the GPU 9
previously mentioned systemic studies[14,10]. Rather than relying on keys built
in at distribution time or delivered in bulk by remote servers, such personal de-
vices would generate high-quality cryptographic keys on demand, leveraging the
on-board GPU physical randomness pool.
The MR Architecture also significantly lowers the cost of key generation and
encryption/decryption on personal devices. This in turn may expand usage of
cryptographic systems to personal communications, or to many-to-many group
communications in dynamic networks.
Finally, further research work towards assessing the physical characterization
of the MR-TRNG non-deterministic random bits generator should conclude on
the appropriateness of GPU hardware for the design and deployment of Phys-
ically Unclonable Function (PUF) systems, a promising solution to additional
security issues.
Acknowledgements
The authors wish to thank Pr. Jean-Jacques Quisquater, Université catholique
de Louvain, for sharing numerous insights and recommendations during the de-
velopment of this research.
References
1. Jean-Claude Bajard, Laurent-Stéphane Didier, and Peter Kornerup. Modular mul-
tiplication and base extensions in residue number systems. In IEEE Symposium
on Computer Arithmetic, pages 59–65. IEEE Computer Society, 2001.
2. Jean-Claude Bajard and Laurent Imbert. A full RNS implementation of RSA.
IEEE Trans. Computers, 53(6):769–774, 2004.
3. Jean Claude Bajard, Marcelo Kaihara, and Thomas Plantard. Selected RNS Bases
for Modular Multiplication. In 19th IEEE International Symposium on Computer
Arithmetic, pages 25–35, Portland, Oregon, US, 2009. IEEE Computer Society.
4. Elaine Barker and John Kesley. Draft NIST Special Publication 800-90c: Recom-
mendation for random bit generator (RBG) constructions, august 2012.
5. FIPS. Security Requirements for Cryptographic Modules. National Institute for
Standards and Technology, Gaithersburg, MD, USA, May 2001. Annex A: Ap-
proved Security Functions (19 May 2005); Annex B: Approved Protection Profiles
(04 November 2004); Annex C: Approved Random Number Generators (31 Jan-
uary 2005); Annex D: Approved Key Establishment Techniques (30 June 2005).
Supersedes FIPS PUB 140-1, 1994 January 11.
6. Patrick Gallagher and Cita Furlani. FIPS Pub 186-3 Federal Information Process-
ing Standards publication digital signature standard (DSS), 2009.
7. Filippo Gandino, Fabrizio Lamberti, Paolo Montuschi, and Jean-Claude Bajard.
A general approach for improving RNS Montgomery exponentiation using pre-
processing. In Elisardo Antelo, David Hough, and Paolo Ienne, editors, IEEE Sym-
posium on Computer Arithmetic, pages 195–204. IEEE Computer Society, 2011.
8. Pascal Giorgi, Thomas Izard, and Arnaud Tisserand. Comparison of Modular
Arithmetic Algorithms on GPUs. In ParCo’09: International Conference on Par-
allel Computing, page N/A, France, November 2009.
10 J.-M. Chauvet et al.
9. Owen Harrison and John Waldron. Efficient acceleration of asymmetric cryptogra-
phy on graphics hardware. In Proceedings of the 2nd International Conference on
Cryptology in Africa: Progress in Cryptology, AFRICACRYPT ’09, pages 350–367,
Berlin, Heidelberg, 2009. Springer-Verlag.
10. Nadia Heninger, Zakir Durumeric, Eric Wustrow, and J. Alex Halderman. Min-
ing your Ps and Qs: Detection of widespread weak keys in network devices. In
Proceedings of the 21st USENIX Security Symposium, August 2012.
11. Tobias Jeske and Felix Kurth. Big number modulo exponentiations for Zero-
Knowledge protocols on GPUs. GPU Technology Conference, San Jose, 14 - 17
May 2012, 2012.
12. Donald E. Knuth. The Art of Computer Programming, Volume II: Seminumerical
Algorithms, 2nd Edition. Addison-Wesley, 1981.
13. Pierre L’Ecuyer and Richard Simard. TestU01: A C library for empirical testing
of random number generators. ACM Trans. Math. Softw., 33(4), August 2007.
14. Arjen K. Lenstra, James P. Hughes, Maxime Augier, Joppe W. Bos, Thorsten
Kleinjung, and Christophe Wachter. Ron was wrong, Whit is right. IACR Cryp-
tology ePrint Archive, 2012:64, 2012.
15. B D McCullough. A review of TESTU01. Journal of Applied Econometrics,
21(5):677–682, 2006.
16. Alfred J. Menezes, Scott A. Vanstone, and Paul C. Van Oorschot. Handbook of
Applied Cryptography. CRC Press, Inc., Boca Raton, FL, USA, 1st edition, 1996.
17. J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, and J.C. Phillips. GPU
Computing. Proceedings of the IEEE, 96(5):879 –899, may 2008.
18. René Schoof. Four primality testing algorithms. arXiv submission, 2008.
19. Robert Szerwinski and Tim Güneysu. Exploiting the power of GPUs for asym-
metric cryptography. In Elisabeth Oswald and Pankaj Rohatgi, editors, CHES,
volume 5154 of Lecture Notes in Computer Science, pages 79–99. Springer, 2008.
