Low-Latency Elliptic Curve Scalar Multiplication by Bos, Joppe
Int J Parallel Prog (2012) 40:532–550
DOI 10.1007/s10766-012-0198-5
Low-Latency Elliptic Curve Scalar Multiplication
Joppe W. Bos
Received: 11 July 2011 / Accepted: 8 May 2012 / Published online: 26 May 2012
© Springer Science+Business Media, LLC 2012
Abstract This paper presents a low-latency algorithm designed for parallel com-
puter architectures to compute the scalar multiplication of elliptic curve points based
on approaches from cryptographic side-channel analysis. A graphics processing unit
implementation using a standardized elliptic curve over a 224-bit prime field, comply-
ing with the new 112-bit security level, computes the scalar multiplication in 1.9 ms on
the NVIDIA GTX 500 architecture family. The presented methods and implementation
considerations can be applied to any parallel 32-bit architecture.
Keywords Elliptic curve cryptography · Elliptic curve scalar multiplication ·
Parallel computing · Low-latency algorithm
1 Introduction
Elliptic curve cryptography (ECC) [30,37] is an approach to public-key cryptography
which enjoys increasing popularity since its invention in the mid 1980s. The attractive-
ness of small key-sizes [32] has placed this public-key cryptosystem as the preferred
alternative to the widely used RSA public-key cryptosystem [48]. This is emphasized
by the current migration away from 80-bit to 128-bit security where, for instance, the
United States’ National Security Agency restricts the use of public key cryptography
in “Suite B” [40] to ECC. The National Institute of Standards and Technology (NIST)
is more flexible and settles for 112-bit security at the lowest level from the year 2011
on [52].
J. W. Bos (B)
Laboratory for Cryptologic Algorithms, École Polytechnique Fédérale de Lausanne (EPFL),
Station 14, 1015 Lausanne, Switzerland
e-mail: joppe.bos@epfl.ch
123
Int J Parallel Prog (2012) 40:532–550 533
At the same time there is a shift in architecture design towards many-core proces-
sors [47]. Following both these trends, we evaluate the performance of a standardized
elliptic curve over a 224-bit prime field [55], which complies with the new 112-bit
security level, on a number of different graphics processing unit (GPU) architecture
families. In order to assess the possibility to use the GPU as a cryptographic accelera-
tor we present algorithms to compute the elliptic curve scalar multiplication (ECSM),
the core building block in ECC, which are designed from a low-latency point of view
for parallel computer architectures.
Our approach differs from the previous reports implementing ECC schemes on
GPUs [2,3,5,53] in that we divide the elliptic curve arithmetic over multiple threads
instead of dividing a single finite field operation over the available resources. The
presented algorithms are based on methods originating in cryptographic side-chan-
nel analysis [31] and are designed for a parallel computer architecture with a 32-bit
instruction set. This makes the new generation of NVIDIA GPUs, the GTX 400/500
series known as Fermi, an ideal target platform. Despite the fact that our algorithms are
not particularly optimized for the older generation GPUs, we show that this approach
outperforms, in terms of low-latency, the results reported in the literature on these
previous generations GPUs while it at the same time sustains a high throughput. On
the newer Fermi architecture the ECSM can be computed in less than 1.9 ms with the
additional advantage that the implementation can be made to run in constant time; i.e.
resistant against timing attacks.
There is an active area in the cryptologic community which reports how to adopt fre-
quently used cryptographic and cryptanalytic algorithms efficiently to the GPU archi-
tecture. These papers report performance benchmark results to demonstrate that the
GPU architecture can already be used as a cryptologic workhorse. Previous published
GPU implementations cover asymmetric cryptography, such as RSA [24,39,53], and
ECC [2,53], and symmetric cryptography [12,22,23,35,45,56]. The GPU has also
been considered to enhance the performance of cryptanalytic computations in the set-
ting of finding hash collisions [8], integer factorization [3,5] and computing elliptic
curve discrete logarithms [4].
Parts of this work were pre-published as [10] where the Cell broadband engine
architecture is considered. This paper is organized as follows. Section 2 recalls the
required information related to elliptic curves and presents the standard elliptic curve
used throughout the paper. Section 3 introduces the GPU platform and explains which
programming paradigm we are using. Section 4 explains the rationale behind the
design choices and presents the parallel algorithm to compute the ECSM. Section 5
shows and compares our benchmarks results with other GPU implementations in the
literature. Section 6 concludes the paper.
2 Elliptic Curves
Let p > 3 be a prime, then any a, b ∈ Fp such that 4a3 + 27b2 = 0 define an elliptic
curve Ea,b over the finite field Fp. The zero point 0, the so-called point at infin-
ity, together with the set of points (x, y) ∈ Fp × Fp which satisfy the short affine
Weierstrass equation
123
534 Int J Parallel Prog (2012) 40:532–550
Algorithm 1 The radix-r schoolbook multiplication method.
Input:
{
A = ∑n−1
i=0 airi , B =
∑n−1
i=0 biri with 0 ≤ ai , bi < r
Output:
{
C = A · B = ∑2n−1
i=0 ci ri with 0 ≤ ci < r
1. C ← A · b0
2. for i = 1 to n − 1 do
3. C ← C + ri (A · bi )
4. end for
5. return C
y2 = x3 + ax + b, (1)
form an abelian group Ea,b(Fp) [50]. For a ∈ Ea,b(Fp), the additively written group
law is defined as follows. Define a + 0 = 0 + a = a. For non-zero a = (x1, y1), b =
(x2, y2) ∈ Ea,b(Fp) define a + b = 0 iff x1 = x2 and y1 = −y2. Otherwise a + b =
(x3, y3) with
x3 = λ2 − x1 − x2
y3 = λ(x1 − x3) − y1 , λ =
⎧⎪⎨
⎪⎩
3x21 + a
2y1
if x1 = x2
y1 − y2
x1 − x2 otherwise.
(2)
Currently, the fastest known elliptic curves, in terms of multiplications in Fp, are
the family of curves originating from a normal form for elliptic curves introduced
by Edwards in 2007 [16] and generalized and proposed for usage in cryptology by
Bernstein and Lange [6]. The most efficient curve arithmetic on these type of curves
is due to Hisil et al. [25].
2.1 Standard Elliptic Curves
One way to speed-up elliptic curve arithmetic is to enhance the performance of the
finite field arithmetic by using a prime of a special form. The structure of such a prime
is exploited by constructing a fast reduction method, applicable to this prime only.
Typically, multiplication and reduction are performed in two sequential phases. For
the multiplication phase we consider the so-called schoolbook, or textbook, multipli-
cation technique (see Algorithm 1 for a high-level description). Other asymptotically
faster approaches, such as Karatsuba multiplication [28], are not considered because
they are not expected to give a speed-up for the size of integers we target.
In the FIPS 186-3 standard [55], NIST recommends the use of five prime fields
when using the elliptic curve digital signature algorithm. The sizes of the five recom-
mended primes by NIST are 192, 224, 256, 384 and 521 bits. In this work we consider
the 224-bit prime
p224 = 2224 − 296 + 1.
The prime p224, together with the provided curve parameters from the FIPS 186-3,
allows one to use 224-bit ECC which provides a 112-bit security level. This is the
123
Int J Parallel Prog (2012) 40:532–550 535
Algorithm 2 Fast reduction modulo p224 = 2224 − 296 + 1.
Input: Integer c = (c13, . . . , c1, c0), each ci is a 32-bit word, and 0 ≤ c < p2224.
Output: Integer d ≡ c mod p224.
Define 224-bit integers:
s1 ← ( c6, c5, c4, c3, c2, c1, c0),
s2 ← ( c10, c9, c8, c7, 0, 0, 0),
s3 ← ( 0, c13, c12, c11, 0, 0, 0),
s4 ← ( c13, c12, c11, c10, c9, c8, c7),
s5 ← ( 0, 0, 0, 0, c13, c12, c11)
return (d = s1 + s2 + s3 − s4 − s5);
lowest strength for asymmetric cryptographic systems allowed by NIST’s “SP 800-57
(1)” [52] standard from the year 2011 on (cf. [11] for a discussion about the migration
to these new standards).
Reduction modulo p224 can be done efficiently: for x ∈ Z with 0 ≤ x < (2224)2
and x = xL + 2224xH for xL, xH ∈ Z, 0 ≤ xL, xH < 2224, define
R(x) = xL + xH (296 − 1).
It follows that R(x) ≡ x mod p224 and R(x) ≤ 2320 − 296. Algorithm 2 shows
the application of R(R(x)) for a machine word (limb) size of 32 bits, based on the
work by Solinas [51]. Note that the resulting value R(R(x)) ≡ x mod p224 with
−(2224 + 296) < R(R(x)) < 2225 + 2192.
The NIST curves over prime fields all have prime order. In order to translate a Wei-
erstrass curve into a suitable Edwards curve over the same prime field, to use the faster
elliptic curve arithmetic, the curve needs to have a group element of order four [6]
(which is not the case with the prime order NIST curves). To comply with the NIST
standard we chose not to use Edwards but Weierstrass curves.
An extensive study of a software implementation of the NIST-recommended elliptic
curves over prime fields on the x86 architecture is given by Brown et al. [14].
2.2 Elliptic Curve Scalar Multiplication
Repeated elliptic curve point addition is called elliptic curve scalar multiplication:
nP := P + P + · · · + P︸ ︷︷ ︸
n times
for n ∈ Z>0 and is the main operation in many elliptic curve based cryptographic
systems. Prime examples are the elliptic curve digital signature algorithm as specified
in [55] and elliptic curve Diffie–Hellman for key agreement (see [54]). Improving the
performance of the ECSM translates directly in faster ECC schemes in practice. A
survey of the different techniques and approaches together with the cost to compute the
ECSM is given by Cohen et al. [15]. An updated overview from 2008, incorporating
the latest curve representations and algorithms, is given by Bernstein and Lange [7].
123
536 Int J Parallel Prog (2012) 40:532–550
Algorithm 3 The double-and-add algorithm (left) and the Montgomery ladder (right).
Input:
⎧⎨
⎩
G ∈ Ea,b(Fp),
s ∈ Z>0, 2k−1 ≤ s < 2k,
s = ∑k−1
i=0 si2i , with 0 ≤ si < 2
Output: P = sG ∈ Ea,b(Fp)
1. P ← G
2. for i = k − 2 down to 0 do
3. P ← 2P
4. if si = 1 then
5. P ← P + G
6. end if
7. end for
Input:
⎧⎨
⎩
G ∈ Ea,b(Fp),
n = ∑k−1
i=0 ni2i , n ∈ Z>0,
2k−1 ≤ n < 2k
Output: P = nG ∈ Ea,b(Fp)
1. P ← G,Q ← G
2. for i = k − 2 down to 0 do
3. if ni = 1 then
4. (P,Q) ← (P + Q, 2Q)
5. else
6. (P,Q) ← (2P, P + Q)
7. end if
8. end for
Let us recall one particular approach, known as the Montgomery ladder: this tech-
nique was introduced by Montgomery in [38] in the setting of integer factorization
using the elliptic curve method (ECM) [33]. We give the higher level description
from [27]. Let L0 = s = ∑t−1i=0 ki2i , define Lj =
∑t−1
i=j ki2i−j and Hj = Lj + 1.
Then,
Lj = 2Lj+1 + kj = Lj+1 + Hj+1 + kj − 1 = 2Hj+1 + kj − 2.
One can update these two values using
(Lj ,Hj ) =
{
(2Lj+1, Lj+1 + Hj+1) if kj = 0,
(Lj+1 + Hj+1, 2Hj+1) if kj = 1.
An overview of this approach is given in the right part of Algorithm 3. This approach
is slower compared to, for instance, the double-and-add technique (the left part of
Algorithm 3) since a duplication and addition are always performed per bit. This
disadvantage is actually used as a feature in environments which are exposed to side-
channel attacks and where the algorithms need to process the exact same sequence
of steps independent of the input parameters. It is not difficult to alter Algorithm 3
such that it becomes branch-free (using the bit ni to select which point to double).
In ECM the elliptic curve scalar multiplication is calculated using the Montgomery
form which avoids computing on the y-coordinate. This is achieved as follows: given
the x- and z-coordinate of the points P,Q and P − Q one can compute the x- and
z-coordinates of P + Q (and similarly 2P or 2Q). Avoiding computations on one of
the coordinates results in a speedup in practice (see [38] for all the details).
3 Compute Unified Device Architecture
GPUs have mainly been game- and video-centric devices. Due to the increasing com-
putational requirements of graphics-processing applications, GPUs have become very
powerful parallel processors and this, moreover, incited research interest in comput-
ing outside the graphics-community. Until recently, programming GPUs was limited
123
Int J Parallel Prog (2012) 40:532–550 537
to graphics libraries such as OpenGL [49] and Direct3D [9], and for many applica-
tions, especially those based on integer-arithmetic, the performance improvements
over CPUs was minimal, sometimes even degrading. The release of NVIDIA’s G80
series and ATI’s HD2000 series GPUs (which implemented the unified shader architec-
ture), along with the companies’ release of higher-level language support with compute
unified device architecture (CUDA), close to metal (CTM) [46] and the more recent
open computing language (OpenCL) [21] facilitate the development of massively-
parallel general purpose applications for GPUs [1,43]. These general purpose GPUs
have become a common target for numerically-intensive applications given their ease
of programming (relative to previous generation GPUs), and ability to outperform
CPUs in data-parallel applications, commonly by orders of magnitude.
We focus on NVIDIA’s GPU architecture with CUDA, more specifically the third
generation GPU family known under the code name Fermi [42]. After the first gener-
ation G80 architecture, the first GPU to support the C-programming language, and the
second generation GT200 architecture, the Fermi architecture was released in 2010.
One of the main features for our setting is the support of 32×32 → 32-bit multiplica-
tion instructions, for both the least- and most-significant 32-bit of the multiplication
result. The previous NVIDIA architecture families have native 24 × 24 → 32-bit
multiplication instructions (for the least significant 32-bit of the result).
We briefly recall some of the basic components of NVIDIA GPUs. More detailed
information about the specification of CUDA as well as experiences using this parallel
computer architecture can be found in [18,34,41–43]. Each GPU contains a number
of streaming multiprocessors (SMs) and each SM consists of multiple scalar processor
cores (SP); these number vary per graphics card. Typically, on the Fermi architecture,
there are 32 SPs per SM and around 16 SMs per GPU. C for CUDA is an extension
to the C language that employs the massively parallel programming model called sin-
gle-instruction multiple-thread. The programmer defines kernel functions, which are
compiled for and executed on the SPs of each SM, in parallel: each light-weight thread
executes the same code, operating on different data. A number of threads are grouped
into a thread block which is scheduled on a single SM, the threads of which time-share
the SPs. Hence, typically the number of launched threads is a multiple of the available
SPs (cores) of the GPU. This hierarchy provides for threads within the same block
to communicate using the on-chip shared memory and to synchronize their execution
using barriers (a synchronization method which causes threads to wait until all threads
reach a certain point).
On a lower level, threads inside each thread block are executed in groups of 32
called warps. On the Fermi architecture each SM has two warp schedulers and two
instruction dispatch units. This means that two instructions, from separate warps, can
be scheduled and dispatched at the same time. By switching between the different
warps, trying to fill the pipeline as much as possible, a high throughput rate can be
sustained. When the code executed on the SP contains a conditional data-dependent
branch all possibilities, taken by the threads inside this warp, are serially executed
(threads which do not follow a certain branch are disabled). After executing these
possibilities the threads within this warp continue with the same code execution. For
optimal performance it is recommended to avoid multiple execution paths within a
single warp.
123
538 Int J Parallel Prog (2012) 40:532–550
The GPU has a large amount of global memory. Global memory is shared among
all threads running on the GPU (on all SMs). Communication between threads inside a
single thread block can be performed using the faster shared memory. If a warp requests
data from global memory, the request is split into two separate memory requests, one
for each half-warp (16 threads), each of which is issued independently. If the word size
of the memory requested is 4, 8, or 16 bytes, the data requested by all threads lie in the
same segment and are accessed in sequence (the kth thread in the half-warp fetches
the kth word) then the global memory request is coalesced. In practice this means
that a 64-byte memory transaction, a 128-byte memory transaction, or two 128-byte
memory transactions are issued if the size of the words accessed by the threads is 4,
8, or 16 bytes, respectively. When this transfer is not coalesced, 16 separate 32-byte
memory transactions are performed. More advanced rules might apply to decide if a
global memory request can be coalesced or not depending on the architecture used,
see [43] for the specific details.
4 Parallel Algorithms
In this section we present low-latency algorithms to compute the ECSM designed for
architectures capable of running multiple computational streams in parallel together
with implementation considerations to speed up the modular arithmetic. Our algo-
rithms are especially designed for platforms which have a 32-bit instruction set,
e.g. addition, subtraction and multiplication instructions with 32-bit in- and output.
The (modular) multiplication operations in this chapter are designed to operate on rel-
atively small (224-bit) integers. On the widely available x86 and x86-64 architectures
the threshold for switching from schoolbook multiplication to methods with a lower
asymptotic run-time complexity (e.g. Karatsuba multiplication) is >800 bits [20] (but
this threshold depends on the word-size of the architecture and the various latencies
and throughput of the addition and multiplication instructions). In our 224-bit setting
we have investigated schoolbook multiplication only.
A common approach to obtain low-latency arithmetic is to compute the modular
multiplications with multiple threads using a residue number system (RNS) [19,36].
This might be one of the few available options to lower the latency for schemes which
perform a sequence of data-dependent modular multiplications, such as in RSA where
the main operation is modular exponentiation, but different approaches can be tried in
the setting of elliptic curve arithmetic. We follow the ideas from [25,27] and choose,
in contrast to for instance [2], to let a single thread compute a single modular mul-
tiplication. The parallelism is exploited at the elliptic curve arithmetic level where
multiple instances of the finite field arithmetic are computed in parallel to implement
the elliptic curve group operation. On a highly parallel architecture, like the GPU,
multiple instances of our parallel ECSM algorithm are dispatched in order to take full
advantage of all the available cores.
4.1 Finite Field Arithmetic
In order to speed up the modular calculations we represent the integers using a redun-
dant representation. Instead of fully reducing to the range [0, p224〉 we use the slightly
123
Int J Parallel Prog (2012) 40:532–550 539
larger interval [0, 2224〉. This redundant representation saves a multi-limb comparison
to detect if we need to perform an additional subtraction after a number has been
reduced to [0, 2224〉. Using this representation, reduction can be done more efficiently,
as outlined in this section, while it does not require more 32-bit limbs (or registers)
to store the integers. Various operations need to be adapted in order to handle the
boundary cases, this is outlined in this section.
4.1.1 Modular Addition and Subtraction
After an addition of a and b, with 0 ≤ a, b < 2224, resulting in a + b = c =
cH 2224 + cL, with 0 ≤ cL < 2224 and 0 ≤ cH ≤ 1, there are different strategies to
compute the modular reduction of c. One can subtract p224 once (cH = 1) or the
result is already in the proper interval (cH = 0) and subsequently continue using the
224 least significant bits of the outcome. In order to prevent divergent code on parallel
computer architectures the value cHp224 (either 0 or p224) could be pre-computed and
subtracted after the addition of a and b. Note that an additional subtraction might be
required in the unlikely event that cH = 1 and a + b − p224 ≥ 2224.
A faster approach is to use the special form of the prime p224. Using
c = cH 2224 + cL ≡ cL + cH (296 − 1) mod p224 (3)
requires the computation of one addition of a and b and one addition with the pre-
computed constant cH (296 − 1), for cH ∈ {0, 1}. Again, in the unlikely event that
cH = 1 and cL +296 −1 ≥ 2224 an additional subtraction is required. The probability
to obtain a carry after adding the fourth 32-bit limb of 296 − 1 to cL is so small that an
early-abort strategy can be applied; i.e. all concurrent threads within the same warp
assume that no carry is produced and continue executing the subsequent code. In the
unlikely event that one or more of the threads produce a carry this results in divergent
code and the other threads in this warp remain idle until the computation has been
completed. This strategy decreases the number of instructions required to implement
the modular addition and makes this approach preferable in practice.
For modular subtraction the same two approaches can be applied. In the first
approach the modulus p224 is added to c = a − b if there is a borrow: i.e. b > a. An
additional addition of p224 might be required since 0 > p224 − 2224 < a − b + p224.
In the second approach 2p224 + a is computed before subtracting b, to ensure that the
result is positive. Next, we proceed as in the addition scenario with the only difference
that cH ∈ {0, 1, 2}.
4.1.2 Multiplication
Algorithm 4 depicts schoolbook multiplication designed to run on parallel architec-
tures and is optimized for architectures with a native multiply-and-add instruction.
After trivially unrolling the for-loops the algorithm is branch-free. Algorithm 4 splits
the operands in r-bit words and takes advantage of the r-bit multiplier assumed to
be available on the target platform. We use r = 32 for the GPU architecture but
this can be modified to work with any other word size on different architectures.
123
540 Int J Parallel Prog (2012) 40:532–550
Algorithm 4 Radix-2r schoolbook multiplication algorithm for architectures which
have a multiply-and-add instruction. We use r = 32 for the GPU architecture.
Input: Integers a = ∑n−1
i=0 ai2ri , b =
∑n−1
i=0 bi2ri , with 0 ≤ ai , bi < 2r .
Output: Integer c = a · b = ∑2n−1
i=0 ci2ri , with 0 ≤ ci < 2r .
1. di ← 0, i ∈ [0, n − 1]
2. for j = 0 to n − 1 do
3. (e,Dj ) ← split(a0 · bj + d0)
4. for i = 1 to n − 1 do
5. (e, di−1) ← split(ai · bj + e + di )
6. end for
7. dn−1 ← e
8. end for
9. return (c ← (dn−1, dn−2, . . . , d0,Dn−1,Dn−2, . . . , D0))
After the multiply-and-add, and a possible extra addition of one r-bit word, the
2r-bit result z is split into the r most and r least significant bits. This is denoted
by (
 z2r , z mod 2r ) ← split(z). Note that the computation ai · bj + e + di fits in
two 32-bit limbs since (232 − 1)2 + 2(232 − 1) < 264.
Algorithm 4 requires the computation of split(a0 · bj + d0) for n values of j
and split(ai · bj + e + di) for n(n − 1) pairs (i, j), where n = 7 for the 224-bit
multiplication. On the GTX 400/500 family of GPUs, where there are 32 × 32 →
32-bit multiplication instructions to get the lower and higher 32-bits and 32-bit addi-
tions with carry in and out, the former can be implemented using four and the latter
using six instructions. A direct implementation of the schoolbook algorithm as pre-
sented in Algorithm 1 might result in a slightly lower instruction count, using the
addition with carry in- and out, but has the disadvantage that it requires more storage
(registers) compared to Algorithm 4. We implemented both methods (Algorithms 1
and 4) and found that, when benchmarking the performance on different GPU fami-
lies, the more memory efficient method from Algorithm 4 is faster in practice. This is
the approach we use in our final implementation.
4.1.3 Fast Reduction
The output from the fast reduction routine as outlined in Algorithm 2, and denoted
by Red, is not in the preferred range [0, p224〉 nor in the range for the redundant rep-
resentation [0, 2224〉; instead, −(2224 + 296) < Red(a · b) < 2225 + 2192. In order
to avoid working with negative (signed) numbers we modify the algorithm slightly
such that it returns d = s1 + s2 + s3 − s4 − s5 + 2p224 ≡ c = a · b mod p224 where
2224 − 296 < d < 2226 + 2192.
In order to fully reduce multiple integers simultaneously using SIMT instructions,
several approaches can be applied. Obviously the reduction algorithm can be applied
again. A most likely faster approach is to subtract p224 repeatedly until the result is
in the desired range [0, p224〉. Since the arithmetic is executed on parallel architec-
tures, the repeated subtraction is calculated by masking the value appropriately before
subtracting. This process needs to be performed four times, when avoiding branches
123
Int J Parallel Prog (2012) 40:532–550 541
Table 1 The values of the 32-bit unsigned limbs ci of t · p224 =
∑7
i=0 ci232i
t t · p224 = {c7, . . . , c0}
c7 c6 c5 c4 c3 c2 c1 c0
0 0 0 0 0 0 0 0 0
1 0 232 − 1 232 − 1 232 − 1 232 − 1 0 0 1
2 1 232 − 1 232 − 1 232 − 1 232 − 2 0 0 2
3 2 232 − 1 232 − 1 232 − 1 232 − 3 0 0 3
4 3 232 − 1 232 − 1 232 − 1 232 − 4 0 0 4
(i.e. divergent code), since the largest positive integer t which results in a positive
(2226 + 2192 − 1) − t · p224 is t = 4.
An additional performance gain is possible at the expense of some storage. Select
the desired multiple of the modulus p224 which needs to be subtracted from a look-up
table, and perform a single subtraction. Using a redundant representation in [0, 2224〉
the most significant word, containing the possible carry, has to be inspected only to
determine the multiple of p224 to subtract. Note that an extra single subtraction might
be needed in the unlikely situation that the result after the subtraction is >2224. The
partially reduced numbers can be used as input to the same modular multiplication
routines and if reduction to [0, p224〉 is required this can be achieved at the cost of a
single conditional multi-limb subtraction.
A refinement, in terms of storage, of the previous approaches to reduce the result-
ing value is by generating the desired values efficiently on-the-fly. We distinguish two
cases (just as when doing the modular addition in Sect. 4.1.1); either subtract multiples
of p224 or 296 − 1. Selecting the correct multiple of p224 is illustrated in Table 1. The
32-bit unsigned limbs ci of t · p224 = ∑7i=0 ci232i for 0 ≤ t < 5 can be computed
as c0 = t, c1 = c2 = 0, c3 = 0 − t . The values for c4, c5, c6, c7 can be efficiently
constructed using masks depending on t = 0 or t > 0. When subtracting multiples
t (296 − 1) = ∑3i=0 ci232i of 296 − 1 for 0 ≤ t < 5, the constants can be computed as
c0 = 0 − t
c1 = c2 =
{
0, if t = 0,
232 − 1, if t > 0.
c3 =
{
0, if t = 0,
t − 1, if t > 0.
The conditional statements can be converted to straight line (non-divergent) code to
make the algorithms more suitable for parallel computer architectures.
4.2 Elliptic Curve Arithmetic
A common approach to avoid the relatively costly modular inversion operation in
the elliptic curve group operation [see Eq. (2)] is to use projective coordinates. The
projective equation of the short affine Weierstrass form from Eq. (1) is
123
542 Int J Parallel Prog (2012) 40:532–550
Y 2Z = X3 + aXZ2 + bZ3. (4)
A point (x, y) is represented as (X : Y : Z) such that (x, y) = (X/Z, Y/Z) for
non-zero Z and the neutral element is represented as (0 : 1 : 0).
In our setting we are interested, given a parallel computer architecture capable of
launching a number of threads Ti , to lower the latency of the longest running thread
Tmax = maxi Ti as opposed to the total time of all resources combined Tsum = ∑i Ti
where Ti is the time corresponding to thread Ti . Since high-throughput and low-
latency may be conflicting goals, one may not be able to achieve both at the same
time. Our approach is designed at the elliptic curve arithmetic level for low-latency
while not sacrificing the throughput too much. To accomplish this we chose to aim for a
high-throughput (and longer latency) design at the finite field arithmetic level: a single
thread computes a single multiplication. If one is willing to sacrifice the throughput
even more one could compute the finite field arithmetic using multiple threads as well,
reducing the throughput even further. The elliptic curve point addition and duplica-
tion are processed simultaneously, significantly reducing the latency at the expense of
potentially lowering the throughput.
Another desirable property of an implementation of a parallel algorithm is that all
threads follow the exact same procedure since this reduces the amount of divergent
code. An active research area where such algorithms have been studied is in the context
of cryptographic side-channel attacks. Side channel attacks [31] are attacks which use
information gained from the physical implementation of a certain scheme to break its
security; e.g. the elapsed time or power consumption. In order to avoid these types of
attacks the algorithms must perform the same actions independent of the input to avoid
leaking information. The approach we use is based on the Montgomery ladder applied
to projective Weierstrass coordinates [13,17,26] instead of Montgomery coordinates.
Even though the y-coordinate can be recovered, this is not necessary in most of the
elliptic curve based cryptographic schemes because they only use the x-coordinate to
compute the final result.
In particular, we adopt the formulas from [17]. Recall from Algorithm 3 that every
iteration processes a single bit of the scalar at the cost of computing an elliptic curve
addition and doubling. Computation on the Y -coordinate is omitted as follows [17]
(P + Q, 2Q) = (P˜ , Q˜) = ((P˜x, P˜z), (Q˜x, Q˜z))
=
⎧⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩
P˜x = 2(PxQz + QxPz)(PxQx + aPzQz)
+4bP 2z Q2z − Gx(PxQz − QxPz)2
P˜z = (PxQz − QxPz)2
Q˜x = (Q2x − aQ2z)2 − 8bQxQ3z
Q˜z = 4(QxQz(Q2x + aQ2z) + bQ4z).
(5)
Note that Gx is the x-coordinate of the input point to the elliptic curve scalar multipli-
cation algorithm. Using that the NIST standard defines a = −3, and slightly rewriting
Eq. (5), results in the set of instructions presented in Table 2. Every row of the table
is executed concurrently by the different threads, an empty slot means that the thread
either remains idle or works on fake data. The bold entries in Table 2 are pre-computed
123
Int J Parallel Prog (2012) 40:532–550 543
Ta
bl
e
2
In
st
ru
ct
io
n
o
v
er
v
ie
w
to
co
m
pu
te
(P
+
Q
,
2Q
)
=
(P˜
,
Q˜
)
=
((
P˜
x
,
P˜
z
),
(Q˜
x
,
Q˜
z
))
u
sin
g
se
v
en
th
re
ad
s.
Th
e
bo
ld
en
tr
ie
sa
re
pr
e-
co
m
pu
te
d
22
4-
bi
ti
n
te
ge
rs
,G
x
is
th
e
x
-
co
o
rd
in
at
e
o
ft
he
in
pu
tp
oi
nt
to
th
e
M
o
n
tg
om
er
y
la
dd
er
O
pe
ra
tio
n
Th
re
ad
1
Th
re
ad
2
Th
re
ad
3
Th
re
ad
4
Th
re
ad
5
Th
re
ad
6
Th
re
ad
7
(1)
m
u
l
t 0
=
P
x
Q
z
t 1
=
Q
x
P
z
t 2
=
P
x
Q
x
t 3
=
P
z
Q
z
t 4
=
Q
2 x
t 5
=
Q
2 z
t 6
=
Q
x
Q
z
(2)
tr
ip
le
t 7
=
3t
3
t 8
=
3t
5
(3)
a
dd
t 9
=
t 0
+
t 1
t 1
0
=
t 4
+
t 8
(4)
su
b
t 2
=
t 2
−
t 7
t 0
=
t 0
−
t 1
t 4
=
t 4
−
t 8
(5)
m
u
l
t 9
=
t 9
t 2
t 3
=
t2 3
P˜
z
=
t2 0
t 1
0
=
t2 1
0
t 1
1
=
t 6
t 5
t 6
=
t 6
t 4
t 5
=
t2 5
(6)
m
u
l
t 9
=
2t
9
t 3
=
4b
t 3
t 0
=
G
x
P˜
Z
t 1
1
=
8b
t 1
1
t 5
=
4b
t 5
t 6
=
4t
6
(7)
a
dd
t 9
=
t 9
+
t 3
Q˜
z
=
t 5
+
t 6
(8)
su
b
P˜
x
=
t 9
−
t 0
Q˜
x
=
t 1
0
−
t 1
1
123
544 Int J Parallel Prog (2012) 40:532–550
at the initialization phase of the algorithm. The b value is one of the parameters which
define the elliptic curve (together with a and p224) and is provided in the standard.
The b is invariant for the different concurrent elliptic curve scalar multiplications.
Depending on the thread identifier, the pre-computed value is copied to the correct
shared memory position which is used in operation number 6.
Using the instruction flow from Table 2 seven threads can compute a single ellip-
tic curve scalar multiplication (ECSM) using the Montgomery ladder algorithm. The
time Tmax is three multiplications, two additions, two subtractions and a single triple
operation in Fp224 to compute a single elliptic curve addition and duplication. This is
in contrast with Tsum which consist of 18 multiplications, two triplings, four additions,
five subtractions and two multiplications by a power of two in Fp224 . Although Tsum
is significantly higher, compared to the cost to process one bit of the scalar using dif-
ferent coordinate representations and different ECSM algorithms, the latency Tmax is
roughly three multiplications to compute both an elliptic curve addition and doubling
using seven threads. Running seven threads in parallel is the best, e.g. the highest num-
ber of concurrent running threads, we could achieve and it should be noted that this is
suboptimal from different perspectives. First of all, a GPU platform using the CUDA
paradigm typically dispatches threads in block sizes which are a small multiple of
32. Hence, in each subgroup of eight threads one thread remains inactive: decreasing
the overall throughput. Secondly, 20 multiplications are used in Table 2 which results
in one idle thread in the third multiplication (thread 7 in operation 6). On different
parallel platforms, where i threads are processed concurrently, with 2 ≤ i ≤ 7, the
approach outlined in Table 2 can be computed using
⌈ 20
i
⌉
multiplications.
This approach is not limited to arithmetic modulo p224 but applies to any modulus.
Given the (estimated) performance of a single modular multiplication, either using a
special or a generic modulus, the approach from Table 2 can be applied such that the
overall latency to multiply an elliptic curve point with a k-bit scalar is approximately
the time to compute k · ⌈ 20
i
⌉
single thread multiplications.
5 Results
Table 3 states the performance results, using the approach from Table 2, when using
our GPU implementation running on a variety of GPUs from both the older and newer
generations of CUDA architectures. Our benchmark results include transferring the
in- and output and allows to use different multipliers and elliptic curve points in the
same batch. To be compatible with as many settings used in practice as possible, it
is not assumed that the initial elliptic curve point is given in affine coordinates but
instead in projective coordinates. This has a performance disadvantage: the amount
of data which needs to be transferred from the host to the GPU is increased. While
primarily designed for the GTX 400 (and newer) family, the bold entries GTX 465,
480 and 580 in Table 3, the performance on the older GTX 200 series in terms of
latency and throughput are remarkably good.
Our fastest result is obtained on the GTX 580 GPU when computing a single 224-
bit elliptic curve scalar multiplication and requires 1.94 ms when dispatching eight
threads. This is an order of magnitude faster, in term of response time, compared
123
Int J Parallel Prog (2012) 40:532–550 545
Ta
bl
e
3
Pe
rfo
rm
an
ce
co
m
pa
ris
on
o
f2
24
-b
it
el
lip
tic
cu
rv
e
sc
al
ar
m
u
lti
pl
ic
at
io
n
o
n
di
ffe
re
nt
G
PU
pl
at
fo
rm
s.
A
bo
ld
pl
at
fo
rm
n
am
e
in
di
ca
te
st
ha
tt
hi
sp
la
tfo
rm
ha
sc
o
m
pu
te
ca
pa
bi
lit
y
2.
0
(F
er
m
i)
o
r
hi
gh
er
(th
e
la
te
st
(th
ird
)G
PU
-a
rc
hi
te
ct
ur
e
fa
m
ily
).R
es
ul
ts
w
he
n
u
til
iz
in
g
th
e
en
tir
e
G
PU
ar
e
ex
pr
es
se
d
in
o
pe
ra
tio
ns
(22
4-
bi
te
lli
pt
ic
cu
rv
e
sc
al
ar
m
u
lti
pl
ic
at
io
ns
)p
er
se
co
n
d
(op
/s)
R
ef
.
Pl
at
fo
rm
#G
PU
s
CU
DA
co
re
s
Pr
o
ce
ss
o
r
cl
oc
k
(M
Hz
)
M
o
du
lu
s
M
in
im
um
la
te
nc
y
(m
s)
M
ax
im
um
th
ro
ug
hp
ut
(op
/s)
pe
rG
PU
Ty
pe
B
it-
siz
e
[5
](
sc
al
ed
)
88
00
G
TS
1
96
12
,0
0
G
en
er
ic
28
0
–
3,
01
8
G
TX
28
0
1
24
0
1,
29
6
G
en
er
ic
28
0
–
11
,4
17
G
TX
29
5
2
24
0
1,
24
2
G
en
er
ic
28
0
–
21
,1
03
[3
]
G
TX
29
5
2
24
0
1,
24
2
G
en
er
ic
21
0
–
25
9,
53
4
[5
3]
88
00
G
TS
1
96
1,
20
0
Sp
ec
ia
l
22
4
30
5.
0
1,
41
3
[2
]
88
00
G
TS
1
96
1,
20
0
Sp
ec
ia
l
22
4
30
.3
3,
13
8
G
TX
28
5
1
24
0
1,
47
6
Sp
ec
ia
l
22
4
24
.3
9,
99
0
N
ew
G
TX
29
5
2
24
0
1,
24
2
Sp
ec
ia
l
22
4
10
.6
79
,1
98
G
TX
46
5
1
35
2
1,
21
5
Sp
ec
ia
l
22
4
2.
6
15
2,
02
3
G
TX
48
0
1
48
0
1,
40
1
Sp
ec
ia
l
22
4
2.
3
23
7,
41
5
G
TX
58
0
1
51
2
1,
54
4
Sp
ec
ia
l
22
4
1.
9
29
0,
53
5
O
pe
nS
SL
[4
4]
In
te
l
Co
re
i7
-2
60
0K
3,
40
0
Sp
ec
ia
l
22
4
0.
27
14
,7
40
O
pe
nS
SL
u
sin
g
[2
9]
R
u
n
n
in
g
o
n
4
co
re
s
Sp
ec
ia
l
22
4
0.
09
46
,1
76
123
546 Int J Parallel Prog (2012) 40:532–550
Fig. 1 Latency results when varying the amount of dispatched threads for blocksize equal to 32 (red, top
line), 64 (green, bottom line) and 96 (blue, middle line) on the GTX 580 GPU (Color figure online)
to the previous fastest low-latency implementation [2]. Figure 1 shows the latencies
when varying the amount of threads, eight threads are scheduled to work on a single
ECSM, for different block-sizes on the GTX 580. There is a clear trade-off: increas-
ing the block-size allows to hide the various latencies by performing a context switch
and calculating on a different group of threads within the same block. On the other
hand, every eight threads require their own memory and registers for the intermediate
values as outlined in Table 2. Increasing the block size too much results in a per-
formance degradation because the required memory for the entire block does not fit
in the shared memory any more. As can be observed from Fig. 1 a block-size of 64
results in the optimal practical performance when processing larger batches of ECSM
computations.
To illustrate the computational power of the GPU even further let us consider the
throughput when fixing the latency to 5 ms. As can be seen from Fig. 1, the GTX 580
can compute 916, 1,024, or 960,224-bit elliptic curve scalar multiplications within this
time limit when using a block-size of 32, 64, or 96 threads respectively. The best of
these short runs already achieves a throughput of over 246,000 scalar multiplications
per second, when using a blocksize of 64, which is already 0.85 of the maximum
observed throughput obtained when processing much larger batches.
5.0.1 Performance Comparison
In [5] and the follow-up work [3] fast elliptic curve scalar multiplication is imple-
mented using Edwards curves in a cryptanalytic setting. The GPU implementations
optimize for high-throughput and implement generic modular arithmetic. The setting
considered in [3,5] requires to perform an ECSM with a 11,797-bit scalar; in order to
compare results we scale their figures by a factor 11,797224 . Comparing to these imple-
mentations is difficult because both the finite field and elliptic curve arithmetic differ
from the approaches considered in this paper where the faster arithmetic on Edwards
123
Int J Parallel Prog (2012) 40:532–550 547
curves cannot be used. On the GTX 295 architecture, for which our algorithms are not
designed, the throughput reported in [3] is 3.3 times higher. The associated latency
times are not reported. Since the approach from [3] is designed for high-throughput,
e.g. one thread per ECSM resulting in no communication overhead, we expect that this
approach has a higher throughput (and higher latency) on the Fermi cards compared
to the new results from this work.
The GPU implementations discussed in [2,53] target the same special modulus
as discussed in this paper. In [53] one thread per multiplication is used (optimizing
for high-throughput) and multiplies the same elliptic curve point in all threads. This
reduces the amount of data which needs to be transferred from the host machine to the
GPU. The authors of [2] implement a low-latency algorithm by parallelizing the finite
field arithmetic using RNS (see Sect. 4.1). Their performance data do not include the
transfer time of the input (output) from the host (GPU) to the GPU (host). Both the
GTX 285 and 295 belong to the same GPU family, the former is clocked 1.2 faster
than the latter while the GTX 295 consists of two of these slower GPUs. Compared
to [2] our minimum latency is more than twice lower while the maximum throughput
on a single slower GPU of the GTX 295 is almost quadrupled.
To put these results in perspective we have ran the latest benchmark suite in Open
SSL [44] of the Elliptic curve Diffie–Hellman protocol [54] which incorporates the
optimizations reported in [29]. The target machine has an Intel Core i7-2600K CPU
(clocked at 3.40 GHz). The OpenSSL benchmark tool measures a throughput of 11,544
Diffie–Hellman operations per second per core (using NIST’s p224 prime)operations
per second per core1. Even though a single Diffie–Hellman operation is essentially
an elliptic curve scalar multiplication there is some additional overhead in OpenSSL’s
benchmark suite so these results are only a lowerbound on the performance. Taken into
account that this machine has four CPUs the throughput of our GPU implementation
is more than six times higher (on the GTX 580). The latency on the CPU is lower:
around 0.09 ms compared to 1.9 ms on the CPU and GPU respectively. This is no
surprise as it was the main motivation to investigate low-latency implementations in
this paper. These results show that the current parallel implementations are coming
close to deliver low-latency while still significantly outperforming the CPU in terms
of throughput.
6 Conclusion
We presented an algorithm which is particularly well-suited for parallel computer
architectures to compute the scalar multiplication of an elliptic curve point with as
main goal to reduce latency. When applied to a 224-bit standardized elliptic curve
used in cryptography and running on a GTX 580 GPU the minimum time required is
1.9 ms; improving on previous low-latency results by an order of magnitude.
1 To enable the faster implementation using the techniques described in [29] OpenSSL 1.0.1 needs to be
configured using “./Configure enable-ec_nistp_64_gcc_128”, otherwise the throughput is 3,685 Diffie–
Hellman operations per second.
123
548 Int J Parallel Prog (2012) 40:532–550
Acknowledgments This work was supported by the Swiss National Science Foundation under grant num-
bers 200020-132160 and 200021-119776. We gratefully acknowledge Deian Stefan, for granting us access
to the NVIDIA GTX 480 to benchmark our programs and the insightful comments by the International
Journal of Parallel Programming reviewers.
References
1. AMD: ATI CTM reference guide. Technical reference manual (2006)
2. Antao, S., Bajard, J.C., Sousa, L.: Elliptic curve point multiplication on GPUs. In: 21st IEEE Interna-
tional Conference on Application-Specific Systems Architectures and Processors (ASAP), 2010, pp.
192–199 (2010)
3. Bernstein, D.J., Chen, H.C., Chen, M.S., Cheng, C.M., Hsiao, C.H., Lange, T., Lin, Z.C., Yang, B.Y.:
The billion-mulmod-per-second PC. In: Special-Purpose Hardware for Attacking Cryptographic Sys-
tems—SHARCS 2009, pp. 131–144 (2009)
4. Bernstein, D.J., Chen, H.C., Cheng, C.M., Lange, T., Niederhagen, R., Schwabe, P., Yang, B.Y.:
ECC2K-130 on NVIDIA GPUs. In: Gong, G., Gupta, K.C. (eds.) Indocrypt 2010, Lecture Notes in
Computer Science, vol. 6498, pp. 328–346. Springer, Berlin, Heidelberg (2010)
5. Bernstein, D.J., Chen, T.R., Cheng, C.M., Lange, T., Yang, B.Y.: ECM on graphics cards. In: Joux, A.
(ed.) Eurocrypt 2009, Lecture Notes in Computer Science, vol. 5479, pp. 483–501. Springer, Heidel-
berg (2009)
6. Bernstein, D.J., Lange, T.: Faster addition and doubling on elliptic curves. In: Kurosawa, K. (ed.)
Asiacrypt, Lecture Notes in Computer Science, vol. 4833, pp. 29–50. Springer, Heidelberg (2007)
7. Bernstein, D.J., Lange, T. : Analysis and optimization of elliptic-curve single-scalar multiplica-
tion. In: Mullen, G.L., Panario, D., Shparlinski, I.E. (eds.) Finite Fields and Applications, Contemporary
Mathematics Series, vol. 461, pp. 1–119. American Mathematical Society, Providence, RI (2008)
8. Bevand, M.: MD5 Chosen-Prefix Collisions on GPUs. Whitepaper, Black Hat (2009)
9. Blythe, D.: The Direct3D 10 system. ACM Trans. Graph. 25(3), 724–734 (2006)
10. Bos, J.W.: High-performance modular multiplication on the cell processor. In: Hasan, M.A., Helleseth,
T. (eds.) Arithmetic of Finite Fields—WAIFI 2010, Lecture Notes in Computer Science, vol. 6087, pp.
7–24. Springer, Heidelberg (2010)
11. Bos, J.W., Kaihara, M.E., Kleinjung, T., Lenstra, A.K., Montgomery, P.L.: On the security of 1024-bit
RSA and 160-bit elliptic curve cryptography. Cryptology ePrint archive, report 2009/389. http://eprint.
iacr.org/ (2009)
12. Bos, J.W., Stefan, D.: Performance analysis of the SHA-3 candidates on exotic multi-core architec-
tures. In: Mangard, S., Standaert, F.X. (eds.) Cryptographic Hardware and Embedded Systems—CHES
2010, Lecture Notes in Computer Science, vol. 6225, pp. 279–293. Springer, Heidelberg (2010)
13. Brier, E., Joye, M.: Weierstraß elliptic curves and side-channel attacks. In: Naccache, D., Paillier,
P. (eds.) Public Key Cryptography—PKC 2002, Lecture Notes in Computer Science, vol. 2274, pp.
335–345. Springer, Heidelberg (2002)
14. Brown, M., Hankerson, D., López, J., Menezes, A.: Software implementation of the NIST elliptic
curves over prime fields. In: Naccache, D. (ed.) CT-RSA, Lecture Notes in Computer Science, vol.
2020, pp. 250–265. Springer, Heidelberg (2001)
15. Cohen, H., Miyaji, A., Ono, T.: Efficient elliptic curve exponentiation using mixed coordinates. In:
Ohta, K., Pei, D. (eds.) Asiacrypt 1998, Lecture Notes in Computer Science, vol. 1514, pp. 51–65.
Springer, Heidelberg (1998)
16. Edwards, H.M.: A normal form for elliptic curves. Bull. Am. Math. Soc. 44, 393–422 (2007)
17. Fischer, W., Giraud, C., Knudsen, E.W., Seifert, J.P.: Parallel scalar multiplication on general elliptic
curves over Fp hedged against non-differential side-channel attacks. Cryptology ePrint archive, report
2002/007. http://eprint.iacr.org/ (2002)
18. Garland, M., Grand, S.L., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y.,
Volkov, V.: Parallel computing experiences with CUDA. IEEE Micro 28(4), 13–27 (2008)
19. Garner, H.L.: The residue number system. IRE Trans. Electron. Comput. 8, 140–147 (1959)
20. Granlund, T.: GMP small operands optimization. In: Software Performance Enhancement for Encryp-
tion and Decryption—SPEED 2007 (2007)
21. Group, K.: OpenCL—the open standard for parallel programming of heterogeneous systems. http://
www.khronos.org/opencl/
123
Int J Parallel Prog (2012) 40:532–550 549
22. Harrison, O., Waldron, J.: AES encryption implementation and analysis on commodity graphics pro-
cessing units. In: Paillier, P., Verbauwhede, I. (eds.) Cryptographic Hardware and Embedded Systems—
CHES 2007, Lecture Notes in Computer Science, vol. 4727, pp. 209–226. Springer, Heidelberg (2007)
23. Harrison, O., Waldron, J.: Practical symmetric key cryptography on modern graphics hardware. In:
Proceedings of the 17th Conference on Security Symposium, pp. 195–209. USENIX Association
(2008)
24. Harrison, O., Waldron, J.: Efficient acceleration of asymmetric cryptography on graphics hardware.
In: Preneel, B. (ed.) Africacrypt 2009, Lecture Notes in Computer Science, vol. 5580, pp. 350–367.
Springer, Heidelberg (2009)
25. Hisil, H., Wong, K.K.H., Carter, G., Dawson, E.: Twisted Edwards curves revisited. In: Pieprzyk, J.
(ed.) Asiacrypt 2008, Lecture Notes in Computer Science, vol. 5350, pp. 326–343. Springer, Heidelberg
(2008)
26. Izu, T., Takagi, T.: A fast parallel elliptic curve multiplication resistant against side channel attacks.
In: Naccache, D., Paillier, P. (eds.) Public Key Cryptography—PKC 2002, Lecture Notes in Computer
Science, vol. 2274, pp. 371–374. Springer, Heidelberg (2002)
27. Joye, M., Yen, S.M.: The Montgomery powering ladder. In: Kaliski, B.S. Jr., Koç, Ç.K., Paar, C. (eds.)
Cryptographic Hardware and Embedded Systems—CHES 2002, Lecture Notes in Computer Science,
vol. 2523, pp. 1–11. Springer, Heidelberg (2003)
28. Karatsuba, A.A., Ofman, Y.: Multiplication of many-digital numbers by automatic computers. In:
Proceedings of the USSR Academy of Science, vol. 145, pp. 293–294 (1962)
29. Käsper, E.: Fast elliptic curve cryptography in OpenSSL. In: Danezis, G., Dietrich, S., Sako, K.
(eds.) The 2nd Workshop on Real-Life Cryptographic Protocols and Standardization, Lecture Notes
in Computer Science, vol. 7126. Springer. http://research.google.com/pubs/archive/37376.pdf (2012,
to appear)
30. Koblitz, N.: Elliptic curve cryptosystems. Math. Comput. 48(177), 203–209 (1987)
31. Kocher, P.C.: Timing attacks on implementations of Diffie–Hellman, RSA, DSS, and other systems. In:
Koblitz, N. (ed.) Crypto 1996, Lecture Notes in Computer Science, vol. 1109, pp. 104–113. Springer,
Heidelberg (1996)
32. Lenstra, A.K., Verheul, E.R.: Selecting cryptographic key sizes. J. Cryptol. 14(4), 255–293 (2001)
33. Lenstra, H.W. Jr.: Factoring integers with elliptic curves. Ann. Math. 126(3), 649–673 (1987)
34. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA tesla: a unified graphics and computing
architecture. IEEE Micro 28(2), 39–55 (2008)
35. Manavski, S.: CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In:
IEEE International Conference on Signal Processing and Communications, 2007. ICSPC 2007, pp.
65–68 (2007)
36. Merrill, R.D.: Improving digital computer performance using residue number theory. IEEE Trans.
Electron. Comput. EC-13(2), 93–101 (1964)
37. Miller, V.S.: Use of elliptic curves in cryptography. In: Williams, H.C. (ed.) Crypto 1985, Lecture
Notes in Computer Science, vol. 218, pp. 417–426. Springer, Heidelberg (1986)
38. Montgomery, P.L.: Speeding the Pollard and elliptic curve methods of factorization. Math. Com-
put. 48(177), 243–264 (1987)
39. Moss, A., Page, D., Smart, N.P.: Toward acceleration of RSA using 3D graphics hardware. In: Galbra-
ith, S.D. (ed.) Proceedings of the 11th IMA International Conference on Cryptography and Coding,
Cryptography and Coding 2007, pp. 364–383. Springer (2007)
40. National Security Agency: Fact sheet NSA suite B cryptography. http://www.nsa.gov/ia/programs/
suiteb_cryptography/index.shtml (2009)
41. Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30(2), 56–69 (2010)
42. NVIDIA: NVIDIA’s next generation CUDA compute architecture: Fermi (2009)
43. NVIDIA: NVIDIA CUDA programming guide 3.2 (2010)
44. OpenSSL: The open source toolkit for SSL/TLS. http://www.openssl.org/ (2012)
45. Osvik, D.A., Bos, J.W., Stefan, D., Canright, D.: Fast software AES encryption. In: Hong, S., Iwata, T.
(eds.) Fast software encryption—FSE 2010, Lecture Notes in Computer Science, vol. 6147, pp. 75–93.
Springer, Heidelberg (2010)
46. Owens, J.: GPU architecture overview. In: Special Interest Group on Computer Graphics and Interac-
tive Techniques—SIGGRAPH 2007, p. 2. ACM (2007)
47. Patterson, D.A., Hennessy, J.L.: Computer Organization and Design: The Hardware/Software Inter-
face, 4th edn. Morgan Kaufmann, San Francisco, CA (2009)
123
550 Int J Parallel Prog (2012) 40:532–550
48. Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signatures and public-key cryp-
tosystems. Commun. ACM 21, 120–126 (1978)
49. Segal, M., Akeley, K.: The OpenGL Graphics System: A Specification (Version 2.0). Silicon Graph-
ics, Mountain View, CA (2004)
50. Silverman, J.H.: The Arithmetic of Elliptic Curves, Gradute Texts in Mathematics, vol. 106.
Springer, Berlin (1986)
51. Solinas, J.A.: Generalized Mersenne numbers. Technical report CORR 99-39, Centre for Applied
Cryptographic Research, University of Waterloo (1999)
52. National Institute of Standards and Technology: Special publication 800-57: recommendation
for key management part 1: general (revised). http://csrc.nist.gov/publications/nistpubs/800-57/
sp800-57-Part1-revised2_Mar08-2007.pdf
53. Szerwinski, R., Güneysu, T.: Exploiting the power of GPUs for asymmetric cryptography. In: Oswald,
E., Rohatgi, P. (eds.) Cryptographic Hardware and Embedded Systems—CHES 2008, Lecture Notes
in Computer Science, vol. 5154, pp. 79–99. Springer, Heidelberg (2008)
54. US Department of Commerce and National Institute of Standards and Technology: Recommendation
for pair-wise key establishment schemes using discrete logarithm cryptography. See http://csrc.nist.
gov/publications/nistpubs/800-56A/SP800-56A_Revision_Mar08-2007.pdf (2007)
55. US Department of Commerce/National Institute of Standards and Technology: Digital signature stan-
dard (DSS). FIPS-186-3. http://csrc.nist.gov/publications/fips/fips186-3/fips_186-3.pdf (2009)
56. Yang, J., Goodman, J.: Symmetric key cryptography on modern graphics hardware. In: Kurosawa, K.
(ed.) Asiacrypt, Lecture Notes in Computer Science, vol. 4833, pp. 249–264. Springer, Heidelberg
(2007)
123
