High-Performance Modular Multiplication on the Cell Processor by Bos, Joppe Willem
High-Performance Modular Multiplication on
the Cell Processor
Joppe W. Bos
Laboratory for Cryptologic Algorithms
EPFL, Station 14, CH-1015 Lausanne, Switzerland
Abstract. This paper presents software implementation speed records
for modular multiplication arithmetic on the synergistic processing ele-
ments of the Cell broadband engine (Cell) architecture. The focus is on
moduli which are of special interest in elliptic curve cryptography, that
is, moduli of bit-lengths ranging from 192- to 521-bit. Finite ﬁeld arith-
metic using primes which allow particularly fast reduction is compared
to Montgomery multiplication. The special primes considered are the
ﬁve recommended NIST primes, as speciﬁed in the FIPS 186-3 standard,
and the prime used in the elliptic curve curve25519. While presented
and benchmarked on the Cell architecture, the proposed techniques to
eﬃciently implement the modular multiplication algorithms are suited to
run on any architecture which is able to compute multiple computations
concurrently; e.g. graphics processing units.
Keywords: Cell Broadband Engine, Curve25519, Elliptic Curve Cryp-
tography (ECC), Montgomery Multiplication, NIST primes.
1 Introduction
Elliptic curve cryptography (ECC) [20,24] is an approach to public-key crypto-
graphy which enjoys increasing popularity since its invention in the mid 1980s.
The attractiveness of small key-sizes [22] has placed this public-key cryptosystem
as the preferred alternative to the widely used RSA public-key cryptosystem [30].
This is emphasized by the current migration away from 80-bit to 112-bit security
where, for instance, the United States’ National Security Agency restricts the
use of public key cryptography in “Suite B” [27] to ECC.
In this paper we present performance results for one of the key operations in
ECC: modular multiplication. The performance results are obtained when run-
ning on the heterogeneous, multi-core, single instruction, multiple data (SIMD)
Cell broadband engine (Cell) architecture. As far as we know, our performance
results set new speed records for generic moduli, using interleaved Montgomery
multiplication [25], and special modular multiplication for moduli ranging from
192 to 521 bits. This range covers the current standardized parameters for ECC
cryptosystems as speciﬁed by National Institute of Standards (NIST) [34].
The special primes considered in this work are the recommended primes of
special form by NIST [34] and the prime used in curve25519 as proposed by
M.A. Hasan and T. Helleseth (Eds.): WAIFI 2010, LNCS 6087, pp. 7–24, 2010.
c© Springer-Verlag Berlin Heidelberg 2010
8 J.W. Bos
Bernstein [2]. These special primes are used to enhance the performance of
ECC-based schemes in practice by exploiting the special form of the primes
to construct a fast reduction step. Typically, the multiplication and special re-
duction are performed sequentially. For the separated multiplication step we
consider schoolbook and Karatsuba multiplication [18] techniques. We use the
straight-forward methods to implement the fast reduction for the NIST recom-
mended primes (see [32]). For the special prime in curve25519 we use a diﬀerent
approach in order to compare with the proposed fast reduction from [2].
The performance results are obtained by using the features of SIMD archi-
tectures. The implementations are optimized for the Cell and take both the
advantages (e.g., the rich instruction set and large register ﬁle) and disadvan-
tages (e.g., the “small” 16 × 16 → 32-bit multiplier) of this architecture into
account. Furthermore, multiple streams of computations are interleaved to in-
crease throughput. Multi-stream modular multiplication computations are use-
ful in both a cryptanalytic and cryptographic setting. For instance, one could
use multi-stream modular multiplication routines, either the generic or special
variant, to speedup batch decryption for ECC-based schemes. Additionally, this
work shows the practical beneﬁt of using the special over generic prime moduli
on the Cell.
The paper is organized as follows. Section 2 introduces the Cell broadband
engine architecture. Section 3 recalls some basic facts about elliptic curves,
Montgomery multiplication and discusses the special primes used in this work.
Section 4 describes the cryptographic and cryptanalytic applications where
multi-stream modular multiplications can be used. Section 5 describes how the
diﬀerent modular multiplication methods can be combined into a multi-stream
high-performance implementation on the Cell. Section 6 presents and discusses
our performance results and compares them to implementations by others on
the Cell. Section 7 concludes the paper.
2 The Cell Broadband Engine
The Cell architecture [15], jointly developed by Sony, Toshiba, and IBM, is
equipped with one dual-threaded, 64-bit in-order “Power Processing Element”
(PPE), which can oﬄoad work to the eight “Synergistic Processing Elements”
(SPEs) [33]. The SPEs are the workhorses of the Cell processor which can be
found in the PlayStation 3 (PS3) game console. Each SPE, running at 3.2 GHz in
the PS3, consists of a Synergistic Processing Unit (SPU), 256 kilobyte of private
memory called Local Store (LS) and a Memory Flow Controller.
Most SPU instructions are 128-bit wide single instruction, multiple data
(SIMD) operations performing sixteen 8-bit, eight 16-bit, four 32-bit, or two
64-bit computations in parallel. Each SPU is equipped with a large register ﬁle
containing 128 registers of 128 bits each, providing space for unrolling and soft-
ware pipelining of loops, hiding the relatively long latencies of its instructions.
Unlike the processor in the PPE, the SPUs are asymmetric processors, having
two pipelines (denoted by the odd and the even pipeline) which are designed to
High-Performance Modular Multiplication on the Cell Processor 9
execute two disjoint sets of instructions (denoted by odd and even instructions).
In the ideal case, two instructions (one odd and one even) can be dispatched per
cycle. The SPUs are in-order processors and have no hardware branch-prediction.
Instead, the programmer (or compiler) can tell the instruction fetch unit in ad-
vance where a (single) branch instruction will jump to.
Each SPE has access to a rich instruction set which operates simultaneously
on 8-, 16- or 32-bit words. Instructions of particular interest are shuffle (odd
instruction) and select (even instruction). The d = shuffle(a, b, c) instruction
uses the pattern given in c to shuﬄe 16 of the 32 bytes of a and b to the output d,
in such a way that the jth byte of c determines the jth byte of d, either as a
copy of a byte of a or b or as one of the constants {0x00, 0xFF, 0x80}, and
where duplicate copies are allowed. The d = select(a, b, p) instruction acts as
a 2-way multiplexer; depending on the input pattern p the corresponding bit
from either a or b is selected as the output bit in d. The SPEs are equipped
with a 4-way SIMD multiplier (even instruction) which can compute four 16-
bit integer multiplications simultaneously per clock cycle. In addition, an even
4-way SIMD multiply-and-add instruction, which performs a 16 × 16 → 32-bit
unsigned multiplication and an addition of a 32-bit unsigned operand to the
32-bit product, is available and has the same latency as a multiplication without
the addition. Note that carries are not generated for this instruction.
3 Preliminaries
In this section the required background about elliptic curves, the various (mod-
ular) multiplication techniques and the special primes are recalled. We want to
compute the product C ≡ A · B mod M , by either ﬁrst applying schoolbook or
Karatsuba multiplication and next a fast reduction, or C ≡ A · B · r−n mod M
using Montgomery multiplication, with A,B,C, r, n,M ∈ Z. Here, M is an n-
word, odd modulus such that rn−1 ≤ M < rn. In practice r = 2w with w the
bit-length of a word, for the algorithms implemented for the SPE we either use
w = 32 or w = 16 (cf. Section 5).
Elliptic Curves. Let p > 3 be a prime, then any a, b ∈ Fp such that
4a3 + 27b2 = 0 deﬁne an elliptic curve Ea,b over Fp. The zero point, the so-
called point at inﬁnity, together with the set of points (x, y) ∈ Fp × Fp which
satisfy the shortened aﬃne Weierstrass equation y2 = x3 + ax + b, form an
Abelian group Ea,b(Fp) [31] (usually written additively). Repeated point addi-
tion is called scalar multiplication and a single instance of point addition can be
computed using multiple operations in Fp. Besides the aﬃne Weierstrass repre-
sentation one can use a whole range of diﬀerent representations. An overview of
the costs, expressed in arithmetic operations in the underlying ﬁeld, is given by
Bernstein and Lange in [5].
Montgomery Multiplication. The Montgomery modular multiplication
method is introduced in [25] and can be used to replace the conventional modular
multiplication. In order to be used, the operands need to be converted: given an
10 J.W. Bos
Algorithm 1. Schoolbook (left), Karatsuba (middle) and interleaved Montgomery
(right) multiplication algorithms.
Input:{
A =
∑n−1
i=0 air
i,
B =
∑n−1
i=0 bir
i
Output:{
C = A · B
=
∑ 2n−1
i=0 cir
i
1. C = A · b0
2. for i = 1 to n−1 do
3. C = C + ri(A · bi)
4. return C
Input:
⎧⎪⎪⎪⎨
⎪⎪⎪⎩
A =
∑n−1
i=0 air
i,
B =
∑n−1
i=0 bir
i,
T : some threshold for
switching to schoolbook
multiplication.
Let r˜ = rn/2.
Output:
{
C = A · B
=
∑ 2n−1
i=0 cir
i
1. if n < T then
2. return C = schoolbook(A, B)
3. A = A0 + A1r˜, 0 ≤ A0, A1 < r˜
4. B = B0 + B1 r˜, 0 ≤ B0, B1 < r˜
5. T0 = Karatsuba(A0, B0)
6. T1 = Karatsuba(A1, B1)
7. T2 = Karatsuba(A0+A1, B0+B1)−
T0 − T1
8. return
C = (T0 + T2 · r˜ + T1 · r˜2)
Input:⎧⎪⎪⎪⎨
⎪⎪⎪⎩
A =
∑n−1
i=0 air
i, B,
M, µ such that
0 ≤ A, B < rn,
rn−1 ≤ M < rn,
2  M, gcd(r, M) = 1
µ = −M−1 mod r,
Output:{
C ≡ A · B · r−n mod M
such that 0 ≤ C < rn
1. C = 0
2. for i = 0 to n− 1 do
3. C = C + ai · B
4. q = µ · C mod r
5. C = (C + q ·M)/r
6. if C ≥ rn then
7. C = C −M
8. return C
integer X , the Montgomery residue of this integer is deﬁned as X˜ = X ·rn mod M
with rn−1 ≤ M < rn. The constant rn is the Montgomery radix such that
gcd(rn,M) = 1. The Montgomery product is deﬁned as X˜ · Y˜ · r−n mod M , ad-
dition and subtraction remain unchanged. Since converting to and from Mont-
gomery form requires computational eﬀort, the Montgomery multiplication is
mostly used in settings where the computation of a sequence of modular opera-
tions is required. See Algorithm 1 for a high-level description of the interleaved
Montgomery multiplication method.
Fast Reduction. One way to speed up elliptic curve arithmetic is to enhance the
performance of the ﬁnite ﬁeld arithmetic by using a prime of a special form. The
structure of such a prime is exploited by constructing a fast reduction method,
applicable to this prime only. Typically, the multiplication and reduction are
in two sequential phases. For the multiplication phase we consider the so-called
schoolbook, or textbook, multiplication and the asymptotically faster Karatsuba
multiplication techniques (see Algorithm 1 for a high-level description).
NIST Primes. In the FIPS 186-3 standard [34] NIST recommends the use of
ﬁve prime ﬁelds when using the elliptic curve digital signature algorithm. These
primes allow fast reduction, see Appendix A for the algorithms optimized for a
machine word (limb) size of 32 bits, based on the work by Solinas [32]. The ﬁve
recommended primes are
p192 = 2192 − 264 − 1, p224 = 2224 − 296 + 1,
p256 = 2256 − 2224 + 2192 + 296 − 1, p384 = 2384 − 2128 − 296 + 232 − 1,
p521 = 2521 − 1.
An extensive study of a software implementation of the NIST-recommended
elliptic curves over prime ﬁelds on the x86 architecture is given by Brown
et al. [8].
Curve25519. The elliptic curve curve25519 is proposed by Bernstein in [2].
Besides oﬀering high-speed arithmetic, a list of other advantages can be found
High-Performance Modular Multiplication on the Cell Processor 11
in the original article [2]. This curve is over Fp255 with p255 = 2255 − 19, an ele-
ment x ∈ Fp255 can be represented as x =
∑9
i=0 xi2
25.5i. Bernstein proposes to
implement the arithmetic using ﬂoating point instructions and therefore repre-
sentation inside a CPU is achieved by using ﬂoating-point registers. The original
article gives performance data obtained on a Pentium M.
4 Applications
To increase throughput the 4-way SIMD instructions of the SPE are used to
implement a modular multiplication routine which operates on 4 streams, or a
small multiple of 4 by interleaving these streams, in parallel. When a sequence
of multiplications has to be computed, for instance in elliptic curve scalar mul-
tiplication, the algorithm performs the same operations in SIMD-mode on all
inputs. When the scalar multipliers are diﬀerent, a square-and-multiply algo-
rithm needs to perform a diﬀerent sequence of point additions and doublings,
since this depends on the binary expansion of the scalar multiplier. Performing
the same computations on multiple streams concurrently, when multiplying with
diﬀerent scalars, in a SIMD fashion might be suboptimal since all streams which
are being processed in parallel need to perform the same computations. In this
section we present some applications in cryptography and cryptanalysis where
SIMD modular multiplication algorithms can be beneﬁcial; i.e., where the same
multiplier is used in multiple independent instances.
Cryptography. Cryptographic schemes often need to perform exponentiations
with a randomly selected exponent, or scalar multiplications when using the
additive group law as in the elliptic curve setting. If this exponent is used several
times, in independent calculations, these operations can be performed in parallel
in a SIMD fashion. For instance, in elliptic curve public-key schemes the ability
to process multiple streams of modular multiplication computations can be used
to speedup batch decryption. Examples of such schemes are the elliptic curve
integrated encryption scheme (ECIES), proposed by Bellare and Rogaway [1] and
standardized in [9], and the provably secure encryption curve scheme (PSEC),
based on the work by Fujisaki and Okamoto [13] and standardized in [17]. The
decryption of a message consist of multiplying an elliptic curve point, as speciﬁed
by the ciphertext, by the private key d in PSEC or by h ·d in the case of ECIES,
where h is the cofactor of the group order and is constant for a given private key.
When many messages need to be decrypted, using the same private key, SIMD
algorithms as described in this article can be used to speedup computations.
In other settings, where the bitsize of the modulus is usually larger compared
to the ECC setting, multi-stream modular multiplication computations can be
useful as well. ElGamal encryption schemes [12] require two exponentiations with
the same random exponent. Other related methods perform more exponentia-
tions with the same exponent. The double base variant of ElGamal by Damg˚ard,
often referred to as Damg˚ard ElGamal [11], performs three exponentiations. The
“double” hybrid Damg˚ard ElGamal, as proposed by Kiltz et al. [19], requires four
exponentiations with the same exponent in every encryption.
12 J.W. Bos
Cryptanalysis. In cryptanalysis, multi-stream modular multiplication compu-
tations, for moduli sizes as considered in this article, can be used to enhance the
performance of the Pollard rho discrete logarithm algorithm [29], a method to
solve the elliptic curve discrete logarithm problem (ECDLP) which is essential
to the security of ECC. In practice, modular inversions in the Pollard rho al-
gorithm are traded for modular multiplications, to increase speed, by using the
Montgomery simultaneous inversion technique [26]. This technique allows one to
trade, when running N computations in parallel, N inversions for roughly 3N
modular multiplications and one inversion. For example, this technique is used
in [7] to solve a 112-bit ECDLP on the SPE architecture by working concur-
rently on 400 computations. Here, 70 percent of the total run-time is spent on
the computation of modular multiplications [7].
Another cryptanalytic application is factoring integers. The integer factoriza-
tion problem is essential to cryptographic algorithms as RSA. The fastest known
method to factor integers is the number ﬁeld sieve [28,21]. This method can use
the elliptic curve factorization method (ECM) [23] in a co-factorization phase.
Performing elliptic curve arithmetic on multiple points allows the use of multi-
stream modular multiplication methods. Related work by Bernstein et al. [4]
gives performance details of a high-performance multi-stream implementation of
modular arithmetic in the ECM on graphics cards.
5 Multiplication on the SPE Architecture
The (modular) multiplication operations in this work are designed to operate on
relatively small (≤ 521 bits) integers. On the widely available x86 and x86-64
architectures the threshold for switching from schoolbook multiplication to meth-
ods with a lower asymptotic run-time complexity (e.g. Karatsuba multiplication)
is > 800 bits [14]. On these architectures the size of the operands on which the
multiplication and addition instructions work is typically the same (either 32 or
64 bits).
On the Cell “only” a 16× 16 → 32 bits multiplication instruction is available,
performing four multiplications in parallel, while the size of the 4-way SIMD
operands to the addition instruction is 32 bits. Unlike the x86 architecture an
integer multiply-and-add instruction is available. This allows the addition of
two extra 16-bit values to a result of a 16-bit multiplication without generating
a carry, since if 0 ≤ a, b, c, d < 216, then a · b+ c+ d < 232. We consider both the
schoolbook and Karatsuba multiplication for the special modular multiplication
routines.
Integer Representation on the Cell. For a high-performance implementa-
tion of arithmetic algorithms on the Cell, vectorization techniques are applied
and data are represented using the 4-way SIMD organization of the SPEs. Using
m 128-bit registers x[0], x[1], . . . , x[m− 1] a four-tuple (x1, x2, x3, x4) of integers
is represented. Each xi is a wm-bit integer, where w is either 16 or 32 depend-
ing on the setting; typically we use w = 16 for multiplication and w = 32 for
High-Performance Modular Multiplication on the Cell Processor 13
x[0] =
128-bit wide register︷ ︸︸ ︷
︸ ︷︷ ︸
the 32 (or 16) least significant bits of x2 are located in
this 32-bit word (or in its 16 least significant bits)
...
...
x[j] = 16-bit︸ ︷︷ ︸
high
order
16-bit︸ ︷︷ ︸
low
order
...
...
x[m− 1] = ︸ ︷︷ ︸
↑
(x1,
︸ ︷︷ ︸
↑
x2,
︸ ︷︷ ︸
↑
x3,
︸ ︷︷ ︸
↑
x4)
Fig. 1. A four-tuple (x1, x2, x3, x4) of 32m-bit (or 16m-bit) integers arranged in m
128-bit registers
addition and subtraction to match the bit-lengths of the corresponding 4-way
SIMD instructions. Every element of the four-tuple is represented in a radix-2w
system:
xi =
m−1∑
j=0
x[j]i2wj,
for i = 1, 2, 3, 4. The four 32-bit words of the 128-bit register x[j] are denoted
by x[j]i. The representation of such a four-tuple (x1, x2, x3, x4) is depicted in
Figure 1.
Multiplication. Algorithm 2 depicts schoolbook multiplication designed to
run on SIMD architectures and is optimized for architectures with a native
multiply-and-add instruction. After trivially unrolling the for-loops the algo-
rithm is branch-free. Algorithm 2 splits the operands in 16-bit words, to take
advantage of the 16-bit multiplier on the Cell, but this can be modiﬁed to work
with any other word size on diﬀerent architectures. Hence, on the SPE, Algo-
rithm 2 operates on four-tuples of inputs simultaneously using the data repre-
sentation from Fig. 1.
After the multiply-and-add, and a possible extra addition of one 16-bit word,
the 32-bit result z is split into the 16 most and 16 least signiﬁcant bits, x and y
respectively. This is denoted by split(z) = ( z216 , z mod 216). On the SPE this
splitting can be implemented in diﬀerent ways, i.e. by using two odd shuffle
instructions, or one even and and one odd shuffle instruction, or two even
and instructions. The appropriate splitting implementation is chosen to bal-
ance the number of odd and even instructions, reducing the total number of
required cycles. Note that when i = 1 the extra addition of di+1 can be omitted.
Hence, Algorithm 2 requires n2× split, n2× muladd and n(n− 2)× add (when
multiplying two 16n-bit integers); this can be computed in 2n(n − 34 ) cycles,
14 J.W. Bos
Algorithm 2. Radix-216 schoolbook multiplication algorithm.
Input:
{
Integer a = (an−1, . . . , a1, a0), each ai is a 16-bit word.
Integer b = (bn−1, . . . , b1, b0), each bi is a 16-bit word.
Output: Integer c = (c2n−1, . . . , c1, c0) = a · b, each ci is a 16-bit word.
1. di = 0, i ∈ [1, n]
2. for j = 0 to n− 1 do
3. (e0,Dj) = split(a0 · bj + d1)
4. for i = 1 to n− 1 do
5. (ei, di) = split(ai · bj + ei−1 + di+1)
6. dn = en−1
7. return (c = (dn, dn−1, . . . , d1,Dn−1,Dn−2, . . . , D0))
optimistically assuming all odd and even pairs can be dispatched simultaneously.
Furthermore, this approximation ignores the function-call overhead and loading
and storing the in- and output from the local store. This leads to an optimistic
approximation for the computation of a single 16n× 16n → 32n-bit schoolbook
multiplication in n2
(
n− 34
)
cycles (when processing 4 streams in parallel).
A branch-free (when unrolled) Karatsuba multiplication algorithm optimized
for vector architectures is given in Algorithm 3. This algorithm works on 32-bit
words, which is the word size of the even 4-way SIMD addition and subtraction
instructions on the SPE. Just as with the schoolbook multiplication this word
size can trivially be modiﬁed. Algorithm 3 assumes that the bitsize of the input
values is a multiple of 64 to split the operands evenly in two 32-bit multiples.
These parts are multiplied using another multiplication routine mul, which is
either a schoolbook or Karatsuba multiplication, which operates on inputs of
half the size.
The 2m-bit multiplication is split into two m×m-bit and one (m+1)×(m+1)-
bit multiplications (see Alg. 1). In order to avoid the use of a probably more
expensive multiplication by an extra limb, three m ×m-bit multiplications are
used. The correct result, for the (m+1)×(m+1)-bit multiplication, is computed
by creating select-masks from the most signiﬁcant bit of each of the two operands.
These are used to select the appropriate value (one of the inputs) or zero, which
is added to the result of the m×m-bit multiplication. Note that the initial borrow
values, in line 21, are (counterintuitively) set to one. An extra subtraction of one
is performed when the borrow is zero and no subtraction is performed when the
borrow is one on the SPE.
Special Reduction. The special reduction algorithms, see Appendix A, do not
fully reduce the input to the range [0, p〉 but to [0, t · p〉, where p is the prime
modulus used and t a small positive integer. In order to reduce a four-tuple of
integers simultaneously using SIMD instructions, diﬀerent approaches can be
applied. Obviously the reduction algorithm can be applied again. A most likely
faster approach, when t is suﬃciently small, is to subtract p repeatedly until the
result is in the desired range. The repeated subtracting is done by masking the
value appropriately before subtracting, which needs to be performed up to t− 1
times since multiple integer values are processed in parallel.
High-Performance Modular Multiplication on the Cell Processor 15
Algorithm 3. Radix-232 Karatsuba multiplication algorithm for architectures
which support vector instructions, n is even.
Input:
{
Integer X = (xn−1, . . . , x0), each xi is a 32-bit word.
Integer Y = (yn−1, . . . , y0), each yi is a 32-bit word.
Output: Integer Z = (z2n−1, . . . , z0) = X · Y , each zi is a 32-bit word.
1. (Bn−1, . . . , B0) = mul((xn−1, . . . , xn/2), (yn−1, yn/2))
2. (Cn−1, . . . , C0) = mul((xn/2−1, . . . , x0), (yn/2−1, . . . , y0))
3. zero = carry1 = carry2 = {0}
4. for i = 0 to n/2− 1 do
5. Xi = add extended(xn/2+i, xi, carry1)
6. Yi = add extended(yn/2+i, yi, carry2)
7. carry1 = gen carry extended(xn/2+i, xi, carry1)
8. carry2 = gen carry extended(yn/2+i, yi, carry2)
9. mask1 = cmpgt(carry1, 0), mask2 = cmpgt(carry2, 0)
10. for i = 0 to n/2− 1 do
11. si = select(zero, Yi,mask1), ti = select(zero,Xi,mask2)
12. c1 = select(zero, carry1,mask2)
13. (zn−1, . . . , zn/2, An/2−1, . . . , A0) = mul((Xn/2−1, . . . ,X0), (Yn/2−1, . . . , Y0))
14. carry1 = carry2{0}
15. for i = n/2 to n− 1 do
16. T = add extended(zi, si−n/2, carry1)
17. Ai = add extended(T, ti−n/2, carry2)
18. carry1 = gen carry extended(zi, si−n/2, carry1)
19. carry2 = gen carry extended(T, ti−n/2, carry2)
20. An = add extended(carry1, carry2, c1)
21. borrow1 = borrow2 = {1}
22. for i = 0 to n− 1 do
23. T = sub extended(Ai, Bi, borrow1)
24. Ei = sub extended(T, Ci, borrow2)
25. borrow1 = gen borrow extended(Ai, Bi, borrow1)
26. borrow2 = gen borrow extended(T, Ci, borrow2)
27. En = sub(An, zero, borrow1), En = sub(An, zero, borrow2)
28. carry1 = 0
29. for i = n/2 to n− 1 do
30. Zi = add extended(Ci, Ei−n/2, carry1)
31. carry1 = gen carry extended(Ci, Ei−n/2, carry1)
32. for i = n to n + n/2 − 1 do
33. Zi = add extended(Bi−n, Ei−n/2, carry1)
34. carry1 = gen carry extended(Bi−n, Ei−n/2, carry1)
35. Zn+n/2 = add extended(Bn/2, En, carry1)
36. carry1 = gen carry extended(Bn/2, En, carry1)
37. for i = n + n/2 + 1 to 2n− 1 do
38. Zi = add(Bi−n, carry1)
39. carry1 = gen carry(Bi−n, carry1)
40. return Z = (Z2n−1, . . . , Zn/2, Cn/2−1, . . . , C0)
16 J.W. Bos
Table 1. The values of the 32-bit unsigned limbs of t · p224, c7 and c0 are the most
and least signiﬁcant limb respectively. In order to avoid using a look-up table the value
t · p224 can be computed eﬃciently. Given t, c0 = t, c1 = c2 = 0, c3 = 0− t, the values
for c4, c5, c6, c7 can be constructed (using the select instruction) depending on t.
t t · p224 = {c7, . . . , c0}
c7 c6 c5 c4 c3 c2 c1 c0
0 0 0 0 0 0 0 0 0
1 0 232 − 1 232 − 1 232 − 1 232 − 1 0 0 1
2 1 232 − 1 232 − 1 232 − 1 232 − 2 0 0 2
3 2 232 − 1 232 − 1 232 − 1 232 − 3 0 0 3
4 3 232 − 1 232 − 1 232 − 1 232 − 4 0 0 4
An additional performance gain is possible when the modulus is constant.
Select the desired multiple of p, which needs to be subtracted, from a look-up
table and perform a single subtraction. This can be achieved, when operating on
multiple integer values in parallel, using the select instruction. If reduction to
[0, 2m〉, for an m-bit modulus p, is allowed, the most signiﬁcant word, containing
the possible carry, has to be inspected only to determine the multiple of p to
subtract. Note that an extra single subtraction might be needed in the unlikely
situation that the result after the subtraction is > 2m. This rare case is imple-
mented by a branch which is hinted to be false to reduce the branch-overhead.
The partially reduced numbers can be used as input to the same modular mul-
tiplication routines and if reduction to [0, p〉 is required this can be achieved at
the cost of a single multi-limb comparison and subtraction.
For the moduli of special form more instructions can be saved. For example
consider the modulus p224 = 2224 − 296 + 1. As described in Algorithm 6, in
Appendix A, the algorithm returns with (s1 + s2+ s3− s4− s5), where all the si
are 224-bit integers. At the implementation level we work with unsigned integers
and prefer not to work with negative numbers. This is achieved by subtracting
s4 + s5 from 2p224. We can bound the return value d by d = s1 + s2 + s3 +
(2p224 − s4 − s5) < 5p224. To reduce d to [0, 2224〉 the value t · p224, for some
t ∈ [0, 5〉, must be subtracted for four possibly diﬀerent values of t in parallel after
inspection of the most signiﬁcant word. As can be seen from the representation
in Table 1, when using a 232 radix system, selecting the correct value for the
diﬀerent limbs is computationally easy. This allows the computation of t · p224
on-the-ﬂy without the need to use and load from a look-up table. The reductions
for the other special NIST primes can be done in a similar fashion.
We propose a diﬀerent approach for calculating the reduction step for the spe-
cial prime p255 = 2255−19 compared to the ﬂoating point approach from [2] (see
Section 3). This approach is similar to the special reduction technique applied to
the 112-bit prime modulus in [6, Appendix A]. A redundant representation mod-
ulo P˜255 = 2 ·p255 = 2256−38 is used. Let R = rh ·2256+ rl be the 512-bit result
after multiplication. Next, the ﬁrst reduction step is performed by computing
S = rl +38 · rh ≡ R mod P˜255; note that S < 2262. Next, the same computation
High-Performance Modular Multiplication on the Cell Processor 17
Algorithm 4. Radix-216 Montgomery Multiplication Algorithm.
Input:
⎧⎪⎨
⎪⎪⎩
Integer a = (an−1, . . . , a1, a0), each ai is a 16-bit word.
Integer b = (bn−1, . . . , b1, b0), each bi is a 16-bit word.
Integer M = (Mn−1, . . . ,M1,M0), each Mi is a 16-bit word and M is odd.
An 16-bit integer m˜ = −M−1 mod 216.
Output: Integer c = (cn−1, . . . , c1, c0) ≡ a · b · 2−16n mod M .
1. di = 0, i ∈ [0, n]
2. for i = 0 to n− 1 do
3. (e0, d0) = split(a0 · bi + d0)
4. for j = 1 to n− 1 do
5. (ej , dj) = split(aj · bi + dj + ej−1)
6. dn = dn + en−1
7. (∗, q) = split(d0 · m˜)
8. (e0, d0) = split(M0 · q + d0)
9. for j = 1 to n− 1 do
10. (ej , dj−1) = split(Mj · q + dj + ej−1)
11. (dn, dn−1) = split(dn + en−1)
12. if dn > 0 then
13. (dn−1, . . . , d1, d0) = (dn, dn−1, . . . , d1, d0)− (Mn−1, . . . ,M1,M0)
14. return (c = (dn−1, . . . , d1, d0))
is repeated on S = sh · 2256 + sl: T = sl + 38 · sh ≡ S ≡ R mod P˜255. This is
computationally faster since sh < 26, note that the resulting T < 2257. Similar
techniques as described for the NIST primes are used to reduce the result to
[0, 2256〉.
Montgomery Multiplication. The interleaved Montgomery multiplication,
optimized for the use on vector architectures, is given in Algorithm 4. As pre-
sented, it uses 16-bit limbs and on the Cell four-tuples of inputs are processed
concurrently (but Alg. 4 can trivially be modiﬁed to operate on any radix size). A
conditional subtraction step is needed at the end of the algorithm to ensure that
the result is < 216n, for 16n-bit inputs. This conditional subtraction is replaced
by a comparison which creates a select mask, using this mask the value zero or
the value of the modulus is selected and subtracted. This eliminates a branch
which is to be avoided when processing multiple integer values in a SIMD fash-
ion. For eﬃciency, the integer representation is switched to a 232 radix system
when doing the ﬁnal masking and subtraction.
The same notation for the split function is used as in Section 5. Hence, Al-
gorithm 4 requires 2n(n + 1)× split, 2n(n + 1) × muladd (when counting the
multiplication in line 8 as an multiply-and-add) and 2n(n − 1) × add since the
addition of dj in line 5 when j = 1 can be omitted. For the conditional sub-
traction we ﬁrst convert the integer representation to a 232 radix system using

n2  shuffle instructions. Next we compare the carry (one cmpgt instruction)
and mask the value which we are going to subtract using 
n2  and instructions.
The subtraction requires 
n2  (extended) subtraction instructions and 
n2  − 1
(extended) generate borrow instructions.
18 J.W. Bos
Counting the number of instructions required in Algorithm 4 gives 4n2+3
n2 
even and 
n2  odd instructions plus 2n(n + 1) times the split function. Hence,
an optimistic estimate of the number of cycles, ignoring overhead and assuming
perfect scheduling, for a single computation of Montgomery multiplication on
16n-bit inputs, when computing four computations in parallel, using Algorithm 4
on a single SPE is n2 + 9n16 cycles.
6 Results
We implemented the proposed generic and special modular multiplication algo-
rithms using the C-programming language for the SPEs on the Cell architec-
ture. Four, or a small multiple of four, computations are processed in parallel.
The performance benchmarks are performed on a single SPE in the PlaySta-
tion 3 game console. We summarize these results, together with other (single
and multi-stream computation) modular multiplication results,7 obtained from
the literature, in Table 2. The metric of our performance results is the num-
ber of cycles for a single modular multiplication computation. Our performance
results are obtained by averaging over long sequences, hundreds of millions, of
diﬀerent modular multiplications and include the timing benchmark overhead,
the function call overhead, loading and storing the in- and output from the lo-
cal store and possibly converting the in- and output from the diﬀerent integer
representations (from radix-232 to radix-216 and vice-versa).
Performance Comparison. Performance results obtained with the Multi-
Precision Math (MPM) Library [16], provided by IBM in the example API for
the Cell, are given in Table 2 for diﬀerent bit-sizes. The MPM library imple-
ments a single-stream Montgomery multiplication computation. In order to ob-
tain a faster implementation for speciﬁc bit-lengths (to make a fair comparison)
we unrolled the various loops inside the MPM library. These unrolled versions
are signiﬁcantly faster compared to the standard MPM implementation; e.g.,
the unrolled 256-bit Montgomery multiplication is 1.4 times faster compared to
the unmodiﬁed MPM implementation. Our multi-stream implementations have
a higher latency compared to the unrolled MPM library but process multiple
streams resulting in fewer cycles per single multiplication. For instance, in the
setting of 256-bit moduli the unrolled MPM requires 877 cycles for a single
multiplication while our implementation requires 1188 cycles to compute four
multiplications in parallel. This is a speedup of almost a factor of three per
single multiplication.
In [10] Costigan and Schwabe implement elliptic curve arithmetic aimed at
curve25519 on the SPE architecture. The representation used diﬀers slightly,
but is based on, the one proposed in [2]; an element x ∈ Fp255 is represented as
x =
∑19
i=0 xi2
12.75i. A multi-stream version working on four streams in parallel
is implemented and hand-optimized in assembly and “perfectly” scheduled with
the surrounding code in a larger function implementing elliptic curve arithmetic.
This multi-stream implementation is estimated to compute a single modular
multiplication in around 168 cycles [10], this does not include any overhead for
High-Performance Modular Multiplication on the Cell Processor 19
Table 2. Performance results of Montgomery multiplication or modular multiplication
modulo the special prime pi. The latter uses a separate multiplication (schoolbook (S)
or Karatsuba (K)) and a fast reduction phase. The benchmarks are performed on a
single SPE on a Cell in the PlayStation 3 game console. The stated number of cycles are
for a single modular multiplication (when processing the reported number of streams
in parallel) and the optimistic estimates are from the formulas from Section 5 and do
not include the special reduction cost.
From
Bitsize of
Method Streams
Performance Estimate
the modulus (cycles) (cycles)
This article 192 p192 (K) 8 105
This article 192 p192 (S) 8 126 68
This article 192 Montgomery 8 176 151
Bernstein et al. [3] 195 Montgomery 6 189
This article 224 p224 (K) 8 139
This article 224 p224 (S) 8 143 93
This article 224 Montgomery 4 234 204
Costigan and
255 p255 (S) 4 168
1
Schwabe [10]
This article 255 p255 (K) 8 175
This article 255 p255 (S) 8 182 122
This article 256 p256 (S) 8 192 122
This article 256 p256 (K) 4 193
This article 256 Montgomery 4 297 265
MPM unrolled [16] 256 Montgomery 1 877
MPM [16] 256 Montgomery 1 1188
This article 384 p384 (K) 4 389
This article 384 p384 (S) 4 391 279
This article 384 Montgomery 4 665 590
MPM unrolled [16] 384 Montgomery 1 1610
MPM [16] 384 Montgomery 1 2092
This article 521 p521 (S) 4 622 500
This article 521 p521 (K) 4 723
This article 512 Montgomery 4 1393 1042
MPM unrolled [16] 512 Montgomery 1 2700
MPM [16] 512 Montgomery 1 3275
saving and storing the in- and output registers to and from the local store,
function call overhead and overhead due to benchmarking. In comparison, our
implementation requires 175 cycles for a single modular multiplication using a
1 This is the required number of cycles for an in-register implementation, no load-
ing of input and storing of output from the local store is performed, and excludes
benchmark and function call overhead.
20 J.W. Bos
diﬀerent approach for the special reduction (see Section 5). This includes loading
and storing the in- and output, function call and benchmarking overhead and
additional latencies because not all code can be scheduled perfectly (especially
at the beginning and end of the function where stalls occur). Comparing the
performance of the two diﬀerent approaches for the reduction step is diﬃcult
since the reported performance results of two versions are in diﬀerent settings;
ours is a stand-alone multiplication function while the implementation from [10]
is an inline version working on registers only. In [10] it is estimated that the
time to load and store the in- and output requires 56 cycles in the setting of a
single modular multiplication. When considering this cost our approach using
the redundant representation looks preferable (since 175 < 168+ 56), especially
since we did not use any ﬁne-tuned assembly code to achieve these results.
Improved multi-stream modular multiplication computations results, com-
pared to [4], are given by Bernstein et al. in [3]. Here, not only results for GPUs
are reported but also for the Cell architecture as used in the PlayStation 3. In
this setting Montgomery multiplication is implemented and optimized for one bit
size: a 195-bit generic modulus. A radix-213 system is used to represent 195-bit
integers using 15 limbs, this has the advantage of accumulating multiple carries
before an overﬂow occurs (on the SPE architecture) compared to a radix-216
system but requires more limbs to represent the integers. When quadratically
scaling our 192-bit performance result, in a similar fashion as done in [3], this
leads to an estimate of 176 · (195192 )2 = 182 cycles; this is slightly faster compared
to the 189 required cycles reported in [3].
Discussion. The performance data from Table 2 show that the modular mul-
tiplication using the special primes are in almost all cases, with the exception
of p256 and p521, roughly 1.7 times faster compared to the Montgomery multi-
plication implementations targeting the same bit-lengths. Our results show that
p256 is 1.55 times faster than 256-bit Montgomery multiplication while p521 is
2.2 times faster compared to 512-bit Montgomery multiplication. This can be
partially explained by the relatively complicated and easy structure of p192 and
p521 respectively.
For p192 the version using Karatsuba multiplication is signiﬁcantly (20 per-
cent) faster compared to the version using schoolbook multiplication. For p224,
p255, p256 and p384 the performance is roughly the same while for p521 school-
book multiplication is 16 percent faster. These diﬀerences can be explained due
to extra load and store operations from and to the local store. For the smaller
bitsizes almost all operations can be performed, after the initial loading from
the inputs, on registers. For the larger values the available 128 registers are not
suﬃcient and extra load and store instructions, leading to more instructions and
possibly extra stalls, are required. This also explains why processing four streams
instead of eight gives a higher performance for p384 and p521.
The number of cycles required for the Montgomery multiplication is 12 to 17
percent higher compared to the estimations for all special primes except p521.
This overhead is mainly caused by extra load and stores and due to the fact
High-Performance Modular Multiplication on the Cell Processor 21
that the estimates are too optimistic (not every cycle a pair of instructions can
be dispatched due to instruction dependencies). For the special prime p521 more
than 33 percent of the estimated number of cycles is needed. After compiling our
code to assembly inspection shows that the signiﬁcant overhead is as expected
due to the extra loads and stores. Note that loading the two input values in
registers (after conversion to radix-216) requires 66 registers which is already
more than half of the available register space.
7 Conclusions
In this paper we presented techniques to eﬃciently implement modular mul-
tiplication algorithms to SIMD architectures (such as the Cell or GPUs). We
considered Montgomery multiplication and various special reduction routines
which are of interest for elliptic curve cryptography. The modular multiplica-
tion implementations, which use these faster reduction schemes, are at least 1.5
times faster compared to general purpose Montgomery multiplication for the
same bitsize. The performance results of our multi-stream modular multiplica-
tion implementations for the synergistic processing elements of the Cell broad-
band engine architecture set new performance records for moduli of bit-length in
the range [192, 521] on this platform. These high-performing modular multipli-
cation, generic or special, implementations can be used to speed up public-key
cryptography; e.g. in batch elliptic curve decryption.
References
1. Bellare, M., Rogaway, P.: Minimizing the use of random oracles in authenticated
encryption schemes. In: Han, Y., Quing, S. (eds.) ICICS 1997. LNCS, vol. 1334,
pp. 1–16. Springer, Heidelberg (1997)
2. Bernstein, D.J.: Curve25519: New Diﬃe-Hellman speed records. In: Yung, M.,
Dodis, Y., Kiayias, A., Malkin, T.G. (eds.) PKC 2006. LNCS, vol. 3958,
pp. 207–228. Springer, Heidelberg (2006)
3. Bernstein, D.-J., Chen, H.-C., Chen, M.-S., Cheng, C.-M., Hsiao, C.-H., Lange, T.,
Lin, Z.-C., Yang, B.-Y.: The Billion-Mulmod-Per-Second PC. In: SHARCS 2009,
pp. 131–144 (2009)
4. Bernstein, D.J., Chen, T.-R., Cheng, C.-M., Lange, T., Yang, B.-Y.: ECM
on graphics cards. In: Joux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479,
pp. 483–501. Springer, Heidelberg (2010)
5. Bernstein, D.J., Lange, T.: Analysis and optimization of elliptic-curve single-scalar
multiplication. In: Finite Fields and Applications. Contemporary Mathematics Se-
ries, vol. 461, pp. 1–19 (2008)
6. Bos, J.W., Kaihara, M.E., Kleinjung, T., Lenstra, A.K., Montgomery, P.L.: On
the Security of 1024-bit RSA and 160-bit Elliptic Curve Cryptography. Cryptology
ePrint Archive, Report 2009/389 (2009), http://eprint.iacr.org/
7. Bos, J.W., Kaihara, M.E., Montgomery, P.L.: Pollard rho on the PlayStation. In:
SHARCS 2009, vol. 3, pp. 35–50 (2009)
22 J.W. Bos
8. Brown, M., Hankerson, D., Lo´pez, J., Menezes, A.: Software implementation of the
NIST elliptic curves over prime ﬁelds. In: Naccache, D. (ed.) CT-RSA 2001. LNCS,
vol. 2020, pp. 250–265. Springer, Heidelberg (2001)
9. Certicom Research: Standards for Eﬃcient Cryptography 1: Elliptic Curve Cryp-
tography. Standard SEC1, Certicom (2000)
10. Costigan, N., Schwabe, P.: Fast elliptic-curve cryptography on the Cell broadband
engine. In: Preneel, B. (ed.) AFRICACRYPT 2009. LNCS, vol. 5580, pp. 368–385.
Springer, Heidelberg (2009)
11. Damg˚ard, I.: Towards practical public key systems secure against chosen ciphertext
attacks. In: Feigenbaum, J. (ed.) CRYPTO 1991. LNCS, vol. 576, pp. 445–456.
Springer, Heidelberg (1992)
12. El Gamal, T.: A public key cryptosystem and a signature scheme based on discrete
logarithms. In: Blakely, G.R., Chaum, D. (eds.) CRYPTO 1984. LNCS, vol. 196,
pp. 10–18. Springer, Heidelberg (1985)
13. Fujisaki, E., Okamoto, T.: Secure integration of asymmetric and symmetric encryp-
tion schemes. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 537–554.
Springer, Heidelberg (1999)
14. Granlund, T.: GMP small operands optimization. In: SPEED 2007 (2007)
15. Hofstee, H.P.: Power eﬃcient processor architecture and the Cell processor. In:
HPCA 2005, pp. 258–262 (2005)
16. IBM: Multi-precision math library, Example Library API Reference,
https://www.ibm.com/developerworks/power/cell/documents.html
17. ISO/IEC 18033-2: Information technology – Security techniques – Encryption al-
gorithms – Part 2: Asymmetric ciphers (2006)
18. Karatsuba, A., Ofman, Y.: Multiplication of many-digital numbers by automatic
computers. In: Proceedings of the USSR Academy of Science, vol. 145, pp. 293–294
(1962)
19. Kiltz, E., Pietrzak, K., Stam, M., Yung, M.: A new randomness extraction
paradigm for hybrid encryption. In: Joux, A. (ed.) EUROCRYPT 2009. LNCS,
vol. 5479, pp. 590–609. Springer, Heidelberg (2010)
20. Koblitz, N.: Elliptic curve cryptosystems. Mathematics of Computation 48,
203–209 (1987)
21. Lenstra, A.K., Lenstra Jr., H.W.: The Development of the Number Field Sieve.
Lecture Notes in Mathematics, vol. 1554. Springer, Heidelberg (1993)
22. Lenstra, A.K., Verheul, E.R.: Selecting cryptographic key sizes. Journal of Cryp-
tology 14(4), 255–293 (2001)
23. Lenstra Jr., H.W.: Factoring integers with elliptic curves. Annals of Mathemat-
ics 126, 649–673 (1987)
24. Miller, V.S.: Use of elliptic curves in cryptography. In: Williams, H.C. (ed.)
CRYPTO 1985. LNCS, vol. 218, pp. 417–426. Springer, Heidelberg (1986)
25. Montgomery, P.L.: Modular multiplication without trial division. Mathematics of
Computation 44(170), 519–521 (1985)
26. Montgomery, P.L.: Speeding the Pollard and elliptic curve methods of factorization.
Mathematics of Computation 48, 243–264 (1987)
27. National Security Agency: Fact sheet NSA Suite B Cryptography (2009),
http://www.nsa.gov/ia/programs/suiteb_cryptography/index.shtml
28. Pollard, J.M.: Factoring with cubic integers. In: [21], pp. 4–10
29. Pollard, J.M.: Monte Carlo methods for index computation (mod p). Mathematics
of Computation 32, 918–924 (1978)
30. Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signatures
and public-key cryptosystems. Communications of the ACM 21, 120–126 (1978)
High-Performance Modular Multiplication on the Cell Processor 23
31. Silverman, J.H.: The Arithmetic of Elliptic Curves. In: Gradute Texts in Mathe-
matics. Springer, Heidelberg (1986)
32. Solinas, J.A.: Generalized Mersenne numbers. Technical Report CORR 99-39, Cen-
tre for Applied Cryptographic Research, University of Waterloo (1999)
33. Takahashi, O., Cook, R., Cottier, S., Dhong, S.H., Flachs, B., Hirairi, K.,
Kawasumi, A., Murakami, H., Noro, H., Oh, H., Onish, S., Pille, J.,
Silberman, J.: The circuit design of the synergistic processor element of a Cell
processor. In: ICCAD 2005, pp. 111–117. IEEE Computer Society, Los Alamitos
(2005)
34. U.S. Department of Commerce and National Institute of Standards and Technol-
ogy: Digital Signature Standard (DSS) (2009),
http://csrc.nist.gov/publications/fips/fips186-3/fips_186-3.pdf
A NIST Reduction
Algorithm 5. Fast reduction modulo p192 = 2192 − 264 − 1.
Input: Integer c = (c11, . . . , c1, c0), each ci is a 32-bit word, and 0 ≤ c < p2192.
Output: Integer d ≡ c mod p192.
Deﬁne 192-bit integers:
s1 = (c5, c4, c3, c2, c1, c0), s2 = (0, 0, c7, c6, c7, c6),
s3 = (c9, c8, c9, c8, 0, 0), s4 = (c11, c10, c11, c10, c11, c10);
return (d = s1 + s2 + s3 + s4);
Algorithm 6. Fast reduction modulo p224 = 2224 − 296 + 1.
Input: Integer c = (c13, . . . , c1, c0), each ci is a 32-bit word, and 0 ≤ c < p2224.
Output: Integer d ≡ c mod p224.
Deﬁne 224-bit integers:
s1 = ( c6, c5, c4, c3, c2, c1, c0), s2 = ( c10, c9, c8, c7, 0, 0, 0),
s3 = ( 0, c13, c12, c11, 0, 0, 0), s4 = ( c13, c12, c11, c10, c9, c8, c7)
s5 = ( 0, 0, 0, 0, c13, c12, c11);
return (d = s1 + s2 + s3 − s4 − s5);
24 J.W. Bos
Algorithm 7. Fast reduction modulo p256 = 2256 − 2224 + 2192 + 296 − 1.
Input: Integer c = (c15, . . . , c1, c0), each ci is a 32-bit word, and 0 ≤ c < p2256.
Output: Integer d ≡ c mod p256.
Deﬁne 256-bit integers:
s1 = ( c7 c6, c5, c4, c3, c2, c1, c0), s2 = ( c15 c14, c13, c12, c11, 0, 0, 0),
s3 = ( 0, c15, c14, c13, c12, 0, 0, 0), s4 = ( c15, c14, 0, 0, 0, c10, c9, c8),
s5 = ( c8, c13, c15, c14, c13, c11, c10, c9), s6 = ( c10, c8, 0, 0, 0, c13, c12, c11),
s7 = ( c11, c9, 0, 0, c15, c14, c13, c12), s8 = ( c12, 0, c10, c9, c8, c15, c14, c13),
s9 = ( c13, 0, c11, c10, c9, 0, c15, c14);
return (d = s1 + 2s2 + 2s3 + s4 + s5 − s6 − s7 − s8 − s9);
Algorithm 8. Fast reduction modulo p384 = 2384 − 2128 − 296 + 232 − 1.
Input: Integer c = (c23, . . . , c1, c0), each ci is a 32-bit word, and 0 ≤ c < p2384.
Output: Integer d ≡ c mod p384.
Deﬁne 384-bit integers:
s1 = ( c11, c10, c9, c8, c7, c6, c5, c4, c3, c2, c1, c0),
s2 = ( 0, 0, 0, 0, 0, c23, c22, c21, 0, 0, 0, 0),
s3 = ( c23, c22, c21, c20, c19, c18, c17, c16, c15, c14, c13, c12),
s4 = ( c20, c19, c18, c17, c16, c15, c14, c13, c12, c23, c22, c21),
s5 = ( c19, c18, c17, c16, c15, c14, c13, c12, c20, 0, c23, 0),
s6 = ( 0, 0, 0, 0, c23, c22, c21, c20, 0, 0, 0, 0),
s7 = ( 0, 0, 0, 0, 0, 0, c23, c22, c21, 0, 0, c20),
s8 = ( c22, c21, c20, c19, c18, c17, c16, c15, c14, c13, c12, c23),
s9 = ( 0, 0, 0, 0, 0, 0, 0, c23, c22, c21, c20, 0),
s10 = ( 0, 0, 0, 0, 0, 0, 0, c23, c23, 0, 0, 0);
return (d = s1 + 2s2 + s3 + s4 + s5 + s6 + s7 − s8 − s9 − s10);
Algorithm 9. Fast reduction modulo p521 = 2521 − 1.
Input: Integer c = (c33, . . . , c1, c0), each ci is a 32-bit word, and 0 ≤ c < p2521.
Output: Integer d ≡ c mod p521.
Deﬁne 521-bit integers:
s1 = (c16, . . . , c1, c0), s2 = (c32, . . . , c17, c16);
return (d = s1 mod 2
521 + s2
223
);
