Fast massively parallel modular arithmetic by Disxon, B. & Lenstra, Arjen K.
Fast Massively Parallel Modular Arithmetic
B. Dixon*j A. K. Lenstra
July 10, 1996
Abstract
High-performance implementations on massively parallel architectures of many
number theoretic algorithms and cryptographic schemes require efficient methods to
perform modular arithmetic. This paper shows how this can be achieved on a SIMD
array of processors. The goal of our algorithms is to trade parallelism for time, which
we are able to do in an optimal fashion. For the purposes of performing modular arith
metic, we can effectively reconfigure a 16K processor massively parallel SIMD computer
into a machine having 16K/k processors that are k times faster. This is particularly
useful for our high-performance application, where a high degree of slow parallelism is
much less efficient than a lower degree of faster parallelism.
1 Introduction
Fast modular multiplication is crucial for an efficient implementation of many number theo
retic algorithms or cryptographic schemes. In particular, the elliptic curve integer factoring
algorithm [3] spends most of its time multiplying numbers mod n, where n is the number
that one attempts to factor [1]. In our first implementation of the elliptic curve algorithm
on the 16K processor MasPar1 computer we ran 16K completely independent factoring at
tempts, on the same n. From a high level, the algorithm was simple since the attempts could
*Department of Computer Science, Princeton University, Princeton, NJ 08544.
Research partially supported by a National Science Foundation Graduate Fellowship and by DIMACS
(Center for Discrete Mathematics and Theoretical Computer Science), a National Science Foundation Science
and Technology Center, Grant No. NSF-STC88-09648.
Bellcore, Room 2Q-334, 445 South Street, Morristown, NJ 07960, U. S. A.
‘It is the policy of Belicore to avoid any statements of comparative analysis or evaluation of products or
vendors. Any mention of products or vendors in this presentation or accompanying printed materials is done
1
easily work in a data parallel manner. The process was organized in a such way that the
16K independent factoring attempts did not overlap with each other, so that we achieved a
16K-fold speed-up compared to a single attempt running on one processor.
This is as good as one can hope to achieve, except that the elliptic curve method has the
peculiarity that its optimal degree of parallelism depends on and grows with the size of the
factor p one tries to find. Similarly, the optimal effort to spend per trial grows with the size of
p. An optimal application of the elliptic curve method with 16K attempts in parallel would
aim for much larger p than is currently believed to be in the realistic range, and the run-
time per attempt would be exceedingly long, especially using the slow MasPar processors.
So, although our first implementation might turn out to be very useful on later generations
of massively parallel machines that have much faster processors, a better solution for the
current technology would be to have fewer but faster attempts. This requires distributing
modular arithmetic among multiple processors, which motivates the research presented here.
In this paper we describe how modular multiplication can be carried out efficiently and in
parallel on large ‘Single Instruction Multiple Data’ (SIMD) arrays of processors. Using the
algorithm presented here, a modular multiplication using k processors requires time about
1/k times the amount of time to perform a modular multiplication using one processor,
which is the optimal speed-up one can achieve using k processors. This should be compared
to the straightforward SIMD implementation, which would require 1.5/k times the time for
the one processor version, due to an inherent unparallelizable inefficiency in the algorithm.
We have been able to overcome this efficiency, thus gaining a speed-up of a factor 1.5.
In addition to applications to cryptanalysis, our method can be used for a ‘central facility’
that has to sign or verify public key certificates, or decrypt RSA messages [6], for many
users simultaneously. There too, as in the elliptic curve implementation, a low degree of fast
parallelism is more desirable than a high degree of slow parallelism. Such applications are
not uncommon in the telecommunication industry. In the following sections, we describe the
multiplication algorithms. Section 2 describes the sequential version, and section 3 begins
with the description of SIMD addition and subtraction algorithms and finishes with the
SIMD multiplication algorithm.
where necessary for the sake of scientific accuracy and precision, or to provide an example of a technology
for illustrative purposes, and should not be construed as either a positive or negative commentary on that
product or vendor. Neither the inclusion of a product or a vendor in this presentation or accompanying
printed materials, nor the omission of a product or a vendor, should be interpreted as indicating a position
or opinion of that product or vendor on the part of the presenter or Bellcore.
2
2 Modular arithmetic
It is well known that if the modulus remains unchanged throughout the computation, as
is always the case in our applications, a considerable speed-up can be gained by using the
so-called Montgomery representation [5]. This speed-up comes from the fact that divisions
are replaced by shifts to return a result in the required range. The remainder of the paper
discusses numbers in Montgomery form.
Throughout this paper we fix ii as an odd positive integer that will serve as the modulus.
For an integer x we define the residue of x modulo n, denoted x mod n, as the smallest
non-negative integer that is congruent to x modulo n. We will measure the run-time of the
algorithms to be presented in number of elementary multiplications, where one elementary
multiplication computes the 2b-bit product of two b-bit integers, for some positive integer
b that depends on the word size of the machine/processor we are using. We will assume
that b 2. The case b = 1 is trivial, and can be found in [5]. We are concerned with
arithmetic on numbers which are much too large to be stored in one computer word. Thus
we represent each number x mod ri as a sequence XrlXr_2.. .x210 of r non—negative b bit
words, which means that x x2 mod n and 0 < x < 2b for 0 < i < r, where r is
the smallest integer with 2l’ > n. The Montgomery representation of an integer x is the
integer (x. R) mod n, where R = 2. Notice that because n is fixed r and R are also fixed.
Clearly, if s = (x + y) mod n, then is either + or + — n, so that addition
(or subtraction) of numbers in Montgomery representation is not different from ordinary
modular addition (or subtraction). Multiplication, however, becomes simpler than ordinary
multiplication modulo n. If z = (x
.
y) mod n, then .
.
y W mod n, and therefore
= (x y R) mod n
= ( . R—’) mod n. It follows that the Montgomery product of two
numbers in Montgomery representation can be computed using one ordinary multiplication
of and followed by a division by R modulo n. To be able to easily divide by R, the
Montgomery multiplication algorithm adds a multiple of n to . which is computed so that
the low order br bits of the result are zero. This allows division by R as a simple shift. The
result of the shift has only br bits and thus can easily be reduced to the required residue
mod n. The details of the computation follow.
2.1 Montgomery multiplication. Let d be such that dn —1 mod 2’ and 0 < d <
this d is well-defined because n is odd. First, compute the ordinary product u = . Let
U = U2r_2U2r_3. . .u21n0 be the radix 2b representation of u. Next, we iterate the following
from i = 0 to r — 1: compute t
=
((ui . d) mod 2b) . ri and note that t0 —nj mod 2b Thus,
3
we add 2bj t to u to “zero” u. We increment i and continue the ioop. Notice that after
iteration i, the b bit words u, u_1, .. ., uo are all zero and the new u is congruent to the old
u modulo n. Thus when the loop terminates, we replace u by u/R by shifting u over br bits
to the right. We now have that either = u or = — n; this follows from the fact that the
original u is (n — 1)2, that at most >j2bi( — 1)n = (R — 1)n was added to n in the
course of the computation, and that ((n — 1)2 + (R — 1)n)/R < 2n. This implies that the
final result follows after a few subtractions and comparisons.
This multiplication algorithm can be carried out in 2r + r elementary multiplications,
plus some additions: r2 multiplications for the computation of . , plus 1 multiplication for
( . d) mod 2’ and r multiplications for ((ui . d) mod 2b) n for each of the r iterations. Here
we assume that d is computed once and for all at the initialization of the modular arithmetic
for n using, for instance, the extended Euclidean algorithm. Notice that no divisions are
needed because division with remainder can be carried out by shifting and masking
operations.
2.2 Reducing the number of multiplications. If we precompute d n, then the
product . (d. n) can be computed using only r multiplications since the low order block
of (d. n) is known to be —1. This means that the total number of multiplications could be
reduced to 2r, if we would replace u by u+2bi.u.(d.n) instead of u+2bi. ((ui . d) mod 2b).n,
for i = 0, 1,. . . , r — 1 in succession. The disadvantage of this approach is, however, that
cannot be derived easily from the final u, because the final u can be as large as n.
2.3 Remark. There are various other ways to avoid divisions in modular multiplication.
For instance, if a table containing 2bj mod n for i = r, r + 1,. . . , 2r has been precomputed,
the product of two residues modulo n can be reduced modulo n using r2 elementary multi
plications. This is less than the r2 + r elementary multiplications needed in the reduction
part of Montgomery multiplication, and has the (small) advantage that one can work with
the ‘ordinary’ representation. In practice, however, it often turns out to be slower, due to
the frequent memory fetches to retrieve the elements of the table; another disadvantage that
might be serious for some implementations is the fact that the method requires storage for
the table.
In the next section we discuss a version of Montgomery multiplication that uses r + 1
processors, and that takes time 2r + 1, plus some additions.
4
3 SIMD Montgomery arithmetic
Let F0, P1,
..., Pr be a processor array, with r as above, that operates in ‘SIMD mode,’ i.e.,
the F, carry out the same instruction at the same time, but each P has its own data, and
not every P has to participate. The instruction stream for the processor array is provided
by a master processor M, which might simultaneously provide several different processor
arrays with (identical) instructions. We assume that P can send data to P+i (for all i < r
simultaneously) or to P_1 (for all i > 0 simultaneously), and that both P0 and P can
broadcast data to all other P. Furthermore, we assume that each P has a special flag
register that can be set either to true or false, such that it takes M a small constant amount
of time (independent of r and independent of the number of processor arrays governed by M)
to decide if at least one of the flags is set to true. Finally, we assume that each P can carry
out elementary multiplications, for some appropriately chosen b > 1, as well as additions,
subtractions, and shifting and masking operations on local 2b-bit integer variables. These
last assumptions are meant to allow division with remainder by 2b without actually dividing.
This set-up is not unrealistic. For instance, a 128 x 128 processor MasPar computer [4],
can be thought to consist of 128 [128/(r+ 1)] disjoint processor arrays that satisfy the above
conditions, for any r < 128 and b < 32.
In this section we describe how Montgomery arithmetic can be performed on numbers
represented by the processor array F0, F1, . .
.,
P, where the processor array represents a
non-negative integer a < 2b(r+1) if each P contains a non-negative b-bit integer a such that
a
= Z aj2t)’. The a’s are said to represent a. Let x and y be two integers, and let § and
be their Montgomery representations.
3.1 SIMD Montgomery addition
Let j and j be non-negative b-bit integers stored at P, for i = 0, 1,. . . , r, such that the
1j’5 represent and the th’s represent . Notice that both ir = 0 and = 0. We describe
how to compute non-negative b-bit integers on P, for i = 0, 1,.. . , r, that represent ,
where s = x + y. We assume that P has local variables c, e, n, t, and u, where the ni’s
represent the modulus n, and that these variables can contain (b + 2)-bit integers.
Rough description: The algorithm first computes t = + (steps 1 through 7), next
= t — n (steps 8 through 14), and finally sets = t if u < 0 or = u if u > 0 (Steps 15
and 16).
The algorithm. Perform the steps below sequentially on all P simultaneously, unless mdi
5
cated otherwise.
tz = +
e = t/2”; c = 0;
while some e = 1 do
for all P with i > 0 do
= e_;
t = t mod 2b + c;
— + /b.
— Ut/ ,
= t — n;
e = 0;
for all P with i <r do
if u <0 then e = 1, u = u + 2b
10 while some e = 1 do
11 forallPwithi>0do
e = 0;
for all P with i < r do
if u < 0 then e =
{The addition}
{Compute carry bit e}
{Carry propagation ioop using special fiag}
{ Send the carry to the next processor}
{Add the carry}
{ Compute the new carry}
{Compute t — n in case t might be > n}
{ Zero the carry bit and compute carry}
{ on all processors except the leftmost}
{Carry propagation ioop using special fiag}
In steps 5 and 12 we use our assumption that processors can send data to neighboring
processors; in Step 15 the broadcast feature is used. It is easy to find examples where the
carry propagation loops (steps 4 through 7, and steps 11 through 14), have to be repeated
r times. A log r depth carry propagation tree would give a better worst case performance
(if the processor array would allow implementation of such a structure). In our experience,
however, our variant gives a much better average performance, because the carry propagation
loops are hardly ever executed more than once. This is the case even if M governs several
thousand processor arrays simultaneously. In that case the speed’ is determined by the
processor array that needs most iterations through steps 3 or 10; notice that the other
processor arrays carry out the same instructions, but that the instructions have no effect on
the values stored.
1
2
3
4
5
6
7
8
9
12 c = e_i; {Send the carry to the next processor}
13 u = u — c; {Subtract the carry}
14 {Zero the carry bit and compute carry}
{ on all processors except the leftmost}
1, U = U + 2
15 C = Ur {Broadcast Ur to all P}
16 ifc, <0 then =t1 else =ij; {Selecttort—n}
6
We subtract n from t = + without first checking if the subtraction is actually needed;
this is because it is as easy to compute u = t — n as it is to test whether t > ii, and given
t and u it is easy to decide which of the two is the final answer. This is faster than doing
a comparison to see if the subtraction has to be carried out, in particular if M governs
more processor arrays. Clearly, this SIMD addition can also be used for ordinary (i.e.,
non-Montgomery) addition of residues modulo ii.
3.2 SIMD Montgomery subtraction
Under the same assumptions as in (3.1), we describe how to compute non-negative b-bit
integers h on P, for i 0, 1,. . . , r, that represent Ii, where h = x
—
y.
Rough description. The algorithm first computes u = (steps 1 through 7) and terminates
if ‘u 0, setting ii = u (Step 8), otherwise it computes t = u + n (steps 9 through 15) and
sets ii = t (Step 16).
The algorithm. Perform the steps below sequentially on all P simultaneously, unless indi
cated otherwise.
1
2 e=0;
u = u —
e 0;
for all P with i < r do
if u <0 then e =
8 ej=ur;
if e = 0 then h = u else
9 t=u+n2;
10 e = tj/2b;
11 while some e = 1 do
12 forallPwithi>0do
13 c = e_;
{The subtraction}
{ Zero the carry bit and compute carry}
{ on all processors except the leftmost}
{ Send the carry to the next processor}
{Add the carry}
{ Zero the carry bit and compute carry}
{ on all processors except the leftmost}
1, u = u + 2b
{Broadcast Ur to all P so that}
{the result can be decided}
{The addition}
{Compute carry bit e}
{Carry propagation ioop using special fiag}
for all P with i <r do
if u < 0 then e = 1, U = u + 2b
while some e = 1 do
for all F, with i > 0 do
= e_1;
3
4
5
6
7
{Carry propagation loop using special flag}
{ Send the carry to the next processor}
7
14 t1 t2 mod 2b + c; {Add the carry}
15 e = tj/2”; {Compute the new carry}
16 = {Store the result}
If M governs more processor arrays, it might be better to perform steps 9 through 15
irrespective of the value of ur, and to choose between u and t2 at the very end. With only one
processor array, however, the version presented above is faster. ‘We refer for the discussion
after Algorithm (3.1) for additional comments.
3.3 SIMD Montgomery multiplication
The straightforward SIMD implementation of Montgomery multiplication as it appears in
(2.1) would involve 3r multiplications per processor using r processors. In each of r iterations,
the product x yj is first computed (for i = 0, 1 , r — 1 in parallel, i.e., r multiplications in
the time of a single one), and then n0 . d mod 2b must be computed (a single multiplication).
The final product in each iteration is computing (uo . d mod 2b) n (again, r multiplications
in the time of a single one) using the result of the previous product. Since the second result
is needed at all processors in the final step, the second product can either be computed on a
single processor and the result broadcast to all other processors or it can be computed at all
processors. Since we are dealing with SIMD algorithms, there is no advantage to computing
the product on a single processor; all other processors would simply be idle during this time.
In any case, this straightforward method clearly leads to a major inefficiency caused by the
inherent sequential nature of the process, which allows us to parallelize only 2r of the 2r H— 1
multiplications per iteration; one multiplication remains unparallelizable and its execution
costs a full multiplication per iteration.
Our goal is to reduce this inefficiency in the SIMD version and to get rid of this single
unparallelizable multiplication per iteration. The observation in (2.2) helps lead to an effi
cient algorithm. By precomputing the product d• n and storing one word of the result at
each processor in the array, the Montgomery multiplication will only take 2 multiplications
per iteration. The drawback to this approach, as mentioned in (2.2), is that the result can
be rather large and cannot be easily reduced to a residue mod n. The fix to this problem
is to use the precomputed value only for the first r — 1 iterations of the algorithm and use
the longer 3 multiplication form for the last iteration. Using this approach the result can be
shown to be less than 3n (cf. our analysis below), so that at most two subtractions suffice
8
to do the final reduction modulo n.
More precisely, in the notation of (2.1), the algorithm works by first replacing zi
by u + 2 u (d n) for i = 0, 1, . . . , r — 2 in succession, after which u is replaced by
u +
. ((.1 d) mod 2b) . n, where the computation of is merged with the divisions
by 2b modulo n. So, during iteration i for i <r — 1, we add a multiple of n to u which is less
than 2(b — 1)2n, and during the last iteration we add a multiple of n which is less than
2b(r_1)(
— 1)n because the last iteration is identical to ordinary Montgomery multiplication
presented in (2.1). This implies that at most
(2bi(
— 1)2n) + — 1)n
is added to u in the course of the computation. Since the original u is bounded by (n — 1)2,
the final u is bounded by
((n — 1)2 + (2b —1)(2b(r_1) — 1)n + Rn — 2b_1)fl)/R < 3n.
The final result is therefore u, u — n, or u — 2n.
Given and representing and , we describe how to compute non-negative b-bit
integers on P, for i 0, 1,. . . , r, that represent , where z = (x y) mod n. We assume
that P has local variables c, e, m, n, t, u, and w, where the ni’s represent the modulus
n and the mi’s represent (d• n)/2b, rhere d as in (2.1) is assumed to exist on all P,’s. The
variable w can contain 2b-bit integers, for the others b + 2 bits suffice.
Rough description. The algorithm consists of r iterations. During the jth iteration, with
o <j <r, it computes w
=
(steps 3 and 4) and replaces the accumulator u by u + w
(steps 5 through 9), where u is initially set to 0 (step 1). Not all carries are propagated
in this addition, but u0 is normalized (steps 8 and 9). Next u’s last b bits are made zero
by adding an appropriate multiple of n to u. For this purpose, u0 is first broadcast to all
processors and stored in t there (Step 10), after which u is shifted to the right (in Step 11,
where is also shifted to the right, to facilitate the broadcast of in Step 3), and a multiple
of n is added to the shifted u. In the first r — 1 iterations this is done in steps 14 through 20
by adding w = t. ((ci. n)/2b) to the already shifted u (steps 15 through 17, where the carry
of the part of u that was shifted out is taken care of by c0 on Fo); after this addition, carries
in u are propagated once (steps 18 through 20). In the last iteration (steps 21 through 30) a
different w, namely ((t . d) mod 2b) (n/2”), which takes one more elementary multiplication
to compute, is added (steps 22 through 26), and in this last iteration, all carries in u are
9
propagated (steps 27 through 30). In steps 3]. through 37 the correct result (u, u — n, or
u — 2n) is chosen.
The algorithm. Perform the steps below sequentially on all P simultaneously, unless indi
cated otherwise.
{ Zero the accumulator and carry bit}
{One iteration for each block of y}
and add the result to n}
{Broadcast the next block of y}
{Do the multiplication and compute the carry}
{ Propagate carries once for w}
{ Add new result to accumulator}
{ Compute the new carry}
{ Propagate carries once for u}
{ Add the carry and normalize u}
correct multiple of n}
{Broadcast the low—order bits of U}
{Shift out the “zeroed” bits of U}
{Move the next block of y to P0}
12 = 0; {Shift a zero into the high bits of U}
13 if j < r — 1 then {If this is not the last iteration}
14 if t0 > 0 then {No correction necessary if u0 = 0}
15 = t m; e = wi/2b; {w = to. ((d. n)/2b) and compute carry}
16 c efori>0,andc=1fori=0only;
{ The above multiplication would have produced a carry into u0}
17 = u + (‘wi mod 2b) + c; {Add the multiple of ri to u}
18 e Ui/2b; {Propagate carries once}
19 c=e_1fori>0,andc0=Ofori=Oonly;
20 = (u mod 2b) + c;
21 else if t0 > 0 then
22 e = n for i <r, and er = 0
23 W = ((ti . d) mod 2b) . e;
24 e = wj/2b;
25 c=e_1fori>0;
1 u=0;c=0;
2 forj =Oto r—1 do
{Multiplication step: Compute
3 t=
4 W t e =
5 forPwithi>0setc=e2_1;
6 U = u + (W mod 2b) + c;
7 ej = Uj/2b;
8 forPwithi>0setc=e1
9 u = (U mod 2b) + c;
{Mod step: Zero U0 by adding the
10
11 forallP2withi<rdo
U =
Yi = Yi+i;
{ Last iteration, check for U0 = 0}
for i = 0 only;
{W = ((t0 . d) mod 2b) . (n/2b) }
{ Propagate carries once}
10
26 = zt + (w mod 2”) + c; {Add the proper multiple of n and the carry}
27 e = uj/2b; {Compute the carry and}
28 while some e = 1 do {propagate completely using special fiag}
29 c=e_ifori>0;
30 u = (U mod 2b) + c; e =
{ Choose the proper value: U, U — n, or u — 2n}
31 t = U — n; {Subtract n and propagate carries}
32 ej=0;iftj<Oandi<rthenej=1,tj=ti+2b;
33 while some e = 1 do
34 c = e_1 for i > 0;
35 t=t—c;
36 e =0; ift <0 andi <rthen e = 1, t =tj+2b;
37 jf tr < 0 then j = u {If t < 0 then U < n is the proper result}
else u = t and repeat the subtraction in lines 31—37;
There are two elementary multiplications per P in the first r — 1 iterations (in steps 4
and 15), and three in the last iteration (one in Step 4 and two in Step 23). This implies
that the algorithm takes time 2r + 1, plus some additions. In the computation of t . d in
Step 23, the value of t is identical on all P; computing this product on only one processor,
and broadcasting the result to the others, does not save any time in our SIMD architecture.
Notice that the full 2b-bit product t . d is not needed in Step 23, but that its least significant
b bits suffice. This might make this single extra multiplication slightly faster.
4 Conclusion
This paper shows how to avoid a direct but inefficient parallel implementation of a well known
algorithm through a reformulation of the algorithm and its invariants. In our particular
case, an efficient solution can be attained by relaxing the invariant for the first part of
the algorithm and returning to a tighter invariant for a final stage in order to produce a
correct result. Although we have only discussed the application of this general technique
to modular multiplication, we suspect that similar ideas may be used to enhance the SIMD
parallelizability of other algorithms.
The modular multiplication algorithm itself has applications to cryptanalyis and parallel
RSA systems as mentioned earlier, and we have applied the Montgomery arithmetic from
11
(3.1), (3.2), and (3.3) in our massively parallel implementation of the elliptic curve factoring
algorithm on a 16K processor MasPar computer. The efficiency of the implementation on
this machine is helped by the fact that all relevant values can be kept in registers, which is
not true for the single processor variants of these operations. The result is that we achieve a
direct speed up; with k processors, our algorithm runs k times faster than the single processor
algorithm.
In the elliptic curve method the odd composite number to be factored is used as the
modulus throughout the algorithm, which makes our SIMD algorithms applicable. We used
b = 30, so that for n in the range [2300, 2330), for instance, we got r = 11 and therefore
128 [128/(11 + 1)] = 1280 processor arrays. This implies that we can run 1280 elliptic
curve trials in parallel for such n. Our elliptic curve implementation has broken factoring
records by being the first to find a 40 digit factor using elliptic curves, as reported in [1].
This implementation is probably the fastest known method for factoring numbers that have
factors of less than 40 digits. For a number that is the product of two roughly equal size
primes, then the parallel implementation of the quadratic sieve given in [2] is efficient for
numbers in the 80—120 digit range.
References
[1] B. Dixon, A. K. Lenstra, Massively parallel elliptic curve factoring, Advances in Cryp
tology, Eurocrypt ‘92,Lecture Notes in Comput. Sci. 658 (1993), 183—193.
[2] B. Dixon, A. K. Lenstra, Factoring integers using SIMD sieves, Advances in Cryptology,
Eurocrypt ‘93,Lecture Notes in Comput. Sci. To appear.
[3] H. W. Lenstra, Jr., Factoring integers with elliptic curves, Ann. of Math. 126 (1987),
649—673.
[4] MasPar MP-1 principles of operation, MasPar Computer Corporation, Sunnyvale, CA,
1989.
[5] P. L. Montgomery, Modular multiplication without trial division, Math. Corrip 44
(1985), 519—521.
[6] R. L. Rivest, A. Shamir, L. Adleman, A method for obtaining digital signitures and
public—key cryptosystems, Communications of the ACM, 21,2 (Feb. 1978).
12
