Systolic Montgomery multiplication by Lenstra, Arjen K. & Dixon, B.
Systolic Montgomery multiplication version 19920225
Systolic Montgomery multiplication
B. Dixon, A.K. Lenstra
Abstract. Montgomery multiplication is a division-free and memory-efficient algorithm for modular multipli
cation. We describe a fast systolic version of Montgomery multiplication. Our version is particularly useful to
turn a short sequence of moderately fast processing elements, that operate in SIMD mode, into a single much
faster modular multiplier. We have successfully implemented our method on a 16K processor massively parallel
SIMD computer, which effectively reconfigured the machine into a machine that is k times faster, but only has
16K/k processors, where k is a small integer that depends linearly on the logarithm of the modulus.
1. Montgomery arithmetic
Fast modular multiplication is crucial for an efficient implementation of many number
theoretic algorithms or cryptographic schemes. It is well known that if the modulus remains
unchanged throughout the computation, as is often the case, a considerable speed-up can
be gained by using the so-called Montgomery representation [3]. In this paper we describe
how multiplication of numbers in Montgomery representation can be carried out efficiently
and in parallel on large ‘Single Instruction Multiple Data’ (SIMD) arrays of processors.
In addition to applications to cryptanalysis [1], our method could prove to be useful for
a ‘central facility’ that has to sign public key certificates, or decrypt RSA messages, for
many users simultaneously.
Throughout this paper we fix ii as an odd positive integer that will serve as the
modulus. For an integer x we define the residue of x modulo n, denoted x mod ii, as
the smallest non-negative integer that is congruent to x modulo n. We will measure the
run-time of the algorithms to be presented in number of elementary multiplications, where
one elementary multiplication computes the 2b-bit product of two b-bit integers, for some
positive integer b that depends on the type of machine/processor we are using. We will
assume that b 2. The case b = 1 is trivial, and can be found in [3]. Let r be the smallest
integer with 2”’ > ii, and let R = 2’. The Montgomery representation § of an integer x
is the integer (x. R) mod n.
Clearly, if s = (x + y) mod n, then is either + or § + — n, so that addition
(or subtraction) of numbers in Montgomery representation is not different from ordinary
1
modular addition (or subtraction). Multiplication, however, becomes simpler than ordi
nary multiplication modulo n. If z = (x y) mod n, then
= ( R—’) mod n. This
can be computed efficiently as follows.
(1.1) Montgomery multiplication. First, compute the ordinary product u =
Let = with 0< u < 2b for i = 0,1,...,2r, and let d be such that dn
—1 mod 2b and 0 < d < 2’; this d is well-defined because n is odd. Next, successively
replace u by u + 2bj . ((ui . d) mod 2b) n, for i = 0,1,. . . , r — 1, where the u for i > 0
are the radix 2b digits of the u that was computed in the previous iteration. Notice
that after iteration i the new u is zero, and that the new U is congruent to the old u
modulo n. Consequently, the resulting u0 through are all zero. Replace u by u/R,
simply by shifting u over b . r positions to the right. We now have that either = u
or = U — n; this follows from the fact that the original ‘u is < (n — 1)2, that at most
2’(2b — 1)n (R — 1)n was added to u in the course of the computation, and
that ((n — 1)2 + (R — 1)n)/R < 2n. This implies that the final result follows after a few
subtractions and comparisons.
This multiplication algorithm can be carried out in 2r + r elementary multiplications,
plus some additions: r2 multiplications for the computation of . , plus 1 multiplication
for (u . d) mod 2b and r multiplications for n ((Ui . d) mod 2’) for each of the iterations.
Here we assume that d is computed once and for all at the initialization of the modular
arithmetic for n. Notice that no divisions are needed because division with remainder by
can be carried out by shifting and masking operations.
(1.2) Remark. Only 2r multiplications per Montgomery multiplication would suf
fice if d• n is precomputed, and ‘u is replaced by U + uj (d. n) for i = 0, 1,. . . , r — 1 in
succession. The disadvantage of this approach is, however, that cannot be derived easily
from the final ‘u, because the final u can be rather large.
(1.3) Remark. There are various other ways to avoid divisions in modular multi
plication. For instance, if a table containing 2bj mod ii for i = r, r + 1, . . . , 2r has been
precomputed, the product of two residues modulo n can be reduced modulo n using r2
elementary multiplications. This is less than the r2 + r elementary multiplications needed
in the reduction part of Montgomery multiplication, and has the (small) advantage that
2
one can work with the ‘ordinary’ representation. In practice, however, it often turns out
to be slower, due to the frequent memory fetches to retrieve the elements of the table;
another disadvantage that might be serious for some implementations is the fact that the
method requires storage for the table.
In the next section we discuss a version of Montgomery multiplication that uses r + 1
processors, and that takes time 2r + 1, plus some additions.
2. Systolic Montgomery arithmetic
Let F0, Fi,
..., Fr be a processor array, with r as above, that operates in ‘SIMD mode,’
i.e., the P carry out the same instruction at the same time, but each P has its own data,
and not every P has to participate. The instruction stream for the processor array is
provided by a master processor M, which might simultaneously provide several different
processor arrays with (identical) instructions. We assume that P can send data to P+’
(for all i < r simultaneously) or to P_1 (for all i > 0 simultaneously), and that both
Po and Pr can broadcast data to all other P. Furthermore, we assume that each P
has a special flag register f that can be set either to true or false, such that it takes
M a small constant amount of time (independent of r and independent of the number of
processor arrays governed by M) to decide if at least one of the f’s is set to true. Finally,
we assume that each P can carry out elementary multiplications, for some appropriately
chosen b > 1, as well as additions, subtractions, and shifting and masking operations on
local 2b-bit integer variables. These last assumptions are meant to allow division with
remainder by 2b without actually dividing.
(2.1) Remark. This set-up is not unrealistic. For instance, a 128 x 128 processor
MasPar computer [2], can be thought to exist of 128• [128/(r + 1)] disjoint processor arrays
that satisfy the above conditions, for any r < 128 and b < 32. Another place where
processor arrays occur is in VLSI, cf. [4: Section 5].
In this section we describe how Montgomery arithmetic can be performed on numbers
represented by the processor array Fo, Fi, . .
., Fr, where the processor array represents a
non-negative integer a < 2b(r+1) if each P contains a non-negative b-bit integer a such
that a
=
a2’. The a’s are said to represent a. Let x and y be two integers, and let
3
and be their Montgomery representations.
(2.2) Systolic Montgomery addition. Let i and th be non-negative b-bit integers
on Pi, for i = 0, 1,. . . , r, such that the j’s represent and the th’s represent . Notice
that both i = 0 and r = 0. We describe how to compute non-negative b-bit integers
on P, for i = 0, 1,.. . , r, that represent , where s = x + y. We assume that P has local
variables c, e, rag, t, and uj, where the ni’s represent the modulus n. These variables can
contain (b + 2)-bit integers.
Rough description. The algorithm first computes t = + (steps 1 through 5), next
u = t — n (steps 6 through 11), and finally sets . = t if u < 0 or . = u if u> 0 (Step 12).
The algorithm. Perform the steps below sequentially on all P simultaneously, unless
indicated otherwise.
(1)
(2) (carry computation) Set c = 0 and e = tj/2b;
(3) (carry propagation) Set c = e_1, on all P with i > 0 simultaneously;
(4) (carry addition, set flag) Replace t by (t mod 2b) + c; Set the special flag register f
to true if t 2b and to false otherwise;
(5) (repeat carry computation?) Let the master processor M check if there is an f that
is set to true; go back to Step 2 if that is the case, proceed to Step 6 if not;
(6) (initialize flag, u = t — n) Set f to false; Set u = t —
(7) (carry computation) Set c = 0, and set e = 1 if ‘u <0 and e = 0 otherwise;
(8) (carry propagation) Set c = e_1, on all P with i > 0 simultaneously;
(9) (carry addition) Replace u by u + (e . 2b) — c;
(10) (set flag) Set f to true if u < 0 and to false otherwise, on all P with i < r simulta
neously;
(11) (repeat carry computation?) Let M check if there is an f that is set to true; go back
to Step 7 if that is the case, proceed to Step 12 if not;
(12) (select between t and u) Set c = Ur (broadcast from Pr); If c < 0 set § = t, and set
= u otherwise.
In steps 3 and 8 we use our assumption that processors can send data to neighboring
processors; in Step 12 the broadcast feature is used. It is easy to find examples where
4
steps 2 through 5, or steps 7 through 11, have to be repeated r times. A logr depth carry
propagation tree would give a better worst case performance (if the processor array would
allow implementation of such a structure). Our variant, however, gives a much better
average performance, even if M governs several thousand processor arrays simultaneously
(in which case the ‘speed’ is determined by the processor array that needs most repeats).
Notice that we subtract n from t = without first checking if the subtraction is actually
needed; this is because it is as easy to compute u = t — n as it is to test whether t > n,
and given t and ‘u it is easy to decide which of the two is the final answer. This is faster
than doing a comparison to see if the subtraction has to be carried out, in particular if M
governs more processor arrays. Clearly, this systolic addition can also be used for ordinary
(i.e., non-Montgomery) addition of residues modulo n.
(2.3) Systolic Montgomery subtraction. Under the same assumptions as in (2.2),
we describe how to compute non-negative b-bit integers h on P, for i = 0, 1,.. . ,r, that
represent h, where h = x
—
y.
Rough description. The algorithm first computes u = — (steps 1 through 6), next sets
Ii = ii and terminates if u 0 (Step 7), and otherwise it computes t = tt + n (steps 7a
through 7e) and sets Ii = t (Step 7f).
The algorithm. Perform the steps below sequentially on all P simultaneously, unless
indicated otherwise.
(1) (initialize flag, u
— ) Set f to false; Set u = — th;
(2) (carry computation) Set c = 0, and set e = 1 if u <0 and e = 0 otherwise;
(3) (carry propagation) Set c = ej_1, on all P with i > 0 simultaneously;
(4) (carry addition) Replace by u + (e . 2b) — c;
(5) (set flag) Set f to true if u < 0 and to false otherwise, on all P with i < r simulta
neously;
(6) (repeat carry computation?) Let M check if there is an f that is set to true; go back
to Step 2 if that is the case, proceed to Step 7 if not;
(7) (check sign U) Let M check if u = 0; if that is the case, set h2 = u and terminate,
otherwise (i.e., Ur < 0), perform steps 7a through 7f.
(7a) (t = U + n) Set t = u + n;
5
(7b) (carry computation) Set c 0 and e = t/2’;
(7c) (carry propagation) Set c2 = e_1, on all P with i > 0 simultaneously;
(7d) (carry addition, set flag) Replace t by (t mod 2”) + c; Set f to true if t2 2b and
to false otherwise;
(7e) (repeat carry computation?) Let M check if there is an f that is set to true; go back
to Step 7b if that is the case, proceed to Step 7f if not;
(7f) (return result) Set h1 = t.
If M governs more processor arrays, it might be better to perform steps 7a through 7f
irrespective of the value of and to choose between u and t at the very end. With only
one processor array, however, the version presented above is faster.
(2.4) Systolic Montgomery multiplication. Given i and th representing and
we describe how to compute non-negative b-bit integers j on P., for i = 0, 1,. . . , r, that
represent , where z = (x
.
y) mod n. We assume that P has local variables c, e, m, n,
t, u, and w, where the ni’s represent the modulus n and the mi’s represent (d. n)/21’,
where d as in (1.1) is assumed to exist on all P2’s. The variable w can contain 2b-bit
integers, for the others b + 2 bits suffice.
Rough description. The algorithm consists of r iterations. During the jth iteration, with
o < j < r, it computes w
=
(steps 2 through 4) and replaces the accumulator u
by u + w (steps 5 through 7), where u is initially set to 0 (step 1). Not all carries are
propagated in this addition, but u0 is normalized (steps 6 and 7). Next u’s last b bits
are made zero by adding an appropriate multiple of n to u. For this purpose, u0 is first
broadcast to all processors (Step 8), after which u is shifted to the right (in Step 9, where
is also shifted to the right, to facilitate the broadcast of in Step 2), and a multiple of
n is added to the shifted u. In the first r — 1 iterations this is done in steps 8 through 18
by adding w = ‘old uo’.((d.n)/2”) to the already shifted u (steps 12 through 14, where the
carry of the part of u that was shifted out is taken care of by c0 on Fo); after this addition
all carries in u are propagated (steps 15 through 17). In the last iteration a different w,
namely ((‘old uo’.d) mod 2”) . (n/2’), which takes one more elementary multiplication to
compute, is added (steps 21 through 23); for the rest the last iteration (steps 19 through
27) is identical to the earlier ones.
6
It follows that, in the notation of (1.1), the algorithm works by first replacing u
by u + 2bj u (d. n) for i = 0, 1, . . . , r — 2 in succession, after which u is replaced by
u+2(T_l). ((ur_i .d) mod 2b) •n, where the computation of• is merged with the divisions
by 2b modulo n. So, the first r — 1 iterations are identical to those in Remark (1.2), the
last iteration is the same as the last iteration of the ordinary Montgomery multiplication
(1.1). This implies that at most
(:2bi( — 1)2n) +
2b(r—1) (2b
— 1)n
is added to u in the course of the computation. Since the original n is bounded by (n — 1)2,
the final u is bounded by
((n — 1)2 + (2’ —1)(2b(r_1) — 1)n + Rn — 2l(T_i)n)/R < 3n.
The final result is therefore u, u — n, or u — 2n. In steps 28 through 35 the correct result
is chosen.
The algorithm. Perform the steps below sequentially on all P simultaneously, unless
indicated otherwise.
(1) (initialize counter j, initialize u) Set j 0 (on M), and u = 0 (on the P);
(2) (distribute j) Set t = o (broadcast from Po);
(3) (w
=
, w-carry computation) Set w = t j; Set c = 0 and e = wj/2b;
(4) (w-carry propagation) Set c = e1_, on all P with i > 0 simultaneously;
(5) (u = n + i, single carry computation) Replace u by n + c + (w mod 2b); Set
c = 0 and e = uj/2b;
(6) (single carry propagation) Set c = e_1, on all P with i> 0 simultaneously;
(7) (single carry addition) Replace u by (u mod 2b) + c;
(8) (distribute u0) Set t = n0 (broadcast from Fo);
(9) ( =/2b, u = u/2b) Set th = , and u u1, on all P2 with i <r simultaneously,
and set r = 0 on P only;
(10) (handle last iteration separately) If j = r — 1, go to Step 19; otherwise proceed to
Step 11;
7
(11) (check if u has to be corrected) Let M check if to = 0; go to Step 15 if that is the
case, proceed to Step 12 if not;
(12)(w = to.((d.n)/2”), w-carry computation) Set w = t•m; Set c1 = 0 and e2 = w/2”;
(13) (w-carry propagation) Set c = e_1, on all P with i > 0 simultaneously, and set
c0 = 1 on Po only;
(14) (correct u) Replace u by u + c2 + (w mod 2”);
(15) (single carry computation) Set c = 0 and e = uj/2b;
(16) (single carry propagation) Set c = e_1, on all P with i > 0 simultaneously;
(17) (single carry addition) Replace u by (uj mod 2”) + c;
(18) (increment counter) Replace j by j + 1 and go back to Step 2;
(19) (last iteration, check if u has to be corrected) Let M check if to = 0; go to Step 24 if
that is the case, proceed to Step 20 if not;
(20) (e = n/2”) Set e
= n+i on all P with i < r simultaneously, and set er = 0 on Pr
only;
(21) (w = ((to d) mod 2’j (n/2”), w-carry computation) Set w = ((ti d) mod 2”) e; Set
= 0 and e =
(22) (w-carry propagation) Set c = e_1, on all P with i > 0 simultaneously, and set
c0 = 1 on Po only;
(23) (correct u) Replace u by u + c + (w mod 2”);
(24) (carry computation) Set c = 0 and e =
(25) (carry propagation) Set c = e_1, on all P with i > 0 simultaneously;
(26) (carry addition, set flag) Replace u by (u mod 2”) + c; Set f to true if u 2’ and
to false otherwise;
(27) (repeat carry computation?) Let M check if there is an f that is set to true; go back
to Step 24 if that is the case, proceed to Step 28 if not;
(28) (set correction counter) Set j = 0;
(29) (initialize flag, t = u — n) Set f to false; Set t = u — n;
(30) (carry computation) Set c = 0, and set e = 1 if t <0 and e = 0 otherwise;
(31) (carry propagation) Set c = e_1, on all P with i > 0 simultaneously;
(32) (carry addition) Replace t by t + (e 2b) — c;
8
(33) (set flag) Set f to true if t2 < 0 and to false otherwise, on all P2 with i < r simulta
neously;
(34) (repeat carry computation?) Let M check if there is an f2 that is set to true; go back
to Step 30 if that is the case, proceed to Step 35 if not;
(35) (return result) Let M check if tr < 0; if that is the case set ij = u and terminate;
otherwise, if j = 1 set ij = t2 and terminate; otherwise, set j = 1, set u2 = t2 and go
back to Step 29.
There are two elementary multiplications per P2 in the first r — 1 iterations (in steps
3 and 12), and three in the last iteration (one in Step 3 and two in Step 21). This implies
that the algorithm takes time 2r + 1, plus some additions. In the computation of t2 d
in Step 21, the value of t2 is identical on all P; computing this product on only one
processor, and broadcasting the result to the others, does not save any time in our SIMD
architecture. Notice that the full 2b-bit product t2 d is not needed in Step 21, but that
its least significant b bits suffice. This might make this single extra multiplication slightly
faster.
A systolic version of (1.1) would require r iterations of 3 elementary multiplications
each, for a total time of 3r. This is considerably slower than the approach taken in (2.4),
but has the advantage that at most one subtraction at the end suffices. The second
subtraction in (2.4), however, is hardly ever needed.
We have applied the Montgomery arithmetic from (2.2), (2.3), and (2.4) in our mas
sively parallel implementation of the elliptic curve factoring algorithm on a 16K processor
MasPar computer; on this machine all relevant values in (2.4) can be kept in registers,
which makes the Montgomery approach faster than the one mentioned in Remark (1.3).
In the elliptic curve method the odd composite number to be factored is used as the mod
ulus throughout the algorithm, which, combined with Remark (2.1), makes our systolic
algorithms applicable. We used b = 30, so that for n in the range [2°°, 2330), for instance,
we got r = 11 and therefore 128• [128/(11 + 1)] = 1280 processor arrays. This implies that
we can run 1280 elliptic curve trials in parallel for such n. For an explanation why this
is better than running 16K much slower elliptic curve trials in parallel and for the other
details we refer to [1].
9
References
1. B. Dixon, A. K. Lenstra, Massively parallel elliptic curve factoring, in preparation.
2. MasPar MP-1 principles of operation, MasPar Computer Corporation, Sunnyvale, CA,
1989.
3. P. L. Montgomery, Modular multiplication without trial division, Math. Comp 44
(1985), 519—521.
4. J. D. Uliman, Computational aspects of VLSI, Computer Science Press, Rockville,
1984.
Princeton University, Department of Computer Science, 35 Olden Street, Princeton, NJ
08544-2087, U. S. A.
Beilcore, 445 South Street, Morristown, NJ 07960, U. S. A.
10
