A very simple multiplier cell is developed for use in a linear, purely systolic array forming a digit-serial multiplier for unsigned or 2'complement operands. Each cell produces two digit-product terms and accumulates these into a previous sum of the same weight, developing the product least signi cant digit rst. Grouping two terms per cell, the ratio of active elements to latches is low, and only n 2 cells are needed for a full n by n multiply. A modulo-multiplier is then developed by incorporating a Montgomery type of modulo-reduction. Two such multipliers interconnect to form a purely systolic modulo exponentiator, capable of performing RSA encryption at very high clock frequencies, but with a low gate count and small area. It is also shown how the multiplier, with some simple back-end connections, can compute modular inverses and perform modular division for a power of two as modulus.
Introduction
Motivated by the wish for a simple systolic structure to be used in a modulo multiplier for cryptosystems, a serial multiplier is rst developed. Usually the digit product terms in multipliers are computed either along rows or along columns of the total array of terms, but here we have chosen to compute the terms along slanted lines, moving across the array. Computing two terms from each column in each cycle, only l n 2 m cells are needed. By gradually right-shifting, a linear array of cells produces the result in 2n cycles, each cell having only nearest neighbor communication, thus it is suitable for the very large multipliers (500-1000 bits) needed in cryptosystems. The array is developed as a parallel-by-serial multiplier; however it can be modi ed to a serial-by-serial structure, but then the multiplicand digits will have to be delivered at twice the speed of the multiplier digits.
Several other serial multipliers have been developed in the past, the one closest to the one developed here seems to be the classical design by Atrubin Atr65] which is of the type serial-by-serial and also has a linear, systolic structure, but with a much higher cell complexity.
This work has been supported by the Danish Research Councils, Grant No. 5.21.08.02.
Although the multiplier is developed for unsigned operands and in any radix, it turns out that it is very simple to modify the binary version to operate on 2'complement numbers. Hence it could be useful in standard applications for serial multipliers, e.g. in digital signal processing. However, the intended applications are in number-theoretic computation, primarily for cryptosystems, but other applications are possible and are demonstrated. The intended application of modulo multiplication is supported by adapting the cell to generate and accumulate two additional terms, corresponding to partial products of the modulus to be added (subtracted). By suitably selecting factors of the modulus (quotient digits), the steps of modular reduction are thus interleaved with the multiplication. The result is a linear array modular multiplier much simpler than the 2-dimensional systolic structure proposed in Wal93].
This paper is organized as follows. In Section 2 the cell and the basic multiplier structure is developed from the dependency graph. After some comments on its operation, it is shown how it can be modi ed to accept 2'complement operands.
In Section 3 the cells are modi ed to allow the interleaving of multiplication and modulo reduction. The reduction is based on the idea by Montgomery Mon85] , where quotient digits can be determined from the least signi cant digits of the partial products. Since the reduction step is a shift-and-add step very similar to the basic steps of a multiplication, it is then simple to extend the cell to combine the two operations. To perform the modular exponentiations needed in cryptosystems, two linear arrays of such cells are then combined into one structure, capable of computing two modulo-multiplications in parallel, sharing a common modulus. By related computations of yz2 ?n mod m and zz2 ?n mod m, the exponentiation x e mod m can be performed.
As an illustration of other applications of the basic shift-and-add structure of the multiplier array, in Section 4 two simple algorithms are shown to compute a ?1 mod 2 k , respectively da ?1 mod 2 k . Both algorithms can then be implemented in the array by some simple modications to the right-most cell. For the particular case of the modulus being a power of 2, these simple algorithms can substitute for the Extended Euclidean Algorithm which would otherwise be needed in the case of a more general modulus. Very recently Jebelian Jeb93] has described similar algorithms for these problems, but without speci c systolic designs.
Finally, in Section 5 some general comments and conclusions are provided.
2 A Bit-Serial, Purely Systolic Multiplier Structure Looking for a way in which a purely systolic, linear array of cells could be organized for a bit-serial multiplier, let us initially consider the array of partial product terms of the product AB in base where A = P n?1 0 a i i and B = P n?1 0 b i i , as illustrated in Figure 1 for the case of n = 5.
In usual linear-time, parallel multipliers, the terms a i b j are constructed row by row, and suitably accumulated in a linear array of adders, requiring the broadcast of b j to n cells. Dadda types of serial multipliers (e.g. Dad83]) operate columnwise, starting at the least signi cant end. Only one new digit of the multiplier and of the multiplicand is introduced in each step, but it seems di cult to organize the generation and accumulation of the terms in a purely systolic structure.
However, if in each step terms along a slanted line as indicated in Figure 1 are constructed, it is straightforward to accumulate the terms in an array of cells containing previous terms of the same weight. And again here only one new digit from multiplier and multiplicand is introduced in each step. If one term a i b j from each column is computed and accumulated per step, then m steps are necessary to generate all digit product terms. However, 2n steps are needed to generate all result digits serially, so to further decrease the number of cells and steps, it then seems more advantageous to combine a low number of terms with a higher radix, rather than generating three or more terms from each column.
We will thus concentrate on the case of using two terms from each column, and restrict the discussion to radix 2 since modi cation to a higher radix is straightforward. In Figure 2 a part of an array of partial products is shown, with pairs of terms enclosed in dashed boxes. The terms in boxes connected by the slanted lines are the terms to be computed and accumulated in a particular time step. Using a suitable redundant representation for the accumulation it is possible to start generating and accumulating terms from the most signi cant end (e.g. for an on-line multiplier), but here we will choose to start at the least signi cant end. Note that in Figure 2 , in each cycle one new multiplier digit plus two new multiplicand digits are needed, and one digit of the result can be produced.
To derive recursive equations for the computation we start by noting that at time step t=0 S(0; t)2 t , but note that we still have to consider how to handle the problem of carries. It could be handled by modi cation of the recursion, but we shall see that we can easily incorporate the carries later.
We can then draw the dependency graph as shown in Figure 3 for the case n = 4. Initially all latches are assumed reset. During the rst n cycles b 0 ; b 1 ; : : : ; b n?1 are delivered, and the rst n (least signi cant) bits s 0 ; s 1 ; : : :; s n?1 are produced. During the next n cycles for an unsigned multiply, zeroes have to be supplied for b n ; b n+1 ; : : :; b 2n?1 , while the remaining results s n ; s n+1 ; : : : ; s 2n?1 are output. In the case of a signed 2'complement multiplier, the value of b n?1 just has to be supplied as the sign-extension during the last n cycles. A 2'complement signed multiplicand can be accepted by a modi cation of the leftmost cell, as shown in Figure 5 for the case of n even, n = 2k.
The leftmost carry-latch is to be initialized by the value of the most signi cant bit of the multiplicand, a 2k?1 . Since b i = 0 for i < 0, the latch e ectively remembers the value of a 2k?1 until the rst non-zero b i arrives, when it is then added in. As the terms added into this position are a 2k?1 b t?k a Baugh and Wooley Scheme is thus realized.
It is easy to see that the same structure can be used at a higher radix also. Assuming a digit set fd j 0 d ? 1g is used for radix , then the carry that has to be retained is at most 2( ? 1).
The time for a complete n by n multiplication is 2n cycles, but due to the very local communication it can operate at a high frequency. Note that 2n cycles are needed although This multiplier shows some similarity with the Atrubin serial multiplier Atr65] in forming two partial product terms per cell. However, the Atrubin multiplier cell number i stores two pairs of operand bits (a 2i ; b 2i ); (a 2i+1 ; b 2i+1 ) plus a transient pair (a t?i ; b t?i ) together with sum and carry information. Thus the amount of state information in each cell is much larger than in this design (about 11 or 13 bits vs. 6).
Systolic Modular Exponentiation
In many cryptographic applications, like RSA and other two-key systems, modulo-multiplication is a fundamental operation for the implementation of the compute intensive modular exponentiations needed. Due to the extremely long operands (500-1000 bits), systolic structures with pure nearest-neighbor communication will be advantageous if very high clock frequencies are wanted to achieve speed.
The modular reductions needed for modulo-multiplication can be arranged to t the basic shift-and-add structure of the above systolic multiplier. Based on the idea of P. Montgomery Mon85], the decision on what multiple of the modulus to subtract can be based on least signi cant digits of the product. Hence his method can easily be adopted to a serial multiplier as described above.
The idea is to change to a di erent residue system, using residues of the form r = u k mod m, where m is the very large odd integer modulus, is the base of the arithmetic with gcd(m; ) = 1 and k > m. We will here restrict the discussion to the case = 2 which is particularly simple to describe and implement. Choosing higher values of is possible, corresponding to the choice of radix for the multiplication. In particular, choosing = 4 is also su ciently simple to implement, allowing a selection of the reduction factor to be made at speed matching the speed of the cells.
We shall not go into details and formal properties of the algorithm here, but refer the reader to other descriptions, e.g. return y where e = P n?1 0 e i 2 i and the modulo reductions are substituted by Montgomery reductions assuming that operands initially are converted into Montgomery residues, and the result at the end is converted back into an ordinary residue. These conversions are then amortized over the total exponentiation cost, and can actually be performed as similar modular multiplications by constants.
Since potentially two multiplications are needed in each step of Algorithm ME, they may as well both be performed in parallel and the new value for y only conditionally substituted for the old value. By combining a systolic modular multiplier for yz with one for zz, operating in parallel, we obtain a systolic exponentiator as shown in Figure 7 . The f box at the right is a mode ip-op, of value 0 throughout the rst n+1 cycles of a multiplication and value 1 during the last n+1 cycles. It is used to control the determination of q y ; S y , respectively q z ; S z : q y = S in y^f ; S out y = S in y^f q z = S in z^f ; S out z = S in z^f ; taking place in the black circles.
The
is furthermore gradually shifted out the right end to be used as the multiplier for both multiplications. Note that the zeroes in z 0 are used during the second phase, to be used as extension of the multiplier.
Due to the cyclic nature of the structure, the number of cells has to match the number of steps of the algorithm. But since a n?1 = a n = 0 and m n?1 = m n = 0 the leftmost cell is actually not needed, since it can never contribute to the computation. It is possible to eliminate one of the two additional steps by noting that in Algorithm MM, the reduction is one step behind the introduction of the new term b i A. By changing L1 into q i := (S + b i a 0 ) mod 2 and L2 into S := (S + q i m + b i A) div 2, only n cycles of the loop are needed. However, this will require some modi cation in the right-most cell and thus destroy the regularity of the systolic structure.
The total system requires n systolic modular multiplier cells and 5n latches/ ip-ops. The total execution time for a n-bit modular exponentiation is 2n 2 cycles when the exponent e also has n bits, which amounts to 2n cycles per bit of the result. With a 100 MHz clock and n = 500 this corresponds to a processing speed of 100.000 bits per second.
For RSA encryption it is possible to use an exponent with much fewer bits, and for decryption the factorization of m is known, allowing this process to take place as two parallel modular exponentiations of half the size, and later combine the result by the Chinese Remainder Theorem. A speed-up of close to a factor of 4 is thus possible, using about the same amount of hardware SV93].
By using radix 4 for the multiplications and reductions, the number of cycles can be cut in half. What is needed is then suitable logic which based on the least signi cant two bits of S determines a reduction factor q i 2 f0; 1; 2; 3g such that (S + q i m) mod 4 = 0, together with appropriate modi cations of the cells to multiply in radix 4. For any radix a similar adder structure can be obtained, since the value of the carry to be retained is limited to 4( ? 1) when digits are in the set f0; 1; ; ? 1g. Since the complexity of the cell does not increase that much, it is feasible to use radix 4 and gain almost a factor of two in speed.
4 Other Shift-and-Add Algorithms
The multiplier structure developed in Section 2 can be used for other types of algorithms where the \multiplier" digits b i are determined \on-the-y", based on the least signi cant digit s i produced during the algorithm. As a rst example, let us consider the determination of the reduction factors q i in the Montgomery multiplication above. In the general case of operating in radix 2 k , q i can be determined as q i = ((S mod 2 The correctness of the algorithm follows from the fact that it actually performs the multiplication of a by b, where b 0 is chosen such that (ab) mod 2 = 1, and the subsequent b i are chosen precisely such that all bits of S of higher weight become zeroes (the odd a is added when the previous S is odd, producing a zero which is shifted out). Since the previous algorithm is derived from \inverting" a multiplication ab 1 (mod 2 k ), we can generalize the algorithm to the case of ab d (mod 2 k ) corresponding to 2-adic Figure 8 we then obtain the cell in Figure 9 , to be used as the rightmost cell in the systolic array. Observe that if it is known that a divides d then Algorithm MD and the systolic array in Figure 9 performs the exact division of d by a, where the digits of the quotient are determined least signi cant digit rst.
Note also that for xed k, what is computed is the base 2, k'th order Hensel code KRS75], or for unbounded k the in nite, but periodic, 2-adic representation of the rational d a . For use in symbolic computations, Jebelean recently Jeb93] presented algorithms similar to Algorithm MI and MD, however his algorithms are subtractive, based on division, whereas MI and MD are additive and based on multiplication. Jebeleans algorithms are presented for higher radix and to be implemented in a systolic fashion on a multiprocessor array for use in unbounded precision arithmetic. Algorithms MI and MD can easily be extended to a higher radix also and are very simple to use in a pipelined, LSB-rst digit serial computation.
One particular application is to use the systolic multipliers and dividers in Figures 4 and 8 together with standard serial adders and subtracters, as a set of building blocks for the kind of \exact arithmetic" based on Hensel codes which is described in GK84].
Conclusions
A serial-by-parallel multiplier array based on a simple and purely systolic cell has been developed. The linear array of cells performs the multiplication by a sweep across the total array of product terms at a slanted angle, such that each cell forms and accumulates two terms from each column. The motivation has been to nd a simple and area e cient way of implementing modulo exponentiation as needed in cryptosystems. The cell and the array is easily extended to accommodate the interleaving of multiplication and modulo reduction. Two such arrays of cells then allow two multiplications to be performed in parallel, as needed for modulo exponentiation, thus providing a simple systolic system which can operate at a high clock frequency due to the purely nearest-neighbor communication.
By some simple modi cations at the front-end or back-end cells, the basic multiplier structure has been shown to implement other algorithms, including 2'complement multiplication, modular inversion and exact division. Although the example designs are all shown in radix 2, it is simple to modify the cells to a higher radix of the form = 2 k .
