X design for an expandable modular multiplication hardwire is proposed. This design allows for cascading the hardwart' if larger moduli are required. The proposed design uses hhtgoiriery modular multiplication algorithm [5].
Introduction
Several public-key cryptographic systems [l] make heavy use of modular multiplication. The security of these systems depwds 011 the size of the encryptionldecryption key. Larger kc?-sizes have better security. The most popular such algoritluri is the RSA (21 algorithm. Reported RSA hardware iinI-'leiriciitations, e.g. [3, 4, 5, 6, 7, 8, 9, 11, 121 , are designed for fixed kev sizes. If larger key sizes are needed to improve systcin security, the hardware must be redesigned.
Tltc goal of this paper is to develop an expandable Montgoiiicry iriodular multiplication hardware implementation wlicrc duplication of well defined bit-sliced hardware will a.dapt the system to larger numbers. To incorporate such ficxibilitv, hardware and performance overheads are expcutcd. The proposed expandable design depends mainly 0x1 a systolic multiplier used by Sauerbrey [6] . This model lias bccii redesigned and modified for expandability.
111 the following, the basic systolic multiplier is described, Bioiitgomery product algorithm and its implementation are reviewed, and modifications for expandability are detailed. The systolic multiplier consists of a set of cascaded identical cells as shown in Figure 1 . This multiplier performs the operation: p = x.y + q, in a word-serial manner, where x,y and q have 1-words, of b-bits each. In other words, each of z,y and q are numbers of l-digits in base 2b. The time required for a complete operation is 21 clock cycles [6] . The control input . z is used to indicate the beginning of the operation.
The Systolic Multiplier
To multiply two operands of /-words, the required number of cascaded cells is [//21 + 1 [6] . This systolic multiplier is chosen since it can be easily expanded. Figure 1 , shows a multiplier for I-digit numbers. If the numbers t o be multiplied are increased in size to 21-digits, the only required modification on the design is to add another identical systolic multiplier in cascade, as shown in Figure 2 . Clarification and modeling of each cell in the systolic multiplier is given in the following subsection.
The Basic Cell
The basic cell of the systolic multiplier is designed to perform the algorithm shown in 
Since R is a power of 2 number ( R = 2k), the mod R and division by R operations can be inexpensively computed. However, this method is suitable only if many MP(x',y') operations are required to offset the initial cost of converting x and y t o the Montgomery's representations x' and y'. This is the case with the modular exponentiations needed by the RSA algorithm. To accommodate large moduli, computing MP(x',y') is organized in a word.-serial manner, as shown in Figure 5 . This algorithm is well-suited for a systolic multiplier implementation similar to the one described in the previous section. 
3.

4.
5.
Choose R > N such that R = 2k, where k is the number of bits in N: accordingly R is relatively prime to N . Two types of processors are required, one is a parallel multiplier, and the other is a systolic multiplier. The output starts coming out after 21 clock cycles. However, the first 1-digits of the output will be discarded to account for the division by R, and accordingly the MP result will be serially available after 31 clock cycles.
The hardware implementation derived from the signal flow graph (for 1 = 4 words) is shown in Figure 7 . The full MP result will be available after 41 clock cycles. Two types of registers are used for proper data synchronization:
T and 2T, which delay data by one and two clock cycles respectively. The overall number of registers required for an I-words design is (61 -3). The number of parallel multipliers used is 1, while the number of systolic multipliers used is 1 + 1. The extra systolic multiplier is not shown in Figure   ' i, t>ut it is necessary to compute p(O). 
Expandability of the parallel MP Implementat ion
To expand the design shown in Figure 7 , not only should the iiurnber of systolic multipliers be increased, but also the size of each systolic multiplier must increase. This is due to the fact that the number of cascaded stages of the systolic multipliers depends on the number of words/digits (I). Thus, such design does not allow for regular linear expandability.
The Expandable MP Design
For regular linear expandability of the MP implementation, the design must be reorganized for serial rather than parallel processing. This is achieved by projecting all systolic multipliers into one and projecting all parallel multipliers into one, i.e. using only one parallel multiplier and one systolic multiplier as shown in Figure 9 . Consider the case where 1 = 4-w0rds, and p(O) is precomputed, let the words of p(') be fed in a word-serial manner to the projected processor and the outputs be saved in an output register, as shown in Figure 10 . After 21 clock cycles all words of p(') will be available in the output register. Then, p(') is serially fed to the processor with the other inputs N and N i , properly synchronized. After additional 21 clock cycles, all digits of p(2) will be available in the output register. Likewise, p(3) and p(4) are computed using the same procedure allowing 21 clock cycles for each.
For example, if the design needs to be expanded to handle 21-~vords, the expansion is performed in both the horizontal arid the vertical directions. The horizontal expansion is to illcrease the number of systolic multipliers to 21, while the vertical expansion increases the number of cells in each systolic multiplier to accommodate the 21-words. Thus, linear expandability for such architecture is not possible ( Figure   8 ). The time required for each p(j+') to be available at the output register is 21 + 1 clock cycles, including an extra cycle for proper synchronization. The digits of N must be synchronized to those of p(') and delayed by two more clock cycles. The single word V, (Figure 5 ) is calculated using a parallel multiplier to multiply Nh with pii), and discarding the most significant word.
To allow for expandability, the controller is built using shift registers and multiplexors since shift registers can be easily expanded, as shown in Figure 11 . The systolic multiplier is made expandable by passing its cascading signals as external interface signals as shown in Figure 2 .
An expandable modular multiplier system will consist of a basic processor chip which can operate on 1-digit operands, and a number of expansion chips, each allows processing of an additional 1-digits. The basic MP-processor for a key size of I-digits, consists of a systolic multiplier, 9 b-bit registers, a 
Conclusion
111 tliis paper, a new hardware model of an expandable modular inultiplication system is proposed. The new hardware tlc~>ciirls mainly on a systolic multiplier reported by Sauerbrev [6] . Targeting the maximum possible speed, Montgoiricry modular multiplication algorithm has been adopted. This inctliod computes modulo multiplication without trial tlivisiori, which is the most time consuming operation in iriorlulo multiplications.
