We propose an algorithm for modular multiplication 
Introduction
With the proliferation of Internet usage, there is an increasing necessity for PCs and mobile devices, such as PDAs, of having ability to manage several security protocols. Since processing of public-key cryptosystems requires huge amount of computation, there is a growing demand for developing dedicated hardware to accelerate this.
In this paper, we propose a VLSI algorithm for modular multiplication/division with a large modulus. Modular multiplication with a large modulus is the basic operation in calculating modular exponentiation which is used to process public-key cryptosystems such as RSA [4] . One of the efficient methods for calculating the modular multiplication is by using Montgomery's multiplication algorithm [3] . Several implementations of the algorithm have been proposed [1] . On the other hand, modular division with a large modulus is used in decryption of public-key cryptosystems such as ElGamal [2] . It can be calculated by using the extended Binary GCD algorithm which is suited for binary arithmetic [5] .
Since PCs and mobile devices do not seem to process more than one cryptosystem simultaneously, we combine multiplier and divider so that the hardware requirement is reduced by making large part of the circuit be shared by the two operations.
In the VLSI algorithm to be proposed, multiplication is based on Montgomery's algorithm and division is based on the extended Binary GCD algorithm. The algorithm is accelerated by introducing redundant representation in all additions/subtractions so that they are carried out in constant time independent of the length of the operands. Almost all the components in the VLSI algorithm are shared reducing considerably hardware requirements.
A modular multiplier/divider based on the algorithm has a linear array structure with a bit-slice feature and is suitable for VLSI implementation. The amount of hardware of an n-bit modular multiplier/divider is proportional to n. It performs an n-bit modular multiplication in at most 2(n+2) 3 + 3 clock cycles and an n-bit modular division in at most 2n + 5 clock cycles where the length of clock cycle is constant independent of n.
In the next section, we explain the extended Binary GCD algorithm and Montgomery's multiplication algorithm. In Section 3, we propose a VLSI algorithm for modular multiplication/division. In Section 4, we discuss several aspects about implementation. In Section 5, we present the concluding remarks. We show the algorithm below. Note that A and B are integers and are allowed to be negative. δ represents α − β, where α and β are values such that 2 α and 2 β indicates the minimums of the upper bounds of |A| and |B| respectively.
A To calculate U/2 mod M , the algorithm examines the least significant bit of U to determine whether it is even or odd. If it is even, the algorithm performs U/2, otherwise it performs (U + M )/2. In this way, modular reduction is accomplished by a simple shift operation. 
Montgomery's Modular Multiplication Algorithm
Montgomery introduced an efficient algorithm for calculating modular multiplication [3] . Consider the residue class ring of integers with an odd modulus M . Let X and Y be elements of the ring. Montgomery's modular multiplication algorithm calculates Z(< M) such that Z ≡ XY r −1 (mod M ) where r is an arbitrary constant relatively prime to M . The value of r is usually set to 2 n when the calculations are performed in radix-2 with an n-bit modulus M .
The radix-2 Montgomery's multiplication algorithm is described below. We use the same notation as in the extended Binary GCD algorithm to emphasize the similitude of these algorithms. Note that U is always bounded by 2M throughout all iterations. Therefore, the last correction step assures that the output is correctly expressed in modulo M .
A VLSI Algorithm for Montgomery's Modular Multiplication and Modular Division
We propose a VLSI algorithm that performs Montgomery's modular multiplication and modular division, which is efficient in execution time and hardware requirements.
Use of a Redundant Representation
We assume that the input modulus M is an n-bit binary odd number that satisfies the condition 2 n−1 < M < 2 n . We also assume that the input operands X and Y and the The SD2 representation uses the digit set {1, 0, 1}
i . Addition of two SD2 numbers can be performed without carry propagation. We use the addition rules for SD2 numbers shown in table 1 [6] . The addition is accomplished by first calculating the interim sum u i and the carry digit c i and then performing the final sum
To calculate s i , we just have to check the digits a i , b i and their preceding ones. All the digits of the result can be computed in parallel. The negation of an SD2 number can be done simply by changing the signs of all nonzero digits in it. Subtraction can be performed through negation and addition in one step. We require a carry-propagate addition to convert an SD2 number to the binary representation. We represent the internal variables A, B, U and V in n-digit SD2 representation so that all basic operations are carried out in constant time independent of the lengths of the operands by a combinational circuit.
In applications such as exponentiation, chained multiplications are required. To remove time-consuming SD2 to binary conversion in each multiplication, we allow the input operands X and Y as well as the output result Z be expressed in the same redundant representation so that the output can be directly fed into the inputs. Note that the operands X, Y can still be given in ordinary binary representation.
Division Mode
We follow the structure of the VLSI algorithm for modular division based on the Binary GCD algorithm [5] and further accelerate it.
This algorithm [5] performs all basic operations in constant time independent of n by a combinational circuit. This algorithm implements the 'while' loop introducing P which represents a binary number of n + 2 bits and indicates the minimum of the upper bounds of |A| and |B|, i.e., min(2 α , 2 β ). Note that P has only one bit in 1 and the rest in 0. In this way, the termination condition check, A = 0, that may require an investigation of the whole bits of A is replaced by a check of P = 1 which can be carried out by just looking at the least significant bit of P , i.e. p 0 . A binary number D and a flag s (∈ {0, 1}) are introduced to implement δ. D has n bits of length and has the value D = 2 (−1) s ·δ . Note that this variable also has only one bit in 1 and the rest in 0. In this way, the decrement of δ, δ := δ − 1, which may require a long borrow propagation is replaced by a one-bit shift of D.
The calculation of T /2 modulo M is implemented by the operation M HLV (T, M ). It is carried out by performing T /2 or (T + M )/2 accordingly as T is even or odd. Note that only the least digit of T has to be checked to determine whether it is even or odd.
The calculation of T /4 modulo M is implemented by the operation M QRT R(T, M ). It is carried out by performing the following calculations: If
Since M is an ordinary binary number, addition of M or −M or 2M in M HLV and M QRT R is simpler than the ordinary SD2 addition. For the details of the simpler SD2 addition, see, e.g., [7] .
The operation U/2 modulo M that is performed with the operation A := A/2 in Algorithm 1 when A is divisible by 2, is implemented with the operation M HLV (U, M ). In order to accelerate the calculation, for the case that A is divisible by 4, instead of performing A/2 and U/2 modulo M in two different steps, we modify the algorithm by grouping two of each operation into the calculations of A/4 and U/4 modulo M . We perform the latter calculation by using M QRT R(U, M ).
Multiplication Mode
We implement the while loop by using the same P as in the division case.
In Algorithm 2, A and V are initialized with the values of Y and X. U is used to store the partial products and it is initialized with the value 0. The algorithm examines the least significant bit of A to determine whether V has to be added. Then it performs a division of U by 2 modulo M and A is shifted down one position.
To accelerate the calculation, we modify this algorithm so that it processes two digits at a time. We examine the least two significant digits of A, i.e. In this way, all the operations can be accomplished with shifts, M HLV and M QRT R, and all the results are always bounded in magnitude by M . The Montgomery's constant r is now 2 n+2 . To make use of the same decision rule as in the division, we initialize B with its least significant digit in1. In this way, when the least significant digit of A has value 1, A+ B = 0 mod 4. The correction of adding −4 or 4 can be done introducing the digit1 in the third least significant bit of B, i.e. b 2 . The conversions and corrections are performed in the algorithm by rewrite (a 2 , a 1 , b 2 ) 
The VLSI Algorithm
The VLSI algorithm is presented here. In the following, {C1, C2} means that two calculations, C1 and C2, are performed in parallel. 
Modular Division
Step 1 In division mode, i.e. mode= 1, when A mod 4 = 0, A is shifted down two digits and M QRT R(U, M ) is performed. Note that when P = 2 and a 0 = 0, an extra 0 digit is processed together. However, since these operations only updates the values of A and U , this calculation does not affect the final result nor does increase the number of iterations needed. No special consideration has to be taken for the termination condition.
Note also that in the algorithm, δ is represented with the values of D and s. We take as convention to represent δ = 0 with D = 1 and s = 1.
In Step 3, B is 1 when B mod 4 = 1 and it is −1 otherwise, i.e., when B mod 4 = 3. When B = −1, V is negated in the SD2 system. In multiplication mode, i.e mode=0 the flag s is set to 1 and it remains in this value until the end of Step 2.
In the case that P = 2, and the corresponding operation to be performed involves two digits shift, we shift P only one position to mark the end of the loop and reset the flag s to 0. At this point, n + 2 digits of A are processed so no extra calculations are needed. In the case that P = 2, and the corresponding operation to be performed involves only one digit shift, P is shifted one position and the loop finishes leaving one digit of A unprocessed. This is the same case as having P = 4 with operations involving two digits shifts. The flag s is left in the value 1 indicating that an extra operation is needed in Step 3. It can be shown that this unprocessed digit is always 0, so we only need to perform M HLV (U, M ) at the end. In this way, all the n + 2 digits of A are always processed and the Montgomery's constant has the value r = 2 n+2 Proposition 1: Let Y be expressed in SD2 representation with n bits of length such that −M < Y < M, and M be an n-bit binary number that satisfies the condition 2 n−1 < M < 2 n . If Algorithm 3 is used with this input and Step 2 finishes leaving the topmost significant bit of A unprocessed, this digit is always 0.
Proof: At initialization time, the value of Y is copied into A. Suppose the case that A is positive and a n−1 = 1, this digit can be transformed into [10] or into [11] when A + B or A − B is performed following the addition rules of SD2 numbers described in table 1. For the former case, the digits [10] can in turn be transformed into [110] . Further expansion does not occur when the most significant digit is followed by1. Now, consider the case that n − 1 bits of A have been processed and we are about to process the next two of the remaining three bits. A can have its bits [a 2 , a 1 , a 0 ] = 110 or 111. No other possibilities are left because of the restriction of |A| < M. In the former case, A is shifted by only one position leaving the other two bits to be processed in the next iteration. In fact, these bits 11 are recoded into 01 and they are processed together in the next iteration. No extra calculation is needed. In the latest case, the least significant two digits11 of A are recoded into 01 and processed together. The generated carry digit1 is subtracted from A so that the most significant bit of A that has been left is cancelled and reset to 0. Similarly, when A is negative and a n−1 =1, this digit can be transformed into [11] and no further expansion occurs. 
Discussions

Chained Multiplications and Exponentiation
In applications such as exponentiation, chained multiplications are performed in Montgomery's representation. Observing that the result Z of the modular multiplication satisfies |Z| < M, it is possible to reuse the result as input operands of another modular multiplication. Note that r is an arbitrary constant relatively prime to M . In our proposed algorithm r has the value 2 n+2 . Only one carry propagation addition is needed at the end of the whole calculation to convert the result from SD2 representation into binary number. In the case that Z < 0, we need to add M as a final correction step. The same correction step is applied in division mode.
Furthermore, modular multiplication/division can also be used to accelerate the calculation of modular exponentiations. That is, consider the operation x b ( mod M ). Let b be expressed in SD2 representation. The modular exponentiation can be calculated by examining each digit of the exponent from the topmost significant position and performing a modular squaring for each digit in 0, a modular squaring and a modular multiplication for each digit in 1 and a modular squaring and a modular division for each digit in1. Since b can be recoded to reduce the number of 1s, the number of the overall operations can be considerably reduced.
Hardware Implementation
We assume to perform one pass of the computations in the 'while' loop of Step 2, i.e., one row in Fig. 1/Fig. 2 , in one clock cycle.
A modular multiplier/divider based on Algorithm 3 mainly consists of 7 registers for storing A, B, P , D, U , V and M , three SD2 adders one of which is simpler, selectors, and a small control circuit. Fig. 3 shows a block diagram of the multiplier/divider.
In multiplication mode D is not used. Therefore, D can be disconnected during this mode to reduce power consumption. The circuit has a linear array structure with a bit- slice feature. The amount of hardware of the modular multiplier/divider is proportional to n. Since the depth of the combinational circuit part is constant, the length of clock cycle is a constant independent of n.
Use of Two Level 1-hot Counters or Binary Counters
We can reduce the amount of hardware for keeping P and D by replacing the 1-hot counters with two-level 1-hot counters. Let n h and n l be integers such that n+2 ≤ n h ·n l is satisfied and n h + n l is minimized, namely, n h ≈ n l ≈ √ n. We replace P with n h -bit and n l -bit 1-hot counters P h and P l which keep p h and p l , respectively, such that p h · n l + p l = P . We replace D with D h and D l in the same way.
When we use P h and P l instead of P and use D h and D l instead of D, we modify the algorithm as follows. P h and P l are initialized so that p h = (n + 1)/n l and p l = n + 1 mod n l are satisfied. D h and D l are initialized so that d h = 0 and d l = 1. The operation P := P >> 1 is realized as:
1. If the rightmost bit of P l is 1, then perform 1-bit right shift of P h ; 2. Perform 1-bit cyclic right shift of P l .
Similarly, the operation of P >> 2 can be accomplished by looking at the rightmost two bits of P l . Shift operations of D can be realized in similar ways. The check of p 0 = 1 can be replaced by the check of the rightmost bits of both P h and P l being 1.
When we use a 1-hot counter for each counter, it requires n + 2 flip-flops. When we use a two-level 1-hot counter, it requires about 2 √ n flip-flops. We can further reduce the amount of hardware for counters by using binary counters, each of which requires about log 2 n flip-flops. Although the depth of the binary counter is not a constant, it is proportional to log log n and is very small even when n is several hundreds. Therefore, in practice, it may be efficient to use binary counters.
When we employ binary counters, we should introduce a zero flag and perform zero detection of the counter in the previous step, i.e., in one step earlier than in [Algorithm 3] in order to avoid the increase of the clock period.
Concluding Remarks
We have proposed a VLSI algorithm for modular multiplication/division. We have modified the extended Binary GCD algorithm and Montgomery's modular multiplication and have accelerated them by the use of a redundant representation for internal computation.
A modular multiplier/divider based on the algorithm has a linear array structure with a bit-slice feature, and is suitable for VLSI implementation. The amount of hardware of an n-bit modular multiplier/divider is proportional to n. It performs an n-bit modular multiplication in at most 
