A hardware algorithm for modular multiplication/division which performs modular division, Montgomery multiplication, and ordinary modular multiplication is proposed. The modular division in our algorithm is based on the extended Euclidean algorithm. We employ our newly proposed computation method that consists of processing the multiplier from the most significant digit first to calculate Montgomery multiplication. Finally, the ordinary modular multiplication is based on shift-and-add multiplication. Each of these three operations is carried out through the iteration of simple operations such as shifts and additions/subtractions. To avoid carry propagation in all additions and subtractions, the radix-2 signed-digit representation is employed. A modular multiplier/divider based on the algorithm has a linear array structure with a bit-slice feature and carries out n-bit modular multiplication/division in O(n) clock cycles, where the length of the clock cycle is constant and independent of n. This multiplier/divider can be implemented using a hardware amount only slightly larger than that of the modular divider.
Introduction
Modular multiplication and modular division are basic operations in abstract algebra and play important roles in processing many public-key cryptosystems. For example, modular multiplication is used in the RSA cryptosystem [9] and in the Diffie-Hellman key exchange protocol [3] . Modular multiplication and modular division are required in the ElGamal [4] cryptosystem, in the DSA digital signature scheme [1] and to compute point operations in elliptic curve cryptosystem with curves defined over GF(p) [7] .
In applications where long chained multiplications are required, such as in the deciphering process of RSA, the Montgomery method [8] has an outstanding performance. On the other hand, in applications where few modular multiplications are required, as in the enciphering process of RSA, performing the calculations using the ordinary modular multiplication may be faster.
Considering the need of personal computers and mobile devices to manage several security protocols, and the great demand in technology to shrink hardware to reduce fabrication costs, it is important to develop an algorithm for calculating modular multiplication/division that can be implemented in compact hardware. In a previous publication [5] , we proposed a hardware algorithm for modular multiplication/division that calculates modular division and Montgomery multiplication where the calculation of the modular division is based on the extended Binary GCD algorithm. In this paper, we propose a hardware algorithm for modular multiplication/division which calculates modular division, Montgomery multiplication, and also, ordinary modular multiplication with similar hardware resources to that necessary to calculate modular division only. To the best of our knowledge, there is no other work proposed in the literature that combines these three operations.
In the hardware algorithm to be proposed, modular division is based on the extended Euclidean algorithm. We improve the hardware algorithm for modular division originally presented in [10] to reduce hardware requirements by sharing the modular reduction hardware component. Montgomery multiplication is based on our newly proposed computation method that consists of processing the multiplier from the most significant digit first. This method makes it possible to compute Montgomery multiplication using almost the same hardware required for computing the modular division. The ordinary modular multiplication is based on the classical shift-and-add multiplication. We modify this algorithm in order to use almost the same hardware. Hence, the three operations share almost all the hardware components reducing the required hardware resources considerably. Each of the three operations is carried out through the iteration of shifts and additions/subtractions. In order to avoid carry propagation in all additions and subtractions, the radix-2 signed-digit (SD2) representation is employed.
A modular multiplier/divider based on our algorithm has a linear array structure with a bit-slice feature and is suitable for VLSI implementation. The hardware amount of an n-bit modular multiplier/divider is proportional to n and is slightly larger than that of the modular divider. It performs an n-bit modular multiplication/division in O(n) clock cycles where the length of a clock cycle is constant independent of n.
In the next section, we will explain the extended Euclidean algorithm, the Montgomery multiplication algorithm and the ordinary modular multiplication algorithm. We also explain basic operations in the SD2 system. In Sect. 3, we propose a hardware algorithm for modular multiplication/division. In Sect. 4, we discuss hardware imple- 
Preliminaries

Extended Euclidean Algorithm for Modular Division
One of the well known methods for calculating modular division is the Extended Euclidean Algorithm [6] .
Consider the residue class field of integers with an odd prime modulus M. Let X and Y( 0) be elements of the field. The algorithm calculates
The algorithm performs modular division by intertwining a procedure to find the modular quotient with that for calculating gcd(Y, M).
A A (A ) and B are involved in the calculation of GCD and are allowed to be negative. U (U ) and V are used for calculating the quotient and are also allowed to be negative.
Montgomery Multiplication
Montgomery introduced an efficient algorithm for calculating modular multiplication [8] . Consider the residue class ring of integers with an odd modulus M, and let X and Y(= [y n−1 y n−2 · · · y 0 ]) be elements of the ring. The Montgomery multiplication algorithm now calculates
where R is an arbitrary constant R > M and relatively prime to M. This constant R usually takes the value of 2 n when the calculations are performed in radix-2 with an n-bit modulus M.
The radix-2 Montgomery multiplication algorithm is described below.
[Algorithm 2] (Montgomery Multiplication)
Inputs: M : 2 n−1 < M < 2 n and odd X, Y : 0 ≤ X, Y < M Output: Z = X · Y · 2 −n mod M Algorithm: U := 0; for i := 0 to n − 1 do U := (U + y i · X)/2 mod M; endfor if U ≥ M then Z := U − M else Z := U;
Ordinary Modular Multiplication
The usual method for calculating the ordinary modular multiplication, i.e. the calculation of Z such that Z ≡ X × Y (mod M), is known as shift-and-add multiplication and it is shown below. In this method, the bits of the multiplier are scanned from the most significant position first.
[Algorithm 3] (Ordinary Modular Multiplication)
Inputs: M : 2 n−1 < M < 2 n X, Y : 0 ≤ X, Y < M Output: Z = X · Y mod M Algorithm: U := 0; for i := n − 1 downto 0 do U := (2 · U + y i · X) mod M; endfor Z := U;
Basic Operations in SD2 System
In order to perform additions and subtractions without carry propagation, we represent the internal variables as (n + 1)-digit radix-2 signed digit (SD2) numbers [2] where n is the bit length of the modulus M. The SD2 representation has a fixed radix 2 and a digit set {1, 0, 1}, where1 denotes −1.
i . The algorithm requires a doubling procedure for an SD2 integer without overflow [10] . In order to illustrate how this procedure is implemented, let us take two (n + 1)-digit SD2 integers and call them A and S . The doubling procedure A := DBL(S ), i.e., the calculation of A so that A = 2 · S , is only necessary when S satisfies s n = 0 or s n−1 = −s n and is performed as follows. When
Procedures for addition, doubling and halving in modulo M in the SD2 system are also required and are described below. See also [10] , [11] .
Let the modulus
, is performed in two steps. In the first step, we calculate W := U +V in the SD2 system where W results in an (n + 2)-digit SD2 number. In the second step, if the value of the number formed by the three most significant digits of W, i.e. the value of [w n+1 w n w n−1 ], is negative or zero or positive, we add M or 0 or M to W, respectively. M = [10m n−2 ...m 1 1] is a (n + 1)-digit SD2 number where m i is 1 or 0 accordingly as m i is 0 or 1, and has the value −M. This addition is also performed in the SD2 system. Since all the digits of the addend are non-negative except the most significant one, the addition in this step is simpler than the addition of two SD2 numbers. For the details of the modular addition procedure, see [11] .
Modular doubling T := MDBL(U, M), i.e., the calculation of T so that T ≡ 2 · U (mod M), can be performed by applying the second step of the modular addition to 2 · U, which is obtained by shifting U one position to the left.
Modular halving T := MHLV(V, M), i.e., the calculation of T so that T ≡ V/2 (mod M), is performed through two steps. In the first step, we add M to V when V is odd, i.e. when v 0 0. No action is required when V is even. In the second step, we shift the result of the first step one position to the right throwing away the least significant digit, which is 0. (Recall that M is odd.)
Procedures MADD, MDBL and MHLV can be performed in a constant time independent of n by means of combinational circuits.
A Hardware Algorithm for Modular Multiplication/Division
We propose a hardware algorithm that performs modular division, Montgomery multiplication and ordinary modular multiplication, which is efficient in execution time and hardware requirements. We present first our improved modular division algorithm and then our newly proposed algorithm for computing Montgomery multiplication, and finally, the combined modular multiplication/division algorithm.
Improved Hardware Algorithm for Modular Division
We introduce two modifications to improve the hardware algorithm for modular division based on the extended Euclidean algorithm proposed in [10] . In the hardware algorithm proposed in [10] , variables A and B of [Algorithm 1] are represented by using two n-digit SD2 integers, A and B, and two n-digit binary integers of the form 2 i , P and D, so that A = A/(P/D) and B = B/P. Note that P and D have only one bit of value 1. U and V are (n + 1)-digit SD2 integers satisfying −M < U, V < M. The calculation of GCD is performed as a series of integer divisions where the divisor and the remainder of each of these divisions are taken as the new dividend and the new divisor of the next integer division respectively. To simplify the quotient digit determination, the variable B which stores the divisor is strongly normalized. This means that B is rewritten to change the representation in the SD2 system without modifying the numerical value so that b n−1 0 and
or [10] . Then, integer division is performed as a series of subtractions A − q · B in the SD2 system.
The first modification that we introduce is related to the length and the representation of the input operands. In the original algorithm [10] , after finishing the calculations in the SD2 system, the result of (n + 1)-digits is converted into the binary representation and then reduced to the range [0, M − 1]. This step involves a carry propagation addition, a full bit comparison and another possible carry propagation addition. All these operations are time consuming. In order to enable the feed back of the output represented in SD2 directly into the inputs and avoid this time consuming step, we represent the variables A and B with the same digit length as U and V and in the same SD2 representation. Note that in the SD2 system, operands X and Y can still be given in the ordinary binary representation. Increasing the length of the operands and allowing them to be represented in the SD2 system do not alter the structure of the algorithm. The only correction that is required in the algorithm is the displacement by one of the subindex positions in the variables where the digits are examined to control the different operations. As the input operands are now represented as numbers of (n + 1)-digits, and |Y| < M, b n−1 can never take the value of 1 when b n is equal to 1 at initialization time. Therefore, Step 2-0 of the original hardware algorithm [10] is eliminated.
The second modification that we introduce is related to the order of the steps. The structure of the hardware algorithm proposed in [10] follows the structure of [Algorithm 1]. Integer division is performed between A and B which contains the dividend A and the divisor B respectively. After performing each division, the divisor B and the remainder A are taken as the new dividend and the new divisor, respectively, for the next integer division. In order to accomplish this, a swap operation between the variables A and B is performed. Normalization is then applied to variable B. To reduce hardware requirements, we change the order of the steps. After finishing the integer division, variable A, which contains the remainder, is normalized. Then, a swap operation is performed so that integer division is performed between A and B in the same way as the original hardware algorithm. Consequently to the change of the order, the variables A and B need to be initialized with the values interchanged. Equivalent modifications are required to the variables U and V. As a result, this modification enables the operation DBL to be applied only to variable A and enables the sharing of the modular reduction hardware component between the operations MDBL and MADD since these two operations are now applied to the same variable U. This means that, the proposing algorithm is more efficient in terms of hardware requirements than the one proposed in [10] .
The improved hardware algorithm is described below:
Step 1 Step 2-1: [Normalization of A] while p n = 0 and [a n a n−1 ] [10] and [a n a n−1 ] [10] do if [a n a n−1 a n−2 ] = [111] or [a n a n−1 a n−2 ] = [011] then [a n a n−1 a n−2 ] := [101]; elseif [a n a n−1 a n−2 ] = [111] or [a n a n−1 a n−2 ] = [011] then [a n a n−1 a n / * Main Stage * / while d 0 = 0 do if a n = 0 or a n−1 = −a n then S := A; else q := a n · b n ;
= MHLV(V, M); else
A := S ; endif endwhile / * T ermination Stage * / r := sgn([a n a n−1 ]); while sgn([a n a n−1 ] = r and (abs([a n a n−1 a n−2 ]) ≥ 3 or (b n−2 = −b n and abs([a n a n−1 a n−2 ]) = 2)) do q := r · b n ; A := A − q · B; U := MADD(U, −q · V, M); endwhile goto Step 2-1;
Step 3:
Step 4:
In Step 2-3, we perform an integer division. We perform the subtraction of A − q · B in the SD2 system. In this subtraction, we use the special addition rule detailed in [10] at the most significant two positions. We can show that in the main stage, no two successive SD2 additions are performed without doubling A (DBL(S )). The termination stage, at the end of the integer division, avoids the situation where the final |A| (|A |) is very near to |B| (|B|) which makes the convergence of the whole computation very slow. For the details, see [10] .
A numerical example of a modular division performed by [Hardware Algorithm 1] is given overleaf in Fig. 1 
New Algorithm for Computing Montgomery Multiplication
In order to implement Montgomery multiplication using the same hardware components needed for modular division, we modify [Algorithm 2] so that the multiplier is processed from the most significant digit. Here, instead of halving the intermediate result, we halve the multiplicand in modulo M. We present the modified Montgomery algorithm. Here, y n = 0.
[Algorithm 4] (Modified Montgomery Multiplication)
U := 0; V := X; 
A Hardware Algorithm for Modular Multiplication/Division
In this subsection, the new hardware algorithm is presented. The algorithm has three modes of operation. In mode=0, the algorithm performs modular division. In mode=1, it performs Montgomery multiplication, and in mode=2, it performs ordinary modular multiplication. In order to remove time-consuming SD2 to binary conversion in each multiplication/division, the input operands X and Y, as well as the output result Z are expressed as Variable D is used in division mode to indicate the number of digits that the divisor is shifted in order to align its most significant digit to that of the dividend. When performing Montgomery multiplication and ordinary modular multiplication, variable D is used to implement the 'for' loop. In this case, D is required to be an (n + 2)-bit variable which is initialized to the value of 2 n+1 and shifted to the right at each iteration step until is equal to 1. The variables A, U and V are used in both multiplication modes, i.e. mode = 1 and mode = 2, to store the multiplier, the partial product and the multiplicand respectively.
[Hardware Algorithm 2] (Modular Mult./Div.)
Inputs: mode ∈ {0, 1, 2}
M : 2 n−1 < M < 2 n and odd (prime when mode = 0)
Step 1 Step 2-1: [Normalization of A] while p n = 0 and [a n a n−1 ] [10] and [a n a n−1 ] [10] do if [a n a n−1 a n−2 ] = [111] or [a n a n−1 a n−2 ] = [011] then [a n a n−1 a n−2 ] := [101]; elseif [a n a n−1 a n−2 ] = [111] or [a n a n−1 a n−2 ] = [011] then [a n a n−1 a n−2 ] := [101]; else which stores the partial product, is initialized to 0, instead of doubling U and then adding the multiplicand, we proceed in reverse order. For the case a n = 0 or [a n a n−1 ] = 11 or11, temporary variable S is set to A in order to double A by means of DBL(S ) and U by MDBL(U, M). For the cases [a n a n−1 ] = 11 or 10 or11 or10, we need to perform the operations U := MADD(U, −q · V, M) and U := MDBL(U, M). However, since the modular reduction hardware component are shared for both operations to reduce hardware resources, these operations can never be executed at the same iteration. Therefore, we split the calculation of MDBL(U, M) and MADD(U, −q · V, M) into two different iteration steps using simple control signals. For this, we set the temporary variable S with the value of A and perform U := MADD(U, −q · V, M) depending on the value of a n . Note that in this case, s n 0 and [s n s n−1 ] 11 nor 11, therefore no action is performed to A, D and V. At the end of the loop we leave the most significant position of A in 0. In this way, in the next iteration, A is shifted to the left by DBL(A) and U is doubled in modulo M by MDBL(U, M). Since we reverse the order of doubling the partial product and adding the multiplicand, MDBL(U, M) is not performed in the last iteration.
A Multiplier/Divider for Modular Arithmetic
An n-bit modular multiplier/divider based on [Hardware Algorithm 2] consists of seven registers for storing A, B, P, D, U, V and M, and a combinational circuit part. A block diagram of the modular multiplier/divider is given in Fig. 2 .
We assume that each iteration of the 'while' loops are performed in one clock cycle. Thus, in Step 2-1, operations Step 2-2; the correction step, i.e., Step 3; and the selection of the output, i.e. Step 4; takes one clock cycle each.
The combinational circuit part of the multiplier/divider (for Steps 2 and 3) mainly consists of an SD2 adder (with an operand negator), a modular adder (with an operand negator), a modular halving circuit, an SD2 negator and selectors. The modular adder consists of two SD2 adders where one of these is simpler. The modular doubling circuit and the modular halving circuit consist of simpler SD2 adders.
The depth of the combinational circuit part is a constant independent of n, and therefore, the length of the clock cycle is constant independent of n. The modular multiplier/divider has a bit-slice structure and is suitable for VLSI implementation.
The amount of hardware of the modular multiplier/divider is proportional to n. The hardware components used in this architecture are practically the same as those needed in the construction of the individual modular divider. Only a slight increase of the hardware is required to enable the computation of Montgomery multiplication and the ordinary modular multiplication. This little extra hardware is due to the selectors of Step 1 and Step 4 and the additional logic gates to select the different operations in Step 2-3.
In division mode, we can show from the discussion in [10] that in Step 2-3, no two successive clock cycles are executed without doubling A (DBL(S )) in the main stage, and no more than three cycles are executed in the termination stage. We can also show that if DBL(A) is not performed in an execution of Step 2-1 (normalization of A), DBL(A) must be performed in the next execution of Step 2-1. Hence, the number of clock cycles executed in Step 2 is at least 2n + 1 and at most about 3n depending on the operands.
Montgomery multiplication is performed in Step 2-3 in exactly n + 1 clock cycles. Ordinary modular multiplication is performed in at least n + 1 and at most 2n + 1 clock cycles. This varies with the multiplier.
Concluding Remarks
We have proposed a hardware algorithm for modular multiplication/division. We first improved the hardware algorithm for modular division proposed in [10] to reduce hardware requirements. Then, we modified the Montgomery multiplication algorithm and the ordinary modular multiplication algorithm to accelerate them by the use of the SD2 redundant representation for internal computation and en-abled these operations to share almost all the hardware components with those required for computing a single modular division.
Although two operations are available for calculating modular multiplication, Montgomery multiplication has an outstanding performance when long chained multiplications are required. Therefore, in applications such as in modular exponentiation, multiplications can be performed in the Montgomery representation to accelerate the calculations. The inclusion of the ordinary modular multiplication in the same hardware and the expansion of the inputs in one digit position allows for performing not only a few single modular multiplications but also for transforming the operands into the Montgomery domain by multiplying the operand with the value of 2 n without the need for any precomputed constants.
Modular divisions can be jointly used with Montgomery multiplications to accelerate modular exponentiation. The classical methods consist of decomposing the calculation of modular exponentiation as a series of modular multiplications. By representing the exponent as an SD2 number, it is possible to decompose the calculation into either a series of modular multiplications or a mixture of multiplications and divisions. As the number of multiplications and divisions are proportional to the weight of the exponent, acceleration can be accomplished by representing the exponent as a minimum weight SD2 number. Finally, it is worth noting that these operations can still be computed using the Montgomery representation as described in [5] .
