Abstract. We present a novel technique which allows a virtual increase of the bitlength of a crypto-coprocessor in an efficient and elegant way. The proposed algorithms assume that the coprocessor is equipped with a special modular multiplication instruction. This instruction, called MultModDiv(A, B, N ) computes A * B mod N and (A * B)/N . In addition to the doubling algorithm, we also present two conceivable economic implementations of the MultModDiv instruction: one hardware and one software realization. The hardware realization of the MultModDiv instruction has the same performance as the modular multiplication presented in the paper. The software realization requires two calls of the modular multiplication instruction. Our most efficient algorithm needs only six calls to an n-bit MultModDiv instruction to compute a modular 2n-bit multiplication. Obviously, special variants of our algorithm, e.g., squaring, require fewer calls.
Introduction
Fast modular multiplication algorithms have been extensively studied [Ba, DQ, HP1, HP2, Knu, Mo, Om, Pai, Q, Sed, WQ, Wa] . This is due to the fact that large integer arithmetic is essential for public-key cryptography. Recently, we have seen some progress of integer factorization [C + ] which demands for higher RSA bit lengths. On the other hand, for low cost and low power devices (e.g., in Smartcards, PDAs, Cellular Phones, etc.) one has to use hardware which does not provide sufficient bitlengths.
Unfortunately, these two requirements lead to a burden of the system issuer, e.g., the card industry. The source of this burden is the fact that, say 2048-bit RSA, cannot be handled efficiently on a 1024 bit device. Only with some workaround this problem becomes a manageable task. Namely, as it is now commonly known, one can use the Chinese Remainder Theorem for the RSA signature, see [CQ] . To keep the RSA verification also relatively simple, most often the fourth Fermat number is used as public exponent. Only recently it was shown how to efficiently reduce such modular 2048-bit multiplications to 1024-bit modular multiplications, see [HP1, HP2, Pai] . Pailler [Pai] initiated this doubling research topic and formulated the following research problem:
Problem. Find an nk-bit modular multiplication algorithm using a minimal number of n-bit modular operations.
His algorithm needs nine modular multiplications for the case k = 2, see [Pai] . In this paper we will provide an answer to his question by presenting a novel doubling algorithm. Our most general and most efficient algorithm needs only six n-bit modular operations to compute a 2n-bit modular multiplication. The idea for our family of algorithms is based on the fact that the coprocessor is equipped with a special modular multiplication instruction. This instruction is called MultModDiv and defined within the next section. An optimal realization is clearly achieved using an enhanced hardware modular multiplication instruction. Nevertheless, an efficient software realization of this instruction is possible. The software realization requires two calls to the modular multiplication instruction. Both realizations of this MultModDiv instruction will be presented.
The present paper is organized as follows: The next section gives the necessary definitions of MultModDiv. Section 3 explains our basic doubling algorithm, an enhanced version and also our special purpose variants. In section 4 we introduce the simple software emulation of the MultModDiv instruction. Finally, in section 5 we show how to realize the MultModDiv instruction in hardware.
Preliminaries

The Instructions MultMod and MultModInit n
The following definition is the usual modular multiplication. A, B and N , N > 0 
Definition 1. For numbers
The following extension of the modular multiplication is already a feature of today's existing crypto coprocessors.
Definition 2. For a fixed integer n and numbers A, B, C and N , N > 0, the MultModInit n instruction is defined as
The Instructions MultModDiv and MultModDivInit n
The following definition is a natural extension of the usual modular multiplication. 
Definition 4. For a fixed integer n and numbers A, B, C and N , N > 0, the MultModDivInit n instruction is defined as
(Q, R) = MultModDivInit n (A, B, C, N ) with Q := A * B + C * 2 n N and R := (A * B + C * 2 n ) − Q * N.
The Doubling Algorithm
Modular Multiplication without Initialization
We will start with the easiest of our algorithms, which needs 7 MultModDiv instructions on an n-bit processor.
Theorem 1.
There exists an algorithm to compute A * B mod N using seven MultModDiv instructions of length n, provided that 2 2n−1 ≤ N < 2 2n and 0 ≤ A, B < N .
Proof. We will first present the algorithm.
is indeed congruent to A * B modulo N . This can easily be seen from the following, where we use Z = 2 n as abbreviation.
The two congruences above are based on the fact that N t Z ≡ −N b mod N . Apart from the fact that this result still has to be reduced modulo N , this completes the proof.
Practical Implementation Issues
1. Observe that in steps three and five negative numbers may occur. This can be resolved by the fact that for positive numbers A, B and N the equation
It is possible that the intermediary output (Q, R) is not reduced, i.e., 0 ≤ R < 2 n and 0 ≤ Q < N t is not fulfilled. In this case one has to do a final reduction: first, do (Q, R) 
n ) until R is reduced modulo 2 n . 3. Using two parallel n-bit processors one only needs the time of four MultModDiv instructions. 4. If the given module N has an odd bitlength, then one has to compute with 2 * N .
Modular Multiplication with Initialization
By using a MultModDivInit n instruction we can reduce the number of steps to six.
Theorem 2. There exists an algorithm to compute A * B mod N using five MultModDiv and one MultModDivInit n instruction of length n, provided that
Proof. We first present the algorithm.
Enhanced Basic Doubling Algorithm:
) is indeed congruent to A * B modulo N . This can be seen from the following, where we use Z = 2 n as abbreviation.
The three congruences above are based on the fact that N t Z ≡ −N b mod N . Apart from the fact that this result still has to be reduced modulo N , this completes the proof.
Practical Implementation Issues
1. Observe that in steps two and six negative numbers may occur. This can be resolved as shown above. 2. It is possible that the intermediary output (Q, R) is not reduced. This can be resolved as shown above. 3. Using two parallel n-bit processors one only needs the time of three MultModDiv instructions. 4. Again, if the given module N has an odd bitlength, then one has to compute with 2 * N .
Optimized Special Purpose Variants
Now the basic strategy of our algorithms should be clear. Therefore, we will present the results for special purpose variants.
Squaring Theorem 3.
There exists an algorithm to compute A 2 mod N using six MultModDiv instructions of length n, provided that 2 2n−1 ≤ N < 2 2n and 0 ≤ A < N.
If we consider the algorithm of section 3.2 for the case A = B, we see that steps three and four are identical. Therefore, we get the following result: Theorem 4. There exists an algorithm to compute A 2 mod N using four MultModDiv and one MultModDivInit n instruction of length n, provided that 2 2n−1 ≤ N < 2 2n and 0 ≤ A < N.
Precomputation
If the factor B is known in advance (e.g., square and multiply for exponentiation), then the first and second computation of the algorithm of section 3.1 can be carried out in advance. Therefore, the multiplication can be done in five steps. Proof. We present the simple algorithm. This completes the proof.
Theorem 5. There exists an algorithm to compute
Simulation of
In a similar way the MultModDivInit n instruction is emulated by the MultModInit n instruction. Proof. We present the algorithm.
Simulation of MultModDivInit
The proof is a derivation of the former one, leaving the modifications to the reader. However, we note that bounding the size of the quotient Q is the crucial point.
Both algorithms can be extended to algorithms also working for non-reduced A, B and C. This is necessary for our doubling algorithms.
Hardware Realization of the MultModDiv and
MultModDivInit n Instructions
We will now sketch how an algorithm for the MultMod instruction can be extended into an algorithm for the MultModDiv instruction. We first consider the textbook MultMod implementation.
In the paper's full version we will actually show how the former algorithm can be simply integrated into the modular multiplication algorithm due to H. Sedlak [Sed] .
The MultModDivInit n and MultModInit n are derived from the former ones essentially by exchanging the step Z := 0 with Z := C.
Conclusion
In this paper we have introduced new efficient algorithms to compute 2n-bit modular multiplications using only n-bit modular multiplications. Using the MultModDiv and MultModDivInit n instructions we were able to improve the results presented by Pailler [Pai] . The question of what is the minimal number of multiplications is still open, as we currently have no proof of the optimality of our algorithm.
