Abstract. We present a novel approach for computing 2n-bit Montgomery multiplications with n-bit hardware Montgomery multipliers. Smartcards are usually equipped with such hardware Montgomery multipliers; however, due to progresses in factoring algorithms, the recommended bit length of public-key schemes such as RSA is steadily increasing, making the hardware quickly obsolete. Thanks to our doublesize technique, one can re-use the existing hardware while keeping pace with the latest security requirements. Unlike the other double-size techniques which rely on classical n-bit modular multipliers, our idea is tailored to take advantage of n-bit Montgomery multipliers. Thus, our technique increases the perenniality of existing products without compromises in terms of security.
Introduction
The algorithm proposed by Montgomery to calculate modular multiplications [6] , usually referred to as "Montgomery multiplication" technique, is extensively used in practical implementations of public-key cryptosystems such as RSA [10] . In particular, Montgomery multiplications are not affected by delays which are commonly introduced by carries in other strategies for computing modular multiplications. As a consequence, Montgomery's approach is very effective for high-performance hardware implementations of modular multiplications. Low-end devices such as smartcards can benefit from crypto-coprocessors implementing Montgomery multiplications [7] , which can drastically reduce the time necessary to encrypt and decrypt data, or sign and verify signatures. But such hardware accelerators suffer from an important restriction: their operand size is limited [8] . Now, because of progresses in integer factorization [11] , official security institutions are slowly but surely moving their recommendation from 1024-bit to 2048-bit key sizes for RSA; unfortunately, the latter bit length is not supported by many crypto-coprocessors. This problem has motivated many studies for developing double-size modular multiplication techniques using single-size hardware multipliers. On the one hand, thanks to the Chinese Remainder Theorem, private computations (decryption or signature generation) require only single-size multiplications for computing a double-size decryption or signature generation [9] . On the other hand, in the case of public computations, the Chinese Remainder Theorem is of no help, and double-size modular multiplications are needed.
Paillier initiated the work on double-size multiplications [8] , and showed how to efficiently compute a kn-bit classical modular multiplication with n-bit classical modular multiplication units. Later, Fischer et al. optimized Paillier's scheme for the 2n-bit case [2] . Finally, Chevallier-Mames et al., also concentrating on the case of 2n-bit multiplications, showed further improvements in the general case and when the modulus has a special form [1] . We note a recurring problem in these techniques: they are based on classical modular multipliers, and as such, do not concern about taking high-performance of Montgomery multipliers which usually equip smartcards. This is a serious limitation of these techniques, which cannot fully take advantage of the available hardware.
In this paper, we propose a technique for computing 2n-bit Montgomery multiplications with n-bit Montgomery multiplication units. Firstly, we define the notion of quotients of n-bit Montgomery multiplications; indeed, such quotients are necessary to calculate 2n-bit Montgomery remainders. We consider two types of settings, and in each case, propose efficient solutions to compute the quotients. In the first settings, we assume that we have to re-use an existing n-bit Montgomery multiplier, and that we cannot modify it. In this case, we show how to emulate the calculation of the quotient in software with two calls to the n-bit Montgomery multiplier. In the second settings, the modification of the hardware Montgomery multiplier is allowed, but still restricted to n-bit operands. We explain how to modify the circuitry with minimal changes in order to calculate the quotients along with their remainders. In addition to Montgomery quotients, our double-size technique also requires a new representation of 2n-bit integers, tailored for a better use of Montgomery multipliers. Indeed, to satisfy the requirements of the Montgomery multipliers, the moduli must be odd and greater than 2 n−1 . Our representation does not only achieve this, but also allows an efficient conversion from/to the standard binary representation of integers.
As a result, thanks to our double-size technique, one can compute 2n-size Montgomery multiplications with available n-bit hardware Montgomery multipliers, allowing the current generation of crypto-coprocessor to survive the shift towards higher security and longer key lengths.
The rest of this paper is organized as follows. In Section 2, we review previous double-size technique based on classical modular multiplications [2, 1] . In Section 3, we introduce the Montgomery multiplications and put their limitations in evidence. In Section 4, we explain our idea for computing 2n-bit Montgomery multiplications with n-bit Montgomery multipliers. Since our technique requires the n-bit quotients of the Montgomery multiplications, in Section 5 we introduce two approaches to calculate these quotients; the first is well-suited for software implementations, and the second is for hardware implementations. Section 6 shows experimental results and introduces some practical issues. Finally, we conclude with Section 7.
Notation
We let n denote the operand-size of Montgomery/classical modular multiplication units. We also let capital letters: A, B, N and M denote 2n-bit integers, and small letters denote the others, such as n-bit integers.
Known Double Size Techniques
Low-end devices such as smartcards are equipped with crypto-coprocessors to calculate modular multiplications; however, such hardware accelerators have a strict restriction: their operand size is limited. Recently, because of progresses in integer factorization [11] , official security institutions are changing their recommended key-length for RSA from 1024-bit to 2048-bit. Now, this problem has motivated many studies for developing double-size modular multiplication using single-size hardware multipliers and only public information [3] .
Paillier first initiated the work on double size-multiplications [8] , and showed how to compute a kn-bit classical modular multiplication with n-bit classical modular multiplication units and public information. Later, Fischer et al. [2] optimized Paillier et al.'s scheme for the 2n-bit case. Finally, Chevallier-Mames et al. [1] showed further improvements in the case of 2n-bit multiplications, too.
This section introduces works about schemes of Fischer et al. and not ChevallierMames et al., because in Section 4 we will propose our double size technique which modifies Fischer et al.'s one.
Fischer et al.'s Schemes
In this subsection, we introduce the work of Fischer et al. [2] : how to compute double-size classical modular multiplication with single-size classical modular multiplication units. Their double-size technique requires not only remainders but also quotients of n-bit multiplication to build a 2n-bit remainder.
The equation xy − q c w = r c shows the relation between the product of two integers x and y, the modulus w, the quotient q c and the remainder r c in the case of classical modular multiplications. The basic idea of classical modular multiplications is to subtract the modulus w from the most significant bit of product xy, q c times until the product becomes less than modulus w; thus, digits of the product are eliminated from left to right.
Fischer et al. assumed that the instruction MultModDiv is available, where MultModDiv computes the remainder and the quotient of n-bit classical modular multiplications. For n-bit positive integers x, y and w, MultModDiv(x, y, w) = (q c , r c ) with q c = (xy)/w and r c = xy (mod w).
The MultModDiv instruction is a natural extension of the classical modular multiplication. If the classical modular multiplications is implemented in hardware, the MultModDiv instruction can be emulated with two calls to the multiplier, or by changing the hardware of the multiplier only a little.
Algorithm 1. Fischer et al.'s algorithm
Output: AB (mod N );
We introduce an algorithm proposed by Fischer et al. to compute 2n-bit classical modular multiplication (AB mod N ). Given 2n-bit integers A, B, N , where 0 ≤ A, B < N. First, every 2n-bit integers are divided into n-bit integers that can be handled by the MultModDiv instruction.
holds thanks to the above transformation. Their proposed algorithm derives from the equation n 1 2 n ≡ −n 0 (mod N ), and is described in Algorithm 1.
Montgomery Multiplications
Montgomery multiplications, which are based on a technique to calculate modular multiplication proposed by Montgomery [6] , are widely used in practical implementations of public-key cryptosystems, such as RSA. Montgomery multiplications are very suitable for high-performance hardware implementations of modular multiplications, which are typically one of the most expensive operations in public-key cryptosystems. Most low-end devices such as smart cards have crypto-coprocessors implementing Montgomery multiplications to encrypt and decrypt data, or sign and verify signatures.
In this section, we introduce Montgomery multiplications and describe the problems that occur when one tries to extend known double size technique to n-bit Montgomery multipliers.
Montgomery Multiplication Algorithm
The basic idea of Montgomery multiplications is to replace expensive divisions by cheaper multiplications and additions in computations. Output: r; . Thus the Montgomery algorithm is faster than classical modular multiplication. Figure 1 illustrates the principle of Montgomery multiplications. Unlike classical modular multiplication techniques, Montgomery multiplications add the modulus w to the product xy from the least significant bit (that is, from right to left), and save the remainder r in the most significant side.
Problems of Previous Techniques
The schemes proposed by Fischer et al. [2] and Chevallier-Mames et al. [1] show how to compute a 2n-bit classical modular multiplication with n-bit classical modular multiplication units. But their schemes cannot take advantage of highperformance Montgomery multipliers for the two following reasons: Problem 1: Quotient Their double-size techniques require not only n-bit remainders but also n-bit quotients of multiplications to construct 2n-bit remainders. However, there is no notion of quotient of Montgomery multiplications.
Problem 2: Modulus
The moduli of Montgomery multipliers must be odd because they are restricted to be coprime to Montgomery constants. However, their double-size techniques allow even moduli. For example, in Algorithm 1, they set upper n-bit value n 1 as modulus, but n 1 can be even.
New Double Size Techniques
We propose a new scheme for computing 2n-bit Montgomery multiplications with the existing coprocessors. On the one hand, Montgomery multiplications are widely implemented on coprocessors for public-key cryptosystems, and produce high-performance hardware modular multiplications for encrypting and decrypting data, or signing and verifying signatures. On the other hand, there is no scheme to compute a double-size Montgomery multiplication efficiently with such coprocessors.
Instruction for Remainders
First, we define the instruction for computing the remainder of Montgomery multiplications in Definition 1.
Instruction for Quotients
As the schemes of Fischer et al. [2] and Chevallier-Mames et al. [1] require quotients of n-bit classic modular multiplications, our scheme also requires quotients of n-bit Montgomery multiplications to construct 2n-bit remainder. However, there is no definition of quotients of Montgomery multiplications. To solve this problem, we extend the notion of quotients to the case of Montgomery multiplications. The remainder calculated by Montgomery multiplications is different from the remainder calculated by classical modular multiplications. Indeed, from Definition 1, the following equation holds; xy ≡ rm (mod w), where m = 2 n . The above equation means that there is some integer q satisfying: xy − qw = rm. We call this integer q the quotient of the Montgomery multiplication. Now, we define the instruction to calculate the quotient and the remainder of the Montgomery multiplication in Definition 2. w) =1, the MultMonDiv instruction is defined as (q, r)= MultMonDiv(x, y, w) with r := xym −1 (mod w) and q satisfies the equation; xy = qw + rm.
Definition 2. For numbers
In Section 5, we will show an algorithm to implement the MultMonDiv instruction and establish its correctness.
Algorithm 3. Modified Fischer et al.'s Algorithm
Input: A = a1z + a0m, B = b1z + b0m, N = n1z + n0m, with M = 2 2n ;
Representation of 2n-Bit Integers
We define a new representation to divide a 2n-bit integer into two n-bit integers for n-bit Montgomery multipliers.
Definition 3. For numbers
n and gcd(m, z) = 1, the representation is defined as
The product n 0 m is always even. Therefore z and n 1 must be odd, whenever N is odd.
Modified Fischer et al.'s Algorithm
Thanks to Definition 3, we extend schemes of Fischer et al. to the case of Montgomery multiplication in Algorithm 3, which only uses odd moduli n 1 and z. Since n-bit Montgomery multiplications output the n-bit remainder: xym −1 (mod w) where 0 ≤ x, y < w, 2 n−1 < w < 2 n and m = 2 n , our algorithm outputs the 2n-bit remainder of the Montgomery multiplication: ABM −1 (mod N ) where 0 ≤ A, B < N , 2 2n−1 < N < 2 2n and M = 2 2n .
Theorem 1. Algorithm 3 computes ABM (−1) (mod N ) calling length n-MultMonDiv instruction, provided that 0 ≤ A, B < N < 2
2n and M = 2 2n .
We show the proof of Theorem 1 in Appendix A.
Implementations for Quotients
In Section 4, we defined the quotient of Montgomery multiplications; in fact, this quotient is necessary to compute 2n-bit Montgomery multiplications. We consider two types of settings, and in each case, show efficient algorithms to calculate the quotients. In the first settings, we assume that we have to re-use an existing n-bit Montgomery multiplier in software, and cannot modify it. Thus,
Algorithm 4. MultMonDiv instruction calling the MultMon instruction
Input: x, y, w with 0 ≤ x, y < w, 2 n−1 < w < 2 n , m = 2 n and gcd(w, m) = 1;
Output: q, r;
we assume a pure software implementation of our double-size technique. Section 5.1 shows how to emulate the calculation of the quotient with two calls to the nbit Montgomery multiplier. In the second settings, modifications of the hardware Montgomery multiplier are allowed, but still restricted to n-bit operands. Section 5.2 explains how to modify the circuitry with minimal changes.
Software Approach: Calling Montgomery Multipliers
In this subsection, we introduce Algorithm 4, which emulates the calculation of quotients with two calls to the n-bit Montgomery multiplier. Appendix B shows the proof of Theorem 2.
Hardware Approach: Changing Montgomery Multipliers
This subsection shows that the implementation of the MultMon instruction can be changed in order to directly compute the MultMonDiv instruction. In fact, since the MultMon instruction already has information of the quotient, Algorithm 5 has little changes compared to the MultMon instruction, which is the standard technique proposed by Montgomery [6] . We just insert Step 2.(c) to calculate the quotient and output the quotient along with the remainder in Step 4.
Experimental Results

Validation
We implemented 2048-bit Montgomery multiplications and exponentiations on an emulator for smartcards using our proposed technique. Its coprocessor can only handle 1024-bit operands for Montgomery multiplications. Therefore, unlike the assumption in Theorem 2, we make a more strict assumption for getting the quotients: the bit-length of the modulus is exactly twice as much as the operands size of the coprocessor. We show another algorithm in Appendix C for implementing the MultMonDiv instruction with this more strict condition which we faced.
Practical Implementation Issues
Representations of 2n-bit Integers We implemented our technique with w set as (2 n −1) for two reasons. One reason is that w should be odd because of the requirements of Montgomery multiplications, and that w is 2 n−1 < w < 2 n because of the assumptions in Theorem 2. The other is that the conversion to such representation is easy. We show how to get the representation of 2n-bit moduli for Montgomery multiplications. N = n 1 2 n + n 0 = n 1 (2 n − 1) + n 0 2 n . Then, we can calculate n 1 and n 0 easily for Montgomery multiplications. n 1 := 2 n − n 0 and n 0 := n 1 − n 1 + 1.
Condition on the Modulus
Unfortunately, it is not easy to achieve the assumption of Theorem 2, namely that the modulus must be greater than 2 n−1 . For example, if w is (2 n − 1), 0 < n 1 < 2 n ; n 1 can be smaller than 2 n−1 . One choice is to modify Algorithm 3 slightly. When the value of n 1 is 0 < n 1 < 2 n−1 , (m − n 1 ) can be applied to Algorithm 3 as modulus instead of n 1 to satisfy the assumption. We can compute the quotient and the remainder of the modulus n 1 from the modulus (m − n 1 ) with the following equations. xy = q(m − w) + rm = (−q)w + (q + r)m.
Handling of Input Value
The input value in Algorithm 3 may break the assumption (0 < x, y < 2 n ) of Theorem 2. Since the Montgomery remainder only is affected by this problem, it could be solved by the following fact: if xy ·m (−1) (mod w) = r, then (x+ im)(y + jm)· m (−1) (mod w) = r + jx + iy + ijm holds, where i and j are small integers. Compute (q, r) ← (q + m, r − z), then return (q, r).
Reduction of the intermediate output
In fact, q and r are always in fixed range; −3·2 n < q < 5·2 n and −2·2 n < r < 2 n , because of the equation in Algorithm 3; q = r 3 + r 4 − q 5 − q 6 + q 7 and r = −r 5 − r 6 + r 7 with −2 n < q i < 2 n and 0 ≤ r i < 2 n . Therefore, it might happen that q and r are outside the area defined by case A to D. However, we can easily extend our technique to such cases.
Conclusion
We proposed a novel technique for 2n-bit Montgomery multiplications, provided that n-bit Montgomery multiplications are available. We defined the quotient of Montgomery multiplications for 2n-bit Montgomery multiplications. Since Montgomery multiplications have already been implemented on many platforms, we proposed one technique to emulate the calculation of Montgomery quotients with the available Montgomery multiplications unit. In addition, we proposed another approach where the implementation of the Montgomery multiplications unit is changed in order to directly calculate the quotient. The approach of calling available units takes two instructions, and the approach changing the units has roughly the same cost as one instruction. As a result, our proposed techniques calculate 2n-bit Montgomery multiplications by calling the crypto-coprocessor implementing n-bit Montgomery multiplications only, or the instruction for computing remainders and quotients of n-bit Montgomery multiplications. This paper concentrates on the way to compute double-size Montgomery multiplications with a single-size crypto-coprocessor. Therefore, although the scheme of Chevallier-Mames et al. requires less calls to the multiplier than Fischer et al.'s one, we extended the scheme of Fischer et al., which allows to make our scheme simple: our proposed representation of 2n-bit integers can avoid using even moduli for Montgomery multipliers. A further direction of this research is to optimize computational costs of 2n-bit Montgomery multiplications, for example by using Chevallier-Mames et al. technique. 
A Modified Fischer et al.'s Algorithm
We will prove theorem 1 in Section 2.1 to be correct, which an output of Algorithm 1 is indeed congruent to ABM (−1) modulo N , where 0 ≤ A, B < N , 2 2n−1 < N < 2 2n and M = 2 2n .
Proof. Firstly, 2n-bit integers A, B, N are decomposed on the following equation;
where z is odd, 2 n−1 ≤ z < 2 n and m = 2 n . Then, we continue to be the following. 
B Proof of Theorem 2
To show the proof of Theorem 2 telling correctness of Algorithm 4, we prove the following Lemmas 1, 2 and 3, step by step. Lemma 1 states the ranges of the quotient. MultMon(x, y, w) and q satisfies the equation; xy = qw + rm, then −2 n < q < 2 n+1 holds.
Proof. If w is in the range of 2 n−1 < w < 2 n+1 , then we have qw = xy − rm and −w2 n < xy − rm < 2 2n . Since the minimum value of w is greater than 2 n−1 , Lemma 1 holds.
Lemma 2 states ranges of difference between two different quotients of Montgomery multiplications. Whenever Lemma 1 holds, Lemma 2 also holds. If (xy − r (w + 2 n ) + rw) < 2 (mod 2 2 ), then q − q = {xy − rw + r (w + 2 n ) (mod 2 2 )} × 2 n .
Otherwise, we have:
Finally, based on Lemma 2, Lemma 1 shows the equation for getting the quotient of Montgomery multiplications.
Lemma 3.
If 0 ≤ x, y < w, 2 n−1 < w < 2 n , m = 2 n , gcd(w, m) = 1, r is provided by MultMon(x,y,w), r is provided by MultMon(x,y,w + 2 n ), q satisfies the equation; xy = qw + rm and q satisfies the equation; xy = q (w + 2 n ) + r m, then when we set δ as {xy + rw − r (w + 2 n ) (mod 2 2 )}, either q = δ(w + 2 n ) + r − r or q = (δ − 2 2 )(w + 2 n ) + r − r holds.
Proof. By hypothesis, qw + rm = q (w + 2 n ) + r m. Therefore q2 n = (q − q )(w + 2 n ) + (r − r )m holds. From lemma 2, we have:
either q2 n = δ2 n (w + 2 n ) + (r − r)m, or q2 n = (δ − 2 2 )2 n (w + 2 n ) + (r − r)m.
The proof of Theorem 2 follows from Lemma 3.
C Approach for Quotients of Montgomery Multiplications Based on limited Memories of a Coprocessor
Lemma 2 and Lemma 3 assume r = MultMon(x, y, w + 2 n ) and q that satisfy xy = q (w + 2 n ) + r m. But it is possible that the MultMon instruction cannot treat the modulus (w + 2 n ) when it is implemented in hardware with just nbit memory, or when there are restrictions for the size of the modulus. For this case, we use (w ± 2 n−2 ) rather than (w + 2 n ). In this case, one can prove lemmas and theorems similar to that from Section 5. We show algorithms to calculate the quotient of Montgomery multiplications instead of Algorithm 4, only when coprocessors have limited memories, whose bit-lengths are just n bits. Our proposed algorithms are divided into Algorithm 6 and Algorithm 7, because we treat (2 n−1 < w < 2 n−1 + 2 n−2 ) and (2 n−1 + 2 n−2 < w < 2 n ) separately. Firstly, we show Algorithm 6, where w is 2 n−1 < w < 2 n−1 + 2 n−2 . Secondly, we show Algorithm 7, where w is 2 n−1 + 2 n−2 < w < 2 n .
