A large set of moduli, for which the speed of bipartite modular multiplication considerably increases, is proposed in this work. By considering state of the art attacks on public-key cryptosystems, we show that the proposed set is safe to use in practice for both elliptic curve cryptography and RSA cryptosystems. We propose a hardware architecture for the modular multiplier that is based on our method. The results show that, concerning the speed, our proposed architecture outperforms the modular multiplier based on standard bipartite modular multiplication. Additionally, our design consumes less area compared to the standard solutions.
Introduction
Public-key cryptography (PKC), a concept introduced by Diffie and Hellman [9] in the mid 70's, has gained its popularity together with the rapid evolution of today's digital communication systems. The best-known public-key cryptosystems are based on factoring i.e. RSA [20] and on the discrete logarithm problem in a large prime field (Diffie-Hellman, ElGamal, Schnorr, DSA) [14] or on an elliptic curve (ECC/HECC) [2] . Based on the hardness of the underlaying mathematical problem, PKC usually deals with large numbers ranging from a few hundreds to a few thousands of bits in size. Consequently, the efficient implementations of the PKC primitives has always been a challenge.
An efficient implementation of the mentioned cryptosystems highly depends on the efficient implementation of modular arithmetic. Namely, modular multiplication forms the basis of modular exponentiation which is the core operation of the RSA cryptosystem. It is also present in many other cryptographic algorithms including those based on ECC and HECC. In particular, if one uses projective coordinates for ECC/HECC, modular multiplication remains the most time consuming operation for ECC. Hence, an efficient implementation of PKC relies on efficient modular multiplication.
Two algorithms for modular multiplication, namely Barrett [3] and Montgomery [15] algorithms are widely used today. Both algorithms avoid multiple-precision divisions, the operation that is considered to be expensive, especially in hardware. The classical modular multiplication algorithm, based on Barrett's reduction, uses single-precision multiplications with a precomputed modulus reciprocal instead of expensive divisions [8] . An algorithm that efficiently combines classical and Montgomery multiplications, both in finite fields of characteristic 2, was first proposed by Potgieter [17] in 2002. Published in 2005, a bipartite modular multiplication (BMM) by Kaihara and Takagi [11] extended this approach to the ring of integers.
In this work, we propose a large set of moduli, for which the intermediate quotient evaluation in both Barrett and Montgomery algorithms basically comes for free. Therefore a speed of bipartite modular multiplication, where the Barrett and Montgomery algorithms are the main ingredients, significantly increases. By considering state of the art attacks on public-key cryptosystems, we show that the proposed set is safe to use in practice for both ECC/HECC and RSA cryptosystems. We propose a hardware architecture for the modular multiplier that outperforms the multiplier based on standard BMM method.
The remainder of the paper is structured as follows. Section 2 introduces preliminaries. Section 3 describes the proposed method. In Sect. 4 we give the results of hardware implementations and Sect. 5 discusses the security issues. Section 6 concludes.
Preliminaries
In the paper we use the following notations. A multiple-precision n-bit integer A is represented in radix r representation as A = (A nw −1 . . . A 0 ) r where r = 2 w ; n w represents the number of digits and is equal to n/w where w is a digit-size; A i is called a digit and A i ∈ [0, r − 1].
Classical and Montgomery Modular Multiplication Methods
Given a modulus M and two elements X, Y ∈ Z M where Z M is the ring of integers modulo M , the ordinary modular multiplication is defined as:
Let the modulus M be an n w -digit integer, where the radix of each digit is r = 2 w . The classical modular multiplication algorithm computes XY mod M by interleaving the multiplication and modular reduction phases as it is shown in Algorithm 1. The calculation of the intermediate quotient q C at step 4 of the algorithm is done by utilizing integer division which is considered as an expensive operation, especially in hardware. The idea of using the precomputed reciprocal of the modulus M and simple shift and multiplication operations instead of division originally comes from Barrett [3] . To explain the basic idea, we rewrite the intermediate quotient q C as:
The valueq represents an estimation of the intermediate quotient q C . In most of the cryptographic applications, the modulus M is fixed during the many modular multiplications and hence the value 2 n+α /M can be precomputed and reused multiple times. Algorithm 1. Classical modular multiplication algorithm.
Furthermore, an integer division with the power of 2 is a simple shift operation in hardware. Since the value ofq is an estimated value, some correction steps at the end of the modular multiplication algorithm have to be performed. In his thesis, Dhem [8] determines the values of α and β for which the classical modular multiplication, based on Barrett reduction algorithm, needs at most one subtraction at the end of the algorithm. The improved Barrett algorithm [8] , uses the following parameters: α = w + 3 and β = −2.
Montgomery's algorithm [15] is the most commonly utilized modular multiplication algorithm today. In contrast to classical modular multiplication, it utilizes right to left divisions. Given an n w -digit odd modulus M and an integer U ∈ Z M , the image or the Montgomery residue of U is defined as X = U R mod M where R, the Montgomery radix, is a constant relatively prime to M . If X and Y are, respectively, the images of U and V , the Montgomery multiplication of these two images is defined as:
The result is the image of U V mod M and needs to be converted back at the end of the process. For the sake of efficient implementation, one usually uses R = r nw where r = 2 w is the radix of each digit. Similar to the Barrett multiplication, this algorithm uses a precomputed value
The algorithm is shown in Algorithm 2.
Bipartite Modular Multiplication Method
An algorithm that efficiently combines classical and Montgomery multiplications, both in finite fields of characteristic 2, was first proposed by Potgieter [17] in 2002. Extending this approach to the ring of integers, the bipartite modular multiplication (BMM) was introduced by Kaihara and Takagi in [11] . The method efficiently combines a classical modular multiplication method with Montgomery's modular multiplication algorithm. It splits the operand multiplier into two parts that can be processed separately in parallel, increasing the calculation speed. The calculation is performed using Montgomery residues defined by a modulus M and a Montgomery radix R, R < M. Next, we outline the main idea of the BMM method.
Algorithm 2. Montgomery modular multiplication algorithm.
Let the modulus M be an n w -digit integer, where the radix of each digit is r = 2 w and let R = r k where 0 < k < n w . Consider the multiplier Y to be split into two parts
Then, the Montgomery multiplication modulo M of the integers X and Y can be computed as follows: 
Related Work
Before introducing related work we note here that for the moduli used in all common ECC cryptosystems, the modular reduction can be done much faster than the one proposed by Barrett or Montgomery. Even without any multiplication. This is the reason behind standardizing generalized Mersenne prime moduli (sums/differences of a few powers of 2) [16, 1, 21] . The idea of speeding up a modular multiplication by simplifying an intermediate quotient was first presented by Quisquater [18] at the rump session of Eurocrypt '90. The method is similar to the one of Barrett except that the modulus is preprocessed before the modular multiplication in such a way that the evaluation of the intermediate quotient basically comes for free. Preprocessing requires some extra memory and computational time, but the latter is negligible when many modular multiplications are performed using the same modulus.
In [12] , Lenstra proposes several ways to generate RSA moduli with any number of predetermined leading (trailing) bits, with the fraction of specified bits only limited by security considerations. He points out that choosing such moduli is beneficial both for storage and computational requirements. Furthermore, Lenstra discusses security issues and concludes that the resulting moduli do not seem to offer less security than regular RSA moduli. Joye [10] enhances the method for generating RSA moduli with a predetermined portion proposed in [12] .
Speeding Up the Bipartite Modular Multiplication
In both Barrett and Montgomery modular multiplication algorithms, the precomputed values of either modulus reciprocal or modulus inverse are used in order to avoid multiple-precision divisions. However, single-precision multiplications still need to be performed (step 4 of the algorithms above). This especially concerns the hardware implementations, as the multiplication with the precomputed values often occurs within the critical path of the whole design. Section 4 discusses this issue in more detail.
Since the BMM method utilizes both Barrett and Montgomery multiplication algorithms, one needs to precompute both μ = 2 n+α /M and M = −M −1 0 mod r. Let us, for now, assume that the precomputed values are both of type 2 γ where γ ∈ Z. By tuning μ and M to be of this special type, we transform a single-precision multiplication with these values into a simple shift operation in hardware. Therefore, we find a set of moduli for which the precomputed values are both of type 2 γ . A lemma that defines this set is given below: 
Proof. To prove the theorem, we first rewrite 2 n+α as:
Now, the reciprocal μ of the modulus M can be written as:
Having that μ = 2 α , the inequality 0 ≤ λ < M must hold. By solving the left part of inequality (λ ≥ 0) we get:
Similar, for the right part of inequality (λ < M) we get:
From the condition M = −M −1 0 mod r = 1 it follows that M ≡ −1 mod r. This is true for all Δ ∈ Z. Finally, a condition that the modulus M is an n-bit integer (2 n−1 ≤ M < 2 n ) makes the last condition for Δ:
Now, from the inequalities (1), (2), (3) and the fact that Δ ∈ Z, follows the final condition for Δ:
The previous theorem defines a set of moduli for which both conditions μ = 2 α and M = 1 are true. As mentioned earlier, to minimize the number of correction steps in the improved Barrett algorithm [8] , we choose α = w + 3. Finally, the proposed set is defined as:
Figure 1 further illustrates the properties of the proposed set. As can be seen, the w least significant bits and the w + 3 most significant bits are fixed to be all 1's while the other n − 2w − 3 bits can be randomly chosen. 
For β ≤ 0, the previous equation becomes simplified and equivalent to:
Since M = 1, the intermediate quotient for the Montgomery multiplication also gets simplified: q M = Z mod r . Finally, the bipartite modular multiplication for the proposed set of moduli is given in Algorithm 3. After the final addition is performed, one more correction step might be necessary since 
Hardware Implementation
To verify our approach in practice, we implement a set of multipliers that are based on our proposal and compare them with the multipliers that support the original BMM algorithm. Obviously, the mission of the BMM algorithm is to utilize the parallel computation and hence, increase the speed of the modular multiplication. Therefore, in order to compare different designs with the same input size, we define a relative throughput as
where t cp is a critical path delay and N is a number of clock cycles. The total throughput is then obtained as T = BT r , where B is the number of bits processed in 1/T r time.
To maximize the throughput, one obviously needs to decrease both N and t cp . Typically, there are plenty of trade-offs to explore in order to make an optimal (in this case fastest) design. To make an objective comparison, we distinguish between designs that aim at the shortest critical path and the ones that achieve the minimum number of clock cycles. We address each of them separately, in the coming subsections.
Optimization Goal: Shortest Critical Path
A modular multiplier based on the BMM algorithm, depicted in Fig. 2 , consists of four multiple-precision multipliers (π H1 , π H2 , π L1 , π L2 ). Apart from the multipliers, the architecture contains some additional adders (Σ L , Σ H and Σ). The multiple-precision multipliers are implemented in a digit-serial manner which typically provides a good trade-off between area and speed. The multipliers π H1 and π H2 assemble together the Barrett modular multiplier that processes the most significant half of Y (that is Y H ). Similarly, the multipliers π L1 and π L2 form the Montgomery modular multiplier that processes the least significant half of Y (that is Y L ). The results of both multipliers are finally added together, resulting in Z = XY r −k mod M . The parameters k and α are chosen such that the execution speed is maximized and the number of correction steps is minimized: k = n w /2 and α = w + 3.
A choice of the specific architecture is based on the following criteria. The two levels of parallelism are exploited such that the number of clock cycles needed for one modular multiplication is minimized. First, the BMM algorithm itself is constructed such that the Barrett part and the Montgomery part of the multiplier work independently, in parallel. Second, the multiple-precision multipliers π H1 and π H2 in the Barrett part, and π L1 and π L2 in the Montgomery part operate with the independent data such that they run in parallel and speed-up the whole multiplication process. The critical path is minimized and consists of one multiplexer, a single-precision multiplier and an adder (bold line, Fig. 2) . In order to avoid any ambiguity we provide a graph in Fig. 3 which shows the exact timing schedule of separate blocks inside the multiplier. With i (0 ≤ i < k) we denote the current iteration of the algorithm. Each iteration consists of n w + 3 clock cycles except the first iteration that lasts for n w + 1 cycles.
Optimization Goal: Minimum Number of Clock Cycles
In order to minimize the number of clock cycles needed for one modular multiplication, the architecture from Fig. 2 is modified as depicted in Fig. 4 . Two single-precision multipliers (π H3 and π L3 ), consisting only of a pure combinatorial logic, are added without requiring any clock cycles for calculating their products. We again provide a graph in Fig. 5 which shows the timing schedule of the multiplier. Each iteration now consists of n w + 2 clock cycles except the first that lasts for n w + 1 cycles.
The critical path of the whole design occurs from the output of the register Z H to the input of the temporary register in π H1 , passing through two single-precision multipliers and one adder (bold line).
Proposed Multiplier
An architecture of the modular multiplier based on the BMM method with the moduli from the proposed set (see Algorithm 3) is shown in Fig. 6 . The most important difference is that there are no multiplications with the precomputed values and hence, the π H1 6 . Datapath of the modular multiplier based on BMM method with the modulus from the proposed set critical path contains one single-precision multiplier and one adder only. A full timing schedule of the multiplier is given in Fig. 7 . The number of cycles remains the same as in the architecture from Fig. 4 while the critical path reduces.
Results
To show this in practice, we have synthesized 192-bit, 512-bit and 1024-bit multipliers, each with the digit size of 16, 32 and 64 bits. The designs were synthesized using UMC π H1 Table 1 .
Observing the implementation results, we conclude that our proposed design outperforms the standard BMM design with the shortest critical path for at most 18 %. A design that is based on standard BMM with the minimum number of clock cycles is outperformed by at most 67 %. Furthermore, our design consumes less area than all its counterparts.
On the Security of the Proposed Set
In this section we analyze the security implications of choosing primes in the proposed set for use in ECC/HECC and in RSA.
In the current state of the art, the security of ECC/HECC over finite fields GF(p m ) only depends on the extension degree m of the field [2] . Therefore, the security does not depend on the precise structure of the prime p. This is illustrated by the particular choices for p that have been made in several standards such as SEC [21] , NIST [16] It is easy to verify that for w ≤ 28 all primes except p 224 are in S. In conclusion: choosing a prime of prescribed structure has no influence on the security of ECC/HECC.
The case of RSA requires a more detailed analysis than ECC/HECC. First, we assume that the modulus N is chosen from the proposed set. This is a special case of the security analysis given in [12] followed by the conclusion that the resulting moduli do not seem to offer less security than regular RSA moduli.
Next, we assume that the primes p and q that constitute the modulus N = pq are both chosen in the set S. To analyze the security implications of the restricted choice of p and q, we first make a trivial observation. The number of n-bit primes in the set S for n > 259 + 2w is large enough such that the exhaustive listing of these sets is impossible, since a maximum of 2w + 3 bits are fixed. The security analysis then corresponds to attacks on RSA with partially known factorization. This problem has been analyzed extensively in the literature and the first results come from Rivest and Shamir [19] in 1985. They describe an algorithm that factors N in polynomial time if 2/3 of the bits of p or q are known. In 1995, Coppersmith [5] improves this bound to 3/5.
Today's best attacks all rely on variants of Coppersmith's method published in 1996 [7, 6] . A good overview of these algorithms is given in [13] . The best results in this area are as follows. Let N be an n bit number, which is a product of two n/2-bit primes. If half of the bits of either p or q (or both) are known, then N can be factored in polynomial time. If less than half of the bits are known, say n/4 − ε bits, then the best algorithm simply guesses ε bits and then applies the polynomial time algorithm, leading to a running time exponential in ε. In practice, the values of w (typically w ≤ 64) and n (n ≥ 1024) are always chosen such that our proposed moduli remain secure against Coppersmith's factorization algorithm, since at most 2w + 3 bits of p and q are known.
Finally, we consider a similar approach extended to moduli of the form N = p r q where p and q have the same bit-size. This extension was proposed by Boneh, Durfee and Howgrave-Graham [4] . Assuming that p and q are of the same bit-size, one needs a 1/(r + 1)-fraction of the most significant bits of p in order to factor N in polynomial time. In other words, for the case r = 1, we need half of the bits, whereas for e.g. r = 2 we need only a third of the most significant bits of p. These results show that the primes p, q ∈ S, assembling an RSA modulus of the form N = p r q, should be used with care. This is especially true when r is large. Note that if r ≈ log p, the latter factoring method factors N in polynomial time for any general primes p, q ∈ N.
Conclusion
A set of moduli for which the performance of bipartite modular multiplication considerably increases is proposed in this work. The size of the set is determined by the digit-size and the length of the modulus. Since the security level of ECC/HECC does not depend at all on the precise structure of the prime p, our proposed set is safe to be used for constructing underlying fields in elliptic curves cryptography. The case of RSA is also discussed with a conclusion that if used with care (w ≤ 64 and n ≥ 1024) our proposed set does not decrease the security of RSA.
Additionally, we propose an architecture for a modular multiplier that is based on our method. The results show that, concerning the speed, our proposed architecture outperforms the modular multiplier based on standard BMM method with no additional area overhead.
