Abstract. In this study we show how modular multiplication with Barrett and Montgomery reductions over certain finite fields of characteristic 2 can be implemented efficiently without using a pre-computational phase. We extend the set of moduli that is recommended by Standards for Efficient Cryptography (SEC) by defining two distinct sets for which either Barrett or Montgomery reduction is applicable. As the proposed algorithm is very suitable for a fast modular multiplication, we propose an architecture for the fast modular multiplier that can efficiently be used without pre-computing the inverse of the modulus.
Introduction
Modular multiplication is at the heart of many Public-Key Cryptosystems (PKC), e.g. RSA [10] , Diffie-Hellman key agreement [4] , ElGamal scheme [5, 6] and Elliptic Curve Cryptography (ECC) [7, 8] . Due to the very long numbers used in these crypto primitives efficient hardware and software implementation of modular multiplication has always been a challenge. Algorithms that are most commonly used to avoid computationally intensive multi-precision divisions are Barrett reduction [1] and Montgomery reduction [9] . Both algorithms have one common property, namely a pre-computational step, where the inverse of the modulus is calculated and stored together with the value of the modulus. As the modular inverse operation is computationally more expensive than the modular multiplication itself one usually fixes the value of modulus and uses the pre-computed value of the inverse. This reduces flexibility as well as the performance of the implementation increasing the area needed for the storage of the modulus inverse.
In [11] the authors outline a so called unbalanced exponent modular reduction for special type of moduli that can efficiently be used as a replacement for existing Barrett and Montgomery algorithms. The original idea comes from Standards for Efficient Cryptography (SEC) [12] and it recommends the use of moduli that are of type M (x) = x n + T (x) where the degree of T (x) is far smaller than n. This is very suitable for software implementations and makes reduction very efficient due to the special type of modulus. However, hardware implementation still requires more effort, especially when one needs to support more than one modulus.
In this paper we show how Barrett and Montgomery reduction algorithms over binary fields can be performed without using a pre-computational step. We extend the set of moduli that is recommended by SEC by defining two distinct sets for which either Barrett or Montgomery reduction is applicable. Due to the similarity between the Barrett and Montgomery algorithms, we propose a hardware architecture for the fast modular multiplier that can efficiently be used for the two defined sets of moduli. The multiplier performs modular multiplications within a single clock cycle.
The remainder of this paper is structured as follows. Section 2 describes the Barrett and Montgomery modular multiplication algorithms as the two most commonly used reduction methods. In Sect. 3, we show how pre-computation can be omitted in Barrett and Montgomery algorithms over binary fields. Hardware implementation and comparison with related work is given in Sect. 4. Section 5 concludes the paper and gives some guidelines for the future work.
Related Work

Modular Multiplication with Barrett Reduction
Barrett reduction algorithm was introduced by P. D. Barrett in 1986 [1] . This algorithm computes r ≡ a mod m for an input a and a modulus m and is given in Alg. 1. The algorithm uses μ, a pre-computed reciprocal of m, to avoid computationally expensive divisions that are necessary to compute the quotient q such that a = qm + r.
Modular multiplication of two n-bit inputs using Barrett reduction is done by providing the result of n × n-bit multiplication (input a in Alg. 1).
Modular Multiplication with Montgomery Reduction
Montgomery algorithm is one of the most commonly used reduction algorithms. In contrast to Barrett reduction it utilizes right to left divisions which make implementation simpler, with no correction steps necessary. The result of reduction has a form aR −1 mod m, where R is a power of the base b. Similar to Barrett reduction this algorithm uses a pre-computed value β ≡ −m −1 mod R. Algorithm 2 shows Montgomery reduction in short.
Similar to Barrett reduction, modular multiplication of two n-bit inputs can also be done based on Alg. 2. 
Algorithm 2. Montgomery reduction for integers
Require: positive integers a = (a2n−1, ..., a0) b , m = (mn−1..., m1, m0) b and β ≡ −m −1 mod R, where R = b n and gcd(b, m) = 1. Ensure: t ≡ aR −1 mod m. s1 ← a mod R, s2 ← βs1 mod R, s3 ← ms2. t ← (a + s3)/R. Final reduction: if t ≥ m then t ← t − m. end if return t.
Shortcomings of the Existing Algorithms
Both described, Barrett and Montgomery algorithms have one property in common. In order to perform modular reduction, they need a pre-computed value of the reciprocal/inverse of modulus. This reduces flexibility of the system forcing us to use fixed modulus and its pre-computed reciprocal/inverse. From the implementation point of view, this requires extra computational time and memory space to store this pre-computed value.
In the next section we show how these shortcomings can be overcome using the special set of moduli over GF(2 n ).
The Proposed Modular Reduction Method
Here, we first show how the original Barrett reduction can be adapted for modular multiplication over GF(2 n ). Second, we provide a special set of moduli for which pre-computational step in Barrett algorithm can be omitted. Finally, we show how Montgomery reduction, using a complementary set of moduli, can also be performed without pre-computing the inverse. Since neither of them requires a pre-computational step these algorithms are specially suitable for both hardware and software implementations.
Before describing the actual algorithms, we need to give some mathematical background of the finite fields arithmetic. Thus, in the next subsection we first give two lemmata and one definition that are necessary for further explanation of the algorithm.
Mathematical Background
Starting with the basic idea of the proposed Barrett algorithm over GF(2 n ) we give Lemma 1 as follows:
over GF (2) , where l = n 2 . Then it holds:
Proof. In order to prove that Eq. (1) holds we need to find polynomial B(x) of degree n − 1 or less that satisfies the following equation:
Indeed, if we write x 2n as
This concludes the proof. (2), where l = n 2 . Then it holds:
Proof. In order to prove Eq. (2) we need to show that
This concludes the proof. Definition 1. Let P (x) and Q(x) denote arbitrary polynomials of degree p and q, respectively. We define Δ(n) =
P (x)
Q(x) such that n = p − q and n ∈ Z Z. In other words, with Δ(n) we denote an arbitrary element from the set of all rational functions of degree n.
Barrett Reduction without Pre-computation
The Barrett modular reduction algorithm for integers is given in Sect. 2. In [3] the author shows how the original Barrett reduction can be adapted for the finite fields of characteristic q. Here, we outline the Barrett reduction over a binary field and additionally, we propose a special set of moduli for which the pre-computational step is not needed. First, we provide Alg. 3 and then we give a proof.
Algorithm 3. Barrett reduction over GF(2 n )
Require: polynomial-basis inputs
Proof. Using notation from Def. 1 and starting from the original Barrett reduction algorithm we can write
Similarly, we can express Q 1 (x) as
Using the previous equations, Q 2 (x) and Q 3 (x) can be written as
Finally, we can evaluate
This concludes the proof. Now, according to Lemma 1, we can define a set of moduli for which the Barrett reduction described in Alg. 3 does not require a pre-computational step. This set is of type
m i x i , where l = n 2 and the algorithm is shown in Alg. 4. It is interesting to note here that, for this special case, the irreducible polynomial can be chosen from the set that contains 2 n/2 different polynomials. As we already know, only irreducible polynomials can be used to construct the field.
Algorithm 4. Barrett reduction over GF(2
n ) without pre-computation
and ai, mi ∈ {0, 1}.
Montgomery Reduction without Pre-computation
Since there is no correction step in the original Montgomery algorithm (see Alg. 2), this method can be easily applied for the modular multiplication over GF(2 n ) (see Alg. 5). This algorithm was proposed in [2] . Here we outline the algorithm and then, to make the paper more consistent, we also give a proof.
Algorithm 5. Montgomery reduction over GF(2 n )
, where R(x) = x n and ai, mi ∈ {0, 1}.
Proof. Polynomial A(x) can be written as
n . There exists polynomial S(x) of degree n − 1 such that
In other words, S(x) can be expressed as
At the same time it holds
Finally, we have
Using notations from Alg. 5 it is obvious that:
This concludes the proof.
According to Lemma 2, we can easily find a set of moduli for which the precomputational step in Montgomery reduction can be omitted. Instead of precomputing and using β(x) we use the modulus itself. This algorithm is shown in Alg.6. The proposed set is of type
Similar to the set defined for Barrett reduction, this set also contains 2 n/2 different polynomials.
Algorithm 6. Montgomery reduction over GF(2 n ) without pre-computation
, R(x) = x n and ai, mi ∈ {0, 1}.
Hardware Implementation of the Proposed Algorithm
To verify our algorithm in practice we synthesize the proposed solution for GF (2 192 ) using a 0.13μm CMOS standard cell library. Here, we aim only for the fast version of the multiplier. Additionally, we synthesize standard Barrett and Montgomery reduction algorithms and compare them with our results. To make a fair comparison we use the same 192 × 192-bit multipliers in every implementation.
An architecture for the straightforward implementation of the Barrett or Montgomery modular multiplication algorithm over GF (2 192 ) is shown in Fig. 1 (a) . For both algorithms we need five 192-bit registers, three 192 × 192-bit multipliers and one 192-bit adder. Since in binary fields there is no carry propagation, addition is equivalent to XOR operation.
A block diagram of the proposed solution is shown in Fig. 1 (b) . Here we can see that instead of using five we use only four 192-bit registers. Similarly to Barrett and Montgomery architecture shown in Fig. 1 (a) we use three 192×192-bit multipliers and one 192-bit adder. Additionally, we use four multiplexers that are driven by the register ind. This 1-bit register indicates which of the two proposed complementary set of moduli is used. Architecture of the selectors 1, 2 and 3 is given in Fig. 2 . For ind = 0 our architecture executes Barrett modular multiplication while for ind = 1 it performs Montgomery modular multiplication.
Synthesis results are given in Table 1 . They include registers and combinational logic. For both Barrett and Montgomery algorithms, we assume that pre-computation is already performed and both values for μ(x) and β(x) (see Fig. 1 (a) ) are known. Skipping the pre-computation step is highly beneficial for both area and computational cost and is the main advantage of our algorithm.
Observing the results from Table 1 we can conclude that our architecture has almost identical size as the separate Barrett and Montgomery multipliers. This gives a practical value to our theoretical work described above. One can always use the proposed implementation with much more flexibility and choose different moduli without pre-computational phase.
Conclusions and Future Work
In this paper we have defined two distinct sets of moduli for which the precomputational step in the modular multiplication algorithm can be excluded. At the cost of more control logic, a similar multiplier that supports different degrees of moduli can be introduced. This would further increase the flexibility of the proposed fast modular multiplier.
As a part of our future research we would like to explore the impact of the similar special sets of moduli on building a compact version of the modular multiplier.
