Abstract-This paper describes a new method to speed up IF p -arithmetic in hardware for pairing-friendly curves, such as the wellknown Barreto-Naehrig (BN) curves. We explore the characteristics of the modulus defined by these curves and choose curve parameters such that IF p multiplication becomes more efficient. The proposed algorithm uses Montgomery reduction in a polynomial ring combined with a coefficient reduction phase using a pseudo-Mersenne number. As an application, we show that the performance of pairings on BN curves in hardware can be significantly improved, resulting in a factor 2.5 speedup compared with state-of-the-art hardware implementations.
INTRODUCTION
A bilinear pairing is a map G 1 Â G 2 ! G T where G 1 and G 2 are typically additive groups and G T is a multiplicative group and the map is linear in each component. Many pairings used in cryptography, such as the Tate pairing [4] , ate pairing [23] , R-ate pairing [26] , and optimal pairings [22] , [33] , choose G 1 and G 2 to be specific cyclic subgroups of EðIF p k Þ, and G T to be a subgroup of IF Ã p k . The selection of parameters has a substantial impact on the security and performance of a pairing. For example, the underlying field, the type of curve, the orders of G 1 , G 2 , and G T should be carefully chosen such that it offers sufficient security, but is still efficient to compute. In this paper, we propose a new modular multiplication algorithm for finite fields used in parametrized families of pairingfriendly elliptic curves. The characteristic p of such fields is given by the evaluation of a polynomial fðzÞ 2 Z Z½z at an integer " z 2 Z Z. One of the most important examples of such families is the Barreto-Naehrig (BN) curves [32] where one has p ¼ 36" z 4 þ 36" z 3 þ 24" z 2 þ 6" z þ 1 for some " z 2 Z Z such that p is prime. We show that when choosing " z ¼ 2 þ s, where s is a reasonably small number, the modular multiplication in IF p can be substantially improved. Existing techniques to speed up arithmetic in extension fields (see [13] , [14] for fast arithmetic in IF p 2 , IF p 6 , and IF p 12 ) can be used together with our construction. As an application, we show how to choose parameters for BN curves and obtain a significant improvement on the performance in hardware of the ate and optimal ate pairing.
The remainder of the paper is organized as follows: in Section 2, we review cryptographic pairings, pairing-friendly curves, and modular arithmetic. In Section 3, we present our new modular multiplication algorithm and compare its complexity with known algorithms. Section 4 provides parameters for several families of pairing-friendly curves. As an application, Section 5 gives in detail an architecture of an IF p multiplier for BN curves, and discusses the results obtained by our hardware implementation. Finally, Section 6 concludes the paper.
BACKGROUND 2.1 Bilinear Pairings
Let IF p be a finite field and let E be an elliptic curve defined over IF p . Let r be a large prime dividing #EðIF p Þ and k the embedding degree of EðIF p Þ with respect to r, namely, the smallest positive integer k such that rjp k À 1. For any finite extension field IK of IF p , denote with EðIKÞ½r the IK-rational r-torsion group of the curve. For P 2 EðIKÞ and an integer s, let f s;P be a IK-rational function with divisor ðf s;P Þ ¼ sðP Þ À ð½sP Þ À ðs À 1ÞðOÞ;
where O is the point at infinity. This function is also known as a Miller function [27] , [28] .
Let G 1 ¼ EðIF p Þ½r, G 2 ¼ EðIF p k Þ=rEðIF p k Þ, and G 3 ¼ r & IF Frobenius; then the ate pairing is also a well-defined, nondegenerate bilinear pairing, and can be computed as aðQ; P Þ ¼ ðf tÀ1;Q ðP ÞÞ ðp k À1Þ=r :
The R-ate pairing [26] is a generalization of the ate pairing and can be seen as an instantiation of optimal pairings [22] , [33] . Since the definition of the optimal ate pairing really depends on the particular elliptic curve one is using, we only provide the definition in the case of BN curves: using the same G 1 and G 2 as for the ate pairing, the optimal ate pairing on BN curves is defined as
where a ¼ 6" z þ 2, f ¼ f a;Q ðP Þ, and l A;B denotes the line through points A and B.
Due to limited space, we only describe the algorithm to compute the optimal ate pairing for BN curves. The algorithms for Tate and ate pairings are similar, and can be found in [14] .
Algorithm 1.
Computing the optimal ate pairing on BN curves [14] Input: P 2 EðIF p Þ½r, Q 2 EðIF p k Þ½r T Kerð p À ½pÞ and Return f.
Pairing-Friendly Curves
An elliptic curve E over IF p is called pairing friendly whenever there exists a large prime r j #EðIF p Þ with r ! ffiffi ffi p p and the embedding degree k is small enough, e.g., k log 2 ðrÞ=8. Furthermore, if #EðIF p Þ ¼ p þ 1 À t with t the trace of Frobenius, then t 2 À 4p should have a very small squarefree part (e.g., less than 10 10 ) to be able to find the equation of the curve using the complex multiplication (CM) method [3] . These restrictions imply that pairing-friendly curves are hard to find and several specialized constructions have been proposed (see [18] for an excellent overview).
Many construction methods result in a parameterized family of elliptic curves, i.e., r and p are given by the evaluation of polynomials rðzÞ and fðzÞ at an integer value " z. To illustrate this type of curve, we provide several important examples of complete families of curves. 
define a family of pairing-friendly curves with k ¼ 8 and CM by À1, which admits quartic twists. Furthermore, finding the equation of the curve is straightforward due to CM by À1. Given a " z-value such that rð" zÞ is a 224-bit prime, this family provides 112-bit security.
Example 2. 128-bit security level. The Barreto-Naehrig curves [32] are by far the most important family of pairing-friendly curves for the 128-bit security level. These curves have k ¼ 12 and are defined by
Furthermore, since these curves have CM by À3, the equation for such a curve is easy to find. Finally, they also admit sextic twists which speeds up all types of pairings. Given a " z value such that rð" zÞ is a 256-bit prime, this family provides 128-bit security.
Example 3. 256-bit security level. For very high security levels, the BN curves are somewhat ill-adapted and it is better to use the following cyclotomic family with k ¼ 24 [9] . Recall that the 24th cyclotomic polynomial is
Again, this family has CM by À3 so sextic twists are possible. Given a " z-value such that rð" zÞ is a 512-bit prime, this family provides 256-bit security.
Multiplication in IF p
We briefly recall the techniques for integer multiplication and reduction. Given a modulus p < 2 n and an integer c < 2 2n , the following algorithms can be used to compute c mod p.
Barrett reduction. The Barrett reduction algorithm [5] uses a precomputed value ¼ b 2 2n p c to help estimate c p ; thus, integer division is avoided. Dhem [15] proposed an improved Barrett modular multiplication algorithm which has a simplified final correction.
Montgomery reduction. The Montgomery reduction method [29] precomputes p 0 ¼ Àp À1 mod R, where R normally is a power of 2. Given c and p, it generates q such that c þ qp is a multiple of R. As a result, the division of ðc þ qpÞ by R is exact and can be performed by a shift operation.
Chung-Hasan reduction. In [11] and [12] , Chung and Hasan proposed an efficient reduction method for lowweight polynomial form moduli p ¼ fð" zÞ ¼ "
z þ f 0 , where jf i j 1. The resulting modular multi plication is given in Algorithm 2. The polynomial reduction phase is rather efficient since fðzÞ is monic, making the polynomial long division (Steps 3-5) simple. Barrett reduction is used to perform divisions required in Phase III. According to the implementation results [11] , the performance of the Chung-Hasan algorithm is more efficient than the traditional Barrett or Montgomery reduction algorithms when the moduli are large (See Fig. 5 in [11] for details). In [12] , this algorithm is further extended to monic polynomials with jf i j s where s ( " z. Note that the polynomial reduction phase is efficient only when fðzÞ is monic.
Algorithm 2. Chung-Hasan multiplication algorithm [11] . 
HYBRID MODULAR MULTIPLICATION

Polynomial Montgomery Reduction
In this section, we introduce a new reduction algorithm for polynomial form moduli that can be seen as the dual algorithm to Chung-Hasan. Whereas Chung and Hasan require the polynomial defining the modulus to be monic, our algorithm requires the constant coefficient to be AE1. As will be shown below, this requirement is a very natural one.
Let fðzÞ ¼ f nÀ1 z nÀ1 þ Á Á Á þ f 1 z þ f 0 2 Z Z½z with f 0 6 ¼ 0 and assume the modulus is given by p ¼ fð" zÞ for some " z 2 Z Z. For an integer c ¼ P 2nÀ2 i¼0 c i " z i , the first step in the algorithm applies Montgomery reduction in the polynomial representation cðzÞ ¼ P 2nÀ2 i¼0 c i z i . Montgomery reduction avoids expensive divisions by computing cðzÞz Àn mod fðzÞ, instead of cðzÞ mod fðzÞ itself. The main idea is to solve for the polynomial qðzÞ such that cðzÞ À qðzÞfðzÞ 0 mod z n ;
which shows that qðzÞ ¼ cðzÞgðzÞ mod z n where gðzÞ fðzÞ À1 mod z n . Since gcdðfðzÞ; zÞ ¼ 1, we thus obtain that ðcðzÞ À qðzÞfðzÞÞ=z n cðzÞz Àn mod fðzÞ;
where the division by z n is exact and can be computed by a simple right shift of the coefficients. Note that f 0 6 ¼ 0 implies that gðzÞ 2 Q½z exists, but gðzÞ does not necessarily have integer coefficients. The following lemma shows that the condition f 0 ¼ AE1 is a natural one.
Z Z½z with f 0 6 ¼ 0, and define g k ðzÞ ¼ fðzÞ À1 mod z k ; then a necessary and sufficient condition for g k to have integer coefficients is f 0 ¼ AE1.
Proof. The argument is classical and is a special case of Newton lifting over the rational function field QðzÞ.
Indeed, g k ðzÞ fðzÞ À1 mod z k is the solution to the linear equation (in the indeterminate X) fðzÞX À 1 0 mod z k ;
which can be solved using the Newton iteration: g 1 ðzÞ 1=f 0 mod z and then g k ðzÞ g kÀ1 ðzÞ À ðfðzÞg kÀ1 ðzÞ À 1Þ=fðzÞ 2g kÀ1 ðzÞ À g 2 kÀ1 ðzÞfðzÞ mod z k :
As such, f 0 ¼ AE1 is clearly sufficient for g k ðzÞ to be defined over Z Z. However, it is also necessary, since
Parallel Version
Algorithm 3 describes our modular multiplication algorithm for polynomial form moduli. The algorithm is composed of four phases, i.e., polynomial multiplication, a partial coefficient reduction, polynomial reduction, and coefficient reduction. The polynomial reduction uses the Montgomery reduction, while the coefficient reduction uses division. We call this algorithm Hybrid Modular Multiplication (HMM). For Algorithm 3 to be useful in practice, we need to show that given a bounded input, the algorithm returns an output which is similarly bounded. To this end, we require two short lemmata. The first lemma analyzes the coefficient reduction in Phase IV, while the second lemma gives bounds on the input, such that the output satisfies the same bounds. 
Proof. Using induction it is easy to see that in step i, the size of c i div " z is bounded by
The size of c m in step m À 1, therefore, becomes
which shows that in step m, we obtain the bound
Using Lemma 2, we are now ready to show that given a bounded input, the result of Algorithm 3 is again bounded.
Recall that for a polynomial h ¼ P i h i z i 2 Z Z½z, the infinity norm is defined as khk 1 ¼ max i jh i j. 
ja nÀ1 j; jb nÀ1 j 2 =2þ ; for some ! 1, then rðzÞ computed by Algorithm 3 satisfies
This shows that if ! 2ðlog 2 ð4n 2 ð2 þ þ 2 2 ÞÞ À Þ, then rðzÞ satisfies the same bounds as the input.
Proof. After step 2, the coefficients of c i are clearly bounded by n2 2þ2 for i ¼ 0; . . . ; 2n À 3 and jc 2nÀ2 j 2 þ2 . Lemma 2 shows that after the first coefficient reduction we have jc i j " z for i ¼ 0; . . . ; n À 1 and jc n j n2 2þ2þ1 . The coefficients of qðzÞ are easily seen to be bounded by jq i j n2 þþ1 so after step 9, we obtain
Applying Lemma 2 again with the above B and D shows that
Given a polynomial fðzÞ, it is easy to compute the inverse gðzÞ fðzÞ À1 mod z n and, thus, to determine the exact values for and in Lemma 3. For each polynomial, we, therefore, obtain a very explicit lower bound on " z such that Algorithm 3 can be applied repeatedly.
Digit-Serial Version
Algorithm 4 presents the digit-serial version of Algorithm 3 by interleaving polynomial multiplication and reduction. While the parallel version uses the complete polynomial gðzÞ f À1 ðzÞ mod z n , the digit-serial reduction involves only g 0 ¼ AE1. On the other hand, the parallel version could possibly use Karatsuba's algorithm [25] in steps 2, 8, and 9 to reduce the number of subword multiplications. The coefficient reduction phase is the same as in Algorithm 3.
polynomial representation of rð" zÞ að" zÞbð" zÞ" z Àn mod fð" zÞ.
c iþ1 c iþ1 þ ðc i div " zÞ, c i c i mod " z. 10: end for Return rðzÞ cðzÞ.
The following lemma is the analog of Lemma 3, but for the digit-serial version.
Lemma 4. Let ¼ blog 2 ð" z À 1Þc and ¼ dlog 2 kfk 1 e; then if n 2 < " z and the input að" zÞ; bð" zÞ satisfies
ja nÀ1 j; jb nÀ1 j 2 =2þ ; for some ! 1, then rðzÞ computed by Algorithm 4 satisfies
This shows that if ! 2ðlog 2 ð2ðn þ 2Þð2 2 þ 2 ÞÞ À Þ, then rðzÞ satisfies the same bounds as the input.
Proof. The idea of the proof is similar to the proof of Lemma 3, i.e., before the final coefficient reduction we want to show that the coefficient c nÀ2 is small enough to ensure that after reduction, c nÀ1 is small. Note that at the beginning of each iteration in step 2, the degree of cðzÞ is at most n À 2, which shows that after step 7 we have c nÀ2 ¼ a nÀ1 b nÀ1 À nÀ1 f nÀ1 f 0 . Here, i denotes the computed in step 4 in the ith iteration. Since j i j j" zj, this shows that jc nÀ2 j 2 þ2 þ 2 þþ1 . Assume that the coefficients c i for i ¼ 0; . . . ; n À 3 are bounded by B (to be determined below) and let
; then again we can apply Lemma 2 resulting in
Obtaining the bound B is easy too: denote with B i the bound on kcðzÞk 1 at the start of the ith iteration; then we have B 0 ¼ 0 and for i ! 1:
Let
Assuming that n 2 " z, we can easily bound the sum above by ðe À 1Þ=" z, so n À 1 ðn þ e À 1Þ=" z. So, we finally obtain
As a result, the bound on c nÀ1 becomes
It is important to point out that both Algorithms 3 and 4 may involve negative numbers as intermediate or final results. As a result, a direct use of these algorithms requires the handling of negative numbers. In hardware implementations, one can avoid signed multiplication by adding a multiple of fð" zÞ to rð" zÞ. We describe this method in an example in Section 5.
Faster Coefficient Reduction
Algorithms 3 and 4 require division by " z. As in ChungHasan's algorithm, division can be performed with Barrett's reduction algorithm [11] . However, the complexity of the division step can be reduced if " z is a pseudo-Mersenne number. Algorithm 5 transfers division by " z to multiplication by s for " z ¼ 2 þ s where s is small.
The following lemma gives the maximum number of iterations for Algorithm 5 to finish. Proof
To bound the right-hand side, we need to consider the two cases:
. ð2 À þ 1Þ ; then jj 2 Àþ , so in each iteration the bit length is decreased by À . . ð2 À þ 1Þ > ; then jj < 2 þ1 , so either is reduced or one last step is needed jj ¼ j À sj ð2 À 1Þ þ s < " z:
In conclusion, d
Àð2Àþ1Þ À e iterations are needed to reduce to the second case and then a maximum of two iterations is needed to finish reduction in the second case. In total, the algorithm, thus, requires d
When using Algorithm 5 to perform steps 4 and 9 in Algorithm 4, at most three iterations are required. This is guaranteed if jc i j 2 3À2þ1 for 0 i n À 1. Recall that c i satisfies the following bounds:
2þ2þ1 þ 2 þþ1 Þ for 0 i < n À 2, and
Since B > D, it is sufficient to ensure B 2 3À2þ1 . Since we typically have ,
By choosing ! 2ð þ Þ þ log 2 ðn þ e À 1Þ þ 1, we ensure log 2 B 3 À 2 þ 1:
Complexity
We compare the complexity of the proposed algorithm with the algorithms by Barrett and Montgomery. Here, , are defined as in Lemma 3, namely, ¼ blog 2 ð" z À 1Þc, ¼ dlog 2 kfk 1 e, " z ¼ 2 þ s where 0 < s 2 À1 . The complexity of each step in Algorithm 4 is as follows:
.
Step 3. For i n À 2, computing aðzÞb i involves one ðd
2 e þ Þ and ðn À 1Þ times ðd 2 e þ Þ Â ð þ Þ multiplications. . Step 5. Computing fðzÞðÀf 0 Þ needs ðn À 1Þ times Â multiplications. . Steps 4 and 9. As discussed above, three iterations are required at most. We have jj log 2 ðn þ e À 1Þ þ þ 2 þ 2 and jj log 2 ðn þ e À 1Þ þ þ 2 þ 3 in the first iteration and second iteration, respectively. For the third iteration, jj 1 is guaranteed. The cost of step 4 or 9 is one Â ð þ dlog 2 ðn þ e À 1Þe þ 2 þ 2Þ multiplication and one Â ðdlog 2 ðn þ e À 1Þe þ þ 2 þ 3Þ multiplication. Table 1 compares the complexity of the different multiplication algorithms. Note that and are chosen to be small, so that multiplication with fðzÞ and s can be efficiently performed.
The complexity of the HMM algorithm highly depends on the chosen curve parameters since they determine f; ; n; ; g. Each family of pairing-friendly curves is defined by a fixed polynomial fðzÞ, so n and are constant for a given family. The size of is set by the desired security level. To achieve lower complexity, it is, thus, desirable to choose small and . According to Lemma 4, can be as small as 1 and the size of is determined by s where " z ¼ 2 þ s. Computing ABR À1 mod M using conventional Montgomery multiplication requires 2w 2 þ w subword multiplications (see [10] for details), where w is the number of digits in A and B. Note that w ¼ n À 1 here since we need one more digit to represent AðzÞ or BðzÞ. Barrett multiplication has approximately the same complexity as Montgomery's algorithm [11] .
Compared to , the parameters and are much smaller. This makes the polynomial reduction and coefficient reduction very efficient. We present some parameters for the examples in Section 2.2 and compare the complexities of different multiplication algorithms.
PARAMETER SELECTION FOR PAIRING-FRIENDLY CURVES
In this section, we present values for " z for each of the three examples given in Section 2.2. Note that for the first and third families, fðzÞ does not have integral coefficients, but 3fðzÞ or 4fðzÞ does, so we simply work modulo these polynomials. Table 2 contains values " z that lead to efficient instantiations of the HMM algorithm. Furthermore, the bit length of " z is chosen to reflect the ideal security level at which the respective families of curves should be used. Table 3 compares the complexity of HMM and Montgomery's algorithm using parameter sets of Table 2 . In hardware designs, one can customize the multipliers for specific operand sizes. HMM takes advantage of this freedom to reduce the complexity of the multiplier. For example, a 6 Â 63-bit multiplier is much smaller and faster than a 64 Â 64-bit multiplier. Compared with Montgomery multiplication, the HMM has a significantly lower complexity in the context of hardware implementation.
Note that the complexity of HMM can be reduced further by exploring the characteristics of the coefficients of fðzÞ. For example, BN curves have fðzÞ ¼ 36z 4 þ 36z 3 þ 24z 2 þ 6z þ 1, and fðzÞðÀf 0 Þ can be computed as follows:
Step 2. 24 ¼ 2 2 ð6Þ; and .
Step 3. 36 ¼ 24 þ 2ð6Þ. Instead of performing four 63 Â 6 multiplications in each iteration, four shift and two addition operations are performed. Example 3 has ¼ 1, resulting in even larger savings in the polynomial reduction step, since no multiplications are required as reflected in Table 3 by the 1 Â 64 cell.
APPLICATION TO BN CURVES
In this section, we propose an architecture for HMM in hardware. Note that Fan et al. described the first digit-serial architecture for the HMM algorithm [17] for BN curves, where polynomial multiplication and reduction are interleaved. In the architecture proposed in this paper, polynomial multiplication and reduction are separated as in Algorithm 3. We show that this architecture is flexible and can achieve higher throughput than the one from [17] . We choose the same parameters as [17] , namely, "
With these parameters, the implementation achieves 128-bit security.
Algorithm 6 shows the HMM algorithm for BN curves. Since fðzÞ ¼ 36z 4 þ 36z 3 þ 24z 2 þ 6z þ 1, we have gðzÞ Àf À1 ðzÞ 324z 4 À 36z 3 À 12z 2 þ 6z À 1 mod z 5 . Note that both fðzÞ and gðzÞ have relatively small coefficients. As a result, the polynomial reduction phase can be efficiently implemented.
Algorithm 6. Parallel hybrid modular multiplication algorithm for BN curves. 
HMM Multiplier
Fig . 1 shows the architecture of the multiplier. It consists of a row of multipliers to carry out polynomial multiplication, five "Mod-1" blocks to perform the first coefficient reduction, a module to accumulate the partial products, a module to perform polynomial reduction, and four "Mod-2" blocks to perform the second coefficient reduction. Fig. 1 also gives the bit length of input/output of each block. Phase I. Five integer multipliers, including four 65 Â 65 and one 65 Â 32 multipliers, are used to carry out polynomial multiplication. Clearly, given enough area, the polynomial multiplication stage can be fully parallelized (using 13 multipliers [30] ). The architecture used here is a trade-off between area and throughput. Using the schoolbook method, it requires five cycles to finish Phase I. Each 65 Â 65 multiplier is implemented using a two-level Karatsuba's method, and each multiplication has a delay of seven cycles.
Phase II. The partial products (a i b j ; 0 i; j < n) are reduced immediately after they are generated. Fig. 1d shows the structure of the "Mod-1" block. Since
four additions. The output of Phase II is then accumulated and shifted in the accumulator, shown in Fig. 1f . Note that the output buffers of the adders, except the one on the most right side, should be set to zero for each IF p multiplication. The "Mod-1" block only performs one round of reduction; thus, it does not give fully reduced results. It is easy to see that the output of "Mod-1" is always less than 2 77 . We shall see this is good enough for this architecture. Phase III. Once the partially reduced results (c 8 ; ::; c 0 ) are ready, Phase III is applied. The polynomial reduction is performed with only addition and shift operations, e.g.,
Phase IV. The "Mod-2" block is implemented in the same way as "Mod-1". However, the input of "Mod-2" is only 93 bit. Thus, the output of the proposed multiplier has the following bounds: jr i j < 2 63 þ 2 41 , 0 i 3, and jr 4 j < 2 30 . Note that the resulting polynomial, rðzÞ may have negative coefficients. For future multiplications, negative coefficients as input are not desirable. We can ensure positive coefficients by adding to rðzÞ the following polynomial:
where # ¼ 2 25 . One can verify that lð" zÞ ¼ 2 
Implementation Results for Pairing Coprocessor
We used Xilinx FPGA as the design platform. In order to achieve a high clock frequency, a 16-stage pipeline is used in the HMM multiplier. Note that the polynomial multiplication takes five iterations. One multiplication on the proposed multiplier has a delay of 20 cycles. On the other hand, the multiplier finishes one multiplication every five cycles, and thus has a throughput of 1/5. Using the multiplier described above, we built a pairing processor. It consists of a multiplier, an adder, a 74 Kbit data memory and an instruction ROM. The data memory and instruction ROM are essentially implemented with Block RAMs on the FPGA. The data memory has one read port and one write port.
In order to keep the HMM multiplier busy, we explore the parallelism within the pairing computation. Consider IF p 2 ¼ IF p ½u=ðu 2 þ 2Þ and two elements in IF p 2 : ða þ buÞ and ðc þ duÞ, a; b; c; d 2 IF; then ða þ buÞðc þ duÞ can be computed as follows:
The IF p multiplications can be performed one after another. We use a C++ program to schedule each high-level function, e.g., subroutines of the Miller loop and the final exponentiation. The scheduling is then transferred into microinstructions stored in the instruction ROM. Table 4 gives the cycle counts of each function.
On a Xilinx Virtex-6 FPGA (XC6VLX240), the design uses 4,014 Slices, 42 DSP48E1s, and 5 Block RAMs (RAMB36E1s). The design achieves a maximum frequency of 210 MHz. Table 5 compares the result with the state-of-the-art implementations.
Kammler et al. [24] reported a hardware implementation of cryptographic pairings using an application-specific instruction-set processor (ASIP). They chose " z ¼ 0x6000000000001F2D to generate a 256-bit BN curve. Montgomery's algorithm is used for IF p multiplication. Fan et al. [17] reported an ASIC implementation using the HMM algorithm. They choose " z ¼ 2 63 þ 857. It achieves a factor 5.4 speedup compared with the one in [24] . On the other hand, the ASIP design [24] offers higher flexibility than the ASIC implementation [17] that is optimized for a dedicated parameter set. Ghosh et al. [19] reported the first FPGA implementation of pairings based on BN curves. They also chose " z ¼ 0x6000000000001F2D to achieve 128-bit security. The IF p multiplier used in [19] is based on Blakley's algorithm [8] , and it does not make use of the DSP slices on FPGA.
The design reported in this paper achieves a factor 2.5 speedup compared with the one in [17] . The speedup comes mainly from the high throughput multiplier. The multiplier used in [17] requires 23 cycles for each multiplication, while the multiplier described in this paper finishes one multiplication every five cycles. The Montgomery multiplier used in [24] requires 68 cycles for one multiplication. On the other hand, the speedup for pairing computations is less than the speedup of the multiplier. This is mainly due to the read-after-write (RAW) dependency that introduces pipeline bubbles. Table 5 also includes the state-of-the-art software implementations [6] , [20] , [21] , [31] . Software implementations try to make use of available features (fast multipliers, vector registers, etc.) on the target processor. The speed records have been updated shortly after they are set, mainly due to new implementation techniques and the evolution of processors. The current speed record [1] of optimal ate pairing using BN curves is achieved by Aranha et al. on an AMD Phenom II processor, who use " z ¼ Àð2 62 þ 2 55 þ 1Þ. This choice not only reduces the complexity of both the Miller loop and the final exponentiation, but more importantly admits for extensive use of lazy reduction techniques. This implementation achieves so far the best performance.
An interesting attempt to use hybrid multiplication algorithm in software has been made [31] . Naehrig et al. adapted the hybrid multiplication algorithm and proposed a software-oriented algorithm for fast modular multiplication. An element in IF p is represented as a polynomial of degree 11. As such, the coefficients are short enough to fit in the vector registers and overflows in multiplication are avoided. Polynomial reduction can be efficiently performed since pðxÞ is made monic. This implementation shows a significant speedup compared to previous software implementations [20] , [21] . On the other hand, on CPUs where 64-bit multiplication is not much slower than addition, the traditional Montgomery algorithm seems to achieve a higher performance [1] , [6] . Table 5 also includes the state-of-the-art implementations of pairings over binary or ternary fields achieving 128-bit security. The results of these implementations are comparable with pairings using BN curves in software. To the best of our knowledge, Estibals reported so far the only hardware implementation of 128-bit Tate pairing over supersingular curves in small characteristic. On a Virtex-4 FPGA, it requires only 2.2 ms for the pairing computation. It is, however, difficult to give a fair comparison due to the use of different platforms.
CONCLUSIONS
In this paper, we introduced a new modular multiplication algorithm for polynomial form moduli called Hybrid Modular Multiplication. The algorithm combines Montgomery's algorithm in the polynomial representation with a fast coefficient reduction based on pseudo-Mersenne numbers.
The algorithm results in a substantial speedup for hardware implementation of finite field arithmetic appearing in pairing-based cryptography. To illustrate this efficiency, we implemented pairings on Barreto-Naehrig curves at the 128-bit security level and obtained a factor 2.5 speedup compared with the state-of-the-art hardware implementations. Ingrid Verbauwhede received the Electrical Engineering degree and the PhD degree from the Katholieke Universiteit Leuven (KU Leuven), Belgium, in 1991. From 1992 to 1994, she was a postdoctoral researcher and a visiting lecturer at the Electrical Engineering and Computer Sciences Department, University of California, Berkeley. From 1994 to 1998, she worked for TCSI and ATMEL in Berkeley, California. In 1998, she joined the faculty of University of California, Los Angeles (UCLA). She is currently a professor at the KU Leuven and an adjunct professor at UCLA. At KU Leuven, she is a codirector of the Computer Security and Industrial Cryptography (COSIC) Laboratory. Her research interests include circuits, processor architectures and design methodologies for real-time embedded systems for security, cryptography, digital signal processing, and wireless communications. This includes the influence of new technologies and new circuit solutions on the design of next-generation systems on chip. She was the program chair of the Ninth International Workshop on Cryptographic Hardware and Embedded Systems (CHES 07), the 19th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 08), and the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED 02). She was also the general chair of ISLPED 2003. She was a member of the executive committee of the 42nd and 43rd Design Automation Conference (DAC) as the design community chair. She is a senior member of the IEEE.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
