Efficient Hardware Implementation of Fp-arithmetic for Pairing-Friendly Curves by Fan, Junfeng et al.
TRANSACTION ON COMPUTERS 1
Efficient Hardware Implementation of
Fp-arithmetic for Pairing-Friendly Curves
Junfeng Fan, Student Member, IEEE, Frederik Vercauteren,
and Ingrid Verbauwhede, Senior Member, IEEE
Abstract—This paper describes a new method to speed up Fp-arithmetic in hardware for pairing-friendly curves, such as the well
known Barreto-Naehrig (BN) curves. We explore the characteristics of the modulus defined by these curves and choose curve
parameters such that Fp multiplication becomes more efficient. The proposed algorithm uses Montgomery reduction in a polynomial
ring combined with a coefficient reduction phase using a pseudo-Mersenne number. As an application we show that the performance
of pairings on BN curves in hardware can be significantly improved, resulting in a factor 2.5 speed-up compared with state-of-the-art
hardware implementations.
Index Terms—Pairing-Friendly Curves, Modular reduction
F
1 INTRODUCTION
A bilinear pairing is a map G1 × G2 → GT where
G1 and G2 are typically additive groups and GT is
a multiplicative group and the map is linear in each
component. Many pairings used in cryptography such
as the Tate pairing [4], ate pairing [23], R-ate pairing [26]
and optimal pairings [22], [33], choose G1 and G2 to
be specific cyclic subgroups of E(Fpk), and GT to be a
subgroup of F∗pk .
The selection of parameters has a substantial impact
on the security and performance of a pairing. For exam-
ple, the underlying field, the type of curve, the order of
G1, G2 and GT should be carefully chosen such that it
offers sufficient security, but is still efficient to compute.
In this paper, we propose a new modular multiplication
algorithm for finite fields used in parametrized families
of pairing-friendly elliptic curves. The characteristic p of
such fields is given by the evaluation of a polynomial
f(z) ∈ Z[z] at an integer z¯ ∈ Z. One of the most
important examples of such families are the Barreto-
Naehrig (BN) curves [32] where one has p = 36z¯4 +
36z¯3 + 24z¯2 + 6z¯+ 1 for some z¯ ∈ Z such that p is prime.
We show that when choosing z¯ = 2τ + s, where s is a
reasonably small number, the modular multiplication in
Fp can be substantially improved. Existing techniques to
speed up arithmetic in extension fields (see [13], [14] for
fast arithmetic in Fp2 , Fp6 and Fp12 ) can be used together
with our construction. As an application we show how to
choose parameters for BN curves and obtain a significant
improvement on the performance in hardware of the ate
and optimal ate pairing.
• The authors are with the Department of Electrical Engineering, Katholieke
Universiteit Leuven and IBBT, ESAT/SCD-COSIC, Kasteelpark Aren-
berg 10, B-3001 Leuven-Heverlee, Belgium. E-mail: {jfan, fvercaut, iver-
bauw}@esat.kuleuven.be.
The remainder of the paper is organized as follows.
In Section II we review cryptographic pairings, pairing-
friendly curves and modular arithmetic. In Section III we
present our new modular multiplication algorithm and
compare its complexity with known algorithms. Section
IV provides parameters for several families of pairing-
friendly curves. As an application, Section V gives in
detail an architecture of an Fp-multiplier for BN curves,
and discusses the results obtained by our hardware
implementation. Finally, Section VI concludes the paper.
2 BACKGROUND
2.1 Bilinear Pairings
Let Fp be a finite field and let E be an elliptic curve
defined over Fp. Let r be a large prime dividing #E(Fp)
and k the embedding degree of E(Fp) with respect
to r, namely, the smallest positive integer k such that
r|pk − 1. For any finite extension field K of Fp, denote
with E(K)[r] the K-rational r-torsion group of the curve.
For P ∈ E(K) and an integer s, let fs,P be a K-rational
function with divisor
(fs,P ) = s(P )− ([s]P )− (s− 1)(O),
where O is the point at infinity. This function is also
known as a Miller function [27], [28].
Let G1 = E(Fp)[r], G2 = E(Fpk)/rE(Fpk) and G3 =
µr ⊂ F∗pk (the r-th roots of unity), then the reduced Tate
pairing [4] is a well-defined, non-degenerate, bilinear
pairing. Let P ∈ G1 and Q ∈ G2, then the reduced Tate
pairing of P,Q is computed as
e(P,Q) = (fr,P (Q))
(pk−1)/r.
The ate pairing [23] is similar but with different G1 and
G2. Here we define G1 = E(Fp)[r] and G2 = E(Fpk)[r]∩
Ker(pip − [p]), where pip is the p-th power Frobenius
endomorphism, i.e. pip : E → E : (x, y) 7→ (xp, yp). Let
TRANSACTION ON COMPUTERS 2
Algorithm 1 Computing the optimal ate pairing on BN
curves [14]
Input: P ∈ E(Fp)[r], Q ∈ E(Fpk)[r]
⋂
Ker(pip − [p]) and
a = 6z¯ + 2.
Output: Ra(Q,P ).
1: a =
∑L−1
i=0 ai2
i.
2: T ← Q, f ← 1.
3: for i = L− 2 downto 0 do
4: T ← 2T .
5: f ← f2 · lT,T (P ).
6: if ai = 1 then
7: T ← T +Q.
8: f ← f · lT,Q(P ).
9: end if
10: end for
11: f ← (f · (f · laQ,Q(P ))p · lpi(aQ+Q),aQ(P ))(pk−1)/r.
Return f .
P ∈ G1, Q ∈ G2 and let t := p+ 1−#E(Fp) be the trace
of Frobenius, then the ate pairing is also a well-defined,
non-degenerate bilinear pairing, and can be computed
as
a(Q,P ) = (ft−1,Q(P ))(p
k−1)/r.
The R-ate pairing [26] is a generalization of the ate
pairing and can be seen as an instantiation of optimal
pairings [22], [33]. Since the definition of the optimal ate
pairing really depends on the particular elliptic curve
one is using, we only provide the definition in the case
of BN curves: using the same G1 and G2 as for the ate
pairing, the optimal ate pairing on BN curves is defined
as
Ra(Q,P ) = (f · (f · laQ,Q(P ))p · lpi(aQ+Q),aQ(P ))(pk−1)/r,
where a = 6z¯ + 2, f = fa,Q(P ) and lA,B denotes the line
through points A and B.
Due to limited space, we only describe the algorithm
to compute the optimal ate pairing for BN curves. The
algorithms for Tate and ate pairings are similar, and can
be found in [14].
2.2 Pairing-Friendly curves
An elliptic curve E over Fp is called pairing-friendly
whenever there exists a large prime r | #E(Fp) with
r ≥ √p and the embedding degree k is small enough,
e.g. k ≤ log2(r)/8. Furthermore, if #E(Fp) = p + 1 − t
with t the trace of Frobenius, then t2 − 4p should have
a very small squarefree part (e.g. less than 1010) to be
able to find the equation of the curve using the complex
multiplication (CM) method [3]. These restrictions imply
that pairing-friendly curves are hard to find and several
specialized constructions have been proposed (see [18]
for an excellent overview).
Many construction methods result in a parametrized
family of elliptic curves, i.e. r and p are given by the
evaluation of polynomials r(z) and f(z) at an integer
value z¯. To illustrate this type of curve, we provide sev-
eral important examples of complete families of curves.
Example 1: 112-bit security level Consider the follow-
ing family with k = 8 given in [18, Construction 6.10]:
The polynomials
f(z) =
1
4
(81z6 + 54z5 + 45z4 + 12z3 + 13z2 + 6z + 1)
r(z) = 9z4 + 12z3 + 8z2 + 4z + 1
define a family of pairing-friendly curves with k = 8 and
CM by −1, which admits quartic twists. Furthermore,
finding the equation of the curve is straightforward due
to CM by −1. Given a z¯-value such that r(z¯) is a 224-bit
prime, this family provides 112-bit security.
Example 2: 128-bit security level The Barreto-Naehrig
curves [32] are by far the most important family of
pairing-friendly curves for the 128-bit security level.
These curves have k = 12 and are defined by
f(z) = 36z4 + 36z3 + 24z2 + 6z + 1
r(z) = 36z4 + 36z3 + 18z2 + 6z + 1 .
Furthermore, since these curves have CM by −3, the
equation for such a curve is easy to find. Finally, they
also admit sextic twists which speeds up all types of
pairings. Given a z¯-value such that r(z¯) is a 256-bit
prime, this family provides 128-bit security.
Example 3: 256-bit security level For very high secu-
rity levels, the BN curves are somewhat ill-adapted and
it is better to use the following cyclotomic family with
k = 24 [9]. Recall that the 24-th cyclotomic polynomial
is Φ24(z) = z8 − z4 + 1, then
f(z) =
1
3
(z − 1)2Φ24(z) + z r(z) = Φ24(z) .
Again, this family has CM by −3 so sextic twists are
possible. Given a z¯-value such that r(z¯) is a 512-bit
prime, this family provides 256-bit security.
2.3 Multiplication in Fp
We briefly recall the techniques for integer multiplication
and reduction. Given a modulus p < 2n and an integer
c < 22n, the following algorithms can be used to compute
c mod p.
Barrett reduction The Barrett reduction algorithm [5]
uses a precomputed value µ = b 22np c to help estimate
c
p , thus integer division is avoided. Dhem [15] proposed
an improved Barrett modular multiplication algorithm
which has a simplified final correction.
Montgomery reduction The Montgomery reduction
method [29] precomputes p′ = −p−1 mod R, where R
normally is a power of two. Given c and p, it generates
q such that c + qp is a multiple of R. As a result, the
division of (c+ qp) by R is exact and can be performed
by a shift operation.
Chung-Hasan reduction In [11], [12], Chung and
Hasan proposed an efficient reduction method for low-
weight polynomial form moduli p = f(z¯) = z¯n +
TRANSACTION ON COMPUTERS 3
Algorithm 2 Chung-Hasan multiplication algorithm [11].
Input: positive integers a =
∑n−1
i=0 aiz¯
i, b =
∑n−1
i=0 biz¯
i,
modulus p = f(z¯) = z¯n + fn−1z¯n−1 + · · ·+ f1z¯ + f0.
Output: polynomial representation of c(z¯) = a(z¯)b(z¯)
mod p.
1: Phase I: Polynomial Multiplication
2: c(z)← a(z)b(z) = ∑2n−2i=0 cizi.
Phase II: Polynomial Reduction
3: for i = 2n− 2 down to n do
4: c(z)← c(z)− cif(z)zi−n.
5: end for
Phase III: Coefficient Reduction
6: cn ← bcn−1/z¯e, cn−1 ← cn−1 − cnz¯.
7: c(z)← c(z)− cnf(z).
8: for i = 0 to n− 1 do
9: qi ← bci/z¯e, ri ← ci − qiz¯.
10: ci+1 ← ci+1 + qi, ci ← ri.
11: end for
12: c(z)← c(z)− qn−1f(z)z.
Return c(z).
fn−1z¯n−1 + .. + f1z¯ + f0, where |fi| ≤ 1. The resulting
modular multiplication is given in Algorithm 2. The
polynomial reduction phase is rather efficient since f(z)
is monic, making the polynomial long division (Steps 3-
5) simple. Barrett reduction is used to perform divisions
required in Phase III. According to the implementation
results [11], the performance of the Chung-Hasan algo-
rithm is more efficient than the traditional Barrett or
Montgomery reduction algorithms when the moduli are
large (See Fig. 5 in [11] for details). In [12], this algorithm
is further extended to monic polynomials with |fi| ≤ s
where s z¯. Note that the polynomial reduction phase
is efficient only when f(z) is monic.
3 HYBRID MODULAR MULTIPLICATION
3.1 Polynomial Montgomery Reduction
In this section we introduce a new reduction algorithm
for polynomial form moduli that can be seen as the
dual algorithm to Chung-Hasan. Whereas Chung and
Hasan require the polynomial defining the modulus to
be monic, our algorithm requires the constant coefficient
to be ±1. As will be shown below this requirement is a
very natural one.
Let f(z) = fn−1zn−1 + · · · + f1z + f0 ∈ Z[z] with
f0 6= 0 and assume the modulus is given by p = f(z¯)
for some z¯ ∈ Z. For an integer c = ∑2n−2i=0 ciz¯i, the
first step in the algorithm applies Montgomery reduction
in the polynomial representation c(z) =
∑2n−2
i=0 ciz
i.
Montgomery reduction avoids expensive divisions by
computing c(z)z−n mod f(z), instead of c(z) mod f(z)
itself. The main idea is to solve for the polynomial q(z)
such that
c(z)− q(z)f(z) ≡ 0 mod zn ,
which shows that q(z) = c(z)g(z) mod zn where g(z) ≡
f(z)−1 mod zn. Since gcd(f(z), z) = 1, we thus obtain
that
(c(z)− q(z)f(z))/zn ≡ c(z)z−n mod f(z) ,
where the division by zn is exact and can be computed
by a simple right shift of the coefficients. Note that
f0 6= 0 implies that g(z) ∈ Q[z] exists, but g(z) does
not necessarily have integer coefficients. The following
lemma shows that the condition f0 = ±1 is a natural
one.
Lemma 1: Let f(z) = fn−1zn−1 + · · · + f1z + f0 ∈ Z[z]
with f0 6= 0, and define gk(z) = f(z)−1 mod zk, then
a necessary and sufficient condition for the gk to have
integer coefficients is f0 = ±1.
PROOF: The argument is classical and is a special case
of Newton lifting over the rational function field Q(z).
Indeed, gk(z) ≡ f(z)−1 mod zk is the solution to the
linear equation (in the indeterminate X)
f(z)X − 1 ≡ 0 mod zk
which can be solved using the Newton iteration: g1(z) ≡
1/f0 mod z and then
gk(z) ≡ gk−1(z)− (f(z)gk−1(z)− 1)/f(z)
≡ 2gk−1(z)− g2k−1(z)f(z) mod zk .
As such f0 = ±1 is clearly sufficient for gk(z) to be
defined over Z. However, it is also necessary, since
gk(z) ≡ f−10 mod z. 
3.2 Parallel version
Algorithm 3 describes our modular multiplication al-
gorithm for polynomial form moduli. The algorithm is
composed of four phases, i.e. polynomial multiplication,
a partial coefficient reduction, polynomial reduction and
coefficient reduction. The polynomial reduction uses the
Montgomery reduction, while the coefficient reduction
uses division. We call this algorithm Hybrid Modular
Multiplication (HMM).
For Algorithm 3 to be useful in practice, we need to
show that given a bounded input, the algorithm returns
an output which is similarly bounded. To this end we
require two short lemmata. The first lemma analyzes
the coefficient reduction in Phase IV, while the second
lemma gives bounds on the input, such that the output
satisfies the same bounds.
Lemma 2: Let c(z) =
∑m
i=0 ciz
i ∈ Z[z] be a polynomial
of degree m. Assume that |ci| < B for i = 0, . . . ,m − 1
and |cm| < D for bounds B and D, then the coefficient
reduction
for i = 0 to m do
ci+1 ← ci+1 + (ci div z¯), ci ← ci mod z¯
end for
results in 0 ≤ ci < z¯ for i = 0, . . . ,m and
|cm+1| < D
z¯
+
B
z¯(z¯ − 1) .
TRANSACTION ON COMPUTERS 4
Algorithm 3 Parallel hybrid modular multiplication al-
gorithm.
Input: positive integers a =
∑n−1
i=0 aiz¯
i, b =
∑n−1
i=0 biz¯
i,
modulus p = f(z¯) = fn−1z¯n−1 + .. + f1z¯ + f0 with
f0 = ±1 and polynomial inverse g(z) ≡ f(z)−1 mod zn.
Output: polynomial representation of r(z¯) ≡
a(z¯)b(z¯)z¯−n mod p
1: Phase I: Polynomial Multiplication
2: c(z) =
∑2n−2
i=0 ciz
i ← a(z)b(z).
3: Phase II: Coefficient Reduction
4: for i = 0 to n− 1 do
5: ci+1 ← ci+1 + (ci div z¯), ci ← ci mod z¯.
6: end for
7: Phase III: Polynomial Reduction
8: q(z)← (c(z) mod zn)g(z) mod zn.
9: c(z)← (c(z)− q(z)f(z))/zn.
10: Phase IV: Coefficient Reduction
11: for i = 0 to n− 2 do
12: ci+1 ← ci+1 + (ci div z¯), ci ← ci mod z¯.
13: end for
Return r(z)← c(z).
PROOF: Using induction it is easy to see that in step i,
the size of ci div z¯ is bounded by
|ci div z¯| <
i+1∑
k=1
|ci+1−k|/z¯k <
i+1∑
k=1
B/z¯k <
B
z¯ − 1 .
The size of cm in step m− 1 therefore becomes
|cm| < D + B
z¯ − 1 ,
which shows that in step m, we obtain the bound
|cm+1| < D
z¯
+
B
z¯(z¯ − 1) .

Using Lemma 2 we are now ready to show that given
a bounded input, the result of Algorithm 3 is again
bounded. Recall that for a polynomial h =
∑
i hiz
i ∈
Z[z], the infinity norm is defined as ||h||∞ = maxi |hi|.
Lemma 3: Let τ = blog2(z¯ − 1)c, φ = dlog2 ||f ||∞e and
ζ = dlog2 ||g||∞e, then if the input a(z), b(z) satisfies
|ai|, |bi| ≤ 2τ+κ for i = 0, . . . , n− 2 ,
|an−1|, |bn−1| ≤ 2τ/2+κ ,
for some κ ≥ 1, then r(z) computed by Algorithm 3
satisfies
|ri| ≤ 2τ+1 for i = 0, . . . , n− 2
|rn−1| ≤ 4n2(2ζ+φ + 22κ) .
This shows that if τ ≥ 2(log2(4n2(2ζ+φ + 22κ))−κ), then
r(z) satisfies the same bounds as the input.
PROOF: After step 2, the coefficients of ci are clearly
bounded by n22τ+2κ for i = 0, . . . , 2n − 3 and |c2n−2| ≤
2τ+2κ. Lemma 2 shows that after the first coefficient
reduction we have |ci| ≤ z¯ for i = 0, . . . , n − 1 and
|cn| ≤ n22τ+2κ+1.
The coefficients of q(z) are easily seen to be bounded
by |qi| ≤ n2τ+ζ+1 so after step 9, we obtain
|ci| ≤ n22τ+2κ+1 + n22τ+ζ+φ+1 =: B
|cn−2| ≤ 2τ+2κ + n22τ+ζ+φ+1 =: D .
Applying Lemma 2 again with the above B and D,
shows that
|cn−1| ≤ Dz¯ + Bz¯(z¯−1)
≤ n22ζ+φ+2 + n22(κ+1)
≤ 4n2(2ζ+φ + 22κ) .

Given a polynomial f(z) it is easy to compute the
inverse g(z) ≡ f(z)−1 mod zn and thus to determine the
exact values for φ and ζ in Lemma 3. For each polyno-
mial, we therefore obtain a very explicit lower bound on
z¯ such that Algorithm 3 can be applied repeatedly.
3.3 Digit-serial version
Algorithm 4 presents the digit-serial version of Algo-
rithm 3 by interleaving polynomial multiplication and
reduction. While the parallel version uses the complete
polynomial g(z) ≡ f−1(z) mod zn, the digit-serial reduc-
tion involves only g0 = ±1. On the other hand, the paral-
lel version could possibly use Karatsuba’s algorithm [25]
in steps 2, 8 and 9 to reduce the number of sub-word
multiplications. The coefficient reduction phase is the
same as in Algorithm 3.
Algorithm 4 Digit-serial hybrid modular multiplication
algorithm
Input: a(z¯) =
∑n−1
i=0 aiz¯
i, b(z¯) =
∑n−1
i=0 biz¯
i, and modulus
p = f(z¯) =
∑n−1
i=0 fiz¯
i with f0 = ±1.
Output: polynomial representation of r(z¯) ≡
a(z¯)b(z¯)z¯−n mod f(z¯).
1: c(z) =
∑n−1
i=0 ciz
i ← 0 .
2: for i = 0 to n− 1 do
3: c(z)← c(z) + a(z)bi .
4: µ← c0 div z¯, γ ← c0 mod z¯.
5: h(z)← (fn−1zn−1 + · · ·+ f1z + f0)(−f0γ).
6: c(z)← (c(z) + h(z))/z + µ.
7: end for
8: for i = 0 to n− 2 do
9: ci+1 ← ci+1 + (ci div z¯), ci ← ci mod z¯.
10: end for
Return r(z)← c(z).
The following lemma is the analogue of Lemma 3, but
for the digit-serial version.
Lemma 4: Let τ = blog2(z¯ − 1)c and φ = dlog2 ||f ||∞e,
then if n2 < z¯ and the input a(z¯), b(z¯) satisfies
|ai|, |bi| ≤ 2τ+κ for i = 0, . . . , n− 2 ,
|an−1|, |bn−1| ≤ 2τ/2+κ .
TRANSACTION ON COMPUTERS 5
for some κ ≥ 1, then r(z) computed by Algorithm 4
satisfies
|ri| ≤ 2τ+1 for i = 0, . . . , n− 2
|rn−1| ≤ 2(n+ 2)(22κ + 2φ) .
This shows that if τ ≥ 2(log2(2(n + 2)(22κ + 2φ)) − κ),
then r(z) satisfies the same bounds as the input.
PROOF: The idea of the proof is similar to the proof of
Lemma 3, i.e. before the final coefficient reduction we
want to show that the coefficient cn−2 is small enough
to ensure that after reduction, cn−1 is small. Note that at
the beginning of each iteration in step 2, the degree of
c(z) is at most n − 2, which shows that after step 7 we
have cn−2 = an−1bn−1−γn−1fn−1f0. Here γi denotes the
γ computed in step 4 in the i-th iteration. Since |γi| ≤ |z¯|,
this shows that |cn−2| ≤ 2τ+2κ + 2τ+φ+1.
Assume that the coefficients ci for i = 0, . . . , n − 3
are bounded by B (to be determined below) and let
D = 2τ+2κ + 2τ+φ+1, then again we can apply Lemma 2
resulting in
|cn−1| ≤ D
z¯
+
B
z¯(z¯ − 1) ≤ 2
2κ + 2φ+1 +
B
z¯(z¯ − 1) .
Obtaining the bound B is easy too: denote with Bi the
bound on ||c(z)||∞ at the start of the i-th iteration, then
we have B0 = 0 and for i ≥ 1:
Bi ≤ Bi−1(1 + 1
z¯
) + 22τ+2κ+1 + 2τ+φ+1 .
Let α = (1 + 1/z¯) and β = 22τ+2κ+1 + 2τ+φ+1, then we
need to solve the recursion Bi ≤ αBi−1 +β with B0 = 0.
This results in the bound Bn ≤ β(αn − 1)/(α − 1) and
since α = 1 + 1/z¯ this becomes Bn ≤ βz¯(αn− 1). Writing
out αn − 1 explicitly gives
αn − 1 =
n∑
k=1
(
n
k
)
1
z¯k
<
n
z¯
+
∞∑
j=2
nj
j!z¯j
.
Assuming that n2 ≤ z¯, we can easily bound the sum
above by (e − 1)/z¯, so αn − 1 ≤ (n + e − 1)/z¯. So we
finally obtain
B = Bn ≤ (n+ e− 1)(22τ+2κ+1 + 2τ+φ+1) .
As a result, the bound on cn−1 becomes
|cn−1| ≤ 22κ + 2φ+1 + (n+ e− 1)(22κ+1 + 2φ−τ+1)
≤ (n+ 2)(22κ+1 + 2φ+1) .

It is important to point out that both Algorithm 3
and Algorithm 4 may involve negative numbers as
intermediate or final results. As a result, a direct use
of these algorithms requires the handling of negative
numbers. In hardware implementations, one can avoid
signed multiplication by adding a multiple of f(z¯) to the
r(z¯). We describe this method in an example in Section 5.
3.4 Faster coefficient reduction
Algorithms 3 and 4 require division by z¯. As in Chung-
Hasan’s algorithm, division can be performed with Bar-
rett’s reduction algorithm [11]. However, the complexity
of the division step can be reduced if z¯ is a pseudo-
Mersenne number. Algorithm 5 transfers division by z¯
to multiplication by s for z¯ = 2τ + s where s is small.
Algorithm 5 Division by z¯ = 2τ + s
Input: a, z¯ = 2τ + s with 0 < s < 2bτ/2c.
Output: (µ, ) with a = µz¯ + , || < z¯.
1: µ← 0, ← a.
2: while || ≥ z¯ do
3: ρ←  div 2τ , ←  mod 2τ .
4: µ← µ+ ρ, ← − sρ.
5: end while
Return µ, .
The following lemma gives the maximum number of
iterations for Algorithm 5 to finish.
Lemma 5: Define pi such that |a| ≤ 2pi and ν such
that 0 < s ≤ 2ν−1, then Algorithm 5 requires at most
dpi−τ−1τ−ν e+ 1 iterations.
PROOF: After one iteration we have
|| = |− sρ| ≤ ||+ s|ρ| ≤ (2τ − 1) + 2pi+ν−1−τ .
To bound the right hand side, we need to consider the
two cases:
• (2τ − ν + 1) ≤ pi, then || ≤ 2pi−τ+ν , so in each
iteration the bit length is decreased by τ − ν.
• (2τ−ν+1) > pi, then || < 2τ+1, so either  is reduced
or one last step is needed
|| = |− sρ| ≤ (2τ − 1) + s < z¯ .
In conclusion: dpi−(2τ−ν+1)τ−ν e iterations are needed to
reduce to the second case and then a maximum of two
iterations is needed to finish reduction in the second
case. In total, the algorithm thus requires dpi−τ−1τ−ν e + 1
iterations. 
When using Algorithm 5 to perform step 4 and 9 in
Algorithm 4, at most three iterations are required. This
is guaranteed if |ci| ≤ 23τ−2ν+1 for 0 ≤ i ≤ n− 1. Recall
that ci satisfies the following bounds:
|ci| ≤ B = (n+ e− 1)(22τ+2κ+1 + 2τ+φ+1),
for 0 ≤ i < n− 2, and
|cn−2| ≤ D = 2τ+2κ + 2τ+φ+1.
Since B > D, it is sufficient to ensure B ≤ 23τ−2ν+1.
Since we typically have φ ≤ τ ,
log2B ≤ log2(n+ e− 1) + 2τ + 2κ+ 2 .
By choosing τ ≥ 2(κ+ ν) + log2(n+ e− 1) + 1, we ensure
log2B ≤ 3τ − 2ν + 1 .
TRANSACTION ON COMPUTERS 6
3.5 Complexity
We compare the complexity of the proposed algorithm
with the algorithms by Barrett and Montgomery. Here τ ,
φ are defined as in Lemma 3, namely, τ = blog2(z¯ − 1)c,
φ = dlog2 ||f ||∞e, z¯ = 2τ + s where 0 < s ≤ 2ν−1. The
complexity of each step in Algorithm 4 is as follows.
• Step 3: for i ≤ n− 2, computing a(z)bi involves one
(d τ2 e+ κ)× (τ + κ) and (n− 1) times (τ + κ)× (τ +
κ) multiplications. For i = n − 1, computing a(z)bi
involves one (d τ2 e+κ)× (d τ2 e+κ) and (n− 1) times
(d τ2 e+ κ)× (τ + κ) multiplications.
• Step 5: computing f(z)(−f0γ) needs (n − 1) times
τ × φ multiplications.
• Step 4 and 9: as discussed above, three iterations are
required at most. We have |ρ| ≤ log2(n+e−1) + τ +
2κ + 2 and |ρ| ≤ log2(n + e − 1) + ν + 2κ + 3 in the
first iteration and second iteration, respectively. For
the third iteration, |ρ| ≤ 1 is guaranteed. The cost of
step 4 or 9 is one ν × (τ + dlog2(n+ e− 1)e+ 2κ+ 2)
multiplication and one ν × (dlog2(n + e − 1)e + ν +
2κ+ 3) multiplication.
Table 1 compares the complexity of the different mul-
tiplication algorithms. Note that φ and ν are chosen to be
small, so multiplication with f(z) and s can be efficiently
performed.
The complexity of the HMM algorithm highly de-
pends on the chosen curve parameters since they de-
termine {τ, ν, n, φ, κ}. Each family of pairing-friendly
curves is defined by a fixed polynomial f(z), so n and
φ are constant for a given family. The size of τ is set by
the desired security level. To achieve lower complexity,
it is thus desirable to choose small ν and κ. According
to Lemma 4, κ can be as small as 1 and the size of ν is
determined by s where z¯ = 2τ + s.
Computing ABR−1 mod M using conventional Mont-
gomery multiplication requires 2w2 + w sub-word mul-
tiplications (see [10] for details), where w is the number
of digits in A and B. Note that w = n− 1 here since we
need one more digit to represent A(z) or B(z). Barrett
multiplication has approximately the same complexity
as Montgomery’s algorithm [11].
Compared to τ , the parameters φ and ν are much
smaller. This makes the polynomial reduction and co-
efficient reduction very efficient. We present some pa-
rameters for the examples in Section 2.2 and compare
the complexities of different multiplication algorithms.
4 PARAMETER SELECTION FOR PAIRING-
FRIENDLY CURVES
In this section, we present values for z¯ for each of the
three examples given in Section 2.2. Note that for the first
and third family, f(z) does not have integral coefficients,
but 3f(z) or 4f(z) does, so we simply work modulo
these polynomials.
Table 2 contains values z¯ that lead to efficient instantia-
tions of the HMM algorithm. Furthermore, the bit-length
of z¯ is chosen to reflect the ideal security level at which
the respective families of curves should be used.
Table 3 compares the complexity of HMM and Mont-
gomery’s algorithm using parameter sets of Table 2. In
hardware designs, one can customize the multipliers for
specific operand sizes. HMM takes advantage of this
freedom to reduce the complexity of the multiplier. For
example, a 6 × 63-bit multiplier is much smaller and
faster than a 64×64-bit multiplier. Compared with Mont-
gomery multiplication, the HMM has a significantly
lower complexity in the context of hardware implemen-
tation.
Note that the complexity of HMM can be reduced
further by exploring the characteristics of the coefficients
of f(z). For example, BN-curves have f(z) = 36z4 +
36z3 + 24z2 + 6z + 1, and f(z)(−f0γ) can be computed
as follows:
• Step 1: 6γ = 22γ + 2γ;
• Step 2: 24γ = 22(6γ);
• Step 3: 36γ = 24γ + 2(6γ);
Instead of performing four 63 × 6 multiplications in
each iteration, four shift and two addition operations
are performed. Example 3 has φ = 1, resulting in even
larger savings in the polynomial reduction step, since no
multiplications are required as reflected in Table 3 by the
1× 64 cell.
5 APPLICATION TO BN CURVES
In this section, we propose an architecture for HMM in
hardware. Note that Fan et al. described the first digit-
serial architecture for the HMM algorithm [17] for BN
curves, where polynomial multiplication and reduction
are interleaved. In the architecture proposed in this
paper, polynomial multiplication and reduction are sep-
arated as in Algorithm 3. We show that this architecture
is flexible and can achieve higher throughput than the
one from [17]. We choose the same parameters as [17],
namely, z¯ = 263 + s and s=29 + 28 + 26 + 24 + 23 + 1. With
these parameters, the implementation achieves 128-bit
security.
Algorithm 6 shows the HMM algorithm for BN curves.
Since f(z) = 36z4 + 36z3 + 24z2 + 6z+ 1, we have g(z) ≡
−f−1(z) ≡ 324z4−36z3−12z2 +6z−1 mod z5. Note that
both f(z) and g(z) have relatively small coefficients. As a
result, the polynomial reduction phase can be efficiently
implemented.
5.1 HMM Multiplier
Figure 1 shows the architecture of the multiplier. It
consists of a row of multipliers to carry out polynomial
multiplication, five “Mod-1” blocks to perform the first
coefficient reduction, a module to accumulate the partial
products, a module to perform polynomial reduction
and four “Mod-2” blocks to perform the second coef-
ficient reduction. Figure 1 also gives the bit-length of
input/output of each block.
TRANSACTION ON COMPUTERS 7
TABLE 1
Complexity comparison of different modular multiplication algorithms
(z¯ = 2τ + s, 0 < s ≤ 2ν−1, φ = dlog2 ||f ||∞e, δ1 = τ + dlog2(n+ e− 1)e+ 2κ+ 2, δ2 = dlog2(n+ e− 1)e+ ν + 2κ+ 3. )
Algorithm (d τ
2
e+ κ)× (d τ
2
e+ κ) (d τ
2
e+ κ)× (τ + κ) (τ + κ)× (τ + κ) φ× τ δ1 × ν δ2 × ν
Barrett 2(n− 1)2 + n− 1
Montgomery 2(n− 1)2 + n− 1
HMM (Alg. 4) 1 2(n− 1) (n− 1)2 n(n− 1) 2n− 1 2n− 1
TABLE 2
Selection of z¯ for curves (z¯ = 2τ + s, 0 < s ≤ 2ν−1, φ = dlog2 ||f ||∞e)
Family k z¯ φ ν dlog(2, p)e dlog(2, r)e
Example 1 8 239 + 1175 7 12 239 159
Example 2 12 263 + 857 6 11 258 258
Example 3 24 264 + 11757 1 15 639 513
TABLE 3
Multiplication complexity for each set of parameters
Example 1: 80-bit security (n = 7, τ = 39, ν = 12, φ = 7, κ = 1, δ1 = 47, δ2 = 21)
21 × 21 21 × 40 40×40 7× 39 47× 12 21× 12
Montgomery 78
HMM (Alg. 4) 1 12 36 42 13 13
Example 2: 128-bit security (n = 5, τ = 63, ν = 11, φ = 6, κ = 1, δ1 = 70, δ2 = 19)
33 × 33 33 × 64 64×64 6× 63 70× 11 19× 11
Montgomery 36
HMM (Alg. 4) 1 8 16 20 9 9
Example 3: 256-bit security (n = 11, τ = 64, ν = 15, φ = 1, κ = 1, δ1 = 72, δ2 = 24)
33 × 33 33 × 65 65×65 1× 64 72× 15 24× 15
Montgomery∗ 210
HMM (Alg. 4) 1 20 100 110 21 21
∗ 64 bits used for each digit (10x64=640), thus 64×64 bit multiplier is used.
Phase I. Five integer multipliers, including four 65×65
and one 65× 32 multipliers, are used to carry out poly-
nomial multiplication. Clearly, given enough area, the
polynomial multiplication stage can be fully parallelized
(using 13 multipliers [30]). The architecture used here
is a tradeoff between area and throughput. Using the
schoolbook method, it requires 5 cycles to finish Phase I.
Each 65×65 multiplier is implemented using a two-level
Karatsuba’s method, and each multiplication has a delay
of 7 cycles.
Phase II. The partial products (aibj , 0 ≤ i, j < n) are
reduced immediately after they are generated. Figure
1(d) shows the structure of the “Mod-1” block. Since
s = 25 · (24 +23)+26 +(24 +23)+1, multiplication by s is
implemented with four additions. The output of Phase II
is then accumulated and shifted in the accumulator,
shown in Figure 1(f). Note that the output buffers of the
adders, except the one on the most right side, should be
set to zero for each Fp multiplication.
The “Mod-1” block only performs one round of reduc-
tion, thus it does not give fully reduced results. It is easy
to see that the output of “Mod-1” is always less than 277.
We shall see this is good enough for this architecture.
Phase III. Once the partially reduced results (c8, .., c0)
are ready, Phase III is applied. The polynomial reduction
is performed with only addition and shift operations, e.g.
6α = 22α+ 2α, 9α = 23α+ α and 36α = 25α+ 22α.
Since the output of “Mod-1” is less than 277, we have
|ci| < (i + 1) · 277 for 0 ≤ i ≤ 4. In the computation of
q(z) in Algorithm 6, q4 = −c4 + 6(c3 − 2c2 − 6(c1 − 9c0)),
resulting in |q4| < (5 + 6 · (4 + 2 · 3 + 6 · (2 + 9)))277 =
16596 · 277. Likewise, we can compute the size of hi and
vi for 0 ≤ i ≤ 3. One can verify that v(z) has coefficients
|vi| < 292 for 0 ≤ i ≤ 3.
Phase IV. The “Mod-2” block is implemented in the
same way as “Mod-1”. However, the input of “Mod-2” is
only 93-bit. Thus, the output of the proposed multiplier
has the following bounds: |ri| < 263 + 241, 0 ≤ i ≤ 3, and
|r4| < 230.
Note that the resulting polynomial, r(z) may have
negative coefficients. For future multiplications, negative
coefficients as input are not desirable. We can ensure
positive coefficients by adding to r(z) the following
polynomial:
l(z) = (36ϑ− 2)z4 + (36ϑ+ 2z¯ − 2)z3+
(24ϑ+ 2z¯ − 2)z2 + (6ϑ+ 2z¯ − 2)z + (ϑ+ 2z¯),
where ϑ = 225. One can verify that l(z¯) = 225f(z¯). Let
r′(z) = Σ4i=0r
′
iz
i = r(z)+ l(z), then we have 0 ≤ r′4 < 232.
For 0 ≤ i ≤ 3, 2z¯ < li and |ri| < 2z¯ , thus we have r′i > 0.
On the other hand, ri+li < 2(263+s)+36ϑ−2+263+241 <
265. Thus, r′(z) has only positive coefficients and satisfies
the input bounds.
TRANSACTION ON COMPUTERS 8
Fig. 1. Fp multiplier using algorithm HMMB.
Algorithm 6 Parallel hybrid modular multiplication al-
gorithm for BN curves.
Input: positive integers a =
∑4
i=0 aiz¯
i, b =
∑4
i=0 biz¯
i,
modulus p = f(z¯) = 36z¯4 + 36z¯3 + 24z¯2 + 6z¯ + 1.
Output: polynomial representation of r(z¯) ≡
a(z¯)b(z¯)z¯−5 mod p
1: Phase I: Polynomial Multiplication
2: c(z) =
∑8
i=0 ciz
i ← a(z)b(z).
3: Phase II: Coefficient Reduction
4: for i = 0 to 4 do
5: ci+1 ← ci+1 + (ci div z¯), ci ← ci mod z¯.
6: end for
7: Phase III: Polynomial Reduction
8: q(z) =
∑4
i=1 qit
i ← (−c4 +6(c3−2c2−6(c1−9c0)))z4
+(−c3 + 6(c2 − 2c1 − 6c0))z3
+(−c2 + 6(c1 − 2c0))z2
+(−c1 + 6c0)z.
9: h(z) =
∑3
i=0 git
i ← (36q4)z3
+36(q4 + q3)z
2
+12(2q4 + 3(q3 + q2))z
+6(q4 + 4q3 + 6(q2 + q1)).
10: v(z)← c(z)/z5 + h(z);
11: Phase IV: Coefficient Reduction
12: for i = 0 to 3 do
13: vi+1 ← vi+1 + (vi div z¯), vi ← vi mod z¯.
14: end for
Return r(z)← v(z).
5.2 Implementation Results for Pairing Co-
processor
We used Xilinx FPGA as the design platform. In order
to achieve a high clock frequency, a 16-stage pipeline is
used in the HMM multiplier. Note that the polynomial
multiplication takes five iterations. One multiplication
on the proposed multiplier has a delay of 20 cycles. On
the other hand, the multiplier finishes one multiplication
every 5 cycles, thus has a throughput of 1/5.
Using the multiplier described above, we built a
pairing processor. It consists of a multiplier, an adder,
a 74Kbit data memory and an instruction ROM. The
data memory and instruction ROM are essentially im-
plemented with Block RAMs on the FPGA. The data
memory has one read port and one write port.
In order to keep the HMM multiplier busy, we explore
the parallelism within the pairing computation. Consider
Fp2 = Fp[u]/(u2 + 2) and two elements in Fp2 : (a + bu)
and (c + du), a, b, c, d ∈ F, then (a + bu)(c + du) can be
computed as follows:
(a+ bu)(c+ du) = (ac− 2bd) + ((a+ b)(c+ d)− ac− bd)u.
The Fp multiplications can be performed one after an-
other. We use a C++ program to schedule each high level
function, e.g. sub-routines of the Miller loop and the final
exponentiation. The scheduling is then transfered into
micro-instructions stored in the instruction ROM. Table 4
gives the cycle counts of each function.
On a Xilinx Virtex-6 FPGA (XC6VLX240), the de-
sign uses 4,014 Slices, 42 DSP48E1s and 5 Block RAMs
(RAMB36E1s). The design achieves a maximum fre-
quency of 210 MHz. Table 5 compares the result with
the state-of-the-art implementations.
Kammler et al. [24] reported a hardware implemen-
tation of cryptographic pairings using an application-
specific instruction-set processor (ASIP). They chose
z¯=0x6000000000001F2D to generate a 256-bit BN curve.
TRANSACTION ON COMPUTERS 9
TABLE 4
Number of clock cycles required by different subroutines
2T T+Q lT,T (P ) lT,Q(P ) f2 f · l f (pk−1)/r ate optimal ate
#Cycles 220 342 196 120 540 432 138,302 336,366 245,430
TABLE 5
Performance comparison of software and hardware implementations of pairings
Design Pairing Security Platform Algorithm Area Freq. Cycle Delay
[bit] [MHz] [ms]
This ate 128 Xilinx FPGA HMM 4,014 Slices 210 336,366 1.60design optimal ate (Virtex-6) (Parallel) 42 DSP48E1s 245,430 1.17
[17] ate 128 ASIC HMM 183 kGates 204 861,724 4.22optimal ate (130 nm) (Digit-serial) 592,976 2.91
Tate 1,730,000 34.6
[19] ate 128 Xilinx FPGA Blakley [8] 52k Slices 50 1,207,000 24.2
optimal ate Virtex-4 821,000 16.4
Tate 11,627,200∗ 34.4
[24] ate 128 ASIC Montgomery 97 kGates 338 7,706,400∗ 22.8
optimal ate (130 nm) 5,340,400∗ 15.8
[16] Tate over 128 Xilinx FPGA 4755 Slices 192 428,853 2.23F35·97 (Virtex-4) - 7 BRAMs
[21] ate 128 64-bit Core2 Montgomery - 2400 15,000,000 6.25optimal ate 10,000,000 4.17
[20] ate 128 64-bit Core2 Montgomery 2400 14,429,439 6.01
[31] optimal ate 128 Core2 Quad Hybrid Mult. - 2394 4,470,408 1.86
[6] optimal ate 126 Core i7 Montgomery - 2800 2,330,000 0.83
[1] optimal ate 127 Phenom II Montgomery - 3000 † 1,562,000 0.52
[2] ηT over F21223 128 Xeon (8 cores) - - 2000 3,020,000 1.51
[7] ηT over F3509 128 Core i7 (8 cores) - - 2900 5,423,000 1.87
∗ Estimated by the authors.
† Processor frequency is not mentioned in the original paper. We take 3.0 GHz (typical frequency) for delay estimation.
Montgomery’s algorithm is used for Fp multiplication.
Fan et al. [17] reported an ASIC implementation using
HMM algorithm. They choose z¯=263 + 857. It achieves
a factor 5.4 speed-up compared with the one of [24].
On the other hand, the ASIP design [24] offers higher
flexibility than the ASIC implementation [17] that is opti-
mized for a dedicated parameter set. Ghosh et al. [19] re-
ported the first FPGA implementation of pairings based
on BN curves. They also chose z¯=0x6000000000001F2D
to achieve 128-bit security. The Fp multiplier used in [19]
is based on Blakley’s algorithm [8], and it does not make
use of the DSP slices on FPGA.
The design reported in this paper achieves a factor
2.5 speed-up compared with the one in [17]. The speed-
up comes mainly from the high throughput multiplier.
The multiplier used in [17] requires 23 cycles for each
multiplication, while the multiplier described in this
paper finishes one multiplication every 5 cycles. The
Montgomery multiplier used in [24] requires 68 cycles
for one multiplication. On the other hand, the speed-up
for pairing computations is less than the speed-up of
the multiplier. This is mainly due to the read-after-write
(RAW) dependency that introduces pipeline bubbles.
Table 5 also includes the state-of-the-art software im-
plementations [6], [20], [21], [31]. Software implemen-
tations try to make use of available features (fast mul-
tipliers, vector registers and so on) on the target pro-
cessor. The speed records have been updated shortly
after they are set, mainly due to new implementation
techniques and the evolution of processors. The current
speed record [1] of optimal ate pairing using BN curves
is achieved by Aranha et al. on an AMD Phenom II
processor, who use z¯ = −(262 + 255 + 1). This choice not
only reduces the complexity of both the Miller loop and
the final exponentiation, but more importantly admits
for extensive use of lazy reduction techniques. This
implementation achieves so far the best performance.
An interesting attempt to use hybrid multiplication
algorithm in software has been made [31]. Naehrig
et al. adapted the hybrid multiplication algorithm and
proposed a software-oriented algorithm for fast modular
multiplication. An element in Fp is represented as a
polynomial of degree 11. As such, the coefficients are
short enough to fit in the vector registers and overflows
in multiplication are avoided. Polynomial reduction can
be efficiently performed since p(x) is made monic. This
implementation shows a significant speedup compared
to previous software implementations [20], [21]. On the
other hand, on CPUs where 64-bit multiplication is not
much slower than addition, the traditional Montgomery
algorithm seems to achieve a higher performance [1], [6].
Table 5 also includes the state-of-the-art implementa-
tions of pairings over binary or ternary fields achieving
128-bit security. The results of these implementations are
comparable with pairings using BN curves in software.
To the best of our knowledge, Estibals reported so far the
only hardware implementation of 128-bit Tate pairing
over super-singular curves in small characteristic. On a
TRANSACTION ON COMPUTERS 10
Virtex-4 FPGA, it requires only 2.2 ms for the pairing
computation. It is however difficult to give a fair com-
parison due to the use of different platforms.
6 CONCLUSIONS
In this paper we introduced a new modular multi-
plication algorithm for polynomial form moduli called
Hybrid Modular Multiplication. The algorithm combines
Montgomery’s algorithm in the polynomial representa-
tion with a fast coefficient reduction based on pseudo-
Mersenne numbers.
The algorithm results in a substantial speed-up for
hardware implementation of finite field arithmetic ap-
pearing in pairing based cryptography. To illustrate this
efficiency, we implemented pairings on Barreto-Naehrig
curves at the 128-bit security level and obtained a factor
2.5 speed-up compared with the state-of-the-art hard-
ware implementations.
ACKNOWLEDGMENTS
This work was supported in part by K.U. Leuven-BOF
(OT/06/40), by the IAP Programme P6/26 BCRYPT
of the Belgian State (Belgian Science Policy), by the
Research Council K.U.Leuven: GOA TENSE, by FWO
project G.0300.07 and by IBBT.
REFERENCES
[1] D.F. Aranha, K. Karabina, P. Longa, C.H. Gebotys, and J. Lo´pez.
Faster Explicit Formulas for Computing Pairings over Ordinary
Curves. In Eurocrypt 2011. To appear.
[2] D.F. Aranha, J. Lo´pez, and D. Hankerson. High-Speed Parallel
Software Implementation of the ηT Pairing. In CT-RSA 2010,
volume 5985 of Lecture Notes in Computer Science, pages 89–105.
Springer, 2010.
[3] R.M. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen,
and F. Vercauteren. Handbook of Elliptic and Hyperelliptic Curve
Cryptography. CRC Press, 2005.
[4] P.S.L.M. Barreto, H.Y. Kim, B. Lynn, and M. Scott. Efficient
Algorithms for Pairing-Based Cryptosystems. In CRYPTO 2002,
volume 2442 of Lecture Notes in Computer Science, pages 354–368.
Springer, 2002.
[5] P. Barrett. Implementing the Rivest Shamir and Adleman Public
Key Encryption Algorithm on a Standard Digital Signal Processor.
In CRYPTO 1986, volume 263 of Lecture Notes in Computer Science,
pages 311–323. Springer, 1986.
[6] J.-L. Beuchat, J.E. Gonza´lez-Dı´az, S. Mitsunari, E. Okamoto,
F. Rodrı´guez-Henrı´quez, and T. Teruya. High-Speed Software
Implementation of the Optimal Ate Pairing over Barreto-Naehrig
Curves. In Pairing 2010, volume 6487 of Lecture Notes in Computer
Science, pages 21–39. Springer, 2010.
[7] J.-L. Beuchat, E. Lo´pez-Trejo, L. Martı´nez-Ramos, S. Mitsunari, and
F. Rodrı´guez-Henrı´quez. Multi-core Implementation of the Tate
Pairing over Supersingular Elliptic Curves. In CANS 2009, volume
5888 of Lecture Notes in Computer Science, pages 413–432. Springer,
2009.
[8] G.R. Blakley. A Computer Algorithm for Calculating the Product
AB Modulo M. IEEE Trans. Comput., 32(5):497–500, 1983.
[9] F. Brezing and A. Weng. Elliptic Curves Suitable for Pairing Based
Cryptography. Designs, Codes and Cryptography, 37:133–141, 2003.
[10] C¸. K. Koc¸, T. Acar, and B. S. Kaliski. Analyzing and Comparing
Montgomery Multiplication Algorithms. IEEE Micro, 16:26–33,
1996.
[11] J. Chung and M.A. Hasan. Low-Weight Polynomial Form Integers
for Efficient Modular Multiplication. IEEE Trans. Comput., 56(1):44–
57, 2007.
[12] J. Chung and M.A. Hasan. Montgomery Reduction Algorithm
for Modular Multiplication Using Low-Weight Polynomial Form
Integers. In ARITH ’07: Proceedings of the 18th IEEE Symposium on
Computer Arithmetic, pages 230–239, Washington, DC, USA, 2007.
IEEE Computer Society.
[13] A. Devegili, C. O´’ hE´igeartaigh, M. Scott, and R. Dahab. Multipli-
cation and Squaring on Pairing-Friendly Fields. Cryptology ePrint
Archive, Report 2006/471. Available from http://eprint.iacr.org.
[14] A. Devegili, M. Scott, and R. Dahab. Implementing Cryptographic
Pairings over Barreto-Naehrig Curves. In Pairing 2007, volume
4575 of Lecture Notes in Computer Science, pages 197–207. Springer,
2007.
[15] J.-F. Dhem. Design of an efficient public-key cryptographic library
for RISC-based smart cards. PhD thesis, Universite´ Catholique de
Louvain, Louvain-la-Neuve, Belgium, 1998.
[16] N. Estibals. Compact Hardware for Computing the Tate Pair-
ing over 128-Bit-Security Supersingular Curves. In Pairing 2010,
volume 6487 of Lecture Notes in Computer Science, pages 397–416.
Springer, 2010.
[17] J. Fan, F. Vercauteren, and I. Verbauwhede. Faster Fp-Arithmetic
for Cryptographic Pairings on Barreto-Naehrig Curves. In CHES
2009, volume 5747 of Lecture Notes in Computer Science, pages 240–
253. Springer, 2009.
[18] D. Freeman, M. Scott, and E. Teske. A Taxonomy of Pairing-
Friendly Elliptic Curves. Journal of Cryptology, 23(2):224–280, 2010.
[19] S. Ghosh, D. Mukhopadhyay, and D.R. Chowdhury. High Speed
Flexible Pairing Cryptoprocessor on FPGA Platform. In Pairing
2010, volume 6487 of Lecture Notes in Computer Science, pages 450–
466, 2010.
[20] P. Grabher, J. Großscha¨dl, and D. Page. On Software Parallel
Implementation of Cryptographic Pairings. In SAC 2008, volume
5381 of Lecture Notes in Computer Science, pages 34–49. Springer,
2008.
[21] D. Hankerson, A. Menezes, and M. Scott. Software Implementa-
tion of Pairings. In M. Joye and G. Neven, editors, Identity-Based
Cryptography, 2008.
[22] F. Hess. Pairing Lattices. In Pairing 2008, volume 5209 of Lecture
Notes in Computer Science, pages 18–38. Springer, 2008.
[23] F. Hess, N.P. Smart, and F. Vercauteren. The Eta Pairing Revisited.
Information Theory, IEEE Transactions on, 52(10):4595–4602, Oct.
2006.
[24] D. Kammler, D. Zhang, P. Schwabe, H. Scharwaechter, M. Lan-
genberg, D. Auras, G. Ascheid, R. Leupers, R. Mathar, and
H. Meyr. Designing an ASIP for Cryptographic Pairings over
Barreto-Naehrig Curves. In CHES 2009, volume 5747 of Lecture
Notes in Computer Science, pages 254–271. Springer, 2009.
[25] A. Karatsuba and Y. Ofman. Multiplication of Multidigit Numbers
on Automata. Doklady Akademii Nauk SSSR, 145(2):293–294, 1962.
[26] E. Lee, H.-S. Lee, and C.-M. Park. Efficient and Generalized
Pairing Computation on Abelian Varieties. Cryptology ePrint
Archive, Report 2009/040. Available from http://eprint.iacr.org/.
[27] V.S. Miller. Short Programs for Functions on Curves, 1986.
Unpublished manuscript, Available at http://crypto.stanford.edu/
miller/miller.pdf.
[28] V.S. Miller. The Weil Pairing, and Its Efficient Calculation. Journal
of Cryptology, 17(4):235–261, 2004.
[29] P.L. Montgomery. Modular Multiplication without Trial Division.
Mathematics of Computation, 44(170):519–521, 1985.
[30] P.L. Montgomery. Five, Six, and Seven-Term Karatsuba-Like
Formulae. IEEE Trans. Comput., 54(3):362–369, 2005.
[31] M. Naehrig, R. Niederhagen, and P. Schwabe. New Software
Speed Records for Cryptographic Pairings. In LATINCRYPT 2010,
volume 6212 of Lecture Notes in Computer Science, pages 109–123.
Springer, 2010.
[32] P.S.L.M. Barreto and M. Naehrig. Pairing-friendly elliptic curves
of prime order. In SAC 2005, volume 3897 of Lecture Notes in
Computer Science, pages 319–331. Springer, 2006.
[33] F. Vercauteren. Optimal Pairings. IEEE Transactions on Information
Theory, 56(1):455–461, 2010.
TRANSACTION ON COMPUTERS 11
Junfeng Fan received the BS and MS degrees
in electrical engineering from Zhejiang Univer-
sity, China, in 2003 and 2006, respectively. Since
2006, he has been a PhD student in the Electri-
cal Engineering Department (ESAT), Katholieke
Universiteit Leuven (KU Leuven), Belgium. His
research interests include computer arithmetics,
with speical focus on efficient implementations
for Public Key Cryptography (PKC). He is also
interested in physical security of embedded sys-
tems and secure design methodologies.
Frederik Vercauteren received the M.Sc. de-
gree in computer science, the M.Sc. degree
in pure mathematics, and the Ph.D. degree in
electrical engineering from the Katholieke Uni-
versiteit Leuven, Belgium. He is currently a Post-
doctoral Fellow of the Research Foundation-
Flanders (FWO Vlaanderen) at the Department
of Electrical Engineering, Katholieke Universiteit
Leuven, Belgium. Previously, he held a lecturer
position at the Department of Computer Sci-
ence, University of Bristol, U.K. His current re-
search interests include applications of computational number theory
and arithmetic geometry in cryptography.
Ingrid Verbauwhede received the electrical en-
gineering degree and PhD degree from the
Katholieke Universiteit Leuven (KU Leuven),
Belgium, in 1991. From 1992 to 1994, she
was a postdoctoral researcher and visiting lec-
turer at the Electrical Engineering and Computer
Sciences Department, University of California,
Berkeley. From 1994 to 1998, she worked for
TCSI and ATMEL in Berkeley, California. In
1998, she joined the faculty of University of
California, Los Angeles (UCLA). She is currently
a professor at the KU Leuven and an adjunct professor at UCLA. At
KU Leuven, she is a codirector of the Computer Security and Industrial
Cryptography (COSIC) Laboratory. Her research interests include cir-
cuits, processor architectures and design methodologies for real-time
embedded systems for security, cryptography, digital signal process-
ing, and wireless communications. This includes the influence of new
technologies and new circuit solutions on the design of next-generation
systems on chip. She was the program chair of the Ninth Interna-
tional Workshop on Cryptographic Hardware and Embedded Systems
(CHES 07), the 19th IEEE International Conference on Application-
specific Systems, Architectures and Processors (ASAP 08), and the
ACM/IEEE International Symposium on Low Power Electronics and
Design (ISLPED 02). She was also the general chair of ISLPED 2003.
She was a member of the executive committee of the 42nd and 43rd
Design Automation Conference (DAC) as the design community chair.
She is a senior member of the IEEE.
