An Algorithm for the nt Pairing Calculation in Characteristic Three and its Hardware Implementation by Beuchat, Jean-Luc et al.
An Algorithm for the ηT Pairing Calculation in Characteristic Three and its
Hardware Implementation
Jean-Luc Beuchat1, Masaaki Shirase2, Tsuyoshi Takagi2, and
Eiji Okamoto1
1University of Tsukuba, Japan
2Future University-Hakodate, Japan
Abstract
In this paper, we propose a modified ηT pairing algo-
rithm in characteristic three which does not need any cube
root extraction. We also discuss its implementation on a
low cost platform which hosts an Altera Cyclone II FPGA
device. Our pairing accelerator is ten times faster than pre-
vious known FPGA implementations in characteristic three.
Keywords: Tate pairing, ηT pairing, characteristic three,
elliptic curve, hardware accelerator, FPGA.
1. Introduction
Since the introduction of pairings over (hyper)elliptic
curves in constructive cryptographic applications, an ever
increasing number of protocols based on Weil or Tate pair-
ings have appeared in the literature: identity-based encryp-
tion [8], short signature [10], and efficient broadcast en-
cryption [9] to mention but a few. Nowadays pairing-based
cryptosystems have become a central research topic in cryp-
tography.
Miller’s algorithm [19] was the only way to compute
the Tate pairing until 2002, where significant improvements
were independently proposed by Barreto et al. [5] and Gal-
braith et al. [13]. One year later, Duursma and Lee gave a
closed formula in the case of characteristic three [11]. They
described an iterative scheme involving additions, multipli-
cations, cubing operations, and cube root extractions over
F3m . This work was then extended by Kwon, who pro-
posed a closed formula for the Tate pairing computation for
supersingular elliptic curves over F2m with odd dimension
m [18]. Furthermore, he proved that both his algorithm and
Duursma-Lee algorithm can be modified so that no inverse
Frobenius map (i.e. square root in characteristic two or cube
root in characteristic three) is required.
Fong et al. showed that extracting a square root in
F2m requires approximately the time of a field multiplica-
tion and proposed an improved scheme for trinomials [12].
Barreto extended this approach to cube root in charac-
teristic three [3]: if F3m admits an irreducible trinomial
xm + axk + b (a, b ∈ {−1, 1}) with the property k ≡ m
(mod 3), then five shifts and five additions allow to imple-
ment this operation. However, these algorithms restrict the
choice of curves and it seems interesting to design pairing
algorithms without inverse Frobenius maps. Hardware im-
plementations also benefit from such pairing algorithms: re-
moving the inverse Frobenius maps allows to design simpler
arithmetic and logic units.
By introducing the ηT pairing, Barreto et al. reduced the
number of iterations of Duursma-Lee algorithm by half [4].
However, this algorithm reintroduces inverse Frobenius
maps. Recently, Shu et al. described how to get rid of
square roots in characteristic two [22]. In this paper, we
introduce a modified ηT pairing algorithm in characteris-
tic three which does not require any cube root (Section 2).
Then, we discuss its hardware implementation on a low cost
Field Programmable Gate Array (FPGA) board hosting Al-
tera Cyclone II technology (Section 3) and we compare this
pairing accelerator against several software and hardware
architectures reported in the literature (Section 4).
2. An Algorithm for the ηT Pairing Calculation
Let E be an elliptic curve over Fq , where q is a power of
a prime number. A formal symbol (P ) is defined for each
point P of the curve. A divisor D on E is then a finite lin-
ear combination of such symbols with integer coefficients:
D =
∑
j aj(Pj), aj ∈ Z. The degree of a divisor is defined
by deg(
∑
j aj(Pj)) =
∑
j aj ∈ Z. For an introduction to
divisors, we refer the reader to [23]. Let l > 0 be an integer
relatively prime to q. The least positive integer k satisfy-
ing qk ≡ 1 (mod l) is called embedding degree or security
multiplier. Let E(Fq)[l] be the set of points P ∈ E(Fq)
such that lP = O, whereO is the point at infinity. Consider
P ∈ E(Fq)[l] and Q ∈ E(Fqk)[l]. The reduced Tate pairing
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00  © 2007
is the map el : E(Fq)[l]× E(Fqk)[l]→ F∗qk , given by
el(P,Q) = fl,P (DQ)
qk−1
l , (1)
where fl,P is a rational function on E whose divisor is
equivalent to l(P ) − l(O), and DQ is a divisor of degree
0 equivalent to (Q) − (O). fl,P and DQ have disjoint sup-
ports. The computation of the (qk − 1)/l-th power is re-
ferred to as final exponentiation. The reduced Tate pairing
satisfies the following properties:
• Bilinearity: let a be an integer; then el(aP,Q) =
el(P, aQ) = el(P,Q)a, for all P ∈ E(Fq)[l] and
Q ∈ E(Fqk)[l].
• Non-degeneracy. If el(P,Q) = 1 for all Q ∈
E(Fqk)[l], then P = O.
Equation (1) was initially computed according to an al-
gorithm introduced by Miller in the context of Weil pair-
ing [19]. Several improvements have been proposed
since 2002 (see for example [5, 13, 11, 18]). Barreto et
al. [5] proved that the reduced pairing can be computed as
el(P,Q) = fl,P (Q)
qk−1
l , where fl,P is evaluated on a point
rather than on a divisor. In the same paper, the authors ex-
ploited a distortion map to further enhance Miller’s algo-
rithm.
This work is devoted to the computation of pairing in
characteristic three (i. e. q = 3m, where m is odd). Let Eb
be a supersingular elliptic curve over F3m :
Eb : y2 = x3 − x + b, with b ∈ {−1, 1}.
The distortion map ψ : Eb(F3m) → Eb(F36m) is then de-
fined as follows: ψ(Q) = ψ(xq , yq) = (−xq + ρ, yqσ),
where σ and ρ belong to F36m and respectively satisfy
σ2 = −1 and ρ3 = ρ + b. The modified Tate pair-
ing eˆ(P,Q) is then given by: eˆ(P,Q) = el(P, ψ(Q)).
Note that {1, σ, ρ, σρ, ρ2, σρ2} is a basis of F36m over
F3m . We will therefore represent an element A ∈ F36m
as A = (a0, a1, a2, a3, a4, a5) = a0+ a1σ+ a2ρ+ a3σρ+
a4ρ
2 + a5σρ2, where the ai’s belong to F3m . This rep-
resentation is equivalent to a tower extension of F3m (see
for instance [17]): F32m = F3m [y]/(y2 + 1) and F36m =
F32m [z]/(z3 − z − b), where y2 + 1 and z3 − z − b are
respectively irreducible polynomials over F3m and F32m .
This tower field representation allows one to replace arith-
metic over F36m by arithmetic over F3m .
Barreto et al. defined the ηT pairing as ηT (P,Q) =
fT,P (ψ(Q)), for some T ∈ Z [4]. This formula does not
always give a non-degenerate, bilinear pairing. However,
Barreto et al. described some cases where ηT (P,Q)W is
a non-degenerate and bilinear map (a final exponentiation
is therefore required for pairing-based cryptosystems). In
such cases, this approach reduces the number of iterations
by half (Algorithm 1). In characteristic three, the relation-
ship between the ηT pairing and the modified Tate pairing
is given by:
(
ηT (P,Q)
W
)3T 2
= eˆ(P,Q)Z (2)
where T = −b3m+12 − 1, Z = −b3m+32 , and W = (33m −
1)(3m + 1)(3m − b3m+12 + 1). Let v = ηT (P,Q)W . The
modified Tate pairing can be computed as follows:
eˆ(P,Q) = v−2 ·
(
v3
(m+1)/2 · 3m
√
v3(m−1)/2
)−b
.
This method is more efficient than the one proposed by Bar-
reto et. al in [4]. ηT (P,Q) can be calculated according to
Algorithm 1. As mentioned in Section 1, this scheme in-
volves two cube root extractions at each iteration.
Algorithm 1 Computation of ηT pairing in characteristic
three [4].
Input: P˜ = (x˜p, y˜p) and Q˜ = (x˜q, y˜q) ∈ Eb(F3m)[l].
The algorithm requires R˜0 and R˜1 ∈ F36m , as well as
r˜0 ∈ F3m for intermediate computations.
Output: ηT (P˜ , Q˜)
1: if b = 1 then
2: y˜p ← −y˜p;
3: end if
4: r˜0 ← x˜p + x˜q + b;
5: R˜0 ← −y˜pr˜0 + y˜qσ + y˜pρ;
6: for i = 0 to (m− 1)/2 do
7: r˜0 ← x˜p + x˜q + b;
8: R˜1 ← −r˜20 + y˜py˜qσ − r˜0ρ− ρ2;
9: R˜0 ← R˜0R˜1;
10: x˜p ← x˜1/3p ; y˜p ← y˜1/3p ; x˜q ← x˜3q ; y˜q ← y˜3q ;
11: end for
12: Return R˜0;
We propose here a modified ηT pairing algorithm in
characteristic three which computes R0 = R˜3
i+1
0 at step
i (Algorithm 2). This trick allows one to get rid of cube
roots and our algorithm returns ηT (P,Q)3
(m+1)/2
. A proof
of correctness of this new scheme is provided in an extended
version of this paper [7]. Let us describe now how to imple-
ment the original ηT (P,Q) pairing with our algorithm. Re-
call that tripling a point requires only four cubing operations
in characteristic three for supersingular elliptic curves (see
for instance [15]): 3(xp, yp) = (x9p − b,−y9p). Therefore,
we suggest to compute 3m−12 P by means of 2(m− 1) cub-
ings and to take advantage of the bilinearity of ηT (P,Q)W :(
ηT
(
3
m−1
2 P,Q
)3m+12 )W
=
(
ηT (P,Q)
W
)3m
. (3)
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00  © 2007
Note that cubing over F3m is efficiently performed in hard-
ware (Section 3.2). A postprocessing step involving a
3m-th root is further required. However, this operation
is carried out by means of six additions (or subtractions)
and a negation over F3m . Assume that b = 1. Raising
ηT (P,Q)3
(m+1)/2
to the W -th power is based on the fol-
lowing observation:
W = 35m + 2 · 34m + 33m + 3m+(m+1)/2 + 3(m+1)/2
− (34m+(m+1)/2 + 33m+(m+1)/2 + 32m + 2 · 3m + 1).
This operation requires 11 multiplications and a single in-
version over F36m , as well as additions over F3m .
Algorithm 2 Proposed computation of ηT (P,Q)3
(m+1)/2
.
Input: P = (xp, yp) and Q = (xq , yq) ∈ Eb(F3m)[l].
The algorithm requires R0 and R1 ∈ F36m , as well as
r0 ∈ F3m and d ∈ F3 for intermediate computations.
Output: ηT (P,Q)3
(m+1)/2
1: if b = 1 then
2: yp ← −yp;
3: end if
4: r0 ← xp + xq + b;
5: d ← b;
6: R0 ← −ypr0 + yqσ + ypρ;
7: for i = 0 to (m− 1)/2 do
8: r0 ← xp + xq + d;
9: R1 ← −r20 + ypyqσ − r0ρ− ρ2;
10: R0 ← (R0R1)3;
11: yp ← −yp;
12: xq ← x9q ; yq ← y9q ;
13: d ← (d− b) mod 3;
14: end for
15: Return R0;
3. Hardware Implementation
This section describes the hardware implementation of
Algorithm 2 for the field F3[x]/(x97 + x12 + 2) and the
curve y2 = x3 − x + 1 (i.e. b = 1). This choice of pa-
rameters allows us to easily compare our work against the
many pairing accelerators for m = 97 described in the open
literature. A first approach consists in designing an architec-
ture able to compute both pairing and final exponentiation.
However, it does not allow to take advantage of the constant
coefficients of R1 (see Algorithms 1 and 2) to optimize the
multiplication over F36m . Therefore, we suggest to design
a pairing accelerator evaluating ηT (P,Q)3
(m+1)/2
and a co-
processor responsible for final exponentiation working in
parallel. In this paper, we will only focus on the compu-
tation of the modified ηT pairing. Algorithm 2 and final
exponentiation require respectively (m − 1)/2 + 1 = 49
and 11 multiplications over F36m . The inversion over F36m
can be replaced by a few multiplications and additions over
F3m and a single inversion over F3m [17]. Consequently,
the final exponentiation requires less operations (and thus
less hardware) than the computation of the ηT pairing.
In order to compare our architecture against software
implementations, we decided to choose a design board
whose price is comparable to that of an entry level desk-
top computer. We selected a DE2 development and edu-
cation board [2] which costs $495 and hosts an Altera Cy-
clone II EP2C35F672C6 FPGA. Note that Altera provides
free simulation and design tools for the Cyclone II family.
The smallest unit of logic in a Cyclone II is called Logic
Element (LE). Each LE includes a 4-input Look-Up Table
(LUT), carry logic, and a programmable register. A Cy-
clone II EP2C35F672C6 device contains for instance 33216
LEs. Readers who are not familiar with Cyclone II de-
vices should refer to [1] for further details. Since we leave
the study of final exponentiation for further work, our pair-
ing accelerator should not utilize all resources of our target
FPGA. Thus, we impose a size constraint: our design must
require less than 50% of the available configurable logic.
3.1. Addition and Subtraction over F3m
Since they are performed component-wise, addition and
subtraction over F3m are rather straightforward operations.
Each element ai of F3 is encoded by two bits aLi and aHi
such that [14]: aLi = ai mod 2 and aHi = ai div 2. Thus,
the addition of ai and bi on a Cyclone-II FPGA requires two
4-input LUTs. Our processor includes an operator which
adds or subtracts up to three elements of F3m and stores the
result in a register (Figure 1a).
3.2. Cubing over F3m and F36m
Cubing is also a pretty simple arithmetic operation.
Since F36m is constructed as an extension field of F3m ,
the computation of R30 involved in Algorithm 2 is replaced
by six cubing, six additions (or subtractions), and a nega-
tion over F3m . Indeed, by noting that σ3 = −σ, (ρ2)3 =
ρ2 − ρ + 1, ρ3 = ρ + 1, (σρ2)3 = −σρ2 + σρ − σ, and
(σρ)3 = −σρ−σ, we obtain: C3 = (c30+c32+c34)+(−c31−
c33 − c35)σ + (c32 − c34)ρ
+ (−c33 + c35)σρ + c34ρ2 + (−c35)σρ2, where C =
(c0, c1, c2, c3, c4, c5) belongs to F36m . Let us now consider
the computation of b(x) = a(x)3 over F3m . We have:
b(x) = a(x)3 =
(
m−1∑
i=0
aix
3i
)
mod f(x),
where f(x) is a degree m irreducible polynomial over F3.
Since we set f(x) = x97 + x12 + 2, a simple Maple or
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00  © 2007
S
pe_add
D1
Ctrl
C
D
pe_cubingCtrl
−1
x 3
Sgn
En0
En1
D
C
D0 D1 D2
S
Sgn0
En
Sgn2
Sgn1
(b)(a)
−1 −1 −1
D2D0
Figure 1. (a) Addition over F3m . (b) Cubing
over F3m .
Pari program provides us with a closed formula for cub-
ing over F3m :
b0 = a93 + a89 + a0, b2 = a33, . . .
b1 = a65 − a61, b3 = a94 + a90 + a1, b96 = a32.
The most complex operation involved in cubing is the ad-
dition of three elements of F3. Therefore, the critical path
includes only two LUTs. Our pairing accelerator embeds a
single cubing unit (Figure 1b) which computes either a(x)3
or (−a(x))3 according to a control bit. In order to guaran-
tee a short critical path, the operator includes two pipeline
stages. It is worth noticing that the only degree 97 irre-
ducible trinomial over F3 allowing a simple cube root ex-
traction [3] has a more complex closed formula for cubing.
Thus, Algorithm 2 offers additional flexibility to select pa-
rameters leading to the smallest hardware operators.
3.3. Multiplication over F3m
We designed a Most Significant Element (MSE) first
multiplier overF3m based on a paper by Song and Parhi [24]
to compute a(x)b(x) mod f(x). At step i we compute a
degree (m + D − 2) polynomial t(x) which is the sum of
D partial products: t(x) =
∑D−1
j=0 aDi+jx
jb(x). A degree
(m + D − 1) polynomial s(x), updated according to the
celebrated Horner’s rule, allows to accumulate the partial
products:
s(x)← t(x) + xD · (s(x) mod f(x)).
Thus, after m/D steps, this algorithm returns a degree
(m + D − 1) polynomial s(x), which is congruent with
a(x)b(x) modulo f(x). The circuit described by Song
and Parhi requires dedicated hardware to compute p(x) =
s(x) mod f(x) [24]. We suggest to achieve this final mod-
ulo f(x) reduction by performing an additional iteration
with a−j = 0, 1 ≤ j ≤ D. Since t(x) is now equal to zero,
we have: s(x) = xD · (a(x)b(x) mod f(x)). Therefore,
it suffices to consider the m most significant coefficients
of s(x) to get the result: p(x) = s(x)/xD . Algorithm 3
summarizes this multiplication scheme. Synthesis results
indicate that for D = 3 and D = 4, such a multiplier re-
quires respectively 1170 and 1560 LEs. According to our
size constraint, up to ten multipliers can be included in our
pairing accelerator.
Algorithm 3 MSE multiplication over F3m .
Input: A degree m monic polynomial f(x) = xm +
fm−1xm−1+ . . .+f1x+f0, two degree (m−1) poly-
nomial a(x), and b(x). We assume that a−j = 0, 1 ≤
j ≤ D. The algorithm requires a degree (m + D − 1)
polynomial s(x) as well as a degree (m+D− 2) poly-
nomial t(x) for intermediate computations.
Output: p(x) = a(x)b(x) mod f(x)
1: s(x)← 0;
2: for i in m/D − 1 downto −1 do
3: t(x)←
D−1∑
j=0
aDi+jx
jb(x);
4: s(x)← t(x) + xD · (s(x) mod f(x));
5: end for
6: p(x)← s(x)/xD;
3.4. Multiplication over F36m
The cost of Algorithm 2 is dominated by the multipli-
cation of R0 by R1 over F36m . By applying Karatsuba-
Ofman’s algorithm (see for instance [25]) and taking advan-
tage of the constant coefficients of R1, the product R0R1
could be computed in parallel by means of 13 multiplica-
tions and 50 additions (or subtractions) over F3m [6]. Two
further multiplications are needed to compute ypyq as well
as r20 (a straightforward modification of the scheduling of
Algorithm 2 allows to compute r20 , ypyq, and R0R1 in par-
allel). However, according to our size constraints, it is im-
possible to implement 15 multipliers on our target FPGA.
Furthermore, our processor embeds only three adders over
F3m and scheduling 50 additions could be a complex task.
We propose here an algorithm which offers a better trade-
off between the number of additions and multiplications.
Let A = a0 + a1σ + a2ρ + a3σρ + a4ρ2 + a5σρ2 and
C = c0+c1σ+c2ρ+c3σρ+c4ρ2+c5σρ2 be two elements of
F36m . We write each coefficient ci as a sum of two elements
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00  © 2007
c
(0)
i and c
(1)
i ∈ F3m . Thanks to this notation we define the
product C = A · (−r20 + ypyqσ − r0ρ− ρ2) as follows:
c
(0)
0 = −a4r0 − a2, c(1)0 = −a0r20 − a1ypyq,
c
(0)
1 = −a5r0 − a3, c(1)1 = a0ypyq − a1r20,
c
(0)
2 = −a0r0 − a4 + c(0)0 , c(1)2 = −a2r20 − a3ypyq,
c
(0)
3 = −a1r0 − a5 + c(0)1 , c(1)3 = a2ypqq − a3r20,
c
(0)
4 = −a2r0 − a0 − a4, c(1)4 = −a4r20 − a5ypyq,
c
(0)
5 = −a3r0 − a1 − a5, c(1)5 = a4ypyq − a5r20.
Note that computation of the c(0)i ’s, 0 ≤ i ≤ 5, requires
six multiplications over F3m and depends neither on r20 nor
on ypyq. Thus, we can perform eight multiplications over
F3m in parallel (r20 , ypyq , and air0, 0 ≤ i ≤ 5). Consider
now c
(1)
0 and c
(1)
1 and assume that (a0 + a1), as well as
(ypyq − r20), are stored in registers. Karatsuba-Ofman’s al-
gorithm allows to compute c(1)0 and c
(1)
1 by means of three
multiplications and three additions over F3m :
c
(1)
0 = −a0r20 − a1ypyq, (4)
c
(1)
1 = (a0 + a1)(ypyq − r20) + a0r20 − a1ypyq. (5)
Therefore, the computation of the c(1)i ’s involves nine mul-
tiplications over F3m , which can be carried out in parallel.
Algorithm 4 summarizes this multiplication scheme in-
volving 17 multiplications and 29 additions (or subtrac-
tions) over F3m . Since at most nine multiplications can
be performed in parallel, our pairing accelerator hosts nine
multipliers over F3m and the computation of R0R1 involves
two multiplication cycles. A careful scheduling allows to
share operands between up to three operators, thus saving
hardware resources (Table 1): during the first multiplica-
tion cycle, M0, M1, and M2 respectively compute a0r0,
a2r0, and a4r0. The MSE multiplier described in Sec-
tion 3.3 stores its first operand in a shift register, and its
second operand in a standard register. Since a shift register
is more complex (an operand is loaded in parallel, and then
shifted), we load the common operand r0 in this component.
At the end of the first cycle, the three standard registers still
contain a0, a2, and a4. Therefore it suffices to load r20 in
the shift register before starting the second multiplication
cycle. Figure 2a describes the operator we designed. This
component is connected to the addition/subtraction opera-
tor described in Section 3.1 (Figure 2c). Note that the same
architecture allows to compute a1r0, a3r0, a5r0, a1ypyq ,
a3ypyq, and a5ypyq. The five remaining multiplications in-
volve a slightly more complex component (Figure 2b). Two
shift registers are required to compute r20 and ypyq since
there is no common operand. At the end of the first multipli-
cation cycle, a dedicated subtracter computes ypyq− r20 and
stores the result in the shift registers. Three clock cycles are
requested to load (a0+a1), (a2+a3), and (a4+a5), which
have been computed during the first multiplication cycle
(see Algorithm 4). This approach could also be adopted to
implement the multiplication of R˜0 by R˜1 in Algorithm 1.
Table 1. Multiplication over F3m: scheduling.
First cycle Second cycle
M0 a0 · r0 a0 · r20
M1 a2 · r0 a2 · r20
M2 a4 · r0 a4 · r20
M3 a1 · r0 a1 · ypyq
M4 a3 · r0 a3 · ypyq
M5 a5 · r0 a5 · ypyq
M6 r0 · r0 (a0 + a1) · (ypyq − r20)
M7 yp · yq (a2 + a3) · (ypyq − r20)
M8 – (a4 + a5) · (ypyq − r20)
3.5. Architecture of the Pairing Accelerator
Figure 2c shows the architecture of our hardware ac-
celerator. Inputs and outputs, as well as intermediate re-
sults, are stored in registers implemented using embed-
ded memory blocks available in the FPGA. The control
unit mainly consists of a ROM containing the microcode
of Algorithm 2 and a program counter. The size of the
microcode depends on D, the number of coefficients pro-
cessed at each clock cycle by a multiplier over F3m . For
D = 3, the initialization step of Algorithm 2 (copy of in-
puts in registers of multipliers and computation of r0, d,
and R0) and the main loop respectively require 47 and 98
clock cycles. Since m = 97, a pairing is completed after
47 + 98 · (m − 1)/2 = 47 + 98 · 49 = 4849 clock cycles.
For D = 4, the initialization and the main loop respectively
involve 39 and 80 microinstructions. Thus, the computation
of a pairing requires 39 + 80 · 49 = 3959 clock cycles.
4. Results and Comparisons
The proposed architecture was captured in the VHDL
language and prototyped on an Altera Cyclone II
EP2C35F672C6 device. Both synthesis and place-and-
route steps were performed with Quartus II 6.0 Web Edi-
tion. VHDL simulations and experiments with a DE2 board
were carried out to extensively test our design. The area
and the calculation time depend on D, the number of coef-
ficients of a multiplier processed at each clock cycle (Sec-
tion 3.3). The two rightmost columns of Table 2 summarize
our results for D = 3 and D = 4. When D = 3, the
pairing accelerator occupies 45% of the LEs, thus meeting
our size constraint (Section 3). However, choosing D = 4
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00  © 2007
Algorithm 4 Multiplication over F36m .
Input: A = a0 + a1σ + a2ρ+ a3σρ + a4ρ2 + a5σρ2 ∈ F36m . r0, yp, and yq ∈ F3m .
Output: C = A · (−r20 + ypyqσ − roρ− ρ2)
1: Compute in parallel (8 multiplications and 3 additions over F3m): pi ← air0, 0 ≤ i ≤ 5; p6 ← r0r0; p7 ← ypyq;
s0 ← a0 + a1; s1 ← a2 + a3; s2 ← a4 + a5;
2: Compute in parallel (7 additions over F3m):
s4 ← p7 − p6; // ypyq − r20 c2 ← a4 + p0; // a4 + a0r0 c4 ← a0 + p2; // a0 + a2r0
c0 ← a2 + p4; // a2 + a4r0 c3 ← a5 + p1; // a5 + a1r0 c5 ← a1 + p3; // a1 + a3r0
c1 ← a3 + p5; // a3 + a5r0
3: Compute in parallel (9 multiplications and 4 additions over F3m):
p8 ← a0p6; // a0r20 p13 ← s1s4; // (a2 + a3)(ypyq − r20) c2 ← c2 + c0;
p9 ← a1p7; // a1ypyq p14 ← a4p6; // a4r20 c3 ← c3 + c1;
p10 ← s0s4; // (a0 + a1)(ypyq − r20) p15 ← a5p7; // a5ypyq c4 ← c4 + a4;
p11 ← a2p6; // a2r20 p16 ← s2s4; // (a4 + a5)(ypyq − r20) c5 ← c5 + a5;
p12 ← a3p7; // a3ypyq
4: Compute in parallel (15 additions over F3m ):
c0 ← −c0 − p8 − p9; c2 ← −c2 − p11 − p12; c4 ← −c4 − p14 − p15;
c1 ← −c1 + p10 + p8 − p9; c3 ← −c3 + p13 + p11 − p12; c5 ← −c5 + p16 + p14 − p15;
lead to an architecture which requires 56% of the config-
urable logic. Several researchers described implementa-
tions of pairing algorithms on Xilinx Virtex-II Pro FPGAs
and reported the area in terms of slices. Each slice features
two 4-input LUTs, carry logic, wide function multiplexers,
and two storage elements. Let us assume that Xilinx design
tools try to utilize both LUTs of a slice as often as possible
(i.e. area optimization). Under this hypothesis, we consider
that a slice is roughly equivalent to two LEs in our compar-
isons.
To our best knowledge, the FPGA-based pairing acceler-
ator described by Shu et al. in [22] is the fastest to date. It
computes the Tate pairing over F2239 in 34 µs on a Virtex-II
Pro 100 device (25287 slices). Ronan et al. designed an
embedded processor to compute the ηT pairing on genus 2
hyperelliptic curves [20]. This architecture requires 43986
slices on a Virtex-II Pro 125 device and computes a pair-
ing in 749 µs. Kerins et al. proposed an implementation
of the modified Duursma-Lee algorithm on a Xilinx Virtex-
II Pro 125 FPGA [17]. Multiplication over F36m is per-
formed according to Karatsuba-Ofman’s algorithm. How-
ever, since the authors do not take advantage of the constant
terms of R1, this operation requires 18 multiplications over
F3m . Thus, the hardware architecture consists of 18 mul-
tipliers and 6 cubing circuits over F397 , along with “a suit-
able amount of simpler F3m arithmetic circuits for perform-
ing addition, subtraction, and negation” [17]. The authors
claim that roughly 100% of available resources are required
to implement their pairing accelerator. We can therefore es-
timate the cost to 55616 slices [22]. Remember that our
target FPGA embeds 33216 LEs. Consequently, even if the
final exponentiation unit we left for future work requires
50% of the device, our processor is smaller than the afore-
mentioned solutions. Furthermore, our approach requires a
less expensive FPGA technology for which free simulation
and design tools are available.
Grabher and Page designed a coprocessor dealing with
F3m arithmetic, which is controlled by a general purpose
processor [14]. Their hardware accelerator embeds a sin-
gle multiplier over F3m . Our architecture requires roughly
twice as much LEs, while performing up to nine multiplica-
tions in parallel.
Several researchers studied the software implementation
of pairings on smartcards or mobile phones (see for in-
stance [16] and [21]). For comparison purpose, they often
provide the reader with timings on desktop computers. Ta-
ble 3 summarizes such results which indicate that our FPGA
architecture achieves a speedup of 100.
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00  © 2007
M 0 M 1 M 2
pe_mult_block_t1_generic
D1D0
Q
Ctrl
D2D0
S
pe_add
D1
Ctrl
Mux10 1
0
1Mux0
pe_mult_block_t1_generic
D1D0
Q
Ctrl
Mux2
0
1
D1D0
Q
Ctrl
pe_mult_block_t2_generic
RAM
Q_a
Q_b
M2
M1
M0
M8
M7
M6
00 01 10 11
En0
Rst0
En1
Rst1
Q
Sel1
Synchronous reset
01 01
Ld
0
Ld
1
Ld
2
D
0
D
1
Sel0
Shift
Ld4Ld3
Shift register
M 6 M 7 M 8
Q
En1
Rst1
Sel
En0
Rst0
00 01 10 11
Synchronous reset
Ld
0
Ld
1
Ld
2
Ld
3
Sh
ift
D
0
D
1
Shift register
(a) (b) (c) 
PE0
PE3
PE2PE1
PE4
Host computer
Cyclone II EP2C35
C
D
pe_cubingCtrl
Ctrl
Controller
M5
M4
M3
M
ux
0
M
ux
1
M
ux
2
PE
4
R
A
M
PE
0
PE
1
PE
2
PE
3
D1D0
Q
Ctrl
pe_mult_block_t2_genericpe_mult_block_t1_generic
D1D0
Q
Ctrl
Figure 2. a) and b) Building blocks for multiplication over F36m . c) Architecture of the ηT pairing
accelerator.
Table 3. Comparisons with software imple-
mentations on desktop computers.
Kawahara Scott Proposed
et al. [16] et al. [21] architecture
Algorithm ηT pairing ηT pairing Algorithm 2
Processor Pentium M Pentium 4 FPGA
Clock frequency 1.73GHz 3GHz 0.149GHz
Calculation time 10.15ms 3.7ms 0.033ms
5. Conclusions
We have proposed a modified ηT pairing algorithm on
supersingular elliptic curves over F3m which does not need
any cube root. We have then described a pairing acceler-
ator based on a low cost platform hosting an Altera Cy-
clone II FPGA. Since VHDL simulation and FPGA con-
figuration are performed with free design tools, the price
of our system is comparable to that of an entry level desk-
top computer. Our results demonstrate a one hundred-fold
improvement on software implementations, and a ten-fold
improvement on the best known FPGA implementation in
characteristic three. We achieve the same calculation time
than the fastest published accelerator in characteristic two,
while requiring less hardware resources. Further work will
include the design of a small processing unit responsible for
final exponentiation.
Acknowledgement
Authors would like to thank the anonymous reviewers
for their useful comments. This work was supported by the
New Energy and Industrial Technology Development Orga-
nization (NEDO), Japan.
References
[1] Altera. Cyclone II Device Handbook, 2006. Available from
Altera’s web site (http://altera.com).
[2] Altera. DE2 Development and Education Board – User
Manual, 2006. Available from Altera’s web site (http:
//altera.com).
[3] P. S. L. M. Barreto. A note on efficient computation of cube
roots in characteristic 3. Cryptology ePrint Archive, Report
2004/305, 2004.
[4] P. S. L. M. Barreto, S. Galbraith, C. ´O h ´Eigeartaigh, and
M. Scott. Efficient pairing computation on supersingu-
lar Abelian varieties. Cryptology ePrint Archive, Report
2004/375, 2004.
[5] P. S. L. M. Barreto, H. Y. Kim, B. Lynn, and M. Scott.
Efficient algorithms for pairing-based cryptosystems. In
M. Yung, editor, Advances in Cryptology – CRYPTO 2002,
number 2442 in Lecture Notes in Computer Science, pages
354–368. Springer, 2002.
[6] G. Bertoni, L. Breveglieri, P. Fragneto, and G. Pelosi. Par-
allel hardware architectures for the cryptographic Tate pair-
ing. In Proceedings of the Third International Conference
on Information Technology: New Generations (ITNG’06).
IEEE Computer Society, 2006.
[7] J.-L. Beuchat, M. Shirase, T. Takagi, and E. Okamoto. An
algorithm for the ηT pairing calculation in characteristic
three and its hardware implementation. Cryptology ePrint
Archive, Report 2006/327, 2006.
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00  © 2007
Table 2. Comparison against previous FPGA implementations. The parameter D refers to the number
of coefficients processed at each clock cycle by a multiplier.
Shu, Kwon, Ronan et al. [20] Grabher and Kerins et al. [17] Proposed architecture
and Gaj [22] Page [14] D = 3 D = 4
Algorithm ηT pairing ηT pairing Duursma-Lee Duursma-Lee Algorithm 2
Underlying field F2239 F2103 F397 F397 F397
Curve Elliptic Hyperelliptic Elliptic Elliptic Elliptic
FPGA Virtex-II Pro 100 Virtex-II Pro 125 Virtex-II Pro 4 Virtex-II Pro 125 Cyclone II EP2C35
Free design tools No No Yes No Yes
Controller Hardwired logic Hardwired logic Microprocessor Hardwired logic Hardwired logic
Multiplier(s) 6 (over F2239 ) 12 (over F2103 ) 1 (over F397 ) 18 (over F397 ) 9 (over F397 )
Area 25287 slices 43986 slices 4481 slices 55616 slices 14895 LEs 18553 LEs
Clock cycles – – – 12866 4849 3959
Clock frequency 84MHz 32.3MHz 150MHz 15MHz 149MHz 147MHz
Calculation time 34µs 749µs 399.4 µs 850µs 33µs 27µs
Final exponentiation Yes Yes No Yes No No
[8] D. Boneh and M. Franklin. Identity-based encryption from
the Weil pairing. In J. Kilian, editor, Advances in Cryp-
tology – CRYPTO 2001, number 2139 in Lecture Notes in
Computer Science, pages 213–229. Springer, 2001.
[9] D. Boneh, C. Gentry, and B. Waters. Collusion resistant
broadcast encryption with short ciphertexts and private keys.
In V. Shoup, editor, Advances in Cryptology – CRYPTO
2005, number 3621 in Lecture Notes in Computer Science,
pages 258–275. Springer, 2005.
[10] D. Boneh, B. Lynn, and H. Shacham. Short signatures from
the Weil pairing. In C. Boyd, editor, Advances in Cryptol-
ogy – ASIACRYPT 2001, number 2248 in Lecture Notes in
Computer Science, pages 514–532. Springer, 2001.
[11] I. Duursma and H. S. Lee. Tate pairing implementation for
hyperelliptic curves y2 = xp − x + d. In C. S. Laih, ed-
itor, Advances in Cryptology – ASIACRYPT 2003, number
2894 in Lecture Notes in Computer Science, pages 111–123.
Springer, 2003.
[12] K. Fong, D. Hankerson, J. Lo´pez, and A. Menezes. Field
inversion and point halving revisited. IEEE Transactions on
Computers, 53(8):1047–1059, Aug. 2004.
[13] S. D. Galbraith, K. Harrison, and D. Soldera. Implementing
the Tate pairing. In C. Fieker and D. Kohel, editors, Algo-
rithmic Number Theory – ANTS V, number 2369 in Lecture
Notes in Computer Science, pages 324–337. Springer, 2002.
[14] P. Grabher and D. Page. Hardware acceleration of the Tate
Pairing in characteristic three. In J. R. Rao and B. Sunar,
editors, Cryptographic Hardware and Embedded Systems –
CHES 2005, number 3659 in Lecture Notes in Computer
Science, pages 398–411. Springer, 2005.
[15] K. Harrison, D. Page, and N. P. Smart. Software imple-
mentation of finite fields of characteristic three, for use in
pairing-based cryptosystems. LMS Journal of Computation
and Mathematics, 5:181–193, Nov. 2002.
[16] Y. Kawahara, T. Takagi, and E. Okamoto. Efficient imple-
mentation of Tate pairing on a mobile phone using Java.
Cryptology ePrint Archive, Report 2006/299, 2006.
[17] T. Kerins, W. P. Marnane, E. M. Popovici, and P. Barreto.
Efficient hardware for the Tate Pairing calculation in char-
acteristic three. In J. R. Rao and B. Sunar, editors, Cryp-
tographic Hardware and Embedded Systems – CHES 2005,
number 3659 in Lecture Notes in Computer Science, pages
412–426. Springer, 2005.
[18] S. Kwon. Efficient Tate pairing computation for supersin-
gular elliptic curves over binary fields. Cryptology ePrint
Archive, Report 2004/303, 2004.
[19] V. S. Miller. Short programs for functions on
curves. Unpublished manuscript available at
http://crypto.stanford.edu/miller/miller.pdf, 1986.
[20] R. Ronan, C. ´O h ´Eigeartaigh, C. Murphy, M. Scott,
T. Kerins, and W. Marnane. An embedded processor for
a pairing-based cryptosystem. In Proceedings of the Third
International Conference on Information Technology: New
Generations (ITNG’06). IEEE Computer Society, 2006.
[21] M. Scott, N. Costigan, and W. Abdulwahab. Implement-
ing cryptographic pairings on smartcards. Cryptology ePrint
Archive, Report 2006/144, 2006.
[22] C. Shu, S. Kwon, and K. Gaj. FPGA accelerated Tate pair-
ing based cryptosystem over binary fields. Cryptology ePrint
Archive, Report 2006/179, 2006.
[23] J. H. Silverman. The Arithmetic of Elliptic Curves. Num-
ber 106 in Graduate Texts in Mathematics. Springer-Verlag,
1986.
[24] L. Song and K. K. Parhi. Low energy digit-serial/parallel
finite field multipliers. Journal of VLSI Signal Processing,
19(2):149–166, July 1998.
[25] D. Zuras. More on squaring and multiplying large inte-
gers. IEEE Transactions on Computers, 43(8):899–908,
Aug. 1994.
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00  © 2007
