In this paper, we propose a modified η T pairing algorithm in characteristic three which does not need any cube root extraction. We also discuss its implementation on a low cost platform which hosts an Altera Cyclone II FPGA device. Our pairing accelerator is ten times faster than previous known FPGA implementations in characteristic three.
Introduction
Since the introduction of pairings over (hyper)elliptic curves in constructive cryptographic applications, an ever increasing number of protocols based on Weil or Tate pairings have appeared in the literature: identity-based encryption [8] , short signature [10] , and efficient broadcast encryption [9] to mention but a few. Nowadays pairing-based cryptosystems have become a central research topic in cryptography.
Miller's algorithm [19] was the only way to compute the Tate pairing until 2002, where significant improvements were independently proposed by Barreto et al. [5] and Galbraith et al. [13] . One year later, Duursma and Lee gave a closed formula in the case of characteristic three [11] . They described an iterative scheme involving additions, multiplications, cubing operations, and cube root extractions over F 3 m . This work was then extended by Kwon, who proposed a closed formula for the Tate pairing computation for supersingular elliptic curves over F 2 m with odd dimension m [18] . Furthermore, he proved that both his algorithm and Duursma-Lee algorithm can be modified so that no inverse Frobenius map (i.e. square root in characteristic two or cube root in characteristic three) is required. Fong et al. showed that extracting a square root in F 2 m requires approximately the time of a field multiplication and proposed an improved scheme for trinomials [12] .
Barreto extended this approach to cube root in characteristic three [3] : if F 3 m admits an irreducible trinomial x m + ax k + b (a, b ∈ {−1, 1}) with the property k ≡ m (mod 3), then five shifts and five additions allow to implement this operation. However, these algorithms restrict the choice of curves and it seems interesting to design pairing algorithms without inverse Frobenius maps. Hardware implementations also benefit from such pairing algorithms: removing the inverse Frobenius maps allows to design simpler arithmetic and logic units.
By introducing the η T pairing, Barreto et al. reduced the number of iterations of Duursma-Lee algorithm by half [4] . However, this algorithm reintroduces inverse Frobenius maps. Recently, Shu et al. described how to get rid of square roots in characteristic two [22] . In this paper, we introduce a modified η T pairing algorithm in characteristic three which does not require any cube root (Section 2). Then, we discuss its hardware implementation on a low cost Field Programmable Gate Array (FPGA) board hosting Altera Cyclone II technology (Section 3) and we compare this pairing accelerator against several software and hardware architectures reported in the literature (Section 4).
An Algorithm for the η T Pairing Calculation
Let E be an elliptic curve over F q , where q is a power of a prime number. A formal symbol (P ) is defined for each point P of the curve. A divisor D on E is then a finite linear combination of such symbols with integer coefficients: D = j a j (P j ), a j ∈ Z. The degree of a divisor is defined by deg( j a j (P j )) = j a j ∈ Z. For an introduction to divisors, we refer the reader to [23] . Let l > 0 be an integer relatively prime to q. The least positive integer k satisfying q k ≡ 1 (mod l) is called embedding degree or security multiplier. Let E(F q )[l] be the set of points P ∈ E(F q ) such that lP = O, where O is the point at infinity. Consider
where f l,P is a rational function on E whose divisor is equivalent to l(P ) − l(O), and D Q is a divisor of degree 0 equivalent to (Q) − (O). f l,P and D Q have disjoint supports. The computation of the (q k − 1)/l-th power is referred to as final exponentiation. The reduced Tate pairing satisfies the following properties:
• Bilinearity: let a be an integer; then e l (aP,
• Non-degeneracy. If e l (P,
Equation (1) [5, 13, 11, 18] ). Barreto et al. [5] proved that the reduced pairing can be computed as
, where f l,P is evaluated on a point rather than on a divisor. In the same paper, the authors exploited a distortion map to further enhance Miller's algorithm.
This work is devoted to the computation of pairing in characteristic three (i. e. q = 3 m , where m is odd). Let E b be a supersingular elliptic curve over F 3 m :
where σ and ρ belong to F 3 6m and respectively satisfy σ 2 = −1 and ρ 3 = ρ + b. The modified Tate pairingê(P, Q) is then given by:ê(P, Q) = e l (P, ψ(Q)).
where the a i 's belong to F 3 m . This representation is equivalent to a tower extension of F 3 m (see for instance [17] ):
, where y 2 + 1 and z 3 − z − b are respectively irreducible polynomials over F 3 m and F 3 2m . This tower field representation allows one to replace arithmetic over F 3 6m by arithmetic over F 3 m .
Barreto et al. defined the η T pairing as η T (P, Q) = f T,P (ψ(Q)), for some T ∈ Z [4] . This formula does not always give a non-degenerate, bilinear pairing. However, Barreto et al. described some cases where η T (P, Q)
W is a non-degenerate and bilinear map (a final exponentiation is therefore required for pairing-based cryptosystems). In such cases, this approach reduces the number of iterations by half (Algorithm 1). In characteristic three, the relationship between the η T pairing and the modified Tate pairing is given by:
where T = −b3
The modified Tate pairing can be computed as follows:
This method is more efficient than the one proposed by Barreto et. al in [4] . η T (P, Q) can be calculated according to Algorithm 1. As mentioned in Section 1, this scheme involves two cube root extractions at each iteration.
Algorithm 1
Computation of η T pairing in characteristic three [4] . We propose here a modified η T pairing algorithm in characteristic three which computes R 0 =R
. This trick allows one to get rid of cube roots and our algorithm returns η T (P, Q)
. A proof of correctness of this new scheme is provided in an extended version of this paper [7] . Let us describe now how to implement the original η T (P, Q) pairing with our algorithm. Recall that tripling a point requires only four cubing operations in characteristic three for supersingular elliptic curves (see for instance [15] 
. Therefore, we suggest to compute 3 m−1 2 P by means of 2(m − 1) cubings and to take advantage of the bilinearity of η T (P, Q) W :
Note that cubing over F 3 m is efficiently performed in hardware (Section 3.2). A postprocessing step involving a 3 m -th root is further required. However, this operation is carried out by means of six additions (or subtractions) and a negation over
to the W -th power is based on the following observation:
This operation requires 11 multiplications and a single inversion over F 3 6m , as well as additions over F 3 m .
Algorithm 2 Proposed computation of η T (P, Q)
. The algorithm requires R 0 and R 1 ∈ F 3 6m , as well as r 0 ∈ F 3 m and d ∈ F 3 for intermediate computations.
10:
11:
12: 
Hardware Implementation
This section describes the hardware implementation of Algorithm 2 for the field
. This choice of parameters allows us to easily compare our work against the many pairing accelerators for m = 97 described in the open literature. A first approach consists in designing an architecture able to compute both pairing and final exponentiation. However, it does not allow to take advantage of the constant coefficients of R 1 (see Algorithms 1 and 2) to optimize the multiplication over F 3 6m . Therefore, we suggest to design a pairing accelerator evaluating η T (P, Q)
and a coprocessor responsible for final exponentiation working in parallel. In this paper, we will only focus on the computation of the modified η T pairing. Algorithm 2 and final exponentiation require respectively (m − 1)/2 + 1 = 49 and 11 multiplications over F 3 6m . The inversion over F 3 6m can be replaced by a few multiplications and additions over F 3 m and a single inversion over F 3 m [17] . Consequently, the final exponentiation requires less operations (and thus less hardware) than the computation of the η T pairing.
In order to compare our architecture against software implementations, we decided to choose a design board whose price is comparable to that of an entry level desktop computer. We selected a DE2 development and education board [2] which costs $495 and hosts an Altera Cyclone II EP2C35F672C6 FPGA. Note that Altera provides free simulation and design tools for the Cyclone II family. The smallest unit of logic in a Cyclone II is called Logic Element (LE). Each LE includes a 4-input Look-Up Table  ( LUT), carry logic, and a programmable register. A Cyclone II EP2C35F672C6 device contains for instance 33216 LEs. Readers who are not familiar with Cyclone II devices should refer to [1] for further details. Since we leave the study of final exponentiation for further work, our pairing accelerator should not utilize all resources of our target FPGA. Thus, we impose a size constraint: our design must require less than 50% of the available configurable logic.
Addition and Subtraction over F 3 m
Since they are performed component-wise, addition and subtraction over F 3 m are rather straightforward operations. Each element a i of F 3 is encoded by two bits a 
Cubing over F 3 m and F 3 6m
Cubing is also a pretty simple arithmetic operation. Since F 3 6m is constructed as an extension field of F 3 m , the computation of R 3 0 involved in Algorithm 2 is replaced by six cubing, six additions (or subtractions), and a negation over F 3 m . Indeed, by noting that 3 over F 3 m . We have:
where f (x) is a degree m irreducible polynomial over Pari program provides us with a closed formula for cubing over F 3 m :
The most complex operation involved in cubing is the addition of three elements of F 3 . Therefore, the critical path includes only two LUTs. Our pairing accelerator embeds a single cubing unit (Figure 1b ) which computes either a(x) 3 or (−a(x)) 3 according to a control bit. In order to guarantee a short critical path, the operator includes two pipeline stages. It is worth noticing that the only degree 97 irreducible trinomial over F 3 allowing a simple cube root extraction [3] has a more complex closed formula for cubing. Thus, Algorithm 2 offers additional flexibility to select parameters leading to the smallest hardware operators.
Multiplication over F 3 m
We designed a Most Significant Element (MSE) first multiplier over F 3 m based on a paper by Song and Parhi [24] to compute a(x)b(x) mod f (x). At step i we compute a degree (m + D − 2) polynomial t(x) which is the sum of D partial products:
, updated according to the celebrated Horner's rule, allows to accumulate the partial products:
Thus, after m/D steps, this algorithm returns a degree
, which is congruent with a(x)b(x) modulo f (x). The circuit described by Song and Parhi requires dedicated hardware to compute p(x) = s(x) mod f (x) [24] . We suggest to achieve this final modulo f (x) reduction by performing an additional iteration with a −j = 0, 1 ≤ j ≤ D. Since t(x) is now equal to zero, we have:
. Therefore, it suffices to consider the m most significant coefficients of s(x) to get the result: p(x) = s(x)/x D . Algorithm 3 summarizes this multiplication scheme. Synthesis results indicate that for D = 3 and D = 4, such a multiplier requires respectively 1170 and 1560 LEs. According to our size constraint, up to ten multipliers can be included in our pairing accelerator.
Algorithm 3 MSE multiplication over
F 3 m . Input: A degree m monic polynomial f (x) = x m + f m−1 x m−1 + . . . + f 1 x + f 0 , two degree (m − 1
) polynomial a(x), and b(x). We assume that a
−j = 0, 1 ≤ j ≤ D. The algorithm requires a degree (m + D − 1) polynomial s(x) as well as a degree (m + D − 2) poly- nomial t(x) for intermediate computations. Output: p(x) = a(x)b(x) mod f (x) 1: s(x) ← 0; 2: for i in m/D − 1 downto −1 do 3: t(x) ← D−1 j=0 a Di+j x j b(x); 4: s(x) ← t(x) + x D · (s(x) mod f (x)); 5: end for 6: p(x) ← s(x)/x D ;
Multiplication over F 3 6m
The cost of Algorithm 2 is dominated by the multiplication of R 0 by R 1 over F 3 6m . By applying KaratsubaOfman's algorithm (see for instance [25] ) and taking advantage of the constant coefficients of R 1 , the product R 0 R 1 could be computed in parallel by means of 13 multiplications and 50 additions (or subtractions) over F 3 m [6] . Two further multiplications are needed to compute y p y q as well as r 2 0 (a straightforward modification of the scheduling of Algorithm 2 allows to compute r 2 0 , y p y q , and R 0 R 1 in parallel). However, according to our size constraints, it is impossible to implement 15 multipliers on our target FPGA. Furthermore, our processor embeds only three adders over F 3 m and scheduling 50 additions could be a complex task. We propose here an algorithm which offers a better tradeoff between the number of additions and multiplications.
Let A = a 0 + a 1 σ + a 2 ρ + a 3 σρ + a 4 ρ 2 + a 5 σρ 2 and C = c 0 +c 1 σ+c 2 ρ+c 3 σρ+c 4 ρ 2 +c 5 σρ 2 be two elements of 
Therefore, the computation of the c (1) i 's involves nine multiplications over F 3 m , which can be carried out in parallel.
Algorithm 4 summarizes this multiplication scheme involving 17 multiplications and 29 additions (or subtractions) over F 3 m . Since at most nine multiplications can be performed in parallel, our pairing accelerator hosts nine multipliers over F 3 m and the computation of R 0 R 1 involves two multiplication cycles. A careful scheduling allows to share operands between up to three operators, thus saving hardware resources (Table 1) : during the first multiplication cycle, M 0 , M 1 , and M 2 respectively compute a 0 r 0 , a 2 r 0 , and a 4 r 0 . The MSE multiplier described in Section 3.3 stores its first operand in a shift register, and its second operand in a standard register. Since a shift register is more complex (an operand is loaded in parallel, and then shifted), we load the common operand r 0 in this component. At the end of the first cycle, the three standard registers still contain a 0 , a 2 , and a 4 . Therefore it suffices to load r 2 0 in the shift register before starting the second multiplication cycle. Figure 2a describes the operator we designed. This component is connected to the addition/subtraction operator described in Section 3.1 (Figure 2c) . Note that the same architecture allows to compute a 1 r 0 , a 3 r 0 , a 5 r 0 , a 1 y p y q , a 3 y p y q , and a 5 y p y q . The five remaining multiplications involve a slightly more complex component (Figure 2b) . Two shift registers are required to compute r 2 0 and y p y q since there is no common operand. At the end of the first multiplication cycle, a dedicated subtracter computes y p y q − r 2 0 and stores the result in the shift registers. Three clock cycles are requested to load (a 0 + a 1 ), (a 2 + a 3 ), and (a 4 + a 5 ), which have been computed during the first multiplication cycle (see Algorithm 4) . This approach could also be adopted to implement the multiplication ofR 0 byR 1 in Algorithm 1. 
Architecture of the Pairing Accelerator

Results and Comparisons
The proposed architecture was captured in the VHDL language and prototyped on an Altera Cyclone II EP2C35F672C6 device. Both synthesis and place-androute steps were performed with Quartus II 6.0 Web Edition. VHDL simulations and experiments with a DE2 board were carried out to extensively test our design. The area and the calculation time depend on D, the number of coefficients of a multiplier processed at each clock cycle (Section 3.3). The two rightmost columns of Table 2 lead to an architecture which requires 56% of the configurable logic. Several researchers described implementations of pairing algorithms on Xilinx Virtex-II Pro FPGAs and reported the area in terms of slices. Each slice features two 4-input LUTs, carry logic, wide function multiplexers, and two storage elements. Let us assume that Xilinx design tools try to utilize both LUTs of a slice as often as possible (i.e. area optimization). Under this hypothesis, we consider that a slice is roughly equivalent to two LEs in our comparisons.
To our best knowledge, the FPGA-based pairing accelerator described by Shu et al. in [22] is the fastest to date. It computes the Tate pairing over F 2 239 in 34 µs on a Virtex-II Pro 100 device (25287 slices). Ronan et al. designed an embedded processor to compute the η T pairing on genus 2 hyperelliptic curves [20] . This architecture requires 43986 slices on a Virtex-II Pro 125 device and computes a pairing in 749 µs. Kerins et al. proposed an implementation of the modified Duursma-Lee algorithm on a Xilinx Virtex-II Pro 125 FPGA [17] . Multiplication over F 3 6m is performed according to Karatsuba-Ofman's algorithm. However, since the authors do not take advantage of the constant terms of R 1 , this operation requires 18 multiplications over F 3 m . Thus, the hardware architecture consists of 18 multipliers and 6 cubing circuits over F 3 97 , along with "a suitable amount of simpler F 3 m arithmetic circuits for performing addition, subtraction, and negation" [17] . The authors claim that roughly 100% of available resources are required to implement their pairing accelerator. We can therefore estimate the cost to 55616 slices [22] . Remember that our target FPGA embeds 33216 LEs. Consequently, even if the final exponentiation unit we left for future work requires 50% of the device, our processor is smaller than the aforementioned solutions. Furthermore, our approach requires a less expensive FPGA technology for which free simulation and design tools are available.
Grabher and Page designed a coprocessor dealing with F 3 m arithmetic, which is controlled by a general purpose processor [14] . Their hardware accelerator embeds a single multiplier over F 3 m . Our architecture requires roughly twice as much LEs, while performing up to nine multiplications in parallel.
Several researchers studied the software implementation of pairings on smartcards or mobile phones (see for instance [16] and [21] ). For comparison purpose, they often provide the reader with timings on desktop computers. Table 3 summarizes such results which indicate that our FPGA architecture achieves a speedup of 100. 
Conclusions
We have proposed a modified η T pairing algorithm on supersingular elliptic curves over F 3 m which does not need any cube root. We have then described a pairing accelerator based on a low cost platform hosting an Altera Cyclone II FPGA. Since VHDL simulation and FPGA configuration are performed with free design tools, the price of our system is comparable to that of an entry level desktop computer. Our results demonstrate a one hundred-fold improvement on software implementations, and a ten-fold improvement on the best known FPGA implementation in characteristic three. We achieve the same calculation time than the fastest published accelerator in characteristic two, while requiring less hardware resources. Further work will include the design of a small processing unit responsible for final exponentiation. 
