Abstract-Since their introduction in constructive cryptographic applications, pairings over (hyper)elliptic curves are at the heart of an ever increasing number of protocols. With software implementations being rather slow, the study of hardware architectures became an active research area. In this paper, we discuss several algorithms to compute the T pairing in characteristic three and suggest further improvements. These algorithms involve addition, multiplication, cubing, inversion, and sometimes cube root extraction over IF 3 m . We propose a hardware accelerator based on a unified arithmetic operator able to perform the operations required by a given algorithm. We describe the implementation of a compact coprocessor for the field IF 3 97 given by IF 3 ½x=ðx 97 þ x 12 þ 2Þ, which compares favorably with other solutions described in the open literature.
Ç

INTRODUCTION
I N 2001, Boneh et al. [1] proposed the BLS scheme, a remarkable short signature scheme whose principle is the following. They consider an additive group G 1 ¼ hP i of prime order q and a map-to-point hash function H : f0; 1g Ã ! G 1 . The secret key is an element x of f1; 2; . . . ; q À 1g and the public key is xP 2 G 1 for a signer. Let m 2 f0; 1g Ã be a message, they compute the signature xHðmÞ. To do the verification, they use a map called bilinear pairing that we now define. Let G 1 ¼ hP i be an additive group and G 2 a multiplicative group with identity 1. We assume that the discrete logarithm problem is hard in both G 1 and G 2 . A bilinear pairing on ðG 1 ; G 2 Þ is a map e : G 1 Â G 1 ! G 2 that satisfies the following conditions:
1. Bilinearity. For all Q, R, S 2 G 1 , eðQ þ R; SÞ ¼ eðQ; SÞeðR; SÞ; eðQ; R þ SÞ ¼ eðQ; RÞeðQ; SÞ:
2. Nondegeneracy. eðP ; P Þ 6 ¼ 1.
3.
Computability. e can be efficiently computed. Modifications of the Weil and Tate pairings provide such maps.
The verification in the BLS scheme is done by checking if the values eðP ; xHðmÞÞ and eðxP ; HðmÞÞ coincide. Actually, if x 0 2 f1; 2; . . . ; q À 1g satisfies eðxP ; HðmÞÞ ¼ eðP ; x 0 HðmÞÞ, then we obtain eðP ; HðmÞÞ x ¼ eðP ; HðmÞÞ x 0 due to the bilinearity property of the pairing. From the nondegeneracy of the pairing, we know that eðP ; HðmÞÞ x ¼ eðP ; HðmÞÞ x 0 implies x ¼ x 0 . The total cost is one hashing operation, one modular exponentiation, and two pairing computations, and the signature is twice as short as the one in DSA for similar level of security.
Pairings in Cryptology
Pairings were first introduced in cryptology by Menezes et al. [2] and Frey and Rü ck [3] for codebreaking purposes. Mitsunari et al. [4] and Sakai et al. [5] seem to be the first to have discovered their constructive properties. Since the foundational work of Joux [6] , an already large and ever increasing number of pairingbased protocols has been found. Most of them are described in the survey by Dutta et al. [7] . As noticed in that survey, such protocols rely critically on efficient algorithms and implementations of pairing primitives.
According to [8] , [9] , when dealing with general curves providing common levels of security, the Tate pairing seems to be more efficient for computation than the Weil pairing and we now describe it.
Let E be a supersingular 1 elliptic curve over IF p m , where p is a prime and m is a positive integer, and let EðIF p m Þ denote the group of its points. Let ' > 0 be an integer relatively prime to p. The embedding degree (or security multiplier) is the least positive integer k satisfying p km 1 ðmod 'Þ. Let EðIF p m Þ½' denote the '-torsion subgroup of EðIF p m Þ, i.e., the set of elements P of EðIF p m Þ that satisfy ½'P ¼ O, where O is the point at infinity of the elliptic curve. Let P 2 EðIF p m Þ½' and Q 2 EðIF p km Þ½', let f ';P be a rational function on the curve with divisor 'ðP Þ À 'ðOÞ (see [10] for an account of divisors), there exists a divisor D Q equivalent to ðQÞ À ðOÞ, with a support disjoint from the support of f ';P . Then, the Tate pairing 2 , where f ';P is evaluated on a point rather than on a divisor. Due to a distortion map : EðIF p m Þ½' ! EðIF p km Þ½' (the concept of a distortion map was introduced in [12] ), one can define the modified Tate pairingê byêðP ; QÞ ¼ eðP ; ðQÞÞ for all P ; Q 2 EðIF p m Þ½'.
Miller [13] , [14] proposed in 1986 the first algorithm for computing Weil and Tate pairings. Different ways for computing the Tate pairing can be found in [11] , [15] , [16] , and [17] . In [18] , Barreto et al. introduced the T pairing, which extended and improved the Duursma-Lee techniques [16] . It makes it possible to efficiently compute the Tate pairing. The T pairing is presented in Section 2 in which we recall the relation between it and the modified Tate pairing.
Implementation Challenges
With the software implementations of these successive algorithmic improvements being rather slow, the need for fine hardware implementations is strong. This is a critical issue to make pairings popular and of common use in cryptography and in particular in view of a successful industrial transfer. The papers [19] , [20] , [21] , [22] , [23] , [24] , [25] , [26] , and [27] address that problem.
In this paper, we deal with the characteristic three case, and given a positive integer m coprime to 6, we consider E, a supersingular elliptic curve over IF 3 m , defined by the equation
Following the discussion at the beginning of [18, Section 5] , there is no loss of generality from considering this case since these curves offer the same level of security for pairing applications as any supersingular elliptic curve over IF 3 m . The considered curve has an embedding degree of 6, which is the maximum value possible for supersingular elliptic curves and, hence, seems to be an attractive choice for pairing implementation.
Our Contribution
The algorithm given in [18] for computing the T pairing halves the number of iterations used in the approach by Duursma and Lee [16] but has the drawback of using inverse Frobenius maps. In [25] , Beuchat et al. proposed a modified T pairing algorithm in characteristic three that does not require any inverse Frobenius map. Moreover, they designed a novel arithmetic operator implementing addition, cubing, and multiplication over IF 3 97 , which performs in a fast and cheap way the step of final exponentiation [26] . Then, they extended in [27] this approach to the computation of the reduced T pairing (i.e., the combination of the T pairing and the final exponentiation).
In this paper, we present a synthesis and an improvement of the results in [25] , [26] , and [27] . The outline of this paper is given as follows: In Section 2, we define the T pairing and its reduced form, we give different algorithms to compute them, and we provide exact cost evaluations for these algorithms. Section 3 is dedicated to the presentation of a reduced T pairing coprocessor that is based on a unified arithmetic operator that implements the various required elementary operations over IF 3 m . We want to mention that all the material (i.e., algorithms and architectures) presented in this section can be easily adapted to work on any field IF p ½x=ðfðxÞÞ for any prime p and any polynomial f irreducible over IF p . We implemented our coprocessor on several Field-Programmable Gate Array (FPGA) families for the field IF 3 97 given by IF 3 ½x=ðx 97 þ x 12 þ 2Þ. We provide the reader with a comprehensive comparison against state-of-the-art T pairing accelerators in Section 4 and conclude this paper in Section 5.
The appendices mentioned in the rest of the paper can be found in the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109.TC.2008.103.
COMPUTATION OF THE T PAIRING IN
CHARACTERISTIC THREE
Preliminary Definitions
We use here the definition of the T pairing as introduced by Barreto et al. [18] . The interested reader shall find in that paper all the details related to the mathematical construction of the pairing, which we will deliberately not mention here for clarity's sake. Let E be the supersingular elliptic curve defined by the equation E :
where b 2 fÀ1; 1g. Considering a positive integer m coprime to 6, the number of rational points of E over the finite field IF 3 m is given by 
In the following, we will refer to this additional step as final exponentiation.
One should also note that, in characteristic 3, we have the following relation between the reduced T and modified Tate pairings:
with L ¼ Àb3 [16] to simplify the computation of f n;P in Miller's algorithm, we obtain
where
, is the rational function introduced by Duursma and Lee [16] , defined over EðIF 3 6m Þ½' and having divisor ðg V Þ ¼ 3ðV Þ þ ð½À3V Þ À 4ðOÞ. For all ðx; yÞ 2 EðIF 3 6m Þ½', we have 
&
The remaining part of this section will present and discuss various algorithms that can be used to effectively compute the reduced T pairing. The next three sections will focus on the computation of T ðP ; QÞ only, the details of the final exponentiation being given in Section 2.5. Finally, cost evaluations and comparisons will be presented in Section 2.6.
Direct Approaches
Direct Algorithm
From the expression of f T 0 ;P 0 , notingQ ¼ ðQÞ, we can write
Â Ã P 0 ðQÞ Á l P 0 ðQÞ: 
(6C, 6A) 11.
x P x 9 P À b; y P Ày
A few remarks concerning this algorithm:
. The multiplication by Àb on line 1 is for free.
Indeed 
Simplification Using Cube Roots
Cubing the intermediate result R 2 IF Ã 3 6m at each iteration of Algorithm 1 is quite expensive. But, one can use the fact that, due to the bilinearity of the reduced T pairing,
to compute instead
withQ ¼ ð½3 
This naturally gives another iterative method to compute T ðP ; QÞ, presented in Algorithm 2. Here, the cubings over IF 3 6m are traded for cube roots (noted R) over IF 3 m , which can be efficiently computed by means of a specific operator (see Section 3.5 for further details).
Algorithm 2 Simplified algorithm for computing the
T pairing, with cube roots.
x P x 3 P ; y P y
Tabulating the Cube Roots
Even if cube roots can be computed with only a slight hardware overhead, it is sometimes advisable to restrict the hardware complexity of the arithmetic unit in order to achieve higher clock frequencies. The previous algorithm can easily be adapted to cube-root-free coprocessors by simply noticing that, as x Q and y Q 2 IF 3 m , x This idea, originally suggested by Barreto et al. [18] was for instance applied by Ronan et al. [23] in the case m 1 ðmod 12Þ, although they curiously do not compute the actual T pairing, but the value
Reversed-Loop Approaches
In [18] , Barreto et al. suggest reversing the loop to compute the T pairing. To that purpose, they introduce a new index
2 À i for the loop. TakingQ ¼ ðQÞ, we find
Reversed-Loop Algorithm
Directly injecting the expression of ½3 mÀ1 2 Àj P 0 ¼ ðx
Þ into the formulas, we obtain
Following this expression, a third iterative scheme for computing the T pairing can be directly devised, as detailed in Algorithm 3. In the case m 1 ðmod 12Þ, this is the exact same algorithm as described by Barreto et al. [18] .
Algorithm 3 Reversed-loop algorithm for computing the T pairing, with cube roots.
It is to be noted that given the expression of its operands, the multiplication on line 4 is computed by means of only six multiplications, one cubing, and six additions over IF 3 m , as described in Appendix F.4.
As for Algorithm 2, Algorithm 3 also requires the computation of cube roots. A similar technique of precomputation and tabulation of the cube roots due to successive cubings of x P and y P can also be used, although we will not detail it here.
Eliminating the Cube Roots
The apparent duality between Algorithms 2 and 3 can be exploited to find another cube-free algorithm, still based on the reversed loop but similar to Algorithm 1.
For that purpose, we once again compute the reduced T pairing of P and Q as
2 QÞ, the reversed loop becomes
Á hmÀ1 2 ;P 0 ðQÞ; with the rational function h j;P 0 ðQÞ defined as
We then compute the explicit expressions of l P ðQÞ and h j;P 0 ðQÞ:
Algorithm 4 is a direct implementation of the previous computation of T ðP ; QÞ. Similarly to Algorithm 1, it uses cubings over IF 3 6m in order to avoid the cube roots of Algorithm 3. In the case m 1 ðmod 12Þ, this algorithm corresponds to the T pairing computation described by Beuchat et al. [25] .
Algorithm 4 Cube-root-free reversed-loop algorithm for computing the T pairing.
Loop Unrolling
Granger et al. [28] proposed a loop unrolling technique for the Duursma-Lee algorithm. They exploit the sparsity of g V in order to reduce the number of multiplications over IF 3 m , exactly in the same way as we reduced the first two iterations of Algorithms 1 and 2.
By noting that h j;P 0 ðQÞ 3 is also as sparse as h j;P 0 ðQÞ (for details, see Appendix E.2), we can apply the same approach to Algorithm 4.
In two successive iterations 2j 0 À 1 and 2j 0 of the loop, for
The values of h 2j 0 À1;P 0 ðQÞ and h 2j 0 ;P 0 ðQÞ, computed at iterations 2j 0 À 1 and 2j 0 , respectively, are both of the form is even.
It is to be noted that one could also straightforwardly apply a similar loop unrolling technique to Algorithm 1. However, we will not detail this point any further, for it is rigorously identical to the previous case.
Final Exponentiation
As already stated in Section 2.1, the T pairing has to be reduced in order to be uniquely defined and not only up to 'th powers. This reduction is achieved by means of a final exponentiation, in which T ðP ; QÞ is raised to the Mth power, with
For this particular exponentiation, we use the scheme presented by Shirase et al. [29] . 
we obtain the following expression for U 3 3m À1 :
This computation is directly implemented in 
One can then remark that
where [28] . This is a crucial point here, since arithmetic on the torus T 2 ðIF 3 3m Þ is much simpler than arithmetic on IF Input: is obtained by a second call to Algorithm 7. The value to be computed is then
, this is just a dummy operation, but it is an actual inversion when b ¼ 1. However, as W 2 T 2 ðIF 3 3m Þ, writing
Inversion over T 2 ðIF 3 3m Þ is therefore completely free, as it suffices to propagate the sign corrections in the final product V Á W 0 , implemented as a full multiplication over IF Ã 3 6m .
Algorithm 8 Final exponentiation of the reduced
T pairing [29] . 
Overall Cost Evaluations and Comparisons
The costs of all the previously detailed algorithms are summarized in Table 1 , in terms of additions (or subtractions), multiplications, cubings, cube roots, and inversions over IF 3 m . From this table, we can see that the additional cost for cube-root-free algorithms is approximately 4m extra cubings and 7m=2 extra additions, when compared to the equivalent algorithms with cube roots. The choice of a type of algorithm instead of the other will therefore depend on the practicality of the computation of cube roots in the given finite field IF 3 m (see the discussion in Section 3.5).
This table also shows a slight superiority of reversedloop algorithms versus direct-loop approaches. This is the reason why we chose to apply the loop unrolling technique to Algorithm 4.
The advantage of such a loop unrolling becomes also clearer when looking at Table 1 . From Algorithm 4 to Algorithm 5, we trade approximately 27m=4 additions and 3m=4 multiplications for m=2 cubings over IF 3 m .
The costs of these algorithms for m ¼ 97, on which we focus more closely in this paper, is given in Table 2 . As detailed in Section 3.2, we can compute the inversion over IF 3 97 according to Fermat's little theorem in nine multiplications and 96 cubings, which allows us to express these costs in terms of additions, multiplications, cubings, and cube roots only. The total number of operations for the complete computation of the reduced T pairing, using Algorithm 5 for the T pairing and Algorithm 5 for the final exponentiation, is also given.
A COPROCESSOR FOR ARITHMETIC OVER IF m
The T pairing calculation in characteristic three requires addition, multiplication, cubing, inversion, and sometimes 
Several researchers reported implementations of the Tate and T pairings on a supersingular curve defined on the field IF 3 97 . Therefore, we discuss the implementation of Algorithm 5 for the field IF 3 ½x=ðx 97 þ x 12 þ 2Þ and the curve y 2 ¼ x 3 À x þ 1 (i.e., b ¼ 1) on our coprocessor. It is nonetheless important to note that the architectures and algorithms presented here can be easily adapted to different parameters. For instance, a different irreducible polynomial fðxÞ, a different field extension degree m, or even a different characteristic p (cubing and cube root extraction, being, respectively, Frobenius and inverse Frobenius maps in characteristic three, then replaced by raising to the pth power and pth root extraction).
Multiplication over IF 3 m
Three families of algorithms allow one to compute d0ðxÞ Á d1ðxÞ mod fðxÞ (see, for instance, [30] , [31] , and [32] for an account of modular multiplication). In parallel-serial schemes, a single coefficient of the multiplier d0ðxÞ is processed at each step. This leads to small operators performing a multiplication in m clock cycles. Parallel multipliers compute a degree-ð2m À 2Þ polynomial and carry out a final modular reduction. They achieve a higher throughput at the price of a larger circuit area. By processing D coefficients of an operand at each clock cycle, array multipliers, introduced by Song and Parhi [33] , offer a good trade-off between computation time and circuit area and are at the heart of several pairing coprocessors (see, for instance, [19] , [20] , [22] , [23] , [25] , and [34] ).
Depending on the order in which coefficients of d0ðxÞ are processed, array multipliers can be implemented according to two schemes: most significant element (MSE) first and least significant element (LSE) first. Algorithm 9 summarizes the MSE-first scheme proposed by Shu et al. [22] . Fig. 1a 
Multiplication over IF 3 ¼ f0; 1; 2g is then defined as follows:
and can be implemented by means of two 4-input Lookup Tables (LUTs) . Since d0 i multiplies all coefficients of d1, the fan-out of our array multiplier is equal to 2m. However, a careful encoding of the elements of IF 3 can reduce the fan-out of the operator [35] . Since 2 À1ðmod 3Þ, we take advantage of the borrow-save system [36] in order to represent the elements of IF 3 ¼ f0; 1; À1g: d0 i is encoded by a positive bit d0 þ i and a negative bit d0
and requires two 3-input LUTs: the first one depends on d0 þ i , and the second one on d0 À i . Thus, the fan-out of the array multiplier is now equal to m. Since it is performed component-wise, addition over IF 3 m is also a rather straightforward operation. If elements of IF 3 are represented by 2 bits, addition modulo 3 is, for instance, carried out by means of two 4-input LUTs.
Inversion over IF 3 m
The final exponentiation of the T pairing involves a single inversion over IF 3 m . Instead of designing a specific operator based on the Extended Euclidean Algorithm (EEA), we suggest to keep the circuit area as small as possible by performing this inversion according to Fermat's little 
Starting with an element d of IF 3 m , d 6 ¼ 0, we first raise it to the power of the base-3 repunit ð3 mÀ1 À 1Þ=2 to obtain r. This particular powering can be achieved using only m À 2 cubings over IF 3 m and a few multiplications over IF 3 m as detailed below. By cubing r and then multiplying the result by d, we successively obtain
A final product gives us the result
Since v 6 ¼ 0 and
and this operation could be performed in a single clock cycle at the price of a modification of our MSE-first multiplier: adding an extra control bit and a multiplexer allows one to select the value of the coefficient d0 3i between its normal value (the D most significant coefficients of the multiplier) and the D least significant coefficients of the multiplier. Indeed, as v 2 IF 3 , its coefficients v i are zero for all i 6 ¼ 0. Therefore, we only need v 0 to compute the final multiplication u Á v ¼ u Á v 0 . As our multiplier operates in a most-significant-coefficient-first fashion, instead of performing the full multiplication over IF 3 m , this multiplexer would allow us to bypass the whole shift register mechanism and compute the product u Á v in a single iteration of the multiplier. Since we consider m ¼ 97 for our implementation, this trick would allow us to save only dm=De À 1 ¼ d97=3e À 1 ¼ 32 clock cycles at the price of a longer critical path and a larger control word. Thus, we do not include this modification in our coprocessor. 
As already shown in [38] and [39] , addition chains can prove to be perfectly suited to raise elements of IF 3 m to particular powers, such as the radix-3 repunit ð3 mÀ1 À 1Þ=2 required by our inversion algorithm. In the following, we will restrict ourselves to Brauer-type addition chains, 3 whose definition follows.
A Brauer-type addition chain C of length l is a sequence of l integers S ¼ ðj 1 ; . . . ; j l Þ such that 0 j i < i for all 1 i l. We can then construct another sequence ðn 0 ; . . . ; n l Þ satisfying
Moreover, we can see that we have, for n n 0 ,
Consequently, given a Brauer-type addition chain C of length l for m À 1, we can compute the required d
as shown in Algorithm 11. This algorithm simply ensures that, for each iteration i, we have z i ¼ d . . . ; j l Þ for m À 1, and the integer sequence ðn 0 ; . . . ; n l Þ associated with C. 
where i ðxÞ 2 IF 3 97 , 0 i 2, and
Recall that our inversion algorithm involves successive cubings. Since storing intermediate results in memory would be too time consuming, our cubing unit should include a feedback mechanism to efficiently implement Algorithm 11. Furthermore, cubing over IF 3 6m requires the computation of Àu . The feedback loop responsible for the accumulation of partial products must be deactivated while cubing. An array of m AND gates performs this task and allows one to carry out the initialization step of the modular multiplication (instruction pðxÞ 0 in Algorithm 9). . Multiplexers select the input of the multioperand adders between modulo fðxÞ reduced partial products and the i ðxÞ's. . The shift register of the multiplier and the PPGs allow for the control of cubing operations. If we store a control word in register R0 such that d0 3i ¼ d0 3iþ1 ¼ d0 3iþ2 ¼ À1, the operator returns Àd1ðxÞ
3 .
Addition over IF 3 m
The reduced T pairing algorithms discussed in this paper involve additions, subtractions, and accumulations over IF 3 m . Fig. 1c describes an operator implementing these functionalities. Again, a closer look at the reduced T pairing algorithms as well as at the algorithms for arithmetic over IF 3 3m and IF 3 6m indicates that there is almost no parallelism between additions and multiplications over IF 3 m . We suggest to further modify our array multiplier to include addition, subtraction, and accumulation ( Fig. 3) :
. An additional register is needed to store the second operand of an addition. Again, the shift register stores a control word to control additions. Assume for instance that we have to compute Àd2ðxÞ þ d1ðxÞ. We, respectively, load d2ðxÞ and d1ðxÞ in registers R2 and R1 and define a control word stored in R0 so that d0 3i¼1 , d0 3iþ1 ¼ 2, and d0 3iþ2 ¼ 0. We will thus compute ðd1ðxÞ þ 2 Á d2ðxÞ þ 0 Á d1ðxÞÞ mod fðxÞ ¼ ðd1ðxÞ À d2ðxÞÞ mod fðxÞ. Since the reduced T pairing algorithm involves successive additions and cubings, each control word loaded in the shift register manages a sequence of operations. Note that -while performing a multiplication or a cubing, registers R1 and R2 must store the same value; -d0 3iþ2 is always equal to zero in the case of addition.
. A multiplexer in the accumulation loop allows one to select between the content of register R3 (accumulation) or the content of R3 shifted and reduced modulo fðxÞ (multiplication). . An additional multiplexer is required to select the second input of the multioperand adder: d2ðxÞ (addition), ðd2ðxÞ Á d0 3iþ1 Á xÞ mod fðxÞ (multiplication), or 1 ðxÞ (cubing).
Cube Root over IF 3 m
Some of the T pairing algorithms in characteristic three described in Section 2 involve cube roots over IF 3 m . This function is computed exactly in the same way as cubing: first, the normal form of ffiffiffiffiffiffiffiffiffi dðxÞ 3.6 Architecture of the Coprocessor Fig. 4 describes the architecture of our T pairing coprocessor. It consists of a single processing element (unified operator for addition, multiplication, and cubing), registers implemented by means of a dual-port RAM (six Virtex-II Pro SelectRAM+ blocks or 13 Cyclone II M4K memory blocks), and a control unit that consists of a Finite State Machine (FSM) and an instruction memory (ROM). Each instruction consists of four fields: an 11-bit word that specifies the functionality of the processing element, address and write enable signal for port B of the dual-port RAM, address for port A of the dual-port RAM, and a 6-bit control word that manages jump instructions and indicates how many times an instruction must be repeated. This approach makes it possible for instance to execute the consecutive steps appearing in the multiplication over IF 3 m with a single instruction. The architecture described in Fig. 4 was captured in the VHDL language and prototyped on several Altera and Xilinx FPGAs. We selected the following parameters: m ¼ 97, b ¼ 1, and fðxÞ ¼ x 97 þ x 12 þ 2. Both synthesis and place-and-route steps were performed with Quartus II 7.1 Web Edition and ISE WebPACK 9.2i. The implementation on this coprocessor of the reduced T pairing (using Algorithm 5 for the T pairing and Algorithm 8 for the final exponentiation) takes 900 instructions, which are executed in 27,800 clock cycles. Table 3 summarizes the area (in slices on Xilinx FPGAs and Logic Elements (LEs) on the Altera device) and the calculation time.
It is worth noticing that an operator for inversion over IF 3 97 based on the EEA occupies 3,422 LEs on a Cyclone-II device [42] and 2,210 slices on a Virtex-II FPGA [43] . The implementation of the algorithm based on Itoh and Tsujii's work requires 394 clock cycles on our coprocessor for m ¼ 97. The EEA needs 2m ¼ 194 clock cycles to return the inverse. Therefore, introducing specific hardware for inversion would double the circuit area while reducing the calculation time by less than 1 percent. We also described a naive coprocessor embedding the multiplier, the cubing unit, and the adder depicted in Fig. 1 . The outputs of these operators are connected to the register file by means of a three-input multiplexer controlled by two additional bits. Place-and-route results indicate that such a coprocessor (without control unit) occupies 2,199 slices on a Spartan-3 FPGA and 3,345 LEs on a Cyclone-II device. Furthermore, we need 17 bits to control this ALU. Thus, our unified operator reduces both the area of the coprocessor and the width of the control words.
In order to guarantee the security of pairing-based cryptosystems in a near future, larger extension degrees will probably have to be considered, thus raising the question of designing such a unified operator for other extension fields. For this purpose, we wrote a C++ program that automatically generates a synthesizable VHDL description of a unified operator according to the characteristic and the irreducible polynomial fðxÞ.
COMPARISONS
Grabher and Page designed a coprocessor dealing with arithmetic over IF 3 m , which is controlled by a general purpose processor [19] . The ALU embeds an adder, a subtracter, a multiplier (with D ¼ 4), a cubing unit, and a cube root operator based on the method highlighted by Barreto [41] . This architecture occupies 4,481 slices and allows one to perform the Duursma-Lee algorithm and its final exponentiation in 432.3 s. The main advantage is that the control can be compiled using a retargeted GCC tool chain and other algorithms should easily be implemented on this architecture. Our approach leads however to a much simpler control unit and allows us to divide the number of slices by 2.4.
Another implementation of the Duursma-Lee algorithm was proposed by Kerins et al. [20] . It features a parallel multiplier over IF 3 6m based on Karatsuba-Ofman's scheme. Since the final exponentiation requires a general multiplication over IF 3 6m , the authors cannot take advantage of the optimizations described in this paper and in [21] for the pairing calculation. Therefore, the hardware architecture consists of 18 multipliers and six cubing circuits over IF 3 97 , along with, quoting [20] , "a suitable amount of simpler IF 3 m arithmetic circuits for performing addition, subtraction, and negation." Since the authors claim that roughly 100 percent of available resources are required to implement their pairing accelerator, the cost can be estimated as 55,616 slices [22] . The approach proposed in this paper reduces the area and the computation time by 30 and 4.4, respectively. Note that a multiplier over IF 3 6m based on the fast Fourier transform [44] would save three multipliers over IF 3 m . Since all multiplications over Beuchat et al. described a fast architecture for the computation of the T pairing [25] . The authors introduced a novel multiplication algorithm over IF 3 6m , which takes advantage of the constant coefficients of S. Thus, this design must be supplemented with a coprocessor for final exponentiation and the full pairing accelerator requires around 18,000 LEs on a Cyclone II FPGA [26] . The computation of the pairing and the final exponentiation require 4,849 and 4,082 clock cycles, respectively. Since both steps are pipelined, we can consider that a new result is returned after 4,849 clock cycles if we perform a sufficient amount of consecutive full T pairings. In order to compare our accelerator against this architecture, we implemented it on an Altera Cyclone II 5 FPGA with Quartus II 7.1 Web Edition. Our design occupies 3,216 LEs and the maximal clock frequency of 152 MHz allows one to compute a pairing in 183 s. The architecture proposed in this paper is therefore 6 times slower but 5.6 times smaller.
In order to study the trade-off between circuit area and calculation time of the T pairing, Ronan et al. wrote a C program that automatically generates a VHDL description of a coprocessor and its control unit according to the number of multipliers over IF 3 m to be included and the parameter D [23] . An architecture embedding five multipliers processing D ¼ 4 coefficients at each clock cycle computes for instance a full pairing in 187 s. Though slightly faster, this design requires five times the amount of slices of our pairing accelerator. Our approach offers a better compromise between area and calculation time (Table 4) .
To our best knowledge, the fastest T pairing processor described in the open literature was designed by Jiang [24] . Unfortunately, Jiang does not give any detail about his architecture. Since a pairing is computed in 1,627 clock cycles and that multiplication over IF 3 m is based on an LSE array multiplier processing D ¼ 7 coefficients at each clock cycle, we can however guess that the design includes a hardwired multiplier over IF 3 6m . Though 6.5 faster than the coprocessor based on our unified arithmetic operator, the design by Jiang requires 40 times more slices.
CONCLUSION
We have discussed several algorithms to compute the T pairing and its final exponentiation in characteristic three. We proposed a compact implementation of the reduced T pairing in characteristic three over IF 3 ½x=ðx 97 þ x 12 þ 2Þ. Our architecture is based on a unified arithmetic operator that leads to the smallest circuit proposed in the open literature while demonstrating competitive performances.
Future works should include studies of the T pairing in characteristic two, where the wired multipliers embedded in most of the current FPGAs should allow for cheaper and faster array-and even fully parallel multipliers over IF 2 m . Such more efficient architectures would then allow us to investigate the T pairing over hyperelliptic curves.
The study of the Ate pairing [45] would also be of interest, for it presents a large speedup when compared to the Tate pairing and also supports nonsupersingular curves. The parameter D refers to the number of coefficients processed at each clock cycle by a multiplier.
