Abstract-This paper is devoted to the design of fast parallel accelerators for the cryptographic T pairing on supersingular elliptic curves over finite fields of characteristics two and three. We propose here a novel hardware implementation of Miller's algorithm based on a parallel pipelined Karatsuba multiplier. After a short description of the strategies that we considered to design our multiplier, we point out the intrinsic parallelism of Miller's loop and outline the architecture of coprocessors for the T pairing over F 2 m and F 3 m . Thanks to a careful choice of algorithms for the tower field arithmetic associated with the T pairing, we manage to keep the pipelined multiplier at the heart of each coprocessor busy. A final exponentiation is still required to obtain a unique value, which is desirable in most cryptographic protocols. We supplement our pairing accelerators with a coprocessor responsible for this task. An improved exponentiation algorithm allows us to save hardware resources. According to our place-and-route results on Xilinx FPGAs, our designs improve both the computation time and the area-time trade-off compared to previously published coprocessors.
Ç

INTRODUCTION
I N 2000, Mitsunari et al. [36] , Sakai et al. [41] , and Joux [25] independently showed how to use bilinear pairings defined over algebraic curves to solve cryptographic problems of long standing. This discovery ignited an intensive research that, until today, has produced an impressive number of pairing-based cryptographic protocol proposals [13] . Practice has shown that one of the most efficient options to compute bilinear pairings is to resort to the Tate pairing operating on supersingular elliptic curves of low embedding degrees.
Back in 1986, Miller [33] , [34] presented an iterative algorithm that can be adapted to compute the Tate pairing with linear complexity with respect to the size of the input. Since then, significant improvements of Miller's algorithm were independently proposed in 2002 by Barreto et al. [4] and Galbraith et al. [17] . One year later, Duursma and Lee presented a radix-3 variant of Miller's algorithm especially targeted at the case of characteristic three [14] . In 2004, Barreto et al. [3] introduced the T approach, which further shortens the loop of Miller's algorithm. More recently, Hess et al. generalized these results to ordinary curves [22] , [23] , [46] .
We extend here the work presented in [10] and propose novel hardware architectures for computing the T pairing over binary and ternary fields based on parallel pipelined Karatsuba multipliers and enhanced unified arithmetic operators. We stress that the modified Tate pairing can be directly computed from the reduced T pairing at almost no extra cost [7] . Our hardware accelerators are able to compute the T pairing operating on supersingular elliptic curves defined over F 2 691 and F 3 313 in just 18.8 and 16:9 s, respectively ( Table 5) . We note that these field sizes enjoy an associated security equivalent to that of 105-and 109-bit symmetric-key cryptosystems, respectively ( Table 4) .
The main strategies considered to design our parallel pipelined multiplier are described in Section 2. They are included in a VHDL code generator that allows us to experiment on a wide range of operators as well as a variety of design parameters. Thanks to a judicious choice of algorithms for performing tower field arithmetic and a careful analysis of the scheduling, we managed to keep our pipelined units always busy. This allows us to compute one iteration of Miller's algorithm over ternary and binary fields in only 17 and 7 clock cycles, respectively (Sections 3 and 4). We summarize the results obtained from our FPGA implementation and provide the reader with a thorough comparison against previously published coprocessors in Section 5.
For the sake of concision, we are forced to skip the description of many important concepts of elliptic curve theory. We suggest the interested reader to review [44] , [47] for an in-depth coverage of this topic.
PARALLEL KARATSUBA MULTIPLIERS
Before delving into the specifics of our pairing coprocessor architectures, we first detail here the Karatsuba multipliers on which they extensively rely.
We define the p-ary extension field F p m as F p ½x=ðfðxÞÞ, where f is an irreducible degree-m monic polynomial over F p . The product of two arbitrary elements of F p m represented as p-ary polynomials of degree at most m À 1 is computed as the polynomial multiplication of the two elements modulo f. Carefully selecting an irreducible polynomial with low Hamming weight (i.e., trinomial, tetranomial, etc.) and low subdegree allows for a simple modular reduction step.
In this work, due to its subquadratic space complexity, we opted for a variant of the classical Karatsuba multiplier to implement the polynomial product, while a few extra adders and subtracters over F p are dedicated to performing the final reduction modulo f.
Variations on the Karatsuba Algorithm
The Karatsuba multiplication [27] is based on the observation that the polynomial product c ¼ a Á b, for a and b 2 F p m , can be computed as
where n ¼ d
H . Note that since we are working with polynomials, there is no carry propagation. This allows one to split the operands in a slightly different way: for instance, Hanrot and Zimmermann [21] suggested to split them into odd-and even-degree parts. It was adapted to multiplication over F 2 m by Fan et al. [15] . Since there is no overlap between the odd and even parts at the reconstruction step, this different method of splitting saves approximately m additions over F p during the reconstruction of the product.
Another natural way to generalize the Karatsuba multiplication is to split the operands into three or more parts, in a classical way (i.e., splitting each operand into contiguous parts from the lowest to the highest powers of x) or using a generalized odd/even split (i.e., according to the degree modulo the number of split parts). By applying this strategy recursively, in each iteration, each polynomial multiplication is transformed into three or more subproducts of smaller degree, until all the polynomial operands are reduced to single coefficients. Nevertheless, practice has shown that it is better to prune the recursion earlier, performing the lowest-level multiplications using alternative techniques that are more compact and/or faster for low-degree operands, such as the so-called schoolbook method with quadratic complexity, which has been selected for this work.
A Pipelined Architecture for the Karatsuba Multiplier
We pipelined our multiplier architecture by means of optional registers inserted between the computations of the required subproducts, where the depth of the pipeline can be adjusted according to the complexity of the application at hand. This approach allows us to split the critical path of the whole multiplier structure and, therefore, increase its operating frequency. In order to study a wide range of implementation strategies, we wrote a VHDL code generator, which automatically produces the description of different variants of Karatsuba multipliers according to several parameters (field extension degree, irreducible polynomial, splitting method, etc.). Our automatic tool was extremely useful for selecting the operator that showed the highest clock frequency, the smallest area, or a good trade-off between them.
REDUCED T PAIRING IN CHARACTERISTIC THREE
In the following, we consider the computation of the reduced T pairing in characteristic three. Table 1 summarizes the parameters of the algorithm and of the supersingular curves. We refer the reader to [3] , [8] for more details about the computation of the T pairing. Recall that a final exponentiation is required to obtain a unique value, which is desirable in the context of cryptographic protocols. As pointed out by Beuchat et al. [9] , the computations of the nonreduced pairing (i.e., Miller's algorithm) and of the final exponentiation do not share the same datapath, and it seems judicious to pipeline these two tasks using two distinct coprocessors in order to reduce the computation time and increase the throughput.
Computation of Miller's Algorithm
We rewrote in Algorithm 1 the reversed-loop algorithm in characteristic three described in [8] , denoting each iteration with parenthesized indices in superscript in order to emphasize the intrinsic parallelism of the T pairing. At each iteration of Miller's algorithm, two tasks are performed in parallel, namely, a sparse multiplication over F 3 6m (lines 6 and 7) and the computation of the coefficients for the next sparse operation (lines 8-10). We say that an operand in F 3 6m is sparse when some of its coefficients are trivial (i.e., either zero, one, or minus one).
Sparse Multiplication over F 3 6m
The intermediate result R ðiÀ1Þ is multiplied by the sparse operand S ðiÞ (Algorithm 1, lines 6 and 7). This operation is easier than a standard multiplication over F 3 6m , but the choice of the sparse multiplication algorithm requires careful attention. Bertoni et al. [6] and Gorla et al. [18] took advantage of Karatsuba multiplication and Lagrange interpolation, respectively, to reduce the number of multiplications over F 3 m at the expense of several additions. (Note that Gorla et al. study standard multiplication over F 3 6m in [18] , but extending their approach to sparse multiplication is straightforward.) In order to keep the pipeline of a Karatsuba multiplier busy, we would have to embed in our processor a large multioperand adder (up to 12 operands for the scheme proposed by Gorla et al.) and several multiplexers to deal with the irregular datapath. This would negatively impact the area and the clock frequency, and we prefer considering the algorithm discussed by Beuchat et al. in [11] which gives a better trade-off between the number of multiplications and additions over the underlying field (Algorithm 2): it involves 17 multiplications and 29 additions over F 3 m to compute S ðiÞ and R ðiÀ1Þ Á S ðiÞ . We suggest to take advantage of a parallel Karatsuba multiplier with seven pipeline stages to implement Miller's algorithm. Since the algorithm we selected for sparse multiplication over F 3 6m requires at most the addition of four elements of F 3 m , it suffices to complement the multiplier with a four-operand adder to compute s We managed to find a scheduling that allows us to start a new multiplication over F 3 m at each clock cycle, thus keeping the pipeline busy and computing an iteration of Miller's algorithm in 17 clock cycles, as depicted in Fig. 2 . It is worth noticing that the cost of additions over F 3 m is hidden and the number of clock cycles depends only on the amount of multiplications over F 3 m . We easily identify five datapaths (denoted by the numerals to in Figs. 1 and 2 ) between the output of the four-operand adder and the inputs of the parallel multiplier. Specific attention is needed to design the register file storing the coefficients of R ðiÞ and the intermediate variables a ðiÞ j , 0 j 6, of the sparse multiplication algorithm. According to our scheduling scheme, we have to read simultaneously up to three variables from the register file. Thus, we decided to implement it by means of two blocks of Dual-Ported RAM (DPRAM):
. The first one is connected to input M 0 of the parallel multiplier and input A 0 of the four-operand adder, and stores the coefficients of R ðiÞ . . According to our scheduling (Fig. 2) , the second DPRAM block provides the four-operand adder with its fourth input, namely, a ðiÞ j , 0 j 5, and r ðiÀ1Þ j , 0 j 2.
Computation of the Sparse Operand
The second task consists in computing the coefficients of the sparse operand S ðiþ1Þ required for the next iteration of Miller's algorithm (Algorithm 1, lines 8-10). Two cubings and an addition over F 3 m allow us to update the coordinates of point P and to determine the coefficient t ðiþ1Þ of the sparse operand S ðiþ1Þ , respectively. Recall that the T pairing over F 3 m comes in two flavors: the original one involves a cubing over F 3 6m after each sparse multiplication. Barreto et al. [3] explained how to get rid of that cubing at the price of two cube roots over F 3 m to update the coordinates of point Q. It is essential to consider such an algorithm here, as an extra cubing over F 3 6m would put even more strain on the first task (which is already the most expensive one). According to our results, the critical path of the circuit is never located in a cube root operator when pairing-friendly irreducible trinomials or pentanomials [2] , [20] are used to define F 3 m . If, by any chance, such polynomials are not available for the considered extension of F 3 and the critical path is in the cube root, it is always possible to pipeline this operation. Therefore, the cost of cube roots is hidden by the first task.
The hardware implementation is rather straightforward ( Fig. 1 ): four registers, a cubing operator, and a cube root operator allow us to store and update the coordinates of points P and Q. Then, a two-operand adder computes the sum of x ðiÞ P and x ðiÞ Q , and the result t ðiÞ is memorized in a fifth register. Multiplexers select the inputs of the parallel multiplier according to our scheduling.
Initialization
The initialization step of the T pairing (Algorithm 1, lines 1 and 2) involves a small amount of specific hardware in order to compute x ð0Þ P , y Loading the coordinates of points P and Q and performing the initialization step involve 17 clock cycles (i.e., exactly the same number of clock cycles as an iteration of Miller's algorithm). Therefore, our coprocessor returns R ððmÀ1Þ=2Þ after 17 Á ðm þ 3Þ=2 clock cycles.
Final Exponentiation
The second and last stage in the computation of the T pairing is the final exponentiation, where the result of Miller's algorithm R ððmÀ1Þ=2Þ ¼ T ðP ; QÞ is raised to the Mth power (Algorithm 1, line 12). This exponentiation is necessary since the nonreduced pairing T ðP ; QÞ is only defined up to Nth powers in F Ã 3 6m .
Improved Algorithm
In order to compute this final exponentiation, we use the algorithm presented by Beuchat et al. [8] . This method exploits the special form of the exponent M (see Table 1 ) to achieve better performances than with a classical square-and-multiply algorithm. Among other computations, this final exponentiation involves the raising of an element of F Ã 3 6m to the power of 3 ðmþ1Þ=2 , which Beuchat et al. [8] perform by computing ðm þ 1Þ=2 successive cubings over F Ã 3 6m . Since each of these cubings requires six cubings and six additions over F 3 m , the total cost of this step is 3m þ 3 cubings and 3m þ 3 additions.
We present here a new method for computing U
by exploiting the linearity of the Frobenius map (i.e., cubing in characteristic three) to reduce the number of additions. Indeed, noting that
, we obtain the following formula for
, depending on the value of i: 
Hardware Implementation
Our first attempt at computing the final exponentiation was to use the unified arithmetic operator introduced by Beuchat et al. [8] . Unfortunately, due to the sequential scheduling inherent to this operator, it turned out that the final exponentiation algorithm required more clock cycles than the computation of Miller's algorithm by our coprocessor. We, therefore, had to consider a slightly more parallel architecture.
Noticing that the critical operations in the final exponentiation algorithm were multiplications and long sequences of cubings over F 3 m , we designed the coprocessor for arithmetic over F 3 m depicted in Fig. 3 . Besides a register file implemented by means of DPRAM, our coprocessor embeds a parallel-serial multiplier [45] processing D coefficients of an operand at each clock cycle (typically, D ¼ 13 or 14), along with a novel unified operator supporting addition, subtraction, accumulation, Frobenius map (i.e., cubing), and double Frobenius map (i.e., raising to the ninth power). This architecture allowed us to efficiently implement the final exponentiation algorithm described, for instance, in [8] , while taking advantage of the improvement proposed above.
Using Inverse Frobenius Maps
We adapt here the idea behind the square-root-and-multiply algorithm for exponentiation over binary finite fields given by Rodríguez-Henríquez et al. in [37] .
From the final exponentiation algorithm given in [8] , it can be noticed that Frobenius maps over F 3 m (i.e., cubings) are needed only to perform an inversion over F mÀi Þ=2 , we note that
Since, for any two integers n and n 0 , we also have
it follows that, given an addition chain of length l for m À 1, we can compute u ¼ w mÀ1 in l multiplications and m À 1 cube roots over F 3 m . This has to be compared to the l multiplications and m À 1 cubings required in [8, Algorithm 10 ] to obtain the same u.
As for the raising of an element of F Ã 3 6m to the power of 3 ðmþ1Þ=2 , also part of the final exponentiation algorithm, we simply apply Fermat's little theorem once more to see that z
for all z 2 F 3 m . Thus, we can directly trade the 3m þ 3 required cubings (as explained in the analysis given in Section 3.2.1) for 3m À 3 cube roots over F 3 m .
Hence, from the previous considerations, it is possible to replace all Frobenius maps (cubings) by inverse Frobenius maps (cube roots) in the final exponentiation. This is particularly interesting since the irreducible polynomial used to represent that F 3 m was carefully chosen to allow for low-complexity cubings and cube roots, as both are required for the computation of the nonreduced T pairing. Furthermore, it appears that for the considered irreducible trinomials, the complexity of the cube root is always lower than that of the cubing. This is shown in Table 2 , where the third column reports the total number of required additions/subtractions over F 3 , and the fourth column indicates the largest number of elements of F 3 that need to be added/subtracted to one another to compute a coefficient of the result (for instance, cubing over F 3 97 requires summing at most four elements of F 3 at a time).
In order to assess the impact of replacing the Frobenius and double-Frobenius operators by inverse-Frobenius (cube root) and double-inverse-Frobenius (ninth root) operators in the architecture presented in Fig. 3 , we implemented the different variants on Xilinx Virtex-4 LX FPGAs (xc4vlx40-11). The place-and-route results, reported in the last two columns of Table 2 , show that the use of cube roots usually shortens the critical path, even though the circuits are then slightly larger, as the cubing formulas generally involve more common subexpressions which can then share the same logic and decrease the total resource usage. All in all, it appears that relying on inverse Frobenius maps to compute the final exponentiation is by and large an effective optimization.
REDUCED T PAIRING IN CHARACTERISTIC TWO
An approach similar to that of characteristic three allowed us to design a parallel coprocessor for the reduced T pairing in characteristic two, as depicted in Fig. 4 . The supersingular curves and the parameters of the algorithm are summarized in Table 1 .
Computation of Miller's Algorithm
Applying the strategy that we used for characteristic three to the case of characteristic two, we adopted the reversedloop algorithm described in [7] , which we recall here in Algorithm 3. However, the scheduling turns out to be slightly more difficult than in characteristic three since we have to perform three tasks in parallel at each iteration of Miller's algorithm.
Sparse Multiplication over F 2 4m
The intermediate result It is worth mentioning that the datapath between the output of the four-operand adder and the parallel multiplier is much simpler than in characteristic three: it suffices to delay f ðiÞ j , 0 j 3, by one clock cycle and there is, therefore, no need for a memory block to store the operands of the multiplier. Dealing with inputs A 0 and A 1 of the fouroperand adder is unfortunately more difficult because of data dependencies between the coefficients of F Instead of including a DPRAM block in our design, we propose a solution based on two small FIFOs (see Fig. 6 for details). An advantage of characteristic two over characteristic three is that the register file is smaller in terms of circuit area and requires fewer control bits. 
Computation of g
Choosing the Adequate Karatsuba Multiplier
In these settings, a Karatsuba multiplier with five pipeline stages can be kept busy during the computation of the main loop, as shown in Fig. 5 . Since we have to carry out seven multiplications over F 2 m at each iteration, the calculation time for the full loop is equal to 7 Á ðm þ 1Þ=2 clock cycles. It is again crucial to consider an algorithm with inverse Frobenius maps (i.e., square roots) in order to avoid squaring F ðiÞ at each iteration of Miller's algorithm (see, for instance, [7] for a survey of algorithms for the Tate pairing over supersingular curves in characteristic two). Such an operation would lengthen the computation time and pipeline bubbles would be inserted in the multiplier. Fig. 3 
Initialization
The initialization step requires specific attention. In order to start multiplying u ð0Þ by v ð0Þ as soon as possible (Algorithm 3, line 5), we load the coordinates of points P and Q in the following order: x P , x Q , y P , and y Q . Thus, u ð0Þ and v
are available after two clock cycles. Thanks to this scheduling, we complete the initialization step in 15 clock cycles.
Irreducible Pentanomials Suitable for LowComplexity Square Root Computation
Although irreducible trinomials allowing for simple computations of squarings and square roots exist in some binary finite fields, as detailed in [37] , this was not the case for several fields considered in this work (see Table 3 ). To tackle this issue, we present here a novel family of irreducible square-root-friendly pentanomials that, to the best of our knowledge, has not been proposed before in the literature. Let m and d be two odd positive integers with d < m=2 and such that the degree-m monic polynomial
is irreducible over F 2 . We then represent the binary extension field F 2 m as F 2 ½x=ðfðxÞÞ. Note that reducing modulo f, we have
where all the exponents on the lefthand side are even. It then follows that ffiffiffi
Therefore, using this expression for ffiffiffi x p , we can compute the square root of an element a 2 F 2 m as [16] 
a 2iþ1 x i mod fðxÞ:
Furthermore, one can show that if ð2m À 1Þ=7 d ð2m þ 1Þ=5, then the complete computation of the square root (i.e., including the reduction modulo f) will require the addition of at most three elements of F 2 at a time for each coefficient of the result. And finally, choosing d ! ðm À 1Þ=6 ensures that a squaring will involve only additions of at most four operands.
As reported in Table 3 , three pentanomials of this family have been selected to represent the finite fields F 2 557 , F 2 613 , and F 2 691 , with d ¼ 197, 185, and 243, respectively.
Final Exponentiation
The final exponentiation (Algorithm 3, line 22) is carried out according to the algorithms proposed in [7] and [42] , [43] for ¼ 1 and ¼ À1, respectively. We took advantage of the algorithm introduced in [12] when raising to the ð2 m þ 1Þst power over F 2 4m . Here, again, the linearity of the Frobenius map allows us to reduce the number of additions when computing U
Here, again, a hybrid arithmetic operator, similar to the one used in the case of characteristic three (see Fig. 3 ), allows us to perform the final exponentiation in slightly less clock cycles than Miller's algorithm without impacting too much on the resource usage. The architecture is very similar to that of characteristic three, except that we removed the multiplications by 1 and À1, which are useless in characteristic two, and replaced the double cubing by a triple-squaring operator, to accommodate for the longer chains of successive Frobenius maps in the final exponentiation algorithm. The parallel-serial multiplier processes here between D ¼ 15 and D ¼ 17 coefficients of its second operand per clock cycle.
It is worth noting that the trick of trading Frobenius for inverse Frobenius maps used for the final exponentiation in characteristic three can also be applied to the case of characteristic two. Indeed, as reported in Table 3 , the complexity of the square root is always lower than that of the squaring over the considered finite fields.
However, putting this optimization into practice happens to be more complex than in characteristic three. Apart from the Frobenius maps required for the inversion over These few squarings could be replaced by actual multiplications, which would then slightly increase the number of clock cycles required to compute the final exponentiation. Alternatively, we could take the fourth root of the result, which by linearity of the square root would then cancel all the extra Frobenius maps, but we would then end up computing a fixed power of the T pairing and not the T pairing itself.
However, observing that in characteristic two, the critical path of the whole pairing accelerator lies in the nonreduced pairing coprocessor and not in the final exponentiation one, this is actually a moot point as there is no use trying to shorten further the critical path in the final exponentiation. We, therefore, decided against using this optimization altogether in the case of characteristic two.
RESULTS AND COMPARISONS
Comparison with Previous Works
Thanks to our automatic VHDL code generator, we designed several versions of the proposed architectures and prototyped our coprocessors on Xilinx Virtex-II Pro and Virtex-4 LX FPGAs with average speedgrade. Table 4 details the specifics of the considered supersingular curves, while Table 5 provides the reader with a comparison between our work and accelerators for the Tate and T pairings over supersingular (hyper)elliptic curves published in the open literature. (Note that our comparison remains fair since the Tate pairing can be computed from the T pairing at no extra cost [7] .) Finally, these results are summarized in Fig. 7 , where post-place-and-route computation time and area-time product estimations are plotted against the achieved level of security.
In the presented benchmarks, the logic resource usage is given in terms of slices, which is the usual metric on Xilinx FPGAs. Each slice comprises two four-input lookup tables and two 1-bit flip-flops. Furthermore, it is worth noting that even though our coprocessors also make use of some embedded memory blocks as register files, they are by far not a critical resource and are, therefore, not reported in the benchmarks.
Our architectures are also much faster than software implementations. Mitsunari wrote a very careful multithreaded implementation of the T pairing over F 3 97 and F 3 193 [35] . He reported a computation time of 92 and 553 s, respectively, on an Intel Core 2 Duo processor (2.66 GHz). 
TABLE 3 Frobenius versus Inverse Frobenius Maps in Characteristic Two
Interestingly enough, his software library outperforms several hardware architectures proposed by other research ers for low levels of security. When we compare his results with our work, we note that the gap between software and hardware increases when considering larger values of m. The computation of the T pairing over F 3 193 on a Virtex-4 LX FPGA with a medium speedgrade is, for instance, roughly 50 times faster than software. This speedup justifies the use of large FPGAs which are now available in servers and supercomputers such as the SGI Altix 4700 platform.
Kammler et al. [26] reported the first hardware implementation of the Optimal Ate pairing [46] over a Barreto-Naehrig (BN) curve [5] , that is an ordinary curve defined over a prime field F p with embedding degree k ¼ 12. The proposed design is implemented with a 130 nm standard cell library and computes a pairing in 15.8 ms over a 256-bit BN curve. It is, however, difficult to make a fair comparison between our respective works since the level of security and the target technology are not the same.
Characteristic Two versus Characteristic Three
It is worth noting that, in order to achieve the same level of security for the T pairing over supersingular curves in characteristics two and three, the extension degree m of F 2 m has to be larger than that of F 3 m 0 . More precisely, we have the ratio m m 0 ¼ 3 log 3 2 log 2 % 2:4;
since the embedding degree is six in characteristic three, against four in characteristic two. This ratio also applies asymptotically to the number of iterations in Miller's algorithm, which is ðm þ 1Þ=2 and ðm 0 þ 1Þ=2, respectively. However, the arithmetic over F 2 4m required for the computation of the pairing in characteristic two is much simpler than that the arithmetic over F 3 6m 0 : one iteration of Miller's algorithm requires only 7 multiplications over F 2 m , against 17 multiplications over F 3 m 0 in the case of characteristic three. Coincidentally, the ratio between the two is also 17=7 % 2:4.
Thus, although necessitating 2.4 times as many iterations as in characteristic three, the T pairing over F 2 m requires almost exactly as many products over the base field as the T pairing over F 3 m 0 . Furthermore, a smaller extension degree m 0 compensates for the arithmetic over F 3 being more expensive than that over F 2 .
That close similarity in terms of performances between characteristics two and three at a constant level of security, as hinted at by this short analysis, can actually be observed in the place-and-route results of our coprocessors (Fig. 7) , even though characteristic two appears to have a slight advantage for low security.
CONCLUSION
We proposed novel architectures based on a parallel pipelined Karatsuba multiplier for the T pairing in characteristics two and three. The main design challenge we faced was to keep the pipeline continuously busy. Accordingly, we modified the scheduling of Miller's algorithm in order to introduce more parallelism in the pairing computation. We also presented a faster way to perform the final exponentiation by exploiting the linearity of the Frobenius map and/or taking advantage of a simpler inverse Frobenius map in certain cases. Both software and hardware implementations can benefit from these techniques.
To our knowledge, the implementation of our designs on several Xilinx FPGA devices improved both the computation [1] , [7] , [11] , [19] , [24] , [28] , [29] , [30] , [31] , [38] , [39] , [40] , [42] , [43] .
However, as of today, the design of pairing accelerators providing a level of security equivalent to that of AES-128 remains a problem of major interest. Although Kammler et al. [26] proposed a first solution over a Barreto-Naehrig curve, several questions remain open. For instance, is it possible to achieve such a level of security in hardware with supersingular (hyper)elliptic curves at a reasonable cost in terms of computation time and circuit area? Since several protocols rely on such curves, it seems crucial to us to address this topic in a near future.
Another interesting direction for further work is to investigate the use of the hybrid operator (Fig. 3) to compute the complete Tate pairing and not only the final exponentiation. From our experiments, this operator should offer a competitive balance between the area-efficient unified operators of [7] and the latency-oriented architectures presented here. 
