Abstract. This paper presents a novel method for designing compact yet efficient hardware implementations of the Tate pairing over supersingular curves in small characteristic. Since such curves are usually restricted to lower levels of security because of their bounded embedding degree, aiming for the recommended security of 128 bits implies considering them over very large finite fields. We however manage to mitigate this effect by considering curves over field extensions of moderately-composite degree, hence taking advantage of a much easier tower field arithmetic. This technique of course lowers the security on the curves, which are then vulnerable to Weil descent attacks, but a careful analysis allows us to maintain their security above the 128-bit threshold. As a proof of concept of the proposed method, we detail an FPGA accelerator for computing the Tate pairing on a supersingular curve over F 3 5·97 , which satisfies the 128-bit security target. On a mid-range Xilinx Virtex-4 FPGA, this accelerator computes the pairing in 2.2 ms while requiring no more than 4755 slices.
Introduction
Pairings were first introduced in cryptography in 1993 by Menezes, Okamoto, & Vanstone [36] and Frey & Rück [24] as an attack against the elliptic curve discrete logarithm problem (ECDLP) for some families of curves over finite fields. Since then, constructive properties of pairings have also been discovered and exploited in several cryptographic protocols: starting independently in 2000 with Joux's one-round tripartite Diffie-Hellman key agreement [31] and Sakai-Ohgishi-Kasahara cryptosystem [46] , many others have followed, such as Mitsunari-Sakai-Kasahara broadcast encryption scheme [39] , Boneh-Franklin identity-based encryption [12] or Boneh-Lynn-Shacham short signature [13] for instance. Pairings nowadays being the cornerstone of various protocols, their efficient implementation on a wide range of targets became a great challenge, especially on low-resource environments.
Although many FPGA implementations of pairing accelerators have been proposed [2, 6, 7, 9, 30, 34, 43, 47] , none of them allows to reach the AES-128 security level. However, recent ASIC implementations of pairings over BarretoNaehrig (BN) [4] curves with 128 bits of security have been published [22, 33] . The main difficulty for computing a pairing at the 128-bit security level is to implement an efficient arithmetic over a quite large finite field.
In contrast with the ASIC implementation, we chose to implement pairings over supersingular elliptic curves over small-characteristic finite fields so as to benefit from the many optimizations available in the literature. As a drawback, since supersingular curves are restricted to low embedding degrees, this implies considering unbalanced settings, where the curve offers potentially much more security than the required 128 bits. Nonetheless we took advantage of this excess of security and defined our curves over finite fields of composite extension degree: on the one hand, the curves might be weaker because of, for instance, the Gaudry-Hess-Smart attack [17, 26, 27] ; on the other hand, the arithmetic algorithm can really benefit from this tower field structure. This article is devoted to the demonstration that this compromise is very effective in the context of a low-resources hardware implementation.
After a reminder on the Tate pairing and its security in a general context (Section 2), we present the consequences on security of defining an elliptic curve over a composite-extension field (Section 3). We then detail the algorithms for computing the Tate pairing over such curves in Section 4 and present a lowarea FPGA accelerator implementing these algorithms for a test-case curve in Section 5. Finally we report our performance results and compare them against other implementations from the literature (Section 6) and conclude in Section 7.
Definition and security of the Tate pairing
Given an elliptic curve E defined over a finite field F q , take a prime number dividing the cardinal of the curve #E(F q ). The embedding degree k of E is then defined as the smallest integer such that | q k − 1, that is to say such that the group of -th roots of unity µ = {x ∈ F q | x = 1} is in F * q k . Assuming further that k > 1 and that there are no points of order 2 in E(F q k ), we can then define the Tate pairing over E as the map:
The embedding degree k, also called security multiplier in this context, acts as a cursor to adjust the size of the multiplicative group F * q k with respect to that of F q , which directly constrains #E(F q ) to Hasse's bounds, therefore limiting the achievable values of . Given that the discrete logarithm problem (DLP) is exponential in the subgroup E(F q )[ ] but subexponential in the finite field F multiplier k that balances the security on both the input and the output of the Tate pairing.
As we are targeting the AES-128 security level, elliptic curves with an embedding degree between 12 and 15 seem to be a good choice. Barreto-Naehrig (BN) curves are a family of such curves with prime cardinal = #E(F q ) and embedding degree k = 12 [4] ; as a result BN curves perfectly balance the security between the -torsion and µ at the 128-bit level. However, since BN curves are defined over prime fields, computing a pairing over them requires expensive modular arithmetic, which is far less better-suited to hardware implementation than arithmetic over small-characteristic finite fields. Last but not least, BN curves are ordinary curves: point doubling and tripling formulae are not as efficient as in the supersingular case in characteristic 2 and 3 respectively.
As a consequence, we chose to consider supersingular elliptic curves even if their embedding degree is bounded by 6 [3] . Due to this bound, the security on the curve will be too high with respect to the security on µ . We however decided to take advantage of this: using finite fields with composite extension degree will decrease the security on the curves but make the field arithmetic better suited to low-resource hardware implementations. Those points will be detailed and quantified in the next two sections.
We now detail the definition, security and computation of the Tate pairing over the considered supersingular elliptic curves.
Pairing over supersingular elliptic curves
Our study focuses on pairings on supersingular curves over finite fields F q with q = p m and p = 2 or 3. We thus define the two following families [3] :
where b ∈ {0, 1}; and
When m is coprime to 2 and 6 in characteristic 2 and 3 respectively, the cardinal of those curves reaches the Hasse bounds:
Moreover, their embedding degree is 4 and 6 in characteristic 2 and 3, respectively. Thanks to their supersingularity, there exists a distortion map over those elliptic curves, mapping the F q -rational -torsion group to another subgroup of
which is used to define the modified Tate pairing as:
.
inria-00539926, version 1 -25 Nov 2010
One can furthermore show thatê is not degenerate. We refer the reader to [3] and [9, Table I ] for the mathematical details of pairing construction over supersingular curves.
Attacks against pairings over supersingular curves
The security of the pairing is determined by the difficulty of the discrete logarithm problem (DLP) on the input curve and on the output multiplicative group.
Since is a prime, the best known algorithm to attack the DLP on thetorsion is Pollard's ρ method [42] , which requires an average of π /2 group operations. As Duursma et al. showed in [19, 25] , we should take into account the group of automorphisms on the curve, which has order 24 and 12 in char- Additionally, one may attack the DLP on µ ⊂ F * q k ; this is the fundamental idea behind the attacks of Menezes, Okamoto, & Vanstone [36] and Frey & Rück [24] . Since the -th roots of unity are defined in the multiplicative group of a finite field, the DLP may be attacked by sieving algorithms. In our case, where the characteristic p is 2 or 3, one can use the function field sieve (FFS) [1] ; the complexity of this attack is subexponential:
If we consider our 128-bit security level target, we need to take m between 1100 and 1200 in characteristic 2 and around 500 in characteristic 3.
Elliptic curves over composite-extension fields
We examine, in this section, the consequences on security of defining supersingular elliptic curves over a finite field of the form F q n , where q = p m , n is a small integer and m a prime. This corresponds to substituting q n for q and m · n for m in the previous section.
It is important to remark that such elliptic curves defined over compositeextension fields have already been described for cryptographic use under the name Trace-Zero Variety (TZV) [23] . Applying the Weil descent to E(F q n ), we obtain an isomorphic variety W E (F q ) which is also isomorphic to the product inria-00539926, version 1 -25 Nov 2010 E(F q ) × B(F q ) where B(F q ) is the TZV. It is a variety defined over the base field F q which might also be represented as the quotient E(F q n )/E(F q ). As we consider in this work an -torsion subgroup of E(F q n ) which is not contained in E(F q ), this -torsion is a subgroup of the corresponding TZV. In the context of pairings, TZVs have also been studied, chiefly for point compression [16, 44, 45] .
The Gaudry-Hess-Smart attack
As soon as one defines a curve on a field of composite extension degree, one should also consider other attacks: the Weil descent can indeed be applied on those curves and have some "destructive facets." The Weil descent allows one to map an elliptic curve defined over F q n to the Jacobian of a curve of genus at least n over F q .
Thus the discrete logarithm problem on the elliptic curve defined over F q n might be transported to the DLP on the Jacobian of a genus-n curve over F q . This last DLP can then be solved using an index calculus algorithm. Gaudry, Hess, & Smart have shown that this attack (GHS) runs inÕ(q 2− 2 n ) in some cases (Weil restrictions) [27] . More generally Gaudry [26] and Diem [17] showed that this also holds in the general case, but with a very bad dependency in n (hidden in the big-O notation).
The static Diffie-Hellman problem
Recent studies [28, 32] showed that defining a curve over a finite field of composite extension degree makes it weaker regarding the static Diffie-Hellman problem (SDH). The SDH problem on a curve consists in: given two points P, The cryptographic consequence of solving SDH problem is breaking the Diffie-Hellman key exchange protocol when one participant never changes his private key, as it occurs in the El Gamal encryption scheme for instance [20] .
Granger discovered the best known algorithm that solves the SDH problem on elliptic curves defined over a field of composite extension degree F q n with O(q 1− 1 n+1 ) calls to the oracle and inÕ(q 1− 1 n+1 ) time [28] . One should notice that the attacker not only needs a great computational power but also a great number of calls to the oracle: a simple but efficient protection against this attack is revoking a key after a certain amount of use.
Finding curves with 128-bit security level
To the best of our knowledge, the literature does not mention any other attack on curves over fields of composite extension degree.
In order to find suitable curves for our method, we enumerated all the supersingular curves of characteristic 2 and 3 on fields with moderately-composite extension degrees m · n (n < 15) large enough for the 128-bit security level. We then evaluated an approximation (constants hidden in big-O are not taken into account) of the computation time of each of the attacks mentioned in the paper: Pollard's ρ, FFS, GHS and SDH. A selection of curves reaching the 128-bit level of security is given in Table 1 ; since that is not necessarily a security issue for all protocols, we also present curves that are not resistant to Granger's SDH attack.
Cost of the attacks (bits)
q n b log 2 Pollard's ρ FFS GHS SDH The main difficulty in computing Table 1 is to factor the cardinal of the different curves because they contains more than 350 digits in characteristic 2 and 240 in characteristic 3. Luckily those cardinals are the Aurifeuillean factors of Cunningham numbers and many of them are referenced in the factor tables maintained by Wagstaff [49] and Leyland [35] .
The security estimations given in Table 1 confirm the intuition: the more composite the extension degree of the field of definition, the more effective the attacks using Weil descent, until they become the best attack on the curves.
As a proof of concept, we finally chose to implement the pairing over the supersingular curve E 3,−1 over F 3 5·97 , as this curve has an embedding degree equal to 6 and is resistant to all the attacks, even for the SDH problem.
inria-00539926, version 1 -25 Nov 2010
4 Computation of the Tate pairing over composite-extension fields
As we have identified some curves that allow us to reach the 128-bit level of security, we now focus on the algorithms for computing the pairing over such curves.
Algorithms for computing the Tate pairing
The computation of the Tate pairing is split into two parts: Miller's loop [37, 38] and a final exponentiation in the multiplicative group F * q k·n . Many improvements of Miller's algorithm have been published since its discovery. Duursma & Lee adapted it to exploit the simple point-tripling formulae in characteristic 3 by turning the double-and-add into a triple-and-add algorithm [18] . Furthermore Barreto et al. put forward the η T approach which divides by two the length of the loop by exploiting the action of the Verschiebung on the -torsion [5] .
1 Those improvements and a careful implementation of the arithmetic of the extension over F q k·n leads to the algorithms presented by Beuchat et al. in [6, 8] .
To implement the pairing of our test case, we chose the unrolled loop algorithm in [8, Algorithm 5] because it minimizes the number of multiplications on the field of definition F q n which represents the major cost on a field large enough to reach the AES-128 security level. Moreover this algorithm requires only additions, multiplications and cubings over F q n but not any cube rooting; therefore it represents a substantial saving in hardware resources requirements.
We have now determined the sequence of operations in F q n to compute the η T pairing over F q n . Nonetheless we want to design compact hardware to execute them: the datapath of a circuit directly handling elements of F q n would be very large. Therefore we take advantage of the composite extension degree of our field of definition and implement the pairing as sequence of operations over F q : the datapath of a coprocessor dealing with elements of F q only will be much smaller. Thus we have to express the arithmetic of F q n in terms of operation over F q in an efficient way.
Representation and computation over the extension
Pairing computation requires a large number of multiplications. Using normal basis would thus be very harmful. As a consequence F q n is represented using a polynomial basis:
where f is a degree-n irreducible polynomial over F q . Hence an element of F q n is represented as a polynomial of degree at most n − 1 over F q , and operations over F q n are mapped to operations over F q [X] followed if necessary by a reduction modulo f .
The irreducible polynomial f could be taken among all irreducible polynomials of degree n over F q but we restricted this choice to polynomials over F p in order to avoid multiplications over F q during the different reductions modulo f . This is possible because n is coprime to m. We also chose f to have a low Hamming weight, i.e. a trinomial or a pentanomial, so as to further reduce the cost of the reductions.
Frobenius automorphism over F q n . During the pairing computation, many iterated applications of the Frobenius, i.e. p i -th powering, are required. By linearity of this operation, we have:
Moreover we have that X p n ≡ 1 (mod f ) because f is defined over F p . Therefore computing the i-th iterated Frobenius over F q n is tantamount to computing the ith iterated Frobenius over all coefficients and then applying a linear combination on them that only depends on the value of i mod n.
Multiplications over F q n . Multiplication is the most expensive operation and it can be greatly optimized by using subquadratic multiplication schemes. Choosing the best algorithm to compute the products of two degree-(n − 1) polynomials depends on many criteria and we studied how different solutions fit our case. Many subquadratic multiplication algorithms can be used: Karatsuba, Montgomery's Karatsuba-like formulae [21, 40] , or CRT-based algorithms [14, 15] . The common point between those algorithms is that they can all be expressed as the linear combination of a set of products of linear combinations of the coefficients of the operands.
The Toom-Cook algorithm and its variants cannot be used easily in the case of polynomials over low-characteristic fields, as it is based on an evaluateinterpolate scheme. To be efficient, evaluation points, their inverse, and their successive powers should have a small representation. However, we cannot find enough "simple elements" in low-characteristic fields: taking interpolation points in F q instead of F p will increase the number of multiplications and defeat the whole point of the method.
Furthermore, as we will see in Section 5.1, additions do not have a negligible cost when compared to multiplications as it is often assumed in estimations of multiplication complexity. Thus we have to express the formulae given by the different algorithms and count the total number of operations of each type.
Inversion over F q n . During the final exponentiation step of the pairing computation, an inversion over F q n has to be carried out. Because there is only one inversion in the whole pairing computation, there is no gain to dedicate specific hardware resources to speed up its computation. However, thanks to the Itoh-Tsujii algorithm [29] which consists in applying Fermat's little theorem, inria-00539926, version 1 -25 Nov 2010 the inversion over F q n is computed with (n − 1) · m applications of the Frobenius in F q n , some multiplications over F q n and one inversion over F q . We also used another Itoh-Tsujii's algorithm to compute this last inversion over F q and then do not need any other inversion since inversion over F p is the identity when p = 2 or 3.
Our test case: F 3 5·97
We chose to construct the extension for our test case as -the quadratic and so-called schoolbook method; -one-level Karatsuba, where the sub-products are computed using the schoolbook method; -recursive Karatsuba, where the sub-products are also computed thanks to Karatsuba algorithm; -Montgomery's Karatsuba-like formulae [40] ; -algorithm based on the Chinese Remainder Theorem (CRT) by Cenk & Ozbudak [14] (cf. Section A for detailed algorithm).
Since n = 5 is odd, Montgomery's trick [40, Section 2.3] for applying the Karatsuba formulae can be used and saves one extra sub-product. As we have now expressed a variety of algorithms for multiplication over F 3 5·97 , choosing one of them is a matter of algorithm-architecture co-design. Indeed, timing for each algorithm heavily depends on:
-the cost of multiplication on F 3 97 compared to the addition, -the data dependencies, and -the scheduling of the operations in regards to the memory architecture. 
inria-00539926, version 1 -25 Nov 2010
Finally, it turned out that the algorithm by Cenk &Özbudak [14] best fitted our arithmetic coprocessor (cf. Section 5). In conclusion, the overall cost of the arithmetic over the extension field F 3 5·97 is presented in Table 3 . Table 4 summarizes the number of operations over the field F 3 97 and its extension F 3 5·97 needed to perform Miller's loop and the final exponentiation from [8] .
× + (.) As we have now reduced the pairing computation to a sequence of operations over F q with q = p m , we need a coprocessor able to perform additions, multiplications and Frobenius (squarings and cubings) over this field. To this intent, we chose the coprocessor that Beuchat et al. developed for the final exponentiation in [10] .
The architecture of this coprocessor is reproduced in Fig. 1 and is composed of three units running in parallel: a register file implemented by means of a dual-ported RAM, a unit performing additions and Frobenius applications, and a parallel-serial multiplier. Several direct feedback paths exist between the inputs and outputs of the units, for instance allowing a product to be used in an inria-00539926, version 1 -25 Nov 2010 Table 4 ) but long sequences of iterated squarings or cubings occur several times. The coprocessor is designed to fit this observation: the addition unit shares most of its datapath with a Frobenius unit which can carry out both single and double applications of the Frobenius in one clock cycle. One should also notice that there is a direct feedback loop from its output to one of its inputs so as to further speed up sequences of Frobenius.
Products are processed in a parallel-serial fashion: at each cycle the first operand is multiplied by D coefficients of the second operand. The complete multiplication over F p m is then computed in m D clock cycles. D is a parameter of the processor and is chosen as trade-off between computation time of the multiplication and the operating frequency (a large value of D lengthens the critical path and this deteriorates the frequency).
In our case of computing the Tate pairing over F 3 5·97 , we chose D = 14. The product on F 3 97 then takes 7 clock cycles, i.e. 7 times longer than an addition. Given this cost ratio between multiplications and additions, the multiplication algorithm over F 3 5·97 by Cenk &Özbudak fit best the coprocessor, that is to say we managed to find a scheduling of the algorithm that hides all the additions behind the 12 multiplications over F 3 97 . A multiplication algorithm with less sub-products and more additions would not yield a better execution time since the bottleneck would be in the memory access. Indeed memory ports are near to be saturated in our scheduling of Cenk &Özbudak's algorithm.
Micro-and macrocode
Considering the total number of multiplications over F q (cf. Table 4 ) and their cost, the pairing needs a minimum of 260 000 clock cycles to be calculated.
During those cycles, the 36 control bits (the c i 's in Fig. 1 ) should be set: this represents a total amount of 10 Mbit of memory for the pairing program. Thus we cannot store those control bits directly in an instruction memory: it would use up much more resources than the coprocessor itself.
In order to reduce instruction memory requirements, we implemented two levels of code. In the lower one, the microcode, we implemented the arithmetic over the extension F 3 5·97 . These operations are called in a macro-program that computes the actual pairing. Given that the non-reduced pairing is computed thanks to Miller's loop, we also constructed a loop mechanism on the macrocode.
Finally the implementation of the Tate pairing over E(F 3 5·97 ) is a sequence of 464 macro-operations which takes 428 853 clock cycles to be executed. Although microcoding implies a loss of parallelism, it allows us to drastically reduce the size of the instruction memory, which now fits in 24 kbit.
The register file is split into two parts: the first one contains 32 macrovariables (elements of F 3 5·97 ) and the second serves as a scratch space of 16 temporary variables (elements of F 3 97 ) for use inside the microcode. Macrovariables are blocks of 5 consecutive addresses in the register file that are accessed in the microcode thanks to a windowed address mechanism. Since each element of F 3 is represented by 2 bits, the total amount of RAM used is 33 kbit.
Results and comparisons
We prototyped and synthesized our design on Xilinx mid-range Virtex 4 and also on Spartan-3's, which are more suited to embedded systems. Place-androute results show that the coprocessor uses 4755 slices and seven 18 kbit RAM blocks of a Virtex-4 (xc4vlx25-11) clocked at 192 MHz, finally computing our test-case pairing in no more than 2.11 ms. Performance for the low-end FPGA are more modest but still interesting: on a Spartan 3 (xc3s1000-5) running at 104 MHz, this pairing can be computed in 4.1 ms using 4713 slices.
To the best of our knowledge, this design is the first FPGA implementation of a pairing reaching 128 bits of security; thus we compared our design to FPGA implementations of less secure pairings (Table 5) , along with ASIC (Table 6) and software (Table 7) implementations of 128-bit security pairings.
The literature about pairing computation on FPGAs only focuses on lowsecurity pairings because they already reach the limit of the available FPGA resources. Indeed the designs presented in [2, 6, 7, 9, 47] have a datapath that handles the field of definition of their respective curves and thus increasing the security means increasing the designs' area. In contrast our approach allows us to "split" elements of the field of definition into smaller parts and thus achieve a smaller area: the coprocessor is very compact compared to the other published architectures. However we have to pay the price of security in terms of computation time: computing a pairing over E(F 3 5·97 ) (128 bits of security) with our processor is 130 times slower than computing one over E(F 3 313 ) (109 bits) with Beuchat et al.'s hardware [9] . It is however 20 times smaller. The first ASIC implementations of pairings with 128 bits of security were presented in [22, 33] . The two implementations use BN-curves so as to exploit their optimal embedding degree k = 12 while targeting 128 bits of security. Although we did not synthesize our design on ASIC, a very rough and pessimistic estimation places our coprocessor around the 100-kGate mark, not counting the register file. That is to say roughly the same area as required by the two accelerators presented in Table 6 . We also use 33 kbit of dual-ported RAM: a bit more than Fan et al. and half of the amount used by Kammler et al. As a result, our architecture seems to be very comparable with the ones from [22, 33] in terms of area, and its performance is also very closed to the ASICs' one. Finally we compared our results against single-core software implementations of 128-bits pairings over supersingular curves [11] and BN curves [41] . Even though, we targeted our implementation to embedded systems and low-resource hardware, our timings are very comparable to that of the software implementations: specific hardware for small-characteristic finite field arithmetic proves to be very efficient when compared to software implementations.
Conclusion
We presented a compact hardware implementation of a pairing reaching 128 bits of security, which is perfectly suited for embedded systems. To this end, we showed that the Tate pairing on supersingular curves over composite-extension field is a pertinent solution, even though their embedding degree k could be deemed too small at first glance. This also demonstrates that the efficiency of the underlying arithmetic plays a key role in pairing computation, and should be taken into account, right along with the size of the base field and the embedding degree, when designing pairing-based cryptosystems.
Furthermore, the idea to use curves defined over finite fields F q n of moderately-composed extension degree might be exploited in other areas of cryptography. While targeting the AES-128 level of security, the attacks based on Weil descent do not introduce extra weaknesses on the curve as long as n is kept small enough. This is an interesting result in itself: expanding the fauna of pairingfriendly curves suited to the 128-bit security level is indeed very relevant for cryptography. Moreover, computations on such curves can be carried out in a more efficient and parallel way, which yields better overall performances.
An interesting development of this work is to implement this idea in characteristic 2. Indeed, arithmetic over binary fields is simpler than in characteristic 3; as a consequence, characteristic 2 might also be a good choice, even though the embedding degree is even lower. We are planning to explore this direction in the near future.
Implementing the pairing on all the supersingular elliptic curves shown in Table 1 would also give a better coverage of the area-time trade-off for computing pairings with 128 bits of security: the more composite the extension degree, the smaller the base field F q and thus the coprocessor. Additionally, in our approach, products over F q are performed thanks to a quadratic scheme but the algorithms used for multiplications over F q n are subquadratic; therefore using a larger n for a same size of the field F q n might lead to a more efficient multiplication.
Furthermore, Cesena has noticed that the extra structure in curves defined over a composite-degree extension field-or TZVs-leads to a natural parallelization of Miller's algorithm [16] . It might be of interest to design a more parallel accelerator exploiting this fact. Such a circuit might achieve a lower latency for computing the Tate pairing with 128 bits of security at the cost of a larger silicon footprint.
Last but not least, the method presented in this article might scale to higher levels of security. For instance, the curve E 3,1 (F 3 17·67 ) reaches 192 bits of security, while keeping the hardware requirements to a minimum. Finding other such curves and comparing them against higher-embedding-degree ordinary curves might help finding the crossover point between the two and assessing the actual relevance of supersingular elliptic curves in the context of low-resource pairingbased cryptography.
B -torsion of presented curves
We provide in this appendix the largest factor of the cardinals of the curves used in Table 1 and the one with 192-bit security level mentioned in the conclusion. 
