Although identity based cryptography offers many functional advantages over conventional public key alternatives, the computational costs are significantly greater. The core computational task is evaluation of a bilinear map, or pairing, over elliptic curves. In this paper we prototype and evaluate polynomial and normal basis field arithmetic on an FPGA device and use it to construct a hardware accelerator for pairings over fields of characteristic three. The performance of our prototype improves roughly ten-fold on previous known hardware implementations and orders of magnitude on the fastest known software implementation. As a result we reason that even on constrained devices one can usefully evaluate the pairing, a fact that gives credence to the idea that identity based cryptography is an ideal partner for identity aware smart-cards.
Introduction
The notion of identity based cryptography was first proposed by Shamir [25] in 1984. Essentially it allows a user identity, an arbitrary string, to play the role of a public key rather than have the key derived from a relationship with private information as would be the case in traditional schemes such as RSA. This can vastly reduce the amount of certification infrastructure required and generally presents a rich set of functional and security characteristics that are difficult or impossible to realise with other solutions. The first efficient Identity Based Encryption (IBE) scheme was presented by Boneh and Franklin [8] who followed the idea of Sakai, Ohgishi and Kasahara [23] in basing their scheme on bilinear maps, or pairings, over elliptic curves.
Although pairing and identity based cryptography has sparked a wealth of research into cryptographic schemes [7, 11] and proof techniques, it has remained an ongoing task to reduce the computational cost that underpins such work. Theorists have generally worked under the gross assumption that a pairing takes around ten times as long to compute than the major computational task in elliptic curve cryptography (ECC), the point multiplication. Although in reality this ratio is significantly lower, the cost of pairing evaluation still constitutes a major hurdle. This is particularly true in constrained environments such as smart-cards which, due to their use as identity-aware tokens, seem a natural partner for identity based cryptography.
Recently, Gemplus announced that it had developed a smart-card hosted IBE implementation in partnership with the market leaders Voltage Security [27] . Although details are scarce, it seems probable that they use an existing core for F p arithmetic to accelerate a software implementation of the BKLS algorithm [4] . This seems the natural decision given the increasing flexibility in parameterisation [3, 5, 19] and expertise related to implementing arithmetic in F p accumulated from building conventional ECC and RSA based systems. However, in the short term at least it is attractive to consider working over fields of characteristic three since when parameterised using suitable supersingular elliptic curves, the resulting system boasts a higher security multiplier [12] , given by the MOV embedding degree [20] . Additionally, there are some specialised, high-performance algorithms for computing pairings in this context: the Duursma-Lee algorithm [10] , recently improved upon by Kwon [18] and Barreto et al. [2] , uses a closed formula for the pairing which is efficient as long as the underlying field arithmetic in F 3 m is also efficient. To this end, previous work has considered the possibility of using polynomial [6, 22] and normal bases [13] to implement said arithmetic. However, such work has focused mainly on arithmetic performance rather than placing the designs in context to actually compute IBE related functions, the exception being Kerins, Popovici and Marnane [17] who quote estimated timings for FPGA hosted pairing hardware using a BKLS style algorithm.
In this paper, our main aims are three-fold: to evaluate the performance and cost of constructing hardware polynomial and normal basis arithmetic in F 3 m ; to investigate the possibility of construct a hardware accelerator that is small enough for use in constrained environments; to prove pairings over F 3 m using the closed form family of algorithms are a viable alternative to the use of F p and BKLS. We prototype our work on an FPGA device and present experimental results of the performance and cost comparisons with previous work in this area. We organise our work as follows: in Section 2 we give an overview of pairings before using Section 3 to present details of arithmetic in F 3 m . We then discuss the details of our accelerator architecture and present experimental results in Section 4 before concluding in Section 5.
Algorithm 1: The Duursma-Lee algorithm [10] for calculating the Tate pairing in characteristic three.
Input : point P = (
return f
An Introduction to Pairings
To provide a concrete case for discussion, we use the example of pairings where the base field is of characteristic three, i.e. F q where q = 3 m . To allow investigation of both polynomial and normal bases we consider cases m = 97 and m = 89 respectively. Let E be an elliptic curve over a finite field F q , and let O denote the identity element of the associated group of rational points E(F q ). For a positive integer l|#E(F q ) coprime to q, let F q k be the smallest extension field of F q which contains the l-th roots of unity in F q . Also, let E(F q )[l] denote the subgroup of E(F q ) of all points of order dividing l, and similarly for the degree k extension of F q . Setting k = 6, we parameterise F q 6 as the quadratic extension
. For efficient arithmetic in these fields, we to the work of Granger et al. [14] .
Our choice of prime values for m is motivated by well known security considerations; both our choices offer an security level which is roughly equivalent to 800 − −900-bit RSA. Using a polynomial basis with m = 97 provides us with a curve which is well known in the literature and hence a good reference against which to compare our results. However, one can only construct a type-two normal basis where 2m + 1 is also prime: the most efficient type-one basis is never available. This limits our choices significantly. We settled on m = 89 since it is the closest choice to m = 97 for which affords a suitable parameterisation. For both our choices of m, we use the curve E : Y 2 = X 3 − X + 1. In the case of m = 89 this has an unattractively large cofactor [13] : this parameterisation problem alone might be viewed as a reason not to use a normal basis representation; we stress that our aim in selecting these parameters is performance and cost comparison only.
The Reduced Tate Pairing
For a thorough treatment of the following, we refer the reader to [4] and also [12] , and to [24] for an introduction to divisors. The reduced Tate pairing of order l Algorithm 2: The Kwon-BGOS algorithm [18] for calculating the Tate pairing in characteristic three.
given by e l (P, Q) = f P,l (D). Here f P,l is a function on E whose divisor is equivalent to
, whose support is disjoint from the support of f P,l , and
It satisfies the following properties:
• For any integer n,
• Let L = hl. Then e l (P, Q)
• It is efficiently computable.
The non-degeneracy condition requires that Q is not a multiple of P , i.e. that Q is in some order l subgroup of E(
. When one computes f P,l (D), the value obtained belongs to the quotient group F *
l , and not F * given by this exponentiation makes it possible to compute f P,l (Q) rather than f P,l (D).
The Modified Tate Pairing
Duursma and Lee introduced their algorithm [10] in the context of pairings on a family of supersingular hyperelliptic curves. The performance of their method was improved upon by Kwon [18] and Barreto et al. [2] who also provide similar algorithms for other characteristics. Let q = 3 m and E(F q ) : Y 2 = X 3 − X + b, with b = ±1, and let P = (x 1 , y 1 ) and Q = (x 2 , y 2 ) be points of order l. Let
, with b = ±1 depending on the curve equation, and let
. Then the modified Tate pairing on E is the mapping f P (φ(Q)) where φ : E(F q ) → E(F q 6 ) is the distortion map φ(x 2 , y 2 ) = (ρ − x 2 , σy 2 ). The methods for computing the Duursma-Lee and Kwon-BGOS algorithms are shown in Algorithm 1 and Algorithm 2 respectively. Note that the final result is powered by q 3 − 1 to form a compatible result with the BKLS [4] algorithm.
Arithmetic in F m
The finite field F 3 m is isomorphic to F 3 [X]/(p) and F 3 (α) where p is an irreducible polynomial of degree m in F 3 [X] and α is a root of p. We will identify these three fields, but our notation will be tailored toward F 3 (α). In a polynomial basis F 3 (α) is regarded as an m-dimensional vector space over F 3 with basis
For an elementâ ∈ F 3 (α) we will simply write the elements in a polynomial, or standard basis asâ
Arithmetic in a polynomial basis is fairly straightforward when based on conventional polynomial arithmetic. When discussing implementation of such arithmetic, it is often useful to denote elements as a vector of coefficients such aŝ
so that physical operations such as shifting and rotation of coefficients is more naturally expressed. We use the notationâ (i) to denote the (left) rotation of the coefficients in such a vector by distance i. That is, we writê
where in all cases, coefficient indices are reduced modulo m. Using this notation, a
j represents the j-th coefficient of the rotated elementâ (i) . In a normal basis, things are slightly more involved. Given an irreducible polynomial p of degree m and with root α, the full set of roots of p in
).
If the elements of B are linearly independent then the set of roots forms a basis of F 3 (α) over F 3 and this basis, p and α are all called normal. To construct such as basis, and the matrix M which determines how the multiplication operation works, we use the techniques of Granger et. al [13] based on work by Nöcker [21] .
For an elementā ∈ F 3 (α) we writē
but again, for brevity, we often denote a normal basis field element using the coefficient vector and rotated coefficient vector notation as described above. When using both polynomial and normal basis representations, we hold a polynomial over F 3 of degree m as a 2m length vector of bits. Two sequential bits are used to hold each coefficient so that
For concreteness, we set the defining polynomial for our polynomial basis to α
Addition and Subtraction
The most basic operations on field elements are addition and subtraction. These are made reasonably straightforward because they can be performed componentwise with no interaction with other coefficients. Given that our coefficients are held using two bits, we can construct cells for the required arithmetic using simple logical operations. Following Harrison et al. [15] , the addition r i = a i +b i of two coefficients a i and b i can be specified using
. Subtraction, and hence multiplication by two, are equally efficient since the negation of an element a simply swaps the bits a H i and a L i over and can therefore be implemented by the same function as addition.
Cubing and Cube Roots
When working in characteristic three, cubing is an important operation since curve and pairing arithmetic is often manipulated to utilise cubing rather than a more costly multiplication. In addition, the cube root operation is important in the Duursma-Lee algorithm if pre-computation is avoided.
When using a normal basis, the cube and cube root operations are very efficient in characteristic three: both can be achieved by cyclic shifting the coefficients in an elements so that for an elementā Clearly these rotations can be easily implemented in a hardware circuit, where they reduce to wired permutation of bits with no actual computational overhead. In a polynomial basis, cubing is a linear operation in the same way squaring is linear in characteristic two [6, 22] . That is, we have
Therefore, we can implement it using by simply thinning the coefficients, i.e. padding them with zeros, before performing a reduction. Cube root is somewhat more involved but since our chosen field is of the right form, we can utilise the method highlighted by Barreto [1] . Specifically, since our defining polynomial for m = 97 is α 97 + α 16 + 2 we have that 97 = 3u + 1 and 16 = 3v + 1 so that u = 32 and v = 5. Hence, for an elementâ = t 0 + t 1 + t 2 where
given that for t ∈ F 3 m , t ≪n denotes tα n , the value t shifted left by n coefficients and suitable reduced.
Multiplication
In addition to component-wise addition and subtraction, for normal basis multiplication we also require a component-wise multiplication of the form r i = a i ·b i . This can be performed using similarly inexpensive logical operations
Armed with a function to perform this operation, we construct a general multiplication result of the formc =ā ·b usinḡ
where in all cases, coefficient indices are reduced modulo m. The sparse matrix M in this description is constructed from the normal polynomial p and essentially dictates how reduction behaves for the field. We developed a compiler that takes M and automatically produces circuitry to implement the three phases of the above formula: an addition phase to compute the terms M i,j ·b k+j , keeping in mind that M i,j ∈ {0, 1, 2}; a multiplication phase to multiplyā k+i by the summed terms; and accumulation phase sum all the multiplied terms and form c k . Such circuitry generates a single coefficient and hence requires m clock cycles to complete a multiply; we can place several of them working in parallel to accelerate the multiplication [13] .
There has already been plenty of previous work dedicated to hardware polynomial basis multiplication methods in characteristic three [6, 17, 22] . We follow the approach of Bertoni et al. [6] in employing a digit-serial approach. In a similar way that a normal basis is scalable since we can utilise D parallel coefficient calculation circuits, a digit-serial multiplier allows us to scale the digit-size D in order to find a suitable balance between size and speed.
Inversion
Inversion is generally the most expensive operation when dealing with finite field arithmetic, so much so that in systems like ECC every effort is made to construct higher level operations so that inversion is not required. Due to the cost of constructing dedicated hardware for limited return, we implement inversion in software using our hardware for other operations in F 3 m . To avoid the extra hardware cost described by Kerins et al. [17] , we implement inversion using the relationship
using a ternary expansion of the exponent since cubing operations are so inexpensive. In a polynomial basis this could be improved upon incrementally by using a translation of the standard binary Euclidean algorithm [15] . Since we only require inversion once in the final powering, we leave this issue for further work.
Exponentiation
Generally, we avoid exponentiation of pairing values by arbitrary exponents since one can use the bilinearity property to push the operation inside the pairing as a point multiplication which is more efficient, see the work of Granger et al. [14] for efficient methods in this area. However, we do need to consider the final powering of the pairing output by q 3 − 1 in order to yield a value compatible with BKLS. To power the pairing output f by the required exponent, we decompose the operation into
the first term of which is simply three applications of the q-frobenius and the second is an inversion. Thanks to our field arithmetic, the inversion is reasonably efficient essentially because it can be done directly [14] rather than using an iterative method.
Architecture and Results

Architecture
Our design was realised using VHDL synthesised with a combination of Xilinx EDK 7.1 and ISE 7.1. Our experimental platform was a Xilinx ML300 prototyping board which hosts a Virtex-II PRO FPGA (XC2VP4FF672-6) device with 4928 slices. Our philosophy with this design was to treat the F 3 m arithmetic as a kind of co-processor, which is controlled by a more general purpose processor rather than hardwiring logic to directly compute the pairing. By swapping the co-processor we can provide arithmetic in either polynomial or normal bases; Table 1 : Cost and performance characteristics of hardware based field, point and pairing arithmetic using polynomial and normal bases, clocked at low and maximum frequencies. the FPGA size prevented making both available in one design. Since the instructions that are issued to the co-processor are executed synchronously, one might view this as a kind of instruction set extension. With this approach, we can easily implement other higher level operations based on the same field arithmetic, such as the ECC point multiplication over E(F 3 m ) which is also required in most pairing based schemes. As such, we combine our arithmetic in F 3 m with a register file, backed by BlockRAM, of 32 registers each able to store an element of F 3 m which total under 1 kilobyte for our choices of m. We control this combined data-path with a Xilinx MicroBlaze soft-core, a 32-bit, 3-stage pipelined RISC processor which interfaces to the logic using the Fast Simplex Link (FSL) interface. The MicroBlaze code to control the co-processor was compiled using a re-targeted GCC tool-chain; we were able to achieve fast development times as a result. In short, the FPGA of our prototyping board is filled, as described by Figure 1 , with what could be considered an embedded processor with a co-processor for arithmetic in F 3 m . The obvious real-world analogy of this type of architecture is a smart-card with an associated co-processor.
Results
Having selected our fields for polynomial and normal bases so that they were as close as possible in size, we took the approach of utilising as equal an amount of the FPGA as possible to make comparison easier. Since our multiplier architecture in both cases allows for scalability by altering the digit-size D, we parameterised the polynomial basis multiplier with D = 4 and the normal basis multiplier with D = 2, choices that resulted in roughly the same area cost. Table 1 shows the performance of our arithmetic and higher level functions at a modest clock speed that could be useful in a constrained environment and the fastest possible speed resulting from our synthesis results. A given arithmetic operation essentially requires n + 2 cycles, 1 cycle for the instruction fetch and decode, n for the execution and 1 to write-back the result into the register file. As well as cycle and wall-clock timings, we quote the number of instructions issued by the MicroBlaze core to the ALU. The area costs are inclusive of all system elements bar the instruction memory and register file which are backed by BlockRAM. The MicroBlaze core, FSL interface and debugging unit consumes roughly 1300 slices; the finite state machine to control the ALU consumes roughly 500 slices; the ALU logic consumes roughly 1700 slices depending on which elements are included. Note that our upper clock speed was bounded by 150 MHz since this was the maximum permitted by use of the MicroBlaze.
In terms of field arithmetic, we find that the polynomial basis representation is generally faster since although the cube and cube root circuits are more complex, the dominant feature was the multiplier. The critical path of the normal basis multiplier was far longer, forcing a lower clock speed, and the design much larger, meaning the polynomial multiplier could employ a larger, more efficient digit-size. Using these results and by simply looking at the algorithms, it is clear that the Duursma-Lee algorithm will be faster than that of Kwon-BGOS since although the later removes the need for a cube root in F q , it requires a cubing in F q k . Thanks to the single-cycle cube root implementations, the cube in F q k will inevitably be slower. Table 1 confirms this by quoting results for evaluating the pairing and for the final powering: one should view a pairing as being the combination of these two if the goal is compatibility with other algorithms.
Note that although the Kwon-BGOS algorithm is marginally slower it offers an attractive trade-off since we can omit the cube root logic from our design and save the associated slices. Also note that because of the fast cube root method of Barreto [1] , the perceived advantage of a normal basis in being able to perform fast cube root operations is eliminated: the multiplier is the dominant cost as a result.
Analysis
In characteristic three, given our constrained setting, an efficient way to perform point multiplication using minimal pre-computation is to use the generalised non-adjacent form (GNAF) [9, 26] , to construct a signed ternary expansion of the exponent d (mod l). Such a representation is easy to compute and reduces the average density of non-zero trits from two thirds to one half. Using A to denote point addition and T to denote point tripling, the cost of an average point multiplication is log(d) log(3)
The Boneh-Franklin IBE scheme [8] is perhaps the most definitive example of the use of pairings within a concrete scheme. The trust authority or TA has a public key P T A = s · P for a master secret s. A users public key is calculated from the string ID using a hash function as P ID = H 1 (ID). The corresponding secret key is calculated by the TA as S ID = s · P ID . To encrypt the message M , one selects a random r and computes the tuple
to decrypt C = (U, V ), one computes the result
Considering our faster implementation using polynomial basis and DuursmaLee algorithm with a modest clock speed of 16 MHz, we use P to denote the combination of pairing and final powering, M a point multiplication and E a field exponentiation. Using this notation we see that encryption costs 2M + P while decryption costs P. Although we do not consider it as an option, given some extra storage the pairing required for encryption can be pre-computed which results in the cost being M + E. Using these costs and our timings from Table 1 , we find that using our architecture we can perform Boneh-Franklin encryption in ≈ 7ms and decryption in ≈ 4ms. This performance is easily enough for practical applications since a given scheme will typically try to minimise the number of pairings executed. Thus, one can consider making a trade-off between performance and cost to reduce the device size. For example, we can remove the cube root logic as described above and utilise the Kwon-BGOS algorithm. Additional optimisations in this direction include: reduction of the digit-size in our multiplication units; sharing a group of addition cells between the addition and multiplication operations, at the moment we place individual copies for each; improving the register allocation strategy or spilling values to the main memory so as to reduce the size of our register file containing F q elements; and further turning of the MicroBlaze to eliminate the debug and RS232 logic used for development purposes only.
Conclusions
We have presented an accelerator for arithmetic in F 3 m and used it to implement the Tate pairing, a primitive which is of increasing importance in cryptographic schemes. Unlike previous work, we investigate both polynomial and normal basis representations of field elements and both the Duursma-Lee and Kwon-BGOS algorithms to compute the pairing. Our results demonstrate roughly a ten-fold improvement on the only other known hardware implementation [17] and orders of magnitude better than the fastest known software implementations.
The issue of size of slightly harder to quantify due to the use of FPGA as a target. Although our design is clearly still unrealistically large to place on a smart-card for example, we have demonstrated that our performance margin is so great, trade-offs that significantly reduce the area are viable. We leave the realisation of such optimisations for further work which might also include other marginal issues: acceleration of inversion in F 3 m using Euclidean techniques rather than by powering, perhaps by using extra hardware [17] ; some comparison with existing, proprietary smart-card hosted implementations of the Tate pairing [27] .
