Abstract. In this paper the benefits of implementation of the Tate pairing computation in dedicated hardware are discussed. The main observation lies in the fact that arithmetic architectures in the extension field GF (3 6m ) are good candidates for parallelization, leading to a similar calculation time in hardware as for operations over the base field GF (3 m ). Using this approach an architecture for the hardware implementation of the Tate pairing calculation based on a modified Duursma-Lee algorithm is proposed.
Introduction
In recent years an ever increasing number of pairing based cryptosystems have appeared in the literature, see [1] . In turn this has driven research into efficient algorithms for the implementation of bilinear pairings on elliptic curves. To date the Tate pairing (originally introduced to cryptography by Frey and Rück in [2] ) has attracted attention as the most efficiently computable bilinear pairing on elliptic curves and over supersingular elliptic curves it achieves its maximum security in characteristic three Until 2002 the best method of Tate pairing computation on elliptic curves was via the algorithm of Miller [3] . It is an extension of the well known doubleand-add method of performing point scalar multiplication on elliptic curves. This involves performing point additions and point doublings, as well as evaluation of the intermediate line functions to elements of the underlying field. These are then accumulated to give the pairing value. In 2002 the work of Galbraith et al. and Barreto et al. furthered this development so that the Tate pairing became easier to compute in practice [4] [5] . As described in the BKLS/GHS algorithms, prudent choice of points, by use of a distortion map of the type discussed in [6] , as well as a triple-and-add algorithm in characteristic three greatly simplifies the pairing calculation. The utilization of so called tower fields of GF (3 m ) for arithmetic in GF (3 6m ) was originally proposed by Galbraith et al. [4] . In 2003 further improvements in the implementation of the Tate pairing were described by Duursma and Lee in [7] , leading the DL algorithm for Tate pairing computation. Here the pairing computation was extended to more general hyperelliptic curves. Also the distortion map was incorporated into the operation of into the algorithm itself, as well as as well as modifying the loop of the BLKS/GHS algorithms, to yield a more efficient implementation. Further enhancements to the DL algorithm for supersingular elliptic curves over fields of characteristic three were described in [9] , [10] and [11] . As will be described in this paper this modified DL (MDL) algorithm described in [9] is an excellent candidate for implementation on dedicated hardware. Further work on even more efficient general pairing algorithms of which the MDL algorithm is a special case appeared recently in [12] Despite the large body of work accumulating regarding the improving algorithmic efficiency of the Tate pairing computation to date the hardware implementation of such algorithms particularly over characteristic three has received scant attention in the literature. This is somewhat surprising given the well known speed and security advantages of dedicated cryptographic hardware [13] . The main contribution of this paper is the description of how the modified DL algorithm in characteristic three can be efficiently implemented in hardware and a number of conclusions are then derived about the expected calculation time of such an architecture. This paper is organized as follows. Section 2 describes related work on the hardware implementation of Tate pairing and arithmetic circuitry in characteristic three. Section 3 describes the modified MDL algorithm for computation of the Tate pairing and issues related to its efficient the hardware implementation. Section 4 discusses feasibility and calculation time on dedicated hardware. The conclusions of this paper are presented in Section 5.
Related Work
Hardware architectures for polynomial basis arithmetic in characteristic three have appeared in [14] [15] [16] [17] while architectures for normal basis arithmetic have appeared in [18] . In hardware and indeed software the basis representation is a significant design choice. For this paper the polynomial basis representation of GF (3 m ) ∼ = GF (3)[x]/f (x) was chosen, where f (x) is a degree m irreducible polynomial over GF (3). In polynomial basis multiplication in GF (3 m ) is possible in d = ⌈m/D⌉ clock cycles for some digit size D following the architectures outlined by Bertoni et al. in [15] . The coefficient serial multiplier discussed in [16] and [17] is a special case of this. As will be described in Section 3 the primary required operations over GF (3 m ) for the MDL algorithm are addition, subtraction, multiplication and cubing. It has been outlined in [15] [16] [17] [18] that addition and subtraction (and also negation as a special case) can be efficiently performed in hardware by small combinational gate circuits using various two bit binary encoding of GF (3) elements and that the gate delay for these addition and subtraction architectures is low. This implies that additive operations in GF (3 m ) arithmetic hardware can be performed almost for free and will not significantly contribute to a processor's calculation time. In hardware elements of GF (3 m ) can be represented in 2m bits. In [15] a digit serial multiplier over GF (3 m ) is described. This considers multiplication over GF (3 m ) as a series of matrix-vector multiplications with coefficients in GF (3). This can also be implemented efficiently in hardware assuming a low weight irreducible polynomial f (x) ∈ GF (3)[x] (trinomial or pentanomial) has been used to define arithmetic in GF (3 m ). Under this assumption cubing circuitry in GF (3 m ) can also be efficiently implemented in much less hardware than general multiplication and cubing can be performed in a single clock cycle. An efficient algorithm and hardware architecture for inversion in GF (3 m ) in 2m clock cycles based on the extended Euclidean algorithm appeared recently in [16] and [17] .
Few full hardware processor architectures for Tate pairing calculation in characteristic three have appeared in the literature. However, the authors are aware of an FPGA implementation of a pairing based cryptosystem coprocessor architecture based on the binary BLKS/GHS algorithm in [19] .
Tate Pairing Calculation by Modified Duursma-Lee Algorithm
This section presents an outline of the modified Duursam-Lee algorithm along with some observations regarding its efficient calculation in hardware.
The Tate Pairing
Following from [5] , [8] and [12] the modified Tate pairing is defined on the supersingular elliptic curve E ± in affine coordinates defined over a Galois field GF (3 m ) where in practice m is generally prime
The set of points on E ± along with the point at infinity O form a group of order #E under the well known chord-tangent law of composition [20] . The curve (1) is chosen so that it contains a large cyclic subgroup of prime order l, i.e. l = #E/n, where n is small. Also l 2 does not divide #E but l divides 3 6m − 1 and not any 3 jm − 1, j < 6. In order to resist discrete logarithm solving attacks it is recommended that the binary representation of l is at least 150 bits long [21] . Now E ± (GF (3 m )) contains an l-torsion group E ± [l](GF (3 m )) and similarly E ± (GF (3 6m )) contains an l torsion group E ± [l](GF (3 6m )). Following [5] for our purposes the Tate pairing of order l is defined as a bilinear map between
It is only defined up to l th powers of unity; to obtain a unique value in GF (3 6m ) suitable for cryptographic applications it is necessary to raise it to the power ǫ = (
The pairing is efficiently computed in practice by considering the point
where φ is a distortion map of the type introduced in [6] . The distortion map φ is defined as
where
) and σ 2 + 1 = 0. Following [7] [8] [9] the the modified Tate pairing is now defined on points
The calculation of (4) is performed in two stages :
* This is performed by modified Duursma-Lee algorithm illustrated as Algorithm 1 -Raising the resulting t ∈ GF (3 6m ) element to the Tate power ǫ 1 , i.e. τ = t ǫ1 . Tate power ǫ 1 = ǫ/3 3m = 3 3m − 1 as the DL algorithm benefits from the equivalence property of the Tate pairing.
After the calculation of t = e 3 3m −1 (P, φ(R))) ∈ GF (3 6m ) by Algorithm 1 this Galois field element must then be raised to the power ǫ 1 . This can be efficiently performed by representing GF (3 6m ) as an extension field of GF (3 m ) as illustrated in Section 3.2.
3.2 A Tower field representation for GF (3 6m )
As discussed in Section 2, efficient hardware architectures exist for addition, subtraction, cubing and multiplication in the base field GF (3 m ). However as seen from Algorithm 1 the principal complexity in performing the modified Tate pairing (4) lies in the implementation of efficient arithmetic in GF (3 6m ) as well as GF (3 m ). The suggestion of constructing the field GF (3 6m ) as an extension field of GF (3 m ) originally appeared in [4] and [5] and is prudent for hardware implementation. In [11] much of the arithmetic developed in this section is explicitly described.
The choice of basis for construction for GF ((3 6m )) from GF (3 m ) is motivated by a desire to simplify as much as possible the GF (3 6m ) elements ρ and σ used in the distortion map φ (3) appearing in steps 05. of Algorithm 1. Elements of a ∈ GF (3 6m ) are represented as a =
3 ) where σ and ρ are zeros of σ 2 + 1 = 0 and ρ 3 − ρ ∓ 1 as defined by the distortion map i.e.
where g(y) = y 2 + 1 is an irreducible polynomial over GF (3 m ) (provided that m and 2 are coprime) and
where h ± (z) = z 3 − z ∓ 1 is an irreducible polynomial over GF (3 2m ). Polynomial h + (z) = z 3 − z − 1 is used for E + and h − (z) = z 3 − z + 1 for E − (1) provided that m and 3 are coprime.
In this basis the elements GF (3 6m ) elements σ and ρ required by the distortion map so that σ 2 + 1 = 0 ∈ GF (3 6m ) and
and
Now implementation of multiplication by σ and ρ in steps 05. of Algorithm 1 now becomes much simpler in hardware. Consider calculation of γ ∈ GF (3 6m )
Now calculation of γ involves only two multiplications of µ 2 and βy in the GF (3 m ) subfield which can be carried out in parallel. The GF (3 m ) negation operation does not need to be clocked can be carried out by a small amount of combinational gate circuitry. Calculation of µ from step 04. of Algorithm 1. requires only addition over GF (3 m ) which can also be carried out un-clocked using a small amount of combinational logic. Multiplication of the respective GF (3 m ) elements by ζ in (7) can be performed by a simple rewiring in hardware. As elements of GF (3 m ) are represented by 2m bits in hardware elements of GF (3 6m ) are represented in 12m bits. A further advantage of using this representation from a hardware perspective is that cubing and full multiplication in GF (3 6m ) (steps 06., 07. Algorithm 1) can also be performed using only simpler cubing and multiplication operations respectively over the base field GF (3 m ) and similarly all these simpler operations can be carried out in parallel.
Multiplication
In this representation multiplication of GF (3
) is performed by Karatsuba multiplication [22] of a and b over GF (3 2m ) to form a degree 4 polynomial d =
Polynomial d from (8) is then reduced modulo the irreducible polynomial h ± (z) (6) over GF (3 2m ) to form c = 2 i=1c i ρ i as illustrated in (10) for h + (z) and (10) 
As seen from (8) the composition stage of multiplication in GF (3 6m ) is performed in six multiplications, seven additions and six subtractions in GF (3 2m ) while the reduction stage is performed in either five additions for h + (z) (10) or three additions and two subtractions for h − (z) in GF (3 2m ). Addition and subtraction in GF (3 2m ) are performed coefficient-wise so are easy and cheap to perform in hardware using arrays of simple gate circuits previously discussed. The hardware complexity in GF (3 6m ) multiplications lies in the required six multiplications in GF (3 2m ). From the dataflow diagram for (8) illustrated as Figure 1 is is seen that the six required GF (3 2m ) multiplications can be carried out in parallel.
Fig. 1. Dataflow for Karatsuba composition stage of multiplication in
Multiplicationc =ãb ∈ GF (3 2m ) (5) of two elementsã = a 0 + σa 1 and
is performed by Karatsuba multiplication in three multiplications, two additions and three subtractions in GF (3 m ) as illustrated in (11)
Here both the polynomial composition and reduction steps are performed simultaneously by the observation that σ 2 = −1 ∈ GF (2 2m ) from g(y) (5). Again additive operations in GF (3 m ) are easily performed by simple gate circuits and multiplication in GF (3 m ) can be performed as discussed in Section 2. As illustrated from Figure 2 the three required GF (3 m ) multiplications can be carried out in parallel.
This implies that by this method multiplication in GF (3 6m ) requires eighteen multiplications in the base field GF (3 m ) plus a number of additive operations. The advantage of implementing this operation in dedicated hardware over serial general purpose processors lies in the fact that all eighteen GF (3 m ) multiplications can be carried out in parallel. By parallelizing this operation the calculation time for multiplication in GF (3 6m ) can be made very close to that for that in GF (3 m ). Due to the large number of GF (3 m ) additions/subtractions required (124 in total) it is impractical to consider implementing these as pure combinational logic. In practice it is more prudent to implement a smaller number of additive gate circuits and schedule the required operations through these in an extra few clock cycles. So using the digit serial multiplier of Bertoni et al. [15] in hardware implementation of multiplication in GF (3 6m ) can be performed in ⌈m/D⌉ + n m clock cycles, where n m is the relatively small number of extra clocks required for scheduling the additions/subtractions and register read/write operations.
) is performed by (12) for GF (3 6m ) generated by polynomial h + (z) and by (13) for GF (3 6m ) generated by polynomial h − (z).  
Each involves three cubing operations, two additions and a subtraction in GF (3 2m ). As illustrated in Figures 3 and 4 in both cases the three GF (3 2m ) cubing operations can be carried out in parallel. From (12) and (13) main complexity in cubing in GF (3 6m ) ∼ = GF (3 2m )[z]/h ± (z) lies in performing the cubing operation in the field GF (3 2m ) ∼ = GF (3 m )[y]/g(y) (5) . Consider an elementã = a 0 + σa 1 ∈ GF (3 2m ) generated by g(y) = y 2 + 1,
where a 1 , a 0 ∈ GF (3 m ). Nowc = c 0 + σc 1 =ã 3 ∈ GF (3 2m ) is calculated by
which involves two cubing operations in GF (3 m ) which again can be performed in parallel. So the cubing operation in GF (3 6m ) can be efficiently calculated in hardware by performing six GF (3 m ) cubing operations in parallel as well as three GF (3 m ) negation operations and six addition/subtraction operations. Following from [15] GF (3 m ) cubing can be performed efficiently in a single clock cycle and the additive operations can be performed by simple combinational gate circuits. Using this type of parallel cubing architecture with six GF (3 m ) cubing circuits GF (3 6m ) cubing is performed in a single clock cycles and the six additive operations are performed by simple un-clocked gate circuits previously discussed.
Raising to Tate Power The basis {ζ
) over GF (3 m ) described by the distortion map as previously discussed is converted to the other basis {ξ, ξ i , ξ 2 , ξ 3 , ξ 4 , ξ 5 } = (1, ρ, ρ 2 , σ, σρ, σρ 2 ) described by the distortion map by a simple rewiring in hardware as illustrated in Figure 5 . This is analogous to the tower field representation 
where h ± (y) = y 3 − y ∓ 1 is an irreducible polynomial over GF (3
where g(z) = z 2 + 1 is an irreducible polynomial over GF (3 3m ). In this basis a ∈ GF (3 6m ) is represented a pair of elementsǎ 0 ,ǎ 1 ∈ GF (3 3m )
As described in [11] raising a = 5 i=0 a i ξ i ∈ GF (3 6m ) to the Tate power ǫ 1 = 3 3m − 1 in this basis can be performed in a much more efficient manner that typical multiply-and-accumulate methods of exponentiation by the observation that for m odd a
as σ 2 = −1 ∈ GF (3 3m ). Thus (17) implies that c = a ǫ1 ∈ GF (3 6m ) is calculated by
. Thus raising to the Tate power ǫ 1 involves five multiplications, three additions and a subtraction and an inversion in GF (3 3m ). Multiplication in the field GF (3 3m ) (15) is carried out in a similar manner to that outlined in (8) , (9) and (10) except in this case the base field is GF (3 m ). The six required GF (3 m ) multiplications can be carried out in parallel and the additive operations are carried out by the gate circuits previously discussed. The calculation time for multiplication in GF (3 3m ) is given as ⌈m/D⌉+n m . Inversion in GF (3 3m ) is carried out by arithmetic in GF (3 m ) as illustrated in Appendix A. As this operation is performed only once it does not require to be heavily parallelized.
A Hardware Architecture for Tate Pairing Calculation based on Duursma-Lee Algorithm
This section considers a prospective hardware implementation for Tate pairing calculationê(P, R) = τ (4) over elliptic curves (1) based on Algorithms 1 considering the observations from Section 3.2 on the efficient calculation time achievable by parallelizing GF (3 6m ) arithmetic.
Observations on the Modified Tate Pairing Calculation
It is interesting to consider the number of clock cycles required for the main iteration loop (steps 03. and six GF (3 m ) cubing circuits are available in parallel, along with a suitable amount of simpler GF (3 m ) arithmetic circuits for performing addition, subtraction and negation. Also required on such an architecture are 2m bit registers for storage of elements of GF (3 m ) and 12m bit bus lines for elements of GF (3 6m ) elements. The calculation time for an iteration of Algorithm 1 using this type of architecture is illustrated in Table 1 . An extra two clock cycles are added to the calculation time of each operation for register read/write operations. 
From Table 1 the modified Duursma-Lee Algorithm, Algorithm 1., can be performed on the type of dedicated hardware discussed in Section 3 in θ DL = m(2⌈m/D⌉ + 17 + n m ) clock cycles.
After e 3 3m −1 (P, φ(R)) = t ∈ GF (3 6m ) has been performed by Algorithm 1. it is then necessary to raise this GF (3 6m ) element to the Tate power ǫ 1 by (18) to generate the required unique result τ = t ǫ1 ∈ GF (3 6m ). This operation can be efficiently performed on much of the same underlying hardware as required for Algorithm 1. The only operations required are multiplication, and additive operations and a single inversion in the base field GF (3 m ). Performing the GF (3 m ) multiplications as required in parallel implies that (18) can be performed in θ T P = 9(⌈m/D⌉ + n m ) + 2m clock cycles.
Assuming a worst case situation where the register read/write operations and scheduling through the simple gate circuits take the same number of clock cycles as a multiplication operation (i.e. n m ≈ ⌈m/D⌉) this implies that using this type of hardware architecture the number of clock cycles for calculation of (4) is given by
Implementation Aspects
The question remains : How practical is the parallel architecture as discussed in Section 4.1? In order to gauge the feasibility of the architecture the required the main arithmetic units the GF (3 m ) multiplier and cubing cores were captured in the VHDL hardware design language and prototyped on the Xilinx Virtex2Pro125 device [23] for the field GF (3 97 ) ∼ = GF (3)[x]/x 97 + x 16 + 2. The FPGA resource usage (in FPGA slices : total on device 55616) and the post place-and-route (PPR) clock frequency of the GF (3 97 ) digit serial multiplier are illustrated for digit sizes D = 1, 4, 8, 12 in Table 2 . The fast GF (3 97 ) cubing circuitry was also implemented on this target technology and occupied 514 slices (0.5%) and had a post place-and-route clock frequency of 118 MHz. The GF (3 m ) inverter architecture achieved a clock frequency of 62 MHz and occupied 2210 (4 % device) FPGA slices.
These preliminary results indicate that eighteen GF (3 97 ) multipliers with a digit size of D = 4 can be implemented on approximately 60% of the target device and the six GF (3 97 ) cubing circuits and inversion circuit on approximately 7% of the device. This leaves the remaining 33% of the target device for storage registers, control data-path and arrays of gate circuits for the simple GF (3 m ) addition and subtraction logic.
Using the pessimistic (19) for the required number of clock cycles this implies that calculation of (4) on E + from (1) over GF (3 97 ) with a digit size of D = 4 could be performed in 12,866 clock cycles. A conservative 10 MHz clock frequency on the target technology translates this into a calculation time of approximately 1.3 ms. This represents at least a three fold improvement over the calculation times of 4.05 ms and 4.33 ms reported recently for optimized software implementations of the same calculation on serial general purpose processors [11] [24].
Conclusions
In this paper the suitability of the modified Duursma-Lee (MDL) algorithm for implementation in dedicated hardware has been illustrated. Prudent choice of basis construction for the fields GF (3 6m ) allows the efficient implementation of multiplication and cubing operations and only arithmetic in the GF (3 m ) subfield is required. Multiplication in GF (3 6m ) can be performed by eighteen GF (3 m ) multipliers in parallel along with some combinational logic and cubing in GF (3 6m ) can be performed by six GF (3 m ) cubing circuits in parallel along with some combinational logic. This leads to a low number of clock cycles for arithmetic in GF (3 6m ) compared to those required on serial processors. Modern FPGA devices such as the Virtex2Pro currently have enough resources to contain an implementation of this type of parallel hardware for calculation of the MDL algorithm. Assuming pessimistic operating parameters this dedicated hardware is projected to at least third the calculation time currently possible using optimized software implementations.
(21) then involves a further six GF (3 m ) multiplication operations. In hardware this operation can be partly parallelized by performing three multiplication operations in parallel. This implies that inversion in GF (3 m ) can be performed in 4(⌈m/D⌉ + n m ) + 2m clock cycles.
