DL systems with bilinear structure recently became an important base for cryptographic protocols such as identity-based encryption (IBE). Since the main computational task is the evaluation of the bilinear pairings over elliptic curves, known to be prohibitively expensive, efficient implementations are required to render them applicable in real life scenarios. We present an efficient accelerator for computing the Tate Pairing in characteristic 3, using the ModifiedDuursma-Lee algorithm. Our accelerator shows that it is possible to improve the area-time product by 12 times on FPGA, compared to estimated values from one of the best known hardware architecture [6] implemented on the same type of FPGA. Also the computation time is improved upto 16 times compared to software applications reported in [17] . In addition, we present the result of an ASIC implementation of the algorithm, which is the first hitherto.
Introduction
Identity-based encryption (IBE), is a public key cryptosystem that allows any arbitrary string to be used as a public key, such as recipients' email address. This vastly reduces the amount of work, to set up an online lookup for public keys and presents novel functionalities especially useful in access control systems while maintaining privacy and anonymity. Shamir introduced the concept of identity-based cryptography in 1984 [2] . However, the concept became practical only with Boneh and Franklin in 2003 [3] . Tate developed by Frey and Rück [5] , became popular, since it is efficiently computable and achieves its maximum security in characteristic three over supersingular elliptic curves [6] . Later, in [9, 16] tower fields of GF(3 m ), GF (3 6m ) was proposed. Duursma and Lee [11] further improved the implementation of Tate Pairing and proposed Duursma-Lee algorithm. The algorithm used here first appeared in [12] with further improvements and eliminating the cube root operation at the expense of two extra cubing operations. However, implementing pairing operations in software falls short of matching speed requirements of many pairing-based cryptography applications, especially in embedded systems. Therefore, despite the fact and necessity that designing dedicated hardware architectures gained significant importance, there is not much work on this subject in the literature. The modified Duursma-Lee algorithm was previously implemented partially as a dedicated hardware only in [6] on FPGA. Our aim is to design an accelerator that reduces the computation time and area of Tate pairing computation in characteristic three, to realize it on FPGA's and build the first ASIC implementation of Tate pairing.
Related Work
One of the earliest work is by Page and Smart [13] which described GF(3 m ) arithmetic architectures for cryptographic applications. They, later, implemented Tate pairing with Duursma-Lee algorithm using an accelerator for arithmetic in GF (3 m ) [1] . The work by Kerins et. al. proposes a Tate pairing implementation [6] based on the modified Duursma-Lee algorithm. With their approach, it is possible to multiply two polynomials in GF (3 6m ) using the same number of clock cycles as multiplying two GF(3 m ) polynomials, at the expense of area overhead and reduced clock frequency. In our accelerator we use the parallel hardware architecture and optimize it in terms of area and speed especially working on sub-blocks. We optimize cubing and multiplication units for specific irreducible polynomials used in the construction of ternary extension fields reducing the total area significantly. Additionally, we try to find an optimum algorithm and architecture to design a suitable Tate pairing accelerator for relatively constrained settings. Our contribution is three-fold: i) we present a full realization of the accelerator on both FPGA and ASIC for Duursma-Lee algorithm for the first time, and ii) we demonstrate that sub-blocks in the accelerator can be improved in terms of both area and time complexity by applying good design techniques, and iii) we show that our actual implementation of Duursma-Lee algorithm is in fact faster and smaller than the estimated values given in the previous work [6] .
Tate Pairing Calculation and Modified Duursma-Lee Algorithm
The modified Tate pairing is basically a transformation that takes two points on an elliptic curve E ± : y 2 = x 3 -x ± 1 defined over GF (3 m ) and outputs a nonzero element of in tower extension field GF (3 6m ). Arithmetic operations required to implement the modified Duusma-Lee Algorithm is addition, subtraction, cubing and multiplication in GF (3 m ) and cubing and multiplication in GF (3 6m ).
Algorithm 1:
The Modified Duursma-Lee Algorithm (char 3) [6] input:
Constructing the ternary extension field GF (3 6m ) on the base field of GF(3 m ) is suggested in [16, 9] and described explicitly in [6] . Use of extension fields simplifies the arithmetic operations and allows parallelization for the cubing and multiplication operations.
Arithmetic in Characteristic Three and Our Sub-Blocks
In this section, we present hardware architectures for addition, subtraction, multiplication and cubing in GF(3 m ). Characteristic three arithmetic is slightly more complicated than characteristic two arithmetic since coefficients can take three values; {0, 1, 2}. Hence two bits are needed to represent each digit in GF(3) using the encoding {0, 1, 2} = {00, 01, 10}. The negation in this representation is performed by swapping the most and the least significant bits (which is almost free in hardware implementations) since 2 ≡ -1 mod 3. Since negation operation is used very often especially in performing GF (3 6m ) multiplication, this particular representation is very useful in our case. For arithmetic operations, m bit elements are expressed as 2m bit arrays as follows
Addition and Subtraction
Addition and subtraction is performed digit-wise by using the Boolean expression in [1] , i.e.
where ∨ and ⊕ stands for logical OR and XOR operations, respectively. In the used representation, negation and multiplication of GF(3) element by two are equivalent operations and performed by swapping the most and least significant bits of the digit representing the element. Therefore, subtraction in GF(3 m ) is equally efficient as the addition in the same field and thus the same adder block is used for both operations. If subtraction is needed, bits in digits of subtrahend are individually swapped and connected to the adder block. Since this is achieved by only wiring no additional hardware resource is used. When implemented on FPGAs, for each GF(3) addition, two 4-input "look-up tables" (LUTs) are used. Since one slice is composed of two LUTs, for m-bit long GF(3 m ) additions m slices are used. This result is almost the same in all papers implementing characteristic three addition such as [10] . The delay of the addition operation is 5,061 ns on Xilinx Virtex2p 100 device.
Cubing
The modified Duursma-Lee algorithm requires cubing operation in GF (3 6m ) and it is possible to build a parallel architecture by using GF(3 m ) cubing blocks as explained in the next section. Cubing is a linear operation in characteristic three and we adopt the technique presented in [15] . For characteristic three, Frobenius map is written as follows:
This formula can be represented as follows: Here the degrees of the second U and the third terms V are bigger than m and need to be reduced. For p(x) = x m + p t x t + p 0 and t < m/3, the terms can be represented as follows as also showed in [15] : We optimize the reduction for the well known polynomial p(x) = x 97 + x 16 + 2 and calculate the terms to be added to achieve reduction in the same clock cycle. The optimization for a specific polynomial results in a very efficient implementation. We use 111 GF(3) adders to complete the cubing operation. And critical path of the system consists of three serially connected GF(3) adders. As seen from Table 1 , our implementation is 2.5 times more efficient than the implementations in [6] and [15] . Although the implementation details of the cubing circuits are not clear in [6] and [15] , the improvement in the slice and LUT numbers should be due to register free design and doing the reduction for a given polynomial. 1 Note that
Multiplication
Multiplication is the most important operation for pairing implementations due to its complexity. Since the modified, as explained in the next section. Duursma Lee algorithm requires GF (3 6m 
Finally, digit multipliers are very similar to serial multipliers but they process w coefficients of the multiplier at each clock cycle rather than a single coefficient. Consequently, the operation is completed in m/w cycles. The area consumption is more than the serial multipliers and increases with w. Since the area and critical path delay also increase with w, choosing w is an important decision influenced by area and time constraints. We prefer to use serial multipliers in our implementation, which incur increased number of clock cycles, while providing a better solution in terms of area and frequency. Serial multipliers can also be treated in two classes: i) least-significant-element-first (LSE) and ii) most-significant-element-first (MSE). Although there is not much difference between the two types we implement the LSE Multiplier. As illustrated in Algorithm 2, the reduction is performed in interleaved fashion. . . -p 1 x -p 0 . Two LSE multipliers are designed to examine the effects of fixed versus generic polynomials on time and space complexities. The advantage of the generic design is that it can be used with any polynomial in characteristic three which is flexible. In case of fixed polynomials, the coefficients of the polynomial can be hard-coded into the multiplier unit resulting in reduction of design complexity. For the fixed irreducible polynomial of x 97 + x 16 + 2, used in many IBE implementations in literature, only two GF(3) additions are needed in each iteration of interleaved reduction. As illustrated in Table 2 , the multiplier with hard-coded irreducible polynomial is 30% better than the generic multiplier in terms of area. The proposed GF(3 m ) LSE multiplier architecture is shown in Figure 1 . The proposed multiplier is implemented for m = 97 on virtex2p-100 for comparison purposes since it is the same Xilinx device used in [6] . As observed in Table 2 , the fixed multiplier is nearly 2.5 times smaller than the architecture in [6] and the generic multiplier consumes around 60% of the area of the same architecture In Table 2 . As a result our architecture is better than the architectures in the literature to the best of our knowledge.
A R e g is te r O u tp u t R e g is te r Table 4 is for D = 4, whic consumes 1821 slices.
Hardware Implementation of Tate Pairing Based on Modified Duursma Lee
GF(3 6m ) can be considered as an extension field over GF (3 2m ) with irreducible polynomial z 3 -z ± 1 [6] . Also as suggested in [6] , the multiplication in GF (3 6m ) can be done in two steps: i) Karatsuba multiplication for polynomials with coefficients from GF (3 2m ), and ii) reduction with irreducible polynomial z 3 -z ± 1. Reader can profitably refer to [6] for further details. ) multiplier unit from [6] In Figure 2 , GF(3 6m ) Karatsuba multiplier unit, as proposed in [6] , is illustrated, where nodes represent the GF (3 2m ) adders, subtracters, and multipliers. Similarly, GF (3 2m ) can also be seen as an extension field over GF(3 m ) with irreducible polynomial y 2 + 1. Since the adder/subtracter units operate on the corresponding coefficients of the operand polynomials, their structure is the same as GF(3 m ) adders. GF (3 2m ) multiplier, on the other hand, consists of GF(3 m ) adders, subtracters, and multipliers as seen in ) multiplier unit from [6] As seen in Figure 2 , GF(3 6m ) Karatsuba multiplier has five GF (3 2m ) elements as output. 
To summarize, 18 GF(3 m ) multipliers and 52 GF(3 m ) adders are used in one GF (3 6m ) multiplier. The advantage of the proposed architecture is that multiplication is completed within m clock cycles as a GF(3 m ) multiplication 3 . In order to explore reduction strategies, we develop two implementations: i) all the blocks are parallel and ii) we limit the number of adders after the multipliers to four and the operations are scheduled. This approach increases the number of clock cycles by five (2.5% of all operations), but significantly reduces the amount space consumed by adders. Similarly, we try to use scheduling approach to decrease the number of multipliers. However, not only that scheduling does not give successful results on FPGA implementation but also increases the number of slices around by 5%. We leave the scheduling approach for ASIC implementations as the future work since it may save chip space in ASIC. Lastly, for additions and subtractions we used the same adder block by just rewiring the inputs to swap the bits of the subtrahend since it negates the GF(3) elements in the employed representation. The second GF (3 6m ) block is for performing cubing operation and as in the case of the multiplier it is constructed using arithmetic units of the base field GF (3 2m ) as proposed in [6] . As shown in Figure 4 , GF (3 6m ) cubing circuitry includes three adder/subtracter and three cubing blocks in GF (3 2m ), while GF(3 2m ) cubing circuit includes two GF(3 m ) cubing circuit without any additional overhead but negation. Recall that ) cubing unit from [6] 
Proposed Coprocessor Architecture
After building the efficient blocks that are needed for our accelerator, we design a control unit and a datapath for the Tate Pairing operation. The operation may be divided 3 The adders before and after 18 GF(3 m ) multipliers are, in fact, in the critical path; therefore they do not add to the cycle count.
into two big phases as initialization and loop. In Table 3 operations are described in detail. Step When the initialization is completed, accelerator starts operating in a loop. For the entire operation, we use only one GF(3 6m ) multiplier for step 10, one GF(3 6m ) cubing circuit for step 9, two GF(3 m ) cubing circuits for steps 5 and 6, two GF(3 m ) multipliers for step 8 and a number of adders. The main advantage of our accelerator is that most of the operations are completed in single clock cycle. If the adder and cubing circuits were implemented with registers, clock count would increase around by 400 and registers would increase the area of the accelerator. FPGA Implementation: In this work we mapped our blocks and the whole accelerator to Xilinx Virtex2Pro100 device, since the previous works on the subject used the same device. Two different versions of our hardware pairing accelerator is synthesized as seen from the As seen from Table 4 , our implementation of Modified Duursma-Lee algorithm is almost three times (2.93) better than the previous implementation in the literature in terms of execution time and consumes nearly one-fourth of the estimated area in other implementations(namely [6] ). In terms of area-time product, our Tate pairing accelerator with fixed multiplier is 12 times better than the one in [6] and the one with generic multiplier is 9 times better than the same implementation. In addition to these, our hardware implementation shortens the calculation time nearly sixteen times compared to software implementation reported in [17] . ASIC Implementation: We also synthesized VHDL codes of the algorithm for 0.25 µm CMOS technology. The total cell area is 4.3mm 2 excluding the buffers that are needed to satisfy the clock tree and static timing analysis specifications. The implementation consumes around 10 mm 2 chip area after routing with 5 metal technology. Our ASIC implementation has reached the frequency of 78 MHz and completes the pairing in 250 µs. We should note here that Virtex-2 devices are based on 90 nm CMOS technology with 9 metal routing layers.
Implementation Aspects

Conclusion
We first developed the sub-blocks for arithmetic in GF(3 m ) needed for the operations in characteristic three. After achieving good results in sub-blocks, we implemented GF (3 6m ) arithmetic using them in parallel manner. Finally we developed our accelerator that computes the Tate Pairing in 250 µs by using commonly available FPGA. The implementation results showed that final area-time product of the design is 12 times better than the estimated values given in a previous implementation of the same algorithm in the literature and total computation time is 16 times better than the software applications. Another advantage of our accelerator is its suitability for ASIC implementation since no flip-flops are used in cubing and addition blocks and there is limited usage in the multiplier blocks. Based on these assumptions we build the first ASIC implementation of the modified Duursma Lee algorithm using 0.25 µm CMOS technology and presented the results of the area and operating frequency of the circuit. For future work, we plan to build a crypto coprocessor based on our Tate pairing implementation.
