Abstract-Pairings on hyperelliptic curves have been applied to many cryptographic schemes, and it is important to exploit methods that increase the speed of various pairings and their curves. Additionally, multiple pairings should be performed efficiently in some cryptographic application such as attribute-based encryption or functional encryption. We propose an efficient extension field construction method that defines a curve and its pairing. We also implemented the parallel arithmetic on extension fields and multiple pairings in parallel and reported experimental timing results. We achieved timing of 12.7ms and 52.0ms per pairing when computed 1248 pairings by using GPU Tesla K20c. We took the extension degree of base field = which is greater than the parameter = , that was appropriate for the pairing at the 128-bit security level. By normalization of experimental result, we achieved a certain level of speeding up of the pairing compared to the state-of-the-art CPU implementation. In addition, we achieved scalability with the extension degree of base field in our parallel implementation by performing Karatsuba multiplications between multiple elements of extension field in parallel.
I. INTRODUCTION
Koblitz [1] suggested a hyperelliptic cryptosystem us-ing Jacobians of hyperelliptic curves as arithmetic generalizations on groups of elliptic curves. Arithmetic on Jacobians of hyperelliptic curves is more complex than on elliptic curve groups. Alternatively, we can use smaller finite fields; i.e., we can employ smaller size keys by using higher genus curves to achieve the same level of security.
Pairings on elliptic curves or higher genus curves have attracted significant attention and have been applied to many cryptographic schemes, such as ID-based cryptography. Generally, calculation methods for pairings are complex and the cost of pairings is considerably higher than that of arithmetic on curves. In addition, the cost is significantly higher when using algebraic curves of higher genus.
Modern graphics processing unit (GPU) technology for general purposes, based on GPU computation has advanced significantly, while the use thereof in high level cryptography implementations has increased rapidly. There has been much research on increasing the speed of multiple-precision Manuscript received December 30, 2013; revised February 27, 2014 . M. Ishii is with Nara Institute of Science and Technology, Nara, Japan (e-mail: masahiro-i@is.naist.jp).
A. Inomata is with Initiative Center, Nara Institute of Science and Technology, Nara, Japan (e-mail: atsuo@itc.naist.jp).
K. Fujikawa is with Information Initiative Center, Nara Institute of Science and Technology, Nara, Japan (e-mail: fujikawa@itc.naist.jp).
arithmetic or arithmetic on finite fields using GPUs, which is explored further in Section II.
In this study, we consider the parallelization of arithmetic on extension fields. The pairing algorithm is suitable for our parallel algorithm. As there are many parameters for pairings, by implementing pairings on a GPU, we can exploit parallelization methods to extend the program code flexibly. In the case that the field characteristic defining the curve and pairing is large, we can compute elements of the field in parallel as modular arithmetic on prime fields using a GPU [2] - [5] . On the other hand, if the characteristic is small, we can implement arithmetic on the field efficiently using a GPU and polynomial bases. Y. Katoh, Y. Huang, C. Cheng, and T. Takagi [6] implemented arithmetic on the 3 and pairing defined on 3 using a GPU. They succeeded in accelerating the process significantly by computing multiple pairings in a bit-sliced fashion. As the field characteristic is small, the degrees of the polynomials calculated as elements of the extension field are large; therefore, we can use the power of a GPU effectively within the context of parallelization.
Having focused on a parallel implementation of arithmetic on (extension) fields using polynomial bases, we have developed a practical and efficient method for parallelizing arithmetic on extension fields and a method for their construction that is suitable for our parallel algorithm. Indeed, we implemented parallel pairing on a supersingular genus-two curve. We then used basis conversion to compute the pairing to change the extension field construction making it suitable for parallel arithmetic on fields. We also achieved speedup of the pairing using a different extension field construction method based on [7] .
The remainder of this paper is organized as follows. We describe work related to the state-of-the-art software implementation of pairings and a GPU implementation for parallel modular arithmetic and pairings in Section II. In Section III, we recall pairing on a genus-two curve over a binary field and its algorithm. We then describe recent research on the discrete logarithm algorithm in a finite field of small characteristic and the security level for the pairing over binary fields. Section IV presents the detailed methodology of our parallel algorithm for arithmetic on extension fields and pairing. We then report experimental timing results of the pairing implementation on a GPU in Section V. Finally, we present our conclusions and suggestions for future work in Section VI.
II. RELATED WORK
Here, we summarize state-of-the-art work related to a software implementation of pairings. First, we describe some Computer and Communication Engineering, Vol. 3, No. 3, May 2014 work concerning the efficient implementation of pairings on a CPU. We then describe some research on implementing multiple-precision arithmetic on finite fields, modular arithmetic, and pairings, or other cryptographic applications using GPUs.
J. Beuchat, E. López-Trejo, L. Martí nez-Ramos, S. Mitsunari, and F. Rodrí guez-Henrí quez [8] implemented a reduced modified Tate pairing ( pairing) on supersingular elliptic curves and designed a fast multi-core library using single-instruction multiple-data instructions. They reported a calculation time of just 1.87ms on Intel Core i7 architectures for Tate pairing at the 128-bit security level.
In [9] , Beuchat et al. described the design of a fast software library for the computation of optimal Ate pairing on a Barreto-Naehrig elliptic curve. They re-ported that optimal Ate pairing at the 126-bit security level took 2.33 million clock cycles on a single core of an Intel Core i7 2.8 GHz processor.
D. F. Aranha, J. López, and D. Hankerson [10] implemented pairing over bi-nary supersingular curves at the 128-bit security level in parallel (using two types of parallelism: vector instructions and multiprocessing). They reported parallel timings 66% faster than the result of [8] . S. Chatterjee, D. Hankerson, and A. Menezes [11] reported preliminary timings for Type 1 (symmetric) pairings on supersingular genus-2 curves of characteristic 2 at the 128-bit security level. The pairing over 2 439 took 16.4 million clock cycles on an Intel Core 2 processor.
In [12] , D. F. Aranha, K. Karabina, P. Longa, C. H. Gebotys, and J. López described efficient formulas for computing optimal Ate pairings on ordinary elliptic curves over prime fields. Their efficient techniques for computing pairings, for the first time allow a pairing to be obtained in under 2 million cycles on a 64-bit processor, improving the result of [9] by 28%-34%.
D. F. Aranha, J. Beuchat, J. Detrey, and N. Estibals [13] presented a novel optimal Eta pairing algorithm on supersingular genus-2 binary hyper-elliptic curves. According to their experimental results from a software implementation, an optimal Eta pairing on a genus-2 curve over 2 367 took 4.44 and 2.75 million clock cycles on an Intel Core 2 processor and Core i5 32 nm (Nehalem microarchitecture) processor, respectively.
Recently, Mitsunari [14] reported an efficient implementation of an optimal Ate pairing at the 126-bit security level (in [9] , [12] ) on an Intel Haswell processor. A mulx instruction supported by the Haswell processor was used to achieve the pairing in only 1.17 million clock cycles.
Efforts to speed up modular arithmetic using the power of GPUs are reported in [2] , [5] . In [5] , the authors implemented three modular arithmetic operations on a GPU: addition, subtraction, and multiplication. They used Montgomery's method for multiplication to avoid division. They also implemented arithmetic on elliptic curves using modular multiplication on a GPU.
Radix representation or a residue number system (RNS) can be used for modular multiplication in large finite fields. According to [3] , [4] , using a radix representation is superior to a RNS with regard to parallel implementation of modular multiplication on GPUs.
Katoh et al. [6] implemented arithmetic on 3 and an pairing defined on 3 using a GPU. They succeeded in achieving significant speedups by computing multiple pairings in a bit-sliced fashion. Essentially, they parallelized modular arithmetic represented by a polynomial basis by performing arithmetic on the basic field 3 in each thread. They implemented modular multiplication on a GPU by parallelizing the Comb method on a finite field [6, §3.2, Implementation II]. In addition, they implemented and evaluated parallel multiplication by computing 32 operations in a bit-sliced fashion. They reported that pairing over 3 509 took 3.01ms on an NVIDIA GTX 480 and concluded that their GPU implementation for larger fields was slower than multi-core CPU implementations [8] owing to the limited fast on-die memory on the GPU. Y. Zhang, C. Jason-Xue, D. S. Wong, N. Mamoulis, and S. M. Yiu [15] were the first to present an evaluation of bilinear pairings over composite-order groups on graphics card hardware. They implemented parallelized base field operations via a RNS and performed multiple pairings to occupy the hardware resource. According to their experimental results, in 1024-bit base fields, the NVIDIA GTX 480 achieved a running time of 8.7ms per pairing, which is 19.6 times faster than the state-of-the-art CPU implementation.
III. PAIRING AND SECURITY LEVEL
A. Algorithm of Pairing Here, we describe the pairing [7] and discuss some of its properties. Barreto et al. exploited pairing in [7] for supersingular curves as a generalization of the Duursma-Lee technique [16] .
Suppose that / is a supersingular curve with embedding degree > 1 and that a distortion map ∶ → is present. This allows denominator elimination, i.e., for ∈ , ∈ has an x-coordinate in /2 . Then, for ∈ ℤ, Eta pairing ( pairing) is given by
Barreto et al. generalized the Duursma-Lee techniques, including effective calculation of divisors and using a Frobenius map, directly in Miller's algorithm. They succeeded in generalizing a loop shortening idea in many other cases. They described pairing on a supersingular genus-two curve in detail in [7, §7] . We use this curve and consider pairing under the same conditions. We consider the supersingular curve
over 2 an embedding degree of 12; therefore, we have to perform arithmetic on the extension field 2 12 . In this study, we implemented pairing using two methods to construct an extension field.
In the first method, we construct 2 12 according to [7, §7.1] , starting with a sixth degree extension. We then construct a quadratic extension as follows: We refer to this polynomial basis as -basis. In this paper, we examine a specific algorithm but do not describe the functions and algorithm for the pairing in detail in each case. According to [7, §7] , when we use the 0 -basis, we can compute the pairing as [7, §7.3, Algorithm 4] . We also present the pairing algorithm using the -basis as Algorithm 1 using the same notation as in [7, §7.3, Algorithm 4] .
As shown in Algorithm 1, we can calculate pairings in the same manner irrespective of which polynomial basis is used. However, using -basis reduces the cost of multiplication of in line 20. Therefore, we can implement efficiently in parallel. We need to perform 13 multiplications on base field by using Karatsuba method in order to compute with s0w-basis. We can reduce number of multiplication to 9 by constructing the 12-th extension field of with -basis.
B. Security Level for
Pairing Security parameters for the pairings were chosen assuming Coppersmith's algorithm [17] or a generalization thereof [18] with heuristic complexity
where , = exp + 1 log log log 1− Therefore, the pairing over / 2 at the 128-bit security level was implemented by choosing = 367, 439 since 2 12•367 , 2 12•439 were assumed to be 128-bit security against Coppersmith attacks.
Recently, small theoretical and practical advancements of the efficient discrete logarithm problem (DLP) algorithm have been made [19] - [23] . G. Adj, A. Menezes, T. Oliveira, and F. Rodrí guez-Henrí quez [24] explained how the new algorithms by Joux [21] and R. Barbulescu, P. Gaudry, A. Joux, and E. Thomé [23] could be combined to solve the DLP in 3 6 •509 faster than the Joux-Lercier algorithm [25] . They estimated the complexity (number of multiplications) to solve the DLP in 3 
32: return
In Fig. 1 , we compare the heuristic complexity to solve the DLP in 2 12 • with Coppersmith algorithm, Joux-Lercier algorithm and Joux's pinpointing method. For implementation of the pairing over / 2 at the 128-bit security level, we take two extension degrees = 487, 967.
International Journal of Computer and Communication Engineering, Vol. 3, No. 3, May 2014
Since Adj et al. estimated the DLP algorithm using Joux-Lercier pinpointing method in 2 12 •367 with 2 91.6 , we should take which is greater than at least = 439 . Although it needs concrete analysis for the complexity of DLP algorithm in 2 12 • as shown in [24, Appendix A], we evaluate parallel implementation of the pairing over / 2 487 . In the second case, we take extension degree = 967 which seems adequate parameter at the 128-bit security level and perform an experimental simulation of pairing computation using GPU. 
IV. METHODOLOGY
In this section, we present the proposed method for parallelization of arithmetic on base fields and extension fields. Katoh et al. [6] implemented pairing on a GPU. Similarly, we parallelized arithmetic on extension fields constructed by irreducible polynomials. We parallelized the arithmetic in a straightforward manner to allow flexible computation of extension field elements. In addition, we propose an approach for combining Karatsuba method parallelization and parallel arithmetic on extension fields.
A. Parallel Computation of Multiplication on Base Fields and Extension Fields
First, we introduce how to implement operations in base fields 2 . We implemented elements of base fields as polynomial represented by uint64_t array. We then adopted left-to-right comb window method [27] , therefore we took the word size of 64-bit and the window size of 4 with experimental results.
Katoh et al. implemented pairing on a GPU by parallelizing the Comb method on finite fields [6, §3.2, Implementation III] in a bit-sliced fashion. We implemented straightforward parallelization of arithmetic of polynomials as arithmetic on base field. Indeed, for elements of base field
We compute the coefficients of + ( ) by using XOR operation 1 ⊕ 1 in parallel, and compute each where
in parallel with comb window method. In comb win-dow method, we add polynomials as elements of base field to c(x) in m threads and we implement other calculation in serial.
In this study, we implemented arithmetic on extension fields using only the Karatsuba method. The Karatsuba method can be generalized for polynomials of arbitrary degree [28, §3.2, Algorithm 2]. We consider two polynomials of degree d,
For each = 0, … , , we compute ≔
and for 0 ≤ < ≤ ,
We can then compute = 
Thus, we can use the Karatusba method for multiplication in extension field and reduce multiplication to 3, 6, 21, or 78 operations for if = 2, 3, 6 or 12, respectively. In addition to the above parallel method for operations in base field, we consider the Karatsuba parallelization method for arithmetic on extension fields. We parallelize the precomputation phase of the Karatsuba method as follows. First, we start to compute (1) in parallel. At the same time we compute + , ( + ) (2) and , (2) in parallel. After that, we compute (3) in serial.
B. Implementation of
Pairing on GPU We implemented our parallel algorithm on a GPU, the NVIDIA Tesla K20c, and used the Compute Unified Device Architecture (CUDA) programming model [29] . The experimental environment is presented in Table I . The extension degree directly affects the time for parallel arithmetic computations on the extension field. We implemented and evaluated pairing for extension degrees = 487 and = 967 as described in previous section about security level. Since elements of base field are represented uint64_t array, we need to take the length of 8 and 16 respectively for the extension degree = 487 and = 967. Therefore, we implemented that 8 or 16 threads handle each operation in base field using CUDA programming model. In addition, we computed multiple pairings in order to use GPU resource effectively. We implemented arithmetic between multiple elements in base fields and extension fields in parallel by using blocks in CUDA programming. Each block handles calculation with independent elements of fields in parallel using multiple threads, therefore we could compute multiple pairings independently.
Basically, we implemented operations in fields by starting to copy data to global memory which is an off-chip memory device that can be accessed by any thread and the device's access time is higher than that of other memory operations. We then stored intermediate variable to shared memory and after that finished to calculate on GPU, we copied the results to main memory. Indeed, we used shared memory to store precomputation table in window method, and the limit of the window size when = 967 was 8 since the size of shared memory on Tesla K20c was 48KB. We implemented parallel Karatsuba method of the degree = 2, 5 respectively for extension degree of 3, 6 and combined the Karatsuba method in order to do multiplication on 12-th extension field for each construction using 0 -basis, -basis. We computed additions on fields except per-formed in Karatsuba method on CPU since it is enough fast to perform additions between multiple elements of fields.
V. EVALUATION
In this section, we showed the experimental result of the implementation of pairing on GPU. First, we reported comparison of the total time to compute multiple pairings on CPU and GPU. In Fig. 2 and Fig. 3 , we describe total time of CPU and GPU implementation of the multiple pairings over 2 487 , 2 967 respectively, with 0 -basis, -basis. CPU implementation means that we compute multiple pairings in serial with same parallel algorithm for GPU implementation. As shown in Fig. 2 and Fig. 3 , timings of GPU implementation are slower than CPU implementation with small number of pairings since GPU resource is not used sufficiently. We can see that the total time in the case of 2 967 is seemed to have more effect on parallelization compared to the case of 2 487 . The construction for 12-th extension field using -basis have an impact on timing of CPU implementation over 2 967 since the cost of multiplication is bigger than the case of = 487. We then report the timing results for computing per pairing on GPU in Fig. 4 In this experimental result, we achieved fast timings of 12.7ms and 52.0ms per pairing when computed 1248 parings. We can consider that the timing per pairing goes down as increasing number of pairings. In addition, although the order of growth in algorithm for the pairing is roughly ( 3 ) where m is extension degree, timing per pairing over 2 967 is less than the estimation which is showed by the uppermost line in Fig. 4 .
We achieved timing per pairing of 12.7ms per pairing using GPU Tesla K20c which the core clock is 706 MHz. In regard to security level for new DLP algorithm, we took the extension degree of base field = 487 that is greater than = 367, 439 . As future works, we tackle effective management and use of memories in particular registers on GPU. We believe that can achieve significant speed up compared to state-of-the-art result of CPU or GPU implementation by optimizing our approach.
VI. CONCLUSIONS
In this study, we implemented the parallel arithmetic on extension fields and multiple pairings in parallel. In addition, we used effective construction of extension field so that we could reduce the cost of the pairing. We achieved timing of 12.7ms and 52.0ms per pairing when computed 1248 pairings by using GPU Tesla K20c. We took the extension degree of base field = 487 which is greater than the parameter = 367, 439 that was appropriate for the pairing at the 128-bit security level. By normalization of experimental result, we achieved a certain level of speeding up of the pairing compared to the state-of-the-art CPU implementation.
As shown in the Section III and Section II, it is adequately considered that the extension degree = 487 is not appropriate for the pairing at the 128-bit security in the fields of characteristic 2. However, the parallelize method of Karatsuba multiplication on extension field we proposed can be basically apply to performing modular multiplication in the case of large characteristic. In addition, we achieved scalability with the extension degree of base field in our parallel implementation and that also held in the case of large characteristic.
