Pairing-based cryptography provides us many novel cryptographic applications such as ID-based cryptosystems and efficient broadcast encryptions. The security problems in ubiquitous sensor networks have been discussed in many papers, and pairing-based cryptography is a crucial technique to solve them. Due to the limited resources in the current sensor node, it is challenged to optimize the implementation of pairings on sensor nodes. In this paper we present an efficient implementation of pairing over MICAz, which is widely used as a sensor node for ubiquitous sensor network. We improved the speed of η T pairing by using a new efficient multiplication specialized for ATmega128L, called the block comb method and several optimization techniques to save the number of data load/store operations. The timing of η T pairing over GF(2 239 ) achieves about 1.93 sec, which is the fastest implementation of pairing over MICAz to the best of our knowledge. From our dramatic improvement, we now have much high possibility to make pairing-based cryptography for ubiquitous sensor networks practical.
Introduction
The technology of wireless sensor networks (WSNs) has been implemented in practical applications of ubiquitous society. In general, WSN node has low physical protection and does not have secure memory for cryptographic keys. It is thus important to develop secure solutions to these networks.
Firstly, symmetric cryptosystems have been utilized to propose secure WSNs. However, they face the key distribution problem. Due to that, conventional public key such as RSA and elliptic curve cryptosystem (ECC) are considered as alternative proposals. Public key authentication is typically achieved by means of a public key infra-structure (PKI), which issues certificates and requires exchange and memory of large keys. These operations, however, causespotlight because it is an exception where a known information that uniquely recognizes users such as email address can be utilized as a public key and thus PKI is unnecessary. Recently, Oliveira et al. argued that IBE is idea for WSNs and vice versa [17] , [18] . They discussed the synergy between the systems, describe how WSNs can take advantage of IBE.
To make Pairing-Based Cryptography including IBE become truly practical in WSNs, it is necessary to optimize the performance of pairings which are the most significant operation. In [17] Oliveira et al. implemented the Tate pairing of a supersingular elliptic curve over GF(q) with 256-bit q (q 2 ≈ 512 bits) on MICAz, whose timing is about 30 sec. Moreover, Ishiguro et al. implemented an η T pairing over GF (3 m ) in 5.8 sec for m = 97 [12] . TinyPBC [18] by Oliveira et al. and NanoECC [25] by Szczechowiak et al. are implementations of η T pairing over GF (2 m ) for m = 271. Currently pairing implementations (TinyPBC, TinyTate) are slower than RSA (TinyPK, [27] ) and ECC (TinyECC [14] , TinyECCK [23] ). Therefore, it is a good research challenge to optimize the implementation of Pairing cryptosystems in resource-constrained sensor nodes.
In this paper, we propose an efficient implementation of η T pairing over GF (2 239 ) on MICAz with ATmega128L processor to compare with previous works [12] , [17] , [18] . The target η T pairing consists of Addition, Multiplication, Reduction, Square, Inversion, which form 2.3%, 75%, 6.3%, 4.7%, 4.5% in total computation cost. Namely, multiplication is a dominant operation in the η T pairing. Thus, we first propose a fast multiplication, called block comb method, which can dramatically reduce the number of load/store operation. Actually, it is well known that data load and store are very time consuming operation in sensor node. Due to the proposed block comb method, the timing of a multiplication in GF (2 239 ) is improved 1.28 from 4.6 msec. In addition, we improve the squaring, reduction, and inversion. The shift operation is optimized for ATmega128L, which makes the reduction and squaring faster. The degree function in the inversion is also improved. Consequently, we can compute an η T pairing over GF (2 239 ) on MICAz with ATmega128L processor in about 1.93 sec which is, to the best of our knowledge, the most efficient implementation of PBC primitives for MICAz with ATmega128L processor.
The remainder of this paper is organized as follows. In Sects. 2 and 3, we discuss η T pairings and its implementa- 
η T Pairing and Its Implementation on ATmega128L
In this section, we explain the η T pairing proposed by Barreto et al. and how to compute that pairing. The η T pairing is a fast pairing, and there has been much research on it [5] , [6] , [16] , [22] , [24] , [25] .
η T Pairing over Binary Field
In this section, we explain about the η T pairing over a binary field. Let GF(2 m ) be a binary field with the extension degree m, where m is an odd prime. Let E be a supersingular elliptic curve defined over GF (2 m ),
Let l be the largest odd prime with l | #E(GF(2 m )). Then, the η T paring proposed by Barreto et al. [2] is the mapping,
with the bilinearity, namely η T (sP, tQ) = η T (P, Q) st , satisfied for any P, Q ∈ E(GF(2 m )) and any integers s, t. The extension degree 4 of GF (2 4m ) over GF (2 m ) is the smallest positive integer such that l divides (2 km − 1). Such an integer is called the embedding degree.
An element in the 4−th extension GF (2 4m ) is represented as a 0 + a 1 s + a 2 t + a 3 st for a 0 , a 1 , a 2 , a 3 ∈ GF(2 m ), where s 2 + s + 1 = 0 and t 2 + t + s = 0. The η T pairing is computed by Algorithm 1, modified by Shu et al. [25] , where M of Step 13 is defined as follows.
Note that A and C in Algorithm 1 belong to GF (2 4m ), where a set {1, s, t, st} is a basis of GF (2 4m ) over GF(2 m ) with
Step 13 is called the final exponentiation, and is computed using multiplications, squarings, and an inversion. The final exponentiation is efficiently computed by the following equations.
where A = a 0 + a 1 s + a 2 t + a 3 st ∈ GF (2 4·m ). In this paper, we fix parameters (m = 239, b = 1) of the η T pairing. Let l be the largest odd prime with l | #E(GF(2 239 )). The size of l and GF (2 4·239 ) are important values for security. Let E be an elliptic curve used Algorithm 1 Computing the η T pairing [25] 
end if 10: end for 11:
for the η T pairing over GF (3 97 ), and let l be the largest odd prime with l | #E (GF(3 97 )). We know that l ≈ l and 2 4·239 ≈ 3 6·97 , which means that the security of our implementation is equivalent to that of the implementation in Ref. [5] , [6] , [9] , [12] , [13] .
It is easy to see that Algorithm 1 takes 885 multiplications, 144 squarings, 1 division, and 3226 additions when m = 239.
Previous Implementations
We report previous known implementations of pairing on ATmega128L.
Oliveira et al. implemented the Tate pairing [17] . The implementation is named TinyTate. TinyTate implemented uses a finite field GF(p) (p is a 256 bit prime) and a curve y 2 = x 3 + x with the embedding degree 2. It occupies 18,384 bytes of ROM for program, and 1,831 bytes of memory. The timing is about 30 sec.
Oliveira et al. implemented the η T pairing [18] . The implementation is named TinyPBC. TinyPBC uses a finite field GF (2 271 ) and a curve y 2 + y = x 3 + x 2 with the embedding degree 4, and occupies 47,948 bytes of ROM for program, and 3,235 bytes of memory (2,867 bytes are used as stack). A multiplication in GF (2 271 ) is computed by look-up table and the Karatsuba method, and takes about 4.0 msec. The timing of pairing is about 5.5 sec.
Ishiguro et al. implemented the η T pairing [12] . Their implementation uses a finite field GF (3 97 ) and a curve y 2 = x 3 − x + 1 with the embedding degree 6, and occupies 17,284 bytes of ROM for program, and 628 bytes of memory. A multiplication in GF (3 97 ) is computed by the comb method, and takes about 6.2 msec by the comb method. The timing of pairing is about 5.8 sec.
On the other hand, TinyECCK is the fastest implementation of ECC on ATmega128L by Seo et al. [23] . A multiplication in GF (2 163 ) is computed by the comb method and the window method with the width 4, and takes 2.9 msec. This multiplication method becomes an object of comparison of our proposed method because this paper also uses binary field.
Our Initial Implementation
We initially implement η T pairing over GF (2 239 ) on ATmega128L by using Algorithm 1. The timing is 5.4 sec that is faster than the implementation of [12] .
In the implementation, multiplication is implemented by the comb method with window of the width 3, squaring is implemented by table reference, and inversion is implemented by the extended Euclidean algorithm. Timings of an addition, multiplication, squaring, reduction, and inversion are 0.039 msec, 4.60 msec, 0.176 msec, 0.384 msec, and 246.2 msec, respectively. And, we found multiplications, reductions, squarings, inversion, additions occupied 75%, 6.3%, 4.7%, 4.5%, and 2.3% of whole pairing timing, respectively. Therefore, reduction of multiplication timing is most important for fast implementation of the η T pairing.
Target Platform: MICAz ATmega128L
In this section, we explain ATmega128L, which is a processor for MICAz.
Architecture of ATmega128L
ATmega128L is a processor of 8-bit word, and its clock frequency is 7.38 MHz. ATmega128L consists of an arithmetic logic unit (ALU), 32 8-bit purpose registers (R 0 ∼ R 31 ) for intermediate results, data memory of 64 Kbyte for general data, and a program ROM (flash memory) of 64 K locations.
TinyOS is an operating system for WSN, especially MICAz [15] that is often used as a platform for the research of sensor network. NesC language is extension to C language for sensor nodes. We can implement applications on ATmega128L by using NesC on TinyOS.
Operation on ATmega128L
In this section, we explain how ALU operates instruction.
The ALU operates between general purpose resisters,
with one cycle, where op is an operation, and 0 ≤ d, r ≤ 31. The ALU takes 2 cycles to store one word data in register to the memory, and takes 2 cycles to load one word date in the memory to a register. Then, computing C = A op B takes 7 cycles for data in memory.
In other words, 1 operation takes 6 cycles for memory access (load/store operation). In general the compiler of ATmega128L generates such code. However, if the number of registers in ATmega128L is not limited, we can write a code, with which the multiplication takes only 240 cycles for memory access, as follows:
In this method, 5,160 cycles are saved per a computation of a multiplication in GF (2 239 ). Of course, ATmega128L has only 32 registers, and thus we need more than 240 cycles for a multiplication in GF (2 239 ). The main purpose of this paper is to propose a multiplication method for GF (2 239 ) in which memory access by using 32 registers. Reducing the amount of memory access to compute a multiplication is effective because pairing computation needs many hundreds of multiplications.
The Proposed Block Comb Method
In this section, we propose a block comb method for multiplying the η T pairing efficiently. In this method, the multiplier and multiplicand are divided into blocks and a partial product in each block is performed without memory access. In such a way, a multiplication is efficiently computed.
The Representation of Finite Field
This section explains how we represent GF (2 239 ) suitably for ATmega128L.
GF (2 239 ) is represented as GF( 2 239 )/( f (x)), with the irreducible polynomial f (x) = x 239 + x 36 + 1. Then, a basis of GF (2 239 )/GF(2) is
and each element A in GF (2 239 ) is
For simplicity, we represent the right side of Eq. (1) as ), (C 6 6 , C 6 0 )
in this paper. A can be represented as
by 30 words where each word is 8-bit. Moreover, we integrate s-wordsize as one block. Then A is represented as 
Block Multiplication
In this section, we propose the block multiplication in GF(2 239 ) on ATmega128L.
In order to compute a multiplication AB for A, B ∈ GF(2 239 ), first, we compute AB as polynomial, next, we compute a reduction AB modulo f (x), where f (x) = x 239 + x 36 + 1. The polynomial multiplication is an dominant time consuming operation in computing the η T pairing. We focus on the polynomial multiplication because reduction is already fast.
Let s be the block size. A = (A Table 1 shows the block multiplication of size 6.
Note that there are many orders of computation of partial products A , where → needs no store operation, ⇒ needs 6-word store operation, and 12-word store operation is needed in the last product. If we choose any different order, then the number of operation "⇒" increases. Therefore, the above order for implementing a multiplication AB in GF (2 239 ) has the least number of store operations. In this case, the number of memory accesses to compute AB are 8 6-word store operations, one 12-word store operation and 16 12-word load operations. The whole memory accesses takes 720 cycles because one load/store operation takes 2 cycles on ATmega128L.
Next we consider the case of general s. We assume that there are enough registers in a processor to compute each partial product (C . If i + j = i + j , then the store operation is omitted. If (i , j ) = (i + 1, j) or (i, j + 1), then only low s registers are stored to memory corresponding to (C s (i+ j)s ). From the above observation, the least number of memory accesses to compute AB require (2t − 2) s-word store operations, one 2s-word store operation and (t − 1) 2 2s-word load operations. Then, the whole memory accesses takes (4st 2 − 4st + 4s) cycles to compute a multiplication AB in GF(2 239 ).
The Choice of Block Size
In this section, we provide the best block size s that is an answer of the important issue (2 
The Proposed Block Comb Method
Algorithm 2 presents the proposed block comb method for computing C = AB for A, B ∈ GF (2 239 ). Notations in Algorithm 2 denote the following:
load: transfer of a data in memory to a register store: transfer of a data in a register to memory move: transfer of a data in a register to another register 1: left-shift operation ⊕:
bitwise exclusive-or First, we allocate registers (R 0 to R 31 ) on ATmega128L to perform the block comb method as follows: the 12 registers R 0 , · · · R 11 are used for the result C = AB, register R 12 is used for a temporary register of the comb method, the 6 registers R 13 , · · · , R 18 are used for the multiplier A, and the 6 registers R 19 , · · · R 24 are used for the multiplicand B.
Next, we explain each Step in Algorithm 2. Steps 1 to 7 initialize the 12 registers corresponding to C. Steps 8 to 30 correspond to the comb method to compute
In
Step 12 (A 6 j+5 , · · · , A 6 j ) in memory are loaded to R 13 , · · · , R 19 , and (B 6k+5 , · · · , B 6k ) in memory are loaded to for j = 6 to 11 do 6:
R j ← 0 7: end for 8:
for j = 0 to 4 do 9:
if 0 ≤ k and k ≤ 4 then 11:
for l = 0 to 5 do 12:
load R 13+l ← A 6 j+l 13:
load R 19+l ← B 6k+l 14:
end for 15:
for m = 6 downto 1 do 19:
for n = 0 to 5 do 20:
if (the l-th bit of R 18+m ) = 1 then 21: for k = 0 to 5 do 32:
store C 6i+k ← R k 33:
R k ← R k+6 34: end for 35: end for 36: for i = 0 to 6 do 37:
store C 54+i ← R 6+i 38: end for R 20 , · · · , R 25 .
Steps 16 to 25 are the body of the comb method. Note that the result of Eq. (2) is registered in (R 12 , · · · , R 1 ) (not (R 11 , · · · , R 0 )) due to shift operations at Step 17. Then, we need Steps 26 to 28 to make R 11 (the most significant word), and R 0 (the least significant bit)
† . In Steps 32, the lowest 6 words (R 5 , · · · , R 0 ) of the result of the partial product are stored. Recall that the highest 6 words (R 11 , · · · , R 6 ) can be reused in the next iteration, as explained in Sect. 4.2. Last, the highest 6 words are also stored (Steps 36 to 38).
In the following we estimate the efficiency of the proposed comb method. A multiplication of the block comb method with block size 6 takes 504 cycles for memory access. Recall that a straight-forward multiplication, which calls each partial product A i B j individually, takes 5,400 cy- † We may omit Steps 22 to 24 in Algorithm 2 if we allow the movement of the most/least significant word of (R 12 , · · · , R 0 ) and modify Algorithm 2. 
Source register k:
Constant address K:
Constant data b:
Bit in the register X, Y:
Indirect Address Register (X = R 27 : R 26 and Y = R 29 : R 28 ) q:
Displacement for direct addressing (6-bit) C:
Carry flag PC:
Program Counter cles for memory access, as described in Sect. 3.2. Therefore, the proposed block comb method for a multiplication AB in GF (2 239 ) is about 10 times faster than the multiplication with the slowest memory access. In the real implementation, a compiler does not usually gives us a code of multiplication with the slowest memory access, and thus the improvement by the proposed scheme becomes smallerit strongly depends on the underlying compiler. We will demonstrate the effectiveness of the block comb method applied to ATmega128L in Sect. 5.1.
Our Implementation
In this section, we explain about the details of our implementation of the η T pairing over GF (2 239 ) on ATmega128L. Refer Table 2 for each instruction used in this section of ATmega128L in details.
Implementation of Proposed Block Comb Method
We implemented the proposed block comb method (Algorithm 2) by assembly. In the following we explain main three improvements: memory access (Step 8, Step 31, Step 36), if-statement (Step 20) , and left-shift (Step 17), where their timing is more than 80% of the whole multiplication of GF (2 239 ). The proposed block comb method with the three improvements below can compute a multiplication in GF(2 239 ) in 1.29 msec. The comb method of width 3 in Sect. 2.3 requires 4.60 msec, and thus our implementation of the proposed block comb method is about 3.6 times faster.
Memory Access
Note that 16-bit registers are required for indicating memory address in ATmega128L because the size of the memory is 2 16 bytes and hence 16-bit indirect address registers are needed for memory addressing. Six registers can be used as 3 16-bit registers, X-register (R 26 and R 27 ), Y-register (R 28 and R 29 ), and Z-register (R 30 and R 31 ). We use registers X, Y, and Z for pointers C, A, and B, respectively. We can write load operations by "ldd" instructions. For example, we can implement the load instruction at Step 8 as follows.
Note that X + (6 j + l) denotes the address of A 6 j+1 . The ldd instruction takes 2 cycles.
We can write store operations by "st" instructions. Note that store operations are sequentially performed to C 0 , C 1 , · · · , C 59 at Steps 31 and 36. We can implement a store operation as follows † .
st X+, R 13+k † An std instruction corresponding to ldd is prepared in ATmega128L. However, the std instruction does not support the X register. Therefore, we must use the st instruction for store operations in Algorithm 2. We can implement the if-statement at Step 20 by using "sbrs" and "rjmp" instructions as follows.
In the case of an lth bit of R 19+ j = 1, rjmp is not executed and eor (exclusive or) operations are executed. The sbrs instruction takes two cycles. If the lth bit of R 19+ j = 0 and rjmp is executed, the eor operations are not executed. The sbrs instruction takes one cycle, and the rjmp instruction takes two cycles. The if-statement takes 2.5 cycles on average because the probability of R 19+ j = 1 is 0.5.
Block Left-Shift
We can implement the block left-shift at Step 17 by using "lsl" and "rol" instructions as follows.
. . .
rol R 12
Instructions lsl and rol each take one cycle. Thus, the block left-shift takes only 12 cycles in the implementation.
Other Improvements
The shift operation is often used in the squaring, reduction, and inversion. We implemented the shift operation optimized for ATmega128L.
Let A 0 = (a 7 , a 6 , a 5 , a 4 , a 3 , a 2 , a 1 , a 0 ) be an 8-bit register. " (a 3 , a 2 , a 1 , a 0 , 0, 0 a 3 , a 2 , a 1 , a 0 , a 7 , a 6 , a 5 , a 4 From this we reduce to 6 clocks from 10 clocks. Similarly we can perform the same improvements for left i-bit sift or right i-bit sifts for i = 1 to 7.
As a result, the timings of squaring, reduction, and inversion are improved to 0.129 msec, 0.0304 msec, and 166.8 msec from 0.176 msec, 0.0384 msec, and 246.2 msec of our initial implementation in Sect. 2.3, respectively.
Comparison with Other Works
In Table 3 , we show a comparison of implementation of pairing on ATmega128L with previous works. Ours is 3 times faster than the implementation of η T pairing over GF (3 97 ) which has the same security level (see Sect. 2.1), but we used twice larger ROM due to assembly code. TinyPBC has a larger parameter size, and the weighted speed of their implementation over GF (2 239 ) is 5.45(239/271) 2 = 4.24 sec., which is still twice slower than ours. The ROM size of TinyPBC is larger than that of ours because they used a table look-up for Karatuba multiplication in GF (2 271 ). TinyTate uses finite field GF(p) with a large prime characteristic, and it is currently much slower than other implementation of pairing on ATmega128L.
Remark 1:
The timing of implementation only by NesC (without assembly) can also be improved by the proposed block comb method. The improvement by assembly merely aims at achieving the top timing of pairing implementation at ATmega128L. For example, the block comb method enhances the speed of Ishiguro et al. [12] or TinyECCK [23] , because it uses the comb method.
Conclusion
In this paper we presented an efficient implementation of pairing over a sensor node. We implemented the η T pairing over GF (2 239 ) using MICAz platform with ATmega128L. In order to accelerate the speed of the pairing, we proposed the block comb method that is particularly optimized for ATmega128L. Combining with other optimizations for squaring, reduction, and inversion, the timing of the η T pairing becomes 1.93 sec. This is currently the fastest timing comparing with the previously known implementations of pairing over sensor nodes. Ubiquitous sensor networks now use the pairing-based cryptography in a reasonable time.
