Recently, the result of TinyECCK (Tiny Elliptic Curve Cryptosystem with Koblitz curve) shows that both field multiplication and reduction over GF(2 m ) are related to a heavy amount of duplicated memory accesses and that reducing the number of these duplications noticeably improves the performance of elliptic curve operations such as scalar multiplications, signing and verification. However, in case that the underlying word size is extended from 8-bit to 16-bit or 32-bit, the efficiency of the techniques proposed in TinyECCK is decreased because the number of memory accesses to load or store an element in GF(2 m ) is significantly reduced. Therefore, in this paper, we propose a technique which makes left-to-right (ltr) comb method which is widely used as an efficient multiplication algorithm over GF(2 m ) suitable for extended word sizes and present TinyECCK16 (Tiny Elliptic Curve Cryptosystem with Koblitz curve on 16-bit word) which is implemented with the proposed multiplication algorithm on 16-bit Tmote Sky mote. The proposed algorithm is faster than typical ltr comb method by 15.06% and the 16-bit version of the algorithm proposed in TinyECCK by 5.12% over GF(2 163 ). key words: wireless sensor network, elliptic curve cryptosystem, efficient implementation
Introduction
Wireless sensor networks (WSNs) are comprised of hundreds or thousands of resource-limited small sensor nodes. Since they are deployed in harsh, unattended environments, some combinations of authentication, integrity, and confidentiality are required for reliable and secure network communications. Due to the inherent characteristics of WSNs such as the absence of supervisors and limited resources, new security protocols considering these issues are required rather than conventional protocols. Even if some symmetric key-based security protocols have been proposed with the consideration of limited resources on motes, they lack of functionalities at pairwise key setup and broadcast authentication phases. Thus, many researchers have tried to apply public key cryptosystems to provide functionalities such as key distribution and authentication, especially elliptic curve cryptosystem (ECC), to WSNs because of its much smaller key size compared with other public key system such as RSA and DSA [1] , [4] , [7] - [10] , [13] - [15] . They have implemented ECC on several types of motes and presented the running time and memory consumption in order to prove the feasibility of ECC on WSNs. Until now, implementations over GF (p) give relatively more contented performance than those over GF (2 m ). However, recently the result of TinyECCK [14] shows that low performance of field operations over GF (2 m ) are caused by a heavy amount of memory accesses, thus reducing the number of these unnecessary operations improves the overall performance of elliptic curve operations over GF (2 m ). Furthermore, TinyECCK is the most efficient among ECC softwares over both GF (p) and GF (2 m ) running on ATmega128L processor [24] . Both field multiplication and reduction are the most frequent operation in elliptic curve operations. TinyECCK has proposed some techniques which can significantly reduce the number of memory accesses in both operations, thus it could achieve performance improvement. However, the efficiency of techniques proposed in TinyECCK is decreased as the word size on a target platform is increased because the required number of memory accesses are also reduced. Actually, we have got 5.78-7.69% of improvement when TinyECCK is implemented on 16-bit word environment (comp. it was 15-19% on 8-bit platform). This result means that a new field multiplication algorithm is required with considering extended word sizes such as 16-bit and 32-bit.
In this paper, we propose a technique which makes left-to-right comb method which is widely used as an efficient multiplication algorithm over GF (2 m ) suitable for extended word sizes and present TinyECCK16 which is implemented with the proposed multiplication algorithm on 16-bit Tmote Sky mote [22] using MSP430 processor [25] . The proposed multiplication algorithm is faster than typical ltr comb method (resp. the improved ltr comb method proposed in TinyECCK) by 15.06% (resp. 5.12%) over GF (2 163 ) and with its application to TinyECCK16, 8.4-11.8% (resp. 2.82-6.93%) of running time in elliptic curve operations is saved. Furthermore, TinyECCK16 is superior to existing ECC softwares implemented on Tmote Sky sensor mote with regards to running times and memory requirements. TinyECCK16 with 5TNAF can compute a scalar multiplication in 0.64 secs and it generates (resp., verifies) a signature within 0.77 (resp., 1.27 secs) with 14,422-byte of ROM and 1,750-byte of RAM. Since the efficiency of the proposed algorithm is increased as the word size on the target platform is extended, it seems that our proposal is promising technique for upcoming more improved sensor platforms.
Related Work
Until now, there have been several implementations of ECC over both GF (2 m ) and GF(p) on sensor motes using 8-bit or 16-bit CPU. They have tried to prove the feasibility of ECC for WSNs.
Existing Implementations on 16-bit Sensor Motes
There are some ECC softwares over both GF(p) and GF(2 m ) on TelosB or Tmote Sky motes using 16-bit MSP430 processor. TinyECC which is tuned for 16-bit Tmote Sky can generate a signature in 1.58 secs and verify it in 2.02 secs at the expense of 13,520-byte of ROM and 1,504-byte of RAM [1] . Wang et al. have implemented ECC software over GF(p) on TelosB mote running at 8 MHz and have applied it for their proposed access control protocol [9] . Their implementation consumes 3.35 and 6.78 secs for signing and verification, respectively at the cost of 17,823-byte of ROM and 1,638-byte of RAM. After their early work, they have significantly improved the performance of their code [10] . The updated code takes 1.60 and 3.32 secs for signing and verification, respectively. At this time, it requires 19,251-byte of ROM and 1,392-byte of RAM except for SHA-1 code more than 30-Kbyte. Arazi et al. have tuned EccM for 16-bit TelosB mote. The modified implementation takes 32.5 secs for a scalar multiplication with 20 K-byte of ROM and 1,500-byte or RAM [13] .
NanoECC
NanoECC [15] is based on MIRACL library [19] and supports ECC operations and pairing-based cryptographic operations on both MICAz and Tmote Sky mote over both GF(p) and GF (2 m ). For efficient elliptic curve operations over GF (2 m ), NanoECC is based on Koblitz curve and uses Karasutba-Ofman multiplication algorithm and fast reduction algorithm using pentanomial. NanoECC over GF(p) makes use of hybrid multiplication algorithm and optimized reduction algorithm using Mersenne-prime. It applies fixedbased comb algorithm with window size 4 for fast scalar multiplications. On a MICAz (resp. Tmote Sky) mote, it takes NanoECC 2.16 (resp. 1.04) and 1.27 (resp. 0.72) secs to compute a scalar multiplication over GF (2 m ) and GF(p), respectively.
For extensive description of existing ECC implementations on sensor motes 8-bit or 16-bit CPU, Table 1 is presented. Since NanoECC is a kind of multi-platform implementation, we do not include it in the table.
Implementation Details
We have implemented TinyECCK16 on a 16-bit Tmote Sky mote [22] using the MSP430 processor [25] . We use the domain parameter (sect163k1) recommended by [3] and polynomial basis to represent elements in GF (2 m ). TinyECCK16 has been developed with nesc language [27] in order to run on TinyOS [21] . We modified the original field arithmetic algorithms using 32-bit word size which are presented in Guide to Elliptic Curve Cryptography [5] , [6] into the forms suitable for 16-bit word environment. For efficiency, TinyECCK16 makes use of recoding algorithms such as wNAF and wTNAF [16] and selects the mixed coordinate system [17] rather than affine coordinate. 
Sensor Platform Description

Preliminaries
Let assume word size W be 16-bit since MSP430 uses 16-bit data bus. Following notations are used in the rest of this paper for describing algorithms. We assume that A (= a(z)) and B (= b(z)) are elements in GF(2 m ). • A ⊕ B: bitwise exclusive-or.
• A & B: bitwise AND.
• A i: right shift of A by i positions with padding i upper bits as 0.
• A i: left shift of A by i positions with padding i lower bits as 0.
• W: a 16-bit word.
• A[ j] denotes j-th word of the A polynomial.
• t = m/W is the required number of words to store A in memory.
• The left-to-right comb method using window w processes the bits of a(z) from left to right direction as follows:
+1.
• Each of A, B, C, and D means a block in Fig. 1 
for j ← 0 to t − 1 increments j by 1 do 8:
for i ← 0 to t increments i by 1 do 10:
end for 12: end for 13:
masking ← masking 4.
16:
end if 17: end for 18: Return (c) ble instead of computing it. Namely, this method can save the number of XOR operations and memory accesses compared with naive shift-and-xor multiplication algorithm at the expense of more memory consumption. Considering the tradeoff between the overhead of precomputation and its advantage during computing partial multiplications, the proper window size of ltr comb method is known as 4. Thus, the field multiplication algorithms discussed in this paper are all based on window size 4.
Algorithm 2 is the 16-bit version of the ltr comb method proposed in [14] . This algorithm is a kind of improved version of Algorithm 1. Actually, step 7-12 in Algorithm 1, partial multiplication, requires many duplicated Algorithm 2 16-bit version of left-to-right comb multiplication method proposed in TinyECCK [14] (window width w = 4)
for j ← 0 to t − 1 increments j by 2 do 8:
for i ← 1 to t increments i by 1 do 12:
end for 14:
end for 15:
18:
end if 19: end for 20: Return (c) memory accesses. Thus, Algorithm 2 reduces this overhead by combining two instances of these steps (Refer to [14] for details) into one. This results in saving the number of duplicated LOAD and STORE instructions. Following is the procedure of Algorithm 2.
Step 3 builds a precomputation table about u(z) · b(z)
(Since the used window size is 4, it contains the result . Thus, 2t words should be left-shifted three times (Step 15-18 in Algorithm 2). Actually, 8-bit version of Algorithm 2 requires shifting 2t words only once; this overhead is relatively small compared with the overhead due to the redundant memory accesses. However, on 16-bit word environment, while the number of memory accesses during a field multiplication is reduced in half, the number of shifting C is increased from once to three times. Thus, the overhead from shifting C occupies larger portion during a field multiplication than 8-bit environment. For solving this problem, we present a promising 
12) ⊕ (T 3 13). 8: 
Field Reduction
The result of both field multiplication and field squaring over GF(2 m ) should be reduced by irreducible polynomial f (z). Reduction polynomials such as sparse trinomial and pentanomial recommended by NIST [3] in FIPS 186-2 are often used for efficient reduction. TinyECCK16 makes use of f (z) = z 163 + z 7 + z 6 + z 3 + 1. Algorithm 3 is the 16-bit version of the fast reduction algorithm proposed in [14] .
Step 3-11 conducts the reduction to C of 2t words in such a way that redundant memory accesses are reduced.
Step 12-17 is a postprocessing part for remaining words and it is more complex than 8-bit version. Table 3 shows the ratio of inversion to multiplication and squaring over GF (2 163 ) on MSP430 processor. Since the ratio of inversion to multiplication is 23.57, it is more desirable to eliminate the inversions during scalar multiplications (For inversion, binary extended Euclidean algorithm [6] is used)
Selection of Coordinate System
† . Therefore, we choose to use the López-Dahab coordinate rather than affine coordinate [17] . TinyECCK16 uses the mixed coordinates for elliptic curve point addition (ECADD) since the addition of two points represented in different coordinate system is more efficient than that of two points using the same representation [5] , [6] . Hence, we build a precomputation table of the points represented in affine coordinate.
Proposed Efficient Field Multiplication on 16-bit Environment
With algorithm 2 and 3, TinyECCK16 saves only 5.78-7.69% of running time in elliptic curve operations. This is contrastive to the result that applying 8-bit versions of these algorithms on ATmega128 processor save 15-19% of TinyECCK's running time. The main reason of this result is because, on the one hand, the use of extended word size (16-bit word) reduces the number of memory accesses during multiplication and reduction algorithm, on the other hand, the number of shifting C of 2t words is increased from once to three times. This section describes a proposal which can make algorithm 2 use only one of shifting C of 2t words.
Proposed Left-to-Right Comb Method on 16-bit Word
The number of shifting C can be reduced from three to one by rearranging the sequence of processing blocks in Fig. 1 . 
Thus, the partial products of 
for j ← 0 to t − 1 increments j by 2 do 10:
. 14:
for i ← 1 to t increments i by 1 do 15:
19:
PT R C ← &C. // assigning C's original address 20: end for 21: C ← C · z w . 22: PT R C ← &C+1. // assigning 1-byte incremented address of C 23: masking ← 0x0 f 00. 24: // Processing B, D blocks 25: for k ← 2 to 0 decrements k by 2 do 26:
for j ← 0 to t − 1 increments j by 2 do 27:
31:
for i ← 1 to t increments i by 1 do 32: To implement our proposal, the results of A and B blocks should be stored at 8-bit incremented address from where C and D blocks are stored. This can be possible with following codes written in C language.
BYTE16 C[DOUBLENUMWORDS]
; // arrary storing the running result C BYTE8* ptrC8 =(BYTE8*)C; // assigning the base address of C at, an 8-bit pointer, ptrC8 BYTE16* ptrC16 = ptrC8+1; // incrementing the address of ptrC8 as 8-bit and storing it at, a 16-bit pointer, ptrC16
The results of A and B blocks are stored from the ptrC16 address. With this manner, the results of partial multiplica- † Even if the running time of a inversion is significantly reduced by using 16-bit word, it is still much bigger than that of field multiplication.
tions of both A and B blocks equivalent to be left-shifted by 8-bit.
Algorithm 4 is the improved version of Algorithm 2 by reducing the number of shifting the running result C of 2t words from three to one on 16-bit environment.
Step 5 and 22 of Algorithm 4 increments the address of running result C and stores it at the PT Rc, a 16-bit pointer. In this manner, step 12-16 and step 29-33 are equivalent to shifting and storing the results of partial multiplications in A and B blocks to 8-bit left direction compared with C and D blocks' results. The base address is restored at step 19 and 36 because the results of C and D blocks should be stored at the original address of C. The proposed algorithm requires only one shifting operation of 2t words compared with Algorithm 2.
Theoretical Analysis
We can count how many instructions are saved with the proposed method. Proof. The computation of ltr comb method using window can be divided into three parts; precomputation, computing partial multiplications and shifting the intermediate result C. Thus, the cost of Algorithm 1 can be computed as follows.
• Precomputation
The precomputation for u(z) · b(z) with window size 4 requires seven XOR additions and seven 1-bit left shifting operations on arrays of t words. We optimize this precomputation process by reducing the number of loops. Namely, we combine XOR addition and 1-bit left shift operation. For example, z · b(z) and (z + 1) · b(z), z 2 · b(z) and (z 2 + 1) · b(z) can be consecutively calculated. In our implementation, the cost of combined XOR addition and 1-bit shifting is
The following is the pseudo code for shifting the intermediate result C as 4-bit to the left direction. 
Thus, total cost of Algorithm 1 is [(8t
by summing the cost of each part.
Theorem 2. Algorithm 4 requires [(6t
Proof. Since the precomputation step of Algorithm 4 is same as Algorithm 1, the cost is identical. However, the costs for computing partial multiplications and shifting the C are reduced in Algorithm 4. The costs for two parts are described as follows.
• Computing partial multiplications
According to the above analysis, the total cost of Algorithm 4 are [(6t
Since the difference between Algorithm 2 and Algorithm 4 is the number of shifting the C, we can easily compute the cost of Algorithm 2; [(6t Because each of Algorithm 1 and Algorithm 2 requires (1536L+872S +605X+326S H) and (1294L+630S +605X+ 326S H), Algorithm 4 contributes to 21.26% (resp. 7.91%) of saving instead of Algorithm 1 (resp. Algorithm 2) † † † .
Application to More Extended Word Size
The proposed technique can be applied for more extended word size such as 32-bit word. Actually, Imote2 [23] , a state-of-art sensor mote, uses 32-bit PXA271 processor [26] .
How to Apply
Typical ltr comb multiplication algorithm on 32-bit word † In Algoritm 1, step 8 and step 9-11 require (1L + 1S + 1S H) and (t + 1)(2L + 1S + 1X), respectively. † † Step 10-13 and Step 14-16 in Algorithm 4, the combined partial multiplication, requires (6L + 4S + 2X + 2S H) and t(3L + 1S + 2X), respectively.
† † † Because one instruction requires different number of cycles according to addressing modes, we assume that LOAD, STORE, XOR and SHIFT operation use same number of clock cycles 
Thus, Algorithm 1 and Algorithm 2 using window size 4 on 32-bit word require seven shifting of the C. However, this overhead can be reduced into only one shifting operation by rearranging the sequence of processing blocks and the position where the partial multiplications to be stored. The sequence of processing blocks are
Details are as follows.
1. The results of partial multiplications on A block are stored from the 3-byte incremented address than the base address of C. 2. The results of partial multiplications on C block are stored from the 2-byte incremented address than the base address of C. When a field multiplication algorithm over GF(2 m ) (m is fixed) is implemented on 32-bit word, the number of memory accesses to load and store an element is more reduced compared with using 8-bit or 16-bit word. Thus, the performance gain from the techniques, proposed in [14] , reducing memory accesses become less effective on 32-bit environment. However, the proposed method will be more promising since it reduces the number of shifting 2t words from seven to one.
Estimation of Performance Gain
We can count the saved number of instructions on 32-bit word when the proposed algorithm is used instead of Algorithm 1 and Algorithm 2. 
,W = 32. Table 4 Comparison the running times among three field multiplication algorithms over GF (2 163 ). The improvement ratio in second row is for Algorithm 1 and the ratio in third row is for Algorithm 2 (all times are measured by secs). Proof. The cost of Algorithm 1 can be easily computed from Theorem 1.
The total number of required instructions in Algorithm 1 is [(16t 2 + 80t)L + (8t 2 + 51t + 7)S + (8t 2 + 15t)X + (50t)S H] including the cost of precomputation.
Theorem 4. Algorithm 4 on 32-bit word requires
Proof. The cost of Algorithm 4 can be derived from Theorem 2 as follows.
By summing the cost of each part in Algorithm 4, it requires [(12t 2 +56t−6)L+(4t 2 +39t+1)S +(8t 2 +15t)X+(26t− 6)S H] instructions for computing a field multiplication.
By the same manner, the cost of Algorithm 2 can be computed as [ 
On the basis of Theorem 3 and Theorem 4, Algorithm 4 saves [(4t 2 + 24t + 6)L + (4t 2 + 12t + 6)S + (24t + 6)S H] and [(24t + 6)L + (12t + 6)S + (24t + 6)S H] compared with Algorithm 1 and Algorithm 2. Since 32-bit word size and GF (2 163 ) are used,t is equal to 6 (= 163 32 ). Therefore, Algorithm 4 saves (294L + 222S + 150S H) (resp. 150L + 78S + 150S H) instructions compared with Algorithm 1 (resp. Algorithm 2). These savings from Algorithm 4 contributes to 28.53% (18.47%) of reduced running time in comparison with Algorithm 1 (resp. Algorithm 2).
On the grounds of this counting, we expect that the proposed method is more promising as the used word size is increased. Thus, it can be efficiently used for more powerful state-of-art sensor motes. of running times compared with Algorithm 1 and Algorithm 2. In section 4.1.2, we estimate that the proposed Algorithm 4 contributes to round 21.26% (resp. 7.91%) of improvement as compared to Algorithm 1 (resp. Algorithm 2). However, we got 15.06% (resp. 5.12%) of saving. There are some gaps between high level estimation and actual implementation. For example, we did not count the loop counter and function call overhead. In addition, each instruction may consume different clock cycles according to address mode, the number of involved operands, etc. Above all things, there is an important reason. This is originated from the limitation of Tmote Sky sensor mote. Namely, MSP430 processor does not support word-level instructions at odd address. It supports word-level instructions for data on even address (For data on odd address, only byte-level instructions are operated). For this reason, we can not implement the proposed algorithm with sorely C language. In other words, after incrementing the address of the C as 8-bit, word-level instruction is not properly operated. In our experience, when the word-level instruction is applied for odd address, the address is automatically converted into even address. Therefore, we use inline assembly code for implementing partial multiplications. For example, partial multiplications in A and B blocks are implemented as byte-level instructions, because the address of C is incremented when processing A and B blocks. The partial multiplications in A and B block are implemented with byte-level instructions while those in C and D blocks are implemented with word-level instructions (These fined-grained control is possible with only inline assembly code). With this limitation of MSP430 processor, the efficiency of the proposed technique is attenuated. Therefore, with these reasons, we got 15.06% (resp. 5.12%) of actual performance gain instead of 21.26% (resp. 7.91 %) from theoretic analysis. Furthermore, we think that the comparisons between Algorithm 4 partly implemented with inline assembly code and Algorithm 1, 2 sorely implemented with C langulage are fair, because of the limited address manipulation of MSP430 processor.
Experimental Results and Analysis
Analysis of Field Operations
Performance Confirmation
We can confirm the data in Table 5 from the running times of multiplication and squaring. Let assume the used window size is 4, namely 4TNAF is used † . Since TinyECCK16 uses the mixed coordinate system for (ECADD), operations of (8M + 3S ) is required for a ECADD (Each of M and S mean † Here, the window size means the used window for computing scalar multiplication such as using wNAF and wTNAF. It is different from the window used in ltr comb method. , elliptic curve point doubling (ECDBL) operation is replaced with 3 squarings in wTNAF representation). Thus, the running time of a scalar multiplication using 4TNAF is computed as (264 · 0.00178422 + 585 · 0.00023538) = (0.6087)sec. The estimated result is almost 0.1 secs less than real data (0.7141 secs), however, this result can be admitted considering the overhead of function calls and loop countings. Table 5 shows the running time of TinyECCK16 and its improvement ratio when Algorithm 4 is used instead of Algorithm 1 and Algorithm 2. Algorithm 4 saves 2.82-6.93% (8.44-11.76%) of running times in elliptic curve operations compared with Algorithm 2 (resp. Algorithm 1). Because elliptic curve operations using larger window sized-TNAF representation require less ECADD and ECDBL, namely less field multiplications, the efficiency of the proposed algorithm is attenuated.
Performance Comparisons
The code size of softwares is crucial because RAM and ROM size on sensor motes are very limited (see Table 2 ). Table 6 shows the existing ECC softwares in regards of execution time and memory requirement. TinyECC, and [10] are based on GF(p). NanoECC provides the implementation of ECC over both GF(p) and GF (2 m ). TinyECC, [10] and NanoECC over GF(p) utilize many optimization techniques such as hybrid multiplication algorithm, pseudoMersenne prime, Jacobian coordinate, and sliding window scalar multiplication (fixed-based comb method in case of NanoECC). NanoECC over GF (2 m ) is based on Koblitz curve and uses Karasutba-Ofman multiplication algorithm and fast reduction algorithm using pentanomial. Because NanoECC uses MIRACL [19] for cryptographic operations, it is not optimized for sensor platform with respect to code size. In case of [10] , since the code size of SHA-1 is 30-Kbyte, it requires much larger code size compared with TinyECCK16 and TinyECC. TinyECCK16 is implemented with considering the limited memory of sensor motes. Thus, it requires less code size compared with other implementations. Since TinyECCK16 uses wTNAF-based scalar multiplication, it requires 982-byte and 1,750-byte of RAM when 4TNAF and 5TNAF are applied, respectively. Because TinyECCK16 stores additional negative points such as {−P, −3P, . . . , −((2 w−1 ) − 1)P} for efficient scalar multiplication, it can be further optimized by storing only positive points at the expense of little performance degrade.
TinyECCK16 is the fastest implementation of ECC on the Tmote Sky sensor mote compared with other implementations over both GF(p) and GF (2 m ). It takes 0.64 secs to compute a scalar multiplication when 5TNAF is applied. NanoECC over GF(p) provides comparable performance, however it requires much larger memory. Actually, the performance in [10] is measured when it run on 4 MHz clock. Since Tmote Sky mote can run on both 4 MHz and 8 MHz, we can expect that the running time when [10] operates on 8 MHz is going to be half of the time measured on 4 MHz. However, TinyECCK16 is more efficient than [10] with respect to running time and memory consumption even though [10] runs on 8 MHz.
Concluding Remarks
In this paper, we show the technique proposed in [14] become less efficient on extended word size such as 16-bit and 32-bit and describe how to improve the performance of ltr comb method on this environment. The proposed method can make the number of shifting the intermediate result C of 2t words only one time. It saves two times and six times of shifting the C when 16-bit and 32-bit word is used, respectively. The proposed technique saves 15.06% (resp. 5.12%) of running time compared with typical ltr comb method (resp. the algorithm proposed in TinyECCK). This improvement contributes to saving of 8.4-11.8% (resp. 2.82-6.93%) of the running times of elliptic curve operations such as a scalar multiplication, a signing and verification compared with typical ltr comb method (resp. TinyECCK's multiplication algorithm).
We have shown that the proposed algorithm will be more promising when it is implemented on more extended word size such as 32-bit through theoretic analysis. TinyECCK16 using 5TNAF computes a scalar multiplication within 0.64 secs and it generates (resp., verifies) a signature within 0.77 (resp., 1.26 secs) within 14,422-byte of ROM and 1,750-byte of RAM. This result is better in regard of execution time and memory requirement than existing ECC softwares implemented on 16-bit Tmote Sky sensor mote.
