Abstract. In this paper, we revisit a generally accepted opinion: implementing Elliptic Curve Cryptosystem (ECC) over GF (2 m ) on sensor motes using small word size is not appropriate because XOR multiplication over GF (2 m ) is not efficiently supported by current low-powered microprocessors. Although there are some implementations over GF (2 m ) on sensor motes, their performances are not satisfactory enough to be used for wireless sensor networks (WSNs). We have found that a field multiplication over GF (2 m ) are involved in a number of redundant memory accesses and its inefficiency is originated from this problem. Moreover, the field reduction process also requires many redundant memory accesses. Therefore, we propose some techniques for reducing unnecessary memory accesses. With the proposed strategies, the running time of field multiplication and reduction over GF (2 163 ) can be decreased by 21.1% and 24.7%, respectively. These savings noticeably decrease execution times spent in Elliptic Curve Digital Signature Algorithm (ECDSA) operations (signing and verification) by around 15% ∼ 19%. We present TinyECCK (Tiny Elliptic Curve Cryptosystem with Koblitz curve -a kind of TinyOS package supporting elliptic curve operations) which is the fastest ECC implementation over GF (2 m ) on 8-bit sensor motes using ATmega128L as far as we know. Through comparisons with existing software implementations of ECC built in C or hybrid of C and inline assembly on sensor motes, we show that TinyECCK outperforms them in terms of running time, code size and supporting services. Furthermore, we show that a field multiplication over GF (2 m ) can be faster than that over GF (p) on 8-bit ATmega128L processor by comparing TinyECCK with TinyECC, a well-known ECC implementation over GF (p). TinyECCK with sect163k1 can compute a scalar multiplication within 1.14 secs on a MICAz mote at the expense of 5,592-byte of ROM and 618-byte of RAM. Furthermore, it can also generate a signature and verify it in 1.37 and 2.32 secs with 13,748-byte of ROM and 1,004-byte of RAM. 
Abstract. In this paper, we revisit a generally accepted opinion: implementing Elliptic Curve Cryptosystem (ECC) over GF (2 m ) on sensor motes using small word size is not appropriate because XOR multiplication over GF (2 m ) is not efficiently supported by current low-powered microprocessors. Although there are some implementations over GF (2 m ) on sensor motes, their performances are not satisfactory enough to be used for wireless sensor networks (WSNs). We have found that a field multiplication over GF (2 m ) are involved in a number of redundant memory accesses and its inefficiency is originated from this problem. Moreover, the field reduction process also requires many redundant memory accesses. Therefore, we propose some techniques for reducing unnecessary memory accesses. With the proposed strategies, the running time of field multiplication and reduction over GF (2 163 ) can be decreased by 21.1% and 24.7%, respectively. These savings noticeably decrease execution times spent in Elliptic Curve Digital Signature Algorithm (ECDSA) operations (signing and verification) by around 15% ∼ 19%. We present TinyECCK (Tiny Elliptic Curve Cryptosystem with Koblitz curve -a kind of TinyOS package supporting elliptic curve operations) which is the fastest ECC implementation over GF (2 m ) on 8-bit sensor motes using ATmega128L as far as we know. Through comparisons with existing software implementations of ECC built in C or hybrid of C and inline assembly on sensor motes, we show that TinyECCK outperforms them in terms of running time, code size and supporting services. Furthermore, we show that a field multiplication over GF (2 m ) can be faster than that over GF (p) on 8-bit ATmega128L processor by comparing TinyECCK with TinyECC, a well-known ECC implementation over GF (p). TinyECCK with sect163k1 can compute a scalar multiplication within 1.14 secs on a MICAz mote at the expense of 5,592-byte of ROM and 618-byte of RAM. Furthermore, it can also generate a signature and verify it in 1.37 and 2.32 secs with 13,748-byte of ROM and 1,004-byte of RAM.
Introduction
Many researchers have tried to apply the public-key cryptosystem, especially ECC to wireless sensor networks to overcome the limitations of the symmetrickey based protocols at pairwise key setup and broadcast authentication phases. They concluded that employing ECC is viable in WSNs: Their implementations have been shown reasonable performances in running time and code size [1, 8, 10, 15] . Until now, implementations giving relatively satisfactory performance are all based on GF (p) [1, 8, 15] . On the other hand, the implementations over GF (2 m ) result in disappointing performance [4, 7, 9, 10] . Some literatures [1, 8, 10, 15] imputed the poor performances to insufficient support of field arithmetic operations over GF (2 m ), especially field multiplication, of current low-powered microprocessors that work in small word size, thus the implementation of ECC over GF (2 m ) would lead to lower performances. This paper revisits this opinion and shows that the field multiplication in GF (2 m ) can be faster than that in GF (p). There are following misunderstandings about the implementation of ECC over GF (2 m ) on sensor motes:
Inefficient field multiplication:
The field multiplication which is one of the most frequent operations in the elliptic curve operation in GF (2 m ) is regarded as being less efficient than that in GF (p) on low-powered and small word-sized devices since it requires partial XOR multiplications which are not efficiently supported by current microprocessors at instruction level.
1
Heavy memory requirement for ECDSA: ECDSA implementations over GF (2 m ) require not only field arithmetic over GF (2 m ) but also field arithmetic over GF (p) for generating and verifying digital signatures. Thus, it may be thought that the code size of ECDSA over GF (2 m ) is larger than that over GF (p). Actually, most of existing works over GF (2 m ) only implement Elliptic Curve DiffieHellman (ECDH) protocol in their motes. However, the code size of optimized implementation of ECDSA over GF (2 m ) is comparable to that over GF (p). Our implementation, TinyECCK, achieves optimized code size for ECDSA and outperforms TinyECC known as the most efficient software implementation of ECDSA over GF (p) on sensor motes.
The contributions of this paper are described as follows.
1.
Showing field multiplication over GF (2 m ) can be faster than that over GF (p): We have found that the field multiplication and reduction over GF (2 m ) are involved in many redundant memory accesses. In fact, most of the intermediate results of consecutive XOR multiplications during a field multiplication over GF (2 m ) are stored at the same memory destination and same values are loaded several times. We present some techniques to eliminate much of redundant memory accesses at field multiplication and reduction phases. As the result of applying the proposed techniques, the execution times of field multiplication and reduction over GF (2 163 
Related Work
There have been several implementations of ECC over both GF (2 m ) and GF (p) on sensor motes. They have tried to prove the feasibility of ECC for WSNs.
Existing Implementations over GF (2 m )
Malan et al. implemented EccM which was the first implementation of ECC over GF (2 m ) on a 8-bit sensor mote [4] . They used ECC to provide a key distribution mechanism for UC Berkeley's TinySec [2] module. EccM takes 34 secs for generating a public key and requires 34,342-byte of ROM as code size. In [9] , Yan and Shi indicated that the software implementations of ECC over GF (2 m ) were still slow on small computing devices such as sensor nodes. They implemented 163-bit ECC using fast modular reduction on a 8-bit ATmega128L processor. Their implementation requires 11,592-byte of code size and takes 13.9 secs for a scalar multiplication. Eberle et al. pointed out that field arithmetic over GF (2 m ), especially field multiplication is prohibitively slow since general-purpose microprocessors do not support arithmetic in that field [10] . They claimed that the performance of ECC implementation over GF (2 m ) can be faster than that over GF (p) with additional architectural extension using instruction set extension. Actually, the implementation using architectural extension took only 0.29 secs for a 163-bit ECC point multiplication over GF (2 m ) while their assembly implementation takes 4.14 secs. This result supports that ECC over GF (2 m ) is more suitable for hardware implementation rather than software implementation. Blaß and Zitterbart implemented ECDH, ECDSA and El-Gamal over GF (2 113 ) and compared their performances with those of EccM [7] . Their implementation took 6.88 and 24.17 secs for signature generation and verification, respectively. Their code occupies 75,088-byte of ROM.
Existing Implementations over GF (p)
To prove the feasibility of public-key cryptography on WSNs, Gura et al. implemented RSA and ECC over GF (p) with assembly code and instruction set extension on a 8-bit ATmega128 processor and compared the performance between RSA and ECC [15] . Their ECC implementation took only 0.81 secs for a scalar multiplication, which supports the assertion that the use of public-key cryptography, especially ECC is viable for WSNs. They also presented a hybrid multiplication algorithm exploiting advantages of operand and product scanning multiplication algorithm to reduce the number of memory accesses. TinyECC [1] is a software package providing ECC operations such as a scalar multiplication, and ECDSA services over GF (p) on TinyOS [16] . TinyECC adopted several optimization techniques such as optimized modular reduction using pseudoMersenne prime, sliding window method, Jacobian coordinate systems, inline assembly and hybrid multiplication to achieve computational efficiency. TinyECC -its major operations such as field multiplication and modular reduction are built in inline assembly -can generate a signature and verify it within 2.00 and 2.43 secs, respectively, at the cost of more code size: i.e., 19,308-byte of ROM. On the other hand, TinyECC sorely built in C takes 6.26 and 7.92 secs for generation and verification of a signature with smaller code size: i.e., 15,872-byte of ROM. Until now, the performance of ECC implementations over GF (p) surpasses those over GF (2 m ) with 8-bit word. From these observations, it appears that software implementation of ECC over GF (p) outperforms that of ECC over GF (2 m ) on small devices and ECC implementation over GF (2 m ) is suitable only for hardware implementation. However, in this paper we show that the performance of the optimized software implementation (TinyECCK) of ECC over GF (2 m ) can surpass that of the optimized one (TinyECC) of ECC over GF (p) on 8-bit sensor motes.
Overview of Elliptic Curve Cryptosystem
The set of solutions of following weierstrass equation over a field F forms an abelian group with the point at infinity O as its identity.
In case characteristics of F is 2, the equation is simplified as follows:
According to the principle of abelian group, a point P 3 which is result of adding two points P 1 and P 2 on a curve is also on the curve. Adding two different points and two same points are called elliptic curve point addition (ECADD) and elliptic curve point doubling (ECDBL). Let us assume two arbitrary points P 1 = (x 1 , y 1 ) and P 2 = (x 2 , y 2 ) ∈ E(GF (2 m )) with P 1 = −P 2 . Then the coordinate of P 3 = (x 3 , y 3 ) which is the result of P 1 + P 2 can be computed as follows in affine coordinate:
Both ECADD and ECDBL in affine coordinate require 1 field inversion and 2 field multiplications. It brings advantages to use projective coordinate when the field inversion is more expensive than field multiplication. For example, López-Dahab (LD) projective coordinate requires 14 field multiplications and 4 field multiplications in ECADD and ECDBL, respectively.
2 Therefore, it is expected that the use of LD coordinate is more suitable than that of affine coordinate when I > 7M (I = field inversion, M = field multiplication) on the target platform. Adding a point P to itself k times is called scalar multiplication; it is expressed as Q = kP , where k is an integer and P ∈ E(GF (2 m )). This scalar multiplication is the dominant operation in ECC such as ECDH and ECDSA.
Implementation Details
We have implemented TinyECCK on a MICAz [14] sensor mote including the 8-bit ATmega128L processor. We use the domain parameter (sect163k1) recommended by [3] and polynomial basis to represent elements in GF (2 m ). We modified the original field arithmetic algorithms using 32-bit word size which are presented in Guide to Elliptic Curve Cryptography [5, 6] into the forms suitable for 8-bit word environment. For efficiency, TinyECCK makes use of recoding algorithms such as wNAF and wTNAF and selects mixed coordinate system rather than affine coordinate.
Preliminaries
We assume that the used word size is 8-bit since the ATmega128L processor works with 8-bit word memory address. The following notations are used in the rest of this paper. Let us assume A and B are elements in GF (2 m ). [5, 6] ). The higher 4-bit and lower 4-bit are expanded to 8-bit words by inserting a 0 bit into each odd position.
Field Arithmetic over GF
Algorithm 1 Polynomial squaring
Field Multiplication Because field multiplication is one of the most frequent operations during a scalar multiplication, it should be efficiently implemented. Even though the shift-and-add method is the most straightforward, it is not desirable for software implementations due to the large number of memory accesses and word shifts. Throughout the experiments, we found that the left-toright comb method using window is more efficient compared with shift-and-add and right-to-left comb method: at this time, the optimal window size on the 8-bit ATmega128L processor is 4.
3 Even though the table using window size 4 requires the computation of 15 elements (except for the zero element), the main computation can be considerably accelerated at the expense of small overhead. Alg. 2 describes the left-to-right comb method using window (w = 4) with 8-bit wordlength (Alg. 2 is the 8-bit version of left-to-right comb method depicted in [5, 6] ). Since the wordlength is 8-bit and window size is 4,
, respectively. In fact, C{j} ⊕ T u , a partial XOR multiplication, of step 6 and step 10 in the Alg. 2 are involved in for-loop. In other words, the real code of Algorithm 2 Left-to-right comb method using window width w = 4
Modular Reduction The result of both multiplying two elements and squaring an element in GF (2 m ) should be reduced with the irreducible polynomial f . There are some reduction polynomials, for fast reduction modulo, recommended by NIST in the FIPS 186-2 standards [3] . Since these polynomials are either pentanomial or trinomial, reduction of c(z) modulo f (z) can be efficiently performed by one word at a time. The Alg. 3 reduces the result of field multiplication or field squaring into an element in GF (2 163 ) (Alg. 3 is the 8-bit version of fast reduction modulo presented in [5, 6] ). Similar to the aforementioned field multiplication algorithm, Alg. 3 is also associated with a large number of memory accesses since the word size (W = 8) is small.
Selection of Coordinate System
The ratio of inversion to multiplication over GF (2 163 ) on ATmega128L is 24.99 (e.g., M : I = 1 : 24.99). Thus, eliminating the inversion operations during scalar multiplication is beneficial to better performance. This is why we select the López-Dahab coordinate system rather than affine coordinate system. Table 1 supports the selection of our coordinate system for TinyECCK. We use the mixed coordinates for ECADD since the addition of two points which are represented in different coordinate system is more efficient than that of two points using the same representation [12, 6] . Hence, we build a precomputed table of the points represented in affine coordinate system. 
Width-w NAF
The inverse of P = (x, y) over GF (2 m ) is −P = (x, x + y). In this manner, the inverse of an element in E(GF (2 m )) can be calculated at negligible cost: the subtraction of points can be computed as efficient as addition. This motivates to use signed digit representation
. Nonadjacent form (NAF) provides optimal nonzero density ( If some extra memory is available, the execution time of scalar multiplication can be decreased with application of sliding window method which processes w digit of k at a time. A width-w NAF (wNAF) provides 1 w+1 of nonzero density at the expense of a precomputation table containing (2 w−2 − 1) precomputed points except for the original point. Thus, the scalar multiplication using wNAF can be done with l · ECDBL + l w+1 · ECADD. Since 128-Kbyte of ROM memory are available in a MICAz sensor mote, we applied wNAF recoding algorithm to scalar multiplication.
Koblitz Curves and Width-w TNAF
Koblitz curves are binary elliptic curves and they are defined over a binary field GF (2 m ) by the eqaution: E/GF (2 m ) : y 2 + xy = x 3 + ax 2 + 1, where a ∈ {0, 1}. The main advantage of Koblitz curves is that elliptic curve doublings in a scalar multiplication can be replaced by the efficiently computable Frobenius map τ (x, y) = (x 2 , y 2 ), τ (∞) = ∞, thus scalar multiplication algorithms can be developed without using any point doublings. Because it is known that (τ 2 + 2)P = µτ (P ) holds for all points on the curve, where µ = (−1)
1−a , the Frobenius map can be regarded as a complex number τ , τ = (µ +
To decrease the number of point additions in a scalar multiplication, the τ -adic representation for k should be sparse and short. This can be achieved by applying τ -adic NAF (TNAF), which can be viewed as a τ -adic analogue of the ordinary NAF.
The running time of TNAF-based scalar multiplications can be decreased by applying a window method for TNAF representations, width-w TNAF (wTNAF), which processes w digit at a time at the expense of extra memory. Since the remainders of the wTNAF belong to the set u ∈ {±1, ±3, . . . , ±(2 w−1 − 1)}, it requires (2 w−2 − 1) of precomputed points which are same as those for wNAF. In [11] , Solinas proposed efficient algorithms for computing TNAF, wTNAF:
, TNAF and wTNAF recoding. TinyECCK provides the implementations of width-w τ -adic non-adjacent form (wTNAF) [11] since it is based on sect163k1 [3] . Therefore, the scalar multiplication using wTNAF can be computed with only l w+1 · ECADD. However, the implementation of wTNAF requires more code size than wNAF, because it needs additional partial reduction modulo function and rounding off procedure [11] . TinyECCK takes 10,870-byte and 13,748-byte of ROM memory in case of using wNAF and wTNAF, which are only 8.3% and 10.5% of total ROM size (128-Kbyte). We found that the optimal window size on the 8-bit MICAz mote is 4 from the experiments. TinyECCK mainly uses wTNAF recoding algorithm rather than wNAF since the scalar multiplications with wTNAF can be computed faster than with wNAF with the same number of precomputed points.
Efficient implementation of partial reduction modulo in wTNAF recoding
The length of TNAF(k) and wTNAF(k) is approximately 2 log 2 (k), which is twice the length of NAF(k). To handle the problem of a long TNAF, we need to find nice representation of k. In other words, it is required to find an appropriate ρ ∈ Z[τ ] to be as small norm as possible with ρ ≡ k (mod ρ), where ρ = (τ m − 1)/(τ − 1), then apply ρ to TNAF or wTNAF instead of k [6, 11] . Alg. 4 is responsible for finding such a ρ in nice representation. It is the 
, and V m = −4845466632539410776804317, in case of sect163k1). s 0 and s 1 are calculated with Lucas sequence [11] and TinyECCK sets the value of C to be 16 for providing high probability of reduction. The purpose of the Round function in step 9 is to find appropriate integers q 0 and q 1 such that q 0 + q 1 τ is close to complex number λ 0 + λ 1 τ [6, 11] . Instead of using long floating point numbers, we can obtain fractional part of the λ 0 and λ 1 of step 7 by using only some floating point variables. The bits lower than C become fractional part as the result of division by 2 C . For example, let us assume that (11111111) 2 is divided by (10000) 2 . Then the integer part is (1111) 2 and the fractional part is (.1111) 2 . Thus, the value of the fractional part is computed by summing these results (
TinyECCK efficiently computes the wTNAF representation of scalar k with these techniques.
Interleave Method for the Verification Procedure in ECDSA
Computing a common secret key in ECDH and generating a signature in ECDSA involve one scalar multiplication. On the other hand, the signature verification step requires an addition of two scalar multiplications such as uP + vQ where u, v are scalars and P, Q are points on curve. If the verification step is implemented without care, the execution time will be almost twice of signing step. Thus, we apply the interleave method [18] which is a kind of multi-scalar multiplication algorithm for the verification step of ECDSA in TinyECCK. The interleave method enables to apply different recoding algorithms with different window sizes to each scalar; it is appropriate for memory-constrained devices such as sensor motes. Fig. 1 . Process of field multiplication using Alg. 2.
Proposed Techniques for Further Improvement
The performance of the field multiplication and reduction algorithms presented in Sect 4.2 can be improved by eliminating the redundant memory accesses. We can observe that both field multiplication and reduction algorithms are involved in a large number of memory access operations. Note that the memory access operations occupy large portion of the whole execution time.
Reducing Redundant Memory Accesses in Field Multiplication
Field multiplication over GF (2 m ) is involved in many redundant memory accesses. This is the reason why typically the performance of field multiplication over GF (2 m ) is inferior to that over GF (p). Fig. 1 describes the process of field multiplication using Alg. 2. In Fig. 1 , odd rows are the intermediate result of the second for-loop and even rows are related to the first for-loop of Alg. 2. Later, all rows are XORed each other at corresponding positions to generate final result C (partial XOR multiplication). According to the Alg. 2, the result of the first for-loop shifts to left by window size (in our case, w = 4); it makes even rows to be shifted to the left direction depicted as Fig. 1 
. In each for-loop of Alg. 2, the L(a[j]) or U (a[j]) is evaluated to access the precomputed table about b(z).
Afterwards, the corresponding element in the table is loaded and XORed with C from j to (j + N ) word (N = t). Observing the process of multiplication in Fig. 1 in detail, we can discover that the Alg. 2 is related to redundant memory accesses. The following example process shows the observation (we consider only the process of the second for-loop for the sake of simplicity). 
for i ← 1 to N increments i by 1 do 9:
end for 11: end for 12:
20:
for i ← 1 to N increments i by 1 do 21:
end for 23: end for 24:
. 27: end for 28: Return (c) // C1: 2 LOADs, T0: 1 LOAD, T1: 1 LOAD, STORE: 2 The calculations of C 1 , C 2 and C 3 are related to two, three, and four STORE operations which are redundant. In addition to, C i is loaded i + 1 times each step. 6 We formulate this strategy into Alg. 5. In fact, the more C{j} ⊕ T u can be integrated at the expense of larger code size. In our case, we combine two XOR multiplications into one considering optimization between code size and performance; thus, the counter j of for-loops is incremented by 2. However, the final XOR multiplications, step 9 and 15 of Alg. 5, should be computed outside for-loops since the t is a odd number.
Theoretical Analysis
We can calculate the saved number of STOREs and LOADs in the proposed strategy. In the original algorithm, the counter j of for-loop is from 0 to (t − 1) and a C{j}⊕T u consists of (t+1)(2L+S+X) operations (L=LOAD, S=STORE, X=XOR). Since the XOR multiplication is computed 2t times in the original algorithm, the total operations in the for-loops of the original algorithm are 2t [j]) ). On the other hand, in the proposed algorithm, the combined XOR multiplication requires [t(3L+S +2X)+ (4L + 2S + 2X)]. Since the combined XOR multiplication is processed (t − 1) times, the total operations of for-loops are (t − 1)[t(3L + S + 2X) + (6L + 4S + 2X)]+2[(t+1)(2L+S +X)+(L+S)] = (3t 2 +5t+2)L+(t 2 +3t+2)S +(2t 2 +2t)X 7 . Therefore, (t 2 + t − 2) of STOREs and LOADs are saved. Table 2 shows that we can significantly decrease the execution time with Alg. 5 replacing Alg. 2 by 21.1% and save (460S+460L) when t = 21.
Reducing Redundant Memory Accesses in Modular Reduction
The fast reduction modulo (Alg. 3) also involves many redundant memory accesses. Let us consider an example that the counter i decreases from 30 to 27 in the process of Alg. 3. Regarding the decrease of the counter (i.e., i = Table 4 . The ratio of contribution between Alg. 5 and Alg. 6. The ratio is computed as (saved time from Alg. 5 or Alg. 6)/ (total saved time using Alg. 5 and Alg. 6). 30, 29, 28, 27), the execution steps are as follows:
Alg. 3 uses 16 STOREs and 16 LOADs to compute C[30], C[29], C[28], and C[27].
However, we can use the following strategy to reduce the redundant STOREs 
and LOADs.
In this case, the number of STOREs and LOADs is reduced from 16 to 10. Alg. 3 requires 20*4 = 80 STOREs and LOADs in the for-loop. However, the proposed method requires only 5*10 = 50 STOREs and LOADs, which results in the saving of 30 STOREs and 30 LOADs. Alg. 6 is the formulation of the proposed strategy. As Table 2 indicates, the execution time of modular reduction with Alg. 6 replacing Alg. 3 is decreased by 24.7%. Actually, we can extend the degree of combination. However, the deeper degree of combination is used, the more code size is required. Therefore, it is necessary to find the optimal degree. Throughout experiments, we found that the 4 is more appropriate than other degrees. We apply the two aforementioned strategies to implement TinyECCK. Table 3 depicts the improved performances when TinyECCK equipped with Alg. 5 and 6 instead of Alg. 2 and 3. When the proposed strategies are applied, the TinyECCK presents around 15% ∼ 19% saving in execution time. The improve-ment when TinyECCK uses wTNAF is lower than when it uses wNAF since ECDBL operation is replaced by some trivial squarings with wTNAF.
Remarks
The Algorithm 2, Loop-unrolled reduction modulo with 32-bit word size, presented in [17] is very similar to the proposed Alg. 6. However, the focus of [17] is to show that changing the reduction polynomial can improve the performance of reduction algorithm rather than to verify that unrolling techniques can reduce the redundant memory accesses. Therefore, the purposes and contributions of [17] are different from our proposals which aim at showing that the concept of unrolling techniques can be used in both field multiplication and reduction so as to reduce the number of redundant memory accesses. Furthermore, the improvement from Alg. 5 is bigger than that from Alg. 6. Thus,the main contribution of this paper is Alg. 5. Table 4 shows that the ratio of improvement from Alg. 5 occupies around 70% while that from Alg. 6 is only 20 ∼ 30%.
Experimental Results and Analysis
This section analyzes the performance of TinyECCK -in terms of running time, memory occupancy, and supporting services -and compares it with the performances of existing ECC software implementations.
Analysis of Field Operations
We compare TinyECCK with TinyECC [1] in the light of the running time of field operations to show that the field multiplication over GF (2 m ) can be faster than that over GF (p) on sensor motes. TinyECC applies hybrid multiplication/squaring using additional registers to reduce unnecessary memory accesses and optimized modular reduction using pseudo-Mersenne prime. Thus it is fair to compare TinyECCK with TinyECC since TinyECCK also uses left-to-right comb method using window (Alg. 5) and fast reduction modulo (Alg. 6). Table  5 shows that the multiplication of TinyECCK is faster than that of TinyECC (the running time of multiplication and squaring includes the reduction time). In fact, the running time of field multiplication of TinyECCK is slower than that of TinyECC when using Alg. 2. However, with Alg. 5, TinyECCK's field multiplication becomes faster than TinyECC's one. Apart from the advantage of field inversion in TinyECCK, the field squaring in TinyECCK is much more efficient than that in TinyECC.
Consideration of Code Size
Even though TinyECCK implements the field arithmetics over both GF (2 m ) and GF (p) to provide ECDSA services, it requires less code size than that TinyECC uses. Table 6 compares TinyECCK with TinyECC in view of the running time and the code size when they do same operations. TinyECC using inline assembly at the critical parts such as multiplication, squaring and reduction could achieve the improved performance at the expense of more code size, however it is still inferior to TinyECCK. Actually TinyECC implements hybrid multiplication/squaring algorithms which aim at reducing the number of memory accesses by using additional registers through inline assembly codes; thus the code size of TinyECC is highly increased.
The code size of the scalar multiplication module in TinyECCK is only 5,592-byte since the field multiplication and squaring can be simply implemented. However, ECDSA module of TinyECCK requires more code size in that signature generation and verification need additional field operations over GF (p). The code size for field arithmetics over GF (p) in TinyECCK is relatively smaller than that of TinyECC. This is because TinyECCK applies only a few optimization techniques for field arithmetics over GF (p) while TinyECC uses all known optimization algorithms.
As shown in Table 6 , TinyECCK is more faster and memory-efficient than TinyECC. The main reason that the code size of TinyECCK can be smaller than that of TinyECC is that TinyECCK presents better performance than implementations using inline assembly code even though it is built in only C code. TinyECCK does not need the use of inline assembly codes in that it presents good performance without applying them 8 . Application of the proposed algorithms and wTNAF-based scalar multiplications contributes this performance achievement. Through this result, we can verify our assertion: the code size of the optimized ECDSA implementation over GF (2 m ) can be smaller than that over GF (p). Fig. 2 analyzes the existing software implementations of ECC on sensor motes and compares the performance of TinyECCK with them in respect to various aspects such as running time, code size, supporting protocols, and so on. The performances of existing implementations of ECC over GF (2 m ) [4, 9, 7, 10] are relatively low compared with [8, 1] . Even if ECC in [10] is implemented with assembly code, its performance is still inferior to [1] which is implemented with C and partially inline assembly. The implementations in [4, 9] could not exploit the advantages of Koblitz curve since they did not implement wTNAF recoding algorithm even if they used the sect163k1 as a domain parameter. The critical reason why TinyECCK can be the fastest among software implementations of ECC over GF (2 m ) is that TinyECCK implements the wTNAF-based scalar multiplication and applies the proposed algorithms for field multiplication and reduction (Without applying signed recoding algorithms, the running time of TinyECCK is almost same as that of [10] ). TinyECCK provides the improved performance in view of running time, used ROM and RAM size compared with existing implementations. The modules for a scalar multiplication in TinyECCK require only 5,592-byte of ROM; 330-byte of RAM is occupied with 2TNAF, and 618-byte with 4TNAF. Moreover, its running time is also superior to the existing software implementations built in C or hybrid of C and inline assembly. Even if the ECDSA modules of TinyECCK require more code size (13,748-byte of ROM, 1,004-byte of RAM in case 4TNAF is applied) for the signature generation and verification, its code size is still smaller than that of TinyECC (19,308-byte of ROM and 1,510-byte of RAM). Moreover, TinyECCK is better than TinyECC with regard to initialization time for establishing precomputed tables and initializing domain parameters. TinyECCK takes 0.2515 secs to compute a precomputed table when 4TNAF is applied, while TinyECC takes 1.83 secs to establish a precomputed table with 4-ary window method. With TinyECCK, two sensor nodes can compute a common pairwise key around 1.14 secs. Furthermore, a sensor node can generate a signature and verify it in 1.37 and 2.32 secs, respectively. In light of supporting protocols, TinyECCK provides modules for all elliptic curve operations over GF (2 m ) from point addition, doubling and scalar multiplication to ECDSA services.
Performance Comparisons

Remarks
After finishing our work, we have noticed the existence of NanoECC [19] . NanoECC provides the implementations of ECC and pairing-based cryptography (PBC) over GF (p) and GF (2 m ) on both widely used MICA2 and Tmote Sky motes. NanoECC is based on MIRACL (Multiprecision Integer and Rational Arithmetic C/C++ Library) which provides all the necessary primitives and functions for symmetric-key and public-key cryptography. When implementing ECC over GF (2 m ), NanoECC use the sect163k1 same as the curve TinyECCK uses. For the optimized field level arithmetics, NanoECC makes use of KaratsubaOfman multiplication and fast reduction algorithm using f (x) = x 163 +x 7 +x 6 + x 3 + 1. NanoECC implements the hybrid multiplication algorithm and a fast reduction algorithm using Solinas prime (p = 2 160 − 2 112 + 2 64 + 1) for efficient big integer arithmetic over GF (p). The elliptic curve points in NanoECC are represented as projective coordinate and the fixed-based comb method is applied for efficient scalar multiplications with w = 4 using 16 precomputed points. Table  7 compares TinyECCK with NanoECC. Even if NanoECC can compute scalar multiplication relatively fast compared with existing ECC implementations on the ATmega128L processor, it requires a heavy amount of ROM and RAM sizes. We think that the heavy memory requirement of NanoECC is due to using the MIRACL which is originally intended for efficient big number arithmetic on typical computer systems. On the other hand, TinyECCK has been developed with considering memory and computing-constrained environments of sensor motes. Therefore, TinyECCK using 4TNAF or 5TNAF provides better performance than NanoECC with regard to both computation times and memory requirements. However, the development of NanoECC is significant because it implements not only ECC but also pairing-based cryptography on widely used two sensor motes (MICA2 and Tmote Sky). In case of PBC, NanoECC privides the fastest pairing computations.
Conclusion
In this paper, we have described that the inefficiency of field multiplication and reduction over GF (2 m ) are caused by a heavy amount of redundant memory accesses. Therefore, we have proposed techniques to reduce unnecessary memory accesses. With the proposed techniques, running times of field multiplication and reduction over GF (2 163 ) are saved as much as 21.1% and 24.7%, respectively. These savings decrease the running time of ECDSA operations around 15% ∼ 19%. The proposed multiplication algorithm is approximately 7.4% faster than hybrid field multiplication over GF (p).
We have implemented TinyECCK with the proposed techniques on a MICAz sensor mote and compared it with the existing implementations built in C or hybrid of C and inline assembly. The comparisons show that TinyECCK provides more improved performance than the existing implementations in respect to running time, code size, and supporting services.
From experimental results and comparisons, we obtain the two conclusions. Firstly, the software implementation of ECC over GF (2 m ) is more suitable for sensor motes with small word size than that of ECC over GF (p). Note that this fact is contrast to existing opinions. Especially, the field multiplication over GF (2 m ) can be faster than that over GF (p) with careful implementations. Secondly, the use of ECC, especially TinyECCK is applicable for securing sensor networks.
