In this work, elliptic curve cryptography (ECC) is used to make a fast, and very low-power software implementation of a public-key cryptography algorithm on the ARM Cortex MO+. An optimization of the L6pez-Oahab field multipli cation method is proposed, which aims to reduce the num ber of memory accesses, as this is a slow operation on the target platform. A mixed C and assembly implementation was made; a random point multiplication requires 34. 16 JIJ, whereas our fixed point multiplication requires 20.63 JIJ. Our implementation's energy consumption beats all other soft ware implementations, on any platform, by a factor of at least 3. 3. The ARM Cortex-MO+ [2] is a low cost, ultra low-power microcontroller (MCV) with a 32-bit architecture, and a small but powerful instruction set. The authors are con vinced that this platform is appropriate for WSNs not only because of its specs, but also because first integrations of this MCV have already been announced [9] . As this processor has only been available since 2012, we do not know of any other PKC implementations optimized for this architecture.
INTRODUCTION
vinced that this platform is appropriate for WSNs not only because of its specs, but also because first integrations of this MCV have already been announced [9] . As this processor has only been available since 2012, we do not know of any other PKC implementations optimized for this architecture.
We now present the new state of the art in low-power soft ware implementations of ECC on the ARM Cortex-MO+. In particular, we propose an optimization of the L6pez-Oahab (LO) [16] field multiplication, which aims to reduce the num ber of memory accesses, as this is a slow operation on the target platform. The results will be compared to the exist ing solutions in the ultra low-power domain, as well as to a standard library compiled for this MCV.
The rest of the paper is organized as follows. First, we will discuss related work in the low-power domain in sec tion 2. Following we discuss the methods that were used to perform the parameter, and algorithmic selection for our im plementation in section 3. Next, we describe our results, and compare them with the results from implementations found in literature and in software libraries in section 4. Subse quent, we discuss some ideas for future work in section 5, and finally we provide a general conclusion in section 6.
RELATED WORK
Here we will discuss the related work of low-power soft ware implementations of ECC. There is an evolution of al gorithms and hardware, and therefore the overview follows a chronological order, with a focus on the L6pez-Oahab (LO) [16] field multiplication method, as our implementation is based on this. The LO method and window parameter w will be discussed in more detail later. A number of low power implementations exist in the literature; however, in the past a lot of the focus has gone towards software im plementations on existing WSNs like the 8-bit MICA2 and MICAz ( which both contain the ATMegaI28L) and the 16-bit TelosB ( which contains the MSP430) . Only a small num ber of implementations were found in the literature for ARM MCVs like the IMote2 ( which contains the ARMv5TE based PXA271) , and the ARM7TDMI.
Gura et al. [11] showed that by optimizing the number of memory accesses, which are often the most expensive oper ation on an MCV, significant gains in performance can be achieved.
Szczechowiak et al. [23] 
METHODS
In this section, we present the methods that were used to perform the parameter and algorithmic selection that is nec essary to make an efficient and low-power ECC implemen tation. First, we will discuss the model that was used to make a curve selection. Next, we will discuss some of the algorithmic choices that were made.
1
Matching a curve to the architec ture In order to make an efficient and low-power implementa tion it is necessary to select the appropriate curve for the (1) Binary Koblitz curves will lead to a slightly faster implementation (2) Binary curves require less power than prime curves, because binary curve arithmetic consist largely of XOR and shift instructions, whereas prime curve arithmetic consist mostly of multiply and add instructions.
The energy profiling results of the target platform (section 4. 1) show that both the shift, and XOR instructions require less energy than either the multiply, and ADD instructions.
Field arithmetic algorithms
Here we will discuss some of the different field arithmetic algorithms that were used during analysis and implementa tion. 
Multiplication
where n is the number of words needed for the field param eter. While generating the lookup table, y is left-shifted by a maximum amount of w -1 bits, which may cause a large polynomial to overflow into another word. bits. This is repeated 8 times, but in the final iteration the shift is not required.
In order to reduce the number of memory operations the field multiplication algorithm can be interleaved with the reduction algorithm.
Algorithm 1 L6pez-Dahab with fixed registers multi plication in lF2m for n = 8.
Note: v denotes the internal state vector composed of n -1 memory addresses and n + 1 registers.
for k f-0 to n -1 do 5:
for l f-0 to n -1 do 
Reduction
Since the curve we are using has a sparse reduction polyno mial, the reduction can be efficiently computed one word at a time. 
Inversion

3
Comparison of multiplication algo rithms
The field multiplication routine is the most dominant in terms of execution time in an ECC system. We will now compare our proposed optimization of the LD algorithm to the previous best method (the LD with rotating regis ters) , as well as the original LD algorithm.
For all three methods a window size of w = 4 is used, where a single precomputation table of 16n words (4 kB) is required. This is valid under the assumption that the scalar y is short. For both the analysis of the LD with rotating registers, and the LD with fixed registers methods, we assume that n + 1 registers are available for storing the par tial products. The total number of operations and cycle estimates are shown in Table 1 and Table 2 respectively. The cycle es timate assumes that a memory operation will take 2 cy cles and all other operations take only 1 cycle to complete. The main loop is executed 8 times for w = 4. The input parameter y is split up into sections of w bits which are used as an index for the LUT.
When comparing the LO method to the LO with rotating registers method, we see a drastic reduction in the num ber of memory operations due to the implementation of the rotating register scheme, which minimizes the storing of in termediate values in memory. When comparing the LO with fixed registers method to the LO with rotating regis ters, we see a further reduction of memory accesses due to the more efficient usage of registers. The LO with fixed registers has a performance increase of 15% over the LO with rotating registers method, and a performance in crease of 40% over the standard LO method.
RESULTS
This section will be used to present our key results. First, we will discuss the measurements setup, and the results from our measurements. Next, we will present two implementa tions, and compare them to the state of the art low-power implementations found in literature, as well as with software libraries
1 Measurement setup and results
In order to determine the energy usage of different instruc tions as well as cryptographic software implementations, a system was designed to measure the power consumption of the target platform.
The power consumption for a number of different instruc tions were measured in order to investigate the effect of different field arithmetic algorithms on the overall power consumption. Table 3 shows the results of energy mea surements for instructions which are relevant to prime and binary field arithmetic. A variation in energy consump tion of up to 22. 5% was observed between different instruc tions. The ADD instruction was found to be the most en ergy hungry, requiring 6. 9% more energy than any other measured instruction. This is important for the choice of the underlying field because binary field arithmetic require a large amount of shift (LSL and LSR) and XOR instruc tions, whereas prime field arithmetic require a large amount of MUL and ADD instructions. Table 4 and Table 5 shows our proposed implementation compared with low power implementations found in litera ture, as well as with software libraries. In the cases where the energy consumption was not provided in the author's ---=-Inst -r -u -cti :-o -n --= E =-n -e -r -g -y It is a C library with some field arithmetic in assembly for many of its supported platforms. Some timings for this li brary can be found in [3] and are also listed in Table 4 . The RELIC toolkit [1] is an open-source cryptographic library that supports many different architectures.
Comparison with other libraries
Micro ECC [17] is a small, C-based, open-source library of ECDH and ECDSA for 32-bit microcontrollers.
In the following text we will present two implementations.
First, we will present an implementation that relies exclu sively on the RELIC toolkit to make all its computations.
Next, we present an implementation that was largely devel oped in C and assembly, but also makes use of the RELIC toolkit to perform some calculations. The curve and algo rithmic parameters for both implementations were chosen to P A custom prime curve is used.
T Random point multiplication.
[ match each other as close as possible. IF 2 228  IF 2 256  IF2163  IF 2 233  IF 160  IF2167   IF2271  IF160  IF2163  IF256  IF 2 283  IF2271  IF2271  IF 2 233 The RELIC toolkit was used to make an implementation Table 4 . Even though the fixed point multiplication uses more power than the random point multiplication, it still uses less energy because its faster.
RELIC implementation
Proposed implementation
An implementation was made using C and assembly that uses the binary Koblitz sect233kl curve. The left-to-right wTNAF method was used for point multiplication; the pa rameter w was set to 4 for random point multiplication (kP), and w = 6 for fixed point multiplication (kG). Point addi tions are done in mixed LD-affine coordinates. The RELIC toolkit was used to perform the TNAF precomputation, and TNAF transformation of the scalar k. The LD with fixed registers method was used for field multiplication, reduc tion was done one word at a time, inversion was done with the Extended Euclidian algorithm, and squaring was done with the table-based method. The field arithmetic routines were written in C and assembly.
The average execution time and energy usage of this im plementation is compared to others in Table 4 and Table 5 .
Our proposed implementation was measured to have an av erage power consumption of 577. 2 pW for a random point multiplication, and 519.6 p W for a fixed-point multiplica tion. On average our random point multiplication imple mentation require 2814827 cycles, and 36. 6 pJ, whereas our fixed point multiplication require 1864470 cycles, and 24. 6 pJ.
When compared to RELIC, our random point implementa tion is 1.99 times faster, and our fixed point implementation is 2. 98 times faster. The field arithmetic cycle times are shown in Table 6 , and the accumulated execution time for different operations are shown in Table 7 for both a random point multiplication (kP), as well as a fixed point multipli cation (kG). 
