By the deployment of Internet of Things, embedded systems using microcontroller are nowadays under threats through the network and incorporating security measure to the systems is highly required. Unfortunately, microcontrollers are not so powerful enough to execute standard security programs and need light-weight, high-speed and secure cryptographic libraries. In this paper, we port NaCl cryptographic library to ARM Cortex-M0(M0+) Microcontroller, where we put much effort in fast and secure implementation. Through the evaluation we show that the implementation achieves about 3 times faster than AVR NaCl result and reduce half of the code size.
Introduction
Nowadays Internet of Things(IoT) is one of the most emerging technologies by the spread of the smartphone and lowprice wireless module. Some types of IoT are composed of embedded systems using microcontroller. Such systems are connected through the Internet and it is unavoidable to take some security measure against threats through the network. Unfortunately, microcontrollers are not so powerful enough to execute standard security programs and need light-weight, high-speed and secure cryptographic libraries.
The microcontroller is one-chip controller that contains CPU, RAM, program memory and peripherals. It does not have enough power, but the cost is quite cheaper than the personal computer. It has been embedded in product that performs a simple control. The microcontroller's RAM and program memory are often have a very small size, e.g. the general Cortex-M0(M0+) memory size is about 8-KB to 128-KB. Therefore, the program used in microcontroller is required to have a small code size and a small amount of RAM. If the microcontroller has large memory, the maker can implement more application to one microcontroller in order to reduce the cost. Therefore, the cryptographic library is expected to have a small size. In addition, the program is required to achieve high-speed for energy-saving and shorten the response time for efficiency. A typical microcontroller has 8, 16 or 32-bit length architecture and one of these architectures is selected depending on the purpose. The bit length of the architecture is related to data length that can be processed at a time. For example, 32-bit microcontroller can process a 32-bit value at a time. microcontroller is 4 times faster than 8-bit microcontroller. Therefore, it is preferable to use high-speed 32-bit microcontrollers in order to manage the process for the networking and security. ARM Cortex-M0(and M0+) microcontroller is one of the 32-bit microcontrollers that have characteristics of lowcost and power-saving. They are expected to be used for the communication of the sensor nodes. In addition, the advent of mbed [1] platform will allow high-speed prototyping more easily by using ARM microcontroller. ARM microcontrollers will presumably continue to deploy further in the world. Therefore, we should support mbed platform at the time of make programs, in order to spread the programs for ARM microcontroller.
The Networking and Cryptography library NaCl pronounced "salt" [2] is a cryptographic library for securing Internet communication, developed by Daniel J Bernstein. This library has been designed for easy-usability, high-security and high-speed. NaCl has been re-designed as µNaCl for the microcontroller, and developed to AVR NaCl [3] for the AVR microcontroller. Unfortunately, AVR microcontrollers are not so powerful enough for encryption and communication for sensors.
Later, M0NaCl [4] was developed, that is µNaCl implementation for ARM Cortex-M0 microcontroller. However, only high-speed Curve25519 was implemented, and full function of NaCl cannot be used.
In this paper, we port full function of NaCl cryptographic libraries to ARM Cortex-M0(M0+) microcontroller, where we put much effort in fast and secure implementation, and evaluate the implementation. This paper is organized as follows. In Sect. 2, we explain the NaCl and ARM Cortex-M0(M0+) microcontroller. Section 3 describes the implementation of NaCl. Results, comparison with previous work and discussion are given in Sect. 5.
Background

Networking and Cryptography Library(NaCl) and µNaCl
The Networking and Cryptography library NaCl pronounced "salt" [2] is a cryptographic library for securing Internet communication, developed by Daniel J Bernstein. This library has been designed for easy-usability, highsecurity and high-speed.
Copyright c 2016 The Institute of Electronics, Information and Communication Engineers NaCl provides primitives shown in Table 1 . In addition, NaCl provides APIs which allow one to perform encryption and decryption using NaCl primitives.
NaCl and µNaCl is designed to protect following known vulnerability issues [2, Section3] [3, Section3.1] .
(1) Secret load addresses.
Normal CPU but not embedded one has cache memory and the data in the cache can be accessed faster than those in memory. Attackers use such a time difference for timing attacks. We should avoid secret-key dependent load address, which is called secret load address in [3] . Some microcontrollers such as AVR and ARM Cortex-M0 do not have a CPU cache memory, and they are free from the attack.
(2) Secret branch conditions. Attackers can perform timing attack by measuring the difference of execution time of each branch. Reasons why such attacks can be performed are the existence of secretkey dependent branch conditions [9] and success/failure of branch prediction [10] . We should avoid secret-key dependent branch conditions, which are called secret branch condition in [3] .
In the similar vein, NaCl incorporates the following measures against the attacks.
• Remove the conditional branch that depends on the secret information.
• Make loop counts deterministic.
AVR microcontroller also adopts these measures. AVR and ARM Cortex-M0 microcontroller do not have branch predictor and secure against the attack. In 2008, it is discovered that OpenSSL generates a predictable random number(CVE-2008-0166) [11] . The cause of the problem is that the code for the OpenSSL random number generation was patched by a wrong code. To avoid the problem NaCl has centralized the random number generation to the OS random number generator [2, Section3 Centralizing randomness] . In addition, NaCl avoids unnecessary use of randomness to reduce the problems arisen from random numbers [2, Section3 Avoiding unnecessary randomness] .
ARM Cortex-M0 and M0+ Microcontroller
ARM Cortex-M microcontroller series are 32-bit RISC architecture microcontrollers developed by ARM Inc. Cortex-M0 and M0+ have characteristics of low-cost and powersaving and they are expected to be used for the communication of sensor nodes. Hereafter, we simply write Cortex-M0 without describing both of them as long as it does not cause confusion.
Cortex-M0 architecture has thirteen 32-bit generalpurpose registers(R0-R12) and three special registers(R13-R15). In particular, R0-R7 are called "low-register", and R8-R12 are called "high-register". Among special register R13-R15, R13 is a stack pointer(SP), R14 is a link register(LR) and R15 is a program counter(PC).
Cortex-M0 uses Thumb instruction set that is 16-bit(half-word) fixed length instruction. The Thumb instruction set has some limitation such that a lot of instruction cannot use high-register. However, we can write highlyefficient and small code using the Thumb instruction set.
In addition, Cortex-M0 has 3-stage(M0+ has 2-stage) pipeline. If it is possible to use these units effectively, it can speed up the execution.
Although Cortex-M0 hardware multiplier is optional, it can calculate 32 × 32 bit multiplication and output the lower 32-bit result. The execution cycle of the multiplication instruction(MUL) is dependent on multiplier unit implementation. One out of two types, "fast" and "small", of multiplier unit can be implemented at the time of processor manufacturing. The "fast" implementation can perform a multiplication in 1-cycle. Although the "small" implementation requires 32-cycle for multiplication, but its implementation size can be reduced.
Implementation
The core part of AVR NaCl is written in about 6,000-line AVR assembly language for optimization and safety improvement. We rewrite this AVR assembly code to ARM Thumb assembly code, and µNaCl is implemented to work on the ARM Cortex-M0. Moreover, we attempt small security fix and optimization.
Main points of our implementation are as follows,
• Adoption of the pseudo-random number generation method whose entropy source is the initial value of SRAM.
• Importing M0NaCl's 256-bit multiplication code with the balance of speed and code size.
Other point of the implementation is the treatment of branch conditions. We remove branch instructions as much as possible for the improvement of speed. The removal also allows us to prevent unknown security breaches caused by branch instructions which have indirect relation with secret key.
Pseudo-Random Number Generation Method
Two random number generation methods proposed in the paper of AVR NaCl [3, Section3.1 Randomness generation] have the following problems. The first method uses an external random number generator, which increases the cost due to the addition of the external device. The second method uses jitter of RC oscillator [12] , for which frequency injection attack [13] has been discovered.
We adopt a pseudo-random number generation method using Salsa20 as encryption scheme and the initial value of the SRAM [14] as a random seed. The process of the pseudo-random number generation is based on that of "arc4random" of OpenBSD, and the encryption scheme ChaCha20 is replaced with Salsa20 and the random seed is replaced with the initial value of the SRAM. The advantage of this method is the use of the existing Salsa20, whose code size has already been reduced and the random seed generation which is low cost and has not yet been attacked.
The procedure of the pseudo-random number generation method is as follows.
1. Get an 12,800-bit initial value of SRAM. 2. Input the initial value to SHA2-512 to create a random seed. 3. Set the random seed to an internal state of Salsa20. 4. Calculate Salsa20 to output a 256-bit pseudo-random number.
Security of Random Number
In µNaCl, random numbers only in 32-byte key generation. We discuss are required entropy required for random numbers. As recommended in NIST SP800-90 [15] entropy of the random number seed should be more than 1.5 times the amount of the shared key. Therefore, the entropy of the random number seed should be more than 384 bits. The initial value of SRAM entropy is 3% per 1-bit [16] . Therefore, 12,800 bits (1,600byte) should be extracted from the SRAM as an initial value. 20,000-bit random numbers generated on MKL25Z128VLK4(Cortex-M0+) by using this method have passed the test of FIPS 140-2.
Importing M0NaCl's 256-bit Multiplication Code with the Balance of Speed and Code Size
The Curve25519 code of M0NaCl is very fast, but its code size is large. According to [4] , the Curve25519 code size is 7,900-bytes, but program flash size of many Cortex-M0 microcontrollers is less than 32-KB. If we use the Curve25519 code of M0NaCl as it is, about 25 % of the program flash are occupied and it becomes difficult to implement other cryptoprimitives and user application codes. Therefore, we import the Curve25519 code of M0NaCl with the following change to save code size.
• Using 256-bit multiplication instead of 256-bit squaring.
• Reimplementation of 256-bit multiplication using 128-bit multiplication.
Reimplementation of 256-bit Multiplication Using 128-bit Multiplication
The 256-bit multiplication code of M0NaCl treats Cortex-M0 multiplication instruction as 16-bit multiplication instruction, and uses three-level Subtract Karatsuba method [4, Section 5.2 Multiplication] . This 256-bit multiplication code uses 128-bit and 64-bit Subtract Karatsuba method multiplications and 32-bit schoolbook multiplication. The cause of large code size is that 256-bit multiplication of M0NaCl does not reuse any long-multiplication code, so these codes are written one by one. Therefore, we reuse 128-bit multiplication for 256-bit multiplication code in order to save code size.
We call "fast" code is the original 256-bit multiplication code, and "small" code is the improved one with saved code size with 128-bit multiplication. The user can select "small" or "fast" according to the purpose. Both benchmark results are given in Sect. 4.1.
Benchmark Results
In this section, we show the benchmark results of our implementation described in Sect. 3.
256-bit Multiplication Benchmark Result
"fast" and "small" 256-bit multiplication codes have the following differences.
We show in Table 2 that "small" code can save its size about half. Table 2 and Table 3 show "small" multiplication code is about 20% slower than "fast" code.
AVR NaCl versus Cortex-M0/M0+
There are fast version and small version of AVR NaCl. Benchmark results of the fast version of the AVR and our work are shown in Table 5 . Note that the benchmark results are obtained in environments shown in Table 4 .
The Table 5 shows that this implementation is about 3 times faster than AVR result and reduces half of the code size. Compared with the results of the AVR NaCl, the code size is smaller than even "small" version whose size is 18,328-byte. Theoretically, Cortex-M0 can perform 4 times faster with only twice longer instructions than AVR microcontroller. The results shown above mean that we have succeeded in achieving about 70% of the efficiency in our implementation. In Table 5 the scores of programs related to Curve25519 are high. This is because the code of M0NaCl has been used. Table 5 shows that stack size used in this library is less than 2000-bytes. We expect the effect of footprint for utilizing the software library is enough small.
Concluding Remarks
In this paper, we implemented µNaCl in Cortex-M0 microcontroller for all primitives of M0NaCl, only Curve25519 of which has been implemented so far [4] , and showed that the implementation is 3 times faster than AVR NaCl. So far, we have not yet succeed in optimizing all of the codes. We can expect that µNaCl with higher-speed can be obtained after the optimization.
Entropy contained in the SRAM has been estimated from the results for STM32 microcontroller [16] . Since there are many ARM microcontrollers e.g. NXP, FreeScale, Atmel, we should examine the amount of entropy contained in the SRAM of those microcontrollers.
If we can port this library into mbed environment, then the security of many products using mbed can be improved and many people can get benefit of the library. In the mean time, program flash size is less than 32-KB in many Cortex-M0 microcontrollers supported by mbed. The program flash size is not large enough for installing our implementation with basic libraries, e.g. libgcc and libstdc++. The balance of speed and size is quite important in the implementation. A library having an appropriate balance be widely used in mbed and other Cortex-M0 microcontrollers.
Concerning preemptive multitasking, we expect that preemptive multitasking is possible if the scheduler saves registers and restore it correctly. Basically, the NaCl uses only stack. Few global memory may be used but its use is limited to store only read-only values e.g. SHA2-512's constant. Therefore, even if another process executes NaCl's encryption operation during the interruption of main process, the process does not influence to the main process. Since we have not yet fully checked the behaviors of preemptive multitasking in our environment, these issues are remained as the future work.
