Cryptographic primitives are the key component in the security protocols to support the authentication, key management and secure communication establishment. For that reason, this work presents the optimization of the Elliptic Curve Cryptography through the usage of Shifting Primes for constrained devices. Specifically, this presents the optimization for the chipsets JN51XX from NXP/Jennic, which are based on OpenRISC architecture and offer a class-2 constrained device. In details, Shifting Primes features have allowed to optimize the multiplication and squaring through a double accumulator and shifting reduction. This work is ancillary to the previous works about optimization of Shifting Primes for class-1 constrained devices. The optimization of the Elliptic Curve Cryptography for the class-2 constrained devices brings several opportunities for realistic scenarios, where the security interoperability between a gateway (class-2 device) and end-nodes (class 1 devices) is a major requirement.
Introduction
The Internet of Things (IoT) [Atzori et al., 2010 , Zhang et al., 2012 is one of the main drivers for the evolution of the Internet towards the Future Internet.
Nowadays, sensors, actuators and devices (so-called things), are connected to the Internet through gateways and platforms such as Supervisory Control and Data Acquisition platforms (SCADAs), panels, and brokers. These gateways and platforms break the end-to-end connection with the Internet. For that reason, this initial approach is defined as an Intranet of Things [Zorzi et al., 2010] .
The Intranet of Things is being extended to smart things, such as lights, watches, clinical devices, proximity sensors, etc. [Kortuem et al., 2010] with a higher scalability, pervasiveness, and integration into the Internet Core. This extension is leading to reach a real IoT, where things are first class citizens in the Internet, and they do not need to relay any more on a gateway, middleware, proxy, or broker.
IoT requires both an architecture and products that allow for the extension of Internet technologies, in order to reach a Future Internet of Things, Services and People.
IoT drives towards integrating everything into the Internet Core. This integration is motivated by the market wish to have all processes remotely accessible through an uniform medium while at the same time understanding that re-engineering an infrastructure to allow this for each application independently would be prohibitively costly and time-consuming. Moreover, the current evolution from uniform mass markets, to personalized ones [Chen and Prokopi, 2013] , where the customization and user-specified adaptation is a requirement, makes the sort of uniform infrastructure found in the Internet, imperative. This allows many components to be re-used, and services to be shared, with correspondingly huge economies of scale and shortened implementation times.
IoT fills the gap between the needs arising from the evolution of the market, information, users, and things, by moving all of them to a common framework, the Internet. This is different from the current approach in such applications, where they are based on stand-alone and monolithic solutions designed for a narrow or -stovepiped-application domain. Users now require more flexibility and freedom. Offering a common framework allows choice among the available manufacturers, suppliers, service providers, delivery options, and payment services. While this obviates the need for standalone or proprietary solutions, it also requires a high level of integration.
IoT allows communication among very heterogeneous devices connected via a very wide range of networks through the Internet infrastructure. IoT devices and resources are any kind of device connected to Internet, from existing devices, such as servers, laptops, and personal computers, to emerging devices such as smart phones [Costa et al., 2010] , smart meters, sensors, identification readers, and appliances.
In addition to the physical devices, IoT is also enriched with the cybernetic resources and Web-based technologies [Pedrinaci and Domingue, 2010] . For that purpose, IoT is enabled with interfaces based on Web Services such as RESTFul architecture and the novel protocol for Constrained devices Applications Protocol (CoAP) . These interfaces enable the seamless integration of the IoT resources with information systems, management systems, and the humans. Reaching thereby a universal and ubiquitous integration among human networks (i.e., society), appliance networks, sensor networks, machine networks, and, in definitive, everything networks.
IoT offers several advantages and new capabilities for a wide range of application areas. For example, nowadays IoT is finding applications for the development of Ambient Intelligent [Kofod-Petersen and Cassens, 2010] , Smart Cities, starting with Smart Grid [Ling et al., 2012] , Smart Lighting and transport with new services such as Smart Parking and the Bicycle Sharing System from Barcelona (Spain) [FroehlichJon et al., 2008] for building sustainable and efficiently smart ecosystems.
The application of the IoT is not limited to high scale deployments such as the locations in Smart Cities, elsewhere it can also be considered for epayment [Zoran et al., 2007] , consumer electronics, vehicular communications, industrial control, building automation, logistic, retail, marketing, and healthcare [Glascock and Kutzik, 2006] . Other applications can include voice and video [Vaidya et al., 2011] , [Zhou et al., 2010] , [Denko et al., 2009] .
But, the exposition of constrained resources to Internet, through a direct connectivity presents a challenge in terms of security.
The majority of the IoT applications need to take into considerations the support of mechanisms to carry out the authentication, authorization, access control, and key management. In addition, due to the reduced capabilities from the constrained devices enabled with Internet connectivity, a higher protection of the edge networks needs to be considered with respect to the global network and the implementation problems should be considered as a major issue, see [Keoh et al., 2013] .
This work presents an optimization of the cryptographic primitives based on Elliptic Curve Cryptography (ECC) for the JN51XX chipset family.
Security for the Internet of Things
Security is a wide concept which covers everything from authenticity (ensuring that the end-user is who is claimed to be), authority (ensuring that the enduser is allowed to perform the requested action), integrity (the data received is exactly the same data transmitted), and confidentiality (communication is not understandable for intermediary users, even when an intruder is in the network). These concepts are satisfied through a set of protocols, algorithms and cryptographic primitives [Chen et al., 2008] .
The IoT security has been one of the most discussed and yet pending issues, even after of the existence of protocols for IPv6 network security such as IPSec, and for datagrams (i.e., UDP or CoAP) such as DTLS. Security for the IoT is not excessively extended and deployed because of the difficulties in configuring (IPSec) for end users and the lack of scalable certificate management for DTLS. Consequently, the majority of the Internet traffic continues being transmitted in plain text, i.e., unprotected.
For that reason, one of the initial actions in order to carry out an effective deployment of autonomous and unassisted IoT deployments that satisfies the scalability and self-management requirements from the IoT is the development of protocols for authentication and key management.
Specifically, on the one hand, the protocol for the authentication and key management at the network layer such as the Protocol for Carrying Authentication for Network Access (PANA) [Marin-Lopez et al., 2012] is being considered by the research institutions and also industrial alliances, such as the ZigBee Alliance for their ZigBee IP stack [Sturek, 2009] .
On the other hand, the IPSec set of protocols (i.e., Internet Key Exchange (IKE) and Encapsulation Security Protocol (ESP)), and another protocols at the medium access layer such as 802.1x, are also being considered. All of these share the usage of the Extensible Authentication Protocol (EAP) to transport the security credentials.
Therefore, the challenge is not limited to the protocol, else the EAP scheme needs to be optimized in terms of a proper support of the required cryptographic primitives by the constrained device, i.e., symmetric cryptography algorithm to protect the packet, hash function to ensure the integrity and authenticate of the packet [Zia and Zomaya, 2011, Lee et al., 2012] , and finally asymmetric cryptographic algorithm to carry out the key exchange and initial authentication .
Some initial works for the IoT have been proposed for IPSec such as IPSec for Contiki OS [Raza et al., 2012a] , where several pending problems have been found, since for example a low version of the symmetric cryptography with 32-bit keys is used, such as AES-CBC-32, which are very weak. In addition, this relies on pre-shared keys for IPsec, which is not very scalable. Therefore, it does not solve the scalability and self-management requirements.
In order to satisfy these requirements, a Key Management Protocol (KMP) can be considered, that allows keys to be refreshed periodically (therefore maintaining acceptable security levels). Specifically, an automatic key exchange mech-anism is required; thereby, each node can keep track of the security associations (SA) that specify how a particular IP flow should be treated in terms of security.
The most extended KMP is IKE. A very simple approach of IKE has been defined in [Raza et al., 2012b] , which does not satisfy all the requirements and functionality for a full SA establishment.
Other issues from IPSec is that the overhead caused by a IPsec packet (the extra bytes on the IP header) can force the packet to be fragmented (the link layer payload that includes the extra IPsec bytes becomes bigger than the maximum size of a 802.15.4 packet), thus an extra packet must be sent to the link layer and the energy/network overhead will become bigger. In addition, this overhead problem is worse with the ESP mode of IPSec, since the internal headers of IPv6 and UDP are encrypted and consequently cannot be compressed.
In addition to IPSec, the majority of works from the CORE Working Group in IETF are focused on the integration of security through the transport layer security solutions such as DTLS for CoAP. DTLS is the default security for CoAP.
A pre-shared key mode (PSK) is also considered by CoAP, with the aforementioned problems regarding the lack of scalability for this pre-establishment of the security credentials.
CoAP also offers a very interesting approach based on RawPublicKey, i.e., a solution based on the use of an asymmetric key pair, but without an X.509 certificate metadata. This approach is highly relevant since it can manage the identity issues mentioned in the introduction section, in order to verify the authenticity of the device and its link with the manufacturer. For example, the Certification Authority (CA) of the public key can also indicate the list of identities of the nodes, with which it can communicate. It can thereby indicate the entities which are trustworthy in the initial verification and bootstrapping phase.
CoAP also considers certificates, i.e., X.509 certificate that binds it to its Authority Name and is signed by some common trust root, e.g., the manufacturer.
In order to optimize DTLS for smart objects, DTLS 1.2 offers the schemes to re-use the cryptographic hardware support by the majority of the IEEE 802.15.4 transceivers [Rescorla and Modadugu, 2012] . In particular, it is based on AES CCM for the symmetric cryptography. In addition, considers the usage of Elliptic Curve Cryptography (ECC) for the asymmetric cryptography. Thereby, making it more suitable for these constrained devices.
Nowadays, DTLS is being considered by the Smart Energy profile for ZigBee alliance (SE 2.0), and it is also being considered as an adaptation of DTLS 1.2 in the IPSO Alliance based on the subset allowed by RFC6347.
In addition to the solutions presented, there is security support over the current Internet architecture based on IPv6 in the network layer and UDP/TCP for the transport layer, where the security is based on IPSec for IPv6 and DTLS/TLS for UDP/TCP respectively. Also two solutions from the IETF to support the ID/Locator split have been defined. The first, HIP, has been developed by the Host Identity Protocol (HIP) Working Group, a group mainly focused on improving security of the Future Internet, and the HIP Diet EXchange (HIP DEX) [Nie et al., 2011] , which has been optimized for constrained environments such as the Internet of Things. HIP offers in a single mechanism the capabilities for authentication and establishment of the communication.
Therefore, the goals to solve for the security support are, first, to optimize cryptographic primitives for the described protocols. Specially, ECC for the asymmetric cryptography.
Thereby in the future works it will be feasible to analyze and evaluate the impact of IP security protocol (IPSec) for constrained devices.
Asymmetric cryptography for constrained devices
Asymmetric cryptography is applied to the different methods and mechanisms developed by the community such as the mentioned IPSec IKE. Asymmetric cryptography has been considered mandatory in order to satisfy scalability of security and the goal to build highly scalable and autonomous solutions. Specifically, the optimization of the cryptographic primitives for constrained devices such as the 16-bits microprocessor MSP430 from Texas Instrument (commonly used in IoT devices such as 6LoWPAN, active RFID and DASH7) presented in [Marin et al., 2011 , Marin et al., 2013b .
That work solved the mathematical optimization of cryptographic primitives for asymmetric cryptography based on a special pseudo-Mersenne primes, which we have denominated Shifting Primes. These primes can be used for ECC primitives with 160-bit keys in a highly optimal way. Specifically, this allows us to carry out ECC scalar multiplication within 5.42 million clock cycles over MSP430 devices without a hardware multiplier. This result reduces, for a microprocessor working at 8Mhz, the time required for the basic ECC protocols to the boundary of 1s (the key generation and the Diffie-Hellman protocol is basically only one scalar multiplication). This is even less time than offered by the TinyECC implementation for devices with fast multiplication hardware instruction [Liu and Ning, 2008] .
This work presents the usage of the Shifting Primes for the 32-bits microprocessor JN5139 from NXP/Jennic based on OpenRISC architecture, such as described in [Beuran et al., 2012] . This microprocessor presents higher capabilities and consequently offers higher performance for this operation. This microprocessor has a multiplication instruction that can be used for the implementation, and this produces big differences between JN5139 and MSP430.
The following sections present how the properties of the Shifting Primes can be exploited to optimize the multiplication and squaring. Specifically, Section 4 presents how a double accumulator and shifting reduction has been defined thanks to the properties of the Shifting Primes to simplify the multiplication and squaring. The details of both optimizations are described in details in the Section 5 and Section 6 respectively.
The Double Accumulator and Shifting Reduction
The key point in the following algorithms is the fact that the Jennic/NXP JN51XX family of chipsets, based on the OpenRISC architecture, support 32-bit registers. This OpenRISC architecture supports an instruction to multiply two registers, although it is limited to store the result in a single 32-bit register. This means that only are stored the 32 least significant bits of the result.
In order to implement big number multiplication, it is required to split the operand in 16-bit operands and use the multiplication in the microprocessor to get 32-bit results. These partial results will be added to obtain the final result.
The set of registers that will be used to add the partial results will be called the accumulator. This accumulator will be used also to make the reductions modulo p whenever necessary.
Before going any further, let introduce some notation:
, they are the 10 pointers for the shifting of 16 bits inside a 160 bits number.
2. B i = 2 32i for i ∈ {0, · · · , 9}, they are the 10 pointers for the shifting of 32 bits inside a 320 bits number.
3. D i = 2 32i+16 for i ∈ {0, · · · , 9}, they are the 10 pointers for the shifting of 32 bits inside a 320 bits number for not aligned operations.
4. p is the shifting prime p = 2uC 9 − 1. Specifically, the examples presented in this work will be based on the shifting prime with value equal to u = 0x6400 = 51200.
5. Given a 32-bit number W , It will be defined as hi k (W ) the k most significant bits of W and lo k (W ) the k lowest significant bits. This notations satisfies the equality W = hi k (W )2 k + lo 32−k (W ). If it is not specified any subscript, then k = 16. Then it is written W = hi(W )2 16 + lo(W ) where hi(W ) and lo(W ) are 16-bit numbers.
Following this notation, a 160-bit number x can be written
16 − 1}. Since, it is required to multiply two 16-bit numbers a and b in different position, for example aC i and bC j . The result ab (which is a 32-bit number) will be in position C i+j , but this can be not 32-bit aligned. It will be aligned only in case i + j is an even number. This produces a problem of misalignment of the half of the multiplications. For that reason, it is required to use the B i for the aligned results and the D i for the not aligned ones. All these assumptions are summarized in the following proposition (with some useful properties of the symbols introduced):
Proposition 1. These values satisfy the following properties for i, j ∈ {0, · · · , 9}
and any 32-bit number W : 
2. Since, a shifting prime i used, in particular p = 2u2 32·4+16 − 1, this has the
3. Using the same equivalence,
In order to make the implementation more efficient, the accumulator is stored in a set of registers (to avoid operations with memory). The multiplication or the square of 160-bit numbers is a 320-bit number, therefore this requires 10 registers. This is also required that at least one operand is stored in registers, since they need to be stored as 16-bits half words, then this requires another 10 registers; although it will be only used the least significant half of the registers.
The problem with the accumulator is that half of the partial results are not aligned, therefore it needs to be shifted 16-bits before adding with the align partial results. This continuous shifting requires a lot of computations, and instead of it, this work proposed an optimization solution based on the use of a double accumulator, one that accumulates the aligned partial results, and another one for the other partial results.
The problem is that this requires 10 extra registers, and this makes a total requirement of 30 registers, that is too much for the Jennic/NXP JN51XX Open-RISC processor, since this is limited to 32 registers, and some of them cannot be used such as the stack pointer, the constant register r 0 , and the link register. Finally, it is also required some temporal registers for the partial results before addition.
It is required to compute modular multiplications, therefore it can use the shifting prime to have modular equivalent results that can be written using less registers. This is also relevant because in the Montgomery representation, a 160-bit shifting of the final result is required, therefore it needs to store all the information in the most significant words.
The trick is given by properties (2) and (3) in Proposition 1. The transfer of the content from the least significant words to he highest ones multiplying by 2u, since
, and consequently the final result is also equivalent. To exploit this property, the prime p must be a shifting prime, since the accumulator information can be concentrated in the registers of any of both accumulators. Therefore, let reuse the other registers. The unique pending problem is that multiply by 2u increases the size in 16 bits, therefore the corrections have to be made in both accumulators. For example, let W be a 32-bit number, then
To sum up, this technique let implement the double accumulator with much less than 20 registers. Thereby, in total is required the at least 10 for the accumulator, 10 for the parameter and some others for partial computations.
Multiplication
The multiplication algorithm is explained in detail in [Marin et al., 2013a] . This section explains it briefly with the notation introduced in the previous section. This will allow to compare the multiplication algorithm with the squaring that will be presented in the following section.
Let x = 9 i=0 x i C i and y = 9 j=0 y j C j be the operands, it needs to be computed the Montgomery multiplication of x and y, this is xy2 −160 , or what it is equivalent xy written on 10 words, but concentrated in the most significant words using the relations induced by the shifting prime that has been described in the previous section.
Since, a double accumulator is used, it is just required to merge them and carry out the required shifting at the end.
The accumulator A starts with the value A 0 = 0, and it is increased in several steps. Therefore, the accumulator is evolving in the following way A 1 , A 2 , etc. At the end, the accumulator will be xy. The value increased is
Two different situations need to be managed, On the one hand, when i is even, the value is:
On the other hand, when i is odd, the value is:
The odd and even cases follows an almost similar calculus, the main difference is the interchange of the roles for the D's and B's variables. This calculus has been designed in order to simplify the implementation and make feasible the re-use of code in order to optimize memory footprint.
The values in the least significant words are moved to the most significant words in every step using the following equations:
This equations let store the accumulator with only 10 registers. Since, in the last step of the algorithm, both parts of the accumulator are combined. The information about the performance and the detailed implementation is presented in [Marin et al., 2013a] .
Squaring
The optimization of ECC for constrained devices is focused on the multiplication and squaring operations. The previous section has explained how to optimize the registers usage for the multiplication, this section explains how to use the shifting primes to implement x 2 (p) for p = 2uC 9 − 1 and x = In order to implement the squaring, it could be re-used the standard multiplication with x = y. But, in that case, it is requiring all the computation time of a standard multiplication, when the squaring can be optimized and reached a lower computation time. Therefore, as the majority of the optimization techniques for the implementation of cryptographic primitives is a trade-off between memory footprint and computational time required for the operation set.
In the case of the squaring, the extra code is highly worth, when it is required to maximize the speed for the ECC protocols. In addition, the usage of the shifting primes offer a set of properties that allows even to optimize much more the squaring implementation.
When x 2 is computed, it presents repeated multiplications x i x j and x j x i , these multiplications produce 2x i x j when i = j, i.e., the same result. Therefore, it can be reduced the number of multiplications, and it can be considered an accumulator that stores the values x i x j and multiply by 2 the value at the end. The problem appears with the values x 2 i . These values are not multiplied by 2 and therefore, they require a special treatment. The shifting prime solves this following the next equation:
The constant 2 that multiplies u in the Proposition 1(2) let add
the accumulator without a division by 2. This operation is done for the values
will be added at the end, after to carry out the multiplication of the accumulator by 2.
The temporal value for the accumulator is defined as:
for i = 0, 1, 2, 3, 4, 5. This temporal value is almost the accumulator.
It is required to compute these values for i = 1, 2, 3, 4, 5, note that for T 0 = 0
The multiplications ulo 17 (x 2 i ) are always below 2 32 because u < 2 15 , but it can present a carry in the addition x 0 x 9 + ulo 17 (x 2 0 ), therefore in order to avoid that carry, the followed technique is to delay the addition of x 0 x 9 .
Regarding the computation of
, it is required to include the following terms:
x 5 x 7 B 6 + x 5 x 9 B 7 + x 6 x 8 B 7 + x 7 x 9 B 8 +
These values can be included with the other ones that have been used to compute the T i .
Then. the new terms are underlined and removed the x 0 x 9 D 4 from the first addition. These terms will be included in Δ 8 such as follows: The final computation is
The order in the additions is important, because this allows to free the registers that have the values x i when they are already used. The order in which these registers are made available (i.e., free) is: register freed after adding
The registers that have the values x 5 , · · · , x 9 can be used to store the final result, since they can be re-used to store their square value, and for adding on it two times the accumulated value given by the Δ's (after adding the D-values over the B-values) .
During the accumulation process, it is required to move the least significant words to the highest significant ones, in order to have everything stored and calculated in terms of B 5 , ..., B 9 and D 5 , ..., D 9 (the value on D 9 is used only for carriers).
Therefore, it is required five registers for D's and other five for B's. The most significant register for D's will be used only for carriers. For each step, it is moved a register to the most significant part as follows:
First, it is already calculated the accumulated value αD i , then:
can be added with the next x r x s B i+1 because x r x s + hi(α) < 2 32 . This process requires only one extra multiplication by u instead of two, that was required in the multiplication process explained in more details in the work presented in [Marin et al., 2013a] .
Evaluation
The evaluation of the presented optimizations is carried out following a bottomup approach. Specifically, on the one hand, the squaring optimization, and on the other hand, this is evaluated from the scalar multiplication, to the modular multiplication. Note, that the modular multiplication is the required operation for the cryptographic process of key generation, encryption, and decryption. Regarding the squaring optimization, the computation cost in terms of number of scalar multiplications for a modular product xy is 119. In the computation of the squaring, i.e., x 2 is 70. Therefore, this presents a reduction of 59% in the number of single scalar multiplications. The time reduction is equal to the 65%, this time reduction is also influenced by the other operations involved, this reduction is limited by the other operations that are carried, which are not using the square operation. In order to extrapolate the impact of the squaring optimization for the modular multiplication; let M, which denotes the time required for a modular multiplication, and S the time required for squaring. In the literature, it is usual to consider that S 0.8M. But, the proposed algorithm offers a S 0.65M. This reduction has a high impact in the time required for the modular multiplication.
The time required for the modular multiplication also depends on the representation used. In particular, this work has used the Jacobian coordinates on Weierstrass curves, point addition (with Z2 = 1) can be done with 7M + 4S, this means a difference 0.78. The doubling algorithm is used more often than point addition, then it is reached a final time using the special square that is 80% the time required using the same function for multiplication and squaring.
The results depend on the technique used for the scalar multiplication. There is a wide variety of formulas for different point representations and also different curves that can be used for cryptography. See for example the information found in 1 for a detailed compilation.
We can compare these results with other implementations given in the literature, for example [Chatzigiannakis et al., 2011] we can see an implementation that requires 11.121 sec for JN5139 for key generation. Even without this special square function, our implementation requires 140.37 ms (see [Marin et al., 2013a] ). With this extra reduction, the time is reduced another 20 percent. The authors in [Chatzigiannakis et al., 2011] claim that their implementations are not optimized for speed, this can explain these big differences.
Conclusions and Future work
This work has presented the optimization cryptographic primitives for asymmetric key cryptography based on Elliptic Curve Cryptography, this has demonstrated that asymmetric key cryptography is feasible for constrained devices.
The ongoing work is focused on the definition of hybrid scenarios with the defined optimization based on Shifting Primes for scenarios with chipsets such as the MSP430 for the end-devices, and for chipsets such as the presented JN51XX for the gateway and border routers.
These hybrid scenarios will make use of high level algorithms such as IPSec and DTLS, where Elliptic Curve Cryptography is used for the establishment of the security association or session.
The future work is focused on solutions to carry out IoT/M2M trust verification, through a mechanism such as capabilities-based access control. Consequently, novel scenarios based on temporal access to resources can be defined. For example, a house proprietor with an access control solution (e.g. a smart door lock) is able to offer temporal access to his neighbor so as to go everyday at anytime from 15:00 to 18:00 in order to feed the pets and irrigate the plants.
The mechanisms required to offer secure solutions that make usage of the IoT capabilities feasible during usual human activities and behaviors, where devices and physical resources are involved, needs to be enhanced. As a result, these new mechanisms and solutions will facilitate the introduction of the IoT as part of the Internet-powered society.
FPU program (AP2009-3981), the "Fundación Séneca" and the Spanish ministry for Science and Technology.
