Abstract-As networking has become major innovation driver for the Internet of Things as well as Networks on Chips, the need for effective cryptography in hardware is on a steep rise. Both cost and overall system security are the main challenges in many application scenarios, rather than high throughput. In this work we present area-optimized implementations of the lightweight block cipher SIMON. All presented cores are protected against side channel attacks using threshold implementation, which applies secret sharing of different orders to prevent exploitable leakages. Implementation results show that, on FPGAs, the higher-order protected SIMON core can be smaller than an unprotected AES core at the same security level against classic cryptanalysis. Also, the proposed secure cores consume less than 30 percent the power of any unprotected AES. Security of the proposed cores is validated by provable arguments as well as practical t-test based leakage detection methods. In fact, we show that the first-order protected SIMON core does not have first-order leakage and is secure up to 10 million observations against higher-order attacks. The second-order secure implementation could not be exploited at all with up to 100 million observations.
Ç

INTRODUCTION
T HE increasing interconnectedness of embedded systems pushes cryptographic cores into an ever-increasing number of devices. Many of these new applications are cost-constrained, so cheap and simple, yet effective security solutions are in dire need. One approach to make cryptography affordable for area-constrained embedded devices are constrain-adjusted cryptographic designs, also known as Lightweight Cryptography. One recent proposal is the SIMON block cipher, proposed by NSA [2] . SIMON is sizeoptimized and has a range of block and key size options, allowing an adjustment to cost-constrained applications in hardware and embedded software.
However, in embedded applications, a solid cipher is not always sufficient to provide the expected security level. Since adversaries have access to the devices, protection against physical attacks such as side channel and fault attacks is also important. Fault attacks have been reasonably discussed in [3] , [4] , and countermeasures for SIMON have been discussed in [5] . In this paper we focus on side channel attacks and propose new hardware implementations to prevent them.
A fairly elegant solution to protect against power and electromagnetic side-channel analysis was proposed by Nikova et al. in [6] . The idea is based on secret sharing and it is called Threshold Implementation (TI). Unlike many other countermeasures, it is effective and secure in the presence of glitches, making it directly applicable to hardware implementations. It is also applicable at the algorithmic level and less error-prone than many alternative proposals.
There are several works published based on the idea of threshold implementation. Kutzner et al. in [7] shows the implementation of 4-bit S-Boxes using three shares. Another work by Moradi et al. [8] tries to implement the well-known cipher, AES, in a small area. It was shown that the threshold version of AES can be implemented by using approximately 11,000 GE. Bilgin et al. in [9] improved the results even more and implemented threshold implementation of AES using approximately 8,200 GE.
Recent works focus on higher-order threshold implementation, which provides improved resistance to higherorder attacks. For example, Bilgin et al. in [10] discussed the theory of higher-order threshold implementation as well as practical realization. They also presented the resistance of their core by analyzing 300 million traces and showed that there is no univariate leakage in those traces.
In this work, we show how the threshold implementation countermeasure can be applied to the SIMON block cipher. We show that the resulting implementation still yields an amazingly small crypto-core while providing a high level of side channel protection. To prove the resistance to univariate first and second order attacks we use variants of the popular t-test based leakage detection methodology proposed by Goodwill et al. in [11] . One of the interesting features of the combination between a simple lightweight cryptographic design such as SIMON with the threshold implementation technique is that randomness is only needed for the initial sharing of the state and key. There is no need for fresh randomness during the computation of the cipher. This is in contrary to most other countermeasures, and even to many applications of TI to more complex ciphers such as AES (see [8] and [9] ).
The remainder of the work is structured as follows. In Section 2 the block cipher SIMON and its area-minimal implementation are explained and the threshold implementation masking technique is introduced. In Section 3 the equations for 1st and 2nd-order secure SIMON are presented. In Section 4 the hardware architecture of the implementation is introduced and the results of our implementation is compared to other lightweight ciphers. Section 5 presents the leakage detection test based on actual power measurements. The paper is concluded in Section 6.
BACKGROUND
SIMON
SIMON implements a Feistel structure that splits the state into two words of 16, 24, 32, 48, or 64 bits each. The key is split into two to four n-bit words, depending on the plaintext size. These first key words are used in the first rounds of SIMON. The key scheduling algorithm is used to generate the following round keys. The number of rounds in SIMON ranges from 32 to 72 rounds. Hence, one particular instance of SIMON is denoted as SIMON2n=mn, where m is 2, 3, or 4 and n is the word size. Defined SIMON parameters and the corresponding number of rounds are shown in Table 1 . For example, SIMON64/128 accepts 64 bits of plaintext at a word size of 32 and 128 bits of key (four words). It generates a ciphertext after 44 rounds.
A schematic description of SIMON is given in Fig. 1 . Assuming that the input words of round i are l i and r i , the output words are
The upper index X s indicates left circular shift by s bits. This function is highlighted in Fig. 1a . This can be expressed in GF (2) , where the XOR operation becomes addition and the AND operation becomes multiplication, as
Also, assuming that the input words of the key, which are also the first round keys, are k 0 and k 1 (and possibly k 2 and k 3 , depending on the key size), the next round key is computed following the key expansion function as
Two and Three Words (3)
where c i is a round constant. The key expansion as twowords secret key, m ¼ 2, is shown in Fig. 1b. 
Bit-Serialized Implementation
Aysu et al. in [12] proposed a bit-serialized implementation of SIMON where only one bit of the internal state is processed in each clock cycle. Hence, a single round of SIMON completes after n cycles, where n is the size of input word. Moreover, two shift registers are used to store the internal states to simplify the control of sequentially processing and storing individual bits. In fact, the left share of the internal state is passed over as-is to the right share, hence only one shift register of the same size as the input block is actually needed. Here, SIMON is implemented as a special class of non-linear feedback shift register, where the output of the feedback function changes the state only after completing the round function. Since the feedback function requires only four bits of the state, namely r i , l 
Threshold Implementation
The Threshold Implementation countermeasure was proposed by Nikova et al. in [6] . TI applies XOR secret-sharing to achieve provable resistance against first-order side channel attacks.
Let z ¼ fðp; kÞ be the function that should be performed in each round of SIMON where p represents the data, and k represents the key. Let f 1 ; f 2 ; . . . ; f n denote a set of functions that are selected to fulfill the following properties: 1) Non-completeness: Non-completeness means that the equation used to evaluate any output share must be missing at least one input share
where f i is missing p i and k i . 2) Uniformity: Any bias in the joint distribution of the output shares (p i ; k i ) is due to biases in the joint distribution of the input (p; k). In other words, if the input shares are uniformly distributed, the output shares must also be uniformly distributed
3) Correctness: Correctness means that combining the output of the different shares retrieves the original output in a correct way
These requirements enforce that the information required to compute the secret value (all the shares) is not present in the leakage at any time instant. Hence, any vulnerability in the underlying implementation (e.g., glitches) cannot leak the secret key. It was proven in the appendix of [6] that if the three requirements are met, all the intermediate results will be independent of the input p; k and the output z. Also, the power consumption and any other side-channel output of functions f i are independent of p; k, and z. In the literature, an implementation that follows these properties is denoted as a first-order resistant threshold implementation.
First-order secure threshold implementation of block ciphers have been published for AES [8] , [9] , PRESENT [7] , as well as for KECCAK [13] . It has even been applied successfully to public key cryptography in [14] . In all of these cases, security has also been practically verified through power analysis on FPGAs.
The concept of TI was later expanded in order to achieve higher-order side channel resistance [10] . The conditions for an implementation to achieve higher-order resistance are more complex. The authors propose to modify the non-completeness property to a new-dth order non-completeness property.
1) dth Order Non-completeness: Any combination of up to d output of functions f i must be independent of at least one input share. Bilgin et al. [10] showed that the univariate dth statistical moment of the power consumption of a device which satisfies the above property is independent of the unmasked input even in the occurrence of glitches. However, Reparaz et al. [15] pointed out that the shown dth-order resistance might not hold across subsequent function calls, implying that multivariate attacks may succeed. In those cases, adding new entropy to the potentially vulnerable intermediate states prevents the leakage. There has been recent work on the first and second-order resistant implementation for the S-Boxes of PRESENT [16] and AES [17] with practical leakage assessment.
THRESHOLD IMPLEMENTATIONS OF SIMON
The simple algebraic structure of the SIMON rounds make SIMON a suitable candidate for efficient threshold implementation. Although a three-shares implementation is required to overcome the effect of glitches in hardware modules, we start with a two-shares implementation as a preliminary step. Then, we propose the first-order three-share TI of SIMON, and we also propose a 2nd-order TI of SIMON resisting univariate 2nd-order side channel attacks.
SIMON with Two Shares
In order to process SIMON in two shares, we use the following equations. We denote the random mask that affects the input plaintext as m½p. We denote the random mask that affects the right and left words as m r ½p and m l ½p, respectively. Also, we denote the different shares by indexes 1, 2, . . .. For example, ða j Þ i is the value of jth share of a in round i.
The input words are given as
Then, the round functions can be expressed as
where k 1 and k 2 are the two shares of the round key. We use a different mask to process the key schedule, denoted by m½k. The size of the mask should be equal to the size of the key. Equations for splitting the key schedule into two shares are straightforward, being an entirely linear operation.
This masking scheme is correct because XORing the two shares at any time will cancel-out the effect of the input randomness (m r ½p and m l ½p) (see Eq. (2)).
Although the system of equations in the data-path (every term in the equations aside from the key) is not invertible i.e., its mapping is not guaranteed to be one-to-one, which suggests non-uniformity, uniformity is guaranteed by the randomness brought by the key shares (k 1 and k 2 ). The key shares are uniformly distributed as the system of equations to generate them is linear and invertible (assuming that the input random masks are uniform). Then, it is easy to prove that the result of addition in GF(2) between an arbitrary variable that is not necessarily uniform (the data-path) and a uniformly distributed random variable (the key shares), is uniformly distributed. This implies that the above system of equations is uniform. Although the random variable used in one round depends on the random variables used in the previous rounds, this does not result in any vulnerability for univariate attacks that harvest information from a single point in the trace.
However, it is not non-complete because the two input shares are required to process any output share. This masking scheme can work in software implementations if we enforce the order of processing the equation to be from left to right. Hence, we ensure that the compiler does not generate any intermediate variable that is free from the random mask. However, this masking scheme is not provable in hardware implementations where glitches can leak the relation between the two shares. In order for the secret-sharing scheme to provably work in hardware implementations, we need to enforce the requirement of non-completeness. Hence, we propose the three-sharing scheme in the next section.
1st-Order Secure SIMON
In order to enforce non-completeness, we process SIMON in three shares using the following equations. We need two randomly generated masks m fr or lg;i in order to mask the plaintext (similar to Eq. (8)).
The same method is used to mask the key, with fresh masks. This generates three shares of each plaintext and key. The equations used to process the r part are straightforward and hence omitted (see Eq. (9)). The equations to process the l part are as follows:
This masking scheme is correct, uniform and noncomplete. It is correct and uniform using the same arguments proposed for the two-shares SIMON. It is noncomplete because the equation used to process any output share (e.g., l 1 ) does not include at least one input share (l 1 and r 1 ).
Univariate 2nd-Order Secure SIMON
To achieve a glitch-free 2nd-order resistance, at least five shares are needed for implementing a secure version of the SIMON round function, since it has algebraic degree of 2. The key schedule, however, is linear i.e., 2nd-order non-completeness is already achieved with the three-share implementation of the previous design. In order to achieve a more areaefficient implementation, our design varies in the number of shares throughout the implementation, as also done in [9] , with a three-share key schedule and a five-share state with one intermediate nine-share state to compute the shared AND term. In particular, Bilgin et al. showed in [10] that implementing a shared AND with only five shares in a 2nd-order non-complete fashion seems difficult. Instead, they proposed to compute a function of the form y ¼ a þ b Â c in two steps, where the intermediate step expands to nine shares. The sharing of the variable a, b and c will be denoted by a j , b j and c j , respectively. One possible set of equations to satisfy the above property is as follows [10] :
We use this set of equations to compute the non-linear part of the proposed design. Here, we need to compute
, which is part of the SIMON round function (see Eq. (2)). Essentially, we replace a, b, and c in Eq. (12) by r, l 1 , and l 8 , respectively. The set of equations for processing the non-linear part in round i can hence be written as follows:
Note that, any combination of up to 2 outputs is missing at least one input share, as required by the definition of 2nd-order non-completeness discussed in Section 2.3. For example, the equations used to compute y 1 and y 9 are missing the input shares l . In these equations, the number of input shares is 5 while the number of output shares is 9. By keeping the sharing of y j , the design will get bigger as more non-linear functions are to be computed. Hence, there is a need for decreasing the number of shares. It was shown in [10] that the following construction has 2nd-order noncompleteness
These equations follow the definition of 2nd-order noncompleteness over the y shares. However, they are not 2nd-order non-complete over the original inputs a, b, and c. Hence, the output of the first set of equations must be stored in registers before being accessed by the second set of equations in order to prevent glitches from propagating between the two sets. Separating the two sets by registers will make the entire system 2nd-order non-complete [10] .
In order to incorporate this reduction of variables in our design, we use the following set of equations
Note that, by adding the five shares of l 2 and the three shares of key k, we are computing the entire round function l iþ1 ¼ ðy þ l 2 þ kÞ i , as required by Eq. (2).
The set of equations in Eqs. (15) and (13) represents a 2nd-order threshold implementation of SIMON. This implementation is correct, uniform, and 2nd-order non-complete. It is correct as the non-linear part is processed by Eq. (13) and the linear part is added in Eq. (15) . It is uniform due to the fresh randomness brought by the key shares. It is also 2nd non-complete as any two subequations in Eqs. (12) and (13) are missing as least one input share.
HARDWARE IMPLEMENTATION
The structure of the base unprotected SIMON128/128 core is shown in Fig. 2 . At first, the input is loaded into the Shift Register Up (SRU), FIFO1 and FIFO2. During the first eight cycles (phase 1), the look-up table (LUT) processes three bits from the SRU, a key bit, and the output of FIFO2. The result is stored in the Shift Register Down (SRD). During this phase, SRD stores the new values, while SRU stores the old ones for further processing. Once the SRD is full and before overflowing occurs, instead of SRU, SRD will be connected to FIFO1, where the new values will be stored (phase 2). SRU will still work as the old register for storing old bit values from FIFO1 output. This phase continues for 56 cycles until the round is completed. In the next round, the functionality of SRU and SRD will be flipped, representing phases 3 and 4 as shown in Fig. 2 . A detailed description of the underlying design rationale of this implementation is given in [12] and [18] .
In order to design a threshold implementation for SIMON there are two choices concerning the order in which the shares are processed. The shares can either be processed in parallel or in a serial fashion. We focus on the parallel implementation of SIMON, since the parallel implementation has similar security properties at a much smaller overhead. For a more in-depth discussion of the serial implementation, please refer to [1] .
Threshold Implementations of SIMON
Three-Share Implementation. This implementation uses three copies of the data-path and key schedule units, i.e., one for each share. Note that the three data path units and key schedule units need only one instance of the control unit.
As can be seen in Fig. 3 , the input to each function block comes from at most two shares (denoted by old) along with one bit from the key based on Eq. (11). The output is written into one share (denoted by new). The function block is implemented using LUTs. The old share is stored in the SRU (or SRD) and the new share is stored in the SRD (or SRU
In order to ensure that each output share is independent of at least one input share the "Keep Hierarchy" property of synthesize tool is enabled. The keep hierarchy property ensures that parallel LUTs are synthesized so that they never share in one slice. The resistance analysis presented in Section 5 shows that this level of separation is sufficient for security.
Five-Share Implementation. This implementation uses five copies of the data-path and three copies of the key schedule units. As it can be seen in Fig. 4 , the input to each LUT comes from two or three shares (denoted by old) based on the Eq. (14) . As it was mentioned at one stage the number of shares will grow to nine. After one clock cycle the intermediate results will be combined together according to Eq. (15) to bring down the number of shares back to five.
Implementation Results
The proposed designs were implemented in Verilog HDL and synthesized for Spartan-3 xc3s50 using ISE 14.7. Table 2 summarizes the results and provides a comparison to previous implementations on the same platform. Our proposed parallel implementation needs 96 slices when synthesized by setting the optimization goal to area. The occupied area is slightly less than three times the unprotected design, since the control logic is not replicated for the parallel design. We also synthesized the parallel design by choosing speed as the main optimization goal, letting synthesize tool pick slices. As highlighted in Table 2 , our implementation is more compact than some unprotected ciphers, namely AES and PRESENT, though these implementations are not Fig. 3 . Three-share Implementation of SIMON. All three shares are processed at the same time. Fig. 4 . Five-share Implementation of SIMON. All five shares are processed at the same time. Fig. 2 . Data-path of the SIMON cipher. In each round, after the first eight cycles the input of FIFO1 will change. Based on the round, the SRU and SRD will function as input or output of the LUT block.
protected against SCA. Bit serializing the implementation of SIMON affected only the throughput as compared to the unprotected AES and PRESENT.
We also implemented the higher-order SIMON. As it can be seen, the design is larger than the first-order resistant of SIMON. It can also be seen that because of the complex equations for higher-order version the number of LUTs utilized in the design is significantly higher than the other two designs.
We synthesized the design for ASIC using Synopsys Design Compiler targeting the TSMC 90 nm cell library. The results are shown in Table 3 . The reported areas are divided by 2.8 (the size of a NAND gate in this library) to give the Gate Equivalent (GE) number. The table also shows the power consumption at 100 kHz frequency.
As it can be seen for the case of SIMON128/128, the threshold implementation core is roughly four times bigger than the unprotected version of the same core. The higher-order implementation core is roughly five times bigger than the unprotected core. We also compared the performance results with AES and PRESENT. The results are shown in Table 3 . It can be seen that the higher-order implementation of SIMON128/128 is smaller than any first-order threshold implementation of AES that is available in the literature. The other small cipher is KATAN which accepts plaintext of size 32, 48, and 64 bits and the key size for all of them is 80 bits. The smaller key and state size explain some of the area savings. While there is a higherorder TI version of the AES S-Box [21] , no results for the full cipher are available at the moment.
With respect to the power consumption at 100 kHz, our designs show superior low power requirements. The TI-SIMON128/128 works at 6 percent the power required for TI-AES from [8] , while the two cores support the same level of mathematical and SCA security. In fact, our TI-SIMON128/ 128 works at 22 percent the power requirement of the unprotected AES [8] , and 30 percent the power requirement of the unprotected PRESENT [22] . The higher-order threshold implementation of SIMON128/128 works at less than 40 percent of the power required for the unprotected AES or PRESENT.
Mask Generation
While ordinary cryptographic cores only take the plaintext and a key as input, masked implementations additionally need random values to be used as masks. Most masked implementations reported in the literature focus only on the masked encryption core and assume inputs to be already masked [8] , [9] , [10] . Often, additional randomness needed during encryption also has to be somehow provided externally [9] . In order to provide a fair comparison, Tables 2  and 3 compared the masked encryption cores of SIMON without any mask generation. However, to operate a masked encryption core, randomness for the masks has to be provided. In fact, implementations differ significantly by the amount of randomness consumed per encryption. Any d-share masked implementation requires d À 1 times the plaintext size (and sometimes the key size) in randomness for the initial sharing of the secret state (and the key). Some implementations also require additional fresh randomness during the encryption. For example the AES implementations presented in [8] and [9] consume an additional 48 and 44 bits of randomness, respectively, per S-Box lookup. That means, the total amount of consumed randomness per encryption for AES-128 is more than 7,000 bits [9] . The presented three-share TI-SIMON128/128 only needs 512 bits for the initial sharing. The five-share 2nd-order resistant HO-TI-SIMON needs 768 bits. The proposed cores do not require any additional random bits during operation.
Generating randomness in hardware is a well-studied problem. A comprehensive introduction is given in [23] . Masks are usually generated using a deterministic Random Number Generator (RNG), which takes a short random seed and outputs a random-looking sequence of the desired length. The requirements on randomness for masks are much lower than that for other cryptographic purposes, as the adversary can never observe the values directly, only via a noisy side channel. As such it is customary to use Linear Feedback Shift Registers (LFSRs), which features good statistical properties and high throughput while featuring a low area-overhead [10] , [14] , [24] . For our TI-SIMON cores, it is sufficient to initialize the RNG and operate it to produce two mask bits per plaintext and key bits moved into the encryption core during initialization. Hence, the design does not create a performance overhead. As example, we used the LFSR The numbers reported for AES and PRESENT are for unprotected version of them. with the characteristic polynomial (x 22 þ x þ 1) to generate the masks. The LFSR is very efficient in hardware and can be configured to output many bits per cycle, e.g., the 4 or 6 bits needed for the two presented TI-SIMON cores, respectively. The additional area overhead incurred by the LFSR is given in Table 4 and are at most seven slices (or 127 GEs) for the higher-order TI-SIMON. Thus, adding a mask generator to the implementation increases the area of the cryptographic engine by at most 5 percent.
RESISTANCE ANALYSIS
In this section, we study performance of the proposed cores against side channel analysis. First, we show a practical attack against the unprotected core of SIMON128/128 as proposed in [12] . Then, we study the 1st-order secure SIMON as proposed in Section 3.2 with two experiments. One experiment supports the claim of being secure against first-order attacks, while the other experiment shows possible higherorder leaks. Thereafter, we apply similar experiments to study performance of the 2nd-order secure SIMON as proposed in Section 3.3.
The practical test setup consists of a SASEBO-GII board to develop the hardware design, a Tektronix DPO-5104 oscilloscope running in FastFrame mode to collect the power traces at 100 MS/s and a ZFL-1000LN amplifier to improve resolution of the collected traces. All the studied SIMON cores are bit-serialized versions of SIMON128/128 requiring 64 clock cycles per round. The studied secured versions utilize parallel cores, hence requiring the same number of clock cycles. Given that the board is clocked at 3 MHz, one encryption round can be recorded in 2,133 samples. Although the underlying core computes the entire encryption in each execution which is necessary to assure correct outputs, the oscilloscope captures only the first 9,000 samples of each power trace which covers slightly more than the first four rounds. This range is sufficient for side channel analysis where the first couple of rounds are the typical target, while keeping the dedicated computing resources in a manageable range. The number of traces in one of the experiments exceeded 100 million traces requiring 1 TB of storage.
We implemented the masked designs assuming that the random masks are applied from an external source. Hence, the input to the cores is already in a masked form.
Test Methodologies
In general, there are two different goals aimed for in side channel analysis of cryptographic cores. The first goal is to prove possibility of attacking a certain core. This goal is achieved by demonstrating a successful side channel attack. Here, we use correlation power analysis, which is one of the most powerful unprofiled DPA attacks [25] . The concept of this attack is to estimate the change in power consumption at one certain point in the trace using a computer model on the known input and a key hypothesis. The correct key hypothesis will result in a maximum correlation between the modeled traces and the actual ones.
The other goal is to prove side channel resistance of a certain core. A secure core (with respect to side channel analysis) has one property that: the power trace of processing any selected input cannot be distinguished from the power trace of processing a random input. In other words, the power trace does not give any advantage to the adversary. Although distinguishability may or may not lead to a full break, it indicates some form of information leakage. Unfortunately, security under this definition cannot be practically proven. However, statistical tools can prove that there is no basis to support the opposite claim. In better statistical terms, we denote that the set of power traces collected at any selected, fixed input as F and the set of power traces collected at random inputs as R. We set the null hypothesis as (m F ¼ m R ), i.e., the two sets have the same mean. Then, the Welch's t test is used to find any evidence to reject the null hypothesis. Here, the t metric is
where F ans R are the sample means, s F and s R are the sample variances and N F and N R are the number of traces in each set. The null hypothesis is rejected if the t metric exceeds a pre-specified critical value. This metric focuses only on first-order leakages, however it can easily be extended to higher-order leakages as follows. Let us denote the random variable of the power traces with X. We use X and s X as sample mean and sample standard deviation, respectively. In order to analyze secondorder evaluations we use ðX À XÞ 2 and for orders more than 2 we use ð
The natural way of computing the higher-order analysis is by processing traces twice. First time to compute the mean and the second time to calculate the mean-free traces. This model of analysis can be quite time consuming since all the traces should be processed even if the device fails for rather small number of traces. Schneider and Moradi in [26] introduced a way to compute the mean-free traces in a one pass manner so that early exit from the analysis is possible if the leakage is found in the early stages. Using Welch's t-test as a non-parametric leakage detection test was proposed by Goodwill et al. [11] . We used this test to analyze the 1st-order secure SIMON where the number of collected traces was moderate. However, in order to analyze the 2nd-order secure SIMON we collected more than 100 million traces over several days where the environmental variables stated to show significant effects on the test results.
Recently, Ding et al. recommended the use of paired ttest instead of the Welch's t test [27] . Here, we compute a new set D, where each trace in D is the difference between one trace in F and one trace in R
where D is one trace from the set D. The traces selected should be captured at very close time to each other hence any environmental variables should be canceled out. In this case, the null hypothesis becomes (m D ¼ 0) and the t metric is
We use the paired t-test to analyze the 2nd-order secure SIMON. For both tests, we used with a critical value of (4.5), similar to many previous papers in this regard [11] , [26] , [28] . In other words, the null hypothesis will be rejected of jtj > 4:5. This critical value is equivalent to a confidence level of > 99:999.
Breaking the Unprotected Core
We use xðaÞ b to denote bit number b 2 ½0 : 63 of the word x : l _ r in round number a 2 ½1 : 68. x can also denote the key k. We highlight that, the previous SCA attacks against SIMON as proposed in [29] and [30] were developed against the full-state implementation, and cannot be used against the bit-serialized version of our focus.
The first step in DPA is to identify a sensitive intermediate variable, which depends on both the input data and the secret key in a non-linear equation with as low confusion as possible. Linear equations can also work (as used in [29] ), but the attack in this case will need more traces to distinguish between the correct key and close-by ones (with small Hamming Distance). Low confusion means that the non-linear operation processes a small number of the key-bits. This is recommended to break the complexity of the secret key into smaller portions (divide-and-conquer).
Hence, we focus on attacking the output of the non-linear operation (the AND gate) in the second round of SIMON where the first key word (kð1Þ) becomes part of lð2Þ to compute lð3Þ. We do this analysis bit-by-bit following the bitserialized implementation. The equation for the first bit of lð3Þ is lð3Þ 0 ¼ rð2Þ 0 þ kð2Þ 0 þ lð2Þ 62 þ ðlð2Þ 63 Ã lð2Þ 56 Þ;
where rð2Þ 0 ¼ lð1Þ 0 , and lð2Þ i ¼ kð1Þ i þ rð1Þ i þ lð1Þ iÀ2 þ ðlð1Þ iÀ1 Ã lð1Þ iÀ8 Þ;
where i 2 f62; 63; 56g for this particular bit and the subtraction in indexes is done modulo 64. A similar equation can be written for all the bits of the internal state. In short, one bit of the left word in round three (e.g., lð3Þ 0 ) depends nonlinearly on two key-bits (kð1Þ 63 and kð1Þ 56 ) and linearly on another two key bits (kð2Þ 0 and kð1Þ 62 ), along with some input data. The second step of a DPA attack is to select an accurate power model, which is a function that converts the sensitive intermediate variable into relative power consumption. In this work, we use the Hamming Distance (HD) power model which is suitable for hardware modules. The HD represents the number of bit-flips between two clock cycles. For example, we focus on the activity of the first register of the left word, representing the operation of overwriting bit lð3Þ 0 by bit lð3Þ 1 between cycle 65 and 66. However, we first need to consider an equation for the system power consumption.
The system power equation of the unprotected structure (only one share) is
where P SRU ; P SRD ; P FIFO1 and P FIFO2 represent the power consumption of the SRU, the SRD and the FIFO registers, respectively. N is a noise component which represents the measurement noise along with all on-board activities that do not depend on the input data including the key-schedule circuit. We did not write a separate term for the LUT as its effect can be included in its output register, which is the first register of SRU or SRD depending on the clock cycle (SRU in our example). During the update of cycle 65/66 and following the HD model, the power consumption of each component is where HW is the Hamming weight function (the number of set-bits), (jj) represents bit-concatenation, X s is a circular shift right by s bits and jxj 64 denotes trimming x to the first 64 bits. P SRD þ P FIFO1 and P FIFO2 depend linearly on the plaintexts and the bits of kð1Þ. P SRU is the only component in the system power consumption that depends non-linearly on key bits. Fig. 5 gives the results of attacking the studied SIMON cores with Correlation Power Analysis (CPA) [25] . In this attack, we used a 4-bit key hypothesis to represent the nonlinear key-bits involved in the computation of lð3Þ 0 and lð3Þ 1 . Figs. 5a and 5b show results for attacking the unprotected core. Fig. 5a shows the correlation coefficient as a function of time focusing on 4,000 time-samples to highlight the leakage. This result was generated after processing 25,000 traces. Fig. 5b shows the correlation associated with the correct key against those of the incorrect keys as the number of analyzed traces increases. Although the results highlight the success rate of recovering only four bits of the secret key, the remaining key-bits could also be recovered by selecting another points in the algorithm using the same number of traces. These results show that the unprotected core can be broken with less than 1,200 traces.
Analysis of the Protected SIMON Cores
First of all, we applied the same correlation based DPA attack against the 1st-order secure implementation of SIMON as proposed in Section 3.2. Fig. 5c shows results of the attack after processing 500,000 traces, while Fig. 5d shows the progress of the correlation coefficient over the number of traces. In this experiment, we collected from the three-share version synthesized with speed optimization. The attack fails to recover any secret key, which supports our claim of secrecy.
As detailed before, results of DPA attack show that our protected core can withstand this specific type of attack however, it cannot prove its security against any other side channel attack. To support our claim of security, we applied the leakage quantification tests as proposed by [11] and [27] . Fig. 6 shows the results of applying the non-parametric Welch's t test against the unprotected core. The huge t-values give a strong indication of leakage for this design, as expected. This experiment serves as a reference while also confirming soundness of the test setup. Fig. 7 shows results of the first and second order leakages of the 1st-order secure SIMON core after processing 20 million traces. The figures clearly show that the protected core is secure against first-order attacks. The second order analysis reveals information leakage. This is expected, as the threeshare implementation can only prevent first-order attacks. Fig. 8 shows how the t value progresses with the number of considered traces at the most leaking point of each order. It shows that, we can observe second order leakage after processing around 10 million traces. Hence, even though a second order leakage exists, it is still very costly to exploit, as several millions of traces are needed.
Similarly, we analyzed the five-share SIMON core, which should provide 2nd-order security. In this case, we collected 100 million traces and used the new paired t-test [27] for having better efficiency. Fig. 9 shows results of analyzing all the first five order leakages. The figure shows that, 100 million traces are not sufficient to extract any leakage from the proposed core (up to the fifth order). Fig. 10 shows how the metric progresses across traces at the most leaking point of each order. Selecting different analysis point for each order (against one fixed point) leverages the notice that, higher-order leakages do not necessary happen at the same time-instance as the first-order leakage. These figures support our claim of security against second order attacks.
Although the 2nd-order secure SIMON core may leak information at the third order leakage (or higher), we believe that 100 million traces are sufficient to prove security for all practical applications.
CONCLUSION
This work presents threshold implementations of SIMON block cipher. These implementations can be used in a wide range of hardware applications where physical attacks may be a security issue. Depending on the need for side channel resistance, we presented two cores of SIMON first-order resistant core and a second-order resistant core. Both cores have a very small area footprint of less than 100 slices and less than 170 slices respectively. On 90 nm ASIC, they require less than 5,700 and 7,600 GE at less than 2 mW of power. Random mask generation can be included in these engines with minimal overhead. Both engines can thus serve as a drop-in replacement for AES, while providing area savings and significant power savings compared to an unprotected AES implementation. The performed t-test based leakage detection tests showed that both cores achieved the promised level of resistance, by making firstorder and second-order attacks infeasible. We further showed that even higher-order attacks require millions of measurements for the first-order protected core, while we could not detect any leakage with up to 100 million measurements for the second-order protected core. 
