Abstract. Glitches, occurring from unwanted switching CMOS gates, have been shown to leak information even when side-channel countermeasures are applied to hardware cryptosystems. The polynomial masking scheme presented at CHES 2011 by Roche et al. is a method that offers provable security against side-channel analysis at any order even in the presence of glitches. The method is based on Shamir's secret sharing and its computations rely on a secure multi-party computation protocol. At CHES 2013, Moradi et al. presented a first-order glitch resistant implementation of the AES S-box based on this method. Their work showed that the area and speed overheads resulting from the polynomial masking are high. In this paper, we present a first-order glitch resistant implementation of the present S-box which is designed for lightweight applications, indicating less area and randomness requirements. Moreover, we provide a second-order glitch resistant implementation of this S-box and observe the increase in implementation requirements.
Introduction
Radio frequency identification (RFID) systems, wireless sensor networks, smart cards and other compact mobile applications have become prevalent in everyday life. Their widespread deployment in applications ranging from supply chains to intelligent homes and even electronic body implants, has made their security a pressing issue. While block ciphers provide su cient security against cryptanalysis for these applications, their hardware implementations are susceptible to sidechannel leakage. By exploiting these leaks through side-channel analysis (SCA), a cryptosystem can be compromised more easily than promised by the cryptanalytic security. A common side-channel analysis is Di↵erential Power Analysis (DPA) [14] . DPA exploits dependencies between the instantaneous power consumption of a device and the intermediate values arising in the computation of a cryptographic operation.
Several countermeasures have been proposed to cope with these side-channels. Secure logic styles that balance the power consumption of di↵erent data values [24] can be used or noise can be increased in the form of random delays, random execution orders or by inserting dummy operations [25] . Even though an analysis becomes harder as this noise increases, these techniques do not provide provable security. A popular countermeasure that does provide provable security under certain assumptions is masking [5, 10] . This method conceals sensitive information, such as key and plaintext related information, using random values. Compared to a naive implementation, a well implemented masked implementation typically o↵ers more resistance against power analysis attacks, and makes the attack much more expensive as the order d of the masking increases. This masking order d in turn defines the order d + 1 of the attack needed to retrieve the sensitive information. This attack order sets the number of shares that are jointly exploited by either analyzing the (d + 1)
th -order statistical moment of the leakage at one point in time or by nonlinearly combining leakages from d + 1 points in time. Such an attack is known as a (d + 1)
th -order DPA attack. A d th -order secure implementation can consequently always be broken by a (d + 1)
th -order attack. When the attack order is larger than one, this is known as a higher-order DPA (HO-DPA) attack [5, 17] .
Masking is however deteriorated by the switching behaviour of CMOS transistors, the so called glitching e↵ect [15, 16] . Two masking schemes that show provable security against DPA in the presence of glitches, or glitch resistance for short, are the polynomial masking scheme [22] and threshold implementations [19] . While, at the time of writing, the latter achieves glitch resistance at the first-order only, the former provides this security also for higher orders. Therefore, we consider the polynomial masking scheme in this paper.
Masking introduces an overhead on the area and throughput. To avoid overly large and slow implementations, we will focus on lightweight, i.e. compact and power e cient, block ciphers. A popular lightweight block cipher is present [3] which, as of 2012, is part of the ISO/IEC 29192-2 standard [13] , making its side-channel resistance relevant. Besides present, its S-box is also used in other lightweight cryptographic algorithms, including the led block cipher [12] , the gost revisited block cipher [20] and the photon lightweight hash function [11] . In this paper, we focus on glitch resistant implementations of the nonlinear part of present, the S-box, since this is typically the most challenging part of a masked implementation.
Related work. An algorithmic description of a first-order glitch resistant Advanced Encryption Standard (AES) implementation using the polynomial masking scheme is given in [22] . In [18] , this description is used to implement a firstorder glitch resistant AES S-box on an FPGA. The present S-box has, to our knowledge, not yet been implemented using polynomial masking.
Contribution. In this paper, we present a first-and a second-order polynomially masked implementation of the 4-bit present S-box. To our knowledge, this is the first second-order present S-box implementation showing resistance against second-order DPA in the presence of glitches. The implementations are based on the guidelines for the first-order glitch resistant AES implementation proposed in [18] . We also present experimental confirmation showing that the implementations indeed achieve their claimed security. To this end, we applied univariate and bivariate leakage detection based on Welch's t-test.
Organization. Section 2 introduces the necessary background regarding the polynomial masking scheme and the present S-box.The design decisions, hardware implementations and their costs are presented in Section 3. The SCA results are shown in Section 4. Finally, the conclusion is drawn in Section 5.
Preliminaries

PRESENT Block Cipher
The present block cipher [3] is a symmetric key encryption algorithm designed considering the heavy constraints on performance, area and timing requirements of lightweight hardware applications. Its block length equals 64-bits. Key lengths of 80-and 128-bits are supported, which are referred to as present-80 and present-128 respectively. For lightweight applications, present-80 is recommended. The present cipher performs 31 rounds followed by a final key whitening stage. Each round consists of a binary addition with the round key and a substitution-permutation network. The permutation layer is bit oriented and can easily be implemented by wiring, making it very hardware friendly. The substitution layer applies 16 identical 4-bit S-boxes governed by the following 
Polynomial Masking Scheme
Side-channel resistance in the presence of glitches can be achieved at any order by the polynomial masking scheme [22] . Sensitive variables are masked using Shamir's secret sharing scheme [23] and computations on the resulting shares are performed using the BGW's secure multi-party computation protocol [2] . In Shamir's scheme, a secret Z 2 K ⌘ F 2 m is shared among n < 2 m players such that d+1 players are needed to reconstruct Z. To this end, a dealer generates a degree-d polynomial P Z (X) 2 K[X] with constant term Z and secret, random coe cients a i :
When working in the field K, we will denote binary addition and field multiplication by + and . respectively. This polynomial is then evaluated in n distinct, non-zero elements ↵ 1 , ..., ↵ n 2 K, which are called the public coe cients and are available to all players. Lastly, each resulting value Z i = P Z (↵ i ) is distributed to its corresponding player i. The secret Z can be reconstructed using the first row ( 1 , ..., n ) of the inverse of the (n ⇥ n) Vandermonde matrix (↵ j i ) 1i,jn as:
This is exemplified for the second-order in Appendix B.
BGW's protocol defines how to securely operate on the shares. We can distinguish between operations that can be processed by all players independently and operations that need communication between the players. Multiplication of a share and a constant, addition of a share and a constant and addition of two shares can be processed by each player independently. As a result, these operations can be implemented straightforwardly. Multiplication of two shares, which is referred to as shared multiplication, requires the players to exchange information, which complicates its secure execution. This operation has to be performed in three steps [22] :
1. Each player multiplies its shares, resulting in a 2d-degree polynomial 2. Each player masks the result of the previous multiplication and sends these shares to all other players 3. Each player reconstructs the result by interpolation and evaluation in the public coe cients
When the square of a share is desired, the shared multiplication can be omitted when following conditions are imposed [22] :
-The public coe cients ↵ i are distinct and non-zero -The public coe cients ↵ i are stable over the Frobenius automorphism: for every ↵ i , there exists an ↵ j such that
Each player can then independently perform the squaring on its own share but a reordering of the shares is needed between player i and player j when i 6 = j to keep the right public coe cient linked to its corresponding player. To achieve glitch resistance with this masking scheme, two conditions need to be fulfilled. Firstly, the number of players has to exceed twice the degree of the polynomial, i.e. n > 2d. Secondly, each player has to leak independently of all other players.
Cyclotomic Classes
The masking complexity of an S-box is defined in [4] as the minimal number of nonlinear multiplications required to evaluate its polynomial. These nonlinear multiplications correspond to shared multiplications.
When calculating a power x ↵ from another power x , a nonlinear multiplication can be omitted if and only if ↵ and lie in the same cyclotomic class. A cyclotomic class is defined as follows. 
For the present S-box, we work in field F 2 4 . Its corresponding cyclotomic classes are:
(1) An important property is that we can cycle through the elements of a cyclotomic class by squaring, which can be performed independently by all players when the conditions listed in Section 2.2 are fulfilled. As squaring is linear in F 2 4 , the Sbox complexity equals the number of di↵erent transitions between these classes required to evaluate the S-box substitution function.
Hardware Implementation
In this section, the hardware implementations of the first-order and second-order glitch resistant present S-box are explained. First, the polynomial of the Sbox and its evaluation order are established. Then the detailed first-order glitch resistant implementation is discussed. Afterwards, the modifications required to achieve second-order glitch resistance are given. This section is concluded with an overview of the implementation requirements.
Evaluation Order
The substitution of any 4-bit S-box can be expressed as a unique polynomial over F 2 4 with a degree of at most 2 4 1 = 15. This polynomial can be obtained by expanding the following expression [7] :
Using the Mattson-Solomon polynomial, the coe cients c i of S(x) can directly be computed by:
where ↵ is a primitive element in F 2 4 . If we use x 4 + x + 1 as irreducible polynomial for the construction of F 2 4 , we get the following polynomial for the present S-box given in Table 1 .
The evaluation order of this polynomial is an adaptation of the proposal by Carlet et al. in [4] to reduce the required memory and area by processing sequentially instead of in parallel. The block diagram of the present S-box evaluation is depicted in Figure 1 . The gray multipliers symbolize a field multiplication with a constant, while the black multipliers represent a shared multiplication. Starting from input x, squaring is consecutively carried out until all elements of the cyclotomic class C 1 from Equation (1) are covered. The last element of that class is then multiplied with x to access cyclotomic class C 3 , where all elements are again obtained by squaring. After a multiplication with x, squaring is performed again to reach all elements in C 7 . To access the final cyclotomic class C 5 , a multiplication with x 11 is chosen, as multiplying our last obtained power with
x would lead back to class C 1 . This value will need to be stored separately. From this discussion it is apparent that a shared multiplier cannot be omitted. As our primary design goal is low area, we choose to only implement a shared multiplier to handle all shared multiplications. However, by evaluating the polynomial this way, the designs can easily be extended with a dedicated squaring circuit and benefit from a significant reduction of required randomness. This extension is left as future work.
Block diagram of the evaluation for the present S-box.
First-Order Glitch Resistant PRESENT S-box
To achieve first-order glitch resistance, both conditions in Section 2.2 have to be fulfilled. Namely, our sensitive variables need to be masked by a first-order polynomial and need to be shared between three players with independent sidechannel leakage. In order to achieve this independent leakage, we choose to temporally separate the players' operations. After each operation, the intermediate results are stored and left unaltered while another player is active. The design is shown in Figure 2 and is similar to the AES S-box implementation from [18] . This design is compatible with all combinational finite field multipliers. The one used in our implementations is given in Appendix A.
Shared multiplier. As pointed out in Section 2.2, the shared multiplication di↵ers from the other operations in that it needs communication between the players. To achieve this, the computations are divided in two parts and the communicated intermediate values are stored in registers.
Step 1 and Step 2 (Section 2.2) of the shared multiplication are performed in the mult el1 i blocks. With every shared multiplication, each player receives a new random coe cient a i to remask the multiplication of its input shares t i . The reconstruction in Step 3 is handled by the mult el2 i blocks once all intermediate results are available.
The detailed working principle is described in series of clock cycles. Such a series consists of six clock cycles and is related to the control signals em 1i6 , which can be seen in Figure 2 . During each series, a shared multiplication is realized.
-The first clock cycle of a series, enables signal em 1 . The two required inputs for the shared multiplier are selected by selm 1 . At the same time, a new random number a 1 is fed to the mult el1 1 block. Together with this random number, the fixed public coe cients ↵ 1 , ↵ 2 and ↵ 3 are used to remask the multiplied input shares t 1 . -The same procedure is repeated on the second clock cycle using signal em 2 in block mult el1 2 and on the third clock cycle using em 3 in block mult el1 3 .
After the third clock cycle, all intermediate results are available. -In the fourth clock cycle, by activating signal em 4 , the intermediate results related to the first public coe cient ↵ 1 are stored in the registers q 1,1 , q 2,1 ,
The combinatorial logic in block mult el2 1 then performs the reconstruction using 1 , 2 and 3 . This outputs the first share of the shared multiplication. The result is not saved in this clock cycle, but will be done at the start of the next series, with the activation of the select signal em 1 . -In the fifth and sixth clock cycles, the same principles as in the fourth clock cycle apply. The enable signal em 5 handles the reconstruction related to the second public coe cient ↵ 2 in block mult el2 2 and em 6 serves the reconstruction related to the third public coe cient ↵ 3 in block mult el2 3 .
Note that, except for the registers, the shared multiplier is entirely combinatorial. Therefore, the mult el1 i and mult el2 i blocks are only active when a new value is assigned to their input registers. After one clock cycle, the intermediate values reach their stable states and the blocks stay idle until their input registers are changed again. By temporally separating the em i signals with a carefully designed control unit, we achieve the required temporal separation.
Input selection. The right inputs for the shared multiplier are selected by the multiplexers in the ctrl el i blocks. A glitch on the select signal of a multiplexer can temporarily change the inputs of the shared multiplier and induce processing in a player that is supposed to be idle. This would result in an overlap of leakages of di↵erent players and would eradicate the temporal separation. To avoid this, these selm i signals are synchronised. As was noted in Section 3.1, we need to store one extra intermediate value x 11 . When the shares of this value are output at the mult el2 i blocks, the es 1 , es 2 and es 3 signals follow the levels of the em 1 , em 2 and em 3 signals to store the shares of x 11 in separate registers.
Addition and accumulation. To calculate the polynomial, the powers of x need to be multiplied with a constant and accumulated with the previously obtained results. This is handled by the add acc el i blocks. When the shares of a desired power of x are ready at the outputs of the shared multiplier, the ea i signals activate with the corresponding em i signals. With the activation of an ea i signal, a new coe cient chosen by selcoe↵ is fed to an input of the add acc el i multiplier, resulting in the right multiplication of a constant and its corresponding power of x. In the first series of clock cycles, the constant value C of the polynomial is added to an empty register using the sela i signal, which activates with its corresponding em i signal. In all following series of clock cycles, the register output is chosen to accumulate the results. The eo i signal enables the output share of player i when the register holds the final value. This signal also activates with its corresponding em i signal. 
Second-Order Glitch Resistant PRESENT S-box
We will now discuss how to extend our first-order design to the second-order. Again, both conditions in Section 2.2 need to be fulfilled. To provide secondorder glitch resistance, our sensitive variables are now masked using a secondorder polynomial and shared among five players. We again choose temporal separation to decouple the leakages of the di↵erent players. The operations in this (5,2)-sharing scheme are detailed in Appendix B. Figure 5 in Appendix C shows the resulting architecture diagram.
Shared multiplier. The mult el1 i blocks now require two instead of one random coe cients to mask the multiplication of the inputs. Furthermore, the evaluation of the polynomial is done in five public coe cients and their squared value is needed. When hardcoding the public coe cients and their squares, we additionally require seven multiplications, seven additions and one register. Each player now requires five instead of three registers to share the intermediate results. The mult el2 i blocks need two extra multiplications and two extra additions to perform the reconstruction in the (5,2)-sharing scheme.
The control schedule is changed to incorporate five players. The same principles from Section 3.2 apply, but we need 10 em i signals, the first five to control the mult el1 i blocks and the last five to store the intermediate values in the registers.
Input selection, addition and accumulation. The only change made in these operations is the extension from three ctrl el (resp. add acc el) blocks to five.
The security against second-order DPA in the presence of glitches of this implementation can theoretically be explained as follows. As a second-order polynomial is used to divide the shares among five players, the shares of at least three players are required to interpolate the masked secret. Mixing up to two observations of intermediate variables will therefore not lead to enough information to reveal the secret variable. Furthermore, as the computations of each player are temporally separated, the information leaked by glitches is contained to the share of that player only and is not influenced by the shares of other players. This theoretical proof is valid for all orders when appropriate changes to the players are considered.
Implementation Requirements
The total area in NAND gate equivalents (GEs) covers 3594 GE and 8338 GE for the first-and second-order glitch resistant implementation respectively. The largest contributions come from the shared multiplier (37.8% and 59.6%) and the control unit (41.8% and 25.7%), both for the first-and second-order respectively). The detailed area requirements of the di↵erent blocks are given in Table 3 in Appendix D. The results are obtained from Synopsys 2010.03 using the NanGate 45nm Open Cell Library [1] .
The first-order implementation requires 89 clock cycles from the activation of the request signal to the output of all shares. For the second-order implementation, this number becomes 149 clock cycles. The secure evaluation of the first-order present S-box requires 156-bits of randomness. If a squaring module is used, this randomness can drop to 36-bits trading o↵ area. For secure evaluation of the second-order present S-box, the required randomness changes to 520-bits (resp. 120-bits when a squaring module is used). As all public coe cients should be distinct and non-zero, up to 15 players can be accommodated. By imposing the condition that n > 2d, this leads to a maximum of a seventh-order glitch resistant implementation for the present S-box. For all possible orders d of glitch resistance, the required number of randomness and clock cycles are summarized in Table 2 . 
SCA evaluation
In this section we provide experimental evidence that our implementations provide a reasonable guarantee against typical power analysis attacks. We perform leakage detection tests on the present S-box, implemented on a SASEBO-G board [21] . The board is externally clocked with a stable, relatively low-frequency clock source of 3.072MHz. All the randomness required for the computations is generated by an AES-based PRNG on the control FPGA. All the tests were performed with 1M traces unless explicitly stated otherwise.
For our evaluation, we use the non-specific fixed-vs-random methodology of [6, 9] . In a nutshell, the leakage detection test assesses whether the means of power consumption traces, conditioned on any intermediate, are equal or not. In the context of first-order masking, this means whether the masking is sound or not. We stress that by using a non-specific test, we are targeting all intermediates appearing during the computation of an S-box. This allows us to test the implementation against a wide range of leakages, without assuming how the implementation may leak.
The original methodology starts by taking two sets of measurements corresponding to fixed plaintext and random plaintext. Then, a hypothesis test is applied time sample per time sample to test whether the means of the two populations are the same or not. Normally, a Student T-test is applied. Having set a significance level beforehand, the result of the test is directly interpretable in terms of probability. In our case, a value of the t-test statistic beyond 4.5 means that there is leakage with high probability. For details on the test, we refer to Appendix E. For our purposes of testing the higher-order security, we adapt the methodology to analyze higher-order moments in univariate and bivariate distributions (two time samples jointly analyzed). This is achieved by preprocessing the power traces through a suitable combination function. In our case, we use the centered product.
We begin with a univariate analysis of the first-order protected implementation. As a first sanity check of our experimental setup, we performed a univariate first-order test with the PRNG switched o↵, thus deliberately disabling the masking. The result of the t-test on the unmasked first-order implementation is given in Figure 6 in Appendix E. This clearly shows that the implementation is leaking since the t-test statistic trace exceeds the confidence threshold C = ±4.5 in several clock cycles, which is expected as the masking is inactive. If we repeat the experiment with the PRNG enabled, the t-test statistic never exceeds the predefined threshold as the top left corner of Figure 3 indicates.
We repeated the test on centered and squared traces. This is equivalent to test whether there is information leakage on the variances. Note that the first-order protected implementation is expected to leak in the second moment, as Figure 3 indicates. This only provides us with the evidence that we indeed have enough traces to show that the first-order attack is more expensive in terms of traces than higher-order ones, and thus our goal of first-order security is attained.
We proceeded with a univariate analysis of the second-order glitch resistant implementation. The process follows the lines of the first-order protected implementations and the results are again shown in Figure 3 . We can see that the implementation is indeed first-and second-order univariate secure up to 1M traces. The implementation leaks in the third-order but this poses no problem to the security claims. We also performed a preliminary bivariate analysis. To this end, we preprocess each trace by first centering around a mean and then multiplying all possible pairs of time samples within a trace. This means that an m-sample trace is expanded into a m 2 -sample trace, which results in a substantial increase in the computational and memory requirements. Then, a leakage test is performed on the preprocessed traces. To speed up the bivariate analysis, we opted for compressing the traces by a factor of 100.
As in the univariate case, we first carry out a sanity check to verify the soundness of this approach by performing a bivariate second-order analysis on the first-order secure implementation. This is expected to leak, and the results of Figure 4 confirm this. We obtained t-test statistic values within the region of interest larger than 20, clearly indicating second-order bivariate leakage. These leakages are close to the diagonal, meaning that leakage occurs by combining samples from adjacent clock cycles. The leakage is visible with 200k traces.
We repeated the same experiment with the second-order secure implementation and found no value exceeding our confidence threshold of 4.5 with 1M traces. This provides some evidence that the second-order implementation indeed may be secure. However, we feel we cannot provide with a definite answer unless we exhaustively cover all possible pairs of time samples (without compression), something that is out of our current computational reach. 
Conclusions
We implemented a first-and second-order glitch resistant present S-box using the polynomial masking scheme presented in [22] . We verified these implementations with both univariate and bivariate attacks and confirmed the claimed SCA resistance. Our implementations resulted in 3594 GE for the first-order and in 8338 GE for the second-order implementation. First, five distinct non-zero elements in F 2 4 need to be chosen. These are referred to as the public coe cients ↵ 1i5 . Together with these points, the first row ( 1 , ..., 5 ) of the inverse Vandermonde matrix (↵ j i ) 1i,j5 is needed. These interpolation coe cients can be calculated as:
Here, the multiplicative inverse in our field is represented by .
1 . Elements ↵ 1i5 and 1i5 are publicly available to all five players.
Sharing a value X requires two secret and random coe cients a 1 , a 2 and the public coe cients ↵ 1i5 . The resulting shares X 1i5 are calculated as:
Each player receives exactly one share X i and has no access to any other share.
Reconstruction of the secret value X requires the interpolation coe cients Considering (a 1 c) and (a 2 c) as the new coe cients of the second-order polynomial, the shares Z 1i5 represent the desired output Z = Xc. Note that the reconstruction of the masked secret variable does not depend on the polynomial coe cients a 1 , a 2 , but on the interpolation coe cients 1i5 , which only depend on the public coe cients ↵ 1i5 .
Addition of two shared secrets is executed in following way: pothesis holds, indicating that there is no relation between the processed intermediate value and the instantaneous power consumption. In case the threshold is crossed, another t-test is performed on an independent set of traces. When the t-test statistic exceeds ±C at the same points in time, the null hypothesis can be rejected with a significance level related to C. In that case, the alternate hypothesis holds, indicating that the power consumption and the intermediate values are related in a statistically significant way, making the device potentially vulnerable to SCA attacks. Figure 6 shows the resulting t-test statistic in case the alternate hypothesis hold. 
