Abstract In this paper we present a threshold implementation of the Advanced Encryption Standard's S-box which is secure against first-and second-order power analysis attacks. This security guarantee holds even in the presence of glitches, and includes resistance against bivariate attacks. The design requires an area of 7849 Gate Equivalents and 126 bits of randomness per S-box execution. The implementation is tested on an FPGA platform and its security claim is supported by practical leakage detection tests.
Introduction
Side-Channel Analysis (SCA) and more specifically Differential Power Analysis (DPA) [14] are considered to be powerful methods which can be used to extract secrets, e.g. keys or passwords, from cryptographic implementations running on embedded devices. The wide usage of these devices demands strong yet practical methods to mitigate this problem. A sound and popular such method is masking [8, 13] . Masking works by splitting every intermediate variable that depends on the secret into several shares such that knowledge of any share does not provide any information about the intermediate variable. This splitting breaks the dependency between the average instantaneous power consumption and the sensitive intermediates handled by the implementation, and thus thwarts first-order DPA attacks.
In theory, however, a masked implementation can always be broken by a higher-order attack. Higher-order attacks consider information from several shares simultaneously and are increasingly difficult to mount as the order increases, both in terms of number of traces [8] and computational complexity. Nonetheless, second-order attacks have been shown to be practical to mount [17, 23, 24, 28, 31, 32] and hence its protection is of importance. Higher-order masking schemes provide security guarantees against higher-order DPA attacks under specific assumptions, and up to a certain order.
When implemented in hardware, masking can lead to insecure designs due to glitches. Standard CMOS gates can glitch, and these glitches can cause the power consumption to depend on unmasked variables. This behavior degrades the security claims. For instance, Mangard et al. [16] present first-order attacks against masked implementations in hardware exploiting this idea.
Threshold Implementation (TI) [21, 22] is a specific masking approach that provides security even in the presence of glitches in the hardware. First-order TIs of the Advanced Encryption Standard (AES) [2] have been shown to being practically feasible as well as being secure [4, 6, 20] . The theory of TI has recently been extended to provide higher-order security by Bilgin et al. [5] .
Prouff and Roche's Higher-Order Glitches Free Implementation (HOGFI) [25] provides an alternative approach to TI. A first-order secure HOGFI of the AES S-box has been presented [18] . However, a higher-order extension has not yet been put into practice for AES. To our knowledge, the only higher-order implementation of this method is applied to present [11] .
Contribution. We provide the first higher-order threshold implementation of the AES S-box. Our design shows up to second-order security (including bivariate attacks) in the presence of glitches. This paper is, to our knowledge, the first one to show this security in practice within the context of TI. Additionally, we discuss several trade-offs between randomness and area that can be considered.
Organization. Section 2 introduces our notation, the necessary background information regarding higher-order TI and Canright's decomposition of the AES S-box on which we base our implementation. In Section 3, we present our hardware design of which the implementation costs are given in Section 4. Discussions of these results by comparing them with other glitch resistant implementations of the AES S-box and by investigating trade-offs in area and randomness through different design decisions are also given in the same section. We detail our measurement setup and the results of the side-channel analysis in Section 5. Finally, the conclusion is drawn in Section 6.
Preliminaries
In this section, we first introduce our notation, then provide a brief description of the threshold implementation technique to produce higher-order masked hardware implementations, and finally we end with the description of a compact (unmasked) implementation of the AES S-box that will serve as a basis of our masked implementation.
Notation
We use lower-case characters to denote elements in GF(2 n ). A function f is defined from GF(2 n ) to GF(2 m ) and can be considered as an m-tuple of Boolean functions (f 1 (x), . . . , f m (x)), where x ∈ GF(2 n ). Similarly, x ∈ GF(2 n ) can be denoted as (x 1 , . . . , x n ), where x i ∈ GF(2). We use ⊕ for XOR and ⊗ for multiplication in a given field. If the multiplication is bit-wise, we drop ⊗.
In order to perform a masked computation, a secret variable x should be split into s x shares x i . In this paper, we consider Boolean masking for this initial split which is described as follows: without loss of generality, the shares x 1 , ..., x sx−1 are drawn from independent and uniform random distributions and the share x sx is calculated s.t. x = i x i holds. A shared vector (sharing) (e.g. (x 1 , . . . , x sx )) is denoted by bold characters (e.g. x). A sharing is a uniform masking if for each value x, the corresponding vectors with masked values occur with the same probability.
In order to perform operations in the masked domain, the function f is also split in shares f i which are called component functions. The sharing of f is denoted by f.
Threshold Implementations
Threshold implementation (TI) is a masking method which provides security against higher-order DPA (hence the name higher-order TI). It diverges from many other masking schemes since it can provide security when non-ideal, glitchy cells are used given the following property: d th −order non-completeness. Any combination of up to d component functions f i of f must be independent of at least one input share x i . This property enforces the combination of leakages from the calculation of d component functions to be independent of the sensitive variable x given a uniform sharing x. We refer the reader to [5] for details. In addition, it has been shown that there always exists a d th -order non-complete sharing of a degree t function f with s in ≥ td + 1 input shares [5] . This naturally implies that the required number of shares for a given security increases together with the degree of the function.
In [5] , a method for generating the component functions with s in = td + 1 input and s out = sin t is provided. Hereon, we refer f with s in input and s out output shares as (s in , s out ) sharing.
Uniform sharings vs. refreshing. As stated in Section 2.2, the computation of a sharing f requires the input x to be uniform. However, the fact that x is uniform does not automatically ensure that y is uniform. The lack of uniformity of y poses a problem if this variable is plugged into another sharing g(y), since the input y should also be uniform for g to be secure. By careful selection of the shared function f, it is possible to guarantee the uniformity of y given a uniform input x. We refer to such a sharing f as a uniform sharing. This uniformity allows an elegant composition mechanism: uniform sharings can be composed freely whole circuit [22] . If one cannot find uniform sharings, it is always possible to resort to refreshing the sharing y prior to applying g. This refreshing produces a uniform output at the cost of additional randomness.
The situation in higher-order threshold implementations is more subtle. It has been shown in [27] that the composition of uniform sharings without refreshing is not necessarily higher-order secure; the composition can be made higher-order secure by introducing a refreshing block. Thus, in our design, we refresh the output of each non-linear function to provide higher-order security.
Canright's Very Compact AES S-box
The AES S-box is an 8-bit permutation composed of a multiplicative inversion in GF( 2 8 ) followed by a GF(2)-affine transformation [10] . Side-channel resistant implementations of this S-box are commonly based on subfield arithmetic as proposed by Rijmen [29] and explored by Canright [7] . This approach typically produces low-area circuits. It takes its name from the recursive decomposition of the S-box into computations in smaller fields. Namely, the GF( 2 8 ) inversion is first decomposed into arithmetic operations in GF( 2 4 ); and in turn the nonlinear operations are performed in the subfield GF(2 2 ). The resulting computation is composed of a GF(2)-linear (LM) and inverse GF(2)-linear map (ILM), several GF(2 2 ) multiplications, bitwise XORs and multiple instantiations of linear operations in GF(2 2 ) (l i ). For a detailed description of the individual operations, we refer to the original work [7] .
Hardware Implementation
This section gives an overview of the choices we made during the design of the second-order threshold implementations of the AES S-box.
Redefining the S-box decomposition
Converting the Canright S-box to a threshold implementation can be achieved on several levels. Each individual block can be composed with the neighbouring blocks or decomposed into smaller sub-blocks to attain different trade-offs between area, speed and randomness. We acknowledge the fact that randomness requirements also (indirectly) affect the area requirements. Hence, we strive for a compact, low-area implementation, and at the same time we try to keep the randomness requirements as low as possible.
For the discussion of our shared AES S-box, we rely on Figure 1 . We choose to implement the square scale and multiplication operations in GF(2 4 ) as done by Bilgin et al. [4] . This adaptation requires less randomness and clock cycles than sharing their subfield functions in GF(2 2 ) since some of the refreshing and registering that must follow the nonlinear operation is avoided. The inversion in GF (2 4 ) is of algebraic degree three. Since no small second-order non-complete sharing of such a function has been yet proposed, we share its subfield decomposition, which is contained of a linear operation and three multiplications in GF(2 2 ). Although, this increases the number of clock cycles by one, it keeps the area and randomness contained.
GF (2 4 ) sq.sc. We decompose the calculation of the S-box into 6 pipeline stages. All stages are separated by registers, indicated by the vertical blue lines in Figure 1 , in order to satisfy the non-completeness property within pipeline stages as explained in Section 2.2. Note that the register after the linear map and the register preceding the inverse linear map can be merged with the AES state or key registers.
Sharing the Nonlinear Operations
The GF(2 4 ) and GF(2 2 ) multipliers are the only nonlinear operations of our S-box. For the second-order threshold implementation we need a sharing with second-order non-completeness. There are two known sharings with s in ≥ td + 1 input shares and fulfilling the non-completeness condition for a function of the form f (x, y) = xy, with x, y ∈ GF(2) [3, 5] . One uses s in = 5 input shares and results in s out = 10 output shares, the other one accepts s in = 6 input shares and outputs s out = 7 shares. We choose the (6, 7)-sharing for the multiplications with the following observations in mind:
-To achieve higher-order security, all nonlinear sharings need refreshing of all their output shares as noted in Section 2.2. Hence, we remask the outputs of all the multiplications. The details of how this refreshing is done is described in Section 3.4. The lower s out , the lower the consumed randomness is. By using only 7 output shares, the required randomness can be decreased by 30% in comparison to using 10 output shares. With our choice of the (6, 7)-sharing over the (5, 10)-sharing we pay a slightly larger area for a substantial reduction of the required randomness. We use the following (6,7)-sharing for all the bit-wise multiplications a = f (x, y) = xy that occur in the Boolean function description of the field multipliers.
For convenience, we provide the Boolean functions descriptions of the field multipliers for which we denote each element with its most significant bit on the left-hand side. GF(2
Sharing the Linear Operations
Linear operations are well-known to be easy to mask. The linear computation is performed on each share independently. This works for the following functions:
-Square scale in GF(2 4 ) -l 1 and l 3 in GF(2 2 ) -Linear map and inverse linear map In our implementation, we chose to instantiate s in = 6 copies for each of these functions.
The affine operations are performed in parallel with the multiplication in Stages 2 and 3. The output of the affine operation is added to the output of the multiplication; the novelty here is that we can add them before storing them in the register, and thus only store the result of the addition. In this way we use less registers and hence a lower area. This addition has to be performed carefully. We have to ensure that the output of the addition still satisfies non-completeness. For instance one wire can carry the value a 1 ⊕ A(x 2 , y 2 ) instead of a 1 , where A symbolizes an affine operation. This process can be alternatively seen as the sharing of a single function f = x ⊗ y ⊕ A(x, y).
Mask Refreshing and Compression
Apart from the initial sharing, extra randomness is required in the refreshing blocks to attain second-order security. We use a ring structure for this refreshing as proposed in [27] . An advantage of this method is that the sum of the fresh masks does not need to be saved in an extra register. Since we operate on six input shares in each stage, we need to compress the seven preceding output shares into six shares. Figure 2 shows how the mask refreshing and the compression are performed over register boundaries after each nonlinear operation. The points where these refreshing and compression layers occur are depicted in Figure 1 by red circles. 
Implementation Cost and Trade-Offs
In this section, we elaborate on the implementation cost of our design and the impact of possible trade-offs. We use Xilinx ISE version 12.2 to verify the func-tionality of our design and Synopsys 2010.03 with the NanGate 45nm Open Cell Library [1] when providing area estimations. Table 1 shows the summary of the implementation cost for our S-box design. This is compared to previous first-order secure TIs of the AES S-box. We now briefly discuss these figures. Area Requirements. From Table 1 we can see that our design uses ×1.84 more area when compared to the first-order TI from [20] , or ×3.53 more area when compared to the more compact first-order secure design of [6] . Both figures for the synthesis options "compile" and "compile ultra" are provided 4 . This is the price we pay in area to go from first-order security to second-order security. Table 2 lists the contribution in area of the several components from a single S-box. Note that the l 3 operation, which is the inversion in GF(2 2 ), is merely a swapping of wires and has therefore no contribution to the area. Also listed are the numbers for the second-order TI of the full AES based on [20] . As in [6] , we use d + 1 shares in the key and state array. The registers that hold the randomness are not included in the figures. Note that the full AES was not tested in practice. In addition, Table 2 also lists the results of the synthesis for a Virtex 5 FPGA. We provide these figures for future comparison, following the HOGFI design from [18] . Currently, it is hard to discuss the impact of scaling from first-order to second-order security for HOGFI. An extrapolation factor of ×1.667 could be used 5 , but may be too optimistic, e.g. for the present S-box, the factor of area increase was shown to be ×2.3 in [11] . Furthermore, the area of the second-order present S-box implementation amounts to 8338 GE, which is larger than our second-order design. Randomness Requirements. Our design requires four ring refreshing on seven shares. Two on 4 bit shares, one on 2 bit shares and one on 8 bit shares. This results in 126 bits of randomness per S-box execution. For a full AES execution we require 3.814 kB, this includes the initial sharing and the randomness needed to increase the three shares of the state and key arrays to the six input shares of the S-box. When compared to other TIs that provide first-order security, our implementation requires ×2.6 more randomness than [20] or ×3.9 more than [6] .
Implementation Cost
For the first-order HOGFI implementation [18] , 432 bits are consumed per S-box execution. This increases to 1440 bits for the second-order HOGFI of the AES S-box 6 . Thus, our implementation consumes ×11.4 less randomness than a second-order HOGFI S-box.
Clock Cycle Requirements. Our S-box currently evaluates in six clock cycles. This is one more than [20] and three clock cycles more than the fastest TI [6] . In Section 4.2, we show that our S-box can easily be modified to achieve an evaluation speed of only four clock cycles. For the S-box implemented with HOGFI, 132 clock cycles are required for the first-order implementation. This scales to 220 cycles for the second-order implementation. This is substantially larger than our design.
Directions for Optimizations
We now list several trade-offs we can make to reduce the area and their effect on the randomness.
-We can reduce the area by changing the number of instances of the linear functions to d + 1 = 3. This was done in the nimble version of [6] . The required randomness is not changed by this modification. -The registers for the output of the linear map and the input of the inverse linear map can be bypassed. This was shown in [4, 6, 20] . This bypass will not lead to a reduction in the randomness cost. The execution speed will however be improved by two clock cycles, making a whole AES encryption with this S-box as fast as the implementation of [20] . Note that in a full AES implementation, these two Stages do not add an area overhead, as these registers can be merged with the State and Key Arrays. -As previously mentioned, the (5,10)-sharing of [5] can be used to save area, but this will increase the required randomness to 180 bits.
Side-Channel Analysis Evaluation
For the purpose of evaluating the security claims of our approach, we implemented the design on a Virtex-II xc2vp7-fg456-5 FPGA on a SASEBO-G board. We generate the required randomness prior to the S-box execution on the control FPGA on board. We silence all activity on the board except the S-box lookup during the lookup itself. The design is clocked at 3.072 MHz and the instantaneous power consumption is acquired with a Tektronix DPO 7254C oscilloscope at 500 MS/s. The platform is very low noise.
Methodology. We pursue the following steps to test that the soundness of our setup and masking. First, we switch off the randomness source. This effectively disables the masking countermeasure, thus the design is expected to be vulnerable. Nevertheless, the analysis is first performed in this setting to show that the setup is sound. Then, we repeat the analysis with the randomness source switched on. Any gain in resistance shown in the analysis is then exclusively due to masking.
Security claims. We claim security against first-and second-order power analysis attacks. That is, the adversary is bounded in the statistical order that he can use in the attacks. Note that since higher statistical moments are increasingly more difficult to estimate in the presence of noise, higher-order attacks beyond our security claims requires considerably more traces to work on and thus are deemed more impractical.
There is an orthogonal classification dividing attacks into uni-or multivariate ones according to the number of different time samples considered jointly in the analysis. This distinction is relevant in practice since multi-variate attacks can be more cumbersome to mount. Here, our second-order claims also include the bi-variate case as the variables considered can be evaluated in different times.
Leakage detection. We use leakage detection to test our security claims [9, 12, 30] . A basic leakage detection to test the univariate first-order security claim is as follows. Two sets of measurements are acquired corresponding to a lookup of either a fixed value or a random value. A statistical test is applied to test the null hypothesis "the means of the two trace distributions are the same". We use Student's t-statistic and compare it against a threshold of ±5 corresponding to a confidence level > 99.999%. If the statistic surpasses the threshold, there is a statistically significant difference in the means, and thus the means carry some information on the handled value. The test is failed in this case and passed otherwise.
7
Higher statistical orders are tested in a similar fashion by first pre-processing the traces with an appropriate function. In our case, we use the centered product [8] . If no observable leakage is detected, classical key-recovery attacks will not succeed.
Univariate analyses
Masks off. In Figure 3 , we plot the t-statistic when the masks are switched off. We used 500 000 traces to produce plots with clear leakage which was more than the required amount of traces for detecting leakage. We can see clear excursions of the statistic beyond the specified threshold of ±5. The same behavior can be observed for the second and third moment. Hence, all three tests fail. This is expected and means that the measurement setup is sound.
Masks on. When the randomness is switched on, the implementation passes the leakage detection tests for first and second order using up to 100 million traces. This supports the claim of univariate first-and second-order univariate security of the implementation.
Bivariate analysis
The previous analysis can be straightforwardly generalized for multivariate statistics. Each time sample is centered before combined with all other timesamples. The combination function is centered product. To alleviate the computational effort, we sample at 200 MS/s to get ×2.5 shorter traces.
Masks off. Figure 4 shows the result of the bivariate second-order test when the masks are off using 100 000 traces. Clear excursions above ±5 are seen, both in the diagonal (squaring) and off the diagonal (centered product of two different time samples). This serves as confirmation that our approach is sound, in particular, the change in the sampling rate did not have a significant effect on the magnitude of the t-statistic. Masks on. We repeated the same test when the randomness is switched on. The implementation shows resistance when using up to 70 million measurements. No excursion beyond the threshold were detected, and thus the test is passed.
Conclusion
In this work, we presented the first higher-order threshold implementation of AES S-box with second-order univariate and bivariate security in the presence of glitches. We directed our attention to the S-box specifically as it is the hardest part to secure due to its nonlinearity. Our design was implemented on an FPGA and was shown to achieve our claimed security by resisting leakage detection tests with 70 million traces. A total area of 7849 Gate Equivalents is covered by our design. One S-box execution consumes 126 bits of randomness, which is largely required to achieve the bivariate security. We discussed the scaling of area, speed and randomness when increasing the order of security. Exactly how these costs scale for TI is our first direction for future research, while optimizing the randomness cost provides a second direction. 
