The threat posed by side channels requires ciphers that can be efficiently protected in both software and hardware against such attacks. In this paper, we proposed a novel Sbox construction based on iterations of shift-invariant quadratic permutations and linear diffusions. Owing to the selected quadratic permutations, all of our Sboxes enable uniform 3-share threshold implementations, which provide first order SCA protections without any fresh randomness. More importantly, because of the "shift-invariant" property, there are ample implementation trade-offs available, in software as well as hardware. We provide implementation results (software and hardware) for a four-bit and an eight-bit Sbox, which confirm that our constructions are competitive and can be easily adapted to various platforms as claimed. We have successfully verified their resistance to first order attacks based on real acquisitions. Because there are very few studies focusing on software-based threshold implementations, our software implementations might be of independent interest in this regard.
Introduction
In the past decade, side channel analysis (SCA) has become a serious threat to various cryptographic devices. In this adversarial model, an attacker may observe information leakage from a device operating some key-related information. For cryptographic engineers, efficiently implementing a good cipher is then no longer enough. They must also mitigate against the threat of such leakage and integrate a proper countermeasure, which often is a non-trivial task.
Since they were proposed, Threshold Implementations (TI) [1, 2] have become a recognised countermeasure for power analysis [3] [4] [5] [6] [7] when hardware implementations are considered. Unlike Boolean masking schemes [8, 9] , TI requires more shares, but the "non-completeness" property of TI ensures that in each computation logic gate, at least one of the (input) shares is missing. As a consequence, even in the presence of hardware glitches, this missing share guarantees that the observed leakage will not give out information about any secret intermediate value [2] and thus robustly protects against so-called first-order attacks. In This is the full version of the paper accepted at the CT-RSA 2019 with the same title.
this paper, we only consider threshold implementations that provide first order protections.
One obstacle in threshold implementations is that there is no trivial efficient constructions for arbitrary cryptographic components. Take 3-share TI schemes for instance: in theory, any arbitrary quadratic function can be re-written in a TI-shared form with 3 shares. In practice, however, considering the requirement of "uniformity", a uniform 3-share TI scheme may not exist [2] . For smaller components (eg. Sboxes), this issue has been extensively studied up to affine equivalence [3, [10] [11] [12] . For larger components, there is no generic construction available. On the other hand, solutions for uniform TIs may exist with higher implementation costs, such as increasing the number of shares or adding fresh randomness. Recently, De Meyer, Moradi and Wegener proposed a bit-serialized implementation of the Sbox of AES [13] : although their implementation with AES can be easily deployed in many applications, it comes with the price of adding fresh randomness. Joan Daemen proposed a technique called "Changing of the Guards", which significantly eases the dilemma between uniformity and fresh randomness [14] . As the "Changing of the Guards" technique borrows randomness from the shares of other concurrent components, engineers no longer need to ensure uniformity for their TI schemes, as long as there are a few extra random bits available in the beginning of the encryption. Considering that the overhead of TI is already high, it is imperative to keep any extra cost as low as possible. Therefore, in this paper, we would like to avoid any fresh randomness and minimize the number of shares.
Instead of searching for efficient TI representations for existing Sboxes, we can also construct new Sboxes that are intrinsically suitable for TI protections. The TI forms of all 4×4 Sboxes were described in [3] . Boss el al. constructed several 8-bit Sboxes with round-based balanced Feistel, MISTY, SPNs structures, where the core building blocks are 4-bit Sboxes with easier TI protections [15] . The main focus of their paper was in finding Sboxes with efficient hardware TI implementations. However, the authors claimed their approach also "enables an efficient and low-cost implementation in software"(their "software implementation" refers to masked bitslice implementations, rather than TI-based software implementations). De Meyer and Varici further extended this approach to several new constructions (such as Generalized Feistel, Lai-Massey, Asymmetric SPN etc.) and provided implementation costs in terms of ASIC logic area [16] .
It is not surprising that very few papers actually consider using TI-based software implementations. To the best of our knowledge, the only available TI constructions on software are TI-based PRESENT on an 8-bit micro-controller [17] and TI-based ARX ciphers [18] . The reason behind this is straightforward: the main concern that TI solves -glitches -do not exist in software 1 . The overhead of using TI-based countermeasures is usually much higher than using (bitsliced) masking. Thus in theory, there is little point in applying TI to software im-plementations. In practice however, it has been observed that d-order bitsliced maskings sometimes fail to provide d-order SCA protections. This is because the internal architecture of micro-processors is not publicly accessible. Now even if a cryptographic engineer carefully writes his/her code in assembly, some implicit operations/registers may still mix different shares and produce exploitable leakage such as demonstrated in [19, 20] . In the worst case, as Balasch el al. suggested, a d-order masking may only achieve d 2 -order security in practice [21] . Our contribution. In this paper, we aim to find several Sboxes that come with easier first order TI protections, in both software and hardware platforms. In contrast to Boss el al.'s work [15] we use shift-invariant quadratic permutations instead of smaller Sboxes [22] . Similar to the χ 2 function [23] , any coefficient Boolean function of these permutations is simply a "rotated" version of another. In other words, the bit-width of the elementary computation logic -which we called "granularity" in this paper -can be 1. Combined with the idea of serial threshold implementations [24, 25] , the granularity of first order TI implementation can then be 1. Finer granularity brings more flexibility for cryptographic engineers, giving them more fine-grained trade-off options between executing time, logic area as well as power consumption. Specifically, the benefits of such protected Sboxes include:
-No fresh randomness.
-Easier software implementations. Since the shared version of our TI function preserves the "shift-invariant" property to some extent, bit-slicing such protected Sbox becomes easier. -Flexible hardware implementations. As the granularity of such TI Sboxes is 1, in hardware, it is possible to implement only 1 computation unit, then get all other shared bits by shifting. Such strategy can lead to a very compact footprint, in the price of taking more cycles to execute. -Full implementations/security evaluations. Despite the fact that all the implementations in this paper follow exactly the same rules as standard TI-s, we have verified these implementations with real-world acquisitions.
Outline. In Section 2 we explain a few essential concepts, including the cryptographic properties for Sboxes, the principle of threshold implementations and our Sbox searching strategies. Section 3 first introduces the concept of shiftinvariance, then presents a search for quadratic TI-uniform shift-invariant permutations. Based on the results of this search, we further construct Sboxes with an SPN network. Section 4 and 5 discuss the possible implementation tradeoffs on software/hardware platforms, respectively. Section 6 presents TVLA-based security evaluation results on both an ARM M0 core and a Kintex 7 FPGA.
Preliminaries

Cryptanalytic properties for Sboxes
In a block cipher the Sboxes provide the desired non-linear properties. A newly constructed Sbox must be evaluated for cryptographic properties e.g. differential uniformity, linearity, to thwart the differential and linear attacks. Let : F 2 n → F 2 n be a function.
The differential uniformity of F is defined as
| denotes the cardinality of the set D F (a → b) and is determined by the entry at the position (a, b) in the difference distribution table of F .
The Walsh transformation of the function F is defined as W : F 2 n × F 2 n → Z and is given as
The linearity of an Sbox gives a measure of its best linear approximation. The linearity of F is defined as follows,
Besides, an Sbox should not have any algebraic properties e.g low degree of the polynomial, which may be exploited by an adversary to mount an attack. It is known that the maximum algebraic degree of an m-bit permutation Sbox will be m − 1.
Threshold Implementation
In side channel research,threshold implementation (TI) usually refers to a countermeasure that based on secret sharing. For an m×n vectorial Boolean function f where each input x is shared as an s-length vector x = x (1) , .., x (s) , TI implements a few shared functions f (j) that satisfy: To ensure uniformity for permutations (m = n), we can simply check if the shared version of f is an m × s-bit permutation [3] (or prove it is invertible [14] ).
Constructing TI Sboxes
To ensure non-completeness, threshold implementations need more shares for Boolean functions with higher degrees. As the implementation cost increases with the number of shares, the cheapest protected non-linear functions are quadratic (deg = 2) Boolean functions. For Sbox constructions, it is favourable to use permutations rather than arbitrary quadratic vectorial Boolean functions. Previous studies have sucessfully found uniform TI schemes for many quadratic permutations, including 3 × 3 and 4 × 4 Sboxes [3] , 5-bit permutations [27] as well as a few observations on 6-bit quadratic permutations [28] .
All the results above serve as a perfect building block for larger Sboxes: although directly applying TI is difficult, we can always use smaller Sboxes/quadratic permutations with known TIs to build large Sboxes. Boss et al. started searching for 8-bit Sboxes with Feistel ( Figure 1 (a)), SPN (Figure 1(b) ), and MISTY structures, using 4-bit TI Sboxes as building blocks [15] . De Meyer and Varici extended this search to other constructions, such as Double Misty, Asymmetric SPN and Generalized Feistel structures [16] . Since the building blocks are smaller Sboxes/permutations, such constructions give much more compact 8-bit Sboxes in hardware [15, 16] . Generally speaking, for an n-bit Sbox, its 3-share TI form would be a 3n-bit permutation. Although each share can be computed with only 2 input shares (2n-bit), in hardware, increasing inputs usually boosts the area cost. Using smaller TI-Sboxes as building blocks significantly reduces the overall implementation cost, but it is unclear whether such constructions can provide flexibility when considering other platforms. Neither of these papers discusses the possibilities of serial TI-an extra trade-off proposed back in 2013 [24] . Boss et al.'s work did mention software implementations, yet their argument is that fewer AND gates lead to more efficient bit-sliced masking in software, rather than any TI protection [15] . None of these papers present security evaluations of their final implementations.
The notion of granularity
Irrespective of considering hardware or software implementations, constructions that feature multiple identical computation tasks usually give the cryptographic engineer more flexibility for the speed/cost trade-off. Taking hardware implementations for instance, all 4 bits in a PRESENT Sbox must be implemented with combinational logic, because all 4 bits are based on different Boolean functions [29] . Meanwhile, for the Keccak 5-bit χ 2 function, it is possible to implement only the circuit to do a 1 bit computation, as other 4 output bits can be computed through rotating the inputs [23] using the same circuit.
In this paper, we denote the output size of the smallest "gadget" to compute an Sbox as the "granularity". Clearly, the granularity for an unprotected PRESENT Sbox is 4, whereas for an unprotected 5-bit χ 2 function is 1. A finer granularity gives crypto engineers more opportunities for trade-offs: for instance, they can opt for a serial (slower) implementation, or a parallel (faster) implementation in hardware. Granularity also plays a critical role in software implementations. As most processors have intrinsic bit-widths (8, 32 or 64), when performing bitwise operations, most of the bit-width will be wasted unless all the bits require the same operation. In order to take full advantage of the bit-width, a bit-slice implementation usually "slices" the same bits from multiple Sboxes to one register. As the CPU processes multiple Sboxes simultaneously, the overall throughput increases. Implementations with finer granularity provide intrinsic parallelism, which may take the most of the bit-width of our processors without manually "slicing" from a lot of concurrent data blocks (eg. Sboxes).
Constructing TI-Sboxes with better granularity
In this section, we present our TI-Sboxes search strategy. To achieve better implementation flexibility, we choose a different type of building blocks: instead of using 4 bit Sboxes with known TIs, our search utilizes the "Shift-invariant" [22] permutations. Such constructions usually lead to finer granularity (for each elemental operation) and give better implementation trade-offs for not only the Sbox itself, but also its TI-protection.
Shift-invariant: concept and previous works
Technically, an n × n vectorial Boolean function F is shift-invariant if for any rotated shift τ and any state x, F (τ (x))) = τ (F (x)) [22] . As stated in Daemen's thesis [22] , "shift-invariant transformations can be implemented as an interconnected array of identical 1-bit output 'processors'"(granularity 1). Daemen further studied both linear and non-linear shift-invariant transformations, exploring their invertibility, local propagation and correlation properties [22] . As shift-invariance is closely linked to the concept of cellular automaton, Mariot, Picek , Leporati and Jakobovic searched up to 7 × 7 Sboxes from a cellular automaton perspective [30] . The most well known output of this direction is the χ 2 function in Keccak. However, it worth mentioning that without any other trick, χ 2 itself does not have a uniform 3-share TI.
Quadratic shift-invariant permutation with uniform TI
For an unprotected Sbox, shift-invariance ensures its granularity is equal to 1. However, considering the requirements of first order TI, its granularity also grows with the number of shares. Further reducing the granularity requires not only shift-invariance, but also its TI property: for any Boolean function f , if its direct shared form (i.e. Section 4.2 in [3] ) is uniform, its granularity can be reduced to 1, using a serial TI implementation [24, 25] . Thus, for granularity, our best option would be using quadratic shift-invariant permutations with a uniform direct sharing threshold implementation.
Therefore, our main building blocks for Sbox constructions are quadratic shift-invariant permutations with uniform 3-share TI-s. Although Daemen's thesis gave many useful results, it did not cover all possible nonlinear shift-invariant transformations. Fortunately, the search space for common Sbox sizes (n = 4 or n = 8) is small enough. For n × n shift-invariant transformations, the number of all possible quadratic transformations are equal to the number of n-bit quadratic Boolean functions 2 2 i=0 ( n i ) . The search space for 4 bit building blocks is 2 11 , whereas for the 8 bit case is 2 37 . Among these transformations, we are interested in those satisfy:
-The transformation itself is an n-bit permutation.
-Its direct 3-share TI is uniform.
Both properties are easy to check: for TI uniformity we simply check whether the shared form is still a 3n×3n permutation. For early abortion in this permutation check we first examine whether the coefficient Boolean function f is balanced. If it is not balanced, the transformation it derived cannot be a permutation. Additionally, we further limit our search to functions that satisfy:
-For bit y 0 , its Boolean function always contains bit x 0 . If not, we can always find a shift transformation τ that ensures F = F • τ (F is the shift-invariant transformation f derived) 2 . For a shift-invariant F , τ and F are commutative. This means for lower rounds (1 or 2) of SPN network, τ can be integrated into the initial/final linear transformation, which does not affect the cryptographic properties. f does not have a constant term. For a shift-invariant transformation, the constant term can be either all-0 or all-1. As an all-1 constant has little impact on the cryptographic property of F , we simply discard these choices.
For 4-bit quadratic functions, we found that 952 out of 2048 functions contain x 0 and 0 as their constant terms. 392 of them are balanced, whereas only 24 f lead to a 4 × 4 permutation F . Fortunately, all of the direct 3-shares schemes are actually 12 × 12 permutations (i.e. satisfy uniformity).
On the other hand, for 8 bit permutations, the search space of f is 2 37 . Almost half of the f -s have x 0 = c = 0, while only a quarter of f -s are balanced. 520 128 (≈ 2 19 ) can generate an 8-bit shift-invariant permutation F : interestingly, all of these permutations have uniform direct 3-share TI. 
Constructing Sboxes
In this section, we further construct cryptographically good 4/8-bit Sboxes with these quadratic permutations. The Sbox search follows exactly the same strategy as previous works [15, 16] , although the granularity further complicates the situation here.
Design Architectures As shift-invariance ensures each bit can be computed in the same way, generally speaking, we would like to avoid more branches. Take two-branch balanced Feistel structure for instance: although the round function may still have granularity 1, the other branch also contributes to the granularity for the whole Sbox. To this end, we perform our Sbox search with full range Substitution-Permutation Network (SPN) (Figure 1(c) ).
Permutation Layer As the substitution layer is chosen from those quadratic TI permutations, the only decision left to make is the permutation layer. Clearly, the most efficient construction would be using shift-invariant linear permutation or nothing at all. Although shift-invariance is a good property for software/hardware implementations, considering the threat of rotational cryptanalysis [31] , we prefer not to preserve it in the final Sbox. Thus, our linear transformation here needs to stop the propagation of shift-invariance. In general, the cheapest option would be using non-shift bit-permutations. However, a bitpermutation usually have a larger granularity (as each bit has to be implemented respectively), which leads to a penalty on its software performance. Instead, in this paper, we consider a linear transformation that is similar to AES's "xtime". More specifically, we search for invertible matrices that satisfy:
Let a 1 = {a 1,1 , a 1,2 , ..., a n−1,1 , 1}, if A is indeed invertible, in software, it can be implemented with a shift and a conditional XOR.
As the conditional branch is prone to cache attack, most implementations tend to use a multiplication instruction to achieve a constant control flow
As the n-bit state x is operated as a word, the granularity is determined by this 1-bit multiplication: since this equation only holds 1 bit values, the overall granularity gets coarser. Nonetheless, from an implementation perspective, it is still much better than arbitrary binary matrix multiplication. To achieve a better diffusion property, in our Sbox search, we use two layers of A (A 2 ) as our permutation layer.
Selection criteria. In order to achieve a balance between the implementation cost and the cryptographic properties, we have defined a selection criteria for the candidate Sboxes. Specifically, for 4-bit Sboxes, the differential uniformity is ≤ 4 and, the linearity is ≤ 8 For 8-bit Sboxes, the differential uniformity is ≤ 8 and, the linearity is ≤ 72
Besides, the algebraic degree and the degree of the interpolation polynomial should be large enough to resistent algebraic attack and interpolation attack, respectively.
Results
4-bit case. For 4-bit Sboxes, such selection criteria only accepts optimal Sboxes (differential uniformity= 4, linearity = 8) [32] . By enumerating all possible choices of A and quadratic permutations, we can find 16 such 4-bit Sboxes within 2 rounds. One such Sbox is presented as follow. The algebraic degree of this Sbox is 3, whereas the degree of the interpolation polynomial is 15. 8-bit case. For n = 8, the overall search space is around 2 26 , which is quite feasible for most PCs. 6 Sboxes appear within 3 rounds: all of them have differential uniformity 8 whereas their linearity vary from 64 to 72. Due to the space limit, we present the best one (differential uniformity= 8, linearity = 64) in the Appendix. The algebraic degree of the presented Sbox is 6 and the degree of interpolation polynomial is 252.
Software Implementation
The major benefit of an Sbox with small granularity, is that it can be efficiently implemented in both software and hardware platforms. Although software based TIs tend to have higher overhead, in terms of security, they might have their own advantages [5] . In this section, we implement our selected Sboxes with first order TI protections in software and discuss a few possible trade-off options.
Target Platform
For software implementations, the most common platforms are smart cards or high-end processors (ARM/AMD/Intel). Although different processors may have different instruction sets, for bit-slice computations, most required bit-wise instructions can be found easily in all instruction sets. The major difference lies in the bit-width of the target processor, which determines how many bits can be computed in parallel. In this paper, our implementation chooses the most common bit-width-32. Implementations for 8-bit and 64-bit follow exactly the same rule. Because our target chip is an NXP ARM M0 core, we wrote our Sbox implementations using the Thumb instruction set [33] . In order to demonstrate the difference between Thumb and ARM instruction sets [34], we also show how those Sboxes can be computed on a more advanced core like the ARM M3.
Implementation Trade-offs
No optimization. It is worth mentioning that finer granularity only provides a possibility for further implementation trade-off: when such trade-off is not necessary, engineers can always do a TI implementation with 3n variables. Such an implementation achieves its best performance when there are 32 concurrent data blocks (Sboxes) available. As the available bit-width is already fully occupied, the shift-invariant property will not provide any benefit in this case.
Size-based optimization. As each bit can be computed in the same way, with shift-invariant transformations, we can pack all n bits into one register. Take an 8 bit Sbox for instance, if there are 4 concurrent Sbox computations for x [1] , x [2] , x [3] and x [4] , a 32-bit register can be filled with
1 , x [3] 1 , x [4] 1 , ..., x [1] 8 , x [2] 8 , x [3] 8 , x [4] 8 where x i is the i-th bit of x. Correspondingly, each computation will be adjusted to ensure it takes the right input bit. Note that the rotated shift is still available in this form: instead of rotating 1 bit, now we are rotating 4 bits. Readers can verify that the transformation can still be computed correctly in this form, while the number of required concurrent data blocks shrinks from 32 to 4. Similar to the unprotected Sbox, the TI protection can be computed in exactly the same way. If all three shares are computed separately, such an optimization does not contradict with any TI requirement.
Extreme optimization. In theory, since the granularity of the TI protection is still 1, packing all 3 shares into one register is possible. Whether it contradicts with TI's security requirement (i.e. non-completeness) is debatable: ideally, if bit-wise instructions' leakage can be regarded as a sum of the leakages of all candidate bits (i.e. no "bit-interactions"), such implementation should be as secure as a hardware-based TI 3 . However, current results seem to suggest this may not always be the case: Sasdrich et al's work shows that for lookup tables (i.e. LDR instruction) on smart cards, bit-interaction clearly exists [5] . Our experiments with ARM M0 processors also prove the shift instructions (LSL,LSR,ROR) have the same issue. Moreover, as different bits and shares both get placed in one register, shifting becomes trickier. Only one of the shifts, whether shift bits or shares, can be operated with rotated shift instructions. The other one must be done manually with a few shifts and data masks. Considering the security loss and potential performance gain, we believe this is not a reasonable option.
Implementation on ARM M0/M3
Throughout this section, our evaluation is based on the size-based optimization. For the quadratic permutation S, we simply computed the TI-protected permutation according to its Algebraic Normal Form (ANF). Further customized optimizations may be possible but are out of the scope of this paper. To limit the usage of registers or memories, we compute all shifted results online, even if some of them appear repeatedly in the computation. Although this sounds far from ideal, as most commodity processors have a limited number of general purpose registers, such a compromise is inevitable in practice. For the linear transformation P , as the multiplication operation can only handle 1 bit at a time, all n-bit data shares must be executed one by one.
Despite the fact that our Sbox is computed online (rather than using precomputed lookup tables), architecturally, its computation procedure is not that different from Sasdrich et al.'s implementation of PRESENT's Sbox [17] . Depending on the context, leakage might still show up when the CPU switches from one TI-shared function to another. Nonetheless, as the number of shared functions in TI is quite limited (compared with the number of AND-s in masking), implementing TI correctly requires much less effort than implementing bit-slice Boolean masking. Table 5 illustrates the software implementation costs of our selected Sboxes, along with a few other well-known protected Sboxes, such as AES and PRESENT. It is not hard to see there is a significant performance difference between Thumb [33] and ARM [34] instruction sets. The major difference lies in rotation: as Thumb's ROR only shifts with a register rather than a constant, rotating r1 by n and storing the result in r2 has to be implemented as As the results in Table 5 are most likely parallel implementations for multiple Sboxes, we have listed the number of parallel Sboxes with the operation cycles. For our S4, Table 5 suggests it takes 870 cycles to compute 8 Sboxes simultaneously. 4 For 4 bit Sboxes, our shift-invariant Sbox has similar performance as the PRESENT Sbox based on quadratic decomposition (654 v.s. 686). With bitslice masking, PRESENT Sbox can be much more efficient [36] . On the other hand, for the 8 bit case, both the KHL and bit-sliced masking are quite efficient, running twice faster than our shift-invariant Sbox. However, we would like to stress that the comparison of Table 5 is not as trivial as comparing the numbers of cycles. First of all, our implementation does not take any fresh randomness. As we can see in Table 5 , all other Sboxes use quite a lot of random bits, even if they do not use any mask refreshing. Considering the cost of producing (pseudo)random numbers, it is clearly desirable to avoid fresh randomness. On the other hand, although all Sboxes in Table 5 claim first order security, a TI scheme has 3 shares whereas a bit-slice masking only has 2. Since the authors did not give any real traces based SCA evaluation [36], it is hard to argue whether these bit-sliced masking schemes provide the same security level as our threshold implementations. If we simply believe in the order-reduction theorem [21] , a fair comparison would be using the second order bit-slice masking (3 shares), which degrades their performance to the same level of ours [36]. Last but not least, enormous effort has been invested in optimizing the implementations of both AES's Sbox and PRESENT's Sbox. In fact, the advantage of bit-slice masking is mainly inherited from the circuit optimization of the unprotected Sbox. On the contrary, we simply implemented the ANF of our shift-invariant Sboxes: further optimizations may be possible but they are out of the scope of this paper.
Another interesting observation would be our granularity gains. Technically, granularity determines how many concurrent Sbox computations we need to achieve the best possible throughput. For PRESENT and AES in Table 5 , granularity does not cause an issue: both ciphers use SPN networks with many same Sboxes as their confusion layers. However, if the cipher uses smaller round functions with less concurrent Sboxes or a confusion layer with different Sboxes, it would be difficult to find enough data to "slice" within one plaintext block. Thanks to the fine granularity of our new Sboxes, in short encryption request, our construction has a better chance to reach its maximal throughput.
Hardware Implementation
Implementation Trade-off
Unlike software platforms, TI on hardware has been extensively studied for years. The only difference our Sboxes bring is a "double-rotating" feature: not only the 3 shares can be generated by rotating the inputs with the same circuit (i.e. serial TI [24] ), all n-bit output can also be generated by rotating inputs. Note that these two rotations are different operations: one is rotating bits, the other is rotating shares. On software platforms, since there is only one rotation instruction, implementing both efficiently is not trivial. On hardware, double-rotation can be simply implemented with multiplexers. Thanks to the fine granularity, now we can implement only 1 bit Boolean function and compute the other 3n − 1 bits through rotations. As all other implementations are relatively trivial, in this section, our evaluation only uses this 1-bit serial implementation. Note that this implementation is by no means our "reference" design. The point of having a granularity 1 Sbox is that the engineers have the flexibility to choose the right trade-off. Although this 1-bit implementation leads to a very compact logic footprint, it trades area advantages with executing cycles. It takes 3 * n cycles to finish a 3-share n-bit Sbox computation. Besides, the multiple data paths cause the control logic to increase, which may compensate some of the footprint gain. Depending on the specific applications, engineers can also use a "single rotation" version, where only the shares or the bits are generated by rotations.
Pre-charge Issue
A well known issue for serial threshold implementation, is some first-order leakage might appear during the "shift-shares" procedure [37] . The reason behind is that the leakage for a combinational logic during an input transition depends on not only the current state, but also the previous state. The solution would be simply eliminating any transition of input shares in the combinational logic: i.e. add a pre-charge stage which charges the combinational logic with all zero between these two states. Obviously, this pre-charge stage penalizes the overall performance by one extra cycle. Interestingly, as our double-rotating design takes more cycles to proceed, the percentage of pre-charge time becomes smaller. Note that a pre-charge stage is only required when we are switching between different shares, not between different bits.
Implementation on ASIC
In order to evaluate their performance on hardware, we have implemented our Sboxes with first order TI protections in Verilog. For synthesis, we used Synopsys Design Compiler with the TSMC 180nm standard cell library. Their area requirements as well as clock cycles are presented in Table 6 . Note that only the combinational part is documented in Table 6 : as most previous works excluded the multiplexers and registers as "required extra logic", we cannot further compare the whole design 5 . For clarity, Table 6 Since most results in Table 6 are uniform first order threshold implementations, we did not present their fresh randomness requirements. Only the AES Sbox uses 32 random bits; all others do not take fresh randomness. Thanks to its fine granularity, our protected Sboxes can be implemented with 1-bit combinational logic, which leads to very compact implementations (Table 6 ). However, this is nothing more than a trade-off: the number of cycles clearly shows the price to pay. Besides, for a larger n, shift-invariant constructions lose most of their charms. Table 6 shows the area gain for 8 bit Sbox is neglectable (if any, considering a serial implementation uses more MUX-es), compared with Boss et. al's construction. The reason for this roots in the philosophy of shift invariance: shift invariance saves area by reducing the outputs of a logic circuit, but not the inputs. Our 1-bit implementation is still a 2n-variate Boolean function. Boss et. al's construction uses smaller Sboxes, which reduces the input scale of the protected circuit. Technically, for an arbitrary vectorial Boolean function, the implementation cost grows linearly with its output, but exponentially with its input. Having said that, the main advantage of our construction is providing flexible implementation trade-offs, on both software and hardware platforms. Although Boss et. al's paper also mentioned software-efficiency, their prediction is actually based on the number of AND-s. We believe that software performance evaluation should use actual assembly code: due to the limited resources available (eg. instructions, registers, buses, etc.), high-level estimations could be misleading.
6 Security Evaluation
Software: ARM M0
In order to evaluate our protected Sbox in practice, we have implemented such Sboxes on both software and hardware platforms. For software implementation, our target chip is an NXP LPC1114 (ARM Cortex M0) processor. The measurement point connects to a 100 Ohm resistor on the VCC end. Power traces were captured with a PicoScope 2206B running at a sampling rate of 125MSa/s. The clock speed of the target core was set to 8MHz. For leakage detection, we use the non-specific fix-vs-random T-test [39] . In order to increase the detection power, we force all parallel Sboxes to use the same input shares (i.e. all the concurrent Sbox computations are exactly the same). Figure 3 shows the evaluation results for our 4 bit Sbox with 1 million traces: Considering the Sbox computation includes 25000 time points, we increase the T-test threshold to 5 [40] . With 1 million traces, a first order T-test cannot find any significant leakage. As we have only implemented a first order TI protection, second order attacks are still feasible. In theory, the most efficient 2nd order attack should be multi-variate attacks which combine 2 independent samples on the trace. In practice, significant leakage can be detected by simply performing the same T-test on the second moment ( Figure 3) . Therefore, we did not enumerate all possible second order sample combinations on the trace. The 8-bit case is quite similar: due to the limited space, we present the results for S 8 in the Appendix.
Hardware: SAKURA-X FPGA
For hardware implementations, we have tested our Sboxes on the SAKURA-X board with Xilinx Kintex-7 FPGA. In order to increase the signal-to-noise ratio, an Agilent 25db amplifier is connected to the measured signal. Moreover, considering our all-serial implementation has very limited power consumption, we extended a 3n-bit protected Sbox to a 384-bit design: for the 4 bit case, this means there are 32 parallel Sboxes implemented on the board. For 8 bit Sboxes, there are 16 parallel Sboxes. Similar to software implementations, all the implemented Sboxes were given the same input shares. Our FPGA design run at 3MHz, while our Lecroy Waverunner 700 Zi scope was capturing traces at 500MSa/s. Obvious outliers were removed before T-test. Figure 4 shows the leakage detection results for our 4 bit Sbox after 5 million traces. Clearly, our protected design is first order secure. Since our implementation is a serial one, technically, the second order detection should use multi-variate T-test. However, it is not hard to see that the second moment already shows some clear leakage. Like the software case, we present the 8-bit results in the Appendix.
Conclusion
In this paper, we propose a novel Sbox construction using quadratic shiftinvariant transformations. Thanks to the shift-invariant property, our Sbox constructions have a fine "granularity"which contributes to more flexible implementation trade-offs. Both software and hardware implementations have been discussed and evaluated (on ARM processors and an FPGA). The strong point of our Sboxes is that their first order protection can be efficiently tuned for the needs in different applications without using any fresh randomness. Experiments suggest our TI protection has effectively eliminated the 1-st order leakage. Meanwhile, to the best of our knowledge, this is the first computation based TI Sbox implementation in software (rather than the table-based TI implementation in [5] ). Considering masked software implementations do not always back their security claims (eg. [17] ), utilizing threshold implementations on software is of independent interest. , where the diffusion layer uses two layers of A: 
The overall Sbox is 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 00 6d f1 8f 3d 80 b4 31 50 82 3f 2e 51 0f 1c c1 1 a0 c4 25 
0 1 2 3 4 5 6 7 8 9 A B C D E F 0 00 b5 6b 17 d6 e4 2e d5 ad 9a c9 37 5c ec ab d2 1 5b 77 35 d0 93 38 6e 0c b8 16 d9 be 57 7e a5 45 2 b6 53 ee c2 6a 08 a1 0a 27 40 70 de dc 3c 18 31 3 71 0d 2c 99 b3 48 7d 4f ae 50 fc cb 4b 32 8a 3a 4 6d 39 a6 3b dd 0e 85 9f d4 02 10 0f 43 12 14 8c 5 4e 83 80 84 e0 aa bd 3e b9 f6 78 fe 30 f8 62 63 6 e2 e6 1a d7 58 db 33 79 67 e1 90 df fa fb 9e 56 7 5d c0 a0 f4 f9 e3 97 44 96 89 64 b2 15 8d 74 25 8 da 8b 72 ea 4d 9b 76 69 bb 68 1c 06 0b 5f 3f a2 9 a9 61 04 05 20 6f 1e 98 86 cc 24 a7 28 e5 19 1d a 9c 9d 07 cf 01 87 09 46 c1 42 55 1f 7b 7f 7c b1 b 73 eb ed bc f0 ef fd 2b 60 7a f1 22 c4 59 c6 92 c c5 75 cd b4 34 03 af 51 b0 82 b7 4c 66 d3 f2 8e d ce e7 c3 23 21 8f bf d8 f5 5e f7 95 3d 11 ac 49 e ba 5a 81 a8 41 26 e9 47 f3 91 c7 6c 2f ca 88 a4 f 2d 54 13 a3 c8 36 65 52 2a d1 1b 29 e8 94 4a ff Table 10 . The quadratic shift-invariant permutation S 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 00 4a 94 8e 29 4e 1d 2a 52 56 9c c8 3a 13 54 2d 1 a4 77 ac 2f 39 c7 91 3f 74 e9 26 eb a8 18 5a ba 2 49 ca ee 3d 59 f7 5e a0 72 bf 8f 12 23 c3 7e ce 3 e8 f2 d3 99 4c 7b d7 b0 51 05 30 34 b4 cd 75 5c 4 92 93 95 c4 dd f1 7a 06 b2 fd ef f0 bc de 41 73 5 e4 7c 7f b7 1f aa 24 c1 46 90 87 01 fc 07 9d 36 6 d1 19 e5 7d a7 42 33 86 98 1e f6 20 af 04 61 9a 7 a2 f3 0a 0b 60 1c 68 44 69 76 9b d4 ea d8 b8 da 8 25 47 27 15 2b 64 89 96 bb 97 e3 9f f4 f5 0c 5d 9 65 9e fb 50 df 09 e1 67 79 cc bd 58 82 1a e6 2e a c9 62 f8 03 fe 78 6f b9 3e db 55 e0 48 80 83 1b b 8c be 21 43 0f 10 02 4d f9 85 0e 22 3b 6a 6c 6d c a3 8a 32 4b cb cf fa ae 4f 28 84 b3 66 2c 0d 17 d 31 81 3c dc ed 70 40 8d 5f a1 08 a6 c2 11 35 b6 e 45 a5 e7 57 14 d9 16 8b c0 6e 38 c6 d0 53 88 5b f d2 ab ec c5 37 63 a9 ad d5 e2 b1 d6 71 6b b5 ff Table 11 . Table 13 . The quadratic shift-invariant permutation S
