Abstract. Masking is a popular countermeasure against differential power analysis (DPA) and other side-channel attacks. When designing integrated circuits to resist DPA, masking at the logic gate level has the benefit that it can be implemented without consideration of the highlevel function of the circuit. However, the phenomena of glitches and early propagation reduce the effectiveness of many gate-level masking schemes. In this paper we present a new technique for gate-level masking that is free of glitches and early propagation, yet requires only cell-level "don't touch" constraints. Our technique, which we call LUT-Masked Dual-rail with Precharge Logic (LMDPL), can therefore be implemented in a typical FPGA or standard cell ASIC design flow. LMDPL does not require routing constraints, nor sequencing of the evaluation of individual gates with enables, registers, or latches. We verify our techniques with an AES implementation on an FPGA. Our implementation shows no significant leaks in evaluations using up to 200 million traces.
Introduction
Many devices leak information through side channels such as power consumption or radiated electromagnetic energy. Side channel analysis techniques such as differential power analysis (DPA) [11] can recover information about secrets manipulated in a cryptographic device. Given enough measurements, these techniques may enable an attacker to recover a portion, or the entirety, of a secret key intended to be kept secure within a cryptographic device.
Masking countermeasures [4] seek to prevent DPA attacks by making the electrical activity in a device independent of secret values being operated upon. This is done by dividing the secret into multiple shares. The shares can be combined to recover the original secret, but each share is random when considered individually. Thus, operations may be performed on the shares without leaking information about the secret. For example, given a secret value k in some group G, a first-order additive masking uses a mask m chosen randomly from G, and divides k into the shares m and m + k. Each of these shares is, when considered individually, independent of k. 
The masked function g uses two independent mask bits for the inputs, and produces an output masked with a third mask that is independent of the input masks. The function h uses a common mask bit that is reused for both of the inputs and the output. Suppose we can construct a masked gate that can compute Boolean functions without leaking the unmasked values of the secret data a, b, and q. Then, more complex functions can be constructed from those masked gates, ideally in the same manner that any circuit can be constructed out of standard logic gates. Alternatively, given an implementation of some cryptographic circuit constructed using standard Boolean gates, the masked gates could perhaps be swapped for the standard gates to yield a masked implementation of the circuit. Our goal is to create such a masked gate.
Previous Work
Masking countermeasures have been studied extensively. We focus here on techniques that are most relevant to hardware implementations. Trichina et al. made an early proposal for a masked AND gate using four ANDs and four XORs [28] . Subsequent study found that direct implementation of masked operations in hardware may leak information through extraneous signal transitions known as glitches, due to the multitude of paths through the circuit [7, 13] .
This led to the proposal of masked dual-rail with precharge logic (MDPL) [22] . MDPL avoids glitches through the use of precharged, monotonic, dual-rail logic, with each signal x encoded as a complementary pair (x, x). The authors observe that the h version of a masked AND gate can be implemented as:
However, it was later shown that MDPL circuits exhibited significant first-order leakage due to early propagation [12, 21, 26] . Improved MDPL (iMDPL) addresses early propagation, but requires use of latches to control the moment at which gates evaluate [21] . In addition to the above issues, MDPL and other maskings of the form h described in Eq. 2 may not provide adequate resistance against attacks that examine leakage distributions [6, 8, 24, 29] . Another technique that can be used to attack protected implementations is the collision-correlation attack [17] . This is a powerful technique for exploiting complex leakages such as those arising from incompletely masked combinational logic [14, 15] .
The maskings shown in Eqs. 1 and 2 divide a secret into two shares. It is also possible to utilize more than two shares. Techniques from the field of multiparty computation may be used to perform computations without ever operating on all the shares simultaneously, thus ensuring immunity from glitch-related leakage [23] . However, a memory effect was identified, in which leakages from a computation can persist in a circuit for a period of time after the computation occurs. This phenomenon can impact the security of schemes thought to be immune to univariate attacks [16] .
The technique of Prouff and Roche [23] performs shared multiplications in GF (2 8 ). In contrast, threshold implementations instead use bitwise shares. The product of the values x = x 1 + x 2 + x 3 and y = y 1 + y 2 + y 3 is a collection of x i y j terms, which can be allocated to output shares such that no single output share contains sufficent information to leak the secret. Thus, threshold implementations also address the problem of glitches [2, 18, 19, 20] .
Of the foregoing techniques, threshold implementations offer the greatest promise for strong masking of arbitrary circuits, but doing so still requires insertion of additional registers in some cases. In this work, we offer a strategy for general gate-level masking that does not require additional registers.
Roadmap
The paper is organized as follows. First, we briefly describe the idea of path-based leakage assessment. Next, we introduce LUT-Masked Dual-rail with Precharge Logic (LMDPL), a masking technique that is leak-free under a path-based leakage metric. Finally, we present some experimental results obtained from FPGA implementations of an LMDPL AES core.
Path-Based Leakage Assessment
Many previous countermeasures have been justified with arguments that the settled final values of each circuit node in each clock cycle are independent of secret data. However, such analyses cannot identify ways in which the transient electrical behavior may correlate with secret data. In practice, designs constructed without consideration of transient electrical behavior have remained vulnerable to side-channel attacks.
Most contemporary semiconductor devices are implemented using complementary metal-oxide-semiconductor (CMOS) or closely related technology. In CMOS technology, when a logic gate changes state, the parasitic capacitance at the inputs of downstream gates must be either charged or discharged. Ignoring quasi-static operating conditions such as supply voltage and temperature, the time it takes to (dis)charge the inputs of downstream gates still depends on many factors. The factors can include the number of inputs of a gate that are switching, the transition time (slew rate) at the switching inputs, and the logic state (voltage) present at non-switching inputs. When considering whether the electrical activity is independent of a secret, these effects should be considered cumulatively for the entire propagation path. For example, in a two-share scheme, if the output transition of an early gate exhibits a slight delay depending on the value of one share, and this output propagates to a gate at which the activity depends on the other share, the combination of these two effects may make the electrical activity at the downstream gate correlated with the unmasked secret. To investigate whether masked logic styles leak due to this type of electrical effect, we have developed a technique that we call activity image analysis. Due to space constraints we include only a brief description of the technique here. Activity image analysis determines whether electrical activity at upstream and downstream gates can combine to leak a secret by considering the switching events at adjacent gates jointly, rather than separately. The idea is illustrated in Fig. 1 . A circuit is leak-free under an activity image metric if, for each activity image, observation of that image does not correlate with any secret value. This is a significantly stronger condition than balancing the distribution of final gate output values.
Activity image analysis is more comprehensive than toggle simulation analysis, which analyzes a single extracted model of propagation time through gates and wires, and applies to a single combination of operating conditions. Similar to structural clock domain crossing checks, activity image analysis examines the logical structure of the circuit and provides an assurance that is robust to timing variation. We also believe activity images can be helpful in detecting early propagation, but have no formal proof.
Appendix A shows an activity image analysis of iMDPL. Residual leakage in an iMDPL implementation due to circuit effects was also examined in [14] . Based on the results we have obtained from activity image analysis, we question whether mapping a single-rail circuit to a dual-rail circuit (as done e.g. in [5] ) is an effective technique for producing first-order masked implementations.
LUT-Masked Dual Rail Logic
In this section, we introduce LMDPL, explain its usage, and then describe how we implemented AES using LMDPL. 
The LMDPL Non-linear Gate

Fig. 2. LMDPL Non-linear Gate
It is well-known that linear functions are amenable to being computed on a shared representation of their argument, while non-linear functions pose substantial difficulty. Consequently, our efforts focused on identifying a way of computing non-linear functions in masked logic while satisfying the activity image leakage metric. We arrived at the dual-rail table lookup structure shown in Fig. 2 . In our schematics, wires shown crossing at a right angle are not electrically joined, whereas wires shown meeting at a tee are electrically joined.
The LMDPL non-linear gate is intended to be used with a masking in the form of Eq. 1. Secret inputs a and b are converted to masked representation by obtaining two random mask values m a and m b , and computing
The The structure shown in Fig. 2 is important. If EDA tools are permitted to freely restructure the logic, the gate will no longer pass a path-based leakage assessment. Fortunately, it is not difficult to instruct common EDA tools to preserve certain cell instantiations with a mechanism known as a don't touch constraint. Limited restructuring of the gate is acceptable. For example, ASIC implementations may prefer the NAND/NAND structure obtained by applying De Morgan's Law. We suggest some strategies for implementing LMDPL with common tools in Appendix B
Between each evaluation, the circuit must be precharged by driving both signals in each masked data pair to zero. Zeros on the four masked data inputs will propagate to the outputs, hence a precharge applied at the masked data inputs of a collection of LMDPL gates will propagate to the final outputs. During the evaluation of the gate, a transition away from zero on an output requires a non-zero value to have arrived on one of the component signals of each dual-rail input pair. Thus, the LMDPL gate does not exhibit early propagation.
LMDPL avoids glitches through the use of monotonic gates, in the same manner as the original MDPL. In the course of any evaluation, each of the q m and q m outputs will transition at most once.
On any evaluation, exactly one of the AND gates in the LMDPL non-linear gate will produce a rising transition at the output. Even after fixing any or all of the unmasked data values, each of the eight AND gates has an equal probability of being the active gate upon each evaluation, depending upon the mask values. This effect is similar to Baddam et al.'s path switching countermeasure [1] .
Implementing LMDPL
A simple circuit constructed using LMDPL is illustrated in Fig. 3 . The circuit has three inputs, x, y, and z, and one output, w. The top portion of the figure operates on the mask share, and the bottom portion of the figure operates on the masked data share. The lookup tables t i for the LMDPL gates are passed from the mask share to the masked data share through registers. There are two nonlinear gates, so the mask share takes two fresh mask bits from the RNG. Each of the mask share logic and masked data share logic is constructed by making modifications to the original circuit. The mask share retains linear elements, ties the output of each non-linear element to an RNG bit, and instantiates a "Table Gen" component for each non-linear element. The masked data share replaces the linear elements with corresponding dual-rail versions, and replaces the non-linear elements with instances of the LMDPL non-linear gate. Let m = (m b , m a ) and i = (i 1 , i 0 ) with i 0 , i 1 ∈ {0, 1}. Then,
The non-linear function implemented by the LMDPL gate will commonly be a logical AND: f (a, b) = a · b. The operation of the table generation logic for this case is shown in Table 1 .
Implementing Linear Functions
Circuits typically also include gates that are linear (or affine) under boolean masking. When implementing linear gates, it is not necessary to consider the masking, so LMDPL is compatible with the linear gates from non-masked dualrail logic styles such as WDDL [27] . We review briefly how to implement NOT and XOR gates. A NOT gate can be implemented without any transistors, simply by swapping the complementary dual-rail signals. That is, q = NOT(a) is implemented by: 
XOR gates should be implemented as monotonic logic (i.e., constructed out of AND and OR gates) to ensure the logic remains glitch-free and to correctly propagate the precharge state. An XOR gate q = XOR(a, b) can be implemented as follows:
AES Implementation
To test the effectiveness of LMDPL, we developed an implementation of AES. The overall architecture of the AES implementation is shown in Fig. 4 . The design computes a complete round transformation in parallel, and thus has 16 S-boxes. We favor simplicity and use a clock-based approach for the precharge, driving inputs to the LMDPL logic to zero in alternate cycles. Recall that sophisticated masking techniques are required only for non-linear operations, and the only non-linear operation in AES is the GF(2 8 ) inversion within SubBytes. We implement only the inversion in LMDPL, and implement the remainder of the round transformation (including the linear portions of SubBytes) in ordinary logic. The sequence of operation is: 0. Initially, the LMDPL inversion logic is precharged. 1. In cycle 1 of a cipher operation, ordinary logic performs AddRoundKey and converts bytewise to the subfield basis used for inversion. The LMDPL logic is still precharged. 2. In cycle 2, the LMDPL logic computes bytewise inversion in GF(2 8 ). 3. In cycle 3, ordinary logic converts bytewise to the standard AES basis, applies the SubBytes affine transformation, performs ShiftRows, MixColumns, and AddRoundKey, and then converts bytewise back to the subfield basis. Also in cycle 3, the LMDPL logic is precharged. 4. In subsequent even cycles, the LMDPL logic is active. 5. In subsequent odd cycles, the ordinary logic is active, and the LMDPL logic is precharged. The GF(2 8 ) inversion uses the GF(((2 2 ) 2 ) 2 ) normal basis identified in [3] . This implementation requires 36 bit-multiplications in GF (2) . Some additional detail on our implementation of the inversion is provided in Appendix B.
The mask share (the t i ) would ideally be kept in the Hamming-weightbalanced 8-bit encoding to minimize leakages usable by second-order attacks. However, this is quite expensive. At some cost in resistance to second-order attacks, we generate and register only half of the table. The complementary half is obtained by inversion. In some cases, registers with complementary outputs could be used.
For purposes of comparison, we synthesized an ASIC version of our LMDPL AES core using Synopsys Design Compiler 2013.03-SP2. Table 2 compares our implementation with several others reported in the literature. Note that the threshold implementations [2, 18] have the advantage that the S-box can be pipelined, meaning the overall throughput is one S-box evaluation per clock rather than 1/latency. However, this benefit disappears in fully parallel implementations, as it is necessary to obtain the previous round's SubBytes output and apply the remaining transformations of an AES round before the next Sub- Bytes input is ready. Also, note that although it requires fewer random bits per S-box, the parallel AES presented in this work requires more random bits in per-clock terms (576/2) compared to the threshold implementations with 8-bit datapaths (44/3 and 48/5). As was the case for the threshold implementations, we have provided ASIC area figures for comparison, while presenting evaluation results from an FPGA.
Experimental Results
This section presents assessments of DPA resistance on two designs incorporating LMDPL. Each design is described in Verilog, and implemented for Xilinx Virtex-5 FPGA using Synplify Pro 2009.03 and Xilinx ISE 13.2.
Evaluation Methodology
To evaluate the information leakage in different designs, we used the test vector leakage assessment (TVLA) methodology proposed by Goodwill et al. [9] . The TVLA methodology is designed to measure information leakage and provide an objective score. It specifies test vectors and uses Welch's t-test to measure the significance in the difference of means of two distributions. One of the tests in the methodology is known as the "fixed versus random" (FVR) test. In this test, the measurements are collected as the device operates repeatedly using fixed input data and randomly varying input data. (The fixed and random input vectors are randomly interleaved.) Welch's t-test is then used to score the differences between the two sets of measurements. We follow [9] and use |t| < 4.5 as the criteria for a passing result. The fixed versus random evaluation technique does not target specific leaks. Rather, it measures aggregate information leakage at each point in time during the cryptographic operation. It is extremely powerful and can often find potential vulnerabilities with fewer traces than needed to identify specific leakages. In particular, for designs where the parallelism exceeds the portion of the key that can be guessed by a DPA attack, a leak identified by the FVR test is stronger than that which would actually be available to an attacker who cannot guess the entire key at once. Nevertheless, a failure of the FVR test does represent some correlation with secret intermediates, and the goal of masking is to eliminate such correlations.
Another characteristic of the FVR test is that false positives may arise due to the plaintext and ciphertext being fixed. The dilemma is similar to the need in conventional DPA attacks to select an intermediate separated from the plaintext or ciphertext by a non-linear function. We avoided the problem of input and output leaks by splitting the input into separate mask and masked data shares prior to transfer to the device under test (DUT), and likewise retrieving mask and masked data shares from the DUT before combining. We refer to this scheme as externally applied masking and the more conventional scheme where the DUT divides the data into shares as internally applied masking.
We also perform a variant of a collision correlation attack [17] . Our simulated collision correlation (SCC) attack operates by dividing the pool of traces into two equal-size groups and computing for each group the 256 means corresponding to the possible values of the S-box input. Then, for each of 256 possible "guesses" of a linear key byte distance, the means in one group are permuted according to the guess, and the correlation computed between the two sets of means. The unpermuted case represents the "correct" guess. To select points for this attack, we used one-way analysis of variance (ANOVA) to identify points with the strongest dependency on the S-box input value.
Our evaluation setup uses a Sasebo-GII board and a Signatec PX14400A PCI-E card for data acquisition. The signal is taken from the 1 Ω supplyside sense resistor on the Sasebo-GII and connected through a Mini-Circuits BLK-89-S+ DC blocker, a Mini-Circuits BLP-150+ LPF, and a Mini-Circuits ZFL-1000+ amplifier before driving the input of the Signatec card, which has a sample rate of 400 MS/s and 14 bits of resolution.
The design operates at 24 MHz. Our evaluation harness performs 2,000 consecutive AES operations with data obtained from and stored to buffers on the FPGA. The design provides a trigger signal concurrent with the start of the first AES operation. This signal is used as an external trigger for the Signatec card. To ensure that the 400 MS/s sample rate does not impact the alignment quality when analyzing our traces, we use a technique similar to that of [10] to achieve sub-sample alignment resolution.
Results from a Single S-box Design
Prior to presenting results from the full AES implementation, we present results from a simplified design. The simplified design maintains the 128-bit parallel datapath of the full AES implementation, but replaces 15 of the 16 S-boxes with passthroughs. We chose this approach, rather than a true 8-bit datapath AES, to focus on leakage from the LMDPL S-box as opposed to leakage from registers.
We first disabled the mask generator and collected waveforms from 10,000 encryptions. For each encryption, we chose with even probability between the fixed plaintext and a random plaintext. Fig. 5 shows several analyses of these traces. A sample-wise plot of the t statistic (a) immediately indicates that the design is leaking. We selected sample 71, the point in the first round with the greatest |t| value, for further analysis. For this design, slightly over 300 traces are needed before the |t| > 4.5 threshold is reached for sample 71 (b). We then performed a SCC attack at sample 71. For this evaluation, between 1,000 and 1,500 traces are needed before the correct guess becomes dominant (c).
We next enabled the mask generator and collected 100,000,000 traces, again choosing evenly between a fixed plaintext or a random plaintext for each encryption. Fig. 6 presents analysis of these traces. With the masking enabled, the t statistic does not exceed the |t| > 4.5 threshold with 100M traces (b), demonstrating that the first-order masking is effective. However, the design exhibits second-order leakage, as can be seen by using the t statistic to compare the squared residuals between the two groups (c).
We used the 50,000,000 random traces out of the same data set to develop an attack exploiting the second-order leakage. We sorted the traces based on the output from each of the 36 non-linear gates in the S-box. The difference in variance due to the value of a single bit is smaller than the difference that arises when the entire plaintext is fixed, but it it still detectable. We examined all 36 candidates (d) and selected for the attack the bit and time sample combinations with the largest |t|. The first candidate, bit 25 at sample 92, does not result in selection of the correct key guess with 50 million traces (e). The second candidate, bit 25 at sample 99, does result in the selection of the correct key guess with 50 million traces (f). First-and second-order versions of our SCC attack on this design were not successful (g,h). 
Results from a Parallel Design
One possible strategy to improve upon the resistance of the single S-box implementation would be to incorporate higher-order masking. However, in low-noise environments, the security benefit of higher-order masking is limited [25] . With this in mind, we explored the resistance of an AES-256 implementation performing SubBytes on the entire round state in parallel. We again measured the design with masking disabled as a baseline. For the parallel design we collected 100,000 traces. For each trace, we chose randomly between the fixed plaintext or a random plaintext. Fig. 7 shows our analysis of these traces. Fig. 7(a) is a plot of the FVR t statistic versus the sample index, and as with the corresponding plot for the serial implementation, provides immediate evidence that the design is leaking. Fig. 7(b) plots the t statistic between the fixed and random traces at sample 441 (the sample with the largest absolute t value), and shows that less than 50 traces are needed before the |t| > 4.5 threshold is reached. Finally, Fig. 7(c) shows the results of our SCC attack at sample 71. The correct key guess becomes dominant after about 10,000 traces. Finally, we enabled the mask generator in our parallel design and collected 200,000,000 traces. Fig. 8 shows our results. The first-order FVR t has only marginally exceeded the |t| > 4.5 threshold with 200,000,000 traces. In contrast with the serial implementation, where the second-order t statistic reached significantly larger values than the first-order t, the second-order t for the parallel implementation reaches only slightly larger values than the first-order t. The spike at the end of the second-order analysis in Fig. 8(b) is due to the final masked output, and mask, being manipulated at the end of the calculation, and does not represent a leak of sensitive information. We performed the SCC attack on this design, and it was not successful (c,d). Fig. 8(e) is shown to demonstrate a technique that we use to investigate the behavior of our designs and to verify that our data collection is correct. The masked implementation used for evaluation allows re-seeding of the PRNG with an externally-supplied per-encryption seed. This allows us to compute the values of circuit intermediates that are a function of the mask, which would normally not be predictable by an attacker. The figure shows the correlation between the current measurement and the Hamming distance between the round one and round two masks. Because the mask values for successive rounds overwrite each other in the mask share logic, a strong correlation is expected and is indeed present. We additionally note that the memory effect [16] is clearly visible here. The register update occurs at the time of the initial downward spike around sample 83. A strong correlation exists for around 50 samples (3 clock cycles) after the register update, and a weak correlation persists throughout the encryption.
Conclusion
In this work, we propose the use of a path-based model for the leakage from combinational circuits. Unlike traditional methods that focus on the settled values of circuit nodes, activity image analysis considers ways that data-dependent behavior can accumulate as transitions propagate through combinational logic.
We also present LMDPL, a new technique for gate-level masking. LMDPL compares competitively or favorably with previous techniques on multiple metrics. Furthermore, LMDPL does not require routing constraints, and does not require that sequential elements or enable signals be used to delay the propagation of signals through the circuit. Table 3 presents an activity image analysis of iMDPL in tabular form. Each row describes one activity image. Columns to the left of the double line represent states observed at the output of each gate in the circuit. The columns to the right of the double line are labeled with a value x * of the secret x = x m ⊕ m, and entries in those columns report the count of observations of that row's activity image when x takes the value x * . In this example there are eight possible inputs to the circuit, corresponding to the two possible values for each of x m , y m , and m. Evaluation of a circuit for a given input may exhibit multiple activity images.
A Activity Image Analysis Example
We define a circuit to be balanced under the activity image metric if, for any value x * that the secret x may take, the conditional probability that x = x * , given that some particular activity image was observed, is the same as the unconditional probability that x = x * . In other words, the observation counts in each row of the table must have the same proportion as the global probabilities of the associated x * values. In the case of iMDPL, Pr{x = 0} = Pr{x = 1} = 0.5, so the requirement is that the counts in each row be equal. The iMDPL circuit is not balanced, as the observation counts are different in 6 of the 10 rows. The same analysis for the LMDPL circuit (omitted for space reasons) shows that it is balanced under this metric. r r r r 1 0
B Details on implementing AES with LMDPL
As discussed in Section 3.1, LMDPL requires the structure of the non-linear gate be preserved with don't touch constraints (sometimes called "keep" constraints).
For an ASIC design, library cells implementing the elemental functions may be instantiated in the HDL description, and a don't touch attribute applied to the instantiations. A common and simple way to do this is to use a distinguishing prefix in the instance names, and use a wildcard pattern to identify for the tool the cells not to touch. For either an ASIC or an FPGA, the elemental functions (AND/OR/NAND) used in the gate may be placed in a dedicated module, and a hierarchy-preserving attribute or directive applied to that module. For the Virtex 5 FPGA, hierarchy preservation attributes limited the amount of packing the place and route tools would perform. We obtained better results by applying net preservation directives to the interface of the modules implementing the elemental functions, or to the interface of the module implementing the LMDPL gate. For example, preserving the interface of the dual-rail XOR (a 4-input, 2-output function) allows it to be packed in a single dual-output LUT. Similarly, appropriate constraints enable the eight AND gates of the non-linear gate to be packed pairwise into four dual-output LUTs.
Our AES implementation incorporates an optimized inversion circuit, which uses functions other than AND for some of the 36 non-linear gates. We created a Liberty-format library description containing cells of unit area implementing the XOR and XNOR functions, and cells of ten units area implementing each of the non-linear two-input boolean functions. We then used Synopsys Design Compiler to map the normal-basis GF (2 8 ) inversion onto this library. The netlist from Design Compiler contained 37 non-linear gates rather than the expected 36, however, inspection revealed that two of the non-linear gates could be combined with minor rearrangement of neighboring XORs to achieve a 36-gate implementation. This optimized circuit was used as the basis for translation to LMDPL mask and masked data share implementations.
