Abstract-Latest evaluation of the state-of-the-art iMDPL logic style has shown small information leakage compared to its predecessor version MDPL. Concurrently, new advanced power analysis attacks specifically targeting iMDPL have been proposed. Up to now, these attacks are purely theoretic and have not been applied to an implementation. We present a comprehensive analysis of iMDPL, backed by real measurements collected from a 180 nm iMDPL prototype chip. We thoroughly study the extent of remaining information leakage of iMDPL by applying all relevant attacks. Our investigation shows the vulnerability of the target device, a standalone AES core, to several of the advanced attack methods. In comparison to conventional power analysis attacks, the advanced attacks need less power measurements to obtain meaningful results. With the help of logic level simulations routing imbalances between complementary mask trees are identified as a major source of leakage.
even trusted platform modules (TPMs) [31] . Many different approaches to building countermeasures against PA have been proposed. A recent overview is given in [17] . Most concepts apply either masking, where intermediate states are processed only after addition of a random mask, and hiding, which aims at reducing information leakage to aggravate the extraction. Many of the more recent proposals combine both concepts at the same time. A combination of different methods is usually desirable, as no single one countermeasure has been found that can completely prevent PA. However, effective countermeasures usually have a strong impact on the implementation cost. They either significantly decrease throughput or increase the area of the protected cryptographic implementation [17] .
The effectiveness of countermeasures is evaluated by determining the remaining information leaked by the power consumption. Since the discovery of PA in 1999 by Kocher et al. [14] , a lot of effort has been put into improving analysis methods. The most popular approaches include correlation power analysis attacks (CPA) [1] , template attacks [4] , and mutual information analysis (MIA) [9] . Recent results show that especially the latter and stronger approaches uncover more or less the same leakage [32] . This work presents a thorough analysis of DPA-resistant logic styles, which belong to the strongest methods of preventing PA attacks.
History of DPA-Resistant Logic Styles: During the last decade, several DPA-resistant logic styles have been proposed. The common goal of these logic styles is obvious: to complicate or even to prevent DPA attacks on cryptographic hardware implementations. Basically, we distinguish between two approaches to implement such logic styles. In the first approach specially designed logic cells have to be built, which requires an immense effort that has to be repeated for every process technology. Examples for this approach are SABL [27] , RSL [26] and its successor DRSL [5] , and TDPL [2] . In the second approach the DPA-resistant logic styles are based on available standard cell libraries instead of newly developed logic cells. The effort to achieve invulnerability against DPA attacks is shifted from the transistor level towards the cell level. This approach involves cell-level masking schemes presented in [30] and [10] , as well as dual-rail logic styles like WDDL [29] or MDPL [23] . The security analysis of the masking approaches in [18] and [26] revealed that the occurrence of glitches significantly reduces the resistance against PA attacks.
The logic styles WDDL and MDPL are based on the dual-rail precharge (DRP) principle which prevents the occurrence of glitches. In a DRP logic style each signal (i.e., single-bit data line) is represented by two complementary wires in the circuit. In the precharge phase (first half of a clock cycle) the levels of both wires of every signal are LOW; in the evaluation phase the level of only one of the two wires is HIGH, depending on the value of the corresponding signal. A pure DRP logic style like WDDL depends on balanced complementary wires, ie the electrical characteristics (resistance, capacitance, inductance) of two related wires have to be identical in order to draw exactly the same amount of energy in every clock cycle. In theory, the power consumption of a WDDL circuit would then be independent of the data values processed, and hence, PA attacks would be infeasible. In practice it is very hard or almost impossible to achieve an exact balancing of two wires. Even if all possibilities of the most powerful design tool are utilized, small process variations in the fabrication of a chip will again introduce differences in the complementary wires.
The MDPL style follows a different approach. It combines the properties of cell-level masking and the strength of DRP logic styles (i.e., the nonexistence of glitches). Every operation within a device only processes masked data, i.e., the power consumption of the device only depends on the masked data. Hence, an attacker cannot discover information about unmasked data (e.g., a secret key). Furthermore, the masking approach bypasses the need of perfectly balanced wires in the DRP logic style. Considering these characteristics, the masking technique is safe from glitches, and the complementary wires in the DRP circuit do not have to be balanced since all data values are masked in every clock cycle with a random mask value. In theory, this combination should give MDPL high resistance against DPA attacks.
Further research in [15] and [25] indicates that MDPL might suffer from an effect called early propagation. Similar to glitches, early propagation significantly reduces the DPA resistance of masked logic styles, as confirmed by a detailed investigation of early propagation of MDPL in [22] . The study shows that early propagation causes a temporary data dependent switching behavior of MDPL gates.
The authors extended the MDPL gates with an evaluationprecharge detection unit (EPDU) that tries to prevent the occurrence of early propagation and introduced an improved version called iMDPL. The EPDU generates an enable/disable signal that: 1) prevents a gate from evaluation until all input signals at the gate are valid and 2) forces a gate to precharge when the first input signal leaves its valid state. An iMDPL-AND cell is depicted in Fig. 1 . The detection unit generates the enable/disable signal and the SR-latch stage controls the beginning of the evaluation/precharge phases.
Related Work: An iMDPL implementation has been thoroughly investigated in [13] . It has been shown that the impact of early propagation is significantly reduced and the DPA-resistance of iMDPL compared to unprotected CMOS has been increased. During the last years, masked dual-rail logic styles such as MDPL and iMDPL have drawn much interest by the research community. A specialized way to recover leakages is probability density function (pdf) analysis, originally proposed in [28] and later adapted to target MDPL by means of a folding technique using pdf in [24] . The approach exploits the leakage caused by capacitance imbalances of dual-rail gates, which lead to different probability distributions of the power values when a masked dual-rail combinational circuit processes the input with different masks. Simulation results of [24] suggest a symmetry of the two distributions, which allows an attack on the measurements folded around the mean. Further attacks focused on analyzing the leakage of storage elements, e.g., flip-flops, of the masked dual-rail logic styles [19] , [21] . In [21] , Moradi et al. show how to perform classical DPA attacks on the masked flip-flops when the masks have been recovered by means of a SPA attack. A modified power model based on folded measurements, i.e., difference of hamming weights (DHW), is proposed in [19] . Both of the latter attacks were, similarly to the aforementioned pdf analysis and the folding attack using pdf, performed using simulated power traces.
Recently, an enhanced collision PA attack has been presented in [20] which is capable of revealing the linear difference between key bytes by comparing the leakages of a combinational circuit in two instances of time. While this attack is not specific for iMDPL, it has been found to ignore a certain type of masks in the original publication.
Contribution: In spite of the thorough recent analysis of iMDPL, the reason for its remaining side-channel leakage is not yet fully understood. Recently proposed iMDPL-specific attacks have not yet been practically verified. This work analyzes the leakage of iMDPL by applying all relevant and recently proposed analysis methods to an actual iMDPL prototype chip. The results remove any doubts left on the remaining leakage in the iMDPL logic style. We identify the unbalanced capacitive load of the mask tree as a source of leakage and show how to efficiently exploit this leakage. Our practical investigations are initiated by studying the classical PA attacks on the iMDPL prototype chip, including CPA and MIA. We apply different power models in different levels of abstraction. Results imply that iMDPL circuits are not only vulnerable to those PA attacks which use sophisticated power models, e.g., toggle count. Finally, new attack concepts that help overcoming the protection offered by iMDPL are presented and verified. We demonstrate that generic attacks such as the enhanced collision attack are able to recover all possible linear relations between key bytes using as few as 200 000 actual measurements.
Organization: The details of the target chip are presented in Section II. In Section III we track down the source of the remaining data leakage and show that some parts of the circuit are highly vulnerable to straightforward PA attacks. We then explore the leakage of the mask bit itself in Section IV. Our study of the newly proposed attacks specific to MDPL/ iMDPL suggests their impracticality. We provide a theoretical reasoning in addition to practical evidence on the impracticality of DHW/DHD attack in Section V. The pdf analysis and the folding attack using pdf are investigated in Section VI. We discuss the efficiency of correlation-enhanced collision attacks in Section VII. This paper concludes in Section VIII.
II. HARDWARE ARCHITECTURE
Our investigations have been performed on the GRANDESCA chip, which is an outcome of the FIT-IT project GRANDESCA: "Generating RANDom values for Encryption in the presence of Side Channel and other Attacks" [8] from 2008. The purpose of the GRANDESCA chip was the investigation of masked logic styles in practice to improve the resistance against power analysis attacks. The chip has been produced in a 180-nm CMOS process technology from UMC.
The GRANDESCA chip contains two types of cores implemented in two different logic styles. Fig. 2 shows the architecture of the chip, which consists of the following parts: two PCores, two AESCores, control logic to select an active core, a pseudo-random number generator (PRNG) that provides the mask value for the iMDPL cores, access to a program memory (PROM) and to an external memory (XRAM), as well as a serial interface and an 8-bit parallel I/O port for the communication with the chip.
Each PCore contains an implementation of an 8051-compatible microcontroller and an AES crypto module that is used as a coprocessor. The microcontroller features 128 bytes of internal memory, a serial interface, and an 8-bit parallel I/O port. The AES crypto module supports AES-128 encryption/decryption in ECB mode, and the implementation follows the standard version of the AES hardware architecture presented by Mangard et al. in [16] .
Each AESCore contains a stand-alone implementation of the same AES crypto module and a controller unit communicating with the crypto module over an AMBA APB bus and providing a serial interface for the communication with a PC. Furthermore, the controller unit provides an 8-bit parallel output port. The floorplan of the GRANDESCA chip is shown in Fig. 3 . The two unprotected implementations in plain CMOS are placed top right, the stand-alone AES processor implemented in iMDPL is placed top left, and the 8051-microcontroller including the AES coprocessor implemented in iMDPL takes approximately two thirds of the whole area from the bottom. The PRNG and some glue logic are placed in between the four cores.
We describe the AESCore in more detail, as most of our investigations have been performed on the stand-alone AEScore implemented in iMDPL. Fig. 4 shows the datapath of the AES module. The AES-128 architecture implements the logical 4 4 state layout of AES in hardware. As mentioned, the implementation of the AES module follows the standard version of the AES hardware architecture presented in [16] . The main difference to the standard AES version of [16] is the single MixColumns module attached to the left-most column of the 4 4 AES state. A barrel shifter implements the ShiftRows operation and the four S-boxes are combinational and one-stage pipelined implementations as described by Wolkerstorfer et al. in [33] .
Initially, the AES state values are loaded column-wise from the right within 4 clock cycles and each consecutive AES round takes 9 clock cycles. In a first step, each of the 16 AES state cells applies the AddRoundKey operation. In the following 5 clock cycles the AES state rows are shifted down through the S-boxes and the ShiftRows operation. Afterwards, the AES state columns are shifted to the left through the MixColumns module for 4 clock cycles. During the MixColumns operation also the next AES round key is calculated. Within the last MixColumns cycle also the next AddRoundKey operation is applied. When all AES rounds are finished, the final AES state values are read out to the left.
Measurement Setup: In order to evaluate the prototype chip a custom design evaluation board has been developed, in which all the recommendations addressed in [17] regarding making a suitable measurement setup for power analysis attacks are implemented. The power consumption traces are measured using a LeCroy WP715Zi 1.5 GHz oscilloscope and a suitable differential probe capturing the voltage drop over a 39 resistor in the VDD core (2.5 V) supply of the chip. Unless stated otherwise, the measurements are performed at a sampling rate of 2.5 GS/s and the clock signal is provided by a 1.8432 MHz oscillator. 1 
III. CLASSICAL POWER ANALYSIS ATTACKS
In the following we present a detailed investigation of our iMDPL prototype chip. First, we provide a complete exploration of the remaining information leakage of an iMDPL circuit based on logic level simulations in combination with real measurements. In the second part of this section we perform conventional PA attacks exploiting the information leakage of an iMDPL AES core. We further combine these PA attacks with a simple integral preprocessing scheme and shortly outline the results.
A. Exploring the Leakage of iMDPL
Security evaluations of iMDPL [13] have shown that the vulnerability against PA attacks caused by early propagation has been significantly reduced compared to MDPL. However, the results also show that there is still a side-channel leakage in iMDPL and that an iMDPL implementation can be broken by CPA attacks if an attacker uses a sufficient amount of power traces. We investigated iMDPL in more detail by means of logic level simulations including back-annotated delay information to discover the reason for the remaining side-channel leakage. Furthermore, we verified our investigations by comparing PA attacks on real measurements with PA attacks on simulated power traces derived from toggle counting [12] . The simulated power traces have been obtained by analyzing the value change dump (VCD) file of a logic level simulation with Cadence NCSim [3] . In the analysis of the VCD file, the transitions of all signals have been summed up for each point in time, whereby each signal transition (0 1 or 1 0) causes a power consumption of 1 and each signal keeping its logical state (0 0 or 1 1) causes a power consumption of 0.
Due to the complexity and the long simulation time of several encryption runs in the AES module of the GRANDESCA design, we decided to perform simulations and analysis on MOV instructions in the 8051-compatible microcontroller. In this scenario, a known random byte value is stored in a register in the internal memory of the microcontroller. Such an operation is quite well comparable to the common attack scenario on the AES module: the storage of the SubBytes transformation output into a register. We started our investigation of iMDPL by comparing a CPA attack based on power and EM measurements with a CPA attack based on simulated power traces. Fig. 6 presents the results of the CPA attacks predicting the HW of the stored byte value. Fig. 6 (a) shows the results of CPA attacks on 30 000 measured power traces (plotted in gray) and on 30 000 measured EM traces (plotted in black). At the top of the figure the trigger signal (TRG) and the clock signal (CLK) are indicated in gray. It can be seen that the correlation result obtained from the power traces is highly blurred over time compared to the result obtained from the EM traces. This is caused by several factors such as the power supply grid of the chip (which behaves like a higher-order R-L-C network), noise induced in the power lines of the chip as well as in the measurement resistor connected in series to the chip. The correlation result obtained from the EM traces shows a significantly better resolution over time, and thus, the EM result matches very well with the result obtained from the CPA attack on the 256 toggle-count traces shown in Fig. 6(b) . The two correlation peaks in the sixth clock cycle after the trigger signal goes HIGH (around 4.5 s) can clearly be seen in both figures. The reason for the peaks in the other clock cycles in the simulated result is that only 256 simulated power traces have been used for the CPA attack (cf. [17, ch. 6.4.1]). Based on the good comparability of the simulations and the real measurements we started to analyze the simulation results in more detail. Our VCD analyzing technique allows us to focus only on specific submodules within the whole design during the process of toggle counting. As a consequence we found three modules in which information leakage occurs: the memory module containing the registers, the arithmetic logic unit (ALU) of the 8051-microcontroller, and the memory controller containing major data multiplexers. For further investigations we focused on a multiplexer in the memory controller handling the data flow to and from the internal memory. We discovered that the information leakage in iMDPL is caused by imbalances between the complementary rails in the logic style (i.e., different wire lengths and different electrical characteristics of the wires). Besides causing slight differences in the power consumption, the imbalances also influence the signal delays of the complementary wires. This effect causes different arrival times of signals at the complementary iMDPL gates leading to a data-dependent switching behavior of the gates. But, contrary to MDPL and early propagation, the differing signal delays of the complementary wires are inconsistent. Looking at the data signals and of two complementary wires, the delay time can be positive, negative, or even very close to zero when changing the data value of or the mask value . The signal delays in an 8-bit memory multiplexer in the 8051-compatible microcontroller in different cases are shown in Table I According to Table I , Fig. 7 shows a symbolic iMDPL multiplexer (MUX) and depicts the timing relationships between the masked input signal , the masked select signal , and the masked output signal , where . In this figure we can see the following: and originating from flip-flops arrive shortly after the falling clock edge ( , the beginning of the evaluation phase) at the MUX, possibly at slightly different times due to differences in their paths. The select signal originating from a rather huge combinational circuit arrives considerably later at the MUX and shows a significant timing dependency on the mask value ( if , if ). According to the structure of the iMDPL gates, the MUX only evaluates on arrival of , which consequently affects the data signals in their timing. Fig. 7 also depicts the original source of the mask-dependent switching behavior of the MUX: as mentioned, originate from a combinational circuit (bottom left in Fig. 7 ). The signals feeding this combinational circuit originate from flip-flops and start evaluating shortly after the falling clock edge . The mask signals however arrive some time after due to the fact that the mask generator in iMDPL provides a new mask with the falling clock edge and the mask signals require some time propagating through the rather huge mask trees. More important, as it is not a requirement when implementing iMDPL, the timing of the two mask trees of and are not balanced, and hence, the arrival time of the mask signals at their destinations is closely related to the mask value itself:
in case and in case .
In summary, it can be said that in early combinational stages the timing difference between the mask signals is transferred to data and/or control signals. In our example a select signal of a MUX is affected and the timing difference is transferred further to the datapath carrying sensitive data signals. This scenario finally results in a side-channel leakage. Furthermore, the differences in wire length and electrical characteristics are not consistent and may deviate from case to case, i.e., the signal delays may differ from each other in terms of the algebraic sign and/or the value. This is the reason why the leakage in iMDPL in a typical case is smaller compared to the leakage in MDPL.
B. Exploiting the Leakage of iMDPL
An important issue when performing PA attacks is how to choose an appropriate selection function. The selection function is a part of the intermediate value of the computation performed by the target device, i.e., iMDPL AES core in this case. The selection function should be related to a small part of the secret key to allow examining all possible hypotheses for the secret key part in a feasible time. Moreover, the nonlinearity degree of the relation between the selection function and the secret key part affects the efficiency of the PA attacks. The more nonlinear the relation is, the fewer ghost peaks may appear in the result of the PA attacks [1] . In the case of the AES encryption scheme, the S-box output is a straight forward and common intermediate value for PA attacks. The S-box output has a highly nonlinear relation with the corresponding byte of the secret key. It depends only on 8 bits of the secret key, allowing the PA attacks to search in a space of key hypotheses for each secret key part. On the other hand, the HW of the intermediate values is a very common model to predict the instantaneous power consumption of hardware. Therefore, we consider HW of the S-box output in most of the PA attacks which are presented.
We have investigated the vulnerability of the standalone iMDPL AES core of our chip to classical PA attacks. In a first step CPA attacks using a power model predicting the HW of an S-box output have been performed on 200 000 traces. The results of these attacks on bytes 2 and 12 are shown in Fig. 8 . As can be seen in Fig. 8(a) , a sharp peak corresponding to the correct key byte guess appears at around 5 s, while the correct hypothesis for key byte 12, in Fig. 8(b) , is not distinguishable amongst the others. It should be noted that performing the same attacks using single-bit power models, which are similar to a classical DPA attack [14] , discovered that the sharp peak corresponds mainly to the leakage of one bit of the S-box output, here MSb. Interestingly, the results of the attacks on bytes 6, 10, 11, 14, and 16 succeed in a similar way as the result shown in Fig. 8(a) , and the correct key byte hypothesis gets distinguishable by a sharp peak. Inspecting the hardware architecture of the target core revealed that the corresponding S-box output of these key bytes all appear at the state byte 2 indicated in Fig. 4 . According to the findings by means of the simulations of the 8051-microcontroller core (see Table I ) it can be concluded that state byte 2 suffers from very strong imbalances in the datapath of the AES circuit and thus causes stronger information leakage than the other state bytes.
Furthermore, results of MIA attacks using different hypothetical models, e.g., HW of the S-box output, on the same measured traces have been investigated. In fact, the results are approximately the same as the results of the CPA, i.e., key bytes 2, 6, 10, 11, 14, and 16 are recoverable. Fig. 9 shows the result of MIA attacks on key bytes 2 and 12 using a HW model.
Since the combinational logic realizing the S-boxes consists of several iMDPL gates, propagating the signals through these gates during either evaluation or precharge phase takes more time than in a similar CMOS circuit. In order to get the integral of power values in a clock cycle one can define a window to sum up adjacent points of the measured power traces and perform the attacks after the integration. We have used a window with a length of 176 ns to get the integration, and have repeated the above mentioned attacks on the preprocessed traces. The results of CPA attacks, which are shown in Fig. 10 , confirm that integration in this case helps increasing the distinguishability of the correct hypothesis. Although key byte 12 remains difficult to detect by a CPA attack as shown in Fig. 10(b) , performing MIA attacks on the same preprocessed traces, which are shown in Fig. 11 , led to a significant improvement in detectability of the correct guess. The same attacks have been separately performed targeting all key bytes. In short, MIA attacks could recover all correct key bytes.
IV. MASK DETECTION
Since a single mask bit is used for the entire iMDPL circuit, and signals must be routed through all parts of the circuit. Hence, the capacitances of their wires are higher than that of the other signals, including the clock signal. If the designer does not specifically consider the effect of different routing of these two signals, their capacitances will differ, and their corresponding power consumption when switching from LOW to High will consequently be distinguishable. This may lead to a possible way for the attacker to recover the mask values. We investigated this potential leakage using simulation results, and have observed that the capacitances of and trees are considerably different, as no limitations nor special attention for routing and has been taken during the design of our prototype chip. Simulation of the back-annotated netlist of the prototype chip shows a particular power consumption difference when or tree switches from LOW to HIGH at the start of the evaluation phase. Since our prototype chip is able to turn off the mask generator and fix the mask bit, we have measured and compared the power consumption when and . Two average power traces, each generated from 200 000 traces with the same mask bit value, are shown in Fig. 12 . As expected, the slopes of the average traces at the start of the evaluation phase are quite different. In order to detect the mask value at each clock cycle, we have made the histogram of the average power consumption within a certain window which is marked in Fig. 12 . To find the suitable window length and window position we have checked most of the possible cases around the start of the evaluation phase to see two clearly distinguishable hills in the histogram when the mask generator is working in a normal mode. Fig. 13 shows a histogram of all 200 000 clock cycles using the aforementioned window with a length of around 3 ns. It shows that the mask bit value of each clock cycle can be recovered.
In a real attack scenario, the adversary can only observe the histogram, but cannot identify which histogram corresponds to which mask value. However, an adversary can classify the clock cycles in which the same mask value has been used. In other words, the adversary can make two groups of power traces, each Fig. 12 when the mask bit is fixed using 200 000 traces for each case. Fig. 14 . Results of CPA attacks predicting HW of the S-box output byte 2 on classified traces based on the detected mask bit using around 100 000 traces each. Fig. 15 . Results of CPA attacks predicting HW of the S-box output byte 12 on classified traces based on the detected mask bit using around 100 000 traces each.
of which has the same mask bit value in the desired clock cycles. But how can this information help the adversary recovering the secret key? We have performed the same attacks illustrated in Section III on the classified traces. The results of a CPA attack targeting key byte 2 on two classified groups of power traces are presented in Fig. 14 . The sharp peak that appeared without classification in Fig. 8(a) exists in both attack results of Fig. 14 . However, a clear correlation which can be seen at around 3.5 s when does not appear for the other mask bit value. Repeating the same scenario for the key byte 12 led to the results shown in Fig. 15 in which a negative peak appeared for the correct guess when . Performing the MIA attacks on the classified traces did not show an improvement when compared to the corresponding CPA results here.
These attacks can also be performed on the preprocessed traces by means of integration after classifying by the recovered mask bits. The results of CPA and MIA attacks both targeting key byte 12 on the preprocessed classified traces are shown in Fig. 16 and Fig. 17 , respectively. Both attacks show an improvement when in comparison to the case without classification, i.e., Fig. 10(b) and Fig. 11(b) .
V. DHW/DHD
Many logic styles, including MDPL and iMDPL, mask the registers with the same single-bit mask. The difference of Hamming weight (DHW) and difference of Hamming distance (DHD) attacks [19] were designed to exploit the HW or HD leakages of -bit registers that are masked with the Fig. 16 . Results of CPA attacks predicting HW of the S-box output byte 12 on classified traces based on the detected mask bit using each around 100 000 preprocessed traces by means of the integration over a 176 ns window. Fig. 17 . Results of MIA attacks predicting HW of the S-box output byte 12 on classified traces based on the detected mask bit using each around 100 000 preprocessed traces by means of the integration over a 176 ns window. same single-bit mask . The attacks are based on the observation that masking with a common bit either does not change the leakage (for ) or inverts it (for ). Hence, the leakage of the masked register can be described as . Independently of the mask, the distance from the average leakage is unchanged
An equivalent statement can be derived for the HD [19] . To exploit the shown mask-independent leaking information, the DHW/DHD attacks estimate the above equation by approximating the average leakage by the average over all measurements and subtract it from every single measurement , where is an individual power trace. The resulting folding of the power leakage is depicted in Fig. 18 . The result should sufficiently well correlate with the above statement, enabling a straightforward DPA attack. While [19] shows that the DHW/DHD attack works well on simulated power traces, our results suggest that some assumptions implicitly taken by the attack are usually not met by practical implementations.
In practice, the power consumption does not only depend on the target registers, but also on other registers, which might add a considerable amount of noise to the measurements. While the authors of [19] already mentioned the increased susceptibility to noise, an additional effect can hinder the attack even for strongly Fig. 19 . Distributions of a HW of an 8-bit value masked by a single mask bit, and the result of folding when there is a difference between mean of two distributions. The DHW no longer correlates nicely with the power consumption.
increased numbers of measurements. The above attack model makes two assumptions which will often not be met in practice. The mean power consumption does not necessarily coincide with the mean for (which does not have to coincide with the mean for ). In fact, for our prototype chip we showed that the overall power consumption is higher for , resulting in differing means. Hence, the overall mean does not coincide with either of the other means , making the folding at the mean an unpredictable mapping where the leakage for and no longer coincide, independent of the mask value. An example of such a case is given in Fig. 19 . In this generalized scenario the DHW/DHD model no longer generates useful results, as it no longer correlates with the power consumption. Causes for mismatching averages are not restricted to unbalanced mask trees.
In fact, when a DHW/DHD attack is performed in a real case scenario, the mask-dependent leakage caused by sources which are not included in the selection function of the attack prevents the symmetry of the distribution of the leakages around . For example, when attacking the iMDPL microcontroller core of our prototype chip, if the target byte of the attack is a register operand of the MOV instruction, all the control logic including status register, program counter, and many further registers which should keep their previous values (since the mask is changing every clock cycle) will affect on the difference between the overall means and . Hence, the theory behind such an attack will fit to the practice only if all masked parts, e.g., all masked registers, are included in the model, which is not an applicable scenario in the general case. We should mention that none of the DHW/DHD attacks we performed on the measured traces of our prototype chip were successful. We conclude that the practical impact of DHW/DHD attacks is limited, due to the tight restrictions of their applicability.
VI. FOLDING ATTACK
Folding attacks based on the probability distribution of the power consumption of combinational circuits were introduced in [28] and [24] . In theory, the attacks can be successfully applied to MDPL and iMDPL circuits. The scheme is based on the difference between capacitances of dual-rail combinational circuits. Since the rail capacitances in a dual-rail gate are not exactly the same, their corresponding power consumptions also differ during switching, e.g., from LOW to HIGH. As mentioned before, using a single mask bit to randomly switch the rails was the fundamental idea to develop MDPL. Due to the difference between the capacitances of an MDPL combinational circuit when the mask bit is "0" or "1", the probability distribution of the power consumptions differs in both cases. Fig. 20 shows the histogram of the simulated power consumption of a round-based implementation of AES obtained using weighted toggle count model, as presented by Schaumont and Tiri in [24] . The histogram reveals distinguishable probability density functions of the power values of the combinational circuit for differing masks, if the load capacitances of the dual-rail gates differ. Yet, allowing unconstrained placement and routing, which usually results in such imbalances, is one of the design goals of MDPL. Consequently, an adversary can classify the power traces according to the used mask value. He can therefore, similar to what has been shown in Section IV, eliminate the effect of the randomness provided by the single mask bit, and perform the attacks on traces which have the same mask value. Since the histogram shows that the probability distribution shapes are approximately symmetric, it is also proposed in [24] to fold the power traces around the mean and perform the attacks on the preprocessed traces.
Due to the promising results from the attacks on simulated power consumption, we have tried to practically verify them. As mentioned before, a toggle count model has been used to simulate the power consumption of a combinational MDPL circuit. Therefore, we computed the average of the measured power values within a specific window to have an estimation of the power consumption of the entire combinational circuit. Many different values for the window position and for the window length were tested. We have started from 1 ns in our measured power traces up to 250 ns as the window length with the step of 1 ns, and have moved the window from the start of the evaluation phase till end the next precharge phase. The best shapes where some distributions are distinguishable as shown by Fig. 21(a) and (b) are achieved by a window with a size of 176 ns placed 90 ns after the start of the evaluation phase, and 85 ns after the start of the precharge phase.
We saw in Section IV that the leakage of the mask tree rails is detectable at the start of the evaluation phase. Although this information is not exploitable by an adversary having no control over the PRNG, we are able to use this information to ex- amine the relation between the mask value and the distributions in Fig. 21(a) and (b). The two distributions for the different mask bits are not distinguishable at the evaluation phase, as indicated in Fig. 21(c) . Yet, Fig. 21(d) reveals that during the precharge phase a part of the histogram always belongs to the mask set . As a result, the adversary can collect a group of traces with a common mask bit , and subsequently mount the attacks addressed in Section IV. In other words, if the mask tree rails are balanced and the mask bit does not leak like what has been shown in Section IV, the adversary can still get more or less the same result by means of studying the probability distribution of power values of the combinational circuit in the evaluation phase.
Although some probability distributions are distinguishable in the histograms, none of the shapes looks similar to the simulation results shown in Fig. 20 . However, the results of the analysis of folded preprocessed traces are striking. The preprocessing step, as before, utilizes computation of integration over a fixed window size, i.e., 176 ns. Note that the preprocessing step is performed before folding around the mean. The results of both CPA and MIA attacks targeting key byte 12 using a HW model are shown in Fig. 22 . Comparing these two figures with those shown in Fig. 10(b) and Fig. 11(b) confirms that folding around the mean after integration definitely helps detecting the correct hypothesis while the same attacks using the same traces without folding do not lead to a successful attack. It should be noted that without integration none of our attempts to mount a CPA or a MIA attack on folded traces was successful. What we should also mention is that we have repeated the same attack using smaller window sizes; the attack works using a window size as small as 3 ns, the same window size used to make histograms in Fig. 13 . Interestingly, the best result is achieved when the window is placed directly behind the 3 ns window shown in Fig. 12 in the evaluation phase. This result suggests that the unbalanced mask trees, which caused the leakage of the mask bit mentioned in Section IV also influence the efficiency of the folding attack. In other words, it is not obvious whether the leakage of the target combinational circuit, regardless of the difference between the mask tree capacitances is symmetric around the mean. Hence, it is not obvious whether folding around the mean is still effective if the mask trees are balanced during the design phase, and the leakage presented in Section IV does not appear.
VII. ENHANCED COLLISION ATTACK
The correlation-enhanced collision attack has been introduced in [20] . Unlike other collision attacks, it works well against certain masked implementations and does, unlike template attacks, not need an offline profiling phase. The attack is a classical known plaintext or known ciphertext attack, i.e., the power consumption for known random inputs or outputs is observed and then analyzed.
The analysis, which is referred to as detection phase in [20] , aims at recovering the XOR-difference between key bytes. The attack assumes that the power consumption of the S-box computation for two different byte positions and has the same leakage if the same values are processed. For such a collision the difference of the two involved key bytes and is . Several of these differences finally allow for an easy key recovery.
Unlike classical side-channel collision attacks, this attack aims at finding several collisions at once. The power traces are sorted based on the input byte value such that all traces where takes a certain value are averaged to obtain the average power consumption for that byte. For a known key difference , the S-box inputs are the same if the input bytes and show the same difference as the keys , resulting in highly similar average power consumptions . Such a similarity can be detected by computing the correlation between all possible -pairs for all possible key differences . The correlation for a correct key difference is very high, as each -pair is a direct collision, while for false 's the correlation is low as unrelated computations are correlated. Classical PA attacks such as CPA described in Section III, correlate a simple model of the power consumption to the sampled power consumption. Typically, these models are not very precise. In contrast, the above attack correlates two different instances of the same observed power consumption, resulting in a much higher correlation. In essence, the approach is similar to template attacks, but, unlike those, does not require a training phase. The complexity of the attack is slightly higher, but comparable to that of a CPA. Computing the averages takes little additional time, but speeds up the actual correlation step. The actual key recovery is also slightly more complex, as the attack returns key differences instead of values of key bytes.
The attack compares the power consumption characteristics of two combinational circuits. Best results are achieved by comparing the power consumption of the same instance of the target combinational circuit, e.g., an S-box, during different clock cycles. There are four instances of the S-box in our target chip (c.f. Section II). Hence, the computation of the S-boxes of the first round is spread over 4 clock cycles. We can compare the power consumption of each S-box instance in several clock cycles. As mentioned above, the procedure of the attack involves getting averages based on the (unmasked) values that are processed by the target combinational circuit. Since the power consumption profile of a combinational circuit depends on its two consecutive input values, the best profiling can be done by getting averages based on two consecutive S-box inputs. However, the precharge phase in every clock cycle of iMDPL circuits always sets the first input of the S-box to zero. In fact, the precharging mechanism simplifies the averaging process because the initial state of the circuit is always known and the same.
We computed the average traces for each plaintext byte value and position using a total of 200 000 traces. According to the architecture of our target chip, state bytes 4, 3, 2, and 1, respectively, are processed by the same instance of the S-box (the left most S-box in Fig. 4 ), but in subsequent clock cycles. To align the leakages, we have shifted the average traces corresponding to plaintext bytes 3, 2, and 1 by one, two, or three clock cycles to the left. By computing the correlation between the 's and 's, and, subsequently, between the 's and 's, and 's and 's, we recover the differences between the key bytes through . The found differences , , and yield the following equations, which we call equation set 1:
The results of these attacks, as shown in Fig. 23(a)-(c) , indicate that the attack is indeed able to detect the similarity between the average traces. Repeating the same attacks on the average traces of plaintext bytes 8, 7, 6, and 5 leads to recovering the XOR-difference between key bytes through , which we call equation set 2:
The results of the attacks are highly similar to those shown in S-box blocks when the state register is shifted to the left. The result of the S-box blocks are ignored during MixColumns operations, but they are nevertheless applied to the circuit. Therefore, the left-most S-box block sees the aforementioned state bytes, i.e., the S-box result of key-XORed plaintext bytes 16, 4, 8, and 12. Each of these belongs to a different set of equations introduced above. We successfully repeated the same attack on the corresponding mean traces, as shown in Fig. 23(d)-(f) . Therefore, we reveal three additional linear equations of the key bytes which link the prior four sets of equations. We finally have 15 independent linear equations of the key bytes leading to candidates for the correct 128-bit AES key. Hence, correlation-enhanced collision attack successfully breaks iMDPL, implying a remaining first-order leakage.
VIII. SUMMARY AND CONCLUSION
In this paper we presented an extensive evaluation of the iMDPL style based on analyses of the GRANDESCA prototype chip. The chip features a standalone AES implementation and an 8051-microcontroller implemented in iMDPL. Our investigation consolidates recent research results and shows that the exploitable leakage in iMDPL is caused by routing imbalances between complementary mask trees. The most recent general attack methods as well as special methods targeting masked logic styles have been covered in the investigation.
An overview of all evaluated preprocessing schemes in combination with the attack methods that have contributed to our practical evaluation results is given in Table II . In short, several combinations of the attack methods and preprocessing schemes can recover all key bytes using 200 000 traces. If the adversary can handle the mask detection phase, performing straightforward CPA attacks can be considered the easiest way to reveal the full secret key. Otherwise, either performing CPA on the traces preprocessed by means of folding and integration or mounting correlation-enhanced collision attack leads to full-key recovery. Moreover, in a few cases, where CPA failed, MIA could reveal the secret when the measurements are preprocessed by integration. Our results suggest that unbalanced mask trees are a major source of leakage, which is exploited by the above attacks. We expect that balancing the mask trees will impede or even completely prevent these attacks. But balancing the mask trees would violate the original design goal of iMDPL, which was intended to make such routing requirements unnecessary. However, a basic balancing of the two mask trees in terms of timing could easily be implemented by treating the mask trees like a clock tree. This strategy would enable the minimization of timing skew between the two trees. Such a basic balancing approach would significantly reduce both the detectability of the mask value and the mask-dependent switching behavior of the iMDPL circuit, which can result in a considerably higher side-channel resistance of iMDPL.
