Abstract-When complex functions, for example, substitution boxes of block ciphers, are realized in hardware, timing attributes of the underlying combinational circuit depend on the input/output changes of the function. These characteristics can be exploited by the help of a relatively new scheme called fault sensitivity analysis. A collision timing attack which exploits the data-dependent timing characteristics of combinational circuits is demonstrated in this paper. The attack is based on an also recently published correlation collision attack, which avoids the need for a hypothetical timing model for the underlying combinational circuit to recover the secret materials. The target platforms of our proposed attack are 14 AES ASIC cores of the SASEBO LSI chips in three different process technologies, 13 nm, 90 nm, and 65 nm. Successfully breaking all cores including the DPA-protected and fault attack protected cores indicates the strength of the attack.
Ç

INTRODUCTION
W HILE modern cryptographic algorithms can mostly be assumed secure from the mathematical point-ofview, their implementation can in general easily be broken by means of so-called side-channel attacks if no special countermeasures are applied. Sensitive information about the internal secret of the cryptographic device, for example, the used encryption key in a symmetric cipher, leak through side channels. The yet known side channels include execution time [14] , power consumption [15] , and electromagnetic radiation [10] , [30] . Faulty outputs in case of intentionally injected faults have also been used in the active branch of noninvasive physical attacks, for example, differential fault analysis [5] .
At CHES 2010 a new fault attack called fault sensitivity analysis (FSA) [18] was proposed, which uses the fact that the critical path of an AES S-box implementation is data dependent because of the nature of the underlying gates. Using clock glitches and a simple hypothetical (timing) model for the S-box -which can be obtained by simulation knowing the design details or by a profiling phase -they showed a complete break of an AES ASIC implementation using the critical path details of 50 ciphertexts.
Another contribution to CHES 2010 was a collision attack enhanced by correlation [21] . Compared to classical power analysis attacks, its main feature is that it does not rely on the knowledge of an underlying (hypothetical) power model. Instead, it directly correlates power traces to each other and -by finding colliding S-box computations -is able to recover the relation between key parts. Using such an attack a complete break of a masked FPGA implementation of AES has been shown.
The combination and improvement of these two ideas is the main contribution of this paper. We present an attack that exploits the timing characteristics of AES S-boxes, but to recover the secret it does not need to know specifically how these characteristics are and how they relate to the inputs. Despite the similar name, the attack presented here is different to [8] , where a collision timing attack on cache misses of an embedded system is presented. The aim of our proposed attack is to exploit the timing characteristics of combinational functions, for example, S-boxes implemented in ASICs, thereby recovering the relation between key parts and restricting the key search space in a way that the secret key can be revealed knowing a single plaintext-ciphertext pair.
Similarly to [18] , we have chosen the SASEBO-R board [2] as the evaluation platform. The board can hold different ASICs, and we have analyzed the SASEBO LSI2 [3] , both in 130 nm and 90 nm technology, as well as the SASEBO LSI3 [4] in 65 nm technology. Each of them contains the same 14 different implementations of AES, therefore the only difference is the process technology, which has a big influence on the timing characteristics. The implementations themselves differ in the style of the S-box realization and in side-channel countermeasures. One of the AES implementations is also specifically designed to withstand attacks using fault injection, which in theory should make the implementation resistant against attacks like FSA.
Using the attack presented here we are able to break all 14 AES implementations of each LSI regardless of the process technology. It is noteworthy that the effort required to mount the attack on different cores varies because of their different architecture and mostly their different S-box design. Nevertheless, the proposed attack can not only extract the desired secret from unprotected implementations, but it is especially able to overcome all DPA and fault attack countermeasures which are applied in these ASIC designs. In other words, randomizing the computation of combinational circuits by means of different masking schemes does not prevent the relation between its timing characteristics and the processed data. Preventing faulty ciphertexts by means of one of the most efficient fault detection schemes, furthermore, cannot prevent an attacker who injects the faults to measures the timing characteristics of the circuit.
In the later parts of this paper the prerequisites, including a review of fault sensitivity analysis and the correlation collision attack, are given in Section 2. Our proposed attack, namely collision timing attack, is expressed in Section 3, and the practical evaluation results on all 130 nm, 90 nm, and 65 nm SASEBO LSI2 and LSI3 cores are presented in Section 4. Finally, Section 5 concludes our work.
This paper is extending the authors parts of a joint CHES 2011 paper [22] from the authors and a team from the Department of Informatics, The University of ElectroCommunications, Tokyo, Japan. The independent research of both teams was merged by request of the CHES 2011 Program Committee because of some overlap of used techniques. For example, both utilize the ideas of fault sensitivity analysis [18] and correlation collision attacks [21] although their final attacks are quite different. Compared to the conference paper [22] this publication not only covers the attacks on some unprotected or DPA-protected cores but especially shows how the proposed fault attack can be applied to a very sophisticated fault protection scheme and is finally able to break all 42 cores of all currently available SASEBO-R LSIs. The attack descriptions on the unprotected and DPA-protected cores are extended as well giving more detail like, for example, explaining the necessary adaptions to attack a core using the counter mode of operation and giving insights on the applicability of the scheme to decryption circuits. We also highlighted the difficulties we faced during the 6 months of practical evaluations to allow the reader to better understand the attack and evaluate our results (see Appendix, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TC.2012.154). During the review phase of this publication another journal paper was published by the aforementioned team from Tokyo [17] giving more details on FSA and how to extend it in a selftemplate mode to also successfully attack one of the fault protected SASEBO-R LSIs. We want to stress that both works have been performed completely independent from each other.
PRELIMINARIES
This section summarizes the underlying attacks, namely the fault sensitivity analysis [18] and correlation collision attack [21] , which are the basis to the proposed collision timing attack. We also explain how the timing data, which is required for the attack, can be captured and finally give some definitions which are used in the following sections.
Fault Sensitivity Analysis
Unlike differential fault attacks (DFA) [5] , no faulty ciphertexts are required by an FSA attack. Instead, the attack works by increasing the fault intensity until a distinguishable characteristic can be observed, for example, the first appearance of a faulty output. It was demonstrated that the attack is able to completely break the AES_PPRM1 core of the SASEBO LSI2 (130 nm) using 200 faulty operations of each randomly selected 50 plaintexts. It also could reveal three key bytes of the AES_WDDL implementation of the same ASIC using its fault sensitivity leakage obtained from 1,200 plaintexts.
The presented method of increasing the fault sensitivity in [18] is the shortening of clock glitches, i.e., two normal clock cycles get replaced by a short and a longer one, whereby the length of the short one can be gradually decreased until a faulty output occurs or the fault becomes stable. Since the critical path of some gates, for example, AND and OR gates, is data dependent, knowing the underlying model for this data dependence helps revealing the secret. For example, by simulation it could be ascertained that the timing delay of a PPRM S-box correlates to the Hamming weight (HW) of its input.
Since no faulty ciphertexts are required, the attack might also be applicable to implementations which apply DFA countermeasures. While AES_WDDL should in theory be immune against set-up time violation attacks, by creating templates with a known key it was shown that at least some bits correlate to the timing delay which lead to the aforementioned recovery of three key bytes.
Another difference to DFA attacks is that the fault does not need to be restricted to a small subspace. In contrast, by for example attacking the last round of the AES_PPRM1 implementation, each faulty output byte can be independently observed and therefore the same complete faulty output can be used to attack all key bytes simultaneously. On the other hand, as stated in [18] , while countermeasures like masking are only of limited use against DFA attacks, they may have a great impact on FSA attacks since the critical path is affected by the random mask bits.
Correlation Collision Attack
The correlation collision attack was introduced in [21] . While remaining a certain noise sensitivity, its major advantage compared to classical power analysis attacks is that it neither relies on a hypothetical power model nor requires a profiling phase. By enhancing linear collision attacks [7] by the methods of correlation-based DPAs, it is able to overcome side-channel countermeasures as long as a minimal first-order leakage remains.
A linear collision occurs if two instances of combinational circuits or one instance at two different points in time process(es) the same value, i.e., for AES the 8-bit output and thereby the input of the S-box must be the same. In this case, it is possible to deduce the relation between the attacked key bytes since the rearranging of The correlation collision attack on AES works similarly, but starts by creating sets of mean traces for each possible input byte if performing the attack on the first round. To do this for two input bytes, namely i 1 and i 2 , all traces are sorted based on the corresponding input byte value, and traces with the same value are averaged, thereby creating 256 different mean traces M 1 ði 1 Þ and M 2 ði 2 Þ for each of the two input bytes. Computing the variances for each set of mean traces will reveal the point in time where the corresponding bytes are processed by the S-box, which is necessary to align the traces for the attack.
If the previous assumption holds true, i.e., the power consumption of two S-box computations are highly similar, comparing pairs of mean sets also shows a high similarity between certain mean traces. Therefore, when Á ¼ k 1 È k 2 and attacking the input bytes i 1 and i 2 , then
The correct Á can be found by computing the correlation between the two sets of mean traces for each of the 256 candidates of Á. This yields a very high correlation coefficient since no hypothetical model is applied but instead the averaged real power consumptions are used in correlations.
How to Measure the Timing
The focus of this work is to analyze the timing characteristics of combinational circuits like S-boxes. As shown in Fig. 1 , when the input of a combinational circuit changes, its output stops toggling after a certain time (so-called Át). The maximum value of Át for different inputs is known as the longest critical path of the circuit, and defines the maximum frequency of the clock signal which triggers the flip-flops providing the input and storing the output of the considered combinational circuit. Timing characteristics of a circuit are therefore defined as a set of Át (T T ¼ fÁt 1 ; Át 2 ; . . . ; Át n g), where Át i is the minimum Át for the given input i.
Let us suppose that the target combinational circuit is a part of a bigger circuit, for example, a coprocessor, which provides some I/O signals for communication. If the output signals of the target combinational circuit are accessible from the outside -which is quite unlikely -one can easily probe the output signals for the given input i and measure the time when the output signals get stable. Otherwise, if the output of the target combinational circuit is stored into registers which are triggered by a clock signal that can be controlled from the outside, as shown in Fig. 2 , one can steadily shorten the time interval between the input transition and the output storage (known as setup time) till an incorrect value is stored into the registers while input i is repeatedly given to the combinational circuit. The minimum time interval when the considered register stores the correct value can be concluded to Át i . Note that this procedure is similar to clock glitches, which are mostly used in DFA to intentionally inject faults or to skip the execution of an instruction and analyze the faulty outputs based on the target algorithm [5] , [6] , [27] .
However, measuring Át in this case does not deal with the faulty outputs; once a faulty output is detected, Át i can be concluded.
It should be noted that, because of the environmental noise, it might be required to repeat the same procedure and shorten the clock glitch period until the probability of detecting faulty output gets higher than a threshold. Also, if the target combinational circuit is not a single-bit function and it is possible to detect which output bit is faulty, one can measure Át for each output bit independently.
Therefore, we define the adversary model and define his capabilities to be able to measure Át i of the target combinational circuit for the given input i:
. The adversary can access and control the clock signal which triggers the registers providing the input and saving the output of the target combinational circuit. . He knows in which clock cycle the target combinational circuit processes the desired data, for example, known or guessed input or output. . He can control the target device in a way that the same input value i is repeatedly processed by the target combinational circuit during shortening the time interval of the clock glitch. . He is equipped with appropriate instruments to shorten the duration of the clock glitch with suitable accuracy.
Definitions
Bitwise capture. Let us define BitCap The gray bars denote a stable output signal after the S-box input has changed from 0xaa to 0x00, to 0x55, or to 0xff. when input i is given, where p T H is a threshold for the probability and is defined based on physical characteristics of the target circuit and is also based on the maximum probability achieved by shortening Át. Accordingly, the time required to complete the computation of all bits when processing input i is defined as
Át % p T H . Depending on the target device, its architecture, and the role of the target combinational circuit inside the target device, it might not be possible to know the input i processed. However, if the output of the target combinational circuit is accessible, one can make all the above defined terms based on the fault-free output o, i.e., 
COLLISION TIMING ATTACK
For simplicity, let us suppose that the target combinational circuit is an S-box of the last round of an AES encryption, i.e., o ¼ SðiÞ È k, where o is the corresponding output ciphertext byte and k the target key byte. If (bitwise) timing characteristics of an S-box implemen-
, show a diversity of Át depending on output o, one can perform an attack and recover the secret knowing how the secret k contributes in
. In other words, if the timing characteristics of the S-box itself regardless of k and later key addition (È) are known as an extra information or are obtained by profiling using a circuit similar to the target, one can make a hypothetical function to estimate the timing characteristics and examine its similarity to T T o (T T o b ) for each key guess. A similar approach has been presented in [18] , where the timing characteristics of an AES S-box implementation were profiled and an attack similar to a correlation power analysis using an HW model was successfully performed. In fact, a set of Cap o Át for a specific Át is used in [18] to mount the attack at the last round of the AES encryption.
One may also try using information theoretic tools, for example, mutual information analysis [11] , to overcome the uncertainty of the leakage model. However, it is necessary to use a suitable leakage model that cannot be selected without extra knowledge about the (timing) characteristics of the target combinational function, or several different models must be examined to find a suitable one [39] . It is noteworthy that the leakages (Cap o Át ) consist of only two values ("fail" and "success"). This causes probability distributions (used in, e.g., mutual information analysis) to be represented by only two bins in a histogram, and using other schemes to estimate the probability distributions, for example, kernel density estimation, in this case leads to increasing the noise. Here, when using histograms, mutual information will also be close to the variance of means.
In contrast to a correlation attack or a mutual information analysis using a leakage model, we apply a correlation collision attack [21] and o ¼ SðiÞ È k 2 , respectively. As stated in [21] , the aim of a correlation collision attack is to find the linear difference between k 1 and k 2 , i.e., Á ¼ k 1 È k 2 . This will be done by comparing 1 
This can also be adopted to attack the first round of the AES encryption. Means, suppose that 1 T T i and 2 T T i are the timing characteristics of the key addition followed by the S-box when calculating Sði È k 1 Þ and Sði È k 2 Þ. This is possible when the MixColumns transformation of the first round is not following the SubBytes at the same clock cycle. Then, the correlation collision attack can -exactly as in the previous case -recover
PRACTICAL EVALUATION
We are attacking the AES implementations of three ASIC chips built for the SASEBO-R board, namely the SASEBO LSI2 (130 nm), LSI2 (90 nm), and LSI3 (65 nm). Each chip consists in the same 14 different cores, which we group in . unprotected (mainly varying in the style of the S-box implementation), . DPA protected (from masked-AND gates [37] and a kind of threshold implementation [25] to MDPL [28] , WDDL [36] , and Pseudo-RSL [31] ), and . fault attack protected [33] . As stated previously, we are using a similar approach for fault injection as in [18] . We have first tried to generate the glitchy clock inside the control FPGA without using an external function generator, but the width of the glitchy clock could only be adjusted in large steps (e.g., of around 170 ps [9] ), which were not small enough to meet our desired condition. Therefore, we had to use a programmable digital function generator to externally provide the precise clock frequencies. The external clock is fed into the SASEBO-R control FPGA where it is multiplied by a factor of 32 using a digital clock management (DCM) unit. This fast clock signal is then used together with some logic to shape the glitchy clock signal. An internal circuit controls the clock signal of the LSI to inject the glitchy clock at the preferred instance of time synchronized to the AES computation of the target core. The block diagram of the setup and the timing diagram of the signals are presented in Fig. 3 .
As it is presented in the following, we change the width of the glitchy clock in steps of 25 ps to 5 ps. Also, the multiplication of the clock frequency is necessary because of the limitation (maximum frequency of 15 MHz) of the function generator we have used, while the frequencies necessary to inject a fault in the combinational circuit are up to the range of 300 MHz. Also, the DCMs inside the Virtex-II control FPGA of the SASEBO-R can, when fed with a low frequency input signal, only generate output frequencies up to 210 MHz. Since some of the cores, especially of the 65 nm LSI3, require a higher frequency for fault injection, for these cores it was necessary to daisy chain two DCMs, one for generating a high-frequency signal out of the function generator output and another one to reach the maximum supported output frequency which can only be generated by the DCM using a high-frequency input [40] . We should mention that we kept core voltage of all LSIs at 1.1 V during the experiments shown here. The maximum clock frequencies drop down significantly when decreasing the core voltage, for example, to 0.6 V. Therefore, the problem mentioned above due to the necessity of two daisy chained DCMs can be avoided, and the width of the glitchy clock can be modified in larger steps if the adversary is able to modify the core voltage in the target setup.
In all cores of the LSIs 16 instances of the S-box are implemented to perform the complete SubBytes operation in a single clock cycle (see Fig. 4 as the general architecture of the encryption datapath). All cores -except the one supporting a counter mode and the fault-protected onerealize a round-based architecture, i.e., S-boxes and MixColumns are performed consecutively during each clock cycle except for the last round where MixColumns is absent [3] , [4] . Therefore, extracting the timing characteristics of the S-boxes in the first nine rounds is not easily possible, and one needs to inject faults by shortening the width of the clock glitches in the last round, when the target cores only compute the ShiftRows and SubBytes operations followed by the final key addition and the result is stored in registers (similar scheme as used in [18] ). In addition, one can see from the design architecture of the cores (see Fig. 4 ) that the round key of the last round is already computed in the previous round and is stored into a register. The glitchy clock at the last round, hence, does not affect the key scheduling computations. Note that all the AES cores available in our targeted LSIs support the encryption scheme while only two cores support both encryption and decryption. The results of the attack on the encryption scheme of all cores are first presented; then, we discuss on feasibility of the attacks on the decryption modules at the end.
In the following, the result of the attacks on different cores and different LSIs are presented. Because of the high number of broken cores, only a subset of the performed attacks are presented in detail, giving additional information about the differences to the not mentioned cores as required.
From the 14 AES cores of each LSI, all cores have been successfully broken. These 14 cores could be broken on each of the three LSIs regardless of the process technology, giving a total of 42 successfully broken ASIC implementations using our proposed collision timing attack.
Attacking the Unprotected Cores
We start by showing the results of the attack on the first AES core of the 130 nm chip, namely AES_Comp, whose S-boxes have been made using a composite field approach [32] . As stated before, 16 separate S-box instances have been implemented which are active at the same time. Therefore, it is not possible to compare the timing characteristics of one S-box instance when processing, for example, two values with different key bytes, that would be an ideal case for a collision timing attack. In contrast, the timing characteristics of different S-box instances must be compared, which may slightly vary because of different placement and routing even when being based on the same netlist.
Since changing the glitchy clock width in our setup requires reseting the DCM(s), we have collected BitCap Careful study of the timing characteristics shown in Fig. 6 reveals that Át is much smaller than the other cases when the S-box input is zero, that is a known issue since the zero-input power model has been defined [12] to mount CPA attacks on AES S-box leakages. In fact, it is not needed to mount the collision timing attack in this case, and the key bytes can be recovered observing T T o b of each S-box instance separately. However, as it is shown later this property does not hold for the other cores realized by different S-boxes, and mounting our proposed attack is essential to reveal the secrets.
The same S-box architecture is used in two other cores. The AES_Comp_ENC_top core is exactly the same as the examined one, but it does not provide the decryption module. The AES_PKG core utilizes the same architecture as AES_Comp_ENC_top, but precalculates the rounkeys and stores them into dedicated registers. Two other cores, i.e., AES_PPRM1 and AES_PPRM3, have the same architecture and datapath. Means, only the encryption is supported, and the roundkeys are computed on the fly. However, the S-boxes have been realized by a low-power approach called positive polarity Read-Muler (PPRM) [24] in 1 and 3 stages, respectively, in the AES_PPRM1 and AES_PPRM3 cores; the later one is supposed to consume less power.
To perform the attack on these core the same procedures as explained above have been repeated. We have provided a list shown in Table 1 (in Appendix, which is available in the online supplemental material) as a reference for the timing characteristics and the number of required captures to successfully mount the attack on different cores in different LSIs. Attacking the AES_TBL core, where S-boxes have been realized by look-up tables (case statements), is different to the aforementioned cores. We illustrate this case when explaining how to mount the attack on the WDDL and MDPL cores.
The AES_CTR core, which enables the counter mode, employs a four-stage pipeline architecture to speed up the computations and is depicted in Fig. 8 . To utilize the pipeline the initialization vector (IV) and four plaintexts are sent to the core in the beginning. The AES circuit then encrypts the full 128-bit IV to generate pseudo random numbers to XOR encrypt the first plaintext and increments the stored IV for each of the following plaintexts. As attacker we exploit the fact that we have full control over the IV, i.e., we can completely control which input is encrypted by the AES circuit. By 1) setting the plaintext to all zeros, 2) focusing only on the first resulting ciphertext where the input IV has not been incremented yet, and 3) reseting the core after each obtained ciphertext to load a new IV, we basically create the same attack scenario as for the other cores. The only difference is that in this case we use the IV as our known plaintext.
The S-box itself is divided into three parts (see Fig. 8 ), where the part with the longest critical path is the inversion; therefore, we have selected this part to exploit the timing characteristics and mount the attacks during the computation of the last round. According to [3, Tables 4.7 and 4.15], the AES_CTR core has a longer critical path and is slower compared to the AES_Comp core. Indeed, it comes from the integer addition module used to increment the IV. However, the encryption core itself is quite fast, and because of its pipeline architecture we expected a shorter critical path for the inversion unit. Our practical experiments confirmed our assumption, and 4,200 ps was the minimum Át for fault-free operation of the 130 nm chip showing the operation frequency to be around 240 MHz (excluding the integer addition unit). Although we initially faced several problems making faulty results in the AES_CTR core of the 90 nm and 65 nm chips because of the very short critical path, the mounted attacks could successfully reveal the secrets similarly to the shown results of the AES_Comp core.
Attacking the DPA-Protected Cores
The first DPA-protected core is AES_MAO in which a gatelevel masking scheme (masked-AND gate [37] ) is used to realize the masked S-boxes. The capturing scenario and the attack scheme are exactly the same as that of the unprotected cores, for example, AES_Comp. Fig. 9 shows the timing characteristics of two S-box instances of the AES_MAO (65 nm) core. It can be seen that timing characteristics for different (unmasked) outputs still differ though the S-box inputs are masked by means of random values. Consequently, it is possible to extract the relation between the key bytes, as depicted in Fig. 10 where the results after obtaining 10,000 captures, when shortening Át in steps of 25 ps are presented. Comparing the number of required captures when attacking the AES_Comp and AES_MAO cores, one can conclude that the attack on the masked-AND gates is harder. This is because of the influence of the randomness on the dependence of the timing characteristics of an S-box to its given unmasked inputs.
The threshold implementation scheme, which is a highorder masking countermeasure to the power analysis attacks, is based on a combination of Boolean masking and multiparty computation [26] . In contrast to other masking schemes when implemented in hardware, for example, [20] , [21] , it is expected to prevent first-order leakage caused by signal glitches [23] , [29] . This scheme is an algorithmic-level countermeasure which needs to fulfill certain properties including correctness, noncompleteness, and uniformity [26] . Although a core of our targets is called AES_TI [3] , [4] , it has been made without considering the later two properties. This core has been realized by modifying the datapath of the AES_Comp_ENC_top core. The nonlinear gates are provided by only two-input AND gates, every signal is represented by four shares, and finally the AND gates are replaced with the four-shared threshold implementation of a two-input AND gate which is available at [23] , [25] . It indeed makes this core a kind of gate-level masked implementation. This can be verified by examining the source code of this core available at [1] . Extracting the timing characteristics and mounting the same attack as before on this core could also break the implementation and recover the key relations (see Fig. 11 for example) . However, as stated in Table 1 (in the Appendix, which is available in the online supplemental material), much more captures are required compared to the AES_MAO core. To the best of our knowledge it is due to the amount of randomness provided by the AES_TI core where each single-bit signal is presented by four shares, i.e., using three random mask bits, while only one mask value is used in masked-AND gates (the AES_MAO core).
DPA-resistant logic styles are used in four other DPAprotected cores. Random switching logic (RSL) [34] is a precharge logic style combined with a logic-level masking scheme (first theoretically proposed at [13] and later used in [38] ). Based on the design methodology presented in [31] , pseudo RSL, which makes use of standard CMOS library, is used in realization of both AES_PR and AES_WO cores. The design methodology is based on the fact that consecutive nonlinear parts of the circuit which are realized by RSL should get enabled one after each other. It means that the S-box is divided into several parts which to prevent propagation of glitches are enabled sequentially by means of a control unit. Therefore, timing characteristics of only the last logic step of the S-box circuit can be extracted and attacked. Repeating the illustrated attack could reveal the desired secret as the results of the attack on the AES_PR core are shown in Fig. 12 . The other pseudo-RSL core, i.e., AES_WO, is made for evaluation purposes. Though its design is not obvious to us, comparing the corresponding Át range of these two cores (see Table 1 , which is available in the online supplemental material) reveals that the enable signals in the AES_WO core are missing or the appropriate timing constrains have not been considered for them. Interestingly, the AES_WO core can be attacked the same as an unprotected core; even the number of required captures to reveal the key relations is comparable to that of the AES_Comp core. The result of the attacks on this core is depicted in Fig. 13 .
Wave dynamic differential logic (WDDL) [36] is a logic style that is proposed to realize the sense amplifier-based logic (SABL) [35] using a standard CMOS library. These logic styles aim at flattening the power consumption, a kind of hiding at transistor/gate level [19] . There are two types of WDDL sequential circuits, and a version that needs one clock cycle per either precharge or evaluation phase is used to implement the AES_WDDL core. The reason of such constrain is the structure of master-slave WDDL flip-flops [36] . Each round of the encryption function is therefore executed in two clock cycles while the datapath and the architecture of this core are otherwise the same as that of the AES_Comp_ENC_top core. Exactly the same specifications and features hold for the AES_MDPL core which is realized using masked dual-rail precharge logic (MDPL) style [28] . MDPL works similarly to WDDL in addition to a single mask bit which randomly swaps the rails of the complete circuit at every clock cycle.
In contrary to the other cores an injected fault by a clock glitch at the evaluation phase of both WDDL and MDPL can only lead to a bit flip from 1 to 0, not vice versa, because of the predischarge phase in which all signals are set to 0. Therefore, bitwise timing characteristics T o b does not provide any information for those output values o in which bit b is zero. This issue has been also addressed in [16] where a successful attack is performed on an AES_WDDL core. Our solution is to avoid using bitwise characteristics, and apply the attack on timing characteristics T T o (not the bitwise one). This way the bitwise timing characteristic of all bits are combined, and T T o provides information for all output values. The result of the attack on the AES_WDDL (130 nm) and the AES_MDPL (65 nm) cores are shown in Figs. 14 and 15 , respectively. Interestingly, the number of captures required to successfully mount the attacks is less than the unprotected core cases. We have seen -for reasons unknown to us -the same behavior (no information is provided by T o b for those o where bit b is zero) when attacking the AES_TBL core, where the S-boxes have been realized by look-up tables. The same approach, that we have used to attack the AES_WDDL and the AES_MDPL cores (using T T o ), was helpful when attacking the AES_TBL core. It should be noted that, to attack the AES_MDPL and AES_WDDL cores, we only used 10,000 captures for each Át in steps of 5 ps. On the other hand, a successful attack on the AES_TBL core required around 1,000,000 captures, which might be because of marginal differences between the critical paths of the circuit realizing the look-up table.
Attacking the Fault-Protected Core
The scheme presented in [33] has been used to implement the fault-protected core, i.e., AES_FA, of our three targeted LSIs. The main idea of this scheme is to split the round function into two parts where both parts can be active at the same time. One half is used to compute the next state of the half-round while the other half is used to verify the previous half-round computation.
An overview of this architecture is given by Fig. 16 . The first half-round circuit is comprised of ShiftRows and SubBytes while the second half includes MixColumns and AddRoundkey, all also featuring their inverse counterparts. This ability to perform both the encryption and decryption operations in each half is utilized for the fault detection. More precisely, after every half-round computation, in the next clock cycle the inverse operation is performed, and the result is compared to the original state to validate that the intermediate state has indeed been computed correctly.
To clarify the order of operations an exemplary encryption is depicted in Fig. 17 and is in the following Fig. 12 . Result of the attack on the last round of AES_PR (65 nm) recovering Ák between key bytes (9,13) (left) using 10,000 captures and (right) over the number of captures. Fig. 13 . Result of the attack on the last round of AES_WO (90 nm) recovering Ák between key bytes (7,8) (left) using 10,000 captures and (right) over the number of captures. Fig. 16 ). Since an encryption is performed in this example, the first roundkey K0 is XORed with the plaintext and the result is stored as D0 in the state register (DregX in Fig. 16 ). This state is propagated to both the comparison register (DregY in Fig. 16 ) and through the ShiftRows and SubBytes functions so that the result of the first half round (D1X) is available at the input of the state register.
At the next clock edge (see Fig. 17b ) the initial input D0 is stored in the comparison register and the intermediate result D1X is stored in the state register and following that is propagated through three different datapaths at the same time. First, D1X goes through InverseShiftRows and InverseSubBytes (thereby recomputing the initial input D0) to the comparison circuit, which triggers a fault flag if the computed value differs from the stored D0 in the comparison register. If this flag would be raised the circuit will reset and no ciphertext output will be generated. Second, D1X goes through MixColumns and AddRoundkey making the complete first round result D1 available at the input of the state register. Third, D1X is also propagated to the comparison register input since it will be required to validate the next computation step.
Similar to the previous clock cycle at the next clock edge (see Fig. 17c ) D1X is stored in the comparison register and the intermediate result D1 is stored in the state register. Following that D1 is again propagated through three different datapaths. To check if D1 has been computed correctly it is sent through InverseMixColumns and InverseAddRoundkey to the comparison function which checks for differences to the stored original D1X value. Simultaneously the result of the first half of the second round D2X is computed by going through ShiftRows and SubBytes and again the original value D1 is also propagated to the comparison register.
These steps, basically (b) and (c) in Fig. 17 , are repeated until the intermediate result of the last round D10X has been computed. In Fig. 17d , the final result D10 is computed, omitting MixColumns as usual, but to check whether an error occured in the key scheduling the computed last roundkey is also compared to the stored one in the DecKreg register. Since the final computation of D10 has to be checked as well, an additional clock cycle is needed before the ciphertext can be issued, as depicted by Fig. 17e .
Since each computation is checked by its inverse counterpart, not only dynamic but also static faults can be detected. In theory the performance penalty of this countermeasure is low, since, as stated in [33] , while the double amount of clock cycles are necessary for a complete AES computation, at the same time the critical path is shortened by splitting up of the round function. However, as it is explained later we have found that because of the comparison circuit, which compares two 128-bit values, the critical path of the circuit gets quite long and makes the circuit considerably slower compared to the other cores (as can also be seen in [3, Tables 4.7 and 4.15] ).
In addition, since MixColumns does not immediately follow SubBytes, and they are separated by a register, in contrary to the other cores one can exploit the timing characteristics of the S-boxes at the first round and also extend the attack on the later rounds. To mount the attack on the first round there are two options:
1. If the glitchy clock appears in the first clock cycle of the first round (see Fig. 17a ), the faulty result D1X is saved into the state register, which is detected in the next clock cycle by the comparison circuit when the inverse computation leads to the wrong D0 (see Fig. 17b ). When we have tried this on different LSIs, 6,600, 5,500, and 4,800 ps were the minimum width of the clock glitch with which, respectively, the 130, 90, and 65 nm chips still worked fault free. 2. In the second clock cycle (see [33, Fig. 5b] ) the critical path of InverseShiftRows and InverseSubBytes followed by the comparison unit is longer than MixColumns and AddRoundkey. Therefore, if the glitchy clock is injected in the second clock cycle, the fault detection bit (output of the comparison circuit) is evaluated, i.e., is stored in the target register, before the computation of the InverseSubBytes or the comparison itself has been finished. Examining this in practice showed that the chip reports faulty computation when the width of the clock glitch is less than 10,700, 6,700, and 6,800 ps for the 130, 90, and 65 nm chips, respectively. In fact, it shows that the increase of the critical path caused by the comparison circuit strongly affects the performance (throughput) of the design, as stated previously. It is noteworthy that faults in the key scheduling unit are not detected until the described comparison in the last round (see Fig. 17e ). While the glitchy clocks may in theory affect the round key computations, the fact that the critical path of that computation is actually split in half by a pipelining register in the S-boxes of the key scheduling unit [33] causes the faults to be triggered in the normal datapath way earlier, since there a full S-box computation has to be performed among other steps. Therefore, by selecting the Fig. 16 . Architecture of the fault-protected core (taken from [33] ).
2. Note that as it is explained later, the last roundkey K10 must be precomputed prior to the encryption or decryption operation when setting a key.
second clock cycle to shorten the width of the clock glitch and mount the attack where the critical path is maximal because of the included InverseShiftRows, InverseSubBytes and the comparison function, we are also making sure not to affect the key scheduling which has a significantly shorter critical path.
Contrary to the other cores, here the fault detection unit provides only one bit indicating the appearance of a fault, but from the adversary point of view it is not possible to distinguish where (in which S-box instances) the fault occurred. Therefore, not only the bitwise capturing is infeasible, but also the fault detection unit makes the capturing a very imprecise estimation process. Two timing characteristics of two S-box instances of the AES_FA (65 nm) core obtained by p T H ¼ 0:5 and two attack results are shown in Figs. 18 and 19 , respectively. We have used 500,000 captures for each clock glitch width when shortening Át by steps of 10 ps.
As stated before, the capturing process here is very inaccurate because of the effect of the fault occurrence on other S-box instances when considering a specific S-box instance. In other words, when at a specific Át an S-box instance is always faulty, it prevents extracting any information about the behavior of the other S-box instances at that Át. Therefore, we could only recover the relation between some key bytes, for example, between bytes (0,2), (0,4), (0,6), (0,7), and (0,10). We should emphasize that this behavior is not the same for all LSIs; different relations between key bytes are recovered when attacking the first round. Our solution, to completely break the core and recover all the key bytes, is to extend the attack on the second round. When the glitchy clock is injected in the second clock cycle of the second round, i.e., the forth clock cycle of encryption, we collected the result of 50,000 trials with random plaintext for a specific Át when the probability of the fault occurrence is about 0.5. The attack process is divided into four parts, targeting each 4-byte MixColumns output independently. We start with the first column when the key bytes (0,5,10,15) of the first round key are required to compute the first MixColumns output. Since the relation between key bytes for example, (0,10) has been revealed by the last attack shown, the search space for the key bytes (0,5,10,15) is reduced to 2 24 . Therefore, for each key guess the output of the MixColumns is computed, and based on each MixColumns output byte the four captures and four probabilities using the collected 50,000 trials are generated. Interestingly, it is not required to mount the collision attack. Instead, a variance check approach as stated in [21] can find the correct key guess. The idea behind it is that the probabilities p p i are different for different input values. For a wrong key guess the trials will be randomly distributed into the captures, and the probabilities have a small variance compared to the case that the captures have been made by the correct key guess. The result of this approach is shown in Fig. 20a , where the correct key guess is obviously distinguishable. Using a 16-core PC to compute all variance values for the 2 24 key space lasts roughly one hour. Since there are four captures and four probabilities p p i each of which corresponds to one MixColumns output byte, the variance check over the probabilities can be done on each of these four probabilities independently. So, if the variances of probabilities of one output byte did not show any distinguishable guess, other output bytes can be examined. This process is repeated on the other columns using the recovered key bytes and the recovered relations so that the search space sometimes gets smaller than 2 24 ; the results are shown in Fig. 20 . Although different relations between key bytes are recovered when attacking different LSIs, the variance check approach at the second round could completely reveal the secret in all LSIs searching in a space up to 2 24 . We should emphasize that the key recovery attack is possible even omitting the attack on the first round. In this case when no information about the key bytes is available, each part of the attack on the second round (exactly as explained above) must search in a space of 2 32 which might be a time-consuming task using PCs, but can be extremely accelerated by means of graphics processing units (GPU).
Attacking the Decryption Modules
Only two cores, AES_Comp and AES_FA, include the essential module for decryption while the AES_CTR core also supports the decryption feature by swapping the plaintext and ciphertext since the AES in counter mode can be seen as a stream cipher. Attacking the decryption module of the AES_FA core is exactly the same as the attack on its encryption module. Means, to attack the first round of decryption the glitchy clock is injected in the second clock cycle when the critical path of SubBytes and ShiftRows followed by the comparison unit is affected. Though it can be ignored, the attack reveals the relation between some key bytes. Extending the attack on the second clock cycle of the second round of decryption by means of the variance check approach and searching in a space of up to 2 32 (in case the attack on the first round is ignored) all the round keys are revealed. We omit presenting the result of the attack because of its similarity to already presented results in Figs. 19 and 20 .
The decryption module of the AES_Comp core is a dedicated circuit which is depicted in Fig. 21 . The first clock cycle is the only clock cycle where InverseShiftRows and InverseSubBytes without InverseMixColumns are in the critical path. The glitchy clock, therefore, must be injected in this clock cycle to extract the timing characteristics of the S-boxes. However, the output of the first decryption round cannot be viewed by the attacker; he can only check whether whole of the decryption output (plaintext) is correct or not (similar result that the AES_FA core provides). Therefore, the timing characteristics of some S-boxes cannot be extracted and the attack scheme is similar to that of the AES_FA core. Again the result of the attack is omitted while information about the critical path and the number of required captures are given in Table 1 (in Appendix, which is available in the online supplemental material). We should emphasize that contrary to the AES_FA core the attack cannot be extended to the second decryption round since InverseMixColumns affects the critical path. Therefore, the relation between some key bytes can be recovered causing the search space for the full key to shrink significantly. In general, the used architecture for the decryption module (see Fig. 21 ) makes our proposed attack hard comparing to the corresponding encryption module of the same core. We have presented a collision attack which efficiently utilizes the data dependent timing characteristics of combinational circuits to reveal the secrets. While the attack is based on the idea of [18] , it is significantly more powerful since it does not require any knowledge about the characteristics of the target combinational circuit.
We demonstrated the power of the attack by successfully breaking all 14 AES implementations of the SASEBO LSI2 and LSI3 in all currently available process technologies. It is indicated in [18] that while masking does not prevent DFA attacks, it may actually provide security against FSA-based attacks because of the randomized inputs of the combinational functions. However, by breaking all DPA-protected cores of the mentioned ASICs we have shown that randomizing countermeasures itself cannot prevent datadependent timing characteristics of the combinational circuit, and they therefore remain vulnerable against the attack introduced here.
Finally, our introduced attack is even capable of extracting the secret key out of an implementation applying a fault detection scheme by extending the attack on the second round of the cipher and a few hours of computation. In short, the results shown in this work imply the need for a special unit in the -especially side-channel protected -designs to detect the clock glitches to thwart such kind of attacks. Oliver Mischke received the diploma degree in IT-security from Ruhr-University Bochum, Germany, in 2010. After working for ESCRYPT -Embedded Security, he is now a research assistant at the Hardware Security Group, Chair of Embedded Security, Horst Gö rtz Institute for IT-Security, Ruhr-University Bochum. His research interests include the area of hardware security, particularly in efficient implementations of cryptographic schemes with a focus on sidechannel resistance.
Christof Paar received the PhD degree in electrical engineering from the Institute for Experimental Mathematics, University of Essen. He holds the Chair of Embedded Security in the Electrical and Computer Engineering Department, Horst Gö rtz Institute for IT-Security, Ruhr-University Bochum. His research interests include physical security, cryptanalytical hardware, security in real-world systems, and efficient software and hardware implementations of cryptographic algorithms. He is a fellow of the IEEE, the ACM, and the International Association for Cryptologic Research (IACR).
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
