Abstract. Because of the isomorphisms in GF(2 8 ) there exist 240 different non-trivial dual ciphers of AES. While keeping the in-and outputs of a dual cipher equal to the original AES, all the intermediate values and operations can be different from that of the original one. A comprehensive list of these dual ciphers is given by an article presented at ASIACRYPT 2002, where it is mentioned that they might be used as a kind of side-channel attack countermeasure if the dual cipher is randomly selected. Later, in a couple of works performance figures and overhead penalty of hardware implementations of this scheme is reported. However, the suitability of using randomly selected dual ciphers as a power analysis countermeasure has never been thoroughly evaluated in practice. In this work we address the pitfalls and flaws of this scheme when used as a side-channel countermeasure. As evidence of our claims, we provide practical evaluation results based on a Virtex-5 FPGA platform. We realized a design which randomly selects between the 240 different dual ciphers at each AES computation. We also examined the side-channel leakage of the design under an information theoretic metric as well as its vulnerability to different attack models. As a result, we show that the protection provided by the scheme is negligible considering the increased costs in term of area and lower throughput.
Introduction
From a mathematical point of view embedded systems can easily be protected by modern ciphers which are secure in a black-box scenario. However, since the late 90s the security of a cryptographic device relies not only on the use of a secure cryptographic algorithm but also on how this algorithm is implemented. Since sensitive information like encryption keys of an unprotected implementation can be recovered by observing so called side channels, the need of secure implementations of cryptographic primitives like AES is at an all-time high.
Many different kinds of countermeasures have been proposed either for protection of software and/or hardware platforms (see [18] for instance). Masking of sensitive values is one of the most considered solutions, and the community has shown a huge interest to different aspects of masking countermeasures, e.g., [2, 5, 8, 10, 11, 15, 22-24, 26, 28] . Because of sequential nature of the platform, masking in software is usually straight forward and effective. However, realizing the masking schemes in hardware is intricate since glitches in the circuit can cause otherwise theoretically secure schemes to leak [19] [20] [21] .
Back to the early 2000s, there exist only few attempts to better understand the algebraic specification of AES-Rijndael. One is about how to make dual ciphers which are equivalent to the original Rijndael in all aspects [3] . By replacing all the constants in Rijndael, including the replacement of the irreducible polynomial, the coefficients of the MixColumns, the affine transformation in the S-box, etc, the idea is to make another ciphers which generate the same ciphertext as the original Rijndael for the given plaintext and key. As explained in [3] , there exist 240 non-trivial Rijndael dual ciphers, and a comprehensive list of the matrices and coefficients is given in [4] . Later in [27] , it has been shown that one can include field mappings from GF (2 8 ) to GF(2) 8 as well as intermediate isomorphic mappings to GF(2 2 ) and GF(2 4 ) to build 61 200 similar Rijndael dual ciphers.
This idea was taken by the authors of [31] , and by means of the gate count they investigated which of those 240 dual ciphers can be implemented in hardware using smaller area, and which ones can speed up the implementation. Since the intermediate values of the dual ciphers during encryption are different than Rijndael's, it is mentioned in [3] that one can randomly change the constants of the cipher thereby realizing different dual ciphers and provide security against power analysis attacks. This led to other contributions. For instance, a hardwaresoftware co-design of a system based on an Altera FPGA where according to the randomly chosen parameters the content of the lookup tables are dynamically changed is presented in [16, 17] 1 . Moreover, the authors of [12] and [13] represented a hardware implementation which can realize every selected dual cipher amongst those 240 ones. They reported the performance and area loss when the scheme is realized in order to increase the security against side-channel attacks.
In this work we examine this scheme, i.e., random selection of constants to choose a dual cipher out of 240, from a side-channel point of view. We address its flaws and weaknesses which can lead to easily breaking the corresponding implementation. In order to examine our findings in practice, we implemented the scheme on a Virtex-5 FPGA by means of precomputed matrices and constants and -in contrast to [17] -by avoiding the use of any lookup table. Our practical side-channel evaluations confirm our claims indicating that the protection provided by the scheme is negligible while having high area and performance overheads. We show that the implementation can be easily broken when a suitable attack model is taken by the adversary.
The next section restates the concept of Rijndael dual ciphers with respect to the original work [3] . Our design of the scheme considering our targeted FPGA platform in addition to its performance and area overhead figures are represented in Section 3. Our discussions about the side-channel resistance of the scheme and practical investigations are given by Section 4 while Section 5 concludes our research.
Two ciphers E and E are called dual ciphers, if they are isomorphic, i.e., if there exist invertible transformations f (·), g(·) and h(·) such that
where plaintext and key are denoted by P and K respectively. The concept of dual ciphers for AES-Rijndael was first published in 2002 [3] . The authors demonstrate how to build a square dual cipher of the original AES and show that it is possible to again iterate this process multiple times creating more square dual ciphers. This way 8 dual ciphers for each possible irreducible polynomial in GF(2 8 ) can be derived. Since it is also shown how to create dual ciphers by porting the cipher to use one of the other 30 irreducible polynomials in GF (2 8 ), a total of 240 non-trivial dual ciphers for AES exist. Here non-trivial means that we are only considering those dual ciphers which actually change the inner core of AES and not only consist of invertible transformations of the input and output of the cipher.
As an example, closely following the explanation in [3] , let us consider a square dual cipher of the original AES-Rijndael. In order to create this dual cipher one first has to multiply all AES constants by a matrix which performs the squaring operation under the original AES-Rijndael polynomial 0x11b. These constants include the round constant of the key schedule, the coefficients of the MixColumns transformation, as well as the input data and the key. In this special example this matrix is generated by taking a generator a, in this case the polynomial x 2 in GF(2 8 ), and building a matrix of the form
, where each of these elements represents a column of the matrix and the result of the exponentiation is reduced by the original AES reduction polynomial. The resulting matrix is 
Furthermore, we also need to make changes to the SubBytes transformation. If we consider SubBytes to be pure a table look-up of constants S(x), we can compute a new look-up table S 2 by applying the R matrix and its inverse R −1
as follows: S 2 = RS(R −1 x). If we consider the SubBytes transformation as inversion in GF(2 8 ) followed by a multiplication by the affine matrix A and addition of the constant b, then the inversion stays unchanged while the new affine matrix A 2 is computed as A 2 = RAR −1 and the new constant b 2 is computed (similar to the other constants, i.e., those of MixColums and key schedule) by multiplying it with the transformation matrix R: b 2 = Rb. Note that in the case of S 2 or A 2 no actual squaring is taking place. If we consider the original cipher as E and the above described square dual cipher as E 2 , by applying the same squaring routines again we can create a total of 8 square dual ciphers (up to E 128 since E 256 is equal to E in GF( 2 8 )). These square dual ciphers all use different constants and SubBytes transformations. According to the dual cipher concept, if the R matrices are multiplied with all input data bytes and key bytes and the result is transformed back by multiplying each output byte with the inverse matrix R −1 , the results of all ciphers when given the same input data and key will be equal. The differences in the internal structure, like the different S-box in SubBytes or the different coefficients in the MixColumns, also translates into e.g., different power consumption and EM emanations of a circuit implementing this technique. As denoted in [3] , these differences in the internal structure of the dual ciphers might be usable as some kind of side-channel countermeasure. If the used dual cipher is randomly chosen, this could be comparable to a normal masking countermeasure.
Besides using square dual ciphers of the original AES-Rijndael, one can use the same transformation techniques as above to change all constants by using different generators a and reducing the a i by the new irreducible polynomial. If the SubBytes transformation is not implemented as table look-up but as inversion plus affine, the inversion as well as all field multiplications as in MixColumns are then also performed using the new irreducible polynomial not the original one. This works for all 30 irreducible polynomials in GF(2 8 ). Since there exist 8 generators for all irreducible polynomials representing the 8 square dual ciphers, as stated previously a total of 240 different non-trivial dual ciphers in GF(2 8 ) exist. All generators, polynomials and constants of each of the 240 dual ciphers can be found in [4] . Note that we consider only dual ciphers using mappings in GF (2 8 ) not such where other composite field representations are utilized, e.g., those presented in [27] .
Our Design
The first design decision one has to make is whether to implement the SubBytes transformation purely based on look-up tables or if a general inversion circuit is used together with the affine matrix multiplication and constant addition. Since the area overhead to store 240 different complete S-boxes is massive, similar to [12] and [13] we opted to implement a general inversion circuit. Since we want to analyze the side-channel resistance of the original submission of dualciphers [3] , this requires the inversion to be implemented in GF (2 8 ) without the option to save on resources by utilizing inversions in composite fields or using a tower field approach [25, 29] . In other words, the inversion circuit must be general and valid for all the 30 irreducible polynomials mentioned in Section 2.
In order to prevent leakage through the timing channel during the inversion it is important to make the circuit time invariant. To achieve this one can make use of the fact that in GF(2 8 ) x 256 is equivalent to x, which leads to x −1 ≡ x 254 . Using addition chains this exponentiation can be implemented by a low number of modular multipliers and squaring circuits as depicted in Fig. 1 . Note that the squaring step itself is free in GF( 2 8 ) and only requires hardware resources for the modular reduction.
For each possible dual cipher one needs to store the following parameters:
1. Initial transformation matrix R (64 bits), which is required to transform the original input data and key to the dual cipher representation. 2. Inverse transformation matrix R −1 (64 bits), required to transform the output of the AES computation from the dual cipher representation back to the original AES representation, precomputed as normal matrix inversion of R in GF(2). 3. Modular reduction polynomialp (8 bits), to be used during all field multiplications (MixColumns) and the inversion steps (SubBytes). 4. MixColumns coefficientsmc (2 × 8 bits). While the MixColumns coefficients originally are 8-bit elements of a 4 × 4 matrix, because the coefficients of each row are only a rotated variant of the first row and only two are not 01 x (in GF(2 8 )), it is sufficient to store only these two transformed coefficients (R(02 x ), R(03 x )). 5. Affine matrix of SubBytesÂ (64 bits), to apply the affine matrix multiplication step of the affine transformation. The matrix is computed aŝ A = RAR −1 , where A is the original affine matrix of the AES. 6. Affine constantb of SubBytes (8 bits), final addition step of the affine transformation. As for every other constant transformation this can be computed asb = Rb, where b is the original affine constant, i.e., 63 x . 7. Round constants (rcon) of the key schedulingrc (10 × 8 bits). The rcons are constructed asrc(r) = (R 02 x ) r modp, with r starting from 1 for the first round,p being the used irreducible polynomial, (02 x ) the initial element, and R the transformation matrix. The rcons could also be computed on-the-fly which would only require the storage of the transformedrc init = R 02 x (8 bits). Since this would require another modular multiplier, we have opted to store all the precomputed rcons for each of the 240 dual ciphers.
The overall architecture of our evaluation circuit is depicted in Fig. 2 . The initial transformations of the input data and key are performed prior to the general AES/dual cipher computation. After the full encryption is complete, the inverse transformation moves the result back to the original AES representation as described previously. The AES/dual cipher computation itself is implemented as round-based design, i.e., every round of AES requires one clock cycle and the We chose to implement a round-based design because this is very common in real-world implementations when a hardware platform is targeted. The on-the-fly key scheduling seems to be the most suitable option since the roundkeys, which are different for each dual cipher, would otherwise require 41 kBytes of storage. We have implemented the whole design on a Virtex-5 LX50 FPGA mounted on a SASEBO-GII [1] (Side-channel Attack Standard Evaluation Board). In our implementation all the aforementioned parameters and constants are stored in block RAMs and are preloaded before every complete AES computation. The resource utilization is shown in Table 1 . Compared to an unprotected design utilizing a more common S-box implementation based on look-up tables we require significantly more LUT resources. This is due to the 20 large general inversion circuits implemented in parallel (16 for the round function and 4 for the key scheduling) which are required to perform the inversion in every selectable dual cipher representation. The number of LUTs could be heavily reduced by using a composite field or tower field approach in the S-box design which, as stated previously, we have not implemented at this point to enable a side-channel evaluation of the original dual cipher proposal. We should also highlight the very low maximum operation frequency of our design. It is due to the very long critical path of the inversion unit. Since it has to be general for any given irreducible polynomial, it could not be optimized with respect to both delay and area.
Evaluation
According to the explanation and the architecture figure given in the previous section, at the start of each encryption process the PRNG randomly selects one of the 240 dual ciphers and holds the outputs of the BRAM, i.e., constants and coefficients, until the whole of the encryption process is finished. If the selected dual cipher, whose index is denoted here by 1 ≤ i ≤ 240, is unknown to the adversary, the intermediate values cannot be predicted. Therefore, it can be seen as a kind of a masking scheme on which certain side-channel attacks are supposed to be infeasible. However, below we address a few issues which significantly affect the robustness of the scheme.
Mask Reuse All intermediate values and inputs are transformed to a new domain by means of the selected transformation matrix R i . It means that all 16
plaintext bytes are transformed using the same transformation. It can be seen as similar as the mask reuse issue in masking schemes. In the case of e.g., a boolean masking when two S-boxes get the inputs masked by the same mask value, a classical linear collision attack [6] might be able to recover the corresponding key bytes difference [9] . The same holds for the dual ciphers case; all the Sboxes compute the inversion using the same parameters and their inputs have been transformed using the same matrix. By help of side-channel leakages once a collision between two S-boxes is detected
, where x (j) and k
denote the j-th byte of the given plaintext and key. However, in the case of our design, which realizes a round-based architecture, this issue can be ignored since the side-channel leakage of different S-boxes in a round cannot be separated making the collision detection infeasible.
Concurrent Processing of Mask and the Masked Data
In contrast to software implementations of masking, preventing univariate leakages when the target platform is hardware is a challenging task. It is due to the glitches of the circuit, e.g., a masked S-box, when processing both mask and the masked data at the same time. This issue has been seen in many different realizations of masking in hardware (see [19] [20] [21] ). Our implementation of dual ciphers suffers from this problem as well. The SubBytes unit gets the transformed key-whitened input as well as the irreducible polynomialp, affine matrix coefficientsÂ, etc. All Fig. 3 . Distributions of the S-box output for (top) 11x and (bottom) 44x as original input over all 240 dual ciphers these parameters are independent of the transformed input. Therefore, the sidechannel leakage of e.g., the S-box circuit due to its glitches is not independent of its original (untransformed) input. Hence it is expected that a univariate attack, e.g., a CPA [7] with an appropriate power model or a MIA [14] , will be able to recover the relation between the leakages and the secret materials.
Unbalance Having the lemmas and properties given in [22, 23] in mind, we explain this issue as follows. For the sake of simplicity suppose a masking scheme which maps an input value x into its masked representation x m for a given mask m as x m = x * m. In order to guarantee the balance of the distributions the conditional probability Pr(x m = X M |x) must be constant for ∀x and the given X M by which we mean a realization of x m . In other words, if f x (x m ) represents the probability density function of x m for the given x, each pair of probability distributions f x=X 1 (x m ) and f x=X 2 (x m ) must be equal, where X 1 and X 2 are two realizations of x. Otherwise, when two distributions are different, their corresponding side-channel leakages can be distinguished from each other. Therefore, it may lead to detecting whether X 1 or X 2 is processed. This property should hold for all intermediate values at all steps of the scheme. However, it is not true for the case of dual ciphers. For example, we considered the S-box output and computed the probability distributions for two original input values 11 x and 44 x over all 240 cases. Two different resulting histograms are shown by Fig. 3 , clearly indicating the unbalance of intermediate values. Therefore, it is expected that a univariate side-channel attack can be successfully mounted.
Zero Value There is a general problem in multiplicative masking schemes, i.e., masking the zero value. That is because regardless of the mask m, input value x = 0 never gets masked. Therefore, a CPA attack using the zero-value power model [15] can easily overcome the protection. The same problem holds for the dual cipher approach. Because of the linearity of the transformation, i.e., multiplication by the matrix R in GF(2), the zero input is always transformed to itself in all 240 cases. It is indeed a special case of the unbalanced distributions. The distribution for the zero value f x=0 (x m ) shows the certainty of x m and is much different to all other distributions when x = 0. Therefore, a zero-value CPA attack targeting the S-box input should break the implementation.
Before moving toward practical results, we would like to comment on these issues when not only 240 AES dual ciphers are considered but also when one can take more from those 61 200 cases of [27] . Regardless of the existence and its difficulty one may find a set of 255 dual ciphers which satisfy the balance property. Note that because of the zero-value issue, a set of 256 dual ciphers can never satisfy the property. On the other hand, if the desired condition is fulfilled considering e.g., the S-box input, keeping the balance property for the S-box output cannot be certainly justified because each dual cipher employs a different S-box. Therefore, it seems that the balance property can never be fully satisfied. In short, for any selection of the dual ciphers all the aforementioned problems stay valid theoretically making the scheme vulnerable to certain attacks.
Practical Investigations
As stated before, our practical experiences are based on a SASEBO-GII [1] platform. The design was implemented on the crypto FPGA of the board, a Virtex-5 LX50. The crypto core receives the plaintext and by means of the stored key performs the whole of the encryption operation while the dual cipher index i is internally generated by a PRNG. A LeCroy HRO66Zi 600MHz digital oscilloscope at the sampling rate of 1GS/s was used to measure the power consumption of the crypto core over a 1Ω resistor in the VDD path (oscilloscope in AC mode). In order to reduce the electronic noise the bandwidth of the oscilloscope was set to 20MHz, and the crypto core was running at the clock frequency of 1.5MHz during the whole of our practical experiments.
A sample power trace clearly indicating the round computations is shown in Fig. 4 . The first peak between 0 and 1µs is due to the selection of the dual cipher. At this clock cycle the corresponding parameters of the selected dual cipher appear at the BRAMs' output causing glitches and activities in whole of the circuit. We should also highlight the very high power consumption of the design resulting in more than 300 mV peak-to-peak power traces. Following the information theoretic metric of [30] to examine the amount of information available through the power traces we computed the mutual information based on the output of an S-box module in the first round. In order to examine the level of protection provided by the scheme, we also collected the power consumption traces of the design when the PRNG is OFF thereby selecting the original AES parameters. In both cases we used 2 000 000 traces to make the mutual information curves shown in Fig. 5(a) . It indeed shows that compared to the unprotected case -considering the same amount of tracesthe scheme can reduce information available to be recovered by an adversary. Figure 5 (b) compares the mutual information of these two cases in presence of noise. In order to make this figure we artificially added Gaussian noise to the collected traces.
The shown figures confirm our theoretical discussions on the available univariate leakages which can be used by different attacks. In order to check the feasibility of a successful attack we mounted a correlation collision attack [21] making use of the first-order moments. We targeted two S-boxes of the first round and tried to recover their corresponding key difference. The result of the attacks is depicted in Fig. 6 . As expected, the attack is successful though compared to the unprotected case the number of required traces increased from less than 5000 to 100 000.
We also examined the feasibility of a zero-value attack whose results are represented by Fig. 7 . The graphics confirm our theoretical claims that a zero- value attack is amongst the weakest points of the scheme since using a very low number of traces of 10 000 one can overcome the provided protection.
Conclusions
In this work we have taken an in-depth look at the AES-Rijndael dual cipher concept from a side-channel point of view. We have implemented an evaluation circuit which is able to perform AES computations using randomly chosen dual ciphers. The inversion part of the circuit operates in GF( 2 8 ), as in the original dual cipher contribution [3] , giving a total choice of 240 different internal computations with correspondingly different side-channel leakage characteristics.
Besides providing practical evidence of the vulnerability of this original dual cipher implementation to several side-channel attacks, we have also described some of the general flaws of the scheme when considered as a side-channel countermeasure. This includes the mask reuse, the concurrent operations on both mask and the masked data, the violation of the balance property, and the inability to mask the zero value. Because of these properties the vulnerability of dual cipher implementations is not only limited to those which are restricted to a low amount of possible transformations by focusing on mappings in GF (2 8 ). Even when one would be able to select between several thousand dual ciphers using composite fields, as given in [27] , the described weaknesses still exist and would enable an attacker to successfully extract the secret key. In conclusion, even when ignoring the large area overhead of the circuit in comparison to other lighter masking schemes, AES-Rijndael dual ciphers are unsuitable as a sidechannel countermeasure and can be broken using modest efforts and simple attack models.
