Abstract
processing of sensible data. Design for on-line testability of such cores prevents structural failures to cause loss of service and compromise the security.
Fault detection and tolerance schemes for various implementations of cryptographic algorithms have been recently considered. Mainly, two approaches have been developed: based on information redundancy (e.g. the use of codes, [1] [2] [3] ) or functional redundancy ( [3] [4] [5] ).
All the techniques based on codes add some bits to the original data word in order to check its validity. The main issue in these approaches is the prediction of the value of the code on an output, given the input value and the executed operation.
For instance, the prediction of a parity bit is almost straightforward for the ShiftRows, MixColumns and AddRoundKey operations performed in the AES [7] because these transformations are either linear or they just perform some bit permutations (see Section 2 for a detailed description of the AES). Conversely, the prediction of the parity bit is not trivial for the SubBytes operation performed by the so-called S-Boxes. As a consequence, the parity prediction requires larger circuitry. Solutions based on parity codes ( [1] [2]) lead to an overhead of about 20% and high single error detection. However they are not effective in case of multiple faults or single faults that lead to an even number of errors. Other solutions based on the use of more complex codes such as CRC [1] or systematic nonlinear robust codes [3] lead to higher fault coverage but at the expense of a significant area overhead (> 60%).
Alternatively, the techniques presented in [3] , [4] , and [5] are based on functional redundancy. They can be used whenever encryption and decryption modules are implemented on the same circuit. Each encoding phase is followed by a decoding and compare phase in order to check if the resulting decoded text matches with the initial plaintext. A similar procedure is employed when the circuit is used for decoding a cipher-text.
Conversely to most of the previously proposed approaches that focus on the SBoxes only (dominant component, counting up to 75% of the circuit area), we propose a low cost self-test architecture for detecting single and multiple faults in most of the AES hardware. The form of testing is accomplished using duplication and comparison. The main idea is to implement the datapath in such a way that several identical blocks can be defined. With an additional block, online pair wise lirmm-00423026, version 1 -13 Oct 2009 comparisons of blocks are implemented to check the functionality of the AES hardware. Efficiency and low area overhead are achieved by exploiting the spatial duplication inherent to the parallel implementation of the algorithm.
Moreover, since any structural modification on the hardware implementation may jeopardize the digital security, the proposed architecture is also checked with respect to one of the most common attack based on power analysis [6] .
The paper is organized as follows. Section 2 introduces the basic concepts and the characteristics of the Advanced Encryption Standard algorithm. Section 3 presents the proposed on-line self-test approach, while section 4 discusses the results in terms of area overhead and fault detection capability. Section 5 introduces the problem of side channel attacks based on power analysis, and presents experimental results showing the resistance of the proposed architecture to such an attack. Eventually, Section 6 concludes the paper. AES [7] is a block cipher adopted as an encryption standard by the U.S.
Advanced Encryption Standard

government. AES began immediately to replace the Data Encryption Standard
(DES, used since 1976) for the reason that it outperforms in long-term security thanks to, among other things, larger key sizes (128, 192, or 256 key bits). For sake of simplicity, we focus on 128-bit key in the sequel of the paper Another major advantage of AES is its efficient implementation on various platforms. It is suitable for small 8-bit microprocessor platforms, common 32-bit processors, and dedicated hardware implementations that can reach throughput rates in the gigabit range. Several hardware implementations are presented in [8] .
The AES algorithm's internal operations are performed on a two dimensional array of bytes called State. The State consists of 4 rows of 4 bytes. Each byte is denoted by S i,j (0 ≤ i < 4, 0 ≤ j < 4) . The four bytes in each column of the State array form a 32-bit word, with the row number as the index for the four bytes in each word. The initial plain text is a 128-bit block that can be expressed as 16 bytes: in 0 , in 1 , in 2 … in 15 . Encryption and decryption processes are performed on the State, at the end of which the final value is mapped to the output bytes array out 0 , out 1 , out 2 , … out 15 .
lirmm-00423026, version 1 -13 Oct 2009
The AES is an iterative process composed of 10 rounds. The plain text to cipher is first copied to the State array. After the initial secret key addition (roundkey(0)), the first 9 rounds are identical, with small difference in the 10th round. As illustrated in Figure 1 , each of the first 9 rounds consists of 4 transformations: SubBytes, ShiftRows, MixColumns and AddRoundKey. The final round excludes the MixColumns transformation. The encryption scheme in Figure 1 can be inverted to get a straightforward structure for decryption.
SubBytes Transformation
The SubBytes transformation is a non-linear byte substitution that operates independently on each byte of the State using a substitution table (S-Box). This SBox is constructed by composing two transformations:
1. Take the multiplicative inverse in the finite field GF (2 8 ); the element (00000000) 2 is mapped to itself; 2. Apply the following affine transformation (over GF(2)): In AddRoundKey transformation, a roundkey is added to the State array by bitwise XOR operation. Each roundkey consists of 16 bytes generated from the Key Expansion operation described below.
Key Expansion
The key expansion routine, as part of the overall AES algorithm, takes the input secret key of 128 bits and outputs an expanded key of 11*128 bits composed of the input secret key and 10 roundkeys, one for each round. Details of the algorithm for determining the value of each roundkey are given in [7] .
Functional redundancy for on-line fault detection
The technique we propose in this paper is designed for all the AES cores (encryption and decryption) that use 16 S-Box repetitions. We do not consider low-area implementations, where there is only one S-Box at the cost of several clock cycles for completing one encryption/decryption round. Our goal is to identify a partitioning of the circuit that allows a repetition of identical sub-blocks. These sub-blocks will be compared two-by-two for on-line fault detection thanks to the implementation of an extra sub-block. In the classical architecture depicted in Figure 2 , ShiftRows unfortunately prevents such a partitioning since it operates on all the 128 bits.
However by inspecting the AES algorithm, it can be seen that SubBytes and Shiftrows functions can be switched. We thus propose to perform ShiftRows before SubBytes, and even before loading the registers. The same procedure can be applied to the whole circuit and, as a consequence, the datapath can be divided in 4 identical slices that operate on 32 bits each, and that we call RSMA (32-bits Register, 4 S-Boxes, 1 Mixcolumns and 32 xor for the Addroundkey operation).
The main idea of the proposed approach is to use one additional RSMA block, and to compare a pair of RSMA blocks at each clock cycle. In particular, at each clock cycle two blocks are fed by the same inputs and the related outputs are compared in order to detect possible faults. Figure 4 details the behavior of a part of the circuit where one extra RSMA block has been added. In this figure,
LMux(2), LMux(3) and LMux(4) are multiplexers with an additional output that is asserted whenever the two inputs are equal (i.e., a multiplexer with a comparator). 
From the Control Unit um(3) um (2) um (1) um (4) From the Control Unit lm (0) lm (1) check (1) check (2) check (3) check (4) check(0)
UMux (2) 0 1
UMux ( 
Results and Fault Detection Analysis
This section provides results related to the area overhead and the fault detection analysis of the proposed approach. The proposed architecture has been described in VHDL and synthesized using Synopsys Design Compiler A classical DMR architecture allows detecting all the faults (single and multiple) that lead to an error (i.e., a difference at the output of one of the duplicated modules). Starting from the moment of appearance of the fault, the fault latency depends on the inputs applied to the circuit, only. In other words, the fault is detected as soon as the input vector can sensitize the fault and propagate it up to the output of the module (i.e., the input of the comparator between the two modules). Anyway, a system based on classical DMR scheme does not deliver faulty responses without noticing it (unless in case of equivalent faults in the two modules).
lirmm-00423026, version 1 -13 Oct 2009
Our technique is able to detect any single or multiple fault leading to a wrong RSMA output value (as for the classical DMR) but only when the affected RSMA is compared with another one. Conversely to DMR, the dynamic reconfiguration of the modules leads to a comparison of each module twice every 5 clock cycles.
Therefore it can happen that the system produces erroneous responses without noticing it even in presence of a single stuck-at.
We question here the probability to get an error on the AES output and to not detect it. This probability can be analyzed by computing the probability P err (f) of not detecting an error on the circuit's outputs while a given fault f affects the circuits. Here, only non-redundant faults are of interest, i.e. we focus on testable faults. For this analysis we focus on single stuck-at faults only, because the number of multiple faults is too high to be analyzed. However, unless the extremely very low probable case of multiple faults composed of 5 equivalent faults in the 5 RSMA modules, all multiple faults are covered by our technique.
With regard to the proposed architecture, P err (f) is the probability that the fault f is activated (i.e. sensitized and propagated in such a way that it leads to an error) during at least one of the 6 clock-cycles during which the faulty RSMA is not compared, and it is not activated during the 4 clock cycles when the RSMA is compared.
Let denote p f the probability of activation of a fault f into an RSMA module, i.e., the probability that for a random input pattern the fault is sensitized and the error is propagated to its output. In the hypothesis to have several distinct functional inputs, we can consider that the device is fed by a random source. In addition, as demonstrated in [11] , the inherent properties of the AES makes that the sequence of input values that are applied to consecutive rounds of the same encryption can be considered as random. Therefore, the probability p f is equal to the ratio of input For the proposed architecture, the probability that f is not activated during the clock-cycles of comparison is equal to   In order to calculate the overall error probability, we simulated all the faults in the Sboxes to determine the distribution of probabilities of activation of the faults. An overall of 3860 stuck-at faults are present in our implementation. Basically, we calculated how many faults are activated by one test pattern (p f = 1/256), how many faults are activated by 2 patterns (p f = 2/256), and so on. Figure 6 summarizes, for each probability p f , the number FD(p f ) of faults with that activation probability. Assuming that each fault has the same probability to appear in the circuit, the overall error probability P ERR 
Concerning the MixColumn and AddRoundKey, there are #Faults MCAK =750 faults, all of them with p f =50%. Therefore the error probability P ERR-Sbox of
MixColumns and AddRoundKey is calculated based on equation (3).
It comes that the overall probability P ERR of the RSMA is:
The architecture has thus a probability of 90.01% to detect any fault in the RSMA during a single encryption (10 clock cycles).
Let's now analyze the evolution of this probability based on the number of encryptions. When we perform E encryptions, an RSMA block is compared during 4E clock cycles, while it is not compared during 6E clock cycles.
The error probability can therefore be rewritten as follows:
Considering the fault distribution FD given in Figure 6 and the probability of error detection in the MixColumns and AddRoundKey, we can re-calculate the overall error probability P ERR of the RSMA block in function of the number of encryptions ( Figure 7) . As it can be seen, the error probability slightly increases up to 14% for 5 encryptions, while for higher encryption numbers it tends to 0.
lirmm-00423026, version 1 -13 Oct 2009
The error probability augmentation from 0 to 5 encryptions can be explained by the fact that at the beginning the probability to exercise the faulty module with random test patterns increases more quickly than the probability to compare the faulty module with a good one while exciting the fault. Since we focus on permanent fault, after a while (i.e. 5 encryptions) the probability to detect the fault (from comparison) is predominant. Namely, for 300 encryptions, the fault detection probability is equal to 99.9% is a higher number of executed rounds. Therefore, we expect lower number of encryptions for achieving the same error probability.
Differential Power Analysis
An important issue when dealing with cryptographic cores is the sensitivity of the architecture implementation to side-channel attacks, in particular against Differential Power Analysis (DPA). We focused on this attack because among all the known possible attacks, this is one of the cheapest and easiest to perform.
Basically, the DPA attack is a statistical technique relying on the correlation that exists between the current consumed by the device and the processed data.
lirmm-00423026, version 1 -13 Oct 2009
We now introduce some theoretical issues that allow the reader to understand the principle underlying the DPA attack.
Let's consider the output of a gate whose state depends on both the plain text under ciphering (primary inputs) and the secret key. It is called the target node.
Let's consider now a sequence of input patterns P 0 , P 1 , …, P n that generate the transitions T 1 (P 0 P 1 ), T 2 (P 1 P 2 ), ..., T n (P n-1 P n ) on the circuit primary inputs.
A logic simulation of the circuit while monitoring the target node allows classifying these input transitions in two sets, according to a guess on the key:
 PA, composed by the transitions that make the target node to commute from 0 to 1 and therefore that make the target gate to consume;  PB, composed by the transitions that do not lead the target gate to participate to the power consumed by the circuit (i.e., transitions from 0 to 0, 1 to 1, and 1 to 0 on the target node). 
In other words, since the two sets are classified in such a way that the set PA always leads to a component of power consumption that is not present in the set PB, the difference between the two mean powers computed from set PA and set PB must show a noticeable difference. Figure 9 where K x is assumed to be the correct key, the one actually used during ciphering. Concerning the proposed architecture, we performed DPA on the base AES architecture and on the proposed redundancy-based solution using an in-house DPA simulator [12] . We found that the DPA attack is slightly more difficult to correlation between the power consumption and the processed data, therefore it does not make easier attacks based on Differential Power Analysis.
