Naturally occurring and maliciously injected faults reduce the reliability of Advanced Encryption Standard (AES) and may leak confidential information. We developed an invariance-based concurrent error detection (CED) scheme which is independent of the implementation of AES encryption/decryption. Additionally, we improve the security of our scheme with Randomized CED Round Insertion and adaptive checking. Experimental results show that the invariance-based CED scheme detects all single-bit, all singlebyte fault, and 99.99999997% of burst faults. The area and delay overheads of this scheme are compared with those of previously reported CED schemes on two Xilinx Virtex FPGAs. The hardware overhead is in the 13.2-27.3% range and the throughput is between 1.8-42.2Gbps depending on the AES architecture, FPGA family, and the detection latency. One can implement our scheme in many ways; designers can trade off performance, reliability, and security according to the available resources.
INTRODUCTION
The Advanced Encryption Standard (AES) is used in a variety of applications, including smart cards, mobile phones, WWW servers, automated teller machines, and digital recorders. The decreasing cost of VLSI chips and increasing user throughput requirements make hardware implementation of AES necessary.
Faults that occur in VLSI chips are classified into two categories: transient faults that eventually die away and permanent faults. The origin of these faults could be internal phenomena in the system, such as threshold changes, shorts, opens, etc., or external influences, such as electromagnetic radiation. These faults affect the
Related Work
Previous work on CED can be classified into four types of redundancy: hardware, time, information, and combined redundancy.
Hardware redundancy duplicates the function and detect faults by comparing the outputs of two copies.
Time redundancy: The function is computed twice on the same input and the results are compared. A variation of the time redundancy is in [6] . The function is computed on both clock edges. This speeds up the computation. Under some conditions, this scheme allows the encryption to be computed twice without affecting the global throughput. This scheme is complex and delicate to implement as technology scales.
Information redundancy: Many fault detection schemes are based on error detecting codes. A few check bits are generated from the input message; then they propagate along with the input message and are finally validated when the output message is generated. The basic parity scheme is proposed in [3] . In this scheme, each predicted parity bit is generated from an input byte. Then, the predicted parity bits, and actual parity bits of output are compared to detect the faults. This scheme incurs large hardware overhead. [7] proposes a scheme in which only single-bit parity is used for the entire 128-bit output, and the parity bit is checked once for the entire round. However, the above two schemes only apply to lookup table-based (LUTs) substitution-box (S-box) implementation. In [8] , parity is obtained for S-box implementation that uses logic gates. In [9] , a general parity-based scheme is proposed to protect the S-box regardless of its implementation. All these parity schemes share the same limitation. If an even number of faults occur in the same byte, none of these schemes can detect them.
Combined redundancy: In [10] , the authors consider CED at the operation, round, and algorithm levels for AES. In these schemes, an operation, a round, or the entire encryption and decryption are followed by their inverses, and the results are compared with the original input to detect faults. Although these schemes detect most of the faults, they require both encryption/decryption to be on chip and can suffer from more than 100% throughput overhead.
Contributions
We propose a low overhead, implementation independent, and invariance-based CED scheme, which uses combined redundancy for obtaining a reliable AES implementation. The scheme achieves a fault detection capability that is close to [10] and hardware redundancy, the state-of-the-art countermeasures, but with much lower cost. Our contributions are as follows: • We present the invariance-based 1 CED with Randomized CED Round Insertion and adaptive checking for AES to improve security.
• We prove that the invariance-based CED scheme detects all single-bit and single-byte faults.
• The invariance-based CED achieves an order of 10 5 lower fault miss rate than the best parity schemes for multiple burst and multiple random faults. This paper is organized as follows: In Section 2, we introduce the AES algorithm. In Section 3, we propose the key idea of our invariance-based CED scheme, and we show the fault coverage, hardware overhead, and detailed analyses of the scheme.
ADVANCED ENCRYPTION STANDARD
In this paper, we consider 128-bit AES as specified by the National Institute of Standards and Technology (NIST). AES encrypts a 128-bit input plaintext into a 128-bit output ciphertext with a user key using 10 nearly identical rounds plus an initial special round (round 0). One AES encryption round consists of SubBytes, ShiftRows, MixColumns, and AddRoundKey denoted by B, S, M , and A, respectively, as shown in Figure 1 . In round 0, only AddRoundKey is performed and in round 10, MixColumns is not performed. Each operation in every round acts on a 128-bit input state, where each state element is a byte in GF (2 8 ). In this paper, each byte is denoted by sr,c (0 ≤ r, c ≤ 3), and it indicates that this byte is in row r and column c in state matrix.
s0,0 s0,1 s0,2 s0,3 s1,0 s1,1 s1,2 s1,3 s2,0 s2,1 s2,2 s2,3 s3,0 s3,1 s3,2 s3,3
(1)
In SubBytes, all the bytes are processed separately by 16 S-boxes (SBs). Each S-box performs a nonlinear transformation of the input byte. Let X be the input to the SBs. The resulting output is:
In ShiftRows, the rows of the state are shifted cyclically bytewise using a different offset for each row. Row 0 is not shifted, while rows 1, 2, and 3 are cyclically shifted to the left by 1 byte, 2 bytes, and 3 bytes, respectively. The resulting output is:
y0,0 y0,1 y0,2 y0,3 y1,1 y1,2 y1,3 y1,0 y2,2 y2,3 y2,0 y2,1 y3,3 y3,0 y3,1 y3,2
128 →{0, 1} 128 denotes an operation on the state space of Rijndael which operates completely on the Galois field GF (2 8 ). A property P ⊆{0, 1} 128 , P = 0, is called an invariance of f , if P is preserved by f , i.e., for every x ∈ P it follows that f (x) ∈ P . [11] = 
INVARIANCES OF AES
In [11] , the authors have shown that AES exhibits various mapping invariance properties. They were investigating these as possible source of weakness in AES. In contrast, we use these invariances for CED to protect the AES against random faults and malicious attacks. We analyze three round level invariances of AES. We give a formal proof of the invariance property P1, which is the most effective invariance according to our experimental results. We analyze the effectiveness of the remaining invariances in Section 4. THEOREM 1. An AES round can be represented as
where X is the 128-bit input to the round. Exist byte permutation P, so that the following hold true:
where P −1 denotes the inverse function of P . One of the byte permutation is: 
PROOF. Let us start from the right-hand side of the equation. First, we apply permuted input X = P1(X) to SubBytes, and from (2) and (7), we get:
Given Y as the input to ShiftRows, we get:
y0,3 y0,0 y0,1 y0,2 y1,0 y1,1 y1,2 y1,3 y2,1 y2,2 y2,3 y2,0 y3,2 y3,3 y3,0 y3,1
Applying Z to MixColumns, we get: 
Then we apply the permuted round key matrix K = P1(K) resulting in:
Finally we apply inverse permutation P −1 to the output. We get:
3.1 Invariance-based CED Architecture
We design two invariance-based CED architectures for AES: Fully pipelined and iterative.
Fully pipelined: There are 11 stages in our pipelined architecture. For each pipeline stage in Figure 2 (a), we add two muxes (muxx and mux k ) and a comparator (cmp). P is a permutation of wires based on the invariance property. P −1 is the inverse permutation. We need two encryption cycles to detect the faults, and let us call them C1 and C2. In C1, let the input and the key of round 1 be X1 and K1. We run the encryption with muxs and mux k , selecting X1 and K1 as the inputs. The round result V 1 is stored in the data register. Then we run C2; we run the encryption with muxs and mux k , selecting permuted inputs X1 and K1 , respectively. At the end of C2, we inverse permute output V 1 and compare it with the value V 1 stored in the data register. If the results are equal, no fault is detected. Otherwise, the comparator will assert the fault indication flag. The comparator does not add delay to the critical path because the comparison can be performed when the next round input is executed. We see that C1 can be any normal encryption cycle, and C2 is the corresponding extra cycle, which selects the permuted inputs to be performed after every C1. One can add a C2 after R rounds; we call R the checking ratio. R can be changed based on the tradeoff between performance, reliability, and security specified by the designer. For a detailed analysis, please see Section 3.3.
Iterative: As shown in Figure 2 (b), we add muxx and mux k and a comparator. There is one security benefit of iterative implementation. Each ciphertext takes 10 cycles to generate in an iterative architecture. If the designer chooses to add a C2 for every ten cycles, then the faulty ciphertext will not be sent to the output because the fault is detected before it is generated. This will further prevent an attacker from stealing the secret key. In the pipeline architecture, a ciphertext is generated every cycle. Therefore, if the faults are generated before the comparison, faulty outputs will be obtained by the attacker.
Fault Analysis
Invarianced-based CED detects all single-bit and single-byte faults. Our simulations show that this CED scheme detects 99.99999997% of multiple burst faults and close to 100% of multiple random faults. Fault coverage (FC) is calculated as:
where FMR is the fault miss rate calculated as: where T undetected is the number of tests in which faults are excited but not detected. T total is the total number of tests we applied. Tcorrect represents the tests in which the faults are not excited.
100% Fault Coverage of Single-Bit Faults
THEOREM 2. If a single-bit fault in any of the steps in a round affects the outputs of the final result of that round, our scheme will detect it.
PROOF. Case 1: A single-bit fault in S-box (SB). In Figure 1 , let the SBi,j (0 ≤ i, j ≤ 3) have a single-bit stuck-at fault. If the SBs are implemented using ROMs, the considered fault corresponds to an address fault of the ROM, a fault in memory location, or a fault in the output data lines. If the SBs are implemented using combinational logic, the considered fault can appear in any gate of the implementation.
In C1, the SBi,j generates faulty output yi,j. After ShiftRows, the outputs are [zr,c] Therefore, the faulty column in C1 is (j − i) mod 4, but the faulty column in C2 corresponds to column ((j −i) mod 4)+3) mod 4 in C1; note that 0 ≤ i, j ≤ 3 and (j −i) mod 4 = (((j −i) mod 4)+ 3) mod 4.
Since faulty SBi,j affects different columns in C1 and C2, we always compare a faulty column with a nonfaulty column, and our scheme detects the fault as long as it affects the output. For a concrete example, let SB1,2 be faulty, thus, y1,2 is the faulty output byte. After ShiftRows, z1,1 = y1,2. After MixColumns, the faulty state elements are shown as [ur,1] Fault miss rate Parity [7] for MixColumns and AddRoundKey Parity [7] for SubBytes and ShiftRows Invariance-based CED (P1) Figure 3 : Simulation results show that the fault miss rate of the invariance-based CED is superior to that of the parity scheme in [9] .
columns in C1 and C2 are different, and we detect the fault by comparing the outputs. Case 2: A single-bit fault in ShiftRows. A fault in ShiftRows is equivalent to a fault at the input of MixColumns, and thus, we prove this in case 3.
Case 3: A single-bit fault in MixColumns. Since MixColumns is mainly implemented with XOR and a few other basic gates, we consider a fault in MixColumns in three scenarios: the input, the internal logic gates, and the output. If there is a stuck-at fault in the input, the fault will propagate to all four bytes in the same column. Assuming that column [ur,j] (12) and (14), we know that j = (j+3) mod 4. Because the faulty columns of the two outputs are different, we detect the fault.
Case 4: A single-bit fault in AddRoundKey. AddRoundKey is mainly implemented as bit-wise XOR gates. We consider a fault in AddRoundKey as stuck-at fault in the input or output. The fault at the input is equivalent to the fault in the output of MixColumns. Let us prove the theorem true for a single-bit fault at the output of AddRoundKey. Let the faulty bit affects byte vi,j in the C1 and v i,j in C2. From (14), we know that v i,j = v i,(j+3) mod 4 . Again, the faulty columns in C1 and C2 are different, and thus we detect the fault.
100% Fault Coverage for Single-Byte Faults
Recent experiments have shown that high-energy ions can energize two or more adjacent memory cells in a circuit [12] . Because an attacker can choose light [13] or electromagnetic radiation [14] to inject faults, this is a realistic attack model. We define singlebyte fault as faults that affect at most a single byte, i.e., they can affect from one bit upto eight bit of the byte. THEOREM 3. If multiple faults in any of the processing steps in a round affect a byte quantity, the invariance scheme will detect it.
PROOF. We prove this theorem for single-byte fault in either Sbox, ShiftRows, MixColumns, or AddRoundKey.
Case 1: A single-byte fault in S-box (SB). Let multiple faults in S-box affect one byte output yi,j in C1. From the proof of our theorem 1, after AddRoundKey, faulty state elements are shown as In order to have a 100% single-byte fault detection rate, one need hardware redundancy or combined redundancy [10] , both of which have more than 100% hardware overhead. Our fault simulation confirms that the invariance-based CED detects all single-bit and single-byte faults.
Fault Coverage for Multiple Faults
We simulated multiple faults for the invariance scheme and compared it with the one proposed in [9] . These models cover both natural faults and fault attacks [15] .
Due to technology constraints, an attacker may not be able to inject a single-bit fault [15] . Multiple faults are actually injected in the process. We use burst and random fault models [15] . We use Fibonacci Linear Feedback Shift Register (LFSR) with 128-bit output taps to inject faults. The maximum sequence length polynomial for the LFSR is selected as L(X) = X 128 + X 29 + X 27 + X 2 + 1. Burst faults occur at the output of only one operation at a time, i.e., the faults are injected at the 128-bit output of one operation in the AES encryption. This includes both stuck-at zero and one faults. The size, location, and type of the burst are randomly generated.
The results of our simulations for the burst faults in the AES encryption are shown in Figure 3 . We compare our miss rate with that of [9] . The dot-dash line respresents the fault miss rate of SubBytes and ShiftRows in the AES encryption in [9] . The dash line represents the fault miss rate of MixColumns and AddRoundKey in the AES encryption in [9] . The solid line represents the fault miss rate of the invariance-based CED. We have injected up to 700,000 burst faults at the operation outputs and monitored the faults that are detected by the fault indication flag. The fault miss rate for [9] is between 10 −2 and 10 −3 . The miss rate of our scheme is between 10 −7 and 10 −8 ; a reduction of 10 5 . Random faults are injected at random locations, i.e., four 128-bit outputs of the operations. In another simulation, we saw 0% fault miss rate after injecting up to 700,000 random faults.
Comparison of Fault Coverage
As shown in Table 1 , for single-bit and single-byte faults, the invariance-based CED provides 100% fault coverage, the same as [10] and hardware redundancy. It is note that [10] requires both encryption/decryption to be on chip to achieve such fault coverage. While most of the parity schemes achieves 100% fault coverage for single-bit fault, they can only provide 50% fault coverage for single-byte fault [8] . For multiple faults, [10] and hardware redundancy provide 100% fault coverage. The invariance-based CED provides 99.99999997% fault coverage, and much higher than the parity-based schemes. The tradeoff between performance and detection latency can be explored by varying the checking ratio R, which is the ratio of the number of results computed without invariance to the number of results computed with invariance-based CED.
Security Analysis
In this section, we propose two policies that the designers can Table 2 : Comparisons of implementation of CED schemes on two Xilinx FPGAs. We use the metrics, FPGA platform, and results from [9] . Our pipeline implementation are shown in bold, and we implement the iterative architectures. a. the latency is 2x the original AES encryption/decryption b. using two (256 × 9) ditributed memories for CED of each S-box or inverse S-box c. using (256 × 9) distributed memories for CED of each S-box or inverse S-box d. checking ratio is 11 e. checking ratio is 10 f. checking ratio is 1 employ to further secure their design.
Randomized control: The invariance-based CED technique cannot detect the faults that are not excited in C1 and C2 rounds. If an attacker determines the architecture of the AES with the proposed CED implementation, this feature can be used as a weakness to insert faults in such a way that they do not exist during the CED rounds but only during the normal rounds. To prevent this, we proposed using Randomized CED Round Insertion (RCRI). In this method, the positions of the CED rounds C1 and C2 are randomized during the 11-round AES encryption process for pipelined architecture. This can be implemented as shown in the state diagram of Figure 4 . A random number can be obtained using the randomness property of the AES algorithm. For example, a Rand register can be incorporated into the circuit with some random number stored in it at manufacture time and for every subsequent encryption performed, the resulting ciphertext is xored with Rand to get a Temp number. When an encryption is performed, the algorithm enters the normal execution state. Normal encryption rounds are performed until the value of the Temp modulo 11 equals the round number. Once this condition is satisfied, the CED round C1 is performed. Depending on whether 11 normal rounds have been performed, either C2 or the remaining normal rounds are performed. The encryption process is complete when 11 normal rounds and the randomly inserted C1 and C2 CED round are complete. For the mod operation, we can take the last four bits of Temp and apply it to a lookup table which contains modulo 11 results from input 0 to 15. For the iterative architecture, since the encryption takes 10 rounds, we use Temp modulo 10.
Adaptive checking: If the designer chooses R = 1, some of the transient faults injected by the attacker may go undetected. However, an attacker needs to obtain a large number of faulty outputs to steal the secret key. We suggest the designer deploy an adaptive approach. When a fault is detected, the chip changes its checking ratio from the R specified by the designer to 1, and thus almost all the faults will be detected. If faults are detected in a number of consecutive checks, the chip can stop its function or self-destruct.
If R = 1, all results will be checked. The fault miss rate remains the same for permanent and transient faults. If R > 1, every R th result will be checked. Let us assume the transient faults appear for N cycles. When R ≤ N , the fault coverage remains the same, because the results of C1 and C2 will be checked before the faults disappear. When R > N , the probability of detecting a single-bit and single-byte fault is N R × 100% and that of multiple burst faults is N R × 99.99999997%.
Implementation and Comparison
The implementation results shown in Table 2 are all post placeand-route. We implement fully pipelined and iterative architectures. We use pipelined distributed memories for S-boxes and inverse S-boxes similar to [9] . Hardware redundancy, information redundancy [3, 7, 9] , combined redundancy [10] , and invariancebased CED are compared. The metrics include (1) slice utilization (the number of occupied slices), (2) slice overhead (ratio of number of slices for CED schemes over the number of slices for AES), (3) maximum clock frequency, (4) throughput, and (5) efficiency (raitio of (4) over (1)).
For pipeline architecture, the hardware overhead of the invariancebased CED is much lower than that of hardware redundancy for both encryption and decryption. The invariance-based CED provides flexible throughput. If the designer checks all rounds, check-ing ratio R is 1 and the throughput overhead is 100%. If the designer performs one check per ciphertext, the checking ratio R is 11 for pipeline architecture and 10 for iterative. The throughput overhead is 1 11 for the pipeline architecture and 1 10 for the iterative architecture. If R is large, then the throughput is unaffected. The invariance-based CED has higher efficiency when the checking ratio is 11 compared to hardware redundancy. Because the scheme in [7] uses 1-bit signatures for the 128-bit block of data, it has lower hardware overhead and higher efficiency compared to invariancebase scheme. However, from Table 1 , the fault coverage of this scheme is the lowest. [3] and [9] use 16 bits for each 128-bit block, and this leads to much higher fault coverage. The invariance-based CED has much smaller hardware overhead and higher efficiency than [3] , but provides higher fault coverage. Another limitation of [3] and [7] is that they are only applicable to S-box implementation using LUT. On Virtex-5, the invariance-based CED has higher efficiency than most of the CED schemes except [9] . Although the invariance-based CED has approximately the same hardware overhead compared to [9] , it detects all single-byte faults and lowers the fault miss rate of multiple burst faults by an order of 10 5 . The schemes in [10] are only applicable when both encryption and decryption are both on the same chip. Therefore, if only encryption or decryption is on chip, the hardware overhead of [10] is in the 49.7-108.7% range [9] , e.g., 108.7% for AES encryption on Virtex-4 FPGA. If both encryption and decryption are on the same chip, the hardware overhead of [10] , which is from the comparator, is very low. For Virtex-5, the overhead of this scheme is -14.8%, because the slice utilization of this scheme is smaller than the total slice utilization of encryption and decryption. However, the efficiency of the invariance-based CED is higher than that of [10] . Most importantly, the invariance-based CED and all other CED schemes do not require both encryption and decryption to be on chip.
For iterative architecture, since round 0 is performed in the same clock cycle as round 1, an extra delay is added in the critical path. The hardware overhead of the invariance-based CED as a 16.7-27.3% is slightly higher than that of the pipeline architecture on Virtex-5.
CONCLUDING REMARKS
There are several other mapping invariances that can be used for CED [11] . Most of them restrict the pattern of the inputs and thus are not effective when realistic random inputs are provided. However, there are two other invariances that allow us to perform CED on any inputs: P2 swaps the first and third columns and also the second and fourth columns in the state matrix. P3 is the same as P1 except that the initial input X is permuted by Ppre before being applied to the input. Fault miss rate for CED using invariances P1, P2, and P3 are compared in Figure 5 . The fault miss rate dropped sharply to below 10 −7 after we injected 11 faults using P1 and P3, and 15 faults for P2. After we injected 15 faults, the fault miss rate was very low and thus not shown in Figure 5 . The fault miss rates of invariance P1 and P3 are lower than the fault miss rate of invariance P2 with any number of faults injected. Compare to P2, the area and performance overheads of P1 is the same. For P3, one first need to apply Ppre(X) as input to run C1, and store the result Vpre. Then use P1(Ppre(X)) as input to run C2, and apply P −1 1 to the result V pre to compare with Vpre. Both C1 and C2 are extra overhead for performance. Compare with P3, C2 is the only performance overhead for P1. Therefore, P1 is the most effective invariance.
ACKNOWLEDGMENTS
This material is based upon work supported by the NSF CNS program under grant 0831349.
