Abstract. Because of the rapidly shrinking dimensions in VLSI, transient and permanent faults arise and will continue to occur in the near future in increasing numbers. Since cryptographic chips are a consumer product produced in large quantities, cheap solutions for concurrent checking are needed. Concurrent Error Detection (CED) for cryptographic chips also has a great potential for detecting (deliberate) fault injection attacks where faults are injected into a cryptographic chip to break the key. In this paper we propose a low cost, low latency, time redundancy based CED technique for a class of symmetric block ciphers whose round functions are involutions. This CED technique can detect both permanent and transient faults with almost no time overhead. A function F is an involution if F(F(x))=x. The proposed CED architecture (i) exploits the involution property of the ciphers and checks if x=F(F(x)) for each of the involutional round functions to detect transient and permanent faults and (ii) uses the idle cycles in the design to achieve close to a 0% time overhead. Our preliminary ASIC synthesis experiment with the involutional cipher KHAZAD resulted in an area overhead of 23.8% and a throughput degradation of 8%. A fault injection based simulation shows that the proposed architecture detects all single-bit faults.
Introduction
Because of the rapidly shrinking dimensions in VLSI, faults arise and will continue to occur in the near future in increasing numbers. Faults can broadly be classified in to two categories: Transient faults that die away after sometime and permanent faults that do not die away with time but remain until they are repaired or the faulty component is replaced. The origin of these faults could be due to the internal phenomena in the system such as threshold change, shorts, opens etc. or due to external influences like electromagnetic radiation. The faults could also be deliberately injected by attackers in order to extract sensitive information stored in the system. These faults affect the memory as well as the combinational parts of a circuit and can only be detected using Concurrent Error Detection (CED). This is especially true for sensitive devices such as cryptographic chips. Hence, CED for cryptographic chips is growing in importance. Since cryptographic chips are a consumer product produced in large quantities, cheap solutions for concurrent checking are needed. CED for cryptographic chips also has a great potential for detecting (deliberate) fault injection attacks where faults are injected into a cryptographic chip to break the key [1 2 3 4] . A Differential Fault Attack technique against AES and KHAZAD was discussed in [5] . It exploited the faults in bytes. It was shown that if the fault location could be chosen by the attacker, it required only 2 ciphertexts for a successful attack. CED techniques could help prevent such kind of attacks.
Until now some of the following CED methods for cryptographic algorithms are known. In [6] a CED approach for AES and other symmetric block ciphers that exploits the inverse relationship between the encryption and decryption at the algorithm level, round level and individual operation level was developed. This technique had an area overhead of 21% and a time overhead up to 61.15%. In [7] this inverserelationship based technique was extended to AES round key generation. A drawback of this approach is that it assumes that the cipher device operates in a half-duplex mode (i.e. either encryption or decryption but not both are simultaneously active). In [8] a parity-based method of CED for encryption algorithms was presented. The technique adds one additional parity bit per byte resulting in 16 additional bits for the 128-bit data stream. Each of the sixteen 8-bit×8-bit AES S-Boxes is modified into 8-bit×9-bit S-Boxes. In addition, this technique adds an extra parity bit per byte to the outputs of the Mix-Column operation because Mix-Column does not preserve parity of its inputs at the byte-level.
In this paper, we propose a low cost, low latency CED technique for involution-based symmetric block ciphers. Any function F is an involution if F(F(x))=x. An involutional symmetric block cipher is one in which each round operation of the cipher is an involution. Usually, in symmetric block ciphers, the decryption operation differs significantly from encryption. Although it is possible to implement decryption in such a way that it has the same sequence of operations as encryption, round level operations such as S-Box etc are different. With involutional ciphers, all the operations are involutions. So, it becomes possible to implement the decryption in such a way that only the round keys used are different from that for encryption. Besides the implementation benefit, an involutional structure also implies equivalent security for both encryption and decryption [9] .
The CED technique we propose exploits the involution property of the round operations of this class of symmetric block ciphers and checks if x=F(F(x)) for each of the involutional round functions to detect the faults in the system. It offers optimum trade-off between area and time overhead, performance and fault detection latency. Further, it requires minimal modification to the encryption device and is easily applicable to all involution-based symmetric block ciphers. Traditionally, time redundancy based CED schemes cannot detect permanent faults but usually entail >100% time overhead. Although the CED scheme we propose is time redundancy based, it entails almost zero time overhead.
The paper is organized as follows. In section 2 we will recapitulate Involutionbased symmetric block ciphers. In section 3 we will describe the involution based CED architecture and the error detection capability of the proposed method. To obtain the area overhead and the additional time delay, we modeled the method using VHDL and synthesized using cadence ASIC synthesis tool PKS. The results of this implementation are presented in section 4. The error detection capabilities of the proposed methods are discussed in section 5 and finally, Conclusions are reported in section 6.
Involutional Block Ciphers
A substitution permutation network (SPN) symmetric block cipher is composed of several rounds and each round consists of a non-linear substitution layer (S-Box), a linear diffusion layer and a layer to perform exclusive-or operation with the round key. The linear diffusion layer ensures that after a few rounds all the output bits depend on all the input bits. The nonlinear layer ensures that this dependency is of a complex and nonlinear nature. exclusive-or with the round key introduces the key material. Recently, several SPN symmetric block ciphers that are composed of involution functions have been proposed and analyzed [10, 11, 12, 13, 14] . In an involutional SPN cipher, the non-linear S-Box layer and the linear diffusion layer are involutions. An advantage of using involutional components is that the encryption data path is identical to the decryption data path. In an involutional SPN cipher, the round key generation algorithm for decryption is the inverse of the round key generation algorithm for encryption.
In this paper we will consider the 64-bit involutional SPN cipher KHAZAD [10] shown in Figure 1 as a running example. The involutional SPN cipher KHAZAD uses seven identical encryption/decryption rounds (?) with each encryption/decryption round composed of an involutional non-linear byte substitution ? (i.e. ?(?(x)) = x), an involutional linear diffusion layer ? (i.e. ?(?(x)) = x) and exclusive-or with key s which is an involution as well (i.e. s(s(x)) = x). 
The non-linear substitution function γ
The involutional non-linear substitution layer ? of KHAZAD uses eight 8x8 involutional S-Boxes, each of which is constructed from three layers of 4x4 involutional PBoxes and Q-Boxes as shown in Figure 2 . The truth tables for the P-Box and the Q-Box are given in Table 1 . Observe that the P-Box and the Q-box are also involutions. We implemented the P-Box and Q-box as ROMs and the ? layer uses a total of 24 P-Boxes and 24 Q-Boxes. Table 1 . 4x4 involutional P-Box and Q-Box
The linear diffusion layer ?
The diffusion layer ? is a linear mapping based on the [16, 8, 9 ] MDS code with generator matrix GH = [I H], and H = had(01x, 03x, 04x, 05x, 06x, 08x, 0bx, 07x) i.e. ? (a) = b ó b = a × H, where 03x  04x  05x  06x  08x  0Bx  07x   03x  01x  05x  04x  08x  06x  07x  0Bx   04x  05x  01x  03x  0Bx  07x  06x  08x   05x  04x  03x  01x  07x  0Bx  08x  06x   06x  08x  0Bx  07x  01x  03x  04x  05x   08x  06x  07x  0Bx  03x  01x  05x  04x   0Bx  07x  06x  08x  04x  05x  01x  03x   07x  0Bx  08x  06x  05x  04x  03x  01x H = The H matrix is symmetric and unitary and therefore ? is an involution. For efficient hardware implementation the ? layer was described in [15] as follows: 
where a 7 -a 0 and b 7 -b 0 are the eight input bytes and the eight output bytes of the diffusion layer ?. Assuming that a(7),a(6),a(5),a(4),a(3),a(2),a(1) and a(0) are the eight bits in an input byte a, the three byte level functions X(a), X 2 (a)=X(X(a)) and X 3 (a)=X(X(X(a))) can be defined as follows: 
It can be seen that if bit a (7) is "0" then function X(a) reduces to a single bit left rotate. Similarly, the function X 2 (a) reduces to a 2-bit left rotate when a(7) and a(6) are "0" and finally, the function X 3 (a) reduces to a 3-bit left rotate when a(7), a(6) and a(5) are "0".
The exclusive -or function σ
The key addition s layer consists of bitwise exclusive-or of the 64-bit round-key K r with the input to the module. Hence, s is also an involution.
We will first describe a simple CED scheme that exploits the involution property of any involution function and then extend it to the involutional SPN cipher KHAZAD. If a hardware module implements a function F that satisfies the involution property, faults in this module can be detected by checking if x=F (F(x) ). At the beginning of operation, the input x is buffered. The output of the hardware module implementing F is fed back to the input of the module F. The result is compared to the original input x and a mismatch indicates an error. Figure 3 shows this basic idea behind CED scheme for an involutional function.
Fig. 3. CED scheme for an involution function
We will show how this CED scheme can be incorporate into a non-pipelined architecture for KHAZAD. Since symmetric block ciphers are frequently used in one of three feedback modes: Cipher Block Chaining (CBC), Output FeedBack (OFB) or Cipher Feedback (CFB), such a non-pipelined architecture is indeed reasonable. The proposed scheme works even in a non-feedback Electronic Code Book (ECB) mode to facilitate pipelining, with an appropriately modified architecture. If a pipelined architecture is implemented, the scheme can result in increased throughput at the cost of a little extra hardware compared to the non-pipelined version owing to a more complex controller. Also, ECB mode is not popularly implemented. Hence, we will consider a non-pipelined version. In the non-pipelined KHAZAD architecture, each KHAZAD round ? takes three clock cycles to finish with round operation ?, round operation ? and round operation s completing in clock cycles 1, 2 and 3 respectively. 
CED Scheme 2
As a first modification to the above CED scheme, we propose to integrate involution based CED shown in Figure 4 into the non-pipelined KHAZAD (Scheme 2) as follows. Round operation ? is performed in clock cycle 1 on input x followed by the corresponding CED operation ?(?(x)) in clock cycle 2. If x= ?(?(x)) (i.e. no fault is detected in module ?), round operation ? is performed in clock cycle 3 on ?(x) followed by the corresponding CED operation ?(?(?(x))) in clock cycle 4. If ?(x)= ?(?(?(x))) (i.e. no fault is detected in module ?) then round operation s is performed in clock cycle 5 on ?(?(x)) followed by the corresponding CED operation s(s(?(?(x)))) in clock cycle 6. If ?(?(x))= s(s(?(?(x)))) (i.e. no fault is detected in module s) one KHAZAD round ? is successful. This modification can detect permanent faults in addition to the transient faults. This is because, although the same mo dule is used twice, the data that it is operating on is different in each case. This was possible due to the involution property of the modules. The time overhead of this modified time redundancy based CED is still >100%.
CED Scheme 3
During a complete KHAZAD encryption/decryption, round operation ? is busy in clock cycles 1, 4, 7 … and idles in clock cycles 2, 3, 5, 6, 8, 9… Similarly, round operation ? is busy in clock cycles 2, 5, 8 ... and idles in clock cycles 1, 3, 4, 6, 7… Finally, round operation s is busy in clock cycles 3, 6, 9 … and idles in clock cycles 1, 2, 4, 5, 7, 8 … The involution based CED scheme in Figure 4 can be adapted to the nonpipelined KHAZAD architecture to exploit these idle clock cycles as follows (Scheme 3): Round operation ?(x) is performed in clock cycle 1. The corresponding CED operation for ?(x) i.e., ?(?(x)) is performed in clock cycle 2 concurrent with the round operation ?(?(x)). If x= ?(?(x)) then no fault was detected in module ? and hence no errors are reported. The corresponding CED operation for ?(?(x)) i.e., ?(?(?(x))) is performed in clock cycle 3 concurrent with round operation s(?(?(x))). If ?(x)= ?(?(?(x))) then no fault was detected in module ?. At this point, one KHAZAD round ? is completed only in 3 clock cycles in contrast to the 6 cycles consumed by the two other schemes described above. Now, in clock cycle 4, the corresponding CED operation for s(? (?(x))) i.e., s(s (?(?(x))) ) is performed concurrent with the round operation ?(y) where y is the input to the second round of the KHAZAD encryption/decryption given by y= s (?(?(x) )). If s(s (?(?(x) )))= ?(?(x)) then no fault was detected in module s. The comparisons between the 3 schemes are presented in Table 2 . Table 2 . Comparison between the three CED schemes during the first six clock cycles
As explained above, Scheme 3 uses idle clock cycles to re-compute the round operations on the corresponding round outputs by feeding back the output as the input. The result obtained is compared to the original input value stored in the buffer. These two values should be equal since every round operation is an involution. If they are not equal, an error is reported. As seen from Table 2 , this time redundancy based CED method (Scheme 3) entails almost no time overhead because one round operation is completed per clock cycle. Another inherent advantage of the proposed CED method is that we can detect permanent faults in the system even though the faults might not affect the output, i.e. are not activated by current inputs. Consider a situation where a faulty bit is stuck-at-1 and the output at that bit was supposed to be logic '1'. Now, although the output is correct, the fault in the system will be detected because the involution will not yield the correct result. This enhances the security of the implementation since any attempts to clandestinely attack the algorithm by an external agent can be detected. This also improves the overall fault coverage as well as the error figure 4 is the requirement of a second multiplexer for the CED of the s operation. This is due to the fact that a direct involutional operation on the σ layer yields in a fault-coverage of only 50%. The σ function is a 64-bit exclusive-or operation, and applying an involution operation on an XOR module will not enable us to detect all faults. i.e, if an exclusive-or function has a stuck-at fault at one of its output bits and a faulty output is obtained because of this fault, the result of involution would in fact, be the correct input that was applied. For example, if the input to the σ layer is 0x12345678 and the round key value is 0xABCDEF01, the normal output would be 0xB9F9B979. If there is a fault in the system such that the 2 nd LSB (Least Significant Bit) of the exclusive-or output is stuck-at-1, instead of obtaining the correct result, we obtain 0xB9F9B97B which is a faulty output. But when we perform the exclusive-or operation again on this faulty output, we get back 0x12345678. In such cases, ordinary involution based CED fails. In such cases, ordinary involution based CED fails. To solve this problem, we propose the following. The operands for all the exclusive-or operators are exchanged. So, the 64-bit exclusive-or operation is now divided into two parts, the left and the right with each part consisting of 32-bit exclusive-or operations. Similarly the 64-bit exclusive-or operator is also divided into two parts, the left and the right with each part consis ting of 32-bit exclusive-or operators. During the normal computation, the left part of exclusive-or operation is allocated to the left part of exclusive-or operator while the right part of exclusive-or operation is allocated to the right part of exclusive-or operator. But for the involution based CED, we interchange the operators. i.e., the right part is allocated to the left and vice-versa. Fault simulation shows the single-bit fault coverage of this scheme is 100%.
Implementation based validation
We implemented KHAZAD with involution based CED using IBM 0.13 micron library. The modeling of the architecture was done using VHDL, and Cadence Buildgates PKS was used to do the synthesis and place route. KHAZAD without CED datapath was also implemented using the same library and design flow. Table 3 shows the details of the overhead. The second row shows the area used by both designs. An inverter of this library takes 32 units area. The area overhead of the CED design is 23.8%. The third row shows the minimum clock period of synthesized designs. Due to the multiplexers inserted in the datapath, the clock period of CED design is 3.3% more than the normal design. The fourth row shows that the CED design takes one more clock cycle than the normal design. This is because in the CED design, the re-computation of a round operation lags one clock cycle to the normal computation. This means that if we ignore the CED only for the σ layer in the last round, the normal architecture and the involution based CED architecture in fact take the same number of clock cycles to complete i.e., no time overhead. Finally the throughput comparison is shown in the fifth row. The throughput is calculated as the number of bits encrypted per second, i.e. the # of text / (the # of clock cycles × clock period).
Normal
Involution based CED (scheme 3 above) 
Single-bit Faults
To check the error detection capability, we modeled our implementation using C. A stuck-at fault at a function output was injected by adding a multiplexer at the output of the function as shown in Figure 5 . By setting the fault injection control to 1, a stuck-at fault (either 0 or 1) is injected at the output of the function. Similarly, a stuck-at fault can be injected at the input of a function. Therefore, the number of connections between gates/functions gives the number of possible single-bit faults. Note that in this simulation we treat Function F as a black box and only consider the faults at inputs and outputs. If we break down the Function F into smaller components and consider the inputs and outputs of these smaller components, the number of single-bit faults is increased. In our simulation we consider the P-Box and Q-box of an S-Box, and the exclusive-or for the functions θ and σ as the black box operations. The number of single-bit faults is shown in Table 4 . For example, as shown in figure 2 , since an S-Box consists of three 4x4 P-Box and three 4x4 Q-Boxes, the total number of connections of an S-Box is 4 × 3 (for P-Box) + 4 × 3 (for Q-Box) + 8 (the number of inputs to S-Box) = 32. Since the γ function of KHAZAD consists of 8 S-Boxes, the total number of connections and hence the total number of single-bit faults is 256. We ran simulations for random 1.5 million inputs, and for every input we simulated all the possible single-bit faults, i.e. only one bit is stuck at 1 or 0. Table 4 shows the fault coverage. The lowest level of fault injection in the design was performed at the bit-wise exclusive-or level. As seen from the Table 4 . Fault Coverage of the Implementation
Multiple-bit Faults
The injection of random multiple bit faults into the system yielded an overall fault coverage of approximately 99% over a random 1.5 million input test simulation run. The reason for not achieving 100% fault coverage with multiple-bit faults is because in some exceptionally rare cases, the fault in the system gets nullified in the case of the ? and s layers. Consider a single S-Box component of the non-linear ? layer. If we have an input of 0xD5 to the S-box, the output obtained is 0x11. After involution, we get back 0xD5, which was the original input applied. Hence, the system will not report an error. Now, if a fault occurs in the system such that the two LSBs of the P-box are stuck at logic 1, then after 0xD5 is passed through the S-Box, 0x71 is obtained, which is a faulty output. Interestingly, the involution output obtained in this case is also 0xD5. Since this value is equal to the original input, the system fails to report an error. This implies that in extremely rare cases as the one explained above, the CED method that we propose does not yield accurate results. Problems like the one described above do not affect the ? layer because ? is a diffusion layer and every bit in the output is dependent on every bit in the input. Hence, on performing involution, all multiple-bit errors are also detected, giving 100% fault coverage.
Conclusion
We proposed a low cost CED technique for involutional ciphers which exploits the involution property of the cipher. For the involutional cipher the proposed technique entails an additional 23.8% silicon area and degrades the throughput by less than 10%. This technique entails a 4.5% time overhead (which can be reduced to 0% if the CED is ignored only for the s layer in the last round of encryption/decryption). The fault injection based simulation shows the proposed CED technique detects all single-bit faults and around 99% of all multiple-bit faults. . Since round key generation uses the KHAZAD round function ?, the CED method proposed in this paper can be applied to detect all single-bit faults. .
