Cryptographic systems are vulnerable to random errors and injected faults. Soft errors can inadvertently happen in critical cryptographic modules and attackers can inject faults into systems to retrieve the embedded secret. Di↵erent schemes have been developed to improve the security and reliability of cryptographic systems. As the new SHA-3 standard, Keccak algorithm will be widely used in various cryptographic applications, and its implementation should be protected against random errors and injected faults. In this paper, we devise di↵erent parity checking methods to protect the operations of Keccak. Results show that our schemes can be easily implemented and can e↵ectively protect Keccak system against random errors and fault attacks.
INTRODUCTION
Keccak is a new algorithm that can be used for hashing, stream encryption, pseudo-random sequence generation, message authentication codes (MAC), etc., and has been recently selected as SHA-3 standard [1] . As SHA-3 will be widely used in cryptographic applications, reliability and security of its implementations will be of vital importance to the SHA-3 based security engine.
Cryptographic systems are sensitive to random errors caused by aging, ambient environment such as temperature and X-ray radiation [9, 19] . For cryptographic systems used for encryption, authentication and integrity checking, etc., the random errors will cause incorrect results and make the cryptographic systems unreliable. Attackers can also inject faults temporarily to the system to retrieve the secret key Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. or state. Analyzing the correct output and faulty output is called di↵erential fault analysis (DFA), which has been shown to be very e↵ective in cracking block ciphers including Data Encryption Standard (DES) and Advanced Encryption Standard (AES) [6, 17] . A recent work [3] shows that DFA can also be used to attack SHA-3 implementations to recover the internal states.
To protect cryptographic systems against random errors and injected faults, di↵erent error detection methods have been adopted in cryptographic systems [8, 9, 10, 11, 13, 14, 19] . Many schemes are based on redundancy -with another copy of the original implementation for outputs comparison so as to detect error occurring in either copy. To further improve the reliability, the second copy can have di↵erent implementation from the first copy [8] . To reduce the resource overhead, error detection coding is also widely used. Parity checking code has been used in the protection of AES [19] because of its e cient design and high error coverage. Besides parity checking, non-linear codes are also introduced to further improve system reliability with higher fault coverage [10] .
Previous works on Keccak mainly focus on side-channel analysis and collisions attacks on Keccak [7, 12, 16, 18] , while leaving its reliable design against injected faults and random errors almost blank. To the best of our knowledge, only one work about reliable Keccak design has been published [5] . In [5] , the authors find a property of Keccak algorithm: each lane of the Keccak state can be rotated by a random number before each round operation, and then shifted back after Keccak operations without changing the results. Based on this property, they implement another copy of Keccak with this kind of rotation for comparison. Any error can cause output mismatch of the two copies and thus can be detected. With di↵erent implementations in the two copies, it is unlikely that same errors will appear in the two copies. However, this method has high resource overhead due to the extra copy.
In this paper, we propose a simple and e cient parity checking based error detection method for Keccak. Optimized designs are also proposed to further improve the eciency of the proposed scheme. We implement the proposed scheme in VHDL and simulate fault injection at gate level to get the fault coverage of the proposed scheme. Results show that our scheme has a high fault coverage for hardware implementations with very small resource overhead.
The rest of this paper is organized as follows. In Section 2, some basic knowledge of Keccak and preliminaries of error detection are introduced. In Section 3, the mathemat-ical property of Keccak is analyzed and our parity checking scheme for Keccak operations is proposed. In Section 4, implementation and fault injection simulation results of the proposed schemes are given. Finally, the paper is concluded in Section 5.
PRELIMINARIES

Preliminaries of Keccak Hash Function
Keccak can work in di↵erent modes and have variable length [1] . In this paper, we use Keccak-1600 as an example to demonstrate our scheme. All of the 1600-bit states are organized in a 3-D array, as shown in Figure 1 . Each bit is addressed by three coordinates, denoted as S(x, y, z), x, y 2 {0, 1, ..., 4}, z 2 {0, 1, ..., 63}. 2-D entities, plane, sheet and slice, and 1-D entities, lane, column and row, are also defined in Keccak and shown in Figure 1 . Keccak relies on a Sponge architecture to iteratively absorb message inputs and squeeze out outputs by a f permutation function. The f function consists of 24 rounds, where each round has five sequential steps:
in which S0 is the initial input. Details of each step are described below:
✓ is a linear operation which involves 11 input bits and outputs a single bit. Each output state bit is the XOR of the input state bit and two intermediate bits produced by its two neighbor columns. We denote the input to ✓ operation as ✓i while the output as ✓o, and the operation is given as follows:
⇢ is a permutation over the bits of the state along z-axis (in lanes), and the amount of shifted bits depend on the (x, y) coordinates.
⇡ is a permutation over the bits of the state within slices. Only the center bit (x = 0, y = 0) of the slice does not move. All other bits are permuted to other positions depending on their original coordinates.
is a non-linear step that contains mixed binary operations. Every bit of the output state is the result of an XOR between the corresponding input state bit and its two neighboring bits along the x-axis (in a row):
◆ is a binary XOR with a round constant which is publicly known.
Further details of Keccak can be found in [1] .
Reliable Design with Error Detection
Reliable cryptographic modules rely on error detection to detect random errors and injected faults in systems. The basic structure of error detection is shown in Figure 2 , in which the cryptographic module is protected by introducing another module, Protector. Protector is composed of three modules, which are Predictor, Compressor and Comparator. 
PARITY CHECKING OF KECCAK
In this section, we will propose our parity checking based concurrent error detection scheme for Keccak. We take each operation of Keccak as a cryptographic module shown in Figure 2 , and the goal is to design e cient and e↵ective predictor and compressor suitable for each operation.
Analysis of the Protection of Keccak
Sponge function involves a lot of simple bitwise operations such as XOR, AND, NOT etc., rather than complex nonlinear operations such as S-box in block ciphers. Such bit-wise operations can be combined according to Boolean algebra for e cient predictor and compressor design. For each 3-D input and output state of a cryptographic operation, parity checking can be implemented at di↵erent granularities. For example, it can be implemented in row, column, or lane, or in 2-D entities, in unit of slice, sheet, or plane. This will result in di↵erent compression ratio, yielding di↵erent error coverage and resource overhead.
For concurrent error detection, usually some steps can be combined to achieve higher e ciency. For example, the error detection of ShiftRows, MixColumns and AddRoundKey of AES can be combined together to achieve lower overhead and higher e ciency [10] . Instead of protecting each operation of Keccak separately, we propose to combine the protections of some operations to improve the e ciency and save resource. For example, ◆ is a binary XOR operation with a constant number, and thus the protection of ◆ can be e ciently combined with the protection of . In this section, we show how to make use of the property of Keccak and combine parity checking of some steps to achieve higher e ciency.
The overall protection scheme is shown in Figure 3 . The protections of and ◆, ⇢ and ⇡, are combined respectively for higher e ciency, with the combined operations denoted as 0 and ⇢ 0 . Note that depending on the implementation of Keccak, the ⇢ 0 protector may be optional. This is because both ⇢ and ⇡ are permutation operations and only change the position of bits without changing the values. If no storage (state register) is used in these operations, they will be synthesized as wires in FPGA or ASIC. Any errors (stuck-at-zero or stuck-at-one) will be detected by the previous ✓ protector or the following 0 protector. Nevertheless, we present protection on ⇢ and ⇡ in this section at the algorithm level, and leave the implementation to the next section.
Parity Checking of ✓
Parity Checking of ✓ in 1-D Entities
As shown in formula (2), each ✓ operation works on one input bit ✓i and two nearby columns (each five bits). Operations on five bits in the same column of ✓i involve the same two nearby columns. Thus we propose to implement parity checking of ✓ along the y axis (in column). For each column of ✓o, the parity checking is as follows: Where the parity checking of each column of ✓o (✓o(x, Y, z) on the left hand) is the parity checking of three ✓i columns (✓i(x, Y, z), ✓i(x 1, Y, z) and ✓i(x + 1, Y, z 1)).
We denote the parity checking result of ✓i in each column as P [✓i](x, z), which means to compress ✓i along the y direction. The parity checking of the input state is a plane (X = [0 : 4], Z = [0 : 63]). Similarly, we denote the parity checking result of ✓o in each column as P [✓o](x, z). We can represent (3) as follows:
The parity checking can also be done along the row. Similarly, we have:
in which P [✓i](z) stands for the compression of each slice of ✓i and therefore the input state is compressed to a parity lane, and P [✓i](y, z) is the parity of a row. Compared to (4), it involves multiple parity generations (both in row and in slice).
Parity Checking of ✓ in Slice
Equation (4) shows that parity checking in each column involves nearby columns (with one on the same slice). Equation (5) shows that parity checking in each row already involves two slices. We examine parity checking in 2-D entities too. Parity of each slice is denoted as: ✓i(x + 1, y, z 1)). (6) Due to the property of XOR operation, we find a unique property for slice based parity checking of ✓ operation: ✓i(x + 1, y, z 1).
It means that the parity of each slice of ✓o is the parity of a nearby slice of ✓i. For the 1, 600-bit state of Keccak, there are 64 slice-based parity checking bits for ✓i and ✓o respectively. Define P [✓o](Z) and P [✓i](Z) as following:
, then (7) can be represented as:
which means that the slice-based parity lane of ✓o is a round shift of the parity lane of ✓i. This makes the slice-based parity checking for ✓ very e cient. For the three parity generation and checking schemes introduced in Section 3.2.1 and Section 3.2.2, we compare their resource overhead in terms of XOR2 gates. Take parity checking in slice as an example, the predictor compresses every slice into one bit, requiring 24 XOR gates, and the predictor needs 1536 (24 * 64) XOR gates in total. The XOR gates consumption of di↵erent protection schemes of ✓ are listed in Table 1 . Column  1920  1280  639  3839  Row  2176  1280  639  4095  Slice  1536  1536  127  3199  Duplicate  3520  0  3199  6719 For schemes with parity generation in row and column, they protect ✓ for every five bits, while the slice based scheme protect ✓ operation for every 25 bits. Thus the row and column based schemes have higher error coverage than the slice based scheme. Table 1 shows that the slice based parity checking scheme has lower area overhead than the other two schemes. A balance should be made between error coverage and resource overhead during design. Meanwhile, comparing with the proposed parity checking schemes, duplication based error detection requires much higher resource overhead. In this paper, we use slice-based parity generation to implement the parity checking of ✓ operation.
Parity Checking of ⇢ and ⇡
As discussed in Section 3.1, ⇢ and ⇡ can be left unprotected if they are implemented using wires in circuit. For implementations in which ⇢ and ⇡ use registers (pipelined design, for example), the protections of ⇢ and ⇡ can be implemented either separately or combined together.
For ⇢, the permutation is along z-axis, thus we propose to compress ⇢ operation along each lane, and the protection function is as following:
In (9), the left side is the predictor while the right side is the compressor design. Similarly, while ⇡ permutes the state bits inside each slice, we propose to compress the bits inside each slice:
The protection of ⇢ and ⇡ can both be implemented efficiently because of their simple operations. To further improve the implementation e ciency, we propose to combine ⇢ and ⇡ as a new operation ⇢ 0 and protect it instead. For the protection of ⇢ 0 , parity checking can be e ciently implemented in z-axis. The protector of ⇢ 0 can be designed as following:
in which the left-hand side is predictor and the right-hand side is for compressor design. For predictor design, the 1,600-bit state is firstly compressed to a slice P [⇢ 0 i ](x, y), and this slice is then permuted according to ⇡ operation. For this protection scheme, the predictor and compressor both require 1, 575 XOR gates, and the comparator needs 49 XOR gates. Although ✓ can be also combined with ⇢ 0 , the e ciency is not improved significantly, thus we protect them separately in this paper.
Parity Checking of and ◆
What Should We Avoid in Protection of ?
is the only non-linear step in Keccak and it involves both NOT and AND operations. This makes the parity checking of step di↵erent from previous operations. In this section, we first show a pitfall for the protection of which should be avoided in practical design.
We take one row of i as an example here, we denote the five bits of this row as {a, b, c, d, e}. Then five bits of corresponding o output row can be denoted as a (b · c), ) and e (ā · b) . The parity bit of this single row is:
For equation (12) , if one bit out of five bits in this row is flipped, the final result of equation (12) may not change. Take bit a as an example, the change of a will a↵ect both (e · a) and (a · b). According to De Morgan's laws:
When a flips, if b e is already 0, the above result does not change, then the result of (12) will not change either. Assume all bits are independent, the probability of b e = 0 is 50% and therefore the fault coverage of this scheme is 50% for single bit errors. So parity checking in each row of should be avoided because of the non-linearity of operation, and thus parity checking in each slice will not be applicable for operation either.
Parity Checking of in Each Lane
In this section, we show how to build practical error detection module for operation. As x-axis compression of is not a good choice, we considering compressing results along either z-axis or y-axis. For parity generation along z-axis, 64 bits in each lane are compressed to one bit and thus the 1, 600 bits are compressed to one slice. The parity generation can be denoted as following:
According to the definition of operation, we have:
and ](x, y), in which and (x, y, z) = i(x + 1, y, z) · i(x + 2, y, z). Thus the predictor design of parity checking for at z-axis is also easy to implement. It involves AND operations first, then it compresses the data in z-axis to generate the parity for checking.
Meanwhile, the parity of can also be generated in each column, along y-axis direction. For this scheme, five bits in each column are compressed to one bit, and the 1, 600-bits state is compressed to one plane. The compressor works as follows:
and ](x, z). In this scheme, the overhead is higher than the z-axis compression scheme due to its lower compression ratio, while it has higher fault coverage. Thus designers can choose the best scheme according to the system requirement. In this paper, we implement the protection of operation along 
Combination of and ◆
While ◆ only adds a constant number to result, it can be easily combined with as discussed in previous section. If is checked at z-axis, the combined parity checking is
and parity checking of combining with ◆ at other directions are similar. Note here that P [◆c](x, y) is computed at design stage to avoid computations for each run. This protection scheme requires 3, 200 XOR gates and 1, 600 AND gates for the predictor, 1,575 XOR gates for the compressor, and 49 XOR gates for the comparator.
IMPLEMENTATION AND FAULT IN-JECTION RESULTS
Implementation Results
To evaluate the proposed scheme, we implement the unprotected Keccak implementation (referring to the ocial implementation provided online [2] ) and the proposed scheme in Figure 3 . We implement three variants of the proposed scheme:
• Proposed combines the protection of and ◆ as described in Section 3.4.3, and leaves ⇢ and ⇡ unprotected;
• Design 2 protects and ◆ separately, and leaves ⇢ and ⇡ unprotected; • Design 3 combines the protection of and ◆ as described in Section 3.4.3, and protects ⇢ and ⇡ together referring to (11) . For the above three designs, we implement the protection of ✓ using the scheme in Section 3.2.2, which is to implement parity checking in each slice. Each protector is composed of three parts, as described in Figure 2 . All the designs have five steps in each round within one clock cycle using combinational circuits. For the proposed schemes, we implement the original circuit and the protection circuits in parallel and they work simultaneously.
For integrated circuit resource evaluation, the implementations (with and without protections) are modeled in VHDL and synthesized in Cadence Encounter RTL Compiler with a 45nm Opencell library (NanGate FreePDK45 v1 3 v2009 07). The designs were placed and routed using Cadence Encounter. The power and area overhead of the protection schemes were estimated using Concurrent Current Source (CCS) model under typical operation conditions assuming a supply voltage of 1.1V and a temperature of 25 Celsius degree. The results including area, timing delay and power consumption are shown in Table 2 . Table 2 shows that Proposed has much higher performance while lower area and power consumption than Design 2 and Design 3.
Comparing with original implementation, Proposed has about 27.05% area resource overhead. Meanwhile, our proposed scheme maintains high performance because it has the protection modules works concurrently with Keccak module. What's more, the proposed scheme combines ⇢ and ⇡, and ◆ together respectively, thus it has small timing delay overhead.
Comparing with Proposed, Design 2 does not optimize the protections of and ◆ and protect these two modules separately. It shows that Design 2 has much higher area resource overhead than the proposed design because of this separate protection design. Meanwhile, it has larger timing dealy and power consumption than the proposed design as well. Design 3 protects ⇢ and ⇡ together as described in Section 3.4.3, results show that it has much higher area resource overhead than the proposed scheme, as well as larger timing delay and power consumption. Thus the proposed optimization methods can help to save resources in the protection of Keccak.
Fault Coverage Analysis
Theoretically, for parity checking schemes, if odd number bits are flipped, the errors will be detected with 100% probability; while if even number bits are flipped, the errors will be always undetected. For real hardware systems, it will be very di cult for the attackers to precisely control the numbers and positions of faulty gates in the circuit [4] . What's more, the errors in the circuit will randomly propagate and cause di↵erent numbers of faulty bits in the output [15] . Thus, fault injection simulation results at gate level are required for error coverage evaluation.
In this paper, we randomly inject one to ten stuck-at-0 and stuck-at-1 faults into Keccak circuit for error coverage simulation. To get the fault coverage result, we give random plaintext input for one round of Keccak, then randomly inject one to ten stuck-at-0 or stuck-at-1 faults into the system. We check the results of the protected implementation and the alarm signals to see if we miss any errors. For each design, we run about 10 8 fault injection trials and the error coverage results are shown in Table 2 . Table 2 shows that the proposed scheme has error coverage about 83.60%. Meanwhile, results show that separate protections of and ◆ in Design 2 will not increase the error coverage. This is because ◆ module is very small and only occupies very small ratio of gates in the design. The probability of errors happening in ◆ is very small, and the error coverage of Design 2 is almost the same as the proposed design.
For Design 3, it has a large part of gates used for the protection of ⇢ and ⇡, and errors happen in ⇢ 0 module with high probability. In such case, we can assume that a large part of errors are injected into the protection module of ⇢ 0 and this part of errors will be detected with high probability. Results show that the error coverage will increase to 89.89% for Design 3. Thus for pipelined designs which use registers to store the results of ⇢ and ⇡, the protection method proposed in Section 3.3 should be implemented for higher error coverage.
In conclusion, according to the synthesis results in Section 4.1 and fault injection simulation results in Section 4.2, the proposed scheme has a small resource overhead and high performance, and it can detect the injected faults with a high probability. Thus our proposed scheme strikes a good balance between resource overhead and error coverage, and can be e ciently implemented for the protection of Keccak implementations.
CONCLUSION
In this paper, we look into the parity checking of Keccak to protect it against random errors and injected faults. We make use of the mathematical properties of Keccak to implement parity checking based error detection. We find that combining protections of some steps reduces the area overhead, timing delay, and power consumption significantly without sacrificing the error coverage. Results show that our scheme has small resource overhead, while the timing delay and power consumption are also very small. Under multiple bit random errors model, fault injection simulation results show that our method can detect 83.60% injected faults. The future work will be more e cient protection methods of Keccak against random errors and injected faults, and protections of Keccak against other kinds of attacks.
