Reliable and High-Performance Hardware Architectures for the Advanced Encryption Standard/Galois Counter Mode by Mozaffari-Kermani, Mehran
Western University 
Scholarship@Western 
Electronic Thesis and Dissertation Repository 
6-27-2011 12:00 AM 
Reliable and High-Performance Hardware Architectures for the 
Advanced Encryption Standard/Galois Counter Mode 
Mehran Mozaffari-Kermani 
The University of Western Ontario 
Supervisor 
Dr. Arash Reyhani-Masoleh 
The University of Western Ontario 
Graduate Program in Electrical and Computer Engineering 
A thesis submitted in partial fulfillment of the requirements for the degree in Doctor of 
Philosophy 
© Mehran Mozaffari-Kermani 2011 
Follow this and additional works at: https://ir.lib.uwo.ca/etd 
 Part of the Digital Circuits Commons, Hardware Systems Commons, and the VLSI and Circuits, 
Embedded and Hardware Systems Commons 
Recommended Citation 
Mozaffari-Kermani, Mehran, "Reliable and High-Performance Hardware Architectures for the Advanced 
Encryption Standard/Galois Counter Mode" (2011). Electronic Thesis and Dissertation Repository. 180. 
https://ir.lib.uwo.ca/etd/180 
This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted 
for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of 
Scholarship@Western. For more information, please contact wlswadmin@uwo.ca. 
RELIABLE AND HIGH-PERFORMANCE HARDWARE
ARCHITECTURES FOR THE ADVANCED ENCRYPTION
STANDARD/GALOIS COUNTER MODE
(Spine Title: Reliable and High-Performance Architectures for the
AES/GCM)
(Thesis Format: Monograph)
by
Mehran Mozaari Kermani
Graduate Program in Electrical and Computer Engineering
Submitted in partial fulllment
of the requirements for the degree of
Doctor of Philosophy
School of Graduate and Postdoctoral Studies
The University of Western Ontario
London, Ontario, Canada
June 2011
c Mehran Mozaari Kermani 2011
THE UNIVERSITY OF WESTERN ONTARIO
School of Graduate and Postdoctoral Studies
CERTIFICATE OF EXAMINATION
Supervisor:
. . . . . . . . . . . . . . . . . . . . .
Dr. Arash Reyhani-Masoleh
Examination Board:
. . . . . . . . . . . . . . . . . . . . .
Dr. Amr M. Youssef
. . . . . . . . . . . . . . . . . . . . .
Dr. Anestis Dounavis
. . . . . . . . . . . . . . . . . . . . .
Dr. Hanan Lutyya
. . . . . . . . . . . . . . . . . . . . .
Dr. Xianbin Wang
The thesis by
Mehran Mozaari Kermani
entitled:
Reliable and High-Performance Hardware Architectures for the Advanced
Encryption Standard/Galois Counter Mode
is accepted in partial fulllment of the
requirements for the degree of
Doctor of Philosophy
. . . . . . . . . . . . . . .
Date
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chair of the Thesis Examination Board
ii
Abstract
The high level of security and the fast hardware and software implementations of the
Advanced Encryption Standard (AES) have made it the rst choice for many critical
applications. Since its acceptance as the adopted symmetric-key algorithm, the AES has
been utilized in various security-constrained applications, many of which are power and
resource constrained and require reliable and ecient hardware implementations.
In this thesis, rst, we investigate the AES algorithm from the concurrent fault de-
tection point of view. We note that in addition to the eciency requirements of the
AES, it must be reliable against transient and permanent internal faults or malicious
faults aiming at revealing the secret key. This reliability analysis and proposing ecient
and eective fault detection schemes are essential because fault attacks have become a
serious concern in cryptographic applications. Therefore, we propose, design, and im-
plement various novel concurrent fault detection schemes for dierent AES hardware
architectures. These include dierent structure-dependent and independent approaches
for detecting single and multiple stuck-at faults using single and multi-bit signatures.
The recently standardized authentication mode of the AES, i.e., Galois/Counter Mode
(GCM), is also considered in this thesis. We propose ecient architectures for the AES-
GCM algorithm. In this regard, we investigate the AES algorithm and we propose low-
complexity and low-power hardware implementations for it, emphasizing on its nonlinear
transformation, i.e., SubByes (S-boxes). We present new formulations for this transfor-
mation and through exhaustive hardware implementations, we show that the proposed
architectures outperform their counterparts in terms of eciency. Moreover, we present
parallel, high-performance new schemes for the hardware implementations of the GCM
to improve its throughput and reduce its latency.
The performance of the proposed ecient architectures for the AES-GCM and their
fault detection approaches are benchmarked using application-specic integrated circuit
(ASIC) and eld-programmable gate array (FPGA) hardware platforms. Our compar-
ison results show that the proposed hardware architectures outperform their existing
counterparts in terms of eciency and fault detection capability.
Keywords: Advanced Encryption Standard, nite eld, Galois/Counter Mode, high
performance, concurrent fault detection.
iii
Dedication
To my parents for their love, inspiration, and guidance.
iv
Acknowledgements
I would like to express my sincere appreciation and gratitude to Dr. Arash Reyhani-
Masoleh for supervising my research during my Ph.D. studies. I would like to also
thank my lab-mates, Dr. Arash Hariri, Christopher Kennedy, and Reza Azarderakhsh
for sharing their experience and knowledge with me.
v
Contents
Certicate of Examination ii
Abstract iii
Dedication iv
Acknowledgements v
List of Figures ix
List of Tables xi
List of Abbreviations xiii
1 Introduction and Preliminaries 1
1.1 Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 LUT-Based Architectures . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Composite Field Architectures . . . . . . . . . . . . . . . . . . . . 4
1.2 The Galois/Counter Mode . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Literature Review 9
2.1 Fault Detection Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 AES-GCM Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Performance Evaluations and Comparisons of the AES S-boxes 16
3.1 Logic-gate Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Area and Delay Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 23
vi
3.3 Power Consumptions and Comparisons . . . . . . . . . . . . . . . . . . . 25
3.3.1 Power Derivation Method . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Analysis and Comparison . . . . . . . . . . . . . . . . . . . . . . 25
4 A Lightweight Fault Detection Scheme for the (Inverse) S-box Using
Composite Fields 28
4.1 Some Notes on Polynomial and Normal Bases . . . . . . . . . . . . . . . 29
4.2 Fault Detection Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 The S-box and the Inverse S-box Using Polynomial Basis . . . . . 31
4.2.2 The S-box and the Inverse S-box Using Normal Basis . . . . . . . 34
4.3 Error Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 ASIC and FPGA Implementations and Comparisons . . . . . . . . . . . 41
5 A High-Performance Concurrent Fault Detection Approach for the
Composite Field (Inverse) S-box 46
5.1 S-box and Inverse S-box Arithmetic Used in This Chapter . . . . . . . . 47
5.2 Proposed Fault Detection Approach . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 S-box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Inverse S-box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.3 Merged S-box and Inverse S-box . . . . . . . . . . . . . . . . . . . 55
5.2.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 ASIC Implementations and Comparisons . . . . . . . . . . . . . . . . . . 60
5.5 Formulations for Mixed Bases . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.1 Other Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Concurrent Structure-Independent Fault Detection Schemes for the
AES 74
6.1 Notations Used in This Chapter . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.1 AES Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.1.2 AES Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 A New Fault Detection Scheme for the S-box and the Inverse S-box . . . 78
6.2.1 The Systematic Scheme for the Multiplicative Inversion . . . . . . 78
vii
6.2.2 The Proposed Scheme for the S-box and the Inverse S-box . . . . 81
6.3 Proposed Fault Detection Schemes for the AES . . . . . . . . . . . . . . 85
6.3.1 AES Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
SubBytes and ShiftRows . . . . . . . . . . . . . . . . . . . . . . . 86
MixColumns and AddRoundKey . . . . . . . . . . . . . . . . . . 87
Further Improvements . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3.2 AES Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
InvShiftRows and InvSubBytes . . . . . . . . . . . . . . . . . . . 92
AddRoundKey and InvMixColumns . . . . . . . . . . . . . . . . . 92
Further Improvements . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Error Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 AES FPGA Implementations and Comparisons . . . . . . . . . . . . . . 99
7 Ecient and High-Performance Parallel Hardware Architectures for
the AES-GCM 107
7.1 High-Performance GCM Parallel Architecture . . . . . . . . . . . . . . . 108
7.1.1 High-Performance GHASHH Function . . . . . . . . . . . . . . . . 108
7.1.2 High-Speed Structures for Hash Subkey Powers . . . . . . . . . . 112
7.1.3 GF (2128) Multipliers for the GCM . . . . . . . . . . . . . . . . . . 115
7.2 AES-GCM Performance Comparisons . . . . . . . . . . . . . . . . . . . . 116
8 Summary and Future Work 123
8.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Bibliography 127
Curriculum Vitae 135
viii
List of Figures
1.1 The AES encryption round transformations [1]. . . . . . . . . . . . . . . 3
1.2 The composite eld S-box architecture using polynomial basis [20] and
normal basis [23]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 The GCM authenticated encryption data ow [5]. . . . . . . . . . . . . . 6
2.1 Redundant unit fault detection structure for S-box (inverse S-box) [34], [42]. 10
2.2 Parity-based fault detection structure of the ith round in the AES-128
encryption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 The multiplication-based scheme for the fault detection of the multiplica-
tive inversion [38]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Low-power S-box (resp. inverse S-box) architecture using composite elds
and polynomial basis [13]. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 The sequential method used in [64], [65], and [66] for the hardware imple-
mentation of the GHASHH . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 The S-box (the inverse S-box) using composite elds and polynomial basis
[20] and their fault detection blocks. . . . . . . . . . . . . . . . . . . . . . 30
4.2 The S-box (the inverse S-box) using composite elds and normal basis [23]
and their fault detection blocks. . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 The architecture of the S-box (resp. the inverse S-box) using composite
eld and polynomial basis [20]. . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 The proposed parity-based fault detection scheme for the S-box (resp.
inverse S-box). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Merged S-box (SB) and inverse S-box (ISB) and the corresponding pre-
dicted parities for dierent blocks. . . . . . . . . . . . . . . . . . . . . . . 56
ix
5.4 The areas, critical path delays, and power consumptions of the original
[22] and the proposed fault detection S-box and inverse S-box. . . . . . . 67
5.5 The Area, delay, and power consumption overheads of the proposed schemes
for the S-box and the inverse S-box. . . . . . . . . . . . . . . . . . . . . . 68
5.6 The presented fault detection structure for the mixed bases S-box [62]. . 70
6.1 The proposed structure-independent fault detection scheme of the S-box. 83
6.2 The proposed fault detection scheme for the ith round of the AES encryption. 87
6.3 The proposed low-complexity fault detection scheme for the ith round of
the AES encryption utilizing subexpression sharing. . . . . . . . . . . . . 91
6.4 The proposed fault detection scheme for the ith round of the AES decryption. 93
6.5 The proposed low-complexity fault detection scheme for the ith round of
the AES decryption utilizing subexpression sharing. . . . . . . . . . . . . 96
6.6 Simulation results for the error coverages of the proposed fault detection
schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.1 The hardware architecture of the proposed high-performance GCMGHASHH
function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 The derivation of H4 of the GCM hash subkey. . . . . . . . . . . . . . . 113
7.3 (a) Cascade, (b) parallel, and (c) hybrid realization methods for the hash
subkey exponentiations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.4 The AES-128 structure for (a) simple loop, (b) unrolled pipelined, and (c)
unrolled sub-pipelined architectures (MixColumns is bypassed in the last
round). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5 The proposed AES-GCM high-performance architecture for q = 8. . . . . 118
7.6 Comparison of the eciencies of nine dierent AES-GCM architectures
for n1 = 2
32   2 and n2 = 210. . . . . . . . . . . . . . . . . . . . . . . . . 121
x
List of Tables
3.1 Evaluation of the performance metrics of the S-boxes on ASIC using the
STM 65-nm CMOS standard technology. . . . . . . . . . . . . . . . . . . 22
3.2 Evaluation of the power consumptions of the S-boxes on ASIC using the
STM 65-nm CMOS standard technology and the Synopsys R PrimeTime R
PX [73]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Area/delay complexities of blocks 1 and 5 of the S-box and their predicted
parities for possible values of  0s and 0s. . . . . . . . . . . . . . . . . . . 35
4.2 Parity predictions and complexities of block 2 of the normal basis S-box
for possible values of  0 and 0. . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Error simulation results of the optimum S-box and inverse S-box after
injecting 500; 000 errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 ASIC implementations of the fault detection schemes for the S-box (SB)
and the inverse S-box using 0:18 CMOS technology. . . . . . . . . . . . 42
4.5 Xilinx R VirtexTM-II Pro FPGA implementations (xc2vp2-7) of the fault
detection schemes for the S-box (SB) and the inverse S-box. . . . . . . . 43
4.6 ASIC implementations of the fault detection schemes of the AES encryp-
tion using 0:18 CMOS technology. . . . . . . . . . . . . . . . . . . . . . 44
4.7 Xilinx R VirtexTM-II Pro FPGA implementations of the fault detection
schemes of the AES encryption. . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 The timing details of the proposed concurrent scheme for the S-box and
the inverse S-box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Fault detection capabilities of the proposed schemes after injecting 1,000,000
random multiple faults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
xi
5.3 Comparing the areas, critical path delays, power consumptions, and fault
detection capabilities of the proposed and previously presented fault de-
tection schemes for the S-box using the 65-nm CMOS standard technology. 62
5.4 Comparing the areas, critical path delays, power consumptions, and fault
detection capabilities of the proposed and previously presented fault de-
tection schemes for the inverse S-box using the 65-nm CMOS standard
technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 Comparisons of the implementations of the fault detection schemes of the
AES using LUT S-boxes and inverse S-boxes on Xilinx R FPGAs. . . . . 101
6.2 Implementation comparisons of the fault detection schemes of the AES
encryption using composite eld S-boxes on Xilinx R FPGAs. . . . . . . . 104
7.1 Performance analysis and comparison of GHASHH within the GCM for n
blocks and q parallel structures. . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Complexities of the realizations of the hash subkey exponentiations for
q = 8 parallel architectures for GHASHH . . . . . . . . . . . . . . . . . . . 115
7.3 Hardware and timing complexities analysis of the utilized bit-parallel mul-
tipliers for the GCM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4 The proposed architecture for the AES-GCM. . . . . . . . . . . . . . . . 117
7.5 ASIC synthesis comparisons of the AES-GCM using the STM 65-nm CMOS
technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
xii
List of Abbreviations
3DES Triple Data Encryption Standard
AES Advanced Encryption Standard
ASIC Application-Specic Integrated Circuit
DES Data Encryption Standard
FPGA Field-Programmable Gate Array
GCM Galois/Counter Mode
GF Galois Field
LUT Look-up Table
NIST National Institute of Standards and Technology
VHDL Very-high-speed integrated circuit Hardware Description
Language
VLSI Very Large Scale Integration
xiii
Chapter 1
Introduction and Preliminaries
IN this chapter, we present an introduction for this thesis. We also provide the pre-liminaries, motivations, and thesis outline.
Symmetric key cryptography uses a shared key in both sender and receiver ends during
encryption and decryption for secure communications. For the drawbacks of the previous
symmetric-key cryptographic standards such as the DES and the 3DES, they have been
replaced by the Advanced Encryption Standard (AES) [1]. In particular, the AES has
overcome the drawbacks of the previous standards in terms of vulnerability to brute
force attacks and slow software implementations. The AES was accepted by the National
Institute of Standards and Technology (NIST) in 2001 and since its acceptance, it has
been utilized in a variety of security-constrained applications. For instance, it has been
included in wireless standards of Wi-Fi [2] and WiMAX [3] and many other applications,
ranging from the security of smart cards to the bitstream security mechanisms in FPGAs
[4].
The Advanced Encryption Standard-Galois/Counter Mode (AES-GCM) provides au-
thentication and condentiality for sensitive data simultaneously. In the AES-GCM,
data condentiality is provided by the AES [1]. The authentication of the AES-GCM is
provided by the Galois/Counter Mode (GCM) [5] using a universal hash function. The
AES-GCM has been used for a number of applications such as the new LAN security
standard WLAN 802.1ae (MACSec) [6] and Fibre Channel Security Protocols (FC-SP)
[7]. Moreover, it has been utilized in a number of cores from industry, see, for example,
[8], [9], and [10]. In addition, two AES-GCM software-based implementations have been
presented in [11] and [12].
In what follows, we present the details on the AES and the GCM algorithms.
1
Chapter 1 2
1.1 Advanced Encryption Standard
In the AES encryption, the input and the output blocks are limited to 128 bits. However,
based on the security requirements, the key size could be determined as 128 (AES-128),
192 (AES-192) or 256 (AES-256). For each of these three types of the AES, dierent
number of rounds corresponding to dierent levels of security objectives is utilized, i.e.,
for the AES-128, 10 rounds, for the AES-192, 12 rounds and for the AES-256, 14 rounds
are processed [1]. In the AES encryption, four transformations in all the rounds, except
for the last round which has three transformations, are utilized.
The AES encryption transformations for the typical round j are depicted in Fig. 1.1.
The 128 bits of the input and output of each transformation are considered as four by four
matrices, called states (shown in dotted rectangles in Fig. 1.1), whose entries are eight
bits. The rst transformation in each round is SubBytes (S-boxes), which is implemented
by 16 S-boxes. In the S-box, each byte of the input state (Xi, 0  i  15, in Fig. 1.1)
is substituted by a new byte (Yi, 0  i  15, in Fig. 1.1). ShiftRows is the second
transformation in which the rst row of the state remains intact and the four bytes of
the last three rows of the input state are cyclically shifted. The third transformation
is MixColumns in which each column is modied individually. As shown in Fig. 1.1,
the columns are considered as polynomials over GF (28) and are multiplied by a xed
polynomial. The nal transformation is AddRoundKey which performs the modulo-2
addition of the input state and the key of the corresponding round, i.e., kj, 1  j  10,
12 or 14 for the AES-128, 192 or 256, respectively [1]. In the AES decryption, the reverse
procedure of the AES encryption is performed [1].
For realizing the S-box, the irreducible polynomial of P (x) = x8 + x4 + x3 + x + 1
is used to construct the binary eld GF (28). Let Xi 2 GF (28) and Yi 2 GF (28) be the
input and the output of the S-box. Then, the S-box consists of nding the multiplicative
inversion, i.e., Xi
 1 2 GF (28) with the exception of mapping the zero input to the zero
output, followed by the ane transformation in GF (28) [1]. For the inverse S-boxes of
the AES decryption, inverse ane transformation precedes the multiplicative inversion
[1].
Among the four dierent transformations in the AES, only the S-box and the inverse
S-box are non-linear. Additionally, all the S-boxes (resp. the inverse S-boxes) occupy
Chapter 1 3
128
ShiftRows
MixColumns
128
128
SubBytes
 (S−boxes)
128
AddRoundKey
AddRoundKey
MC
input
state
Round j
X0
X4
X8
X1
Y0
Y4
Y8
Y12
Y0
Y5
Y10
Y15
1132
1321
3211
2113
X12 X14X13
X9
X5
Y9
Y13
Y1
k0
kj
X10
X6
X2
X15
X11
X7
X3
Y5
Y1
Y15
Y11
Y7
Y3
Y14
Y10
Y6
Y2
Y14
Y9
Y4
Y3
Y13
Y8
Y7
Y2
Y12
Y11
Y6
Figure 1.1: The AES encryption round transformations [1].
much of the total AES encryption (resp. decryption) area and their power consumption
is around three fourths of that of the entire AES [13]. In what follows, we present the
preliminaries regarding the hardware implementations of the S-boxes and the inverse
S-boxes within the AES using look-up tables (LUTs) and composite elds.
1.1.1 LUT-Based Architectures
The AES S-boxes and inverse S-boxes can be implemented using LUTs. For this purpose,
2568 memory cells are used to store the 256 possible 8-bit outputs of each S-box/inverse
S-box. The LUT-based implementation is suitable for the eld-programmable gate array
(FPGA) platforms in which block memories are available, see, for example, [14], [15],
and [16]. However, although this implementation reaches high-speed architectures, it is
not suitable for applications requiring low-complexity AES application-specic integrated
circuit (ASIC) implementations [17].
The S-box and the inverse S-box are nonlinear operations which take 8-bit inputs and
generate 8-bit outputs. In the S-box, the irreducible polynomial of P (x) = x8+x4+x3+
x+1 is used to construct the binary eld GF (28). The usage of arithmetic in composite
elds reduces the space complexity of the S-box. Moreover, it allows us to use pipelining
and therefore the eective speed of the AES is increased while processing independent
messages. Consequently, the S-boxes and inverse S-boxes implemented using composite
elds can lead to area-ecient and high-performance structures [17]. In the following,
the preliminaries on composite eld realizations are presented.
Chapter 1 4
1.1.2 Composite Field Architectures
In this section, we describe the composite eld arithmetics to calculate the multiplicative
inversion over GF (28). This approach has received much attention in the literature, see,
for example, [13], [17], [18], [19], [20], [21], [22], [23], [24], and [25]. Moreover, there have
been low-power implementations for the S-boxes (resp. the inverse S-boxes) such as the
ones in [13] and [26]. It is noted that the low-power S-box (resp. inverse S-box) presented
in [13] uses composite elds.
The composite elds can be represented using normal basis [23] or polynomial basis
[18], [20], [21], [22]. The composite eld realizations of the S-box using polynomial
and normal bases are presented in Fig. 1.2. As seen in this gure, a transformation
matrix transforms a eld element X 2 GF (28) to the corresponding representation in
the composite eld GF (162), i.e., . We consider the irreducible polynomial of u2 +
u + , where  is chosen over GF (16) depending on the composite elds. Then, the
multiplicative inversion generates the inverse as  =  1. Finally, as seen in Fig. 1.2,
the inverse transformation matrix transforms the composite eld element to the one in
the binary eld, i.e., Y 2 GF (28).
Using polynomial basis constructed by the irreducible polynomial of u2 + u + ,
one can obtain the coordinates of  as  h = h(
2
h + hl + 
2
l )
 1 and  l = (l +
h)(
2
h + hl + 
2
l )
 1 [20]. This multiplicative inversion in composite elds using poly-
nomial basis is shown in the top part of Fig. 1.2 by a dotted rectangle. Similarly, for
normal basis, the coordinates of  are obtained as  h = (hl + (h
2 + l
2)) 1l and
 l = (hl + (h
2 + l
2)) 1h [23], shown in the dotted rectangle in the bottom of Fig.
1.2. One can refer to [20] and [23] for more details on the composite eld S-box architec-
tures. As seen in Fig. 1.2, the above multiplicative inversions consist of composite eld
multiplications, additions and inversion in the sub-eld GF (16). In this gure, the sub-
eld multiplications are shown by crossed circles. Moreover, the circle with plus inside
represents GF (24) addition using 4 XOR gates.
1.2 The Galois/Counter Mode
Authenticated encryption and decryption are the two functions within the GCM. The
authenticated encryption performs two tasks; encrypting the condential data and com-
Chapter 1 5
8 8 8 8
Transformation from 
binary to composite field
Multiplicative inversion in
composite field
Inverse and affine 
transformation 
to binary field
Polynomial basis
Normal basis
4
4
4
4
44
8
4
4
Squaring and 
constant mult.
Inversion in
GF(2
4
)
4
4
4
4
44
8
4
4
Squaring and 
constant mult.
Inversion in
GF(2
4
)
X
?
Y
?
? ?
?
?
Figure 1.2: The composite eld S-box architecture using polynomial basis [20] and normal
basis [23].
puting an authentication tag. The authenticated decryption function decrypts the con-
dential data and veries the tag [5]. The data ow of the authenticated encryption is
shown in Fig. 1.3. As seen in this gure, the mechanism for the condentiality of data is
a variation of the block cipher counter mode of operation, denoted by GCTRK (Galois
Counter with the key K) [5]. For the AES-GCM, the block cipher encryption with the
specic key K is shown by AESK in Fig. 1.3. Then, the function GCTRK performs
the block cipher counter mode with the Initial Counter Block (ICB) and its increments
(CB2   CBi) and the plaintext blocks (P1   Pi) as the inputs.
As shown in Fig. 1.3, the Galois Hash (GHASHH) function within the GCM provides
the authentication for the condential data. This function is constructed by GF (2128)
multiplications with a xed parameter, called the hash subkey (H). The GHASHH
function calculates
nX
j=1
XjH
n j+1 = X1 Hn X2 Hn 1  : : :Xn H; (1.1)
where X1 to Xn are the n, 128-bit blocks of the input [5]. It is noted that the hash subkey
Chapter 1 6
ICB INC INCCB2 CBi-1 CBi
AESK
P1 P2 Pi-1 P
*
iA1 Am
H
H H H H H
LA,C
H
MSBtT
J0
GCTRK
GHASHH
X1 Xm Xm+1 Xm+2 Xn-2 Xn-1 Xn
H
0
128 128
128 128 128
128 128
128 128 128 128 128 128
128
128 128 128 128
128
128
128
128
128
128
128
t
AESK
AESK AESK AESK
AESK
Figure 1.3: The GCM authenticated encryption data ow [5].
is generated by applying the AES to the zero block, i.e., 0 = (0; 0; :::; 0) 2 GF (2128).
Then, the GHASHH function calculates (1.1) [5]. All the arithmetic operations in (1.1),
i.e., additions, GF multiplications, and exponentiations are performed over GF (2128)
constructed by the irreducible polynomial P (x) = x128+ x7+ x2+ x+1. As seen in Fig.
1.3, the total number of input blocks to GHASHH is n = m + i + 1, where m and i are
the number of blocks for the additional authenticated data (A1   Am) and the output
of GCTRK , respectively. Eventually, the authentication tag T with length of t bits is
derived. In the authenticated decryption, the same GHASHH procedure is performed on
the authenticated data and ciphertext blocks to verify the tag. For the entire description
of the GCM, one can refer to [5] and Algorithms 1 and 2.
Algorithm 1 shows the GCM authenticated encryption [5]. In this algorithm, IV is the
Initialization Vector, P is the Plaintext, A is the Additional Authenticated Data, and K
is the Key. It is noted that the authentication in the GCM is performed based on the hash
function GHASHH . After deriving J0, GHASHH is applied to (Ak0vkCk0uk[len(A)]64k
[len(C)]64) to obtain block S, from which the authentication tag T with length of t is
derived.
Algorithm 2 depicts the GCM authenticated decryption [5]. This algorithm uses the
same functions as Algorithm 1, in which another authentication tag (T 0) is derived. Al-
Chapter 1 7
Algorithm 1 The Authenticated Encryption GCM-AEK(IV, P, A)
1: Let H = CIPHK(0
128).
2: Dene a block, J0, as follows: If
If len(IV ) = 96, then let J0 = IV k031k1.
If len(IV ) 6= 96, then let s = 128dlen(IV )=128e   len(IV ), and let
J0 = GHASHH(IV k0s+64k[len(IV )]64).
3: Let C = GCTRK(inc32(J0); P ).
4: Let u = 128dlen(C)=128e   len(C) and v = 128dlen(A)=128e   len(A).
5: Dene a block, S, as follows: S
S = GHASHH(Ak0vkCk0uk[len(A)]64k[len(C)]64).
6: Let T =MSBt(GCTRK(J0; S)).
7: Return (C; T ).
Algorithm 2 The Authenticated Decryption GCM-ADK(IV, C, A, T)
1: If the bit lengths of IV , A or C are not supported, or if len(T ) 6= t, then return
FAIL.
2: Let H = CIPHK(0
128).
3: Dene a block, J0, as follows: If
If len(IV ) = 96, then let J0 = IV k031k1.
If len(IV ) 6= 96, then let s = 128dlen(IV )=128e   len(IV ), and let
J0 = GHASHH(IV k0s+64k[len(IV )]64).
4: Let P = GCTRK(inc32(J0); C).
5: Let u = 128dlen(C)=128e   len(C) and v = 128dlen(A)=128e   len(A).
6: Dene a block, S, as follows: S
S = GHASHH(Ak0vkCk0uk[len(A)]64k[len(C)]64).
7: Let T 0 =MSBt(GCTRK(J0; S)).
8: If T = T 0, return P otherwise return FAIL.
gorithm 2 performs the verication of authenticity by checking if the sent authentication
tag T is the same as T 0.
1.3 Motivation
Using the AES, the sender and the receiver of the sensitive data share a secret key to
ensure the condentiality of the information. Nonetheless, a malicious attacker can take
over the secret key and compromise the standard. One of the methods for extracting
the side-channel information is the fault attacks for which several approaches have been
introduced, see, for instance, [27], [28], [29], [30], [31], [32], and [33]. It is noted that
the internal hardware failures may also result in malfunctioning of the AES encryp-
tion/decryption. This has been the motivation for the rst contribution of this thesis to
Chapter 1 8
develop high-performance and low-overhead fault detection schemes for the AES.
Dierent GCM architectures have been presented in the literature, the details of
which are provided throughout this thesis. These methods of realization mostly need
many clock cycles to execute, reducing the performance of the GCM architectures and
resulting in low throughput. This has been a motivation for the second contribution of
this thesis to propose high-performance parallel methods for obtaining the GCM and
developing ecient structures for the AES-GCM. The proposed methods are suitable for
high-performance applications.
1.4 Thesis Outline
The rest of this thesis is organized as follows. In Chapter 2, we review some of the
existing works in the literature. In Chapter 3, dierent AES S-boxes are evaluated
and benchmarked in terms of area, delay, and simulation-based power consumption. In
Chapter 4, a high-performance fault detection scheme for the composite eld S-boxes
and inverse S-boxes of the AES is presented and benchmarked. Chapter 5 presents a
concurrent low-overhead fault detection method for the AES with emphasis on burst
fault detection. In Chapter 6, a structure-independent fault detection approach for the
entire AES encryption and decryption is presented. Chapter 7 covers the proposed high-
performance parallel hardware architecture for the AES-GCM. Finally, in Chapter 8, we
summarize our contributions.
Chapter 2
Literature Review
THIS chapter presents some previous works on both fault detection and hardwareimplementations of the AES-GCM.
2.1 Fault Detection Schemes
Several fault detection schemes have been proposed to date to counteract the fault attacks
and detect the natural faults in cryptographic algorithms and the AES, see, for example,
[34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], and
[51].
For fault detection of the encryption or decryption in AES one may use redundant
units [34], [42], where algorithm-level, round-level and operation-level concurrent error
detection for the AES are used. In the algorithm-level, comparing the plain text with
the output of a decryption after an encryption is proposed. The round-level error de-
tection uses similar ideas in the rounds, where, the output of a round in encryption is
applied to a round in decryption and is compared with the input. The operation-level (or
transformation-level) error detection uses the inversion of a transformation in each round
and compares the output with the input. Fig. 2.1 shows the operation-level concurrent
error detection for S-box and inverse S-box presented in [34]. In this gure, the 8-bit
input X of the S-box (8-bit input Y of the inverse S-box) is compared with the output
of two consecutive transformations, S-box and inverse S-box (inverse S-box and S-box)
using an 8-bit comparator to generate the error indication ag.
There exist a number of fault detection schemes based on the error detecting codes,
see, for example, [35], [36], [37], [38], [39], [40], and [41]. Using one parity bit for each
9
Chapter 2 10
S−box Inverse S−box
(Inverse S−box) (S−box)
Error indication flag
Comparator
8 8
8
1
8
Y
(Y)
X
(X) X
(Y)
Figure 2.1: Redundant unit fault detection structure for S-box (inverse S-box) [34], [42].
byte of a transformation, one can obtain the structure shown in Fig. 2.2 for the round
i, 1  i  9, of the encryption of the AES-128 (128-bit key) to achieve a parity-based
fault detection scheme. Similar structure can be obtained for the AES-128 decryption.
The AES-128 encryption/decryption has 10 consecutive rounds which are similar except
for the last one in which one of the transformations is not used. As seen in Fig. 2.2,
the output parity bits of each transformation in every round of the AES encryption are
predicted from the inputs using the prediction boxes denoted by P^ notations. Then, the
comparisons between the predicted parities (shown by a matrix with 16-bit entries) and
the actual parities (obtained using the actual parity block) in Fig. 2.2 can be scheduled
so that the desired fault detection capability is obtained.
Parity predictions of ShiftRows, InvShiftRows, and AddRoundKey are straightfor-
ward and those of MixColumns and InvMixColumns can be done using the equations
given in [35], [36], [40], and [41]. It is noted that the parity predictions of the S-box and
the inverse S-box proposed in [36] are based on LUTs implementations in which 512 9
memory cells are used to generate the predicted parity bit as well as the 8-bit output.
In Fig. 2.2, let k0 be the 128-bit input key to the key expander. Then, all the modied
keys, i.e., k0i, 0  i  10, consist of the 128-bit expanded key ki and 16-bit parities, if
one bit parity is used for each byte.
The parity-based scheme proposed in [35] is one of the rst fault detection schemes and
has received attention in the literature. Although the approach in [35] is a good scheme
in terms of the fault detection capability, it has two drawbacks. First, this approach
Chapter 2 11
Comparison
parity
Actual
Substitute Bytes
AddRoundKey
ShiftRows
MixColumns
 (S−boxes)
128
Input Parity
Input and parity to the next round
128 16
16
128+16
16
16
128
16
Key expander unit
128
128
Parity unit
Round i
P5
P9
P12P13 P14 P15
P0 P1
P4
P8 P10
P6
P2
P11
P7
P3
eS
k
′
i
PˆS
PˆSh
PˆM
PˆA
k0
k
′
0
to k
′
10
Figure 2.2: Parity-based fault detection structure of the ith round in the AES-128 en-
cryption.
is based on using the expanded S-boxes and inverse S-boxes for parity predictions, i.e.,
two blocks of 256  9 memory cells. Not only does this restrict the AES encryption
and decryption implementations to LUT-based S-boxes and inverse S-boxes, but it has
the area overhead of greater than 100% for either the S-box or the inverse S-box. The
second drawback of the approach in [35] is the relatively high area complexity of the
parity predictions of MixColumns in the AES encryption. For the AES decryption, the
area complexity of the predicted parities of InvMixColumns is even more [36].
In [37] and [39], instead of using one parity bit or two signatures in case of using the
scheme presented in [38] for each byte, one bit parity is used for 128-bit data using the
LUT S-boxes. The multiplication-based fault detection scheme [38] for the multiplicative
inversion of the S-box is shown in Fig. 2.3. In this scheme, the 8-bit input of the
multiplicative inversion is multiplied by the 8-bit output and the n-bit result, 1  n  8,
of the multiplication is compared with the n-bit actual result, i.e., 1 2 GF (28) if X 6= 0
and 0 2 GF (28) if X = 0. Because the multiplicative inversion is also used in the inverse
S-box, the same scheme can be used for the inverse S-box.
The schemes presented in [34] and [42] use the redundant unit fault detection ap-
proach. It is noted that this results in the area, power, and delay overheads of approx-
imately 100%. In addition, the scheme in [43] proposes using the transformations in an
AES round twice for the same data to detect the transient errors. In [44], a concurrent
Chapter 2 12
Transformation
Comparator
Affine
Inversion in
result
Actual partial
S−box
Multiplication
(Signature)
(Predicted signature)
(Signature)
Predicted partial result
Error indication flag
8
8
n
n
8
GF (28)
8
8
1
X
X
−1
X
′
Figure 2.3: The multiplication-based scheme for the fault detection of the multiplicative
inversion [38].
fault detection scheme based on the merged S-box and inverse S-box is proposed. It is
also noted that the scheme presented in [49] uses double-data-rate computation for coun-
teracting the fault attacks. Additionally, a fault detection scheme based on the Hamming
and Reed-Solomon codes for protecting the storage elements within the AES is proposed
in [50]. Furthermore, for the logic elements, the scheme in [36] and the use of the partial
duplication of the most vulnerable elements are proposed in [50]. Moreover, the approach
in [51] is based on implementing functional redundancy in the AES.
There exist a number of fault detection approaches which are specic to composite
eld S-boxes and inverse S-boxes, see, for example, [52], [53], [54], [55], and [56]. In
the scheme of [52], the fault detection of the multiplicative inversion of the S-box is
considered for two specic composite elds. The transformation and ane matrices are
excluded in this approach. Moreover, in [53], predicted parities have been used for the
multiplicative inversion of a specic S-box using composite eld and polynomial basis.
Furthermore, the transformation matrices are also considered. In [54], [55], and [56], the
composite eld S-boxes and inverse S-boxes (using polynomial basis) have been divided
into sub-blocks and parity predictions are used for their fault detection. Moreover, FPGA
implementations have been performed in [56] to benchmark the presented method. It is
noted that the approaches in [55] and [56] (for single fault detection) have been extended
Chapter 2 13
8
Block 3
Y(X)
8
Block 1
X
(Y)
4
Block 2
4
(γ
′
) (θ
′
)
θγ
Figure 2.4: Low-power S-box (resp. inverse S-box) architecture using composite elds
and polynomial basis [13].
in the work in [57]. This work presents new structures for the S-box and the inverse S-box
with higher complexities compared to the original structures for detecting 100% of single
faults. We note that unlike the schemes presented in [55] and [56], this work focuses on
the stuck-at faults injected not only at the outputs but at any net in the circuit. The
results in [57] have been benchmarked using ASIC platform.
2.2 AES-GCM Architectures
As mentioned in the previous chapter, the S-boxes and the inverse S-boxes are the only
nonlinear transformations in the AES, whose hardware implementations aect that of
the AES signicantly. A low-power implementation of the S-box (resp. inverse S-box)
has been presented in [13] which uses the composite eld in [20]. For reaching a low-
power architecture with acceptable hardware complexity, it is suggested in [13] that the
structures are partitioned into three blocks (see Fig. 2.4). Then, the logic gates within
each of these blocks are implemented using two-level logics consisting of the arrays of
ANDs and XORs. Although this method increases the area of the composite elds
implementation, it reduces the power consumption signicantly [13].
The AND-XOR structure of each block shown in Fig. 2.4 results in the low number
of transitions and thus low power consumption. This is because the AND array has 50%
propagation probability of signal transitions. In [13], similar to many other publications
such as [17], [18], [20], and [22], the irreducible polynomials u2 + u +  and v2 + v + ,
where  = f1100g2 and  = f10g2, are used for the composite elds. As seen in
Fig. 2.4, for block 1, a eld element X for the S-box (Y for the inverse S-box) in the
binary eld GF (28) is converted to the corresponding representation in the composite
Chapter 2 14
R
iX
H
128
128
128
Figure 2.5: The sequential method used in [64], [65], and [66] for the hardware imple-
mentation of the GHASHH .
eld GF (28)/GF (((22)2)2). The output of block 1 is then obtained as  2 GF (24)
(0 2 GF (24) for the inverse S-box). As seen in Fig. 2.4,  2 GF (24) (0 2 GF (24) for the
inverse S-box) is then derived as the output of block 2. Eventually, using the irreducible
polynomials u2+u+  and v2+ v+, the output of the S-box, i.e., Y (X for the inverse
S-box), is obtained after conversion from the composite eld GF (28)/GF (((22)2)2) to the
binary eld GF (28).
In some previous works such as [13], [20], [22], [27], and [58], one specic S-box and
in [59], three reported S-boxes have been synthesized on ASIC. However, exhaustive
search has not been performed for all suitable composite elds to evaluate their perfor-
mance metrics using the same technology. It is also noted that in some other works,
see, for instance, [23], [24], [30], [60], [61], and [62], the hardware and timing complexi-
ties of dierent composite eld S-boxes have been evaluated in terms of logic gates (in
[63], software implementations have been performed). However, benchmarking the per-
formance (including power consumptions through simulation-based approaches) of the
S-boxes implementations on hardware platforms has not been performed in these works.
Dierent GCM architectures have been presented in the literature. In [64], [65], and
[66], the sequential method for the hardware implementation of the GCM function is
adopted. The sequential method is shown in Fig. 2.5, where one GF multiplier, a set
of 128 XOR gates, and a 128-bit register (R) are utilized to perform the operation. Let
the register R in Fig. 2.5 be cleared initially. Let n be the number of input blocks to
Chapter 2 15
the GHASHH function, i.e., Xi, 1  i  n. Then, after n clock cycles, register R in Fig.
2.5 contains the result. Although this method of realization is area-ecient, it needs
many clock cycles (equal to the number of input blocks), reducing the performance of
the architecture.
Because of the low throughput of the sequential method, a parallel method is proposed
in [67] which uses two GF (2128) multipliers to perform this operation in parallel. This
parallel implementation has been generalized in [68] and [69] so that q, q  2, parallel
GF (2128) multiplications are performed concurrently. In the most ecient method in [68]
and [69], for the case of q = 4 and n = 8, the operation in the GCM is realized according
to the following calculation steps:
j=4z }| {
(
j=1z }| {
X1H
4X5)H4| {z }
j=2 :::
1 1
j=4z }| {
(
j=1z }| {
X2H
4X6)H| {z }
j=2 :::
H H 
j=4z }| {
(
j=1z }| {
X3H
4X7)H| {z }
j=2 :::
H  1
j=4z }| {
(
j=1z }| {
X4H
4X8)H| {z }
j=2 :::
1 1; (2.1)
where all operations are performed over GF (2128) constructed by the irreducible polyno-
mial P (x) = x128 + x7 + x2 + x + 1 and
L
comprises 128 XOR gates. Consecutive GF
multiplications with H are performed for deriving the powers of the hash subkey used.
Recently, a high-performance approach for computing the GHASHH function for long
messages has been proposed in [70]. However, in this scheme the hardware complex-
ity is increased. Therefore, a high-performance parallel method for obtaining the GCM
by relying on the low-complexity powers of the hash subkey is needed so that with-
out pre-computing the hash subkey exponents, compact realizations of these exponents
are obtained and implemented. This results in high-throughput and low-latency GCM
hardware architectures, suitable for high-performance applications.
Chapter 3
Performance Evaluations and
Comparisons of the AES S-boxes
IN this chapter, dierent ASIC architectures of building blocks of the AES S-boxes,the only nonlinear AES transformation, are evaluated and optimized to identify high-
performance and low-power architectures. We evaluate the performance of more than 40
S-boxes utilizing a xed benchmark platform in 65-nm CMOS technology. To obtain the
least-complexity S-box, the formulations for the Galois Field (GF) sub-eld inversions in
GF (24) are optimized. By conducting exhaustive simulations for the input transitions,
we analyze the average and peak power consumptions of the AES S-boxes considering
the switching activities, gate-level netlists, and parasitic information.
In this chapter, we logic-gate optimize and perform comprehensive ASIC syntheses
of more than 40 dierent S-boxes for deriving their performance metrics. This bench-
marking, which is done on the same platform, results in having a clear picture of the
performance metrics of dierent designs. We synthesize the structures of dierent AES
S-boxes using the Synopsys R Design Vision R (which is the graphical user interface to
Synopsys R Design Compiler R) [73] in STM 65-nm CMOS standard technology [74].
Then, the areas and delays of these hardware architectures are derived and compared.
To achieve the least dynamic power-consuming AES S-box, we obtain the average and
peak power consumptions of the S-boxes through exhaustive searches considering the
possible input transitions. These derivations are based on a timing simulation-based
analysis using the switching activities of internal nodes with Synopsys R PrimeTime R
PX [73] and ModelSim R [75].
The implementation complexities of the S-boxes using composite elds are dependent
16
Chapter 3 17
on the choice of the coecients  2 GF (24) and  2 GF (22) in the irreducible polynomi-
als u2+u+ and v2+v+ used for the composite elds, respectively. The composite elds
GF (((22)2)2) in polynomial basis use iterations to construct the S-box. For these compos-
ite elds, the constants  2 GF (24) and  2 GF (22) are over GF ((22)2)=v2 + v + and
GF (22)=x2+x+1, respectively. According to [24], after exhaustive search for nding the
possible choices for  2 GF (24) and  2 GF (22), the following 16 combinations are ob-
tained:  2 ff10g2; f11g2g and  2 ff1000g2; f1001g2; f1010g2; f1011g2; f1100g2; f1101g2
; f1110g2; f1111g2g. Similarly, for normal basis, it can be derived that the only two ac-
ceptable values for  are  = f10g2 and  = f01g2. The following 8 values of  are ac-
ceptable:  2 ff0100g2; f0001g2; f1000g2; f0010g2; f0111g2; f1101g2; f1011g2; f1110g2g.
Based on the possible values of  and  in polynomial basis representation, the (in-
verse) transformation matrices can be constructed using the algorithm presented in [22].
In this algorithm, using an exhaustive search, the transformation matrix is constructed
using eight base elements in GF (((22)2)2), i.e., 1; ; 2; : : : ; 7, to which eight base ele-
ments of GF (28) are mapped. We note that for each combination of  and , there
exist eight possible (inverse) transformation matrices. These are constructed according
to the base element  and the conjugates of this base element, i.e., 2
i
, i = 1; 2; : : : ; 7.
In what follows, for each combination of  and , one of these possible matrices is con-
sidered. As suggested in [22], we have also used subexpression sharing for obtaining
the low-complexity implementations for these matrices. We note that dierent (inverse)
transformation matrices in normal basis are derived simply by reordering the columns.
The organization of this chapter is as follows. In Section 3.1, logic-gate optimizations
for the inversions in GF (24) within the S-boxes are presented. In Section 3.2, we present
the results of our syntheses for dierent S-boxes. Power consumption derivations and
comparisons of the S-boxes through a simulation-based method are presented in Section
3.3. The results presented in this chapter can also be found in [71] and [72].
3.1 Logic-gate Optimizations
In this section, rst we present the architecture of the low-complexity S-box using normal
basis. The previously presented low-complexity S-box using normal basis [23] is improved
and the hardware complexity of the inversion in GF (24) is reduced.
Chapter 3 18
Let  = (3; 2; 1; 0) be the input and  = (3; 2; 1; 0) be the output of an inverter
in GF (24) using normal basis. Then, the formulations for the inversion in GF (24) using
the low-complexity normal basis ( = f10g2) presented in [23] are obtained as follows
3 = 210 + 31 + 21 + 1 + 0;
2 = 310 + 31 + 21 + 20 + 0;
1 = 320 + 31 + 30 + 3 + 2; (3.1)
0 = 321 + 31 + 30 + 20 + 2;
where, \+" represents the modulo-2 addition which uses an XOR gate in hardware.
Considering the formulations above, we present the following lemma for reaching a
low-complexity architecture of an inverter in GF (24).
Lemma 3.1 The low-complexity formulations for the inversion in GF (24) using normal
basis can be written as follows.
3 = (21 _ 0) + 31;
2 = 0(1 _ 2) _ 1(2 + 3);
1 = (30 _ 2) + 31; (3.2)
0 = 2(3 _ 0) _ 3(1 + 0);
where, \+" and \_" represent the XOR and OR operations, respectively.
Proof For having low-complexity structures for 3 and 1, we use the fact that for two
Boolean variables x and y, we have
x+ y + xy = x _ y: (3.3)
Then, using 3 in (3.1) and considering x = 21 and y = 0 in (3.3), one can nd
3 = 210 + 31 + 21 + 1 + 03
= (21 _ 0) + 1(3 + 1)
= (21 _ 0) + 31: (3.4)
Similarly, one can consider x = 30 and y = 2 in (3.3) for 1 in (3.1) to obtain
1 = 320 + 31 + 30 + 3 + 2
= (30 _ 2) + 3(1 + 1)
= (30 _ 2) + 31: (3.5)
Chapter 3 19
We now prove the formulations for 2 and 0. According to (3.1) and noting that i+1 =
i, we obtain
2 = 310 + 31 + 21 + 20 + 0
= 1(2 + 3(0 + 1)) + 0(2 + 1)
= 1(2 + 30) + 02: (3.6)
By the denition of the XOR we have 2 + 30 = 230 _ 2(3 _ 0). Then, (3.6) can
be written as
2 = 1(230 _ 2(3 _ 0)) + 02
= (123 _ 210 _ 3210) + 02: (3.7)
It is noted that for having a low-complexity structure for 2, we use the fact that for two
Boolean variables x and y, one can prove that
x+ xy = x _ y: (3.8)
Then, by distributing the XOR in (3.7) and using (3.8), the following terms are obtained
3210 + 02 = 2(310 + 0) = 2(31 _ 0); (3.9)
210 + 02 = 0(21 + 2) = 0(2 _ 1); (3.10)
321 + 02 = 321 _ 02: (3.11)
Then, according to (3.7), by ORing (3.9)-(3.11) and noting that 321_321 = 1(2+
3), it is straightforward to obtain 2 in (3.2).
We obtain the following for 0
0 = 321 + 31 + 30 + 20 + 2
= 3(0 + 1(2 + 1)) + 2(0 + 1)
= 3(0 + 12) + 20: (3.12)
By the denition of the XOR for 0 + 12, (3.12) can be written as
0 = 3(210 _ 0(1 _ 2)) + 20
= (301 _ 320 _ 3210) + 20: (3.13)
Chapter 3 20
Then, by distributing the XOR in (3.13) and using (3.8), the following terms are obtained
3210 + 20 = 0(312 + 2) = 0(31 _ 2); (3.14)
320 + 20 = 2(30 + 0) = 2(0 _ 3); (3.15)
130 + 20 = 130 _ 20: (3.16)
Then, according to (3.13), by ORing (3.14)-(3.16) and noting that 031 _ 310 =
3(1 + 0), one can obtain 0 in (3.2).
It is noted that for reaching a low-complexity architecture, the formulations in (3.2) can
be implemented using only NOR, NAND, and XOR gates as follows
3 = XOR(NOR(NOR(2; 1); 0); NAND(3; 1));
2 = NAND(NAND(0; NAND(2; 1)); NAND(1; XOR(2; 3)));
1 = XOR(NOR(NOR(3; 0); 2); NAND(1; 3)); (3.17)
0 = NAND(NAND(2; NAND(0; 3)); NAND(3; XOR(1; 0))):
In what presented above, the eld inversion in GF (24) of the most compact composite
eld in [23] has been modied to decrease its hardware complexity. This eld uses normal
basis with  = f10g2 and  = f0001g2. Now, we consider polynomial basis to further
optimize the S-boxes using polynomial basis. We present the following lemma through
which the hardware complexity of the composite eld inversion in GF (24) is decreased.
This is performed by presenting low-complexity formulations for the inversion in GF (24)
through logic-gate minimization. Moreover, these formulations are implemented using
NAND, NOR, and XOR gates for reducing the complexity.
Lemma 3.2 Let  = (3; 2; 1; 0) be the input and  = (3; 2; 1; 0) be the output
of an inverter in GF (24). Then, the formulations for the low-complexity inversion in
GF (24) using polynomial basis with  = f11g2 are as follows:
3 = 231 + 30;
2 = 30 _ 2(3 _ 1);
1 = 20 _ 310 _ 312; (3.18)
0 = 3 _ 10 _ 201 + 1(2 _ 30):
Chapter 3 21
Moreover, for  = f10g2, one reaches the following:
3 = 231 + 30;
2 = 21 _ 3(2 _ 0); (3.19)
1 = 31(2 _ 0) _ 20 + 3 + 2 + 1;
0 = 0 _ 23 _ 132 + 2(1 _ 03):
where \+" and \_" represent the XOR and OR operations, respectively.
Proof For  = (3; 2; 1; 0) as the input and  = (3; 2; 1; 0) as the output of an
inverter in GF (24), the formulations for the inversion in GF (24) using the polynomial
basis with  = f11g2 and  = f10g2 are obtained as follows, respectively, [24], [22]:
3 = 321 + 30 + 2;
2 = 321 + 320 + 30 + 21 + 3;
1 = 321 + 310 + 30 + 31 + 20+
21 + 2 + 1; (3.20)
0 = 321 + 320 + 310 + 210+
20 + 30 + 21 + 3 + 1 + 0;
3 = 321 + 30 + 3 + 2;
2 = 321 + 320 + 30 + 21 + 2; (3.21)
1 = 321 + 310 + 20 + 3 + 2 + 1;
0 = 321 + 320 + 310 + 210 + 31+
30 + 21 + 2 + 1 + 0:
One can obtain 3-0 in (3.18) and (3.19) from those of (3.20) and (3.21), respectively.
For performing this, we note that i+1 = i and i+j+ij = i_j. For instance, now
obtain 3 in (3.18) from that of (3.20) as 3 = 321+ 30+ 2 = 2(31+1)+ 30 =
231 + 30. Using similar methods, one can obtain (3.18). As another example,
one can obtain 3 in (3.19) from that of (3.21) as 3 = 321 + 30 + 3 + 2 =
2(31 + 1) + 3(0 + 1) = 231 + 30. By verifying the 16 combinations of the input
, same results are obtained for (3.18) and (3.20) ((3.19) and (3.21)).
Chapter 3 22
Table 3.1: Evaluation of the performance metrics of the S-boxes on ASIC using the STM
65-nm CMOS standard technology.
Structure Specications Area Delay [Freq.] Thro'put Ec.
  (m2) GEa (ns) [MHz] (Gbps) (Mbps
m2
)
1000 [30], [24]b 525.2 252.5 1.31 [763] 6.1 11.6
1000 (proposed, (3.19)) 518.9 249.4 1.15 [869] 7.0 13.53
1001 537.2 258.2 1.43 [699] 5.6 10.4
1010 535.6 257.5 1.36 [735] 5.9 11.0
1011 540.8 260.0 1.43 [699] 5.6 10.3
10 1100 [13], [20], [22]c 540.3 259.7 1.37 [730] 5.8 10.8
1101 548.6 263.7 1.34 [746] 6.0 10.9
1110 524.7 252.3 1.40 [714] 5.7 10.9
1110 (proposed, (3.19)) 510.23 245.2 1.25 [800] 6.4 12.5
Polynomial 1111 535.6 257.5 1.40 [714] 5.7 10.6
basis 1000 528.3 253.9 1.39 [719] 5.8 10.9
1000 (proposed, (3.18)) 516.3 248.2 1.11 [900]3 7.23 13.92
1001 534.6 257.0 1.56 [641] 5.1 9.6
1010 [24]b 519.0 249.5 1.42 [704] 5.6 10.9
1010 (proposed, (3.18)) 498.42 239.6 1.11 [900]3 7.23 14.41
11 1011 531.0 255.2 1.45 [690] 5.5 10.4
1100 548.1 263.5 1.49 [671] 5.4 9.8
1101 546.5 262.7 1.42 [704] 5.6 9.8
1110 542.8 260.9 1.52 [657] 5.3 9.7
1111 542.9 261.0 1.52 [657] 5.3 9.7
0001 [23]b 569.4 273.7 1.59 [629] 5.0 8.8
0001 [23] (3.17)d 511.7 246.0 1.45 [690] 5.5 10.7
0010 564.2 271.2 1.42 [704] 5.6 10.0
0100 576.7 277.2 1.57 [637] 5.1 8.8
10 1000 579.3 278.5 1.46 [685] 5.5 9.4
0111 575.1 276.5 1.56 [641] 5.1 8.9
1011 589.2 283.2 1.55 [645] 5.2 8.7
1101 572.0 275.0 1.60 [625] 5.0 8.7
Normal 1110 588.6 282.0 1.57 [637] 5.1 8.6
basis 0001 564.2 272.2 1.58 [633] 5.1 9.0
0010 577.2 277.5 1.51 [662] 5.3 9.2
0100 564.2 271.2 1.59 [629] 5.0 8.9
01 1000 570.9 274.4 1.58 [633] 5.1 8.9
0111 583.9 280.7 1.49 [671] 5.4 9.2
1011 572.5 275.2 1.58 [633] 5.1 8.9
1101 583.9 280.7 1.48 [676] 5.4 9.3
1110 571.9 274.9 1.59 [629] 5.0 8.8
Normal basis - 0001 [60]e 403.21 193.8 1.64 [610] 4.9 12.1
Polynomial basis - 1110 [27]f 554.3 266.5 1.26 [794] 6.4 11.5
Mixed basis [62]g - - 571.1 274.5 1.30 [769] 6.2 10.9
LUT/ROM [77], [59]h - - 1,407.1 676.4 0.60 [1666]1 13.31 9.5
LUT/ROM-MI [59]i - - 1,434.6 689.4 0.68 [1470]2 11.82 8.2
1, 2, and 3 are the best cases for each performance metric.
aGate equivalent in terms of two-input NAND.
bAmong all elds considered, the presented composite eld has the least hardware complexities in
terms of logic-gate counts.
cThese are some works in which this composite eld is used.
dThe hardware complexity of this composite eld has been improved by (3.17).
eThis implementation is based on a minimization method at the expense of more timing complexity.
fThis architecture is based on the composite eld GF ((24)2).
gHas been presented very recently based on mixed polynomial and normal bases and only focuses on
decreasing the critical path delay.
hUsing synthesized ROM-based LUTs.
iLUTs for the multiplicative inversion (MI) and logic gates for the ane transformation.
Chapter 3 23
In what follows, we evaluate and compare the performance metrics of dierent S-
boxes. The presented results conrm eciency increase using (3.18) and (3.19).
3.2 Area and Delay Evaluations
In the following, we evaluate and compare the areas, delays, throughputs, and eciencies
of dierent S-boxes, including the ones presented in [13], [20], [22], [23], [24], [27], [30],
[60], and [62]. It is noted that the implementation in [61] is for the inversion in GF (24)
and does not provide the entire S-box architecture.
Using MATLABR [76], we have derived the low-complexity transformation and mixed
inverse and ane transformation matrices for the syntheses. We have used the STM 65-
nm CMOS standard technology and CORE65LPSVT standard cell library [74]. This
library is optimized for using in low-power applications. The nominal junction temper-
ature is 25 C and VHDL has been used as the design entry to the Synopsys R Design
Vision R [73]. We note that the presented results are post synthesis and do not consider
the post layout routing.
The results of our syntheses are presented in Table 3.1. As seen in this table, for
dierent S-boxes, the areas (in terms of m2), critical path delays (in terms of ns),
maximum working frequencies (in terms of MHz), throughputs (in terms of Gbps), and
eciencies (in terms of Mbps
m2
) have been obtained. According to the STM 65-nm standard
cell library information, the lowest and nominal drive strength for the cells is two. It
is noted that the area of a NAND gate in the utilized STM 65-nm CMOS library for
the drive strength of two is 2:08m2. Then, using this area, we have also provided the
gate equivalent (GE) measure for dierent S-boxes in the table. Note that if we increase
the area eort, lower areas are usually achieved mostly at the expense of more delay
overhead.
Memory macros tend to be expensive in hardware for implementing the S-boxes,
resulting in high hardware complexity and power consumption. Therefore, this imple-
mentation is not considered in this chapter. We have considered two dierent methods
of realization of the LUT S-boxes. In these methods, read-only LUTs are used for im-
plementing the S-box, see, for instance, [77] and the hw-lut/hybrid-lut architectures in
[59]. This allows us to logic-optimize the S-box architecture by synthesis of hardware
Chapter 3 24
description languages, leading to low-area implementations. In the rst method (denoted
by LUT/ROM), the entire S-box is implemented using LUTs. Moreover, we consider the
S-boxes in which only the multiplicative inversion (MI) in GF (28) is implemented using
LUTs and the ane transformation is implemented separately (denoted by LUT/ROM-
MI). This enables the designers to share the multiplicative inversion in GF (28) for the
S-box and the inverse S-box in the merged structures.
In some of the previous works such as [20], [23], [27], and [30], the area of the S-
box has been presented in terms of GE. For instance, in [20] and [30], the areas of the
implemented S-boxes have been provided as 294 GE and 272 GE using 0:11m and
0:18m technologies, respectively. Based on the information of the cell library in a
0:18m technology, the gate count of the S-box has been converted to gate equivalent as
180 GE in [23]. We note that the result in [23] (unlike those in [20], [27], [30], and this
chapter) is the direct conversion of the gate count (without synthesis) to GE. In addition,
the conversion factor of 1:75 has been used in [23] for obtaining the GE for XOR/XNOR
and MUX21. However, in the cell library used in this chapter, these conversion factors
are 2:25 and 2, respectively. Another factor causing the reported areas in terms of GE in
dierent works to vary is the type of the synthesis tools used and the map eort specied.
Using (3.18) and (3.19) of Lemma 3.1, we have also presented the results of the logic-
gate optimized S-boxes in Table 3.1. Specically, we have used (3.18) and (3.19) for two
most compact S-boxes using polynomial basis for  = f11g2 and  = f10g2 in Table
3.1. It is also noted that for each of the evaluated performance metrics, the three best
cases among dierent results for the S-boxes have been marked with superscripts 1, 2,
and 3. As shown in Table 3.1, the areas for the composite eld S-boxes range from 403.2-
589.2 m2 (dierence of 46:1%), the working frequencies from 625-900 MHz (dierence
of 44:0%), the throughputs from 5.0-7.2 Gbps (dierence of 44:0%), and the eciencies
from 8.6-14.4 Mbps
m2
(dierence of 67:4%).
As seen in Table 3.1, the S-boxes using LUTs (last two rows) are the fastest S-boxes.
However, their eciencies are not the highest among other S-boxes in Table 3.1. Among
the composite eld S-boxes, the one using normal basis presented in [60] is the most
compact one (see the area column in Table 3.1). However, it has the worst working
frequency and throughput. The S-boxes using polynomial basis (optimized using (3.18))
have the highest frequency and throughput among the composite eld S-boxes. Finally,
Chapter 3 25
the highest eciency (see the last column in Table 3.1) is obtained for the one using
polynomial basis with  = f11g2 and  = f1010g2 (optimized using (3.18)).
3.3 Power Consumptions and Comparisons
In the following, the power consumption results for dierent S-boxes are presented. We
have derived the power consumptions of the S-boxes within the AES through a simulation-
based analysis method. In what follows, we present the power derivation method as well
as the results of our analysis and comparison.
3.3.1 Power Derivation Method
We use VHDL as the design entry to the Synopsys R Design Vision R. After obtaining
the gate-level netlists of the S-boxes, timing simulations are performed using ModelSim R
SE 6:2d [75]. The testbench used for timing simulations covers all the 256 255 = 65280
possible transitions for the 8-bit input of the S-box. This exhaustive input pattern
assertion includes all the possible transitions between each two dierent pairs of the
possible 256 inputs. Then, for each and every S-box, the results of the switching activities
of all internal nodes have been logged in the VCD (Value Change Dump) les. We have
set the resolution of the timing simulations to high so that the VCD les contain the
switching activities of glitches (dynamic hazards) occurring in the logic gates. Then, as
the nal step, the power consumption of the circuit is computed from the VCD logs,
gate-level netlists, cell information, and parasitics of the target ASIC library. We have
utilized the Synopsys R PrimeTime R PX [73] to obtain the average power (including net
switching power, cell internal power, and cell leakage power), peak and instantaneous
power consumption details. It is noteworthy that the power consumption results are for
the working frequency of 50 MHz and for the high resolutions for both timing and power
consumption.
3.3.2 Analysis and Comparison
The results of our simulation-based power computations are presented in Table 3.2. As
depicted in this table, for dierent S-boxes, we have derived the average power (in terms
of W), peak power (in terms of mW), and the input pattern transition for which the
Chapter 3 26
peak power happens. As shown in Table 3.2, the average powers for the composite eld
S-boxes range from 44.39-58.96 W (dierence of 32:8%) and the peak powers from 1.013-
1.324 mW (dierence of 30:7%). We have also marked (with superscripts 1, 2, and 3)
the three cases for which the lowest power consumptions are achieved.
Comparing the results in Tables 3.1 and 3.2 shows that generally and with few ex-
ceptions, the S-boxes with more hardware complexities consume more power. As seen
in Table 3.2, the highest and lowest average power consumptions are achieved for the
LUT-based (using memories) S-box and the normal basis S-box presented in [60], respec-
tively. Based on our results in Table 3.1, these two S-boxes have the highest and lowest
hardware complexities, respectively. On the other hand, according to the results of Table
3.1, the normal basis S-box presented in [60] has the highest timing complexity among
the composite eld S-boxes.
The transitions of the inputs of the S-boxes when the peak powers occur have been
also shown in Table 3.2. As shown in this table, most of the peak powers occur when the
S-box input changes to the all-zero input.
Chapter 3 27
Table 3.2: Evaluation of the power consumptions of the S-boxes on ASIC using the STM
65-nm CMOS standard technology and the Synopsys R PrimeTime R PX [73].
Structure Specication Averagea Peakb
  (W) (mW) Transition
1000 [30], [24]c 54.99 1.283 78! 00
1001 55.77 1.184 1C ! 00
1010 54.91 1.165 F4! 58
10 1011 56.45 1.262 79! 00
1100 [13], [20] 55.98 1.283 C0! 00
, [22]d
1101 56.63 1.324 D5! 01
1110 55.28 1.313 B5! 00
Polynomial 1111 55.87 1.161 C3! 07
basis 1000 54.69 1.1342 27! 00
1000 (proposed, 54.15 1.188 B8! 00
using (3.18))
1001 55.44 1.214 54! 01
1010 [30]c 54.12 1.1453 71! 00
1010 (proposed, 53.782 1.229 C9! 00
using (3.18))
11 1011 54.80 1.218 55! 00
1100 55.13 1.178 D7! 00
1101 56.51 1.244 BA! 00
1110 55.91 1.239 3B ! 00
1111 55.40 1.185 9C ! 00
0001 [23]c 58.02 1.268 46! 00
0001 [23] (3.17)e 54.033 1.189 46! 0A
0010 58.51 1.291 41! 00
0100 58.03 1.283 46! 00
10 1000 58.15 1.290 41! 00
0111 58.79 1.299 F2! 00
1011 58.35 1.247 91! 00
1101 58.81 1.300 73! 00
Normal 1110 58.96 1.309 91! 00
basis 0001 58.19 1.323 68! 00
0010 58.07 1.284 92! 00
0100 57.90 1.292 68! 00
01 1000 58.17 1.292 92! 00
0111 58.54 1.291 43! 00
1011 57.88 1.231 E7! 00
1101 58.70 1.282 42! 00
1110 58.23 1.246 5B ! 00
Normal basis - 0001 [60]f 44.391 1.0131 27! 00
Polynomial basis - 1110 [27]g 55.48 1.208 22! 00
Mixed basis [62] - - 58.06 1.242 46! 00
LUT/ROM [77], [59] - - 63.18 1.337 91! 00
LUT/ROM-MI [59] - - 66.20 1.344 91! 00
1, 2, and 3 are the best cases for each performance metric.
aIncludes net switching, cell internal, and cell leakage power.
bObtained from the instantaneous power values for each case.
cAmong all elds considered, the presented composite eld has the least hardware complexities in
terms of logic-gate counts.
dThese are some works in which this composite eld is used.
eThe power consumption of this composite eld has been improved using (3.17).
fThe lowest-power yet the slowest composite eld S-box.
gThis architecture is based on the composite eld GF ((24)2).
Chapter 4
A Lightweight Fault Detection
Scheme for the (Inverse) S-box
Using Composite Fields
IN this chapter, we present a lightweight concurrent fault detection scheme for theAES. In the proposed approach, the composite eld S-box and inverse S-box are
divided into blocks and the predicted parities of these blocks are obtained. Through
exhaustive searches among all available composite elds, we nd the optimum solutions
for the least overhead parity-based fault detection structures. Moreover, through our
error injection simulations for one S-box (resp. inverse S-box), we show that the total
error coverage of 99.998% for 16 S-boxes (resp. inverse S-boxes) can be achieved. Fi-
nally, it is shown that both the ASIC and FPGA implementations of the fault detection
structures using the obtained optimum composite elds, have better hardware and time
complexities compared to their counterparts.
We present a low-cost parity-based fault detection scheme for the S-box and the in-
verse S-box using composite elds. In the presented approach, for increasing the error
coverage, the predicted parities of the ve blocks of the S-box and the inverse S-box are
obtained (three predicted parities for the multiplicative inversion and two for the transfor-
mation and ane matrices). It is interesting to note that the cost of our multi-bit parity
prediction approach is lower than its counterparts which use single-bit parity. It also
has higher error coverage than the approaches using single-bit parities. We implement
both the proposed fault detection S-box and inverse S-box and other counterparts. Our
both ASIC and FPGA implementation results show that compared to the approaches
presented in [52] and [53], the complexities of the proposed fault detection scheme are
28
Chapter 4 29
lower. Through exhaustive searches, we obtain the least area and delay overhead fault
detection structures for the optimum composite elds using both polynomial basis and
normal basis. The proposed fault detection scheme is simulated and we show that the
error coverages of 99.998% for 16 S-boxes (resp. inverse S-boxes) can be obtained. Fi-
nally, we have implemented the fault detection hardware structures of the AES using
both 0:18 CMOS technology and on Xilinx R VirtexTM-II Pro FPGA [80]. It is shown
that the fault detection scheme using the optimum polynomial and normal bases have
lower complexities than those using other composite elds for both with and without
fault detection capability.
The organization of this chapter is as follows. In Section 4.1, some preliminaries re-
lated to the composite elds are presented. The proposed fault detection approach for
the S-box and the inverse S-box is presented in Section 4.2. Furthermore, the time and
hardware complexities analysis is preformed in this section. In Section 4.3, the results of
the simulations of the proposed approach are presented; through which, the fault detec-
tion capabilities are derived. In Section 4.4, through FPGA and ASIC implementations,
the performance metrics of the proposed fault detection scheme and the previously re-
ported ones are compared. The results presented in this chapter can also be found in
[78] and [79].
4.1 Some Notes on Polynomial and Normal Bases
The composite elds can be represented using normal basis [23] or polynomial basis [18],
[20], [21], [22]. The S-box and inverse S-box for the polynomial and normal bases are
shown in Figs. 4.1 and 4.2, respectively. As shown in these gures, for the S-box using
polynomial basis (resp. normal basis), the transformation matrix 	 (resp. 	01) trans-
forms a eld element X in the binary eld GF (28) to the corresponding representation
in the composite elds GF (28)/GF (24). It is noted that the result of X = hu + l in
Fig. 4.1 (resp. X = 0hu
16 + 0lu in Fig. 4.2) is obtained using the irreducible polynomial
of u2 + u+  (resp. u2 +  0u+  0).
The multiplicative inversion in Fig. 4.1 consists of composite-eld multiplications,
additions and an inversion in the sub-eld GF (24) over GF (2)/x4 + x + 1 [21]. The
1We use prime notations for the composite elds using normal basis.
Chapter 4 30
4
4
4
4
4
4
4
8
 and trans.)
affine
(Mixed inverse
matrix
Trans.
8 4
4
(Inverse
Trans.
and affine
inverse
Mixed
8
44
σh
(σh + σl)
b1 = ηh + ηl
(σh)
ηh
(σ)(Y )
X
Ψ
(σl)
η
ν
ηl
(X)
Y
σ
Ψ
−1
)
(ηl)
(ηh)
b4
b2 = γ
σl
b3 = θ
b5
(η)
Block 3
Pˆb5
Block 4 Block 5
()
−1
Pˆb3
Pˆb4
Block 2Block 1
(.)
2
Pˆb1 Pˆb2
Figure 4.1: The S-box (the inverse S-box) using composite elds and polynomial basis
[20] and their fault detection blocks.
decomposition can be further applied to represent GF (24) as a linear polynomial over
GF (22) and then GF (2) using the irreducible polynomials of v2+
v+ and w2+w+1,
respectively. As a result, it is understood that the implementation of the multiplicative
inversion can be performed using the eld represented by GF ((24)2), see for example, [18]
and [21], or the eld represented by GF (((22)2)2) and has been used in the literature, see
for example [20] and [22]. Finally, as seen in Fig. 4.2 for normal basis, the decomposition
is performed using the irreducible polynomials of v2 + 
0v + 0 and w2 + w + 1.
For calculating the multiplicative inversion, the most ecient choice is to let 
 =  =
1 (
0 =  0 = 1) in the above irreducible polynomials [23]. Then, we have the following
for the multiplicative inversion of the S-box using polynomial basis (Fig. 4.1) and normal
basis (Fig. 4.2), respectively, [20], [23]
(hu+ l)
 1 = (4.1)
[((h + l)l + h
2) 1h]u+ ((h + l)l + h2) 1(h + l);
(0hu
16 + 0lu)
 1 = (4.2)
[(0h
0
l + (
0
h
2
+ 0l
2
) 0) 10h]u
16 + [(0h
0
l + (
0
h
2
+ 0l
2
) 0) 10l]u:
It is noted that one can replace  (0) with  (0) to obtain (4.1) and (4.2) for the inverse
S-box. In the next section, we propose the low-cost fault detection scheme for the S-box
and the inverse S-box.
Chapter 4 31
, respectively. From the above, it is understood that
4
4
8
Trans.
(Inverse
and affine
inverse
Mixed
4
4
4
4
and trans.)
(Mixed inverse
affine
Trans.
matrix
4
8 8 4
4
4
4
4
b3 = θ
′
σ
′
h
b2 =
b4
Y
(η
′
h)
(X)
(η′)
σ′
σ
′
l
(η
′
l)
b5
Ψ
′−1
)
b1
(σ
′
h)
η
′
h
η
′
l
(σ
′
l)
(σ
′
)
η
′
(Y )
X
Ψ
′
ν
′
γ
′
Pˆb3
Pˆb2
Block 3 Block 4
Pˆb4
()
−1
Block 2
Pˆb1
Block 1
(.)2
Block 5
Pˆb5
Figure 4.2: The S-box (the inverse S-box) using composite elds and normal basis [23]
and their fault detection blocks.
4.2 Fault Detection Scheme
To obtain low-overhead parity prediction, we have divided the S-box and the inverse
S-box into 5 blocks as shown in Figs. 4.1 and 4.2. In these gures, the modulo-2
additions, consisting of 4 XOR gates, are shown by two concentric circles with a plus
inside. Furthermore, the multiplications in GF (24) are shown by rectangles with crosses
inside. Let bi be the output of the block i denoted by dots in Figs. 4.1 and 4.2 for the
S-box. As seen in Fig. 4.1, b1 = h + l; b2 = ; b3 = ; b4 = , and b5 = Y . Similarly,
from Fig. 4.2, b1 = 
0
h + 
0
l; b2 = 
0; b3 = 0; b4 = 0, and b5 = Y . One can replace  (0)
with  (0) and X with Y for the inverse S-box. In the following, we have exhaustively
searched for the least overhead parity predictions of these blocks denoted by P^b1-P^b5 in
Figs. 4.1 and 4.2.
4.2.1 The S-box and the Inverse S-box Using Polynomial Basis
The implementation complexities of dierent blocks of the S-box and the inverse S-box
and those for their predicted parities are dependent on the choice of the coecients
 2 GF (24) and  2 GF (22) in the irreducible polynomials u2 + u +  and v2 + v + 
used for the composite elds. Our goal in the following is to nd  2 GF (24) and
 2 GF (22) for the composite elds GF (((22)2)2) and  2 GF (24) for the composite
elds GF ((24)2) so that the area complexity of the entire fault detection implementations
becomes optimum. According to [24], 16 the possible combinations for  2 GF (24) and
Chapter 4 32
 2 GF (22) exist. Moreover, for the composite elds GF ((24)2), we have exhaustively
searched and have found the possible choices for  making the polynomial x2 + x + 
irreducible. These parameters determine the complexities of some blocks as explained
below.
Blocks 1 and 5: Based on the possible values of  and  in GF (((22)2)2) ( in
GF ((24)2)), the transformation matrices in Fig. 4.1 in blocks 1 and 5 of the S-box and
the inverse S-box can be constructed using the algorithm presented in [24]. Using an
exhaustive search, eight base elements in GF (((22)2)2) (or GF ((24)2)) to which eight
base elements of GF (28) are mapped, are found to construct the transformation matrix.
In [81], the Hamming weights, i.e., the number of non-zero elements, of the trans-
formation matrices for the case  = f10g2 and dierent values of  in GF (((22)2)2) are
obtained. It is noted that in [24], instead of considering the Hamming weights, subex-
pression sharing is suggested for obtaining the low-complexity implementations for the
S-box. Here, we have also considered these transformation matrices for  = f11g2 as
well as the transformation matrices for the inverse S-box for dierent values of  and 
and have derived their area and delay complexities. Moreover, the gate count and the
critical path delay for blocks 1 and 5 and their predicted parities, i.e., P^b1 and P^b5, of the
S-box and the inverse S-box in GF ((24)2) have been obtained.
Blocks 2 and 4: As shown in Fig. 4.1, block 2 of the S-box and the inverse S-box
consists of a multiplication, an addition, a squaring and a multiplication by constant 
in GF ((22)2). We present the following lemma for deriving the predicted parity of the
multiplication in GF ((22)2), using which the predicted parities of blocks 2 and 4 in Fig.
4.1 are obtained.
Lemma 4.1 Let  = (3; 2; 1; 0) and  = (3; 2; 1; 0) be the inputs of a multiplier
in GF ((22)2). The predicted parities of the result of the multiplication of  and  in
GF ((22)2) for  = f10g2 and  = f11g2 are as follows, respectively,
P^ = 3(3 + 2 + 0) + 2(3 + 1 + 0) + 1(2 + 0)
+ 0(3 + 2 + 1 + 0) if  = f10g2: (4.3)
P^ = 3(3 + 0) + 2(2 + 1 + 0) + 1(2 + 0)
+ 0(3 + 2 + 1 + 0) if  = f11g2: (4.4)
Chapter 4 33
Proof One can perform modulo-2 addition of the coordinates of the result of the multipli-
cation over GF ((22)2) [20]. Then, by reordering and factoring of the result for  = f10g2
and  = f11g2, the predicted parities in (4.3) and (4.4) are obtained.
The predicted parity of block 2 of the S-box and the inverse S-box, i.e., P^b2 = P^h2 +
P^(h+l)l in Fig. 4.1, depends on the choice of the coecients  2 GF ((22)2) and  2
GF (22). Using Lemma 4.1, we have derived the complexity of the predicted parity of
block 2 for these coecients. Furthermore, for block 4 in Fig. 4.1, which consists of two
multiplications in GF ((22)2), one can also use Lemma 4.1 to derive the predicted parity.
For block 2 of the S-box (resp. the inverse S-box) over GF ((24)2) in Fig. 4.1, only the
multiplication by constant  is aected for dierent values of s. For this block, we have
exhaustively searched for and obtained the optimum implementation for dierent values
of s. Moreover, block 4 in Fig. 4.1 is independent of the value of . Therefore, the
complexity of the predicted parity for this block is the same for all possible s.
Block 3: We present the following theorem for block 3 of the S-box and the inverse
S-box over GF ((22)2) in Fig. 4.1.
Theorem 4.1 Let  = (3; 2; 1; 0) be the input and  = (3; 2; 1; 0) be the output
of an inverter in GF ((22)2). The predicted parities of the result of the inversion in
GF ((22)2), i.e., P^b3, for  = f10g2 and  = f11g2 are as follows, respectively,
P^ = (2 _ 1)0 + (1 + 0)3 if  = f10g2; (4.5)
P^ = (21 _ 0) + 31 if  = f11g2; (4.6)
where, _ represents an OR operation.
Proof ByModulo-2 addition of the coordinates of the result of the inversion inGF ((22)2)
for  = f10g2 in [20], one can obtain the predicted parity of  as P^ = 20 + 210 +
31 + 0 + 30 = 0(2(1 + 1) + 1) + 3(1 + 0). By noting that 1 + 1 = 1 and
21 = 2 _ 1, one can reach (4.5). Moreover, by XORing the result for  = f11g2, P^
is obtained as P^ = 31+210+21+0. Noting that 210+21+0 = 21_0,
one can simplify P^ to reach (4.6) and the proof is complete.
It is noted that the inversion in GF (24) in Fig. 4.1 is independent of the value of .
Therefore, the complexity of the predicted parity for this block is the same for any
possible s.
Chapter 4 34
Considering the discussions presented in this section for dierent combinations of 
and  for polynomial basis, we present the following for the optimum parity predictions.
Proposition 4.1 The fault detection S-box using composite elds GF (((22)2)2) has the
least area complexity for  = f11g2 and  = f1010g2. For this optimum S-box (PB1),
we have the following predicted parities for the 5 blocks in Fig. 4.1: P^b1 = x0; P^b2 =
3(7 + 4) + 2(7 + Ph) + 1(6 + 4) + 0Ph + 6 + 7; P^b3 = (21 _ 0) + 13; P^b4 =
3(3 + 0) + 2(P + 3) + 1(2 + 0) + 0P; P^b5 = 7 + 5 + 3 + 2 + 0; where,
Ph = 7 + 6 + 5 + 4 and P = 3 + 2 + 1 + 0. Additionally, among all the possible
values for  using composite elds GF ((24)2),  = f1010g2 yields to the least-complexity
architecture for the optimum S-box (PB2), respectively. Then, for the S-box we have:
P^b1 = x7 + x0; P^b2 = 34 + 2(5 + 4) + 1(Ph + 7) + 0Ph + Ph + 4; P^b3 = 320 +
0(1_(2 + 3)); P^b4 = 30+2(1+0)+1(P+3)+0P; P^b5 = 4+3+2+1+0:
Furthermore, we have the following for the inverse S-box.
Proposition 4.2 For the inverse S-box using composite eld GF (((22)2)2), choosing  =
f11g2 and  = f1011g2 and for the one using composite eld GF ((24)2) having  =
f1001g2 yields to the lowest area complexity architecture. It is noted that blocks 3 and 4
have the same predicted parities as the S-box by swapping  and . For other blocks of
the optimum inverse S-box (PB1) we have: P^b1 = y7 + y6 + y5 + y2; P^b2 = 3(7 + 4) +
2(7+Ph)+1(6+4)+0Ph +4; P^b5 = 7+ 6+ 3+ 2+ 0: Additionally, for the
optimum inverse S-box (PB2) we have: P^b1 = y7 + y6 + y3; P^b2 = 34 + 2(5 + 4) +
1(Ph + 7) + 0Ph + 7; P^b5 = 0:
4.2.2 The S-box and the Inverse S-box Using Normal Basis
Based on the possible values of  0 and 0, the transformation matrices in blocks 1 and
5 of the S-box, denoted as 	0 and 	0 1/ane, can be constructed using the algorithm
presented in [24] with a slight modication for normal basis. One possible way to nd
the least complex transformation matrices is to calculate the Hamming weights, i.e.,
the number of non-zero elements, of the matrices 	0 and 	0 1/ane. It is noted in
[23] that instead of considering the Hamming weights, subexpression sharing is used for
obtaining the low complexity implementations. We have exhaustively searched for the
least overhead transformation matrices and their parity predictions combined, the results
Chapter 4 35
Table 4.1: Area/delay complexities of blocks 1 and 5 of the S-box and their predicted
parities for possible values of  0s and 0s.
H(	0)+H Total area of Total delay of Total area of Total delay of
0 0 (	0 1/ane) blocks 1 and 5 blocks 1 and 5 P^b1 and P^b5 P^b1 and P^b5
0001 57 28X 5X
0010 57 32X 5X
0100 57 34X 5X
10 1000 57 30X 5X
0111 67 34X 3X
1011 65 30X 5X
1101 67 34X 3X
1110 65 31X 7TX 5X 4TX
0001 57 32X 5X
0010 57 32X 5X
0100 57 29X 5X
01 1000 57 34X 5X
0111 65 34X 5X
1011 67 37X 3X
1101 65 34X 5X
1110 67 32X 3X
X = XOR, TX= Delay of an XOR
of which are presented in Table 4.1. In this table, for every possible combination of  0 and
0, the Hamming weights of 	0 and 	0 1/ane for the least complex cases are tabulated
in column 3. Also, the number of gates needed for the low complexity implementation
of blocks 1 and 5 are presented in column 4 of the table. Furthermore, the total number
of XOR gates needed for the predicted parities of blocks 1 and 5 of the S-box, i.e., P^b1
and P^b5, and the delays associated with them are also shown in the table (see Fig. 4.2).
Block 2: As shown in Fig. 4.2, block 2 of the S-box consists of a multiplication, an
addition, a squaring and a multiplication by constant  0 in GF (24). The multiplication
in GF (24) consists of three multiplications, additions and a multiplication by constant
0 in GF (22). The following lemmas are used for deriving the predicted parity of the
multiplication in GF (24) and block 2, respectively.
Lemma 4.2 Let 0 = (03; 
0
2; 
0
1; 
0
0) and 
0 = (03; 
0
2; 
0
1; 
0
0) be the inputs of a multiplier
in GF (24). The predicted parity of the result of the multiplication of 0 and 0 in GF (24)
is independent of 0 and can be derived as
P^ 0 = 
0
3
0
3 + 
0
2
0
2 + 
0
1
0
1 + 
0
0
0
0: (4.7)
Proof For the inputs 0 = (01;
0
0) and 
0 = (01;
0
0), the two-bit result of the multi-
plication in GF (22), 0 = (01;
0
0), can be derived as 
0
1 = 
0
1
0
0 + 
0
0
0
1 + 
0
0
0
0 and
Chapter 4 36
00 = 
0
1
0
0 + 
0
0
0
1 + 
0
1
0
1. Furthermore, multiplication by two possible values of 
0,
i.e., 0 = w2 = f10g2 and 0 = w = f01g2, can be obtained by putting 0 = 0. Then,
we have 01 = 
0
0 and 
0
0 = 
0
1 +
0
0 for 
0 = w2 = f10g2 and 01 = 01 +00 and 00 = 01
for 0 = w = f01g2. Consequently, one can derive the coordinates of 0. Therefore, for
0 = w2 = f10g2 we have
03 = 
0
3(
0
3 + 
0
1 + 
0
0) + 
0
2(
0
1 + 
0
2) + 
0
1(
0
3 + 
0
2 + 
0
1 + 
0
0) + 
0
0(
0
3 + 
0
1);
02 = 
0
3(
0
2 + 
0
1) + 
0
2(
0
3 + 
0
2 + 
0
0) + 
0
1(
0
3 + 
0
1) + 
0
0(
0
2 + 
0
0);
01 = 
0
3(
0
3 + 
0
2 + 
0
1 + 
0
0) + 
0
2(
0
3 + 
0
1) + 
0
1(
0
3 + 
0
2 + 
0
1) + 
0
0(
0
3 + 
0
0); (4.8)
00 = 
0
3(
0
3 + 
0
1) + 
0
2(
0
2 + 
0
0) + 
0
1(
0
3 + 
0
0) + 
0
0(
0
2 + 
0
1 + 
0
0):
Also, for 0 = w = f01g2 we have the result as
03 = 
0
3(
0
3 + 
0
2 + 
0
1) + 
0
2(
0
3 + 
0
0) + 
0
1(
0
3 + 
0
1) + 
0
0(
0
2 + 
0
0);
02 = 
0
3(
0
3 + 
0
0) + 
0
2(
0
2 + 
0
1 + 
0
0) + 
0
1(
0
2 + 
0
0) + 
0
0(
0
3 + 
0
2 + 
0
1 + 
0
0);
01 = 
0
3(
0
3 + 
0
1) + 
0
2(
0
2 + 
0
0) + 
0
1(
0
3 + 
0
1 + 
0
0) + 
0
0(
0
2 + 
0
1); (4.9)
00 = 
0
3(
0
2 + 
0
0) + 
0
2(
0
3 + 
0
2 + 
0
1 + 
0
0) + 
0
1(
0
2 + 
0
1) + 
0
0(
0
3 + 
0
2 + 
0
0):
Modulo-2 adding the coordinates of (4.8) or (4.9) gives (4.7) and the proof is complete.
Lemma 4.3 The predicted parity of block 2, i.e., P^b2, depends on the choice of the co-
ecients  0 2 GF (24) and 0 2 GF (22) in the irreducible polynomials u2 + u +  0 and
v2 + v + 0 used for the composite eld.
Proof Considering the fact that P^b2 = P^(0h+0l)20 + P^0h0l , one can use Lemma 4.2 to
obtain P^0h0l independent of the values of 
0 and 0. However, P^(0h+0l)20 depends on the
elements  0 and 0. This is because of having squaring in GF (24), i.e., (0h + 
0
l)
2, and
also a multiplication by  0 to obtain P^(0h+0l)20 . Therefore, the predicted parity of block
2 is also dependent on these values and the proof is complete.
Using these lemmas, we can state the following to predict the parity of block 2.
Lemma 4.4 The predicted parity of block 2, i.e., P^b2, can be derived as shown in Table
4.2.
Chapter 4 37
Table 4.2: Parity predictions and complexities of block 2 of the normal basis S-box for
possible values of  0 and 0.
Area of Delay of Predicted Area of Delay of
0 0 block 2 block 2 parity (P^b2) P^b2 P^b2
0001 28X+9A (07 _ 03) + (06 _ 02) + (04 _ 00) + 0501 3X+3O+1A
0010 29X+9A (07 _ 03) + (05 _ 01) + (04 _ 00) + 0602 3X+3O+1A
0100 28X+9A (06 _ 02) + (05 _ 01) + (04 _ 00) + 0703 3X+3O+1A
10 1000 29X+9A (07 _ 03) + (06 _ 02) + (05 _ 01) + 0400 3X+3O+1A
0111 28X+9A (04 _ 00) + 0703 + 0602 + 0501 3X+3A+1O
1011 29X+9A (07 _ 03) + 0602 + 0501 + 0400 3X+3A+1O
1101 28X+9A (06 _ 02) + 0703 + 0501 + 0400 3X+3A+1O
1110 29X+9A 6TX (
0
5 _ 01) + 0703 + 0602 + 0400 3X+3A+1O 2TX
0001 29X+9A +1TA (
0
6 _ 02) + (05 _ 01) + (04 _ 00) + 0703 3X+3O+1A +1TA
0010 28X+9A (07 _ 03) + (06 _ 02) + (05 _ 01) + 0400 3X+3O+1A
0100 29X+9A (07 _ 03) + (06 _ 02) + (04 _ 00) + 0501 3X+3O+1A
01 1000 28X+9A (07 _ 03) + (05 _ 01) + (04 _ 00) + 0602 3X+3O+1A
0111 29X+9A (06 _ 02) + 0703 + 0501 + 0400 3X+3A+1O
1011 28X+9A (05 _ 01) + 0703 + 0602 + 0400 3X+3A+1O
1101 29X+9A (04 _ 00) + 0703 + 0602 + 0501 3X+3A+1O
1110 28X+9A (07 _ 03) + 0602 + 0501 + 0400 3X+3A+1O
A = AND; f+; Xg = XOR; f_; Og = OR
TX= Delay of an XOR, TA= Delay of an AND= Delay of an OR
Proof One can use Lemma 4.2 to obtain P^(0h+0l)20 and P^0h0l in P^b2 = P^(0h+0l)20 + P^0h0l .
P^0h0l can be easily found using Lemma 4.2. Furthermore, using Lemma 4.2 with the
inputs being 0 = (0h + 
0
l)
2 and 0 =  0 one can obtain P^(0h+0l)20 . Noting that the
possible values for 0 are 0 = w2 = f10g2 and 0 = w = f01g2, one can nd the
corresponding possible (0h + 
0
l)
2 using (4.8) and (4.9). This is achieved by putting both
inputs in (4.8) or (4.9) as 0h + 
0
l. Then, for 
0 = w2 = f10g2 we have
(0h + 
0
l)
2 =(07 + 
0
6 + 
0
5 + 
0
3 + 
0
2 + 
0
1; 
0
6 + 
0
5 + 
0
4 + 
0
2 + 
0
1 + 
0
0;
07 + 
0
5 + 
0
4 + 
0
3 + 
0
1 + 
0
0; 
0
7 + 
0
6 + 
0
4 + 
0
3 + 
0
2 + 
0
0); (4.10)
and for 0 = w = f01g2 we have
(0h + 
0
l)
2 =(07 + 
0
5 + 
0
4 + 
0
3 + 
0
1 + 
0
0; 
0
7 + 
0
6 + 
0
4 + 
0
3 + 
0
2 + 
0
0;
07 + 
0
6 + 
0
5 + 
0
3 + 
0
2 + 
0
1; 
0
6 + 
0
5 + 
0
4 + 
0
2 + 
0
1 + 
0
0): (4.11)
One can obtain the predicted parities of block 2, i.e., P^b2 = P^(0h+0l)20 + P^0h0l , for all the
possible combinations of  0 and 0. The results are presented in Table 4.2.
Table 4.2 shows the predicted parities for dierent combinations of  0 and 0 and
their area/delay complexities. Moreover, the complexities for block 2 are shown in this
Chapter 4 38
table. As seen in Table 4.2, the delay overhead for both the original block and its parity
prediction is the same for all the cases. Whereas, the area in terms of the number of
gates are dierent for dierent values of  0 and 0.
Block 3: Block 3 in Fig. 4.2 consists of an inversion in GF (24). The inversion in
GF (24) is dependent on the two possible choices of 0 and is the same for dierent values
of  0. Therefore, depending on the choice of 0, there are two possible choices for this
block and its parity prediction. It is noted that for both of these implementations, the
area and the critical path delay are the same. The following theorem is used for obtaining
the predicted parity of block 3, i.e., P^b3.
Theorem 4.2 Let 0 = (03; 
0
2; 
0
1; 
0
0) be the input and 
0 = (03; 
0
2; 
0
1; 
0
0) be the output
of an inverter in GF (24). Then, for 0 = w2 = f10g2, the predicted parity of block 3,
i.e., P^b3, can be found as
P^b3 = P^0 = 02
0
0(
0
3 + 
0
1) + 
0
3
0
1(
0
2 + 
0
0): (4.12)
Also, for 0 = w = f01g2 we have
P^b3 = P^0 = 03
0
1(
0
2 + 
0
0) + 
0
2
0
0(
0
3 + 
0
1): (4.13)
Proof According to [23], we have P^0 = P^ 10h + P^ 10l = P^ 1(0h+0l). Then, according
to the predicted parity of the multiplication in GF (22) in the proof of Lemma 4.2, we
have P^ 1(0h+0l) = 
 1
1 (
0
3 + 
0
1) + 
 1
0 (
0
2 + 
0
0). Moreover, considering the fact that the
inversion in GF (22) is free, i.e.,  1 = (0;1), we reach P^0 = 0(03+
0
1)+1(
0
2+
0
0).
Then, according to the formulations for the multiplication in GF (22) and knowing that
the squaring in GF (22) is free, nding the coordinates of  for two values of 0 is
straightforward and the proof is complete.
Block 4: Block 4 of the S-box consists of two multiplications in GF (24). According
to Lemma 4.2, the area/delay overhead of the multiplications in GF (24) and that of their
predicted parity are the same for both 0 = w = f01g2 and 0 = w2 = f10g2. Moreover,
we have P^b4 = P^0h0 + P^0l0 = P^(0h+0l)0 . Then, according to (4.7) in Lemma 4.2 with the
inputs of 0h + 
0
l and 
0, one can nd P^b4 as
P^b4 = (
0
7 + 
0
3)
0
3 + (
0
6 + 
0
2)
0
2 + (
0
5 + 
0
1)
0
1 + (
0
4 + 
0
0)
0
0: (4.14)
Chapter 4 39
It is noted that for the implementation of P^b4, the modulo-2 additions of 
0
7 + 
0
3, 
0
6 +
2, 
0
5 + 
0
1, and 
0
4 + 
0
0 are already available at the input of block 2. Therefore, this
implementation only needs 3 XORs and 4ANDs.
Above, the optimum fault detection S-box using normal basis in Fig. 4.2 has been
derived. In the following, we have also performed an exhaustive search for nding the
optimum predicted parities based on the choice of the coecients  0 2 GF (24) and
0 2 GF (22) for the ve blocks of the inverse S-box using normal basis. We have
exhaustively searched for the least overhead transformation matrices and their parity
predictions combined for the inverse S-box and have derived the total complexity for the
predicted parities of blocks 1 and 5, i.e., P^b1 and P^b5, and the delays associated with them.
These are used to obtain the optimum S-box inverse S-box and its parity predictions in
this section. It is also noted that as shown in Fig. 4.2, blocks 2, 3, and 4 of the S-box
and the inverse S-box are the same. Therefore, the predicted parities of these blocks can
be obtained for the inverse S-box. Using the discussions presented in this section, we
present the following for the optimum parity predictions.
Proposition 4.3 For dierent combinations of  0 and 0 for normal basis, for the S-
box and the inverse S-box, 0 = f10g2 and  0 = f0001g2 have the least area for the
operations and their fault detection circuits combined. The following is the predicted
parities for the S-box: P^b1 = x7 + x5; P^b2 = (
00
7 _ 03) + (06 _ 02) + (04 _ 00) + 0501; P^b3 =
02
0
0(
0
3+
0
1)+
0
3
0
1(
0
2+
0
0); P^b4 = (
0
7+
0
3)
0
3+(
0
6+
0
2)
0
2+(
0
5+
0
1)
0
1+(
0
4+
0
0)
0
0; P^b5 =
07 + 
0
5 + 
0
4 + 
0
3 + 
0
2: Moreover, for the inverse S-box, P^b2   P^b4 are the same as those
for the S-box by swapping 0 and 0. For the other blocks, we have: P^b1 = y7+y6+y2+y1
and P^b5 = 
0
7 + 
0
5 + 
0
4 + 
0
3 + 
0
2:
It is noted that the area overhead of the proposed scheme for the optimum structures
consists of those of the optimum parity predictions. In addition, 23 XORs for the actual
parities (3 XORs for adding the coordinates of each of 0h + 
0
l, 
0, and 0 and 7 XORs
each for those of 0 and Y ) are utilized. Moreover, the delay overhead of the predicted
parities of 5 blocks can overlap the delays for the implementations of 5 blocks in Figs.
4.1 and 4.2. The only delay overhead for this scheme is the delay of the actual parity of
block 5, which is 3TX , where, TX is the delay of an XOR gate.
Chapter 4 40
Table 4.3: Error simulation results of the optimum S-box and inverse S-box after injecting
500; 000 errors.
Operations Field Errors covered Error Coverage
S-box PB1 485,008 (485,106) 97.002% (97.021%)
(Inverse S-box) PB2 485,039 (485,015) 97.008% (97.002%)
NB 485,015 (485,174) 97.003% (97.035%)
4.3 Error Simulations
If exactly one bit error appears at the output of the S-box (resp. inverse S-box), the
presented fault detection scheme is able to detect it and the error coverage is about
99.998%. This is because in this case, the error indication ag of the corresponding block
alarms the error. However, due to the technological constraints, single stuck-at error may
not be applicable for an attacker to gain more information [82]. Thus, multiple bits will
actually be ipped and hence multiple stuck-at errors are also considered in this chapter
covering both natural faults and fault attacks [82].
For the calculation of the error coverage for the multiple errors, we dene pi as the
probability of error detection in block i, 1  i  5, in Figs. 4.1 and 4.2. Then, the
probability of not detecting the errors in block i is (1   pi). For randomly distributed
errors in the S-box (resp. inverse S-box), this probability for each block is independent
of those of other blocks. Therefore, one can derive the equation for the error coverage of
the randomly distributed errors as
EC% = 100 (1 
Y
i2S
(1  pi))%; (4.15)
where S is the set of the block numbers where the faults are injected. For randomly
distributed errors, the error coverage for each block is pi  12 . Then, the representation
of (4.15) can be simplied as EC% = 100(1 (1
2
)n)%, where, n is the number of blocks.
Therefore, if multiple errors are randomly distributed in all blocks, the error coverage
reaches 97% using n = 5 error indication ags.
We have performed error simulations for the S-boxes and the inverse S-boxes using the
optimum composite eld obtained in the previous section to conrm our above theoretical
computation. In our simulations, we use stuck-at error model at the outputs of the ve
blocks forcing one or multiple nodes to be stuck at logic one (for stuck-at one) or zero (for
Chapter 4 41
stuck-at zero) independent of the error-free values. We use Fibonacci implementation of
the LFSRs for injecting random multiple errors, where, the numbers, the locations and
the types of the errors are randomly chosen. In this regard, the maximum sequence
length polynomial for the feedback is selected. The injected errors are transient, i.e.,
they last for one clock cycle. However, the results would be the same if permanent errors
are considered.
The results of the error simulations using Xilinx R ISETM version 9.1i Simulator (ISim)
[80] are presented in Table 4.3. As seen in this table, up to 500; 000 random errors are
injected for both the S-box and the inverse S-box. It is noted that in these tables, the
optimum polynomial basis GF (((22)2)2) denoted by PB1, GF ((2
4)2) denoted by PB2 and
normal basis (NB) are presented. As shown in the table, using 5 parity bits of the 5 blocks,
the error coverage for random faults reaches 97% which is the same as our theoretical
computation in this section. This error coverage will be increased if the outputs of more
than one S-box (resp. inverse S-box) of the AES implementation are erroneous. In this
case, the errors detected in any of 16 S-boxes (resp. inverse S-boxes) contribute to the
total error coverage. Thus, error coverage of very close to 100% (99.998%) is achieved.
4.4 ASIC and FPGA Implementations and Compar-
isons
In this section, we compare the areas and the delays of the presented scheme with those
of the previously reported ones in both ASIC and FPGA implementations. We have im-
plemented the S-boxes using memories and the ones presented in [21], [22] (the hardware
optimization of [20]), and [81] which use polynomial basis representation in composite
elds. We have also implemented the fault detection schemes proposed in [34], [36], and
[42] (both united and parity-based) which are based on the ROM-based implementation
of the S-box. The results of the implementations for both original and fault detection
scheme (FDS) in terms of delay and area have been tabulated in Tables 4.4 and 4.5. As
seen in these tables, the original structures are not divided into blocks and full optimiza-
tion of the original entire architecture as a single block is performed in both ASIC and
FPGA. This allows us to nd the actual overhead of the presented fault detection scheme
as compared to the original structures which are not divided into ve blocks. We have
Chapter 4 42
Table 4.4: ASIC implementations of the fault detection schemes for the S-box (SB) and
the inverse S-box using 0:18 CMOS technology.
Operation Architecture Area (m2) , Delay (ns)
Structure FDS Original FDS
ROM United S-box [34], [42] 169 103 344 103
SB , 5.4 , 7.7
ROM Two 256 9 ROMs 169 103 378 103
SB [36] , 5.4 , 5.8
ROM Parity-based SB [42] 185 103 191 103
(mult. inv.) , 5.8 , 5.9
PB [22] [52] (mult. inv.) 5315 , 12.0 6869 , 12.8
S-box PB [22] [53] 5315 , 12.0 7047 , 14.1
PB [22] [57] for the original SB 5315 , 12.0 6763 , 14.1
PB [21] Proposed scheme applied 5642 , 11.3 7113 , 13.0
PB [81] Proposed scheme applied 5547 , 13.2 7034 , 13.8
NB [78] 5179 , 12.9 6712 , 14.7
PB1 [79] 5217 , 10.6 6723 , 12.5
PB2 [79] 5290 , 9.2 6739 , 11.5
Inverse NB [79] 5187 , 13.2 6480 , 14.5
S-box PB1 [79] 5225 , 10.9 6537 , 13.0
PB2 [79] 5274 , 9.4 6619 , 11.3
used 0:18 CMOS technology for the ASIC implementations. These architectures have
been coded in VHDL as the design entry to the Synopsys Design Analyzer. The results
are tabulated in Table 4.4. Moreover, for the FPGA implementations in Table 4.5, the
Xilinx R VirtexTM-II Pro FPGA (xc2vp2-7) [80] is utilized in the Xilinx R ISETM version
9.1i. Furthermore, the synthesis is performed using the XSTTM.
As seen in Tables 4.4 and 4.5, we have implemented the fault detection scheme pre-
sented in [34] and [42] based on using redundant units for the S-box (united S-box).
Furthermore, the fault detection scheme proposed in [36] is implemented. This scheme
uses 512  9 memory cells to generate the predicted parity bit and the 8-bit output of
the S-box [36]. One can obtain from Tables 4.4 and 4.5 that for both of these schemes,
the area overhead is more than 100%. As mentioned in the introduction, the approach
in [50] utilizes the scheme in [36] for protecting the combinational logic elements, whose
implementation results are also shown in Tables 4.4 and 4.5. Additionally, for certain
AES implementations containing storage elements, one can use the error correcting code-
based approach presented in [50] in addition to the proposed scheme in this chapter to
make a more reliable AES implementation. Moreover, the parity-based scheme in [42]
which only realizes the multiplicative inversion (mult. inv.) using memories is imple-
Chapter 4 43
Table 4.5: Xilinx R VirtexTM-II Pro FPGA implementations (xc2vp2-7) of the fault de-
tection schemes for the S-box (SB) and the inverse S-box.
Operation Architecture Slice , Delay (ns)
Structure FDS Original FDS
ROM (SB) United SB [34], [42] 69 , 3.826 150 , 5.398
ROM (SB) Two 256 9 ROMs [36] 69 , 3.826 159 , 4.287
ROM Parity-based SB [42] 88 , 5.734 100 , 6.370
(mult. inv.)
PB [22] [52] (mult. inv.) 33 , 9.375 44 , 9.869
PB [22] [53] 33 , 9.375 47 , 9.996
S-box PB [22] [57] for the original SB 33 , 9.375 42 , 10.317
PB [21] Proposed scheme applied 38 , 8.285 50 , 9.582
PB [81] Proposed scheme applied 37 , 9.986 47 , 10.832
NB [78] 31 , 9.339 39 , 10.026
PB1 [79] 31 , 7.284 40 , 7.465
PB2 [79] 32 , 7.356 41 , 8.150
Inverse NB [79] 31 , 7.736 38 , 7.964
S-box PB1 [79] 32 , 6.992 42 , 7.423
PB2 [79] 32 , 7.550 44 , 8.181
mented. As seen in these tables, we have also implemented the schemes in [52] and [53].
It is noted that the scheme in [52] is for the multiplicative inversion and does not present
the parity predictions for the transformation matrices. Moreover, we have applied the
presented fault detection scheme to the S-boxes in [21] and [81]. As seen in bold faces
in Tables 4.4 and 4.5, with the error coverage of 99.998%, the presented low-complexity
fault detection S-boxes are the most compact ones among the other S-boxes. The op-
timum S-box and inverse S-box using normal basis have the least hardware complexity
with the fault detection scheme. Moreover, as seen in the tables, the optimum structures
using composite elds and polynomial basis (PB1 and PB2) have the least post place
and route timing overhead among other schemes. It is noted that using sub-pipelining
for the presented fault detection scheme in this chapter, one can reach much more faster
hardware implementations of the composite eld fault detection structures.
We have also implemented the AES encryption using the presented optimum S-boxes
excluding the key expansion. Then, we have added the proposed scheme for SubBytes
and ShiftRows considering that ShiftRows is the rewiring from the output of SubBytes.
The results are presented in Tables 4.6 and 4.7. As one can notice, the S-boxes occupy
more than three fourths of the AES encryption. As shown in these tables, the most com-
Chapter 4 44
Table 4.6: ASIC implementations of the fault detection schemes of the AES encryption
using 0:18 CMOS technology.
AES Optimum Area (m2) Freq.
encryption S-box S-boxes All (MHz)
Original without PB1 692781 (80%) 859471 79.4
fault detection PB2 704490 (80%) 871180 91.8
NB 680590 (80%) 845426 73.5
Presented scheme for PB1 956233 - 78.8
SubBytes (ShiftRows) PB2 972217 - 89.2
NB 946476 - 69.5
Presented scheme for PB1 - 1268520 68.2
SubBytes (ShiftRows) PB2 - 1280412 70.1
scheme in [36] for others NB - 1256812 60.3
Table 4.7: Xilinx R VirtexTM-II Pro FPGA implementations of the fault detection
schemes of the AES encryption.
AES Optimum Slice Freq.
encryption S-box S-boxes All (MHz)
Original without PB1 5248 (77%) 6760 81.1
fault detection PB2 5417 (78%) 6913 89.8
NB 5112 (78%) 6579 75.8
Presented scheme for PB1 6896 - 79.3
SubBytes (ShiftRows) PB2 6958 - 84.0
NB 6342 - 73.2
Presented scheme for PB1 - 9881 65.8
SubBytes (ShiftRows) PB2 - 9921 64.8
scheme in [36] for others NB - 9405 60.8
pact AES encryption with and without the fault detection scheme is for normal basis.
Furthermore, the frequency degradation is negligible. Moreover, the original AES encryp-
tion for PB2 and the ones with fault detection for PB1 and PB2 have the highest working
frequencies. In addition, as seen in the tables, we have applied the presented scheme to
SubBytes and ShiftRows and used the scheme in [36] for the other transformations.
In this chapter, we have presented a high performance parity-based concurrent fault
detection scheme for the AES using the S-box and the inverse S-box in composite elds.
Using exhaustive searches, we have found the least complexity S-boxes and inverse S-
boxes as well as their fault detection circuits. Our error simulation results show that very
high error coverages for the presented scheme are obtained. Moreover, a number of fault
detection schemes from the literature have been implemented on ASIC and FPGA and
compared with the ones presented here. Our implementations show that the optimum
Chapter 4 45
S-boxes and the inverse S-boxes using normal basis are more compact than the ones
using polynomial basis. However, the ones using polynomial basis result in the fastest
implementations. We have also implemented the AES encryption using the proposed
fault detection scheme. The results of the ASIC and FPGA mapping show that the costs
of the presented scheme are reasonable with acceptable post place and route delays.
Chapter 5
A High-Performance Concurrent
Fault Detection Approach for the
Composite Field (Inverse) S-box
IN the previous chapter, an exhaustive search-based fault detection scheme for theAES S-boxes and inverse S-boxes was presented. In this chapter, we also present a
concurrent fault detection scheme for the S-box and the inverse S-box based on the low-
cost composite eld implementations of the S-box and the inverse S-box. However, we
divide the structures of these operations into three blocks and nd the predicted parities
of these blocks. Our simulations show that except for the redundant units approach which
has the hardware and time overheads of close to 100%, the fault detection capabilities
of the proposed scheme for the burst and random multiple faults are higher than the
previously reported ones. Finally, through ASIC implementations, it is shown that for
the maximum target frequency, the proposed fault detection S-box and inverse S-box in
this chapter have the least areas, critical path delays, and power consumptions compared
to their counterparts with similar fault detection capabilities.
In this chapter, we present a low-power and high-performance parity-based fault de-
tection approach for the S-box, the inverse S-box, and the merged S-box/inverse S-box
within the AES using composite elds. We obtain new formulations for the ve predicted
parities for three blocks of the S-box and the inverse S-box. To reach high multiple and
burst fault detection capabilities, multiple-bit signatures are obtained within the blocks
constituting more area in the structures of the S-box and the inverse S-box. Our sim-
ulation results show higher burst fault detection capability for the proposed scheme
compared to the previously presented schemes with similar comparable overheads. This
46
Chapter 5 47
can be used as an eective countermeasure against the fault attacks noting that in real-
istic fault attacks, multiple adjacent bits are actually ipped [82]. Moreover, using the
proposed scheme, for multiple random faults, the entire SubBytes and inverse SubBytes
are capable of detecting 99.998% of the injected faults. Through ASIC implementations,
it is shown that for the maximum target frequency, the timing, power and area of the
proposed scheme are the least compared to the schemes with similar fault detection ca-
pabilities. It is noted that the fault detection scheme proposed in this chapter can also
be applied to both the low-area S-box and inverse S-box presented in [17], [18], [20], [22],
and the low-power one proposed in [13].
The organization of this chapter is as follows. In Section 5.1, preliminaries related to
the S-box and the inverse S-box arithmetic used in this chapter are presented. The pro-
posed fault detection approach for the S-box, the inverse S-box, and the merged structures
is presented in Section 5.2. Furthermore, the time and hardware complexities analysis is
preformed in this section. In Section 5.3, the results of the simulations of the proposed
approach are presented; through which, the fault detection capabilities are derived. In
Section 5.4, through ASIC implementations, the areas, power consumptions, and critical
path delays of the proposed fault detection scheme and the previously reported ones are
compared. We also present the formulations for the mixes S-box in Section 5.5. The
results presented in this chapter can also be found in [83] and [84].
5.1 S-box and Inverse S-box Arithmetic Used in This
Chapter
The structures of the S-box and the inverse S-box using composite eld and polynomial
basis are shown in Fig. 5.1. As seen in Fig. 5.1, for the S-box, the transformation
matrix 	 transforms a eld element X =
P7
i=0 xi
i in the binary eld GF (28) to the
corresponding representation in the composite eld GF (28)/GF (((22)2)2) for performing
the multiplicative inversion. Then, using the inverse transformation matrix 	 1, the
result of the multiplicative inversion, i.e., X 1, is obtained. This is performed using the
irreducible polynomial of u2 + u + . It is noted that the decomposition can be further
applied to represent GF ((22)2) as a linear polynomial over GF (22) and then GF (2) using
the irreducible polynomials of v2 + v +  and w2 + w + 1, respectively. Eventually, as
Chapter 5 48
(Inverse trans.)
and affine
Inverse trans.
(Inverse affine
Transformation
and trans.)
Block 2Block 1 Block 3
4
4
4
4
8 8
4
4
8
4
4
4
4
4
4
8
Y
(θ′)
(γ′)ηl
(σh)
ηh
η
(Y )
X
(X)
(ηl)
(η)
(ηh)
σ
σh
(σl)
(σ)
γ
θ
σl
ν(.)2
(.)
−1
Figure 5.1: The architecture of the S-box (resp. the inverse S-box) using composite eld
and polynomial basis [20].
seen in Fig. 5.1, using the ane transformation, the 8-bit output of the S-box, i.e., Y ,
is derived. Furthermore, as seen in Fig. 5.1, for the inverse S-box, the reverse procedure
is performed to obtain the output X from the input Y . It is noted that in Fig. 5.1, the
notations for the inverse S-box are presented in parentheses.
All arithmetic operations including the multiplications, the inversion and the squaring
in Fig. 5.1 are over GF ((22)2). In Fig. 5.1, the two concentric circles with a plus inside
represent 4 XOR gates which perform the modulo-2 addition. Moreover, the three nite
eld multiplications and the inversion in GF ((22)2) are shown by crossed rectangles and
(:) 1, respectively. Furthermore, the multiplication by constant  and squaring (:)2 in
GF ((22)2) are shown in this gure. As seen in Fig. 5.1 for the S-box, for the output of
the multiplicative inversion hx+ l = (hx+ l)
 1 we have the following [20]
h = ((h + l)l + h
2) 1h;
l = ((h + l)l + h
2) 1(h + l): (5.1)
Moreover, for the inverse S-box in Fig. 5.1, one can swap  and  to derive the relation
for the multiplicative inversion.
5.2 Proposed Fault Detection Approach
The parity-based fault detection scheme has received much attention in the literature,
see, for example, [85], [86], [87], [88], [89], and [90]. In such schemes, the parity of a block
is predicted and compared with the actual parity of the block. The result is the error
Chapter 5 49
indication ag of the corresponding block which alarms the detected faults. Let  and 
be the input and the output of the block under test, respectively. Then, the predicted
parity of  is obtained from the input  , i.e., P^(), and the actual parity is implemented
from the output , i.e., P(). The comparison between the actual and predicted parities
is implemented by an XOR gate to generate the error indication ag e = P^()+P().
In the presented parity-based fault detection scheme, we divide the structures of the
S-box and the inverse S-box using polynomial basis into 3 blocks as shown in Fig. 5.1 so
that it can also be used for the low-power structures presented in [13] (see Fig. 2.4). One
can obtain that for the S-box and inverse S-box presented in Fig. 5.1 [20], blocks 1 and 3
occupy around 86% of the area of the entire operations. Therefore, these two blocks are
more susceptible to the internal faults and more prone to fault attacks. Consequently, we
propose using two bits predicted parities for each of these two blocks. Furthermore, one
predicted parity is used for block 2. The details of the proposed schemes are presented
below.
5.2.1 S-box
In the proposed scheme, ve predicted parities are derived for 3 blocks of the S-box.
Then, by comparing these with the ve actual parities, ve error indication ags are
obtained. All ve ags should be zero for the error free computations. The proposed
fault detection scheme for the S-box is shown in Fig. 5.2. As seen in this gure, for block
1, two predicted parities, i.e., P^ 1b1 and P^
2
b1, are obtained using the parity prediction unit
(PP1). As seen from Fig. 5.2, the predicted parity of the second block P^b2 is obtained by
the parity prediction unit (PP2). Furthermore, for block 3, two predicted parities, i.e.,
P^ 1b3 and P^
2
b3, are derived using the parity prediction unit (PP3).
The derivations of the actual parities are also shown in Fig. 5.2. As seen from Fig.
5.2, two actual parities for the two most and least signicant bits of , i.e., P 1b1 =
P3
i=2 i
and P 2b1 =
P1
i=0 i, have been derived from the output of block 1 using two trees of
XOR gates. Similarly, as shown in Fig. 5.2, the two actual parities for block 3 are
obtained from the output of block 3 for the four most and least signicant bits of Y , i.e.,
P 1b3 =
P7
i=4 yi and P
2
b3 =
P3
i=0 yi. In addition, one actual parity is obtained for block 2
as Pb2 =
P3
i=0 i. Then, as shown in Fig. 5.2, by comparing the predicted and actual
parities, the error indication ags of three blocks, i.e., e1-e5, are obtained.
Chapter 5 50
X
Y
X
PP1
2
1bP
?
1e
2e
8 8
PP2
4
( ')? ?
3e
X
4e
5e
8
( ')? 4
8
Block 1 Block 2 Block 3
PP3
3
?
1
1bP
?
1
1bP
2
1bP
2bP
?
1
3bP
?
2
3bP
?
0
y
1
y
2
y
3
y
4
y
5
y
6
y
7
y
1
3bP
2
3bP
2bP
(Y) (Y)
1
( ' )e
2
( ' )e
3
( ' )e
4
( ' )e
5
( ' )e
(X)
(Y)
?
7
( )x
6
( )x
5
( )x
4
( )x
3
( )x
0
( )x
2
( )x
1
( )x
'
3
( )?
2
?
'
2
( )?
1
?
'
1
( )?
0
?
'
0
( )?
3
? '3( )?
2
? '
2
( )?
1
? '
1
( )?
0
? '0( )?
Figure 5.2: The proposed parity-based fault detection scheme for the S-box (resp. inverse
S-box).
The following lemma is used from [20] for the multiplication in GF ((22)2) used in
blocks 1 and 3. Then, using this lemma, the predicted parities for the S-box in Fig. 5.2
are derived.
Lemma 5.1 [20] Let U = (u3; u2; u1; u0) and V = (v3; v2; v1; v0) be the inputs of a
multiplier in GF ((22)2). Then, the result of multiplication, i.e., Z = UV , is
z3 = u3(v3 + v2 + v1 + v0) + u2(v3 + v1) + u1(v3
+ v2) + u0v3;
z2 = u3(v3 + v1) + u2(v2 + v0) + u1v3 + u0v2;
z1 = u3v2 + u2(v3 + v2) + u1(v1 + v0) + u0v1; (5.2)
z0 = u3(v3 + v2) + u2v3 + u1v1 + u0v0:
Using Lemma 5.1, we present the formulations for these ve predicted parities in the
following theorem.
Theorem 5.1 Let X 2 GF (28) be the input of the S-box. Then, the ve predicted parities
of the three blocks of the S-box in Fig. 5.2, i.e., P^ 1b1, P^
2
b1, P^b2, P^
1
b3, and P^
2
b3, are obtained
as follows
P^ 1b1 = x7(D + x5) + x4B + x3(B + x4) + x0D + x1x2; (5.3)
P^ 2b1 = x7(G+ x6) + x4I + x1(C + E) + x2 _ x5 + PX ; (5.4)
Chapter 5 51
P^b2 = (2 _ 1)0 + Pl3; (5.5)
P^ 1b3 = 3H + 2(G+ x7) + 1(J + C) + 0J; (5.6)
P^ 2b3 = 3(C + x0) + 2(H + x3) + 1(I + x7) + 0(A+ x2); (5.7)
where x1 + x6 = A, x5 + A = B, x3 + x2 = C, PX +H = D, x0 + x6 = E, x2 + x5 = F ,
F +x4 = G, x0+x7 = H, B+C = I, and E+F = J . Furthermore, \+" and _ represent
the modulo-2 addition using an XOR gate and the OR operation, respectively. Moreover,
PX =
P7
i=0 xi and Pl = 1 + 0.
Proof First, we obtain the two predicted parities of block 1, i.e., P^ 1b1 = P^h and P^
2
b1 = P^l
in (5.3) and (5.4). As seen from Fig. 5.1, block 1 consists of the transformation matrix
	, a eld multiplication, modulo-2 additions, and squaring followed by the multiplication
by the constant . From [20], one can obtain that for the input of h = (7; 6; 5; 4),
the result of the squarer- is
h
2 = (7 + 4; 7 + 6 + 5; 4; 5): (5.8)
Moreover, using (5.2) with the inputs u = l and v = h+l, one can obtain the result of
the eld multiplication in this block. By modulo-2 adding the coordinates of h = (3; 2)
and l = (1; 0), i.e., two most and least signicant bits of (5.8) and that of the result
of the multiplication, respectively, one can obtain
P^ 1b1 = 3(6 + 4) + 2(7 + 6 + 5 + 4) + 16
+ 0(7 + 6) + 7 + 6 + 5 + 2; (5.9)
P^ 2b1 = 37 + 26 + 14 + 0(5 + 4) + 6 + 2 + 0: (5.10)
By substituting the coordinates of  with those of X and reordering the results in (5.9)
and (5.10), one reaches the following
P^ 1b1 = x7(x6 + x4 + x3 + x2 + x1) + x4(x6 + x5 + x1)
+ x3(x6 + x5 + x4 + x1) + x0(x6 + x5 + x4 + x3
+ x2 + x1) + x1x2; (5.11)
P^ 2b1 = x7(x6 + x5 + x4 + x2) + x4(x6 + x5 + x3 (5.12)
+ x2 + x1) + x1(x6 + x3 + x2 + x0) + x2 _ x5 + PX :
Chapter 5 52
Using subexpression sharing, it is straightforward to obtain (5.3) and (5.4) from (5.11)
and (5.12), respectively. It is also noted that the predicted parity of block 2 in (5.5) is
derived from that of block 3 in the scheme in [57] noting that Pl = 1 + 0.
Now, we derive the two predicted parities of block 3, i.e., P^ 1b3 = P^Yh and P^
2
b3 = P^Yl . As
seen from Fig. 5.1, block 3 consists of the mixed inverse and ane transformation ma-
trices and two eld multiplications. It is straightforward that we obtain the formulations
for these mixed transformation matrices as follows
y= A	 1 + b
=
0BBBBBBBBB@
1 1 1 0 0 0 1 1
1 0 0 0 0 0 0 1
1 0 1 1 1 1 1 0
1 1 1 0 0 0 0 0
1 1 0 0 1 0 0 1
0 0 1 0 0 0 0 1
0 0 0 0 1 1 1 1
0 0 1 1 0 0 0 1
1CCCCCCCCCA
 +
0BBBBBBBBB@
1
1
0
0
0
1
1
0
1CCCCCCCCCA
: (5.13)
Eventually, P^Yh and P^Yl , i.e., two predicted parities of block 3 in Fig. 5.2, are obtained as
follows P^Yh = 6+5+3+1+0 and P^Yl = 5+4+3+2. Then, by multiplying u = 
and v = h+ l and also u =  and v = h using (5.2), one can obtain the coordinates of
. Substituting these in above, the following is obtained for the two predicted parities of
block 3 of the S-box in Fig. 5.2:
P^ 1b3 = 3(x7 + x0) + 2(x7 + x5 + x4 + x2) (5.14)
+ 1(x6 + x5 + x3 + x0) + 0(x6 + x5 + x2 + x0);
P^ 2b3 = 3(x3 + x2 + x0) + 2(x7 + x3 + x0) (5.15)
+ 1(x7 + x6 + x5 + x3 + x2 + x1) + 0(x6 + x2 + x1):
Then, using subexpression sharing for (5.14) and (5.15), one can obtain (5.6) and (5.7)
and the proof is complete.
5.2.2 Inverse S-box
As seen in Fig. 5.2, similar to the S-box, for blocks 1-3 of the inverse S-box, ve predicted
parities are derived using the parity prediction units. This is also depicted in Fig. 5.2.
It is noted that the notations for the inverse S-box are denoted by parentheses to be
contrasted from those for the S-box. Additionally, similar to the S-box, the actual parities
Chapter 5 53
of the three blocks for the inverse S-box are derived using XOR trees. It is noted that
for blocks 1 and 3, the actual parities are obtained as P 1b1 =
P3
i=2 
0
i and P
2
b1 =
P1
i=0 
0
i
for block 1 and P 1b3 =
P7
i=4 xi and P
2
b3 =
P3
i=0 xi for block 3. Then, as seen in Fig. 5.2,
by comparing the predicted and actual parities, ve error indication ags of three blocks,
i.e., e01-e
0
5, are obtained.
Using Lemma 5.1 and considering Theorem 5.1, we present the formulations for the
ve predicted parities of the inverse S-box for the 3 blocks shown in Fig. 5.2 in the
following theorem.
Theorem 5.2 Let Y 2 GF (28) be the output of the inverse S-box. The ve predicted
parities of the three blocks of the inverse S-box in Fig. 5.2 are obtained as follows.
P^ 1b1 = y0e+ y5(y4 + y3 + a) + y2b+ y7y4 + b; (5.16)
P^ 2b1 = y1(y7 + y5 + h) + y2a+ y3(y5 + y4) + y5h+ y0 + e; (5.17)
P^b2 = (02 _ 01)00 + P0l03; (5.18)
P^ 1b3 = 
0
3f + 
0
2(PY + d+ y7) + 
0
1(c+ y7 + y4) + 
0
0(a+ y4 + y2); (5.19)
P^ 2b3 = 
0
3(y1 + d) + 
0
2(y0 + g) + 
0
1(y6 + g) + 
0
0(y1 + f); (5.20)
where y6 + y7 = a, y1 + a = b, y1 + y2 = c, y3 + y6 = d, c + d = e, PY + y4 + y6 = f ,
PY + y2 = g, and y4 + y0 = h. Furthermore, \+" and _ represent the modulo-2 addition
using an XOR gate and the OR operation, respectively. Moreover, PY =
P7
i=0 yi and
P0l = 
0
1 + 
0
0.
Proof As seen in Fig. 5.2, the S-box and the inverse S-box share block 2. Therefore,
the predicted parity of this block is the same for them.
Now, we obtain the two predicted parities of block 1, i.e., P^ 1b1 and P^
2
b1 in (5.16) and
(5.17). As seen from Fig. 5.1, block 1 consists of the transformation matrix 	 preceded
by the inverse ane transformation. Moreover, as seen in Fig. 5.1, similar to the S-box,
a eld multiplication, modulo-2 additions, and squaring followed by the multiplication
by the constant  are utilized in this block. Similar to the S-box, using (5.2) with the
inputs u = l and v = h + l, one can obtain the result of the eld multiplication in
this block. Moreover, one can obtain that the result of the squarer- in Fig. 5.1 is
h
2 = (7 + 4; 7 + 6 + 5; 4; 5): (5.21)
Chapter 5 54
By modulo-2 adding the two most and least signicant bits of the result of the squarer-
in (5.21) and that of the result of the multiplication, respectively, one can obtain
P^ 1b1 = 3(6 + 4) + 2(7 + 6 + 5 + 4) + 16
+ 0(7 + 6) + 7 + 6 + 5 + 2; (5.22)
P^ 2b1 = 37 + 26 + 14 + 0(5 + 4) + 6 + 2 + 0: (5.23)
One can substitute the coordinates of  with those of Y . This is performed by utilizing
the following as the result of mixing the inverse ane and transformation matrices.
= 	A 1y +	A 1b
=
0BBBBBBBBB@
0 0 1 0 0 0 1 1
0 1 0 1 0 1 0 0
0 1 1 0 0 1 1 1
0 0 0 0 0 1 0 1
0 0 0 1 1 1 0 0
1 0 0 0 1 1 1 0
1 1 1 1 0 0 1 1
0 1 1 0 0 0 1 1
1CCCCCCCCCA
y +
0BBBBBBBBB@
1
0
1
1
1
1
1
0
1CCCCCCCCCA
: (5.24)
Then, by reordering the result in (5.22) and (5.23), the following is derived.
P^ 1b1 = y0(y6 + y3 + y2 + y1) + y5(y7 + y6 + y4 + y3)
+ y2(y7 + y6 + y1) + y7y4 + y7 + y6 + y1; (5.25)
P^ 2b1 = y1(y7 + y5 + y4 + y0) + y2(y7 + y6) + y3(y5 + y4)
+ y5(y4 + y0) + y6 + y3 + y2 + y1 + y0: (5.26)
Using subexpression sharing, it is straightforward to obtain (5.16) and (5.17) from (5.25)
and (5.26), respectively.
Now, we derive the two predicted parities of block 3 of the inverse S-box in Fig. 5.2,
i.e., P^ 1b3 = P^Xh and P^
2
b3 = P^Xl . As seen from Fig. 5.1, block 3 consists of the inverse
transformation and two eld multiplications. It is straightforward that considering the
inverse transformation matrix we obtain P^Xh and P^Xl as follows P^Xh = 7 + 5 + 4 + 1
and P^Xl = 7 + 6 + 5 + 2 + 0. Then, by multiplying u = 
0 and v = h + l and also
u = 0 and v = h using (5.2), the coordinates of  are obtained. Substituting these in
Chapter 5 55
above, the following is derived
P^ 1b3 = 
0
3(PY + y6 + y4 + 1) + 
0
2(PY + y7 + y6 + y3 + 1)
+ 01(y7 + y4 + y2 + y1 + 1) + 
0
0(y7 + y6 + y4 + y2 + 1); (5.27)
P^ 2b3 = 
0
3(y6 + y3 + y1) + 
0
2(PY + y2 + y0 + 1)
+ 01(PY + y6 + y2 + 1) + 
0
0(PY + y6 + y4 + y1): (5.28)
Then, using subexpression sharing for (5.27) and (5.28), one can obtain (5.19) and (5.20)
and the proof is complete.
5.2.3 Merged S-box and Inverse S-box
In some low-complexity implementations that use encryption or decryption at a time,
multiplicative inversions of the S-box and the inverse S-box are shared (see, for example,
the joint encrypter/decrypter in [20] and [22] and the merged encryption and decryption
S-boxes/inverse S-boxes in [21]). The multiplicative inversion in the nite eld GF (28)
is needed for both the S-box and the inverse S-box. Therefore, one can merge them in
order to reuse the multiplicative inversion and its parity predictions. It is noted that
when there is no need to utilize both the S-box and the inverse S-box at the same time,
this merged structure leads to a low-area design. Fig. 5.3 shows the merged S-box (SB)
and inverse S-box (ISB) and their corresponding predicted parities for the three blocks.
As seen in this gure, the multiplicative inversion in Fig. 5.1 is used for both the S-box
and the inverse S-box. On the other hand, as seen in Fig. 5.3, two multiplexers are used
for choosing the transformation matrix and the inverse and ane transformations (for the
S-box with the select input SB = 1) and the inverse ane and transformation matrices
and the inverse transformation (for the inverse S-box with the select input ISB = 1). The
parity prediction unit is also shown in Fig. 5.3. As seen in this gure, these multiplexers
also choose between the predicted parities of blocks 1 and 3 for the S-box and the inverse
S-box. As a result, a parity-based fault detection merged structure is obtained.
5.2.4 Complexity Analysis
In what follows, we obtain the hardware and time complexities of the proposed schemes
for the S-box and the inverse S-box. We use two-input gates in the implementation of
the predicted parities of the proposed schemes. We have obtained the number of gates
Chapter 5 56
 
 


  
  


8
8 8
1
0
1
8
8
0
8
8
8
8
8
8
8
8
8
0
1
0
1
matrices
matrix
Transformation
matrix
transformation
Inverse
Inverse and affine
transformation
matrices8
8
8
transformation
Inverse affine and8
Parity Prediction
Inverse S−box
S−box
Inverse S−box
S−box
Multiplicative
Inversion
(Block 2 and parts of
blocks 1 and 3)
P.P.
P.P.
P.P.
P.P.
8
8
8
8
8
2
2
P.P.
(shared)
8
2
2
SB = ISB SB = ISB
SB = ISBSB = ISB
Pˆ 1
b1
-Pˆ 2
b1
Pˆ 1
b1
-Pˆ 2
b1
Pˆb2
Pˆ 1
b3
-Pˆ 2
b3
Pˆ 1
b3
-Pˆ 2
b3
Figure 5.3: Merged S-box (SB) and inverse S-box (ISB) and the corresponding predicted
parities for dierent blocks.
needed for implementing the predicted parities of the S-box in (5.3)-(5.7) as 33 XORs,
19 NANDs, 2 XNORs, and one NOR gate. Moreover, for the inverse S-box, one needs 40
XORs and 19 NANDs to implement (5.16)-(5.22). Furthermore, for obtaining the actual
parities of blocks 1-3, 2 XORs (one XOR for each of P 1b1 and P
2
b1), 3 XORs (for Pb2),
and 6 XORs (three XORs for each of P 1b3 and P
2
b3) are needed, respectively. Moreover,
5 XOR gates are used for comparing the ve predicted and actual parities to obtain the
indication ags. Later in this chapter, through ASIC implementations, we derive the
chip area of the proposed schemes for the S-box and the inverse S-box. Furthermore, the
area, critical path delay, and power consumption overheads are derived.
The timing overhead of the proposed scheme can be overlapped by the time needed
for performing the operations in blocks 1-3. In other words, as seen in Fig. 5.2, the
predicted parities are obtained concurrently with the time needed for the blocks. Table
5.1 presents the details of the timings of the three blocks for the S-box and the inverse
S-box (presented in Fig. 5.1) as well as those for obtaining the predicted parities of
these blocks. As seen in this table, for all the blocks, the times needed for deriving the
Chapter 5 57
Table 5.1: The timing details of the proposed concurrent scheme for the S-box and the
inverse S-box.
Operation Block 1 Block 2 Block 3
original predicted parity original predicted parity original predicted parity
S-box 10TX + 1TA 5TX + 1TA 3TX + 2TA 2TX + 1TA 8TX + 1TA 6TX + 1TA
Inverse S-box 10TX + 1TA 5TX + 1TA 3TX + 2TA 2TX + 1TA 7TX + 1TA 4TX + 1TA
predicted parities are less than those of the operations. Therefore, no overhead exists for
obtaining these predicted parities. It is also noted that the actual parities are obtained in
the time allotted to the next block. Therefore, the only timing overhead is for obtaining
the actual parity of block 3 and comparing it with the corresponding predicted parity
(see Fig. 5.2). These are equal to 2TX and 1TX , respectively. Therefore, the total timing
overhead is 3TX for both operations.
The implementations of the S-box and the inverse S-box using composite elds are
area ecient in comparison with those using LUTs. Moreover, the critical path delay
can be reduced using sub-pipelining. In [17], sub-pipelining of the S-box and the in-
verse S-box is done by placing one, two, and three-stage registers between the blocks.
Although the sub-pipelining techniques used in [17] are based on the implementations
of the S-box and the inverse S-box over GF ((24)2), similar pipelining techniques can be
used for the composite eld GF (((22)2)2) (see, for example, [22]). The proposed fault de-
tection scheme can take advantage of sub-pipelining without adding delay to the original
pipelined structure. In the pipelined fault detection scheme, we use the parity prediction
units of each pipelined block and obtain the error indication ag. According to Table
5.1, one can observe that the critical path delays of the predicted parity bits of each
block of the S-box and the inverse S-box is less than the critical path delay of that block.
Therefore, we can use the parity prediction schemes in the pipelined structures of the
blocks without aecting the frequency of the clock signal; the predicted parity bits of the
blocks are obtained in the same clock cycle as the outputs of the blocks are calculated.
Calculating the actual parity and comparing it with predicted parity to obtain the error
indication ag can be done in the next clock cycle. Using the above-mentioned pipelined
structure, one can see that the time overhead will be only one extra clock cycle which may
be overlapped with other computations in the pipelined fault detection implementation
of AES.
Chapter 5 58
5.3 Simulation Results
In the following, we evaluate the proposed fault detection scheme for single stuck-at
errors, burst faults, and multiple random faults to model both natural faults and fault
attacks.
The single stuck-at errors are at the output of the S-box (the inverse S-box). Such
errors are covered 100% in the proposed scheme which is the same as those of the schemes
in [57] and [78]. However, due to the technological constraints, injecting single stuck-
at errors may not be applicable in practice [82]. Therefore, we rely on simulations to
consider both the burst and the multiple permanent and transient faults; the details of
which are presented in the following.
Burst Faults
Although the fault attacker gains more information through injecting single faults, due
to the technological constraints, injecting single stuck-at faults may not be applicable in
the practical fault attacks [82]. Therefore, in realistic fault attacks, multiple adjacent
bits are actually ipped. Moreover, natural failures can be of the correlated type causing
neighboring faults [82]. Consequently, in what follows, we consider the fault detection
capability of the proposed scheme for neighboring faults referred to as burst faults.
Because of the nonlinear structure of the S-box (resp. the inverse S-box), the burst
faults in a block of the S-box (resp. the inverse S-box) appear as random multiple errors
at the output of that block. Moreover, the burst faults that occur in two adjacent blocks
appear as multiple random errors at the outputs of the adjacent blocks. For deriving
the burst fault detection capability of the proposed scheme, we have performed error
simulations for blocks 1-3 of the S-box and the inverse S-box in Fig. 5.2; the details of
which are presented in the following.
Linear Feedback Shift Registers (LFSRs) are used for injecting the errors at the output
of one block or two adjacent blocks for modeling the burst faults. The stuck-at error
model used forces multiple output bits to be stuck at logic one (for stuck-at one) or zero
(for stuck-at zero) independent of the error-free values. We use Fibonacci implementation
of the LFSR with 4 (for the outputs of blocks 1 and 2) or 8 (for the random input and
output of block 3) output taps for injecting the errors, where the numbers, locations
Chapter 5 59
and types of the errors are randomly chosen. In this regard, according to the maximum
sequence length taps presented in [91], the maximum sequence length polynomial for the
feedback are selected as L1(X) = X
4+X and L2(X) = X
8+X4+X3+X2 for the 4 and
8 output taps, respectively. Moreover, for our simulations, we use the ModelSim R SE
6:2d [75]. We have injected 100,000 burst faults at the outputs of the blocks for 100,000
random 8-bit inputs of the S-box and the inverse S-box. Then, we have used the ve
error indication ags at the outputs of three blocks of the S-box and the inverse S-box to
detect the burst faults. The results of our simulations show that for the S-box and the
inverse S-box 71,257 and 72,321 of the faults are detected, respectively. This yields to
71.3% and 72.3% burst fault detection capabilities for these two structures, respectively.
It is noted that these are higher compared to the scheme in [57] for the original S-box
and the one in [78], which have the burst fault detection capability of close to 50%. The
complete comparison of the fault detection capabilities of the proposed schemes and the
previous ones are presented in the next section.
Multiple Faults
The fault detection capability of the presented scheme depends on the number of the
S-box and the inverse S-box blocks and the number of the predicted parities used for
them. Two predicted parities have been used for blocks 1 and 3 of the S-box and the
inverse S-box which constitute much of the area. Because at least one predicted parity
is used for each block of the S-box and the inverse S-box, all odd number of errors in
each of three blocks can be detected using the error indication ags. The error indiction
ags of blocks 1 and 3 can also detect certain even number of errors comprising two odd
number of errors in two partitions of these blocks. In the remaining of this section, it is
shown that for the entire SubBytes, the error coverage is very close to 100% (99.998%).
For the randomly distributed multiple faults in the entire S-box and inverse S-box,
the fault detection capabilities can be obtained. It is noted that in our simulations,
we use a transient stuck-at error model. Nonetheless, the simulation results are also
the same for the permanent errors, including the permanent internal failures and the
malicious fault attacks aiming at destroying the chip. Similar to the burst faults, we
use LFSRs for injecting the errors. This is performed using a 16-output tap LFSR for
injecting the random multiple errors at the outputs of three blocks utilizing L3(X) =
Chapter 5 60
Table 5.2: Fault detection capabilities of the proposed schemes after injecting 1,000,000
random multiple faults.
Operation Initial Detected Fault Coverage
values (%)
L2 = f9Dgh 966,324  97%
L3 = fAFA2gh
S-box L2 = fB0gh 972,198  97%
L3 = f3DA9gh
L2 = f73gh 968,775  97%
L3 = f2BBFgh
L2 = f9Dgh 977,760  98%
L3 = fAFA2gh
Inverse L2 = fB0gh 969,139  97%
S-box L3 = f3DA9gh
L2 = f73gh 971,815  97%
L3 = f2BBFgh
X16 +X12 +X3 +X and an 8-bit LFSR for applying the random input of the S-box or
the inverse S-box using L2(X) = X
8 +X4 +X3 +X2 [91].
The results of our simulations for three dierent initial values of the LFSRs L2 and
L3 polynomials are depicted in Table 5.2. As seen in this table, after injecting 1,000,000
random multiple faults, the fault detection capabilities for one S-box or inverse S-box are
close to 97%. It is interesting to note that for the entire SubBytes or inverse SubBytes,
i.e., 16 S-boxes or inverse S-boxes, respectively, injecting this number of multiple faults
resulted in the fault detection of very close to 100% (99.998%). As a matter of fact,
in this case, the faults are detected by the 5  16 = 80 ags for the entire SubBytes
or inverse SubBytes transformations, yielding to approximately complete fault detection
capabilities, i.e., approximately 100 (1  2 80)%.
5.4 ASIC Implementations and Comparisons
In this section, we present the results of the syntheses we have performed for the proposed
and previously presented fault detection schemes of the S-box and the inverse S-box. We
have used the STM 65-nm CMOS standard technology [74] for the syntheses. Moreover,
VHDL has been used as the design entry to the Synopsys Design Vision [73]. We have
set the target frequency as 500 MHz, 1 GHz, and 1.1 GHz corresponding to the delays of
2 ns, 1 ns, and 0.91 ns, respectively. Using Synopsys Design Vision, we have obtained the
Chapter 5 61
maximum target frequency in which our fault detection structure can operate without
violating the timing constraints. This maximum target frequency has been obtained as
1.1 GHz in the 65-nm technology. The proposed fault detection schemes and the ones
presented in [34], [36], [38], [39], [42], [57], [78], [79], and [92] have been synthesized and
their areas, delays and power consumptions are derived. The results for dierent target
frequencies are shown in Table 5.3 (for the S-box) and Table 5.4 (for the inverse S-box).
As seen in these tables, areas (m2), critical path delays (ns), total power consumptions
(W ), and fault coverages (%) are shown. In the following, the syntheses details of the
structures are explained.
As seen in Table 5.3 for the S-box, the rst three schemes, i.e., the schemes presented
in [34], [42], [39], and [36], use the LUT S-box in their structures. The schemes in [34]
and [42] use the S-box followed by the inverse S-box. These can be implemented using
two 2568 LUTs. Then, the result is compared with the input to detect the faults in the
structure of the S-box or the inverse S-box. It is noted that although its fault detection
capability reaches 100%, this method has the critical path delay and the area overheads
of close to 100%. Furthermore, as seen in Table 5.3, because of the use of LUT S-box,
areas and power consumptions are higher than the schemes using composite elds.
Chapter 5 62
T
ab
le
5.
3:
C
om
p
ar
in
g
th
e
ar
ea
s,
cr
it
ic
al
p
at
h
d
el
ay
s,
p
ow
er
co
n
su
m
p
ti
on
s,
an
d
fa
u
lt
d
et
ec
ti
on
ca
p
ab
il
it
ie
s
of
th
e
p
ro
p
os
ed
an
d
p
re
v
io
u
sl
y
p
re
se
n
te
d
fa
u
lt
d
et
ec
ti
on
sc
h
em
es
fo
r
th
e
S
-b
ox
u
si
n
g
th
e
65
-n
m
C
M
O
S
st
an
d
ar
d
te
ch
n
ol
og
y.
F
a
u
lt
d
et
ec
ti
on
T
ar
ge
t
fr
eq
u
en
cy
:
50
0
M
H
z
T
ar
ge
t
fr
eq
u
en
cy
:
1
G
H
z
T
ar
ge
t
fr
eq
u
en
cy
:
1.
1
G
H
z
F
au
lt
co
v
er
ag
e
(%
)
sc
h
em
e
A
re
a
D
el
ay
T
ot
al
p
ow
er
A
re
a
D
el
ay
T
ot
al
p
ow
er
A
re
a
D
el
ay
T
ot
al
p
ow
er
B
u
rs
t
M
u
lt
ip
le
(
m
2
)
(n
s)
(
W
)
(
m
2
)
(n
s)
(
W
)
(
m
2
)
(n
s)
(
W
)
fa
u
lt
s
fa
u
lt
s
R
ed
u
n
.
u
n
it
s
[3
4]
,
U
n
it
ed
S
-b
ox
[4
2]
52
.3
1.
23
7.
2
54
.2
0.
95
15
.4
54
.7
0.
87
16
.9
10
0%
10
0%
(L
U
T
s)
1
0
3
1
0
3
1
0
3
1
0
3
1
0
3
1
03
P
a
ri
ty
-b
as
ed
sc
h
em
e
in
[3
9]
29
.5
0.
59
4.
3
29
.5
0.
59
8.
4
29
.5
0.
59
9.
5

50
%

50
%
(2
56

9
L
U
T
)
1
0
3
1
0
3
1
0
3
1
0
3
1
0
3
1
03
(S
u
b
B
.)
(S
u
b
B
.)
P
a
ri
ty
-b
as
ed
sc
h
em
e
in
[3
6]
57
.1
0.
68
7.
8
57
.1
0.
68
15
.6
57
.1
0.
68
17
.1

50
%

50
%
(5
12

9
L
U
T
)
1
0
3
1
0
3
1
0
3
1
0
3
1
0
3
1
03
M
u
lt
ip
li
ca
ti
on

75
%

75
%
ap
p
ro
a
ch
in
[3
8]
87
6
1.
88
63
0.
3
18
29
0.
96
30
00
.7
21
21
0.
88
36
00
.1
(m
u
lt
.
(m
u
lt
.
(p
ol
y
n
om
ia
l
b
as
is
)
in
v
.)
in
v
.)
S
tr
u
c.
-i
n
d
ep
en
d
en
t
sc
h
em
e
in
[9
2]
75
4
1.
90
57
4.
9
14
59
0.
97
22
63
,8
17
63
0.
87
29
02
.5

50
%

50
%
(p
ol
y
n
om
ia
l
b
as
is
)
S
ch
em
e
in
[5
7]
T
ar
ge
t
T
ar
ge
t
T
ar
ge
t
fo
r
o
ri
g
in
al
S
-b
ox
88
1
1.
92
60
7.
7
17
48
0.
96
27
09
.4
is
n
ot
is
n
ot
is
n
ot

50
%

97
%
(p
ol
y
n
om
ia
l
b
as
is
)
ac
h
ie
ve
d
ac
h
ie
ve
d
ac
h
ie
ve
d
P
a
ri
ty
-b
as
ed
sc
h
em
e
in
[7
9]
86
5
1.
82
61
6.
2
16
45
0.
96
25
07
.8
17
42
0.
88
29
21
.8

50
%

97
%
(p
ol
y
n
om
ia
l
b
as
is
)
T
ar
ge
t
T
ar
ge
t
T
ar
ge
t
S
ch
em
e
in
[7
8]
85
8
1.
90
62
0.
0
17
55
1.
0
26
72
.9
is
n
ot
is
n
ot
is
n
ot

50
%

97
%
(n
or
m
al
b
as
is
)
ac
h
ie
ve
d
ac
h
ie
ve
d
ac
h
ie
ve
d
T
h
is
C
h
ap
te
r
95
3
1.
80
71
2.
3
16
83
0.
95
26
00
.2
17
30
0.
87
29
12
.2
71
:3
%

97
%
(p
ol
y
n
om
ia
l
b
as
is
)
Chapter 5 63
Additionally, the schemes in [39] and [36] use the error detecting codes (parity) for the
LUT S-box, where the S-box is expanded. Similar to the scheme in [34] and [42], using the
LUT S-box increases the areas and power consumptions of these schemes considerably.
In the low-cost scheme presented in [39], the modulo-2 addition of the predicted parities
of the input and output of the S-box along with the S-box itself are stored in a 256 9
LUT. Then, a comparison with the actual parities is performed for deriving the error
indication ags. As seen in Table 5.3, the burst and multiple fault detection capabilities
of this scheme for the entire SubBytes (not each S-box) is around 50%. The parity-
based scheme presented in [36] utilizes a 512 9 LUT to store the predicted parities as
well as the output of the S-box. This results in reaching the burst and multiple fault
detection capability of approximately 50% for each S-box at the cost of more area and
power consumption and slightly more delay compared to the scheme in [39].
As presented in Table 5.3, the last six fault detection schemes use the S-box using
composite elds; represented either in polynomial basis or normal basis. It is noteworthy
that sub-pipelining of these fault detection S-boxes has not been performed and these
syntheses are only intended to compare dierent presented schemes. The scheme in
[38] uses two ags for the fault detection of the non-linear part of the S-box, i.e., the
multiplicative inversion. This is performed by comparing the result of multiplying the
input and the output of the multiplicative inversion with the actual result, i.e., f01g2. As
seen in Table 5.3, this yields to the fault detection capability of approximately 75%. The
structure-independent scheme in [92] uses one-bit parity in the multiplication scheme for
obtaining the fault detection capability of around 50% for the S-box. Although the fault
detection capability is less than that of [38], as seen in Table 5.3, better area and power
consumption results are obtained.
The results for the proposed scheme in this chapter are shown in bold face in Table
5.3. As depicted in the table, for the target frequency of 1.1 GHz, the proposed scheme in
this chapter for the S-box has the least area, power consumption, and critical path delay
among the schemes that have similar or slightly more fault detection capabilities, i.e., the
schemes presented in [34], [42], [57], [79] and [78]. Specically, compared to the schemes
presented in [57], [78], and [79], for the low frequency of 500 MHz, the presented scheme
in this chapter is faster at the expense of more area. Nonetheless, as seen from the table,
the maximum target frequency of 1.1 GHz cannot be achieved for the schemes of [57]
Chapter 5 64
and [78]. Nevertheless, in higher frequencies, e.g., 1.1 GHz in Table 5.3, the presented
scheme outperforms the one proposed in [79] in terms of area, power consumption and
delay. It is also noted that the schemes proposed in [57], [78], and [79] yield to the fault
detection capability of around 50% for the burst faults which is less compared to the
presented scheme in this chapter.
It is also noted that compared to the schemes with lower fault detection capability
in Table 5.3, for this maximum target frequency, the proposed scheme is more com-
pact. Moreover, it has less power consumption except for the scheme presented in [92].
Nonetheless, the fault detection capabilities of the structure-independent scheme in [92]
for burst and multiple faults are around 50%, i.e., approximately half of that of the
proposed scheme for the multiple faults and less for burst faults. Finally, using sub-
pipelining, the critical path delay of the proposed scheme can be considerably reduced.
This can result in even better critical path delays compared to the schemes using LUTs
at the expense of more hardware utilizations for the pipelining registers. It is noted
that the sub-pipelined composite eld structures are still much more compact than the
schemes taking advantage of LUTs.
We have also implemented the proposed scheme for the inverse S-box for the three
target frequencies; the results of which are presented in Table 5.4 in bold face. As seen
in this table, in addition, the schemes for the inverse S-box presented in [34], [42], [36],
[39], [92], [57], [79] and [78] have been synthesized and their areas, delays and power
consumptions are derived. As seen from Table 5.4, similar to the S-box, for the low
frequency of 500 MHz, the presented scheme for the inverse S-box is the fastest compared
to [57], [78], and [79]. Additionally, for the maximum target frequency of 1.1 GHz, it has
the lowest area, delay and power consumption compared to those of [57], [78], and [79].
It is also noted that as presented in Table 5.4, the target frequency of 1.1 GHz cannot
be achieved by the scheme in [79].
Chapter 5 65
T
ab
le
5.
4:
C
om
p
ar
in
g
th
e
ar
ea
s,
cr
it
ic
al
p
at
h
d
el
ay
s,
p
ow
er
co
n
su
m
p
ti
on
s,
an
d
fa
u
lt
d
et
ec
ti
on
ca
p
ab
il
it
ie
s
of
th
e
p
ro
p
os
ed
an
d
p
re
v
io
u
sl
y
p
re
se
n
te
d
fa
u
lt
d
et
ec
ti
on
sc
h
em
es
fo
r
th
e
in
ve
rs
e
S
-b
ox
u
si
n
g
th
e
65
-n
m
C
M
O
S
st
an
d
ar
d
te
ch
n
ol
og
y.
F
a
u
lt
d
et
ec
ti
on
T
ar
ge
t
fr
eq
u
en
cy
:
50
0
M
H
z
T
ar
ge
t
fr
eq
u
en
cy
:
1
G
H
z
T
ar
ge
t
fr
eq
u
en
cy
:
1.
1
G
H
z
F
au
lt
co
v
er
ag
e
(%
)
sc
h
em
e
A
re
a
D
el
ay
T
ot
al
p
ow
er
A
re
a
D
el
ay
T
ot
al
p
ow
er
A
re
a
D
el
ay
T
ot
al
p
ow
er
B
u
rs
t
M
u
lt
ip
le
(
m
2
)
(n
s)
(
W
)
(
m
2
)
(n
s)
(
W
)
(
m
2
)
(n
s)
(
W
)
fa
u
lt
s
fa
u
lt
s
R
ed
u
n
.
u
n
it
s
[3
4]
,
U
n
it
ed
S
-b
ox
[4
2]
52
.3
1.
23
7.
2
54
.2
0.
95
15
.4
54
.7
0.
87
16
.9
10
0%
10
0%
(L
U
T
s)
1
0
3
1
0
3
1
0
3
1
0
3
1
0
3
1
03
P
a
ri
ty
-b
as
ed

50
%

50
%
sc
h
em
e
in
[3
9]
29
.5
0.
59
4.
3
29
.5
0.
59
8.
4
29
.5
0.
59
9.
5
(I
n
v
.
(I
n
v
.
(2
56

9
L
U
T
)
1
0
3
1
0
3
1
0
3
1
0
3
1
0
3
1
03
S
u
b
B
.)
S
u
b
B
.)
P
a
ri
ty
-b
as
ed
sc
h
em
e
in
[3
6]
57
.1
0.
68
7.
8
57
.1
0.
68
15
.6
57
.1
0.
68
17
.1

50
%

50
%
(5
12

9
L
U
T
)
1
0
3
1
0
3
1
0
3
1
0
3
1
0
3
1
03
S
tr
u
c.
-i
n
d
ep
en
d
en
t
sc
h
em
e
in
[9
2]
78
3
1.
72
58
1.
3
14
50
0.
97
22
62
.6
16
83
0.
89
28
93
.4

50
%

50
%
(p
ol
y
n
om
ia
l
b
as
is
)
S
ch
em
e
in
[5
7]
fo
r
o
ri
g
in
al
S
-b
ox
88
6
1.
85
62
9.
4
16
89
0.
97
27
11
.1
19
93
0.
88
36
12
.6

50
%

97
%
(p
ol
y
n
om
ia
l
b
as
is
)
P
a
ri
ty
-b
as
ed
sc
h
em
e
in
[7
9]
86
5
1.
85
62
3.
6
16
67
0.
96
26
92
.3
19
64
0.
88
35
28
.5

50
%

97
%
(p
ol
y
n
om
ia
l
b
as
is
)
P
a
ri
ty
-b
as
ed
T
ar
ge
t
T
ar
ge
t
T
ar
ge
t
sc
h
em
e
in
[7
9]
85
5
1.
85
57
4.
0
15
78
1.
0
23
74
.4
is
n
ot
is
n
ot
is
n
ot

50
%

97
%
(n
or
m
al
b
as
is
)
ac
h
ie
ve
d
ac
h
ie
ve
d
ac
h
ie
ve
d
T
h
is
C
h
ap
te
r
91
6
1.
68
63
6.
4
14
81
0.
96
22
00
.5
17
09
0.
88
28
12
.8
72
:3
%

97
%
(p
ol
y
n
om
ia
l
b
as
is
)
Chapter 5 66
As depicted in Table 5.4, for the highest frequency to achieve, i.e., 1.1 GHz, the
proposed scheme in this chapter is the most compact scheme with the lowest power
consumption compared to the schemes presented in [34], [42], [36], [39], [57], [79] and
[78]. It is also noted that similar to the S-box, the fault detection structure of the inverse
S-box can be sub-pipelined so that with a reasonable hardware overhead, the critical
path delay is highly reduced. The proposed scheme in this chapter has more area and
less power consumption compared to the one in [92]. As mentioned previously, however,
the fault detection capability of the scheme in [92] for the burst and multiple faults is
around 50%. This is less than the fault detection capabilities of 97% and 72.3% for the
proposed scheme for the multiple and burst faults, respectively.
Furthermore, we have compared the areas, critical path delays, and power consump-
tions of the proposed schemes for the S-box and the inverse S-box with those for the
original ones presented in [22]. For this purpose, we have implemented both the original
and the fault detection S-box and inverse S-box for several target frequencies ranging
from 500 MHz to 1.1 GHz. The results are shown in Fig. 5.4. As seen in Fig. 5.4a
and Fig. 5.4d for the S-box and the inverse S-box, respectively, the areas of both the
original structures (solid lines with  marks) and the fault detection ones (dotted lines
with + marks) for dierent target frequencies are depicted. As seen in these gures, as
the target frequency increases, it is reached by increasing the occupied area. This yields
to having the areas ranging from 698 m2 - 1338 m2 and 662 m2 - 1334 m2 for the
original S-box and inverse S-box, respectively. Moreover, for the fault detection S-box
and inverse S-box presented in this chapter, the areas of 953 m2 - 1730 m2 and 916
m2 - 1709 m2 are achieved, respectively.
Moreover, the results of our implementations for the power consumptions of the orig-
inal and the fault detection S-box and inverse S-box are depicted in Fig. 5.4b and Fig.
5.4e, respectively. As seen from these gures, for the low target frequencies, the power
consumptions of the original structures and the fault detection ones are close to each
other. Nonetheless, as seen in Fig. 5.4b and Fig. 5.4e, these dierences increase after
applying tighter critical path delay constraints. As an example, for the target frequency
of 1.1 GHz, the power consumption for the original S-box (resp. inverse S-box) becomes
2.2 mW (2.3 mW ). Moreover, for the fault detection S-box (resp. inverse S-box) it
reaches 2.9 mW (2.8 mW ). Finally, the critical path delays of the original structures
Chapter 5 67
500 600 700 800 900 1000 1100
600
800
1000
1200
1400
1600
1800
Target frequency (MHz)
(a) Area (S-box)
A
re
a
 (
?m
2
)
500 600 700 800 900 1000 1100
500
1000
1500
2000
2500
3000
Target frequency (MHz)
(b) Power (S-box)
P
o
w
e
r 
(?W
)
500 600 700 800 900 1000 1100
0.8
1
1.2
1.4
1.6
1.8
2
Target frequency (MHz)
(c) Delay (S-box)
D
e
la
y
 (
n
s
)
500 600 700 800 900 1000 1100
600
800
1000
1200
1400
1600
1800
Target frequency (MHz)
(d) Area (Inverse S-box)
A
re
a
 (
?m
2
)
500 600 700 800 900 1000 1100
500
1000
1500
2000
2500
3000
Target frequency (MHz)
(e) Power (Inverse S-box)
P
o
w
e
r 
(?W
)
500 600 700 800 900 1000 1100
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Target frequency (MHz)
(f) Delay (Inverse S-box)
D
e
la
y
 (
n
s
)
Original
Proposed FD
Original
Proposed FD
Original
Proposed FD
Original
Proposed FD
Original
Proposed FD
Original
Proposed FD
Figure 5.4: The areas, critical path delays, and power consumptions of the original [22]
and the proposed fault detection S-box and inverse S-box.
and those for the proposed scheme in this chapter for the S-box and the inverse S-box are
presented in Fig. 5.4c and Fig. 5.4f. As seen in these gures, for the target frequency of
500 MHz, the critical path delays of the original and the fault detection S-box are 1.54
ns (working frequency of 649 MHz) and 1.80 ns (working frequency of 555 MHz), respec-
tively. Furthermore, for the inverse S-box, the critical path delays of 1.44 ns (working
frequency of 694 MHz) and 1.68 ns (working frequency of 595 MHz) are obtained for
the original and fault detection structures, respectively. It is also noted that, for the
maximum target frequency to achieve, the original and fault detection S-box (inverse
S-box) reaches the critical path delay of 0.87 ns (0.88 ns), i.e., the working frequency of
1.15 GHz (1.14 GHz). As seen in Fig. 5.4, this is for the cost of the increased areas and
power consumptions for the structures.
We conclude this section by deriving the area, delay, and power consumption over-
Chapter 5 68
500 600 700 800 900 1000 1100
20
30
40
50
60
70
80
Target frequency (MHz)
(a) Area overhead
O
v
e
rh
e
a
d
 (
%
)
500 600 700 800 900 1000 1100
10
20
30
40
50
60
70
80
90
Target frequency (MHz)
(b) Power overhead
O
v
e
rh
e
a
d
 (
%
)
500 600 700 800 900 1000 1100
0
2
4
6
8
10
12
14
16
18
20
Target frequency (MHz)
(c) Delay overhead
O
v
e
rh
e
a
d
 (
%
)
S-box
Inverse S-box
S-box
Inverse S-box
S-box
Inverse S-box
Figure 5.5: The Area, delay, and power consumption overheads of the proposed schemes
for the S-box and the inverse S-box.
heads of the proposed scheme for the S-box and the inverse S-box. To this end, we
have considered the areas, delays, and power consumptions of the original operations
presented in [22] and the fault detection structures shown in Fig. 5.4. Then, we have
obtained the overheads; the results of which are presented in Fig. 5.5. The results in this
table show that for the low frequency of 500 MHz for the S-box (see the dotted lines with
 marks) and the inverse S-box (see the solid lines with + marks), the area overheads are
approximately 36% and 38%, respectively (see Fig. 5.5a). Moreover, in this frequency,
the overheads for the critical path delays and the power consumptions for the S-box are
16% and 40%, respectively. Additionally, for the inverse S-box, for the target frequency
of 500 MHz, the critical path delay and the power consumption overheads of 16% and
25% are obtained, respectively. However, as we increase the target frequency, the critical
path delay overhead decreases (see Fig. 5.5c). It is noted that as seen in Fig. 5.5c, no
timing overhead is observed for the target frequencies higher than 1 GHz. Finally, as
presented in earlier in this chapter, with the mentioned overheads, the fault detection
scheme proposed in this chapter achieves high fault coverages. This makes the presented
fault detection S-box and inverse S-box suitable choices in counteracting the fault attacks
and detecting the internal failures.
Chapter 5 69
5.5 Formulations for Mixed Bases
The hardware implementation of the S-box using mixed bases has been presented recently
in [62]. In this S-box, in contrast to the conventional works, polynomial and normal bases
are used in mixture (mixed bases). The theoretical analysis in [62] shows lower timing
complexity for this S-box compared to the ones using polynomial and normal bases pre-
sented in [20] and [23]. In what follows, we present a multi-bit parity-based fault diagnosis
approach for the S-box using mixed bases. We derive formulations for these multi-bit
parities and optimize them to reach low complexity. Based on the presented formulations,
we present two dierent avors for our fault detection scheme considering compromise
between fault detection capability and resources needed. Moreover, we evaluate the fault
detection capabilities of the proposed reliable architectures.
We present the formulations for the ve predicted parities of the mixed bases S-box
of Fig. 5.6 in the following theorem.
Theorem 5.3 For X 2 GF (28) as the input of the S-box, the predicted parities of block
1 (P^1-P^2), block 2 (P^3), and block 3 (P^4-P^5) of the S-box in Fig. 5.6 are obtained as
follows
P^1 = x6(x7 + x5 + x0) + x5Z3 + x4(Z7 + Z1 + x2)
+ x2(x5 + x3) + x1Z1 + (x7 _ x4) + (x5 _ x1); (5.29)
P^2 = x6(Z4 + x1 + x0) + x4Z6 + x3x7 + x0Z2+
(x7 _ x5) + (x2 _ x1); (5.30)
P^3 = (31 _ 2) + 320; (5.31)
P^4 = 3(Z4 + Z3 + x6) + 2(Z8 + x4) + 1(Z6
+ x6) + 0(Z5 + Z2 + x2); (5.32)
P^5 = 3(Z8 + Z4) + 2(Z7 + x7) + 1(Z9 + Z5)+
0(Z5 + x7 + x1); (5.33)
where Z1 = x3+x0, Z2 = x5+x1, Z3 = Z2+Z1, Z4 = x7+x2, Z5 = x6+x3, Z6 = Z1+x5,
Z7 = x6+x1, Z8 = Z7+x0, and Z9 = Z5+Z2. Moreover, \+" and _ represent modulo-2
addition using an XOR gate and the OR operation, respectively.
Chapter 5 70
Transformation
Matrix
Inverse and 
affine
Transformation
4
Mˆ
4
Iˆ?ˆ4S
4
M
4
M
4
M
4
S : Multiplication and Squaring in GF(22)2 polynomial basis
: Multiplication (by constant) and inversion in GF(22)2 mixed basis
4
Mˆ 4Iˆ?ˆ
8
4
4 4
4
4
4
4
8
4
4
4 4
X Y
l
?
h
?
? ?
1
e
2
e 3
e
5
e
4
e
Block 1 Block 2 Block 3
Figure 5.6: The presented fault detection structure for the mixed bases S-box [62].
Proof The two predicted parities of block 1, i.e., P^1 and P^2 in (5.29) and (5.30), are
obtained according to Fig. 5.6. As seen in Fig. 5.6, block 1 consists of a transformation
matrix (T ) which transforms the coordinates of X in binary eld to  in composite eld
and is dened as [62]:
T =
0BBBBBBB@
1 0 1 0 0 1 0 0
1 0 0 0 1 0 1 0
0 0 1 1 1 1 0 0
0 1 0 1 0 1 0 0
1 1 0 1 0 1 0 0
1 0 1 1 0 0 0 1
0 1 0 0 0 1 1 1
0 0 1 0 1 0 1 0
1CCCCCCCA
: (5.34)
Moreover, considering [62] and Fig. 5.6, for the input (7+3; 6+2; 5+1; 4+0),
we obtain the merged S4-^ in Fig. 5.6 as (6+ 2; 7+ 3; 7+ 5+ 3+ 1; 7+ 6+ 5+
4 + 3 + 2 + 1 + 0). In addition, according to [62] and for inputs l and h, one can
obtain the result of the eld multiplication (M^4) in this block as
c00 = 7(3 + 1 + 0) + 6(2 + 1) + 5(3 + 2 + 1 + 0) + 4(3 + 1);
c01 = 7(3 + 2 + 0) + 6(3 + 1 + 0) + 5(2 + 0) + 4(3 + 2 + 1 + 0);
c02 = 72 + 6(3 + 2) + 5(1 + 0) + 41;
c03 = 73 + 62 + 50 + 4(1 + 0): (5.35)
Chapter 5 71
Therefore, by modulo-2 adding the coordinates of (3; 2) and (1; 0), i.e., two most
and least signicant bits of squarer-^ in Fig. 5.6 and result of the multiplication in (5.35),
one can obtain
P^1 = 7(3 + 2) + 63 + 51 + 40 + 7+
6 + 3 + 2; (5.36)
P^2 = 7(2 + 1) + 6(3 + 2 + 0) + 5(3+
1) + 4(2 + 0) + 6 + 4 + 2 + 0: (5.37)
By substituting the coordinates of  with those of X using (5.34), reordering, and using
subexpression sharing, it is straightforward to obtain (5.29) and (5.30) from (5.36) and
(5.37), respectively.
Block 2 of the S-box in Fig. 5.6 consists of an inversion in GF ((22)2). According to
[62], we have obtained the formulations for the inversion of the input  2 GF ((22)2) as
 2 GF ((22)2) = () 1 according to the following
0 = (2 + 0)I1 + (3 + 1)(I1 + I0);
1 = (2 + 0)(I1 + I0) + (3 + 1)I0;
2 = 0I1 + 1(I1 + I0); (5.38)
3 = 0(I1 + I0) + 1I0:
where I1 = 3(1 + 0) + 21 + 0 and I0 = 30 + 2 _ 1 + 02. Then, one can obtain
from (5.38) that P^3 = 2I0+3I1 which leads to P^3 = 2+31+321+320. Noting
that 2 + 31 + 321 = 2 _ 31, one can obtain (5.31).
Now, we derive the two predicted parities of block 3, i.e., P^4 and P^5. Let U =
(u3; u2; u1; u0) and V = (v3; v2; v1; v0) be the inputs of a multiplier in GF ((2
2)2) in block
3 (M4). Then, the result of multiplication is
c0 = u3v2 + u2(v3 + v2) + u1(v1 + v0) + u0v1;
c1 = u3v3 + u2v2 + u1v0 + u0(v1 + v0);
c2 = u3(v3 + v2 + v1 + v0) + u2(v3 + v1)+
u1(v3 + v2) + u0v3; (5.39)
c3 = u3(v2 + v0) + u2(v3 + v2 + v1 + v0) + u1v2+
u0(v3 + v2):
Chapter 5 72
Moreover, as seen from Fig. 5.6, block 3 consists of the mixed inverse (T 1) and ane
(A) transformation matrices. The formulation for this mixed transformation (which is
added later with constant f63gh) is as follows [62].
A T 1 =
0BBBBBBB@
0 0 0 1 1 1 0 1
0 0 0 0 0 1 1 1
0 0 1 0 0 1 0 0
0 1 1 0 1 0 0 1
0 1 0 0 0 1 1 0
1 0 1 0 1 0 0 0
1 0 1 1 0 0 0 1
0 1 0 0 0 1 1 1
1CCCCCCCA
: (5.40)
Finally, according to (5.40), the two predicted parities of block 3 in Fig. 5.6 are obtained
by adding the multiplications output coordinates 3 and 4 (for P^4) and 1, 3, 5, 6, and
7 (for P^5). Then, according to (5.39), by multiplying U =  and V = h + l and also
U =  and V = l, one can obtain these coordinates. The following is obtained for the
two predicted parities of block 3 of the S-box in Fig. 5.6:
P^4 = 3(6 + 0) + 2(7 + 6 + 1 + 0)+ (5.41)
1(5 + 4 + 2 + 1 + 0) + 0(5 + 3 + 2 + 1);
P^5 = 3(5 + 3 + 2 + 1 + 0) + 2(4 + 3+
1) + 1(7 + 4 + 3 + 2) + 0(6 + 5+
4 + 3): (5.42)
Then, using subexpression sharing and by substituting the coordinates of  with those
of X in (5.41) and (5.42) using (5.34), one can obtain (5.32) and (5.33) and the proof is
complete.
The critical path delay of the structure presented in Fig. 5.6 is determined by that
of the S-box and the fault detection scheme. Because of the concurrency of the scheme
in Fig. 5.6, the predicted parities of all the three blocks and the actual parities of
blocks 1 and 2 are obtained during the computations of the S-box itself. Thus, the only
delay that the scheme in Fig. 5.6 adds to the architecture is the delay of computing
the actual parities of block 3 and their corresponding comparisons with the predicted
parities for obtaining the error indication ags e4 and e5. It is interesting to note that
Chapter 5 73
having two predicted parities for the last block (instead of one) reduces the critical path
delay overhead. Based on these observations, the critical path delay of the presented
fault detection structure in Fig. 5.6 is just 3TX (2TX for computing the actual parities
of block 3 and 1TX for obtaining their error indication ags).
5.5.1 Other Variants
An advantage for the scheme proposed in Theorem 5.3 is that based on the reliability
requirements and the available resources, one may use dierent number of predicted
parities for dierent blocks. For instance, for applications which have tight resource
constraints, one may use one predicted parity for each block, i.e., three predicted parities
in total for the entire S-box, to reduce the performance metrics overheads at the expense
of reducing the error coverage. This can be performed by modulo-2 adding (5.29) and
(5.30) to obtain one predicted parity for block 1 and (5.32) and (5.33) for the one for
block 3 (see also Fig. 5.6). In other words we can use P^1+2 = P^1 + P^2 (using (5.36) and
(5.37)), P^4+5 = P^4 + P^5 (using (5.41) and (5.42)), and P^3 as three predicted parities for
the S-box in Fig. 5.6. For P^1+2 and P^4+5 we have
P^1+2 = 7(3 + 1) + 6(2 + 0) + 53 + 42+
7 + 4 + 3 + 0; (5.43)
P^4+5 = 3(6 + 5 + 3 + 2 + 1) + 2(7 + 6+
4 + 3 + 0) + 1(7 + 5 + 3 + 1 + 0)+
0(6 + 4 + 2 + 1): (5.44)
Our simulations (if the entire SubBytes is considered) show multiple random error
coverage of very close to 100% (99.998%) for mixed bases S-box. In addition, we have
performed ASIC syntheses using a 65-nm CMOS standard technology for the proposed
concurrent fault detection architectures and some of the previous ones. Compared to the
approaches with similar error coverage, the proposed approach in this section is the most
ecient one, reaching the eciency of 5:02 Mbps
m2
while maintaining the throughput of 5
Gbps. Based on the error coverage needed and the performance requirements, one may
use the proposed high-speed concurrent fault detection approach to reach the desired
coverage/performance goals.
Chapter 6
Concurrent Structure-Independent
Fault Detection Schemes for the
AES
IN the previous two chapters, we proposed two methods for fault detection of theS-boxes and inverse S-boxes using composite elds. In this chapter, we propose a
structure-independent fault detection scheme for the entire AES encryption and decryp-
tion. Specically, we obtain new formulations for the fault detection of SubBytes and
inverse SubBytes using the relation between the input and the output of the S-box and
the inverse S-box. The proposed schemes are independent of the way the S-box and the
inverse S-box are constructed. Therefore, they can be used for both the S-boxes and
the inverse S-boxes using look-up tables and those utilizing logic gates based on com-
posite elds. Our simulation results show the error coverage of greater than 99% for the
proposed schemes. Moreover, the proposed and the previously reported fault detection
schemes have been implemented on the most recent Xilinx R VirtexTM FPGAs. Their
area and delay overheads have been compared and it is shown that the proposed schemes
outperform the previously reported ones.
As presented before (see Fig. 2.3), a multiplication-based scheme is presented in
[38]. In this scheme, the result of the multiplication of the input and the output of the
multiplicative inversion is compared with the predicted result of unity. However, this
scheme is not suitable for the S-boxes and inverse S-boxes implemented using look-up
tables (LUTs). This is because the output (the input) of the multiplicative inversion
in the S-box (the inverse S-box) may not be accessible in the LUT-based implementa-
tions. Therefore, the fault detection scheme presented in [38] is not applicable for these
74
Chapter 6 75
implementations.
In this chapter, we present structure-independent fault detection schemes for ob-
taining a reliable AES implementation. We present a systematic method for obtaining
the fault detection signatures for the multiplicative inversion of the S-boxes (inverse
S-boxes). We propose new formulations resulting in novel fault detection schemes for
checking SubBytes, inverse SubBytes, and the other transformations in the encryption
and the decryption of the AES. The proposed schemes are independent of the method
the S-box (resp. the inverse S-box) is implemented. Thus, they can be applied to both
the LUT and composite elds implementations. Moreover, we simulate the proposed
fault detection structures for the AES encryption and decryption. Through our simula-
tions after injecting up to 700,000 random stuck-at errors, we show that the proposed
low cost schemes reach the error coverage of greater than 99%. Finally, our proposed
fault detection schemes and almost all of the previously reported ones are implemented
on the recent Xilinx R VirtexTM FPGAs and their area and delay overheads have been
derived and compared. The FPGA implementation results show the low area and delay
overheads for the proposed fault detection schemes.
The organization of this chapter is as follows: In Section 6.1, we present some brief
preliminaries regarding the AES algorithm. The proposed structure-independent schemes
for the fault detection of the S-boxes and the inverse S-boxes are presented in Section
6.2. Then, the fault detection schemes for the entire AES encryption and decryption
are considered in Section 6.3. In Section 6.4, the results of the simulations of the pro-
posed schemes are presented and their error coverages are obtained. In Section 6.5, the
presented fault detection schemes and the previously reported ones are implemented on
FPGAs and they are compared in terms of time and space complexities. The results
presented in this chapter can also be found in [92] and [93].
6.1 Notations Used in This Chapter
In this section, we briey present the notations and preliminaries used throughout this
chapter for the four transformations of each round of the encryption and the decryption
in the AES.
Each transformation in every round acts on its 128-bit input denoted as the state. The
Chapter 6 76
states are considered as four by four matrices whose entries are eight bits. For example,
the input state S with its 8-bit entries, i.e., sr;c, 0  r; c  3, is represented as follows:
S = [sr;c]
3
r;c=0: (6.1)
6.1.1 AES Encryption
Considering (6.1) as the input state of an encryption round, the transformations in each
round of encryption (except for the last round) are as follows [1]:
 SubBytes: The rst transformation in each round is the bytes substitution (Sub-
Bytes) implemented by 16 S-boxes. Let sr;c 2 GF (28) and s0r;c 2 GF (28) be the
8-bit input and output of each S-box, respectively. Then, the S-box consists of a
multiplicative inversion, i.e., s 1r;c 2 GF (28), followed by an ane transformation
consisting of the matrix   and the vector  to generate the output as
s0r;c =  s
 1
r;c +  =
0BBBBBBBBBB@
1 0 0 0 1 1 1 1
1 1 0 0 0 1 1 1
1 1 1 0 0 0 1 1
1 1 1 1 0 0 0 1
1 1 1 1 1 0 0 0
0 1 1 1 1 1 0 0
0 0 1 1 1 1 1 0
0 0 0 1 1 1 1 1
1CCCCCCCCCCA
s 1r;c +
0BBBBBBBBBB@
1
1
0
0
0
1
1
0
1CCCCCCCCCCA
: (6.2)
The 8-bit outputs of 16 S-boxes are used to obtain the output state of the SubBytes
transformation as
S0 = [s0r;c]
3
r;c=0: (6.3)
 ShiftRows: In the second transformation, ShiftRows, four bytes of the rows of the
input state are cyclically shifted to the left and the rst row is left unchanged to
obtain the output state, i.e., SR(S0), as
SR(S0) =
0BB@
s00;0 s
0
0;1 s
0
0;2 s
0
0;3
s01;1 s
0
1;2 s
0
1;3 s
0
1;0
s02;2 s
0
2;3 s
0
2;0 s
0
2;1
s03;3 s
0
3;0 s
0
3;1 s
0
3;2
1CCA = [s0r;(r+c)mod 4]3r;c=0: (6.4)
 MixColumns: In the third transformation, MixColumns, the output state is ob-
tained by multiplying a constant matrix with the output state of ShiftRows, SR(S0)
Chapter 6 77
in (6.4), to obtain the output state of MixColumns, i.e., the matrix S00, as
S00 = [s00r;c]
3
r;c=0 =
0BB@
f2gh f3gh f1gh f1gh
f1gh f2gh f3gh f1gh
f1gh f1gh f2gh f3gh
f3gh f1gh f1gh f2gh
1CCASR(S0): (6.5)
 AddRoundKey: The nal transformation is AddRoundKey in which the input
state is added (modulo-2) with the key of the round. Considering the roundkey
input state as the matrix K = [kr;c]
3
r;c=0, with entries kr;c, 0  r; c  3, the output
state of the AddRoundKey transformation, i.e., O, is obtained as
O = [or;c]
3
r;c=0 = S
00 +K: (6.6)
6.1.2 AES Decryption
In the AES decryption rounds, four transformations, i.e., InvShiftRows, InvSubBytes,
AddRoundKey and InvMixColumns are utilized. Considering S0 as the input state of
a decryption round, in the rst transformation, InvShiftRows, similar to ShiftRows in
encryption, the rst row of the input state remains unchanged. However, the other rows
entries are cyclically shifted to the right as follows
ISR(S0) =
0BB@
s00;0 s
0
0;1 s
0
0;2 s
0
0;3
s01;3 s
0
1;0 s
0
1;1 s
0
1;2
s02;2 s
0
2;3 s
0
2;0 s
0
2;1
s03;1 s
0
3;2 s
0
3;3 s
0
3;0
1CCA : (6.7)
The next transformation in each round is InvSubBytes implemented by 16 inverse S-
boxes. In the inverse S-box, the inverse ane transformation precedes the multiplicative
inversion in GF (28) to generate s 1r;c =  
 1s0r;c +  
 1, where,   and  are presented in
(6.2). The 8-bit outputs of 16 inverse S-boxes are used to obtain the output state of the
InvSubBytes transformation as S = [sr;c]
3
r;c=0.
The next transformation is AddRoundKey in which the input state is added with
the key of the round. Then, the output state of AddRoundKey is obtained as S00 =
[s00r;c]
3
r;c=0 = S +K. Finally, the last transformation, InvMixColumns, is equivalent to
multiplying the input state with a constant matrix with hexadecimal entries to obtain
the output state of the round as
O = [or;c]
3
r;c=0 =
0BB@
f0egh f0bgh f0dgh f09gh
f09gh f0egh f0bgh f0dgh
f0dgh f09gh f0egh f0bgh
f0bgh f0dgh f09gh f0egh
1CCAS00: (6.8)
Chapter 6 78
6.2 A New Fault Detection Scheme for the S-box
and the Inverse S-box
In this section, rst we present a systematic method for the fault detection of the mul-
tiplicative inversion of the S-box and the inverse S-box. Then, the new scheme for the
entire S-box and the inverse S-box is presented.
6.2.1 The Systematic Scheme for the Multiplicative Inversion
In what follows, we present a systematic method for the fault detection scheme for the
multiplicative inversion by deriving the matrix-based formulations for the multiplicative
inversion in the S-box/inverse S-box.
We use the following theorem from [94] to obtain the multiplication of eld elements
A =
Pm 1
i=0 ai
i and B =
Pm 1
i=0 bi
i in the nite eld GF (2m) constructed by the irre-
ducible polynomial of P (x) with the primitive root of .
Theorem 6.1 [94] Let C =
Pm 1
i=0 ci
i be the multiplication of A and B 2 GF (2m).
Then, the coordinates of C can be obtained from
[c0; c1;    ; cm 1]T = (L+QTU )b; (6.9)
where, b = [b0; b1;    ; bm 1]T ,
L =
0BBBBB@
a0 0 0 0 : : : 0
a1 a0 0 0 : : : 0
a2 a1 a0 0 : : : 0
...
...
. . .
. . .
. . .
...
am 2 am 3 : : : a1 a0 0
am 1 am 2 : : : a2 a1 a0
1CCCCCA ; (6.10)
U =
0BBB@
0 am 1 am 2 : : : a2 a1
0 0 am 1 : : : a3 a2
...
...
. . .
. . .
...
...
0 0 : : : 0 am 1 am 2
0 0 : : : 0 0 am 1
1CCCA ; (6.11)
and the m  1 by m binary matrix Q is obtained as follows
[m; m+1; : : : ; 2m 2]T =
Q[1; ; 2; : : : ; m 1]
T
(modP ()): (6.12)
Chapter 6 79
Let s = s7
7 + s6
6 + s5
5 + s4
4 + s3
3 + s2
2 + s1 + s0 and s
 1 = s 17 
7 +
s 16 
6 + s 15 
5 + s 14 
4 + s 13 
3 + s 12 
2 + s 11  + s
 1
0 be the 8-bit input and output of
the multiplicative inversion in the binary eld GF (28), respectively. Considering the fact
that the result of the multiplication of the 8-bit input s, s 6= 0, and the output s 1 of
the multiplicative inversion is the unity polynomial 1 2 GF (28), the following is derived
from Theorem 6.1 for the relation between s and s 1.
Corollary 6.1 Let s = [s0; s1; s2; s3; s4; s5; s6; s7]
T and
s 1 = [s 10 ; s
 1
1 ; s
 1
2 ; s
 1
3 ; s
 1
4 ; s
 1
5 ; s
 1
6 ; s
 1
7 ]
T be the vectors corresponding to the input and
output of the multiplicative inversion. Then, the matrix formulation of the multiplicative
inversion of the S-box (resp. the inverse S-box) is as follows
Zs 1 = u; (6.13)
where,
Z =
0BBBBBBB@
s0 s7 s6 s5 s4 s7;3 s7;6;2 s6;5;1
s1 s7;0 s7;6 s6;5 s5;4 s7;4;3 s6;3;2 s7;5;2;1
s2 s1 s7;0 s7;6 s6;5 s5;4 s7;4;3 s6;3;2
s3 s7;2 s6;1 s7;5;0 s7;6;4 s7;6;5;3 s7;6;5;4;2 s7;6;5;4;3;1
s4 s7;3 s7;6;2 s6;5;1 s7;5;4;0 s6;4;3 s5;3;2 s7;4;2;1
s5 s4 s7;3 s7;6;2 s6;5;1 s7;5;4;0 s6;4;3 s5;3;2
s6 s5 s4 s7;3 s7;6;2 s6;5;1 s7;5;4;0 s6;4;3
s7 s6 s5 s4 s7;3 s7;6;2 s6;5;1 s7;5;4;0
1CCCCCCCA
(6.14)
u = [u; 0; 0; 0; 0; 0; 0; 0]T , and u is obtained by logical OR operations of all inputs and
outputs, i.e., u = (s0_s1_ : : : s7)_ (s 10 _s 11 _ : : : s 17 ). Moreover, in (6.14), the modulo-
2 additions (XOR operations) of the coordinates of s are shown with commas in indices,
e.g., s7;0 = s7 + s0.
Proof We prove (6.13) for two cases of s 6= 0 and s = 0, separately. Let the input
s be a non-zero eld element in GF (28) generated by P (x) = x8 + x4 + x3 + x + 1.
Then, the multiplicative inversion should generate s 1. Using (6.12) in Theorem 6.1 and
considering the irreducible polynomial of P (x), the 7 8 matrix Q can be obtained as
Q =
0B@
1 1 0 1 1 0 0 0
0 1 1 0 1 1 0 0
0 0 1 1 0 1 1 0
0 0 0 1 1 0 1 1
1 1 0 1 0 1 0 1
1 0 1 1 0 0 1 0
0 1 0 1 1 0 0 1
1CA : (6.15)
This matrix is obtained by using the representations of 8; 9; : : : ; 14 with respect to
the polynomial basis for dierent rows of Q. Considering A = s 6= 0 and B = s 1 in
Chapter 6 80
Theorem 6.1, the matrices L and U in (6.10) and (6.11) are functions of the 8-bit input
vector s as
L =
0BB@
s0 0 0 0 0 0 0 0
s1 s0 0 0 0 0 0 0
s2 s1 s0 0 0 0 0 0
s3 s2 s1 s0 0 0 0 0
s4 s3 s2 s1 s0 0 0 0
s5 s4 s3 s2 s1 s0 0 0
s6 s5 s4 s3 s2 s1 s0 0
s7 s6 s5 s4 s3 s2 s1 s0
1CCA ; (6.16)
U =
0B@
0 s7 s6 s5 s4 s3 s2 s1
0 0 s7 s6 s5 s4 s3 s2
0 0 0 s7 s6 s5 s4 s3
0 0 0 0 s7 s6 s5 s4
0 0 0 0 0 s7 s6 s5
0 0 0 0 0 0 s7 s6
0 0 0 0 0 0 0 s7
1CA : (6.17)
Substituting Q, L, and U from (6.15)-(6.17) into (6.9) and denoting Z = L+QTU ,
one can obtain the matrix Z presented in (6.14). Since s 6= 0 = (0; 0; :::; 0) 2 GF (28),
u = 1 and the result of multiplication is C = A:B mod P (x) = 1 2 GF (28), i.e.,
[c0; c1; :::; c7]
T = [1; 0; :::; 0]T . Therefore, using (6.9), one can prove that (6.13) is valid
for s 6= 0. Moreover, for s = 0, the output of the multiplicative inversion generates
0 = (0; 0; :::; 0). Thus, all entries of the matrix Z and hence all 8 entries of the left hand
side vector of (6.13) are equal to zero. In such a case, the vector u = [0; 0; :::; 0]T since
the result of the OR operation among all sis and s
 1
i s are zero, i.e., u = 0. Therefore,
the proof is complete.
The validity of (6.13) can be used to detect specic faults in the inversion block. Let
us consider (6.13) for 3 special cases. If both the input and the output are zero, i.e.,
s = s 1 = 0 2 GF (28), the output is error-free. Then, both sides of (6.13) are zero and
thus it holds which means no fault is detected. On the other hand, the left hand side
of (6.13) is zero while in the right hand side, u = 1 in the following two cases: (i) the
input is zero (s = 0) and the erroneous output is not zero, i.e., s 1 6= 0, (ii) the input
is not zero, i.e., s 6= 0, but the erroneous output is zero (s 1 = 0). Thus, in both cases
(6.13) does not hold which indicates that the errors in the output of the multiplicative
inversion have been occurred.
One can gure out that implementing (6.13) needs 64 ANDs, 15 ORs, and 143 XOR
gates. It is noted that using subexpression sharing, one can reduce the number of XOR
gates to 84. If one implements the S-box using the composite eld presented in [22], it
requires 36 AND gates and 123 XOR gates for the original S-box implementation. Then,
adding this fault detection scheme would require approximately 91% area overhead. This
Chapter 6 81
is derived assuming that an XOR gate is implemented by 10 transistors [95] and the silicon
area of an AND is 0.6 of that of an XOR gate. Furthermore, the upper bound delay of the
multiplication can be derived as TM  TA+5TX , where TA and TX are the delays for an
AND and an XOR gate, respectively [94]. This is the delay overhead after the derivation
of the output of the SubBytes transformation. As a result of this high overhead, this
scheme may not be applied for the area/delay-constrained applications.
As mentioned above, comparing the actual result of the multiplication of the input
and the output of the multiplicative inversion with the predicted one is not area ecient.
Therefore, considering our derivations of matrix Z in (6.14), the complexity of the fault
detection scheme of the multiplicative inversion can be reduced by deriving the partial
result of the multiplication of the input and the output based on the rows that have
the lowest overhead. Therefore, one can use this low-complexity signature for the fault
detection of the multiplicative inversion.
6.2.2 The Proposed Scheme for the S-box and the Inverse S-box
The scheme in [38] does not take the ane transformation into account and checks it
separately with an additional overhead. Furthermore, if one implements SubBytes in
the AES using LUTs, there is no access to the output of the multiplicative inversion.
Therefore, the above mentioned scheme cannot be used. In what follows, we propose
a new scheme which is independent of the way the S-box and the inverse S-box are
implemented. First, we obtain the matrix-based S-box formulations as follows:
Theorem 6.2 Let s = s7
7 + s6
6 + s5
5 + s4
4 + s3
3 + s2
2 + s1 + s0 and s
0 =
s07
7 + s06
6 + s05
5 + s04
4 + s03
3 + s02
2 + s01 + s
0
0 be the 8-bit input and output of the
S-box. Then, one can obtain the relation between the input and output of the S-box as:
Ms0 +m = u0; (6.18)
where, u0 = [u0; 0; 0; 0; 0; 0; 0; 0]T , u0 = (s0_s1_: : : s7)_(s00_s01_s02_s03_s04_s05_s06_s07), s0 =
[s00; s
0
1; s
0
2; s
0
3; s
0
4; s
0
5; s
0
6; s
0
7]
T , and m = [s6;0; s7;6;1; s7;2;0; s6;3;1; s7;6;4;2; s7;5;3; s6;4; s7;5]
T .
Chapter 6 82
Furthermore, the 8 8 matrix M is denoted as follows
M =
0BBBBBBBBB@
s6;5;2 s5;4;1 s7;5;3;0 s6;4;2 s7;5;3;1 s7;6;5;2;0 s7;6;5;4;1 s7;6;3;0
s7;5;3;2;0 s6;4;2;1 s7;6;5;4;3;1 s7;6;5;4;3;2;0 s7;6;5;4;3;2;1 s5;3;2;1 s4;2;1;0 s6;4;3;1
s6;4;3;1 s7;5;3;2;0 s7;6;5;4;2 s7;6;5;4;3;1 s7;6;5;4;3;2;0 s6;4;3;2 s5;3;2;1 s7;5;4;2;0
s7;6;4;0 s6;5;3 s6;0 s7;5 s6;4 s6;4;3;2;0 s7;5;3;2;1 s7;5;1
s7;6;2;1 s7;6;5;1;0 s5;3;1 s4;2;0 s3;1 s6;4;3;2;1 s7;5;3;2;1;0 s7;3;2
s7;3;2 s7;6;2;1 s6;4;2;0 s5;3;1 s4;2;0 s7;5;4;3;2 s6;4;3;2;1 s4;3;0
s4;3;0 s7;3;2 s7;5;3;1 s6;4;2;0 s5;3;1 s6;5;4;3;0 s7;5;4;3;2 s5;4;1
s5;4;1 s4;3;0 s6;4;2 s7;5;3;1 s6;4;2;0 s7;6;5;4;1 s6;5;4;3;0 s6;5;2
1CCCCCCCCCA
:(6.19)
Proof We prove (6.18) for two cases of s 6= 0 and s = 0, separately. Let the 8-bit input
s be a non-zero eld element in GF (28). Considering (6.2), one can obtain
s 1 =   1s0 +  1 =
0BBBBBBBBBB@
0 0 1 0 0 1 0 1
1 0 0 1 0 0 1 0
0 1 0 0 1 0 0 1
1 0 1 0 0 1 0 0
0 1 0 1 0 0 1 0
0 0 1 0 1 0 0 1
1 0 0 1 0 1 0 0
0 1 0 0 1 0 1 0
1CCCCCCCCCCA
s0 +
0BBBBBBBBBB@
1
0
1
0
0
0
0
0
1CCCCCCCCCCA
: (6.20)
By substituting s 1 from (6.20) in (6.13), one reaches Z  1s0 + Z  1 which is the
same as the left hand side of (6.18). Now, let us denote Z  1 =M and Z  1 = m.
Then, the left hand side of (6.18) is obtained. Since s 6= 0 = (0; 0; :::; 0) 2 GF (28),
u0 = 1. Moreover, according to the proof of Corollary 6.1, for s 6= 0, the left hand side of
(6.13) is [1; 0; :::; 0]T , i.e., the result of multiplication C = A:B mod P (x) = 1 2 GF (28).
This implies that the left hand side of (6.13) be Zs 1 = [1; 0; :::; 0]T = u0. Furthermore,
because we have Zs 1 = Ms0 + m, one can prove that (6.18) is valid for s 6= 0.
Moreover, according to (6.2), for the input s = 0 = (0; 0; :::; 0) 2 GF (28), we have
the output as s0 = [s00; s
0
1; :::; s
0
7]
T = [1; 1; 0; 0; 0; 1; 1; 0]T which corresponds to the eld
element s0 = f63gh = (0; 1; 1; 0; 0; 0; 1; 1) 2 GF (28). Therefore, as seen in Theorem 6.2,
u0 = [0; 0; :::; 0]T since we have u0 = (s0_s1_: : : s7)_(s00_s01_s02_s03_s04_s05_s06_s07) = 0.
In addition, for s = 0, all the entries of the matrixM and the vectorm in the left hand
side of (6.18) are equal to zero. This results in the vector [0; 0; :::; 0]T = u0 for the left
hand side of (6.18). Therefore, the proof is complete.
Let us consider (6.18) for the input s = 0 = (0; 0; :::; 0) 2 GF (28). For this input, the
correct output is s0 = f63gh = (0; 1; 1; 0; 0; 0; 1; 1) 2 GF (28) (see (6.2)). If the erroneous
output is not s0 = f63gh = (0; 1; 1; 0; 0; 0; 1; 1) 2 GF (28), in the right hand side of (6.18)
we have u0 = 1, whereas, the left hand side is zero. Therefore, the erroneous output is
detected.
Chapter 6 83
Transformation
Comparator
Affine
Inversion in
signature
Actual partial
S−box
Error indication flag
(Parity)
Signature
Predicted signature
8
8
8
1
1
1
s
GF(28)
8
s
−1
8
s
′
Figure 6.1: The proposed structure-independent fault detection scheme of the S-box.
Proposition 6.1 Using subexpression sharing, the implementation of the left hand side
of (6.18) needs 64 AND gates and 111 XOR gates. Furthermore, the upper bound delay
of the relation in the left hand side of (6.18) is TA + 6TX , where, TA and TX are the
delays for an AND and an XOR gate, respectively.
Although checking the formulation of (6.18) detects all errors in the output of the
S-box, its implementation is very costly (see Proposition 6.1). To reduce the overhead of
the fault detection scheme, as seen in Fig. 6.1, we have obtained the single-bit parity for
the formulation of (6.18). As shown in this gure, this is obtained in order to compare
only one bit for an 8-bit data to detect any combination of odd number of erroneous bits
at the result of the left hand side of (6.18). Thus, one can check the parity of two sides
of (6.18) to obtain one bit equation for checking the S-box as follows:
Theorem 6.3 Let s = s7
7 + s6
6 + s5
5 + s4
4 + s3
3 + s2
2 + s1 + s0 2 GF (28),
and s0 = s07
7 + s06
6 + s05
5 + s04
4 + s03
3 + s02
2 + s01+ s
0
0 2 GF (28) be the 8-bit input
and output of the S-box. Then, the following equation holds for all the possible patterns
Chapter 6 84
of s and s0.
P(Ms0+m) = s0(s
0
b + s
0
c) + s1s
0
b + s2s
0
d + s3s
0
4 + s4(s
0
c + s
0
3)
+ s5s
0
a + s6(s
0
d + s
0
6) + s7(s
0
5 + s
0
4) = u
0; (6.21)
where, s0a = s
0
0 + s
0
2 + s
0
3 + s
0
5, s
0
b = s
0
a + s
0
7, s
0
c = s
0
1 + s
0
4 + s
0
6, and s
0
d = s
0
2 + s
0
7.
Proof After obtaining the parity of two sides of (6.18) we have
P(Ms0+m) = Pu0 = u
0; (6.22)
where, M , m and u0 are presented in Theorem 6.2. Considering the fact that parity is
a linear operation, one can obtain the left hand side of (6.22) as P(Ms0+m) = PMs0 +Pm.
Then, using M and m dened in Theorem 6.2, one can obtain PMs0 = s
0
0sa + s
0
1sb +
s02sc+s
0
3(sa+s4)+s
0
4(sb+s3+s7)+s
0
5(sa+s7)+s
0
6(sb+s6)+s
0
7(s5+sc) and Pm = s6+s7,
where, sa = s0 + s1 + s5, sb = s0 + s4, sc = sa + s2 + s6. After rearranging, one reaches
(6.21) and the proof is complete.
To implement (6.21), 18 XOR gates and 8 AND gates and two NOT gates are needed.
Also, the delay overhead associated with this implementation is the delay of 6 XORs
and one AND after the completion of the S-box. It is noted that this delay can be
overlapped by other AES round transformations and hence it will not reduce the speed
of the entire fault detection AES implementation. More details on this will be presented
later in this chapter. The parity obtained by the parity circuit is then compared with
u0 (see Theorems 6.2 and 6.3) to obtain the error indication ag of each S-box, i.e., er;c,
0  r; c  3. It is noted that using an OR tree for the error indication ags of 16 S-boxes,
the nal error indication ag of the entire SubBytes transformation is obtained. The nal
error indication ag of the SubBytes transformation signals the errors if at least one of
the error indication ags of 16 S-boxes detect errors.
Now, we want to present the fault detection scheme for the inverse S-box in the AES
decryption. The inverse S-box of the decryption consists of the inverse ane transfor-
mation (the inverse of the ane transformation in (6.2)) followed by the multiplicative
inversion. In other words, one can obtain the inverse S-box by removing the ane trans-
formation and adding the inverse ane one. This uses the input of s0 and the output of
s 1 with the following multiplicative inversion having the input of s 1 and the output of
Chapter 6 85
s. Therefore, Theorems 6.2 and 6.3 are also valid for the inverse S-box and hence we can
conclude the following for the inverse S-box.
Corollary 6.2 For the fault detection of the inverse S-box, one can use (6.21) by chang-
ing the place of the input and output, i.e., swapping the coordinates of s with s0.
6.3 Proposed Fault Detection Schemes for the AES
As mentioned before, the parity-based scheme proposed in [35] is one of the rst fault
detection schemes and has received attention in the literature. Although the approach
in [35] is a good scheme in terms of the fault detection capability, it has two drawbacks.
First, this approach is based on using the expanded S-boxes and inverse S-boxes for
parity predictions, i.e., two blocks of 256  9 memory cells. Not only does this restrict
the AES encryption and decryption implementations to LUT-based S-boxes and inverse
S-boxes, but it has high area overhead. To counteract this drawback, one may use the
proposed fault detection scheme for the S-box or the inverse S-box. As an example, for
the AES encryption one may use (6.21) for the S-boxes. This results in obtaining the
output parity of each S-box concurrently without having an extra circuit for deriving
it, i.e., Ps0 =
P7
i=0 s
0
i = s
0
b + s
0
c in (6.21). This simplies the fault detection circuit of
the AES when the output parities of the S-boxes are utilized for the fault detection of
other transformations in the AES rounds in [35]. More specically, if one uses the scheme
presented in [35] for the fault detection of the MixColumns transformation, the predicted
parities of this transformation become functions of the output parities of the ShiftRows
(SubBytes) transformation. Using the proposed scheme for the S-box in this chapter,
one can easily utilize the output parities of the S-boxes to predict the parities of the
MixColumns transformation.
The second drawback of the approach in [35] is the relatively high area complexity of
the parity predictions of MixColumns in the AES encryption. For the AES decryption,
the area complexity of the predicted parities of InvMixColumns is even more [36]. The
implementation results presented later in this chapter show the high area overhead of
this scheme. Considering the fact that a low-cost fault detection scheme for the AES
encryption and decryption is preferred, in this section, we propose signature-based low
complexity fault detection schemes for the transformations in the AES encryption and
Chapter 6 86
decryption. We consider AES-128 (which is denoted as AES in the remaining of this
chapter) for the sake of brevity. It is noted that the proposed schemes can be also
applied to AES-192 and AES-256. The proposed schemes for the AES transformations
are based on deriving the low-cost output signatures of the transformations in the AES
rounds and comparing them with their actual signatures for reaching the error indication
ags.
6.3.1 AES Encryption
We present the new fault detection structure for the AES encryption in the following. A
typical AES encryption round (except for the last round) consists of four transformations,
the fault detection schemes are shown in Fig. 6.2 and presented in details below.
SubBytes and ShiftRows
In the AES encryption, the SubBytes transformation consists of 16 S-boxes (see (6.3)).
Let er;c, 0  r; c  3, be the error indication ag for the S-box with the input and the
output of sr;c and s
0
r;c, respectively. The output state of such ags can be re-written as
16 formulations as follows
er;c = P(Mr;cs0r;c+mr;c) + u
0
r;c; 0  r; c  3; (6.23)
where, u0r;c is dened in Theorem 6.2 and for a typical S-box, P(Mr;cs0r;c+mr;c) is presented
in (6.21).
The 128-bit output of the SubBytes transformation acts as the input to ShiftRows.
As seen in (6.4), the output state of ShiftRows is obtained by shifting the state entries in
(6.3). Therefore, by considering the corresponding output of ShiftRows in (6.4), one can
check two transformations of SubBytes and ShiftRows together using 16 error indication
ags. According to (6.4) and considering (6.23), for row r and column c, the output state
of the ags can be re-written as 16 formulations as follows
er;c = P(Mr;cs0r;c+mr;c ) + u
0
r;c ; 0  r; c  3; (6.24)
where, c = (r + c)mod 4.
According to (6.24), 16 error indication ags for the SubBytes and ShiftRows trans-
formations, i.e., one error indication ag for each byte, are obtained. This is shown in
Chapter 6 87
Compressor
Input
128Round
 (S−boxes)
SubBytes
ShiftRows
128
128
128
Eq. (24) Eq. (24)
Eq. (24)Eq. (24)
128
Input to the next round
AddRoundKey
128
128 128
128
128
8
Eq. (32)
8Eq. (32)
MixColumns
128
128
n
1 ≤ n
≤ 32
s0,0 s
′
0,0 s0,3 s
′
0,3
s3,3 s
′
3,3 s3,2 s
′
3,2
e0,0
e0,3
e3,3
e3,0
kr,0SR(s
′
r,0
)or,0
E0
E3
SR(s
′
r,3
)
s
′
r,c∗
or,c
kr,c
i
sr,c
kr,c
or,3kr,3
Figure 6.2: The proposed fault detection scheme for the ith round of the AES encryption.
Fig. 6.2. As seen in this gure, (6.24), i.e., instances of the hardware implementation of
(6.21), is utilized for obtaining 16 error indication ags.
MixColumns and AddRoundKey
The third and the fourth transformations in a typical AES encryption round are Mix-
Columns and AddRoundKey. It is noted that MixColumns is constructed using (6.5).
Furthermore, according to (6.6), AddRoundKey is the modulo-2 addition of the input
state with the roundkey. In what follows, we present a key formulation that is used for
deriving a low-complexity fault detection scheme for MixColumns and AddRoundKey
combined.
Theorem 6.4 Let SR(S0) = [s0r;c ]
3
r;c=0 and K = [kr;c]
3
r;c=0 be the input and the round-
key input of MixColumns and AddRoundKey in round i, respectively. Let the output of
AddRoundKey be O = [or;c]
3
r;c=0 (see (6.6)). Then, the following holds:
3X
r=0
(s0r;c + kr;c + or;c) = 0 2 GF (28); 0  c  3; (6.25)
Chapter 6 88
where, c = (r + c)mod 4, and each summation is over GF (28) which consists of eight
modulo-2 additions.
Proof After adding the columns of S00 in (6.5), one reaches the following:
s000;0 + s
00
1;0 + s
00
2;0 + s
00
3;0 = (6.26)
(f2g16 + f1g16 + f1g16 + f3g16)(s00;0 + s01;1 + s02;2 + s03;3);
s000;1 + s
00
1;1 + s
00
2;1 + s
00
3;1 = (6.27)
(f2g16 + f1g16 + f1g16 + f3g16)(s00;1 + s01;2 + s02;3 + s03;0);
s000;2 + s
00
1;2 + s
00
2;2 + s
00
3;2 = (6.28)
(f2g16 + f1g16 + f1g16 + f3g16)(s00;2 + s01;3 + s02;0 + s03;1);
s000;3 + s
00
1;3 + s
00
2;3 + s
00
3;3 = (6.29)
(f2g16 + f1g16 + f1g16 + f3g16)(s00;3 + s01;0 + s02;1 + s03;2):
Considering the fact that f3g16 = f1g16+f2g16, we have (f2g16+f1g16+f1g16+f3g16) =
f1g16. Moreover, the right hand sides of (6.26)-(6.29) are the additions of the columns
of matrix SR(S0) in (6.4). Therefore, the addition of the column elements of S00 is equal
to that of the corresponding column of SR(S0), i.e.,
P3
r=0 s
00
r;c =
P3
r=0 s
0
r;c , 0  c  3.
Furthermore, according to (6.6), we have
3X
r=0
or;c =
3X
r=0
s00r;c +
3X
r=0
kr;c; 0  c  3: (6.30)
Therefore, considering (6.30) we reach
3X
r=0
or;c =
3X
r=0
s0r;c +
3X
r=0
kr;c; 0  c  3: (6.31)
Considering (6.31), we have
P3
r=0(s
0
r;c+kr;c+or;c) = (0; 0; :::; 0) 2 GF (28) and the proof
is complete.
Now, let us introduce the four 8-bit error indication ags for four columns of the state
as
Ec =
3X
r=0
(s0r;c + kr;c + or;c); 0  c  3: (6.32)
Chapter 6 89
One can use Theorem 6.4 to verify that for the error-free situation, all 32 bits of such
ags in (6.32) are zero, i.e., Ec = 0 = (0; 0; :::; 0) 2 GF (28), 0  c  3. These 32 error
indication ags can be used for the MixColumns and AddRoundKey transformations
combined, i.e., 8 error indication ags for each column of the state matrix. This is shown
in Fig. 6.2. It is noted that in Fig. 6.2, [kr;c]
3
r;c=0 is the round i key. As seen in this gure,
using (6.32), 32 error indication ags are obtained. It is noted that these error indication
ags can be compressed so that n, 1  n  32, error indication ags for these two
transformations are achieved. This can be performed by ORing dierent combinations
of the 32 error indication ags obtained in (6.32) as denoted by the compressor block in
Fig. 6.2. This gives us the freedom in the number of the error indication ags used in
the fault detection scheme of the MixColumns and AddRoundKey transformations. It is
interesting to note that although up to 32 ags can be used, our simulations show that
using 16 error indication ags (the same number as the ags derived for SubBytes and
ShiftRows), greater than 99% of the errors are covered.
The last round of the AES encryption (round 10 in AES-128 encryption) consists of
three transformations, i.e., SubBytes, ShiftRows and AddRoundKey. In other words,
compared to the other encryption rounds, the MixColumns transformation has been
removed. We present the following for the fault detection of this round.
Remark Similar to the fault detection scheme for the other rounds of the AES en-
cryption, one can use (6.24) for the last encryption round to derive 16 error indication
ags for SubBytes and ShiftRows combined. Furthermore, one can use (6.31) for the
relation of the inputs and the output of AddRoundKey (see also Fig. 6.2 by removing
MixColumns). Therefore, (6.32) can also be used for the last round. Consequently, by
removing the MixColumns transformation, one can also utilize the fault detection scheme
in Fig. 6.2 for the last encryption round of the AES.
Further Improvements
The proposed fault detection scheme for a typical round of the AES encryption can be
modied so that the complexity of the scheme is reduced. This improvement is based
on the fact that using subexpression sharing, one can reduce the number of logic gates
utilized in obtaining two sets of the error indication ags shown in Fig. 6.2. Specically,
in this chapter, we propose a fault detection scheme for the MixColumns transformation
Chapter 6 90
which has 25% less area overhead than the scheme presented in [35] and [36].
As seen in Fig. 6.2, the error indication ags of SubBytes and ShiftRows are obtained
utilizing the output state of ShiftRows, i.e., SR(S0) in (6.4). Furthermore, as shown in
this gure, this state is also used in obtaining the error indication ags of MixColumns
and AddRoundKey. This leads us to perform subexpression sharing in deriving these two
sets of error indication ags to have low-complexity fault detection scheme of the AES
encryption. We use (6.32) to derive 16 low-complexity signatures for the MixColumns
and AddRoundKey transformations, i.e., 4 signatures for each column of the state matrix.
This is performed by modulo-2 addition of two sets of four coordinates of (6.32) for each
column, i.e., Ec = (ec;7; ec;6; :::; ec;0) 2 GF (28), 0  c  3. Let E^c = (ec;4; ec;2; ec;1; ec;0)
and Ec = (ec;5; ec;7; ec;6; ec;3). Then, the four error indication ags for column c of the
state are
Ec = E^c + Ec; 0  c  3: (6.33)
One can utilize four sets of modulo-2 additions of the output bits of each S-box pre-
computed in (6.21), i.e., s04+s
0
5, s
0
2+s
0
7, s
0
1+s
0
6, and s
0
0+s
0
3, to obtain the low-complexity
error indication ags in (6.33). This is shown in Fig. 6.3. As seen in this gure, the
Common Subexpressions (CS) unit has been utilized to obtain 64 common subexpressions,
i.e., 4 for each of the 16 S-boxes in the SubBytes transformation. As depicted in Fig.
6.3, these outputs are then used in obtaining the two sets of 16 error indication ags
for SubBytes and ShiftRows combined, i.e., er;c, 0  r; c  3, and for MixColumns and
AddRoundKey combined, i.e., Ec, 0  c  3, respectively. In Fig. 6.3, realizing (6.24)
is less complex than the one in Fig. 6.2. This is because (6.24) utilizes the hardware
implementation of (6.21) which is less complex when the common subexpressions are
used. It is noted that if any of the derived two sets of error indication ags are one, the
error is detected. Whereas, if all of them are zero, no error has been detected although
the output can be erroneous or correct.
One can compare the complexity of the proposed fault detection scheme for Mix-
Columns with that of [35] and [36]. For comparison, we consider the error indication
ags of this transformation separately, i.e., without considering AddRoundKey. In the
fault detection scheme of MixColumns, we only need 3 XOR gates for each signature,
i.e., modulo-2 adding of the 4 common subexpressions presented above, e.g., s04 + s
0
5, in
four rows. Therefore, we have the following remark:
Chapter 6 91
Input
128Round
 (S−boxes)
SubBytes
ShiftRows
128
128
Input to the next round
AddRoundKey
128
128 128
128
Common
64
Eq. (24) Eq. (24)
Eq. (24)Eq. (24)
128
64
64
(CS)
(CS)
(CS)
(CS)
128
Eq. (33)
Eq. (33)
4
4
MixColumns
128
128
Unit
(CS)
Subexpressions
s0,0 s
′
0,0
s
′
0,3
s
′
3,3
s
′
3,2
e0,3
e3,3
e0,0
e3,0
(CS)r,0
(CS)r,3
E¯0
E¯3
i
or,c
kr,c
s
′
r,c∗
sr,c
kr,c
or,0
kr,3or,3
s3,2
s3,3
kr,0
s0,3
Figure 6.3: The proposed low-complexity fault detection scheme for the ith round of the
AES encryption utilizing subexpression sharing.
Remark For having 16 signatures for the MixColumns transformation, 48 XOR gates
are needed. Comparing this with the parity-based scheme presented in [35] and [36] which
needs 64 XOR gates for the predicted parities, this is a 25% area overhead reduction.
Moreover, there are two XORs in the critical path delay of the proposed scheme for
MixColumns compared to 3 XORs for the scheme in [35] and [36] which is a 33% reduction
in the critical path delay.
6.3.2 AES Decryption
We present the fault detection scheme for the AES decryption in what follows. It is noted
that the AES decryption rounds (except for the last round) consist of four transforma-
tions, i.e., InvShiftRows, InvSubBytes, AddRoundKey and InvMixColumns. The fault
detection schemes of these transformations are presented in details in the following.
Chapter 6 92
InvShiftRows and InvSubBytes
As seen in (6.7), in the AES decryption, the 128-bit input to InvShiftRows, i.e., the
state matrix S0 entries, are cyclically shifted to the right with the rst row remained
unchanged. Therefore, this transformation is just a re-wiring in hardware.
The output state of the InvShiftRows transformation, i.e., ISR(S0) in (6.7), acts
as the input to InvSubBytes. The InvSubBytes transformation in the AES decryption
consists of 16 inverse S-boxes. One can use Corollary 6.2 for the fault detection scheme of
the inverse S-boxes. Then, the fault detection scheme for InvShiftRows and InvSubBytes
combined can be derived so that we are able to check these two transformations together.
Let er;c, 0  r; c  3, be the error indication ag of each byte of these two transformations
combined with the input and the output of s0r;c and sr;c, respectively. Then, according to
(6.18), the output state of such ags can be re-written as 16 formulations as follows
er;c = P(Mr;cs0r;c+mr;c) + u
0
r;c; 0  r; c  3; (6.34)
where, c = jr   cj.
According to (6.34), 16 error indication ags for the InvShiftRows and InvSubBytes
transformations, i.e., one error indication ag for each byte, are obtained. This is shown
in Fig. 6.4. As seen in this gure, (6.34), i.e., instances of the hardware implementation
of (6.21), is utilized for obtaining these 16 error indication ags.
AddRoundKey and InvMixColumns
As seen in Fig. 6.4, the third and the forth transformations in a typical AES decryption
round are AddRoundKey and InvMixColumns. In the AddRoundKey transformation,
the input state, i.e., S, is added with the roundkey input state, i.e., K. Furthermore,
the InvMixColumns transformation is equivalent to multiplying the input state with the
constant matrix in (6.8). In what follows, we present a key formulation used for deriving
a low-complexity fault detection scheme for these two transformations combined.
Theorem 6.5 Let K = [kr;c]
3
r;c=0 and S = [sr;c]
3
r;c=0 be the roundkey input and the
input of AddRoundKey in round i, respectively. Let the output of InvMixColumns be
O = [or;c]
3
r;c=0 (see (6.8)). Then, the following holds:
3X
r=0
(sr;c + kr;c + or;c) = 0 2 GF (28); 0  c  3; (6.35)
Chapter 6 93
Input
128Round
128
128
Eq. (34) Eq. (34)
Eq. (34)Eq. (34)
128
Input to the next round
128
128 128
128
8
Eq. (42)
8Eq. (42)
InvShiftRows
InvSubBytes
(Inverse S−boxes)
AddRoundKey
InvMixColumns
128
128
128
128
Compressor
s0,0 s
′
0,0 s0,3 s
′
0,3
s3,0 s
′
3,3 s3,3 s
′
3,0
e0,0
e0,3
e3,3
e3,0
kr,c
kr,0or,0
E0
E3
or,3kr,3sr,3
sr,0
i
s
′
r,c
or,c
sr,c
n
1 ≤ n
≤ 32
kr,c
Figure 6.4: The proposed fault detection scheme for the ith round of the AES decryption.
where, each summation is over GF (28) which consists of eight modulo-2 additions.
Proof After adding the columns of O, according to (6.8) one reaches
o0;0 + o1;0 + o2;0 + o3;0 = (6.36)
(feg16 + f9g16 + fdg16 + fbg16)(s000;0 + s001;0 + s002;0 + s003;0);
o0;1 + o1;1 + o2;1 + o3;1 = (6.37)
(feg16 + f9g16 + fdg16 + fbg16)(s000;1 + s001;1 + s002;1 + s003;1);
o0;2 + o1;2 + o2;2 + o3;2 = (6.38)
(feg16 + f9g16 + fdg16 + fbg16)(s000;2 + s001;2 + s002;2 + s003;2);
o0;3 + o1;3 + o2;3 + o3;3 = (6.39)
(feg16 + f9g16 + fdg16 + fbg16)(s000;3 + s001;3 + s002;3 + s003;3):
We have feg16 + f9g16 + fdg16 + fbg16 = f1g16. Noting that the right hand sides of
(6.36)-(6.39) are the additions of the columns of the output state of InvMixColumns,
Chapter 6 94
the addition of the column elements of S00 is equal to that of the corresponding column
of O, i.e.,
P3
r=0 s
00
r;c =
P3
r=0 or;c, 0  c  3. Furthermore, for the AddRoundKey
transformation we have
3X
r=0
s00r;c =
3X
r=0
sr;c +
3X
r=0
kr;c; 0  c  3: (6.40)
Therefore, according to (6.40) we reach
3X
r=0
or;c =
3X
r=0
sr;c +
3X
r=0
kr;c; 0  c  3: (6.41)
Considering (6.41), one can obtain
P3
r=0(sr;c + kr;c + or;c) = (0; 0; :::; 0) 2 GF (28) and
the proof is complete.
Similar to the AES encryption, for the AES decryption, we introduce the four 8-bit error
indication ags for four columns of the state as
Ec =
3X
r=0
(sr;c + kr;c + or;c); 0  c  3: (6.42)
These 32 error indication ags for four columns of the state can be utilized for the fault
detection of the AddRoundKey and InvMixColumns transformations combined. This is
shown in Fig. 6.4. It is noted that like the AES encryption, these error indication ags can
be compressed so that n, 1  n  32, error indication ags for these two transformations
are achieved. This gives us the freedom in the number of the error indication ags used
in the fault detection scheme of the AddRoundKey and InvMixColumns transformations.
It is interesting to note that our simulations for the AES decryption show that using 16
error indication ags more than 99% of the errors are covered.
Similar to the AES encryption, in the last round of the AES decryption, three trans-
formations are used, i.e., InvMixColumns is removed. We present the following for the
fault detection of this round.
Remark Similar to the fault detection scheme for the other rounds of the AES decryp-
tion, one can use (6.34) for the last decryption round to derive 16 error indication ags
for InvShiftRows and InvSubBytes combined. Furthermore, one can use (6.41) for the
relation of the inputs and the output of AddRoundKey (see also Fig. 6.4 by removing
InvMixColumns). Therefore, (6.42) can also be used for the last round. Consequently,
by removing the InvMixColumns transformation, one can also utilize the fault detection
scheme in Fig. 6.4 for the last decryption round of the AES.
Chapter 6 95
Further Improvements
Using subexpression sharing, the proposed fault detection scheme for a typical AES
decryption round can be modied so that its hardware complexity is reduced. As seen
in Fig. 6.4, the error indication ags of InvShiftRows and InvSubBytes are obtained
utilizing the output state of InvSubBytes, i.e., S. As shown in Fig. 6.4, this output state
is also used in obtaining the error indication ags of AddRoundKey and InvMixColumns.
Therefore, similar to the fault detection scheme for the AES encryption, we can perform
subexpression sharing to obtain these two sets of error indication ags to have low-
complexity fault detection scheme of the AES decryption. First, we present the following
for the inverse S-boxes by rearranging Corollary 6.2 so that we are able to present a
low-complexity fault detection scheme for the AES decryption.
Corollary 6.3 Let s0 = s07
7 + s06
6 + s05
5 + s04
4 + s03
3 + s02
2 + s01 + s
0
0 2 GF (28),
and s = s7
7 + s6
6 + s5
5 + s4
4 + s3
3 + s2
2 + s1+ s0 2 GF (28) be the 8-bit input
and output of the inverse S-box. Then, the following equation holds for all the possible
patterns of s and s0.
P(Ms0+m) = s
0
0sa + s
0
1sb + s
0
2sc + s
0
3(sa + s4) + s
0
4(sb + s3
+ s7) + s
0
5(sa + s7) + s
0
6(sb + s6) + s
0
7(s5 + sc) + s6
+ s7 = u
0; (6.43)
where, sa = s0+ s1+ s5, sb = s0+ s4, sc = sa+ s2+ s6, and u
0 = (s0 _ s1 _ : : : s7)_ (s00 _
s01 _ s02 _ s03 _ s04 _ s05 _ s06 _ s07).
Proof According to Theorem 6.3 and Corollary 6.2, one can re-write (6.21) and swap
the input and the output to derive (6.43). Therefore, the proof is complete.
To implement the signature presented in the left hand side of (6.43), 20 XOR gates and
8 AND gates are needed. Then, it is compared with u0 to obtain the error indication ag
of each inverse S-box.
Using Corollary 6.3 and Theorem 6.5, we derive 16 low-complexity signatures for the
AddRoundKey and InvMixColumns transformations, i.e., 4 signatures for each column of
the state matrix. This is performed by modulo-2 addition of two sets of four coordinates
of (6.42) for each column, i.e., Ec = (ec;7; ec;6; :::; ec;0) 2 GF (28), 0  c  3. For the
Chapter 6 96
Input
128Round
128
128
Input to the next round
InvMixColumns
128
128 128
Common
Unit
64
Eq. (34) Eq. (34)
Eq. (34)Eq. (34)
128
128
128
64
64
Subexpressions
(CS)
(CS)
(CS)
(CS)
(CS)
128
Eq. (44)
Eq. (44)
4
4
AddRoundKey
128
InvShiftRows
InvSubBytes
(Inverse S−boxes)
s0,0 s
′
0,0
s0,3s
′
0,3
s
′
3,3
s
′
3,0
e0,3
e3,3
e0,0
e3,0
(CS)r,0
(CS)r,3
or,3
kr,c
s
′
r,c
sr,c
or,c
i
E¯0
E¯3
kr,c
s3,0 s3,3
kr,0or,0
kr,3
Figure 6.5: The proposed low-complexity fault detection scheme for the ith round of the
AES decryption utilizing subexpression sharing.
AES decryption, let Ec = (ec;3; ec;2; ec;1; ec;0) and Ec = (ec;7; ec;6; ec;5; ec;4). Then, the
four error indication ags for column c of the state are
Ec = Ec + Ec; 0  c  3: (6.44)
One can utilize four sets of modulo-2 additions of the output bits of each inverse S-box
pre-computed in Corollary 6.3, i.e., s0 + s4, s1 + s5, s2 + s6 and s3 + s7, to obtain the
low-complexity error indication ags in (6.44). This is shown in Fig. 6.5. As seen in this
gure, similar to the AES encryption, the Common Subexpressions (CS) unit has been
utilized to obtain 64 common subexpressions. Then, these outputs are used in obtaining
the two sets of 16 error indication ags for the AES decryption, respectively. It is noted
that in Fig. 6.5, the hardware implementation of (6.43) is used in (6.34) which is less
complex when the common subexpressions are used.
The proposed fault detection scheme for InvMixColumns requires 48 XOR gates with
two XOR gates in the critical path. Compared to the scheme presented in [36] for the
InvMixColumns transformation, the proposed scheme has less area and critical path
delay. It is noted that the authors in [36] have not presented the equations for the
Chapter 6 97
parity-based fault detection scheme of InvMixColumns, mentioning that they have the
same structure as those of MixColumns but they are more complicated. Therefore, at
least a 25% area overhead reduction and a 33% reduction in the critical path delay are
expected for the proposed scheme.
6.4 Error Simulations
We have considered both single and multiple stuck-at errors for the proposed scheme.
These models cover both natural faults and fault attacks [82]. If exactly one bit error
appears at the output of the AES encryption or decryption rounds, the presented parity-
based fault detection scheme is able to detect it and the error coverage of the proposed
scheme is about 100%. This is because in this case, one of the 8-bit four error indication
ags in (6.33) or (6.44) alarms the error. However, due to the technological constraints,
single stuck-at error may not be applicable for an attacker to ip exactly one bit to gain
more information [82]. Thus, multiple bits will actually be ipped and hence multiple
stuck-at errors are also considered in this chapter.
For the multiple stuck-at error models, we rely on simulations for both burst and
random errors. In the case of fault attacks, it is more likely that a transient burst
error appears instead of one-bit ips due to the present constraints [82]. Moreover, most
internal faults are modeled by transient random errors [82]. It is noteworthy that the
results of our simulations are valid for the transient errors. Furthermore, in case of
occurring permanent internal faults, the same simulation results are achieved.
We use stuck-at error model at the outputs of the AES transformations. This type
of error forces multiple nodes to be stuck at logic one (for stuck-at one) or zero (for
stuck-at zero) independent of the error-free values. It is noted that we use Fibonacci
implementation of the Linear Feedback Shift Registers (LFSR) with 128 output taps
for injecting random multiple errors, where, the numbers, locations and types of the
errors are randomly chosen. In this regard, maximum sequence length polynomial for
the feedback is selected as L(X) = X128+X29+X27+X2+1 according to the maximum
sequence length taps presented in [91].
We use the fault detection schemes presented in the previous section and shown in Fig.
6.3 and Fig. 6.5 for the AES encryption and decryption, respectively. In our simulations
Chapter 6 98
using Xilinx R ISETM version 9:1 Simulator [80], we use the error indication ags at the
outputs of ShiftRows (cover the errors for SubBytes and ShiftRows) and AddRoundKey
(cover the errors for MixColumns and AddRoundKey) for the AES encryption in Fig.
6.3. Moreover, for the AES decryption in Fig. 6.5, we obtain the error indication ags
at the outputs of InvSubBytes (cover the errors for InvShiftRows and InvSubBytes) and
InvMixColumns (cover the errors for AddRoundKey and InvMixColumns). The results of
our simulations show that by having these two sets of error indication ags, an acceptable
error coverage is achieved.
In our simulations, we inject errors in two manners, i.e., burst and random errors,
and obtain the error coverage for these two cases, the details of which are as follows.
Burst Errors
The rst type of errors that we consider is the burst errors. For this type of errors, we
assume that stuck-at errors occur at the output of only one transformation at a time,
i.e., the errors are injected at the 128-bit output of only one transformation in the AES
encryption/decryption in Fig. 6.3 and Fig. 6.5. This includes both stuck-at zero and
stuck-one errors. Then, using two series of 16-bit signatures shown in these gures, the
error coverage is obtained. The results of our simulations for the burst errors in the AES
encryption and decryption are shown in Fig. 6.6. In this gure, the solid and dashed
lines represent the error coverage for the AES encryption and decryption, respectively.
As seen in this gure, we have injected up to 700; 000 burst errors at the transformation
outputs, one at a time, and have monitored the errors that are covered by the error
indication ags. It is noted that because the errors are injected only at the output of
one transformation, only one of the two series of the error indication ags can detect
them. As seen in this gure, after injecting up to 700; 000 burst errors, for both the AES
encryption and decryption, the error coverage for the two sets of error indication ags is
greater than 99:996%.
Random Errors
The second type of errors is random errors, where errors are injected at random locations,
i.e., four 128-bit outputs of the transformations. Our simulations show that after inject-
ing up to 700; 000 random errors, the higher error coverages of very close to 100% are
Chapter 6 99














	



	


	






	




 	 
    









	

 
!
	"#
"$%
&
	
"$%
"#
!
&
Figure 6.6: Simulation results for the error coverages of the proposed fault detection
schemes.
obtained, i.e., all the errors are covered by at least one of the two series of the error indi-
cation ags. We also expect the error coverage of close to 100% if we increase the number
of errors injected. The high error coverages of the proposed scheme for the AES rounds
is suitable for the security-constrained applications on FPGAs. These include any AES
algorithms implemented on the FPGAs as well as the bitstream security mechanisms.
6.5 AES FPGA Implementations and Comparisons
The proposed schemes in this chapter are structure-independent and can be applied to
the AES using both the LUT-based and the composite eld S-boxes and inverse S-boxes.
In this section, we have implemented both of these structures so that we are able to
compare the results for the presented schemes with those using LUTs and composite
elds. In what follows, we consider the implementation of both the AES encryption and
decryption.
For the FPGA implementations, we have used VHDL as the design-entry for ISETM
version 9:1. Furthermore, the synthesis is performed using Xilinx R Synthesis Tool
(XSTTM) on VirtexTM-4 and VirtexTM-5 families [80]. It is noted that the results of
the implementations in this section, i.e., the number of occupied slices and the minimum
periods (maximum working frequencies), are all post place and route results.
Chapter 6 100
We have implemented the original AES using LUT-based S-boxes and inverse S-boxes
on VirtexTM-4 (xc4vlx160-12) and VirtexTM-5 (xc5vlx110-3) devices. These larger devices
are chosen to have enough number of slices needed for the fault detection scheme in [35]
and [36]. We have used pipelined distributed memories for the LUT-based S-boxes and
inverse S-boxes in the AES to increase the design speed and the overall frequency. The
XSTTM uses the LUT resources in the FPGAs in order to implement the distributed
memories. Furthermore, pipelining is achieved by describing the necessary registers in
the design-entry language. The schemes in [34], [35], [36], [39], Hardware Redundancy
and the proposed ones in this chapter have been implemented and the results are depicted
in Table 6.1. As seen in this table, the Error Coverage (EC %), the number of occupied
slices, the maximum working frequency (MHz), the throughput (Gbps) and the eciency
(Mbps/slice) for the original schemes and the Fault Detection (FD) ones are derived.
Moreover, the slice overheads (overheads for the number of occupied slices) are presented.
It is noted that there is a dierence in the implementations of the LUT-based S-boxes
and inverse S-boxes using distributed memories for the selected FPGAs. Specically, for
VirtexTM-5 and VirtexTM-4, 256 and 64 bits per CLB are specied for the distributed
memories, respectively. This causes the LUT implementations for VirtexTM-5 to be more
compact as compared to those on VirtexTM-4 [80]. This can be observed in Table 6.1. In
this regard, the number of slices for the original AES encryption and decryption using
LUTs and the slice overhead for the scheme in [35] and [36] whose area overhead is
dominated by the expansion of the S-box to 512  9 memories is less on VirtexTM-5.
This makes VirtexTM-5 a suitable family for the AES using memory-based S-boxes and
inverse S-boxes and their fault detection schemes. Because of the higher number of slices
for the original AES encryption and decryption on VirtexTM-4, the slice overheads of the
proposed schemes and the scheme in [39] are less as compared to those for VirtexTM-5.
Chapter 6 101
T
ab
le
6.
1:
C
om
p
ar
is
on
s
of
th
e
im
p
le
m
en
ta
ti
on
s
of
th
e
fa
u
lt
d
et
ec
ti
on
sc
h
em
es
of
th
e
A
E
S
u
si
n
g
L
U
T
S
-b
ox
es
an
d
in
ve
rs
e
S
-b
ox
es
on
X
il
in
x
R 
F
P
G
A
s. F
D
S
E
n
cr
y
p
ti
o
n
D
ec
ry
p
ti
o
n
F
P
G
A
fa
m
il
y
S
ch
em
e
E
C
(%
)
S
li
ce
F
re
q
.
T
h
ro
.
E

.(
M
b
p
s
S
li
ce
F
re
q
.
T
h
ro
.
E

.(
M
b
p
s
(D
ev
ic
e)
(o
ve
rh
ea
d
)
(M
H
z)
(G
b
p
s)
/
sl
ic
e)
(o
ve
rh
ea
d
)
(M
H
z)
(G
b
p
s)
/
sl
ic
e)
O
ri
g
in
a
l
-
1
8
3
3
5
(-
)
2
4
0
.5
3
0
.8
1
.7
1
9
3
2
2
(-
)
2
0
3
.5
2
6
.0
1
.3
A
lg
o
ri
th
m
-l
ev
el
[3
4
]
1
0
0
%
3
8
2
7
3
(1
0
8
.7
%
)
1
9
4
.9
2
4
.9
a
0
.6
3
8
2
7
3
(9
8
.1
%
)
1
9
4
.9
2
4
.9
a
0
.6
V
ir
te
x
T
M
-4
H
a
rd
w
a
re
R
ed
u
n
d
a
n
cy
1
0
0
%
2
8
9
0
5
(5
7
.6
%
)b
2
4
0
.5
3
0
.8
1
.1
3
5
4
2
1
(8
3
.3
%
)
2
0
3
.5
2
6
.0
0
.7
(x
c4
v
lx
1
6
0
F
D
in
[3
5
]c
fo
r
en
cr
y
p
ti
o
n
9
9
.9
9
7
%
3
9
1
0
4
1
6
3
.5
2
0
.9
0
.5
4
0
2
4
4
1
4
5
.4
1
8
.6
0
.5
-1
2
)
F
D
in
[3
6
]c
fo
r
d
ec
ry
p
ti
o
n
(1
1
3
.3
%
)
(1
0
8
.3
%
)
F
D
in
[3
9
]d
fr
o
m
th
e
9
8
.7
%
si
n
.
2
1
2
1
1
2
4
0
.5
3
0
.8
1
.4
2
2
2
8
0
2
0
3
.5
2
6
.0
1
.1
g
en
er
a
l
sc
h
em
e
in
[3
7
]
4
8
-5
3
%
m
u
lt
.
(1
5
.7
%
)
(1
5
.3
%
)
P
r
o
p
o
se
d
(t
h
is
c
h
a
p
te
r
)
9
9
.9
9
6
%
2
0
1
2
7
(9
.8
%
)
2
4
0
.5
3
0
.8
1
.5
2
0
9
0
9
(8
.2
%
)
2
0
3
.5
2
6
.0
1
.2
O
ri
g
in
a
l
-
2
9
6
0
(-
)
3
7
1
.7
4
7
.6
1
6
.1
3
9
0
6
(-
)
2
9
6
.3
3
7
.9
9
.7
A
lg
o
ri
th
m
-l
ev
el
[3
4
]
1
0
0
%
5
8
4
9
(9
7
.6
%
)
2
8
4
.4
3
6
.4
a
6
.2
5
8
4
9
(4
9
.7
%
)
2
8
4
.4
3
6
.4
a
6
.2
V
ir
te
x
T
M
-5
H
a
rd
w
a
re
R
ed
u
n
d
a
n
cy
1
0
0
%
4
6
3
7
(5
6
.7
%
)b
3
7
1
.7
4
7
.6
1
0
.2
7
2
0
0
(8
4
.3
%
)
2
9
6
.3
3
7
.9
5
.5
(x
c5
v
lx
1
1
0
F
D
in
[3
5
]c
fo
r
en
cr
y
p
ti
o
n
9
9
.9
9
7
%
5
5
9
0
2
8
2
.8
3
6
.2
6
.5
6
6
8
8
2
6
0
.2
3
3
.3
4
.9
-3
)
F
D
in
[3
6
]c
fo
r
d
ec
ry
p
ti
o
n
(8
8
.9
%
)
(7
1
.2
%
)
F
D
in
[3
9
]d
fr
o
m
th
e
9
8
.7
%
si
n
.
3
6
1
9
3
0
4
.0
3
8
.9
1
0
.7
4
4
2
6
2
7
7
.0
3
5
.5
8
.0
g
en
er
a
l
sc
h
em
e
in
[3
7
]
4
8
-5
3
%
m
u
lt
.
(2
2
.3
%
)
(1
3
.3
%
)
P
r
o
p
o
se
d
(t
h
is
c
h
a
p
te
r
)
9
9
.9
9
6
%
3
7
5
7
(2
6
.9
%
)
3
7
1
.7
4
7
.6
1
2
.7
4
2
8
6
(9
.7
%
)
2
9
6
.3
3
7
.9
8
.8
a
T
h
e
la
te
n
cy
is
tw
ic
e
as
m
u
ch
as
th
e
or
ig
in
al
A
E
S
en
cr
y
p
ti
on
or
d
ec
ry
p
ti
on
.
b
A
lt
h
ou
gh
th
e
ov
er
h
ea
d
of
gr
ea
te
r
th
an
10
0%
is
ex
p
ec
te
d
,
th
e
ov
er
h
ea
d
fo
r
th
e
n
u
m
b
er
of
o
cc
u
p
ie
d
sl
ic
es
is
le
ss
.
c
U
si
n
g
tw
o
(2
56

9)
m
em
or
ie
s
fo
r
th
e
fa
u
lt
d
et
ec
ti
on
of
ea
ch
S
-b
ox
or
in
v
er
se
S
-b
ox
.
d
U
si
n
g
(2
56

9)
m
em
or
ie
s
fo
r
th
e
fa
u
lt
d
et
ec
ti
on
of
ea
ch
S
-b
ox
or
in
ve
rs
e
S
-b
ox
.
Chapter 6 102
As seen in Table 6.1, the number of slices for the original decryption is more than
that of the encryption. This is mainly because of the InvMixColumns transformation
which is more complex than MixColumns in the AES encryption. Furthermore, the slice
overhead for the scheme in [36] in which the LUTs sizes are expanded to 512  9 is
less on VirtexTM-5 family compared to VirtexTM-4. As seen in Table 6.1 in bold faces,
the proposed structure-independent scheme for the AES decryption is the most ecient
and the most compact one among the other schemes. Moreover, for the VirtexTM-5, the
proposed scheme for the AES encryption has the least slice overhead. However, the slice
overhead of the proposed scheme implemented on VirtexTM-4 is slightly more than that
of the scheme in [39]. It is noted that the low overhead of the scheme in [39] is because
it uses one-bit signatures for the 128-bit block of data. While, the proposed schemes and
the one in [35] and [36] use 16 bits for each 128-bit block. As seen in Table 6.1, this leads
to much higher error coverage.
The scheme in [38] is based on using the output of the multiplicative inversion (not
that of the S-box) to obtain a signature for fault detection. This scheme cannot be
applied to the S-boxes using LUTs where the output of the multiplicative inversion is
not accessible. Therefore, we have implemented the original AES encryption which uses
the S-boxes using polynomial basis and composite elds in order to have access to the
output of the multiplicative inversion. For this reason, we utilize the AES presented in
[22]. This implementation of the AES is a hardware optimization for the scheme in [20],
which is extensively used in the literature, see for example [13], [15]. Then, we have
implemented the scheme of [38] and compared it with the proposed scheme presented
in this chapter. Moreover, the scheme in [34] and Hardware Redundancy have been
implemented.
The results of the implementations are shown in Table 6.2. It is worth noting that in
[38], the fault detection scheme for the AES decryption is not presented. Therefore, no
comparison for the AES decryption with this scheme is presented in this table. It is noted
that we have not used sub-pipelining for the implementations and registers are only used
at the output of each round. Using sub-pipelining for the S-boxes using composite elds,
one can reach higher working frequencies compared to those for LUT-based S-boxes. As
seen in this table, the number of slices for the original AES encryption using S-boxes
in composite elds is less than those of the LUTs for VirtexTM-4 (compare Tables 6.1
Chapter 6 103
and 6.2 for VirtexTM-4). However, the original AES using LUT-based S-boxes is more
compact when VirtexTM-5 is used. As mentioned before, this is due to the low number
of slices needed for the implementation of the memories in this device family. As seen
in Table 6.2, the proposed scheme is the most compact and the most ecient scheme
compared to the scheme in [38], i.e., the eciency degradations (percent degradation
from the eciency of the original operations) and the slice overheads are the least for
two devices. It is noted that the proposed scheme in this chapter uses 16 error indication
ags for the 128-bit output states of the transformations. However, the scheme in [38]
utilizes 32 error indication ags for each output state. Therefore, more slice overhead
and greater error coverage are expected for that scheme. However, as discussed earlier,
this scheme cannot be applied to the AES using LUTs.
Furthermore, we have compared the proposed schemes in this chapter with the light
weight concurrent fault detection scheme for the AES S-boxes presented in [78]. This
scheme is based on using normal basis for logic gate implementations of the S-boxes in the
AES encryption. In this fault detection scheme, the structure of the S-box using normal
basis has been divided into 5 blocks. Then, the predicted parities of these blocks are
obtained. Moreover, through an exhaustive search among all available composite elds,
the optimum solution for the least overhead S-box and its parity predictions is achieved.
We have implemented the AES encryption with the original S-boxes using normal basis
in composite elds proposed in [23] and veried with the FPGA implementations in
[78]. Then, the fault detection scheme for the S-boxes in [78] has been utilized for
the SubBytes transformation while the proposed scheme in this chapter is used for the
other transformations. In other words, we derive 5 error indication ags for each S-
box in SubBytes (5  16 = 80 ags for the entire SubBytes transformation), while the
scheme in Fig. 6.3 is used for other AES encryption transformations using 16-bit ags.
Moreover, the proposed signature-based structure-independent scheme in this chapter,
i.e., the scheme in Fig. 6.3, has been implemented for the AES encryption with the
S-boxes using normal basis. The results of these implementations are also presented and
compared in Table 6.2. As seen in this table, the FPGA implementations of the original
AES encryption with the S-boxes using normal basis representation in composite elds
have less area compared to the traditional ones using polynomial basis, i.e., 6752 and
3692 compared to 7498 and 3718 for two devices, respectively.
Chapter 6 104
T
ab
le
6.
2:
Im
p
le
m
en
ta
ti
on
co
m
p
ar
is
on
s
of
th
e
fa
u
lt
d
et
ec
ti
on
sc
h
em
es
of
th
e
A
E
S
en
cr
y
p
ti
on
u
si
n
g
co
m
p
os
it
e

el
d
S
-b
ox
es
on
X
il
in
x
R 
F
P
G
A
s.
F
P
G
A
S
-b
ox
es
F
D
S
S
li
ce
T
h
ro
.
(G
b
p
s)
E

.
(M
b
p
s/
sl
ic
e)
st
ru
ct
u
re
sa
O
rg
.
F
D
S
O
v
er
.
O
rg
.
F
D
S
O
rg
.
F
D
S
d
eg
.
P
B
b
A
lg
or
it
h
m
-l
ev
el
[3
4]
74
98
17
07
5
12
7.
7%
14
.6
14
.6
c
1.
9
0.
9
52
.6
%
P
B
b
H
ar
d
w
ar
e
R
ed
u
n
d
an
cy
74
98
14
96
8
99
.6
%
14
.6
14
.6
1.
9
1.
0
47
.4
%
V
ir
te
x
T
M
-4
P
B
b
[3
8]
74
98
10
34
0
37
.9
%
14
.6
12
.3
1.
9
1.
2
36
.8
%
P
B
b
P
ro
p
o
se
d
(F
ig
.
6
.3
)
7
4
9
8
9
2
5
2
2
3
.4
%
1
4
.6
1
2
.6
1
.9
1
.4
2
6
.3
%
N
B
d
P
ro
p
o
se
d
e
6
7
5
2
9
3
2
5
3
8
%
1
7
.1
1
2
.9
2
.5
1
.4
4
4
%
N
B
d
P
ro
p
o
se
d
(F
ig
.
6
.3
)
6
7
5
2
8
2
1
6
2
1
.7
%
1
7
.1
1
2
.3
2
.5
1
.8
2
8
.0
%
P
B
b
A
lg
or
it
h
m
-l
ev
el
[3
4]
37
18
74
92
10
1.
5%
18
.2
15
.7
4.
9
2.
1
57
.1
%
P
B
b
H
ar
d
w
ar
e
R
ed
u
n
d
an
cy
37
18
71
62
92
.6
%
18
.2
16
.7
4.
9
2.
3
53
.1
%
V
ir
te
x
T
M
-5
P
B
b
[3
8]
37
18
47
50
27
.8
%
18
.2
14
.2
4.
9
3.
0
38
.8
%
P
B
b
P
ro
p
o
se
d
(F
ig
.
6
.3
)
3
7
1
8
4
3
5
4
1
7
.1
%
1
8
.2
1
4
.6
4
.9
3
.4
3
0
.6
%
N
B
d
P
ro
p
o
se
d
e
3
6
9
2
4
6
8
3
2
6
.8
%
1
7
.9
1
4
.8
4
.8
3
.1
3
5
.4
%
N
B
d
P
ro
p
o
se
d
(F
ig
.
6
.3
)
3
6
9
2
4
2
8
6
1
6
.1
%
1
7
.9
1
4
.5
4
.8
3
.4
2
9
.2
%
a
T
h
e
or
ig
in
al
A
E
S
en
cr
y
p
ti
on
im
p
le
m
en
ta
ti
on
s
d
i
er
on
ly
in
th
e
S
-b
ox
es
st
ru
ct
u
re
s.
b
T
h
e
S
-b
ox
es
u
si
n
g
p
ol
y
n
om
ia
l
b
as
is
p
re
se
n
te
d
in
[2
2]
.
c
A
lt
h
ou
gh
th
e
th
ro
u
gh
p
u
t
is
th
e
sa
m
e
as
th
e
or
ig
in
al
A
E
S
,
th
e
la
te
n
cy
is
tw
ic
e
as
m
u
ch
as
th
e
or
ig
in
al
o
n
e. d
T
h
e
o
ri
gi
n
al
S
-b
ox
es
u
si
n
g
n
or
m
al
b
as
is
in
co
m
p
os
it
e

el
d
s
p
ro
p
os
ed
in
[2
3]
an
d
v
er
i
ed
w
it
h
th
e
F
P
G
A
im
p
le
m
en
ta
ti
on
s
in
[7
8]
.
e
F
au
lt
d
et
ec
ti
on
sc
h
em
e
fo
r
th
e
S
-b
ox
es
fr
om
[7
8]
an
d
th
e
p
ro
p
os
ed
sc
h
em
e
in
th
is
ch
ap
te
r
fo
r
th
e
ot
h
er
tr
an
sf
or
m
at
io
n
s.
Chapter 6 105
In addition, the proposed structure-independent scheme in this chapter has the least
area overhead complexities and the most eciencies for both FPGA families. At this
point, we would like to mention that for the scheme in [78], higher error coverage and
slightly higher throughput are achieved compared to the proposed scheme in this chapter.
However, this is at the cost of the higher area overhead complexity. It is also noted that
the fault detection scheme in [78] not only can be only applied for the composite eld
S-boxes but it is also dependent on the composite elds and normal basis chosen, i.e.,
the parity predictions would be dierent if other composite elds are used. Whereas, the
proposed scheme in this chapter is independent of the structures of the S-boxes used in
the AES encryption.
Recently, a fault tolerant approach which is resistant to fault attacks is proposed
in [50]. This approach is based on protecting the logic blocks and memories of the
AES. To protect the combinational logic blocks used in the four rounds of the AES,
either the parity-based scheme proposed in [36] or the duplication one presented in [96]
is implemented. Furthermore, to protect the memories used for storing the expanded
key and the state matrix either the Hamming or Reed-Solomon error correcting code is
implemented. The results of the comparison of the proposed scheme in this chapter with
the parity-based scheme of [35] and [36] for protecting the combinational logic elements of
the AES are depicted in Table 6.1. Moreover, for certain AES implementations containing
storage elements, one can use the error correcting code-based approach presented in
[50] in addition to the proposed scheme in this chapter to make a more reliable AES
implementation.
To conclude, in this chapter, we have studied a number of fault detection schemes
for the encryption and the decryption of the AES. New fault detection schemes which
are independent of the structures of the S-boxes and the inverse S-boxes have been
proposed. Our simulations show that for the AES encryption and decryption, these
structure-independent schemes reach high error coverage.
Furthermore, our proposed fault detection schemes and almost all of the previously
reported ones have been implemented on the recent Xilinx R VirtexTM FPGAs. Their
area and delay overheads for the AES encryption and decryption have been derived and
compared. In our implementations, we have considered using both the look-up table-
based and the composite eld AES structures. Our FPGA implementations show that
Chapter 6 106
for the AES encryption, the slice overhead of the proposed scheme is around 9:8% to
26:9%, depending on the FPGA family and the AES implementation. In addition, for
the AES decryption, lower slice overhead is achieved. These slice overheads are less than
those for the other schemes which have the same error coverages.
According to our simulation and implementation results, with acceptable error cov-
erages, the structure-independent schemes proposed in this chapter have the highest
eciencies, showing reasonable area and time complexity overheads. Based on the AES
structure chosen, the performance goals to achieve, and the resources available, one can
use combinations of the presented schemes in order to have much more reliable AES
encryption and decryption structures.
Chapter 7
Ecient and High-Performance
Parallel Hardware Architectures for
the AES-GCM
IN the previous chapters, we have proposed dierent high-performance fault diagnosisapproaches for the AES. These approaches help making the AES hardware architec-
tures reliable. In this chapter, we present high-speed, parallel hardware architectures for
reaching low-latency and high-throughput structures of the GCM. By investigating the
high-performance GF (2128) multiplier architectures, we benchmark the proposed AES-
GCM architectures using quadratic and sub-quadratic hardware complexity GF (2128)
multipliers. It is shown that the performance of the presented AES-GCM architectures
outperforms the previously reported ones in the utilized 65-nm CMOS technology.
In this chapter, using a complexity reduction technique, the hardware complexities
of dierent architectures for the subkey exponentiations in the GCM are reduced. Then,
by utilizing these low-complexity exponentiations, we propose ecient architectures for
the GCM, yielding high throughput and low latency. The proposed hardware architec-
tures for the AES-GCM are synthesized considering two types of GF (2128) multipliers.
We investigate the performance of quadratic and six dierent sub-quadratic complex-
ity GF (2128) multipliers. It is shown that the proposed architectures for the AES-GCM
have higher throughput and eciency and reach lower latency compared to the previously
reported ones.
The organization of this chapter is as follows. In Section 7.1, the proposed high-
performance architectures for implementing the GCM are presented. Section 7.2 presents
the ASIC syntheses and comparisons of the proposed architectures and the previously
107
Chapter 7 108
reported ones. The results presented in this chapter can also be found in [72].
7.1 High-Performance GCM Parallel Architecture
In this section, we propose high-performance parallel architectures for the GCM. These
architectures improve the throughput and the latency of the structures presented in [68]
and [69] for GHASHH . They also remove the need for consecutive GF (2
128) multiplica-
tions with H for deriving (1.1). We also derive the hardware implementations of the ex-
ponentiations of the hash subkey to the powers of 2, i.e., in the form of H2
j
, needing only
XOR gates. Because of the low complexity of the implementations of these exponents, we
take advantage of these low-cost hash subkey powers in the proposed high-performance
architectures. We utilize the powers in the form of H2
j
to obtain the other powers of
the hash subkey with the least number of GF multiplications over GF (2128) for proposed
architectures. For instance, we derive H3 = H2 H or H6 = H4 H2.
7.1.1 High-Performance GHASHH Function
Algorithm 3 is used for obtaining the key formulation for the proposed GHASHH func-
tion. Although there is no restriction in choosing q, i.e., the number of parallel adder-
multipliers, we use q = 2j, 1  j  blog2(n)c. This leads to lower number of clock cycles
and higher throughput needed for the implementations. In Algorithm 3, the output
GHASH (X;H) is obtained as follows:
X1 Hq  : : :Hq| {z }
n
q
times
X2 Hq  : : :Hq| {z }
n
q
 1 times
Hq 1  : : :
Xj Hq  : : :Hq| {z }
n
q
 1 times
Hq j+1  : : :
Xq Hq  : : :Hq| {z }
n
q
 1 times
H Xq+1 Hq  : : :Hq| {z }
n
q
 1 times
Xq+2 Hq  : : :Hq| {z }
n
q
 2 times
Hq 1  : : :XnH; (7.1)
where all operations are performed over GF (2128) constructed by the irreducible polyno-
mial P (x) = x128 + x7 + x2 + x+ 1 and
L
comprises 128 XOR gates.
One can re-write (7.1) so that only the exponentiations of the hash subkey to the
powers of 2 in the form of H2
j
are utilized. This method of exponentiation is based on
Chapter 7 109
Algorithm 3 The proposed high-performance approach for implementing the GCM.
Inputs: Xp 2 GF (2128), 1  p  n, and H2j 2 GF (2128), 0  j  log2(q).
Output: GHASH (X;H)=
Pn
j=1XjH
n j+1.
1: for i = 1 to q do
2: tempi  Xi
3: for j = 1 to n
q
  1 do
4: tempi = (tempi Hq Xi+jq)
5: end for
6: Let q   i+ 1 = (a0(i); : : : ; alog2(q)(i))2
7: tempi = tempi  (Ha(i)0 q H
a
(i)
1 q
2  : : :Ha
(i)
log2(q))
8: end for
9: GHASH(X;H) =
Pq
i=1 tempi
10: return GHASH(X;H).
the binary exponentiation, see, for example, [97]. As seen from this algorithm, for the
exponentiations Hq i+1, 1  i  q, one can use the binary representation of q   i+ 1 as
(a0
(i); : : : ; alog2(q)
(i))2 .
The hardware implementation of Algorithm 3 has been presented in Fig. 7.1. For
implementing Algorithm 3 in hardware, in total, n
q
+ log2(q) clock cycles are needed. For
the rst n
q
  1 clock cycles, the GF (2128) multiplications by Hq are performed. This is
achieved by a simple control unit selecting Hq. Then, for the next log2(q) clock cycles,
the other exponentiations are used. These include the powers of the hash subkey in the
form of H2
j
and a number of eld elements 1 = (0; 0; :::; 1) 2 GF (2128) for bypassing the
GF (2128) multiplication operations. We note that if n is not a multiple of q, one needs
to add q  mod (n; q) blocks containing 0 = (0; 0; :::; 0) 2 GF (2128) to the beginning of
the n blocks to make the total blocks processed multiple of q. Performing this, the hash
computation can be done normally based on the presented procedure. As seen in Fig.
7.1, q adder-multipliers are required and multiplexers are also utilized to select dierent
exponentiations.
To illustrate the proposed scheme, we use the case with n = 16 and q = 8. In the rst
clock cycle (j = 1), the outputs of all the multiplexers in Fig. 7.1 are H8 for this case.
Then, according to the following, the outputs of the multiplexers in the other cycles can
Chapter 7 110
R1
1qjX ?
R2 R3 R4
H
q
RT
Rq-1
( 1) 1q j
X ? ?
Rq
( 1)q j
X ?2qjX ? 3qjX ? 4qjX ?
128
128128128128128128
128 128 128 128 128 128
H
q
H
q/2
H
H
2
H
q
H
q/2
1
H
H
q
H
2
1
H
q
H
11
H
q
1
H
2
H
q/2
128-bit XOR tree
128128128
Figure 7.1: The hardware architecture of the proposed high-performance GCM GHASHH
function.
be found.
j=3z }| {
(
j=1z }| {
X1H
8X9)H8| {z }
j=2
11
| {z }
j=4

j=3z }| {
(
j=1z }| {
X2H
8X10)H4| {z }
j=2
H2H
| {z }
j=4

: : :
j=3z }| {
(
j=1z }| {
XiH
8Xi+8)H4a
(i)
1| {z }
j=2
H2a(i)2 Ha(i)3
| {z }
j=4
 (7.2)
: : :
j=3z }| {
(
j=1z }| {
X8H
8X16)H| {z }
j=2
11
| {z }
j=4
;
where (a1; a2; a3)2 is the binary representation of q  i+1 = 9  i, 1  i  8. Five cycles
are required to implement (7.2); 4 cycles are shown in (7.2) with j = 1 to j = 4, and the
last one is used for the addition of the results of the registers R1   R8 to have the nal
result in RT .
Chapter 7 111
Table 7.1: Performance analysis and comparison of GHASHH within the GCM for n
blocks and q parallel structures.
Approach Latency Throughput
Sequential
[64], [65], [66] n 128(Tmul+TX)n
[68], [69] nq + q   1 128(Tmul+TX)(nq +q 1)
Proposed nq + log2(q)
128
(Tmul+TX)(
n
q +log2(q))
According to Fig. 7.1, the working frequency of the proposed scheme is obtained as
Tmul + TX (we note that this delay is larger than that of the XOR tree). It is noted
that Tmul is the time delay of the used multiplier and TX is the time delay of one set
of modulo-2 additions in the critical path. Furthermore, according to Algorithm 3, the
number of clock cycles needed for the GHASHH function is
n
q
+ log2(q). Latency and
throughput of the proposed scheme are compared with the ones presented in [64], [65],
[66], [68], and [69] in Table 7.1. As seen in this table, the sequential approach has the least
throughput which leads to low-performance hardware implementations. The throughput
of the proposed scheme, i.e., 128
(Tmul+TX)(
n
q
+log2(q))
, is higher than that of the scheme in [68]
and [69], i.e., 128
(Tmul+TX)(
n
q
+q 1) , especially for high values of parallel structures, i.e., high
values of q. For example, for the case presented in (7.2), the proposed architectures of
this chapter need n
q
+ log2(q) = 2 + 3 = 5 clock cycles to obtain the result. This can
be compared with the linear relation of the scheme in [68] and [69] with q, leading to
n
q
+ q   1 = 2 + 8   1 = 9 clock cycles needed. The complete comparison in terms of
hardware and timing complexities of the proposed architectures with the previous ones
is presented later in this chapter using ASIC syntheses.
As discussed earlier, with the change in the block cipher key, the re-calculations of
the hash subkey and then raising it to dierent powers are inevitable. In the following,
we present dierent methods in obtaining the hash subkey powers required in the pro-
posed architectures of this section, i.e., H2
j
. Moreover, through complexity reduction
techniques, low-complexity structures for the exponentiations are obtained in which the
timing complexities are remained unchanged.
Chapter 7 112
7.1.2 High-Speed Structures for Hash Subkey Powers
In the following, using squaring operations, we present three methods for implementing
the hash subkey exponentiations. Using a complexity reduction algorithm, we also derive
their hardware-optimized architectures.
According to [5], it is less likely that the GCM is invoked with the same key on
distinct sets of input data. Thus, a new hash subkey and its powers need to be obtained
in each invocation. It is known that the squaring operation in binary extension elds
leads to a linear structure, see, for example, [98]. In other words, implementing squaring
in hardware is less costly than GF (2128) multiplications. The squaring of a eld element
over GF (2128) in the GCM uses the irreducible polynomial P (x) = x128+x7+x2+x+1.
Utilizing P (x), we have obtained the formulations for the squaring after performing
modular reduction. It is noted that MATLAB R [76] has been utilized to verify the
formulations used for squaring. For the GCM, the critical path delay of squaring is
obtained as 3TX , where TX is the XOR gate delay. Moreover, it requires 202 XOR gates.
To implement H2
j
, 2  j  blog2(q)c, one can cascade j squaring architectures or
use a feedback for deriving them. We refrain using the feedback structure because of its
low throughput and high latency. According to the hardware and timing complexities of
squaring derived in this section, for H2
j
, the cascade structure yields to the hardware and
timing complexities of 202j XOR gates and 3j TX , respectively. This leads to low-speed
implementations which are not desirable in applications requiring high performance. It
is possible to reduce the delay of the implementations of these exponentiations for the
high-performance hardware implementations. To achieve this, we do not cascade the
squaring implementations. Instead, we nd the squaring exponentiations separately so
that their derivations become in parallel. This reduces the critical path delay of the
realizations. We present the following lemma for obtaining the exponentiations of the
hash subkey within the GCM.
Lemma 7.1 The squaring exponentiations of the hash subkey, i.e., H2
j
, 2  j 
blog2(q)c, are obtained using the following.
H2
j
mod P (x) = d+
2j 1X
i=1
~ei; (7.3)
Chapter 7 113
h0h1 0000 0 h31d
e1
h32h63
h32h63
h32
h32h62
0
00
00
00
h32h330h63 0 0
e2
h64h650h95
e3
h96h970h127
0 00
0
0 00
000
0
0
0h63
h64h95 00 0
h64h95 00 0
h64
h92
00
h64
h95h95h9500 00h95
h33
h33
h33
h65
0
0
h63h630 00h63 h630
h93
0 h93
0 h93
h93
0
0
0
0
Mod P(x)
0 0
0
Figure 7.2: The derivation of H4 of the GCM hash subkey.
where, d and ~ei, 1  i  2j   1, are eld elements in GF (2128) dened as follows
d =
128
2j
 1X
s=0
hsx
2js; ~ei = (
128(i+1)
2j
 1X
s= 128i
2j
hsx
2js) mod P (x)
Proof Let H =
P127
s=0 hsx
s 2 GF (2128) be the hash subkey of the GHASH function.
Then, we have H2
j
= (
P127
s=0 hsx
2js) mod P (x) =
P 128
2j
 1
s=0 hsx
2js + (
P127
s= 128
2j
hsx
2js
mod P (x)) = d + (
P2j 1
i=1
P 128(i+1)
2j
 1
s= 128i
2j
hsx
2js) mod P (x) = d +
P2j 1
i=1 ~ei and the proof
is complete.
For clarifying the method, we present the structure for deriving H4 in Fig. 7.2.
We obtain the polynomials d and e1-e3 in (7.3) as: d = h31x
124 + h30x
120 + : : : + h0,
e1 = h63x
252 + h62x
248 + : : : + h32x
128, e2 = h95x
380 + h94x
376 + : : : + h64x
256, and e3 =
h127x
508 + h126x
504 + : : :+ h96x
384. As seen in this gure, the coecients of d are added
with the reduced coecients of e1-e3 using P (x).
The complexity reduction techniques use dierent methods for decreasing the number
of gates needed in the implementations, see, for example, the ones in [99] and [100].
Because it is not guaranteed that the delay of the method in [99] is maintained, we have
implemented the complexity reduction algorithm presented in [100] using a C code. In
Chapter 7 114
(.)2
(.)2 (.)4 (.)8
(.)2
(.)2
(.)2
(.)2
(.)8
H2
H
4
H8
H
2
H
4
H
8
H4
H2
H8
H H H
H
H
H
(a) (b) (c)
Figure 7.3: (a) Cascade, (b) parallel, and (c) hybrid realization methods for the hash
subkey exponentiations.
our program, the procedure suggested in [100] (to nd the shared XOR terms) has been
utilized for the case study of q = 8, which requires implementing H2, H4 and H8. It is
noted that through the employed technique, we reach low hardware complexities without
changing the critical path delays.
We have performed three experiments for implementing H2, H4 and H8. These are
shown in Fig. 7.3. As seen in Fig. 7.3a, in the cascade method, three identical squaring
architectures are used consecutively. This method has the lowest hardware complexity
and the highest timing complexity. In Fig. 7.3b, the parallel method of implementation
of the hash subkey exponentiations is utilized. Compared to the other methods, this
method has the lowest critical path delay while its hardware complexity is the highest.
On the other hand, in the hybrid method which is shown in Fig. 7.3c, a compromise
between hardware and timing complexities is achieved.
The timing and hardware complexities of these methods and the results of the com-
plexity reduction technique utilized for them are depicted in Table 7.2. In this table,
for three methods presented in Fig. 7.3, the hardware complexities before and after
complexity reduction are derived. The timing complexity is remained unchanged after
Chapter 7 115
Table 7.2: Complexities of the realizations of the hash subkey exponentiations for q = 8
parallel architectures for GHASHH .
Method Hardware Hardware Complexity Complexity Timing
Complexity after complexity reduction reduction(%) Complexity
Cascade (Fig. 7.3a) 606 XORs 594 XORs  2% 9TX
Parallel (Fig. 7.3b) 1986 XORs 1099 XORs  45% 5TX
Hybrid (Fig. 7.3c) 1627 XORs 1062 XORs  35% 6TX
applying the complexity reductions. As seen in Table 7.2 in bold face, the least hardware
complexity is achieved for the cascade method after the complexity reduction, i.e., 594
XOR gates. However, the timing complexity of this method is the highest among the
three methods as depicted in this table. On the other hand, the timing complexity of
the parallel method is the lowest, i.e., 5TX . As shown in Table 7.2, this is at the expense
of higher hardware complexity which is 1099 XOR gates after about 45% complexity
reduction.
7.1.3 GF (2128) Multipliers for the GCM
Dierent types of GF (2128) multipliers are utilized in the literature for implementing the
GF (2128) multiplications in the GCM. In [64], [68], and [69], the multiplications have
been performed using bit-parallel, digit-serial, and hybrid multipliers in composite elds.
Furthermore, in [65] and [101], the eciency of dierent multipliers, including the sub-
quadratic ones, are compared. Moreover, in [102] a high-speed AES-GCM core has been
presented. It is noted that the considered GF (2128) multipliers in these works include
the Mastrovito multiplier [103] with quadratic space complexity, the Karatsuba-Ofman
multiplier [104] and the GF (2128) multiplier in [105].
We have considered the bit-parallel GF (2128) multiplier presented in [94] which has
quadratic hardware complexity. It is noted that this GF (2128) multiplier has lower tim-
ing complexity compared to the sub-quadratic hardware complexity GF (2128) multipli-
ers. However, we note that according to the latency of the proposed architectures, i.e.,
n
q
+ log2(q), increasing the number of parallel structures (q) results in having higher
throughputs. On the other hand, having higher values for q increases the hardware
complexities of GHASHH . Therefore, for reducing the hardware complexity, using sub-
quadratic hardware complexity GF (2128) multipliers is benecial when high values of q
are utilized.
Chapter 7 116
Table 7.3: Hardware and timing complexities analysis of the utilized bit-parallel multi-
pliers for the GCM.
Multiplier GEa Delay Eciencyb
Complexity Type (103  Throughput
GE
)
Quad. [94] - 56,957 TA + 10TX 0:21=TX
KO1 44,338 TA + 12TX 0:23=TX
KO2 34,660 TA + 14TX 0:25=TX
Sub-quad. KO3 28,195 TA + 16TX 0:27=TX
[106] KO4 24,517 TA + 18TX 0:28=TX
KO5 23,443 TA + 20TX 0:27=TX
KO6 24,961 TA + 21TX 0:24=TX
aGate equivalent in terms of two-input NAND.
bConsidering TX = 1:99TA according to the utilized technology.
For reducing the hardware complexity of the AES-GCM, we have also used the ecient
realization of the Karatsuba-Ofman multiplier presented in [106] as the sub-quadratic
hardware complexity GF (2128) multiplier. It is noted that the gate count of dierent
steps for one Karatsuba-Ofman multiplier has been presented in [106]. Based on our
technology hardware and timing specications, we have presented the performance of
the GF (2128) multipliers in Table 7.3. As shown in this table, six dierent steps for
the Karatsuba-Ofman multipliers are considered. We denote these realizations by KO1
(for the case that only one step is performed) to KO6 (for which the 128-bit GF (2
128)
multiplier is broken all the way to 2-bit multiplications using Karatsuba-Ofman method).
Applying the Karatsuba-Ofman method recursively to obtain KOi, 2  i  6 for the
GCM would result in low-area implementations with higher timing complexities. As
seen from this table, although the sub-quadratic multiplier KO5 is the most compact
implementation, the sub-quadratic multiplier KO4 reaches the best eciency. In the
next section, we present the synthesis results of these sub-quadratic multipliers for our
proposed architectures. We also compare the power consumptions and the eciencies of
dierent methods for realizing these multipliers.
7.2 AES-GCM Performance Comparisons
In this section, rst dierent AES architectures are presented and then we present and
compare the ASIC synthesis results of the proposed and the previously presented archi-
Chapter 7 117
AES Round
128
(a) Simple loop
Round 1
(b) Unrolled pipelined
Round 2
Round 10
...
128
128
Round 1
(c) Unrolled sub-pipelined
Round 2
Round 10
...
128
128
pipeline
stages ...
...
...
sub-pipeline
stages
Figure 7.4: The AES-128 structure for (a) simple loop, (b) unrolled pipelined, and (c)
unrolled sub-pipelined architectures (MixColumns is bypassed in the last round).
Table 7.4: The proposed architecture for the AES-GCM.
AES GCM (Proposed using Algorithm 3)
(Unrolled pipelined) Exponents Multiplier
PB S-box ( = f11g2, Complexity-reduced Quad. and six
 = f1010g2) parallel method sub-quad. (KO1
optimized using (3.18) (Table 7.2 and Fig. 7.3b) -KO6) in Table 7.3
tectures for the AES-GCM function.
We have presented dierent AES-128 architectures in Fig. 7.4. As seen in the AES
simple loop structure (Fig. 7.4a), the AES rounds are executed serially (in the last round
MixColumns is bypassed). This architecture is the most compact AES architecture and
has been used in the literature, see, for instance, [20]. However, it suers from low
throughput. In Fig. 7.4b, the AES unrolled pipelined structure is shown in which
the pipeline stages are shown by dotted lines (see, for instance, [17]). As seen in this
gure, 10 AES rounds are duplicated, with the last round without the MixColumns
transformation. Although this architecture needs 10 AES rounds to be implemented, it
allows the designers to use pipelining and hence process multiple inputs sequentially for
achieving high throughput. For further increasing the throughput, sub-pipelining of the
AES transformations can be used as depicted in Fig. 7.4c.
Sub-pipelining is useful in increasing the working frequency of the AES at the expense
of more area used for the pipeline registers. However, it increases the latency of structures.
Chapter 7 118
R1 R2 R3 R4
H
8
1
RT
128 128 128 128
128 128 128 128
R5 R6 R7 R8
128 128
128 128
H
8
1
H
8
H
2
1
H
8
H
1
128
128
128
128
H
4
H
2
H
8
H
4
1
128
H
8
H
H
4
H
2
H
8
1
H
4
H
H
8
1
H
2
H
Round 1
Round 2
Round 10
Round 1
Round 2
Round 10
Round 1
Round 2
Round 10
Round 1
Round 2
Round 10
Round 1
Round 2
Round 10
Round 1
Round 2
Round 10
Round 1
Round 2
Round 10
Round 1
Round 2
Round 10
...
...
...
...
...
... ...
...
ICB INC
128
INC
8
128 128 128 128 128 128 128
INC
8
INC
8
INC
8
INC
8
INC
8
INC
8
INC
8
INC INC INC INC INC INCCB3CB2 CB4 CB5
CB6 CB7
CB8
Pi Pi+1
Pi+2 Pi+3 Pi+4 Pi+5 Pi+6 Pi+7
GCTR
GHASH
128128
AES
128
Figure 7.5: The proposed AES-GCM high-performance architecture for q = 8.
For instance, the latency of a 3-stage sub-pipelined AES is 3 times more than that
of the unrolled pipelined. We also note that if the critical path delay is determined
by the multipliers in the GCM architecture, sub-pipelining of the AES transformations
cannot increase the working frequency. Although both pipelined and sub-pipelined AES
architectures can be utilized, in this chapter, for the syntheses and comparisons, we use
pipelined AES architecture presented in Fig. 7.4b. Moreover, for analyzing the eect of
sub-pipelining, we have used sub-pipelined AES for two AES-GCM architectures. The
details of our implementations are presented later in this section.
According to Table 7.4, we use the most ecient S-box, i.e., the one using polynomial
basis (PB) based on (3.18), to reach the AES-GCM with the highest performance. The
AES-128 encryption is considered as the block cipher for the GCM and as indicated in
Table 7.4, the 10 rounds of the AES-128 are unrolled and pipelined. Moreover, as seen in
Table 7.4, we use the proposed Algorithm 1 for the GCM and utilize the parallel method
Chapter 7 119
in Fig. 7.3b for hash subkey exponentiations (hardware optimized through complexity
reduction methods in the previous section). Finally, As seen in this table, we use both
quadratic and sub-quadratic multipliers presented in Table 7.3.
Fig. 7.5 presents the proposed architecture for the AES-GCM for q = 8 parallel
structures. The AES-128 pipeline registers are shown by dashed lines in Fig. 7.5. As
seen in this gure, 10 clock cycles are needed for obtaining the ciphertext. After these
rst 10 clock cycles, the results are obtained after each clock cycle. According to Fig.
7.5, 8 parallel AES-128 structures are implemented as part of GCTRK to provide inputs
to GHASHH . As seen in this gure, the function GCTRK performs the AES counter
mode with the Initial Counter Block (ICB) and its one-increments (CBi). Moreover,
q = 8 increments (using INC 8 module) and the plaintext blocks (Pi) are used as the
inputs. It is assumed that the data is encrypted and the IV in the GCM is 96 bits which
is recommended for high throughput implementations [5].
The results of our syntheses for the AES-GCM using the STM 65-nm CMOS tech-
nology [74] are presented in Table 7.5. The architectures have been coded in VHDL as
the design entry to the Synopsys R Design Vision R [73]. The proposed architectures in
this chapter and the ones in [64], [65], [66], [68] and [69] have been synthesized. The
syntheses are based on the case for q = 8 parallel addition-multiplications using the bit-
parallel GF (2128) multiplier presented in [94] which has quadratic hardware complexity.
For achieving low hardware complexity for the AES-GCM, we have also synthesized six
dierent steps for the Karatsuba-Ofman multipliers. As seen in Table 7.5, areas, power
consumptions, and maximum working frequencies are tabulated. From the discussions in
this chapter, for n input blocks and q parallel structures, the latency for the architecture
in [64], [65] and [66] is n, for the one in [68] and [69] is n
q
+ q   1, and for our proposed
architectures is n
q
+ log2(q). According to these, for dierent architectures presented in
Table 7.5, throughputs and eciencies are also presented.
As presented in Table 7.5, the sequential approach in [64], [65], and [66] has the
lowest hardware complexity compared to other approaches. However, it has the least
throughput leading to low-performance hardware implementations. As depicted in Table
7.5, lower areas and power consumptions are achieved for the sub-quadratic hardware
complexity GF (2128) multipliers used in our proposed architectures compared to the one
in [94]. As seen in this table, the maximum working frequency is decreased as we increase
Chapter 7 120
Table 7.5: ASIC synthesis comparisons of the AES-GCM using the STM 65-nm CMOS
technology.
Schemea Total Area [AES]b Power Freq. Thro. E.
(mm2) K-GEc (mW) (MHz) (Gbps) (Gbpsmm2 )
[64], [65], 0.23 110
[66] [0.12] [57] 19.6 568 72:7n
316:0
n
1.86 894
[68], [69]d [1.02] [490] 144.3 568 72:7n
8
+7
39:1
n
8
+7
Proposed 1.82 875
(quad.)e [0.92] [442] 142.5 641 82:0n
8
+3
45:1
n
8
+3
Proposed 1.62 779
(KO1)
e [0.92] [442] 124.6 641 82:0n
8
+3
50:6
n
8
+3
Proposed 1.46 702
(KO2)
e [0.92] [442] 113.0 641 82:0n
8
+3
56:2
n
8
+3
Proposed 1.34 644
(KO3)
e [0.92] [442] 104.8 621 79:4n
8
+3
59:3
n
8
+3
Proposed 1.31 630
(KO4)
e [0.92] [442] 101.2 613 78:4n
8
+3
59:8
n
8
+3
Proposed 1.30 625
(KO5)
e [0.92] [442] 102.0 595 76:1n
8
+3
58:5
n
8
+3
Proposed 1.33 639
(KO6)
e [0.92] [442] 105.2 578 73:9n
8
+3
55:6
n
8
+3
aFor the case of q = 8 parallel structures.
bThe area of the AES is shown inside brackets.
c103 gate equivalent in terms of two-input NAND.
dThe better scheme in [68] and [69], in terms of timing and hardware complexities has been synthe-
sized.
eThe quadratic and six sub-quadratic multipliers used in the proposed AES-GCM.
the number of multiplication steps. However, this trend is not observed for the hardware
complexity, i.e., it is decreased up to KO5 as the optimum value and then rises for KO6.
The highest throughput is achieved for the proposed architectures in this chapter, i.e.,
82:0
n
8
+3
Gbps using quadratic and KO1/KO2 sub-quadratic multipliers. As seen in Table
7.5, the highest eciency is derived for KO4, i.e.,
59:8
n
8
+3
Gbps
mm2
. As seen in this table, the
working frequencies and throughputs for KO1 and KO2 are similar. We have observed
that this is because for these two multipliers, the critical path delay is dominated by
the AES rounds and not the sub-quadratic multiplier. Inner-round pipelining can be
performed to increase the working frequencies of the implementations. Nevertheless, this
sub-pipelining increases the area and latency of the AES-GCM architectures. We have
performed experiments by sub-pipelining the AES rounds for the architectures using KO1
Chapter 7 121
1 2 3 4 5 6 7 8 9
n1 101 100 115 129 143 151 152 149 141
n2 110 100 119 134 148 157 158 155 147
70
80
90
100
110
120
130
140
150
160
170
N
o
rm
a
li
ze
d
?ef
fi
ci
e
n
cy
?(%
)
[6
4
],
?[6
5
],
?[6
6
]
[6
8
],
?[6
9
]
Q
u
a
d K
O
1
K
O
2 K
O
3
K
O
4
K
O
5
K
O
6
Figure 7.6: Comparison of the eciencies of nine dierent AES-GCM architectures for
n1 = 2
32   2 and n2 = 210.
and KO2 multipliers. This is achieved by adding one pipeline stage after ShiftRows and
right before MixColumns. The results of our experiments show no major dierence in the
maximum working frequency of the design utilizing KO2 multiplier and increase in its
hardware complexity. However, for the architectures using KO1 multipliers, the working
frequency of 689 MHz with the increased area of 1.70 mm2 is achieved. Therefore, for
this architecture which uses KO1 multipliers, the best speed is obtained compared to the
results in Table 7.5. However, its eciency is obtained as 51:9n
8
+3
Gbps
mm2
which is less than
that of the architectures with KO4 multipliers (see Table 7.5).
For comparing the eciencies of the schemes presented in Table 7.5, we have presented
Fig. 7.6. Based on the derived values for eciencies in the last column of Table 7.5, two
dierent graphs for two values of n are presented in Fig. 7.6. We consider two dierent
values of n, i.e., n1 = 2
32   2 (the largest encrypted message size allowed) and n2 = 210.
It is noted that considering the normalized eciency (%) of the scheme in [68], [69] as
100, the relative eciencies for dierent architectures are presented in this gure. As
seen in Fig. 7.6, for n1 = 2
32  2 and n2 = 210, KO4 has the highest eciencies (51% and
44% more than the sequential method, respectively).
We conclude this chapter by a summary of the work presented. In this chapter,
we have obtained optimized building blocks for the AES-GCM to propose ecient and
Chapter 7 122
high-performance architectures. For the AES, through logic-gate minimizations for the
inversion in GF (24), the areas of the S-boxes have been reduced. We have also evaluated
and compared the performance of dierent S-box architectures using an ASIC 65-nm
CMOS technology. Furthermore, through exhaustive searches for the input patterns, we
have performed simulation-based average and peak power derivations for dierent S-boxes
to reach more accurate results compared to the statistical power derivation methods.
We have also proposed high-performance and ecient architectures for the GCM. For
the case study of q = 8 parallel structures in GHASHH , we have performed a hardware
complexity reduction technique for the hash subkey exponentiations, having their timing
complexities intact. For comparison, the proposed architectures and the previous ones
have been synthesized on ASIC. The results show that better eciencies are achieved for
the proposed architectures. Moreover, according to our results, the structures using the
four-step Karatsuba-Ofman GF (2128) multiplier are the most ecient ones for our pro-
posed architectures. Based on the available resources and performance goals to achieve,
one can choose the proposed AES-GCM architectures to fulll the constraints needed for
the required applications.
Chapter 8
Summary and Future Work
8.1 Thesis Summary
IN this thesis, we have proposed reliable and high-performance hardware implementa-tions for the AES-GCM. This includes novel lightweight and concurrent fault detec-
tion schemes for the AES for making it reliable and high-performance hardware archi-
tectures for the AES-GCM for reaching ecient VLSI implementations. The following
summarizes the contributions of this work.
In Chapter 3, which has been presented in [71] and [72], we have evaluated the
performance of more than 40 S-boxes utilizing a xed benchmark platform in 65-nm
CMOS technology. To obtain the least-complexity S-box, the formulations for the Galois
Field (GF) sub-eld inversions in GF (24) have been optimized. By conducting exhaustive
simulations for the input transitions, we have analyzed the average and peak power
consumptions of the AES S-boxes considering the switching activities, gate-level netlists,
and parasitic information.
In Chapter 4, which has been presented in [78] and [79], we have proposed a lightweight
concurrent fault detection scheme for the AES. In the presented approach, for increasing
the error coverage, the predicted parities of the ve blocks of the S-box and the inverse
S-box have been obtained (three predicted parities for the multiplicative inversion and
two for the transformation and ane matrices). Through exhaustive searches among all
available composite elds, we have found the optimum solutions for the least overhead
parity-based fault detection structures. Moreover, through our error injection simulations
for one S-box (resp. inverse S-box), we have shown that the total error coverage of almost
100% (99.998%) for 16 S-boxes (resp. inverse S-boxes) can be achieved. Finally, it is
123
Chapter 8 124
shown that both the ASIC and FPGA implementations of the fault detection structures
using the obtained optimum composite elds, have better hardware and time complexities
compared to their counterparts.
In Chapter 5, which has been presented in [83] and [84], we have proposed a concur-
rent fault detection scheme for the S-box and the inverse S-box based on the low-cost
composite eld implementations of the S-box and the inverse S-box. We have divided the
structures of these operations into three blocks and found the predicted parities of these
blocks. We have obtained new formulations for the ve predicted parities for three blocks
of the S-box and the inverse S-box. To reach high multiple and burst fault detection ca-
pabilities, multiple-bit signatures have been obtained within the blocks constituting more
area in the structures of the S-box and the inverse S-box. Our simulations have shown
that except for the redundant units approach which has the hardware and time overheads
of close to 100%, the fault detection capabilities of the proposed scheme for the burst and
random multiple faults are higher than the previously reported ones. Finally, through
ASIC implementations, it has been shown that for the maximum target frequency, the
proposed fault detection S-box and inverse S-box in this chapter have the least areas,
critical path delays, and power consumptions compared to their counterparts with similar
fault detection capabilities.
In Chapter 6, which has been presented in [92] and [93], we have proposed a structure-
independent fault detection scheme for the entire AES encryption and decryption. Specif-
ically, we have obtained new formulations for the fault detection of SubBytes and inverse
SubBytes using the relation between the input and the output of the S-box and the in-
verse S-box. The proposed schemes are independent of the way the S-box and the inverse
S-box are constructed. Therefore, they can be used for both the S-boxes and the inverse
S-boxes using look-up tables and those utilizing logic gates based on composite elds.
Our simulation results have shown very high error coverage for the proposed schemes.
Finally, our proposed fault detection schemes and almost all of the previously reported
ones have been implemented on FPGAs and their area and delay overheads have been
derived and compared. The FPGA implementation results have shown the low area and
delay overheads for the proposed fault detection schemes.
Finally, in Chapter 7, which has been presented in [72], we have presented high-speed,
parallel hardware architectures for reaching low-latency and high-throughput structures
Chapter 8 125
of the GCM. Having investigated the high-performance GF (2128) multiplier architec-
tures, we have benchmarked the proposed AES-GCM architectures using quadratic and
sub-quadratic hardware complexity GF (2128) multipliers. It has been shown that the per-
formance of the presented AES-GCM architectures outperforms the previously reported
ones in the utilized 65-nm CMOS technology.
Based on the above summary, the contributions of this thesis are
 Optimization and benchmarking the AES S-boxes on a xed hardware platform
 Devising lightweight concurrent exhaustive search-based fault detection schemes for
the AES S-boxes using polynomial and normal bases
 Proposing multi-bit signature-based fault diagnosis approaches for the AES S-
boxes, inverse S-boxes, and mixed operations
 Presenting structure-independent schemes for fault detection of the entire AES
encryption and decryption
 Proposing high-speed, parallel hardware architectures for the GCM
8.2 Future Work
As future works for this thesis, the followings can be pursued.
 The fault detection schemes proposed in this thesis have been evaluated using
extensive simulations and benchmarked on ASIC hardware platforms. As a future
work for this thesis, our proposed fault diagnosis approaches and corresponding
original architectures can be fabricated on chip and actual error injections can be
performed. This error injection to the fabricated chip veries the eectiveness of
the proposed fault detection approaches one level beyond the simulation level.
 Another future work for the FPGA platform can be explored noting that the AES
is utilized for bitstream security mechanisms. Specically, the AES decryption
is hardware-implemented in many recent FPGAs. Incorporating the proposed
hardware countermeasures and evaluating their eectiveness in counteracting in-
ternal/malicious faults on FPGAs would be an interesting future research topic.
Chapter 8 126
 As an extension for this thesis, one can integrate reliability into the design of recent
cryptographic data authentication algorithms. One can carry out research on devel-
oping reliable architectures for the third-round SHA-3 (Secure Hash Algorithm-3)
candidates, one of which containing the AES algorithm as its building blocks. This
development will be a promising advancement in cryptography research and will
result in selecting the nal candidates of the ongoing competition to choose the
winning function for SHA-3 in 2012.
 Finally, one can work on devising reliable architectures for the recently standard-
ized GCM, which provides data authentication to block ciphers such as the AES.
To the best of our knowledge, the aforementioned research on reliability of these
architectures will be carried out for the rst time.
Bibliography
[1] National Institute of Standards and Technologies, \Announcing the Advanced En-
cryption Standard (AES)," Federal Information Processing Standards Publication,
no. 197, Nov. 2001.
[2] Wi-Fi, http://standards.ieee.org/getieee802/download/802.11-2007.pdf.
[3] WiMAX, http://standards.ieee.org/getieee802/download/802.16e-2005.pdf.
[4] S. Trimberger, \Security in SRAM FPGAs," IEEE Design and Test of Computers,
vol. 24, no. 6, pp. 581, Nov. 2007.
[5] M. Dworkin, \Recommendation for Block Cipher Modes of Operation: Ga-
lois/Counter Mode (GCM) and GMAC," NIST SP 800-38D, 2007.
[6] IEEE Standard for Local and Metropolitan Area Networks, Media Access Control
(MAC) Security, 2006.
[7] Fibre Channel Security Protocols (FC-SP), 2006. Available:
http://www.t10.org/ftp/t11/document.06/06-157v0.pdf.
[8] Algotronics Ltd.: GCM Extension for AES G3 Core, 2007.
[9] Helion Technology: AES-GCM Cores, 2007.
[10] Elliptic Semiconductor Inc.: CLP-15: Ultra-High Throughput AES-GCM Core-40
Gbps, 2008.
[11] E. Kasper and P. Schwabe, \Faster and Timing-Attack Resistant AES-GCM," Proc.
Int'l Workshop Cryptographic Hardware and Embedded Systems (CHES '09), LNCS
5747, pp. 1-17, 2009.
[12] K. Jankowski and P. Laurent, \Packed AES-GCM Algorithm Suitable for
AES/PCLMULQDQ Instructions," IEEE Trans. Computers, vol. 60, no. 1, pp. 135-
138, Jan. 2011.
[13] S. Morioka and A. Satoh, \An Optimized S-Box Circuit Architecture for Low Power
AES Design," Proc. Int'l Workshop Cryptographic Hardware and Embedded Systems
(CHES '02), pp. 172-186, Aug. 2002.
127
BIBLIOGRAPHY 128
[14] M. McLoone and J.V. McCanny, \High Performance Single-chip FPGA Rijndael Al-
gorithm Implementations," Proc. Int'l Workshop Cryptographic Hardware and Em-
bedded Systems (CHES '01), LNCS 2162, pp. 65-76, 2001.
[15] F. X. Standaert, G. Rouvroy, J. J. Quisquater, and J. D. Legat, \Ecient Imple-
mentation of Rijndael Encryption in Recongurable Hardware: Improvements and
Design Tradeos," Proc. Int'l Workshop Cryptographic Hardware and Embedded Sys-
tems (CHES '03), LNCS 2779, Springer, pp. 334-350, Sep. 2003.
[16] P. Bulens, F.-X. Standaert, J.-J. Quisquater, P. Pellegrin, and G. Rouvroy, \Im-
plementation of the AES-128 on Virtex-5 FPGAs," Proc. Int'l Conf. Theory and
Application of Cryptology and Information Security: Advances in Cryptology (ASI-
ACRYPT '08), LNCS 5023, pp. 16-26, 2008.
[17] A. Hodjat and I. Verbauwhede, \Area-Throughput Trade-Os for Fully Pipelined 30
to 70 Gbits/s AES Processors," IEEE Trans. Computers, vol. 55, no. 4, pp. 366-372,
April 2006.
[18] V. Rijmen, \Ecient Implementation of the Rijndael S-box," Katholieke
Universiteit Leuven, Dept. ESAT, Belgium, 2000, available at:
http://www.esat.kuleuven.ac.be/ rijmen/rijndael/sbox.pdf.
[19] A. Rudra, P. K. Dubey, C. S. Jutla, V. Kumar, J. R. Rao, and P. Rohatgi, \Ecient
Rijndael Encryption Implementation with Composite Field Arithmetic," Proc. Int'l
Workshop Cryptographic Hardware and Embedded Systems (CHES '01), pp. 171-184,
May 2001.
[20] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, \A Compact Rijndael Hardware
Architecture with S-Box Optimization," Proc. Int'l Conf. Theory and Application of
Cryptology and Information Security: Advances in Cryptology (ASIACRYPT '01),
pp. 239-254, Dec. 2001.
[21] J. Wolkerstorfer, E. Oswald, and M. Lamberger, \An ASIC Implementation of the
AES SBoxes," Proc. Cryptographers' Track-RSA '02, pp. 67-78, Jan. 2002.
[22] X. Zhang and K. K. Parhi, \High-Speed VLSI Architectures for the AES Algorithm,"
IEEE Trans. Very Large Scale Integration Systems, vol. 12, no. 9, pp. 957-967, Sep.
2004.
[23] D. Canright, \A Very Compact S-Box for AES," Proc. Int'l Workshop Cryptographic
Hardware and Embedded Systems (CHES '05), pp. 441-455, Aug. 2005.
[24] X. Zhang and K. K. Parhi, \On the Optimum Constructions of Composite Field for
the AES Algorithm," IEEE Trans. Circuits and Systems II: Express Briefs, vol. 53,
no. 10, pp. 1153-1157, Oct. 2006.
[25] S. Nikova, V. Rijmen, and M. Schlaer, \Using Normal Bases for Compact Hardware
Implementations of the AES S-Box," Proc. Security in Communication Networks
'08, pp. 236-245, 2008.
BIBLIOGRAPHY 129
[26] G. Bertoni, M. Macchetti, and L. Negri, \Power-ecient ASIC Synthesis of Cryp-
tographic Sboxes," Proc. Great Lakes Symposium on VLSI '04, pp. 277-281, April
2004.
[27] J. Blomer and J. P. Seifert, \Fault Based Cryptanalysis of the Advanced Encryption
Standard (AES)," Proc. Financial Cryptography '03, pp. 162-181, Jan. 2003.
[28] G. Piret and J. J. Quisquater, \A Dierential Fault Attack Technique against SPN
Structures, with Application to the AES and Khazad," Proc. Int'l Workshop Cryp-
tographic Hardware and Embedded Systems (CHES '03), pp. 77-88, Sep. 2003.
[29] P. Dusart, G. Letourneux, and O. Vivolo, \Dierential Fault Analysis on AES,"
Proc. Int'l Conf. Applied Cryptography and Network Security (ACNS '03), pp. 293-
306, Oct. 2003.
[30] C. Giraud, \DFA on AES," In Proc. of AES 2004, pp. 27-41, May 2004.
[31] J. Blomer and V. Krummel, \Fault Based Collision Attacks on AES," Proc. Int'l
Workshop Fault Diagnosis and Tolerance in Cryptography (FDTC '06), pp. 106-120,
Oct. 2006.
[32] J. Takahashi, T. Fukunaga, and K. Yamakoshi, \DFA Mechanism on the AES Key
Schedule," Proc. Int'l Workshop Fault Diagnosis and Tolerance in Cryptography
(FDTC '07), pp. 62-72, Sep. 2007.
[33] M. Rivain, \On the Physical Security of Cryptographic Implementations," Ph.D.
thesis, University of Luxamburg, Sep. 2009.
[34] R. Karri, K. Wu, P. Mishra, and K. Yongkook, \Fault-based Side-Channel Crypt-
analysis Tolerant Rijndael Symmetric Block Cipher Architecture," Proc. IEEE Int'l
Symp. Defect and Fault Tolerance in VLSI Systems (DFT '01), pp. 418-426, Oct.
2001.
[35] G. Bertoni, L. Breveglieri, I. Koren, P. Maistri, and V. Piuri, \A Parity Code Based
Fault Detection for an Implementation of the Advanced Encryption Standard," Proc.
IEEE Int'l Symp. Defect and Fault Tolerance in VLSI Systems (DFT '02), pp. 51-59,
Nov. 2002.
[36] G. Bertoni, L. Breveglieri, I. Koren, P. Maistri, and V. Piuri, \Error Analysis and
Detection Procedures for a Hardware Implementation of the Advanced Encryption
Standard," IEEE Trans. Computers, vol. 52, no. 4, pp. 492-505, April 2003.
[37] R. Karri, G. Kuznetsov, and M. Goessel, \Parity-based Concurrent Error Detection
of Substitution-Permutation Network Block Ciphers," Proc. Int'l Workshop Cryp-
tographic Hardware and Embedded Systems (CHES '03), pp. 113-124, Sep. 2003.
[38] M. Karpovsky, K. J. Kulikowski, and A. Taubin, \Dierential Fault Analysis Attack
Resistant Architectures for the Advanced Encryption Standard," Proc. Conf. Smart
Card Research and Advanced Applications (CARDIS '04), vol. 153, pp. 177-192,
Aug. 2004.
BIBLIOGRAPHY 130
[39] K. Wu, R. Karri, G. Kuznetsov, and M. Goessel, \Low Cost Concurrent Error
Detection for the Advanced Encryption Standard," Proc. Int'l Test Conf. '04, pp.
1242-1248, Oct. 2004.
[40] G. Bertoni, L. Breveglieri, I. Koren, and P. Maistri, \An Ecient Hardware-based
Fault Diagnosis Scheme for AES: Performances and Cost," Proc. IEEE Int'l Symp.
Defect and Fault Tolerance in VLSI Systems (DFT '04), pp. 130-138, Oct. 2004.
[41] L. Breveglieri, I. Koren, and P. Maistri, \Incorporating Error Detection and Online
Reconguration into a Regular Architecture for the AES," Proc. IEEE Int'l Symp.
Defect and Fault Tolerance in VLSI Systems (DFT '05), pp. 72-80, Oct. 2005.
[42] C. H. Yen and B. F. Wu, \Simple Error Detection Methods for Hardware Implemen-
tation of Advanced Encryption Standard," IEEE Trans. Computers, vol. 55, no. 6,
pp. 720-731, June 2006.
[43] T. G. Malkin, F. X. Standaert, and M. Yung, \A Comparative Cost/Security Anal-
ysis of Fault Attack Countermeasures," Proc. Int'l Workshop Fault Diagnosis and
Tolerance in Cryptography (FDTC '06), pp. 159-172, Oct. 2006.
[44] A. Satoh, T. Sugawara, N. Homma, and T. Aoki, \High-Performance Concurrent
Error Detection Scheme for AES Hardware," Proc. Int'l Workshop Cryptographic
Hardware and Embedded Systems (CHES '08), pp. 100-112, Aug. 2008.
[45] G. Canivet, P. Maistri, R. Leveugle, F. Valette, J. Clediere, M. Renaudin, \Depend-
ability Analysis of a Countermeasure against Fault Attacks by Means of Laser Shots
onto a SRAM-based FPGA," 21st IEEE International Conference on Application-
specic Systems Architectures and Processors (ASAP), pp. 115-122, July 2010.
[46] U. Legat, A. Biasizzoa, and F. Novaka, \A Compact AES Core with On-line Error-
detection for FPGA Applications with Modest Hardware Resources," Journal of
Microprocessors and Microsystems, vol. 35, no. 4, pp. 405-416, June 2011.
[47] M. Medwed and J.-M. Schmidt, \A Continuous Fault Countermeasure for AES
Providing a Constant Error Detection Rate," In Proc. of FDTC 2010, IEEE Press,
pp. 66-71, August 2010.
[48] S. Ghaznavi, \Soft Error Resistant Design of the AES Cipher Using SRAM-based
FPGA," Ph.D. thesis, University of Waterloo, Ontario, Canada, 2011.
[49] P. Maistri and R. Leveugle, \Double-Data-Rate Computation as a Countermeasure
against Fault Analysis," IEEE Trans. Computers, vol. 57, no. 11, pp. 1528-1539,
Nov. 2008.
[50] C. Moratelli, F. Ghellar, E. Cota, and M. Lubaszewski, \A Fault-Tolerant DFA-
Resistant AES Core," Proc. IEEE Int'l Symp. Circuits and Systems (ISCAS '08),
pp. 244-247, May 2008.
BIBLIOGRAPHY 131
[51] G. Di Natale, M. Doulcier, M. L. Flottes, and B. Rouzeyre, \A Reliable Architecture
for Parallel Implementations of the Advanced Encryption Standard," Journal of
Elec. Testing, vol. 25, no. 4, pp. 269-278, Aug. 2009.
[52] S.-Y. Wu and H.-T. Yen, "On the S-Box Architectures with Concurrent Error De-
tection for the Advanced Encryption Standard," IEICE Trans. on Fundamentals
of Electronics, Communications and Computer Sciences, vol. E89-A, no. 10, pp.
2583-2588, Oct. 2006.
[53] A. E. Cohen, "Architectures for Cryptography Accelerators," PhD dissertation, Uni-
versity of Minnesota, Sep. 2007.
[54] M. Mozaari Kermani and A. Reyhani-Masoleh, \Parity Prediction of S-box for
AES," In Proc. of the IEEE Canadian Conference on Electrical and Computer En-
gineering (CCECE 2006), pp. 2357-2360, May 2006.
[55] M. Mozaari Kermani and A. Reyhani-Masoleh, \Parity-based Fault Detection Ar-
chitecture of S-box for Advanced Encryption Standard," In Proc. of the IEEE Inter-
national Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 2006),
pp. 572-580, Oct. 2006.
[56] M. Mozaari Kermani, \Fault Detection Schemes for High Performance VLSI Imple-
mentations of the Advanced Encryption Standard," M.E.Sc. Thesis, Department of
Electrical and Computer Engineering, The University of Western Ontario, London,
Ontario, Canada, April 2007.
[57] M. Mozaari Kermani and A. Reyhani-Masoleh, \Fault Detection Structures of the
S-boxes and the Inverse S-boxes for the Advanced Encryption Standard," Journal
of Elec. Testing, vol. 25, no. 4, pp. 225-245, Aug. 2009.
[58] T. Good and M. Benaissa, \692-nW Advanced Encryption Standard (AES) on a
0.13-m CMOS," IEEE Trans. VLSI Systems, vol. 18, no. 12, pp. 1753-1757, Dec.
2010.
[59] S. Tillich, M. Feldhofer, T. Popp, and J. Groschadl, \Area, Delay, and Power Char-
acteristics of Standard-Cell Implementations of the AES S-Box," J. Sign. Process
Syst, vol. 50, pp. 251-261, 2008.
[60] J. Boyar and R. Peralta, \A New Combinational Logic Minimization Technique with
Applications to Cryptology," In Proc. of SEA 2010, pp. 178-189, 2010.
[61] S. Nikova, V. Rijmen, and M. Schlaer, \Using Normal Bases for Compact Hard-
ware Implementations of the AES S-Box," In Proc. of the SCN 2008, LNCS 5229,
Springer, pp. 236-245, 2008.
[62] Y. Nogami, K. Nekado, T. Toyota, N. Hongo, and Y. Morikawa, \Mixed Bases for
Ecienct Inversion in F((22)2)2 and Conversion Matrices of SubBytes of AES," Proc.
Int'l Workshop Cryptographic Hardware and Embedded Systems (CHES '10), LNCS
6225, Springer, pp. 234-247, August 2010.
BIBLIOGRAPHY 132
[63] D. Canright and D.A. Osvik, \A More Compact AES," In Proc. of SAC 2009, LNCS
5867, pp. 157-169, 2009.
[64] S. Lemsitzer, J. Wolkerstorfer, N. Felbert, and M. Braendli, \Multi-gigabit GCM-
AES Architecture Optimized for FPGAs," In Proc. of CHES 2007, LNCS, vol. 4727,
pp. 227-238, 2007.
[65] P. Patel, \Parallel Multiplier Designs for the Galois/Counter Mode of Operation,"
Master of Applied Science Thesis, The University of Waterloo, 2008.
[66] B. Yang, S. Mishra, and R. Karri, \High Speed Architecture for Galois/Counter
Mode of Operation (GCM)," Cryptology ePrint Archive: Report 2005/146, June
2005.
[67] D. A. McGrew and J. Viega, \The Galois/counter Mode of Operation (GCM),"
2005.
[68] A. Satoh, \High-speed Parallel Hardware Architecture for Galois Counter Mode,"
In Proc. of ISCAS, pp. 1863-1866, 2007.
[69] A. Satoh, T. Sugawara, and T. Aoki, \High-Performance Hardware Architectures for
Galois Counter Mode," IEEE Trans. Computers, vol. 58, no. 7, pp. 917-930, 2009.
[70] N. Meloni, C. Negre, and M. A. Hasan, \High Performance GHASH Function for
Long Messages," In Proc. of ACNS 2010, pp. 154-167, 2010.
[71] M. Mozaari Kermani and A. Reyhani-Masoleh, \A Low-Cost S-box for the Ad-
vanced Encryption Standard Using Normal Basis," In Proc. of the IEEE Interna-
tional Conference on Electro/Information Technology (EIT 2009), pp. 52-55, June
2009.
[72] M. Mozaari Kermani and A. Reyhani-Masoleh, \Ecient and High-Performance
Parallel Hardware Architectures for the AES-GCM," To appear in IEEE Trans. on
Computers.
[73] Synopsys, http://www.synopsys.com/.
[74] STMicroelectronics, http://www.st.com/.
[75] ModelSim, http://www.model.com/.
[76] Mathworks, http://www.mathworks.com/.
[77] S.-Y. Lin and C.-T. Huang, \A High-Throughput Low-Power AES Cipher for Net-
work Applications," In Proc. of ASP-DAC 2007, pp. 595-600, 2007.
[78] M. Mozaari Kermani and A. Reyhani-Masoleh, \A Lightweight Concurrent Fault
Detection Scheme for the AES S-boxes Using Normal Basis," In Proc. of CHES
2008, pp. 113-129, Aug. 2008.
BIBLIOGRAPHY 133
[79] M. Mozaari Kermani and A. Reyhani-Masoleh, \A Lightweight High-Performance
Fault Detection Scheme for the Advanced Encryption Standard Using Composite
Fields," IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 19, no. 1,
pp. 85-91, Jan. 2011.
[80] XILINX, http://xilinx.com.
[81] N. Mentens, L. Batina, B. Preneel, and I. Verbauwhede, \A Systematic Evaluation of
Compact Hardware Implementations for the Rijndael S-box," In Proc. of CT-RSA,
pp. 323-333, Feb. 2005.
[82] L. Breveglieri, I. Koren, and P. Maistri, \An Operation-Centered Approach to Fault
Detection in Symmetric Cryptography Ciphers," IEEE Trans. Computers, vol. 56,
no. 5, pp. 534-540, May 2007.
[83] M. Mozaari Kermani and A. Reyhani-Masoleh, \A Low-Power High-Performance
Concurrent Fault Detection Approach for the Composite Field S-box and Inverse
S-box," To appear in IEEE Trans. Computers (special issue on Concurrent On-Line
Testing and Error/Fault Resilience of Digital Systems).
[84] M. Mozaari Kermani and A. Reyhani-Masoleh, \A High-Performance Fault Diag-
nosis Approach for the AES SubBytes Utilizing Mixed Bases," To appear in Proc.
of FDTC 2011.
[85] M. Nicolaidis, R. O. Duarte, S. Manich, and J. Figueras, \Fault-Secure Parity Pre-
diction Arithmetic Operators," IEEE Des. Test, vol. 14, no. 2, pp. 60-71, 1997.
[86] N. A. Touba and E. J. McCluskey, \Logic Synthesis of Multilevel Circuits with
Concurrent Error Detection," IEEE Trans. CAD, vol. 16, no. 7, pp. 783-789, 1997.
[87] S. Fenn, M. Goessel, M. Benaissa, and D. Taylor, \On-Line Error Detection for
Bit-Serial Multipliers in GF(2m)," Journal of Elec. Testing, vol. 13, pp. 29-40, 1998.
[88] C. Metra, M. Favalli, and B. Ricco, \Novel Implementation for Highly Testable
Parity Code Checkers," Proc. Int'l Workshop On-Line Testing, pp. 167-171, 1998.
[89] A. Reyhani-Masoleh and M. A. Hasan, \Fault Detection Architectures for Field
Multiplication Using Polynomial Bases," IEEE Trans. Computers, Special Issue on
Fault Diagnosis and Tolerance in Cryptography, vol. 55, no. 9, pp. 1089-1103, Sep.
2006.
[90] G. C. Cardarilli, M. Ottavi, S. Pontarelli, M. Re, and A. Salsano, \Fault Local-
ization, Error Correction, and Graceful Degradation in Radix 2 Signed Digit-based
Adders," IEEE Trans. Computers, vol. 55, no. 5, pp. 534-540, 2006.
[91] M. George P. and Alfke, \Linear Feedback Shift Registers
in Virtex Devices," Xilinx Application Note 210, available at:
http://www.xilinx.com/support/documentation/application notes/ xapp210.pdf.
BIBLIOGRAPHY 134
[92] M. Mozaari Kermani and A. Reyhani-Masoleh, \Concurrent Structure-Independent
Fault Detection Schemes for the Advanced Encryption Standard," IEEE Trans.
Computers, Special Issue on System Level Design of Reliable Architectures, vol. 59,
no. 5, pp. 608-622, May 2010.
[93] M. Mozaari Kermani and A. Reyhani-Masoleh, \A Structure-independent Ap-
proach for Fault Detection Hardware Implementations of the Advanced Encryption
Standard," In Proc. of FDTC 2007, IEEE Press, pp. 47-53, Sep. 2007.
[94] A. Reyhani-Masoleh and M. Hasan, \Low Complexity Bit Parallel Architectures for
Polynomial Basis Multiplication over GF (2m)," IEEE Trans. Computers, vol. 53,
no. 8, pp. 945-959, 2004.
[95] R. Zimmermann and W. Fichtner, \Low-power Logic Styles: CMOS versus Pass-
transistor Logic," IEEE Journal of Solid-State Circuits, vol. 32, no. 7, pp. 1079-1090,
1997.
[96] C. Moratelli, E. Cota, and M. Lubaszewski, \A Cryptography Core Tolerant to DFA
Fault Attacks," In Proc. of SBCCI 2006, pp. 190-195, Sep. 2006.
[97] D. E. Knuth, The Art of Computer Programming, vol. 2, \Semi-numerical Algo-
rithms," pp. 441-466, Addison-Wesley, 1981.
[98] R. Lidl and H. Niederreiter, \Introduction to Finite Fields and Their Applications,"
Cambridge Univ. Press, 1994.
[99] O. Gustafsson and M. Olofsson, \Complexity Reduction of Constant Matrix Com-
putations over the Binary Field," In Proc. of WAIFI 2007, pp. 103-115, 2007.
[100] H. Yi, J. Song, S. Park and C. Park, \Parallel CRC Logic Optimization Algorithm
for High Speed Communication Systems," In Proc. of ICCS 2006, pp. 1-5, 2006.
[101] G. Zhou, H. Michalik, and L. Hinsenkamp, \Improving Throughput of AES-GCM
with Pipelined Karatsuba Multipliers on FPGAs," In Proc. of ARC 2009, LNCS
5453, pp. 193-203, 2009.
[102] J. Lazaro, A. Astarloa, U. Bidarte, J. Jimenez, and A. Zuloaga, \AES-Galois
Counter Mode Encryption/Decryption FPGA Core for Industrial and Residential
Gigabit Ethernet Communications," In Proc. of ARC 2009, LNCS 5453, pp. 312-
317, 2009.
[103] E. D. Mastrovito, \VLSI Architectures for Computation in Galois Fields," Ph.D.
Thesis, Linkoping University, 1991.
[104] A. Karatsuba and Y. Ofman, \Multiplication of Multidigit Numbers on Automata,"
Soviet Physics Doklady, vol. 7, pp. 595, 1963.
[105] H. Fan and M. A. Hasan, \A New Approach to Subquadratic Space Complexity
Parallel Multipliers for Extended Binary Fields," IEEE Trans. Computers, vol. 56,
no. 2, pp. 224-233, 2007.
BIBLIOGRAPHY 135
[106] G. Zhou, H. Michalik, and L. Hinsenkamp, \Complexity Analysis and Ecient
Implementations of Bit Parallel Finite Field Multipliers Based on Karatsuba-Ofman
Algorithm on FPGAs," IEEE Trans. VLSI Systems, vol. 18, no. 7, pp. 1057-1066,
2010.
Curriculum Vitae
Name: Mehran Mozaari Kermani
Post-Secondary The University of Western Ontario
Education and 2007 - 2011, Ph.D.
Degrees:
The University of Western Ontario
2005 - 2007, M.E.Sc.
Tehran University
2000 - 2005, B.Sc.
Honours and  NSERC Postdoctoral Fellowship (PDF), 2011 - 2013
Awards:  Ontario Graduate Scholarship (OGS), 2010 - 2011
 Ontario Graduate Scholarship in Science and
Technology (OGSST), 2009 - 2010
 Western Graduate Research Scholarship, 2005 - present
 Western Conference Travel Grants, 2006 - 2007, 2009
 Dean of School of Engineering's highest honor award at
Tehran University, 2000
Related Work
Experience:  Senior ASIC/Layout Design Engineer (May 2011 - present)
AMD, Markham, Ontario
 Graduate Research Assistant (September 2005 - present)
The University of Western Ontario, London, Ontario
 Graduate Teaching Assistant (September 2005 - present)
The University of Western Ontario, London, Ontario
 Research Assistant (May 2003 - October 2003)
Iran Telecommunication Research Center, Tehran, Iran
136
BIBLIOGRAPHY 137
Publications:
Journal Articles
 M. Mozaari Kermani and A. Reyhani-Masoleh, \Ecient and High-Performance Par-
allel Hardware Architectures for the AES-GCM," To appear in IEEE Transactions on
Computers.
 M. Mozaari Kermani and A. Reyhani-Masoleh, \A Low-Power High-Performance
Concurrent Fault Detection Approach for the Composite Field S-box and Inverse S-box,"
To appear in IEEE Transactions on Computers (special issue on Concurrent On-Line
Testing and Error/Fault Resilience of Digital Systems).
 M. Mozaari Kermani and A. Reyhani-Masoleh, \A Lightweight High-Performance
Fault Detection Scheme for the Advanced Encryption Standard Using Composite Fields,"
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 1, pp.
85-91, Jan. 2011.
 M. Mozaari Kermani and A. Reyhani-Masoleh, \Concurrent Structure-Independent
Fault Detection Schemes for the Advanced Encryption Standard," IEEE Transactions
on Computers, Special Issue on System Level Design of Reliable Architectures, vol. 59,
no. 5, pp. 608-622, May 2010.
 M. Mozaari Kermani and A. Reyhani-Masoleh, \Fault Detection Structures of the
S-boxes and the Inverse S-boxes for the Advanced Encryption Standard," Journal of
Electronic Testing, vol. 25, no. 4, pp. 225-245, Aug. 2009.
Conference Papers
M. Mozaari Kermani and A. Reyhani-Masoleh, \A High-Performance Fault Diagnosis
Approach for the AES SubBytes Utilizing Mixed Bases," To appear in Proc. of FDTC
2011.
 M. Mozaari Kermani and A. Reyhani-Masoleh, \A Low-Cost S-box for the Advanced
Encryption Standard Using Normal Basis," In Proc. of EIT 2009, pp. 52-55, June 2009.
BIBLIOGRAPHY 138
 M. Mozaari Kermani and A. Reyhani-Masoleh, \A Lightweight Concurrent Fault De-
tection Scheme for the AES S-boxes Using Normal Basis," In Proc. of CHES 2008, pp.
113-129, Aug. 2008. (Blind reviewed-Acceptance ratio: 25%).
 M. Mozaari Kermani and A. Reyhani-Masoleh, \A Structure-independent Approach
for Fault Detection Hardware Implementations of the Advanced Encryption Standard,"
In Proc. of FDTC 2007, IEEE Press, pp. 47-53, Sep. 2007.
 M. Mozaari Kermani and A. Reyhani-Masoleh, \Parity-based Fault Detection Ar-
chitecture of S-box for Advanced Encryption Standard," In Proc. of DFT 2006, pp.
572-580, Oct. 2006.
 M. Mozaari Kermani and A. Reyhani-Masoleh, \Parity Prediction of S-box for AES,"
In Proc. of CCECE 2006, pp. 2357-2360, May 2006.
