A reconfigurable and scalable efficient architecture for AES by Li, Ke & University of Lethbridge. Faculty of Arts and Science
University of Lethbridge Research Repository
OPUS http://opus.uleth.ca
Theses Arts and Science, Faculty of
2008
A reconfigurable and scalable efficient
architecture for AES
Li, Ke
Lethbridge, Alta. : University of Lethbridge, Deptartment of Mathematics and Computer Science, 2008
http://hdl.handle.net/10133/778
Downloaded from University of Lethbridge Research Repository, OPUS
A RECONFIGURABLE AND SCALABLE EFFICIENT
ARCHITECTURE FOR AES
KE LI
Bachelor of Science, University of Electronic Science and Technology of China, 2003
A Thesis
Submitted to the School of Graduate Studies
of the University of Lethbridge
in Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
Department of Mathematics and Computer Science
University of Lethbridge
LETHBRIDGE, ALBERTA, CANADA
c© Ke Li, 2008
For my family, who offered me unconditional love and support throughout
the course of this thesis.
iii
Abstract
A new 32-bit reconfigurable FPGA implementation of AES algorithm is presented in this
thesis. It employs a single round architecture to minimize the hardware cost. The com-
binational logic implementation of S-Box ensures the suitability for non-Block RAMs
(BRAMs) FPGA devices. Fully composite field GF((24)2) based encryption and keysched-
ule lead to the lower hardware complexity and convenience for the efficient subpipelining.
For the first time, a subpipelined on-the-fly keyschedule over composite field GF((24)2)
is applied for the all standard key sizes (128-, 192-, 256-bit). The proposed architecture
achieves a throughput of 805.82Mbits/s using 523 slices with a ratio throughput/slice of
1.54Mbps/Slice on Xilinx Virtex2 XC2V2000 ff896 device.
iv
Acknowledgments
I would like to express many thanks to my supervisor Dr. Hua Li, for his invaluable advice
and ideas on the research and also for his devotion of time to me during this program. His
support and expertise resolved many hurdles that I encountered throughout the research.
I am also grateful to other committee members Dr. Howard Cheng and Dr. Gongbing
Shan for their advice.
Finally, I would like thank my parents for their support of me.
v
Contents
Approval/Signature Page ii
Dedication iv
Abstract iv
Acknowledgments v
Table of Contents vi
List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 32-bit Subpipelined Architecture . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Mathematical Background 5
2.1 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 AES Arithmetic over Field GF(28) . . . . . . . . . . . . . . . . . 6
2.2 Composite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 AES Arithmetic over Composite Field GF((24)2) . . . . . . . . . . 10
3 AES Algorithm 13
3.1 Subbytes and Invsubbytes . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Shiftrows and Invshiftrows . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Mixcolumns and Invmixcolumns . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Addroundkey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Keyschedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Reconfigurable and Compact Architecture of the AES 23
4.1 32-bit Single Round Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Full Composite Field Encryptor and Keyschedule . . . . . . . . . . . . . . 24
4.3 Subpipelined Encryptor and Keyschedule . . . . . . . . . . . . . . . . . . 27
4.4 Double-Block Subpipelined Architecture . . . . . . . . . . . . . . . . . . . 28
4.4.1 Column Fashion Shiftrows . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2 Subpipelined Subbytes . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.3 Mixcolumns on GF((24)2) . . . . . . . . . . . . . . . . . . . . . . 48
4.4.4 Subpipelined Keyschedule . . . . . . . . . . . . . . . . . . . . . . 51
vi
5 Implementation Performance And Comparison 67
6 Conclusion 73
Bibliography 75
vii
List of Tables
3.1 Key-Block-Round Combinations [20] . . . . . . . . . . . . . . . . . . . . 13
4.1 AES Encryption Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Four Control Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Path Delays and Number of Slices for Spartan2E and Virtex2 . . . . . . . . 47
4.4 Key128 Roundkey Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Key192 Roundkey Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Key256 Roundkey Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 Comparisons of BRAMs Based AES Architecture . . . . . . . . . . . . . . 69
5.2 Comparisons of Non-BRAMs Architectures . . . . . . . . . . . . . . . . . 70
5.3 Comparisons of AES Architectures Functions . . . . . . . . . . . . . . . . 72
viii
List of Figures
3.1 State array input and output . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 AES architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 AES S-box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 AES IS-box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 AES Shiftrows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 AES Invshiftrows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.7 Pseudo Code for Key Expansion [20] . . . . . . . . . . . . . . . . . . . . . 21
3.8 AES Keyschedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Unfolded Architecture(a) - Single Round Unit(b) - 32-bit Single Round
Unit(c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Partial Composite Field (a)- Full Composite Field (b) . . . . . . . . . . . . 24
4.3 Pipelining (a) and Subpipelining (b) . . . . . . . . . . . . . . . . . . . . . 27
4.4 AES Encryption Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Column Fashion Shiftrows . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Two States’ Arrangement in Shiftrows Registers . . . . . . . . . . . . . . . 34
4.7 Input of Shiftrows in Figure (4.6) . . . . . . . . . . . . . . . . . . . . . . . 36
4.8 Subbytes in composite field GF(24)[34] . . . . . . . . . . . . . . . . . . . 36
4.9 Pipelined Subbytes in composite field GF((24)2) . . . . . . . . . . . . . . 47
4.10 GF((24)2) Based Mixcolumns . . . . . . . . . . . . . . . . . . . . . . . . 49
4.11 Architecture of Keyschedule 128 . . . . . . . . . . . . . . . . . . . . . . . 55
4.12 Architecture of Keyschedule 192 . . . . . . . . . . . . . . . . . . . . . . . 58
4.13 Architecture of Keyschedule 256 . . . . . . . . . . . . . . . . . . . . . . . 64
ix
Chapter 1
Introduction
Cryptography is of importance in digital communications systems. The security aspects of
many applications such as Automated Teller Machines (ATMs), e-commerce, internet bank
services depend on various cryptographic schemes.
A symmetric-key cryptography algorithm, Data Encryption Standard (DES), has been
the encryption standard since 1977. It has been widely used and no attack better than the
brute force search has been discovered. But its 56-bit key size has been criticized since its
inception. 3DES with triple key size of DES offers higher security but it is inefficient in
software, because DES was primarily designed for hardware implementations [30].
In 2001, the National Institute of Standards and Technology (NIST) announced the
approval of the Federal Information Processing Standard (FIPS) for the Advanced Encryp-
tion Standard (AES), FIPS-197 [20]. This standard specifies the Rijndael algorithm [7] as
an FIPS-approved symmetric encryption algorithm that may be used by U.S. government
organizations (and others) to protect sensitive information.
As a replacement of DES, AES is presently widely used in both software and hardware
implementations. Hardware approaches are attractive because it provides better throughput
as well as higher physical security. Besides, the byte-wise arithmetic in AES gives hard-
ware approaches more convenience. There are mainly two categories of hardware imple-
mentations: Application-Specific Integrated Circuit (ASIC) and Field Programmable Gate
Array (FPGA). Compared with ASIC, FPGA becomes more and more popular because of
its scalability, re-programmability and obvious advantage on time-to-market [19].
1
1.1 Motivation
The standard announced by NIST [20] indicates that AES is a block cipher with 128-
bit block size and 128-, 192-, 256-bit key sizes. These three key sizes are specified for
various security levels. The capability to deal with all key sizes makes reconfigurability an
important feature of AES implementations.
Numerous FPGA [5, 9, 22, 23, 34] and ASIC [2, 25, 28] implementations of the AES
have been presented and evaluated. To date, most implementations feature high speed and
high cost suitable for high-end applications only. Fully unrolled scheme makes a con-
venient platform for pipelining technology to get efficient area cost and high throughput
by unfolding all the ten (128-bit key) rounds on the device, which is applied in literature
[8, 9, 11, 17, 34].
The issue of secure communication in computing restricted environments, such as Per-
sonal Digital Assistants (PDAs), wireless devices, and many other embedded devices, has
become more important recently. In order to apply AES in these devices, the AES im-
plementations must be cost efficient. An opposite approach to fully unrolled scheme is to
implement a single round unit on hardware [1, 2, 26, 27, 31]. When no further optimization
effort is made, a block of data needs ten (128-bit key) cycles to go through encryption. The
economic area cost is obtained by sacrificing the speed.
In this thesis, a compact design of AES with low hardware cost and adequate throughput
is proposed and implemented in a non-BRAM FPGA.
1.2 32-bit Subpipelined Architecture
The following list summarizes the major contributions in this thesis.
• 32-bit Single Round Unit: By extending one cycle’s job to ten cycles (128-bit
2
key), single round unit requires approximately 1/10 hardware area as fully unrolled
scheme; by chopping a block data (128-bit) to four words, theoretically, a 32-bit sin-
gle round unit costs 1/40 hardware area as the common 128-bit fully unrolled scheme
as in [8, 9, 11, 17, 34]. Nevertheless, when 32-bit datapath is used, the shiftrows
transformations can not be simply implemented by rewiring. We use the column
fashion shiftrows which naturally cuts one round unit into four substages.
• Complete Composite Field Based AES: In a non-BRAM design, combinational
logic is the approach used for subbytes, also known as S-Box. It is the most costly
transformation in AES, in both time and area aspects. Rijmen [24] suggested an alter-
native approach to calculate multiplicative inverses in S-Box. Since then, the relevant
research has proved that the composite field GF((24)2) based arithmetic provides the
least gate count and the shortest critical path for calculating multiplicative inverse of
a byte, which is the key step in S-Box. This conversion involves an isomorphic map
function before and after inversion in each round. As in [9, 11, 28, 31, 34], when key
size is 128-bit, it needs ten map functions for each block (128-bit) from finite field
to composite field and ten inverse map functions for encryption. If key generator,
which also has S-Boxes, is included, another ten mappings and ten inverse mappings
are needed. To save the overhead caused by mapping, our design converts the whole
AES algorithm from GF(28) to GF((24)2), which needs only one forward mapping
before the initial round and one backward mapping after the final round. Only one
forward mapping is needed for the keyschedule.
• Subpipelined On-the-fly Keyschedule and Encryptor: On-the-fly keyschedule sup-
ports instant key changing. The previous works of [1, 2, 4, 9, 12, 14, 17, 33, 26,
28, 34] applied the on-the-fly key generator, but only [1, 2, 17] integrate on-the-fly
keyschedule for all three key sizes (128, 192, 256-bit). These three designs employ
3
subpipelining to optimize throughput/area ratio. However, none of them uses it in
keyschedule. When pipelining and on-the-fly keyschedule are both employed in an
AES implementation, the keyschedule must be synchronized with the cipher because
they share the same clock. The designs in [34, 26] made a subpipelined keyschedule,
but they only support 128-bit key size.
1.3 Thesis Outline
Chapter 2 introduces the mathematical background of finite fields which are relevant to
AES. We also present the definition of composite fields in this chapter.
Chapter 3 gives an overview of AES standard. We focus on encryption and keyschedule
in this thesis. However, the complete AES standard, including decryption, is presented in
this chapter.
Chapter 4 describes the proposed architecture in detail. The formulas for the non-trivial
transformations in field GF((24)2) are presented. The keyschedules for three key sizes are
demonstrated in figures.
Chapter 5 presents the implementation and compares the proposed architecture with the
previous designs.
Chapter 6 provides conclusion of the design.
4
Chapter 2
Mathematical Background
This chapter introduces the mathematical background of AES. Finite Fields, also referred
to as Galois Fields, is the arithmetic basis of AES. The Rijndael algorithm [7] is derived
from the finite field GF(28). C. Paar [21] demonstrated that by decomposing field GF(28)
into composite field GF((24)2), we can make hardware implementations consuming less
area. The following sections introduce the relevant properties and definitions in finite field
GF(28) and composite field GF((24)2). All statements are given without proof, but they
are referred to the appropriate literature.
2.1 Finite Fields
This section introduces the definition of finite fields, followed by the basic AES mathemat-
ical representations and operations over finite field GF(28). We start with the definition of
group.
Definition 2.1 [21] A set G together with a binary operation G×G−→G is called a group
if the following condition are satisfied:
• The binary operation is associative: (a◦b)◦ c = a◦ (b◦ c), for all a,b,c ∈ G;
• There is an identity element e ∈ G such that a◦ e = e◦a, for all a ∈G;
• For any element a ∈ G, there exists an inverse element a′ ∈ G such that a ◦ a′ =
a′ ◦a = e.
If a group satisfies the additional condition that a◦b = b◦a, for all a,b ∈G, the group
is commutative or abelian.
5
Definition 2.2 [29] Let F be a set of elements on which two binary operations, called
addition “+” and multiplication “·”, are defined. The set F together with the two binary
operation + and · is a field if the following conditions are satisfied:
• F is a commutative group under addition +. The identity element with respect to
addition is called the zero element or the additive identity of F and is dentoed by 0;
• The set of nonzero elements in F is a commutative group under multiplication ·.
The identity element with respect to multiplication is called the unit element or the
multiplicative identity of F and is denoted by 1;
• Multiplication is distributive over addition; that is, for any three elements a,b and c
in F : a · (b+ c) = a ·b+a · c.
Fields with a finite number of elements are called Finite or Galois Fields, denoted as
GF(q). Here, q is the number of field elements, which is also the order of GF(q). The
extension field is of order qm and is denoted by GF(qm) [21], which can be constructed
by an irreducible polynomial P(x) [29] of degree m over GF(q). The field GF(q) is a
subfield of GF(qm) [16]. Every element of field GF(qm) can be represented as polynomial
with a maximum degree of m− 1 over GF(q), which is the residue modulo P(x). Hence
P(x) determines the arithmetic operations in field GF(qm).
2.1.1 AES Arithmetic over Field GF(28)
AES is built on the specific finite field GF(qm), when q = 2,m = 8. GF(28) is an extension
field of GF(2). We use the same notations and conventions as the AES specification in
[20], except the multiplication denotation of two elements in GF(28). Instead of using •,
we use ⊗, for a consistency with the figures in the subsequent chapters. The basic unit for
6
processing in the AES algorithm is a byte. Each 8-bit sequence of input, output, states,
cipherkey or roundkeys is treated as a single entity.
A. 3 Notations of An Element
1. Binary notation: A concatenation of 8 individual bits. The bit value is 0 or 1.
{a7a6a5a4a3a2a1a0}
2. Polynomial notation: Because GF(28) is the extension field of GF(2), its element
can be represented as a polynomial over GF(2) (Equation (2.1)). Bit ai is the coeffi-
cients of the polynomial with the value of 0 or 1.
a(x) =
7
∑
i=0
aix
i = a7x
7 +a6x
6 +a5x
5 +a4x
4 +a3x
3 +a2x
2 +a1x+a0 (2.1)
3. Hexadecimal notation: {AB}, A denotes a7a6a5a4 in hexadecimal representation, B
denotes a3a2a1a0 in hexadecimal representation.
For example, {01100011} (binary notation) can be represented as x6 + x5 + x+1 (polyno-
mial notation) and {63} (hexadecimal notation).
B. Addition
The addition of two elements in GF(28) is adding their corresponding polynomial co-
efficients modulo 2, which is the XOR-operation denoted by ⊕. For a(x),b(x) ∈ GF(28)
(a(x) is in Equations (2.1); b(x) = b7x7 +b6x6 +b5x5 +b4x4 +b3x3 +b2x2 +b1x +b0), it
can be implemented by Equation (2.2)
a(x)⊕b(x) =
7
∑
i=0
aix
i⊕
7
∑
i=0
bixi =
7
∑
i=0
(ai⊕bi)xi (2.2)
7
C. Multiplication
For a(x),b(x) ∈ GF(28), ⊗ is the multiplication operation in GF(28), × is the normal
polynomial multiplication.
Polynomial (2.3) is the irreducible polynomial used in AES. The multiplication of a(x)
and b(x) is done by multiplying these two polynomials followed by a modular reduction
over m(x) (Equation (2.4)). The modular reduction is made to ensure that the result is an
element in GF(28).
m(x) = x8 + x4 + x3 + x+1 (2.3)
Given q(x) ∈GF(28), q(x) = q7x7 +q6x6 +q5x5 +q4x4 +q3x3 +q2x2 +q1x+q0, we have:
q(x) = a(x)⊗b(x) = (a(x)×b(x)) mod m(x) (2.4)
FIPS gives an efficient method to do multiplication in GF(28) in [20]. It uses the
multiplication by x, which is denoted as xtimes(a(x)). Given: t(x) ∈GF(28), t(x) = t7x7 +
t6x6 + t5x5 + t4x4 + t3x3 + t2x2 + t1x+ t0, we have:
t(x) = xtimes(a(x)) = (a(x)× x) mod m(x)
−−−−−−−−−−−−−−−−−−
t0 = a7, t1 = a0⊕a7, t2 = a1, t3 = a2⊕a7
t4 = a3⊕a7, t5 = a4, t6 = a5, t7 = a6
(2.5)
In Equation (2.5), t(x) is the multiplication result of a(x) and x in GF(28). It is cal-
culated by multiplying a(x) with x, followed by the modular reduction over m(x). Based
on Equations (2.5), we can use Equation (2.6) to conduct the multiplication in GF(28)
8
(Equation (2.4)).
q(x) = ∑7i=0 Pi(x)×bi
−−−−−−−−−−−−−−−−−−
Pi(x) = xtimes(Pi−1(x)) (P0(x) = a(x))
(2.6)
In Equation (2.6), the partial multiplications (Pi(x)) is performed first, followed by adding
the corresponding coefficients. Bit bi is the coefficient in b(x), which are 0 or 1.
D. Multiplicative Inverses
∀a ∈ GF(28)\{0} : a⊗a−1 = {1} (2.7)
a−1 is the multiplicative inverse of a in GF(28). A popular algorithm for inversion is the
Extended Euclidean Algorithm [18], but it is not suitable for hardware implementation
because of its high hardware complexity.
2.2 Composite Fields
Two Galois Fields of the same order are isomorphic, but they may have different hardware
complexity which depends on the representations of their field elements. Green and Taylor
[10] introduced a certain type of extension fields called composite field, which can simplify
field operations in AES arithmetic.
Definition 2.4
We call two pairs
{GF(2n),Q(y) = yn +
n−1
∑
i=0
qiyi,qi ∈ GF(2)}
{GF((2n)m),P(x) = xm +
m−1
∑
i=0
pixi, pi ∈ GF(2)}
9
a composite field if
• GF(2n) is constructed from GF(2) by Q(y);
• GF((2n)m) is constructed from GF(2n) by P(x).
Composite field is denoted by GF((2n)m). A composite field GF((2n)m) is isomorphic
to the field GF(2k), k = nm [15].
2.2.1 AES Arithmetic over Composite Field GF((24)2)
The specific composite field used in this thesis is GF((24)2), which is isomorphic to field
GF(28) (k = 8,n = 4,m = 2). Taking field GF(28) as a quadratic extension of the field
GF(24), an element a ∈ GF(28) is represented as a linear polynomial with coefficients in
GF(24).
A. Notation
Wolkerstorfer et al. introduced a two-term polynomial in [32], which is the representa-
tion of GF((24)2) used in the thesis.
a∼= ahx+al, a ∈GF(28), ah,al ∈ GF(24) (2.8)
The two-term polynomial ahx +al is an isomorphic representation of a. Hence, all mathe-
matical operations applied to elements of GF(28) can also be computed in this representa-
tion.
B. Addition
Adding the corresponding coefficients.
(ahx+al)⊕ (bhx+bl) = (ah⊕bh)x+(al ⊕bl) (2.9)
10
C. Multiplication
There are two irreducible polynomials needed for the two-term polynomial multiplica-
tion: n(x) (Equations (2.10)) and m(x) (Equations (2.11)).
n(x) = x2 +{1}x+{E} ({E} denotes ”1110”) (2.10)
m(x) = x4 + x+1 (2.11)
Equation (2.10) is used to reduce the result to a two-term polynomial. The coefficients of
n(x) are written in hexadecimal notation which are elements in GF(24) (Section (2.1.1)).
Multiplication of two-term polynomials is denoted by ⊗. Normal polynomials multipli-
cation is denoted by ×. Multiplying two two-term polynomials, followed by a modular
reduction over n(x), is described by Equations (2.12).
(ahx+al)⊗ (bhx+bl) = ((ahx+al)× (bhx+bl)) mod n(x) (2.12)
Equation (2.11) is used to ensure that, the result of multiplication in subfield GF(24) (Equa-
tion (2.13)), where (a′(x),b′(x) ∈ GF(24)), is an element of GF(24).
a′(x)⊗b′(x) = (a′(x)×b′(x)) mod m(x) (2.13)
These two irreducible polynomials n(x) and m(x) are chosen by Wolkerstorfer et al. [32]
to optimize the arithmetic.
D. Multiplicative Inverses
A multiplication of a two-term polynomial with its inverse yields the 1-element of the
field GF((24)2)
(ahx+al)⊗ (a
′
hx+a
′
l) = {0}x+{1} (2.14)
11
where ah,al,a′h,a′l ∈ GF(24).
(ahx+al)
−1 = (a′hx+a
′
l) = (ah⊗d)x+(ah⊕al)⊗d (2.15)
where d = ((a2h⊗{E})⊕(ah⊗al)⊕a2l )−1 =((a2h⊗{e})⊕((ah⊕al)⊗al))−1 (⊕ is addition
in GF(24); ⊗ is multiplication in GF(24)).
This multiplicative inversion equation is proposed by Wolkerstorfer et al. [32]. Recon-
figuration of this equation can provide good quality for subpipeling. We will explain this
in Section (4.4.2).
12
Chapter 3
AES Algorithm
This chapter introduces the AES algorithm presented by NIST in 2001 [20].
The AES algorithm, also known as the Rijndael algorithm is the encryption standard
designed by two Belgian cryptographers John Daemen and Vincent Rijmen [7]. AES is
a symmetric-key cipher where both the encryptor and decryptor use the same key. It is
an iterative algorithm. Each iteration is called a round. According to NIST, AES is a
symmetric block cipher with block size of 128-bit and three key sizes of (128-, 192-, or
256-bit). The AES parameters depend on the key size (Table (3.1)):
• Nk is the number of 32-bit words comprising the cipher key;
• Nb is the number of 32-bit words comprising a data block, which is four in AES
standard;
• Nr is the number of rounds which is 10, 12 or 14 for AES-128, AES-192 and AES-
256, respectively.
The internal operations of AES are performed on a 4× 4 matrix of bytes, termed the
state (Figure (3.1)). An individual byte of the state is referred as Sr,c (r represents the row
Table 3.1: Key-Block-Round Combinations [20]
Key Length Block Size Number of Rounds
(Nk words) (Nb words) (Nr)
AES-128 4 4 10
AES-192 6 4 12
AES-256 8 4 14
13
s0,0 s0,1 s0,2 s0,3
s1,0 s1,1 s1,2 s1,3
s2,0 s2,1 s2,2 s2,3
s3,0 s3,1 s3,2 s3,3
in0 in4 in8 in12
in1 in5 in9 in13
in2 in6 in10 in14
in3 in7 in11 in15
out0 out4 out8 out12
out1 out5 out9 out13
out2 out6 out10 out14
out3 out7 out11 out15
W0 W1 W2 W3 W0 W1 W2 W3 W0 W1 W2 W3
plain/cipher text cipher/plain textstate AES
Figure 3.1: State array input and output
number and c represents the column number: 0≤ r < 4,0≤ c < 4). A word Wi (0≤ i < 4)
consists of the four bytes of column i.
AES runs iteratively on four transformations (inv-/subbytes, inv-/shiftrows, inv-/mixcolumns
and addroundkey) with different sequence in encryption and decryption. Figure (3.2) il-
lustrates the basic architecture of AES. In the initial round (r = 0), only addroundkey is
performed; in the final round (r = Nr), it skips inv-/mixcolumns. The keyschedule mod-
ule expands cipherkey to (Nr + 1)× 4 words of roundkeys. Each round applies a unique
128-bit roundkey in the addroundkey operation.
3.1 Subbytes and Invsubbytes
Inv-/subbytes is the only non-linear transformation in AES which is also called S-Box.
A. Subbytes – Uses an S-Box to perform a non-linear byte-by-byte substitution of the state.
S-Box is a 16×16 matrix containing all possible 256 8-bit values.
Consider a byte {x7x6x5x4x3x2x1x0}. Subbytes transformation has two steps:
1. {x′7x′6x′5x′4x′3x′2x′1x′0} is its multiplicative inverse in GF(28) field, modulo the irre-
14
Subbytes
Shiftrows
Mixcolumns
Addroundkey
Invmixcolumns
Addroundkey
Invsubbytes
Invshiftrows
plaintext plaintext
ciphertext ciphertextroundkey
Keyschedule cipherkey
If
 r
=
0
If
 0
 <
 r
 <
 (
N
r+
1
)
If r = Nr
If r = Nr
If
 0
 <
 r
 <
 (
N
r+
1
)
If
 r
=
0
If r=0 If r = Nr
r=0,1,2,…,Nr
Encryption
r=0,1,2,…,Nr
Decryption
Figure 3.2: AES architecture
ducible polynomial m(x) = x8 +x4 +x3 +x+1; {00000000}’s multiplicative inverse
in GF(28) field is itself;
2. An affine transformation over GF(2) (Equation (3.1)) is conducted on the inverse,
which is the result of the first step.
15


y0
y1
y2
y3
y4
y5
y6
y7


=


1 0 0 0 1 1 1 1
1 1 0 0 0 1 1 1
1 1 1 0 0 0 1 1
1 1 1 1 0 0 0 1
1 1 1 1 1 0 0 0
0 1 1 1 1 1 0 0
0 0 1 1 1 1 1 0
0 0 0 1 1 1 1 1




x′0
x′1
x′2
x′3
x′4
x′5
x′6
x′7


+


1
1
0
0
0
1
1
0


(3.1)
Figure (3.3) shows the S-Box diagram:
S-Box}{ 01234567 xxxxxxxx }{ 01234567 yyyyyyyy
Figure 3.3: AES S-box
B. Invsubbytes – Uses an inverse S-Box (IS-Box) to perform a non-linear byte-by-byte
substitution of the state.
Considering a byte {y7y6y5y4y3y2y1y0}. Inverse subbytes transformation has two steps:
1. The inverse affine transformation over GF(2) (Equation (3.2)) is performed first
16


x′0
x′1
x′2
x′3
x′4
x′5
x′6
x′7


=


0 0 1 0 0 1 0 1
1 0 0 1 0 0 1 0
0 1 0 0 1 0 0 1
1 0 1 0 0 1 0 0
0 1 0 1 0 0 1 0
0 0 1 0 1 0 0 1
1 0 0 1 0 1 0 0
0 1 0 0 1 0 1 0




y0
y1
y2
y3
y4
y5
y6
y7


+


1
0
1
0
0
0
0
0


(3.2)
2. {x7x6x5x4x3x2x1x0} is the multiplicative inverse of {x′7x′6x′5x′4x′3x′2x′1x′0} in GF(28)
field, modulo the irreducible polynomial m(x) = x8 +x4 +x3 +x+1; {00000000} is
mapped onto itself.
Figure (3.4) shows the IS-Box diagram:
IS-Box }{ 01234567 xxxxxxxx}{ 01234567 yyyyyyyy
Figure 3.4: AES IS-box
3.2 Shiftrows and Invshiftrows
This transformation circularly shifts each row of the state to the left on encryption or to the
right on decryption. The top row of the state is denoted as row(0) and the bottom row is
denoted as row(3). The shift offset of each row corresponds to the row number.
A. Shiftrows – Each row of the state is left shifted cyclically a certain number of bytes.
Performs i-byte circular left shift to row(i)(i = 0,1,2,3). Figure (3.5) illustrates the
shiftrows operation.
17
s0,0 s0,1 s0,2 s0,3
s1,0 s1,1 s1,2 s1,3
s2,0 s2,1 s2,2 s2,3
s3,0 s3,1 s3,2 s3,3
s0,0 s0,1 s0,2 s0,3
s1,0s1,1 s1,2 s1,3
s2,0 s2,1s2,2 s2,3
s3,0 s3,1 s3,2s3,3
Shiftrows
row(0)
row(1)
row(2)
row(3)
i
Figure 3.5: AES Shiftrows
B. Invshiftrows – Each row of the state is right shifted cyclically a certain number of bytes.
Performs i-byte circular right shift to row(i)(i = 0,1,2,3). Figure (3.6) illustrates the
invshiftrows operation.
s0,0 s0,1 s0,2 s0,3
s1,0 s1,1 s1,2 s1,3
s2,0 s2,1 s2,2 s2,3
s3,0 s3,1 s3,2 s3,3
s0,0 s0,1 s0,2 s0,3
s1,0 s1,1 s1,2s1,3
s2,0 s2,1s2,2 s2,3
s3,0s3,1 s3,2 s3,3
Inv
shiftrows
row(0)
row(1)
row(2)
row(3)
i
Figure 3.6: AES Invshiftrows
3.3 Mixcolumns and Invmixcolumns
This transformation treats each column of the state as a four-term polynomial over GF(28)
and transforms each column to a new one by multiplying it with a constant polynomial
a(x) = {03}x3 +{01}x2 +{01}x+{02}modulo x4 +1. The inverse mixcolumns operation
is a multiplication of each column with b(x) = a−1(x) = {0B}x3 +{0D}x2 +{09}x+{0E}
18
modulo x4 +1.
A. MixColumns – Left multiplies the state with a mixcolumns matrix.
Mixcolumns transformation gives each byte of a column a new value based on all four
bytes in that column. In matrix form, the mixcolumns can be expressed as:


02 03 01 01
01 02 03 01
01 01 02 03
03 01 01 02




s0,0 s0,1 s0,2 s0,3
s1,0 s1,1 s1,2 s1,3
s2,0 s2,1 s2,2 s2,3
s3,0 s3,1 s3,2 s3,3


=


s′0,0 s
′
0,1 s
′
0,2 s
′
0,3
s′1,0 s
′
1,1 s
′
1,2 s
′
1,3
s′2,0 s
′
2,1 s
′
2,2 s
′
2,3
s′3,0 s
′
3,1 s
′
3,2 s
′
3,3


(3.3)
B. Invmixcolumns – Left multiplies the state with a invmixcolumns matrix.
In matrix form, the invmixcolumns can be expressed as:


0E 0B 0D 09
09 0E 0B 0D
0D 09 0E 0B
0B 0D 09 0E




s′0,0 s
′
0,1 s
′
0,2 s
′
0,3
s′1,0 s
′
1,1 s
′
1,2 s
′
1,3
s′2,0 s
′
2,1 s
′
2,2 s
′
2,3
s′3,0 s
′
3,1 s
′
3,2 s
′
3,3


=


s0,0 s0,1 s0,2 s0,3
s1,0 s1,1 s1,2 s1,3
s2,0 s2,1 s2,2 s2,3
s3,0 s3,1 s3,2 s3,3


(3.4)
3.4 Addroundkey
The addroundkey is a simple logical XOR of the current state with a roundkey which is
generated by the keyschedule.
Addroundkey – The state is XORed with the 128-bit roundkey (Equation (3.5)).
19


s0,0 s0,1 s0,2 s0,3
s1,0 s1,1 s1,2 s1,3
s2,0 s2,1 s2,2 s2,3
s3,0 s3,1 s3,2 s3,3


⊕ roundkey =


s′0,0 s
′
0,1 s
′
0,2 s
′
0,3
s′1,0 s
′
1,1 s
′
1,2 s
′
1,3
s′2,0 s
′
2,1 s
′
2,2 s
′
2,3
s′3,0 s
′
3,1 s
′
3,2 s
′
3,3


(3.5)
3.5 Keyschedule
Keyschedule – Derives roundkeys from the cipherkey. It consists of two steps:
1. Key Expansion - Uses the AES Key Expansion Algorithm (Figure (3.7)) to generate
4× (Nr + 1) words of roundkeys (W0,W1, ...,W4(Nr+1)−1). The cipherkey is divided
into Nk words used as the first Nk roundkeys. Keyschedule repeats to generate the
rest roundkeys.
2. Roundkey Selection - The first 4 roundkeys are the first 4 words, the second 4 round-
keys are the second 4 words, etc. Each roundkey has 128 bits: roundkey(i) =
(W4i,W4i+1,W4i+2,W4i+3).
Figure (3.8) shows keyschedule’s architecture which generates roundkeys for AES-128,
AES-192 and AES-256.
• Rotword: One-byte circular left shift on a word. For example, word (a,b,c,d) be-
comes (b,c,d,a).
• Subword: Using S-Box to perform a byte substitution on each byte.
• Xorrcon: XORing with a round constant rcon[ j], j = 1,2, · · · ,Nr.
rcon[ j] = (RC[ j],0,0,0), with RC[1] = 1,RC[ j] = 2 ·RC[ j−1] and with multiplica-
tion defined over the field GF(28).
20
//////////////////////////////////////////////////////////////
//Input: key[4*Nk] (Cipherkey)
//Output: w[4*(Nr+1)] (Nr+1 roundkeys)
//Nk and Nr is specified in Table (3.1)
//////////////////////////////////////////////////////////////
KeyExpansion(byte key[4*Nk], word w[4*(Nr+1)], Nk)
begin
word temp
i=0
while(i<Nk)
w[i]=word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])
i=i+1
end while
i=Nk
while(i<Nb*(Nr+1))
temp=w[i-1]
if(i mod Nk=0)
temp=subword(rotword(temp)) xor rcon[i/Nk]
else if (Nk>6 and i mod Nk=4)
temp=subword(temp)
end if
w[i]=w[i-Nk] xor temp
i=i+1
end while
end
Figure 3.7: Pseudo Code for Key Expansion [20]
21
W0 W1 W2 W3 W4 W5 W6 W7
W4 W5 W6 W7
W8 W9 W10 W11 W12 W13 W14 W15 W6 W7 W8 W10 W11 W12
K0
K1
K2
K3
K4
K5
K6
K7
K8
K9
K10
K11
K12
K13
K14
K15
K16
K17
K18
K19
K20
K21
K22
K23
K24
K25
K26
K27
K28
K29
K30
K31
AES-256
AES-192
AES-128
xorrcon(subword(rotword(W)))
subword(W)
xorrcon(subword
(rotword(W)))
xorrcon(subword
(rotword(W)))
Figure 3.8: AES Keyschedule
22
Chapter 4
Reconfigurable and Compact Architecture of the AES
In this chapter, the reconfigurable and compact AES architecture is proposed. We introduce
the contributions in detail, followed by the four transformations (shiftrows, subbytes, mix-
columns and addroundkey). The three keyschedules with different key sizes (128-, 192-,
256-bit) are explained individually.
4.1 32-bit Single Round Unit
Round_1
Round_2
Round_Nr
Subbytes
Shiftrows
Mixcolumns
Addroundkey
(a) (b)
128
128
Subbytes
Shiftrows
Mixcolumns
Addroundkey
32
(c)
Figure 4.1: Unfolded Architecture(a) - Single Round Unit(b) - 32-bit Single Round Unit(c)
Roll unfolded architecture (Figure (4.1(a))) is widely used to achieve high speed. It
processes several blocks of data during one clock cycle by implementing more than one
round units on the hardware. The more round units the architecture implements, the higher
the hardware cost. The opposite scheme, which is called the single round unit architecture
(Figure (4.1(b))), can be applied to simplify the hardware complexity. Instead of unfolding
23
all the round units in devices, it implements a single round unit which costs approximately
1/Nr area as the unfolded scheme by sacrificing the speed (Figure (4.1(a))).
Both Figure ((4.1(a)) and ((4.1(b)) use 128-bit data path. Sticking to the goal of making
a compact design, we propose a 32-bit single round unit (Figure (4.1 (c))). It needs four
iterations to perform a round on a block (128-bit). This 32-bit data path scheme saves about
75% hardware, compared with the 128-bit single round unit (Figure (4.1 (b))).
4.2 Full Composite Field Encryptor and Keyschedule
Multiplicative 
inverse
Affine 
transform
Shiftrows
Mixcolumns
Addroundkey
MAP
MAP-1
))2(( 24GF
32
S-Box
(a)
Multiplicative 
inverse
Affine 
transform
ShiftRows
Mixcolumns
Addroundkey
MAP
MAP-1
))2(( 24GF
plaintext
ciphertext
cipherkey
roundkey
Keyschedule
MAP
32 32
32
S
-B
o
x
(b)
Figure 4.2: Partial Composite Field (a)- Full Composite Field (b)
Many high-end FPGA devices possess Block-RAMs (BRAMs) which are efficient for
the implementation of S-Box. S-Box, also referred as subbytes, is the key part in both
24
encryptor and keyschedule modules. However, these BRAM-based designs cannot be im-
plemented in the low-cost devices which do not have BRAMs. An alternative approach
for S-Box implementation is using combinational logic. But this method may lead to
high hardware complexity because of the mathematic operations of AES over finite field
GF(28).
The key step of S-Box is calculating multiplicative inverse of each byte (Section (3.1)).
Since the introduction of composite field GF((24)2) based S-Box, numerous research [9,
11, 28, 31, 34] has investigated the calculation of the multiplicative inverses over GF((24)2),
instead of GF(28), to decrease hardware complexity (Figure (4.2(a))). In Figure (4.2), the
arithmetic in the shadow area is performed over field GF((24)2). Figure (4.2(a)) shows that
the architecture implements only multiplicative inverse in GF((24)2). The architectures in
[33, 22] extend the field GF((24)2) to affine transformation which makes all S-Box block
operations performed in GF((24)2). By decomposing these operations from GF(28) to its
subfield GF(24), the hardware complexity is decreased.
As in Figure (4.2(a)), in each round before S-Box, it needs an isomorphic mapping
function (MAP) to convert a representation from GF(28) to GF((24)2); to convert inversely
after, it needs the inverse mapping (MAP−1). If key size is 128 bits, it applies the S-Box
to the plaintext and the cipherkey ten times, which means that it needs 20 MAPs and 20
MAP−1s for the encryption of 128-bit data. In [32], for every byte, MAP costs 11 XOR
gates with 2 gates in critical path; MAP−1 costs 15 XOR gates with 3 gates in critical
path. MAP and MAP−1 together cost 33.3% in critical path and 21% gates in total for the
subbytes transformation.
In order to reduce the cost of MAP and MAP−1 as much as possible, we propose the
complete composite field approach (Figure (4.2(b))). The GF((24)2) field covers both
encryptor and keyschedule. As illustrated in Figure (4.2(b)), one MAP and one MAP−1 are
applied in encryption, one MAP−1 is applied in keyschedule. This is a constant overhead
25
which is not affected by the round count. No matter what the key size is, the cost of
mapping is the same.
The isomorphic mapping functions between field GF(28) and field GF((24)2) are deter-
mined by the irreducible polynomials of field GF(28) (Equation (2.3)) and field GF((24)2)
(Equations (2.10) and (2.11)). We use the mapping formulas in [32] to conduct the transi-
tion of representations between GF(28) and GF((24)2):
ahx+al = MAP(a), ah,al ∈ GF(24), a ∈GF(28)
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
aA = a1⊕a7, aB = a5⊕a7, aC = a4⊕a6
al0 = aC⊕a0⊕a5, al1 = a1⊕a2, al2 = aA, al3 = a2⊕a4
ah0 = aC⊕a5, ah1 = aA⊕aC, ah2 = aB⊕a2⊕a3, ah3 = aB
(4.1)
In Equation (4.1), a is an element in field GF(28). MAP(a) convert a to its isomorphic
element in GF((24)2), which is represented as ahx+al .
a = MAP−1(ahx+al), a ∈ GF(28), ah,al ∈ GF(24)
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
aA = al1⊕ah3, aB = ah0⊕ah1
a0 = al0⊕ah0, a1 = aB⊕ah3, a2 = aA⊕aB
a3 = aB⊕al1⊕ah2, a4 = aA⊕aB⊕al3, a5 = aB⊕al2
a6 = aA⊕al2⊕al3⊕ah0, a7 = aB⊕al2⊕ah3
(4.2)
In Equation (4.2), ahx + al in an element in field GF((24)2). MAP−1(ahx + al) convert
ahx+al to its isomorphic element in GF(28), which is represented as a.
26
4.3 Subpipelined Encryptor and Keyschedule
Round_1
Round_2
Round_Nr
register
register
register
substage1
substage2
substagek
register
register
(a) (b)
Figure 4.3: Pipelining (a) and Subpipelining (b)
The technique of pipelining is applied in the AES designs to optimize speed/area ratio
in [1, 2, 8, 9, 11, 17, 33, 26, 27, 31, 34]. By inserting registers among combinational
logic, multiple blocks are processed simultaneously. The frequency is determined by the
maximum delay between two registers. When the maximum delay between two registers
is decreased, the frequency is increased.
Figure (4.3(a)) is the fully unrolled pipelining architecture, which includes two steps.
First, unfold the Nr round units on the device; second, insert registers between each round
unit. In this case, the maximum delay is the period of one round which contains four
transformations.
By cutting one round unit into more substages, we can further improve the frequency.
This technology is called subpipelining [34]. Figure (4.3(b)) gives an example where reg-
isters are placed both between and inside each round unit. The frequency is determined by
27
the maximum delay of a substage. In this thesis, we propose a single round subpipelined
architecture, where one round unit is implemented and subpipelined into eight substages.
To generate the roundkeys, we design an on-the-fly keyschedule, which generates a 32-
bit roundkey at each clock cycle. The encryption unit and the key expansion unit share
the same clock which leads to the fact that the general frequency is determined by the
maximum delay in both units. Hence, the substage balance of keyschedule is as important
as in encryptor. We propose a new subpipelined keyschedule on composite field for all
standard key sizes. The most costly part of keyschedule is still the S-Box. We divide it into
the same substages as in encryptor.
4.4 Double-Block Subpipelined Architecture
An equivalent decryptor along with the AES was introduced in FIPS [20], where the same
architecture can be used in both encryption and decryption. Figure 5.7 in [30] illustrates the
equivalent inverse cipher. It makes use of the fact that the order of subbytes and shiftrows
can be exchanged because subbytes changes the value of each byte individually while
shiftrows only rearranges their positions. So it changes the order of invshiftrows and in-
vsubbytes, and add an extra step to conduct invmixcolumns on each roundkey. We can also
change the sequence of shiftrows and subbytes in encryptor to obtain the same result. In
this design, we put shiftrows before subbytes.
Figure (4.4) illustrates the proposed encryption architecture. The eight 32-bit registers
(four in shiftrows, three in subbytes and one between subbytes and mixcolumns) are used
to cut one round unit into eight substages, which leads to an eight clock cycles initial delay
to generate the first 32-bit ciphertext. clk counter in Figure (4.4) is a clock register counter
generated in keyschedule. It is repeating from 0 to 8Nr + 7 (Table (4.1)) and is used to
synchronize encryptor and keyschedule.
28
Table 4.1: AES Encryption Sequence
clk_counter 0 1 2 3 4 5 6 7
plaintext PA(0) PA(1) PA(2) PA(3) PB(0) PB(1) PB(2) PB(3)
cipherkey KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3)
outcome(0) OA(0) OA(1) OA(2) OA(3) OB(0) OB(1) OB(2) OB(3)
clk_counter 8 9 10 11 12 13 14 15
input(1) OA(0) OA(1) OA(2) OA(3) OB(0) OB(1) OB(2) OB(3)
roundkey(1) KA(4) KA(5) KA(6) KA(7) KB(4) KB(5) KB(6) KB(7)
outcome(1) OA(4) OA(5) OA(6) OA(7) OB(4) OB(5) OB(6) OB(7)
clk_counter 8Nr 8Nr+1 8Nr+2 8Nr+3 8Nr+4 8Nr+5 8Nr+6 8Nr+7
input(Nr) OA(4Nr-4)) OA(4Nr-3) OA(4Nr-2) OA(4Nr-1) OB(4Nr-4) OB(4Nr-3) OB(4Nr-2) OB(4Nr-1)
roundkey(Nr) KA(4Nr) KA(4Nr+1) KA(4Nr+2) KA(4Nr+3) KB(4Nr) KB(4Nr+1) KB(4Nr+2) KB(4Nr+3)
ciphertext CA(0) CA(1) CA(2) CA(3) CB(0) CB(1) CB(2) CB(3)
(a)
(b)
(c)
.
.
.
29
Shiftrows
register1,2,3,4
Subbytes
register5,6,7
Mixcolumns
MAP-1
cipherkey
mul
MAP MAP
ciphertext
Keyschedule
clk_counter
acb
))2(( 24GF
roundkey
plaintext
32-bit data line
Control line
register8
register1
Column1
Column2
Column3
Column4
register2
register3
register4
register5
Subbytes(A)
Subbytes(B)
Subbytes(C)
Subbytes(D)
register6
register7
(a)
(b)
Figure 4.4: AES Encryption Architecture
We use a double-block (block A and B) data flow to avoid the eight clock cycles initial
delay. Table (4.1(a)) illustrates the data sequence of the initial round (Nr = 0).
{PA(0), PA(1), PA(2), PA(3)}: 128-bit plaintext of block A
{PB(0), PB(1), PB(2), PB(3)}: 128-bit plaintext of block B
They are put into AES during the first eight clock cycles and then processed alternately.
{KA(0), KA(1), KA(2), KA(3)}: cipherkey for block A
{KB(0), KB(1), KB(2), KB(3)}: cipherkey for block B
30
Because in the initial round, the encryption involves only addroundkey, which is the simple
XOR operation, and the according roundkey is the MAPed cipherkey, the operation in this
round is not delayed by registers. Hence, the outcome of the initial round (outcome(0)) is
produced from the very beginning.
{OA(0), OA(1), OA(2), OA(3)}: outcome of round 0 for block A
{OB(0), OB(1), OB(2), OB(3)}: outcome of round 0 for block B
Table (4.1(b)) is for round 1, which goes through the eight substages. At the eighth
clock cycle, OA(0) finishes the eight substages and XORes the the according roundkey
(KA(4)) to generate the outcome (OA(4)) for block A, so as block B.
Table (4.1(c)) is for the last round Nr.
{CA(0), CA(1), CA(2), CA(3)}: ciphertext for block A
{CB(0), CB(1), CB(2), CB(3)}: ciphertext for block B
Now we explain the 3-to-1 multiplexer (mul) controlled by the clk counter:
• Case a: In initial round, where 0≤ clk counter < 8, 128-bit plaintext is MAPed into
GF((24)2) and XORed with the according roundkey in four clock cycles, 32 bits at
each clock cycle. The result is the outcome of the initial round (Nr = 0) which is the
input of the second round;
• Case b: In normal rounds, where 8 ≤ clk counter < Nr× 8, the outcome of mix-
columns XORs with the according roundkey to produce the outcome of this round.
• Case c: The last round, where Nr×8≤ clk counter < (Nr +1)×8, the transforma-
tion mixcolumns is skipped. The result of subbytes is added with its roundkey.
Finally, the outcome of the last round goes through MAP−1 to generate the ciphertext.
31
Table 4.2: Four Control Signals
Counter W0 Counter W1 Counter W2 Counter W3
W0 1 0 0 0
W1 0 1 0 0
W2 0 0 1 0
W3 0 0 0 1
4.4.1 Column Fashion Shiftrows
This subsection proposes the column fashion shiftrows (Figure (4.5)). It includes 16 8-bit
registers (Row0 Col0, Row0 Col1, ... , Row3 Col3) and three 2 to 1 multiplexers (M1, M2
and M3). Both input and output of shiftrows is a state. Each column is a word (W0, W1,
W2 and W3), which includes four bytes. Every clock cycle it processes a 32-bit word (one
column of a state), so four clock cycles are needed to produce a 128-bit state. The first 3
clock cycles are initial clock cycles, so the first word is shifted out at the 4th clock cycle.
Figure (4.6) shows how it works in the first eight clock cycles. R00, R01, ... and R33 stand
for registers Row0 Col0, Row0 Col1, ... and Row3 Col3. Each row shows their values
at each clock cycle(clk0, clk1, ... and clk7). We will explain the shadow area and black
border in the following text.
Four counters (Counter W0, Counter W1, Counter W2 and Counnter W3) control the
registers and the multiplexers. Table (4.2) shows how these signals are generated.
When the first word (W0) of a state is shifted in, Counter W0 = 1;
When the second word (W1) of a state is shifted in, Counter W1 = 1;
When the third word (W2) of a state is shifted in, Counter W2 = 1;
When the forth word (W3) of a state is shifted in, Counter W3 = 1.
Certain registers are controlled by special enable signals (Enable row1 col3,
Enable row2 col23 and Enable row3 col123), others use the general enable signal, which
32
Enable_row1_col3
Row0_Col0 Row0_Col1 Row0_Col2 Row0_Col3
Row1_Col0 Row1_Col1 Row1_Col2 Row1_Col3
Row2_Col0 Row2_Col1 Row2_Col2 Row2_Col3
Row3_Col0 Row3_Col1 Row3_Col2 Row3_Col3
Enable_row2_col23
Enable_row3_col123
S0,3
S1,3
S2,3
S3,3
S0,2
S1,2
S2,2
S3,2
S0,1
S1,1
S2,1
S3,1
S0,0
S1,0
S2,0
S3,0
w0 w1 w2 w3
S0,3
S1,0
S2,1
S3,2
S0,2
S1,3
S2,0
S3,1
S0,1
S1,2
S2,3
S3,0
S0,0
S1,1
S2,2
S3,3
w0 w1 w2 w3
8-bit data line
1-bit control line
M3
C
o
u
n
ter_
W
0
M2
M1
Input(state) output(state)
C
o
u
n
ter_
W
1
C
o
u
n
ter_
W
2
C
o
u
n
ter_
W
3
Figure 4.5: Column Fashion Shiftrows
33
is not shown.
A0,3 A1,3 A2,3 A3,3
A0,2 A1,2 A2,2 A3,2
A0,1 A1,1 A2,1 A3,1
A0,0 A1,0 A2,0 A3,0
A0,0 A1,0 A2,0 A3,0
A0,1 A1,1 A2,1 A3,1A0,0 A1,0 A2,0 A3,0
A0,2 A1,2 A2,2A0,1 A1,1 A2,1 A3,1A0,0
B0,0 B1,0 B2,0 B3,0A0,3 A1,3 A2,3A0,2 A1,2A0,1
B0,1 B1,1 B2,1 B3,1B0,0 B1,0 B2,0 B3,0A0,3 A1,3 A3,2A0,2 A3,1
B0,2 B1,2 B2,2 B3,2B0,1 B1,1 B2,1 B3,1B0,0 B1,0 B2,0 B3,0A0,3 A2,1 A3,2
A1,0 A2,0 A3,0
A1,0 A2,1 A2,0
A3,2
A3,2 A3,1
A1,0 A2,1
B0,3 B1,3 B2,3 B3,3B0,2 B0,1 B0,0 B1,2 B1,1 B1,0 B2,2 B2,1 B2,0 B3,2 B3,1 B3,0
A0,3
A1,0
A2,1
A3,2
A0,2
A1,3
A2,0
A3,1
A0,1
A1,2
A2,3
A3,0
A0,0
A1,1
A2,2
A3,3
clk0
clk1
clk2
clk3
clk4
clk5
clk6
clk7
R00 R01 R02 R03 R10 R11 R12 R13 R20 R21 R22 R23 R30 R31 R32 R33
clk3 clk4 clk5 clk6
W0 W1 W3W2
Output for the 1st state
A1,0
A3,0
A2,0
Figure 4.6: Two States’ Arrangement in Shiftrows Registers
Enable Enable row1 col3 (Enable row1 col3 = Counter W3) controls register
Row1 Col3. This enable signal is negative when clock is clk0, clk1, clk2, clk4, clk5, clk6,
etc. Row1 Col3 does not work during these clock cycles, which corresponds to the shadow
areas of the column R13 in Figure (4.6);
Enable Enable row2 col23 (Enable row2 col23 = Counter W2 ∨ Ccounter W3) con-
trols registers Row2 Col2 and Row2 Col3. This enable signal is negative when clock is
clk0, clk1, clk4, clk5, etc. Row2 Col2 and Row2 Col3 do not work during these clock
cycles, which corresponds to the shadow area of the columns R22 and R23 in Figure (4.6);
Enable Enable row3 col123 (Enable row3 col123 = Counter W1 ∨ Counter W2 ∨
Counter W3) controls registers Row3 Col1, Row3 Col2 and Row3 Col3. This enable sig-
nal is negative when clock is clk0, clk4, etc. Row3 Col1, Row3 Col2 and Row3 Col3 do
not work during these clock cycles, which corresponds to the shadow area of the columns
R31, R32 and R33 in Figure (4.6).
For each input word:
34
The input (a state) of shiftrows is the MAPed ciphertext if it is the initial round; other-
wise it is the outcome of the previous round. Each word of the state (W0, W1, W2 and W3)
is shifted into the first column of the registers (Row0 Col0, Row1 Col0, Row2 Col0 and
Row3 Col0) at each clock cycle.
For each output word:
• 1st byte is shifted out from Row0 Col3, which corresponds to the black border area
of column R03 in Figure (4.6);
• 2nd byte is shifted out from Row1 Col3 if (Counter W2) is active, otherwise from
Row1 Col2, which corresponds to the black border area of columns R12 and R13 in
Figure (4.6);
• 3rd byte is shifted out from Row2 Col3 if (Counter W1 or Counter W2) is active,
otherwise from Row2 Col1, which corresponds to the black border area of columns
R21 and R23 in Figure (4.6);
• 4th byte is shifted out from Row3 Col3 if (Counter W0 or Counter W1 or Counter W2)
is active, otherwise from Row3 Col0, which corresponds to the black border area of
columns R30 and R33 in Figure (4.6).
Figure (4.6) takes two states A and B (Figure (4.7)) as the input of shiftrows. During
the firsts eight clock cycles, each word of state A and B is shifted into the first column of
registers (R00, R10, R20 and R30) one after another. The first three clock cycles are the initial
cycles with no output.
At clk3, the first column of state A is generated from registers (R03, R12, R21 and R30);
At clk4, the second column of state A is generated from registers (R03, R12, R21 and
R33);
At clk5, the third column of state A is generated from registers (R03, R12, R23 and R33);
35
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
W0 W1 W2 W3
State A
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
W0 W1 W2 W3
State B
Figure 4.7: Input of Shiftrows in Figure (4.6)
At clk6, the forth column of state A is generated from registers (R03, R13, R23 and R33).
The first output state is shown in the right down corner of the Figure (4.6).
4.4.2 Subpipelined Subbytes
2
x e×
1−
x
×
××
4H
4L
4H
4L
88
AFF_TRAN
8
1 2
3
Figure 4.8: Subbytes in composite field GF(24)[34]
The key step of subbytes is the calculation of the multiplicative inverse. Figure (4.8)
36
illustrates the architecture of subbytes used in [34], which applies Equation (2.15). As
shown in this figure, it uses multiplication in GF(24) three times. In order to distinguish
the multipliers, we indicate them as ×1, ×2, ×3. It also needs one inversion (x−1), one
constant multiplier with {E} (×e), {E} is in hexadecimal notation, which is ’1110’ in
binary notation), one squarer (x2) and two 4-bit XORs (⊕). These arithmetic operations are
over field GF(24).
Considering x,y,z ∈ GF(24), x, y and z are represented in binary notation where x =
{x3x2x1x0}, y = {y3y2y1y0}, z = {z3z2z1z0}. Let a, b, c, d, e and f are 1-bit value, which
equals to 0 or 1. ⊕ stands for XOR-operation. x0y1 means x0∧ y1.
The following Equations (4.3), (4.4), (4.5) and (4.6) are used to calculate squaring,
constant multiplication with {E}, multiplication and multiplicative inverse [32].
y = x2
−−−−−−−−−−−−−−−−−−−−−−−−−−−−
y0 = x0⊕ x2, y1 = x2
y2 = x1⊕ x3, y3 = x3
(4.3)
y = x×{E}
−−−−−−−−−−−−−−−−−−−−−−−−−−−−
a = x0⊕ x1, b = x2⊕ x3
y0 = x1⊕b, y1 = a
y2 = a⊕ x2, y3 = a⊕b
(4.4)
37
z = x× y
−−−−−−−−−−−−−−−−−−−−−−−−−−−−
a = x0⊕ x3, b = x2⊕ x3, c = x1⊕ x2
z0 = x0y0⊕ x3y1⊕ x2y2⊕ x1y3
z1 = x1y0⊕ay1⊕by2⊕ cy3
z2 = x2y0⊕ x1y1⊕ay2⊕by3
z3 = x3y0⊕ x2y1⊕ x1y2⊕ay3
(4.5)
y = x−1
−−−−−−−−−−−−−−−−−−−−−−−−−−−−
a = x1⊕ x2⊕ x3⊕ x1x2x3
y0 = a⊕ x0⊕ x0x2⊕ x1x2⊕ x0x1x2
y1 = x0x1⊕ x0x2⊕ x1x2⊕ x3⊕ x1x3⊕ x0x1x3
y2 = x0x1⊕ x2⊕ x0x2⊕ x3⊕ x0x3⊕ x0x2x3
y3 = a⊕ x0x3⊕ x1x3⊕ x2x3
(4.6)
As illustrated in Figure (4.4), subbytes should be cut into four substages. The key to
an efficient subpipelining technology is to balance the delays of these substages. Previous
research [34] calculate the delay of an individual substage by counting the gates in critical
path.
Xilinx ISE provides synthesis tool to yield the maximum combinational delay of an
entity. A more straightforward method to achieve the optimal balance is to cut subbytes
in different manners and use this synthesis tool to measure the delay of each substage (an
entity). The most even delays of these substages stand for the optimal balanced substages
arrangement.
Based on our experiments, Equation (4.6) is not suitable for this 4-substage subbytes.
38
With this equation, the substage including x−1 yields the longest delay, hence decreasing
this substage’s delay can increase the general frequency. We derive a new Equation (4.7)
from Equation (4.6) to reduce the delay caused by x−1. Equation (4.7)is derived in three
steps:
1. In Equation (4.6), replace a by its expression, we have:
y0 = x0⊕ x1⊕ x2⊕ x3⊕ x0x2⊕ x1x2⊕ x0x1x2⊕ x1x2x3
y1 = x0x1⊕ x0x2⊕ x1x2⊕ x3⊕ x1x3⊕ x0x1x3
y2 = x0x1⊕ x2⊕ x0x2⊕ x3⊕ x0x3⊕ x0x2x3
y3 = x1⊕ x2⊕ x3⊕ x1x2x3⊕ x0x3⊕ x1x3⊕ x2x3
2. The expressions in step 1 can be equally changed to:
y0 = x1⊕ x2⊕ x1x2⊕ x0x2⊕ (x0⊕ x3)(1⊕ x1x2)
y1 = x1x2⊕ x0x2⊕ x0x1⊕ x3(1⊕ x1⊕ x0x1)
y2 = x2⊕ x0x2⊕ x0x1⊕ x3(1⊕ x0⊕ x0x2)
y3 = x1⊕ x2⊕ x3(1⊕ x0⊕ x1⊕ x2⊕ x1x2)
3. Let a = x1x2, b = x0x2, c = x0x1, d = x1⊕ x2, e = 1⊕a and f = b⊕ c, we have:
y = x−1
−−−−−−−−−−−−−−−−−−−−−−−−−−−−
a = x1x2, b = x0x2, c = x0x1, d = x1⊕ x2
e = 1⊕a, f = b⊕ c
y0 = a⊕b⊕d⊕ (x0⊕ x3)e
y1 = a⊕ f ⊕ x3(x1⊕1⊕ c)
y2 = f ⊕ x2⊕ x3(b⊕1⊕ x0)
y3 = d⊕ x3(e⊕ x0⊕d)
(4.7)
39
According to Equation (4.7), we design the circuit Figure (4.9) to perform x−1 over
GF(24).
Besides multiplicative inversion, other expensive operations in Figure (4.8) are the three
multiplications (×1, ×2 and ×3). In order to decrease the maximum delay caused by mul-
tiplication, we separate each multiplication into two steps and put each step in different
substages. The registers between each substage store the result of the first step of multi-
plication and pass it to the second step. We decompose these three multipliers into two
different manners (AB-type and MN-type) to achieve the best balance.
AB-type: Equation (4.8) is derived from Equation (4.5). p0, p1, ..., p15 are 1-bit values,
which represents one AND term in Equation (4.5). Step A calculates the value of all
the terms; Step B conducts XOR of every four values to generate z0, z1, z2 and z3. A
register is inserted between Step A and Step B to store p0, p1, ..., p15. ×1 in Figure
(4.8) is separated in this way, as ×1A and ×1B in Figure (4.9);
z = x× y (AB− type)
−−−−−−−−−Step A−−−−−−−−−−−−−−−−
a = x0⊕ x3, b = x2⊕ x3, c = x1⊕ x2
p0 = x0y0, p1 = x3y1, p2 = x2y2, p3 = x1y3
p4 = x1y0, p5 = ay1, p6 = by2, p7 = cy3
p8 = x2y0, p9 = x1y1, p10 = ay2, p11 = by3
p12 = x3y0, p12 = x2y1, p14 = x1y2, p15 = ay3
−−−−−−−−−Step B−−−−−−−−−−−−−−−−
z0 = p0⊕ p1⊕ p2⊕ p3
z1 = p4⊕ p5⊕ p6⊕ p7
z2 = p8⊕ p9⊕ p10⊕ p11
z3 = p12⊕ p13⊕ p14⊕ p15
(4.8)
40
MN-type: Equation (4.9) is also derived form Equation (4.5). Step M creates the value of
a, b and c; Step N finishes the rest of Equation (4.5). A register is inserted between
Step M and Step N to store a,b,c. ×2 and ×3 in Figure (4.8) are separated in this
way, as ×2M and ×2N , ×3M and ×3N in Figure (4.9).
z = x× y (MN− type)
−−−−−−−−−Step M−−−−−−−−−−−−−−−−
a = x0⊕ x3, b = x2⊕ x3, c = x1⊕ x2
−−−−−−−−−Step N−−−−−−−−−−−−−−−−
z0 = x0y0⊕ x3y1⊕ x2y2⊕ x1y3
z1 = x1y0⊕ay1⊕by2⊕ cy3
z2 = x2y0⊕ x1y1⊕ay2⊕by3
z3 = x3y0⊕ x2y1⊕ x1y2⊕ay3
(4.9)
The last operation in subbytes is the affine transformation. We derive Equation (4.16)
to do the affine transformation, based on Equation (3.1), Equation (4.1) and Equation (4.2).
First, we change the format of Equation (4.1) and Equation (4.2).
Consider p ∈ GF((24)2), q ∈ GF(28):
p = {p7 p6 p5 p4p3 p2 p1 p0}
q = {q7q6q5q4q3q2q1q0}
For Equation (4.1):
1. In expression of al0,...,ah3, replace aA, aB and aC by their expression
al0 = a4⊕a6⊕a0⊕a5
al1 = a1⊕a2
al2 = a1⊕a7
41
al3 = a2⊕a4
ah0 = a4⊕a6⊕a5
ah1 = a1⊕a7⊕a4⊕a6
ah2 = a5⊕a7⊕a2⊕a3
ah3 = a5⊕a7
2. Let p replace ahx+al , q replace a, we have Equation (4.10)
p = MAP(q), p ∈GF((24)2), q ∈ GF(28)
−−−−−−−−−−−−−−−−−−−−−−−−−−−−
p0 = q0⊕q4⊕q5⊕q6
p1 = q1⊕q2
p2 = q1⊕q7
p3 = q2⊕q4
p4 = q4⊕q5⊕q6
p5 = q1⊕q4⊕q6⊕q7
p6 = q2⊕q3⊕q5⊕q7
p7 = q5⊕q7
(4.10)
The same steps for Equation (4.2):
1. In expression of a0,...,a7, replace aA and aB by their expression
a0 = al0⊕ah0
a1 = ah0⊕ah1⊕ah3
a2 = al1⊕ah3⊕ah0⊕ah1
a3 = ah0⊕ah1⊕al1⊕ah2
a4 = al1⊕ah3⊕ah0⊕ah1⊕al3
a5 = ah0⊕ah1⊕al2
42
a6 = al1⊕ah3⊕al2⊕al3⊕ah0
a7 = ah0⊕ah1⊕al2⊕ah3
2. Let q replace a, p replace ahx+al , we have Equation (4.11)
q = MAP−1(p), q ∈GF(28), p ∈GF((24)2)
−−−−−−−−−−−−−−−−−−−−−−−−−−−−
q0 = p0⊕ p4
q1 = p4⊕ p5⊕ p7
q2 = p1⊕ p4⊕ p5⊕ p7
q3 = p1⊕ p4⊕ p5⊕ p6
q4 = p1⊕ p3⊕ p4⊕ p5⊕ p7
q5 = p2⊕ p4⊕ p5
q6 = p1⊕ p2⊕ p3⊕ p4⊕ p7
q7 = p2⊕ p4⊕ p5⊕ p7
(4.11)
Now we use Equation (3.1), Equation (4.10) and Equation (4.11) to derive Equation
(4.16).
Let x′, y be the elements in GF(28):
x′ = {x′7x
′
6x
′
5x
′
4x
′
3x
′
2x
′
1x
′
0}
y = {y7y6y5y4y3y2y1y0}
According to Equation (3.1), we have:
43
y0 = x′0⊕ x
′
4⊕ x
′
5⊕ x
′
6⊕ x
′
7⊕1
y1 = x′0⊕ x
′
1⊕ x
′
5⊕ x
′
6⊕ x
′
7⊕1
y2 = x′0⊕ x
′
1⊕ x
′
2⊕ x
′
6⊕ x
′
7
y3 = x′0⊕ x
′
1⊕ x
′
2⊕ x
′
3⊕ x
′
7
y4 = x′0⊕ x
′
1⊕ x
′
2⊕ x
′
3⊕ x
′
4
y5 = x′1⊕ x
′
2⊕ x
′
3⊕ x
′
4⊕ x
′
5⊕1
y6 = x′2⊕ x
′
3⊕ x
′
4⊕ x
′
5⊕ x
′
6⊕1
y7 = x′3⊕ x
′
4⊕ x
′
5⊕ x
′
6⊕ x
′
7
(4.12)
In the following, we convert the result of y to the field GF((24)2), and use the GF((24)2)
format to represent x′. Thus, we can derive the affine transformation in GF((24)2).
1. We let w to represent y in GF((24)2) (w is one element in GF((24)2)). According to
Equation (4.10) (Map from GF(28) to GF((24)2)):
w0 = y0⊕ y4⊕ y5⊕ y6
w1 = y1⊕ y2
w2 = y1⊕ y7
w3 = y2⊕ y4
w4 = y4⊕ y5⊕ y6
w5 = y1⊕ y4⊕ y6⊕ y7
w6 = y2⊕ y3⊕ y5⊕ y7
w7 = y5⊕ y7
(4.13)
2. Next, we use GF((24)2) format to represent x′ in Equation 4.12. Let z be the
GF((24)2) format of x′. From Equation 4.11, we have:
44
x′0 = z0⊕ z4
x′1 = z4⊕ z5⊕ z7
x′2 = z1⊕ z4⊕ z5⊕ z7
x′3 = z1⊕ z4⊕ z5⊕ z6
x′4 = z1⊕ z3⊕ z4⊕ z5⊕ z7
x′5 = z2⊕ z4⊕ z5
x′6 = z1⊕ z2⊕ z3⊕ z4⊕ z7
x′7 = z2⊕ z4⊕ z5⊕ z7
(4.14)
3. Now, we replace y with its GF((24)2) format (w), and replace x′ with its GF((24)2)
format (z):
w0 = y0⊕ y4⊕ y5⊕ y6
= (x′0⊕ x
′
4⊕ x
′
5⊕ x
′
6⊕ x
′
7⊕ 1)⊕ (x′0⊕ x′1⊕ x′2⊕ x′3⊕ x′4)⊕ (x′1⊕ x′2⊕ x′3⊕ x′4⊕ x′5⊕
1)⊕ (x′2⊕ x′3⊕ x′4⊕ x′5⊕ x′6⊕1) (By Equation (4.12)
= x′2⊕ x
′
3⊕ x
′
5⊕ x
′
7⊕1
= (z1⊕z4⊕z5⊕z7)⊕(z1⊕z4⊕z5⊕z6)⊕(z2⊕z4⊕z5)⊕(z2⊕z4⊕z5⊕z7)⊕1 (By
Equation (4.14))
= z6⊕1 = z6
In the same way, we can get:
45
w0 = z6
w1 = z1⊕ z2⊕ z7
w2 = z0⊕ z5⊕ z6⊕ z3
w3 = z1⊕ z5⊕ z6⊕ z7
w4 = z0⊕ z2⊕ z4⊕ z5⊕ z6⊕ z7
w5 = z1⊕ z5⊕ z6
w6 = z2⊕ z6⊕ z7
w7 = z3⊕ z5
(4.15)
4. Finally, for the consistency of the other equations in this thesis, we replace w by y, z
by x (x,y ∈GF((24)2)). Let a = x5⊕ x6⊕ x7, we have:
y = AFF T RAN(x)
−−−−−−−−−−−−−−−−−−−−−−−−−−−−
a = x5⊕ x6⊕ x7
y0 = x6, y1 = x1⊕ x2⊕ x7
y2 = x0⊕ x3⊕ x5⊕ x6, y3 = x1⊕a
y4 = x0⊕ x2⊕ x4⊕a, y5 = x1⊕ x5⊕ x6
y6 = x2⊕ x6⊕ x7, y7 = x3⊕ x5
(4.16)
Figure (4.9) describes the proposed subpipelined architecture of subbytes in GF((24)2).
The following symbols represent the equation for each arithmetic block in Figure (4.9),
except the ⊕, which is a simple 4-bit XOR operation. The dashed lines in Figure (4.9)
stand for the registers.
46
x2 —- Equation (4.3)[32] ×e —- Equation (4.4)[32]
×1A —- Equation (4.8) Step A ×1B —- Equation (4.8) Step B
×2M and ×3M —- Equation (4.9) Step M ×2N and ×3N —- Equation (4.9) Step N
x−1 —- Equation (4.7) AFF T RAN —- Equation (4.16)
2
x e×
1−
x
4H
4L
4H
4L
8
8
I II III
AFF_TRAN
8
IV
×
1A
×
1B
×
2M
×
2N
×
3M
×
3N
Figure 4.9: Pipelined Subbytes in composite field GF((24)2)
Table (4.3) shows the time (ns) and area (slices) cost of each substage (I, II, III, IV in
Figure (4.9)) when it runs on different FPGA devices. We cut an AES round unit into 8
substages with the maximum delay determined by part II in subbytes.
Table 4.3: Path Delays and Number of Slices for Spartan2E and Virtex2
Delay(ns):Slices I II III IV
Spartan2E 10.955:69 11.083:27 10.225:55 10.025:18
Virtex2 7.052:69 7.752:27 6.925:55 6.677:18
47
4.4.3 Mixcolumns on GF((24)2)
Mixcolumns is another transformation which involves mathematic operations on GF((24)2).
We derive the equations to perform mixcolumns in composite field in this subsection.
Subsection (3.3) describes mixcolumn in finite field GF(28). Since GF((24)2) is an
isomorphic field to GF(28), and in GF((24)2), {02} is mapped to {26}, {03} is mapped to
{27}, {01} is still {01}, Equation (3.3) can be mapped directly to Equation (4.17).


26 27 01 01
01 26 27 01
01 01 26 27
27 01 01 26




s0,0 s0,1 s0,2 s0,3
s1,0 s1,1 s1,2 s1,3
s2,0 s2,1 s2,2 s2,3
s3,0 s3,1 s3,2 s3,3


=


s′0,0 s
′
0,1 s
′
0,2 s
′
0,3
s′1,0 s
′
1,1 s
′
1,2 s
′
1,3
s′2,0 s
′
2,1 s
′
2,2 s
′
2,3
s′3,0 s
′
3,1 s
′
3,2 s
′
3,3


(4.17)
Observing that in GF((24)2), {27}= {26}⊕{01}, Equation (4.17) is equal to Equation
(4.18), where j = 0,1,2,3:
S′0, j = {26}× (S0, j⊕S1, j)⊕S1, j⊕S2, j⊕S3, j
S′1, j = {26}× (S1, j⊕S2, j)⊕S0, j⊕S2, j⊕S3, j
S′2, j = {26}× (S2, j⊕S3, j)⊕S0, j⊕S1, j⊕S3, j
S′3, j = {26}× (S0, j⊕S3, j)⊕S0, j⊕S1, j⊕S2, j
(4.18)
Equation (4.18) presents the mixcolumn transformation of one column of a state. We
implement the mixcolumn transformation as the structure in Figure (4.10).
In the following, we derive Equation (4.22) to calculate x×26 in GF((24)2). That is,
we represent the results of x×{02} in GF((24)2):
1. Let, x, y ∈GF(28), using Equation (2.5) to calculate y = x×{02}.
48
x26
x26
x26
x26
S0,j
S1,j
S2,j
S3,j
S’0,j
S’1,j
S’2,j
S’3,j
Figure 4.10: GF((24)2) Based Mixcolumns
y0 = x7, y1 = x0⊕ x7, y2 = x1
y3 = x2⊕ x7, y4 = x3⊕ x7, y5 = x4
y6 = x5, y7 = x6
(4.19)
2. Convert y to the field element in GF((24)2). Let w to represent y in GF((24)2) (w is
one element in GF((24)2)). We have the same equation as Equation (4.13).
3. Next, we use GF((24)2) format to represent x. Let z be the GF((24)2) format of x. z
is one element in GF((24)2). By Equation (4.11), we have:
49
x0 = z0⊕ z4
x1 = z4⊕ z5⊕ z7
x2 = z1⊕ z4⊕ z5⊕ z7
x3 = z1⊕ z4⊕ z5⊕ z6
x4 = z1⊕ z3⊕ z4⊕ z5⊕ z7
x5 = z2⊕ z4⊕ z5
x6 = z1⊕ z2⊕ z3⊕ z4⊕ z7
x7 = z2⊕ z4⊕ z5⊕ z7
(4.20)
4. We replace x and y with their corresponding GF((24)2) format, z and w, we have:
w0 = y0⊕ y4⊕ y5⊕ y6 (By Equation (4.13))
= (x7)⊕ (x3⊕ x7)⊕ (x4)⊕ (x5) (By Equation (4.19))
= x3⊕ x4⊕ x5
= (z1⊕ z4⊕ z5⊕ z6)⊕ (z1⊕ z3⊕ z4⊕ z5⊕ z7)⊕ (z2⊕ z4⊕ z5) (By Equation (4.20))
= z2⊕ z3⊕ z4⊕ z5⊕ z6⊕ z7
By the same method, we derive:
w0 = z2⊕ z3⊕ z4⊕ z5⊕ z6⊕ z7
w1 = z0⊕ z2⊕ z4
w2 = z0⊕ z1⊕ z3⊕ z4⊕ z5
w3 = z1⊕ z2⊕ z4⊕ z5⊕ z6
w4 = z3⊕ z6
w5 = z0⊕ z3⊕ z6⊕ z7
w6 = z1⊕ z4⊕ z7
w7 = z2⊕ z5
(4.21)
50
5. To be consistent, we replace z with x, and replace w with y (x,y ∈ GF((24)2)). In
addition, in order to calculate the mixcolumns operations efficiently, we store the in-
termediate results. Let a = x2⊕ x4), b = x3⊕ x6⊕ x7, c = x1⊕ x5, we have:
y = x⊗26, x,y ∈ GF((24)2)
−−−−−−−−−−−−−−−−−−−−−−−−−−−−
a = x2⊕ x4, b = x3⊕ x6⊕ x7, c = x1⊕ x5
y0 = a⊕b⊕ x5, y1 = a⊕ x0
y2 = c⊕ x0⊕ x3⊕ x4, y3 = c⊕a⊕ x6
y4 = x3⊕ x6, y5 = b⊕ x0
y6 = x1⊕ x4⊕ x7, y7 = x2⊕ x5
(4.22)
This mixcolumns architecture (Figure (4.10)) is a 32-bit parallel combinational logic.
When synthesized on Virtex2 XC2V2000, it costs 28 1-bit 2-input XOR gates, 44 1-bit
3-input XOR gates, four 1-bit 4-input XOR gates and four 8-bit 2-input XOR gates. The
maximum combinational path delay is 7.922ns.
In the above section, we have designed all the modules in AES over field GF((24)2). In
this way, each byte of the data needs only one MAP before the initial round and one inverse
MAP after the last round.
4.4.4 Subpipelined Keyschedule
There are two approaches to implement keyschedule: (1) pre-calculated keyschedule and
(2) on-the-fly keyschedule. In the pre-calculated keyschedule, the (Nr +1) 128-bit round-
keys are generated before the encryption or decryption begins and stored in the memory.
The addroundkey operation accesses the roundkeys by referring the corresponding address
in the memory. The advantage of this approach is that the keyschedule only needs to be
51
performed once; however, the drawbacks include:
1. The (Nr +1) roundkeys cost (Nr +1)×128 bits memory space;
2. The cipherkey cannot change frequently. Every time it changes, the roundkeys must
be recalculated.
In this thesis, we propose a new 32-bit on-the-fly keyschedule in composite field
(GF((24)2)) with 128-, 192-, 256-bit key sizes, where each 128-bit roundkey is generated
at every four clock cyles (32-bit at each clock). This is suitable for our 32-bit encryption
architecture.
Table (4.1) shows the 32-bit roundkeys at each clock cycle. The following list explains
this table for the three key sizes.
• When key size=128 bits, Nr=10, it generates 11 128-bit roundkeys for both block A
and B from cycles 0 to 87.
The roundkeys for block A:
roundkey[0]={KA(0), KA(1), KA(2), KA(3)}
roundkey[1]={KA(4), KA(5), KA(6), KA(7)}
......
roundkey[10]={KA(40), KA(41), KA(42), KA(43)}
The roundkeys for block B:
roundkey[0]={KB(0), KB(1), KB(2) KB(3)}
roundkey[1]={KB(4), KB(5), KB(6), KB(7)}
......
roundkey[10]={KB(40), KB(41), KB(42), KB(43)}
• When key size=192, Nr=12, it generates 13 roundkeys for both block A and B from
52
cycles 0 to 103.
The roundkeys for block A:
roundkey[0]={KA(0), KA(1), KA(2), KA(3)}
roundkey[1]={KA(4), KA(5), KA(6), KA(7)}
......
roundkey[12]={KA(48), KA(49), KA(50), KA(51)}
The roundkeys for block B:
roundkey[0]={KB(0), KB(1), KB(2), KB(3)}
roundkey[1]={KB(4), KB(5), KB(6), KB(7)}
......
roundkey[12]={KB(48), KB(49), KB(50), KB(51)}
• When key size=256, Nr=14, it generates 15 roundkeys for both block A and B from
cycles 0 to 119.
The roundkeys for block A:
roundkey[0]={KA(0), KA(1), KA(2), KA(3)}
roundkey[1]={KA(4), KA(5), KA(6), KA(7)}
......
roundkey[14]={KA(56), KA(57), KA(58), KA(59)}
The roundkeys for block B:
roundkey[0]={KB(0), KB(1), KB(2), KB(3)}
roundkey[1]={KB(4), KB(5), KB(6), KB(7)}
......
roundkey[14]={KB(56), KB(57), KB(58), KB(59)}
53
Because we are using the on-the-fly keyschedule, keyschedule and encryptor are shar-
ing the same clock, which means the general frequency is determined by the maximum de-
lay in both keyschedule and encryptor modules. To achieve an efficient pipelining, proper
division in keyschedule is as important as in encryptor. We know that subword is the most
costly part in keyschedule. In order to make the same maximum delay in both modules, we
implement subword in the same way as subbytes in encryptor.
In keyschedule module, rotword rearranges the position of each byte without changing
its value, hence the sequence of rotword and subword can be changed. We do the subword
operation before rotword to save one multiplexer in keyschedule 256.
All mathematic operations in keyschedule are transformed into field GF((24)2). Sub-
word shares the same structure as in subbytes. Xorrcon is a simples XOR operation with
a round constant, which is initially {01} and multiplied by {02} each keyschedule round.
Keyschedule round is defined in this way. It begins when clk counter = 0. If key size is
128, keyschedule round cycle is four; if key size is 192, keyschedule round cycle is six;
if key size is 256, keyschedule round cycle is eight. As explained in Subsection (4.4.3),
in GF((24)2), {01} is still {01}, {02} is mapped to {26}. We can use Equation (4.22) to
generate round constant for each keyschedule round in GF((24)2).
This keyschedule has three key size options: Key128, Key192 and Key256. In the
following section, we discuss the generation of roundkeys in details for these three key size
options. In the rest of the chapter, roundkey32 stands for 32-bit roundkey for each clock
cycle, roundkey stands for 128-bit roundkey for a round of AES.
Key128
When key size is 128 bits, the encryptor round count is ten. Two blocks A and B need 22
roundkeys. Figure (4.11) illustrates the architecture of keyschedule when key size is 128
54
bits.
a
register
W1W2W3W4W5W6W7 W0
cipherkey MAP
SA RWSDSCSB
Subword
RC
b
c
))2(( 24GF
mul
encryptor
32
32
32
clk_counter
round
key
Figure 4.11: Architecture of Keyschedule 128
In our design, the first step is to map (MAP) cipherkey from GF(28) to GF((24)2).
After that, it performs its isomorphic functions in GF((24)2). The output of keyschedule
are roundkey32s represented in GF((24)2). They are the exact format required in encryp-
tion where the message blocks are also represented in GF((24)2), hence no inverse MAP
follows roundkey.
In Figure (4.11), W7, ..., W0 are 32-bit words separated by eight registers, which are
used to store the previous eight roundkey32s. SA, SB, SC and SD are the results of the 4
parts of subword. We place three registers among the four substages in subword, same as in
Figure (4.9). RW is the outcome of rotword. RC generates the round constant for xorrcon in
GF((24)2). mul is a 3-to-1 multiplexer controlled by clk counter, which is the same signal
as in Figure (4.4) and Table (4.1). There are three different cases (a, b, c) to generate the
current roundkey32. Table (4.4) explains the value of each register in Figure (4.11) during
the first 15 clock cycles. In this table, the row titled mul stands for the multiplexer in Figure
(4.11). a, b and c are the three cases. In the following expressions, roundkey32[i] stands for
the cell of this table with row ID of roundkey32 and column ID of clk counter = i.
55
Case a: clk counter < 8 (initial round):
roundkey32[clk counter] = MAP(cipherkey[clk counter]);
Case b: clk counter >= 8 and clk counter mod4 6= 0 :
roundkey32[clk counter] = roundkey32[clk counter−1]⊕roundkey32[clk counter−
8]
roundkey32[clk counter−1] is stored in W7
roundkey32[clk counter−8] is stored in W0;
Case c: clk counter >= 8 and clk counter mod 4 = 0 :
roundkey32[clk counter] = rotword(subword(roundkey32[clk counter−5]))
⊕ roundkey32[clk counter−8]⊕Rcon
rotword(subword(roundkey32[clk counter−5])) is stored in RW
roundkey32[clk counter−8] is stored in W0.
We give examples for each case.
• When clk counter = 0, the first roundkey32 is MAPed from the first word of ci-
pherkey. roundkey32[0] = KA(0). (Case a)
• when clk counter = 1, the second roundkey32 is MAPed from the second word of
cipherkey. roundkey32[1] = KA(1). KA(0) is moved to W7. In the mean time, KA(0)
finished the first part of subword and is stored in SA. (Case a)
......
• when clk counter = 8, KA(3) is moved in to RW, which means
KA(3) = rotword(subword(KA(3))). Now roundkey32[8] = KA(4) = KA(3)⊕KA(0)⊕
Rcon. (Case c)
56
Table 4.4: Key128 Roundkey Sequence
cipherkey KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3)
clk_counter 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
mul c c
roundkey32 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(4) KB(5) KB(6) KB(7)
w7 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(4) KB(5) KB(6)
w6 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(4) KB(5)
w5 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(4)
w4 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7)
w3 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6)
w2 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5)
w1 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4)
w0 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3)
SA KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(4) KB(5) KB(6)
SB KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(4) KB(5)
SC KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(4)
SD KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7)
RW KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6)
a b b
57
• when clk counter = 9, KA(4) is put into W7 and SA (after finishes the first part of
subword). roundkey32[9] = KA(5) = KA(4)⊕KA(1). (Case b)
......
Key192
When key size is 192 bits, the encryptor round count is 12. Block A and block B need 104
roundkey32s. Cipherkey size does not affect the function entities. So it shares the same
subword, rotword and xorrcon as in Figure (4.11). However, due to key size, the structure
becomess more complex. When key size is 192, the keyschedule round cycle is six while
the encryptor cycle is still four. This cycle difference requires extra treatment for the input
of subword. We can see in Figure (4.11) that, when key size is 128 bits, the input of
subword is the roundkey. But when key size is 192 bits, the input of subword is classified
into three cases. We use multiplexer mul1 in Figure (4.12) to choose the input from case x,
y and z.
mul1
))2(( 24GF
mul2
x
y
z
encryptor
W13 W9 W4 W0
SA SC RW
MAP cipherkey6
W3 W2W11
RC
cipherkey MAP
Round
key
a
b
c
d
e
f
register
Subword
clk_counter
Figure 4.12: Architecture of Keyschedule 192
58
Table 4.5: Key192 Roundkey Sequence
cipherkey KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KB(4) KB(5)
cipherkey6 KA(5) KB(5)
mul2 f b d c e
mul1 x x z y
clk_counter
     



 	            

      
roundkey32 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(4) KB(5) KA(8) KA(9) KA(12) KB(14) KB(15)
w13 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(4) KB(7) KA(8) KB(11) KB(13) KB(14)
w12 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KA(7) KB(6) KB(7) KB(10) KB(12) KB(13)
w11 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KA(6) KB(5) KB(6) KB(9) KA(15) KB(12)
w10 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(5) KB(4) KB(5) KB(8) KA(14) KA(15)
w9 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(4) KA(7) KB(4) KA(11) KA(13) KA(14)
w8 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KB(3) KA(6) KA(7) KA(10) KA(12) KA(13)
w7 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KB(2) KA(5) KA(6) KA(9) KB(11) KA(12)
w6 KA(0) KA(1) KA(2) KA(3) KB(0) KB(1) KA(4) KA(5) KA(8) KB(10) KB(11)
w5 KA(0) KA(1) KA(2) KA(3) KB(0) KB(3) KA(4) KB(7) KB(9) KB(10)
w4 KA(0) KA(1) KA(2) KA(3) KB(2) KB(3) KB(6) KB(8) KB(9)
w3 KA(0) KA(1) KA(2) KB(1) KB(2) KB(5) KA(11) KB(8)
w2 KA(0) KA(1) KB(0) KB(1) KB(4) KA(10) KA(11)
w1 KA(0) KA(3) KB(0) KA(7) KA(9) KA(10)
w0 KA(2) KA(3) KA(6) KA(8) KA(9)
SA KA(5) KB(5) KB(11) KA(17)
SB KA(5) KB(5) KA(17)
SC KA(5) KB(5)
SD KA(5) KB(5)
RW KA(5) KA(11)
baa
59
Table (4.5) shows the value of each register in Figure (4.12) during the first 32 clock
cycles. Row mul1 and mul2 correspond to these two multiplexers. In the following section,
we explain the two multiplexers (mul1 and mul2) in Figure (4.12).
1. Multiplexer 1 (mul1)
Case x: (clk counter = 6 or 10) : SA = MAP(cipherkey6); (Cipherkey6 is the sixth
32-bit of the 192-bit cipherkey. We can see from Table (4.5) that, when clk counter =
10, KA(5) must finish subword and rotword, so that we can produce KA(6),
where KA(6) = KA(5)⊕ rotword(subword(KA(5)))⊕Rcon. Because it needs
five clock cycles to complete rotword(subword(KA(5))), KA(5) must be shifted
into SA when clk counter = 6. That’s why we need cipherkey6 to provide
KA(5).)
Case y: (clk counter mod 24 = 6 or 10) and (clk counter > 23) : SA = W11⊕
W 2⊕W3; (One example is when clk counter = 30, KA(17) needs to be shifted
into SA. Because KA(17) is not stored in any register, we need to calculate
it from the existing data. KA(17) = KA(16)⊕KA(11) = KA(15)⊕KA(10)⊕
KA(11). KA(15), KA(10) and KA(11) are stored in register W11, W2 and W3,
respectively.)
Case z: (clk counter mod 24 = 0 or 20) and (clk counter 6= 0) : SA = W13;
This is why we need multiplexer mul1 to differentiate three cases for the input of
subword when key size is 192 bits.
2. Multiplexer 2 (mul2)
• Generate roundkey32s from cipherkey directly, KA(i), KB(i), i = 0,...,5
60
Case a: (clk counter < 10 or = 12,13) : roundkey32[clk counter] = MAP(cipherkey
[clk counter]) (Because roundkey32 is generated when it is needed in en-
cryption, the arrangement of row cipherkey in Table (4.5) is determined by
encryptor)
• Generate roundkey32s KA(i), KB(i), i ≥ 8 and i mod 6 6= 0 (Because of the
round cycle difference between keyschedule and encryptor, we need to classify
it into three sub-cases, based on the value (clk counter mod 4). Table (4.5)
shows these three sub-cases (b, c and d), where roundkey32s are generated by
the formula KA/B(i) = KA/B(i−1)⊕KA/B(i−6).)
Case b: (clk counter mod 4 = 3 or 0) and (clk counter > 7) :
roundkey32[clk counter] = roundkey32[clk counter−1]⊕roundkey32[clk counter−
10];
roundkey32[clk counter−1] is stored in register W13;
roundkey32[clk counter−10] is stored in register W4;
Case c: (clk counter mod 4 = 2) and (clk counter > 7) :
roundkey32[clk counter] = roundkey32[clk counter−1]⊕roundkey32[clk counter−
14];
roundkey32[clk counter−1] is stored in register W13;
roundkey32[clk counter−14] is stored in register W0;
Case d: (clk counter mod 4 = 1) and (clk counter mod 24 > 7) :
roundkey32[clk counter] = roundkey32[clk counter−5]⊕roundkey32[clk counter−
14];
roundkey32[clk counter−5] is stored in register W9;
roundkey32[clk counter−14] is stored in register W0;
• Generate roundkey32s KA(i), KB(i), i ≥ 8 and i mod 6 = 0 (The sub-cases are
61
caused by the same reason as the above case. The following two sub-cases are
based on the formula KA/B(i) = rotword(subword(KA/B(i−1)))⊕KA/B(i−
6)⊕Rcon)
Case e: (clk counter mod 24 = 0 or 4) and (clk counter > 7) :
roundkey32[clk counter] = rotword(subword(roundkey32[clk counter−1]))
⊕ roundkey32[clk counter−14]⊕RC;
rotword(subword(roundkey32[clk counter−1])) is stored in register RW;
roundkey32[clk counter−14] is stored in register W0;
Case f: (clk counter mod 24 = 10 or 14) :
roundkey32[clk counter] = rotword(subword(roundkey32[clk counter−1]))
⊕ roundkey32[clk counter−10]⊕RC;
rotword(subword(roundkey32[clk counter−1])) is stored in register RW;
roundkey32[clk counter−10] is stored in register W4;
Table (4.5) lists instances for each case for both mul1 and mul2 during the first 32 clock
cycles.
• When clk counter = 0, roundkey32[0] = KA(0), which is MAPed from cipherkey(Case a);
......
• When clk counter = 6, roundkey32[6] = KB(2), which is MAPed from cipherkey(Case a).
SA = KA(5), where KA(5) is MAPed from cipherkey6 and shifted into SA after fin-
ished subword’s first part(Case x);
......
• When clk counter = 10, roundkey32[10] = KA(6) = rotword(subword(KA(5)))⊕
KA(0)⊕Rcon (Case f);
62
• When clk counter = 11, roundkey32[11] = KA(7) = KA(6)⊕KA(1) (Case b);
......
• When clk counter = 16, roundkey32[16] = KA(8) = KA(7)⊕KA(2) (Case d);
• When clk counter = 17, roundkey32[17] = KA(9) = KA(8)⊕KA(3) (Case c);
......
• When clk counter = 24, roundkey32[24] = KA(12) = rotword(subword(KA(11)))⊕
KA(6)⊕Rcon (Case e). SA = KB(11), where KB(11) is shifted into SA after finished
subword’s first part (Case z);
......
• When clk counter = 30, SA = KA(17) = KA(16)⊕KA(11) = (KA(15)⊕KA(10))⊕
KA(11) (Case y);
......
Key256
Keyschedule 256 is slightly different from keyschedule 128. The keyschedule round cycle
is eight clock cycles. As shown in Figure (4.13):
There are four different cases to generate roundkey32s:
Case a: (clk counter < 16) :
roundkey32[clk counter] = MAP(cipherkey[clk counter]);
Case b: (clk counter ≥ 16) and (clk counter mod 4 6= 0) :
roundkey32[clk counter] = roundkey32[clk counter−1]⊕roundkey32[clk counter−
16];
63
roundkey
mul
register
))2(( 24GF
W15 W0
SA SD RW
RC
MAPcipherkey
a
b
c
d
clk_counter
Figure 4.13: Architecture of Keyschedule 256
roundkey32[clk counter−1] is stored in register W15;
roundkey32[clk counter−16] is stored in register W0;
Case c: (clk counter ≥ 16) and (clk counter mod 8 = 0) :
roundkey32[clk counter] = rotword(subword(roundkey32[clk counter−5]))RW
⊕ roundkey32[clk counter−16]⊕RC;
rotword(subword(roundkey32[clk counter−5])) is stored in register RW;
roundkey32[clk counter−16] is stored in register W0;
Case d: (clk counter ≥ 16) and (clk counter mod 4 = 0) and (clk counter mod 8 6= 0) :
roundkey32[clk counter] = subword(roundkey32[clk counter−5])⊕roundkey32[clk counter−
16];
subword(roundkey32[clk counter−5]) is stored in register RW;
roundkey32[clk counter−16] is stored in register W0.
This is why we change the sequence of subword and rotword. Puting rotword before
subword saves one multiplexer when key size is 256 bits.
Table (4.4.4) gives instances for each case.
64
• When clk counter = 0, roundkey32[0] = KA(0), where KA(0) is MAPed from ci-
pherkey (Case a);
......
• When clk counter = 16, roundkey32[16] = KA(8) = rotword(subword(KA(7)))⊕
KA(0)⊕Rcon (Case c);
• When clk counter = 17, roundkey32[17] = KA(9) = KA(8)⊕KA(1) (Case b);
......
• When clk counter = 24, roundkey32[12] = KA(12) = subword(KA(11))⊕KA(4)
(Case d);
......
65
Table 4.6: Key256 Roundkey Sequence
cipherkey KA(0) KB(7)
mul c c d d
clk_reg 0 15 16 17 18 19 20 24 28
roundkey32 KA(0) KB(7) KA(8) KA(9) KA(10) KA(11) KB(8) KA(12) KB(12)
w15 KB(6) KB(7) KA(8) KA(9) KA(10) KA(11) KB(11) KA(15)
w14 KB(5) KB(6) KB(7) KA(8) KA(9) KA(10) KB(10) KA(14)
w13 KB(4) KB(5) KB(6) KB(7) KA(8) KA(9) KB(9) KA(13)
w12 KA(7) KB(4) KB(5) KB(6) KB(7) KA(8) KB(8) KA(12)
w11 KA(6) KA(7) Kb(4) KB(5) KB(6) KB(7) KA(11) KB(11)
w0 KA(0) KA(1) KA(2) KA(3) KB(0) KA(4) KB(4)
SA KB(6) KB(7) KA(8) KA(9) KA(10) KA(11) KB(11) KA(15)
SB KB(5) KB(6) KB(7) KA(8) KA9 KA(10) KB(10) KA(14)
SC KB(4) KB(5) KB(6) KB(7) KA(8) KA(9) KB(9) KA(13)
SD KA(7) KB(4) KB(5) KB(6) KB(7) KA(8) KB(8) KA(12)
RW KA(6) KA(7) KB(4) KB(5) KB(6) KB(7) KA(11) KB(11)
a b
66
Chapter 5
Implementation Performance And Comparison
Literature regarding hardware implementation of AES have been published. The compari-
son tables listed in the literatures are synthesized by various design tools on different FPGA
chips. Although the difficulty of comparison about FPGA implementations was reported,
there is still no proved measure to get a real fair comparison among different architectures.
Even for the devices from the same company (Xilinx), different families use different tech-
nology which leads to different frequency. For example, a slice in Virtex 5 has four LUTs
(Look Up Tables) instead of two in previous families [6], which leads to different area cost
(number of slice).
Since AES standard includes encryption, decryption and keyschedule with three key
sizes, it is up to the designers to choose which function they would like to realize. Obvi-
ously, more functions need more resource. Hence it is reasonable to compare architectures
providing similar functions.
In this chapter, we first classify previous AES architectures into different categories and
then use tables to compare their performance.
1. Encryption and Decryption: AES architectures include encryption and decryption
units. In [1, 3, 5, 8, 9, 17, 33, 22, 27, 28], they provide functions for both encryption
and decryption. As a symmetric algorithm, encryption and decryption share same
units. With the parameter indicated by the user, it executes encryption or decryption
exclusively. Some other AES architectures only focus on encryption [2, 4, 11, 12,
25, 26, 31].
2. Key Sizes: AES uses data size of 128 bits but offers three key sizes (128, 192 and
256 bits). 128-bit is the most common choice in the reported designs [3, 4, 5, 9, 12,
67
33, 26, 27, 28, 34]. However, as reconfigurability is one of most important factors
for FPGA implementations, options for all three key sizes are included in a number
of designs [1, 2, 17, 22].
3. Key Expansion: The keyschedule in AES generates roundkeys for each round. The
roundkeys can be previously calculated and stored in memory [1, 2, 3, 5, 22, 27].
This method results in an acceptable initial delay when the data size is relatively large
compared with the key size. A more flexible approach is the on-the-fly keyschedule
[4, 9, 12, 17, 33, 26, 28, 34] which conducts an on-line calculation of roundkeys for
each 128-bit data block. On-the-fly keyschedule affects the general frequency as both
the data unit and key unit share the same clock, especially when it is employed for all
the three key sizes. There are also some architectures that do not include keyschedule
[8, 11, 13, 31].
4. BRAM based S-Box and combinational logic based S-Box: Different approaches
for S-Box implementation have obvious impact on AES performance. BRAM based
approaches [5, 8, 13, 17, 26, 27] are preferred when low area cost is required. It
saves the slices required in combinational logic based approach. Hence it is not
reasonable to compare the ratio of throughput/slice between BRAM-based S-Box and
combinational logic based S-Box [1, 2, 9, 11, 12, 13, 33, 22, 25, 28, 31, 34]. Good
et al. [9] used a term (32bits/slice) to convert number of BRAMs to number of slices
required to implement the equivalent distributed memory. But, the estimates vary
between 8 and 32 bits/slice depending on the functionality required. In this thesis,
we only compare our design throughput/slice with non-BRAM implementations.
The above four categories summarize the majors factors affecting the performance in
hardware implementation of AES. Table 5.1 compares the performance of the architectures
68
Table 5.1: Comparisons of BRAMs Based AES Architecture
Design Device Frequency Slices BRAMs Throughput(MHz) (Mbps)
Samanta VIRTEX2 76.699 1051 11 111.56
et al. [27] 2V6000
Chodowiec VIRTEX 95 12600 80 12100[8] XCV1000
Chodowiec SPARTAN2 60 222 3 166
et al. [5] XC2S30
Chang SPARTAN2 38.50 200 2 38
et al. [4] XC2S30
Saggese VIRTEXE 142 648 10 1820
et al. [26] XCV2000E
McLoone VIRTEXE 54.35 2222 100 6956
et al. [17] XCV3200E
using BRAMs. Table 5.2 compares the architectures without BRAMs. Table 5.3 summa-
rizes the functions provided by these architectures.
Among the architectures using BRAMs, Chodowiec [8] employed fully unrolled sub-
pipelining achieving the highest throughput with the largest resource cost. Recently,
Chodowiec et al. made a compact design costing 222 slices in a Spartan2 device offering a
throughput of 166Mbps [5].
In our proposed architecture, we do not use BRAM. In Table 5.2, it can be seen that
Good et al. achieves the highest throughput of 25.107 Gbps on Spartan3 XC3S2000. It
employs fully parallel loop unrolled architecture which calculates multiplicative inverse of
each byte over composite field GF((24)2). It also gets the frequency of 196.1MHz. But it
only deals with 128-bit key size and costs 17425 slices. Another fully unrolled architecture
is proposed by Zhang et al. [34]. This design used number-of-gates-in-critical-path to place
the pipeline cuts. It subpipelines a round into seven substages and achieves 21.556 Gbps
with the throughput/slice ratio of 1.956.
Compared with the previous architectures, our design focuses on the low cost, non-
69
Table 5.2: Comparisons of Non-BRAMs Architectures
Design Device Frequency Area Throughput Mbps/(MHz) (Slices) (Mbps) Slice
Good SPARTAN3 196.1 17425 25107 1.441
et al. [9] XC3S2000
Zhang VIRTEXE 168.4 11022 21560 1.956
et al. [34] XCV1000E
Jarvinen VIRTEX 139.1 10750 17800 1.656
et al. [12] XC2V2000
Mucci VIRTEX2P 169.1 9446 21640 2.291
et al. [11] XV2VP20
Lemsitzer VIRTEX4 110 7300 3500 0.479
et al. [13] FX100
Bulens SPARTAN3 150 1800 1700 0.944
et al. [3]
Standaert VIRTEXE 167 1767 2085 1.180
et al. [31] XCV1000E
Pramstaller VIRTEXE 161 1125 215 0.191
et al. [22] XCV1000E
Our Desisgn VIRTEX2 277.4 523 807 1.543XC2V2000E
Alam VIRTEXE 135 510 432 0.847
et al. [1] XCV1000E
70
BRAM implementations. There were not many literatures in the low-cost AES designs.
Pramstaller et al. proposed a compact design costing 1125 slices in [22]. Its pre-calculate
key generator can deal with three key sizes. Standaert et al. [31] made a single encryp-
tion architecture with 1767 slices which provides Gbps-level throughput. Alam et al. [1]
reported a design including encryption, decryption and on-the-fly keyschedule for 3 key
sizes, which achieves 432 Mbps with the frequency of 135MHz.
Compared with similar previous works, our proposed low-cost and efficient AES archi-
tecture only uses 523 slices, and achieves the throughput of 806Mbps when implemented in
Virtex 2 XCV2V2000. The throughput/area ratio is 1.543, which is relatively high in low-
cost designs (< 2000 slices). The proposed design can be efficiently applied in computing-
resources restricted environments, such as wireless devices and embedded devices.
71
Table 5.3: Comparisons of AES Architectures Functions
Design Encryption Decryption KeySchedule KeySize BRAMs
Samanta
• • Pre-Calculate 128 •[27]
Chodowiec
• • •[8]
Chodowiec
• • Pre-Calculate 128 •
et al. [5]
Satoh
• • On-The-Fly 128
et al. [28]
Hodjat
•
et al. [11]
Jarvinen
• On-The-Fly 128
et al. [12]
Good
• • On-The-Fly 128
et al. [9]
Zhang
• On-The-Fly 128
et al. [34]
Chang
• On-The-Fly 128 •
et al. [4]
Pramstaller
• • Pre-Calculate 128/192/256
et al. [22]
Saggese
• On-The-Fly 128 •
et al. [26]
Standaert
•
et al. [31]
McLoone
• • On-The-Fly 128/192/256 •
et al. [17]
Lemsitzer
•
et al. [13]
Bulens
• • Pre-Calculate 128 •
et al. [3]
Alam
• On-The-Fly 128/192/256
et al. [1]
Our Design • On-The-Fly 128/192/256
72
Chapter 6
Conclusion
AES is an important and popular cryptographic algorithm to secure the information and
data transmission. In this thesis, we propose a compact reconfigurable FPGA architecture
for the AES implementation.
The 32-bit single round unit design results in low area cost, which makes it suitable
for low-end devices. The combinational logic approach of AES implementation elimi-
nates the need for BRAMs. Full composite field (GF((24)2)) based design decreases hard-
ware complexity of arithmetic operations in AES. We apply subpipelining technology in
both encryptor and keyschedule modules to optimize the speed/area ratio, which achieves
1.543Mbps/Slice in Virtex 2 XCV2V2000. Besides, the capability to deal with three key
sizes makes our design an efficient reconfigurable architecture of AES.
The throughput of our proposed design achieves 805.8Mbps. It requires less than a
quarter of the resources of a Xilinx Spartan2 FPGA, which is one of the smallest FPGA de-
vices. The performance comparison indicates that the proposed AES architecture achieves
higher throughput than previous compact designs.
FIPS standard [20] provides an equivalent inverse cipher which switches the sequence
of the four transformations in decryption round so that the encryption and decryption can
share the same functions, such as the multiplicative inversion in subbytes. In our design, the
encryption conducts shiftrows before subbytes. When implementing the equivalent inverse
cipher, it only needs to switch the relative sequence of inv-mixcolumns and addroundkey.
The positions of inv-shiftrows and inv-subbytes are not changed. The proposed design can
be easily modified into an equivalent cipher.
In conclusion, the proposed compact and reconfigurable AES architecture has high
throughput and low area cost, which is very useful in the computing restricted environment
73
and wireless devices.
74
Bibliography
[1] Monjur Alam, Santosh Ghosh, Dipanwita RoyChowdhury, and Indranil Sengupta.
Single Chip Encryptor/Decryptor Core Implementation of AES Algorithm. In VLSID
’08: Proceedings of the 21st International Conference on VLSI Design, pages 693–
698, Washington, DC, USA, 2008. IEEE Computer Society.
[2] Monjur Alam, Sonai Ray, Debdeep Mukhopadhayay, Santosh Ghosh, Dipanwita Roy-
Chowdhury, and Indranil Sengupta. An Area Optimized Reconfigurable Encryptor for
AES-Rijndael. In DATE ’07: Proceedings of the conference on Design, automation
and test in Europe, pages 1116–1121, San Jose, CA, USA, 2007. EDA Consortium.
[3] Philippe Bulens, Francois-Xavier Standaert, Jean-Jacques Quisquater, Pascal Pelle-
grin, and Gael Rouvroy. Implementation of the AES-128 on Virtex-5 FPGAs. In
Progress in Cryptology - AfricaCrypt 2008, pages 16 – 26. Springer, 2008.
[4] Chi-Jeng Chang, Chi-Wu Huang, Hung-Yun Tai, and Mao-Yuan Lin. 8-bit AES
Implementation in FPGA by Multiplexing 32-bit AES Operation. In ISDPE ’07:
Proceedings of the The First International Symposium on Data, Privacy, and E-
Commerce, pages 505–507, Washington, DC, USA, 2007. IEEE Computer Society.
[5] Pawel Chodowiec and Kris Gaj. Very Compact FPGA Implementation of the AES
Algorithm. In CHES, pages 319–333, 2003.
[6] Adrian Cosoroaba. Achieve Higher Performance with Virtex-5 FPGAs. Xilinx, Inc.
Available at http://china.xilinx.com/publications/xcellonline/xcell_59/xc_
pdf/p016-018_59-consoroba.pdf.
[7] J. Daemen and V. Rijmen. AES Proposal: Rijndael. Technical report, National Institute of
Standards and Technology (NIST). Available at http://www.nic.funet.fi/pub/crypt/
cryptography/symmetric/aes/nist/Rijndael.pdf.
[8] Kris Gaj and Pawel Chodowiec. Comparison of the Hardware Performance of the AES Can-
didates Using Reconfigurable Hardware. In AES Candidate Conference, pages 40–54, 2000.
[9] Tim Good and Mohammed Benaissa. AES on FPGA from the Fastest to the Smallest. In
Josyula R. Rao and Berk Sunar, editors, CHES, volume 3659 of Lecture Notes in Computer
Science, pages 427–440. Springer, 2005.
[10] D.H. Green and I.S. Taylor. Irreducible Polynomials over Composite Galois Fields and Their
Applications in Coding Techniques. pages 935–939, September 1974.
[11] Alireza Hodjat and Ingrid Verbauwhede. A 21.54 Gbits/s Fully Pipelined AES Processor
on FPGA. In FCCM ’04: Proceedings of the 12th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines, pages 308–309, Washington, DC, USA, 2004.
IEEE Computer Society.
75
[12] Kimmo U. Ja¨rvinen, Matti T. Tommiska, and Jorma O. Skytta¨. A Fully Pipelined Memoryless
17.8 Gbps AES-128 Encryptor. In FPGA ’03: Proceedings of the 2003 ACM/SIGDA eleventh
international symposium on Field programmable gate arrays, pages 207–215, New York, NY,
USA, 2003. ACM.
[13] Stefan Lemsitzer, Johannes Wolkerstorfer, Norbert Felber, and Matthias Braendli. Multi-
gigabit GCM-AES Architecture Optimized for FPGAs. In Pascal Paillier and Ingrid Ver-
bauwhede, editors, CHES, volume 4727 of Lecture Notes in Computer Science, pages 227–
238. Springer, 2007.
[14] Liberatori, M. Otero, F. Bonadero, J.C. Castineira, J.UNMDP, and Mar del Plata. AES-128
Cipher. High Speed, Low Cost FPGA Implementation. pages 195–198, Mar del Plata, 2007.
IEEE Computer Society.
[15] Rudolf Lidl and Harald Niederreiter. Finite Fields (Encyclopedia of Mathematics and its
Applications). Addison-Wesley, 1983.
[16] Robert J. McEliece. Finite Fields for Computer Scientists and Engineers. Kluwer Academic
Pub, 1987.
[17] Ma´ire McLoone and John V. McCanny. High Performance Single-Chip FPGA Rijndael Al-
gorithm Implementations. In CHES ’01: Proceedings of the Third International Workshop on
Cryptographic Hardware and Embedded Systems, pages 65–76, London, UK, 2001. Springer-
Verlag.
[18] Alfred J. Menezes, Scott A. Vanstone, and Paul C. Van Oorschot. Handbook of Applied
Cryptography. CRC Press, Inc., Boca Raton, FL, USA, 1996.
[19] Mike Nelson. Why You Should Use FPGAs in Data Security. Xilinx is an Ideal Plat-
form for Data Security Applications. Storage and Servers. Vertical Markets. Xilinx, Inc.
Available at http://www.xilinx.com/publications/xcellonline/xcell_57/xc_pdf/
p054-057_57-secure.pdf.
[20] NIST. Announcing the ADVANCED ENCRYPTION STANDARD (AES). Available at http:
//csrc.nist.gov/publications/fips/fips197/fips-197.pdf.
[21] Christof Paar. Efficient VLSI Architectures for Bit-Parallel Computation in Galois Fields. PhD
thesis, Institute for Experimental Mathematics – University of Essen, 1994.
[22] Norbert Pramstaller, Stefan Mangard, Sandra Dominikus, and Johannes Wolkerstorfer. Ef-
ficient AES Implementations on ASICs and FPGAs. In Hans Dobbertin, Vincent Rijmen,
and Aleksandra Sowa, editors, AES Conference, volume 3373 of Lecture Notes in Computer
Science, pages 98–112. Springer, 2004.
[23] Norbert Pramstaller and Johannes Wolkerstorfer. A Universal and Efficient AES Co-processor
for Field Programmable Logic Arrays. 3203/2004:565–574, 2004.
[24] Vincent Rijmen. Efficient Implementation of the Rijndael S-box. Available at http://www.
comms.scitech.susx.ac.uk/fft/crypto/rijndael-sbox.pdf.
76
[25] Atri Rudra, Pradeep K. Dubey, Charanjit S. Jutla, Vijay Kumar, Josyula R. Rao, and Pankaj
Rohatgi. Efficient Rijndael Encryption Implementation with Composite Field Arithmetic. In
CHES ’01: Proceedings of the Third International Workshop on Cryptographic Hardware
and Embedded Systems, pages 171–184, London, UK, 2001. Springer-Verlag.
[26] Giacinto Paolo Saggese, Antonino Mazzeo, Nicola Mazzocca, and Antonio G. M. Strollo.
An FPGA-Based Performance Analysis of the Unrolling, Tiling, and Pipelining of the AES
Algorithm. In FPL, pages 292–302, 2003.
[27] Sounak Samanta. FPGA Implementation of AES Encryption and Decryption. Sardar Val-
labhbhai National Institute of Technology, Surat. Available at http://www.design-reuse.
com/articles/13981/fpga-implementation-of-aes-encryption-and-decryption.
html.
[28] Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A Compact Rijndael Hard-
ware Architecture with S-Box Optimization. In ASIACRYPT ’01: Proceedings of the 7th
International Conference on the Theory and Application of Cryptology and Information Se-
curity, pages 239–254, London, UK, 2001. Springer-Verlag.
[29] Lin Shu and Costello Daniel J. Error Control Coding: Fundamentals and Applications. Pren-
tice Hall, 1983.
[30] William Stallings. Cryptography and Network Security-Principles and Practices (Fourth Edi-
tion). Pearson Prentice hall, 2006.
[31] Franc¸ois-Xavier Standaert, Gae¨l Rouvroy, Jean-Jacques Quisquater, and Jean-Didier Legat.
Efficient Implementation of Rijndael Encryption in Reconfigurable Hardware: Improvements
and Design Tradeoffs. In CHES, pages 334–350, 2003.
[32] Johannes Wolkerstorfer, Elisabeth Oswald, and Mario Lamberger. An ASIC Implementation
of the AES SBoxes. In CT-RSA ’02: Proceedings of the The Cryptographer’s Track at the
RSA Conference on Topics in Cryptology, pages 67–78, London, UK, 2002. Springer-Verlag.
[33] Namin Yu and H.M. Heys. Investigation of Compact Hardware Implementation of the Ad-
vanced Encryption Standard. pages 1069– 1072, 2005.
[34] Xinmiao Zhang and Keshab K. Parhi. High-speed VLSI architectures for the AES algorithm.
IEEE Trans. Very Large Scale Integr. Syst., 12(9):957–967, 2004.
77
