Hardware Implementations for Symmetric Key Cryptosystems by El-Razouk, Hayssam
Western University 
Scholarship@Western 
Electronic Thesis and Dissertation Repository 
6-5-2015 12:00 AM 
Hardware Implementations for Symmetric Key Cryptosystems 
Hayssam El-Razouk 
The University of Western Ontario 
Supervisor 
Arash Reyhani-Masoleh 
The University of Western Ontario 
Graduate Program in Electrical and Computer Engineering 
A thesis submitted in partial fulfillment of the requirements for the degree in Doctor of 
Philosophy 
© Hayssam El-Razouk 2015 
Follow this and additional works at: https://ir.lib.uwo.ca/etd 
 Part of the VLSI and Circuits, Embedded and Hardware Systems Commons 
Recommended Citation 
El-Razouk, Hayssam, "Hardware Implementations for Symmetric Key Cryptosystems" (2015). Electronic 
Thesis and Dissertation Repository. 2927. 
https://ir.lib.uwo.ca/etd/2927 
This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted 
for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of 
Scholarship@Western. For more information, please contact wlswadmin@uwo.ca. 
HARDWARE IMPLEMENTATIONS FOR SYMMETRIC KEY
CRYPTOSYSTEMS
(Thesis format: Monograph)
by
Hayssam El-Razouk
Graduate Program in Electrical and Computer Engineering
A thesis submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
The School of Graduate and Postdoctoral Studies
The University of Western Ontario
London, Ontario, Canada
c Hayssam El-Razouk 2015
Abstract
The utilization of global communications network for supporting new electronic applica-
tions is growing. Many applications provided over the global communications network involve
exchange of security-sensitive information between dierent entities. Often, communicating
entities are located at dierent locations around the globe. This demands deployment of cer-
tain mechanisms for providing secure communications channels between these entities. For
this purpose, cryptographic algorithms are used by many of today’s electronic applications to
maintain security. Cryptographic algorithms provide set of primitives for achieving dierent
security goals such as: confidentiality, data integrity, authenticity, and non-repudiation. In gen-
eral, two main categories of cryptographic algorithms can be used to accomplish any of these
security goals, namely, asymmetric key algorithms and symmetric key algorithms. The secu-
rity of asymmetric key algorithms is based on the hardness of the underlying computational
problems, which usually require large overhead of space and time complexities. On the other
hand, the security of symmetric key algorithms is based on non-linear transformations and
permutations, which provide ecient implementations compared to the asymmetric key ones.
Therefore, it is common to use asymmetric key algorithms for key exchange, while symmetric
key counterparts are deployed in securing the communications sessions. This thesis focuses on
finding ecient hardware implementations for symmetric key cryptosystems targeting mobile
communications and resource constrained applications.
First, ecient lightweight hardware implementations of two members of the Welch-Gong
(WG) family of stream ciphers, the WG(29; 11) and WG-16, are considered for the mobile
communications domain. Optimizations in the WG(29; 11) stream cipher are considered when
the GF

229

elements are represented in either the Optimal normal basis type-II (ONB-II) or
the Polynomial basis (PB). For WG-16, optimizations are considered only for PB representa-
tions of the GF

216

elements. In this regard, optimizations for both ciphers are accomplished
mainly at the arithmetic level through reducing the number of field multipliers, based on novel
trace properties. In addition, other optimization techniques such as serialization and pipelining,
are also considered.
After this, the thesis explores ecient hardware implementations for digit-level multipli-
cation over binary extension fields GF (2m). Ecient digit-level GF (2m) multiplications are
advantageous for ultra-lightweight implementations, not only in symmetric key algorithms,
but also in asymmetric key algorithms. The thesis introduces new architectures for digit-level
GF (2m) multipliers considering the Gaussian normal basis (GNB) and PB representations of
the field elements. The new digit-levelGF (2m) single multipliers do not require loading of the
two input field elements in advance to computations. This feature results in high throughput fast
ii
multiplication in resource constrained applications with limited capacity of input data-paths.
The new digit-level GF (2m) single multipliers are considered for both the GNB and PB. In
addition, for the GNB representation, new architectures for digit-level GF (2m) hybrid-double
and hybrid-triple multipliers are introduced. The new digit-level GF (2m) hybrid-double and
hybrid-triple GNB multipliers, respectively, accomplish the multiplication of three and four
field elements using the latency required for multiplying two field elements. Furthermore, a
new hardware architecture for the eight-ary exponentiation scheme is proposed by utilizing the
new digit-level GF (2m) hybrid-triple GNB multipliers.
Keywords: Digit-Level Multipliers, Finite Fields, Finite Field Exponentiation, Gaussian
Normal Basis, Hybrid-Double Multiplication, Linear Feedback Shift Registers, Normal Basis,
Optimal Normal Basis, Polynomial Basis, Pseudo Random Key Generators, Serial Multiplica-
tion, Stream Ciphers, Trace Function, WG Transformation.
iii
Co-Authorship
I would like to thank Dr. Guang Gong, from the electrical and computer engineering depart-
ment at the University of Waterloo, for her constructive inputs during the dierent discussions
and throughout the writing / revision phases of the published / accepted versions of chapters 3
and 4 of this thesis.
iv
Dedications
To my great parents, to my lovely wife Shaima, to my little precious girl Quds, and to all my
bigger family.
v
Acknowledgements
First of all, all praise and thanks are due to my Lord.
I would like to thank my parents, my wife Shaima, my beautiful daughter Quds, and my
sisters for all their support, and prayers.
I would like to thank my supervisor, Dr. Arash Reyhani-Masoleh, for his continuous and
non-stopping support and guidance throughout the course of my PhD studies.
I would also like to thank the examiners, Dr. Anestis Dounavis, Dr. Abdelkader Ouda, Dr.
Marc Moreno Maza, and Dr. Majid Ahmadi, for putting the time and eort to read my PhD
thesis and provide their constructive comments.
Last but not least, I would like to thank my colleges, Behdad Husseini, Depanwita Gan-
gopadhyay, Ebrahim Hassan, and Sasan Khoshroo, for all the constructive and fruitful discus-
sions we had.
vi
Contents
Abstract ii
Co-Authorship iv
Dedications v
Acknowledgements vi
List of Figures xii
List of Tables xiv
Nomenclature xvi
1 Introduction 1
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivations and Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background on GF (2m) and WG Stream Ciphers 5
2.1 Preliminaries on Algebraic Structures . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Modular Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Polynomial Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Binary Extension Fields GF (2m) . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Polynomial Basis (PB) Representation . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Gaussian Normal Basis (GNB) Representation . . . . . . . . . . . . . . . . . . 11
2.8 Addition over GF (2m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9 Multiplication over GF (2m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
vii
2.9.1 GF (2m) Multiplication in the PB Representation . . . . . . . . . . . . 13
2.9.1.1 Multiplication of Two Arbitrary GF (2m) Elements Repre-
sented in the PB . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9.1.2 Ecient Multiplication of aGF (2m) Element Represented in
the PB by q . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9.1.3 Previous Work on PB Multiplication . . . . . . . . . . . . . . 15
2.9.2 GF (2m) Multiplication in the GNB Representation . . . . . . . . . . . 17
2.9.2.1 Formulation for the BL-PISO GF (2m) Multiplication in the
GNB Representation . . . . . . . . . . . . . . . . . . . . . . 17
2.9.2.2 Multiplication by the Normal Element  . . . . . . . . . . . . 18
2.9.2.3 Previous Work on GNB Multiplication . . . . . . . . . . . . 18
2.10 Exponentiation and Inverse over GF (2m) . . . . . . . . . . . . . . . . . . . . . 19
2.11 Trace Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.12 Welch-Gong (WG) Stream Ciphers . . . . . . . . . . . . . . . . . . . . . . . . 21
2.12.1 Stream Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.12.2 WG Stream Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.12.2.1 A General Block Diagram . . . . . . . . . . . . . . . . . . . 22
2.12.2.2 Phases of Operation . . . . . . . . . . . . . . . . . . . . . . 23
2.12.2.3 WG(29; 11) and WG-16 . . . . . . . . . . . . . . . . . . . . 23
2.12.2.4 Parameters of the WG(29; 11) . . . . . . . . . . . . . . . . . 24
2.12.2.5 Parameters of the WG-16 . . . . . . . . . . . . . . . . . . . 25
3 Implementations of the WG Stream Ciphers Using ONB-II 26
3.1 Optimized Hardware Design of the MOWG(29; 11; 17) Cipher . . . . . . . . . 27
3.1.1 Reducing the Hardware Complexity of the MOWG Transformation . . 27
3.1.2 Improving the Critical Path of the MOWG Transform . . . . . . . . . . 28
3.1.2.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2.2 Modified KIA Algorithm . . . . . . . . . . . . . . . . . . . . 30
3.1.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.4 The Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.5 Space and Time Complexities . . . . . . . . . . . . . . . . . . . . . . 33
3.1.5.1 Space Complexity . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.5.2 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Low Complexity WG Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Properties of the Trace Function for Type-II ONB . . . . . . . . . . . . 36
3.2.2 Optimizing the WG Transform’s Hardware for the Run Phase . . . . . . 38
viii
3.2.3 Serializing the Computation of the Initial Feedback Signal . . . . . . . 39
3.2.3.1 Architecture and Operation of the Modified FSM . . . . . . . 40
3.2.3.2 Architecture and Operation of the Serialized Key Initializa-
tion Module . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.4 Space and Time Complexities . . . . . . . . . . . . . . . . . . . . . . 44
3.2.4.1 Space Complexity . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.4.2 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Results from FPGA and ASIC Implementations . . . . . . . . . . . . . 46
3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Implementations of the WG Stream Ciphers Using PB 52
4.1 Architectures of the WG(29; 11) Stream Cipher . . . . . . . . . . . . . . . . . 54
4.1.1 Formulation of WGT29 . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.2 Design Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.2.1 Field Polynomial and Squaring Matrices . . . . . . . . . . . 55
4.1.2.2 Characteristic Polynomial of the LFSR . . . . . . . . . . . . 55
4.1.2.3 Trace Vector . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2.4 Trace of Multiplication of Two Field Elements . . . . . . . . 57
4.1.3 Architecture and FSM . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.3.1 Architecture of the WG(29; 11) Cipher . . . . . . . . . . . . 59
4.1.3.2 The Finite State Machine (FSM) . . . . . . . . . . . . . . . . 62
4.1.4 Serialized Implementation of the PB Based WG(29; 11) . . . . . . . . . 62
4.1.4.1 Architecture of the Serialized WG(29; 11) . . . . . . . . . . . 62
4.1.4.2 FSM for the Serialized PB based WG(29; 11) . . . . . . . . . 64
4.1.5 Pipelined Implementation of the PB Based WG(29; 11) . . . . . . . . . 66
4.1.5.1 Architecture of the Pipelined PB Based WG(29; 11) . . . . . 66
4.1.5.2 FSM for the Pipelined PB Based WG(29; 11) . . . . . . . . . 67
4.2 Architectures of the WG-16 Stream Cipher . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Formulations of WGP16 and WGT16 . . . . . . . . . . . . . . . . . . 69
4.2.2 Squaring Matrices and Trace Vector . . . . . . . . . . . . . . . . . . . 70
4.2.2.1 Squaring Matrices . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.2.2 Trace Vector . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.3 Trace of the Multiplication of Two Field Elements for the PB Based
WG-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
ix
4.2.4 Architecture and FSM . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.4.1 Architecture of the WG-16 Cipher . . . . . . . . . . . . . . . 71
4.2.4.2 The Finite State Machine . . . . . . . . . . . . . . . . . . . 74
4.2.5 Serialized Implementation of the PB Based WG-16 . . . . . . . . . . . 74
4.2.5.1 Architecture of the Serialized WG-16 . . . . . . . . . . . . . 74
4.2.5.2 FSM for the Serialized WG-16 . . . . . . . . . . . . . . . . 75
4.2.6 Pipelined Implementation of the PB Based WG-16 . . . . . . . . . . . 77
4.2.6.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.6.2 FSM for the Pipelined WG-16 . . . . . . . . . . . . . . . . . 78
4.3 Implementation Results and Comparisons . . . . . . . . . . . . . . . . . . . . 78
4.3.1 ASIC Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.2 Results and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 Digit-Level Architectures for GF (2m)Multiplication in the GNB 83
5.1 Proposed DL-FSIPO Single GNB Multipliers . . . . . . . . . . . . . . . . . . 86
5.1.1 Proposed MSD DL-FSIPO Single GNB Multiplier . . . . . . . . . . . 86
5.1.1.1 Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.1.3 Space and Time Complexities . . . . . . . . . . . . . . . . . 90
5.1.1.4 Bit-Level Case . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.2 Proposed LSD DL-FSIPO Single GNB Multiplier . . . . . . . . . . . . 93
5.1.2.1 Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.2.3 Space and Time Complexities . . . . . . . . . . . . . . . . . 97
5.2 Proposed DL-PISO Single GNB Multiplier . . . . . . . . . . . . . . . . . . . . 98
5.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.2 Space and Time Complexities . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Proposed Digit-Level Hybrid-Double and Hybrid-Triple GNB Multipliers . . . 101
5.3.1 Proposed MSD DL-SIPO Hybrid-Double GNB Multiplier . . . . . . . 101
5.3.2 Proposed DL-PIPO Hybrid-Triple GNB Multiplier . . . . . . . . . . . 103
5.3.3 Space and Time Complexity Analysis . . . . . . . . . . . . . . . . . . 105
5.3.4 Hybrid Versus Single Digit-Level GNB Multipliers . . . . . . . . . . . 106
5.4 Proposed Architecture for Field Exponentiation . . . . . . . . . . . . . . . . . 107
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
x
6 Digit-Level Architectures for GF (2m)Multiplication in the PB 110
6.1 Proposed MSD DL-FSIPO PB Multiplier . . . . . . . . . . . . . . . . . . . . 111
6.1.1 Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.1.3 Space and Time Complexities . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Proposed LSD DL-FSIPO PB Multiplier . . . . . . . . . . . . . . . . . . . . . 118
6.2.1 Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2.3 Space and Time Complexities . . . . . . . . . . . . . . . . . . . . . . 126
6.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7 Summary and Future Work 137
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Bibliography 139
Curriculum Vitae 146
xi
List of Figures
2.1 A stream cipher is used for providing privacy over an insecure channel between
two communicating entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 A general block diagram of a WG stream cipher. . . . . . . . . . . . . . . . . . 22
3.1 Proposed MOWG transformation. . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Proposed design of the MOWG(29; 11; 17) cipher. . . . . . . . . . . . . . . . . 31
3.3 FSM of the MOWG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 The proposed design of the WG transformation. . . . . . . . . . . . . . . . . . 39
3.5 Modified FSM after adding the new 3-bit one-hot counter. . . . . . . . . . . . . 41
3.6 Block diagram of the SKIM module. . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 The proposed WG transformation after integration with the SKIM module. . . . 43
3.8 Serial Implementation of MOWG/WG Stream Ciphers. . . . . . . . . . . . . . 49
4.1 Contributions of this work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 The matrix S for WG(29; 11). . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Architecture of the WG(29; 11) stream cipher. . . . . . . . . . . . . . . . . . . 60
4.4 Architecture of the 210   1 module. . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 FSM for the PB based implementation of the WG(29; 11) stream cipher. . . . . 62
4.6 Architecture of the serial WGP29/WGT29 implementation. . . . . . . . . . . . 64
4.7 a) Architecture of the FSM for the serialized implementation of theWG(29; 11).
b) Generating the Clock Enable Control Signals and the Multiplexers’ Selectors. 65
4.8 Pipelined version of the WGT29. . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.9 Architecture of the FSM for the pipelined version of the WG(29; 11). . . . . . . 67
4.10 Clock enable control signals for the pipelined version of the WG(29; 11). . . . . 68
4.11 The matrix S for WG-16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.12 a) Architecture of the WG-16. b) Generation of the signal Y s (s = 25   1). c)
Generation of the signal (Ai+31)1057. . . . . . . . . . . . . . . . . . . . . . . . . 72
4.13 Architecture of the serial implementation for the PB based design of the WG-16. 75
4.14 Generating the Clock Enable Control Signals and the Multiplexers’ Selectors
for the serial version of the WG-16. . . . . . . . . . . . . . . . . . . . . . . . 76
xii
4.15 Pipelined version of the WG-16 transform. . . . . . . . . . . . . . . . . . . . . 77
4.16 Generating the clock enable signals and, ctrl0 and ctrl1 signals for the pipelined
version of the WG-16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 Summary of contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 (a) Architecture of the proposed MSD DL-FSIPO single GNB multiplier. (b)
Architecture of r j. (c) Architecture of  j. . . . . . . . . . . . . . . . . . . . . 89
5.3 Architecture of the proposed LSD DL-FSIPO single GNB multiplier. . . . . . . 96
5.4 (a) The proposed architecture of the MSD DL-PISO single GNB multiplier. . . 99
5.5 Architectures of the proposed MSD DL-SIPO hybrid-double GNB multiplier.
(a) Low area design. (b) High speed design. . . . . . . . . . . . . . . . . . . . 102
5.6 Architectures of the proposed MSD DL-PIPO hybrid-triple GNB multiplier.
(a) Low area design. (b) High speed design. . . . . . . . . . . . . . . . . . . . 104
5.7 Architecture of the proposed eight-ary field exponentiation scheme. . . . . . . 108
6.1 (a) Architecture of the proposed MSD DL-FSIPO PB multiplier. (b) Detailed
architecture of 4 j. (c) Architecture of 
 module. . . . . . . . . . . . . . . . . . 115
6.2 The state of the corresponding GF

23

MSD DL-FSIPO PB multiplier for Ex-
ample 6.1.3, throughout the dierent iterations of the computation. (a) initial
state. i = 0. (b) state after first clock cycle. i = 1. (c) state after second clock
cycle. i = 2. (d) state after third clock cycle. . . . . . . . . . . . . . . . . . . . 117
6.3 (a) Architecture of the proposed LSD DL-FSIPO PB multiplier. (b) Detailed
architecture of 40j at i-th iteration. . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4 The state of the corresponding GF

23

LSB BL-FSIPO PB multiplier for Ex-
ample 6.2.3, throughout the dierent iterations of the computation. (a) initial
state. i = 0. (b) state after first clock cycle. i = 1. (c) state after second clock
cycle. i = 2. (d) state after third clock cycle. . . . . . . . . . . . . . . . . . . . 126
6.5 Multiplying an arbitrary GF (2m) element by the constant  q wherep (x) =
xm+
P! 2
i=1 x
ti+1 is the field’s generating irreducible polynomial with ! nonzero
terms and q  t1 (condition of (6.11)). . . . . . . . . . . . . . . . . . . . . . . 129
6.6 Normalized throughput as a function of the digit size for the serial inputs load-
ing case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.7 Normalized throughput as a function of the digit size for the parallel inputs
loading case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
xiii
List of Tables
1.1 Comparison of strength of AES, RSA, DSA, and ECDSA [18]. . . . . . . . . . 1
2.1 Addition over GF (5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Subtraction over GF (5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Multiplication over GF (5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Phase of operation in the proposed MOWG as a function of the state of the
2-bit binary counter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Count of 1-bit registers and logic gates in the dierent components of the pro-
posed MOWG design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Signals s0 and s1 as a function of the output of the 3-bit one-hot counter. . . . . 42
3.4 Multiplexers outputs and next states of Register 1 and Register 2 as a function
of s0 and s1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Count of 1-bit registers and logic gates in the components of the proposed
WG(29; 11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Results obtained from ASIC implementations. . . . . . . . . . . . . . . . . . . 47
3.7 Results obtained from FPGA implementations. . . . . . . . . . . . . . . . . . 47
4.1 The space and time complexities of the dierent squaring matrices used in the
WG(29; 11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Computation of the IF = WGP29 signal over 3 clock cycles during the initial-
ization phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Phase of operation in the proposed PB based WG designs as a function of the
state of the 2-bit binary counter. . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Steps for computing the WGP29 and WGT29 in the serial implementation of
the WG(29; 11) design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Space and propagation delay complexities of the dierent squaring matrices
used in the WG-16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Computation of the WGP16 signal over 3 clock cycles. . . . . . . . . . . . . . 73
4.7 Computing WGP16 and WGT16 in the serial implementation of WG-16. . . . 76
xiv
4.8 Results obtained for area and speed from the ASIC implementations. . . . . . . 82
5.1 Steps for multiplication of the two GF

23

elements A = B = 2
2
= (0; 0; 1). . . 88
5.2 Space complexity of digit-level single GNB multipliers. . . . . . . . . . . . . . 91
5.3 Time complexity of digit-level single GNB multipliers. . . . . . . . . . . . . . 91
5.4 Space and time complexity readings for the case of type-4 GNB of GF

2163

digit-level single multipliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Steps for multiplication of the two GF

23

elements A = B = 2
2
= (0; 0; 1). . . 95
5.6 Space complexity of the digit-level hybrid-double and hybrid-triple GNB mul-
tipliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.7 Time complexity of the digit-level hybrid-double and hybrid-triple GNB mul-
tipliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1 Example 6.1.3 for multiplying the two GF

23

elements A =  = (0; 1; 0) and
B = 2 = (1; 0; 0) using (6.1) and (6.2). . . . . . . . . . . . . . . . . . . . . . . 114
6.2 Example 6.2.3 for multiplying the two GF

23

elements A =  = (0; 1; 0) and
B = 2 = (1; 0; 0) using (6.6) and (6.7). . . . . . . . . . . . . . . . . . . . . . . 122
6.3 Space and time complexities of the dierent digit-level GF (2m) PB multipliers. 131
6.4 Space and time complexities for the NIST recommended field GF

2233

de-
fined by the irreducible trinomial x233 + xt1 + 1, where t1 = 74 and the digit size
is d = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5 Space and time complexity estimates for the multipliers which are listed in
Table 6.4 based on on the standard 65nm CMOS library measures. . . . . . . . 135
xv
Nomenclature
3GPP The 3rd generation partnership project
4G Fourth generation mobile communications domain
AES Advanced encryption standard
ASIC Application specific integrated circuits
CMOS Complementary metal-oxide-semiconductor technology
DL Digit-level
Double field multiplication Multiplication of three field elements
DSA Digital signature algorithm
DSS Digital signature standard
ECDSA Elliptic curve digital signature algorithm
FF Flip-Flop
FLT Fermat’s Little Theorem
FPGA Field programmable gate array
FSIPO Fully-serial-in-parallel-out
FSM Finite state machine
GF(2) The binary finite field with elements 0 and 1
GF(2m) Galois Field with 2m elements
GNB Gaussian normal basis
xvi
i.e. That is
IP Inner product
IP-networks Internet protocol networks
LFSR Linear feedback shift register
LSB Least significant bit first
LSD Least significant digit first
LTE Long term evolution
m-sequence Maximal length sequence
MOWG Multiple output-bits version of the WG stream cipher
MPD Maximum propagation delay
MSB Most significant bit first
MSD Most significant digit first
MUX Multiplexer
NB Normal Basis
NIST National institute of standards and technology
ONB-II Optimal normal basis of type 2
PB Polynomial Basis
PD Propagation delay
PIPO Parallel-in-parallel-out
PISO Parallel-in-serial-out
PRSG Psuedo random sequence generator
RFID Radio frequency identification
ROM Read-only memory
Single field multiplication Multiplication of two field elements
xvii
SIPO Serial-in-parallel-out
SNOW 3G The stream cipher SNOW 3G, which is included in the 4G net-
work domain’s cipher suite
SSL Secure sockets layer
TLS Transport layer security
TP Throughput: number of output bits per second
Tr(A) The trace of a field element A, which is a mapping fromGF(2m)
to either 0 or 1
Triple field multiplication Multiplication of four field elements
w.r.t With respect to
WEP Wired equivalent privacy
WG Welch-Gong
WG Stream Cipher Welch-Gong transform based Stream Cipher
WG(29; 11) WG stream cipher with an LFSR of length 11 over GF(229)
WG(m; l) WG stream cipher with an LFSR of length l over GF(2m)
WGP Welch-Gong (WG) permutation
WGT Welch-Gong (WG) transform
WPA Wi-Fi protected access
XOR Logical binary exclusive OR operation
XST Xilinx synthesis tool
ZUC The stream cipher ZUC, which is included in the 4G network
domain’s cipher suite
xviii
Chapter 1
Introduction
Cryptographic algorithms play essential role in communications systems security. In general,
the dierent deployed cryptosystems are divided into two categories of asymmetric key and
symmetric key [82]. Schemes from the former category require relatively high space and time
complexities for their hardware implementations. Hence, asymmetric key cryptosystems are
usually deployed in key set-up mechanisms. On the other hand, schemes from the latter cat-
egory oer hardware implementations which require relatively lower complexities in terms of
space and time. Hence, symmetric key cryptosystems are usually used for providing security
services during communications sessions. The lower implementation cost of symmetric key
systems is due to the simpler underlying mathematical constructs, in addition to the smaller
key sizes required by these cryptosystems to achieve certain security levels compared to the
asymmetric key ones. Table 1.1 presents a comparison of key sizes required by AES, RSA,
DSA, and ECDSA in order to achieve some security levels (in terms of bits of security). In
this table, k indicates the key size for AES, while it refers to the size of the modulus in RSA.
l and n, respectively, indicate the size of public and private keys for DSA. For ECDSA, f is
considered to be the key size.
Bits of Security AES RSA DSA ECDSA
128 k = 128 k = 3072 l = 3072; n = 256 256  f  383
192 k = 192 k = 7680 l = 7680; n = 384 384  f  511
256 k = 256 k = 15360 l = 15360; n = 512 f  512
Table 1.1: Comparison of strength of AES, RSA, DSA, and ECDSA [18].
Ecient hardware implementations of the deployed cryptosystems are necessary for their
practical use. In other words, the implementation of a given cryptosystem should comply
with the performance requirements of the underlying communications system. Therefore, this
1
research focuses on finding ecient hardware implementations for symmetric key cryptosys-
tems, targeting mobile and resource constrained applications, as it is stated in the following
section.
1.1 Objectives
The goal of this research is to introduce ecient hardware implementations for symmetric key
cryptosystems targeting the mobile communications domain and resource constrained applica-
tions, as follows.
First, this research aims for lightweight hardware implementations for the two classes
of the Welch-Gong (WG) family of stream ciphers, WG(29; 11) and WG-16. The targeted
lightweight implementations will provide trade-os between space and time complexities for a
set of lightweight applications. Specifically, the WG-16 implementations are intended for the
Long term evaluation (LTE) 4G mobile communications domain.
After this, the research focuses on finding new finite field constructs for higher throughput
in resource constrained applications. New architectures for higher throughput digit-level field
multiplications will be investigated for resource constrained applications. The issue of reduced
throughput in digit-levelGF (2m) multiplication of two elements (single multiplication), due to
inputs preloading in applications where the input data-path has limited capacity and the value
of m (dimension of the binary extension field) is large, will be addressed. This issue will be
considered for both GNB and PB representations. Furthermore, new architectures for higher
throughput concurrent multiplications of three (hybrid-double multiplication) and four (hybrid-
triple multiplication) field elements will also be explored for the GNB representations. New
field exponentiation architectures based on the eight-ary scheme [42] will be considered as a
practical application of the hybrid-triple multipliers.
The following section highlights the motivations and significance of the research.
1.2 Motivations and Significance
This section states the importance and significance of the underlying research.
Stream ciphers are symmetric key cryptosystems which are attractive for protecting the
wireless communications domain [24]. This is because stream ciphers prevent error propaga-
tion at the receiving end. In this context, a stream cipher is required to provide the desired
security, generate key-stream sequences with good randomness properties, and show ecient
performance [24]. In addition, stream ciphers can also be used as random number generators
2
for other algorithms (for example, generating random numbers for the Digital signature stan-
dard ”DSS” [12]). In this type of applications, the randomness of the generated key sequences
is very critical. This thesis considers the two classes of WG(29; 11) andWG-16 stream ciphers.
In addition of resistance to all known attacks, to the best of the author knowledge, these two ci-
phers provide a set of desired randomness properties which can not be oered by other existing
ciphers [40, 68]. Therefore, new lightweight hardware implementations of the WG(29; 11) and
WG-16 stream ciphers are introduced. The new designs provide trade-o between randomness
properties and performance for a selection of cryptosystems. In particular, the new hardware
implementations of the WG-16 cipher provide dierent space options while complying with
the throughput requirements of the 4G domain. This, makes the WG-16 cipher an interesting
candidate for securing the 4G mobile domain.
The thesis then considers ecient digit-level GF (2m) multipliers. Digit-level field multi-
pliers trade-o space complexity with lower throughput. Hence, digit-level field multipliers
are important for resource constrained applications. Any improvement in such operation is
considered of great value to a wide range of applications, such as: symmetric and asymmet-
ric crypto algorithms, error coding, random number generation, and digital signal processing.
In this context, this research proposes new architectures for single (multiply two elements),
hybrid-double (multiply three elements), and hybrid-triple (multiply four elements) digit-level
GF (2m) multipliers, as follows.
New GNB and PB single multipliers are introduced, targeting higher throughput for digit-
level field multiplications in resource constrained applications. In particular, for cases where
the input data-path has limited capacity, higher throughput is achieved through removing the
requirement for inputs preloading.
In addition, new hybrid-double and hybrid-triple GNB multipliers are introduced. The new
hybrid architectures improve the throughput of concurrent multiplications where three and four
field elements are multiplied at the same time. Also, these hybrid multiplier architectures are
advantageous for improving throughput of the important operations of field inversion [51]. It
is noted that, the hybrid-triple multiplier is proposed for the first time in the literature.
As another practical application for the hybrid-triple multipliers, field exponentiation is
an essential operation in asymmetric key cryptography (such as Die-Hellman key exchange
algorithm [29]) and symmetric key cryptography (WG for example). The proposed hybrid-
triple GNB multipliers are utilized in constructing new eight-ary exponentiation schemes. This
results in hardware exponentiation architectures which run at same latencies as the existing
eight-ary designs, however, without requiring any initial phase for precomputations, or any
storage of intermediate variables.
The following section outlines the rest of the thesis.
3
1.3 Thesis Outline
The thesis is outlined as follows. Chapter 1 highlights the objectives, motivations, and sig-
nificance, and, outlines the remaining chapters. Chapter 2 presents a brief overview about
binary extension fields and WG stream ciphers, as it suces for understanding the rest of this
thesis. Chapter 3 introduces ecient hardware implementations for the WG(29; 11) stream
cipher based on the Optimal Normal Basis Type-II (ONB-II) representation of the feild ele-
ments. Chapter 4 introduces ecient hardware implementations for the WG(29; 11) and WG-
16 stream ciphers based on the Polynomial Basis (PB) representation of the feild elements.
Chapter 5 proposes new architectures for digit-level single, hybrid-double, and hybrid-triple
field multiplications and exponentiation based on the Gaussian Normal Basis (GNB) represen-
tation. Chapter 6 proposes new architectures for digit-level single field multiplications based
on the Polynomial Basis (PB) representation. Chapter 7 highlights the contributions of this
thesis and lists some future works.
4
Chapter 2
Background on GF
 
2m

and WG Stream
Ciphers
This chapter starts by a brief introduction on binary extension fields and Welch-Gong (WG)
stream ciphers.
Arithmetic operations over binary extension fields are extensively encountered in the next
four chapters. This chapter presents necessary background about binary extension fields, as it
suces for the purpose of clarifying contents of the remaining chapters. Other references can
be consulted for more reading about finite fields, for example, [82, 59]. For the purpose of this
work, in the following sections, some preliminary definitions are first given. This is followed
by introducing finite fields, modular arithmetic, polynomial rings, and binary extension fields.
After this, the two representations of Polynomial basis (PB) and Gaussian normal basis (GNB)
of binary extension fields elements are discussed. Then, field operations over binary extension
fields are reviewed. At the end of the sections which are dedicated for reviewingGF (2m), from
this chapter, the trace mapping is presented. It is noted that, the material presented throughout
the above-mentioned sections is a summary based on reviews done over [82, 59, 5].
After introducing GF (2m), the chapter introduces WG Stream Ciphers. First, a brief intro-
duction to stream ciphers is given. This is followed by an overview on WG stream ciphers,
where the chapter talks about general WG stream ciphers block diagrams and phases of opera-
tion, after which the two classes of WG(29; 11) and WG-16 are discussed.
2.1 Preliminaries on Algebraic Structures
Before starting the presentation about finite fields, the following definitions are required.
Definition 2.1.1 The algebraic structure (G; ?) which consists of a set G and a binary opera-
5
tion ? is said to be a semigroup if:
 ? is closed over G, that is: for any g1; g2 2 G, then, g1 ? g2 2 G.
 ? is associative over G, that is: for any g1; g2; g3 2 G, then, (g1 ? g2)?g3 = g1?(g2 ? g3).
Definition 2.1.2 The algebraic structure (G; ?) forms a group if it is a semigroup, and:
 G contains an identity element e with respect to ?, where: for any g1 2 G, then, g1 ? e =
e ? g1 = g1.
 For each element g1 2 G, there exists an inverse element g2 2 G with respect to ?, where:
g1 ? g2 = g2 ? g1 = e.
Definition 2.1.3 The algebraic structure (G; ?) is called an abelian group if it is a group, and:
 ? is commutative over G, that is: for any g1; g2 2 G, then, g1 ? g2 = g2 ? g1.
Definition 2.1.4 The algebraic structure (R;+; ) which consists of a set R and the two binary
operations + (additive) and  (multiplicative) is called a ring if:
 (R;+) is an abelian group. The additive identity element is usually denoted by 0. The
additive inverse of any element r1 2 R is denoted by  r1.
 (R; ) is a semigroup.
  is distributive over +, that is: for any r1; r2; r3 2 R, then, r1  (r2 + r3) = r1  r2 + r1  r3,
and, (r2 + r3)  r1 = r2  r1 + r3  r1.
Definition 2.1.5 A homomorphism is a map between two algebraic structures (such as groups,
rings, and so on) through which the operations are preserved. For example, for the two
groups (G; ?) and (H;4), the mapping  : G ! H is a homomorphism if 8g1; g2 2 G =)
 (g1 ? g2) =  (g1)4 (g2). If in addition,  is surjective1 (onto) and one-to-one, then  is an
isomorphism.
The following section introduces finite fields.
1every element in H has at least one corresponding element in G
6
2.2 Finite Fields
2.2.1 Fields
An algebraic structure (F;+; ), constructed from a set F and the two binary operations + and 
forms a field if:
 (F;+) is an abelian group.
 (F   f0g ; ) is an abelian group. The multiplicative identity element is usually denoted
by 1. The multiplicative inverse of any element f1 2 F is denoted by f  11 .
 Multiplication is distributive over addition.
 There exist no zero divisors over F, that is, for any f1; f2 2 F, if f1  f2 = 0, then, either
f1 = 0 or f2 = 0.
2.2.2 Finite Fields
A finite field, known as Galois Field, and denoted by GF (q) (or Fq), is a field

Fq;+; 

, where
Fq is a set with finitely q elements. The order of GF (q), that is the number of field elements
q, is a positive integer which is either a prime or a power of a prime (including powers of 2).
In GF (q), there is a zero element, while the remaining q   1 elements form the multiplicative
group of the field (all elements which have multiplicative inverses). For a given order q, there
might be more than one representation of the corresponding finite field GF (q). However, all
finite fields of a given order are isomorphic (have same structure).
In simple words, a finite field is a set with finitely many elements over which one can
perform the operations of addition, subtraction, multiplication, and division, while staying in
the same set.
The following section introduces modular arithmetic and highlights the relation between
the algebraic structure

Zp;+; 

and finite fields of the formGF (p), where Z denotes the set of
integers, p is a prime number, and Zp is the set of positive integers less than p (including 0).
2.3 Modular Arithmetic
This section gives a brief presentation about modular arithmetic. Modular addition and mul-
tiplication are carried out as they are done over the set of integers Z, however, followed by
reducing the result modulo an integer m > 0. The integer m is referred to as the modulus.
7
Applying the ”modulo m” operator to a given integer v 2 Z, written as v mod m, returns the
remainder out of dividing v by m (long division). For example, let v = 7 and m = 3. Then,
7 mod 3 = 1, since 7 = 2  3 + 1. If there exists a v0 2 Z such that v mod m = v0 mod m, one
can also write v  v0 (mod m). The latter expression is read as: v is equivalent (or is congruent)
to v
0
, modulo m. In this expression,  is known as the equivalence (or congruence) operator.
If v  0 (mod m), that is, v mod m = 0, then, m divides v, which is simply written as m j v. If
v  w (mod m), that is, v mod m = w mod m, then, m j jv   wj, where jj denotes the absolute
value operator. The following are some properties of congruences modulo an integer m > 0:
 8v 2 Z, v  v (mod m).
 8v;w 2 Z, v  w (mod m) =) w  v (mod m).
 8v;w; x 2 Z, if v  w (mod m) and w  x (mod m), then v  x (mod m).
 8v;w; x; y 2 Z, if v  w (mod m) and x  y (mod m), then v  w  x  y (mod m).
 If n is a non-zero positive integer such that n j m, therefore 8v;w 2 Z where v 
w (mod m), then v  w (mod n).
 Let n be a non-zero positive integer such that the greatest common divisor of m and n is
1, that is, gcd (m; n) = 1. Therefore, 8v;w 2 Z, if v  w (mod m) and v  w (mod n),
then v  w (mod mn).
Now, denote by Zm the set of residue classes modulo m. That is, Zm = f0; : : : ;m   1g
consists of integers modulo m. In general, not all elements of Zm have multiplicative inverses.
This is the reason why (Zm;+; ) forms a commutative ring and not a finite field. Only elements
of Zm which are relatively prime to m have multiplicative inverses. Therefore, if m = p is a
prime, then, multiplicative inverses exist for all non-zero elements in Zp. In this case, Zp is an
abelian group under multiplication, and hence,

Zp;+; 

forms a finite field (since

Zp;+

is also
an abelian group,  is distributive over +, and there are no zero divisors in Zp). Operations over
the finite field

Zp;+; 

are isomorphic to those over GF (p). The following is an illustrative
example for arithmetic operations over GF (5). After this, next section introduces polynomial
rings.
Example 2.3.1 The field GF (5) contains the elements f0; 1; 2; 3; 4g. The additive and multi-
plicative identities are 0 and 1, respectively. Arithmetic operations are performed modulo 5,
as it is shown for the cases of addition, subtraction, and multiplication, in Tables 2.1, 2.2, and
2.3, respectively. Notice from Table 2.3 that all non-zero elements have multiplicative inverses.
8
+ 0 1 2 3 4
0 0 1 2 3 4
1 1 2 3 4 0
2 2 3 4 0 1
3 3 4 0 1 2
4 4 0 1 2 3
Table 2.1: Addition over GF (5).
  0 1 2 3 4
0 0 4 3 2 1
1 1 0 4 3 2
2 2 1 0 4 3
3 3 2 1 0 4
4 4 3 2 1 0
Table 2.2: Subtraction over GF (5).
2.4 Polynomial Rings
The algebraic structure (R;+; ) forms a ring if (R;+) is an abelian group, (R; ) is a semigroup,
and multiplication is distributive over addition. A polynomial ring is a structure (R [x] ;+; )
such that R [x] =
nPn
i=0 rix
i j n  0; ri 2 R
o
represents the set of all polynomials in the variant x
with coecients from R, where addition and multiplication, respectively, are defined as follows
n
0X
i=0
r
0
i x
i +
n”X
i=0
r”i x
i =
max

n
0
;n”
X
i=0

r
0
i + r
”
i

xi;
and 0BBBBBBB@ n
0X
i=0
r
0
i x
i
1CCCCCCCA 
0BBBBBB@ n”X
i=0
r”i x
i
1CCCCCCA = n
0
+n”X
i=0
0BBBBBB@X
j+k=i
r
0
j  r”k
1CCCCCCA xi:
Notice that, while exact division applies to fields, only long division is applicable to rings.
This is because field elements have inverses, while this is not the case for elements of a ring.
Therefore, in order to have long division in a polynomial ring, coecients need to be from a
field and not a ring. Otherwise, long division might not be possible over a polynomial ring.
The following is an example showing operations done over elements of the polynomial ring
(GF (2) [x] ;+; ).
9
 0 1 2 3 4
0 0 0 0 0 0
1 0 1 2 3 4
2 0 2 4 1 3
3 0 3 1 4 2
4 0 4 3 2 1
Table 2.3: Multiplication over GF (5).
Example 2.4.1 Let f (x) = x4+1 and g (x) = x3+x+1 be two elements from the polynomial ring
(GF (2) [x] ;+; ). GF (2) is the binary finite field with elements 0 and 1, which is isomorphic
to (Z2;+; ). Notice that, over GF (2): 1 + 1 = 0 + 0 = 0 and 0 + 1 = 1 + 0 = 1, while
0  0 = 0  1 = 1  0 = 0 and 1  1 = 1. Then, one conducts addition, multiplication, and division
over these two elements as follows:
1) Addition:
f (x) + g (x) =

x4 + 1

+

x3 + x + 1

=x4 + x3 + x:
2) Multiplication:
f (x)  g (x) =

x4 + 1



x3 + x + 1

=x7 + x5 + x4 + x3 + x + 1:
3) Division (long division):
x4 + 1 =

x3 + x + 1

 (x) +

x2 + x + 1

;
where q (x) = x is the quotient of f (x)=g(x) and r (x) = x2 + x + 1 is its remainder.
Notice that in the latter operation, the degree of r (x) is less than that of g (x), written as
deg (r (x)) < deg (g (x)). In general, 0  deg (r (x)) < deg (g (x)). If there exists a g (x) such
that r (x) = 0, then, in this case g (x) j f (x), and hence, f (x) is reducible over GF (2). If there
is no such g (x) which divides f (x), then, f (x) is an irreducible polynomial over GF (2).
The following section introduces binary extension fields GF (2m) and shows how irre-
ducible polynomials are used to construct such fields.
10
2.5 Binary Extension Fields GF (2m)
An extension field over the finite field GF (p) is referred to as GF (pm). GF (p) is denoted as
the ground field. Specifically, for p = 2, GF (2m) is denoted as the binary extension field (over
GF (2)). A GF (2m) can be viewed as a vector space over GF (2) of dimension m. Hence, a
GF (2m) is isomorphic to Z2[x]=p(x) under polynomial addition and multiplication, where Z2[x]=p(x)
denotes the set of polynomials in variant x with coecients from Z2 taken modulo an irre-
ducible polynomial p (x) of degree m. There are 2m elements in GF (2m). Each one of these
elements is uniquely represented by m bits (elements of GF (2) with value 0 or 1) with respect
to a basis. A basis is a set of m linearly-independent elements   = fi 2 GF (2m) j 0  i < mg
[59]. Then, an element A 2 GF (2m) is represented with respect to   as A = Pm 1i=0 aii, where
ai 2 GF (2) for 0  i < m are the binary coordinates of A w.r.t  . The following two sections
introduce two of the most common representations of the elements of GF (2m).
2.6 Polynomial Basis (PB) Representation
The most straight forward representation for the elements of GF (2m) is obtained from the
isomorphism of GF (2m) to Z2[x]=p(x). Here, p (x) is an irreducible polynomial of degree m over
Z2. That is, coecients of the terms of p (x) are either 0 or 1. Therefore, following this
construction, elements ofGF (2m) include the binary representations of all polynomials in x of
degree less than or equal to m   1.
A PB follows the form
n
m 1; : : : ; ; 1
o
and is constructed by finding a root  2 GF (2m) of
an irreducible polynomial p (x) of degree m over GF (2) [72]. Then, using this PB, an element
A in GF (2m) is represented as A =
Pm 1
i=0 ai
i, where ai is either 0 or 1 for 0  i < m. In vector
representation, the element A can also be refered to as A = (am 1; : : : ; a0) [5].
2.7 Gaussian Normal Basis (GNB) Representation
On the other hand, a Normal basis (NB) is constructed by finding an element  2 GF (2m) such
that them elements 2
0
through 2
m 1
are linearly-independent [59]. Then, the set
n
2
0
; : : : ; 2
m 1o
forms a NB where  is called a normal element. In the NB, an element A is represented as
A =
Pm 1
i=0 ai
2i , with ai 2 f0; 1g representing the i-th coordinate of A with respect to the NB, for
0  i < m. In vector representation, the element A can also be represented as A = (a0; : : : ; am 1)
[5].
Gaussian normal bases (GNBs) is a special subset of NBs which oer field operations with
smaller area and time overhead compared to the general NB, when realized in hardware. A
11
GNB exists for all GF (2m) which satisfy the following conditions [15, 52, 70]:
1. m is not divisible by 8, and
2. there exists a prime integer p = Tm+1, with T > 0 is an integer, such that gcd

Tm
k ;m

=
1, and k is the order of 2 modulo-p (that is 2k  1 (modp)).
and hence, T is called the type of the GNB. It is noted that for odd values ofm the type T should
be even (since p is an odd prime). It is also noted that, smaller values of T results in more
ecient hardware implementations of field multiplications. Similar to the NB representation,
any element A 2 GF (2m) can be represented w.r.t the GNB as A = Pm 1i=0 ai2i = (a0; : : : ; am 1),
where ai 2 f0; 1g.
Common GF (2m) arithmetic operations include addition, multiplication, exponentiation,
and inversion. The following three sections present more details about arithmetic operations
over GF (2m) considering the two cases of PB and GNB representations.
2.8 Addition over GF (2m)
Field addition of two arbitrary GF (2m) elements, say A and B, is accomplished by a bit-wise
Exclusive-OR (XOR) operation on the corresponding coordinates of the added elements, re-
gardless whether a PB or a GNB representation is used. That is:
A + B =
m 1X
i=0
(ai + bi) i;
where ai and bi are the binary coordinates of A and B with respect to a given basis   =
f0; : : : ; m 1g.
2.9 Multiplication over GF (2m)
On the other hand, the field multiplication of two arbitrary GF (2m) elements A and B is ac-
complished as follows:
AB =
m 1X
i=0
m 1X
j=0
aib ji j;
which is more expensive than field addition, and its complexity depends on the underlying rep-
resentation   = f0; : : : ; m 1g. It is known that the PB representation oers ecient hardware
implementations of field multiplications, compared to the NB / GNB representation [71]. The
following two sections give brief literature reviews of GF (2m) multiplications in the PB and
GNB representations, respectively, as it suces for the purpose of this thesis.
12
2.9.1 GF (2m)Multiplication in the PB Representation
Two popular schemes for the multiplication of two GF (2m) elements in the PB representation
are: the two-step classic multiplication scheme and the Matrix-vector scheme [28]. The first
scheme starts by performing polynomial multiplication of the two input field elements, then,
the result is reduced modulo the irreducible defining field polynomial [89]. The following is
an example illustrating this multiplication scheme.
Example 2.9.1 This example constructs the field GF

23

using the irreducible polynomial
p (x) = x3 + x + 1 over GF (2). Denote by  the root of p (x) over GF

23

. By using ,
the polynomial basis
n
2; ; 1
o
is constructed. Based on this polynomial basis, any element
A from the 23 = 8 elements of GF

23

is defined by a unique set of 3 binary coordinates as
A = a22 + a1 + a0. For example, by considering the GF

23

elements 2 +  + 1 and ,
then, one has

2 +  + 1

 () = 3 + 2 +  = 2 + 1 over GF

23

. The latter result is
obtained after reducing x3 + x2 + x by p (x) = x3 + x + 1 (also, it can be obtained by noticing
that p () = 0, which results in 3 =  + 1).
In the second scheme, known as the Mastrovito multiplier [62, 84, 44, 72], one performs
the field multiplication in terms of vector by matrix multiplication, in which both steps of
the former scheme are combined into a single step. In the following, the multiplication of two
arbitraryGF (2m) elements represented in the PB, based on the vector by matrix method, is first
briefly reviewed. This is followed by reviewing ecient ways for the hardware realizations of
the fixed multiplication of an arbitrary GF (2m) element by q, where q is a positive integer.
After this, a quick review over previous work on PB multiplication is presented.
2.9.1.1 Multiplication of Two Arbitrary GF (2m) Elements Represented in the PB
This section, reviews the multiplication of two arbitrary GF (2m) elements represented in the
PB based on the vector by matrix method. The formulations presented in this section are
utilized in Section 6.2.1 to accomplish the multiplication of an arbitrary GF (2m) element by
the constant m 1. Here,  is a root of the irreducible polynomial p (x) = xm+
P! 2
i=1 x
ti+1 which
generates the field GF (2m). ! = H (p (x)) is the Hamming weight of the field polynomial,
denoting the number of nonzero terms in p (x). First, the following notations are defined.
Definition 2.9.2 Define v [" i] and v [# i] to be the operations of up and down i-bit shifts, re-
spectively, of a given m-bit vertical vector v =
h
v0 : : : vm 1
iT
, where the emptied positions
are filled with zeros (T denotes vector transposition). That is,
v [# i] =
h
0 : : : 0 v0 : : : vm i 1
iT
13
and
v [" i] =
h
vi : : : vm 1 0 : : : 0
iT
:
Let C = (cm 1; : : : ; c0) denotes the result of multiplying two arbitrary GF (2m) elements
A = (am 1; : : : ; a0) and B = (bm 1; : : : ; b0), represented in the PB. Therefore, the m binary
coordinates ofC = AB mod p (), represented by the verticalm-bit vector c =
h
c0 : : : cm 1
iT
,
are obtained as follows [75]
c =d +
! 2X
j=0
e
0 h# t ji ; (2.1)
where
d =
266666666666666666664
a0 0 : : : 0
a1 a0 : : : 0
:::
:::
: : :
:::
am 1 am 2 : : : a0
377777777777777777775
266666666666666666664
b0
b1
:::
bm 1
377777777777777777775
; (2.2)
e
0
=
n 1X
i=0
e [" li] ;
and
e =
2666666666666666666666666664
0 am 1 : : : a2 a1
0 0 : : : a3 a2
:::
:::
: : :
:::
:::
0 0 : : : 0 am 1
0 0 : : : 0 0
3777777777777777777777777775
2666666666666666666666666664
b0
b1
:::
bm 2
bm 1
3777777777777777777777777775
: (2.3)
Here, t0 = 0, n is the number of nonzero entries in column zero of the (m   1)  m binary
reduction matrixQ [72], and li denotes the row location of the i-th nonzero entry in this column,
0  i < n. The following are some remarks on the values of n and li.
Remark 2.9.3 Let p (x) = xm+
P! 2
i=1 x
ti+1 be the generator irreducible polynomial of GF (2m)
with ! nonzero elements, then [33, 75]:
 l0 = 0 regardless of the structure of p (x).
 If p (x) is a trinomial of the form xm + x + 1, then: n = 1 and l0 = 0.
 If p (x) is a trinomial of the form xm + xt1 + 1 with 1 < t1  m+12 , then: n = 2 with l0 = 0
and l1 = m   t1.
 If p (x) is a general irreducible polynomial with t! 2  m+12 , then: n = ! 1 
j
1
t1
k
, l0 = 0,
and li = m   ti for 1  i  !   2  
j
1
t1
k
.
14
2.9.1.2 Ecient Multiplication of a GF (2m) Element Represented in the PB by q
It is noted that, one can obtain a general formulation for the multiplication of an arbitrary
element A = (am 1; : : : ; a0) 2 GF (2m), represented in the PB, by q, using (2.1), (2.2), and (2.3)
(see Section 6.2.1). However, this section lists some conditions, originally presented in [81]
and [57], for the ecient hardware realization of such constant field multiplication. Here, q is
a positive integer and  is the root of the field’s irreducible polynomial p (x) = xm+
P! 2
i=1 x
ti +1
with ! nonzero terms.
Theorem 2.9.4 [81] Assume p (x) = xm+
P! 2
i=1 x
ti +1 is the field irreducible polynomial which
defines GF (2m). Let  denotes the root of p (x). Therefore, for q < m  t! 2, the coordinates of
m+q are obtained as follows
m+q mod p () =
0BBBBB@! 2X
i=1
ti + 1
1CCCCCAq: (2.4)
Theorem 2.9.5 [57] Assume p (x) = xm+
P! 2
i=1 x
ti +1 is the field irreducible polynomial which
defines GF (2m). Denote by  the root of p (x). Let A = (am 1; : : : ; a0) be an arbitrary GF (2m)
element represented in the PB. Therefore, for q  m   t! 2, the coordinates of Aq mod p ()
are obtained in a single step using q (!   2) two-inputs XOR gates with a propagation delay
equivalent to

log2 (q + 1)

XOR gate delays, as follows:
Aq mod p () =
m 1X
i=0
aii+q mod p ()
=
m q 1X
i=0
aii+q+
m 1X
i=m q
ai
0BBBBBB@! 2X
j=1
t j + 1
1CCCCCCAi (m q): (2.5)
2.9.1.3 Previous Work on PB Multiplication
In general, the dierent proposed designs for implementing PB multiplication fall under one
of the two categories of parallel and serial computations. For achieving high throughput, the
parallel implementation is used where all the output bits of the multiplication are generated in
a single clock cycle [19, 62, 84, 44, 72, 27, 22]. For achieving low space complexity, digit-
level serial computations are considered. In digit-level serial multiplication schemes, the space
complexity is reduced at the expense of increasing the number of clock cycles required for
generating the m output bits (computational latency) to k =
l
m
d
m
clock cycles (in general),
where d is the digit size [78, 20, 46, 75, 79, 50, 66].
15
In a digit-level serial implementation, the multiplication input / output bits are entered /
generated either in parallel, or serially in the order of one digit per a clock cycle. For example,
digit-level serial-in-parallel-out (DL-SIPO) multipliers generate the output bits in parallel after
k clock cycles [20]. In this DL-SIPO scheme, one input is loaded in parallel (in advance to
computations), while the other input enters serially one digit per a clock cycle during compu-
tations. The serial input of the DL-SIPO multiplier enters in either a most-significant-digit first
(MSD) or least-significant-digit first (LSD) order. Parallel-in-serial-out is another digit-level
multiplication scheme (DL-PISO), in which both inputs are preloaded in parallel in advance to
computations [75]. After this, the output digits of the DL-PISO multiplier are generated over k
clock cycles, one digit per a clock cycle. A third digit-level multiplication scheme is known as
parallel-in-parallel-out (PIPO) requires preloading of both inputs in advance to computations.
The output of the PIPO multiplier is generated in parallel after a number of clock cycles from
inputs preloading. For example, the PIPO PB multiplication architecture presented in [50] has
a latency of 2t! 2 + 1 clock cycles to generate the m output bits in parallel, where t! 2 denotes
the second highest nonzero term of the field irreducible polynomial p (x) = xm +
P! 2
i=1 x
ti + 1
with ! nonzero terms. The authors of [50] show that their serial PIPO PB multiplier oers the
lowest latency for cases where m  2t! 2   1, however, the corresponding space complexity is
quadratic in m.
In addition, serial-serial finite field multipliers with two serially-entered inputs and serial
output have also been proposed. For example, in 1992, the authors of [46] presented a bit-
level most-significan-bit-first (MSB) serial-serial PB multiplier which generates the m output
bits serially over 2m clock cycles. In [46], the inputs to the multiplier enter serially bit-by-
bit, starting with the MSB, over the first m clock cycles. After reading the serial inputs, the
m output bits are then generated serially, one bit per a clock cycle, starting with the MSB. In
2009, the authors of [14] proposed a generic serial-serial multiplication/reduction architecture
for GF (q), where q can be a prime p, a power of a prime pm, and where it is possible to have
p = 2. For the case of GF (2m), which is the focus of this thesis, the serial-serial multipli-
cation/reduction scheme proposed in [14] reads both of its multiplication inputs serially, one
digit at a time, in either least or most-significant first order. The final result is generated digit-
by-digit without using any dedicated parallel-in-serial-out register, starting with the (k + 1)-th
clock cycle, where an additional correction step is required in case of the least-significant first
input order. Then, using the scheme in [14], all the m output bits are produced serially, after
a total of 2k clock cycles. It is noted that, the serial-serial multiplier in [14] is not a dedicated
multiplication scheme in the sense that it works for any irreducible polynomial by reading it as
one of its inputs. In order to allow for scalability, and to make the field multiplication generic,
the serial-serial multiplier in [14] requires additional multiplexers and storage Flip-Flops (FF),
16
in addition to a number of control signals.
2.9.2 GF (2m)Multiplication in the GNB Representation
Although multiplication is realized more eciently in hardware based on the PB representa-
tion [71]; however, NBs are considered advantageous for use in the hardware designs of binary
extension fields’ arithmetic [71] due to the free cost of squaring operations, which are imple-
mented as cyclic shifts [5]. In particular, the special subset of Gaussian normal bases (GNBs)
oer field operations with smaller area and time overhead compared to the general NB. Hence,
GNBs are often used for ecient hardware implementations of field multiplication, for exam-
ple see the IEEE standard [5] and the National institute of standards and technology (NIST)
standard [12].
The original multiplication scheme under the NB representation has been proposed by
Massey and Omura [61]. In this scheme, all the output bits of the product are computed,
one bit at a time, through applying some function to dierent cyclic shifts of the two input
elements. This scheme is referred to as bit-level (BL) parallel-in-serial-out (PISO) multiplica-
tion. This section, briefly reviews formulations for the BL-PISO multiplication of twoGF (2m)
elements represented in the GNB. Also, this section shows how the multiplication of a field
element by the normal element  is accomplished. In addition, a brief summary about existing
GNB multiplication is given.
The following, starts by reviewing formulations for the BL-PISO GNB multiplication of
two arbitrary GF (2m) elements.
2.9.2.1 Formulation for the BL-PISOGF (2m)Multiplication in the GNBRepresentation
Here, the formulations for accomplishing bit-level PISO multiplication of two GF (2m) ele-
ments represented in the GNB are presented. As mentioned earlier, by finding an element
 2 GF (2m) such that N =
n
2
0
; : : : ; 2
m 1o
is a basis, then, N is a NB and  is a normal ele-
ment. For any m > 1 not divisible by 8, if there exists a prime number p = mT + 1 such that
gcd (mT=k;m) = 1 where 2k  1 (modp), then, N is a Gaussian normal basis (GNB) of type T ,
where T is an even integer if m is odd. Any element A 2 GF (2m) can be represented w.r.t the
GNB as A =
Pm 1
i=0 ai
2i = (a0; : : : ; am 1), where ai 2 f0; 1g.
Let PA (V) = AV = (p0; : : : ; pm 1) denotes the result of multiplying A by V = (v0; : : : ; vm 1).
Then, by using the following formulation, one obtains the l-th coordinate of PA (V), for 0  l <
17
m [70]
pl =alv((l+1)) +
m 1X
i=1
a((l+i))
0BBBBBB@ TX
j=1
v((l+R[i; j]))
1CCCCCCA ; (2.6)
where ((q)) = q mod m and 0  R i; j < m, for 1  i < m and 1  j  T , is an integer entry
of an (m   1)  T matrix R which corresponds to the position of the j-th 1 in the i-th row of
the GNB’s multiplication matrixM [70]. This scheme for computing the l-th coordinate of the
field multiplication requiresmAND gates and at most (m   1)T XOR gates, with a propagation
delay of TA +
 
log2m

+

log2 T

TX [16], where TA and TX denote the propagation delay in a
two-inputs AND gate and a two-inputs XOR gate, respectively.
The following section, shows how the GNB multiplication by the normal element  is
accomplished.
2.9.2.2 Multiplication by the Normal Element 
Here, the formulation for accomplishing field multiplication of an arbitrary GF (2m) element
V = (v0; : : : ; vm 1) represented in the Gaussian normal basis
n
; : : : ; 2
m 1o
of type T by the nor-
mal element  = (1; 0; : : : ; 0) is presented. By substituting for (a0; : : : ; am 1) with (1; 0; : : : ; 0)
in (2.6), and considering all values of l = 0; : : : ;m   1, one obtains [70]
P (V) =v1 +
m 1X
i=1
0BBBBBB@ TX
j=1
v((i+R[m i; j]))
1CCCCCCA 2i ; (2.7)
which requires at most (m   1) (T   1) XOR gates, with a propagation delay of log2 T TX. It
is noted that, one can reduce the number of XOR gates required for realizing (2.6) or (2.7) by
a value 4X through applying signal reuse techniques (see [69, 25] for example), where 4X is
obtained through simulation.
2.9.2.3 Previous Work on GNB Multiplication
Massey and Omura [61] proposed the original scheme for multiplication in the NB representa-
tion. After this, a number of designs were proposed in an attempt to optimize the throughput
and / or space complexities of the NB multiplier [86, 47, 55, 71, 73, 13]. Generally, the dier-
ent proposed designs can be divided into three categories of parallel, bit-level, and digit-level
computations. For high throughput usage, the parallel implementation generates all the output
bits of the multiplication in one clock cycle [86, 47, 55, 71]. In GF (2m), this is achieved by
a gate complexity which is quadratic in m. For area critical applications, bit-level schemes
are adopted [86, 36, 13]. In this scheme, the space complexity is generally proportional to m,
18
while the multiplication process requires m clock cycles to generate the final output. To trade-
o between space and throughput, digit-level multipliers are deployed [73, 37]. In a digit-level
scheme, the space complexity is traded-o with the number of required clock cycles in such
a way that d-bits, 2  d < m, are processed in parallel during each one of the k =
l
m
d
m
clock
cycles of computations.
Similar to the PB digit-level multipliers, there are three schemes in terms of types of inputs
and output for the digit-level (DL) GNB multipliers. The first scheme of serial multipliers is
the parallel-in-parallel-out (PIPO) [38, 13, 73]. In this scheme, the inputs are preloaded to the
input registers first, and then, the m output bits are produced in parallel after k clock cycles.
In the remaining two schemes, one or both input(s) / output are fed / generated serially
during each iteration of computations, where the serial input(s) / output follow either a least
significant digit first (LSD), or a most significant digit first (MSD) order. The second scheme is
the serial-in-parallel-out (SIPO) [20, 36]. There are two variants of this scheme, one with only
one serial input [20], while the other has two serial inputs [36]. For clarity of reference, the
two serial inputs variant is denoted as fully-serial-in-parallel-out (FSIPO). Both of the SIPO
and the FSIPO multipliers generate the m output bits in parallel after k clock cycles. The SIPO
requires to preload one of its inputs in advance to computations, while the other input enters
the multiplier during computations. On the other hand, a FSIPO multiplier does not require
any preloading of the operands, since it reads both inputs as computations are carried out.
The third scheme is the parallel-in-serial-out (PISO), in which the two operands are preloaded
into the input registers before the computation starts, followed by generating the k output digits
serially, one digit per a clock cycle [61, 37, 76].
By combining one DL-PISO and one DL-SIPO architectures, a DL-PIPO hybrid-double
GNB multiplier has been recently proposed by the authors of [16], which performs two field
multiplications using the same latency required for a single field multiplication (i.e. k iter-
ations). It is noted that the authors of [16] have shown the hybrid-double multiplier to be
useful for applications where two dependent field multiplications are involved, such as double
exponentiation.
2.10 Exponentiation and Inverse over GF (2m)
Field exponentiation and inversion, are usually realized in the form of repeated rounds of
“square and multiply” operations [5]. Squaring is the operation of multiplying an element
by itself. Field exponentiation of an element A, say Ae, is computed as [42]:
Ae =m 1i=0 A
ei2i ;
19
where the integer exponent e, 2  e < 2m   1, is represented in its radix two expansion asPm 1
i=0 ei2
i with ei 2 f0; 1g for all i = 0; 1; : : : ;m   1. Field inversion is a special case of the ex-
ponentiation, in which the exponent has a fixed value e = 2m   2, according to Fermat’s Little
Theorem (FLT) [30]. As it is mentioned earlier, the NB representation requires larger space
overhead, compared to the PB, in order to realize field multiplications in hardware. However,
NBs oer free of cost squaring operations, which is not the case for PBs. In NB representation,
for any A = (a0; : : : ; am 1) 2 GF (2m), one simply obtains A2 = (am 1; a0 : : : ; am 2). In hard-
ware, this is realized as a simple right cyclic shift. Therefore, many hardware designs favor
using NBs for implementing exponentiation and inversion over PBs. More specifically, the
subclass of Gaussian normal basis representation (GNB) which oers more ecient hardware
implementations for the field multiplications than the general NBs [5], is usually deployed for
exponentiation and inversion.
The next section introduces trace mappings of GF (2m) elements.
2.11 Trace Mapping
Trace, is a mapping, which maps a GF (2m) element, say A, to the ground field GF (2), and is
denoted by Tr (A). The Trace of the element A is computed using the following formulation
[5]:
Tr (A) =
m 1X
i=0
A2
i
: (2.8)
In the NB representation, (2.8) is reduced to the modulo-2 sum of the coordinates of A, that is
[5]:
Tr (A) =
m 1X
i=0
ai:
On the other hand, in the PB representation, (2.8) is computed in the form of an inner product
of the row vector representing A with the constant column vector  = (m 1; : : : ; 0), as follows
[5]:
Tr (A) =
m 1X
i=0
aii:
The coordinates of the constant vector  are precomputed as i = Tr

i

, for 0  i < m, with
 representing the root of the defining irreducible polynomial p (x) of GF (2m).
20
2.12 Welch-Gong (WG) Stream Ciphers
2.12.1 Stream Ciphers
Stream ciphers are symmetric key cryptosystems which are used for providing privacy through
applying encryption and decryption mechanisms to a given message’s text. Stream ciphers are
attractive for implementing protection in the wireless air-link domain, due to the individual
processing of the input message digits, which results in preventing error propagation at the
receiving end. For example, stream ciphers are used in dierent wireless communications
applications, such as, blue-tooth [8], network protocols (WEP and WPA) [43], and 3GPP Long
Term Evolution (LTE) security suite [11, 7]. To accomplish individual processing of the input
message’s digits, stream ciphers encrypt (or decrypt) an input message by bit-wise XORing
the corresponding bits of the message with a generated key-stream bits, bit by bit, where the
key-stream is generated by means of a Pseudo random sequence generator (PRSG). Figure 2.1
presents two entities communicating over an insecure channel where a stream cipher is used
for accomplishing privacy of transmitted data.
Insecure Channel
Figure 2.1: A stream cipher is used for providing privacy over an insecure channel between
two communicating entities.
21
2.12.2 WG Stream Ciphers
The Welch-Gong (WG) stream ciphers, is a family of stream ciphers with good randomness
properties [39, 68, 67, 24]. The randomness properties provided by the member ciphers of this
family are proved mathematically, which include long period, balanced 0-1 distribution, ideal
tuple distribution, exact linear complexity, cross correlation with an m-sequence has only three
values, delta like autocorrelation functions, and high non linearity, for which no other existing
ciphers could provide [40, 68].
2.12.2.1 A General Block Diagram
Figure 2.2 presents a block diagram showing an architecture of a general WG stream cipher.
As it is shown in this figure, a WG stream cipher is built from a Finite state machine (FSM), a
Linear
Feedback
Initial Feedback
? ?C Z
A
i
A
m
2
1
WGPm
WGTm
? ?Tr ?
m
m m
m
m
m
1i l
Decim.
m
m
Figure 2.2: A general block diagram of a WG stream cipher.
Linear feedback shift register (LFSR) which consists of l elements from the field GF (2m), and
aWG transform (WGTm). Hence, the WG cipher is denoted by either WG(m; l) or WG-m. The
FSM controls the operation of the cipher. The linear feedback function, which is represented
by the LFSR’s characteristic polynomial C (Z) in the figure, is primitive over GF (2m), and
therefore, the LFSR generates m-sequences having periods of 2ml   1. The output of the LFSR,
which is taken from the leftmost cell, is filtered by anm-bit WG transformWGTm. Notice that,
the LFSR output might first go through decimation (that is exponentiation) before entering the
transform. The WG transform consists of a permutation module (WGPm) followed by a trace
mapping.
22
2.12.2.2 Phases of Operation
There are three phases of operation in a WG stream cipher: loading phase, initialization phase,
and run phase. During the loading phase, which takes l clock cycles to complete, the initial
state is written to the cells of the LFSR, where the only input to the LFSR is “Initial Vector”
in figure 2.2. Then, the initialization of the cipher starts and continues for 2l clock cycles,
during which the input to the LFSR is the bitwise XOR of the “Linear Feedback” and “Initial
Feedback” signals in figure 2.2. After this, the cipher enters the run phase, where a single key-
stream bit is generated at each clock cycle. The only input to the LFSR during the run phase is
the “Linear Feedback” signal.
2.12.2.3 WG(29; 11) and WG-16
The eSTREAM project [6] is the most significant eort for finding secure stream ciphers [67].
The WG(29; 11) [39] is a stream cipher submitted to the hardware profile of phase 2 of this
project. The WG(29; 11) oers the proved randomness properties of the WG family of ciphers
[40, 39, 68, 24]. The two attacks [88, 77] were launched on WG(29; 11) during this project.
However, it is noted that the revised version of the cipher [68] does not suer the chosen IV
(Initial Value) attack in [88, 64]. Also, as per design, the number of key-stream bits per a single
key/IV pair is strictly less than the number of key-stream bits required to perform a linear span
attack introduced in [77], [68].
In the literature, there is a number of proposed WG(29; 11) hardware designs [39, 67, 68,
56]. The original submission uses normal basis (NB) representation [39] and hence all of pre-
sented designs until now have used the NB representation [39, 68, 56, 58]. The authors of [39]
adopt a direct design using computation in the Optimal normal basis (ONB), which requires 7
multiplications and an inversion over GF

229

. The inversion using Itoh-Tsujii algorithm re-
quires
 
log2 (28)

+ H (28)   1 = 4 + 3   1 = 6 multiplications and 28 squarings in GF 229,
where H (28) denotes the Hamming weight of 28 [45]. In [68], the authors replaced the in-
version operation with a computation of the power 2k   1 which requires 4 multiplications for
k =
l
29
3
m
= 10 and reduced the other 7 multiplications of the WG transformation in [39] by
one through signal reuse. In [56], the author uses a look-up table based approach which uses
229 bits of ROM. In [58], the authors propose a multiple-bit output version of the WG cipher,
called MOWG. The MOWG reduces the hardware cost through signal reuse by removing one
multiplier from the WG permutation in [68], while it generates d  17 output bits. Further-
more, [58] improves the hardware cost and throughput of the cipher through pipelining with
reuse techniques. The keystream sequences generated by the MOWG cipher possess many of
the WG keystream randomness properties [58].
23
Another initiative for designing secure stream ciphers is the LTE mobile technology. LTE
is being established as the fourth generation (4G) mobile technology, where a flat all Internet
Protocol infrastructure has been adopted [34]. This has changed the threat model of the 4G
mobile domain to include the security issues which are applied to the IP-networks [34]. Ac-
cordingly, there is a continuous eort demonstrated by the security specification group of the
third generation partnership project (3GPP-TSG) [1] to address these security threats [34]. The
cipher suite of 4G LTE consists of two stream ciphers, SNOW 3G and ZUC, and the block
cipher AES in the counter mode [11, 7]. It is noted that the randomness of the key-streams
generated by the 4G LTE cryptographic algorithms is hard to analyze and, more importantly,
some weaknesses concerning these ciphers have already been discovered [87, 21]. Further-
more, some security flaws in the LTE integrity protocols have been recently recognized [90].
The authors of [34] propose confidentiality and integrity protection schemes for securing the
4G network domain against the attack in [90]. These schemes are based on the WG-16 stream
cipher. The WG-16 oers the proved randomness properties of the WG family of ciphers [34].
In addition, it is secure and resists to all known attacks [34]. The only WG-16 hardware design,
which uses NB, is presented in [35]. This design is based on composite field arithmetic and
properties of the trace function in the tower field representation.
2.12.2.4 Parameters of the WG(29; 11)
The permutation for the WG(29; 11) is
WGP29 = 1  Y  Y210+1  Y220+210+1
Y2
20 210+1  Y220+210 1; (2.9)
where Y = 1  Ai+10 and Ai+10 is the LFSR’s output. The WG transform is given as follows
[39, 68, 58]
WGT29 = Tr (WGP29) : (2.10)
The linear feedback characteristic polynomial of the WG(29; 11)
C(Z) = Z11  Z10  Z9  Z6  Z3  Z  ; (2.11)
is a primitive polynomial of degree 11 over GF

229

, where  = 464730077 is the generator
of the Type-II Optimal NB (ONB-II, that is GNB of type 2) and  is a root of the defining
polynomial of GF

229

given by [68]
g (x) = x29 + x28 + x24 + x21 + x20 + x19 + x18 + x17
+ x14 + x12 + x11 + x10 + x7 + x6 + x4 + x + 1: (2.12)
24
2.12.2.5 Parameters of the WG-16
The WG-16 permutation is [34]
WGP16 =1  Y  Y211+1  Y211+26+1
 Y 211+26+1  Y211+26 1; (2.13)
where Y = (Ai+31)1057  1 and Ai+31 is the output of the LFSR. In [35], WGP16 is computed as
1  Y  Y211+1  Y211(211 1)+1  Y26

Y2
11+1  Y211 1

; (2.14)
where
Y2
11 1 =Y((1+2)(1+2
2)+24)(1+25)+210 :
It is noted that (2.14) requires 10 multiplications (including 2 for computing (Ai+31)1057). The
WG transform is WGT16 = Tr (WGP16). The characteristic polynomial of the WG-16’s
LFSR2 is [34]
C(Z) = Z32  Z31  Z22  Z9  !11 (2.15)
which is primitive over GF

216

, where ! is the root of the GF

216

’s field polynomial
g (x) =x16 + x5 + x3 + x2 + 1: (2.16)
2For the field polynomial (2.16), the multiplication with the constant !11 in (2.15) requires only 33 XOR gates
and a delay of 2TX .
25
Chapter 3
Implementations of the WG Stream
Ciphers Using ONB-II
In this chapter, a novel method for computing the trace of a product of two field elements is
presented, when the representation is the type-II ONB. Also, two designs are proposed. One
for the MOWG(29; 11; 17) cipher (where 29 corresponds to GF

229

, 11 is the number of
stages in the LFSR, and 17 is the number of output bits) and the other one for the WG(29; 11)
cipher (which was initially proposed in [39]), demonstrated by ASIC and FPGA implementa-
tions. The proposed designs optimize the area by reducing the number of multiplications in
the MOWG/WG transforms. This is done through signal reuse for the MOWG(29; 11; 17) and
through utilizing the new trace properties for the WG(29; 11). The ASIC and FPGA imple-
mentations of the proposed WG(29; 11) design show significant area and power consumption
reduction and an improved speed compared to [68]. Notice that, in an FPGA implementation
one has a predetermined space resources. In this context, reducing area consumption in an
FPGA implementation is in terms of decreasing the number of used look-up tables. This in
return would leave more resources for implementing other modules on the FPGA chip.
Throughout this chapter,  represents the bit-wise addition operator (XOR) in GF (2m).
A2
p
= A  p and A2 p = A  p, represent the right and left cyclic shift, respectively, of
the coordinates of A = (a0; : : : ; am 1) 2 GF (2m), w.r.t NB, p-times. In the NB representation,
the addition of 1 = (1; : : : ; 1) 2 GF (2m) to another GF (2m) element can be done by comple-
menting the bits of that element. C (Z) = Zl  Pl 1i=0CiZi, Ci 2 GF (2m) is the characteristic
polynomial of an l-stages LFSR over GF (2m), from which the recurrence relation is obtained
as
A j+l =
l 1X
i=0
CiAi+ j; (3.1)
26
where j  0, Ai 2 GF (2m), and (A0; A1; : : : ; Al 1) is the initial state of the LFSR.
Also, throughout this chapter, the 29-bit WG transformation and permutation introduced in
Chapter 2 are rewritten as follows
WGT29 (Ai+10  1) = Tr (WGP29 (Ai+10  1)) ; (3.2)
and
WGP29 (X) =1  X  Xr1  Xr2  Xr3  Xr4
=

1  X  X2k+1  X22k+(2k+1)  X2k(2k 1)+1  X22k+(2k 1)

(3.3)
where r1 = 2k + 1, r2 = 22k + 2k + 1, r3 = 22k   2k + 1, r4 = 22k + 2k   1, and k =
l
29
3
m
[58].
It is noted that a version of this chapter appears in [31]. The chapter is organized as follows.
Sections 3.1 and 3.2 presents the new hardware designs of the MOWG(29; 11; 17) cipher and
the WG(29; 11) cipher, respectively. Results based on FPGA and ASIC implementations of the
new designs are discussed in Section 3.3. Section 3.4 concludes the chapter.
3.1 Optimized Hardware Design of the MOWG(29; 11; 17)
Cipher
This section presents a hardware design of the MOWG(29; 11; 17) cipher. In this design, the
MOWG transform uses 7 multipliers, compared to 8 multipliers in [58]. Also, in an attempt
to improve the overall speed of the cipher, the LFSR is reconstructed in order to remove the
inverters from the critical paths during the run phase/initialization phase. In what follows, the
reduced areaMOWG transform design is first introduced, followed by presenting the LFSR and
key initialization algorithm (KIA) changes for speed improvement. Then, the proposed archi-
tecture and finite state machine are discussed, and the section ends up by deriving formulations
for the space and time complexities.
3.1.1 Reducing the Hardware Complexity of the MOWG Transforma-
tion
The hardware cost of the MOWG(29; 11; 17) cipher is dominated by its transform’s field multi-
pliers. Any decrease in the number of these multipliers would minimize the area of the overall
cipher. This section presents the architecture of the MOWG transform, where the number of
field multipliers is reduced by 1 through signal reuse, compared to [58].
27
The architecture of the proposed MOWG transform is shown in Figure 3.1. In this figure,
X = Ai+10 + 1 is the bit-wise complement of the LFSR’s output, r1 = 2k + 1, r2 = 22k + 2k + 1,
r3 = 22k   2k + 1, r4 = 22k + 2k   1, and k =
l
29
3
m
= 10. By taking X2
2k
as a common factor of the
exponent terms 22k +

2k + 1

and 22k +

2k   1

in equation (3.3), the architecture in this figure
can easily be obtained, where the WG permutation given by (3.3) is now computed as follows
WGP29 =
 
1  X  X2k+1  X2k(2k 1)+1  X22k

X(2
k+1)  X(2k 1)
 !
: (3.4)
In the MOWG(29; 11; 17), k = 10 and, hence, the signal X2
k 1 requires 4 multiplications and 4
squaring operations (which is free of cost in ONB) [58]. Also, in addition to the multiplication
operations involved in computing the signal X(2
k 1), (3.4) requires three more multiplications to
generate the signals X2
k+1, X2
k(2k 1)+1, and X2
2k

X(2
k+1)  X(2k 1)

. Therefore, the architecture
of Figure 3.1 requires a total of 7GF

229

multiplications. The inverter symbol denoted by (1)
in this figure requires 29 NOT gates to generate X = Ai+10  1 from the LFSR’s output signal
Ai+10. The signal XXr1Xr2Xr3Xr4 is obtained as the addition inGF(229) of X, Xr1 = X2k+1,
Xr2 Xr4 = X22k

X(2
k+1)  X(2k 1)

, and Xr3 = X2
k(2k 1)+1. The signals X2
k
and X2
2k
are obtained
by right cyclic shifts of X, k and 2k times, respectively. X2k+1 is generated by multiplying X
with X2
k
in GF

229

. X2
k(2k 1) is the right cyclic shift of X(2
k 1), k times, and X2k(2k 1)+1 is
generated by multiplying X2
k(2k 1) with X in GF

229

. In Figure 3.1, the coordinates of the
output of XXr1 Xr2 Xr3 Xr4 inGF(229) are complemented by the inverter symbol denoted
by (2) to generate all 29 bits of theWGP29 function of (3.4), which forms the Initial Feedback.
Seventeen bits of the WGP29 are the output of the MOWG(29; 11; 17) in the run phase [58].
MOWG
Transformation
10
20
10
10
2
X
1
r
X
2
r
X
3
r
X
10
2
X
29
29 17
2
1
29
29
Initial
Feedback
=
WGP29
r
X 4
Figure 3.1: Proposed MOWG transformation.
3.1.2 Improving the Critical Path of the MOWG Transform
The time delay through the MOWG transform dominates the delay of the overall cipher (see
Section 3.1.5.2). This section shows how to slightly reduce the delay through this transform.
28
This is accomplished by removing inverter (1), and by reallocating inverter (2) away from the
critical paths of the run phase and key initialization phase. This reduces the delay of the critical
path by an amount equivalent to the delay of two inverters. However, the MOWG transform
delay is still the dominant, due to the delays of 5 serially connected field multipliers. First, the
required mathematical formulation is derived, then required changes to the KIA algorithm are
presented.
3.1.2.1 Formulation
During the key initialization phase and the run phase, inverter (1) in Figure 3.1 generates the
complement of Ai+10. Notice that this cell holds the feedback from the LFSR during the run
phase, and the bit-wise XOR of the LFSR feedback and the MOWG transform feedback during
the key initialization phase. Therefore, to remove inverter (1), it requires the direct storage of
the complement of these values in both phases. In other words, it is required to reconstruct
the LFSR such that it generates a sequence B =
n
Bi = 1  Ai; 0  i < 2319   1
o
, where Bi 2
GF

229

and fAig is the sequence generated by (2.11) over GF

229

. Sequence B is referred to
as the complement sequence of fAig. The following proposition shows how this is accomplished
for an LFSR with a general feedback polynomial of degree l over GF (2m).
Proposition 3.1.1 Let B be the complement sequence of a sequence A =
n
Ai; 0  i < 2ml   1
o
,
where Ai 2 GF (2m) and A is generated by (3.1). Then, B is generated by the following recur-
rence relation
B j+l =
0BBBBB@ l 1X
i=0
CiBi+ j
1CCCCCA  0BBBBB@0BBBBB@ l 1X
i=0
Ci
1CCCCCA  11CCCCCA ; (3.5)
where j  0, and the initial state of B is Bi = 1  Ai, for 0  i  l   1.
Proof By definition
B j+l = A j+l  1; (3.6)
j  0. Using (3.1) in (3.6), one gets B j+l = Pl 1i=0CiAi+ j1, and by noticing 2Ci = 0 one obtains
B j+l =
l 1X
i=0
Ci(Ai+ j  1) 
l 1X
i=0
Ci  1
=
l 1X
i=0
CiBi+ j 
l 1X
i=0
Ci  1:
Thus, the assertion is true.
29
By noticing that X = 1  Ai+10 in (3.4), then, from Proposition 3.1.1, one can see that X
is Bi+10. Notice that the term
Pl 1
i=0Ci

 1 in (3.5) is a constant term. Hence, its addition in
GF

229

is realized with a number of NOT gates equal to its Hamming weight. For the LFSR of
the MOWG(29; 11; 17), replacing the coecients of (2.11) in (3.5) gives
Pl 1
i=0Ci

 1 =  1,
which has a Hamming weight equal to 28.
Inverter (2), on the other hand, realizes the addition of the field element 1 in (3.4). Notice
that this addition of the term 1 can be implemented in dierent ways. One way is to add it to
one of the terms X, Xr1 , Xr2 Xr4 , or Xr3 prior to the summation of these terms. Doing so would
reallocate inverter (2) from its current position. However, it is required that this reallocation
does not result in a delay higher than the current maximum delay of the MOWG transform. For
this reason, the inverter is relocated to complement X before it is added to Xr1 . This is the path
at the top of Figure 3.1, which has the lowest delay with only two GF

229

adders between
inverters (1) and (2).
The following section presents necessary changes required in the KIA algorithm of the
MOWG(29; 11; 17) cipher.
3.1.2.2 Modified KIA Algorithm
Modifying the LFSR of MOWG(29; 11; 17) according to (3.5), requires its left most stage to
hold the complement of the Initial Vector during the loading phase. Therefore, it is required to
complement the Initial Vector input before it is loaded to the modified LFSR. This can easily
be implemented by inserting 29 inverters at the multiplexer’s input which receives the Initial
Vector in Figure 2.2.
Next, the proposed architecture of the MOWG(29; 11; 17) cipher is presented.
3.1.3 Architecture
Here, the overall proposed architecture of the MOWG(29; 11; 17) cipher is presented, as shown
in Figure 3.2. In this figure, a double-headed arrow, under a component, corresponds to a 29-
bit register which is inserted for pipelining purposes (see Section 3.3.2 for more details). The
Finite State Machine (FSM) controls the input to the LFSR for each phase of operation. In
the same figure, due to the bit-wise complement operator denoted by (a), the LFSR receives
the complemented Initial Vector during the loading phase. Hence, after 11 clock cycles, the
initial state of this LFSR, (B0; B1; : : : ; B10), is basically the complement of the initial state of
the LFSR in Figure 2.2, i.e. Bi = Ai  1, 0  i < 11. When the key initialization phase starts,
the bit-wise XOR of the Initial Feedback and the Linear Feedback applies to the input of the
LFSR. Note that the Linear Feedback in Figure 3.2 is generated by (3.5), which is equivalent
30
……...
? ?b
10i
B ? 9iB ? 1iB ? iB
10
1 1
i
X B ?? ? ?
?
10i
X B ??
Initial
Vector
FSM
2
M
U
X
Linear
Feedback
MOWG
Transformation
10
20
10
2
10
2 1
X
?
17
Initial Feedback = WGP29
? ?a
29
29
29
29
29
29
29
10
1
r
X
2
r
X
r
X 4
3
r
X
Figure 3.2: Proposed design of the MOWG(29; 11; 17) cipher.
to Bi = Ai  1, 11  i < 33 (complement of corresponding one in Figure 2.2). However,
the Initial Feedback signal in Figure 3.2 has the same value as the one generated in Figure
3.1. This means that the input to the LFSR during the key initialization phase in Figure 3.2
is complemented w.r.t the one in Figure 2.2. Throughout the run phase, the only input to the
LFSR is the Linear Feedback signal Bi = Ai  1, 33  i < 2319   1. This sets the MOWG
transform of Figure 3.2 to generate the same key-stream bits of Figure 3.1. It is clear that the
maximum delay of the MOWG transformation is reduced by an amount equivalent to the delay
of two inverters, as compared to the one in Figure 3.1. The revised LFSR in Figure 3.2 has
additional H (  1) = 28 inverters, compared to Figure 2.2. This is due to the new constant
term   1 in the feedback polynomial.
The following section presents the finite state machine.
31
3.1.4 The Finite State Machine
This section exposes the architecture of the FSM and describes how it schedules the input to
the LFSR throughout the three phases of operation.
Figure 3.3 shows the components of the FSM. The FSM has two inputs, namely clk and
11-bit one-hot
counter
0 1 2 10
clk
2-bit binary
counter
0 1
FSM
resetop0
1
??
??
1 1
op1
1
Figure 3.3: FSM of the MOWG.
reset, 1-bit each, while there are two outputs denoted as op0 and op1. The reset input is pulled
down before each run of the cipher. This forces the 11-bit one-hot counter to initialize to
(1; 0; : : : ; 0), i.e. output 0 is the only bit set to a high logic level. Also, when the reset signal is
low, the 2-bit binary counter resets its state to (0; 0). Due to the 1-bit Register connected to the
AND gate at the reset input of the 11-bit one-hot counter, this counter starts incrementing one
clock cycle after the reset signal gets pulled up. This assures that the 11-bit one-hot counter
returns to its initial state after 11 clock cycles. Then, it triggers the 2-bit binary counter to
increment which starts the initialization phase. The output of the 2-bit binary counter controls
the cipher’s phase of operation. This is done by generating the op0 and op1 signals according
to Table 3.1. The op0 and op1 signals select one of the three inputs of the multiplexer in
Figure 3.2 and connect it to the input of the LFSR, during each phase. It is noted that the
loading phase takes 11 clock cycles, then starts the key initialization phase which takes 22
clock cycles, followed by the run phase. During the run phase, the clock inputs of the 11-bit
32
2-bit counter
op1 op0 phase of operation
bit 1 bit 0
0 0 0 0 Load Key and IV
1 0 0 1 Key Initialization
0 1 0 1 Key Initialization
1 1 1 0 Running Phase
Table 3.1: Phase of operation in the proposed MOWG as a function of the state of the 2-bit
binary counter.
one-hot counter and the 2-bit binary counter become idle.
In what follows, space and time complexities of the proposed MOWG(29; 11; 17) are stud-
ied.
3.1.5 Space and Time Complexities
This section provides the space and time complexities of the MOWG design in Figure 3.2.
3.1.5.1 Space Complexity
The space complexity is evaluated in terms of number of gates in each component, in order
to obtain the overall hardware cost. Let NR, NA, NX, NO, and NI denote the number of 1-bit
Registers, AND gates, XOR gates, OR gates, and Inverters, respectively.
MOWG Transform The transform dominates the hardware complexity of the MOWG de-
sign, as it consists of 7 field multipliers and 4 GF

229

adders. A GF

229

adder requires 29
XOR gates. Also, the multiplier in [71] is used for implementation, which has 841 AND gates
and 1218 XOR gates. Therefore, the total hardware cost of the transformation is as listed in
Table 3.2.
LFSR The LFSR has 11-stages of 29-bit shift registers, and a feedback polynomial. The
feedback polynomial is composed of 1 field multiplier (with a constant)1, 5 GF

229

addi-
tions, and H (  1) = 28 Inverters. Therefore, the hardware complexity of the LFSR is as
summarized in Table 3.2.
1A multiplication with a constant can be further optimized so that it contains few XOR gates.
33
Component NR NA NX NO NI
MOWG Transform - 5887 8642 - -
LFSR 319 841 1363 - 28
FSM (Figure 3.3) 14 3 1 - 1
29-bit
- 174 - 87 2
4-to-1 MUX
Table 3.2: Count of 1-bit registers and logic gates in the dierent components of the proposed
MOWG design.
4-to-1 29-bit Multiplexer The 4-to-1 29-bit multiplexer is composed of a binary tree of three
2-to-1 29-bit multiplexers and 2 NOTs (selectors). Each 2-to-1 29-bit multiplexer is built from
29 parallel 2-to-1 1-bit multiplexers. A 2-to-1 one bit multiplexer consists of two AND gates
and one OR gate. Therefore, the total cost of the 4-to-1 29-bit multiplexer is as summarized in
Table 3.2.
FSM From Figure 3.3, there are 3 AND gates, 1 XOR gate and 1 Inverter in the FSM. The
11-bit one-hot counter is simply an 11-stages circular shift register with set/reset inputs having
the output of the last shift register fed to the input of the first one. The 2-bit binary counter is
built from two JK Flip Flops. The two inputs of the first FF are pulled to high logic and its
output drives the two inputs of the second FF (one can also use D FF instead of the JK FF to
design the 2-bit binary counter). Thus, one can find the total number of one-bit registers in the
FSM as
NR = 11 + 2 + 1 = 14:
Table 3.2 summarizes the number of gates in the FSM.
In addition to the above-mentioned components, the MOWG cipher contains two 29-bit
bit-wise complement operators (inverter symbol (a) and inverter symbol (b) in Figure 3.2)
and a GF

229

adder (computing the bit-wise XOR of Initial Feedback signal and the Linear
Feedback signal). Let NMOWGO , N
MOWG
I , N
MOWG
R , N
MOWG
A , and N
MOWG
X denote the number of
OR gates, Inverters, 1-bit Registers, AND gates, and XOR gates in the MOWG of Figure
3.2, respectively. Therefore, by adding the corresponding number of gates in this GF

229

adder and in inverter symbols (a) and (b) to the number of gates in the FSM, the 4-to-1 29-bit
multiplexer, the LFSR, and the MOWG transform (see Table 3.2) one obtains
NMOWGO = 87; N
MOWG
I = 89; N
MOWG
R = 333;
NMOWGA = 6905; N
MOWG
X = 10035:
34
3.1.5.2 Time Complexity
Here, the formulation for the critical path delay of the MOWG cipher (Figure 3.2) is derived.
There are three critical paths in the MOWG:
 Critical path of the LFSR.
 Critical path along the MOWG transformation during the key initialization phase.
 Critical path along the MOWG transformation during the run phase.
The LFSR’s path has one multiplication and five finite field additions. This results in a propa-
gation delay of
TA +
 
1 +

log2 (6)

+

log2 (29)

TX = TA + 9TX; (3.7)
where TA and TX denote the propagation delay of an AND and an XOR, respectively. The delay
through a finite field multiplier is TA +
 
1 +

log2 (29)

TX [71]. On the other hand, the delays
through the two MOWG transform paths have 5 multipliers in series, which corresponds to a
delay of
5 (TA + 6TX) = 5TA + 30TX: (3.8)
From (3.7) and (3.8), it is clear that the longest path of the MOWG cipher passes through its
transformation.
From Figure 3.2, the critical path of the proposed MOWG during the run phase includes
the delays of a 29-bit Register, 5 field multipliers in series, and 3 GF

229

adders. This results
in the delay stated in (3.9):
TRunPh =5TA + 33TX + TR; (3.9)
where TRunPh denotes the maximum time delay through the MOWG during the run phase. In
the same figure, the critical path of the MOWG during the key initialization phase includes the
delays of 4 GF

229

adders, 5 field multipliers, a 29-bit Register, and a 4-to-1 29-bit multi-
plexer. Notice that the delay through the 4-to-1 29-bit multiplexer is equivalent to the delay
through 2 2-to-1 1-bit multiplexers in series. This is equivalent to the sum of the delays through
2 AND gates, 2 OR gates, and 2 Inverters. Therefore, the delay of the MOWG during the key
initialization phase is
TKIPh =7TA + 34TX + TR + 2TO + 2TI (3.10)
Comparing (3.9) and (3.10), it is clear that TKIPh > TRunPh.
35
3.2 Low Complexity WG Cipher
This section proposes a new design of the WG(29; 11). The proposed WG design considers
Figure 3.2 with an added trace to the output of the WGP29 as the starting point for optimiza-
tion. Properties of the trace function when the elements of GF (2m) are represented in ONB of
type-II (which exists for m = 29 [52]) are first introduced. The proposed WG design utilizes
these properties in order to minimize the hardware complexity of its transform. Note that the
proposed design eliminates some necessary signals for the generation of the Initial Feedback,
which is required to conduct the key initialization phase of the cipher. Missing of the Initial
Feedback signal is recovered by introducing a serialized scheme to generate it. At the end of
this section, the hardware and the time complexities of the new implementation are provided.
3.2.1 Properties of the Trace Function for Type-II ONB
This section presents a method for computing the trace of a multiplication of two field elements
when the representation is in the type-II ONB. Also, two corollaries are deduced from the
proposed method.
Fact 3.2.1 [65] Let f; 2; 22 ; : : : ; 2m 1g be a type-II ONB for GF (2m). Then
Tr(2
i
) = 1; i = 0; 1;    ;m   1;
and
Tr(2
i
2
j
) = 0 8i , j; i; j = 0; 1;    ;m   1:
In other words, a type-II ONB is a self-dual basis. Thus Proposition 3.2.2 is achieved as
follows.
Proposition 3.2.2 In a type-II ONB, the trace of the field multiplication of any two GF (2m)
elements A = (a0; a1; : : : ; am 1) and B = (b0; b1; : : : ; bm 1) is computed as the inner product of
A and B, that is:
Tr (AB) =
m 1X
i=0
aibi: (3.11)
Proof The proof is completed by considering the following derivation:
Tr(AB) = Tr(
m 1X
i=0
ai2
i
m 1X
j=0
b j2
j
)
=
X
0i; j<m
aib jTr(2
i+2 j) =
m 1X
i=0
aibi;
where the last result is obtained using Fact 3.2.1.
36
Proposition 3.2.2 implies that the trace of a field multiplication of two elements represented
in type-II ONB is easily implemented in hardware using m AND gates and m   1 XOR gates.
Corollary 3.2.3 In type-II optimal normal basis, the two relations below are valid for any two
elements A and B in GF (2m)
Tr (AB) =Tr ((A  n) (B n)) =
m 1X
i=0
ai nbi n; (3.12)
and
Tr (AB) =Tr ((A  n) (B n)) =
m 1X
i=0
ai+nbi+n; (3.13)
where n is a positive integer and the indices of a and b are computed modulo m.
Proof Let A and B be any two elements in GF (2m) and n an arbitrary positive integer. It is
well known that
Tr

X2
n
=Tr (X)2
n
= Tr (X) ;
for any X 2 GF (2m). Therefore, by replacing X with AB one obtains
Tr (AB) =Tr

A2
n
B2
n
: (3.14)
Using Proposition 3.2.2, the proof is completed by realizing that the squaring operation X2,
and the square root operation X2
 1
, are simply the right cyclic shift and the left cyclic shift of
the coordinates of X (or AB) w.r.t the ONB, respectively.
According to Corollary 3.2.3, the trace of the field multiplication of any two elements A
and B, represented in type-II ONB, does not change if an n-bit cyclic shift (left or right) is
applied to both elements in the same direction.
Corollary 3.2.4 Let C be a common factor of two or more GF (2m) elements AC, BC, ..., etc,
then, the following relation holds:
Tr (AC) + Tr (BC) +    =
m 1X
i=0
(ai + bi +    ) ci: (3.15)
Proof Let A, B, ..., etc, be any two or more arbitrary elements from the finite field GF (2m).
Then,
Tr (AC) + Tr (BC) +    = Tr ((A  B     )C)
=
m 1X
i=0
(ai + bi +    ) ci;
where the last result follows from Proposition 3.2.2, and C 2 GF (2m).
37
The following section applies the new trace properties of this section in order to optimize
the hardware implementation of the WG transform.
3.2.2 Optimizing the WG Transform’s Hardware for the Run Phase
Here, it is shown how Proposition 3.2.2 and Corollaries 3.2.3 and 3.2.4 are used to further
reduce the number of field multiplications in the WG transform in Figure 3.2 (with trace).
Before proceeding, it is important to mention that by applying (3.11), one can generate the
trace of the field multiplication of two elements A and B directly from A and B. However, the
result of the multiplication operation, i.e. C = AB, will be lost. Therefore, it is important to
apply (3.11) to the multiplication terms in (3.4) which are not used anywhere else. From Figure
3.2, the two signals Xr2 Xr4 and Xr3 are used only as inputs to the trace function (after they are
bit-wise XORed), while the signal Xr1 is required in generating Xr2  Xr4 . The first two signals
are generated as follows 8>><>>: Xr2  Xr4 =X2
2k 
Xr1  X2k 1

;
Xr3 =XX2
k(2k 1):
(3.16)
Therefore, applying the trace function to (3.16) one gets8>><>>: Tr (Xr2  Xr4) =Tr

X2
2k 
Xr1  X2k 1

;
Tr (Xr3) =Tr

XX2
k(2k 1) : (3.17)
Using (3.17), the WG transformation becomes
WGT29 =Tr (1  X  Xr1) + Tr

XX2
k(2k 1) + Tr X22k Xr1  X2k 1 : (3.18)
Applying a right cyclic shift of 2k-stages to X and X2
k(2k 1) in the term Tr

XX2
k(2k 1)

of
(3.18) does not change the value of the trace, i.e.
Tr

XX2
k(2k 1) =Tr (X)22k X2k(2k 1)22k : (3.19)
Using (3.19) in (3.18) gives
WGT29 =Tr (1  X  Xr1) + Tr

X2
2k
X2
3k(2k 1) + Tr X22k Xr1  X2k 1 : (3.20)
Taking X2
2k
as a common factor in (3.20) one obtains
WGT29 =Tr (1  X  Xr1) + Tr

X2
2k 
Xr1  X2k 1  X23k(2k 1)

: (3.21)
Notice that by applying Corollary 3.2.4 to (3.21), only one multiplication operation is required
to generate Xr1 = X2
k+1 (excluding the generation of the signal X2
k 1). Figure 3.4 captures the
38
WG
Transformation
+
1
10
2
10i
X B ??
1
1
1Tr
1X ?29
29
29
IP
29
29
1
r
XX
Tr 4
r
X2
r
X 3
r
X
1
r
X
Figure 3.4: The proposed design of the WG transformation.
resulting architecture of the WG transform in (3.21). In this figure, the block denoted by “IP”
generates the inner product of the two 29-bit inputs, while  adds the 29-bits at its input over
GF (2). This architecture uses 5 field multipliers, i.e., 4 multipliers less than the WG transform
presented in [68].
In Figure 3.4, the key stream bits are obtained by XORing Tr (1  X  Xr1) and
Tr (Xr2  Xr3  Xr4). Tr (1  X  Xr1) is the GF (2) addition of the coordinates of 1  X  Xr1
w.r.t the ONB. On the other hand, notice that the signals Xr3 and Xr2  Xr4 do not exist in the
WG transform. This is because Tr (Xr2  Xr3  Xr4) is generated directly from X22k , Xr1 , X2k 1,
and X2
3k(2k 1) using an inner product operation, as it is stated in (3.21). This absence of the two
signals Xr3 and Xr2  Xr4 resulted in the elimination of the Initial Feedback signal. The next
section proposes a recovery method for generating the Initial Feedback signal, which is only
used in the key initialization phase.
3.2.3 Serializing the Computation of the Initial Feedback Signal
This section presents a method for the recovery of the Initial Feedback signal through serial-
ized computation. To accomplish the multiplication operations during this serial computation,
the existing finite field multiplier which is used in generating the signal Xr1 in Figure 3.4, is uti-
lized. The proposed scheme generates the Initial Feedback signal by serially computing it over
three consecutive clock cycles. Denote this complete round of the serialized Initial Feedback
computation (three clock cycles) as an “extended key initialization round. And the single clock
cycle version of this computation (as in the MOWG design) as a “simple round”. Therefore,
with serialization, the entire key initialization phase requires 3  22 = 66 clock cycles instead
39
of 22 clock cycles (that is, 22 extended rounds instead of 22 simple rounds). It is noted that
this only aects the key initialization phase without increasing the number of cycles required
for the run phase.
The expansion of the key initialization round from 1 to 3 clock cycles is established through
the support of a new FSM’s control signal, namely, lfsr clk (Figure 3.5). This signal controls
the clock input of the LFSR and triggers it to shift once every three clock cycles. Also, in order
to compute the Initial Feedback signal over three stages, a new hardware module denoted as
the Serialized Key Initialization Module (SKIM) will be introduced (Figure 3.6). This module
uses the available signals and the field multiplier which is used in the generation of Xr1 , in
Figure 3.4. This module schedules the proper inputs to the field multiplier in each stage of
the serial computation by means of some multiplexers. The output of these multiplexers are
controlled by two new signals generated by the FSM, namely, s0 and s1 (Figure 3.5). The
intermediate results, between two consecutive stages of the computation, are stored in internal
29-bit Registers of the SKIM module. In the following, the FSM changes required for the
support of the serialization process are first introduced. Then, the architecture and operation of
the SKIM module and its integration to the WG transform in Figure 3.4, are discussed.
3.2.3.1 Architecture and Operation of the Modified FSM
Here, the new architecture and operation of the FSM are described. The architecture, which
is shown in Figure 3.5, generates the new set of control signals lfsr clk, s0, and s1. These are
required for the serial computation of the Initial Feedback signal. Before each run of the cipher,
the FSM resets its 11-bit one-hot counter to (1; 0; : : : ; 0) and its 2-bit binary counter to (0; 0)
(where the leftmost bit and the rightmost bit, within the brackets, denote the lowest output bit
and the highest output bit of the corresponding counter, respectively). This is done by means
of pulling down the reset inputs. When the reset signal is released, the 2-bit binary counter
becomes ready. At the same time, the 11-bit one-hot counter’s reset input stays pulled down
for an extra clock cycle. This is due to the 1-bit Register connected to the input of the AND
gate which drives its reset input. This assures that the (1; 0; : : : ; 0) state of the 11-bit one-hot
counter consumes a clock cycle, at the beginning of the loading phase. After 11 clock cycles,
from the release of the reset signal, the 11-bit one-hot counter returns to the (1; 0; : : : ; 0) state.
At this point it triggers the clock input of the 2-bit binary counter. The 2-bit binary counter
changes its state to (1; 0), triggering the start of the key initialization phase. Then, the clk
signal starts triggering the clock input of the 3-bit one-hot counter. However, the counting will
start one clock cycle later, when the output of the 1-bit Register connected to the 3-bit one-hot
counter’s reset input pulls up. This in turn assures that the 3-bit one-hot counter consumes one
40
11-bit one-hot 
counter
3-bit 1-hot
counter
0 1 2 10 0 1 2
clk
2-bit binary 
counter
0 1
FSM
resetop0lfsr_clk s0
1 1
……
……
1 1
in
0
in
1
op1
1
s1
1 1
sel 0
Figure 3.5: Modified FSM after adding the new 3-bit one-hot counter.
clock cycle, before incrementing its initial state of (1; 0; 0), at the start of the key initialization
phase. During this phase, the first output bit of the 3-bit one-hot counter drives the clock input
of the 11-bit one-hot counter. Therefore, it takes 33 clock cycles for the 11-bit one-hot counter
to complete 11 counts. Hence, it takes 33 clock cycles for the 2-bit binary counter to increment.
Therefore, it requires 66 clock cycles for the 2-bit binary counter to increment twice in order
to start the running phase. When the running phase starts, with the 2-bit binary counter’s state
at (1; 1), the 11-bit and the 3-bit one-hot counters stop counting, as their clock inputs become
idle.
Notice that during the key initialization phase, the lfsr clk is driven by the first output of the
3-bit one-hot counter. Hence, the LFSR shifts once every three clock cycles. The two signals
s0 and s1 are derived from the 3-bit one-hot counter’s output according to Table 3.3. Notice that
this table is realized without any additional hardware by setting s0 to be the second output, and
s1 to be the third output, of the 3-bit one-hot counter, respectively. Therefore, (s0; s1) produces
the three patterns of (0; 0), (1; 0), and (0; 1) during the first stage, the second, and the third
stage of an extended key initialization round, respectively. During the running phase, (s0; s1)
will generate (0; 0). The following section shows how these patterns are used to accomplish
the proper functionality in the key initialization phase as well as in the running phase.
41
3-bit one-hot counter
s1 s0
bit 2 bit 1 bit 0
0 0 1 0 0
0 1 0 0 1
1 0 0 1 0
Table 3.3: Signals s0 and s1 as a function of the output of the 3-bit one-hot counter.
3.2.3.2 Architecture and Operation of the Serialized Key Initialization Module
Here, the SKIM module, which performs the serialized computation of the Initial Feedback
signal over an extended key initialization round (three clock cycles), is presented.
Figure 3.6 is a block diagram describing the architecture of this module. The Initial Feed-
2
2
k
X
X
? ?2 2 1k k
X
?
2
k
X
Initial Feedback
MUX
in0
in1
in2
in3
s
e
l1
s
e
l0
MUX
in0
in1 s
e
l0
1
2
MUX
in0
in1
s
e
l0
(running phase)
? ?2 1k
X
?
1
1
1
1
s
0
s
1
1X ? 29
clk
29
29
29
29
29
29
29
29
29
29
29
29
29
1
2
3
1
r
X
Figure 3.6: Block diagram of the SKIM module.
back signal in this figure is connected to the LFSR’s input multiplexer as shown in Figure 2.2.
Also, Xr1 connectivity is shown in more details in Figure 3.7. In this figure, the block denoted
by “IP” generates the inner product of the two 29-bit inputs, while  adds the 29-bits at its input
over GF (2). The double-headed arrows under a component (correspond to inserted registers)
and the dotted arrow output (Initial Feedback), are used for pipelining (see Subsection 3.3.2).
The numbers under a register specify the clocking of that register within the pipelined scheme,
during initialization phase. During the extended key initialization round, the two signals s0 and
s1 in Figure 3.6 change values in each stage as mentioned in the previous section. These two
42
s0
s1
In
itia
l F
e
e
d
b
a
c
k
in0
in1
sel0
in2
in3
sel1
in0
in1
sel0
1
in0
in1
sel0
Output
Sequence
+
102
.
30
10i
X B??
1
1
1
1
1
1
10
1 1
i
X B?? ? ?
29
29 29
29
29
29
29
29
29
29
29
29
29
1
r
X
IP
29
MUX
1
MUX
2
MUX
3
10
10
20
Initial Feedback 
(for pipelining)
1 2
1
2 1 2 3
4
3,5,7
3,5,7
3,5,7
4,6,8
4,6,8
2
Figure 3.7: The proposed WG transformation after integration with the SKIM module.
signals control the outputs of the three multiplexers MUX1, MUX2, and MUX3 according to
Table 3.4. In each stage of the extended key initialization round, the SKIM module computes
a partial value of the Initial Feedback signal and stores it in Register 2 (see Figure 3.6).
During the first clock cycle, s0 and s1 are both at low logic levels. Hence, MUX1, MUX2,
and MUX3 generate the signals X2
k
, X, and X  1 at their outputs, respectively. The output
of the multiplier becomes Xr1 = X2
k+1 and that of the GF

229

adder is Xr1  X  1. Upon
receiving a new clock signal, i.e. at the start of the second clock cycle, Register 1 and Register
2 update their states with the output signal of the multiplier and the output of theGF

229

adder,
respectively. Also, X2
k 1 is stored in a 29-bit register. At the same time s0 pulls up forcing the
outputs of MUX1, MUX2, and MUX3 to become Xr1  X2k 1, X22k , and Xr1  X  1 (the state of
Register 2 when the clock signal arrived), respectively. With these settings of the multiplexers
and the registers, the multiplier output changes to Xr2  Xr4 = X22k

Xr1  X(2k 1)

and that of
theGF

229

adder to Xr4 Xr2 Xr1 X  1, denoting Register 1’s and Register 2’s next states,
respectively, when the third clock signal arrives. When the third clock cycle starts, s0 changes
to low logic level while s1 changes to high logic level, which forces MUX1, MUX2, and MUX3
to generate X2
k(2k 1), X, and Xr4 Xr2 Xr1 X1 at their outputs, respectively. The multiplier
43
Stage s0 s1
Output Next State
MUX1 MUX2 MUX3 Register 1 Register 2
1 0 0 X2
k
X X  1 Xr1 Xr1  X  1
2 0 1 Xr1  X(2k 1) X22k Xr1  X  1 Xr4  Xr2 Xr4  Xr2  Xr1  X  1
3 1 0 X2
k(2k 1) X Xr4  Xr2  Xr1  X  1 Xr3 Xr4  Xr3  Xr2  Xr1  X  1
Table 3.4: Multiplexers outputs and next states of Register 1 and Register 2 as a function of s0
and s1.
and the GF

229

adder outputs become Xr3 = X2
k(2k 1)+1 and Xr4  Xr3  Xr2  Xr1  X  1,
respectively.
At the arrival of the fourth clock signal (the beginning of a new extended key ini-
tialization round) s0 and s1 both change back to low logic levels, the LFSR is clocked
and latched with the result of the bit-wise XOR of the computed Initial Feedback signal
(Xr4  Xr3  Xr2  Xr1  X  1) and the LFSR’s Linear Feedback signal. At the arrival of the
67-th clock signal, the LFSR would have been clocked 22 times and the running phase starts.
Throughout the run phase, both s0 and s1 stay at logic level 0; therefore MUX1 generates
the signal X2
k
and MUX2 generates the signal X. With these values, the multiplier generates
Xr1 and the WG transform in Figure 3.7 produces a stream bit, for each cycle.
The following section, studies the space and time complexities of the proposed WG(29; 11)
cipher.
3.2.4 Space and Time Complexities
This section begins with presenting the hardware complexity of the proposed WG implemen-
tation, followed by its time complexity.
3.2.4.1 Space Complexity
The space complexity of the WG transform is reduced, while that of the WG’s FSM is slightly
increased, compared to the corresponding ones in the proposed MOWG. Please refer to Tables
3.2 and 3.5 for a comparison of the number of gates in the transform and FSM of the MOWG
and WG, respectively. In what follows, the hardware complexities of the WG transform and its
FSM are first summarized. Then, the overall hardware cost of the WG design is obtained.
WG Transformation The space complexity of the WG transform has been improved com-
pared to the MOWG transform. This is mainly because the number of field multipliers in the
WG transform is reduced by 2 w.r.t that in the MOWG transform. On the other hand, compared
44
to the MOWG transformation in Figure 3.2, the design in Figure 3.7 has the following addi-
tional components: a GF

229

adder, a 29-bit GF (2) addition, three 29-bit Registers, an XOR
gate, an OR gate, one 4-to-1 29-bit multiplexer, two 2-to-1 29-bit multiplexers with 2 selector
NOTs, and an inner product. A 29-bit GF (2) adder consists of 28 XOR gates. A 2-to-1 29-bit
multiplexer consists of 29 parallel 2-to-1 1-bit multiplexers. The inner product has 29 AND
gates and 28 XORs. Refer to Subsection 3.1.5.1 for details about the hardware of the other
components listed above. By adding the hardware of the additional components to the gate
count in the MOWG transform (Table 3.2), and then subtracting the hardware cost of two field
multipliers, the total hardware cost of the proposed WG transform is obtained as listed in Table
3.5.
Component NR NA NX NO NI
WG Transform 87 4524 6292 146 4
FSM (Figure 3.5) 18 7 1 3 2
Table 3.5: Count of 1-bit registers and logic gates in the components of the proposed
WG(29; 11).
FSM The FSM depicted in Figure 3.5 has additional two AND gates, two OR gates, a 2-
to-1 1-bit multiplexer (with 1 selector NOT), 1-bit Register, and a 3-bit one-hot counter, as
compared to Figure 3.3. Similar to the 11-bit one-hot counter, the 3-bit one-hot counter is
simply composed of a three stages circular shift register with set/reset inputs having the output
of the last register fed to the input of the first register. By adding the gates in the mentioned
components to the number of gates of the FSM in Figure 3.3 (Table 3.2), the total hardware
cost of the FSM in Figure 3.5 is as shown in Table 3.5.
The LFSR and the 4-to-1 MUX of the WG have same complexities as the ones in the
MOWG (Table 3.2). Moreover, the WG design contains two 29-bit bit-wise complement oper-
ations (inverter symbol (a) and inverter symbol (b) in Figure 3.2) and a GF

229

adder (com-
puting the bit-wise XOR of Initial Feedback signal and the Linear Feedback signal). Let NWGO ,
NWGI , N
WG
R , N
WG
A , and N
WG
X denote the number of OR gates, Inverters, 1-bit Registers, AND
gates, and XOR gates in the proposed WG cipher, respectively. Therefore, by adding the cor-
responding number of gates in the GF

229

adder and in inverter symbols (a) and (b) to the
number of gates in the 4-to-1 multiplexer, and the LFSR (see Table 3.2), and as well, to the
45
number of gates in the FSM and the WG transform (see Table 3.5) one obtains
NWGO = 236; N
WG
I = 94; N
WG
R = 424;
NWGA = 5546; N
WG
X = 7685:
3.2.4.2 Time Complexity
Here, the propagation delay along the critical path of the proposed WG design is derived.
Notice that the LFSR is not a candidate for the critical path, since it still has less multipliers
contributing to its propagation delay, compared to the WG transform. In what follows, the
formulation of the longest path during the key initialization phase is presented. After this, the
longest path during the running phase is proved to be the critical path of the cipher.
From Figure 3.7, one can see that the critical path during the key initialization phase extends
between the LFSR (not shown in the figure) and the output of the module generating X2
k 1.
Hence, the propagation delay through longest path during key initialization phase of the WG is
TKIPh =24TX + 4TA + TR: (3.22)
The longest path of theWG cipher during the run phase can also be seen in Figure 3.7 extending
between the LFSR and the cipher’s output, passing through the X2
k 1 module. Therefore, the
propagation delay of this run phase longest path is easily obtained by adding the delays of its
components as follows
TRunPh =32TX + 5TA + TR: (3.23)
From (3.22) and (3.23), the critical path of the cipher is (3.23).
3.3 Results and Comparisons
The following sections compare the proposed designs of the MOWG(29; 11; 17) and the
WG(29; 11) ciphers with the corresponding previous implementations in [58], [68], and [56].
Also, further optimizations and general applicability of the proposed algorithms are discussed.
3.3.1 Results from FPGA and ASIC Implementations
The proposedWG andMOWG designs, together with theWG in [68], have been realized using
ASIC and FPGA implementations. The ASIC speed and area results are for the 65nm CMOS
technology based on Synopsys Design Compiler’s estimate of area and clock speed prior to
place-and-route, with medium eort for optimizations. The power consumption readings have
46
Cipher
Transform
Technology
Primary # Clocks in Bits/Cycle
Area
Latency
Speed
Throughput Throughput Dynamic
Energy
Architecture Optimization
Init. Phase (Run Phase) (nsec) (Mbps)
Per Area Power
Type Target (KGate) (MHz) (Kbps=Gate) (mW) (mJ=Gbit)
WG-7 @2MHz [60]
Look-up Table
4-bit
- 10084 -
Code Lines = 1097,
- - 0.098 - - -
microcontroller
(software)
MARC4
Exp/Ret = 7/4
ATAM893 - D
WG-7 @8MHz [60]
Look-up Table
8-bit
- 10074 -
SRAM = 0,
- - 0.28 - - -
microcontroller
(software) ATmega family Flash = 1100
WG [68] Multiplier-based CMOS 65nm Area 22 1 33.2 6.94 144 144 4.34 7.28 50.6
WG [56]
Look-up Table
- - - 1
319 Registers +
- - - - - -9000 XORs +
(ROM) 229 ROM bits
MOWG [58]
Multiplier-based
CMOS 90nm - 22 -
187
- 1000 8500
45
- -
(Pipelined with Reuse) (Km2) (Kbps=m2)
WG (Figure 3.7) Multiplier-based CMOS 65nm Area 66 1 19.9 4.45 224 224 11.2 4.45 19.8
MOWG (Figure 3.2) Multiplier-based CMOS 65nm Area 22 17 26 6.62 151 2567 98.73 5.89 2.3
Table 3.6: Results obtained from ASIC implementations.
Cipher
Transform
Family
Synthesis Primary # Clocks in Bits/Cycle
LUTs
Latency Speed Throughput Throughput Total Energy
Architecture
Tool
Optimization
Init. Phase (Run Phase) (nsec) (MHz) (Mbps)
Per Area Power
(J=Gbit)
Type Target (Kbps=LUT) (mW)
WG [68] Multiplier-based
Virtex 4
Xilinx XST Area 22 1 6449 33.3 30 30 4.65 380 12.67
(xc4vfx12sf363-10)
MOWG [58]
Multiplier-based Stratix II Mentor Graphics
- 22 - 4184 - 218 1853 443 - -
(Pipelined with Reuse) (EP2S15F484C) PrecisionRTL
WG (Figure 3.7) Multiplier-based
Virtex 4
Xilinx XST Area 66 1 4044 29.4 34 34 8.41 187 5.5
(xc4vfx12sf363-10)
MOWG (Figure 3.2) Multiplier-based
Virtex 4
Xilinx XST Area 22 17 5512 28.6 35 595 108 342 0.57
(xc4vfx12sf363-10)
Table 3.7: Results obtained from FPGA implementations.
been conducted under 140 MHz frequency for all the designs. The FPGA designs have been
synthesized using Xilinx Synthesis Tool (XST) [2]. The FPGA area and speed results are for
Xilinx Virtex4 series FPGA device xc4vfx12sf363-10. All FPGA results are for post place-
and-route and the power consumption results have been recorded for a frequency of 29 MHz
for all the designs.
The reported ASIC and FPGA results are listed in Tables 3.6 and 3.7, respectively. In Table
3.6, the WG-7 results (another member of the WG family based on an LFSR overGF

27

) are
from software implementations presented in [60]. KGate is the area equivalence in terms of
number of NAND gates 103 (estimated area of one NAND gate is 2:08 (m)2). The results for
the WG(29; 11) hardware implementation proposed by [56] are based on theoretical analysis.
“Exp” and “Ret” denote the depth of the expression and return stacks, respectively. In Tables
3.6 and 3.7, Throughput is the # bits per cycle  speed (Mbps = 106bit/second). Gbit = 109bit.
Also, the readings shown from the MOWG design in [58] were reported for the pipelined-
with-reuse version of the transform. The following paragraphs analyze the reported results and
compare the proposed WG and MOWG designs to other listed ones.
The reported results show that the proposed WG takes longer to finish its initialization
47
phase compared to the one in [68] (293 nsec (ASIC)/1:94 msec (FPGA) in the proposed scheme
compared to 152 nsec (ASIC)/0:73 msec (FPGA) in [68]). This is not significant because
initialization is executed only once per a run. The reported results also show that the proposed
WG is superior to the one in [68] in terms of throughput, area, and power consumption. The
proposed WG has lower latency, by 36% (ASIC) and 12% (FPGA), w.r.t the one in [68]. And
accordingly, the speed/throughput of the proposed WG is increased by 55% (ASIC) and 13%
(FPGA), compared to [68]. Also, notice that the normalized throughput (proposed) is twice
the one in [68]. This is due to the higher throughput and the significant reduction in area (area
reduced by 40% for ASIC and by 37% for FPGA) of the proposed WG compared to the one
in [68]. Moreover, one can see that the proposed WG consumes less power (39% ASIC, 51%
FPGA) and uses less than half the energy reported for [68].
The WG design in [56] requires 2m ROM bits for a general WG overGF (2m). On the other
hand, the area of the proposedWG is dominated by its field multipliers, which have space com-
plexity quadratic in m. Specifically, for the WG(29; 11), 229-bits of ROM are required in [56]
(in addition to 9000 XORs and 319 registers). There are no results in [56] about the running
speed of the presented WG. According to a similar study on ROM-based and multiplier-based
MOWG designs by [58], ROM based ASIC implementations are always larger and slower than
using field multipliers, for m > 11.
The proposed MOWG design is expected to oer better area and speed compared to the one
presented in [58]. The proposed MOWG has 8 multipliers compared to 9 in [58]. Therefore,
its area is expected to be scaled down by a ratio close to 8=9 w.r.t the one in [58]. It is noted
that the results from [58] are reported for the pipelined-with-reuse version of the transform.
Applying pipeline-with-reuse techniques to the proposed MOWG would result in speed and
area readings similar to the ones reported in [58]. For the non-pipelined and the pipelined
(without reuse) versions, however, the proposed MOWG is expected to show lower area and a
slightly higher speed/throughput, and lower latency, compared to the corresponding versions
from [58]. This is due to the removed multiplier and the removed inverters from its critical
path (see Figure 3.2). Notice that a 6-stage pipeline of the proposed MOWG oers 6-times the
throughput which is reported for its non-pipelined version in Tables 3.6 and 3.7 (see Section
3.3.2). That is, almost double the throughput provided by the pipeline-with-reuse MOWG in
[58].
The proposed WG oers higher clock speed, and better area and power consumption, com-
pared to the proposed MOWG. However, the proposed MOWG has higher throughput and
better energy per bit. Most important, the WG has more good randomness properties than the
MOWG cipher [68, 58]. Therefore, when security and randomness are critical for the appli-
cation, the proposed WG design is preferred. If instead, throughput and area are the critical
48
criteria for the application, then, in this case, the proposed WG design is superior for low area
applications, while the proposed MOWG serves better for high throughput applications. It is
noted that one can apply serialization or pipelining to the WG/MOWG transforms for achiev-
ing lower area or higher throughput, if it is demanded by the application. This is discussed in
the next section.
3.3.2 Discussion
This section discusses the serialization and pipeline techniques as further optimizations to
the proposed WG and MOWG. Also, the applicability of the proposed techniques to general
MOWG/WG ciphers, when field elements are represented in the NB, is considered.
For low throughput applications, smaller area can be achieved by serial computation of
the MOWG/WG transforms. Figure 3.8 presents how this is done using one multiplier. In
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
12345678
1
2
3
XX
2
R
G
2
R
G
2
R
G
2
0
2
XX
2
RG
1
RG
3
RG
10
2
X
2
X
2
2
2
RG
4
2
X
5
2
2
RG
1 2
RG RG?
10
2
2
RG
1
X
?
3
R
G
3
R
G0 0 0 0
29
29
29
29
29
29
29
29
29 29
29 29
29
29
17
2929292929292929
2929292929292929
29
Initial Feedback
MOWG Output
30
2929
29
1 X?
WG Output
+
1
1
1
IP
20
2
X
Register Clock Cycle
Clocking of Registers
1
2
3
1
2 - 5
1,6,7
29
Figure 3.8: Serial Implementation of MOWG/WG Stream Ciphers.
this figure, the dotted square is used, only, for generating the WG stream bits. The rest of
the diagram is common for MOWG and WG. The initialization round takes 8 cycles for both
transforms. During run phase of the MOWG, 17 output bits are generated every 7 cycles. For
49
the WG, a stream bit is produced every 6 cycles. The maximum propagation delay is equivalent
to 17 levels of gate delays. Compared to 38 levels in (3.23) (WG) and 46 levels in (3.10)
(MOWG), the clocks of the serial WG and MOWG are 2:2 and 2:7 times faster, respectively.
Therefore, the throughput of the serial versions of the WG and the MOWG ciphers are almost
2=6 and 3=7 of the corresponding original ones in Figures 3.2 and 3.7, respectively. The total gate
counts for the serial versions of the transforms are 4155 (WG) and 4011 (MOWG). Compared
to 11053 gates in the WG transform (Section 3.2.4.1) and 14529 gates in the MOWG transform
(Section 3.1.5.1), then, the area of the serial versions of the WG/MOWG transforms are almost
2=5 and 2=7 of their original architectures, respectively. If even lower area is demanded, a digit-
level field multiplier [76, 74] can be deployed, adding more cycles for each multiplication.
The proposed schemes can achieve higher throughput through pipelined transforms. The
LFSR should be reconstructed using the Galois-style feedback, or simply by placing the mul-
tiplication with  in between cells Bi+1 and Bi. Otherwise, the LFSR’s speed will constrain the
pipelining. Figure 3.2 shows how to achieve a 6-stage pipeline of the MOWG transform using
19 29-bit registers. The pipelined MOWG critical path has 7 levels of logic gate delays. The
corresponding throughput and run phase latency are 17=(TA+6TX) and 6 (TA + 6TX), respectively.
Since (3.10) has 46 levels of logic gate delays, thus, the throughput of the pipelined MOWG
is almost 6 times higher. Similarly, Figure 3.7 shows a 6-stage pipeline of the WG transform.
From this figure, one can find the pipelined WG’s latency and throughput as 6 (TA + 6TX) and
1=(TA+6TX), respectively (the latency during initialization is higher, i.e., 8 (TA + 6TX)). Com-
pared to the throughput which results from (3.23), this is almost 5 times higher. For even
higher throughput, the unfolding technique presented in [26] can be deployed. Simply, the
MOWG/WG LFSR is unfolded to generate n outputs (2  n  11) per a cycle. Hence, by
implementing the same number of transforms, the throughput will be n-times higher at the
expense of a proportional area increase.
Notice that Equation (3.4) is a general form of the WG permutation (for any
MOWG(m; l; d)). Since squarings are cyclic shifts in the NB, then, only the architecture of
the power 2k   1 will vary for dierent values of k =
l
m
3
m
. By having the WGPm, the MOWG
transform is just a proper selection of d bits from theWGPm [58]. Also, notice that the compli-
ment LFSR in (3.5) is general for anyGF (2m). Similarly, except for the power 2k 1, Equation
(3.21) is general for any WG(m; l). However, (3.11) is only applicable to GF (2m) where self-
dual NB exist. Therefore, if there is not self-dual NB [15], the inner product which is used
to compute Tr

X2
2k

Xr1  X2k 1  X23k(2k 1)

in Figures 3.4 and 3.7 should be replaced with a
field multiplication followed by a trace.
It is interesting to investigate the WG implementation in the PB. It is known that the PB
oers area ecient multipliers, compared to the NB representation. However, there is a penalty
50
due to the additional space and propagation delay introduce by the squaring operations. This
is considered in the next chapter.
3.4 Conclusion
Two new designs for the MOWG(29; 11; 17) and the WG(29; 11) ciphers have been proposed.
As compared to the MOWG presented in [58], the proposed MOWG reduces the number of
field multipliers in the transform by one through signal reuse. Also, it increases the speed by
eliminating two inverters delay from the critical path. This is accomplished by reconstructing
the key/IV loading algorithm and the feedback polynomial of the LFSR. The proposed WG
is an optimization of the proposed MOWG with trace (WG version). It is obtained through
using the new properties of the trace function for type-II ONB, accompanied with serialized
computation of the Initial Feedback signal during key initialization phase.
The proposed designs have been implemented on ASIC and FPGA. The ASIC implemen-
tations show that the proposed WG implementation achieves better results compared to [68]
for area, speed, and power consumption. The WG improves the power consumption by a 39%
reduction, area by a 40% reduction, and speed by an increase of 55%. Similarly, the FPGA
implementations show that the proposed WG achieves better results for area, speed, and power
consumption compared to [68]. The power consumption is reduced by 51%, the area is reduced
by 37%, and the speed is increased by 13%.
Based on these results, the proposed implementations of the MOWG(29; 11; 17) cipher and
theWG(29; 11) cipher are promising candidates for high speed and limited resources platforms,
respectively, where throughput, area, and power consumption are of critical importance and the
guaranteed randomness properties are required.
51
Chapter 4
Implementations of the WG Stream
Ciphers Using PB
Previous chapter presented an optimized WG(29; 11) design based on the Type-II Optimal
Normal Basis (ONB-II). Using the novel trace property presented in the previous chapter, the
design requires only 6 field multipliers. In this chapter, PB representation is considered for the
fist time in the WG stream ciphers. A novel method for computing the trace of the multipli-
cation of two field elements represented in the PB is proposed. It is noted that the proposed
trace method is applicable to any GF (2m), while the one presented in the previous chapter
only applies to fields where self-dual bases exist. Based on the trace method proposed here,
a PB-based hardware design of the WG(29; 11), which uses 6 multipliers, is presented. Also,
pipelined and serialized instances of this standard design are presented (see Figure 4.1). The
reported results for the 65nm CMOS ASIC realization of the proposed standard WG(29; 11)
design shows smaller area and, slightly improved normalized throughput, compared to the best
result presented in the previous chapter.
The only WG-16 hardware design, which uses NB, is presented in [35]. This design is
based on composite field arithmetic and properties of the trace function in the tower field rep-
resentation. In this chapter, a new formulation of the WG-16 permutation which requires 8
multiplications compared to 10 in the formulation of [35] is proposed. Furthermore, a new for-
mulation for the trace function of the multiplication of two field elements is derived, based on
which a PB-based WG-16 design is proposed using only 6 multipliers for its transform. Also,
pipelined and serialized versions of this standard design, are presented and for each design
both the traditional PB and Karatsuba multipliers are considered (see Figure 4.1). According
to the conducted ASIC (CMOS 65 nm) implementations, the proposed pipelined instance of
the WG-16 oers double the throughput, while it slightly reduces the area, compared to the
52
results reported in [35].
The goal of this chapter is to show hardware implementations for WG ciphers, which in
return, provides trade-os between randomness properties and performance for a selection of
ciphers for a particular application. In particular, it is shown that the proposed WG-16 im-
plementations comply with the throughput requirements of the 4G domain. The contributions
of this chapter which include a novel trace method and nine new designs of the WG stream
ciphers are summarized in Figure 4.1. In this figure, the standard WG(29; 11) implementation
shows lower space and slightly improved normalized throughput, compared to the one in pre-
vious chapter. Also, the pipelined instance of the proposed WG-16 reports higher throughput
and lower area compared to the corresponding ones in [35].
Novel method for the trace function 
of the multiplication of two field 
elements represented in the PB
PB based Design 
for the WG-16
PB based Design 
for the WG(29,11)
Standard
Version
Serial
Version
Pipelined
Version
Standard
Version
Serial
Version
Pipelined
Version
Implemented
Using PB 
Multiplier
Implemented
Using Karatsuba 
Multiplier
Implemented
Using PB 
Multiplier
Figure 4.1: Contributions of this work.
It is noted that, throughout this chapter,  represents the addition operator inGF (2m). Also,
C (Z) = Zl Pl 1i=0CiZi, Ci 2 GF (2m) is the characteristic polynomial of an l-stages LFSR over
GF (2m), from which the feedback recurrence relation can be derived as A j+l =
Pl 1
i=0CiAi+ j;
where j  0, Ai 2 GF (2m), and (A0; A1; : : : ; Al 1) is the initial state of the LFSR.
It is noted that a version of this chapter appears in [32]. The chapter is organized as follows.
Section 4.1 presents the proposed WG(29; 11) hardware designs based on the PB. Section
4.2 presents the proposed WG-16 hardware designs based on the PB. Results based on ASIC
implementations are discussed in Section 4.3. Section 4.4 concludes the chapter.
53
4.1 Architectures of the WG(29; 11) Stream Cipher
The WG(29; 11) uses exponentiation overGF

229

, and therefore, an ONB was assumed to be
more ecient for hardware design, compared to other representations, due to the free cost of
squaring operations [39, 68]. As it is shown, the previous chapter uses new properties of the
trace function for type-II ONB in order to build the cipher using only 6 field multiplications,
which is the most optimal WG(29; 11) design so far.
In this section, three PB-based designs for the WG(29; 11) are proposed. These designs
include a standard architecture, its serial version, and its pipelined version. The serial version
is suitable for low-area applications whereas the pipelined one is proposed for high-speed ap-
plications. To the best of the author knowledge, this is the first implementation of the WG
cipher based on the PB representation. The parameters of the cipher are chosen carefully for a
low area design. Also, for further area reduction, the proposed implementation uses properties
of the trace function for PB in order to optimize the WG transform. The proposed scheme
oers smaller area and a slightly higher normalized throughput, compared to the best results
presented in the previous chapter, at the expense of a small decrease in the speed. In this sec-
tion, first, the WG transform formulations are derived. This is followed by finding the design
parameters. After that, the proposed architecture of the WG(29; 11) is introduced.
4.1.1 Formulation of WGT29
Since replacing

Y2
20 210+1 with Y220 210+1220 in (2.9) does not aect Tr (WGP29), therefore
WGT29 =
Tr
 
1  Y  Y

Y2
525!
+
Tr
0BBBBBB@ Y2525!210  Y Y2525  Y210 1  Y210 1230!
1CCCCCCA : (4.1)
It is noted that (4.1) shows the order of computing the squarings in the transform. To reduce
propagation delay due to squarings in the PB, Y2
10 1 is computed as follows:
Y2
10 1 =
 
Y2
5+1
2+1 
Y2
5+1
24! 
Y2
5+1
2+122
: (4.2)
The following section introduces the WG(29; 11)’s design parameters.
4.1.2 Design Parameters
This section presents the design parameters for the proposed PB implementation of the
WG(29; 11). In what follows, the field polynomial, the squaring matrices, the LFSR’s char-
54
acteristic polynomial, the trace vector, and the formulation for directly computing the trace of
the multiplication of two field elements are presented.
4.1.2.1 Field Polynomial and Squaring Matrices
To compute (4.1) and (4.2), field multiplications and squarings are used. In the original design
of the WG(29; 11) [39] and all reported schemes to date [68, 31], NB representation is used.
The squaring is obtained by cyclic shift in NB and hence it is free in hardware implementation.
However, such an operation in PB is not free. On the other hand, field multiplication using PB
requires lower complexity than the one using NB. In PB, the complexities of these operations
depend on the irreducible polynomial that constructs the finite field. It is known that irreducible
trinomials define PBs oering field multiplications with low space and time complexities [72,
63, 5]. For GF

229

, the following two trinomials are irreducible over GF (2)
t1 (x) = x29 + x2 + 1; (4.3)
and its reciprocal function t2 (x) = x29

t1

x 1

= x29 + x27 + 1: Between t1 and t2, t1 oers
operations with lower space complexities. Specifically, the t1-based PB multiplier requires
292 = 841 ANDs and 292   1 = 840 XORs with a propagation delay of TA + 7TX [72],
where TA and TX are the delays in an AND and an XOR, respectively. In the following, the
complexities of the squarings using the PB defined by (4.3) are obtained.
Let A be an arbitrary element of GF (2m) represented in the PB, and let V = A2. Denote
by a = (a0; : : : ; am 1) and v = (v0; : : : ; vm 1), the row vectors holding the bits which represent
A and V w.r.t the PB, respectively. Then, v = aS, where S is the binary m  m squaring
matrix whose entries are either 0 or 1 [5]. In general, W = A2
e
is obtained as w = aSe.
This formulation involves m inner products aSej, where S
e
j denotes the j-th column vector of
Se, 0  j < m. Let NX denote the number of XOR gates. Then, the hardware realization
of aSe requires NX =
P
H

Sej

>1;0 j<m

H

Sej

  1

and TSe =

log2 ()

TX, where TSe is the
propagation delay for computing aSe, H (
) is the Hamming weight of a vector 
, and  =
maxH

Sej

>1
n
H

Sej

j 0  j < m
o
.
For the PB defined by (4.3), the squaring matrix S is shown in Figure 4.2. Table 4.1 lists the
space and time complexities, before and after signal reuse, for the dierent squaring matrices
used in the WG(29; 11)’s implementations. In this table, PD denotes propagation delay.
4.1.2.2 Characteristic Polynomial of the LFSR
A primitive characteristic polynomial of degree 11 over GF

229

is required in order for the
WG(29; 11) to produce key-streams with maximal period of 2319   1 [39, 68]. For space e-
55
Figure 4.2: The matrix S for WG(29; 11).
ciency, the following primitive pentanomial is selected
Z11  Z6  Z2  Z  ; (4.4)
where  2 GF

229

is a root of the defining polynomial (4.3). The primitive property of
the polynomial has been verified using the “is primitive()” method provided by the Sage
Notebook online tool [3]. Let
n
Ai; 0  i < 2319   1
o
denote the sequence generated by (4.4).
According to previous chapter, the following recurrence relation generates the sequencen
Bi = Ai  1; 0  i < 2319   1
o
B j+11 =

B j+6  B j+2  B j+1  B j

 ; j  0; (4.5)
where fBi = Ai  1; 0  i  10g is the initial state of the LFSR. By constructing the LFSR based
on (4.5) instead of (4.4), then, one obtains Y = 1 Ai+10 = Bi+10 in (4.1) and (4.2). In addition,
notice that (4.5) requires only three field additions, one field multiplication with  (a constant1),
and one NOT gate (for addition of ).
1For the field polynomial (4.3), one can easily find that the multiplication with the constant  requires only
one XOR gate with a propagation delay TX .
56
No Sig. Reuse Sig. Reuse
XOR PD XOR PD
S, S30 15 TX 15 TX
S2 37 2TX 30 2TX
S4 118 3TX 65 3TX
S5 182 4TX 97 4TX
S10 374 5TX 214 5TX
S20 338 5TX 200 5TX
Table 4.1: The space and time complexities of the dierent squaring matrices used in the
WG(29; 11).
4.1.2.3 Trace Vector
Let the elements in GF (2m) be represented in the PB which is defined by an irreducible poly-
nomial f (x) of degree m over GF (2). Then, the trace of an element A 2 GF (2m) is ob-
tained as Tr (A) = aT , where a = (a0; a1; : : : ; am 1) (ai’s are coordinates of A w.r.t PB),
 = (0; 1; : : : ; m 1) is a unique and constantm-bit vector such that i = Tr

i

2 GF (2) ; 0 
i < m and f () = 0 [5]. Therefore, for the PB
n
28; : : : ; ; 1
o
defined by (4.3), one obtains
i = 1 for i 2 f0; 27g and i = 0 otherwise. Thus,
Tr (A) = a0 + a27: (4.6)
4.1.2.4 Trace of Multiplication of Two Field Elements
Previous chapter presented a method for the direct computation of the trace of the multiplica-
tion of two elements represented in the type-II ONB. In the following, a formulation for the
direct computation of the trace of the multiplication of two field elements represented in PB
is constructed. This method is then used to optimize the space complexity of the PB based
implementations of the WG(29; 11) and the WG-16 (see Sections 4.1.3 and 4.2.4).
Proposition 4.1.1 Consider the m-bit trace vector  = (0; : : : ; m 1), i = Tr

i

, where 
is the root of the defining polynomial of GF (2m) over GF (2) [5]. For any two field elements
A = (am 1; : : : ; a0) and B = (bm 1; : : : ; b0), let C = AB 2 GF (2m). Then:
Tr (C) =
m 1X
i=0
i
iX
j=0
ai  jb j +
m 1X
i=0
i
m 2X
k=0
qk;i
m 1X
j=k+1
am  j+kb j; (4.7)
57
where Q(m 1)m =

qk;i

is the reduction matrix and, U(m 1)m =
h
uk; j
i
and Lmm =
h
li; j
i
are as
follows [72]
U =
2666666666666666666666666664
0 am 1 am 2    a2 a1
0 0 am 1    a3 a2
:::
:::
:::
: : :
:::
:::
0 0 0    am 1 am 2
0 0 0    0 am 1
3777777777777777777777777775
;
and
L =
2666666666666666666666666666666664
a0 0 0    0 0
a1 a0 0    0 0
a2 a1 a0    0 0
:::
:::
:::
: : :
:::
:::
am 2 am 3 am 4    a0 0
am 1 am 2 am 3    a1 a0
3777777777777777777777777777777775
:
Proof Let b = (b0; : : : ; bm 1) and c = (c0; : : : ; cm 1) be row vectors holding the bits of B and C,
respectively, then, from [72] one has
cT = LbT +QTUbT ; (4.8)
where QT is the transpose of Q. Therefore:
Tr (C) = cT =

LbT
T
T +

QTUbT
T
T
=
m 1X
i=0
m 1X
j=0
li; jb ji +
m 1X
i=0
m 1X
j=0
m 2X
k=0
qk;iuk; jb ji
=
m 1X
i=0
i
iX
j=0
li; jb j +
m 1X
i=0
i
m 2X
k=0
qk;i
m 1X
j=k+1
uk; jb j;
where the last result is obtained by noticing that li; j = 0 for j > i and uk; j = 0 for j  k [72],
and by replacing li; j and uk; j with the corresponding entries from L and U, respectively, one
obtains (4.7).
The hardware realization of (4.7) requires n ANDs, n   1 XORs, and a propagation delay
of TA +

log2 (n)

TX, where n =
P
i,0 (i + 1) +
P
i,0;qk;i,0 (m   k   1) is the upper bound of
the number of terms

ai  jb j

and

am  j+kb j

in (4.7). It is noted that if  and Q have low
Hamming weights, then, the computation of Tr (AB) using (4.7) becomes more ecient (in
terms of space) than the straight forward method. In what follows, the realization of (4.7) for
the WG(29; 11) is derived.
58
Corollary 4.1.2 Let
n
28; : : : ; ; 1
o
be the PB of GF

229

over GF (2) which is defined by (4.3).
Then, the trace of the multiplication of two field elements A =
P28
i=0 ai
i and B =
P28
i=0 bi
i is
computed as follows:
Tr (AB) = (a0 + a27) b0 +
25X
j=1

a27  j + a29  j

b j+
(a1 + a26) b28 +
27X
j=26

a27  j + a29  j + a54  j

b j: (4.9)
Proof It is noted that  has only two nonzero components, 0 and 27 (see Section 4.1.2.3).
The Q (reduction) matrix for the field polynomial (4.3) have been computed and it has been
found that the only nonzero entries in the 1-st and the 28-th columns of this matrix are q0;0,
q27;0, q25;27, and q27;27. Hence, (4.9) results from substituting these values in (4.7).
It is noted that the realization of (4.9) requires 29 AND and 59 XOR gates with a time delay
of TA + 6TX.
4.1.3 Architecture and FSM
4.1.3.1 Architecture of the WG(29; 11) Cipher
The PB (defined by (4.3)) based architecture of the WG(29; 11), according to the WGT29
formulations in (4.1) and (4.2), and the linear recurrence (4.5), is shown in Figures 4.3 and
4.4. In these two figures, Tr () generates the trace of aGF

229

element (see Section 4.1.2.3).
Tr (?) generates the trace of the multiplication of two GF

229

elements using (4.9). Y is
the output of the LFSR represented by (4.5). Y2
10 1 is generated based on (4.2). An arrow
represents a register which is inserted for pipelining (see Section 4.1.5). A number n under a
register means it is clocked at end of the n-th clock cycle during each computation of the initial
feedback in the initialization phase. A zero under a register indicates that the register’s clock
input is always enabled during the run phase. r1 = 210+1, r2 = 220+210+1, r3 = 220 210+1, and
r4 = 220+210 1. The squaring matrices are implemented using the signal reuse constructions,
the complexities of which are presented in Table 4.1. The complement operator, i.e. , invert
the first bit of the input, which requires only one NOT gate. Notice that Bi, which is required
for generating the LFSR feedback signal, is stored in the right most cell of the LFSR (i.e. B
0
i) as
shown in Figure 4.3. This is done to reduce the propagation delay through the LFSR feedback
by one multiplier. This construction avoids having the LFSR’s critical path constraining the
speed of the cipher when pipelining is applied to the transform.
59
c
trl0
c
trl1
Tr
1
FSM
LF
29
……...
10iB ? 6iB ? 1iB ?
'
iB2iB ?……...
?
? ?b
Y
10i
Y B
Y
r
4
Output
Sequence
10
2
1
1
1
1
29
29
29
29
29
29
29
29
29
29
Y
29
10S
Tr
Runnig Phase 
Critical Path
1
29
29
29
5
2
Y
29
Y
(run ph.)
29
10
2
1
r
*Tr
29
29
29 29 29 29
29
29
29
29
1
2
3
WG
Transform
Y
r
3Y
r
2
Tr Y
r
1
Y
11
IF
(init. ph.)
29
29
1
2
3
2929
p
h
0
p
h
1
? ?a
IV
29 29
129
29
1
20
2
29 29?
1
30S
5S
10S
Tr
Figure 4.3: Architecture of the WG(29; 11) stream cipher.
The finite state machine (FSM) controls the cipher during three dierent phases of opera-
tion (see Section (4.1.3.2)). During the load phase, the LFSR shifts at each clock cycle, where
its leftmost cell is loaded with 1  IV (IV is the initial vector).
It is noted that the initial feedback signal IF = WGP29, which is needed for initialization
phase, is missing in Figure 4.3. This is a result of computing WGT29 according to (4.1) using
(4.9). Let q = 210   1, r1 = 210 + 1, r2 = 220 + r1, r3 = 220   q, and r4 = 220 + q. Therefore,
the WGP29 in (2.9) can be written as 1  Y  Yr1  Yr2  Yr3  Yr4 , and is recovered using
serial computation over 3 clock cycles as described in Table 4.2. In this table, ctrl0 and ctrl1
are generated by the FSM. WGP29 is the next state of Register 2 in stage 3. Rows of the table
are listed in order of computation stages (first to last). It is noted that, next state of Register 3
is always Yq (Figure 4.3). During the initialization phase, the LFSR shifts once every 3 clock
60
10
2
Y
Y
29
5
S
29
29 29
S
2
S
4
S
29
29
29
29
29
52
Y
0,1
0,1
0,2 0,3
0,4 0,5
0,3
0,4 0,5
0,5
0,6 0,7
0,6
Figure 4.4: Architecture of the 210   1 module.
cycles and loads its leftmost cell with IF  LF, where LF = LF  1 and LF is the original
linear feedback given by (4.4).
ctrl0 ctrl1
Output Next State
MUX # 1 MUX # 2 MUX # 3 Register 1 Register 2
0 0 Y2
10
Y Y  1 Yr1 Yr1  Y  1
1 0 Yr1  Yq Y220 Yr1  Y  1 Yr4  Yr2 Y
r4  Yr2
Yr1  Y  1
0 1 Y2
10q Y
Yr4  Yr2
Yr3
Yr4  Yr3
Yr1  Y  1 Y
r2  Yr1
Y  1
Table 4.2: Computation of the IF = WGP29 signal over 3 clock cycles during the initialization
phase.
In the running phase, the LFSR updates its state in each clock cycle, where Bi+10 is loaded
with LF. In Figure 4.3, the keystream bits are obtained from XORing Tr (1  Y  Yr1) with
Tr

Yr2  (Yr3)220  Yr4

. Tr (1  Y  Yr1) is the result of XORing Tr (1  Y) and Tr (Yr1).
Tr (1  Y) and Tr (Yr1) are produced by applying operator Tr () to 1Y and Yr1 , respectively.
The operator Tr () generates its output according to (4.6). Yr1 is generated by multiplying
Y with Y2
10
in GF

229

(by setting ctrl0 = ctrl1 = 0 for the running phase in Figure 4.3).
Y is the output Bi+10 of the LFSR and Y2
10
is obtained from the squarer S5 operating on Y25 ,
which in turn, is available from the generator of Y2
10 1 (see Figure 4.4). 1  Y is the addi-
tion of Y 2 GF

229

with the unity element 1 = (0; : : : ; 0; 1) represented w.r.t. PB. Thus,
1  Y results from inverting the least significant bit of Bi+10 by the complement operator .
Tr

Yr2  (Yr3)220  Yr4

is generated by applying (4.9) to Y2
20
and

Yr1  Yq  Y230q

. The sig-
nal Y2
20
is the result of S10 operating on Y210 . Signal Yr1  Yq  Y230q is the bitwise XOR of Yr1 ,
Yq, and Y2
30q, where Y2
30q is obtained from S30 operating on Yq and Yq is generated as presented
in Figure 4.4.
61
4.1.3.2 The Finite State Machine (FSM)
The architecture of the FSM is shown in Figure 4.5. The FSM controls the inputs to the LFSR
3-Bit 1-hot 
Counter
c
lk
2-Bit Binary 
Counter
FSM
re
s
e
t
p
h
0
lfs
r_
c
lk
1 1 1 1
in
0
in
1
p
h
1
1
sel 0
c
trl0
c
trl1
11-Bit 1-hot 
Counter
……
……
re
s
e
t
re
s
e
t
re
s
e
t
c
lo
c
k
a
0
a
1
a
2
a
1
0
b
0
b
1
c
0
c
1
c
2
c
lo
c
k
c
lo
c
k
Figure 4.5: FSM for the PB based implementation of the WG(29; 11) stream cipher.
during the three phases of operation through signals ph0 and ph1. As presented in Table 4.3 for
the column of Figure 4.3 the loading phase takes 11 clock cycles followed by the initialization
phase which stays for 33 + 33 = 66 clock cycles, then starts the run phase. The FSM is
built from a 2-bit binary counter, an 11-bit 1-hot counter, and a 3-bit 1-hot counter. The first
counter generates ph0 and ph1. The 11-bit counter triggers the clock of the 2-bit counter, every
11 counts, during loading and initialization. The 3-bit counter, generates ctrl0 and ctrl1, and
triggers the clock of the 11-bit counter as well as the clock of the LFSR, every 3 counts, during
initialization.
4.1.4 Serialized Implementation of the PB Based WG(29; 11)
4.1.4.1 Architecture of the Serialized WG(29; 11)
Here, a serialized WGP29/WGT29 design is presented for area constrained applications. The
serial WG(29; 11) which is proposed in this section has the same LFSR, compared to the stan-
dard design in Figure 4.3; however, the WG transform and the FSM are modified. Figure
4.6 presents the proposed serial WGP29/WGT29 architecture. In this figure, Y = Bi+10 is the
LFSR’s output (see Figure 4.3), r1 = 210 + 1, r2 = 220 + 210 + 1, r3 = 220   210 + 1, and
62
2-bit
ph1=ph0
phase Number of Clock Cycles for the Proposed Designs
counter of Figure Figure Figure Figure Figure Figure
a1 a0 operation 4.3 4.6 4.8 4.12a 4.13 4.15
0 0 0=0 Load 11 11 11 32 32 32
0 1 1=1 Init. 33 88 132 96 288 416
1 0 1=1 Init. 33 88 132 96 288 416
1 1 0=1 Run - - - - - -
Table 4.3: Phase of operation in the proposed PB based WG designs as a function of the state
of the 2-bit binary counter.
r4 = 220 + 210   1. In this architecture, only one multiplier is used. The computations of the
dierent variables used in (4.1) and (4.2) are accomplished sequentially according to Table 4.4.
Clock Cycle
1 2 3 4
N
ex
tS
ta
te Register 1 Y
r1 Yr1 Yr1 Yr1
Register 2 - Y2+1 Y
P3
i=0 2
i
Y
P4
i=0 2
i
Register 3 1  Y  Yr1 1  Y  Yr1 1  Y  Yr1 1  Y  Yr1
Clock Cycle
5 6 7
N
ex
tS
ta
te Register 1 Y
r1 Yr1 Yr1
Register 2 Yq Yq Yq
Register 3 1  Y  Yr1 1  Y 1  Y  Y
r1
Yr1  Yr2  Yr4 Yr2  Yr3  Yr4
Table 4.4: Steps for computing the WGP29 and WGT29 in the serial implementation of the
WG(29; 11) design.
It is noted that no changes are required for the loading phase of the serial WG(29; 11).
However, in the architecture of Figure 4.6, an initialization round takes 7 clock cycles to gen-
erate the WGP29 signal. The LFSR is updated at the 8-th clock cycle. During the run phase,
a stream bit is produced every 6 cycles. During these two phases, the multiplexers provide the
inputs to the multiplier and the adder. The multiplexers’ inputs are multiplexed by selectors
63
12
3
4
5
6
7
1234
2
R
3
R
10
2
Y
4
2
Y
1 2
R R?
29
29
29
29
29
29
29
29 29
29 29
29
29
29292929
29
Initial Feedback
2929
Output
1
1
20
2
Y
Register Clock Cycle
Clocking of Registers
1
2 - 5
1,6,7
29
2
R
1
R
1
2
3
*Tr
30
S
2
9
2
9
m41 2
3
R
m0 m1 m2
m4
m3
2
Y
.Tr
29 1
1
1
1Tr Y
r
1
Y
Tr
Y
r
4
Y
r
3
Y
r
2
Y
2
0
2
YY
Y
Y
Selector
Clock Cycle 
Enabled
1,3,5,7
2,4,6
4,5,6,7
m0
m1
m2
3,4,5m3
6,7m4
2
2
R
2
10
2
R
2
5
2
R
2
.Tr
Figure 4.6: Architecture of the serial WGP29/WGT29 implementation.
m0 - m4. The 3 registers are clocked as it is specified by the clocking table in Figure 4.6. The
clocking of the dierent registers is enabled by means of clock enable signals (see Section
4.1.4.2). In this design, the lfsr clk signal in Figure 4.7a is required in order to clock the LFSR
once every 1 clock cycle, 8 clock cycles, and 6 clock cycles, during loading, initialization, and
run phases, respectively. This means that the initialization phase takes a total of 8  22 = 176
clock cycles. The number of clock cycles needed for dierent phases of Figure 4.6 are pre-
sented in the corresponding column of Table 4.3. Moreover, the signal EO in Figure 4.7a is
used to enable the keystream output every 6 clock cycles during the run phase. These selectors,
clock enables, lfsr clk, and EO signals are generated through the FSM, as it is presented next.
4.1.4.2 FSM for the Serialized PB based WG(29; 11)
Figure 4.7a is a block diagram for the FSM which is used for the serialized PB based
WG(29; 11). Also, Figure 4.7b shows the details of generating the Clock Enable Control Sig-
nals and the Multiplexers’ Selectors. For the clock enable signals, the number at the output
of an OR gate indicates the number of the enabled clock cycle during the initialization phase.
m0, m1, m2, m3, and m4 are the selectors for the multiplexers. The FSM controls the inputs
to the LFSR during the three phases of operations. As shown in Table 4.3 for Figure 4.6, the
64
8-Bit 1-hot 
Counter
c
0
c
1
c
7
clk
2-Bit Binary
Counter
FSM
resetph0lfsr_clk
1 1 1 1
in
0
in
1
ph1
1
sel 0
……
……
EO
1
6-Bit 1-hot
Counter
d
0
d
1
d
5……
……
s
in
2
sel 1
s
a
0
b
0
b
1
a
1
a
2
a
1
0
re
s
e
t
re
s
e
t
re
s
e
t
re
s
e
t
11-Bit 1-hot
Counter
……
……
(a)
c
0
c
5
1,6,72-5
c
6
1
c
0
m2
c
1
c
3
c
5
c
7
d
4
m1
d
1
d
3
m0
Clock Enable Signals Multiplexer Selectors
p
h
1
d
0
p
h
1
m4
d
1
d
2
d
3
d
4
c
1
c
2
c
3
c
4
m3
c
2
c
3
c
6
c
7
d
2
d
3
c
4
c
5
c
6
c
7 c2
c
3
c
4
c
6
d
2
d
3
c
5
c
6
(b)
Figure 4.7: a) Architecture of the FSM for the serialized implementation of the WG(29; 11).
b) Generating the Clock Enable Control Signals and the Multiplexers’ Selectors.
loading phase takes 11 clock cycles followed by the initialization phase which stays for 176
clock cycles, then starts the run phase. The FSM is built from a 2-bit binary counter, an 11-bit
1-hot counter, an 8-bit 1-hot counter, and a 6-bit 1-hot counter. The 2-bit counter generates
ph0 and ph1 according to Table 4.3. The 11-bit counter triggers the clock of the 2-bit counter,
every 11 counts, during the loading and initialization. The 8-bit counter, generates the clock
enable signals and the multiplexers’ selectors (see Figure 4.6), and triggers the clock of the
11-bit counter as well as the clock of the LFSR, every 8 counts, during initialization. In the run
phase, the 6-bit counter, generates the clock enable signals and the multiplexers’ selectors, and
triggers the clock of the LFSR, every 6 counts. From the starting of the run phase, the 6-bit
counter enables the output of the cipher every 6 counts.
65
4.1.5 Pipelined Implementation of the PB Based WG(29; 11)
4.1.5.1 Architecture of the Pipelined PB Based WG(29; 11)
Figures 4.8 and 4.4 present the pipelined version of the PB based implementation of the
WGT29.
c
trl0
c
trl1
Y 1
10iY B
in0
in1
in0
in1
102
1
1
1 1
29
29
29
29
29
29
29
29
Y
29
10S
10S
Tr
Tr
1
10S
29
29
(run ph.)
29
1r
*Tr
29
WG
Transform
11
IF
(init. ph.)
29
in0
in1
in2
29
1
2
3
30S
0,1
0,2
0,6,
8,10
0,7
7,9,11
0,3
0,4
0,5
0,5
0,7
0,6
0
1 00
0
0
0
0
0,6,8,10
0,6,8,10
29
Figure 4.8: Pipelined version of the WGT29.
The pipeline has been constructed with 10-stages during the run phase and 12-stages dur-
ing the initialization phase, in order to achieve a critical path with only one multiplier. In these
figures, the double headed arrows point to the locations where the registers are inserted, for the
pipeline. The numbers under these arrows indicate the clock cycles, during each initial feed-
back computation throughout initialization, during which the registers will be clock-enabled.
A zero below a register means that its clock input will always be enabled during the run phase.
The clocking of the dierent registers in the transform is controlled by means of clock enable
signals (see Section 4.1.5.2).
It is noted that no changes are required for the loading phase. However, during the initial-
66
ization and the run phases, an input signal now requires 12 and 10 clock cycles, respectively,
to propagate to the output of the transform/permutation. Therefore, for the initialization phase,
the lfsr clk signal in Figure 4.9 triggers the LFSR once every 12 cycles. This means that the
initialization phase takes a total of 12  22 = 264 clock cycles as presented in Table 4.3 for
Figure 4.8. Also, the multiplexers’ outputs in Figure 4.8 are controlled through the signals ctrl0
and ctrl1 (Figure 4.10) during the initialization and the run phases. For the run phase, an output
enable signal, EO in Figure 4.9, is used to enable the keystream output after the first 10 clock
cycles. The following section presents the FSM and show how the dierent control signals are
derived.
4.1.5.2 FSM for the Pipelined PB Based WG(29; 11)
c
lk
FSM
re
s
e
t
p
h
0
lfs
r_
c
lk
1 1 1 1
in
0
in
1
p
h
1
1
sel 0
s
EO
1
c9
2-Bit Binary 
Counter
b
0
b
1
re
s
e
t
12-Bit 1-hot 
Counter
c
0
c
1
c
1
1……
……
re
s
e
ta
0
a
1
a
2
a
1
0
re
s
e
t
11-Bit 1-hot 
Counter
……
……
Figure 4.9: Architecture of the FSM for the pipelined version of the WG(29; 11).
Figure 4.9 presents the architecture of the FSM which is used for the pipelined version of
the PB based implementation of the WG(29; 11). Figure 4.10 shows the details of generating
the clock enable signals and, ctrl0 and ctrl1 signals. The numbers at the output indicate the
clock cycles, during each initial feedback computation throughout initialization, during which
the register will be clock-enabled. A 0 at the output means the clock input will be always
enabled during the run phase. Signals s and ci, 0  i  11, are shown in Figure 4.9. Similar
to the previously introduced FSMs in this chapter, the FSM controls the inputs to the LFSR
during the three phases of operations through generating the signals ph0 and ph1. According
to column of Figure 4.8 in Table 4.3, the loading phase takes 11 clock cycles, followed by
67
Figure 4.10: Clock enable control signals for the pipelined version of the WG(29; 11).
the initialization phase which stays for 264 clock cycles, followed by the run phase. The
FSM is built from a 2-bit binary counter, an 11-bit 1-hot counter, and 12-bit 1-hot counter.
The 2-bit counter generates ph0 and ph1 according to Table 4.3. The 11-bit counter triggers
the clock of the 2-bit counter, every 11 counts, during the loading and initialization. The
12-bit counter, generates the clock enable signals and the multiplexers’ selectors (ctrl0 and
ctrl1, see Figure 4.8), and triggers the clock of the 11-bit counter as well as the clock of the
LFSR, every 12 counts, during initialization. In the run phase, signal s in Figure 4.9 and
the 12-bit counter, generate the clock enable signals and the multiplexers’ selectors (fixed at
ctrl0=ctrl1=0), respectively. The 12-bit counter enables the output of the cipher after 10 counts
from the start of the run phase. The LFSR is triggered with each clock cycle in the run phase.
4.2 Architectures of the WG-16 Stream Cipher
The WG-16 cipher has been proposed by the authors of [34] for securing the 4G’s confiden-
tiality and integrity protection schemes against the attack in [90]. The only WG-16 hardware
design, which uses NB, is presented in [35]. This design is based on composite field arithmetic
and properties of the trace function in the tower field representation.
Here, a new formulation of the WG-16 permutation is proposed. This formulation requires
8 multiplications compared to 10 in the formulation of [35]. Based on this formulation, and
using the trace property in (4.7), this section presents six hardware architectures of the WG-16,
based on the PB representation for the first time. The six designs include a standard archi-
tecture, its serial version, and its pipelined version using two dierent types of multipliers for
each version. The serial version can be used for low-area applications whereas the pipelined
one is suitable for high-speed applications. The pipelined instance of the proposed scheme of-
68
fers almost twice the throughput which is reported by the implementations in [35], at a slightly
smaller area. In what follows, the formulation of the WG-16 transform followed by the for-
mulations used for squaring and trace function are derived. In addition, the formulation for
direct computation of the trace of the multiplication of two field elements, in the PB, is ob-
tained. Then, the proposed standard architecture of the WG-16 is shown. The section ends by
presenting serialized and pipelined versions of the standard design.
4.2.1 Formulations of WGP16 andWGT16
The WGP16’s formulation in (2.14) requires 10 multiplications when the field elements are
represented in the PB. In the following, a new formulation is derived which requires 8 multi-
plications.
Proposition 4.2.1 The WG permutation of the WG-16 stream cipher is computed as follows
WGP16 =1  Y  Y211+1  Y211(25 1)+26
 Y211+1

Y2
6  Y2(25 1)

; (4.10)
where Y = (Ai+31)1057  1, Ai+31 is the output of the LFSR described by (2.15), and Y25 1 is
computed as follows
Y2
5 1 =

Y2
2+1
2+1
Y2
4
: (4.11)
Proof Let e1 = 211 + 1, e2 = 211 + 26 + 1, e3 =  211 + 26 + 1, and e4 = 211 + 26   1 in (2.13).
By noticing that e3 + 216   1  e3

mod 216   1

, then, one obtains
e2 = e1 + 26; e3 = 211s + 26; e4 = e1 + 2s;
where s = 25   1, and the proof is completed by taking Ye1 as a common factor between Ye2
and Ye4 .
TheWG transform is obtained by taking the trace of (4.10). Equation (4.10) requires 8GF

216

multiplications: 1 for computing Y2
11+1, 3 for computing Y2
5 1, 1 for computing Y2
11(25 1)+26 , 1
for computing Y2
11+1

Y2
6  Y2(25 1)

, and 2 for computing 1  Y = (Ai+31)1057. In addition to
this, (4.10) requires 7 squarings and 5 GF

216

additions. For the transform, the computation
of the trace ofWGP16 is required. Section 4.2.3 presents a method which reduces the number
of multiplications in the WGT16 to only 6 through computing Tr

Y2
11(25 1)Y2
6

directly from
Y2
11(25 1) and Y2
6
, and Tr

Y(2
11+1)

Y2
6  Y2(25 1)

directly from Y(2
11+1) and Y2
6  Y2(25 1),
without performing the multiplications.
69
4.2.2 Squaring Matrices and Trace Vector
Similar to the WG(29; 11), in what follows, the squaring matrices and the trace vector for the
field polynomial (2.16) are presented.
4.2.2.1 Squaring Matrices
Figure 4.11 shows the squaring matrix S for the field polynomial (2.16). One can find the
Figure 4.11: The matrix S for WG-16.
required squaring operations for the WG-16 permutation from (4.10) and (4.11). Table 4.5 lists
the space and propagation delay complexities of the dierent squaring matrices used in the
WG-16 implementation (before and after signal reuse). In this table, PD denotes propagation
delay.
No Sig. Reuse Sig. Reuse No Sig. Reuse Sig. Reuse
XOR PD XOR PD XOR PD XOR PD
S 30 3TX 21 3TX S6 99 4TX 63 4TX
S2 82 3TX 45 3TX S9 89 4TX 58 4TX
S4 103 4TX 64 4TX S10 102 4TX 60 4TX
S5 89 4TX 58 4TX S11 115 4TX 62 4TX
Table 4.5: Space and propagation delay complexities of the dierent squaring matrices used in
the WG-16.
70
4.2.2.2 Trace Vector
The trace vector for the PB
n
15; : : : ; ; 1
o
defined by (2.16) is  = (0; : : : ; 15) where i = 1
for i 2 f11; 13g and i = 0 otherwise (see Section 4.1.2.3). Thus, for A 2 GF

216

Tr (A) = a11 + a13: (4.12)
4.2.3 Trace of the Multiplication of Two Field Elements for the PB Based
WG-16
The following is the realization of (4.7) when applied to WG-16.
Corollary 4.2.2 Consider the GF

216

defined by (2.16) where
n
15; : : : ; ; 1
o
is its PB. Then,
the trace of the multiplication of two field elements A =
P15
i=0 ai
i and B =
P15
i=0 bi
i is computed
as follows:
Tr (AB) =
11X
j=0

a11  j + a13  j

b j +
13X
j=12
a13  jb j+
9X
j=7
a22  jb j + (a12 + a15) b10+
13X
j=11

a22  j + a25  j + a26  j

b j+
15X
j=14

a22  j + a25  j + a26  j + a29  j

b j: (4.13)
Proof  has only two nonzero components, 11 and 13 (see Subsection 4.2.2.2). By computing
theQ (reduction) matrix for the field polynomial (2.16), one finds that the only nonzero entries
for the 12-th and the 14-th columns of this matrix are q6;11, q8;11, q9;11, q11;11, q8;13, q10;13, q11;13,
and q13;13. Hence, by replacing these values of i and qk;i in (4.7), one gets (4.13).
It is noted that the realization of (4.13) requires 23 AND and 47 XOR gates and introduces a
propagation delay of TA + 7TX.
4.2.4 Architecture and FSM
4.2.4.1 Architecture of the WG-16 Cipher
Let e1 = 211 + 1, e2 = 211 + 26 + 1, e3 =  211 + 26 + 1, e4 = 211 + 26   1, and s = 25   1.
Figures 4.12a , 4.12b, and 4.12c present the proposed architecture of the WG-16 according to
71
the WGP16 formulations in (4.10) and (4.11), and the linear recurrence (2.15), based on the
PB defined by (2.16).
c
trl0
c
trl1
Tr
1
FSM
in0
in1
Output
Sequence
5
2
1
1
1
1
16
16
16
16
16
16
16
16
11S
6S
Tr
Run Phase 
Critical Path
Tr
1
S
16
11S
16
16
(run ph.)
16
*Tr
16
16
1
2
WG
Transform
Y
e
2
11
(init. ph.)
……...
31iA ? 22iA ?
? ?1057311 iY A ?? ?
11?
Initial
Vector
Linear Feedback
Initial Feedback
16
……...
9iA ?……...
16
1057
16
161616 16
1616
16
16
Y
1
*Tr
1
1
Y
e
3
Y
e
4
Y
e
1
in0
in1
in2
in0
in1
in2
3
16
1iA ?
16
16
p
h
0
1
16
1
16
16
2
3
'
iA
4 16
ph1
AND
16
2
Tr Y Y
e
1
16
(a)
YY
S
S
S
(b)
i+31
i+31
1057
(c)
Figure 4.12: a) Architecture of the WG-16. b) Generation of the signal Y s (s = 25   1). c)
Generation of the signal (Ai+31)1057.
72
In Figure 4.12a, Tr () generates the trace of aGF

216

element. Tr (?) generates the trace
of the multiplication of two GF

216

elements. Figure 4.12b shows the used architecture for
generation of the signal Y s (s = 25   1). Figure 4.12c shows the architecture for the generation
of the signal (Ai+31)1057. The squaring matrices in the three figures are implemented using the
signal reuse constructions of Table 4.5. In figures 4.12b, and 4.12c, a double-headed arrow
points to the location where a register is inserted for pipelining purposes (see Section 4.2.6.1).
In Figure 4.12a, the FSM controls the components of the cipher during the dierent phases
of operation. This is accomplished through signals lfsr clk, ph0, ph1, ctrl0 and ctrl1 (see
Section 4.2.4.2 for details).
During the load phase, the LFSR shifts at each clock cycle while its leftmost cell is loaded
through the Initial Vector input.
It is noted that the signal Ye2Ye4 is missing in Figure 4.12a. This is due to the generation of
Tr (Ye2  Ye4) directly from Y211+1 and

Y2
6  Y2s

using (4.13). As a result, the Initial Feedback
(WGP16) signal, which is needed for the initialization phase, does not exist. This is recovered
by generating WGP16 over 3 clock cycles, during initialization, as presented in Table 4.6. In
ctrl0=ctrl1
Output Next State
MUX # 1 MUX # 2 MUX # 3 Register 1 Register 2 Register 3
0/0 Y Y2
11
Y  1 Ye1 Ye1  Y  1 Y26  Y2s
1/0 Ye1 Y2
6  Y2s Ye1  Y  1 Ye2  Ye4 Y
e4  Ye2
Y2
6  Y2s
Ye1  Y  1
0/1 Y2
6
Y2
11s
Ye4  Ye2
Ye3
Ye4  Ye3
Y2
6  Y2s
Ye1  Y  1 Y
e2  Ye1
Y  1
Table 4.6: Computation of the WGP16 signal over 3 clock cycles.
this table, the control signals ctrl0 and ctrl1 are generated by the FSM.WGP16 is the next state
of Register 2 in stage 3. Rows are listed in order of computation stages (first to last). It is
noted that, next state of Register 4 in Figure 4.12a is always Y s. During the initialization phase,
the lfsr clk signal triggers the LFSR every 3 clock cycles. The leftmost cell is loaded with the
result from the field addition of the LFSR feedback and WGP16 (Initial Feedback).
In the running phase, the LFSR updates its state at each clock cycle. The only feedback
is the LFSR feedback. The keystream bits are obtained by XORing the signals Tr (1  Y),
Tr (Ye1), Tr (Ye3), and Tr (Ye2  Ye4). Tr (1  Y) and Tr (Ye1) are produced from 1  Y and
Ye1 using (4.12). Ye1 is generated by multiplying Y with Y2
11
in GF

216

. Y is generated by
complementing the least significant bit of (Ai+31)1057, and Y2
11
is obtained from the squarer S11
operating on Y . 1Y is simply (Ai+31)1057. Tr (Ye3) is generated by applying (4.13) to Y211(25 1)
73
and Y2
6
. The signal Y2
6
is the result of S6 operating on Y . The signal Y211(25 1) is the result of S11
operating on Y2
5 1. Tr (Ye2  Ye4) is generated by applying (4.13) to Y211+1 and

Y2(2
5 1)  Y26

.
The signal Y2(2
5 1) is the result of S operating on Y25 1, while signal Y2(25 1)Y26 is the bitwise
XOR of Y2(2
5 1) and Y2
6
.
4.2.4.2 The Finite State Machine
The FSM for the PB based WG-16 is similar to the one used for the PB based implementation
of the WG(29; 11) (see Section 4.1.3.2). However, the WG-16’s FSM replaces the 11-bit 1-hot
counter with a 5-bit binary counter and, the clocking of the 2-bit binary counter occurs after
a complete 32 counts for the 5-bit counter. As can be seen from column of Figure 4.12a in
Table 4.3, the loading phase takes 32 clock cycles. This is followed by the initialization phase
which stays for 192 clock cycles, where each initialization round is extended to 3 clock cycles
(for computing WGP16) by means of the 3-bit 1-hot counter. During this phase, the LFSR is
clocked 64 times, once every 3 clocks, by means of the 3-bit 1-hot counter. After this starts the
run phase. Also, the 3-bit counter controls the multiplexers’ selectors, ctrl0 and ctrl1, during
initialization and run phases.
4.2.5 Serialized Implementation of the PB Based WG-16
4.2.5.1 Architecture of the Serialized WG-16
The serialized computation of the WG-16 transform results in a lower space complexity, com-
pared to the standard design in Figure 4.12a. Figure 4.13 presents the proposed architecture
for the serial WG-16.
In this architecture, X = Ai+31 and Y = 1  X1057. The WGP16 is computed over 8 cycles
(initialization phase) while theWGT16 is computed over 6 cycles (run phase). The design uses
only one field multiplier. The computations are accomplished according to Table 4.7.
It is noted that no changes are required for the loading phase, as a result of applying the
serial computation. In this architecture, an initialization round takes 8 clock cycles to generate
theWGP16 signal. The LFSR is updated at the 9-th clock cycle. During the run phase, a stream
bit is produced every 6 cycles. During these two phases, the multiplexers provide the inputs
to the multiplier and adder. The multiplexers’ inputs are multiplexed through selectors m0 -
m3 during computations. The 4 registers are clocked as it is specified by the clocking table in
Figure 4.13. The clocking of the dierent registers is enabled by means of clock enable signals
(see Section 4.2.5.2). In this design, the FSM’s signal lfsr clk is required in order to clock
the LFSR once every 1 , 9 , and 6 clock cycles, during loading, initialization, and run phases,
74
12
3
4
5
6
7
8
12345678
1
2
4
X
1
R
G
2
R
G
2
R
G
2RG
1RG
3
RG
10
2
X
22
1
RG
16
16
16
16
16
16
16
16
16 16
16 16
1
6
1
6
1
6
1
6
1
6
1
6
1
6
1
6
16
Initial Feedback
Register Clock Cycle
Clocking of Registers
1
2
3
1,2
3 - 5
6 - 8
16
1
R
G
52
X
42
1
RG
2
2
RG
1
1
21
R
G
1
RG
3
R
G
2
1
RG 2
2
RG
1
1
22
R
G
6
2
1
RG
6
3
4
RG
16
Tr
1
Output
Sequence
1
1
1
16
6S
Tr
Tr
1
S
16
16
*Tr
Y
e
2
1
*Tr
1
1
Y
e
3
Y
e
4
Tr Y Y
e
1
16 16
11S
16
4 6 - 8
m0 m1 m2
m0
m1
m2
1
6
1
6
4
R
G
1
R
G
m3
1RG
Selector
Clock Cycle
Enabled
1,3,5,7
2,4,6
4,5,6,7
m0
m1
m2
6,7m3
12
Figure 4.13: Architecture of the serial implementation for the PB based design of the WG-16.
respectively. This means that the initialization phase takes a total of 9  64 = 576 cycles.
Moreover, an output enable signal EO is used to enable the keystream output every 6 cycles
during the run phase. These selectors, clock enables, lfsr clk, and EO signals are generated
through the FSM, as it is presented next.
4.2.5.2 FSM for the Serialized WG-16
The FSM for the serializedWG-16 is a modified version of the one in Section 4.1.4.2. The FSM
of the serial WG-16 is obtained by replacing the 11-bit 1-hot counter with a 5-bit binary counter
and the 8-bit 1-hot counter with a 9-bit 1-hot counter. The 2-bit binary counter generates ph0
and ph1, and is clocked once every 32 counts from the 5-bit binary counter. As it is shown
in column of Figure 4.13 in Table 4.3, the initialization phase takes 576 clock cycles. Each
initialization round takes 9 clock cycles. The LFSR is clocked at the arrival of the 9-th clock
cycle by means of the 9-bit 1-hot counter. During the run phase, the LFSR is clocked once
every 6 clock cycles by means of the 6-bit counter. The cipher’s output is enabled once every
6 clock cycles during the run phase, through the 6-bit counter. The clock enable signals which
control the clocking of the registers in Figure 4.13, and the multiplexers’ selectors m, m1, m2,
and m3 are derived from the signal ph1, the outputs of the 9-bit 1-hot counter (initialization),
75
Clock Cycle
1 2 3 4
N
ex
tS
ta
te Register 1 X
210+1 X1057 X1057 X1057
Register 2 - - Y2
2+1 Y
P3
i=0 2
i
Register 3 - - - -
Register 4 - - - -
5 6 7 8
N
ex
tS
ta
te
Register 1 X1057 X1057 X1057 X1057
Register 2 Y s Y s Y s Y s
Register 3 - Ye1 Ye2  Ye4 Ye3
Register 4 - 1  Y  Ye1 1  Y  Y
e1 1  Y  Ye1
Ye2  Ye4 Ye2  Ye3  Ye4
Table 4.7: Computing WGP16 and WGT16 in the serial implementation of WG-16.
and outputs of the 6-bit 1-hot counter (run phase), as it is shown in Figure 4.14. In the figure,
c
5
c
6
6-83-5
c
7
c
2
c
3
c
4
1,2
c
0
c
1
d
5
d
2
d
3
d
4
m2
c
1
c
3 c5
c
7
d
5
d
4
c
2
c
3
c
6
c
7
d
2
d
3
m1
d
1
d
3
d
5
m0
Clock Enable Signals Multiplexer Selectors
p
h
1
d
0
d
1
p
h
1
c
6
m3
c
7c4
c
5
c
6
c
7
Figure 4.14: Generating the Clock Enable Control Signals and the Multiplexers’ Selectors for
the serial version of the WG-16.
for the clock enable signals, the number at the output of an OR gate indicates the number of
the enabled clock cycle during the initialization phase. Signal s is shown in Figure 4.7a, while
signals ci, 0  i  8 and di, 0  i  5, are the outputs of the 9-bit and the 6-bit 1-hot counters,
respectively.
76
4.2.6 Pipelined Implementation of the PB Based WG-16
4.2.6.1 Architecture
A pipelined version of the PB based implementation of the WG-16 is presented in Figures
4.15, 4.12b, and 4.12c. The critical path of this architecture has only one multiplier. This
c
trl0
c
trl1
in0
in1
52
1
1
1
1
16
16
16
16
16
16
11S
6S
.Tr
.Tr
1
S
16
11S
16
16
16 *
Tr
WG
Transform
11
(init. ph.)
Initial Feedback 
= WGP
16
16
1057
16 1*Tr
1
1
in0
in1
in2
in0
in1
in2
16
1
16
16
2
3
0,4
0,5
0,8
0
,4
8,10,12
0,7,9,
11
0,8
0,7,9,
11
0
0
0
0
0
0
0
0
0
8,10,12
0,6
0
,5
0
,4
0
,5
0
,6
0
,6
0,7
31iA
? ?1057311 iY A ?? ?
Y
Y
Figure 4.15: Pipelined version of the WG-16 transform.
is accomplished through a pipeline which has 11-stages during the run phase and 13-stages
during the initialization phase. In these figures, the double headed arrows point to the locations
where the registers are inserted, for the pipeline. Also, the numbers under an arrow specify the
corresponding clock cycles which trigger it during each WGP16 computation throughout the
initialization phase (13 clock cycles for each computation). A zero under an arrow indicates
that the register is enabled during the run phase. The clocking of the dierent registers in the
transform is controlled by means of clock enable signals (see Section 4.2.6.2).
77
No changes are required for the loading phase. During the initialization and the run phases,
an input signal requires 13 and 11 clock cycles, respectively, to propagate to the output of the
transform/permutation. Therefore, for the initialization phase, the lfsr clk signal triggers the
LFSR once every 13 cycles. This means that the initialization phase takes a total of 13  64 =
832 clock cycles. Also, the multiplexers’ outputs in Figure 4.15 are controlled through signals
ctrl0 and ctrl1 during the initialization and the run phases. For the run phase, an output enable
signal EO is used to enable the keystream output after the first 11 clock cycles. The following
section presents the FSM and show how the dierent control signals are derived.
4.2.6.2 FSM for the Pipelined WG-16
The FSM for the pipelined version of theWG-16 is obtained from the one introduced in Section
4.1.5.2, where a 5-bit binary counter and a 13-bit 1-hot counter replace the 11-bit 1-hot counter
and the 12-bit 1-hot counter, respectively. The 2-bit binary counter is clocked once every time
the 5-bit binary counter completes 32 counts, during load and initialization. From column of
Figure 4.15 in Table 4.3, the loading phase takes 32 clock cycles followed by the initialization
phase which stays for 832 clock cycles, then starts the run phase. The 13-bit 1-hot counter
expands the initialization phase to a total of 832 clock cycles. At the end of each computation
of the WGP16 (13 clock cycles), the LFSR is shifted once. The WGP16 computations are
controlled through the two signals ctrl0 and ctrl1, which are generated by the 13-bit counter
(Figure 4.16). Signal ctrl0 is set during clock cycles 8 and 9, while signal ctrl1 is set during
clock cycles 10, 11, and 12. These two signals always reset throughout the run phase. Signals
ph0 and ph1 select the LFSR’s input. These are derived from the output of the 2-bit binary
counter according to Table 4.3. After 11 clock cycles from the start of the run phase, the output
of the cipher is enabled by means of the 13-bit counter. The clock enable signals are derived
from the outputs of the 13-bit counter during initialization and from the outputs of the 2-bit
counter (signal s is shown in Figure 4.9) during the run phase, as can be seen from Figure 4.16.
In this figure, signals ci, 0  i  12 and di, 0  i  5, are the outputs of the 13-bit and the 6-bit
1-hot counters, respectively.
4.3 Implementation Results and Comparisons
This section presents speed and area results based on ASIC implementations for the nine dif-
ferent proposed designs. The space and speed trade os concerning the standard, pipelined,
and serial versions of the proposed PB based WG(29; 11) and WG-16 designs, are examined
and compared to the counterparts.
78
Figure 4.16: Generating the clock enable signals and, ctrl0 and ctrl1 signals for the pipelined
version of the WG-16.
4.3.1 ASIC Implementations
Table 4.8 presents the speed and area readings for the nine WG designs which have been pro-
posed, based on the ASIC implementations. In this table, GE denotes Gate Equivalence in
terms of number of NAND gates and TP denotes the throughput. The ASIC implementations
provide speed and area results for the 65nm CMOS technology with medium eort for opti-
mizations using Synopsys Design Vision [4]. The results are based on Design Vision’s estimate
of area and clock speed prior to place-and-route. The PB realizations are accomplished using
the multiplier presented in [72] for both the WG(29; 11) and the WG-16. The WG-16 has
been also realized using the Karatsuba multiplier [53]. We use the VHDL implementations
presented in [28] for these two multipliers. Table 4.8 presents the area and speed results for
the ASIC implementations of the dierent designs. The results for the hardware design of the
WG(29; 11) which is proposed in [56] are based on theoretical analysis. In addition, the re-
sults for the WG(29; 11) design in [68] are reported in previous chapter of this thesis. For the
WG-16 which is presented in [35], the results are reported for post place and route.
4.3.2 Results and Comparisons
As shown in Table 4.8, the space complexity of the proposed standard WG(29; 11) is re-
duced, w.r.t the ones previously presented in [68] and the previous chapter, and the normal-
ized throughput is improved. While the proposed standard WG(29; 11) design shows higher
throughput compared to the one in [68], it reports a slightly lower throughput compared to
the type-II ONB based design presented in the previous chapter. The WG design presented in
[56] requires a number of ROM bits which is exponential in m (the dimension of the binary
extension field). For the WG(29; 11), this realization requires 229-bits of ROM in addition to
79
9000 XORs and 319 registers, as can be seen from Table 4.8. On the other hand, the space
complexities of the proposed designs are based on the area of the multiplier, which is quadratic
in m. For high speed applications, the throughput which is reported in Table 4.8 for the pro-
posed pipelined version of the PB based WG(29; 11) design is almost 4:5 times compared to
the proposed standard one. This comes at an expense of almost 23% increase in the space
complexity. On the other hand, for area constrained applications, the serial version shows up
to 59% decrease in the space complexity compared to the standard design, according to the
results in Table 4.8. This comes at the expense of reducing the throughput to the half.
In Table 4.8, the Karatsuba based PB implementations of the standard, pipelined, and serial
WG-16 show optimal readings for throughput, space, and normalized throughput, compared
to the same realizations using the multiplier in [72]. In the same table, in comparison with
the pipelined WG-16 implementations presented in [35], the proposed pipelined PB based
WG-16 demonstrates almost 2:5 times the throughput with even less space complexity. In
addition, for low area applications, the serial version shows up to 42% decrease in the space
complexity compared to the standard design. This comes at an expense of around 40% decrease
in throughput. On the other hand, for high speed requirements, the pipelined version of the PB
based WG-16 design increases the throughput by almost 7 times compared to the standard one.
This comes at an expense of almost 33% increase in space complexity.
Moreover, for WG-16, which is proposed by the authors of [34] to overcome the security
flaws in the LTE integrity protocols [90], the reported results of the proposed design in Ta-
ble 4.8 clearly show that the dierent realizations oer bit rates greater than 100 Mbps and,
hence, satisfy the LTE’s peak bit rate requirements [49]. Although SNOW 3G [54] and ZUC
[10] show better normalized throughput readings compared to our WG-16 designs in Table
4.8, the reported space complexities for the proposed WG-16 (specially, serial instances) are
competitive to SNOW 3G and ZUC. Hence, WG-16 is an interesting, low area, candidate for
the 4G domain. Table 4.8 also lists the 1-bit output versions of Grain and Trivium which show
better performances compared to the proposed designs of the WG-16. On the other hand, our
pipelined version of the WG-16 has higher throughput, and normalized throughput, while our
serial WG-16 instance shows a very close area complexity, compared to Mickey128.
If even higher throughput is demanded, one can apply the unfolding technique which is
presented in [26] to the proposed pipelined WG(29; 11) and WG-16. In this technique, by
implementing multiple transforms, the throughput will increase proportionally. Digit-level
field multipliers [66] can be considered if lower area is demanded; however, at the expense of
adding more cycles for each multiplication.
80
4.4 Conclusion
This chapter proposed for the first time new architectures for ecient computations of the WG
stream ciphers using polynomial basis. The proposed architectures require fewer multiplica-
tion operations as compared to the WG counterparts. Moreover, an area ecient method for
the direct computation of the trace of the multiplication of two GF (2m) elements have been
derived. Unlike the trace method presented in previous chapter which applies only to type-II
ONB, the trace method proposed in this chapter applies to any PB. Based on the proposed trace
properties, two classes of PB based designs (standard architecture) have been proposed, one
for the WG(29; 11) stream cipher and the other one for the WG-16 stream cipher. In addition, a
serialized version and a pipelined version, has been proposed for each of the proposed standard
designs.
Nine dierent proposed designs have been realized through ASIC implementations using
the 65nm CMOS technology. The ASIC implementations show that the proposed PB based
WG(29; 11) design achieves better area and normalized throughput results compared to all
WG(29; 11) counterparts which use NB. Also, it has been shown that the proposed pipelined
PB based WG-16 provides almost double the throughput which is oered by the implementa-
tions presented in [35], at even smaller area. In addition, the throughput readings reported for
the dierent designs of the WG-16 stream cipher meet the requirements for the peak bit rate
specifications of the 4G mobile technology.
Based on these results, the proposedWG(29; 11) andWG-16 designs using PB are competi-
tive candidates, compared to the previously proposed implementations, for securing mobile and
communication systems [23, 11, 7]. Specifically, the proposed WG-16 designs are promising
for the 4G communications where the guaranteed randomness properties and security aspects
are of significant importance.
81
Im
plem
entation
B
asis
W
G
M
ultiplier
Technology
G
E
Speed
T
P
N
orm
alized
Transform
T
hroughput
A
rchitecture
(M
H
z)
(M
bps)
( K
bps=G
ate)
SN
O
W
3G
[9]
-
-
-
90nm
34000
-
1900
55.88
SN
O
W
3G
[54]
-
-
-
130nm
25016
249
7900
315.97
Z
U
C
[10]
-
-
-
65nm
10000
-
1500
150
G
rain128
(1-bitoutputversion)[41]
-
-
-
130nm
1857
926
926
499
Trivium
(1-bitoutputversion)[41]
-
-
-
130nm
2599
358
358
138
M
ickey128
[41]
-
-
-
130nm
5039
413
413
82
W
G
(29,11)[68]
O
N
B
Standard
[71]
65nm
33200
144
144
4.34
W
G
(29,11)[56]
-
L
ook-up
Table
(R
O
M
)
-
-
319
R
egisters
+
9000
X
O
R
s
+
2
29
R
O
M
bits
-
-
-
W
G
(29,11)[31]
O
N
B
Standard
[71]
65nm
19900
224
224
11.26
W
G
(29,11)(T
his
w
ork,Figure
4.3)
PB
Standard
[72]
65nm
17165
202
202
11.77
W
G
(29,11)(T
his
w
ork,Figure
4.6)
PB
Serialized
[72]
65nm
7050
610
101
14.32
W
G
(29,11)(T
his
w
ork,Figure
4.8)
PB
Pipelined
[72]
65nm
21190
917
917
43.28
W
G
-16
[35]
N
B
Pipelined
( M
16=
I8 )
-
65nm
12031
552
552
45.88
W
G
-16
[35]
N
B
Pipelined
( M
8=
I8 )
-
65nm
12352
558
558
45.17
W
G
-16
(T
his
w
ork,Figure
4.12a)
PB
Standard
[72]
65nm
9103
189
189
20.76
W
G
-16
(T
his
w
ork,Figure
4.12a)
PB
Standard
[53]
65nm
8060
193
193
23.94
W
G
-16
(T
his
w
ork,Figure
4.15)
PB
Pipelined
[72]
65nm
11795
1149
1149
97.41
W
G
-16
(T
his
w
ork,Figure
4.15)
PB
Pipelined
[53]
65nm
10681
1370
1370
128.26
W
G
-16
(T
his
w
ork,Figure
4.13)
PB
Serialized
[72]
65nm
5267
680
113
21.45
W
G
-16
(T
his
w
ork,Figure
4.13)
PB
Serialized
[53]
65nm
5026
714
119
23.67
Table
4.8:R
esults
obtained
forarea
and
speed
from
the
A
SIC
im
plem
entations.
82
Chapter 5
Digit-Level Architectures for GF
 
2m

Multiplication in the GNB
This chapter, focuses on field multiplication based on the GNB representation for binary ex-
tension fields of odd values of m. This includes the five fields recommended by NIST for
Elliptic curve digital signature algorithm (ECDSA) [12]. For clarity of reference, in what fol-
lows, the multiplication of two field elements is referred to as single multiplication, while the
multiplication of two or more elements is denoted by hybrid multiplication.
This chapter, proposes three new digit-level architectures for the single GNB multiplica-
tion, which follow dierent input/output order schemes. Two new digit-level architectures for
the FSIPO single GNB multiplication are proposed, one follows an MSD order of its inputs
while the other follows an LSD order. It is worth mentioning that a FSIPO multiplier does not
require any preloading of the operands, which is not the case for the other input schemes (see
Chapter 2). This makes the FSIPO multipliers advantageous for achieving high throughput in
applications where the data path capacity, for inputs preloading, is small and m is large. Also,
an area ecient version of the MSD DL-PISO single GNB multiplier, which was originally
presented in [70], is proposed.
In addition to above three single multipliers, a new DL-SIPO hybrid-double GNB multi-
plier, and for the first time in literature, a DL-PIPO hybrid-triple GNB multiplier, are proposed
by combining the proposed DL-PISO and DL-FSIPO single multipliers. The proposed digit-
level hybrid-triple multiplication scheme accomplishes three field multiplications using the
latency required for a single digit-level multiplication, at the expense of more area.
Furthermore, and based on the new hybrid-triple GNB multiplier, a digit-level eight-ary
field exponentiation architecture is proposed. Compared to the existing digit-level eight-ary
schemes [83, 42], the proposed architecture oers almost the same latency while it does not
83
require any precomputation or storage of the field element’s odd powers which are less than 8.
The following, summarizes the contributions of this chapter.
Contributions
The contributions of this chapter are summarized in Figure 5.1. In this chapter, seven new
digit-level architectures are proposed for the GF (2m) single, hybrid-double, and hybrid-triple
multiplication, in addition to a new digit-level architecture for theGF (2m) eight-ary field expo-
nentiation (see Figure 5.1), based on the GNB representation when m is odd. The contributions
of this chapter are explained as follows:
MSD/LSD DL-FSIPO 
Single GNB 
Multipliers (Figures 
5.2a and 5.3)
Area Efficient MSD 
DL-PISO Single 
GNB Multiplier 
(Figure 5.4a)
Low Area / High 
Speed MSD DL-
SIPO Hybrid-Double 
GNB Multipliers 
(Figures 5.5a and 
5.5b)
Low Area / High 
Speed DL-PIPO 
Hybrid-Triple GNB 
Multipliers
(Figures 5.6a and 
5.6b)
Eight-ary Field 
Exponentiation 
Architecture (Figure 5.7)
Figure 5.1: Summary of contributions.
 Two new architectures are proposed for MSD/LSD DL-FSIPO single GNB multipliers
(Figures 5.2a and 5.3). It is noted that these multipliers do not require preloading of in-
puts. Therefore, they are advantageous to achieve high throughput in applications where
the parallel preloading of the inputs is not possible due to limited sizes of the data path,
especially when m is large. For the single bit digit size case, one obtains a bit-level ver-
sions of the proposedMSD/LSDDL-FSIPO single GNBmultipliers. It is noted that Feng
[36] proposed the original most significant bit (MSB), bit-level (BL), FSIPO NB multi-
plication scheme. However, the MSB version which is obtained from the proposed MSD
DL-FSIPO single GNB multiplier is based on a slightly modified formulation compared
to the one in [36]. Also, while there are no space and/or time complexities formulations
84
presented in [36], the formulations for the space and time complexities of the proposed
MSD/LSD DL-FSIPO single GNB multiplication architectures are derived. ForGF

25

,
the bit-level versions of the proposed FSIPO multipliers (based on the type-2 GNB) re-
quire smaller space and time complexities compared to the GF

25

multiplier which is
presented in [36]. Moreover, this work proposes reduction of the number of XOR gates
through applying sub-expression sharing techniques to the multiplication by .
 An area ecient MSD DL-PISO single GNB multiplier is proposed (Figure 5.4a), where
the number of XOR gates of the original MSD DL-PISO GNB multiplier in [70] is
reduced based on applying the sub-expression sharing presented in [17].
 Low area/high speed designs for an MSD DL-SIPO hybrid-double GNB multiplier are
proposed (Figure 5.5), constructed by combining the proposed MSD DL-FSIPO and DL-
PISO single GNBmultipliers (see Figure 5.1). It is noted that the proposed hybrid-double
GNB multiplier is the first DL-SIPO scheme proposed for the hybrid-double GNB-based
multiplication, while the one presented in [16] follows a DL-PIPO scheme. This in turn
allows for proposing a digit-level hybrid-triple multiplier, as it is stated next.
 Low area/high speed designs for a DL-PIPO hybrid-triple GNB multiplier are proposed
(Figure 5.6). As shown in Figure 5.1, the proposed DL-PIPO hybrid-triple GNB multi-
pliers are constructed by combining the proposed MSD DL-PISO single and the MSD
DL-SIPO hybrid-double GNB multipliers. It is noted that, as far as the author know, the
proposed digit-level PIPO hybrid-triple GNB multipliers are the first such multipliers, in
the open literature, which perform three digit-level field multiplications using the latency
of only one multiplication, at the expense of more area.
 Finally, a digit-level architecture which accomplishes field exponentiation based on
radix-8 representation of the exponent is proposed (Figure 5.7). The proposed scheme
has almost the same latency which is oered by the exiting digit-level eight-ary expo-
nentiation schemes [83, 42], however, it does not require any precomputations or storage
of the field element’s odd powers which are less than 8.
The chapter is organized as follows. Section 5.1, presents the proposed MSD/LSD DL-
FSIPO single GNB multiplication schemes. Section 5.2 explains the proposed MSD DL-PISO
single GNB multiplier. Section 5.3 presents the proposed MSD DL-SIPO hybrid-double and
the DL-PIPO hybrid-triple GNB multiplication schemes. Section 5.4 introduces the new digit-
level eight-ary field exponentiation architecture. Section 5.5 concludes the chapter.
85
5.1 Proposed DL-FSIPO Single GNB Multipliers
The following, starts by presenting the proposed MSD DL-FSIPO single GNB multiplier, fol-
lowed by the LSD one. In addition to their proofs, the proposed digit-level formulations pre-
sented in this section, i.e. (5.1), (5.2), (5.3), and (5.4), have been verified through simulations
using the Sage tool [3]. It is noted that the proposed multipliers in this section do not require
preloading of inputs, and perform the multiplication operation as the input digits enter the mul-
tiplier. This is advantageous, especially for large m (> 160 in [12]), to achieve high throughput
in applications where the parallel preloading of the inputs is not possible due to limited sizes
of the data path.
5.1.1 Proposed MSD DL-FSIPO Single GNB Multiplier
In this section, a digit-level MSD architecture is proposed for the FSIPO single GNB multipli-
cation. In what follows, the formulations for the MSD DL-FSIPO single multiplication in the
GNB is first derived, followed by presenting the proposed architecture of the MSD DL-FSIPO
single GNB multiplier and, the section ends by analyzing the space and time complexities.
5.1.1.1 Formulations
This section, derives formulations for digit-level multiplication of two GF (2m) elements rep-
resented in the GNB, where the two inputs of the multiplier are entered serially, digit-by-digit,
in an MSD first order. In what follows, the proposed MSD first recursive construction of field
elements when represented in the GNB is shown.
Lemma 5.1.1 Given a digit size 0 < d < m, a field element A = (a0; : : : ; am 1) 2 GF (2m)
represented in the GNB, is constructed recursively, starting from the most significant digit Ak 1
(total of k =
l
m
d
m
digits A0 through Ak 1), as follows:
A(i) =Ak 1 i +

A(i 1)
2d
(5.1)
where i takes values from 0 upto k   1, A( 1) = 0, A = A(k 1), and Ak 1 i = Pd 1j=0 ad(k 1 i)+ j2 j is
the (k   1   i)-th digit of A = (A0; : : : ; Ak 1) with ad(k 1 i)+ j = 0 for d (k   1   i) + j  m.
Proof By substituting for i = 0; : : : ; k   1 in (5.1), one gets
A(k 1) =A0 +

A1 +   

Ak 2 + (Ak 1)2
d2d    2d
=
0X
i=k 1
A2
d(k 1 i)
k 1 i ;
86
and by noticing that Ak 1 i =
Pd 1
j=0 ad(k 1 i)+ j
2 j one obtains
A(k 1) =
0X
i=k 1
d 1X
j=0
a j+d(k 1 i)2
j+d(k 1 i)
=
d 1X
j=0
a j2
j
+
d 1X
j=0
a j+d2
j+d
+   +
d 1X
j=0
a j+d(k 1)2
j+d(k 1)
=
m 1X
j=0
a j2
j
;
where the last result is achieved since a j+d(k 1) = 0 for j + d (k   1)  m.
Then, the multiplication of the GF (2m) elements A and B is obtained as follows.
Proposition 5.1.2 Let E = AB be the multiplication of the two elements A; B 2 GF (2m) rep-
resented in the GNB. By using construction (5.1), one obtains E = A(k 1)B(k 1), where k =
l
m
d
m
and d is the digit size, by the following recurrence starting at i = 0 upto k   1
A(i)B(i) =
Pd 1
j=0

ad(k 1 i)+ j

Bk 1 i +

B(i 1)
2d
+
bd(k 1 i)+ j

A(i 1)
2d 2  j

2 j
+

A(i 1)B(i 1)
2d
: (5.2)
Proof A(i)B(i) is obtained by substituting for A(i) and B(i) in A(i)B(i), using (5.1), as
A(i)B(i) =

Ak 1 i +

A(i 1)
2d 
Bk 1 i +

B(i 1)
2d
=Ak 1 i

Bk 1 i +

B(i 1)
2d
+
Bk 1 i

A(i 1)
2d
+

A(i 1)B(i 1)
2d
;
and by substituting for Ak 1 i =
Pd 1
j=0 ad(k 1 i)+ j
2 j in Ak 1 i

Bk 1 i +

B(i 1)
2d
, and for Bk 1 i =Pd 1
j=0 bd(k 1 i)+ j
2 j in Bk 1 i

A(i 1)
2d
the following is obtained
A(i)B(i) =
Pd 1
j=0 ad(k 1 i)+ j
2 j

Bk 1 i +

B(i 1)
2d
+
d 1X
j=0
bd(k 1 i)+ j2
j 
A(i 1)
2d
+

A(i 1)B(i 1)
2d
which yields
A(i)B(i) =
Pd 1
j=0

ad(k 1 i)+ j

Bk 1 i +

B(i 1)
2d
+
bd(k 1 i)+ j

A(i 1)
2d 2  j

2 j
+

A(i 1)B(i 1)
2d
:
87
It is noted that the correctness of (5.2) has also been verified using simulations with the Sage
tool [3].
In (5.2), the multiplication of A by B (elements of GF (2m)) represented in the GNB, is
reduced recursively to a number of bit-wise AND operations, field additions, multiplications
with the normal element , and cyclic shifts for computing the powers 2  j, 2 j, and 2d. Notice
that the addition of the digit Bk 1 i to

B(i 1)
2d
in (5.2) is a free of cost concatenation. This is
because the most significant digit of B(i 1) is 0d for 0  i < k, where 0d denotes a string of zeros
of length d.
Since it is already given that A( 1) = B( 1) = 0, therefore by using (5.2), starting at i = 0, and
proceeding up to i = k 1 (k clock cycles), the final result of the multiplication E = A(k 1)B(k 1)
is obtained. At each step, the (k   1   i)-th digit of A and B, i.e. Ak 1 i and Bk 1 i, in addition to
A(i 1), B(i 1), and A(i 1)B(i 1), are used for computing A(i) and B(i), and A(i)B(i) according to (5.1)
and (5.2), respectively. The following example illustrates the proposed multiplication scheme.
Example 5.1.3 Table 5.1 shows the steps of multiplying the GF

23

elements A = B = 2
2
=
(0; 0; 1), which are represented in the type-2 GNB
n
; 2; 2
2
o
(that is, Optimal normal basis
type-2), according to (5.1) and (5.2), for the case of d = 2 (i.e., k = 2). Note that k = 2
i A1 i = a2 2i + a3 2i2 B1 i = b2 2i + b3 2i2 A(i 1) B(i 1)
0 A1 = a2 + a32 =  B1 = b2 + b32 =  A( 1) = 0 B( 1) = 0
1 A0 = a0 + a12 = 0 B0 = b0 + b12 = 0 A(0) = A1 +

A( 1)
22
=  B(0) = B1 +

B( 1)
22
= 
X1 i = B1 i +

B(i 1)
22
Y1 i =

A(i 1)
22
Z1 i = a2 2iX1 i + b2 2iY1 i W1 i = a3 2iX1 i + b3 2iY1 i
0 X1 = B1 +

B( 1)
22
=  Y1 =

A( 1)
22
= 0 Z1 = a2X1 + b2Y1 =  W1 = a3X1 + b3Y1 = 0
1 X0 = B0 +

B(0)
22
= 2
2
Y0 =

A(0)
22
= 2
2
Z0 = a0X0 + b0Y0 = 0 W0 = a1X0 + b1Y0 = 0
A(i 1)B(i 1) A(i)B(i) = Z1 i +

W2
 1
1 i
2
+

A(i 1)B(i 1)
22
0 A( 1)B( 1) = 0 A(0)B(0) = Z1 +

W2
 1
1 
2
+

A( 1)B( 1)
22
= 2
1 A(0)B(0) = 2 A(1)B(1) = Z0 +

W2
 1
0 
2
+

A(0)B(0)
22
= 2
3
= 
Table 5.1: Steps for multiplication of the two GF

23

elements A = B = 2
2
= (0; 0; 1).
and a3 = b3 = 0. Also, (5.2) is rewritten as A(i)B(i) = Z1 i +

W2
 1
1 i
2
+

A(i 1)B(i 1)
22
,
where Z1 i = a2 2iX1 i + b2 2iY1 i, W1 i = a3 2iX1 i + b3 2iY1 i, X1 i = B1 i +

B(i 1)
22
, and
Y1 i =

A(i 1)
22
, for 0  i < 2.
Next, the proposed architecture of the MSD DL-FSIPO single GNBmultiplier is presented.
88
5.1.1.2 Architecture
Figure 5.2a presents the architecture of the proposed MSD DL-FSIPO single GNB multiplier.
d
d
<Z>
d
mm
m
m
m-d
m-d
m
d-1 m
m
1
1
m-d
in1
in2
in3
in4
j
m
1
1
m-d
in1
in2
in3
in4
0
m
1
1
m-d
in1
in2
in3
in4
<X>
0 m-d-1
<Y>
0 m-d-1
k 1
B
i k 1
B
0
B
k 1
A
i k 1
A
0
A
0 m-1
(a)
(b)
?
(c)
Figure 5.2: (a) Architecture of the proposed MSD DL-FSIPO single GNB multiplier. (b)
Architecture of r j. (c) Architecture of  j.
This architecture is constructed based on (5.1) and (5.2). In Figure 5.2a, d denotes the digit
size, k =
l
m
d
m
denotes the total number of cycles of computations, and 0  i < k refers to the i-th
clock cycle. Part (b) of the same figure presents the architecture of r j which is used in Figure
5.2a, 0  j < d, where for 0  i < k: in1 = Bk 1 i +

B(i 1)
2d
, in2 = ad(k 1 i)+ j, in3 = bd(k 1 i)+ j,
89
and in4 =

B(i 1)
2d
. Also, part (c) shows the architecture of  j which is used in Figure 5.2b,
0  j < d.
Initially, the (m   d)-bits shift registers hXi and hYi, and the m-bits register hZi, are cleared
(i.e., initialized by A( 1), B( 1), and A( 1)B( 1), respectively). Then, at each i-th iteration of the
following k iterations, hXi, hYi, and hZi update their states from A(i 1), B(i 1), and A(i 1)B(i 1)
to A(i), B(i), and A(i)B(i), respectively, as follows. The two GF (2m) input elements A and B are
entered to registers hXi and hYi, respectively, one digit per a clock cycle, following a most
significant digit first order starting with the (k   1)-th digits (according to (5.1)). At the i-th
iteration, hXi and hYi perform a d-fold right shift (not cyclic) and, the (k   1   i)-th digits of A
and B are written to the least significant d-bits of hXi and hYi, respectively. At the same time,
register hZi accumulates the result of the field addition Pd 1j=0 r j + A(i 1)B(i 1)2d , where
r j =

ad(k 1 i)+ j

Bk 1 i +

B(i 1)
2d
+
bd(k 1 i)+ j

A(i 1)
2d 2  j

2 j
is generated as shown in Figures 5.2b and 5.2c. According to (5.2), this results in writing A(i)B(i)
to hZi. Then, after k clock cycles, i.e. i = k   1, one obtains hZi = A(k 1)B(k 1) = AB. It is
noted that the proposed architecture implements Bk 1 i+

B(i 1)
2d
in (5.2) by concatenating the
d-bits of Bk 1 i to the least significant digit of

B(i 1)
2d
(the concatenations are shown by thick
vertical lines in Figure 5.2, two in Figure 5.2a and one in Figure 5.2b). This concatenation is
possible since the least significant digit of

B(i 1)
2d
is simply 0d for all 0  i < k (notice from
(5.2) that A(k 1) and B(k 1) are not used in generating A(k 1)B(k 1)).
In the following, the space and time complexities of the proposed MSD DL-FSIPO single
GNB multiplier are studied.
5.1.1.3 Space and Time Complexities
The space complexity of the proposedMSDDL-FSIPO single GNBmultiplier is listed in Table
5.2. This includes the count of logic gates, Flip Flop (FF), and preloading multiplexers (for
parallel inputs preloading). In this table, d is the digit size. T is the GNB type. P and S denote
either the corresponding input/output is loaded/generated in parallel or in serial, respectively.
It is noted that this table shows the space complexity of the proposed MSD DL-FSIPO single
GNB multiplier before applying sub-expression sharing. From Figure 5.2b, one can see that
each r j module, 0  j < d, consists of m+m d = 2m d two input AND gates, and therefore,
the total number of two input AND gates in the d r j modules of Figure 5.2a is d (2m   d). The
total number of two input XOR gates in the
P
module of Figure 5.2a (a GF (2m) adder which
90
Multiplier FF AND XOR
2 : 1 1-bit Input Input
Output
MUX 1 2
DL-PISO [61] 2m d [T (m   1) + 1] d [T (m   1)] 2m P P S
DL-PISO [37] 2m dm d [T (m   1)] 2m P P S
DL-PISO1 [16] 2m dm  d
h
(T   1)

(m   1)   d 12
i
+ d (m   1) 2m P P S
DL-PIPO [17] 3m dm d
h
(m 1)(T 1)
2 + m
i
2m P P P
DL-SIPO1 [16] 2m dm  d (T   1)
h
(m   1)   d 12
i
+ dm m S P P
M/LSD DL-FSIPO2
3m   2d d (2m   d)  d [(2m   d) + (T   1) (m   1)] 0 S S P
(Figures 5.2 & 5.3)
MSD DL-PISO1
2m dm  d
h
(T   1)

(m   1)   d 12
i
+ d (m   1) 2m P P S
(Figure 5.4a)
1 without applying group sub-expression elimination. 2 without applying sub-expression elimination.
Table 5.2: Space complexity of digit-level single GNB multipliers.
adds d + 1 field elements) is dm. In addition, each r j module, 0  j < d, has  (m   d) +
(T   1) (m   1) XORs out of which are  (T   1) (m   1) contributed by the multiplications
by  (before sub-expression elimination, see Section 2.9.2.2). Therefore, the total number of
XORs in the MSD DL-FSIPO single GNB multiplier is  d [(2m   d) + (T   1) (m   1)]. In
addition, while register hZi has m FFs, only m   d FFs are required for each one of registers
hXi and hYi, since the (k   1)-th digit in these two registers is always zero throughout the
computations. Hence, the total number of FFs is 2 (m   d) + m = 3m   2d. One can also
see that there are no preloading multiplexers required for the proposed MSD DL-FSIPO single
GNB multiplier.
On the other hand, Table 5.3 reports the time complexity of the proposed digit-level MSD
FSIPO single GNB multiplier, in terms of the propagation delay of the corresponding levels
of two input XOR and AND gates through the critical path. As seen from Figure 5.2a, the
Multiplier
Propagation Serial Preloading Computation
Delay Latency Latency
DL-PISO [61] TA +
 
log2 (T (m   1) + 1)

TX k k
DL-PISO [37] TA +
 
log2m

+

log2 T

TX k k
DL-PISO [16] TA +
 
log2m

+

log2 T

TX k k
DL-PIPO [17] TA +
 
log2 (d + 1)

+

log2 T

TX k k
DL-SIPO [16] TA +
 
log2 (d + 1)

+

log2 T

TX k k
M/LSD DL-FSIPO
TA +

1 +

log2 (d + 1)

+

log2 T

TX 0 k
(Figures 5.2 & 5.3)
MSD DL-PISO
TA +
 
log2m

+

log2 T

TX k k
(Figure 5.4a)
Table 5.3: Time complexity of digit-level single GNB multipliers.
critical path of the proposed architecture passes through one r j module and thePmodule. The
91
propagation delay of a r j module, 0  j < d, is TA +  1 + log2 (T ) TX, where log2 (T ) TX
is the propagation delay through  j (due to the multiplication with , see Section 2.9.2.2).
Therefore, by adding the delay of the
P
module (a GF (2m) adder which adds d + 1 field
elements), which is

log2 (d + 1)

TX, the total propagation delay of the proposed multiplier
becomes TA +

1 +

log2 (d + 1)

+

log2 T

TX.
5.1.1.4 Bit-Level Case
It is noted that the original bit-level FSIPO multiplication scheme was presented by Feng [36]
for an MSB order of the inputs. By considering a single-bit digit size, one obtains a bit-level
MSB FSIPO single GNB multiplier from the proposed digit-level architecture. The obtained
proposed MSB BL-FSIPO single GNB multiplier oers a maximum propagation delay of TA+
3TX, while it requires 13 FFs, 9 ANDs, and 13 XORs (for the case of GF

25

and T = 2). On
the other hand, the GF

25

multiplier presented in [36] has a maximum propagation delay of
TA+6TX and requires 13 FFs, 9 ANDs, and 15 XORs. Moreover, in this work, the formulations
for the space and time complexities of the proposedMSDDL-FSIPO single GNBmultiplier are
derived, while there are no such formulations presented in [36]. In addition, this work further
reduces the space complexity of the proposed architecture through applying sub-expression
sharing techniques to the multiplication by  (see [69, 25] for example).
Table 5.4 estimates the corresponding space and time complexity readings for the case of
bit-level (d = 1) versions of the dierent multipliers in Tables 5.2 and 5.3, considering the type-
4 GNB of GF

2163

, based on the 65nm CMOS standard library’s statistics. It is noted that,
Multiplier
MPD Serial Input Loading Parallel Input Loading
ns Total Gates Latency TP/G @ 1 GHz Total Gates2 Latency TP/G @ 1 GHz
BL-PISO [61] 0:43 3329:75 326 150 3981:75 164 250
BL-PISO [37] 0:43 2722:25 326 184 3374:25 164 295
BL-PIPO [17] 0:15 2849:5 326 175 3501:5 164 284
LSB BL-SIPO [16] 0:15 2724:25 326 184 3050:25 164 326
M/LSB BL-FSIPO1 (Figures 5.2 & 5.3, d = 1) 0:19 3854:5 163 259 3854:5 163 259
1 without elimination. If we apply the elimination in [69], savings is 127 XORs = 254 GE. 2 with MUXs.
Table 5.4: Space and time complexity readings for the case of type-4 GNB of GF

2163

digit-
level single multipliers.
the NAND gate equivalence (GE) is obtained for a two input AND, two input XOR, D-type FF,
and a 2 : 1 1-bit MUX through synthesis using the Synopsys Design Vision tool [4] to be 1:25,
2, 3:75, and 2, respectively. Similarly, the maximum propagation delay (MPD) is obtained
for a two input AND and two input XOR to be 0:03ns and 0:04ns, respectively. In this table,
latency denotes the number of clock cycles required for computing the m-bits of output. TP is
92
throughput and TP/G denotes throughput (@ 1 GHz) per total GE measured in Kbps/Gate. As
one can see from this table, the bit-level version of the proposed DL-FSIPO single GNB multi-
plier oers half the latency and provides the best normalized throughput compared to the other
multipliers in the case of serial loading of inputs. Moreover, one can further reduce the space
complexity of the proposed architecture through applying sub-expression sharing techniques
to the multiplication by . For example, applying the elimination algorithm proposed in [69],
saves 127 XORs which is equivalent to 254 GE.
In the following section, new LSD DL-FSIPO single GNB multiplier is introduced.
5.1.2 Proposed LSD DL-FSIPO Single GNB Multiplier
This section, presents the LSD DL-FSIPO single GNB multiplier. The section starts by deriv-
ing the formulations for the LSD DL-FSIPO single multiplication scheme. Then, it presents
the architecture of the proposed multiplier. The section concludes by studying space and time
complexities.
5.1.2.1 Formulations
Here, the formulations are derived for digit-level multiplication of twoGF (2m) elements based
on the GNB representation, where the two inputs of the multiplier are entered in an LSD first
order. First, the following shows how one constructs the elements ofGF (2m) when represented
in the GNB, digit by digit, starting from the least significant digit.
Lemma 5.1.4 Given the digit size 0 < d < m, an arbitrary GF (2m) element A = (a0; : : : ; am 1)
represented in the GNB is constructed, recursively, starting with its least significant digit, as
follows:
A(i) =

Ai + A(i 1)
2 d
(5.3)
where i takes values starting from 0 upto k   1, k =
l
m
d
m
, A = A(k 1), A( 1) = 0, and Ai =Pd 1
j=0 adi+ j r
2 j is the i-th digit of A such that adi+ j r = 0 for di + j   r < 0 given r = kd   m
which represents the number of left padded zeros.
Proof By substituting for i = 0; : : : ; k   1 in (5.3), one gets
A(k 1) =

Ak 1 +   

A1 + (A0)2
 d2 d    2 d
=
k 1X
i=0
A2
 d(k i)
i ;
93
and by noticing that Ai =
Pd 1
j=0 adi+ j r
2 j the following is obtained
A(k 1) =
k 1X
i=0
d 1X
j=0
adi+ j r2
j d(k i)
=
0BBBBBB@ k 1X
i=0
d 1X
j=0
adi+ j r2
j+di
1CCCCCCA
2 dk
:
Now, let l = di + j, and notice that dk = m + r (since r = kd   m), then
A(k 1) =
0BBBBB@m+r 1X
l=0
al r2
l
1CCCCCA2
 m r
=
0BBBBB@m 1X
l= r
al2
l+r
1CCCCCA2
 r
=
m 1X
l=0
al2
l
;
where the last result is achieved since al = 0 for l < 0.
Then, the multiplication of two GF (2m) elements A and B represented in the GNB and con-
structed by (5.3), is obtained as follows.
Proposition 5.1.5 Let E = AB be the multiplication of two elements A; B 2 GF (2m) repre-
sented in the GNB. Therefore, using construction (5.3), E = A(k 1)B(k 1) is obtained by the
following recurrence
A(i)B(i) =
 d 1X
j=0

adi+ j r

Bi + B(i 1)

+ bdi+ j rA(i 1)
2  j

2 j
+ A(i 1)B(i 1)
2 d
; (5.4)
where i takes values starting from 0 upto k   1, k =
l
m
d
m
, and A( 1) = B( 1) = A( 1)B( 1) = 0.
Proof A(i)B(i) is obtained by substituting for A(i) and B(i) in A(i)B(i), using (5.3), as
A(i)B(i) =
h
Ai + A(i 1)
 
Bi + B(i 1)
i2 d
=
h
Ai

Bi + B(i 1)

+ BiA(i 1) + A(i 1)B(i 1)
i2 d
;
and by substituting for Ai =
Pd 1
j=0 adi+ j r
2 j in Ai

Bi + B(i 1)

, and for Bi =
Pd 1
j=0 bdi+ j r
2 j in
BiA(i 1) one gets
A(i)B(i) =
 d 1X
j=0
adi+ j r2
j 
Bi + B(i 1)

+
d 1X
j=0
bdi+ j r2
j
A(i 1) + A(i 1)B(i 1)
2 d
94
which yields
A(i)B(i) =
 d 1X
j=0

adi+ j r

Bi + B(i 1)

+
bdi+ j rA(i 1)
2  j

2 j
+ A(i 1)B(i 1)
2 d
:
In (5.4), and similar to (5.2), the multiplication of A by B (elements ofGF (2m)) represented in
the GNB, is reduced recursively to a number of bit-wise AND operations, field additions, mul-
tiplications with the normal element , and cyclic shifts for computing the powers 2  j, 2 j, and
2 d. It is noted that the field addition of the term Bi in (5.4) is realized through concatenation.
The concatenation is possible since the least significant digit of B(i 1) is always 0 for 0  i < k
(only B(k 1) has a non zero LSD; however, B(k 1) is not used in computing A(k 1)B(k 1)).
Since it is already given that A( 1) = B( 1) = 0, therefore, starting at i = 0, and proceeding
up to i = k   1, one obtains the final result of the multiplication AB = A(k 1)B(k 1). As one can
see from (5.4), at each step, the i-th digits in A and B, i.e. Ai and Bi, together with A(i 1) , B(i 1),
and A(i 1)B(i 1), are used for computing A(i)B(i). The following example illustrates the proposed
multiplication scheme for the case of GF

23

where d = 2 and the elements are represented in
the GNB type-2.
Example 5.1.6 Table 5.5 shows how one multiplies the GF

23

elements A = B = 2
2
=
(0; 0; 1), which are represented in the type-2 GNB
n
; 2; 2
2
o
(that is, Optimal normal basis
type-2), using (5.3) and (5.4) for the case of d = 2. Note that, in this case k = 2, and r = 1. For
i Ai = a2i 1 + a2i2 Bi = b2i 1 + b2i2 A(i 1) B(i 1)
0 A0 = a 1 + a02 = 0 B0 = b 1 + b02 = 0 A( 1) = 0 B( 1) = 0
1 A1 = a1 + a22 = 2 B1 = b1 + b22 = 2 A(0) =

A0 + A( 1)
2 2
= 0 B(0) =

B0 + B( 1)
2 2
= 0
i Xi = Bi + B(i 1) Yi = A(i 1) Zi = a2i 1Xi + b2i 1Yi Wi = a2iXi + b2iYi
0 X0 = B0 + B( 1) = 0 Y0 = A( 1) = 0 Z0 = a 1X0 + b 1Y0 = 0 W0 = a0X0 + b0Y0 = 0
1 X1 = B1 + B(0) = 2 Y1 = A(0) = 0 Z1 = a1X1 + b1Y1 = 0 W1 = a2X1 + b2Y1 = 2
i A(i 1)B(i 1) A(i)B(i) =

Zi +

W2
 1
i 
2
+ A(i 1)B(i 1)
2 2
0 A( 1)B( 1) = 0 A(0)B(0) =

Z0 +

W2
 1
0 
2
+ A( 1)B( 1)
2 2
= 0
1 A(0)B(0) = 0 A(1)B(1) =

Z1 +

W2
 1
1 
2
+ A(0)B(0)
2 2
=

2
22 2
=

2
2
2 2
= 
Table 5.5: Steps for multiplication of the two GF

23

elements A = B = 2
2
= (0; 0; 1).
95
this example, rewrite A(i)B(i) =

Zi +

W2
 1
i 
2
+ A(i 1)B(i 1)
2 2
, where Zi = a2i 1Xi + b2i 1Yi,
Wi = a2iXi + b2iYi, Xi = Bi + B(i 1), and Yi = A(i 1), for 0  i < 2.
Next, the proposed architecture of the LSD DL-FSIPO single GNB multiplier is presented.
5.1.2.2 Architecture
Here, the architecture of the proposed LSD DL-FSIPO single GNB multiplier is introduced.
This architecture is shown in Figure 5.3, which is constructed based on (5.3) and (5.4). In this
d
m
m
m
m
d-1 m
m
m-d
in1
in2
in3
in4
j
m
m-d
in1
in2
in3
in4
0
m
m-d
in1
in2
in3
in4
d
m-d
<X>
0 m-d-1
d
m-d
<Y>
0 m-d-1
<Z>
m
0 m-1
B
ik 1
B
0
B
A
ik 1
A
0
A
Figure 5.3: Architecture of the proposed LSD DL-FSIPO single GNB multiplier.
figure, the digit size is d, 0 < d < m, and i denotes the i-th clock cycle of the computations,
0  i < k where k =
l
m
d
m
. Architectures of r j and  j (which is a component of r j) blocks,
0  j < d, are shown in Figures 5.2b and 5.2c, respectively, where in Figure 5.2b, and at
iteration 0  i < k, one has: in1 = Bi + B(i 1), in2 = adi+ j r, in3 = bdi+ j r, and in4 = A(i 1).
First, the (m   d)-bits shift registers hXi and hYi, and the m-bits register hZi are cleared (in
other words, hXi, hYi, and hZi are loaded with A( 1), B( 1), and A( 1)B( 1), respectively). After
this, at the i-th iteration, 0  i < k, registers hXi, hYi, and hZi change states from A(i 1), B(i 1),
and A(i 1)B(i 1) to A(i), B(i), and A(i)B(i), respectively, as follows. At iteration i, hXi and hYi
perform a d-fold left shift (not cyclic) and, at the same time, the i-th digits of the field elements
96
A and B are written to the most significant d-bits of hXi and hYi, respectively, according to
(5.3). Moreover, and at the same time, register hZi accumulates the d-fold left cyclic shift ofPd 1
j=0 r j + A(i 1)B(i 1), where the architecture of r j is captured in Figure 5.2b and implements
adi+ j r

Bi + B(i 1)

+ bdi+ j rA(i 1)
2  j

2 j
for in1 = Bi + B(i 1), in2 = adi+ j r, in3 = bdi+ j r, and in4 = A(i 1). Therefore, after the i-th
clock cycle, hZi = A(i)B(i), as one can see from (5.4). After k clock cycles one gets hZi =
A(k 1)B(k 1) = AB. It is noted that, since the least significant digit of B(i 1) is always 0d for
0  i < k, where 0d denotes a string of zeros of length d, the proposed architecture implements
Bi + B(i 1) in (5.4) by concatenating Bi to the least significant digit of B(i 1).
In the following, the space and time complexities of the LSD DL-FSIPO single GNB mul-
tiplier are studied.
5.1.2.3 Space and Time Complexities
The space complexity of the proposed LSD DL-FSIPO single GNB multiplier is listed in Table
5.2, in terms of the count of logic gates, FFs, and preloading multiplexers (for the case of
parallel inputs preloading). In Section 5.1.1.3, it was found that each r j module, 0  j < d,
has 2m   d AND gates and  (m   d) + (T   1) (m   1) XOR gates. Figure 5.3, on the other
hand, shows that the number of two input XOR gates in the
P
module (a GF (2m) adder which
adds d + 1 field elements) is dm. And hence, the total number of two input AND gates is
d (2m   d), while the total number of XOR gates adds up to  d [(2m   d) + (T   1) (m   1)].
In addition, one can notice that, while register hZi has m FFs, registers hXi and hYi have m   d
FFs each, since their least significant digits will always be zero throughout the k clock cycles
of computations. This adds up to a total of 3m   2d FFs. It is also noted that no preloading is
required for the proposed LSD DL-FSIPO single GNB multiplier.
The time complexity of the proposed LSD DL-FSIPO single GNB multiplier is also re-
ported in Table 5.3, in terms of levels of the propagation delay of two input XOR and AND
gates through its critical path. Similar to the proposed MSD DL-FSIPO single GNB multiplier
(see Section 5.1.1.3), Figure 5.3 shows that the propagation delay of the proposed LSD archi-
tecture is equivalent to the sum of the propagation delays through one r j module and the P
module. Hence, the maximum propagation delay in the proposed LSD DL-FSIPO single GNB
multiplier is TA +

1 +

log2 (d + 1)

+

log2 T

TX, as shown in Table 5.3.
It is noted that, the proposed multiplication algorithms in Propositions 5.1.2 and 5.1.5 are
dierent than the one presented in [48]. Propositions 5.1.2 and 5.1.5 build a GF (2m) element
recursively digit-by-digit, starting from the MSD and LSD, respectively. This behaviour results
97
in a DL-FSIPO GNB multiplication scheme. The algorithm presented in [48] takes opposite
action by recursively shrinking a GF (2m) element bit-by-bit, starting from the LSB. The latter
behavior constructs a BL-PIPO GNB multiplication scheme, but not a DL-FSIPO GNB one.
In addition, the authors of [48] present a bit-parallel GNB multiplier extended from their algo-
rithm. Bit-parallel multipliers do not require any input or output registers for their processing
and usually target high throughput applications by generating the output in one clock cycle at
the expense of a space complexity which is quadratic in m for the scheme in [48]. On the other
hand, this chapter focuses on digit-level multiplications for resource constrained applications
which requires input / output registers and trade-o space complexity against larger number of
clock cycles.
It is worth mentioning that, although the presented DL-FSIPO multiplication algorithms
are dierent from the one in [48], however, they meet at the bit-parallel level. Accordingly,
one might construct a multiplexer based DL-FSIPO GNB multipliers through applying parti-
tioning to the bit-parallel architecture presented in [48] (the proposed DL-FSIPO single GNB
multipliers are AND / XOR based). In this case, similar eorts to those presented in this chap-
ter need to be taken in order to optimize number of FFs and number of XOR gates within the
fixed multiplication by . Also, notice that, the underlying multiplication algorithm needs to
be theoretically aligned / proved according to the proposed formulations. Otherwise, it would
be more natural to construct a DL-PIPO GNB multiplier by partitioning of the architecture
in [48], which reflects the underlying multiplication algorithm presented in [48]. In fact, the
missing of reference to Feng’s original work [36] throughout [48] indicates that authors of [48]
were determined to use a DL-PIPO algorithm.
5.2 Proposed DL-PISO Single GNB Multiplier
In this section, the proposed architecture of the area ecient MSD DL-PISO single GNB
multiplier is presented. This multiplier is an area-optimized instance of the one presented by
the authors of [70], which is based on (2.6), and hence, the reader is referred to [70] for more
details about the formulations. The area reduction is accomplished through applying the group
sub-expression sharing algorithm presented in [17]. In what follows, the architecture of the
proposed multiplier is first shown, followed by analyzing its space and time complexities.
5.2.1 Architecture
Here, an area ecient architecture is presented for the MSD DL-PISO single GNB multiplier,
as it is shown in Figure 5.4a. Part (b) of this figure depicts the architecture of the Rd block
98
<Y>
m
<X>
m
m
m
m
m
d(m-1)
d
IP
d
m
d
d
m
m
m
IP
m
m
m
IP
R
z0
z1
zd-1
(i)
(i)
(i)
Y
(i)
X
(i)
(a)
m
m-1m
m
m
d-1
m
1
m
0
m
m-1
m-1
R
R
R
(b) (c)
Figure 5.4: (a) The proposed architecture of the MSD DL-PISO single GNB multiplier.
before applying sub-expression sharing. Part (c) shows the architecture of the IP block. The
architecture in this figure diers from the LSD DL-PISO single GNB multiplier, which is
presented in [16], in that it generates the multiplication output in the order of most significant
digit first. This is accomplished through generating the d output bits z(i)d 1 through z
(i)
0 , during
iteration i, where 0  i < k and k =
l
m
d
m
, as follows. In Figure 5.4a, a bit z(i)n denotes the
left most (least significant) coordinate of P2
 n
X(i)

Y (i)

, where 0  n < d and , X(i) = A2(i+1)d t and
Y (i) = B2
(i+1)d t
(t = k  d   m) denote the contents of registers hXi and hYi at the i-th iteration
of the computations. It is noted that z(i)n , the left most coordinate of P2
 n
X(i)

Y (i)

, is obtained
according to (2.6) as follows
z(i)n =x
(i)
n y
(i)
n+1 +
m 1X
u=1
x(i)((n+u))
0BBBBB@ TX
v=1
y(i)((n+R[u;v]))
1CCCCCA : (5.5)
99
In (5.5), x(i)j and y
(i)
j , respectively, denote the j-th coordinates (cells) of registers hXi and hYi
during the i-th iteration. In the same formulation, (()) denotes the reduction modulo-m. There-
fore, by initializing registers hXi and hYi such that X(0) = A2d t and Y (0) = B2d t , one obtains the
most significant digit of the output. It is noted that the t-fold left cyclic shift in A2
d t
and B2
d t
is implemented in order to allow for appending zeros to the t most significant bits of the first
output digit (most significant digit), i.e.,

z(0)d t; : : : ; z
(0)
d 1

. This is required for compatibility of
integration with the proposed MSD DL-FSIPO single GNB multiplier (see Section 5.3). Then,
at the i-th iteration, 0  i < k, one has X(i) = A2(i+1)d t and Y (i) = B2(i+1)d t , due to the d-fold
right cyclic shifts which are applied to hXi and hYi at each clock cycle, as shown in Figure
5.4a. According to this, bit z(i)n of the i-th output digit in Figure 5.4a maps to the output bit
em (d(i+1) t)+n of the multiplication result E = AB. For all 0  n < d, the inner product in (5.5)
is realized through an IP block in Figure 5.4a, while the d instances (for 0  n < d) of the
m   1 bits of Pm 1u=1 PTv=1 y((n+R[u;v])) 2u are generated through the Rd block, which is shown in
Figure 5.4b. This figure shows the architecture of Rd before applying the group sub-expression
sharing algorithm presented in [17], where each R block represents the matrix multiplication of
the lower m  1 rows of the multiplication matrixM by the corresponding m-bits input vertical
vector.
The following is the space and time complexity analysis of the proposed MSD DL-PISO
single GNB multiplier.
5.2.2 Space and Time Complexities
Here, the space and time complexities of the proposed MSD DL-PISO single GNB multiplier
in Figure 5.4a are discussed. In this figure, an IP block consists ofmAND gates andm 1 XOR
gates, as it is shown in Figure 5.4c. Furthermore, for area eciency, the Rd block of Figure 5.4a
realizes a group sub-expression shared version of the d R blocks in Figure 5.4b based on the
algorithm presented in [17]. It is noted that one can find the number of eliminations due to this
sharing through simulation. Therefore, the architecture of Figure 5.4a requires a total of 2m
FFs, dm ANDs, and  d
h
(T   1)

(m   1)   d 12
i
+ d (m   1) XORs, as presented in Table 5.2,
where  (T   1) (m   1) is the number of XOR gates in each R block before sub-expression
sharing (see Section 2.9.2.2) and the term d(d 1)2 is due to the elimination of common rows
between dierent R matrices which are used in implementing the d R blocks [16].
For the time complexity, one can see that the critical path of Figure 5.4a has a propaga-
tion delay equals to TA +
 
log2 T

+

log2m

TX, as presented in Table 5.2, where the delay
through the area optimized Rd block is

log2 T

TX [16], and that through an IP module is
TA +

log2m

TX.
100
In the following, the proposed MSD DL-FSIPO and DL-PISO single GNB multipliers are
combined to construct an MSD DL-SIPO hybrid-double and a DL-PIPO hybrid-triple, GNB
multipliers.
5.3 Proposed Digit-Level Hybrid-Double and Hybrid-Triple
GNB Multipliers
A hybrid-double digit-level GNB multiplication architecture has been recently proposed by
the authors of [16], which performs two field multiplications (multiplication of three field ele-
ments) using the same latency required for a single field multiplication (i.e. k =
l
m
d
m
iterations
for a digit size d). To accomplish this, the authors of [16] have extended the LSB BL-PISO
GNB multiplier in [70] and the LSB BL-SIPO GNB multiplier in [20] to the digit-level, then,
by combining these two digit-level single GNB multipliers, they constructed their DL-PIPO
hybrid-double GNB multiplier.
In this section, four new architectures forGF (2m) digit-level hybrid multiplications are pre-
sented, two for an MSD DL-SIPO hybrid-double multiplication (a low area and a high speed
designs) and, for the first time, another two (low area / high speed designs) for a DL-PIPO
hybrid-triple multiplication (multiplication of four field elements), when the field elements
are represented in the GNB. In order to construct the proposed hybrid-double multiplier, the
MSD DL-PISO single GNB multiplier presented in Section 5.2.1 is combined with the pro-
posed MSD DL-FSIPO single GNB multiplier of Figure 5.2a. On the other hand, the proposed
hybrid-triple multiplier is constructed by combining the MSD DL-PISO single GNB multiplier
presented in Section 5.2.1 with the MSD DL-SIPO hybrid-double GNB multiplier proposed in
this section1.
In the following, the proposed architectures of the MSD DL-SIPO hybrid-double GNB
multiplier are first presented, followed by those of the proposed DL-PIPO hybrid-triple GNB
multiplier. This section concludes by analyzing the space and time complexities of the two
proposed hybrid multipliers.
5.3.1 Proposed MSD DL-SIPO Hybrid-Double GNB Multiplier
This section presents the architecture of the proposed MSD DL-SIPO hybrid-double GNB
multiplier. It is noted that the hybrid-double GNB multiplier proposed by the authors of [16]
1It is noted that one can build an LSD DL-SIPO hybrid-double architecture, as well as a DL-PIPO hybrid-triple
architecture, by combining the LSD DL-PISO GNB multiplier presented in [16] with the LSD DL-FSIPO GNB
multiplier which is proposed in this chapter.
101
follows a DL-PIPO scheme of its inputs/outputs, while the proposed architecture in this work
is the first DL-SIPO scheme for the digit-level hybrid-double GNB multiplication. Figure 5.5
shows two versions of the proposed MSD DL-SIPO hybrid-double GNB multiplier, one for a
low area design (Figure 5.5a) and the other for a high speed design (Figure 5.5b). The appended
(Figure 5.4a)
MSD
(Figure 5.2a)
(a)
d
<W>
d d-t m
E
MSD
t
t
d
m
(Figure 5.4a)m
A
B
(Figure 5.2a)
dC ||0
d
(b)
Figure 5.5: Architectures of the proposed MSD DL-SIPO hybrid-double GNB multiplier. (a)
Low area design. (b) High speed design.
0d (zero digit) in input C of part (b) of this figure balances the timing due to pipelining. It is
noted that the most significant t-bits - where t = k  d   m, k =
l
m
d
m
, and d is the digit size - of
the first output digit (most significant digit) of the MSD DL-PISO single GNB multipliers in
Figure 5.5 are set to zero through the MSD signal. As can be seen from the figure, the low area
MSD DL-SIPO hybrid-double GNB multiplier is built from MSD DL-PISO and DL-FSIPO
single GNB multipliers with the output of the former connected to one input of the latter. On
the other hand, the high speed version follows from the low area version by inserting the d-bits
register hWi between the output of the DL-PISO, and the input of the DL-FSIPO, single GNB
multipliers, as can be seen from Figure 5.5b. This, in turn, shortens the propagation delay of
the multiplier’s critical path, which results in reaching higher operating frequencies compared
to the low area version. However, it adds one extra clock cycle to the latency. Each one of the
two versions of the proposed MSD DL-SIPO hybrid-double GNB multiplier takes three inputs,
102
two of which are m-bits wide each (inputs A and B in Figure 5.5), while the third one has only
d-bits (input C in Figure 5.5).
Initially, A and B are loaded to the input registers of the MSD DL-PISO single GNB multi-
plier, while the input/output registers of the MSD DL-FSIPO single GNBmultiplier are cleared
out. In the low area version, at the i-th clock cycle, 0  i < k, the MSD DL-PISO single GNB
multiplier generates the (k   1   i)-th output digit for the multiplication AB, while the MSD
DL-FSIPO single GNB multiplier generates E(i) = (AB)(i)C(i) (according to (5.1) and (5.2)).
After k iterations, the output register of the MSD DL-FSIPO single GNB multiplier holds the
result of the double multiplication, i.e., E(k 1) = (AB)(k 1)C(k 1). In the high speed version, an
extra clock cycle is required at the beginning to store the MSD output digit of the DL-PISO
single GNB multiplier to register hWi.
In the following, the proposed architectures for the DL-PIPO hybrid-triple GNB multiplier
are introduced.
5.3.2 Proposed DL-PIPO Hybrid-Triple GNB Multiplier
This section, presents the proposed architectures for the DL-PIPO hybrid-triple GNB multi-
plier. To the best of the author knowledge, this is the first digit-level hybrid GNB multiplier
proposed in the literature which performs threeGF (2m) multiplications using the same latency
of a single digit-level field multiplication (multiplying of four field elements of A, B, C, and D
together). Figures 5.6a and 5.6b present two variants of the proposed DL-PIPO hybrid-triple
GNB multiplier. Figure 5.6a is a low area design, while Figure 5.6b shows a high speed de-
sign. The most significant t-bits, t = k  d   m, of the first output digit of the MSD DL-PISO
single GNB multipliers in these figures are set to zero through the MSD signal. In Figure 5.6a,
the low area DL-PIPO hybrid-triple GNB multiplier is constructed from one MSD DL-PISO
single, and one low area MSD DL-SIPO hybrid-double, GNB multipliers with the output of
the former connected to the serial input of the latter. The high speed DL-PIPO hybrid-triple
GNB multiplier instance uses a high speed MSD DL-SIPO hybrid-double GNB multiplier.
Also, it has a d-bits register hVi inserted between the output of the MSD DL-PISO single GNB
multiplier and the input of the high speed MSD DL-SIPO hybrid-double GNB multiplier (see
Figure 5.6b). This leads to shorter critical path, and hence, results in reaching higher operat-
ing frequencies compared to the low area instance; however, at the expense of one extra clock
cycle.
Each one of the two proposed hybrid-triple GNB multiplier’s versions takes four m-bits
inputs, denoted by A, B, C, and D, and generates an m-bits output, i.e. E = ABCD. Initially,
A, B, C, and D are loaded to the input registers of the MSD DL-PISO single GNB multipliers
103
mm
m
(Figure 5.5a)
A
B
E
m
m
C
D
d d-t
MSD
t
t
d
(Figure 5.4a)
(a)
m
m
A
B
m
m d
C
D
<V>
m
E
d d-t
MSD
t
t
d
(Figure 5.4a)
(Figure 5.5b)
(b)
Figure 5.6: Architectures of the proposed MSD DL-PIPO hybrid-triple GNB multiplier. (a)
Low area design. (b) High speed design.
(including the one in the DL-SIPO hybrid-double GNB multiplier), while the input/output
registers of the MSD DL-FSIPO single GNB multiplier (in the hybrid-double multiplier) are
cleared out. In the low area version, at the i-th clock cycle, 0  i < k, the MSD DL-PISO
single GNB multipliers generate their (k   1   i)-th output digits for AB and CD at the same
time, while the output register of the MSD DL-SIPO hybrid-double multiplier computes E(i) =
(AB)(i) (CD)(i) (according to (5.1) and (5.2)). After k iterations, the output register of the low
area DL-PIPO hybrid-double GNB multiplier holds E(k 1) = (AB)(k 1) (CD)(k 1) = ABCD. On
the other hand, the high speed version generates its final output after k+1 clock cycles, since an
extra clock cycle is required, at the beginning, to store the MSD output digit of CD in register
hVi and the MSD output digit of AB in register hWi (see Figure 5.5b).
Next, the space and time complexities for the dierent proposed architectures of the digit-
level hybrid-double and hybrid-triple GNB multipliers are given.
104
5.3.3 Space and Time Complexity Analysis
Here, the space and time complexities of the proposed digit-level hybrid-double and hybrid-
triple GNB multipliers which are presented in Sections 5.3.1 and 5.3.2, respectively, are de-
rived. Table 5.6 shows the space complexities of the proposed hybrid GNB multipliers, while
Table 5.7 presents their corresponding time complexities.
Multiplier D-FF AND XOR1 2 : 1 MUX
DL-PIPO Hybrid-Double2 (low area) [16] 4m 2dm + t  d (T   1) [2 (m   1)   (d   1)] + d (2m   1) 3m
DL-PIPO Hybrid-Double2 (high speed) [16] 4m + d 2dm + t  d (T   1) [2 (m   1)   (d   1)] + d (2m   1) 3m
MSD DL-SIPO Hybrid-Double (low area) (Figure 5.5a) 5m   2d d (3m   d) + t  d (T   1)
h
2 (m   1)   d 12
i
+ d (3m   (d + 1)) 2m
MSD DL-SIPO Hybrid-Double (high speed) (Figure 5.5b) 5m   d d (3m   d) + t  d (T   1)
h
2 (m   1)   d 12
i
+ d (3m   (d + 1)) 2m
DL-PIPO Hybrid-Triple (low area) (Figure 5.6a) 7m   2d d (4m   d) + 2t  d (T   1) [3 (m   1)   (d   1)] + d (4m   (d + 2)) 4m
DL-PIPO Hybrid-Triple (high speed) (Figure 5.6b) 7m d (4m   d) + 2t  d (T   1) [3 (m   1)   (d   1)] + d (4m   (d + 2)) 4m
1 without sub-expression elimination. 2Note: the authors of [16] did not count for the t ANDs which are required for appending zeros .
Table 5.6: Space complexity of the digit-level hybrid-double and hybrid-triple GNB multipli-
ers.
Multiplier
Propagation Serial Loading Computation
Delay of Inputs Latency Latency
DL-PIPO Hybrid-Double (low area) [16] TPISO + TS IPO k k
DL-PIPO Hybrid-Double (high speed) [16] max fTPISO;TS IPOg k k + 1
MSD DL-SIPO Hybrid-Double (low area) (Figure 5.5a) TPISO + TFS IPO + TA k k
MSD DL-SIPO Hybrid-Double (high speed) (Figure 5.5b) max fTPISO;TFS IPO + TAg k k + 1
DL-PIPO Hybrid-Triple (low area) (Figure 5.6a) TPISO + TFS IPO + TA k k
DL-PIPO Hybrid-Triple (high speed) (Figure 5.6b) max fTPISO;TFS IPO + TAg k k + 1
Table 5.7: Time complexity of the digit-level hybrid-double and hybrid-triple GNBmultipliers.
In table 5.6, t = k  d   m, k =
l
m
d
m
, and d is the digit size. T is the GNB type. In
Table 5.7, TPISO = TA+
 
log2 T

+

log2m

TX denotes the delay in the DL-PISO single GNB
multiplier, TS IPO = TA+
 
log2 (d + 1)

+

log2 T

TX denotes the delay in the DL-SIPO single
GNB multiplier, and TFS IPO = TA +
 
1 +

log2 (d + 1)

+

log2 T

TX denotes the delay in the
DL-FSIPO single GNB multiplier.
Using the construction of Figure 5.5a, one obtains the space complexity of the low area
version of the proposed MSD DL-SIPO hybrid-double GNB multiplier which is listed in Table
5.6. This is done by adding the corresponding space complexities of the MSD DL-PISO and
DL-FSIPO single GNB multipliers, in addition to the t AND gates which are used for padding
the most significant digit with zeros. Similarly, one can find the space complexity of the low
area version of the proposed DL-PIPO hybrid-triple GNB multiplier. This is done by adding
105
the corresponding gate count in its MSD DL-PISO single and DL-SIPO hybrid-double GNB
multipliers, in addition to the t AND gates in Figure 5.6a, as can be seen from Table 5.6.
Moreover, the space complexities of the high speed versions of the proposed hybrid-double
and hybrid-triple GNB multipliers are achieved by adding d and 2d FFs, respectively, to the
space complexities of the corresponding low area versions.
In addition, from Figures 5.5a and 5.6a, one finds that the low area architectures of the pro-
posed MSD DL-SIPO hybrid-double GNB multiplier and the proposed DL-PIPO hybrid-triple
GNBmultiplier oer maximum propagation delays which are equivalent to TPISO+TFS IPO+TA.
In the latter formulation, TPISO and TFS IPO denote the propagation delay through the MSD DL-
PISO single GNB multiplier and the MSD DL-FSIPO single GNB multiplier, respectively. On
the other hand, due to the insertion of registers (Figures 5.5b and 5.6b), the propagation de-
lays of the high speed architectures of the proposed digit-level hybrid-double and hybrid-triple
GNB multipliers are reduced to max fTPISO;TFS IPO + TAg.
In the following, a brief discussion is given about the advantages of using digit-level hybrid
GNB multipliers for accomplishing double and triple field multiplications, compared to using
digit-level single GNB multipliers with 2d and 3d digit sizes, respectively.
5.3.4 Hybrid Versus Single Digit-Level GNB Multipliers
This section briefly discusses advantages for using the proposed digit-level hybrid-double and
hybrid-triple GNB multipliers, instead of using digit-level single GNB multipliers, for accom-
plishing double and triple field multiplications, respectively.
It is noted that the proposed digit-level hybrid-double and hybrid-triple GNBmultipliers are
constructed using two and three digit-level single GNB multipliers, respectively. Hence, for a
fair discussion, we compare the digit-level hybrid-double and hybrid-triple GNB multipliers,
of digit size d each, to digit-level single GNB multipliers of digit sizes 2d and 3d, respectively.
The main advantage of using the digit-level (digit size d) hybrid-double and hybrid-triple
GNB multipliers, in case of double and triple field multiplications, respectively, is that one
can obtain lower computational latency, and hence higher throughput, compared to using digit-
level single GNB multipliers with digit sizes 2d and 3d, respectively. For example, by using a
digit-level hybrid-double GNB multiplier with d =
l
m
3
m
, one obtains the result of a double field
multiplication after 3 clock cycles. However, a digit-level single GNB multiplier computes the
double field multiplication over 4 clock cycles for a digit size 2d = 2
l
m
3
m
where
l
m
2
m
< 2
l
m
3
m
<
m. In addition, by using the digit-level hybrid-double/triple GNB multiplier structures (for
digit size d), one can achieve lower space/time complexities for some computational latencies,
compared to using digit-level single GNB multipliers (for digit sizes 2d and 3d, respectively).
106
For example, a digit-level hybrid-triple GNB multiplier with d =
l
m
2
m
accomplishes a triple
field multiplication over 2 clock cycles, while the same latency can only be achieved by using
two bit-parallel GNB multipliers when multiplying four field elements.
Therefore, the proposed DL-PIPO hybrid-triple GNB multiplier accomplishes three field
multiplications using the latency required for a single field multiplication, and hence, it can be
used to increase the throughput of applications where such triple multiplications exist. In what
follows, a new architecture for the eight-ary field exponentiation is presented, as an application
for the digit-level hybrid-triple GNB multiplier presented in this section.
5.4 Proposed Architecture for Field Exponentiation
Exponentiation is a fundamental operation for the Die-Hellman key exchange algorithm [29]
and is also used for other cryptographic applications such as random number generation [85].
The n-ary scheme is used to increase throughput of GF (2m) exponentiation [42, 83]. In the
case of n = 23 (i.e. eight-ary scheme), to compute the exponentiation Ah for A 2 GF (2m) and
a positive integer h =
Pdm3 e 1
i=0 hi2
3i, where 0  hi < 8, one rewrites h as h = P7w=1  (w)w,
where  (w) =
P
fi:hi=wg 2
3i. Then, this eight-ary exponentiation scheme requires finding the
coecients  (w) and precomputing and storing odd powers for 1 < w < 8. This takes at mostl
m
3
m
+2 iterations to complete [42, 83]. In this section, a new architecture for the eight-ary field
exponentiation scheme when the GF (2m) elements are represented in the GNB is presented.
The proposed architecture is based on the digit-level hybrid-triple GNB multiplier presented in
Section 5.3.2 (Figure 5.6) and computes the exponentiation results using
l
m
3
m
iterations, while
it does not require any storage of precomputed values.
In the following, the proposed formulations for field exponentiation is first derived, fol-
lowed by presenting its corresponding proposed architecture.
Proposition 5.4.1 Let F = Ah denotes the exponentiation of an arbitrary GF (2m) element A
represented in the GNB, where 1 < h < 2m is an arbitrary positive integer. Therefore, one can
compute F using the following recurrence:
F(i) =Ahk0 1 i

F(i 1)
23
; (5.6)
where k0 =
l
m
3
m
, h =
Pk0 1
i=0 hk0 1 i8
k0 1 i, 0  hk0 1 i < 8, F( 1) = 1, and F = F(k0 1).
Proof By substituting for i = 0; : : : ; k0   1 in (5.6), where k0 =
l
m
3
m
and hk0 1 i 2 [0; 7] are the
coecients of the radix-8 representation of h, one gets
F(k
0 1) =A(((hk0 18+hk0 2)8+hk0 3)8++h1)8+h0
=A
Pk0 1
i=0 hk0 1 i8
k0 1 i
:
107
That is, F(k
0 1) = F, since h =
Pk0 1
i=0 hk0 1 i8
k0 1 i.
Note that (5.6) reads h left-to-right. Similarly, one can read h right-to-left by using F(i) =
Ahi

F(i 1)
2 3
, for 0  i < k0, where h = 23(k0 1) Pk0 1i=0 hi2 3(k0 1 i) and F = F(k0 1)23(k0 1) .
Based on (5.6), the eight-ary exponentiation architecture of Figure 5.7 is proposed, which is
constructed based on the proposed digit-level hybrid-triple GNB multiplier of Figure (5.6)
(either Figure 5.6a or Figure 5.6b, depending on whether the target application requires a low
area design or a high speed design, respectively). It is noted that, in this figure, the 1 inputs to
A
m(Figure 5.6a or 
5.6b)
Fm
m
m
m
m
m
m
m
m
m
m
m
m m
m
m
Figure 5.7: Architecture of the proposed eight-ary field exponentiation scheme.
multiplexers represent the field element 1 = (1; : : : ; 1) represented in the GNB. As shown in
this figure, the architecture is composed of one DL-PIPO hybrid-triple GNBmultiplier and four
2 : 1 m-bits multiplexers. The first three multiplexers (0, 1, and 2), respectively, are controlled
by the coecients s(i)0 , s
(i)
1 , and s
(i)
2 of the binary representation of hk0 1 i = s
(i)
0 + s
(i)
1 2 + s
(i)
2 2
2,
where k0 =
l
m
3
m
and 0  hk0 1 i < 8 for all 0  i < k0 in (5.6). The last multiplexer, i.e.
3, passes the field element 1 = (1; : : : ; 1) during the first iteration, while it selects the 3-fold
right cyclic shift of the multiplier’s output during the remaining iterations. Therefore, by using
this architecture one computes F = Ah after k0 runs of the hybrid-triple multiplication. This is
equivalent to k0 (L + 1) clock cycles in the case of parallel preloading of the multiplier, where
L = k if a low area hybrid multiplier is used otherwise it becomes L = k + 1 for using a high
speed hybrid multiplier (k =
l
m
d
m
, d is the digit size).
One can see that, the proposed eight-ary exponentiation architecture does not require any
storage of precomputed values, while it has almost the same latency, compared to the existing
schemes. Also, it is noted that the proposed architecture uses the same latency regardless of
the exponent’s value. This in turn prevents leakage of time/power dissipation information.
108
5.5 Conclusion
In this paper, three new architectures for digit-level (DL) single multiplication using GNB
have been proposed; two multipliers with fully serial-in-parallel-out (FSIPO) and one with
parallel-in-serial-out (PISO). The two DL-FSIPO single GNB multipliers have been proposed
for the first time in the literature. They do not require preloading of the inputs and, hence, are
advantageous for applications where the parallel loading of inputs is not possible due to limited
size of the data-path.
Using the proposed single digit-level multiplier architectures, a new digit-level serial-in-
parallel-out (DL-SIPO) hybrid-double GNB multiplier and for the first time in the literature
a new digit-level parallel-in-parallel-out (DL-PIPO) hybrid-triple GNB multiplier have been
proposed. The proposed digit-level hybrid-double and hybrid-triple multipliers, perform two
and three field multiplications, respectively, using the same latency as a single digit-level field
multiplication.
As an application of the proposed hybrid-triple multiplier, a new digit-level eight-ary field
exponentiation architecture has been presented which oers computational latency similar to
the existing eight-ary schemes, however, without requiring storage of precomputed values.
109
Chapter 6
Digit-Level Architectures for GF
 
2m

Multiplication in the PB
In this chapter, and to the best of the author knowledge, two new architectures of GF (2m)
digit-level FSIPO (DL-FSIPO) multipliers for dedicated PBs are proposed for the first time in
literature. The new digit-level serial PB architectures generate the output bits in parallel after
k iterations. In the new DL-FSIPO PB multiplication schemes, both inputs enter the multiplier
digit-by-digit serially, one digit per a clock cycle starting from the most or least significant digit
(MSD or LSD), as the computations are carried out. Therefore, the new MSD and LSD DL-
FSIPO PB multiplication structures are expected to be advantageous for resource constrained
applications where the data-path of the inputs might have limited capacity, specially, when the
value of m is large. In addition, by using additional parallel-in-serial-out register, one can also
generate the output bits of the proposed DL-FSIPO PBmultipliers serially over 2k clock cycles.
The latter scheme is advantageous over the serial-serial schemes in [46, 14] for performing n
consecutive multiplications. The serial-serial schemes presented in [46, 14] require 2kn clock
cycles to complete n consecutive GF (2m) digit-level multiplications. The same number of
n consecutive digit-level multiplications can be run using only k (n + 1) clock cycles based on
the proposed DL-FSIPOmultiplication schemes with an additional parallel-in-serial-out output
register.
It is noted that, a preliminary MSB bit-level version of this chapter appears in ARITH 22,
the 22nd IEEE Symposium On Computer Arithmetic (June 2015). The rest of the chapter
is organized as follows. Sections 6.1 and 6.2, respectively, present the proposed MSD and
LSD DL-FSIPO PB multiplication schemes. Section 6.3, presents comparisons between the
proposed MSD and LSD DL-FSIPO PB multiplication schemes and the other existing coun-
terparts. Section 6.4 gives some conclusions.
110
6.1 Proposed MSD DL-FSIPO PB Multiplier
This section, presents a new MSD serial multiplier design for dedicated PB which follows
a FSIPO inputs/output scheme. To the best of the author knowledge, the proposed MSD DL-
FSIPO architecture of the serial PBmultiplier is presented for the first time in the literature. The
proposed MSD DL-FSIPO PB multiplier conducts the multiplication operation as the digits of
the two inputs enter the multiplier in a digit-by-digit order, one digit per a clock cycle (for each
input), starting from the most significant digit. Therefore, the proposed MSD DL-FSIPO PB
multiplier is advantageous for achieving high throughput for applications where m is large and
the parallel preloading of the inputs is not possible due to the limited sizes of the input data-
paths. In the following, the required formulations for the MSD DL-FSIPO PB multiplication
is first derived. Then, the proposed architecture is shown. The section concludes by studying
the space and time complexities.
6.1.1 Formulations
In this section, the required formulations for the proposed MSD DL-FSIPO PB multiplication
scheme is derived. First, a recursive digit-level construction of the GF (2m) elements when
represented in the PB is given, by reading the field element digit-by-digit, starting from the
most significant digit, as follows.
Definition 6.1.1 Let  be the root of the field’s defining irreducible polynomial of GF (2m). Let
us divide the GF (2m) element, say A = (am 1; : : : ; a0) represented in the PB, into k =
l
m
d
m
digits
of size d each. That is A = (Ak 1; : : : ; Ak 1 i; : : : ; A0), where Ak 1 i =
Pd 1
j=0 ad(k 1 i)+ j
j is the
(k   1   i)-th digit and ad(k 1 i)+ j = 0 for d (k   1   i) + j  m. Then, A can be constructed
recursively, starting from the most significant digit Ak 1, as follows:
A(i) =Ak 1 i + A(i 1)d (6.1)
starting at i = 0 and obtaining A = A(k 1) at i = k   1, given that A( 1) = 0.
Proof By using (6.1), for i = 0; 1; : : : ; k   2 one gets
111
A(0) =Ak 1 + A( 1)d
=
d 1X
j=0
ad(k 1)+ j j;
A(1) =Ak 2 + A(0)d
=
d 1X
j=0
ad(k 2)+ j j +
d 1X
j=0
ad(k 1)+ j j+d
=
2d 1X
j=0
ad(k 2)+ j j;
:::
A(k 2) =A1 + A(k 3)d
=
d 1X
j=0
ad(1)+ j j +
(k 2)d 1X
j=0
ad(2)+ j j+d
=
(k 1)d 1X
j=0
ad+ j j;
and hence, for i = k   1
A(k 1) =A0 + A(k 2)d
=
d 1X
j=0
a j j +
(k 1)d 1X
j=0
ad+ j j+d
=
kd 1X
j=0
a j j;
that is A(k 1) =
Pm 1
i=0 ai
i since ai = 0 for i  m, which completes the proof.
Notice that the multiplication by d in (6.1) realizes a d-bit left shift and does not require
any reduction for 0  i < k. Based on the recursive construction in (6.1), one obtains the
multiplication of any two arbitrary GF (2m) elements A and B as follows.
Proposition 6.1.2 Let A and B be two arbitrary GF (2m) elements represented in the PB which
is generated by the degree m irreducible polynomial p (x) = xm +
P! 2
i=0 x
ti + 1 with ! nonzero
terms. Let us define Ci = A(i)B(i) mod p (), where A(i) and B(i) are given in (6.1) and  is the
112
root of p (x). Then, one can compute the multiplication of A and B, i.e. AB mod p () = Ck 1,
based on the following recurrence on Ci:
Ci =
d 1X
j=0

ad(k 1 i)+ j

Bk 1 i + B(i 1)d

+
bd(k 1 i)+ jA(i 1)d

 j mod p ()+
Ci 12d mod p () ; (6.2)
i = 0; : : : ; k   1, where C 1 = A( 1)B( 1) mod p () = 0.
Proof By using the definition (6.1) for A(i) and B(i) in evaluating Ci = A(i)B(i) mod p (), one
obtains
Ci =

Ak 1 i + A(i 1)d
 
Bk 1 i + B(i 1)d

mod p ()
=Ak 1 i

Bk 1 i + B(i 1)d

+
Bk 1 iA(i 1)d +Ci 12d mod p () :
Now, by substituting for Ak 1 i =
Pd 1
j=0 ad(k 1 i)+ j
j and Bk 1 i =
Pd 1
j=0 bd(k 1 i)+ j
j in the above
formulation, (6.2) is obtained.
Having AB mod p () = A(k 1)B(k 1) mod p () = Ck 1, then, by iterating for i =
0; 1; : : : ; k   1, one obtains the multiplication results after k iterations over (6.1) and (6.2).
Notice that, the left most (kd   m) bits of the most significant digit of the input are zeros.
Hence, the highest order coordinate in the intermediate variable elements A(k 2) and B(k 2) is
(k 1)d 1 (kd m) = m d 1. Therefore, the multiplication by d which appears in the expressions
ad(k 1 i)+ j

Bk 1 i + B(i 1)d

and bd(k 1 i)+ jA(i 1)d can be accomplished by a simple left shift of
d bits without any reduction.
Based on (6.2), the multiplication of the two GF (2m) elements A and B, is reduced recur-
sively to bit-wise AND operations, field additions, left shifts (for the multiplication by d), and
multiplications with the fixed field elements 2d and  j, 0 < j < d.
The following is an example for illustrating the proposed multiplication scheme.
Example 6.1.3 Table 6.1 lists the steps for multiplying the two GF

23

field elements A =
 = (0; 1; 0) and B = 2 = (1; 0; 0), represented in the PB
n
2; ; 1
o
which is defined by the
irreducible trinomial p (x) = x3 + x + 1. In this example, d = 1 (bit-level multiplication).
113
i a2 i b2 i A(i 1) B(i 1)
0 a2 = 0 b2 = 1 A( 1) = 0 B( 1) = 0
1 a1 = 1 b1 = 0 A(0) = a2 + A( 1) = 0 B(0) = b2 + B( 1) = 1
2 a0 = 0 b0 = 0 A(1) = a1 + A(0) = 1 B(1) = b1 + B(0) = 
i a2 i

b2 i + B(i 1)

b2 iA(i 1) Ci 12 mod p () Ci
0 a2

b2 + B( 1)

= 0 b2A( 1) = 0 C 12 mod p () = 0 C0 = 0
1 a1

b1 + B(0)

=  b1A(0) = 0 C02 mod p () = 0 C1 = 
2 a0

b0 + B(1)

= 0 b0A(1) = 0 C12 mod p () =  + 1 C2 =  + 1 = 3
Table 6.1: Example 6.1.3 for multiplying the two GF

23

elements A =  = (0; 1; 0) and
B = 2 = (1; 0; 0) using (6.1) and (6.2).
The proposed MSD DL-FSIPO PB multiplication scheme in (6.1) and (6.2) can be im-
plemented for an arbitrary irreducible polynomial considering any digit size d, 0 < d < m.
However, the following remark gives some conditions for ecient hardware realization, based
on Theorems 2.9.4 and 2.9.5.
Remark 6.1.4 Let p (x) = xm +
P! 2
i=1 x
ti + 1 denotes the defining irreducible polynomial with
! nonzero terms for GF (2m). Then, by choosing the digit size d of the MSD DL-FSIPO PB
multiplier such that
d 
m   t! 2
2

; (6.3)
the multiplication of an arbitrary GF (2m) element by a fixed field element q, where q  2d,
can be accomplished eciently in a single step using using q (!   2) two-inputs XOR gates
with a propagation delay equivalent to

log2 (q + 1)

XOR gate delays.
According to (6.3), ecient hardware realizations of the MSD DL-FSIPO PB multipli-
cation scheme for the five fields recommended by NIST for ECDSA GF

2163

, GF

2233

,
GF

2283

, GF

2409

, and GF

2571

, respectively, oer maximum digit sizes of 78, 79, 135,
161, and 280.
6.1.2 Architecture
This section presents the proposed architecture of the MSD DL-FSIPO PBmultiplier, as shown
in Figure 6.1a, where A; B 2 GF (2m) represent the inputs to the multiplier, k =
l
m
d
m
and d is the
digit size. Figure 6.1b shows the detailed architecture of 4 j module at i-th iteration, 0  j < d
and 0  i < k. Figure 6.1c shows the architecture of 
 module.
114
<Z>
d
mm
m
m
m-d
m-d
m
d-1 m
m
1
1
m-d
j
m
1
1
m-d
0
m
1
1
m-d
<X>
0m-d-1
<Y>
0m-d-1
k 1
A
ik 1
A
0
A
0m-1
d
k 1
B
ik 1
B
0
B
d
(a)
m
m-d
m
m-d
m-d
m m-d
d
m
j
d
d
k i
(b)
n
n
n
n
(c)
Figure 6.1: (a) Architecture of the proposed MSD DL-FSIPO PB multiplier. (b) Detailed
architecture of 4 j. (c) Architecture of 
 module.
The architecture in Figure 6.1a is designed based on formulations (6.1) and (6.2). In this
design, hXi and hYi are left shift registers, which respectively store the bits of A(i 1) and B(i 1)
(see (6.1)) during the i-th iteration, for 0  i < k. It is noted that, the (m   d)-th to (m   1)-th
coordinates are omitted from hXi and hYi, since these correspond to zeros in all the interme-
diate elements A(0) to A(k 2) and B(0) to B(k 2) according to (6.1). Then, it is sucient to have
only (m   d)-bits, in each of hXi and hYi. Moreover, according to this, one obtains A(i 1)d and
115
B(i 1)d by simple left shifting of d-bits. During the i-th iteration, the vertical thick line in Fig-
ure 6.1a represents a 2m-bits bus which contains the bits of Ak 1 i, Bk 1 i, A(i 1)d, and B(i 1)d.
In Figure 6.1a, 2d represents the multiplication of the contents of accumulator hZi by the fixed
field element 2d. The vertical thick line in Figure 6.1b represents the concatenation of the
lower d-bits of ad(k 1 i)+ j

Bk 1 i + B(i 1)d

with the (m   d)-bit result of XORing the higher
bits of ad(k 1 i)+ j

Bk 1 i + B(i 1)d

to bd(k 1 i)+ jA(i 1)d. This is done in order to compute the
expression ad(k 1 i)+ j

Bk 1 i + B(i 1)d

+ bd(k 1 i)+ jA(i 1)d in (6.2) (the multiplication by d is
accomplished through the d-bit left shift). In the same figure, the block denoted by  j repre-
sents the multiplication by the fixed field element  j, 0  j < d. Hence, by adding the outputs
of the field multiplications by all  j and 2d, one obtainsCi = A(i)B(i) mod p () in accumulator
hZi, after the i-th clock trigger, according to (6.2). Therefore, by initializing the three registers
hXi, hYi, and hZi of Figure 6.1a, with zeros, the result Ck 1 = AB = A(k 1)B(k 1) mod p () is
generated in accumulator hZi after k iterations.
As a graphical illustration of Example 6.1.3, Figure 6.2 presents the state of the correspond-
ingGF

23

MSD DL-FSIPO PB multiplier during the dierent iterations of computations (for
multiplying the two field elements A =  = (0; 1; 0) and B = 2 = (1; 0; 0) when d = 1), based
on the architecture which has been introduced in this section. Figure 6.2a shows the initial state
(i = 0). Figure 6.2b shows the state after first clock cycle (i = 1). Figure 6.2c shows the state
after second clock cycle (i = 2). Figure 6.2d shows the state after third clock cycle, where the
result 3 =  + 1 is stored in the output register which is surrounded by the dotted rectangle.
It is noted that, in this figure, the underlined leftmost bits of A(i 1) and B(i 1), respectively,
are always zero, which represent the missing (not required) leftmost FFs in registers hXi and
hYi.
In the following, the space and time complexities of the proposed MSD DL-FSIPO PB
multiplier will be studied.
6.1.3 Space and Time Complexities
This section gives the space and time complexities of the proposed MSD DL-FSIPO PB mul-
tiplier. Following the design guidelines of Remark 6.1.4, the space complexity of the proposed
MSD DL-FSIPO PB multiplier in Figure 6.1a is given by the following proposition.
Proposition 6.1.5 The total number of gates in the proposed MSD DL-FSIPO PB multiplier
of Figure 6.1a is as follows:8>>><>>>:#ANDs = d (2m   d) ; #FFs = 3m   2d;#XORs = d h2m + (d+3)(! 2)2   di : (6.4)
116
11
33
3
2
2
3
100
2
2
1
3
010
00
00
00 0
A
(-1)
 = 000
B
(-1)
 = 000
C0 = 000
a2 = 0
b2 = 1
C-1 = 000
2
3
1
2
1
(a)
1
1
33
3
2
2
3
00
2
2
1
3
10
10
00
00 0
A
(0)
 = 000
B
(0)
 = 001
C1 = 010
a1 = 1
b1 = 0
C0 = 000
2
3
1
2
1
(b)
1
1
33
3
2
2
3
0
2
2
1
3
0
01
10
10 0
A
(1)
 = 001
B
(1)
 = 010
C2 = 011
a0 = 0
b0 = 0
C1 = 010
2
3
1
2
1
(c)
1
1
33
3
2
2
3
2
2
1
3
00
01
10 1
2
3
1
2
1
C2 = 011
(d)
Figure 6.2: The state of the correspondingGF

23

MSDDL-FSIPO PBmultiplier for Example
6.1.3, throughout the dierent iterations of the computation. (a) initial state. i = 0. (b) state
after first clock cycle. i = 1. (c) state after second clock cycle. i = 2. (d) state after third clock
cycle.
Proof The total number of two-inputs AND gates which is required for the hardware realiza-
tion of the proposed architecture in Figure 6.1a equals to d (2m   d), where 2m   d two-inputs
AND gate is contributed by each 4 j block, 0  j < d. Similarly, and from the same fig-
ure, one finds the total number of FF to be (m   d) + (m   d) + m = 3m   2d. For the total
number of two-inputs XOR gates, it consists of the XOR gates in the field addition of d + 1
elements, the XOR gates in all the 4 j modules, 0  j < d, in addition to the XOR gates
which form the multiplication by the constant 2d. Notice that, for j = 0, the multiplication
by  j = 0 = 1 in 40 module is free. Therefore, the total number of two-inputs XOR gates is
dm + d (m   d) +Pd 1j=1 j (!   2) + 2d (!   2) = d h2m + (d+3)(! 2)2   di.
For the time complexity of the proposed MSD DL-FSIPO PB multiplier, it is derived in
terms of the propagation delay through the corresponding levels of two-inputs AND and two-
inputs XOR gates along the multiplier’s longest path, as follows.
Proposition 6.1.6 The maximum propagation delay (PD) through the proposed MSD DL-
117
FSIPO PB multiplier of Figure 6.1a is:
PD =maxfTA +  1 + log2 (d) + log2 (d + 1) TX; 
log2 (2d + 1)

+

log2 (d + 1)

TXg (6.5)
where TA denotes the propagation delay of a single two-inputs AND gate.
Proof As one can see from Figure 6.1a, there are two main paths in the proposed design of
the MSD DL-FSIPO PB multiplier. The first path is between the shift registers hXi and hYi,
from one side, and the accumulator hZi, from the other side. This path has a propagation
delay of TA +
 
1 +

log2 (d)

+

log2 (d + 1)

TX, where the propagation delay contributed by
a 4 j block, 1  j < d, and the (d + 1)-inputs field adder, respectively, are log2 ( j + 1) TX
and

log2 (d + 1)

TX. The second path lies between the output and input of the accumulator
hZi, which passes through the (d + 1)-inputs field adder and the module 2d. This path has
a propagation delay equals to
 
log2 (2d + 1)

+

log2 (d + 1)

TX, where

log2 (2d + 1)

TX is
the propagation delay contributed by the 2d module. Therefore, the propagation delay of the
proposed MSD DL-FSIPO PB multiplier takes the value of the maximum propagation delay
between these two paths.
In the following, the proposed LSD version of the DL-FSIPO PB multiplier is presented.
6.2 Proposed LSD DL-FSIPO PB Multiplier
In this section, the proposed LSD variant for our DL-FSIPO PB multiplier is presented. To
the best of the author knowledge, the proposed LSD DL-FSIPO multiplier is the first such
architecture presented for dedicated PB in the literature. The proposed LSD DL-FSIPO PB
multiplier reads its two inputs digit-by-digit, one digit per a clock cycle (for each input) while
the computations are being performed, starting from the least significant digit. This in return,
removes the preloading requirement of the inputs, in advance to computations. It is noted that,
the parallel loading of inputs might not be possible in resource constrained applications where
the GF (2m) dimension m is large and the capacity of input data-paths is limited. Hence, the
proposed LSD DL-FSIPO PB multiplier has the potential of achieving high output throughput
in such applications. The following starts by deriving the required formulations for the LSD
DL-FSIPO PB multiplication scheme. This is followed by constructing the corresponding
architecture. At the end of this section, the space and time complexities will be studied.
118
6.2.1 Formulations
This section gives the necessary formulations for constructing the proposed scheme of LSD
DL-FSIPO PB multiplication. The following introduces the recursive least significant digit
first digit-level construction of the GF (2m) elements, based on the PB representation.
Definition 6.2.1 Let  be the root of the GF (2m) defining irreducible polynomial. Let A =Pm 1
i=0 ai
i 2 GF (2m) be an arbitrary field element represented in the PB. Divide A into k =
l
m
d
m
digits of size d each. That is, A = (Ak 1; : : : ; Ai; : : : ; A0), where Ai =
Pd 1
j=0 adi+ j r
j is the i-th
digit of A such that adi+ j r = 0 for di + j   r < 0 (r = kd   m represents the number of right
padded zeros). Then, one constructs A recursively, starting from its least significant digit, as
follows:
A(i) =Aim d + A(i 1) d; (6.6)
for i = 0; : : : ; k   1, given that A( 1) = 0.
Proof Substituting for i = 0; 1; : : : ; k   2 in (6.6), one gets
A(0) =A0m d + A( 1) d
=
d 1X
j=0
a j rm d+ j;
A(1) =A1m d + A(0) d
=
d 1X
j=0
ad+ j rm d+ j +
d 1X
j=0
a j rm 2d+ j
=
2d 1X
j=0
a j rm 2d+ j;
:::
A(k 2) =Ak 2m d + A(k 3) d
=
d 1X
j=0
a(k 2)d+ j rm d+ j +
(k 2)d 1X
j=0
a j rm (k 1)d+ j
=
(k 1)d 1X
j=0
a j rm (k 1)d+ j;
and hence, for i = k   1 one has
119
A(k 1) =Ak 1m d + A(k 2) d
=
d 1X
j=0
a(k 1)d+ j rm d+ j +
(k 1)d 1X
j=0
a j rm kd+ j
=
kd 1X
j=0
a j rm kd+ j
=
kd r 1X
j= r
a jm+r kd+ j
and by noticing that kd   m = r, then
A(k 1) =
m 1X
j= r
a j j
=
m 1X
j=0
a j j;
since a j = 0 for j < 0, which completes the proof.
It is noted that, the multiplication of A(i 1) by  d in (6.6) is realized as a d-bit right shift
(no reduction is require for 0  i < k). The following theorem utilizes (6.6) in conducting
multiplication of two arbitrary GF (2m) elements.
Proposition 6.2.2 Let Ci = A(i)B(i) mod p (), where A and B are two arbitrary GF (2m) ele-
ments, A(i) and B(i) are given in (6.6), and  is the root of the field irreducible polynomial p (x).
Then, based on (6.6), one computes AB mod p () = A(k 1)B(k 1) mod p () = Ck 1 according
to the following recurrence on Ci
Ci =
 d 1X
j=0

adi+ j r

Bim d + B(i 1) d

+ bdi+ j rA(i 1) d

 j (d 1) mod p ()

m 1 mod p ()+
Ci 1 2d mod p () ; (6.7)
for i = 0; : : : ; k   1 given that C 1 = A( 1)B( 1) mod p () = 0, where d is the digit size, k =
l
m
d
m
is the number of iterations, and Bi =
Pd 1
j=0 bdi+ j r
j is the i-th digit of B such that bdi+ j r = 0
for di + j   r < 0 (r = kd   m represents the number of right padded zeros).
120
Proof By using definition (6.6) for A(i) and B(i) in evaluating Ci = A(i)B(i) mod p (), one has
Ci =

Aim d + A(i 1) d
 
Bim d + B(i 1) d

mod p ()
=Aim d

Bim d + B(i 1) d

mod p ()+
Bim dA(i 1) d mod p ()+
A(i 1)B(i 1) 2d mod p ()
=
d 1X
j=0

adi+ j r

Bim d + B(i 1) d

+
bdi+ j rA(i 1) d

m d+ j mod p ()+
Ci 1 2d mod p () ;
where the last result is obtained by substituting for Ai =
Pd 1
j=0 adi+ j r
j
in Aim d

Bim d + B(i 1) d

mod p () and for Bi =
Pd 1
j=0 bdi+ j r
j in
Bim dA(i 1) d mod p (), followed by taking m d+ j as a common factor. Then, by
noticing that m d  j = m 1 j d+1, the proof is complete.
Notice that, the right most r = (kd   m) bits in the least significant input digits A0 and B0
are zeros. According to this, the lowest coordinate in either A(k 2) or B(k 2) has an order of d.
Therefore, it is sucient to accomplish the multiplication by  d in expressions B(i 1) d and
A(i 1) d of (6.7) by simple d-bit right shifts without any reductions. Now, since A = A(k 1)
and B = B(k 1), then, by iterating on (6.7) for i = 0; 1; : : : ; k   1, one obtains AB mod p () =
A(k 1)B(k 1) mod p () = Ck 1.
Based on (6.7), the multiplication of A and B is reduced, recursively, to bit-wise AND
operations, field additions, right shifts (for the multiplication by  d), in addition to the mul-
tiplications with the constant elements m 1 and  q whereq is a positive integer such that
q  2d. The following is an example illustrating the proposed multiplication scheme in (6.7).
Example 6.2.3 Table 6.2 lists the steps (according to formulations (6.6) and (6.7)) for multi-
plying the two GF

23

field elements A =  = (0; 1; 0) and B = 2 = (1; 0; 0), represented
in the PB
n
2; ; 1
o
which is defined by the irreducible trinomial p (x) = x3 + x + 1. In this
example, d = 1 (bit-level multiplication), and hence, k =
l
3
1
m
= 3 and r = 3  1   3 = 0.
The following presents the formulations needed for realizing the operations of multiplying
an arbitrary field element by the constants m 1 and  q, where q  2d for some positive integer
d.
121
i ai bi A(i 1) B(i 1)
0 a0 = 0 b0 = 0 A( 1) = 0 B( 1) = 0
1 a1 = 1 b1 = 0 A(0) = a02 + A( 1) 1 = 0 B(0) = b02 + B( 1) 1 = 0
2 a2 = 0 b2 = 1 A(1) = a12 + A(0) 1 = 2 B(1) = b12 + B(0) 1 = 0
Xi = ai

bi2 + B(i 1) 1

Yi = biA(i 1) 1 Zi = Ci 1 2 mod p () Ci = (Xi + Yi)2 mod p () + Zi
0 a0

b02 + B( 1) 1

= 0 b0A( 1) 1 = 0 C 1 2 mod p () = 0 C0 = 0
1 a1

b12 + B(0) 1

= 0 b1A(0) 1 = 0 C0 2 mod p () = 0 C1 = 0
2 a2

b22 + B(1) 1

= 0 b2A(1) 1 =  C1 2 mod p () = 0 C2 = 3 mod p () =  + 1
Table 6.2: Example 6.2.3 for multiplying the two GF

23

elements A =  = (0; 1; 0) and
B = 2 = (1; 0; 0) using (6.6) and (6.7).
First, the multiplication by the constants m 1 is considered. Let p (x) = xm+
P! 2
i=1 x
ti+1 be
the generating irreducible polynomial of GF (2m), where  is its root. According to Theorem
2.9.5, the multiplication of an arbitrary GF (2m) element A by the constant element m 1 can
be accomplished eciently in one step if m   1  m   t! 2. This means that, t! 2 = 1, and
hence, p (x) is an irreducible trinomial of the form xm + x + 1. Since this form of p (x) is not
common, the following general formulation for multiplying an arbitrary field element by the
constant element m 1 is considered.
Proposition 6.2.4 Let the elements of GF (2m) be represented in the PB which is defined by
the irreducible p (x) = xm +
P! 2
i=1 x
ti + 1. Let  be a root of p (x). Denote by [" i] and
[# i] the operations of up and down i-bit shifts, as defined by Definition 2.9.2. Let the m-bits
vertical vector
h
am 10 : : : a
m 1
m 1
iT
(T is the vector transposition) represents the coordinates
of the result out of multiplying an arbitrary GF (2m) element A =
Pm 1
i=0 ai
i by m 1, that is
Am 1 mod p () =
Pm 1
i=0 a
m 1
i 
i, then266666666666666666664
am 10
:::
am 1m 2
am 1m 1
377777777777777777775
=
266666666666666666664
0
:::
0
a0
377777777777777777775
+
! 2X
j=0
0BBBBBBBBBBBBBBBBBBB@
n 1X
i=0
266666666666666666664
a1
:::
am 1
0
377777777777777777775
[" li]
1CCCCCCCCCCCCCCCCCCCA
h
# t j
i
: (6.8)
Here, t0 = 0, n is the number of nonzero entries in column zero of the (m   1)  m binary
reduction matrixQ [72], and li denotes the row location of the i-th nonzero entry in this column.
Proof From Section 2.9.1.1, by setting B in (2.2) and (2.3) to B = m 1 =
0BBBBBBB@1; 0; : : : ; 0|  {z  }
m 1
1CCCCCCCA, one
obtains (6.8).
122
Next, the multiplication of an arbitrary field element A represented in the PB by the constant
element  q, i.e. A q mod p (), where q is a positive integer, is considered. The following
are some conditions for the ecient hardware realization of this operation.
Proposition 6.2.5 Assume p (x) = xm +
P! 2
i=1 x
ti + 1 is the field irreducible polynomial which
defines GF (2m). Let  denotes the root of p (x). Therefore, for a positive integer q  t1, the
coordinates of  q are obtained in a single step, as follows
 q mod p () =
0BBBBB@m + ! 2X
i=1
ti
1CCCCCA q: (6.9)
Proof Since p () = 0, then m +
P! 2
i=1 
ti = 1, and by multiplying both sides by  q one gets
 q mod p () =m q +
! 2X
i=1
ti q;
in which ti   q  0 for all 1  i  !   2 if q  t1. Then, the assertion is true.
Proposition 6.2.6 Assume p (x) = xm +
P! 2
i=1 x
ti + 1 is the field irreducible polynomial which
defines GF (2m). Denote by  the root of p (x). Let A = (am 1; : : : ; a0) be an arbitrary GF (2m)
element represented in the PB. Therefore, for a positive integer q  t1, the coordinates of
A q mod p () =
Pm 1
i=0 ai
i q mod p () are obtained in a single step, as follows:
A q mod p () =
m 1X
i=q
aii q
q 1X
i=0
ai
0BBBBBB@m + ! 2X
j=1
t j
1CCCCCCAi q: (6.10)
Proof Note that, A q mod p () =
Pm 1
i=q ai
i q +
Pq 1
i=0 ai
i q mod p (). Since it is given that
q  t1, then, one can compute  1 through  q using (6.9). This completes the proof.
The following is a remark about the selection of the digit size for ecient hardware imple-
mentation of the proposed LSD DL-FSIPO PB multiplier.
Remark 6.2.7 Let p (x) = xm +
P! 2
i=1 x
ti + 1 denotes the defining irreducible polynomial with
! nonzero terms for GF (2m). Then, by choosing the digit size d of the LSD DL-FSIPO PB
multiplier such that
d 
 t1
2

; (6.11)
the multiplication of an arbitrary GF (2m) element by the fixed field element  q, where q is a
positive integer satisfying q  2d, can be accomplished in a single step.
123
According to (6.11), ecient hardware realizations of the LSD DL-FSIPO PB multipli-
cation scheme for the five fields recommended by NIST for ECDSA GF

2163

, GF

2233

,
GF

2283

, GF

2409

, and GF

2571

, respectively, oer maximum digit sizes of 1, 37, 2, 43,
and 1. It is evident that the MSD version of the DL-FSIPO PB multiplier provides larger
flexibility on the selection of digit sizes for ECDSA recommended fields.
In the following section, the architecture of the proposed LSD DL-FSIPO PB multiplier is
presented.
6.2.2 Architecture
This section presents the proposed architecture of the LSD DL-FSIPO PB multiplier, as shown
in Figure 6.3a. Figure 6.3b shows the detailed architecture of the 40j module at i-th iteration,
0  j < d and 0  i < k. The component 
 is shown in more details in Figure 6.1b. Also, it is
noted that r = kd   m is the number of right padded zeros.
The architecture of Figure 6.3a is constructed based on (6.6) and (6.7). In the following
illustration denotes by A and B the input field elements to the multiplier, while A(i) and B(i) are
given in (6.6), andCi is defined in (6.7). In Figure 6.3a, hXi and hYi are right shift registers. hXi
stores the bits of A(i 1), while hYi stores the bits of B(i 1), during the i-th iteration of the k clock
cycles of computations. Notice that, the least significant digit of either A(i 1) or B(i 1) is zero
for all i < k (only A(k 1) = Ak 1m d +A(k 2) d and B(k 1) = Bk 1m d +B(k 2) d have nonzero
least significant digits). Also, the rightmost r bits of A(k 2) and B(k 2) are zeros (padding zeros).
Therefore, during the last iteration i = k   1, one has hXi = A(k 2) =
*
am 1 d; : : : ; a0; 0; : : : ; 0|  {z  }
d
+
and hYi = B(k 2) =
*
bm 1 d; : : : ; b0; 0; : : : ; 0|  {z  }
d
+
, and hence, it is sucient to have only (m   d)-bits
in each of hXi and hYi. The vertical thick line in Figure 6.3a represents a 2m-bit bus carrying
the bits of Ai, Bi, hXi = A(i 1), and hYi = B(i 1), during the i-th iteration, for 0  i < k. During
the i-th iteration, the m-bit input Bim d+B(i 1) d in Figure 6.3b represents B(i) (see (6.6)) and
is constructed by concatenating the d-bits from Bi (higher bits) with the (m   d)-bits from B(i 1)
(lower bits). In the same figure, the vertical thick line concatenates the d bits from adi+ j rB(i)
(higher bits) to the m   d bits (lower) resulting from bit-wise XORing the lower m   d bits
of adi+ j rB(i) with bdi+ j rA(i 1) d. Here, j denotes the number of the block 40j in Figure 6.3a
and its value satisfies 0  j < d. The multiplication of the latter concatenated m-bit signal
(of Figure 6.3b) by  j (d 1) generates the m-bit output of the corresponding 40j block in Figure
6.3a (that is
h
adi+ j r

Bim d + B(i 1) d

+ bdi+ j rA(i 1) d
i
 j (d 1)). The output of block m 1
represents the result of the fixed multiplication of the summation of the outputs of all 40j by
124
<Z>
d
mm
m
m
m-d
m-d
m
d-1 m
m
1
1
m-d
j
m
1
1
m-d
0
m
1
1
m-d
<X>
0m-d-1
<Y>
0m-d-1
0m-1
B
ik 1
B
0
B
d
A
ik 1
A
0
A
d
‘
‘
‘
m-1
m m
(a)
m
m-d
m
m-d
m-d
m(i-1)
(i-1)
di+j-r
di+j-r
m-d
d
m
j-(d- )
-d
-d
i
m-d
(b)
Figure 6.3: (a) Architecture of the proposed LSD DL-FSIPO PB multiplier. (b) Detailed archi-
tecture of 40j at i-th iteration.
m 1. At the i-th clock trigger, the accumulator hZi is updated by adding the output of block
m 1 to the output of block  2d. Block  2d represents the multiplication of the current state
of register hZi by the fixed element  2d. Therefore, after the i-th clock signal, hZi = Ci,
according to (6.7). Then, by initializing the three registers hXi, hYi, and hZi in Figure 6.3a with
zeros, one generates the multiplication result AB mod p () = A(k 1)B(k 1) mod p () = Ck 1 in
accumulator hZi after k iterations.
Figure 6.4 presents a graphical illustration showing the state of theGF

23

least significant
bit first bit-level (LSB BL-FSIPO) PBmultiplier during the dierent iterations of computations,
for multiplying the two field elements A =  and B = 2 in Example 6.2.3, based on the
architecture of Figure 6.3a, where d = 1. Figure 6.4a shows the initial state (i = 0). Figure
6.4b shows the state after first clock cycle (i = 1). Figure 6.4c shows the state after second clock
cycle (i = 2). Figure 6.4d shows the state after third clock cycle, where the result 3 =  + 1
125
is stored in the output register which is surrounded by the dotted rectangle. It is noted that, in
this figure, the underlined rightmost bit of each of A(i 1) and B(i 1), respectively, is always zero,
which represents the missing (not required) rightmost FF in each of register hXi and register
hYi, respectively.
33
3
3
00 0
C0 = 000
C-1 = 000
3
1
1
2
2
3
100
2
2
2
010
00
00
A(-1) = 000
B(-1) = 000
a0 = 0
b0 = 0
1
3
1
2
1
(a)
33
3
3
00 0
C1 = 000
C0 = 000
3
1
1
2
2
3
10
2
2
2
01
00
00
A(0) = 000
B(0) = 000
a1 = 1
b1 = 0
1
3
1
2
1
(b)
33
3
3
00 0
C2 = = 011
C1 = 000
3
1
1
2
2
3
1
2
2
2
0
00
01
A(1) = 100
B(1) = 000
a2 = 0
b2 = 1
1
3
1
2
1
(c)
33
3
3
10 1 C2 = 011
3
1
1
2
2
3
2
2
2
01
10
1
3
1
2
1
(d)
Figure 6.4: The state of the correspondingGF

23

LSB BL-FSIPO PB multiplier for Example
6.2.3, throughout the dierent iterations of the computation. (a) initial state. i = 0. (b) state
after first clock cycle. i = 1. (c) state after second clock cycle. i = 2. (d) state after third clock
cycle.
In the following, the space and time complexities of the proposed LSD DL-FSIPO PB
multiplier will be studied.
6.2.3 Space and Time Complexities
This section starts by deriving the space and time complexities for the multiplication of an
arbitrary field element, represented in the PB, by the constants m 1 and  q (for 0  q  t1),
respectively, where  2 GF (2m) is the root of the field’s generating irreducible polynomial.
After this, the space and time complexities of the proposed LSD DL-FSIPO PB multiplier are
considered.
The following lemma gives the space and time complexities for the multiplication of an
arbitrary field element by the constant element m 1.
126
Lemma 6.2.8 The hardware realization of the multiplication of an arbitrary GF (2m) element
A =
Pm 1
i=0 ai
i by the constant element m 1 according to (6.8) requires the following number
of two-inputs XOR gates
Nm 1 = (m   1) (n + !   3) + (!   2)
 
n 1X
i=1
li  
! 2X
j=1
t j; (6.12)
and a propagation delay of
Tm 1 =
 
log2 (n)

+

log2 (!   1)

TX; (6.13)
where p (x) = xm +
P! 2
i=1 x
ti + 1 is the field’s generating irreducible polynomial with ! nonzero
terms, n is the number of nonzero entries in column zero of the (m   1)  m binary reduction
matrix Q [72], and li denotes the row location of the i-th nonzero entry in this column.
Proof The generation of v1 =
Pn 1
i=0
h
a1 a2 : : : am 1 0
iT
[" li] in (6.8) requiresPn 1
i=1 (m   1   li) = (n   1) (m   1)  
Pn 1
i=1 li two-inputs XOR gates. After this, one needs
another
P! 2
j=1

m   t j

= (!   2)m   P! 2j=1 t j two-inputs XORs for the realization of v2 =
v3 +
P! 2
j=0 v1
h
# t j
i
in (6.8), where v3 =
h
0 0 : : : a0
iT
and t0 = 0. Therefore, by adding
these values we get (6.12). Similarly, one obtains (6.13) by adding the propagation delays con-
tributed by the generation of v1 (that is,

log2 (n)

TX) and v2 (that is

log2 (!   1)

TX, since
v1 [# 0] + v3 does not require any XORing).
Corollary 6.2.9 If 1 < t! 2  m+12 , then, Nm 1 and Tm 1 in (6.12) and (6.13), respectively,
become
Nm 1 = (!   2) (m   1) ; (6.14)
and
Tm 1 =2

log2 (!   1)

TX: (6.15)
Proof Since 1 < t! 2  m+12 , then, n = !   1, l0 = 0, and li = m   ti for 1  i < !   1 (see
Remark 2.9.3). Based on this, Nm 1 becomes
127
Nm 1
= (m   1) (n + !   3) + (!   2)
 
n 1X
i=1
li  
! 2X
j=1
t j
= (m   1) (!   1 + !   3) + (!   2)
 
! 2X
i=1
(m   ti)  
! 2X
j=1
t j
= (2!   4)m   (2!   4) + (!   2)
 
! 2X
i=1
m
= (!   2) (m   1) :
Similarly, Tm 1 becomes
Tm 1
=
 
log2 (n)

+

log2 (!   1)

TX
=
 
log2 (!   1)

+

log2 (!   1)

TX
=2

log2 (!   1)

TX:
The following targets ecient hardware implementation of the proposed LSD DL-FSIPO
PB multiplier. Therefore, values of d which satisfy the condition of (6.11) are only considered.
In this context, the following lemma gives the space and time complexities for the hardware
realization of the multiplication of an arbitrary field element A by the constant element  q,
based on the formulation (6.9).
Lemma 6.2.10 The hardware realization of the multiplication of an arbitrary GF (2m) element
A =
Pm 1
i=0 ai
i by the constant element  q, according to (6.9), requires at most a number of
two-inputs XOR gates equals to
N q =q (!   2) ; (6.16)
and a propagation delay of
T q =

log2 (q + 1)

TX; (6.17)
where p (x) = xm +
P! 2
i=1 x
ti + 1 is the field’s generating irreducible polynomial with ! nonzero
terms and d satisfies the condition of (6.11).
128
Proof According to (6.9), we have
A q mod p ()
=
m 1X
i=0
aii q
=
m 1X
i=q
aii q +
q 1X
i=0
ai
0BBBBBB@m + ! 2X
j=1
t j
1CCCCCCAi q:
As it is shown in Figure 6.5, the hardware realization of the above formulation requires a
number of two-inputs XOR gates equals to q (!   1)   q = q (!   2) and a propagation delay
equivalent to

log2 (q + 1)

TX.
q m-q
a
qm 1
a
0
aq-1a
q
-1
q
a
0
q
m 1
a
q
2
1
jt qm q
j
?? ?? ??
?
??
2
11
1
jtm
j
?? ?? ??
?
??
? ?modqA p? ??
qA
Figure 6.5: Multiplying an arbitrary GF (2m) element by the constant  q wherep (x) = xm +P! 2
i=1 x
ti + 1 is the field’s generating irreducible polynomial with ! nonzero terms and q  t1
(condition of (6.11)).
Now, the space complexity of the proposed LSD DL-FSIPO PB multiplier in Figure 6.3a
is given. What follows assumes the conditions 1 < t! 2  m+12 (which is true for the five binary
extension fields recommended by NIST for ECDSA [12]) and d 
j
t1
2
k
are valid.
Proposition 6.2.11 By following the conditions 1 < t! 2  m+12 and d 
j
t1
2
k
, then, the total
number of gates in the hardware realization of the proposed LSD DL-FSIPO PB multiplier of
129
Figure 6.3a is as follows:8>>>>>>><>>>>>>>:
#ANDs = d (2m   d)
#XORs = (2d + !   2)m   (!   2)   d
h
(d 3)(! 2)
2
i
+ 1
#FFs = 3m   2d
: (6.18)
Proof The total number of two-inputs AND gates required for the hardware realization of
the proposed architecture in Figure 6.3a equals to d (m + m   d) = d (2m   d), where each
40j block contributes m + m   d = 2m   d two-inputs AND gates (0  j < d). From the
same figure, one finds the total number of FFs in registers hXi, hYi, and hZi to be (m   d) +
(m   d) +m = 3m   2d. For the total number of two-inputs XOR gates, it consists of the XOR
gates in the field addition of d elements plus those in the field addition of 2 elements (that is
(d   1)m+m = dm XORs), the XOR gates which form the multiplication by the constant  2d
(that is 2d (!   2) see (6.16)), the XOR gates which form the multiplication by the constant
m 1 (given by (6.14)), in addition to the XOR gates in all the 40j modules, 0  j < d. Notice
that, each 40j module requires m d+N j (d 1) two-input XOR gates, out of which N j (d 1) (given
by (6.16)) two-input XOR gates are required to realize the multiplication by  j (d 1). Therefore,
the total number of two-inputs XOR gates is dm + 2d (!   2) + d (m   d) + Pd 1j=0 N j (d 1) +
Nm 1 , and by substituting for
Pd 1
j=0 N j (d 1) =
Pd 1
i=0 N i =
Pd 1
i=0 i (!   2) = d(d 1)(! 2)2 and for
Nm 1 = (!   2) (m   1), according to (6.16) and (6.14), respectively, (2d + !   2)m (!   2) 
d
h
(d 3)(! 2)
2
i
+ 1 is obtained.
The time complexity of the proposed LSD DL-FSIPO PB multiplier, in terms of the propa-
gation delay of the corresponding levels of two-inputs AND and two-inputs XOR gates along
the multiplier’s longest path, is as follows. Again, assuming the conditions 1 < t! 2  m+12 and
d 
j
t1
2
k
are valid.
Proposition 6.2.12 By following the conditions 1 < t! 2  m+12 and d 
j
t1
2
k
, then, the max-
imum propagation delay for the hardware realization of the proposed LSD DL-FSIPO PB
multiplier in Figure 6.3a is independent of the binary extension field’s dimension (i.e., m), and
is equal to
PD =TA + 2
 
1 +

log2 d

+

log2 (!   1)

TX: (6.19)
Proof There are two main paths in Figure 6.3a. The first path extends between the input
registers (hXi and hYi) and the output accumulator hZi. The second path extends between the
output and input of register hZi. For the former path, notice that the multiplication by  (d 1)
requires higher propagation delay than the multiplications by the constants  1 through  (d 2).
130
Therefore, the critical path in Figure 6.3a between input registers (hXi and hYi) and the output
accumulator hZi passes through module 400. This propagation delay consists of TA+TX+T (d 1)
contributed by module 400,

log2 d

TX contributed by the d inputs field adder, TX contributed
by the 2 inputs field adder, in addition to Tm 1 which is contributed by the multiplication with
m 1 (given by (6.15)). This adds up to TA + 2
 
1 +

log2 d

+

log2 (!   1)

TX. On the other
hand, the propagation delay of the path between the output and input of accumulator hZi is
equivalent to TX + T 2d =
 
1 +

log2 (2d + 1)

TX. Hence, the propagation delay in (6.19) is
the maximum between these two paths.
In the following, a comparison between the proposed DL-FSIPO PB multipliers and other
existing digit-level serial PB multiplication schemes is conducted.
6.3 Comparisons
In this section, the proposed DL-FSIPO PB multipliers are compared to other existing serial
PB multipliers. For this purpose, the propagation delay and space complexity for the dier-
ent serial PB multiplication schemes are listed in Table 6.3. In this table, space complexity
is reported in terms of number of FF, two-inputs AND and XOR gates, and 2-to-1 1-bit mul-
tiplexers (for either logic implementation or inputs preloading). Time complexity appears in
terms of number of levels of two-inputs AND (TA) and XOR (TX) gates, and 2-to-1 1-bit
multiplexers (TM). p (x) = xm +
P! 2
i=1 x
ti + 1 is the field’s irreducible polynomial, satisfy-
ing m+12  t! 2 and t1 > 1. Also, in the table, T
0
=
 
1 +

log2 (!   1)

+

log2 (m)

TX and
T ” =
 
1 +

log2 (!   1)

+

log2 (m   1)

TX [75]. Moreover, d is the digit size, k =
l
m
d
m
, and
for an integer x the function  (x) = 0 if x , 1.
Multiplier FF AND XOR
2-to-1 1-bit Propagation Parallel Loading Serial Loading
MUX Delay 2-to-11-bit MUX Latency Latency
LSD DL-SIPO [57] 2m + d   1 dm + (2d   1) (!   1) dm + (d   1)+ m TA + log2 (d + 1) TX m k + 1    (d) 2k + 1    (d)(2d   1) (!   2)
MSD DL-SIPO [80] 2m + d dm + (2d   1) (!   1) dm + d (!   2) 0 TA + log2 (2d + 1) TX m k + 1    (d) 2k + 1    (d)
BL-PISO (d = 1) [75] 3m + t! 2   1 2m   1
(!   1) (m   1)+
0
TA +

1 +

log2 (!   1)

+
2m m 2m
!   3 +P! 2i=1 ti log2 (m) TX
PIPO [50] 5m   1 m2+m2 m
2+m
2 4m TA +

log2m

TX + 2TM 2m 2t! 2 + 1 k + 2t! 2 + 1
LSD DL-FSIPO
3m   2d d (2m   d)
(2d + !   2)m   (!   2) 
0
TA + 2

1 +

log2 (d)

0 k k
(Figure 6.3a) d
h
(d 3)(! 2)
2
i
+ 1 +

log2 (!   1)
 
TX
MSD DL-FSIPO
3m   2d d (2m   d) 2dm + d
h
(d+3)(! 2)
2   d
i
0

log2 (d + 1)

TX+
0 k k
(Figure 6.1a)
max
 
log2 (2d + 1)

TX;
TA +
 
1 +

log2 (d)

TX

 These multiplexers are used in the multiplication logic which are dierent from the ones used for parallel preloading of inputs.
Table 6.3: Space and time complexities of the dierent digit-level GF (2m) PB multipliers.
While the DL-SIPO PB multipliers listed in this table oer best space complexities and
131
propagation delays, one can see that, the proposed DL-FSIPO PB multipliers are advantageous
for the case of serial inputs preloading since they oer lower latency for generating them output
bits. This feature of the proposed DL-FSIPO multipliers results in low-latency fast multiplica-
tion in resource constrained applications where the input data-path might have limited capacity
for reading elements from large finite fields. In addition, and similar to the DL-SIPO PB mul-
tipliers in Table 6.3, the proposed MSD and LSD DL-FSIPO PB multipliers oer propagation
delays that are independent of the dimension of GF (2m). Compared to the PIPO serial PB
multiplier in Table 6.3, both of the proposed MSD and LSD DL-FSIPO PB multipliers show
better space as well as time complexities.
Figures 6.6 and 6.7 plot the eciency as a function of the digit size considering serial
inputs loading and parallel inputs loading, respectively, for the dierent multipliers in Table
6.3 (except the BL-PISO), under the field GF

2233

recommended by NIST which is defined
by an irreducible trinomial p (x) = x233 + xt1 + 1, where t1 = 74.
Normalized Throughput (TP/G)
0
100
200
300
400
500
600
700
800
0 40 80 120 160 200 240
d
T
P
/G
LSD DL-SIPO MSD DL-SIPO PIPO MSD DL-FSIPO LSD DL-FSIPO
Figure 6.6: Normalized throughput as a function of the digit size for the serial inputs loading
case.
132
Normalized Throughput (TP/G)
0
100
200
300
400
500
600
700
800
900
1000
1100
0 20 40 60 80 100 120 140 160 180 200 220 240
d
T
P
/G
LSD DL-SIPO MSD DL-SIPO PIPO MSD DL-FSIPO LSD DL-FSIPO
Figure 6.7: Normalized throughput as a function of the digit size for the parallel inputs loading
case.
Here, eciency denotes the normalized throughput, that is, throughput (computed at 1
GHz) per number of NAND gate equivalence (TP/G), measured in Kbps/Gate. The inclinations
after each peak in these two plots are due to increasing the digit size while the latency (hence
the throughput) is fixed. For instance, the eciency peaks which start at d = 117 correspond
to a computational latency of k =
l
233
117
m
= 2 clock cycles. Increasing d beyond the value of
117, say d = 140, increases the space complexities, however, the latency stays constant at
k =
l
233
140
m
= 2.
From the two plots of Figures 6.6 and 6.7, one can see that, the proposed DL-FSIPO PB
multipliers are advantageous for the case of serial inputs preloading since they oer better
eciencies, compared to the other multipliers (when running at the same clock speed). The
DL-SIPO PB multipliers show higher eciencies than the other multiplication schemes in the
case of parallel inputs preloading. However, it is interesting to notice that the gap between the
eciencies of the proposed DL-FSIPO PBmultipliers and those of the DL-SIPO PBmultipliers
133
decreases with increasing d in case of parallel preloading of inputs, as depicted in Figure 6.7.
Although this is not a practical case, however, one can see from the chart that for values of
d  160, the eciencies of the proposed DL-FSIPO PB multipliers beat those of the DL-
SIPO PB multipliers. On the other side, for the case of serial inputs preloading, the eciency
gap increases in favor of the proposed DL-FSIPO PB multipliers. Considering the PIPO PB
multiplier, it oers the lowest eciency in both serial and parallel inputs preloading scenarios.
Notice that, in the case of parallel inputs preloading, the eciency of the PIPO PB multiplier
is fixed since its latency depends on t! 2 (second highest order amongst the orders of the terms
forming the field defining polynomial), and not the digit size.
In Table 6.3, the PISO PB multiplier from [75] is a bit-level multiplier (d = 1). Hence,
the digit size is set to d = 1 (one bit) for the dierent digit-level multipliers in this table, in
order to conduct further comparisons. As a case study, the bit-level case of the field GF

2233

recommended by NIST which is defined by an irreducible trinomial p (x) = x233+xt1+1, where
t1 = 74, is investigated. Then, for this case, the resulting space and time complexities of the
multipliers which are listed in Table 6.3 are reported in Table 6.4. Also, Table 6.5 estimates
the corresponding space and time complexity readings based on the 65nm CMOS standard
library’s statistics. In this table, the total gate counts are estimated in terms of total NAND gate
equivalence (GE) while MPD denotes the maximum propagation delay. Latency denotes the
total number of clock cycles required to generate the 233-bits of output. TP is throughput (@ 1
GHz) and TP/G denotes throughput per total GE measured in Kbps/Gate. SIL and PIL denote
“Serial Input Loading” and “Parallel Input Loading”, respectively.
Multiplier FF AND XOR
2-to-1 1-bit Propagation Parallel Loading Serial Loading
MUX Delay 2-to-11-bit MUX Latency Latency
LSB BL-SIPO [57] 466 235 234 233 TA + TX 233 233 466
MSB BL-SIPO [80] 467 235 234 0 TA + TX 233 233 466
BL-PISO [75] 772 465 538 0 TA + 10TX 466 233 466
PIPO [50] 1164 27261 27261 932 TA + 8TX + 2TM 466 149 382
LSB BL-FSIPO (Figure 6.3a) 697 465 700 0 TA + 4TX 0 233 233
MSB BL-FSIPO (Figure 6.1a) 697 465 467 0 3TX 0 233 233
 These multiplexers are used in the multiplication logic which are dierent from the ones used for parallel preloading of inputs.
Table 6.4: Space and time complexities for the NIST recommended field GF

2233

defined by
the irreducible trinomial x233 + xt1 + 1, where t1 = 74 and the digit size is d = 1.
In the standard 65nm CMOS technology library, the NAND gate equivalences (GEs) for a
two-inputs AND, two-inputs XOR, D-type FF, and a 2-to-1 1-bit Multiplexer, when reported
based on synthesis results using the Synopsys Design Vision tool [4], are 1:25, 2, 3:75, and 2,
respectively. In addition, and based on synthesis with the same tool using the same technology
library, the maximum propagation delays (MPD) for a two-inputs AND, two-inputs XOR, and
134
Multiplier
MPD GE Latency TP/G @ 1 GHz
ns PIL SIL PIL SIL PIL SIL
LSB BL-SIPO [57] 0:07 3441 2975 233 466 291 168
MSB BL-SIPO [80] 0:07 2979 2513 233 466 336 199
BL-PISO [75] 0:43 5484 4552 233 466 182 110
PIPO [50] 0:41 95759 94827 149 382 16 6
LSB BL-FSIPO (Figure 6.3a) 0:19 4595 4595 233 233 218 218
MSB BL-FSIPO (Figure 6.1a) 0:12 4129 4129 233 233 242 242
Table 6.5: Space and time complexity estimates for the multipliers which are listed in Table
6.4 based on on the standard 65nm CMOS library measures.
2-to-1 1-bit multiplexer are 0:03ns, 0:04ns, and 0.03ns respectively.
From Table 6.5, one can see that the listed PIPO serial PB multiplier oers the best latency
in case of parallel preloading of its inputs. However, it has lowest eciency (i.e. normalized
throughput in terms of throughput per NAND gate equivalence measured at 1 GHz) in both
parallel and serial preloading scenarios, compared to all the other listed multiplication schemes.
This is mainly due to the relatively large space complexity of this PIPO serial PB multiplier.
It is noted that the BL-SIPO PB multipliers which are listed in Table 6.5 oer the best space
complexity and highest operating frequency. In addition, the BL-SIPO PB multipliers in this
table show the best eciency, in case of parallel preloading of inputs.
However, the proposed MSB and LSB BL-FSIPO PB multipliers oer lower latencies,
compared to the BL-SIPO, BL-PISO, and PIPO, in case of serial inputs loading. In this case,
as a result of the low latencies, the proposed MSB and LSB BL-FSIPO PB multipliers show
the best eciency. In comparison to the BL-PISO and BL-PIPO PB multipliers, which are
listed in Table 6.5, the proposed MSB and LSB BL-FSIPO PB multipliers are advantageous in
terms of space complexity1, operating frequency, and eciency, as well as in terms of latency
in case of serial loading of inputs. Furthermore, Table 6.5 show that the proposed MSB BL-
FSIPO PB multiplier is superior to the proposed LSB BL-FSIPO PB multiplier in terms of
space complexity, propagation delay, and hardware eciency.
It is also worth noting that, in the case of parallel preloading of inputs, the BL-PISO PB
multiplier generates its first output bit with a latency of 1 clock cycle, while the proposed BL-
FSIPO, as well as the BL-SIPO PB multipliers, require 233 clock cycles after which all the
output bits are generated in parallel. For the same case of parallel preloading of the inputs, the
PIPO PB multiplier, which is listed in Table 6.5, requires 149 clock cycles to generate all the
233 output bits, in parallel.
1the LSB BL-FSIPO PB shows almost similar space complexity as the PISO for the case of serial loading of
inputs
135
In cases where the multiplication results need to be communicated to other modules of
the underlying system, one can convert the proposed DL-FSIPO PB multipliers into serial-
in-serial-out schemes, if the output is transmitted using the same limited capacity data-path
of inputs. This is done by running the proposed DL-FSIPO PB multipliers an additional
l
k
2
m
clock cycles through which the inputs are set to zeros. During each one of the additional clock
cycles, the proposed DL-FSIPO PB multipliers produce two digits of the multiplication result
for transmission over the output data-path. Therefore, the proposed DL-FSIPO PB multipliers
are advantageous over the serial-serial multipliers presented in [46, 14], in the sense that they
fully utilize the output data-path by requiring only 2k  
l
k
2
m
clock cycles compared to 2k clock
cycles required by the multipliers in [46, 14] which use only half the output data-path capacity.
In addition, in case only one output digit is required to be transmitted per a clock cycle, one can
accomplishes this by using a dedicated parallel-in-serial-out output register with the proposed
DL-FSIPO PBmultipliers. This scheme accomplishes n consecutive multiplications using only
(n + 1) k clock cycles, and hence, it is favoured over the serial-serial multiplication schemes in
[46, 14] which require 2nk clock cycles for the same scenario.
The following section, concludes this chapter.
6.4 Conclusion
This chapter introduced two new digit-level multiplication schemes for the elements of
GF (2m), based on the PB representation. The proposed formulations for the digit-level PB
multiplications are based on recursive constructions of the field elements, which constructs an
element digit-by-digit, one digit per a clock cycle, starting from either the most or the last sig-
nificant digit. Based on these new formulations, and to the best of the author knowledge, the
first architectures for digit-level fully-serial-in-parallel-out (DL-FSIPO) multiplier have been
proposed for dedicated PB. The proposed MSD and LSD DL-FSIPO PB multipliers do not
require any preloading of the inputs and, therefore, they are advantageous for achieving high
throughput in applications where the parallel preloading of the inputs is not possible (if the
input data-path size is limited, which is possible in resource constrained applications). For this
specific case of serial preloading of the inputs, it has been shown, based on the provided the-
oretical analysis, that the proposed MSD and LSD DL-FSIPO PB multipliers oer the highest
throughput and normalized throughput, when compared to other digit-level serial PB multipli-
cation schemes.
136
Chapter 7
Summary and Future Work
This chapter summarizes the contributions of this work and presents some future goals.
7.1 Summary of Contributions
This thesis introduced ecient hardware designs of the WG stream ciphers in Chapters 3 and
4. The presented designs in Chapter 3 are for the multiple output bit MOWG(29; 11; 17) and
single output bit WG(29; 11) based on the ONB-II representation of the GF

229

elements.
The hardware complexity of the MOWG(29; 11; 17) has been reduced by one field multiplier
through signal reuse techniques, while its time complexity has been slightly enhanced by re-
moving some inverters from the critical path. On the other hand, the space complexity of the
WG(29; 11) has been significantly reduced to only five multipliers in its transform through the
utilization of new trace properties. The new trace property generates the trace of the multipli-
cation of two field elements represented in the ONB-II without performing the multiplication.
The conducted ASIC and FPGA implementations showed superior performance of the pro-
posed WG(29; 11) compared to previous counterparts.
In Chapter 4, polynomial basis representation of the field elements has been considered
for the first time for implementing WG stream ciphers. Nine new designs have been intro-
duced. Three out of which are for the class of WG(29; 11) including a standard, a serialized,
and a pipelined versions. The other six designs are for the class of WG-16 including a stan-
dard, a serialized, and a pipelined versions, each implemented by a traditional PB multiplier
and Karatsuba multiplier. The space complexity of these two classes of the WG cipher has
been significantly reduced through using a new trace property for the PB. Similar to the trace
method introduced in Chapter 3, the new trace property of this chapter generates the trace of
the multiplication of two field elements represented in the PB without performing the multi-
137
plication. The dierent designs have been demonstrated by ASIC implementations and have
shown a promising performance compared to the previous counterparts. In particular, this
chapter showed that the proposed WG-16 designs comply with the bit rate requirements of the
4G mobile network domain, while oering a variety of optimization options.
In addition, new architectures for the digit-level multiplication in the GNB and PB repre-
sentations of GF (2m) elements have been proposed in Chapters 5 and 6. In Chapters 5 and 6,
new DL-FSIPO GF (2m) multiplication schemes have been proposed for both of the GNB and
PB representations, respectively. These new architectures are shown to be advantageous for
increasing the throughput in applications with limited data-path capacities. Both MSD as well
as LSD variants have been constructed for all proposed DL-FSIPO multipliers.
In Chapter 5, an optimized MSD DL-PISO GNB multiplier has also been presented. The
proposed MSD DL-FSIPO and DL-PISO GNB multipliers in Chapter 5 have been interleaved
in order to construct new architectures for an MSD DL-SIPO GNB Hybrid-double and a DL-
PIPO GNB Hybrid-triple multiplications. The latter two hybrid multiplication schemes con-
duct multiplication of three and four field elements, respectively, using the same latency re-
quired to multiply only two elements. Based on the proposed Hybrid-triple GNB multiplier,
a new digit-level eight-ary exponentiation scheme has been presented in Chapter 5. This new
exponentiation uses almost the same latency of existing eight-ary designs, however, it does not
require pre-computations or storage of intermediate values.
The following section presents some future work for this thesis.
7.2 Future Work
In the future, the following projects can be considered as a continuation for this thesis:
 Generalized hybrid-n-ary FSIPO multipliers.
 ASIC and FPGA realizations of the proposed DL-FSIPO and the digit-level hybrid mul-
tipliers, optimized at the gate / transistor level.
 Implementations for fast field inversion designs using the new hybrid multipliers based
on the recently published work about generalized k-chains [51].
 Ultra lightweight hardware designs for the WG-16 stream cipher suitable for RFIDs,
based on the new hybrid-triple digit-level multipliers.
 Concurrent error control in the dierent presented designs.
138
Bibliography
[1] 3GPP Technical Specification Groups. http://www.3gpp.org/
Specification-Groups.
[2] Xilinx. http://www.xilinx.com/.
[3] The Sage Notebook. http://www.sagenb.org/.
[4] Synopsys. http://www.synopsys.com/.
[5] IEEE Standard Specifications for Public-Key Cryptography. IEEE Std 1363-2000, page i,
2000.
[6] eSTREAM - The ECRYPT Stream Cipher Project, 2005.
[7] 3rd Generation Partnership Project; Long Term Evaluation Release 10 and Beyond (LTE-
Advanced); Proposed to ITU at 3GPP TSG RAN Meeting, 2009.
[8] Adopted Bluetooth Core Specifications, Core Version 4.0. Bluetooth Special Interest
Group, June 2010.
[9] CLP-41: SNOW 3G Flow through Core. Elliptic Technologies, 2011. http://www.
elliptictech.com/products-clp-41.php.
[10] ZUCKey StreamGenerator. Elliptic Technologies, 2011. http://www.elliptictech.
com/pdf/CLP-410ZUCKeyStreamGenerator.pdf.
[11] 3GPP TS 33.401 v11.0.1. 3rd Generation Partnership Project; Technical Specification
Group Services and Systems Aspects; 3GPP System Architecture Evolution (SAE): Se-
curity Architecture, June 2011 (Release 11).
[12] Digital Signature Standard (DSS). Federal Information Processing Standards (FIPS), July
2013.
139
[13] Gordon B. Agnew, Ronald C. Mullin, I. M. Onyszchuk, and Scott A. Vanstone. An
Implementation for a Fast Public-Key Cryptosystem. J. Cryptology, 3:63–79, 1991.
[14] Abdulaziz Al-Khoraidly and Mohammad K. Ibrahim. Finite field serial-serial multiplica-
tion/reduction structure and method, US7519644 B2, Apr 2009.
[15] David W. Ash, Ian F. Blake, and Scott A. Vanstone. Low Complexity Normal Bases.
Discrete Applied Math., 25(3):191 – 210, 1989.
[16] R. Azarderakhsh and A. Reyhani-Masoleh. Low-Complexity Multiplier Architectures
for Single and Hybrid-Double Multiplications in Gaussian Normal Bases. IEEE Trans.
Comput., 62(4):744–757, April 2013.
[17] Reza Azarderakhsh and Arash Reyhani-Masoleh. A Modified Low Complexity Digit-
Level Gaussian Normal Basis Multiplier. In M.Anwar Hasan and Tor Helleseth, editors,
Arithmetic of Finite Fields, volume 6087 of Lecture Notes in Computer Science, pages
25–40. Springer Berlin Heidelberg, 2010.
[18] Elaine Barker, William Barker, William Burr, William Polk, and Miles Smid. Nist Special
Publication 800-57. NIST Special Publication, 800(57):1–142, 2007.
[19] Thomas C. Bartee and David I. Schneider. Computation With Finite Fields. Information
and Control, 6(2):79 – 98, 1963.
[20] T. Beth and D. Gollman. Algorithm Engineering for Public Key Algorithms. IEEE J. Sel.
Areas Commun., 7(4):458–466, 1989.
[21] Alex Biryukov, Deike Priemuth-Schmid, and Bin Zhang. Dierential Resynchronization
Attacks on Reduced Round SNOW 3G. In MohammadS. Obaidat, GeorgeA. Tsihrintzis,
and Joaquim Filipe, editors, e-Business and Telecommunications, volume 222 of Commu-
nications in Computer and Information Science, pages 147–157. Springer Berlin Heidel-
berg, 2012.
[22] M. Cenk, M.A Hasan, and C. Negre. Ecient Subquadratic Space Complexity Bi-
nary Polynomial Multipliers Based on Block recombination. IEEE Trans. Comput.,
63(9):2273–2287, September 2014.
[23] L. Chen, J. Franklin, and A. Regenscheid. Guidelines on Hardware-Rooted Security in
Mobile Devices (Draft). In Special Publication 800-164. National Institute of Standards
and Technology, October 2012.
140
[24] Lidong Chen and Guang Gong. Communication System Security. Chapman and Hall -
CRC Press, 2012.
[25] Yanni Chen and Keshab K. Parhi. Small Area Parallel Chien Search Architectures for
Long BCH Codes. IEEE Trans. Very Large Scale Integr. Syst., 12(5):545–549, May 2004.
[26] Chao Cheng and K.K. Parhi. High-Speed Parallel CRC Implementation Based on Unfold-
ing, Pipelining, and Retiming. IEEE Trans. Circuits and Systems II, 53(10):1017–1021,
2006.
[27] A. Cilardo. Fast Parallel GF(2m) Polynomial Multiplication for All Degrees. IEEE Trans.
Comput., 62(5):929–943, May 2013.
[28] J.P. Deschamps, J.L. Iman˜a, and G.D. Sutter. Hardware Implementation of Finite-Field
Arithmetic. McGraw-Hill Education, 2009.
[29] W. Die and M.E. Hellman. New Directions in Cryptography. IEEE Trans. Inf. Theory,
22(6):644–654, November 1976.
[30] V. Dimitrov and K. Jarvinen. Another Look at Inversions Over Binary Fields. In 2013
21st IEEE Symposium on Computer Arithmetic (ARITH), pages 211–218, April 2013.
[31] H. El-Razouk, A Reyhani-Masoleh, and G. Gong. New Implementations of the WG
Stream Cipher. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 22(9):1865–1878,
September 2014.
[32] H. El-Razouk, A. Reyhani-Masoleh, and G. Gong. New Hardware Implementations of
WG(29,11) and WG-16 Stream Ciphers Using Polynomial Basis. IEEE Trans. Comput.,
to appear.
[33] Serdar S. Erdem, Tugrul Yanik, and C¸etin K. Koc¸. Polynomial Basis Multiplication over
GF(2m). Acta Applicandae Mathematica, 93(1-3):33–55, September 2006.
[34] X. Fan and G. Gong. Specification of the Stream Cipher WG-16 Based Confidentiality
and Integrity Algorithms. Technical Report CACR 2013-06, University of Waterloo,
Waterloo, ON, Canada, 2013.
[35] X. Fan, N. Zidaric, M. Aagaard, and G. Gong. Ecient Hardware Implementation of
the Stream Cipher WG-16 with Composite Field Arithmetic. Technical Report CACR
2013-23, University of Waterloo, Waterloo, ON, Canada, 2013.
141
[36] G.-L. Feng. A VLSI Architecture for Fast Inversion in GF(2m). IEEE Trans. Comput.,
38(10):1383–1386, 1989.
[37] L. Gao and G.E. Sobelman. Improved VLSI Designs for Multiplication and Inversion in
GF(2M) Over Normal Bases. In ASIC/SOC Conference, 2000. Proceedings. 13th Annual
IEEE International, pages 97–101, 2000.
[38] Willi Geiselmann and Dieter Gollmann. Symmetry and Duality in Normal Basis Multipli-
cation. In Teo Mora, editor, Applied Algebra, Algebraic Algorithms and Error-Correcting
Codes, volume 357 of Lecture Notes in Computer Science, pages 230–238. Springer
Berlin Heidelberg, 1989.
[39] Guang Gong and Yassir Nawaz. The WG Stream Cipher. eSTREAM, ECRYPT Stream
Cipher Project, Report 2005/033, 2005.
[40] Guang Gong and A.M. Youssef. Cryptographic Properties of the Welch-Gong Transfor-
mation Sequence Generators. IEEE Trans. Inf. Theory, 48(11):2837 – 2846, Nov. 2002.
[41] T. Good and M. Benaissa. Hardware Results for Selected Stream Cipher Candidates. In
Workshop Record of the State of The Art of Stream Ciphers 2007 (SASC 2007), pages
191–204, 2007.
[42] Daniel M. Gordon. A Survey of Fast Exponentiation Methods. Journal of Algorithms,
27(1):129–146, April 1998.
[43] S.S. Gupta, A. Chattopadhyay, K. Sinha, S. Maitra, and B.P. Sinha. High-Performance
Hardware Implementation for RC4 Stream Cipher. IEEE Trans. Comput., 62(4):730–743,
2013.
[44] A. Halbutogullari and C.K. Koc. Mastrovito Multiplier for General Irreducible Polyno-
mials. IEEE Trans. Comput., 49(5):503 –518, May 2000.
[45] A. Hariri and A. Reyhani-Masoleh. Digit-Level Semi-Systolic and Systolic Structures for
the Shifted Polynomial Basis Multiplication Over Binary Extension Fields. IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., 19(11):2125 –2129, nov. 2011.
[46] M.A. Hasan and V.K. Bhargava. Division and Bit-Serial Multiplication over GF(qm).
Computers and Digital Techniques, IEE Proceedings E, 139(3):230–236, May 1992.
[47] M.A. Hasan, M.Z. Wang, and V.K. Bhargava. A Modified Massey-Omura Parallel Mul-
tiplier for a Class of Finite Fields. IEEE Trans. Comput., 42(10):1278 –1280, Oct. 1993.
142
[48] Jenn-Shyong Horng, I.-Chang Jou, and Chiou-Yng Lee. Low-Complexity Multiplexer-
Based Normal Basis Multiplier Over GF(2m). Journal of Zhejiang University SCIENCE
A, 10(6):834–842, May 2009.
[49] Junxian Huang, Feng Qian, Alexandre Gerber, Z. Morley Mao, Subhabrata Sen, and
Oliver Spatscheck. A Close Examination of Performance and Power Characteristics of 4G
LTE Networks. In Proceedings of the 10th international conference on Mobile systems,
applications, and services, MobiSys ’12, pages 225–238, New York, NY, USA, 2012.
ACM.
[50] J.L. Iman˜a. Low Latency GF(2m) Polynomial Basis Multiplier. IEEE Trans. Circuits Syst.
I, Reg. Papers, 58(5):935–946, May 2011.
[51] K. Jarvinen, V. Dimitrov, and R. Azarderakhsh. A Generalization of Addition Chains and
Fast Inversions in Binary Fields. IEEE Transactions on Computers, to appear.
[52] D. Johnson, A. Menezes, and S. Vanstone. The Elliptic Curve Digital Signature Algo-
rithm (ECDSA). Int’l J. Information Security, 1(1):36–63, 2001.
[53] Anatolii Karatsuba and Yuri Ofman. Multiplication of Multidigit Numbers on Automata.
Soviet Physics-Doklady, 7:595–596, 1963.
[54] Paris Kitsos, George Selimis, and Odysseas Koufopavlou. High Performance ASIC Im-
plementation of the SNOW 3G Stream Cipher. In IFIP/IEEE VLSISOC 2008 - Interna-
tional Conference on Very Large Scale Integration (VLSI SOC), Rhodes Island, Greece,
Oct. 13-15 2008.
[55] C.K. Koc and B. Sunar. Low-Complexity Bit-Parallel Canonical and Normal Basis Mul-
tipliers for a Class of Finite Fields. IEEE Trans. Comput., 47(3):353–356, 1998.
[56] E. Krengel. Fast WG Stream Cipher. In IEEE Region 8 Int. Conf. on Computational
Technologies in Elect. and Electron. Eng., 2008. SIBIRCON 2008., pages 31 –35, Jul.
2008.
[57] S. Kumar, T. Wollinger, and C. Paar. Optimum Digit Serial GF(2m) Multipliers for Curve-
Based Cryptography. IEEE Transactions on Computers, 55(10):1306–1311, October
2006.
[58] C. Lam, M. Aagaard, and G. Gong. Hardware Implementations of Multi-output Welch-
Gong Ciphers. Technical Report CACR 2011-01, University of Waterloo, Waterloo, ON,
Canada, 2009.
143
[59] Rudolf Lidl and Harald Niederreiter. Introduction to Finite Fields and their Applications.
Cambridge University Press, New York, NY, USA, 1986.
[60] Yiyuan Luo, Qi Chai, Guang Gong, and Xuejia Lai. A Lightweight Stream Cipher WG-
7 for RFID Encryption and Authentication. In Global Telecommunications Conference
(GLOBECOM 2010), 2010 IEEE, pages 1 –6, Dec. 2010.
[61] James L. Massey and Jimmy K. Omura. Computational Method and Apparatus for Finite
Field Arithmetic, May 1986.
[62] Edoardo D. Mastrovito. VLSI Designs for Multiplication Over Finite Fields GF(2m). In
Teo Mora, editor, Applied Algebra, Algebraic Algorithms and Error-Correcting Codes,
volume 357 of Lecture Notes in Computer Science, pages 297–309. Springer Berlin Hei-
delberg, 1989.
[63] Edoardo D. Mastrovito. VLSI Designs for Multiplication over Finite Fields GF(2m). In
Proceedings of the 6th International Conference, on Applied Algebra, Algebraic Algo-
rithms and Error-Correcting Codes, AAECC-6, pages 297–309, London, UK, UK, 1989.
Springer-Verlag.
[64] A. Mirzaei, M. Dakhilalian, and M. Modarres-Hashemi. An Improved Attack on WG
Stream Cipher. IJSNS International Journal of Computer Science and Network Security,
10(4):45–52, apr. 2010.
[65] R. C. Mullin, I. M. Onyszchuk, S. A. Vanstone, and R. M. Wilson. Optimal Normal Bases
in GF(pn). Discrete Applied Math., 22(2):149–161, Feb. 1989.
[66] S.H. Namin, Huapeng Wu, and M. Ahmadi. Power Eciency of Digit Level Polynomial
Basis Finite Field Multipliers in GF(2283). In 2012 19th IEEE International Conference
on Electronics, Circuits and Systems (ICECS), pages 897–900, December 2012.
[67] Yassir Nawaz. Design of Stream Ciphers and Cryptographic Properties of Nonlinear
Functions. PhD thesis, University of Waterloo, 2007.
[68] Yassir Nawaz and Guang Gong. WG: A Family of Stream Ciphers with Designed Ran-
domness Properties. Inf. Sci., 178(7):1903 – 1916, 2008.
[69] C. Paar. Optimized Arithmetic for Reed-Solomon Encoders. In , 1997 IEEE International
Symposium on Information Theory. 1997. Proceedings, pages 250–, June 1997.
144
[70] A. Reyhani-Masoleh. Ecient Algorithms and Architectures for Field Multiplication
Using Gaussian Normal Bases. IEEE Trans. Comput., 55(1):34–47, Jan 2006.
[71] A. Reyhani-Masoleh and M.A. Hasan. A New Construction of Massey-Omura Parallel
Multiplier Over GF(2m). IEEE Trans. Comput., 51(5):511 –520, May. 2002.
[72] A. Reyhani-Masoleh and M.A. Hasan. Low Complexity Bit Parallel Architectures for
Polynomial Basis Multiplication Over GF(2m). IEEE Trans. Comput., 53(8):945 – 959,
Aug. 2004.
[73] A. Reyhani-Masoleh and M.A. Hasan. Low Complexity Word-Level Sequential Normal
Basis Multipliers. IEEE Trans. Comput., 54(2):98–110, 2005.
[74] A. Reyhani-Masoleh and M.A. Hasan. Low Complexity Word-Level Sequential Normal
Basis Multipliers. IEEE Trans. Comput., 54(2):98–110, 2005.
[75] Arash Reyhani-Masoleh. A New Bit-Serial Architecture for Field Multiplication Us-
ing Polynomial Bases. In Elisabeth Oswald and Pankaj Rohatgi, editors, Cryptographic
Hardware and Embedded Systems - CHES 2008, number 5154 in Lecture Notes in Com-
puter Science, pages 300–314. Springer Berlin Heidelberg, Jan 2008.
[76] Arash Reyhani-Masoleh and M. Anwar Hasan. Ecient Digit-Serial Normal Basis Mul-
tipliers Over Binary Extension Fields. ACM Trans. Embed. Comput. Syst., 3(3):575–592,
August 2004.
[77] Sondre Ronjom and Tor Helleseth. Attacking the Filter Generator Over GF(2m). eS-
TREAM, ECRYPT Stream Cipher Project, Report 2007/011, 2007.
[78] P.A Scott, S.E. Tavares, and L.E. Peppard. A Fast VLSI Multiplier for GF(2m). IEEE J.
Sel. Areas Commun., 4(1):62–66, January 1986.
[79] George N. Selimis, Apostolos P. Fournaris, Harris E. Michail, and Odysseas
Koufopavlou. Improved Throughput Bit-Serial Multiplier for GF(2m) Fields. Integra-
tion, the VLSI Journal, 42(2):217 – 226, 2009.
[80] Leilei Song and Keshab K. Parhi. Low-Energy Digit-Serial/Parallel Finite Field Multi-
pliers. Journal of VLSI signal processing systems for signal, image and video technology,
19(2):149–166, July 1998.
[81] Leilei Song and K.K. Parhi. Ecient Finite Field Serial/Parallel Multiplication. In Pro-
ceedings of International Conference on Application Specific Systems, Architectures and
Processors, 1996. ASAP 96, pages 72–82, August 1996.
145
[82] W. Stallings. Cryptography and Network Security: Principles and Practice. Prentice
Hall, 2011.
[83] D. Stinson. Some Observations on Parallel Algorithms for Fast Exponentiation inGF(2n).
SIAM J. Comput., 19(4):711–717, August 1990.
[84] B. Sunar and C.K. Koc. Mastrovito Multiplier for All Trinomials. IEEE Trans. Comput.,
48(5):522 –527, May 1999.
[85] C.C. Wang and D. Pei. A VLSI Design for Computing Exponentiations in GF(2m) and
its Application to Generate Pseudorandom Number Sequences. IEEE Trans. Comput.,
39(2):258–262, February 1990.
[86] C.C. Wang, T.K. Troung, H.M. Shao, L.J. Deutsch, J. Omura, and Irving S. Reed. VLSI
Architectures for Computing Multiplications and Inverses in GF(2m). IEEE Trans. Com-
put., C-34(8):709–717, 1985.
[87] Hongjun Wu, Tao Huang, PhuongHa Nguyen, Huaxiong Wang, and San Ling. Dier-
ential Attacks Against Stream Cipher ZUC. In Xiaoyun Wang and Kazue Sako, editors,
Advances in Cryptology - ASIACRYPT 2012, volume 7658 of Lecture Notes in Computer
Science, pages 262–277. Springer Berlin Heidelberg, 2012.
[88] Hongjun Wu and Bart Preneel. Resynchronization Attacks on WG and LEX. In Matthew
Robshaw, editor, Fast Software Encryption, volume 4047 of Lecture Notes in Computer
Science, pages 422–432. Springer-Verlag, 2006.
[89] Huapeng Wu. Bit-Parallel Finite Field Multiplier and Squarer Using Polynomial Basis.
IEEE Trans. Comput., 51(7):750 –758, July 2002.
[90] TengWu and Guang Gong. TheWeakness of Integrity Protection for LTE. In the Proceed-
ings of Sixth ACM Conference on Security and Privacy in Wireless and Mobile Networks,
WiSec13, pages 79–88. Also, appeared as Technical Report, CACR 2013–03, 2013, Uni-
versity of Waterloo, Canada., Budapest, Hungary, Apr. 17-19 2013.
146
Curriculum Vitae
Name: Hayssam El-Razouk
Post-Secondary University of Western Ontario
Education and London, ON, Canada
Degrees: 2011 - 2015 Ph.D.
University of Western Ontario
London, ON, Canada
2004 - 2006 M.E.Sc.
Beirut Arab University
Beirut, Lebanon
1997 - 2002 B.E.
Honours and NSERC CGS D (Canada)
Awards: 2012-2015.
Jamal Abdul Nasir (Lebanon)
1999-2002.
Related Work Teaching / Research Assistant
Experience: University of Western Ontario
2004 - 2006 and 2011 - 2015.
Software Engineer
RedIron Technologies (Canada)
2006 - 2011.
147
