Efficient Arithmetic for the Implementation of Elliptic Curve Cryptography by Hasan Abdulrahman, Ebrahim Abdulrahman
Western University 
Scholarship@Western 
Electronic Thesis and Dissertation Repository 
11-22-2013 12:00 AM 
Efficient Arithmetic for the Implementation of Elliptic Curve 
Cryptography 
Ebrahim Abdulrahman Hasan Abdulrahman 
The University of Western Ontario 
Supervisor 
Reyhani-Masoleh 
The University of Western Ontario 
Graduate Program in Electrical and Computer Engineering 
A thesis submitted in partial fulfillment of the requirements for the degree in Doctor of 
Philosophy 
© Ebrahim Abdulrahman Hasan Abdulrahman 2013 
Follow this and additional works at: https://ir.lib.uwo.ca/etd 
 Part of the Computer and Systems Architecture Commons, Digital Communications and Networking 
Commons, and the Hardware Systems Commons 
Recommended Citation 
Hasan Abdulrahman, Ebrahim Abdulrahman, "Efficient Arithmetic for the Implementation of Elliptic Curve 
Cryptography" (2013). Electronic Thesis and Dissertation Repository. 1744. 
https://ir.lib.uwo.ca/etd/1744 
This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted 
for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of 
Scholarship@Western. For more information, please contact wlswadmin@uwo.ca. 
EFFICIENT ARITHMETIC FOR THE IMPLEMENTATION OF ELLIPTIC
CURVE CRYPTOGRAPHY
(Thesis format: Monograph)
by
Ebrahim Abdulrahman Hasan
Graduate Program in Electrical and Computer Engineering
A thesis submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
The School of Graduate and Postdoctoral Studies
The University of Western Ontario
London, Ontario, Canada
c© Ebrahim Abdulrahman Hasan 2013
Abstract
The technology of elliptic curve cryptography is now an important branch in public-key
based crypto-system. Cryptographic mechanisms based on elliptic curves depend on the arith-
metic of points on the curve. The most important arithmetic is multiplying a point on the curve
by an integer. This operation is known as elliptic curve scalar (or point) multiplication oper-
ation. A cryptographic device is supposed to perform this operation efficiently and securely.
The elliptic curve scalar multiplication operation is performed by combining the elliptic curve
point routines that are defined in terms of the underlying finite field arithmetic operations.
This thesis focuses on hardware architecture designs of elliptic curve operations. In the first
part, we aim at finding new architectures to implement the finite field arithmetic multiplication
operation more efficiently. In this regard, we propose novel schemes for the serial-out bit-level
(SOBL) arithmetic multiplication operation in the polynomial basis over F2m . We show that the
smallest SOBL scheme presented here can provide about 24-26% reduction in area-complexity
cost and about 21-22% reduction in power consumptions for F2163 compared to the current
state-of-the-art bit-level multiplier schemes. Then, we employ the proposed SOBL schemes to
present new hybrid-double multiplication architectures that perform two multiplications with
latency comparable to the latency of a single multiplication.
Then, in the second part of this thesis, we investigate the different algorithms for the imple-
mentation of elliptic curve scalar multiplication operation. We focus our interest in three as-
pects, namely, the finite field arithmetic cost, the critical path delay, and the protection strength
from side-channel attacks (SCAs) based on simple power analysis. In this regard, we propose
a novel scheme for the scalar multiplication operation that is based on processing three bits
of the scalar in the exact same sequence of five point arithmetic operations. We analyse the
security of our scheme and show that its security holds against both SCAs and safe-error fault
attacks. In addition, we show how the properties of the proposed elliptic curve scalar multi-
plication scheme yields an efficient hardware design for the implementation of a single scalar
multiplication on a prime extended twisted Edwards curve incorporating 8 parallel multiplica-
tion operations. Our comparison results show that the proposed hardware architecture for the
twisted Edwards curve model implemented using the proposed scalar multiplication scheme is
the fastest secure SCA protected scalar multiplication scheme over prime field reported in the
literature.
Keywords: Finite field arithmetic multiplication, elliptic curve cryptography, scalar multi-
plication, serial-out bit-level.
ii
Dedication
To my mother for her love, inspiration, and guidance.
iii
Acknowledgments
This work would not have been possible without the support of many people. I would like
to use this space to express my most sincere gratitude to all those who have made this possible.
First and foremost I would like to thank my supervisor Prof. Arash Reyhani-Masoleh for
the advice, guidance, and trust he has provided me with. I could not forget the valuable benefits
I have gained from his constructive criticism, invaluable advice, many long night discussions
and the amount of time he spent on going over my draft papers. I feel honored by being able
to work with him and look forward to a continued research relationship in the future.
I also am deeply indebted to Prof. Wu Huapeng, University of Windsor for taking the
time to review this work as an external examiner. Moreover, I would like to thank Prof. Marc
Moreno Maza, Prof. Abdallah Shami, and Prof. Anestis Dounavis for serving on the thesis
committee and for offering their insightful comments and invaluable suggestions. I would like
to truly appreciate the financial support provided by the University of Bahrain during my PhD
thesis.
Thanks must also go out to my colleagues in the VLSI lab at Western University Hayssam,
Behdad, Sasan, Depanwita, and Shahriar for the good spirit and friendship. A big Thank you!
to my friends Yasser, Fadah, and Aiman for interesting discussions and general friendship.
Finally, this work will not have been possible without the love and moral support of my
mother Mariam and brother Hasan. Lastly but certainly not least, my wonderful wife Fayeza
who helped me in more way than I can count. Without her love and support, I would not have
finished this dissertation.
To all of you thank you very much!
Ebrahim A. H. Abdulrahman 2013/11/12
iv
Contents
Abstract ii
Dedication iii
Acknowledgments iv
List of Figures viii
List of Tables xi
List of Algorithms xiii
List of Abbreviations xiv
List of Notations xvi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Field Multiplication Operation . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Bit-Level Finite Field Double Multiplication . . . . . . . . . . . . . . . 6
1.1.3 Elliptic Curve Scalar Multiplication . . . . . . . . . . . . . . . . . . . 7
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Thesis Organization and Outlines . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Preliminaries 12
2.1 Public-Key Based Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Introduction to Elliptic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Elliptic Curve Diffie-Hellman Key Agreement Scheme . . . . . . . . . 18
2.3 Group Low Operations in Affine Coordinates . . . . . . . . . . . . . . . . . . 19
2.4 Group Low Operations in Projective Coordinates . . . . . . . . . . . . . . . . 20
2.4.1 Inverse of a Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Elliptic Curve Scalar Multiplication . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Binary Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Window Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Power Analysis Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.1 The Secured ECSM Schemes . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Standard Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
v
2.8 Finite Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.9 Arithmetic over Prime Fields Fp . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.9.1 Field Arithmetic Addition . . . . . . . . . . . . . . . . . . . . . . . . 32
2.9.2 Field Arithmetic Subtraction . . . . . . . . . . . . . . . . . . . . . . . 33
2.9.3 Field Arithmetic Multiplication . . . . . . . . . . . . . . . . . . . . . . 34
2.9.4 Field Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.9.5 Field Arithmetic Squaring . . . . . . . . . . . . . . . . . . . . . . . . 37
2.9.6 Field Arithmetic Inversion . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10 Arithmetic over Binary Extension Fields F2m . . . . . . . . . . . . . . . . . . . 37
2.10.1 Field Arithmetic Addition . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10.2 Field Arithmetic Squaring . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10.3 Field Arithmetic Multiplication . . . . . . . . . . . . . . . . . . . . . . 40
2.10.4 Traditional Parallel-Out Bit-Level Polynomial Basis Multiplication Op-
eration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.10.5 Field Arithmetic Division/Inversion . . . . . . . . . . . . . . . . . . . 46
3 Architectures for SOBL Multiplication Using Polynomial Basis 47
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Reduction Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Proposed SOBL Multiplication Algorithm . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Proposed SOBL Multiplication Algorithm for ω-nomials . . . . . . . . 54
3.4 Multiplier Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 Multiplier Architecture for ω-nomials . . . . . . . . . . . . . . . . . . 58
3.4.2 Multiplier Architecture for Trinomials . . . . . . . . . . . . . . . . . . 61
3.5 Novel Very Low Area Multiplication Architecture . . . . . . . . . . . . . . . . 63
3.5.1 Proposed Compact Multiplier Architecture . . . . . . . . . . . . . . . . 65
3.5.2 Extending to a Digit-Level Scheme . . . . . . . . . . . . . . . . . . . . 68
3.6 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.7 ASIC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Hybrid-Double Multiplication Architecture 76
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Architectures for Double Multiplication . . . . . . . . . . . . . . . . . . . . . 77
4.2.1 New LSB-first/MSB-first POBL Double Multiplications . . . . . . . . 77
4.2.2 New Parallel-Out Digit-Level Polynomial Basis Double Multiplication . 78
4.3 Hybrid-Double Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 ASIC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 New Regular Radix-8 Scheme for ECSM 88
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vi
5.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.2 The SSCA-Protected ECSMs . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Proposed Radix-8 Scalar Multiplication Algorithm . . . . . . . . . . . . . . . 93
5.3.1 High-Radix Scalar Expansion . . . . . . . . . . . . . . . . . . . . . . 93
5.3.2 Recoding the Scalar k Into Signed Radix-8 . . . . . . . . . . . . . . . . 95
5.3.3 Proposed Radix-8 Algorithm for Scalar Multiplication . . . . . . . . . 96
5.4 Proposed Regular ECSM Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.1 The Four-Stage Levels . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.2 The Three-Stage Levels . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Performance Analysis of The Proposed ECSM Scheme . . . . . . . . . . . . . 104
5.6 Parallel Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6 Summary and Future Work 115
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Bibliography 118
Curriculum Vitae 133
vii
List of Figures
1.1 Hierarchical Scheme for The Implementation of ECC Crypto-System [1]. . . . 2
2.1 Diffie-Hellman Key Exchange Scheme [1, 10]. . . . . . . . . . . . . . . . . . . 14
2.2 Graphical Representation of The Chord-and-Tangent Group Low (EC-Operations)
for an Elliptic Curve E : y 2 = x 3−2 over Fp [1, 91]. (a) Point Addition (ADD)
Operation of P and Q on E and Resulting in The Point R. (b) Point Doubling
(DBL) Operation of P on E and Resulting in The Point Q. . . . . . . . . . . . 17
2.3 Elliptic Curve Diffie-Hellman Key Exchange Scheme [1]. . . . . . . . . . . . . 18
2.4 Modular Addition over Fp [90]. . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Modular Subtraction over Fp [90]. . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Field Arithmetic Squaring constructed via P(x) = x 4 + x + 1 over F24 . . . . . . 40
2.7 The Traditional Parallel-Out Bit-Level (POBL) Field Arithmetic Multiplication
Schemes [31]. (a) LSB-First POBL Multiplier. (b) MSB-First POBL Multiplier. 45
3.1 Constructing The Mastrovito Matrix M over F2163 Generated by x163 + x7 + x6 +
x3 + 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 The Process for Constructing The Coordinates of The Signal Vector s over F2163 . 56
3.3 The Proposed SOBL Mastrovito Multiplier Architecture for The ω-nomial Ir-
reducible Polynomials. (a) The High-Level Architecture. (b) The Implemen-
tation of The Circuit S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 The Proposed SOBL Mastrovito Multiplier Architecture for The Irreducible
Trinomials. (a) The High-Level Architecture. (b) The Implementation of The
Circuit S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 The proposed compact SOBL multiplier architecture for the pentanomial irre-
ducible polynomial. (a) The high-level architecture. (b) The implementation
of the circuit S . (c) An example for BTX4 module when P(x) = x163 + x7 +
x6 + x3 + 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 The architecture of serial-out digit-level (SODL) polynomial basis multiplier
over F2m for the pentanomial irreducible polynomial, i.e., xm + xt1 + xt2 + xt3 + 1,
where digit d = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
viii
3.7 Hardware Overhead Gates Due to The Parallel I/O Data Transfer. (a) The Cir-
cuit That Enables a Register to be Cleared or Updated. (b) The Circuit That
Enables a Register to be Switched Between Two Inputs (MUX). . . . . . . . . 71
4.1 The Proposed Double Multiplication Architectures That Extend The POBL
Schemes Presented in [31]. (a) LSB-First POBL Double Multiplication Ar-
chitecture. (b) MSB-First POBL Double Multiplication Architecture. . . . . . 79
4.2 Proposed architecture for the LSD-first PODL Double Multiplication Operation. 82
4.3 Proposed architecture for the MSD-first PODL Double Multiplication Operation. 83
4.4 Architectures for The Hybrid-Double Multiplication. The Hybrid-Double Mul-
tiplier Structure is Developed by Connecting The Output of The SOBL Multi-
plier Into The Input of The POBL Multiplier. . . . . . . . . . . . . . . . . . . 84
4.5 Architectures for The Hybrid-Double Multiplication. (a) The Critical-Path De-
lay of The Hybrid-Double Multiplier (th). (b) Reducing The Delay by Insert-
ing Registers at The IPm Block Inside The SOBL Multiplier. . . . . . . . . . . 85
5.1 EC-Operations Dependency Graph for The Montgomery Ladder ECSM Method
[189, 190, 191], Which Shows That a Fixed Sequence of Both The ADD and
The DBL Blocks Are Performed for Any Value of The ki Bit, i.e., Only The
Operands Are Transposed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 EC-Operation Dependency Graph That Shows The Usage of Both The ADD
and The DBL Blocks When k j = 3 or k j = 4. . . . . . . . . . . . . . . . . . . . 100
5.3 EC-Operation Dependency Graph That Shows The Usage of Both The ADD
and The DBL Blocks When k j = 2 or k j = 5. . . . . . . . . . . . . . . . . . . . 101
5.4 EC-Operation Dependency Graph That Shows The Usage of Both The ADD
and The DBL Blocks When k j = −1, 0, 1, or 6. Notice That The SUB Opera-
tion is Used at Stage 3 for Both Cases k j = −1 and k j = 0. . . . . . . . . . . . . 102
5.5 EC-Operation Dependency Graph That Shows The Usage of Both The ADD
and The DBL Blocks for All Cases of k j, i.e., k j ∈ { −1, 0, 1, · · · , 6 }. . . . . . . 102
5.6 EC-Operation Dependency Graph for The Proposed Radix-8 ESCM Method
That Shows The Total Memory Points Required, The Total EC-Operations
Cost, and The Total Computational Time Complexity Per 3 Scalar Bits at The
EC-Operation Level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7 EC-operation Dependency Graph for The Width-4 Okeya Method [64] That
Shows The Total Memory Points Required, The Total EC-Operations Cost,
and The Total Computational Time Complexity Per 3 Scalar Bits at The EC-
Operation Level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
ix
5.8 EC-Operation Dependency Graph for The Montgomery Ladder and Joye’s Bi-
nary Methods [189, 192] That Shows The Total Memory Points Required, The
Total EC-Operations Cost, and The Total Computational Time Complexity Per
3 Scalar Bits at The EC-Operation Level. . . . . . . . . . . . . . . . . . . . . . 105
5.9 Data Dependency Graph for Parallel Computing of The ADDDBL Operation
for The x-Coordinates Only Montgomery Ladder Method on The Montgomery
Curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.10 Data Dependency Graph for Parallel Computing of The Proposed ADDDBL
Operation for The Prime Extended Twisted Edwards Curve. . . . . . . . . . . . 112
x
List of Tables
2.1 NIST Recommended Key Sizes Measured in Number of Bits [104]. . . . . . . 15
2.2 NIST Recommended Finite Fields and Their Corresponding Reduction Poly-
nomials [16]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 List of Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 The Operations of The Control Signals Ctrl1, and Ctrl2 in Figure 3.3(a). . . . . 60
3.3 Comparison Table for The Proposed Multiplier Schemes (Figures 3.3(a), 3.4(a),
and 3.5(a)) With The Related Bit-Level Multiplier Schemes in Terms of Time
and Space Complexities for The ω-nomial, The Pentanomial, and The Irre-
ducible Trinomial. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4 Comparison Table for The Proposed Multiplier Schemes (Figures 3.3(a), 3.4(a),
and 3.5(a)) With The Related Bit-Level Multiplier Schemes When Having The
Same Parallel I/O Data Transfer Format. . . . . . . . . . . . . . . . . . . . . . 72
3.5 Comparison of Bit-Level Polynomial Basis Multipliers on an ASIC Implemen-
tation (Post Synthesis) Over Both F2163 and F2233 Using 65-nm CMOS Standard
Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1 Comparison Table for The ASIC Synthesis Results for The Proposed Double
Multiplication Architectures (Figure 4.5(a), 4.5(b)) for The Polynomial Basis
Over F2163 Using 65-nm CMOS Standard Technology. . . . . . . . . . . . . . . 86
4.2 Comparison Table for The ASIC Synthesis Results for The Proposed Double
Multiplication Architectures (Figure 4.5(a), 4.5(b)) for The Polynomial Basis
Over F2233 Using 65-nm CMOS Standard Technology. . . . . . . . . . . . . . . 87
5.1 An Example That Shows The Computation for kP = 6644P Using The Pro-
posed Signed Radix-8 Scalar Multiplication. . . . . . . . . . . . . . . . . . . . 98
5.2 The 4 Stages That The Proposed Algorithm 13 Evaluates for Each Value of k j. . 99
xi
5.3 Comparison Table of Related Binary, and Width-4 Window-Based ECSM Schemes
With The Proposed Radix-8 Scheme (Figure 5.6) in Terms of Memory Register
Space Used, Total EC-Operations Cost, and Computation Time Complexity at
The EC-operations Level per 3 Scalar Bits Evaluations. . . . . . . . . . . . . . 106
5.4 Comparison Table of The Proposed Radix-8 Scheme (Figure 5.6) With the U-
nified Operation Technique and With Different ECSM Schemes That are Resist
Against Side Channel Attacks in Term of Total Field Arithmetic Operations Per
3 Scalar Bits on the Weierstraß Elliptic Curve Model. . . . . . . . . . . . . . . 108
5.5 Comparison Table of The Proposed Radix-8 ECSM Scheme (Figure 5.6) With
Different Scalar Multiplication Schemes That Offers Resistance Against Side-
Channel Attacks Using Parallel Environments With Respect to The Computa-
tion Time Complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6 Comparison Table of Related Parallel Schemes With The Proposed 8-Processor
Scheme for The Extended Twisted Edwards Curve over Prime Fields, Which is
Shown in Figure 5.10, With Respect to The Computational Time Complexities
for The Bit Lengths of The Underlying Fields of NIST Recommended Curves
[16]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xii
List of Algorithms
1 The Addition Law for Elliptic Curve E over Fp in Affine Coordinates [1] . . . 19
2 The Addition Law for Elliptic Curve E over F2m in Affine Coordinates [1] . . 20
3 Left-to-Right Double-and-Add Binary Scalar Method [1, 91] . . . . . . . . . 25
4 Left-to-Right Binary NAF Scalar Method [1, 91] . . . . . . . . . . . . . . . 27
5 Left-to-Right Standard Signed Radix-r Scalar Multiplication [126] . . . . . . 28
6 Addition Modulo p [103] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Subtraction Modulo p [103] . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8 Left Shift Multiplication [101] . . . . . . . . . . . . . . . . . . . . . . . . . 36
9 Proposed Serial-Out Bit-Level Mastrovito Multiplier for ω-nomials xm + xt1 +
· · · + xtω−2 + 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10 Proposed Serial-Out Bit-Level ω-nomials x m + x t1 + · · · + x tω−2 + 1 . . . . . . 64
11 Proposed LSD-First Parallel-Out Digit-Level Double-Multiplication Operation 81
12 Proposed Non-Seven Encoding Method . . . . . . . . . . . . . . . . . . . . 96
13 Proposed Signed Radix-8 Scalar Multiplication . . . . . . . . . . . . . . . . 97
xiii
List of Abbreviations
3DES Triple Data Encryption Standard
ADD Point Addition Operation
ADDDBL Consdiering Both the ADD and DBL as a Single Composite Operation.
AES Advanced Encryption Standard
AFIPS American Federation of Information Processing Societies
ANSI American National Standards Institute
ASIC Application-Specific Integrated Circuit
DBL Point Doubling Operation
DLP Discrete Logarithm Problem
DSA Digital Signature Algorithm
DSS Digital Signature Standard
ECC Elliptic Curve Cryptography
ECDLP Elliptic Curve Discrete Logarithm Problem
ECDSA Elliptic Curve Digital Signature Algorithm
ECSM Elliptic Curve Scalar Multiplication
EC-operations Elliptic Curve Group (Arithmetic Points) Operations
EEA Extended Euclidean Algorithm
FIPS Federal Information Processing Standards
FLT Fermat’s Little theorem
FPGA Field-Programmable Gate Array
IDEA International Data Encryption Algorithm
IEC International Electrotechnical Commission
IEEE Institute of Electrical and Electronics Engineers
IETF Internet Engineering Task Force
IFP Integer Factorization Problem
ISO International Organization for Standardization
LSB Least Significant Bit
LUT Look-Up Table
MAC Message Authentication Code
mADD Mixed Addition Operation
MOF Mutual Opposite Form
MSB Most Significant Bit
NAF Non-Adjacent Form
NB Normal Basis
xiv
List of Abbreviations
NIST National Institute of Standards and Technology
NSA National Security Agency
PB Polynomial Basis
PK Public-Key based scheme
PKI Public Key Infrastructure
POBL Parallel-Out Bit-Level
RC4 Rivest Cipher 4 Stream
RC5 Rivest Cipher 5 Stream
RCS Right Cyclic Shift
RFID Radio Frequency IDentification
RSA Rivest-Shamir-Adleman
SCA Side-Channel Attack
SECG Standards for Efficient Cryptography Group
SOBL Serial-Out Bit-Level
SODL Serial-Out Digit-Level
SSCA Simple Side-Channel Attack
SUB Point Subtraction Operation
TDEA Triple Data Encryption Algorithm
uADD Unified Addition Operation
VHDL Very-high-speed-integrated circuit Hardware Description Language
VLSI Very Large Scale Integration
VoIP Voice over Internet Protocol
xv
List of Notations
⊕ An Addition Group Low Binary Operation on a Curve
	 Point Subtraction (SUB) Operation
∆ The Discriminant of a Curve
|G | The Order of G
[v j, · · · , vi] The Range of Bits in The Vector v From Position i to Position
j, j > i〈
r j, · · · , ri
〉
The Range of Bits in The Register 〈R〉 From Position i to Position
j, j > i
O The Point at Infinity
α Root of an Irreducible Polynomial
ω Non-Zero Terms in an Irreducible Polynomial
ω-nomials An Irreducible Polynomial With ω Non-Zero Terms
−P A Unary Operation on a Curve E, Namely, Point Inverse
A Finite Field Arithmetic Addition
An(F) Affine n-Space over The Field F
Bit-latency Number of Clock Cycles Required to Generate The First
Output Bit
char(F) The Characteristic of F
D Finite Field Arithmetic Multiplication by Curve Constant
E An elliptic curve
ei || v The Process of Concatenating an Element ei and a Vector v
E(F) A Group Formed by The Points on E Defined over The Field F
E(a1, · · · , a6) Curve E Parameter
F An Arbitrary Finite Field
F2m Finite Fields over Characteristic Two (Binary Extension Fields)
Fp = { 0, 1, · · · , p − 1 } Finite Field With p Elements
F ∗p = { a ∈ Fp | gcd (a, p = 1) } The Units mod p
F+p = { 1, 2, · · · , p − 1 } The Non-Zero Residues mod p
g Generator of Fp
G = { gr : 0 ≤ r ≤ p } A Random Finite Cycle Group of p Elements
ga = ga mod p A’s Public key
gb = gb mod p B’s Public key
gab = gba Shared Key Between A and B
gcd The Greatest Common Divisor of a and b
xvi
List of Notations
GF(2m) Finite Fields over Characteristic Two (Binary Extension Fields)
GF(p) Finite Fields over Prime Integer
I Finite Field Arithmetic Inversion
k A Scalar Integer
kP Elliptic Curve Scalar Multiplication of an Elliptic Curve Point P
With a Scalar k
kPcost Cost at The Arithmetic Point Level for Computing ECSM
m Positive Integer
M[↓ n] A Down Shift of The Matrix M by n Positions, Emptied Positions
After The Shifts are Filled by Zeros
M(: , j) The jth Column of The Matrix M
M(i , :) The ith Row of The Matrix M
M(i : j) An Entry With Position (i, j) of The Matrix M
M( j , :) [→ 1] A Right Shift of The jth Row of The Matrix M by 1 Position,
Emptied Positions After The Shifts are Filled By Zeros
p Prime Number
P A Point on a Curve
P(x) An Irreducible Polynomial
P(xP, yP) A Point With Coordinates (xP, yP) in Affine Coordinates
P(XP, YP, ZP) A point With Coordinates (XP, YP, ZP) in Projecitve Coordinates
r-NAF Radix-r Non-Adjacent Form
S Finite Field Arithmetic Square
trinomials An Irreducible Polynomial With 3 Non-Zero Terms
v[ f0 , → 1] A Right Shift of The Vector v By One-Bit With Cell f0 Fed in Its Left-Most Bit,
i.e., For The Vector v of Length l-Bits v[ f0 , → 1] = [ f0 ,
l−1︷     ︸︸     ︷
0 , · · · , 0] + v[→ 1]
w-NAF Width-w Non-Adjacent Form
Z Set of All Integers
xvii
1
Introduction
A ges ago, there was only a negligible probability of confidential data being eaves-dropped, monitored, stolen or destroyed without being noticed. Because, directtalk was the only way to communicate between people, payments were done us-
ing cash, and secret documents were saved in tightly sealed boxes (e.g. treasure
chest). However, with the proliferation of communication technologies in the last couple of
decades, new communication channels have been created to satisfy the people’s desire to en-
hance the quality of life in the communities. By the times, those channels become faster, wider
and more accessible for everyone. Nowadays, an enormous amount of data is flooding the
communication lines each day, carrying love notes, digit cash, secret corporate documents and
other treasured information. These communication trends make the life easier but at the same
time reveal more security risks and more information about individuals and companies than
appreciated.
Cryptography -the art and science of hiding data- is a mathematical tool that is used by
security engineers to secure data against unauthorized access or manipulation. Cryptography
plays a crucial role in many aspects from communication and commercial applications in the
Internet to many other digital applications. Cryptography supplies the people, who are re-
sponsible for security, the required utilities to hide data, control accesses to them, verify their
integrity, and estimate the required cost and time to break the security. Understanding the prin-
ciples on which the cryptography is based, requires a cryptographer to gain enough knowledge
with cryptographic algorithms and protocols, computational complexity and a range of topics
in computer arithmetic and mathematics.
In the early days of cryptography, the secret keys used to encrypt and decrypt messages
were exchanged through the direct meeting of the parties or through the use of a trusted third
1
2 Chapter 1. Introduction
Scalar Multiplication  Q = kP
Elliptic Curve Primitives
M
e
s
s
a
g
e
 E
n
c
ry
p
tio
n
M
e
s
s
a
g
e
 D
e
c
ry
p
tio
n
M
e
s
s
a
g
e
 A
u
th
e
n
tic
a
tio
n
D
. S
ig
n
a
tu
re
 G
e
n
e
ra
tio
n
....
Point Triple Point Addition Point Halving
Elliptic Curve Point Routines: 
....
Point Quadrupling Point Doubling Point Quintupling
Elliptic Curve Point Routines: 
....
Arith
meti
c
Squ
arin
gArit
hme
tic
Add
ition A
rithm
etic
Mult
iplic
atio
nArith
meti
c
Inve
rsio
n
Finite Field Arithmetic Operations
Figure 1.1: Hierarchical Scheme for The Implementation of ECC Crypto-System [1].
party. Public-key (PK) based cryptography overcome this key distribution and key manage-
ment problem through the complex number based theory. PK based cryptography is essential
in today’s digital communication and storage infrastructure. The technology of PK schemes
can benefit a large number of application contexts ranging from high performance network
processor down to hand-held and resource constrained appliances.
Elliptic curve cryptography (ECC) is a technology of choice for developing PK based cryp-
tography. This is due to its resistance against powerful index-calculus attacks. Cryptographic
mechanisms based on elliptic curves depend on the arithmetic of points on the curve. The most
important arithmetic is multiplying a point on the curve by an integer. This operation is known
as elliptic curve scalar (or point) multiplication (ECSM) operation. The computation of the
ECC based crypto-system can be visualized in a hierarchical layer of operations as illustrated
in Figure 1.1. This hierarchical view is helpful for understanding the mechanisms of the op-
erations implementation and execution. It represents multiple levels of abstractions. The top
level of the figure represents the elliptic curve cryptographic applications that make up a secure
communication. The next level represents the scalar multiplication algorithm that depends on
a number of elliptic curve group operations. Finally, at the bottom level, the elliptic curve
arithmetic operations are defined based on the use of number theory, that is, all the low-level
operations are carried out in finite fields. For cryptographic applications, elliptic curves are
usually defined over two types of finite field: prime fields Fp with p a large prime, or binary
1.1. Motivation 3
extension fields, i.e., fields of characteristic two F2m 1. We restrict ourselves to prime values
of m to avoid weaknesses by plunging into sub-field 2. The most common finite field oper-
ations used in ECC are addition/subtraction, multiplication, squaring, and inversion/division.
In general, an ECC cryptographic primitive requires few scalar multiplication operations, but
hundreds of elliptic curve group operations and correspondingly many thousands of finite field
arithmetic operations.
In this thesis, we investigate the operations at the lower three levels of the ECC hierarchical
scheme, namely, finite field arithmetic level, elliptic curve point arithmetic level, and the ECSM
operation. Emphasis has been placed on the development of hardware oriented algorithms and
operations for the ECC. In this regard, we introduce several algorithms for speeding up or
reducing the area of both the finite field arithmetic multiplication and the ECSM operation.
This opening chapter aims to provide the reader with a clear idea of what my research
thesis focus on. In this chapter, we introduce the topics at a high level of abstraction. More
specifically, in Section 1.1, we elaborate on the motivation of this research, in Section 1.2, we
list the main and specific objectives pursed in this research, and present the main contributions
of this research, and finally, we outline of the rest of the research thesis in Section 1.3.
1.1 Motivation
There are two main forms of cryptographic schemes. These are PK based schemes 3 and
private-key based schemes 4. In the private-key based schemes, the communication parties
must share a key in a “secret way”. The underlying primitives used by private-key based
schemes are generally not computationally intensive. However, in order for the key to be dis-
tributed, a secure communication channel 5 must be established, or the involved parties must
have access to a trusted third party such as the Kerberos authentication system, which is re-
sponsible for key distribution 6. In practice, both possibilities are problematic, as the key estab-
lishment scheme does not scale well in multi-entity systems. Further, keys must be stored and
1 It can also be defined over fields of characteristic three F3m , however, it is not common as the prime and
binary extension fields.
2 Curves over F2m for some non-prime values of m are avoided for cryptographic applications since the
ECDLP can be reduced [2, 3, 4].
3 These are typically based on number theory and usually needs complex arithmetic operations.
4 These are usually built with substitution and permutation networks.
5 A secured channel is one from which an adversary does not have the ability to reorder, delete, insert, or
read.
6 These are the set of processes and mechanisms, which support key establishment and the maintenance of
ongoing keying relationships between parties, including replacing older keys with new keys as necessary.
4 Chapter 1. Introduction
distributed for each pair of entities, consuming the number of keys to grow as N∗ (N − 1) /2,
where N is the number of entities in the system. For private-key based schemes, confidentiality
is achieved via an encryption algorithm, e.g., the Triple Data Encryption Standard (3DES) [5],
Advanced Encryption Standard (AES) [6], RC4 and RC5 stream cipher (Rivest Cipher) [7],
and IDEA [8]. Data integrity and data origin authentication are accomplished by message au-
thentication codes (MACs/HMACs) [1] or keyed hash functions. However, the non-repudiation
feature is not achievable as the secret key is not in the possession of a single entity.
PK based schemes, on the other hand, have higher computation demands, which reduce
their throughput and make them difficult to implement in hardware. However, due to the key
distribution, key management, and the provision of non-repudiation problems with the private-
key based schemes, there is an increasing trend of implementing PK based schemes in hard-
ware. PK based schemes allow two (or more) communicating parties to negotiate a secret key
on demand. PK based applications and services entering worldwide trade on the Internet have
been expanding enormously in the last few years. A few examples are the numerous financial
transactions that occur over the Internet in daily life, remote accessing through virtual private
networks, and Voice over Internet protocol (VoIP).
In the 18th century, the idea of using the intractability of a number theoretic method for
cryptography was introduced by William Stanley Jevons [9]. Two centuries later, Diffie and
Hellman invented their famous key exchange protocol based on the discrete logarithm problem
(DLP) [10] 7. Rivest, Shamir, and Adleman then introduced the first practical PK encryption
and signature scheme 8. Their scheme referred to as RSA 9 and is based on another hard math-
ematical problem, i.e., the integer factorization problem (IFP) [12]. In 1984, ElGamal invented
another class of powerful and practical PK based schemes. These are also based on the DLP
[13]. For PK based scheme, confidentiality is achieved by means of encryption. For that pur-
pose, the most commonly used building blocks are RSA, ElGamal and elliptic curve variants of
ElGamal. Data integrity and origin authentication with non-repudiation can be accomplished
by the signature schemes such as RSA-PSS [14, 15], the digital signature algorithm (DSA)
[16], and the elliptic curve digital signature algorithm (ECDSA) [17].
Elliptic curves have been studied long time before they were introduced to cryptography.
Based on the specific properties of elliptic curves, in the mid-1980s, Victor Miller [18], and
Neal Koblitz [19] independently proposed using the group of points on an elliptic curve defined
7 This paper is the one that officially gave birth to PK base cryptography. There is a companion paper
entitled “Multiuser Cryptographic Techniques” that was presented by the same authors at the National Computer
Conference that took place on June 7-10, 1976, in New York City [11].
8 It is currently the most deployed scheme but is intended to be supplanted by elliptic curve cryptography.
9 RSA, named after its inventors Rivest, Shamir and Adleman, was proposed in 1977.
1.1. Motivation 5
over a finite field in PK based cryptography. Since then, ECC has been intensively studied, and
has become popular among other common PK based schemes, i.e., Diffie-Hellman [10], RSA
[12], and ElGamal [13]. The main advantage of ECC is the absence of sub-exponential al-
gorithms to solve the underlying hard problem, namely, the elliptic curve discrete logarithm
problem (ECDLP). ECC therefore features smaller security parameter, providing an equiva-
lent protection compared to factoring-based and classical discrete logarithm techniques for PK
based cryptography.
The technology of ECC is currently well accepted in the industry and the academic commu-
nity and has been included in the following major standards: The German Brainpool Standard
[20], Institute of Electrical and Electronics Engineers (IEEE) 1363-2000 [21], the National
Institute of Standards and Technology (NIST) in the US Federal Information Processing Stan-
dard (FIPS) 186-3 [16], American National Standards Institute (ANSI) X9.62 [17], Standards
for Efficient Cryptography Group (SECG) [22], and ISO/IEC 15946-2 [23]. ECC has also
become the standard to protect U.S. information, the United States’ National Security Agen-
cy (NSA) restricts the use of PK based schemes in “Suite B” to ECC [24]. It is also worth
mentioning that ECC is utilized in Cisco systems for its network infrastructure solutions, Re-
search In Motion (BlackBerry) for its enterprise software, Unisys for banking applications and
Motorola and Sony for its mobile security [25].
The underlying cryptographic primitive in ECC is based on the ECSM operation. This
operation is certainly, the most computationally intensive step in each ECC based PK schemes.
Most of the time, the computation of the scalar multiplication becomes the bottleneck of the
performance. In this thesis, with respect to special-purpose hardware, we want to explore
and optimize the performance of ECC underlying operations. The specific motivations for
the research presented in this thesis and the corresponding contributions are summarized as
follows:
1.1.1 Field Multiplication Operation
Motivation: The motivations for developing fast and area efficient hardware solutions for the
arithmetic multiplication operation come from two facts. First, the fact that the arithmetic mul-
tiplier has been widely used in applications of different fields like error-control coding, cryp-
tography, and digital signal processing [26, 27, 28, 29]. Second, the fact that other complex
and time consuming operations such as exponentiation and division/inversion are implemented
by the iterative application of the multiplication operations. In PK based schemes, many al-
gorithms of RSA and ECC originally designed based on the arithmetic multiplication of very
6 Chapter 1. Introduction
large operands, i.e., sizes from 163 to 4096 bits. Hence, this important operation has a high
impact in the performance of the entire crypto-system.
The serial-out bit-level (SOBL) multiplication scheme is characterized by an important
low-latency feature. It has an ability to sequentially generate an output bit of the multiplication
result in each clock cycle. To current knowledge, the best architecture for the SOBL multipli-
cation is the scheme proposed by Reyhani-Masoleh in [30]. It is highly desirable to investigate
and develop other methods for such a serial-out multiplier structure in order to lower its area
cost and its critical path delays.
Contribution 1: We proposed novel schemes for the SOBL finite field multiplication op-
eration that are constructed by an irreducible polynomial with ω, ω ≥ 3, non-zero terms (de-
noted by ω-nomials). We showed that in terms of the area and time complexities, the smallest
SOBL scheme proposed outperforms the existing SOBL schemes available in the literature. In
addition, we showed that the proposed scheme can provide about 24-26% reduction in area
complexity cost and about 21-22% reduction in power consumptions compared to the current
state-of-the-art bit-level multiplier schemes. The proposed multiplier scheme, as its area and
power consumptions are dropped, is ideally suitable to be used by the manufacturer’s of RFID
tags and sensor networks.
1.1.2 Bit-Level Finite Field Double Multiplication
Motivation: A multiplication scheme based on the SOBL structure has certain advantages as
compared to the traditional parallel-out bit-level (POBL) multiplication structures [31]. For
instance, a hybrid-double multiplier has been recently proposed in F2m using normal basis (N-
B) representation [32, 33]. In their architecture, the hybrid-double multiplier is achieved by
combining and interleaving a SOBL Gaussian NB multiplier that is implemented based on
[34], and a POBL NB multiplier that is based on [31]. It has been shown in [32, 33] that the
hybrid-double multiplier would make fast exponentiation and inversion possible. A multiplier
operates using the PB, in compared to the NB representation, has lower hardware requirements
and easy-to-derive structure based on the defining irreducible polynomial for the field P(x)
[35]. Hence, it is desirable to investigate the similar hybrid-double architecture using the PB
representation to broad its usefulness in performing arithmetic computations.
Contribution 1: In order to investigate the applicability of the proposed SOBL schemes,
we employed the proposed SOBL schemes to present, to our knowledge, the first approach
for a hybrid-double multiplication architecture in the PB over F2m . This hybrid multiplier
structure operates on three finite field elements and performs two multiplication tasks with
1.1. Motivation 7
latency comparable to the latency of a single multiplication.
Contribution 2: We extended the traditional POBL multiplier schemes presented in [31]
to propose new LSB-first/MSB-first POBL double multiplication architectures, which perform
two multiplications in the PB over F2m together after 2m clock cycles. To obtain the actual
implementation results, all the proposed schemes are coded in Very-High-Speed-Integrated-
Circuit Hardware Description Language (VHDL) and implemented on application-specific in-
tegrated circuit (ASIC) technology (10 schemes in total), over both F2163 and F2233 .
1.1.3 Elliptic Curve Scalar Multiplication
Motivation: Secure PK based scheme is essential in today’s age of rapidly growing Internet
communication. ECC has become popular due to its shorter key size requirement in compar-
ison with the existing PK based algorithms. Elliptic curves are widely used in many crypto-
graphic primitives such as digital signature, key exchange, and data encryption/decryption. The
most important and time consuming operation in ECC is the scalar multiplication operation.
The speed of the scalar multiplication operation plays an important role in the security and
the efficiency of an implementation of the whole system. Designing secure implementations
requires taking into account the physical attacks. Due to its importance, it is an interesting
problem to explore new approaches and algorithms to perform the scalar multiplication opera-
tion.
It is stated in [36] that the fastest known approach to perform the scalar multiplication over
prime fields is due to Hisil et al. in [37]. Hisil et al., have used the maximum possible parallel
operations, i.e., 4-processes, for computing the extended twisted Edwards curve model. This
has motivates us to come up with a new scalar multiplication scheme that allows incorporating
8 parallel operations for computing the point arithmetic (underlying group) operations on the
extended twisted Edwards curve model.
Contribution 1: We proposed a novel approach for computing ECSM operation that can
be used on any abelian group. We analysed the security of our approach and showed that its
security holds against both simple side-channel (power analysis) attack (SSCA) [38, 39, 40],
and safe-error (C-safe) fault attacks [41, 42, 43].
Contribution 2: We employed the proposed approach for computing the scalar multi-
plication to present a new design for the implementation of an ECSM operation on a prime
extended twisted Edwards curve model incorporating 8 parallel operations. We showed that in
comparison to the other SSCA protected schemes over prime fields, the proposed design of the
extended twisted Edwards curve model is the fastest scalar multiplication scheme reported in
8 Chapter 1. Introduction
the literature.
1.2 Objectives
According to the formulated motivations, we may define the research objectives, which for us
seem worth obtaining. They are two main objectives and one main goal:
1. Based upon the analysis of recent publications on hardware design for the finite field mul-
tiplication operation, we aimed at proposing a new hardware architecture for the SOBL
multiplication operation that is more efficient in terms of speed, size (implementation
cost), or power and energy consumption in compared to previously published results.
We also aimed at extending the traditional POBL multiplication hardware scheme to a
POBL double multiplication operation that performs two sequential multiplications. We
further, aimed at proposing new hardware architecture for the hybrid-double multiplica-
tion architecture. In order for us to develop a new arithmetic multiplication scheme over
the fields of characteristic two, i.e., F2m , we have to:
• Perform a vast literature research to appoint the most suitable basis (e.g., polyno-
mial basis [31, 44, 45, 46, 47], normal basis [34, 48, 49, 50, 51], shifted polynomial
basis [52], etc.) to represent the finite field elements .
• Familiarise ourselves with details of the hardware structures of the arithmetic multi-
plication unit (e.g., bit-level [31], digit-level [53], bit-parallel [46], pipelined struc-
ture [54, 55], hybrid structure [32], etc.).
• Carefully study the existing algorithms and approaches for the arithmetic multi-
plication operations over F2m (e.g., Mastrovito multiplication [56, 46], dual basis
multiplication [57, 58], Montgomery multiplication [59, 60], etc.).
• Carefully study the irreducible polynomials (e.g., irreducible trinomials [61], pen-
tanomials [45], all-ones [62], and ω-nomials [30], etc.) that are associated with the
arithmetic multiplication over a finite field F2m .
• Carefully interpret and model the favourable algorithms into VHDL codes and anal-
yse them to inspire us to create our own solutions as efficient as possible.
• Familiarise ourselves with many of the Synopsys Design Compiler tools to verify
the correctness of our schemes.
1.2. Objectives 9
2. Based upon the hardware architecture and the analysis of the different algorithms for the
implementation of ECSM operation, we aimed at building a new hardware scheme for
the scalar multiplication operation that would work on any abelian group. We also aimed
at utilizing the proposed ECSM scheme for the implementation of scalar multiplication
in a special model of elliptic curves known as extended twisted Edwards model. There
are several design options for implementing the scalar multiplication. In order for us to
propose a new approach to computing the scalar multiplication the following must be
considered:
• Select an appropriate addition chain method (e.g., sliding window [1, 63, 64], multi-
based [65, 66, 67], ternary expansion [68], methods based on signed digit represen-
tations [69, 70], etc.).
• Use a representation of the scalar such that the number of point arithmetic opera-
tions is reduced (e.g., non-adjacent form (NAF) [71], radix-r NAF (r-NAF) [72],
width-w NAF (w-NAF) [1, 64, 73], Frobenius map [73, 74], etc.).
• Select an appropriate elliptic curve model with corresponding parameters (e.g., the
Hessian [75], Edwards [76], Huffs [77], Koblitz [74], Jacobi quartics [78] curve
models, etc.).
• Select the most appropriate coordinate system to represent elliptic curve points
(e.g., the affine, projective, mixed, x-only coordinates, etc.).
• Utilize efficient point arithmetic operation formulas based on a combination of the
underlying finite field arithmetic operations (e.g., implementing point halving in-
stead of the doubling over binary fields [79], point tripling over fields of character-
istic three [66, 68], and using composite operations, i.e., 2Q + P [80]).
• Optimize the architectures of the underlying finite field arithmetic operations (e.g.,
utilizing pipelining methods [81, 82], parallel operations schemes [37, 83, 84, 85,
86, 87, 88], etc.).
• Ensure that the security of a method holds against both side-channel attacks and
safe-error fault attacks.
In the end, the goal is to utilize the proposed ECC underlying operations to achieve a perfor-
mance gain in the implementation of the elliptic curve crypto-coprocessor over both Fp and
F2m
10. Hardware implementations of ECC over F2m are more popular than Fp due to their carry
10 Prime fields are commonly used for software implementations because the integer arithmetic is more opti-
mized in today’s microprocessors.
10 Chapter 1. Introduction
free addition [89]. However, in case of the field-programmable gate arrays (FPGAs), carry
chain adders are optimized so that arithmetic over Fp are less complex and more suitable for
FPGA implementation [90].
1.3 Thesis Organization and Outlines
This thesis is divided into six chapters. The next chapter provides a basic introduction and
preliminaries while the following three chapters, i.e., Chapters 3, 4 and 5, exhibit the results
of our contribution. In the following, we give an overview of the structure of the thesis and
highlight the main contributions
• Chapter 1: Introduction. This opening chapter is intending to bring the readers quickly
on the different works developed in this thesis. The chapter starts with identifying the
motivation of our research topics. Once the motivation has been identified, it ensures
that the main contributions are highlighted. Then, the chapter proceeds with outlining
the objectives and the organization of the thesis.
• Chapter 2: Preliminaries. The ultimate purpose of this chapter is to ensure we collect
the prerequisites that form the basis for the novel contributions that follow in the main
chapters, i.e., Chapters 3, 4, and 5. This chapter gives background related to ECC-
based crypto-system. The advantages of using PK based schemes are first presented
before providing an overview of the Diffie-Hellman key exchange protocol. The ECC
crypto-system is then reviewed in more details and its main algorithms and operations are
provided. Since the elliptic curve arithmetic point operations, i.e., group low and scalar
multiplication operations, rely on the finite field arithmetic operations, an introduction
to the modular arithmetic algorithms over both Fp and F2m is provided. The introductory
material presented in this chapter could be extended in [1, 91, 92, 93, 94, 95, 96, 97].
• Chapter 3: Architectures for SOBL Multiplication Using Polynomial Basis. This
chapter is based on our work in [98]. In this chapter, novel schemes for SOBL multipli-
cation operation using polynomial basis is introduced. It is then, proceeds with analysing
the performance of the proposed SOBL schemes and comparing them to the counterpart
bit-level ones.
• Chapter 4: Architectures for Hybrid-Double Multiplication Using Polynomial Ba-
sis. Part of this chapter can be found in our work in [98]. The chapter starts with p-
resenting new double multiplication architectures and new hybrid-double multiplication
1.3. Thesis Organization and Outlines 11
architectures using polynomial basis. Then the performance of the proposed architec-
tures is investigated by implementing each arithmetic double multiplication architectures
on ASIC technology.
• Chapter 5: New Regular Radix-8 Scheme for Elliptic Curve Scalar Multiplication
Without Pre-computation. This chapter is based on our work in [99]. It starts with an
overview of side channel attacks (SCAs) and its countermeasures. Then, a novel scheme
for the ECSM operation is introduced. It shows how the properties of the proposed EC-
SM scheme enhance parallelism at both the point arithmetic and the finite field arithmetic
levels. It also provides detailed security analyses of the proposed scheme and shows that
it can be used to provide a natural protection from SCAs based on simple power analysis
as well as safe-error fault attacks. Finally, it shows how the proposed ECSM scheme can
be employed in proposing a new hardware design for the implementation of an ECSM
on a prime extended twisted Edwards curve incorporating 8 parallel operations.
• Chapter 6: Summary and Future Work. In this final chapter, we present brief com-
ments on possible directions for future ongoing work. Summary of our contributions and
conclusion are also presented in this chapter.
2
Preliminaries
T he ECC-based crypto-system is considered to be one of the best choices for imple-menting PK based schemes. Although the operations involved in ECC are morecomputationally intensive, the significant smaller key parameters that can be used
by ECC lead to a more efficient implementation compared to other PK based crypto-
systems. ECC standards provide different parameter options that can meet a wide range of de-
sign requirements, which are suitable for applications ranging from a “tiny chip” in a resource-
constrained device, to NSA Suite B and high-end embedded devices [100]. Cryptographic
mechanisms based on elliptic curves depend on the arithmetic of points on the curve. The most
important arithmetic is multiplying a point on an elliptic curve by an integer. This operation
is known as elliptic curve scalar (or point) multiplication (ECSM) operation. A cryptographic
device for ECC is supposed to perform this operation efficiently and securely. ECSM is per-
formed by implementing and evaluating the elliptic curve point routines, e.g., point addition
(ADD) and point double (DBL) operations. Both operations are originated from the arithmetic
operations in the underlying finite field.
In ECC, parameters such as keys and the point coordinates can be seen as finite field ele-
ments. Hence, all the operations involved in ECC are carried out in finite fields. Elliptic curves
are usually defined over two types of finite field: prime fields Fp with p a large prime, or bi-
nary extension fields, i.e., fields of characteristic two F2m , with m a prime integer. In order for
us to understand, build, analyse and study the ECC systems, we have to achieve a sufficient
knowledge about the elliptic curve’s underlying field arithmetic operations over both Fp and
F2m .
This chapter gives background related to ECC-based crypto-system. The advantages of
using PK based schemes are first presented before providing an overview of the Diffie-Hellman
12
2.1. Public-Key Based Schemes 13
key exchange protocol. The ECC crypto-system is then reviewed in more details and its main
algorithms and operations are provided. In addition, this chapter provides an introduction
to the modular arithmetic algorithms over both Fp and F2m . More precisely, the finite field
arithmetic addition/subtraction, multiplication and division/inversion algorithms suitable for
hardware implementations are presented. The introductory material presented in this chapter
could be extended in [1, 91, 92, 93, 94, 95, 96, 97].
2.1 Public-Key Based Schemes
As discussed in the Motivation Section in Chapter 1, modern cryptography can be categorized
into PK and private-key based schemes. In private-key based schemes, two parties in com-
munication agree upon a single key known only to them. The computation of the private-key
based schemes are typically very efficient 1. However, they have significant drawbacks of key
distribution and key management problems. To overcome these problems, Diffie and Hellman
introduced a practical algorithm for key exchange [10]. They showed that two parties can es-
tablish a shared secret over an insecure channel 2 without having any prior knowledge of one
another.
The simplest version of the Diffie-Hellman key exchange protocol uses F ∗p , the multiplica-
tive group version of integer modulo p. There is also a public generator element g ∈ Fp, which
is a primitive root 3 mod p. Figure 2.1 shows how the Diffie-Hellman key exchange can be
used when two parties A and B wish to communicate securely 4. As shown in this figure, both
A and B have a public and a private key 5. The private key is a randomly chosen integer, which
we denote by a for party A and b for party B such that a, b ∈ F+p . Then, the protocol can be
defined as follows:
1. Party A hides his secret key by raising the generator g to the power of his private key,
i.e., A computes ga = g a mod p.
1 Therefore, it is usually employed for confidentiality purposes and when the non-repudiation feature is not
required in addition to data integrity and origin authentication.
2 An insecure channel is one from which parties other than those for which the information is intended can
reorder, delete, or read.
3 a number g is a primitive root modulo p if every number co-prime to p is congruent to a power of g modulo
p.
4 The scheme presented here is the basic one and is used for the illustrative purpose, additional features (such
as padding plaintext messages with random strings prior to encryption) to the schemes should be added before it
can be considered to offer adequate protection against real attacks.
5 To avoid ambiguity, a common convention is to use the term private key in association with public-key
based schemes, and private-key based scheme in association with symmetric-key based crypto-systems.
14 Chapter 2. Preliminaries
IntegerI t a
Party A Party B 
Private Keyi  
Publicly Knownli l  
(1) Calculate Public Key( ) l l t  li  
(2) Calculate Shared 
Secret Key
( ) l l t   
t 
IntegerI t
g   = g    mod p
ab a
b
b
g  = g   mod p
b
b
p
g
Modulol  
Generator
b
g
a
g
Private Keyi  
(1) Calculate Public Key( ) l l t  li  
(2) Calculate Shared 
Secret Key
( ) l l t   
t 
g   = g    mod p
ab b
a
g  = g   mod p
a
a
ey
20  20
Insecure Channell
Figure 2.1: Diffie-Hellman Key Exchange Scheme [1, 10].
2. The value ga (called A′s public key) is then sent over an insecure channel to B, to which
B can exponentiate ga and compute gab = g ba mod p.
3. Party B computes his public key, i.e., gb = g b mod p, and sends it over an insecure
channel to A, who can compute the shared secret gab = g ab mod p.
Both parties are now in possession of a shared secret key gab. The individual secrets, i.e.,
private keys, of both parties are assumed to be safe under the DLP [101], which can be defined
as follows. Let G = { g r : 0 ≤ r ≤ p − 1 } be a random cyclic group of p elements generated
by g > 1. Given a primitive element g and a random element s = g r ∈ G, it is very hard to
compute r = log g s.
Despite the difficulty of recovering gab from g, g a, and g b, there is still the “brute force”
method that solves this problem. An eavesdropper can start successively computing higher
power of g until it matches the public key. This requires at most |g | multiplications, where |g |
is the order of g in the group G. It is the case, however, that |G | ≈ 10 160 and p ≈ 2 1880 [102] 6
and hence the schemes based on DLP methods are intractable [103].
Example:
1. Choose the modulo p = 17, and the generator g = 8.
2. Select a positive integer as a private key, a = 5 and b = 4.
3. Compute public key ga = g a mod p = 8 5 mod 17 = 9, and gb = g b mod p = 8 4 mod 17 = 16.
4. Compute the shared secret, gab = g ba mod p= 9
4 mod 17= 16, and gab = g ba mod p= 16
5 mod 17= 16.
It is worth-noting that this simple version of Diffie-Hellman key exchange does not provide
authentication of the origin of information. Thus, one needs to make sure that the key ex-
change process is initiated only between the intended user and not an intruder in the middle.
6 [102] was published in 2005.
2.2. Introduction to Elliptic Curves 15
Table 2.1: NIST Recommended Key Sizes Measured in Number of Bits [104].
Symmetric Key Size RSA & Diffie-Hellman Key Size Elliptic Curve Key Size
(bits) (bits) (bits)
80 1024 160
112 2048 224
128 3072 256
192 7680 384
256 15360 521
A
t
This is done by defining authenticated agreement scheme wherein the users first authenticate
themselves and then initiate the process after validating their identification or authority 7.
Although, today used PK based algorithms such as RSA and ElGamal are believed to be
secure, some of their implementations have been challenged by the quick factoring, and integer
discrete logarithm attacks [1, 105, 106]. ECC that can provide the same level of security with
shorter key size becomes more attractive [107]. For example, a 160-bit ECC is as secure as
1024-bit RSA crypto-system [108] 8. Table 2.1 from [104] provides the key sizes recommended
by NIST, as of 2009, that are used in conventional encryption algorithms such as DES and
AES together with the key sizes for RSA, Diffie-Hellman and elliptic curves that are needed to
provide equivalent security.
2.2 Introduction to Elliptic Curves
Elliptic curves are important objects occurring in many different areas (e.g., geometry algebra,
number theory, complex analysis, etc.). Recently, elliptic curves have become widely used
in applications such as factoring [109] and cryptography [18, 19]. Elliptic curves are groups
that are defined over fields. Elliptic curve groups allows only one binary operation (denoted
by addition group low operation), which is originated from the arithmetic operations in the
underlying finite field. There are many ways to represent elliptic curves such as Legendre
equation, cubic equations, quartic equations, and intersection of two quadratic surfaces [95]. It
can also be expressed as the form of Weierstraß equation.
Definition [1, 93] An elliptic curve E defined over a field F using affine coordinates is defined
7 This is achieved by public key infrastructures (PKIs) like X.509.
8 The reason for this significant difference is the lack of a known index-calculus attack on elliptic curve
discrete logarithms [102].
16 Chapter 2. Preliminaries
by the Weierstraß equation.
E(F) : y 2 + a1xy + a3y = x 3 + a2x 2 + a4x + a6, (2.1)
where a1, a2, a3, a4, a6 ∈ F and ∆ , 0.
Here ∆ 9 is the discriminant of E(F) 10. Equation (2.1) is called the general Weierstraß equation
for elliptic curves. Miller [18] and Koblitz [19] were the first to show that the group of rational
points on an elliptic curve E over F can be used for the DLP in a PK based crypto-system.
Aside from all the (x, y) ∈ F solutions to the equation above, there is an extra point which cannot
be defined using the affine equation, but must be included to complete the group definition. This
point is called the point at infinity, which we denote by O.
If char(F) < { 2, 3 }, then E(F) can be transformed to [1]
E(Fp) : y 2 = 4x 3 + b2x 2 + 2b4x + b6,
and then applying change of coordinates (x, y) 7→( x−3b236 , y108 ), which yields the following sim-
plified equation [1]
E(Fp) : y 2 = x 3 + ax + b, (2.2)
with a, b ∈ Fp. Equation (2.2) is called the short Weierstraß equation for elliptic curves. It is
proved that the condition 4a 3 + 27b 2 , 0 11 is necessary and sufficient to prove that (2.2) has
three distinct roots 12. Such an elliptic curve with distinct roots is called a non-singular EC and
forms an abelian group with respect to a binary operation
If char(F) = 2, then an admissible change of variables transforms E(F) to the curve of
equation [1]
E(F2m) : y 2 + xy = x 3 + ax 2 + b, (2.3)
where a, b ∈ F2m . Such a curve is said to be non-supersingular and has a discriminant ∆ = b
[1].
The set of points { (x, y) ∈ E(Fp) } ∪ { O } under the addition group operation rule, i.e.,
⊕, which forms an additive abelian group, the sum of any two points on a curve is a gain a
point of the same curve. The elliptic curve group low operation is defined by point addition
9 The condition ∆ , 0 ensures that the elliptic curve is smooth, that is, there are no points at which the curve
has two or more distinct tangent lines [1].
10 Detailed definition of ∆ can be found in [1].
11 This inequality comes from examining the discriminant of the curve in the short Weierstraß equation, viz.
∆ = −16(4a 3 + 27b 2).
12 That is, we don’t allow the curve to have multiple roots.
2.2. Introduction to Elliptic Curves 17
Q
PPQ  !
P
x x
y y
Q 
P
R 
QPR  !
(a) (b)
Figure 2.2: Graphical Representation of The Chord-and-Tangent Group Low (EC-Operations)
for an Elliptic Curve E : y 2 = x 3 − 2 over Fp [1, 91]. (a) Point Addition (ADD) Operation of
P and Q on E and Resulting in The Point R. (b) Point Doubling (DBL) Operation of P on E
and Resulting in The Point Q.
(ADD) and point doubling (DBL) operations. Both ADD and DBL are usually called the chord-
and-tangent method [91]. To visualize these operations, Figure 2.2 illustrates the graphical
representation of the group axioms. Given three points P, Q, and R ∈ E(Fp) : y 2 + x 3 − 2, the
addition of two distinct points P and Q is shown in Figure 2.2(a). It is defined by drawing a line
through the two points, this line intersects the graph of E at a third point −R. Then R = P ⊕ Q
is defined to be the other point where the vertical line through −R intersects the graph of E.
The double of a point P is shown in Figure 2.2(b). It is defined by drawing the tangent line to
the elliptic curve at P. This line intersects the elliptic curve at a second point. Then Q is the
reflection of this point about the x-axis. Let us consider the following example:
E(F17) : y 2 = x 3 + x + 7. (2.4)
Equation (2.4) is an elliptic curve. Along with the point at infinity O, the set of rational points
of such an elliptic curve over F17 is defined by
E(F17) = { (x : y) ∈ A2(F17) : y 2 = x 3 + x + 7 } ∪ { O }. (2.5)
Here E(F17) denotes the set of all points on E(F17), and A2(F17) denotes the affine plane over
F17. It consists of equivalence classes of doubles (x, y) ∈ F17 × F17, (x, y) , (0, 0), two doubles
(x, y) and (x ′, y ′) being equivalent if there exists c ∈ F ∗17 such that cx = x′, cy = y′; the
equivalence class containing (x, y) is denoted by (x : y). The point P = (xP, yP) = (1, 3) lies in
E(F17), as do Q = (xQ, yQ) = (6, 5) and R = (xR, yR) = (2, 0). The point R can be defined as
18 Chapter 2. Preliminaries
Scalarl
Private Keyi  
k
B
(1) Calculate Public Key( ) l l t  li  
Q   =  k   P
BB
Scalarl Party A Party B
Publicly  Knownli l   
(2) Calculate Shared 
Secret Key
( ) l l t   
t 
P
E
Pointi t
Elliptic Curvelli ti  
B
QPrivate Keyi  
(1) Calculate Public Key( ) l l t  li  
(2) Calculate Shared 
Secret Key
( ) l l t   
t 
Q   =  k   P
A
Insecure Channel
A
k
A
Q =  k   Q
A B
Q =  k   Q
B A
A
Q
Figure 2.3: Elliptic Curve Diffie-Hellman Key Exchange Scheme [1].
R = P ⊕ Q, where ⊕ is an EC-point addition (ADD in this case). The coordinates xR, yR are
computed from xP, yP, xQ, yQ using the underlying finite fields in F17. Furthermore, it can also
be verified that Q = P ⊕ P (DBL in this case), accordingly R = P ⊕ P ⊕ P; we usually write
these as Q = 2P and R = 3P, where P ⊕ P · · · ⊕ P︸          ︷︷          ︸
k
= kP in general is called scalar (or point)
multiplication and is defined by the addition of the point P to itself k − 1 times. The security
of ECC is based on the hardness of the ECDLP, namely, finding out the scalar k for any given
two points P and S such that S = kP. It is supposed intractable to solve this for well chosen
curves, parameters, and base point P.
2.2.1 Elliptic Curve Diffie-Hellman Key Agreement Scheme
The simplest elliptic curve scheme is the elliptic curve Diffie-Hellman key exchange protocol.
The scalar multiplication operation in this scheme is equivalent to the modular exponentiation
in Diffie-Hellman key exchange scheme. Figure 2.3 shows how this protocol permits the two
parties A and B to communicate securely and agree about a secret Q. In this figure, the scalars
kA and kB are the secret keys (private keys) of A, and B, respectively. The elliptic curve pa-
rameters and the point P are assumed to be publicly known. Party A hides his secret key by
computing QA = kAP, and party B computes QB = kBP. The values QA, and QB (called A’s
and B’s public key, respectively) are send to each other over an insecure channel. Finally, party
A computes Q = kAQB and party B computes Q = kBQA. As a result they both share a secret
Q. The security of this scheme is based on the elliptic curve computational Diffie-Hellman
assumption, which states that if the parameters are chosen carefully, it is computationally in-
feasible to calculate kAkBP when P, kAP and kBP are given. Till date there is no good attack
method on the ECDLP. Other frequently used attacks such as Pohlig-Hellman and Baby step
Giant attacks work on special situations of elliptic curves [108], However, the attacks can be
2.3. Group Low Operations in Affine Coordinates 19
Algorithm 1 The Addition Law for Elliptic Curve E over Fp in Affine Coordinates [1]
Input : P1 = (x1, y1), P2 = (x2, y2), O ∈ Fp.
Output : P3 = (x3, y3) = P1 ⊕ P2.
Step 1 : If P1 = O Then Return P2;
Step 2 : Else If P2 = O Then Return P1;
Step 3 : Else If P2 = −P1 Then Return O; /* x1 = x2 and y1 = −y2 */
Step 4 : Else If P2 = P1 Then /* Perform DBL operation */
Step 4.1 : λ = 3x
2
1 +a
2y1
mod p; /* [tangent] */
Step 5 : Else If P2 , ±P1 Then /* Perform ADD operation */
Step 5.1 : λ = y2−y1x2−x1 mod p; /* [chord] */
Step 6 : x3 =
(
λ 2 − x1 − x2
)
mod p; y3 =
(
λ
(
x1 − x3) − y1) mod p = (λ (x2 − x3) − y2) mod p;
Step 7 : Return (x3, y3);
rendered ineffective by carefully choosing the curve’s parameters and the point P.
2.3 Group Low Operations in Affine Coordinates
The computations of the addition group low binary operation ⊕ in affine coordinates is sum-
marized in Algorithm 1 over prime fields and in Algorithm 2 over binary extension fields. As
one can see from the two algorithms, they both require the division. Since all elliptic curve
point coordinates are represented as finite field elements, the intended division operation is
implemented as a costly and complex finite field inversion operation.
Returning to the presented points P, Q, and R, one can see that the coordinates xQ, yQ of
the point Q = 2P can be computed from Algorithm 1, as following
λ =
(3x 2P + a
2yP
)
mod p
=
(3 + 1
6
)
mod 17 = 4 ∗ 3 mod 17 = 12
xQ =
(
λ 2 − 2xP
)
mod p = 144 − 2 mod 17 = 6
yQ =
(
λ
(
xP − xQ) − yP) mod p
= 12(1 − 6) − 3 mod 17 = −63 mod 17 = 5
⇒ Q = (6, 5).
We note that the fraction 16 in arithmetic modulo 17 is the inverse of 6, that is the solution of
6x = 1 mod 17, namely the number 3 because 6 × 3 = 1 mod 17.
20 Chapter 2. Preliminaries
Algorithm 2 The Addition Law for Elliptic Curve E over F2m in Affine Coordinates [1]
Input : P1 = (x1, y1), P2 = (x2, y2), O ∈ F2m .
Output : P3 = (x3, y3) = P1 ⊕ P2.
Step 1 : If P1 = O Then Return P2;
Step 2 : Else If P2 = O Then Return P1;
Step 3 : Else If x1 = x2 Then
Step 3.1 : If x2 = y1 + y2 Then Return O; /* P2 = −P1 */
Step 3.2 : Else If P2 , −P1 Then /* Perform DBL operation */
Step 3.2.1 : λ = x1 + y1x1 ;
Step 3.2.2 : x3 = λ 2 + λ + a = x 21 +
b
x 21
; y3 = x 21 + λx3 + x3;
Step 4.1 : λ = y1+y2x1+x2 mod p;
Step 4.2 : x3 = λ 2 + λ + x1 + x2 + a; y3 = λ
(
x1 + x3
)
+ x3 + y1;
Step 5 : Return (x3, y3);
Also, the coordinates xR, yR of the point R = Q ⊕ P can be computed from Algorithm 1, as
following
λ =
( yP − yQ
xP − xQ
)
mod p
=
2
5
mod 17 = 2 ∗ 7 mod 17 = 14
xR =
(
λ 2 − xP − xQ
)
mod p = 189 − 2 mod 17 = 2
yR =
(
λ
(
xP − xR) − yP) mod p
= −17 mod 17 = 0
⇒ R = (2, 0).
2.4 Group Low Operations in Projective Coordinates
As seen in Algorithms 1 and 2, for points represented in affine coordinates the computations
of elliptic curve point routines involve finite field arithmetic additions/subtractions, multiplica-
tions, squaring, and also the expensive inversion operation. Since the field arithmetic inversion
operation is relatively expensive compared to the arithmetic multiplication and squaring opera-
tions, it is practical to represent elliptic curve points in different coordinate systems [93, 110] 13.
The general way to define the collection of points in projective space for curves defined over
Fp, i.e., (2.2), is to homogenise an elliptic curve, that is making the substitution x = X/Z and
13 The mixed coordinate systems are exceptional as some of the points hold their affine coordinate represen-
tation [111].
2.4. Group Low Operations in Projective Coordinates 21
y = Y/Z, and multiplying by Z 3 to clear the denominators, which gives
EP (Fp) : Y 2Z = X 3 + aXZ 2 + bZ 3. (2.6)
The projective coordinates (XP, YP, ZP) then can be used to replace the affine coordinates (xp,
yp). These substitutions (xp = XP/ZP, yp = YP/ZP), when ZP , 0, are the most simple (and
standard) way to obtain projective coordinates 14, but not restricted to this choice of substitu-
tion. In general, the projectifying remains the same; that is using projections obtained through
substitutions of the form x = X/Z i and y = Y/Z i [110].
To convert the affine representation of point (xp, yp) into projective representation, the coor-
dinate Z is simply set to 1, i.e., (xp, yp, 1). The advantage of using projective coordinates is that
it eliminates the need for performing arithmetic inversion in the addition low algorithms. How-
ever, using projective coordinates results in increasing the number of arithmetic multiplication
and squaring required per bit of the scalar. It should be noted that the projective coordinates are
generally used for internal computations, but the resultant projective point is converted to its
affine coordinate form before being transmitted. Hence, an arithmetic inversion over the field
is indeed required to convert the final result back to affine coordinates. This can be achieved
through the modular exponentiation given by fermat’s little theorem (FLT) which states that
the inverse of A ∈ F2m is A−1 = Ap−2 mod p, if gcd (A, p) = 1. A modular inversion can
also be implemented by the extended Euclidean algorithm (EEA) and Montgomery inversion
algorithm [1, 97, 101].
Example 1: Jacobian Projective Coordinates
One of the efficient coordinates for curves defined over Fp, i.e., (2.2), is the Jacobian projective
coordinate system. In this system, the projective point (X, Y , Z), Z , 0, corresponds to the
affine point (X/Z 2, Y/Z 3). The corresponding Jacobian projective Weierstraß equation of the
elliptic curve is [1, 111]:
EJ (Fp) : Y 2 = X 3 + aXZ 4 + bZ 6. (2.7)
The point at infinity O corresponds to (1, 1, 0), while the negative of (X, Y , Z) is (X, − Y , Z).
From the substitution of point (X/Z 2, Y/Z 3) in the affine curve equation, i.e., in Algorithm 1,
it is possible to derive point DBL and ADD operations. The point Q = (XQ, YQ, ZQ) resulting
from the doubling of point P = (XP, YP, ZP), with P , −P, i.e., YP , 0, can be written as
14 A redundant representation (more than 3 coordinates) can also be employed to represent the elliptic curve
points.
22 Chapter 2. Preliminaries
follows [1, 111]:
XQ ←
(
3X 2P + a · Z 4P
) 2 − 8XP · Y 2P ,
YQ ←
(
3X 2P + a · Z 4P
) (
4XP · Y 2P − XQ
)
− 8Y 4P ,
ZQ ← 2YP · ZP.
(2.8)
If temporary values are stored in registers A to C, the three coordinates (XQ, YQ, ZQ) of point
doubling can be computed by 3 arithmetic multiplications (M), 1 arithmetic multiplication by
constant (D), 6 arithmetic squarings (S), and 11 arithmetic addition (A) 15 [111, 112]:
A← 2Y 2P , B← 2XP · A , C ← 3X 2P + a · Z 4P ,
XQ ← C 2 − 2B , YQ ← C · (B − XQ) − 2A 2 , ZQ ← 2YP · ZP .
(2.9)
When a fast squaring is available, this DBL operation costs 1M + 8S + 1D [83]. An
interesting case is when curve parameter a is a = −3, in which case fast point doubling can be
performed, saving two arithmetic squarings in (2.9) using [112]:
C ← 3
(
XP + Z 2P
)
·
(
XP − Z 2P
)
. (2.10)
The field operations yields to 4M + 4S + 12A arithmetic operations for the fast doubling.
The point R = (XR, YR, ZR) resulting from the addition of point P = (XP, YP, ZP), and point Q =
(XQ, YQ, ZQ) with P , ±Q and ZP, ZQ , 0, can be expressed as follows [111, 112]:
XR ← F 2 − E 3 − 2A · E 2,
YR ← F
(
A · E 2 − XR) −C · E 3,
ZR ← ZP · ZQ · E,
(2.11)
where
A← XP · Z 2Q , B← XQ · Z 2P , C ← YP · Z 3Q ,
D← YQ · Z 3P , E ← B − A , F ← D −C .
The field operations yields to 12M + 4S + 7A arithmetic operations for the general addition.
If any of the two points P or Q is given in affine coordinates, then performing the addition for
a mixed affine-Jacobian projective coordinates, one squaring and four multiplications can be
saved in (2.11) yielding to 8M + 3S + 7A arithmetic operations [1, 112].
15 For simplicity, we use the term A to refer to both modular addition and subtraction operations.
2.4. Group Low Operations in Projective Coordinates 23
Example 2: Lo`pez & Dahab Coordinates
One of the efficient coordinates for curves defined over F2m , i.e., (2.3), is the Lo`pez & Dahab
projective coordinates system [113]. In this system, the projective point (X, Y , Z), Z , 0, corre-
sponds to the affine point (X/Z, Y/Z2). The corresponding Lo`pez-Dahab projective Weierstraß
equation of the elliptic curve is [113]:
ELD (F2m) : Y 2 + XYZ = X 3Z + aX 2Z 2 + bZ 4. (2.12)
The point at infinity O corresponds to (1, 0, 0), while the negative of (X, Y , Z) is (X, X + Y , Z).
From the substitution of point (X/Z, Y/Z 2) in the affine curve equation, it is possible to derive
point DBL and ADD operations. The point Q = (XQ, YQ, ZQ) resulting from the doubling of
point P = (XP, YP, ZP) can be computed as follows [113]:
ZQ ← X 2P · Z 2P ,
XQ ← X 4P + b · Z 4P ,
YQ ← b · Z 4P · ZQ + XQ ·
(
a · ZQ + Y 2P + b · Z 4P
)
.
(2.13)
The doubling formula in (2.13) is performed by 3M + 5S + 2D + 4A. The point R = (XR,
YR, ZR) resulting from the addition of point P = (XP, YP, ZP), and point Q = (XQ, YQ, ZQ) with
P , ±Q can be computed by 13M + 1D + 6S + 8A [113]:
ZR ← F 2,
XR ← C 2 + H + G,
YR ← H · I + ZR · J,
(2.14)
where
A0 ← YQ · Z 2P , A1 ← YP · Z 2Q , B0 ← XQ · ZP ,
B1 ← XP · ZQ , C ← A0 + A1 , D← B0 + B1 ,
E ← ZP · ZQ , F ← D · E , G ← D 2 · (F + a · E 2) ,
H ← C · F , I ← D 2 · B0 · E + XQ , J ← D 2 · A0 + XQ .
If any of the two points P or Q is given in affine coordinates, i.e., having a mixed affine-
Lo`pez-Dahab projective coordinates, the addition formula (2.14) can be further improved as
24 Chapter 2. Preliminaries
follows [113]:
ZR ← C 2,
XR ← A 2 + D + E,
YR ← E · F + ZR ·G,
(2.15)
where
A← YQ · Z 2P + YP , B← XQ · ZP + XP , C ← ZP · B ,
D← B 2 · (C + a · Z 2P ) , E ← A ·C , F ← XP + XQ · ZR ,
G ← XR + YQ · ZR .
If the curve parameter a ∈ { 0, 1 }, the cost of point ADD operation in the mixed affine-
Lo`pez-Dahab projective coordinates above can be reduced to 9M + 4S + 5A [113].
Example 3: Lo`pez & Dahab x-Coordinates only
Let the points P = (xP, yP), Q = (xQ, yQ), R = (xR, yR), and S = (xS , yS ) be four different
affine points that belong to the curve (2.3) such that R = P ⊕ Q and S = P 	 Q. Lo`pez and
Dahab in [114] observed that the x-coordinate of DBL operation can be obtained without any
y-coordinates being included or involved in its formula (see Algorithm 2). They derived a new
formula to obtain the x-coordinate of ADD operation without any y-coordinates being involved
in their formula. This x-coordinates only ADD formula is given as [114]:
xR = xS +
( xP
xP + xQ
) 2
+
xP
xP + xQ
.
Let the x-coordinates of P and Q be represented by XP/ZP, XQ/ZQ, respectively. Then,
when the points 2P and P + Q are converted to Lo`pez-Dahab projective coordinates, i.e., 2P =
(X2P, Y2P, Z2P) and P + Q = (XP+Q, YP+Q, ZP+Q), the two points can be computed as [114]
X2P ← X 4P + b · Z 4P ,
Z2P ← X 2P · Z 2P ,
(2.16)
where b is the curve parameter, and
ZP+Q ←
(
XP · ZQ + XQ · ZP
)2
,
XP+Q ← XS · ZP+Q + (XP · ZQ) · (XQ · ZP),
(2.17)
2.5. Elliptic Curve Scalar Multiplication 25
where XS is the x-coordinate of the point S = P 	 Q 16. The DBL formula above, i.e., (2.16),
requires 1M + 1D + 4S + 1A, and the ADD formula above, i.e.,(2.17), requires 4M + 1S +
2A. The formula for recovering the y-coordinate of the point R is obtained as follows [114] 17:
YR =
(XP + XS )
(
(XP + XS )(XQ + XS ) + X 2S + YS
)
XS + YS
. (2.18)
2.4.1 Inverse of a Point
An interesting property of elliptic curve group is that the unary operation (−) that is called
inverse of a point, i.e., −P, can be computed virtually for free. This is a reason why signed
representations of the scalar are meaningful. The inverse of a point P = (xP, yP) ∈ E(F2m) is
given by −P = (xP, xP + yP), similarly −R = (xR, −yR) for R = (xR, yR) ∈ E(Fp). Therefore, the
binary SUB operation of two points on an elliptic curve is very much alike in compared to the
ADD [1].
2.5 Elliptic Curve Scalar Multiplication
The fundamental building block of any ECC based protocol is ECSM. Across the years, a num-
ber of algorithms and techniques have been proposed to providing efficient implementations of
the scalar multiplication. There could be three broad scenarios possible, depending on how
the ECSM method is performed. The first scenario is multiplying a scalar k by a fixed base
Algorithm 3 Left-to-Right Double-and-Add Binary Scalar Method [1, 91]
Input : Integer k = (1, kl−2, · · · , k1, k0), Point P ∈ E(F).
Output : Point Q = kP.
Initialize : Q← P ;
Step 1 : For i = l − 2 down to 0 do
Step 1.1 : Q← 2Q ; /* Perform DBL operation */
Step 1.2 : If ki = 1 Then
Step 1.2.1 : Q← P ⊕ Q ; /* Perform ADD operation */
Step 2 : End For
Step 3 : Return Q;
point (generator) [115]. An example of such a scenario case is the generation of the elliptic
16 When Montgomery Ladder ECSM method is used, this S point is always the base point P.
17 Complete proof of (2.18) is found in [114].
26 Chapter 2. Preliminaries
curve digital signature algorithms standard. The second scenario is simultaneously multiply-
ing two scalars k and l, one by a fixed base point G and the other by an unknown point Q to
obtain R = kG + lQ [116, 117, 118, 119]. An example of such a scenario case is the signature
verification protocols. The third scenario, which we are addressing in this thesis, is when the
base point P is not known in advance (random point) and when only one single scalar multi-
plication is required. An example of such a scenario case is the generation of the elliptic curve
Diffie-Hellman key exchange protocol [120, 121, 122].
Given P a point of E(Fp) and k ∈ N∗, let kP be the point of the subgroup generated by P
define by
kP = P ⊕ · · · ⊕︸        ︷︷        ︸
k−1 times
P.
This definition extends naturally to k ∈ Z with 0P = O and (−k)P = k(−P).
In the following, we provide a brief description of the elliptic curve algorithms used by the
elliptic curve processors.
2.5.1 Binary Methods
The fundamental algorithm for the computation of the scalar multiplication, i.e., kP, is the
well known (left-to-right) Double-and-Add binary method that is shown in Algorithm 3 [1,
91, 123]. On average, the computation complexity of the Double-and-Add binary method is
s − 1 DBLs, and s−12 ADDs [91] 18. Since the inverse of a point can be easily computed (see
Section 2.4.1), it is possible to lower the number of ADD operations by converting the scalar
k to a signed-representation. Let each bit of k be denoted by ki, for 0 ≤ i ≤ s − 1. Then ki in
signed-representation becomes ki ∈ { −1, 0, 1 }. The signed-representation revises the Double-
and-Add binary method to a new method called the signed binary (or addition-subtraction)
method [69, 71, 123]. Among the different signed representation methods, the non-adjacent
form (NAF) that is shown in Algorithm 4 [1, 71, 72, 91, 124] and the mutual opposite form
(MOF) [70] are the most popular methods. The computation of ECSM in the signed binary
methods is more effective than in the Double-and-Add binary method. Representing the scalar
k as NAF or MOF would save an average of 1/6 of ADDs in the computation of kP [1, 91]. The
total run time of the ADD in both the Double-and-Add binary method and the signed binary
methods depend on the Hamming-weight of the scalar k. Hence, an adversary observing the
run time, could determine the Hamming-weight of the secret k.
18 The number of ADDs operations is dependant to the Hamming weight of the scalar k, whereas, the number
of DBLs operations is independent from the Hamming weight of k.
2.5. Elliptic Curve Scalar Multiplication 27
Algorithm 4 Left-to-Right Binary NAF Scalar Method [1, 91]
Input : Integer k = (kl, kl−1, · · · , k1, k0), ki ∈ { −1, 0, 1 }, Point P ∈ E(F).
Output : Point Q = kP.
Initialize : Q← O ;
Step 1 : For i = l down to 0 do
Step 1.1 : Q← 2Q ; /* Perform DBL operation */
Step 1.2 : Q← Q ⊕ kiQ ; /* Perform ADD operation */
Step 2 : End For
Step 3 : Return Q;
2.5.2 Window Based Methods
If sufficient amount of memory is available and allowed to be used, window based methods (or
windowing techniques) can be used to enhance the speed of ECSM operation. A generalization
of the window based method has been first proposed by Brauer in 1939 [125]. The idea is to
slice the scalar k into digits and to process w digits at a time. The scalar k in the window
based methods is represented in a base 2 w (or 2 r in radix-r method), where w, r > 1. The
algorithms in this method would significantly improve the speed of scalar multiplication, i.e.,
it processes w digit of k at a time, at the expense of 2 w−2 points in memory look-up table
(LUT). For instance, computing the ECSM using width-w method introduced by Thurber [127]
requires a set of iP, for i ∈ { 1, 3, 5, 7, · · · , 2 w−1 }, points to be pre-computed and stored in the
LUT. A typical standard method to compute ECSM in the radix-r representation is illustrated
in Algorithm 5. In this algorithm, the average density of non-zero digits is
(
r−1
r
)
. From
Algorithm 5, one can see that the ADD operation in Step 1.2.1 and the SUB operation in Step
1.3 start only when the repeated DBL operations in Step 1.1 are completed. This represents
a pure sequential method in the computation of elliptic curve point operations at the addition
group low level. We note that in order to make these window based methods feasible for
implementations supporting parallel processing at the addition group low level, all the pre-
computed points need to be doubled w − 1 times at each iteration [128]. Let us denote the cost
of computing ECSM operation by kPcost. Then, the kPcost of an s-bit scalar k using width-w
method is approximately:
kPcost = (s − 1)DBL +
(
s
w + 1
)
ADD.
It is noteworthy that the window based methods described here do not provide resistance
28 Chapter 2. Preliminaries
Algorithm 5 Left-to-Right Standard Signed Radix-r Scalar Multiplication [126]
Input : A l-bit Radix-r of k and a Point P ∈ E(F), where
k = (Rl−1, Rl−2, · · · , R1, R0)r, Ri ∈ { 0, 1, 2, · · · , (r − 1) }.
Output : Point Q = kP.
Pre − computation : |Ri| P for all Ri ∈ {1, 2, · · · , r − 1}.
Initialize : Q← O ;
Step 1 : For i = l − 1 down to 0 do
Step 1.1 : Q← rQ ; /* Perform repeated DBL operation */
Step 1.2 : If Ri ≥ 0 Then
Step 1.2.1 : Q← Q ⊕ RiP ; /* Perform ADD operation */
Step 1.3 : Else Q← Q 	 RiP ; /* Perform SUB operation */
Step 2 : End For
Step 3 : Return Q;
against SCAs 19. The methods have to be performed in a regular structure to resist against most
of the SCAs 20.
2.6 Power Analysis Attacks
The first official information on SCA dates from 1956 [94]. It is recorded in [129], how Peter
Wright helped the British secret services to break a rotor machine by listening to the clicking
sound with a microphone. In the past few decades there has been a lot of commotion about
the electromagnetic emanation of video screens [130]. In the mid 1990s the academic research
has examined three new types of SCAs, namely, execution time [38], computational faults
[131] and power consumption [132, 39]. An attacker here does not focus on the flows of the
algorithm, but tries to break the system by exploiting weaknesses in the implementation of
the algorithm. e.g., measuring the elapsed time or the power consumption of operations that
depends on analysing the VLSI implementation of the crypto-algorithm.
Of all the types of SCAs in PK based schemes, the power analysis attacks (or power side
attack) is the common type. Two main classes of power analysis attacks were presented by
Kocher et al. in [132, 39]. These are simple and differential power analysis attacks. Both of
them are based on monitoring the power consumption of a cryptographic token while execut-
ing an algorithm that manipulates the secret key. The traces of the measured power are then
analysed to obtain significant information about the key. In ECC crypto-system, power anal-
19 To solve the irregularity in the execution of the window based method, a special consideration must be
made to avoid the zero-digits in the scalar k [1].
20 Two secured window based methods proposed in [63, 64] that will be provided and discussed in Chapter
5.
2.6. Power Analysis Attacks 29
ysis attack can reveal large features of the algorithm such as identifying the DBL and ADD
operations being executed in the iterations of the loop [40]. Thus, the ECSM algorithm should
be implemented using a fixed sequence of EC-point operations that does not depend on the
value of a particular scalar ki bit. Furthermore, to thwart differential side-channel analysis, the
inputs of the scalar multiplication algorithm, namely, the base point P and the scalar k, should
be randomized.
2.6.1 The Secured ECSM Schemes
Designing secure implementations requires taking into account the physical attacks. These at-
tacks include power analysis that may infer information on a secret key by monitoring how it
interacts with its environment, and fault analysis in which an adversary can disturb the normal
functioning of a device with obtain the same goal. From Algorithms 1 and 2, it clearly appears
that the formulas for doubling a point or for adding two (distinct) points on Weierstraß elliptic
curve model are different. So, for example, from the distinction between the two point arith-
metic operations, i.e., ADD and DBL, a SSCA using power traces, allows revealing the value
of the secret k in the scalar multiplication algorithm. To counter the power attack, the power
consumption of a crypto-algorithm has to be independent of the performed operations and the
processed data values. Hence, it should have one of the following two properties [133]:
• The device consumes random amount of power in each clock cycle.
• The device consumes equal amount of power in each clock cycle.
For the former type of counter property, the randomize is achieved by performing methods,
such as a randomized projective coordinate method [40], a random double base number system
(DBNS) representation [134], and a randomized curve method proposed in [135]. For various
randomization techniques, comprehensive references are [1, 93, 136].
In order to withstand SSCAs, one must regularly execute the scalar multiplication, such
that it performs a constant operation flow whatever the scalar value. This can be done by one
of the following three basic approaches:
• The first approach is to use a unified addition (or indistinguishable addition) formulae,
i.e., formulas using for both point arithmetic ADD and DBL are the same. Such formulae
exist for standard Weierstraß elliptic curves [137, 138]; however, an implementation of
these two formulas would suffer from huge area complexity and low speed computation.
In addition, other unified addition formulas for special elliptic curve models are available
30 Chapter 2. Preliminaries
in the literature, for instance, the Edwards elliptic curve model over odd characteristic
fields [76, 139], and for binary Edwards curves [140], the inverted Edwards model [141],
the twisted Edwards model [37], the Huff model [77], the Hessian model over odd char-
acteristic fields [75], and for binary Hessian models [142], and the Jacobi elliptic curve
model [143, 78, 144].
• The second approach is to split both point arithmetic operations into small homogeneous
blocks of basic field arithmetic operations. If both ADD and DBL are carefully im-
plemented in an atomic block structure, it becomes impossible to distinguish between
the atomic blocks that come from either of the two point arithmetic operations. This ap-
proach was first proposed in [145]. Different atomic block structures were later presented
in [81, 83, 146, 147].
• The third one which covers the case we are addressing in this thesis, i.e., when both
ADD and DBL operations are different. The only way to make an ECSM algorithm
SSCA aware is to use a regular structure scalar multiplication scheme; which evaluates
the point arithmetic operations in a uniform sequence.
2.7 Standard Curves
Selecting the best suited elliptic curve parameters can make the implementation secured and
optimized. If chosen incorrectly, however, may lead to an insecure system [148]. In regard
to this issue, the two main standards for defining elliptic curves for cryptography, namely, the
NIST in the FIPS 186-3 [16] and the German Brainpool standard [20] have recommended
certain curve parameters for each finite field 21. These curves have been intentionally selected
because of the cryptographic strength and efficient implementations they provide.
In the binary fields, NIST recommends five finite fields, i.e., F2163 , F2233 , F2283 , F2409 , and F2571
for use in the ECDSA [16]. These fields and corresponding reduction polynomials are listed in
Table 2.2. Note that each of the reduction polynomials listed in the table is either a trinomial
or a pentanomial. Also, note that the second leading non-zero coefficient of the polynomial has
relatively small degree when compared to the degree of the whole polynomial.
In the prime fields, NIST recommends five finite fields, i.e., F2192 , F2224 , F2256 , F2384 , and F2521
for use in the ECDSA [16]. The German Brainpool recommends seven finite fields, i.e., F2160 ,
F2192 , F2224 , F2256 , F2320 , F2384 , and F2512 , for the same goal.
21 Other standards such as ANSI X9.62 [17], ISO 15946-2 [23], IEEE P1363 [21] and SECG [22] mainly
provide pointers to NIST curves.
2.8. Finite Field Arithmetic 31
Table 2.2: NIST Recommended Finite Fields and Their Corresponding Reduction Polynomials
[16].
Field size m Reduction Polynomial
163 P (x) = x163 + x7 + x6 + x3 + 1
233 P (x) = x233 + x74 + 1
283 P (x) = x283 + x12 + x7 + x5 + 1
409 P (x) = x409 + x87 + 1
571 P (x) = x571 + x10 + x5 + x2 + 1
A
t
2.8 Finite Field Arithmetic
Finite field arithmetic has been widely applied in applications of different fields like error-
control coding, cryptography, and digital signal processing [26, 27, 28, 29]. Most of PK based
schemes are also relying on the finite field arithmetic operations to implement their functionali-
ties. A field with a finite set of elements is called a finite field 22. Let us denote the finite field by
Fq or GF(q), where q stands for the number of elements in the field. The number of elements
in a finite field is always a prime or a prime power, i.e., q = p or q = pm, where m is a positive
integer and the prime number p is called the characteristic of the finite field. When q is a prime,
i.e., q = p, the finite field Fp is called a prime field. The prime field Fp is the field of residue
classes modulo p and its elements are represented by the integers in { 0, 1, 2, · · · , p − 1 }.
When q is a prime power, i.e., q = pm, the finite field Fpm is called an extension field. The
extension field Fpm is generated by using an mth degree irreducible polynomial over Fp and
it is the field of residue classes modulo the irreducible field generating polynomial. Hence,
in polynomial representation the elements of Fpm are represented by polynomials of degree at
most m − 1 with coefficients in Fp.
Arithmetic in finite field is different from standard integer arithmetic as it has limited num-
ber of elements and all operations performed in the finite field result in an element within that
field. In ECC systems, finite field arithmetic is the key factor that decides the cost of the curve
group operations. The basic field arithmetic operations used in ECC are addition/ subtraction,
multiplication, squaring, and inversion/division.
22 It is also called a Galois field, in honor of Evariste Galois the mathematician who first introduced them in
1830 in his proof of the unsolvability of the general quintic equation.
32 Chapter 2. Preliminaries
Two types of fields are commonly used. They are prime fields Fp where p is large prime,
and binary extension fields F2m , where m is a prime integer. In the scope of this thesis, we
consider curves that are defined over both prime fields and over binary extension fields.
2.9 Arithmetic over Prime Fields Fp
As discussed in Section 2.3 and showed in Algorithm 1, the ADD operation on elliptic curves
over Fp requires one modular division 23, one modular multiplication, one modular squaring,
and six modular addition/subtraction operations, i.e., 2M + 1S + 6A as well as one inversion
(I). Point doubling operation requires one modular division, one modular multiplication, two
modular squarings, and five additions/subtractions, i.e., 1I + 2M + 2S + 5A. Combining the
architecture for these field arithmetic operations allows performing any of the required elliptic
curve point routines.
2.9.1 Field Arithmetic Addition
Let A and B ∈ [0, p − 1], where p represents the prime modulus, then the modular addition
operation, as seen in Algorithm 6, comes down to an integer addition of A and B, followed by a
subtraction of p if the result of addition is greater than or equal the prime p , i.e., A+ B ≥ p. An
architecture to perform modular addition is illustrated in Figure 2.4. As shown in this figure,
the modular addition over Fp takes three inputs A, B, and p all of length
⌈
log2 p
⌉
, and produces
an output A + B mod p, which is also of length
⌈
log2 p
⌉
. The rectangle block filled with plus,
represents an adder. There are lots of ways to make an adder, for example one can implement
a ripple-carry full adder. The carry propagate adders used must be at least (1 +
⌈
log2 p
⌉
)-bits
long to represent the intermediate result A + B, which could be greater than the (
⌈
log2 p
⌉
)-bit
modulus p. To subtract p, a carry propagate adder is used with the sum of the previous adder
and bitwise inverted modulus p as inputs, and the carry-in tied to ’1’, thus performing two’s
compliment subtraction. The carry-out of this adder is then an indication that A + B is greater
than or equal to p. This signal controls the multiplexer which selects whether A+B or (A+B)−p
is the correct result.
By setting both inputs in Figure 2.4 to A, the output is given by 2A mod p, i.e., the modular
doubling operation is performed.
23 The modular division can be performed by multiplying by the modular inverse of an element.
2.9. Arithmetic over Prime Fields Fp 33
Algorithm 6 Addition Modulo p [103]
Input : Integer x, y, where 0 ≤ x, y < p.
Output : x + y mod p.
Step 1 : a← x + y ; c← a − p ;
Step 2 : If c < 0 Then
Step 2.1 : Return a ;
Step 3 : Else
Step 3.1 : Return c ;
0
1
+
+
A B p
0 1
A+B mod p
0
1  !p2log !p2log !p2log
 !p2log
 ! 1log2  p
 ! 1log2  p
1
 !p2log !p2log
 !p2log
 ! 1log2  p
 
 
 
  
 
 
Figure 2.4: Modular Addition over Fp [90].
2.9.2 Field Arithmetic Subtraction
A subtraction in Fp is computed, as seen in Algorithm 7, by an integer subtraction followed
by an addition of p if the result is less than zero. To perform modular subtraction, input B is
bitwise inverted and added to input A with a carry-in of ’1’. If the result is negative, i.e., the
carry-out is low, then the modulus must be added to produce an output in the range [0, p − 1].
An architecture to perform modular subtraction is illustrated in Figure 2.5. In this archi-
tecture, the value of (A − B)+p is computed while the relative magnitude of A and B is being
determined. By this method, both possible results are computed while the relative magnitude
of A and B is being determined in slightly more time than a single m-bit addition, and the cor-
rect result is selected depending on the carry out bit of the A − B stage. This eliminates the
necessity to wait a full m-bit magnitude comparison before deciding whether to add p or not.
The following two examples of a modular addition and a modular subtraction are computed
34 Chapter 2. Preliminaries
Algorithm 7 Subtraction Modulo p [103]
Input : Integer x, y, where 0 ≤ x, y < p.
Output : x − y mod p.
Step 1 : a← x − y ;
Step 2 : If a < 0 Then
Step 2.1 : a← a + p ;
Step 3 : Return a ;
0
+
+
A B
1 0
A-B mod p
1
 !p2log
 !p2log
 !p2log
p0
1  !p2log
 ! 1log2  p
 ! 1log2  p
 ! 1log2  p
 !p2log  !p2log
 !p2log
2
Figure 2.5: Modular Subtraction over Fp [90].
in F7:
(4 + 5) mod 7 = 9 mod 7 = 2,
(4 − 5) mod 7 = −1 mod 7 = 6.
Modular negation may be performed by using the modular subtraction architecture illustrated
in Fig. 2.5. Using only input B, and setting input A to zero.
2.9.3 Field Arithmetic Multiplication
Field multiplication in Fp can be accomplished by first performing an integer multiplication, it
is then followed by a reduction step. The result of the operation AB = A × B usually results
in AB ∈ [0, (p − 1) 2]. The reduction of such large product requires dividing by p such that
q =
⌊
AB
p
⌋
and r = AB − qp. Here, q is the quotient and r is the remainder of the division that is
always in the range [0, p − 1]. An example of a modular multiplication in F7 is computed as
2.9. Arithmetic over Prime Fields Fp 35
follows:
(6 × 6) mod 7 = 36 mod 7,
b36/7c = 5,
36 − 5 × 7 = 1,
⇒ (6 × 6) mod 7 = 1.
An extensive study has been done in this field to improve the computation capacity of systems
performing such operations. Depending on whether the modular reduction occurs during the
multiplication or only at the end, multiplication methods can be designed as interleaved or non-
interleaved. Interleaved methods are usually less complex and have the advantage to reduce the
memory necessary to store the intermediate results. Non-interleaved methods can be preferred
when an efficient modular reduction technique is available. It combines advantages of the
basic techniques for the multiplication algorithm such as the quadratic complexity methods
(e.g., schoolbook method and the Comba’s method [149]), and the sub-quadratic techniques
(e.g., the well-known divide and conquer Karatsuba algorithm [150] 24). A modular reduction
is then executed to keep the result in the range of the chosen finite field.
A traditional multiplication operation can be derived as follows. Given two m-bit integers
A and B = (bm−1, · · · , b0) r, the product AB can be written as:
AB =
m−1∑
i=0
(A · bi)r i = r
(
· · ·
(
r (0 + A · bm−1) + A · bm−2
)
+ · · ·
)
+ A · b0. (2.19)
Algorithm 8 from [101] summarizes the multiplication operation in (2.19). From Algorithm
8, one can see that it requires in every step a digit multiplication (A · bi), a multiplication by r,
and an adder. For r = 2 the algorithm reduces to left-shift by one bit and addition of A or 0 (for
more details on binary method see Subsection 2.10.4). We also point out that Algorithm 8 can
be re-written in terms of right-shift operations [151, 152]. Let Mcost denote the time taken to
multiply two integers. Then the complexity of Mcost in this algorithm becomes Mcost = O(n2).
Given two m-bit integers A and B such that
A = 2 m/2u + v and B = 2 m/2x + y,
where u, v, x, and y are 2 m/2-bit integers, the traditional quadratic complexity methods compute
24 Which has an asymptotic complexity of O(n1.58).
36 Chapter 2. Preliminaries
Algorithm 8 Left Shift Multiplication [101]
Input : Integer A, B =
m−1∑
i=0
bir i.
Output : Z = AB.
Initialize : Z ← 0.
Step 1 : For i = 0 to m − 1 do
Step 1.1 : Z ← r · Z + A · bm−1−i
Step 2 : End For
Step 3 : Return Z
AB using four multiplications of (m/2)-bit integers:
AB = 2 mux + 2 m/2(uy + vx) + vy.
Karatsuba showed that the number of multiplications can be reduced from four multiplica-
tions to three using the fact that
uy + vx = (u + v)(x + y) − ux − vy.
In this case the complexity of Mcost becomes Mcost = O(nlog2 3).
2.9.4 Field Reduction
Field reduction can be performed very efficiently if the modulus p is a generalized Mersenne
(GM) prime. These primes are sum or differences of a small number of powers of 2 and have
been adopted as recommended curves in different standards like NIST, ANSI, and SEC. The
normally used GM primes for different field sizes are shown here:
p160 = 2 160 + 2 31 − 1,
p192 = 2 192 + 2 64 − 1,
p256 = 2 256 + 2 224 + 2 192 + 2 96 − 1.
Fast reduction is possible using these primes since the powers of 2 translate naturally to bit
locations in hardware. For instance, 2 160 ≡ 2 31 + 1 mod p160 and therefore each of the higher
bits can be wrapped to the lower bit locations based on the equivalence. The steps required to
compute the fast reduction using GM primes is given in NIST 25).
25 When using general primes which are not GM primes, two other different techniques can be used: Barrett
2.10. Arithmetic over Binary Extension Fields F2m 37
2.9.5 Field Arithmetic Squaring
A special case of multiplication is the squaring, where the multiplicand and the multiplier are
equal. Using quadratic methods, the main advantage is that all cross products, e.g., x0y1 + x1y0,
arise twice. Using this symmetry, the halves of multiplications needed for the cross products
are saved at the expense of shifts or additions.
2.9.6 Field Arithmetic Inversion
The two most popular methods for field arithmetic inversion are either based on the Euclidean
algorithm [155] or one of its derivatives (e.g., the almost inverse algorithm), or on FLT.
2.10 Arithmetic over Binary Extension Fields F2m
Finite fields Fpm with m > 1, are fields with characteristic p, and have a number pm of elements.
Such a finite field exists for every prime p and positive integer m, and contains a subfield
having p elements. This subfield is called ground field of the original field. The Fpm is often
represented in polynomial of degree less than or equal to m − 1. The special case where p = 2
is usually referred to as binary extension fields or F2m 26. Arithmetic in F2m fields has different
properties than Fp fields, but is structurally very similar. The role of the prime modulus is
adopted by an irreducible polynomial P(x) of degree m 27. This class of finite fields, as stated
in [91, 28, 156], is very attractive to implementations on digital computers because of the
straightforward representation of coefficients as binary bit strings. In addition, arithmetic in F2m
fields has three distinct advantages. First, the entire addition is computed by XOR operation
and does not require a carry chain. The second advantage is that the multiplication is defined as
AND operation. The third advantage is that in F2 the element 1 is its own additive inverse, i.e.,
1 + 1 = 0 and −1 + 1 = 0. It can be concluded then that addition and subtraction are equivalent.
Since, the maximum degree of input polynomial is m − 1, and the addition and subtraction
operations are a simple bitwise XOR of the associated binary vectors of input polynomials, the
maximum degree of the output polynomial does not increase. Consequently, the irreducible
polynomial is not needed to reduce the result of the addition/subtraction operations.
The extended binary field F2m contains 2m different elements. In order for an extension of
reduction [153] and Montgomery reduction [154].
26 Also denoted by GF(2m).
27 A polynomial P(x) in F2m is irreducible if P(x) is not a unit element and if P(x) = F(x)×G(x), then F(x) or
G(x) must be a unit element.
38 Chapter 2. Preliminaries
F2 to be a field, this polynomial should be irreducible, which means that it should be impossible
to write it as a product of polynomials with a degree less than m. An irreducible polynomial of
degree m that is associated with F2m can be written as
P(x) = xm + pm−1xm−1 + · · · + p1x1 + p0,
with ∀ : pi ∈ F2 and p0 = 1. A root α of the polynomial satisfies the following equation:
αm + pm−1αm−1 + · · · + p1α + p0 = 0
⇒ αm = pm−1αm−1 + · · · + p1α + p0.
As a consequence, reduction modulo P(α) can be done replacing αm with pm−1αm−1 + · · · +
p1α + p0. The following example illustrates multiplication in F27 with P(x)= x7 + x + 1:
(x6 + x5 + x + 1) × (x6 + x4 + x2 + x)
= x12 + x11 + x10 + x9 + x8 + x7 + x5 + x4 + x3 + x
= (x5 + x4 + x3 + x2 + x + 1) × (x7 + x + 1) + (x6 + x5 + x4 + x3 + x + 1)
= x6 + x5 + x4 + x3 + x + 1.
Precisely how each element is represented is defined by the basis being used. The most
common representation that is used in this thesis is polynomial basis (PB) 28.
This work considers arithmetic in F2m using a PB representation. Assuming α is a root of
the irreducible polynomial P(α), an arbitrary element A ∈ F2m is a polynomial of degree less
than m defined over a basis (αm−1, αm−2, · · · , α1, α0), with coefficients ai ∈ F2, i.e.,
A(α) =
m−1∑
i=0
aiαi = am−1αm−1 + am−2αm−2 + · · · + a1α + a0 | ai = 0 or 1.
The above equation states that in PB representation, an element A ∈ F2m is represented as a
polynomial with coefficients a0 to am−1. These elements are frequently represented as binary
vectors of dimension m over F2 as (am−1, am−2, · · · , a0), which is relative to a given basis
(αm−1, αm−2, · · · , α1, α0), i.e.,
Polynomial rep.︷                                            ︸︸                                            ︷
am−1αm−1 + am−2αm−2 + · · · + a1α + a0 ⇔
coordinate rep.︷                     ︸︸                     ︷
(am−1, am−2, · · · , a0)
28 The field elements in F2m can be represented using other representations such as shifted polynomial basis,
and normal basis. However, they are beyond the scope of this thesis.
2.10. Arithmetic over Binary Extension Fields F2m 39
2.10.1 Field Arithmetic Addition
Addition can be performed by adding the corresponding coefficients in F2, i.e., without any
carries. Let two arbitrary elements A and B ∈ F2m , and let C be the addition of the two elements,
i.e., C = A + B. C is then obtained as follows: 29
C(α) =
m−1∑
i=0
ciαi = A(α) + B(α) =
m−1∑
i=0
(
(ai + bi) mod 2
)
αi,
where ci, ai, bi ∈ F2 which in term of logic circuits directly translates into XOR combination
of the coefficients. The following example illustrates addition in GF(27):
(x6 + x5 + x + 1) + (x6 + x4 + x2 + x)
= (1 + 1)x6 + x5 + x4 + x2 + (1 + 1)x + 1
= x5 + x4 + x2 + 1.
In hardware, a bit-parallel adder requires m XOR gates, and an addition can be generally
computed in one clock cycle.
2.10.2 Field Arithmetic Squaring
Squaring is a special case of multiplication. While a multiplier can be reused as a squarer, a
dedicated squaring architecture that performs the square in a shortest possible time is much
appreciated 30. This is specifically true when the arithmetic squarer is necessary for general
exponentiation as well as inversion of a field element. Let P(α) be the irreducible polynomial
over F2 generating the field F2m . Let A(α) =
m−1∑
i=0
ai αi be an arbitrary element of F2m . The
squaring operation of A(α) is
A2 ≡
m−1∑
i=0
ai α 2i mod P(α)
≡ a0 + a1α 2 + a2α 4 + · · · + am−1α 2m−2 mod P(α).
(2.20)
The squaring operation, i.e., (2.20) is performed by first computing a2, which is done by
simply interleaving zeros between each bit of a, and then reducing modulo P(α). If the reduc-
29 The subtraction of two field elements in F2m is the same as the addition because each element is its own
additive inverse.
30 In case of the NB, squarer is free in terms of both timing and area as it is equivalent to cyclic shift.
40 Chapter 2. Preliminaries
tion generator of the PB is of a low Hamming weight such as a trinomial or a pentanomial,
the reduction becomes simple and hence, the circuit has a low area complexity. For instance,
suppose that F24 is constructed via the trinomial P(x) = x 4 + x +1, and let α be the root of P(x).
By replacing α 4 = α + 1, we have
A(α)2 = a3α 6 + a2α 4 + a1α 2 + a0 = a3α 2(α + 1) + a2(α + 1) + a1α 2 + a0
= a3α 3 + (a3 + a1)α 2 + a2α + (a0 + a2).
Hence, the arithmetic square over F24 can be realized via 2 XOR gates as shown in Figure
2.6. For F2233 constructed by P(x)= x 233 + x 74 + 1, a bit parallel squarer requires 153 XOR
gates 31 and has a latency equal to 2TX.
a
0
a
2
a
3
a
1
a
3
2
a
2
2
a
1
2
a
0
2
a
2
a
3
Figure 2.6: Field Arithmetic Squaring constructed via P(x) = x 4 + x + 1 over F24 .
2.10.3 Field Arithmetic Multiplication
In the last two decades, there have been a number of papers dealing with the practical hardware
and software implementation of the PB multiplication. The multiplication in F2m based on P-
B representation is depended on two arithmetic operations over binary polynomials, namely,
polynomial multiplication and reduction modulo an irreducible polynomial. An efficient im-
plementation of bit-parallel PB multiplication was described by Mastrovito in [56]. Mastrovito
has built a bit-parallel PB multiplier by utilizing the so-called Mastrovito matrix, which is con-
structed from the coefficients of the first multiplicand and the irreducible polynomial defining
the field. Then, the polynomial multiplication and modular reduction steps are performed to-
gether using this matrix. The authors in [61] have thoroughly studied the Mastrovito multiplier
for the irreducible trinomials. The authors in [158] have generalized the Mastrovito multiplier
in [61] for any irreducible polynomials. A practical and systematic design approach for the
31 Obtained by m+k−12 [157].
2.10. Arithmetic over Binary Extension Fields F2m 41
Mastrovito multiplier can be found in [159]. The authors in [45] propose to use a reduction
matrix to derive a new formulation for PB multiplication.
Let P(x) be an irreducible polynomial over F2 generating the field F2m . Let A =
m−1∑
i=0
ai αi,
B =
m−1∑
i=0
bi αi be two arbitrary elements of F2m . The product C of A and B can be obtained in the
following two steps:
1. Polynomial multiplication: C′ = A × B, where
C′ =
( m−1∑
i=0
aiαi
)
×
( m−1∑
j=0
b jα j
)
=
2m−2∑
k=0
c′kα
k,
and c′k is given by c
′
k =
∑
i+ j=k aib j, 0 ≤ i, j ≤ m − 1, 0 ≤ k ≤ 2m − 2.
2. Reduction modulo the irreducible polynomial:
C =
m−1∑
i=0
ciαi ≡
2m−2∑
k=0
c′kα
k mod P(α).
The complexity of the first step is independent of choice of the irreducible polynomial
P(x), while the second step has costs (ω − 1)(m − 1) bit operations in F2 when the irreducible
polynomial P(α) has ω non-zero terms [160, 161, 47]. Polynomial multiplication C′ = A × B
can be written in matrix form as [158]:
c′0
c′1
c′2
...
c′m−2
c′m−1
c′m
c′m+1
...
c′2m−3
c′2m−2

=

a0 0 0 0 · · · 0 0
a1 a0 0 0 · · · 0 0
a2 a1 a0 0 · · · 0 0
...
...
...
...
. . .
...
...
am−2 am−3 am−4 am−5 · · · a0 0
am−1 am−2 am−3 am−4 · · · a1 a0
0 am−1 am−2 am−3 · · · a2 a1
0 0 am−1 am−2 · · · a3 a2
...
...
...
...
. . .
...
...
0 0 0 0 · · · am−1 am−2
0 0 0 0 · · · 0 am−1

×

b0
b1
b2
...
bm−2
dm−1

. (2.21)
42 Chapter 2. Preliminaries
The coefficients of C′(α) in (2.21) can be determined by the following expressions
c′k =

∑k
i=0 ai bk−i , for k = 0, · · · , m − 1,∑m−1
i=k−m+1 ai bk−i for k = m, · · · , 2m − 2.
The total gate complexity 32 for the bit-parallel implementation of the matrix-by-vector
product given in (2.21) is m2 AND gates and (m − 1)2 XOR gates. The AND gates operate
all in parallel, while the XOR gates are organized as a binary tree. The longest depth of the
binary tree XOR gates is equal to m for the computation of c′m−1. Therefore, the total delay
complexity (Tprod) for the bit-parallel matrix-by-vector product is Tprod = TA +
⌈
log2 m
⌉
TX,
where TA and TX denote the delay of the 2-input AND gates, and the delay of the 2-input XOR
gates, respectively.
C′(α) can also be obtained from the modified version of Karatsuba-Ofman [150]. Let the
elements A(α) and B(α) be represented as [162]
A(α) = αm/2
AH︷                          ︸︸                          ︷(
αm/2−1am−1 + · · · + am/2
)
+
AL︷                          ︸︸                          ︷(
αm/2−1am/2−1 + · · · + a0
)
,
B(α) = αm/2
(
αm/2−1bm−1 + · · · + bm/2
)︸                          ︷︷                          ︸
BH
+
(
αm/2−1bm/2−1 + · · · + b0
)︸                          ︷︷                          ︸
BL
.
Traditionally, the computation of C′(α) requires four multiplications of (m/2)-bits, i.e.,
d(α) = αmAH BH + αm/2(AH BL + ALBH) + ALBL. (2.22)
Using Karatsuba method, the number of multiplication can be reduced from four to three.
First, three recursive operations are defined as
M (1)0 = AL(α)BL(α),
M (1)1 = [AL(α) + AH(α)][BL(α)BH(α)],
M (1)2 = AH(α) + BH(α).
(2.23)
32 The gate complexity is measured in terms of the number of logic gates required for an implmentation.
Logic gates refer to the traditional two-input gates, i.e., AND gates, OR gates, XOR gates, etc.
2.10. Arithmetic over Binary Extension Fields F2m 43
Then the product given in (2.22) can be obtained by [162]:
C′(α) = αmM (1)2 (α) + α
m/2
[
M (1)1 (α) + M
(1)
0 (α) + M
(1)
2 (α)
]
+ M (1)0 (α).
The algorithm becomes recursive if it is applied again to the polynomials given in (2.23).
The next iteration step splits the polynomials AL, BL, AH, BH, (AL + AH), and (BL + BH) again
in half. With these newly halved polynomials, new auxiliary polynomials M (2) can be defined
in a similar way to (2.23).
When the product C′(α) is obtained, a reduction modulo an irreducible polynomial P(α)
must be performed
C(α) = C′(α) mod P(α)
=
2m−2∑
i=0
c′iα
i mod P(α)
=
m−1∑
i=0
c′iα
i +
2m−2∑
i=m
c′i
(
αi mod P(α)
)
.
(2.24)
The digit-level multipliers is introduced in [53] and is described in the following equation
AB ≡
(
A
KD−1∑
i=0
BiαDi
)
mod P(α)
≡
( KD−1∑
i=0
Bi
(
AαDi mod P(α)
))
mod P(α).
(2.25)
In (2.25), B is expressed in KD digits (1 ≤ KD ≤ dm/De) as follows:
B =
KD−1∑
i=0
BiαDi, (2.26)
where
Bi =
D−1∑
j=0
bDi+ jα j, (2.27)
and D is the digit size in bits. Note that when m/D is not an integer, B is extended to an integer
number of digits (KD = dm/De) by setting its most significant bits to 0, i.e., bm = bm+1 = · · · =
bKD∗D−1 = 0.
44 Chapter 2. Preliminaries
2.10.4 Traditional Parallel-Out Bit-Level Polynomial Basis Multiplica-
tion Operation
The most compact architecture for the bit-level multiplication, as stated in [163], is the classical
parallel-out bit-level (POBL) multiplier (MSB-first or LSB-first) due to Beth and Gollman
in [31]. Let P(x) be an irreducible polynomial over F2 generating the field F2m . Let A =∑m−1
i=0 ai α
i, B =
∑m−1
i=0 bi α
i be two arbitrary elements of F2m , and C be their multiplication, i.e.,
C = AB. Then, the LSB-first POBL multiplier is obtained as follows [31]
C = bm−1
(
(Aαm−1) mod P(α)
)
+ · · · + b0
(
A mod P(α)
)
,
and the MSB-first POBL multiplier is obtained as follows
C =
(
· · ·
(
(bm−1A)α mod P(α) + bm−2A
)
α mod P(α) + · · · + b1A
)
α mod P(α) + b0A,
where α is a root of the irreducible polynomial P(x).
Both The LSB-first, and MSB-first bit-level multipliers are shown in Figures 2.7(a) and
2.7(b). In these figures, the two registers 〈X〉, and 〈Z〉, and the cyclic shift (CS) register 〈Y〉
are of length m bits. Let 〈X〉(n), 〈Y〉(n), and 〈Z〉(n) denote the contents of 〈X〉, 〈Y〉, and 〈Z〉 at the
n-th, 0 ≤ n ≤ m − 1, clock cycle, respectively.
In the LSB-first bit-level multiplier that is shown in Figure 2.7(a), suppose the 〈X〉 register
is initialized with a multiplicand A, i.e., 〈X〉(0) = A, then the output of this register at the n-th
clock cycle is 〈X〉(n) ∈ F2m , which is calculated from the input of this register, i.e., using the α
module shown in Figure 2.7(a) and obtained as
〈X〉(n) = α · 〈X〉(n−1) mod P(α), 1 ≤ n ≤ m − 1, (2.28)
where 〈X〉(0) = A = (am−1, · · · , a1, a0). Suppose that the right CS register 〈Y〉 is initial-
ized with a multiplier B. Also, suppose that the register 〈Z〉 is initially cleared, i.e., 〈Z〉(0) =
(0, · · · , 0, 0). Then, one can obtain the content of 〈Z〉 at the first clock cycle as 〈Z〉(1) = b0 A
and in general at the n-th clock cycle as
〈Z〉(n) = b0A +
n−1∑
i=1
bi 〈X〉(i) , 1 < n ≤ m − 1.
Let C denote the PB multiplication of A and B, i.e., C = AB mod P(α). Then, using (2.28)
2.10. Arithmetic over Binary Extension Fields F2m 45
mm
.
.
.
0z
2 mz
1 mz
 
.
.
. m m
0x
2 mx
1 mx
X Z
A
m
Preload
C=AB
0y2 my1 my . . .
Y
mmm
.
.
.
0z
2 mz
1 mz
 .
.
.
m
m m
0x
2 mx
1 mx
X Z
A
m
Preload
C=AB
1
0y 2 my
1
1
1 my. . .
Y
B
mPreload
m m
B
mPreload
1
1
1
(b)
(a)
Figure 2.7: The Traditional Parallel-Out Bit-Level (POBL) Field Arithmetic Multiplication
Schemes [31]. (a) LSB-First POBL Multiplier. (b) MSB-First POBL Multiplier.
recursively, one can obtain
C =
m−1∑
i=0
bi
(
(Aαi) mod P(α)
)
=
m−1∑
i=0
bi · 〈X〉(i) .
(2.29)
From (2.29), one can determine that after m clock cycles 〈Z〉 contains C = AB mod P(α) ∈
46 Chapter 2. Preliminaries
F2m , i.e., 〈Z〉(m) = C. The implementation of bi · 〈X〉(i) in (2.29) is done using m 2-input AND
gates. This is shown in Figure 2.7(a) with the circle module with a bold dot inside, i.e.,
⊙
.
Also, the sum operation in (2.29) is implemented with m 2-input XOR gates which is shown
with the circle module with a plus inside, i.e.,
⊕
.
Similarly, in the MSB-first bit-level multiplier that is shown in Figure 2.7(b), if the reg-
isters 〈X〉, and 〈Z〉 in Figure 2.7(b) are initialized with A = (am−1, · · · , a1, a0) and 0 =
(0, · · · , 0, 0), respectively, then one can verify that after the m-th clock cycle the register
〈Z〉 contains the coordinates of C, i.e., 〈Z〉(m) = C.
In addition to the core multiplier component, the bit-level multiplier processor has to embed
some other functionality to operate properly. For instance, a controller component that allows
controlling the I/O communication signals and generates the control signals is required. Also
to minimize the total latency, the data I/O has to be transferred in parallel (at cost of 1 clock
cycle). These additional components are not shown in Figure 2.7 for simplicity, however, all
components must be considered in the area and time complexity analysis.
2.10.5 Field Arithmetic Division/Inversion
Division/Inversion is the most expensive field arithmetic operations that are needed by point
arithmetic operations in ECC. Division in F2m can be performed using an architecture similar
to that described by Shantz in [164], which is based on the EEA. The architecture proposed in
[164] can compute the division result in 2m clock cycles. It is also possible to perform a divi-
sion using repeated multiplications and squaring using the Itoh and Tsujii inversion algorithm
[165, 166]. Since, the Itoh and Tsujii algorithm performs inversion; it must be followed by
an additional multiplication to replace the division. Assuming A , 0, A ∈ F2m , the objective
is to find a field element A−1, where A · A−1 = 1. This algorithm is derived from FLT, that
is A 2
m−1 = 1 (poof of this can be found in [167]). In order to obtain A 2
m−2, m − 1 squarings
and
⌊
log2(m − 1)
⌋
+ H(m − 1) − 1 multiplications are required, where H(m − 1) represents the
number of non-zero coefficients in the binary representation (Hamming weight) of (m − 1).
Hence, for F2163 , this algorithm allows a division to be computed in 10M + 162S operations.
Further information on the Itoh and Tsujii inversion algorithm can be found in [165].
3
Architectures for SOBL Multiplication
Using Polynomial Basis
C ompact hardware implementations are very significant for small embedded devicessuch as Radio frequency identification (RFID) tags. The area complexity of finitefield arithmetic multiplication is critical for such a resource constrained environ-
ment. In this chapter, we propose new schemes for the serial-out bit-level multipli-
cation operation using polynomial basis. We show that in terms of the area and time complexi-
ties, the proposed schemes outperform the existing serial-out bit-level schemes available in the
literature. In addition, we show that the smallest SOBL scheme proposed can provide about
24-26% reduction in area complexity cost and about 21-22% reduction in power consumptions
for F2163 compared to the current state-of-the-art bit-level multiplier schemes 1.
3.1 Introduction
Finite field arithmetic has been widely applied in applications of different fields like error-
control coding, cryptography, and digital signal processing [26, 27, 28, 29]. The arithmetic
operations in the finite fields over characteristic two F2m have gained widespread use in PK
based cryptography such as point multiplication in ECC [18, 19], and exponentiation-based
crypto-systems [13, 10]. The finite field F2m has 2m elements and each of its elements can be
represented by its m binary coordinates based on the choice of field-generating polynomial.
For such a representation, the addition is relatively straight-forward by bit-wise XORing of
the corresponding coordinates of two field elements. On the other hand, the multiplication
operation requires larger and slower hardware. Other complex and time-consuming operations
1 Part of this work can be found in [98].
47
48 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
such as exponentiation, and division/inversion are implemented by the iterative application of
the multiplication operations. Much of the ongoing research in this area is focused on finding
new architectures to implement the arithmetic multiplication operation more efficiently (see for
example [168, 169, 170]).
Finite field multipliers with different properties are obtained by choosing different repre-
sentations of the field elements. With the advantages of low design complexity, simplicity, reg-
ularity, and modularity in architecture, the standard or polynomial basis (PB) representation,
is extensively used for cryptographic applications [171, 1]. In the PB, a multiplier requires a
polynomial multiplication followed by a modular reduction. In practice, these two steps can be
combined into a single step by using the so-called Mastrovito matrix [56, 46]. The properties
and complexities of the PB multipliers depend heavily on the choice of a field-generating poly-
nomial. In this chapter, we first consider an irreducible polynomial with ω, ω ≥ 3, non-zero
terms (denoted by ω-nomials). We then obtain a further optimized structure for the special
irreducible trinomial (ω = 3).
The implementation of finite field multipliers can be categorized, in terms of their struc-
tures, into three groups of bit-parallel, digit-level and bit-level types. Various efficient bit-
parallel architectures for the PB multipliers have been proposed in the literature, for example
see [56, 46, 172, 47, 58, 61, 158, 159, 49, 162, 45]. In the bit-parallel multiplier, once the two
m-bit inputs are received, the m bits of the multiplication are obtained together at the output
after a propagation delay of its logic gates.
The bit-level multiplier is especially attractive for application on resource-constrained and
low-weighted devices; whereas, the bit-parallel multiplier is attractive for high speed imple-
mentations. The bit-level type multiplication algorithms, when the PB is used are classified as
least significant bit first (LSB-first), and most significant bit first (MSB-first) schemes [31].
The bit-level multiplier can be further categorized into two types of either parallel or serial
output. In the traditional parallel-out bit-level (POBL) multipliers [31], all of the output bits
of the multiplication (from the first bit to the last bit) are generated at the end of the last clock
cycle. serial-out bit-level (SOBL) multipliers, on the other hand, generate an output bit of the
product sequentially, after a certain number of clock cycles. Let us denote the delay as being
the number of clock cycles required to generate the first output bit by bit-latency. The bit-
latency in the work proposed by Yeh, et al., in [54], is 2m cycles. In [31, 173, 174, 175, 176],
this latency has been reduced to m cycles. In [177], the first SOBL multiplier was proposed;
however, in their architecture, the first output bit is constructed after a delay of m cycles, i.e.,
the bit-latency is m. In [178], an architecture for the SOBL multiplication using irreducible all-
one polynomials has been proposed. The author of [30], has proposed a SOBL multiplication
architecture that is constructed by the trinomials and the ω-nomials irreducible polynomials in
3.2. Preliminaries 49
F2m using PB representation. A major feature of this architecture is that the bit-latency is one
clock cycle. A multiplication scheme based on serial-out architecture, i.e., SOBL, has certain
advantages as compared to the traditional parallel-out architecture. For instance, combining a
SOBL with a traditional LSB-first POBL one, would make fast exponentiation and inversion
possible [32, 33]. In this chapter, alternative schemes for the serial-out multiplication in the PB
over F2m for trinomial, pentanomial, and ω-nomial irreducible polynomial are developed. We
summarize our contributions as follows:
• We proposed novel schemes for the SOBL finite field multiplication operation that are
constructed by an irreducible polynomial with ω, ω ≥ 3, non-zero terms (denoted by
ω-nomials). We showed that in terms of the area and time complexities, the proposed
schemes outperform the existing SOBL schemes available in the literature. In addition,
we show that the smallest SOBL scheme proposed can provide about 24-26% reduction
in area complexity cost and about 21-22% reduction in power consumptions for F2163
compared to the current state-of-the-art bit-level multiplier schemes.
• To obtain the actual implementation results, all the proposed schemes, i.e., 3 SOBL mul-
tipliers, and the counterpart ones, i.e., LSB-first POBL [31], MSB-first POBL [31], and
SOBL scheme proposed in [30] are coded in VHDL (6 schemes in total), and implement-
ed on ASIC technology over both F2163 and F2233 .
The organization of this paper is as follows. Notation and mathematical background are given
in Section 2. In Section 3, the formula for a new SOBL multiplication is presented. Section
4 is the core of our paper, in which 2 novel architectures for the SOBL multiplier for both
the trinomial and the ω-nomial irreducible polynomial are presented. In Section 5, another
compact approach to the architecture design of SOBL multiplier is presented. In Section 6, the
proposed architectures and the previously reported ones are compared in terms of area, delay
and I/O parallel loading complexities. In Section 7, the performance of the proposed multiplier
schemes are investigated by implementing each multiplier and the counterpart multipliers on
ASIC technology. Finally, the conclusion is presented in Section 8.
3.2 Preliminaries
The binary extension field F2m can be viewed as an m-dimensional vector space defined over F2
[26]. A set of m linearly independent vectors (elements of F2m) is chosen to serve as the basis
of representation. An explicit choice for a basis is the ordered set
{
αm−1, · · · , α2, α, 1
}
, where
α ∈ F2m and is a root of an irreducible polynomial P(x). This basis is called the polynomial
50 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
basis (PB). Each element is represented by a polynomial of degree m − 1, whose coefficients
are the binary digits 0 or 1. All arithmetic operations are performed modulo 2.
A straightforward F2m multiplication computations consists of two parts, the product of two
field elements, followed by a modular reduction [47, 58]. Suppose A = (am−1, · · · , a1, a0),
B = (bm−1, · · · , b1, b0) are two arbitrary field elements, i.e., A, B ∈ F2m , then to obtain the field
multiplication of A and B, AB is computed first; it is then followed by the modular reduction,
i.e.,
C , AB mod P(α).
In [56, 46], Mastrovito has proposed an efficient dedicated parallel multiplication that combines
the two parts of the product and the modular reduction into a single step. He showed that the
coordinates of C are obtained from the matrix-by-vector product of
c = [cm−1, · · · , c1, c0]T = M · bT , (3.1)
where T denotes the transposition; the row vector b = [bm−1, · · · , b1, b0] contains the coordi-
nates of the multiplier B = (bm−1, · · · , b1, b0) ∈ F2m , and M is an m × m binary matrix whose
entries depend on the coordinates of A ∈ F2m . This equation was implicitly used in [61, 158],
and [159] to derive the bit-parallel multiplier and is now used in this work to design the new
SOBL multiplier.
Sunar and Koc¸ [61] have studied the Mastrovito matrix M, and have presented a for-
mulation for the Mastrovito algorithm using the irreducible trinomials. Halbutog˘ullari and
Koc¸ in [158] have presented a new architecture for the Mastrovito multiplication and rigor-
ous analysis of the complexity for a general irreducible polynomial. They have also shown
that the coefficient of the product AB can be obtained from the matrix-by-vector product of
d , [d2m−2, · · · , dm, dm−1, · · · , d0]T = Z · bT , where Z is a 2m − 1 × m binary matrix whose
entries are
Z ,

a0 0 · · · 0 0
a1 a0 · · · 0 0
...
...
. . .
...
...
am−2 am−3 · · · a0 0
am−1 am−2 · · · a1 a0
0 am−1 · · · a2 a1
0 0 · · · a3 a2
...
...
. . .
...
...
0 0 · · · 0 am−1

. (3.2)
3.2. Preliminaries 51
Table 3.1: List of Notations.
Symbol Description
b Row vector.
bT Column vector.
M(i, :) The ith row of the matrix M.
M(:, j) The jth column of the matrix M.
M(i: j) An entry with position (i,j) of the matrix M.
[vj , · · ·, vi] The range of bits in the vector v from position i to position j, j > i.
〈rj , · · ·, ri〉 The range of bits in the register 〈R〉 from position i to position j, j > i.
M[↓ n] A down shift of the matrix M by n positions, emptied positions after the
shifts are filled by zeros.
M(j, :)[→1] A right shift of the jth row of the matrix M by 1 position, emptied positions
after the shifts are filled by zeros.
v[f0, → 1] A right shift of the vector v by one-bit with cell f0 fed in its left-most bit, i.e.,
for the vector v of length l-bits
v[f0,→ 1] = [f0,
l−1︷ ︸︸ ︷
0, · · · , 0 ]+ v[→ 1].
ei||v The process of concatenating an element ei and a vector v.
In [159], Zhang and Parhi have proposed the use of a bit-parallel Mastrovito multiplier
based on a systematic design approach for the technique proposed in [158].
3.2.1 Notations
Let us now introduce the following notations, which will be used in this work: Row and column
vectors are represented by small boldfaced characters. Matrices are represented by capital
boldfaced characters, and to represent the entries of a matrix, we use the common notation
used in the literature such as [61, 158, 159, 172, 30]. These notations are summarized in Table
3.1.
3.2.2 Reduction Process
Let us first define an irreducible polynomial with ω non-zero terms, i.e., [30]
P(x) , xm +
ω−1∑
i=1
xti , (3.3)
where m2 > t1 > t2 > · · · > tω−2 > tω−1 = 0. Then from (3.3), we define two new sets: T is
a set of degrees of nonzero terms in (3.3), and N consists of ω − 1 elements, which are the
52 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
differences between m and the others contains the non-zero terms in (3.3), i.e.,
T , {0, t1, · · · , tω−2} ,
and
N , {0, ∆1, · · · , ∆ω−2} ,
where ∆1 = m − tω−2, ∆2 = m − tω−3, · · · , ∆ω−2 = m − t1.
Note that the Mastrovito matrix M, which is shown in (3.1) can be obtained by reducing
the matrix Z in (3.2) using the generating polynomial (3.3). It is shown in [45], that the entries
of the matrix M can be obtained as
M = (L + Q · U) , (3.4)
where L is an m × m lower triangular Toeplitz matrix, which is defined as the first m rows of
the matrix Z; U is an (m − 1) × m upper triangular Toeplitz matrix, which is defined as the last
(m − 1) rows of Z, i.e.,
L ,

a0 0 0 0 · · · 0
a1 a0 0 0 · · · 0
a2 a1 a0 0 · · · 0
...
...
. . .
. . .
...
am−2 am−3 · · · a1 a0 0
am−1 am−2 · · · a2 a1 a0

,
U ,

0 am−1 am−2 · · · a1
0 0 am−1 · · · a2
0 0 0 · · · a3
...
...
. . .
. . .
...
0 0 · · · am−1 am−2
0 0 · · · 0 am−1

,
(3.5)
and Q is a reduction matrix, which is formalized in [159, 45, 172] as
Q =
∑
n∈N
Qˆ[→ n], (3.6)
where
Qˆ =
∑
t∈T
Im×(m−1) [↓ t] , (3.7)
3.3. Proposed SOBL Multiplication Algorithm 53
where Im×(m−1) represents an m × (m − 1) identity matrix.
From (3.4), one can see that based on Q, certain rows of the matrix U are added to the rows
with lower indices. Then, using (3.6) and (3.7) the matrix M in (3.4) can be written as [159]
M = L + S +
∑
t∈T−{0}
S[↓ t], (3.8)
where the matrix S is an m × m upper triangular Toeplitz matrix with the following form:
S ,

0 sm−1 sm−2 · · · s1
0 0 sm−1 · · · s2
...
...
. . .
. . .
...
0 0 · · · sm−1 sm−2
0 0 · · · 0 sm−1
0 0 · · · 0 0

, (3.9)
where the row 0 of S, i.e., S(0, :) can be computed as [159]
S(0, :) = [0, sm−1, · · · , s1] =
∑
n∈N
U(0, :)[→ n]. (3.10)
3.3 Proposed Serial-Out Bit-Level Multiplication Algorithm
From (3.4) and (3.8), one can define a matrix P as
P = Q · U = S +
∑
t∈T−{0}
S[↓ t]. (3.11)
In (3.11), the rows produced due to the reductions corresponding to the xti terms in (3.3) are
identical to the rows produced at the first reduction iteration. Thus, we can store the elements
of row S(0, :), so that they can be added later to obtain the rows ti, 1 ≤ i ≤ ω − 2, of the matrix
P, i.e., P(ti, :), for ti ∈ T − {0}. Then, the rows P( j, :), for 0 ≤ j ≤ m − 1 can be obtained as
P( j, :) =

S(0, :), for j = 0,
P( j − 1, :)[→ 1], for 0 < j & j , ti,
P( j − 1, :)[→ 1] + S(0, :), for j = ti,
(3.12)
for 1 ≤ i ≤ ω − 2.
From the Toeplitz matrix L, which is shown in (3.5), one can see that the rows L( j, :), for
54 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
0 ≤ j ≤ m − 1 can be obtained as
L( j, :) =

[a0, 0, · · · , 0︸   ︷︷   ︸
m−1
], for j = 0,
L( j − 1, :)[a j, → 1], for 0 < j ≤ m − 1.
(3.13)
From (3.12) and (3.13), the row j of the matrix M in (3.4), i.e., M( j, :), for 0 ≤ j ≤ m − 1, is
obtained as
M( j, :) =

L(0, :) + S(0, :), j = 0,
M( j − 1, :)[a j, → 1], 0 < j & j , ti,
M( j − 1, :)[a j, → 1] + S(0, :), j = ti,
(3.14)
for 1 ≤ i ≤ ω − 2.
From (3.10) and (3.13), one can see that the row 0 of the matrix M in (3.14) can be obtained
as
M(0, :) = L(0, :) + S(0, :) = [a0, sm−1, sm−2, · · · , s1]. (3.15)
After calculating M( j, :) and based on (3.1), one can serially obtain c j, for 0 ≤ j ≤ m − 1 as
c j = M( j, :) · bT . (3.16)
3.3.1 Proposed SOBL Multiplication Algorithm for ω-nomials
From (3.10), (3.14), (3.15), and (3.16), we propose the following algorithm, which outlines
the process of serially generating the coordinates of C starting from c0 to ending cm−1 for the
multiplication of the two field elements A and B.
Algorithm 9 is indeed a bit-level algorithmic version of the architecture of the bit-parallel
Mastrovito PB multiplier proposed in [159]. In Algorithm 9, the coordinates of the signal
vector s represent the entry of the first row of the matrix S, i.e., S(0, :). These coordinates
are obtained as presented in (3.10). From the Toeplitz matrix S shown in (3.9), one can see
that the entry S(0: m − 1) is zero; hence, it is neglected in Algorithm 9. The signal vector s,
is initialized with the coordinates from 1 to m − 1 of the multiplicand A, i.e., s =[sm−1, · · · ,
s1] =[am−1, · · · , a1]. Then, the elements of signal s are accumulated in accordance with (3.10)
to produce the desired S(0, :) after a total of ω − 2 loop iterations. Hence, at each for loop
iteration, i.e., in Step 1.2, coordinates from ∆i +1 to m−1, for 1 ≤ i ≤ ω−2, of the multiplicand
A are XORed with the previous iteration’s s signal.
Let us consider the binary extension field F2163 generated by the irreducible pentanomial
3.3. Proposed SOBL Multiplication Algorithm 55
Algorithm 9 Proposed Serial-Out Bit-Level Mastrovito Multiplier for ω-nomials xm + xt1 +
· · · + xtω−2 + 1
Input : The parameters of the ω-nomials irreducible polynomial: m, t1, · · · , tω−2,
A =
(
am−1, · · · , a0), B = (bm−1 , · · · , b0) ∈ F2m .
Output : c j, where C =
(
cm−1, · · · , c0) = AB mod P(α).
/* Set signal vectors s, y, and z of length m − 1, m − 1, and m bits, respectively */
Initialize : y = [ ym−2, · · · , y0] = (am−1, · · · , a1) ;
z = [ zm−1, · · · , z0] = (bm−1, · · · , b0) ;
s = [ sm−1, · · · , s1] = (am−1, · · · , a1) .
/* Compute s = S(0, :) */
Step 1 : For i = 1 to ω − 2 do
Step 1.1 : ∆i = m − tω−1−i ;
Step 1.2 : s = [sm−1, · · · , s1]+[
∆i︷   ︸︸   ︷
0, · · · , 0 , am−1, · · · , a∆i+1] ;
Step 2 : End For
/* Set a signal vector w of length m − 1 bits, and initialized it with S(0, :),
and set a signal vector x of length m bits, and initialized it with M(0, :) */
Step 3 : w← s ; x← a0
∣∣∣∣∣∣s ;
/* Processes of the loop started in Step 4 are computed in parallel */
Step 4 : For j = 0 to m − 1 do
/* Compute the inner product : c j = M( j, :) · bT */
Step 4.1 : Output c j = x • z;
/* Update x with M(j+1, :) */
Step 4.2 : If j , ti − 1 Then
/* M(j+1, :)= M( j, :)[a j+1, → 1] */
Step 4.2.1 : x← [y0, xm−1, · · · , x1] ;
Step 4.3 : Else /* j = ti − 1 */
/* M(j+1, :)= M( j, :)[a j+1, → 1] + S(0, :) */
Step 4.3.1 : x← [y0, xm−1 + wm−2, · · · , x1 + w0] ;
Step 4.4 : End If
Step 4.5 : y← [y0, ym−2, · · · , y1] ;
Step 5 : End For
P(x) = x163 + x7 + x6 + x3 + 1. Then, the two sets of groups are T = {0, 7, 6, 3} and N =
{0, 160, 157, 156}. Given an arbitrary field element A ∈ F2163 , the Mastrovito matrix M for this
example is shown in Figure 3.1. As shown in this figure, the coordinates of the signal vector s,
are utilized for obtaining the rows of the matrix M. The coordinates of s, are computed as
si =

ai + a160+i + a157+i + a156+i, 1 ≤ i ≤ 2,
ai + a157+i + a156+i, 3 ≤ i ≤ 5,
ai + a156+i, i = 6,
ai, 7 ≤ i ≤ 162,
(3.17)
for i = 1, 2, · · · 162. Equation (3.17), can be realized by an architecture of 6 binary tree of the
XOR gates as depicted in Figure 3.2. In general, the number of the XOR gates for computing
56 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
M ,

a0 s162 s161 · · · s2 s1
a1 a0 s162 · · · s3 s2
a2 a1 a0 s162 · · · s3
a3 a2 + s162 · · · a0 + s160 s162 + s159 · · · s4 + s1
... . . . . . . . . . . . . . . .
a6 a5 + s162 · · · a3 + s160 a2 + s162 + s159 · · · s7 + s4 + s1
a7 a6 + s162 a5 + s162 + s161 · · · s8 + s5 + s2 + s1
... . . . . . . . . . . . . . . .
a162 a161 · · · a7 a6 + s162 · · · a0 + s157 + s156

.
Figure 3.1: Constructing The Mastrovito Matrix M over F2163 Generated by x163+x7+x6+x3+1.
1
a
2
a
5
a
6
a
7
a162a
1
a
2
a162a
162
a
162
a
1
a
2
a
1
a
2
a
161
a
161
a
160
a
159
a
159
a
815
a
815
a
157
a
156
a
157
a
156
a
160
1
 !
157
2
 !
156
3
 !
1
s
2
s
5
s
6
s
7
s
162
s
Figure 3.2: The Process for Constructing The Coordinates of The Signal Vector s over F2163 .
the vector s i.e., (#XORS) is
#XORS =
ω−2∑
i=1
(ti − 1), (3.18)
and the time delay of the longest path between the inputs and outputs (ST ) is ST = dlog2(ω −
1)eTX, where TX denotes the delay of the 2-input XOR gate. Hence, in Figure 3.2, the total
XOR gates becomes #XORS = 13 and the delay becomes ST = 2TX.
The following lemma proves the correctness of vector s contents in Algorithm 9.
Lemma 3.3.1 Let A be an arbitrary element in F2m and s be a vector of length m − 1 that is
initialized with the following entries s = [sm−1, · · · , s1] = [am−1, · · · , a1]. Then, the entries of
the vector s at the end of for loop at Step 1 of Algorithm 1 become S(0, :).
Proof Since the vector s is initialized with the row 0 of the matrix U in (3.5), the recursive
call to the for loop in Step 1 accumulates s in accordance with U(0, :)[→ ∆i]. Then, the final
retuned vector (after a total of ω − 2 loop iterations) satisfies S(0, :) as in (3.10).
As shown in the initialization step, the coordinates of the multiplier B are stored in the
vector z. Also the coordinates from 1 to m − 1 of the multiplicand A are stored in the vector y,
which will be used to obtain the rows j, for 1 ≤ j ≤ m − 1, of the matrix L as stated in (3.13).
3.4. Multiplier Architectures 57
In Step 3, the operation x← a0
∣∣∣∣∣∣s, represents the concatenation of a0 and s; hence, M(0, :) that
is shown in (3.15), is generated and stored in the vector x. The vector s is also stored in w, in
order to be added later for obtaining the rows M(ti, :), 1 ≤ i ≤ ω − 2, as seen in (3.14).
The operation x • z in Step 4.1, represents the inner products of the coordinates of both
vectors x and z, i.e., x • z = ∑m−1i=0 xizi. It is noteworthy that at the end of the iteration j of the
loop started in Step 4, the output c j is computed and at the same iteration the row j + 1 of the
matrix M, i.e., M( j + 1, :) would be generated and stored in the vector x. Hence, it would be
ready for use in the next iteration. The following lemma proves that the contents of vector x at
the end of j clock cycle become the row M( j + 1, :) as seen in (3.14).
Lemma 3.3.2 Let A be an arbitrary element in F2m , y be a vector of length m − 1 that is
initialized with the following entries y = [ym−2, · · · , y0] = [am−1, · · · , a1], w be a vector of
length m− 1 that is initialized with S(0, :), and x be a vector of length m that is initialized with
row 0 of matrix M. Then, the coordinates of the vector x in the for loop at Step 4 of Algorithm
1 returns the correct value of the next row of the matrix M in (3.4).
Proof The for loop in Step 4 of Algorithm 1 has two conditional cases, for j , ti, for this case,
the for loop recursively computes
x← [y0, xm−1, · · · , x1], y← [y0, ym−2, · · · , y1],
and for j = ti, for this case, the for loop recursively computes
x←[y0, xm−1 + wm−2, · · · , x1 + w0],
y←[y0, ym−2, · · · , y1],
by induction, each recursive call to the for loop in Step 4 of Algorithm 1, returns the next row
of matrix M as in (3.14).
The inner product generated in Step 4.1 and the bit additions of Step 4.3.1 can be performed
independently and in parallel. Therefore, the computation time required for obtaining each bit
of the output result (c j), is proportional to the longest delay that is the delay of the inner product
generated in Step 4.1.
3.4 Multiplier Architectures
In this section, an approach to the architecture design of the SOBL multiplier for both the ω-
nomials and the irreducible trinomials is presented in detail. Both architectures are capable of
58 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
generating an output bit with a total of one computational clock cycle. The space and time
complexities of both architectures are also provided in detail.
We remark that the bit-level structure multiplier is considered as an iterative architecture.
Thus, for any bit-level (or digit-level) multiplier, a control unit that generates a counter is
required to generate the load, start, complete, and other control signals. More details on the
controller and its complexity will be presented in Section 6. We further remark that the loop
iterations of the Algorithm 9 are mapped into hardware clock cycles that are denoted by clk.
3.4.1 Multiplier Architecture for ω-nomials
The architecture for the ω-nomials (irreducible polynomials with ω non-zero terms) is depicted
in Figure 3.3(a). It is composed of a circuit S , an IPm block, and four registers 〈W〉, 〈X〉, 〈Y〉,
and 〈Z〉 that are of length m − 1, m, t1, and m-bits, respectively. The circuit S maps the
implementation of the loop started in Step 1 of Algorithm 9. The detailed implementation of
the circuit S is shown in Figure 3.3(b). In this figure, an oval-shape enclosure indicates a binary
tree of XOR gates. It is noted that the output signal vector s, which is generated by the circuit
S , is equal to that of corresponding row 0 of the matrix S, i.e., S(0, :). The register 〈W〉 is then,
initialized with the contents of the signal vector s, i.e., 〈wm−2, · · · , w0〉 = [sm−1, · · · , s1]; hence,
the operation w ← s, in Step 3 of Algorithm 9 is considered in this architecture. The output
bits obtained from the circuit S , are concatenated with the element a0, and the result is loaded
to the register 〈X〉, i.e., 〈xm−1, · · · , x0〉 = [a0, sm−1, · · · , s1]. This indicates that the operation
x← a0
∣∣∣∣∣∣s, in Step 3 of Algorithm 9, is also presented in our architecture. It is worth noting that
before the loop started in Step 4, i.e., when clk = 0, the initial output bits of the register 〈X〉
are equal to those of corresponding row 0 of the matrix M, i.e., M(0, :), and the initial content
of the register 〈W〉 is equal to that of corresponding row 0 of the matrix S, i.e., S(0, :).
The vector y in Algorithm 9 serves as storage of the coordinates from 1 to m − 1 of the
multiplicand A for obtaining the row j, 1 ≤ j ≤ m − 1, of the matrix M as shown in (3.14).
From Step 3 and (3.10), one can see that the contents of the register 〈W〉 in the locations from
t1 to m − 2 are
〈
wm-2, · · · , wt1
〉
=[sm-1, · · · , st1+1]=[am-1, · · · , at1+1]. (3.19)
Then, the size of the vector y in Algorithm 9 can be reduced to t1 and initialized with the
coordinates from 1 to t1 of the multiplicand A, whereas, the coordinates from t1 + 1 to m − 1
of the multiplicand A would be obtained from the register 〈W〉 as shown in (3.19). As a result
of using our approach, a saving of m − t1 − 1 register bits is achieved. Accordingly, the row j,
3.4. Multiplier Architectures 59
y t  -11
zm-2zm-1
xm-1 xm-2
 
Ctrl 2
w0w1wt  -11wt 1
 wm-2
Ctrl 1Clk
  
x0x1
 
 
 
 
   
c i
z 0z 1
 
 
 
 
 
 
 y0 y1
m
A
S
1 m
1
1
t
1 m
1 m
m
B
m Preload
Preload
Preload
Preload
 
1
a
1
2
 
!
"#
a
1
1
 !
a
 
2
a
2
2
 
!
"
#
a
2
1
 !
a
 
1
22
 !
  
"
##
t
a
1 m
a
1
2
 
 !
t
a
1
1
 t
a
1 m
a
 
1 m
a
1
t
a
2
s1
1
 t
s
1
2
 
 !
t
s
Preloaded to both registers         and           
1
s
1
t
s
1 m
s
1 m
WX
(b)
Multiplicand  A
11 t
1
2
 
 !
t
1 m
1 m
S Circuit
(a)
IPm
Figure 3.3: The Proposed SOBL Mastrovito Multiplier Architecture for The ω-nomial Irre-
ducible Polynomials. (a) The High-Level Architecture. (b) The Implementation of The Cir-
cuit S .
0 < j ≤ m − 1 & j , ti, of the matrix M in (3.14) is obtained as
M( j, :) =
M( j − 1, :)[y0, → 1], for t1 − 1 > j > 0,M( j − 1, :)[wt1 , → 1], for t1 < j ≤ m − 1,
60 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
where y0 and wt1 are the coordinates of 〈Y〉 and 〈W〉 registers, respectively.
Table 3.2: The Operations of The Control Signals Ctrl1, and Ctrl2 in Figure 3.3(a).
6
y t  -11
zm-2zm-1
xm-1 xm-2
 
Ctrl 2
w0w1wt  -11wt 1
 wm-2
Ctrl 1Clk
  
x0x1
 
 
 
 
   
c i
z 0z 1
 
 
 
 
 
 
 y0 y1
m
A
S
1 m
1
1
t
1 m
1 m
m
B
m Preload
Preload
Preload
Preload
 
1
a
1
2
 
!
"#
a
1
1
 !
a
 
2
a
2
2
 
!
"
#
a
2
1
 !
a
 
1
22
 !
  
"
##
t
a
1 m
a
1
2
 
 !
t
a
1
1
 t
a
1 m
a
 
1 m
a
1
t
a
2
s1
1
 t
s
1
2
 
 !
t
s
Preloaded to both registers         and           
1
s
1
t
s
1 m
s
1 m
WX
(b)
Multiplicand  A
11 t
1
2
 
 !
t
1 m
1 m
S Circuit
(a)
IPm
Fig. 1: The proposed serial-out bit-level (SOBL) Mastrovito multiplier architecture for the ω-nomial. (a) The high-
level architecture. (b) The implementation of the circuit .
TABLE 2: The operations of the control signals Ctrl1, and Ctrl2 in Fig. 1(a).
clk Ctrl1 Ctrl2 〈W 〉 〈X〉 〈Y 〉
0 ≤clk< t1−1 & clk 6= ti−1† 0 0 clock is disabled 〈X〉=〈y0, xm−1, · · · , x1〉 〈Y 〉=〈y0, yt1−1, · · · , y1〉
clk = ti − 1† 0 1 clock is disabled 〈X〉=〈y0, xm−1 + wm−2, · · · , x1 + w0〉 〈Y 〉=〈y0, yt1−1, · · · , y1〉
m− 1 ≤clk> t1 − 1 1 0 〈W 〉=〈w0, wm−2, · · · , w1〉 〈X〉=〈wt1 , xm−1, · · · , x1〉 clock is disabled
† For 1 ≤ i ≤ ω − 2.
from the register 〈W 〉 as shown in (18). As a result of
using our approach, a saving of m− t1−1 register bits is
achieved. Accordingly, the row j, 0 < j ≤ m−1 & j 6= ti,
of the matrix M in (14) is obtained as
M(j, :) =
{
M(j − 1, :)[y0, → 1], t1 − 1>j,
M(j − 1, :)[wt1 , → 1], t1 < j,
where y0 and wt1 are the coordinates of 〈Y 〉 and 〈W 〉
registers, respectively.
In TABLE 2, we show how the control signals Ctrl1
and Ctrl2 in Fig. 1(a) coordinate the contents of 〈W 〉, 〈X〉,
and 〈Y 〉 registers. As shown in this table, if clk ≤ t1− 1,
the contents of the register 〈W 〉 remain unchanged, i.e.,
〈W 〉 = S(0, :), whereas, the contents of the register 〈Y 〉
are right cyclic shifted and, hence, it maps the imple-
mentation of Step 4.5 of Algorithm 1. The contents of the
register 〈X〉 during clk, for 0 ≤ clk ≤ t1 − 1 are updated
as follows. If clk 6= ti−1, then, the register 〈X〉 is updated
by the right shift (RS) of its coordinates with 〈y0〉 fed at
the MSB. This maps the implementation of Step 4.2.1 of
Algorithm 1. If clk = ti − 1 (ti is obtained in (3)), then,
the register 〈X〉 is updated by XORing the coordinates
In Table 3.2, we show how the control signals Ctrl1 and Ctrl2 in Figure 3.3(a) coordinat
the contents of 〈W〉, 〈X〉, and 〈Y〉 registers. As shown in this table, if clk ≤ t1 − 1, the contents
of the register 〈W〉 remain unchanged, i.e., 〈W〉 = S(0, :), whereas, the contents of the register
〈Y〉 are right cyclic shifted and, hence, it maps the implementation of Step 4.5 of Algorithm
9. The contents of the register 〈X〉 during clk, for 0 ≤ clk ≤ t1 − 1 are updated as follows. If
clk , ti − 1, then, the register 〈X〉 is updated by the right shift (RS) of its coordinates with 〈y0〉
fed at the MSB. This maps the implementation of Step 4.2.1 of Algorithm 9. If clk = ti − 1 (ti
is obtained in (3.3)), then, the register 〈X〉 is updated by XORing the coordinates of the register
〈W〉 with the RS of its coordinates, and 〈y0〉 being fed into the MSB of 〈X〉. This maps the
implementation of Step 4.3.1 of Algorithm 9. If clk > t1 − 1, observing this conditional case,
one can see that the above mentioned condition, i.e., clk = ti − 1, will not occur any more,
hence, the contents of the register 〈W〉, i.e., S(0, :) are no longer needed. This gives us the
freedom of using and changing the contents of the register 〈W〉. Hence, the contents of the
register 〈W〉 are right cyclic shifted, i.e., 〈wm−2, · · · , w0〉 = 〈w0, wm−2, · · · , w1〉. The register
〈X〉 is then updated by the RS of its coordinates with 〈wt1〉 being fed into the MSB of 〈X〉.
As also shown in the initialization step of Algorithm 9, the register 〈Z〉 is initialized with
the coordinates of the multiplier B and its contents remain unchanged during each clock cycle
until the end of multiplication process, i.e., 〈Z〉 = 〈bm−1, b1, · · · , b0〉, for 0 ≤ clk ≤ m − 1.
Also, the coordinates from 1 to t1 of the multiplicand A are initially fed into the register 〈Y〉,
i.e.,
〈
yt1−1, · · · , y1, y0
〉
= [at1 , · · · , a2, a1].
The module IPm that is shown in Figure 3.3(a), maps the implementation of the operation
c j = x • z in Step 4.1. This module, computes the output bit result c j = M( j, :) · bT . It does
so by performing the inner product (IP) of its two input vectors; it first generates the product
in parallel using m AND gates and then, by adding (modulo 2) the generated partial products
3.4. Multiplier Architectures 61
using a binary XOR tree. The architecture of the IPm block implements
ci =
m−1∑
i=0
xizi = [x0, · · · , xm−1] × [z0, · · · , zm−1]T ,
which requires m − 1 XOR gates to accumulate the partial products. The depth of the binary
XOR tree is given as
⌈
log2 m
⌉
and, hence, the total delay of the IPm module (IPmtime) is
IPmtime = TA +
⌈
log2 m
⌉
TX, (3.20)
where TA denotes the delay of the 2-input AND gate.
In what follows, the space complexity of the proposed SOBL multiplier for the ω-nomial
irreducible polynomials is obtained.
Proposition 3.4.1 For the finite field F2m generated by a ω-nomial irreducible polynomial that
is shown in (3.3), the proposed SOBL PB multiplier architecture (Figure 3.3(a)) requires 3m +
t1 − 1 1-bit registers, 2m + 2 2-input AND gates, and 2m − 2 + ∑ω−2i=1 (ti − 1) 2-input XOR gates.
Proof The number of 1-bit registers includes the ones in the 〈X〉 register, i.e., m, the register
〈Z〉, i.e., m, the register 〈W〉, i.e., m − 1 and the register 〈Y〉, i.e., t1. Thus, the multiplier
requires 3m + t1 − 1 1-bit registers. The IPm block requires m AND gates, a single AND gate
for clock enabling the 〈W〉 register and m + 1 AND gates for the connection between 〈W〉 and
〈X〉 registers are also required. Therefore, the multiplier requires 2m + 2 2-input AND gates.
The number of the XOR gates is obtained by adding those for the IPm, the updating signal for
the register 〈X〉, as well as the S circuit, which are m − 1, m, and (3.18), respectively. As a
result, the number of the XOR gates required in the SOBL multiplier architecture generated by
a ω-nomial irreducible polynomial is 2m − 2 + ∑ω−2i=1 (ti − 1) and the proof is complete.
3.4.2 Multiplier Architecture for Trinomials
The proposed SOBL multiplier architecture that is illustrated in Figure 3.3(a), can be further
optimized for the irreducible trinomial, which is a special case of (3.3), i.e., P(x) , xm +
xt1 + 1. The sets T and N for the irreducible trinomial, have {0, t1} and {0, ∆1 = m − t1} sets,
respectively. Recall that the vector w in Algorithm 9 serves as storage of S(0, :) for obtaining
the row j = t1 of the matrix M as shown in (3.10). From Step 3 in Algorithm 9, one can see
that S(0, :) is also stored in vector x at the initial stage. Then, if the coordinates from 1 to t1 of
the initial contents of x had been stored, we could have computed the row t1 of the matrix M
in (3.10) without utilizing the vector w. This optimization can be achieved as shown in Figure
3.4(a).
62 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
x1 x0x  -1
1
x
 
1
xm-2xm-1
 
y
1
y
0
  
 
x
 +1
1
x
 - t
1 1
y
 -1- t
1 1
y
 - t
1    1
x  +1- t1         1x   -21
y
 -3
1
x   -1
1
 
 ci
z
m-2zm-1
 
z1 z0
Ctrl
xt  -1
1
x t 
1
xt +1
1
xt  -2
1
0
1
11
  ! t
1
t
IPm
m
A
S
1 m
1
1 m
B
m Preload
Preload
Preload
11 !
 
1 m
a
1a
1
1
 !
a
1
1
 t
a
11 t
 
1tm 
1 ma
1
ta
1
s
1 ms
 
1
SCircuit
(a)
(b)
m
1
1
1
1
1
 
 
 
 
  
 
   
 
 
 
 
 
 
 
 
 
 
 
 
 
y
 - 2
1
Figure 3.4: The Proposed SOBL Mastrovito Multiplier Architecture for The Irreducible Trino-
mials. (a) The High-Level Architecture. (b) The Implementation of The Circuit S .
The architecture in this figure, is composed of a circuit S , an IPm block, and three registers
〈X〉, 〈Y〉, and 〈Z〉. The register 〈Y〉 in this figure, is reduced to ∆1 − 1 bits. Initially, the
coordinates from 1 to t1 of the multiplicand A are fed into 〈Y〉 in the locations from 0 to t1 − 1,
i.e.,
〈
yt1−1, · · · , y0
〉
= [at1 , · · · , a1]. The contents of 〈Y〉 are postponed by m − 2t1 − 1, zeros
(cleared) at its left-most m − 2t1 − 1 bits, i.e.,
〈
y∆1−2, · · · , yt1
〉
= [ 0, 0, · · · , 0︸       ︷︷       ︸
m−2t1−1
].
The register 〈Z〉, and the module IPm remain unchanged as in the proposedω-nomial SOBL
architecture, which is presented in Subsection 3.4.1 (Figure 3.3(a)). The S circuit is implement-
ed as shown in Figure 3.4(b). As seen in this figure, it is composed of t1 − 1 parallel XORs and
it maps the implementation of Step 1 of Algorithm 9. The output bits obtained from the circuit
S , are concatenated with the element a0. This concatenation result is loaded to the register
〈X〉, i.e., 〈xm−1, · · · , x0〉 = [a0, sm−1, · · · , s1]. During both clock periods 0 ≤ clk ≤ t1 − 2 and
t1 ≤ clk ≤ m − 1, the contents of both registers 〈X〉 and 〈Y〉 are right shifted. The right-most
bit (LSB) of the register 〈X〉 is fed into the MSB of the register 〈Y〉, i.e., 〈y∆1−2〉 ← 〈x0〉, and
similarly, the LSB of the register 〈Y〉 is fed into the MSB of the register 〈X〉, i.e., 〈xm−1〉 ← 〈y0〉.
3.5. Novel Very Low Area Multiplication Architecture 63
At the clock cycle t1 − 1, both registers 〈X〉 and 〈Y〉 are updated with the proper contents as
described in the following:
〈
xt1−2, · · · , x0
〉← 〈xt1−1 + y∆1−2, · · · , x1 + y∆1−t1〉 ,〈
xm−1, · · · , xt1−1
〉← 〈y0, y∆2 , · · · , y∆1−t1 , x∆1−t1 + x∆1 , · · · , x0 + xt1〉,〈
y∆1−1, · · · , y0
〉← 〈x∆1−1, · · · , x1〉 .
In what follows, the space complexity of the proposed SOBL multipliers for the irreducible
trinomial is obtained.
Proposition 3.4.2 For the finite field F2m generated by the irreducible trinomial xm + xt1 +1, the
proposed SOBL PB multiplier architecture (Figure 3.4(a)) requires 3m − t1 − 1 1-bit registers,
3m − 3 2-input AND gates, and 3m − 4 2-input XOR gates.
Proof The number of 1-bit registers includes the ones in the 〈X〉 register, i.e., m, the register
〈Z〉, i.e., m, and the register 〈Y〉, i.e., ∆1−1 = m− t1−1. Thus, the multiplier requires 3m− t1−1
1-bit registers. The IPm block requires m AND gates. Also the connection between 〈X〉 and
〈Y〉 registers are 2m − 3 AND gates. Therefore, the multiplier requires 3m − 3 2-input AND
gates. The number of the XOR gates is obtained by adding those for the IPm, the updating
signals for the registers 〈X〉 and 〈Y〉, as well as the S circuit, which are m − 1, m − 1, ∆1, and
t1 − 1, respectively. As a result, the number of the XOR gates required in the SOBL multiplier
architecture generated by the irreducible trinomial is 3m − 3 and the proof is complete.
The critical path delay, which is the longest path from the registers to the output ci, is one
of the main factors that determines the time complexity. It determines the maximum operating
frequency. By properly implementing the proposed SOBL architectures, i.e., Figure 3.3(a) and
Figure 3.4(a), one can see that the critical path delay of both architectures is equal to the total
delay of the IPm module, which is shown in (3.20).
3.5 Novel Very Low Area Multiplication Architecture
From (3.14) and (3.16), one can calculate c j, for 0 ≤ j < m, as follows
c0 = L(0, :) · b T + S(0, :) · b T
= a0 · bm−1 + S(0, :) · b T ,
c j =
M( j−1, :)[a j,→1] · b
T , 0 < j & j , ti,
M( j−1, :)[a j,→1] · b T + S(0, :) · b T , j = ti,
(3.21)
64 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
for 1 ≤ i ≤ ω − 2.
Algorithm 10 Proposed Serial-Out Bit-Level ω-nomials x m + x t1 + · · · + x tω−2 + 1
Input : The parameters of the ω-nomials irreducible polynomial: m, t1, · · · , tω−2,
A =
(
am−1, · · · , a0), B = (bm−1 , · · · , b0) ∈ F2m .
Output : c j, where C =
(
cm−1, · · · , c0) = AB mod P(α).
/* Set signal vectors e , s , y , and z of length t1, m−1, t1 and m bits, respectively */
Initialize : e = [ et1 , · · · , e1] = (0, · · · , 0) ; y = [ yt1−1, · · · , y0 ] = (at1 , · · · , a1) ;
z = [ zm−1, · · · , z0 ] = (bm−1, · · · , b0) ; s = [ sm−1, · · · , s1] = (am−1, · · · , a1) .
/* Compute s = S( 0, :) */
Step 1 : For i = 1 to ω − 2 do
Step 1.1 : ∆ i = m − tω−1−i ;
Step 1.2 : s = [ sm−1, · · · , s1 ] + [
∆ i−1︷   ︸︸   ︷
0, · · · , 0 , sm−1, · · · , s ∆i+1 ] ;
Step 2 : End For
/* Set a signal vector x of length m bits, and initialized it with M(0, :) */
Step 3 : x← a0
∣∣∣∣∣∣s ;
Step 4 : For j = 0 to m − 1 do
/* Compute the inner product */
Step 4.1 : s′j ← [ xm−2, xm−3, · · · , x0 ] • [ bm−2, bm−3, · · · , b0 ] ;
Step 4.2 : v j ← [ xm−1 ] · [bm−1 ] ;
Step 4.3 : Output c j = (s′j + v j + et1 + et2 + · · · + etω−2) ;
/* Update e */
Step 4.4 : If j = 0 Then
Step 4.4.1 : e1 ← s′j ;
Step 4.5 : Else /* Right shift e */
Step 4.5.1 : e← [ et1−1, et1−2, · · · , e1, e1 ] ;
Step 4.6 : End If
/* Update x with M(j+1 , :) */
Step 4.7 : x← [ y0, xm−1, · · · , x1 ] ;
Step 4.8 : y← [ xt1 , yt1−1, · · · , y1 ] ;
Step 5 : End For
From (3.10), (3.14), (3.15), and (3.21), we propose the following algorithm, which outlines
the process of serially generating the coordinates of C starting from c0 to ending cm−1 for the
multiplication of the two field elements A and B.
We proceed with Algorithm 10 step by step.
In Steps 1-2 in Algorithm 10, since the signal vector s is initialized with the first row of the
matrix U in (3.5), at the end of the for loop in Step 2, we have the row 0 of the matrix S as in
(3.10).
In Step 3 in Algorithm 10, from x← a0
∣∣∣∣∣∣s , we have [a0, sm−1, sm−2, · · · , s1]. It can be seen
3.5. Novel Very Low Area Multiplication Architecture 65
that this expression is equal to the first row of the matrix M as shown in (3.15).
In Step 4 in Algorithm 10, the operation shown in Step 4.1, represents the inner product in
(3.21). Note that when j = 0, the value of s′0 become s
′
0 ← S(0, :) · b T as shown in (3.21).
Since the vector e is initially cleared (et1 = et2 = · · · = etω−2 = 0), the first output bit result, i.e.,
c0 that is shown in (3.21) is obtained in Step 4.2. The value s′0 that is equal to S(0, :) · b T is
then fed to the signal vector e at the LSB to be used later to obtain cti , for 1 ≤ i ≤ ω − 2, as
shown in (3.21).
3.5.1 Proposed Compact Multiplier Architecture
In this section, an approach to the architecture design of the SOBL multiplier is presented in
detail. The architecture is capable of generating an output bit with a total of one computational
clock cycle per each output bit. The space and time complexities of the architecture are also
provided in detail. We show that the proposed SOBL multiplier architecture provides about
24-26% reduction in area complexity cost and about 21-22% reduction in power consumptions
for F2163 compared to the current state-of-the-art bit-level (SOBL or POBL) multiplier archi-
tectures. Our key idea is to get an optimal sharing of hardware resources, i.e., registers and
gate operations to find a best way to reuse these hardware resources without affecting the mul-
tiplier’s performance. We remark that the loop iterations of the Algorithm 10 are mapped into
hardware clock cycles (denoted by clk), and for simplicity, the architecture is designed specif-
ically for the pentanomial irreducible polynomial, however, it can be applied for any ω-nomial
irreducible polynomial.
The architecture for the pentanomial irreducible polynomial is depicted in Figure 3.5(a). It
is composed of a circuit S , an IPm−1 block, a BTX4 block, and four registers 〈E〉, 〈X〉, 〈Y〉, and
〈Z〉 that are of length t1, m, t1, and m-bits, respectively. The circuit S maps the implementation
of the loop started in Step 1 of Algorithm 10. The detailed implementation of the circuit S
is shown in Figure 3.5(b). In this figure, an oval-shape enclosure indicates a binary tree of
XOR gates. It is noted that the output signal vector s, which is generated by the circuit S , is
equal to that of corresponding row 0 of the matrix S, i.e., S(0, :). The output bits obtained from
the circuit S , are concatenated with the element a0, and the result is loaded to the right shift
register 〈X〉, i.e., 〈xm−1, · · · , x0〉(0) = [a0, sm−1, · · · , s1]. This indicates that the operation x ←
a0
∣∣∣∣∣∣s, in Step 3 of Algorithm 10, is presented in our architecture. The right shift register 〈Y〉 is
initialized with the coordinates of A as
〈
yt1−1, · · · , y0
〉(0)
= [ at1 , at1−1, · · · , a1]. Also, the left
shift register 〈E〉 is cleared initially, i.e., 〈et1 , · · · , e1〉(0) = [0, 0, · · · , 0].
As also shown in the initialization step of Algorithm 10, the register 〈Z〉 is initialized with
the coordinates of the multiplier B and its contents remain unchanged during each clock cycle
66 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
1
1
 t
a
1 m
a
1 m
a
1
t
a
e3e6
y t  -11
zm-2zm-1
xm-1 xm-2 x0x1
 
   
z 0z 1
 
 y0 y1
m
A
S
1 m
1
1
t
1 m
m
B
m Preload
Preload
Preload
1
a
13 !
a
1
1
 !
a
1
33
 !" t
a
1 m
a
1
3
 t
a
12 t
a
1
1
 t
s
1
3
 t
s
Preloaded to register  
1
s
1
t
s
1 m
s
1 m
X
(b)
Multiplicand  A
11 t
1 m
1 m
S Circuit
(a)
IPm-1
 
x t1
  et 1
 e 1 e 2
Preload
0
BTX4
cis'j
Clk
CTRL “disabled after the first clock cycle”
11 t
BTX
ci
s'j
e7
vj
Register Z
Right Shift Register X
Right Shift Register Y
Left Shift Register E
vj
(c)
4
3
1
23
 !" t
a
12 !
a
1
32
 !" t
a
1
2
 t
1
3
 t
1 m
a
1
2
 t
s
Figure 3.5: The proposed compact SOBL multiplier architecture for the pentanomial irre-
ducible polynomial. (a) The high-level architecture. (b) The implementation of the circuit
S . (c) An example for BTX4 module when P(x) = x163 + x7 + x6 + x3 + 1.
until the end of multiplication process, i.e., 〈Z〉(clk) = 〈bm−1, b1, · · · , b0〉, for 0 ≤ clk ≤ m − 1.
The module IPm−1 that is shown in Figure 3.5(a), maps the implementation of the operation
shown in Step 4.1 in Algorithm 10. This module, computes the output bit result s′j. It does so
by performing the inner product (IP) of its two input vectors; it first generates the product in
parallel using m− 1 AND gates and then, by adding (modulo 2) the generated partial products
using a binary XOR tree. The architecture of the IPm−1 block implements
s′j =
m−2∑
i=0
xizi = [x0, · · · , xm−2] × [z0, · · · , zm−2]T ,
3.5. Novel Very Low Area Multiplication Architecture 67
which requires m − 2 XOR gates to accumulate the partial products. The depth of the binary
XOR tree is given as
⌈
log2(m − 1)
⌉
and, hence, the total delay of the IPm−1 module (IPm-1time)
is
IPm-1time = TA +
⌈
log2(m − 1)
⌉
TX, (3.22)
where TA denotes the delay of the 2-input AND gate.
Let c j denote the output of the BTX4 block in Figure 3.5(a) after the j-th clock cycle. From
Algorithm 10 and from (3.21), one can obtain the output value c of the BTX4 block in Figure
3.5(a) as
c j =

s′j + v j + 0 + 0 + 0, 0 ≤ j < t3,
s′j + v j + s
′
0 + 0 + 0, t3 ≤ j < t2,
s′j + v j + s
′
0 + s
′
0 + 0, t2 ≤ j < t1,
s′j + v j + s
′
0 + s
′
0 + s
′
0, t1 ≤ j < m − 1.
Figure 3.5(c) shows how the BTX4 is structured and connected to the register 〈E〉 when
P(x) = x163 + x7 + x6 + x3 + 1 is used. Since only s′0 is needed to be fed into the left shift
register 〈E〉 at only clock 0, a CTRL signal is ANDed to the clocking signal of the LSB-bit of
〈E〉, i.e., e1. The CTRL signal is set to 1 at clock clk = 0, it is then set to 0 in the duration
1 ≤ clk ≤ m − 1.
In what follows, the space complexity of the proposed SOBL multiplier for the pentanomial
irreducible polynomial is obtained.
Proposition 3.5.1 For the finite field F2m generated by a pentanomial irreducible polynomial
P(x) = xm + xt1 + xt2 + xt3 + 1, the proposed SOBL PB multiplier architecture (Figure 3.5(a))
requires 2m + 2t1 1-bit registers, m + 1 2-input AND gates, and m + 2 +
∑3
i=1(ti − 1) 2-input
XOR gates.
Proof The number of 1-bit registers includes the ones in the 〈X〉 register, i.e., m, the register
〈Z〉, i.e., m, the register 〈Y〉, i.e., t1 and the register 〈E〉, i.e., t1. Thus, the multiplier requires
2m + 2t1 1-bit registers. The IPm−1 block requires m − 1 AND gates, a single AND gate for
obtaining v j and also a single AND gate for clock enabling the LSB-bit of 〈E〉 register is also
required. Therefore, the multiplier requires m + 1 2-input AND gates. The number of the XOR
gates is obtained by adding those for the IPm−1, the BTX4, as well as the S circuit, which are
m − 2, 4, and (3.18), respectively. As a result, the number of the XOR gates required in the
SOBL multiplier architecture generated by a pentanomial irreducible polynomial is m + 2 +∑3
i=1(ti − 1) and the proof is complete.
68 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
3.5.2 Extending to a Digit-Level Scheme
Unlike the digit-level multipliers available in the open literature such as [53, 179, 180, 44, 181,
182], which generate all m-bits of the multiplication in parallel at the final clock cycle, one can
extend the proposed SOBL schemes to develop a digit-level scheme that generates K-bits of
the multiplication in each clock cycle. The digit-level scheme can be obtained by replicating
the IPm block that is shown in Figure 3.3(a) and Figure 3.4(a) and performing j-fold right
shift of the registers 〈W〉, 〈X〉, and 〈Y〉 in Figure 3.3(a) for the irreducible ω-nomial, and the
registers 〈X〉 and 〈Y〉 in Figure 3.4(a) for the irreducible trinomial. Since the environment that is
considered in this work is low resource platforms and the proposed architecture in Figure3.5(a)
has the lowest area complexity, we discuss in further detail its extension to the digit level. Also,
we assume for simplicity that K ≤ t3 − 1, especially the digit K = 2 is selected.
In the SOBL architecture shown in Figure 3.5(a), one can obtain its digit-level version
by replicating both an IPm, and a BTX3 blocks and connect their 〈X〉 and 〈E〉 inputs to their
shifted forms. In other words, let us formulate the output of Figure 3.5(a) by the function f
as c = f (〈Z〉 , 〈X〉 , 〈E〉). It is shown in the previous section that the output of c at the j-th,
0 ≤ j ≤ m − 1, clock cycle generate c j by j-fold right shifts of the 〈X〉, and 〈E〉 shift registers,
i.e.,
c( j) = c j = f (〈Z〉 , 〈X〉  j, 〈E〉  j), (3.23)
where 〈R〉  j, and 〈R〉  j represent the j-fold right shifts of the register 〈R〉, and the j-
fold left shifts of the register 〈R〉, respectively. Therefore, by implementing the function c( j),
0 ≤ j ≤ 1, from j-fold right shifts of 〈X〉, and from j-fold left shifts of 〈E〉, one can obtain
the serial-out digit-level architecture as shown in Figure 3.6. In this figure, the IPm, and the
BTX3 blocks are added to the SOBL scheme shown in Figure 3.5(a). Note that all 1-bit shift
registers in Figure 3.5(a) are replaced with 2-bit ones. In what follows, the space complexity of
the proposed SODL multiplier, K = 2, for the pentanomial irreducible polynomial is obtained.
Proposition 3.5.2 For the finite field F2m generated by a pentanomial irreducible polynomial
P(x) = xm + xt1 + xt2 + xt3 + 1, the proposed SODL PB multiplier architecture (Figure 3.6)
requires 2m + 2t1 1-bit registers, 2m + 1 2-input AND gates, and 2m + 4 +
∑3
i=1(ti − 1) 2-input
XOR gates.
Proof The number of 1-bit registers includes the ones in the 〈X〉 register, i.e., m+1, the register
〈Z〉, i.e., m, the register 〈Y〉, i.e., t1−1 and the register 〈E〉, i.e., t1. Thus, the multiplier requires
2m + 2t1 1-bit registers. The IPm−1 block requires m − 1 AND gates, IPm block requires m
AND gates a single AND gate for obtaining v j and also a single AND gate for clock enabling
3.6. Comparison 69
m
A
S
1 m
2
1
1
 t
1 m
1 m
Preload
Preload
CTRL “disabled after the first clock cycle”
Register Z
Right Shift Register X
1
x
1 m
x
 
m
x
1
tx
1
1
 t
x
 
2
2
2
2 m
x
0
x
Preload
 B
m
zm-1
z
0
zm-2
1 m
1 m
m
m
IPm-1
IPm
1
1
0
Clk
Left Shift Register E
BTX
4
ci  
3
te
 
2
e
Preload
1
1
 t
13 t
e
 
2
te
12 t
e
1
te
11 t
e
2
BTX
3
ci+1
Right Shift Register Y
 
y
0
y
1
2
yt - 2
1
1
e
Figure 3.6: The architecture of serial-out digit-level (SODL) polynomial basis multiplier over
F2m for the pentanomial irreducible polynomial, i.e., xm + xt1 + xt2 + xt3 + 1, where digit d = 2.
the LSD-bits of 〈E〉 register is also required. Therefore, the multiplier requires 2m + 1 2-
input AND gates. The number of the XOR gates is obtained by adding those for the IPm−1,
the IPm, the BTX4, the BTX3, as well as the S circuit, which are m − 2, m − 1, 4, 3 and
(3.18), respectively. As a result, the number of the XOR gates required in the SOBL multiplier
architecture generated by a pentanomial irreducible polynomial is 2m + 4 +
∑3
i=1(ti − 1) and the
proof is complete.
3.6 Comparison
Let us define bit-latency and total-latency as the number of clock cycles needed for the first bit
of the output to be available, and for the entire multiplication, respectively. Thus, one can see
that the bit-latency of the proposed SOBL multipliers is one, and that the total-latency requires
m clock cycles.
Table 3.3, shows the comparison of the proposed SOBL multipliers to the other efficient
POBL and SOBL multipliers in terms of area and time complexities for the irreducible ω-
nomials, pentanomial, and the trinomials. It can be seen from the table that the time complexity
of the SOBL multiplier schemes are higher than that using POBL multiplier schemes. However,
in applications on resource constrained environment such as RFID that run at 100 kHz, the
impact of longer critical path delay of SOBL schemes does not affect the speed performance of
the devices. In addition, in many applications a SOBL multiplier would be desirable because
of its ability to sequentially generate an output bit of the final multiplication result in each clock
70 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
Table 3.3: Comparison Table for The Proposed Multiplier Schemes (Figures 3.3(a), 3.4(a), and
3.5(a)) With The Related Bit-Level Multiplier Schemes in Terms of Time and Space Complex-
ities for The ω-nomial, The Pentanomial, and The Irreducible Trinomial.
Type of Multiplier Bit-Latency Total-Latency Critical Path Area Cost Output
Scheme [cycle] [cycle] Delay † Total AND Gates Total XOR Gates Total 1-bit Reg. Structure
P (x) = xm +
∑ω−1
i=1 x
ti , m
2
> t1 > t2 > · · · > tω−2 > tω−1 = 0
LSB-first [31] m m TA + TX m m+ ω − 2 3m Parallel
MSB-first [31] m m TA + TX m m+ ω − 2 3m Parallel
SOBL [177] †† m 2m TA + ⌈log2(m− 1)⌉TX 3m− 1 3m− 2 4m+ 1 Serial
SOBL [30] ††† 1 m TA +max (T1, T2) 2m− 1 2m+ ω + γ − 4 3m+ t1 − 1 Serial
Proposed SOBL 1 m TA + ⌈log2m⌉TX 2m+ 2 2m+ γ − 2 3m+ t1 − 1 Serial
Figure 3.3(a)
Proposed SOBL 1 m TA + (1 + ⌈log2(m− 1)⌉)TX m+ 1 m+ ω + γ − 3 2m+ 2t1 Serial
Figure 3.5(a)
P (x) = xm + xt1 + xt2 + xt3 + 1 , m
2
> t1 > t2 > t3
LSB-first [31] m m TA + TX m m+ ω − 2 3m Parallel
MSB-first [31] m m TA + TX m m+ ω − 2 3m Parallel
SOBL [177] †† m 2m TA + ⌈log2(m− 1)⌉TX 3m− 1 3m− 2 4m+ 1 Serial
SOBL [30] ††† 1 m TA + (3 + ⌈log2(m)⌉)TX 2m− 1 2m+ 1 + γ 3m+ t1 − 1 Serial
Proposed SOBL 1 m TA + ⌈log2m⌉TX 2m+ 2 2m+ γ − 2 3m+ t1 − 1 Serial
Figure 3.3(a)
Proposed SOBL 1 m TA + (1 + ⌈log2(m− 1)⌉)TX m+ 1 m+ 2 + γ 2m+ 2t1 Serial
Figure 3.5(a)
P (x) = x163 + x7 + x6 + x3 + 1
LSB-first [31] 163 163 TA + TX 163 166 489 Parallel
MSB-first [31] 163 163 TA + TX 163 166 489 Parallel
SOBL [177] 163 326 TA + 8 TX 488 487 653 Serial
SOBL [30] 1 163 TA + 11 TX 325 340 495 Serial
Proposed SOBL 1 163 TA + 8 TX 328 338 495 Serial
Figure 3.3(a)
Proposed SOBL 1 163 TA + 9 TX 164 178 340 Serial
Figure 3.5(a)
P (x) = xm + xt1 + 1, and 1 ≤ t1 < m2
LSB-first [31] m m TA + TX m m+ 1 3m Parallel
MSB-first [31] m m TA + TX m m+ 1 3m Parallel
SOBL [177] m 2m TA + ⌈log2(m− 1)⌉TX 3m− 1 3m− 2 4m+ 1 Serial
SOBL [30] 1 m TA + (2 + ⌈log2(m)⌉)TX 2m− 1 2m+ t1 − 2 3m+ t1 − 1 Serial
Proposed SOBL 1 m TA + ⌈log2m⌉TX 3m− 3 3m− 4 3m− t1 − 1 Serial
Figure 3.4(a)
Proposed SOBL 1 m TA + (1 + ⌈log2(m− 1)⌉)TX m+ 1 m+ t1 − 3 2m+ 2t1 Serial
Figure 3.5(a)
P (x) = x233 + x74 + 1
LSB-first [31] 233 233 TA + TX 233 234 699 Parallel
MSB-first [31] 233 233 TA + TX 233 234 699 Parallel
SOBL [177] 233 466 TA + 8 TX 698 697 933 Serial
SOBL [30] 1 233 TA + 10 TX 465 538 772 Serial
Proposed SOBL 1 233 TA + 8 TX 696 695 624 Serial
Figure 3.4(a)
Proposed SOBL 1 233 TA + 9 TX 234 304 614 Serial
Figure 3.5(a)
† The critical path delay of the the multiplier schemes is obtained in terms of the delay of two-input XOR gate (TX ) and the delay of
two-input AND gate (TA).†† The complexity results of [177] are obtained from [176].
††† T1 = (1 + ⌈log2 (ω − 1)⌉+ ⌈log2(m)⌉)TX , T2 = (1 + ⌈log2 (m− 1)⌉+ ⌈log2(ω − 2)⌉)TX , γ =
∑ω−2
i=1 (ti − 1).
3.6. Comparison 71
Reg.
Load
Update
Reg.
Load
InitializeUpdate
(a) (b)
Figure 3.7: Hardware Overhead Gates Due to The Parallel I/O Data Transfer. (a) The Circuit
That Enables a Register to be Cleared or Updated. (b) The Circuit That Enables a Register to
be Switched Between Two Inputs (MUX).
cycle with the latency of one cycle. Table 3.3 also shows that in terms of area complexities, the
proposed SOBL multiplier schemes shown in Figure 3.5(a), provides about 30-32% reduction
in total register cost compared to any bit-level scheme for for F2163 . The table further shows
that in terms of delay complexities, the proposed two SOBL multiplier schemes, i.e., Figure
3.3(a) and Figure 3.4(a), outperform the previous published SOBL ones. As an example, for
the binary extension fields F2163 and F2233 that are recommended by NIST [16] and SECG [22],
the critical path delay of the SOBL multiplier that is proposed in [30] over those two finite
fields are TA + 11TX, and TA + 10TX, respectively, whereas in proposed two SOBL multiplier
schemes, the critical path delays over both finite fields are TA + 8TX.
In addition to the core multiplier component, the bit-level multiplier processor has to embed
some other functionality to operate properly. For instance, a controller component that allows
controlling the I/O communication signals, and generating the control signals is required. Also,
to minimize the total latency, the data I/O has to be transferred in parallel (at cost of 1 clock
cycle). The parallel I/O overhead (time and extra hardware) cannot be considered negligible.
Figures 3.7(a) and 3.7(b), illustrate the hardware overhead gates due to the parallel I/O data
transfer. The circuit that is depicted in Figure 3.7(a) enables a bit register to be initially cleared
(when load signal = 1) or updated with the update signal (when load signal = 0). The circuit
in Figure 3.7(b) enables a bit register to switch between two inputs based on the load signal.
Note that no extra gate is required when a bit register hold the same data as at the initializa-
tion (as required in the 〈Z〉 register in Figures 3.3(a), 3.4(a), and 3.5(a)). The corresponding
loading overhead gates in the proposed multiplier schemes are provided in Table 3.4. In this
table, we compare the proposed multiplier schemes with the related bit-level multipliers when
having the same parallel I/O communication format. The table shows that in terms of area
complexities, the proposed SOBL multiplier schemes shown in Figure 3.5(a), provides about
72 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
Table 3.4: Comparison Table for The Proposed Multiplier Schemes (Figures 3.3(a), 3.4(a), and
3.5(a)) With The Related Bit-Level Multiplier Schemes When Having The Same Parallel I/O
Data Transfer Format.
Type of Multiplier Total Reg. Never Changed Initially Cleared Loaded and Updated Total Parallel I/O Hardware Overhead
Scheme [bit] Reg. † [bit] Reg. ††[bit] Reg. ††† [bit] Total AND Gates Total OR Gates
P (x) = xm +
∑ω−1
i=1 x
ti , m
2
> t1 > t2 > · · · > tω−2 > tω−1 = 0
LSB-first [31] 3m − m 2m 5m 2m
MSB-first [31] 3m m m m 3m m
SOBL [30] 3m+ t1 − 1 m m+ t1 − 1 m 3m+ t1 − 1 m
Proposed SOBL 3m+ t1 − 1 m − 2m+ t1 − 1 4m+ 2t1 − 2 2m+ t1 − 1
Figure 3.3(a)
Proposed SOBL 2m+ 2t1 m t1 m+ t1 2m+ 3t1 m+ t1
Figure 3.5(a)
P (x) = x163 + x7 + x6 + x3 + 1
LSB-first [31] 489 − 163 326 815 326
MSB-first [31] 489 163 163 163 489 163
SOBL [30] 495 163 169 163 495 163
Proposed SOBL 495 163 − 332 664 332
Figure 3.3(a)
Proposed SOBL 340 163 7 170 347 170
Figure 3.5(a)
P (x) = x283 + x12 + x7 + x5 + 1
LSB-first [31] 849 − 283 566 1415 566
MSB-first [31] 849 283 283 283 849 283
SOBL [30] 860 283 294 283 860 283
Proposed SOBL 860 283 − 577 1154 577
Figure 3.3(a)
Proposed SOBL 590 283 12 295 602 295
Figure 3.5(a)
P (x) = x233 + x74 + 1
LSB-first [31] 699 − 233 466 1165 466
MSB-first [31] 699 233 233 233 699 233
SOBL [30] 772 233 306 233 772 233
Proposed SOBL 772 233 − 539 1078 539
Figure 3.3(a)
Proposed SOBL 624 233 84 307 698 307
Figure 3.4(a)
Proposed SOBL 614 233 74 307 688 307
Figure 3.5(a)
† Bit registers with free I/O data transfer.
†† Bit registers with a single AND gate for the I/O data transfer.
††† Bit registers with a multiplexer for the I/O data transfer.
30-32% reduction in total register cost compared to any bit-level scheme for the pentanomial
irreducible polynomial.
3.7. ASIC Implementation 73
3.7 ASIC Implementation
In this section, we implement the presented schemes in the previous sections and the coun-
terpart ones (6 schemes in total) to evaluate their area, time, and power requirements. For
each scheme, we have two implementations, one without considering the controller as part of
the multiplier scheme (the core multiplier only), and one with considering the controller that
initializes and terminates the computation as part of the multiplier scheme (a complete serial-
multiplier circuit). The proposed multiplier schemes are modeled in VHDL and synthesized
for the binary extension fields F2163 and F2233 that are recommended by NIST and SECG. The
65-nm complementary metal-oxide-semiconductor (CMOS) library has been chosen for the
synthesis on the ASIC technology. All architectures have been synthesized using Synopsys R©
Design Vision R© which is a GUI for Synopsys R© Design Compiler R© tools [183]. The correct-
ness of the architectures is verified by Xilinx R© ISETM Simulator (ISim). The map effort for
optimizations is set to medium (i.e., default). The voltage settings in the BIOS was fixed and
the power consumption readings have been conducted under 666 MHz frequencies for all de-
signs. The fast bit-level multipliers described in [31] and [30] are also modeled in VHDL and
synthesized in the same framework as the proposed multipliers to facilitate quantitative perfor-
mance comparison. We note that the power compiler in Synopsys R© Design Compiler R© tools
uses the power characterization specified in the target library and switching activity to estimate
power dissipation [183]. For each multiplier scheme, the area complexities are normalized to
the complexity of a two-input NAND gate. It is noted that the area of a NAND gate in the
utilized CMOS library for the drive strength of two is 2.08 µm2. The total area is the sum of
the combinational area (CA) and the non-combinational area (Non-CA). The timing (ns) for
the critical-path delays (CPD) and the dynamic power (mW) are also obtained for all the de-
signs. The reported ASIC results of the implementations of the multipliers over F2163 and F2233
are listed in Table 3.5. In this table, the total time required for each multiplier is computed by
multiplying the number of clock cycles, i.e., m, by the critical-path delay. It can be seen from
the table that for the POBL schemes, the computation time required to obtain the first output bit
and the total time required for the multiplication are equal, whereas, in the SOBL schemes, the
computation time required to obtain the first output bit is equal to the critical-path delay. Also
the controller has longer critical-path delay than the delay of the actual POBL schemes (the
core multiplier component). From the table, one can see that the proposed ω-nomial SOBL
scheme that is depicted in Figure 3.3(a), has lower critical-path delay by an average of 10-14%
w.r.t the one in [30]. Also from this table, one can see that when considering the controller as
part of the multiplier in the finite field over F2233 , the SOBL multipliers are the most dynamic
power efficient schemes.
74 Chapter 3. Architectures for SOBL Multiplication Using Polynomial Basis
Table 3.5: Comparison of Bit-Level Polynomial Basis Multipliers on an ASIC Implementation
(Post Synthesis) Over Both F2163 and F2233 Using 65-nm CMOS Standard Technology.
Type of Type of Area [KGate] † CPD Speed Bit- Total- Dynamic
Multiplier Scheme CA Non-CA Total [ns] [MHz] Time [ns] Time [ns] Power [mW ] ††
P (x) = x163 + x7 + x6 + x3 + 1 (Without Controller)
LSB-first [31] POBL 1.49 1.84 3.33 0.3 3333 48.9 48.9 6.653
MSB-first [31] POBL 1.16 1.84 3 0.32 3125 52.16 52.16 5.76
SOBL [30] SOBL 1.63 1.9 3.53 0.86 1162 0.86 140.18 4.996
Proposed SOBL SOBL 1.83 1.9 3.73 0.74 1351 0.74 120.62 6.132
Figure 3.3(a)
Proposed SOBL SOBL 0.91 1.26 2.2 0.75 1333 0.75 122.25 3.861
Figure 3.5(a)
P (x) = x163 + x7 + x6 + x3 + 1 (With Controller)
LSB-first [31] POBL 1.58 1.89 3.47 0.41 2439 66.83 66.83 6.748
MSB-first [31] POBL 1.23 1.89 3.12 0.43 2325 70.09 70.09 5.816
SOBL [30] SOBL 1.67 1.96 3.63 0.86 1162 0.86 140.18 5.168
Proposed SOBL SOBL 1.99 1.96 3.95 0.75 1333 0.75 122.25 6.338
Figure 3.3(a)
Proposed SOBL SOBL 0.97 1.38 2.35 0.75 1333 0.75 122.25 4.08
Figure 3.5(a)
P (x) = x233 + x74 + 1 (Without Controller)
LSB-first [31] POBL 2.11 2.62 4.73 0.31 3225 72.23 72.23 9.498
MSB-first [31] POBL 1.65 2.62 4.27 0.32 3125 74.56 74.56 8.108
SOBL [30] SOBL 2.45 2.95 5.4 0.83 1204 0.83 193.39 7.756
Proposed SOBL SOBL 2.51 2.95 5.46 0.74 1351 0.74 172.42 8.738
Figure 3.3(a)
Proposed SOBL SOBL 2.67 2.34 5.01 0.73 1369 0.73 170.09 8.045
Figure 3.4(a)
Proposed SOBL SOBL 1.92 2.29 4.21 0.76 1316 0.76 177.08 6.619
Figure 3.5(a)
P (x) = x233 + x74 + 1 (With Controller)
LSB-first [31] POBL 2.22 2.67 4.89 0.4 2500 93.2 93.2 9.625
MSB-first [31] POBL 1.79 2.67 4.46 0.41 2439 95.53 95.53 8.297
SOBL [30] SOBL 2.51 3.01 5.52 0.83 1204 0.83 193.39 7.969
Proposed SOBL SOBL 2.56 3.01 5.57 0.74 1351 0.74 172.42 9.01
Figure 3.3(a)
Proposed SOBL SOBL 2.74 2.39 5.13 0.73 1369 0.73 170.09 8.191
Figure 3.4(a)
Proposed SOBL SOBL 1.98 2.37 4.35 0.76 1316 0.76 177.08 6.634
Figure 3.5(a)
† KGate is the area equivalence in terms of number of NAND gates ×103 (estimated area of one NAND gate
is 2.08 µm2).
†† The power consumption readings were conducted under 666 MHz frequency for all the designs.
3.8. Conclusions 75
It can also be seen from the table that the proposed pentanomial SOBL scheme that is
depicted in Figure 3.5(a), provides about 26-30% reduction in area complexity cost and about
22-24% reduction in power consumptions compared to the current state-of-the-art bit-level
multiplier schemes for F2163 2.
3.8 Conclusions
We have presented new hardware schemes for the serial-out bit-level (SOBL) multiplier in
PB representation over F2m for the ω-nomial, pentanomial, and the irreducible trinomial. Com-
pared to previously published results in terms of time and area complexities, the work presented
here outperform the existing SOBL multiplier schemes. The implementation results show that
the smallest SOBL scheme proposed provides about 26-30% reduction in area complexity cost
and about 22-24% reduction in power consumptions compared to the current state-of-the-art
bit-level multiplier schemes for the irreducible pentanomials that are recommended by NIST.
We also showed that it is possible to further extend the scheme toward reaching a serial-out
digit-level multiplier. The proposed finite field multiplication scheme can be applied to all of
the ECC processor implementations in the resource constrained devices.
2 Similar result was obtained for the other NIST recommended pentanomial ireeducible polynomials, i.e.,
F2283 , and F2571 .
4
Architectures for Hybrid-Double
Multiplication Using Polynomial Basis
I n order to investigate the applicability of the proposed SOBL schemes, in this chapter,we employ the proposed three SOBL schemes, and the SOBL scheme proposed in [30],to present, to our knowledge, the first approach for hybrid-double multiplication archi-
tecture in the polynomial basis over F2m . This hybrid multiplier structure operates on
three finite field elements and performs two multiplication tasks with latency comparable to the
latency of a single multiplication, i.e., the result of two finite field multiplications are obtained
after m + 1 clock cycles. We also extended the traditional POBL multiplier schemes presented
in [31] to propose two new low-complexity and fast LSB-first/MSB-first POBL double multi-
plication architectures, which perform two multiplications together after 2m clock cycles. To
obtain the actual implementation results, all the proposed schemes, i.e., 4 hybrid-double archi-
tectures, 2 double multiplication architectures are coded in VHDL, and implemented on ASIC
technology over both F2163 and F2233 1.
4.1 Introduction
The Serial-out bit-level (SOBL) multiplication scheme is characterized by an important latency
feature. It has an ability to sequentially generate an output bit of the multiplication result in
each clock cycle. It appears applicable in many recent applications, such as the hybrid-double
multiplication architectures. The computational complexity of the existing SOBL multipliers
in F2m using normal basis (NB) representation, limits its usefulness in many applications. A
1 Part of this work can be found in [98].
76
4.2. Architectures for Double Multiplication 77
multiplier operates using the polynomial basis (PB) representation, in compared to the NB, has
lower hardware requirements and easy-to-derive structure based on the defining irreducible
polynomial for the field P(x) [35]. In the following we employ the proposed three SOBL
schemes, and the SOBL scheme proposed in [30], to present, for the first time, hybrid-double
multiplication architectures using PB over F2m .
The organization of this paper is as follows. In Section 2, new double multiplication ar-
chitectures using PB are proposed and discussed. In Section 3, an architecture for hybrid-
double multiplication is proposed. In Section 4, the performance of the proposed double and
hybrid-double multiplication architectures are investigated by implementing each architecture
on ASIC technology. Finally, the conclusion is presented in Section 5.
4.2 Architectures for Double Multiplication
In this section, we first extend the traditional parallel-out bit-level (POBL) multiplier schemes
presented in [31] to propose new low complexity and fast POBL double multiplication archi-
tectures. We then, propose new hybrid-double multiplication architectures using PB over F2m .
Note that all the presented architectures can be easily modified to extend their structure into
the digit-level. However, for the sake of simplicity, in this work we did not investigate on the
techniques for the digit-level structures.
4.2.1 New Architectures for LSB-first/MSB-first POBL Double Multipli-
cations
Beth and Gollman in [31] proposed two types of bit-level multiplier schemes, namely LSB-
first and MSB-first, multipliers. Let A and B be two arbitrary elements of F2m and C be their
multiplication, i.e., C = AB. Then, the LSB-first POBL multiplier is obtained as follows [31]
C = bm−1
(
(Aαm−1) mod P(α)
)
+ · · · + b0(A mod P(α)),
and the MSB-first POBL multiplier is obtained as follows
C =
(
· · ·
(
(bm−1A)α mod P(α) + bm−2A
)
α mod P(α) + · · · + b1A
)
α mod P(α) + b0A.
Let D and E ∈ F2m such that E = CD mod P(α). A combination of two consecutive single
multiplications C = AB, and E = CD produces the following double multiplication involving
78 Chapter 4. Hybrid-Double Multiplication Architecture
three operands:
E = ABD. (4.1)
A double multiplier that computes (4.1) can be achieved by extending the schemes of the
traditional POBL to the schemes presented in Figures 4.1(a) and 4.1(b). In these figures, the
register 〈Y〉 is initialized as follows, for the LSB-first double multiplier, i.e., Figure 4.1(a),
〈y2m−1, · · · , ym〉 = D, and 〈ym−1, · · · , y0〉 = A, and for the MSB-first double multiplier, Figure
4.1(b), 〈y2m−1, · · · , ym〉 = A, and 〈ym−1, · · · , y0〉 = D. In both architectures, the register 〈X〉
is initialized with B and the register 〈Z〉 is initially cleared. Also, the α module multiplies the
input by α and reduces the results by P(x). This is done at cost of ω−2 2-input XOR gates. The
dotted block, i.e.,
⊙
, in both figures, denotes bit-wise AND operation between the LSB (or
MSB) bit of 〈Y〉 register and the contents of the register 〈X〉 and is performed using m 2-input
AND gates. The adder block, i.e.,
⊕
, denotes bit-wise XOR gates and is implemented using
m 2-input XOR gates. After m clock cycles, the contents of 〈Z〉 that become the coordinates
of the product C = AB, are loaded to 〈X〉. Eventually, at clock 2m, the contents of 〈Z〉 become
the coordinates of the product E = CD.
The MSB-first double multiplier scheme shown in Figure 4.1(a) as compared to the LSB-
first double multiplier scheme shown in Figure 4.1(b), has longer critical path delay. Since in
the MSB-first double multiplier scheme, the α module must also be considered in the delay
path. However, the hardware overhead gates due to the parallel I/O data transfer to 〈X〉 register
in the LSB-first double multiplier requires a 3-to-1 multiplexer of size m bits. As a result, the
LSB-first double multiplier has higher area complexity.
4.2.2 New Parallel-Out Digit-Level Polynomial Basis Double Multiplica-
tion
In the following, we extend the traditional least significant digit first (LSD-first) PODL multi-
plication algorithm proposed in [44] to propose a new LSD-first PODL double-multiplication
architecture using PB over F2m .
Let A, B, and C be three arbitrary elements in F2m generated by the irreducible polynomial
P(α) in (3.3), where C is the result of the product of A and B as in (3.1). Let D and E ∈ F2m
such that E , CD mod P(α). Let us assume that q =
⌈
m
w
⌉
, where w is the selected digit size. If
m is not a multiple of q, then the field multipliers A and D must be padded with (qw − m)-bit
4.2. Architectures for Double Multiplication 79
mmm
.
.
.
0z
2mz
1mz

.
.
.
m
m
1
0x
2mx
1mx
X
Z
E=ABD
m
m
. . . 0y2my1my
11
Y
my22 my12 my
M
U
X
(a)
mm
.
.
.
0z
2mz
1mz
.
.
.
m
m
1
0x
2mx
1mx
X
Z
E=ABD
m
Preload
. . . 0y2my1my
1
1
Y
my22 my12 my
Preload
A
Preload
D
B
m
M
U
X
m
. . .
m
(b)
. . .
m
M
U
Xm
0
m
m
Preload
A
Preload
D
m m
Preload
B
m
m
M
U
Xm
0
Preload
Preload
Update when  clk = m
Update when  clk = m
Figure 4.1: The Proposed Double Multiplication Architectures That Extend The POBL
Schemes Presented in [31]. (a) LSB-First POBL Double Multiplication Architecture.
(b) MSB-First POBL Double Multiplication Architecture.
zeros in its most significant bit (MSB), i.e.,
A = ( 0, · · · , 0︸   ︷︷   ︸
qw−m
, am−1, am−2, · · · , a1, a0),
D = ( 0, · · · , 0︸   ︷︷   ︸
qw−m
, dm−1, dm−2, · · · , d1, d0).
(4.2)
80 Chapter 4. Hybrid-Double Multiplication Architecture
Accordingly, the elements A and D can be represented by
A =
q−1∑
i=0
Aiαiw, D =
q−1∑
i=0
Diαiw, (4.3)
where
Ai = aiw+w−1αw−1 + · · · + aiw+1α + aiw,
Di = diw+w−1αw−1 + · · · + diw+1α + diw.
(4.4)
By using LSD-first PODL multiplication scheme, the double product E can be written as
E = CD mod P(α)
= C(Dq−1α(q−1)w + · · · + D1αw + D0) mod P(α)
= Eq−1 + · · · + E1 + E0 mod P(α),
(4.5)
where
Ei = C
(i)
Di, (4.6)
where
C
(i)
= Cαwi mod P(α) = αwC
(i−1)
mod P(α), (4.7)
for 0 < i < q − 1 and C(0) = C.
Similarly, the product C can be written as
C = AB mod P(α)
= Cq−1 + · · · + C1 + C0 mod P(α),
(4.8)
where
Ci = B
(i)
Ai, (4.9)
where
B
(i)
= Bαwi mod P(α) = αwB
(i−1)
mod P(α), (4.10)
for 0 < i < q − 1 and B(0) = B.
The LSD-first PODL double-multiplication given by (4.5) can be described by Algorithm
11.
The main operations of this algorithm include a multiplication followed by an addition
in Step 1.1.2, and a multiplication by αw followed by a reduction by P(α) in Step 1.1.3. As
shown in the Initialization Step, the vector Y is of length 2qw-bit and is loaded with both of
4.2. Architectures for Double Multiplication 81
Algorithm 11 Proposed LSD-First Parallel-Out Digit-Level Double-Multiplication Operation
Input : A =
(
Aq−1, · · · , A0), B = (bm−1, · · · , b0), D = (Dq−1, · · · , D0) ∈ F2m , where Ai = (aiw+w−1αw−1 + · · ·+
aiw
)
, Di =
(
diw+w−1αw−1 + · · · + diw), q = ⌈mw ⌉, 0 ≤ i ≤ q − 1, a j, b j ∈ GF(2), for 0 ≤ j ≤ m − 1, and
a j, d j = 0, for m ≤ j ≤ qw − 1.
Output : E = A·B·D mod P(α), where P(α) = αm + ∑ω−1i=1 αti , m2 > t1> t2> · · · > tω−2> tω−1 = 0, and P(α) = 0.
/* Set signal vectors Z, Y, and X of length m+w−1, 2qw, and m bits, respectively */
Initialize : Z = [ zm+w−2, · · · , z0]← (0, · · · , 0);
X = [ xm−1, · · · , x0]← (bm−1, · · · , b0);
Y = [ y2qw−1, · · · , y0]← ( 0, · · · , 0︸   ︷︷   ︸
qw−m
, dm−1, dm−2, · · · , d0, 0, · · · , 0︸   ︷︷   ︸
qw−m
, am−1, am−2, · · · , a0);
Step 1 : For i = 0 to 2q do
Step 1.1 : If i , q
/* Set a signal vector W of length w bits */
Step 1.1.1 : W = [ ww−1, · · · , w0]←[ yiw+w−1, · · · , yiw];
Step 1.1.2 : Z←W · X + Z ;
Step 1.1.3 : X← Xαw mod P(α);
Step 1.2 : Else
Step 1.2.1 : X← E;
Step 1.2.2 : Z = [ zm+w−2, · · · , z0]← (0, · · · , 0);
Step 1.3 : End If
Step 1.4 : E← Z mod P(α);
Step 2 : End For
Step 3 : Return E;
the elements A and D as presented in (4.2), the vector X is of length m-bit and is loaded with
element B, and the vector Z is of length (m + w − 1)-bit and is initially cleared. According
to (4.8), after q =
⌈
m
w
⌉
clock cycles, the contents of vector Z in Algorithm 11 (register
〈
Z
〉
in
Figure 4.2) become Z = C0 +C1 + · · ·+Cq−1. The final reduction polynomial is then performed
to obtain E = Z mod P(α) as shown in Step 1.4. to obtain the product C, which is then loaded
to the vector X to perform the second multiplication as presented in (4.5). This is done at clock
q as shown in Step 1.2.1. Also at clock q, the vector Z is cleared again as shown in Step.
1.2.2. In Figure 4.2, we show a designing architecture that corresponds to Algorithm 11, that
is, the LSD-first PODL PB double-multiplication over F2m . The architecture consists of one
multiplier core, three registers, two reduction polynomials (Xαw mod P(α) and Z mod P(α)),
and one (m + w − 1)-bit adder. In this architecture,
〈
Z
〉
is an (m + w − 1)-bit register, which is
initially cleared, contains the coordinates of the polynomial Ci shown in (4.9) during the first q
clock cycles, cleared again at clock q, and contains the coordinates of the polynomial Ei shown
in (4.6) during the second q clock cycles. The register
〈
X
〉
is an m-bit register, which contains
the coordinates of the polynomial B
(i)
shown in (4.10) at the first q clock cycle, updated with
X at clock q, and contains the coordinates of the polynomial C
(i)
shown in (4.7) at the second q
clock cycles. The register
〈
Y
〉
is a 2qw-bit register, which contains the coordinates of both of
82 Chapter 4. Hybrid-Double Multiplication Architecture
Y
m
X
E=ABD
A
m
m
w
Core
Multiplier
Z
m
mod
m
0
qw   m_ Preload
2qw
w
w
w
m+w   1_m+w   1_
m+w   1_
m+w   1_
)( P
mUpdate when i = q   
D
0
qw   m_
m
)(iZ[ ]
[ )1(  iZ ]
)(iX[ ]
)(qX[ ]
)0(Y[ ]
M
U
X
B
Preload
m
m
m
w
 
[ )1(  iX ]
)0(X[ ]
M
U
X
m+w   1_
0
Preload & update when i = q
)0(Z [ ])(qZ[ ],
m+w   1_
 
Figure 4.2: Proposed architecture for the LSD-first PODL Double Multiplication Operation.
the elements A and D as shown in (4.2). In addition, this architecture includes two loops. The
left and the right loops implements Step 1.1.3 and Step 1.1.2 of Algorithm 11, respectively. The
dotted block, i.e.,
⊙
, denotes the multiplier core, that is, multiplication of X (a polynomial
of degree m − 1) by a digit W (a polynomial of degree w − 1), and as a result, its output has
(m + w− 1)-bit signal. It is performed using wm 2-input AND gates and w(m− 1) 2-input XOR
gates. The multiplier core in Figure 4.2 represents (4.9) at the first q cycles and represents (4.6)
at the second q cycles. It computes the term W · X that is shown in Step 1.1.2 of Algorithm
11. The adder block, i.e.,
⊕
, denotes bit-wise XOR gates. It adds the results of the
⊙
block
with current values of register
〈
Z
〉
and stores the results in register
〈
Z
〉
again. The
⊕
block is
implemented using m + w − 1 2-input XOR gates. The αw module multiplies the contents of〈
X
〉
by αw and reduces the result by P(α), which implements Step 1.1.3 of Algorithm 11. The
result of αw module, which represents (4.10) at the first q clock cycles and (4.7) at the second
q clock cycles, is stored in register
〈
X
〉
. the mod P(α) module implements Step 1.4; which is a
reduction of a polynomial of degree m+w−2 by P(α). Note that in Figure 4.2,
[
X
(i)
]
, and
[
Z
(i)
]
show the content of the registers
〈
X
〉
and
〈
Z
〉
at the ith iteration of Algorithm 11, respectively.
From the proposed architecture in Figure 4.2, one can see that the LSD-first PODL double-
multiplier demands 2 ×
⌈
m
w
⌉
+ 2 clock cycles. In Figure 4.3 we show a designing architecture
that corresponds to the MSD-first PODL PB double-multiplication over F2m .
4.3 Hybrid-Double Multiplication
Recently, hybrid-double multiplier was proposed in F2m using normal basis representation
[32, 33]. This hybrid-double multiplier is achieved by combining and interleaving a SOBL
4.3. Hybrid-Double Multiplication 83
Y
m
X
A
m
w
Core
Multiplier
Z
m
mod
m
0
qw   m_ Preload
2qw
w
w
w
m+w   1_m+w   1_
m+w   1_
m+w   1_
)( P
mUpdate when i = q   
D
0
qw   m_
m
)(iZ[ ]
[ )1(  iZ ]
)(qX[ ]
)0(Y[ ]
M
U
X
B
Preload
m
m
)0(X[ ]
M
U
X
m+w   1_
0
Preload & update when i = q
)0(Z [ ])(qZ[ ],
m+w   1_
w
 
m+w   1_
E=ABD
Figure 4.3: Proposed architecture for the MSD-first PODL Double Multiplication Operation.
Gaussian normal basis multiplier that is implemented based on [34], and a POBL normal bases
multiplier that is based on [31]. Note that a traditional POBL multiplier such as Beth and
Gollmann approach [31] by itself cannot create a hybrid-double multiplier component; how-
ever, combining a SOBL multiplier with a traditional POBL one would allow to develop a
hybrid-double multiplier.
The SOBL polynomial basis multiplication scheme proposed in [30] generates every bit of
the multiplication in each clock cycle. Thus, it can be combined with the traditional POBL
multiplier (such as Beth and Gollmann approach in [31]) to produce the hybrid-double multi-
plication scheme. The structure of the hybrid-double multiplier is illustrated in Figure 4.4. In
this figure, the SOBL multiplier generates every bit of the multiplication, i.e., the output bit
result of the product C = AB, in each clock cycle, whereas the POBL multiplier computes all
output coordinates in parallel after m clock cycles. As one can see from Figure 4.4, all bits of
the operands A, B, and D are initially available, while the coordinates of the partial product C
should be available in serial fashion starting from the LSB, i.e., c0.
The structure of the hybrid-double multiplier as illustrated in Figure 4.4, allows performing
two multiplications simultaneously, where the results are available in parallel after m + 1 clock
cycles assuming that one clock cycle is required to load the output of the SOBL multiplier
(stored in the register) to the input of the LSB-first SOBL multiplier.
The critical path delay of the hybrid-double multiplier (th) is equal to the maximum of
delays between the LSB-first POBL (ts) and the SOBL (tp) multipliers, i.e., th = max{ts, tp}.
Based on the information provided in Table 3.3, i.e., ts > tp, one can see that th = ts. Thus, to
speed up the multiplication, one can balance the latency of the two multipliers at the cost of a
few additional registers. Let us divide the IPm block by inserting registers at stage ε, then, the
84 Chapter 4. Hybrid-Double Multiplication Architecture
LSB-first POBL
m
m
.
.
.
0z
2mz
1mz

.
.
.
m
m
1
0x
2mx
1mx
X
Z
E=ABD
m
m
M
U
X
Preload
D
m
m
mM
U
Xm
0
SOBL
c
c1
c
m-1
...
m
m
Preload
A
Preload
B
A B
SOBL
LSB-first
POBL
D
E=ABD
st
pt
B
SOBL
LSB-first
POBL
E=ABD
st
pt
(c)(b)
(a)
m m
A
m m
m
D
m
=
(     +     )st pt
2h
t= max (    ,     ) =st ptht st
m
0
ic i
c
mm
Preload
Figure 4.4: Architectures for The Hybrid-Double Multiplication. The Hybrid-Double Multi-
plier Structure is Developed by Connecting The Output of The SOBL Multiplier Into The Input
of The POBL Multiplier.
total number of required registers υ is υ =
⌈
m
2ε
⌉
register bits. It is noted that, if the position of
ε were to be properly chosen, then, the total propagation delay of the hybrid-double multiplier
architecture, as depicted in Figure 4.5(b), would be reduced to about
⌈ ts+tp
2
⌉
.
4.4 ASIC Implementa ion
In this section, we implement the presented double and hybrid-double architectures to evaluate
their area, time, and power requirements. For each scheme, we have two implementations,
one without considering the controller as part of the multiplier scheme (the core multiplier
only), and one with considering the controller that initializes and terminates the computation
as part of the multiplier scheme (a complete serial-multiplier circuit). The proposed architec-
tures are modeled in VHDL and synthesized for the binary extension fields F2163 and F2233 that
are recommended by NIST and SECG. The 65-nm complementary metal-oxide-semiconductor
(CMOS) library has been chosen for the synthesis on the ASIC technology. All architectures
have been synthesized using Synopsys R© Design Vision R© which is a GUI for Synopsys R© De-
sign Compiler R© tools [183]. The correctness of the architectures is verified by Xilinx R© ISETM
Simulator (ISim). The map effort for optimizations is set to medium (i.e., default). The power
consumption readings have been conducted under 666 MHz frequencies for all designs. The
area complexities are normalized to the complexity of a two-input NAND gate. It is noted that
the area of a NAND gate in the utilized CMOS library for the drive strength of two is 2.08 µm2.
The total area is the sum of the combinational area (CA) and the non-combinational area (Non-
4.4. ASIC Implementation 85
LSB-first POBL
m
m
.
.
.
0z
2mz
1mz

.
.
.
m
m
1
0x
2mx
1mx
X
Z
E=ABD
m
m
M
U
X
Preload
D
m
m
mM
U
Xm
0
SOBL
c
c1
c
m-1
...
m
m
Preload
A
Preload
B
A B
SOBL
LSB-first
POBL
D
E=ABD
st
pt
B
SOBL
LSB-first
POBL
E=ABD
st
pt
(b)(a)
m m
A
m m
m
D
m
=
(     +     )st pt
2h
t= max (    ,     ) =st ptht st
m
0
ic i
c
mm
Preload
Figure 4.5: Architectures for The Hybrid-Double Multiplication. (a) The Critical-Path Delay
of The Hybrid-Double Multiplier (th). (b) Reducing The Delay by Inserting Registers at The
IPm Block Inside The SOBL Multiplier.
CA). The timing (ns) for the critical-path delays (CPD) and the dynamic power (mW) are also
obtained for all the designs. The reported ASIC results of the implementations of the proposed
double multiplication architectures over F2163 and F2233 are listed in Table 4.2. In this table, the
total time of the multiplication is computed as follows. For the POBL double-multiplication
architectures, we multiply the total number of clock cycles, i.e., 2m, by the critical-path de-
lay. For the PODL double-multiplication architectures, we multiply the total number of clock
cycles, i.e., 2×
⌈
m
w
⌉
+2, by the critical-path delay. For the hybrid-double multiplication architec-
tures, we multiply the total number of clock cycles, i.e., m + 1, by the critical-path delay. Also,
for the POBL double-multiplication architectures, the throughput (TPT) of the multiplication
is obtained by multiplying the number of bits per cycle, i.e. m2m , by the speed, whereas, the TPT
in the hybrid-double multiplication architectures, is obtained by multiplying the number of bits
per cycle, i.e. mm+1 , by the speed.
It is shown in Table 4.2 that by employing the proposed SOBL schemes in the hybrid-
double multiplication architectures, the total time complexity reduces, and the throughput im-
proves, w.r.t. the other POBL double multiplication architectures. It is also shown in this table
86 Chapter 4. Hybrid-Double Multiplication Architecture
Table 4.1: Comparison Table for The ASIC Synthesis Results for The Proposed Double Mul-
tiplication Architectures (Figure 4.5(a), 4.5(b)) for The Polynomial Basis Over F2163 Using
65-nm CMOS Standard Technology.
Type of Type of Area [KGate] † CPD Speed Total Time TPT †† TPT/Area Dynamic Energy ††††
Architecture Multiplier used CA Non-CA Total [ns] [MHz] [ns] [Mbps] [Kbps/Gate] Power ††† [mW ] [m.J/Gbit]
P (x) = x163 + x7 + x6 + x3 + 1 (Without Controller)
LSB-first double POBL [31] 2.00 2.45 4.45 0.41 2439 133.7 1219 274 7.76 6.36
Figure 4.1(a)
MSB-first double POBL [31] 1.88 2.45 4.33 0.36 2777 117.3 1389 321 7.68 5.53
Figure 4.1(b)
LSD-first double PODL [44] 2.18 2.49 4.67 0.48 2083 79.7 2045 438 8.08 3.95
Figure 4.2 w = 2
MSD-first double PODL [44] 2.05 2.49 4.54 0.46 2173 76.36 2133 470 8.02 3.76
Figure 4.3 w = 2
Hybrid-double SOBL [30] 2.75 3.08 5.83 0.87 1149 142.7 1142 196 9.408 8.23
Figure 4.5(a)
Hybrid-double SOBL 2.95 3.12 6.07 0.61 1640 100.0 1630 269 10.73 6.585
Figure 4.5(b) Figure 3.3(a)
Hybrid-double SOBL 1.92 2.57 4.49 0.7 1429 114.0 1420 316 6.51 4.585
Figure 4.5(b) Figure 3.5(a)
P (x) = x163 + x7 + x6 + x3 + 1 (With Controller)
LSB-first double POBL [31] 2.05 2.51 4.56 0.48 2083 156.5 1041 229 8.907 8.55
Figure 4.1(a)
MSB-first double POBL [31] 1.97 2.51 4.48 0.45 2174 150.0 1087 243 8.22 7.56
Figure 4.1(b)
LSD-first double PODL [44] 2.24 2.55 4.79 0.54 1851 89.65 1817 379 9.31 5.12
Figure 4.2 w = 2
MSD-first double PODL [44] 2.12 2.55 4.67 0.52 1923 86.3 1888 404 9.26 4.905
Figure 4.3 w = 2
Hybrid-double SOBL [30] 2.79 3.13 5.92 0.87 1149 142.7 1142 193 9.506 8.32
Figure 4.5(a)
Hybrid-double SOBL 3.01 3.17 6.18 0.62 1613 101.7 1603 260 11.01 6.87
Figure 4.5(b) Figure 3.3(a)
Hybrid-double SOBL 1.96 2.65 4.61 0.7 1429 114.0 1420 308 7.66 5.394
Figure 4.5(b) Figure 3.5(a)
† KGate is the area equivalence in terms of number of NAND gates ×103 (estimated area of one NAND gate is 2.08 µm2).
†† TPT is the throughput and is equal to the number of bits per cycle times the speed.
††† The power consumption readings were conducted under 666 MHz frequency for all the designs.
†††† Obtained by dynamic powerthroughput .
that by employing the proposed compact SOBL scheme, i.e., Figure 3.5(a) in the hybrid-double
multiplication architecture, the total area complexity reduces, w.r.t. the other POBL/PODL
double multiplication architectures over F2163 .
4.5 Conclusions
We have extended the traditional POBL multiplier schemes to new POBL double multiplica-
tion architectures, which perform two multiplications after 2m clock cycles. Then, we pro-
4.5. Conclusions 87
Table 4.2: Comparison Table for The ASIC Synthesis Results for The Proposed Double Mul-
tiplication Architectures (Figure 4.5(a), 4.5(b)) for The Polynomial Basis Over F2233 Using
65-nm CMOS Standard Technology.
Type of Type of Area [KGate] † CPD Speed Total Time TPT †† TPT/Area Dynamic Energy ††††
Architecture Multiplier used CA Non-CA Total [ns] [MHz] [ns] [Mbps] [Kbps/Gate] Power ††† [mW ] [m.J/Gbit]
P (x) = x233 + x74 + 1 (Without Controller)
LSB-first double POBL [31] 2.84 3.5 6.34 0.42 2380 195.72 1190 188 11.15 9.37
Figure 4.1(a)
MSB-first double POBL [31] 2.66 3.5 6.16 0.35 2857 163.1 1428 232 10.99 7.7
Figure 4.1(b)
LSD-first double PODL [44] 3.03 3.56 6.59 0.5 2000 118 1974 300 11.47 5.81
Figure 4.2 w = 2
MSD-first double PODL [44] 2.86 3.56 6.42 0.45 2222 106 2193 341 11.31 5.157
Figure 4.3 w = 2
Hybrid-double SOBL [30] 4.14 4.64 8.78 0.8 1250 187.2 1245 142 14.11 11.34
Figure 4.5(a)
Hybrid-double SOBL 4.32 4.70 9.02 0.6 1667 140.4 1660 184 15.79 9.51
Figure 4.5(b) Figure 3.3(a)
Hybrid-double SOBL 3.97 4.15 8.12 0.57 1754 133.38 1747 215 13.92 7.97
Figure 4.5(b) Figure 3.4(a)
Hybrid-double SOBL 3.22 4.04 7.26 0.68 1470 158.4 1464 202 10.71 7.32
Figure 4.5(b) Figure 3.5(a)
P (x) = x233 + x74 + 1 (With Controller)
LSB-first double POBL [31] 2.89 3.56 6.45 0.52 1923 242.32 961 149 12.76 13.27
Figure 4.1(a)
MSB-first double POBL [31] 2.73 3.56 6.29 0.45 2222 209.7 1111 177 11.70 10.53
Figure 4.1(b)
LSD-first double PODL [44] 3.09 3.62 6.71 0.54 1852 127.4 1828 272.43 12.7 6.95
Figure 4.2 w = 2
MSD-first double PODL [44] 2.93 3.62 6.55 0.53 1887 125.1 1863 284.43 12.55 6.74
Figure 4.3 w = 2
Hybrid-double SOBL [30] 4.19 4.69 8.88 0.79 1265 184.86 1260 142 14.26 11.31
Figure 4.5(a)
Hybrid-double SOBL 4.36 4.75 9.11 0.61 1640 142.74 1632 179 15.64 9.58
Figure 4.5(b) Figure 3.3(a)
Hybrid-double SOBL 4.02 4.20 8.22 0.57 1754 133.38 1747 213 14.15 8.1
Figure 4.5(b) Figure 3.4(a)
Hybrid-double SOBL 3.29 4.12 7.41 0.68 1470 158.4 1464 178 11.19 7.64
Figure 4.5(b) Figure 3.5(a)
† KGate is the area equivalence in terms of number of NAND gates ×103 (estimated area of one NAND gate is 2.08 µm2).
†† TPT is the throughput and is equal to the number of bits per cycle times the speed.
††† The power consumption readings were conducted under 666 MHz frequency for all the designs.
†††† Obtained by dynamic powerthroughput .
posed new hybrid-double multiplication architectures in PB over F2m . These hybrid multiplier
structures perform two multiplications with latency comparable to the latency of a single mul-
tiplication, i.e., after m + 1 clock cycles. We have obtained the space and time complexities
of the presented multipliers and have compared them with their counterparts. For the practical
purposes, all the 6 schemes presented in this work have been implemented in ASIC technology
over both F2163 and F2233 , and the area, timing, power consumption, and energy results have
been presented.
5
New Regular Radix-8 Scheme for Elliptic
Curve Scalar Multiplication Without
Pre-computation
T he recent advances in mobile technologies have increased the demand for high per-formance parallel computing schemes. In this chapter, we present a new algorithmfor evaluating elliptic curve scalar multiplication that can be used on any abelian
group. We show that the properties of the proposed algorithm enhance parallelism
at both the point arithmetic, and the field arithmetic levels. Then, we employ this algorithm
in proposing a new hardware design for the implementation of an elliptic curve scalar multi-
plication on a prime extended twisted Edwards curve incorporating 8 parallel operations. We
further show that in comparison to the other simple side-channel attack protected schemes over
prime fields, the proposed design of the extended twisted Edwards curve is the fastest scalar
multiplication scheme reported in the literature 1.
5.1 Introduction
In 1976, Diffie and Hellman introduced the idea of Public key cryptography (PKC) [10]. PKC
is now widely used for key establishment, digital signature, data encryption, and other applica-
tions. Since then, several PK based crypto-systems have been proposed; the security in these
systems are based on the difficulty of the mathematical problem [185, 186]. Although today
1 The content of this chapter can be found in [99].
88
5.1. Introduction 89
commonly used PK based algorithms such as RSA [12], and ElGamal [13] are believed to be
secure, some of their implementations have been challenged by the quick factoring and integer
discrete logarithm attacks [105, 106, 1]. Elliptic curve cryptography (ECC) [18, 19] that can
provide the same level of security with a shorter key size becomes more attractive in applica-
tions with embedded microprocessors [107]. While the ECC provides shorter key sizes, the
required computational complexity may still be excessive. In a properly designed digital C-
MOS circuit, the switching activities consumes more than 90% of the total power consumption
[53, 184]; therefore, various techniques have been proposed to reduce power consumption by
reducing switching activities. The power reduction can be achieved by reformulating a design
procedure, increasing the concurrency of the internal operations, and rearranging the design
topology from array-type to parallel-type architectures. By exploiting parallelization in the
low-power design, a system will not only reduces the computation time but also minimizes the
switching activities and the energy expenditure will be minimized [187].
ECC algorithms belong to the class of group-based protocols, whose security is based on
the difficulty of the DLP over a finite group. Using additive notation, this problem can be de-
scribed as follows. Given points P and Q in the group, finding a number k such that Q = kP is
assumed to be not feasible in polynomial time [28]. The operation of computing the new point,
i.e., kP, is called the Elliptic curve scalar (or point) multiplication (ECSM) operation, which is
the core building block in ECC [124]. ECSM computes a scalar point kP by performing mul-
tiple point additions, based on an s-bit scalar k, where s =
⌈
log 2 k
⌉
, and a point P that is on an
elliptic curve. This operation is achieved with the execution of iterated point Addition (ADD)
and point Doubling (DBL), which involve the finite field (or modular) arithmetic operations
over either Fp or F2m .
To efficiently compute the scalar multiplication, there are three main approaches. The first
approach is to utilize efficient point arithmetic operation formulas based on a combination
of the underlying finite field operations. For instance, implementing point halving instead
of the DBL operation over binary fields [79], point tripling over fields of characteristic three
[66, 68], and using composite operations, i.e., 2Q + P [80]. The second approach is to use
a representation of the scalar such that the number of point arithmetic operations is reduced.
Non-adjacent form (NAF) [71], radix-r NAF (r-NAF) [72], width-w NAF (w-NAF) [73, 64, 1],
and Frobenius map [73, 74] are some techniques based on this approach. The third approach
is to use more hardware support, i.e., utilizing memory for pre-computation, and/or parallel
operations [37, 83, 84, 85, 86, 87, 88], and/or pipelining methods [81, 82]. In this work,
90 Chapter 5. New Regular Radix-8 Scheme for ECSM
we combine the first two approaches with the parallel computation in the third approach to
yield a very efficient scalar multiplication scheme. The main contributions of this work can be
summarized as follows:
• We propose an approach to computing the ECSM operation that is based on processing
three bits of the scalar in the exact same sequence of five point arithmetic operations,
namely, 3 DBLs, and 2 ADDs for all eight different combinations of 3 bits without
using any dummy operations. The scalar k and the point P in the proposed method
are considered to be generic, and no memory lookup-table for precomputed points is
required.
• We analyse the security of our scheme and show that its security holds against both
Simple side-channel (or power analysis) attacks (SSCAs) [38, 39, 40], and safe-error (or
C-safe) fault attacks [41, 42, 43].
• Finally, we show how the properties of the proposed ECSM scheme yields an efficient
hardware design for the implementation of a single ECSM on a prime extended twist-
ed Edwards curve incorporating 8 parallel multiplication operations. We show that this
design is the fastest SSCA-protected scalar multiplication scheme over prime fields re-
ported in the literature including the fast x-coordinates only method of the Montgomery
Ladder on the Montgomery curves [114] for the parallel environment.
The organization of this chapter is as follows. In the next section, preliminaries related to the
SSCA-protected ECSM schemes are presented. In Section 3, the formula for a new radix-r
method for evaluating the scalar multiplication is introduced. Then, the generalised radix-
r algorithm is specified for the radix-8 one. Section 4 is the core of our work, in which, a
novel ECSM scheme that offers resistance against both SSCA and safe-error fault attacks is
presented. Then, to illustrate the advantages of the proposed scheme, in Sections 5, and 6,
we evaluate and analyse the efficiency of the proposed ECSM scheme and compare it to the
other well known ECSM schemes at the elliptic curve group operations level, and at the field
arithmetic levels, respectively. Section 7, explains how a protected scalar multiplication using
the proposed scheme for the prime extended twisted Edwards model can be performed faster
than all the other parallel and SSCA-protected schemes reported in the literature. Finally, the
conclusion is summarized in Section 8.
5.2. Preliminaries 91
5.2 Preliminaries
The classical method for evaluating kP is the so-called Double-and-Add binary method [123].
On average, the computation complexity of the Double-and-Add binary method is s− 1 DBLs,
and s−12 ADDs [91]. In order to lower the number of ADDs, the scalar k is converted to a
signed-representation. Let each bit of k be denoted by ki, for 0 ≤ i ≤ s − 1. Then ki in
signed-representation becomes ki ∈ { −1, 0, 1 }. The signed-representation revises the Double-
and-Add binary method to a new method called the signed binary (or addition-subtraction)
method [69, 71, 123]. Among the different signed representation methods, the Non-adjacent
form (NAF) [124, 71, 72] and the Mutual opposite form (MOF) [70] are the most popular
methods. The computation of ECSM in the signed binary methods is more effective than
in the Double-and-Add binary method. Representing the scalar k as NAF or MOF would
save an average of 1/6 of ADDs in the computation of kP [91, 1]. The total run time of the
ADD in both the Double-and-Add binary method and the signed binary methods depend on the
Hamming-weight of the scalar k. Hence, an adversary observing the run time, could determine
the Hamming-weight of the secret k.
From a mathematical point of view, ECC is regarded as being secure. However, real-
world hardware implementations of ECC protocols may introduce leakage, which raises the
issue of other threats that may not be addressed by the crypto-algorithms, e.g., the elapsed
time or the power consumption that depends on analysing the VLSI implementation of the
crypto-algorithm. Thus, an unsecured implementation can lead to the exposure of the secret
key by utilizing attack techniques that analyse such information. Kocher in [38] reviewed
these kind of attacks and referred to them as Side-channel attacks (SCAs). Of all the types
of SCAs, the SSCAs is the common. In ECC crypto-systems, SSCA can reveal large features
of the algorithm such as identifying the DBL and the ADD operations being executed in the
iterations of the loop [40]. Thus, ECSM should be implemented using a specific sequence of
point arithmetic operations that does not depend on the value of a particular scalar bit.
5.2.1 Notations
In this work, we refer to the elliptic curve group (arithmetic point) operations as EC-operations.
Also, ADD, and DBL stand for the EC-operations of addition, and doubling, respectively. Sim-
ilarly, the EC-operation of subtraction is denoted by SUB in this chapter. Also, the ADDDBL
operation stands for considering both the ADD and the DBL operations as a single composite
92 Chapter 5. New Regular Radix-8 Scheme for ECSM
operation. In addition, mADD, and uADD stand for the cost of mixed addition, and unified ad-
dition, respectively. Computing the cost of field arithmetic operations is represented by capital
boldfaced characters; hence, I, M, S, A, and D stand for the computing costs of field inversion,
multiplication, squaring, addition, and field multiplication by a curve constant, respectively.
5.2.2 The SSCA-Protected ECSMs
When both the ADD and the DBL operations are different, the only way to make an ECSM
algorithm SSCA aware is to use a regular structure scalar multiplication scheme, which evalu-
ates the point arithmetic operations in a uniform sequence. The author in [40] has masked the
dependency between the scalar bit and the evaluated point arithmetic operation by inserting a
dummy operation. However, it is noted in [188, 135] that it may be easy for the adversaries
to determine which point arithmetic ADDs are the dummy operations. A method proposed by
Mo¨ller in [63] performs the scalar multiplication with a fixed pattern of point arithmetic DBLs
and ADDs, Okeya et al. in [64] have also proposed a similar window-based method. The
Montgomery Ladder binary method [114, 189, 190, 191] is especially suitable for hardware
implementation because of the data independency of its underlying point arithmetic opera-
tions, and the resistance to SSCA. Figure 5.1 shows how Montgomery’s scalar multiplication
method operates at the point arithmetic level. It can be seen that although there is a conditional
statement at the beginning of each stage, which is represented by multiplexers, Montgomery’s
method is still considered to be a highly regular method as both the ADD and the DBL oper-
ations are repeatedly evaluated together at each iteration of the main loop. Joye in [192] has
also developed a similar binary scalar multiplication method that eliminates power analysis
information.
In this work, we present a new regular ECSM scheme. We show that we save 1/3 of the
computation of the ADD operations as compared to the regular binary schemes presented in
[189, 190, 191, 192]. We also show that at least 40% of the memory registers are less compared
to the secured window-based schemes shown in [63, 64]. Further, if the computational time
complexity of 2 ADDs is less than the computational time complexity of 2 DBLs + mADD,
the speed of the proposed scheme outperforms those of secured window-based schemes.
5.3. Proposed Radix-8 Scalar Multiplication Algorithm 93
2 s
k
0
k
2 s
k
DBL
ADD
Stage 1 Stage s-2
2
P
2
P
1
P
1
P
DBL
ADD
0
k
DBL
ADD
Stage s-1
Output 1
Output 2
 
!
"
#
#
#
1
0
1
2
i
i
kP
kP
if    ,
if    ,
1output 
 
!
"
#
#
#
1
0
2
1
i
i
kP
kP
if    ,
if    ,
2output 
a
b
a b
Figure 5.1: EC-Operations Dependency Graph for The Montgomery Ladder ECSM Method
[189, 190, 191], Which Shows That a Fixed Sequence of Both The ADD and The DBL Blocks
Are Performed for Any Value of The ki Bit, i.e., Only The Operands Are Transposed.
5.3 Proposed Radix-8 Scalar Multiplication Algorithm
Throughout this section, we present a method for evaluating the scalar multiplication in radix-
r. We then explain how the scalar k in the radix-8 can be recoded to a signed representation in
the range [−1, 6] so that the scheme we propose in the next section, can thwart SSCAs.
5.3.1 High-Radix Scalar Expansion
It is assumed hereafter that the basis r has been chosen to be a power of 2, i.e., r = 2 w,
where 2 ≤ w ≤ s − 1. Hence, the computation of rP requires only repeated DBLs. Let the
scalar k (of length s-bits) be partitioned into l digits, i.e., l =
⌈
s
w
⌉
, and let each digit of k be
denoted as k′i for 0 ≤ i ≤ l − 1. The scalar k with radix-r expansion (k′l−1, · · · , k′1, k′0)r, where
k′i ∈ { 0, 1, · · · , r − 1 } for every i ≤ l − 1, can be presented as
k =
l−1∑
i=0
k′i r
i, k′i ∈ { 0, 1, · · · , r − 1 }. (5.1)
Scalar multiplication kP can then be computed as
kP =
l−1∑
i=0
(k′i r
i)P. (5.2)
In the following, we let E(Fq) be an abelian group with an identity element O, and we let P ∈
E(Fq) be an input point element. Notice that our goal is to compute the scalar multiplication
94 Chapter 5. New Regular Radix-8 Scheme for ECSM
point kP that is also a point in E(Fq), i.e., kP ∈ E(Fq). Let PkP and P1 be two points on the
curve, which are initialized by O and P, respectively. We define the point
P ( j)kP =
j∑
i=0
(k′i r
i)P, (5.3)
for any 0 < j < l.
Comparing (5.2) and (5.3), one can see that kP = P (l−1)kP . By removing the upper j-th term
from the summation of (5.3), we get
P ( j)kP = k
′
jr
jP + P ( j−1)kP . (5.4)
Assuming
P ( j)Acc = r
jP, (5.5)
is another point on the curve that is initialized to P, i.e., P (0)Acc = P. Substituting this in (5.4),
one can obtain P ( j)kP as
P ( j)kP = k
′
jP
( j)
Acc + P
( j−1)
kP . (5.6)
Now, we define another recursive point on the curve
P ( j)1 = r
j+1P − P ( j)kP . (5.7)
In order to ensure the computation regularity for each specific input k′j, the two recursive points
P ( j)kP , and P
( j)
1 have to be properly obtained by performing either the ADD or the SUB operations
as presented below.
Lemma 5.3.1 Consider j to be in the range [1, l − 1], and 0 ≤ k′j ≤ r − 1, then P ( j)kP can be
defined in one of the following two ways:
P ( j)kP =

P ( j−1)kP + k
′
jP
( j)
Acc
rP ( j)Acc − P ( j)1 ,
(5.8)
5.3. Proposed Radix-8 Scalar Multiplication Algorithm 95
and P ( j)1 can be obtained as follows
P ( j)1 =

rP ( j)Acc − P ( j)kP
(r − 1 − k j′)P ( j)Acc + P ( j−1)1 ,
(5.9)
where P ( j)Acc = r
jP = rP ( j−1)Acc .
Proof using (5.6) and (5.7), one can easily obtain (5.8). Changing j to j − 1 and re-arranging
the terms in (5.7), one can obtain r jP as
r jP = P ( j−1)1 + P
( j−1)
kP . (5.10)
Substituting r jP from (5.10) into (5.4), one can obtain P ( j)kP as
P ( j)kP = k
′
jP
( j−1)
1 +
(
k′j + 1
)
P ( j−1)kP . (5.11)
Substituting P ( j)kP from (5.11) into (5.7), one can obtain
P ( j)1 = r
j+1P −
(
k′jP
( j−1)
1 +
(
k′j + 1
)
P ( j−1)kP
)
. (5.12)
Substituting P ( j−1)kP from (5.10) into (5.12) and using (5.5), P
( j)
1 can be further obtained as
P ( j)1 =
(
r − (k′j + 1))P ( j)Acc + P ( j−1)1 .
The proof is complete.
5.3.2 Recoding the Scalar k Into Signed Radix-8
In order to ensure that our scheme is entirely regular, we need to skip the digit k′j that is equal
to 7 and replace it with -1 with an increment to the next digit as k′j+1 + 1. Mo¨ller in [63] has
described a recoding algorithm for m-array exponentiation where each digit that is equal to
zero is replaced with −m, and the next most significant digit is incremented by one. In [193],
the scalar digits are recoded in the set { 1, · · · , m }, where each zero digit is replaced with m
and the next digit is decremented by one. In our case, we replace the k′j value that is equal to
digit 7 with (7− 8 = −1). This representation was discussed by Parhami in [151]. He used this
96 Chapter 5. New Regular Radix-8 Scheme for ECSM
Algorithm 12 Proposed Non-Seven Encoding Method
Input : A t − 1 digit Radix-8 of the scalar k,
k = (k′t−2, · · · , k′1, k′0)8, k′j ∈ { 0, 1, · · · , 7 }.
Output : k = (kt−1, · · · , k1, k0)8, k j ∈ { −1, 0, 1, · · · , 6 }.
Initialize : k = (0, k′t−2, · · · , k′1, k′0)8 ;
Step 1 : For j = 0 to t − 1 do
Step 1.1 : If k′j ∈ { 7, 8 } Then
Step 1.1.1 : k j = k′j − 8, k′j+1 = k′j+1 + 1 ;
Step 1.2 : Else Leave the digit as it is, i.e., k j = k′j
Step 2 : End For
Step 3 : Return k = (kt−1, · · · , k1, k0)8 ;
representation in multiplication schemes that can handle more than one bit of the multiplier in
each cycle. Intuitively, the recoding algorithm replaces the 7 digits by -1 and increments the
next more significant digit to adjust the value. Let the scalar k of the length of s bits be given in
the radix-8 digit representation, where k′j is in the range [0, 7]. Algorithm 12 shows the steps
to convert (5.1) for radix-8, i.e. r = 8, to the following non-seven representation
k =
t−1∑
i=0
k j8 i, k j ∈ { −1, 0, 1, · · · , 6 }.
In the next subsection, we define a new radix-8 ECSM algorithm for a t-digit of k, where
t =
⌈
log 8 k
⌉
+ 1, and k j ∈ [−1, 6], which, as will be shown in Section 5.4, yields to a regular
ECSM scheme.
5.3.3 Proposed Radix-8 Algorithm for Scalar Multiplication
We perform the scalar multiplication with a new right-to-left radix-8 algorithm using the non-
seven representation of k that is discussed in Subsection 5.3.2 and obtained in Algorithm 12.
We notice that the evaluation of the scalar multiplication in the proposed radix-8 algorithm, is
performed utilizing three EC-points, i.e., PkP, P1, and PAcc without pre-computation.
One can extend Lemma 5.3.1 so that one can compute P ( j)kP for any j > 0, and −1 ≤ k j ≤ 6,
5.3. Proposed Radix-8 Scalar Multiplication Algorithm 97
Algorithm 13 Proposed Signed Radix-8 Scalar Multiplication
Input : Point P ∈ E(Fq), A t digit of integer k, i.e.,
k = (kt−1, kt−2, · · · , k0)8, k j ∈ { −1, 0, 1, · · · , 6 }.
Output : Point Q = kP.
Initialize : PkP ← O, P1 ← P, PAcc ← P ;
Step 1 : For j = 0 to t − 1 do
Step 1.1 : If k j ∈ { −1, 0, 1, 2, 4 } Then
Step 1.1.1 : PkP ← PkP + k jPAcc ;
Step 1.1.2 : PAcc ← 8PAcc ; /* Prepare P( j+1)Acc */
Step 1.1.3 : P1 ← PAcc − PkP ;
Step 1.2 : Else If k j ∈ { 3, 5, 6 } Then
Step 1.2.1 : P1 ← P1 + (7 − k j)PAcc ;
Step 1.2.2 : PAcc ← 8PAcc ; /* Prepare P( j+1)Acc */
Step 1.2.3 : PkP ← PAcc − P1 ;
Step 2 : End For
Step 3 : Return (PkP) ;
as follows
P ( j)kP =

P ( j−1)kP + k jP
( j)
Acc, if k j ∈ { −1, 0, 1, 2, 4 },
8P ( j)Acc − P ( j)1 , if k j ∈ { 3, 5, 6 }.
(5.13)
Similarly, from the extension of Lemma 5.3.1, one can compute P ( j)1 for any j > 0, and −1 ≤
k j ≤ 6, as follows
P ( j)1 =

8P ( j)Acc − P ( j)kP , if k j ∈ { −1, 0, 1, 2, 4 },
(7 − k j)P ( j)Acc + P ( j−1)1 , if k j ∈ { 3, 5, 6 }.
(5.14)
Note that the reason we have split the eight possible combinations of k j in (5.13) into two
cases is to have the k j with a maximum of one Hamming-weight in one group list. Similarly,
the reason we have split the eight possible combinations of of k j in (5.14) into two cases is
to have the 7 − k j with a a maximum of one Hamming-weight in one group list. Based on
(5.13) and (5.14), we propose Algorithm 13 in which the scalar k is obtained from the output
of Algorithm 12. In Algorithm 13, it is shown that 8PAcc is computed in each iteration, and
the result of its computation is stored in a register known as PAcc (see Steps 1.1.2, and 1.2.2).
Hence, the value of point P( j+1)Acc = 8P
( j)
Acc is evaluated in advance at the end of iteration j. The
evaluation of kP involves a total of t computational iterations. At each iteration, the sum of the
98 Chapter 5. New Regular Radix-8 Scheme for ECSM
Table 5.1: An Example That Shows The Computation for kP = 6644P Using The Proposed
Signed Radix-8 Scalar Multiplication.
kj Groups Initialization
(Iteration No.) , kj = kj value
(0) , k0 = 4 (1) , k1 = 6 (2) , k2 = −1 (3) , k3 = 5 (4) , k4 = 1 (5) , k4 = 0
PkP ← O
kj ∈ {−1, 0, 1,
PAcc ← P
PkP ← 4P PkP ← −12P PkP ← 6644P PkP ← 6644P
2, 4}.
P1 ← P
PAcc ← 8P PAcc ← 512P PAcc ← 32768P PAcc ← 262144P
P1 ← 4P P1 ← 524P P1 ← 26124P P1 ← 255500P
kj ∈ {3, 5, 6}.
PkP ← 52P PkP ← 2548P
PAcc ← 64P PAcc ← 4096P
P1 ← 12P P1 ← 1548P
two points PkP and P1 are always equal to the value of point PAcc. The final result of the kP is
obtained at the last iteration, which is the content values of the register PkP at the iteration t−1.
It is noteworthy that both Algorithms 12, and 13 are evaluated from right to left; hence, they
can be interleaved resulting in a significant memory register reduction, because it eliminates
the need to store both the scalar and its recoding.
We illustrate Algorithm 13 by showing an example of computing kP. Suppose that k =
6644 and has an octal representation of (14764) 8, which can be further represented as (0151¯64) 8,
where 1¯ = −1, using the non-seven recoding method that is shown in Algorithm 12. Table 5.1,
illustrates the process of computing kP by exploiting the proposed signed radix-8 scalar multi-
plication that is shown in Algorithm 13.
As shown in Table 5.1, the three registers PkP, P1, and PAcc are initialized to O, P, and P,
respectively (see the Initialize step in Algorithm 13). The loop started in Step 1, is executed
t times, that is t =
⌈
log 8 6644
⌉
+ 1 = 6 in this example. As shown in Step 1 in Algorithm
13, the for loop iteration starts from the least significant octal value of k. This is shown in the
third column of Table 5.1. If the octal digit, i.e., k j in a column is k j ∈ { −1, 0, 1, 2, 4 }, then
the operations in Steps from 1.1.1 to 1.1.3 are sequentially computed. On the other hand, if
k j ∈ { 3, 5, 6 }, then the operations in Steps from 1.2.1 to 1.2.3 in Algorithm 13 are sequentially
computed. Eventually, the content of the PkP register, at iteration t − 1 = 5 (initial iteration =
0), contains the desired computation of kP, i.e., in the rightmost column in Table 5.1. It is clear
from the presented example that the total number of computational cycles required is 6.
5.4. Proposed Regular ECSM Scheme 99
Table 5.2: The 4 Stages That The Proposed Algorithm 13 Evaluates for Each Value of k j.
kj Processing Stages kj Processing Stages
-1
PTemp ← 2PAcc.
3
PAcc ← 2PAcc.
PTemp ← 2PTemp. PAcc ← 2PAcc.
PkP ← PkP − PAcc,† PAcc ← 2PTemp. P1 ← PAcc + P1, PAcc ← 2PAcc.
P1 ← PAcc − PkP . PkP ← PAcc − P1.
0
PTemp ← 2PAcc.
4
PAcc ← 2PAcc.
PTemp ← 2PTemp. PAcc ← 2PAcc.
PkP ← PAcc − P1,† PAcc ← 2PTemp. PkP ← PAcc + PkP , PAcc ← 2PAcc.
P1 ← PAcc − PkP . P1 ← PAcc − PkP .
1
PTemp ← 2PAcc.
5
PAcc ← 2PAcc.
PTemp ← 2PTemp. PTemp ← 2PAcc.
PkP ← PAcc + PkP , PAcc ← 2PTemp. P1 ← PAcc + P1, PAcc ← 2PTemp.
P1 ← PAcc − PkP . PkP ← PAcc − P1.
2
PAcc ← 2PAcc.
6
PTemp ← 2PAcc.
PTemp ← 2PAcc. PTemp ← 2PTemp.
PkP ← PAcc + PkP , PAcc ← 2PTemp. P1 ← PAcc + P1, PAcc ← 2PTemp.
P1 ← PAcc − PkP . PkP ← PAcc − P1.
† The SUB operation can be easily obtained using the ADD operation.
5.4 Proposed Regular ECSM Scheme
In this section, we present a uniform addition chain scheme that is resistant to SSCA and safe-
error fault attacks. The proposed radix-8 ECSM shown in Algorithm 13 is revised to behave in
a highly regular manner; so that for any k j digit, the computational cycle of the addition chain
loop is evaluated using the same sequence of EC-operations.
5.4.1 The Four-Stage Levels
In the following, it is assumed that a temporary register PTemp is provided as part of the pro-
cessor. It is also assumed that both EC-operations ADD and SUB are indistinguishable under
SSCA attacks [194, 195, 196, 197]. The latter assumption can be justified as follows. The cost
of negation operation in GF(p), i.e., mapping x→ −x, can be carried out by one non-modular
subtraction (which has about half the cost of a modular addition/subtraction). Considering the
extended twisted Edwards curve as an example, one can see from [37] that the cost of ADD
= 8M+10A. Based on the experimental ratio of the cost of a modular addition by the one of a
modulo multiplication, i.e., A/M on the smart cards that is provided in [146], the average ratio
100 Chapter 5. New Regular Radix-8 Scheme for ECSM
is A/M  0.2. Then, one can obtain the cost of ADD in term of A as ADD  50A. The cost of
SUB for this curve that is equal to the cost of ADD and the cost of modular negation operation,
i.e., SUB  50.5A. We conclude that the ratio of cost of the point ADD to the cost of point
SUB becomes ADD/SUB  0.99.
Proposition 5.4.1 For any value of k j, Algorithm 13 would be evaluated in 4 stages as
S tage 1 : DBL.
S tage 2 : DBL.
S tage 3 : DBL, ADD/S UB.
S tage 4 : S UB.
(5.15)
ADD
*
Stage1 Stage2 Stage3
AccP
Stage4
kPP
1
P
Sel 
DBL DBL DBL
SUB
ADD operation at Stage 3 is performed as follows:
Output 1 Output 2
*
,
 
!
"
#
#
#
4
3
1
jkP
j
kP
kP
if,
if,
1output 
 
!
"
#
#
#
4
3
1 j
jkP
kP
kP
if,
if,
2output 
AccP
AccP
,
AccP
3
11
 !" jAcc kPPP when, 4 !" jkPAcckP kPPP when,
** ***
** ***
 !"
Reg. Reg. 
Figure 5.2: EC-Operation Dependency Graph That Shows The Usage of Both The ADD and
The DBL Blocks When k j = 3 or k j = 4.
Proof Table 5.2 provides the evaluation sequence for each case of k j values separately. Also
in Figures 5.2, 5.3, and 5.4, it is shown how Algorithm 13 is evaluated at the EC-operations
level for each case of k j. We provide here a detailed analysis of the main two cases, i.e., when
k j = −1, and k j = 0. Given that k j = −1, the operations in Step 1.1 in Algorithm 13 are
processed. In Step 1.1.1, the evaluation of PkP requires processing PkP − PAcc; hence, the SUB
operation that is very similar to the ADD operation is processed. So by shifting the evaluation
of this operation, i.e., PkP = PkP − PAcc to Stage 3 (see Figure 5.4), the three Steps: 1.1.1-1.1.3
5.4. Proposed Regular ECSM Scheme 101
are evaluated in 4 stages as follows:
S tage 1 : PTemp ← 2PAcc.
S tage 2 : PTemp ← 2PTemp.
S tage 3 : PkP ← PkP − PAcc, PAcc ← 2PTemp.
S tage 4 : P1 ← PAcc − PkP.
Given that k j = 0, the operations in Step 1.1 in Algorithm 13 are processed. In Step 1.1.1,
the evaluation of PkP requires no processing. However, in order to keep the scheme consistent
ADD
*
Stage1 Stage2
TempP
Stage3
AccP
Stage4
kPP
1
P
Sel
DBL DBL DBL
SUB
ADD operation at Stage 3 is performed as follows:
*
2 !" jkPAcckP kPPP when, 511  !" jAcc kPPP when,,
 
!
"
#
#
#
5
2
1 j
jkP
kP
kP
  if,
if,
1output 
 
!
"
#
#
#
5
2
1
jPk
j
kP
kP
if,
if,
2output 
AccP
AccP
,** ***
Output 1 Output 2** ***
Reg. Reg. 
Figure 5.3: EC-Operation Dependency Graph That Shows The Usage of Both The ADD and
The DBL Blocks When k j = 2 or k j = 5.
with the other cases, i.e., highly regular, we re-evaluate PkP by performing the following oper-
ation PkP = PAcc − P1 because the sum of the two points PkP and P1 are always preserved and
are equal to the value of the point PAcc. Notice that this operation affects the evaluation of kP,
and, hence, it cannot be considered to be a dummy operation. Then the three Steps: 1.1.1 to
1.1.3 are evaluated in 4 stages as follows:
S tage 1 : PTemp ← 2PAcc.
S tage 2 : PTemp ← 2PTemp.
S tage 3 : PkP ← PAcc − P1, PAcc ← 2PTemp.
S tage 4 : P1 ← PAcc − PkP.
Figure 5.5, shows the EC-operation dependency for all eight of the different combinations
of k j. An intriguing feature of this scheme is that for all cases of k j, the same steps are per-
102 Chapter 5. New Regular Radix-8 Scheme for ECSM
****
CTRL input switches between ADD and SUB operations
ADD/SUB operation at Stage 3 is performed as follows:
*
TempP
AccP
Stage1 Stage2
TempP
Stage3
AccP
Stage4
kPP
1
P
Sel
DBL DBL DBL
ADD/ 
SUB
CTRL*
**
**
1 ! " jAcckPkP kPPP when, 01  !" jAcckP kPPP when,
1 !" jkPAcckP kPPP when, 611  !" jAcc kPPP when,
,
,
 
!
"
#
$#
#
6
1,0,1
1 j
jkP
kP
kP
   if,
if,
1output 
 
!
"
#
$#
#
6
1,0,1
1
jPk
j
kP
kP
  if,
if,
2output ,
SUB
***
*** ****
Output 1 Output 2
Reg. Reg.
Figure 5.4: EC-Operation Dependency Graph That Shows The Usage of Both The ADD and
The DBL Blocks When k j = −1, 0, 1, or 6. Notice That The SUB Operation is Used at Stage
3 for Both Cases k j = −1 and k j = 0.
* There is no operation dependency between SUB operation at Stage 4 and DBL operation 
at Stage 1 
TempP
AccP
Stage1 Stage2
TempP
Stage3
AccP
Stage4
Sel 1
DBL DBL DBL
SUB
ADD/ 
SUB
CTRL*
Output 1
Output 2
Sel 2
*
Reg. 
Reg. 
Reg. 
Figure 5.5: EC-Operation Dependency Graph That Shows The Usage of Both The ADD and
The DBL Blocks for All Cases of k j, i.e., k j ∈ { −1, 0, 1, · · · , 6 }.
formed, i.e., only the operands are transposed. This means that the cost per 3 bits is fixed at
3 DBLs + 2 ADDs. It is worth mentioning here that in order to evaluate Steps 1.1.2 or 1.2.2
of Algorithm 13, 3 repeated DBL operations are necessary. Also, in (5.15), at stage 3 both the
ADD/SUB and the DBL operations are evaluated in parallel (see Stage 3 in Figures 5.2 to 5.5).
5.4. Proposed Regular ECSM Scheme 103
5.4.2 The Three-Stage Levels
Based on Proposition 5.4.1, the proposed Algorithm 13 can be evaluated in a unified sequence
of four stages. Analysing the generalized schedule scheme shown in Figure 5.5, for the eight
cases of k j values, one can see that the DBL operation evaluated for all cases at Stage 1 has
no operation dependency with the SUB operation being evaluated at Stage 4. Since there is no
operation dependency between the two EC-operations, the SUB operation that is evaluated at
Stage 4 can be rearranged to be performed at Stage 1 of the next iteration.
Therefore, the SUB operation of the previous iteration and the first DBL operation of the
current iteration can be evaluated in parallel. The sequence order of the EC-operations is then
adjusted as shown in Figure 5.6; hence, a total of 3 stages would be used at each iteration. In
this case, the proper initialization of the registers has to be considered, i.e., initially, PAcc = 8P,
and based on the value of k0 either P1 = (7 − k0)P, or PkP = k0P. We also note that the
temporary register PTemp can be omitted in the proposed scheme shown in Figure 5.6. Let us
consider the following two possible scenarios:
)0(
Acc
P
Stage1 Stage2 Stage3
Acc
P
Sel 4
Sel 5
DBL DBL DBL
SUB
ADD/
SUB
Output 1
Output 2
Sel 2
Sel 1
*
**
**
Input 1
Sel 3
3 EC-operations
CTRL***
Reg.
Reg.
Sel 6
* The final result for computing kP is obtained from this signal, 
i.e., Output 2. 
**
 
!
"
#$
$#
#
6,5,3)7(
4,2,1,0,1
00
00
kPk
kPk
if      ,
if   ,
1input PP
Acc
8
)0(
 
*** CTRL input switches between ADD and SUB operations
and input 1 are the initial points and defined as follows
)0(
Acc
P
,
Figure 5.6: EC-Operation Dependency Graph for The Proposed Radix-8 ESCM Method That
Shows The Total Memory Points Required, The Total EC-Operations Cost, and The Total
Computational Time Complexity Per 3 Scalar Bits at The EC-Operation Level.
1. The first scenario involves the serial implementation design of Figure 5.6, i.e., one ADD
and one DBL are implemented in parallel. In this case, it takes 3 clock cycles to complete
104 Chapter 5. New Regular Radix-8 Scheme for ECSM
one iteration of the for loop in Algorithm 13, i.e., processing 3 scalar bits. As one can
see from Figure 5.6, only one DBL operation is required to be executed at clock cycle 2.
Then, during the clock cycle 2, the additional temporary registers used to compute the
ADD operation become idle and it becomes possible to reuse them to store the contents
of PTemp.
2. The second scenario, which is considered in this work, involves a parallel implementa-
tion design of Figure 5.6, i.e., a total of two ADDs and three DBLs are implemented. In
this case, three bits of the scalar (one digit of k j) are processed at every clock cycle, and
the contents of PTemp will be no longer needed to be stored. Furthermore, for hardware
resource efficiency in this scenario, a single register can be shared between the two points
PkP, and P1. The strategy is to store one point in the register, and to obtain the result of
the second point at the end of the ADD operation at the end of Stage 1 in every iteration
(see Figure 5.6).
Since all the k j cases use the same set of EC-operations, ADD and DBL do not have to be
indistinguishable. Also, as no dummy operations are introduced, the risk posed by the adaptive
fault analysis is minimal [43].
5.5 Performance Analysis of The Proposed ECSM Scheme
As shown in Figure 5.6, the power consumption of the proposed scalar multiplication scheme is
fixed. This indicates that the proposed scheme is intrinsically protected against SSCA because
every iteration in the main loop involves 3 DBLs and 2 ADDs. Furthermore, since no dummy
operation is used, any fault introduced into any operation will result in an incorrect scalar
multiplication result, which makes it resistant to safe-error fault attacks. [43].
In the following, we evaluate and analyse the efficiency of the proposed ECSM scheme
(Figure 5.6) and compare it to the other well known ECSM schemes at the point arithmetic
level. To compare fairly, the proposed scheme evaluates 3 bits of the scalar, and, hence, the
comparisons are made corresponding to the 3 bits of the scalar k. First, we compare it to
two well-known binary methods: the Double-and-Add [123], and the signed binary methods
[124, 71, 69, 70]. Second, we compare it to the non-secure width-4 [73], and the non-secure
radix-8 NAF schemes [72]. Third, we compare it to the SSCA aware width-4 window-based
methods, i.e., the width-4 Mo¨ller [63], and the width-4 Okeya windows schemes (Figure 5.7)
[64]. Fourth, we compare it to the SSCA aware binary methods: the Montgomery Ladder
[189, 190, 191], and Joye’s binary methods (Figure 5.8) [192]. In our analysis, we assume that
the recoding is secure against SSCA, and has a negligible computational cost.
5.5. Performance Analysis of The Proposed ECSM Scheme 105
Table 5.3 summarizes the comparison of the different ECSM schemes. In this table, the
memory consumption is the sum of the look-up table and the registers required during the
evaluation stage. We note that in order to compute the ECSM in a non-secure width-w NAF,
Stage1 Stage2 Stage3
4 EC-operations
Stage4
Q
Q2 Q4 Q8
Sel 2
1P
3P
5P
7P
mADD/
mSUB
DBLDBLDBL
)0(
Q
Sel 1
CTRL**
CTRL input switches between ADD and SUB operations
*
Look-up Table***
**
22  w pre-computation points, where w=4 4 memory points 
Reg. 
*
***
)0(
Q is the initial point 
Figure 5.7: EC-operation Dependency Graph for The Width-4 Okeya Method [64] That Shows
The Total Memory Points Required, The Total EC-Operations Cost, and The Total Computa-
tional Time Complexity Per 3 Scalar Bits at The EC-Operation Level.
* and          are the initial points
)0(
1P
)0(
2P
DBL
ADD
Stage 1
DBL
ADD
Stage 2
DBL
ADD
Stage 3
)0(
2P
)0(
1P
Output 1
Output 2
2 sk
2 sk
1 ik
1 ik
i
k
i
k 1 ik
1 ik
)0(
2P
)0(
1P
*
*
*
*
3 EC-operations
Reg.
Reg. 
Figure 5.8: EC-Operation Dependency Graph for The Montgomery Ladder and Joye’s Binary
Methods [189, 192] That Shows The Total Memory Points Required, The Total EC-Operations
Cost, and The Total Computational Time Complexity Per 3 Scalar Bits at The EC-Operation
Level.
a total of 2 w−2 − 1 pre-computation points including base point P is required. The width-w
of the Mo¨ller method is based on (2 w−2 + 1) pre-computation look-up tables, and hence, for
w = 4, the total memory consumption in this ECSM scheme is 5 pre-computation points and
1 for the evaluation stage. Also, the SSCA aware width-w NAF method presented by Okeya
and Takagi in [64], has more recoding overhead; but, as shown in Table 5.3, it has 1 memory
106 Chapter 5. New Regular Radix-8 Scheme for ECSM
Table 5.3: Comparison Table of Related Binary, and Width-4 Window-Based ECSM Schemes
With The Proposed Radix-8 Scheme (Figure 5.6) in Terms of Memory Register Space Used,
Total EC-Operations Cost, and Computation Time Complexity at The EC-operations Level per
3 Scalar Bits Evaluations.
Method Memory Points Total EC-operations Cost Computational Time
/3 Scalar Bits Complexity/3 Scalar Bits a
Non-Secure ECSM Methods
Double-and-Add [123] 2→ [P , Q] 4.5 uADD or Atomic Structure b 3 EC-operations (Fix)
Signed Binary [69]–[71], [124] 2→ [P , Q] 4 uADD or Atomic Structure c 3 EC-operations (Fix)
Width-4 NAF [73] 4→ [P , 3P , 5P , Q] d 3.67 uADD or Atomic Structure e 3.67 EC-operations (Av.)
Radix-8 NAF [72] 4→ [P , 3P , 5P , Q] d 3.67 uADD or Atomic Structure e 3.67 EC-operations (Av.)
Secure ECSM Methods
Width-4 Mo¨ller [63] 6→ [P , 3P , 5P , 7P , 8P , Q] f 3 DBL & 1 mADD 4 EC-operations (Fix)
Width-4 Okeya [64] 5→ [P , 3P , 5P , 7P , Q] g 3 DBL & 1 mADD 4 EC-operations (Fix)
Montgomery Ladder [189]–[191] 2→ [P1, P2] h 3 DBL & 3 ADD 3 EC-operations (Fix)
Joye’s Binary Method [192] 2→ [R0, R1] h 3 DBL & 3 ADD 3 EC-operations (Fix)
Proposed Radix-8 Scheme 2→ [PAcc , output 1] i 3 DBL & 2 ADD 3 EC-operations (Fix)
Figure 5.6
a Note that the terms Av. and Fix stand for the average and fix measurements of the computation complexity.
b Utilizing the atomicity principle, on average, the computation complexity is 3 DBLs+ 1.5 mADDs.
c Utilizing the atomicity principle, on average, the computation complexity is 3 DBLs+ 1 mADDs.
d (2w−2 − 1) pre-computation points, where w = 4, and another EC-point is used in the evaluation process.
e Utilizing the atomicity principle, on average, the computation complexity is 3 DBLs+ 0.67 mADDs.
f (2w−2 + 1) pre-computation points, where w = 4, and another EC-point is used in the evaluation process.
g (2w−2) pre-computation points, where w = 4, and another EC-point is used in the evaluation process.
h If only the x-coordinates of the EC-points are computed, then the initial (base) EC-point, i.e., P will be reserved and used to
obtain the ADD operation and the y-coordinate from the x-coordinates. Hence, total memory points would become 3.
i If one ADD and one DBL are implemented in parallel to design Figure 5.6, then the total of the registers would become 3.
reduction in the size of the look-up table as compared to the width-4 Mo¨ller method in [63].
Hence, a total of 5 memory registers including the register for the evaluation stage are required
(see Figure 5.7). It can be seen from this table that the secure width-4 window-based ECSM
methods requires the highest amount of memory, and that it used at least 40% of the memory
registers more compared to the proposed ECSM scheme shown in Figure 5.6.
The secure width-4 window based methods require a total of 3 DBLs + mADD. Assuming
that their pre-computed points are kept in affine coordinates. The SSCA aware binary methods,
i.e., Montgomery Ladder, and Joye’s binary methods, require a total of 3 DBLs + 3 ADDs for
every 3 bits of the scalar k j. While, the proposed scheme requires a total of 3 DBLs + 2 ADDs
for every 3 bits of the scalar k j.
The Double-and-Add, signed binary, Radix-8 NAF, and width-4 NAF methods are prone
to SSCA. In order to withstand SSCAs, the methods should either use the unified operation
approach (cf., [37]) or the atomicity principle (cf. [81], and [83]). The first approach uses an
5.5. Performance Analysis of The Proposed ECSM Scheme 107
indistinguishable addition, i.e., a uADD that is when the formulas used for both the ADD and
the DBL are the same; however, the implementation of such a formula for different models
of elliptic curves would suffer from huge area complexity. The atomic structure approach is
usually implemented with DBLs and a Jacobian projective-affine mADD operation. It should
be noted that the atomic structure schemes are only provided to a few projective coordinates,
that is, they are not generalized to all of the elliptic curve models. Further, the architecture
design in the atomic schemes is very restricted; hence, the architecture design is restricted to
performing a specific number of arithmetic multiplication and squaring operations per each
clock cycle.
The SSCA aware binary methods, i.e., the Montgomery Ladder, and Joye’s binary methods,
require a total of 3 DBLs + 3 ADDs for every 3 bits of the scalar k j. The proposed scheme
requires a total of 3 DBLs + 2 ADDs for every 3 bits of the scalar k j. This indicates that 1/3
of the computation of the ADD operations in the proposed ECSM scheme shown in Figure
5.6 decreases when compared to the SSCA-protected binary methods. It is noted that in those
SSCA aware binary methods, the computation of the scalar multiplication can be enhanced at
the field arithmetic level. For instance, in the Montgomery Ladder method on the Montgomery
curve, only the x-coordinates of the EC-points are computed in the EC-operations. However, as
will be shown in Section 5.6, utilizing the proposed ECSM scheme in a parallel environment,
one can gain a significant performance improvement that yields a faster performance time than
do the optimized binary ECSM schemes.
The secure width-4 window-based methods require a total of 3 DBLs + mADD. Assuming
that their pre-computed points are kept in affine coordinates. However, as seen in Figures 5.6
to 5.8, in terms of computational time complexity, the proposed method along with all other
binary methods reveal themselves to be more efficient by observing that in each stage both
EC-operations, DBL and ADD, are independent and can be evaluated in parallel. Whereas,
the non-secure window-based and secure window-based methods are performed sequentially.
Hence, their computational complexity becomes 3.67, and 4 EC-operations, respectively. In
order to make these window-based methods, which involve pre-computations with the base
point P, feasible for implementations supporting parallel processing of EC-operations, i.e.,
their computational time complexity becomes 3 EC-operations, all the pre-computed points
need to be doubled w − 1 times at each iteration [128].
We apply the proposed ECSM scheme to two well-known Weierstraß elliptic curve models.
Table 5.4 reports the total field arithmetic operations for computing the scalar multiplication
using Double-and-Add, signed binary, and non-secure width-4 NAF algorithms with unified
addition-or-doubling formulas. A comparison of the proposed ECSM scheme, i.e., Figure
5.10, with the other secured ECSM methods is also provided in this table. From Table 5.4, one
108 Chapter 5. New Regular Radix-8 Scheme for ECSM
can see that the secured width-4 methods require less amount of field arithmetic operations. It
must be noted however, that the secured width-4 methods impose additional memory registers
for the pre-computed points. In the following section, we take advantage of the ECSM scheme
Table 5.4: Comparison Table of The Proposed Radix-8 Scheme (Figure 5.6) With the Unified
Operation Technique and With Different ECSM Schemes That are Resist Against Side Channel
Attacks in Term of Total Field Arithmetic Operations Per 3 Scalar Bits on the Weierstraß
Elliptic Curve Model.
Security Method Point Operation Cost a ECSM Method Field Arithmetic Complexity/3 Scalar Bits
In Terms of M and S When S=0.8M,
Projective Coordinates Representation [139]
Unified Operation ADD → 12M+ 2S
Techniques
Double-and-Add [123] 58.5M+ 13.5S 69.3M
DBL b→ 7M+ 3S Signed Binary [69]–[71], [124] 52M+ 12S 61.6M
uADD c → 13M+ 3S Non-Secure Width-4 NAF [72], [73] 47.71M+ 11.01S 56.51M
Proposed ECSM Scheme Figure 5.6 45M+ 13S 55.4M
SSCA Secured ECSM
ADD → 12M+ 2S
Secure Width-4 Mo¨ller Scheme [63]
30M+ 11S d 38.8M
Methods
DBLb→ 7M+ 3S
Secure Width-4 NAF Scheme [64]
mADD → 9M+ 2S
Montgomery Ladder [189]–[191]
57M+ 15S 69M
Joye’s Binary Method [192]
Proposed ECSM Scheme Figure 5.6 45M+ 13S 55.4M
Jacobian Projective Coordinates Representation [111]
SSCA Secured ECSM
ADD → 12M+ 4S
Secure Width-4 Mo¨ller Scheme [63]
20M+ 15S d 32M
Methods
DBLb→ 4M+ 4S
Secure Width-4 NAF Scheme [64]
mADD → 8M+ 3S
Montgomery Ladder [189]–[191]
48M+ 24S 67.2M
Joye’s Binary Method [192]
Proposed ECSM Scheme Figure 5.6 36M+ 20S 52M
a We follow most of the literature in ignoring the cost of A.
b It is assumed that a = −3.
c It is assumed that a = −1.
d Additional computation of 3(k − 1)M+1I, where k is the total pre-computation points in lookup table, is required for the
transformation of points to the affine coordinate in the pre-computation stage, i.e., preparing the points in lookup table.
we proposed with the objective of deriving faster ECC formulae for parallel architectures. For
the comparison with other parallel environment systems, we decided to choose the Extended
Twisted Edwards coordinates for the curves defined over Fp.
5.6 Parallel Architectures
In this section, we explain how a protected scalar multiplication using the proposed scheme
for the prime extended twisted Edwards model can be performed faster than all of the parallel
and SSCA-protected schemes over prime fields reported in the literature including the fast
Montgomery Ladder method on the Montgomery curve.
The objective of using the proposed scheme, i.e., Figure 5.6, is to achieve the fastest scalar
5.6. Parallel Architectures 109
multiplication result. Note for simplicity purpose, the required auxiliaries (or registers) in the
ECSM schemes are not discussed or analysed. Also in the parallelization process, we impose
the restriction that the architectures can only be based on SIMD (Single instruction multiple
data) operations.
The total field arithmetic operations cost of the Montgomery curve is the least among the
existing elliptic curve models over prime fields [110]. We recall [189], that an elliptic curve
produced by a Montgomery equation is of the form
EM : By 2 = x 3 + Ax 2 + x,
where A, B ∈ Fp with (A 2 − 4)B , 0. Let Pm(Xm, Zm), and Pn(Xn, Zn), be two arbitrary points
on this curve, and Pm−n(Xm−n, Zm−n) be another point that is equal to the difference between the
two points, i.e., Pm−n = Pm − Pn. Assuming that Zm−n = 1, then the coordinates of the point
Pm+n(Xm+n, Ym+n) = Pm + Pn are given as follows [189]
Xm+n =
(
(Xm − Zm)(Xn + Zn) + (Xm + Zm)(Xn − Zn)
) 2
,
Zm+n = Xm−n
(
(Xm − Zm)(Xn + Zn) − (Xm + Zm)(Xn − Zn)
) 2
,
and the coordinates of the doubling formulae, i.e., P2m(X2m, Z2m) = 2Pm are given in [189] by
4XmZm = (Xm + Zm) 2 − (Xm − Zm) 2,
X2m = (Xm + Zm) 2(Xm − Zm) 2,
Z2m = (4XmZm)
(
(Xm − Zm) 2 +
(
(A + 2)/4
)
(4XmZm)
)
.
A 5M + 4S + 1D + 8A, Montgomery Ladder ADDDBL algorithm is given in [110], and a
parallel algorithm for the Montgomery Ladder is given in [37] at an effective time cost of 2M
+ 2S + 1D + 3A using 4-processors. As a point of comparison, in Figure 5.9, we derived the
fastest timings for the Montgomery Ladder’s ADDDBL operation in the parallel strategies. In
our scheme, we assumed that the two S operations performed in Step 2 in Figure 5.9 are carried
out as M operations. Then, all of the four operations executed in Step 2 are performed at the
same time with a delay of M. We note that the two S operations performed in Step 4 are carried
out by dedicated squaring. From this figure, one can see that the ADDDBL operation for the
Montgomery Ladder can be performed with an effective time of 2M + 1S + 3A for each bit
of the scalar. It is worth noting that dependencies restrict us from achieving further reductions
with more processes. Consequently, for the Montgomery Ladder algorithm, the computation
110 Chapter 5. New Regular Radix-8 Scheme for ECSM
S2 9
^2
^2
Process 1:
Step1 Step2 Step3 Step4 Step5
Process 2:
Process 3:
Process 4:
x
x
Step6
Dm
X
m
Z
4/)2(  a
ADSM 3112    
Addition ^2 Squaring carried out as multiplication x Multiplication
D Field Multiplication by a curve constant
mX
m
Z
nX
nX
n
Z
n
Z
Subtraction
Xm+n
^2 Squaring carried out by dedicated squaring
Step7
x
x
x
nm
X
 
Z 2m
X2m
Z
m+n
^2
^2
Figure 5.9: Data Dependency Graph for Parallel Computing of The ADDDBL Operation for
The x-Coordinates Only Montgomery Ladder Method on The Montgomery Curve.
time complexity per each of the 3 bits of the scalar as provided in Table 5.5 is 6M + 3S + 3D.
We now investigate the 8-processor implementation of the ADDDBL operation for the
prime extended twisted Edwards curve. The twisted Edwards curve is a generalization of the
Edwards curve [139] and has the equation [198]
ET : ax 2 + y 2 = 1 + dx 2y 2,
where a, d ∈ Fp, with ad(a − d), 0. To develop a faster way of performing the DBL and the
ADD operations, in [37], an additional auxiliary coordinate was added to the twisted Edwards
coordinates. It is observed in [37] that the extended twisted Edwards curves are represented
by the quadruple coordinates, and for the special case a = −1, the DBL and the ADD opera-
tions can be performed at a computation cost of 4M + 4S + 6A, and 8M + 10A operations,
respectively, assuming that the field arithmetic addition and subtraction are equal [37].
Let P1(X1, Y1, T1, Z1), and P2(X2, Y2, T2, Z2), be two distinct points on Ee, where Ee de-
notes the extended twisted Edwards coordinates, with Z1 , 0 and Z2 , 0, then the coordinates
of the point P3(X3, Y3, T3, Z3) = P1 + P2 are given as follows [37]
X3 = (X1Y2 − Y1X2)(T1Z2 + Z1T2),
Y3 = (Y1Y2 − X1X2)(T1Z2 − Z1T2),
T3 = (T1Z2 + Z1T2)(T1Z2 − Z1T2),
Z3 = (Y1Y2 − X1X2)(X1Y2 − Y1X2),
(5.16)
5.6. Parallel Architectures 111
Table 5.5: Comparison Table of The Proposed Radix-8 ECSM Scheme (Figure 5.6) With D-
ifferent Scalar Multiplication Schemes That Offers Resistance Against Side-Channel Attacks
Using Parallel Environments With Respect to The Computation Time Complexity.
Scheme - Processors a EC Model - Coordinates ECSM Method Comput. Time Complexity/3 Scalar Bits
b
In Terms of M, S and D S=0.8M and D=0
[85] - 2 Processors c Jacobian Proj. Coordinates Width-4 Mo¨ller Scheme [63] 10M+8S 16.4M
[86] - 2 Processors d PL with x-coordinate only [114] Montgomery Curve [189] 30M 30M
[87] - 2 Processors e Modified Jacobian Coordinates Width-4 Mo¨ller Scheme [63] 11M+10S 19M
[87] - 3 Processors f Modified Jacobian Coordinates Width-4 Mo¨ller Scheme [63] 10M+3S 12.4M
[88] - 3 Processors g Hessian Proj. Curves Width-4 Mo¨ller Scheme [63] 10M+3S 12.4M
[83] - 2 Processors h Jacobian Proj. Coordinates (atomic) Non-Secure Width-4 NAF [73], [72] 8.7M+8.7S (Av.) 15.66M
[83] - 3 Processors i Jacobian Proj. Coordinates Width-4 Mo¨ller Scheme [63] 6M+8S 12.4M
[83] - 4 Processors j Jacobian Proj. Coordinates Width-4 Mo¨ller Scheme [63] 3M+10S 11M
[37] - 2 Processors k Extended Twisted Edwards (unified) Non-Secure Width-4 NAF [73], [72] 14.68M+3.67D (Av.) 14.68M
[37] - 4 Processors l Extended Twisted Edwards (unified) Non-Secure Width-4 NAF [73], [72] 7.34M+3.67D (Av.) 7.34M
[37] - 4 Processorsm Extended Twisted Edwards Width-4 Mo¨ller Scheme [63] 5M+3S 7.4M
[37] - 4 Processors PL with x-coordinate only [114] Montgomery Curve [189] 6M+6S 10.8M
Figure 5.9 - 4 Processors PL with x-coordinate only [114] Montgomery Curve [189] 6M+3S+3D 8.4M
Proposed 8 Processors n Extended Twisted Edwards Proposed radix-8 scheme - 5M+1S 5.8MFigure 5.10 Figure 5.6
a Processors are based on the number of parallel field multipliers M. The effects of the number of auxiliaries (or registers) to the area is not
discussed here.
b We follow most of the literature in ignoring the cost of A. The experimental ratio A/M on the smart cards is provided in [146].
c A sequence of 3 DBLs, i.e., 3(2M+2S) followed by the mADD, i.e., (4M+2S).
d A sequence of 3 parallel computing of [ADD & DBL], i.e., 3(10M)=30M.
e A sequence of 3 DBLs, i.e., 3(2M+3S) followed by the mADD, i.e., (5M+1S).
f A sequence of 3 DBLs, i.e., 3(2M+1S) followed by the mADD, i.e., (4M).
g A sequence of 3 DBLs, i.e., 3(2M+1S) followed by the ADD, i.e., (4M).
h Each point is represented by sextuplet coordinates. An average of 3 DBLs + 0.67 mADDs, i.e., 3(2M+2S)+0.67(4M+4S).
i Each point is represented by the sextuplet coordinates. A sequence of 3 DBLs, i.e., 3(1M+2S) followed by the mADD, i.e., (3M+2S).
j Each point is represented by the sextuplet coordinates. A sequence of 2 special DBLs, i.e., 2(3S) followed by a generalized DBL, i.e., 1M+2S
followed by the mADD, i.e., 2M+2S.
k Each point is represented by the quadruple coordinates. A sequence of 3.67 uDBLs, i.e., 3.67(4M+1D).
l Each point is represented by the quadruple coordinates. A sequence of 3.67 uDBLs, i.e., 3.67(2M+1D). Stated in [36] that it is the fastest known
approach to performing elliptic curve point operations.
m Each point is represented by the quadruple coordinates. A sequence of 3 DBLs, i.e., 3(1M+1S) followed by the ADD, i.e., (2M).
n Each point is represented by the quadruple coordinates. A sequence of ADDDBL, DBL, and ADDDBL. As shown in Figure 5.6, the ADDDBL
operation can be performed at an effective cost of 2M, and from [37], the DBL operation can be performed at an effective cost of 1M+1S.
and the coordinates of the doubling formulae, i.e., P4(X4, Y4, T4, Z4) = 2P1 are given in [37]
by
X4 = 2X1Y1(2Z 21 − Y 21 + X 21 ),
Y4 = (Y 21 − X 21 )(Y 21 + X 21 ),
T4 = 2X1Y1(Y 21 + X
2
1 ),
Z4 = (Y 21 − X 21 )(2Z 21 − Y 21 + X 21 ).
(5.17)
It was shown in [37], that both the ADD and the DBL operations can be performed utilizing
4-processors with an effective time of 2M + 3A, and 1M + 1S + 3A, respectively. We propose
a composite ADDDBL operation for this curve by splitting the computational task of both the
ADD and the DBL operations into 5 steps with the utilization of 8-processors. The data depen-
dency graph of both (5.16) and (5.17) is presented in Figure 5.10, which shows that combining
these two equations requires a computation cost of 12M + 4S + 15A (1 field addition operation
112 Chapter 5. New Regular Radix-8 Scheme for ECSM
Process 1:
Process 2:
Process 3:
Process 4:
^2
^2
1
Y
1
X
Step1 Step2
^2
Step3 Step4 Step5
x
1
X
x
1
Y
^2
1
Z
x
x
2
Y
x
x
x
x
x
x
2
X
1
Z
2
T
2
Y
2
X
1
Y
1
X
1
T
2
Z
x
Process 6:
Process 7:
Process 8:
Process 5:
AM 32  
Addition ^2 Squaring carried out as multiplicationx MultiplicationSubtraction
T4
Y4
Z4
Z4
X3
Y3
T3
Z3
x
Figure 5.10: Data Dependency Graph for Parallel Computing of The Proposed ADDDBL Op-
eration for The Prime Extended Twisted Edwards Curve.
is saved). According to this figure, the effective time can be reduced to 2M + 3A operations
with 8 processes. As shown in Figure 5.10, the ADDDBL operation scheme consists of eight
independent processing elements, i.e., process 1 to process 8. A finite field arithmetic operation
is represented by a circle and it is labeled according to the type of action it performs. In our
scheme, we assumed that the S operations performed in Step 2 are carried out as M operations.
The interconnections among the eight processing elements are needed because of the data de-
pendency in the operation in each processing element. For instance, when arriving at Step 2,
process 5 needs the output data generated by process 4 in Step 1. Thus, an interconnection
between process 4 and process 5 is needed to support such data dependency. Similarly, other
necessary interconnections should also be obtained. From this figure, and the effective time
cost of DBL operation for the prime extended twisted Edwards curve that is obtained from
[37], we conclude that one round of computing 3 bits of the scalar in the proposed scheme
5.6. Parallel Architectures 113
Table 5.6: Comparison Table of Related Parallel Schemes With The Proposed 8-Processor
Scheme for The Extended Twisted Edwards Curve over Prime Fields, Which is Shown in
Figure 5.10, With Respect to The Computational Time Complexities for The Bit Lengths of
The Underlying Fields of NIST Recommended Curves [16].
Prime Field Size Scheme - Processors Computational TimeFp Complexities
s = 192
4 Processors for Jacobian Projective Coordinates [83] 191M+637S
4 Processors for Extended Twisted Edwards [37] 319M+191S
Montgomery Ladder method on the Montgomery curve [37] 382M+382S
Montgomery Ladder method on the Montgomery curve (Figure 5.9) 382M+191S
Proposed 8 processors scheme (Figure 5.10, and DBL operation obtained from [37]) 320M+64S
s = 224
4 Processors for Jacobian Projective Coordinates [83] 223M+744S
4 Processors for Extended Twisted Edwards [37] 372M+223S
Montgomery Ladder method on the Montgomery curve [37] 446M+446S
Montgomery Ladder method on the Montgomery curve (Figure 5.9) 446M+223S
Proposed 8 processors scheme (Figure 5.10, and DBL operation obtained from [37]) 374M+75S
s = 256
4 Processors for Jacobian Projective Coordinates [83] 255M+850S
4 Processors for Extended Twisted Edwards [37] 425M+255S
Montgomery Ladder method on the Montgomery curve [37] 510M+510S
Montgomery Ladder method on the Montgomery curve (Figure 5.9) 510M+255S
Proposed 8 processors scheme (Figure 5.10, and DBL operation obtained from [37]) 427M+86S
s = 384
4 Processors for Jacobian Projective Coordinates [83] 383M+1277S
4 Processors for Extended Twisted Edwards [37] 639M+383S
Montgomery Ladder method on the Montgomery curve [37] 766M+766S
Montgomery Ladder method on the Montgomery curve (Figure 5.9) 766M+383S
Proposed 8 processors scheme (Figure 5.10, and DBL operation obtained from [37]) 640M+128S
s = 521
4 Processors for Jacobian Projective Coordinates [83] 520M+1734S
4 Processors for Extended Twisted Edwards [37] 867M+520S
Montgomery Ladder method on the Montgomery curve [37] 1040M+1040S
Montgomery Ladder method on the Montgomery curve (Figure 5.9) 1040M+520S
Proposed 8 processors scheme (Figure 5.10, and DBL operation obtained from [37]) 869M+174S
(Figure 5.6), which requires a sequence of ADDDBL, DBL, and ADDDBL, can be completed
in an effective time of 5M + 1S + 9A. Table 5.5 provides the computation time complexity
of the different scalar multiplication schemes provided in the literature, which offer resistance
against side-channel attacks in the parallel environments.
In general, for an s-bit scalar multiplication, the Montgomery Ladder method shown in
Figure 5.9 requires 6 (s−1)3 M + 3
(s−1)
3 S, whereas the extended twisted Edwards curve in the
proposed ECSM method requires 5s3 M+
s
3S. Table 5.6 shows the comparison of the 4-processor
scheme for the Jacobian projective coordinates presented in [83], the 4-processor scheme for
the extended twisted Edwards curve presented in [37], the 4-processor Montgomery Ladder
method on the Montgomery curve that is obtained from [37], the 4-processor Montgomery
Ladder method on the Montgomery curve that is shown in Figure 5.9, and the 8-processor
114 Chapter 5. New Regular Radix-8 Scheme for ECSM
scheme for the extended twisted Edwards curve that is shown in Figure 5.10 in terms of the
computational time complexities for the prime fields that are recommended by NIST [16].
5.7 Conclusion
In this chapter, a new radix-8 scalar multiplication scheme is introduced that can be used for
any elliptic curve model. It allows one to compute each of the three bits of the scalar with
five point arithmetic operations in a unified sequence. We showed that the properties of the
proposed scheme enhances parallelism at both the point arithmetic, and the field arithmetic
levels. Further, it implicitly provides resistance against certain implementation attacks.
We applied the proposed scheme to the prime extended twisted Edwards curves for the
computation of a scalar multiplication in an 8-processor environment. We then provided the
performance estimates and the comparisons for the proposed scheme and different parallel
schemes presented in the recent papers. We further showed that to the best of the authors’
knowledge, the 8-processor scheme provided in this work is the fastest SSCA protected scalar
multiplication scheme over prime fields in the parallel environment. The proposed 8-processor
scheme provided in this work can be applied to all of the parallel hardware implementations
and also to parallel software environments such as a Cell multiprocessor [199], and ePUMA
[187].
6
Summary and Future Work
I n this thesis, we have investigated the two lowest operational levels in elliptic curvehierarchical scheme, namely, finite field arithmetic level, and point arithmetic level.We aim to provide new hardware design for the arithmetic in ECC crypto-systems.
After identifying the motivation and the objectives of this research in Chapter 1 and
introduction and background in Chapter 2, we present novel serial-out bit-level multiplication
schemes in Chapter 3. Then, we extend the proposed serial-out-bit-level schemes to a hybrid-
double multiplication schemes that allow performing two multiplications simultaneously. We
also present a novel scheme for the elliptic curve scalar multiplication (ECSM) operation in
Chapter 5. The following summarizes the contribution of this work.
• In Chapter 3, we have studied the finite field multiplication operation over F2m . The
specific contributions presented in this chapter are summarized as follows:
1. We have proposed a novel Serial-out bit-level (SOBL) multiplier scheme that is
constructed by an ω-nomial irreducible polynomial. We then obtained a further
optimized SOBL multiplier scheme for the irreducible trinomial. We showed that
the proposed two multiplier schemes are faster than the previously published SOBL
schemes.
2. We have further analysed the SOBL schemes, and proposed a compact bit-level
multiplication scheme that is suitable for resource constrained devices such as R-
FID tags. We showed that this proposed scheme, can provide about 24-26% reduc-
tion in area complexity cost and about 21-22% reduction in power consumptions
for F2163 compared to the current state-of-the-art bit-level multiplier schemes
115
116 Chapter 6. Summary and Future Work
• In Chapter 4, which has been submitted for publication in [98], we employed the pro-
posed three SOBL schemes to present, to our knowledge, the first approach for a hybrid-
double multiplication architecture in the polynomial basis representation over F2m . In
addition, we extended the traditional Parallel-out bit-level (POBL) multiplier schemes to
propose two new low complexity and fast LSB-first/MSB-first POBL double multiplica-
tion architectures, which perform two multiplications.
• In Chapter 5, which has been submitted for publication in [99], we have studied the
ECSM algorithms. The specific contributions presented in this chapter are summarized
as follows:
1. We proposed a novel approach for computing ECSM that can be used on any a-
belian group. We analysed the security of our approach and showed that its security
holds against both simple side-channel attack and safe-error attacks.
2. We employed the proposed approach for computing the scalar multiplication on a
prime extended twisted Edwards curve model incorporating 8 parallel operations.
We showed that in comparison to the other simple side-channel attack protected
schemes over F2m , the proposed design of the extended twisted Edwards curve mod-
el is the fastest scalar multiplication scheme reported in the literature.
6.1 Future Work
The research presented in this thesis can serve as the base for several future research directions.
In Chapter 3, all the proposed multiplier schemes are of type bit-level structure, which pro-
vides the most efficient area and power requirement design structure for hardware implementa-
tion. However, this structure is quite slow. One future research direction toward the finite field
arithmetic is to extend the proposed schemes to a digit-level multiplier structure. The digit-size
can be analysed in order to achieve a best tradeoff between area, power and speed. In addition,
the compact finite field multiplication scheme is ideal for the implementation of ECC processor
in the resource constrained devices.
In Chapter 4, as the demand for providing flexible architecture solutions to the finite field
multiplication operation is increased, another future direction is to extend the proposed hybrid-
double multiplication architecture for carrying out a semi-varsatile hybrid-double architecture
that operates on the five irreducible polynomials that are recommended by NIST. Further, a po-
6.1. Future Work 117
tential research area is to employ the resource sharing techniques in the proposed hybrid-double
multiplication architecture to further reduce the area requirements and the power consumption-
s. Furthermore, since the MSB-first multiplier has less power consumption than the LSB-first
multiplier, a future research direction toward the SOBL structure is to obtain the output from
the most significant bit first.
In Chapter 5, we proposed a novel scheme for the ECSM operation. We employed the
proposed scheme for computing the scalar multiplication on a prime extended twisted Edwards
curve model. Future work in this direction may include the discussion of applying the proposed
scheme to other elliptic curve models such as the Koblitz curve model. It can also be interesting
to investigate whether the proposed scheme can speed up the scalar multiplication over fields
of characteristic three.
Bibliography
[1] Hankerson, D., Menezes, A., Vanstone, S.: Guide to Elliptic Curve Cryptography.
Springer-Verlag, New York, Inc. (January 2004)
[2] Frey, G.: Applications of Arithmetical Geometry to Cryptographic Constructions. In: Proc.
of 5th Int’l Conference on the Finite Fields and Applications, Fq 5. LNCS, pp. 128 -161
(August 1999)
[3] Gaudry, P., Hess, F., Smart, N. P.: Constructive and Destructive Facets of Weil Descent on
Elliptic Curves. In: J. of Cryptology. vol. 15(1), pp. 19-46, Springer-Verlag (March 2002)
[4] Maurer, M., Menezes, A., Teske, E.: Analysis of the GHS Weil Descent Attack on the
ECDLP over Characteristic Two Finite Fields of Composite Degree. In: Proc. of Int’l
Conference on Cryptology in India, INDOCRYPT 2001, LNCS, vol. 2247, pp. 195- 213
(December 2001)
[5] William, C. B., Elaine, B.: Nat’l Inst. of Standards and Technology (NIST) Special Publi-
cation 800-67: Recommendation for the Triple Data Encryption Algorithm (TDEA) Block
Cipher-Revision 1 (January 2012)
[6] Nat’l Inst. of Standards and Technology (NIST). Federal Information Processing Standards
(FIPS) Publication 197: Announcing the Advance Encryption Standard (AES) Specifica-
tion (November 2001)
[7] Rivest, R. L.: The RC5 Encryption Algorithm. In: Proc. of 2nd Int’l Workshop: Fast
Software Encryption. LNCS, vol. 1008, pp. 86 -96 (December 1994)
[8] Lai, X., Massey, J. L.: A Proposal for a New Block Encryption Standard. In Proc. of Work-
shop on the Theory and Application of Cryptographic Techniques: Advances in Cryptolo-
gy, EUROCRYPT 90. LNCS vol. 473, pp. 389 - 404 (May 1990)
[9] Jevons, W. S.: The Principles of Science: A Treatise on Logic and Scientific Method.
London, Macmillan Co. (1913)
[10] Diffie, W., Hellman, M. E.: New Directions in Cryptography. In: IEEE Transactions on
Information Theory, vol. IT-22(6), pp. 644 - 654 (November 1976)
118
BIBLIOGRAPHY 119
[11] Diffie, W., Hellman, M. E.: Multiuser Cryptographic Techniques. In: Proc. of American
Federation of Information Processing Societies (AFIPS ’76), pp. 109 -112 (June 1976)
[12] Rivest, R. L., Shamir, A., Adleman, L.: A Method for Obtaining Digital Signatures and
Public-Key Cryptosystems. In: J. of the ACM Communications, vol. 21(2), pp. 120 -126
(February 1978)
[13] ElGamal, T.: A Public Key Cryptosystem and a Signature Scheme Based on Discrete
Logarithms. In: IEEE Transactions on Information Theory, vol. 31(4), pp. 469 - 472 (July
1985)
[14] Jonsson, J., Kaliski, B.: Public-Key Cryptography Standards (PKCS) #1: RSA Cryp-
tography Specifications Version 2.1, RFC 3447, Internet Engineering Task Force (IETF)
(February 2003)
[15] RSA Labs, PKCS #1 v2.1: RSA Cryptography Standard, RSA Security Inc. Available
from URL: http://www.rsasecurity.com/rsalabs/node.asp?id=2125 (June 2002)
[16] Nat’l Inst. of Standards and Technology (NIST). Federal Information Processing Stan-
dards (FIPS) Publication 186-4: Digital Signature Standard (DSS) (July 2013)
[17] American National Standards Inst. (ANSI) X9.62: Public Key Cryptography for The
Financial Services Industry: The Elliptic Curve Digital Signature Algorithm (November
ECDSA) (2005)
[18] Miller, V. S.: Use of Elliptic Curves in Cryptography. In: Proc. of Advances in Cryptolo-
gy, CRYPTO ’85. LNCS, vol. 218, pp. 417- 426 (August 1985)
[19] Koblitz, N.: Elliptic Curve Cryptosystems. In: J. of Mathematics of Computation, Amer-
ican Mathematical Society. vol. 48(177), pp. 203 -209 (January 1987)
[20] Lochter, M., Merkle, J.: Elliptic Curve Cryptography (ECC) Brainpool Standard Curves
and Curve Generation. RFC5639. Available from URL: http://tools.ietf.org/html/rfc5639
(March 2010)
[21] IEEE Std 1363-2000: Draft Standard for Specifications for Password based Public Key
Cryptographic Techniques (2007)
[22] Standards for Efficient Cryptography Group (SECG). Certicom Research. SEC 2: Rec-
ommended Elliptic Curve Domain Parameters, Version 2.0 (January 2010)
[23] ISO/IEC 15946-1 to 5: Information Technology Security Techniques Cryptographic
Techniques Based on Elliptic Curves Parts 1 to 5 (2002-2009)
[24] National Security Agency (NSA). Suite B Cryptography / Cryptographic Interoperabil-
ity. Available from URL: http://www.nsa.gov/ia/programs/suiteb cryptography/ (January
2009)
[25] URL: www.certicom.com (visited on May 2013)
120 BIBLIOGRAPHY
[26] Lidl, R., Niederreiter, H.: Introduction to Finite Fields and Their Applications. Revised
Ed., Cambridge University Press, Cambridge, UK (August 1994)
[27] Blahut, R. E.: Theory and Practice of Error Control Codes. 1st Ed., Addison-Wesley Pub.
Co. (May 1983)
[28] Menezes, A. J., Blake, I. F., Gao, X., Mullin, R. C., Vanstone, S. A., Yaghoobian, T.:
Applications of Finite Fields. 1st Ed., Kluwer Academic Pub., Boston, MA (1993)
[29] Blahut, R. E.: Fast Algorithms for Digital Signal Processing. Addison-Wesley Pub. Co.,
Reading, MA (September 1985)
[30] Reyhani-Masoleh, A.: A New Bit-Serial Architecture for Field Multiplication Using
Polynomial Bases. In: Proc. of 10th Int’l Workshop: Cryptographic Hardware and Em-
bedded Systems, CHES 2008. LNCS, vol. 5154, pp. 300 -314 (August 2008)
[31] Beth, T., Gollman, D.: Algorithm Engineering for Public Key Algorithms. In: IEEE J. on
Selected Areas in Communications, vol. 7(4), pp. 458- 466 (May 1989)
[32] Azarderakhsh, R., Reyhani-Masoleh, A.: Low-Complexity Multiplier Architectures for
Single and Hybrid-Double Multiplications in Gaussian Normal Bases. In: IEEE Transac-
tions on Computers, vol. 62(4), pp. 744 -757 (April 2013)
[33] Azarderakhsh, R., Ja¨rvinen, K., Dimitrov, V. S.: Fast Inversion in GF(2m) with Normal
Basis Using Hybrid-Double Multipliers. In: IEEE Transactions on Computers, in process.
[34] Reyhani-Masoleh, A.: Efficient Algorithms and Architectures for Field Multiplication
Using Gaussian Normal Bases. In: IEEE Transactions on Computers, vol. 55(1), pp. 34 -
47 (January 2006)
[35] Katti, R., Brennan, J.: Low Complexity Multiplication in a Finite Field Using Ring Rep-
resentation. In: IEEE Transactions on Computers, vol. 52(4), pp. 418- 427 (April 2003)
[36] Joppe, W. B.: On the Cryptanalysis of Public-Key Cryptography. PhD thesis, University
E´cole Polytechnique Fe´de´rale de Lausanne, Lausanne, Swiss (2012)
[37] Hisil, H., Wong, K. K. -H, Carter, G., Dawson, E.: Twisted Edwards Curves Revisited.
In: Proc. of 14th Int’l Conference Theory and Application of Cryptology and Information
Security: Advances in Cryptology, ASIACRYPT 2008, LNCS, vol. 5350, pp. 326 -343
(December 2008)
[38] Kocher, P. C.: Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS and
Other Systems. In: Proc. of 16th Int’l Cryptology Conference: Advances in Cryptology,
CRYPTO ’96. LNCS, vol. 1109, pp. 104 -113 (August 1996)
[39] Kocher, P. C., Jaffe, J., Jun, B.: Differential Power Analysis. In: Proc. of 19th Int’l Cryp-
tology Conference: Advances in Cryptology, CRYPTO ’99. LNCS, vol. 1666, pp. 388-397
(August 1999)
BIBLIOGRAPHY 121
[40] Coron, J. -S.: Resistance Against Differential Power Analysis for Elliptic Curve Cryp-
tosystems. In: Proc. of 1st Int’l Workshop: Cryptographic Hardware and Embedded Sys-
tems, CHES ’99. LNCS, vol. 1717, pp. 292-302 (August 1999)
[41] Biehl, I., Meyer, B., Mu¨ller, V.: Differential Fault Attacks on Elliptic Curve Cryptosys-
tems. In: Proc. of 20th Int’l Cryptology Conference: Advances in Cryptology, CRYPTO
2000. LNCS, vol. 1880, pp. 131-146 (August 2000)
[42] Yen, S. -M., Joye, M.: Checking Before Output may not be Enough Against Fault-Based
Cryptanalysis. In: IEEE Transactions on Computers, vol. 49(9), pp. 967-970 (September
2000)
[43] Avanzi, R. M.: Side Channel Attacks on Implementations of Curve-Based Cryptograph-
ic Primitives. In: IACR, Cryptology Eprint Archive, 2005/017. Available from URL:
http://eprint.iacr.org/2005/017 (January 2005)
[44] Kumar, S., Wollinger, T., Paar, C.: Optimum Digit Serial GF(2m) Multipliers for Curve-
Based Cryptography. In: IEEE Transactions on Computers, vol. 55(10), pp. 1306 -1311
(October 2006)
[45] Reyhani-Masoleh, A., Hasan, M. A.: Low Complexity Bit Parallel Architectures for Poly-
nomial Basis Multiplication over GF(2m). In: IEEE Transactions on Computers, vol. 53(8),
pp. 945-959 (August 2004)
[46] Mastrovito, E. D.: VLSI Architectures for Computations in Galois Fields. PhD thesis,
Linko¨ping University, Linko¨ping, Sweden (1991).
[47] Wu, H.: Bit-Parallel Finite Field Multiplier and Squarer Using Polynomial Basis. In:
IEEE Transactions on Computers, vol. 51(7), pp. 750 -758 (July 2002)
[48] Massey, J. L., Omura, J. K.: Computational Method and Apparatus for Finite Field Arith-
metic. US Patent No. 4587627.A (May 1986)
[49] Koc¸, C¸. K., Sunar, B.: Low-Complexity Bit-parallel Canonical and Normal Basis Mul-
tipliers for a Class of Finite Fields. In: IEEE Transactions on Computers, vol. 47(3), pp.
353 -356 (March 1998)
[50] Sunar, B., Koc¸, C¸. K.: An Efficient Optimal Normal Basis Type II Multiplier. In: IEEE
Transactions on Computers, vol. 50(1), pp. 83 -87 (January 2001)
[51] Namin, A. H., Wu, H., Ahmadi, M.: High-Speed Architectures for Multiplication Using
Reordered Normal Basis. In: IEEE Transactions on Computers, vol. 61(2), pp. 164 -172
(February 2012)
[52] Haining, F., Hasan, M. A.: Fast Bit Parallel-Shifted Polynomial Basis Multipliers in
GF(2n). In: IEEE Transactions on Circuits and Systems I, vol. 53(12), pp. 2606 -2615
(December 2006)
122 BIBLIOGRAPHY
[53] Song, L., Parhi, K. K.: Low-Energy Digit-Serial/Parallel Finite Field Multipliers. In: J.
of VLSI signal processing systems for signal, image and video technology, vol. 19(2), pp.
149 -166, Kluwer Academic Pub. (July 1998)
[54] Yeh, C. -S., Reed, I. S., Truong, T. K.: Systolic Multipliers for Finite Fields GF(2m). In:
IEEE Transactions on Computers, vol. C-33(4), pp. 357-360 (April 1984)
[55] Meher, P. K.: Systolic and Non-Systolic Scalable Modular Designs of Finite Field Mul-
tipliers for ReedSolomon Codec. In: IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 17(6), pp. 747-757 (June 2009)
[56] Mastrovito, E. D.: VLSI Designs for Multiplication over Finite Fields GF(2m). In: Proc.
of 6th Int’l Conference: Applied Algebra, Algebraic Algorithms and Error-Correcting
Codes, AAECC-6. LNCS, vol. 357, pp. 297-309 (July 1988)
[57] Wu, H., Hasan, M. A., Blake, I. F.: New Low-Complexity Bit-Parallel Finite Field Mul-
tipliers Using Weakly Dual Bases. In: IEEE Transactions on Computers, vol. 47(11), pp.
1223 -1234 (November 1998)
[58] Rodrı´guez-Henrı´quez, F., Koc¸, C¸. K.: Parallel Multipliers Based on Special Irreducible
Pentanomials. In: IEEE Transactions on Computers, vol. 52(12), pp. 1535-1542 (Decem-
ber 2003)
[59] Koc¸, C¸. K., Acar, T.: Montgomery Multiplication in GF(2k). In: J. of Designs, Codes and
Cryptography. Kluwer Academic Pub., vol. 14(1), pp. 57- 69 (April 1998)
[60] Chiou-Yng, L., Jenn-Shyong, H., I-Chang, J., Erl-Huei, L.: Low-Complexity Bit-Parallel
Systolic Montgomery Multipliers for Special Classes of GF(2m). In: IEEE Transactions
on Computers, vol. 54(9), pp. 1061-1070 (September 2005)
[61] Sunar, B., Koc¸, C¸.K.: Mastrovito Multiplier for All Trinomials. In: IEEE Transactions on
Computers, vol. 48(5), pp. 522-527 (May 1999)
[62] Jiafeng, X., Meher, P. K., Jianjun, H.: Low-Complexity Multiplier for GF(2m) Based
on All-One Polynomials. In: IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 21(1), pp. 168-173 (January 2013)
[63] Mo¨ller, B.: Securing Elliptic Curve Point Multiplication Against Side-Channel Attacks.
In: Proc. of 4th Int’l Conference: Information Security, ISC 2001. LNCS, vol. 2200, pp.
324 -334 (October 2001)
[64] Okeya, K., Takagi, T.: The Width-w NAF Method Provides Small Memory and Fast
Elliptic Scalar Multiplications Secure Against Side Channel Attacks. In: Proc. of The
Cryptographers’ Track at the RSA: Topics in Cryptology, CT-RSA 2003. LNCS, vol. 2612,
pp. 328-343 (April 2003)
[65] Dimitrov, V. S., Jullien, G. A., Miller, W. C.: Theory and Applications for a Double-Base
Number System. In: Proc. of 13th IEEE Symposium on Computer Arithmetic (ARITH
1997), pp. 44 -51 (July 1997)
BIBLIOGRAPHY 123
[66] Dimitrov, V. S., Imbert, L., Mishra, P. K.: Efficient and Secure Elliptic Curve Point Mul-
tiplication Using Double-Base Chains. In: Proc. of 11th Int’l Conference on the Theory
and Application of Cryptology and Information Security: Advances in Cryptology, ASI-
ACRYPT 2005. LNCS, vol. 3788, pp. 59 -78 (December 2005)
[67] Adikari, J., Dimitrov, V. S., Imbert, L.: Hybrid Binary-Ternary Number System for El-
liptic Curve Cryptosystems. In: IEEE Transactions on Computers, vol. 60(2), pp. 254 -265
(February 2011)
[68] Ciet, M., Joye, M., Lauter, K., Montgomery, P. L.: Trading Inversions for Multiplica-
tions in Elliptic Curve Cryptography. In: J. of Designs, Codes and Cryptography. Kluwer
Academic Pub., vol. 39(2), pp. 189 -206 (May 2006)
[69] Booth, A. D.: A Signed Binary Multiplication Technique. In: Quarterly J. of Mechanics
and Applied Mathematics, vol. 4(2), pp. 236 -240 (August 1951)
[70] Okeya, K., Schmidt-Samoa, K., Spahn, C., Takagi, T.: Signed Binary Representations
Revisited. In: Proc. of 24th Int’l Cryptology Conference: Advances in Cryptology, CRYP-
TO 2004. LNCS, vol. 3152, pp. 123 -139 (August 2004)
[71] Reitwiesner, G. W.: Binary Arithmetic. In: Advances in Computers. Elsevier, vol. 1, pp.
231-308 (1960)
[72] Arno, S., Wheeler, F. S.: Signed Digit Representations of Minimal Hamming Weight. In:
IEEE Transactions on Computers, vol. 42(8), pp. 1007-1010 (August 1993)
[73] Solinas, J. A.: Efficient Arithmetic on Koblitz Curves. In: J. of Designs, Codes and
Cryptography. Kluwer Academic Pub., vol. 19(2-3), pp. 195-249 (March 2000)
[74] Koblitz, N.: CM-Curves with Good Cryptographic Properties. In: Proc. of Advances in
Cryptology, CRYPTO 91. LNCS, vol. 576, pp. 279 -287 (1991)
[75] Joye, M., Quisquater, J. -J.: Hessian Elliptic Curves and Side-Channel Attacks. In: Proc.
of 3rd Int’l Workshop: Cryptographic Hardware and Embedded Systems, CHES 2001.
LNCS, vol. 2162, pp. 402- 410 (May 2001)
[76] Edwards, H. M.: A Normal Form for Elliptic Curves. In: J. of Bulletin, American Math-
ematical Society. vol. 44(3), pp. 393 - 422 (July 2007)
[77] Joye, M., Tibouchi, M., Vergnaud, D.: Huff’s Model for Elliptic Curves. In: Proc. of
9th Int’l Symposium: Algorithmic Number Theory, ANTS-IX 2010. LNCS, vol. 6197, pp.
234 -250 (July 2010)
[78] Billet, O., Joye, M.: The Jacobi Model of an Elliptic Curve and Side-Channel Analysis.
In: Proc. of 15th Int’l Symposium: Applied Algebra, Algebraic Algorithms and Error-
Correcting Codes, AAECC-15. LNCS, vol. 2643, pp. 34 - 42 (May 2003)
[79] Knudsen, E. W.: Elliptic Scalar Multiplication Using Point Halving. In: Proc. of Int’l
Conference on the Theory and Application of Cryptology and Information Security: Ad-
vances in Cryptology, ASIACRYPT 99. LNCS, vol. 1716, pp. 135-149 (November 1999)
124 BIBLIOGRAPHY
[80] Longa, P., Miri, A.: New Composite Operations and Precomputation Scheme for Elliptic
Curve Cryptosystems over Prime Fields. In: Proc. of 11th Int’l Workshop on Practice and
Theory in Public-Key Cryptography, PKC 2008. LNCS, vol. 4939, pp. 229 -247 (March
2008)
[81] Mishra, P. M.: Pipelined Computation of Scalar Multiplication in Elliptic Curve Cryp-
tosystems (Extended Version). In: IEEE Transactions on Computers, vol. 55(8), pp. 1000
-1010 (August 2006)
[82] Azarderakhsh, R., Reyhani-Masoleh, A.: Efficient FPGA Implementations of Point Mul-
tiplication on Binary Edwards and Generalized Hessian Curves Using Gaussian Normal
Basis. In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20(8),
pp. 1453 -1466 (August 2012)
[83] Longa, P., Miri, A.: Fast and Flexible Elliptic Curve Point Arithmetic over Prime Fields.
In: IEEE Transactions on Computers, vol. 57(3), pp. 289 -302 (March 2008)
[84] Koyama, K., Tsuruoka, Y.: Speeding up Elliptic Cryptosystems by Using a Signed Binary
Window Method. In: Proc. of 12th Int’l Cryptology Conference: Advances in Cryptology,
CRYPTO 92. LNCS, vol. 740, pp. 345-357 (August 1992)
[85] Izu, T., Takagi, T.: Fast Elliptic Curve Multiplications with SIMD Operations. In: Proc.
of 4th Int’l Conference: Information and Communications Security, ICICS 2002. LNCS,
vol. 2513, pp. 217-230 (December 2002)
[86] Fischer, W., Giraud, C., Knudsen, E. W., Seifert, J. -P.: Parallel Scalar Multiplica-
tion on General Elliptic Curves over Fp Hedged Against Non-Differential Side-Channel
Attacks. IACR, Cryptology Eprint Archive, 2002/007, 2002. Available from URL:
http://eprint.iacr.org/2002/007
[87] Aoki, K., Hoshino, F., Kobayashi, T., Oguro, H.: Elliptic Curve Arithmetic Using SIMD.
In: Proc. of 4th Int’l Conference: Information Security, ISC 2001. LNCS, vol. 2200, pp.
235-247 (October 2001)
[88] Smart, N. P.: The Hessian Form of an Elliptic Curve. In: Proc. of 3rd Int’l Workshop:
Cryptographic Hardware and Embedded Systems, CHES 2001. LNCS, vol. 2162, pp. 118-
125 (May 2001)
[89] Eberle, H., Gura, N., Chang-Shantz, S.: A Cryptographic Processor for Arbitrary Elliptic
Curves over GF(2m). In: Proc. of 14th IEEE Int’l Conference on Application-Specific
Systems, Architectures, and Processors (ASAP), pp. 444 - 454 (June 2003)
[90] Daly, A., Marnane, W., Kerins, T., Popovici, E.: An FPGA Implementation of a GF(p)
ALU for Encryption Processors. In: J. of Microprocessors and Microsystems, vol. 28(5-6),
pp. 253 -260, Elsevier Science (August 2004)
[91] Blake, I., Seroussi, G., Smart, N.: Elliptic Curves in Cryptography. In: London Math-
ematical Society Lecture Note Series, Cambridge University Press, Cambridge (August
1999)
BIBLIOGRAPHY 125
[92] Silverman, J. H.: The Arithmetic of Elliptic Curves. In: Graduate Texts in Mathematics,
vol. 106, Springer-Verlag, New York Inc. (1986)
[93] Avanzi, R., Cohen, H., Doche, C., Frey, G., Lange, T., Nguyen, K., Vercauteren, F.:
Handbook of Elliptic and Hyperelliptic Curve Cryptography. 1st Ed., Chapman & Hal-
l/CRC (July 2005)
[94] Blake, I. F., Seroussi, G., Smart, N. P.: Advances in Elliptic Curve Cryptography. London
Mathematical Society Lecture Note Series, 1st Ed., Cambridge University Press, New York
(April 2005)
[95] Washington, L. C.: Elliptic Curves: Number Theory and Cryptography. Series of Discrete
Mathematics and Its Applications, 1st Ed., Chapman & Hall/CRC, Boca Raton (May 2003)
[96] McEliece, R. J.: Finite Fields for Computer Scientists and Engineers. Springer Int’l Series
in Engineering and Computer Science, Kluwer Academic Pub., (November 1986)
[97] Koblitz, N.: A Course in Number Theory and Cryptography. In: Graduate Texts in Math-
ematics. 2nd Ed., Springer-Verlag (September 1994)
[98] Abdulrahman, E. A. H., Reyhani-Masoleh, A.: High-Speed Hybrid-Double Multiplica-
tion Architectures Using New Serial-Out Bit-Level Mastrovito Multipliers. Submitted for
publication in IEEE Transactions on Computers (Submitted in November 2012, revised in
July 2013)
[99] Abdulrahman, E. A. H., Reyhani-Masoleh, A.: New Regular Radix-8 Scheme for Elliptic
Curve Scalar Multiplication Without Pre-computation. Accepted for publication in IEEE
Transactions on Computers, 14 pages in total (2013)
[100] Eisenbarth, T., Kumar, S., Paar, C., Poschmann, A., Uhsadel L.: A Survey of
Lightweight-Cryptography Implementations. In: IEEE Design & Test of ICs for Secure
Embedded Computing, vol. 24(6), pp. 522-533 (December 2007)
[101] Menezes, A. J., van Oorschot, P. C., Vanstone, S. A.: Handbook of Applied Cryptogra-
phy. In: Discrete Mathematics and Its Applications, Cambridge University Press. 1st Ed.,
CRC Press (October 1996)
[102] Stinson, D. R.: Cryptography: Theory and Practice. Series of Discrete Mathematics and
Its Applications, 3rd Ed., Chapman & Hall/CRC, Boca Raton (November 2005)
[103] de Dormale, G. M.: Destructive and Constructive Aspects of Efficient Algorithms and
Implementation of Cryptographic Hardware. PhD thesis, Universite` catholique de Louvain,
Belgium (October 2007)
[104] National Security Agency (NSA). The Case for Elliptic Curve Cryptography. Available
from URL: http://www.nsa.gov/business/programs/elliptic curve.shtml (2009)
[105] Boneh, D.: Twenty Years of Attacks on the RSA Cryptosystem. In: Notices of the
American Mathematical Society vol. 46(2), pp. 203 -213 (February 1999)
126 BIBLIOGRAPHY
[106] Song, Y. Y.: Cryptanalytic Attacks on RSA. Springer US (2008)
[107] Lenstra, A. K., Verheul, E. R.: Selecting Cryptographic Key Sizes. In: J. of Cryptology.
vol. 14(4), pp. 255-293, Springer-Verlag (January 2001)
[108] Trappe, W., Washington, L. C.: Introduction to Cryptography with Coding Theory. 2nd
Ed., Pearson, (July 2005)
[109] Lenstra, H. W.: Factoring Integers with Elliptic Curves. In: J. Annals of Mathematics.
Second Series, vol. 126(3), pp. 649 - 673 (November 1987)
[110] Bernstein, D. J., Lange, T.: Explicit-formulas database. Joint Work by Bernstein,
D. J., and Lange, T., Building on Work by Many Authors. Available from URL:
http://www.hyperelliptic.org/EFD/, (2013)
[111] Cohen, H., Miyaji, A., Ono, T.: Efficient Elliptic Curve Exponentiation Using Mixed
Coordinates. In: Proc. of Int’l Conference on the Theory and Application of Cryptology
and Information Security: Advances in Cryptology, ASIACRYPT ’98, LNCS, vol. 1514,
pp. 51- 65 (October 1998)
[112] Verneuil, V.: Elliptic Curve Cryptography and Security of Embedded Devices. PhD
thesis, Universite´ de Bordeaux, Bordeaux, France (September 2012)
[113] Lo`pez, J., Dahab, R.: Improved Algorithms for Elliptic Curve Arithmetic in GF(2n). In:
Proc. of 5th Int’l Workshop: Selected Areas in Cryptography, SAC ’98. LNCS, vol. 1556,
pp. 201-212 (August 1998)
[114] Lo`pez, J., Dahab, R.: Fast Multiplication on Elliptic Curves over GF(2m) Without Pre-
computation. In: Proc. of 1st Int’l Workshop: Cryptographic Hardware and Embedded
Systems, CHES ’99. LNCS, vol. 1717, pp. 316 -327 (August 1999)
[115] Mishra, P. K., Sarkar, P.: Application of Montgomerys Trick to Scalar Multiplication for
Elliptic and Hyperelliptic Curves Using a Fixed Base Point. In: Proc. of 7th Int’l Workshop
on Theory and Practice in Public Key Cryptography, PKC 2004. LNCS, vol. 2947, pp. 41-
54 (March 2004)
[116] Aranha, D. F., Faz-Herna`ndez, A., Lo`pez, J., Rodrı´guez-Henrı´quez, F.: Faster Imple-
mentation of Scalar Multiplication on Koblitz Curves. In: Proc. of 2nd Int’l Conference on
Cryptology and Information Security in Latin America: Progress in Cryptology, LATIN-
CRYPT 2012. LNCS, vol. 7533, pp. 177-193 (October 2012)
[117] Brown, M., Hankerson, D., Lo`pez, J., Menezes, A.: Software Implementation of the
NIST Elliptic Curves over Prime Fields. In: Proc. of The Cryptographers’ Track at RSA
Conference: Topics in Cryptology, CT-RSA 2001. LNCS, vol. 2020, pp. 250 -265 (April
2001)
[118] Galbraith, S. D., Lin, X., Scott, M.: Endomorphisms for Faster Elliptic Curve Cryptog-
raphy on a Large Class of Curves. In: J. of Cryptology. Springer-Verlag, vol. 24(3), pp.
446 - 469 (July 2011)
BIBLIOGRAPHY 127
[119] Bernstein, D. J., Duif, N., Lange, T., Schwabe, P., Yang, B. -Y.: High-Speed High-
Security Signatures. In: J. of Cryptographic Engineering. Springer-Verlag, vol. 2(2), pp.
77-89 (September 2012)
[120] Taverne, J., Faz-Herna`ndez, A., Aranha, D. F., Rodrı´guez-Henrı´quez, F. Hankerson,
D., Lo`pez, J.: Speeding Scalar Multiplication over Binary Elliptic Curves Using the
New Carry-Less Multiplication Instruction. In: J. of Cryptographic Engineering. Springer-
Verlag, vol. 1(3), pp. 187-199 (November 2011)
[121] Longa, P., Gebotys, C.: Efficient Techniques for High-Speed Elliptic Curve Cryptogra-
phy. In: Proc. of 12th Int’l Workshop: Cryptographic Hardware and Embedded Systems,
CHES 2010. LNCS, vol. 6225, pp. 80 -94 (August 2010)
[122] Gaudry, P., Thome´, E.: The mpFq Library and Implementing Curve-Based
Key Exchange. In: Porc. of Software Performance Enhancement of En-
cryption and Decryption (SPEED 2007). pp. 49 - 64, Available from URL:
http://www.hyperelliptic.org/speed/record.pdf (June 2007)
[123] Knuth, D.E.: The Art of Computer Programming volume 1 Fundamental Algorithms.
Third Ed. Addison-Wesley Pub. Co., MA (1969)
[124] IEEE Standard Specifications for Password-Based Public-Key Cryptographic Tech-
niques, IEEE Standard 1363.2-2008 (January 2009)
[125] Brauer, A.: On Addition Chains. In: Bulletin of the American Mathematical Society.
vol. 45(10), pp. 736-739 (1939)
[126] Han, D. -G, Takagi, T.: Some Analysis of Radix-r Representations. In: IACR, Cryp-
tology Eprint Archive, 2005/402. Available from URL: http://eprint.iacr.org/2005/402
(November 2005)
[127] Thurber, E. G.: On Addition Chains l (mn) ≤ l (n) b and Lower Bounds for c (r). In: J.
Duke Mathematical. vol. 40(4), pp. 907-913 (1973)
[128] Ja¨rvinen, K.: Optimized FPGA-Based Elliptic Curve Cryptography Processor for High-
Speed Applications. In: Integration, the VLSI J., Elsevier, vol. 44(4), pp. 270 -279
(September 2011)
[129] Wright, P.: Spy Catcher: The Candid Autobiography of a Senior Intelligence Officer.
Book Club ed., Viking Press (July 1987)
[130] van Eck, W.: Electromagnetic Radiation From Video Display Units: an Eavesdropping
Risk?. In: J. of Computers and Security. vol. 4(4), pp. 269 -286, Elsevier Advanced Tech-
nology Publications Oxford, UK (December 1985)
[131] Boneh, D., Richard A. DeMillo, R. A., Lipton, R. J.: On The Importance of Checking
Cryptographic Protocols for Faults. In Proc. of Int’l Conference on the Theory and Appli-
cation of Cryptographic Techniques: Advances in Cryptology, EUROCRYPT 97. LNCS
vol. 1233, pp. 37-51 (May 1997)
128 BIBLIOGRAPHY
[132] Kocher, P., Jaffe, J., Jun, B.: Introduction to Differential Power Analysis and Related
Attacks. In: Technical report, Cryptography Research Inc. Patent Pending. Available from
URL: http://www.cryptography.com/dpa/technical/index.html. (1998)
[133] Mangard, S., Oswald, E., Popp, T.: Power Analysis Attacks: Revealing the Secrets of
Smart Cards. New York:Springer-Verlag (2007)
[134] Chabrier, T., Pamula, D., Tisserand, A.: Hardware Implementation of DBNS Recoding
for ECC Processor. In: Proc. of 44th IEEE Conference on Signals, Systems and Computers
(ASILOMAR), pp. 1129 -1133 (November 2010)
[135] Clavier, C., Joye, M.: Universal Exponentiation Algorithm A First Step towards Prov-
able SPA-Resistance. In: Proc. of 3rd Int’l Workshop: Cryptographic Hardware and Em-
bedded Systems, CHES 2001. LNCS, vol. 2162, pp. 300 -308 (May 2001)
[136] Ciet, M.: NOT FOUND Aspects of Fast and Secure Arithmetics for Elliptic Curve Cryp-
tography. PhD thesis, Louvain-la-Neuve, Belgium (2003)
[137] Brier, E´., Joye, M.: Weierstraß Elliptic Curves and Side-Channel Attacks. In: Proc.
of 5th Int’l Workshop on Practice and Theory in Public Key Cryptosystems, PKC 2002.
LNCS, vol. 2274, pp. 335-345 (February 2002)
[138] Bellezza, A.: Countermeasures Against Side-Channel Attacks for Elliptic Curve Cryp-
tosystems, In: IACR, Cryptology Eprint Archive, Report 2001/103. Available from URL:
http://eprint.iacr.org/eprint-bin/cite.pl?entry=2001/103 (November 2001)
[139] Bernstein, D. J., Lange, T.: Faster Addition and Doubling on Elliptic Curves. In: Proc.
of 13th Int’l Conference on the Theory and Application of Cryptology and Information
Security: Advances in Cryptology, ASIACRYPT 2007. LNCS, vol. 4833, pp. 29 -50 (De-
cember 2007)
[140] Bernstein, D. J., Lange, T., Farashahi, R. R.: Binary Edwards Curves. In: Proc. of 10th
Int’l Workshop: Cryptographic Hardware and Embedded Systems, CHES 2008. LNCS,
vol. 5154, pp. 244 -265 (August 2008)
[141] Bernstein, D. J., Lange, T.: Inverted Edwards Coordinates. In: Proc. of 17th Int’l Sym-
posium: Applied Algebra, Algebraic Algorithms and Error-Correcting Codes, AAECC-17.
LNCS, vol. 4851, pp. 20 -27 (December 2007)
[142] Farashahi, R. R., Joye, M.: Efficient Arithmetic on Hessian Curves. In: Proc. of 13th
Int’l Conference on Practice and Theory in Public Key Cryptography, PKC 2010. LNCS,
vol. 6056, pp. 243 -260 (May 2010)
[143] Liardet, P. -Y., Smart, N. P.: Preventing SPA/DPA in ECC Systems Using the Jacobi
Form. In: Proc. of 3rd Int’l Workshop: Cryptographic Hardware and Embedded Systems,
CHES 2001. LNCS, vol. 2162, pp. 391- 401 (May 2001)
BIBLIOGRAPHY 129
[144] Hisil, H., Wong, K. K. -H., Carter, G., Dawson, E.: Faster Group Operations on
Elliptic Curves. In: IACR, Cryptology Eprint Archive 2007/441. Available from URL:
http://eprint.iacr.org/eprint-bin/cite.pl?entry=2007/441 (2007)
[145] Chevallier-Mames, B., Ciet, M., Joye, M.: Low-Cost Solutions for Preventing Simple
Side-Channel Analysis: Side-Channel Atomicity. In: IEEE Transactions on Computers,
vol. 53(6), pp. 760 -768 (June 2004)
[146] Giraud, C., Verneuil, V.: Atomicity Improvement for Elliptic Curve Scalar Multipli-
cation. In: Proc. of 9th IFIP WG 8.8/11.2 Int’l Conference: Smart Card Research and
Advanced Application, CARDIS 2010. LNCS, vol. 6035, pp. 80 -101 (April 2010)
[147] Abarzu´a, R., The´riault, N.: Complete Atomic Blocks for Elliptic Curves in Jacobian
Coordinates over Prime Fields. In: Proc. of 2nd Int’l Conference on Cryptology and Infor-
mation Security in Latin America: Progress in Cryptology, LATINCRYPT 2012. LNCS,
vol. 7533, pp. 37-55 (October 2012)
[148] Menezes, A., Teske, E., Weng, A.: Weak Fields for ECC. In: Proc. of the Cryptogra-
phers’ Track at the RSA Conference: Topics in Cryptology, CT-RSA 2004. LNCS, vol.
2964, pp. 366 -386 (February 2004)
[149] Comba, P. G.: Exponentiation Cryptosystems on the IBM PC. In: J. of IBM Systems,
vol. 29(4), pp. 526 -538 (December 1990)
[150] Karatsuba, A., Ofman, Y.: Multiplication of Multidigit Numbers on Automata. In: J. of
Soviet Physics Doklady, vol. 7(7), pp. 595-596 (January 1986)
[151] Parhami, B.: Computer Arithmetic Algorithms and Hardware Designs. Second ed., Ox-
ford University Press (2000)
[152] Koren, I.: Computer Arithmetic Algorithms. 2nd Ed., A K Peters/CRC Press (November
2001)
[153] Barrett, P.: Implementing the Rivest Shamir and Adleman Public Key Encryption Al-
gorithm on a Standard Digital Signal Processor. In: Proc. of Advances in Cryptology,
CRYPTO ’86. LNCS, vol. 263, pp. 311-323 (1987)
[154] Montgomery, P. L.: Modular Multiplication Without Trial Division. In: J. of Mathe-
matics of Computation, American Mathematical Society. vol. 44(170), pp. 519 -521 (April
1985)
[155] de Dormale, G. M., Bulens, P., Quisquater, J. -J.: Efficient Modular Division Imple-
mentation. In: Proc. of 14th Int’l Conference: Field Programmable Logic and Application,
FPL 2004. LNCS, vol. 3203, pp. 231 -240 (August 2004)
[156] Orlando, G.: Efficient Elliptic Curve Processor Architectures for Field Programmable
Logic. PhD thesis, Worcester Polytechnic Inst., Massachusetts, United States (March
2002)
130 BIBLIOGRAPHY
[157] Wu, H.: On Complexity of Polynomial Basis Squaring in F2m . In: Proc. of 7th Int’l
Workshop: Selected Areas in Cryptography, SAC 2000. LNCS, vol. 2012, pp. 118 -129
(August 2000)
[158] Halbutog˘ullari, A., Koc¸, C¸. K.: Mastrovito Multiplier for General Irreducible Polyno-
mials. In: IEEE Transactions on Computers, vol. 49(5), pp. 503 -518 (May 2000)
[159] Zhang, T., Parhi, K. K.: Systematic Design of Original and Modified Mastrovito Mul-
tipliers for General Irreducible Polynomials. In: IEEE Transactions on Computers, vol.
50(7), pp. 734 -748 (July 2001)
[160] Afanassiev, V., Gehrmann, C., Smeets, B.: Fast Message Authentication Using Efficient
Polynomial Evaluation. In: Proc. of 4th Int’l Workshop: Fast Software Encryption, FSE
’97. LNCS, vol. 1267, pp. 190-204 (January 1997)
[161] von zur Gathen, J., No¨cker, M.: Exponentiation in Finite Fields: Theory and Practice.
In: Proc. of 12th Int’l Symposium: Applied Algebra, Algebraic Algorithms and Error-
Correcting Codes, AAECC-12. LNCS, vol. 1255, pp. 88 -113 (June 1997)
[162] Paar, C.: A New Architecture for a Parallel Finite Field Multiplier with Low Complexity
Based on Composite Fields. In: IEEE Transactions on Computers, vol. 45(7), pp. 856 -861
(July 1996)
[163] Tuyls, P., Batina, L.: RFID-Tags for Anti-counterfeiting. In: Proc. of The Cryptog-
raphers’ Track at the RSA: Topics in Cryptology, CT-RSA 2006. LNCS, vol. 3860, pp.
115-131 (February 2006)
[164] Shantz, S. C.: From Euclid’s GCD to Montgomery Multiplication to The Great Divide.
In: Technical report, TR-2001-95. Sun Microsystems, Inc. (June 2001)
[165] Itoh, T., Tsujii, S.: A Fast Algorithm for Computing Multiplicative Inverses in GF(2m)
Using Normal Bases. In: J. of Information and Computation, vol. 78(3), pp. 171-177,
Elsevier Science (September 1988)
[166] Takagi, N., Yoshiki, J., Takagi, K.: A Fast Algorithm for Multiplicative Inversion in
GF(2m) Using Normal Basis. In: IEEE Transactions on Computers, vol. 50(5), pp. 394-
398 (May 2001)
[167] Deschamps, J. -P., Iman˜a, J. L., Sutter, G. D.: Hardware Implementation of Finite-Field
Arithmetic. Series in Electronic Engineering, 1st Ed. McGraw-Hill Professional, (February
2009)
[168] Hasan, M. A., Namin, A. H., Negre, C.: Toeplitz Matrix Approach for Binary Field Mul-
tiplication Using Quadrinomials. In: IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 20(3), pp. 449 - 458 (March 2012)
[169] Wu, H.: Bit-Parallel Polynomial Basis Multiplier for New Classes of Finite Fields. In:
IEEE Transactions on Computers, vol. 57(8), pp. 1023 -1031 (August 2008)
BIBLIOGRAPHY 131
[170] Hariri, A., Reyhani-Masoleh, A.: Bit-Serial and Bit-Parallel Montgomery Multiplication
and Squaring over GF(2m). In: IEEE Transactions on Computers, vol. 58(10), pp. 1332-
1345 (October 2009)
[171] Hsu, I. S., Truong, T. K., Deutsch, L. J., Reed, I. S.: A Comparison of VLSI Architecture
of Finite Field Multipliers Using Dual, Normal, or Standard Bases. In: IEEE Transactions
on Computers, vol. 37(6), pp. 735-739 (June 1988)
[172] Erdem, S. S., Yank, T., Koc¸, C¸. K.: Polynomial Basis multiplication over GF(2m). In:
J. of Acta Applicandae Mathematica. Kluwer Academic Pub., vol. 93(1-3), pp. 33 -55
(September 2006)
[173] Wang, C. C., Troung, T. K., Shao, H. M., Deutsch, L. J., Omura, J., Reed, I. S.: VLSI Ar-
chitectures for Computing Multiplications and Inverses in GF(2m). In: IEEE Transactions
on Computers, vol. C-34(8), pp. 709 -717 (August 1985)
[174] Scott, P. A., Tavares, S. E., Peppard, L. E.: A Fast VLSI Multiplier for GF(2m). In: IEEE
J. Selected Areas in Communications. vol. 4(1), pp. 62- 66 (January 1986)
[175] Wang, C. L., Lin, J. L.: Systolic Array Implementation of Multipliers for Finite Fields
GF(2m). In: IEEE Transactions on Circuits and Systems, vol. 38(7), pp. 796 -800 (July
1991)
[176] Song, L., Parhi, K. K.: Efficient Finite Field Serial/Parallel Multiplication. In: Proc. of
10th IEEE Int’l Conference Application Specific Systems, Architectures and Processors
(ASAP), pp. 72-82 (August 1996)
[177] Hasan, M. A., Bhargava, V. K.: Division and Bit-Serial Multiplication over GF(qm).
In: proc. -E of IEE in Computers and Digital Techniques. vol. 139(3), pp. 230 -236 (May
1992)
[178] Fenn, S. T. J., Parker, M. G., Benaissa, M., Taylor, D.: Bit-Serial Multiplication in
GF(2m) Using Irreducible All-One Polynomials. In: Proc. of IEE Computers and Digital
Techniques. vol. 144(6), pp. 391-393 (November 1997)
[179] Hasan, M. A.: Look-Up Table-Based Large Finite Field Multiplication in Memory Con-
strained Cryptosystems. In: IEEE Transactions on Computers, vol. 49(7), pp. 749 -758
(July 2000)
[180] Chang, S., Gaj, K., El-Ghazawi, T.: Low Latency Elliptic Curve Cryptography Ac-
celerators for NIST Curves Over Binary Fields. In: Proc. of Int’l Conference on Field-
Programmable Technology, pp. 309 -310 (December 2005)
[181] Meher, P. K.: On Efficient Implementation of Accumulation in Finite Field Over GF(2m)
and its Applications. In: IEEE Transactions on Very Large Scale Integration (VLSI) Sys-
tems, vol. 17(4), pp. 541-550 (April 2009)
132 BIBLIOGRAPHY
[182] Chang, S., Soonhak, K., Gaj, K.: Reconfigurable Computing Approach for Tate Pairing
Cryptosystems Over Binary Fields. In: IEEE Transactions on Computers, vol. 58(9), pp.
1221-1237 (September 2009)
[183] Synopsys, Inc. [Online]. Available: http:/www.synopsys.com
[184] Chandrakasan, A. P., Brodersen, R. W.: Low Power Digital CMOS Design. Kluwer
Academic Pub. (1995)
[185] Abdelguerfi, M., Kaliski, B. S. Jr., Patterson, W.: Public-Key Security Systems. In:
IEEE Micro. vol. 16(3), 10 -13 (June 1996)
[186] Batina, L., O¨rs, S. B., Preneel, B., Vandewalle, J.: Hardware Architectures for Public
Key Cryptography. In: Integration, the VLSI J., Elsevier, vol. 34(1-2), pp. 1- 64 (May
2003)
[187] Tolunay, J.: Parallel Gaming Related Algorithms for an Embedded Media Processor.
Master’s thesis, Linko¨ping University, Linko¨ping, Sweden (2012)
[188] Sung-Ming, Y., Kim, S., Lim, S., Moon, S.: A Countermeasure Against One Physical
Cryptanalysis May Benefit Another Attack. In: Proc. of 4th Int’l Conference: Information
Security and Cryptology, ICISC 2001. LNCS, vol. 2288, pp. 414 - 427 (December 2001)
[189] Montgomery, P. L.: Speeding the Pollard and Elliptic Curve Methods of Factorization.
In: J. of Mathematics of Computation, American Mathematical Society. vol. 48(177), pp.
243 -264 (January 1987)
[190] Okeya, K., Kurumatani, H., Sakurai, K.: Elliptic Curves with the Montgomery-Form
and Their Cryptographic Applications. In: Proc. of 3rd Int’l Workshop on Practice and
Theory in Public Key Cryptosystems, PKC 2000. LNCS, vol. 1751, pp. 238-257 (January
2000)
[191] Joye, M., Yen, S. -M.: The Montgomery Powering Ladder. In: Proc. of 4th Int’l Work-
shop: Cryptographic Hardware and Embedded Systems, CHES 2002. LNCS, vol. 2523,
pp. 291-302 (August 2002)
[192] Joye, M.: Highly Regular Right-to-Left Algorithms for Scalar Multiplication. In: Proc.
of 9th Int’l Workshop Cryptographic Hardware and Embedded Systems, CHES 2007. L-
NCS, vol. 4727, pp. 135-147 (September 2007)
[193] Vuillaume, C., Okeya, K.: Flexible Exponentiation with Resistance to Side Channel
Attacks. In: Proc. of 4th Int’l Conference: Applied Cryptography and Network Security,
ACNS 2006. LNCS, vol. 3989, pp. 268-283 (June 2006)
[194] Kargl, A., Wiesend, G.: On Randomized Addition-Subtraction Chains to Counteract
Differential Power Attacks. In: Proc. of 6th Int’l Conference on the Information and Com-
munications Security, ICICS ’04. LNCS, vol. 3269, pp. 278-290 (October 2004)
BIBLIOGRAPHY 133
[195] The´riault, N.,: SPA Resistant Left-to-Right Integer Recodings. In: Proc. of 12th In-
t’l Workshop: Selected Areas in Cryptography, SAC ’05. LNCS, vol. 3897, pp. 345-358
(August 2005)
[196] Han, D. -G., Takagi, T.: Some Analysis of Radix-r Representations. In: IACR, Cryp-
tology Eprint Archive, 2005/402. Available from URL: http://eprint.iacr.org/2005/402
(November 2005)
[197] Okeya, K., Sakurai, K.: On Insecurity of the Side Channel Attack Countermeasure
Using Addition-Subtraction Chains Under Distinguishability Between Addition and Dou-
bling. In: Proc. of 7th Australasian Conference: Information Security and Privacy, ACISP
’02. LNCS, vol. 2384, pp. 420-435 (July 2002)
[198] Bernstein, D. J., Birkner, p., Joye, M., Lange, T., Peters, C.: Twisted Edwards Curves.
In: Proc. of 1st Int’l Conference on Cryptology in Africa: Progress in Cryptology,
AFRICACRYPT 2008. LNCS, vol. 5023, pp. 389 - 405 (June 2008)
[199] Kistler, M., Perrone, M., Petrini, F.: Cell Multiprocessor Communication Network:
Built for Speed. In: IEEE Micro, vol. 26(3), pp. 10 -23 (May-June 2006)
Curriculum Vitae
Name: Ebrahim Hasan
Post-Secondary The University of Western Ontario
Education and London, Ontario, Canada
Degrees: 2009 - 2013 Ph.D.
James Cook University
Townsville, Queensland, Australia
2003 - 2005 M.Sc.
Qatar University
Doha, Qatar
1997 - 2002 B.Sc.
Related Work Graduate Teaching Assistant
Experience: The University of Western Ontario
2009 - 2013
Related Work Graduate Research Assistant
Experience: The University of Western Ontario
2009 - 2013
Related Work Graduate Teaching Assistant
Experience: University of Bahrain
2005 - 2008
Related Work Graduate Research Assistant
Experience: University of Bahrain
2005 - 2008
134
BIBLIOGRAPHY 135
Publications:
1. Abdulrahman, E. A. H., Reyhani-Masoleh, A.: High-Speed Hybrid-Double Multiplica-
tion Architectures Using New Serial-Out Bit-Level Mastrovito Multipliers. Submitted
for publication in IEEE Transactions on Computers (Submitted in November 2012, re-
vised in July 2013)
2. Abdulrahman, E. A. H., Reyhani-Masoleh, A.: New Regular Radix-8 Scheme for Elliptic
Curve Scalar Multiplication Without Pre-computation. Accepted for publication in IEEE
Transactions on Computers, 14 pages in total (2013).
