Efficient Elliptic Curve Processor Architectures for Field Programmable Logic by Orlando, Gerardo
Worcester Polytechnic Institute
Digital WPI
Doctoral Dissertations (All Dissertations, All Years) Electronic Theses and Dissertations
2002-03-27
Efficient Elliptic Curve Processor Architectures for
Field Programmable Logic
Gerardo Orlando
Worcester Polytechnic Institute
Follow this and additional works at: https://digitalcommons.wpi.edu/etd-dissertations
This dissertation is brought to you for free and open access by Digital WPI. It has been accepted for inclusion in Doctoral Dissertations (All
Dissertations, All Years) by an authorized administrator of Digital WPI. For more information, please contact wpi-etd@wpi.edu.
Repository Citation
Orlando, G. (2002). Efficient Elliptic Curve Processor Architectures for Field Programmable Logic. Retrieved from
https://digitalcommons.wpi.edu/etd-dissertations/77
Efficient Elliptic Curve Processor Architectures for Field
Programmable Logic
by
Gerardo Orlando
A Dissertation
Submitted to the Faculty
of the
WORCESTER POLYTECHNIC INSTITUTE
in partial fulfillment of the requirements for the
Degree of Doctor of Philosophy
in
Electrical Engineering
by
March 4, 2002
APPROVED:
Dr. Christof Paar Dr. Berk Sunar
Dissertation Advisor Dissertation Committee
ECE Department ECE Department
Dr. Fred J. Looft Dr. Wayne P. Burleson
Dissertation Committee Dissertation Committee
ECE Department ECE Department
University of Massacusetts
Dr. John Orr
Head of ECE Department
Abstract
Elliptic curve cryptosystems offer security comparable to that of traditional
asymmetric cryptosystems, such as those based on the RSA encryption and digital
signature algorithms, with smaller keys and computationally more efficient algo-
rithms. The ability to use smaller keys and computationally more efficient algo-
rithms than traditional asymmetric cryptographic algorithms are two of the main
reasons why elliptic curve cryptography has become popular. As the popularity of
elliptic curve cryptography increases, the need for efficient hardware solutions that
accelerate the computation of elliptic curve point multiplications also increases.
This dissertation introduces elliptic curve processor architectures suitable for
the computation of point multiplications for curves defined over fields GF (2m) and
curves defined over fields GF (p). Each of the processor architectures presented here
allows designers to tailor the performance and hardware requirements according to
their performance and cost goals. Moreover, these architectures are well suited for
implementation in modern field programmable gate arrays (FPGAs). This point was
proved with prototyped implementations. The fastest prototyped GF (2m) processor
can compute an arbitrary point multiplication for curves defined over fields GF (2167)
in 0.21 milliseconds and the prototyped processor for the field GF (2192− 264− 1) is
capable of computing a point multiplication in about 3.6 milliseconds.
The most critical component of an elliptic curve processor is its arithmetic unit.
A typical arithmetic unit includes an adder/subtractor, a multiplier, and possibly
a squarer. Some of the architectures presented in this work are based on multiplier
and squarer architectures developed as part of the work presented in this disser-
tation. The GF (2m) least significant bit super-serial multiplier architecture, the
GF (2m) most significant bit super-serial multiplier architecture, and a new GF (p)
Montgomery multiplier architecture were developed as part of this work together
with a new squaring architecture for GF (2m).
Acknowledgements
The work presented in this dissertation is the results of over four years of research.
I want to thank Christof Paar for advising me throughout this research work. I also
want to thank the members of the dissertation committee, Berk Sunar, Fred J.
Looft, and Wayne P. Burleson, for reviewing this work.
The majority of the research work presented here was done on a part-time basis
while I worked as an engineer. I want to thank Julian Bubrowski, Walter Schneider,
and Jim Sheedy for allowing me to balance my professional work with my research
work. This work would have not been possible without their help.
Finally, this work will not have been possible without the love and moral support
of my sisters Clara and Marian, my mother Maria, and my father Juan. Lastly but
not least, I want to acknowledge that the love and understanding of my wife Diane,
my son Leonardo, and my daughter Jane carried me over the finish line.
i
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Research Contributions . . . . . . . . . . . . . . . . . . 5
1.3 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 9
2.1 Finite Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 GF(p) Arithmetic Background . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Montgomery Reduction . . . . . . . . . . . . . . . . . . . . . . 13
2.3 GF(2m) Arithmetic Background . . . . . . . . . . . . . . . . . . . . 16
2.4 Elliptic Curve Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Elliptic Curve Discrete Logarithm Problem (ECDLP) . . . . . . . . . 25
2.6 Coordinate Representation . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Special Coordinates and Algorithms . . . . . . . . . . . . . . . . . . . 33
2.7.1 Montgomery Point Multiplication Algorithm for
GF(2m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7.2 Jacobi Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.8 Point Multiplication Algorithms . . . . . . . . . . . . . . . . . . . . . 39
2.8.1 Generic Point Multiplication Algorithms . . . . . . . . . . . . 40
ii
2.8.2 Fixed-Point Point Multiplication Algorithms . . . . . . . . . . 48
2.8.3 Summary of Point Multiplication Algorithms . . . . . . . . . . 54
3 Elliptic Curve Processor Architecture 59
3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Main Controller (MC) . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3 Arithmetic Unit Controller (AUC) . . . . . . . . . . . . . . . . . . . . 72
3.4 Arithmetic Unit (AU) . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4 GF(2m) Arithmetic Unit 82
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.1 GF(2m) Multiplier Architectures . . . . . . . . . . . . . . . . 82
4.1.2 GF(2m) Squarer Architectures . . . . . . . . . . . . . . . . . 86
4.2 Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Most Significant Bit First Multiplier (MSB) . . . . . . . . . . . . . . 89
4.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.2 Complexity, Critical Path Delay, and Performance . . . . . . . 90
4.4 Least Significant Bit First Multiplier (LSB) . . . . . . . . . . . . . . 92
4.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2 Complexity, Critical Path Delay, and Performance . . . . . . . 94
4.5 Most Significant Digit First Multiplier (MSD) . . . . . . . . . . . . . 96
4.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5.2 Complexity, Critical Path Delay, and Performance . . . . . . . 100
4.6 Least Significant Digit First Multiplier (LSD) . . . . . . . . . . . . . 103
4.6.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6.2 Complexity, Critical Path Delay, and Performance . . . . . . . 106
4.7 Most Significant Bit First Super-Serial Multiplier (MSB-SSM) . . . . 110
iii
4.7.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.7.2 Complexity, Critical Path Delay, and Performance . . . . . . . 115
4.8 Least Significant Bit First Super-Serial Multiplier (LSB-SSM) . . . . 118
4.8.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.8.2 Complexity, Critical Path Delay, and Performance . . . . . . . 122
4.9 New Squaring Architecture . . . . . . . . . . . . . . . . . . . . . . . . 125
4.9.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.9.2 Complexity, Critical Path Delay, and Performance . . . . . . . 130
4.10 Parallel Squarers with Fixed Irreducible
Polynomial Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.11 Zero Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.12 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.13 GF(2m) Arithmetic Unit Complexity and Performance . . . . . . . . 143
4.13.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.13.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5 GF(p) Arithmetic Unit 156
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.2 High-Radix Montgomery Multiplication with Quotient Pipelining . . 158
5.3 High-Radix, Precomputation-Based
Montgomery Multiplication with Quotient
Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.3.1 Validity of Algorithm . . . . . . . . . . . . . . . . . . . . . . . 166
5.3.2 Accuracy of Algorithm . . . . . . . . . . . . . . . . . . . . . . 169
5.3.3 Range of Si+1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.3.4 Processing Time . . . . . . . . . . . . . . . . . . . . . . . . . 173
iv
5.3.5 Multiplications of Interest in the Computation of
Point Multiplications . . . . . . . . . . . . . . . . . . . . . . . 178
5.3.6 Two’s Complement and Binary Stored-Carry Number Repre-
sentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.3.7 Area and Storage . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.4 Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.4.1 Complexity, Critical Path Delay, and Performance . . . . . . . 193
5.5 New Montgomery Multiplier . . . . . . . . . . . . . . . . . . . . . . . 196
5.5.1 BSC Shift Register . . . . . . . . . . . . . . . . . . . . . . . . 199
5.5.2 BSC to NR Converter 1 . . . . . . . . . . . . . . . . . . . . . 201
5.5.3 Booth Recoders 1 and 2 . . . . . . . . . . . . . . . . . . . . . 203
5.5.4 A˜Bi Scalar Multiplier . . . . . . . . . . . . . . . . . . . . . . 208
5.5.5 Q˜αi−d Scalar Multiplier . . . . . . . . . . . . . . . . . . . . . 213
5.5.6 Carry-Save Adder Tree . . . . . . . . . . . . . . . . . . . . . . 218
5.5.7 Si/2
k Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.5.8 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5.5.9 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5.5.10 Critical Path Delay . . . . . . . . . . . . . . . . . . . . . . . . 226
5.5.11 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.6 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.7 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.8 GF(p) Arithmetic Unit Complexity and Performance . . . . . . . . . 236
5.8.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
5.8.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6 Comparison of GF(2m) and GF(p) Arithmetic Units 240
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
v
6.1.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.2 Prototype Implementations . . . . . . . . . . . . . . . . . . . . . . . 246
6.2.1 Description of the Prototype Implementations of the GF(2m)
Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.2.2 Complexity and Performance of the Prototyped
GF(2m) Processors . . . . . . . . . . . . . . . . . . . . . . . . 249
6.2.3 Description of the Prototype Implementation of the GF(p)
Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
6.2.4 Complexity and Performance of the Prototyped GF(p) Pro-
cessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.2.5 Comparison of Prototyped GF(p) and GF(2m)
Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
7 Conclusions 264
7.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 264
7.2 Recommendations for Further Research . . . . . . . . . . . . . . . . . 270
A Hardware Implementation Models 272
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
A.1.1 Two-Input Gate Implementations . . . . . . . . . . . . . . . . 273
A.1.2 FPGA Implementations . . . . . . . . . . . . . . . . . . . . . 274
A.2 Logic Complexity and Critical Path Delay for Implementations that
Use Two-Input Gates as Logic Elements . . . . . . . . . . . . . . . . 275
A.3 Logic Complexity and Critical Path Delay for FPGA Implementations287
B Acronyms and Symbols 296
B.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
vi
B.2 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
vii
List of Figures
2.1 Point addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Point double . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Elliptic curve analogue of the Diffie-Hellman key agreement algorithm 26
2.4 Arrangement of multiplier k for the fixed-point comb point multipli-
cation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5 G[s, Is,r] precomputation table for the fixed-point comb point multi-
plication algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1 Point multiplication hierarchy . . . . . . . . . . . . . . . . . . . . . . 60
3.2 Elliptic curve processor architecture . . . . . . . . . . . . . . . . . . . 61
3.3 MC instruction format . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 Main controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5 AUC instruction format . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6 Arithmetic unit controller . . . . . . . . . . . . . . . . . . . . . . . . 77
3.7 Functional block diagram of the arithmetic unit . . . . . . . . . . . . 79
4.1 MSB multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 LSB multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3 MSD multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 LSD multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
viii
4.5 Super-serial multiplier emulation of a bit-serial multiplier . . . . . . . 112
4.6 Processing units of the MSB and the MSB-SSM multipliers . . . . . . 113
4.7 MSB-SSM multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.8 Processing units of the LSB-SSM and the LSB multipliers . . . . . . 120
4.9 LSB-SSM multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.10 New squaring architecture using LSB or LSD multipliers . . . . . . . 128
4.11 New squaring architecture using LSB-SSM multipliers . . . . . . . . . 129
4.12 Zero test circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.13 GF (2m) arithmetic units . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.1 GF (p) arithmetic unit . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.2 Carry-save addition of two’s complement numbers . . . . . . . . . . . 183
5.3 GF (p) adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.4 GF (p) multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.5 BSC to NR converter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.6 Example of the Modified Booth Recoding Algorithm . . . . . . . . . 204
5.7 Window recoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.8 Generation and formatting of positive multiples . . . . . . . . . . . . 209
5.9 Generation and formatting of negative multiples . . . . . . . . . . . . 210
5.10 Processing unit for nonredundant number representation . . . . . . . 210
5.11 A˜Bi scalar multiplier example for nonredundant number representation211
5.12 A˜Bi scalar multiplier example for binary stored-carry number repre-
sentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.13 Q˜αi−d scalar multiplier example for Multiplication and Lookup 1 re-
duction methods that uses nonredundant number representation . . . 216
5.14 Q˜αi−d scalar multiplier example for Lookup 2 based reduction that
uses nonredundant number representation . . . . . . . . . . . . . . . 216
ix
5.15 Si/2
k circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.16 Multiplexer options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.1 GF (2m) arithmetic unit architecture . . . . . . . . . . . . . . . . . . 247
A.1 Two’s complement with zero circuit . . . . . . . . . . . . . . . . . . . 282
A.2 Shift register circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
A.3 Binary tree and ripple adder architectures . . . . . . . . . . . . . . . 283
A.4 Ripple-carry adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
A.5 Increment adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
A.6 3:2 carry-save adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
A.7 4:2 carry-save adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
A.8 GF (2) adder implementation with a LUT . . . . . . . . . . . . . . . 288
A.9 2:1 multiplexer implementation with a LUT . . . . . . . . . . . . . . 289
A.10 Half adder implementation with LUTs . . . . . . . . . . . . . . . . . 291
A.11 Full adder implementation with LUTs . . . . . . . . . . . . . . . . . 291
A.12 Two’s complement with zero cell circuit implementation with a LUT 292
A.13 Shift register cell implementation with a LUT . . . . . . . . . . . . . 292
A.14 Binary tree implementation with LUTs . . . . . . . . . . . . . . . . . 294
A.15 GF (2) mult/add tree implementation with LUTs . . . . . . . . . . . 294
x
List of Tables
2.1 Complexity of GF (p) arithmetic operations . . . . . . . . . . . . . . . 14
2.2 Approximate bit-complexity of GF (2m) arithmetic operations . . . . 19
2.3 Complexity of GF (2m) arithmetic operations . . . . . . . . . . . . . . 19
2.4 Point addition and point double formulas for elliptic curves defined
over fields GF (p) (p > 3) and points represented using affine coordinates 24
2.5 Point addition and point double formulas for elliptic curves defined
over fields GF (2m) and points represented using affine coordinates. . 24
2.6 Point addition and point double formulas for elliptic curves defined
over fields GF (p) (p > 3) and points represented using Jacobian
coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Point addition and point double formulas for elliptic curves defined
over fields GF (2m) and points represented using Jacobian coordinates 31
2.8 Computational complexity of point addition and point double for
curves defined over fields GF (p) (p > 3) . . . . . . . . . . . . . . . . 31
2.9 Computational complexity of point addition and point double for
curves defined over fields GF (2m) . . . . . . . . . . . . . . . . . . . . 32
2.10 Montgomery point multiplication using projective coordinates . . . . 37
2.11 Computational complexity of the Montgomery point multiplication
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
xi
2.12 Complexity of point multiplication algorithms . . . . . . . . . . . . . 57
2.13 Complexity of point multiplication algorithms for k = 160 ≈ m ≈
log2 p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1 MC instruction set – execution control instructions . . . . . . . . . . 67
3.2 MC instruction set – data manipulation and arithmetic instructions . 68
3.3 MC instruction set symbols . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 MC components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5 AUC instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6 AUC instruction set symbols . . . . . . . . . . . . . . . . . . . . . . . 77
3.7 AUC components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1 Time-area characteristics of GF (2m) multipliers . . . . . . . . . . . . 83
4.2 Time-area characteristics of GF (2m) squarers . . . . . . . . . . . . . 86
4.3 Complexity and critical path delay of GF (2m) adders . . . . . . . . . 88
4.4 Complexity and critical path delay of MSB multiplier . . . . . . . . . 91
4.5 Performance of MSB multiplier . . . . . . . . . . . . . . . . . . . . . 91
4.6 Complexity and critical path delay of LSB multiplier . . . . . . . . . 95
4.7 Performance of LSB multiplier . . . . . . . . . . . . . . . . . . . . . . 95
4.8 Complexity and critical path delay of MSD multiplier . . . . . . . . . 102
4.9 Complexity and critical path delay of MSD multiplier for m >>
D >> r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.10 Performance of MSD multiplier . . . . . . . . . . . . . . . . . . . . . 102
4.11 Complexity and critical path delay of LSD multiplier . . . . . . . . . 108
4.12 Complexity and critical path delay of LSD multiplier for m >> D >> r108
4.13 Performance of LSD multiplier . . . . . . . . . . . . . . . . . . . . . . 109
xii
4.14 MSB-SSM multiplication example for AB mod (α3 +
∑2
i=0 fiα
i) with
D = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.15 Complexity and critical path delay of MSB-SSM multiplier . . . . . . 117
4.16 Performance of MSB-SSM multiplier . . . . . . . . . . . . . . . . . . 117
4.17 LSB-SSM multiplication example for AB mod x3+
∑2
i=0 fiα
i+C with
D = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.18 Complexity and critical path delay of LSB-SSM multiplier . . . . . . 124
4.19 Performance of LSB-SSM multiplier . . . . . . . . . . . . . . . . . . . 124
4.20 Number of clock cycles required to compute a square operation when
using different types of multipliers . . . . . . . . . . . . . . . . . . . . 130
4.21 Complexity and critical path delay of the squaring adapter to be used
with LSB and LSD multipliers . . . . . . . . . . . . . . . . . . . . . . 132
4.22 Complexity and critical path delay of the squaring adapter to be used
with LSB-SSM multipliers with digit sizes with even values . . . . . . 132
4.23 Distribution of squaring to multiplication processing time ratios for
the GF (2m) fields polynomials specified in [ANS98, ANS99] with
prime m in the range 163 . . . 997 . . . . . . . . . . . . . . . . . . . . . 133
4.24 Distribution of squaring to multiplication processing time ratios for
the GF (2m) fields polynomials specified in [IEE98] with prime m in
the range 163 . . . 997 when using LSB or LSB-SSM multipliers . . . . 133
4.25 Complexity and critical path delay statistics for parallel squarers that
support the fixed trinomials specified in [IEE98, ANS98, ANS99] of
prime degree in the range 163 . . . 997 . . . . . . . . . . . . . . . . . . 137
4.26 Complexity and critical path delay statistics for parallel squarers that
support the fixed pentanomials specified in [IEE98] of prime degree
in the range 163 . . . 997 . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xiii
4.27 Complexity and critical path delay statistics for parallel squarers that
support the fixed pentanomials specified in [ANS98, ANS99] of prime
degree in the range 163 . . . 997 . . . . . . . . . . . . . . . . . . . . . . 138
4.28 Complexity and critical path delay of zero test circuits . . . . . . . . 140
4.29 Complexity and critical path delay of register file . . . . . . . . . . . 142
4.30 Complexity of architecture 4 for arithmetic units based on MSB-SSM
multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.31 Complexity of architecture 1 for arithmetic units based on LSB-SSM
multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.32 Complexity of architecture 2 for arithmetic units based on LSB-SSM
multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.33 Complexity of architecture 4 for arithmetic units based on MSB mul-
tipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.34 Complexity of architecture 5 for arithmetic units based on MSB mul-
tipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.35 Complexity of architecture 1 for arithmetic units based on LSB mul-
tipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.36 Complexity of architecture 2 for arithmetic units based on LSB mul-
tipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.37 Complexity of architecture 3 for arithmetic units based on LSB mul-
tipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.38 Complexity of architecture 4 for arithmetic units based on MSD mul-
tipliers that support programmable and fixed irreducible polynomials
(m >> D >> r) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
xiv
4.39 Complexity of architecture 5 for arithmetic units based on MSD mul-
tipliers that support programmable and fixed irreducible polynomials
(m >> D >> r) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.40 Complexity of architecture 1 for arithmetic units based on LSD mul-
tipliers that support programmable and fixed irreducible polynomials
(m >> D >> r) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.41 Complexity of architecture 2 for arithmetic units based on LSD mul-
tipliers that support programmable and fixed irreducible polynomials
(m >> D >> r) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.42 Complexity of architecture 3 for arithmetic units based on LSD mul-
tipliers that support programmable and fixed irreducible polynomials
(m >> D >> r) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.43 Summary of arithmetic unit gate complexity according to multiplier
family for architectures that include squaring circuitry, that provide
programmable polynomial support, and that exhibit m >> D >> r . 151
4.44 Summary of arithmetic unit FPGA logic complexity according to
multiplier family for architectures that include squaring circuitry, that
provide programmable polynomial support, and that exhibit m >>
D >> r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.45 Summary of critical path delays of GF (2m) multipliers that support
programmable irreducible polynomials for which m >> D >> r . . . 153
4.46 Summary of critical path delays, according to multiplier families,
of GF (2m) arithmetic units that support programmable irreducible
polynomials for which m >> D >> r . . . . . . . . . . . . . . . . . . 153
4.47 Throughput of GF (2m) arithmetic units for multiplication operations
(in clock cycles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
xv
4.48 Throughput of GF (2m) arithmetic units for square operations (in
clock cycles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.49 Throughput of GF (2m) arithmetic units for addition operations (in
clock cycles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.50 Throughput for the different field operations of GF (2m) arithmetic
units that incorporate squaring circuitry (estimates are provided in
clock cycles according to the multiplier families) . . . . . . . . . . . . 155
5.1 Accuracy of different reduction methods . . . . . . . . . . . . . . . . 170
5.2 Approximate maximum value of Q˜αj+12
k(d+1) − Q˜j+1 . . . . . . . . . 174
5.3 Approximate maximum value of 2k
∑i−d−1
j=0 (Q˜αj+12
k(d+1) − Q˜j+1) 2kj 174
5.4 Multiplications of interest . . . . . . . . . . . . . . . . . . . . . . . . 180
5.5 Bus widths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.6 Approximate bus widths (Multiplication reduction method) . . . . . . 186
5.7 Complexity of GF (p) adder . . . . . . . . . . . . . . . . . . . . . . . 193
5.8 Critical path delay of GF (p) adder . . . . . . . . . . . . . . . . . . . 194
5.9 Performance of GF (p) adder . . . . . . . . . . . . . . . . . . . . . . . 194
5.10 Components of the GF (p) multiplier . . . . . . . . . . . . . . . . . . 198
5.11 Complexity and critical path delay of BSC shift register . . . . . . . . 200
5.12 Complexity and critical path delay of BSC to NR converter 1 . . . . . 202
5.13 Complexity and critical path delay of Booth recoder . . . . . . . . . . 207
5.14 Complexity and critical path delay of A˜Bi scalar multiplier . . . . . . 212
5.15 Complexity and critical path delay of Q˜αi−d scalar multiplier . . . . . 217
5.16 Carry-save adder tree configurations . . . . . . . . . . . . . . . . . . . 219
5.17 Complexity and critical path delay of carry-save adder tree . . . . . . 219
5.18 Complexity and critical path delay of Si/2
k circuit . . . . . . . . . . . 222
5.19 Complexity and critical path delay of registers . . . . . . . . . . . . . 223
xvi
5.20 Complexity of GF (p) multiplier . . . . . . . . . . . . . . . . . . . . . 225
5.21 Critical path delay of GF (p) multiplier . . . . . . . . . . . . . . . . . 228
5.22 Average latency of GF (p) multiplier . . . . . . . . . . . . . . . . . . 230
5.23 Average throughput of GF (p) multiplier . . . . . . . . . . . . . . . . 230
5.24 Complexity and critical path delay of register file . . . . . . . . . . . 233
5.25 Complexity and critical path delay of multiplexer . . . . . . . . . . . 235
5.26 Complexity of the GF (p) arithmetic unit . . . . . . . . . . . . . . . 237
5.27 Complexity of the GF (p) arithmetic unit (cont.) . . . . . . . . . . . 238
6.1 Complexity ofGF (2m) andGF (p) arithmetic units (m >> k,D,r,s,u,v,d)243
6.2 Complexity ratio between GF (2m) and GF (p) (min. complexity)
arithmetic units (m >> k,D, r, s, u, v, d) . . . . . . . . . . . . . . . . 243
6.3 Complexity ratio between GF (2m) and GF (p) (max. complexity)
arithmetic units (m >> k,D, r, s, u, v, d) . . . . . . . . . . . . . . . . 243
6.4 Critical path delay of GF (2m) and GF (p) arithmetic units . . . . . . 245
6.5 Processing time for GF (p) and GF (2m) arithmetic units (in # clock
cycles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.6 Logic complexity of GF (2m) processors (m = 167) . . . . . . . . . . . 249
6.7 Estimated LUT complexity of an arithmetic unit versus measured
LUT complexity of GF (2m) processor (m = 167) . . . . . . . . . . . . 250
6.8 Point multiplication performance of GF (2m) processors . . . . . . . . 251
6.9 Performance of leading hardware accelerators that compute point
multiplications for curves defined over fields GF (2m) . . . . . . . . . 251
6.10 Logic complexity of GF (p) processor . . . . . . . . . . . . . . . . . . 254
6.11 Estimated LUT complexity of arithmetic unit versus measured LUT
complexity of GF (p) processor (w0 = 237) . . . . . . . . . . . . . . . 255
6.12 Estimated point multiplication performance of GF (p) processor . . . 256
xvii
6.13 Performance of leading hardware accelerators that compute point
multiplications for curves defined over fields GF (p) . . . . . . . . . . 256
6.14 Complexity of GF (2m) and GF (p) processors normalized with respect
to m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.15 Approximate processing time for the computation of different field
operations in the GF (2m) and GF (p) processors . . . . . . . . . . . . 259
6.16 Ratio of time-area characteristics of prototypes . . . . . . . . . . . . 263
A.1 Frequently used terms in complexity and timing estimates . . . . . . 276
A.2 Frequently used terms in complexity and timing estimates (cont.) . . 277
A.3 Complexity of basic building blocks for implementations that use two-
input gates as logic elements . . . . . . . . . . . . . . . . . . . . . . . 279
A.4 Critical path delay of basic building blocks for implementations that
use two-input gates as logic elements . . . . . . . . . . . . . . . . . . 279
A.5 Complexity of composite building blocks for implementations that
use two-input gates as logic elements . . . . . . . . . . . . . . . . . . 280
A.6 Critical path delay of composite building blocks for implementations
using two-input gates as logic elements . . . . . . . . . . . . . . . . . 281
A.7 Complexity of basic building blocks for implementations using FPGA
logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
A.8 Critical path delay of basic building blocks for implementations using
FPGA logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
A.9 Complexity of composite building blocks for implementations using
FPGA logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
A.10 Critical path delay of composite building blocks for implementations
using FPGA logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
xviii
B.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
B.2 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
xix
Chapter 1
Introduction
1.1 Motivation
This work introduces elliptic curve processor architectures designed to take advan-
tage of the features offered by reprogrammable hardware, in particular field pro-
grammable gate array (FPGA) hardware.
The first elliptic curve cryptosystems were independently proposed in 1985 by
Neil Koblitz [Kob87] and Victor Miller [Mil86]. Since its inception, elliptic curve
cryptography has been the subject of extensive cryptanalysis. Today, elliptic curve
cryptosystems are deemed secure for commercial [ANS98, ANS99, IEE98] as well as
government use [FIP00].
Based on today’s cryptanalytical knowledge, elliptic curve cryptosystems offer
security comparable to that of traditional public-key cryptosystems, such as those
based on the RSA [RSA78] and the ElGamal [ElG85] encryption and digital sig-
nature algorithms, and those based on the Diffie-Hellman key agreement algorithm
[DH76], with smaller keys and computationally more efficient algorithms.
The ability to use smaller keys and computationally more efficient algorithms
1
than traditional cryptographic algorithms are two of the main reasons why elliptic
curve cryptography is becoming popular for use in constrained environments such as
cellular phones and personal digital assistant devices, which contain limited amounts
of memory and are battery-powered. The same reasons also make elliptic curve
cryptography attractive for high performance systems, such as secure networking
devices, whose ability to protect and route traffic are a function of their capacity to
establish secure connections. The establishments of these secure connections often
involve public-key operations.
Efficient software implementations of elliptic curve algorithms have been and
continue to be a topic of extensive research. Some important works in this area are
documented in [SOOS95, HLM00, WBV+96, WMPW98, LD99a].
Software architectures have the great advantage that they are portable to mul-
tiple hardware platforms. Their main disadvantages are their lower performance
when compared to specialized hardware architectures and their inability to protect
private keys from disclosure with the same degree of security that is achievable in
hardware. These disadvantages are some of the reasons motivating the study of
efficient hardware architectures.
Efficient hardware implementations of elliptic curve algorithms are just beginning
to be documented. Among the most significant hardware architectures for elliptic
curves defined over fields GF (2m) are [AMV93, Ros98b, GSS99, SES98, LMWL00,
OP00a]. As of this writing, the elliptic curve processor architecture documented in
[OP01] is the only elliptic curve processor architecture for GF (p) documented in the
open literature.
The emphasis of this work is on elliptic curve processor architectures based on the
architectures introduced by the author in [OP00a] and [OP01] as part of the research
work documented here. These architectures are well suited for implementation using
2
FPGAs, which is the hardware platform emphasized in this work.
Unlike traditional very large scale integration (VLSI) hardware, FPGAs do not
possess fixed functionality after fabrication. FPGAs are reconfigurable hardware de-
vices; that is, devices whose functionality is programmable. The configuration of an
FPGA device can be changed over time thus allowing the same FPGA to implement
different functions. The reconfigurability of FPGAs and the use of architecturally
scalable elliptic curve processor architectures afford implementations the benefits
listed below.
Architecture Efficiency: The complexity of finite field arithmetic architectures
depends greatly on whether arithmetic for one specific field representation
is being implemented or whether it is implemented for arbitrary finite field
representations. The most dramatic example is perhaps squaring in GF (2m)
using standard basis. For a specific field, squaring can be performed in one
clock cycle, whereas a general architecture usually requires m/2 clock cycles
(where m ≥ 160 for elliptic curves cryptosystems) [BG89]. One of the algo-
rithmic options explored in this work is the use of parallel squarers optimized
for the different GF (2m) fields. The reconfigurability of FPGA logic allows
the instantiation of different squarer architectures in the same hardware.
Scalable Security: Depending on the application, different levels of security may
be required. The main factor that determines the security of an elliptic curve
cryptosystem is the size of the underlying finite field. For instance, the Na-
tional Institute of Standards and Technology (NIST) announced a list of curves
that can support keys of 163 to 571 bits [FIP00]. Realizing such a wide operand
range efficiently in traditional hardware is a major challenge whereas the use
of architecturally scalable elliptic curve processor architectures together with
3
the reconfigurability of FPGA logic allows implementations to realize different
security levels in the same hardware. For example, an FPGA can be config-
ured with an elliptic curve processor function that supports 163-bit arithmetic.
Later, the same FPGA can be reconfigured with an elliptic curve processor
function that supports 571-bit arithmetic.
Scalable Performance-Cost Trade-Off: Different applications may favor very
different trade-offs between processing speed and logic resources. For the ellip-
tic curve processor architectures presented here, the finite field multiplier is the
most critical component. This work describes different multiplier architectures
whose performance and complexity are scalable. These multiplier architectures
allow implementers to explore different area-time trade-offs. Moreover, the fine
degree of scalability of some of these multipliers allows implementers to best
use the resources available in the targeted FPGA platforms.
The rest of this document is devoted to the description of elliptic curve processor
architectures that exploit the features of FPGA logic described above.
4
1.2 Summary of Research Contributions
This work introduces new elliptic curve processor architectures suitable for the
computation of point multiplications for elliptic curves defined over fields GF (2m).
These architectures are based on the architecture presented by the author at the
Cryptographic Hardware and Embedded Systems – CHES 2000 conference. This
architecture was published as part of the proceedings of that conference in [OP00a].
This work extends the work documented in [OP00a] to elliptic curve processors
based on different types of GF (2m) multipliers and squaring architectures.
Two of the GF (2m) multipliers used in the elliptic curve processor architectures
discussed in this document were developed as part of the dissertation research docu-
mented here. These multipliers belong to the family of multipliers originally named
super-serial multipliers. These are multipliers of low complexity especially designed
to exploit the availability of memory in modern FPGA logic. The first multiplier was
presented in the Seventh Annual IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM ’99) and it was published as part of the proceedings
of that conference in [OP99]. A new super-serial multiplier is also introduced in this
dissertation. This multiplier is discussed in Section 4.8.
One of the squaring architectures for GF (2m) discussed in this document was
developed as part of the dissertation research. This squaring architecture is based
on the observation that a squaring operation in GF (2m) can be transformed into
a multiplication by a constant and a sum, when the field elements are represented
using standard basis. The computation of the multiplication and the sum can be
performed with a simple circuit and a least-significant bit/digit first multiplier. This
work was published in the IEE Electronic Letters Journal in [OP00b].
This work introduces a new elliptic curve processor architecture suitable for the
5
computation of point multiplications for elliptic curves defined over fields GF (p).
This processor is an extension of the processor architecture for GF (2m) introduced
in [OP00a]. The centerpiece of this processor is a new Montgomery multiplier op-
timized for field programmable logic, which was developed as part of the work
documented here. This processor architecture was presented at the Cryptographic
Hardware and Embedded Systems – CHES 2001 conference and it was published as
part of the proceedings of that conference in [OP01].
6
1.3 Dissertation Outline
Chapter 2 introduces the arithmetic and algorithmic concepts needed for the under-
standing of this dissertation. This chapter starts with an introduction to GF (2m)
and GF (p) finite fields. The chapter then proceeds with an introduction to elliptic
curve arithmetic and the elliptic curve discrete logarithm problem. The chapter
ends with an introduction to elliptic curve coordinate representations and with de-
scriptions of point multiplication algorithms, which are the core functions of elliptic
curve processors.
Chapter 3 introduces the general architecture of the elliptic curve processors
introduced in this work. This section also describes the architecture of the two
programmable processors used in the elliptic curve processor architectures. This
chapter ends with a general description of the architecture of the arithmetic units
used by the elliptic curve processors.
Chapter 4 introduces GF (2m) arithmetic unit architectures. These architectures
are based on six types of GF (2m) multipliers and two squaring architectures. Two
of the multiplier architectures and one of the squaring architectures were developed
by the author as part of the dissertation research presented here. This chapter also
provides complexity and performance estimates for the different architectures.
Chapter 5 introduces a GF (p) arithmetic unit architecture. This architecture
is based on a new Montgomery multiplier introduced by the author as part of the
dissertation research documented here. This chapter describes the new Montgomery
multiplier in detail and provides complexity and performance estimates for different
configurations of the arithmetic unit.
Chapter 6 compares the complexity and performance of arithmetic units for
GF (2m) finite fields based on digit-serial multipliers and the arithmetic unit archi-
7
tecture for GF (p) finite fields based on the new Montgomery multiplier. This section
also compares the complexity and performance of prototype implementations of an
elliptic curve processor for point multiplication for curves defined over a GF (2m)
finite field and an elliptic curve processor for point multiplication for curves defined
over a GF (p) finite field.
Chapter 7 summarizes the conclusions of this work and provides recommenda-
tions for further research.
Appendix A describes the models used to quantify the complexity and the per-
formance of the circuits that form part of the elliptic curve processors introduced
here. Appendix B lists the acronyms and symbols used in this work.
8
Chapter 2
Background
2.1 Finite Field Arithmetic
This section provides a brief introduction to finite field arithmetic. Finite field
arithmetic is used in many public-key cryptosystems in use today, including elliptic
curve cryptosystems. Additional information on this topic can be found in [McE87,
LN94, Big85].
A finite field is a finite set of elements with interesting properties. Before de-
scribing its properties, it is convenient to introduce few terms.
A group is a set of elements G with one binary operation, ∗, that exhibit the
following properties:
1. A group is closed under the ∗ operation: a ∗ b = c for a, b, c ∈ G.
2. The ∗ operation is associative: (a ∗ b) ∗ c = a ∗ (b ∗ c).
3. The group contains an identity element e ∈ G such that a ∗ e = e ∗ a = a for
a ∈ G.
4. Every element a ∈ G has an inverse a−1 ∈ G such that a ∗ a−1 = a−1 ∗ a = e.
9
Abelian groups are groups with the additional property that the group oper-
ation is commutative; that is, a ∗ b = b ∗ a for a, b ∈ G.
Cyclic groups are characterized by the existence of generators. A generator
g is a group element that can represent each group element a as a = ig for i =
0 . . . |G| − 1, where ig = g ∗ g ∗ . . . ∗ g︸ ︷︷ ︸
i times
. The number of elements in the group is
known as the order of the group and it is represented here with the symbol |G|,
where G represents the group under consideration.
A field is a set of elements F with two binary operations, represented here as
addition (+) and multiplication (∗), that exhibit the following properties:
1. The elements of F form an abelian group under the + operation.
2. The elements of the set F ∗, which is a set that contains all the elements in
the set F except the additive identity, form an abelian group under the ∗
operation.
3. The distribute laws apply to the two binary operations; for example, a∗ (b+c)
= (a ∗ b) + (a ∗ c) for a, b, c ∈ F ∗.
Finite fields are also referred to as Galois fields and are represented here by the
symbol GF (q). Finite fields exist only for q = pm, where p is a prime number and
m is a positive integer. The number of elements of a finite field is q.
Each field element represents an equivalence class, or a collection of elements
with common properties. For finite fields GF (p), all the integers that are congruent
to each other modulo p represent an equivalence class. Two elements a and b are
congruent if they generate the same remainder r ∈ [0, p) when divided by p . The
congruence relation is represented here as a ≡ b mod p, where a ≡ b mod p if a =
s ∗ p + r, b = t ∗ p + r, and s and t are two arbitrary integers. The elements of a
10
field are often represented by their least positive residue, which are integers in the
range [0, p).
For finite fields GF (pm), where p is prime and m is greater than one, this work
uses standard (or polynomial) basis representation. In this representation all polyno-
mials G(α) =
∑∞
i=0 aiα
i with ai ∈ GF (p), which are congruent to each other modulo
F (α), represent an equivalence class. F (x) = xm +
∑m−1
i=0 fix
i with fi ∈ GF (p) is
an irreducible polynomial over the field GF (p); that is, a polynomial that is only
divisible by itself or an element of the field GF (p) over the field GF (p). α is a root
of the irreducible polynomial in GF (pm) that does not belong to the field GF (p).
Two elements A =
∑na
i=0 aiα
i and B =
∑nb
i=0 biα
i of GF (pm) are congruent
if when divided by the polynomial F (α) they generate the same remainder R =∑m−1
i=0 riα
i of degree lower than the degree of F (x) (ai, bi, ri ∈ GF (p)).
The congruence relation for GF (pm) is represented here as A ≡ B mod F (α),
where A = S ∗ F (α) + R, B = T ∗ F (α) + R, and where S = ∑nsi=0 siαi and T =∑nt
i=0 tiα
i are two arbitrary polynomials with coefficients in GF (p). The elements
of a field GF (p) are represented here by the polynomials of least degree in their
equivalence class. Using this representation, the elements of a field are represented
with polynomials of degree lower than that of F (x).
Arithmetic in fields GF (pm) using standard basis representation is analogous
to polynomial arithmetic with the added restriction that the operations involving
coefficients are handled as finite fields operation in the field GF (p).
Of importance in this work are finite fields GF (p) with primes p greater than
three and the fields GF (2m). The fields GF (p) are often referred to as prime fields
and the fields GF (2m) are often referred to as binary fields.
11
2.2 GF(p) Arithmetic Background
This section provides a brief introduction to GF (p) field arithmetic. For additional
information on this subject the reader is referred to [McE87, LN94].
For the computation of elliptic curves point multiplications, an elliptic curve
processor must be capable of computing modular additions, subtractions, multipli-
cations, and inverses.
Addition is the simplest modular operation that an elliptic curve processor needs
to implement. The addition of two field elements a and b, where a, b ∈ [0, p), can
be computed in two steps. The first step adds the two operands. The second step
reduces the result by subtracting p from a+ b, if this result is greater than or equal
to p. The operation just described can be extended to modular subtractions.
Multiplication, including squaring, is the most critical operation in the com-
putation of elliptic curve point multiplications. The multiplication of two element
a, b ∈ [0, p) yields a result ab ∈ [0, (p − 1)2]. The reduction of this result can be
performed with a division: ab mod p ≡ (qp+r) mod p ≡ r mod p, where q = bab/pc
and r = ab − qp ∈ [0, p) (q represents the quotient and r the remainder of the
division).
Division is more complex than multiplication. Division requires quotient es-
timation, multiplications, and additions. Modern reduction algorithms, such as
Montgomery reduction, yield approximated results. These methods trade accuracy
for speed. The approximated results are of the form r + ²p, where ² ≥ 0 defines the
approximation accuracy. These results are congruent modulo p to the desired result
r.
For the elliptic curve processor for GF (p) introduced here, this work recommends
the use of Montgomery multiplication. Montgomery multiplication interleaves mul-
12
tiplication and reduction steps. The computational cost of Montgomery multiplica-
tion is approximately equal to the computational cost of two multiplications when
assuming that these multiplications are done using the schoolbook multiplication
method.
The following section provides a brief introduction to Montgomery reduction.
The version of the Montgomery multiplication algorithm of interest here is presented
in Sections 5.2.
Inversion is the most complex operation that an elliptic processor must im-
plement. This work focuses on the computation of inverses using Fermat’s Little
Theorem (this is the same approach used here for GF (2m)). Equation (2.1) shows
the expression used here to compute inverses (GF (p)∗ represents the multiplicative
group of GF (p)).
a−1 mod p ≡ ap−2 mod p for a ∈ GF (p)∗ (2.1)
The bit-complexities of the arithmetic operations that must be implemented
by an elliptic curve processor are summarized in Table 2.1. This table assumes
that the field elements are represented by m-bit numbers (m = dlog2 pe), that
multiplications are computed using the Montgomery multiplication algorithm, and
that inversions are computed according to Equation (2.1) using the Montgomery
multiplication algorithm.
2.2.1 Montgomery Reduction
This section provides a brief introduction to Montgomery reduction. For additional
information on this topic, readers are referred to the following references: [Mon85,
MvOV97, BSS99, KAK96].
13
Table 2.1: Complexity of GF (p) arithmetic operations
GF (p) operation Bit-complexity
Addition O(m)
Multiplication O(m2)
(Montgomery mult.)
Inverse O(m3)
(Fermat’s Little Theorem &
Montgomery mult.)
The GF (p) multiplier introduced here uses what is known as Montgomery re-
duction. This reduction method was introduced in [Mon85]. The Montgomery
reduction of a number x < RM using the algorithm introduced in [Mon85] yields
a weighted residue of the form xR−1 mod M ∈ [0, 2M). R is a constant such that
R > M and gcd(R,M) = 1, which is arbitrarily chosen so that it simplifies the
reduction process: a power of two for hardware implementations. M represents the
modulus of operation (M = p for GF (p)). The following sections use M to represent
the modulus when discussing aspects related to Montgomery multiplication.
The effectiveness of Montgomery reduction lies on performing arithmetic using
weighted residues of the form xR mod M rather than x mod M itself. These residues
are referred to here as Montgomery residues. The basic idea is to perform a trans-
formation at the beginning of an algorithm, such as exponentiation, then perform
arithmetic using Montgomery residues, and, at the end of the algorithm, transform
the results back to residues that are not weighted. For example, the computation
of the exponentiation xe mod M involves the following steps:
1. Computation of the residue xR mod M .
2. Computation of the exponentiation (xR mod M)e mod M using multiplica-
tions and Montgomery reductions. (Note that the Montgomery reduction of a
product (xR mod M)(yR mod M) yields the Montgomery residue xyR mod
14
M .)
3. Conversion of the result (xe)R mod M to xe mod M .
4. If necessary, reduction of xe mod M to obtain result in least residue represen-
tation (i.e., a result in the range [0,M)).
The previous example demonstrates a common use of Montgomery residues.
Note that the arithmetic operations involving Montgomery residues are not limited
to multiplication; addition, subtraction, negation, equality/inequality testing, and
greatest common divisor computations involving M , can all be done using Mont-
gomery residues [Mon85].
Note that the transformation of x mod M into xR mod M can be performed by
multiplying x mod M by R2 mod M and then reducing the result using Montgomery
reduction [Mon85]. (Note that the constant R2 mod M needs to be computed only
once for a given modulus and that standards that specify the use of elliptic curve
cryptography change these infrequently; for example, [FIP00] specifies a set of mod-
uli for the foreseeable future).
The conversion of a result xR mod M into x mod M can be performed by multi-
plying xR mod M by one and then reducing the result using Montgomery reduction
(((xR mod M) ∗ 1)R−1 mod M ≡ x mod M) [Mon85].
The reduction in Step 4 may be needed when the result of the Montgomery
reduction exceeds the value of M .
15
2.3 GF(2m) Arithmetic Background
This section provides a brief introduction to GF (2m) field arithmetic. For additional
information on this subject the reader is referred to [McE87, LN94].
This work considers arithmetic in fields of characteristic two, GF (2m), using a
standard basis representation. A field GF (2m) is isomorphic to GF (2)[x]/(F (x)),
where F (x) = xm +
∑t
i=0 fix
i is a monic irreducible polynomial of degree m with
coefficients fi ∈ {0, 1}. Here each residue class is represented by the polynomial of
least degree in its class.
A standard basis representation uses the basis defined by the set of elements
{1, α, α2, . . . , αm−1}, where α is a root of the irreducible polynomial F (x). In this
basis, field elements are represented as polynomials in α of degree less than m with
coefficients 0 or 1; for example, an element A is represented as A =
∑m−1
i=0 aiα
i with
coefficients ai ∈ {0, 1}.
The addition of two elements A =
∑m−1
i=0 aiα
i and B =
∑m−1
i=0 biα
i, as shown by
Equation (2.2), requires the modulo 2 addition of the coefficients of the two field
elements.
A + B mod F (α) =
m−1∑
i=0
(ai + bi mod 2)α
i (2.2)
The multiplication of two field elements A =
∑m−1
i=0 aiα
i and B =
∑m−1
i=0 biα
i is
defined by Equation (2.3). This equation expresses the multiplication operation in a
way that resembles the operation of the multipliers studied here. These multipliers
accumulate products of digits of the multiplier B and the multiplicand A. The
reductions modulo F (α) are computed using Equation (2.4).
16
AB ≡ (A
m−1∑
i=0
biα
i) mod F (α) (2.3)
αm+i ≡
m−1∑
j=0
fjα
j+i (2.4)
Squaring is a special form of multiplication, whose computation in GF (2m) is
much more efficient than for general multiplications. Equation (2.5) provides an
expression for squaring the field element A =
∑m−1
i=0 aiα
i. As this equation shows, a
squaring operation requires the computation of a reduction [Wu99].
A2 ≡
(
m−1∑
i=0
aiα
2i
)
mod F (α) (2.5)
Inversions in GF (2m) can be computed with the Extended Euclidean Algorithm
[MvOV97]. They can also be computed with exponentiations using Fermat’s Little
Theorem as shown in Equation (2.6). Inversion with exponentiation takes advantage
of the factorization of the exponent, as it is shown in Equation (2.7), and of the
lower complexity of squaring operations in GF (2m) [BSS99]. This work focuses on
the implementation of inversions with exponentiations.
Variants of the inversion with exponentiation method are proposed in [IT88,
Van99]. These algorithms compute inverses with m− 1 squares and with blog2(m−
1)c+W (m−1)−1 multiplications [BSS99], where W (m−1) represents the number
of nonzero coefficients in the binary representation of m− 1.
A−1 ≡ A2m−2 mod F (α) ≡ A2(2m−1−1) mod F (α) (2.6)
17
A2
m−1−1 =
 A
(2(m−1)/2−1)(2(m−1)/2+1) = (A2
(m−1)/2−1)2
(m−1)/2
A2
(m−1)/2−1, m is odd,
AA2
(m−1)−2 = A(A2
m−2−1)2, m is even.
(2.7)
Table 2.2 approximates the bit-complexity for GF (2m) addition, squaring, and
multiplication operations, which are the operations considered here for implemen-
tation in hardware. The table lists the bit-complexity approximations for different
types of irreducible polynomials. The column labeled “Arbitrary” summarizes the
complexity for implementations that support arbitrary field polynomials containing
r+1 nonzero coefficients. The columns labeled “Trinomial” and “Pentanomial” list
the complexity for implementations that support trinomials and pentanomials, for
which r is equal to 2 and 4, respectively.
Table 2.2 highlights that the complexity of squaring can be considered to be
linear for implementations that use fixed trinomials and pentanomials.
In Table 2.2 the complexity of the multiplication operation is based on the school-
book multiplication method: AB mod F (α) = (b0A+b1Aα+ . . .+bm−1Aα
m−1) mod
F (α), where B =
∑m−1
i=0 biα
i.
Table 2.3 provides a coarse approximation of the complexity of GF (2m) arith-
metic operations.
The security of an elliptic curve cryptosystems lies on the difficulty of the discrete
logarithm in the group defined by an elliptic curve defined over a finite field (see next
section). To preclude attacks, the finite fields must be very large (m ranging from
163 to 571 bits according to [FIP00]). The security of an elliptic curve cryptosystem
does not rest on the finite field itself. As a result, the field representation can be
chosen freely. Standards such as [IEE98, ANS98, ANS99, FIP00] recommend the use
of trinomials (F (x) = xm+xt+1) and pentanomials (F (x) = xm+xt3 +xt2 +xt1+1)
18
as irreducible polynomials because they simplify implementations. The remainder
of this document assumes the use of trinomials and pentanomials for cryptographic
applications.
Table 2.2: Approximate bit-complexity of GF (2m) arithmetic operations
GF (2m) Irreducible polynomial
operation Arbitrary Trinomial Pentanomial
Addition m m m
Square rm 2m 4m
Multiplication 2m2 + (r − 2)m− r + 1 2m2 − 1 2m2 + 2m− 3
(Schoolbook method)
Table 2.3: Complexity of GF (2m) arithmetic operations
GF (p) operation Bit-complexity
Addition O(m)
Square O(rm)
Multiplication O(m2)
(Schoolbook mult. method)
Inverse O(m2 log2 m)
(Fermat’s Little Theorem)
19
2.4 Elliptic Curve Arithmetic
This section provides a brief introduction to elliptic curve arithmetic. Additional
information on this subject can be found in [Sti95, Kob94, BSS99, Ros98a, ST92,
Men93, IEE98].
An elliptic curve over a finite field GF (q) defines a set of points (x, y) that
satisfy an elliptic curve equation together with the point O, known as the “point at
infinity.” The “point at infinity” does not satisfy the elliptic curve equation. The
coordinates x and y of the points on the curve are elements of the field GF (q), where
q = pm and p is prime.
This work focuses on elliptic curve defined over the fields GF (p) and GF (2m),
where p and m are primes. The use of elliptic curves defined over composite fields
GF (2m) (m is not a prime) is discouraged due to recently discovered cryptographic
weaknesses in these structures [GHS00].
Equation (2.8) defines the elliptic curve equation for fields GF (p) with p > 3,
and Equation (2.9) defines the elliptic curve equation for fields GF (2m). In Equation
(2.8), a, b ∈ GF (p) and 4a3 + 27b2 6≡ 0 mod p. In Equation (2.9) a, b ∈ GF (2m)
with b 6= 0.
y2 = x3 + ax + b (2.8)
y2 + xy = x3 + ax2 + b (2.9)
The set of discrete points on an elliptic curve form an abelian group (commu-
20
tative group), whose group operation is known as point addition. The number of
discrete points on an elliptic curve defined over a finite field is bounded by the ex-
pression shown in Equation (2.10), known as Hasse’s theorem, where the symbol n
represents the number of points on the elliptic curve and where q = pm represents
the number of elements in the underlying finite field.
q + 1− 2√q ≤n ≤ q + 1 + 2√q (2.10)
Elliptic curve point addition is defined according to the “chord-tangent process.”
Point addition is easiest to describe for elliptic curves defined over the real numbers
as follows.
Let P and Q be two distinct points on an elliptic curve E defined over the real
numbers with Q not equal to −P (Q is not the additive inverse of P ). The addition
of P and Q is the point R (R = P + Q); where R is the additive inverse of S, and
where S is the third point on the elliptic curve intercepted by a line through points
P and Q. For the curve under consideration, R is the reflection of the point S with
respect to the x-axis; that is, if R is the point (x, y), S is the point (x,−y). The
addition operation just described is shown in Figure 2.1.
When P and Q represent the same point on the elliptic curve and P is not equal
to −P , the addition of P and Q is the point R (R = 2P ); where R is the additive
inverse of S, and where S is the third point on the elliptic curve intercepted by a
line tangent to the curve at point P . The operation just described is referred to as
point double, and it is shown in Figure 2.2.
The “point at infinity,” O, is the additive identity of the group. The most
relevant operations involving O are the following: the addition of a point P and O
21
is equal to P (P + O = P ) and the addition of a point P and its additive inverse
−P is equal to O (P − P = O). In the previous expression, if P is a point on the
curve, then −P is also a point on the curve.
The point addition and the point double operations are generally computed in
computer system using algebraic formulas derived from the geometrical operations
just described. Tables 2.4 and 2.5 summarize the point addition and the point
double formulas for elliptic curves defined over the fields GF (p) and GF (2m). All
the operations in the tables are performed in the field over which the curve is defined.
The operations for the curves defined over fieldsGF (2m) are computed modulo F (α),
where F (x) is the irreducible polynomial used to define the field GF (2m) and α is
a root of F (x) in GF (2m).
The expressions in Tables 2.4 and 2.5 correspond to point representation using
affine coordinates. In affine coordinates, a point is represented by two coordinates,
x and y; for example, a point P in affine coordinates is represented as (x, y).
Point subtraction is a useful operation in some algorithms. This operation can
be performed with the point addition or point double formulas using the additive
inverse of the point to be subtracted. For example, the point subtraction P − Q
can be computed using the point addition operation as follows: P −Q = P +(−Q).
The additive inverse of a point P = (x, y) is the point (x,−y) for curves defined
over fields GF (p) and (x, x + y) for curves defined over fields GF (2m).
The operation used by elliptic curve cryptosystems is referred to here as point
multiplication. This operation is also referred to as scalar point multiplication
[IEE98]. The point multiplication operation is denoted here as kP , where k is
an integer number and where P is point on the elliptic curve. The operation kP
represents the addition of k copies of point P as shown in Equation (2.11).
22
kP = P + P + . . . + P︸ ︷︷ ︸
k points P
(2.11)
Elliptic curve cryptosystems are built over cyclic groups. Each group contains a
finite number of points, n, that can be represented as scalar multiples of a generator
point: iP for i = 0, 1, . . . , n− 1, where P is a generator of the group, which implies
that iP 6= O for 1 < i < n. The order of the point P in the previous expression is n,
which implies that nP = O and iP 6= O for i = 1 . . . n− 1. The order of each point
on the group must divide n. Consequently, a point multiplication kQ for k > n
can be computed as (k mod n)Q. (Note: kQ = (an + b)(qP ) = (anq + bq)P =
aq(nP ) + bqP = O + b(qP ) = bQ, where Q = qP and k = an + b.)
R = P+Q
R
S
Q
P
Figure 2.1: Point addition
23
R = 2P
P
R
S
Figure 2.2: Point double
Table 2.4: Point addition and point double formulas for elliptic curves defined over
fields GF (p) (p > 3) and points represented using affine coordinates
Description Formula
Elliptic curve equation y2 ≡ x3 + ax+ b mod p; 4a3 + 27b2 6≡ 0 mod p
Point addition x3 ≡
(
y2−y1
x2−x1
)2
− x1 − x2 mod p
(x3, y3) = (x1, y1) + (x2, y2) y3 ≡
(
y2−y1
x2−x1
)
(x1 − x3)− y1 mod p
Point double x3 ≡
(
3x21+a
2y1
)2
− 2x1 mod p
(x3, y3) = 2(x1, y1) y3 ≡
(
3x21+a
2y1
)
(x1 − x3)− y1 mod p
Table 2.5: Point addition and point double formulas for elliptic curves defined over
fields GF (2m) and points represented using affine coordinates.
Description Formula
Elliptic curve equation y2 + xy ≡ x3 + ax2 + b mod F (α); b 6= 0
Point addition x3 ≡
(
y2+y1
x2+x1
)2
+
(
y2+y1
x2+x1
)
+ x1 + x2 + a mod F (α)
(x3, y3) = (x1, y1) + (x2, y2) y3 ≡
(
y2+y1
x2+x1
)
(x1 + x3) + x3 + y1 mod F (α)
Point double x3 ≡
(
x1 +
y1
x1
)2
+
(
x1 +
y1
x1
)
+ a mod F (α)
(x3, y3) = 2(x1, y1) y3 ≡
(
x1 +
y1
x1
)
(x1 + x3) + x3 + y1 mod F (α)
24
2.5 Elliptic Curve Discrete Logarithm Problem
(ECDLP)
Elliptic curve cryptosystems base their security in what is known as the elliptic curve
discrete logarithm problem. This problem can be stated as follows. Given a known
elliptic curve and two known points P and Q, where Q = kP , it is computationally
infeasible to determine the value of k if the parameters have been carefully chosen.
Elliptic curve cryptosystems use cyclic groups with very large numbers of points;
for example, the FIPS 186-2 standard [FIP00] recommends groups for which the
number of points range from about 2163 to about 2571 points, depending on the curve
and the underlying finite field. The best cryptanalysis algorithms known today, such
as the Pollard’s rho algorithm [Pol78], compute an elliptic curve discrete logarithm
with an average of O(√n) point operations if the parameters have been carefully
chosen, where n represents the number of points of the cyclic group being used.
Using the best cryptanalysis algorithms, determining a discrete logarithm in the
curves specified by the FIPS 186-2 standard require over 280 point operations. This
is an intractible problem given the current computer technology. It is important to
realize that well-chosen curves achieve the degree of security specified here. Other
curves may exhibit structures that facilitate cryptanalysis; for example, curves de-
fined over composite fields of characteristic two. More information on this topic can
be found in [BSS99, IEE98].
Figure 2.3 shows an example of an elliptic curve cryptosystem. The example
shows a key establishment using the elliptic curve analogue of the Diffie-Hellman
key agreement algorithm. In this example, the secret components, which are the
integers a and b, are chosen by Alice and Bob in the first step of the algorithm.
These secret components are never revealed. The public components, which are the
25
points A and B, are generated and then exchanged over an insecure channel in the
second and the third steps. Finally, each party computes a shared secret point S
in Step 4, using his/her secret component and the peer’s public component. For a
properly setup system, an eavesdropper will be unable to compute the shared secret
from the exchanged public components. In particular, an attacker faces the elliptic
curve discrete logarithm problem if he/she intends to recover a secret component
from its associated public component.
 
	
 
ﬀﬁ
ﬂ ﬃ! " #$ﬂ % & '( ﬃ)+*
 $,-	/.-0!ﬀ!#
1
" ﬂ ﬀ2ﬀﬃ !*
 435	62ﬁ
#
1
" ﬂ ﬀ2ﬀﬃ !ﬃ78:9;*
7<
9=6>ﬁ%" " ﬂ ﬃ?ﬂ ﬀ4ﬀ/#
  ﬂ ﬃ !*
 4@	A
#ﬃ$ﬁ%
ﬀﬁ5ﬂ ﬃB A*
7" ﬂ ﬀ
C
7
9
7ED
C
6
 FD
C
9=D
CG
6
9H1
G
9=D
G
6
 FD
G
78D
CG
6
Figure 2.3: Elliptic curve analogue of the Diffie-Hellman key agreement algorithm
26
2.6 Coordinate Representation
Point multiplications are computed with iterated point addition and point double
operations. For cryptographic applications the value of k is generally large, which
translates into a large number of point addition and point double operations for
most algorithms. The complexity of an algorithm is a function of the complexity of
the employed point addition and point double operations.
The point addition and the point double formulas for affine coordinates, listed
in Tables 2.4 and 2.5, include field additions, subtractions, multiplications, squares,
and inverses. The field additions, subtractions, multiplications, and squares can be
efficiently computed in hardware and software. In general, field inverses cannot be
computed as efficiently as field additions, subtractions, multiplications, or squares
in either hardware or software.
Some coordinate representations do not require field inverses in their point ad-
dition and point double operations, which makes them attractive for hardware and
software implementations.
Projective coordinates are attractive for cryptographic systems for which the
computational costs of field inversions are much higher than the computational
costs of field multiplications.
Different projective coordinate representations exist. Some of the most popular
ones are discussed in [CC87, CMO98]. This section discusses point addition using
Jacobian coordinates. Point multiplication using Jacobian coordinates is discussed
in more detail in [BSS99, IEE98].
The computation of point multiplication using Jacobian coordinates usually in-
volves the following steps. First, the point to be multiplied is transformed from affine
to Jacobian coordinates. Then, the point multiplication is done using Jacobian coor-
27
dinates. Finally, the resulting point is transformed from Jacobian coordinates back
into affine coordinates.
In Jacobian coordinates, a point P is represented by the coordinates X, Y , and Z
as follows: P = (X,Y, Z). The transformation of a point from affine coordinates to
Jacobian coordinates is done using the following expression: (x, y)⇒ (X = x, Y =
y, Z = 1). The transformation of a point from Jacobian to affine coordinates is done
using the following expression: (X,Y, Z)⇒ (x = X/Z2, y = Y/Z3).
Table 2.6 lists the elliptic curve equation for curves defined over fields GF (p)
that corresponds to points represented using Jacobian coordinates. This table also
summarizes the correspondent formulas for point addition and point double. Table
2.7 lists the same information as Table 2.6 for curves defined over fields GF (2m).
From Tables 2.6 and 2.7 it is evident that in Jacobian coordinates, point addi-
tions and point doubles can be computed with field additions, subtractions, multi-
plications, and squares (no field inversions required). The computation of a point
multiplication requires the computation of one field inverse at the end of the point
multiplication process, where the resulting point is converted from Jacobian to affine
coordinates. The computational cost of this inverse is relatively low compared to
the computational cost of the point multiplication operation.
The concept of point addition is not restricted to the addition of points rep-
resented using the same coordinate representation. It is often advantageous to
add points that are represented using different representations, an approach that
is referred to as mixed coordinates. For example, the addition P +Q with P repre-
sented using Jacobian coordinates and with Q represented using affine coordinates
is computationally simpler than the addition of two points represented in Jacobian
coordinates. Mixed coordinates representations are studied in detail in [CMO98] for
curves defined over fields GF (p).
28
Table 2.8 lists the computational complexities of point addition and point double
for curves defined over fields GF (p), whose points are represented using different
coordinate representations. Table 2.9 list similar information for curves defined over
fields GF (2m). In these tables A and J refer, respectively, to affine and Jacobian
coordinates. An expression of the form A + J ⇒ J indicates the addition of a
point represented in affine coordinates and a point represented in Jacobian coordi-
nates whose result is a point represented in Jacobian coordinates. An expression of
the form 2A ⇒ A represents the point double operation of a point represented in
affine coordinates that yields another point represented in affine coordinates. The
computational complexity is measured in terms of field squares, multiplications, and
inverses; which are represented, respectively, by S, M , and I. The computational
complexities of modular additions and subtractions are considered to be much lower
than those of modular multiplications, squares, and inverses.
The results in Tables 2.8 and 2.9 include complexity values for different values
of a, where a is one of the parameters that define an elliptic curve. As the tables
show, the complexity of the point addition operation varies for different values of a.
(Standards, such as [FIP00], specify values of a that facilitate implementations.)
Tables 2.8 and 2.9 also specify the number of field elements that need to be
stored during the computation of point addition or point double operations.
The data in Tables 2.8 and 2.9 show that the point addition and the point
double operations do not require the computation of inverses when one of the points
is represented using projective coordinates, but they require the computation of a
larger number of multiplications, including squares, than those required when the
points are represented using affine coordinates. This behavior is also true also for
other projective curve representations, such as Chudnovsky [CC87], Chudnovsky-
Jacobian [CC87], modified Jacobian [CMO98], and Lopez-Dahab [LD99b] (GF (2m)
29
only) coordinates.
Table 2.6: Point addition and point double formulas for elliptic curves defined over
fields GF (p) (p > 3) and points represented using Jacobian coordinates
Description Formula
Elliptic curve equation Y 2 ≡ X3 + aXZ4 + bZ6 mod p;
4a3 + 27b2 6≡ 0 mod p
U1 ≡ X1Z22 mod p
S1 ≡ Y1Z32 mod p
U2 ≡ X2Z21 mod p
S2 ≡ Y2Z31 mod p
W ≡ U1 − U2 mod p
Point addition R ≡ S1 − S2 mod p
(X3, Y3, Z3) = (X1, Y1, Z1)+ T ≡ U1 + U2 mod p
(X2, Y2, Z2) M ≡ S1 + S2 mod p
Z3 ≡ Z1Z2W mod p
X3 ≡ R2 − TW 2 mod p
V ≡ TW 2 − 2X3 mod p
Y3 ≡ (V R−MW 3) ∗ 2−1 mod p
M ≡ 3X21 + aZ41 mod p
Z3 ≡ 2Y1Z1 mod p
Point double S ≡ 4X1Y 21 mod p
(X3, Y3, Z3) = 2(X1, Y1, Z1) X3 ≡M2 − 2S mod p
T ≡ 8Y 41 mod p
Y3 ≡M(S −X3)− T mod p
30
Table 2.7: Point addition and point double formulas for elliptic curves defined over
fields GF (2m) and points represented using Jacobian coordinates
Description Formula
Elliptic curve equation Y 2 +XY Z ≡ X3 + aX2Z2 + bZ6 mod F (α);
b 6= 0
U1 ≡ X1Z22 mod F (α)
S1 ≡ Y1Z32 mod F (α)
U2 ≡ X2Z21 mod F (α)
W ≡ U1 + U2 mod F (α)
S2 ≡ Y2Z31 mod F (α)
Point addition R ≡ S1 + S2 mod F (α)
(X3, Y3, Z3) = (X1, Y1, Z1)+ L ≡ Z1W mod F (α)
(X2, Y2, Z2) V ≡ RX2 + LY2 mod F (α)
Z3 ≡ LZ2 mod F (α)
T ≡ R+ Z3 mod F (α)
X3 ≡ aZ23 + TR+W 3 mod F (α)
Y3 ≡ TX3 + V L2 mod F (α)
Z3 ≡ X1Z21 mod F (α)
Point double X3 ≡ (X1 + b1/4Z21 )4 mod F (α)
(X3, Y3, Z3) = 2(X1, Y1, Z1) U ≡ Z3 +X21 + Y1Z1 mod F (α)
Y3 ≡ X41Z3 + UX3 mod F (α)
Table 2.8: Computational complexity of point addition and point double for curves
defined over fields GF (p) (p > 3)
Operation Restrictions Complexity Storage
A+A ⇒ A none 2M + S + I 5
2A ⇒ A none 2M + 2S + I 4
J + J ⇒ J none 12M + 4S 7
A+ J ⇒ J none 8M + 3S 6
none 4M + 6S 5
2J ⇒ J a small 3M + 6S 5
a ≡ −3 mod p 4M + 4S 5
31
Table 2.9: Computational complexity of point addition and point double for curves
defined over fields GF (2m)
Operation Restrictions Complexity Storage
A+A ⇒ A none 2M + S + I 5
2A ⇒ A none 2M + S + I 5
J + J ⇒ J none 15M + 5S 9
a = 0 14M + 4S 8
A+ J ⇒ J none 11M + 4S 8
a = 0 10M + 3S 7
2J ⇒ J none 5M + 5S 4
32
2.7 Special Coordinates and Algorithms
Computing a point multiplication using the coordinates studied in the previous
section requires the use of different formulas for point addition and point double.
Moreover, a point multiplication algorithm must explicitly handle the computations
involving the “point at infinity.” For example, the point multiplication algorithm
must handle explicitly the cases P +Q = O and 2P = O by verifying that P 6= −Q
in the first case and that P 6= −P in the second case. The point multiplication
algorithm must also verify that two points are different before applying the point
addition formula, applying the point addition formula when Q = P yields the wrong
result for P + Q.
The computational differences for point addition, point double, and the opera-
tions involving the “point at infinity,” provide attackers information about the point
multiplier k, which often serves as a private key. The computational differences lead
to different processing times and different power signatures, which allows attackers
to mount timing and power attacks.
The ideal case will be to use a common formula for point addition and point
double that implicitly handles the operations involving the “point at infinity.”
The following section discusses one algorithm that approximates the ideal case
for curves defined over fields GF (2m). The section after that briefly discusses a new
projective coordinates scheme that works efficiently with elliptic curves defined over
fields GF (p) that exhibit a special set of properties.
33
2.7.1 Montgomery Point Multiplication Algorithm for
GF(2m)
This section provides a brief description of the Montgomery point multiplication
algorithm introduced in [LD99a]. For more details on this algorithm, the reader is
referred to [LD99a].
The Montgomery point multiplication algorithm for GF (2m) discussed here uses
a variant of the double-and-add point multiplication algorithm discussed later in
this work. The algorithm is based on the observation that the x coordinate of the
sum of two points P1 and P2, whose difference is known to be P (P2−P1 = P ), can
be computed using the x coordinates of the points P , P1, and P2. The y coordinate
of the point P1, which contains the point multiplication result at the end of the
point multiplication process, can be recovered using the x coordinates of P , P1, and
P2 together with the y coordinate of P .
The description in the previous paragraph corresponds to the affine coordinates
version of the algorithm, which is unattractive for implementation because it requires
the computation of inverses for each group operation. An attractive projective
coordinates version of the algorithm was also introduced in [LD99a].
The projective coordinates version of the algorithm uses mixed coordinates. The
x coordinates of the points P1 and P2 are represented in projective coordinates. The
x coordinate of P1 = (x1, y1) is represented by X1 and Z1, where x1 = X1/Z1. The x
coordinate of P2 is represented by X2 and Z2, where x2 = X2/Z2. The y coordinates
of P1 and P2 are not used in the algorithm. The coordinates of P are maintained in
affine coordinates: P = (x, y). The coordinates of the resulting point kP = (xk, yk)
are recovered using the coordinates X1, Z1, X2, Z2, x, and y. (Note that kP = P1 at
the end of the point multiplication algorithm.)
34
Table 2.10 lists the operations required for the computation of point multiplica-
tion using the projective coordinates version of the Montgomery point multiplication
algorithm. The multiplication algorithm starts with the initialization of P1 = P and
P2 = 2P (initialize function); only the x coordinates of P1 and P2 are initialized.
Then, the point multiplication is computed using a variant of the double-and-add
algorithm (montgomery point multiplication function), which in turn computes the
x coordinate of a point double (mdouble function) and the x coordinate of a point
addition (madd function) in each iteration of the loop. Finally, the coordinates of
the point kP = (xk, yk) are recovered using the X1, Z1, X2, and Z2 coordinates
of the points P1and P2, together with the x and y coordinates of P (compute xy
function).
In the montgomery point multiplication function, the x coordinate of P1, x =
X1/Z1, maintains the x coordinate of the accumulated value of kP (kP = P1). This
is analogous to the accumulation of kP in the double-and-add algorithm. Point P2
is always set to be P1 + P , which guarantees the point difference P2 − P1 = P .
Maintaining this constant point difference is what makes possible the recovery of
the y coordinate of kP at the end of the point multiplication algorithm.
From Table 2.10, one can appreciate that the computational effort is the same
in every iteration of the loop in the montgomery point multiplication function. The
code that implements the montgomery point multiplication function can be bal-
anced so that the computational differences introduced by the values ki are mini-
mized, thus minimizing the amount of information that can be leaked by the point
multiplication process in the form of processing time or power signature differences.
In addition to protect against attacks, the Montgomery point multiplication algo-
rithm is attractive for implementations because of its low computational complexity
and low storage requirements. These are two of the reasons why this algorithm
35
was prototyped in the elliptic curve processor for curves defined over fields GF (2m)
discussed later in this work. The computational complexity of this algorithm is
summarized in Table 2.11. The algorithm requires storage for 8 field elements.
The Montgomery point multiplication method does not apply to curves defined
over fields GF (p). Recently, a way to recover the y coordinate of a point kP was
described in [OS01] for elliptic curves defined by the following elliptic curve equation:
by2 = x3+ax2+x (for curves defined over fields GF (p) with p > 3). These curves are
used in factoring algorithms. Currently, these curves are not specified in standards
for elliptic curve cryptosystems.
36
Table 2.10: Montgomery point multiplication using projective coordinates
(kP = (xk, yk), P = (x, y), y
2 + xy = x3 + ax2 + b with b 6= 0)
montgomery point multiplication(x, y, k) initialize(x, y, b)
/* P1 = P, P2 = 2P, k = (kl−1 . . . k0)2 */ X1 = x; Z1 = 1; c = b
1/2 = b2
m−1
(X1, Z1, X2, Y2, c) = initialize(x, y, b) X2 = x
4 + b; Z2 = x
2
for i = l − 2 downto 0 do return(X1, Z1, X2, Z2, c)
if ki = 1 then madd(X1, Z1, X2, Z2, x)
/*P1.X = P1.X + P2.X, P2.X = 2P2.X*/ Z3 = ((X1Z2) + (X2Z1))
2
(X1, Z1) = madd(X1, Z1, X2, Z2, x) X3 = (X1Z2)(X2Z1) + xZ3
(X2, Z2) = mdouble(X2, Z2, c) return(X3, Z3)
else mdouble(X,Z, c)
/*P2.X = P1.X + P2.X, P1.X = 2P1.X*/ return(X3 = X
4 + c2Z4,Z3 = X
2Z2)
(X2, Z2) = madd(X2, Z2, X1, Z1, x) compute xy(X1, Z1, X2, Z2, x, y)
(X1, Z1) = mdouble(X1, Z1, c) xk = X1/Z1
end if yk = ((X1/Z1 + x)(X2/Z2 + x)+
end for x2 + y) ∗ (X1/Z1 + x)/x+ y)
return(compute xy(X1, Z1, X2, Z2, x, y)) return(xk, yk)
Table 2.11: Computational complexity of the Montgomery point multiplication al-
gorithm
(kP with log2 k ≈ m)
# Mult. # Squares. # Inverses
6m+ 10 5m+ 3 1
37
2.7.2 Jacobi Form
In [LS01], the Jacobi form of an elliptic curve is used to compute point multipli-
cations. The method is applicable to elliptic curves for which x3 + ax + b contains
three roots in GF (p).
Using the Jacobi form of an elliptic curve, point additions and point doubles can
be computed using the same formula. The use of the same formula for point addition
and point double limits the amount of information that can be leaked in the form of
processing time and power signature, especially when used with an algorithm that
requires a number of point additions and point doubles that is independent of the
value of the point multiplier k.
The computation of this formula requires 13 field multiplications and 3 field
squares. The formula can be simplified for the case of point double, which can be
computed with 4 field multiplications and 3 field squares. As for projective coordi-
nates schemes, a point multiplication algorithm must perform a transformation at
the beginning and at the end of the point multiplication algorithm.
Similar to conventional point representations, the point addition method dis-
cussed in [LS01] is not tied to a particular multiplication algorithm as is the case of
the Montgomery point multiplication algorithm discussed in the previous section.
Therefore, this point addition method can be used with the point multiplication
algorithms discussed later in this work.
The theory behind the point addition method introduced in [LS01] is beyond
the scope of this dissertation. Additional information on this subject can be found
in [LS01].
38
2.8 Point Multiplication Algorithms
This section describes some of the most popular point multiplication algorithms.
The algorithms discussed here for point multiplications are mostly variants of sim-
ilar algorithms employed for exponentiation. The most popular algorithms for ex-
ponentiation are described in [MvOV97]. The algorithms for point multiplication
covered here are discussed in detail in [BSS99, Gor98, HLM00, LD00, IEE98, Sol99].
The following two sections study two main classes of algorithms. These are the
generic point multiplication algorithms that can be used to compute an arbitrary
point multiplication and the fixed-point point multiplication algorithms, which can
be used to compute point multiplications involving known points. The fixed-point
point multiplication algorithms are of interest because point multiplication with a
known point can be computed much more efficiently than for arbitrary points. In
addition, fixed-point multiplication is a common operation in elliptic curve crypto-
graphic algorithms.
Following the description of the point multiplication algorithms, Section 2.8.3
summarizes the performance and memory requirements for the different point mul-
tiplication algorithms.
Note: To simplify the description of the algorithms presented in the following
subsections, a point addition is assumed to add to distinct points. When the points
to be added are the same, the point addition operation must be substituted with a
point double operation; that is, instead of using Q = P +Q the operation that must
be used is Q = 2Q.
39
2.8.1 Generic Point Multiplication Algorithms
Generic point multiplication algorithms can be used to compute point multiplica-
tions involving arbitrary points. This section discusses five point multiplication
algorithms: double-and-add (or binary), w-ary, addition-subtraction, signed w-ary,
and width-w addition-subtraction point multiplication algorithms.
Double-and-Add Point Multiplication Algorithm
One of the simplest point multiplication algorithms is the double-and-add point
multiplication algorithm. Algorithm 2.8.1.1 shows the left-to-right version of the
double-and-add point multiplication algorithm. This algorithm inspects the multi-
plier k, starting with its most significant bit and ending with its least significant
bit. For each inspected bit, the algorithm performs a point double (Step 2.1), and,
if the inspected bit is a one, the algorithms also performs a point add (Step 2.2.1).
The double-and-add point multiplication algorithm requires, on average, l point
doubles and l/2 point additions, where l ≈ dlog2 ke. This algorithm also requires
the storage of two points, P and Q.
Algorithm 2.8.1.1: Double-and-add point multiplication algorithm
Inputs: P – point to multiply
k =
∑l−1
i=0 ki2
i, ki ∈ [0, 1] – point multiplier
Output: kP
/* Initialization */
1. Q = O
/* Point multiplication */
2. for i = l − 1 down to 0 do
2.1 Q = 2Q /* point double */
2.2 if ki 6= 0 then
2.2.1 Q = Q+ P /* point addition */
end for
3. return (Q)
40
w-ary Point Multiplication Algorithm
The w-ary point multiplication algorithm, Algorithm 2.8.1.2, is a generalization
of the double-and-add point multiplication algorithm that process w bits of the
multiplier k in each iteration.
The first main step of Algorithm 2.8.1.2 is the recoding of the multiplier k in
radix 2w with dl/we digits in the range [0, 2w): k =∑dl/we−1i=0 k′i 2wi with k′i ∈ [0, 2w).
This representation can be derived directly from the binary representation of k; for
example, the number k = (01 10 11 00)2 is recoded as k = (1230)4 in radix 4.
The second main step of Algorithm 2.8.1.2 is the precomputation of the values
iP for i ∈ [2, 2w). The basic idea is to compute these values once and then use the
precomputed values as necessary in the point multiplication operation.
The third main step of Algorithm 2.8.1.2 is the actual point multiplication oper-
ation. This operation involves dl/we iterations. In each iteration, the accumulated
value Q is doubled w times (Step 4.1). If the recoded digit k′i is nonzero, then
the precomputed point Pk′i is added to the accumulated point (Step 4.2.1). Note
that as in Algorithm 2.8.1.1 the recoded digits are consumed starting with the most
significant digit and ending with the least significant one.
The precomputation effort of the w-ary point multiplication algorithm requires
approximately 2w−1 point doubles and 2w−1 point additions. The point multipli-
cation phase requires on average l point doubles and l/w point additions, when
assuming that k′i 6= 0 for all i and where l ≈ dlog2 ke . In total, the algorithm
requires approximately 2w−1 + l point doubles and 2w−1 + l/w point additions. The
algorithm also requires the storage of approximately 2w points.
For a given set of parameters, it may be advantageous to compute 2wP directly
in Step 4.1 with a closed expression instead of computing it with w individual point
doubles (2(2(2(. . . 2P )))). Formulations for the direct computation of 2wP are given
41
in [GP97, LD98a] for curves defined over fields GF (2m) and in [LD98b, ITT+99] for
curves defined over fields GF (p).
The w-ary point multiplication algorithm is a fixed-size windowing algorithm.
Sliding window algorithms are extensions of the w-ary point multiplication algorithm
that use variable-size windows. A study in [Koc95] reveals that sliding window
algorithms for exponentiation in GF (pm) for p > 3 are 4.5% to 7.8% faster than
the fixed-size windowing algorithms for exponents ranging from 128 to 512 bits.
The complexity of point double operations is typically lower than that of point
addition operations, which suggests that larger speedups could be obtained in point
multiplication algorithms. Sliding window algorithms are not discussed further in
this work. Additional information about sliding window algorithms can be found in
[BSS99, Gor98].
Algorithm 2.8.1.2: w-ary point multiplication algorithm
Inputs: P – point to multiply
k =
∑l−1
i=0 ki2
i, ki ∈ [0, 1] – point multiplier
Output: kP
/* Recoding of k: k =
∑dl/we−1
i=0 k
′
i 2
wi, k′i ∈ [0, 2w) */
1. for i = 0 to dl/we − 1 do
1.1 k′i = k mod 2
w
1.2 k = k − k′i
1.3 k = k/2w
end for
/* Initialization */
2. Q = O; P1 = P
/* Precomputations: Pi = iP , i ∈ [1, 2w) */
3. for i = 1 to 2w−1 − 1 do
3.1 P2i = 2Pi
3.2 P2i+1 = P2i + P
end for
/* Point multiplication */
4. for i = dl/we − 1 down to 0 do
4.1 Q = 2wQ /* w point doubles */
4.2 if k′i 6= 0 then
4.2.1 Q = Q+ Pk′i /* point addition */
end for
5. return (Q)
42
Addition-Subtraction Point Multiplication Algorithm
The addition-subtraction point multiplication algorithm, shown in Algorithm 2.8.1.3,
is an extension of the double-and-add point multiplication algorithm that computes
point multiplications using point additions, point subtractions, and point doubles.
By incorporating point subtractions, whose computational complexities are quite
similar to those of point additions, this algorithm achieves a lower computational
complexity than the double-and-add point multiplication algorithm with a relatively
small increase in complexity.
The addition-subtraction point multiplication algorithm can be realized with dif-
ferent signed digit representations. This work considers the use of the non-adjacent
form representation (NAF) described in [Sol99]. Using this representation a multi-
plier k =
∑l−1
i=0 ki 2
i is uniquely recoded as k =
∑l
i=0 k
′
i 2
i with k′i ∈ [−1, 1], where
the recoded representation does not contain contiguous nonzero digits and where
the average number of nonzero digits is l/3. The NAF representation of an l-bit
multiplier k is at most l + 1 digits long.
The first main step of the addition-subtraction point multiplication algorithm is
the recoding of the multiplier k. The second main step is the point multiplication
operation. The point multiplication operation consists of l + 1 loop iterations. In
each loop iteration an accumulated point is doubled (Step 3.1). Also in each iteration
of the loop, if the recoded digit under inspection is a one, a point is added to the
accumulated point (Step 3.2.1). If the value of the recoded digit is negative one (-1),
a point is subtracted from the accumulated point (Step 3.3.1).
The addition-subtraction point multiplication algorithm requires, on average, l
point doubles and l/3 point additions. This algorithm also requires the storage of
two points, P and Q.
43
Algorithm 2.8.1.3: Addition-subtraction point multiplication algorithm
Inputs: P – point to multiply
k =
∑l−1
i=0 ki2
i, ki ∈ [0, 1] – point multiplier
Output: kP
/* Recoding of k: k =
∑l
i=0 k
′
i 2
i, k′i ∈ [−1, 1] */
1. for i = 0 to l do
1.1 if k mod 2 = 1 then
1.1.1 k′i = 2− (k mod 22)
1.2 else
1.2.1 k′i = 0
1.3 k = k − k′i
1.4 k = k/2
end for
/* Initialization */
2. Q = O
/* Point multiplication */
3. for i = l down to 0 do
3.1 Q = 2Q /* point double */
3.2 if k′i = 1 then
3.2.1 Q = Q+ P /* point addition */
3.3 if k′i = −1 then
3.3.1 Q = Q− P /* point subtraction */
end for
4. return (Q)
Signed w-ary Point Multiplication Algorithm
The signed w-ary point multiplication algorithm, shown in Algorithm 2.8.1.4, is an
extension of the w-ary point multiplication algorithm that computes point multipli-
cations using point additions, point subtractions, and point doubles.
The first main step of Algorithm 2.8.1.4 is the recoding of the multiplier k in
radix 2w with d(l+1)/we digits in the range [−2w−1, 2w−1): k =∑d(l+1)/we−1i=0 k′i 2wi.
For example, using this representation, the number k = (11 10 01 00)2 can be
recoded as k = (102¯10)4 in radix 4, where 2¯ = −2. (Note that the signed w-
ary point multiplication algorithm can be implemented using different signed digit
representations.)
The second main step of Algorithm 2.8.1.4 is the precomputation of the values
44
iP for i ∈ [2, 2w−1]. The basic idea is to compute these values once and then use the
precomputed values as necessary in the point multiplication operation. The additive
inverse of a point is generated as necessary when performing point subtractions.
The third main step of Algorithm 2.8.1.4 is the actual point multiplication oper-
ation. This operation involves d(l+1)/we iterations. In each iteration, the accumu-
lated value of Q is doubled w times (Step 5.1). If the recoded digit k ′i is greater than
0, then the point Pk′i is added to the accumulated point (Step 5.2.1). If the recoded
digit k′i is negative, then the point P|k′i| is subtracted from the accumulated point
(Step 5.3.1). Note that as in the w-ary point multiplication algorithm the recoded
digits are consumed starting with the most significant digit and ending with the
least significant one.
The precomputations of the signed w-ary point multiplication algorithm requires
approximately 2w−2 point doubles and 2w−2 point additions. The point multiplica-
tion phase requires approximately l point doubles and l/w point additions (including
subtractions), when assuming that k′i 6= 0 for all i and where l ≈ dlog2 ke. In total,
the algorithm requires approximately 2w−2 + l point doubles and 2w−2 + l/w point
additions (including subtractions). The algorithm also requires the storage of 2w−1
points.
In comparison with the w-ary point multiplication algorithm, the signed w-ary
point multiplication algorithm requires the precomputation and storage of half as
many points.
45
Algorithm 2.8.1.4: Signed w-ary point multiplication algorithm
Inputs: P – point to multiply
k =
∑l−1
i=0 ki2
i, ki ∈ [0, 1] – point multiplier
Output: kP
/* Recoding of k: k =
∑d(l+1)/we−1
i=0 k
′
i 2
wi, k′i ∈ [−2w−1, 2w−1) */
1. for i = 0 to d(l + 1)/we − 1 do
1.1 k′i = k mod 2
w
1.2 if k′i ≥ 2w−1 then
1.2.1 k′i = −(2w − k′i)
1.3 k = k − k′i
1.4 k = k/2w
end for
/* Initialization */
2. Q = O; P1 = P
/* Precomputations: Pi = iP , i ∈ [1, 2w−1] */
3. for i = 1 to 2w−2 − 1 do
3.1 P2i = 2Pi
3.2 P2i+1 = P2i + P
end for
4. P2w−1 = 2P2w−2
/* Point multiplication */
5. for i = d(l + 1)/we − 1 down to 0 do
5.1 Q = 2wQ /* w point doubles */
5.2 if ki > 0 then
5.2.1 Q = Q+ Pki /* point addition */
5.3 if ki < 0 then
5.3.1 Q = Q− P|ki| /* point subtraction */
end for
6. return (Q)
Width-w Addition-Subtraction Point Multiplication Algorithm
Algorithm 2.8.1.5 shows the width-w addition-subtraction point multiplication al-
gorithm described in [Sol99].
The first main step of the algorithm is the recoding of the multiplier k (Steps 1–
1.4). In this algorithm, the multiplier k =
∑l−1
i=0 ki2
i with ki ∈ [0, 1] is recoded using
a width-w non-adjacent form as follows: k =
∑l
i=0 k
′
i2
i, where k′i ∈ (−2w−1, 2w−1)
and where k′i is odd.
The second main step of the algorithm is the precomputation of the points iP
for odd values of i in the range [3, 2w−1) (Steps 3–5.1).
46
The third main step of the algorithm is the point multiplication process. This is
an iterative process in which one digit of the recoded k is inspected in each iteration.
In each iteration the accumulated point is doubled (Step 6.1). If the scanned digit is
greater than zero, the point Pbk′i/2c is added to the accumulated point (Step 6.2.1). If
the scanned digit is negative, the point Pb|k′i|/2c is subtracted from the accumulated
point (Step 6.3.1).
The precomputation phase of the width-w addition-subtraction point multipli-
cation algorithm requires one point double and 2w−2− 1 point additions. The point
multiplication phase requires on average approximately l point doubles and l/(w+1)
point additions (including subtractions), where l ≈ dlog2 ke. In total, the algorithm
requires approximately l point doubles and 2w−2+ l/(w+1) point additions (includ-
ing subtractions). The algorithm also requires the storage of approximately 2w−2
points.
47
Algorithm 2.8.1.5: Width-w addition-subtraction point multiplication algorithm
Inputs: P – point to multiply
k =
∑l−1
i=0 ki2
i, ki ∈ [0, 1] – point multiplier
Output: kP
/* Recoding of k: k =
∑l
i=0 k
′
i 2
i, k′i ∈ (−2w−1, 2w−1) */
1. for i = 0 to l do
1.1 if k mod 2 = 1 then
1.1.1 k′i = k mod 2
w
1.1.2 if k′i ≥ 2w−1 then
1.1.2.1 k′i = −(2w − k′i)
1.2 else
1.2.1 k′i = 0
1.3 k = k − k′i
1.4 k = k/2
end for
/* Initialization */
2. Q = O
/* Precomputations */
3. P0 = P
4. T = 2P
5. for i = 1 to 2w−2 − 1 do
5.1 Pi = Pi−1 + T /* Pi = Pi−1 + 2P */
end for
/* Point multiplication */
6. for i = l down to 0 do
6.1 Q = 2Q
6.2 if k′i > 0 then
6.2.1 Q = Q+ Pbk′i/2c /* Q = Q+ k
′
iP */
6.3 if k′i < 0 then
6.3.1 Q = Q− Pb|k′i|/2c /* Q = Q− |k′i|P */
end for
7. return (Q)
2.8.2 Fixed-Point Point Multiplication Algorithms
This section discusses the special case of point multiplication using fixed points. This
operation is used in elliptic curves cryptographic algorithms, such as the analogues of
the Diffie-Hellman key agreement algorithm [ANS99], the ElGamal encryption and
digital signature algorithms, and the Digital Signature Algorithm (DSA) [FIP00].
48
Fixed-Point Windowing Point Multiplication Algorithm
The fixed-point windowing point multiplication algorithm, described by Algorithm
2.8.2.1, is based on the fixed-base exponentiation algorithm introduced in [BGMW93].
Algorithm 2.8.2.1 shows a variant of the fixed-point windowing point multiplication
algorithm discussed in [Gor98] that recodes the multiplier k using signed digit rep-
resentation.
In the fixed-point windowing point multiplication algorithm, the multiplier k is
recorded as k =
∑d(l+1)/we−1
i=0 k
′
i2
wi with k′i ∈ [−2w−1, 2w−1). Using this recoding,
the point multiplication can be expressed as follows: kP =
∑d(l+1)/we−1
i=0 k
′
i(2
wiP ).
Because the point P is known, it is possible to precompute the points 2wiP for
i = 1 . . . d(l + 1)/we − 1. Given the precomputed points, a point multiplication can
be computed by adding, or subtracting, |k′i| copies of 2wiP for all k′i 6= 0. The fixed-
point windowing point multiplication algorithm performs these point additions and
point subtractions in an efficient manner.
The first main step of the fixed-point windowing point multiplication algorithm
is the off-line precomputation of the points 2wiP for i = 1 . . . d(l+ 1)/we − 1 (Steps
1–2.1).
The second main step of the fixed-point windowing point multiplication algo-
rithm is the recoding of the multiplier k (Steps 3–3.4).
The third main step of the fixed-point point multiplication algorithm is the point
multiplication process (Steps 5–5.3). This is an iterative process that adds a point
k′i(2
wiP ) by adding the point 2wiP to an accumulated point when k′i > 0 or by
subtracting the point 2wiP when k′i < 0. From the iteration at which the point 2
wiP
is added or subtracted till the last loop iteration, the accumulated point is added
to itself k′i times; therefore, the accumulated point incorporates the point k
′
i(2
wiP )
in its result.
49
Assuming off-line precomputation, the point multiplication requires approxi-
mately 2w−1 + l/w point additions (including subtractions), where l ≈ dlog2 ke.
The algorithm also requires the storage of approximately dl/we points.
Algorithm 2.8.2.1: Fixed-point windowing point multiplication algorithm
Inputs: P – fixed-point to multiply
k =
∑l−1
i=0 ki2
i, ki ∈ [0, 1] – point multiplier
Output: kP
/* Off-Line precomputations: Pi = 2
wiP */
1. P0 = P
2. for i = 1 to d(l + 1)/we − 1 do
2.1 Pi = 2
wPi−1
end for
/* Recoding of k: k =
∑d(l+1)/we−1
i=0 k
′
i 2
wi, k′i ∈ [−2w−1, 2w−1) */
3. for i = 0 to d(l + 1)/we − 1 do
3.1 k′i = k mod 2
w
3.2 if k′i ≥ 2w−1 then
3.2.1 k′i = −(2w − k′i)
3.3 k = k − k′i
3.4 k = k/2w
end for
/* Initialization */
4. A = O; B = O
/* Point multiplication */
5. for j = 2w−1 downto 1 do
5.1 for each i for which k′i = j do
5.1.1 B = B + Pi
5.2 for each i for which k′i = −j do
5.2.1 B = B − Pi
5.3 A = A+B
6. Return (A)
Fixed-Point Comb Point Multiplication Algorithm
The fixed-point comb point multiplication algorithm, described by Algorithm 2.8.2.2,
is based on the fixed-base comb exponentiation algorithm introduced in [LL94].
The fixed-point comb point multiplication algorithm arranges the scalar mul-
tiplier k =
∑l−1
i=0 ki2
i as shown to the left of the vertical bar in Figure 2.4. The
arrangement of the multiplier k consists of v two-dimensional arrays, where each
50
two-dimensional array contains h rows and b columns and where l = ah and a = vb
(note that the multiplier k can be extended by adding zeros to the most significant
bit positions to meet these conditions).
The contribution of digit ki in the point multiplication kP =
∑l−1
i=0 ki2
iP , ki2
iP ,
is shown in Figure 2.4 as follows. The weight of a multiplier ki within a row is listed
in the top row and the weight of a row is listed to the right of the vertical bar. To
determine the contribution of bit ki multiply the weight at the top of the column
containing ki, the weight of the row that contains ki, and the value of ki; this process
forms the value ki2
iP .
2b−1 . . . 2 1
kb−1 . . . k1 k0 P
ka+b−1 . . . ka+1 ka 2
aP
...
...
...
...
...
k(h−1)a+b−1 . . . k(h−1)a+1 k(h−1)a 2
(h−1)aP
k2b−1 . . . kb+1 kb 2
bP
ka+2b−1 . . . ka+b+1 ka+b 2
a+bP
...
...
...
...
...
k(h−1)a+2b−1 . . . k(h−1)a+b+1 k(h−1)a+b 2
(h−1)a+bP
...
...
...
...
...
kvb−1=a−1 . . . k(v−1)b+1 k(v−1)b 2
(v−1)bP
ka+vb−1=2a−1 . . . ka+(v−1)b+1 ka+(v−1)b 2
a+(v−1)bP
...
...
...
...
...
k(h−1)a+vb−1=ah−1 . . . k(h−1)a+(v−1)b+1 k(h−1)a+(v−1)b 2
(h−1)a+(v−1)bP
Figure 2.4: Arrangement of multiplier k for the fixed-point comb point multiplication
algorithm
The fixed-point comb point multiplication algorithms makes use of a two dimen-
sional precomputation table consisting of v rows and 2h − 1 columns. Each entry
in the table is a precomputed point. There is one row in the table for each of the
two dimensional arrays in Figure 2.4 (v two dimensional arrays). There is also one
column in the table for each binary combination of the h-tuple formed by the digits
51
in the columns of the two-dimensional arrays excluding the h-tuple containing only
zeros (2h − 1 columns).
A precomputation table entry in Algorithm 2.8.2.2 is denoted
G[array index][entry index], where array index is an integer in the range [0, v)
that refers to one of the v two dimensional arrays in Figure 2.4 and where the
entry index is an integer in the range [1, 2h) that refers to a column in the table
corresponding to the binary representation of an h-tuple containing digits of k.
In Algorithm 2.8.2.2, a column is pointed to by Is,r =
∑h−1
t=0 kta+bs+r2
t, where s
specifies a two-dimensional array and r specifies a column in it. Note that the entry
G[s, Is,r] in the lookup table contains the point
∑h−1
t=0 (kta+bs+r2
ta+bs+r)P . A sample
precomputation table is shown in Figure 2.5.
Figure 2.5: G[s, Is,r] precomputation table for the fixed-point comb point multipli-
cation algorithm
s Â Is,r 2h − 1 . . . 2 1
0
∑h−1
i=0 2
iaP . . . 2aP P
1
∑h−1
i=0 2
ia+bP . . . 2a+bP 2bP
. . . . . . . . . . . . . . .
v − 1 ∑h−1i=0 2ia+(v−1)bP . . . 2a+(v−1)bP 2(v−1)bP
The first main step of the fixed-point comb point multiplication algorithm is the
off-line computation of the precomputation table (Steps 1–2.2.1).
The second main step is the point multiplication process (Steps 5–5.2.2.1). This
is an iterative process consisting of b steps, each of which consists of v sub-steps. In
each sub-step, a precomputed point corresponding to a column in one of the v two-
dimensional arrays is added to an accumulated point. The v sub-steps adds point
corresponding to the same column in each of the v two-dimensional arrays. In each
main step, the accumulated point is doubled and to it is added the point computed
52
by the v sub-steps (this is analogous to the double-and-add point multiplication
algorithm).
Algorithm 2.8.2.2: Fixed-point comb point multiplication algorithm
Inputs: P – fixed-point to multiply
k =
∑l−1
i=0 ki2
i, ki ∈ [0, 1] – point multiplier
h – number of blocks in which k is divided, also
the number of rows in the precomputation matrix
a – width of the blocks
v – number of sub-blocks in which each block is further
subdivided
b – width of the sub-blocks
Output: kP
/* Off-Line precomputations */
1. for i = 0 to h− 1 do
1.1 Pi = 2
aiP
end for
/* Compute precomputation array. */
2. for i = 1 to 2h − 1 do
2.1 G[0][i] =
∑h−1
j=0 ijPj /* i =
∑h−1
j=0 ij2
j , ij ∈ [0, 1] */
2.2. for j = 1 to v − 1 do
2.2.1 G[j][i] = 2bjG[0][i]
end for
end for
/* Initialization */
4. A = O
/* Point multiplication */
5. for r = b− 1 down to 0 do
5.1 A = 2A
5.2 for s = v − 1 down to 0 do
5.2.1 Is,r =
∑h−1
t=0 kat+bs+r2
i /* precomputation table index. */
5.2.2 if Is,r 6= 0 then
5.2.2.1 A = A+G[s][Is,r]
end for
end for
6. Return (A)
Assuming off-line precomputation, the fixed-point comb point multiplication al-
gorithm requires, on average, b − 1 point doubles and a − 1 point additions. This
algorithm also requires the storage of v(2h − 1) points.
53
2.8.3 Summary of Point Multiplication Algorithms
Table 2.12 summarizes the complexity and the storage requirements for the point
multiplication algorithms discussed in the previous sections. The complexity is spec-
ified in terms of point additions (including point subtractions) and point doubles.
The storage requirements are specified in terms of the number of points that needs
to be stored.
The results in Table 2.12 highlight that generic point multiplication algorithms
require about one point double per bit in the binary representation of k. The results
also demonstrate that the generic point multiplication algorithms differ in how they
reduce the number of point additions.
The double-and-add, the addition-subtraction, and the Montgomery point mul-
tiplication (for GF (2m) only) algorithms require no precomputation. The addition-
subtraction point multiplication algorithm reduces the number of point additions
over the double-and-add point multiplication algorithm by recoding the multiplier
k (point subtractions are treated as point additions). The other generic algorithms
include combinations of multiplier recoding and precomputation. The number of
precomputations in these algorithms grows exponentially with the window size, and
the number of point additions, excluding the ones required for the precomputations,
grows inversely proportional with the window size. The exponential growth in the
number of precomputation limits the optimal window sizes of the different algo-
rithms to relatively small values (w < 6 in the examples shown later). The memory
requirements grow exponentially, as each precomputed point requires storage.
The fixed-point windowing point multiplication algorithm does not require point
doubles. The number of point additions required by this algorithm exhibit growth
similar to that of the generic point multiplication algorithms that use precomputa-
tions. The number of point additions contains a component that grows exponentially
54
with the window size and a component that grows inversely proportional with the
window size. As for generic algorithms, this behavior limits the optimal window
sizes to relatively small values (w < 6 in the examples shown later). The memory
requirement of the fixed-point windowing point multiplication algorithm grows in-
versely proportional with the window size, a feature that is unique among the point
multiplication algorithms studied here that use precomputation.
The processing time of the fixed-comb point multiplication algorithm is con-
trolled by the parameters a and b, which are under user control. The number of
point additions is ruled by a ≈ m/h and the number of point doubles is ruled by
b = (m/h)/v, where a = vb and where m ≈ log2 k. To reduce the number of point
additions and point doubles, one will choose a large value for h. To further reduce
the number of point doubles, one will choose a large value for v. The user choice is
likely to be restricted by the memory requirements.
The memory required grows exponentially with h and linearly with v ((2h− 1)v
points). An advantage of the fixed-comb point multiplication algorithm over the
fixed-point windowing point multiplication algorithm is that the user has full control
of the processing time and memory requirements, which allows him/her to make
best use of the available resources. For example, if storage can only be provided for
four points, as could be the case for memory constrained devices, one could choose
h equal to two and v equal to one. With these selections a point multiplication
using the fixed-comb point multiplication algorithm can be computed with m/2
point additions and m/2 point doubles. Under the same conditions, the fixed-point
windowing point multiplication algorithm requires a window size of the order of
m/4. For this window size, the fixed-point windowing point multiplication algorithm
requires approximately 2m/4−1 point additions.
Table 2.13 shows the complexity and the memory requirements for the point
55
multiplication algorithms discussed here for elliptic curves defined over fields GF (p)
and GF (2m), where k, p, and m are 160-bits numbers. This table shows exam-
ples that use affine coordinates (A), Jacobian coordinates (J ), and the projective
coordinates used in [LD99b] for curves defined over GF (2m) (LD). The results
in this table assume negligible time for the computation of squares in GF (2m) and
assume the same processing cost for multiplications and squares in GF (p). The stor-
age requirements for GF (p) assume the representation of numbers in nonredundant
number representation.
In Table 2.13, the results for the fixed-point comb point multiplication algorithm
approximate the processing time of the fixed-point windowing point multiplication
algorithm. It is important to highlight that for the fixed-point comb point multi-
plication algorithm, the processing time can be made arbitrarily small if provided
enough memory. Finally, the results in Table 2.13 do not include the processing cost
of coordinate conversions.
The results in Table 2.13 demonstrate that the processing time for fixed-point
point multiplication algorithms are significantly lower than for general point mul-
tiplications. For the examples provided in this table for curves defined over fields
GF (p), there are moderate changes in processing cost for the generic point multi-
plication algorithms. For the examples provided for curves defined over GF (2m),
the processing cost for the Montgomery point multiplication algorithm is lower than
for the other generic point multiplication algorithms. The results in Table 2.13 do
not use inverses when computing the precomputed values for the generic algorithms.
It may be beneficial to use inverses for curves defined over fields GF (2m), because
inverses in these fields can be computed with fewer than 2dlog2 me field multiplica-
tions using Fermat’s Little Theorem. The same is not true for curves defined over
fields GF (p) because inverses in these fields require over log2 p field multiplications,
56
when computing inverses using Fermat’s Little Theorem.
Table 2.12: Complexity of point multiplication algorithms
Algorithm Complexity (ave.) Storage
(log2 k ≈ m) requirements
# point double # point addition # points
double-and-add m m/2 2
w-ary 2w−1 +m 2w−1 +m/w 2w
addition-subtraction m m/3 2
signed w-ary 2w−2 +m 2w−2 +m/w 2w−1
width-w m 2w−2 +m/(w + 1) 2w−2
addition-subtraction
Montgomery m m 1 (affine)
(GF (2m) only) 2 (X,Z proj.)
fixed-point windowing 0 2w−1 +m/w m/w
fixed-point comb b a v(2h − 1)
(m = ah, a = vb)
57
Table 2.13: Complexity of point multiplication algorithms for k = 160 ≈ m ≈ log2 p
Algorithm Coord. Op. w GF (p) GF (2m)
# mult. kbytes # mult. kbytes
double-and-add 2J → J N/A 2480 0.1 1680 0.1
A+ J → J
w-ary 2J → J 4 2448 0.94 1560 0.94
J + J → J
addition-subtraction 2J → J N/A 2187 0.1 1387 0.1
A+ J → J
signed w-ary 2J → J 5 2320 0.94 1440 0.94
J + J → J
width-w 2J → J 5 2155 0.47 1320 0.47
addition-subtraction J + J → J
Montgomery 2LD → LD & N/A N/A N/A 960 0.12
(GF (2m) only) LD + LD → LD
fixed-point windowing 2J → J 5 528 1.25 528 1.25
A+ J → J
fixed-point comb 2J → J N/A 540 2.34 540 1.17
(m = ah, a = vb) A+ J → J (h = 4, v = 4) (h = 4, v = 2)
58
Chapter 3
Elliptic Curve Processor
Architecture
3.1 Architecture
The point multiplication algorithms discussed in the previous section are elliptic
curve and point representation independent. The main characteristics of these al-
gorithms include the recoding of the multiplier k, the precomputation of frequently
used values using point additions and doubles (if required by the algorithm), and
the iterative computation of point multiplication using point additions, subtractions
and doubles. When using projective or mixed coordinates, the point multiplication
algorithms also require coordinate conversions from affine to projective coordinates
and from projective to affine coordinates.
All the point multiplication algorithms discussed in the previous sections share
a hierarchical structure similar to that shown in Figure 3.1.
At the top of the hierarchy is the point multiplication function. This function
is responsible for orchestrating the computation of point multiplications using the
59
Point
Multiplication
Coordinate
Conversion
Point
Addition/
Subtraction
Point
Double
Field
Inversion
Field
Multiplication
Field
Addition/
Subtraction
Figure 3.1: Point multiplication hierarchy
services provided by the point addition/subtraction, the point double, and the co-
ordinate conversion functions.
In the computation of a point multiplication using projective or mixed coordi-
nates, the point multiplication function does the following:
1. Commands the coordinate conversion function to convert the point to be mul-
tiplied from affine to projective coordinates.
2. Orchestrates the computation of the precomputed points, if required, by com-
manding the point addition/subtraction and the point double functions to add
and double points.
3. Records the multiplier k.
4. Orchestrates the computation of kP by examining the digits of the recoded
k, and, based on their value, commanding the point addition/subtraction and
the point double functions to add and double points.
5. Commands the coordinate conversion function to convert the resulting point
to affine coordinates.
60
The coordinate conversion function converts points from affine to projective or
from projective to affine coordinates using algorithms that require the services pro-
vided by the field addition/subtraction, the field multiplication, and the field inver-
sion functions.
The point addition/subtraction and the point double functions compute point
additions/subtractions and point doubles using algorithms that require the services
provided by the field addition/subtraction and the field multiplication functions.
The field inversion function computes inverses using an algorithm that requires
the services provided by the field multiplications function.
From the previous discussion it is evident that the point multiplication, the co-
ordinate conversion, the point addition/subtraction, the point double, and the finite
field inversion functions are mainly control functions. The field addition/subtraction
and field multiplication are the basic arithmetic functions. This separation of con-
trol and arithmetic functions is reflected in the elliptic curve processor architectures
introduced here.
Figure 3.2 shows a block diagram of the elliptic curve processor architectures
introduced in this work. In this figure, the solid lines represent busses that carry
data and the dotted lines represent signals that carry control information.
data
Main Controller
Point Multiplication
System I/O
(k) (P)
Arithmetic Unit
Controller
Point Addition/
          Subtraction
Point Double
Coordinate Conv.
Field Inversion
(kP)
control
status
command
to/from
Host
control
status
command
control
status
Arithmetic Unit
Field Addition/
          Subtraction
Field Multiplication
Comparison
Figure 3.2: Elliptic curve processor architecture
The elliptic curve processor is composed by an arithmetic unit and two pro-
grammable processors – the main controller and the arithmetic unit controller.
The main controller (MC) realizes the point multiplication function, and, in
61
addition, is responsible for system input/output (system I/O).
The arithmetic unit controller (AUC) realizes the point addition/subtraction,
the point double, the coordinate conversion, and the field inversion functions. The
AUC also controls the multiplier, the squarer (GF (2m) only, optional), the adder,
and the comparator circuits embedded in the arithmetic unit.
The arithmetic unit (AU) is the computational engine of the elliptic curve pro-
cessor. This work considers arithmetic units for GF (p) and for GF (2m) finite field
arithmetic.
The arithmetic unit for GF (p) field arithmetic incorporates a multiplier, one
or more adders (depending on the number representation), a comparator, and a
register file. All these components work under the control of the AUC.
The arithmetic unit for GF (2m) field arithmetic incorporates a multiplier, a
squarer (optional), an adder (for some types of multipliers), a comparator, and a
register file. All these components work under the control of the AUC.
The multiplier, the squarer, and the adder circuits are used to perform the
arithmetic operations required to compute point multiplications.
The comparator is used to perform the comparisons required in the point mul-
tiplication process. For example, before performing a point addition it is often nec-
essary to determine that the two points to be added are different from one another
and that they are not the additive inverse of each other (P 6= Q and P 6= −Q). De-
termining these conditions requires the comparison of the coordinates of the points
P and Q, or the comparison of the difference of their coordinates against zero.
The register file is the set of registers used to store precomputed points, elliptic
curve parameters, and temporary values.
The following sections describe the architecture of the MC and the AUC, whose
architectures are similar for processors that perform point multiplication for curves
62
defined over fields GF (2m) and for processors that compute point multiplication for
curves defined over fields GF (p).
The discussion of the MC and AUC controllers is followed by sections that de-
scribe the AU for processors that perform point multiplication for curves defined
over fields GF (2m) and for processors that compute point multiplication for curves
defined over fields GF (p).
63
3.2 Main Controller (MC)
The MC is a reduced instruction set processor. This section describes a possible
architecture for this processor. The details of the implementation are not presented
here. Details on the implementation of processors can be found in [Tan84].
The MC is responsible for orchestrating the point multiplication process. In the
computation of a point multiplication using mixed or projective coordinates, the
MC is responsible for the following: commanding the AUC to convert the point to
be multiplied from affine to projective coordinates; commanding the AUC to per-
form the needed precomputations; recoding the multiplier k (if necessary); guiding
the point multiplication process by commanding the AUC to do point additions,
subtractions, and doubles; and commanding the AUC to convert the resulting point
from projective to affine coordinates. The MC is also responsible for synchronizing
I/O operations with the host processor.
To take advantage of the features provided by the different algorithms while
maintaining a fixed logic footprint and timing behavior, this work recommends the
implementation of the MC using a programmable processor.
The degree of sophistication of the MC processor is a function of the algorithms
to be supported. The system I/O and the ability to dispatch commands to the
AUC is similar for all the point multiplication algorithms. The details of the point
multiplication process differ from one algorithm to another.
The point multiplication process of the different point multiplication algorithms
include one or more loops in which point additions, subtractions and doubles are
commanded based on the processed bits of the multiplier k. Processing the bits of k
could involve logical operations, such as testing bits; arithmetic operations; shifting
operations; and the storage of processed values. In addition, the processed values
64
could be used to indirectly address precomputed values.
Multiple MC architectures could be devised for an elliptic curve processor. This
section presents a reduced instruction set processor architecture. This architecture is
generic in the sense that it does not necessarily perform functions specific to elliptic
curve point multiplication. These functions are realized with software running in
the processor.
The basic instruction set supported by the processor is listed in Tables 3.1 and
3.2. The symbols and variables used in these tables are defined in Table 3.3. The
instruction formats are summarized in Figure 3.3. To facilitate decoding, the in-
structions are divided in three fields. Unused fields are left blank in the instruction
formats.
The instruction set includes logical (and, not), arithmetic (add), shift, con-
trol (branch and subroutine instructions), and data move instructions (load, store,
move). The addition instructions along with the status flags support two’s comple-
ment arithmetic. The instruction set also supports a limited number of direct and
indirect addressing modes. The logical, shift, and arithmetic instructions operate on
register data. The basic instruction set specified here is purposely limited to support
the functions that need to be performed by the MC processor. This instruction set
can be extended to support more complex functions or can be reduced further to
reduce the complexity and to increase the throughput of the processor.
The basic architecture presented here does not specify a specific number of data
registers nor does it specify the range of the data and address fields of the in-
structions, which allow sizing the processor architectures according to the system
requirements.
For simplicity of use, all the instruction execute in one processor cycle, where the
period of a processor cycle is a function of the complexity and performance expected
65
from the processor.
Figure 3.4 shows a block diagram of the MC processor. This figure shows the data
paths represented with solid lines and the most important control paths represented
with dotted lines. The functions performed by the components of the processor are
summarized in Table 3.4.
The following can be a sample implementation of the MC. The instruction op-
erational codes, referred to as opcodes, require at least five bits. If the controller
incorporates eight registers, it requires at least three bits to address all the regis-
ters. Finally, assuming that the instructions uses eight bits for address and data
fields, the instructions will be 16 bits wide. The sample processor will be capable of
running programs containing a maximum of 256 instructions and can also indirectly
access 256 bytes of data memory. The registers will be at most eight bits wide and
the ALU/Shifter will be an eight-bit unit. The PC stack memory can be sized so
that a reasonable number of subroutine nesting can be supported; for example, the
PC stack memory could provide storage for 16 return addresses.
The sample processor described above is expected to execute one instruction per
processor cycle. The processor cycle could consist of multiple system clock cycles
when the processor is pipelined to maximized system throughput. Note that to
maximize system throughput, the AUC and the AU must operate at the maximum
clock rate possible. The rate of computation of each of these devices is higher than
the rate of computation of the MC. As an example consider the case of a point
multiplication using the double-and-add point multiplication algorithm. In this
example, the MC must inspect one bit of the multiplier k and must then command
the AUC to do a point double and possibly a point addition. The MC operation
will take few processor cycles, while the AUC and the AU must compute at least
16 modular multiplications, each requiring a relatively large number of clock cycles
66
when using Jacobian coordinates with elliptic curves defined over fields GF (p).
Table 3.1: MC instruction set – execution control instructions
Instruction Syntax Operation Affected Format
flags
branch direct bd addr (PC) = addr 2
branch zero bdz addr if (Z) = 1 then (PC) = addr 2
direct else (PC) = (PC) + 1
branch carry bdc addr if (C) = 1 then (PC) = addr 2
direct else (PC) = (PC) + 1
branch overflow bdv addr if (V) = 1 then (PC) = addr 2
direct else (PC) = (PC) + 1
branch positive bdp addr if (P)= 1 then (PC) = addr 2
direct else (PC) = (PC) + 1
branch indirect bi reg (PC) = (reg) 3
branch zero biz reg if (Z)= 1 then (PC) = (reg) 3
indirect else (PC) = (PC) + 1
branch carry bic reg if (C) = 1 then (PC) = (reg) 3
indirect else (PC) = (PC) + 1
branch overflow biv reg if (V) = 1 then (PC) = (reg) 3
indirect else (PC) = (PC) + 1
branch positive bip reg if (P) = 1 then (PC) = (reg) 3
indirect else (PC) = (PC) + 1
jump subroutine jsrd addr ((PCSP)) = (PC)+1 2
direct (PC) = addr
(PCSP) = (PCSP)+1
jump subroutine jsri reg ((PCSP)) = (PC)+1 3
indirect (PC) = (reg)
(PCSP) = (PCSP)+1
return ret (PCSP)= (PCSP)-1 1
(PC) = ((PCSP))
67
Table 3.2: MC instruction set – data manipulation and arithmetic instructions
Instruction Syntax Operation Affected Format
flags
load reg ld regd, data (regd) = data 4
immediate (PC) = (PC) + 1
load reg ldi regd, regs (regd) = ((regs)) 5
indirect (PC) = (PC) + 1
store register sti regd, regs ((regd)) = (regs) 5
indirect (PC) = (PC) + 1
move mv regd, regs (regd) = (regs) 5
(PC) = (PC) + 1
add add regd, regs (regd) = (regd)+(regs) C,V,P,Z 5
(PC) = (PC) + 1
add with carry addc regd, regs (regd) = (regd)+(regs)+(C) C,V,P,Z 5
(PC) = (PC) + 1
shift right sr regd, regs (regd) = [(C)//(regs)]>>1 C,P,Z 5
arithmetic (C) = (regs) && 1
(PC) = (PC) + 1
bitwise and and regd, regs (regd) = (regd) && (regs) P,Z 5
(PC) = (PC) + 1
bitwise not not regd, regs (regd) = !(regs) P,Z 5
(PC) = (PC) + 1
Format Graphical representation of MC instructions
1 opcode
2 opcode addr
3 opcode reg
4 opcode reg data
5 opcode reg reg
Figure 3.3: MC instruction format
68
Table 3.3: MC instruction set symbols
Symbol Description
PC Program counter register. Points to the instruction to be executed.
PCSP Program counter stack pointer register. Points to the location in the PC
stack memory where to store the next return address on the next
subroutine call.
C Indicates if the result of the last operation that sets flags generated a carry.
P Indicates if the result of the last operation that sets flags was positive.
V Indicates if the result of the last operation that sets flags generated an
overflow.
Z Indicates if the result of the last operation that sets flags was zero.
( ) Represents the content of; for example, (PC) represents content of the PC
register.
(( )) Represents the content of the content of; for example ((PSCP)) represents
the content of the memory address which value is contained in the PSCP
register (if (PSCP) is 100 then ((PCSP)) represents the value stored
in memory location 100).
[ ] Parenthesis.
&& Represents bitwise and.
! Represents bitwise not.
>> Represents right shift of one bit.
// Represent catenation of values.
regd Represents the destination register, which is the register that will get the
result of an operation or the register that contains the address of the
memory location where the result is to be stored.
regs Represents the source register, which is one of the registers providing a
value or the register containing the address of the memory location that
will be providing a value.
69
PC
Stack
Memory
Program
Memory
(DPRAM)
Controller
(instr)
(PC)(PC ret)
mux
AUC Input Reg
AUC Output Reg
Data Reg 0
Data Reg n
Status Reg
to
AUC
from AUC
Data
Memory
(DPRAM)
Di
Do
A
Host
Interface
(da
ta)

(ad
dr)

(C,V,P,Z)
(C,V,P,Z)
(add,and,not,shift,bypass)
ALU/
Shifter
Figure 3.4: Main controller
70
Table 3.4: MC components
Component Description
Controller Interprets instructions and based on its interpretation controls the
components of the processor.
PC Stack Stores return addresses while the processor is executing subroutine
Memory instructions.
Program Stores program. DPRAM configuration allows the host processor to
Memory load different programs into the processor without having to
reconfigure the elliptic curve processor. If reprogrammability is not
necessary, the memory can be configured as a ROM.
Data Memory Stores program data. Also used by the host processor to pass
commands and the multiplier k to the MC.
ALU/Shifter Performs logical, arithmetic, and shift functions. Supports two’s
complement arithmetic.
Data Hold operands and addresses during command execution.
Register The content of these registers can be used in logical, arithmetic, and
shift operations. The registers can be loaded with results from the
aforementioned operations, with the content of memory locations, or
with data included in a load instruction. The content of a register
can be used to indirectly access memory locations.
Status Hold status flags: C,V,P,Z. These flags are updated by the
Register ALU/Shifter. This register also shares the features of data registers,
which allows flags to be stored, manipulated, and restored. These
features facilitate data manipulation in subroutines.
AUC Input Relays the status of the AUC.
Register This register also shares the features of data registers with the
exception that it can only be loaded with status information from the
the AUC.
AUC Output Register used to relay commands and data to the AUC.
Register This register also shares the features of data registers with the
exception that its output is only forwarded to the AUC.
71
3.3 Arithmetic Unit Controller (AUC)
The AUC is a reduced instruction set processor. This section describes a possible
architecture for this processor. The details of the implementation are not presented
here. Details on the implementation of processors can be found in [Tan84].
The AUC is responsible for processing MC commands. The AUC processes
MC commands by guiding the AU through the computation of the necessary field
arithmetic operations.
In the computation of point multiplications, the AUC is responsible for guiding
the AU in the computation of point additions, point subtractions, point doubles,
coordinate conversions, and field inversions. The computation of these operations
includes field additions, field subtractions, field squares, field multiplications, and
comparisons of finite field elements. The computation of the aforementioned oper-
ations also requires the storage and use of system parameters, input data, precom-
puted values, and temporary results.
The AU by itself does not compute any of the aforementioned operations. The
AU includes hardware that, under the control of the AUC, is capable of computing
field additions, field subtractions, field multiplications, field squares, and compar-
isons of finite field elements. The AUC also manages the storage and the use of
data stored in the AU’s register file. For example, in the computation of a field
multiplication, the AUC sets and clears control signals that force the AU multi-
plier hardware to compute field multiplications (note that this function would be
typically implemented with a state machines running in the AU).
The AUC’s tight control of the AU hardware allows the elliptic curve processor
to maximize the throughput the AU by efficiently scheduling operations and by con-
currently using the processing elements in the AU. For some configurations it also
72
allows the AUC to use specific features of the AU hardware that would otherwise
be hard to exploit. For example, some multipliers compute multiplication by accu-
mulating partial results. The AUC can exploit this feature by using the multiplier
as an adder thus saving the elliptic curve processor from having to incorporate an
adder.
To maximize the throughput of the elliptic curve processor, the AU hardware
must operate at the maximum clock rate possible. To control the AU hardware, the
clock rate of the AUC must match that of the AU. This last requirement suggest
the use of a very simple processor capable of executing at high clock rates.
Multiple AUC architectures could be devised for an elliptic curve processor. This
section presents a reduced instruction set processor architecture. This processor
architecture is generic in the sense that it does not necessarily perform functions
specific to elliptic curve point multiplication nor does it specify an architecture for
the AU. The control sequences that the AUC dispatches to the AU are programmed
into the processor. The AU control signals are wired to one or more output registers
of the AUC. The AU and the MC status signals are wired to the AUC status register.
The basic instruction set supported by the AUC processor is listed in Table 3.5.
The symbols and variables used in this table are defined in Table 3.6. The instruction
formats are summarized in Figure 3.5. To facilitate decoding, the instructions are
divided in three fields. Unused fields are left blank in the instruction formats.
The instruction set includes control (branch and subroutine instructions) and
data move instructions (load, move). The conditional branch instructions include
a mask field. The mask field is OR’ed with the status flags, and, if the result in
nonzero, the branch is taken. This configuration allows the AUC to process multiple
flags concurrently.
The basic instruction set specified for the AUC is purposely limited to support
73
the functions to be performed by the AUC, which are mainly control functions.
This instruction set can be extended to support more complex functions or can be
reduced further to reduce the complexity and to increase processor throughput.
The basic architecture presented here does not specify a specific number of data
registers nor does it specify the range of the data and address fields of the in-
structions, which allow sizing the processor architectures according to the system
requirements.
To maximize the elliptic curve processor throughput, each instruction of the
AUC executes in one clock cycle. Note that in the design specified here, both the
AUC and the AU must use the same clock source.
Figure 3.6 shows a block diagram of the AUC processor. This figure shows
the data paths represented with solid lines and the most important control paths
represented with dotted lines. The functions performed by the components of the
processor are summarized in Table 3.7.
The AUC processor architecture presented here is to a degree a subset of the
MC processor architecture. Each processors contain a controller that interprets
instructions and dispatches control sequences to the components of the processor.
Each processor receives inputs via registers and outputs data and control via regis-
ters. The main difference between the two processor architectures is the instruction
sets they support. The AUC supports a very reduced instruction set that lacks
arithmetic, logical, and instructions that deal with data memory. The AUC also
incorporates branch instructions that allows it to concurrently process multiple sta-
tus bits. The status bits presented by the AU and the MC are implementation
dependent. The AUC samples the relevant status bits by applying a mask to the
content of the status register. If any of the flags is set, the processor proceeds to
execute instructions from the specified branch address.
74
The following can be a sample implementation of the AUC. The instruction op-
erational codes requires at least four bits. If the controller incorporates 16 registers,
it requires at least four bits to address all the registers. These same bits can be use
to support four status flags. Finally, assuming that the instructions uses a 24-bits
data field, the instructions will be 32 bits wide. Note that the registers are capa-
ble of concurrently controlling 24 AU control signals. These signals could control,
among others, the AU’s multiplier, adder, and register file.
The sample processor will be capable of dispatching control sequences that ma-
nipulate up to 24 control signals per clock cycle. The instruction provides the
capacity to execute programs of up to 224 instructions, but in reality the number of
instructions that need to be handled by the processor will be much smaller. For ex-
ample, if the AUC needs to support programs of at most 2048 instructions, the least
significant 11 bits of the address field will drive the address signals of the program
memory. In this example, the PC and the input of the PC stack memory will also
be 11 bits wide. The PC stack memory can be sized so that a reasonable number
of subroutine nesting can be supported; for example, the PC stack memory could
provide storage for 16 return addresses.
75
Table 3.5: AUC instruction set
Instruction Syntax Operation Format
no operation nop (PC) = (PC) + 1 1
branch direct bd addr (PC) = addr 2
branch conditional bdm mask, addr if (flags || mask) != 0 6
direct then (PC) = addr
else (PC) = (PC) + 1
branch indirect bi reg (PC) = (reg) 3
branch conditional bim mask, reg if (flags || mask) != 0 7
indirect then (PC) = (reg)
else (PC) = (PC) + 1
jump subroutine jsrd addr ((PCSP)) = (PC)+1 2
direct (PC) = addr
(PCSP) = (PCSP)+1
jump subroutine jsri reg ((PCSP)) = (PC)+1 3
indirect (PC) = (reg)
(PCSP) = (PCSP)+1
return ret (PCSP)= (PCSP)-1 1
(PC) = ((PCSP))
load reg immediate ld regd, data (regd) = data 4
(PC) = (PC) + 1
move mv regd, regs (regd) = (regs) 5
(PC) = (PC) + 1
Format Graphical representation of AUC instructions
1 opcode
2 opcode addr
3 opcode reg
4 opcode reg data
5 opcode reg reg
6 opcode mask addr
7 opcode mask reg
Figure 3.5: AUC instruction format
76
Table 3.6: AUC instruction set symbols
Symbol Description
PC Program counter register. Points to the instruction to be executed.
PCSP Program counter stack pointer register. Points to the location in the PC
stack memory where to store the next return address on the next subroutine
call.
mask Binary value to be OR’ed with the content of the status register. The
result of the OR operation is used to determine if a branch is to be taken.
flags Status flags implemented in the status register.
( ) Represents the content of; for example, (PC) represents content of the
PC register.
(( )) Represents the content of the content of; for example ((PSCP)) represents
the content of the memory address which value is contained in the
PSCP register (if (PSCP) is 100 then ((PCSP)) represents the value
stored in memory location 100).
!= Not equal.
|| Bitwise OR.
regd Represents the destination register (or the register that will get
the data in a load or move instruction).
regs Represents the source register.
 
 
	
 
  
 
 
  ﬀﬁﬂﬃ
 !  " " 
# $ %!&(' ) *
# +-,*
# +-,.) /0' *
12$
123
4
5236&7'8 %0' /6) 9 :6;</
9 ) 3
=
>ﬂ,
9 ) 3
=
4?@ >ﬂ,
>ﬂ,A8 %6B
C0'-D/!E
12:<' :D2/FEG
12:<' :D2/FEH%
>ﬂ,JIﬂC0' B
C0'-D/!E
' 3
>ﬂ,
' 3
4?
' 3
4?
4?JIﬂC0' B6C<'-D/!EKG
4?JIﬂC0' B6C<'D2/!E=
L' :<' C!&ﬂD2/!E
MNO
Figure 3.6: Arithmetic unit controller
77
Table 3.7: AUC components
Component Description
Controller Interprets instructions and based on its interpretation controls the
components of the processor.
PC Stack Stores return addresses while the processor is executing subroutine
Memory instructions.
Program Stores program. DPRAM configuration allows the host processor to
Memory load different programs into the processor without having to
reconfigure the elliptic curve processor. If reprogrammability is not
necessary, the memory can be configured as a ROM.
Data Hold operands and addresses during command execution. These
Register registers can store frequently used values, subroutine parameters,
data frequently written to output registers, etc. The content of
the data registers can also be used to store branch addresses.
Status Holds MC and AU status flags; for example, the status resulting from
Register the comparison of two field elements in the AU.
MC Input Relays command and data from the MC to the AUC. In the model
Register presented here, the MC identifies a command by the address of the
function that perfoms the functions specified in the command. Upon
receiving a command, the AUC jumps to the routine whose address
is specified in the MC input register.
MC Output Relays status to MC.
Register
AU Output Registers used to relay control sequences to the AU. Each register
Register has the capacity to set/clear a set of AU control signals; therefore,
functions that must be handled concurrently must be allocated to the
same register. The control sequences are latched in the registers.
78
3.4 Arithmetic Unit (AU)
The AU is the main processing engine of an elliptic curve processor. The AU perfor-
mance dictates the performance of the elliptic curve processor. For high performance
elliptic curve processors, the AU complexity also dictates the complexity of the el-
liptic curve processor, because for these implementations the aggregate complexity
of the MC and the AUC is low in comparison with the complexity of the AU.
The AU is responsible for performing field additions, subtractions, multiplica-
tions, squares, and comparisons of finite field elements. The AU is also responsible
for storing elliptic curve parameters, precomputed values, and temporary values.
Figure 3.7 shows a functional diagram of an AU.
Reg. File
Multiplier
Zero Test
mux1
Din
Squarer Adder
Dout
Subtractor
mux2
Figure 3.7: Functional block diagram of the arithmetic unit
The adder, subractor, multiplier, and squarer functional blocks represent the
arithmetic functions performed by the AU.
The zero test function is used in the comparison of finite field elements. In the
AU architectures presented here, the comparison of two field elements involves the
following steps. First, if necessary, the elements are reduced so that their values fall
in a common nonredundant range; that is, only one value represents a residue class.
79
Then, one element is subtracted from the other and the result is compared against
zero.
The register file represents the registers used to store elliptic curve parameters,
precomputed values, and temporary values. The host processor loads elliptic curve
parameters to the register file and downloads point multiplication results from it via
the Di input and the Do output. The arithmetic functions write the results of their
operations into the register file and read their operands from it.
Not all the blocks shown in Figure 3.7 need to be realized with independent
hardware. The AU for GF (p) presented here realizes the adder and the subtractor
functions with a single two’s complement adder and realizes the squarer and the
multiplier functions with a multiplier.
This work discusses multiple configurations for the AU for GF (2m). These con-
figurations realize the adder and subtractor functions with a single adder circuit – in
GF (2m) addition and subtraction represent the same arithmetic operation. Depend-
ing on the multiplier used by a particular configuration, the adder, the subtractor,
the squarer and the multiplier functions can be realized with a multiplier. Some of
the configurations realize the multiplier and squarer functions with a multiplier while
others use a multiplier and a squarer, thus taking advantage of the low complexity
of squaring in GF (2m) for the fields and the irreducible polynomials recommended
for elliptic curve cryptography.
The AU works under the control of the AUC. The AUC control extends to all
the components of the AU. For example, for the computation of a multiplication
the AUC will need to generate the following command sequences. First, the AUC
must command the register file to output one of the multiplication operands while
at the same time it must command the multiplier to latch the operand. The same
process is repeated to load the second operand. Then, the AUC generates the
80
control sequences that guide the multiplier through its iterative process. After a
fixed number of iterations, the multiplier hardware computes a product. At this
point, the AUC configures mux1 and mux2 so that the multiplier output can be
transported to the register file, and, at the same time, commands the storage of the
result in the register file. The AUC enforces control over the AU by driving control
signals attached to the different components of the AU.
The following chapters describe arithmetic unit architectures for GF (p) and
GF (2m) field arithmetic.
81
Chapter 4
GF(2m) Arithmetic Unit
4.1 Introduction
This chapter specifies AU architectures for GF (2m) arithmetic. These architectures
follow the general model introduced in Section 3.4. The different architectures
specified here differ on their functional allocation of the functions shown in Figure
3.7 and on the complexity and the performance of their components.
This chapter starts with descriptions of adder, multiplier, squarer, zero test, and
register file architectures, which are the main components used in arithmetic unit
architectures. The chapter ends with the specification of multiple arithmetic unit
architectures for which complexity and performance numbers are given.
This chapter describes one adder architecture, one register file architecture, two
zero test architectures, six multiplier architectures, and two squaring architectures.
4.1.1 GF(2m) Multiplier Architectures
The most critical component of an arithmetic unit is its multiplier. The research of
GF (2m) multiplier architectures for standard basis has traditionally concentrated on
82
parallel multipliers (see [LR71, YRT84] for early references) and bit-serial multipliers
(see [STP86, YRT84] for early references). More recently, the following multiplier
types were introduced: hybrid multipliers [Mas91, PR97], digit-serial multipliers
[SP96], and super-serial multipliers [OP99]. The time-area characteristics of these
multipliers are summarized in Table 4.1.
The two super-serial multiplier architectures discussed in this dissertation were
developed by the author as part of the research work documented here.
Table 4.1: Time-area characteristics of GF (2m) multipliers
Type Area Time Notes
complexity complexity
Parallel O(m2) O(1)
Digit-serial O(mD) O(m/D) m > D > 1, where D is the digit size.
Hybrid O(mp) O(m/p) Only for composite fields:
GF ((2p)q) where m = pq
Bit-serial O(m) O(m)
Super-serial O(D) O(m2/D) m > D > 0, where D is the digit size.
Area estimates ignore storage requirements.
Of the multipliers shown in Table 4.1, hybrid multipliers are the only ones whose
use is restricted to composite fields. The use of composite fields is discouraged for
elliptic curve cryptosystems. The framework for an attack on elliptic curve cryp-
tosystems that use composite fields of the form GF ((2p)q) is described in [GHS00].
The large fields used for cryptographic applications, for which m ranges from
160 to over 1024, make the use of parallel multiplier architectures impractical for a
large number of applications. This work concentrates on bit-serial, digit-serial, and
super-serial multiplier architectures.
The following sections describe bit-serial, digit-serial, and super-serial multipli-
ers. Two basic versions of each of these multipliers are described. These correspond
to the most significant bit/digit first (MSB/MSD) and the least significant bit/digit
first (LSB/LSD) architectures. Most significant bit/digit architectures compute a
83
product by multiplying the multiplicand operand by the bits/digits of the multi-
plier operand, starting with the most significant bit/digit of the multiplier operand
and ending with the least significant one. Least significant bit/digit architectures
compute a product by multiplying the multiplicand operand by the bits/digits of
the multiplier operand, starting with the least significant bit/digit of the multiplier
operand and ending with the most significant one.
The description of each multiplier architecture in the following sections includes
the following: description of the multiplication algorithm, description of the hard-
ware architecture, and a summary of the estimated complexity and performance for
implementations using logic gates, generic gates, and FPGA logic. In this docu-
ment, generic gates refer to arbitrary two input gates. The complexity and timing
models used in this work are described in Appendix A.
The emphasis of this work is on the development of elliptic curve processor ar-
chitectures for programmable logic. Programmable logic possesses the distinctive
quality of allowing the instantiation of different circuits in the same hardware. This
quality is exploited here to reduce the logic complexity and to increase the per-
formance of arithmetic circuits. The following sections explore the complexity and
the performance of multipliers that support, for a given field GF (2m), arbitrary,
programmable, and fixed irreducible polynomials. For digit-serial architectures, the
programmable polynomials are restricted to optimal primitive polynomials accord-
ing to the definition given in [SP96]. A definition for optimal polynomials is given
here in Section 4.5.
Arbitrary polynomial implementations support arbitrary irreducible polynomi-
als. The least significant coefficients of the irreducible polynomials are programmed
into the multiplier. For example, when using the irreducible polynomial F (x) =
xm +
∑m−1
i=0 fix
i, the fi coefficients of the polynomial are programmed into the mul-
84
tiplier. Note that all the coefficients of the polynomial must be programmed.
Programmable polynomials provide support for a subset of all the possible poly-
nomials for a given m. These types of multipliers are geared towards supporting
the irreducible polynomials specified in standards such as [IEE98, ANS98, ANS99,
FIP00], which for some finite fields GF (2m) recommend different irreducible poly-
nomials. To save logic, these implementations allow programmability of a subset of
the coefficients of the irreducible polynomial. Some of the coefficients are implicitly
set to zero. For example, a multiplier can support programmability of the least
significant t coefficients of the irreducible polynomial, where t < m. In this case the
coefficients fi for t ≤ i < m are set to zero. With these restrictions a multiplier can
support irreducible polynomials of the form F (x) = xm +
∑t−1
i=0 fix
i.
Fixed irreducible polynomial implementations support just one hardwired irre-
ducible polynomial. No provisions are made for programmability of the coefficients
of the irreducible polynomial.
One would expect that general solutions would exhibit higher complexity and
possibly lower performance than specialized solutions. This is demonstrated in the
following sections by studying how the logic complexity and the performance of the
multipliers vary as their support for irreducible polynomials vary.
Each irreducible polynomial of degree m defines a field that is isomorphic to the
fields defined by other irreducible polynomials of degree m. Each irreducible polyno-
mial leads to a different field representation, and one can transform an element from
one representation to another. Because the security of elliptic curve cryptosystems
does not rest on the irreducible polynomials used to define fields GF (2m), standards
recommend irreducible polynomials that facilitate implementations; for example,
the standards [IEE98, ANS98, ANS99, FIP00] recommend the use of trinomials
(F (x) = xm + xt + 1) and pentanomials (F (x) = xm + xt3 + xt2 + xt1 + 1).
85
4.1.2 GF(2m) Squarer Architectures
Squaring is one of the most common arithmetic operations for algorithms based on
the discrete logarithm problem over fields GF (2m) and over the groups formed by
elliptic curves defined over fields GF (2m).
Squaring in normal basis is a simple operation that can be realized with cyclic
shifts. On the other hand, squaring in standard basis for arbitrary fields defined by
arbitrary irreducible polynomials is a complex operation. Table 4.2 shows the area
and time complexity of bit-serial and parallel squaring architectures that support
arbitrary irreducible polynomials.
When restricting the problem of squaring in standard basis to use fixed trinomials
(F (x) = xm+xt+1) and pentanomials (F (x) = xm+xt3+xt2+xt1+1), which are the
type of polynomials recommended by the standards [ANS98, ANS99, IEE98, FIP00],
the complexity of squaring can be considered to be linear. Table 4.2 shows the
complexity of parallel squarers that support fixed irreducible polynomials. For the
case of trinomials and pentanomials, r is respectively two and four, where r is the
number of nonzero coefficients of F (x) minus one.
The standards [ANS98, ANS99, IEE98, FIP00] recommend the use of trinomials
whenever possible. Pentanomials are recommended for fields that cannot be defined
by trinomials. Pentanomials exists for all the fields GF (2m) for which m is greater
than or equal to four [ANS98].
Table 4.2: Time-area characteristics of GF (2m) squarers
Type Polynomial Area Time
support complexity complexity
Bit-serial [BG89] Arbitrary O(m) O(m)
Parallel [JSP98] Arbitrary O(m2) O(1)
Parallel [Wu99] Fixed < rm O(1)
86
This work focuses on two squaring architectures. One architecture is based on
parallel squarers that support fixed irreducible polynomials, specifically trinomials
and pentanomials. These architectures are irregular and depend on the irreducible
polynomials.
The other squaring architecture is based on a new concept developed by the author
as part of the research work documented here. This new squaring architecture was
introduced in [OP00b].
The new squaring architecture is based on the computation of squares using
least significant bit/digit first bit-serial, digit-serial, or super-serial multipliers to-
gether with simple circuits that facilitate the computation of squares. This is a
regular squarer architecture that can be designed to be independent of irreducible
polynomials.
The description of each squarer architecture in the following sections includes
the following: description of the squaring method, description of the hardware ar-
chitecture, and a summary of the estimated complexity and performance for imple-
mentations using logic gates, generic gates, and FPGA logic. The complexity and
timing models used in this work are described in Appendix A.
87
4.2 Adder
Equation (2.2) describes the addition of two GF (2m) field elements. The addition
of two GF (2m) field elements requires the modulo two addition of the coefficients
of each of the input operands, where the modulo two additions can be performed
with XOR gates. A parallel adder requires m XOR gates and its critical path delay
is one XOR gate delay.
The super-serial multipliers operate on digits of the input operands. An arith-
metic unit based on a super-serial multiplier requires an adder that can add one
digit of each of its input operands. This operation can be fulfilled with a parallel
adder of D bits.
Table 4.3 summarizes the complexity and the critical path delay of GF (2m)
adders suitable for arithmetic units based on bit-serial or digit-serial multipliers,
which require m-bit adders, and for arithmetic units based on super-serial multipliers
that require D-bit adders.
Table 4.3: Complexity and critical path delay of GF (2m) adders
Technology Complexity Complexity Critical path
m-bit adder D-bit adder delay
Gates m XOR D XOR TX
Generic gates m GG D GG TG
FPGA logic m LUT D LUT TL
88
4.3 Most Significant Bit First Multiplier (MSB)
The MSB multiplier introduced in [STP86] computes the multiplication of two field
elements A and B according to Algorithm 4.3.1. This algorithm is based on the re-
cursive operation defined by Equation (4.1), where C (i) represents the accumulated
value at iteration i and where C (−1) = 0. The multiplication result is C (m−1). Equa-
tion (4.2) defines the coefficients of C (i) =
∑m−1
j=0 c
(i)
j α
j in terms of the coefficients
of A, B, C(i−1), and F (α).
Algorithm 4.3.1: MSB multiplication algorithm
Inputs: A =
∑m−1
i=0 aiα
i
B =
∑m−1
i=0 biα
i
C =
∑m−1
i=0 ciα
i = 0
F (α) = αm +
∑m−1
i=0 fiα
i
Output: C = AB mod F (α)
1. for i = 0 to m− 1 do
1.1 C = bm−1−iA+ Cα mod F (α)
end for
C(i) =
m−1∑
j=0
c
(i)
j α
j = Abm−1−i + C
(i−1)α mod F (α) for i = 0..m− 1. (4.1)
c
(i)
j =
 bm−1−ia0 + c
(i−1)
m−1 f0 for j = 0,
bm−1−iaj + c
(i−1)
j−1 + c
(i−1)
m−1 fj for j = 1..m− 1.
(4.2)
4.3.1 Architecture
Figure 4.1 shows a block diagram of the MSB multiplier. From this figure and
Equation (4.1) one can appreciate that the MSB multiplier incorporates three main
circuits: a scalar multiplier that computes Abm−1−i, a multiply by α circuit (for Cα
computation), and a mod F (α) circuit (for the reduction of Cα). The distinction of
these three circuits is important for the understanding of the digit-serial multiplier
89
architectures discussed later.
 	
 

	

 





 





 



Figure 4.1: MSB multiplier
4.3.2 Complexity, Critical Path Delay, and Performance
Table 4.4 summarizes the logic complexity and the critical path delay of the MSB
multiplier for arbitrary, programmable, and fixed irreducible polynomial support.
The estimates in the table include the complexity of the m-bit shift register that
holds the B operand.
In Table 4.4, r represents the number of coefficients of F (x) supported by the
multiplier for configurations that support fixed and programmable polynomials.
To highlight the complexity and the critical path delay behavior, only the most
significant terms of the complexity and the critical path delay expressions are in-
cluded in the Table 4.4. Definitions of fixed, programmable, and arbitrary polyno-
mials are given in Section 4.1.1.
Estimates are provided for implementations with logic gates, generic gates, and
FPGA logic. The estimates for FPGA logic assume that the combinatorial functions
are implemented using binary trees. The estimates are based on the models intro-
duced in Appendix A. (Note that in Table 4.4 the acronym FF refers to flip-flops.)
Table 4.5 summarizes the latency and the throughput of the MSB multiplier.
90
These parameters are normalized with respect to the period of a clock cycle. The
clock cycle period is inversely proportional to the critical path delay of the multiplier,
which is a function of the logic elements employed and of the irreducible polynomial
support.
Table 4.4: Complexity and critical path delay of MSB multiplier
Tech- Irreducible Complexity Critical path
nology polynomial delay
support
Gates Arbitrary 4m AND + m OR + 2m XOR + 4m FF TA + 2TX
Programmable (3m+ r) AND+ m OR + (m+ r) XOR
+ (3m+ r) FF
Fixed 3m AND + m OR + (m+ r) XOR 2TX
+ 3m FF
Generic Arbitrary 7m GG + 4m FF 3TG
gates Programmable (5m+ 2r) GG + (3m+ r) FF
Fixed (5m+ r) GG + 3m FF 2TG
FPGA Arbitrary (d4/(L− 1)e+ 1)m LUT + 4m FF dlogL 5e TL
logic Programmable (d4/(L− 1)er + (2m− r)) LUT
+ (3m+ r) FF
Fixed (d3/(L− 1)er + (2m− r)) LUT dlogL 4e TL
+ 3m FF
Table 4.5: Performance of MSB multiplier
Attribute Performance
Latency (in # clocks) m
Throughput (in # operations/# clocks) 1/m
91
4.4 Least Significant Bit First Multiplier (LSB)
The LSB multiplier introduced in [YRT84] computes the field operation AB mod
F (α)+C, where A, B, and C are field elements of the field GF (2m). This multiplier
computes this operation according to Algorithm 4.4.1. This algorithm is based on
the recursive operation defined by Equations (4.3) and (4.4), where A(i) = Aαi mod
F (α), C(i) represents the accumulated product at time i, and C (−1) represents the
value of C at the beginning of the multiplication. The result of the multiplication
is C(m−1).
Equations (4.5) and (4.6) define the coefficients of A(i) and C(i) in terms of the
coefficients of A(i−1), B, C(i−1), and F (x).
Algorithm 4.4.1: LSB multiplication algorithm
Inputs: A =
∑m−1
i=0 aiα
i
B =
∑m−1
i=0 biα
i
C =
∑m−1
i=0 ciα
i
F (α) = αm +
∑m−1
i=0 fiα
i
Output: C = AB mod F (α) + C
1. for i = 0 to m− 1 do
1.1 C = biA+ C
1.2 A = Aα mod F (α)
end for
C(i) =
m−1∑
j=0
c
(i)
j α
j = biA
(i) + C(i−1) for i = 0..m− 1. (4.3)
A(i) =
m−1∑
j=0
a
(i)
j α
j =
 A for i = 0,A(i−1)α mod F (α) for i = 1..m− 1. (4.4)
92
a
(i)
j =
 a
(i−1)
m−1 f0 for j = 0,
a
(i−1)
j−1 + a
(i−1)
m−1 fj for j = 1..m− 1 and i = 1 . . .m− 1.
(4.5)
c
(i)
j = bia
(i)
j + c
(i−1)
j (4.6)
4.4.1 Architecture
Figure 4.2 shows a block diagram of the LSB multiplier. From this figure and Equa-
tions (4.3) and (4.4) one can appreciate that the LSB multiplier incorporates three
main circuits: a scalar multiplier that computes biA
(i), an Aα mod F (α) circuit,
and an accumulator. The distinction of these three circuits is important for the
understanding of the digit-serial multiplier architectures discussed later.
    	


 
  	
  	

   


    	


 
  	

 
  	
  	
 

  	


 

 


Figure 4.2: LSB multiplier
93
4.4.2 Complexity, Critical Path Delay, and Performance
Table 4.6 summarizes the logic complexity and the critical path delay of the LSB
multiplier for arbitrary, programmable, and fixed irreducible polynomial support.
The estimates in the table include the complexity of the m-bit shift register that
holds the B operand.
In Table 4.6, r represents the number of coefficients of F (x) supported by the
multiplier for configurations that support fixed and programmable polynomials.
The estimates in Table 4.6 assume the existence of a 2:1 multiplexer at the input
of each of the registers that hold coefficients of A(i) (these multiplexers are not
shown in Figure 4.2). These multiplexers are used to load the field element A into
the multiplier at the beginning of the multiplication.
To highlight the complexity and the critical path delay behavior, only the most
significant terms of the complexity and the critical path delay expressions are in-
cluded in the Table 4.6.
Estimates are provided for implementations with logic gates, generic gates, and
FPGA logic. The estimates are based on the models introduced in Appendix A.
The estimates for FPGA logic assume that the combinatorial functions are imple-
mented using binary trees, including the 2:1 multiplexers and the logic surrounding
them. In other words, the logic required for the 2:1 multiplexers is merged with
the logic surrounding them. The FPGA complexity and critical path delay are then
determined for the merged combinatorial circuits.
Table 4.7 summarizes the latency and the throughput of the LSB multiplier.
These parameters are normalized with respect to the period of a clock cycle. The
clock cycle period is inversely proportional to the critical path delay of the multiplier,
which is a function of the logic elements employed and of the irreducible polynomial
support.
94
Table 4.6: Complexity and critical path delay of LSB multiplier
Technology Irreducible Complexity Critical path
polynomial delay
support
Gates Arbitrary 6m AND + 2m OR + 2m XOR 2TA + TO + TX
+ 4m FF
Programmable (5m+ r) AND + 2m OR +
(m+ r) XOR + (3m+ r) FF
Fixed 5m AND + 2m OR + (m+ r) XOR TA + TO + TX
+ 3m FF
Generic Arbitrary 10m GG + 4m FF 4TG
gates Programmable (8m+ 2r) GG + (3m+ r) FF
Fixed (8m+ r) GG + 3m FF 3TG
FPGA Arbitrary (d4/(L− 1)e+ 2)m LUT + 4m FF dlogL 5e TL
logic Programmable (d4/(L− 1)er + (3m− r)) LUT
+ (3m+ r) FF
Fixed (d3/(L− 1)er + (3m− r)) LUT dlogL 4e TL
+ 3m FF
Table 4.7: Performance of LSB multiplier
Attribute Performance
Latency (in # clocks) m
Throughput (in # operations/# clocks) 1/m
95
4.5 Most Significant Digit First Multiplier (MSD)
The MSD multiplier introduced in [SP96, SP97] computes the multiplication of two
field elements A and B according to Algorithm 4.5.1. This algorithm is based on the
recursive operation defined by Equation (4.7). In this equation, the field element
B is expressed in digit form. C (i) represents the accumulated product at time i for
i = 0..dm/De − 1 and C(−1) = 0. The result of the multiplication is Cout, which is
defined in Equation (4.8).
Algorithm 4.5.1: MSD multiplication algorithm
Inputs: A =
∑m−1
i=0 aiα
i
B =
∑dm/De−1
i=0 Biα
Di, where
Bi =
∑D−1
j=0 bDi+jα
j and bk≥m = 0.
C =
∑m−1
i=0 ciα
i = 0
F (α) = αm +
∑m−1
i=0 fiα
i
Output: C = AB mod F (α)
1. for i = 0 to dm/De − 1 do
1.1 C = ABdm/De−1−i + (C mod F (α))α
D
end for
2. C = C mod F (α)
C(i) = ABdm/De−1−i + (C
(i−1) mod F (α))αD for i = 0..dm/De − 1 (4.7)
Cout = C(dm/De−1) mod F (α) (4.8)
4.5.1 Architecture
Figure 4.3 shows a block diagram of the MSD multiplier introduced in [SP97]. In
this figure, the notation x : y used in some of the interconnecting busses implies the
use of y busses of at most x bits each. For example, D busses of at most m bits
96
connect the digit multiplication core to the accumulator.
The MSD multiplier is composed by the following circuits: digit multiplication
core, mod F (α) 1, multiply by αD, accumulator, and mod F (α) 2 circuits.
 
  	 
  
 
  
     ﬀﬁ  ﬂﬃ

ﬂ !
"$# % & ' (
 )  *+*-,

 $ﬀ/.0 ﬁ ﬂ 
13254
4$671
4
1/6ﬁ892:4
1
16ﬁ8$254
 ;
  	 
  
 

ﬂ<0 ;=.-ﬂ/>@?A ,CB

ﬂ
4
.Dﬂ/>E?A ,CB
F
.Dﬂ/>E?A ,GB
H
Figure 4.3: MSD multiplier
In each clock cycle, the digit multiplication core computes a set of D scalar
products bDiA, bDi+1Aα, . . ., bDi+D−1Aα
D−1, whose sum represents the scalar prod-
uct ABi as shown by Equation (4.9). The accumulator adds the scalar products
generated by the digit multiplication core thus computing ABi.
ABi =
D−1∑
j=0
AbDi+jα
j (4.9)
The mod F (α) 1 circuit generates a set of scalar products whose sum represents
C(i−1) mod F (α), where C(i−1) is the content of the accumulator.
Equation (4.10) defines C(i−1) mod F (α) for general irreducible polynomials
and Equation (4.11) defines it for optimum primitive polynomials.
The term optimum primitive polynomial was introduced in [SP96]. These are
97
polynomials of the form F (x) = xm+
∑t
j=0 fix
i where m−t ≥ D. These polynomials
simplify the mod F (α) reduction operation as can be appreciated from Equation
(4.11).
Optimum primitive polynomials can be extensively used in cryptographic ap-
plications. The irreducible polynomials of prime degree ranging from 163 to 997
(m = 163 . . . 997) that are recommended in [IEE98, FIP00, ANS98, ANS99] can
be classified as optimum primitive polynomials for digit sizes of at least 40 bits
(D = 40).
The following discussion assumes the use of optimum primitive polynomials.
Additional information on this type of multiplier can be found in [SP96, SP97].
When using optimum primitive polynomials, the mod F (α) 1 circuit generates
the term
∑m−1
j=0 c
(i−1)
j α
j together with the scalar products c
(i−1)
m+j (
∑t
k=0 fkα
k)αj for
j = 0 . . . D − 1. (Note that some of the coefficients of the irreducible polynomial
could be zero.)
C(i−1) mod F (α) ≡
m−1∑
j=0
c
(i−1)
j α
j + (
m+D−1∑
j=m
c
(i−1)
j α
j) mod F (α) (4.10)
≡
m−1∑
j=0
c
(i−1)
j α
j + (αm
D−1∑
j=0
c
(i−1)
m+j α
j) mod F (α)
≡
m−1∑
j=0
c
(i−1)
j α
j + (
D−1∑
j=0
c
(i−1)
m+j α
j)(
m−1∑
j=0
fjα
j) mod F (α)
C(i−1) mod F (α) ≡
m−1∑
j=0
c
(i−1)
j α
j + (
D−1∑
j=0
c
(i−1)
m+j (
t∑
k=0
fkα
k)αj) (4.11)
The multiply by αD circuit multiplies the scalar products generated by the mod
98
F (α) 1 circuit by the constant αD.
The accumulator adds the scalar products generated by the digit multiplication
core circuit together with the terms generated by the mod F (α) 1 and the multiply
by αD circuits. The result of the sum is C (i), which is defined in Equation (4.7).
This result is latched in the accumulator.
The accumulator adds D + 1 terms involving operands of m bits (D from the
digit multiplication core and one from the mod F (α) 1 and the multiply by αD
circuits) together with D operands involving operands of r bits (operands generated
by mod F (α) 1 and the multiply by αD circuits that involve coefficients of the
irreducible polynomial).
When considering timing estimates, the expression
∑D−1
j=0 c
(i−1)
m+j (
∑t
k=0 fkα
k)αj in
Equation (4.11) can be rewritten as
∑t
k=0 fk(
∑D−1
j=0 c
(i−1)
m+j α
j)αk. The later expression
can be interpreted as the sum of t+ 1 scalar products each requiring D bits. When
considering that cryptographic algorithms recommend the use of trinomials and
pentanomials, the latter expression can be interpreted as the sum of at most two D-
bit elements when using fixed trinomials or the sum of at most four D-bit elements
when using fixed pentanomials.
The mod F (α) 2 circuit computes the reduction shown in Equation (4.8). This
circuit generates the same scalar products that the mod F (α) 1 circuit generates
and in addition computes their sum. By using two mod F (α) circuits, implemen-
tations of the MSD multiplier could realize lower critical path delays than what
would be possible with a single mod F (α) circuit, because the terms generated by
the mod F (α) 1 circuit, which are shifted by the multiply by αD circuit, are added
together with the terms generated by the digit multiplication core in a tree structure.
99
4.5.2 Complexity, Critical Path Delay, and Performance
Table 4.8 summarizes the complexity and the critical path delay of the MSD multi-
plier for programmable and fixed irreducible polynomial support. For programmable
irreducible polynomial support, the estimates assume the use of optimal primitive
polynomials.
In Table 4.8, r represents the number of coefficients of F (x) supported by the
multiplier for configurations that support fixed and programmable polynomials.
The estimates in Table 4.8 include the complexity of the m-bit shift register that
holds the B operand.
To highlight the complexity and the critical path delay behavior, only the most
significant terms of the complexity and the critical path delay expressions are in-
cluded in the Table 4.8.
Estimates are provided for implementations with logic gates, generic gates, and
FPGA logic. The estimates are based on the models introduced in Appendix A.
The estimates for FPGA logic assume the use of m+D−1 trees to generate the
coefficients of C(i). Of these trees, (D− 1)+ r include reduction terms generated by
the mod F (α) 1 and the multiply by αD circuits and m−r trees that do not include
reduction terms. In addition, the estimates assume the use of (D − 1) + r trees in
the mod F (α) 2 circuit. These estimates assume the use of irreducible polynomials
of the following form: F (x) = xm +
∑i=t=r−1
i=0 fix
i with fi 6= 0. In general, this need
not be the case, especially when using fixed trinomials and pentanomials. For these
later cases, r, which is used to represent the number of nonzero coefficients of F (x)
minus one, is much lower than m and the results in the Table 4.8 provides a good
approximation of the complexity of the multiplier. In general, D will tend to be
much lower than m, which further increases the accuracy of the estimates in Table
4.8.
100
The trees that include reduction terms are assumed to be implemented using
a variant of the GF (2) mult/add tree described in Appendix A. The number of
LUTs required by each of these trees is determined by adding the effective number
of inputs required to implement the GF (2) multiplications and the number of inputs
that need to be added to the outputs of the GF (2) multipliers. The total number
of inputs is used to determine the number of LUTs required using the binary tree
expressions derived in Appendix A.
Table 4.9 approximates the complexity and the critical path delay of the MSD
multiplier for implementations that satisfy the following conditions: m >> D >>
r. These conditions define MSD multipliers that exhibit large digit sizes and low
reduction overhead. The approximations in Table 4.9 highlight how the complexity
and the critical path delay scale as a function of the digit size.
Table 4.10 summarizes the latency and the throughput of the MSD multiplier.
These parameters are normalized with respect to the period of a clock cycle. The
clock cycle period is inversely proportional to the critical path delay of the multiplier,
which is a function of the logic elements employed and of the irreducible polynomial
support.
101
Table 4.8: Complexity and critical path delay of MSD multiplier
Tech- Irreducible Complexity Critical path
nology polynomial delay
support
Gates Program- (Dm+ 2m+ 2Dr) AND + m OR + TA+
mable (Dm+ 2Dr) XOR + (3m+D + r) FF dlog2 (D + 1+
Fixed (Dm+ 2m) AND + m OR + min(D, r))e TX
(Dm+ 2Dr) XOR + (3m+D) FF
Generic Program- (2Dm+ 3m+ 4Dr) GG (dlog2 (D + 1+
gates mable + (3m+D + r) FF min(D, r))e+
Fixed (2Dm+ 3m+ 2Dr) GG + 1) TG
(3m+D) FF
FPGA Program- [m + d2DZ/(L− 1)e(m− r) + dlogL (
logic mable d2Z(D +min(D, r))/(L− 1)e(D + r) + 2Z∗
d(2Z min(D, r))/(L− 1)e(D + r)] LUT + (D +min(D, r))
(3m+D + r) FF +1)e TL
Fixed [m + d2DZ/(L− 1)e(m− r) + dlogL (
d(2DZ +min(D, r))/(L− 1)e(D + r) + 2DZ
d(min(D, r))/(L− 1)e(D + r)] LUT + +min(D, r)
(3m+D) FF +1)e TL
Table 4.9: Complexity and critical path delay of MSD multiplier for m >> D >> r
Tech- Irreducible Complexity Critical path
nology polynomial delay
support
Gates Programmable (Dm+ 2m) AND + m OR TA + dlog2 (D + r + 1)e TX
Fixed + Dm XOR + 3m FF
Generic Programmable (2Dm+ 3m) GG + 3m FF (dlog2 (D + r + 1)e+ 1) TG
gates Fixed
FPGA Programmable (d2DZ/(L− 1)e+ 1) m LUT dlogL (2Z(D + r) + 1)e TL
logic Fixed + 3m FF dlogL (2DZ + r + 1)e TL
Table 4.10: Performance of MSD multiplier
Attribute Performance
Latency (in # clocks) dm/De
Throughput (in # operations/# clocks) 1/dm/De
102
4.6 Least Significant Digit First Multiplier (LSD)
The LSD multiplier, introduced in [SP96, SP97], computes the multiplication of two
field elements A and B according to Algorithm 4.6.1. This algorithm is based on
the recursive operation defined by Equation (4.12). In this equation, C (i) represents
the accumulated product at time i for i = 0..dm/De − 1 and C (−1) represents the
original value of C in Algorithm 4.6.1, which need not be zero. The result of the
multiplication is Cout, which is defined in Equation (4.13).
Algorithm 4.6.1: LSD multiplication algorithm
Inputs: A =
∑m−1
i=0 aiα
i
B =
∑dm/De−1
i=0 Biα
Di, where
Bi =
∑D−1
j=0 bDi+jα
j and bk≥m = 0
C =
∑m−1
i=0 ciα
i
F (α) = αm +
∑m−1
i=0 fiα
i
Output: C = AB mod F (α) + C
1. for i = 0 to dm/De − 1 do
1.1 C = BiA+ C
1.2 A = AαD mod F (α)
end for
2. C = C mod F (α)
C(i) = Bi(Aα
Di mod F (α)) + C(i−1) for i = 0..dm/De − 1 (4.12)
Cout = C(dm/De−1) mod F (α) (4.13)
4.6.1 Architecture
Figure 4.4 shows a block diagram of the LSD multiplier introduced in [SP97]. The
LSD multiplier is composed by the following circuits: AαDi mod F (α), digit mul-
103
tiplication core, accumulator, and mod F (α) circuits.
TheAαDi mod F (α) circuit generates, per clock cycle, a termA(i) = A(i−1)αD mod
F (α) for i = 1..dm/De − 1, where A(0) = A and A(i) = AαDi mod F (α). Equation
(4.14) defines A(i) for general irreducible polynomials and Equation (4.15) defines
it for optimum primitive polynomials. A definition of the term optimal primitive
polynomials is given in Section 4.5.
 
	
 



ﬀﬁﬃﬂ "!$#
%&
 
%
ﬀﬁ'ﬂ(!)#
 *
%,+(-
)!.

ﬀﬁﬃﬂ/(!#
021 31 
45
 61 7
 1 8"1 :9

;
 
 
Figure 4.4: LSD multiplier
104
A(i) ≡ A(i−1)αD mod F (α) ≡ αD(
m−1∑
j=0
a(i−1)αj) mod F (α) (4.14)
≡ αD
m−D−1∑
j=0
a
(i−1)
j α
j + ((αD
m−1∑
j=m−D
a
(i−1)
j α
j) mod F (α))
≡ αD
m−D−1∑
j=0
a
(i−1)
j α
j + ((αm
D−1∑
j=0
a
(i−1)
m−D+jα
j) mod F (α))
≡ αD
m−D−1∑
j=0
a
(i−1)
j α
j + ((
D−1∑
j=0
a
(i−1)
m−D+jα
j)(
m−1∑
j=0
fjα
j) mod F (α))
A(i) = αD
m−D−1∑
j=0
a
(i−1)
j α
j + (
D−1∑
j=0
a
(i−1)
m−D+jα
j)(
t∑
j=0
fjα
j) (4.15)
In each clock cycle, the digit multiplication core computes a set of D scalar
products bDiA
(i), bDi+1A
(i)α, . . ., bDi+D−1A
(i)αD−1, whose sum represents the scalar
product A(i)Bi as shown by Equation (4.16). The accumulator adds the scalar
products generated by the digit multiplication core thus computing A(i)Bi. The
digit multiplication core used by the LSD multiplier is identical to the one used by
the MSD multiplier.
A(i)Bi =
D−1∑
j=0
A(i)bDi+jα
j (4.16)
The accumulator adds the scalar products generated by the digit multiplication
core circuit to the accumulated value. The new result is latched in the accumulator.
The mod F (α) circuit computes the reduction shown in Equation (4.17). Equa-
tion (4.18) provides a simplified expression for optimum primitive polynomials. The
105
mod F (α) circuit used by the LSD multiplier is similar to the mod F (α) 2 circuit
used by the MSD multiplier. The former mod F (α) circuit reduces an accumulated
value whose maximum degree is m+D− 2 while the latter reduces an accumulated
value whose maximum degree is m + D − 1.
Cout ≡ C mod F (α) (4.17)
≡
m−1∑
j=0
cjα
j + (
m+D−2∑
j=m
cjα
j) mod F (α)
≡
m−1∑
j=0
cjα
j + (αm
D−2∑
j=0
cm+jα
j) mod F (α)
≡
m−1∑
j=0
cjα
j + (
D−2∑
j=0
cm+jα
j)(
m−1∑
j=0
fjα
j) mod F (α)
Cout =
m−1∑
j=0
cjα
j + (
D−2∑
j=0
cm+jα
j)(
t∑
k=0
fkα
k) (4.18)
4.6.2 Complexity, Critical Path Delay, and Performance
Table 4.11 summarizes the complexity and the critical path delay of the LSD multi-
plier for programmable and fixed irreducible polynomial support. For programmable
irreducible polynomial support, the estimates assume the use of optimal primitive
polynomials.
In Table 4.11, r represents the number of coefficients of F (x) supported by the
multiplier for configurations that support fixed and programmable polynomials.
The estimates in Table 4.11 include the complexity of the m-bit shift register
that holds the B operand. The estimates also assume that the critical path delay
106
is dominated by the longest path through the digit multiplication core and the
accumulator circuits.
To highlight the complexity and the critical path delay behavior, only the most
significant terms of the complexity and the critical path delay expressions are in-
cluded in the Table 4.11.
Estimates are provided for implementations with logic gates, generic gates, and
FPGA logic. The estimates are based on the models introduced in Appendix A.
The estimates for FPGA logic assume that the digit multiplication core and
the accumulator circuits together use m + D − 1 trees to generate the coefficients
of C(i). The estimates also assume the use of irreducible polynomials of the form
F (x) = xm +
∑i=t=r−1
i=0 fix
i with fi 6= 0, which implies the use of D+ r− 1 trees for
the generation and addition of the reduction terms in the AαDi mod F (α) circuit
and the use of D+r−2 trees for the generation and addition of the reduction terms
in the mod F (α) circuit.
The trees that add multiple GF (2) products together with other terms are as-
sumed to be implemented using a variant of the GF (2) mult/add tree described in
Appendix A. The number of LUTs required by each of these trees is determined by
adding the effective number of inputs required to implement the GF (2) multiplica-
tions and the number of inputs that need to be added to the outputs of the GF (2)
multipliers. The total number of inputs is used to determine the number of LUTs
required using the binary tree expressions derived in Appendix A.
Table 4.12 approximates the complexity and the critical path delay of the LSD
multiplier for implementations that satisfy the following conditions: m >> D >>
r. These conditions define LSD multipliers that exhibit large digit sizes and low
reduction overhead. The approximations in Table 4.12 highlight how the complexity
and the critical path delay scale as a function of the digit size.
107
Table 4.13 summarizes the latency and the throughput of the LSD multiplier.
These parameters are normalized with respect to the period of a clock cycle. The
clock cycle period is inversely proportional to the critical path delay of the multiplier,
which is a function of the logic elements employed and of the irreducible polynomial
support.
Table 4.11: Complexity and critical path delay of LSD multiplier
Tech- Irreducible Complexity Critical path
nology polynomial delay
support
Gates Program- (Dm+ 4m+ 2Dr) AND + (Dm+ 2Dr) XOR TA+
mable + 2m OR +(3m+D + r) FF dlog2
Fixed (Dm+ 4m) AND + (Dm+ 2Dr) XOR (D + 1)
+ 2m OR +(3m+D) FF e TX
Generic Program- (2Dm+ 6m+ 4Dr) GG + (3m+D + r) FF dlog2
gates mable (2D + 1)
Fixed (2Dm+ 6m+ 2Dr) GG +(3m+D) FF e TG
FPGA Program- [2m+ d2DZ/(L− 1)e(m+D) + dlogL
logic mable d2Z min(D, r))/(L− 1)e(D + r) + (2DZ + 1)
d(2Z min(D − 1, r))/(L− 1)e(D + r)] LUT + e TL
(3m+D + r) FF
Fixed [2m+ d2DZ/(L− 1)e(m+D) +
d(min(D, r))/(L− 1)e(D + r) +
d(min(D − 1, r))/(L− 1)e(D + r)]LUT +
(3m+D) FF
Table 4.12: Complexity and critical path delay of LSD multiplier for m >> D >> r
Technology Irreducible Complexity Critical path
polynomial delay
support
Gates Programmable (Dm+ 4m) AND + 2m OR + TA+
Fixed Dm XOR + 3m FF dlog2 (D + 1)e TX
Generic Programmable (2Dm+ 6m) GG + 3m FF dlog2 (2D + 1)e TG
gates Fixed
FPGA Programmable (d2DZ/(L− 1)e+ 2) m LUT + dlogL (2DZ + 1)e TL
logic Fixed 3m FF
108
Table 4.13: Performance of LSD multiplier
Attribute Performance
Latency (in # clocks) dm/De
Throughput (in # operations/# clocks) 1/dm/De
109
4.7 Most Significant Bit First Super-Serial Mul-
tiplier (MSB-SSM)
The multiplier architecture introduced in this section was developed as part of the
research work documented here.
The most significant bit first super-serial multiplier (MSB-SSM) was recently
introduced in [OP99]. This type of multiplier computes the product of two field
elements in O(m2/D) clock cycles using O(D) processing units, where D (D < m)
represents the digit size.
The MSB-SSM multiplier computes the product of two field elements A and B
according to Algorithm 4.7.1. This algorithm is a serialized version of Algorithm
4.3.1. Step 1 is the same for both algorithms. Steps 1.1 to 1.3.3 of Algorithm 4.7.1
compute in dm/De clock cycles what Step 1.1 of Algorithm 4.3.1 computes in one
clock cycle. (Each iteration of the loop in Step 1.3 of Algorithm 4.7.1 requires one
clock cycle. All other steps require negligible processing time.)
It can be said that an MSB-SSM multiplier emulates an MSB multiplier using
fewer processing units. Figure 4.5 shows how the MSB-SSM multiplier performs
in dm/De clock cycles the same function that the MSB multiplier performs in one
clock cycle.
From Figure 4.5, one can appreciate that the output of the processing unit of
the MSB-SSM multiplier at the extreme right (SS4 for D = 5) is forwarded to the
input of the processing unit at the extreme left (SS0). The processing unit at the
extreme left, in some cycles, emulates processing units of the MSB multiplier that
are neighbors of the MSB multiplier’s processing units emulated by the processing
unit of the MSB-SSM multiplier at the extreme right. For example, at T = 1 the
SS0 processing unit of the MSB-SSM multiplier emulates the operation of the MSB
110
multiplier’s processing unit S5 which requires the output of processing unit S4. The
MSB multiplier’s processing unit S4 is emulated by the MSB-SSM multiplier’s pro-
cessing unit SS4. In Algorithm 4.7.1, this data transfer mechanism is represented by
the variable c∗D(j−1)+(D−1). Note that c
∗
D(j−1)+(D−1) is set to zero when SS0 emulates
the least significant processing unit of the MSB multiplier.
Figure 4.5 shows that the output of the processing unit that emulates the most
significant processing unit of the MSB multiplier (Sm−1) must be forwarded to the
processing units that emulates processing units that perform reductions based on
this output. For example, in Figure 4.5, SS3 emulates Sm−1. The output of Sm−1
is needed by processing units S0 and S6, which are emulated by SS0 and SS1. In
Algorithm 4.7.1 this data transfer mechanism is represented by the variable c∗m−1.
In Algorithm 4.7.1, A, C and F must be extended to an integer number of digits
(dm/De digits). During the execution of Algorithm 4.7.1 unwanted data may be
accumulated in the coefficients ci≥m of C. This is a side effect of the algorithm that
is corrected in Step 2.
111
Algorithm 4.7.1: MSB-SSM multiplication algorithm
Inputs: A =
∑dm/De−1
i=0 Aiα
Di, where Ai =
∑D−1
j=0 aDi+jα
j with ai≥m = 0.
B =
∑m−1
i=0 biα
i
C =
∑dm/De−1
i=0 Ciα
Di = 0, where Ci =
∑D−1
j=0 cDi+jα
j with ci≥m = 0.
F (α) = αm +
∑m−1
i=0 fiα
i
F =
∑dm/De−1
i=0 Fiα
Di, where Fi =
∑D−1
j=0 fDi+jα
j with fi≥m = 0.
Output: C = AB mod F (α)
1. for i = 0 to m− 1 do
1.1 c∗D(j−1)+(D−1) = 0
1.2 c∗m−1 = cm−1
1.3 for j = 0 to dm/De − 1 do
1.3.1 C ′ =
∑D−1
k=0 c
′
kα
k
= (
∑D−1
k=1 (cDj+k−1 + c
∗
m−1fDj+k)α
k) +(c∗D(j−1)+(D−1) + c
∗
m−1fDj)
1.3.2 c∗D(j−1)+(D−1) = cDj+(D−1)
1.3.3 Cj =
∑D−1
k=0 cDj+kα
k = bm−1−iAj + C
′ =
∑D−1
k=0 (bm−1−iaDj+k + c
′
k)α
k
end for
end for
2. C = C mod αm
ss3ss2ss1ss0 ss4
sx   - processing unit of bit-serial multiplier
ssx - processing unit of super-serial multiplier
mapping from bit-serial multiplier
to super-serial multiplier
s9s8s7s6s5
s5 s6 s7 s8 s9 sm-4 sm-3 sm-2 sm-1s0 s1 s2 s3 s4
ss4ss3ss2ss1ss0
s4s3s2s1s0
ss3ss2ss1ss0
s
m-2sm-3sm-4
ss4
s
m-1
Time
T = 0
T = 1
T =  m/D  -1
Figure 4.5: Super-serial multiplier emulation of a bit-serial multiplier
112
4.7.1 Architecture
Figure 4.5 shows how a processing unit from an MSB-SSM multiplier emulates the
functions of up to dm/De processing units of an MSB multiplier.
Figure 4.6 shows the architecture of a processing unit from an MSB-SSM multi-
plier and the architecture of a processing unit of an MSB multiplier. As this figure
shows, to carry out the emulation, the processing units of the MSB-SSM multiplier
incorporate memory elements. The memory elements are used to store the state of
the emulated processing units.
 
  
	


 













ﬀ
ﬂﬁﬃ  "!$#"%&%" ')(+*"', -. 0/213465*,7 -  8,7  #,ﬃ 9
ﬂﬁﬃ  "!$#"%&%) ')(+*"', -. 0/2134;: 3ﬂ3<1=5*,7 -  827  #.ﬃ
 >2 ?.@

 
>2 ?.@
	
>2 ?.@
A
B
C
A
B
D
FE)G



B
D
FE)G


C



H
#"("#,')I2J K2L"M"N2O
M,PFQ
Figure 4.6: Processing units of the MSB and the MSB-SSM multipliers
Figure 4.7 shows a block diagram of the MSB-SSM multiplier. This multiplier
uses two types of storage elements: dual-ported RAM (DPRAM) and registers.
A DPRAM has two interfaces that allow data to be concurrently written or read
from its interfaces. The DPRAM allows, for example, a processing unit k to output
the currently stored value of cDj+k while at the same time allows it to store a new
value for cDj+k.
The register c∗D(j−1)+(D−1) transports the value cD(j−1)+(D−1) generated in itera-
113
tion j − 1 of the loop in Step 1.3 to the input of the processing unit that generates
the value cDj in iteration j of the loop. The register c
∗
m−1 transports the value of
cm−1 generated in iteration i−1 of the loop in Step 1 to the inputs of the processing
units requiring this value in iteration i of the loop.
Table 4.14 summarizes the processing steps involved in the computation of
AB mod (α3 +
∑2
i=0 fiα
i) for digit size equal to two (D = 2). In the table the
variables i and j correspond to the loop index variables in Steps 1 and 1.3 of Algo-
rithm 4.7.1, the variables c
(i−1)
2j and c
(i−1)
2j+1 represent the output of the two processing
units, and the variables c
(i)
2j and c
(i)
2j+1 represent the inputs of the processing units.
The values of the superscripts indicate the iteration of loop in Step 1 in which the
values are generated. The −1 superscript is used to identify values at the beginning
of the multiplication process. The symbol ® in the table represents a don’t care
value.
From Table 4.14 one can appreciate that when m/D is not an integer, the co-
efficients c
(i)
j≥m are not necessarily zero. Therefore, to eliminate these coefficients,
the result at the end of Step 1 of Algorithm 4.7.1 must be reduced modulo αm, this
is done in Step 2 of the algorithm. This reduction can be achieved by forcing the
coefficients c
(i)
j≥m to be equal to zero at the end of the multiplication process.
From Table 4.14 one can also appreciate that the control logic for the MSB-
SSM multiplier must force the output of the c∗D(j−1)+(D−1) register to be zero at the
beginning of each iteration of Step 1 of Algorithm 4.7.1. The control logic must also
latch the value of c
(i−1)
m−1 at the beginning of the loop in Step 1. Lastly, the control
logic must generate the address and the control signals for the different memory
elements.
114
  
	
 

 
 


 

 

 ﬀ
 


 


 ﬀ



ﬂﬁﬃ



ﬀ




ﬀ
ﬂ

 "!$#%!&"'( )*$+-,ﬃ. +$/10
243 5 6 798 23 5 6 78 :9;
2
3 5 <
;
6
78 2
3 5 <
;
6
798 :9;
2
3 5 6
78 :97
<
;
23 5 <
;
6 798 : 7
<
;
Figure 4.7: MSB-SSM multiplier
4.7.2 Complexity, Critical Path Delay, and Performance
Table 4.15 summarizes the logic complexity and the critical path delay of the MSB-
SSM multiplier for arbitrary and programmable irreducible polynomial support.
The fixed irreducible polynomial support case could be implemented with decoding
logic in place of DPRAMs. This case is not considered here. Here, it is assumed
that the programmable irreducible polynomial configuration is used to support im-
plementations that use fixed irreducible polynomials.
In Table 4.15, r represents the number of coefficients of F (x) supported by the
multiplier for configurations that support programmable polynomials.
Estimates are provided for implementations with logic gates, generic gates, and
FPGA logic. The estimates are based on the models introduced in Appendix A.
The estimates for FPGA logic assume that the combinatorial functions are im-
plemented using binary trees.
Table 4.15 includes the memory requirements for the different options. The
memory requirements make no assumptions about the structure of the memory
elements. They only specify the required number of storage bits.
115
Table 4.14: MSB-SSM multiplication example for AB mod (α3 +
∑2
i=0 fiα
i) with
D = 2
i j c
(i)
2j c
(i−1)
2j c
(i)
2j+1 c
(i−1)
2j+1 c
∗
D(j−1)+(D−1) c
∗
m−1
0 0 c
(0)
0 = b2a0 0 c
(0)
1 = b2a1 0 0 0
0 1 c
(0)
2 = b2a2 0 c
(0)
3 = ® ® 0 0
1 0 c
(1)
0 = b1a0 + c
(0)
2 f0 c
(0)
0 c
(1)
1 = b1a1 + c
(0)
2 f1 c
(0)
1 0 c
(0)
2
+ c
(0)
0
1 1 c
(1)
2 = b1a2 + c
(0)
2 f2 c
(0)
2 c
(1)
3 = ® ® c(0)1 c(0)2
+ c
(0)
1
2 0 c
(2)
0 = b0a0 + c
(1)
2 f0 c
(1)
0 c
(2)
1 = b0a1 + c
(1)
2 f1 c
(1)
1 0 c
(1)
2
+ c
(1)
0
2 1 c
(2)
2 = b0a2 + c
(1)
2 f2 c
(1)
2 c
(2)
3 = ® ® c(1)1 c(1)2
+ c
(1)
1
The estimates in Table 4.15 include the complexity of the shift register that holds
digits of the B operand. Here, it is assumed that one digit of the B operand is read
from the register file into a D-bit shift register every dm/DeD clock cycles, following
an initial load before the multiplication process begins. A bit of the loaded digit is
consumed every dm/De clock cycles. A total of dm/De digits of B are loaded into
the shift register over the course of a multiplication.
Table 4.16 summarizes the latency and the throughput of the MSB-SSM mul-
tiplier. The latency estimates assume that the operand A is first loaded into the
multiplier and that the results are captured at the input of the processing units’
DPRAMs as they become available.
The latency and the throughput estimates are normalized with respect to the
period of a clock cycle. The clock cycle period is inversely proportional to the critical
path delay of the multiplier, which is a function of the logic elements employed and
of the irreducible polynomial support.
116
Table 4.15: Complexity and critical path delay of MSB-SSM multiplier
Technology Irreducible Complexity Critical path
polynomial Logic Storage delay
support bits
Gates Arbitrary 4D AND + 2D XOR 3D∗ TA + 2TX
+D OR + D FF dm/De
Programmable (3D + r) AND+ (2D + r)∗
(D + r) XOR + D OR dm/De
+ D FF
Generic Arbitrary 7D GG + D FF 3D∗ 3TG
gates dm/De
Programmable (5D + 2r) GG + D FF (2D + r)∗
dm/De
FPGA Arbitrary (d4/(L− 1)e+ 1)D LUT 3D∗ dlogL 5e TL
logic + D FF dm/De
Programmable ( d4/(L− 1)er (2D + r)∗
+ (2D − r) ) LUT dm/De
+ D FF
Table 4.16: Performance of MSB-SSM multiplier
Attribute Performance
Latency (in # clocks) mdm/De
Throughput (in # operations/# clocks) 1/(mdm/De)
117
4.8 Least Significant Bit First Super-Serial Mul-
tiplier (LSB-SSM)
The multiplier architecture introduced in this section was developed as part of the
research work documented here.
This dissertation introduces the least significant bit first super-serial multiplier
(LSB-SSM) architecture. This type of multiplier computes the product of two field
elements in O(m2/D) clock cycles using O(D) processing units, where D (D < m)
represents the digit size.
The LSB-SSM multiplier computes the product of two field elements A and B
according to Algorithm 4.8.1. This algorithm is a serialized version of Algorithm
4.4.1. Step 1 is the same for both algorithms. Steps 1.1 to 1.3.4 of Algorithm 4.8.1
compute in dm/De clock cycles what Steps 1.1 and 1.2 of Algorithm 4.4.1 compute
in one clock cycle. (Each iteration of the loop in Step 1.3 of Algorithm 4.8.1 is
computed in one clock cycle. All other steps require negligible processing time.)
An LSB-SSM multiplier emulates an LSB multiplier using fewer processing units.
The principles behind an LSB-SSM emulation of an LSB multiplier are similar to the
principles behind the MSB-SSM emulation of an MSB multiplier. Figure 4.5 shows
how an MSB-SSM multiplier computes in dm/De clock cycles the same function
that an MSB multiplier computes in one clock cycle. This figure can also be used to
demonstrate how an LSB-SSM multiplier computes in dm/De clock cycles the same
function that an LSB multiplier computes in one clock cycle.
From Figure 4.5 one can appreciate that the output of the processing unit at
the extreme right (SS4 for D = 5) is forwarded to the input of the processing unit
at the extreme left (SS0). The processing unit at the extreme left, in some cycles,
emulates processing units that are neighbors of the processing units emulated by
118
the processing unit at the extreme right. In Algorithm 4.8.1 this data transfer
mechanism is represented by the variable a∗D(j−1)+(D−1).
Figure 4.5 shows that the output of the processing unit that emulates the most
significant processing unit of the LSB multiplier (Sm−1) must be forwarded to the
processing units that emulate processing units that perform reductions based on
this output. For example, in Figure 4.5, SS3 emulates Sm−1. The output of Sm−1
is needed by processing units S0 and S6, which are emulated by SS0 and SS1. In
Algorithm 4.8.1 this data transfer mechanism is represented by the variable a∗m−1.
In Algorithm 4.8.1, A, C and F must be extended to an integer number of digits
(dm/De digits). During the execution of Algorithm 4.8.1, unwanted data may be
accumulated in the coefficients aj≥m of A and cj≥m of C . This is a side effect of
the algorithm that is corrected in Step 2.
Algorithm 4.8.1: LSB-SSM multiplication algorithm
Inputs: A =
∑dm/De−1
i=0 Aiα
Di, where Ai =
∑D−1
j=0 aDi+jα
j with ai≥m = 0.
B =
∑m−1
i=0 biα
i
C =
∑dm/De−1
i=0 Ciα
Di, where Ci =
∑D−1
j=0 cDi+jα
j with ci≥m = 0.
F (α) = αm +
∑m−1
i=0 fiα
i
F =
∑dm/De−1
i=0 Fiα
Di, where Fi =
∑D−1
j=0 fDi+jα
j with fi≥m = 0.
Output: C = AB mod F (α) + C
1. for i = 0 to m− 1 do
1.1 a∗D(j−1)+(D−1) = 0
1.2 a∗m−1 = am−1
1.3 for j = 0 to dm/De − 1 do
1.3.1 Cj =
∑D−1
k=0 cDj+kα
k = biAj + Cj =
∑D−1
k=0 (biaDj+k + cDj+k)α
k
1.3.2 A′ =
∑D−1
k=0 a
′
kα
k
= (
∑D−1
k=1 (aDj+k−1 + a
∗
m−1fDj+k)α
k) + (a∗D(j−1)+(D−1) + a
∗
m−1fDj)
1.3.3 a∗D(j−1)+(D−1) = aDj+(D−1)
1.3.4 Aj =
∑D−1
k=0 aDj+kα
k = A′ =
∑D−1
k=0 a
′
kα
k
end for
end for
2. C = C mod αm
119
4.8.1 Architecture
Figure 4.5 shows how a processing unit from an LSB-SSM multiplier emulates the
functions of up to dm/De processing units of an LSB multiplier.
Figure 4.8 shows the architecture of an LSB-SSM multiplier’s processing unit
and the architecture of an LSB multiplier’s processing unit. As this figure shows, to
carry out the emulation the processing units of the LSB-SSM multiplier incorporate
memory elements. The memory elements are used to store the state of the emulated
processing units.
 
	 
 
ﬁﬀﬃﬂ ! "  

#$&% ')(* +
'!(
#-, .0/ 1
2
,
.0/ 1
3
,
#$&% '!(4* 5
'!(
6ﬁ7-80896
+
'!(
#
$&% ')(* 5
:<;>=>;@?BADC E@FHGBIJ G@KML
N

	 
 
ﬁﬀPO0ﬁQRﬂ M "
  -
#$&% '!(* +
')(
#PS@, THU
.0/ 1
2SD, T@UV.0/ 1
3
SD, T@U
#
$&% '!(4*&W54XBY
')(
6ﬁ7-80896
+
'!(
#
$9% '!(4*
W54XBY
Figure 4.8: Processing units of the LSB-SSM and the LSB multipliers
Figure 4.9 shows a block diagram of the LSB-SSM multiplier. This multiplier
uses two types of storage elements: DPRAMs and registers.
One DPRAM allows a processing unit k to output the currently stored value
of cDj+k while at the same time allows it to store a new value for cDj+k. Another
DPRAM allows a processing unit k to output the currently stored value of aDj+k
while at the same time allows it to store a new value for aDj+k.
120
The register a∗D(j−1)+(D−1) transports the value aD(j−1)+(D−1) generated in itera-
tion j − 1 of the loop in Step 1.3 to the input of the processing unit that generates
the value aDj in iteration j of the loop. The register a
∗
m−1 transports the value of
am−1 generated in iteration i−1 of the loop in Step 1 to the inputs of the processing
units requiring this value in iteration i of the loop.
Table 4.17 summarizes the processing steps involved in the computation of
AB mod (α3 +
∑2
i=0 fiα
i) for digit size equal to two (D = 2). In the table the
variables i and j correspond to the loop index variables in Steps 1 and 1.3 of Algo-
rithm 4.8.1, the variables c
(i−1)
2j , c
(i−1)
2j+1 , a
(i)
2j and a
(i)
2j+1 represent the DPRAM outputs
of the two processing units, while the variables c
(i)
2j , c
(i)
2j+1, a
(i+1)
2j , and a
(i+1)
2j+1 represent
DPRAM inputs. Coefficients with the -1 superscript represent the coefficients of
the C at the beginning of the multiplication process. The symbol ® in the table
represents a don’t care value.
In each iteration of the loop in Step 1, the value of A(i) =
∑m−1
j=0 a
(i)
j α
j and the
value of C(i−1) =
∑m−1
j=0 c
(i−1)
j α
j are used in the computation of C (i) =
∑m−1
j=0 c
(i)
j α
j.
In addition, in each iteration of the loop, the value of A(i+1) =
∑m−1
j=0 a
(i+1)
j α
j is
generated. The generated A(i+1) is used in the next iteration of the loop in Step 1.
From Table 4.17, one can appreciate that when m/D is not an integer, the
coefficients c
(i)
j≥m are not necessarily zero. Therefore, to eliminate these coefficients,
the result at the end of Step 1 of Algorithm 4.8.1 must be reduced modulo αm. This
reduction can be achieved by forcing the coefficients c
(i)
i≥m to be equal to zero at the
end of the multiplication process.
From Table 4.17 one can also appreciate that the control logic of the LSB-SSM
multiplier must force the output of the a∗D(j−1)+(D−1) register to be zero at the
beginning of each iteration of Step 1 of Algorithm 4.8.1. The control logic must also
latch the value of a
(i)
m−1 at the beginning of the loop in Step 1. Lastly, the control
121
logic must generate the address and the control signals for the different memory
elements.
 	
  
 ﬀﬂﬁﬃ
 "!$# %
 ﬀﬂﬁﬃ
&' ( ) *+ ,*- .
 ﬀﬃ
 !$# %
 ﬀﬃ
&' ( ) *+ , .
"
&' ( ) *+
 /!$# %

 /01
ﬁﬃ
2
!$# %
 ﬀﬂﬃ
34
2
!$# %

34
2
!$# %
 ﬀ5ﬁ6ﬃ
34
&7
4 8
*+
&7
4 8
*+ , .
&7
4 8
*+ ,6*9- .
&7
4
- .
8
*+ ,6*9- .
&
7
4
- .
8
*+ , .&
7
4
- .
8
*+
 "0
;:  ﬁ$ﬃ< ﬀ: 5ﬁﬃ$<
Figure 4.9: LSB-SSM multiplier
4.8.2 Complexity, Critical Path Delay, and Performance
Table 4.18 summarizes the logic complexity and the critical path delay of the LSB-
SSM multiplier for arbitrary and programmable irreducible polynomial support.
The fixed irreducible polynomial support case could be implemented with decoding
logic in place of DPRAMs. This case is not considered here. Here, it is assumed
that the programmable irreducible polynomial configuration is used to support im-
plementations that use fixed irreducible polynomials.
In Table 4.18, r represents the number of coefficients of F (x) supported by the
multiplier for configurations that support programmable polynomials.
Estimates are provided for implementations with logic gates, generic gates, and
FPGA logic. The estimates are based on the models introduced in Appendix A.
The estimates for FPGA logic assume that the combinatorial functions are im-
plemented using binary trees.
Table 4.18 includes the memory requirements for the different options. The
122
Table 4.17: LSB-SSM multiplication example for AB mod x3 +
∑2
i=0 fiα
i +C with
D = 2
i j a
(i+1)
2j a
(i)
2j a
(i+1)
2j+1 a
(i)
2j+1 a
∗
D(j−1)+(D−1) a
∗
m−1
0 0 a
(1)
0 = a
(0)
2 f0 a
(0)
0 a
(1)
1 = a
(0)
0 + a
(0)
2 f1 a
(0)
1 0 a
(0)
2
0 1 a
(1)
2 = a
(0)
1 + a
(0)
2 f2 a
(0)
2 a
(1)
3 = ® ® a(0)1 a(0)2
1 0 a
(2)
0 = a
(1)
2 f0 a
(1)
0 a
(2)
1 = a
(1)
0 + a
(1)
2 f1 a
(1)
1 0 a
(1)
2
1 1 a
(2)
2 = a
(1)
1 + a
(1)
2 f2 a
(1)
2 a
(2)
3 = ® ® a(1)1 a(1)2
2 0 a
(0)
0 next op. a
(2)
0 a
(0)
1 next op. a
(2)
1 ® ®
2 1 a
(0)
2 next op. a
(2)
2 a
(0)
3 = ® ® ® ®
i j c
(i)
2j c
(i−1)
2j c
(i)
2j+1 c
(i−1)
2j+1
0 0 c
(0)
0 = a
(0)
0 b0 + c
(−1)
0 c
(−1)
0 c
(0)
1 = a
(0)
1 b0 + c
(−1)
1 c
(−1)
1
0 1 c
(0)
2 = a
(0)
2 b0 + c
(−1)
2 c
(−1)
2 c
(0)
3 = ® ®
1 0 c
(1)
0 = a
(1)
0 b1 + c
(0)
0 c
(0)
0 c
(1)
1 = a
(1)
1 b1 + c
(0)
1 c
(0)
1
1 1 c
(1)
2 = a
(1)
2 b1 + c
(0)
2 c
(0)
2 c
(1)
3 = ® ®
2 0 c
(2)
0 = a
(2)
0 b2 + c
(1)
0 c
(1)
0 c
(2)
1 = a
(2)
1 b2 + c
(1)
1 c
(1)
1
2 1 c
(2)
2 = a
(2)
2 b2 + c
(1)
2 c
(1)
2 c
(2)
3 = ® ®
memory requirements make no assumptions about the structure of the memory
elements, they only specify the required number of storage bits.
The estimates in Table 4.18 include the complexity of the shift register that
holds digits of the B operand. Here, it is assumed that one digit of B is read from
the register file into a D-bit shift register every dm/DeD clock cycles, following an
initial load before the multiplication begins. A bit of the loaded digit is consumed
every dm/De clock cycles. A total of dm/De digits of B are loaded into the shift
register over the course of a multiplication.
Table 4.19 summarizes the latency and the throughput of the LSB-SSM mul-
tiplier. The latency estimates assume that the operand A is first loaded into the
multiplier and that the results are captured at the input of the processing units’
DPRAMs as they become available.
The latency and the throughput estimates are normalized with respect to the
123
period of a clock cycle. The clock cycle period is inversely proportional to the critical
path delay of the multiplier, which is a function of the logic elements employed and
of the irreducible polynomial support.
Table 4.18: Complexity and critical path delay of LSB-SSM multiplier
Tech- Irreducible Complexity Critical path
nology polynomial Logic Storage delay
support bits
Gates Arbitrary 6D AND + 2D OR + 3D∗ 2TA + TO
2D XOR + D FF dm/De +TX
Program- (5D + r) AND+ (D + r) XOR (2D + r)∗
mable + 2D OR + D FF dm/De
Generic Arbitrary 10D GG + D FF 3D∗ 4TG
gates dm/De
Program- (8D + 2r) GG + D FF (2D + r)∗
mable dm/De
FPGA Arbitrary (d4/(L− 1)e+ 2)D LUT 3D∗ dlogL 5e TL
logic + D FF dm/De
Program- ( d4/(L− 1)e − 1)r + 3D LUT (2D + r)∗
mable + D FF dm/De
Table 4.19: Performance of LSB-SSM multiplier
Attribute Performance
Latency (in # clocks) mdm/De
Throughput (in # operations/# clocks) 1/(mdm/De)
124
4.9 New Squaring Architecture
The squaring architecture introduced in this section was developed as part of the
research work documented here.
The squaring architecture introduced in [OP00b] is based on the observation
that a square operation in GF (2m) can be transformed into a multiplication by a
constant and a sum. This transformation allows the efficient computation of squares
using an LSB-SSM, an LSB, or an LSD multiplier together with a circuit referred
to here as a squaring adapter.
Equations (4.19) to (4.23) show the transformation of a square operation into a
multiplication by a constant and a sum: A2 ≡ A′B′ mod F (α) + C ′. (Note that
this operation is similar to the product-sums computed by LSB-SSM, LSB, and LSD
multipliers.)
A2 ≡ (
m−1∑
i=0
aiα
i)2 mod F (α) (4.19)
≡
m−1∑
i=0
aiα
2i mod F (α)
≡
m−1∑
i=dm/2e
aiα
2i mod F (α) +
dm/2e−1∑
i=0
aiα
2i
A2 ≡ A′B′ mod F (α) + C ′ (4.20)
A′ =
bm/2c−1∑
i=0
ai+dm/2eα
2i (4.21)
B′ ≡ α2dm/2e mod F (α) (4.22)
C ′ =
dm/2e−1∑
i=0
aiα
2i (4.23)
125
The values of A′ and C ′ depend on the value of A while the value of B ′ de-
pends exclusively on the field polynomial. For irreducible polynomials F (x) =
xm +
∑t
i=0 fix
i, B′ can be expressed as shown in Equation (4.24).
B′ =

∑t
i=0 fiα
i even m∑t
i=0 fiα
i+1 odd m and t < m− 1∑t
i=1(fi + fi−1)α
i + f0 odd m and t = m− 1
(4.24)
Recent developments demonstrate that some forms of composite fields give rise
to elliptic curves that possess cryptographic weaknesses [GHS00]. Extension fields
GF (2m) with prime m are recommended for elliptic curve cryptosystems. The
following discussion focuses on the use of irreducible trinomials and pentanomials
for which m is prime and odd.
When considering the case of m prime and odd, Equation (4.19) is equivalent
to Equation (4.25). Equation (4.25) can be computed according to Equation (4.20)
with A′, B′, and C ′ as defined in Equations (4.26) to (4.28).
Equation (4.29) defines B ′ for irreducible trinomials (F (x) = xm + xt + 1) and
Equation (4.30) defines it for irreducible pentanomials (F (x) = xm + xt3 + xt2 +
xt1 + 1). For pentanomials, t3 is equal to t in Equation (4.27).
126
A2 ≡
m−1∑
i=(m+1)/2
aiα
2i mod F (α) +
(m−1)/2∑
i=0
aiα
2i (4.25)
A′ =
(m−3)/2∑
i=0
ai+(m+1)/2α
2i+1 (4.26)
B′ ≡ αm mod F (α) ≡
t∑
i=0
fiα
i (4.27)
C ′ =
(m−1)/2∑
i=0
aiα
2i (4.28)
B′ = αt + 1 (4.29)
B′ = αt3 + αt2 + αt1 + 1 (4.30)
The attractiveness of the architecture presented here lies on the efficient compu-
tation of the multiplication A′B′ mod F (α) and on the need to have a multiplier
in the architecture. To compute elliptic curve point multiplications, a processor
needs to compute both multiplications and squares. Therefore, the squaring archi-
tecture presented here is a good match for the computation of elliptic curve point
multiplications.
The following sections demonstrate that the product A′B′ mod F (α) can be
computed in time proportional to t using LSB-SSM, LSB, or LSD multipliers. They
also demonstrate that squares can be computed efficiently using this squaring ar-
chitecture for the majority of the irreducible polynomials specified by the following
standards: [ANS98], [ANS99], and [IEE98]. For these standards, m >> t for the
majority of the specified irreducible polynomials in the range m = 163 . . . 997 for
127
which m is prime.
4.9.1 Architecture
The new squaring architecture can be used with LSB-SSM, LSB, or LSD multipliers.
Figure 4.10 shows a block diagram of the new squaring architecture that is suitable
for use with LSB and LSD multipliers. Figure 4.11 shows a block diagram of the
new squaring architecture that is suitable for use with LSB-SSM multipliers that
use digit sizes of even values.
 	 
  


ﬀ

ﬀ
ﬁ
ﬂ
ﬁﬃ ﬀ
 "!"#
 "!"$
%"!'&
)(*+-,.0/21*3ﬃ4
ﬁ65
7 7 7 8:9
;
8 <
;
8%= >:?
&@ A $
7 7 7
;;
8%= >%B
&@ A $;
8%CD E
;
Figure 4.10: New squaring architecture using LSB or LSD multipliers
In the new squaring architecture, the multiplier is used to compute products and
sums, and the squaring adapter is used to format and propagate operands to the
multiplier.
The LSB-SSM, LSB, and LSD multipliers compute product-sums of the form
AB mod F (α) + C. Here, these multipliers are used to compute a sum X + Y
128
 
	


ﬀ
ﬁﬃﬂ
  
ﬀ
ﬁﬂ!  #" ﬁﬃﬂ
  %$ ﬁﬃﬂ


"
ﬁﬂ!  
ﬀ
ﬁﬃﬂ!  #" ﬁﬂ!  &$ ﬁﬃﬂ
 '
(
)+* , )+* ,
)+* , )+* ,
)
))
-
 /./01 2 4353687 9	:87:	 ;
<

=

>
=
 ?


$
=A@
?
 BDC
ﬁ
ﬂ
  
ﬀ
ﬁ
ﬂ
  E"
ﬁ
ﬂ!  %$
ﬁ
ﬂ! F
)
))
)
)
Figure 4.11: New squaring architecture using LSB-SSM multipliers
in two steps. In the first step X is loaded in to the multiplier’s accumulator by
computing the product X ∗ 1 (C = 0 in this step). In the second step Y is added
to the accumulated value by computing the product-sum Y ∗ 1 +X (C = X in this
step). Note that if the operand X is already in the accumulator, the first step can
be skipped.
LSB multipliers compute multiplication according to Algorithm 4.4.1. When
computing squares, B ′ =
∑m−1
i=0 b
′
iα
i can be used as the multiplier operand (B
in the algorithm). Because the most significant coefficients of B ′ are zero, the
multiplication process can stop after the most significant nonzero coefficient, b′t, is
processed. Therefore, when using an LSB multiplier, a square can be computed
in t + 2 clock cycles: t + 1 clock cycles are used to compute the product, and an
additional clock cycle is used to compute the sum.
LSB-SSMmultipliers compute multiplication according to Algorithm 4.8.1. When
computing squares, B ′ =
∑m−1
i=0 b
′
iα
i can be used as the multiplier operand (B in
129
the algorithm). Because the most significant coefficients of B ′ are zero, the multi-
plication process can stop after the most significant nonzero coefficient, b′t, is pro-
cessed. Therefore, when using an LSB-SSM multiplier, a square can be computed
in (t + 2)dm/De clock cycles: (t + 1)dm/De clock cycles are used to compute the
product and dm/De clock cycles are used to compute the sum.
LSD multipliers compute multiplication according to Algorithm 4.6.1. When
computing squares, B ′ =
∑dm/De−1
i=0 B
′
iα
Di can be used as the multiplier operand.
The most significant nonzero digit of B ′ is B′d(t+1)/De−1. Consequently, the product
A′B′ mod F (α) can be computed in d(t + 1)/De clock cycles and the sum can be
computed in one clock cycle. In summary, when using an LSD multiplier a square
can be computed in d(t + 1)/De+ 1 clock cycles.
Table 4.20 summarizes the number of clock cycles required to compute a square
operation using the squaring architecture described here with different types of
multipliers. In the table, t represents the most significant nonzero coefficient of B ′,
which corresponds to the second most significant coefficient of F (x).
Table 4.20: Number of clock cycles required to compute a square operation when
using different types of multipliers
Multiplier type # clocks / square
LSB-SSM (t+ 2)dm/De
LSB t+ 2
LSD d(t+ 1)/De+ 1
4.9.2 Complexity, Critical Path Delay, and Performance
From Figure 4.10, one can appreciate that the squaring adapter for LSB and LSD
multipliers is composed by three multiplexers. For LSB and LSD multipliers, MUX2
and MUX3 are m bits wide.
130
For LSB multipliers, MUX1 is one bit wide and for LSD multiplexers it is D bits
wide. Note that MUX1 is used to multiplex a one (1) when performing additions.
These additions consume one bit of the B operand for LSB multipliers and D bits
for LSD multipliers.
Figure 4.11 shows an example of a possible implementation of the new squaring
architecture that uses an LSB-SSM multiplier with digit size of even value.
MUX3 of the squaring adapter for the LSB-SSM multiplier chooses inputs from
the least significant half of the input digit or from the most significant half. Note
that a single digit is used in two consecutive cycles as only half of the bits of it are
consumed in a single cycle (every other input is zero for A′ and C ′). MUX2 chooses
an input digit from A or the zeros needed to generate A′ or C ′. MUX4 chooses the
coefficients required to generate a digit from A, A′, or C ′. (MUX4 is composed by D
2:1 multiplexers whose select signals are independently controlled to make possible
the generation of the digits of A, A′, and C ′.)
MUX2 of the squaring adapter for the LSB-SSM multiplier is composed by D
2:1 multiplexers that choose between an input and zero. MUX2 can be realized with
D AND gates. MUX3 is composed by D/2 2:1 multiplexers, MUX4 is composed by
D 2:1 multiplexers, and MUX1 is composed by one 2:1 multiplexer.
Tables 4.21 and 4.22 summarize the logic complexities and the critical path delays
of the squaring adapters to be used with LSB, LSD, and LSB-SSM multipliers.
Estimates are provided for implementations with logic gates, generic gates, and
FPGA logic. The estimates are based on the models introduced in Appendix A.
Squaring Time for Cryptosystems
Tables 4.23 and 4.24 summarize the squaring to multiplication processing time ratio,
Tsq/Tmul, for the field polynomials recommended by the standards [ANS98, ANS99]
131
Table 4.21: Complexity and critical path delay of the squaring adapter to be used
with LSB and LSD multipliers
Technology Complexity Critical path delay
Gates 3m AND + m OR 2TA + TO
Generic gates 4m GG 3TG
FPGA logic 2m LUT 2 TL
Table 4.22: Complexity and critical path delay of the squaring adapter to be used
with LSB-SSM multipliers with digit sizes with even values
Technology Complexity Critical path delay
Gates 4D AND + 1.5D OR 2TA + 2TO
Generic gates 5.5D GG 4TG
FPGA logic 2.5D LUT 2 TL
and [IEE98] respectively. (The standard [FIP00] specifies a subset of the irreducible
polynomials recommended in [IEE98].) The tables assume the use of LSB or LSB-
SSM multipliers.
The interpretation of the tables is as follows. The column “Tsq/Tmul” specifies
ratios of squaring time to multiplication time in terms of clock cycles; that is, the
squaring time divided by the multiplication time. This is a measure of how fast
squaring is compared to a multiplication; for example, for the range (0, 0.10), it
specifies that squaring is at least ten times faster than multiplication. The “Distri-
bution (%)” column specifies the percentage of irreducible polynomials in the range
m = 163 . . . 997 with prime m that satisfy a given squaring to multiplication ratio;
for example, the first entry in Table 4.23 specifies that for 28% of the irreducible
polynomials of interest, the squaring to multiplication ratio is, at most, 0.1. Finally,
the “Cumulative distribution (%)” column specifies the cumulative distribution; for
example, the second entry in Table 4.23 specifies that for 49% of the irreducible
polynomials of interest, the squaring to multiplication ratio is less than 0.2.
132
Table 4.23: Distribution of squaring to multiplication processing time ratios for the
GF (2m) fields polynomials specified in [ANS98, ANS99] with prime m in the range
163 . . . 997
Tsq/Tmul Distribution (%) Cumulative distribution (%)
(0.00, 0.10) 28 28
[0.10, 0.20) 21 49
[0.20, 0.30) 18 67
[0.30, 0.50) 18 85
[0.50, 1.00) 15 100
Table 4.24: Distribution of squaring to multiplication processing time ratios for the
GF (2m) fields polynomials specified in [IEE98] with prime m in the range 163 . . . 997
when using LSB or LSB-SSM multipliers
Tsq/Tmul Distribution (%) Cumulative distribution (%)
(0.00, 0.10) 72 72
[0.10, 0.20) 11 83
[0.20, 0.30) 11 94
[0.30, 0.50) 6 100
[0.50, 1.00) 0 100
133
4.10 Parallel Squarers with Fixed Irreducible
Polynomial Support
Parallel squarers that support fixed irreducible polynomials are attractive for FPGA
implementations. Synchronous versions of these squarers compute squares in one
clock cycle.
FPGAs allow, through reconfiguration, the instantiation of different parallel
squarers for different fields.
As specified in the previous section, public-key standards such as [ANS98, ANS99,
IEE98, FIP00] recommend the use of trinomials and pentanomials as irreducible
polynomials. For these irreducible polynomials, the gate complexity of parallel
squarers can be considered to be linear – requiring less than 4m GF (2) adders. One
can verify from Equation (4.19) that the degree of A2 is 2m − 2 before reduction.
Therefore, the serial reduction of A2 requires m− 1 additions of the least significant
r terms of the irreducible polynomial F (x).
The complexity of parallel squarers based on fixed trinomials was studied in
[Wu99, PFSR99]. The time complexities of these squarers was studied in [Wu99].
According to [Wu99], a parallel squarer based on fixed trinomials can be realized with
at most (m+ t−1)/2 XOR gates. Such realization will exhibit a critical path delay
of, at most, two gate delays. (Note that 47% of the irreducible polynomials specified
in [IEE98, ANS98, ANS99] with prime m in the range 163 . . . 997 are trinomials.)
The complexity and the critical path delay of a parallel squarer that supports a
fixed pentanomial is a function of the irreducible polynomial. Using the expressions
in Equations (4.25) to (4.30), the square operation when using a fixed pentanomial
with odd degree m and t3 < m− 1 can be expressed as shown in Equation (4.31).
The coefficients of C ′ and A′ in Equation (4.31) do not overlap so the addition of
134
these terms require no logic. When t1, t2, and t3 have relatively low values, one would
expect to add three shifted versions of A′ together with few reduction terms. So for
this case, one would expect that a parallel squarer will require approximately 1.5m
XOR gates (note that the number of nonzero coefficients in A′ is (m−1)/2). As the
degree of the nonzero coefficients of B ′ increase, the number of reduction terms to
be added is likely to increase and accordingly the complexity of the parallel squarer
is likely increase with respect to a squarer with nonzero coefficients of low degree.
A2 ≡ A′B′ mod F (α) + C ′ (4.31)
≡ C ′ + A′ + A′αt1 mod F (α) + A′αt2 mod F (α) + A′αt3 mod F (α)
The standards [IEE98] and [ANS98, ANS99] used a different criteria for the se-
lection of pentanomials. It will be demonstrated later that the approximation of
1.5m XOR gates is very good for parallel squarers based on the fixed pentanomials
specified in [IEE98] of prime degree m in the range 163 . . . 997. The average com-
plexity for squarers with the same properties but using the pentanomials specified
in [ANS98, ANS99] is somewhat higher, approximately 2m XOR gates.
The complexities and the critical path delays of implementations of parallel
squarers for implementations with gates, generic gates, and FPGA logic were de-
termined exhaustively using a program. This program computed the complexities
and the critical path delays of parallel squarers that support trinomials and pen-
tanomials of prime degree in the range 163 . . . 997. The program assumes that each
bit of the result is computed using a binary tree. The program also assumes that
each binary tree is independent and assumes that no logic is shared between binary
trees. The results are summarized in Tables 4.25 to 4.27.
135
In Tables 4.25 to 4.27, LUT3, LUT4, and LUT5 refers respectively to FPGA logic
that uses lookup tables with three, four, and five inputs. The logic complexity and
the critical path delay results for gate implementations apply to implementations
using gates and implementations using generic gates (all the gates in the binary
trees are of the same type).
The logic complexity results in Tables 4.25 to 4.27 are normalized with respect to
m. The complexities and the critical path delays of the different implementations are
summarized in terms of the maximum value, minimum value, mean value, variance
and standard deviation.
The results in Table 4.25 show that implementations with gates, generic gates,
and FPGA logic consume approximately 0.5m logic elements regardless of their
type. The critical path delay of a gate implementation is, at most, two gate delays
and for FPGA logic is equivalent to the propagation delay of a LUT. As expressed
before, the results in Table 4.25 assume that each bit of the result is computed with
independent binary trees that do not share logic among themselves. Note that this
model is not the same model used in [Wu99] that exploits logic sharing. Finally,
note that the standards [IEE98, ANS98, ANS99] specify the same trinomials.
The results in Table 4.26 show that the gate complexity is approximately 1.5m
gates for parallel squarers based on the fixed pentanomials specified in [IEE98]. The
critical path delays for these implementations correspond to, at most, three gate
delays. FPGA implementations require approximately m LUTs and their critical
path delays are, at most, two LUT delays.
The results in Table 4.27 show that the gate complexity is approximately 2m
gates for parallel squarers based on the fixed pentanomials specified in [ANS98,
ANS99]. The critical path delays for these implementations are, at most, four gate
delays. FPGA implementations require, on average, slightly more than m LUTs
136
and their critical path delays are, at most, three LUT delays.
Table 4.25: Complexity and critical path delay statistics for parallel squarers that
support the fixed trinomials specified in [IEE98, ANS98, ANS99] of prime degree in
the range 163 . . . 997
Statistic Normalized complexity Critical path delay
Gen. LUT3 LUT4 LUT5 Gen. LUT3 LUT4 LUT5
gates/ gates/
XOR XOR
(TG/TX) (TL) (TL) (TL)
Max. value 0.74 0.69 0.69 0.69 2.00 1.00 1.00 1.00
Mean value 0.57 0.54 0.54 0.54 1.48 1.00 1.00 1.00
Min. value 0.50 0.50 0.50 0.50 1.00 1.00 1.00 1.00
Variance 0.0032 0.0031 0.0031 0.0031 0.2536 0.0000 0.0000 0.0000
Standard dev. 0.06 0.06 0.06 0.06 0.50 0.00 0.00 0.00
Table 4.26: Complexity and critical path delay statistics for parallel squarers that
support the fixed pentanomials specified in [IEE98] of prime degree in the range
163 . . . 997
Statistic Normalized complexity Critical path delay
Gen. LUT3 LUT4 LUT5 Gen. LUT3 LUT4 LUT5
gates/ gates/
XOR XOR
(TG/TX) (TL) (TL) (TL)
Max. value 1.57 1.04 1.01 1.01 3.00 2.00 2.00 2.00
Mean value 1.53 1.01 0.94 0.94 2.67 1.97 1.67 1.29
Min. value 1.50 1.00 0.51 0.51 2.00 1.00 1.00 1.00
Variance 0.0002 0.0001 0.0266 0.0267 0.2238 0.0282 0.2238 0.2070
Standard dev. 0.015 0.01 0.16 0.16 0.47 0.17 0.47 0.46
137
Table 4.27: Complexity and critical path delay statistics for parallel squarers that
support the fixed pentanomials specified in [ANS98, ANS99] of prime degree in the
range 163 . . . 997
Statistic Normalized complexity Critical path delay
Gen. LUT3 LUT4 LUT5 Gen. LUT3 LUT4 LUT5
gates/ gates/
XOR XOR
(TG/TX) (TL) (TL) (TL)
Max. value 4.47 2.48 1.88 1.51 4.00 3.00 2.00 2.00
Mean value 2.09 1.30 1.07 1.02 2.61 2.00 1.59 1.26
Min. value 1.51 1.00 1.00 0.75 2.00 1.00 1.00 1.00
Variance 0.3606 0.0937 0.0279 0.0075 0.2983 0.0290 0.2462 0.1938
Standard dev. 0.60 0.31 0.17 0.09 0.55 0.17 0.50 0.44
138
4.11 Zero Test
The zero test circuit determines if a result is zero. This circuit incorporates an OR
binary tree that samples the bits of the input operand. The output of the circuit
is set to zero when all the bits of the input operand are set to zero, otherwise the
output is set to one.
Here, it is assumed that arithmetic units based on bit-serial or digit-serial multi-
pliers use parallel zero test circuits. That is, zero test circuits that test all the input
bits in parallel.
For arithmetic units based on super-serial multipliers, this work considers the
use of digit-serial zero test circuits. These circuits test D bits of the input operands
per clock cycle. As soon as a nonzero digit is detected, the result of the test circuit
is set to one. The circuit remains in this state until it is reset. The test of an m-bit
input consumes dm/De clock cycles.
Figure 4.12 shows block diagrams of zero test circuits. Table 4.28 summarizes the
complexity and the critical path delay of parallel and digit-serial zero test circuits.
OR
binary
tree
FFdigit of
the input
operand
b) Digit-serial zero test circuit
OR
binary
tree
input
operand
a) Parallel zero test circuit
zerozero
Figure 4.12: Zero test circuits
139
Table 4.28: Complexity and critical path delay of zero test circuits
Technology Parallel
Complexity Critical path delay
Gates m OR dlog2 me TO
Generic gates m GG dlog2 me TG
FPGA logic dm/(L− 1)e LUT dlogL me TL
Technology Digit-serial
Complexity Critical path delay
Gates D OR + 1 FF dlog2 (D + 1)e TO
Generic gates D GG + 1 FF dlog2 (D + 1)e TG
FPGA logic dD/(L− 1)e LUT + 1 FF dlogL (D + 1)e TL
140
4.12 Register File
The register file is the set of registers used by an elliptic curve processor to store
temporary values, system constants, on-line precomputed values, and off-line pre-
computed values.
As indicated in Section 2.8, some point multiplication algorithms accelerate point
multiplications by precomputing frequently used values before a point multiplica-
tion begins. Some algorithms rely on on-line precomputation and others on off-line
precomputation. On the elliptic curve processor architectures presented here, pre-
computed values are stored in the register file.
The register file is also used to store temporary results and long-term constants.
Temporary results are generated, for instance, by the point multiplication primitives:
point add, point subtract, and point double primitives. Long-term constants include
cryptosystems’ parameters such as elliptic curve parameters and generator points.
The register file is intended to contain a large number of registers. At a particular
time, only one register will be accessed. An efficient way to implement a register
file, particularly for modern FPGA devices, is by using fast memory blocks. For the
remainder of this document, it will be assumed that the register file is implemented
using memory blocks.
The size of the memory needed to implement a register file is influenced by the
width of the data items that need to be stored and by the number of registers that
needs to be implemented.
For arithmetic units based on bit-serial or digit-serial multipliers, the width of a
register file is assumed to be m bits wide. For these multipliers, each operand stored
in a register file is assumed to require m bits of storage.
For arithmetic units based on super-serial multipliers, the width of a register
141
file is assumed to be D bits wide. For these multipliers, each operand stored in a
register file is assumed to require Ddm/De bits of storage (each operand is assumed
to be represented with dm/De digits, each of which is D bits wide).
This work does not suggest an absolute number of registers for the register file
because different configurations require different numbers of registers and the cost
of building register files vary from one hardware platform to another. Instead, this
work identifies the number of registers in a register file with the parameter h. h is
a design parameter under the control of the designer.
As an example, algorithms that use on-line precomputation and Jacobian coordi-
nates could require the storage of up to 32 points (2w with w = 5), where each point
is represented by three coordinates. For point multiplications, implementations also
require approximately 16 registers to store temporary values and constants. A reg-
ister file containing 128 registers (h = 128) is adequate for this example.
Table 4.29 summarizes the complexity and the critical path delay of the register
file. These estimates are based on the models introduced in Appendix A. Note that
the models assume ideal memory elements.
Table 4.29: Complexity and critical path delay of register file
Technology Complexity in storage bits Critical path
for arithmetic units based on delay
Gates & Bit-serial/ Super-serial
Generic gates & Digit-serial multipliers
FPGA logic multipliers
hm hDdm/De 0
142
4.13 GF(2m) Arithmetic Unit Complexity and Per-
formance
This section summarizes the complexity and the performance of arithmetic units
based on the multipliers and the squaring architectures described in the previous
sections. The architectures under consideration are illustrated in Figure 4.13.
Reg. File
Mult.
Zero Test
mux1
in out
Squarer
mux2
Reg. File
Mult.
Zero Test
mux1
in out
Adder
mux2
Reg. File
Mult.
Zero Test
mux
in out
Reg. File
Mult.
Zero Test
mux
in out
Sq. Adapter
Reg. File
Mult.
Zero Test
mux1
in out
Squarer Adder
mux2
a) Architecture 1 b) Architecture 2 c) Architecture 3
d) Architecture 4 e) Architecture 5
Figure 4.13: GF (2m) arithmetic units
143
Architectures 1 to 3 are analyzed for use with least significant bit/digit first
multipliers. These types of multipliers can be used to efficiently compute additions,
and they can be used with the new squaring architecture introduced in Section 4.9.
Architectures 4 and 5 are analyzed for use with most significant bit/digit first
multipliers. These multipliers cannot be used to perform additions and cannot be
used with the new squaring architecture introduced in Section 4.9.
Architectures 1 and 2 are analyzed for use with LSB-SSM multipliers, and ar-
chitecture 4 is analyzed for use with MSB-SSM multipliers. No other architecture
is analyzed for use with super-serial multipliers. These are the only architectures
that exhibit logic complexities that are a function of the digit size D and memory
requirements that are a function of m.
For arithmetic units based on super-serial multipliers, the logic complexities and
the critical path delays are analyzed for arbitrary and programmable irreducible
polynomial support.
For arithmetic units based on bit-serial multipliers, the logic complexities and
the critical path delays are analyzed for arbitrary, programmable, and irreducible
polynomial support.
For arithmetic units based on digit-serial multipliers, the logic complexities and
the critical path delays are analyzed for programmable and fixed irreducible poly-
nomial support, where the programmable polynomials are assumed to be optimal
polynomials according the definition given in Section 4.5.
144
4.13.1 Complexity
Tables 4.30 to 4.42 summarize the logic complexities of the GF (2m) arithmetic units
considered here. The logic complexities of all the architectures are analyzed with
respect to implementations with gates, generic gates, and FPGA logic.
The complexity and performance numbers assume the incorporation of registers
along the dotted lines shown in Figure 4.13. These registers isolate the critical path
delays of the different components of an arithmetic unit. The critical path delays
of the arithmetic units are assumed to be dominated by the critical path delays of
their multipliers, which are the most complex components of the arithmetic units.
Tables 4.43 and 4.44 summarize the logic complexities of the different GF (2m)
arithmetic units that incorporate squaring circuitry, that provide programmable
irreducible polynomial support and that exhibit m >> D >> r. In Tables 4.43 and
4.44, the symbol SB represents storage bits.
145
Table 4.30: Complexity of architecture 4 for arithmetic units based on MSB-SSM
multipliers
Tech- Irreducible Logic Storage
nology polynomial bits
support
Gates Arbitrary 8D AND + 4D OR + 3D XOR D(h+ 3)dm/De
+ 6D FF
Program- (7D + r) AND + 4D OR (hD + 2D + r)dm/De
mable + (2D + r) XOR + 6D FF
Generic Arbitrary 15D GG + 6D FF D(h+ 3)dm/De
gates Program- (13D + 2r) GG + 6D FF (hD + 2D + r)dm/De
mable
FPGA Arbitrary ( Dd4/(L− 1)e+ dD/(L− 1)e+ 4D) D(h+ 3)dm/De
logic LUT + 6D FF
Program- (rd4/(L− 1)e+ dD/(L− 1)e (hD + 2D + r)dm/De
mable + 5D − r) LUT + 6D FF
Table 4.31: Complexity of architecture 1 for arithmetic units based on LSB-SSM
multipliers
Tech- Irreducible Logic Storage
nology polynomial bits
support
Gates Arbitrary 8D AND + 4D OR + 2D XOR D(h+ 3)dm/De
+ 4D FF
Program- (7D + r) AND + 4D OR (hD + 2D + r)dm/De
mable + (D + r) XOR + 4D FF
Generic Arbitrary 14D GG + 4D FF D(h+ 3)dm/De
gates Program- (12D + 2r) GG + 4D FF (hD + 2D + r)dm/De
mable
FPGA Arbitrary (Dd4/(L− 1)e+ dD/(L− 1)e+ 3D) D(h+ 3)dm/De
logic LUT + 4D FF
Program- (rd4/(L− 1)e+ dD/(L− 1)e (hD + 2D + r)dm/De
mable + 4D − r) LUT + 4D FF
146
Table 4.32: Complexity of architecture 2 for arithmetic units based on LSB-SSM
multipliers
Tech- Irreducible Logic Storage
nology polynomial bits
support
Gates Arbitrary 12D AND + 5.5D OR + 2D XOR D(h+ 3)dm/De
+ 5D FF
Program- (11D + r) AND + 5.5D OR (hD + 2D + r)dm/De
mable + (D + r) XOR + 5D FF
Generic Arbitrary 19.5D GG + 5D FF D(h+ 3)dm/De
gates Program- (17.5D + 2r) GG + 5D FF (hD + 2D + r)dm/De
mable
FPGA Arbitrary (Dd4/(L− 1)e + dD/(L− 1)e+ 5.5D) D(h+ 3)dm/De
logic LUT + 5D FF
Program- (rd4/(L− 1)e+ dD/(L− 1)e (hD + 2D + r)dm/De
mable + 6.5D − r) LUT + 5D FF
Table 4.33: Complexity of architecture 4 for arithmetic units based on MSB multi-
pliers
Tech- Irreducible Logic Storage
nology polynomial bits
support
Gates Arbitrary 8m AND + 4m OR + 3m XOR + 9m FF hm
Programmable (7m+ r) AND + 4m OR + (2m+ r) XOR
+ (8m+ r) FF
Fixed 7m AND + 4m OR + (2m+ r) XOR
+ 8m FF
Generic Arbitrary 15m GG + 9m FF
gates Programmable (13m+ 2r) GG + (8m+ r) FF
Fixed (13m+ r) GG + 8m FF
FPGA Arbitrary (md4/(L− 1)e+ dm/(L− 1)e+ 4m) LUT
logic + 9m FF
Programmable (rd4/(L− 1)e+ dm/(L− 1)e+ 5m− r) LUT
+ (8m+ r) FF
Fixed (rd3/(L− 1)e+ dm/(L− 1)e+ 5m− r) LUT
+ 8m FF
147
Table 4.34: Complexity of architecture 5 for arithmetic units based on MSB multi-
pliers
Tech- Irreducible Logic Storage
nology polynomial bits
support
Gates Arbitrary 10m AND + 5m OR + 5m XOR + 10m FF hm
Programmable (9m+ r) AND + 5m OR + (4m+ r) XOR
+ (9m+ r) FF
Fixed 9m AND + 5m OR + (4m+ r) XOR + 9m FF
Generic Arbitrary 20m GG + 10m FF
gates Programmable (18m+ 2r) GG + (9m+ r) FF
Fixed (18m+ r) GG + 9m FF
FPGA Arbitrary (md4/(L− 1)e+ dm/(L− 1)e+ 6.3m) LUT
Logic + 10m FF
Programmable (rd4/(L− 1)e+ dm/(L− 1)e+ 7.3m− r) LUT
+ (9m+ r) FF
Fixed (rd3/(L− 1)e+ dm/(L− 1)e+ 7.3m− r) LUT
+ 9m FF
Table 4.35: Complexity of architecture 1 for arithmetic units based on LSB multi-
pliers
Tech- Irreducible Logic Stor-
nology polynomial age
support bits
Gates Arbitrary 8m AND + 4m OR + 2m XOR + 7m FF hm
Program- (7m+ r) AND + 4m OR + (m+ r) XOR
mable + (6m+ r) FF
Fixed 7m AND + 4m OR + (m+ r) XOR + 6m FF
Generic Arbitrary 14m GG + 7m FF
gates Program- (12m+ 2r) GG + (6m+ r) FF
mable
Fixed (12m+ r) GG + 6m FF
FPGA Arbitrary ( md4/(L− 1)e+ dm/(L− 1)e+ 3m ) LUT + 7m FF
logic Program- (rd4/(L− 1)e+ dm/(L− 1)e+ 4m− r) LUT
mable + (6m+ r) FF
Fixed (rd3/(L− 1)e+ dm/(L− 1)e+ 4m− r) LUT
+ 6m FF
148
Table 4.36: Complexity of architecture 2 for arithmetic units based on LSB multi-
pliers
Tech- Irreducible Logic Stor-
nology polynomial age
support bits
Gates Arbitrary 11m AND + 5m OR + 2m XOR + 8m FF hm
Program- (10m+ r) AND + 5m OR + (m+ r) XOR
mable + (7m+ r) FF
Fixed 10m AND + 5m OR + (m+ r) XOR + 7m FF
Generic Arbitrary 18m GG + 8m FF
gates Program- (16m+ 2r) GG + (7m+ r) FF
mable
Fixed (16m+ r) GG + 7m FF
FPGA Arbitrary (md4/(L− 1)e+ dm/(L− 1)e+ 5m) LUT + 8m FF
logic Program- (rd4/(L− 1)e+ dm/(L− 1)e+ 6m− r) LUT
mable + (7m+ r) FF
Fixed (rd3/(L− 1)e+ dm/(L− 1)e+ 6m− r) LUT
+ 7m FF
Table 4.37: Complexity of architecture 3 for arithmetic units based on LSB multi-
pliers
Tech- Irreducible Logic Stor-
nology polynomial age
support bits
Gates Arbitrary 10m AND + 5m OR + 4m XOR + 9m FF hm
Program- (9m+ r) AND + 5m OR + (3m+ r) XOR
mable + (8m+ r) FF
Fixed 9m AND + 5m OR + (3m+ r) XOR + 8m FF
Generic Arbitrary 19m GG + 9m FF
Gates Program- (17m+ 2r) GG + (8m+ r) FF
mable
Fixed (17m+ r) GG + 8m FF
FPGA Arbitrary (md4/(L− 1)e+ dm/(L− 1)e+ 5.3m) LUT
+ 9m FF
logic Program- (rd4/(L− 1)e+ dm/(L− 1)e+ 6.3m− r) LUT
mable + (8m+ r) FF
Fixed (rd3/(L− 1)e+ dm/(L− 1)e+ 6.3m− r) LUT
+ 8m FF
149
Table 4.38: Complexity of architecture 4 for arithmetic units based on MSDmultipli-
ers that support programmable and fixed irreducible polynomials (m >> D >> r)
Technology Logic Storage
bits
Gates (Dm+ 6m) AND + 4m OR + (Dm+m) XOR + 8m FF hm
Generic gates (2Dm+ 11m) GG + 8m FF
FPGA logic (md2DZ/(L− 1)e+ dm/(L− 1)e+ 4m) LUT + 8m FF
Table 4.39: Complexity of architecture 5 for arithmetic units based on MSDmultipli-
ers that support programmable and fixed irreducible polynomials (m >> D >> r)
Technology Logic Storage
bits
Gates (Dm+ 8m) AND + 5m OR + (Dm+ 3m) XOR + 9m FF hm
Generic gates (2Dm+ 16m) GG + 9m FF
FPGA logic (md2DZ/(L− 1)e+ dm/(L− 1)e+ 6.3m) LUT + 9m FF
Table 4.40: Complexity of architecture 1 for arithmetic units based on LSD multipli-
ers that support programmable and fixed irreducible polynomials (m >> D >> r)
Technology Logic Storage
bits
Gates (Dm+ 6m) AND + 4m OR + Dm XOR + 6m FF hm
Generic gates (2Dm+ 10m) GG + 6m FF
FPGA logic (md2DZ/(L− 1)e+ dm/(L− 1)e+ 3m) LUT + 6m FF
Table 4.41: Complexity of architecture 2 for arithmetic units based on LSD multipli-
ers that support programmable and fixed irreducible polynomials (m >> D >> r)
Technology Logic Storage
bits
Gates (Dm+ 9m) AND + 5m OR + Dm XOR + 7m FF hm
Generic gates (2Dm+ 14m) GG + 7m FF
FPGA logic (md2DZ/(L− 1)e+ dm/(L− 1)e+ 5m) LUT + 7m FF
150
Table 4.42: Complexity of architecture 3 for arithmetic units based on LSD multipli-
ers that support programmable and fixed irreducible polynomials (m >> D >> r)
Technology Logic Storage
bits
Gates (Dm+ 8m) AND + 5m OR + (Dm+ 2m) XOR + 8m FF hm
Generic gates (2Dm+ 15m) GG + 8m FF
FPGA logic (md2DZ/(L− 1)e+ dm/(L− 1)e+ 5.3m) LUT + 8m FF
Table 4.43: Summary of arithmetic unit gate complexity according to multiplier
family for architectures that include squaring circuitry, that provide programmable
polynomial support, and that exhibit m >> D >> r
Mult. Minimum Maximum
family
Super- 17.5D GG + 5D FF + (h+ 2)Ddm/De SB
serial
Bit- 16m GG + 7m FF + hm SB 18m GG + 9m FF + hm SB
serial
Digit- (2D + 14)m GG + 7m FF + hm SB (2D + 16)m GG + 9m FF + hm SB
serial
Table 4.44: Summary of arithmetic unit FPGA logic complexity according to mul-
tiplier family for architectures that include squaring circuitry, that provide pro-
grammable polynomial support, and that exhibit m >> D >> r
Mult. Minimum Maximum
family
Super-serial (6.5D + dD/(L− 1)e) LUT + 5D FF + (h+ 2)Ddm/De SB
Bit-serial (6m+ dm/(L− 1)e) LUT (7.3m+ dm/(L− 1)e) LUT
+ 7m FF + hm SB + 9m FF + hm SB
Digit-serial (d2DZ/(L− 1)em+ dm/(L− 1)e (d2DZ/(L− 1)em+ dm/(L− 1)e
+ 5m) LUT + 7m FF + hm SB + 6.3m) LUT + 9m FF + hm SB
151
4.13.2 Performance
The performance of an arithmetic unit is a function of the number of clock cycles
required to perform the different arithmetic operations and of the clock cycle period.
The minimum clock cycle period is a function of the critical path delay. Here
it is assumed that the critical path delay of an arithmetic unit corresponds to the
critical path delay of its multiplier. The multiplier is the most complex circuit of an
arithmetic unit. The critical path delays of the other circuits of an arithmetic unit
can be reduced using pipelining techniques, thus allowing the critical path delay of
the multiplier to dominate.
Table 4.45 summarizes the critical path delays of the different GF (2m) mul-
tipliers considered in this work. This table specifies the critical path delays for
implementations with gates, generic gates, and FPGA logic. The results in this
table assume that m >> D >> r.
Table 4.46 summarizes the critical path delays of the arithmetic units that
include squaring circuitry according to multiplier families (super-serial, bit-serial,
digit-serial). Estimates are provided for implementations with generic gates and
FPGA logic. The results in this table assume that m >> D >> r.
Tables 4.47 to 4.49 summarize the throughput of the different architectures for
addition, square, and multiplication operations. The throughput is specified in
terms of the number of clock cycles required to compute the desired operation.
Table 4.50 summarizes the throughput of the arithmetic units according to their
multiplier families. This table assumes the use of parallel squarers for arithmetic
units based on bit-serial and digit-serial multipliers. This table assumes the use of
the new squaring architecture for arithmetic units based on super-serial multipliers.
From Table 4.50, one can appreciate that the time required to compute a mul-
tiplication is often much larger than the time required to compute an addition or a
152
square, especially when using parallel squares. Generally, when estimating the time
required to perform an algorithm with the architectures studied here for GF (2m),
the time required for addition and square operations can be ignored.
Table 4.45: Summary of critical path delays of GF (2m) multipliers that support
programmable irreducible polynomials for which m >> D >> r
Multiplier Gates Generic gates FPGA logic
MSB-SSM TA + 2TX 3TG dlogL 5eTL
MSB TA + 2TX 3TG dlogL 5eTL
MSD TA+ TG+ dlogL (2Z(D + r) + 1)eTL
dlog2(D + r + 1)eTX dlog2(D + r + 1)eTG
LSB-SSM 2TA + TO + TX 4TG dlogL 5eTL
LSB 2TA + TO + TX 4TG dlogL 5eTL
LSD TA+ dlog2(2D + 1)eTG dlogL (2DZ + 1)eTL
dlog2(D + 1)eTX
Table 4.46: Summary of critical path delays, according to multiplier families, of
GF (2m) arithmetic units that support programmable irreducible polynomials for
which m >> D >> r
AU Generic gates FPGA logic
Multi- Minimum Maximum Minimum Maximum
plier (TG) (TG) (TL) (TL)
Super- 3 4 dlogL 5e
serial
Bit- 3 4 dlogL 5e
serial
Digit- dlog2(2D + 1)e dlog2(D + r + 1)e dlogL (2DZ + 1)e dlogL
serial + 1 (2Z(D + r) + 1)e
153
Table 4.47: Throughput of GF (2m) arithmetic units for multiplication operations
(in clock cycles)
AU Multiplier Architecture
1 2 3 4 5
MSB-SSM mdm/De N/A
MSB N/A m
MSD dm/De
LSB-SSM mdm/De N/A
LSB m N/A
LSD dm/De
Table 4.48: Throughput of GF (2m) arithmetic units for square operations (in clock
cycles)
AU Multiplier Architecture
1 2 3 4 5
MSB-SSM mdm/De N/A
MSB N/A m 1
MSD dm/De
LSB-SSM mdm/De see N/A
LSB m Tables 4.23 and 4.24 1 N/A
LSD dm/De
Table 4.49: Throughput of GF (2m) arithmetic units for addition operations (in
clock cycles)
AU Multiplier Architecture
1 2 3 4 5
MSB-SSM dm/De N/A
MSB N/A 1
MSD
LSB-SSM dm/De N/A
LSB 1-2 N/A
LSD
154
Table 4.50: Throughput for the different field operations of GF (2m) arithmetic units
that incorporate squaring circuitry (estimates are provided in clock cycles according
to the multiplier families)
AU Multiplier Multiplication Square Addition
Super-serial mdm/De see Tables 4.23 and 4.24 dm/De
Bit-serial m 1
Digit-serial dm/De
155
Chapter 5
GF(p) Arithmetic Unit
5.1 Introduction
This chapter specifies an arithmetic unit architecture for GF (p) arithmetic. This
architecture follows the general model introduced in Section 3.4. The main com-
ponents of the arithmetic unit are a Montgomery multiplier, a two’s complement
adder, a register file, and a zero test circuit. The multiplier is used to compute mul-
tiplications and squares. The adder is used to compute additions and subtractions.
The GF (p) arithmetic unit introduced here is based on a new Montgomery mul-
tiplier architecture developed by the author as part of the research work documented
here.
Modular multiplication architectures have been extensively researched. Some of
the works in this area are documented in [Kor94, SV93, Oru95, Blu99, EW93].
For the GF (p) arithmetic unit introduced here, this work develops a new mul-
tiplier architecture that draws from [Oru95, FP99] an approach for high-radix mul-
tiplication, from [SV93, Oru95] the ability to delay quotient resolution, and from
[Blu99] the use of precomputation. In particular, this work extends the concept of
156
precomputation. The resulting multiplier architecture is a high-radix, precomputation-
based Montgomery multiplier.
The new Montgomery multiplier uses on-line precomputation in a way similar
to that used by elliptic curve point multiplication algorithms that use on-line pre-
computation. This technique allows the development of a multiplier that reduces
processing hardware at the expense of storage. The availability of memory elements
in modern reconfigurable logic makes this option attractive over the ones that rely
exclusively on processing hardware. The adder of the arithmetic unit is the compo-
nent used to compute the precomputation values.
This chapter starts with a description of the Montgomery multiplication algo-
rithm on which the new Montgomery multiplier architecture is based. The chapter
then proceeds to introduce the arithmetic concepts upon which the new Montgomery
multiplier is based. The discussion then shifts to the hardware description of the
arithmetic unit’s adder, multiplier, zero test, and register file circuits. The chap-
ter ends with descriptions of different configurations of the arithmetic unit. The
descriptions include complexity and performance estimates for the different config-
urations.
157
5.2 High-Radix Montgomery Multiplication with
Quotient Pipelining
For the GF (p) arithmetic unit, this work recommends a new multiplier architecture
based on the Montgomery multiplication algorithm with quotient pipelining shown
in Algorithm 5.2.1. This algorithm was introduced in [Oru95].
Notation: The symbol |x|M is used to represent x mod M in least residue form;
that is, | |x|M | < M , where the symbol |y| represents the absolute value of y. The
symbol |x|Mˆ is used to represent |x|M + ²M , where ² is an integer.
Quotient pipelining refers to the ability of the algorithm to work with delayed
quotients. Steps 4.1 and 4.6 of the Algorithm 5.2.1 demonstrate this feature. Step
4.1 shows the computation of quotient Qi. The actual computation of Qi can take
up to d clock cycles, where d represents the quotient resolution delay. Step 4.6 shows
the use of delayed quotients, Qi−d. (Note that the steps of the algorithms are not
sequentially numbered. The steps are numbered so that a common step in different
algorithms uses the same number. Other algorithms are introduced later in this
chapter.)
Some of the most significant features and limitations of Algorithm 5.2.1 are listed
below.
Features of Algorithm 5.2.1:
Simple Setup
The most complicated operation in the setup phase is the computa-
tion of an inverse modulo a power of two. This operation can be effi-
ciently computed using the algorithm introduced by Dusse´ and Kaliski in
[DK91]. Moreover, cryptosystems based on the discrete logarithm prob-
158
lems change their moduli of operation infrequently. At the extreme case,
a system needs to compute the inverse only when a modulus changes.
Simple Quotient Determination
The quotient determination, shown in Step 4.1, consists of a reduction
modulo a power of two. When Si is represented as a two’s complement
number in nonredundant form, the quotient Qi can be set equal to the
least significant k bits of Si.
Simple Accumulation and Reduction
The accumulation and reduction shown in Step 4.6 involve the addition
of the following: two scalar multiplications, ABi and Qi−dα, and a trun-
cated division, bSi/2kc. The scalar multiplications are the most complex
operations of this step. It will be shown later how these scalar multipli-
cations can be efficiently implemented in programmable hardware. The
division in this step consists of shifting the most significant bits of Si.
Ability to Pipeline Quotient Resolution
The ability to pipeline quotient resolution saves a system from having to
wait for the computation of quotients, which could involve long combi-
natorial delays. Pipelined quotient resolution allows the computation of
quotients to proceed in the background, which allows the management of
long combinatorial delays with registered stages.
Suitability for Repeated Operations
The outputs of Algorithm 5.2.1 fall within the range of the input operands
allowed by the algorithm. Therefore, outputs from previous operations
can be used as inputs in other operations.
Limitations of Algorithm 5.2.1:
159
Wide Output Range
The output range for Algorithm 5.2.1 is [0, 2M˜), where M˜ ∈ [M, 2k(d+1)M)
is a function of k, d, and M . This wide output range is typically not a
problem for cryptographic algorithms for which modular exponentiation
is the most critical operation; for example, the RSA, the ElGamal and the
Diffie-Hellman families of algorithms. These algorithms can operate with
the wide output range for most of their computations and they reduce
the final results so they fall in the range [0,M). On the other hand, the
wide output range can be a severe limitation for elliptic curve algorithms
that incorporate comparisons in time-critical operations because different
numbers can represent a residue class. For example, the numbers x+ iM
for which i is an integer represent the same residue class modulo M . To
compare two operands, they must first be reduced to numbers in the range
[0,M), which could be a complex operation depending on the value of
M˜ .
Wide Operand Range
M˜ defines the range of the operands and therefore their size. This limita-
tion is not generally a problem for systems for which log2 M >> k(d+1).
Because elliptic curve cryptosystems use smaller modulus M than tradi-
tional cryptosystems based on the integer factorization and the discrete
logarithm problem over finite fields, they are more susceptible to the
overhead imposed by k and d.
Processing Time Dependence on Quotient Resolution Delay (d)
The number of loop iterations depends both on n and d. Depending on
the modulus M , n can be itself a function of d: n > log2k 4M˜ , where
M˜ ∈ [M, 2k(d+1)M). Therefore, the number of iterations of the loop could
160
include a factor 2d, one explicitly shown in Step 4 of the algorithm and
the other embedded in the value of n.
As it will be shown over the next sections, the features of Algorithm 5.2.1 out-
weigh its limitations, and that is the reason why this algorithm is the basis of the
new Montgomery multiplier architecture introduced here.
Algorithm 5.2.1: High-radix Montgomery multiplication with quotient pipelining [Oru95]
Inputs:
A ∈ [0, 2M˜ ]
B =
∑n+d
i=0 Bi2
ki ∈ [0, 2M˜ ], Bi<n ∈ [0, 2k), Bi≥n = 0
M˜ =M |M ′|2k(d+1) ∈ [M, 2k(d+1)M)
| −MM ′|2k(d+1) = 1, gcd(2,M) = 1
R = 2kn > 4M˜
α = (M˜ + 1)/2k(d+1) ≡ ∣∣2−k(d+1)∣∣
M
d ≥ 0 – quotient resolution delay
Output:
Sn+d+2 ≡ |AB/R|Mˆ ∈ [0, 2M˜)
/* Preprocessing */
1. S0 = 0, Qi<0 = 0
/* Processing */
4. for i = 0 to n+ d do
/* Quotient determination */
4.1. Qi = |Si|2k
/* Accumulation and reduction */
4.6. Si+1 = bSi/2kc+Qi−dα+ABi
end for
/* Postprocessing */
5. Sn+d+2 = 2
kdSn+d+1 +
∑d−1
i=0 Qn+1+i2
ki
161
5.3 High-Radix, Precomputation-Based
Montgomery Multiplication with Quotient
Pipelining
The Montgomery multiplication algorithm introduced in this section was developed
as part of the research work documented here.
The new multiplier architecture implements the multiplication algorithm shown
in Algorithm 5.3.1. The basic structure of Algorithm 5.3.1 is similar to that of
Algorithm 5.2.1. The steps of both algorithms are numbered so that equivalent
steps in both algorithms use the same number; for example, Steps 4 and 4.1 are
identical for the two algorithms and Steps 1, 4.6, and 5 show steps with similar
properties for the two algorithms.
In addition to the operations in common with Algorithm 5.2.1, Algorithm 5.3.1
includes the following: a precomputation phase shown in Steps 2 to 3.1.1, a Booth
recoding phase shown in Steps 4.2 and 4.3, and explicitly shows how the scalar
products are computed in Steps 4.4. and 4.5. In these steps, the sign(x) function
returns 1 if x is positive and −1 if x is negative.
The precomputation phase builds a cache of frequently used values. These values
are then looked up as needed during the accumulation and reduction phase of the
algorithm. This concept allows the use of a simple adder that adds the terms in
Steps 4.4 to 4.6 at the expense of storage and precomputation time at the beginning
of the algorithm.
The recoding phase does two things. First, it allows the representation of num-
bers in radix 2k as sums of numbers of lower radix; for example, Bi can be expressed
as a sum of numbers of radix 2r as Bi =
∑s−1
j=0 bis+j2
rj. This transformation allows
162
designers to strike a balance between the number of terms to be added by the adder
in the multiplier and the processing time and the storage required to compute and
store the precomputed values.
For example, to compute the products ABi, an implementation can precompute
and store jA for j = 0 . . . 2k − 1. Such implementation can compute ABi with a
single table lookup. Alternatively, an implementation can choose to compute ABi
as
∑s−1
j=0 Abis+j2
rj. For this last computation, an implementation can precompute
a single set jA for j = 0 . . . 2r − 1. The precomputed values can be broadcasted
to s processing units. To compute a scalar product, an implementation looks up
and adds s precomputed values. It will be demonstrated later that the ability to
compute scalar products as sums of precomputed values allows implementations to
use higher radices than the optimum radices obtained when relying exclusively on
precomputation.
The second benefit realized from the recoding phase is the reduction in precom-
putation time and the memory required to store precomputed values. The previous
example assumes that the digits bis+j are in the range [0, 2
r). Booth recoding al-
lows the representation of a two’s complement number in radix 2r with a set of
digits whose values fall in the range [−2r−1, 2r−1]. This recoding method reduces
the number of values that need to be precomputed to about half of what would be
required if the number were represented with digits whose values fall in the range
[0, 2r). Reductions in the number of precomputed values lead to reductions in the
time spent doing precomputations and in the amount of memory required to store
precomputed values. Note that for the recoded numbers only the positive multiples
need to be precomputed, the negative multiples can be derived from the positive
ones as they are needed.
163
Algorithm 5.3.1: High-radix, precomputation-based Montgomery multiplication
with quotient pipelining
Inputs:
A ∈ (−A,A), A > 0
B =
∑n+d
i=0 Bi2
ki ∈ (−B,B), B > 0, Bi<t−1 ∈ [0, 2k), Bi=t−1 ∈ [−2k−1, 2k−1),
Bi≥t = Bi<0 = 0, t ≤ n
α ≡ ∣∣2−k(d+1)∣∣
M
, gcd(M, 2) = 1, R = 2kn, d – quotient resolution delay
Output:
Sn+d+2 ≡ |AB/R|Mˆ ∈ (−(AB +QM)/R, (AB +QM)/R)
/* Preprocessing */
1. S0 = 0, Qi<0 = 0, bhi<0 = 0, qhi<0 = 0
2. for i = 0 to 2r−1 do
2.1. A[i] = iA
end for
3. for i = 0 to 2u−1 do
3.1. for j = 0 to v − 1 do
3.1.1. α[i, j] = |iα2uj|Mˆ
end for
end for
/* Processing */
4. for i = 0 to n + d do
/* Quotient determination */
4.1. Qi = |Si|2k
/* Recoding: qlj ∈ [−2u−1, 2u−1]; blj ∈ [−2r−1, 2r−1]; qhj , bhj ∈ [0, 1] */
4.2. qhi2
k +
∑v−1
j=0 qliv+j2
uj = Qi + qhi−1 /* k = uv */
4.3. if i < t then
bhi2
k +
∑s−1
j=0 blis+j2
rj = Bi + bhi−1 /* k = rs */
else∑s−1
j=0 blis+j2
rj = 0 /* Bi≥t = 0 */
end if
/* Accumulation and reduction */
4.4. Q˜αi =
∑v−1
j=0 α[ |qliv+j| , j ](sign(qliv+j))
4.5. A˜Bi =
∑s−1
j=0 A[ |blis+j| ](sign(blis+j))2rj
4.6. Si+1 = bSi/2kc+ Q˜αi−d + A˜Bi
end for
/* Postprocessing */
5. Sn+d+2 = 2
kdSn+d+1 +
∑d−1
i=0 Qn+1+i2
ki + qhn
Figure 5.1 shows a block diagram of an arithmetic unit based on the new Mont-
gomery multiplier. This arithmetic unit incorporates a Montgomery multiplier that
implements Algorithm 5.3.1, an adder that also serves as a precomputation engine,
164
a zero test circuit, and a register file that is used to store precomputed points,
constants and temporary values.
The salient features of the Montgomery multiplier in Figure 5.1 are the following:
the parallel computation of Step 4.6 of Algorithm 5.3.1 using s processing units for
the generation of the terms A blis+j2
rj, v processing units for the generation of the
terms |qliv+jα 2uj|Mˆ , and an (s + v + 1)-input adder that adds all the terms (in
the figure s = v = 2); the use of the arithmetic unit’s adder as the precomputation
engine (this adder routes precomputed values to the inputs of the processing units
of the multiplier); and the implementation of Step 5 of the algorithm with a wide
register that holds the value of Sn+d+1 at the end of the algorithm, d k-bit registers
that hold the values Qi for i = n+ 1 . . . n+ d, and a one-bit register that holds the
value of qhn, which is the most significant bit of Qn.
   
	
   
 
ﬀﬂﬁ ﬃ
 !ﬀ"##
$

%


&'
   
)( 
 *  
+
 ,#-.0/ 1 2ﬂ34/ﬂ52ﬂ6 798;: 29<#1 =/ 2ﬂ: =
>@?	( AB C)( B D	E
   
FGDHIB JAD	EGKLB ( D
NM)MLD	E
O PQRﬂS T	RﬀS QRﬀS
Uﬀﬂﬁ ﬃ
 !ﬀ"##
VXW
   
Y ZU[\ 
 ] ^4_  `a$b

Y c
   
Y ZX( 
 [ d$
b

Y c
+
e
R#f
g
' h i
g
' h j
,#k
' h j h i
l
mn# 
on!p9ﬁ
Figure 5.1: GF (p) arithmetic unit
165
5.3.1 Validity of Algorithm
This work uses the Modified Booth Recoding Algorithm [Par99, Kor93]. This is a
window-based recoding algorithm. Here, this recoding algorithm is used to record
the multiplier operandB and the quotient generated by the multiplication algorithm.
The multiplier operand B =
∑n+d
i=0 Bi 2
ki, where Bi≥t = 0 and Bi =
∑k−1
j=0 bik+j 2
j
with bik+j ∈ [0, 1], is fed to the multiplier core one digit at a time. The multipli-
cation algorithm generates one digit of the quotient, Qi =
∑k−1
j=0 qik+j 2
j, in each
iteration of the loop in Step 4.
Algorithm 5.3.1 records the multiplier operand and the quotient on a digit-by-
digit basis. The recording algorithm is applied independently to the digits of the
multiplier operand and the quotient.
As shown in Step 4.2 of Algorithm 5.3.1, the quotient is recoded in radix 2u with
digits whose values fall in the range [−2u−1, 2u−1]. In each iteration of the loop in
Step 4, the digit Qi of the quotient is computed. Using this digit and the most
significant bit of the digit computed in the previous iteration of the loop, qhi−1,
the multiplication algorithm generates v digits of the recoded quotient, qliv+j for
j = 0 . . . v − 1, along with a carry bit, qhi. Equations (5.1) and (5.2) show how the
digits qliv+j and the carry bit qhi are computed.
qliv+j = −qik+ju+u−1 2u−1 +
(
u−2∑
l=0
qik+ju+l 2
l
)
+ qik+ju−1 (5.1)
qhi = qik+k−1 (5.2)
166
The recoding of the B operand is done in a form analogous to the recoding of
the quotient generated by the multiplication algorithm.
As shown in Step 4.3 of Algorithm 5.3.1, the B operand is recoded in radix 2r
with digits whose values fall in the range [−2r−1, 2r−1]. In each iteration of the loop
in Step 4, a digit Bi is fed to the multiplication algorithm. Using this digit and
the most significant bit of the digit fed in the previous iteration of the loop, bhi−1,
the multiplication algorithm generates s digits of the recoded operand, blis+j for
j = 0 . . . s− 1, along with a carry bit, bhi. Equations (5.3) and (5.4) show how the
digits blis+j and the carry bit bhi are computed.
blis+j = −bik+jr+r−1 2r−1 +
(
r−2∑
l=0
bik+jr+l 2
l
)
+ bik+jr−1 (5.3)
bhi = bik+k−1 (5.4)
The definition of B as a two’s complement number forces the need for the con-
ditional operation in Step 4.3 of Algorithm 5.3.1. If B were negative, the most
significant bit of its most significant digit, Bt−1, would be one. If the operation in
the if part of the conditional were executed in iteration i = t, it would have changed
the sign and along with it the value of B (note that Bi≥t = 0).
The validity of Algorithm 5.3.1 can be proven using an induction argument simi-
lar to the one used in [Oru95] to prove the validity of the Montgomery multiplication
algorithm in which this algorithm is based.
One can verify with induction on i that the invariant shown in Equation (5.5)
holds in Step 4.1 for i = 0, where Q˜αi is defined in Step 4.4 and Q˜i is defined in
167
Equation (5.6). (Here, it is assumed that the value of
∑n
i=m zi is equal to zero for n
smaller than m.)
One can then verify that the invariant holds for i = l + 1 under the assumption
that it holds for i = l in Step 4.1 and then verifying that it indeed holds after
the update function in Step 4.6. For the proof, it helps to use the following facts:
Q˜αi =
∑v−1
j=0 |qliv+jα 2uj|Mˆ , A˜Bi =
∑s−1
j=0 A blis+j2
rj, Qi≤0 = 0, Bi≥t = Bi<0 = 0,∑v−1
j=0 qliv+j2
uj = Qi + qhi−1 − qhi2k,
∑s−1
j=0 blis+j2
rj = Bi + bhi−1 − bhi2k, and
bSi/2kc = (Si −Qi)/2k.
Equation (5.7) shows the result of Step 4.6 for i = l. This equation makes use
of bSi/2kc = (Si −Qi)/2k and removes the division in Step 4.6 by multiplying both
sides of the equality by 2k(l+1). The invariant can then be computed for i = l, solved
for 2klSl, and substituted in Equation (5.7). After some manipulations, one can
verify that the invariant holds for i = l + 1.
2kiSi + 2
k(i−d)
d−1∑
j=0
Qi+j−d2
kj + qhi−d−12
k(i−d) = (5.5)
2kA((
i−1∑
j=0
Bj2
kj)− bhi−12ki) + 2k
i−d−2∑
j=0
(Q˜αj+12
k(d+1) − Q˜j+1)2kj
Q˜i =
v−1∑
j=0
qliv+j2
uj (5.6)
2k(l+1)Sl+1 = 2
klSl − 2klQl + 2k(l+1)Q˜αl−d + 2k(l+1)A˜Bl (5.7)
168
After n + d + 1 iterations of the loop in Step 4, the multiplication result is as
defined in Equation (5.8), where R = 2kn and where the definitions for QM and
QM are given in Equations (5.9) and (5.10).
One can verify from Equation (5.8) that the multiplication result is Sn+d+2 =
|ABR−1|Mˆ ∈ (−(AB + QM/R), (AB + QM/R)). For this verification note that
2k(d+1)Q˜αj+1 ≡ |Q˜j+1|M .
2kdSn+d+1 +
d−1∑
i=0
Qn+1+i2
ki + qhn =
AB +
∑n−1
j=0 (Q˜αj+12
k(d+1) − Q˜j+1)2kj
R
(5.8)
= (AB + QM)/R
∈ (−(AB +QM)/R, (AB +QM)/R))
QM =
n−1∑
j=0
(Q˜αj+12
k(d+1) − Q˜j+1)2kj (5.9)
∈ (−QM,QM)
QM > max( |QM | ) (5.10)
5.3.2 Accuracy of Algorithm
Ideally, one would like the output of a multiplier to be in least residue form; that is,
a value in the range [0,M), where M is the modulus of operation. Multiplication
algorithms that conform with this concept are likely to be computationally more
complex than Algorithm 5.3.1, which approximates the multiplication result to a
value in the range (−²M, ²M) for ² > 1. For Algorithm 5.3.1, ² is a function of the
maximum values of A, B, and QM .
169
The most critical factor for determining the range of the multiplication result
is QM , whose value is a function of k, d, and the reduction method used. This
work considers the two types of reduction methods used in [FP99]: multiplication-
based reduction and lookup-based reduction. In [FP99], the multiplication-based
approach approximates |xα|Mˆ as x |α|M and the lookup-based method represents
|xα|Mˆ as |xα|M (least residue form).
The reduction method is applied in Step 3.1.1 of Algorithm 5.3.1. For the
multiplication-based reduction, this step computes α[i, j] = iα 2uj.
The multiplication-based reduction method is referred to here as the Multiplica-
tion reduction method.
Two lookup-based reduction methods are studied here. The first one is referred
to as the Lookup 1 reduction method. This method computes α[i, j] = |iα|M 2uj.
The other lookup reduction method is referred to as the Lookup 2 reduction method.
This method computes α[i, j] = |iα 2uj|M .
The reduction methods just described affect the value of Q˜αi in Step 4.4 of the
algorithm. Table 5.1 shows the effect of the reduction methods on Q˜αi and QM.
Table 5.1: Accuracy of different reduction methods
Reduction Q˜αi QM
method
Multiplication
∑v−1
j=0 qliv+j
∣∣2−k(d+1)∣∣
M
2uj 2k(d+1)M
(
2kn−1
2u−1
) (
2u
2
)
< 2k(d+1)MR
Lookup 1
∑v−1
j=0
∣∣qliv+j2−k(d+1)∣∣M 2uj 2k(d+1)M (2kn−12u−1 ) < 2k(d+1)MR ( 22u )
Lookup 2
∑v−1
j=0
∣∣qliv+j2−k(d+1)2uj∣∣M 2k(d+1)M (2kn−12k−1 ) v < 2kdMR(2v)
R is a design parameter that influences the output range of the multiplier. As
Table 5.1 shows, the value of QM grows proportionally with R. Therefore, the
output range of the multiplication algorithm, as shown in Equation (5.8), is at least
restricted to values of the order of QM/R.
170
The results in Table 5.1 correspond to the worst case values for QM, where the
value of |x|M is assumed to be M . Parameter selection can greatly improve the
accuracy and speed of the multiplication; for example, [FIP00] specifies modulus
of the form M =
∑m−1
i=t mi2
i +/− 1, for which
∣∣2−kx∣∣
M
=− /+(M
−/+ 1)/2
kx for
t ≥ kx. For t ≥ k(d + 1), QM < MR for all the reduction methods listed in Table
5.1.
For applications requiring iterated multiplications, such as modular exponentia-
tions, R can be chosen so that the accuracy of a multiplication result falls in the range
(−2QM/R, 2QM/R), or [0, 2QM/R) when handling only positive numbers. An
example that uses the last expression can be found in [Oru95], for which
A,B ∈ [0, 2M˜ ], R > 4M˜ , QM ∈ [0, M˜R), and M˜ = | −M−1|2k(d+1)M < 2k(d+1)M .
From the previous analysis, one can deduce that in some instances the only way
to reduce the output range of the multiplier is by limiting the quotient resolution
delay d.
5.3.3 Range of Si+1
The size of the adder required to compute Si+1 in Step 4.6 of Algorithm 5.3.1 can be
determined using the invariant shown in Equation (5.5). To determine the size of the
adder the maximum positive and the minimum negative results need to be analyzed.
For this analysis, the invariant is rearranged in Equation (5.11) and expressed in
terms of i + 1.
171
2k(i+1)Si+1 = −2k(i+1−d)
d−1∑
j=0
Qi+1+j−d2
kj − qhi−d2k(i+1−d)+ (5.11)
2kA((
i∑
j=0
Bj2
kj)− bhi2k(i+1)) + 2k
i−d−1∑
j=0
(Q˜αj+12
k(d+1) − Q˜j+1)2kj
The expression for the determination of the minimum negative result is shown in
Equation (5.12). This equation assumes worst case values for the variables in
Equation (5.11):
−2k(i+1−d)∑d−1j=0 Qi+1+j−d2kj = −2k(i+1−d)(2kd − 1); −qhi−d2k(i+1−d) =
−2k(i+1−d); min(2kA((∑ij=0 Bj2kj)− bhi2k(i+1))) =
−2k(2k(i+1))A/2; and min(2k∑i−d−1j=0 (Q˜αj+12k(d+1) − Q˜j+1)2kj) =
−max(2k∑i−d−1j=0 (Q˜αj+12k(d+1) − Q˜j+1)2kj), where
max(2k
∑i−d−1
j=0 (Q˜αj+12
k(d+1) − Q˜j+1)2kj) is defined in Table 5.3 for different
reduction methods.
2k(i+1)Si+1 ≥ −(2k(i+1−d)(2kd − 1))− (2k(i+1−d)) (5.12)
+ min(2kA((
i∑
j=0
Bj2
kj)− bhi2k(i+1)))
+ min(2k
i−d−1∑
j=0
(Q˜αj+12
k(d+1) − Q˜j+1)2kj)
> −(2k(i+1−d)(2kd − 1))− (2k(i+1−d))−
(
2k(2k(i+1))A
2
)
−
(
2k(2k(i+1))(M − 1)
C
)
> −
(
2k(2k(i+1))A
2
+
2k(2k(i+1))M
C
)
172
The expression for the determination of the maximum positive result is shown in
Equation (5.13). This equation assumes worst case values for the variables in
Equation (5.11): max(2kA((
∑i
j=0 Bj2
kj)− bhi2k(i+1))) = 2k(2k(i+1))A/2 and the
value of max(2k
∑i−d−1
j=0 (Q˜αj+12
k(d+1) − Q˜j+1)2kj) as defined in Table 5.3 for
different reduction methods.
2k(i+1)Si+1 ≤ max(2kA((
i∑
j=0
Bj2
kj)− bhi2k(i+1))) (5.13)
+ max(2k
i−d−1∑
j=0
(Q˜αj+12
k(d+1) − Q˜j+1)2kj)
<
(
2k(2k(i+1))A
2
)
+
(
2k(2k(i+1))(M − 1)
C
)
<
(
2k(2k(i+1))A
2
+
2k(2k(i+1))M
C
)
Combining Equation (5.12) and Equation (5.13) and simplifying the result, one
obtains an expression for the range of Si+1. Equation (5.14) shows the range of Si+1.
A is typically greater than M ; therefore, an approximation for the range of Si+1 is
the range (−2kA, 2kA).
Si+1 ∈ (−(2
kA
2
+
2kM
C
), (
2kA
2
+
2kM
C
)) (5.14)
5.3.4 Processing Time
Equation (5.15) provides a processing time approximation for the computation of
Algorithm 5.3.1 using the multiplier architecture shown in Figure 5.1. This equation
173
Table 5.2: Approximate maximum value of Q˜αj+12
k(d+1) − Q˜j+1
Reduction method max (Q˜αj+12
k(d+1) − Q˜j+1)
Multiplication (2k(d+1)(M − 1))(2k − 1)
Lookup 1 (2k(d+1)(M − 1))
(
2k−1
2u−1
)
Lookup 2 (2k(d+1)(M − 1))v
Table 5.3: Approximate maximum value of 2k
∑i−d−1
j=0 (Q˜αj+12
k(d+1) − Q˜j+1) 2kj
Reduction method max (2k
∑i−d−1
j=0 (Q˜αj+12
k(d+1) − Q˜j+1) 2kj) C
Multiplication 2k2k(i+1)(M − 1)/C 1
Lookup 1 2k2k(i+1)(M − 1)/C 2u − 1
Lookup 2 2k2k(i+1)(M − 1)/C (2k − 1)/v
assumes that a single precomputation engine computes all the precomputed values.
In this equation Tb represents the precomputation time for iA for i = 0 . . . 2
r−1; Tq
represents the precomputation time for |iα 2uj|Mˆ for i = 0 . . . 2u−1 and j = 0 . . . v−1;
and Tm represents the processing time of the algorithm.
The expression in Equation (5.15) is normalized with respect to a reference
unit of time. The processing cost of a precomputation operation is weighted by a
factor ab for the precomputed values iA and by a factor aq for the precomputed
values |iα 2uj|Mˆ . Note that the level of effort may be different for the two types of
precomputed values. The number of operations over which a set of precomputed
values is amortized is weighted by the factor cb for the precomputed values iA and
by the factor cq for the precomputed values |iα 2uj|Mˆ . The processing operations
are weighted by a factor b, where it is assumed that the processing time is equal for
the scalar products A˜Bi and Q˜αi.
174
TM = Tb + Tq + Tm (5.15)
=
ab
cb
2r−1 +
aq
cq
2u−1v + b(n + d)
The processing time expression in Equation (5.15) assumes the computation of
a single set of products iA which is consistent with Steps 2 and 2.1 of Algorithm
5.3.1. This set of products is broadcasted to s processing units. For the generation
of A˜Bi, the processing units perform table lookups using their respective blis+j . The
2rj multiple associated with each of the blis+j terms can be realized by arranging the
processing units in a staggered configuration where their positions represent powers
of 2r; for example, the processing unit that generates blis+1A 2
r is shifted r bits
with respect to the processing unit that generates blisA, where the shifts represent
multiplications by powers of two.
The processing time expression in Equation (5.15) assumes that v sets of products
|iα 2uj|Mˆ need to be computed, which is necessary for the Lookup 2 reduction
method. For the Multiplication and Lookup 1 reduction methods, one can use the
same approach used to compute the products iA. Namely, one can precompute a
single set that is broadcasted to v processing units. The powers of 2u associated with
each term can be realized with a staggered arrangement of the processing units. The
processing time, when using the Multiplication or the Lookup 1 reduction methods
in this form, can be computed using Equation (5.15) with v set equal to one.
For the Multiplication and the Lookup 1 reduction methods, Algorithm 5.3.1 can
be modified as follows: 1) eliminate the loop in Step 3.1; 2) compute α[i] = |iα|Mˆ
in Step 3.1.1; and 3) compute Q˜αi =
∑v−1
j=0 α[ |qliv+j|](sign(qliv+j)) 2uj in Step 4.4.
To determine the optimum number of precomputations, it is best to express
175
Equation (5.15) in terms of m = dlog2 Me. Equation (5.16) provides such an
approximation, where n = dm/ke + d + f , f is a constant, R = 2m+k(d+f), and
k = rs = uv. This equation assumes that the target output range is Sn+d+2 ∈
(−2(QM/R), 2QM/R) and that the inputs are restricted by A,B = 2k/2QM/R.
For these parameters, f falls in the range [1, 2].
TM =
ab
cb
2r−1 +
aq
cq
2u−1v + bdm/ke+ b(2d + f) (5.16)
Most cryptographic algorithms in use today require the computation of many
consecutive operations involving the same modulus. As a result, the cost of the
precomputations associated with the terms |iα 2uj|Mˆ can be amortized over a large
number of operations (cq large). Moreover, elliptic curve cryptosystems, as well as
cryptosystems based on the discrete logarithm problem in finite fields, change modu-
lus infrequently. For these algorithms, the set of precomputed values |iα 2uj|Mˆ needs
to be computed only when the modulus changes. Consequently, for most crypto-
graphic algorithms in use today, the time critical operations are those involving the
scalar products iA.
Equation (5.17) shows a processing time expression that assumes that no pre-
computation is needed for the terms |iα 2uj|Mˆ . This expression makes use of the
fact that k is equal to rs.
TM =
ab
cb
2r−1 + bdm/(rs)e+ b(2d + f) (5.17)
To determine the optimum number of precomputations with respect to r for
a given s, one can differentiate Equation (5.17) with respect to r and set the re-
176
sult equal to zero. Equation (5.18) shows an approximation of this operation that
approximates the value dm/(rs)e as m/(rs).
r + 2 log2 r = (log2 m − log2 s) + (log2 b− log2 ab) + log2 cb − log2 (ln 2) + 1
(5.18)
Equation (5.19) shows an approximation of Equation (5.18) that ignores constant
terms and assumes that ab = b = cb = 1. From this expression, one can deduce
that r decreases as s increases. Reductions of the value of r lead to reductions in
the number of values that need to be precomputed in Algorithm 5.3.1, which, in
turn, lead to reductions in the amount of memory required to store precomputed
values. There is no apparent limitation to this behavior but in reality k = rs, and,
therefore, the extreme case of no precomputation correspond to the case s = k and
r = 1. The practical limitation in this last case is the amount of hardware required
to implement the multiplier.
Another interesting point that can be deduced from Equation (5.19) is that when
s = 1, which correspond to the maximum degree of precomputation, the value of
r (r = k) is limited to a value lower than log2 m. Therefore, the applicability of
precomputation is limited for this case because the precomputation costs (of the
order or 2k−1) overwhelm the reductions in processing costs for large values of r (of
the order of m/r when r = k) .
r + 2 log2 r = log2 m − log2 s (5.19)
The modulus of operation changes infrequently for elliptic curve cryptosystems.
177
Setting an elliptic curve cryptosystems can proceed as follows. First, the values of
r and s are chosen so that they provide the desired performance for the amount of
hardware that designers are willing to invest. Next, the values of u and v can be
chosen to match the value of k established by r and s (k = rs = uv). Note that
designers could choose u to be equal to r and v to be equal to s, but, for some
implementations, it may be more efficient to use large tables of precomputed values
that are programmed only when the modulus changes.
The previous discussion demonstrates some of the features of Algorithm 5.3.1.
These include the freedom to strike a balance between the degree of precomputation
and the amount of hardware for an implementation, the ability to amortize precom-
putation costs over multiple operations, and the ability to regulate the output range
of the multiplier by using different reduction methods.
5.3.5 Multiplications of Interest in the Computation of
Point Multiplications
Table 5.4 lists some of the operations of interest in the computation of point mul-
tiplications. This table assumes that A = B = 2k/2QM/R. It also assumes that
Sn+d+2 ∈ (−2(QM/R), 2QM/R), where Sn+d+2 represents the multiplication re-
sult.
The parameters just described are of interest here because they define a small
number of iterations for Algorithm 5.3.1 that generate results suitable for repeated
multiplications and they also allow a number of additions to be performed between
multiplications without the need for reduction, a feature that is useful for the im-
plementation of elliptic curve algorithms. Unless otherwise specified, this document
will assume the use of the aforementioned parameters from here on.
Entry 1 in Table 5.4 corresponds to the typical modular multiplication. Entry
178
2 corresponds to a division by two that can be computed with d + 2 iterations of
the loop in Step 4 of Algorithm 5.3.1 instead of the n+ d+1 iterations required for
a typical multiplication. Division by two is one of the operations required for the
computation of point multiplications according the algorithms defined in [IEE98].
Entry 3 in Table 5.4 defines a multiplication of a special form that is used here
to reduce the magnitude of a value presumed to be |0|M before comparing it to zero.
Note that for Entry 3, QM is defined with respect to x (n = x) according to the
definitions in Table 5.1, and that this value is not the same as the value of QM
used to define A.
Some of the elliptic curve algorithms defined in the open literature, such as those
described [IEE98], use comparisons in time critical functions, such as point addition
and point double functions. Comparisons are used, among others, to identify the
point at infinity during the point addition and point double operations. These
comparisons involve field elements, therefore numbers A and B are considered equal
if A−B ≡ |0|M , which implies that their difference is a multiple of M .
The accuracy of Algorithm 5.3.1 is of the order QM/R, where QM is defined in
Table 5.1. Rather than adding specialized circuitry to perform comparisons, here we
recommend an approach that multiplies a value presumed to be zero by a constant.
The idea is to perform this multiplication with high accuracy in a short amount
of time. To achieve high accuracy, we recommend the use of Algorithm 5.3.1 with
low quotient resolution delay (d ≈ 0) and possibly by using a more exact version
of Algorithm 5.3.1 (see Table 5.1). To achieve a short processing time, this work
recommends multiplication by
∣∣2−kx∣∣
Mˆ
according to Table 5.4, where the parameter
x is adjusted so that the value of the multiplication result is close to the value of
M .
The recommended algorithm for the comparison of two field elements A and B
179
works as follows. First, compute A − B. The result of this operation is a multiple
of M if A ≡ |B|M , and, if that is the case,
∣∣(A−B)/2kx∣∣
Mˆ
will also be a multiple
of M . Next, compute
∣∣(A−B)/2kx∣∣
Mˆ
according to Table 5.4. Finally, refine the
result to a value in the range (−M,M) and compare it against 0.
For the comparison of A with zero, assume that A = 2k/2+k(d+1)M , which could
correspond to the multiplication-based reduction method shown in Table 5.1 and
that the operation
∣∣A/2kx∣∣
Mˆ
is done using the Lookup 2 reduction method with
x = d + 2 and no quotient resolution delay (d = 0 for this operation). For this
example, the result of
∣∣A/2k(d+2)∣∣
Mˆ
falls in the range (−(2v+1)M, (2v+1)M). This
result can be computed with d + 3 iterations of the loop in Step 4 of Algorithm
5.3.1, but, because these iterations are computed without quotient resolution delay
(d = 0), each one takes d clock cycles. Therefore, the multiplication can be computed
in d(d+3) clock cycles. In summary, the multiplier can compute multiplications with
and without quotient resolution delay, but, when performing operations involving
no quotient resolution delay, the multiplier must wait for the quotient resolution.
The quotient resolution is assumed to take d clock cycles.
Note that the algorithm just described is useful for a large set of applications.
If additional accuracy is needed for the reduction operations, one can implement
a more accurate version of the Montgomery multiplication algorithm. One such
algorithm is presented in [Oru95].
Table 5.4: Multiplications of interest
# Mult. B < |AB| R n < |Sn+d+2|
1 |ABR−1|Mˆ B 2k(QM/R)2 > (2kQM)1/2 log2k R 2QM/R
2 |A/2|Mˆ 2k−1 2k−1A 2k 1 A
3
∣∣A/2kx∣∣
Mˆ
1 A 2kx x (A+QM)/2kx
180
5.3.6 Two’s Complement and Binary Stored-Carry Number
Representation
The previous sections describe the high-radix, precomputation-based Montgomery
multiplication algorithm. This section concentrates on the implementation of this
algorithm using two’s complement and binary stored-carry number representation.
This section provides a brief introduction to two’s complement arithmetic and binary
stored-carry number representation. More information on these topics can be found
in [Par90, Par99, Kor93].
Two’s complement arithmetic is attractive for the implementation of a high-
radix, precomputation-based Montgomery multiplier in programmable hardware
because two’s complement arithmetic is very similar to unsigned number arith-
metic. The overhead associated with sign representation in this representation is
very low, effectively one bit. Realizing the complement of a number can be done ef-
ficiently in hardware by complementing all the bits representing a number and then
adding a one to the complemented bits; for example, the complement of (0101)2 is
(1010)2 + 1 = (1011)2.
In binary stored-carry number representation a number is represented in radix
two using the digit set {0,1,2} [Par90]. In this work numbers are represented in
binary-stored carry number representation as the sum of two numbers; for example,
the number A can be represented as S +C, where C and S can be treated as two’s
complement numbers in nonredundant number representation. The most signifi-
cant benefits of this representation are the following: it supports fast addition using
carry-save adders, it naturally interacts with numbers represented in nonredundant
number representation, and it supports signed number arithmetic using two’s com-
plement arithmetic. The most significant drawback of this representation is that a
181
number A is represented by two numbers C and S, each of which is approximately
as big as number A (the size in bits of C and S is approximately equal to dlog2 Ae).
In two’s complement representation, the number X = (xm−1 . . . x1x0)2 represents
the following: X = −xm−12m−1+xm−22m−2+ . . .+x0. Figure 5.2 shows the addition
of three two’s complement numbers X = −5, Y = −1, and Z = −7 whose result
is the number A = −13. The numbers X, Y , and Z can be added using a 3:2
carry-save adder similar to the one shown in Figure A.6. The basic building block
of a 3:2 carry-save adder is a full adder. Each set of bits that is to be added with a
full adder is enclosed in a rectangle in Figure 5.2.
When A is needed in nonredundant number representation, C and S may be
added using a carry-propagate adder. For this addition, S must be sign-extended
so its sign bit is aligned with C’s sign bit and C must be padded with a zero in its
least significant bit position.
From Figure 5.2, one can deduce that the range of the sum C + S when dealing
with unsigned numbers is [0, 3∗ (2m−1)]. For the lower limit all the bits in C and S
are set to zero and for the upper limit all the bits are set to one. When treating C
and S as two’s complement numbers in nonredundant number representation, the
range of the sum C+S is [−3∗ 2m−1, 3∗ (2m−1− 1)] or [−2m− 2m−1, 2m +2m−1− 3].
From this range, one can verify that to support A as a two’s complement number in
binary stored-carry number representation with values in the range [−2m, 2m − 1],
both C and S must be (m + 1)-bit numbers.
The addition of multiple numbers represented in nonredundant or binary stored-
carry number representations can be performed using carry-save adder trees. The
addition of n numbers with a carry-save adder tree requires (n − 2) 3:2 carry-
save adders. The height of such a tree is estimated here using the Wallace tree
height approximation provided in [Par99], which establishes dlog3/2 n/2e as the
182
101 1
111 1
001 1
011 1
1101
X =
Y =
Z =
S =
C =
sig
n e
xte
nsi
on
1 011 1
1101
S =
C =
1 100 1A =
sig
nsi
gns

0
ze
ro
 pa
d
Figure 5.2: Carry-save addition of two’s complement numbers
lower bound for the height of an n-input tree.
The carry-save adder tree could be built with carry-save adders whose range
increase as the range of the sums increase at different levels of the tree. The widest
carry-save adder will be needed at the root of the tree because this adder will have
to handle the entire range of the results. As the range increases towards the root of
the tree, it becomes necessary to sign-extend the outputs of the carry-save adders.
To facilitate the analysis and the construction of the carry-save adder trees, here
it is assumed that all the nodes of the tree use carry-save adders of a common width.
Each carry-save adder in this model handles the range of the addition results. This
approach requires that the inputs to the tree be sign-extended so their widths match
the width of the carry-save adders.
The main drawbacks of binary stored-carry number representation stem from its
representation of a number with two numbers. First, storing a number in binary
stored-carry number representation requires twice as many bits as storing a number
in nonredundant number representation. Second, determining the magnitude of a
number is difficult, which makes sign determination and comparisons also difficult.
Determining the exact magnitude of a number in binary stored-carry number repre-
sentation requires its conversion to nonredundant number representation, a process
that requires carry-propagate addition.
Algorithm 5.3.1 and the implementations of it described in the following sections
183
overcome some of the limitations of binary stored-carry number representation while
taking advantage of the processing speed provided by this representation. The main
techniques employed in this work are discussed next.
The use of Booth recoding in Algorithm 5.3.1 alleviates the memory requirements
when using binary stored-carry number representation.
The ability to amortize precomputation costs over multiple operations, allows
the use of precomputed values that are represented using nonredundant number
representation. This is especially true for precomputed values associated with the
modulus. A large number of the cryptographic algorithms in use today require the
computation of a large number of operations, all of which use the same modulus.
On-the-fly conversion from binary stored-carry number representation to nonre-
dundant number representation also reduces the hardware complexity of the multi-
plier. In the implementation of Algorithm 5.3.1, the operand B can be converted
to nonredundant number representation at the rate of one digit per iteration of the
loop in Step 4.
Some elliptic curve algorithms, as previously indicated, embed comparisons in
time critical functions. Section 5.3.5 discusses approaches to limit the output range
of Algorithm 5.3.1. The final reduction of a result to a value in the range (−M,M)
may require conversion to nonredundant number representation. Computing these
operations at a high rate may require a large carry-propagate adder. To reduce the
complexity of this adder this work recommends the computations of these reductions
in parallel with other operations. For example, a point addition operation using the
projective coordinates algorithms described in [IEE98] requires eleven multiplica-
tions and two comparisons when Z is equal to one (Z = 1). The processing cost of
these comparisons can be absorbed by computing them in parallel with some of the
multiplications required to compute a point addition operation.
184
The new Montgomery multiplier requires a precomputation engine and a multi-
plication engine. As shown in Figure 5.1, an adder serves as the main precompu-
tation engine. This adder is used to generate the precomputed values in Steps 2 to
3.1.1 of Algorithm 5.3.1. The precomputed values are forwarded to the multiplier,
which computes Steps 4 to 5 of Algorithm 5.3.1.
The complexity and the performance of some of the circuits that form part of the
adder and the multiplier are a function of the range of values they need to support.
A range of values defines the number of bits required to represent the different values
within a range. The number of bits required to represent a range of values is referred
to here as bus width. Table 5.5 provides a summary of the bus widths used by the
adder and the multiplier circuits.
Table 5.6 approximates the bus width values. The approximations in this table
assume the use of the Multiplication reduction method (see Table 5.1 for range),
uses 2m as the approximate value of M (m = dlog2 Me), and assumes that k/2 is an
integer. By assuming the use of the Multiplication reduction method, the results in
the Table 5.6 represent the worst case values for the reduction methods considered
here.
Table 5.5: Bus widths
Operands Operand range Sym- Bus width Notes
bol (in bits)
A,B (−A,A) w0 dlog2 (2A− 1)e A = B
= 2k/2QM/R
> M
A[i] (−2r−1A, 2r−1A) w1 dlog2 (2rA− 1)e Precomputed
α[i, j] (−2u−1M, 2u−1M) w′1 dlog2 (2uM − 1)e values
Si+1 (−2kA, 2kA) w2 dlog2 (2k+1A− 1)e CSA adder
tree result
Sn+d+2 (−2QM/R, 2QM/R) w3 dlog2 (4QM/R− 1)e Multiplication
result
185
Table 5.6: Approximate bus widths (Multiplication reduction method)
Bus width Approximate bus width
symbol (in bits)
w0 m+ k(d+ 1.5) + 1
w1 w0 + (r − 1)
w′1 m+ u
w2 w0 + k
w3 w0 − 0.5k + 1
5.3.7 Area and Storage
The most complex operation of Algorithm 5.3.1 is the computation of the two scalar
multiplications A˜Bi and Q˜αi – the multiplication in Step 5 is just a shift operation.
These scalar multiplications could be computed with no precomputation using a
classical scalar multiplier. For the computation of a scalar multiplication, a classical
scalar multiplier would add up to k/2 numbers when using Booth recoding and k
numbers when using no recoding. Assuming that all the numbers in Step 4.6 of
Algorithm 5.3.1 are of the same size, the computation of this step would require
the addition of k + 1 numbers when using Booth recoding or 2k + 1 when using no
recoding. On the other hand, when using precomputation, the computation of Step
4.6 requires the addition of s + v + 1 numbers.
A limiting factor in the practical implementation of multiplication with precom-
putation is the size of the memory required to store precomputed values. The use
of Booth recoding in Algorithm 5.3.1 reduces the memory requirements to about
half of what is required when using no recoding. (Note that no storage needs to be
provided for values known to have zero value; for example, 0 ∗ A).
Assuming the use of the parameters defined in Section 5.3.5 for the Multiplica-
tion reduction method, the memory requirements for Algorithm 5.3.1 are the follow-
ing. Given that each precomputed product used in the computation of A˜Bi requires
186
m+k(d+1.5)+r bits of storage (w1 in Table 5.6), that each precomputed scalar prod-
uct used in the computation of Q˜αi requires m+u bits of storage (w
′
1 in Table 5.6),
and assuming that each processing unit stores its own set of precomputed values,
Algorithm 5.3.1 requires approximately 2r−1s(m+k(d+1.5)+r)+2u−1v(m+u) bits
of storage when all the numbers are stored in nonredundant number representation.
Algorithm 5.3.1 requires approximately 2rs(m + k(d + 1.5) + r) + 2uv(m + u) bits
of storage when all the numbers are stored in binary stored-carry number represen-
tation.
The relationship between r and s, and, u and v, allows designers to control
the memory size at the expense of the number of processing units; for example, to
achieve a given k, a designer could fix r and then derive s, which defines the required
number of processing units required to compute A˜Bi. Similarly, the designer can
predefine the u and then derive v, which defines the number of processing units
required to compute Q˜αi. This approach is particularly attractive for architectures
that employ fixed size memory elements, such as field programmable gate arrays.
187
5.4 Adder
The adder of the GF (p) arithmetic unit, referred to here as GF (p) adder, is used
to compute modular additions and the precomputation values needed in Algorithm
5.3.1. The GF (p) adder also incorporates the logic used to compare numbers against
zero (zero test circuit in Figure 5.1).
Figure 5.1 shows a block diagram of the GF (p) arithmetic unit. The circuit in
this figure is suitable for implementations of Algorithm 5.3.1 using numbers repre-
sented in nonredundant number representation.
In general, the precomputation of the terms α[i, j] in Steps 3 to 3.1.1 of Algo-
rithm 5.3.1 can be amortized over a large number of operations. Consequently, it
is often advantageous to represent these numbers in nonredundant number repre-
sentation. On the other hand, the computation of the terms A[i] often change from
one multiplication to the next thus precluding the amortization of their processing
cost over multiple operations.
For implementations of the GF (p) arithmetic unit using binary stored-carry
and nonredundant number representations, the generation of a precomputed value
in nonredundant number representation usually requires its generation in binary
stored-carry number representation followed by its conversion to nonredundant num-
ber representation.
The conversion of a number from binary stored-carry number representation to
nonredundant number representation requires carry-propagate addition. For the
large operands used for cryptographic applications, a fast carry-propagate adder
requires large amounts of logic; for example, O(n log2 n) gates for carry-lookahead
adders of n bits.
For the GF (p) arithmetic unit, it is cost effective to use a digit-serial carry-
188
propagate adder because this architecture supports the concurrent operation of the
multiplier and the adder circuits. By allowing the concurrent operation of the mul-
tiplier and the adder, implementations could use simpler adders than those required
if concurrent operation were not possible.
A digit-serial adder computes the addition of two m-bit numbers in dm/De clock
cycles, where D represents the digit size. A digit-serial adder uses a D-bit carry-
propagate adder, a flip-flop to propagate carries from one cycle to the next, and two
shift registers to serve the operands to be added to the carry-propagate adder. The
digit size can be adjusted to achieve the desired time-area goals.
The use of digit-serial adders in time critical operations is not desirable because
of their relatively long computational time. However, a digit-serial adder can be
used to convert the terms α[i, j] from binary stored-carry number representation
to nonredundant number representation because the precomputation cost of these
terms can be amortized over a large number of operations for the majority of the
cryptographic algorithms in use today. Because the terms A[i] are usually used for
just one multiplication, it is usually best to keep these numbers in binary stored-
carry number representation.
Figure 5.3 shows a block diagram of the GF (p) adder. This adder supports
addition and subtraction of numbers using two’s complement arithmetic.
The adder in Figure 5.3 contains a carry-save adder (CSA Accumulator), a
digit-serial carry-propagate adder (Digit-Serial CPA), a zero test circuit, a carry-
propagate adder sign circuit (CPA Sign), and, optionally, a carry-save adder sign
estimation circuit (CSA Sign Est.).
The zero test circuit inspects each of the intermediate values generated by the
digit-serial adder. At soon as the zero test circuit detects a nonzero value, it deac-
tivates its zero output. The output remains latched until the circuit is reset. The
189
zero test circuit is composed by a D-bit OR gate and a latch circuit. The D-bit
input gate can be implemented with a binary tree architecture. The latch circuit
can be implemented with a flip-flop and a 2:1 MUX.
The carry-propagate adder sign circuit forwards the sign bit of the value stored in
the S shift register. When using two’s complement arithmetic, the most significant
bit of the S shift register represents the sign of the value stored in it. The carry-
propagate adder sign circuit just forwards the content of the most significant bit of
the S shift register. Note that the value of this bit is valid at the end of the addition.
While a digit-serial addition is in progress, the value of this bit is undefined. The
carry-propagate sign circuit requires zero logic gates.
Optionally, the GF (p) adder circuit could be equipped with a carry-save adder
sign estimation circuit. This circuit can be used to accelerate modulo reductions of
multiplication results, especially where the multiplication results cover a wide range
of values; for example, multiplication results in the range (−2kM, 2kM) where k is
large. The range of a result can be reduced by adding multiples of M and then
checking for changes in the estimated sign of the accumulated value. A binary
approximation method can be used to reduce an accumulated value so lies in a
range near (−M,M). Because the signs are being estimated, the exact range of
an accumulated value is unknown. The uncertainties can be removed with carry-
propagate additions. This approach takes advantage of the speed of carry-save
addition and the resolution of carry-propagate addition.
Reduction using carry-save sign estimation is not pursued here. This work pro-
poses an approximation method in Section 5.3.5 that makes the use of sign estima-
tion circuit unnecessary. Readers interested in carry-save sign estimation can find
additional information in [Koc90b, Koc90a].
Depending on the structure of the GF (p) arithmetic unit’s register file, the
190
GF (p) adder circuit could have one or two inputs. If the register file stores numbers
in nonredundant number representation, the GF (p) adder needs to support only the
I s input. For this configuration, the GF (p) adder can use a 3:2 carry-save adder
in the CSA accumulator circuit. When the register file stores numbers in binary
stored-carry number representation, the GF (p) adder must support the two inputs
I s and I c. For this configuration, the GF (p) adder needs a 4:2 carry-save adder
in the CSA accumulator circuit.
The GF (p) adder is optimized for accumulation, which is one of the critical
operations that it must perform in support of the multiplier. The inputs to the
GF (p) adder are latched in the register file; therefore, the GF (p) adder does not
have to latch its inputs.
The addition of two numbers requires one addition operation if one of the num-
bers is already in the accumulator (C accumulator and S accumulator). If that is
not the case, the addition of two numbers requires two additions. The first addition
adds a number with an accumulated zero value (the accumulator must be reset for
this operation). The second addition adds a new value to the value stored in the
accumulator.
The GF (p) adder has two sets of outputs that support multiple adder configu-
rations. The outputs O csa c and O csa s are mainly used to output numbers in
binary stored-carry number representation. The output O cpa s is generally used
to output numbers in nonredundant number representation. Alternatively, the pair
of outputs O cpa c and O cpa s could be used to forward numbers in binary stored-
carry number representation. Not all the possible configuration of the GF (p) adder
needs to be supported, only those configurations that make sense for an implemen-
tation should be used.
191
 
 
	

 
 


 
 
 


ﬀ


 
 
ﬂﬁ
ﬁ

 
 
ﬃ ! "
ﬂ #$
%ﬃ ! "
& #$
'
'
((
'
)ﬀ!
$
!
 *
ﬃ
#
! 
,+-ﬁ
.0/1
/ﬂ2 34
5-687 9
: ;<=7>2 ;4?@ A

B-CD ;
E-C6F7
.HG1
/I2 34

'
'J
ﬀ
K Lﬂﬃ K Lﬂ

MﬀLﬂ
NLﬃ MﬀLﬂ
NLO
 MﬀLﬂ
ﬃL
MﬀLﬂ
ﬃLﬂﬃ
P
<?
62 34 Q
CD ;
P
6R?
62 34
P
?D D S
TUV
V
W
TUV
V
W
Figure 5.3: GF (p) adder
192
5.4.1 Complexity, Critical Path Delay, and Performance
Table 5.7 summarizes the logic complexity of the GF (p) adder. Estimates are pro-
vided for versions of the GF (p) adder whose inputs are represented in nonredun-
dant number representation and for versions whose inputs are represented in binary
stored-carry number representation.
Table 5.7 provides estimates for implementations with logic gates, generic gates,
and FPGA logic. These estimates are based on the models introduced in Appendix
A.
Table 5.7: Complexity of GF (p) adder
Tech- Input number representation
nology Nonredundant Binary stored-carry
Gates (7w1 + 3D) AND + (4w1 + 3D) OR (10w1 + 3D) AND + (6w1 + 3D) OR
+ (3w1 + 2D) XOR + 4w1 FF + (6w1 + 2D) XOR + 4w1 FF
Generic (14w1 + 8D) GG + 4w1 FF (22w1 + 8D) GG + 4w1 FF
gates
FPGA (5w1 + 2D + d(D − 1)/(L− 1)e) (8w1 + 2D + d(D − 1)/(L− 1)e)
logic LUT + 4w1 FF LUT + 4w1 FF
Table 5.8 defines the critical path delays of the two paths expected to define the
critical path delay of the GF (p) adder. Estimates are provided for versions of the
GF (p) adder whose inputs are represented in nonredundant number representation
and for versions whose inputs are represented in binary stored-carry number rep-
resentation. The symbols NR and BSC are used in Table 5.8 to identify numbers
represented in nonredundant and binary stored-carry number representations. The
estimates in the table are based on the models introduced in Appendix A.
The path I s/I c to C/S accum. accounts for the delay through the carry-save
adder, and the path CPA adder to S shift reg. accounts for the delay of the carry-
propagate adder and the shift register. For an implementation, the critical path
delay will be largest of the two critical path delays specified.
193
The critical path delay of the carry-propagate adder will be very susceptible to
the underlying hardware platform. The delays specified in Table 5.8 assume the use
of ripple-carry adders as carry-propagate adders. The delays associated with ripple-
carry adders will be lower for platforms that incorporate fast carry propagation
logic.
Table 5.8: Critical path delay of GF (p) adder
Path Input Delay Gates Gen. gates FPGA logic
rep. (in TG) (in TL)
I s/I c to NR TTC + T3:2 CSA 3TX 4 2
C/S accum. BSC TTC + T4:2 CSA 5TX 7 3
CPA adder to NR & TSR + TCPA,D = (D + 1)TA+ 3D + 2 D + 1
S shift reg. BSC TSR + TRA,D (2D + 1)TO
Table 5.9 summarizes the latency and the throughput of the GF (p) adder. This
table includes results for carry-save addition as well as carry-propagate addition.
The carry-propagate addition is assumed to be done using a digit-serial adder of
digit size equal to D. The addition for carry-propagate addition is given in terms of
the largest number that the GF (p) adder needs to support. The size of the largest
number in bits is w1.
Table 5.5 provides a definition of w1, and Table 5.6 provides an approximation
of its size for the parameters of interest here.
Table 5.9: Performance of GF (p) adder
Addition type Latency Throughput
(in # clocks) (in # operations/#clocks)
Carry-save 1–2 1/(1–2)
Carry-propagate dw1/De 1/dw1/De
The performance of the GF (p) adder, together with the number representation
and the reduction method used, define the time spent doing precomputations before
194
each multiplication. The precomputation time is normalized in Section 5.3.4 in
terms of the constants ab and aq.
For most applications, the precomputed values |iα2uj|Mˆ in Algorithm 5.3.1 are
infrequently computed. Therefore, the value of aq is not extremely critical. On the
other hand, the precomputed values iA are computed for most multiplications.
Using the estimates in Table 5.9, the computation of each scalar product iA
requires one clock cycle when these scalar products are represented in binary stored-
carry number representation. When the iA scalar products are represented in nonre-
dundant number representation, the computation of each scalar product takes ap-
proximately dw1/De clock cycles.
In summary, ab is approximately equal to one for scalar products iA repre-
sented in binary stored-carry number representation, and it is approximately equal
to dw1/De for scalar products in represented in nonredundant number representa-
tion.
The computation of a scalar product |iα2uj|Mˆ could require multiple carry-save
and carry-propagate additions, which could result in large values for aq. This is
not generally a problem because the precomputed cost associated with the scalar
products |iα2uj|Mˆ can be amortized over a large number of operations for most
cryptographic algorithms.
195
5.5 New Montgomery Multiplier
The GF (p) arithmetic unit incorporates the GF (p) adder previously discussed and
the new Montgomery multiplier core circuit. The new Montgomery multiplier core
circuit is referred to here as GF (p) multiplier.
Together the GF (p) adder and the GF (p) multiplier implement the high-radix,
precomputation-based Montgomery multiplication with quotient pipelining algo-
rithm described by Algorithm 5.3.1.
The following sections study the complexity and the performance of some imple-
mentations of the GF (p) multiplier that use nonredundant and binary stored-carry
number representations.
Figure 5.4 shows a block diagram of one of the possible implementations of the
GF (p) multiplier. The inputs to the multiplier are the following: B c and B s
are the busses over which the operand B is loaded in binary stored-carry number
representation; AB D c and AB D s are the busses over which the precomputation
values A[i] are loaded into the multiplier; AB A is the address bus that identifies
where the precomputed values A[i] are to be stored; Qα D c and Qα D s are the
busses over which the precomputation values α[i, j] are loaded into the multiplier;
Qα A is the address bus that identifies where the precomputed values α[i, j] are to
be stored; and Sn+d+2 c and Sn+d+2 s are the busses that carry the multiplication
results.
The precomputed values A[i] and α[i, j] are generated by the GF (p) adder.
These values are loaded into the multiplier over the AB D c/AB D s and the
Qα D c/Qα D s busses. The O csa c/O csa s busses of the GF (p) adder can be
connected to the AB D c/AB D s and the Qα D c/Qα D s busses of the GF (p)
multiplier, when these busses transport numbers in binary stored-carry number
196
representation. When the α[i, j] precomputed values are in nonredundant number
representation, the O cpa s bus of the GF (p) adder can be connected to the Qα D s
bus of the GF (p) multiplier.
Figure 5.4 assumes that the precomputation values α[i, j] are generated and
stored in binary stored-carry number representation. When these numbers are
generated and stored in nonredundant number representation the following busses
and signals can be eliminated: Qα D c, |qlivα|Mˆ c to |qliv+v−1α 2u(v−1)|Mˆ c, and
qlivα carry c to qliv+v−1α carry c.
The main components of the GF (p) multiplier and their functions are summa-
rized in Table 5.10. Each of the components of the GF (p) multiplier is described in
detail in the next sections.
 
 	 

 
 


 



 ﬀ ﬁ ﬂ ﬃ

 
ﬀ
ﬁ ﬂ !"$#

%
 ﬀ
ﬁ ﬂ & ﬂ ' (

ﬃ

 
ﬀ
ﬁ ﬂ & ﬂ ' (
!")#

%
 



*,+

-ﬀ.
 
 

/
ﬁ
10
ﬁ ' (
/
ﬁ
!"
/
ﬁ
!2
34 5
167 6
8:97 
 ;7  
<=>?
@ A
BCD
D
EA
B
<=>?
@ A
B
FHG
<=>ﬀ?
@ A
BCD
D
EA
I
<=>?
@ A
I
FHG
<=>?
@
J
@ K
L
A
BC
D
D
EA
B
<=>?
@
J
@ K
L
M N
O P
K
LQ
A
B
FG
<=>?
@
J
@ K
L
A
BC
D
D
EA
I
<=>?
@
J
@ K
L
M N
O P
K
LQ
A
I
FHG
R SUT
5 V W
X67 6
8Y97 
 ;7  
Z
>
?
[
\
A
B
CD
D
EA
B
]
Z
>?
[
\
]^
A
B
F
G
Z
>
?
[
\
A
B
CD
D
EA
I
]
Z
>?
[
\
]
^
A
I
F
G
Z
>ﬀ?
[
J
[ K
L
\
A
BCD
D
E
A
B
]
Z
>
?
[
J
[ K
L
\
M N
O
[ K
LQ
]^
A
B
FG
Z
>?
[
J
[ K
L
\
A
BCD
D
E
A
I
]
Z
>ﬀ?
[
J
[ K
L
\
M N
O
[ K
LQ
]
^
A
I
FG
R

6  _` 16
.
a6


 ﬀ
b cXd e f b g h ikjmlmn n o
 


 ﬀ


p
 q
ﬁ r

s

 
q
ﬁ r
!"$#

%
 q ﬁ r & r ' (m
s

 
q
ﬁ r & r ' (
!"$#

%
t
ﬁ
q0
ﬁ ' (
FG FHG
F
G
F
G
F
G
uv
& w & x yz"
u
v
& w & x y
2
F
G
F{
|
 u
ﬁ

x}
~
~ 
 
t
ﬁ ' (
|
t
ﬁ ' w
~
~ ~ ~
q0Xﬁ ' w ' (
5  pﬀ
~
~ ~
/
!"
/
!2
F F
F
{
+
 1
 
/
!!2
F,
F

ﬃ

 
/
!!"
/
!

u
ﬁ y
"
u
ﬁ y
2
u
ﬁ
!2k 
u
ﬁ
!"k 

tUŁ
!!2
F, 
F,

s

 
tUŁ
!!"
tUŁ
!

Ł
!"$#

Figure 5.4: GF (p) multiplier
197
Table 5.10: Components of the GF (p) multiplier
Component Description
BSC shift Shift register that stores numbers in binary stored-carry number
register representation and outputs two k-bit digits per clock cycle.
One of the digits corresponds to a digit from B c and the other
to a digit from B s.
BSC to NR Converts the digits generated by the BSC shift register to
converter 1 nonredundant number representation and formats the
results for the Booth recoder 1 circuit.
Booth Recodes the digits generated by the BSC to NR converter 1
recoder 1 circuit. (Digit-by-digit recoding.)
Booth Recodes the digits generated by the Si/2
k circuit and the first
recoder 2 delay register R. (Digit-by-digit recoding.)
A˜Bi scalar Generates the products blis+j A 2
rj for j = 0 . . . s− 1.
multiplier Each of these products can be a number in nonredundant or
binary stored-carry number representation.
Q˜αi−d scalar Generates the products |qliv+j α 2uj|Mˆ for j = 0 . . . v − 1.
multiplier Each of these products can be a number in nonredundant or
binary stored-carry number representation.
Carry-save Computes Si+1 = bSi/2kc+ Q˜αi−d + A˜Bi by adding the terms
adder tree blis+j A 2
rj for j = 0 . . . s− 1, the terms |qliv+j α 2uj|Mˆ for
j = 0 . . . v − 1, and the terms that form bSi/2kc.
Si/2
k Generates the quotient bSi/2kc and the remainder |Si|2k .
(Si = bSi/2kc2k + |Si|2k)
Register Stores the value of Si+1 computed in each iteration of the
loop in Step 4 of Algorithm 5.3.1.
R k-bit registers that store the values Qi−1 down to Qi−d.
These values are needed in Step 5 of Algorithm 5.3.1.
FF One-bit register that stores the value qhi−d−1.
This value is needed in Step 5 of Algorithm 5.3.1.
198
5.5.1 BSC Shift Register
The BSC shift register is a parallel-in/serial-out shift register. This shift register
supports the parallel loading of the multiplication operand B. B can be represented
in nonredundant or in binary stored-carry number representation. B is represented
in binary stored-carry number representation as the sum of B c and B s.
When B is represented in nonredundant number representation, the shift register
outputs one digit of B per clock cycle. When B is represented in binary stored-carry
number representation, the shift register outputs two digits per clock cycle: Bi c and
Bi s. Bi c is a digit of B c and Bi s is a digit of B s.
The complexity and the critical path delay of the BSC shift register are sum-
marized in Table 5.11. In this table, the symbols NR and BSC are used to identify
numbers represented in nonredundant and binary stored-carry number representa-
tions.
The estimates are based on the complexity and timing models introduced in
Appendix A.
The complexity of the BSC shift register is a function of the width of the busses
carrying the B operand. Figure 5.4 shows two busses leading into the BSC shift
register. These two busses are needed when B is represented in binary stored-carry
number representation. When B is represented in nonredundant number represen-
tation, only the B s bus is needed.
The B c and B s busses are each w0 bits wide. Table 5.5 provides a definition
for w0, and Table 5.6 provides an approximation of its size for the parameters of
interest here.
199
Table 5.11: Complexity and critical path delay of BSC shift register
Technology Representation Complexity Critical path delay
Gates NR 2w0 AND + w0 OR + w0 FF TA + TO
BSC 4w0 AND + 2w0 OR + 2w0 FF
Generic gates NR 3w0 GG + w0 FF 2TG
BSC 6w0 GG + 2w0 FF
FPGA logic NR w0 LUT + w0 FF TL
BSC 2w0 LUT + 2w0 FF
200
5.5.2 BSC to NR Converter 1
The BSC to NR converter 1 circuit converts the operand B, when it is represented
in binary stored-carry number representation by the sum of B c and B s, to nonre-
dundant number representation on a digit-by-digit basis.
When B is represented in binary stored-carry number representation, the BSC
shift register forwards one digit from B c, Bi c, and one digit from B s, Bi s, to the
BSC to NR converter 1 in each clock cycle. The converter computes carryi + Bi =
Bi c+Bi s+ carryi−1, where carryi represents the carry generated during iteration
i of the loop in Step 4 of Algorithm 5.3.1 and where carry−1 = 0.
When B is represented in nonredundant number representation, the converter
sets Bi equal to Bi s.
The converter also stores the most significant bit of Bi, bhi, and makes it available
in the next iteration of the loop in Step 4 of Algorithm 5.3.1. To guarantee that Bi
is zero for i greater than or equal to t, as required in Step 4.3 of Algorithm 5.3.1,
the register holding bhi−1 must be reset before iteration t of the loop begins.
Figure 5.5 shows a block diagram of the BSC to NR converter 1 circuit suitable
for implementations for which B is represented in binary stored-carry number rep-
resentation. This circuit consists of a k-bit carry-propagate adder and two flip-flops.
One of the flip-flops stores the carries and the other stores bhi−1. When B is rep-
resented in nonredundant number representation, only the register that holds bhi−1
is needed.
The complexity and the critical path delay of the BSC to NR converter 1 circuit
are summarized in Table 5.12. In this table, the symbols NR and BSC are used
to identify numbers represented in nonredundant and binary stored-carry number
representations. The estimates are based on the complexity and timing models
introduced in Appendix A.
201
Bi_c k
Bi_s k
carryicarryi-1 FF
reset
FF
reset
k
Bi
bhi-1
Figure 5.5: BSC to NR converter 1
Table 5.12: Complexity and critical path delay of BSC to NR converter 1
Tech- Representation Complexity Critical path
nology delay
Gates NR 1 FF 0
BSC 3k AND + 2k OR + 2k XOR + 2 FF k(TA + 2TO)
Generic NR 1 FF 0
gates BSC 7k GG + 2 FF 3k TG
FPGA NR 1 FF 0
logic BSC 2k LUT + 2 FF k TL
202
5.5.3 Booth Recoders 1 and 2
The GF (p) multiplier incorporates two Booth recoders: Booth recoder 1 and Booth
recoder 2. The input to each recoder consists of a digit with value in the range
[0, 2k) and a signal with value in the range [0, 1]; for example, the inputs to the
Booth recorder 1 are Bi, which is a digit with value in the range [0, 2
k), and bhi−1,
which is a signal with value in the range [0, 1]. The output of each recorder is a set
of digits represented using a custom sign-magnitude representation.
Booth recoder 1 recodes the sum of its inputs with s digits, where the range of
each digits is [−2r−1, 2r−1]. Booth recoder 2 recodes the sum of its inputs with v
digits, where the range of each digit is [−2u−1, 2u−1].
The recoded digits are expressed using a custom sign-magnitude representation.
For each digit, this representation uses a control signal to indicate if the value of
the digit is zero, uses a control signal to indicate the sign of the digit, and uses a
set of signals to represent the absolute value of the digit. The absolute value of a
digit is generally represented by a binary representation of its value. The exception
is the maximum absolute value, which uses the binary representation corresponding
to zero; for example, in Booth recoder 1 the code (0 . . . 00)2 represents the value
2r−1.
Booth recoding generates digits with values in a range [−2x−1, 2x−1]. When the
zero control signals are not used, a precomputation based scalar multiplier needs to
store 2x−1 + 1 scalar products per digit. On the other hand, when the zero control
signals are used and the products known to have zero value are generated with
logic, a precomputation based scalar multiplier needs to store 2x−1 scalar products
per digit. Storing a number of scalar products that is a power of two is attractive for
many implementations because the number of memory locations in most memory
chips and ASIC cores are powers of two.
203
This work recommends the use of the Modified Booth Recoding Algorithm which
is a parallel Booth recoding algorithm. This algorithm is described in detail in
[Par99].
Figure 5.6 shows the recoding of the sum Bi + bhi−1 using the Modified Booth
Recoding Algorithm as blis+12
r + blis. In this example, Bi =
∑k−1
j=0 bik+j 2
j, bhi−1 =
bik−1, bhi = bik+k−1, blis = −bik+323 + (
∑2
j=0 bik+j 2
j) + bhi−1, blis+1 = −bik+4+323 +
(
∑2
j=0 bik+4+j 2
j) + bik+4−1, k = 8, s = 2, and r = 4.
|blis+1| blis+1_sign blis+1_zero |blis| blis_sign blis_zero
 

  

  	 

 

  


    


 ﬁﬀ ﬂ
ﬃﬃ

ﬃ
ﬂ  ! "
###### $$
%
& ' ( )
'
'
( )( )( * ++++-,
bik-1 = bhi-1bikbik+1bik+2bik+3bik+4bik+5bik+6bhi= bik+7 .0/ 132-45.
6 798
4:2
7-6
Figure 5.6: Example of the Modified Booth Recoding Algorithm
The definition for blis in Figure 5.6 corresponds to the two’s complement addition
of (bik+3 . . . bik)2 and bhi−1. Similarly, the definition for blis+1 corresponds to the
two’s complement addition of (bik+7 . . . bik+4)2 and bik+3.
The two’s complement representation of blis and blis+1 need to be converted to
the previously discussed sign-magnitude representation.
For two’s complement numbers that represent positive values, the conversion
involves reflecting a positive sign on the sign control signal, identifying a zero mag-
nitude by asserting the zero control signal, and forwarding the magnitude of the
two’s complement number. For example, when blis represents a positive nonzero
value, its value is represented as follows: |blis| = (
∑2
j=0 bik+j2
j) + bhi−1 (note that
bik+3 = 0), blis sign = 0 (positive), and blis zero = 0 (nonzero).
For two’s complement numbers that represent negative values, the conversion
involves reflecting a negative sign on the sign control signal, reflecting a nonzero
204
value on the zero control signal, and computing and forwarding the magnitude of the
complement of the two’s complement number. For example, when blis represents a
negative value, its value is represented as follows: |blis| = 23−((
∑2
j=0 bik+j2
j)+bhi−1)
(complement), blis sign = 1 (negative), and blis zero = 0 (nonzero).
A Booth recorder that implements the Modified Booth Recoding Algorithm is
composed by one or more window recoders. The example in Figure 5.6 uses two
window recoders. The Booth recoder 1 circuit uses s window recoders and the Booth
recoder 2 circuit uses v window recoders.
Figure 5.7 shows a block diagram of a possible window recoder implementation.
This implementation uses two increment adders.
Increment adder 1 together with the gate that determines the sign of the digit
compute two’s complement additions. In the example shown in Figure 5.6, one of
these circuits computes the two’s complement addition of (bik+3 . . . bik)2 and bhi−1.
Increment adder 2 computes the complement of the result of increment adder 1.
This adder complements the outputs of the increment adder 1 and adds a one to
the value that the complemented bits represent.
A set of multiplexers chooses the output of the window recoder. If the result
is positive, these multiplexers forward the output of increment adder 1, otherwise
they forward the output of increment adder 2.
A zero result is determined by sampling the input bits. If each input bit is zero
or if each input bit is one, the zero control signal is set to one, which represents a
zero condition. Otherwise, the zero control signal is set to zero.
The sign of the result is negative if the most significant bit of the input digit is a
one and the carry out of the increment adder 1 is zero. This condition implies that
the input digit is negative and that the addition of the least significant bit in the
window does not change the sign. All other conditions result in a positive result.
205
HA 1HAHA
bl_signbl_zero |blis:2|
increment
adder 2
|blis:1| |blis:0|
bik
HA bhi-1
bik+1
HA
bik+2
HA
bik+3
increment
adder 1
Figure 5.7: Window recoder
Booth recoder 1 uses s window recoders. Each of these recoders receives r + 1
inputs. Each of these recoders generates r + 1 outputs. The absolute value of
a recoded digit is reflected in r − 1 outputs, a zero value for a digit is reflected
in an output bit, and the sign of a digit is reflected in another output bit. The
window recoders for Booth recoder 1 use two (r − 1)-bit increment adders, (r − 1)
2:1 multiplexers, two (r+1)-input AND gates (the NOR gate in Figure 5.7 is treated
as an AND gate), an AND gate (represents the AND gate with an inverted input
in Figure 5.7), and an OR gate.
The architecture of Booth recoder 2 is similar to that of Booth recoder 1. The
number of inputs, outputs, and the size of the components of Booth recoder 2 can
be determined using the expressions derived for Booth recoder 1 using v in place of
s and u in place of r.
The complexities and the critical path delays of Booth recoders 1 and 2 are
summarized in Table 5.13. These estimates are based on the complexity and timing
models introduced in Appendix A. Table 5.13 shows the most significant terms of
the complexity estimates.
To simplify the FPGA complexity estimates, the results in Table 5.13 assume
that each AND gate with r + 1 or u + 1 inputs requires, respectively, r/(L − 1) or
206
u/(L−1) LUTs. This approximation ignores the ceiling operators in the complexity
estimates of binary trees. (Note that according to the models introduced Appendix
A, an n-input binary tree requires d(n− 1)/(L− 1)e LUTs.) Furthermore, because
each LUT must have at least three inputs, the estimates in Table 5.13 assume that
each AND gate with r + 1 or u + 1 inputs requires, respectively, r/2 or u/2 LUTs.
The approximations just described permit the specification of the complexity of
Booth recoders 1 and 2 in terms of k, where k = rs = uv.
Table 5.13: Complexity and critical path delay of Booth recoder
Tech- Complexity Critical path delay Critical path delay
nology Booth recoder 1 Booth recoder 2
Gates 6k AND + k OR (r − 1)TA + TO + 2TX (u− 1)TA + TO + 2TX
+ 2k XOR
Generic 9k GG r + 2 TG u+ 2 TG
gates
FPGA 6k LUT r + 1 TL u+ 1 TL
logic
207
5.5.4 A˜Bi Scalar Multiplier
The A˜Bi scalar multiplier generates the scalar products Ablis+j2
rj for j = 0 . . . s−1
using s processing units. Processing unit j generates the scalar products Ablis+j2
rj.
The carry-save adder tree adds the contribution of the s processing units, thus
computing the scalar product shown in Step 4.5 of Algorithm 5.3.1.
Assuming that the scalar products are represented in nonredundant number
representation, processing unit j computes the scalar product Ablis+j2
rj as follows.
First, the processing unit saves in its local memory the products iA written to
it during Steps 2 and 2.1 of Algorithm 5.3.1.
Second, to compute a product Ablis+j2
rj, the processing unit uses the absolute
value of blis+j as an index into the table containing the precomputed values (i.e.,
A[ |blis+j| ]). The result of the table lookup operation is A |blis+j| 2rj, where the
multiple 2rj is realized by offsetting the local memory rj bits toward the most
significant bits with respect to the multiple corresponding to one (2r∗0); in order
words, by offsetting the output of the table rj bits toward the most significant bits,
the table output is effectively multiplied by 2rj.
Third, depending on the assertion of the control signals blis+j zero and blis+j sign,
the processing unit generates a zero result, a positive multiple, or a negative mul-
tiple. The signals blis+j zero and blis+j sign are generated by the Booth recoder 1
circuit. These signals form the bus blis+j sel.
The zero result is generated by forcing the result to zero, which allows the use
of memory location zero to store the multiple 2r−1A. This arrangement uses 2r−1
memory locations to store all the necessary multiples of A. Alternatively, if 0A were
stored in memory, 2r−1+1 memory locations would have been needed. Implementa-
tions could be forced to use 2r memory locations to store 2r−1+1 values because the
number of memory locations of memory chips and ASIC memory cores is typically
208
a power of two.
A positive multiple is generated when the multiplier digit blis+j is positive. This
condition is signaled by blis+j sign set to zero. To generate one of these multiples, a
processing unit looks up the multiple corresponding to the multiplier digit (A[blis+j ])
. It then sign-extends and pads the multiple so it is of the expected length and it
sets the carry bit to zero.
Figure 5.8 shows an example of the generation and formatting of three positive
multiples, where each of the multiples could be the result of a different processing
unit. This figure shows the results of the table lookups in rectangles, sign-extensions
and padding fall outside the rectangles.
a0a1a2a3a4a5
Table Looked Up Values
b0b1b2b3b4b5
c0c1c2c3c4c5
Sign Extension and Padding
a0a1a2a3a4a5
b0b1b2b3b4b5
c0c1c2c3c4c5
a5a5a5a5
b5b5 0 0
0 0 0 0
Figure 5.8: Generation and formatting of positive multiples
A negative multiple is generated when a multiplier digit is negative (blis+j <
0). This condition is signaled by blis+j sign set to one. To generate one of these
multiples, a processing unit looks up the multiple corresponding to the absolute value
of the multiplier digit (A[ |blis+j| ]). It then sign-extends and pads the multiple so
it is of the expected length. Finally, the processing unit generates the complement
of the multiple by complementing each bit of it and by setting the carry bit to one.
Figure 5.9 shows an example of the generation and formatting of negative multi-
ples, where each of the multiples could be the result of a different processing unit. In
this figure, the complement is generated by complementing each bit of the number
and then adding a one to the value they represent. Complemented bits are denoted
with a bar over the variable name; for example, a¯0 represents the logical function
209
NOT a0. The one added to the bit-complemented number is shown inside a box
with a dotted line frame.
a0a1a2a3a4a5
Table Looked Up Values
b0b1b2b3b4b5
c0c1c2c3c4c5
Sign Extension and Padding
a0a1a2a3a4a5
b0b1b2b3b4b5
c0c1c2c3c4c5
a5a5a5a5
b5b5 0 0
0 0 0 0
Complement Generation
a0a1a2a3a4a5
b0b1b2b3b4b5
c0c1c2c3c4c5
a5a5a5a5
b5b5 1 1
1 1 1 1
1
1
1
Figure 5.9: Generation and formatting of negative multiples
Figure 5.10 shows a block diagram of a processing unit that handles numbers
in nonredundant number representation. Each processing unit contains a memory
block and a two’s complement with zero circuit.
Each processing unit receives precomputed values over a w1-bit bus and they
output results over a w2-bit bus together with a carry signal. For the parameters of
interest here, Table 5.5 provides definitions for the different bus widths, and Table
5.6 provides approximations of their size.
(2r-1)A
1A
|blis+j|
r-1
not 0
blis+j_zero
2
w2
A blis+j 2rj
1A,2A,...,2r-1A
w1
w2 w2
w1
w2
blis+j_sign
bl_selis+j
A blis+j_carry
Figure 5.10: Processing unit for nonredundant number representation
Figure 5.11 shows a scalar multiplier consisting of two processing units (s = 2).
It is evident from this figure that the multiples 2rj are realized by staggering the
memory elements of the processing units r bits apart.
210
not 0
bl2i+1_sel 2
w2
A bl2i+1 2r
|bl2i|
r-1
|bl2i+1| r-1
not 0
bl2i_sel
A bl2i
2
(2r-1)A
1A
(2r-1)A
1A
r
1A, 2A,...,2r-1A
w1
w2 w2 w2 w2 w2
w2w2
w1 w1
Abl2i+1_carry Abl2i_carry
Figure 5.11: A˜Bi scalar multiplier example for nonredundant number representation
The concept just described for nonredundant number representation can be ex-
tended to binary stored-carry number representation. In binary stored-carry number
representation, a number is represented redundantly as the sum of two numbers; for
example A can be represented by the sum of Ac and As.
In binary stored-carry number representation, the scalar product Ablis+j2
rj is
equivalent to the sum of the products Ac blis+j2
rj and As blis+j2
rj, where these
products can be treated as two’s complement numbers in nonredundant number rep-
resentation. Two processing units can be employed to compute the scalar products
Ac blis+j2
rj and As blis+j2
rj. The carry-save adder tree is responsible for adding
the scalar products. The binary stored-carry number representation doubles the
number of processing units and thus the number of inputs that need to be added
by the carry-save adder tree.
Figure 5.12 shows a block diagram of a scalar multiplier for binary stored-carry
number representation for s equal to two (s = 2). The scalar multiplier in this
example requires 2s processing units, each of which is similar to the processing unit
shown in Figure 5.10.
An implementation of the A˜Bi scalar multiplier that generates products in nonre-
211
not 0
bl2i+1_sel 2
w2
Abl2i+1 2r_c
|bl2i|
r-1
|bl2i+1| r-1
not 0
bl2i_sel
Abl2i_c
2
(2r-1)Ac
1Ac
(2r-1)Ac
1Ac
1Ac, 2Ac,...,2r-1Ac
w1
w2 w2 w2 w2 w2
w2w2
w1 w1
not 0
2
w2
Abl2i+1 2r_s
|bl2i|
r-1
|bl2i+1| r-1
not 0
2
(2r-1)As
1As
(2r-1)As
1As
1As, 2As,...,2r-1As
w1
w2 w2 w2 w2 w2
w2w2
w1 w1
Abl2i+1_carry_c Abl2i_carry_c
bl2i+1_sel
Abl2i+1_carry_s Abl2i_s
bl2i_sel
Abl2i_carry_s
r r
Figure 5.12: A˜Bi scalar multiplier example for binary stored-carry number repre-
sentation
dundant number representation requires s processing units, whereas an implementa-
tion that generates products in binary stored-carry number representation requires
2s processing units.
Table 5.14 summarizes the complexity and the critical path delay for implemen-
tations of the A˜Bi scalar multiplier that generates products in nonredundant and
in binary stored-carry number representations. The symbols NR and BSC are used
to identify numbers represented in nonredundant and binary stored-carry number
representations. The estimates in Table 5.14 are based on the complexity and timing
models introduced in Appendix A.
Table 5.14: Complexity and critical path delay of A˜Bi scalar multiplier
Tech- Represen- Complexity Critical path
nology tation Logic Bits of storage delay
Gates NR s w2 AND + s w2 XOR 2
r−1 s w1 TA + TX
BSC 2 s w2 AND + 2 s w2 XOR 2
r s w1
Generic NR 2 s w2 GG 2
r−1 s w1 2TG
gates BSC 4 s w2 GG 2
r s w1
FPGA NR s w2 LUT 2
r−1 s w1 TL
logic BSC 2 s w2 LUT 2
r s w1
212
5.5.5 Q˜αi−d Scalar Multiplier
The Q˜αi−d scalar multiplier generates the scalar products |qliv+jα2uj|Mˆ for j =
0 . . . v − 1 using v processing units (α ≡ ∣∣2−k(d+1)∣∣
M
) . Processing unit j generates
the scalar products |qliv+jα2uj|Mˆ . The carry-save adder tree adds the contribution
of the v processing units thus computing the scalar product show in Step 4.4 of
Algorithm 5.3.1.
Section 5.3.2 defines three reduction methods: Multiplication, Lookup 1, and
Lookup 2 reduction methods. When using the Multiplication reduction method,
processing unit j computes the scalar products qliv+jα2
uj; when using the Lookup
1 reduction method, it computes the scalar products |qliv+jα|M 2uj; and when using
the Lookup 2 reduction method, it computes the scalar product |qliv+jα2uj|M .
Section 5.3.5 describes a method for performing comparisons that uses precise re-
duction methods, such as the Lookup 2 reduction method, without quotient pipelin-
ing. To support this comparison method, each processing unit stores a different set
of precomputed values for each of the reduction options it supports.
For example, a processing unit that supports the Lookup 2 reduction method
with quotient resolution delay d and with no quotient resolution (d = 0) may need to
store two sets of precomputed values. One set consist of |i(αd2uj)|M for i = 1 . . . 2u−1,
where αd ≡
∣∣2−k(d+1)∣∣
M
. The other set consists of |i(α2uj)|M for i = 1 . . . 2u−1, where
α ≡ ∣∣2−k∣∣
M
.
For all the reduction methods, the precomputed values are generated by the
GF (p) adder. The precomputed values are written to the memories of the pro-
cessing units during the precomputation phase of the multiplication algorithm. For
the Multiplication and the Lookup 1 reduction methods, the multiples 2uj are real-
ized by staggering the memory elements in the processing units u bits apart. This
arrangement is similar to the one used to generate the scalar products that form
213
A˜Bi.
The Lookup 2 reduction method uses precomputed values that already incor-
porate the 2uj multiples in them. Consequently, for this type of reduction, all the
memory elements of the different processing units are aligned.
The generation of the scalar products Q˜αi−d is very similar to the generation of
the scalar products A˜Bi. Precomputed values are loaded into the processing units
during the precomputation phase. The precomputed values are then used during
the processing part of the algorithm, where, as needed, a processing unit generates a
zero result, a positive multiple, or a negative multiple based on the values of |qliv+j|,
qliv+j sign and qliv+j zero. In addition, if the processing units store multiple sets
of precomputed values, the control signals that indicate which set to use must be
set. Note that because all the precomputed values are stored in memory, the control
signals that indicate which set of precomputed values to use are memory address
signals (e.g., α sel in Figure 5.4).
In summary, the architecture of the Q˜αi−d scalar multiplier is very similar to the
architecture of the A˜Bi scalar multiplier. The architecture of the processing units
used by the two scalar multipliers is identical. The complexity of the processing
units used by the two scalar multipliers can differ. The processing units used by the
Q˜αi−d scalar multipliers use memory elements whose width is w
′
1 bits (see Tables
5.5 and 5.6).
For each set of precomputed values, a processing unit must provide 2u−1 memory
locations. It is expected that each processing unit will have to store at most two
sets of precomputed values: one for general multiplications and one for comparisons.
Therefore, it is expected that the maximum number of memory locations that a
processing unit needs to provide is 2u.
The width of the memory is a function of the range of values that the precom-
214
puted values can take. The number of memory locations for each set of precomputed
values is a function of the number of bits needed to represent the digits recoded by
the Booth recoder 2 circuit. The outputs of the processing units are w2 bits wide.
Table 5.5 provides definitions for w′1 and w2. Table 5.6 provides size approximations
for w′1 and w2 for the parameters of interest here.
Figure 5.13 shows an example of a Q˜αi−d scalar multiplier suitable for the Mul-
tiplication and the Lookup 1 reduction methods. Figure 5.14 shows an example of a
Q˜αi−d scalar multiplier suitable for the Lookup 2 reduction method. The examples
in Figure 5.13 and in Figure 5.14 generate scalar products in nonredundant number
representation using two processing units (v = 2). In these figures, the signal α sel
is used to choose which set of precomputed values to use.
The main architectural difference between Figure 5.13 and Figure 5.14 is that
the memory elements in the processing units are aligned in Figure 5.14 and they are
staggered u bits apart in Figure 5.13.
An implementation of the Q˜αi−d scalar multiplier that generates products in
nonredundant number representation requires v processing units, whereas an im-
plementation that generates products in binary stored-carry number representation
requires 2v processing units.
Table 5.15 summarizes the complexities and the critical path delays of imple-
mentations of the Q˜αi−d scalar multipliers that generate products in nonredundant
number representation and in binary stored-carry number representation. The sym-
bols NR and BSC are used to identify numbers represented in nonredundant and
binary stored-carry number representations.
The estimates in Table 5.15 are based on the complexity and timing models
introduced in Appendix A. These estimates assume that each processing unit stores
two set of precomputed values.
215
  
 	
  



	


ﬀﬂﬁﬃ


	
 

ﬀﬂﬁﬃ
  
 	
  
!"#




%$ & '$ &
!()!

	
 +*
-,.// 0

	
 *
-,.// 0


	
 1*2 3546


	
*

3
ﬀ
%$
&
 7
46+8
:9
*<;

3
 =
*
;
2
3
 7
4
68

9
*-;

3
 =
*
;

3
 7
4
6+8

9
*

3
 =
*2
3
 7
4
6+8

9
*

3
 =
*

3
*
 
*

=
*?>4*%> @ @ @ > 468

*
1/
 =
*

3>

4*

3> @ @ @ >

46+8

*

3
1/
=
*-;A>4*<;B> @ @ @ > 46+8

*-;
1/
 =
*<;

3>

4*<;

3> @ @ @ >

46+8

*-;

3
Figure 5.13: Q˜αi−d scalar multiplier example for Multiplication and Lookup 1 re-
duction methods that uses nonredundant number representation
CDE F
G-H IJ KLMNOH P Q?R
S
GH
IJ KLUTWVYX
S Z
CDE F
GH IJ MN OH
S
GH
IJ T
S Z
PQR[QR
Q#RQ?R
Q
\ ]
Q
\ ]
S
GH IJ
S
^ﬂ_`
S
GH IJ KL
S
^ﬂ_`
QR)Q?R"Q
R
GH
IJ KL+T
M-abcc d GH
IJ T
M-a bcc d
e
Q
\
]
S f
T)VXhg
S
Zi
S
VjTkVXg
S
Zi l l l i
S
V1Xm
L
TkVXg
S
Z
Dc
S
V
X+m
L
T<n
S
Z
S f
T
n
S Z
S
V
Xm
L
T<n<V
X
S
Z
S f
T
n
VXo
Z
S
VX+m
L
T
S
Z
S f
T
S
Z
S
VXm
L
TkVX
S
Z
S f
T)VXpo
Z
T
MN OH
T
MNOH
S f
T<n<VXhg
S
Zi
S
VjT<n<VXhg
S
Zi l l l i
S
VXm
L
T-n<VXhg
S
Z
Figure 5.14: Q˜αi−d scalar multiplier example for Lookup 2 based reduction that
uses nonredundant number representation
216
Table 5.15: Complexity and critical path delay of Q˜αi−d scalar multiplier
Tech- Represen- Complexity Critical path
nology tation Logic Bits of storage delay
Gates NR v w2 AND + v w2 XOR 2
u v w′1 TA + TX
BSC 2 v w2 AND + 2 v w2 XOR 2
u+1 v w′1
Generic NR 2 v w2 GG 2
u v w′1 2TG
gates BSC 4 v w2 GG 2
u+1 v w′1
FPGA NR v w2 LUT 2
u v w′1 TL
logic BSC 2 v w2 LUT 2
u+1 v w′1
217
5.5.6 Carry-Save Adder Tree
The carry-save adder tree adds the outputs of the A˜Bi and the Q˜αi−d scalar multipli-
ers along with the outputs of the Si/2
k circuit. The outputs of the scalar multipliers
correspond to the outputs of their processing units. The outputs of the Si/2
k circuit
consists of a number in binary stored-carry number representation.
Each processing unit outputs two’s complement numbers in nonredundant num-
ber representation. The output of each processing unit consists of a w2-bit number
and carry bit; for example, Ablis c and Ablis carry c in Figure 5.4.
The number of inputs of the carry-save adder tree depends on the representation
of the scalar products. Of interest here is the representation of the scalar products
from the A˜Bi and the Q˜αi−d scalar multipliers in nonredundant and binary-stored
carry number representations.
Table 5.16 lists the number of w2-bit inputs and the number of carries that must
be added by the carry-save adder tree. This table also lists the number of 3:2 carry-
save adders (CSA adders) required to implement the carry-save adder tree. The
symbols NR and BSC in the table identify numbers represented in nonredundant
and binary stored-carry number representations.
As specified in [Kor93], the addition of n numbers requires (n−2) 3:2 carry-save
adders. In the carry-save adder tree, each carry-save adder computes the sum of
three w2-bit numbers and a carry bit.
Table 5.16 shows that the number of carry-save adders matches the maximum
number of carries that must be added by the carry-save adder tree, which implies
that the carry-save adders of the tree can absorb all the carries.
For the GF (p) multiplier, it is best to use the carry input of the carry-save adder
at the root of the tree for a carry from the A˜Bi scalar multiplier. The restriction
that Bi be zero for iteration i greater or equal to t in Algorithm 5.3.1, forces the
218
carries from the A˜Bi scalar multiplier to be zero at the end of the multiplication.
This arrangement allows the carry from the Si/2
k circuit to take the place of the
carry at the root of the tree in the Sn+d+2 c bus. Details about the carry from the
Si/2
k circuit are provided in Section 5.5.7.
Table 5.16: Carry-save adder tree configurations
A˜Bi Q˜αi−d Number of Max. number Max. number
representation representation w2-bit inputs carries 3:2 CSA
NR NR s + v + 2 s + v s + v
NR BSC s + 2v + 2 s + 2v s + 2v
BSC NR 2s + v + 2 2s + v 2s + v
BSC BSC 2s + 2v + 2 2s + 2v 2s + 2v
Table 5.17 summarizes the complexity and the critical path delay for different
configurations of the carry-save adder tree. The estimates in Table 5.17 are based
on the complexity and timing models introduced in Appendix A.
Table 5.17: Complexity and critical path delay of carry-save adder tree
Tech- A˜Bi Q˜αi−d Complexity Critical path
nology rep. rep. delay
Gates NR NR w2(s+ v)∗ 2dlog3/2 (s+ v + 2)/2eTX
(3AND + 2OR + 2XOR)
BSC NR w2(2s+ v)∗ 2dlog3/2 (2s+ v + 2)/2eTX
(3AND + 2OR + 2XOR)
NR BSC w2(s+ 2v)∗ 2dlog3/2 (s+ 2v + 2)/2eTX
(3AND + 2OR + 2XOR)
BSC BSC w2(2s+ 2v)∗ 2dlog3/2 (s+ v + 1)eTX
(3AND + 2OR + 2XOR)
Generic NR NR 7w2(s+ v) 3dlog3/2 (s+ v + 2)/2eTG
gates BSC NR 7w2(2s+ v) 3dlog3/2 (2s+ v + 2)/2eTG
NR BSC 7w2(s+ 2v) 3dlog3/2 (s+ 2v + 2)/2eTG
BSC BSC 14w2(s+ v) 3dlog3/2 (s+ v + 1)eTG
FPGA NR NR 2w2(s+ v) dlog3/2 (s+ v + 2)/2eTL
logic BSC NR 2w2(2s+ v) dlog3/2 (2s+ v + 2)/2eTL
NR BSC 2w2(s+ 2v) dlog3/2 (s+ 2v + 2)/2eTL
BSC BSC 4w2(s+ v) dlog3/2 (s+ v + 1)eTL
219
5.5.7 Si/2
k Circuit
The Si/2
k circuit divides Si by 2
k thus generating the quotient bSi/2kc and the
remainder |Si|2k (note that Si = bSi/2kc2k + |Si|2k).
When using binary stored-carry number representation, the quotient can be com-
puted using Equation (5.20). In this equation, bSi c/2kc and bSi s/2kc represent the
most significant bits of the binary stored-carry number representation of Si, where
Si is represented by the sum of Si c and Si s. |Si|2k carry in Equation (5.20) is
the carry generated from the addition of |Si c|2k and |Si s|2k as shown in Equation
(5.21). The remainder |Si|2k is computed as shown in Equation (5.22).
bSi/2kc = bSi c/2kc+ bSi s/2kc+ |Si|2k carry (5.20)
|Si|2k carry 2k + |Si|2k = |Si c|2k + |Si s|2k (5.21)
|Si|2k = ||Si c|2k + |Si s|2k |2k (5.22)
It would be desirable to have the carry-save adder tree add the terms bSi c/2kc,
bSi s/2kc, and |Si|2k carry. If the carry-save adder tree does not have the capacity
to absorb the carry, the carry can be maintained by the Si/2
k circuit. This is the
arrangement shown in Figure 5.4.
When the Si/2
k circuit maintains the carry, it outputs bSi c/2kc and bSi s/2kc
to the carry-save adder tree, which is equivalent to bSi/2kc − |Si|2k carry. Conse-
220
quently, the carry-save adder tree computes the sum shown in Equation (5.23).
To generate the result expected in Step 4.6 of Algorithm 5.3.1, the Si/2
k circuit
adds the carry generated in the previous cycle, |Si−1|2k carry, to the least significant
bits of the outputs of the registers at the root of the carry-save adder tree. These
registers hold Si − |Si−1|2k carry. The addition of the carry can be done according
the expression shown in Equation (5.24).
Si+1 − |Si|2k carry = bSi/2kc+ Q˜αi−d + A˜Bi (5.23)
|Si|2k carry 2k + |Si|2k = |Si c|2k + |Si s|2k + |Si−1|2k carry (5.24)
From Equation (5.23), one can appreciate that to generate Sn+d+1 as required
in Step 5 of Algorithm 5.3.1, |Sn+d|2k carry must be added to the multiplication
result. To generate the right result, the value of |Si−1|2k carry is forwarded to one
of the busses carrying the multiplication result. At the end of the multiplication,
the value of |Si−1|2k carry is |Sn+d|2k carry.
As described in Section 5.5.6, the carry from the Si/2
k circuit takes the place in
the result that would otherwise be taken by the carry into the carry-save adder at
the root of the tree. This carry is known to have a zero value, when it corresponds
to a carry from the A˜Bi scalar multiplier.
Figure 5.15 shows a block diagram of the Si/2
k circuit that locally stores
|Si−1|2k carry. As this figure shows, the Si/2k circuit requires a k-bit
carry-propagate adder and a flip-flop.
221
k
FF
k
k
w2-k
|Si|2k
sign extension
|Si_c|2k
Si_c
k-copies
Si_c/2kw2
w2
Si_s
k-copies Si_s/2kw2
w2
w2-k
|Si_s|2k
|Si-1|2k_carry
Figure 5.15: Si/2
k circuit
The complexity and the critical path delay of the Si/2
k circuit are summarized
in Table 5.18. These estimates are based on the complexity and timing models
introduced in Appendix A.
Table 5.18: Complexity and critical path delay of Si/2
k circuit
Technology Complexity Critical path delay
Gates 3k AND + 2k OR + 2k XOR + 1 FF k(TA + 2TO)
Generic gates 7k GG + 1 FF 3k TG
FPGA logic 2k LUT + 1 FF k TL
222
5.5.8 Registers
For the implementation of Algorithm 5.3.1, Figure 5.4 specifies three types of regis-
ters that hold multiplication results. These are referred in the figure as register, R,
and FF.
The register stores the carry-save adder tree results, which are numbers in binary
stored-carry number representation. Two w2-bit registers are necessary to store the
results from the carry-save adder tree.
The registers referred to as R store the last d values of Qi (Qi−1 down to Qi−d).
Each of these registers stores a k-bit number in nonredundant number representa-
tion.
The flip-flop, referred to as FF, stores the most significant bit of Qi−d−1, qhi−d−1.
The complexity and the critical path delay of the registers are summarized in
Table 5.19. These estimates are based on the complexity and timing models intro-
duced in Appendix A. (Note that the critical path delay estimates assume the use
of ideal registers, whose propagation delay is zero.)
Table 5.19: Complexity and critical path delay of registers
Technology Complexity Critical path delay
Gates
Generic gates 2w2 + kd+ 1 FF 0
FPGA logic
223
5.5.9 Complexity
Table 5.20 summarizes the complexity of the GF (p) multiplier in terms of logic
gates, generic gates, and FPGA logic. The estimates in this table are given in terms
of w0, w1, w
′
1, and w2. Table 5.5 defines these parameters, and Table 5.6 provides
approximations of their values for the parameters of interest here. The symbols
NR and BSC used in the table identify numbers represented in nonredundant and
binary stored-carry number representations.
Table 5.20 shows that the complexity of the GF (p) multiplier is minimum when
B, the scalar products that form A˜Bi, and the scalar products that form Q˜αi−d are
represented using nonredundant number representation. The complexity is almost
twice the one just mentioned when B, the scalar products that form A˜Bi, and the
scalar products that form Q˜αi−d are represented using binary stored-carry number
representation.
For the GF (p) multiplier, the logic elements and flip-flop complexities are func-
tion of w0 and w2. From the parameters specified in Table 5.6, one can deduce that
the logic elements and flip-flop complexities grow proportionally with m, kd, and k.
The memory requirements are a function of r, s, u, v, w1, and w
′
1. The memory
width grows proportionally with m, kd, k, r, and u, and the number of memory
locations grows linearly with respect to s and v and exponentially with respect to r
and u.
Table 5.20 shows that implementations in FPGA logic realize an average of three
generic gates per LUT, where each LUTs can perform any logic function of its inputs
and where each LUT has at least three inputs.
224
Table 5.20: Complexity of GF (p) multiplier
Tech- Q˜αi−d B, A˜Bi Complexity
nology rep. rep. Logic FF Storage bits
Gates NR NR 4w2(s+ v) + 2w0 + 15k AND 2w2 + w0 2
r−1 s w1+
+2w2(s+ v) + w0 + 4k OR +kd 2
u v w′1
+3w2(s+ v) + 6k XOR
NR BSC 4w2(2s+ v) + 4w0 + 18k AND 2w2 + 2w0 2
r s w1+
+2w2(2s+ v) + 2w0 + 6k OR +kd 2
u v w′1
+3w2(2s+ v) + 8k XOR
BSC NR 4w2(s+ 2v) + 2w0 + 15k AND 2w2 + w0 2
r−1 s w1+
+2w2(s+ 2v) + w0 + 4k OR +kd 2
u+1 v w′1
+3w2(s+ 2v) + 6k XOR
BSC BSC 8w2(s+ v) + 4w0 + 18k AND 2w2 + 2w0 2
r s w1+
+4w2(s+ v) + 2w0 + 6k OR +kd 2
u+1 v w′1
+6w2(s+ v) + 8k XOR
Generic NR NR 9w2(s+ v) + 3w0 + 25k GG 2w2 + w0 2
r−1 s w1+
gates +kd 2u v w′1
NR BSC 9w2(2s+ v) + 6w0 + 32k GG 2w2 + 2w0 2
r s w1+
+kd 2u v w′1
BSC NR 9w2(s+ 2v) + 3w0 + 25k GG 2w2 + w0 2
r−1 s w1+
+kd 2u+1 v w′1
BSC BSC 18w2(s+ v) + 6w0 + 32k GG 2w2 + 2w0 2
r s w1+
+kd 2u+1 v w′1
FPGA NR NR 3w2(s+ v) + w0 + 14k LUT 2w2 + w0 2
r−1 s w1+
logic +kd 2u v w′1
NR BSC 3w2(2s+ v) + 2w0 + 16k LUT 2w2 + 2w0 2
r s w1+
+kd 2u v w′1
BSC NR 3w2(s+ 2v) + w0 + 14k LUT 2w2 + w0 2
r−1 s w1+
+kd 2u+1 v w′1
BSC BSC 6w2(s+ v) + 2w0 + 16k LUT 2w2 + 2w0 2
r s w1+
+kd 2u+1 v w′1
225
5.5.10 Critical Path Delay
Table 5.21 summarizes the critical path delay of the GF (p) multiplier, which corre-
sponds to the longest combinatorial path through the multiplier. The critical path
delay of the multiplier defines the multiplier’s maximum operational frequency. The
results in Table 5.21 assume that s and v are greater than one.
The estimated critical path delay grows linearly with respect to k, u, and r.
The linear growth results from the use of ripple-carry adders in the BSC to NR
converter 1, the Si/2
k, the Booth recoder 1, and the Booth recoder 2 circuits. This
linear growth represents the worst case scenario for hardware platforms that lack
fast carry logic.
For implementations in FPGA logic, the typical case for critical path delay be-
haves much better than for the worst case just described because most modern
devices incorporate fast carry propagation logic. An implementation alternative
could be to use carry-propagate adders with low critical path delays, such as carry-
lookahead or carry-skip adders, but, for architectures that implement fast carry
logic, these adders may exhibit longer critical path delays than ripple-carry adders.
For modern FPGA devices, another alternative is to use lookup tables in place of
traditional carry-propagate adders.
The wide range of alternatives precludes the selection of a single alternative for
all platforms. Therefore, this work suggests prototyping the different alternatives
in the intended hardware platform.
The critical path delay of the GF (p) multiplier also exhibits a logarithmic growth
with respect to s and v. This growth is contributed by the carry-save adder tree.
The critical path delay of the GF (p) multiplier can be reduced by decoupling the
critical path delay of its component circuits. This can be achieved by registering the
signals that are propagated from one circuit to another. This technique is referred
226
to here as pipelining. When using this technique, the critical path delay of a GF (p)
multiplier corresponds to the longest critical path delay of its component circuits.
The critical path delay of the component that defines the critical path delay of a
GF (p) multiplier can also be reduced using the pipelining technique just described.
Pipelining the GF (p) multiplier has two side-effects. Pipelining increases the
latency of the multiplier. Pipelining could also increase the multiplier’s processing
time if it increases the quotient resolution delay.
The addition of pipeline stages in the BSC shift register, the BSC to NR converter
1, the Booth recoder 1, and the A˜Bi scalar multiplier increases the latency of the
multiplier but does not affect the processing time. On the other hand, the addition
of pipeline stages in the Si/2
k, the Booth recoder 2, the Q˜αi−d scalar multiplier,
and the carry-save adder tree increases the quotient resolution delay, d. Increases in
the quotient resolution delay are accompanied by increases in processing time and
increases in the complexity of the multiplier.
As for the determination of the best carry-propagate adder alternative, the best
use of pipelining techniques for a particular hardware platform is best determined
through prototyping efforts.
227
Table 5.21: Critical path delay of GF (p) multiplier
Tech- Q˜αi−d B, A˜Bi Critical path
nology rep. rep. delay
Gates NR NR (k + u)TA + (2k + 1)TO+
(2dlog3/2 (s+ v + 2)/2e+ 3)TX
NR BSC (k +max(r, u))TA + (2k + 1)TO+
(2dlog3/2 (2s+ v + 2)/2e+ 3)TX
BSC NR (k + u)TA + (2k + 1)TO+
(2dlog3/2 (s+ 2v + 2)/2e+ 3)TX
BSC BSC (k +max(r, u))TA + (2k + 1)TO+
(2dlog3/2 (s+ v + 1)e+ 3)TX
Generic NR NR 3k + u+ 4 + 3dlog3/2 (s+ v + 2)/2e TG
gates NR BSC 3k +max(r, u) + 4 + 3dlog3/2 (2s+ v + 2)/2e TG
BSC NR 3k + u+ 4 + 3dlog3/2 (s+ 2v + 2)/2e TG
BSC BSC 3k +max(r, u) + 4 + 3dlog3/2 (s+ v + 1)e TG
FPGA NR NR k + u+ 2 + dlog3/2 (s+ v + 2)/2e TL
logic NR BSC k +max(u, v) + 2 + dlog3/2 (2s+ v + 2)/2e TL
BSC NR k + u+ 2 + dlog3/2 (s+ 2v + 2)/2e TL
BSC NR k +max(r, u) + 2 + dlog3/2 (s+ v + 1)e TL
228
5.5.11 Performance
Tables 5.22 and 5.23 summarize the latency and the throughput of the multiplier
for the parameters of interest here. In these tables, m represents the number of bits
required to represent the modulus M (m = d log2 M e).
Tables 5.22 and 5.23 include the expected maximum and minimum values for
QM. These tables also include values that depend on the number of precomputation
sets |iα 2uj|Mˆ that needs to be precomputed in Steps 3 to 3.1.1 of Algorithm 5.3.1
(the number of sets depends on the reduction method employed).
The performance numbers in Tables 5.22 and 5.23 use the weight factors defined
in Section 5.3.4. For the architectures presented here, the weight b is expected to
be equal to one. The value of ab is expected to be one when using binary stored-
carry number representation and about dw1/De when using nonredundant number
representation. The conversion to nonredundant number representations assumes
the use of a digit-serial adder with digit size equal to D (see Section 5.4.1).
The weight aq is a function of the reduction method employed and of the number
representation. The value of aq can range from one when employing the Multipli-
cation reduction method with numbers represented in binary stored-carry number
representation to O(kM/D) when employing the Lookup 2 reduction method with
numbers represented in nonredundant number representation. This last approxima-
tion assumes that approximately k reductions are needed for each term |iα 2uj|M ,
where each of these reductions requires a carry-propagate addition.
The weight cb defines the number of operations over which the precomputed val-
ues iA are amortized. For most algorithms, the value of cb is one. The weight cq
defines the number of operations over which the precomputed values |iα 2uj|M are
amortized. Elliptic curve cryptosystems tend to change their moduli infrequently;
therefore, the value of cq tends to be very large. For typical applications, the com-
229
putational cost associated with the precomputation of the values |iα 2uj|M can be
ignored.
Table 5.22: Average latency of GF (p) multiplier
QM n # |iα2uj |Mˆ Latency
sets (in # clock cycles)
< 2mR dm/ke 0 2r−1ab/cb + b(dm/ke+ d+ 2)
+ 1 1 2r−1ab/cb + 2
u−1aq/cq + b(dm/ke+ d+ 2)
v 2r−1ab/cb + 2
u−1aq/cqv + b(dm/ke+ d+ 2)
< 2k(d+1)+mR dm/ke 0 2r−1ab/cb + b(dm/ke+ 2d+ 3)
+d+ 2 1 2r−1ab/cb + 2
u−1aq/cq + b(dm/ke+ 2d+ 3)
v 2r−1ab/cb + 2
u−1aq/cqv + b(dm/ke+ 2d+ 3)
Table 5.23: Average throughput of GF (p) multiplier
QM n # |iα2uj |Mˆ Throughput
sets (in #operations/# clock cycles)
< 2mR dm/ke 0 1/(2r−1ab/cb + b(dm/ke+ d+ 2))
+ 1 1 1/(2r−1ab/cb + 2
u−1aq/cq + b(dm/ke+ d+ 2))
v 1/(2r−1ab/cb + 2
u−1aq/cqv + b(dm/ke+ d+ 2))
< 2k(d+1)+mR dm/ke 0 1/(2r−1ab/cb + b(dm/ke+ 2d+ 3))
+d+ 2 1 1/(2r−1ab/cb + 2
u−1aq/cq + b(dm/ke+ 2d+ 3))
v 1/(2r−1ab/cb + 2
u−1aq/cqv + b(dm/ke+ 2d+ 3))
230
5.6 Register File
The register file of the GF (p) arithmetic unit fulfills the same functions fulfilled by
the register file of the GF (2m) arithmetic unit. Refer to Section 4.12 for details.
The main difference between the register files for the different arithmetic units is
that, for some configurations, the register file used in the GF (p) arithmetic unit must
be capable of storing numbers represented using a redundant number representation
that requires two memory locations for each value to be stored. In addition, the
width of the registers of the register file used in the GF (p) arithmetic unit is a
function of the minimum number of bits required to represent the underlying finite
field elements, it is also a function of the quotient resolution delay parameter d, and
is also a function of of the radix of operation 2k; whereas, for the register file used
in the GF (2m) arithmetic unit, the width of the register file is only a function of
the minimum number of bits necessary to represent field elements.
The size of the memory elements needed to implement a register file is influenced
by the width of the data items it needs to store, by the number representation of
the data items (nonredundant or binary stored-carry number representations), and
by the number of registers that need to be implemented. The maximum width of
the data items that need to be stored is assumed to be w0 bits.
Two cases are considered here for the representation of the data items: nonre-
dundant and binary stored-carry number representations, which are represented in
Table 5.24 with the symbols NR and BSC. Data items represented in nonredundant
number representation require one w0-bit register while those represented in binary
stored-carry number representation require two w0-bit registers. The estimates in
Table 5.24 assume that all the registers can be used to represent numbers in either
nonredundant or binary-stored carry number representations.
231
This work does not suggest an absolute number of register for the register file
because different configurations require different numbers of registers and the cost
of building register files vary from one hardware platform to another. In place of
an absolute number of registers, this work identifies the number of registers in the
register file with the parameter h. h is a design parameter under the control of the
designer.
Table 5.24 summarizes the complexity and the critical path delay of the register
file used in the GF (p) arithmetic unit. The estimates in this table are based on the
complexity and timing models introduced in Appendix A. (Note that the critical
path delay estimates assume the use of ideal registers, whose propagation delay is
zero.)
The estimates in Table 5.24 are defined in terms of the parameters h and w0,
where h is a design parameter and where w0 is a parameter that defines the width in
bits of the data items to be stored in the register file. Table 5.5 provides a definition
for w0, and Table 5.6 provides an approximation of its size for the parameters of
interest here.
Table 5.24 represents the complexity of the register file in terms of the number
of storage bits required for an implementation.
232
Table 5.24: Complexity and critical path delay of register file
Technology Number Complexity Critical path
rep. (in # of storage bits) delay
Gates NR hw0 0
Generic gates
FPGA logic
Gates BSC 2hw0 0
Generic gates
FPGA logic
233
5.7 Multiplexer
The multiplexer selects the outputs to be transferred to the register file (see Figure
5.1 for details).
This section considers two implementation options for the multiplexer. In the
first option, the multiplexer forwards numbers in nonredundant number represen-
tation to the register file. In the second option, the multiplexer forwards numbers
represented in binary stored-carry number representation to the register file. Figure
5.16 shows a block diagram of the different multiplexer options.
In the nonredundant number representation option, the multiplexer forwards to
the register file the value in either of the following busses: Sn+d+2 c, Sn+d+2 s, or
O cpa s.
In the binary stored-carry number representation option, the multiplexer for-
wards to the register file a pair of values. Figure 5.16 shows that the pair of values
can correspond to the values in the busses Sn+d+2 c and Sn+d+2 s or the values in
the busses O csa c and O csa s. Depending on the configuration of the adder, the
second pair of outputs could be the values in the busses O cpa c and O cpa s instead
of the values in the busses O csa c and O csa s.
As shown in Figures 5.3 and 5.4, the widths of the outputs of the GF (p) adder
and the GF (p) multiplier are not equal to the width of the multiplexer. The outputs
of the multiplier must be sign-extended so they match the width of the multiplexer.
The outputs of the adder must be truncated so they match the width of the mul-
tiplexer. Note that by truncating the output of the adder the range of values that
can be passed through the multiplexer is reduced. The reduced range matches the
range of the inputs of the multiplier. Note that the width of the multiplexer is the
same as the width of the register file.
234
Sn+d+2_c
O_csa_c or
O_cpa_c
w0 w0
RF_in_c
w0
Sn+d+2_s
O_csa_s or
O_cpa_s
w0 w0
RF_in_s
w0
b) Binary stored-carry number
representation option
O_cpa_s
w0 w0
RF_in_s
w0
a) Nonredundant number
   representation option
Sn+d+2_s
w0 w0
Sn+d+2_c
Figure 5.16: Multiplexer options
Table 5.25 summarizes the complexity and the critical path delay of the multi-
plexer. The estimates in this table are based on the complexity and timing models
introduced in Appendix A. These estimates are defined in terms of the parameter
w0. Table 5.5 provides a definition for w0, and Table 5.6 provides an approximation
of its size for the parameters of interest here. The symbols NR and BSC are used in
the table to identify numbers represented in nonredundant and binary stored-carry
number representations.
Table 5.25: Complexity and critical path delay of multiplexer
Technology Number Complexity Critical path
rep. delay
Gates NR 4w0 AND + 2w0 OR 2(TA + TO)
BSC TA + TO
Generic NR 6w0 4TG
gates BSC 2TG
FPGA NR 2w0 2TL
logic BSC TL
235
5.8 GF(p) Arithmetic Unit Complexity and Per-
formance
The GF (p) arithmetic unit is composed by the GF (p) multiplier, the GF (p) adder,
the register file, and the multiplexer. These circuits are described in detail in the
preceding sections. This section summarizes the complexity, critical path delay, and
performance of the entire GF (p) arithmetic unit.
5.8.1 Complexity
Tables 5.26 and 5.27 summarize the complexity of the GF (p) arithmetic unit. The
complexity is specified in terms of the number representation supported by the
register file and the GF (p) multiplier. In the table, the symbols NR and BSC
are used to represent numbers in nonredundant and binary stored-carry number
representations.
A study of the complexity reveals that the logic complexity of the arithmetic
unit is dominated by the complexity of the multiplier. The complexity in terms of
the required number of storage bits is dominated by the storage requirements of the
GF (p) multiplier and the register file.
236
Table 5.26: Complexity of the GF (p) arithmetic unit
Tech- Reg. Q˜αi−d B, A˜Bi Complexity
nology file rep. rep.
rep. Logic FF Storage bits
Gates NR NR NR (4w2(s+ v) + 6w0 + 7w1+ 2w2 2
r−1 s w1+
+15k + 3D) AND +4w1 2
u v w′1+
+(2w2(s+ v) + 3w0 + 4w1 +w0 hw0
+4k + 3D) OR +kd
+(3w2(s+ v) + 3w1
+6k + 2D) XOR
BSC NR BSC (4w2(2s+ v) + 8w0 + 10w1 2w2 2
r s w1+
+18k + 3D) AND +4w1 2
u v w′1+
+(2w2(2s+ v) + 4w0 + 6w1 +2w0 2hw0
+6k + 3D) OR +kd
+(3w2(2s+ v) + 6w1
+8k + 2D) XOR
BSC BSC NR (4w2(s+ 2v) + 6w0 + 10w1 2w2 2
r−1 s w1+
+15k + 3D) AND +4w1 2
u+1 v w′1+
+(2w2(s+ 2v) + 3w0 + 6w1 +w0 2hw0
+4k + 3D) OR +kd
+(3w2(s+ 2v) + 6w1
+6k + 2D) XOR
BSC BSC BSC (8w2(s+ v) + 8w0 + 10w1 2w2 2
r s w1+
+18k + 3D) AND +4w1 2
u+1 v w′1+
+(4w2(s+ v) + 4w0 + 6w1 +2w0 2hw0
+6k + 3D) OR +kd
+(6w2(s+ v) + 6w1
+8k + 2D) XOR
237
Table 5.27: Complexity of the GF (p) arithmetic unit (cont.)
Tech- Reg. Q˜αi−d B, A˜Bi Complexity
nology file rep. rep.
rep. Logic FF Storage bits
Generic NR NR NR 9w2(s+ v) + 9w0 + 14w1 2w2 2
r−1 s w1+
gates +25k + 8D GG +4w1 2
u v w′1+
+w0 hw0
+kd
BSC NR BSC 9w2(2s+ v) + 12w0 + 22w1 2w2 2
r s w1+
+32k + 8D GG +4w1 2
u v w′1+
+2w0 2hw0
+kd
BSC BSC NR 9w2(s+ 2v) + 9w0 + 22w1 2w2 2
r−1 s w1+
+25k + 8D GG +4w1 2
u+1 v w′1+
+w0 2hw0
+kd
BSC BSC BSC 18w2(s+ v) + 12w0 + 22w1 2w2 2
r s w1+
+32k + 8D GG +4w1 2
u+1 v w′1+
+2w0 2hw0
+kd
FPGA NR NR NR 3w2(s+ v) + 3w0 + 5w1 2w2 2
r−1 s w1+
logic +2D + d(D − 1)/(L− 1)e +4w1 2u v w′1+
+14k LUT +w0 hw0
+kd
BSC NR BSC 3w2(2s+ v) + 4w0 + 8w1 2w2 2
r s w1+
+2D + d(D − 1)/(L− 1)e +4w1 2u v w′1+
+16k LUT +2w0 2hw0
+kd
BSC BSC NR 3w2(s+ 2v) + 3w0 + 8w1 2w2 2
r−1 s w1+
+2D + d(D − 1)/(L− 1)e +4w1 2u+1 v w′1+
+14k LUT +w0 2hw0
+kd
BSC BSC BSC 6w2(s+ v) + 4w0 + 8w1 2w2 2
r s w1+
+2D + d(D − 1)/(L− 1)e +4w1 2u+1 v w′1+
+16k LUT +2w0 2hw0
+kd
238
5.8.2 Performance
The performance of an arithmetic unit is a function of the number of clock cycles
required to perform the different arithmetic operations and of the clock cycle period.
The minimum clock cycle period is a function of the critical path delay of the
arithmetic unit. Here, it is assumed that the critical path delay of an arithmetic unit
corresponds to the critical path delay of its multiplier. The multiplier is the most
complex component of an arithmetic unit. The critical path delay of the other cir-
cuits of an arithmetic unit can be reduced using the pipelining techniques previously
discussed, thus allowing the critical path delay of the multiplier to dominate.
Given that the critical path delay of the arithmetic unit is assumed to be that
of its multiplier, the critical path delay of the arithmetic unit is specified in Table
5.21 for the different multiplier configurations.
The throughput of the arithmetic unit is that of its components. The throughput
of the adder is summarized in Table 5.9 and the throughput of the multiplier is
summarized in Table 5.23.
239
Chapter 6
Comparison of GF(2m) and GF(p)
Arithmetic Units
6.1 Introduction
This section compares the estimated complexity and performance of the GF (p)
arithmetic unit against the estimated complexity and performance of the GF (2m)
arithmetic units that incorporate digit-serial multipliers and squaring logic.
6.1.1 Complexity
The complexity of the GF (p) arithmetic unit is summarized in Tables 5.26 and 5.27.
As can be observed from these tables, the main factor that defines the complexity
of the GF (p) arithmetic unit is the number representation. One can observe that
the logic complexity of an implementation that uses mainly binary-stored carry
number representation is about twice that of an implementation that uses mainly
nonredundant number representation.
The complexity of the GF (p) arithmetic unit depends on the following system
240
parameters: m, k, r, s, u, v, d, h, and D.
The parameter m represents the number of bits required to represent the ele-
ments of the field GF (p).
The parameter k represents the level of parallelism realized by the multiplier,
which is equivalent to the digit size of a GF (2m) digit-serial multiplier (the digit
size of a digit-serial multiplier is represented with the variable D). The level of
parallelism of the GF (p) multiplier is realized with logic elements and with memory.
The logic complexity is influenced by the parameters s and v, and the memory
requirements is influenced by the parameters r, s, u, and v (note that k = rs = uv).
The parameter d represents the quotient resolution delay of the multiplier, and
the parameter h represents the number of registers in the register file. The parameter
D represents the digit size of the carry-propagate adder embedded in the GF (p)
adder.
The complexities of the GF (2m) arithmetic units are summarized in Tables 4.43
and 4.44 for implementations with generic gates and FPGA logic. As these tables
show, the complexities of the GF (2m) arithmetic units that use digit-serial multi-
pliers are a function of the parameters m, D, and h, where m represents the degree
of the extension field, D represents the digit size of the multiplier, and h represents
the number of registers in the register file.
A precise measure of complexity can be obtained by plugging values into the
parameters that define the complexities of the GF (2m) and the GF (p) arithmetic
units. In this section, we are interested in establishing the rate of logic growth for
the GF (2m) and the GF (p) arithmetic units as a function of m, k, and D, where
k represents the level of parallelism achieved by the GF (p) multiplier and where D
represents the level of parallelism achieved by the GF (2m) multiplier. The analysis
is restricted to GF (2m) arithmetic units that incorporate digit-serial multipliers and
241
parallel squarers.
To facilitate the analysis it is assumed that m is much larger than D and k
(m >> k,D), and that D is equal to k (D = k). The analysis ignores terms
independent of m, k, r, s, u, v, D, and h. The analysis also assumes that d is equal
to zero, which constitutes the best case for quotient resolution delay. The analysis
is restricted to implementations using generic gates and FPGA logic. For FPGA
implementations, the estimates assume that Z is equal to one and ignores the ceiling
operators.
Table 6.1 summarizes the complexities of the GF (2m) and the GF (p) arithmetic
units. These estimates are based on the aforementioned simplifications. Table 6.2
summarizes the complexity ratios established by dividing the complexities of differ-
ent configuration of the GF (p) arithmetic unit by the complexity of the GF (2m)
arithmetic unit. This table uses the lowest complexity numbers for the GF (p) arith-
metic listed in Table 6.1. Table 6.3 summarizes results similar to those summarized
in Table 6.2 but it uses the highest complexity numbers for the GF (p) arithmetic
listed in Table 6.1.
Tables 6.2 and 6.3 include the case for which s is not equal to v (this case
represents scalar multipliers that use different numbers of processing units). These
tables also include cases for which s is equal to v. The first case of this type defines
the use of a multiplier that uses minimum precomputation (s = v = k, r = u = 1).
This case leads to the highest logic complexity and the lowest memory requirements.
The second case defines the use of a multiplier that uses maximum precomputation
(s = v = 1, r = u = k). This case leads to the lowest logic complexity and the
highest memory requirements. As described earlier, this case is limited by the size
of k for which precomputation is worth while.
242
Table 6.1: Complexity of GF (2m) and GF (p) arithmetic units (m >> k,D,r,s,u,v,d)
Arithmetic # Generic # LUTs # Storage
unit gates bits
GF (2m) 2Dm 2Dm/(L− 1) hm
(Digit-serial mult.
& squaring logic)
GF (p) 9m(s+ v) 3m(s+ v) (h+ 2r−1s+ 2uv)m
(min. complexity)
GF (p) 18m(s+ v) 6m(s+ v) (2h+ 2rs+ 2u+1v)m
(max. complexity)
Table 6.2: Complexity ratio between GF (2m) and GF (p) (min. complexity) arith-
metic units (m >> k,D, r, s, u, v, d)
Case # Generic # LUTs # Storage
gates bits
s 6= v (9/2)(1/r + 1/u) (3/2)(L− 1)(1/r + 1/u) 1 + (2r−1s+ 2uv)/h
s = v 9/r 3(L− 1)/r 1 + 3 ∗ 2r−1s/h
s = v = k 9 3(L− 1) ≈ 1
r = u = 1
s = v = 1 9/k 3(L− 1)/k 1 + 3 ∗ 2k−1/h
r = u = k
Table 6.3: Complexity ratio between GF (2m) and GF (p) (max. complexity) arith-
metic units (m >> k,D, r, s, u, v, d)
Case # Generic # LUTs # Storage
gates bits
s 6= v 9(1/r + 1/u) 3(L− 1)(1/r + 1/u) 2 + (2rs+ 2u+1v)/h
s = v 18/r 6(L− 1)/r 2 + 3 ∗ 2rs/h
s = v = k 18 6(L− 1) ≈ 2
r = u = 1
s = v = 1 18/k 6(L− 1)/k 2 + 3 ∗ 2k/h
r = u = k
243
6.1.2 Performance
The performance of an arithmetic unit is a function of the number of clock cycles
required to perform the different arithmetic operations and of the clock cycle pe-
riod. The minimum clock cycle period is a function of the critical path delay of an
arithmetic unit. Here, it is assumed that the critical path delay of an arithmetic
unit is that of its multiplier.
Table 6.4 summarizes the critical path delays of the GF (2m) and GF (p) arith-
metic units. The GF (2m) arithmetic unit incorporates a digit-serial multiplier and
squaring circuitry. The results for the different configurations of the GF (p) arith-
metic unit show delays that are proportional to u, r, and k. An analysis of the
critical path delay of a GF (p) multiplier reveals that its critical path delay is dom-
inated by the critical path delays of ripple-carry adders in the Booth recoding and
the Si/2
k circuits. FPGA logic usually incorporates fast carry logic that can real-
ize ripple-carry adders that exhibit lower critical path delay than what is assumed
here. In addition, the delays of the carry-propagate adders can be reduced at the
expense of higher logic complexity by using faster adders, such as carry-lookahead
and carry-skip adders.
Assuming that a GF (p) arithmetic unit is built using ripple-carry adders, one
would expect that the ratio established by dividing the critical path delay of the
GF (p) arithmetic unit by the critical path delay of the GF (2m) arithmetic unit
would of the order of O(k/ log2 k) for implementations using gates and O(k/ logL k)
for implementations using FPGA logic (assuming k >> r, s, u, v, and k = D).
The estimation of the operational clock frequency of an arithmetic unit is a
complex task. As can be appreciated from Table 6.4, the expressions for clock
frequencies depends on several parameters. There are additional parameters not
included in the table, such as routing delays, that influence the critical path delay of
244
an implementation. To determine the exact operational clock frequency of a circuit,
this work suggests prototyping the circuit in the targeted hardware platform.
Table 6.4: Critical path delay of GF (2m) and GF (p) arithmetic units
Arithmetic min./max. Generic gates FPGA logic
unit (TG) (TL)
GF (2m) min. dlog2(2D + 1)e dlogL (2DZ + 1)e
max. dlog2(D + r + 1)e+ 1 dlogL (2Z(D + r) + 1)e
GF (p) min. 3k + u+ 4 k + u+ 2
+ 3dlog3/2 (s+ v + 2)/2e + dlog3/2 (s+ v + 2)/2e
max. 3k +max(r, u) + 4 k +max(r, u) + 2
+ 3dlog3/2 (s+ v + 1)e + dlog3/2 (s+ v + 1)e
Table 6.5 summarizes the processing time estimates for the multiplication, the
squaring, the addition, and the subtraction operations. For the GF (p) arithmetic
unit, the processing time for the addition and the subtraction operations assumes
the use of carry-save addition, and the processing time for the squaring and the mul-
tiplication operations assumes that the multiplication processing time is dominated
by the factor dm/ke.
As can be appreciated from Table 6.5, the main difference between the two
arithmetic units is the time it takes to perform a square operation. The GF (2m)
arithmetic unit incorporates a parallel squarer capable of computing a square op-
eration in one clock cycle while the GF (p) arithmetic unit uses the multiplier to
compute both multiplications and squares.
Table 6.5: Processing time for GF (p) and GF (2m) arithmetic units (in # clock
cycles)
Operation GF (2m) GF (p)
Addition/ 1–2 1–2 (CSA)
subtraction
Square 1 ≈ dm/ke
Multiplication dm/De ≈ dm/ke
245
6.2 Prototype Implementations
This section describes prototype implementations of processors suitable for the com-
putation of point multiplications for curves defined over fields GF (2m). These
processors are referred to here as GF (2m) processors. The architecture of these
prototypes was previously documented in [OP00a].
This section also describes a prototype implementation of a processor suitable for
the computation of point multiplications for curves defined over fields GF (p). This
processor is referred to here as GF (p) processor. The architecture of this prototype
was previously documented in [OP01].
This section starts with descriptions of the prototyped processors. The section
continues with descriptions of the complexity and the performance of each prototype.
The section ends with comparisons of prototype implementations of GF (p) and
GF (2m) processors with comparable degrees of parallelism.
6.2.1 Description of the Prototype Implementations of the
GF(2m) Processors
This section describes prototype implementations of the GF (2m) processors. The
GF (2m) processors are based on the elliptic curve processor model shown in Figure
3.2.
Figure 6.1 shows a block diagram of the arithmetic unit architecture used by
the GF (2m) processors. This architecture uses the same components used by ar-
chitecture 3 of Figure 4.13 with the components arranged in a slightly different
configuration.
The components of the arithmetic unit shown in Figure 6.1 are arranged in a
streamlined fashion for which signals exhibit low fan-out. This arrangement and
246
the use of register stages to control critical path delays led to fast clock rates. In
addition, the architecture of the arithmetic unit shown in Figure 6.1 is well suited
for the computation of consecutive square and multiply operations, sequences that
are common in the computation of modular exponentiations.
  	 

  


ﬀ ﬁ
ﬃﬂ ﬀ ﬁ
!"#	$ %& ' "
(*),+.-./01/
23546 7 
 
5
8:9;
<
=,>? 8	@
A
<
ﬁ B 
<
=,>? 8@
C6 D6 	Eﬁ  6 5ﬁ 6 
5
FGH
I,J' K & LD' & "M
A
<
ﬁ B 
N
O:
 EP

<.Q
R1S
R T
RVU
W	6 =5B B
ﬀ
D W
=B5 B
Figure 6.1: GF (2m) arithmetic unit architecture
The main controllers used by the GF (2m) processors support the following sub-
set of the instruction listed in Section 3.2: branch direct, jump to subroutine direct,
jump to subroutine indirect, return, and load register immediate. The main con-
trollers implement conditional branches using the model used by the arithmetic
unit controller described in Section 3.3, which applies a mask to a flag register and
branches if the result of ORing the mask with the flag register is nonzero. The
main controllers also support a no operation instruction and a shift k instruction.
The no operation instruction does no processing. The shift k instruction forces the
shift register containing the point multiplier k to deposit the content of its most
significant bit into the flag register.
The reduced instruction set of the main controllers is more limited than the in-
247
struction set described in Section 3.2. This limited instruction set is adequate for
simple point multiplication algorithms such as the double-and-add and the Mont-
gomery point multiplication algorithms. These algorithms are described in Sections
2.8.1 and 2.7.1. The limited instruction set and the use of shift registers to store the
point multipliers, relieved the main controllers from having to incorporate arithmetic
logic units and from having to incorporate and support data memories.
Each main controller uses 16-bit instructions of which three bits are used as
opcodes and the rest of the bits are used as data or address fields, incorporates a
program memory with capacity for 256 instructions, and executes one instruction
per clock cycle.
The arithmetic unit controllers used by the GF (2m) processors support the fol-
lowing subset of the instructions listed in Section 3.3: no operation, branch direct,
branch conditional direct, jump to subroutine direct, jump to subroutine indirect,
return, and load register immediate.
Each arithmetic unit controller uses 24-bit instructions of which three bits are
used as opcodes and the rest of the bits are used as data or address fields, incor-
porates a program memory with capacity for 512 instructions, and executes one
instruction per clock cycle.
Figure 6.1 shows a block diagram of the arithmetic unit architecture used by the
GF (2m) processors. The prototyped GF (2m) processors support the field GF (2167),
where this field is defined by the following irreducible trinomial: F (x) = x167 +
x6 + 1. Each arithmetic unit uses a least significant digit-serial multiplier of the
type described in Section 4.6, a parallel squarer, and a register file containing 128
registers. Each arithmetic unit supports only the aforementioned field polynomial
(fixed polynomial implementation).
Three GF (2m) processors were prototyped. Each prototype incorporates a mul-
248
tiplier with a different digit size. The digit sizes of the prototypes are 4, 8, and
16.
The GF (2m) processors were targeted for Xilinx XCV400E-8 FPGAs. The main
building blocks of these FPGAs are 4-input LUTs, flip-flops, and Block RAMs. A
Block RAM is a 4096-bit dual ported memory element. In these FPGAs, a LUT
can be used as a 16x1 RAM (16 memory locations, each capable of storing one bit),
as a 16x1 ROM, or as a 4-input gate. These FPGAs incorporate fast carry logic.
6.2.2 Complexity and Performance of the Prototyped
GF(2m) Processors
Table 6.6 summarizes the complexity of the prototyped GF (2m) processors. This
table includes normalized complexity with respect to m enclosed in parentheses.
In the prototyped GF (2m) processors, the LUTs are the components used to
implement arithmetic functions. Flip-flops are mainly used to hold operands and
temporary results, and they are also used to manage critical path delays. Block
RAM memory elements are used to implement the registers in the register files, and
they are also used to implement the program memories of the main controllers and
the arithmetic unit controllers.
Table 6.6: Logic complexity of GF (2m) processors (m = 167)
Digit #LUT #FF # Block RAM
size
4 1627 1745 10
(9.7m) (10.4m) (0.06m)
8 2136 1753 10
(12.8m) (10.5m) (0.06m)
16 3002 1769 10
(18.0m) (10.6m) (0.06m)
Table 6.7 compares the estimated LUT complexities of the arithmetic units used
249
by the different processors against the measured LUT complexities for the different
processors. As described previously, LUTs are the components used to perform
arithmetic functions. The number of LUTs were estimated using the expression in
Table 4.42 with Z equal to one and L equal to four.
The results in Table 6.7 suggest that the arithmetic unit of a GF (2m) processor
uses about 90% of the total number of LUTs used by the processor.
Table 6.7: Estimated LUT complexity of an arithmetic unit versus measured LUT
complexity of GF (2m) processor (m = 167)
Digit Measured #LUT Estimated # LUT
Estimated # LUT arithmetic unit
Measured # LUT GF (2m) proc.
size GF (2m) proc. arithmetic unit
4 9.7m 8.6m 0.89
8 12.8m 11.6m 0.91
16 18.0m 16.6m 0.92
Table 6.8 summarizes the point multiplication performance of the differentGF (2m)
processors when they implement the Montgomery point multiplication algorithm
and the double-and-add point multiplication algorithm with points represented in
Jacobian and affine coordinates. This table also shows the speedup realized with
respect to the processor implementation that uses a multiplier with digit size equal
to four (D = 4).
When the processing cost of multiplications is much greater than the processing
cost of other field and overhead operations, the performance of a processor scales
in proportion with the digit size. The results in Table 6.8 show that the as the
digit size increases the speedups realized slow down. What happens is that as the
processing costs of multiplications decrease the relative processing costs of other
field and overhead operations increase.
Table 6.9 compares the fastest prototype implementation of the GF (2m) pro-
cessor against the leading hardware accelerators for point multiplication for curves
250
defined over fields GF (2m). The results in this table show that the fastest GF (2m)
processor is about 17 times faster than previously documented processors.
Table 6.8: Point multiplication performance of GF (2m) processors
Digit Clock Montgomery Double- Speedup
size frequency (msec) and-add relative
(D) (MHz) (msec) to D = 4
4 85.7 0.55 0.96 1.0
8 74.5 0.35 0.61 1.8
16 76.7 0.21 0.36 3.0
Table 6.9: Performance of leading hardware accelerators that compute point multi-
plications for curves defined over fields GF (2m)
Implementation/ Platform Point Processing time relative to
fields mult. processing time of GF (2m)
(msecs) processor with D = 16
[AMV93] VLSI 3.9 19
GF (2155) 40 MHz est.
[SES98] Xilinx FPGA 18.4 88
GF (2155) XC4020XL,15 MHz est.
[Ros98b] Xilinx FGPA 4.5 21
GF (((24)2)21) XC4062,16MHz est.
GF ((28)21)
[LMWL00] Xilinx FPGA 3.7 17
GF (2113) XCV300,45 MHz
[Mot01] VLSI, 66 MHz 5.7 27
GF (2155) (for one engine,a total
of six engines in IC)
GF (2m) proc. Xilinx FPGA 0.21 1
with D = 16 XCV400E,76.7MHz
GF (2167)
6.2.3 Description of the Prototype Implementation of the
GF(p) Processor
This section describes the prototype implementation of the GF (p) processor. The
prototyped processor is based on the elliptic curve processor model shown in Figure
251
3.2. This processor uses an arithmetic unit similar to that shown in Figure 5.1.
The GF (p) processor uses the same main controller architecture that is used by
the GF (2m) processors. The characteristics of the main controller are described in
Section 6.2.1.
The arithmetic unit controller used by the GF (p) processor is slightly different
than the arithmetic unit controllers used by the GF (2m) processors. The arithmetic
unit controller used by the GF (p) processor supports the all the instructions sup-
ported by the the arithmetic unit controllers of the GF (2m) processors and it also
supports a variant of the move instruction described in Section 3.3.
The arithmetic unit controller used by the GF (p) processor uses 32-bit instruc-
tions of which three bits are used as instruction opcodes and the rest of the bits are
used as data or address fields. The controller incorporates a program memory with
capacity for 2048 instructions and executes one instruction per clock cycle.
The computation of additions, subtractions, squares and multiplications is more
complex for the GF (p) processor than for the GF (2m) processors. The complexity
of these operations forced the need for more program capacity in the arithmetic unit
controller used by the GF (p) processor with respect to the capacity required by the
controllers used by the GF (2m) processors.
Figure 5.1 shows a block diagram of the arithmetic unit used by the GF (p)
processor. The arithmetic unit uses an adder similar to the one shown in Figure
5.3. The adder includes a carry-save adder for fast additions and a digit-serial
adder for the conversions of numbers from binary stored-carry number representation
to nonredundant number representation. The digit size of the digit-serial adder
is eight. The adder accepts numbers represented in binary stored-carry number
representation from the register file (the I c and I s inputs of the adder are used).
The GF (p) processor uses a GF (p) multiplier similar to that shown in Figure
252
5.4. This multiplier is characterized by k = 8, where k is defined by r = u = 4 and
s = v = 2 (k = rs = uv). The quotient resolution delay of the multiplier is equal
to four (d = 4). The degree of parallelism of this multiplier is comparable to the
degree of parallelism of the multiplier used by the GF (2m) processor that uses a
multiplier with digit size equal to eight.
The multiplier used by the GF (p) processor computes the product of two field
elements of the field GF (2192 − 264 − 1) in 38 clock cycles, where for this multiplier
n = 25, d = 4, and the number of precomputations for each multiplication is equal
to eight (iA precomputations only, each precomputed value is computed in one
clock cycle). The field GF (2192 − 264 − 1) is specified in [FIP00] for elliptic curve
cryptosystems. Note that for this field and the specified degree of parallelism and
quotient resolution delay, QM < 2knM .
The GF (p) processor incorporates a register file containing 128 registers, each
capable of storing numbers in binary stored-carry number representation.
The GF (p) processor was targeted for a Xilinx XCV1000E-8 FPGA. The build-
ing blocks of this FPGA are similar to the building blocks incorporated in the
XCV400E-8 FPGAs used to prototype the GF (2m) processors. The XCV400E-8
and XCV1000E-8 are FPGAs of equal speed grade and different logic densities. De-
tails on the building blocks of the XCV400E-8 and XCV1000E-8 are given in Section
6.2.1.
6.2.4 Complexity and Performance of the Prototyped GF(p)
Processor
Table 6.10 summarizes the complexity of the prototyped GF (p) processor. This
table includes results that are normalized with respect to m and w0.
From Table 5.27, one can appreciate that the complexity of a GF (p) arithmetic
253
unit is a function, among others, of w0, w1, and w2. From the data specified in
Table 5.6, one can observe that w0 can be used to approximate the values of w1,
w2 and w3, especially when w0 >> k. From Table 5.6, one can also appreciate that
w0 takes into account the effects of quotient resolution delay (d). The definitions in
Table 5.6 suggest that when d is small and m is large, the value of w0 approximates
the value of m.
The complexity estimates normalized with respect to w0 allows one to study the
complexity of the GF (p) processor independently of the quotient resolution delay
(in general w0 >> r, s, u, v, k, d, kd).
As for the GF (2m) processors, the LUTs are the components used in the GF (p)
processor to implement arithmetic functions. In the GF (p) processor, LUTs config-
ured as 16x1 RAMs are used to build the memory elements used by the processing
units of the GF (p) multiplier. In the processing units, each LUT provides storage
for 2r−1 or 2u bits (r = u = 4).
In the GF (p) processor, flip-flops are mainly used to hold operands and tem-
porary results, and they are also used to manage critical path delays. Block RAM
memory elements are used to implement the registers in the register file, they are also
used to implement the program memories of the main controller and the arithmetic
unit controller, and they are also used as Booth recoders in the GF (p) multiplier.
Table 6.10: Logic complexity of GF (p) processor
Normalized with respect to #LUT #FF # Block RAM
N/A 11,416 5,735 35
m = 192 59.5m 29.9m 0.18m
w0 = m+ (k(d+ 1.5) + 1) 48.2w0 24.2w0 0.15w0
= 192 + 8 ∗ (4 + 1.5) + 1 = 237
Table 6.11 compares the estimated LUT complexity of the arithmetic unit against
the measured LUT complexity of the entire GF (p) processor. The number of LUTs
254
were estimated by adding the number of LUTs used as memory elements in the
GF (p) multiplier and the number of LUTs specified for the arithmetic unit in Table
5.27.
Each processing unit used by the multiplier used either w1 or w
′
1 LUTs configured
as memory elements. The GF (p) processor uses an arithmetic unit that supports
numbers in binary stored-carry number representation in the register file, the Q˜αi−d
scalar multiplier, and the A˜Bi scalar multiplier.
The following expression estimates the LUT complexity of the arithmetic unit:
6w2(s + v) + 2w1s + 2w
′
1v + 4w0 + 8w1 + 2D + d(D − 1)/(L − 1)e. The estimates
in Table 6.11 approximates this expression as follows: (8(s + v) + 12)w0. This last
expression uses w0 to approximate the values of w1, w
′
1, and w2. The last expression
also assumes that w0 >> D.
The results in Table 6.11 suggest that the arithmetic unit of the GF (p) processor
uses about 91% of the total number of LUTs used by the processor.
Table 6.11: Estimated LUT complexity of arithmetic unit versus measured LUT
complexity of GF (p) processor (w0 = 237)
Measured # LUT Estimated # LUT
GF (p) processor arithmetic unit
Estimated # LUT arithmetic unit
Measured # LUT GF (p) processor
(s = v = 2)
48.2w0 44w0 0.91
Table 6.12 lists the estimated processing time for the computation of an arbitrary
point multiplication using the double-and-add point multiplication algorithm with
points represented in Jacobian and affine coordinates. The estimate ignores the
processing time associated with overhead processing and finite field additions and
subtractions.
Table 6.13 compares the prototype implementation of theGF (p) processor against
255
the leading hardware accelerators for point multiplication for curves defined over
fields GF (p). The results in this table show that the prototyped GF (p) processor
is over 1.5 times faster than the other processor listed in the table.
Table 6.12: Estimated point multiplication performance of GF (p) processor
Clock frequency Double-and-add
(MHz) (msecs)
40 3.6
Table 6.13: Performance of leading hardware accelerators that compute point mul-
tiplications for curves defined over fields GF (p)
Implementation/ Platform Point Processing time relative
fields mult. to prototyped processor
(msecs) for GF (2192 − 264 − 1)
[Mot01] VLSI, 66 MHz 5.7 1.58
p= 155-bit number (for one engine, a total
of six engines in IC)
GF (2192 − 264 − 1) proc. Xilinx FPGA 3.6 1
with k = 8 XCV1000E,40MHz
6.2.5 Comparison of Prototyped GF(p) and GF(2m)
Processors
This section compares the complexity and the performance of GF (2m) and GF (p)
processors with comparable degrees of parallelism. For a GF (2m) processor, the
degree of parallelism is a function of the digit size of its multiplier. The digit size
is equal to eight for the GF (2m) processor considered in this section. The degree
of parallelism is a function of k for the GF (p) processors. For the GF (p) processor
considered in this section k is equal to eight.
Table 6.14 summarizes the logic complexity of the GF (p) processor and the
GF (2m) processor with digit size equal to eight. The complexity results are normal-
256
ized with respect to m, where m is the degree of the irreducible polynomial supported
by the GF (2m) processor and where m = dlog2 pe for the GF (p) processor. m is
a measure of the number of bits required to represent field elements.
Table 6.14 also lists normalized complexity ratios established by dividing the
normalized number of logic resources used by the GF (p) processor by the normalized
number of logic resources used by theGF (2m) processor with digit size equal to eight.
For example, the GF (p) processor uses 4.6 times the number of LUTs used by the
GF (2m) processor with digit size equal to eight, where the number of LUTs used
by each processor is normalized with respect to m before establishing the ratio.
As previously indicated, the LUT complexity is the most critical number used to
gauge the complexity of an elliptic curve processor because LUTs are the components
used to implement arithmetic functions.
Table 6.14: Complexity of GF (2m) and GF (p) processors normalized with respect
to m
Logic element GF (2m) GF (p) Complexity ratio:
processor processor
(m = 167) (p = 2192 − 264 − 1, m = 192) GF (p) processor
GF (2m) processor
(D = 8) (k = 8)
# LUT 12.8m 59.5m 4.6
# FF 10.5m 29.9m 2.8
# Block RAM 0.06m 0.18m 3.0
Table 6.15 lists the approximate performance of the prototyped GF (2m) pro-
cessor with digit size equal to eight and the GF (p) processor for different field
operations. The number of clock cycles required by the GF (p) processor to perform
a multiplication is represented as m/5.1. The factor 5.1 represents what could be
considered as the effective digit size. The effective digit size is obtained by dividing
m by the number of clock cycles it takes to perform a multiplication. For the GF (p)
processor, m is equal to 192 and a multiplication requires 38 clock cycles; therefore,
257
the effective digit size for this prototype is 5.1 (≈ 192/38).
Table 6.15 also lists normalized performance ratios established by dividing the
normalized number of clock cycles required to perform an operation in the GF (p)
processor by the normalized number of clock cycles required to perform the same
operation in the GF (2m) processor with digit size equal to eight.
Of the operations listed in Table 6.15, modular multiplications and squares are
the most important operations to consider in the comparison of prototype imple-
mentations. In general, the processing time of additions, subtractions and squares
can be considered to be negligible for the GF (2m) processors. For the GF (p) pro-
cessor, the processing time of additions and subtractions can be considered to be
negligible. For this processor, the processing time for a multiplication is the same
as the processing time for a square operation. The processing time for multiplica-
tions can be used along with the results listed in Table 2.12 to approximate the
performance of the prototypes for different point multiplication algorithms.
Note that the processing cost of the inverse operation required when using projec-
tive or mixed coordinates is amortized over the entire point multiplication operation.
The low complexity of inversion in GF (2m) fields implies that the complexity of this
operation can be ignored when approximating the point multiplication performance
for the GF (2m) processors. For the GF (p) processor, the complexity of an inverse
operation will be low but not as much as for the GF (2m) processors. Therefore,
for accurate results, the complexity of an inverse operation in the GF (p) processor
cannot be ignored. Because the complexity of an inverse operation in the GF (p) pro-
cessor is expected to be low, about 10% of the processing time of an arbitrary point
multiplication operation, the complexity of the inverse operation required by the
point multiplication process is ignored in the comparisons presented in this section.
If one establishes the ratio between the number of multiplications required to
258
Table 6.15: Approximate processing time for the computation of different field op-
erations in the GF (2m) and GF (p) processors
Operation GF (2m) GF (p) Processing time
processor processor ratio:
(m = 167, D = 8) (p = 2192 − 264 − 1, GF (p) processor
GF (2m) processor
m = 192, k = 8)
(in # clock cycles) (in # clock cycles)
Multiplication m/8 m/5.1 1.6
Square 1 m/5.1 m/5.1
Addition/ 1–2 1–2 (CSA) 1
subtraction
Inverse < (2dlog2 me)(m/8) > m(m/5.1) > 0.78m/dlog2 me
(Fermat’s Little
Theorem)
compute an arbitrary point multiplication for an elliptic curve defined over a field
GF (p) and the number of multiplications required to compute an arbitrary point
multiplication for an elliptic curve defined over a field GF (2m) when using the
same algorithm from those listed in Table 2.13 and when using Jacobian or mixed
coordinates (Jacobian and affine coordinates), one finds that the computation of a
point multiplication for an elliptic curve defined over a field GF (p) requires 1.5 to
1.6 times the number of multiplications required to compute a point multiplication
for an elliptic curve defined over a field GF (2m). This ratio assumes that squares in
GF (2m) are computed in negligible time and that squares in GF (p) are computed
with multiplications. This ratio also assumes that the fields GF (2m) and GF (p)
contain about the same number of elements.
If one establishes the ratio between the number of multiplications required to
compute an arbitrary point multiplication for an elliptic curve defined over a field
GF (p) and the number of multiplications required to compute an arbitrary point
multiplication for an elliptic curve defined over a field GF (2m) when using the best
arbitrary point multiplication algorithms from those listed in Table 2.13 and when
259
using Jacobian or mixed coordinates (Jacobian and affine coordinates), one finds
that the computation of a point multiplication for an elliptic curve defined over
a field GF (p) requires over 2.2 times the number of multiplications required to
compute a point multiplication for an elliptic curve defined over a field GF (2m).
This ratio is based on the same assumptions listed in the previous paragraph.
Using the complexity and the performance numbers listed in Tables 6.14 and 6.15
together with the ratios developed in the previous paragraphs, one can establish the
time-area ratios listed in Table 6.16.
In Table 6.16, the column named “Common algorithms” defines ratios that as-
sume the use of the same algorithm, from among the algorithms listed in Table 2.13,
for point multiplication for curves define over fields GF (p) and for curves defined
over fields GF (2m). The column named “Best algorithms” list ratios that assume
the use of the best algorithms for arbitrary point multiplication listed in Table
2.13: the width-w addition-subtraction algorithm for point multiplication in curves
defined over fields GF (p) and the Montgomery point multiplication algorithm for
point multiplication for curves defined over fields GF (2m). The rows that list time
in milliseconds (msecs) establish the processing time using the maximum clock fre-
quency of the prototypes. The maximum clock frequency for the GF (2m) processor
with digit size equal to eight is 74.5 MHz and the maximum clock frequency for the
GF (p) processor is 40 MHz.
In summary, assuming that the number of LUTs required by the GF (2m) pro-
cessor with digit size equal to eight is x, the GF (p) processor requires 4.6x LUTs.
The precomputations and the quotient resolution delay incurred by the multiplier
in the GF (p) processor, lowers its effective degree of parallelism to an effective digit
size of 5.1 while the multiplier of the GF (2m) processor with digit size equal to eight
achieves a degree of parallelism close to eight. The net effect is that if y is the num-
260
ber of clock cycles required to compute a multiplication in the GF (2m) processor
with digit size equal to eight, then the GF (p) processor requires 1.6y clock cycles
to compute a multiplication (1.6 ≈ 8/5.1).
When considering the use of the same arbitrary point multiplication algorithm
for curves defined over fields GF (p) and GF (2m), the algorithms used in the GF (p)
processor require 1.5 to 1.6 times more multiplications than the algorithms used in
the GF (2m) processor, when considering the use of Jacobian or mixed coordinates
that use a mix of points represented in affine and Jacobian coordinates and when
ignoring the processing time of squares in GF (2m).
When combining the effects of the number of clock cycles required for multi-
plications in the different processors and the number of multiplications required to
perform an arbitrary point multiplication, the net effect is that a point multipli-
cation in the GF (p) processor requires over 2.4 times the number of clock cycles
required to perform a point multiplication in the GF (2m) processor with digit size
equal to eight using the same algorithm (1.6 ∗ 1.5 = 2.4). When considering the
best arbitrary point multiplication algorithms in Table 2.13, a point multiplication
in the GF (p) processor requires 3.5 times the number of clocks required to per-
form a point multiplication in the GF (2m) processor with digit size equal to eight
(1.6 ∗ 2.2 ≈ 3.5).
When considering the maximum operational frequency of the processors, a point
multiplication in the GF (p) processor takes 4.6 times the time it takes to perform
a point multiplication in the GF (2m) processor with digit size equal to eight using
the same algorithm (2.4 ∗ 1.9 ≈ 4.6). When considering the best arbitrary point
multiplication algorithms listed in Table 2.13, a point multiplication in the GF (p)
processor takes 6.7 times the time it takes to perform a point multiplication in the
GF (2m) processor with digit size equal to eight (3.5 ∗ 1.9 ≈ 6.7).
261
The frequency of operation of a particular implementation is influenced by factors
such as the routing tools used, the ability of a tool to use special features of the
FPGAs (e.g., fast carry logic), coding style used in design entry (e.g., VHDL coding
styles), and others. These are factors that in some cases are beyond the control of
the designer. Because a timing representation in terms of the number of clock cycles
it takes to do an operation is under the control of the designer, this work considers
this timing measure as a more reliable measure of time than timing measures that
are influenced by operational frequencies.
Table 6.16 summarize time-area ratios for point multiplication for the GF (p)
processor and the GF (2m) processor with digit size equal to eight. Table 6.16
provides two time-area ratios: one based on time measurements in terms of clock
cycles and the other based on time measurements in terms of milliseconds. The
second measure takes into account the operational frequencies of the processors.
The time-area results show that for the arbitrary point multiplication algorithms
discussed here, the time-area product of the GF (p) processor is 11.0 to 16.1 times
the time-area product of the GF (2m) processor with digit size equal to eight, for
comparable finite field and point multiplier sizes and when measuring time in terms
of clock cycles. While when measuring the time in terms of milliseconds, the time-
area product of the GF (p) processor is 21.2 to 30.8 times the time-area product of
the GF (2m) processor with digit size equal to eight. The last time-area product
range is about two times larger than the previous one because the clock frequency
of the GF (2m) processor with digit size equal to eight is almost twice that of the
GF (p) processor.
262
Table 6.16: Ratio of time-area characteristics of prototypes
Characteristic Ratio:
GF (p) processor
GF (2m) processor, D = 8
Common algorithms Best algorithms
Time (# clocks) 2.4 (=1.6*1.5) 3.5 (≈ 1.6 ∗ 2.2)
Time (msecs) 4.6 (≈ 2.4 ∗ 1.9) 6.7 (≈ 3.5 ∗ 1.9)
Time-area (#clock-# LUT) 11.0 (≈ 2.4 ∗ 4.6) 16.1 (=3.5*4.6)
Time-area (msecs-# LUT) 21.2 (≈ 4.6 ∗ 4.6) 30.8 (≈ 6.7 ∗ 4.6)
263
Chapter 7
Conclusions
7.1 Summary and Conclusions
This dissertation introduces elliptic curve processor architectures suitable for the
computation of point multiplications for curves defined over fields GF (2m) and fields
GF (p). These architectures follow the model shown in Figure 3.2. This model
follows the hierarchical view of the point multiplication algorithms shown in Figure
3.1. Each of the elliptic curve processor architectures incorporates a main controller,
an arithmetic unit controller, and an arithmetic unit.
The main controller orchestrates the point multiplication process and also inter-
acts with the host system. The arithmetic unit controller controls the processing of
the arithmetic unit. This controller is responsible for guiding the arithmetic unit
through the computation of the elliptic curve group operations and the coordinate
conversions. The arithmetic unit is the component that computes the field opera-
tions and the comparisons required in the computation of point multiplications.
The main controller and the arithmetic unit controller are programmable pro-
cessors. The programmability of these components allows them to incorporate new,
264
highly efficient point multiplication algorithms. This feature proved to be effec-
tive during the development of the elliptic curve processor architecture introduced
in [OP00a]. As the architecture of this processor was being developed, the Mont-
gomery point multiplication algorithm described in Section 2.7.1 was published.
Incorporating this algorithm in the processor required the reprogramming of the
main controller and the arithmetic unit controller, functions that did not require
reconfiguration of the prototyped processors. Incorporating this algorithm in the
prototyped processors increased the performance of the processors while maintaining
their hardware footprints.
The most important component of an elliptic curve processor is the arithmetic
unit. This dissertation presents multiple arithmetic unit architectures for GF (2m)
field arithmetic and one architecture for GF (p) arithmetic.
The arithmetic unit architectures for GF (2m) provide a wide range of time-area
options for designers of elliptic processors. These options included arithmetic units
that incorporate digit-serial, bit-serial, and super-serial multipliers.
The arithmetic unit architectures for GF (2m) also provide designers multiple
options for the computation of squares. Squaring is a common operation in the
computation of elliptic curve point multiplications and it is also a common opera-
tion in the computation of field exponentiations. As described in Section 4.9, the
square operation can be considered to be of linear complexity for standard basis
representation of GF (2m) field elements and the field polynomials specified in stan-
dards that recommend the use of elliptic curve cryptography. The low complexity
of a square operation makes possible the use of parallel squarers in the GF (2m)
arithmetic units.
As an alternative to a parallel squarer architecture, whose structure is a function
of the field polynomial used to define a field, this work introduced a new squaring ar-
265
chitecture. The new squaring architecture is based on the observation that a square
operation in GF (2m) can be transformed into a multiplication of a special form and
a sum. The multiplication and the sum can be efficiently computed with LSB-SSM,
LSB, and LSD multipliers. The architecture of the new squaring architecture is reg-
ular and independent of the field polynomial used to define a field. The regularity
of the squaring architecture and the LSB-SSM, LSB, and LSD multipliers, makes
these architectural options attractive for use in non reconfigurable hardware.
The use of the new squaring architecture together with an LSB-SSM multipliers
make possible the construction of elliptic curve processors of low complexity and
reasonable performance, options that are desirable in area and power constrained
environments.
The work presented here concentrated on the development of a high performance
elliptic curve processor for the computation of point multiplications for curves de-
fined over fields GF (p). The architecture presented here is based on a new Mont-
gomery multiplier architecture. This new Montgomery multiplier architecture is
scalable, a feature that allows designers to target the architecture according to their
performance and cost goals.
High-performance elliptic curve processors were prototyped in FPGA technology.
The prototypes included multiple elliptic curve processors suitable for the compu-
tation of point multiplications for curves defined over fields GF (2m). The GF (2m)
processors use LSD multipliers with different digit sizes and they incorporated par-
allel squarers. One prototype suitable for the computation of point multiplications
for elliptic curves defined over fields GF (p) was also developed.
The prototypes validated the complexity estimates provided in this document
for FPGA technology. Moreover, the complexity estimates for high-performance
processors demonstrated that about 90% of the LUTs used by the processors were
266
devoted to their arithmetic units (LUT are the main component used to implement
arithmetic functions).
This work compared prototyped implementations of elliptic curve processors
suitable for the computation of point multiplications for curves defined over fields
GF (2m) and curves defined over fields GF (p). The comparisons were restricted to
processors with comparable degrees of parallelism. The prototyped GF (p) processor
required 4.6 times the number of LUTs required by the GF (2m) processor of com-
parable degree of parallelism. The larger complexity of the GF (p) processor with
respect to the GF (2m) processor is driven by the higher complexity of GF (p) ad-
dition using carry-save addition with respect to GF (2m) addition. The prototyped
GF (p) processor used, mainly, numbers represented in binary stored-carry number
representation (a redundant number representation).
The processing time of a processor is a function of the implemented algorithm,
the point representations, and the frequency of operation of the processor. The
study of the prototypes in Section 6.2.5 reveals that the computation of a point
multiplication using one of the algorithms listed in Table 2.13 for a curve defined over
a field GF (p) requires 2.4 to 3.5 times the number of clock cycles required to compute
a point multiplication for a curve defined over a field GF (2m). When considering the
frequency of operation of the prototypes of comparable degrees of parallelism, the
study in Section 6.2.5 reveals that the computation of a point multiplication using
one of the algorithms listed in Table 2.13 for a curve defined over a field GF (p)
requires 4.6 to 6.7 times the time required to compute a point multiplication for a
curve defined over a field GF (2m).
When combining the area and the processing time listed above, the study in
Section 6.2.5 reveals that when using clock cycles as the unit of time, the time-area
product obtained for the computation of a point multiplication using the prototyped
267
GF (p) processor is 11.0 to 16.1 times the time-area product obtained for the compu-
tation of a point multiplication in the prototyped GF (2m) processor of comparable
degree of parallelism. When accounting for the operational clock frequency of the
prototypes, the time-area product obtained for the computation of a point multi-
plication using the prototyped GF (p) processor is 21.2 to 30.8 times the time-area
product obtained for the computation of a point multiplication in the prototyped
GF (2m) processor of comparable degree of parallelism.
The time-area products defined in the previous paragraph suggest the use of spe-
cialized arithmetic unit architectures for the computation of point multiplications.
In summary, this dissertation defines elliptic curve processor architectures suit-
able for the computation of point multiplications for curves defined over fields
GF (2m) and curves defined over fields GF (p). These architectures are well suited
for implementation in modern FPGAs, as it was proved with prototyped implemen-
tations: the fastest prototyped GF (2m) processor can compute an arbitrary point
multiplication for curves defined over fields GF (2167) in 0.21 milliseconds and the
prototyped processor for the field GF (2192−264−1) is capable of computing a point
multiplication in about 3.6 milliseconds. The programmability of the processors
allows them to incorporate new point multiplication algorithms without the need
for reconfiguration. The wide range of time-area options presented for the different
processors allows designers to tailor the presented architectures according to their
cost-performance goals. The optimization of the processor architectures for FPGA
technology allows implementations to evolve with advancements in FPGA technol-
ogy; for example, as FPGAs become faster, processor implementations ported to
faster FPGAs will achieve higher performance. In addition, as the densities of FP-
GAs increase the ability to develop faster processors increases. Finally, FPGA’s
reconfigurability allows designers to use optimized point multiplication processors
268
for different finite fields.
269
7.2 Recommendations for Further Research
This work introduces processor architectures for the computation of point multipli-
cation for curves defined over fields GF (p) and for curves defined over fields GF (2m).
These processor architectures are optimized for implementation in modern FPGAs.
In the development of the architectures presented here, complexity and perfor-
mance estimates were specified for different design options. Prototype implementa-
tions of the processors validated the complexity estimates, particularly for the use
of LUTs, which are the main components used to implement arithmetic functions.
The critical path delay estimates for FPGA implementations approximated the
critical path delays of circuits in terms of LUT delays. For these estimates, the
delay of flip-flops and memory elements were ignored.
In FPGA implementations, the delays of the storage elements are not negligible
nor are the routing delays. The architectures presented here make provisions for
the control of critical path delays using pipelining techniques. Estimating delays on
FPGAs is not a simple task because different FPGA families use different routing
schemes and because the routing delays of FPGA implementations are a function of
how the circuits are laid out in the FPGAs. Improving the work presented here with
more accurate critical path delay estimates will be of great benefit to implementers.
Some of the GF (2m) processor architectures presented here use specialized mul-
tipliers and squarers for each finite field. This is not particularly limited for many
implementations that can exploit FPGA’s reconfigurability to instantiate optimized
solutions. Exploring the ability to dynamically reconfigure FPGAs with different
optimized elliptic curve processors will be of great research interest. For example,
such study can determine the time it takes to switch from one optimized processor
option to another.
270
The work presented here does not address security concerns such as the provision
of countermeasures against timing and power attacks. The programmability of the
processors presented here allows implementations to use point multiplication algo-
rithms whose processing time is independent of the point multiplier. An example
of such algorithm is the Montgomery point multiplication algorithm described in
Section 2.7.1. The use of this type of algorithm limits the effectiveness of timing
attacks. Guarding against power attacks is especially important when attackers are
capable of mounting sophisticated attacks against cryptographic devices. Extend-
ing the work presented here to include countermeasures against timing and power
attacks will be of great practical interest.
271
Appendix A
Hardware Implementation Models
A.1 Introduction
This work introduces elliptic curve processor architectures targeted for hardware
implementations in programmable hardware. The complexity and the performance
of a processor are a function of the complexity and the performance of the arithmetic
unit used by the processor. To study the complexity and the performance of the
arithmetic units, this work studies their implementation using two-input gates and
their implementation in FPGA logic.
FPGAs are programmable hardware devices. Their programmability allows the
instantiation of different circuits on a common hardware platform. The basic logic
elements common to the majority of FPGA devices are lookup tables (LUTs), flip-
flops (FFs), and, for modern devices, memory elements. The LUTs are used to
implement Boolean functions of their inputs; that is, they are used to implement
functions that are traditionally implemented with logic gates. The programmability
of FPGA devices extend to their logic interconnect, or routing. In general, FPGA
devices favor localized, neighbor-to-neighbor routing. These devices also offer global,
272
high fan-out routing resources that can be used to route control signals. Additional
information about FPGA logic can be found in [Xil00, Alt01, Luc99, Act01].
A.1.1 Two-Input Gate Implementations
For implementations that use two-input gates as logic elements, this work estimates
the logic complexity of a circuit in terms of the number of two-input gates, the
number of flip-flops, and the number of storage bits needed to implement the in-
tended circuit. The gate complexity is measured in terms of the number of logic
gates required for an implementation and in terms of the number of generic gates
needed for an implementation. Logic gates refer to the traditional two-input gates,
such as AND gates, OR gates, etc. A generic gate is a gate that can implement any
Boolean function of its two inputs, which is a concept similar to a lookup table of
two inputs. Generic gates are used here to gauge the number of gates required for
an implementation irrespective of their type.
The performance of a circuit is specified in terms of the circuit’s critical path
delay. A circuit’s critical path delay, Tcp, corresponds to the longest combinatorial
delay of the circuit. The critical path delay of a circuit defines the maximum clock
frequency at which it can operate, F = 1/Tcp, where F represents the maximum
clock frequency. Here, the critical path delay of a circuit is estimated in terms of
two-input gate delays.
As for the complexity estimates, the critical path delay is measured in terms
of traditional logic gate delays and in terms of generic gate delays. The critical
path delay estimates assume ideal flip-flops and memory elements; that is, assume
negligible logic delay for flip-flops and memory elements.
273
A.1.2 FPGA Implementations
For implementations in FPGA logic, this work estimates the logic complexity of a
circuit in terms of the number of LUTs, the number of flip-flops, and the number of
storage bits needed to implement the intended circuit. The performance of a circuit
is specified in terms of the circuit’s critical path delay. Here, the critical path delay
of a circuit’s implementation in FPGA logic is estimated in terms of LUT delays.
These estimates assume ideal flip-flops and memory elements. These estimates also
assume negligible routing delays.
Note that a circuit’s critical path delay estimate is used here as a measure of
the relative performance of a circuit. This measure is used here to compare the
performance of different circuits and it is also used as a guide for the improvement
of a circuit’s performance using pipelining techniques.
Here, a critical path delay estimate is not assumed to be measure of the absolute
performance of a circuit. On actual implementations, the combinatorial delays of
flip-flops and memory elements are not negligible. The routing delays, especially
for FPGA implementations, are also not negligible. To obtain accurate complexity
and critical path delay results, this work recommends prototyping circuits in the
targeted technology.
For implementations that use two-input gates as logic elements, estimating the
complexity of a circuit is a straight forward process. Estimating the critical path
delay of a circuit is also a straight forward process. Estimating the logic com-
plexity and the critical path delay of an FPGA implementation is more difficult
than estimating the logic complexity and the critical path delay of a two-input gate
implementation.
To simplify the complexity and the critical path delay estimation, this work
estimates the complexities and the critical path delays of the circuits that are used
274
as building blocks in the construction of the arithmetic circuits studied here. Two
basic sets of circuits are identified here. These are the basic circuits, which are
the most primitive circuits, and the composite circuits, which use basic circuits as
building blocks.
When describing the performance of a circuit, this work also specifies the latency
and throughput of it. Latency is a measure of the time it takes a circuit to generate
an output given the assertion of its inputs; for example, the time it takes a multiplier
to generate a product after the operands are provided to it. Throughput refers to
the rate at which a circuit can generate results. For the multiplier example, it refers
to the rate at which a multiplier can generate back-to-back products.
In the estimation of logic complexity and performance, a basic set of terms is
frequently used. The basic terms are defined in Tables A.1 and A.2.
The following sections define the complexities and the critical path delays of
the basic and composite circuits for implementations that use two-input gates and
implementations that use FPGA logic elements.
A.2 Logic Complexity and Critical Path Delay for
Implementations that Use Two-Input Gates
as Logic Elements
Tables A.3 and A.4 list, respectively, the complexity and the critical path delay of
the basic set of circuits used to implement the arithmetic circuits studied in this
work. The data in these tables correspond to implementations that use two-input
gates as logic elements. These tables make use of the terms defined in Tables A.1
and A.2.
275
Table A.1: Frequently used terms in complexity and timing estimates
Symbol Description
CPA Carry-propagate adder.
CSA Carry-save adder.
CSAT Carry-save adder tree.
F Maximum clock frequency.
FA Full adder.
FF Flip-flop.
GF2A GF (2) adder.
GF2M GF (2) multiplier.
GG Generic gate.
HA Half adder.
L Number of inputs of a LUT.
MUX 2:1 multiplexer.
MUXC 2:1 multiplexer cell.
RA Ripple-carry adder.
SR Shift register.
SRC Shift register cell.
T} Propagation delay of an unidentified gate.
} is a place holder for the gate type.
T3:2 CSA Propagation delay of 3:2 CSA.
T4:2 CSA Propagation delay of 4:2 CSA.
TA Propagation delay of an AND gate.
TC Two’s complement circuit.
TCC Two’s complement cell.
TCZ Two’s complement with zero circuit.
TCZC Two’s complement with zero circuit cell.
Tcp Critical path delay.
TCPA,k Propagation delay of a CPA adder tree of k inputs.
TCSAT,n Propagation delay of a CSA adder tree of n inputs.
TFA Propagation delay of a full adder.
TFA c Propagation delay of a full adder’s carry output.
TFA s Propagation delay of a full adder’s sum output.
TG Propagation delay of a generic gate.
TGF2A Propagation delay of a GF (2) adder.
TGF2M Propagation delay of a GF (2) multiplier.
THA Propagation delay of a half adder.
276
Table A.2: Frequently used terms in complexity and timing estimates (cont.)
Symbol Description
THA c Propagation delay of a half adder’s carry output.
THA s Propagation delay of a half adder’s sum output.
TI,k Propagation delay of an increment adder of k inputs.
TL Propagation delay of a LUT.
TMUX Propagation delay of a 2:1 multiplexer.
TMUXC Propagation delay of a 2:1 multiplexer cell.
TO Propagation delay of an OR gate.
TRA,k Propagation delay of a ripple-carry adder of k inputs.
TX Propagation delay of an XOR gate.
TSR Propagation delay of a shift register.
TSRC Propagation delay of a shift register cell.
TTC Propagation delay of a two’s complement circuit.
TTCC Propagation delay of a two’s complement circuit cell.
TTCZ Propagation delay of a two’s complement with zero circuit.
TTCZC Propagation delay of a two’s complement with zero circuit cell.
When comparing the delay of different circuit paths, the delay of an XOR gate
is assumed to be TA + TO, where the TA is considered to be equal to TO. This
approximation is based on the implementation of an XOR gate using two AND
gates and an OR gate (c = (a AND b¯) OR (a¯ AND b)). Once it is determined
that the delays of XOR gates need to be included in the critical path delay measure,
the XOR delays are represented using TX .
When determining logic complexity and critical path delays, it is assumed that
the complement of a signal is accomplished with zero logic gates because many
architectures readily support the generation of complements or provide gates that
accept complemented inputs. (The complement of a is NOT a (a¯ = NOT a); for
example, if a is equal to one, its complement is equal to zero.)
The complexity of a 2:1 multiplexer is estimated as three gates, even though
this circuit requires four Boolean operations. As indicated previously, the NOT
operation is assumed to require zero logic gates.
277
Tables A.3 and A.4 specify the complexities and the critical path delays of two
two’s complement circuits. The two’s complement cell circuit complements the in-
put bit if the cmpl signal is asserted. The two’s complement with zero cell circuit
performs the function performed by the two’s complement cell circuit, and, in ad-
dition, it generates a zero output if the zero signal is asserted. Figure A.12 shows a
block diagram of the two’s complement with zero cell circuit.
Tables A.3 and A.4 specify the complexity and critical path delay of a shift
register cell. This cell accepts a parallel input IL and a shift input IS. This circuit
latches and outputs one of its inputs according to the state of the load signal, as it
is shown in Figure A.13.
Two critical path delay measures are provided for full adders and for half adders.
One of the measures corresponds to the carry output and the other to the sum
output. Full adders and half adders are used as building blocks in large adders. For
some adder architectures and gate types, the critical path delay is dominated by the
carry output delay and for others by the sum output delay.
Tables A.5 and A.6 list, respectively, the complexity and the critical path delay
of the composite set of circuits used to implement the arithmetic circuits studied in
this work. The data in these tables correspond to implementations that use two-
input gates as logic elements. These tables make use of the terms defined in Tables
A.1 and A.2.
Figure A.1 shows the block diagram of a two’s complement with zero circuit. This
circuit contains m identical two’s complement with zero cells. When generating a
two’s complement, the two’s complement with zero circuit complements all the bits
of the input number and sets a carry bit to one. Figure A.1 shows how the carry bit
is generated from the cmpl control signal. The two’s complement circuit operation
is similar to that of the two’s complement with zero circuit operation but it lacks
278
Table A.3: Complexity of basic building blocks for implementations that use two-
input gates as logic elements
Building Boolean # Gate # GG # FF
block representation
GF (2) adder c = a XOR b 1 XOR 1 0
GF (2) mult. c = a AND b 1 AND 1 0
2:1 mux c = (s AND a) OR 2 AND + 3 0
((NOT s) AND b) 1 OR
two’s cmpl. O = cmpl XOR I 1 XOR 1 0
cell
two’s cmpl. O = (NOT zero) AND 1 AND + 2 0
w. zero cell (cmpl XOR I) 1 XOR
full adder s = x XOR y XOR z 3 AND + 7 0
c = (x AND y) OR 2 OR +
(y AND z) OR 2 XOR
(x AND z)
half adder s = x XOR y 1 AND + 2 0
c = x AND y 1 XOR
shift reg. cell s = (load AND IL) OR 2 AND + 3 1
((NOT load) AND IS) 1 OR
Table A.4: Critical path delay of basic building blocks for implementations that use
two-input gates as logic elements
Building Symbol Critical path delay Critical path delay
block for logic gates for generic gates
(in TG)
GF (2) adder TGF2A TX 1
GF (2) mult. TGF2M TA 1
2:1 mux TMUXC TA + TO 2
two’s cmpl. cell TTCC TX 1
two’s cmpl. TTCZC TA + TX 2
w. zero cell
full adder’s TFA c TA + 2TO 3
carry output
full adder’s TFA s 2TX 2
sum output
half adder’s THA c TA 1
carry output
half adder’s THA s TX 1
sum output
shift reg. cell TSRC TA + TO 2
279
Table A.5: Complexity of composite building blocks for implementations that use
two-input gates as logic elements
Building Primitive # Gates # GG # FF
block comp.
complexity
register m FF 0 0 m
m-bit op.
2:1 mux m MUXC 2m AND + 3m 0
m-bit op. m OR
two’s cmpl. m TCC m XOR m 0
m-bit op.
two’s cmpl. m TCZC m AND + 2m 0
w. zero m XOR
m-bit op.
shift reg. m SRC 2m AND + 3m m
m-bit op. m OR
binary tree (n− 1) (n− 1) } gates n− 1 0
} gates
GF (2) N/A n AND+ 2n− 1 0
mult/add n− 1 XOR
tree
ripple-carry k FA 3k AND + 7k 0
adder 2k OR +
k-bit op. 2k XOR
incr. adder k HA k AND + 2k 0
k-bit op. k XOR
3:2 CSA m FA 3m AND + 7m 0
m-bit op. 2m OR +
2m XOR
4:2 CSA 2 * 6m AND + 14m 0
m-bit op. 3:2 CSA 4m OR +
4m XOR
CSA (n− 2) * 3m(n− 2) AND + 7m * 0
adder tree 3:2 CSA 2m(n− 2) OR + (n− 2)
m-bit op. 2m(n− 2) XOR
280
Table A.6: Critical path delay of composite building blocks for implementations
using two-input gates as logic elements
Building Symbol Critical path delay Critical path delay
block for logic gates for generic gates
(in TG)
register N/A 0 0
2:1 mux TMUX TA + TO 2
two’s cmpl. TTC TX 1
two’s cmpl. TTCZ TA + TX 2
w. zero
shift reg. TSR TA + TO 2
binary tree N/A dlog2 ne T} dlog2 ne
GF (2) mult/ N/A TA + dlog2 ne TX dlog2 ne + 1
add tree
ripple-carry TRA,k k(TA + 2TO) 3k
adder
incr. adder TI,k k TA k
3:2 CSA T3:2 CSA 2 TX (TFA s) 3 (TFA c)
4:2 CSA T4:2 CSA 4TX (2TFA s) 6 (2TFA c)
CSA adder tree TCSAT,n ≈ dlog3/2 n/2e T3:2 CSA ≈ dlog3/2 n/2e T3:2 CSA
≈ 2dlog3/2 n/2e TX ≈ 3dlog3/2 n/2e
281
the capability to generate a zero output. This circuit does not incorporate the zero
circuit in its two’s complement cell.
Figure A.2 shows the block diagram of a parallel-in/serial-out shift register.
This shift register contains m identical shift register cells. The circuit in Figure A.2
outputs one bit per clock cycle. By rearranging the connections, the circuit can be
modified so it outputs D bits per clock cycle.
not
I
m
m
O
0
cmplcmpl zero
O1Om-1
I1Im-1 zero I0cmpl
O0carry
zero
Figure A.1: Two’s complement with zero circuit
loadload load
FF
IL_0
IS_0
FF
IL_1
FF
IL_m
O
_m
Figure A.2: Shift register circuit
Figure A.3 shows two ways of implementing the GF (2) addition of multiple
inputs. This figure shows a binary tree architecture and a ripple adder architecture.
From this figure, one can verify that both implementations use n− 1 GF (2) adders,
where n represents the number of inputs to be added. From this figure, one can also
verify that the critical path delay of the binary tree architecture is dlog2 ne TX . In
contrast, the critical path delay of a ripple adder architecture is (n− 1) TX .
Tables A.5 and A.6 specify the characteristics of binary tree structures in generic
terms. The binary tree can be specialized by assigning a function and a gate type
282
to the symbol }. In terms of the notation in Tables A.5 and A.6, } represents a
GF (2) adder, or an XOR gate, for implementations of the circuit shown in Figure
A.3.
Binary tree architectures are not limited to trees where all the nodes perform the
same function. This work uses what is referred to here as GF (2) mult/add trees.
The leaves of this type of tree compute GF (2) multiplications and the internal nodes
of the tree compute GF (2) additions. GF (2) mult/add trees can be visualized as
composed of a set of AND gates that compute the GF (2) multiplications, whose
outputs feed a binary GF (2) adder tree. Figure A.15 shows an implementation of
a GF (2) mult/add tree using FPGA logic.
   	
	 	 	
ﬁﬀﬂﬃ  ! "  #! $  %! &  '! ( ﬃﬀﬃﬃ ﬃ)ﬃ" ﬃ#)ﬃ$


ﬀ

ﬃ



"

#

$

%

&

'

ﬃ$

ﬃ#

ﬃ"

ﬃ

ﬃﬃ

ﬃﬀ

(

*
+, -.-./  	 	
Figure A.3: Binary tree and ripple adder architectures
This work assumes that carry-propagate adders are implemented using ripple-
carry adders. The main reasons for this assumption are that the carry-propagate
adders are likely to be small and that many families of FPGA logic incorporate
283
fast carry logic in their parts. Fast carry logic uses dedicated logic that samples the
inputs of an adder and forwards carries without incurring long combinatorial delays.
Examples of parts that incorporate fast carry logic can be found in [Xil00, Alt01,
Luc99].
Figure A.4 shows a block diagram of a k-input ripple-carry adder. The input
operands to the adder are (xk−1 . . . x0)2, (yk−1 . . . y0)2, and ci, where ci represents the
input carry. The output of the adder is (sksk−1 . . . s0)2. In the figure, c0 represents
the output carry and the most significant bit of the sum, sk. From Figure A.4,
one can appreciate that the critical path delay of this adder is dominated by the
propagation of carries through the k full adders, which corresponds to the delay
listed in Table A.6. The critical path delay specified in this table represents the
worst case scenario for carry-propagate adders. These estimates do not assume the
use of fast carry logic.
xk-1yk-1
sk-2sk-1 co= sk
FAFA
x0y0
FA
x1y1
FA
s0s1
ci
xk-2yk-2
Figure A.4: Ripple-carry adder
Figure A.5 shows a block diagram of a k-input increment adder. The input
operands to the adder are (xk−1 . . . x0)2 and ci, where ci represents the input carry.
The output of the adder is (sksk−1 . . . s0)2. In the figure, c0 represents the output
carry and the most significant bit of the sum, sk. From Figure A.5, one can ap-
preciate that the critical path delay of this adder is dominated by the propagation
of carries through the k half adders, which corresponds to the delay listed in Table
A.6. As for the ripple-carry adder, the critical path delay estimates do not assume
284
the use of fast carry logic.
sk-2sk-1 co= sk
HAHA HAHA
s0s1
ci
xk-1 x0x1xk-2
Figure A.5: Increment adder
Figure A.6 shows a block diagram of a 3:2 CSA adder. This adder adds three m-
bit operands. The inputs to the adder are (xm−1 . . . x0)2, (ym−1 . . . y0)2, (zm−1 . . . z0)2,
and ci. The outputs of the adder are (sm−1 . . . s0)2 and (cm . . . c0)2. In this figure,
the input carry, ci, is placed in the output bit c0. For two’s complement addition,
this work assumes that the input operands are sign extended so that the results can
be represented by (sm−1 . . . s0)2 and (cm−1 . . . c0)2 (the output cm is ignored).
From Figure A.6, one can appreciate that the critical path delay of the 3:2 CSA
adder is dominated by the longest path through a full adder. When considering
logic gate implementations, the critical path delay is dominated by the full adder’s
sum output path, whose delay is 2TX . When considering generic gate delays, the
critical path delay is dominated by the full adder’s carry output path, whose critical
path delay is 3TG.
x0y0z0
FA
x1y1z1
FA
x2y2z2
FA
c0s0c1s1c2s2c3
xm-1ym-1zm-1
FA
sm-1cm
ci
Figure A.6: 3:2 carry-save adder
Figure A.7 shows a block diagram of a 4:2 CSA adder. This adder adds four m-bit
285
operands. The inputs to the adder are W , X, Y , Z, ci1 , and ci2 . The outputs C and
S are assumed to be m-bit numbers, under the assumption that the inputs are sign
extended so that the results can be represented by (sm−1 . . . s0)2 and (cm−1 . . . c0)2.
From Figure A.7, one can appreciate that the complexity and the critical path delay
of a 4:2 CSA adder are twice those of a 3:2 CSA adder.
3:2 CSA
3:2 CSA
W X Y Z
C S
ci1
ci2
Figure A.7: 4:2 carry-save adder
CSA adder trees can be built using 3:2 CSA adders as building blocks. The
construction of an n-operand adder requires (n− 2) 3:2 CSA adders [Kor93]. Each
3:2 CSA can consume one input carry, consequently, the adder tree can consume
n− 2 input carries.
Here, the height of a CSA tree is approximated as dlog3/2 n/2e levels, which
is an approximation of the height of an n-input Wallace tree [Par99]. The height
of a tree defines its critical path delay. For a given height, the critical path delay
of an n-input CSA tree can be approximated to dlog3/2 n/2e T3:2 CSA. This is the
measure specified in Table A.6. Additional information on CSA trees can be found
in [Par99].
286
A.3 Logic Complexity and Critical Path Delay for
FPGA Implementations
Tables A.7 and A.8 list, respectively, the complexity and the critical path delay of
the basic set of circuits used to implement the arithmetic circuits studied in this
work. The data in these tables correspond to implementations in FPGA logic. These
tables make use of the terms defined in Tables A.1 and A.2.
The critical path delay estimates are expressed in terms of propagation delays
through LUTs. The propagation delay of a LUT is represented by TL. Each LUT is
assumed to be capable of realizing an arbitrary Boolean function of its inputs. The
number of inputs of a LUT is represented by L. This work assumes the use of LUTs
with at least three inputs, which is consistent with the architectures presented in
[Xil00, Alt01, Luc99, Act01]. The generic gate model described previously could be
used to estimate the complexity and the performance of implementation that use
two-input LUTs.
The ability of a LUT to implement arbitrary Boolean functions of its inputs is
illustrated in Figures A.8 to A.13, where the LUTs are represented by rectangles.
These figures demonstrate how the basic circuits can be implemented with LUTs.
Tables A.7 and A.8 summarize the complexity and the critical path delay of the
basic circuits. Only one critical path delay measure is listed for full adders and half
adders because the critical path delay is the same for the sum output and the carry
output (for two-input gate implementations, the delay of the sum output and the
delay of the carry output are different).
Figures A.8 to A.15 show unused inputs in some of the LUTs. These unused
inputs could be used by synthesis tools when mapping a circuit to the targeted
hardware, thus reducing the complexity of the intended circuit. For the estimates
287
given here, unused LUT inputs are assumed to remain vacant.
Table A.7: Complexity of basic building blocks for implementations using FPGA
logic
Building Boolean # LUT # FF
block representation
GF (2) adder c = a XOR b 1 0
GF (2) mult. c = a AND b 1 0
2:1 mux c = (s AND a) OR 1 0
((NOT s) AND b)
two’s cmpl. O = cmpl XOR I 1 0
cell
two’s cmpl. O = (NOT zero) AND 1 0
w. zero cell (cmpl XOR I)
full adder s = x XOR y XOR z 2 0
c = (x AND y) OR
(y AND z) OR
(x AND z)
half adder s = x XOR y 2 0
c = x AND y
shift-reg. cell s = (load AND IL) OR 1 1
((NOT load) AND IS)
 



Figure A.8: GF (2) adder implementation with a LUT
Tables A.9 and A.10 list, respectively, the complexity and the critical path delay
of the composite set of circuits used to implement the arithmetic circuits studied
in this work. The data in these tables correspond to implementations using FPGA
logic. These tables make use of the terms defined in Tables A.1 and A.2.
The complexity and the critical path delay of the two’s complement circuits, the
shift register, the ripple-carry adder, the increment adder, the 3:2 CSA adder, and
the 4:2 CSA adder circuits can be derived from Figures A.1 to A.7.
288
Table A.8: Critical path delay of basic building blocks for implementations using
FPGA logic
Building Symbol Critical path delay
block (in TL)
GF (2) adder TGF2A 1
GF (2) mult. TGF2M 1
2:1 mux TMUXC 1
two’s cmpl. cell TTCC 1
two’s cmpl. TTCZC 1
w. zero cell
full adder TFA 1
half adder THA 1
shift reg. cell TSRC 1
 



Figure A.9: 2:1 multiplexer implementation with a LUT
A 2:1 multiplexer that handle m-bit inputs is constructed from m 2:1 multiplex-
ers, each of which handles one-bit operands. Using this detail one can corroborate
the results in Tables A.9 and A.10 for the complexity and the critical path delay of
an m-bit 2:1 multiplexer.
The complexity of a CSA adder tree is a function of the number of inputs of
the tree and of the size of each of the input operands. An n-input CSA adder
tree requires (n − 2) 3:2 CSA adders. Therefore, the complexity of a tree is n − 2
times the complexity of a 3:2 CSA adder. The critical path delay of the tree is a
function of the height of the tree. In this work, the height of a CSA adder tree is
approximated to dlog3/2 n/2e levels, where the propagation delay at each level of
the tree corresponds to the propagation delay of a 3:2 CSA adder.
289
Table A.9: Complexity of composite building blocks for implementations using
FPGA logic
Building Primitive # LUT # FF
block comp.
complexity
register m FF 0 m
m-bit op.
2:1 mux m MUXC m 0
m-bit op.
two’s cmpl. m TCC m 0
m-bit op.
two’s cmpl. m TCZC m 0
w. zero
m-bit op.
shift reg. m SRC m m
m-bit op.
binary tree (n− 1)∗ d(n− 1)/(L− 1)e 0
} gates
GF (2) mult/ N/A d(n ∗ Z − 1)/(L− 1)e 0
add tree
ripple-carry k FA 2k 0
adder
k-bit op.
incr. adder k HA 2k 0
k-bit op.
3:2 CSA m FA 2m 0
m-bit op.
4:2 CSA 2 * 4m 0
m-bit op. 3 : 2 CSA
CSA (n− 2) * 2m(n− 2) 0
adder tree 3 : 2 CSA
m-bit op.
290
c
y
x s
y
x
Figure A.10: Half adder implementation with LUTs
x
z
cy s
x
z
y
Figure A.11: Full adder implementation with LUTs
The complexity and the critical path delay of a binary tree implementation using
LUTs is a function of the number of gates that can be packed in a LUT. Figure
A.14 shows how a binary tree can be built using LUTs. One can see in this figure
that the number of gates of a binary tree structure that can be packed into a LUT
is L − 1. Given n, the number of inputs of a binary tree, one can compute the
number of LUTs required to implement a binary tree using the following expression:
d(n − 1)/(L − 1)e. This expression can be interpreted as the number of LUTs
required to implement n− 1 gates, which is the number of two-input gates required
to compute a function of n inputs, given that each LUT can realize L−1 gates when
used in a binary tree structure.
From Figure A.14, one can see that the first level of the tree starting from the
root node supports a maximum of L inputs. The next level supports L2, the one
after that L3, and so on. The height of this type of tree can be determined using
the following expression: dlogL ne TL.
Because a LUT can implement an arbitrary function of its inputs, the nodes of
a binary tree can implement different logical functions. The GF (2) mult/add tree,
shown in Figure A.15, is an example of a tree whose nodes implement more than
291
Icmpl zero
O
Figure A.12: Two’s complement with zero cell circuit implementation with a LUT
load
FF
IL
IS
O
Figure A.13: Shift register cell implementation with a LUT
one logical function.
Figure A.15 shows the need to collocate primitive functions within a LUT. In
this figure, the primitive functions are GF (2) multiplications. The two inputs of
each primitive function must be collocated in the same LUT (this operation cannot
be split between neighboring LUTs). For some configurations, the collocation of
logic leads to wasted resources.
To determine the FPGA complexity of aGF (2) mult/add tree, one first computes
what is referred to here as the effective number of inputs of the tree and then one
uses the binary tree expressions to determine the complexity and the critical path
delay of the tree.
The effective number of inputs is computed by multiplying the number of inputs
dedicated to primitive functions within a LUT by the multiplier Z. Z is defined in
Equation A.1, where I represents the number of inputs of a primitive function. I is
292
Table A.10: Critical path delay of composite building blocks for implementations
using FPGA logic
Building Symbol Critical path delay
block (in TL)
register N/A 0
2:1 mux TMUX 1
two’s cmpl. TTC 1
two’s cmpl. TTCZ 1
w. zero
shift reg. TSR 1
binary tree N/A dlogL ne
GF (2) mult/add N/A dlogL (n ∗ Z)e
tree
ripple-carry adder TRA,k k
incr. adder TI,k k
3:2 CSA adder T3:2 CSA 1
4:2 CSA adder T4:2 CSA 2
CSA adder tree TCSAT,n ≈ dlog3/2 n/2e
restricted in this work to be less than or equal to L. For GF (2) mult/add trees, I is
equal to two (which represents the number of inputs of a GF (2) multiplier), which
yields Z equal to one for LUTs with even number of inputs.
Z represents the ratio established by the number of inputs of a LUT divided
by the total number of inputs of a LUT that are dedicated to primitive functions.
The effective number of inputs is a multiple of the number of inputs of a LUT (is a
multiple of L).
Z =
L
bL/Ic I (A.1)
To derive the complexity and the critical path delay of the GF (2) mult/add tree
do the following. Let n = pL represent the effective number of inputs of a GF (2)
mult/add tree. The leaves of this tree consume p LUTs. The output of the leaves
293
       
	      
         	 

ﬀﬁ
Figure A.14: Binary tree implementation with LUTs
ﬂﬃ !#"%$'&)( *,+
ﬂﬃ !#"-/.#.
0132
1
0452
4
0632
6
7
89:
Figure A.15: GF (2) mult/add tree implementation with LUTs
are added by what is basically a binary tree. The binary tree requires dp− 1/L− 1e
LUTs. When adding the complexity of the leaves and the complexity of the binary
tree, the complexity of the GF (2) mult/add tree is found to be dn− 1/L− 1e LUTs
(p + dp− 1/L− 1e = dp + (p− 1/L− 1)e = d((pL− p) + (p− 1))/L− 1e).
Using the effective number of inputs of a GF (2) mult/add tree, one can verify
that the critical path delay of these trees can be computed using the expression
used for binary trees. The critical path delay of a binary tree is dlogL n e TL. For
n = pL, the critical path delay expression translates into (dlogL p e+ 1) TL, where
the constant TL delay accounts for the critical path delay of the leaves of the tree
294
and the delay dlogL p e TL accounts for the critical path delay of the binary tree
adder.
295
Appendix B
Acronyms and Symbols
B.1 Acronyms
296
Table B.1: Acronyms
Acronyms Description
3DES Triple DES
AES Advanced Encryption Standard
ASIC Application Specific IC
AU Arithmetic unit
AUC Arithmetic unit controller
BSC Binary stored-carry number representation
CPA Carry-propagate adder
CSA Carry-save adder
DES Data Encryption Standard
DPRAM Dual Ported RAM
DSA Digital Signature Algorithm
ECDLP Elliptic Curve Discrete Logarithm Problem
FA Full adder
FF Flip-flop
FPGA Field Programmable Gate Array
gcd Greatest common divisor
GG Generic gate
HA Half adder
IC Integrated Circuit
LSB Least significant bit first bit-serial multiplier
LSB-SSM Least significant bit first super-serial multiplier
LSD Least significant digit first digit-serial multiplier
LUT Lookup table
MC Main controller
MSB Most significant bit first bit-serial multiplier
MSB-SSM Most significant bit first super-serial multiplier
MSD Most significant digit first digit-serial multiplier
MUX Multiplexer
MUXC Multiplexer cell
NAF Non-adjacent form number representation
NR Nonredundant number representation
RAM Random Access Memory
ROM Read Only Memory
SB Storage bits
SR Shift register
SRC Shift register cell
SSM Super-serial multiplier
TC Two’s complement circuit
TCC Two’s complement circuit cell
TCZ Two’s complement with zero circuit
TCZC Two’s complement with zero circuit cell
297
B.2 Symbols
Table B.2: Symbols
Symbol Description
|x| Absolute value of x
|x|Mˆ x modM + ²M ≡ x modM with result in range [0, (²+ 1)M)
|x|M x modM with result in range [0,M)
dxe Value of x rounded up to the next integer
bxc Value of x rounded down to the next lower integer
298
Bibliography
[Act01] Actel Corp. Actel’s ProASIC Family, The Only ASIC Design Flow
FPGA. 2001.
[Alt01] Altera Corp. APEX 20KC Programmable Logic Device Data Sheet.
2001.
[AMV93] G.B. Agnew, R.C. Mullin, and S.A. Vanstone. An implementation of
elliptic curve cryptosystems over F2155 . IEEE Journal on Selected Areas
in Communications, 11(5):804–813, June 1993.
[ANS98] ANSI X9.62 – 199x. Public Key Cryptography For The Finantial
Services Industry: The Elliptic Curve Digital Signature Algorithm
(ECDSA). American National Standard, January 1998. Approved
January 7, 1999.
[ANS99] ANSI X9.63 – 199x. Public Key Cryptography For The Finantial Ser-
vices Industry: Key Agreement and Key Transport Using Elliptic Curve
Cryptography. American National Standard, January 1999. Working
Draft.
[BG89] T. Beth and D. Gollmann. Algorithm engineering for public key algo-
rithms. IEEE Journal on Selected Areas in Communications, 7(4):458–
466, 1989.
[BGMW93] E.F. Brickell, D.M. Gordon, K.S. McCurley, and D.B. Wilson. Fast
exponentiation with precomputation. In Advances in Cryptology – EU-
ROCRYPT ’92 (LNCS 658), pages 200 – 207. Springer-Verlag, 1993.
[Big85] N. Biggs. Discrete Mathematics. Oxford University Press, New York,
1985.
[Blu99] T. Blum. Modular exponentiation on reconfigurable hardware. Mas-
ter’s thesis, ECE Dept., Worcester Polytechnic Institute, Worcester,
U.S.A., May 1999.
[BSS99] I. Blake, G. Seroussi, and N.P. Smart. Elliptic Curves in Cryptography.
Cambridge University Press, Cambridge, UK, first edition, 1999.
299
[CC87] D.V. Chudnovsky and G.V. Chudnovsky. Sequences of numbers gener-
ated by addition in formal groups and new primality and factorization
tests. Advances in Applied Mathematics, 7:385 – 434, 1987.
[CMO98] H. Cohen, A. Miyaji, and T. Ono. Efficient elliptic curve exponen-
tiation using mixed coordinates. In Advances in Cryptology – ASI-
ACRYPT ’98 (LNCS 1514), pages 51 – 65. Springer-Verlag, 1998.
[DH76] W. Diffie and M.E. Hellman. New directions in cryptography. IEEE
Transactions on Information Theory, 22:644–654, 1976.
[DK91] S. Dusse´ and B. Kaliski. A cryptographic library for the Motorola
DSP56000. In Advances in Cryptology – EUROCRYPT ’90 (LNCS
473), pages 230–244. Springer-Verlag, 1991.
[ElG85] T. ElGamal. A public-key cryptosystem and a signature scheme based
on discrete logarithms. IEEE Transactions on Information Theory,
31(4):469–472, 1985.
[EW93] S. E. Eldridge and C. D. Walter. Hardware implementation of Mont-
gomery’s modular multiplication algorithm. IEEE Transactions on
Computers, 42(6):693–699, July 1993.
[FIP00] FIPS 186-2. Digital Signature Standard (DSS). Federal Informa-
tion Processing Standards Publication186-2, U.S. Department of Com-
merce/N.I.S.T. National Institute of Standards and Technology, Jan-
uary 2000.
[FP99] W.L. Freking and K.K. Parhi. A unified method for iterative com-
putation of modular multiplications and reduction operations. In In-
ternational Conference on Computer Design (ICCD ’99), pages 80–87,
1999.
[GHS00] P. Gaundry, F. Hess, and N.P. Smart. Constructive and de-
structive facets of Weil descent on elliptic curves. Available
at http://www.hpl.hp.com/techreports/2000/HPL-2000-10.html, Jan-
uary 2000. Preprint.
[Gor98] D.M. Gordon. A survey of fast exponentiation methods. Journal of
Algorithm, 27:129–146, 1998.
[GP97] J. Guajardo and C. Paar. Efficient algorithms for elliptic curve cryp-
tosystems. In Advances in Cryptology – CRYPTO ’97 (LNCS 1294),
pages 342–356. Springer-Verlag, 1997.
300
[GSS99] L. Gao, S. Shrivastava, and G. Sobelman. Elliptic curve scalar multi-
plier design using FPGAs. In Cryptographic Hardware and Embedded
Systems – CHES ’99 (LNCS 1717), pages 257–268. Springer-Verlag,
1999.
[HLM00] D. Hankerson, J. Lopez, and A. Menezes. Software implementation of
elliptic curve cryptography over binary fields. In Cryptographic Hard-
ware and Embedded Systems – CHES ’00 (LNCS 1965), pages 1–24.
Springer-Verlag, 2000.
[IEE98] IEEE P1363. Standard Specifications for Public-Key Cryptography
(Draft Version 8), October 1998.
[IT88] T. Itoh and S. Tsujii. A fast algorithm for computing multiplicative
inverses in GF (2m) using normal bases. Information and Computation,
78(3):171–177, 1988.
[ITT+99] K. Itoh, M. Takenaka, N. Torii, S. Temma, and Y. Kurihara. Fast
implementation of public-key cryptography on a DSP TMS320C6201.
In Cryptographic Hardware and Embedded Systems – CHES ’99 (LNCS
1717), pages 61–72. Springer-Verlag, 1999.
[JSP98] S. K. Jain, L. Song, and K. K. Parhi. Efficient semisystolic architectures
for finite-fields arithmetic. IEEE Transactions on Very Large Scale
Integration (VLSI) Sytems, 6(1):101–113, March 1998.
[KAK96] C. K. Koc, T. Acar, and B. S. Kaliski. Analyzing and comparing Mont-
gomery multiplication algorithms. IEEE Micro., 16(3):26–33, June
1996.
[Kob87] N. Koblitz. Elliptic curve cryptosystems. Mathematics of Computation,
48:203–209, 1987.
[Kob94] N. Koblitz. A Course in Number Theory and Cryptography. Springer-
Verlag, New York, second edition, 1994.
[Koc90a] C. K. Koc. Carry save adders for computing the product AB modulo
N. Electronic Letters, 26(13):899–900, June 1990.
[Koc90b] C. K. Koc. Multi-operand modulo addition using carry save adders.
Electronic Letters, 26(6):361–363, March 1990.
[Koc95] C. K. Koc. Analysis of sliding window techniques for exponentiation.
Computers and Mathematics with Applications, 30(10):17–24, Novem-
ber 1995.
[Kor93] I. Koren. Computer Arithmetic Architectures. Prentice-Hall, 1993.
301
[Kor94] P. Kornerup. A systolic, linear-array multiplier for a class of right-shift
algorithms. IEEE Transactions on Computers, 43(8):892–898, August
1994.
[LD98a] J. Lopez and R. Dahab. An improvement of the Guajardo-Paar method
for multiplication on non-supersingular elliptic curves. Technical Re-
port IC-98-12, Institute of Computing, State University of Campinas,
Campinas, Sao Paulo, Brazil, April 1998.
[LD98b] J. Lopez and R. Dahab. On computing a multiple of an elliptic curve
point. Technical Report IC-98-13, Institute of Computing, State Uni-
versity of Campinas, Campinas, Sao Paulo, Brazil, April 1998.
[LD99a] J. Lopez and R. Dahab. Fast multiplication on elliptic curves over
GF (2m) without precomputation. In Cryptographic Hardware and Em-
bedded Systems – CHES ’99 (LNCS 1717), pages 316–327. Springer-
Verlag, 1999.
[LD99b] J. Lopez and R. Dahab. Improved algorithms for elliptic curve arith-
metic in GF (2n). In Selected Areas in Cryptography – SAC ’98 (LNCS
1556), pages 201–212. Springer-Verlag, 1999.
[LD00] J. Lopez and R. Dahab. An overview of elliptic curve cryptography.
Technical Report IC-00-10, Institute of Computing, State University
of Campinas, Campinas, Sao Paulo, Brazil, May 2000.
[LL94] C.H. Lim and P.J. Lee. More flexible exponentiation with precompu-
tation. In Advances in Cryptology – CRYPTO ’94 (LNCS 839), pages
95–107. Springer-Verlag, 1994.
[LMWL00] K.H. Leung, K.W. Ma, W.K. Wong, and P.H.W. Leong. FPGA imple-
mentation of a microcoded elliptic curve cryptographic processor. In
Eight Annual IEEE Symposium on Field-Programmable Custom Com-
puting Machines, FCCM ’00, Napa Valley, California, USA, 2000.
[LN94] R. Lidl and H. Niederreiter. Introduction to finite fields and their appli-
cations. Cambridge University Press, Cambridge, UK, revised edition,
1994.
[LR71] B.A. Laws and C.K. Rushforth. A cellular-array multiplier for GF (2m).
IEEE Transactions on Computers, 20:1573–1578, December 1971.
[LS01] P.-Y. Liardet and N.P. Smart. Preventing SPA/DPA in ECC systems
using the Jacobi form. In Cryptographic Hardware and Embedded Sys-
tems – CHES ’01 (LNCS 2162), pages 391–401. Springer-Verlag, 2001.
302
[Luc99] Lucent Technologies Inc. ORCA Series 3C and 3T Field-Programmable
Gate Arrays. 1999.
[Mas91] E.D. Mastrovito. VLSI Architectures for Computation in Galois Fields.
PhD thesis, Linko¨ping University, Dept. Electr. Eng., Linko¨ping, Swe-
den, 1991.
[McE87] R.J. McEliece. Finite Fields for Computer Scientists and Engineers.
Kluwer Academic Publishers, 1987.
[Men93] A.J. Menezes. Elliptic Curve Public Key Cryptosystems. Kluwer Aca-
demic Publishers, 1993.
[Mil86] V. Miller. Uses of elliptic curves in cryptography. In Advances in Cryp-
tology – CRYPTO ’85 (LNCS 218), pages 417–426. Springer-Verlag,
1986.
[Mon85] P.L. Montgomery. Modular multiplication without trial division. Math-
ematics of Computation, 44(170):519–521, April 1985.
[Mot01] Motorola Inc. Technical Summary MPC190 Security Processor.
MPC190TS/D. 2001.
[MvOV97] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone. Handbook of
Applied Cryptography. CRC Press, 1997.
[OP99] G. Orlando and C. Paar. A super-serial Galois fields multiplier for
FPGAs and its application to public-key algorithms. In Seventh Annual
IEEE Symposium on FPGAs for Custom Computing Machines, pages
232–239, Los Alamitos, CA, 1999. IEEE Computer Society Press.
[OP00a] G. Orlando and C. Paar. A high performance elliptic curve processor for
GF (2m). In Cryptographic Hardware and Embedded Systems – CHES
’00 (LNCS 1965), pages 41–56. Springer-Verlag, 2000.
[OP00b] G. Orlando and C. Paar. Squaring architecture for GF (2m) and its
applications in cryptographic systems. Electronic Letters, 36(13):1116–
1117, June 2000.
[OP01] G. Orlando and C. Paar. A scalable GF (p) elliptic curve processor ar-
chitecture for programmable hardware. In Cryptographic Hardware and
Embedded Systems – CHES ’01 (LNCS 2162), pages 348–363. Springer-
Verlag, 2001.
[Oru95] H. Orup. Simplifying quotient determination in high-radix modular
multiplication. In IEEE Proceedings 12th Symposium on Computer
Arithmetic, pages 193–199, 1995.
303
[OS01] K. Okeya and K. Sakurai. Efficient elliptic curve cryptosystem from
a scalar multiplication algorithm with recovery of the y-coordinate on
a Montgomery-form elliptic curve. In Cryptographic Hardware and
Embedded Systems – CHES ’01 (LNCS 2162), pages 126–141. Springer-
Verlag, 2001.
[Par90] B. Parhami. Generalized signed-digit number systems: A unifying
framework for redundant number representations. IEEE Transactions
on Computers, 39(1):89–98, January 1990.
[Par99] B. Parhami. Computer Arithmetic Algorithms and Hardware Designs.
Oxford University Press, Inc., New York, 1999.
[PFSR99] C. Paar, P. Fleischmann, and P. Soria-Rodriguez. Fast arithmetic for
public-key algorithms in Galois fields with composite exponents. IEEE
Transactions on Computers, 48(10):1025–1034, October 1999.
[Pol78] J. Pollard. Monte Carlo methods for index computation mod p. Math-
ematics of Computation, 32:918–924, 1978.
[PR97] C. Paar and P. Soria Rodriguez. A new class of fast finite field archi-
tectures for public-key algorithms. In Advances in Cryptology – EU-
ROCRYPT ’97 (LNCS 1233), pages 363–378. Springer-Verlag, 1997.
[Ros98a] M. Rosing. Implementing Elliptic Curve Cryptography. Manning Pub-
lications Co., Connecticut, 1998.
[Ros98b] M. Rosner. Elliptic curve cryptosystems on reconfigurable hardware.
Master’s thesis, ECE Dept., Worcester Polytechnic Institute, Worces-
ter, USA, May 1998.
[RSA78] R.L. Rivest, A. Sharmir, and L.M. Adleman. A method for obtaining
digital signatures and public-key cryptosystems. Communications of
the ACM, 21:120–126, 1978.
[SES98] S. Sutikno, R. Effendi, and A. Surya. Design and implementation of
arithmetic processor F2155 for elliptic curve cryptosystems. In The 1998
IEEE Asia-Pacific Conference on Circuits and Systems, pages 647–650,
1998.
[Sol99] J. Solinas. Improved algorithms for arithmetic on anomalous binary
curves. Technical Report CORR-46, Dept. of C&O, University of Wa-
terloo, 1999.
[SOOS95] R. Schroeppel, H. Orman, S. O’Malley, and O. Spatscheck. Fast key
exchange with elliptic curve systems. In Advances in Cryptology –
CRYPTO ’95 (LNCS 963), pages 43–56. Springer-Verlag, 1995.
304
[SP96] L. Song and K.K. Parhi. Efficient finite fields serial/parallel multipli-
cation. In Proc. Int. Conf. Application Specific System Architectures
and Processors, pages 72–82. Chicago, IL, August 1996.
[SP97] L. Song and K. K. Parhi. Low-energy digit-serial/parallel finite field
multipliers. Journal of VLSI Signal Processing Systems, 2(22):1–17,
1997.
[ST92] J. H. Silverman and J. Tate. Rational Points on Elliptic Curves.
Springer-Verlag, New York, 1992.
[Sti95] D. R. Stinson. Cryptography, Theory and Practice. CRC Press, 1995.
[STP86] P.A. Scott, S.E. Travares, and L.E. Peppard. A fast VLSI multiplier for
GF (2m). IEEE Journal on Selected Areas in Communications, 4:62–66,
January 1986.
[SV93] M. Shand and J. Vuillemin. Fast implementations of RSA cryptogra-
phy. In IEEE Proceedings 11th Symposium on Computer Arithmetic,
pages 252–259, 1993.
[Tan84] A. S. Tanenbaum. Structured Computer Organization. Prentice-Hall,
Inc., New Jersey, second edition, 1984.
[Van99] S.A. Vanstone. Efficient implementation of elliptic curve cryptography,
June 1999. Certicom Corporation Seminar.
[WBV+96] E. De Win, A. Bosselaers, S. Vandenberghe, P. De Gersem, and J. Van-
dewalle. A fast software implementation for arithmetic operations in
GF (2n). In Advances in Cryptology – ASIACRYPT ’96 (LNCS 1163),
pages 65–76. Springer-Verlag, 1996.
[WMPW98] E. De Win, S. Mister, B. Preneel, and M. Wiener. On the perfor-
mance of signature schemes based on elliptic curves. In Algorithmic
Number Theory, Proceedings Third International Symposium (LNCS
1423), pages 252–266. Springer-Verlag, 1998.
[Wu99] H. Wu. Low complexity bit-parallel finite field arithmetic using polyno-
mial basis. In Cryptographic Hardware and Embedded Systems – CHES
’99 (LNCS 1717), pages 280–291. Springer-Verlag, 1999.
[Xil00] Xilinx Inc. The Programmable Logic Data Book. 2000.
[YRT84] C.S. Yeh, I.S. Reed, and T.K. Truong. Systolic multipliers for finite
fields GF (2m). IEEE Transactions on Computers, 33(4):357–360, April
1984.
305
