Towards Efficient Hardware Implementation of Elliptic and Hyperelliptic Curve Cryptography by Ismail, Marwa Nabil
Towards Ecient Hardware





presented to the University of Waterloo
in fulllment of the
thesis requirement for the degree of
Doctor of Philosophy
in
Electrical and Computer Engineering
Waterloo, Ontario, Canada, 2012
c© Marwa Nabil Ismail 2012
I hereby declare that I am the sole author of this thesis. This is a true copy of
the thesis, including any required nal revisions, as accepted by my examiners.
I understand that my thesis may be made electronically available to the public.
ii
Abstract
Implementation of elliptic and hyperelliptic curve cryptographic algorithms has
been the focus of a great deal of recent research directed at increasing eciency.
Elliptic curve cryptography (ECC) was introduced independently by Koblitz and
Miller in the 1980s. Hyperelliptic curve cryptography (HECC), a generalization
of the elliptic curve case, allows a decreasing eld size as the genus increases.
The work presented in this thesis examines the problems created by limited
area, power, and computation time when elliptic and hyperelliptic curves are
integrated into constrained devices such as wireless sensor network (WSN) and
smart cards. The lack of a battery in wireless sensor network limits the process-
ing power of these devices, but they still require security. It was widely believed
that devices with such constrained resources cannot incorporate a strong HECC
processor for performing cryptographic operations such as elliptic curve scalar
multiplication (ECSM) or hyperelliptic curve divisor multiplication (HCDM).
However, the work presented in this thesis has demonstrated the feasibility of
integrating an HECC processor into such devices through the use of the proposed
architecture synthesis and optimization techniques for several inversion-free al-
gorithms.
The goal of this work is to develop a hardware implementation of binary el-
liptic and hyperelliptic curves. The focus is on the modeling of three factors:
register allocation, operation scheduling, and storage binding. These factors
were then integrated into architecture synthesis and optimization techniques in
order to determine the best overall implementation suitable for constrained de-
vices.
The main purpose of the optimization is to reduce the area and power.
Through analysis of the architecture optimization techniques for both datapath
and control unit synthesis, the number of registers was reduced by an average
of 30%. The use of the proposed ecient explicit formula for the dierent algo-
rithms also enabled a reduction in the number of read/write operations from/to
the register le, which reduces the processing power consumption. As a re-
sult, an overall HECC processor requires from 1843 to 3595 slices for a Xilinix
XC4VLX200 and the total computation time is limited to between 10.08 ms to
15.82 ms at a maximum frequency of 50 MHz for a varity of inversion-free co-
ordinate systems in hyperelliptic curves. The value of the new model has been
demonstrated with respect to its implementation in elliptic and hyperelliptic
iii
curve crypogrpahic algorithms, through both synthesis and simulations.
In summary, a framework has been provided for consideration of interactions
with synthesis and optimization through architecture modeling for constrained
enviroments. Insights have also been presented with respect to improving the




All praise be to Allah, the Creator and Sustainer of the world, for giving me the
soul support to complete this thesis.
First, I would like to express my deep and sincere gratitude to my super-
visor at the University of Waterloo, Professor M. Anwar Hasan, for being an
outstanding advisor. His invaluable guidance, encouragement, and support from
the initial to the nal level enabled me to complete this thesis. It was an honour
to have worked under his supervision.
I also am deeply indebted to Professor Alfred Menezes, Professor Catherine
H. Gebotys, and Professor Guang Gong for serving on the thesis committee and
for oering their insightful comments and invaluable suggestions. Moreover, I
would like to thank Professor Amr Youssef, Concordia University for taking the
time to review this work as an external examiner.
Special thanks to Dr. Jithra Adkari and Dr. Abdulaziz Alkhoraidly for pro-
viding me with valuable feedback about my work. I am very grateful to all of my
collegues for creating such a great environment and for making my research so
enjoyable throughout my stay at Waterloo. Thanks as well to Barbara Trotter
for her assistance in proofreading my thesis. I would also like to thank the ECE
administrative sta for their help, kindness, patience, and cooperation.
I warmly thank my parents for their consistent encouragement, valuable ad-
vice, and the guidance that helped make my graduate studies successful.
Special thanks go to my lovely husband, Abdelaziz Aboueleinin, for his un-
conditional help and understanding during these last ve years while I was con-
ducting the research for my thesis. I also would like to thank my lovely children,
Mohamed, Hossam, and Habibah, who have been very patient throughout the
time I was preparing and writing my thesis.
Finally, my warmest appreciation goes to the Ministry of Higher Education
in Egypt, who made this thesis possible through the Bureau of Cultural and
Educational Aairs of Egypt in Canada.
To all of you, I thank you very much!
v
To my parents, my husband, and my children
vi
Table of Contents
List of Algorithms x
List of Figures xii
List of Tables xiv
List of Abbreviations xvi
1 Introduction 1
1.1 Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Preliminary Background 6
2.1 Binary Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Field Addition Operation . . . . . . . . . . . . . . . . . . 7
2.1.2 Field Multiplication Operation . . . . . . . . . . . . . . 7
2.1.3 Field Squaring Operation . . . . . . . . . . . . . . . . . . 10
2.2 Elliptic Curve Arithmetic . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Example on an Elliptic Curve over F2163 . . . . . . . . 12
2.2.2 Point Representations . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Point Multiplication Costs . . . . . . . . . . . . . . . . . 14
2.3 Hyperelliptic Curve Arithmetic . . . . . . . . . . . . . . . . . . 15
2.3.1 Basic Definitions and Properties . . . . . . . . . . . . . 16
2.3.2 Genus 2 Hyperelliptic Curve over F2m . . . . . . . . . . 17
2.3.3 Example of a Genus 2 Hyperelliptic Curve over F283 18
2.3.4 Divisor Class Representations . . . . . . . . . . . . . . . 19
2.3.5 Divisor Multiplication Complexities . . . . . . . . . . . 21
2.3.6 Hyperelliptic Curve Algorithms . . . . . . . . . . . . . 22
vii
2.3.6.1 Cantor's Group Operations on the Jaco-
bian . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.6.2 Harley's Group Operations on the Jaco-
bian . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Architecture Synthesis and Optimization 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Design Flow of the Behavioral Architecture Synthesis . 29
3.2.1 Register Allocation via Variable Liveness Analysis 30
3.2.1.1 Modeling Register Allocation . . . . . . . . . 31
3.2.2 Operation Scheduling via Forwarding Paths . . . . . 37
3.2.2.1 The Proposed Scheduling Model . . . . . . . 37
3.2.3 Storage Binding via Efficient Register Spilling . . 40
3.3 Optimization Techniques for Architecture Synthesis . . . 43
3.3.1 Datapath Analysis . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Control Unit Analysis . . . . . . . . . . . . . . . . . . . . 43
3.3.2.1 Controllers via Resource Liveness Analy-
sis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Architecture Optimization Tradeoffs . . . . . . . . . . . . . . 44
3.4.1 Area/Power Optimization . . . . . . . . . . . . . . . . . . 45
3.4.2 Power/Time Optimization . . . . . . . . . . . . . . . . . . 46
4 Efficient Implementation for Elliptic Curve Algorithms 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Previous Work in ECC Hardware Implementation . . . . . . 50
4.2.1 Low-Power Designs . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 High performance Designs . . . . . . . . . . . . . . . . . 52
4.2.3 Compact Implementations . . . . . . . . . . . . . . . . . . 53
4.3 Architecture Synthesis of ECC processor . . . . . . . . . . . 55
4.3.1 ECC processor on FPGA . . . . . . . . . . . . . . . . . . . 55
4.3.2 Macroscopic Structural view of Conventional ECC
datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.2.1 Register-Transfer-Level Design for López-
Dahab Algorithm Implementation . . . . . . . 57
4.3.3 Macroscopic Structural view of Optimized ECC
datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
viii
4.3.4 Analyzing the Design . . . . . . . . . . . . . . . . . . . . . 61
4.4 Implementation Results and Comparison . . . . . . . . . . . . 62
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Efficient Implementation of Genus 2 Hyperelliptic Curve Algo-
rithms 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Previous Work related to HECC Hardware Implementa-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Explicit Formulas on Genus 2 Curves over a Binary Field 73
5.3.1 Cantor's Algorithms for Explicit Formula over
Even Characteristics . . . . . . . . . . . . . . . . . . . . . 74
5.3.2 Recent Coordinates in Even Characteristic When
h2 6= 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Efficient Explicit Formula of Genus 2 over Binary Field 81
5.5 Architecture Synthesis of an HECC Processor . . . . . . . 82
5.5.1 Conventional HECC processor . . . . . . . . . . . . . . 89
5.5.1.1 Register File . . . . . . . . . . . . . . . . . . . . . 90
Multi-Port Register File . . . . . . . . . . . . . . 90
5.5.1.2 Multiplexers . . . . . . . . . . . . . . . . . . . . . 91
Internal Data forwarding . . . . . . . . . . . . . 91
5.6 Architecture Optimization for an HECC Processor . . . . 92
5.6.1 Macroscopic Structural View of HECC Datapath . 92
5.7 Hyperelliptic Curve Datapath Analysis . . . . . . . . . . . . . 93
5.7.1 Area of the HECC processor . . . . . . . . . . . . . . . 94
5.7.2 Energy Consumption of the HECC Processor . . . . 96
5.7.3 Improve the performance of the HECC processor . 99
5.8 Hyperelliptic Curve Control Analysis . . . . . . . . . . . . . . . . . . 99
5.9 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 101
6 Comparison of Elliptic and Hyperelliptic 107
6.1 Complexity of Elliptic and Hyperelliptic Curve Proces-
sors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 Elliptic Curve NIST-recommended Comparisons . . . . . . . 110
ix
7 Conclusion and Future Direction 116
7.1 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Bibliography 119
Appendix 129
A Ecient HECC Explicit Register Management Formulas 130
A.1 New Weighted Coordinates (N ) . . . . . . . . . . . . . . . . . . . . . 130
A.2 Projective Coordinates (P) . . . . . . . . . . . . . . . . . . . . . . . . 130
A.3 Recent Coordinates (R) . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B VHDL Source Code 149
C Synthesis and Simulation 163
x
List of Algorithms
2.1 Bit serial multiplication (low to high bits) . . . . . . . . . . . . . . . 7
2.2 Digit-serial multiplication (low to high bits) . . . . . . . . . . . . . . 8
2.3 Left-to-right binary method for scalar multiplication . . . . . . . . 12
2.4 Left-to-right binary method for divisor multiplication . . . . . . . . 22
2.5 Cantor's Divisor Addition on the Jacobian of HECC . . . . . . . . 25
2.6 Cantor's Divisor Doubling on the Jacobian of HECC . . . . . . . . 26
2.7 Harley Divisor Addition on the Jacobian of HECC . . . . . . . . . 26
2.8 Harley's Divisor Doubling on the Jacobian of HECC . . . . . . . . 27
4.1 Register management for the left-to-right binary point multipli-
cation in F2m using López-Dahab mixed projective coordinates [37]. 58
4.2 Register management for the left-to-right binary point multipli-
cation in F2m in López-Dahab projective coordinates [37]. . . . . . . 59
4.3 The modied register management for left-to-right binary point
multiplication in F2m in López-Dahab mixed projective coordinates. 63
4.4 The modied register management for left-to-right binary point
multiplication in F2m in López-Dahab projective coordinates. . . . 64
5.1 Doubling in recent coordinates g = 2, h2 6= 0, in even characteristic 78
5.2 Addition in recent coordinates g = 2, h2 6= 0, in even characteristic . 79
5.3 Mixed addition in recent coordinates g = 2, h2 6= 0, in even char-
acteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 The modied register management of divisor doubling for recent
coordinates for an even characteristic when h2 = 0 . . . . . . . . . . 83
A.1 The modied register management of divisor doubling, new weighted
coordinates in an even characteristic when h2 = 0 . . . . . . . . . . 131
A.2 The modied register management of divisor addition, new weighted
coordinates in an even characteristic when h2 = 0 . . . . . . . . . . 132
xi
A.3 The modied register management of mixed addition, new weighted
coordinates in an even characteristic when h2 = 0 . . . . . . . . . . 133
A.4 The modied register management of divisor doubling , new weighted
coordinates in an even characteristic when h2 6= 0 . . . . . . . . . . 134
A.5 The modied register management of divisor addition, new weighted
coordinates in an even characteristic when h2 6= 0 . . . . . . . . . . 135
A.6 The modied register management of mixed addition for new
weighted coordinates for an even characteristic when h2 6= 0 . . . . 136
A.7 The modied register management of the divisor doubling for pro-
jective coordinate for an even characteristic when h2 = 0 . . . . . . 137
A.8 The modied register management for divisor addition for projec-
tive coordinates for an even characteristic when h2 = 0 . . . . . . . 138
A.9 The modied register management of mixed addition for projec-
tive coordinates for an even characteristic when h2 = 0 . . . . . . . 139
A.10 The modied register management of divisor doubling for projec-
tive coordinates for an even characteristic when h2 6= 0 . . . . . . . 140
A.11 The modied register management of divisor addition for projec-
tive coordinates for an even characteristic when h2 6= 0 . . . . . . . 141
A.12 The modied register management of mixed Addition for projec-
tive coordinates for an even characteristic when h2 6= 0 . . . . . . . 142
A.13 The modied register management of divisor addition for recent
coordinates for an even characteristic when h2 = 0 . . . . . . . . . . 144
A.14 The modied register management of mixed addition for recent
coordinates for an even characteristic when h2 = 0 . . . . . . . . . . 145
A.15 The modied register management of divisor doubling for recent
coordinates for an even characteristic when h2 6= 0 . . . . . . . . . . 146
A.16 The modied register management of divisor addition for recent
coordinates for an even characteristic when h2 6= 0 . . . . . . . . . . 147
A.17 The modied register management of mixed addition for recent
coordinates for an even characteristic when h2 6= 0 . . . . . . . . . . 148
xii
List of Figures
3.1 Computation ow graph . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 The procedure for register allocation via variable liveness analysis 36
3.3 The procedure for bypass-aware operation scheduling via forward-
ing path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Sequential-type versus parallel and forwarding-type architectures . 47
4.1 A top level architecture for the EC processor . . . . . . . . . . . . . 55
4.2 conventional point addition/doubling unit . . . . . . . . . . . . . . . 56
4.3 Optimized point addition/doubling unit . . . . . . . . . . . . . . . . 61
4.4 Comparison on the point multiplication computational times of
the ECP designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Dynamic power comparisons of the ECP designs . . . . . . . . . . . 67
5.1 Basic architecture of Hyperelliptic Curve Divisor Implementation 89
5.2 Structure view of new weighted coordinate divisor doubling dat-
apath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Structural view of new weighted coordinates divisor mixed addi-
tion datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Area in number of slices for dierent digit-serial multiplier when
h2 6= 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5 Area in number of slices for dierent digit-serial multiplier when
h2 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6 Power consumption in (mW) for dierent digit-serial multiplier
when h2 6= 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 Power consumption in (mW) for dierent digit-serial multiplier
when h2 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.8 Top-level Data path for Projective Coordinates Divisor Doubling-
Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xiii
5.9 Energy in µJ for dierent maximum frequency when h2 6= 0 . . . . 100
5.10 Energy in µJ for dierent maximum frequency when h2 = 0 . . . . 100
5.11 An example of a divisor doubling operation with overlapping . . . 102
6.1 Area in number of slices for dierent digit-serial multipliers for
LDDBL−ADD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 Area in number of slices for dierent digit-serial multipliers for
LDDBL−mADD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
C.1 RTL schematic results after synthesis for F2163 multiplier component 164
C.2 RTL schematic results after synthesis for F2163parallel squarer com-
ponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.3 RTL schematic results after synthesis for F283parallel squarer com-
ponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.4 Simulation waveforms for F2163 multiplier VHDL model . . . . . . . 167
C.5 Simulation waveforms for F2163 parallel squarer VHDL model . . . 168
C.6 FPGA synthesis area results for recent coordinates divisor multi-
plication when h2 6= 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
C.7 FPGA synthesis power results through X power for recent coor-
dinates when h2 6= 0 at 100 MHz . . . . . . . . . . . . . . . . . . . . . 170
C.8 Simulation waveforms for F2163 ECC scalar multiplication VHDL
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
xiv
List of Tables
2.1 FPGA Results for Virtex 4: XC4vlx200-111513 . . . . . . . . . . . 9
2.2 Scalar multiplication (kP ) time in cycles (for k = (k162, . . . , k1, k0)2) . 15
2.3 Dierent coordinates projective (P), new weighted (N ), and re-
cent (R) with approximate complexity, in even characteristic, for
divisor doubling with g = 2 . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Dierent coordinates projective (P), new weighted (N ), and re-
cent (R) with approximate complexity, in even characteristic, for
divisor addition with g = 2 . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Divisor multiplication (kD) time in cycles . . . . . . . . . . . . . . . 23
2.6 Divisor multiplication for dierent coordinates, in even character-
istic, with size of (k)2 = 165 bits . . . . . . . . . . . . . . . . . . . . . 24
4.1 Comparison of the FPGA implementation of the elliptic curve
scalar multiplication designs . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 Estimated timing results of previous HECC group operations and
divisor multiplications . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Estimated area results in slices of previous HECC hardware im-
plementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Comparison of the register requirements for a variety of coordi-
nates systems of divisor addition and divisor doubling . . . . . . . 76
5.4 RAVLA strategy of divisor doubling, recent coordinates in an even
characteristic when h2 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 RAVLA strategy of divisor doubling, recent coordinates in an even
characteristic when h2 = 0 (cont.) . . . . . . . . . . . . . . . . . . . . 85
5.6 OSFPs strategy of divisor doubling, recent coordinates in an even
characteristic when h2 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . 86
xv
5.7 OSFPs strategy of divisor doubling, recent coordinates in an even
characteristic when h2 = 0 (cont.) . . . . . . . . . . . . . . . . . . . . 87
5.8 Area, power consumption and clock rate for Data path component 95
5.9 FPGA and ASIC Synthesis and Simulate Power Eciency Results
for HECC Divisor Multiplication . . . . . . . . . . . . . . . . . . . . 105
5.10 FPGA and ASIC Synthesis and Simulate Area Eciency Results
for HECC Divisor Multiplication . . . . . . . . . . . . . . . . . . . . 106
6.1 Complexity of ECC and HECC processors for the 80 bit level of
security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2 NIST recommended binary nite elds and their reduction poly-
nomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Estimated area, power, and computation time comparison for






N New Weighted Coordinate
P Projective Coordinate
ASIC Application Specic Integrated Circuit
BRAM Block Random Access Memory
ECC Elliptic curve cryptography
ECCs Elliptic Curve Cryptosystems
ECDSA Elliptic Curve Digital Signature Algorithm
ECP Elliptic curve processor
ECSM Elliptic Curve Scalar Multiplication
FFAU Finite Field Arithmetic Unit
FFs Flip Flops
FP Forwarding Path
FPGA Field Programmable Gate Array
FSMD Finite State Machine with Datapath
GCD Greatest Common Divisor
xvii
HCDLP Hyperelliptic Curve Discrete Logarithm Problem
HCDM Hyperelliptic Curve Divisor Multiplication
HECC Hyperelliptic curve cryptography
HECCs Hyperelliptic Curve Cryptosystems
HECP Hyperelliptic curve processor
ILP Instruction Level Parallelism
LUTs Lookup Tables
NIST National Institute of Standards and Technology
OSFPs Operation Scheduling via Forwarding Paths
PKC Public Key Cryptography
PKCs Public Key Cryptosystem




SBERS Storage Binding via Ecient Register Spilling
VHDL Very High Speed Integrated Circuit Hardware Description Lan-
guage




Cryptography is the study of methods to ensure the authenticity, integrity and
non-repudiation of data, and it is a fundamental concern in computer and net-
work security. These days cryptography is incorporated into a variety of ap-
plications, including email, web browsing, banking, and electronic commerce.
A major problem in the early days of cryptography was the requirement that
each user have the same secret key. This key had to be exchanged using a non-
encrypted method, e.g., through a personal meeting of the users or through a
trusted third party. This symmetric key was then used to encrypt and decrypt
messages. Public key cryptography (PKC) overcame this ineciency through
the use of asymmetrical encryption. In a PKC system, each user has two keys:
one public and one private. Only the user knows the private key, while everyone
is aware of the public key.
The concept of PKC was introduced in 1976 by Whiteld Die and Martin
Hellman [26]. The security of this scheme is based on the intractability of the
discrete logarithm problem in the multiplicative group of a large nite eld. In
1985, Neal Koblitz [47] and Victor Miller [70] independently proposed the use
of an elliptic curve group over a nite eld for the implementation of public
key cryptosystem (PKCs). The advantage of elliptic curves is that the discrete
logarithm problem is considered much more dicult with elliptic curves than in
some groups (e.g., the discrete logarithm problem in the multiplication group
of nite eld and also dicult that the integer factorization problem (RSA)).
Elliptic curve cryptography (ECC) therefore allows smaller key sizes to be used
to achieve the same level of security, which results in lower memory requirements,
faster encryption and decryption, less power consumption, and lower bandwidth
1
requirements. Elliptic curve cryptography is consequently well suited for low-
power embedded systems that require high levels of security when transferring
data.
An alternative to RSA and elliptic curves is to use other curves, in particu-
lar, genus 2 curves. These cryptosystems, which have been named hyperelliptic,
were proposed in 1989 [48], soon after the elliptic ones, but their deployment is
far more dicult. The rst problem encountered was the group law. For elliptic
curves, the elements of the group are just the points of the curve. There is
no group structure on the set of points of a hyperelliptic curve. Instead, in a
hyperelliptic curve cryptography (HECC), the elements of the group are points
of a 2-dimensional variety associated with the genus 2 curve, called the Jaco-
bian variety. ECC and its generalization, HECC, have since been the subject
of steadily increasing interest, especially with the profusion of embedded de-
vices, such as mobile telephones, car navigation systems, and trusted computing
modules. Such applications have resulted in a stronger motivation to provide al-
gorithms for performing cryptographic operations for curve-based cryptography
that require fewer computational and memory resources.
The goal of this thesis is the development of hardware oriented scalar multi-
plication algorithms for elliptic and hyperelliptic curve cryptography, in partic-
ular, techniques for optimizing these operations and reducing the area, power
consumption, and/or computation time. The focus of the research is on curves
with eciently optimized algorithms over binary elds on an eld programmable
gate arrays (FPGA) based design methodology. Further important aspects of
this work involved mapping the development of explicit expressions for these
optimized algorithms, including the use of dierent coordinate systems, the im-
plementation of explicit formulas for hyperelliptic curve divisor multiplication,
and the determination of compact hardware design architecture for this opera-
tion.
1.1 Research Contribution
Hardware synthesis for cryptographic devices lies at the intersection of electri-
cal and cryptographic engineering. Due to harsh area and power constraints
associated with constrained computing devices, an increasing need has arisen
for architecture synthesis and optimization solutions that are tailored to these
2
environments.
The main contributions of this research are as follows.
• Compose an architecture synthesis methodology relying on register allo-
cation via variable liveness analysis, operation scheduling via forwarding
paths, and storage binding via ecient register spilling. Develop a modi-
ed register management for performing inversion-free operations in both
ECC and HECC arithmetic.
• Appling the above-mentioned methodology, develop an ecient hardware
implementation of HECC and ECC processors for inversion-free coordinates
over a binary eld. The design is optimized to reduce memory and register
requirements.
• For genus 2 HECC, the explicit formula for projective (P), new weighted (N ),
and recent (R) coordinates are optimized and a comparison is performed in
order to determine appropriate inversion-free coordinates with respect to
area, power and computation time of the group operations.
• To the best of the author's knowledge, this study is the rst practical com-
parison of inversion-free coordinate hardware implementations for ECC and
HECC with respect to the same level of security based on memory/register
requirements, energy consumption, and computation time. This compari-
son also takes into account dierent digit sizes for nite eld multipliers.
1.2 Outline
Chapter 2 introduces necessary mathematical background of elliptic and hyper-
elliptic curve cryptosystems. The basic denitions and properties are presented
along with those of points and divisors. These explanations enable the arith-
metic and the Jacobian of the ECC and the HECC to be dened. The denition
of a polynomial representation of the equivalent classes is then provided, and the
point and group operations based on the López-Dahab (LD) and Cantor-Harley
algorithms are introduced.
Chapter 3 provides an overview model of the stages in the hardware synthe-
sis design ow. This chapter gives high-level understanding of various aspects
3
of architecture synthesis and optimization design issues, including register allo-
cation via variable liveness analysis, operation scheduling via forwarding paths,
and storage binding via ecient register spilling. The chapter concludes with an
explanation of datapath and control unit analysis, which oers insight into the
investigation of the tradeos required during each step of the hardware synthesis
design ow.
Chapter 4 presents our implementation of an optimized elliptic curve proces-
sor (ECP) with a special scalar multiplier and self-controlled architecture. The
goal of the study in this chapter is to optimize the ECP implementation for
resource restricted environments with respect to hardware usage. The register
management algorithms for both conventional and optimized point multiplica-
tion are also provided. The area, power, and processing time results are also
presented and compared. The parallelism is exploited on three dierent levels:
the eld arithmetic level, the point operation level, and the scalar multiplication
level. López-Dahab projective and mixed coordinate formulas for genus 1 curves
are targeted. The analysis has resulted in an optimized architecture for ECP.
Chapter 5 rst gives a brief review of previous work with respect to HECC
implementation. The improvements to the group operations for the HECC de-
veloped in the research conducted for this thesis are then presented, with an
initial introduction of the methods used for minimizing register requirements in
a group operation. This chapter also includes a description of optimized group
operations based on the explicit formula of Cantor's and Harley's algorithms
and on the projective and the new weighted coordinates introduced by Lange.
The optimized HECC implementation results achieved on an FPGA are also
presented in this chapter. Three dierent architectures have been developed for
targeting area, power, and speed through the use of projective, the new weighted,
and recent coordinates. The HECC processor is described, and the methodology
for dierent design options is outlined. This chapter ends with a comparison of
the results related to HECC implementations using FPGAs.
Chapter 6 introduces the theoretical comparison of a variety of dierent archi-
tecture options for ECC and their HECC equivalents on the proposed processor.
The focus is on the underlying area, power and processing time. The theoreti-
cal comparison matrices are validated using FPGA implementations. This work
nishes with a comparison of the area, power, and computation time of ECC for
a recommended nite eld library.
4
The conclusions arising from this work and some suggestions for further re-
search are summarized in Chapter 7. The Appendix contains additional infor-





This chapter gives background related to nite eld operations and elliptic curve
cryptography. It mainly focuses on eld F2163 which is one of the ve binary elds
recommended by the National Institute of Standards and Technology (NIST)
for Elliptic Curve Digital Signature Algorithm (ECDSA) application [75]. This
chapter also provides an elementary introduction to the Jacobian of hyperelliptic
curves over nite elds of even characteristic, with attention being given only
to denitions and algorithms that are relevant for this work. References [46, 54,
37, 7] oer additional details about these topics.
2.1 Binary Field Arithmetic
The nite eld F2 has two elements; 0 and 1. The addition and multiplication are
performed modulo 2 as in two's complement arithmetic. F2m is the extension eld
of F2 and has 2m elements. Each of these elements is represented as a polynomial
of degree less than or equal to m − 1 with coecients coming from the ground
eld F2. For such a representation, addition is bit-independent and straightfor-
ward. However, multiplication and squaring involve polynomial multiplication
and squaring modulo an irreducible polynomial of degree m. Hence, the design
of ecient architectures to perform these arithmetic operations is of great prac-
tical concern. This section summarizes the arithmetic and architectures for eld
operations. We will concentrate on elds F2m , because elds of characteristic 2
are best suited for hardware architectures. An element A ∈ F2m will be repre-
sented as a polynomial A(x) =
∑m−1
i=0 aix
i, ai ∈ F2. The irreducible polynomial will
be denoted as F (x). The list below provides a short description of each imple-
6
Algorithm 2.1 Bit serial multiplication (low to high bits)
Input: An irreducible polynomial F (x) of degree m, two elements A(x), B(x) ∈ F2m .
Output: C(x) = A(x)B(x) mod F (x).
1: C(x) = 0
2: for i = 0 to m− 1 do
3: C(x) = biA(x) + C(x)
4: A(x) = A(x) · x mod F (x)
5: end for
6: return C(x)
mentation along with references pointing to more detailed information for eld
multiplication.
2.1.1 Field Addition Operation
The addition of two elements, A and B, in a binary eld F2m is performed by a bit
wise XORing with no carry propagation (C(x) ≡
∑m−1
i=0 cix












i = A(x) +B(x) =
m−1∑
i=0
((ai + bi)mod 2)xi
where ci, ai, bi ∈ F2. It should be noted that in characteristic 2, subtraction of two
eld elements is the same as addition because each element is its own additive
inverse.
2.1.2 Field Multiplication Operation
A number of polynomial basis bit-serial, digit-serial, and bit-parallel nite eld
multipliers have been proposed. The bit-parallel type multipliers have small
critical path delay and high throughput, and are best suited for applications
requiring high speed. If two arbitrary eld elements A and B ∈ F2m are ex-






i, then the product of A and B can
be expressed as,























Gi, where Bi(x) as in Equation (2.2)
Output: C(x) = A(x)B(x) =
∑m−1
i=0 cix
i, where ci ∈ F2
1: C(x) = 0
2: for i = 0 to nG − 1 do
3: C(x) = Bi(x)A(x) + C(x)
4: A(x) = A(x) · xG mod F (x)
5: end for
6: return C(x)`
where F (x) is an irreducible binary polynomial of degree m and denes the eld
F2m . Therefore,
C(x) = (b0A(x) + b1A(x)x+ b2A(x)x
2 + ...+ bm−1A(x)x
m−1)mod F (x). (2.1)
Algorithm 2.1 shows a well-known procedure for implementing Equation (2.1).
The reader is referred to [37] for a diagram of this kind of bit serial multiplier.
Other multipliers (e.g. digit serial and bit parallel) require fewer clock cycles but
more space. The bit serial multiplier provides the most area ecient method of
implementing hardware. The computation time, area, and power requirements
for the developed implementation of the bit serial multiplier are given in Table
2.1 for the case m = 163.
Bit serial multipliers are very area ecient, but they are quite slow. On the
other hand, fully bit parallel multipliers require too much area. For the purpose
of this work, we consider digit serial multipliers. Such multipliers process multi-
ple bits of the input word in one clock cycle. The digit-size can be varied in order
to achieve a desired level of tradeo between area, power, and speed. In this
work, the digit-serial nite eld multiplier proposed by [86] was used as shown in
Algorithm (2.2). This system is called digit-serial/parallel, which reect the fact
that the multiplicand bits are processed in parallel while the multiplier bits are
processed one digit at a time. The design has two major computations: i) partial
product generation and accumulation as seen in Step 3 of Algorithm (2.2), and ii)
modF (x) = xm + fnxn +
∑n−1
i=0 fix
i degree reduction operations as seen in Step 4 of





































































































































































































































































































j, 0 ≤ i ≤ G− 2∑m−1−G(nG−1)
j=0 BGi+j(x)x
j, i = G− 1
(2.2)





The following equation is thus derived for the least signicant digit-serial (LSD)
multiplier scheme:
C(x) = (B0(x)A(x) +B1(x) (A(x) · xGmodF (x)
+B2(x) (A(x)x
G · xGmodF (x)) + · · ·
+BnG−1(A(x)x
G(nG−2) · xGmodF (x)))
modF (x) (2.4)









Gi modF (x))) modF (x)
For digit multipliers with digit size G, when G ≤ m− n, Step 3 and Step 4 in
Algorithm (2.2) can be reduced to degree less than m in one step [86]. In this
work a dierent number of the digit-size is investigated in order to reduce the
computation time as well as the area and power consumption.
2.1.3 Field Squaring Operation
While squaring is a special case of general multiplication and can be performed by
a multiplier, performance can be signicantly improved through the optimization
of the architecture. Squaring a eld element in F2m represented via a polynomial









2i modF (x) (2.5)
10
The square of an element A(x) in F2163 represented by
A2(x) = (a162x
160 + · · ·+ a83x2 + a82)x164mod F (x)︸ ︷︷ ︸
AH
+ (a81x
162 + · · ·+ a1x2 + a0)︸ ︷︷ ︸
AL
(2.6)
involves three mathematical steps: (i) expand the AL part with interleaved 0′s;
(ii) reduce the AH part with the reduction polynomial F (x) = x163 +x7 +x6 +x3 +1;
and (iii) add the two parts: AH , AL. However, for hardware implementations,
the use of Equation (2.6) enables these three steps to be combined in four level
XOR gates if the reduction polynomial has a smaller second-highest degree,
which is the case here. The squaring can hence be eciently implemented in
order to generate the result in one clock cycle without huge area requirements
[39]. The implementation and the area costs are shown in Table 2.1. Using a
xed reduction polynomial F (x), squaring can be performed within one clock
cycle using a combinatorial logic implementation. The squaring requires at most
(n− 1) + (m− 1) gates, where n represents the number of non-zero coecients of
the reduction eld polynomial.
2.2 Elliptic Curve Arithmetic
An ECC-based cryptosystem is considered to be one of the best candidates for
light-weight applications, such as mobile devices, due to its small key size and
ecient computation [28]. Cryptographic mechanisms based on elliptic curves
depend on the arithmetic of points on the curve. Elliptic curve arithmetic is
dened in terms of the underlying eld operations, which are described in Section
2.1. A non-supersingular elliptic curve over F2m can be dened as follows:
E(F2m) : y2 + xy = x3 + ax2 + b , b 6= 0 (2.7)
In its simplest form an elliptic curve point P ∈ E is dened as a pair of
elements (x, y) of F2m that satisfying Equation (2.7). The points on E form a
commutative nite group under the point addition operation. For the work pre-
sented in this thesis, a NIST-recommended random curve [75] with a parameter
b 6= 0 was employed. The special point O, known as the point at innity, is the
additive identity of the group. The addition of the two points on E is performed
using the well-known chord and tangent process [14]. The underlying opera-
11
Algorithm 2.3 Left-to-right binary method for scalar multiplication
Input: P ∈ E(F2m), k =
∑l
i=0 ki2
i where kl−1 = 1
Output: kP ∈ E(F2m)
1: Q← P
2: for l − 2 down to 0 do
3: Q← 2Q
4: if ki = 1 then
5: Q← Q+ P
6: end if
7: end for
8: return Q = kP
tions used to perform the point addition are F2m arithmetic operations. Point
doubling is a special case of point addition, and scalar multiplication kP is the
addition of k copies of point P i.e., Q = kP = P + P + P + · · ·+ P︸ ︷︷ ︸
k copies
. For a large value
of k, the scalar multiplication can be performed using repeated point additions
and doubling, as described in Algorithm 2.3.
2.2.1 Example on an Elliptic Curve over F2163
In this subsection, an example of a point on an elliptic curve with almost-prime
group order [37] is given. The underlying eld is F2163 with reduction polynomial
F (z) = z163 + z7 + z6 + z3 + 1. The example is the random elliptic curve E(F2163) of
genus 1 dened by





are given in hexadecimal. The order of the base point P over F2163 is
n = 000000040000000000000000000292FE77E70C12A4234C33︸ ︷︷ ︸
l
where the factor l is prime. In cryptographic applications with this group, the
multiplier for scalar multiplication should be an integer k, where 1 ≤ k ≤ l− 1. In
this case l is 163 bits in length, so k should also be (at most) 163 bits.
12
2.2.2 Point Representations
Elliptic curve points can be represented using various coordinate systems such as
ane or projective representations. For each such system, the speed of additions
and doubling is dierent. We consider projective presented by López-Dahab
(LD) [54] and give the appropriate formulas and the number of operations. A
projective coordinate elliptic curve point P = (X, Y, Z) consists of three elements
of F2m . To convert the ane point (x, y) to projective coordinates, Z is simply
set to 1, i.e., (x, y, 1). Projective coordinates are generally used for internal
computations, but the resultant projective point is converted to its ane form
before being transmitted. Dierent types of projective coordinates vary with
respect to how the projective point maps to an ane point. As a means of
avoiding an expensive eld inversion operation, it is convenient to work with the
projective coordinates presented by LD [54] because the latter is ecient and
has been used previously in hardware realization.
A López-Dahab [54] projective point P = (X, Y, Z) maps to the ane point
P = (X/Z, Y/Z2). An eort has been made to minimize the number of nite eld
multiplication operations ifmixed coordinates are used. For convenience, explicit
formulas have been given below for computing point addition P +Q = (X3, Y3, Z3)
in LD coordinates of P = (X1, Y1, Z1) and Q = (X2, Y2, Z2).
X3 =(A ∗ (D +B2) +B ∗ (A2 + C)),
Z3 =(A+B
2) ∗ Z1 ∗ Z2,
Y3 =(A ∗H + F ∗ C) ∗ F + (H + Z3) ∗X3,
(2.8)
where
A =X1 ∗ Z2, B =X2 ∗ Z1, C =Y1 ∗ Z22 ,
D =Y2 ∗ Z21 , E =A+B, F =A2 +B2,
G =C +D, H =G ∗ E.
(2.9)




2 +D + Z3,




A = X1 ∗ Z1, B = X21 C = B + Y1,D = A ∗ C.
Algorithm 2.3 is the left-to-right version of the basic repeated double-and-add
method for point multiplication [37]. For scalar multiplication using mixed co-
ordinates, Q is stored in projective coordinates (X1, Y1, Z1), while P is stored in
ane coordinates (A) (X2, Y2, 1). The point addition operation can be performed
using the following formulas:
X3 =A
2 +D + E,





A =Y2 ∗ Z21 + Y1, B =X2 ∗ Z1 +X1, C =Z1 ∗B,
D =B2 ∗ (C + aZ21), E =A ∗ C, F =X3 +X2 ∗ Z3,
G =(X2 + Y2) ∗ Z23 .
The point addition and doubling operations iterate through the binary ex-
pansion of k [37]. On each iteration, a point doubling is performed. If ki = 1,
then a point addition is performed. In general, a binary expansion of k will
have approximately m bits. On average, half of these bits will be equal to one.
The average cost of implementing a scalar multiplication using Algorithm 2.3
is therefore given as Nbinary = mNdouble + (m/2)Nadd, where Ndouble and Nadd are the
number of cycles for each point doubling and addition operation, respectively.
2.2.3 Point Multiplication Costs
Estimates for point multiplication costs are presented in terms of curve oper-
ations (point additions and point doubling) and in terms of the correspond-
ing eld operations (multiplications MUL, squaring SQR, addition ADD, and
inversion INV ). Point addition and doubling in projective coordinates cost
14MUL+4SQR+9ADD and 5MUL+4SQR+5ADD operation counts, respectively.
The cost of conversion back to ane coordinates is 2MUL+ 1 INV + 1SQR. The
special case of a = 0 or 1 provides a saving of 1MUL for both point addition and
doubling. For the eld F2163 , the inversion can be implemented in 9MUL+162SQR
operations. In this work the NIST-recommended elliptic curve over F2163 is used
14
Table 2.2: Scalar multiplication (kP ) time in cycles (for k = (k162, . . . , k1, k0)2)
Coordinate System Nadd Ndouble (kP )Average
Ane (A) 1,793 1,793 437,492
LD projective 2,132 661 280,435
LD mixed 1,318 661 214,501
so that a = 1. When mixed coordinates are used, the operation counts for el-
liptic curve point addition and point doubling are 8MUL + 5SQR + 9ADD and
4MUL+ 4SQR + 5ADD, respectively, in the López-Dahab coordinate systems.
The number of cycles required for performing scalar multiplication is illus-
trated in Table 2.2. The corresponding number of cycles for performing point
multiplication can be used to estimate the scalar multiplication times which are
given in the right most column of the table.
2.3 Hyperelliptic Curve Arithmetic
The main benet of HECC is that it oers security equivalent to that of ECC
for much smaller parameter sizes. This advantage results in smaller datapaths,
less memory usage, and lower power consumption [79]. Integrating a public
key cryptosystem (PKCs) into a constrained device is a challenge due to the
limitations in area and power. In the past, it was widely believed that devices
with such constrained resources cannot carry out strong cryptographic opera-
tions such as hyperelliptic curve divisor multiplication (HCDM). However, the
feasibility of integrating PKCs into such devices has recently been demonstrated
by the success of several implementations [36, 13, 94, 3, 11, 49, 35].
This section provides an elementary introduction to the mathematical back-
ground needed for the application of hyperelliptic curves. Basic denitions and
properties of HECC are given as well as an introductory treatment of divisors.
With these denitions, the Jacobian of hyperelliptic curves over nite elds of
even characteristic can be dened. In order to work eciently with divisors,
Mumford's representation is introduced [74]. Finally, we present the algorithms
for addition and doubling of two elements on the Jacobian of HECC with at-
tention being given only to denitions and algorithms that are relevant for this
work. Additional details are available in [48, 46].
The idea behind using hyperelliptic curves in public key cryptography is that
15
groups formed from hyperelliptic curves are suitable for discrete logarithm cryp-
tosystems. This idea was rst introduced in 1989 by Neal Koblitz [48]. Hyper-
elliptic curves are a special class of algebraic curves and can be viewed as a
generalization of elliptic curves because there are hyperelliptic curves of every
genus g ≥ 1, and a hyperelliptic curve of genus g = 1 is an elliptic curve. This
section concentrates on genus 2 (hyperelliptic) curves over elds of character-
istic 2 and in particular, provides details about divisor doubling and divisor
addition formulas for dierent types of even characteristic curves. Choosing
curves dened over F2m allows very ecient divisor multiplication, as shown in
the literature related to even characteristic curves [59, 61].
2.3.1 Basic Definitions and Properties
Denition 1 A eld is a set F with multiplication and addition operations that
satisfy the familiar rules − associativity and commutativity of both addition and
multiplication, the distributive law, existence of an additive identity 0 and a mul-
tiplicative identity 1, additive inverses, and multiplicative inverses for everything
except 0.
Arithmetic operations in the nite eld F2m are used extensively in hyperel-
liptic curve cryptosystems. A polynomial basis is one of the common bases used
as a representation of eld elements of F2m. The literature includes a number
of proposed hardware structures for an F2m HECC processor using a polynomial
basis [94, 93].
Denition 2 If a eld F has the property that every polynomial with coecients
in F factors completely into linear factors, then F can be said to be algebraically
closed. For example, the algebraic closure of the eld of real numbers is the eld
of complex numbers.
If F is a eld and F is the algebraic closure of F, a hyperelliptic curve C of
genus g over F (g ≥ 1) is given by an equation of the form
C : y2 + h(x)y = f(x) in F[x, y] , (2.12)
where h(x) ∈ F[x] is a polynomial of degree at most g (deg(h) ≤ g), f(x) ∈ F[x] is
a monic polynomial of degree 2g + 1 (deg(f) = 2g + 1), and there are no singular
points.
16
Denition 3 A singular point on C is a solution (x, y) ∈ F×F that simultaneously
satises Equation (2.12) and the following partial dierential equation:
2y + h(x) = 0, h′(x)y − f ′(x) = 0. (2.13)
Denition 4 A curve is said to be non-singular if there are no pairs (x, y) ∈ F×F
that simultaneously satisfy the equation of the curve C and the partial dierential
equations 2y + h(x) = 0 and h′(x)y − f ′(x) = 0.
Denition 5 A divisor D is a formal sum of points on a hyperelliptic curve C.
The Jacobian JC is the group of degree 0 divisors modulo the principal divisors.
Denition 6 A principal divisor is the divisor of a rational function. Let f ∈ F̄(C)
be a rational function on C.





2. A divisor D is called principal if D = div(f) for some rational function
f ∈ F̄(C).
3. The set of principal divisors on C is denoted by Princ(C).
In practice, the Mumford representation [74] is used for elements of JC: each
divisor is represented by a pair of polynomials (u, v) such that u is a monic
polynomial of degree at most 2, deg v <deg u ≤ 2, and u | v2 + vh − f ; these
types of divisors are called reduced. This representation provides a very com-
pact denition of divisor classes and also allows to easily use divisor classes in
implementations since only two lists of coecients (of the two polynomials u and
v) of length at most g have to be stored in a computer [62].
2.3.2 Genus 2 Hyperelliptic Curve over F2m
The non-singular curve C : y2 + h(x)y = f(x), h, f ∈F2m , deg(f) = 5, deg(h) ≤ 2, is
called a Hyperelliptic curve of genus 2 over F2m, where h(x) = h2x2 + h1x+ h0 and
f(x) = x5 + f4x
4 + f3x
3 + f2x
2 + f1x + f0 are polynomials dened over F2m . Genus
2 curves in characteristic 2 can be divided into three types depending on the
17
polynomial h : Type 1 if deg h = 2, Type 2 if deg h = 1, and Type 3 if deg
h = 0. A genus 2 curve of Type 2 dened over F2m by an equation of the form
y2 + h(x)y = f(x) is isomorphic to a curve dened by an equation of the form
y2 +xy = x5 +f3x
3 +εx2 +f0, where ε is in F2 if m is odd. The most common case in
cryptographic applications is that m should be prime in order to resist potential
Weil descent attacks [22].
The points on the curve C do not form a group, so the divisor class group
of C (also called the Jacobian) is used instead. The elements of this group are
represented using Mumford's representation as described in Section 2.3.1.
2.3.3 Example of a Genus 2 Hyperelliptic Curve over F283
In this subsection, an example is provided of the Jacobian of an hyperelliptic
curve with almost-prime group order [24]. The correctness of this example can
be easily checked through multiplying a random divisor by the given group order
and then verifying that the result is principal [24]. The underlying eld is F283
with reduction polynomial F (z) = z83 +z7 +z4 +z2 +1. The example is the random











h0 = 7FF29B08993336B479CD2 h1 = 32C101713C722F8FB5BC9
h2 = 553E16B6A3BC6B2432CA8
f0 = 7AD44882C02B9743CD58B f1 = 327254FA330B44958262A
f2 = 204AB23E12828D061AF04 f3 = 1C827250FFDEFF93B43BE
f4 = 13D80106C0E5571DFE139
are given in hexadecimal. The group order of the Jacobian JC over F283 is
#JC = 2 · 46768052394566313810931349196246034325781246483037︸ ︷︷ ︸
l
,
where the last factor l is prime. In cryptographic applications with this group,
the multiplier for divisor multiplication should be an integer k, where 1 ≤ k ≤ l−1.
18
In this case l is 165 bits in length, so k should also be (at most) 165 bits.
2.3.4 Divisor Class Representations
A divisor D =
∑
P∈C kiPi, ki ∈ Z, is a formal sum of points on a hyperelliptic
curve C. The degree of D, denoted deg D, is the integer
∑
P∈Cmi. The set of all
divisors form an additive group denoted by D(C). The set of divisors of degree
zero are denoted as D0 ⊂ D(C). A divisor D ∈ D0 is called a principal divisor if its
points are the zeros and poles of a rational function on C. The set of all principal
divisors is denoted by P. The set P is a subgroup of D0. If D1, D2 ∈ D0 then one
writes D1 ∼ D2 if D1 − D2 ∈ P; D1 and D2 are said to be equivalent divisors.
For each divisor D ∈ D0 there exists a semi-reduced divisor D1 ∈ D0 such that
D1 ∼ D2. The Jacobian of C can then be dened as the quotient group J = D0/P.







i is equal to D for all automorphisms σ of F̄ over F. It should be noted
that a divisor being dened over F does not mean that each P σi is equal to Pi; σ
may permute the points [22].
In [16], Cantor shows that each element of the Jacobian can be represented
in the form D =
∑r
i=1 Pi − r · ∞ such that for all i 6= j, Pi and Pj are not sym-
metric points (A point P = (x, y) is symmetric if P = (x, −y − h(x)). Such a
divisor is called a semi-reduced divisor. Cantor concludes from the Riemann-
Roch Theorem that each element of the Jacobian can be represented uniquely
by such a divisor, subject to the additional constraint r ≤ g. Such divisors are
referred to as reduced divisors. Finally, Cantor shows that the divisors of the
Jacobian can be represented as a pair of polynomials u and v, where u is monic,
deg v(v) < deg u(u) ≤ g, and u divides v2 + h(u)v− f(u), where the coecients of u
and v are elements of F. A divisor D represented by such polynomials is denoted
as D (u, v).
Hyperelliptic curve points can be represented using dierent coordinate sys-
tems. For each such system, the speed of divisor addition and divisor dou-
bling is dierent. In this work, projective, new weighted, and recent coordinate
representations are considered, and the appropriate inversion-free formulas and
the number of nite eld operations are given. In hardware applications, an
inversion-free formula is preferred because an inverter is very expensive com-
pared to other components. In addition, the INV/MUL−ratio depends on the
platform. A projective system introduces an additional coordinate called Z, as for
19
Table 2.3: Dierent coordinates projective (P), new weighted (N ), and recent (R) with
approximate complexity, in even characteristic, for divisor doubling with g = 2
Operation Complexity (h2 6= 0) Complexity (h2 = 0)
2N = P 37MUL+ 6SQR + 25ADD 33MUL+ 6SQR + 28ADD
2P = P 38MUL+ 7SQR + 39ADD 29MUL+ 7SQR + 23ADD
2R = P 29MUL+ 8SQR + 21ADD 24MUL+ 7SQR + 18ADD
2N = N 37MUL+ 6SQR + 36ADD 26MUL+ 6SQR + 27ADD
2P = N 35MUL+ 6SQR + 27ADD 32MUL+ 6SQR + 23ADD
2R = N 39MUL+ 6SQR + 33ADD 35MUL+ 5SQR + 29ADD
2N = R 35MUL+ 6SQR + 31ADD 31MUL+ 7SQR + 23ADD
2P = R 37MUL+ 6SQR + 38ADD 29MUL+ 5SQR + 21ADD
2R = R 37MUL+ 5SQR + 36ADD 23MUL+ 8SQR + 9ADD
2A = N 20MUL+ 5SQR + 24ADD 19MUL+ 6SQR + 20ADD
2A = P 24MUL+ 5SQR + 18ADD 21MUL+ 6SQR + 19ADD
2A = R 30MUL+ 8SQR + 32ADD 23MUL+ 7SQR + 29ADD
2A = A 1 INV + 20MUL+ 5SQR + 34ADD 1 INV + 13MUL+ 5SQR + 24ADD
an elliptic curve, and lets the quintuple [U1, U0, V1, V0, Z] corresponds to the divi-
sor class [x2+U1/Z x+U0/Z, V1/Z x+V0/Z] in a Mumford representation. Projective
coordinates are generally used for internal computations, but the resultant pro-
jective needs one inversion and four multiplications to be converted to its ane
form before being transmitted. This idea was rst proposed for genus 2 curves
in [97] and then largely improved and generalized by [61].
Recent coordinates, [U1, U0, V1, V0, Z , z] as dened in section 14.5 in [22] corre-
sponds to [x2 +U1/Z x+U0/Z, V1/Z2 x+ V0/Z2] and z = Z2. If new weighted coordi-
nates are used, one more entry than for projective coordinates is needed. Thus,







1Z2)] so that the divisor doubling and divisor addition
operations can be eciently performed. The divisor operations iterates through
the binary expansion of k [37]. During each iteration, a divisor doubling is
performed. If ki = 1, then a divisor addition is performed. If C is a genus 2 hy-
perelliptic curve over F2m of almost-prime order, then scalars k in cryptographic
applications are generally 2m bits in length. On average, half of these bits are
equal to one. Therefore, the cost of implementing a divisor multiplication us-
ing Algorithm 2.4 is given as (2m)DDBL + (m)DADD, where DDBL and DADD are
the number of cycles for each divisor doubling and divisor addition operation,
respectively.
20
Table 2.4: Dierent coordinates projective (P), new weighted (N ), and recent (R) with
approximate complexity, in even characteristic, for divisor addition with g = 2
Operation Complexity (h2 6= 0) Complexity (h2 = 0)
N + P = P 48MUL+ 4SQR + 38ADD 46MUL+ 4SQR + 35ADD
P + P = P 49MUL+ 4SQR + 41ADD 45MUL+ 5SQR + 30ADD
R+ P = P 38MUL+ 7SQR + 32ADD 35MUL+ 8SQR + 29ADD
N +N = N 48MUL+ 4SQR + 35ADD 45MUL+ 5SQR + 31ADD
P +N = N 47MUL+ 4SQR + 39ADD 45MUL+ 4SQR + 38ADD
R+N = N 39MUL+ 8SQR + 38ADD 36MUL+ 6SQR + 35ADD
R+R = R 45MUL+ 3SQR + 32ADD 50MUL+ 8SQR + 29ADD
N +R = R 37MUL+ 6SQR + 29ADD 35MUL+ 7SQR + 30ADD
P +R = R 34MUL+ 8SQR + 31ADD 32MUL+ 5SQR + 28ADD
A+N = N 35MUL+ 5SQR + 29ADD 34MUL+ 6SQR + 25, ADD
A+ P = P 39MUL+ 3SQR + 26ADD 39MUL+ 3SQR + 23ADD
A+R = R 36MUL+ 4SQR + 31ADD 25MUL+ 8SQR + 18ADD
A+A = A 1 INV + 21MUL+ 3SQR + 30ADD 1 INV + 17MUL+ 5SQR + 21ADD
2.3.5 Divisor Multiplication Complexities
Estimates for divisor multiplication costs are presented in terms of curve op-
erations (divisor additions and divisor doubling) and the corresponding eld
operations (multiplications MUL, squaring SQR, addition ADD, and inversion
INV ). For example, divisor addition and doubling in projective coordinates cost
49MUL+ 4SQR + 45ADD and 38MUL+ 7SQR + 36ADD operation counts, respec-
tively. The cost of conversion back to ane coordinates is 4MUL+1INV +15SQR.
If mixed coordinates between ane and projective are used, the operation counts
for hyperelliptic curve divisor addition and divisor doubling are 39MUL+4SQR+
32ADD and 38MUL+7SQR+41ADD, respectively. The classication of the dier-
ent coordinates of genus 2 curves in characteristic 2 allows a signicant number
of dierent complexities in the formula for divisor doubling and addition as listed
in Table 2.3 and Table 2.4, respectively. The results for h of degree 2 and degree
1 are summarized in the tables.
Denition 7 The hyperelliptic curve discrete logarithm problem (HCDLP) is the
following: Let C be a hyperelliptic curve over a nite eld F. If D1, D2 ∈ JC(F),
determine the smallest integer k such that D2 = kD1, if such an k exists.
Divisor multiplication for dierent coordinates in even characteristic with
F283 are given in Table 2.6. The number of cycles required for performing divisor
21
Algorithm 2.4 Left-to-right binary method for divisor multiplication




Output: DN ,P, orR = kDA
1: DN ,P, orR ←∞
2: for i = l − 1 down to 0 do
3: DN ,P, orR ← 2DN ,P, orR
4: if ki = 1 then
5: DN ,P, orR ← DN ,P, orR +DA
6: end if
7: end for
8: return DN ,P, orR = kD
multiplication is illustrated in Table 2.5. The corresponding number of cycles
for performing a divisor addition and a divisor doubling can be used to esti-
mate the cycles required for divisor multiplication. In conclusion, if the genus,
characteristic, and degree of h(x) are restricted to special cases and the correct
coordinates are chosen, the best performance is produced for divisor multiplica-
tion with HECC.
A central ingredient in cryptosystems based on the HCDLP problem in an
abelian group is an ecient process for computing kD for D ∈ JC and for large
integers k.
D +D + · · ·+D︸ ︷︷ ︸
k times
= kD
This operation is called divisor multiplication, and dominates the execution time
of hyperelliptic cryptosystems. Algorithm 2.4 is given the left-to-right binary
method for divisor multiplication. The following chapters discuss algorithms
for performing divisor multiplication eciently in hardware. In conclusion, if
the genus, characteristic, and degree of h(x) are restricted to special cases and
the correct coordinates are chosen, the best performance is produced for divisor
multiplication with HECC.
2.3.6 Hyperelliptic Curve Algorithms
This section provides a brief description of the algorithms used for adding and
doubling divisors on JC(F). These group operations are performed in two steps.
The rst is to nd a semi-reduced divisor D′ = div(u′, v′), such that D′ ∼ D1+D2 =



























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Algorithm 2.5 Cantor's Divisor Addition on the Jacobian of HECC
Input: D1 =div(u1, v1), D2 =div(u2, v2), h, f ∈ F.
Output: D3 =div(u3, v3) = D1 +D2
1: d = gcd(u1, u2, v1 + v2 + h) = s1u1 + s2u2 + s3(v1 + v2 + h)
2: u′ = u1u2d
−2
3: v′ = [s1u1v2 + s2u2v1 + s3 (v1v2 + f )] d
−1 (modu′)
4: k = 0
5: while deg u′k > g do











10: Output (u3 = u
′
k, v3 = v
′
k)
divisor D′ = div(u′, v′) is reduced to an equivalent divisor D = (u, v) [95].
Group operations on hyperelliptic curves can be performed by using Cantor's
algorithm for adding and doubling and Gauss's algorithm for reducing divisors.
These steps require the use of polynomial arithmetic for polynomials that have
coecients in the denition eld. An alternative approach was proposed by
Harley [38]. Harley computed the necessary coecients from the steps of Can-
tor's algorithm directly in the denition eld without the use of polynomial
arithmetic and also made use of computational tricks. The advantage of this
approach is a faster execution time. Up to the present, many attempts have
been reported in the literature to improve the HECC group operations. The
following subsections includes discussion of only the ones most relevant to the
contribution presented in this thesis.
2.3.6.1 Cantor's Group Operations on the Jacobian
Until recently, arithmetic transforms in the Jacobian of HECC have been per-
formed using Cantor's method [16], with modications introduced by Koblitz
[48]. Methods of addition and doubling for curves of genus 2 were considered
in [61]. Cantor's algorithm is completely general and holds for all genera, all
elds, and all valid inputs. Cantor's divisor addition and doubling on the Jaco-
bian of HECC are given in Algorithm 2.5 and Algorithm 2.6, respectively. All
of the explicit formulas have been derived from these algorithms. Neverthe-
less, this algorithm is too slow, mainly because it employs the greatest common
25
Algorithm 2.6 Cantor's Divisor Doubling on the Jacobian of HECC
Input: D1 =div(u1, v1), h, f ∈ F.
Output: D3 =div(u3, v3) = 2D1
1: d = gcd(u1, 2v1 + h) = s1u1 + s3(2v1 + h)
2: u′ = u21d
−2
3: v′ = [s1u1v1 + s3 (v
2
1 + f )] d
−1 (modu′)
4: k = 0
5: while deg u′k > g do











10: Output (u3 = u
′
k, v3 = v
′
k)
Algorithm 2.7 Harley Divisor Addition on the Jacobian of HECC
Input: D1 =div(u1, v1), D2 =div(u2, v2)




2: s ≡ v2−v1
u1
mod u2
3: z = su1
4: u′ = k−s(z+h+2v1)
u2
5: v′ = − (h+ z + v1)mod u′
6: Output (u3 = u
′, v3 = v
′)
divisor (GCD) algorithms and uses up too much memory for restricted environ-
ments such as smart cards. To obtain ecient formulas for this work, Cantor's
algorithm was used but was restricted to a special case of genus 2, in even char-
acteristic. For example, with D1 = D2 taken in order to produce an ecient
doubling formula. The eciency of the formulas also depends heavily on the
degree of the polynomial h(x) in the curve equation.
2.3.6.2 Harley's Group Operations on the Jacobian
The rst attempt to avoid using Cantor's algorithm to achieve faster algorithmic
time and to derive explicit formulas was made by Harley [38]. The author noticed
that one can reduce the number of operations by distinguishing between possible
cases according to the properties of the input divisors. Harley's divisor addition
and divisor doubling on the Jacobian of HECC are given in Algorithm 2.7 and
Algorithm 2.8, respectively. His approach was then generalized to even charac-
26
Algorithm 2.8 Harley's Divisor Doubling on the Jacobian of HECC
Input: D1 =div(u1, v1)




2: s = k
h+2v1
mod u1
3: u′ = s2 + k−s(h+2v1)
u1
4: v′ = − (h+ su1 + v1)mod u′
5: Output (u3 = u
′, v3 = v
′)
teristics by Lange [56]. Harley's study triggered other research, which eventually
led to improvements and extensions to the algorithm [41, 79, 57, 88, 87, 60]. The
work reported in [97, 41, 88] obtained an even greater increase in speed using
Montgomery's trick to reduce the number of inversions to 1. Lange [61] gener-
alized the setting in order to deal with an even characteristic as well through
the determination of the exact number of operations needed for addition and
doubling to be performed in the most common cases. This algorithm drasti-
cally reduced the cost of calculating divisor addition and doubling by specifying
the genus of the curves. Most recently, an approach based on parallelization
for fast implementation has been proposed by Mishra et al. [72]: they paral-
lelized the ane and the inversion-free formulas of genus 2 curves. The Harley
algorithm has long calculation procedures and involves a large number of inter-
mediate variables. In spite of this disadvantage it has potential, but its area and
power hardware implementation has not received enough attention. The main
focus of this work is the ecient hardware implementation of HECC processors
for inversion-free coordinates over a binary eld after the application of several
architecture synthesis and optimization techniques. The design is optimized to






A basic architecture synthesis and optimization problem addressed in this thesis
is based on four assumptions that can be explained as follows: (i) a circuit is
specied through the sequencing of an explicit formula algorithm, (ii) a set of
functional resources that are fully characterized in terms of area and execution
delays, (iii) a set of constraints, and (iv) storage that is implemented through
the use of registers and multiplexer interconnections. In this thesis, modied
algorithms for performing behavioral optimization are presented. A low area
is achieved through liveness analysis of variables combined with a register allo-
cation process. Minimum power consumption is achieved via forwarding paths
through operation scheduling, and high performance is achieved by means of
ecient register spilling through storage binding. This work also incorporates
area, power and computation time tradeos through an implementation opti-
mization process, which maps the optimized datapath and control unit analysis
in the form of highly ecient hardware implementations.
The modied register allocation via variable liveness analysis (RAVLA), oper-
ation scheduling via forwarding paths (OSFPs), and storage binding via ecient
register spilling (SBERS) algorithms are employed to produce ecient explicit
formulas for both elliptic and hyperelliptic curve algorithms and the optimized
hardware implementations. The entire architecture implementation has been
synthesized and simulated to present results as tradeos among area, power,
28
and computation time. The challenge associated with specication optimiza-
tion is to determine the appropriate number of registers, their sizes, and their
connections to the nite eld arithmetic unit (FFAU). Heuristics for allocating
variables into multiple distinct single-ported register les are presented in [53].
These heuristics are applicable only to a straightline without branches, which
is also the case in the explicit formulas for elliptic and hyperelliptic algorithms.
However, the models presented in this section are more applicable for elliptic
and hyperelliptic algorithms with arbitrary forwarding paths, variable liveness
analysis, and conditional spilling between register and memory binding. Further-
more, for all cases of hyperelliptic curve algorithms, it has been demonstrated
that these algorithms are capable of minimizing the number of registers in the
register le, which in turn, decreases the computation time and reduces power
consumption.
3.2 Design Flow of the Behavioral Architecture
Synthesis
Synthesis is the automatic mapping from a high-level description to a low-level
description (e.g., gates to transistors, VHDL to gates). Synthesis is important
because it increases designer productivity and allows exploration of the design
space. Depending on the input and the output descriptions, synthesis comes in
two main avors: logic synthesis and architectural synthesis [83]. Architectural
synthesis, also called high-level synthesis, has the primary goal of translating a
description of circuit behavior into an architecture. It diers from logic syn-
thesis, which is the specication of Boolean equations that describe a number
of operations to be mapped into gates, but not the exact clock cycle based on
which the architecture synthesis is specied.
With an architecture synthesis method, the design task is divided into the
specication optimization (behavioral) and implementation optimization (struc-
tural) phases. The specication synthesis optimization process consists of three
sub phases: register allocation, operation scheduling and storage binding. Con-
ventionally, specication synthesis optimization has the goal of decreasing com-
putational time for a given set of resources; however, it has become necessary to
develop synthesis optimization techniques whose hardware design also accounts
for the minimization of the area and the dissipation of power. Implementation
29
optimization is the initial stage of a hardware design ow that interprets an algo-
rithmic representation of a behaviour and creates a hardware specication that
executes the behaviour. More formally, it is the process of creating a structural
micro-architectural representation, or register transfer level (RTL) description,
from a system specication of an application. A structural micro-architectural
representation denes the exact interconnections among a set of architectural
resources. An architectural resource is a storage element, a functional unit, or
interconnect logic. A storage element provides a method of saving the state of
the circuit. A register is an example of such a storage element. A functional
unit performs an arithmetic or logic operation (e.g., addition, multiplication,
squaring). Interconnect logic is used to route data between the memory and
the functional units. For example, a multiplexer propagates a particular piece
of data according to its input condition. A control unit issues control signals to
direct the resources.
3.2.1 Register Allocation via Variable Liveness Analysis
This section provides a short description of register allocation by means of a
variable liveness analysis algorithm. Before an explanation of the algorithm is
provided, the following denitions are introduced:
Denition 8 A variable is said to be dened in a statement in an explicit formula
algorithm when a value is assigned to it.
Denition 9 A variable is said to be used in a statement in an explicit formula
when its value is referenced in an arithmetic expression.
Denition 10 A variable is said to be live in a statement if it has been dened
earlier and will be used later.
Denition 11 The live range of a variable is the execution range between rst
denition and last use of the variable.
Denition 12 The def-use chain of a variable is a directed graph that connects
the operations that dene a variable to the operations that use this variable.
Note that a variable can have multiple denitions and uses.
30
Denition 13 The live time of a variable is dened as the time interval from its
denition to its last use. Two variables can share the same register if and only
if they do not have overlapping life times.
Denition 14 The three operand code of an explicit formula is code in which
statements have no more than one destination operand and a maximum of two
source operands is said to be three operand code.
Register allocation is the selection of the appropriate number and type of
registers to be used in a given application implementation. The register alloca-
tion process attempts to allocate the total number of temporary variables in a
three-operand code procedure to the hardware registers in the register le. The
temporary variables can be of two types. The rst have a value that is used in
the same operation statement in which that value is assigned. These variables
are implemented as wires or buses in the nal implementation. The other type
of variable has values assigned in one statement and used in another. The time
between the assignment of the value and its last usage denes the lifetime of
each variable. In the nal implementation, these variables must be mapped to
storage units such as register les and memories. The implementation of the task
of assigning the variables with non-overlapping lifetimes is by means of register
allocation, which directly reduces computational time by converting some of the
memory access to register access. Register allocation and operation schedul-
ing are tightly intertwined: one greatly aects the other. They are therefore
sometimes performed in dierent orders. Operation scheduling is discussed in
Subsection 3.2.2. Before the strategy used for operation scheduling the model
of register allocation must rst be dened.
3.2.1.1 Modeling Register Allocation
The problem of minimizing the number of registers required for executing a set
of arithmetic formulas has already been studied [72], and the problem is known
to be NP-complete. According to the further studies of NP-optimization prob-
lems, there is an O(log2m) (where m is the number of operations) approximation
algorithm for this problem [72]. In our work, the main emphasis is on exposing
a variable liveness analysis among dierent types of variables in a certain ex-
plicit formula. In other words, while ECP or HECP are commonly programmed
31
based on the denition of which operations to execute combined with informa-
tion about the source of the operands and the destinations of the results, variable
liveness analysis is programmed based on the used (U) and the dened (D) of
the intermediate variables. The name of the register allocation via variable live-
ness analysis comes from the way register allocation are executed which will be
explained through the rest of this subsection.
Our motivation is to determine the minimum number of registers needed to
store an unbounded number of variables in a certain explicit formula of HECC
algorithms based on classifying three types of variables: short, long, and very
long lived variables. Achieving the primary goal of this work is thus dependent
on determining the answer to the following question: Given an explicit formula
E passed through solving the problem of register allocation via variable liveness
analysis model, what is the minimum number of registers R required in order to
store the intermediate variables necessary for executing E sequentially?
Let I = {Ii}, 1 ≤ i ≤ n be the input to E. Assuming that E is a sequence of
arithmetic operations, each has a unique statement S, where Si : Ti = PiOiRi , 0 ≤
i ≤ j, where Oi is one of the nite eld operations {MUL, SQR, ADD} and Pi,Ri
are a subset of Ti. In fact, in the literature, an explicit formula occurs in a
multiple operation format. The multiple operation can be converted into a
three-operand form Ti through the use of a simple parser algorithm. A sequence
S = {S0,S1, · · · ,Sj} of E will be called a valid sequence if E can be computed
through the execution of its three operand forms T ′i s in the order dictated by
the sequence S. For example, if E = {S0, S1, S2, S3, S4, S5}, where
S0 : T0 =xADD y; S3 : T3 =T0ADD z;
S1 : T1 =y SQRy; S4 : T4 =T3MULy;
S2 : T2 =T1MULx; S5 : T5 =T2ADD T4.
(3.1)
then only four sequences are valid based on data dependency problem:
{T1, T2, T0, T3, T4, T5},{T0, T3, T1, T2, T4, T5},
{T0, T3, T4, T1, T2, T5},{T1, T0, T3, T2, T4, T5}.
(3.2)
The explicit formula E = {S0, S1, S2, S3, S4, S5} may not be executed in any
other order. For the purposes of the work presented in this thesis, the only
knowledge of interest is the determination of which valid sequence needs the
minimum number of registers when the variable liveness analysis are used for
32
dening dierent types of variables in the explicit formula E . In E, specic com-
putations can be performed as a result of the set of inputs I to E , e.g., T0 and T1 in
the previous sequence. Executing one or more of these computations produces
intermediate values that can trigger further operations of E , e.g., T2, T3, T4, and, T5
in the previous sequence example. If V is the set of computations in E , which can
be computed directly from the set I of inputs to E , and | V |= α is the size of the
set V, the execution of E can begin starting from any one of these α operations.
Among these operations are three kinds of intermediate variables: short, long,
and very long lived variables. An example of a short variable is the computation
operation T1 which immediately triggers the computation operation that is to be
followed and that is not needed in other subsequent operations. On the other
hand, T2 is a long lived, which must be hold for a several computation time till
it can be used by the computation operation T5.
An example is provided by a computation ow graph as shown in Figure 3.1
of the following type: there is a root state, the number of states at the rst level
is α, and each of the rst level states correspond to one of the operations in V.
The sub-graph rooted at the rst level state corresponds to the computation ow
graph for completing the remainder of the explicit formula after the operation
corresponding to this state has been executed. The leaf states of the computation
control ow graph correspond to the last operation in the particular sequence of
operations. Clearly, all possible valid computation sequences for completing E
are described by all possible paths from the root to the leaf states. A variable T1
is live at a statement S0 if there is a path in the ow graph from this statement
to a statement S2 such that T1 ∈ I [S2] and for each S0 ≤ S < S2 : T1 /∈ I [S] .
Liveness analysis focuses on analyzing a ow graph in order to determine which
variable places are or are not live. For each ow graph state, two sets can be
derived: 1) Live-in: I [S] gives all variables T that are live before the execution
of statement S; 2) Live-out: U [S] gives all variables T that are live after the
execution of statement S.
The basic methodology of the work presented in Figure 3.1 is essentially
based on a depth search of a valid sequence that requires a minimum number
of registers in order to complete E and also a valid execution sequence that
determines a maximum number of short lived variables for the ecient use of
the forwarding paths later through the operation scheduling model. To reduce






























Figure 3.1: Computation ow graph
34
1. Neglecting the paths that require the same number of long variables: As the
rst valid sequence is obtained, the number of long variables required in
order to execute E according to that sequence is counted and stored it in a
register. During the process of looking for another path, the size of the set
of long variables is checked after each step. If the current size is equal to
the value stored in R, then no further progress along this path is required,
and another path is caught. If a valid sequence requires fewer than R long
variables, then the value of R by is replaced with this new value.
2. Computing both the input and output states at each statement: The compu-
tation of the number of variables in each input or output sets of each state-
ment of the sequence relies on the following properties: 1) If T is used in
statement S, then T ∈ I [S] . 2) If T is live after the execution of S and
T /∈ U [S] , then T is live before the execution of S. 3) In each statement S,
the formal denition of the liveness of variable T is T ∈ I [S] O T ∈ U [S]⇒
T ∈ U [S].
3. Avoiding register writing for short lived variables: Storing the number of
short live variables at each step of the sequence is an area-consuming oper-
ation. If the number of long variables stored in E , then each long variable
need a register R ∈ R in which to be stored. The area can be reduced if
the values of short variables are not stored in the sequence, but the register
writing is bypassed through the use of forwarding paths. The number of
registers needed can therefore not be more than R.
4. Minimizing the number of registers by coalescing: If no conict exists be-
tween the liveness of two long variables in the same sequence for two dif-
ferent statements, the same register can be used for both the two long
variables, and coalescing can then occur in dierent situations as long as
there is no interference.
The following chapters demonstrate that the implementation of these optimiza-






three operand code 
statement
Get LAST_USED for destination 
operand variable
LAST_USED ≤ 1
Get number of USED_STATE 
for each source operand variable.
GET_DEST?





Get  LAST_USED for each 
source operand variable.
Update ASSIGN_REG to store 
the destination value
Next Statement
















Figure 3.2: The procedure for register allocation via variable liveness analysis
36
3.2.2 Operation Scheduling via Forwarding Paths
Scheduling determines the temporal order of operations. Given a set of explicit
formulas with execution delays and a partial ordering, the scheduling model as-
signs a start time for each operation. The start times must follow the precedence
constraints as specied in the system specications. Additional restrictions, such
as timing and area constraints, may be added to the problem, depending on the
target architecture. The scheduling aects resource allocation and vice-versa.
Therefore, the ordering of these two tasks is sometimes interchangeable; some
synthesis tools perform scheduling, then resource allocation, while others allo-
cate the resources rst and then schedule the operations. For the work presented
in this thesis, the order followed is to perform the register allocation via variable
liveness analysis (RAVLA) and then operation scheduling via forwarding paths
(OSFPs) .
Schedulers fall into two broad families: unconstrained or constrained opera-
tions [69]. Because the number of clock cycles entailed inconsistently nite eld
arithmetic the focus of this study was on constrained operations in order to min-
imize the number of registers, without limiting timing. Once the number of live
registers is counted and the register allocation is computed the operations are
scheduled. This option is the simplest solution and produces good results. A
schedule that minimizes both time and area must be chosen. Scheduling assign
operations to clock cycles involves two fundamental steps: register constrained
and timing constrained. Given an explicit formula for a specic elliptic or hyper-
elliptic curve algorithm, the clock cycle time for each nite eld arithmetic, the
resource count, and the resource delays, the register constrained step nds the
minimum number of clock cycles needed to execute the explicit formulas. The
timing constrained step attempts to determine the minimum number of regis-
ters required in order to schedule the operations. However, other multivariate
objective functions are also possible, including power, energy, and area.
3.2.2.1 The Proposed Scheduling Model
In this work, the main emphasis is to expose a forwarding path between the
functional units (e.g., the nite eld squarer, multiplier, and adder) and the
register le. In other words, while HECP are commonly programmed based on
the denition of which operations to execute combined with information about
the source of the operands and the destinations of the results, forwarding path
37
is programmed based on the denition of the transports of the operands and the
results. The name of the forwarding path comes from the way operations are exe-
cuted: the operand data can be read from the output ports of one functional unit
transported directly to the source of second functional units. Using a forwarding
path enables specic optimizations, such as dead result read elimination which
can lead to reduced register le pressure.
This subsection presents a formal model for operation scheduling via forward-
ing paths. Given an explicit formula E , a set of operations O = {o0, o2, . . . , on},
and a collection of computational FFAU units {MUL, SQR, ADD} denoted by
rj = {r1, r2, r3} respectively, each computational unit rj(Aj, Cj, Okj) has area Aj, a
computational time Cj, and can execute only the operations Okj, where Okj ⊂ O
and ∪jOkj = O. Furthermore, each operation in the explicit formula E must be
executable on at least one type of FFAU unit and using at most three regis-
ters. A group of registers is dened as R = {R0, R1, R2, . . . Rw}. Each register
is used for storing an intermediate operation source and the destination vari-
ables. Having each operation uniquely associated with one computational unit
is called homogeneous scheduling. If an operation can be performed by more
than one computational unit, the process is called heterogeneous scheduling [83].
An FFAU is an example of a computational unit that can perform only one
type of operation, so homogeneous scheduling is therefore implied. A further
assumption is that the computational cycle delays associated with each opera-
tion in dierent types of computational units have dierent values, i.e., Cj is a
function of the computational unit j. The nal assumption is that the execution
of the operations is non preemptive; i.e., once the execution of an operation
begins, it nishes without interruption.
The solution to the scheduling problem is a vector of statements for each
explicit formula denoted as S = {S0, S1, . . . , Sk}. The end time of each operation
is also dened for each statement E = {e0, e1, , . . . , ek) reached after a specic
delay time C = {C0, C1, . . . , Ck}. The scheduling problem is formally dened as
the minimization of R, which is the number of registers, with respect to the
following conditions:
1. An operation can start only when its predecessors have nished.
2. At any given cycle t, the number of computational units working is less
than rj = {r1, r2, r3}.
38
3. The register R = {R0, R1, . . . Rw} read operation is performed only if the
value from the register will be used.
4. The scheduling statements are constructed so that the computational unit
reads the operands from the bypasses, rather than from the registers.
The following subsections explain how signicant additional reductions in the
register le area can be obtained based on scheduling operations that transfer
the operands via bypasses rather than reading from the register le. The bypass-
aware operation scheduling heuristics that vary in time complexity have been
developed, and their eectiveness in reducing register le area has been stud-
ied. Operation scheduling is a behavioural view of architecture synthesis, which
according to the operations performed in each clock cycle is specied by means
of explicit modeling of the units in the datapath synthesis. Since the registers
that form the register le are read only when the source operand value is not in
the bypasses, such scheduling can reduce both register usage, and consequently
register le power consumption. Bypass-aware operation achieved in hardware
by using multiplexer. The form of multiplexer used has each data input paired
with an enable. Only one enable input is asserted at any time. The data at
the data input whose associated enable is asserted appears at the multiplexer's
output.
Figure 3.3 shows the procedure for determining exactly when an instruc-
tion bypasses the results, which source operands can read them, and when and
where the result is written back into the register le. In architecture designs
that have complete bypasses, two functions, LAST_USED and FP_USED, are
required for each operation: LAST_USED is the number of cycles after issu-
ing the statement, which is the last statement used for the destination operand
variable, and FP_USED means that the result is written in the bypasses. In
completely bypassed scheduling, the destination of an operation statement is
available to every source operand as long as LAST_USED = 1, and when cycle
1 < LAST_USED < j
2
, where j is the total number of statements, the destination
result is available from the register le until it is overwritten. A cost function,
ASSIGN_FP, was used as a means of providing the number of operands of state-
ment S that are read from bypasses. Another cost function incorporated into
the new algorithms is WRITE_REG, which simply computes the number of reg-
ister accesses used through a specic scheduling step. The function Get_OPR
is used to dene the statement operator either it is MUL for nite eld mul-
39
tiplication or it is ADD/SQR for nite led addition/squarer. The reason for
that is whenever it is MUL the algorithm must check FP_USED function in
order to avoid perform forwarding paths between two nite eld multiplication
operations. Because this model was developed for used later in elliptic and hy-
perelliptic curve algorithms, Get_OPR will only take three values MUL , SQR
and ADD for nite eld multiplication, squaring, and addition respectively. The
function USED_STAT is used to determine the number of statement used by a
certain dene variable in order to decide weather it will be assign to forwarding
paths or not. As long as USED_STAT = 1, which mean a short live variable,
the destination will be assigned to forwarding path.
3.2.3 Storage Binding via Efficient Register Spilling
This subsection, describes the concept of storage binding, which is a key pro-
cess in architecture synthesis. This thesis modied an approach that uses e-
cient method for mapping variables into register les and memory. The storage
binding process binds the variables to the storage units, such as the register
le and memory. A recent trend for designers to use register les other than
isolated registers in the storage binding. Approaches to storage binding using
isolated registers are too complicated and time consuming and are not feasible
for large-scale designs. In this work, rather than relying on isolated registers
the implementations presented in the following chapters are based on register
les and memories as storage units because register les and memories can be
more structured, modular, and dense and because their regular layout structure
requires less chip area in Application-specic integrated circuit (ASIC) and less
resources in Field programmable gate array (FPGA) designs.
The methodology of the architecture synthesis explicitly models the storage
binding via ecient register spilling, which is the assignment of each variable to
a specic hardware component: explicit mapping between registers and memory.
The goal of storage binding via ecient register spilling (SBERS) is to minimize
the area by allowing multiple certain type of variables that has been determined
through variable liveness analysis to be stored into memory instead of register.
The ecient operation scheduling via forwarding paths limits the number of
register spilling that is possible.
For low power storage binding that exploits optimal processor structure, the





three operand code 
statement
Get LAST_USED for destination 
operand variable
LAST_USED ≤ 1
Get FP_USED for forwarding 
used by last statement
GET_OPR?





Get  LAST_USED  for  source 
operand variable.
Update WRITE_REG to store the 
destination value
Next Statement



















Figure 3.3: The procedure for bypass-aware operation scheduling via forwarding path
41
reduction achieved. This section focuses on the reduction of the switching activ-
ity in high-level synthesis, especially in the problem of very long lived variable
binding. An eective register spilling algorithm has been developed based on the
register allocation via variable liveness analysis method so as to reduce switch-
ing activity. The new model is based on the determination of very long lived
variable type satisfying the conditions LAST_USED≥ j
2
and USED_STATE≤ 2
which has been dened through RAVLA of the original explicit formula, and as
a result, it produces optimal or close-to-optimal variable binding solutions with
faster computing times and a minimum number of registers.
Based on the modeling solutions presented in the previous sections and the
functional unit constraints, three novel algorithms have been developed for the
modied explicit formula which will be generated in the later chapters for both
ECC and HECC operation algorithms: (i) a combined register allocation algo-
rithm with variable liveness analysis, which determines the minimum number
of registers need and denes the three types of variables: short, long, and very
long lived variables; (ii) a forward datapath scheduling algorithm, which takes
advantage of operation slack in order to accommodate the write-port restriction
of the register les. (iii) a simultaneous binding algorithm, which determines
the assignment of operations into functional units, with a focus on optimizing
multiplexers; The scheduling limits possible resource bindings. For example,
operations that are scheduled at the same time cannot share the same resource.
To be more precise, any two operations can be bound to the same resource only
if they are not executed concurrently, i.e., are not scheduled in overlapping time
steps. Some resources are capable of executing dierent operations; e.g., both an
addition and a subtraction can be bound to the same arithmetic logic unit. The
storage binding can have a signicant eect on the area and latency of the circuit
because it dictates the number of interconnection logic and storage elements of
the circuit. The more hardware that is allocated to the architecture, the more
parallel operations can be performed. However, this consumes more areas.
42
3.3 Optimization Techniques for Architecture Syn-
thesis
3.3.1 Datapath Analysis
The architecture synthesis model can be described as a nite state machine
with datapath (FSMD) . FSMD is often referred to as the register-transfer level
(RTL), which is composed of a nite state machine that controls the design ow
and a datapath that performs operations. The models considered here abstract
the information represented by VHDL. Abstract models of behaviour at the
architectural level are expressed in terms of operations and their dependencies,
which arise for several reasons, the rst of which is the availability of data. When
an input to an operation is the result of another operation, normal operation
depends on the latter. A second reason is the serialization constraints in the
specication. A task may have to follow a second one regardless of data depen-
dency. This type of model implicitly assumes the existence of variables whose
values store the information that is required and generated by the operations.
Each variable has a lifetime, which is the interval from its birth to its death: the
former is the time at which the value is generated as an output of an operation,
and the latter is the latest time at which the variable is referenced as an input
to an operation. The model assumes that the values are preserved in registers
during their lifetime.
3.3.2 Control Unit Analysis
This section explains the most dicult part of architectural synthesis, which
establishes that a control unit specication has a crucial concurrency property:
liveness. Liveness is dened as a condition that must be satised at some point
during the interaction between the control unit and the computational unit,
which is a block diagram that denes a set of shared resources. A control unit
specication is expressed in terms of the liveness properties that the control
unit must satisfy. Liveness contrasts with a dependency property, which means
a condition that must hold continuously throughout the explicit formula execu-
tion. Employing liveness in combination with a dependency property approach
to synthesis is the basis for a model of a concurrent algorithm in which opera-
tions are executed through the accessing and modication of shared resources.
43
An operation is designated as a three-operand code that dene an active com-
putation. An operation computation generally requires the allocation of shared
resources.
3.3.2.1 Controllers via Resource Liveness Analysis
The liveness analysis performed in this work is built around the concept of the
busy state of the resources encapsulated in the computational unit. The analysis
is formulated in terms of a set of variables and a set of operations that can access
and modify those variables. Liveness analysis of the resource controllers is used
to denite the set of clauses that dene the state transitions performed by each
operation. The state transition provides the liveness conditions that govern the
execution of the explicit formula statements. Through the following chapters we
will use both datapath and control unit analysis to improve the design process
for hardware implementation of cryptographic algorithms.
3.4 Architecture Optimization Tradeoffs
Architectural optimization comprises architectural synthesis, which includes reg-
ister allocation, operation scheduling, and resource binding. Complete architec-
tural optimization is applicable for circuits that can be modeled by sequencing
(or equivalent) graphs without a start time or binding annotation. The goal of
architectural optimization thus consists of the determination of a register allo-
cation γ , an operation scheduling ϕ, and a resource binding β that optimize the
objectives (area, time, and power). The optimization of these multiple objectives
can be reduced to the evaluation (or estimation) of the corresponding objective
functions. An architectural optimization problem is often solved through the
exploration of the area/power tradeo for dierent values of the cycles. This ap-
proach is based on the fact that the time may be constrained in order to obtain
one specic value, or multiple values in an interval, because of system design con-
siderations. The area/power tradeo can then be optimized through the solving
of the appropriate register allocation problems. Other important approaches
are the search for the time/power tradeo for some of the resource binding,
and the area/time trade-o for some of the operation schedules. These tradeos
are important for nding a partial optimization solution when consideration in-
cludes scheduling after allocation or binding and vice verse. Unfortunately, the
44
area/power tradeo for some values of time as well as the time/power trade-o
for some values of area are complex problems to solve, because several allocations
correspond to a given area and several schedules to a specic time value.
3.4.1 Area/Power Optimization
For a specic execution time, the area depends on register usage. The regis-
ter allocation problem provides the framework for determining the area/power
tradeo. The minimum power register allocation can be determined through
the solving of register liveness analysis problems for dierent values of the area
constraints. The ideal tradeo curve in the area/power plane monotonically in-
creases in one parameter as a function of the other. The problem becomes more
complicated when it must be solved for a circuit with a more complex depen-
dency on operation scheduling. If it is rst assumed that the control unit and
the wiring have a negligible impact on area and power, then only registers and
multiplexers must be taken into account jointly with the computation logic of
the operations. Area and power can be determined through register allocation
and operation scheduling because the number of registers allocated depends on
the dependency of the operations. For a xed operation scheduling, the num-
ber of registers and multiplexers depends on the register allocation. Their area
must be added to the computation logic areas when the power is determined.
The register liveness analysis problem may thus aect the register allocation,
which can result in a smaller area as well as minimum power. These corrections
show the circular dependency of operation scheduling, register allocation, and
the estimation of area and power.
In practice, CAD tools for architectural optimization perform either opera-
tion scheduling followed by register allocation or vice verse. The area and the
power are estimated prior to the completion of these tasks. Performing schedul-
ing before allocation permits characterization of the multiplexers and a more
precise evaluation of the areas. No dependency constraints are required for the
allocation because the operation dependency is determined by the scheduling.
In this case, operation scheduling requires that no pair of operations with a
dependency execute concurrent. This approach best ts the synthesis of those
ASIC circuits that are control dominated and in which the register parameters
can be comparable to those of some application specic processors. In the im-
plementation developed in this study, after operation scheduling is performed,
45
the number of clock cycles is xed before the commencement of the register
allocation task. The quality metrics are therefore not the minimum amounts of
computational time. The following metrics can be used in order to compare the
area/power tradeos: the number of clock cycles and the minimum period of
usage in one clock cycle.
3.4.2 Power/Time Optimization
In the architectural optimization stage, storage binding has a signicant impact
on the time constraints between registers. Dierent storage binding solutions
lead to dierent smallest feasible clock periods, which results in the minimum
power consumption. In the new implementation, after the register allocation
task, the number of registers in the register le is xed before the storage binding
task is begun. The following metrics can therefore be used to compare power and
time: the number of registers in the register le and the size of the multiplexer
used in the forwarding paths.
Designs for optimizing power have been an active research area due to the
demand for constraint applications. Power reduction techniques have been pro-
posed at all levels of the design hierarchy, from algorithmic and architectural to
circuit and technological innovations. It has been shown that for digital hard-
ware design, switching activities represent more than 90 % of the total power
consumption [17]. Reducing switching activities is hence a major target in power
reduction studies. Pipelining has been used to reduce the critical path computa-
tion time and also to reduce switching activities, and hence, power consumption,
in digital hardware design. However, the large number of pipelining latches adds
a considerable amount of area overhead. Rather than using pipelining, in our
new implementation design, power reduction is achieved through developing
RAVLA, OSFPS models; through the increasing of the concurrency of the inter-
nal nite eld operations at the eld arithmetic level, and either at the point or
the divisor arithmetic level for elliptic and hyperelliptic curves respectively; and
through the rearranging of the scalar arithmetic topology from a sequential type
to a parallel and forwarding type. Parallel and forwarding type architectures
are ideal for low power designs due to their balanced structure, which not only
reduces the computation time but also minimizes the switching activities and,
hence, the total energy consumption per operation. The two design types shown


















MUL U1`      
b) Parallel and Forwarding-type
TIME 1
TIME 1
TIME 2 TIME 3
Figure 3.4: Sequential-type versus parallel and forwarding-type architectures
nite eld arithmetic as follows:
U ′0 ← S1 + U1, S20 ← S0, U ′1 ← S20U ′0 (3.3)
where S1, U1, U ′0, S0, and U
′
1 are ve elements over GF(2
m). Figure 3.4 a) is a se-
quential type architecture, and Figure 3.4 b) is a parallel and forwarding type.
As can be seen, power consumption through each of the arithmetic units is not
the same, and each reads from a dierent register that introduces switching into
the circuit. As a result, the power consumption values for subsequent arithmetic
units increase. However, in Figure 3.4 b), the critical paths from all input regis-
ters to the output register are exactly the same, and the parallel and forwarding
type architecture thus consumes approximately 35 % fewer registers than does
the sequential type architecture. Parallel and forwarding type architectures can






Elliptic Curve Cryptography (ECC) was independently proposed in the mid
1980s by Miller [70] and Koblitz [47]. Area and power optimization are two
important design issues for Elliptic Curve Cryptography (ECC) used in many
embedded systems. One benet of ECC is that it requires a much shorter key
length than other public key cryptosystems to provide an equivalent level of
security. However, ecient hardware implementation of an Elliptic Curve Pro-
cessor (ECP) for lightweight devices is a challenge. In this chapter we propose
an ecient processor for ECC that aims to reduce the number of registers com-
pared to those that have appeared in the literature. We take advantage of
forwarding paths in the ECP to avoid writing/reading of short-lived variables
to/from the register le. The proposed ECP design is implemented over F2163
on Xilinx XC4VLX200 FPGA device to verify its functionality and measure its
performance. This work yields an area saving up to 38% in the number of ip-
ops and up to 27% with respect to the number of look-up tables (LUTs). The
performance overhead is equal to 1.8ns to be added to the ECP critical path.
4.1 Introduction
Elliptic curve cryptography has occupied the center stage of public key crypto-
graphic research. The main reason behind it is their shorter key for the same
level of security due to its ability to provide greater security per bit compared
to public key systems such as RSA. ECC has also been included in IEEE P1363
48
[4] and NIST [75] standards. The traditional way of implementing ECC is soft-
ware, running on general-purpose processors or on digital-signal processors [51].
Nevertheless, in some applications dedicated hardware must be considered. The
designer of elliptic curve hardware implementation is faced with many choices at
design time, each of which can impact the performance of the implementation in
dierent ways. There are many examples in the literature of how these design
choices can aect the performance of an elliptic curve hardware implementation.
The eect of design choices on area, power and energy in elliptic curve hardware
has been less well studied. This thesis is concerned with the implementation
of an optimized ECP with a special scalar multiplier and self-controlled archi-
tecture. Its goal is to optimize the ECP implementation targeted for resource
restricted environments in terms of hardware usage. In addition the power and
energy results are presented and compared.
A number of hardware implementations for standardized elliptic curve cryp-
tography were suggested in the literature [18, 19, 52, 6, 100], but very few of
them aimed for low-end devices [30, 40, 65, 99]. Most implementations focus on
speed and are only suitable for server end applications due to their huge area
requirements [19, 99, 55, 67]. However, there is an equally important need for
stand-alone ECP engines in small constrained devices used for dierent appli-
cations, like sensor networks and mobile devices. An application-specic ECP
was reported in [18]. The design is based on a combined algorithm to perform
point addition and doubling using a multistage pipelined nite eld multiplier.
In [81] a high performance elliptic curve cryptography processor over F2163 was
proposed which achieved a high throughput with a twice increase in the hardware
complexity. The architecture is based on the López-Dahab elliptic curve point
multiplication algorithm with Gaussian normal basis (GNB) for F2163 . Moreover,
in [55] a high performance elliptic curve processor was presented based on the
Montgomery scalar multiplication algorithm. The architecture consists of three
Arithmetic Units (AUs). Each AU contains a word-serial multiplier, a squarer
and an adder in addition to a bit-parallel modular reduction for the irreducible
pentanomials. Pseudo-multi-core ECP over F2163 was reported in [100]. The ar-
chitecture primarily consists of three nite eld (FF) RISC cores with a main
controller to achieve pipelining. Each core consists of two 41×163 FF multipliers,
two bit-parallel squarer and two adders. The presented architecture is based on
Instruction level parallelism (ILP) of paralleled López-Dahab algorithm among
49
the three FF cores.
The aims of this work are (i) to optimize area requirements by reducing
the number of registers in the Register File (RF) of the ECP; (ii) to present a
low-power embedded ECP architecture that takes advantage of the forwarding
(bypassing) hardware for avoiding the power cost of writing and reading short-
lived variables to and from the RF; and (iii) to improve speed, throughput, and
critical path in the implementation of ECP which keeping the area requirment
low.
This chapter is organized as follows. In section 4.2, we briey recap previous
hardware implementation on ECC and introduce conventional elliptic curve im-
plementation for LD projective coordinates. Section 4.3 summaries architecture
synthesis and optimization techniques used to develop ecient explicit formula
for dierent coordinates systems. Section 4.4 gives synthesis and simulation
results and comparisons. Conclusions are provided in section 4.5.
4.2 Previous Work in ECC Hardware Implementa-
tion
Concerning the design of lightweight implementations, most research on ellip-
tic curve implementations focused on minimizing the computation time per
point multiplication. Several implementations aimed at minimizing power con-
sumption for applications in constrained environments have also been reported
[77, 10]. A particularly active area of research is that of ecient hardware imple-
mentations of elliptic curve operations, including hardware description language
developments, programmable hardware realizations, and fabricated custom cir-
cuits.
4.2.1 Low-Power Designs
An ASIC based low power elliptic curve digital signature chip was presented in
[85]. Typically, an ASIC based architecture uses one particular algorithm and
coordinate system to perform the point scalar multiplication. Bertoni et al. [12]
presents a low power elliptic curve processor based on an Atmel 8-bit micropro-
cessor and a dedicated ASIC coprocessor to perform eld multiplications. Keller
et al. [43] have studied the eects that algorithm and coordinate choice have on
50
the power consumption of recongurable F2m ECP which was implemented using
FPGA.
Richard et al. [85] introduces a VHDL (stands for VHSIC Hardware De-
scription Language, where VHSIC stands for Very High Scale Integrated Cir-
cuit) design that incorporates optimizations intended to provide digital signature
generation with as little power as possible. While the value of elliptic curve arith-
metic which is enabling public-key cryptography to serve in resource-constrained
environments is well accepted, it was motivated by the need to reduce the re-
sources required to provide strong public-key authentication for sensor-based
monitoring system and critical infrastructure protection. For these applications,
signature generation is often performed in highly constrained, battery-operated
environments, whereas signature verication is performed on desktop systems
with only the typical constraint of purchasing power. The work of Gaubatz et al.
[31] discusses the necessity and the feasibility of public key cryptography proto-
cols in sensor networks. In [31], the authors investigated implementations of two
algorithms for this purpose: Rabin's scheme and NTruEncrypt. The conclusion
is that NTruEncrypt features a suitable low-power and small area with the total
of 3000 gates in complexity and power consumption of less than 20µW. On the
other hand, RSA is concluded to be impossible with the imposed constraints. In
the follow-up work [32] the authors have compared the previous two with ECC
solution for wireless sensor networks. The architecture of the ECP occupied an
area of 18,720 gates and consumes less than 400µW of power at clock frequency
of 500 kHz. The eld used was a prime eld with elements that have bit-lengths
of 102. The level security is therefore not so high but it is seems to meet the
requirements for lightweight applications.
The work of Goodman and Chandrakasan [34] also dealt with power ecient
solutions. They proposed a domain-specic recongurable cryptographic pro-
cessor (DSRCP). The primary component of the DSRCP is the recongurable
data-path. The data path consists of four major functional blocks: an eight-word
register le, a fast adder unit, a comparator unit and the main recongurable
logic unit. Multiplication is performed using Montgomery multiplication. At
50 MHz, the processor operates at supply voltage of 2V and consumes at most
75 mW of power. In ultra-low-power mode the processor consumes at most
525 µW. The PhD thesis of Goodman [35] oers many useful ideas and consider-
ations for low-power designs.
51
4.2.2 High performance Designs
Elliptic curve scalar multiplication is a fundamental operation in elliptic curve
cryptosystems. Recently, a number of hardware architectures have been pro-
posed in the literature to speed up this operation. Richard et al. [85] present a
design incorporates built-in optimizations using the point-halving algorithm for
elliptic curves and eld towers to speed up the nite eld arithmetic in general.
Further enhancements of basic nite eld arithmetic operations intended to pro-
vide digital signature generation with low power are also included. Kim et al.
[44] introduce a hardware architecture to take advantage of a non-conventional
basis representation of nite eld elements to make point multiplication more
ecient. Moon et al. [73] address eld multiplication and division, proposing a
new method for fast elliptic curve arithmetic appropriate for hardware.
Goodman and Chandrakasan [34] tackle the problem of providing energy-
ecient public-key cryptography in hardware while supporting multiple algo-
rithms, including elliptic curve-based algorithms. Moving closer to applications
of ECC, Aydos et al. [9] have implemented an ECC-based wireless authentica-
tion protocol that utilizes the elliptic curve digital signature algorithm (ECDSA).
Very recently, McLoone and Robshaw [68] reconsider the hardware cost of public-
key cryptography for ultra constrained area and power applications. Their im-
plementation allows for a tag that can participate in an authentication protocol a
limited number of times. As a result, the tag can store in memory the responses
to a limited number of authentications and thus, the hardware performance is
increased. Mishra et al. [71] propose a pipelining scheme for implementing
ECC. They implement two stage pipeline as at any point of time there will be
at most two operations in a Producer-Consumer Relation or in one step of the
three operand code. The one which enters the pipeline earlier will be producing
outputs which will be consumed by the second operation as inputs. As soon as
the producer process exits the pipeline the subsequent EC-operation will enter
the pipeline as the consumer process. The earlier consumer would be producing
outputs now which will be consumed by the newer process.
The estimated performance of the ECC processor for the eld F2163 is given
in [89]. For the irreducible polynomial F (x) = x163 + x7 + x6 + x3 + 1 one point
multiplication takes 163 ∗ 15M = 2445M. Conversion of coordinates A → P and
P → A takes respectively 2M and I + 2M. Assuming that inversion is done by
means of Fermat the total for conversion is around 300M. This all together results
52
in approximately 3000M. One led multiplication (M) takes 163 cycles, which
results in 489000 cycles for point multiplication. With a clock frequency of even
1MHz one point multiplication would take less than half a second.
Bijan and Hasan [6] proposed a high performance architecture of elliptic curve
scalar multiplication using the Montgomery ladder method over nite eld F2m .
The idea depends on the parallel execution of multiplication and addition/squar-
ing in one iteration and/or in the entire multiplication loop. They also proposed
a pseudo pipelined nite eld multiplier with short critical path. The introduced
nite eld multiplier uses pipelining to reduce the critical path. Kumar and Paar
[51] proposed an ane coordinate ASIC implementation of the ECC processor,
using LD projective coordinates a modied Montgomery point multiplication
method for binary elliptic curves. An area consumption of 10k to 18k GE is
obtained 0.35 µmCMOS process.
4.2.3 Compact Implementations
Previous implementations of ECC processors have been based on VLSI chips
which implement a coprocessor for performing the underlying eld operations.
Motorola DSP 56000 was used to implement a complete ECC system which cal-
culate 5 points a second on a super singular curve [5]. In 1993, the same team
developed a processor for operations in the F2155 [5] which used 11,000 transis-
tors and could operate at 40 MHz. This implementation was intended to be
compact yet secure. A eld programmable gate array (FPGA) based processor
for elliptic curve cryptography in a composite F(2n)m was developed by Rosner
[82]. A compact super-serial multiplier for FPGAs which trades o performance
for area was reported in 1999 and its performance for eld (polynomial basis)
and curve multiplications over F2167 was also presented [76]. ECPs based on se-
rial nite eld multiplier on F2m are compact processors but slower than other
implementations [99, 66].
Low-power and compact implementations become an important research area
with the constant increase in the number of hand-held devices such as mobile
phones, smart cards, PDAs etc. Schroeppel et al [85] present a design for ECC
over binary elds that was optimized for power, space and time in order to
provide digital signatures. Wolkerstorfer [91] shows that ECC based public key
cryptography is feasible on RFID-tags by implementing the ECDSA (Elliptic
curve Digital Signature Algorithm) on a small chip. Batina et al [10] show that
53
by trading o performance for area it is possible to implement EC public-key
cryptography on a tag.
Wolkerstorfer [91] has presented an ECP that able to calculate all arithmetic
operations used in ECC: addition, multiplication, and inversion in the nite
elds Fp and F2m . Bit-serial multiplication is calculated by an improved version
of Montgomery's algorithm. A realization of the ECP on 0.35 µm CMOS needs
1.31mm2 and can be clocked with 68.5 MHz. On a 180 nm CMOS process, the area
will shrink to 0.35 mm2which is acceptable for RFID tags.
Yong Ki Lee et al [65] propose an architecture for compact elliptic curve
multiplication processors using López-Dahab's Montgomery scalar multiplica-
tion algorithm [54], and using the projective coordinates system to avoid inverse
operations. They show that if López-Dahab's algorithm is implemented based
on the modular arithmetic logic unit in a conventional way, the total number
of registers is nine. As the registers occupy more than 80% of the gate area
in the conventional architecture, reducing the number of the registers and the
complexity of the register le are very eective to minimize the total gate area.
In order to minimize the system size, they propose a new formula for the com-
mon projective coordinates system where all the Z -coordinate values are equal.
This resulted in reducing the number of registers by one. Then reducing two
more registers by using register reuse technique and designing the compact reg-
ister le architecture by limiting the access to the registers [64]. Accordingly,
three registers were reduced in total to become 6 registers. They also proposed
a reduction of the complexity of the register le by designing a circular shift
register le. However, the use of this register le increases the number of cycles
due to the control overhead. The authors also calculated the number of execu-
tion cycles needed to perform point multiplication operation which is equal to
313,901 clock cycles. In this work, we are proposing an optimized architecture
for ECP to perform point multiplication operation using 4 registers only without
using a circular shift register le to avoid the total overhead. Furthermore, the
number of execution cycles needed to perform point multiplication operation is
309,546 clock cycles which is less than that of Yong Ki Lee [65] work. Moreover,
the power consumed by the register le is expected to decrease because of the
reduction in the number of the register le and the use of bypassing hardware
















Figure 4.1: A top level architecture for the EC processor
4.3 Architecture Synthesis of ECC processor
This section presents the results obtained from the implementation of an opti-
mized ECP architecture on an FPGA. An ECP is an elliptic curve cryptographic
processor using López-Dahab projective coordinate. FPGA implementation that
includes consideration of explicit formulas based on López-Dahab projective co-
ordinates. Results of register management for both conventional and optimized
point addition/doubling unit for performing point addition/doubling algorithm
were proposed. A method for calculating the power dissipated by the register
le, bypassing hardware and the arithmetic logic unit in the conventional and
optimized schemes are presented. The number of clock cycles needed to perform
point multiplication operation in both conventional and optimized approaches
were also tabulated. Moreover, a comparison between the optimized approach
and the conventional approaches in terms of execution time per clock cycles and
power dissipation are presented.
4.3.1 ECC processor on FPGA
The EC processor is, in fact, the hardware architecture for scalar multiplica-
tion. The top-level architecture consists of a control unit, block RAM (BRAM),
and a point addition/doubling unit, as shown in Figure 4.1. The control unit
receives the EC parameters, reads a key (or a scalar), and controls the point
addition/doubling unit according to the binary double and add point multipli-








































Figure 4.2: conventional point addition/doubling unit
the other hand, is responsible for computing all required eld arithmetic oper-
ations. At the beginning of the scalar multiplication operation, the BRAM is
assumed to contain the scalar and a projective EC point. These values must be
maintained during the iterations of the EC scalar multiplication. The point ad-
dition/doubling unit designed for computing the scalar multiplication algorithms
is discussed in the following sections.
4.3.2 Macroscopic Structural view of Conventional ECC
datapath
An example of conventional architecture for point addition/doubling [52, 28, 54]
is depicted in Figure 4.2. Its three main units are as follows: an adder, a
multiplier, and a squarer, all of which are for F2m. In the work presented in
this thesis, the type of the adder and the squarer are bit parallel, while the
type of multiplier is bit serial. These three eld arithmetic units are closely
interconnected inside a single nite eld arithmetic unit (FFAU) and share a
common input data bus A. A second operand is provided to the FFAU through
an additional data bus B. The operands are stored in a number of registers,
and the output of a register is placed on buses A and B using a multiplexer
(MUX1) with control signals (RSelA and RSelB). The three FAUs are connected
to bus C via a multiplexer (MUX2) controlled by AUsel. The question is, given an
explicit formula for both point addition and point doubling operations, what is
the minimum number of registers required to compute the formula sequentially
or in parallel.
56
4.3.2.1 Register-Transfer-Level Design for López-Dahab Algorithm
Implementation
Register-transfer-level (RTL) design is a complicated-sounding name for a simple
concept. In RTL design a circuit is described as a set of registers and a set of
transfer functions that describe the ow of data between the registers. The
registers are implemented directly as ip-ops while the transfer functions are
implemented as blocks of combination logic.
This stage in the RTL design cycle is commonly referred to as register al-
location and operation scheduling described in chapter 3. Register allocation
refers to the mapping of data operations onto hardware resources. Operation
scheduling refers to the choice of clock cycle during which an operation will be
performed in a multi-cycle operation. Registers must also be allocated for all
values that cross over from one clock cycle to a later one. Register allocation
and operation scheduling are interlinked and must normally be carried out si-
multaneously. The aim is to maximize resource usage and simultaneously to
minimize the number of registers required to storing intermediate results. Ow-
ing to its simplicity, the allocation stage can be considered trivial because all
multiplications, squaring, and additions must be allocated to the one multiplier,
squarer, and adder, respectively. The scheduling operation entails choosing in
which clock cycle each multiplication, squaring, and addition is to be performed.
Algorithm 4.1 describes the register-transfer-level design for the López-Dahab
mixed coordinates point multiplication operation. The three-operand code, like
the register-transfer statements, is shown at each step of the algorithm. Fig-
ure 4.2 shows that the algorithm is implemented using nine registers. In fact,
this section presents an RTL design for López-Dahab coordinates that are both
projective and mixed. Figure 4.2 shows a general design for both mixed and pro-
jective coordinates, with the only dierence being the number of intermediate
values used by the algorithm. For example, with mixed coordinates the interme-
diate values occupy only registers R0, R1, R2, and R3, as indicated in Algorithm
4.1. At the beginning of the mixed coordinates point multiplication operation,
it is assumed that ve read accesses from BRAM are performed in order to
store X1, Y1, Z1, X2, and Y2 in R4, R5, R6, R7, and R8 respectively. If ki = 1, point
addition and point doubling are executed. At the end of the point multiplication
operation, three write accesses are executed in order to store the resulting point
Q = (X3, Y3, Z3). At each step of the register le management, each eld operation
57
Algorithm 4.1 Register management for the left-to-right binary point multiplication in F2m
using López-Dahab mixed projective coordinates [37].
Input: P, k where P ∈ E(F2m) : y2 + xy = x3 + ax2 + b , b = 1, k = [kt−1 · · · k1, k0]2.
Output: Q = k P where Q = (X3, Y3, Z3) in LD coordinates ∈E(F2m)
1: for i = t− 1 down to 0 do
2: R0 ←MUL (X1, Z1)
3: R1 ← SQR (X1)
4: R2 ← ADD (R1, Y1)
5: R3 ←MUL (R0, R2)
6: Z3 ← SQR (R0)
7: R0 ← SQR (R2)
8: R0 ← ADD (R0, R3)
9: X3 ← ADD (R0, Z3)
10: R0 ← ADD (Z3, R3)
11: R1 ← SQR (R1)
12: R1 ←MUL (R1, Z3)
13: R0 ←MUL (R0, X3)
14: Y3 ← ADD (R0, R1)
15: if ki = 1 then
16: R0 ← SQR (Z1)
17: R0 ←MUL (Y2, R0)
18: R0 ← ADD (Y 1, R0)
19: R1 ←MUL (X2, Z1)
20: R1 ← ADD (X1, R1)
21: R2 ←MUL (R1, Z1)
22: Z3 ← SQR (R2)
23: R1 ← SQR (R1)
24: R1 ← ADD (R0, R1)
25: R1 ← ADD (R1, R2)
26: R1 ←MUL (R2, R1)
27: R3 ← SQR (R0)
28: X3 ← ADD (R3, R1)
29: R1 ←MUL (R0, R2)
30: R0 ←MUL (X2, Z3)
31: R2 ← ADD (R0, X3)
32: R1 ← ADD (R1, Z3)
33: R2 ←MUL (R2, R1)
34: R0 ← ADD (Y2, X2)
35: R1 ← SQR (Z3)
36: R3 ←MUL (R0, R1)
37: Y3 ← ADD (R3, R2)
38: end if
39: end for
40: Return (Q = (X3, Y3, Z3))
is be performed according to the explicit formulas stated in section 2.3.4.
For example, the operation in step 17 is a eld multiplication between two
operands Y2 and Z21 stored at R0. These two operands are available for the
multiplication input with the RSelA and RSelB select control signals. After 163
clock cycles, the result is written back to R0 as governed by the AUSel selection
control signal, as shown in Figure 4.2. Each eld addition and eld squaring
step is executed in only one cycle. On the other hand, a eld multiplication
operation requires that 163 cycles be executed, corresponding to the irreducible
pentanomial function f(x) = x163 + x7 + x6 + x3 + 1 for the NIST recommended
curves m = 163 [37].
Similarly, Algorithm 4.2 shows the conventional register management for the
point multiplication with López-Dahab projective coordinates. As presented
58
Algorithm 4.2 Register management for the left-to-right binary point multiplication in F2m
in López-Dahab projective coordinates [37].
Input: P, k where P ∈ E(F2m) : y2 + xy = x3 + ax2 + b , b = 1, k = [kt−1 · · · k1, k0]2.
Output: Q = k P where Q = (X3, Y3, Z3) in LD coordinates ∈E(F2m)
1: for i = t− 1 downto 0 do
2: R0 ←MUL (X1(3), Z1(3))
3: R1 ← SQR (X1(3))
4: R2 ← ADD (R1, Y1(3))
5: R3 ←MUL (R0, R2)
6: Z3 ← SQR (R0)
7: R0 ← SQR (R2)
8: R0 ← ADD (R0, R3)
9: X3 ← ADD (R0, Z3)
10: R0 ← ADD (Z3, R3)
11: R1 ← SQR (R1)
12: R1 ←MUL (R1, Z3)
13: R0 ←MUL (R0, X3)
14: Y3 ← ADD (R0, R1)
15: if ki = 1 then
16: R0 ←MUL (X1, Z2)
17: R1 ←MUL (X2, Z1)
18: R2 ← SQR (R0)
19: R3 ← SQR (R1)
20: R4 ← ADD (R2, R3)
21: R5 ← SQR (Z2)
22: R5 ←MUL (Y1, R5)
23: R6 ← SQR (Z1)
24: R6 ←MUL (Y2, R6)
25: R7 ← ADD (R5, R6)
26: R8 ← ADD (R0, R1)
27: R7 ←MUL (R7, R8)
28: R8 ←MUL (Z1, Z2)
29: Z3 ←MUL (R4, R8)
30: R3 ← ADD (R6, R3)
31: R6 ← ADD (R2, R5)
32: R2 ←MUL (R1, R6)
33: R1 ←MUL (R0, R3)
34: X3 ← ADD (R1, R2)
35: R3 ←MUL (R4, R5)
36: R5 ←MUL (R0, R7)
37: R0 ← ADD (R5, R3)
38: R2 ←MUL (R0, R4)
39: R0 ← ADD (R7, Z3)
40: R5 ←MUL (R0, X3)
41: Y3 ← ADD (R2, R5)
42: end if
43: end for
44: Return (Q = (X3, Y3, Z3))
59
in the algorithm, the intermediate values require that nine registers R0 to R8 be
stored. The design therefore causes the elements X1, Y1, Z1, X2, Y2 to be stored in
the BRAM, which, if ki = 1, needs a total of 20 read accesses through DIN and six
write accesses to DOUT in one iteration. For example, in step 16 of Algorithm 4.2,
a nite eld multiplication(MUL) performed between two intermediate results
stored in R4 and R8 results in an element Z3, which consumes a write access to
BRAM. This element is repeatedly used by the remainder of the point addition
steps or by the point doubling steps, as specied in step 30. The notation
Z1(3) indicates reuse of the memory location of Z1 by Z3 after the rst iteration
of the point multiplication algorithm is executed which has the value ki = 0.
The following section presents the use of the optimization analysis described
in chapter 3 in order to reduce the number of registers used for storing the
intermediate variables and to reduce the number of read/write accesses that are
required from/to BRAM.
4.3.3 Macroscopic Structural view of Optimized ECC dat-
apath
An optimized point addition/doubling unit is illustrated in Figure 4.3. Similar to
the conventional point addition/doubling unit described in section 4.3.2, it con-
sists of the three main components: the adder, the multiplier, and the squarer,
all of which are for F2163. The enhancement in the new unit is the addition two
multiplexers: MUX3 and MUX4 which are used to provide the forward paths
from one nite arithmetic unit to another. These multiplexers are controlled by
two signals: DSel and ESel. The operands are stored in the register le, which
consists of four registers whose output is selected for buses A and B using mul-
tiplexer MUX1 with control signals RSelA and RSelB. These signals are addressed
from the control unit, as shown in Figure 4.1. The reason for using only four
registers is the result of the application of the new methodology for register
allocation and operation scheduling as explained in chapter 3. The following
subsection provides an analysis of the modied left-to-right binary point multi-
plication in both mixed and projective López-Dahab coordinate systems. This
analysis is based on the application of the three models described in section 3.2.















































Figure 4.3: Optimized point addition/doubling unit
4.3.4 Analyzing the Design
The modications of the López-Dahab mixed coordinates point multiplication
scheme that can be obtained in Algorithm 4.3 indicate that registers R0 and R1
are used for storing long variables. It is assumed that eld elements Y1, andZ1 are
stored in registers R2 and R3 respectively; however, X1(3), X2, Y2 are dened or
modied by read or write operations through BRAM. As a result, if ki = 1, there
will be a total of 10 read and three write accesses. However, if ki = 0, the total
number of read and write accesses will be three and two, respectively. Based
on Figure 4.3, a forwarding path (FP) can be obtained between the nite eld
multiplier and adder through MUX3. The short variable resulting from the nite
eld multiplier is forwarded directly to the adder, and the results from the adder
are then stored in the register le, as shown in step 4. Based on the developed
register allocation and operation scheduling models, MUX3 can provide several
FPs that can be used in dierent steps either from the nite eld multiplier
(MUL) to the adder (ADD), dened as FP (MUL−ADD), or from the nite eld
squarer SQR to the ADD, dened as FP (SQR − ADD). As mentioned in section
3.2.3, the use of storage binding as a strategy for minimizing the number of
registers is evident in several steps in Algorithm 4.3. According to the variable
liveness analysis of element Y1, step 6 coalesces Y1 by means of the intermediate
variable D dened as Y1(D). On the other hand, step 24 shows the coalescing of
61
Y1 by means of the intermediate variable C.
Modied register management for the left-to-right point multiplication algo-
rithm using López-Dahab projective coordinates is presented in Algorithm 4.4.
It should be noted that when the algorithm is actually implemented, additional
controls are required. In the algorithm shown only the register le management
is indicated. According to the new methodology presented in chapter 3, it is
assumed that R0 − R3 are used to store long intermediate variables. The eld
elements X1, Y1, Z1, X2, Y2, Z2 are dened and modied via DIN and DOUT through
the BRAM. The total number of read accesses is 30 if ki = 1 and 11 if ki = 0.
In addition, 11 write accesses are needed if ki = 1 and 4 accesses if ki = 0. As
shown in Figure 4.3, FP (SQR−MUL) and FP (ADD−MUL) can be accomplished
through MUX4. The terms used and dened, such as Z1(T ), Y1(J), X1(C), are
examples of the applications of the coalescing operation mentioned in step 4 in
the methodology explained in section 3.2.3. The freedom to schedule the move-
ment of operand/result data using forwarding paths in the point multiplication
cycles reduces the pressure on the register le access, which was one of the main
motivations for the addition of the data forwarding paths.
4.4 Implementation Results and Comparison
The proposed ECP over F2163 has been completely implemented for both the
conventional and the optimized designs in an RTL-level VHDL [63, 90]. For a
fair comparison with other architectures presented in the literature, the code
was synthesized and implemented on FPGAs using Xilinx XC4VLX200. For
performance analysis and low-power metrics, the Virtex-4 family was targeted.
The entire design was veried on a Model-Sim SE v. 6.5e using a special test
DO program. Table 4.1 summarizes the area requirements of both the conven-
tional and the optimized ECPs and provides comparison with other related work
described in [18, 19, 55, 100].
The register usage of the optimized design is 35% less than that of conven-
tional architectures. However, using forwarding paths does indeed increase the
critical path, which aects the maximum clock frequency and the total computa-
tional time for a point multiplication operation. In [18], the area used is roughly
more than twice that of our proposed optimized design because the architecture
in [18] is based on a seven-stage pipelined nite eld multiplier. The scalar mul-
62
Algorithm 4.3 The modied register management for left-to-right binary point multiplica-
tion in F2m in López-Dahab mixed projective coordinates.
Input: P, k where P ∈ E(F2m) : y2 + xy = x3 + ax2 + b , b 6= 0, k = [kt−1 · · · k1, k0]2.
Output: Q = k P where Q = (X3, Y3, Z3) in LD coordinates ∈E(F2m)
1: for i = t− 1 down to 0 do
2: R0 ←MUL (X1(3), Z1(3))
3: R1 ← SQR (X1(3))
4: X1(C)← ADD (R1, Y1(3))
5: Y1(D)←MUL (R0, X1(C))
6: Z1(Z3)← SQR (R0)
7: R0 ← SQR (X1(C))
8: R0 ← ADD (R0, Y1(D))
9: X1(X3)← ADD (R0, Z1(Z3))
10: R0 ← ADD (Z1(Z3), Y1(D))
11: R1 ← SQR (R1)
12: R1 ←MUL (R1, Z1(Z3))
13: FP (MUL− ADD)←MUL (R0, X1(X3))
14: Y1(Y3)← R1, FP (MUL− ADD)
15: if ki = 1 then
16: R0 ← SQR (Z1)
17: FP (MUL− ADD)←MUL(Y2, R0), R0 ← ADD(Y1, FP (MUL− ADD))
18: FP (MUL− ADD)←MUL(X2, Z1), R1 ← ADD(X1, FP (MUL− ADD))
19: Y1(C)←MUL (R1, Z1)
20: Z1(Z3)← SQR (Y1(C))
21: FP (SQR− ADD)← SQR (R1) , R1 ← ADD(R0, FP (SQR− ADD))
22: R1 ← ADD (R1, Y1(C))
23: R1 ←MUL (R1, Y1(C))
24: FP (SQR− ADD)← SQR (R0) , X1(X3)← ADD(R1, FP (SQR− ADD))
25: R1 ←MUL (R0, Y1(C))
26: FP (MUL− ADD)←MUL(X2, Z1(3)), R0 ← ADD(X1(3)), FP (MUL− ADD)
27: R0 ← ADD (R1, Z1(Z3))
28: Y1(T )←MUL (R0, R1)
29: R0 ← ADD (X2, Y2)
30: R1 ← SQR (Z1(Z3))
31: FP (MUL−ADD)←MUL(R0, R1), Y1(Y3)← ADD(Y1(T ), FP (MUL−ADD))
32: end if
33: end for
34: Return (Q = (X3, Y3, Z3))
63
Algorithm 4.4 The modied register management for left-to-right binary point multiplica-
tion in F2m in López-Dahab projective coordinates.
Input: P, k where P ∈ E(F2m) : y2 + xy = x3 + ax2 + b , b 6= 0, k = [kt−1 · · · k1, k0]2.
Output: Q = k P where Q = (X3, Y3, Z3) in LD coordinates ∈E(F2m)
1: for i = t− 1 downto 0 do
2: R0 ←MUL (X1(3), Z1(3))
3: R1 ← SQR (X1(3))
4: X1(C)← ADD (R1, Y1(3))
5: Y1(D)←MUL (R0, X1(C))
6: Z1(Z3)← SQR (R0)
7: R0 ← SQR (X1(C))
8: R0 ← ADD (R0, Y1(D))
9: X1(X3)← ADD (R0, Z1(Z3))
10: R0 ← ADD (Z1(Z3), Y1(D))
11: R1 ← SQR (R1)
12: R1 ←MUL (R1, Z1(Z3))
13: FP (MUL− ADD)←MUL (R0, X1(X3))
14: Y1(Y3)← R1, FP (MUL− ADD)
15: if ki = 1 then
16: R0 ←MUL (X1, Z2)
17: X1(B)←MUL(X2, Z1)
18: FP (SQR−MUL)← SQR(Z2), R1 ←MUL(Y1, FP (SQR−MUL))
19: FP (SQR−MUL)← SQR (Z1), R2 ←MUL(Y2, FP (SQR−MUL))
20: Y1(I)← ADD (R0, R1)
21: FP (ADD−MUL)← ADD(R0, X1(B)), Y1(J) ←MUL(Y1(I), FP (ADD−MUL))
22: Z1(T )←MUL (Z1, Z2)
23: R3 ← SQR (R0)
24: FP (SQRAD)← SQR(X1(B)), FP (ADMUL)← ADD(R3, FB(SQRAD))
25: Z1(Z3)←MUL((FP (ADD −MUL), Z1(Z3))
26: FP (ADD −MUL)← ADD (R1, R3), R3 ←MUL(FP (ADD −MUL), X1(B))
27: FP (SQR,AD)← SQR(X1(B)), FP (AD,MUL)← ADD(R2, FP (SQR,AD))
28: R2 ←MUL (R0, FP (ADD −MUL)
29: X1(X3)← ADD (R2, R3)
30: R3 ← SQR (X1(X3))
31: FP (SQR− ADD)← SQR (R0), R2 ← ADD(FP (SQR− ADD), R3)
32: R1 ←MUL(R1, R2)
33: R0 ←MUL(R0, Y1(J))
34: FP (ADD −MUL)← ADD(R0, R1), R2 ←MUL(FP (ADD −MUL), R2)
35: FP (ADD−MUL)← ADD(Y1(J), Z1(3)), R3 ←MUL(FP (ADD−MUL), X1(3))
36: Y1(Y3)← ADD(R2, R3)
37: end if
38: end for















































































































































































































































































































































































Figure 4.4: Comparison on the point multiplication computational times of the ECP designs
tiplication design using ane point representation that was implemented in [25]
has been synthesized and simulated for the current work using the same FPGA
device used for the conventional and optimized designs tested in this study. The
superior results produced by the conventional architecture compared to the one
implemented by [25] is due largely to the use of projective coordinates. It is par-
ticularly interesting to note that in [19] the maximum frequency is even lower
than in [18], as shown in Figure 4.4, because in the new design, the critical path
is approximately equal to the time delay of one iteration nite eld multiplier
plus the adder. However, in [19] the critical path is a 55-bit-sized multiplier plus
the squarer and the adder. The possibility exists that the conventional and the
optimized design can t into the XC4VLX80 device because the number of avail-
able slices is 35,840. The developed design therefore requires less area than the
architecture presented in [19]. Using three nite eld multi-core elliptic curve
processors signicantly decreases the computational time for a point multipli-
cation. On the other hand, it uses approximately three times more hardware
resources than the developed mixed coordinate optimized design.












Dynamic Power (mW) 
Dynamic Power (mW) 
Figure 4.5: Dynamic power comparisons of the ECP designs
respect to the dynamic power dissipated in the point multiplication unit is shown
in Figure 4.5. The dynamic power consumption was estimated using a Xilinx
ISE X power analyzer. Based on the implementation given in [25], the power
consumed by their design was estimated. The table provided indicated a gradual
decrease of about 15.8% in the total dynamic power dissipation between the
design in [25] and the proposed mixed coordinate optimized design. However,
the dierence in power consumption between the conventional and the optimized
designs is only approximately 8%. The small reduction in the total power results
from the large amount of power dissipated specically inside the multiplier,
which is an indication that future work needs to address the development of
a methodology for optimizing power dissipation in the nite eld multiplier.
There is also a possibility that reducing the area requirements may lead to a
signicant reduction in total power consumption by the optimized design because
it would t into a smaller FPGA device. This improvement would clearly result
an additional reduction, especially in power leakage.
67
4.5 Conclusion
The optimization of area and power are two important design issues with respect
to the ECC used in many embedded systems. One benet of ECC is that it re-
quires a much shorter key length than other public key cryptosystems in order to
provide an equivalent level of security. However, the hardware implementation
of elliptic curve processor (ECP) for lightweight devices is a challenge. In this
work, an ecient processor is proposed for ECC that aims to reduce the number
of registers relative to those that have been presented in the literature. Forward-
ing paths in the ECP were employed as a means of avoiding the writing/reading
of short variables to/from the register le. The proposed ECP design was im-
plemented over F2163 on a Xilinx XC4VLX200 FPGA device in order to verify its
functionality and to measure its performance. The results of this work show a
saving in area of up to 38% for the number of Flip Flops (#FF) and up to 27%
with respect to the number of look-up tables (#LUTs). The performance over-
head is equal to 1.8 ns to be added on the critical path ECP. When constrained
devices must be used for implementing a public key cryptographic system, an
ECP implementation with a smaller key size can be employed. This study has
demonstrated the small-area implementation of an ECP for both projective and
mixed coordinates. Using the proposed model of register allocation and opera-
tion scheduling, area saving of approximately 38% have been demonstrated at
the point arithmetic level. The ECP has been implemented using FPGAs. An
improved register management scheme has been developed for point multiplica-
tion using projective and mixed López-Dahab coordinates as well as forwarding
paths that lead to reduction in area and in dynamic power. Point multiplication
computational times of 1642.88 µs and 1271.99 µs have been achieved for projec-
tive and mixed coordinates, respectively. These results show that optimizing
the number of write backs into register le, analyzing the data dependency of
intermediate variables and scheduling scalar multiplication operations must be
carefully handled in order to save both area and power.
68
Chapter 5
Efficient Implementation of Genus 2
Hyperelliptic Curve Algorithms
5.1 Introduction
Hyperelliptic Curve Cryptography (HECC) was proposed in 1989 by Koblitz [48]
and can be seen as a generalization of elliptic curve cryptography (ECC). The
main advantage of HECC is that it provides the same level of security as ECC,
but the ground eld is only half the size. For example, a hyperelliptic curve
of genus 2 over F283 can provide the same level of security as an elliptic curve
dened over F2163, and such shorter operands appear promising for applications
in constrained environments. While ECC applications are highly developed in
practice, the use of HECC is still of only academic. However, the transformation
in a group of points over elliptic curve cryptography (ECC) is accepted as a mod-
ern public key primitive [4]. Recent published works, such as [79] have shown
that the transformation in the Jacobian of HECC is considered the most promis-
ing substitution in ECC and that the performance of HECC can be compared
to that of ECC. The cryptographic transformation in the Jacobian is also based
on the scalar multiplication [48] of reduced divisors, called divisor multiplication
for the purposes of this study.
The main emphasis of this chapter is the application of architecture synthesis
and optimization techniques to algorithms with the goal of developing ecient
explicit formulas over a nite eld of even characteristic. Ecient explicit formu-
las are presented for a variety of inversion-free coordinate systems: projective,
new weighted, and recent coordinates. These three systems enable the avoid-
69
ance of inversions in the group operation. To reduce the number of registers,
architectural synthesis techniques were applied, which include consideration of
the greatest power consumer in the HECC processor. Architecture optimization
techniques were next applied in order to analyze the overall tradeos between
area, power consumption, and computation time. A brief overview of previous
work related to the implementation of HECC processor hardware is included.
This work also describes the hardware architecture utilized for HECC for the
three inversion-free coordinates and presents an ecient architecture for imple-
menting divisor addition, divisor doubling, and the calculation of the divisor
multiplication. Our goal was to determine the lower limits of an area/power
public key processor for HECC curves. To this end, predictability was sacriced
in order to design inversion-free coordinates for characteristic 2 curves, which is
quite reasonable for constrained devices. All of the proposed architectures are
both area and power optimized. Based on the author's previous implementation
on ECC processors, the memory and register requirements for storing points and
temporary variables can contribute substantially (more than 58%) to the overall
size of an ECC processor. The goal of this study was hence an ecient architec-
ture that requires less memory and fewer register requirements even if it entails
a small computational disadvantage. To the best of the author's knowledge, this
work is the rst to use synthesis and optimization to develop an ecient hard-
ware architecture for projective, new weighted, and recent coordinates over a
binary eld.
5.2 Previous Work related to HECC Hardware
Implementation
Relatively few HECC hardware implementations have been reported e.g., [45, 94,
93]. The advantages of hardware implementations include better performance
and increased physical security than with software solutions. This section pro-
vides a brief summary of previous attempts to use hardware to implement HECC
processor for performing divisor multiplication. The rst work that proposed
an architecture for the hardware implementation of a hyperelliptic cryptosys-
tem was performed by Thomas Wollinger in 2001 [92]. The implementation was
based on consideration of the general form of Cantor's algorithm with its under-
lying polynomial arithmetic. The implemented hyperelliptic curve of genus 2 is
70




ADD DBL MUL FmaxMHz
[92] F281 1 118µs 71µs 25ms 54
[20] F283
1 71µs 62µs 10ms
45
4 52µs 43µs 9ms
[15] F2113 1 105µs 90µs 19ms 45
[21] F2163
1 147µs 123µs 40ms
45
4 109µs 85µs 35ms
[29] F2113
1 35.8µs 30.7µs 7.53ms 45.6
4 10µs 8.58µs 2.12ms 46.7
Type 1 in [45]
F283 32 n.a. n.a.
436µs 62.9
Type 2 in [45] 791µs 50.1
Type 3 in [45] 1 020µs 50.5
given by C : y2 + xy = x5 + f7x7 + f3x3 + 1. The author developed computer archi-
tecture for the appropriate algorithms required for implementing the necessary
eld and polynomial operations in the HECC hardware. The architectures were
developed for a recongurable platform based on eld programmable gate arrays
(FPGAs).
The rst complete hardware implementation of an HECC processor, however,
was presented in [15]. The nal processor developed in this study included
two polynomial multipliers and one each of the other polynomial computation
blocks, including the adder, the divisor, and the squarer. It was also based
on Cantor's algorithm but included an alternative method derived from the
calculation of the greatest common divisor (GCD). Boston et al. [15] provided
actual time measurements on real hardware and presented concrete performance
results from a hardware-based genus two hyperelliptic curve coprocessor over
F2113 . The hardware implementation was conducted on a Xilinx Virtex II FPGA.
The authors used Verilog HDL and the Xilinx Integrated Software Environment
to synthesize and implement the logic design. The implementation was based on
consideration of the following curve: y2 +xy = x5 +f2x2 +1, i.e., all coecients are
elements of F2. Although these type of curves are Koblitz curves, the authors
made no use of the Frobenius automorphism. Reference [58] provides additional
details about the Forbenius automorphism.
In [20, 21] the authors presented extended results for the work reported in
71
[15]. They implemented an HECC coprocessor using a variety of base elds rang-
ing from F283 to F2163 in two dierent digit sized multipliers: G = 1 bit andG = 4
bits. The divisor multiplication took between 9 ms and 40 ms and used between
22 000 and 118 000 slices. Based on the implementation numbers listed in Table
5.2, it seems that the design with 118 000 slices involves unreasonable hardware
requirements because the maximum available number of slices in a Xilinx Vertix
4, for example, is approximately 90 000.
In [45], the authors implemented three dierent designs of an HECC pro-
cessor, ranging from high-performance to moderate-area designs. They chose
the following parameters for their implementation: underlying eld F289 , curve
parameters h(x) = x and F (x) = x5 +f1x+f0, and explicit formulas based on ane
coordinates [61]. In the case of the high-performance design which is Type 1,
two independent arithmetic units are used: one for group addition and one for
group doubling. In the Type 2 design, the HECC processor provides only one
arithmetic unit shared through group addition and group doubling. The nal
design presented [45], Type 3, aimed for a low area. In this type of design,
the authors used only memory rather than both memory and registers, claiming
that the decoding logic for reading/writing data from/to the register le is in-
trinsically implemented inside the memory, which reduces the total design area.
However, the total number of clock cycles for the overall computation increases
because of the expensive movement of data from and to the memory. Reference
[29] reported the rst implementation of a HECC processor on an FPGA based
on the explicit formulas that use projective coordinates, and the implementation
was also over F2113 .
The timing and area results for all of these studies are shown in Table 5.1
and Table 5.2, respectively. The results that are achievable through a projec-
tive explicit formula that is based on Harley's algorithms are far greater than
these that can be achieved via Cantor's algorithm because the four-level hierar-
chy based on Cantor's algorithm, that is, nite eld arithmetic, polynomial ring
arithmetic, computations on the Jacobian of the curve and divisor multiplica-
tions, is reduced to a three-level hierarchy. Based on Lange's explicit formula
for projective and mixed coordinates, the three levels of such a hierarchy are -
nite eld arithmetic, computations of divisor addition and doubling, and divisor
multiplication. In addition to the studies listed in Table 5.1 and Table 5.2, the
tables includes reports of some ASIC implementations of HECC using projective
72
Table 5.2: Estimated area results in slices of previous HECC hardware implementation
Reference Technology Field G
Divisor
ADD DBL MUL
[20] Xilinx Virtex II F283
1 10 400 10 200 22 000
4 29 700 29 300 60 000
[15] Virtex II 2VP30 F2113 1 16 600 15 100 30 816
[21] Xilinx Virtex II F2163
1 20 400 20 100 42 000
4 58 400 57 600 118 000
[29] XC2V8000 F2113
1 9514 9052 22 183
4 10 988 10 087 25 911
Type 1 in [45]
XC2V4000 F283 32
n.a. n.a. 9950
Type 2 in [45] n.a. n.a. 7096
Type 3 in [45] n.a. n.a. 4995
coordinates. For example, in [84], the author proposed an HECC processor using
130 nm CMOS technology. The processor runs at 500 MHz and can perform
one divisor multiplication of HECC over F283 in 63 µs.
5.3 Explicit Formulas on Genus 2 Curves over a
Binary Field
The rst attempt to avoid using Cantor's algorithm to increase the speed of
group operations in the Jacobian on hyperelliptic curves of genus 2 was made by
Robert Harley in 2000 [33]. He describes an ecient algorithm, carefully opti-
mized to reduce the number of required group operations. For simplicity, Harley
restricted his algorithm to odd characteristics, with all tricks able to be carried
to even characteristics. His basic concept was to explicitly compute the divisor
addition and divisor doubling algorithms using Cantor's algorithm so that only
eld operations and no polynomial operations are necessary. Further simpli-
cations are obtained through the use of the Chinese remainder theorem in the
group addition algorithm and through the simplication of the group doubling
algorithm with the help of one Newton's iteration. The latest improvement of
genus 2 HECC of odd characteristic was developed by [88]. The authors used
Montgomery's trick of simultaneous inversions to compute two inverses by per-
forming only one eld inversion and three eld multiplications. The extension
of the explicit formulas developed by [88] for arithmetic on genus two curves to
73
elds of even characteristic and to arbitrary equations of the curve was achieved
by Tanja Lange [59]. Lange also presented timings for the implementation of the
formulas using a variety of libraries for the eld arithmetic and determined the
exact number of operations needed for performing the addition and doubling
required in the most common cases.
An analysis of the complexity of the operations using the known methods of
arithmetic transforms in Jacobian genus 2 HECC over even and odd characteris-
tics demonstrates that the existing methods are already ecient but that room is
available for further renement. Based on the classication of genus 2 curves in
even characteristics, the complexity of operations on recent coordinates for Type
1 designs was determined using an assumed count of h2 = 1 and f4 = f3 = f2 = 0. In
this study, ecient method of hardware implementation for a genus two Harley
algorithm over F2m is presented. The new method is based on several active
steps of architecture synthesis and optimization, including register allocation,
operation scheduling, storage binding, and data path/control unit analysis. A
comprehensive hardware implementation for inversion-free hyperelliptic curve
algorithms has been demonstrated.
5.3.1 Cantor's Algorithms for Explicit Formula over Even
Characteristics
For general cases and computation purposes, a group operation is based on Can-
tor's algorithm [16], which operates directly in Mumford's representation [74].
Cantor's original version worked only in odd characteristics and was extended
for all elds by Koblitz [48]. Chapter 2 gives the algorithms restricted to curves
over elds of characteristic two. To optimize the execution time of Cantor's
algorithm, one can write the steps for adding, doubling, and reducing divisors
explicitly, i.e., calculate the resulting divisor without the use of polynomial arith-
metic. This method results in a formula similar to those known for elliptic curves.
The concept behind explicit formula is to replace the polynomial-based form of
Cantor's algorithm with a coecient-based approach. These formulas are de-
pendent on whether the divisors are distinct (group addition) or equal (group
doubling). Using explicit formula has a number of advantages [8] that result in a
signicant reduction in the area required for and the speed of the computation:
• In Cantor's algorithm, some of the partial computations may be performed
74
twice, with the only dierence being the names of the variables. In an
explicit formula, these duplications are avoided because those intermediate
values are held in the memory or in instant registers.
• The number of multiplications can be reduced [33] with the use of the
Karatsuba multiplication algorithm [42], the Chinese remainder theorem,
and a Newton iteration. The resulting explicit formulas are advantageous
in applications where a short computation time is critical.
• The goal of introducing inversion-free formula [57] for calculating a group
operation on a genus 2 HECC can be achieved through the addition of
another coordinate to represent the elements of the divisor class group. The
resulting explicit formula are advantageous in applications where inversions
are more expensive than multiplication.
• More ecient inversion-free explicit formula have been developed for pro-
jective and weighted coordinates on genus 2 for even characteristics when
h2 ∈ {0, 1} and for recently used coordinates when h2 = 0. In addition, [22] has
provided a thorough comparison of arithmetic on the hyperellitpic curves
of genus 2 curves that contains various coordinates.
Most of the previous work has attempted to optimize explicit formula for hy-
perelliptic curve odd characteristics or to optimize the execution time of odd
characteristics. This study takes a slightly dierent approach, which optimizes
the number of registers required for storing intermediate values. In [72], related
work was conducted, with the goal of minimizing the memory requirement. After
the completion of the register allocation via variable liveness analysis, operation
scheduling via forwarding paths, and storage binding via ecient register binding
steps, the nal ecient explicit formulas are determined for dierent inversion-
free coordinates. Table 5.3 shows the amount of register complexity when various
inversion-free coordinate systems of divisor addition and divisor doubling are op-
timized using the three optimizing techniques presented in Chapter 3. The third
column indicates the number of registers for new weighted coordinates obtained
by the method reported in [72]. The fourth and the last columns show the
number of registers obtained using the modied register allocation and oper-
ation scheduling and storage binding algorithms described in Chapter 3. The
following subsenction introduces explicit formula for recent coordinates in even
75
Table 5.3: Comparison of the register requirements for a variety of coordinates systems of




Systems Register Reuse Allocation Scheduling & Binding
Projective Coordinates (P)
PDBLh2 6=0 [22] n. a. 17 14
PDBLh2=0 [22] n. a. 13 10
PADDh2 6=0 [61] n. a. 23 17
PADDh2=0 [93] n. a. 21 15
PmADDh2 6=0 [61] n. a. 23 17
PmADDh2=0 [50] n. a. 19 13
New Weighted Coordinates (N )
NDBLh2 6=0 [22] 20 18 14
NDBLh2=0 [61] 16 15 11
NADDh2 6=0 [22] 23 23 18
NADDh2=0 [61] 23 22 17
NmADDh2 6=0 [61] 20 22 17
NmADDh2=0 [27] 19 21 16
Recent Coordinates (R)
RDBLh2 6=0 This Work (section 5.4) n. a. 17 13
RDBLh2=0 [22] n. a. 11 10
RADDh 6=0 This Work (section 5.4) n. a. 24 18
RADDh2=0 [22] n. a. 23 16
RmADDh2 6=0 This Work (section 5.4) n. a. 22 15
RmADDh2=0 [22] n. a. 19 13
characteristic when h2 6= 0 based on coordinates conversion from new weighted
coordinates in section 14.5 in [22] to recent coordinates.
5.3.2 Recent Coordinates in Even Characteristic When h2 6=
0
This section presents the algorithms for computing with a recent coordinate
system in even characteristic when h2 6= 0. The more uncommon case of h2 = 0
was considered in [22]. For the purpose of the current study, it was assumed that
h2 6= 0.With this approach, the algorithms were developed from Algorithm 14.47
and Algorithm 14.48 in [22] for new weighted coordinates in even characteristic.
New weighted coordinates N were presented by Lange [60]. For the general case
76
in even characteristic, it is most useful to use a set of coordinates extended by
means of precomputations. Let N denote [U1, U0, V1, V0, Z1, Z2, z1, z2, z3, z4] with the
interpretation ui = Ui/Z21 , vi = Vi/(Z
3
1Z2) and the precomputations z1 = Z
2
1 , z2 =
Z22 , z3 = Z1Z2 and z4 = z1z3. Let R denote [U1, U0, V1, V0, Z, z] with ui = Ui/Z, vi = Vi/Z
2
and the precomputations z = Z2. In Algorithms 5.1, 5.2, and 5.3 the intermediate
steps are listed for divisor doubling, addition, and mixed addition for recent
coordinates in even characteristics together with the number of multiplications
(MUL), squaring (SQR), and additions (ADD) needed. Because it is assumed
that h2, h1, f4 ∈ {0, 1}, operations with these coecients are not counted. Explicit
formulas are derived simply based on the assumed method presented in [22]
for converting new weighted coordinates into recent coordinates and using the
explicit formula for new weighted coordinates for even characteristic when h2 6= 0.
For example, the following steps show the development of doubling in recent
coordinates in even characteristics when h2 6= 0 :
1. In step 1, the calculation of Z ′2 is skipped, if Z1 = Z2, z1 = Z, z4 = Z
2, withui =
Ui/Z, vi = Vi/Z
2 is assumed.
2. Step 2 requires no change in computation almost inverse which is considered
as a renaming of two intermediate variables Ṽ1 and w3.
3. The operation count of step 3 is reduced to 3MUL because two eld multi-
plications of t1 ← w3z2 +V1h2z3 are saved through the use of t1 ← Z(w3 +V1h2).
4. In step 4, the computation of s0 ← w0 + U0w1z1 is replaced with s0 ← w0 +
U0w1Z.






4 is skipped, if Z
′ ← s1Z, z′ ← Z ′2
is assumed.
6. In step 6, if z′3 = Z
′ is assumed, the computation of l2 ← l2 + S + h2z′3 is
replaced with l2 ← l2 + S + h2Z ′.
For steps 7, 8 and 9, U ′0 ← S0 + Z ′y, U ′1 ← h2Z ′, V ′1 ← w1 + Z ′(l1 + RV1 + U ′0) + z′h1,
and V ′0 ← w0 + Z ′(l0 + RV0) + z′h0. Explicit formulas are derived in a manner
similar to those for computing divisor addition and divisor mixed addition and
are presented in Algorithms 5.2, and 5.3, respectively.
77
Algorithm 5.1 Doubling in recent coordinates g = 2, h2 6= 0, in even characteristic
Input: D = [U1, U0, V1, V0, Z, z]







′, z′] = [2]D
1. compute resultant and precomputations [8MUL+ 3SQR + 7ADD]
h̃1 ← Z h1 and h̃0 ← Z h0
Ṽ1 ← h̃1 + h2U1 and Ṽ0 ← h̃0 + h2U0
w0 ← V 21 , w1 ← U21 and w2 ← h̃21 + h22w1
w3 ← Z (h1U1 + h2U0 + h̃0) + h2w1
r ← w2U0 + Ṽ0w3, Z̃ ← Z r
2. compute almost inverse
inv1 ← Ṽ1 and inv0 ← w3
3. compute t [3MUL+ 6ADD]
w3 ← f3Z2 + w1 and t1 ← Z(w3 + V1h2)
t0 ← U1t1 + w0 + z (V1h1 + V0h2 + f2z)
4. compute s = (t inv)modu [6MUL+ 6ADD]
w0 ← t0 inv0 and w1 ← t1 inv1
s1 ← (inv0 + inv1)(t0 + t1) + w0 + w1(1 + U1)
s0 ← w0 + U0w1Z
5. precomputations [6MUL+ 2SQR + 2ADD]
y ← h2s0 + s1(h2U1 + h̃1), Z ′ ← s1Z, S0 ← s20 and s1 ← Z ′s1
S ← s0Z ′, R← Z̃Z ′, s0 ← s0s1 and z′ ← Z ′2
6. compute l [3MUL+ 6ADD]
l2 ← s1U1, l0 ← s0U0 and l1 ← (s1 + s0)(U1 + U0) + l0 + l2
l2 ← l2 + S + h2Z ′
7. compute U ′ [1MUL+ 1ADD]
U ′0 ← S0 + Z ′y and U ′1 ← h2Z ′
8. precomputations [2MUL+ 1ADD]
l2 ← l2 + U ′1, w0 ← l2U ′0 and w1 ← l2U ′1
9. compute V ′ [6MUL+ 7ADD]
V ′1 ← w1 + Z ′(l1 +RV1 + U ′0) + z′h1
V ′0 ← w0 + Z ′(l0 +RV0) + z′h0







′, z′) [total complexity: 35MUL+ 5SQR + 36ADD]
78
Algorithm 5.2 Addition in recent coordinates g = 2, h2 6= 0, in even characteristic
Input: Two divisor classes D1 and D2 represented by D1 = [U11, U10, V11, V10, Z1, z1] and
D2 = [U21, U20, V21, V20, Z2, z2].







′, z′] = D1 ⊕D2.
1. precomputations [5MUL]
Ũ21 ← U21Z1, Ũ20 ← U20Z1, Ṽ21 ← V21z1 and Ṽ20 ← V20z1
Z̃1 ← Z1Z2
2. compute resultant r = Res(U1, U2) [8MUL+ 1SQR + 4ADD]
y1 ← U11Z2 + Ũ21, y2 ← U10Z2 + Ũ20 and y3 ← U11y1 + y2Z1
r ← y2y3 + y21U10, Z̃2 ← rZ̃1and Z ′ ← Z̃2Z̃1
3. compute almost inverse of u2 modulo u1
inv1 ← y1 and inv0 ← y3
4. compute s [8MUL+ 8ADD]
w0 ← V10z2 + Ṽ20, w1 ← V11z2 + Ṽ21, w2 ← inv0w0 and w3 ← inv1w1
s1 ← (inv0 + Z1inv1)(w0 + w1) + w2 + w3(Z1 + U11)
s0 ← w2 + U10w3
5. precomputations [8MUL+ 2SQR + 1ADD]
s̃0 ← s0Z̃1, S0 ← s̃20, Z ′ ← s1Z̃1 and R← rZ ′
y4 ← s1(y1 + Ũ21), U ′1 ← y1s1, s1 ← s1Z ′ and s0 ← s0Z ′
z′ ← Z ′2 and h̃1 ← h1Z ′
6. compute l [3MUL+ 2ADD]
l2 ← s1Ũ21, l0 ← s0Ũ20 and l1 ← (s0 + s1)(Ũ20 + Ũ21) + l0 + l2
7. compute U ′ [5MUL+ 7ADD]
U ′0 ← S0 + y4U ′1 + y2s1 + Z ′(h2(s̃0 + y4) + y1Z̃1) + h̃1
U ′1 ← Z ′(U ′1 + h2)
8. precomputations [3MUL+ 3ADD]
l2 ← l2 + Z ′(s̃0 + h2) + U ′1, w0 ← l2U ′0 and w1 ← l2U ′1
9. compute V ′ [5MUL+ 7ADD]
V ′1 ← w1 + Z ′(l1 +RṼ21 + U ′0 + h̃1)
V ′0 ← w0 + Z ′(l0 +RṼ20) + z′h0







′, z′) [total complexity: 45MUL+ 3SQR + 32ADD]
79
Algorithm 5.3 Mixed addition in recent coordinates g = 2, h2 6= 0, in even characteristic
Input: Two divisor classes D1 and D2 represented by D1 = [U11, U10, V11, V10] and D2 =
[U21, U20, V21, V20, Z2, z2].







′, z′] = D1 ⊕D2.
1. compute resultant r = Res(U1, U2) [6MUL+ 2SQR + 4ADD]
y1 ← U11Z2 + U21, y2 ← U10Z2 + U20 and y3 ← U11y1 + y2
r ← y2y3 + y21U10, Z̃2 ← rZ22 and Z ′ ← Z̃2
2. compute almost inverse of u2 modulo u1
inv1 ← y1 and inv0 ← y3
3. compute s [7MUL+ 7ADD]
w0 ← V10z2 + V20, w1 ← V11z2 + V21, w2 ← inv0w0 and w3 ← inv1w1
s1 ← (inv0 + inv1)(w0 + w1) + w2 + w3(1 + U11)
s0 ← w2 + U10w3
4. precomputations [8MUL+ 2SQR + 1ADD]
s̃0 ← s0Z2, S0 ← s̃20, Z ′ ← s1Z2 and R← rZ ′
y4 ← s1(y1 + U21), U ′1 ← y1s1, s1 ← s1Z ′ and s0 ← s0Z ′
z′ ← Z ′2 and h̃1 ← h1Z ′
5. compute l [3MUL+ 4ADD]
l2 ← s1U21, l0 ← s0U20 and l1 ← (s0 + s1)(U20 + U21) + l0 + l2
6. compute U ′ [4MUL+ 6ADD]
U ′0 ← S0 + y4U ′1 + y2s1 + Z ′(h2(s̃0 + y4) + y1Z̃2) + h1
U ′1 ← Z ′(U ′1 + h2)
7. precomputations [3MUL+ 2ADD]
l2 ← l2 + Z ′(s̃0 + h2) + U ′1, w0 ← l2U ′0 and w1 ← l2U ′1
8. compute V ′ [5MUL+ 7ADD]
V ′1 ← w1 + Z ′(l1 +RV21 + U ′0 + h1)
V ′0 ← w0 + Z ′(l0 +RV20) + z′h0







′, z′) [total complexity: 36MUL+ 4SQR + 31ADD]
80
5.4 Efficient Explicit Formula of Genus 2 over Bi-
nary Field
For HECC, two types of nite elds are currently under consideration: binary
and prime elds. A binary eld oers far more options because of there are many
choices for bases, irreducible polynomials, composite elds, etc. Explicit formu-
las for curves of genus two have been studied extensively [38, 41, 80, 61], both
for binary and prime elds. The work in this thesis is an extensive study of the
challenge related to ecient hardware implementation of genus two hyperelliptic
Jacobians over a binary eld for a variety of inversion-free coordinates.
First, a detailed explicit formula was presented for both doubling and addi-
tion in recent coordinates for even characteristics in the case of h2 6= 0 as has
been shown in section 5.3.2. Methods of improving the best-known explicit for-
mula in projective, new weighted, and recent coordinates for genus two curves
are explained in Chapter 3. Special models for register allocation via variable
liveness analysis, operation scheduling via forwarding paths, and storage binding
via ecient register spilling were used as appropriate in order to reformulate the
explicit formula. The goals were to minimize the area and power and to moder-
ate the computational time penalties associated with architecture synthesis and
optimization problems, which have a greater impact as the eld and group sizes
increase.
For the general case in even characteristic and h2 6= 0, it was assumed that
f4 = f3 = f2 = 0 and h2 = 1. The algorithm included h2, f3, and f2 in the algo-
rithm, but there values were not included in the operations counting, and f4 was
completely omitted. It should be noted that dierent choices for f4, f3, and f2
give rise to dierent isomorphism classes while h(x) distinguishes only between
the two quadratic twists [22]. Sections A.1, A.2, A.3 contain the ecient register
management formula for new weighted, projective, and recent coordinates in even
characteristic, respectively. The algorithms included in these sections indicate
the register transfer statements at each step. For example, Algorithm 5.4 shows
the modied register management of the divisor doubling for recent coordinates
in even characteristics when h2 = 0. explained in chapter 3, the algorithm shows
that the design is implemented using ten registers R0, R1, R2, . . . , R9. In addition,
at the beginning of the divisor doubling operation, it is assumed that four read
accesses from the memory are performed in order to store U1, U0, V1, V0 and Z in
81
R0, R1, R2, R3 and R4, respectively. At each step of the register le management,
each eld operation is performed according to the explicit formula as developed
in [22]. To improve readability, Table 5.4 and 5.5 present the application of
RAVLA strategy that has been composed by this work in Chapter 3 and in
more specic the contents of each register as a function of dene and use each
variable. For example, in the second row, the long live variable z has been used
from R5 to dene another long live variable Z4. However, a short live variable
t1(T2) has been dened in R8 and it will be only used on the next step by state-
ment t1 ← t1(T1)+t1(T2). Table 5.6 and Table 5.7 present our application of FP for
every short live variable which leads to a reduction in number of registers. For
performing divisor doubing in recent coordinates, applying OSFPs results in a
reduction in number of register by 1. However, for the other cases of perfroming
divisor doubling/addition in projective and new weighted coordinates, OSFPs
could lead to a reduction in the number of regsisters by 4.
5.5 Architecture Synthesis of an HECC Processor
This section explains the application of the architecture synthesis methods de-
scribed in chapter 3 for improving the formula presented in [59, 57, 60] for genus
2 curves over elds with even characteristics. Four main approaches were fol-
lowed:
1. Reduce the number of registers required. Because registers are usually used
for storing intermediate variables, reducing the number of registers as much
as possible is often a good idea, even if this change means slightly increased
computational time, as was the case for the elliptic curve described in chap-
ter 4. The nal optimized implementation must be capable of emulating
the conventional implementation of a global register le with a minimum
number of read/write ports and registers. During the implementation, the
interconnection of N register elds to M functional units may have a cost
proportional to N ∗M. The register le allocation process assigned values
into positions in single-ported register les. The execution of each three
operand code instruction proceeds through three phases: reading input
values from register le, executing the operations, and writing the results
back into the register eld. Because register les are single-ported, distinct
values must be accessed concurrently for either reading or writing.
82
Algorithm 5.4 The modied register management of divisor doubling for recent coordinates
for an even characteristic when h2 = 0
Input: [U1, U0, V1, V0, Z, z], h = h1x+ h0, f = x5 + f3x3 + f2x2 + f1x+ f0.







′, z′] = 2[U1, U0, V1, V0, Z, z].
1: R0 ← U1, R1 ← U0,R2 ← V0, R3 ← Z, R4 ← z
2: R5 ← SQR (R4)
3: R1 ← SQR (R1)
4: R6 ← SQR (R0)
5: FP (MUL, ADD)←MUL (R4, f3)
6: R7 ← ADD (FP (MUL, ADD), R6)
7: R6 ←MUL (R5, f0)
8: R2 ← SQR (R2)
9: R2 ← ADD (R2, R6)
10: R6 ←MUL (R2, R4)
11: R5 ←MUL (R1, R5)
12: R8 ←MUL (R1, R7)
13: R4 ←MUL (R4, R8)
14: R0 ←MUL (R0, R2)
15: FP (MUL, ADD)←MUL (R0, R3)
16: R3 ← ADD (FP (MUL, ADD), R4)
17: FP (SQR, MUL)← SQR (h1)
18: R0 ←MUL (FP (SQR, MUL), R5)
19: FP (MUL, ADD)←MUL (R2, R7)
20: R7 ← ADD (FP (MUL,ADD), R0)
21: R5 ←MUL (R0, R5)
22: R0 ←MUL (R0, R6)
23: R3 ← SQR (R3)
24: U ′0 ← ADD (R3, R0)
25: R3 ← SQR (R6)
26: FP (MUL, SQR)←MUL (R2, V1)
27: R8 ← SQR (FP (MUL, SQR))
28: FP (MUL, ADD)←MUL (R3, f2)
29: R8 ← ADD (FP (MUL, ADD), R8)
30: FP (MUL, ADD)←MUL (R4, R7)
31: R4 ← ADD (FP (MUL, ADD), R8)
32: R4 ←MUL (R3, R4)
33: FP (MUL, ADD)←MUL (R3, R5)
34: R4 ←MUL (FP (MUL, ADD), R4)
35: R0 ← SQR (h1)
36: R4 ←MUL (R0, R4)
37: R2 ←MUL (R2, R3)
38: R1 ←MUL (R1, R2)
39: FP (MUL, ADD)←MUL (R7, R8)
40: R1 ← ADD (FP (MUL, ADD), R1)
41: R6 ←MUL (R1, R6)
42: R6 ←MUL (R0, R6)



























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































2. Reduce the number of read/write operations for the register le. If the
number of read/write operations from the register le can be reduced, the
consumption of processing power will also decrease. Most explicit formulas
accomplish this goal by reducing the number of nite eld operations. How-
ever, with the methods presented in this thesis, further improvement was
obtained through the selection of faster explicit formulas, special register
allocation via variable liveness analysis, the combination of multiplications
and squaring or multiplications and addition operations using an operation
scheduling via forwarding paths, ecient register spilling through storage
binding, and the overlapping of multiplication and addition or square op-
erations whenever there is no datapath dependence.
3. Reduce the amount of computation time. The goal here was to use a spe-
cial operation scheduling algorithm that makes use of the operation of the
squaring and addition units, which consumes only one clock cycle, in par-
allel with nite eld multiplication which consumes at least half of the eld
size. Although the performance of the design is not aected, this approach
reduces the total number of clock cycles needed for computing one divisor
addition or one divisor doubling. Further improvement was also obtained
through an analyzing the liveness of the control unit and through overlap-
ping the divisor addition and divisor doubling operations within one divisor
multiplication, which leads to a reduction in the computational time either
in the divisor addition and divisor doubling level or in the divisor multipli-
cation level.
4. Reduce the amount of storage memory. The goal here was to apply the new
storage binding algorithm, which rearranges the storing of the intermediate
variables throughout the divisor addition and divisor doubling operations,
keeping the results of either the divisor doubling or divisor addition oper-
ations in registers if they are to be used later in the next explicit formula.
This storage occurs continually because the intermediate results between
the doubling and the additions operations are usually computed towards
the end of the explicit formula. Keeping some of the results in registers
that are be used by the next loop of the divisor multiplication operation
thus reduces the amount of storage memory that is being used to store the
























Figure 5.1: Basic architecture of Hyperelliptic Curve Divisor Implementation
5.5.1 Conventional HECC processor
The HECC processor is, in fact, the hardware architecture for divisor multipli-
cation. The basic architecture consists of a control unit, block RAM (BRAM),
and a divisor addition/doubling unit, as shown in Figure 5.1. The architecture
of divisor addition/doubling unit is comprised of four classes of blocks: nite
eld multiplier, nite eld arithmetic (i.e. squaring and adder), register le for
intermediate results storage, and multiplexers for selection and forwarding in-
termediate results. The control unit receives the HECC parameters, reads a key
(or a scalar), and controls the divisor addition/doubling unit according to the
binary double and add divisor multiplication algorithm shown in section 2.3.5.
The divisor addition/doubling unit, on the other hand, is responsible for com-
puting all required eld arithmetic operations. At the beginning of the scalar
multiplication operation, the BRAM is assumed to contain the scalar and a divi-
sors. These values must be maintained during the iterations of the HECC scalar
multiplication.
For HECP, the processor must compute the scalar multiplication kD. Ecient
algorithms are required in order to implement the group operations as well as
for the nite eld arithmetic. The work for this thesis included the development
of dierent prototypes of processor architectures that are well suited for imple-
mentation in eld programmable gate arrays (FPGAs). For example, the user
can choose to have two nite eld adders, one nite eld multiplier, one nite
eld squarer, etc. The conventional HECC processor consists of four main com-
ponents: the control unit (CU), the eld arithmetic unit (FAU), the register le
89
(RF), and the multiplexers (MUXs). The Control unit generates control signals
for the RF, FAU, and MUX. The MUXs govern the data exchange between the
RF and the FAU. The FAU performs the eld and group operations as described
in chapter 2. The following subsections provide a more detailed description of
the other individual components.
5.5.1.1 Register File
A register le (RF) is one of the main blocks of digital electronic systems used in
computers, communications, and embedded systems. Flip-ops are a fundamen-
tal building block of an RF. In synchronous systems, they are responsible for
storing and sampling data with respect to the clock signal for sequential circuit
designs in both pipeline systems and state machines. In asynchronous systems,
ip-ops often act as synchronizes that interface between two unrelated signals
that may operate at dierent frequencies and phases. Traditional RF designs
have focused primary on a balanced design tradeo between delay and power,
as indicated by the optimum power-delay-product (PDP) value. In this work,
register liveness analysis was incorporated into the HECC processor design along
with performance and power consumption considerations in order to produce a
high-performance and low-power design. Register liveness analysis is an opti-
mization technique that has been used to minimize the number of read/write
ports of RF and has been described in chapter 3.
Multi-Port Register File This register le has two write ports, including
one asynchronous write port, and two read ports, including one asynchronous
read port. Involving such a register le accomplishes multiple goals for tuning
the area, power, and performance of register les per execution iteration of the
algorithm. The optimized specication for the HECC processor is expressed as a
three-operand code for a conventional implementation with a multi-port register
le. In optimized implementations, positions in a conventional HECC processor
register le are referred to as variables. A three-operand code operation R1 ⇐
R2 + R3 is said to dene variable stores in register R1 and to use variables store
in registers R2 and R3. Using a multi-port register le for storing the values of
the variables creates problems. One problem is associated with computing the
minimum number of ports required to access as many variables as needed. If
each variable always accesses the register le through the same port, then the
90
problem is reduced. Storage binding, which was described in section 3.2.3 has
been used to binding variables to ports. In this work, a register le is assumed
to have two ports for either read or write, thus requiring one cycle per access.
5.5.1.2 Multiplexers
The multiplexer (MUX), one of the basic data path connection elements, con-
tributes a signicant amount of area costs especially for FPGA designs. A recent
study from Altera, based on the analysis of 100 customer designs, stated that
multiplexer account for 26% of the logic element utilization. Optimizing multi-
plexer use is therefore very important for the overall quality of digital designs.
Behavioural synthesis, which compiles designs specied in high-level languages
into register-transfer level code, determines the main micro-architecture of de-
signs and thus has a signicant impact on the quality of a design. Behavioural
synthesis consists of three basic stages: register allocation, operation schedul-
ing, and storage binding. The allocation determines how many instances of each
type of resource (functional units or registers) are needed; scheduling determines
when a computational operation is executed; storage binding assigns operations
(or variables) to either the memory or the register. Each of these steps has
an inuence on multiplexer utilization. In this work, the rst two steps were
assumed to be already completed, and the focus was on only the third step,
storage binding, in order to optimize multiplexer.
Internal Data forwarding A very serious diculty with a genus 2 explicit
formula is the number of registers used to store the intermediate variables in-
volved. An implementation of the divisor addition operation that does not in-
corporate consideration of register requirements would use up to 30 registers.
The size of each register is the same as of the eld size. This number of regis-
ters is too large for constrained environments. The area of the register le can
be further reduced through the use of internal data forwarding among multiple
functional units. An internal data forwarding conict is a situation which an
algorithm statement refers to the data if a preceding statement. The technique
used to discover the internal data forwarding conicts among statements is called
dependance analysis. Consider two algorithm statements S1 and S2 with S1 oc-
curring before S2. The three possible types of conicts are true dependance, load
dependency, and anti dependency.
91
A true dependency, also known as read after write, occurs when an statement
depends on the result of a previous statement. In other words, statement S1
tries to read a source variable before statement S2 writes it, so a statement S2
incorrectly gets the old value. A load dependency, also known as write after
write, occurs when the ordering of statements aects the nal load value of
a variable. For example, statement S2 tries to write an operand before it is
written by statement S1. The write operation is being performed in the wrong
order, leaving the value in the destination written by statement S1 rather than
by S2. An anti dependency, also known as write before read, occurs when a
statement read a variable that is later updated by a write operation. Statement
S2 tries to write destination value before it is read by statement S1, so statement
S2 incorrectly gets the new value. This conict causes some statements write
results early in the algorithm and other statements read operands late in the
algorithm. For optimal dependance analysis, data forwarding conicts must be
traced through choosing an optimal statement scheduling.
5.6 Architecture Optimization for an HECC Pro-
cessor
To demonstrate the accuracy of the newly derived formulas, all of the group
operations introduced in this thesis were implemented on an Xilinx FPGA. An
additional goal was to show that an HECC can achieve the same performance
level as an ECC and, in some cases, even outperform an ECC. A variety of
hardware architectures for genus 2 HECC were investigated, including ones that
use projective, new weighted, and recent coordinate systems that can consider a
group order of approximately 280. The optimization of the HECC was investigated
at three levels: the eld operation level, the group operation level, and the
scalar multiplication level. The study included analysis of tradeos between
pluralization options, latency, read/write power consumption, and area/time
optimized congurations.
5.6.1 Macroscopic Structural View of HECC Datapath
A datapath is an interconnection of resources, which implement nite eld arith-






























































Figure 5.2: Structure view of new weighted coordinate divisor doubling datapath
which send data to the appropriate destination at the appropriate time; and
registers or memory arrays for storing data , input/output ports and the control
unit. For example, the inputs, outputs and controls of the multiplexers have to
be specied and connected. An example of one possible macroscopic structure
for an HECC processor is shown in Figure 5.2 and 5.3.
Figure 5.2 shows a rened view of the datapath of the new weighted coor-
dinate divisor doubling with 11 registers, one multiplier, one squarer and one
adder. It shows explicit the interconnection of the multiplexers. For simplicity,
registers have been used to store constants 1 or 0. Obviously, these registers can
be changed into a hard-wired implementation. The blocks cADD and cMUL have
been used to perform an addition and a multiplication operation with constant,
respectively. The connections to the input/output ports are not shown. Addi-
tion multiplexers are required to load registers with input data from memory.
The connections to the control unit are the enable signals for all registers, the
selectors of the multiplexers. The multiplier returns signal done to the control
unit, detecting the completion of one nite eld multiplication.
5.7 Hyperelliptic Curve Datapath Analysis
This section presents an FPGA implementation that includes consideration of
































































Figure 5.3: Structural view of new weighted coordinates divisor mixed addition datapath
goal of this section is to describe general datapath analysis for HECC proces-
sor targeted for hardware implementation. Datapath analysis has been used to
study the area, power and performance of the processor architectures and their
implementation in FPGA logic.
5.7.1 Area of the HECC processor
Table 5.8 specifying the area complexities, the critical path delay, and the power
consumed by each datapath component in HECC processor architecture. The
performance of a circuit is specied in terms of the circuit critical path delay. A
circuit critical path delay corresponds to the longest combinatorial delay of the
circuit. The critical path delay of a circuit denes the maximum clock frequency
Fmax at which it can operate. When more resources are available, it is possible
to lower the critical path. With the use of digit-serial multiplier, as described in
section 2.1, the nite eld multiplication can be performed twice as fast as the
bit-serial multiplier. The main nding is that, for area optimization architecture,
the design based on a bit-serial multiplier is preferable compared to those using
digit-serial multiplier.






























































































































































































































































































































































































































































































































































P_DBL_ADD P_DBL_mADD N_DBL_ADD N_DBL_mADD R_DBL_ADD R_DBL_mADD 
Area for Xilinx XC4vlx200-11ff1513 
(h2 ≠ 0) 
G = 1 G = 2 G = 4 G = 8 G = 16 
Figure 5.4: Area in number of slices for dierent digit-serial multiplier when h2 6= 0
ten in VHDL the RTL description and synthesized it using Xilinx XC4vlx200-
111513. Figure 5.4 and Figure 5.5 shows the required area for the proposed
HECP for dierent inversion free coordinates when h2 6= 0 and h2 = 0, respec-
tively. As in the gure, the required areas are proportional to the digit size G
of the nite eld multiplier.
5.7.2 Energy Consumption of the HECC Processor
The relationship between energy and power is that power is dened as energy
used over time, i.e. the rate at which energy is expanded. In other words,
power is energy divided by time. Power is measured in milliWatts, while energy
is measured in micro joules. The following simple equation governs dynamic
power consumption:
Dynamic Power = CV 2f
where C is the capacitance of the node switching, V is the supply voltage, and f
is the switching frequency. If capacitance and supply voltage are not changing,









P_DBL_ADD P_DBL_mADD N_DBL_ADD N_DBL_mADD R_DBL_ADD R_DBL_mADD 
Area for Xilinx XC4vlx200-11ff1513 
(h2 = 0) 
G = 1 G = 2 G = 4 G = 8 G = 16 
Figure 5.5: Area in number of slices for dierent digit-serial multiplier when h2 = 0
to increase.
The design of low-energy cryptographic hardware is an active research area
due to the demand for portable applications of cryptographic algorithms. Power
reduction techniques have been proposed for all levels of the design hierarchy,
from algorithmic and architectural optimizations to circuit and technological in-
novations. Reference [98] proposed that switching activities dominate and con-
tribute to more than 90% of the total power consumption. Reducing switching
activities has therefore become the major target in attempts to power reduction.
Pipelining has been used to reduce both the critical path computation time and
switching activities. Rather than using pipelining, this work demonstrates that,
as shown in Figure 5.6 and Figure 5.7, power reduction can achieved by re-
formulating the explicit formulas for performing HECC algorithms through the
modication to the register allocation, operation scheduling, and storage binding
procedures. Power is also reduced by increasing the parallelism of the nite eld
multiplication operation and by rearranging the datapath topology through a
reduction in the number of read/write operations from/to the register le using









P_DBL_ADD P_DBL_mADD N_DBL_ADD N_DBL_mADD R_DBL_ADD R_DBL_mADD 
Dynamic Power (mW) 
G = 1 G = 2  G = 4  G = 8  G = 16 








P_DBL_ADD P_DBL_mADD N_DBL_ADD N_DBL_mADD R_DBL_ADD R_DBL_mADD 
Dynamic Power (mW)  
G = 1 G = 2  G = 4  G = 8  G = 16 


















































Figure 5.8: Top-level Data path for Projective Coordinates Divisor Doubling-Addition
5.7.3 Improve the performance of the HECC processor
In this section we describe our work on improving the performance of comput-
ing divisor multiplication for the HECC processor. The idea of reducing the
critical path was the starting point for our improvement. Critical path plays an
important role in the performance of a design on FPGAs. Good timing perfor-
mance requires that the critical path delay be as ecient as possible. It is often
acceptable to let the area increase within a tolerance limit if the timing could
improved. This approach involved slightly increase the number of registers and
reducing the size of the multiplexers. Even though, for the most part, reducing
area and power result against this, changes can be made in order to improve the
performance. Synthesis results concluded that it was indeed faster to use more
registers.
5.8 Hyperelliptic Curve Control Analysis
The control unit liveness of shared resources is the main focus of this section.








P_DBL_ADD P_DBL_mADD N_DBL_ADD N_DBL_mADD R_DBL_ADD R_DBL_mADD 
Energy in (µJ) 
33MHz 50MHz 70MHz 







P_DBL_ADD P_DBL_mADD N_DBL_ADD N_DBL_mADD R_DBL_ADD R_DBL_mADD 
Energy  in (µJ) 
33MHz 50MHz 70MHz 
Figure 5.10: Energy in µJ for dierent maximum frequency when h2 = 0
100
by graphing the busy states of the shared resources, and second, by using this
graph to drive the formal analysis of the divisor multiplication specication.
If the goal of a designer is to lower the number of registers required, he can
trade the number of registers for additional latency; however, he should avoid
overlapping and should start a group operation only when the previous one has
nished. It should be noted that a single group operation uses from 8 to 14
registers and that the maximum number of registers is reached when two group
operations overlap.
To achieve the primary goal of reducing the number of registers, lower regis-
ter usage was chosen over for additional latency. In most cases, meeting this goal
means that the design should avoid overlapping between datapaths and should
start a divisor doubling operation, as shown in Figure 5.2, only when the previous
divisor doubling or mixed addition, as shown in Figure 5.3, has been completed.
This work demonstrates that overlapping a group operation, as shown in Fig-
ure 5.11, is in fact necessary for minimizing the number of registers and can
be achieved after a specic liveness analysis of the resource controllers under
the divisor multiplication level, which previously described in section 3.3.2. For
example, in the case of projective coordinates, the output consists of ve co-
ecients: ve eld elements that are neither produced at the same time nor
required for the initiation of the next group operation. This feature means that
once one eld element has been computed, it can then be used by the next group
operation.
A number of optimizations can be used to improve the proposed design which
previously described in section 5.6. Namely, we draw the control ow diagram as
shown in Figure 5.11, so that it achieved overlapping between group operation
as well as eld operation. The advantage of having overlapping was that less
computation time needed to be controlled.
5.9 Experimental Results
This section presents the results obtained from the implementation of HECC on
an FPGA and ASIC. FPGAs have potential advantages over ASIC implementa-
tions in cryptographic applications for measuring dynamic power as well as algo-
rithm agility, speed facility, architecture eciency, resource eciency, ability to






































































































































































































































































































































































































































































































































































therefore been done to implement HECC on hardware devices, such as FPGAs.
Most of the implementations, however, are based on the algorithm introduced
by Cantor and not on an the explicit formula. Accelerating this curve-based
arithmetic is possible in several ways, especially multiplication, choosing a coor-
dinate representation that is more ecient and increasing the speed of a divisor
multiplication operation.
In this work, Lange's explicit formulas of even characteristic and a special
polynomial h(x) where h2 6= 0 and h2 = 0, were used to develop an ecient explicit
formula for HECC. The focus was on techniques for optimizing an implemen-
tation to make it suitable for constrained devices, mainly through a reduction
in area and power consumption. Register liveness analysis, combined with data
forwarding paths, was used in order to reduce the number registers from 23 to
14 for HECC new coordinate divisor addition, from 19 to 14 for mixed HECC
new coordinate divisor addition, and from 16 to 14 in HECC new coordinate di-
visor doubling. As a result, the overall HECC processor requires from 40,021 to
49,819 gate element and the total divisor multiplication time is limited between
10.08 ms to 15.82 ms.
The original goal of designing the HECP was to reduce the overall area. In
order to accomplish this, pipelining was excluded from the design since adding
piped stages would signicantly increase the area or the resultant design. The
strategy pursued involved composing architecture synthesis processes for the
dierent inversion-free explicit formula with the goal of reducing the number
of registers. After obtaining the register management, architecture optimiza-
tions would be made, based on analysis the tradeo between area, power, and
computation time.
An interesting new technology that has evolved in the early 2000s is that of
creating an ASIC directly from an FPGA-based design. Many designers use FP-
GAs for ASIC prototyping [90]. They use automated tools to implement their
circuit on FPGAs, and they then extensively test the circuit in the circuit's envi-
ronment, for example, in a prototype digital video player or a prototype satellite
communication chip. The FPGA-based prototype may be larger, costlier, and
more power-hungry than an ASIC-based implementation, but can be useful for
detecting and correcting errors in the circuit, for creating other components and
software that interact with the circuit. Once satised with the circuit, auto-
mated tools could be used to reimplemented the circuit on an ASIC. The ASIC
103
implementation traditionally did not utilize any information from the FPGA
implementation.
Table 5.10 and Table 5.9 show the results obtained with optimized explicit
formulas for a number of dierent inversion-free coordinate systems after synthe-
sis. The rst column indicates the coordinate design system, and the remainder
of the table is divided into two sections. The center set of columns shows the
results using Xilinx Vertix-4 FPGAs, and the right-hand set shows the results
using synthesis tools based on ASIC compiler design. The rst column shows
the coordinate design system.
In the FPGA section, the rst column shows the numbers of ip-ops used for
synthesizing each design, while the second column shows the number of LUTs.
The next two columns present the number of slices and the computation time






































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Comparison of Elliptic and
Hyperelliptic
Elliptic curve cryptosystems (ECCs) are widely used in parctical applications.
Cryptgraphic protocols based on ECC are fast and cosidered as a secure alter-
native to other common PKCs like RSA [78]. Hyperelliptic curve cryptosystems
(HECCs) have had very little practical relevance, in part because the group oper-
ations on HECC are belived to have a relatively high computational cost. Recent
developments of shorter explicit formula and high performance implementations
could turn the trend from ECC towards HECC [96]. In this chapter, the empha-
sis lies on the comparison of the complexity of ECP and HECP based on FPGA
hardware implementation. A comparsion will be derived to state whether a
HECP or an ECP is ecient in terms of energy and time/area factors. For
simplicity we restrict ourselves to characteristic two and a security of 80 bits.
Therefore, we deal with eld sizes of 163 bits for ECC, 83 for HECC of genera
two. We also will be able to compare the expected register requirements, area,
performance, and energy factors of dierent cryptosystems depending on the bit
sizes and the properties of the implemented recommended NIST eld libraries.





















































































































































































































































6.1 Complexity of Elliptic and Hyperelliptic Curve
Processors
Developing hardware implementation of cryptographic algorithms for ultra-low
area and power devices is not as straightforward as compiling existing VHDL
code for an low area and power FPGA or ASIC library. We carefully selected
several algorithms that seemed promising for an ultra-low area and power im-
plementation. We made extensive use of area and power saving techniques on
the architectural, logic, and system level when we implemented the algorithms.
In most cases the speed of the algorithm is not as important as the area and
energy. At the low clock frequencies of these devices leakage power is dominant.
Therefore, we minimized the energy by minimizing the circuit size.
A fair comparision between ECC and HECC is dicult to achieve due to
the dierent eld sizes, types of operations, and the non-deterministic nature
of the HECC operations, in particular, the computation of polynomial Great-
est Common divisors when using Cantor's algorithm. In addition, a lot of the
published ECC results contain many platform specic optimizations which vary
greatly between dierent implementations. The metric we use for measuring
area, power and computation time on ECC and HECC over characteristic two
elds is based on a ratio of time/area and energy characteristics tradeos rather
than bit complexity or specic timings. Depending on ecient hardware im-
plementation through the application of proposed arcitecture synthesis and op-
timization methodology, we will be able to determine in which cases ECP is
time/area and energy ecient than HECP in terms of inversion-free coordinates
and vice versa.
Table 6.1 compares the two inversion-free coordinate LDDBL−ADD and PDBL−ADD
when h2 = 0 or h2 6= 0. A few observations can be made from the table. First,
the energy consumption and the computation time are lower in ECC for the
LDDBL−ADD than those of HECC for PDBL−ADD. Another observation is that the
combined #FF and #LUT for PDBL−ADD signicantly decreases the area require-
ments compared to the results obtained for LDDBL−ADD. In addtion we see that
under certain conditions genus 2 hyperelliptic curves can be more area ecient
than ECC at the same level of security. This result, however, implies to use very
specic curves for HECC.
Our eorts to verify the correctness of the above mentioned implementations
109
included a comparison of theoretical and practical results based upon bottom-up
implementation for both ECP and HECP. Finite eld, point or divisor arithmetic
components are coded and veried rst. These components are then intercon-
nected using structural style descriptions to form the top-level component of
the scalar multiplication is implemented. This approach comprises a bottom-up
implementation. Each component should be dened so that it is individually
veriable. Each component, either nite eld arithmetic and/or divisor or point
arithmetic, is independently veried before being combined with other compo-
nents to form a scalar multiplication component. Scalar multiplication com-
ponent is independently veried before being combined to form the complete
system. The complete system is then veried.
6.2 Elliptic Curve NIST-recommended Comparisons
In previous chapters, we discussed the methods for reducing the number of
registers for dierent inversion-free coordinates when performing scalar multi-
plication either based on ECC or HECC. Another case where register le would
be a problem is the size of each register. Because it is impossible not to perform
any eld operations on a partition eld size element, as well as provide as less
as possible for the whole eld element to become stable on the eld arithmetic
operation, it is important that the register size should be equal to the eld size.
While increasing eld size requires extra in register size to store the intermedi-
ate results, a xed register size algorithm is requires but the computation time
may be larger. If we carefully design an ECC processor with xed register size,
decreasing rate of the area can be larger than that of increasing rate of compu-
tation time when eld size is increased. For the following discussion, we will use
the elds F2163 , F2233 , F2283 , F2409 , and F2571 for ECC. We are focus on a comparison
of the register requirments, area, power, and computational complexity.
In February 2000, FIPS 186-1 was revised by NIST to include the elliptic
curve digital signature algorithm (ECDSA) as specied in ANSI X9.62 [1] with
further recommendations for the selection of underlying nite elds and elliptic
curves; the revised standard is called FIPS 186-2 [2]. For binary elds F2m, m
was chosen so that there exists a random curve of almost prime order over F2m .
FIPS 186-2 has 5 recommended nite elds over binary elds: F2163 , F2233 , F2283 ,
F2409 , and F2571 . For each of the binary elds one randomly selected elliptic curve
110
Table 6.2: NIST recommended binary nite elds and their reduction polynomials
Binary nite eld Reduction polynomial
F2163 F (x) = x163 + x7 + x6 + x3 + 1
F2233 F (x) = x233 + x74 + 1
F2283 F (x) = x283 + x12 + x7 + x5 + 1
F2409 F (x) = x409 + x87 + 1
F2571 F (x) = x571 + x10 + x5 + x2 + 1
and one Koblitz curve was selected, see for example [37]. Table 6.2 lists the
NIST recommended ve binary elds and their reduction polynomials.
Based on the above discussion, this work proposes a xed register le size that
can be implemented by only w bits independent of m bits of the eld size F2m
so that registers for storing intermediate computational variables have only w
bits. Moreover, the work presented in this chapter takes dierent eld sizes into
account and analyzes the result by comparing it with area, power and timings
measured on hardware. In addition this work explains that using xed register
le size can yield small ECP area compared to conventional ECP with variable
register led size based on the eld size. For the theoretical comparison we use
López-Dahab projective and mixed coordinates on curves over dierent nite
elds. To our knowledge López-Dahab coordinates are usually the system of
choice for hardware implementation of elliptic curves over binary elds.
In this work, for a eld F2m and a register size of w = 83bit López-Dahab
projective coordinate requires dm
83
e× 4 number of registers needed to store inter-
mediate variables of operations for a scalar multiplication depending on the un-
derlying eld size m. However, López-Dahab mixed coordinates requires dm
83
e× 2
number of registers. The disadvantage is the increased number of computation
cycles which incresed the total computation time of performing the scalar mul-
tiplication. Table 6.3 shows the results are 6.149 ms to 71.390 ms for a scalar
multiplication using López-Dahab projective coordinates on curves over dierent
nite elds. Based on our number of cycles calculation in Table 2.2, it has been
increased by 27,038 and 20,647 cycles for LDprojective and LDmixed, respectively.
Table 6.3 presents the total energy consumption and execution time values of
the LDDBL−ADD and LDDBL−mADD for varying eld sizes. A lower number of reg-
isters is used to store the intermediate variables. A few important observations
can be made from the table. First, the energy consumption as well as the com-
















Area in Slices for Xilinx XC4vlx200-11ff1513 
(LD DBL-ADD )  
G = 1 
G = 2 
G = 4 
G = 8 






Figure 6.1: Area in number of slices for dierent digit-serial multipliers for LDDBL−ADD
112
This is because the larger the eld size, the more register objects are allocated
and the more the clock cycles are needed.
On a closer look, it is observed that the energy and the execution values
increase in a stepwise manner. Second, it is observed that for LDDBL−mADD the
energy values are lower than the corresponding energy values for LDDBL−ADD.
This implies that the reduction of the number of registers has more impact
on reducing the total energy consumption. Lastly, it is observed that for each
eld size the normalized execution time values are lower than the corresponding
energy consumption values. This implies that the reduction of the number of
registers has more impact on reducing the total energy consumtion of the design
than on reducing the computation time. This observation is justied because
the dierence in energy per access to the register is larger than the diferences
in its access times.
Figure 6.1 and Figure 6.2 depict the area in slices values for all NIST recom-
mended eld sizes for LDDBL−ADD and LDDBL−mADD, respectively. From Figure
6.1, a couple of observations can be made for LDDBL−ADD using dierent digit-
serial sizes for the nite eld multiplier. First, increases in the digit-serial and
the eld sizes both cause an increase in the area in slices. Second, the compari-
son of the LDDBL−ADD with the LDDBL−mADD, in Figure 6.2, reveals that for the
















Area in Slices for Xilinx XC4vlx200-11ff1513 
(LD DBL-mADD )  
G = 1 
G = 2 
G = 4 
G = 8 
























































































































































































































































































































































































































































Conclusion and Future Direction
Security is a critical factor for these ultra-low power devices due to their im-
pact on privacy, trust and control. Wireless sensor network (WSN) belongs to
a new set of ultra-low power applications which make computing ubiquitous.
WSN is quickly becoming a vital part of our infrastructure [23].These appli-
cations impose severe power and area constraints on the underlying hardware
devices. Traditional cryptographic algorithms are considered too bulky, complex
and power hungry for these devices. The goal our research was to develop an
ecient hardware implementation for these constrained devices.
In this thesis we have considered architecture, algorithms and arithmetic for
curve-based cryptography, more precisely ECC and HECC. Hyperelliptic curves,
a generalization of elliptic curves, require smaller eld sizes as genus increases.
Hyperelliptic curves of genus g achieved equivalent security of ECC with eld
size 1
g
times the size of eld of ECC for g ≤ 2. For example, a genus 2 HECC with
an operand length of 83 bits provides the same level of security than ECC with
an operand length of 163 bits. Over the past few years, a number of researchers
have attempted to increase the eciency of curve-based cryptography arith-
metic. The group operations are considerably more complex. The most time
consuming operation in ECC and in HECC is the ECSM and HCDM respec-
tively. The emphasis of this work has been on ecient hardware implementations
with awareness of architecture synthesis and optimization and on applications
with resource constrained devices. There exist several ways to accelerate the
curve-based arithmetic such as: speeding up the nite eld arithmetic, choosing
a suitable coordinates system for ecient group operation and accelerating the
ECSM or HEDM operations. Most of these aspects are considered in previous
116
works, but mainly aimed towards ecient software implementations. Never-
theless, a hardware processor is a necessity in some constrained application for
minimize the area, to lower the power consumption, and/or to speed-up the
processing time.
First, we have focused on modeling architecture synthesis factors that should
lead to a reduced register requirements. Starting from providing an ecient
explicit formulas for performing group operations in dierent inversion-free co-
ordinate systems. Second, we have developed the macroscopic structure view of
both ECP and HECP datapath for the dierent coordinates systems. Third, we
have generated the VHDL source code to synthesis and simulate using FPGA
and ASIC technologies. Fourth, we have discussed the possibilities of improving
the design through datapath and control unit analysis and investigate the eec-
tiveness on area, power and computation time. Finally, we compared between
HECP and ECP for the same level of security in terms of area, computation
time, and energy.
It is widely recognized that data security will play a central role in future
information technology systems. The use of elliptic curve in cryptography can
be found in many applications. Elliptic curve algorithms are used for services
like key establishment and digital signatures. Hyperelliptic curve based cryp-
tosystems have been enjoying increasing attention in recent years. They have
long been considered as not competitive with elliptic curve based counter parts
because of construction, but the situation has changed in the last few years. It
is now possible to eciently construct hyper elliptic curve whose Jacobian has
cryptographically good order. Long operands are a major drawback in embed-
ded applications such as cellular phones, etc., where memory and processing
power are constrained. For curves of genus two over binary elds, they require
only 80 to 120-bit arithmetic compared to numbers that are typically 160-256
bits in length for elliptic curve cryptosystems, which seem promising for many
applications.
When constrained devices are to be used for implementing a public key cryp-
tographic system, one can use a smaller key size ECP implementation. We have
shown here a small area implementation of an ECP in both projective and mixed
coordinates. Using register liveness analysis, we have observed approximately
38% area savings at the point arithmetic level. The ECP has been implemented
using FPGAs. An improved register management scheme has been developed
117
for point multiplication using the López-Dahab projective and mixed coordi-
nates. We have also used forwarding paths that lead to a reduction in area
and dynamic power. A point multiplication computational time of 1652.88µs and
1271.99µs have been achieved for projective and mixed coordinates respectively.
These results show that optimizing the number of write backs into the regis-
ter le, analyzing the data dependency of intermediate variables and scheduling
the scalar multiplication operations must be carefully handled to save area and
power.
7.1 Future Direction
Hardware implementations of cryptographic algorithms are vulnerable to side-
channel attacks. Side-channel attacks that are based on multiple measurements
of the same operation can be potentially countered by employing masking tech-
niques. Many protection measurements depart from an idealized hardware
model that is very expensive to meet with real hardware. In what follows,
we discuss some of the related open problems that can provide a basis for future
work.
• This work has focused on the hardware implementation of the divisor arith-
metic for genus ≤ 2. However, it is important to address the implementation
of genus > 2 including protection against side-channel attacks.
• We have presented a modied register management of dierenet explicit
formula for genus ≤ 2 hyperelliptic curves in projective, new weighted and
recent coordinates. These formula allow inversion-free group arithmetic
for those curves. Such group operations are particularly interesting for
hardware implementations since we remove the need for an inverter unit,
which can produce signicant savings in the area of the processor, or allow
to have more multiplier units for high performance. Developing a moded
register management for genus > 2 can be a subject of future work.
• In this work, we have reduced the minimum number of registers required
to store the intermediate variables for both genus 1 and 2. However, the
signicant increase in the number of intermediate variables for genus 3 and
4 can be an interesting future work for proposing an ecient register alloca-
118
tion, operation scheduling and storage binding aiming an ecient number
of register requirements.
• In this work, we have proposed a special model for register allocation,
operation scheduling which can be practically implemented in software for
dening an ecient explicit formulas for hyperelliptic curves of genus higher
than 2.
• In this work, we have developed a stand alone hyperelliptic curve processor
for performing divisor multiplication based on inversion-free coordinates.
However, formal verication, which means a mathematical proof of the
correctness of a certain design for all possible inputs to result a correct
output, can be an interesting future work for both elliptic and hyperelliptic
curve processors based on a special architecture synthesis models.
119
Bibliography
[1] ANSI X9.62, Public Key Cryptography for the Financial Services Industry:
The Elliptic Curve Digital Signature Algorithm (ECDSA), 1999.
[2] National Institute of Standards and Technology, Digital Signature Stan-
dard, FIPS publication 186-2, February 2000.
[3] Implementing cryptographic pairings on smartcards. In Cryptographic
Hardware and Embedded Systems - CHES 2006, 8th International Work-
shop, pages 134147, 2006.
[4] IEEE Std 1363-2000. IEEE standard specications for public-key cryptog-
raphy. IEEE Computer Society, August 2000.
[5] G.B. Agnew, R.C. Mullin, and S.A. Vanstone. An implementation of el-
liptic curve cryptosystems over GF(2155). IEEE Journal on Selected Areas
in Communications, 11(5):804813, 1993.
[6] Bijan Ansari and Anwar Hasan. High-Performance Architecture of El-
liptic Curve Scalar Multiplication. IEEE Transactions on Computers,
57(11):14431453, 2008.
[7] Andrew W. Appel and Maia Ginsburg. Modern Compiler Implementation
in C. 2004.
[8] R. Avanzi, N. Thériault, and Z. Wang. Rethinking low genus hy-
perelliptic jacobian arithmetic over binary elds: Interplay of
eld arithmetic and explicit formulae. Technical report, Cen-




[9] M. Aydos, T. Yanik, and C.K. Koc. High-speed implementation of an
ECC-based wireless authentication protocol on an ARM microprocessor.
Communications, IEE Proceedings-, 148(5):273279, 2001.
[10] L. Batina, J. Guajardo, T. Kerins, N. Mentens, P. Tuyls, and I. Ver-
bauwhede. An elliptic curve processor suitable for RFID-tags. Cryptology
ePrint Archive, Report 2006/227, 2006.
[11] Lejla Batina, Nele Mentens, Kazuo Sakiyama, Bart Preneel, and Ingrid
Verbauwhede. Public-key cryptography on the top of a needle. In Proc.
IEEE International Symposium on Circuits and Systems (ISCAS07), Spe-
cial Session on Novel Cryptographic Architectures for Low-Cost RFID,
pages 18311834, 2007.
[12] G. Bertoni, L. Breveglieri, and M. Venturi. Power aware design of an
elliptic curve coprocessor for 8 bit platforms. In Pervasive Computing
and Communications Workshops, 2006. PerCom Workshops 2006. Fourth
Annual IEEE International Conference on, pages 5341, 2006.
[13] Guido Bertoni, Jorge Guajardo, Sandeep Kumar, Gerardo Orlando,
Christof Paar, Thomas Wollinger, and Gerardo Orlando. Ecient GF(pm)
arithmetic architectures for cryptographic applications. In TOPICS IN
CRYPTOLOGY - CT RSA 2003, pages 158175. Springer-Verlag, 2003.
[14] I. F Blake, G. Seroussi, and N. P Smart. Elliptic curves in cryptogra-
phy. London Mathematical Society Lecture Note Series, 265, Cambridge
University Press, 1999.
[15] N. Boston, T. Clancy, Y. Liow, and J. Webster. Genus two hyperellip-
tic curve coprocessor. In In Workshop on Cryptographic Hardware and
Embedded Systems | CHES 2002, pages 400414. Springer-Verlag, 2002.
[16] David G. Cantor. Computing in the Jacobian of a hyperelliptic curve.
Mathematics of Computation, 48(177):95101, 1987.
[17] A.P. Chandrakasan and R. W. Brodersen. Low Power Digital CMOS De-
sign. Kluwer Academic Publishers, 1995.
121
[18] William N. Chelton and Mohammed Benaissa. Fast Elliptic Curve Cryp-
tography on FPGA. IEEE transactions on Very Large Scale Integrated
(VLSI) systems, 16(2):198205, February 2008.
[19] Hyun Min Choi, Chun Pyo Hong, and Chang Hoon Kim. FPGA implem-
ntation of high performance elliptic curve cryptographic processor over
GF(2163). Journal of Systems Architecture, 54(10):893900, April 2008.
[20] T. Clancy. Analysis of FPGA-based hyperelliptic curve cryosystems. Mas-
ter's thesis, University of Illinois, Urbana-Champaign, Illinois, 2002.
[21] T. Clancy. FPGA-based hyperelliptic curve cryptosystems. Invited paper
presented at AMS Central Section Meeting, 2003.
[22] Henri Cohen and Gerhard Frey, editors. Handbook of Elliptic and Hyper-
elliptic Curve Cryptography, volume 34 of Discrete Mathematics and Its
Applications. Chapman & Hall/CRC, 2005.
[23] David E. Culler and Hans Mulder. Smart sensors to network the world.
In Scientic American, pages 84-91, June 2004.
[24] Jan Denef and Frederik Vercauteren. An extension of Kedlaya's algorithm
to hyperelliptic curves in characteristic 2, 2002.
[25] Jean-Pierre Deschamps. Hardware Implementation of Finite-Field Arith-
metic. McGraw-Hill Professional, 1st edition, February 2009.
[26] Whiteld Die and Martin E. Hellman. New directions in cryptography.
IEEE Transactions on Information Theory, IT-22(6):644654, 1976.
[27] Sylvain Duquesne and Oberthur Card Systems. Classication of genus
2 curves over F2n and optimization of their arithmetic. cryptology eprint
archive: Report 2004/107.
[28] Thomas Eisenbarth, Sandeep Kumar, Christof Paar, Axel Poschmann, and
Leif Uhsadel. A survey of Lightweight-Cryptography implementations.
IEEE Des. Test, 24(6):522533, 2007.
[29] Grace Elias, Ali Miri, and Tet-Hin Yeap. On ecient implementation
of FPGA-based hyperelliptic curve cryptosystems. Comput. Electr. Eng.,
33:349366, September 2007.
122
[30] M. Fayed, M.W. El-Khamshi, and F. Gebali. A high-speed, low-area
processor array architecture for multiplication over GF(2m). In Interna-
tional Conference on Microelectronics (ICM), pages 6164, Cairo, Decem-
ber 2007.
[31] G. Gaubatz, J. Kaps, and B. Sunar. Public key cryptography in sensor
networks - revisited. In 1st European Workshop on Security in Ad-Hoc
and Sensor Networks (ESAS 2004), August 2004.
[32] G. Gaubatz, J.-P. Kaps, E. Ozturk, and B. Sunar. State of the art in ultra-
low power public key cryptography for wireless sensor networks. In Per-
vasive Computing and Communications Workshops, 2005. PerCom 2005
Workshops. Third IEEE International Conference on, pages 146150, 2005.
[33] Pierrick Gaudry and Robert Harley. Counting points on hyperelliptic
curves over nite elds. In Proceedings of the 4th International Symposium
on Algorithmic Number Theory, ANTS-IV, pages 313332, London, UK,
UK, 2000. Springer-Verlag.
[34] J. Goodman and A.P. Chandrakasan. An Energy-ecient Recongurable
Public-key Cryptography Processor. IEEE Journal of Solid-State Circuits,,
36(11):18081820, 2001.
[35] James Ross Goodman. Energy scalable recongurable cryptographic hard-
ware for portable applications. 2000. AAI0802715.
[36] Jorge Guajardo, Tim GÃneysu, Sandeep Kumar, Christof Paar, and Jan
Pelzl. Ecient hardware implementation of nite elds with applica-
tions to cryptography. Acta Applicandae Mathematicae: An International
Survey Journal on Applying Mathematics and Mathematical Applications,
93(1):75118, 2006.
[37] Darrel Hankerson, Alfred Menezes, and Scott Vanstone. Guide to Elliptic
Curve Cryptography. Springer-Verlag, New York, USA, 2004.
[38] R. Harley. Fast arithmetic on genus two curves. Technical report, Available
at http://cristal.inria.fr/harley/hyper, adding.txt and doubling.c, 2000.
[39] Anwar Hasan. ECE 720: Selected Topics in Cryptographic Computations
Lecture Notes. University of Waterloo, 2008.
123
[40] J. Heyszl and F. Stumpf. Ecient One-pass Entity Authentication Based
on ECC for Constrained Devices. In IEEE International Symposium on
Hardware-Oriented Security and Trust (HOST), pages 8893, Anaheim,
CA, July 2010.
[41] J. Chao K. Matsuo and S. Tsujii. Fast genus two hyperelliptic curve cryp-
tosystems. Technical report, ISEC2001-23, IEICE, pages 89-96, 2001.
[42] A. Karatsuba and Y. Ofman. Multiplication of multidigit numbers on
automata. In Sov. Phys. Dokl. (English translation), vol. 7, no. 7, pp.
595-596, 1963.
[43] Maurice Keller and William Marnane. Low power elliptic curve cryp-
tography. In Integrated Circuit and System Design. Power and Timing
Modeling, Optimization and Simulation, pages 310319. 2007.
[44] Chang Han Kim, Sangho Oh, and Jongin Lim. A new hardware architec-
ture for operations in GF(2n). IEEE Transactions on Computers, 51(1):90
92, 2002.
[45] Howon Kim, Thomas Wollinger, Yongje Choi, Kyoil Chung, and Christof
Paar. Hyperelliptic curve coprocessors on a FPGA. In Workshop on In-
formation Security Applications - WISA, Jeju Island, Korea, pages 2325.
Springer-Verlag, 2004.
[46] N. Koblitz. Algebraic Aspects of Cryptosystem. Springer, 1998.
[47] Neal Koblitz. Elliptic curve cryptosystems. Mathematics of Computation,
48(177):203209, 1987.
[48] Neal Koblitz. Hyperelliptic cryptosystems. J. Cryptol., 1:139150, January
1989.
[49] Unal Kocabas, Junfeng Fan, and Ingrid Verbauwhede. Implementation
of Binary Edwards Curves for Very-constrained Devices. In 2010 21st
IEEE International Conference on Application-specic Systems Architec-
tures and Processors (ASAP),, pages 185 191, july 2010.
[50] Vladislav Kovtun and Thomas Wollinger. Fast explicit formulae for genus 2
hyperelliptic curves using projective coordinates (updated), 2008. updated
vladislav.kovtun@gmail.com 13911 received 2 Feb 2008.
124
[51] S. Kumar and C. Paar. Are Standards Compliant Elliptic Curve Cryptosys-
tems Feasible on RFID? In Workshop Record of the ECRYPT Workshop
RFID Security, page 19, 2006.
[52] Sandeep S. Kumar. Elliptic Curve Cryptography for Constrained Devices.
VDM Verlag, Germany, August 2008.
[53] F. J. Kurdahi and A. C. Parker. Real: a program for register allocation. In
Proceedings of the 24th ACM/IEEE Design Automation Conference, DAC
'87, pages 210215, New York, NY, USA, 1987.
[54] Julio López and Ricardo Dahab. Fast multiplication on elliptic curves over
GF (2m) without precomputation. In Proceedings of the First International
Workshop on Cryptographic Hardware and Embedded Systems, CHES '99,
pages 316327, London, UK, 1999. Springer-Verlag.
[55] Jyu-Yuan Lai, Tzu-Yu Hung, Kai-Hsiang Yang, and Chih-Tsun Huang.
High-performance Architecture for Elliptic Curve Cryptography Over Bi-
nary Field. In IEEE International Symposium on Circuits and Systems
(ISCAS), pages 39333936, Paris, August 2010.
[56] T. Lange. Ecient Arithmetic on Hyperelliptic Curves. PhD thesis, Uni-
versity Gesamthochschule Essen, 2001.
[57] T. Lange. Inversion-free arithmetic on genus 2 hyperelliptic curves. Tech-
nical report, Crypology ePrint Archive, 2002.
[58] Tanja Lange. Ecient Arithmetic on Hyperelliptic Koblitz Curves. Tech-
nical report, Institute for Experimental Mathematics, 2001.
[59] Tanja Lange. Ecient Arithmetic on Genus 2 Hyperelliptic Curves over
Finite Fields via Explicit Formulae. In Cryptology ePrint archive, Report
2002/121, 2002.
[60] Tanja Lange. Weighted coordinates on genus 2 hyperelliptic curves. Tech-
nical report, International Association for Cryptologic Research (IACR)
Eprint archive, 2002. lange@itsc.ruhr-uni-bochum.de 12194 received 11
Oct 2002, last revised 22 May 2003.
125
[61] Tanja Lange. Formulae for arithmetic on genus 2 hyperelliptic curves. Ap-
plicable Algebra in Engineering, Communication and Computing, 15:295
328, 2003.
[62] Tanja Lange and Marc Stevens. Ecient doubling on genus two curves over
binary elds. In Proceedings of the 11th international conference on Se-
lected Areas in Cryptography, SAC'04, pages 170181, Berlin, Heidelberg,
2005. Springer-Verlag.
[63] Weng Fook Lee. VHDL Coding and Logic Synthesis with Synopsys. Aca-
demic Press, 1st edition, August 2000.
[64] Yong Ki Lee, K. Sakiyama, L. Batina, and I. Verbauwhede. Elliptic-Curve-
Based security processor for RFID. IEEE Transaction on Computers,
57(11):15141527, 2008.
[65] Yong Ki Lee and I. Verbauwhede. A Compact Architecture for Mont-
gomery Elliptic Curve Scalar Multiplication Processor. In Proc. Eighth
Inteernational Workshop Infromation Security Applications (WISA'07),
pages 115127, 2007.
[66] K.H. Leung, K.W. Ma, W.K. Wong, and P.H.W. Leong. FPGA imple-
mentation of a microcoded elliptic curve cryptographic processor. In 2000
IEEE Symposium on Field-Programmable Custom Computing Machines,
pages 6876, 2000.
[67] Peng Luo, Xinan Wang, Jun Feng, and Ying Xu. Low-power Hardware
Implemntation of ECC Processor Suitable for Low-cost RFID Tags. In 9th
International Conference on Solid-State and Integrated-Circuit Technology
(ICSICT), pages 16811684, 2008.
[68] M. McLoone and M. Robshaw. Public key cryptography and RFID tags.
In Topics in Cryptology CT-RSA 2007, pages 372384. 2006.
[69] Giovanni De Micheli. Synthesis and Optimization of Digital Circuits.
McGraw-Hill, Inc., 1994.
[70] Victor S Miller. Use of elliptic curves in cryptography. In Lecture notes in
computer sciences; 218 on Advances in cryptologyCRYPTO 85, pages
417426, New York, NY, USA, 1986. Springer-Verlag New York, Inc.
126
[71] Pradeep Kumar Mishra. Pipelined computation of scalar multiplication in
elliptic curve cryptosystems (Extended version). IEEE Trans. Comput.,
55(8):10001010, 2006.
[72] Pradeep Kumar Mishra, Pinakpani Pal, and Palash Sarkar. Towards Min-
imizing Memory Requirement for Implementation of Hyperelliptic Curve
Cryptosystems. Springer-Verlag Berlin Heidelberg, ISPEC 2007(LNCS
4464):269283, 2007.
[73] Sangook Moon, Jaemin Park, and Yongsurk Lee. Fast VLSI arithmetic
algorithms for high-security elliptic curve cryptographic applications. IEEE
Transactions on Consumer Electronics, 47(3):700708, 2001.
[74] D. Mumford. Tata Lectures on Theta II. Springer, 1984.
[75] National Institute of Standards and Technology (NIST). Recommended
Elliptic Curve for Federal Government Use, 1999.
[76] G. Orlando and C. Paar. A super-serial galois elds multiplier for FPGAs
and its application to public-key algorithms. In Field-Programmable Cus-
tom Computing Machines, 1999. FCCM '99. Proceedings. Seventh Annual
IEEE Symposium on, pages 232239, 1999.
[77] E. Öztürk and B. Sunar. Low-power elliptic curve cryptography using
scaled modular arithmetic. In Proceedings of 6th International Workshop
on Cryptographic Hardware in Embedded Systems (CHES), pages 92106.
SpringerVerlag, 2004.
[78] Jan Pelzl. Hyperelliptic cryptosystems on embedded microprocessors.
Master's thesis, Ruhr-Universität Bochum, September 2002.
[79] Jan Pelzl, Thomas Wollinger, Jorge Guajardo, and Christof Paar. Hy-
perelliptic curve cryptosystems: Closing the performance gap to elliptic
curves. In Workshop on Cryptographic Hardware and Embedded Systems,
CHES 2003, pages 351365. Springer-Verlag, 2003.
[80] Jan Pelzl, Thomas Wollinger, and Christof Paar. High performance arith-
metic for special hyperelliptic curve cryptosystems of genus two. In Inter-
national Conference on Information Technology: Coding and Computing -
ITCC 2004. IEEE Computer Society, 2004.
127
[81] Yong ping DAN, Xue cheng ZOU, Zheng lin LIU, Yu HAN, and Li hua
YI. Design of highly ecient elliptic curve crypto-processor with two mul-
tiplications over GF(2163). The Journal of China Universities of Posts and
Telecommunications, 16(2):7279, April 2009.
[82] Martin Christopher Rosner. Elliptic curve cryptosystems on recongurable
hardware. Master Thesis, Worcester Polytechnic Inst, 1998.
[83] Anup Hosangadi Ryan Kastner and Farzan Fallah. Arithmetic Optimiza-
tion Techniques for Hardware and Software Design. Cambridge University
Press, 2010.
[84] K. Sakiyama. Secure Design Methodology and Implementation for Embed-
ded Public-key Cryptosystems. PhD thesis, Katholieke Universiteit Leuven,
Belgium, 2007.
[85] Richard Schroeppel, Cheryl Beaver, Rita Gonzales, Russell Miller, and
Timothy Draelos. A Low-Power design for an elliptic curve digital signature
chip. In Cryptographic Hardware and Embedded Systems - CHES 2002,
pages 5364. 2003.
[86] Leilei Song and Keshab K. Parhi. Low-energy digit-serial/parallel nite
eld multipliers. J. VLSI Signal Process. Syst., 19:149166, July 1998.
[87] Matsuo K. Chao J. Tsujii S. Sugizaki, T. An extension of harley addition
algorithm for hyperelliptic curves over nite elds of characterisitc two.
Technical report, ISEC2002-9, IEICE Japan, 49-56, 2002.
[88] M. Takahashi. Improving Harley Algorithms for Jacobians of genus 2 Hy-
perelliptic Curves. In Proc. of SCIS2002, IEICE Japan, 2002. in Japanese.
[89] Pim Tuyls and Lejla Batina. RFID-tags for Anti-Counterfeiting, 2006.
[90] Frank Vahid. Digital Design with RTL Design, Verilog and VHDL. Wiley,
2nd edition, March 2010.
[91] J. Wolkerstorfer. Scaling ECC hardware to a minimum. In ECRYPT
workshop - Cryptographic Advances in Secure Hardware, September 2005.
128
[92] T. Wollinger. Computer architectures for cryptosystems based on hyper-
elliptic curves. Master's thesis, ECE Department, Worcester Polytechnic
Institute, Worcester, Massachusetts, USA, May 2001.
[93] Thomas Wollinger. Software and hardware implementation of hyperelliptic
curve cryptosystems. Technical report, Ruhr-University Bochum, Bochum,
Germany, 2004.
[94] Thomas Wollinger and Christof Paar. Hardware architectures proposed
for cryptosystems based on hyperelliptic curves. In In Proceedings of the
9th IEEE International Conference on Electronics, Circuits and Systems
- ICECS 2002, volume III, pages 11591163, 2002.
[95] Thomas Wollinger, Jan Pelzl, and Christof Paar. Cantor versus harley:
Optimization and analysis of explicit formulae for hyperelliptic curve cryp-
tosystems. IEEE Transactions on Computers, 54:861872, 2005.
[96] Thomas Wollinger, Jan Pelzl, Volker Wittelsberger, Christof Paar, Gïkay
Saldamli, and ïetin K. Koï. Elliptic and hyperelliptic curves on embedded
µp. ACM Transactions on Embedded Computing Systems, 3(3):509533,
2004.
[97] K. Matsuo J. Chao Y. Miyamoto, H. Doi and S. Tsujii. A fast addition
algorithm of genus two hyperelliptic curve. In Proc. of SCIS2002, IEICE
Japan, pages 497-502, 2002. in Japanese.
[98] Hiroto Yasuura and Hiroyuki Tomiyama. Power optimization by datapath
width adjustment. In Power Aware Design Methodologies, pages 181199.
2002.
[99] Xiaoyang Zeng, Xiaofang Zhou, and Qianling Zhang. Hardware/software
Co-design of Elliptic Curves Public-key Cryptosystems. volume 2, pages
14961499, 2002.
[100] Yu Zhang, Dongdong Chen, Younhee Choi, Li Chen, and Seok-Bum Ko.
A high performance pseudo-multi-core ECC processor over GF(2163). In
IEEE International Symposium on Circuits and Systems (ISCAS), pages
701704, Paris, August 2010.
129
Appendix A
Ecient HECC Explicit Register
Management Formulas
A.1 New Weighted Coordinates (N )
This section contains ecient HECC explicit register management formulas spec-
ied for doubling, addition, and mixed addition in projective coordinates. These
explicit formulas are the ecient ones that result after the application of register
allocation, operation scheduling and storage binding procedures. These formulas
are used as the basis for determining the data path and the control unit synthesis
that are most useful for minimizing the hardware implementation of an HECC
processor. The formulas for the projective when h2 = 0 and when h2 6= 0 can be
found in Algorithms A.1, A.2, A.3, A.4, A.5, and A.6.
.
A.2 Projective Coordinates (P)
Similarly, ecient HECC explicit register management specied formulas are
provided for doubling, addition, and mixed addition in projective coordinates.
For counting, formulas are given for even characteristic when h2 = 0 and when
h2 6= 0 , respectively. The formulas for the projective coordinates can be found
in Algorithms A.7, A.8, A.9, A.10, A.11, and A.12.
.
130
Algorithm A.1 The modied register management of divisor doubling, new weighted coor-
dinates in an even characteristic when h2 = 0
Input: [U1, U0, V1, V0, Z1, Z2, z1, z2, z3, z4],h = h1x+ h0, f = x5 + f3x3 + f2x2 + f1x+ f0.


















4] = 2[U1, U0, V1, V0, Z1, Z2, z1, z2, z3, z4].
1: R0 ← U1, R1 ← U0, R2 ← z1, R3 ← z4
2: R4 ← cMUL (R1, h1)
3: R5 ← cMUL (R0, h1)
4: FP (ADD, cMUL)← ADD (R4, R5)
5: R4 ← cMUL (FP (ADD, cMUL), h1)
6: R5 ← cMUL (R2, SQR(h0))
7: FP (ADD,MUL)← ADD (R4, R5)
8: R5 ←MUL (FP (ADD,MUL), R3)
9: Z ′2 ←MUL (R5, R3), DOUT ← Z ′2
10: R4 ← cMUL (R0, h1)
11: FP (cMUL,ADD)← cMUL (R2, h0)
12: R6 ← ADD (FP (cMUL,ADD), R4)
13: R4 ← V1, R7 ← SQR (R4)
14: R8 ← SQR (R0)
15: FP (SQR, cMUL)← SQR (R2)
16: R9 ← cMUL (FP (SQR, cMUL), f3)
17: FP (ADD,MUL)← ADD (R8, R9)
18: R8 ←MUL (FP (ADD,MUL), z2)
19: R9 ← cMUL (R3, f2)
20: FP (cMUL,ADD)← cMUL (R4, h1)
21: R9 ← ADD (FP (cMUL,ADD), R9)
22: FP (MUL,ADD)←MUL (R3, R9)
23: R3 ← ADD (FP (MUL,ADD), R7)
24: FP (MUL,ADD)←MUL (R0, R8)
25: R7 ← ADD (R3, FP (MUL,ADD), R3)
26: R3 ←MUL (R6, R7)
27: R9 ← cMUL (R8, h1)
28: R6 ← cADD (R6, h1)
29: FP (ADD,MUL)← ADD (R7, R8)
30: R6 ←MUL (FP (ADD,MUL), R6)
31: R7 ← ADD (R3, R6)
32: FP (ADD,MUL)← cADD (R0, 1)
33: R6 ← ADD (FP (ADD,MUL), R9)
34: R6 ← ADD (R6, R7)
35: R7 ←MUL (R2, R9)
36: FP (MUL,ADD)←MUL (R7, R1)
37: R3 ← ADD (FP (MUL,ADD), R3)
38: R7 ←MUL (R6, R2)
39: S0 ← SQR (R3)
40: S ←MUL (R3, R7)
41: R5 ←MUL (R7, R5)
42: R2 ← SQR (R7)
43: R8 ← SQR (Z ′2)
44: R7 ←MUL (R7, R6)
45: R9 ←MUL (R7, Z ′2)
46: z′4 ←MUL (R9, R2)
47: R1 ←MUL (R1, R13)
48: R3 ←MUL (R3, R6)
49: R10 ←MUL (R0, R6)
50: R7 ←MUL (R1, R3)
51: R3 ← ADD (R3, R6)
52: FP (ADD,MUL)← ADD (R0, R1)
53: R3 ←MUL (FP (ADD,MUL), R3)
54: R3 ← SUB (R3, R7)
55: R3 ← SUB (R3, R10)
56: R10 ← ADD (R10, S)
57: FP (cMUL,ADD)← cMUL (R9, h1)
58: R0 ← ADD (FP (cMUL,ADD), S0)
59: R10 ← SUB (R10, R8)
60: R1 ←MUL (R10, R0)
61: R6 ←MUL (R10, R8)
62: FP (MUL,ADD)←MUL (R5, R4)
63: R4 ← ADD (FP (MUL,ADD), R3)
64: FP (ADD,MUL)← ADD (R0, R4)
65: R0 ←MUL (FP (ADD,MUL), R2)
66: R0 ← ADD (R0, R6)
67: FP (cMUL,ADD)← cMUL (R9, h1)
68: V ′1 ← ADD (FP (cMUL,ADD), R0)
69: FP (MUL,ADD)←MUL (R5, V0)
70: R0 ← ADD (FP (MUL,ADD), R7)
71: FP (MUL,ADD)←MUL (R0, R2)
72: R0 ← ADD (FP (MUL,ADD), R1)
73: FP (cMUL,ADD)← cMUL (h0, R9)
74: V ′0 ← ADD (FP (cMUL,ADD), R0)
131
Algorithm A.2 The modied register management of divisor addition, new weighted coor-
dinates in an even characteristic when h2 = 0
Input: [U11, U10, V11, V10, Z11, Z12, z11, z12, z13, z14],[U21, U20, V21, V20, Z21, Z22, z21, z22, z23, z24],
h = h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.



















1: R0 ← U11, R1 ← U10,R2 ← V11, R3 ← V10,R4 ← z11, R5 ← z13
2: R6 ← z14, R7 ← V21, R8 ← V20, R9 ← z21, R10 ← z23
3: R11 ←MUL (R4, U21)
4: R12 ←MUL (R4, U20)
5: R7 ←MUL (R6, R7)
6: FP (MUL, ADD)←MUL (R0, R9)
7: R10 ←MUL (FP (MUL, ADD), R11)
8: FP (MUL, ADD)←MUL (R1, R9)
9: R9 ← ADD (FP (MUL, ADD), R12)
10: R13 ←MUL (R0, R10)
11: FP (MUL, ADD)←MUL (R6, R4)
12: R13 ← ADD (FP (MUL, ADD), R13)
13: FP (SQR, MUL)← SQR (R10)
14: R14 ← ADD (FP (SQR, MUL), R1)
15: FP (MUL, ADD)←MUL (R13, R9)
16: R14 ← ADD (FP (MUL, ADD), R14)
17: R5 ←MUL (R14, R5)
18: Z ′2 ← ADD (R5, R6)
19: FP (SQR, MUL)← SQR (R5)
20: R5 ←MUL (FP (SQR, MUL), R6)
21: FP (MUL, ADD)←MUL (R3, z24)
22: R3 ← ADD (FP (MUL, ADD), R8)
23: R15 ←MUL (R3, R13)
24: R16 ←MUL (R2, R10)
25: FP (ADD, MUL)←MUL (R0, R4)
26: R0 ←MUL (FP (ADD, MUL)R16)
27: R0 ← ADD (R0, R15)
28: R4 ←MUL (R4, R10)
29: FP (ADD, MUL)← ADD (R4, R13)
30: R2 ←MUL (FP (ADD, MUL), R2)
31: FP (MUL, ADD)←MUL (R1, R16)
32: R15 ← ADD (FP (MUL, ADD), R15)
33: R2 ←MUL (R0, R6)
34: R3 ←MUL (R2, R14)
35: R6 ←MUL (R15, R6)
36: R4 ←MUL (R2, R6)
37: R13 ← SQR (Z ′2)
38: R14 ←MUL (R2, Z ′2)
39: z′4 ←MUL (R14, R16)
40: R2 ← ADD (R2, R0)
41: FP (ADD, MUL)← ADD (R17, U21)
42: R2 ←MUL (FP (ADD, MUL), R2)
43: FP (ADD, ADD)← ADD (R2, R15)
44: R2 ← ADD (FP (ADD, ADD), R12)
45: FP (ADD, MUL)← ADD (R10, R11)
46: R1 ← ADD (FP (ADD, MUL), R1)
47: FP (ADD, MUL)← ADD (R1, R5)
48: R1 ←MUL (FP (ADD, MUL), R10)
49: R9 ←MUL (R0, R9)
50: R6 ← ADD (R6, R9)
51: FP (MUL, ADD)←MUL (h1, R14)
52: R6 ← ADD (FP (MUL, ADD), R6)
53: FP (MUL, ADD)←MUL (R0, R10)
54: R13 ← ADD (FP (MUL, ADD), R13)
55: FP (MUL, ADD)←MUL (R3, R7)
56: R2 ← ADD (FP (MUL, ADD), R2)
57: FP (ADD, MUL)← ADD (R2, R6)
58: R2 ← ADD (FP (ADD, MUL), R1)
59: FP (MUL, ADD)←MUL (h1, R2)
60: R4 ← ADD (FP (MUL, ADD), R4)
61: FP (MUL, ADD)←MUL (R3, R8)
62: R12 ← ADD (FP (MUL, ADD), R12)
132
Algorithm A.3 The modied register management of mixed addition, new weighted coor-
dinates in an even characteristic when h2 = 0
Input: [U11, U10, V11, V10],[U21, U20, V21, V20, Z21, Z22, z21, z22, z23, z24],
h = h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.



















1: R0 ← V21, R1 ← V20,R2 ← U21, R3 ← U20,R4 ← Z21, R5 ← z21
2: R6 ← z23, R7 ← z24
3: FP (MUL, ADD)←MUL (R5, U11)
4: R8 ← ADD (FP (MUL, ADD), R2)
5: FP (MUL, ADD)←MUL (R5, U10)
6: R5 ← ADD (FP (MUL, ADD), R3)
7: FP (MUL, ADD)←MUL (R8, U11)
8: R9 ← ADD (FP (MUL, ADD), R5)
9: FP (SQR, MUL)← SQR (R8)
10: R10 ←MUL (FP (SQR, MUL), U10)
11: FP (MUL, ADD)←MUL (R5, R9)
12: R10 ← ADD (FP (MUL, ADD), R10)
13: FP (MUL, ADD)←MUL (R7, V10)
14: R11 ← ADD (FP (MUL, ADD), R1)
15: FP (MUL, ADD)←MUL (R7, V11)
16: R7 ← ADD (FP (MUL, ADD), V21)
17: R15 ←MUL (FP (ADD, MUL), R13)
18: FP (ADD, MUL)← ADD (R7, R11)
19: R14 ←MUL (FP (ADD, MUL), R14)
20: FP (SUB, SUB)← SUB (R12, R14)
21: R14 ← SUB (FP (SUB, SUB), R15)
22: R14 ←MUL (R14, R10)
23: FP (MUL, ADD)←MUL (U10R13)
24: R9 ← ADD (FP (MUL, ADD), R12)
25: R7 ← SQR (R7)
26: R9 ←MUL (R10, R14)
27: R11 ←MUL (Z ′2, R11)
28: z′4 ←MUL (R14, R11)
29: R13 ←MUL (R12, R4)
30: R9 ←MUL (R9, R3)
31: FP (ADD, MUL)← ADD (R2, R3)
32: R2 ←MUL (FP (ADD, MUL), R15)
33: FP (SUB, SUB)← SUB (R2, R9)
34: R2 ← SUB (FP (SUB, SUB), R4)
35: R4 ← ADD (R4, R13)
36: R3 ←MUL (U11, R7)
37: FP (MUL, ADD)← ADD (R6, R3)
38: R6 ←MUL (FP (MUL, ADD), R8)
39: FP (MUL, ADD)←MUL (R7, R8)
40: R5 ← ADD (FP (MUL, ADD), z′2)
41: R4 ← ADD (R4, R5)
42: R6 ←MUL (R3, R4)
43: FP (MUL, ADD)←MUL (R0, R12)
44: R2 ← ADD (FP (MUL, ADD), R2)
45: R4 ← ADD (R4, R2)
46: FP (ADD, MUL)← ADD (R3, R4)
47: R2 ← ADD (FP (ADD,MUL), R14)
48: R0 ← ADD (R2, R6)
49: FP (MUL, ADD)←MUL (R1, R12)
50: R4 ← ADD (FP (MUL, ADD), R9)
51: FP (ADD, MUL)← ADD (R2, R4)
52: R4 ←MUL (FP (ADD, MUL), R14)
133
Algorithm A.4 The modied register management of divisor doubling , new weighted co-
ordinates in an even characteristic when h2 6= 0
Input: [U1, U0, V1, V0, Z1, Z2, z1, z2, z3, z4],h = h2x2+h1x+h0, f = x5+f3x3+f2x2+f1x+f0.


















4] = 2[U1, U0, V1, V0, Z1, Z2, z1, z2, z3, z4].
1: R0 ← U1, R1 ← U0, R2 ← z1, R3 ← z4
2: FP (ADD, cMUL)← ADD (R4, R5)
3: R4 ← cMUL (FP (ADD, cMUL), h1)
4: R5 ← cMUL (R2, SQR(h0))
5: FP (ADD,MUL)← ADD (R4, R5)
6: R5 ←MUL (FP (ADD,MUL), R3)
7: Z ′2 ←MUL (R5, R3), DOUT ← Z ′2
8: FP (cMUL,ADD)← cMUL (R2, h0)
9: R6 ← ADD (FP (cMUL,ADD), R4)
10: R8 ← SQR (R0)
11: FP (SQR, cMUL)← SQR (R2)
12: R9 ← cMUL (FP (SQR, cMUL), f3)
13: FP (ADD,MUL)← ADD (R8, R9)
14: R8 ←MUL (FP (ADD,MUL), z2)
15: FP (cMUL,ADD)← cMUL (R4, h1)
16: R9 ← ADD (FP (cMUL,ADD), R9)
17: FP (MUL,ADD)←MUL (R3, R9)
18: R3 ← ADD (FP (MUL,ADD), R7)
19: FP (MUL,ADD)←MUL (R0, R8)
20: R7 ← ADD (R3, FP (MUL,ADD), R3)
21: R3 ←MUL (R6, R7)
22: R9 ← cMUL (R8, h1)
23: R6 ← cADD (R6, h1)
24: FP (ADD,MUL)← ADD (R7, R8)
25: R6 ←MUL (FP (ADD,MUL), R6)
26: R7 ← ADD (R3, R6)
27: FP (ADD,MUL)← cADD (R0, 1)
28: R6 ← ADD (FP (ADD,MUL), R9)
29: R6 ← ADD (R6, R7)
30: R7 ←MUL (R2, R9)
31: FP (MUL,ADD)←MUL (R7, R1)
32: R3 ← ADD (FP (MUL,ADD), R3)
33: R7 ←MUL (R6, R2)
34: S0 ← SQR (R3)
35: S ←MUL (R3, R7)
36: R5 ←MUL (R7, R5)
37: R2 ← SQR (R7)
38: R8 ← SQR (Z ′2)
39: R7 ←MUL (R7, R6)
40: R9 ←MUL (R7, Z ′2)
41: z′4 ←MUL (R9, R2)
42: R1 ←MUL (R1, R13)
43: R3 ←MUL (R3, R6)
44: R10 ←MUL (R0, R6)
45: R7 ←MUL (R1, R3)
46: R3 ← ADD (R3, R6)
47: FP (ADD,MUL)← ADD (R0, R1)
48: R3 ←MUL (FP (ADD,MUL), R3)
49: R3 ← SUB (R3, R7)
50: R3 ← SUB (R3, R10)
51: R10 ← ADD (R10, S)
52: FP (cMUL,ADD)← cMUL (R9, h1)
53: R0 ← ADD (FP (cMUL,ADD), S0)
54: R10 ← SUB (R10, R8)
55: R1 ←MUL (R10, R0)
56: R6 ←MUL (R10, R8)
57: FP (MUL,ADD)←MUL (R5, R4)
58: R4 ← ADD (FP (MUL,ADD), R3)
59: FP (ADD,MUL)← ADD (R0, R4)
60: R0 ←MUL (FP (ADD,MUL), R2)
61: R0 ← ADD (R0, R6)
62: FP (cMUL,ADD)← cMUL (R9, h1)
63: V ′1 ← ADD (FP (cMUL,ADD), R0)
64: FP (MUL,ADD)←MUL (R5, V0)
65: R0 ← ADD (FP (MUL,ADD), R7)
66: FP (MUL,ADD)←MUL (R0, R2)
67: R0 ← ADD (FP (MUL,ADD), R1)
68: FP (cMUL,ADD)← cMUL (h0, R9)
69: V ′0 ← ADD (FP (cMUL,ADD), R0)
134
Algorithm A.5 The modied register management of divisor addition, new weighted coor-
dinates in an even characteristic when h2 6= 0
Input: [U11, U10, V11, V10, Z11, Z12, z11, z12, z13, z14],[U21, U20, V21, V20, Z21, Z22, z21, z22, z23, z24],
h = h2x
2 + h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.



















1: R0 ← V21, R1 ← V20,R2 ← U21, R3 ← U20,R4 ← Z21, R5 ← z21
2: R6 ← z23, R7 ← z24
3: FP (MUL, ADD)←MUL (R5, U11)
4: R8 ← ADD (FP (MUL, ADD), R2)
5: FP (MUL, ADD)←MUL (R5, U10)
6: R5 ← ADD (FP (MUL, ADD), R3)
7: FP (MUL, ADD)←MUL (R8, U11)
8: R9 ← ADD (FP (MUL, ADD), R5)
9: FP (SQR, MUL)← SQR (R8)
10: R10 ←MUL (FP (SQR, MUL), U10)
11: FP (MUL, ADD)←MUL (R5, R9)
12: R10 ← ADD (FP (MUL, ADD), R10)
13: FP (MUL, ADD)←MUL (R7, V10)
14: R11 ← ADD (FP (MUL, ADD), R1)
15: FP (MUL, ADD)←MUL (R7, V11)
16: R7 ← ADD (FP (MUL, ADD), V21)
17: R13 ←MUL (R7, R8)
18: FP (ADD, MUL)← ADD (U11, 1)
19: R15 ←MUL (FP (ADD, MUL), R13)
20: R14 ← ADD (R8, R9)
21: FP (ADD, MUL)← ADD (R7, R11)
22: R14 ←MUL (FP (ADD, MUL), R14)
23: FP (SUB, SUB)← SUB (R12, R14)
24: R14 ← SUB (FP (SUB, SUB), R15)
25: R14 ←MUL (R14, R10)
26: FP (MUL, ADD)←MUL (U10R13)
27: R9 ← ADD (FP (MUL, ADD), R12)
28: R11 ←MUL (R7, R4)
29: R12 ←MUL (R7, R10)
30: R10 ←MUL (R4, R9)
31: R13 ←MUL (R10, R11)
32: R10 ← SQR (R10)
33: R9 ←MUL (R7, R9)
34: R7 ← SQR (R7)
35: R9 ←MUL (R10, R14)
36: R14 ← SQR (R11)
37: z′2 ← SQR (Z ′2)
38: R11 ←MUL (Z ′2, R11)
39: FP (ADD, MUL)← ADD (R2, R3)
40: R2 ←MUL (FP (ADD, MUL), R15)
41: FP (SUB, SUB)← SUB (R2, R9)
42: R2 ← SUB (FP (SUB, SUB), R4)
43: R4 ← ADD (R4, R13)
44: R3 ←MUL (U11, R7)
45: FP (MUL, ADD)← ADD (R6, R3)
46: R6 ←MUL (FP (MUL, ADD), R8)
47: R3 ← ADD (R10, R6)
48: R5 ←MUL (R5, R7)
49: R6 ← cMUL (h1, R11)
50: R5 ← ADD (R5, R3)
51: R3 ← ADD (R5, R6)
52: FP (MUL, ADD)←MUL (R7, R8)
53: R5 ← ADD (FP (MUL, ADD), z′2)
54: R4 ← ADD (R4, R5)
55: R6 ←MUL (R3, R4)
56: R7 ←MUL (R4, R5)
57: FP (MUL, ADD)←MUL (R0, R12)
58: R2 ← ADD (FP (MUL, ADD), R2)
59: FP (ADD, MUL)← ADD (R3, R4)
60: R2 ← ADD (FP (ADD,MUL), R14)
61: R0 ← ADD (R2, R6)
62: FP (MUL, ADD)←MUL (R1, R12)
63: R4 ← ADD (FP (MUL, ADD), R9)
64: FP (ADD, MUL)← ADD (R2, R4)
65: R4 ←MUL (FP (ADD, MUL), R14)
135
Algorithm A.6 The modied register management of mixed addition for new weighted
coordinates for an even characteristic when h2 6= 0
Input: [U11, U10, V11, V10],[U21, U20, V21, V20, Z21, Z22, z21, z22, z23, z24],
h = h2x
2 + h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.



















1: R0 ← V21, R1 ← V20,R2 ← U21, R3 ← U20,R4 ← Z21, R5 ← z21
2: R6 ← z23, R7 ← z24
3: FP (MUL, ADD)←MUL (R5, U11)
4: R8 ← ADD (FP (MUL, ADD), R2)
5: FP (MUL, ADD)←MUL (R5, U10)
6: R5 ← ADD (FP (MUL, ADD), R3)
7: FP (MUL, ADD)←MUL (R8, U11)
8: R9 ← ADD (FP (MUL, ADD), R5)
9: FP (SQR, MUL)← SQR (R8)
10: R10 ←MUL (FP (SQR, MUL), U10)
11: FP (MUL, ADD)←MUL (R5, R9)
12: R10 ← ADD (FP (MUL, ADD), R10)
13: R6 ←MUL (R6, R10)
14: Z ′2 ←MUL (R6, R4)
15: R6 ← SQR (R6)
16: FP (MUL, ADD)←MUL (R7, V10)
17: R11 ← ADD (FP (MUL, ADD), R1)
18: FP (MUL, ADD)←MUL (R7, V11)
19: R7 ← ADD (FP (MUL, ADD), V21)
20: R13 ←MUL (R7, R8)
21: FP (ADD, MUL)← ADD (U11, 1)
22: R15 ←MUL (FP (ADD, MUL), R13)
23: R14 ← ADD (R8, R9)
24: FP (ADD, MUL)← ADD (R7, R11)
25: R14 ←MUL (FP (ADD, MUL), R14)
26: FP (SUB, SUB)← SUB (R12, R14)
27: R14 ← SUB (FP (SUB, SUB), R15)
28: R14 ←MUL (R14, R10)
29: FP (MUL, ADD)←MUL (U10R13)
30: R9 ← ADD (FP (MUL, ADD), R12)
31: R11 ←MUL (R7, R4)
32: R12 ←MUL (R7, R10)
33: R10 ←MUL (R4, R9)
34: R13 ←MUL (R10, R11)
35: R10 ← SQR (R10)
36: R9 ←MUL (R7, R9)
37: R7 ← SQR (R7)
38: R9 ←MUL (R10, R14)
39: R14 ← SQR (R11)
40: z′2 ← SQR (Z ′2)
41: R11 ←MUL (Z ′2, R11)
42: FP (ADD, MUL)← ADD (R2, R3)
43: R2 ←MUL (FP (ADD, MUL), R15)
44: FP (MUL, ADD)← ADD (R6, R3)
45: R6 ←MUL (FP (MUL, ADD), R8)
46: R3 ← ADD (R10, R6)
47: R5 ←MUL (R5, R7)
48: R6 ← cMUL (h1, R11)
49: R5 ← ADD (R5, R3)
50: R3 ← ADD (R5, R6)
51: FP (MUL, ADD)←MUL (R7, R8)
52: R5 ← ADD (FP (MUL, ADD), z′2)
53: R4 ← ADD (R4, R5)
54: R6 ←MUL (R3, R4)
55: R7 ←MUL (R4, R5)
56: R4 ← cMUL (h1, R11)
57: FP (MUL, ADD)←MUL (R0, R12)
58: R2 ← ADD (FP (MUL, ADD), R2)
59: R4 ← ADD (R4, R2)
60: FP (ADD, MUL)← ADD (R3, R4)
61: R2 ← ADD (FP (ADD,MUL), R14)
62: R0 ← ADD (R2, R6)
63: R2 ← cMUL (h1, R11)
64: FP (MUL, ADD)←MUL (R1, R12)
65: R4 ← ADD (FP (MUL, ADD), R9)
66: FP (ADD, MUL)← ADD (R2, R4)
67: R4 ←MUL (FP (ADD, MUL), R14)
68: R1 ← ADD (R6, R4)
136
Algorithm A.7 The modied register management of the divisor doubling for projective
coordinate for an even characteristic when h2 = 0
Input: [U1, U0, V1, V0, Z], h = h1x+ h0, f = x5 + f3x3 + f2x2 + f1x+ f0.







′] = 2[U1, U0, V1, V0, Z].
1: R0 ← U1, R1 ← U0,R2 ← V1, R3 ← V0,R4 ← Z
2: R4 ← SQR (Z)
3: R5 ← SQR (R2)
4: R3 ← SQR (R0, U1)
5: FP (SQR, MUL)← SQR (R4)
6: R4 ← ADD (FP (SQR, MUL), R1)
7: R7 ← SQR (R0)
8: R8 ←MUL (R0, R7)
9: FP (MUL, ADD)←MUL (R2, R3)
10: R5 ← ADD (FP (MUL, ADD), R5)
11: FP (MUL, ADD)←MUL (R10, R9)
12: R5 ←MUL (FP (MUL, ADD), R8)
13: R8 ←MUL (R5, R6)
14: R2 ←MUL (R3, R7)
15: R9 ←MUL (R1, R3)
16: FP (MUL, ADD)←MUL (R2, R9)
17: R9 ← ADD (FP (MUL, ADD), R8)
18: R6 ← ADD (R3, R6)
19: FP (ADD, MUL)← ADD (R5, R7)
20: R5 ←MUL (FP (ADD, MUL), R6)
21: R5 ← ADD (R5, R8)
22: FP (ADD, MUL)← ADD (1, R0)
23: R6 ←MUL (FP (ADD, MUL), R2)
24: R5 ← ADD (R5, R6)
25: R6 ←MUL (R3, R5)
26: R7 ←MUL (R4, R6)
27: R8 ← ADD (R5, R6)
28: R2 ←MUL (R5, R9)
29: R3 ←MUL (R2, R3)
30: R5 ← ADD (R4, R5)
31: R0 ←MUL (R0, R8)
32: R8 ← ADD (R2, R8)
33: R2 ←MUL (R1, R2)
34: FP (ADD, MUL)← ADD (R1, U1)
35: R1 ←MUL (FP (ADD, MUL), R8)
36: R1 ← ADD (R1, R2)
37: R9 ← SQR (R9)
38: R1 ← ADD (R0, R1)
39: R9 ← ADD (R7, R9)
40: R4 ← SQR (R4)
41: U ′0 ←MUL (R7, R9)
42: U ′1 ←MUL (R4, R7)
43: R6 ← SQR (R6)
44: Z ′ ←MUL (R7, R6)
45: R8 ← ADD (R0, R4)
46: FP (ADD, MUL)← ADD (R3, R8)
47: R8 ←MUL (FP (ADD, MUL), R9)
48: FP (MUL, ADD)←MUL (R5, V0)
49: R2 ← SUB (FP (MUL, ADD), R2)
50: FP (MUL, ADD)←MUL (R2, R6)
51: V ′0 ← ADD (FP (MUL, ADD), R8)
52: R0 ← ADD (R0, R4)
53: FP (ADD, MUL)← ADD (R0, R3)
54: R4 ←MUL (FP (ADD, MUL), R4)
55: FP (MUL, ADD)←MUL (R5, V1)
56: R1 ← ADD (FP (MUL, ADD), R1)
57: R7 ← ADD (R1, R7)
58: FP (ADD, MUL)← ADD (R7, R9)
59: R6 ←MUL (FP (ADD, MUL), R6)
60: V ′1 ← ADD (R4, R6)
137
Algorithm A.8 The modied register management for divisor addition for projective coor-
dinates for an even characteristic when h2 = 0
Input: [U11, U10, V11, V10, Z1]; [U21, U20, V21, V20, Z2];h = h1x + h0, f = x5 + f3x3 + f2x2 +
f1x+ f0.







′] = [U11, U10, V11, V10, Z1] + [U21, U20, V21, V20, Z2]
1: R0 ← U11, R1 ← U10, R2 ← Z1, R3 ← Z2
2: R4 ←MUL (R2, U21)
3: Ũ20 ←MUL (R2, U20)
4: Ṽ21 ←MUL (R2, V21)
5: V ′20 ←MUL (R2, V20)
6: FP (MUL, ADD)←MUL (R0, R3)
7: R5 ← ADD (FP (MUL, ADD), R4)
8: FP (MUL, ADD)←MUL (R1, R3)
9: R6 ←MUL (FP (MUL, ADD), Ũ20)
10: R7 ←MUL (R0, R5)
11: FP (MUL,ADD)←MUL (R2, R6)
12: R7 ← ADD (FP (MUL,ADD), R7)
13: R8 ←MUL (R6, R7)
14: FP (SQR, MUL)← SQR (R5)
15: R9 ←MUL (FP (SQR, MUL), R1)
16: R8 ← ADD (R8, R9)
17: FP (MUL, ADD)←MUL (R3, V10)
18: R9 ← ADD (FP (MUL, ADD), Ṽ20)
19: R10 ←MUL (R7, R9)
20: FP (MUL,ADD)←MUL (R2, R5)
21: R7 ← ADD (FP (MUL,ADD), R7)
22: FP (MUL,ADD)←MUL (R3, V11)
23: R11 ← ADD (FP (MUL,ADD), Ṽ21)
24: FP (ADD, MUL)← ADD (R9, R11)
25: R7 ←MUL (FP (ADD, MUL), R7)
26: R9 ←MUL (R5, R11)
27: R9 ← ADD (R0, R2)
28: R3 ←MUL (R2, R3)
29: R11 ←MUL (R9, R11)
30: R0 ←MUL (R7, R10)
31: FP (MUL, ADD)← ADD (R9, R1)
32: R1 ← ADD (FP (MUL, ADD), R10)
33: R2 ←MUL (R3, R8)
34: R1 ←MUL (R3, R1)
35: R7 ←MUL (R0, R3)
36: R3 ←MUL (R0, R1)
37: R11 ←MUL (R0, R7)
38: R12 ←MUL (R1, R7)
39: FP (MUL, ADD)←MUL (R4, R11)
40: R12 ← ADD (FP (MUL, ADD), R12
41: FP (ADD, MUL)← ADD (R4, Ũ20)
42: R10 ←MUL (FP (ADD, MUL), R10)
43: FP (MUL, ADD)←MUL (R5, R8)
44: R8 ←MUL (FP (MUL, ADD), R7)
45: R7 ← SQR (R7)
46: Z ′ ←MUL (R7, R9)
47: R8 ←MUL (R2, R8)
48: FP (MUL, ADD)←MUL (R6, R11)
49: R6 ← ADD (FP (MUL, ADD), R8)
50: FP (SQR, MUL)← SQR (R0)
51: R0 ←MUL (FP (SQR, MUL), R5)
52: FP (ADD, MUL)← ADD (R4, R5)
53: R0 ←MUL (FP (ADD, MUL), R0)
54: R6 ← ADD (R0, R6)
55: FP (SQR, ADD)← SQR (R1)
56: R1 ← ADD (FP (SQR, ADD), R6)
57: U ′0 ←MUL (R1, R9)
58: R4 ←MUL (R5, R11)
59: R3 ←MUL (R3, Ũ20)
60: R5 ← ADD (R11, R10)
61: R0 ←MUL (R1, R12)
62: FP (MUL, ADD)←MUL (R7, R10)
63: R0 ← ADD (FP (MUL, ADD), R0)
64: FP (ADD, MUL)← ADD (R1, R5)
65: R5 ←MUL (FP (ADD, MUL), R7)
66: FP (MUL, ADD)←MUL (R2, R12)
67: R5 ← ADD (FP (MUL, ADD), R5)
68: FP (MUL, ADD)←MUL (R11, Ṽ20)
69: V ′0 ← ADD (FP (MUL, ADD), R0)
70: FP (ADD, MUL)← ADD (R3, Ṽ21)
71: R11 ←MUL (FP (ADD, MUL), R11)
72: V ′1 ← ADD (R5, R11)
138
Algorithm A.9 The modied register management of mixed addition for projective coordi-
nates for an even characteristic when h2 = 0
Input: [U11, U10, V11, V10],[U21, U20, V21, V20, Z2],
h = h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.







′] = [U11, U10, V11, V10] + [U21, U20, V21, V20, Z2]
1: R0 ← U11, R1 ← U10,R2 ← V11, R3 ← V10R4 ← Z2
2: FP (MUL, ADD)←MUL (R0, R4)
3: R5 ← ADD (FP (MUL, ADD), U21)
4: FP (MUL, ADD)←MUL (R1, R4)
5: R6 ← ADD (FP (MUL, ADD), U20)
6: FP (MUL, ADD)←MUL (R0, R5)
7: R7 ← ADD (FP (MUL, ADD), R6)
8: R8 ←MUL (R6, R7)
9: FP (SQR, MUL)← SQR (R5)
10: R9 ←MUL (FP (SQR, MUL), R1)
11: R8 ← ADD (R8, R9)
12: FP (MUL, ADD)←MUL (R3, R4)
13: R3 ← ADD (FP (MUL, ADD), V20)
14: FP (MUL, ADD)←MUL (R2, R4)
15: R2 ← ADD (FP (MUL, ADD), V21)
16: R10 ←MUL (R3, R7)
17: R10 ←MUL (R2, R5)
18: FP (ADD, MUL)← ADD (R1, R10)
19: R1 ←MUL (FP (ADD, MUL), R9)
20: R7 ← ADD (R5, R7)
21: R2 ← ADD (R2, R3)
22: R2 ←MUL (R0, R10)
23: FP (MUL, ADD)←MUL (R2, R7)
24: R2 ← ADD (FP (MUL, ADD), R9)
25: R3 ←MUL (R8, R4)
26: R7 ←MUL (R1, R4)
27: R4 ←MUL (R2, R4)
28: R9 ←MUL (R3, R4)
29: R10 ←MUL (R1, R2)
30: R11 ←MUL (R2, R4)
31: R1 ←MUL (R1, R4)
32: R12 ←MUL (R11, U21)
33: R13 ←MUL (R2, R3)
34: R14 ←MUL (R10, R13)
35: R1 ← ADD (R1, R12)
36: R10 ← ADD (R11, R10)
37: FP (ADD, MUL)← ADD (R13, R14)
38: R10 ←MUL (FP (ADD, MUL), R10)
39: R12 ← ADD (R13, R14)
40: R12 ← ADD (R10, R12)
41: R7 ← SQR (R7)
42: R0 ←MUL (R0, R5)
43: FP (SQR, MUL)← SQR (R2)
44: R0 ←MUL (FP (SQR, MUL), R0)
45: R0 ← ADD (R0, R7)
46: R6 ←MUL (R6, R11)
47: R0 ← ADD (R0, R6)
48: R2 ←MUL (R3, R8)
49: R8 ←MUL (R2, R5)
50: FP (ADD, ADD)← ADD (R8, R9)
51: R0 ← ADD (FP (ADD, ADD), R0)
52: R5 ←MUL (R5, R11)
53: FP (SQR, ADD)← SQR (R3)
54: R3 ←MUL (FP (SQR, ADD), R5)
55: R0 ←MUL (R0, R9)
56: R4 ← SQR (R4)
57: FP (MUL, ADD)←MUL (R13, V21)
58: R7 ← ADD (FP (MUL, ADD), R3)
59: FP (ADD, MUL)← ADD (R7, R12)
60: R7 ←MUL (FP (ADD, MUL), R4)
61: FP (ADD, MUL)← ADD (R1, R3)
62: R2 ←MUL (FP (ADD, MUL), R3)
63: V ′1 ← ADD (R2, R7)
64: FP (MUL, ADD)←MUL (R13, V20)
65: R7 ← ADD (FP (MUL, ADD), R14)
66: R7 ←MUL (R4, R7)
67: FP (ADD, MUL)← ADD (R1, R3)
68: R0 ←MUL (FP (ADD, MUL), R0)
69: R1 ← ADD (R0, R7)
139
Algorithm A.10 The modied register management of divisor doubling for projective co-
ordinates for an even characteristic when h2 6= 0
Input: [U1, U0, V1, V0, Z],h = h2x2 + h1x+ h0,f = x5 + f3x3 + f2x2 + f1x+ f0.







′] = 2[U1, U0, V1, V0, Z]
1: R0 ← V1, R1 ← V0,R2 ← U1, R3 ← U0,R4 ← Z
2: R5 ← cMUL (R4, h1)
3: R6 ← cMUL (R4, h0)
4: FP (MUL, ADD)←MUL (R5, U1)
5: R8 ← ADD (FP (MUL, ADD), R2)
6: FP (MUL, ADD)←MUL (R5, U0)
7: R5 ← ADD (FP (MUL, ADD), R3)
8: R9 ← SQR (R8)
9: R9 ← SQR (R5)
10: FP (ADD, MUL)←MUL (R8)
11: R10 ←MUL (FP (ADD, MUL), U1)
12: FP (MUL, ADD)←MUL (R5, R9)
13: R10 ← ADD (FP (MUL, ADD), R10)
14: R6 ←MUL (R6, R10)
15: Z ′ ←MUL (R6, R4)
16: FP (MUL, ADD)←MUL (R7, V0)
17: R11 ← ADD (FP (MUL, ADD), R1)
18: FP (MUL, ADD)←MUL (R7, V1)
19: R7 ← ADD (FP (MUL, ADD), V0)
20: R13 ←MUL (R7, R8)
21: R15 ←MUL (FP (ADD, MUL), R13)
22: R14 ← ADD (R8, R9)
23: FP (ADD, MUL)← ADD (R7, R11)
24: R14 ←MUL (FP (ADD, MUL), R14)
25: FP (SUB, SUB)← SUB (R12, R14)
26: R14 ← SUB (FP (SUB, SUB), R15)
27: R14 ←MUL (R14, R10)
28: FP (MUL, ADD)←MUL (U0R13)
29: R9 ← ADD (FP (MUL, ADD), R12)
30: R11 ←MUL (R7, R4)
31: R12 ←MUL (R7, R10)
32: R10 ←MUL (R4, R9)
33: R13 ←MUL (R10, R11)
34: R10 ← SQR (R10)
35: R9 ←MUL (R7, R9)
36: R7 ← SQR (R7)
37: R9 ←MUL (R10, R14)
38: R14 ← SQR (R11)
39: R11 ←MUL (Z ′, R11)
40: R4 ←MUL (R4, R7)
41: R15 ← ADD (R7, R9)
42: R13 ←MUL (R12, R4)
43: R9 ←MUL (R9, R3)
44: FP (ADD, MUL)← ADD (R2, R3)
45: R2 ←MUL (FP (ADD, MUL), R15)
46: FP (SUB, SUB)← SUB (R2, R9)
47: R2 ← SUB (FP (SUB, SUB), R4)
48: R4 ← ADD (R4, R13)
49: R3 ←MUL (U1, R7)
50: FP (MUL, ADD)← ADD (R6, R3)
51: R6 ←MUL (FP (MUL, ADD), R8)
52: R3 ← ADD (R10, R6)
53: R5 ←MUL (R5, R7)
54: R5 ← ADD (R5, R3)
55: R3 ← ADD (R5, R6)
56: FP (MUL, ADD)←MUL (R7, R8)
57: R4 ← ADD (R4, R5)
58: R6 ←MUL (R3, R4)
59: R7 ←MUL (R4, R5)
60: FP (MUL, ADD)←MUL (R0, R12)
61: R2 ← ADD (FP (MUL, ADD), R2)
62: R4 ← ADD (R4, R2)
63: FP (ADD, MUL)← ADD (R3, R4)
64: R2 ← ADD (FP (ADD,MUL), R14)
65: R0 ← ADD (R2, R6)
66: FP (MUL, ADD)←MUL (R1, R12)
67: R4 ← ADD (FP (MUL, ADD), R9)
68: FP (ADD, MUL)← ADD (R2, R4)
69: R4 ←MUL (FP (ADD, MUL), R14)
70: R1 ← ADD (R6, R4)
140
Algorithm A.11 The modied register management of divisor addition for projective co-
ordinates for an even characteristic when h2 6= 0
Input: [U11, U10, V11, V10, Z1],[U21, U20, V21, V20, Z2],
h = h2x
2 + h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.







′] = [U11, U10, V11, V10, Z1] + [U21, U20, V21, V20, Z2]
1: R0 ← U21, R1 ← U20,R2 ← V21, R3 ← V20,R4 ← Z1, R5 ← Z2
2: R6 ←MUL (R4, R5)
3: R7 ←MUL (R6, U21)
4: R6 ←MUL (R6, U20)
5: R11 ←MUL (R10, U11)
6: R12 ←MUL (R11, R6)
7: FP (MUL, ADD)←MUL (R9, V10)
8: R13 ← ADD (FP (MUL, ADD), R3)
9: FP (MUL, ADD)←MUL (R9, V11)
10: R9 ← ADD (FP (MUL, ADD), R2)
11: R0 ← SQR (R13)
12: FP (MUL, ADD)← ADD (R11, R10)
13: R11 ←MUL (R11, FP (MUL, ADD))
14: FP (MUL, ADD)←MUL (R10, R9)
15: R9 ← ADD (R9, FP (MUL, ADD))
16: R9 ←MUL (R9, R14)
17: FP (SQR, MUL)← SQR (R9)
18: R13 ←MUL (FP (SQR, MUL), R10)
19: FP (MUL, ADD)←MUL (R13, U10)
20: R12 ← ADD (R12, FP (MUL, ADD))
21: R8 ←MUL (R8, R12)
22: R13 ←MUL (R8, R8)
23: R8 ←MUL (R8, R4)
24: R14 ← ADD (1, U11)
25: R14 ←MUL (R14, R10)
26: FP (MUL, ADD)←MUL (R10, U10)
27: R10 ← ADD (FP (MUL, ADD), R11)
28: FP (SQR, MUL)← SQR (R13)
29: R11 ←MUL (R12, FP (SQR, MUL))
30: R12 ←MUL (R10, R4)
31: R10 ←MUL (R10, R9)
32: R6 ←MUL (R6, R14)
33: R9 ←MUL (R10, R14)
34: FP (MUL, ADD)←MUL (R2, R8)
35: R9 ← ADD (R9, FP (MUL, ADD))
36: R16 ←MUL (R8, R4)
37: R17 ←MUL (R14, U11)
38: R13 ← ADD (R13, R17)
39: R10 ←MUL (R10, R13)
40: R18 ← ADD (R0, R1)
41: FP (MUL, ADD)←MUL (R12, R13)
42: R10 ← ADD (R10, FP (MUL, ADD))
43: R6 ← ADD (R6, R10)
44: R3 ←MUL (R3, R11)
45: R2 ←MUL (R2, R11)
46: R0 ←MUL (R0, R14)
47: FP (ADD, MUL)← ADD (R10, R14)
48: R10 ←MUL (FP (ADD, MUL), R18)
49: R1 ←MUL (R1, R10)
50: R10 ← ADD (R10, R1)
51: R1 ← ADD (R1, R3)
52: FP (SQR, ADD)← SQR (R10)
53: R2 ← ADD (FP (SQR, ADD), R3)
54: R0 ← ADD (R0, R13)
55: R0 ← ADD (R0, R9)
56: R3 ←MUL (R9, R0)
57: R10 ←MUL (R16, h1)
58: FP (ADD, MUL)← ADD (R6, R10)
59: R0 ←MUL (R0, FP (ADD, MUL))
60: R12 ← ADD (R12, R13)
61: FP (MUL, ADD)←MUL (R16, h1)
62: R2 ← ADD (R2, FP (MUL, ADD))
63: FP (ADD, MUL)← ADD (R2, R6)
64: R2 ←MUL (FP (ADD, MUL), R17)
65: R2 ← ADD (R2, R3)
66: R3 ←MUL (R16, h0)
67: FP (ADD, MUL)← ADD (R1, R3)
68: R1 ←MUL (R1, FP (ADD, MUL))
69: R3 ←MUL (R16, R17)
70: R11 ←MUL (R7, h2)
141
Algorithm A.12 The modied register management of mixed Addition for projective co-
ordinates for an even characteristic when h2 6= 0
Input: [U11, U10, V11, V10],[U21, U20, V21, V20, Z2],
h = h2x
2 + h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.








1: R0 ← V21, R1 ← V20,R2 ← U21, R3 ← U20,R4 ← Z2
2: FP (MUL, ADD)←MUL (R5, U11)
3: R8 ← ADD (FP (MUL, ADD), R2)
4: FP (MUL, ADD)←MUL (R5, U10)
5: R5 ← ADD (FP (MUL, ADD), R3)
6: FP (MUL, ADD)←MUL (R8, U11)
7: R9 ← ADD (FP (MUL, ADD), R5)
8: FP (SQR, MUL)← SQR (R8)
9: R10 ←MUL (FP (SQR, MUL), U10)
10: FP (MUL, ADD)←MUL (R5, R9)
11: R10 ← ADD (FP (MUL, ADD), R10)
12: R6 ←MUL (R6, R10)
13: Z ′ ←MUL (R6, R4)
14: R6 ← SQR (R6)
15: FP (MUL, ADD)←MUL (R7, V10)
16: R11 ← ADD (FP (MUL, ADD), R1)
17: FP (MUL, ADD)←MUL (R7, V11)
18: R7 ← ADD (FP (MUL, ADD), V21)
19: R13 ←MUL (R7, R8)
20: FP (ADD, MUL)← ADD (U11, 1)
21: R15 ←MUL (FP (ADD, MUL), R13)
22: R14 ← ADD (R8, R9)
23: FP (ADD, MUL)← ADD (R7, R11)
24: R14 ←MUL (FP (ADD, MUL), R14)
25: FP (SUB, SUB)← SUB (R12, R14)
26: R14 ← SUB (FP (SUB, SUB), R15)
27: R14 ←MUL (R14, R10)
28: FP (MUL, ADD)←MUL (U10R13)
29: R9 ← ADD (FP (MUL, ADD), R12)
30: R11 ←MUL (R7, R4)
31: R12 ←MUL (R7, R10)
32: R10 ←MUL (R4, R9)
33: R13 ←MUL (R10, R11)
34: R10 ← SQR (R10)
35: R9 ←MUL (R7, R9)
36: R7 ← SQR (R7)
37: R9 ←MUL (R10, R14)
38: R14 ← SQR (R11)
39: z′2 ← SQR (Z ′2)
40: R11 ←MUL (Z ′2, R11)
41: z′4 ←MUL (R14, R11)
42: R4 ←MUL (R4, R7)
43: R15 ← ADD (R7, R9)
44: R13 ←MUL (R12, R4)
45: R9 ←MUL (R9, R3)
46: FP (ADD, MUL)← ADD (R2, R3)
47: R2 ←MUL (FP (ADD, MUL), R15)
48: FP (SUB, SUB)← SUB (R2, R9)
49: R2 ← SUB (FP (SUB, SUB), R4)
50: R4 ← ADD (R4, R13)
51: R3 ←MUL (U11, R7)
52: FP (MUL, ADD)← ADD (R6, R3)
53: R6 ←MUL (FP (MUL, ADD), R8)
54: R3 ← ADD (R10, R6)
55: R5 ←MUL (R5, R7)
56: R6 ← cMUL (h1, R11)
57: R5 ← ADD (R5, R3)
58: R3 ← ADD (R5, R6)
59: FP (MUL, ADD)←MUL (R7, R8)
60: R5 ← ADD (FP (MUL, ADD), z′2)
61: R4 ← ADD (R4, R5)
62: R6 ←MUL (R3, R4)
63: R7 ←MUL (R4, R5)
64: FP (MUL, ADD)←MUL (R0, R12)
65: R2 ← ADD (FP (MUL, ADD), R2)
66: R4 ← ADD (R4, R2)
67: FP (ADD, MUL)← ADD (R3, R4)
68: R2 ← ADD (FP (ADD,MUL), R14)
69: R0 ← ADD (R2, R6)
70: FP (MUL, ADD)←MUL (R1, R12)
71: R4 ← ADD (FP (MUL, ADD), R9)
72: FP (ADD, MUL)← ADD (R2, R4)
73: R4 ←MUL (FP (ADD, MUL), R14)
142
A.3 Recent Coordinates (R)
The classication of the dierent types of genus 2 curves in a characteristic
2 allows signicant increase in the speed of the formulas for doubling which
are included in this section for recent coordinates. These coordinates have the
advantage of allowing ecient doubling explicit formulas but the additions are
more expensive. However, the doubling operation is usually the operation that is
repeated in each loop of the performance of the divisor multiplication operation.
The formulas for recent coordinates can be found in Algorithms 5.4, A.13, A.14,
A.15, A.16, and A.17.
.
143
Algorithm A.13 The modied register management of divisor addition for recent coordi-
nates for an even characteristic when h2 = 0
Input: [U11, U10, V11, V10, Z1, z1][U21, U20, V21, V20, Z2, z2],
h = h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.







′, z′] = [U11, U10, V11, V10, Z1, z1] + [U21, U20, V21, V20, Z2, z2]
1: R0 ← U21, R1 ← U11,R2 ← U10, R3 ← V11,R4 ← V10, R5 ← Z1, R6 ← Z2, R7 ← z1
2: R9 ←MUL (R5, R6)
3: R10 ← SQR (R9)
4: R11 ←MUL (R0, R5)
5: R12 ←MUL (R5, U20)
6: R3 ←MUL (R7, V21)
7: R13 ←MUL (R12, V20)
8: R0 ← ADD (R2, R12)
9: FP (MUL, ADD)←MUL (R13, V10)
10: R13 ← ADD (R9, FP (MUL, ADD))
11: R19 ← ADD (R13, R2)
12: FP (MUL, ADD)←MUL (R11, R7)
13: R1 ← ADD (R1, R11)
14: R4 ←MUL (R2, R7)
15: FP (MUL, ADD)←MUL (R0, R7)
16: R2 ← ADD (FP (MUL, ADD), R5)
17: R2 ←MUL (R2, R0)
18: R3 ←MUL (R0, R2)
19: R17 ←MUL (R2, R12)
20: FP (MUL, ADD)←MUL (R1, R2)
21: R2 ← ADD (FP (MUL, ADD), R4)
22: FP (MUL, ADD)←MUL (R4, R13)
23: R0 ← ADD (FP (MUL, ADD), R2)
24: FP (MUL, ADD)←MUL (R2, R7)
25: R6 ← ADD (R4, FP (MUL, ADD))
26: R9 ←MUL (R6, R9)
27: R13 ← ADD (R19, R13)
28: FP (ADD, MUL)← ADD (R13, R3)
29: R9 ←MUL (FP (ADD, MUL), R4)
30: R3 ← ADD (R2, R5)
31: R8 ←MUL (R9, R8)
32: R4 ←MUL (R2, R12)
33: FP (MUL, ADD)←MUL (R24, R2)
34: R9 ← ADD (R9, FP (MUL, ADD))
35: R16 ←MUL (R16, R7)
36: R12 ←MUL (R9, R12)
37: R1 ←MUL (R9, R16)
38: FP (ADD, MUL)← ADD (R19, R23)
39: R19 ←MUL (FP (ADD, MUL), R22)
40: R12 ←MUL (R12, R13)
41: R16 ←MUL (R16, R10)
42: R10 ←MUL (R10, R14)
43: FP (MUL, ADD)←MUL (R14, R16)
44: R16 ← ADD (FP (MUL, ADD), R13)
45: R13 ←MUL (R13, R15)
46: R10 ←MUL (R10, R11)
47: R11 ← ADD (R11, R15)
48: FP (MUL, ADD)←MUL (R11, R16)
49: R11 ← ADD (FP (MUL, ADD), R13)
50: R13 ← ADD (R13, R3)
51: R11 ← ADD (R11, R10)
52: R11 ← ADD (R11, R12)
53: FP (ADD, MUL)← ADD (R10, R2)
54: R12 ←MUL (FP (ADD, MUL), R16)
55: R12 ← ADD (R12, R9)
56: FP (ADD, MUL)← ADD (R12, R1)
57: R16 ←MUL (R14, FP (ADD, MUL))
58: R10 ←MUL (R16, R10)
59: R15 ←MUL (R4, R8)
60: R2 ←MUL (R8, R1)
61: R11 ← ADD (R12, R10)
62: FP (ADD, MUL)← ADD (R13, R11)
63: R12 ←MUL (FP (ADD, MUL), R2)
64: FP (ADD, MUL)← ADD (R12, R13)
65: R3 ←MUL (FP (ADD, MUL), R12)
66: R10 ← ADD (R10, R13)
67: FP (ADD, MUL)← ADD (R11, R12)
68: R11 ←MUL (FP (ADD, MUL), R16)
69: R13 ←MUL (R16, R9)
70: R11 ← SQR (R11)
71: R12 ← SQR (R12)
144
Algorithm A.14 The modied register management of mixed addition for recent coordi-
nates for an even characteristic when h2 = 0
Input: [U11, U10, V11, V10],[U21, U20, V21, V20, Z2, z2],
h = h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.







′, z′] = [U11, U10, V11, V10] + [U21, U20, V21, V20, Z2, z2]
1: R0 ← U21, R1 ← U20,R2 ← V21, R3 ← V20,R4 ← Z2, R5 ← z2
2: FP (MUL, ADD)←MUL (R4, U11)
3: R10 ← ADD (FP (MUL, ADD), R0)
4: FP (MUL, ADD)←MUL (R6, U10)
5: R6 ← ADD (FP (MUL, ADD), R1)
6: FP (MUL, ADD)←MUL (R10, U11)
7: R11 ← ADD (FP (MUL, ADD), R6)
8: R12 ←MUL (R11, R6)
9: FP (MUL, ADD)←MUL (R9, V10)
10: R13 ← ADD (FP (MUL, ADD), R3)
11: FP (MUL, ADD)←MUL (R9, V11)
12: R9 ← ADD (FP (MUL, ADD), R2)
13: R4 ← ADD (R11, R10)
14: R11 ←MUL (R11, R13)
15: R10 ←MUL (R10, R9)
16: R9 ← ADD (R9, R13)
17: R9 ←MUL (R9, R14)
18: R13 ←MUL (R10, R10)
19: FP (MUL, ADD)←MUL (R13, U10)
20: R12 ← ADD (FP (MUL, ADD), R13)
21: R8 ←MUL (R8, R12)
22: R8 ←MUL (R8, R4)
23: R4 ←MUL (R1, R10)
24: FP (MUL, ADD)←MUL (R10, U10)
25: R10 ← ADD (FP (MUL, ADD), R11)
26: FP (SQR, MUL)← SQR (R9, R14)
27: R11 ←MUL (R12, FP (SQR, MUL))
28: R12 ←MUL (R10, R4)
29: R10 ←MUL (R10, R9)
30: R4 ←MUL (R9, R9)
31: R4 ←MUL (R4, R9)
32: R6 ←MUL (R6, R4)
33: R9 ← ADD (R9, R5)
34: R6 ←MUL (R8, R4)
35: FP (MUL, ADD)←MUL (R14, U11)
36: R13 ← ADD (R13, FP (MUL, ADD))
37: R10 ←MUL (R10, R13)
38: R13 ←MUL (R12, R4)
39: R7 ←MUL (R8, R4)
40: R8 ← ADD (R0, R1)
41: FP (MUL, ADD)←MUL (R12, R12)
42: R10 ← ADD (R10, FP (MUL, ADD))
43: R6 ← ADD (R6, R10)
44: R3 ←MUL (R3, R11)
45: R2 ←MUL (R2, R11)
46: FP (MUL, ADD)←MUL (R0, R14)
47: R10 ← ADD (FP (MUL, ADD), R14)
48: FP (MUL, ADD)←MUL (R1, R10)
49: R1 ← ADD (FP (MUL, ADD), R3)
50: R2 ← ADD (R2, R3)
51: R0 ← ADD (R0, R13)
52: FP (ADD, MUL)← ADD (R0, R9)
53: R3 ←MUL (R9, FP (ADD, MUL))
54: R10 ←MUL (R16, h1)
55: FP (ADD, MUL)← ADD (R6, R10)
56: R0 ←MUL (FP (ADD, MUL), R6)
57: R12 ← ADD (R12, R13)
58: R2 ← ADD (R2, R10)
59: FP (ADD, MUL)← ADD (R2, R6)
60: R2 ←MUL (FP (ADD, MUL), R13)
61: R2 ← ADD (R2, R3)
62: FP (ADD, MUL)← ADD (R1, R3)
63: R1 ←MUL (FP (ADD, MUL), R13)
64: R3 ←MUL (R16, R13)
65: R11 ← SQR (R7)
66: R0 ← SQR (R0)
145
Algorithm A.15 The modied register management of divisor doubling for recent coordi-
nates for an even characteristic when h2 6= 0
Input: [U1, U0, V1, V0, Z, z], h = h2x2 + h1x+ h0, f = x5 + f3x3 + f2x2 + f1x+ f0.







′, z′] = 2[U1, U0, V1, V0, Z, z].
1: R0 ← U1, R1 ← U0,R2 ← V1, R3 ← V0,R4 ← Z, R5 ← z
2: R10 ←MUL (R4, h1)
3: R11 ←MUL (R4, h0)
4: FP (MUL, ADD)←MUL (h2, R1)
5: R13 ← ADD (R13, FP (MUL, ADD))
6: R16 ←MUL (R6, h0)
7: FP (ADD, MUL)← ADD (R13, R6)
8: R13 ←MUL (R6, FP (ADD, MUL))
9: R13 ←MUL (R9, R13)
10: FP (MUL, ADD)←MUL (R0, h2)
11: R10 ← ADD (R10, FP (MUL, ADD))
12: R6 ←MUL (R0, R10)
13: FP (MUL, ADD)←MUL (R12, R1)
14: R11 ← ADD (FP (MUL, ADD), R12)
15: R11 ←MUL (R11, R1)
16: FP (MUL, ADD)←MUL (R6, h2)
17: R13 ← ADD (R13, FP (MUL, ADD))
18: R7 ←MUL (R12, R7)
19: R15 ←MUL (R2, h2)
20: FP (MUL, ADD)←MUL (R5, R13)
21: R11 ← ADD (R11, FP (MUL, ADD))
22: R8 ←MUL (R8, R11)
23: R9 ←MUL (R8, R9)
24: R11 ←MUL (R2, R2)
25: FP (MUL, ADD)←MUL (R0, R7)
26: R11 ← ADD (FP (MUL, ADD), R11)
27: R11 ← ADD (R11, R4)
28: R13 ← ADD (R12, R10)
29: R12 ←MUL (R11, R12)
30: R10 ←MUL (R7, R10)
31: FP (ADD, MUL)← ADD (R11, R7)
32: R7 ←MUL (R13, FP (ADD, MUL))
33: R7 ← ADD (R7, R12)
34: R13 ←MUL (R10, R13)
35: FP (ADD, MUL)← ADD (R12, R10)
36: R12 ←MUL (FP (ADD, MUL), h2)
37: FP (MUL, ADD)←MUL (R6, h1)
38: R11 ← ADD (FP (MUL, ADD), R13)
39: FP (MUL, ADD)←MUL (R7, R11)
40: R11 ← ADD (R12, FP (MUL, ADD))
41: R11 ←MUL (R9, R11)
42: R6 ←MUL (R7, R6)
43: R3 ←MUL (R8, R3)
44: FP (MUL, ADD)←MUL (R10, R10)
45: R8 ← ADD (FP (MUL, ADD), R11)
46: R11 ←MUL (R10, R6)
47: R10 ←MUL (R10, R7)
48: FP (MUL, ADD)←MUL (R6, R7)
49: R12 ← ADD (R7, FP (MUL, ADD))
50: R7 ←MUL (R7, R0)
51: R10 ←MUL (R10, R1)
52: R11 ← ADD (R7, R11)
53: FP (MUL, ADD)←MUL (R12, R0)
54: R0 ← ADD (FP (MUL, ADD), R10)
55: R0 ← ADD (R0, R2)
56: FP (ADD, MUL)← ADD (R0, R8)
57: R1 ←MUL (FP (ADD, MUL), R6)
58: R2 ←MUL (R1, R3)
59: R3 ←MUL (R3, R9)
60: R12 ←MUL (R3, h2)
61: FP (MUL, ADD)←MUL (R3, h2)
62: R13 ← ADD (R7, FP (MUL, ADD))
63: R11 ← ADD (R11, R12)
64: R11 ← ADD (R11, R13)
65: FP (MUL, ADD)←MUL (R11, R8)
66: R2 ← ADD (FP (MUL, ADD), R2)
67: FP (MUL, ADD)←MUL (R11, R13)
68: R0 ← ADD (FP (MUL, ADD), R0)
69: FP (MUL, ADD)←MUL (R10, h1)
70: R0 ← ADD (FP (MUL, ADD), R11)
71: R11 ←MUL (R10, h0)
72: R2 ← ADD (R2, R11)
146
Algorithm A.16 The modied register management of divisor addition for recent coordi-
nates for an even characteristic when h2 6= 0
Input: [U11, U10, V11, V10, Z1, z1],[U21, U20, V21, V20, Z2, z2],
h = h2x
2 + h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.







′, z′] = [U11, U10, V11, V10, Z1, z1] + [U21, U20, V21, V20, Z2, z2]
1: R0 ← U11, R1 ← U10,R2 ← V11, R3 ← V10,R4 ← Z1, R5 ← Z2
2: R6 ←MUL (R4, U21)
3: R9 ←MUL (R4, U20)
4: R10 ←MUL (R1, V21)
5: R12 ←MUL (R0, V20)
6: R13 ← ADD (R4, R5)
7: R12 ←MUL (R12, R9)
8: FP (MUL, ADD)←MUL (R13, R9)
9: R13 ← ADD (R9, FP (MUL, ADD))
10: FP (MUL, ADD)←MUL (R11, R6)
11: R11 ← ADD (R11, FP (MUL, ADD))
12: R5 ←MUL (R12, R6)
13: FP (MUL, ADD)←MUL (R10, R6)
14: R17 ← ADD (R2, FP (MUL, ADD))
15: R10 ←MUL (R12, R20)
16: R13 ←MUL (R18, R23)
17: R10 ←MUL (R10, R1)
18: FP (MUL, ADD)←MUL (R0, R2)
19: R9 ← ADD (R9, FP (MUL, ADD))
20: FP (MUL, ADD)←MUL (R12, R5)
21: R14 ← ADD (R11, FP (MUL, ADD))
22: R8 ←MUL (R6, R8)
23: R6 ←MUL (R16, R6)
24: R12 ←MUL (R3, R12)
25: R13 ←MUL (R2, R13)
26: R12 ←MUL (R12, R9)
27: R11 ←MUL (R9, R5)
28: FP (MUL, ADD)←MUL (R9, R10)
29: R10 ← ADD (FP (MUL, ADD), R9)
30: R2 ←MUL (R2, R13)
31: FP (ADD, MUL)← ADD (R10, R11)
32: R9 ←MUL (FP (ADD, MUL), R9)
33: R11 ← ADD (R11, R10)
34: R11 ← ADD (R11, R12)
35: R12 ←MUL (R13, R2)
36: FP (MUL, ADD)←MUL (R6, R12)
37: R12 ← ADD (R12, FP (MUL, ADD))
38: R12 ← ADD (R12, R9)
39: FP (ADD, MUL)← ADD (R13, R6)
40: R9 ←MUL (FP (ADD, MUL), h2)
41: FP (MUL, ADD)←MUL (R15, R13)
42: R10 ← ADD (R10, FP (MUL, ADD))
43: FP (MUL, ADD)←MUL (R5, h1)
44: R9 ← ADD (R9, FP (MUL, ADD))
45: R8 ← ADD (R9, R8)
46: R8 ←MUL (R6, R8)
47: R12 ← ADD (R12, R8)
48: R11 ← ADD (R11, R12)
49: R1 ←MUL (R9, R3)
50: R10 ←MUL (R5, R8)
51: R3 ←MUL (R19, R20)
52: R4 ←MUL (R10, h2)
53: FP (MUL, ADD)←MUL (R10, h2)
54: R10 ← ADD (R10, FP (MUL, ADD))
55: R8 ← ADD (R8, R4)
56: R1 ← ADD (R8, R12)
57: R10 ← ADD (R10, R2)
58: FP (MUL, ADD)←MUL (R10, R12)
59: R13 ← ADD (FP (MUL, ADD), R13)
60: FP (MUL, ADD)←MUL (R10, R22)
61: R10 ← ADD (FP (MUL, ADD), R11)
62: FP (MUL, ADD)←MUL (R13, h1)
63: R10 ← ADD (R10, FP (MUL, ADD))
64: FP (MUL, ADD)←MUL (R23, h0)
65: R11 ← ADD (R13, FP (MUL, ADD))
147
Algorithm A.17 The modied register management of mixed addition for recent coordi-
nates for an even characteristic when h2 6= 0
Input: [U11, U10, V11, V10],[U21, U20, V21, V20, Z2, z2],
h = h2x
2 + h1x+ h0,f = x
5 + f3x
3 + f2x
2 + f1x+ f0.







′, z′] = [U11, U10, V11, V10] + [U21, U20, V21, V20, Z2, z2]
1: R0 ← U21, R1 ← U20,R2 ← V21, R3 ← V20,R4 ← Z2, R5 ← z2
2: FP (MUL, ADD)←MUL (R4, U11)
3: R14 ← ADD (FP (MUL, ADD), U21)
4: FP (MUL, ADD)←MUL (R4, U10)
5: R10 ← ADD (FP (MUL, ADD), U20)
6: FP (MUL, ADD)←MUL (R0, R14)
7: R15 ← ADD (FP (MUL, ADD), R10)
8: R16 ←MUL (R10, R15)
9: R7 ←MUL (R4, R14)
10: FP (MUL, ADD)←MUL (R17, R1)
11: R16 ← ADD (R16, FP (MUL, ADD))
12: R12 ←MUL (R16, R12)
13: FP (MUL, ADD)←MUL (R3, R13)
14: R1 ← ADD (FP (MUL, ADD), R7)
15: R13 ←MUL (R2, R13)
16: R13 ← ADD (R13, R6)
17: FP (ADD, MUL)← ADD (R15, R14)
18: R15 ←MUL (R15, FP (ADD, MUL))
19: R17 ← ADD (R17, R13)
20: R17 ←MUL (R8, R17)
21: R13 ←MUL (R14, R13)
22: R13 ←MUL (R1, R13)
23: R13 ← ADD (R15, R13)
24: R15 ← ADD (R17, R8)
25: R16 ←MUL (R16, R15)
26: R17 ←MUL (R13, R8)
27: R13 ←MUL (R13, R15)
28: R7 ←MUL (R16, R7)
29: R6 ←MUL (R16, R6)
30: R16 ←MUL (R15, R15)
31: R15 ←MUL (R15, R8)
32: R10 ←MUL (R16, R10)
33: R13 ←MUL (R13, R5)
34: R7 ←MUL (R13, R7)
35: R8 ←MUL (R12, R8)
36: R12 ←MUL (R12, R12)
37: R12 ←MUL (R14, R12)
38: R4 ← ADD (R4, R5)
39: R4 ←MUL (R18, R4)
40: R4 ← ADD (R4, R6)
41: R5 ←MUL (R8, h2)
42: R5 ← ADD (R7, R5)
43: R6 ←MUL (R15, R7)
44: R6 ←MUL (R9, R6)
45: R13 ←MUL (R15, R15)
46: R7 ←MUL (R13, R7)
47: R14 ←MUL (R15, R8)
48: FP (MUL, ADD)←MUL (R8, R8)
49: R14 ← ADD (R16, FP (MUL, ADD))
50: R9 ←MUL (R14, h2)
51: FP (ADD, MUL)← ADD (R11, R16)
52: R9 ←MUL (R0, FP (ADD, MUL))
53: R5 ← ADD (R5, R9)
54: R5 ← ADD (R5, R10)
55: R10 ←MUL (R14, h1)
56: R5 ← ADD (R5, R10)
57: R5 ← ADD (R5, R12)
58: R4 ← ADD (R4, R5)
59: R4 ←MUL (R13, R4)
60: R10 ←MUL (R13, R14)
61: FP (MUL, ADD)←MUL (R14, h2)
62: R12 ← ADD (R12, FP (MUL, ADD))
63: FP (MUL, ADD)←MUL (R14, h2)
64: R6 ← ADD (R6, FP (MUL, ADD))
65: R6 ← ADD (R6, R16)
66: FP (MUL, ADD)←MUL (R6, R5)
67: R7 ← ADD (FP (MUL, ADD), R7)
68: FP (MUL, ADD)←MUL (R6, R16)
69: R4 ← ADD (R6, FP (MUL, ADD))
70: FP (MUL, ADD)←MUL (R10, h1)
71: R4 ← ADD (R4, FP (MUL, ADD))
72: FP (MUL, ADD)←MUL (R10, h0)




This appendix provides an example of VHDL source code of one case for perform-
ing divisor multiplication for new weighted coordinate for an even characteristic
when h2 = 0. The datapath and the control VHDL code are referring to the struc-
ture view of the new weighted coordinate divisor doubling and mixed addition
datapath shown in Figure 5.2 and Figure 5.3, respectively.
l i b r a r y IEEE ;
use IEEE . std_logic_1164 . a l l ;
use IEEE . s td_log ic_ar i th . a l l ;
use IEEE . std_logic_unsigned . a l l ;
use work . parameters . a l l ;
package parameters i s
constant M: i n t e g e r := 83 ;
constant F : std_log ic_vector (M−1 downto 0)
:= "100"x"00000000000000000095";
end parameters ;
e n t i t y NW_data_path i s port (
U1 , U0 , V1 , V0 , Z1 , Z2 , z_1 ,
z_2 , z3 , z4 , Din , Dout : in std_log ic_vector (M−1 downto 0 ) ;
c lk , r e se t , start_mult , i n i t : in s td_log i c ;
Sel_ADD2 , Sel_MUL1 , Sel_MUL2 , Sel_SQR ,
Sel_Const : in s td_log i c ;
Sel_ADD1 , Sel_cMUL : in std_log ic_vector (1 downto 0 ) ;
Sel_RA_out1 , Sel_RA_out2 , Sel_RA_out3 , Sel_RA_out4 ,
Sel_RB_out1 , Sel_RB_out2 , Sel_RB_out3 , EN_RC
149
: in std_log ic_vector (3 downto 0 ) ;
U1o , U0o , V1o , V0o , Z1o , Z2o , z1_o , z2_o , z3o , z4o
: out std_log ic_vector (M−1 downto 0 ) ;
readR0 , readR1 , readR2 , readR3 , readR4 , readR5 , readR6 ,
readR7 , readR8 , readR9 , readR10
: in std_log ic_vector (M−1 downto 0 ) ;
mult_done , i n f i n i t y : out s td_log i c ) ;
end NW_data_path ;
a r ch i t e c t u r e c i r c u i t o f NW_data_path i s
component squarer_83 i s port (
SQR_in : in std_log ic_vector (M−1 downto 0 ) ;
SQR_out : out std_log ic_vector (M−1 downto 0) ) ;
end component ;
component mult i s port (
SA_MUL, SB_MUL: in std_log ic_vector (M−1 downto 0 ) ;
c lk , r e se t , start_mult : in s td_log i c ;
MUL_out : out std_log ic_vector (M−1 downto 0 ) ;
done_mult : out s td_log i c ) ;
end component ;
component Mux2t1 i s port (
in0 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in1 : in STD_LOGIC_vector(M−1 downto 0 ) ;
Sel_mux2t1 : in Std_logic ;
Mux2t1_out : out STD_LOGIC_vector(M−1 downto 0) ) ;
end component ;
component Mux3t1 i s Port (
in0 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in1 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in2 : in STD_LOGIC_vector(M−1 downto 0 ) ;
Sel_mux3t1 : in STD_LOGIC_vector(1 downto 0 ) ;
mux3t1_out : out STD_LOGIC_vector(M−1 downto 0) ) ;
end component ;
component mux4t1 i s Port (
in0 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in1 : in STD_LOGIC_vector(M−1 downto 0 ) ;
150
in2 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in3 : in STD_LOGIC_vector(M−1 downto 0 ) ;
Sel_mux4t1 : in STD_LOGIC_vector(1 downto 0 ) ;
mux4t1_out : out STD_LOGIC_vector(M−1 downto 0) ) ;
end component ;
component mux11t1 i s Port ( in0 :
in STD_LOGIC_vector(M−1 downto 0 ) ;
in1 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in2 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in3 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in4 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in5 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in6 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in7 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in8 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in9 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in10 : in STD_LOGIC_vector(M−1 downto 0 ) ;
Sel_mux11t1_out1 : in STD_LOGIC_vector(3 downto 0 ) ;
Sel_mux11t1_out2 : in STd_logic_vector (3 downto 0 ) ;
Sel_mux11t1_out3 : in std_log ic_vector (3 downto 0 ) ;
Sel_mux11t1_out4 : in std_log ic_vector (3 downto 0 ) ;
mux11t1_out1 : out STD_LOGIC_vector(M−1 downto 0 ) ;
mux11t1_out2 : out std_log ic_vector (M−1 downto 0 ) ;
mux11t1_out3 : out std_log ic_vector (M−1 downto 0 ) ;
mux11t1_out4 : out std_log ic_vector (M−1 downto 0) ) ;
end component ;
component mux11t3 i s Port (
in0 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in1 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in2 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in3 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in4 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in5 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in6 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in7 : in STD_LOGIC_vector(M−1 downto 0 ) ;
151
in8 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in9 : in STD_LOGIC_vector(M−1 downto 0 ) ;
in10 : in STD_LOGIC_vector(M−1 downto 0 ) ;
Sel_mux11t3_out1 : in STD_LOGIC_vector(3 downto 0 ) ;
Sel_mux11t3_out2 : in STd_logic_vector (3 downto 0 ) ;
Sel_mux11t3_out3 : in std_log ic_vector (3 downto 0 ) ;
mux11t3_out1 : out STD_LOGIC_vector(M−1 downto 0 ) ;
mux11t3_out2 : out std_log ic_vector (M−1 downto 0 ) ;
mux11t3_out3 : out std_log ic_vector (M−1 downto 0) ) ;
end component ;
s i g n a l SA_ADD, SB_ADD, ADD_out, SA_MUL, SB_MUL, MUL_out,
cADD_in1 , cADD_out , cMUL_in1 , cMUL_in2 , cMUL_out ,
RA_out1 ,RA_out2 , RA_out3 , RA_out4 , RB_out1 , RB_out2 , RB_out3 ,
RC_in , SQR_in , SQR_out : std_log ic_vector (M−1 downto 0 ) ;
s i g n a l R0 , R1 , R2 , R3 , R4 , R5 , R6 , R7 , R8 , R9 , R10
: std_log ic_vector (M−1 downto 0 ) ;
s i g n a l loadR0 , loadR1 , loadR2 , loadR3 , loadR4 , loadR5 ,
loadR6 , loadR7 , loadR8 , loadR9 , loadR10
: s td_log i c ; constant zero :
s td_log ic_vector (M−1 downto 0) := ( other s => ' 0 ' ) ;
constant one : std_log ic_vector (M−1 downto 0)
:= conv_std_logic_vector (1 , M) ;
subtype state_ty i s std_log ic_vector (19 downto 0 ) ;
constant S0 : state_ty := "00000000000000000001"
constant S1 : state_ty := "00000000000000000010"
constant S2 : state_ty := "00000000000000000100"
constant S3 : state_ty := "00000000000000001000"
constant S4 : state_ty := "00000000000000010000"
constant S5 : state_ty := "00000000000000100000"
constant S6 : state_ty := "00000000000001000000"
constant S7 : state_ty := "00000000000010000000"
constant S8 : state_ty := "00000000000100000000"
constant S9 : state_ty := "00000000001000000000"
constant S10 : state_ty := "00000000010000000000"
constant S11 : state_ty := "00000000100000000000"
152
constant S12 : state_ty := "00000001000000000000"
constant S13 : state_ty := "00000010000000000000"
constant S14 : state_ty := "00000100000000000000"
constant S15 : state_ty := "00001000000000000000"
constant S16 : state_ty := "00010000000000000000"
constant S17 : state_ty := "00100000000000000000"
constant S18 : state_ty := "01000000000000000000"
constant S19 : state_ty := "10000000000000000000"
s i g n a l s t a t e : state_ty ;
begin
MUX1: Mux3t1 port map ( in0 => cADD_out , in1 => ADD_out,
in2 => cADD_out , Sel_mux3t1 => Sel_ADD1 ,
mux3t1_out => SA_ADD) ;
MUX2: Mux2t1 port map ( in0 => ADD_out,
in1 => RA_out1 , Sel_mux2t1 => Sel_MUL1 ,
Mux2t1_out => SA_MUL) ;
MUX3: Mux4t1 port map ( in0 => cMUL_out , in1 =>
SQR_out , in2 => Din , in3 => RB_out1 , Sel_mux4t1 =>
Sel_cMUL , mux4t1_out => cMUL_in1 ) ;
MUX4: Mux2t1 port map ( in0 => RA_out2 , in1 => one ,
Sel_mux2t1 => Sel_SQR , Mux2t1_out => SQR_in ) ;
MUX5: mux11t1 port map (
in0 => readR0 , in1 => readR1 , in2 => readR2 ,
in3 => readR3 , in4 => readR4 , in5 => readR5 ,
in6 => readR6 , in7 => readR7 , in8 => readR8 ,
in9 => readR9 , in10 => readR10 ,
Sel_mux11t1_out1 => Sel_RA_out1 , Sel_mux11t1_out2 =>
Sel_RA_out2 , Sel_mux11t1_out3 => Sel_RA_out3 ,
Sel_mux11t1_out4 => Sel_RA_out4 , mux11t1_out1 =>
RA_out1 , mux11t1_out2 => RA_out2 , mux11t1_out3 =>
RA_out3 , mux11t1_out4 => RA_out4 ) ;
MUX6: mux11t3 port map ( in0 => readR0 , in1 => readR1 ,
in2 => readR2 , in3 => readR3 , in4 => readR4 ,
in5 => readR5 , in6 => readR6 , in7 => readR7 ,
in8 => readR8 , in9 => readR9 , in10 => readR10 ,
153
Sel_mux11t3_out1 => Sel_RB_out1 ,
Sel_mux11t3_out2 => Sel_RB_out2 ,
Sel_mux11t3_out3 => Sel_RB_out3 ,
mux11t3_out1 => RB_out1 ,
mux11t3_out2 => RB_out2 , mux11t3_out3 => RB_out3 ) ;
M6: Mux2t1 port map ( in0 => Din ,
in1 => RB_out2 , Sel_mux2t1 =>
Sel_ADD2 , Mux2t1_out => SB_ADD) ;
M7: Mux2t1 port map ( in0 => Din , in1 => RB_out3 ,
Sel_mux2t1 => Sel_MUL2 , Mux2t1_out => SB_MUL) ;
M8: Mux2t1 port map ( in0 => SQR_out , in1 => one ,
Sel_mux2t1 => Sel_const ,
Mux2t1_out => cMUL_in2 ) ;
ADD: f o r i in 0 to M−1 generate
ADD_out( i ) <= SA_ADD( i ) XOR SB_ADD( i ) ;
end generate ;
B i t_mul t ip l i e r : mult port map (SA_MUL => SA_MUL,
SB_MUL => SB_MUL, i_clock => i_clock ,
i_re se t => i_reset , start_mult =>
start_mult , MUL_out => MUL_out, done_mult =>
mult_done ) ;
cADD: f o r i in 0 to M−1 generate cADD_out( i ) <=
RA_out3( i ) XOR one ( i ) ; end generate ;
squarer : squarer_83 port map (SQR_in => SQR_in ,
SQR_out => SQR_out ) ;
register_R0 : proce s s ( i_c lock ) begin
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R0 <= U0 ;
e l s i f loadR0 = '1 ' then
R0 <= RC_in ;
end i f ;
end i f ;
end proce s s ;
register_R1 : proce s s ( i_c lock ) begin
154
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R1 <= U1 ;
e l s i f loadR1 = '1 ' then
R1 <= RC_in ;
end i f ;
end i f ;
end proce s s ;
register_R2 : proce s s ( i_c lock ) begin
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R2 <= V1 ;
e l s i f loadR2 = '1 ' then
R2 <= RC_in ;
end i f ;
end i f ;
end proce s s ;
register_R3 : proce s s ( c l k ) begin
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R3 <= V0 ;
e l s i f loadR3 = '1 ' then
R3 <= RC_in ;
end i f ;
end i f ;
end proce s s ;
register_R4 : proce s s ( c l k ) begin
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R4 <= Z1 ;
e l s i f loadR4 = '1 ' then
R4 <= RC_in ;
end i f ;
end i f ;
end proce s s ;
155
register_R5 : proce s s ( c l k ) begin
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R5 <= Z2 ;
e l s i f loadR5 = '1 ' then
R5 <= RC_in ;
end i f ;
end i f ;
end proce s s ;
register_R6 : proce s s ( c l k ) begin
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R6 <= z_1 ;
e l s i f loadR6 = '1 ' then
R6 <= RC_in ;
end i f ;
end i f ;
end proce s s ;
register_R7 : proce s s ( c l k ) begin
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R7 <= z_2 ;
e l s i f loadR7 = '1 ' then
R7 <= RC_in ;
end i f ;
end i f ;
end proce s s ;
register_R8 : proce s s ( c l k ) begin
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R8 <= z3 ;
e l s i f loadR8 = '1 ' then
R8 <= RC_in ;
end i f ;
end i f ;
156
end proce s s ;
register_R9 : proce s s ( c l k ) begin
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
R9 <= z4 ;
e l s i f loadR9 = '1 ' then
R9 <= RC_in ;
proce s s ( i_c lock ) then
i f r i s ing_edge ( i_c lock ) then
i f i_re se t = '1 ' then
s t a t e <= S0 ;
e l s e
case s t a t e i s
when S0 =>
i f in_k (M−1) = '0 ' then s t a t e <= S1 ;
e l s e s t a t e <= 10 ;
end i f ;
when S1 =>
i f start_mult = '0 ' then s t a t e <= S2 ; end i f ;
when S2 =>
i f mult_done = '1 ' then s t a t e <= S3 ; end i f ;
when S2 => s t a t e <= S3
when S3 => s t a t e <= S4 ;
when S4 =>
i f mult_done = '1 ' then s t a t e <= S5 ; end i f ;
when S5 => s t a t e <= S6 ;
when S6 => s t a t e <= 7 ;
when S7 =>
i f mult_done = '1 ' then s t a t e <= S8 ; end i f ;
when S9 => s t a t e <= S10 ;
when S10 =>
i f mult_done = '1 ' then s t a t e <= S11 ;
when S11 =>
i f mult_done = '1 ' then s t a t e <= S12 ; end i f ;
when S12 => s t a t e <= 13 ;
157
when S13 => s t a t e <= 14 ;
when S14 =>
i f mult_done = '1 ' then s t a t e <= S15 ; end i f ;
when S15 => s t a t e <= 16 ;
when S16 => s t a t e <= 17 ;
when S17 =>
i f mult_done = '1 ' then s t a t e <= S18 ; end i f ;
when S18 => s t a t e <= S19 ;
when S19 => s t a t e <= S20 ;
when S20 =>
i f count < M−1 then s t a t e <= S0 ; end i f ;
end case ;
end i f ;
end proce s s ;
contro l_unit : p roce s s ( i_clock , i_reset , s t a t e )
begin
case s t a t e i s
when S0 to S1 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "00" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0101";
Sel_RB <= "0010"; EN_RC <= "1000"; s h i f t <= ' 0 ' ;
when S2 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "00" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0101";
Sel_RB <= "0011"; EN_RC <= "0101"; s h i f t <= ' 0 ' ;
when S3 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "00" ;
Sel_ADD1 <= "00" ; Sel_RA <= "1000";
Sel_RB <= "0101"; EN_RC <= "1001"; s h i f t <= ' 0 ' ;
158
when S4 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '1 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "01" ;
Sel_ADD1 <= "10" ; Sel_RA <= "1000";
Sel_RB <= "0110"; EN_RC <= "0100";
s h i f t <= ' 0 ' ; done <= ' 1 ' ;
when S5 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "00" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0111";
Sel_RB <= "0100"; EN_RC <= "0011";
when S6 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "01" ;
Sel_ADD1 <= "00" ; Sel_RA_out1 <= "0101";
Sel_RA_out2 <= "1010"; Sel_RA_out3 <= "0010";
Sel_RA_out4 <= "0100"; Sel_RB <= "0100";
EN_RC <= "0000";
when S7 =>
Sel_ADD2 <= ' 1 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "11" ;
Sel_ADD1 <= "10" ; Sel_RA_out1 <= "1001";
Sel_RA_out2 <= "0010"; Sel_RA_out3 <= "0100";
Sel_RA_out4 <= "1000"; Sel_RB <= "0000";
EN_RC <= "0010";
when S8 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 1 ' ; Sel_cMUL <= "00" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0011";
Sel_RB <= "0010"; EN_RC <= "1010";
159
when S9 =>
Sel_ADD2 <= ' 1 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "01" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0101";
Sel_RB <= "0100"; EN_RC <= "0100";
when S10 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "00" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0011";
Sel_RB <= "0100"; EN_RC <= "0010";
when S11 =>
Sel_ADD2 <= ' 1 ' ; Sel_MUL1 <= ' 1 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "11" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0000";
Sel_RB <= "0000"; EN_RC <= "0000";
when S12 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 1 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '1 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "10" ;
Sel_ADD1 <= "00" ; Sel_RA <= "1001";
Sel_RB <= "0010"; EN_RC <= "0100";
when S13 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "01" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0011";
Sel_RB <= "0010"; EN_RC <= "0011";
when S14 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "11" ;
Sel_ADD1 <= "00" ; Sel_RA <= "1001";
160
Sel_RB <= "0111"; EN_RC <= "1001";
when S15 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '1 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "01" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0100";
Sel_RB <= "0011"; EN_RC <= "1000";
when S16 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "11" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0110";
Sel_RB <= "0101"; EN_RC <= "0110";
when S17 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 1 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '1 ' ;
Sel_Const <= ' 1 ' ; Sel_cMUL <= "11" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0100";
Sel_RB <= "0011"; EN_RC <= "0100";
when S18 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 0 ' ; Sel_cMUL <= "11" ;
Sel_ADD1 <= "00" ; Sel_RA <= "0100";
Sel_RB <= "0010"; EN_RC <= "0100";
when S19 =>
Sel_ADD2 <= ' 0 ' ; Sel_MUL1 <= ' 1 ' ;
Sel_MUL2 <= ' 0 ' ; Sel_SQR <= '1 ' ;
Sel_Const <= ' 1 ' ; Sel_cMUL <= "11" ;
Sel_ADD1 <= "00" ; Sel_RA <= "1001";
Sel_RB <= "0111"; EN_RC <= "0001";
when 20 =>
Sel_ADD2 <= ' 1 ' ; Sel_MUL1 <= ' 0 ' ;
Sel_MUL2 <= ' 1 ' ; Sel_SQR <= '0 ' ;
Sel_Const <= ' 1 ' ; Sel_cMUL <= "00" ;
161
Sel_ADD1 <= "00" ; Sel_RA <= "0100";
Sel_RB <= "1000"; EN_RC <= "0010";
end case ;




RTL schematic for F2163 multiplier from Mentor Graphics Precession RTL is
shown in Figure C.1. This schematic was performed by creating a VHDL model
of a nite eld multiplier component. This model is used with the top-level
component of the divisor addition datapath to form a complete divisor multipli-
cation. Similarly, RTL schematic for F283and F2163 parallel squarer are shown in
Figure C.3 and Figure C.2, respectively.
Simulation waveforms from ModelSim is shown in Figure C.4. In this simula-
tion run, the two value of the F2163 multiplier were set to two random hexadecimal
values or 163-bit in binary. The expected result is done at 16840 ns. The sim-
ulation was performed by creating a VHDL test bench model of a nite eld
multiplier. Similarly, simulation waveforms for the F2163 parallel squarer from
ModelSim as well as the VHDL model are shown in Figure C.5. The expected
result is done at 5.932 ns which has been considered as one clock cycle. This
VHDL model is used with the F2163multiplier model to form a complete FFAU
used in the higher-level component of the divisor addition datapath.
163
F
ig
u
re
C
.1
:
R
T
L
sc
h
em
at
ic
re
su
lt
s
af
te
r
sy
nt
h
es
is
fo
r
F 2
1
6
3
m
u
lt
ip
li
er
co
m
p
on
en
t
164
F
ig
u
re
C
.2
:
R
T
L
sc
h
em
at
ic
re
su
lt
s
af
te
r
sy
nt
h
es
is
fo
r
F 2
1
6
3
p
ar
al
le
l
sq
u
ar
er
co
m
p
on
en
t
165
F
ig
u
re
C
.3
:
R
T
L
sc
h
em
at
ic
re
su
lt
s
af
te
r
sy
nt
h
es
is
fo
r
F 2
8
3
p
ar
al
le
l
sq
u
ar
er
co
m
p
on
en
t
166
F
ig
u
re
C
.4
:
S
im
u
la
ti
on
w
av
ef
or
m
s
fo
r
F 2
1
6
3
m
u
lt
ip
li
er
V
H
D
L
m
od
el
167
F
ig
u
re
C
.5
:
S
im
u
la
ti
on
w
av
ef
or
m
s
fo
r
F 2
1
6
3
p
ar
al
le
l
sq
u
ar
er
V
H
D
L
m
od
el
168
F
ig
u
re
C
.6
:
F
P
G
A
sy
nt
h
es
is
ar
ea
re
su
lt
s
fo
r
re
ce
n
t
co
o
rd
in
a
te
s
d
iv
is
or
m
u
lt
ip
li
ca
ti
on
w
h
en
h
2
6=
0
169
F
ig
u
re
C
.7
:
F
P
G
A
sy
nt
h
es
is
p
ow
er
re
su
lt
s
th
ro
u
gh
X
p
ow
er
fo
r
re
ce
n
t
co
o
rd
in
a
te
s
w
h
en
h
2
6=
0
at
10
0
M
H
z
170
F
ig
u
re
C
.8
:
S
im
u
la
ti
on
w
av
ef
or
m
s
fo
r
F 2
1
6
3
E
C
C
sc
al
ar
m
u
lt
ip
li
ca
ti
on
V
H
D
L
m
od
el
171
