# High Speed and Low-Complexity Hardware Architectures for Elliptic Curve-Based Crypto-Processors 

Reza Azarderakhsh<br>University of Western Ontario

Supervisor
Dr. Arash Reyhani-Masoleh
The University of Western Ontario

Graduate Program in Electrical and Computer Engineering
A thesis submitted in partial fulfillment of the requirements for the degree in Doctor of Philosophy
© Reza Azarderakhsh 2011

Follow this and additional works at: https://ir.lib.uwo.ca/etd
Part of the VLSI and Circuits, Embedded and Hardware Systems Commons

## Recommended Citation

Azarderakhsh, Reza, "High Speed and Low-Complexity Hardware Architectures for Elliptic Curve-Based Crypto-Processors" (2011). Electronic Thesis and Dissertation Repository. 308.
https://ir.lib.uwo.ca/etd/308

This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact wlswadmin@uwo.ca.

# High Speed and Low-Complexity Hardware Architectures for Elliptic Curve-Based Crypto-Processors 

(Spine Title: Hardware Architectures for Elliptic
Curve Cryptography)
(Thesis Format: Monograph)
by
Reza Azarderakhsh

Faculty of Engineering
Department of Electrical and Computer Engineering

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

School of Graduate and Postdoctoral Studies
The University of Western Ontario
London, Ontario, Canada
November, 2011
(C) Reza Azarderakhsh 2011

# Certificate of Examination 

The University of Western Ontario

School of Graduate and Postdoctoral Studies

Supervisor
Examining Board

Dr. Arash Reyhani-Masoleh
Dr. Ali Miri

Dr. Abdallah Shami

Dr. Anestis Dounavis

Dr. Éric Schost

The thesis by
Reza Azarderakhsh
entitled:
High Speed and Low-Complexity Hardware Architectures for Elliptic Curve-Based Crypto-Processors
is accepted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy

November 182011


#### Abstract

The elliptic curve cryptography (ECC) has been identified as an efficient scheme for public-key cryptography. This thesis studies efficient implementation of ECC cryptoprocessors on hardware platforms in a bottom-up approach. We first study efficient and low-complexity architectures for finite field multiplications over Gaussian normal basis (GNB). We propose three new low-complexity digit-level architectures for finite field multiplication. Architectures are modified in order to make them more suitable for hardware implementations specially focusing on reducing the area usage. Then, for the first time, we propose a hybrid digit-level multiplier architecture which performs two multiplications together (double-multiplication) with the same number of clock cycles required as the one for one multiplication. We propose a new hardware architecture for point multiplication on newly introduced binary Edwards and generalized Hessian curves. We investigate higher level parallelization and lower level scheduling for point multiplication on these curves. Also, we propose a highly parallel architecture for point multiplication on Koblitz curves by modifying the addition formulation. Several FPGA implementations exploiting these modifications are presented in this thesis. We employed the proposed hybrid multiplier architecture to reduce the latency of point multiplication in ECC crypto-processors as well as the double-exponentiation. This scheme is the first known method to increase the speed of point multiplication whenever parallelization fails due to the data dependencies amongst lower level arithmetic computations. Our comparison results show that our proposed multiplier architectures outperform the counterparts available in the literature. Furthermore, fast computation of point multiplication on different binary elliptic curves is achieved.


Keywords: Elliptic curve cryptography, Gaussian normal basis, digit-level finite field multiplication, hybrid multiplier, point multiplication, FPGA, ASIC.

## Dedication

To Olfat and Ava.
To my parents.

## Acknowledgements

All praise is due to God. I would like to express my appreciation and gratitude to Prof. Arash Reyhani-Masoleh for supervising my research during my Ph.D. studies at the University of Western Ontario. I also would like to thank my colleagues, Dr. Arash Hariri, Dr. Mehran Mozaffari-Kermani, and Christopher Kennedy for their suggestions, comments, and sharing their knowledge with me. I would like to thank my lab-mates Mohsen Bahramali, Adam Aksoy, Ebrahim Hasan, and S. Behdad Hosseini.

I would like to thanks my committee members, Dr. Ali Miri, Dr. Eric Schost, Dr. Adallah Shami, and Dr. Anestis Dounavis, for taking their time and reading this thesis and providing constructive comments.

Last but not least, I am grateful to my wife, for her love, kind support and having patience during working on this research. I would like to thank my parents, my sisters and my brother for their wisdom and moral supports. Special thanks go to my brother, Alireza Azarderakhsh, for his dedication and unflagging support.

## Contents

Certificate of Examination ..... ii
Abstract ..... iii
Dedication ..... iv
Acknowledgements ..... v
Contents ..... vi
List of Tables ..... xi
List of Figures ..... xiii
List of Algorithms ..... xvi
Nomenclature ..... xvii
1 Introduction ..... 1
1.1 Problem Statement and Motivation ..... 2
1.2 Objectives of the Thesis ..... 2
1.3 Thesis Outline ..... 3
2 Preliminaries and Literature Review ..... 4
2.1 Finite Fields ..... 4
2.2 Binary Fields Arithmetic ..... 6
2.2.1 Polynomial Basis ..... 6
2.2.2 Normal Basis ..... 7
2.2.3 Finite Field Multiplication ..... 7
2.2.3.1 Multiplication Using Normal Basis ..... 8
2.2.3.2 Multiplication Using Gaussian Normal Basis ..... 9
2.2.3.3 Inversion ..... 10
2.2.3.4 Trace and Quadratic Equation Solution ..... 11
2.2.4 Multiplier Architectures ..... 11
2.2.4.1 Bit-Level NB Multiplication ..... 12
2.2.4.2 An Example ..... 16
2.2.4.3 Digit-level GNB multiplication ..... 16
2.2.4.4 Digit-level PISO GNB multiplier ..... 16
2.2.4.5 Digit-level PIPO GNB Multiplier ..... 18
2.3 Elliptic Curve Cryptography ..... 18
2.3.1 Elliptic Curve Arithmetic ..... 20
2.3.2 Inversion free Coordinates ..... 21
2.3.2.1 Standard Projective Coordinates ..... 22
2.3.2.2 Lopez-Dahap Projective Coordinates ..... 22
2.3.2.3 Jacobian Projective Coordinates ..... 22
2.3.3 Point Multiplication ..... 23
2.3.3.1 Double-And-Add Point Multiplication ..... 23
2.3.3.2 Montgomery Point Multiplication ..... 24
3 Low-Complexity Architectures for Digit-level and Bit-parallel GNB Multipliers over $G F\left(2^{m}\right)$ ..... 26
3.1 An Improved Architecture for Digit-level PIPO GNB Multiplier ..... 27
3.1.1 Complexities ..... 30
3.1.2 An Example over $G F\left(2^{7}\right)$ ..... 31
3.1.3 Simulation Results for the DL-PIPO GNB Multiplier over $G F\left(2^{163}\right)$ and $G F\left(2^{283}\right)$ ..... 33
3.2 New Architecture for Digit-Level SIPO GNB Multiplier ..... 34
3.2.1 Formulation ..... 35
3.2.2 New Architecture ..... 36
3.2.2.1 Complexities ..... 39
3.2.2.2 Complexity Reduction ..... 40
3.2.3 An Illustrative Example ..... 40
3.2.4 Simulations ..... 43
3.3 New Architecture for Digit-Level PISO GNB multiplier ..... 45
3.3.1 Low-Complexity Digit-Level PISO GNB Multiplier ..... 45
3.3.1.1 Improved Architecture ..... 45
3.3.1.2 Complexities ..... 46
3.3.2 Complexity Comparison ..... 46
3.4 An Extension to Bit-Parallel GNB Multiplier ..... 48
3.4.1 Comparison ..... 49
3.5 FPGA and ASIC Implementations ..... 52
3.6 Conclusion ..... 53
4 Efficient FPGA Implementation of Point Multiplication over Binary Edwards and Generalized Hessian Curves Using Gaussian Normal Basis ..... 55
4.1 Preliminaries ..... 56
4.1.1 Arithmetic over Binary Edwards and Generalized Hessian Curves ..... 56
4.1.2 Point Addition and Doubling Using Differential Formulations in $w$-coordinates ..... 58
4.2 Point Multiplication on Binary Edwards and Generalized Hessian Curves ..... 60
4.2.1 Point Multiplication ..... 60
4.2.2 Parallelism in Point Multiplication Algorithm ..... 61
4.2.2.1 Scheduling Field Operations for PA and PD ..... 62
4.2.2.2 Parallelization for Binary Edwards Curve (BEC) ..... 63
4.2.2.3 Parallelization for Generalized Hessian Curve (GHC) ..... 64
4.2.2.4 Parallelization for Binary Generic Curve (BGC) ..... 64
4.2.3 Recovering the Final Coordinates of $x$ and $y$ ..... 65
4.2.4 Latency of Point Multiplication Operations ..... 66
4.3 Architecture of the Proposed Elliptic Curve Crypto-Processor ..... 67
4.3.1 Field Arithmetic Unit (FAU) ..... 68
4.3.2 A Fast and Low-Complexity Digit-Level GNB Multiplier over $G F\left(2^{m}\right)$ ..... 69
4.3.2.1 Hardware Architecture ..... 69
4.3.2.2 Complexities ..... 70
4.3.2.3 LUT-based Critical-path Delay Analysis ..... 71
4.3.2.4 Implementation ..... 72
4.3.3 Memory and Control Unit ..... 73
4.3.3.1 Memory ..... 73
4.3.3.2 Control Unit ..... 74
4.4 Comparisons and Implementations ..... 75
4.4.1 Side-Channel Analysis ..... 75
4.4.2 Implementation Results and Discussion ..... 78
4.5 Conclusions ..... 83
5 New Architecture for Double-Multiplication Using GNB and Its Ap- plications for Exponentiation and Elliptic Curve Cryptography ..... 84
5.1 Hybrid Multiplication ..... 85
5.1.1 Traditional Multiplication Scheme ..... 86
5.1.2 Hybrid Multiplication Scheme ..... 86
5.1.2.1 Analysis ..... 87
5.2 Applications of the Proposed Hybrid Multiplier ..... 87
5.2.1 Double-Exponentiation ..... 87
5.2.2 Reducing the Latency of Point Multiplication on Binary Curves ..... 90
5.2.2.1 Binary Edwards Curves ..... 90
5.2.2.2 Generalized Hessian Curves ..... 92
5.2.2.3 Binary Koblitz Curves ..... 93
5.2.2.4 Attacking ECC2K-130 ..... 94
5.3 Implementations ..... 95
5.4 Conclusion ..... 96
6 Highly Parallel and Fast Crypto-Processor for Point Multiplication on Koblitz Curves ..... 98
6.1 Properties of Koblitz Curves ..... 99
6.1.1 Point Addition on Koblitz Curves ..... 100
6.1.1.1 Lopez-Dahab Projective Coordinates ..... 101
6.1.2 Point Multiplication on Koblitz Curves ..... 101
6.2 High-Speed Parallelization of Point Addition ..... 102
6.2.1 Latency of Point Multiplication ..... 104
6.3 Proposed Crypto-processor for Point Multiplication ..... 105
6.3.1 Field Arithmetic Unit (FAU) ..... 105
6.3.2 Control Unit and the Register File ..... 106
6.3.3 Coordinate Converter ..... 107
6.4 FPGA Implementations ..... 107
6.4.1 Comparisons ..... 108
6.5 Conclusion ..... 111
7 Summary and Future Work ..... 112
7.1 Thesis Contributions ..... 112
7.2 Future Work ..... 114
Bibliography ..... 115
Vita ..... 124

## List of Tables

2.1 The Sequence of $F$ for type 4 GNB over $G F\left(2^{7}\right)$ ..... 10
2.2 The values of $F$ for type 2 GNB over $G F\left(2^{5}\right)$ ..... 15
2.3 Content of Variables in the LSB-first and MSB-first multiplication of $A=(01110)$ and $B=(10101)$ over $G F\left(2^{5}\right)$. ..... 15
3.1 Comparison of number of XOR gates between bit-parallel GNB multi- pliers for $G F\left(2^{163}\right)$ and $G F\left(2^{283}\right)$. ..... 34
3.2 Contents of variables in the proposed architecture for LSD-first DL- SIPO type 4 GNB multiplier over $G F\left(2^{7}\right)$. ..... 41
3.3 Comparison of the most recently proposed type $T$ digit-level GNB multipliers over $G F\left(2^{m}\right)$ with parallel outputs. ..... 47
3.4 Area and time complexity comparison of bit-parallel GNB multipliers over $G F\left(2^{m}\right)$. Note that for Type T GNB: $C_{N} \leq T m-T+1$. ..... 51
3.5 FPGA implementation of BL-SIPO (Fig. 2.1) multiplier for type 4 over $G F\left(2^{163}\right)$ on xc4vlx100-ff1148 device. ..... 52
3.6 ASIC synthesis results for BL-SIPO (Fig. 2.1) multiplier for type 4 over $G F\left(2^{163}\right)$. ..... 52
3.7 FPGA (Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ xc4vlx100-ff1148 device) and ASIC (65-nm CMOS library) synthesis results for the improved DL-SIPO (Fig.3.3) multiplier architectures for type 4 GNB over $G F\left(2^{163}\right)$ for differentdigit sizes.53
3.8 FPGA (Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ xc4vlx100-ff1148 device) and ASIC (65-nm CMOS library) synthesis results for the improved DL-PISO (Fig.3.5) multiplier architecture for type 4 GNB over $G F\left(2^{163}\right)$ for differentdigit sizes.53
3.9 FPGA (Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ xc4vlx100-ff1148 device) and ASIC (65-nm CMOS library) synthesis results for the improved DL-PIPO (Fig.3.1) multiplier architecture for type 4 GNB over $G F\left(2^{163}\right)$ for differentdigit sizes.54
4.1 Cost of point operations on binary Edwards curves (BECs), general- ized Hessian curves (GHCs), and binary generic curves (BGCs) over $G F\left(2^{m}\right)$ [1], [2], and [3]. ..... 58
4.2 Multiplier Utilization factors for data dependency graph of different curves. ..... 65
4.3 Latency of the operations in the point multiplication with $\mathcal{M}=1,2,3$, where $M$ is the number of clock cycles required for multiplication of two arbitrary field elements. ..... 66
4.4 Critical-path delay of the pipelined and non-pipelined architecture of presented digit-level type 4 GNB multiplier over $G F\left(2^{163}\right)$. ..... 71
4.5 LUT-based critical-path delay (CPD) $\left(T_{L U T}\right)$ of the presented pipelined multiplier for different digit sizes $(d)$ and levels of accumulation $(\ell)$ for type 4 GNB multiplier over $G F\left(2^{163}\right)$ where $K=\left\lceil\frac{d}{\ell}\right\rceil$. ..... 72
4.6 FPGA implementation results for BECs over $G F\left(2^{163}\right)$ and $\mathcal{M}=2$. ..... 76
4.7 FPGA implementation results for GHC over $G F\left(2^{163}\right)$ and $\mathcal{M}=2$. ..... 77
4.8 FPGA implementation results for BGC over $G F\left(2^{163}\right)$ and $\mathcal{M}=2$ ..... 77
4.9 Comparison of ECC implementations on FPGA over $G F\left(2^{163}\right)$. ..... 80
5.1 Time delay evaluation of the proposed structure for type 4 GNB over $G F\left(2^{163}\right)$. ..... 88
5.2 ASIC and FPGA implementation results for the proposed low-complexity hybrid multiplier architecture (Fig. 5.1) over $G F\left(2^{163}\right)$ for different digit sizes. ..... 97
6.1 Comparison of the latency for performing point addition in the main loop on Koblitz curves in terms of number of multipliers ..... 105
6.2 The implementation results of the point multiplication on Koblitz curves on Altera ${ }^{\circledR}$ Stratix ${ }^{\text {TM }}$ II EP2S180F1020C3 FPGA device. ..... 107
6.3 Comparison of related works for FPGA implementations of point mul- tiplication on Koblitz curves using digit-level finite field multipliers. ..... 110

## List of Figures

2.1 The architecture of (a) LSB-first bit-level SIPO (b) MSB-first bit-level normal basis multipliers [4] (c) The architecture of $P$ module for type $T$ GNB ..... 14
2.2 The architecture of the digit-level PISO GNB multiplier [5]. ..... 17
2.3 The architecture of Digit-level PIPO GNB multiplier proposed in [5], [6], where the $i$-fold right cyclic shift is denoted by ${ }_{\Perp}^{i}$ and $r$ is a number $0 \leq r \leq d-1$ such that $m=q d-r$. ..... 19
2.4 Group law on Elliptic curve over $\mathbb{R}$. ..... 21
3.1 The proposed improved architecture for DL-PIPO GNB multiplier ..... 29
3.2 Comparison between the number of XOR gates required in the DL- PIPO and the improved DL-PIPO for (a): $m=163(T=4)$, (b): $m=283(T=6)$. ..... 33
3.3 (a) The proposed architecture for LSD-first DL-SIPO multiplier. (b) an example of the proposed multiplier for type 4 GNB over $G F\left(2^{7}\right)$ and $d=2$. ..... 37
3.4 Comparison among the numbers of XOR gates required in the origi- nal and the improved digit-level SIPO multiplier architectures [7] for (a) type $T=4$ GNB over $G F\left(2^{163}\right)$ and (b) type $T=6$ GNB over $G F\left(2^{283}\right)$. ..... 43
3.5 (a) The architecture of the improved digit-level PISO GNB multiplier architecture with the LSD-first output. (b) The improved architecture of type 4 GNB multiplier over $G F\left(2^{7}\right)$ and $d=2$. ..... 44
3.6 Comparison among the numbers of XOR gates required in the original and improved digit-level PISO multiplier architectures for (a) type $T=$ 4 GNB over $G F\left(2^{163}\right)$ and (b) type $T=6$ GNB over $G F\left(2^{283}\right)$. ..... 45
3.7 The architecture of proposed bit-parallel GNB multiplier ..... 48
4.1 Data dependency graphs for parallel computing of the combined PA and PD operations on binary Edwards curves (a): $d_{1} \neq d_{2}$ and (b): $d_{1}=d_{2}$ assuming $\mathcal{M}=2$. It requires five registers of $T_{1}, T_{2}, T_{3}, T_{4}$, and $T_{5}$. The constant parameters, $c_{1}=\sqrt{d_{1}}, c_{2}=\sqrt{d_{2} / d_{1}+1}, c_{3}=$ $\sqrt{c_{1}}$, and $c_{4}=\sqrt{c_{2}}$ are assumed to be precomputed and stored in the memory.

4.2 Data dependency graph for parallel computing of the combined PA
and PD operations for $\mathcal{M}=2$ available multipliers on (a) generalized
Hessian curves, assuming $c_{1}=d^{3}$, and $c_{2}=\frac{1}{\sqrt{d^{3}}}$ and (b) binary generic
curves (BGCs) [8].
4.3 Architecture of the proposed elliptic curve crypto-processor for binary Edwards, generalized Hessian, and binary generic curves. ..... 67
4.4 The pipelined architecture of the low-complexity type $T$ digit-level GNB multiplier with parallel-output [9]. ..... 68
4.5 Time-Area ratio of the presented pipelined low-complexity digit-level GNB multiplier for type 4 over $G F\left(2^{163}\right)$ for different digit sizes $d$. ..... 73
4.6 Configuration of BRAMs for the proposed architecture. ..... 74
4.7 Implementation results of point multiplication for binary Edwards,generalized Hessian, and binary generic curves reported in Tables 4.6,4.7 , and 4.8 on Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ xc5vlx110-2ff1760 FPGA device.The points are related to digit sizes of $d=21,24,28,33,41,55,82$.81
5.1 (a) Proposed structure for the hybrid multiplier. (b) Two digit-level multipliers with parallel output operating in two separate steps. (c) A hybrid multiplier operating in one step and composed of an improved DL-PISO and an improved LSD-first DL-SIPO multipliers.
5.2 Architectures for multiplexer based double-exponentiation. (a) with one multiplier (b) with incorporating the proposed hybrid multiplier.
5.3 Data dependency graph for fast computation of combined PA and PD for binary Edwards curves (a): employing four different PIPO multipliers. (b): employing proposed hybrid multiplier. $c_{1}=\sqrt{d_{1}}$, $c_{2}=\sqrt{d_{2} / d_{1}+1}, c_{3}=\sqrt{c_{1}}$, and $c_{4}=\sqrt{c_{2}}$.
5.4 generalized Hessian curves with $c_{1}=d^{3}$, and $c_{2}=\frac{1}{\sqrt{d^{3}}}$, employing the proposed hybrid multiplier.Generalized Hessian curves
5.5 Parallel computation of point addition on Koblitz curves using Jacobian coordinates (a): with three finite field multipliers and (b): employing hybrid multiplier and three parallel multipliers.
6.1 Data dependency graph for parallel computation of point addition on Koblitz curves (a): using three finite field multipliers adopted from [10] (b): proposed scheme employing four multipliers.103
6.2 The architecture of point multiplication crypto-processor ..... 105
6.3 (a): Latency of point computation on Koblitz curves over $\operatorname{GF}\left(2^{163}\right)$ for different digit sizes. (b): Latency-area product of the proposed architecture for point multiplication.108

## List of Algorithms

2.1 Solving quadratic equation $X^{2}+X=A$ using normal basis [11]. . . 12
2.2 Left-to-right Double-and-add point multiplication algorithm [11] . . . 23
2.3 Lopez-Dahab Scalar Multiplication [12] . . . . . . . . . . . . . . . . . 24
4.1 Montgomery's algorithm [13] for point multiplication using $w$-coordinates. 61
6.1 Point multiplication on Koblitz curves using Double-and-add-or-subtract algorithm [11].102

## List of Abbreviations

| ASIC | Application-Specific Integrated Circuit |
| :--- | :--- |
| FPGA | Field Programmable Gate Arrays |
| CMOS | Complementary Metal-Oxide-Semiconductor |
| PA | Point Addition |
| PB | Point Doubling |
| PIPO | Parallel-in Parallel-out |
| SIPO | Serial-in Parallel-out |
| PISO | Parallel-in Serial-out |
| GNB | Gaussian Normal Basis |
| CPD | Critical Path Delay |
| ECC | Elliptic Curve Cryptography |
| GF | Galois Field |
| LSB | Least Significant Bit |
| LSD | Least Significant Digit |
| BEC | Binary Edwards Curve |
| BGC | Binary Generic Curve |
| BKC | Binary Koblitz Curve |
| GHC | Generalized Hessian Curve |
| MSB | Most Significant Bit |
| MSD | Most Significant Digit |
| NIST | National Institute of Standards and Technology |
| ECDLP | Elliptic Curve Discrete Logarithm Problem |
| ECDH | Elliptic Curve Diffie-Hellman |
| VHDL | Very-high-speed integrated circuit Hardware Description Language |
| VLSI | Very Large Scale Integrated |
| KDC | Key Distribution Center |

## Chapter 1

## Introduction

THE history of cryptography is back to 2000 years ago (time of Julius Caesar) when it was required that two communicating parties to share a common secret, i.e., the symmetric key for encryption and decryption. The main problem of this approach is that the two parties must somehow met each other and agree on the common key. In 1976, Diffie and Hellman [14] demonstrated an algorithm for secure key exchange and lead to the development of today's public key cryptography systems known as RSA [15]. Recent technology of small and always connected devices such as mobile hand-held devices, RFID tags, near field communication (NFCs) devices, smart cards, and wireless sensor nodes (WSNs), to name a few, require efficient and high-performance computation of cryptographic protocols. The traditional schemes such as RSA is determined to be infeasible for these devices which resulted in adopting of a new technology based on elliptic curves which is called elliptic curve cryptography (ECC). ECC is proposed independently proposed by Neil Koblitz [16] and Victor Miller [17] for public-key cryptography and has gained significant attention in the recent researches available in the literature. The use of ECC has been identified as an efficient and suitable methodology to achieve public key cryptography in embedded and resource-constrained environments and approved by IEEE [18] and NIST [19] standards. The main advantage of ECC is that it offers similar security level compared to the RSA, employing smaller key size and providing efficient implementations for resource-constrained devices with limited storage, bandwidth, and silicon area. The security of ECC based cryptosystems relies on the difficulty of solving elliptic curve discrete logarithm problem (ECDLP) [19].

All these topics can be viewed as an applied science in the overlap between mathematics, computer science, and computer engineering.

### 1.1 Problem Statement and Motivation

Security in resource-constrained environments (such as smart cards, WSNs, Handheld devices, and RFID tags) and high-performance web server (such as secure ecommerce transactions and online banking) highly requires efficient cryptographic computations (such as ECC). The former applications are suffering from availability of silicon area, while the latter ones are suffering from low speed of the current security protocols. Moreover, due to increasing number of small and connected devices to the servers efficient computation of cryptographic protocols are crucial.

Elliptic curves over finite fields can be represented using prime fields and binary extension fields. There are several implementations in the literature considering implementation of ECC over both fields. However, depending to the application and available resources prime fields have been chosen for software implementations and binary fields provide better performance over binary fields. Recently proposed schemes available in the the literature (for example, [20], [21], [10], [6], [22], [23], [24], [25], [26], and [27]) did not consider a systematic implementations of ECC over binary fields. For instance, they have employed available finite field multipliers in the literature without considering their performance for the proposed crypto-processors. The hierarchy of ECC computations requires an efficient computations in the lower level, i.e., finite fields and then the curve and protocol levels. Therefore, a bottom-up approach to design an ECC crypto-processor targeting the certain applications is one of most important task that one need to explore.

Also, in some of the previous researches parallelism is known as the only method to reduce the latency of curve level arithmetic computations to increase the speed of overall point multiplication on ECC-based crypto-processors. However, one should note that due to the data dependencies between curve level computations, parallelism is not applicable in several situations such as point multiplication on binary Edwards curves and double-exponentiation for elliptic curve digital signature verifications. These dependencies will limit the speed of the designs according to the number of parallel processors.

### 1.2 Objectives of the Thesis

In this thesis, efficient and low complexity ECC-based crypto-processors are proposed. A bottom-up approach is proposed in designing a crypto-processor with devising low complexity finite field arithmetic units. This thesis, not only considers
standard curves available in the literature, but it also describes efficient implementation of newly introduced complete binary elliptic curves such as binary Edwards and generalized Hessian curves. The objectives of this thesis are to design high performance and fast ECC-based crypto-processors for web servers and as well as designing low-complexity and efficient ones for small and hand-held devices based on different security level and key size.

### 1.3 Thesis Outline

This thesis is organized as follows. In Chapter 2, we will provide a literature review on some of the existing works in the literature on normal basis multiplication and elliptic curve cryptography.

In Chapter 3, we present low-complexity Gaussian normal basis multiplier architectures including parallel-in-parallel-out, parallel-in-serial-out, and serial-in-parallelout. Also, we propose a low-complexity architecture for bit-parallel multiplication in this chapter.

In Chapter 4, we propose an efficient ECC-based crypto-processor on binary Edwards and generalized Hessian curves employing a parallel-in parallel-out digit-level GNB multiplier proposed in Chapter 3. The implementation results are provided and compared with the counterparts in the literature.

In Chapter 5, based on the low-complexity digit-level multiplier architectures proposed in Chapter 3, a new hybrid multiplier to perform double-multiplication is proposed. Also, in this chapter we evaluate the efficiency of the new hybrid multiplier and its application for reducing the latency of double-exponentiation and point multiplication on binary elliptic curves.

In Chapter 6, a highly parallel and fast ECC crypto-processor for point multiplication on Koblitz curves is presented. The implementation results are reported and compared with the leading ones in the literature.

Finally, in Chapter 7, we summarize our contributions and provide possible directions for future works.

## Chapter 2

## Preliminaries and Literature Review

IN this chapter, we provide preliminaries and review the previous works available in the literature on farithmetic of finite fields and elliptic curve cryptography. The following discussion is based on comprehensive presentations given in [28], [29], [11], and [30].

### 2.1 Finite Fields

Finite fields are usually referred to as Galois fields (to honor Evariste Galois 18111832, a French mathematician) and have importance in many applications such as cryptography, network coding, and error control theory. Due to these applications their implementations have been studied extensively by computer engineers and computer scientists. A finite field consists of a finite set of objects called field elements together with the description of two operations (addition and multiplication) that can be performed on pairs of field elements. Finite field arithmetic plays an important role in ECC and all the low-level operations are carried out in these fields. It is important to describe these fields in order to closely specify cryptographic methods based on ECC.

A set $G$ and a binary operation $\star$ form a group $(G, \star)$ if they satisfy the following five properties:

1. The operation $\star$ is closed (i.e., $a \star b \in G$ for all $a, b \in G$ ).
2. The operation $\star$ is associative (i.e., $a \star(b \star c)=(a \star b) \star c$ for all $a, b, c \in G)$.
3. The operation $\star$ is commutative (i.e., $a \star b=b \star a$ for all $a, b \in G$ ). In this case set $(G, \star)$ called Abelian.
4. There exists an identity element $e \in G$ such that $e \star a=a \star e=a$ for all $a \in G$.
5. For every $a \in G$, there exists an inverse element $b \in G$ such that $a \star b=e$.

The group $(G, \star)$ with group operation to be multiplication $\times$ is known as multiplicative group $(G, \times)$ which its identity element is 1 and the inverse element is denoted by $a^{-1} \in G$. Similarly, for group operation with addition $(G,+)$ the identity element is 0 and inverse element is $-a$. The order of the group, ord $(G)$, is the number of elements in the set $G$. The group $G$ is finite if $\operatorname{ord}(G)$ is finite. The order of an element $a \in G$, i.e., $\operatorname{ord}(a)$, is the smallest positive integer, $n$, for which $a^{n}=e$.

The group $G$ is cyclic if all its of the group can be generated by applying the group operation repeatedly to an element $a$ and hence $a$ is a generator of $G$.

A field $\mathbb{F}$ is a set of elements with two binary operators, denoted as + (addition) and $\times$ (multiplication) which exhibits the following properties:

1. $\mathbb{F}$ is an abelian group under the addition + operation.
2. The non-zero elements of $\mathbb{F}$ form an abelian group under the operation $\times$.
3. The operation $\times$ is distributive over the operation + , i.e., $a \times(b+c)=(a \times$ $b)+(a \times c)$ and $(b+c) \times a=(b \times a)+(c \times a)$ for all $a, b, c \in \mathbb{F}$.

A field $\mathbb{F}$ with $q$ elements is said to be finite if $q$ is finite and is denoted by $\mathbb{F}_{q}$ which is also referred to as Galois field as $G F(q)$. The order of $\mathbb{F}_{q}$ is the number of elements in $\mathbb{F}_{q}$, and $\mathbb{F}_{q}$ exists if and only if $q$ is prime or a power of a prime, i.e., $q=p^{m}$ for $m \geq 1$. Then, for $m=1$ it is called a prime field and for $m \geq 2$ it is called an extension field. Extension fields with $p=2$, i.e., $\mathbb{F}_{2^{m}}$ or $G F\left(2^{m}\right)$ are called binary fields (or fields with characteristic two) which can be seen as a vector space of dimension $m$ over the field $\mathbb{F}_{2}$ which has only 0 and 1.

As defined above $\mathbb{F}_{q}$ has two main operations, i.e., addition and multiplication. Subtraction and inversion can be defined through addition (i.e., $a-b=a+(-b)$ where $b+(-b)=0)$ and multiplication $\left(a / b=a \times b^{-1}\right.$ where $b \times b^{-1}=1$ and $\left.b \in \mathbb{F}_{q}-\{0\}\right)$, respectively.

Definition 2.1. An element $\alpha$ in a finite field $\mathbb{F}_{q}$ is called a primitive element (or generator) of $\mathbb{F}_{q}$ if $\mathbb{F}_{q}=\left\{0, \alpha, \alpha^{2}, \cdots . \alpha^{q-1}\right\}$.

Definition 2.2. The order of a non-zero element $\alpha \in \mathbb{F}_{q}$ denoted by ord $(\alpha)$, is the smallest positive integer $k$ such that $\alpha^{k}=1$.

Definition 2.3. The non-zero elements in $\mathbb{F}_{q}$ form a multiplicative group of $\mathbb{F}_{q}$ denoted by $\mathbb{F}_{q}^{\star}$ which is cyclic with $\operatorname{ord}\left(\mathbb{F}_{q}^{\star}\right)=q-1$. Hence

$$
\begin{equation*}
a^{q}=a, \tag{2.1}
\end{equation*}
$$

for all $a \in \mathbb{F}_{q}$. This is also known as Fermat's Little Theorem as $a^{p} \equiv a(\bmod p)$.
Then, for $\mathbb{F}_{2^{m}}$ the order of a multiplicative group is $2^{m}-1$ and for an element $A \in \mathbb{F}_{2^{m}}$ one has $A^{2^{m}-1}=1$. In this thesis, we use $G F\left(2^{m}\right)$ to indicate binary Galois fields instead of $\mathbb{F}_{2^{m}}$.

### 2.2 Binary Fields Arithmetic

The binary field of characteristic two, $G F\left(2^{m}\right)$ is a finite field [30] that contains $2^{m}$ different elements. The elements of $G F\left(2^{m}\right)$ are represented as a vector space over $G F(2)$ which contains 0 and 1 with respect to a basis. As the two elements of $G F(2)$ can be represented with a bit, $m$ bits are required to represent elements of $G F\left(2^{m}\right)$. The binary field, $G F\left(2^{m}\right)$, is associated with an irreducible polynomial (i.e., can not be represented as a product of two polynomials with positive degrees) $F(z)$, with $\operatorname{deg}(F(Z))=m$ over $G F(2)$, i.e.,

$$
\begin{equation*}
F(z)=f_{m} z^{m}+f_{m-1} z^{m-1}+\cdots+f_{1} z+f_{0}, f_{i} \in G F(2) \tag{2.2}
\end{equation*}
$$

If $f_{m}=1$ the $\operatorname{deg}[F(z)]=m$. Addition of two elements in $G F\left(2^{m}\right)$ is simply performed bit-wise (modulo 2) XOR operation but the multiplication depends on the field basis and dependencies between the field elements. From implementation point of view binary fields are faster than prime fields as they provide carry-free operations. The field elements can be represented using polynomial (or standard) basis, normal basis, dual basis, and redundant basis. However, polynomial and normal bases are two common type of bases that has been used in conventional hardware and software applications and approved and recommended by the international standards such as IEEE and NIST. In the following, we review briefly polynomial basis and explain normal basis in detail as it is used in this thesis.

### 2.2.1 Polynomial Basis

Let $\alpha \in G F\left(2^{m}\right)$ be a root of the primitive polynomial $F(z)$, i.e., $F(\alpha)=0$. Then the set $\left\{1, \alpha, \alpha^{2}, \cdots, \alpha^{m-1}\right\}$ is known as the polynomial basis and an element $A \in G F\left(2^{m}\right)$
can be represented as linear combinations of this set with a polynomial of degree $m-1$ over $G F(2)$, as $A=\sum_{i=0}^{m-1} a_{i} \alpha^{i}$, where $a_{i} \in G F(2)$. For simplicity, a bit-vector representation is commonly used and so that $A=\left(a_{m-1}, a_{m-2}, \cdots, a_{1}, a_{0}\right)$, where $a_{m-1}$ and $a_{0}$ are the most significant bit (MSB) and least significant bit (LSB), respectively. In polynomial basis the identity element of addition, i.e., 0 , is $(0,0, \cdots, 0,0)$ and the identity element of multiplication, i.e., 1 , is $(0,0, \cdots, 0,1)$.

Addition of two field elements, say, $A=\left(a_{m-1}, \cdots, a_{1}, a_{0}\right)$ and $B=\left(b_{m-1}, \cdots, b_{1}, b_{0}\right)$ in $G F\left(2^{m}\right)$ represented by polynomial basis is $C=A+B$ and can be obtained by pair-wise addition of the coordinates of $A$ and $B$ over $G F(2)$ (i.e., modulo 2 addition) as $c_{i}=a_{i} \oplus b_{i}$. Multiplication of two field elements $A, B \in G F\left(2^{m}\right)$ is complicated. First, $A$ and $B$ are multiplied by using ordinary polynomial multiplication and then the intermediate product needs further reduction by $F(x)$, i.e., $A \cdot B \bmod F(x)$. The squaring in polynomial basis is also complicated and its complexity depends on the irreducible polynomial $F(x)$ [31, 32, 33, 34].

### 2.2.2 Normal Basis

It is shown that there exists a normal basis for the binary extension field $G F\left(2^{m}\right)$ for all positive integers $m$. The normal basis is constructed by finding a normal element $\beta \in G F\left(2^{m}\right)$, where $\beta$ is a root of an irreducible polynomial of degree $m$. Then set $N=\left\{\beta, \beta^{2}, \cdots, \beta^{2^{m-1}}\right\}$ is a basis for $G F\left(2^{m}\right)$ and its elements are linearly independent. In this case, $A \in G F\left(2^{m}\right)$, can be represented as $A=\sum_{i=0}^{m-1} a_{i} \beta^{2^{i}}$, where $a_{i} \in G F(2)$. The identity element of addition, i.e., 0 , is $(0,0, \cdots, 0,0)$ and the identity element of multiplication, i.e., 1 , is $(1,1, \cdots, 1,1)$ as $1=\beta+\beta^{2}+\beta^{2^{2}}+\cdots+\beta^{2^{m-1}}$.

Normal basis is attractive mainly because it provides efficient computation for squaring. For an element, say, $A \in G F\left(2^{m}\right)$ its power of two can be written as $A^{2}=\sum_{i=0}^{m-1} a_{i} \beta^{2^{i+1}}$ and one can get $\beta^{2^{m}}=\beta$ from (2.1). Then, squaring is a linear operation and for $A=\left(a_{0}, a_{1}, \cdots, a_{m-1}\right) \in G F\left(2^{m}\right)$ one can obtain it by a right cyclic shift operation as $A^{2}=\left(a_{m-1}, a_{0}, a_{1}, \cdots, a_{m-2}\right)$. Similar to the polynomial basis the addition can be obtained by bit-wise XOR operation for two given elements $A$ and $B$ as $A+B=\sum_{i=0}^{m-1}\left(a_{i} \oplus b_{i}\right) \beta^{2^{i}}$.

### 2.2.3 Finite Field Multiplication

Among finite field representations, normal basis is more efficient in hardware implementations since squaring of a field element over $G F\left(2^{m}\right)$ can be performed by a simple cyclic shift. This makes normal basis more attractive for the cryptosys-
tems that utilize frequent squarings (e.g., point multiplication on Koblitz curves and exponentiation-based cryptosystems).

### 2.2.3.1 Multiplication Using Normal Basis

Let $A=\left(a_{0}, a_{1}, \cdots, a_{m-1}\right)=\sum_{i=0}^{m-1} a_{i} \beta^{2^{i}}$ and $B=\left(b_{0}, b_{1}, \cdots, b_{m-1}\right)=\sum_{j=0}^{m-1} b_{j} \beta^{2^{j}}$ be two field elements in $G F\left(2^{m}\right)$. Let $C \in G F\left(2^{m}\right)$ be their product, i.e., $C=$ $\left(c_{0}, c_{1}, \cdots, c_{m-1}\right)=A B=\sum_{i=0}^{m-1} \sum_{j=0}^{m-1} a_{i} b_{j} \beta^{2^{i}+2^{j}}$. Let us represent the field element $\beta^{2^{i}+2^{j}} \in G F\left(2^{m}\right), 0 \leq i, j \leq m-1$, with respect to $N=\left\{\beta, \beta^{2}, \cdots, \beta^{2^{m-1}}\right\}$ as $\beta^{2^{i}+2^{j}}=\sum_{l=0}^{m-1} \mu_{i, j}^{(l)} \beta^{2^{l}}$. Then, one can find $C$ as

$$
\begin{equation*}
C=\sum_{i=0}^{m-1} \sum_{j=0}^{m-1} a_{i} b_{j} \sum_{l=0}^{m-1} \mu_{i, j}^{(l)} \beta^{2^{l}}=\sum_{l=0}^{m-1} \sum_{i=0}^{m-1} \sum_{j=0}^{m-1} a_{i} b_{j} \mu_{i, j}^{(l)}{2^{2^{l}} .}^{l} . \tag{2.3}
\end{equation*}
$$

By representing $C$ with respect to $N$, i.e., $C=\sum_{l=0}^{m-1} c_{l} \beta^{2^{l}}$, and equating with (2.3), the $l$-th coordinate of $C$ can be written as $c_{l}=\sum_{i=0}^{m-1} \sum_{j=0}^{m-1} a_{i} b_{j} \mu_{i, j}^{(l)}$. Then, it can be written in a matrix form as

$$
\begin{equation*}
c_{l}=\underline{a} \mathbf{M}^{(l)} \underline{b}^{t r}, 0 \leq l \leq m-1, \tag{2.4}
\end{equation*}
$$

where $\mathbf{M}^{(l)}=\left[\mu_{i, j}^{(l)}\right]_{i, j=0}^{m-1}, \mu_{i, j}^{(l)} \in G F(2), 0 \leq i, j \leq m-1, \underline{a}=\left[a_{0}, a_{1}, \cdots, a_{m-1}\right]$ and $\underline{b}^{t r}$ denotes the matrix transpose of row vector $\underline{b}=\left[b_{0}, b_{1}, \cdots, b_{m-1}\right]$. In (2.4), $\mathbf{M}^{(l)}$ is obtained from the $l$-fold right and down circular shifts of the multiplication matrix $\mathbf{M}=\mathbf{M}^{(0)}$. The computation of entries of $\mathbf{M}$ can be found from [18]. Massey and Omura in [35] have proposed a bit-level PISO multiplier by implementing (2.4) for one coordinate, say $c_{0}=\underline{a} \mathbf{M} \underline{b}^{t r}=F(A, B)$. Then, the $l$-th coordinate of $C$ is obtained by left cyclic shifts of the coordinates of $A$ and $B$, i.e., $c_{l}=F(A \ll l, B \ll l)$ [35]. The number of ones, $C_{N}, 2 m-1 \leq C_{N} \leq m^{2}$, in M defines the complexity of the multiplication. It is well known that for $C_{N}=2 m-1$, the normal basis is called optimal normal basis (ONB) [36]. There are two types of ONBs, referred to as Type I and Type II ONBs. It should be noted that ONB does not exist for all $m$, for example $m=163$. As an extension of the work on ONBs a low complexity of normal bases of type $T, T>1$, is proposed by Ash et al. which are referred to as Gaussian normal basis (GNB). For $T=1$ and 2, the GNBs become the two types of ONBs of [36] and hence, $C_{N} \leq T m-T+1$. In Chapter 2, we will discuss multiplication on GNB in more details as it is the one that has been employed in this thesis and has been included in many international standards [18] and [19].

### 2.2.3.2 Multiplication Using Gaussian Normal Basis

GNB has been constructed by Ash et al. [37] and is a special class of normal basis which is included in the IEEE 1363 [18] and NIST [19] standards and exists for every $m>1$ that is not divisible by eight [29].

Definition 2.4. [29] Let $p=m T+1$ be a prime number and $\operatorname{gcd}(m T / k, m)=1$, where $k$ is the multiplication order of 2 module $p$. Then, the normal basis $N=$ $\left\{\beta, \beta^{2}, \cdots, \beta^{2^{m-1}}\right\}$ over $G F\left(2^{m}\right)$ is called the Gaussian normal basis (GNB) of type $T, T>1$.

The complexities of type $T$ GNB multiplier in terms of time and area depend on $T>1$. In this thesis, we only consider the GNBs with odd values of $m$ which implies that $T$ is an even number. Such GNBs cover all five binary fields, i.e., $m \in\{163,233,283,409,571\}$, recommended by the IEEE 1363 [18] and NIST [19] standards for ECDSA The corresponding types for these fields are $T=4,2,6,4$, and 10, respectively.

Let $A=\left(a_{0}, a_{1}, \cdots, a_{m-1}\right)=\sum_{i=0}^{m-1} a_{i} \beta^{2^{i}}$ and $B=\left(b_{0}, b_{1}, \cdots, b_{m-1}\right)=\sum_{j=0}^{m-1} b_{j} \beta^{2^{j}}$ be two field elements over $G F\left(2^{m}\right)$ and assume $C \in G F\left(2^{m}\right)$ be their product, i.e., $C=\left(c_{0}, c_{1}, \cdots, c_{m-1}\right)=A B$. Then, the first coordinate of $C$, i.e., $c_{0}$ can be obtained from an explicit formula given in [18] as follows

$$
\begin{align*}
c_{0} & =a_{0} b_{1}+\sum_{k=2}^{p-2} a_{F(k)} b_{F(k+1)}, \\
& =a_{0} b_{1}+\sum_{i=1}^{m-1} a_{i}\left(\sum_{F(k)=i} b_{F(k+1)}\right), 2 \leq k \leq p-2, \tag{2.5}
\end{align*}
$$

where in (2.5), the sequence $F(1), F(2), \cdots, F(p-1)$ can be obtained by precomputation using

$$
\begin{equation*}
F(k)=F\left(2^{i} u^{j} \bmod p\right)=i, 1 \leq i \leq m-1,0 \leq j<T, \tag{2.6}
\end{equation*}
$$

where $u$ is an integer of order $T \bmod p$ and $p=T m+1$ [18]. In Table the sequence of $F$ for type 4 GNB over $G F\left(2^{7}\right)$ is given. It is noted that for each $i, 1 \leq i \leq m-1$, $F(k+1), 2 \leq k \leq p-2$ in (2.5), can be used as entries of a $(m-1) \times T$ matrix $\mathbf{R}$. Let us denote the $(i, j)$-th element of this matrix as $R(i, j), 0 \leq R(i, j) \leq m-1$, $1 \leq i \leq m-1,1 \leq j \leq T$. Each row of the matrix $\mathbf{R}$, contains $T$ entries of integer in $[0, m-1]$. Then, one can write $c_{0}$ as [5]

Table 2.1: The Sequence of $F$ for type 4 GNB over $G F\left(2^{7}\right)$

| $k$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $F(k)$ | 0 | 1 | 5 | 2 | 1 | 6 | 5 | 3 | 3 | 2 | 4 | 0 | 4 | 6 |
| $K$ | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| $F(k)$ | 6 | 4 | 0 | 4 | 2 | 3 | 3 | 5 | 6 | 1 | 2 | 6 | 1 | 0 |

$$
\begin{equation*}
c_{0}=a_{0} b_{1}+\sum_{i=1}^{m-1} a_{i}\left(\sum_{j=1}^{T} b_{R(i, j)}\right) . \tag{2.7}
\end{equation*}
$$

Note that, to obtain the $l$ th coordinates of $C$, i.e., $c_{l}$ one needs to add " $l \bmod m$ " to all indices in (2.7). Therefore, one can find all coordinates of $C$ as follows:

Lemma 2.1. [5] The product of $A$ and $B$ in $G F\left(2^{m}\right)$ is

$$
\begin{equation*}
C=(A \odot(B \ll 1)) \oplus \sum_{i=1}^{m-1}(A \ll i) \odot S(i, B) \tag{2.8}
\end{equation*}
$$

where

$$
\begin{equation*}
S(i, B)=((B \ll R(i, 1)) \oplus(B \ll R(i, 2)) \oplus \cdots \oplus(B \ll R(i, T))), 1 \leq i \leq m-1 \tag{2.9}
\end{equation*}
$$

and $(X \ll i)$ is the $i$-fold left cyclic shift of $X \in G F\left(2^{m}\right)$ and $X \odot Y=\left(x_{0} y_{0}, \cdots, x_{m-1} y_{m-1}\right)$ and $X \oplus Y=\left(x_{0}+y_{0}, \cdots, x_{m-1}+y_{m-1}\right)$ denote bit-wise AND and XOR operations between coordinates of $X$ and $Y$, respectively.

Remark 2.1. From (2.6) one can realize that for $T>2$ there are situations (for example $F(k)=\frac{m-1}{2}$ and $F(k)=\frac{m+1}{2}$ for $T=4$ ) where matrix $\mathbf{R}$ contains (two) equal entries.

### 2.2.3.3 Inversion

Inversion, i.e., for a given element $A \in G F\left(2^{m}\right)$ finding an element $A^{-1} \in G F\left(2^{m}\right)$ such that $A \cdot A^{-1}=1$, is considered an expensive operation. It is commonly required in cryptographic applications of finite fields and its efficient implementation is important. There are two ways to compute inversion over finite fields: extended Euclidean algorithm and Fermat's Little Theorem [38]. The inversion based on Fermat's Little Theorem uses consecutive squarings and multiplication and is more suitable while field elements are represented by normal basis. Based on Definition 2.3, it follows
that $A^{2^{m}-2}=A^{-1}$ and its computation (i.e., exponentiation) requires $m-1$ squarings and $m-2$ multiplications as $2^{m}-2=(11, \cdots, 110)_{2}$. However, Itoh and Tsuji [38] proposed an efficient algorithm which reduces the number of multiplications to $\left\lfloor\log _{2}(m-1)\right\rfloor+H(m-1)-1$, where $H(m-1)$ represents the Hamming weight of ( $m-1$ ).

### 2.2.3.4 Trace and Quadratic Equation Solution

The trace function $\operatorname{Tr}: G F\left(2^{m}\right) \rightarrow G F(2)$ is a linear map and for an element $A=$ $\left(a_{0}, a_{1}, \cdots, a_{m-1}\right) \in G F\left(2^{m}\right)$ is defined as $\operatorname{Tr}(A)=\sum_{i=0}^{m-1} A^{2^{i}} \in\{0,1\}$. For normal basis, when $m$ is odd trace of element $A$ can be computed as $\operatorname{Tr}(A)=\sum_{i=0}^{m-1} a_{i}$, which is bit-wise XOR operation of all bits of vector $A$.

The quadratic equation $X^{2}+X=A$ for $X=\left(x_{0}, x_{1}, \cdots, x_{m-1}\right) \in G F\left(2^{m}\right)$ has a solution if and only if $\operatorname{Tr}(A)=0$, and hence if $X$ is a solution, then $X+1$ is a solution. In normal basis the solution can be found bit-wise. However, in polynomial basis it is complicated and needs half-trace computations which requires $m-1$ squarings and $(m-1) / 2$ additions [11]. In Algorithm 2.1, an efficient algorithm to solve quadratic equation using normal basis is presented. The cost of solving quadratic equation using normal basis is only $m-2$ additions.

Example 2.1. Let element $A=\beta+\beta^{16}=(10001)$ in the finite field $G F\left(2^{5}\right)$ for type 2 GNB. Then, the solutions of the quadratic equation $X^{2}+X=A$ can be obtained using Algorithm 2.1. First, we check that $\operatorname{Tr}(A)=\sum_{i=0}^{4} a_{i}=1+0+0+0+1=0$. Then, $X$ can be obtained bit-wise as $x_{0}=1, x_{1}=1, x_{2}=1, x_{3}=1$, and $x_{4}=0$ so $X=(11110)$. Also, $X+1$ is solution too, i.e., $X+1=(11110)+(11111)=(00001)$. These two solutions satisfy the quadratic equation. As seen the cost of solving this equation is only 3 module two additions (i.e., XORing).

In Chapter 4, we employ this algorithm to solve a quadratic equation for recovering final point of point multiplication algorithm.

### 2.2.4 Multiplier Architectures

The implementation of finite field multipliers using normal basis and more specifically GNB can be categorized, in terms of their structures, into three groups: (i) bit-level which includes: parallel-in serial-out (PISO) [35], serial-in parallel-out (SIPO) [39], [4], [40], and parallel-in parallel-out (PIPO) [41], [42], (ii) digit-level including the structures of: parallel-in serial-out (PISO) [43], parallel-in parallel-out (PIPO) [44],

```
Input: \(A=\left(a_{0}, a_{1}, \cdots, a_{m-1}\right) \in G F\left(2^{m}\right)\).
Output: \(X=\left(x_{0}, x_{1}, \cdots, x_{m-1}\right) \in G F\left(2^{m}\right)\).
Step 1: \(\quad x_{0} \leftarrow a_{0}\).
Step 2: For \(i\) from 1 to \(m-2\) do
\(x_{i} \leftarrow a_{i} \oplus x_{i-1}\).
end for
Step 3: \(\quad x_{m-1} \leftarrow 0\).
Step 4: Return \(X\).
```

Algorithm 2.1 Solving quadratic equation $X^{2}+X=A$ using normal basis [11].
[5], [45], and serial-in parallel-out (SIPO) [46], and (iii) bit-parallel which includes: [47], [48], [49], and [50] multipliers.

### 2.2.4.1 Bit-Level NB Multiplication

Bit-level multipliers provide the lowest possible area complexity. The first bit-level normal basis multiplier has been invented by Massey and Omura [35] which all coordinates of both input operands should be presented during multiplication operation. It is also known as a sequential multiplier with serial output in the literature [43]. Bit-level SIPO multipliers have been studied for normal basis and two different structures, namely Least Significant Bit (LSB) first and Most Significant Bit (MSB) first structures, have been proposed by Beth and Gollmann in [4]. A PIPO version of their multiplier is also presented in [41] and its time and area complexities are derived.

Based on the way the input bits are processed and the output bits are produced there are four kinds of of bit-level normal basis multipliers. They are called the LSB-first and the MSB-first bit-level SIPO multipliers [31] and the LSB-first and the MSB-first PISO normal basis multipliers [35].

## LSB-first bit-level SIPO normal basis multiplier

In an LSB-first bit-level multiplication, having all elements of one operand, say $B$, to be present, the other operand, i.e., $A$, is processed from its LSB, i.e., $a_{0}$, and in each clock cycle one bit is processed. In [4], Beth and Gollmann presented an architecture for bit-level multiplication using normal basis. The key formulation of this multiplier is presented below.

Lemma 2.2. [4] Let $A$ and $B$ be two elements of $G F\left(2^{m}\right)$ and $C$ be their multiplica-
tion, i.e., $C=A B$ as

$$
\begin{align*}
C & =\sum_{i=0}^{m-1}\left(a_{i} \beta^{2^{i}}\right) B=\sum_{i=0}^{m-1}\left(a_{i} \cdot \beta B^{2^{-i}}\right)^{2^{i}} \\
& =a_{0} \beta B+a_{1}\left(\beta B^{2^{-1}}\right)^{2}+\cdots+a_{m-1}\left(\beta B^{2^{-(m-1)}}\right)^{2^{m-1}}, \tag{2.10}
\end{align*}
$$

then similar to Horner's rule one can obtain

$$
C=\left(\left(\cdots\left(\left(a_{0} \beta B\right)^{2^{-1}}+a_{1} \beta B^{2^{-1}}\right)^{2^{-1}}+\cdots\right)^{2^{-1}}+a_{m-1} \beta B^{2^{-(m-1)}}\right)^{2^{-1}}
$$

Let us denote $P(B)=\beta B \in G F\left(2^{m}\right)$ as a field element in GNB. In [5], $P(B)$ can be obtained for GNB multiplier based on the $\mathbf{R}$ matrix as

$$
\begin{equation*}
P(B)=\left(b_{1}, s_{0}(1, B), s_{0}(2, B), \cdots, s_{0}(m-1, B)\right), \tag{2.11}
\end{equation*}
$$

where $s_{0}(i, B)=\sum_{j=1}^{T} b_{R(i, j)} \in\{0,1\}, 1 \leq i \leq m-1$. Then using (2.11) and Lemma 1 , we can state the following.

Corollary 2.1. For $G N B$, the product of $A=\left(a_{0}, a_{1}, \cdots, a_{m-1}\right) \in G F\left(2^{m}\right)$, given in bit-serial fashion, and $B \in G F\left(2^{m}\right)$ can be written as

$$
\begin{equation*}
C=\left(\left(\cdots\left(\left(a_{0} P(B)\right) \ll 1+a_{1} P(B \ll 1)\right) \ll 1+\cdots\right) \ll 1+a_{m-1} P(B \ll m-1)\right) \ll 1, \tag{2.12}
\end{equation*}
$$

where "<" denotes a left cyclic shift.
Equation (2.12) can be realized by an architecture depicted in Fig. 2.1a. The implementation of $P(B) \in G F\left(2^{m}\right)$ given in (2.11) is performed by a $P$ module shown in Fig. 2.1c for type $T$ GNB. The product of $a_{i} P(B)$ in Fig. 2.1a. denotes bit-wise AND operation between $a_{i}$ and elements of $P(B)$ and is performed using $m$ 2-input AND gates. Also the sum (adder block in Fig. 2.1a) is implemented using $m$ 2-input XOR gates. As one can see from Fig. 2.1a. all bits of the operand $B$ are available, while the coordinates of the operand $A$ should be available in serial fashion with the LSB first, i,e, $a_{0}$. In this architecture, both $m$-bit registers $\langle Y\rangle=\left\langle y_{0}, y_{1}, \cdots, y_{m-1}\right\rangle$ and $\langle Z\rangle=\left\langle z_{0}, z_{1}, \cdots, z_{m-1}\right\rangle$ should be initialized with operand $B=\left(b_{0}, b_{1}, \cdots, b_{m-1}\right)$


Figure 2.1: The architecture of (a) LSB-first bit-level SIPO (b) MSB-first bit-level normal basis multipliers [4] (c) The architecture of $P$ module for type $T$ GNB.
and $0=(0,0, \cdots, 0)$ (i.e., $Y(0)=B$ and $Z(0)=0$ ), respectively. Let $Z(0)$ denotes the initial value of the register $\langle Z\rangle$ and $Z(i), 1 \leq i \leq m$, be the content of the register $\langle Z\rangle$ in the clock cycle $i$. After one clock cycle the content of $\langle Z\rangle$ is $Z(1)=$ $a_{0} P(B) \in G F\left(2^{m}\right)$. Then, the registers $\langle Y\rangle$ and $\langle Z\rangle$ are cyclically shifted to the left according to (2.12). A one can verify, after $m$-th clock cycle the register $\langle Z\rangle$ contains the coordinates of $Z(m)=C^{2}=\left(c_{m-1}, c_{0}, c_{1}, \cdots, c_{m-2}\right)$ (see (2.12)). Thus, $C$ can be obtained by a left cyclic shift of register $\langle Z\rangle$, i.e., $C=(Z(m) \ll 1)$. The presented architecture requires at most $(T-1)(m-1)$ XOR gates in the $P$ module, $m$ XOR gates for the adder, $m$ AND gates, and two $m$-bit registers. Also, its critical-path delay due to delays through the $P$ module $\left(\left\lceil\log _{2} T\right\rceil T_{X}\right)$, AND gates $\left(T_{A}\right)$, and XOR gates $\left(T_{X}\right)$ is $T_{A}+\left(1+\left\lceil\log _{2} T\right\rceil\right) T_{X}$.

## The MSB-first bit-level SIPO normal basis multiplier

In a MSB-first bit-level SIPO GNB multiplication, the operand $A$ is processed from its MSB, i.e., $a_{m-1}$, and in each clock cycle one bit is considered.

Let $A, B$ be two elements of $G F\left(2^{m}\right)$ and $C$ be their product, i.e., $C=A B$, then similar to Horner's rule one can obtain [4]:

Table 2.2: The values of $F$ for type 2 GNB over $G F\left(2^{5}\right)$

| $k$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $F(k)$ | 0 | 1 | 3 | 2 | 4 | 4 | 2 | 3 | 1 | 0 |

Table 2.3: Content of Variables in the LSB-first and MSB-first multiplication of $A=(01110)$ and $B=(10101)$ over $G F\left(2^{5}\right)$.

| $j$ | LSB-first |  |  | MSB-first |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $Y$ | $A$ | $Z$ | $Y$ | $A$ | $Z$ |
| 0 | 10101 | - | 00000 | 11010 | - | 00000 |
| 1 | 10101 | 0 | 00000 | 11010 | 0 | 00000 |
| 2 | 01011 | 1 | 11011 | 01101 | 1 | 10100 |
| 3 | 10110 | 1 | 10000 | 10110 | 1 | 01101 |
| 4 | 01101 | 1 | 10101 | 01011 | 1 | 01101 |
| 5 | 11010 | 0 | $C^{2}=01011$ | 10101 | 0 | $C=10110$ |

$$
\begin{equation*}
C=A B=\left(\cdots\left(\left(a_{m-1} \beta B^{2^{-(m-1)}}\right)^{2}+a_{m-2} \beta B^{2^{-(m-2)}}\right)^{2}+\cdots\right)^{2}+a_{0} \beta B . \tag{2.13}
\end{equation*}
$$

To realize the implementation of (2.13), one needs to perform multiplication by $\beta$ as $\beta B=\underline{\beta}^{t r} \cdot\left(\underline{\beta}_{\beta} \cdot \underline{b}^{t r}\right)=\left(\underline{\beta}^{t r} \cdot \underline{\beta}\right) \cdot \underline{b}^{t r}=\mathbf{M} \cdot \underline{b}^{t r}$ which is a matrix-by-vector multiplication for GNB and then compute $C$ as

$$
\begin{aligned}
C= & \left(\cdots\left(\left(a_{m-1} \odot P(Y)\right) \gg 1+a_{m-2} \odot P(Y \gg 1)\right) \gg 1+\cdots\right) \gg 1+ \\
& a_{0} \odot P(Y \gg m-1),
\end{aligned}
$$

where $Y=B^{2^{-(m-1)}}=B^{2}$. The architecture for the MSB-first SIPO GNB multiplication is depicted in Fig. 2.1b. As one can see every bit of operand $B$ is available, while operand $A$ should be available in serial with the MSB first. In this multiplier structure, both registers $\langle Y\rangle$ and $\langle Z\rangle$ are initialized to $Y=(B \gg 1)=\left(b_{m-1}, b_{0}, b_{1}, \cdots, b_{m-2}\right)$ and $0=(0,0, \cdots, 0)$, respectively. In the first clock cycle, the register $\langle Z\rangle$ contains $Z(1)=a_{m-1} \odot P(B \gg 1)$. Then, registers $\langle Y\rangle$ and $\langle Z\rangle$ should be cyclically shifted to the right. Thus, as one can verify, after $m$-th clock cycle the register $\langle Z\rangle$ contains the coordinates of $C$, i.e., $Z(m)=C$.

### 2.2.4.2 An Example

Consider the finite field $G F\left(2^{5}\right)$ generated for type 2 GNB and we have the following multiplication matrix from Table 2.2 given in [19] as

$$
\mathbf{M}=\left(\begin{array}{ccccc}
0 & 1 & 0 & 0 & 0 \\
1 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 & 1 \\
0 & 1 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 & 1
\end{array}\right)_{5 \times 5}, \mathbf{R}=\left(\begin{array}{cc}
0 & 3 \\
3 & 4 \\
1 & 2 \\
2 & 4
\end{array}\right)_{4 \times 2}
$$

Let $A=(01110)$ and $B=(10101)$ be two field elements in $G F\left(2^{5}\right)$. Based on the the architectures depicted in Fig. 2.1, Table 2.3 illustrates the contents of various variables of registers $\langle Y\rangle$ and $\langle Z\rangle$ which are updated with the clock cycles. For the MSB-first structure, first, registers $\langle Y\rangle$ and $\langle Z\rangle$ are initialized (in row with $j$ being 0 ) with $B^{2^{-4}}=B^{2}=11010$ and 00000 , respectively. Then, after $j=5$ clock cycles the register $\langle Z\rangle$ contains the product, i.e., $C=10110$. For the LSB-first structure, in the initialization step, registers $\langle Y\rangle$ and $\langle Z\rangle$ are loaded with operand $B$ and 00000 , respectively. Then, after 5 clock cycles the register $\langle Z\rangle$ contains $C^{2}=$ 01011. Therefore, after a left cyclic shift (i.e., rewiring) one can obtain the result of the multiplication as $C=10110$.

### 2.2.4.3 Digit-level GNB multiplication

Digit-level multipliers are alternatives for bit-level and bit-parallel multipliers in which the digit size can be chosen depending on the amount of the resources available. A digit-level PIPO version of Massey-Omura multiplier [51] and its improved version [44] are used in ECC based crypto-processors in [10], [6], and [26]. It has been mentioned that in order to satisfy high speed and low complexity requirements of cryptographic applications, there is a need to design efficient architectures for finite field multiplication using normal basis. In [5], two efficient digit-level PISO and PIPO GNB multipliers are presented in [9], a subexpression sharing algorithm is introduced and applied to obtain the least number of gates for the digit-level PIPO multiplier. In the following, we summarize the contributions of this work.

### 2.2.4.4 Digit-level PISO GNB multiplier

In [5], a digit-level PISO GNB multiplier architecture is proposed. This architecture which uses the following formulation is depicted in Fig. 2.2.


Figure 2.2: The architecture of the digit-level PISO GNB multiplier [5].

Lemma 2.3. [5] Let us denote $z_{l}=\underline{x} \mathbf{M}^{(l)} \underline{y}^{t r}$, where $\mathbf{M}^{(l)}$ denotes l-fold right and down circular shift of multiplication matrix $\mathbf{M}$. Then, for a digit level architecture one needs to implement all entries of $d$ vectors of

$$
\begin{equation*}
\mathbf{v}^{(l)}=\left[v_{0}^{(l)}, v_{1}^{(l)}, \cdots, v_{m-1}^{(l)}\right]^{t r}=\mathbf{M}^{(l)} \underline{y}^{t r}, 0 \leq l \leq d-1 \tag{2.14}
\end{equation*}
$$

Then, by $\underline{y}=\underline{b}$

$$
\begin{equation*}
z_{l}=\underline{x} \mathbf{v}^{(l)}=\sum_{i=0}^{m-1} x_{i} v_{i}^{(l)} \tag{2.15}
\end{equation*}
$$

for $\underline{x}=\underline{a}$ and $\underline{c_{l}}=\underline{z_{l}}$. Consecutive $d$ coordinates of $C=A B$ can be obtained from (2.14) and (2.15) by d-fold left cyclic shift of $\underline{x}$ and $\underline{y}$. This multiplier requires $q=\left\lceil\frac{m}{d}\right\rceil, 1 \leq q \leq m, 1 \leq d \leq m$, clock cycles to generate all the $m$ coordinates of the $C=A B$.

The architecture which realizes (2.14) and (2.15) is shown in Fig. 2.2. A d-fold left cyclic shift is denoted by " $\stackrel{d}{<}$ " in this figure.

It is noted that the presented $\mathbf{R}$ matrix in (2.7) can be easily obtained from the M. Specifically, the $(i, j)$-th, $1 \leq i \leq m-1,1 \leq j \leq T$, entry of the matrix R, i.e., $R(i, j), 0 \leq R(i, j) \leq m-1$ contains the column index of the non-zero entries in row $i$ of $\mathbf{M}$. If the number of 1 s in row $i$ of $\mathbf{M}$ is $T$, then all $R(i, j), 1 \leq j \leq T$, contain an integer in $[0, m-1]$. Otherwise, rows of $\mathbf{R}$ with even number of entries should be initialized with a constant value [5]. Therefore, one can obtain

$$
\begin{equation*}
c_{l}=a_{l} b_{l+1 \bmod m}+\sum_{i=1}^{m-1} a_{l+i \bmod m}\left(\sum_{j=1}^{T} b_{l+R(i, j) \bmod m}\right), \tag{2.16}
\end{equation*}
$$

and implement $d$ copies of $c_{l}$ in hardware to achieve a digit-level architecture for $0 \leq l \leq d-1$.

### 2.2.4.5 Digit-level PIPO GNB Multiplier

In [5] and [6] a digit-level GNB multiplier with parallel output (DL-PIPO) is proposed. It requires $q, 1 \leq q \leq m$, clock cycles to generate all $m$ coordinates of $C=A B$ simultaneously at the end of the final clock cycle. The original multiplier structure of DL-PIPO is shown in Fig. 2.3. Let $\langle X\rangle$ and $\langle Y\rangle$ be the input registers of this multiplier. Then, it implements [5]

$$
\begin{equation*}
J(X, Y)=\sum_{k=0}^{m-1} x_{m-k} s_{0}^{\prime}(k, Y) \beta^{2^{i}} \tag{2.17}
\end{equation*}
$$

where

$$
\begin{equation*}
s_{0}^{\prime}(k, Y)=\sum_{i \in R_{k}} y_{i-k} \tag{2.18}
\end{equation*}
$$

and $R_{k}$ is a set containing the locations of non-zero entries of row $2 k, 0 \leq 2 k \leq m-1$, of the multiplication matrix $\mathbf{M}=\mathbf{M}^{(0)}$ defined in (2.4). Based on the properties of M for GNB, one can find $s_{0}^{\prime}(0, Y)=y_{1}$ and $s_{0}^{\prime}(k, Y)=s_{0}^{\prime}(m-k, Y), 1 \leq k \leq \frac{m-1}{2}$ [5]. Also, it is shown in [44] and [5] that the number of elements in $R_{k}$ is even and less than or equal to $T$, i.e., $\left|R_{k}\right| \leq T$. The $J$ block in Fig. 2.3 performs (2.17) using $m$ AND gates. For the multiplication operation, the registers $\langle X\rangle$ and $\langle Y\rangle$ of this figure are initially loaded by the coordinates of $A$ and $B$, respectively. Also, the output register $\langle Z\rangle$ should be cleared before starting the multiplication operation. Then, after $q$ clock cycles, the output register $\langle Z\rangle$ contains the coordinates of $C=A B$. In the following section, we modify this multiplier to reduce the number of XOR gates.

### 2.3 Elliptic Curve Cryptography

To date, several forms of elliptic curves over finite fields of characteristic two have been considered for hardware implementation of such cryptosystems in the literature; see for example, [20], [21], [10], [6], [22], [23], [24], [25], [26], and [27]. They cover


Figure 2.3: The architecture of Digit-level PIPO GNB multiplier proposed in [5], [6], where the $i$-fold right cyclic shift is denoted by $\quad \begin{gathered}i \\ \Rightarrow\end{gathered}$ and $r$ is a number $0 \leq r \leq d-1$ such that $m=q d-r$.
a wide variety of cases regarding different basis representations (e.g., polynomial basis and normal basis), different coordinate systems (e.g., affine, projective, mixed, etc.), and different curve forms (e.g., generic and Koblitz). In these implementations, various hardware platforms such as field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) have been utilized. For different target applications, efficient implementations of ECC on these platforms with a balance between complexity of computations and availability of the resources are crucial to provide highly efficient cryptographic systems.

Binary Edwards curves have been introduced recently by Bernstein, Lange, and Farashahi in [1]. They showed that all generic elliptic curves over binary fields can be written in Edwards form to obtain efficient complete and unified addition formulas which work for all pairs of inputs. In [2], a generalized form of binary Hessian curves is proposed which has similar characteristics to the binary Edwards curves. Both of these curves offer unified and complete formulas for point operations which provides resistance against side-channel attacks (SCAs). Despite the efficiency of binary Edwards and generalized Hessian curves, a limited number of articles in the literature such as [52], [53], and [54] have investigated their implementations. In [52], an ASIC implementation of point multiplication on a special case of binary Edwards curves has been presented addressing energy consumption and simple power analysis attacks over $G F\left(2^{m}\right)$ using polynomial basis representation. A SCA resistance evaluation of
binary Edwards curves has been discussed in [53] employing unified addition formula for doubling. The work presented in [54] mainly focuses on software implementation of point multiplication on these curves employing different curve parameters.

### 2.3.1 Elliptic Curve Arithmetic

In this thesis we mainly focus on binary fields and limit definitions of elliptic curves on $G F\left(2^{m}\right)$.

Let $E_{W, a, b}$ be a non-supersingular binary generic elliptic curve (short Weierstrass) defined as

$$
\begin{equation*}
E_{W, a, b}: y^{2}+x y=x^{3}+a x^{2}+b, \tag{2.19}
\end{equation*}
$$

where $a, b \in G F\left(2^{m}\right)$, and $b \neq 0$. A set of points $(x, y)$ and a special point at infinity $\mathcal{O}$ (group identity) form a finite Abelian group under a defined addition operation that satisfy (2.19) and the so called chord-and-tangent rule (as shown in Fig. 2.4) is used to define the group operation [11]. For all $P \in E_{W, a, b}\left(G F\left(2^{m}\right)\right), P+\mathcal{O}=\mathcal{O}+P=P$. The negative of point $P=(x, y)$ is $-P=(x, x+y)$, where $(x, y)+(x, x+y)=\mathcal{O}$.

Then, for two points $P_{1}, P_{2} \in E_{W, a, b}\left(G F\left(2^{m}\right)\right)$, the third point $P_{3}=P_{1}+P_{2} \in$ $E_{W, a, b}\left(G F\left(2^{m}\right)\right)$ exist and can be produced using arithmetic operations in $G F\left(2^{m}\right)$ which is called point addition. Also, for point $P=\left(x_{1}, y_{1}\right)$ and $P \neq-P$ the point doubling is $P_{4}=2 P=\left(x_{4}, y_{4}\right)$.

Elliptic curve point multiplication is defined over the Abelian group and it is

$$
\begin{equation*}
Q=k P=\underbrace{P+P+\cdots+P}_{k}, \tag{2.20}
\end{equation*}
$$

where $P$ and $Q$ are two points on $E_{W, a, b}$ and $k>1$ is an integer. The point $P$ is called the base point and $Q$ is the result point.

Definition 2.5. Given the cyclic additive group generated by $P$ on $E_{W, a, b}\left(G F\left(2^{m}\right)\right.$, the order of point $P$, ord $(P)$, is the smallest integer $r$, for which $r P=\mathcal{O}$. Then, the integer $k$ is bounded as $1<k \leq \operatorname{ord}(P)-1$.

Although the point multiplication of the form (2.20) is the most common operation in elliptic curve cryptosystems, but in some applications (such as digital signature) a double point multiplication with the form of $m P+n Q$ is required to be performed, where $P, Q \in E_{W, a, b}\left(G F\left(2^{m}\right)\right)$ are points of order $r$ and $1 \leq m, n \leq r-1$.


Figure 2.4: Group law on Elliptic curve over $\mathbb{R}$.

Definition 2.6. Given two points $P$ and $Q$, where $Q=k P$, it is computationally infeasible to obtain $k$ which is known as elliptic curve discrete logarithm problem (ECDLP). The ECDLP currently has exponential complexity and has no polynomialtime solutions (without considering quantum computers).

Point addition in affine coordinates $P_{3}=\left(x_{3}, y_{3}\right)=P_{1}+P_{2}$, where $P_{1} \neq P_{2}$ is given by [28]:

$$
\left\{\begin{array}{l}
x_{3}=\lambda^{2}+\lambda+x_{1}+x_{2}+a, \lambda=\frac{y_{2}-y_{1}}{x_{2}+x_{1}},  \tag{2.21}\\
y_{3}=\lambda\left(x_{1}+x_{3}\right)+x_{3}+y_{1},
\end{array}\right.
$$

where it costs $\mathbf{I}+2 \mathbf{M}+\mathbf{S}+8 \mathbf{A}$. Point doubling is $P_{4}=\left(x_{4}, y_{4}\right)=2 P_{1}$ as given by

$$
\begin{cases}x_{4} & =x_{1}^{2}+\frac{b}{x_{1}^{2}}  \tag{2.22}\\ y_{4} & =x_{1}^{2}+\left(x_{1}+\frac{y_{1}}{x_{1}}\right) x_{4}+x_{4}\end{cases}
$$

and it costs $\mathbf{I}+2 \mathbf{M}+\mathbf{S}+4 \mathbf{A}$. As computing the inversion is costly in the finite fields and as a result, some alternative approaches have been considered.

### 2.3.2 Inversion free Coordinates

Inversion is known as an expensive operation in finite fields. Therefore, instead of having point coordinates represented in affine coordinate, it is efficient to define them in projective coordinates. In the following, different types of projective coordinates are presented.

### 2.3.2.1 Standard Projective Coordinates

In standard projective coordinates, a point is represented with the triple $(X, Y, Z)$ to represent $(X / Z, Y / Z)$ in affine with $Z \neq 0$ and $\mathcal{O}=(0,1,0)$. Then, the curve equation will be

$$
Y^{2} Z+X Y Z=X^{3}+a X^{2} Z+b Z^{3}
$$

where the cost of point addition and doubling is $16 \mathbf{M}+2 \mathbf{S}+6 \mathbf{A}$ and $8 \mathbf{M}+4 \mathbf{S}+5 \mathbf{A}$, respectively.

### 2.3.2.2 Lopez-Dahap Projective Coordinates

For Lopez-Dahab coordinates, [3] the triple $(X, Y, Z)$ is used to represent $\left(X / Z, Y / Z^{2}\right)$ in affine when $Z \neq 0$ and $\mathcal{O}=(1,0,0)$. The curve equation is

$$
Y^{2}+X Y Z=X^{3} Z+a X^{2} Z^{2}+b Z^{4}
$$

where the cost of point addition and doubling is $13 \mathbf{M}+4 \mathbf{S}+9 \mathbf{A}$ and $5 \mathbf{M}+4 \mathbf{S}+5 \mathbf{A}$, respectively. In Lopez-Dahap coordinates when one of the points represented in affine the cost of mixed projective point addition, i.e., $\left(X_{3}, Y_{3}, Z_{3}\right)=\left(X_{1}, Y_{1}, Z_{1}\right)+\left(x_{2}, y_{2}\right)$, reduces to $9 \mathbf{M}+5 \mathbf{S}+9 \mathbf{A}[55]$.

### 2.3.2.3 Jacobian Projective Coordinates

In Jacobian projective coordinates, the triple $(X, Y, Z)$ corresponds to the affine point $\left(X / Z^{2}, Y / Z^{3}\right)$ with the curve equation as

$$
Y^{2}+X Y Z=X^{3}+a X^{2} Z^{2}+b Z^{6}
$$

where the costs of mixed point addition and doubling are $10 \mathbf{M}+3 \mathbf{S}+7 \mathbf{A}$ and $5 \mathbf{M}+5 \mathbf{S}$, respectively.

```
Algorithm 2.2 Left-to-right Double-and-add point multiplication algorithm [11]
Inputs: An integer \(k>1, k:=\left(k_{l-1} \cdots k_{1} k_{0}\right)_{2}\), and \(P=(x, y) \in E\left(G F\left(2^{m}\right)\right)\)
Output: \(Q=k P \in E\left(G F\left(2^{m}\right)\right)\)
Initialize: \(Q=P\)
For \(i:=l-2\) down to 0 do
    \(Q=2 Q\)
    if \(k_{i}=1\) then
        \(Q=Q+P\)
        end if
end for
return \(Q\)
```


### 2.3.3 Point Multiplication

The elliptic curve point multiplication is defined in the Abelian group as $Q=k \cdot P=$ $P+P+\cdots+P$, ( $k$ times), where $k$ is a positive integer, and $Q$ and $P$ are two points on the elliptic curve $Q, P \in E\left(G F\left(2^{m}\right)\right)$ [3]. The efficiency of point multiplication depends on finding the minimum number of steps to reach $k P$ from a given point $P$. In the following two mostly used algorithm for point multiplication is presented.

### 2.3.3.1 Double-And-Add Point Multiplication

The simplest method to perform point multiplication is the double-and-add method as shown in Algorithm 2.2. As one can see, the scalar $k$ is given in binary form, i.e., $k=\sum_{i=0}^{l-1} k_{i} 2^{i}$ and the algorithm iterates through each bit of $k$. For each iteration a point doubling is performed and when $k$ is one, a point addition is also performed. Clearly, the computational cost of the double-and-add method depends on the number of ones in the binary expansion of $k$, i.e., $H(k)$ which is the Hamming weight of $k$. Therefore, this method requires $l-1$ point doublings and $H(k)-1(H(k) \approx l / 2$ on average) point additions. As the $H(k)$ determines the performance of double-andadd point multiplication algorithm, reducing it is always desired. A Non-Adjacent Form (NAF) representation of $k$ is used to reduce $H(k)$. In this representation two consecutive digits are never nonzero, i.e., $k_{i} k_{i+1}=0$ and $k_{i} \in\{0, \pm 1\}$ for all $i$. The NAF method reduces the Hamming weight to $H(k) \approx l / 3$.

The double-and-add point multiplication is not secure against side channel attacks and an attacker can reveal $k$ by tracing the power consumption for doubling and addition in each iteration. This method is suitable for the applications where point
addition and doubling have equal cost of computation, for example in binary Edwards [1] and generalized Hessian curves [2].

### 2.3.3.2 Montgomery Point Multiplication

Lopez and Dahab [3] generalized the Montgomery's idea [13] to binary generic curves (2.19) and obtained a very efficient algorithm for point multiplication. This method is known as Montgomery point multiplication or Montgomery's ladder and is widely used in the literature. It relies on the fact that the $y$-coordinate is not required during point multiplication because it can be recovered at the end. Then, the $x$-coordinate of point addition can be obtained as $P_{3}=P_{1}+P_{2}$ from

Algorithm 2.3 Lopez-Dahab Scalar Multiplication [12]

Inputs: An integer $k>1, k:=\left(k_{l-1} \cdots k_{1} k_{0}\right)_{2}$, and $P=(x, y) \in E$
Output: $Q=k P$
Step 1: $X_{1}:=x, Z_{1}:=1, X_{2}:=x^{4}+b, Z_{2}:=x^{2}$
Step 2: For $i:=l-2$ down to 0 if $k_{i}=1$ then
Step 3: $\quad\left(X_{1}, Z_{1}\right)=\operatorname{ADD}\left(X_{1}, Z_{1}, X_{2}, Z_{2}\right),\left(X_{2}, Z_{2}\right)=\operatorname{DBL}\left(X_{2}, Z_{2}\right)$ else
Step 4: $\quad\left(X_{2}, Z_{2}\right)=\operatorname{ADD}\left(X_{1}, Z_{1}, X_{2}, Z_{2}\right),\left(X_{1}, Z_{1}\right)=\operatorname{DBL}\left(X_{1}, Z_{1}\right)$
Step 5: return $Q=\operatorname{Mxy}\left(X_{1}, Z_{1}, X_{2}, Z_{2}\right)$

$$
\begin{equation*}
Z_{3}=\left(X_{1} \cdot Z_{2}+X_{2} \cdot Z_{1}\right)^{2}, \quad X_{3}=x \cdot Z_{3}+\left(X_{1} \cdot Z_{2}\right) \cdot\left(X_{2} \cdot Z_{1}\right), \tag{2.23}
\end{equation*}
$$

with the cost of $4 \mathbf{M}+\mathbf{S}+2 \mathbf{A}$ and the $x$-coordinate of point doubling, $P_{4}=2 P_{1}$ from

$$
\begin{equation*}
X_{4}=X_{1}^{4}+b \cdot Z_{1}^{4}, \quad Z_{4}=Z_{1}^{2} \cdot X_{1}^{2} \tag{2.24}
\end{equation*}
$$

with the cost of $2 \mathbf{M}+3 \mathbf{S}+\mathbf{A}$. Then, the $y$-coordinate is recovered with the cost of $\mathbf{I}+10 \mathbf{M}+\mathbf{S}+6 \mathbf{A}[3]$. In point multiplication using Montgomery algorithm in each step point addition and point doubling should be performed. Then, due to its uniform structure it reveals no information to distinguish it performs point addition point doubling an hence is resistive to simple power analysis attack. It also provides fast computations in comparison to the case where explicit addition and doubling formulation are employed. The cost of combined point addition and doubling based
on the $x$-coordinates only is $6 \mathbf{M}+4 \mathbf{S}+3 \mathbf{A}$. Algorithm 2.3, presents Montgomery point multiplication for a given point $P \in E_{W, a, b}\left(G F\left(2^{m}\right)\right)$. Also, Mxy, converts the Lopez-Dahab coordinates to affine ones and it is the only operation in this algorithm which requires inversion. The Montgomery point multiplication is fast, uniform, and secure against side channel attacks such as simple power analysis attacks. For detail information about elliptic curve cryptography and its mathematical computations one can refer to [11].

In the next chapter, we will present low-complexity hardware architectures for digit-level GNB multipliers.

## Chapter 3

## Low-Complexity Architectures for Digit-level and Bit-parallel GNB Multipliers over $G F\left(2^{m}\right)$

OUR objective in this chapter is to reduce the area complexity of digit-level GNB multiplier architectures presented in the previous chapter. The multiplication of two field elements in binary field of characteristic two, i.e., $G F\left(2^{m}\right)$, is more complicated than the other operations (e.g., addition and squaring) and plays an important role in determining the efficiency of cryptographic systems. Massey and Omura (MO) [35] invented a bit-level, parallel-in serial-out $G F\left(2^{m}\right)$ normal basis multiplier. Such a bit-level multiplier is slow as it generates the results of multiplication after $m$ clock cycles. The fastest type of multipliers is the bit-parallel one whose results are available after the propagation delay through the gates in one clock cycle. We note that for type 2 GNB (which is type 2 optimal normal basis), there are several efficient multipliers available in the literature. For instance, in [56], Sunar and Koç proposed a bit-parallel multiplier based on a permuted normal basis. An efficient and systolic type of their multiplier has been proposed later by Kwon [57] for type 2 GNB which is highly regular. Also, sub-quadratic style multipliers have been proposed in [58], [59], and [60] which require smaller area but higher delays. A digit-level version of MO multiplier [35] is investigated for FPGA implementation of ECC in [10]. Also, Kwon et al. [44] proposed an improved digit-level GNB multiplier which has been employed in [6] for FPGA implementation of ECC over $G F\left(2^{163}\right)$. In order to satisfy high speed and low complexity requirements of an ECC crypto-processor, one needs to design an efficient architecture for finite field multiplication using normal basis [10].

The contributions of this chapter can be summarized as follows. The result presented in this chapter can be found in [9] and partly in [61].

- We present a low complexity architecture for digit-level parallel in parallel out (DL-PIPO) GNB multiplier and propose a common subexpression elimination algorithm. We also reduce the complexity of digit-level parallel in serial out (DL-PISO) architecture presented in the previous chapter.
- We propose a new formulation and an improved architecture for digit-level serial in parallel out (DL-SIPO) GNB multiplier architecture and derive its time and area complexities. It is noted that the proposed architecture requires smaller area in comparison to the leading ones in the literature.
- We simulate the performance of the complexity reduction algorithm and for different digit sizes for the proposed digit-level multiplier architectures.
- A low complexity bit-parallel architecture has been obtained by extending the presented DL-PISO multiplier architecture and its time and area complexities compared with the counterparts in the literature.
- Finally, our proposed multiplier architectures are implemented on the Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ FPGA and synthesized using 65-nm CMOS library (ASIC) technology for different digit sizes. The timing and required area is also reported.

The rest of this chapter is organized as follows. In Section 3.1, a low complexity digitlevel parallel in parallel out multiplier architecture is presented. In Section 3.2, a new architecture for digit-level serial in parallel out multiplier proposed and its time and area complexities derived. In Section 3.3, the presented architecture for digit-level parallel in serial out architecture in the previous chapter is improved. In Section 3.4, a low-complexity bit-parallel architecture is proposed and its time and area complexities compared with its counterparts. In Section 3.5, the proposed multiplier architectures are implemented on FPGA and ASIC and the results are reported for different digit sizes. Finally, we conclude this chapter in Section 3.6.

### 3.1 An Improved Architecture for Digit-level PIPO GNB Multiplier

In this section, we propose an improved architecture for the digit-level PIPO multiplier presented in the previous chapter. The number of XOR gates of the DL-PIPO
multiplier can be reduced by reusing the common terms appeared at the outputs of the $P$ blocks. The DL-PIPO GNB multiplier architecture, has several $P$ blocks shown as $p_{0}$ to $p_{d-1}$ in Fig. 2.3. As shown in this figure, $P$ blocks use the shifted combination of the input operand $B$ (preloaded in register $\langle Y\rangle$ ). Therefore, we first determine these combinations and after these combinations are computed, we use their results in different computations to optimize the area complexity by reducing the number of signals and consequently number of XOR gates. We propose a method to combine the computations of the $P$ blocks into a $\rho$ block as illustrated in the architecture of Fig. 3.1. As seen in this figure, the number of outputs of an unoptimized $P$ block in this figure is $\frac{m+1}{2}$. These are based on the following signals [5]

$$
\begin{gather*}
P_{k}(Y)=\left(y_{1-k}, s_{0}^{\prime}(1, Y \ll k), s_{0}^{\prime}(2, Y \ll k), \cdots\right. \\
\left., \cdots, s_{0}^{\prime}\left(\frac{m-1}{2}, Y \ll k\right)\right), 0 \leq k \leq d-1 \tag{3.1}
\end{gather*}
$$

for the $P$ block that generates $P_{k}(Y)$. All signals in (3.1) are used to build the block $\rho$ in Fig. 3.1. As shown in this figure, $y_{1-k} \mathrm{~s}$ are removed from the block $\rho$. To reduce the complexity of the $\rho$ block in Fig. 3.1, we divide the $\rho$ block in two blocks $\rho_{1}$ and $\rho_{2}$, where $\rho_{1}$ includes all common pairs used to generate all signals in (3.1). In the following we explain the procedure to build the $\rho$ block and propose a complexity reduction algorithm to obtain the optimized blocks of $\rho_{1}$ and $\rho_{2}$ having the time complexity to be the same as the original block $\rho$, i.e., the addition of gate delays of the two blocks $\rho_{1}$ and $\rho_{2}$.

## Constructing the $\rho$ Block

1. Corresponding to the output signals of the $P$ block in Fig. 2.3, an $\frac{m-1}{2} \times T$ matrix denoted by $\mu=\left[\mu_{k}\right]_{k=1}^{\frac{m-1}{2}}$ is constructed, where $\mu_{k}$ is the row $k, 1 \leq k \leq$ $\frac{m-1}{2}$ of the matrix $\mu$. The entries of $\mu_{k}$ are at most $T$ integers in the range of $[0, m-1]$ and can be found from (2.18) which can be written as $s_{0}^{\prime}(k, Y)=$ $\sum_{j \in \mu_{k}} y_{j}, 1 \leq k \leq \frac{m-1}{2}$.
2. Based on the matrix $\mu$ and the given digit-size $d$, a matrix denoted by $\rho$ is


Figure 3.1: The proposed improved architecture for DL-PIPO GNB multiplier
obtained by appending the $d-1$ matrices of $\mu-[i] \bmod m$ to $\mu$ as follows:

$$
\rho=\left(\begin{array}{ccc}
\mu & &  \tag{3.2}\\
\mu & - & {[1] \bmod m} \\
\mu- & {[2] \bmod m} \\
\vdots & \vdots & \vdots \\
\mu & - & {[d-1] \bmod m}
\end{array}\right)_{\left(d \times \frac{m-1}{2}\right) \times T}
$$

where $[i], 1 \leq i \leq d$, denotes an $\frac{m-1}{2} \times T$ matrix whose all entries are $i$.
3. Let $\rho_{i}$ be a set which contains the entries in row $i$ of the matrix $\rho$. Then, all signals

$$
\begin{equation*}
s_{j}=\sum_{j \in \rho_{i}} y_{j}, 1 \leq j \leq d \frac{(m-1)}{2} \tag{3.3}
\end{equation*}
$$

should be implemented by the block $\rho$ shown in Figure 3.1.

## Complexity Reduction Algorithm

We want to find the common addition pairs to realize (3.3) with the least number of XOR gates without changing the delay of the modified multiplier as compared with the original one.

1. Generate a pairset to form all pairs that should be implemented in the block $\rho_{1}$.
2. Initialize the pairset in Step 1 by all pairs with only two entries in the rows of the matrix $\rho$.
3. Based on the numbers of times that these pairs are repeated, update the $\rho$ matrix by removing the pairs obtained in Step 2. Then, go to Step 1.
4. Repeat the above iteration until there is no rows with more than two entries in $\rho$ matrix.
5. Generate the the $\rho_{1}$ inside the $\rho$ block based on the common pairs stored in the pairset.
6. Reuse the output of the block $\rho_{1}$ and generate all signals from the block $\rho_{2}$ in Figure 3.1.

It is noted that unlike the complexity reduction schemes available in the literature, see for example [62], the proposed algorithm does not increase the gate delay of the proposed architecture as compared to the original one.

### 3.1.1 Complexities

In this subsection, the complexity of the proposed digit-level PIPO multiplier is given in terms of gate counts and critical-path delay.

Proposition 3.1. The proposed improved architecture for DL-PIPO type $T$ GNB multiplier over $G F\left(2^{m}\right)$ requires dm AND gates, $3 m$-bit registers, and $n_{p}+v_{p}\left(\frac{T}{2}-\right.$ $1)+d m$ XOR gates, where $n_{p}, n_{p} \leqslant \min \left\{\frac{v_{p} T}{2},\binom{m}{2}\right\}$ is the number of XOR gates (pairs) required to construct the block $\rho_{1}$ in the proposed structure and the number of rows inside the matrix which builds $\rho$ is $v_{p}=d \times \frac{m-1}{2}$. Also its critical path delay is

$$
\begin{equation*}
T_{D L-P I P O}=T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(d+1)\right\rceil\right) T_{X} \tag{3.4}
\end{equation*}
$$

where $T_{A}$ and $T_{X}$ are the time delay of a two-input AND gate and an XOR gate, respectively.

Proof. The number of rows in the matrix which builds $\rho$ is $v_{p}=d \times \frac{m-1}{2}$ and each row consists of at most $\frac{T}{2}$ pairs. Then, the number of pairs inside the $\rho_{1}$ block will be less than or equal to $v_{p} \times \frac{T}{2}$. In the case where $d=m$ (bit-parallel), one can find the upper bound of $n_{p}$ as $\binom{m}{2}=\frac{m(m-1)}{2}$. Therefore, for the digit-level structure, i.e., $1<d<m$, the upper bound for $n_{p}$ is less than the minimum of $\left\{\frac{v_{p} T}{2}, \frac{m(m-1)}{2}\right\}$.

Moreover, at most $v_{p} \times\left(\frac{T}{2}-1\right) \mathrm{XOR}$ gates in the $\rho_{2}$ block are required to build all the signals of the $\rho$ block. To construct the $G F\left(2^{m}\right)$ adders, one needs $d m$ XOR gates. As a result, the complexity of the proposed multiplier is $n_{p}+v_{p}\left(\frac{T}{2}-1\right)+d m \mathrm{XOR}$ gates, $d m$ AND gates, and $3 m$ 1-bit registers.

The critical-path delay of the proposed architecture can be obtained by adding the delays of the three blocks of $\rho_{1}, \rho_{2}, J$, and the $G F\left(2^{m}\right)$ adder which are $T_{X}$, $\left\lceil\log _{2} \frac{T}{2}\right\rceil T_{X}, T_{A}$, and $\left\lceil\log _{2}(d+1)\right\rceil T_{X}$, respectively. This results in the total delay of $T_{X}+\left\lceil\log _{2} \frac{T}{2}\right\rceil T_{X}+T_{A}+\left\lceil\log _{2}(d+1)\right\rceil T_{X}=T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(d+1)\right\rceil\right) T_{X}$, which completes the proof.

In the following section, we present an illustrative example for the proposed complexity reduction algorithm.

### 3.1.2 An Example over $G F\left(2^{7}\right)$

To better understand the complexity reduction algorithm, we illustrate an example for the proposed algorithm for type 4 digit-level multiplier over $G F\left(2^{7}\right)$ when the digit-size is $d=m=7$. The matrix $\mathbf{M}$ for type 4 GNB over $G F\left(2^{7}\right)$ is

$$
\mathbf{M}=\left(\begin{array}{lllllll}
0 & 1 & 0 & 0 & 0 & 0 & 0 \\
1 & 0 & 1 & 0 & 0 & 1 & 1 \\
0 & 1 & 0 & 1 & 1 & 1 & 0 \\
0 & 0 & 1 & 0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 & 1 \\
0 & 1 & 1 & 1 & 0 & 0 & 1 \\
0 & 1 & 0 & 0 & 1 & 1 & 1
\end{array}\right)_{7 \times 7}
$$

The matrix $\mu$ can be generated according to the output of the $P$ blocks in Fig. 2.3 as $s_{0}^{\prime}(1, Y)=y_{1-1}+y_{3-1}+y_{4-1}+y_{5-1}=y_{0}+y_{2}+y_{3}+y_{4}, s_{0}^{\prime}(2, Y)=y_{2-2}+y_{6-2}=y_{0}+y_{4}$, and $s_{0}^{\prime}(3, Y)=y_{1-3}+y_{4-3}+y_{5-3}+y_{6-3}=y_{5}+y_{1}+y_{2}+y_{3}$. Then $\mu$ can be written as

$$
\mu=\left(\begin{array}{cccc}
0 & 2 & 3 & 4 \\
0 & 4 & - & - \\
5 & 1 & 2 & 3
\end{array}\right)_{3 \times 4}
$$

Based on the digit-size $d=7$ and the matrix $\mu_{(3 \times 4)}$, the matrix $\rho_{(21 \times 4)}$ can be generated corresponding the complexity reduction algorithm. One can obtain from the matrix $\rho_{(21 \times 4)}$ in which 7 rows of the matrix have just two entries. Therefore, the pairs corresponding to these rows should be implemented as collected in the pairset1. The matrix $\rho$ is updated to $\rho^{(1)}$ by deleting all the two entries mentioned in the pairset1. Then the elements of the pairset1 should be searched in $\rho^{(1)}$ and all common pairs are removed and $\rho^{(1)}$ is updated to $\rho^{(2)}$. This iteration is repeated until there is no rows with more than two entries. As a result, all the remaining pairs as mentioned in the pairset2 should be implemented and repeated pairs (which are underlined in the updated $\rho^{(2)}$ matrix) are removed. The union of pairset1 and pairset2 includes the total of 18 pairs that should be implemented for the block $\rho_{1}$ as follows:

$$
\begin{aligned}
\text { pairset }= & \left\{y_{04}, y_{63}, y_{52}, y_{41}, y_{30}, y_{26}, y_{15}, y_{23}, y_{13}, y_{12}, y_{01}, y_{60},\right. \\
& \left.y_{50}, y_{56}, y_{46}, y_{45}, y_{35}, y_{24}\right\},
\end{aligned}
$$

where $y_{i j}=y_{i}+y_{j}$. In addition to the implementation of the $\rho$ block which requires 18 XOR gates, one need $d \frac{m-1}{2}-d=14($ as, $d=m)$ extra XOR gates for the block $\rho_{2}$ to construct its outputs. Therefore, the total number of XOR gates required to


Figure 3.2: Comparison between the number of XOR gates required in the DL-PIPO and the improved DL-PIPO for (a): $m=163(T=4)$, (b): $m=283(T=6)$.
implement the $\rho$ block will be $18+14=32$, whereas the unoptimized $P$ blocks need 49 XOR gates and the scheme proposed in [5] requires 35 XOR gates.

It is noted that the other complexity reduction algorithms available in the literature may result in fewer number of gates at the expense of more delay. To compare our complexity reduction algorithm with the one proposed in [62], we have applied the complexity reduction algorithm proposed in [62] for the block $\rho$ of this example. It decreases the number of XORs to 23 with the increase of critical path delay to $8 T_{X}$ (eight level of XOR gates). Note that our scheme for this block results in the complexity of 32 XOR gates with the same critical path delay as the original one, i.e., $2 T_{X}$.

### 3.1.3 Simulation Results for the DL-PIPO GNB Multiplier over $G F\left(2^{163}\right)$ and $G F\left(2^{283}\right)$

To evaluate the efficiency of proposed complexity reduction algorithm, a MATLAB code is written to generate common pairs and signals used in the blocks $\rho_{1}$ and $\rho_{2}$ of Fig. 3.1. It is noted that for type 2 GNB which is a type 2 optimal normal basis over $G F\left(2^{m}\right)$, there is no common terms to be reused in the block $\rho$. Therefore, the algorithm presented here cannot reduce the number of XOR gates for $T=2$. The simulation results of the algorithm for the improved DL-PIPO GNB multipliers over $G F\left(2^{163}\right)$ and $G F\left(2^{283}\right)$ are obtained and plotted in Fig. 3.2a and Fig. 3.2b, respectively. In these figures, we plot the number of required XOR gates versus the
digit size for the fields $G F\left(2^{163}\right)(T=4)$ and $G F\left(2^{283}\right)(T=6)$ recommended by NIST for ECDSA [19] as compared to ones of the original DL-PIPO architecture. For a given number of clock cycle, $q, 1 \leq q \leq m$, the least value of digit sizes in the form of $d=\left\lceil\frac{m}{q}\right\rceil, 1 \leq d \leq m$, is implemented so that the area complexity is optimized for both multipliers.

From Fig. 3.2a and 3.2b, one can see that as the digit size increases, more common pairs will be found. As an example, in Fig. 3.2a for the digit size $d=m=163$, the total number of XOR gates required in the original DL-PIPO is 66178 gates whereas, the improved one, requires 50400 XOR gates for $G F\left(2^{163}\right)$. It means that the complexity of the proposed improved DL-PIPO is about $24 \%$ less than the original multiplier. More reduction can be found in Fig. 3.2b for the $G F\left(2^{283}\right)$ with $d=m=$ 283. As seen the number of XOR gates needed by the original DL-PIPO is 279,604, whereas the proposed DL-PIPO requires 185,375 XOR gates which is about $34 \%$ less than that of the original multiplier. The exact values of $n_{p}$, i.e., the number of pairs to construct $\rho$ are given in Table 3.1 which are obtained by simulations.

Table 3.1: Comparison of number of XOR gates between bit-parallel GNB multipliers for $G F\left(2^{163}\right)$ and $G F\left(2^{283}\right)$.

| $m$ | $T$ | $n_{p}$ | Original <br> DL-PIPO | This work |
| :---: | :---: | :---: | :---: | :---: |
| 163 | 4 | 10,791 | 66,178 | 50,400 |
| 283 | 6 | 25,763 | 279,604 | 185,375 |

### 3.2 New Architecture for Digit-Level SIPO GNB Multiplier

In a digit-level SIPO multiplier, the bits of an operand are grouped into digits and in each clock cycle one digit is processed. We extend the architecture of the LSB-first bit-level GNB multiplier architecture presented in Chapter 2 and propose a lowcomplexity LSD-first digit-level SIPO GNB multiplier architecture. In the following, we present formulation, architecture, and complexity of the proposed multiplier architecture.

### 3.2.1 Formulation

Let us assume $A=\sum_{i=0}^{m-1} a_{i} \beta^{2^{i}}=\left(a_{0}, a_{1}, \cdots, a_{m-1}\right)$, then one can group the bits into $q=\left\lceil\frac{m}{d}\right\rceil$ digits denoted by $A_{i}, 0 \leq i \leq q-1$ as $\left(a_{0}, \cdots, a_{d-1}\right)$ for the first digit followed by $\left(a_{d}, \cdots, a_{2 d-1}\right)$ for the second digit and finally ( $a_{d(q-1)}, \cdots, a_{m-1}$ ) for the $q$ th digit where $d, 2 \leq d \leq m-1$, is denoted as the number of bits in each digit. Note that if the last digit does not have $d$ bits, it will be appended by zeros as its most significant bit ends. Then, each digit can be represented as $A_{i}=\left(a_{i d}, a_{i d+1}, \cdots a_{i d+d-2}, a_{i d+d-1}\right)=\sum_{j=0}^{d-1} a_{j+i d} \beta^{2^{j}}, A_{i} \in G F\left(2^{m}\right)$ with respect to the GNB and thus operand $A$ can be written as

$$
A=\sum_{i=0}^{q-1} A_{i}^{2^{i d}}=\left(A_{0}, A_{1}, \cdots, A_{q-1}\right)
$$

Therefore, one can write their product $A B=C \in G F\left(2^{m}\right)$ as

$$
\begin{align*}
C & =A B \\
& =\sum_{i=0}^{q-1} A_{i}^{2^{i d}} \cdot B=\sum_{i=0}^{q-1}\left(A_{i} \cdot B^{2^{-i d}}\right)^{2^{i d}}=\sum_{i=0}^{q-1}\left(C^{(i)}\right)^{2^{i d}}, \tag{3.5}
\end{align*}
$$

where

$$
\begin{equation*}
C^{(i)}=A_{i} B^{2^{-i d}} \tag{3.6}
\end{equation*}
$$

In order to derive a formulation for multiplication whose implementation is more hardware-oriented we state the following.

Corollary 3.1. Given the ith digit of $A$, i.e., $A_{i}$ with $d$ bits and a field element of $B^{2^{-i d}} \in G F\left(2^{m}\right)$, their product $C^{(i)} \in G F\left(2^{m}\right)$ can be obtained as

$$
C^{(i)}=\sum_{j=0}^{d-1} J^{2^{j}}\left(a_{j+i d}, B^{2^{-(i d+j)}}\right),
$$

where $J(x, Y)=x \cdot P(Y) \in G F\left(2^{m}\right)$.
Proof. Using (3.6), one has

$$
\begin{equation*}
C^{(i)}=\sum_{j=0}^{d-1} a_{j+i d} \beta^{2^{j}} \cdot B^{2^{-i d}}=\sum_{j=0}^{d-1}\left(a_{j+i d} \cdot \beta B^{2^{-i d-j}}\right)^{2^{j}} . \tag{3.7}
\end{equation*}
$$

Now we define $J(x, Y)$ as a function of the product of a bit $x \in G F(2)$ and a field element $P(Y) \in G F\left(2^{m}\right)$ as

$$
\begin{equation*}
J(x, Y)=x \cdot P(Y) \tag{3.8}
\end{equation*}
$$

Then, using (2.11) and Corollary 1, one can write $\beta B=P(B)$ to simplify $C^{(i)}$ in (3.7) as follows

$$
\begin{align*}
C^{(i)} & =\sum_{j=0}^{d-1}\left(a_{j+i d} \cdot P(B \ll(i d+j))\right)^{2^{j}} \\
& =\sum_{j=0}^{d-1} J^{2^{j}}\left(a_{j+i d}, B^{2^{-(i d+j)}}\right) \tag{3.9}
\end{align*}
$$

This completes the proof.

Then, the multiplication of $A$ and $B$ can be obtained from

$$
\begin{equation*}
C=A B=\sum_{i=0}^{q-1}\left(C^{(i)} \gg i d\right) . \tag{3.10}
\end{equation*}
$$

In the following, we present the architecture of the proposed DL-SIPO GNB multiplier.

### 3.2.2 New Architecture

In order to map the formulation obtained in previous subsection to hardware, an architecture for the LSD-first DL-SIPO GNB multiplier is depicted in Fig. 3.3. Initially, the register $\langle Y\rangle$ is loaded by $B=\left(b_{0}, b_{1}, \cdots, b_{m-1}\right)$ and the register $\langle Z\rangle$ is cleared to 0 . The $d$-fold left cyclic shifts are realized by " $\stackrel{d}{<}$ " as shown in Fig. 3.3. Also, as one can see in this figure, the last digit of operand $A$, i.e., $A_{q-1}$, is appended by $r=q d-m, 0 \leq r \leq d-1$, zeros as its most significant bit ends. The remaining input bits are correspond to the terms appearing in $A_{q-1}$ (as $m$ is not always a multiple of digit-size $d$ ). This avoids redundant computations in the last clock cycle.

The DL-SIPO GNB multiplier architecture, has several $P$ blocks shown as $p_{0}$ to $p_{d-1}$ in Fig. 3.3b as a $P$ array. As shown in this figure, $P$ blocks use the shifted combination of $P(Y) \in G F\left(2^{m}\right)$ defined in (2.11) for the input operand $B$ (preloaded in register $\langle Y\rangle$ ). Therefore, we first determine these combinations and after these


Figure 3.3: (a) The proposed architecture for LSD-first DL-SIPO multiplier. (b) an example of the proposed multiplier for type 4 GNB over $G F\left(2^{7}\right)$ and $d=2$.
combinations are computed, we use their results in different computations to optimize the area complexity by reducing the number of signals and consequently number of XOR gates. We propose a method to combine the computations of the $P$ blocks into a $Q$ block as illustrated in the architecture of Fig. 3.3a.

The $Q$ block is generated for the digit size $d$ and type $T$ GNB for operand $B$ as $Q(Y)=(P(Y), P(Y) \gg 1, \cdots, P(Y) \gg d-1)$ as illustrated in Fig. 3.3 where $P(Y) \gg l, 0 \leq l \leq d-1$ denotes $l$-fold right cyclic shift of $P(Y) \in G F\left(2^{m}\right)$. As shown in this figure, $y_{l+1}, 0 \leq l \leq d-1$ are removed from the block $Q$ as they are correspond to the lines on $v_{s}$-bus connected to register $\langle Y\rangle$. The $Q$ block can also be represented by the $\mathbf{Q}$ matrix as

$$
\mathbf{Q}=\left(\begin{array}{c}
\mathbf{R}^{(0)}  \tag{3.11}\\
\mathbf{R}^{(1)} \\
\mathbf{R}^{(2)} \\
\vdots \\
\mathbf{R}^{(l)}
\end{array}\right)_{v_{s} \times T} \quad, 0 \leq l \leq d-1
$$

where using (2.16), $\mathbf{R}^{(l)}$ can be obtained by adding the $(i, j)$-th, $1 \leq i \leq m-1,1 \leq$ $j \leq T$, entry of the matrix $\mathbf{R}=\mathbf{R}^{(0)}$, i.e., $R(i, j), 0 \leq R(i, j) \leq m-1$ with " $l \bmod m$ ", as $R(i, j)+l \bmod m$. Also, $v_{s}=d(m-1)-\frac{d(d-1)}{2}$ is the total number of rows inside the $\mathbf{Q}$ matrix. This is due to the fact that every two $\mathbf{R}^{\left(i^{\prime}\right)}$ and $\mathbf{R}^{\left(i^{\prime \prime}\right)}, 0 \leq i^{\prime}, i^{\prime \prime} \leq d-1$, have a common row with the total of $\binom{d}{2}=\frac{d(-1)}{2}$ in the $\mathbf{Q}$ matrix [5]. Then, as one can see, the multiplication of every bit of $A_{i}$ in (3.9) by the outputs of the $Q$ block which is connected to $v_{s}$-bus, is performed by $J,\left(J_{0}\right.$ to $\left.J_{d-1}\right)$ blocks, using (3.8) where each $J$ block includes $m$ two-input AND gates as shown in Fig. 3.3a. After the first clock cycle, the content of register $\langle Y\rangle$ is $B^{2^{-d}}$ and in general it contains $B^{2^{-i d}}$ after $i$ th clock cycle. Let $Z(q) \in G F\left(2^{m}\right)$ denotes the field element after the $q$-th clock cycle whose its coordinates stored in the $m$-bit register $\langle Z\rangle$. Then, after one clock cycle, with the use of (3.9) the register $\langle Z\rangle$ contains

$$
\begin{equation*}
C^{(0)}=A_{0} B=\sum_{j=0}^{d-1} J^{2^{j}}\left(a_{j}, B^{2^{-j}}\right) . \tag{3.12}
\end{equation*}
$$

Then, both registers $\langle Y\rangle$ and $\langle Z\rangle$ should be $d$-fold cyclically shifted to the left to obtain $C^{(1)}, C^{(2)}, \cdots, C^{(q-1)}$, accordingly. The sum of $d m$-bit intermediate results with one $m$-bit initial results in register $\langle Z\rangle$ is performed in the accumulator which is implemented using a $G F\left(2^{m}\right)$ adder (as shown in Fig. 3.3). Therefore, one can verify
that considering (3.10), after $q$-th clock cycle, the register $\langle Z\rangle$ contains

$$
\begin{align*}
Z(q)= & \left(\cdots\left(\left(\left(C^{(0)}\right)^{2^{-d}}+C^{(1)}\right)^{2^{-d}}+C^{(2)}\right)^{2^{-d}}+\cdots\right)^{2^{-d}} \\
& +C^{(q-1)} . \tag{3.13}
\end{align*}
$$

By comparing (3.10) with (3.13) one can write $Z(q)=C^{2^{-d(q-1)}}=C^{2^{m+(d-r)}}=$ $C^{2^{d-r}}$. Thus, the coordinates of $C=A B$ can be obtained by $(d-r)$-fold left cyclic shift of the register $\langle Z\rangle$, i.e., $C=(Z(q) \ll d-r)$.

Remark 3.1. Using the above formulation, one can design similar architecture for the MSD-first digit-level SIPO GNB multiplier.

### 3.2.2.1 Complexities

In this section, the complexity of the proposed digit-level SIPO multiplier is given in terms of gate counts and critical-path delay.

The number of rows in the matrix which builds $Q$ is $v_{s}=d(m-1)-\frac{d(d-1)}{2}$ and each row consists of at most $\frac{T}{2}$ pairs. We divide the $Q$ block into two blocks $Q_{1}$ and $Q_{2}$ blocks. Block $Q_{1}$ contains at most $n_{s}, n_{s} \leq v_{s} \times \frac{T}{2}$, XOR gates with the delay of an XOR gate as shown in Fig. 3.3a. Block $Q_{2}$ consists of trees of XOR gates for the GNB, with $T>2$. The $Q_{2}$ block connects its input bus to the $v_{s}$-bus having each of its output to be addition of at most $T$ coordinates of $\langle Y\rangle$ which can be obtained by adding at most $\frac{T}{2}$ signals from the output of $Q_{1}$. Therefore, if no common subexpression in $Q$ block are reused, the number of XOR gates in $Q_{1}$ block and $Q_{2}$ block of Fig. 3.3a are at most $v_{s} \frac{T}{2}$ and $v_{s}\left(\frac{T}{2}-1\right)$, respectively. It is noted that for the case where $d=m$ (i.e., bit-parallel architecture), the upper bound for $n_{s}$ can be obtained as $\binom{m}{2}=\frac{m(m-1)}{2}$ and hence in general $n_{s} \leq \min \left\{\frac{v_{s} T}{2},\binom{m}{2}\right\}$. Also, the number of XOR gates in the $G F\left(2^{m}\right)$ adder (which adds $d+1 m$-bit inputs together) is $d m$ XOR gates. Moreover, the $J$ blocks require $d m$ two-input AND gates. Therefore, based on the above discussions, the followings can be stated to obtain the gate count and time complexity of the proposed multiplier architecture.

Proposition 3.2. The gate complexities of the proposed LSD-first DL-SIPO multiplier architecture is

$$
\begin{aligned}
& \# A N D=d m \\
& \# X O R \leq v_{s}(T-1)+d m
\end{aligned}
$$

Remark 3.2. The area complexity of proposed LSD-first DL-SIPO multiplier can be further reduced by incorporating a common subexpression elimination algorithm to $n_{s}+v_{s} \times\left(\frac{T}{2}-1\right)+d m$ XOR gates which $n_{s}$ is upper bounded by $n_{s} \leq \min \left\{\frac{v_{s} T}{2},\binom{m}{2}\right\}$ and its exact number can be obtained by simulation.

To obtain the maximum clock frequency for the proposed multiplier, one can see that the critical-path delay of the proposed multiplier architecture includes those for the $Q_{1}$ and $Q_{2}$ blocks (i.e., $T_{X}$ and $\left\lceil\log _{2} \frac{T}{2}\right\rceil T_{X}$ respectively), the $J$ blocks, (i.e., $T_{A}$ ) and the $G F\left(2^{m}\right)$ adder (i.e., $\left.\left\lceil\log _{2}(d+1)\right\rceil T_{X}\right)$. Then, the total critical-path delay due to delays through the above mentioned blocks is $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(d+1)\right\rceil\right) T_{X}$.

### 3.2.2.2 Complexity Reduction

As explained in the previous subsection, the number of rows inside the $\mathbf{Q}$ matrix is $v_{s}=d(m-1)-\frac{d(d-1)}{2}$ to generate all signals at the output of $Q(Y)$. As mentioned in Conjecture 1, the matrix $\mathbf{R}$ contains rows with two equal entries (these entries cancel each other in the formulation). Then, the $\mathbf{Q}$ matrix has some rows with only two entries (i.e., one pair). Base on this fact and the number of times that these pairs are repeated, a subexpression sharing method presented in [9] is used here to obtain the optimized number of pairs in $Q_{1}$, i.e., $n_{s}$. In the following, we give an illustrative example for the proposed multiplier architecture.

### 3.2.3 An Illustrative Example

We consider the multiplication matrix $\mathbf{R}$ for type $T=4$ GNB over $G F\left(2^{7}\right)$ as follows:

Table 3.2: Contents of variables in the proposed architecture for LSD-first DL-SIPO type 4 GNB multiplier over $G F\left(2^{7}\right)$.

| Clock | LSD-First |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $j$ | $A$ | $Y$ | Acc | $Z$ |
| 0 | - | $B=1100011$ | - | 0000000 |
| 1 | 11 | 1100011 | 0111010 | 0111010 |
| 2 | 00 | 0001111 | 0000000 | 1101001 |
| 3 | 01 | 0111100 | 1100111 | 1000000 |
| 4 | 10 | 1110001 | 1111010 | $C^{2}=1111000$ |

$$
\mathbf{R}=\left(\begin{array}{cccc}
0 & 2 & 5 & 6  \tag{3.14}\\
1 & 3 & 4 & 5 \\
2 & 5 & \underline{3} & \underline{3} \\
2 & 6 & \underline{0} & \underline{0} \\
1 & 2 & 3 & 6 \\
1 & 4 & 5 & 6
\end{array}\right)_{(6 \times 4)} .
$$

This matrix can be obtained from the location of non-zero entries (excluding the first row) of the multiplication matrix $\mathbf{M}$ as

$$
\mathbf{M}=\left(\begin{array}{lllllll}
0 & 1 & 0 & 0 & 0 & 0 & 0 \\
1 & 0 & 1 & 0 & 0 & 1 & 1 \\
0 & 1 & 0 & 1 & 1 & 1 & 0 \\
0 & 0 & 1 & 0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 & 1 \\
0 & 1 & 1 & 1 & 0 & 0 & 1 \\
0 & 1 & 0 & 0 & 1 & 1 & 1
\end{array}\right)_{7 \times 7}
$$

Having the digit size to be $d=2$, the matrix $\mathbf{Q}_{(11 \times 4)}$ can be generated as

In this matrix, $\mathbf{R}^{(1)}$ is obtained by adding the $(i, j)$-th entry of $\mathbf{R}=\mathbf{R}^{(0)}$ by " 1 $\bmod 7^{\prime \prime}$. As one can see, the number of rows in this matrix is $v_{s}=2 \times(7-1)-\binom{2}{2}=11$ (as $\mathbf{R}^{(0)}$ and $\mathbf{R}^{(1)}$ have a common row which is removed from this matrix) and it has $2 d=4$ rows with just two entries (as the equal underlined entries cancel each other in those four rows). Then, we first collect these pairs (in rows with two entries), i.e., $(2,5),(2,6),(3,6)$, and $(0,3)$ as a pairset to initialize $\mathbf{Q}_{1}$ matrix. The numbers of times that these pairs are repeated are $2,3,2$, and 2 , respectively. Then, applying the common subexpression elimination algorithm presented in [9], one can obtain the pairs inside the matrix $\mathbf{Q}_{1}$ as $\mathbf{Q}_{1}=\left\{y_{25}, y_{26}, y_{36}, y_{03}, y_{05}, y_{13}, y_{45}, y_{16}, y_{24}\right\}$, where $y_{i j}=y_{i}+y_{j}$ and $n_{s}=9$ is the number of pairs in $Q_{1}$. Also, as each row in $\mathbf{Q}$ needs $\left(\frac{T}{2}-1\right)$ gates excluding the rows with only two entries (which is $2 d$ here) and there are $v_{s}$ rows in total, then $v_{s}\left(\frac{T}{2}-1\right)-2 d=7$ XOR gates in block $Q_{2}$ is required to produce the the outputs of $Q(Y)$. The architecture of the proposed multiplier over $G F\left(2^{7}\right)$ for $d=2$ is depicted in Fig. 3.3c. Therefore, the complexity of the presented improved DL-SIPO multiplier is $n_{s}+v_{s}\left(\frac{T}{2}-1\right)-2 d+d m=30 \mathrm{XOR}$ gates. Note that the unoptimized structure (without common subexpression sharing) requires $\left(d(m-1)-\frac{d(d-1)}{2}\right)(T-1)-2 d+d m=43$ XOR gates and the architecture proposed in [7] requires $m(d T+1)-d=61$ XOR gates. Also, the critical-path delay is $T_{A}+4 T_{X}$.

For the multiplier operation, as one can see in Fig. 3.3c, operand $A$ is grouped into four digits as $A_{0}=\left(a_{0}, a_{1}\right), A_{1}=\left(a_{2}, a_{3}\right), A_{2}=\left(a_{4}, a_{5}\right)$, and $A_{3}=\left(a_{6}, 0\right)$, each with the size of two bits, i.e., $d=2$. Before starting the clock, the register $\langle Y\rangle$ is


Figure 3.4: Comparison among the numbers of XOR gates required in the original and the improved digit-level SIPO multiplier architectures [7] for (a) type $T=4$ GNB over $G F\left(2^{163}\right)$ and (b) type $T=6$ GNB over $G F\left(2^{283}\right)$.
loaded with the coordinates of $B=\left(b_{0}, b_{1}, \cdots, b_{6}\right)$ and register $\langle Z\rangle$ is cleared to zero, i.e., $\langle Z\rangle=(0,0, \cdots, 0)$. Then, in the first clock cycle, two LSD bits, i.e., $a_{0}$ and $a_{1}$ of operand $A$, are the inputs of the corresponding AND gates. One can realize that after $q=\left\lceil\frac{7}{2}\right\rceil=4$ clock cycles, the result of $C^{2^{d-r}}$ is available in parallel at register $\langle Z\rangle$. The contents of registers are given in Table 3.2 for $A=B=$ (11000011). Note that as mentioned before, the result of multiplication $C=A B$ is obtained after one $(d-r=1)$ left cyclic shift of the content of register $\langle Z\rangle$ at the last clock cycle, i.e., $C=(Z(q) \ll 1)=1110001$.

### 3.2.4 Simulations

To compare the complexity of the proposed improved DL-SIPO GNB multiplier to the counterpart a MATLAB code is written to generate common pairs and signals used in the blocks $Q_{1}$ and $Q_{2}$ of the proposed architectures in Fig. 3.3a. The simulation results of the algorithm for the improved DL-SIPO GNB multiplier for $T=4$ over $G F\left(2^{163}\right)$ and $T=6$ over $G F\left(2^{283}\right)$ are obtained and plotted in terms of different digit sizes in Fig. 3.4a and 3.4b, respectively.


Figure 3.5: (a) The architecture of the improved digit-level PISO GNB multiplier architecture with the LSD-first output. (b) The improved architecture of type 4 GNB multiplier over $G F\left(2^{7}\right)$ and $d=2$.


Figure 3.6: Comparison among the numbers of XOR gates required in the original and improved digit-level PISO multiplier architectures for (a) type $T=4$ GNB over $G F\left(2^{163}\right)$ and (b) type $T=6$ GNB over $G F\left(2^{283}\right)$.

### 3.3 New Architecture for Digit-Level PISO GNB multiplier

### 3.3.1 Low-Complexity Digit-Level PISO GNB Multiplier

In this subsection, we present a low-complexity architecture for the digit-level PISO GNB multiplier presented in Chapter 2. The improvement of the new architecture is based on a formulation of the multiplication operation, which is given in the following.

### 3.3.1.1 Improved Architecture

In this section, similar to the previous section, we present an improved architecture for DL-PISO GNB multiplier and reduce its area complexity. As shown in Fig. 2.2, the digit-level PISO multiplier architecture has several BTX blocks that use the same combination of the input operand $B$ (preloaded in the register $\langle Y\rangle$ ). We combine the computations of the parallel computed functions into a $Q$ block (which is the same as the one presented in previous section for DL-SIPO architecture) as illustrated in the architecture in Fig. 3.5. As shown in this figure, $y_{1+d} \mathrm{~S}$ are removed from the block $Q$ as they are corresponding to the lines on $v_{s}$-bus connected to the register $\langle Y\rangle$. The $v_{s}$-bus contains all signals to generate all different terms required in (2.14). These signals are implemented by the blocks of $Q_{1}$ and $Q_{2}$ inside the $Q$ block. We first use the block $Q_{1}$ to implement all pairs required for all signals in (2.14). In
this architecture, each $J$ block consists of $m$ 2-input AND gates to implement (2.15). Then, a level of XOR trees are utilized to implement all $z_{0}, z_{1}, \cdots, z_{d-1}$ coordinates in (2.15). The proposed improved architecture provides the LSD of multiplication at the first clock cycle (LSD-first).

For the purpose of illustration, the improved architecture of DL-PISO $(d=2)$ for type 4 GNB over $G F\left(2^{7}\right)$ is shown in Fig. 3.5b. As shown in this figure, the $Q_{1}$ and $Q_{2}$ blocks are generated for the given matrix $\mathbf{R}$ in (3.14). The registers $\langle X\rangle$ and $\langle Y\rangle$ should be initialized with the coordinates of $A$ and $B$ and then after each clock cycle two bits of $C=A B$ become available at the output.

In the following, we derive the complexity of the improved LSD-first DL-PISO GNB multiplier.

### 3.3.1.2 Complexities

To determine the area and time complexities of the presented architecture, the following is stated.

Proposition 3.3. For type $T$ GNB over $G F\left(2^{m}\right)$, the improved digit-level PISO GNB multiplier requires $d m$ AND gates and $n_{s}+v_{s} \times\left(\frac{T}{2}-1\right)+d(m-1)$ XOR gates. Also, the critical-path delay of the improved architecture is the same as the original structure, i.e., $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2} m\right\rceil\right) T_{X}$.

Proof. The proof is similar to the one presented in Subsection 3.2.2.1.

We further optimize the number of XOR gates required for the improved LSD-first DL-PISO GNB multiplier similar to the one proposed for DL-SIPO multiplier. The results of simulations obtained for different digit-size and are plotted in Figs. 3.6a and 3.6 b for $m=163$ and $m=283$, respectively. As one can see, the improved architecture requires fewer number of XOR gates.

### 3.3.2 Complexity Comparison

In Table 3.3, the time and area complexities of the presented DL-SIPO multiplier (before applying common subexpression elimination algorithm) are compared with the ones, namely, DL-SIPO [7], DL-PISO [5], and DL-PIPO [45] multipliers as they appear to be the most recently proposed works available in the literature. It is noted that our presented multiplier architecture (Fig. 3.3) requires fewer number of gates
Table 3.3: Comparison of the most recently proposed type $T$ digit-level GNB multipliers over $G F\left(2^{m}\right)$ with parallel outputs.

| Multiplier Architecture | $\begin{aligned} & \text { \# AND } \\ & \text { gates } \\ & \hline \end{aligned}$ | $\begin{gathered} \# \text { XOR }{ }^{a} \\ \text { gates } \end{gathered}$ | \# Reg. | Critical-Pathdelay | Output | Input |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  | A | B |
| WL-PIPO [45] | $d m$ | $2 v_{p} \cdot T+d$ | 3 m | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(d+1)\right\rceil\right) T_{X}$ | Parallel | Parallel | Parallel |
| DL-PIPO [5] | $d m$ | $\leq v_{p} \cdot T+\frac{d}{2}(m+1)$ | 3 m | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(d+1)\right\rceil\right) T_{X}$ | Parallel | Parallel | Parallel |
| DL-PIPO [9] | $d m$ | $\leq v_{p}(T-1)+d m$ | 3 m | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(d+1)\right\rceil\right) T_{X}$ | Parallel | Parallel | Parallel |
| DL-SIPO [7] | $d m$ | $2 v_{p} \cdot T+d(T-1)+m$ | $2 m$ | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2} d\right\rceil+1\right) T_{X}$ | Parallel | Serial | Parallel |
| DL-SIPO (Fig. 3.3) ${ }^{2}$ | $d m$ | $\leq v_{s}(T-1)+d m$ | $2 m$ | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(d+1)\right\rceil\right) T_{X}$ | Parallel | Serial | Parallel | 1. $v_{p}=\frac{d(m-1)}{2}$ and $v_{s}=d(m-1)-\frac{d(d-1)}{2}$.

2. Without applying common subexpression elimination algorithm.


Figure 3.7: The architecture of proposed bit-parallel GNB multiplier
than the previously proposed ones DL-SIPO [7] and DL-PIPO [45]. Also, as seen in this table, in terms of time complexity our presented multiplier (Fig. 3.3) is favorably comparable with the DL-SIPO [7]. Moreover, in Fig. 3.4, the area complexity of the improved architecture over $G F\left(2^{163}\right)$ and $G F\left(2^{283}\right)$ after applying the common subexpression elimination algorithm is illustrated in terms of different digit sizes and compared with the ones of its counterpart [7]. As illustrated in Figs. 3.4 and 3.6, the presented improved architectures require fewer XOR gates than the one proposed in [7] and the original one proposed in [5], respectively.

In the following section, we propose a new bit-parallel multiplier.

### 3.4 An Extension to Bit-Parallel GNB Multiplier

Based on the formulation used in the previous sections, we present a new bit-parallel GNB multiplier over $G F\left(2^{m}\right)$ in this section. The proposed digit-level GNB multiplier architectures can be easily scaled up to the bit-parallel type. To obtain the bit-parallel multiplier, one can implement (2.4) in hardware for all $c_{l}, 0 \leq l \leq m-1$. Thus, the hardware architecture of a bit-parallel multiplier is obtained by implementing $m$ copies of identical structures used for $c_{0}$ with cyclic shifts of their inputs.

The architecture of the proposed bit-parallel GNB multiplier is depicted in Fig. 3.7. In Propositions 3.1 and 3.3 for DL-PIPO and DL-PISO multiplier architectures we defined $n_{p}$ and $n_{s}$ as the number of pairs (inside the blocks $\rho_{1}$ and $Q_{1}$ ) to build the $\rho$ and $Q$ blocks, respectively. For a bit-parallel architecture the upper bound for the number of pairs in these blocks are the equal to the the all combinations of
two coordinates of $A$, i.e., $n=\binom{m}{2}=\frac{m(m-1)}{2}$ combinations. Note that for $T=2$, $n=\frac{m(m-1)}{2}$ and the block $\rho_{2}$ connects its input bus to the next bus without using any XOR gates. Note that the exact complexities of $Q_{1}$ and $Q_{2}$ depend on the GNB. However, one can find the upper bound for the number of XOR gates and time delay of this structure as follows.

Proposition 3.4. For Type $T$ GNB over $G F\left(2^{m}\right)$, the proposed bit-parallel $G N B$ multiplier architecture requires $m^{2} A N D$ gates and at most $(T+4)\left(\frac{m(m-1)}{4}\right)$ XOR gates with the critical path delay of

$$
\begin{equation*}
T_{C}=T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2} m\right\rceil\right) T_{X}, \tag{3.15}
\end{equation*}
$$

where $T_{A}$ and $T_{X}$ are the time delay of a two-input AND gate and an XOR gate, respectively.

Proof. The proof can be obtained by equating $n=n_{s}=n_{p}=\frac{m(m-1)}{2}$ in Propositions 3.1 and 3.3. Then, one can obtain the upper bound for the total number of XOR gates as $\frac{m(m-1)}{2}+\frac{m(m-1)(T-2)}{4}+m(m-1)=(T+4)\left(\frac{m(m-1)}{4}\right)$.

The critical-path delay of the proposed architecture can be obtained by adding the delays of the three blocks of $Q_{1}, Q_{2}, J$, and the $G F\left(2^{m}\right)$ adders which are $T_{X}$, $\left\lceil\log _{2} \frac{T}{2}\right\rceil T_{X}, T_{A}$, and $\left\lceil\log _{2} m\right\rceil T_{X}$, respectively. This results in the total delay of $T_{X}+\left\lceil\log _{2} \frac{T}{2}\right\rceil T_{X}+T_{A}+\left\lceil\log _{2} m\right\rceil T_{X}=T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2} m\right\rceil\right) T_{X}$, which completes the proof.

### 3.4.1 Comparison

The time and area complexities of the proposed bit-parallel GNB multiplier and the previous schemes are compared in Table 3.4 for general and special values of $T$. As shown in this table, the critical path delay of the proposed multiplier matches the fastest results available in the literature. For type $T=2 \mathrm{GNB}$, the number of XOR gates also matches the fastest result available in the open literature, i.e., $1.5 m(m-1)$. However, it is much greater than the sub-quadratic results proposed in [63] and [59] which require much higher delay as compared to the one proposed here. It is interesting to note that for $T>2$, the proposed multiplier requires smaller area in comparison to its counterparts which are proposed most recently with the same delay as shown in this table.

It should be noted that, to obtain the exact number of XOR gates for a given GNB, the exact value of $n$ should be obtained by simulations. Using the complexity reduction algorithm proposed in Section 3.1, a comparison between the number of XOR gates of bit-parallel GNB multipliers is illustrated in Table 2 for $G F\left(2^{163}\right)$ and $G F\left(2^{283}\right)$ fields recommended by NIST for ECDSA.

Table 3.4: Area and time complexity comparison of bit-parallel GNB multipliers over $G F\left(2^{m}\right)$. Note that for Type T GNB: $C_{N} \leq T m-T+1$.

| Multiplier | type $T \geq 2$ |  |  |
| :---: | :---: | :---: | :---: |
|  | \#AND | \#XOR | Critical path |
| Massey \& Omura [35] | $m^{2}$ | $m\left(C_{N}-1\right)$ | $T_{A}+\left\lceil\log _{2} C_{N}\right\rceil T_{X}$ |
| Gao \& Sobelman[51] | $m^{2}$ | $m\left(C_{N}-1\right)$ | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| [50] | $m^{2}$ | $\leq \frac{m}{2}\left(C_{N}+m-2\right)$ | $T_{A}+\left(\left\lceil\log _{2}\left(C_{N}+1\right)\right\rceil\right) T_{X}$ |
| DLGMp [5], [6] $(d=m)$ | $m^{2}$ | $\leq \frac{m}{2}\left(C_{N}+m\right)$ | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| DLGMs [5] $(d=m)$ | $m^{2}$ | $\leq \frac{m(m-1)}{2}(T+1)$ | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| DL-PIPO [45] $(d=m)$ | $m^{2}$ | $\leq T m(m-1)+m$ | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| DL-SIPO [7] $(d=m)$ | $m^{2}$ | $\leq(T-1) m^{2}+m(m-1)$ | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| This work | $m^{2}$ | $\leq\left(\frac{m(m-1)}{4}\right)(T+4)$ | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
|  | $T=2$ |  |  |
| [35, 51] | $m^{2}$ | $2 m(m-1)$ | $T_{A}+\left\lceil\log _{2}(2 m-1)\right\rceil T_{X}$ |
| Koc \& Sunar [48] | $m^{2}$ | $1.5 m(m-1)$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
| Fan \& Hasan [59] | $2 m^{1.6}$ | $11 m^{1.6}-12 m+1$ | $T_{A}+\left(2 \log _{2} m+1\right) T_{X}$ |
| Gathen et. al [63] | $2 m^{1.6}$ | $7.6 m^{1.6}+\mathcal{O}(m \log m)$ | $T_{A}+\left(2 \log _{2} m+1\right) T_{X}$ |
| [ $50,5,6]$, This work | $m^{2}$ | $1.5 m(m-1)$ | $T_{A}+\left(1+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ |
|  | $T=4$ |  |  |
| [35], [51] | $m^{2}$ | $4 m^{2}-4 m$ | $T_{A}+\left(2+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| [50] | $m^{2}$ | $2.5 m^{2}-4.5 m$ | $T_{A}+\left\lceil 1+\log _{2}(2 m-1)\right\rceil T_{X}$ |
| DLGMp [5], [6] $(d=m)$ | $m^{2}$ | $2.5 m^{2}-1.5 m$ | $T_{A}+\left(2+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| DLGMs [5] $(d=m)$ | $m^{2}$ | $2.5 m^{2}-2.5 m$ | $T_{A}+\left(2+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| DL-PIPO [45] $(d=m)$ | $m^{2}$ | $4 m^{2}-3 m$ | $T_{A}+\left(2+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| DL-SIPO [7] $(d=m)$ | $m^{2}$ | $4 m^{2}-m$ | $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| This work | $m^{2}$ | $\leq 2 m^{2}-2 m$ | $T_{A}+\left(2+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
|  | $T=6$ |  |  |
| [35], [51] | $m^{2}$ | $6 m^{2}-6 m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| [50] | $m^{2}$ | $3.5 m^{2}-3.5 m$ | $T_{A}+\left(\left\lceil\log _{2}(6 m-4)\right\rceil\right) T_{X}$ |
| DLGMp [5], [6] $(d=m)$ | $m^{2}$ | $3.5 m^{2}-2.5 m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| DLGMs [5] $(d=m)$ | $m^{2}$ | $3.5 m^{2}-3.5 m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| DL-PIPO [45] $(d=m)$ | $m^{2}$ | $6 m^{2}-5 m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| DL-SIPO [7] $(d=m)$ | $m^{2}$ | $6 m^{2}-m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |
| This work | $m^{2}$ | $\leq 2.5 m^{2}-2.5 m$ | $T_{A}+\left(3+\left\lceil\log _{2}(m)\right\rceil\right) T_{X}$ |

Table 3.5: FPGA implementation of BL-SIPO (Fig. 2.1) multiplier for type 4 over $G F\left(2^{163}\right)$ on xc4vlx100-ff1148 device.

| Multiplier | CPD [ns] | FF | LUT | Slice | Time $[\mathrm{ns}]$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| BL-SIPO | 1.9 | 326 | 486 | 323 | 309.7 |

Table 3.6: ASIC synthesis results for BL-SIPO (Fig. 2.1) multiplier for type 4 over $G F\left(2^{163}\right)$.

| Multiplier | CPD $[n \mathrm{~s}]$ | Area $\left[\mu \mathrm{m}^{2}\right]$ | Time $[\mathrm{ns}]$ |
| :---: | :---: | :---: | :---: |
| BL-SIPO | 0.34 | 6817.2 | 55.42 |

### 3.5 FPGA and ASIC Implementations

In this section, we implement the presented architectures in the previous sections to evaluate their area and time requirements. We have selected the Xilinx ${ }^{\circledR}{ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ xc4vlx100-ff1148 device as the target FPGA. In terms of available resources, xc4vlx100ff1148 contains 49,152 slices (98,304 LUTs and 98,304 registers). Each slice contains two flip-flops (FFs) and two 4-input look-up tables (LUTs) [64].

The proposed multiplier architectures are modeled in VHDL and synthesized for different digit sizes using $\mathrm{XST}^{\mathrm{TM}}$ of Xilinx ${ }^{\circledR} \mathrm{ISE}^{\mathrm{TM}}$ version 12.1 design software. Also, $65-\mathrm{nm}$ Complementary Metal-Oxide-Semiconductor (CMOS) library has been chosen for the synthesis on application-specific integrated circuit (ASIC) technology. The proposed architectures synthesized using Synopsys ${ }^{\circledR}$ Design Vision ${ }^{\circledR}$ which is a GUI for Synopsys ${ }^{\circledR}$ Design Compiler ${ }^{\circledR}$ tools. The correctness of the multiplier architectures is verified by Xilinx ${ }^{\circledR}$ ISE ${ }^{\text {TM }}$ Simulator (ISim) and $m$-bit 2-to-1 multiplexers are used to preload operands to the registers in each architecture. For the FPGA implementations, the optimization goal is set to the speed (i.e., default) and optimization effort is set to normal and the area (Slices, LUTs, and FFs) and timing ( $n \mathrm{~s}$ ) for the critical-path delays (CPD) are obtained for different digit sizes. It is noted that the results of the implementations on FPGA, are all after post place and route results. For the ASIC implementations, the map effort is set to medium with a target clock period of 5 ns and the area $\left(\mu \mathrm{m}^{2}\right)$ and timing $(n \mathrm{~s})$ are obtained for each of the designs.

We first implemented the LSB-first BL-SIPO (Fig. 2.1) multiplier and the results are tabulated in Table 3.5 and 3.6, for FPGA (after post place and route) and ASIC (after synthesis), respectively. Then, we have implemented the proposed architectures for LSD-first SIPO, digit-level PISO, and digit-level PIPO, multipliers for different

Table 3.7: FPGA (Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ xc4vlx100-ff1148 device) and ASIC (65-nm CMOS library) synthesis results for the improved DL-SIPO (Fig. 3.3) multiplier architectures for type 4 GNB over $G F\left(2^{163}\right)$ for different digit sizes.

| digit <br> size | $\left[\frac{m}{d}\right\rceil$ |  | Flices |  |  |  |  | FF | LUT | CPD $[\mathrm{ns}]$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Area $\left[\mu m^{2}\right]$ | CPD $[n \mathrm{~s}]$ | $\mathrm{T}[n \mathrm{~s}]$ |  |  |  |  |  |  |  |
| 11 | 15 | 1,691 | 326 | 3,365 | 4.8 | 72.0 | $34,278.4$ | 0.93 | 13.95 |  |
| 21 | 8 | 3,099 | 326 | 6,185 | 5.8 | 46.4 | 63,283 | 1.56 | 12.48 |  |
| 33 | 5 | 5,739 | 326 | 10,281 | 6.3 | 31.5 | $97,420.4$ | 2.16 | 10.80 |  |
| 41 | 4 | 7,229 | 326 | 12,783 | 6.5 | 26.0 | 120,295 | 2.57 | 10.28 |  |
| 55 | 3 | 9,323 | 326 | 16,715 | 6.7 | 20.1 | $160,298.3$ | 3.25 | 9.75 |  |

Table 3.8: FPGA (Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ xc4vlx100-ff1148 device) and ASIC (65-nm CMOS library) synthesis results for the improved DL-PISO (Fig. 3.5) multiplier architecture for type 4 GNB over $G F\left(2^{163}\right)$ for different digit sizes.

| digit <br> size | $\left[\frac{m}{d}\right]$ |  | FPGA Implementation |  |  |  | ASIC Synthesis |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Slices | FF | LUT | CPD $[n \mathrm{~s}]$ | T $[n \mathrm{~s}]$ | Area $\left[\mu m^{2}\right]$ | CPD $[n \mathrm{~s}]$ | $\mathrm{T}[n \mathrm{~s}]$ |  |
| 11 | 15 | 1,899 | 444 | 3,912 | 5.7 | 85.5 | $34,837.4$ | 1.38 | 20.70 |
| 21 | 8 | 3,754 | 408 | 6,995 | 6.1 | 48.8 | $63,397.2$ | 1.85 | 14.80 |
| 33 | 5 | 5,908 | 365 | 10,735 | 6.8 | 34.0 | $97,804.2$ | 2.37 | 11.85 |
| 41 | 4 | 7,385 | 378 | 13,218 | 6.9 | 27.6 | 121,356 | 2.94 | 10.96 |
| 55 | 3 | 9,678 | 419 | 17,348 | 7.3 | 21.9 | $161,494.8$ | 3.85 | 10.65 |

digit sizes. The results of the implementations for different digit sizes are reported in Tables 3.7, 3.8, and 3.9. As one can see the digit-level PIPO multiplier architecture requires smallest area for both FPGA and ASIC implementations. Moreover, it is faster than the other multiplier architectures. We note that one can reduce the critical-path delay of the proposed multiplier architectures by pipelining the multiplier architectures and maintain high-throughput performance. It should be noted that for any particular application the digit-size should be chosen in such a way to achieve highest performance considering the time-area trade-offs.

### 3.6 Conclusion

In this chapter, we have proposed three improved multiplier architectures, namely DL-PIPO, DL-PISO, and DL-SIPO, for digit-level GNB multiplication. We have proposed a complexity reduction algorithm to reduce the complexity of each multiplier. Then, we have derived the area and time complexities of the proposed architectures and compared them with the counterparts in the literature. It has been shown that

Table 3.9: FPGA (Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ xc4vlx100-ff1148 device) and ASIC (65-nm CMOS library) synthesis results for the improved DL-PIPO (Fig. 3.1) multiplier architecture for type 4 GNB over $G F\left(2^{163}\right)$ for different digit sizes.

| digit <br> size | $\begin{gathered} q= \\ \left\lceil\frac{m}{d}\right\rceil \\ \hline \end{gathered}$ | FPGA Implementation |  |  |  |  | ASIC Synthesis |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | Slices | FF | LUT | CPD [ $n \mathrm{~s}$ ] | T [ $n \mathrm{~s}$ ] | Area $\left[\mu m^{2}\right]$ | CPD [ $n \mathrm{~s}$ ] | T [ $n \mathrm{~s}$ ] |
| 11 | 15 | 1,563 | 495 | 2,399 | 4.7 | 70.5 | 28,667 | 0.91 | 13.65 |
| 21 | 8 | 2,545 | 532 | 4,261 | 4.9 | 39.2 | 52,663 | 1.48 | 11.84 |
| 33 | 5 | 4,033 | 554 | 7,194 | 5.4 | 27.0 | 80,566 | 2.16 | 10.8 |
| 41 | 4 | 4,628 | 502 | 8,503 | 5.6 | 22.4 | 99,546 | 2.59 | 10.36 |
| 55 | 3 | 6,484 | 500 | 11,412 | 5.8 | 17.4 | 132,225 | 3.39 | 10.17 |

the proposed architectures require smaller area in comparison to the leading ones in the literature in terms of area and time complexities. For studying the application of the proposed multiplier architectures, we have implemented them on FPGA and ASIC and the results are compared. We also extended the DL-PISO multiplier architecture to a bit-parallel architecture and its time and area complexities also compared with the counterparts. As seen from the FPGA and ASIC implementation results, the DL-PIPO multiplier architecture requires the smallest area and runs in highest clock frequencies in comparison to the DL-SIPO and DL-PISO architectures. These multiplier architectures are suitable for the applications such as exponentiation and point multiplication on binary elliptic curves where GNB multiplication is desired. In the next chapter, we employ the DL-PIPO multiplier architecture to design a ECCbased crypto-processor. We also provide an efficient pipelined architecture for this multiplier as well.

## Chapter 4

## Efficient FPGA Implementation of Point Multiplication over Binary Edwards and Generalized Hessian Curves Using Gaussian Normal Basis

IN the previous chapter, we presented a low complexity digit-level parallel-in parallel-out (DL-PIPO) architecture for Gaussian normal basis multiplier. In this chapter, we efficiently pipeline the DL-PIPO proposed architecture and study its time-area trade-offs. Then, we choose efficient values for the digit-size and compare the results with the non-pipelined architecture. We employ the proposed multiplier architecture for efficient implementation of point multiplication over binary elliptic curves, including binary generic, Edwards, and generalized Hessian curves. We demonstrate how parallelization in higher levels can be performed by full resource utilization of computing point addition and point doubling formulas for the binary Edwards and generalized Hessian curves. We employ the $w$-coordinate differential formulations for computing point multiplication. Using a look-up table (LUT) based pipelining and efficient digit-level GNB multiplier, we evaluate the LUT complexity and time-area trade-offs of the proposed crypto-processor on FPGA. We compare the implementation results of point multiplication on these curves with the ones on the traditional binary generic curve. We note that, this is the first FPGA implementation of point multiplication on binary Edwards and generalized Hessian curves represented by $w$-coordinates.

The main contributions of this chapter are as follows. It is noted that these contributions have been also presented in [65] and can be can be summarized as
follows:

- We propose an efficient hardware architecture for point multiplication on binary Edwards and generalized Hessian curves incorporating higher level parallelization and optimum lower level scheduling. This increases the overall performance considering maximum utilization of available resources.
- We incorporate $w$-coordinate version of Montgomery's ladder for point multiplication in binary Edwards and generalized Hessian curves using mixed differential representation.
- For the proposed crypto-processor architecture over $G F\left(2^{m}\right)$, we obtain the optimum digit sizes in terms of time-area trade-offs for the proposed fast and low-complexity digit-level Gaussian normal basis multiplier.
- Finally, we perform efficient FPGA implementations of point multiplication on binary Edwards and generalized Hessian curves over $G F\left(2^{163}\right)$ on a Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ device and investigate the LUT-based time-area efficiency for different digit sizes. We have also implemented ECC on binary generic curve and compared its FPGA implementation results with the ones obtained for binary Edwards and generalized Hessian curves.

The rest of the chapter is organized as follows. In Section 4.1, preliminaries of arithmetic on binary Edwards and generalized Hessian curves are presented. In Section 4.2, point multiplication and parallelization of point addition and doubling are explained. The proposed hardware architecture for elliptic curve crypto-processor is presented in Section 4.3. In this section, a pipelined version of digit-level PIPO GNB multiplier architecture proposed in the previous chapter is also presented and analyzed in terms of time-area trade-offs for different digit sizes. Section 4.4 presents the results of FPGA implementations for the proposed ECC crypto-processor. Finally, we conclude this chapter in Section 4.5.

### 4.1 Preliminaries

### 4.1.1 Arithmetic over Binary Edwards and Generalized Hessian Curves

It is well known that a non-supersingular binary generic (short Weierstraß) elliptic curve can be defined by a set of points $(x, y)$ and a special point at infinity $\mathcal{O}$ (group
identity) that satisfy the following equation

$$
\begin{equation*}
E_{a, b} / G F\left(2^{m}\right): y^{2}+x y=x^{3}+a x^{2}+b, \tag{4.1}
\end{equation*}
$$

where $a, b \in G F\left(2^{m}\right)$ and $b \neq 0[11]$. These curves are also called anomalous binary curves or Koblitz curves if $a \in\{0,1\}$ and $b=1$, i.e., defined over $G F(2)$ [66].

Binary Edwards curves belong to a special class of generic elliptic curves defined over binary field when $m \geq 3$ [1]. The merit of binary Edwards curves over generic curves is that they have two special properties of being unified and complete [1]. The former is that the point addition formulations can be used for point doubling while the latter means that point addition formulations can be used for all pairs of inputs on the curve.

Definition 4.1. [1] Let $\mathbb{K}$ be a finite field of characteristic two, i.e., $\operatorname{char}(\mathbb{K})=2$ and $d_{1}$ and $d_{2}$ be the elements of $\mathbb{K}$ with $d_{1} \neq 0$ and $d_{2} \neq d_{1}^{2}+d_{1}$. The binary Edwards curve with coefficients $d_{1}$ and $d_{2}$ is the affine curve

$$
\begin{align*}
& E_{B, d_{1}, d_{2}} / G F\left(2^{m}\right): \\
& d_{1}(x+y)+d_{2}\left(x^{2}+y^{2}\right)=x y+x y(x+y)+x^{2} y^{2} \tag{4.2}
\end{align*}
$$

where $d_{1}, d_{2} \in G F\left(2^{m}\right)$.
Given a point $P=(x, y)$, its negation, $-P$, is obtained as $(y, x)$ which has no cost [1]. The point $(0,0)$ is the neutral element and $(1,1)$ has order 2 [1]. The binary Edwards curves are complete if $\operatorname{Tr}\left(d_{2}\right)=1$, i.e., $d_{2}$ cannot be written as $e^{2}+e$ for any $e$ in $\mathbb{K}$, where $\operatorname{Tr}$ is the absolute trace of $G F\left(2^{m}\right)$ over $G F(2)$ [1].

Definition 4.2. [2] Let $c$ and $d$ to be elements of $\mathbb{K}$ such that $c \neq 0$ and $d^{3} \neq 27 c$. The generalized Hessian curve $H_{c, d}$ over $\mathbb{K}$ is defined by the equation

$$
\begin{equation*}
H_{c, d} / G F\left(2^{m}\right): x^{3}+y^{3}+c=d x y \tag{4.3}
\end{equation*}
$$

where $c=1$ results in a Hessian curve, i.e., $H_{d}$.
Note that the generalized Hessian curves are complete if and only if $c$ is not a cube in $\mathbb{K}$.

The standard formulas on generic curves [3] fail in computing addition of two points on curves if one of the points or their addition is at infinity. These possibilities should be tested before designing an elliptic curve cryptosystem. Note that point

Table 4.1: Cost of point operations on binary Edwards curves (BECs), generalized Hessian curves (GHCs), and binary generic curves (BGCs) over $G F\left(2^{m}\right)$ [1], [2], and [3].

| Curve | Curve Parameter | Combined Addition and Doubling ${ }^{1}$ |  |
| :---: | :---: | :---: | :---: |
|  |  | Projective Diff | Mixed Diff |
| $\begin{gathered} \hline \hline \text { BEC } \\ {[1]} \end{gathered}$ | $d_{1} \neq d_{2}$ | $8 \mathbf{M}+4 \mathbf{S}+2 \mathbf{D}$ | $6 \mathbf{M}+4 \mathbf{S}+4 \mathbf{D}$ |
|  | $d_{1}=d_{2}$ | $7 \mathbf{M}+4 \mathbf{S}+2 \mathbf{D}$ | $5 \mathbf{M}+4 \mathbf{S}+2 \mathbf{D}$ |
| $\begin{gathered} \text { GHC } \\ {[2]} \\ \hline \end{gathered}$ | $c \neq 1$ | $7 \mathbf{M}+4 \mathbf{S}+3 \mathbf{D}$ | $5 \mathbf{M}+4 \mathbf{S}+3 \mathbf{D}$ |
|  | $c=1$ | $7 \mathbf{M}+4 \mathbf{S}+2 \mathbf{D}$ | $5 \mathbf{M}+4 \mathbf{S}+2 \mathbf{D}$ |
| BGC [3] | $b \neq 0$ | $7 \mathrm{M}+5 \mathbf{S}+1 \mathbf{D}$ | $5 \mathbf{M}+5 \mathbf{S}+1 \mathbf{D}$ |
| 1. M, S, and $\mathbf{D}$, are the costs of multiplication of two field elements, a squaring, and a multiplication by a constant, respectively. |  |  |  |

addition and doubling formulas on binary Edwards and generalized Hessian curves work for all input pairs. This characteristic is called completeness. In what follows, we discuss the point addition and doubling using $w$-coordinates for binary Edwards and generalized Hessian curves.

### 4.1.2 Point Addition and Doubling Using Differential Formulations in $w$-coordinates

Differential addition [13] is the computation of $Q+P$, given points of $Q, P$, and $Q-P$. In [1] and [2], the idea of Montgomery's ladder [13] is used to present fast formulas for $w$-coordinate differential addition on binary Edwards and generalized Hessian curves, respectively. Let us assume $w$ to be a linear and symmetric function in terms of the coordinates $x$ and $y$ of the point $P$ and is defined as $w_{i}=x_{i}+y_{i}$, where $w(P)=w(-P)$. Bernstein et al. [1] have defined $w$-coordinate differential addition for computing $w(Q+P)$ given $w(Q), w(P)$, and $w(Q-P)$. Similarly, the $w$ coordinates differential doubling is the computation of $w(2 P)$ given $w(P)$. Therefore, using $w$-coordinates of differential addition and doubling formulas, $w((2 n+1) P)$ and $w(2 n P)$ can be computed given $w(n P)$ and $w((n+1) P)$, recursively [1]. In the following, we revisit the differential addition and doubling formulas for binary Edwards and generalized Hessian curves using $w$-coordinates [1] and [2].

Let $P_{1}=\left(x_{1}, y_{1}\right)$ and $P_{2}=\left(x_{2}, y_{2}\right)$ be two affine points on the binary Edwards curve $E_{B, d_{1}, d_{2}}$. Let us define $P_{3}=P_{1}+P_{2}=\left(x_{3}, y_{3}\right), P_{4}=2 P_{2}=\left(x_{4}, y_{4}\right)=$ $\left(x_{2}, y_{2}\right)+\left(x_{2}, y_{2}\right)$, and $P_{0}=P_{2}-P_{1}=\left(x_{0}, y_{0}\right)=\left(x_{2}, y_{2}\right)-\left(x_{1}, y_{1}\right)$. Then, one can write $w_{3}=w\left(P_{1}+P_{2}\right), w_{4}=w\left(2 P_{2}\right)$, and $w_{0}=w\left(P_{2}-P_{1}\right)$ as defined above. In the mixed coordinate representation of $w_{i}$ can be written as the fractions $W_{i} / Z_{i}$ in
projective, as $w_{1}=w\left(P_{1}\right)=W_{1} / Z_{1}$ and $w_{2}=w\left(P_{2}\right)=W_{2} / Z_{2}$, and $w_{0}$ is given as an affine field element. Then, the mixed $w$-coordinate addition (MDiffADD) of these two points can be obtained from [1] as

$$
\begin{align*}
& C=W_{1} \cdot\left(Z_{1}+W_{1}\right), D=W_{2} \cdot\left(Z_{2}+W_{2}\right), \\
& E=Z_{1} \cdot Z_{2}, F=W_{1} \cdot W_{2}, V=C \cdot D, W_{3}=V+w_{0} Z_{3}, \\
& Z_{3}=V+\left(\sqrt{d_{1}} \cdot E+\sqrt{d_{2} / d_{1}+1} \cdot F\right)^{2}, \tag{4.4}
\end{align*}
$$

and the formulas for $w$-coordinate doubling (DiffDBL) [1] are

$$
\begin{align*}
& C=W_{2} \cdot\left(Z_{2}+W_{2}\right), W_{4}=C^{2} \\
& Z_{4}=W_{4}+\left(\left(\sqrt[4]{d_{1}} \cdot Z_{2}+\sqrt[4]{d_{2} / d_{1}+1} \cdot W_{2}\right)^{2}\right)^{2} \tag{4.5}
\end{align*}
$$

For the generalized Hessian curves, the $w$-coordinate differential addition formulas can be written as follows [2]

$$
\begin{align*}
& A=W_{1} \cdot Z_{2}, B=W_{2} \cdot Z_{1}, C=A B, \\
& U=d^{3} \cdot C, Z_{3}=(A+B)^{2}, \\
& W_{3}=U+w_{0} \cdot Z_{3}, \tag{4.6}
\end{align*}
$$

and similarly for doubling, those are presented as follows [2]:

$$
\begin{align*}
& A=W_{2}^{2}, B=Z_{2}^{2}, C=A+\sqrt{c^{3}\left(d^{3}+c\right)} \cdot B \\
& D=d^{3} \cdot B, W_{4}=C^{2}, Z_{4}=A D \tag{4.7}
\end{align*}
$$

The costs of different coordinates to compute differential addition and doubling are given in Table 4.1 for binary Edwards [1], generalized Hessian [2], and generic curves [3]. Let $\mathbf{M}, \mathbf{S}$, and $\mathbf{D}$ be the costs of multiplication of two field elements, a squaring, and a multiplication by a constant curve parameter, respectively. As illustrated in this table, the mixed $w$-coordinate offers fast and comparable PA and PD formulas. Therefore, we use the mixed $w$-coordinate differential addition and doubling formulas [1]. Note that the difference of two points for differential addition is given in affine, i.e.,
$w_{0}=w\left(P_{2}-P_{1}\right)$. Moreover, the mixed $w$-coordinate addition and doubling formulas are complete which means there is no need to check for the exceptional cases [1]. In order to have efficient computation of point operations, i.e., PAs and PDs, one needs to employ an efficient point multiplication algorithm. In the following section, we give an explanation of using Montgomery's ladder for point multiplication.

### 4.2 Point Multiplication on Binary Edwards and Generalized Hessian Curves

In this section, we consider Montgomery's ladder [13] and its modified version [3] to present point multiplication algorithm over $w$-coordinates for binary Edwards, generalized Hessian, and binary generic curves. Using combined PA and PD formulations, we explain how parallelization can increase the performance of point multiplication. At the end, the cost of recovering final coordinates of point multiplication is derived.

### 4.2.1 Point Multiplication

The elliptic curve point multiplication is defined in the Abelian group as $Q=k \cdot P=$ $P+P+\cdots+P,(k$ times $)$, where $k$ is a positive integer, and $Q$ and $P$ are two points on the elliptic curve $Q, P \in E\left(G F\left(2^{m}\right)\right)$ [3]. The efficiency of point multiplication depends on finding the minimum number of steps to reach $k P$ from a given point $P=\left(x_{0}, y_{0}\right)$. In binary Edwards and generalized Hessian curves, point multiplication can be defined similar to the one on generic curves [3]. Let $P$ be a point on a binary Edwards curve $E_{B, d_{1}, d_{2}}$ and let us assume $w(n P)$ and $w((n+1) P), 0<n<k$ are known. Therefore, one can use the $w$-coordinate differential addition and doubling formulas to compute their sum as $w((2 n+1) P)$ and double of $w(n P)$ as $w(2 n P)$.

Among different algorithms to perform point multiplication on elliptic curves, the Montgomery's ladder [13] is widely used in the literature. It has a uniform double-and-add structure which makes it secure against non-differential (simple) sidechannel attacks [1], [53]. In [3], an efficient version of Montgomery's algorithm is proposed over $G F\left(2^{m}\right)$. The Montgomery's ladder algorithm for point multiplication using mixed $w$-coordinates is provided in Algorithm 4.1. As shown in in Step 1 of this algorithm, the point $P=\left(x_{0}, y_{0}\right)$ is converted to the mixed $w$-coordinates by computing $w_{0}=w(P)=x_{0}+y_{0}$ and setting $W_{1}=w_{0}$ and $Z_{1}=1$. Assume the scalar $k$ is represented in binary, i.e., $k=\sum_{i=0}^{l-1} k_{i} 2^{i}, k_{i} \in G F(2)$. Then, the initialization steps, i.e., Steps 1a and 1b of Algorithm 4.1, produce $w(P)=\left(W_{1}, Z_{1}\right)$

```
Algorithm 4.1 Montgomery's algorithm [13] for point multiplication using w-
coordinates.
Inputs: A point \(P=\left(x_{0}, y_{0}\right) \in E\left(G F\left(2^{m}\right)\right)\) on a
binary curve and an integer \(k=\left(k_{l-1}, \cdots, k_{1}, k_{0}\right)_{2}\).
Output: \(w(Q)=w(k P) \in E\left(G F\left(2^{m}\right)\right)\).
1: set: \(w_{0} \leftarrow x_{0}+y_{0}\) and initialize
    a: \(W_{1} \leftarrow w_{0}\) and \(Z_{1} \leftarrow 1\)
    \(\mathrm{b}:\left(W_{2}, Z_{2}\right)=\operatorname{DiffDBL}\left(W_{1}, Z_{1}\right)\)
    for \(i\) from \(l-2\) down to 0 do
    a: if \(k_{i}=1\) then
        i): \(\left(W_{1}, Z_{1}\right)=\operatorname{MDiff} \operatorname{ADD}\left(W_{1}, Z_{1}, W_{2}, Z_{2}, w_{0}\right)\)
        ii): \(\left(W_{2}, Z_{2}\right)=\operatorname{DiffDBL}\left(W_{2}, Z_{2}\right)\)
    b: else
        i): \(\left(W_{1}, Z_{1}\right)=\operatorname{DiffDBL}\left(W_{1}, Z_{1}\right)\)
        ii): \(\left(W_{2}, Z_{2}\right)=\operatorname{MDiff} \operatorname{ADD}\left(W_{1}, Z_{1}, W_{2}, Z_{2}, w_{0}\right)\)
    end if
    end for
    return \(w(k P) \leftarrow\left(W_{1}, Z_{1}\right)\) and \(w((k+1) P) \leftarrow\left(W_{2}, Z_{2}\right)\)
```

and $w(2 P)=\left(W_{2}, Z_{2}\right)$ using (4.5) [67]. For binary Edward curves, the formulations of (4.4) and (4.5) are implemented in MDiff ADD and DiffDBL functions of Algorithm 4.1, respectively. Therefore, after $l-1$ iterations as presented in Steps 2a and 2b of Algorithm 4.1, the $w$-coordinates of $k P$ and $(k+1) P$, i.e., $w(k P)=\left(W_{1}, Z_{1}\right)$ and $w((k+1) P)=\left(W_{2}, Z_{2}\right)$, will be available. Similarly, for generalized Hessian curves $w_{0}=w(P)=1+d x_{0} y_{0}, d \neq 0$ is computed in Step 1 and $\left(W_{1}, Z_{1}\right)=\left(w_{0}, 1\right)$ is initialized in Step 1a for point multiplication [2]. For this curve, the formulations of (4.6) and (4.7) are implemented in MDiff ADD and DiffDBL functions of Algorithm 4.1, respectively.

### 4.2.2 Parallelism in Point Multiplication Algorithm

Parallelism is an approach to reduce the number of field arithmetic operations, mainly multiplications, in the critical-path by using multiple multipliers concurrently [10]. In addition, merging point operations, i.e., the PA and PD, can result in less data dependency and can reduce the latency of the point multiplication over binary Edwards and generalized Hessian curves. Computing the $w$-coordinates of PA and PD for binary Edwards curves together in one step of the Montgomery's algorithm requires six general finite field multiplications and four field multiplications by constants as


Figure 4.1: Data dependency graphs for parallel computing of the combined PA and PD operations on binary Edwards curves (a): $d_{1} \neq d_{2}$ and (b): $d_{1}=d_{2}$ assuming $\mathcal{M}=2$. It requires five registers of $T_{1}, T_{2}, T_{3}, T_{4}$, and $T_{5}$. The constant parameters, $c_{1}=\sqrt{d_{1}}, c_{2}=\sqrt{d_{2} / d_{1}+1}, c_{3}=\sqrt{c_{1}}$, and $c_{4}=\sqrt{c_{2}}$ are assumed to be precomputed and stored in the memory.
reported in Table 4.1. As summarized in this table, for generalized Hessian curves, the cost of combined PA and PD is five field multiplications and two multiplications by constants [2]. In the following, we explain how parallel field operations can be utilized to reduce the latency of the point multiplication operation.

### 4.2.2.1 Scheduling Field Operations for PA and PD

We have obtained the data dependency graphs for the combined PA and PD on binary Edwards and generalized Hessian curves as illustrated in Fig. 4.1 (Fig. 4.1a for $d_{1} \neq d_{2}$ and Fig. 4.1b for $d_{1}=d_{2}$ ) and Fig. 4.2a, respectively. As shown in these figures, the latency (in terms of number of clock cycles) of each step is the latency of an operation with the longest latency. As one can see in Fig. 4.1a and 4.1b, the first four operations of PA, i.e., Step 0 to Step 3, on binary Edwards curve should be performed before any PD operation. This is because computation of PD depends on the PA. For generalized binary Hessian curve (Fig. 4.2a), operations of PA and PD can be performed in parallel at any time. Note that the latency of field additions and field squarings are negligible in comparison to the latency of the


Figure 4.2: Data dependency graph for parallel computing of the combined PA and PD operations for $\mathcal{M}=2$ available multipliers on (a) generalized Hessian curves, assuming $c_{1}=d^{3}$, and $c_{2}=\frac{1}{\sqrt{d^{3}}}$ and (b) binary generic curves (BGCs) [8].
field multipliers. Therefore, we calculate the latency of the critical-path in terms of number of field multiplications. Let $M$ be the latency (in terms of number of clock cycles) for multiplying two field elements and $D$ be the latency of multiplication of a field element by a constant (e.g., curve parameters, $d_{1}$ or $d_{2}$ ). Let us denote $\mathcal{M}$ as the number of parallel finite field multipliers. In the following, we investigate the parallelization using different number of multipliers $\mathcal{M}=1,2$ and 3 .

### 4.2.2.2 Parallelization for Binary Edwards Curve (BEC)

For binary Edwards curves with $d_{1} \neq d_{2}$ and one available multiplier $(\mathcal{M}=1)$, the latency of the combined PA and PD is $6 M+4 D$ as reported in Table 4.1. Utilizing two multipliers, i.e., $\mathcal{M}=2$, reduces the latency to $4 M+1 D$ and $3 M+1 D$ for $d_{1} \neq d_{2}$ (Fig. 4.1a) and $d_{1}=d_{2}$ (Fig. 4.1b), respectively. As one can see in Steps 3, 5, 6, 7, and 10 of Fig. 4.1a, two independent multipliers are fully utilized. Thus, the utilization factor of two multipliers in Fig. 4.1a is $100 \%$. Similarly, in Steps 3, 4, and 6 of Fig. 4.1b, two multipliers are fully utilized. However, in Step 8 of Fig. 4.1b, only one of the two multipliers is utilized (shown in Fig. 4.1b) and the other one is idle (not shown in Fig. 4.1b). Therefore, the utilization factor of two multipliers in Fig. 4.1 b is $7 / 8 \times 100=87.5 \%$.

If three parallel multipliers, i.e., $\mathcal{M}=3$, are employed, the latency will become $4 M$ and $3 M$ for $d_{1} \neq d_{2}$ and $d_{1}=d_{2}$, respectively. Therefore, adding one multiplier only
reduces the latency by one multiplication by a constant. Moreover, one can figure out, the utilization factors for $d_{1} \neq d_{2}$ and $d_{1}=d_{2}$ will reduce to $10 / 12 \times 100=83.34 \%$ and $7 / 9 \times 100=77.78 \%$, respectively. In addition, employing four multipliers reduces the latency to $3 M$ for $d_{1} \neq d_{2}$ and has no impact for the case where $d_{1}=d_{2}$. Note that employing more multipliers, i.e., $\mathcal{M}>4$, does not decrease the latency. As a result, one can see the maximum utilization of the multipliers with low latency for the combined PA and PD operations is achieved only by choosing $\mathcal{M}=2$. Multiplier utilization factors for data dependency graph of different curves are summarized in Table 4.2. It is also worth noting that employing two multipliers for the case where $d_{1} \neq d_{2}$, reduces the latency nearly $50 \%$ as compared to the case where only one multiplier is utilized.

### 4.2.2.3 Parallelization for Generalized Hessian Curve (GHC)

For generalized Hessian curve with $\mathcal{M}=1$, the latency of combined PA and PD algorithm is $5 M+2 D$. In such a case, the multiplier is always performed the operation and hence the utilization of multiplication for $\mathcal{M}=1$ is $100 \%$. The data dependency graph for GHC is illustrated in Fig. 4.2a using the combined PA and PD. In this figure, two multipliers, are available, i.e., $\mathcal{M}=2$. As shown in Steps 2, 3, and 4 of Fig. 4.2a, two multipliers operate in parallel, whereas, in Step 5 only one multiplier performs the multiplication. Therefore, the utilization for $\mathcal{M}=2$ is $7 / 8 \times 100=87.5 \%$. Also, the latency of computing the combined PA and PD operations in parallel is $3 M+1 D$. Note that employing three parallel multipliers $(\mathcal{M}=3)$ reduces the latency to $2 M+1 D$. However, one can figure out that only in a new step (including combination of Steps 2 and 3 in Fig. 4.2a) all three multipliers will be utilized and in Step 4, i.e., multiplication by constant, only one multiplier will perform the operation and the other two multipliers are idle. As a result, the utilization factor will reduce to $7 / 9 \times 100=77.78 \%$. As one can figure out, increasing the number of multipliers from two to three reduces latency only $14 \%$ while increasing the required area about $33 \%$.

### 4.2.2.4 Parallelization for Binary Generic Curve (BGC)

For the sake of comparison, we have included data dependency graph for binary generic curves employing two multipliers $\mathcal{M}=2$ in Fig. 4.2b [8]. As seen from this figure, the latency of the combined PA and PD operations in parallel is $3 M$. Incorporating three multipliers $\mathcal{M}=3$ reduces the latency to $2 M$ with multiplier utilization

Table 4.2: Multiplier Utilization factors for data dependency graph of different curves.

| Curve | Utilization factor |  |
| :---: | :---: | :---: |
|  | $\mathcal{M}=2$ | $\mathcal{M}=3$ |
| BEC $d_{1} \neq d_{2}$ (Fig. 4.1a) | $100 \%$ | $83.34 \%$ |
| BEC $d_{1}=d_{2}$ (Fig. 4.1b) | $87.5 \%$ | $77.78 \%$ |
| GHC (Fig. 4.2a) | $87.5 \%$ | $77.78 \%$ |
| BGC (Fig. 4.2b) | $100 \%$ | $100 \%$ |

of $100 \%$ [6]. It is worth mentioning that employing more than three multipliers, i.e., $\mathcal{M} \geq 4$, will not reduce the latency of point multiplication. This has been investigated in a different way with $\mathcal{M}=4$ to parallelize PA and PD operations as well as parallelizing finite field operations in [8]. We note that parallel computation of point multiplication over binary generic curves has been widely studied in the literature, for instance one can refer to [20], [21], [10], [6], [25], and [8].

In the proposed architecture, multiplication by a constant is performed using one of the available multipliers. As a result, its cost is calculated the same as one of a multiplier.

As illustrated in Figs. 4.1 and 4.2, in each step, two words (e.g., $W_{1}$ and $Z_{1}$ in Step 0 of Figs. 4.1a and 4.1b) are read from the memory as the inputs (it is discussed in details in Section 4.3.3). Consequently, this reduces the memory requirements. Scheduling has been made by two multipliers $(\mathcal{M}=2)$, two adders, and two squarers for efficient implementations. Also, addition and squaring can be performed in one clock cycle and multiplication using digit-level multiplier requires several $M=\left\lceil\frac{m}{d}\right\rceil$ clock cycles with an additional clock cycles for loading the inputs. Note that the order of operations are scheduled to achieve optimum number of clock cycles as illustrated in each step of data dependency graphs. At the end of point multiplication (the bottoms of data dependency graphs), the results of PAs and PDs for point multiplication are written to the memory. In what follows, we explain how to recover $Q=k P$ from $P, w(k P)$, and $w((k+1) P)$ at the end of the proposed Montgomery's point multiplication.

### 4.2.3 Recovering the Final Coordinates of $x$ and $y$

In this thesis, having $w$-coordinates in the last step of point multiplication, one can obtain $w(k P)=w_{1}=W_{1} \cdot Z_{1}^{-1}$ and $w((k+1) P)=w_{2}=W_{2} \cdot Z_{2}^{-1}$. The procedure of recovering the final point from $w$-coordinates is presented in [1]. At the end of differential addition, one has $w(k P), w((k+1) P)$, and $(x, y)$ for the base point $P$.

Table 4.3: Latency of the operations in the point multiplication with $\mathcal{M}=1,2,3$, where $M$ is the number of clock cycles required for multiplication of two arbitrary field elements.

| Operation | Latency of Point Multiplication Operations |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| Curve | BEC [1] |  | GHC [2] | BGC [3] |
| Parameter | $d_{1} \neq d_{2}$ | $d_{1}=d_{2}$ | $c=1$ |  |
| Initialization | $1 M+5$ | $1 M+5$ | $1 M+3$ | 5 |
| PA \& PD, $\mathcal{M}=1$ | $10 M+21$ | $7 M+16$ | $7 M+10$ | $6 M+10$ |
| PA \& PD $\mathcal{M}=2$ | $5 M+15$ | $4 M+11$ | $4 M+8$ | $3 M+8$ |
| PA \& PD $\mathcal{M}=3$ | $4 M+9$ | $3 M+7$ | $3 M+9$ | $2 M+5$ |
| $w$-coord $/$ aff $\mathcal{M}=1$ | $22 M+109$ | $21 M+104$ | $20 M+98$ | $19 M+75$ |
| $w$-coord $/$ aff $\mathcal{M}=2,3$ | $15 M+105$ | $15 M+105$ | $15 M+98$ | $15 M+74$ |

First, one needs to check if $w_{1}^{2}+w_{1} \neq 0$ and then obtain $x_{2}^{2}+x_{2}=A^{\prime}$ from the equation given in [1]. Since $\operatorname{Tr}\left(A^{\prime}\right)=0$ [1], then employing linear half-trace H : $G F\left(2^{m}\right) \rightarrow G F(2)$ computation over $G F\left(2^{163}\right)$, one has $x_{2}$ or $x_{2}+1$ as the output for polynomial basis. With solving the curve equation for $x_{2}$ (or $x_{2}+1$ ), one can get $y_{2}$ (or $y_{2}+1$ ) whose cost is $I+13 M+167 S+81 A$ for $m=163$. Note that using normal basis solving the quadratic equation and computing inversion can be performed very efficiently as explained in Chapter 1. Inversion requires $\left\lfloor\log _{2}(m-1)\right\rfloor+H W(m-1)-1$ multiplications and $m-1$ squarings, where $H W(m-1)$ is the hamming weight (number of ones) of the binary representation of $m-1$. Thus, for $m=163$, the cost of an inversion is $9 M+162 S$, where $M$ and $S$ are the costs (in terms of number of clock cycles in our analysis) to perform a finite field multiplication and squaring, respectively. Then, the total cost of recovering $(x, y)$ coordinates of $k P$ as a final point is $22 M+109$ clock cycles.

### 4.2.4 Latency of Point Multiplication Operations

The latency of point multiplication operations are summarized in Table 4.3 for $\mathcal{M}=$ $1,2,3$. The total latency consists of latencies of initialization ( $L_{\text {initial }}$ ), computing PA and PD in the main loop ( $L_{\text {loop }}$ ), and recovering the final point $\left(L_{R}\right)$ for binary Edwards and generalized Hessian curves as follows

$$
\begin{equation*}
L_{\text {Total }}=L_{\text {initial }}+(l-1) \times L_{\text {loop }}+L_{R} \tag{4.8}
\end{equation*}
$$

As shown in Table 4.3, $M$ is the number of clock cycles to multiply two field elements as well as a multiplication of a field element by a constant curve parameter. As an


Figure 4.3: Architecture of the proposed elliptic curve crypto-processor for binary Edwards, generalized Hessian, and binary generic curves.
example, the latency of combined PA and PD with $\mathcal{M}=2$ is calculated from Fig. 4.1a as $5 M+15$, by adding all clock cycles in 15 steps shown in Fig. 4.1a, with an assumption of $D=M$.

### 4.3 Architecture of the Proposed Elliptic Curve CryptoProcessor

In this section, we propose a hardware architecture for point multiplication over binary Edwards, generalized Hessian, and binary generic curves. A generic structure for the implementation of the point multiplication on FPGA platform is depicted in Fig. 4.3. The architecture is comprised of several blocks: a finite field arithmetic unit (FAU), a control unit and memory. The FAU includes two field multipliers, two adders, and two squarers, as well as five 163 -bit registers to store intermediate results. The controller uses program instructions and implements finite state machine (FSM). The memory includes Block RAMs (BRAMs) and ROM to store the intermediate/final results and program instructions. The lower level (finite field) arithmetics are implemented in FAU and higher levels, i.e., PA and PD, are implemented in control logic as a FSM. In the following, we explain these blocks in details.


Figure 4.4: The pipelined architecture of the low-complexity type $T$ digit-level GNB multiplier with parallel-output [9].

### 4.3.1 Field Arithmetic Unit (FAU)

In the binary field with characteristic two, $G F\left(2^{m}\right)$, addition is a bit-wise XOR and can be computed in one clock cycle. In normal basis, squaring of a field element is almost free (in hardware) in terms of both timing and area as it is equivalent to rewiring. The finite field multiplier plays the main role in determining the performance as it dominates the costs of point operations. Therefore, it is essential to design an efficient multiplier.

Bit-parallel multipliers can perform the finite field multiplication in one clock cycle. These multipliers are fast but require a large area complexity. Bit-serial multipliers require $m$ clock cycles for the entire multiplication operation and they are efficient in terms of area but they are slow. Digit-level multipliers are the most suitable ones because the digit-size can be chosen for specific cryptographic applications based on the available resources. In this work, we use a digit-level multiplier which is explained in the following.

### 4.3.2 A Fast and Low-Complexity Digit-Level GNB Multiplier over $G F\left(2^{m}\right)$

In this subsection, we first present a pipelined low-complexity hardware architecture for digit-level GNB multiplier over $G F\left(2^{m}\right)$. Then, we evaluate the practical timearea efficiency of the presented multiplier by implementing it on a Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ FPGA device.

### 4.3.2.1 Hardware Architecture

Let $A=\left(a_{0}, a_{1}, \cdots, a_{m-1}\right)$ and $B=\left(b_{0}, b_{1}, \cdots, b_{m-1}\right)$ be the field elements represented by type $T$ GNB over $G F\left(2^{m}\right)$. Let $C=\left(c_{0}, c_{1}, \cdots, c_{m-1}\right)$ denote their multiplication, i.e., $C=A B$. Reyhani-Masoleh in [5] has proposed a digit-level GNB multiplier with parallel output and digit-size $d, 1 \leq d \leq m$. It requires $M=\left\lceil\frac{m}{d}\right\rceil$, $1 \leq M \leq m$, clock cycles to generate all the $m$ coordinates of $C=A B$ simultaneously at the end of the final clock cycle. In [9], a modified and low-complexity version of the digit-level GNB multiplier proposed in [5] is presented. In this section, we pipeline this architecture to make a faster VLSI architecture which operates at very high clock frequencies.

The used pipelined multiplier is depicted in Fig. 4.4. It consists of a $\rho$ block, $J$ blocks in Path-1, and the pipelined $G F\left(2^{m}\right)$ adder in Path-2. The $\rho$ block includes two sub-blocks $\rho_{1}$ and $\rho_{2}$ and its structure depends on type $T, T \geq 2$, of GNB and multiplication matrix. Each $J$ block consists of $m$ two-input AND gates and each $G F\left(2^{m}\right)$ adder consists of binary trees of XOR gates. As illustrated in Fig. 4.4, the multiplier is pipelined by adding a stage of pipelined registers inside the $G F\left(2^{m}\right)$ adder in order to allow the multiplier to operate at very high clock frequencies. Therefore, instead of performing $G F\left(2^{m}\right)$ addition of $d m$ inputs (as shown in Fig. 4.4), which are connected to the outputs of AND gates in $J$ blocks, we perform the additions in two stages, i.e., over $\left\lceil\frac{d m}{\ell}\right\rceil$-inputs. The first stage contains $\ell G F\left(2^{m}\right)$ adders, each of which has at most $K=\left\lceil\frac{d}{\ell}\right\rceil m$-bit inputs and are depicted by $j_{0}$ to $j_{d-1}$ in the architecture. The outputs of the first adders are added with the output of the $Z$ register using another $G F\left(2^{m}\right)$ adder in the second stage. Choosing the optimum value of $\ell$ plays an important role in designing the fast multiplier. This will be considered later in this section. It is shown in [5] and [9] that the critical-path delay of the non-pipelined multiplier is composed of the delays of the components located in Path-1 and path-2, i.e., $\left(\left\lceil\log _{2} T\right\rceil T_{X}+T_{A}\right)$ and $\left(\left\lceil\log _{2}(d+1)\right\rceil T_{X}\right)$ for $1 \leq d \leq m$, respectively. Note that these are functions of the type of the multiplier $T$ and the digit size $d$. As shown in

Fig. 4.4, Path-2 is divided into Path-2a and Path-2b by inserting a stage of pipelined registers in between (hereafter we call it $\ell$-level of accumulation). This technique reduces the number of logic gates in the critical-path and simplifies the routing.

### 4.3.2.2 Complexities

In this subsection, we give the number of registers and time complexities of the pipelined digit-level GNB multiplier over $G F\left(2^{m}\right)$. The gate counts of the pipelined multiplier remains the same as the ones of the non-pipelined modified architecture presented in [9]. It requires $d m$ AND gates and $n_{p}+v_{p}\left(\frac{T}{2}-1\right)+d m$ XOR gates, where $n_{p}, n_{p} \leqslant \min \left\{\frac{v_{p} T}{2},\binom{m}{2}\right\}[9]$.

Proposition 4.1. The pipelined multiplier structure of Fig. 4.4 requires $(3+\ell) m$ registers and its critical-path delay is

$$
\begin{equation*}
\max \left\{\left(T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2} K\right\rceil\right) T_{X}\right),\left(\left\lceil\log _{2}(\ell+1)\right\rceil T_{X}\right)\right\} \tag{4.9}
\end{equation*}
$$

where $\ell$ is the level of accumulation and $K=\left\lceil\frac{d}{\ell}\right\rceil$.
Proof. As one can see from Fig. 4.4, $\ell m$ registers are required between Path-2a and Path-2b for the pipeline purposes. As a result, the $(\ell+3) m$ 1-bit registers required in the presented multiplier. The critical-path delay of Path-1, $D_{\text {Path-1 }}$ is composed of the delays of the components in Path-1, i.e., $T_{X}$, $\left\lceil\log _{2} \frac{T}{2}\right\rceil T_{X}$, and $T_{A}$. The delay of Path-2a, $D_{\text {Path-2a }}$ is the delay of an $m$-bit $G F\left(2^{m}\right)$ adder with at most $K=\left\lceil\frac{d}{\ell}\right\rceil m$-bit inputs, i.e., $\left\lceil\log _{2} K\right\rceil T_{X}$, and the delay of Path-2b, $D_{\text {Path-2b }}$ is $\left\lceil\log _{2}(\ell+1)\right\rceil T_{X}$. Therefore, the critical-path delay of the presented architecture is $\max \left\{\left(D_{\text {Path-1 }}+D_{\text {Path-2a }}\right),\left(D_{\text {Path-2b }}\right)\right\}$ which completes the proof.

The critical-path delay of the pipelined and non-pipelined architecture of the presented multiplier in terms of number of levels of accumulation, $\ell$ and digit-size, $d$ are illustrated in Table 4.4. It is noted that employing the proposed $\ell$-level of accumulation using one stage of pipelined registers increases the latency of the multiplication by one clock cycle to $\left\lceil\frac{m}{d}\right\rceil+1$.

Lemma 4.1. The number of feasible accumulators is upper bounded by $l \leq\left\lceil\frac{d}{2}\right\rceil$ and is lower bounded by $l \geq 2$.

Table 4.4: Critical-path delay of the pipelined and non-pipelined architecture of presented digit-level type 4 GNB multiplier over $G F\left(2^{163}\right)$.

| Non-Pipelined [5], [9] |  | Pipelined |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $d$ | $\begin{gathered} D_{\text {Path-1 }}+ \\ D_{\text {Path-2 }}: \end{gathered}$ | K | $\begin{gathered} D_{\text {Path-2a }}: \\ \left\lceil\log _{2} K\right\rceil T_{X} \\ \hline \end{gathered}$ | $\ell$ | $\begin{gathered} D_{\text {Path-2b: }} \\ \left\lceil\log _{2}(\ell+1)\right\rceil T_{X} \end{gathered}$ |
| $2 \leq d \leq 3$ | $T_{A}+4 T_{X}$ | $2<K \leq 4$ | $2 T_{X}$ | $2 \leq \ell \leq 3$ | $2 T_{X}$ |
| $3<d \leq 7$ | $T_{A}+5 T_{X}$ | $4<K \leq 8$ | $3 T_{X}$ | $3<\ell \leq 7$ | $3 T_{X}$ |
| $7<d \leq 15$ | $T_{A}+6 T_{X}$ | $8<K \leq 16$ | $4 T_{X}$ | $7<\ell \leq 15$ | $4 T_{X}$ |
| $15<d \leq 31$ | $T_{A}+7 T_{X}$ | $16<K \leq 32$ | $5 T_{X}$ | $15<\ell \leq 31$ | $5 T_{X}$ |
| $31<d \leq 63$ | $T_{A}+8 T_{X}$ | $32<K \leq 64$ | $6 T_{X}$ | $31<\ell \leq 63$ | $6 T_{X}$ |

Proof. It is clear that from (4.9), the followings should be true in order to achieve the goal of pipelining:

$$
\left\{\begin{array}{ll}
(a):\left\lceil\log _{2}(l+1)\right\rceil<D_{\text {Path-1 }}+\left\lceil\log _{2}(d+1)\right\rceil, & l \geq 1  \tag{4.10}\\
(b):\left\lceil\log _{2} k\right\rceil<\left\lceil\log _{2}(d+1)\right\rceil, & k \geq 2
\end{array},\right.
$$

where $k$ is defined before. From 4.10(a), one can realize that $\left\lceil\log _{2}(l+1)\right\rceil<\left\lceil\log _{2}(d+1)\right\rceil$ and the level of accumulation should be less than the digit-size, i.e., $l<d$, and from 4.10(b) one can get a tighter upper bound for $l$ as $k \geq 2$ and $k<d+1$. The former requires the number of accumulators to be $1<l<d$ and the latter requires the number of accumulators to be about less than half of the digit-size, i.e., $1<l \leq\left\lceil\frac{d}{2}\right\rceil$. This completes the proof.

### 4.3.2.3 LUT-based Critical-path Delay Analysis

In this subsection, we investigate the critical-path delay of the presented pipelined scheme based on the 6 -input programmable look-up tables (LUTs) available in Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ FPGA device. To estimate resource consumption and critical-path delay we need to convert the gate-oriented schematics to LUT-based schematics. Then, when the tree of XOR gates are converted into $\Gamma$-input ( $\Gamma=6$ in this case) LUToriented schematics the $\Gamma-1$ XOR gates can be replaced by one LUT in the best case. For type $T \leq 4$, each output of the $\rho$ block is obtained by adding (XORing) of $T$ inputs and considering the $J$ block which includes an additional input for the AND operation. Therefore, such outputs can be implemented using 6-input LUTs in $1 T_{L U T}$ delay. Then, the LUT-based critical-path delay of the Path- 1 is $1 T_{L U T}$ for

Table 4.5: LUT-based critical-path delay (CPD) $\left(T_{L U T}\right)$ of the presented pipelined multiplier for different digit sizes $(d)$ and levels of accumulation $(\ell)$ for type 4 GNB multiplier over $G F\left(2^{163}\right)$ where $K=\left\lceil\frac{d}{\ell}\right\rceil$.

| $d$ | $D_{\text {Path-1 }}$ | $D_{\text {Path-2a: }}$ <br> $\left\lceil\log _{6} K\right\rceil$ | $\ell$ | $D_{\text {Path-2b }}:$ <br> $\left\lceil\log _{6}(\ell+1)\right\rceil$ |
| :---: | :---: | :---: | :---: | :---: |
| $11 \leq d \leq 28$ | $1 T_{L U T}$ | $1 T_{L U T}$ | $2 \leq \ell \leq 5$ | $1 T_{L U T}$ |
| $33 \leq d \leq 163$ | $1 T_{L U T}$ | $1 T_{L U T}$ | $6 \leq \ell \leq 28$ | $2 T_{L U T}$ |

type $T \leq 4$. The critical-path delay of Path-2 is summarized in Table 4.5 in terms of different levels of accumulation, $\ell$ and digit-size $d$. The critical-path delay of Path-2a and Path-2b are $\left\lceil\log _{6} K\right\rceil T_{L U T}$ and $\left\lceil\log _{6}(\ell+1)\right\rceil T_{L U T}$, respectively. Therefore, $K$ and $\ell$ should be chosen in such a way to have a balance for the LUT-based criticalpath delay. For example, assume digit-size, $d=55$ then the critical-path delay of the non-pipelined multiplier is $1 T_{L U T}+\left\lceil\log _{6} 56\right\rceil T_{L U T}=4 T_{L U T}$. Employing $\ell=$ 10 levels of accumulation results to have at most $K=\left\lceil\frac{55}{10}\right\rceil=6$ inputs for each $G F\left(2^{163}\right)$ adders in Path-2a. Then, the critical-path delay of the presented multiplier is max $\left\{\left(1 T_{L U T}+\left\lceil\log _{6} 6\right\rceil T_{L U T}\right),\left(\left\lceil\log _{6} 11\right\rceil\right) T_{L U T}\right\}=2 T_{L U T}$. Therefore, for practical implementations one needs to obtain optimum level of pipelining considering number of inputs of LUTs.

In this work, we have proposed an LUT-based pipelining scheme. We have tried several different pipelining techniques including the re-timing scheme of ISE tools but none of them was as efficient as the LUT-based analysis. Therefore, inserting pipelined registers in appropriate locations has a significant impact on the critical path delay of the proposed structure as the $G F\left(2^{m}\right)$ adder of the multiplier has the major critical path delay. In the following subsection, we implement the presented multiplier on FPGA.

### 4.3.2.4 Implementation

To evaluate the practical performance, the presented pipelined digit-level type 4 GNB PIPO multiplier over $G F\left(2^{163}\right)$ is implemented on a Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ FPGA device. First, feasible values for digit size $d$ are chosen in such a way to decrease the criticalpath delay while increasing the area (as a result of upper ceiling). Then, a careful LUT-based with floor-planing design is performed based on the given number of accumulators $\ell$ and digit-size $d$. The efficiency of the multiplier is measured in terms of reciprocal of the time-area products, i.e., (time $\times$ area) $)^{-1}$ and is plotted for different digit sizes $d, 11 \leq d \leq 82$, in Fig. 4.5. As shown in this figure, the local optimum


Figure 4.5: Time-Area ratio of the presented pipelined low-complexity digit-level GNB multiplier for type 4 over $G F\left(2^{163}\right)$ for different digit sizes $d$.
(for time-area efficiency) in terms of digit sizes for the presented multiplier can be chosen as $d \in\{21,24,28,33,41,55\}$. It is noted that two largest digit sizes of $d=82$ and $d=163$ degrade the maximum clock frequencies as the place and route (PAR) operation becomes complicated. Therefore, we exclude $d=163$ from our analysis and keep $d=82$ for comparison purposes. The presented multiplier is faster (i.e., operates at high clock frequencies) and is smaller than the digit-level MO multiplier employed in [10] for FPGA implementations of ECC over $G F\left(2^{163}\right)$ [5].

### 4.3.3 Memory and Control Unit

### 4.3.3.1 Memory

The proposed architecture requires RAM to store intermediate and variables output as from the FAU and registers and ROM to store program instructions and constant values. As illustrated in Figs. 4.1 and 4.2, in each cycle two words (163-bit) from memory are accessed. Then, dual port BRAMs are configured as two single port BRAMs with independent data access [64]. One can perform two read operations per cycle by using a dual port BRAM. This feature allows us to reduce the number of required BRAMs and achieve greater utilization of this resource. In the utilized Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ FPGA device, 36 -Kbit (1024, 36-bit words) dual port BRAM blocks are available with a combined 72 -bit bus width (36-bit per port). The dual


Figure 4.6: Configuration of BRAMs for the proposed architecture.
port RAM is assigned through Xilinx ${ }^{\circledR}$ Synthesis Tool ( $\mathrm{XST}^{\mathrm{TM}}$ ). In Fig. 4.3, the storage RAM has been designed to allow the reading and writing of the 163 -bit words for $m=163$. This results in minimizing the number of accesses to the memory. Therefore, as shown in Fig. 4.6, the storage RAM is constructed with $\left\lceil\frac{163 \times 2}{72}\right\rceil=5$ BRAMs resulting in the storage of $512 \times 163$-bit words to store the intermediate inputs as illustrated in the data flow diagrams of Figs. 4.1 and 4.2.

The basic field arithmetic operations, i.e., multiplication, addition, and squaring, are implemented in the FAU. The constants $d_{1}, d_{2}, c_{1}, c_{2}, c_{3}$, and $c_{4}$ are stored in the ROM. The ROM to store constants, is implemented with the same BRAM explained above by reserving a few addresses. A register file of $5 \times 163$-bit registers (shown by $T_{i}$ in Figs. 4.1 and 4.2) is incorporated in the FAU to reduce the overhead of the communication between the FAU and the RAM. It is noted that the load and store between the FAU and the memory storage require a single clock cycle. We count all of these clock cycles when calculating the total latency of the point multiplication. The ROM is also generated using Xilinx ${ }^{\circledR}$ BRAMs as illustrated in Fig. 4.3. In Table 4.3, the latency of the operations required to perform arithmetic operations are reported.

### 4.3.3.2 Control Unit

The control unit of the ECC crypto-processor controls the FAU and memory and it is implemented as a FSM. As shown in Fig. 4.3, the control unit has two address signals, Addr_A and Addr_B, which control the interface between the FAU and the memory. The program instructions are stored in ROM and the control unit fetches and decodes instructions and sends appropriate control signals to the other units based on the presented data dependency graphs of Figs. 4.1 and 4.2. Note that the ROM that stores the program instructions is instantiated using BRAMs as $1024 \times 36$ bit words. Therefore, to store program instructions one extra BRAM is required. It
is noted that the control unit decides where to store and conditionally swamp (based on $k_{i}$ ) the results of the combined PA and PD operations.

### 4.4 Comparisons and Implementations

In this section, we discuss the results obtained in the previous sections and compare them with the counterparts in terms of side-channel analysis and implementation results.

### 4.4.1 Side-Channel Analysis

As mentioned before, Montgomery's Ladder is highly regular and suitable choice to protect scalar $k$ against simple power analysis attacks [68]. Newly introduced binary Edwards and generalized Hessian curves have two special properties of being unified and complete [1]. The former is that the point addition formulations can be used for point doubling while the latter means that point addition formulations can be used for all pairs of inputs on the curve. Then, the point multiplication algorithm based on unified addition and doubling operations, will not cause side-channel leakage and hence it is protected against side-channel attack (SCA). Baldwin et al. [69], have investigated resistivity against simple power analysis (SPA) attacks of the unified operations for twisted Edwards curves over prime fields $G F(p)$. Also, this fact has been investigated in [53] using the unified addition formula of binary Edwards curves. They have also taken advantage of incorporating a simple random order execution (i.e., randomly changing the storage location of the results) in the Montgomery's ladder that makes the differential power analysis (DPA) attack difficult [53]. In this work, we take advantage of completeness of $w$-coordinates differential PA and PD formulas on Montgomery's ladder which is also SPA resistant.

The cost of explicit point addition is $8 \mathbf{M}+5 \mathbf{S}+1 \mathbf{D}$ for generic curves [55], $13 \mathbf{M}+$ $3 \mathbf{S}+3 \mathbf{D}$ for binary Edwards curves [1] , and $8 \mathbf{M}+3 \mathbf{S}$ for Hessian curves [2]. Therefore, the generalized Hessian curves offer the fastest addition formulas for binary elliptic curves. Although the explicit addition formulas for generic curves are faster than binary Edwards curves, they are not complete and unified. Therefore, one can realize that the cost of one step of point multiplication on binary Edwards curves using explicit addition formulas in [53] is higher than employing Montgomery's differential addition algorithm, i.e., combined differential PA and PD. It is interesting to note that one can reduce this cost by employing explicit addition formulas for generalized
Table 4.6: FPGA implementation results for BECs over $G F\left(2^{163}\right)$ and $\mathcal{M}=2$.

| $d$ | $\begin{gathered} M+ \\ 1 \end{gathered}$ | Latency$\left(L_{\text {Total }}\right)$ |  | $\begin{gathered} f_{\max } \\ (\mathrm{MHz}) \end{gathered}$ | Area |  |  |  |  |  | P.M. Time [ $\mu \mathrm{s}$ ] |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  | LUTs | FFs |  | Slices |  |  |  |
|  |  | $d_{1} \neq d_{2}$ | $d_{1}=d_{2}$ |  | $d_{1} \neq d_{2}$ | $d_{1}=d_{2}$ | $d_{1} \neq d_{2}$ | $d_{1}=d_{2}$ | $d_{1} \neq d_{2}$ | $d_{1}=d_{2}$ | $d_{1} \neq d_{2}$ | $d_{1}=d_{2}$ |
| 21 | 9 | 10041 | 7915 |  | 269.3 | 8158 | 8158 | 3097 | 2771 | 3181 | 3181 | 37.2 | 29.3 |
| 24 | 8 | 9208 | 7247 | 268.8 | 8750 | 8750 | 3423 | 3097 | 3371 | 3371 | 34.2 | 26.9 |
| 28 | 7 | 8375 | 6579 | 267.5 | 10309 | 10309 | 3423 | 3097 | 4078 | 4078 | 31.3 | 24.6 |
| 33 | 6 | 7542 | 5911 | 265.8 | 11139 | 11139 | 3249 | 3423 | 4681 | 4681 | 28.3 | 22.2 |
| 41 | 5 | 6709 | 5243 | 264.5 | 14235 | 14235 | 4075 | 3749 | 5788 | 5788 | 25.3 | 19.8 |
| 55 | 4 | 5876 | 4575 | 263.3 | 17432 | 17432 | 5053 | 4727 | 6536 | 6536 | 22.3 | 17.3 |
| 82 | 3 | 5043 | 3907 | 196.1 | 23301 | 23301 | 6357 | 6031 | 8872 | 8872 | 25.7 | 19.9 |

Table 4.7: FPGA implementation results for GHC over $G F\left(2^{163}\right)$ and $\mathcal{M}=2$.

| $d$ | $\begin{aligned} & \hline M+ \\ & 1 \end{aligned}$ | Total Latency Clock Cycles ( $L_{\text {Total }}$ ) | $\begin{gathered} f_{\max } \\ (\mathrm{MHz}) \end{gathered}$ | Area |  |  | P.M. Time [ $\mu \mathrm{s}$ ] |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  | LUTs | FFs | Slices |  |
| 21 | 9 | 7419 | 272.3 | 8158 | 2934 | 3181 | 27.2 |
| 24 | 8 | 6751 | 271.8 | 8750 | 3260 | 3371 | 24.8 |
| 28 | 7 | 6083 | 269.3 | 10309 | 3260 | 4078 | 22.5 |
| 33 | 6 | 5415 | 268.2 | 11139 | 3586 | 4681 | 20.1 |
| 41 | 5 | 4747 | 267.1 | 14235 | 3912 | 5788 | 17.7 |
| 55 | 4 | 4079 | 266.2 | 17432 | 4890 | 6536 | 15.3 |
| 82 | 3 | 3411 | 196.1 | 23301 | 6194 | 8872 | 17.3 |

Table 4.8: FPGA implementation results for BGC over $G F\left(2^{163}\right)$ and $\mathcal{M}=2$.

| $d$ | $\begin{aligned} & \hline \hline M+ \\ & 1 \end{aligned}$ | Total Latency Clock Cycles ( $L_{\text {Total }}$ ) | $\begin{gathered} \hline f_{\max } \\ (\mathrm{MHz}) \end{gathered}$ | Area |  |  | P.M. Time [ $\mu \mathrm{s}$ ] |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  | LUTs | FFs | Slices |  |
| 21 | 9 | 5884 | 272.3 | 8158 | 3097 | 3181 | 21.6 |
| 24 | 8 | 5383 | 271.8 | 8750 | 3423 | 3371 | 19.8 |
| 28 | 7 | 4882 | 269.3 | 10309 | 3423 | 4078 | 18.1 |
| 33 | 6 | 4381 | 268.2 | 11139 | 3249 | 4681 | 16.3 |
| 41 | 5 | 3880 | 267.1 | 14235 | 4075 | 5788 | 14.5 |
| 55 | 4 | 3379 | 266.2 | 17432 | 5053 | 6536 | 12.7 |
| 82 | 3 | 2878 | 196.1 | 23301 | 6357 | 8872 | 14.7 |

Hessian curves.

### 4.4.2 Implementation Results and Discussion

We have selected the Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ xc5vlx110-2ff1760 device as the target FPGA. In terms of available resources, xc5vlx110-2ff1760 contains 17,280 slices $(69,120$ LUTs and 69,120 registers), 128 BlockRAMs (BRAMs), and 800 input/output (I/O) pins. Each slice contains 4 flip-flops (FFs) and 4 look-up tables (LUTs) [64].

Choosing Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ FPGA would increase the performance and speed of our design. This is mainly due to the availability of 6 -input LUTs and large word size in its high 36 -Kbit BRAMs. Having 6 -input LUTs helps the design to be implemented with fewer logic levels and availability of large word size makes it easier to build large memory arrays (for storing large-bit field elements over $G F\left(2^{m}\right)$ ) with less routing delay. As a result, using Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ FPGAs increases the speed by reducing both the critical-path delay and number of clock cycles (latency). Note that for the comparison purpose, we also implement the proposed design on a Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ xc4vlx100 device (which offers four input LUTs) and compared it to the counterparts.

The presented architecture for elliptic curve crypto-processor of Section 4.3 is coded in VHDL and synthesized for different digit sizes $d$, $d \in\{21,24,28,33,41,55,82\}$ using XST ${ }^{\text {TM }}$ of Xilinx ${ }^{\circledR}$ ISE $^{\text {TM }}$ version 12.1 design software. The optimization goal for synthesize is set to the default value (i.e., speed). The results of the timing analysis of the implementations after the post place and route are reported in Tables 4.6 and 4.7 for binary Edwards and generalized Hessian curves, respectively. The number of required clock cycles for computing the point multiplication is also presented in these tables for the different digit sizes and different curve parameters, i.e., $d_{1}=d_{2}$, $d_{1} \neq d_{2}$, and $c=1$. Moreover, the total latencies are found from (4.8) using $l=163$ as the summation of the required clock cycles for the initialization, the total PA and PD in of the point multiplication, and the conversion as obtained from Table 4.3.

The area requirements are stated in terms of the number of occupied slices (including LUTs and FFs) as reported in Tables 4.6 and 4.7. Note that the proposed architecture for the FAU is the same for binary Edwards (with $d_{1}=d_{2}$ and $d_{1} \neq d_{2}$ ) and generalized Hessian curves, but they only differ in the control logic provided by instruction program (in ROM) and the number of required registers. Therefore, the area is equal for theses curves as presented in Tables 4.6 and 4.7. The fastest point multiplications are computed for digit size $d=55$ at approximately $17.3 \mu \mathrm{~s}$ and $15.3 \mu \mathrm{~s}$ for binary Edwards and generalized Hessian curves, respectively. The
proposed architecture requires almost 6,536 occupied slices (17, 432 LUTs and 5, 053 FFs) and 6 BRAM blocks for $d=55$. Similar implementation results are found for binary generic curve as illustrated in Table 4.8.

It is noted that from our implementations results (Tables 4.6, 4.7, and 4.8), one can see that the slices occupation is usually larger than the number of LUTs divided by four $\left(\frac{\# L U T}{4}\right)$ for Virtex ${ }^{\mathrm{TM}}-5$. It is because the ISE design software starts the unrelated logic packing after the CLB pack factor ( $100 \%$ for the default value) is reached [64]. A higher percentage number will result in lower density packing and a lower pack factor results in a denser design with a difficult place and route and consequently higher delays.

Several implementations of ECC have been published in the literature targeting various applications with different requirements in terms of time-area trade-offs. The implementation results of this work are reported in Table 4.9 and are compared with the results for generic and Koblitz curves available in the literature. We note that because different curves and different FPGA technologies are used to implement different crypto-processors, meaningful quantitative comparisons of the area and time results are difficult. Therefore, as mentioned above we have implemented the cryptoprocessor for $d=55$ on Virtex ${ }^{\text {TM }}-4$ device and its area and timing results are reported in Table 4.9. Moreover, as the finite field multiplier plays an important role in determining the performance of an ECC crypto-processor, we discuss the performance results in terms of efficiency of the finite field multiplier and fairly compared them with the counterparts.

It is worth mentioning that in these implementations, we have chosen normal basis as it offers free repeated squarings. Also, we could have taken more advantages of normal basis as it is utilized for Koblitz curves in [10] and [23]. However, by using normal basis, we have eliminated the extra hardware for squarings for the proposed ECC crypto-processor over binary Edwards and generalized Hessian curves. Moreover, recovering final coordinates $(x, y)$ of $Q=k P$ (represented in $w$-coordinates) requires several repeated squarings and Half-trace computation, that their costs are reduced by using normal basis.

In [10], Järvinen et al. have presented the use of parallelization on different levels of point multiplication and have extensively studied the speed and area requirements for NIST $B-163$ and $K-163$ curves. For generic curves, the time-area performances are investigated using one, two, and four digit-level MO [35] multipliers over $G F\left(2^{163}\right)$. As discussed in [5], the area complexity of a digit-level MO multiplier and its improved version is larger than the one presented in this work. Also, as one can realize,

Table 4.9: Comparison of ECC implementations on FPGA over $G F\left(2^{163}\right)$.

| Work ${ }^{1}$ | Device | Basis | $d$ | $\mathcal{M}$ | Area | Time [ $\mu \mathrm{s}$ ] |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| BGC [10] | Stratix II | NB | 14 | 4 | 11,800 ALMs | 48.88 |
| BKC [10] | Stratix II | NB | 11 | 4 | 13,472 ALMs | 25.81 |
| BKC [26] | Stratix II | NB | 17 | 4 | $\begin{aligned} & 23,580 \text { ALMs }(26,647 \\ & \text { ALUTs, } 20575 \text { FFs }) \end{aligned}$ | 9.48 |
| BGC [10] | Stratix II | NB | 41 | 2 | 18,489 ALMs | 51.56 |
| BKC [10] | Stratix II | NB | 41 | 2 | 19,498 ALMs | 35.1 |
| BGC | Virtex-5 | NB | 41 | 2 | 5,788 Slices (14,235 <br> LUTs, 4,075 FFs) | 14.4 |
| $\begin{gathered} \mathrm{BEC} \\ \left(d_{1} \neq d_{2}\right) \end{gathered}$ | Virtex-5 | NB | 41 | 2 | 5,788 Slices (14,235 LUTs, 4,075 FFs) | 24.9 |
| $\begin{gathered} \mathrm{BEC} \\ \left(d_{1}=d_{2}\right) \end{gathered}$ | Virtex-5 | NB | 41 | 2 | 5,788 Slices (14,235 <br> LUTs, 3,749 FFs) | 19.5 |
| $\begin{gathered} \text { GHC } \\ (c=1) \end{gathered}$ | Virtex-5 | NB | 41 | 2 | $\begin{gathered} \text { 5,788 Slices (14,235 } \\ \text { LUTs, 3,912 FFs) } \\ \hline \end{gathered}$ | 17.4 |
| BGC [6] | Virtex-4 | NB | 55 | 3 | 24,363 Slices | 10.11 |
| BGC | Virtex-4 | NB | 55 | 2 | $\begin{gathered} \text { 12,834 Slices }(22,815 \\ \text { LUTs, } 6,683 \mathrm{FFs}) \end{gathered}$ | 17.2 |
| $\begin{gathered} \mathrm{BEC} \\ \left(d_{1} \neq d_{2}\right) \end{gathered}$ | Virtex-4 | NB | 55 | 2 | $\begin{gathered} \text { 12,834 Slices }(22,815 \\ \text { LUTs, } 6,683 \text { FFs }) \end{gathered}$ | 23.3 |
| $\begin{gathered} \mathrm{BEC} \\ \left(d_{1}=d_{2}\right) \end{gathered}$ | Virtex-4 | NB | 55 | 2 | $\begin{gathered} \text { 12,834 Slices }(22,815 \\ \text { LUTs, } 6,520 \text { FFs }) \end{gathered}$ | 22.9 |
| $\begin{aligned} & \text { GHC } \\ & (c=1) \end{aligned}$ | Virtex-4 | NB | 55 | 2 | $\begin{gathered} \text { 12,834 Slices }(22,815 \\ \text { LUTs, } 6,520 \mathrm{FFs}) \\ \hline \end{gathered}$ | 20.8 |
| BGC | Virtex-5 | NB | 55 | 2 | $\begin{gathered} \text { 6,536 Slices (17,305 } \\ \text { LUTs, } 4,075 \text { FFs) } \end{gathered}$ | 12.9 |
| $\begin{gathered} \mathrm{BEC} \\ \left(d_{1} \neq d_{2}\right) \end{gathered}$ | Virtex-5 | NB | 55 | 2 | $\begin{aligned} & \text { 6,536 Slices (17,432 } \\ & \text { LUTs, } 5,053 \text { FFs) } \end{aligned}$ | 22.3 |
| $\begin{gathered} \mathrm{BEC} \\ \left(d_{1}=d_{2}\right) \end{gathered}$ | Virtex-5 | NB | 55 | 2 | $\begin{gathered} \text { 6,536 Slices }(17,432 \\ \text { LUTs, } 4,727 \text { FFs) } \end{gathered}$ | 17.3 |
| $\begin{gathered} \mathrm{GHC} \\ (c=1) \end{gathered}$ | Virtex-5 | NB | 55 | 2 | $\begin{gathered} \text { 6,536 Slices }(17,305 \\ \text { LUTs, } 4,890 \text { FFs } \end{gathered}$ | 15.3 |

1. BGC: binary generic curve, BKC: binary Koblitz curve, BEC: binary Edwards curve, GHC: generalized Hessian curve.


Figure 4.7: Implementation results of point multiplication for binary Edwards, generalized Hessian, and binary generic curves reported in Tables 4.6, 4.7, and 4.8 on Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ xc5vlx110-2ff1760 FPGA device. The points are related to digit sizes of $d=21,24,28,33,41,55,82$.
time complexity of our presented multiplier is less than digit-level MO multiplier as compared in [5]. In addition, we have reached higher clock frequencies with LUTbased pipelining techniques as well. Further, the implementations in ([10], Table VII) for generic curves over $G F\left(2^{163}\right)$ require higher latency and subsequently larger computation time.

In [26], the same digit-level MO multiplier, has been used for point multiplication on Koblitz curves and has been compared with the results of using polynomial basis. The authors indicated that implementation results using polynomial basis is faster than the ones using normal basis having the same area ([26], Table 4). They have also taken advantage of operation interleaving in their implementations on Koblitz curves. However, it is worth mentioning that the large area consumption of the implementations results of using normal basis in [26] might be as a result of large number of pipelined registers and the implementations results of [26] can be improved using our proposed scheme. Therefore, if one employs our presented multiplier architecture incorporating the techniques proposed in [26], the results of point multiplication using normal basis would be comparable with the ones using polynomial basis. We further note that our implementations are not claimed to be the best possible and faster than counterparts using polynomial basis.

The point multiplication scheme proposed in [6] by Kim et al. has been per-
formed on NIST $B$-163 generic curve employing $\mathcal{M}=3$ digit-serial GNB multipliers (proposed by Kwon et al. in [44]) with Montgomery's ladder on a 4 -input Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ FPGA. The maximum clock frequency that is reported for the ECC crypto-processor is $f_{\max }=143 \mathrm{MHz}$ achieved with digit-size $d=55$. Therefore, as the multiplier determines the upper bound for critical-path delay, one can estimate that the maximum operating frequency for the multiplier is 143 MHz . However, our presented multiplier operates at $f_{\max }=196.5 \mathrm{MHz}$ on Virtex ${ }^{\text {TM }}-4$ FPGA with only one level of pipelining. We further note that the proposed LUT-based pipelining technique has significant increase on $f_{\text {max }}$. Moreover, the latency of point multiplication (i.e., the number of clock cycles) in [6] is $L_{\text {Total }}=1+162 \times(2 M+2)+149=1446$ employing three multipliers and hence the total time achieved for point multiplication is $T_{k P}=\frac{L_{k P}}{f_{\text {max }}}=\frac{1446}{143}=10.11 \mu \mathrm{~s}$ with occupying 24,363 slices. Our implementation on Virtex ${ }^{\text {TM }}-4$ FPGA uses only two GNB multiplier and computes a point multiplication in $17.2 \mu$ s with using only 12,834 slices as reported in Table 4.9.

Table 4.9 shows a number of related designs (on NIST $B-163$ and $K-163$ ) which are implemented on different FPGA platforms using different types and number of multipliers. To have a fair comparison, we have implemented the ECC crypto-processor based on NIST $B$-163 generic curve using the presented GNB multiplier for different digit sizes. Data dependency graph of point multiplication of this curve has been illustrated in Fig. 4.2b as its latencies are summarized in Table 4.3. Their implemented results are tabulated in Table 4.8.

In Fig. 4.7, the implementation results are illustrated and point multiplication time is plotted versus area (number of occupied slices). As shown in this figure, increasing the area, as a result of increasing digit-size $d$, results in faster point multiplications. It is noted that larger digit sizes than 55 , i.e., $d>55$, are not efficient for the proposed architecture as it is seen from Fig. 4.7. Therefore, incorporating multiple smaller multipliers is more efficient than using of a large multiplier. As illustrated in Table 4.8 and Fig. 4.7, our results indicate that the point multiplication over binary generic curve is faster than binary Edwards and generalized Hessian curves. This is because it has smaller latency which requires fewer number of clock cycles.

We further note that the implementations of point multiplication over binary generic curves (short Weierstraß) require special hardware to handle point at infinity. Then, during each point operation, a check should be performed to ensure that the resulting point is not at infinity. It should be noted that the proposed ECC cryptoprocessor for binary Edwards and generalized Hessian curves works for all the input pairs without any changes (i.e., it is complete). However, exceptional cases should
be tested separately for the case employing NIST generic and Koblitz curves which requires extra hardware and time.

### 4.5 Conclusions

In this chapter, we have investigated the hardware implementation of point multiplication on binary Edwards and generalized Hessian curve over $G F\left(2^{163}\right)$ using GNB. We have presented a pipelined version of digit-level GNB PIPO multiplier which operates in higher clock frequencies and studied its time-area trade-offs for different digit sizes. The effect of parallelization using two multipliers for computing the point addition and point doubling on binary Edwards and generalized Hessian curves has been investigated. For point multiplication, the widely-used Montgomery's ladder has been incorporated for differential $w$-coordinates. The proposed architecture has been implemented on FPGA to obtain the optimum digit-size. Also, we have examined the completeness of the point operations. For binary Edwards and generalized Hessian curves, the fastest point multiplication achieved with choosing $d=55$. The proposed architecture requires 6,536 occupied slices ( 17,432 LUTs and $5,053 \mathrm{FFs}$ ), and computes a single point multiplication in $17.3 \mu \mathrm{~s}$ and $15.3 \mu \mathrm{~s}$ for binary Edwards and generalized Hessian curves, respectively. Our implementation results also indicate that the point multiplication over binary generic curve is faster than binary Edwards and generalized Hessian curves. On the other hand, the point multiplication over binary Edwards and generalized Hessian curves is complete. In the next chapter, we propose a new method to reduce the latency of point multiplication on binary Edwards and generalized Hessian curves.

## Chapter 5

## New Architecture for Double-Multiplication Using GNB and Its Applications for Exponentiation and Elliptic Curve Cryptography

IN this chapter, based on the two low-complexity multiplier architectures proposed in Chapter 3, we present a new digit-level hybrid multiplier which performs two multiplications together with the same number of clock cycles required as the one for one multiplication. It has advantages for high speed finite field arithmetic operations such as exponentiation and elliptic curves point multiplication. The hybrid structure is developed by connecting the output of the proposed digit-level PISO GNB multiplier into the input of a new digit-level SIPO multiplier.

To the best of our knowledge, this is the first digit-level hybrid GNB multiplier which performs two multiplications with the same latency as the one for one multiplier. In order to investigate the applicability of the proposed hybrid multiplier architecture, we employ it for double-exponentiation which is the key operation for Schnorr [70] and ElGamal-type signature verification algorithms [71]. We further note that this scheme can be incorporated to reduce the latency of point multiplication for ECC-based cryptosystems when other schemes (such as parallelization and interleaving) fail due to data dependencies. To obtain the actual implementation results, the proposed hybrid multiplier architecture is coded using VHDL and then implemented


Figure 5.1: (a) Proposed structure for the hybrid multiplier. (b) Two digit-level multipliers with parallel output operating in two separate steps. (c) A hybrid multiplier operating in one step and composed of an improved DL-PISO and an improved LSD-first DL-SIPO multipliers.
on Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ field-programmable gate array (FPGA) and synthesized using 65-nm CMOS library of application-specific integrated circuit (ASIC) technology for different digit sizes.

The rest of this chapter is organized as follows. In Section 5.1, the architecture of the proposed hybrid multiplier is presented and its complexities studied for different digit sizes. In Section 5.2, the application of proposed hybrid multiplier are investigated. In Section 5.3, the proposed hybrid multiplier is implemented on FPGA and ASIC and the timing and area requirements are reported. In Section 5.4, we concludes this chapter.

### 5.1 Hybrid Multiplication

The discussion of the previous chapters dealt with low-complexity and improved DLPISO and DL-SIPO GNB multipliers. Based on the information provided there, we here present a new hybrid structure by connecting the output of the DL-PISO multiplier to the serial input of the DL-SIPO multiplier. This entire hybrid multiplier performs two multiplications simultaneously, where the results are available in parallel after $\left\lceil\frac{m}{d}\right\rceil+1$ clock cycles assuming that one clock cycle is required to load the output of the first multiplier (stored in the register) to the input of the second multiplier. The structure of the proposed hybrid multiplier is illustrated in Fig. 5.1a. It computes $E=A \times B \times D$ over $G F\left(2^{m}\right)$.

### 5.1.1 Traditional Multiplication Scheme

The traditional method requires two separate multiplications, one to multiply $A \times B$ and the other one to multiply its result by $D$. Thus, the latency of computing $E$ is two multiplications if a traditional multiplication scheme is used and its latency can be obtained as follows. In Fig. 5.1b, two digit-level multipliers with parallel output (DL-PIPO) are employed to compute $E=A \times B \times D, E \in G F\left(2^{m}\right)$. Let us assume that registers $\langle X\rangle,\langle Y\rangle$, and $\langle F\rangle$ are preloaded with the operands $A, B$, and $D$, respectively. Also, the register $\langle Z\rangle$ should be initialized with $0 \in G F\left(2^{m}\right)$. The top multiplier (of Fig. 5.1b) requires $q$ clock cycle to compute $C=A \times B$ and store the results to the $m$-bit register. Also, the bottom multiplier requires $q$ clock cycles to perform $(A B) \times D$ and store it to the register $\langle Z\rangle$. Therefore, to obtain the results in register $\langle Z\rangle, 2 q+1$ clock cycles are required. It should be noted that the critical-path delay is equal to $t_{p}$ which is the delay of a digit-level GNB multiplier with parallel output. Then, the required time to compute $E$ is $T=t_{p} \times(2 q+1)$.

### 5.1.2 Hybrid Multiplication Scheme

Now, we consider Fig. 5.1c, which depicts the use of a hybrid multiplier which is composed of a digit-level PISO GNB multiplier and a LSD-first digit-level SIPO multiplier. This multiplier performs two dependent multiplications to reduce the latency to the one of one multiplication. Let us assume that $C \in G F\left(2^{m}\right)$ be the product of $A$ and $B$, i.e., $C=A B$. Based on the output of digit-level PISO multiplier, $C$ will be available from its LSD as $C_{0}, C_{1}, \cdots, C_{q-1}$ in each clock cycle. In the first clock cycle it provides the first digit of $C$, in the order of $c_{0}$, followed by $c_{1}, \cdots$, and $c_{d-1}$, i.e., $C_{0}=\left(c_{0}, c_{1}, \cdots, c_{d-1}\right)$. In the second clock cycle, the bottom multiplier (i.e., DL-SIPO) multiplies the first digit of $C$, i.e., $C_{0}$ by $D$ (stored in register $\langle F\rangle$ ) and the top multiplier computes the second digit of $C$, i.e., $C_{1}=\left(c_{d}, c_{d+1}, \cdots, c_{2 d-1}\right)$. Then, one can realize that after $q+1$ clock cycles, register $\langle Z\rangle$ contains the result of multiplication of $E=A \times B \times D$. The critical-path delay of the hybrid multiplier is equal to the maximum of the delays for the DL-PISO and DL-SIPO multipliers i.e., $t_{s}=\max \left\{t_{p}, t_{s}\right\}$, and consequently one can obtain the time of multiplication as $T=t_{s} \times(q+1)$.

Based on the information provided above, one can state the following to obtain the complexities of the presented hybrid multiplier.

Proposition 5.1. The proposed hybrid multiplier architecture requires $\leq 2 v_{s}(T-1)+$ $2 d m-d$ XOR gates, $2 d m$ AND gates, four m-bit registers and one d-bit register. Also,
its critical-path delay is equal to $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ which is due to the delays through logic gates in the path with longer critical-path delay (i.e., DL-PISO architecture).

### 5.1.2.1 Analysis

In Table 5.1, the latency and time delay of the proposed hybrid multiplier is investigated in terms of different digit sizes for type 4 GNB over $G F\left(2^{163}\right)$. As shown in this table, the latency, critical-path delay, and time to perform the entire multiplication are given for different digit sizes $d, 7<d<128$. For the traditional method, i.e., the structure of Fig. 5.1b, the latency is $2 q+1$ while for the hybrid structure, i.e., Fig. 5.1c, the latency is $q+1$. The time of multiplication for the proposed hybrid structure is $T=(q+1) T_{A}+(10 q+10) T_{X}$ which is about $17 \%$ less than the general method for smaller digit-sizes, e.g., $7<d \leq 15$ and is $38 \%$ less while choosing larger digit sizes, e.g., $31<d \leq 63$. Therefore, the proposed hybrid structure in Fig. 5.1c reduces the latency and consequently the total time of multiplication and is faster than the one depicted in Fig. 5.1b.

### 5.2 Applications of the Proposed Hybrid Multiplier

The proposed hybrid architecture is particularly applicable for reducing the latency whenever there are repeated multiplications. In this subsection, we provide some of the applications of the proposed hybrid multiplier architecture whenever high speed double-multiplications are required.

### 5.2.1 Double-Exponentiation

The exponentiation on an Abelian group (e.g., finite fields) is one of the most important arithmetic operations for public key cryptography such as Diffie-Hellman [14] key agreement, RSA, and encoding the Reed Solomon codes [72], [73], and [74]. The exponentiation is usually accomplished by performing repeated field multiplications and squarings [72]. Let $A$ and $B$ be two field elements and $K$ and $H$ be two integers. Then, the computation of $A^{K} B^{H}$ (denoted by Double-exponentiation) is a crucial operation for cryptographic applications such as Schnorr- and ElGamal-like signature verifications [70] and [71]. Computing double-exponentiation is presented in [74] by multiplying the result of single exponentiations. Such an scheme is not the most efficient method and efficient computation of double-exponentiation is required.


As explained before, under normal basis representation of field elements squarings are free. Thus, to speed up double-exponentiation one requires to reduce the total number of field multiplications as well as the complexity of each multiplication. The former reduces the latency (in terms of number of clock cycles) while the latter improves the execution time of the multiplier (in terms of propagation delay through logic gates). Based on the discussion regarding low-complexity multipliers presented in the previous sections, we reduce the latency of double-exponentiation using the proposed hybrid multiplier architecture. The following is used in [73] to compute the double-exponentiation operation.

Lemma 5.1. [73] Let $A$ and $B$ be two field elements on $G F\left(2^{m}\right)$ and represented by normal basis and assume $K$ and $H$ be the two positive integers represented by $K=$ $\left(k_{m-1}, \cdots, k_{1}, k_{0}\right)_{2}$ and $H=\left(h_{m-1}, \cdots, h_{1}, h_{0}\right)_{2}$, respectively. Double-exponentiation of the form $A^{K} B^{H}$ is computed by

$$
\begin{aligned}
A^{K} B^{H} & =A^{k_{0}+k_{1} 2+\cdots+k_{m-1} 2^{m-1}} B^{h_{0}+h_{1} 2+\cdots+h_{m-1} 2^{m-1}} \\
& =\left(A^{k_{0}} B^{h_{0}}\right)\left(A^{k_{1}} B^{h_{1}}\right)^{2} \cdots\left(A^{k_{m-1}} B^{h_{m-1}}\right)^{2^{m-1}} \\
& \left.=\left(\ldots\left(A^{k_{m-1}} B^{h_{m-1}}\right)^{2} A^{k_{m-2}} B^{h_{m-2}}\right)^{2} \ldots\right)^{2} A^{k_{0}} B^{h_{0}} .
\end{aligned}
$$

The architecture of a multiplexer based double-exponentiation using one multiplier is given in Fig. 5.2a. It is assumed that $A B$ is precomputed [73]. As seen in this figure, the result of double-exponentiation is available after $m-1$ iterations, i.e., $(m-1) \times q, q=\left\lceil\frac{m}{d}\right\rceil$ clock cycles. In Fig. 5.2 b , we have proposed a new architecture by employing our proposed hybrid multiplier architecture. This hybrid multiplier performs two multiplications with the latency of one multiplication and as seen the double-exponentiation results will be in the register $\langle Z\rangle$ available after $\left\lceil\frac{m-1}{2}\right\rceil$ iterations, i.e., $\left\lceil\frac{m-1}{2}\right\rceil \times(q+1)$ clock cycles. This is due to the fact that in each iteration two bits of $K, k_{i} k_{i+1}$ and $H, h_{i} h_{i+1}$ are processed from their LSB in parallel. One should note that as the representation of field elements are under normal basis, thus computation of repeated squarings are free. Therefore, our proposed scheme reduces the latency of the double-exponentiation based on choosing efficient values for digit-size $d$. It is noted that the fast operation is achieved at the expense of extra area. More importantly, one can obtain a trade-off between time and area by choosing suitable values for $d$. The presented architectures for double-exponentiation can be


Figure 5.2: Architectures for multiplexer based double-exponentiation. (a) with one multiplier (b) with incorporating the proposed hybrid multiplier.
easily modified to eliminate the multiplication bye 1 , i.e., $(1, \cdots, 1,1)$ in normal basis, whenever $h_{i}$ and $k_{i}$ are both zero. However, for the sake of simplicity we do not investigate it here. In [74], a new exponentiation algorithm based on split exponents is proposed. Using normal basis representation and the proposed hybrid multiplier, it can be improved.

### 5.2.2 Reducing the Latency of Point Multiplication on Binary Curves

In this Section, we employ the proposed hybrid multiplier to perform double-multiplication and reduce the overall latency of point multiplication on binary elliptic curves.

### 5.2.2.1 Binary Edwards Curves

In Chapter 4, we have proposed a parallel processor for computing point multiplication on binary Edwards curves employing two digit-level multipliers. In binary Edwards curves, mixed $w$-coordinate has been incorporated to compute mixed differential PA and PD for Montgomery point multiplication with $d_{1} \neq d_{2}$ as given in [1] as:


Figure 5.3: Data dependency graph for fast computation of combined PA and PD for binary Edwards curves (a): employing four different PIPO multipliers. (b): employing proposed hybrid multiplier. $c_{1}=\sqrt{d_{1}}, c_{2}=\sqrt{d_{2} / d_{1}+1}, c_{3}=\sqrt{c_{1}}$, and $c_{4}=\sqrt{c_{2}}$.

$$
\begin{align*}
& C=W_{1} \cdot\left(Z_{1}+W_{1}\right), D=W_{2} \cdot\left(Z_{2}+W_{2}\right), E=Z_{1} \cdot Z_{2}, \\
& F=W_{1} \cdot W_{2}, V=C \cdot D, Z_{3}=V+\left(c_{1} \cdot E+c_{2} \cdot F\right)^{2}, \\
& W_{3}=V+w_{0} \cdot Z_{3}, W_{4}=D^{2}, \\
& Z_{4}=W_{4}+\left(\left(c_{3} \cdot Z_{2}+c_{4} \cdot W_{2}\right)^{2}\right)^{2}, \tag{5.1}
\end{align*}
$$

where $c_{1}=\sqrt{d_{1}}, c_{2}=\sqrt{d_{2} / d_{1}+1}, c_{3}=\sqrt{c_{1}}$, and $c_{4}=\sqrt{c_{2}}$. As seen from the above formulations, the cost of combined PA and PD operations is $10 M$, where $M$ is the cost of a multiplication. For achieving highest degree of parallelization, we employ maximum number of parallel multipliers. The data dependency graph is depicted in Fig. 5.3a employing four DL-PIPO multipliers. In Steps S2 and S3 of Fig. 5.3a four DL-PIPO multipliers are operating in parallel and in Step 7 only two multipliers performed the operation. Therefore, the multiplier utilization is $84 \%$. As one can see, the smallest latency for the combined PA and PD is achieved by employing four multipliers as $3 M+12$. Note that employing more than four multipliers dose not reduce the latency due to data dependencies.

We modify the combined PA and PD formulations in (5.1) in such a way to
incorporate the proposed hybrid multiplier and remove the data dependencies and further reduce the number of multipliers in the data path (i.e., reduce the latency). The modified formulations are as follows

$$
\begin{align*}
& C=W_{1} \cdot\left(Z_{1}+W_{1}\right), D=W_{2} \cdot\left(Z_{2}+W_{2}\right), \\
& E=Z_{1} \cdot Z_{2} \cdot c_{1}, F=W_{1} \cdot W_{2} \cdot c_{2}, G=c_{3} \cdot Z_{2} \\
& V=C \cdot D \cdot w_{0}, Z_{3}=C \cdot D+(E+F)^{2}, H=c_{4} \cdot W_{2} \\
& W_{3}=V+(E+F)^{2} \cdot w_{0}+C D, W_{4}=D^{2}, \\
& Z_{4}=W_{4}+\left((G+H)^{2}\right)^{2} . \tag{5.2}
\end{align*}
$$

The corresponding data dependency graph for the modified formulations for combined PA and PD is illustrated in Fig. 5.3b. As shown in this figure, we employed the proposed hybrid multiplier in Steps S2 and S5. In Step S2, we combined computation of field multiplications by constants ( $c_{1}$ and $c_{2}$ ) and performed them in one step with the latency of $M+2$ using two hybrid multipliers. Three multipliers regular multipliers are also operating in this step. In Step S5, we modified formulation of the PA operation in computing ( $W_{3}$ and $Z_{3}$ ) to take the advantage of the hybrid multiplier as much as possible. As one can see, in this step the computation of $V=C \cdot D \cdot w_{0}$ is done using one hybrid multiplier with the latency of $M+2$. As a result, the latency of the overall point multiplication over binary Edwards curves is reduced to $2 M+12$. Therefore, applying the proposed technique reduces the latency of computation of combined PA and PD to about $34 \%$. We further note that the proposed approach is a new method to reduce the latency of point multiplication while parallelization fails due to data dependency. Therefore, one can achieves higher speeds in computing of point multiplication for high speed applications mentioned before.

The proposed hybrid structure is also applicable for explicit addition formulas for generic, Hessian, and Koblitz elliptic curves, wherever there is data dependency that limit incorporating parallelization to reduce latency and achieve higher speeds.

### 5.2.2.2 Generalized Hessian Curves

Similar to binary Edwards curves, mixed $w$-coordinate has been incorporated to compute mixed differential PA and PD for Montgomery point multiplication as follows [2]:


Figure 5.4: generalized Hessian curves with $c_{1}=d^{3}$, and $c_{2}=\frac{1}{\sqrt{d^{3}}}$, employing the proposed hybrid multiplier.Generalized Hessian curves

$$
\begin{align*}
& A=W_{1} \cdot Z_{2}, B=W_{2} \cdot Z_{1}, Z_{4}=W_{2}^{2} \cdot Z_{2}^{2} \\
& Z_{3}=(A+B)^{2}, D=W_{2}^{2}+Z_{2}^{2} \\
& E=w_{0} \cdot Z_{3}, F=(A \cdot B), G=D \cdot c_{2} \\
& H=F \cdot c_{1}, W_{3}=E+H, W_{4}=\left(Z_{4}+G\right)^{2} \tag{5.3}
\end{align*}
$$

where $c_{1}=d^{3}$, and $c_{2}=\frac{1}{\sqrt{d^{3}}}$. As one can figure out the cost of combined PA and PD is $7 M$. In Fig. 5.4a, the data dependency graph for combined PA and PD is depicted employing three parallel multipliers. As illustrated in this figure the latency is $3 M+9$ and employing more than three multipliers will not reduce the latency. This is the maximum possible number of parallel multipliers that can be used to accelerate the computation of combined PA and PD. However, by employing hybrid multiplier we can reduce the latency to $2 M+10$ as shown in Fig. 5.4b. As one can see, the computation of $A \cdot B \cdot c_{1}$ is done in one step (Step 5) with the latency of $M+2$ clock cycles.

### 5.2.2.3 Binary Koblitz Curves

## Jacobian Projective Coordinates

In Jacobian projective coordinates [11], the projective point $(X: Y: Z), Z \neq 0$, corresponds to the affine point $\left(X / Z^{2}, Y / Z^{3}\right)$ with the projective equation of the curve being $Y^{2}+X Y Z=X^{3}+a X^{2} Z^{2}+b Z^{6}$. The addition formulas for computing
$P_{3}=\left(X_{3}, Y_{3}, Z_{3}\right)=\left(X_{1}, Y_{1}, Z_{1}\right)+\left(x_{2}+y_{2}\right)$ in mixed coordinate cost $10 M+3 S+7 A$ with $Z_{2}=1$ as

$$
\begin{aligned}
& B=x_{2} Z_{1}^{2}, D=y_{2} Z_{1}^{2} Z_{1}, E=X_{1}+B, F=Y_{1}+D \\
& Z_{3}=E Z_{1}, H=x_{2} F+y_{2} Z_{3}, I=F+Z_{3}, G=Z_{3}^{2} \\
& X_{3}=a G+F I+E E^{2}, Y_{3}=I X_{3}+G H
\end{aligned}
$$

where $a \in\{0,1\}$. In Fig. 5.5, the data dependency graph for computing point addition on Koblitz curves with mixed coordinates is depicted. In Fig. 5.5a, we have employed three parallel field multipliers to reduce the latency as much as data dependency allows. As one can see, in Steps S5 and S8 three multipliers are operating while in Steps S2 and S11 only two multipliers are operating. Thus, the latency of the point addition is $4 M+13$. As one can realize, employing four or more multipliers does not reduce the latency due to the data dependencies in Steps S5, S8, and S11. In Fig. 5.5b, we have slightly modified the computation of point addition and employed a hybrid architecture to reduce the latency. As seen in this figure, in Step S2 a hybrid multiplier is employed to perform a double-multiplication. Also, in Step S5 hybrid multiplier is used to perform two double-multiplications. Note that in Step S5 we recompute $Z_{3}=E \cdot Z_{1}$ employing another parallel multiplier. However, one eliminate this multiplier and obtain it from the first output of the hybrid multiplier, i.e., DLPISO. Through employing hybrid technique the latency of mixed point addition on Koblitz curves with Jacobian coordinates reduced to $3 M+14$ which is the smallest one that has been achieved in the literature.

### 5.2.2.4 Attacking ECC2K-130

In [75], Fan et al. have performed an extensive investigation to solve one of the Certicom elliptic curve discrete logarithm problem (ECDLP) challenges, ECC2K-130 using Pollard's rho method [76]. They have focused on Koblitz curves over $G F\left(2^{131}\right)$ and because of performing several squarings, normal basis is incorporated as the Hamming weight of $x$-coordinate is also represented with this basis [75]. Each iteration of their method requires five multiplications that can not be reduced by employing parallel multipliers due to data dependencies. However, our proposed hybrid multiplier for GNB (for type 2) can be incorporated to reduce the latency of each iteration to four multiplications and improve the overall speed of the attack.


Figure 5.5: Parallel computation of point addition on Koblitz curves using Jacobian coordinates (a): with three finite field multipliers and (b): employing hybrid multiplier and three parallel multipliers.

### 5.3 Implementations

In this section, to study the time and area requirements of the proposed hybrid multiplier we implemented it on Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ xc4vlx100-ff1148 FPGA and 65nm Complementary Metal-Oxide-Semiconductor (CMOS) library for the synthesis on application-specific integrated circuit (ASIC) technology. The proposed hybrid architecture for double-multiplication is modeled in VHDL and synthesized for different digit sizes using XST ${ }^{\text {TM }}$ of Xilinx ${ }^{\circledR}$ ISE ${ }^{\text {TM }}$ version 12.1 design software and Synopsys ${ }^{\circledR}$ Design Vision ${ }^{\circledR}$ which is a GUI for Synopsys ${ }^{\circledR}$ Design Compiler ${ }^{\circledR}$ tools. The implementation results are reported in Table 5.2 for different digit sizes over $G F\left(2^{163}\right)$. The correctness of the multiplier architectures is verified by Xilinx ${ }^{\circledR}$ ISE ${ }^{\text {TM }}$ Simulator (ISim). For the FPGA implementations, the optimization goal is set to the speed (i.e., default) and optimization effort is set to normal and the area (Slices, LUTs, and FFs) and timing ( $n s$ ) for the critical-path delays (CPD) are obtained for different digit sizes. It is noted that the results of the implementations on FPGA, are all after post place and route results. For the ASIC implementations, the map effort is set to medium with a target clock period of 5 ns and the area $\left(\mu \mathrm{m}^{2}\right)$ and timing $(n s)$ are obtained for each of the designs.it on ASIC the proposed hybrid multiplier
architecture

### 5.4 Conclusion

In this chapter, for the first time we proposed a digit-level hybrid multiplier over GNB which performs two multiplications with the same latency as the one for one multiplier proposed in the literature. We employed the proposed hybrid architecture to reduce the latency of double-exponentiation. The analyzes results indicate that the proposed hybrid multiplier architecture reduces the latency of double-exponentiation about $50 \%$. Moreover, we employed the hybrid multiplier architecture to reduce the latency of point multiplication on binary Edwards, generalized Hessian, and Koblitz curves. It is shown that the proposed scheme reduces the latency of point multiplication about $33 \%$ for both binary Edwards and generalized Hessian curves and $25 \%$ for Koblitz curves using Jacobean coordinates. Therefore, the point multiplication on binary Edwards and generalized Hessian curves are competitive with the binary generic curves using our hybrid multiplier and provide completeness for input points. It is worth mentioning that the proposed architecture is suitable for the applications when fast computations of point multiplication is desired.
Table 5.2: ASIC and FPGA implementation results for the proposed low-complexity hybrid multiplier architecture (Fig. 5.1)
over $G F\left(2^{163}\right)$ for different digit sizes.

## Chapter 6

## Highly Parallel and Fast Crypto-Processor for Point Multiplication on Koblitz Curves

IN this chapter, based on the DL-PIPO GNB multiplier architecture proposed in chapter 3, we propose a highly parallel an fast crypto-processor for point multiplication on Koblitz curves. Binary Koblitz (or anomalous) curves, are special class of binary generic curves that point multiplication can be efficiently computed using special properties for these curves. These curves employ Frobenius map (instead of doubling) and point addition operation using projective mixed coordinates for computing point multiplication. The binary Koblitz curves are specified in NIST [19], IEEE [18], and SEC2 [77] as the mostly standardized and specified binary curves for different levels of security depending on the availability of the resources. In the recent past, considerable efforts have been made to accelerate the computation of point multiplication over binary elliptic curves. Those include parallelization [78], [6], and [10], interleaving [79], [26], and pipelining [80]. The two former techniques are used to reduce the latency of the computation, whereas the latter is used to increase the maximum operating clock frequency. In this chapter, we employ parallelization and efficient pipelining in our implementations for high speed applications.

Parallelization is a well-known approach to accelerate the ECC computations, employing multiple parallel field arithmetic units (mainly multipliers) in the lower level, i.e., finite field computations, for instance one can refer to [81], [78], and [82]. It is worth mentioning that in case of dependencies amongst lower level computations, achieving parallelization is a challenging task and employing more than certain number of parallel arithmetic units will not increase the speed of ECC computations.

Recently, several methods to perform parallel computations for point addition on Koblitz curves have been proposed [83], [10], and [26]. It has been claimed that the maximum number of the finite field multipliers to achieve the highest parallelization in computing point multiplication on Koblitz curves is three parallel finite field multipliers. However, here we modify the point addition formulation in such a way to employ four multipliers to reduce the latency of point addition. This techniques will increase the overall speed of point multiplication on Koblitz curves. To do so, we first perform data-flow analysis for ECC computations to understand how data has to move between the different logic and computational elements such as field multipliers, adders, and squarers. Then, we perform a latency analysis to determine where potential bottlenecks may occur and then find a balance between desired performance and the cost of implementing the design. In this effect, we modify the point addition formulation to employ four parallel finite field multipliers to reduce the latency of point multiplication about $25 \%$.

For investigating the practical performance of the proposed architecture, we implement it on FPGA for different digit sizes over $G F\left(2^{163}\right)$ targeting the applications where high speed is required and area usage should be considered as well. It is noted that our method can be applied to any finite field representation and for the sake of efficient implementation and comparison, we use GNB in this chapter.

The rest of this chapter is organized as follows. In Section 6.1, properties of Koblitz curves and arithmetic on these curves are presented. In Section 6.2, parallel computation of point multiplication is investigated. In Section 6.3, the hardware architecture of proposed crypto-processor on Koblitz curves is presented. In Sections 6.4, the implementation results for proposed architecture on FPGA are presented. Finally, we conclude this chapter in Section ??

### 6.1 Properties of Koblitz Curves

In finite field of characteristic two, Frobenius map $\phi$ is an endomorphism that raises every element to its power of two, i.e., $\phi: x \rightarrow x^{2}$. The squaring over $G F\left(2^{m}\right)$ using GNB is a free operation in hardware. Then, Frobenius endomorphism can be carried out efficiently (cost free) if the elements of finite field are represented in normal basis [11]. Koblitz [84] showed that point doublings can be performed efficiently by utilizing the Frobenius endomorphism if the binary curve is defined over $G F(2)$ as

$$
\begin{equation*}
E_{K, a} / G F\left(2^{m}\right): y^{2}+x y=x^{3}+a x^{2}+1, \tag{6.1}
\end{equation*}
$$

and $a \in\{0,1\}$. Then, the Frobenius map can be defined as

$$
\begin{aligned}
\phi & : E\left(G F\left(2^{m}\right)\right) \rightarrow E\left(G F\left(2^{m}\right)\right) \\
& (x, y) \rightarrow\left(x^{2}, y^{2}\right),
\end{aligned}
$$

and one can show that

$$
\phi^{2}(P)-\mu \phi(P)+2 P=0 \text { for every } P \in E_{K, a}\left(G F\left(2^{m}\right)\right)
$$

Let $\tau$ be the complex root of $P(T)=T^{2}-\mu T+2$ which is the characteristic polynomial of the Frobenius endomorphism. Then, if one represent the scalar $k$ in $\tau$-adic $\operatorname{NAF}(\tau \operatorname{NAF})$, i.e., $k=\sum_{i=0}^{l-1} k_{i} \tau^{i}$ for $k_{i} \in\{0,1,-1\}$ and $k_{i} k_{i+1}=0$, then point multiplication can be computed as $k P=\sum_{i=0}^{l-1} k_{i} \tau^{i}(P)$ [11]. It results in the hamming weight of $\tau$ NAF to be the same as the one of the binary NAF, i.e., $\approx\left(\log _{2} k\right) / 3$, and its length to be approximately $2 m$ which is twice the length of the binary NAF. Since $\left(\phi^{m}-1\right) P=\phi^{m} P-P=P-P=\mathcal{O}$ stands for all $P \in E_{K, a}\left(G F\left(2^{m}\right)\right)$, Solinas [85] proposed a method that if $k^{\prime} \equiv k(\bmod \delta), \delta=\left(\tau^{m}-1\right) /(\tau-1)$, then $k^{\prime} P=k P$ and the length of the $\tau$ NAF over remainder of $k$ can be reduced to $m$. Recently, efficient hardware architectures for $\tau$ NAF conversion have been proposed in [86], [87], and [88].

In normal basis when $P=(x, y)$ is known, $\tau^{i}(P)$ can be computed by $i$-fold right cyclic shifts of the $x$ and $y$ coordinates representing $P$, i.e., $\tau^{i}(P)=\left(x^{2^{i}}, y^{2^{i}}\right)=(x \gg$ $i, y \gg i$ ). As $2 P=-\tau^{2}(P)+\mu \tau(P)$, then the point doubling operation requires two squarings and a point addition. The faster computation of $\tau(P)=(x \gg 1, y \gg 1)$ in normal basis results in a faster point multiplication of $Q=k p=\sum_{i=0}^{m-1} k_{i} \tau^{i}(P)$ than the traditional methods [89].

### 6.1.1 Point Addition on Koblitz Curves

Point addition on Koblitz curve can be performed using different coordinate systems such as, Jacobian, standard projective, and Lopez-Dahab projective coordinates. Among them Lopez-Dahab coordinate system provides efficient point addition formulation as coming in the following.

### 6.1.1.1 Lopez-Dahab Projective Coordinates

For Lopez-Dahab coordinates, [3] the triple coordinates $(X, Y, Z)$ is used to represent $\left(X / Z, Y / Z^{2}\right)$ in affine when $Z \neq 0$ and $\mathcal{O}=(1,0,0)$. The curve equation in this coordinate is

$$
Y^{2}+X Y Z=X^{3} Z+a X^{2} Z^{2}+b Z^{4}, a, b \in G F\left(2^{m}\right)
$$

and the cost of point addition and doubling is $13 \mathbf{M}+4 \mathbf{S}+9 \mathbf{A}$ and $5 \mathbf{M}+4 \mathbf{S}+$ $5 \mathbf{A}$, respectively. Note that $\mathbf{M}, \mathbf{S}$, and $\mathbf{A}$, are the costs of multiplication, squaring, and addition, respectively. In Lopez-Dahap coordinates where one of the points represented in affine, the cost of mixed projective point addition, i.e., $\left(X_{3}, Y_{3}, Z_{3}\right)=$ $\left(X_{1}, Y_{1}, Z_{1}\right)+\left(x_{2}, y_{2}\right)$, reduces to $9 \mathbf{M}+5 \mathbf{S}+9 \mathbf{A}[55]$.

The explicit formulation are given as follows [55]:

$$
\begin{align*}
& Z:\left\{\begin{array}{l}
A=Y_{1}+y_{2} Z_{1}^{2}, B=X_{1}+x_{2} Z_{1} \\
C=B Z_{1}, \\
Z_{3}=C^{2},
\end{array}\right. \\
& X:\left\{\begin{array}{l}
D=x_{2} Z_{3}, \\
X_{3}=A^{2}+C\left(A+B^{2}+a C\right),
\end{array}\right. \\
& Y: Y_{3}=\left(D+X_{3}\right)\left(A C+Z_{3}\right)+\left(y_{2}+x_{2}\right) Z_{3}^{2} \tag{6.2}
\end{align*}
$$

where $a \in\{0,1\}$ for Koblitz curves and hence its cost reduces to $8 \mathbf{M}+5 \mathbf{S}+9 \mathbf{A}$.
The binary Koblitz curves sect163K1 with $a=1$ [11], is specified in SEC2 [77] as the mostly standardized and specified binary curve at the 83 -bit security level.

### 6.1.2 Point Multiplication on Koblitz Curves

The algorithm for computing point multiplication i.e., $Q=k P$ on Koblitz curves is given in Algorithm 6.1, where the scalar $k$ is presented in $\tau$ NAF [11]. This algorithm requires on average $m-1$ Frobenius maps and $m / 3-1$ point additions or subtractions. Since, Frobenius maps can be computed with free squarings in normal basis, the computation of point addition determines the efficiency of point multiplication. Therefore, our main focus is on high performance computation of point multiplica-

```
Algorithm 6.1 Point multiplication on Koblitz curves using Double-and-add-or-
subtract algorithm [11].
Inputs: A point \(P=(x, y) \in E_{K}\left(G F\left(2^{m}\right)\right)\) on curve
and integer \(k, k=\sum_{i=0}^{l-1} k_{i} \tau^{i}\) for \(k_{i} \in\{0, \pm 1\}\).
Output: \(Q=k P\).
1: initialize
    a: if \(k_{l-1}=1\) then \(Q \leftarrow(x, y, 1)\)
    b: if \(k_{l-1}=-1\) then \(Q \leftarrow(x, x+y, 1)\)
2:for \(i\) from \(l-2\) downto 0 do
    \(Q \leftarrow \phi(Q)=\left(X^{2}, Y^{2}, Z^{2}\right)\)
    if \(k_{i} \neq 0\) then
        \(Q \leftarrow Q+k_{i} P=(X, Y, Z) \pm(x, y)\)
    end if
end for
3: return \(Q \leftarrow\left(X / Z, Y / Z^{2}\right)\)
```

tion employing multiple efficient digit-level finite field multipliers. In the following we study the parallelization of point addition on Koblitz curves.

### 6.2 High-Speed Parallelization of Point Addition

Parallelization for hardware implementation of point addition on Koblitz curves has been investigated employing different number of field multipliers in [10], [78], [82], and [81]. In [10], it is shown that employing two finite field multipliers reduces the number of multipliers (and hence the latency of ECC point multiplication) in the data path to five multiplications. Also, it is shown in [10] that the maximum number of parallel finite field multipliers that can be employed to implement the fastest point multiplication is three. It is shown that employing three parallel finite field multipliers reduces the number of multipliers in the longest data path to four multipliers. The data dependency graph for point addition employing three multipliers is depicted in Fig. 6.1a [10]. As one can see, the latency of point addition is $4 M+13$, where $M$ is the latency for a multiplication. In Step S4 only one multiplier is operating and the other two multipliers are idle. This is mainly because of the dependency of computing $C$ to $B$ (as shown in 6.2). This does not allow us to compute $B$ and $C$ in parallel. As seen from Fig. 6.1a, a potential bottleneck occurs in computing $C$ which uses only one multiplier in Step S4. This results in $66 \%$ multiplier utilization for the data


Figure 6.1: Data dependency graph for parallel computation of point addition on Koblitz curves (a): using three finite field multipliers adopted from [10] (b): proposed scheme employing four multipliers.
dependency graph presented in Fig. 6.1a employing three parallel multipliers.
The formulation of point addition [55] can be modified to employ one additional parallel multiplication to reduce its latency as stated in the following proposition.

In computing the $Z$ coordinate of the point addition formulation of (6.2), the data dependency in computing $C$ can be eliminated by the following

$$
Z:\left\{\begin{array}{l}
A=Y_{1}+y_{2} Z_{1}^{2}, B=X_{1}+x_{2} Z_{1},  \tag{6.3}\\
C=x_{2} Z_{1}^{2}+X_{1} Z_{1}, Z_{3}=C^{2},
\end{array}\right.
$$

As one can see from (6.3), computation of $C$ can be performed in parallel with $B$ at the cost of employing one more multiplier as compared to the formulation presented in (6.2). Therefore, we can employ four multipliers in parallel to compute point addition. The data dependency graph for computing point addition based on (6.3) is depicted in Fig. 6.1b which employs four parallel multipliers. As one can see, in Step S2 of Fig. 6.1b four multipliers are operating in parallel. Therefore, the multiplication in Step S4 in Fig. 6.1a is eliminated. As seen in Fig. 6.1b, the number of field multipliers in the data path is reduced to three multipliers with the overall latency of $3 M+13$ clock cycles. Therefore, employing four parallel multipliers results in $25 \%$ reduction in the latency in comparison with the case where three multipliers are employed. Note that the multipliers utilization is increased to $75 \%$, as 9 out of 12 multiplications are performed using four multipliers. Our presented approach reduces the latency of the point addition using four field multipliers and consequently speeds up the point multiplication as explained before.

### 6.2.1 Latency of Point Multiplication

The point multiplication on Koblitz curves composed of three main blocks: $\tau$ NAF converter, the main processor (addition and Frobenius map), and the coordinate converter. In [88], an efficient circuitry is presented for $\tau$-NAF conversion which requires $m+6$ clock cycles for $m=163$. Also, the latency of coordinate conversion from projective Lopez-Dahab to affine is $11 M+11$ based on Itoh-Tsujii method [38]. Since these latencies are the fixed for all implementations, we only compare the latency for the main processor in computing point additions as given in Table 6.1. We assume that two adders and two squarers are available based on the data dependency graph depicted in Fig 6.1b. In this table, $H(k)$ is the Hamming weight of $\tau$-NAF expansion of $k$ and the total latency of point addition is computed by multiplying the number of

Table 6.1: Comparison of the latency for performing point addition in the main loop on Koblitz curves in terms of number of multipliers .

| \# of Multipliers | $E_{K}[10]$ | This work |
| :---: | :---: | :---: |
| 4 | $(H(k)-1)(4 M+13)$ | $(H(k)-1)(3 M+13)$ |

Figure 6.2: The architecture of point multiplication crypto-processor

non-zero terms in $k$ to the latency of a point addition. the As shown in Table 6.1, for higher speed implementations our proposed data dependency graph provides smaller latency in comparison to the others assuming to have equal cost for Frobenius maps. We note that if one employs polynomial basis to represent field elements, the cost of Frobenius map should be considered as well.

### 6.3 Proposed Crypto-processor for Point Multiplication

In this section, we present a hardware architecture for point multiplication on Koblitz curves. The architecture of the crypto-processor is depicted in Fig. 6.2. As one can see, it consists of a field arithmetic unit (FAU), register file, coordinate converter, and a control unit. The registers are to store point coordinates, intermediate and final values during point additions. In the following, we explain how the proposed architecture operates and produces the point multiplication results for a given point $P$ and scalar $k$ represented in $\tau$ NAF.

### 6.3.1 Field Arithmetic Unit (FAU)

The FAU performs four basic arithmetic operations employing: four digit-level GNB multipliers, two $G F\left(2^{m}\right)$ adders, and two squarers. Multiplication in $G F\left(2^{m}\right)$ plays
the main role in determining the efficiency of the point multiplication in the cryptoprocessor. Finite field multipliers are available in bit-level (with area complexity of $O(m)$ and time complexity of $O(m)$ ), digit-level (with area complexity of $O(m d)$ and time complexity of $O(m / d)$ ), and bit-parallel (with area complexity of $O\left(m^{2}\right)$ and time complexity of $O(1))$ architectures depending on the available resources. We employ a low-complexity and pipelined digit-level parallel-in parallel-out GNB multiplier presented in Chapter 3. Recall that in a digit-level parallel-in parallel-out GNB multiplier both input operands, $A$ and $B$ should be present through multiplication process and the results will be available in parallel after $M=\left\lceil\frac{m}{d}\right\rceil$ clock cycles. Thus the latency of the multiplier (in terms of clock cycles) is given by $M=\left\lceil\frac{m}{d}\right\rceil+1$, $1 \leq d \leq m$ considers one clock cycle for one level of pipelining. For the given field size $m=163$ (which is type 4 GNB), digit-size $d$ is chosen in such a way to reduce the latency while increasing $d$. Therefore, we choose the digit sizes from the set $d=\{11,21,33,41,55\}$ for $m=163$. We note that the finite field multiplier determines the time and area requirements of the point multiplier of the crypto-processor. A digit-level version of Massey-Omura multiplier [35] is investigated for FPGA implementation of ECC in [90], [91], [23], [26], and [10] on Koblitz curves. In terms of area complexity, Massey-Omura multiplier requires $d m$ AND gates and $d T(m-1)$ XOR gates and its critical-path delay is $T_{A}+\left(\left\lceil\log _{2} T\right\rceil+\left\lceil\log _{2} m\right\rceil\right) T_{X}$ for type $T$ GNB. Note that our employed multiplier in this work requires smaller area in comparison to the counterparts used in [91], [23], [26], and [10]. The $G F\left(2^{m}\right)$ adder uses $m$ XOR gates to perform the addition and requires only a clock cycle to store the results in the registers. The squarer is simple rewiring in normal basis and requires a clock cycle to store its results in the registers. Note that Frobenius map is performed for coordinates of $X, Y$, and $Z$, independently.

### 6.3.2 Control Unit and the Register File

The control unit is designed with a finite state machine (FSM) to perform the point multiplication with other units. First, the coordinates of $P=(x, y)$ are loaded to the registers. Once $k$ is available inthe $\tau$ NAF representation, at the input of control unit, the FAU starts the computations based on the FSM stored in the control unit. The final and intermediate results are stored in the registers. The data bus width is set to 163 bits.

Table 6.2: The implementation results of the point multiplication on Koblitz curves on Altera ${ }^{\circledR}$ Stratix ${ }^{\text {TM }}$ II EP2S180F1020C3 FPGA device.

| $d$ | $M+$ <br> 1 | Latency <br> $\left(L_{\text {Total }}\right)$ | $f_{\max }$ <br> $(\mathrm{MHz})$ | Area <br> $(\mathrm{ALMs})$ | P.M. Time <br> $[\mu \mathrm{s}]$ |
| :--- | :---: | :---: | :---: | :---: | :---: |
| 11 | 16 | 3791 | 198 | 7,978 | 19.15 |
| 21 | 9 | 2601 | 195 | 13,032 | 13.45 |
| 33 | 6 | 2091 | 192 | 20,386 | 10.89 |
| 41 | 5 | 1921 | 191 | 24,815 | 10.22 |
| 55 | 4 | 1751 | 165 | 32,856 | 10.62 |

### 6.3.3 Coordinate Converter

The coordinate converter, gets the projective coordinates of $Q=k P$, i.e., $(X, Y, Z)$, and provides affine coordinate of $Q=(x, y)=\left(X / Z, Y / Z^{2}\right)$ using an inversion based on the Itoh-Tsujii's scheme [38] and a field multiplication. As one can see in Fig. 6.2 , it employs a multiplier and a squarer. Coordinate converter is implemented as a dedicated hardware and its latency and area is included in the implementation results presented in Table 6.2.

### 6.4 FPGA Implementations

FPGAs have advantages for prototyping and the proof of concepts. To have a fair comparison with previous works, we have selected Altera ${ }^{\circledR}$ Stratix ${ }^{\text {TM }}$ II EP2S180F1020C3 device as the target FPGA for our implementations. In terms of available resources the target FPGA contains 71,760 ALMs (143,520 ALUTs and 143,520 registers) and 743 input/output (I/O) pins. Each ALMs contains two flip-flops (FFs) and two adaptive look-up tables (ALUTs). ALUTs are flexible and can be used to implement up to a 7 -to-1-bit LUT. The presented architecture for point multiplication of the cryptoprocessor presented in Section 6.3 is coded in VHDL and synthesized for different digit sizes $d, d \in\{11,21,33,41,55\}$ for the Koblitz curve defined over $G F\left(2^{163}\right)$.

We use Altera ${ }^{\circledR}$ Quartus ${ }^{\circledR}$ II version 11 design software for our implementations. The results of the area and maximum clock frequencies of the implementations after the place and route (provided by the fitter) are reported in Table 6.2. As one can see, increasing the digit-size results in the reduction of the latency of the point multiplication, i.e., $L_{\text {Total }}$, at the cost of increase in the area and decrease in the operating clock frequency. The point multiplication time is provided by diving the total number of clock cycles $\left(L_{\text {Total }}\right)$ by the maximum operating clock frequency $\left(f_{\max }\right)$. To achieve


Figure 6.3: (a): Latency of point computation on Koblitz curves over $G F\left(2^{163}\right)$ for different digit sizes. (b): Latency-area product of the proposed architecture for point multiplication.
higher clock frequencies, we pipelined the digit-level GNB multiplier with only one level of pipelined registers. Therefore, we add one clock cycle to the latency of multiplier as seen in the second column of Table 6.2 (i.e., $M+1$ ). The latency of loading the operands to the multipliers are counted in the total latency as shown in the data dependency graph illustrated in Fig. 6.1. Note that the fastest computation of point multiplication is obtained for $d=41$ which is $10.22 \mu$ s employing 24,815 ALMs.

In Fig. 6.3a, the latency of point multiplication is plotted in terms of digit sizes. As one can see, as $d$ increases the latency of point multiplication decreases and $d=41$ is the largest digit-size than results in significant reductions in latency. To investigate the efficiency of the proposed architecture in term of time-area trade-offs, we plot the latency-area product in terms of different digit sizes in Fig. 6.3b. As one can see, the latency-area product always increases as digit-size increases but the increase is moderate when $d \leq 41$.

In what follows, we compare the implementation results to the counterparts especially the ones recently proposed in the literature.

### 6.4.1 Comparisons

High performance FPGA implementation of point multiplication on Koblitz curves have been considered in [90], [91], [23], [79], [26], and [10]. In Table 6.3, their best results in terms of time and area are summarized for point multiplications on Koblitz curve over $G F\left(2^{163}\right)$, i.e., NIST K-163. As one can see, we implement our point
multiplication crypto-processor on the same FPGA device used by the counterparts. This makes our time and area comparisons to be fair and feasible.

As mentioned in Subsection 6.3.1, the finite field multiplier determines the area and time requirements of an ECC crypto-processor. We note that the finite field multiplier employed in this work, i.e., digit-level GNB multiplier with parallel-in and parallel-out, requires smaller area and operates in higher clock frequencies as compared to the ones used in [90], [91], [23], [79], [26], and [10].

The latency of the proposed architecture for point addition is less than the counterparts and is comparable with the one proposed in [26]. In [26] and [79], a new scheme known as interleaving is proposed to reduce the latency of point addition on Koblitz curves. The interleaving idea is based on the fact that the point addition requires the result of the previous point addition. Thus, some parts of it (i.e., coordinates $Z$ and $X$ ) can be processed with the data available before the previous operation (computing $Y$ ) is finished. This scheme reduces the latency of point addition about $50 \%$ of the one proposed in [10] employing four finite field multipliers. We note that in a reliable crypto-processor, a check for validating the resulting point not to be at infinity is required. Employing interleaving in [26] and [79] may result redundant computations in the case of the existence of a point at infinity. Therefore, our proposed scheme provides faster result in computing point multiplication after the one proposed in [26] which is slightly faster.

In [23], a method to reduce the number of point additions for computing point multiplication on Koblitz curves is proposed. Instead of representing $k$ in $\tau$-adic NAF, a two-dimensional Frobenius expansion (based on Kleinian integers) is introduced. This reduces the number of non-zero terms in $k$ and consequently reduces the number of point additions. Also, instead of taking advantage of parallelism in lower level, multiple processors are used to compute the point multiplication and the best results (in terms of time-area trade-off) have been reported with choosing the number of processors to be four. A digit-level version of Massey-Omura multiplier with the digit size $d=25$ over $G F\left(2^{163}\right)$ is employed in each processor to perform finite field multiplications. With efficient choosing of the parameters for two-dimensional Frobenius expansion of $k$, the smallest latency and time to compute a point multiplication are obtained as 2033 clock cycles and $17.15 \mu \mathrm{~s}(13.38 \mu \mathrm{~s}$ without conversion), respectively. It is worth mentioning that parallelization in arithmetic level is more beneficial than parallelization at higher levels, i.e., point multiplication as employed in [23]. Furthermore, one can achieve higher speeds employing two-dimensional Frobenius expansion and our parallelization scheme.
Table 6.3: Comparison of related works for FPGA implementations of point multiplication on Koblitz curves using digit-level

Table 6.3:
finite field

In [90], a double point multiplication algorithm proposed which employs a digitlevel Massey-Omura multiplier with the digit size $d=4$. It only employs one multiplier to perform point addition on Koblitz curves. Since double point multiplication is required in digital signature algorithm and its fast computation is important, our highly parallel scheme can improve its timing results.

The proposed scheme to employ four parallel multipliers can also be applied for the schemes based on polynomial basis and hence similar improvement can be achieved. Note that in this chapter we did not consider resistivity against side channel attacks as the main focus of this chapter is on highly parallel implementation of point multiplication. The reader is referred to [89] for detail information about countermeasures against side channel attacks.

### 6.5 Conclusion

We have proposed a new fast data flow graph for the point addition formulation using lopez-Dahab mixed coordinates employing four parallel multipliers on Koblitz curves. It is shown that the data flow graph has three multipliers in its critical path as compared to four multipliers for the best scheme available in the literature. We have used a low-complexity digit-level GNB multiplier to perform finite field multiplications. The analyzes results show that our method results in smaller latencies in computing point addition. Moreover, the implementations results on Altera ${ }^{\circledR}$ Stratix ${ }^{\text {TM }}$ II indicates that our parallel multipliers operates at higher clock frequencies and the point multiplication results are faster than the ones previous ones available in the literature and favorably comparable in terms of area with the one proposed in [26]. Our proposed architecture performs a point multiplication on NIST K-163 in 10.22 $\mu \mathrm{s}$ employing 24,815 ALMs.

## Chapter 7

## Summary and Future Work

### 7.1 Thesis Contributions

IN this thesis, we have investigated finite field multipliers using Gaussian normal basis and proposed different architectures. This includes novel high speed digitlevel multiplier architecture for ECC to make it fast. We have also considered the design, implementation, and evaluation of different elliptic curve crypto-processors for binary elliptic curves. The following summarizes the contributions of this work.

- In Chapter 3, which has been published in [9] and [61], we have presented a low complexity architecture for digit-level parallel in parallel out (DL-PIPO) GNB multiplier and proposed a common subexpression elimination algorithm to reduce its area complexity. We have also reduced the complexity of digit-level parallel in serial out (DL-PISO) GNB multiplier architecture in this chapter. Moreover, an improved architecture for digit-level serial in parallel out (DLSIPO) GNB multiplier architecture is proposed and its time and area complexities are derived. It is noted that the proposed architecture outperforms the leading ones in the literature in terms of time and area. Further, we have extended the digit-level architectures to a low-complexity bit-parallel architecture and compared it with the counterparts. To evaluate the performance of the proposed multiplier architectures, we have implemented them on FPGA and ASIC and their area and timing results are reported which appear as the best results in comparison to the counterparts in the literature.
- In Chapter 4, which recently has been appeared in [65], for the first time, we have proposed an efficient hardware architecture for point multiplication on binary Edwards and generalized Hessian curves incorporating higher level par-
allelization and optimum lower level scheduling. We have proposed an efficient pipelining method for digit-level GNB multiplier architecture and employed it for the proposed ECC crypto-processor over $G F\left(2^{m}\right)$. Then, we have obtained the optimum digit sizes in terms of time-area trade-offs for the proposed cryptoprocessor. Further, we have performed efficient FPGA implementations of point multiplication on binary Edwards and generalized Hessian curves over $G F\left(2^{163}\right)$ on a Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-5$ FPGA device and have investigated the LUT-based time-area efficiency for different digit sizes. The implementation results have been compared with the counterparts using binary generic curves.
- In Chapter 5, which has been outlined in [61], for the first time, we have proposed a new digit-level hybrid architecture which performs two multiplications together (double-multiplication) with the same number of clock cycles required as the one for one multiplication. The hybrid structure takes advantage of digit-level data interleaving and its structure is developed by combining the architecture of the proposed digit-level PISO GNB multiplier and a digit-level SIPO multiplier architecture. We have employed the proposed hybrid multiplier to reduce the latency of finite field double-exponentiation and point multiplication on binary elliptic curves. The analysis results indicated that the proposed architecture is suitable for the high speed applications whenever higher level of parallelization fails due to the data dependencies in computing point operations. Finally, we have implemented the hybrid architecture on a Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}-4$ FPGA device and $65-\mathrm{nm}$ ASIC and timing and area results have been reported.
- In Chapter 6, which has been presented in [92], we have proposed a highly parallel and fast crypto-processor for point multiplication on Koblitz curves. We have performed a latency analysis to determine where potential bottlenecks may occur and then find a balance between desired performance and the cost of implementing the design. In this effect, we have modified the point addition formulation to employ four parallel finite field multipliers and reduced the latency of point multiplication about $25 \%$ in comparison with the fastest one available in the literature. For investigating the practical performance of the proposed architecture, we have implemented the proposed ECC crypto-processor on an Altera ${ }^{\circledR}$ Stratix ${ }^{\text {TM }}$ FPGA for different digit sizes over $G F\left(2^{163}\right)$ targeting the applications where high speed is required and area usage should be considered as well. The implementation results have indicated that the proposed architecture outperforms the most recent ones available in the literature.


### 7.2 Future Work

As future works, for this thesis, the following can be pursued.

- Recently, a method to employ efficiently computable endomorphism to speed up point multiplication on ECC over quadratic extensions has been proposed. As a future work the idea can be extended to binary Edwards curves with some reasonable modifications which make it possible to use differential addition and efficient endomorphism to speed up point multiplication. This scheme is more efficient than many traditional doublings and the results from this will provide new set of standards for efficient implementations of ECC crypto-processor. These standards are applicable for a wide range of ECC applications.
- Pairing-based cryptography has a potential for solving many open problems in cryptography such as identity-based encryption and short signatures. The pairing computation is the most time-consuming operation in pairing-based schemes. The development of techniques and methods to optimize the pairing computation is of great importance and remains as a challenging effort for cryptosystems in commercial applications. There has been little research in the literature on implementation of pairing on binary elliptic curves. Therefore, as the lower level computations of pairing based cryptography relies on finite field arithmetic, the proposed low-complexity multiplier architectures in this thesis can be employed for efficient implementation of pairing as future works.
- Another future work for the proposed ECC crypto-processors that can be explored is the investigation against side channel attacks including simple power analysis attack and differential power analysis attack. Binary Edwards and generalized Hessian curves provide complete and unified addition formulation and they are very suitable for the applications where side channel attacks should be prevented. Therefore, fast computations of point multiplication on these curves should be considered for such applications.
- Finally, one can work on devising reliable architectures for the proposed ECC crypto-processors in this thesis against known faults and fault attacks in the literature. In this effect, a novel concurrent error detection scheme should be designed and tested for the point multiplication architectures presented in this thesis. For this purpose, parity based approaches can be utilized as they provide reasonable time/area overhead and efficient error detection capability.


## Bibliography

[1] D. Bernstein, T. Lange, and R. Farashahi, "Binary Edwards Curves," in Proceedings of Workshop on Cryptographic Hardware and Embedded Systems (CHES 2008), vol. 5154, 2008, pp. 244-265.
[2] R. Farashahi and M. Joye, "Efficient Arithmetic on Hessian Curves," in Proceedings of The 13th International Conference on Practice and Theory of Public Key Cryptography (PKC 2010), 2010, pp. 243-260.
[3] J. López and R. Dahab, "Fast Multiplication on Elliptic Curves Over GF (2 $2^{m}$ ) Without Precomputation," in Proceedings of Workshop on Cryptographic Hardware and Embedded Systems (CHES 1999), 1999, pp. 316-327.
[4] T. Beth and D. Gollman, "Algorithm Engineering For Public Key Algorithms," IEEE Journal on Selected Areas in Communications, vol. 7, no. 4, pp. 458-466, 1989.
[5] A. Reyhani-Masoleh, "Efficient Algorithms and Architectures for Field Multiplication Using Gaussian Normal Bases," IEEE Transactions on Computers, vol. 55, no. 1, pp. 34-47, 2006.
[6] C. H. Kim, S. Kwon, and C. P. Hong, "FPGA Implementation of High Performance Elliptic Curve Cryptographic Processor over GF $\left(2^{163}\right)$," Journal of System Architcture, vol. 54, no. 10, pp. 893-900, 2008.
[7] C.-Y. Lee, "Concurrent Error Detection Architectures for Gaussian Normal Basis Multiplication over $G F\left(2^{m}\right)$," Integration, the VLSI Journal, vol. 43, no. 1, pp. 113-123, 2010.
[8] F. Rodriguez-Henriquez, N. Saqib, and A. Díaz-Pérez, "A Fast Parallel Implementation of Elliptic Curve Point Multiplication over $G F\left(2^{m}\right)$," Microprocessors and Microsystems, vol. 28, no. 5-6, pp. 329-339, 2004.
[9] R. Azarderakhsh and A. Reyhani-Masoleh, "A Modified Low Complexity DigitLevel Gaussian Normal Basis Multiplier," in Proceedings of Third International Workshop on Arithmetic of Finite Fields (WAIFI 2010), vol. 6087, 2010, pp. 25-40.
[10] K. Järvinen and J. Skyttä, "On Parallelization of High-Speed Processors for Elliptic Curve Cryptography," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 9, pp. 1162-1175, 2008.
[11] D. Hankerson, S. Vanstone, and A. Menezes, Guide to Elliptic Curve Cryptography. Springer-Verlag New York Inc, 2004.
[12] J. Lopez and R. Dahab, "Fast Multiplication on Elliptic Curves over GF (2 $\left.2^{m}\right)$ without Precomputation," Cryptographic Hardware and Embedded Systems: First International Workshop, CHES'99, Worcester, MA, USA, August 1999: Proceedings, 1999.
[13] P. Montgomery, "Speeding the Pollard and Elliptic Curve Methods of Factorization," Mathematics of computation, pp. 243-264, 1987.
[14] W. Diffie and M. Hellman, "New Directions in Cryptography," IEEE Transactions on Information Theory, vol. 22, no. 6, pp. 644-654, 1976.
[15] R. Rivest, A. Shamir, and L. Adleman, "A method for obtaining digital signatures and public-key cryptosystems," Communications of the ACM, vol. 21, no. 2, pp. 120-126, 1978.
[16] N. Koblitz, "Elliptic Curve Cryptosystems," Mathematics of Computation, vol. 48, no. 177, pp. 203-209, 1987.
[17] V. S. Miller, "Use of Elliptic Curves in Cryptography," in Proceedings of Advances in Cryptology-CRYPTO 85, ser. Lecture Notes in Computer Science, Vol. 218, 1986, pp. 417-426.
[18] IEEE Std 1363-2000, "IEEE Standard Specifications for Public-Key Cryptography," Jan. 2000.
[19] U.S. Department of Commerce/NIST, "National Institute of Standards and Technology," Digital Signature Standard, FIPS Publications 186-2, January 2000.
[20] R. Cheung, N. Telle, W. Luk, and P. Cheung, "Customizable Elliptic Curve Cryptosystems," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 13, no. 9, pp. 1048-1059, 2005.
[21] B. Ansari and M. Hasan, "High-Performance Architecture of Elliptic Curve Scalar Multiplication," IEEE Transactions on Computers, vol. 57, no. 11, pp. 1443-1453, 2008.
[22] Y. K. Lee, K. Sakiyama, L. Batina, and I. Verbauwhede, "Elliptic-Curve-Based Security Processor for RFID," IEEE Transactions on Computers, vol. 57, no. 11, pp. 1514-1527, 2008.
[23] V. S. Dimitrov, K. U. Järvinen, M. J. J. Jr., W. F. Chan, and Z. Huang, "Provably Sublinear Point Multiplication on Koblitz Curves and its Hardware Implementation," IEEE Transactions on Computers, vol. 57, no. 11, pp. 1469-1481, 2008.
[24] W. Chelton and M. Benaissa, "Fast Elliptic Curve Cryptography on FPGA," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 2, pp. 198-205, 2008.
[25] M. Keller, A. Byrne, and W. P. Marnane, "Elliptic Curve Cryptography on FPGA for Low-Power Applications," ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 2, no. 1, pp. 1-20, 2009.
[26] K. Järvinen and J. Skyttä, "Fast Point Multiplication on Koblitz Curves: Parallelization Method and Implementations," Microprocessors and Microsystems, vol. 33, no. 2, pp. 106-116, 2009.
[27] Y. Zhang, D. Chen, Y. Choi, L. Chen, and S.-B. Ko, "A High Performance ECC Hardware Implementation with Instruction-level Parallelism over $G F\left(2^{m}\right)$," Microprocessors and Microsystems - Embedded Hardware Design, vol. 34, no. 6, pp. 228-236, 2010.
[28] H. Cohen, G. Frey, and R. Avanzi, Handbook of Elliptic and Hyperelliptic Curve Cryptography. CRC Press, 2006.
[29] A. Menezes, I. Blake, S. Gao, R. Mullin, S. Vanstone, and T. Yaghoobian, Applications of Finite Fields. Kluwer Academic Publisher, 1993.
[30] R. Lidl and H. Niederreiter, Introduction to Finite Fields and Their Applications, 2nd Edition, Cambridge University Press, 1997.
[31] T. Beth and D. Gollman, "Algorithm Engineering for Public Key Algorithms," IEEE Journal on Selected Areas in Communications, vol. 7, no. 4, pp. 458-466, 1989.
[32] J. Imana and J. Sanchez, "Bit-Parallel Finite Field Multipliers for Irreducible Trinomials," IEEE Transactions on Computers, vol. 55, no. 5, pp. 520-533, 2006.
[33] S. Kumar, T. Wollinger, and C. Paar, "Optimum Digit Serial $G F\left(2^{m}\right)$ Multipliers for Curve-Based Cryptography," IEEE Transactions on Computers, vol. 55, no. 10, pp. 1306-1311, 2006.
[34] A. Reyhani-Masoleh and M. Hasan, "Low Complexity Bit Parallel Architectures for Polynomial Basis Multiplication over GF(2m)," IEEE Transactions on Computers, vol. 53, no. 8, pp. 945-959, 2004.
[35] J. Massey and J. Omura, "Computational Method and Apparatus for Finite Arithmetic," US Patent, no. 4587627, 1986.
[36] R. C. Mullin, I. M. Onyszchuk, S. A. Vanstone, and R. M. Wilson, "Optimal Normal Bases in $G F\left(p^{n}\right), "$ Discrete Appl. Math., vol. 22, no. 2, pp. 149-161, 1989.
[37] D. W. Ash, I. F. Blake, and S. A. Vanstone, "Low Complexity Normal Bases," Discrete Applied Mathematics, vol. 25, no. 3, pp. 191-210, 1989.
[38] T. Itoh and S. Tsujii, "A Fast Algorithm for Computing Multiplicative Inverses in $G F\left(2^{m}\right)$ Using Normal Bases," Information and Computation, vol. 78, no. 3, pp. 171-177, 1988.
[39] G. Feng, "A VLSI Architecture for Fast Inversion in $G F\left(2^{m}\right)$," IEEE Transactions on Computers, vol. 38, no. 10, pp. 1383-1386, 1989.
[40] C. Lee, P. Meher, and J. Patra, "Concurrent Error Detection in Bit-Serial Normal Basis Multiplication Over GF( $\left.2^{m}\right)$ Using Multiple Parity Prediction Schemes," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 8, pp. 1234-1238, 2010.
[41] W. Geiselmann and D. Gollmann, "Symmetry and Duality in Normal Nasis Multiplication," in Proceedings of Sixth Symposium Applied Algebra, Algebraic Algorithms and Error-Correcting Codes (AAECC 1989), July 1989, pp. 230-238.
[42] G. B. Agnew, R. C. Mullin, I. M. Onyszchuk, and S. A. Vanstone, "An Implementation for a Fast Public-Key Cryptosystem," Journal of Cryptology, vol. 3, no. 2, pp. 63-79, 1991.
[43] A. Reyhani-Masoleh and M. A. Hasan, "Efficient Digit-serial Normal Basis Multipliers over Binary Extension Fields," ACM Transactions Embedded Computing Systems (TECS)., vol. 3, no. 3, pp. 575-592, Aug 2004.
[44] S. Kwon, K. Gaj, C. H. Kim, and C. P. Hong, "Efficient Linear Array for Multiplication in $G F\left(2^{m}\right)$ using a Normal Basis for Elliptic Curve Cryptography," in Proceedings of Workshop on Cryptographic Hardware and Embedded Systems (CHES 2004), 2004, pp. 76-91.
[45] A. H. Namin, H. Wu, and M. Ahmadi, "A Word-Level Finite Field Multiplier Using Normal Basis," IEEE Transactions on Computers, vol. 99, no. Preprints, 2010.
[46] C. Lee and P. Chang, "Digit-Serial Gaussian Normal Basis Multiplier over $G F\left(2^{m}\right)$ Using Toeplitz Matrix-Approach," in Proceedings of International Conference on Computational Intelligence and Software Engineering (CiSE 2009), 2009, pp. 1-4.
[47] C. C. Wang, T. K. Truong, H. M. Shao, L. J. Deutsch, J. K. Omura, and I. S. Reed, "VLSI Architectures for Computing Multiplications and Inverses in $G F\left(2^{m}\right)$, , IEEE Transactions on Computers, vol. 34, no. 8, pp. 709-717, 1985.
[48] Ç. K. Koç and B. Sunar, "An Efficient Optimal Normal Basis Type II Multiplier over $G F\left(2^{m}\right), "$ IEEE Transaction on Computers, vol. 50, no. 1, pp. 83-87, 2001.
[49] M. Hasan, M. Wang, and V. Bhargava, "A modified Massey-Omura Parallel Multiplier For a Class of Finite Fields," IEEE Transactions on Computers, vol. 42, no. 10, pp. 1278-1280, 2002.
[50] A. Reyhani-Masoleh and M. A. Hasan, "A New Construction of Massey-Omura Parallel Multiplier over $G F\left(2^{m}\right)$, , IEEE Transactions on Computers, vol. 51, no. 5, pp. 511-520, 2002.
[51] L. Gao and G. E. Sobelman, ""Improved VLSI Designs for Multiplication and Inversion in $G F\left(2^{M}\right)$ over normal bases"," in Proceedings of 13th Annual IEEE International ASIC/SOC Conference, 2000, pp. 97-101.
[52] U. Kocabas, J. Fan, and I. Verbauwhede, "Implementation of Binary Edwards Curves for Very-Constrained Devices," in Proceedings of 21st International Conference on Application-specific Systems Architectures and Processors (ASAP 2010), 2010, pp. 185-191.
[53] L. Batina, J. Hogenboom, N. Mentens, J. Moelans, and J. Vliegen, "Side-channel Evaluation of FPGA Implementations of Binary Edwards Curves," in Proceedings of $1^{17}$ th IEEE International Conference on Electronics, Circuits, and Systems (ICECS 2010), 2010, pp. 1255-1258.
[54] R. Moloney, A. O'Mahony, and P. Laurent, "Efficient Implementation of Elliptic Curve Point Operations Using Binary Edwards Curves," Cryptology ePrint Archive, Report 2010/208, 2010, http://eprint.iacr.org/.
[55] E. Al-Daoud, R. Mahmod, M. Rushdan, and A. Kilicman, "A New Addition Formula for Elliptic Curves Over $G F\left(2^{m}\right)$," IEEE Transactions on Computers, vol. 51, no. 8, pp. 972-975, 2002.
[56] B. Sunar and Ç. K. Koç, "An Efficient Optimal Normal Basis Type II Multiplier over $G F\left(2^{m}\right)$," IEEE Transaction on Computers, vol. 50, no. 1, pp. 83-87, 2001.
[57] S. Kwon, "A Low Complexity and a Low Latency Bit Parallel Systolic Multiplier over $G F\left(2^{m}\right)$ Using an Optimal Normal Basis of Type II," in Proceedings of 16th IEEE Symposium on Computer Arithmetic (Arith-16 2003), 2003, pp. 196-202.
[58] J. Gathen, A. Shokrollahi, and J. Shokrollahi, "Efficient Multiplication Using Type 2 Optimal Normal Bases," in Proceedings of First International Workshop on Arithmetic of Finite Fields, (WAIFI 2007), vol. 4547, 2007, pp. 55-68.
[59] H. Fan and M. Hasan, "Subquadratic Computational Complexity Schemes for Extended Binary Field Multiplication Using Optimal Normal Bases," IEEE Transactions on Computers, vol. 56, no. 10, p. 1435, 2007.
[60] D. Bernstein and T. Lange, "Type-II Optimal Polynomial Bases," in Proceedings of Third International Workshop on Arithmetic of Finite Fields (WAIFI 2010), vol. 6078, 2010, pp. 41-61.
[61] R. Azarderakhsh and A. Reyhani-Masoleh, "A Low Complexity Hybrid Architecture for Double-MUltiplication Using Gaussian Normal Basis," IEEE Transactions on Computers,, 2011.
[62] O. Gustafsson and M. Olofsson, "Complexity reduction of constant matrix computations over the binary field," in WAIFI, ser. Lecture Notes in Computer Science, vol. 4547. Springer, 2007, pp. 103-115.
[63] J. Gathen, A. Shokrollahi, and J. Shokrollahi, "Efficient multiplication using type 2 optimal normal bases," in WAIFI, ser. Lecture Notes in Computer Science, C. Carlet and B. Sunar, Eds., vol. 4547. Springer, 2007, pp. 55-68.
[64] Xilinx, "Xilinx Virtex-5 device data sheet," www.xilinx.com/support/documentation/virtex-5.htm, vol. ver5.0, Febraury 2009.
[65] R. Azarderakhsh and A. Reyhani-Masoleh, "Efficient FPGA Implementation of Point Multiplication on Binary Edwards and Generalized Hessian Curves Using Gaussian Normal Basis," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, no. 99, 2011.
[66] N. Koblitz, "Elliptic Curve Cryptosystems," Mathematics of Computation, vol. 48, pp. 203-209, 1987.
[67] D. J. Bernstein, "Batch Binary Edwards," in Proceedings of the 29th Annual International Cryptology Conference on Advances in Cryptology (CRYPTO 2009), 2009, pp. 317-336.
[68] E. Brier and M. Joye, "Weierstraß Elliptic Curves and Side-channel Attacks," in Proceedings of International Conference on Practice and Theory of Public Key Cryptography (PKC 2002), 2002, pp. 183-194.
[69] B. Baldwin, R. Moloney, A. Byrne, G. McGuire, and W. P. Marnane, "A Hardware Analysis of Twisted Edwards Curves for an Elliptic Curve Cryptosystem," in Proceedings of 5th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications (ARC 2009), vol. 5453, 2009, pp. 355-361.
[70] C.-P. Schnorr, "Efficient Signature Generation by Smart Cards," Journal of Cryptology, vol. 4, no. 3, pp. 161-174, 1991.
[71] T. E. Gamal, "A Public Key Cryptosystem and a Signature Scheme Based on Discrete Logarithms," IEEE Transactions on Information Theory, vol. 31, no. 4, pp. 469-472, 1985.
[72] C. Wang and D. Pei, "A VLSI design for computing exponentiations in $G F\left(2^{m}\right)$ and its application to generate pseudorandom number sequences," IEEE Transactions on Computers,, vol. 39, no. 2, pp. 258-262, feb 1990.
[73] C. Lee, J. Lin, and C. Chiou, "Scalable and Systolic Architecture for Computing Double Exponentiation Over $G F\left(2^{m}\right), "$ Acta Applicandae Mathematicae, vol. 93, no. 1, pp. 161-178, 2006.
[74] J. H. Cheon, S. Jarecki, T. Kwon, and M.-K. Lee, "Fast Exponentiation Using Split Exponents," IEEE Transactions on Information Theory, vol. 57, no. 3, pp. 1816-1826, march 2011.
[75] J. Fan, D. Bailey, L. Batina, T. Guneysu, C. Paar, and I. Verbauwhede, "Breaking Elliptic Curves Cryptosystems using Reconfigurable Hardware," in Proceedings of 20th International Conference on Field Programmable Logic and Applications (FPL 2010), 2010, pp. 133-138.
[76] Certicom, "Certicom ECC Chalenge," www.certicom.com, 1997.
[77] Standards for Efficient Cryptography Group, "SEC2: Recommended Elliptic Curve Domain Parameters," 2010, http://www.secg.org/download/aid-784/sec2v2.pdf.
[78] R. Cheung, N. Telle, W. Luk, and P. Cheung, "Customizable Elliptic Curve Cryptosystems," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 13, no. 9, pp. 1048-1059, 2005.
[79] K. Järvinen, "Optimized FPGA-based elliptic curve cryptography processor for high-speed applications," Integration, the VLSI Journal, vol. 44, no. 4, pp. 270279, 2011.
[80] W. N. Chelton and M. Benaissa, "Fast Elliptic Curve Cryptography on FPGA," IEEE Transactions on Very Large Scale Integration (VLSI) Systems., vol. 16, no. 2, pp. 198-205, 2008.
[81] O. Ahmadi, D. Hankerson, and F. Rodríguez-Henríquez, "Parallel Formulations of Scalar Multiplication on Koblitz Curves," Journal of Univers. Computing Sci., vol. 14, no. 3, pp. 481-504, 2008.
[82] J.-Y. Lai and C.-T. Huang, "Elixir: High-Throughput Cost-Effective Dual-Field Processors and the Design Framework for Elliptic Curve Cryptography," IEEE Transaction on VLSI Systems, vol. 16, no. 11, pp. 1567-1580, 2008.
[83] B. Ansari and M. A. Hasan, "High-Performance Architecture of Elliptic Curve Scalar Multiplication," IEEE Transactions on Computers, vol. 57, no. 11, pp. 1443-1453, 2008.
[84] N. Koblitz, "CM-curves with Good Cryptographic Properties," in Advances in Cryptology (CRYPTO 1991). Springer, 1992, pp. 279-287.
[85] J. A. Solinas, "Efficient Arithmetic on Koblitz Curves," Des. Codes Cryptography, vol. 19, pp. 195-249, March 2000.
[86] K. Järvinen, J. Forsten, and J. Skyttä, "Efficient Circuitry for Computing $\tau$-adic Non-Adjacent Form," in Proceedings of the 13th IEEE International Conference on Electronics, Circuits and Systems, (ICECS 2006). IEEE, 2006, pp. 232-235.
[87] B. B. Brumley and K. U. Järvinen, "Conversion Algorithms and Implementations for Koblitz Curve Cryptography," IEEE Transactions on Computers, vol. 59, no. 1, pp. 81-92, 2010.
[88] J. Adikari, V. Dimitrov, and K. Jarvinen, "A Fast Hardware Architecture for Integer to $\tau$-NAF Conversion for Koblitz Curves," IEEE Transactions on Computers, vol. PP, no. 99, p. to appear, 2011.
[89] M. A. Hasan, "Power Analysis Attacks and Algorithmic Approaches to Their Countermeasures for Koblitz Curve Cryptosystems," IEEE Transactions on Computers, vol. 50, no. 10, pp. 1071-1083, 2001.
[90] J. Adikari, V. S. Dimitrov, and R. J. Cintra, "A New Algorithm for Double Scalar Multiplication Over Koblitz Curves," in International Symposium on Circuits and Systems (ISCAS 2011),. IEEE, 2011, pp. 709-712.
[91] C. Vuillaume, K. Okeya, and T. Takagi, "Short-Memory Scalar Multiplication for Koblitz Curves," IEEE Trans. Computers, vol. 57, no. 4, pp. 481-489, 2008.
[92] R. Azarderakhsh and A. Reyhani-Masoleh, "Highly Parallel and Fast Cryptoprocessor for Point Multiplication on Koblitz Curves," IEEE Transactions on Computers, Special Issue on Computer Arithmetic, vol. submitted, p. 9 pages, 2011.

# Curriculum Vitae 

Name:
Post-secondary
Education
and Degrees:

Honors and
Awards:

Related Work
Experience:

Reza Azarderakhsh
The University of Western Ontario
Ph.D., London, Canada

Sharif University of Technology
M.Sc., Tehran, Iran

Civil Aviation Technology College
Tehran, Iran

NSERC/IRDF Award (2012-2013)
Ontario Graduate Scholarship (OGS) 2011-2012.
Western Graduate Scholarship 2007-2011.
PIMS and Western, Travel Grants 2008 and 2010.
Polito TOPMED Scholarship 2006-2007.
ITRC Master's Thesis Scholarship 2003-2005.

Limited Duties Faculty Position (2011- present)
The University of Western Ontario
Graduate Teaching Assistant (2007-2011)
The University of Western Ontario
Graduate Research Assistant (2007-2011)
The University of Western Ontario
Visiting Instructor (2004-2007)
Civil Aviation Technology College, Tehran, Iran
Competent Electronic Design Engineer (2003-2007)
Iranian Airport Holding Company, Tehran, Iran

## PUBLICATIONS

## Journal Papers:

1. R. Azarderakhsh and A. Reyhani-Masoleh, "Efficient FPGA Implementation of Point Multiplication on Binary Edwards and generalized Hessian Curves Using Gaussian Normal Basis", IEEE Transactions on VLSI Systems, accepted for publication, 2011, 14 pages.
2. R. Azarderakhsh, A. Reyhani-Masoleh, "Secure Clustering and Symmetric Key Establishments in Heterogeneous Wireless Sensor Networks", EURASIP Journal on Wireless Communication and Networking (JWCN), Special Issue on Security and Resiliency for Smart Devices and Applications, Article ID 893592, 12 pages, 2011, doi:10.1155/2011/893592.

## Journal Papers (Under Revision):

1. R. Azarderakhsh and A. Reyhani-Masoleh, "A Low Complexity Hybrid Architecture for Double-Multiplication Using Gaussian Normal Basis", IEEE Transactions on Computers, Submitted, 2011, 14 pages.
2. R. Azarderakhsh and A. Reyhani-Masoleh, "Highly Parallel and Fast Cryptoprocessor for Point Multiplication on Koblitz Curves", IEEE Transactions on Computers, Special Issue on Computer Arithmetic, Submitted, 2011, 9 pages.

## Conference Papers:

1. R. Azarderakhsh and A. Reyhani-Masoleh, "A Modified Low Complexity DigitLevel Gaussian Normal Basis Multiplier," a chapter in proceedings of 3rd International Workshop on the Arithmetic of Finite Fields (WAIFI 2010), LNCS No. 6087, Pages: 25-40, 27-30 Jun. 2010.
2. R. Azarderakhsh and A. Reyhani-Masoleh, and Z. Abid, "A Key Management Scheme for Cluster Based Wireless Sensor Networks," in proceedings of IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2008), Volume 2, Pages: 222-227, 17-20 Dec. 2008.
3. X. Yuan, H. Jürgensen, R. Azarderakhsh, and A. Reyhani-Masoleh, "Key Management for Wireless Sensor Networks Using Trusted Neighbors," in proceedings of IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2008), Volume 2, Pages: 228-233,17-20 Dec. 2008.
4. A. R. Masoum, A. H. Jahangir, Z. Taghikhaki, R. Azarderakhsh, "A New Multi Level Clustering Model to Increase Lifetime in Wireless Sensor Networks," in proceedings of the 2nd IEEE International Conference on Sensor Technologies and Applications, (SENSORCOMM 2008), Pages: 185-190, 25-31 Aug. 2008.
5. R. Azarderakhsh, A. H. Jahangir, and M. Keshtgary, "Network Survivability Performance Evaluation in Wireless Sensor Networks," in proceedings of the 11th International CSI Computer Conference (CSI 2006), Pages: 567-570, 2426 Jan. 2006.
6. R. Azarderakhsh, A. H. Jahangir and M. Keshtgary, "A New Virtual Backbone for Wireless Ad Hoc Sensor Network with Connected Dominating Set," in proceedings of the 3rd IFIP Annual Conference on Wireless On demand Network Systems and Services (WONS 2006), Pages: 191-195, 18-20 Jan. 2006.
7. R. Azarderakhsh, A. H. Jahangir, "Optimized Routing Algorithms for Efficient Power Consumption in Wireless Sensor Networks" in proceedings of 13th International Electrical Engineering Conference (IEEC 2005), Pages 178-183, Apr. 2005.
8. R. Azarderakhsh, S.Gh. Miremadi, Gh. Moradi, "Flight Safety Management Systems," in proceedings of the 1st International Conference on Air Transport Industries Management (ICATIM 2005), Pages: 89-99, 19-20 Jan. 2005.
