A high performance pseudo-multi-core elliptic curve cryptographic processor over GF(2^163) by Zhang, Yu
A high performance pseudo-multi-core
elliptic curve cryptographic processor
over GF(2163)
A Thesis Submitted to the
College of Graduate Studies and Research
in Partial Fulllment of the Requirements
for the degree of Master of Science
in the Department of Electrical and Computer Engineering
University of Saskatchewan
Saskatoon
By
Yu Zhang
cYu Zhang, April 2010. All rights reserved.
Permission to Use
In presenting this thesis in partial fullment of the requirements for a Postgrad-
uate degree from the University of Saskatchewan, I agree that the Libraries of this
University may make it freely available for inspection. I further agree that permission
for copying of this thesis in any manner, in whole or in part, for scholarly purposes
may be granted by the professor or professors who supervised my thesis work or, in
their absence, by the Head of the Department or the Dean of the College in which
my thesis work was done. It is understood that any copying or publication or use of
this thesis or parts thereof for nancial gain shall not be allowed without my written
permission. It is also understood that due recognition shall be given to me and to the
University of Saskatchewan in any scholarly use which may be made of any material
in my thesis.
Requests for permission to copy or to make other use of material in this thesis in
whole or part should be addressed to:
Head of the Department of Electrical and Computer Engineering
57 Campus Drive
University of Saskatchewan
Saskatoon, Saskatchewan
Canada
S7N 5A9
i
Abstract
Elliptic curve cryptosystem is one type of public-key system, and it can guarantee
the same security level with Rivest, Shamir and Adleman (RSA) with a smaller
key size. Therefore, the key of elliptic curve cryptography (ECC) can be more
compact, and it brings many advantages such as circuit area, memory requirement,
power consumption, performance and bandwidth. However, compared to private-
key system, like Advanced Encryption Standard (AES), ECC is still much more
complicated and computationally intensive. In some real applications, people usually
combine private-key system with public-key system to achieve high performance. The
ultimate goal of this research is to architect a high performance ECC processor for
high performance applications such as network server and cellular sites.
In this thesis, a high performance processor for ECC over Galois eld (GF)(2163)
by using polynomial presentation is proposed for high-performance applications. It
has three nite eld (FF) reduced instruction set computer (RISC) cores and a
main controller to achieve instruction-level parallelism (ILP) with pipeline so that
the largely parallelized algorithm for elliptic curve point multiplication (PM) can be
well suited on this platform. Instructions for combined FF operation are proposed
to decrease clock cycles in the instruction set. The interconnection among three
FF cores and the main controller is obtained by analyzing the data dependency
in the parallelized algorithm. Five-stage pipeline is employed in this architecture.
Finally, the -code executed on these three FF cores is manually optimized to save
clock cycles. The proposed design can reach 185 MHz with 20; 807 slices when
implemented on Xilinx XC4VLX80 FPGA device and 263 MHz with 217,904 gates
when synthesized with TSMC .18m CMOS technology. The implementation of the
proposed architecture can complete one ECC PM in 1428 cycles, and is 1.3 times
faster than the current fastest implementation over GF (2163) reported in literature
while consumes only 14:6% less area on the same FPGA device.
ii
Acknowledgements
I would like to take this opportunity to express my sincere appreciation to my
supervisor, Dr. Seok-Bum Ko. Without his guidance, advice and support throughout
my research, this work could not be have been realized. Besides, he helped me a
lot when I was hunting for jobs during the time of graduation, and gave me lots of
valuable advices.
I would like to thank my supervisor Dr. Li Chen for his advice for my life,
constant encouragement and support on my research and graduate studies.
I would also like to thank my lab member Dongdong Chen. Without the past
two years discussion with him on research and algorithms, I could not have intensive
understanding in research.
I would also like to thank my other friends in VLSI lab. Working with them is
an invaluable and wonderful experience in my life.
I would like to thank my family for their understanding, constant support and
encouragement throughout my study and life.
iii
This is the dedication to my grandma
iv
Contents
Permission to Use i
Abstract ii
Acknowledgements iii
Contents v
List of Tables vii
List of Figures viii
List of Algorithms ix
List of Abbreviations x
1 Introduction 1
1.1 Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Elliptic Curve Cryptography 8
2.1 ECC Die-Hellman key exchange protocol . . . . . . . . . . . . . . . 8
2.2 Elliptic Curve Geometry . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 A line through Two Distinct Points On Elliptic Curve . . . . . 11
2.2.2 A Tangent Line of Elliptic Curve . . . . . . . . . . . . . . . . 13
2.2.3 Point Addition on Elliptic Curve . . . . . . . . . . . . . . . . 15
2.2.4 Point Doubling on Elliptic Curve . . . . . . . . . . . . . . . . 17
2.2.5 NIST-recommended random elliptic curves over binary elds . 19
2.3 Elliptic Curve Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Montgomery PM . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Projective Coordinates . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Lopez-Dahab Algorithm . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Parallelized Lopez-Dahab Algorithm . . . . . . . . . . . . . . 28
2.3.5 Proposed Instruction-level Parallelism for Parallelized Lopez-
Dahab Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Elliptic curve cryptographic processor 33
3.1 Finite Field Arithmetic Operations . . . . . . . . . . . . . . . . . . . 33
3.1.1 Basic Finite Field Operations . . . . . . . . . . . . . . . . . . 33
v
3.1.2 Parallel Finite Field Reduction . . . . . . . . . . . . . . . . . 35
3.1.3 Word-level nite eld multiplier . . . . . . . . . . . . . . . . . 38
3.1.4 FF square and double square . . . . . . . . . . . . . . . . . . 39
3.1.5 FF inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Architecture and implementation . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Instruction Set Design of FF Cores . . . . . . . . . . . . . . . 42
3.2.2 Register les, interconnection and swap logic . . . . . . . . . . 47
3.2.3 Main controller . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.4 Critical path analysis . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.5 Pipeline and timing . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Experiment Results 54
5 Conclusion and Future Work 58
References 60
A -code on FF cores 63
A.1 -code in ROM1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.2 -code in ROM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.3 -code in ROM3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
vi
List of Tables
2.1 NIST-recommended random elliptic curves over binary elds . . . . . 19
3.1 Itoh-Tsujii Algorithm for GF (2163) . . . . . . . . . . . . . . . . . . . 40
3.2 Instruction description . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Instruction Set Design of FF Core 1 . . . . . . . . . . . . . . . . . . . 44
3.4 Instruction Set Design of FF Core 2 . . . . . . . . . . . . . . . . . . . 45
3.5 Instruction Set Design of FF Core 3 . . . . . . . . . . . . . . . . . . . 45
3.6 Long paths and comparison . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Area information of dierent blocks . . . . . . . . . . . . . . . . . . . 55
4.2 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . 56
vii
List of Figures
1.1 Key size on equivalent strength between RSA and ECC [3] . . . . . . 2
1.2 Conventional hierarchy for ECC operation . . . . . . . . . . . . . . . 3
2.1 An example of ECC key exchange . . . . . . . . . . . . . . . . . . . . 9
2.2 An example of an elliptic curve in real numbers system . . . . . . . . 10
2.3 Special point on elliptic curve . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Intersection points on Elliptic Curve . . . . . . . . . . . . . . . . . . 12
2.5 Tangent line on Elliptic Curve . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Point Addition on Elliptic Curve . . . . . . . . . . . . . . . . . . . . 16
2.7 Point Doubling on Elliptic Curve . . . . . . . . . . . . . . . . . . . . 18
2.8 Data dependency inside the LOOP . . . . . . . . . . . . . . . . . . . 27
2.9 Data dependency inside the LOOP in [11] . . . . . . . . . . . . . . . 29
2.10 Proposed instruction set based on the data dependency . . . . . . . . 29
3.1 Architecture of nite eld adder . . . . . . . . . . . . . . . . . . . . . 34
3.2 An example of FF multiplication . . . . . . . . . . . . . . . . . . . . 34
3.3 An example of pure parallel FF multiplication . . . . . . . . . . . . . 35
3.4 The architecture of nite eld ALU . . . . . . . . . . . . . . . . . . . 39
3.5 The structure of pseudo-multi-core ECC processor . . . . . . . . . . . 42
3.6 Interconnection and register les for FF cores . . . . . . . . . . . . . 43
3.7 Instruction formats in three cores . . . . . . . . . . . . . . . . . . . . 44
3.8 Examples of LOOP instruction . . . . . . . . . . . . . . . . . . . . . 46
3.9 Swap logic and address decoding unit of DA in core 1 . . . . . . . . . 47
3.10 Architecture of 8-to-1 multiplexer in Fig. 3.6 (refer p. 43) . . . . . . 50
3.11 Timing in the loop of Algorithm 6 . . . . . . . . . . . . . . . . . . . . 52
viii
List of Algorithms
1 Binary NAF method [19] . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Montgomery Method [18] . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Lopez-Dahab algorithm [14] . . . . . . . . . . . . . . . . . . . . . . . 24
4 Modied Lopez-Dahab algorithm [10] . . . . . . . . . . . . . . . . . . 27
5 Parallelized version of Lopez-Dahab algorithm with uniform address-
ing [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Proposed ILP of parallelized Lopez-Dahab algorithm on three FF cores 32
7 82-bit word-level FF multiplier . . . . . . . . . . . . . . . . . . . . . . 38
ix
List of Abbreviations
AES Advanced Encryption Standard
ALU Arithmetic Logic Unit
ASIC Application-specic integrated circuit
CMOS Complementary Metal Oxide Semiconductor
DES Data Encryption Standard
DL Data Loading
EC Elliptic Curve
ECADD Elliptic Curve Point Addition
ECC Elliptic Curve Cryptography
ECDBL Elliptic Curve Point Doubling
ECP Elliptic Curve Processor
EX Instruction Executing
FF Finite Field
FFA Finite Field Addition
FFI Finite Field Inversion
FFM Finite Field Multiplication
FPGA Field Programmable Gate Array
FFS Finite eld square
ILP Instruction-level Parallelism
ID Instruction Decoding
IF Instruction Fetching
GNB Gaussian Normal Basis
GF Galois Field
LSB Least Signicant Bit
MQV Menezes-Qu-Vanstone
NAF Non-adjacent Form
NIST National Institute of Standards and Technology
PB Polynomial Basis
PM Point Multiplication
RISC Reduced Instruction Set Computer
RSA Rivest, Shamir and Adleman
WB Writing Back
x
Chapter 1
Introduction
1.1 Cryptography
Cryptography plays a very important role in modern communications as it can ensure
the safety of the condential data in the communication. Cryptography is composed
with encryption and decryption. Encryption converts plain text (ordinary informa-
tion) to cypher text (disordered information) by using a key, and decryption reverses
the process. There are two types of cryptography, symmetric cryptography and
asymmetric cryptography. In symmetric cryptography (also called as Private-Key
cryptography) like AES [1], the sender and the receiver involved in the communica-
tion share a same key, so they have to negotiate the same key before conducting the
communication. In asymmetric cryptography (also called as Public-key cryptogra-
phy) such as ECC and RSA, dierent keys are used in the encryption and decryption,
and both sides in the communication don't need to share their keys each other.
RSA, which stands for Rivest, Shamir and Adleman who rst publicly described
it in 1977 [2], is so far the most widely used public-key cryptography. It is based
on the fact that it is much easier to create extreme large prime numbers while it
is not practical in terms of time and money to factor the product of two primes of
1
Time to 
break in
MIPS years
RSA/DSA
key size
ECC
key size
RSA/ECC
key size 
ratio
10
4
512 106 5 : 1
10
8
768 132 6 : 1
10
11
1,024 160 7 : 1
10
20
2,048 210 10 : 1
10
78
21,000 600 35 : 1
Figure 1.1: Key size on equivalent strength between RSA and ECC [3]
this size. ECC was independently proposed by Miller [4] and Koblitz [5] in 1986
and 1987 respectively. ECC is based on the hardness of the elliptic curve discrete
logarithm problem. ECC can guarantee the same security level with RSA with a
smaller key size as shown in Fig. 1.1. Therefore, the key of ECC can be more
compact, and it brings many advantages such as circuit area, memory requirement,
power consumption, performance and bandwidth. ECC has also been included in
IEEE 1363 [7] and NIST [6]. Consequently, ECC is said to be the next-generation
cryptography, and a vast research has been done on its ecient implementation in
software and hardware.
1.2 Previous Work
Compared to Private-key cryptography, ECC is computationally intensive, which is
mainly caused by the computation of PM (involves arithmetic in FF of large order).
The computation of PM is normally composed of point addition (ECADD) and
doubling (ECDBL) operations, and these operations in turn rely on nite eld (FF)
2
Point Multiplication
Point Addition Point Doubling
Field
Addition/Substration
Field
Multilpication
Field
 Square
Field
Inversion
(a)
Figure 1.2: Conventional hierarchy for ECC operation
operations. The conventional operation hierarchy is shown in Fig.1.2. Normally, the
complexity of these FF operations is FF inversion (FFI), FF multiplication (FFM),
FF square (FFS), FF addition (FFA) in order from the most to the least. Thus many
ECC arithmetic algorithms try to use dierent projective coordinates to lower FFI
operations, such as Lopez-Dahab algorithm [14]. Also, there are many nite eld
arithmetic algorithms to implement FFI [17] [19], and it usually relies on FFM, FFS
and FFA.
As we can see in Fig.1.2, the lowest level of PM operation is the FF operations,
and there are dierent algorithms for them in either hardware or software imple-
mentations. Elliptic curve arithmetic means the algorithm to calculate PM by using
ECADD and ECDBL. Therefore, before implementing the PM, several choices have
to be made, and they include the selection of underlying nite eld, eld represen-
tation, elliptic curve, algorithms for nite eld arithmetic, and algorithm for elliptic
curve arithmetic. Dierent systems have dierent selections, which are determined
by the system requirements (gate count, power consumption) and resources (avail-
3
ability of microprocessor, performance of microprocessor, ROM size and RAM size).
These selections are tightly related to the implementation approaches, in turn, the
implementation approaches (hardware, software, or hardware/software co-design)
will rely on these selections, and it is very hard to make the \best" choice on these
selection.
Normally, ECC in prime eld is implemented in software due to that the opera-
tions in prime eld are very similar with that in real number system except an extra
modular operation needed in prime eld. Algorithms for software implementation
can be found in [8]. Hardware/software co-design implementation is an alternative
approach to implement ECC when the hardware resource is tight [26]. In some high
performance oriented applications like network servers and cellular sites, software
ECC implementation will denitely cause a bottleneck of the entire system when
the number of the services increases in a second. Therefore, the hardware PM core
implementation on FPGA or ASIC is a solution for these applications. Many papers
[10] [11] [12] [14] [15] [16] focus on the hardware implementation of PM on GF(2163)
dened in NIST curve over binary elds, and intensively compare their performance
with others. In [10], the data dependency of Lopez-Dahab algorithm was analyzed
in detail, and nally, a single FF multiplier in the elliptic curve processor (ECP)
was employed to run with no rest, and other FF operations were nished in par-
allelism with the FF multiplier. As [10] did not introduce Lopez-Dahab algorithm
with the largest parallelism, in [11], three 55-bit word level Gaussian normal basis
(GNB) FF multipliers were employed to parallelize Lopez-Dahab algorithm to the
largest extent. Besides, by using the GNB nite eld representation, the operation
4
of A2
s
in Itoh-Tsujii's FF inversion [17] can be simply accomplished by s-bit cyclic
shift. The whole system is composed of two FF arithmetic atomic blocks, namely,
point doubling&addition unit and coordinate conversion unit. Also, ECC hardware
implementation on Koblitz curves can be found in [13], and they belong to a special
class of binary curves, and the PM can be computed very eciently. However, they
are often vulnerable to side channel attacks [24].
1.3 Motivation
Ecient ECC hardware implementation depends on all the factors mentioned above.
However, the above works only consider either elliptic curve arithmetic [14] or algo-
rithms for nite eld arithmetic [11], which is usually not the best case for hardware
implementation. This work focuses on hardware implementation of PM on FPGA
and ASIC on a NSIT proposed random curve over GF(2163).
In ECC hardware implementations, there is usually no dedicated hardware for
FFI, and it is implemented by a large number of other FF operations. FFM usually
is the second time-consuming FF operation after FFI, and there are large numbers
of FFMs involved in a PM operation. However, high speed FF multiplier by FF
arithmetic algorithms doesn't necessarily mean high performance of the entire ECC
system, which is determined by the following relationship, where the delay of the
critical path is the period of the clock.
System performance = Total clock cycles Delay of the critical path: (1.1)
5
For example, if the computation of FFM only consumes one cycle, some other simple
FF operations that also consume one cycle, such as FFA, will be as expensive as FFM.
As a result, the total clock cycle reduced, the system performance may remain low
because of the long system critical path in the system architecture caused by the FF
multiplier. On the other hand, if the FF multiplier has a short critical path, and the
FFM consumes several cycles, the system performance may still be low because of
the large amount of clock cycles in total.
In order to improve the system performance, ECC arithmetic algorithm level,
FF arithmetic level and system hardware architecture level should be considered
simultaneously, and it requires to make a balance between total clock cycles and
critical path, which usually refers to adding pipeline and increasing parallelism.
In [11], the critical path of the system is not analyzed in detail. The GNB based
FF multiplier has a relative longer critical path than the polynomial based coun-
terpart. Therefore, the performance may still be improved by analyzing the system
critical path when using polynomial presentation. In order to increase the system
performance, this thesis will consider the following three aspects simultaneously:
1. ECC arithmetic algorithm level: largely parallelize the Lopez-Dahab al-
gorithm.
2. FF arithmetic level: make the FF multipliers run as often as possible, and
other operations performed in parallel with FF multiplier.
3. System hardware architecture level: try to combine simple operations,
and make the system critical path lie in the polynomial based FF multiplier.
6
1.4 Contribution
The main contributions of this thesis include:
1. Proposed a customized instruction set for parallel version of Lopez-Dahab al-
gorithm;
2. Architected a three-FF-cores based architecture with ve-stage pipeline, and
analyzed the critical path in detail;
3. -code on three FF cores is given based on this proposed architecture;
4. Both FPGA and ASIC implementation results are provided. In FPGA imple-
mentation, this work is 1:3 times faster than the current fastest implementation
over GF (2163) reported in literature while consumes only 85:4% of their area
on the same FPGA device.
1.5 Thesis Outline
The rest of this thesis is organized as follows. Chapter 2 gives the background
knowledge in cryptography, and presents the ECC algorithms used in this work;
Chapter 3 describes the architecture of the proposed ECC processor, and the critical
path of system is analyzed in detail; Chapter 4 shows the implementation results on
FPGA and ASIC of this work, and the comparison with some latest works as well.
Some analysis is given on this result, and nally the conclusion and future work will
be given.
7
Chapter 2
Elliptic Curve Cryptography
ECC system has dierent protocols, which contains ECC Die-Hellman key
exchange protocol, Elliptic Curve Digital Signature Algorithm, and elliptic curve
Menezes-Qu-Vanstone (MQV) [19] [21]. As this thesis is not mainly concerned with
protocols, and it will only briey introduce the ECC Die-Hellman key exchange
protocol. Elliptic curve arithmetic algorithms are the main concern in this chap-
ter. We will present how we propose an algorithm based on previous elliptic curve
arithmetic algorithms in the literature [10] [11] [14] [18].
2.1 ECC Die-Hellman key exchange protocol
The ECC Die-Hellman key exchange protocol [19] [21] relies on the fact that the
scalar of point P on elliptic curve (kP ) is relatively easier while retrieving k knowing
kP and P is a discrete logarithm problem. The mechanism of ECC Die-Hellman
key exchange is shown in Fig. 2.1, where Alice and Bob are two persons who are
involved in the communication, and Eva is a cracker. The key exchange mechanism
procedure is described as follows:
1. Alice generates a random private key integer ka , computes a public key Pa =
8
kaP
kbP
Alice Bob
Eva
Private key: ka
 Public key: kaP
Private key:  kb
Public key: kbP
M
o
n
it
o
ri
n
g
Figure 2.1: An example of ECC key exchange
kaP , and sends Pa to Bob
2. Bob generates a random private key integer kb , computes a public key Pb =
kbP , and sends Pb to Alice
3. Alice computes kaPb = kakbP
4. Bob computes kbPa = kbkaP
5. Finally, Alice and Bob arrive at the same key kakbP
In this communication, the information exposed to public (Eva) are only Pa; Pb; and
P , and it is a discrete logarithm problem [21] to retrieve ka and kb.
2.2 Elliptic Curve Geometry
First, we will present the basic mathematics deduction of elliptic curve in real number
system, describe how the ECADD and ECDBL are dened, and how the cryptogra-
phy is dened on elliptic curve as well.
9
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Figure 2.2: An example of an elliptic curve in real numbers system
NIST recommends the elliptic curve E in Eq. 2.1 , where a, b, x and y are all
nite eld numbers [6] [19] [20] . The prime eld and binary nite eld are two nite
elds used in ECC.
y2 + xy = x3 + ax2 + b (2.1)
In real number system, this type of equation stands for an elliptic curve in ge-
ometry as the example shown in Fig. 2.2, where the equation is
y2 + xy = x3   2x2 + 1 (2.2)
In the following, we will analyze the elliptic curve geometry in real number system.
10
O O O O OO
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Figure 2.3: Special point on elliptic curve
2.2.1 A line through Two Distinct Points On Elliptic Curve
Given two distinct points P1 = (x1; y1) and P2 = (x2; y2) on the elliptic curve E , and
L is the line through them, we can obtain the third intersection point T = (Tx; Ty)
on the E . As the slope of L may not exist if x1 equals to x2 as shown in Fig. 2.3, the
third intersection does not physically exist on the E . Therefore, a special point O is
dened as the third intersection point for this situation, and it is located at innity
as shown in Fig. 2.3.
If x1 6= x2 as shown in Fig. 2.4, we can get the slope of line L,
1 =
y2   y1
x2   x1 ; (2.3)
11
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Figure 2.4: Intersection points on Elliptic Curve
then, the line L is
y = 1(x  x1) + y1: (2.4)
By substituting Eq. 2.4 into Eq. 2.1,
(1(x  x1) + y1)2 + x(1(x  x1) + y1) = x3 + ax2 + b: (2.5)
At the rst glance, we need to solve a quadratic and cubic equation. However,
knowing x1 and x2, we can compare the parameters of the following equation with
Eq. 2.5,
(x  x1)(x  x2)(x  Tx) = 0; (2.6)
12
and after the normalization, we have,
x3   (x1 + x2 + Tx)x2 + (Txx1 + Txx2 + x1x2)x  x1x2Tx = 0: (2.7)
Then, we can compare the parameter of x2 between Eq. 2.7 and Eq. 2.5,
a  1   12 =  (x1 + x2 + Tx) (2.8)
Tx = 1
2 + 1   (x2 + x1)  a: (2.9)
Finally, from Eq. 2.9 and Eq. 2.4,
Ty = 1(Tx   x1) + y1: (2.10)
2.2.2 A Tangent Line of Elliptic Curve
If P1 and P2 are the same, and the line L is a tangent line through it as shown in
Fig. 2.5, the third intersection point T can be obtained similarly. It can also be the
innite point O as shown in the Fig. 2.3 if the slope of the tangent line L does not
exist. If the slope  of L exists, it is the derivative of E in P1,
det(y2 + xy) = det(x3 + ax2 + b)
2yy0 + y + xy0 = 3x2 + 2ax
y0 =
3x2 + 2ax  y
2y + x
(2.11)
13
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Figure 2.5: Tangent line on Elliptic Curve
Then,
2 =
3x21 + 2ax1   y1
2y1 + x1
(2.12)
and the line L is
y = 2(x  x1) + y1: (2.13)
By substituting Eq. 2.13 into Eq. 2.1, we have
(2(x  x1) + y1)2 + x(2(x  x1) + y1) = x3 + ax2 + b: (2.14)
As we have two same points, then
(x  x1)(x  x1)(x  Tx) = 0; (2.15)
14
and after the normalization, it is
x3   (2x1 + Tx)x2 + (2Txx1 + x21)x  x21Tx = 0 (2.16)
By comparing the parameters between Eq. 2.16 and Eq. 2.14, we can get,
Tx = 2
2 + 2   2x1   a: (2.17)
Ty = 2(Tx   x1) + y1: (2.18)
2.2.3 Point Addition on Elliptic Curve
In the above sections, we have described some mathematic background of elliptic
curve in geometry in real number system, now we are going to know how the PM is
dened on the elliptic curve in binary nite eld number system. The reason why
the number used in ECC is in nite eld is that the operations over real number
system on elliptic curve refer to rounding, and the result will be inaccurate. Besides,
in binary nite eld using polynomial representation, the addition operation is a
simple exclusive OR operation of each corresponding bits between two operands,
and this work also chooses this number system. In the following, all the numbers are
in binary nite eld.
As shown in Fig. 2.6, ECADD is dened by taking two distinct points (P1 and
P2) on the elliptic curve and drawing a straight line connecting them. Using the third
point (T ) at which the straight line intersects the elliptic curve, take a reection on
15
P1
P2
T
P3
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Figure 2.6: Point Addition on Elliptic Curve
the x-axis and the resulting point (P3) is the denition of the ECADD. We dened a
special point O previously, and it is also used here for point calculating operations.
Now, let's rst see the rules of ECADD.
1. P1 + P2 = P2 + P1
2. P1 + ( P1) = O
3. P1 + P2 = O if P1 =  P2
As P3 equals to  T , it has the same x-coordinate with T .
y2 + xy = x3 + ax2 + b
y2 + xy   x3   ax2   b = 0
(2.19)
16
From Eq. 2.19, we can get the relation of the y-coordinate between Ty and y3,
Ty + y3 =  Tx
P3 = (Tx; Tx   Ty)
(2.20)
Based on the previous sections, given P1 and P2, we already know how to get the
third intersection T in real number system. In binary nite eld, the operations are
slightly dierent, and some equations can be simplied. The addition and subtraction
are the same, and are all bitwise XOR operations. Finally, we have,
x3 = (
y1 + y2
x1 + x2
)2 +
y1 + y2
x1 + x2
+ x1 + x2 + a
y3 = (
y1 + y2
x1 + x2
)(x1 + x3) + x3 + y1
(2.21)
2.2.4 Point Doubling on Elliptic Curve
Point doubling is dened similarly with ECADD, except instead of using two distinct
points to draw the straight line, it uses the tangent line of a single point (P1) as shown
in Fig. 2.7.
The slope  can be further simplied in binary nite eld
 =
3x21 + 2ax1 + y1
2y1 + x1
 =
x21 + y1
x1
 = x1 +
y1
x1
(2.22)
17
P1
T
P3
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
Figure 2.7: Point Doubling on Elliptic Curve
By substituting Eq. 2.22 into Eq. 2.17 and Eq. 2.18 respectively, we have,
Tx = (x1 +
y1
x1
)2 + x1 +
y1
x1
+ a
Tx = x
2
1 +
y21
x21
+ x1 +
y1
x1
+ a
Tx = x
2
1 + x1 +
y1 + x1y1
x21
+ a
Tx = x
2
1 + x1 +
x31 + ax
2 + b
x21
+ a
Tx = x
2
1 + x1 + x1 + a+
b
x21
+ a
Tx = x
2
1 +
b
x21
(2.23)
Ty = (x1 +
y1
x1
)(Tx + x1) + y1
Ty = (x1 +
y1
x1
)Tx + x
2
1:
(2.24)
18
Finally, the ECDBL result P3 is obtained as follows,
x3 = x
2
1 +
b
x21
y3 = x
2
1 + (x1 +
y1
x1
)x3 + x3:
(2.25)
2.2.5 NIST-recommended random elliptic curves over bi-
nary elds
Table 2.1: NIST-recommended random elliptic curves over binary elds
B-163: m = 163, f(z) = z163 + z7 + z6 + z3 + 1, a = 1, h = 2
b = 0x 00000002 0A601907 B8C953CA 1481EB10 512F7874
4A3205FD
B-233: m = 233, f(z) = z233 + z74 + 1, a = 1, h = 2
b = 0x 00000066 647EDE6C 332C7F8C 0923BB58 213B333B
20E9CE42 81FE115F 7D8F90AD
B-283: m = 283, f(z) = z283 + z12 + z7 + z5 + 1, a = 1, h = 2
b = 0x 027B680A C8B8596D A5A4AF8A 19A0303F CA97FD76
45309FA2 A581485A F6263E31 3B79A2F5
B-409: m = 409, f(z) = z409 + z87 + 1, a = 1, h = 2
b = 0x 0021A5C2 C8EE9FEB 5C4B9A75 3B7B476B 7FD6422E
F1F3DD67 4761FA99 D6AC27C8 A9A197B2 72822F6C
D57A55AA 4F50AE31
B-571: m = 571, f(z) = z571 + z10 + z5 + z2 + 1, a = 1, h = 2
b = 0x 02F40E7E 2221F295 DE297117 B7F3D62F 5C6A97FF
CB8CEFF1 CD6BA8CE 4A9A18AD 84FFABBD 8EFA5933
2BE7AD67 56A66E29 4AFD185A 78FF12AA 520E4DE7
39BACA0C 7FFEFF7F 2955727A
NIST recommends ve random elliptic curves on binary elds [6] [19], and they
are listed in Table 2.1. The f(z) is the reduction polynomial of degree m, and as the
key size (m) increases from 163 to 571, security level is increased correspondingly,
19
Algorithm 1 Binary NAF method [19]
Input: a point over E(GF (2m)), a positive l-bit integer k =
Pl 1
i=0 ki2
i; ki 2
f 1; 0; 1g.
Output: Q = kP .
//*Initialization*//
P1  P; P2  O.
//*PM Loop Process*//
for i = l   1 down to 0 do
P2  2P2
if ki = 1 then
P2  P2   P1
end if
if ki =  1 then
P2  P2   P1
end if
end for
return (Q = P2)
and the computation time and complexity of the system increase as well. NIST also
recommends ve random elliptic curves on prime elds, and it is more suitable to
software implementations as its numbers are rational numbers in nite eld, and can
employ general microprocessor to do the rational number arithmetic operations. In
hardware implementation, the binary nite eld is more popular as its nite eld
arithmetic can be very simple. For example, the addition and subtraction is only
exclusive operation on two operands' corresponding bits. Therefore, there is no carry
propagation delay, which is usually the bottleneck of the critical path in prime eld
number system in hardware implementation.
2.3 Elliptic Curve Arithmetic
The algorithms for calculating PM on elliptic curve are also called elliptic curve
20
Algorithm 2 Montgomery Method [18]
Input: a point P (x; y) over E(GF (2m)), a positive l-bit integer k =
(kl 1;    ; k1; k0)2 .
Output: Q = kP .
//*Initialization*//
P1  P; P2  2P .
//*PM Loop Process*//
for i = l   2 down to 0 do
if ki = 1 then
P1  P1 + P2; P2  2P2.
else
P2  P1 + P2; P1  2P1.
end if
end for
return (Q(x0; y0) = P1)
arithmetic, and it has a conventional hierarchy as shown in Fig. 1.2. There are many
methods for calculating kP . The most common one is the binary method, which is
the same with the method when we do the multiplication in binary numbers (e.g.
11P = 2(2(2P ) + P ) + P ). Similar with the multiplier design in binary hardware,
there exist many methods to recode the k so that the number of operations of
ECADD and ECDBL can be reduced. Signed-digit (SD) representation [19] of k is,
k =
Pl 1
i=0 ki2
i, and ki 2 f 1; 0; 1g. If there is no adjacent non-zero digits, the SD
form is called as non-adjacent form (NAF) [19]. NAF form is the least weight of any
SD representation of k, and it is also unique for every integer k.
2.3.1 Montgomery PM
Montgomery method is a variant of binary method, and it is based on keeping an in-
variant relationship during the whole kP calculation process, that is, P2  P1 = P .
By using this relationship, only 2P x-coordinate is needed in the kP loop calcu-
21
lation process, and the nal result point's y-coordinate can be obtained from its
x-coordinate.
From Eq. 2.21, if P1 6= P2 we have
x3 = (
y1 + y2
x1 + x2
)2 +
y1 + y2
x1 + x2
+ x1 + x2 + a
=
x1y2 + x2y1 + x1x
2
2 + x2x
2
1
(x1 + x2)2
(2.26)
Here P = (x0; y0), and as P = P2   P1, then,
x =
x1y2 + x2(x1 + y1) + x1x
2
2 + x2x
2
1
(x1 + x2)2
: (2.27)
By combining Eq. 2.26 and Eq. 2.27, the x-coordinate of P3 is
x3 = x+ (
x1
x1 + x2
)2 +
x1
x1 + x2
(2.28)
Finally, we have
x3 =
8>>><>>>:
x+ ( x1
x1+x2
)2 + x1
x1+x2
ifP1 6= P2
x21 +
b
x21
ifP1 = P2
(2.29)
Therefore, y-coordinate is not involved in the calculation during the LOOP process.
In the following, the y-coordinate can be calculated based on x-coordinate in the
nal stage. As P2 = P1 + P , then from Eq. 2.26,
x2 =
x1y + xy1 + x1x
2 + xx21
(x1 + x)2
(2.30)
22
y1 = (
x1
x
+ 1)f(x1 + x)(x2 + x) + x2 + yg+ x (2.31)
Therefore, we only need to calculate the x-coordinate in the loop process, and it can
signicantly decrease the number of nite eld operations.
2.3.2 Projective Coordinates
So far, all the calculations are dened in conventional ane coordinates, and the
problem with the Montgomery method in ane coordinates is that there are lots
of nite eld inversion operations involved in the loop process. As the iteration
number of the LOOP is 162 in this work, it means by using Montgomery method,
it will consumes nearly 2  162 nite eld inversions in total. As the nite eld
inversion is usually the most time-consuming operation in nite eld, many works
tried to use projective coordinate [22][25] to reduce the number of inversions.
In standard projective coordinates [8], the projective point on the elliptic curve
has the following relationship with its corresponding point on elliptic curve, where
Z 6= 0.
(X; Y; Z)() (X=Z; Y=Z): (2.32)
By substituting (X=Z; Y=Z) into Eq. 2.32, the projective form of the EC is
ZY 2 +XY Z = X3 + aX2Z + bZ3 (2.33)
23
Algorithm 3 Lopez-Dahab algorithm [14]
Input: a point P (x; y) over E(GF (2m)), a positive l-bit integer k =
(kl 1;    ; k1; k0)2 with kl 1 = 1.
Output: Q = kP:
//*Ane To Projective Coordinate Initialization*//
(X1; Z1) (x; 1); (X2; Z2) (x4 + b; x2):
//*PM Loop Process*//
for i = l   2 down to 0 do
if ki = 1 then
(X1; Z1) Madd(X1; Z1; X2; Z2; x);
(X2; Z2) Mdouble(X2; Z2; b):
else
(X2; Z2) Madd(X1; Z1; X2; Z2; x);
(X1; Z1) Mdouble(X1; Z1; b):
end if
end for
//*Projective To Ane Coordinate Conversion*//
Q Mxy(X1; Z1; X2; Z2; x; y):
return Q(x0; y0).
Similarly, there are also other projective coordinates such as Jacobian projective
coordinates [23], Chudnovsky Jacobian coordinates [23], and Lopez-Dahab projective
coordinates. For example, in Jacobian projective coordinate, the projective point
(X;Y; Z); Z 6= 0 corresponds to (X=Z2; Y=Z3) in ane coordinate, and its projective
form of elliptic curve is Y 2 +XY Z = X3 + aX2Z2 + bZ6.
2.3.3 Lopez-Dahab Algorithm
Lopez-Dahab projective coordinates is one of the most eective way to lower the
number of the FF inversion operations. In Lopez-Dahab projective coordinates, each
point on elliptic curve is also in the form of (X; Y; Z). However, they have dierent
relation in ane coordinates as (X; Y; Z) () (X=Z; Y=Z2). Then, its projective
24
form of the elliptic curve is
Y 2 +XY Z = X3 + aX2Z2 + bZ6 (2.34)
After we have obtained the Lopez-Dahab projective coordinates, let's see how
Lopez-Dahab algorithm optimizes the Montgomery method by using its projective
coordinates. From Eq. 2.29, we can rewrite it in Lopez-Dahab projective coordinates.
If P1 = P2,
X3=Z3 = (X1=Z1)
2 + b(Z1=X1)
2
X3=Z3 = x
4
1 + bZ
4
1=Z
2
1X
2
1
(2.35)
Then,
X3 = X
2
1 + bZ
4
1
Z3 = Z
2
1X
2
1 :
(2.36)
Similarly, if P1 6= P2,
X3 = xZ3 + (X1Z2)(X2Z1)
Z3 = (X1Z2 +X2Z1)
2:
(2.37)
Finally, we can get the Lopez-Dahab method in Algorithm 3, where theMadd stands
for ECADD as in Eq. 2.37, and Mdoubling stands for ECDBL as in Eq. 2.36. As
we can see, there are no FF inversion operation involved in the LOOP process, and
25
the FF inversions are only need in the nal stage in Mxy. Therefore, Lopez-Dahab
method can signicantly improve the performance if the FF inversion operation is
very expensive. To summarize, Madd, Mdoubling and Mxy are dened as follows,
Madd(X1; Z1; X2; Z2; x0)
f
X  X1Z2X2Z1 + x(X1Z2 +X2Z1)2;
Z  (X1Z2 +X2Z1)2;
Return(X;Z);
g
Mdouble(X1; Z1; b)
f
X  X41 + bZ41 ;
Z  X21Z21 ;
Return(X;Z);
g
Mxy(X1; Z1; X2; Z2; x; y)
f
xk = X1=Z1;
yk = [x
2 + y + (x+X1=Z1)(x+X2=Z2)](x+ xk)=x+ y;
Return(xk; yk);
g
In [10], a slight modication was made to move the condition evaluation to the
end of the LOOP as shown in Algorithm 4. By doing so, Madd() and Mdouble()
26
Algorithm 4 Modied Lopez-Dahab algorithm [10]
Input: a point P (x; y) over E(GF (2m)), a positive l-bit integer k =
(kl 1;    ; k1; k0)2 with kl 1 = 1.
Output: Q = kP:
//*Ane To Projective Coordinate Initialization*//
(X1; Z1) (x; 1); (X2; Z2) (x4 + b; x2):
if kl 2 = 1 then
Swap(X1; X2); Swap(Z1; Z2)
end if
//*PM Loop Process*//
for i = l   2 down to 0 do
(X2; Z2) Madd(X1; Z1; X2; Z2; x);
(X1; Z1) Mdouble(X1; Z1; b):
if (i 6= 0 and ki 6= ki 1) or (i = 0 and ki = 1) then
Swap(X1; X2); Swap(Z1; Z2)
end if
end for
//*Projective To Ane Coordinate Conversion*//
Q Mxy(X1; Z1; X2; Z2; x; y):
return Q(x0; y0).
* * *
+
^2
*
+
*
^2
*
+
^2
^2
^2
^2
x Z2 X1 Z1 X2 Z1 X1b
Z2 X2 Z1 X1
Figure 2.8: Data dependency inside the LOOP
27
have uniform inputs, and we can easily analyze the data dependency before the swap
operation inside the LOOP as shown in Fig. 2.8.
2.3.4 Parallelized Lopez-Dahab Algorithm
In hardware implementations, parallelism is always a good way to improve the per-
formance at the cost of circuit area and power consumption. In [11], the Algorithm
5 is fully parallelized based on its data dependency. The two uniform steps are used
to support the data dependency as shown in Fig. 2.9, and as the GNB is used,
the square operation is simply a shift operation. Therefore, a uniform block with a
addition appended to the output of multipliers is architected in [11].
Algorithm 5 Parallelized version of Lopez-Dahab algorithm with uniform address-
ing [11]
Input: P = (x; y) 2 E(GF (2m)), an l-bit integer k, k  (kl 1;    ; k1; k0)2:
Output: kP = (x0; y0).
//*Ane To Projective Coordinate Initialization*//
(X1; Z1) (x; 1); (X2; Z2) (x4 + b; x2):
//*PM LOOP Process*//
if kl 2 = 1 then
Swap(X1; X2), Swap(Z1; Z2),
end if
for i = l   2 down to 0 do
1. T1  (X1Z2); T2  (X2Z1); T3  (X1Z1)2; Z3  (T1 + T2)2; Z2  Z3;
2. X2  T1T2 + xZ3; X1  bZ41 +X41 ; Z1  T3
if (i 6= 0 and ki 6= ki 1) or (i = 0 and ki = 1) then
Swap(X1; X2), Swap(Z1; Z2)
end if
end for
//*Projective To Ane Coordinate Conversion*//
x0  X1Z1 ,
y0  1x(x+ X1Z1 )f(x+ X1Z1 )(x+ X2Z2 ) + x2 + yg+ y.
return kP = (x0; y0).
28
Figure 2.9: Data dependency inside the LOOP in [11]
* * *
+
^2
*
+
*
^2
*
+
^2
^2
^2
^2
x Z2 X1 Z1 X2 Z1 X1b
Z2 X2 Z1 X1
Longest FF 
operation Path
Figure 2.10: Proposed instruction set based on the data dependency
2.3.5 Proposed Instruction-level Parallelism for Parallelized
Lopez-Dahab Algorithm
In this work, a processor based architecture is employed to improve the system
performance. The advantages of using processor based architecture are that it can
be exible, also it would be easier for us to control the system critical path as the
control path and data path can be easily separated and pipelined.
29
As the program executed on cryptoprocessor is xed for a certain type of elliptic
curve arithmetic algorithm, we can generate a customized instruction set for the algo-
rithm to accelerate the system performance. First, let's review the data dependency
as shown in Fig. 2.10. Conventionally, there are only FFM, FFA, FFS instruction
available. If so, the longest FF operation path will include two FFM operations,
two FFA, and a FFS operation. As we know, these operations are all inside of the
LOOP, the total number of the clock cycles to nish the longest FF operation path
has signicant impact on the system's performance. In this work, we combined FFA
and FFS, and instruction (A + B)2 is proposed to nish them in one clock cycle.
Therefore, we can save one clock cycle in one iteration, and save 162 in total of PM
calculation. An instruction A4 is also proposed to nish two square operations in
one clock cycle, and it can signicantly decrease clock cycles by nearly half in FFI
operation as it will be describe in Chapter 3.
In this work, the ILP for the parallelized Lopez-Dahab algorithm is achieved on
three FF cores as in Algorithm 6. There are three columns of FF operations to
execute on each FF core by using the customized FF arithmetic instruction set AB,
A + B, (A + B)2 and A4. The NOP in the algorithm stands for empty operation.
Apparently, we can replace this customized instruction set by AB, A + B and A2,
but when calculating (A+B)2 and A4, it would cost two clocks for both. Therefore,
instructions (A+B)2 andA4, which are both done in one clock cycle in the customized
instruction set, can decrease the clock cycles. The customized instruction set can
also meet the hardware level aspect mentioned above, and it will be described later
in critical path analysis. As a FFM costs several clock cycles, the operation A4 in
30
the loop is computed in parallel with FF multiplication to meet the instruction level
aspect mentioned above. The interconnections among three cores are needed for
data dependency in the operations in each core. For instance, when arriving step
2 in the loop, core 1 needs the data V 2 from core 2, which is generated in step 1.
Thus an interconnection between core 1 and core 2 is needed to support such data
dependency. Similarly, other necessary interconnections can be also obtained.
31
Algorithm 6 Proposed ILP of parallelized Lopez-Dahab algorithm on three FF
cores
Input: P = (x; y) 2 E(GF (2m)), an l-bit integer k, k  (kl 1;    ; k1; k0)2.
Output: kP = (x0; y0).
//Ane To Projective Coordinate Initialization
// core 1 core 2 core 3
1: X1  (x+ 0); NOP ; Z2  (x+ 0)2;
2: Z1  (1 + 0); NOP ; X2  x4;
3: NOP ; NOP ; X2  X2 + b;
//PM LOOP Process
for i = l   2 down to 0 do
// core 1 core 2 core 3
1: V1  X1Z2; V2  X2Z1; V3  X1Z1;
R3  Z41 ;
2: Z2  (V1 + V2)2; NOP ; Z1  (V3 + 0)2;
3: V1  V1V2; V2  xZ2; V3  bR3;
R3  X41 ;
4: X2  V1 + V2; NOP ; X1  V3 +R3;
if (i 6= 0 and ki 6= ki 1) or (i = 0 and ki = 1) then
Swap(X1; X2), Swap(Z1; Z2)
end if
end for
//Projective To Ane Coordinate Conversion
// core 1 core 2 core 3
1: V1  Inv(Z1); V2  Inv(Z2); V3  Inv(x);
2: R1  X1V1; V2  X2V2; R3  (x+ 0)2;
R3  R3 + y;
3: V1  x+R1; V2  x+ V2; NOP ;
4: V1  V1V3; V2  V2V1; NOP ;
5: NOP ; V2  V2 +R3; NOP ;
6: NOP ; V2  V1V2; NOP ;
7: NOP ; R2  V2 + y; NOP ;
return kP = (x0; y0) = (R1; R2).
32
Chapter 3
Elliptic curve cryptographic processor
3.1 Finite Field Arithmetic Operations
In this section, we present the algorithms used in this design to implement FF
arithmetic instructions. The corresponding critical paths of each FF arithmetic
operations are also given for analysis. Before looking into the implementation each
FF operations, we will introduce the basic operations in binary nite eld.
3.1.1 Basic Finite Field Operations
Binary nite eld has two representations: Gaussian Normal Basis (GNB) and Poly-
nomial Basis (PB) representation. As this work focuses on PB representation, here
we only describe the number in PB representation. Each binary number has its cor-
responding PB representation, and an example is shown as follows. In the following,
all the operations are based on PB representation.
11010010! x7 + x6 + x4 + x (3.1)
FF addition and subtraction are the same operation in binary nite eld, it only
33
A B
163 163
163
C
a162 a161 a160 a2 a1 a0
b162 b161 b160 b2 b1 b0
Figure 3.1: Architecture of nite eld adder
needs to do the bitwise XOR on two operands. Therefore, the FF adder can be
nished in one clock cycle with only a delay of TXor as shown in Fig. 3.1.
FF multiplication is similar with normal multiplication in real number system
except a modular operation is involved in the calculation, and the addition has no
propagation delay. Also, an example of FF multiplication is shown in Fig 3.2, where
the f(x) is the irreducible polynomial.
x
6
x
5
x
4
x
3
x
2
x
1
x
0
0    0   0  0
1  0 1   0
1 0   1  0
+
+
+
0 0   0  0
0   1  0 1    0
0   1  1 1   1  0
C = A * B mod f(x)
   = x
5
+ x
4
+ x
2
+ x mod f(x)
B = 0x
3
+ x
2
+ x + 1
A = x
3
+ x1010
f(x) = x
4
+ x + 1
+
1  1 0   1  1  0
0    0   0  0
1  0 1   0
Figure 3.2: An example of FF multiplication
34
x
6
x
5
x
4
x
3
x
2
x
1
x
0
1  0 1    0
1    0   1  0
1   0  1 0+
   0
+
1  1 0   1  1  0 C = A * B mod f(x)
  = x
5
+ x
4
+ x
2
+ x mod f(x)
f(x) = x
4
+ x + 1
+
0   0  0 0
+
B = 0x
3
+ x
2
+ x + 1
A = x
3
+ x1010
Figure 3.3: An example of pure parallel FF multiplication
From this example, we can easily gure out that there can be bit-serial or full
parallel in the hardware implementation of FF multiplier. In bit-serial FF multi-
plication, it takes one clock cycle to calculate each intermediate result. Therefore,
it will take four clock cycles to nish the above calculation (the modular operation
is not considered here). In the pure parallel FF multiplier, all the operations are
nished in one clock cycle as shown in the Fig. 3.3.
3.1.2 Parallel Finite Field Reduction
So far, we have not considered the modular operation in Fig.s 3.2 and 3.3. The
modular operation is also called as FF reduction. It can be calculated after each
intermediate step of other FF operations or at the end of them. For example, if a
FFM consumes 10 cycles, and we calculate a modular operation after each interme-
diate cycle of FFM, there will be 10 modular operations in total. On the contrary,
we can also calculate only one modular operation when the FFM is nished after 10
35
cycles, but its hardware complexity is much higher as illustrated in this section.
FF reduction is performed after every FF operation, and the maximum size of
the polynomial result that needs FF reduction operation is 325, which is caused by
C(x) = [A(x)B(x)]mod f(x)
= [
162X
i=0
162X
j=0
aibjx
i+j]mod f(x);
(3.2)
where the irreducible polynomial f(x) = x163 + x7 + x6 + x3 + 1, and the C(x) =
c325:x
m +   + c1:x+ c0; ci 2 f0; 1g. C(x) can be decomposed as
C(x) = H(x)f(x) +R(x); (3.3)
where the maximum size of H(x) is 162, and R(x) is the reduced result. In this
equation, it is easy to nd that the parameters ci with i > 162 in C(x) are not
determined by reduced variable R(x) but H(x)f(x), then we have
hi =
8>>>>>>>>>>>>><>>>>>>>>>>>>>:
ci+163 7  i  161;
ci+163  hi+156 i = 6;
ci+163  hi+157  hi+156 3  i  5;
ci+163  hi+160  hi+157  hi+156 0  i  2:
(3.4)
from Eq. 3.3, the following equations can be obtained,
R(x) = C(x) +H(x)f(x): (3.5)
36
ri =
8>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>:
ci  hi 3  hi 6  hi 7 i = 162;
ci  hi  hi 3  hi 6  hi 7 7  i  161;
ci  hi  hi 3  hi 6 i = 6;
ci  hi  hi 3 3  i  5;
ci  hi 0  i  2:
(3.6)
Finally, by combining Eq. 3.4 and Eq. 3.6, the nal reduced result can be obtained,
W = ci  ci+157  ci+160;
M = ci  ci+163  ci+319;
ri =
8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:
W  ci+156 i = 162;
W  ci+156  ci+163 13  i  161;
W  ci+156  ci+163  ci+312
11  i  12;
W  ci+156  ci+163  ci+312  ci+314
7  i  10;
W  ci+163  ci+313  ci+314  ci+316 i = 6;
M  ci+160  ci+316  ci+317; 3  i  5;
M  ci+320 i = 2;
M  ci+320  ci+323 0  i  1:
(3.7)
37
Generally, the delay of FF reduction presented above is dlog2 7eTXor. However,
in some specic FF operations, some parameters of C(x) are zero, thus the FF
reduction can be further simplied. For example, if ci = 0, ci cj = cj. The specic
delay of FF reduction in each FF operation will be analyzed respectively later in the
following.
3.1.3 Word-level nite eld multiplier
Algorithm 7 82-bit word-level FF multiplier
Input: A(x); B(x) 2 E(GF (2m)), and B(x) = B3(x)x41 +B2(x)x82 + B1(x)x41 +
B0(x).
Output: C(x) 2 E(GF (2m)).
R(x) = 0; //Initialize
for i = 1 down to 0 do
1. T1 = A(x)B2i(x);T2 = A(x)B2i+1(x);
2. C(x) = T1 + x
41  T2 + x82 R(x);
R(x) = Reduction(C(x));
end for
return R(x).
The algorithm of FF multiplier used in this thesis is from [10]. In our design,
an 82 163 word-level FF multiplier is used, where two 41 163 FF multipliers are
employed in the rst level, and the two sub products are summed up in the second
level as in Fig. 3.4. In order to support the data loading stage in ve-stage pipeline,
a 2-input multiplexer is added to select input data registers F and S. The delay of
path 1 in the FF multiplier is TMux + TAnd + dlog2 41eTXor. In path 2, the length
of the summed result is 245, therefore, the reduction can be simplied with a delay
of dlog2 5eTXor in this FF multiplier. As the summarization and reduction unit are
synthesized together, the delay of path 2 is dlog2(3  5)eTXor.
38
S TF
× ×
M0 M1
+
R
Reduction
R0 R1
+
A
2
A
4
Mux
DB
ASout SSoutMout
DA
Mux
*x
82
8282
163 163
41 41163 163
*x
41
163 163 163
203
245
163
245
203
244
203 203
163
163
163163
P
a
th
1
P
a
th
2
P
a
th
3
P
a
th
4
  
AB A+B
(A+B)
2 A
4
Figure 3.4: The architecture of nite eld ALU
3.1.4 FF square and double square
As in Eq. 3.8, A(x) is a binary FF number in PB presentation, and its FF square
can be performed by FF multiplication. However, due to its special property (i.e.
A A), it can be simplied by inserting zeros in-between the bits of A as shown in
Eq. 3.9, and then do the FF reduction.
A(x) = a162x
162 + a161x
161 +   + aixi +   + a0x0; (3.8)
39
A2(x) = a162x
324 + 0 + a161x
322 +   + aix2i
+   + a2x4 + 0 + a1x2 + 0 + a0x0;
(3.9)
As there are many zeros in A2(x), the FF reduction of FF square is further sim-
plied by eliminating XOR operations with zeros. Finally the FF square operation
has a delay of dlog2 5eTXor. By combining A2 with A+ B as in Fig. 3.4, path 3 for
(A+B)2 has a delay of dlog2(2 5)eTXor + TMux.
In order to accelerate FF inverse operation, we propose a new operation, A4(x).
It can be obtained by combining two A2(x) together, and its simplication refers to
ai  0 = ai and ai  ai = 0. Finally, the FF operation C(x) = Reduction(A4(x)) is
performed in one cycle with a delay of dlog2 12eTxor.
3.1.5 FF inversion
Table 3.1: Itoh-Tsujii Algorithm for GF (2163)
i i [i1 (a)]
2
i2  i2 (a) i(a) = a2
i 1
0 1 - 0(a) = a
21 1
1 2 [0(a)]
20  0(a) 1(a) = a22 1
2 3 [1(a)]
20  0(a) 2(a) = a23 1
3 5 [2(a)]
21  1(a) 3(a) = a25 1
4 10 [3(a)]
23  3(a) 4(a) = a210 1
5 20 [4(a)]
24  4(a) 5(a) = a220 1
6 40 [5(a)]
25  5(a) 6(a) = a240 1
7 41 [6(a)]
20  0(a) 7(a) = a241 1
8 81 [7(a)]
26  6(a) 8(a) = a281 1
9 162 [8(a)]
28  8(a) 9(a) = a2162 1
Compared to other FF operations, FF inversion is the most time-consuming
40
operation. In this thesis, we adopt Itoh-Tsujii algorithm [17] for FF inversion and
in the following we briey describe the Itoh-Tsujii Algorithm.
In GF (2163), any nonzero element a has a cyclic order of 2163  1. Therefore, the
inverse of a can be obtained by a 1 = a2
163 2. Here, we dene k(a) = a2
k 1; k 2 N ,
and it has the following property.
k+j(a) = a
2k+j 1
=
 
a2
k
a
!2j
a2
j
a
=

a2
k 1
2j
a2
j 1
= k(a)
2jj(a)
(3.10)
Now, we can decompose a2
163 2 = (a2
162 1)2 by a sequence of FF operations as listed
in Table 3.1.
As a2
s
is frequently performed, by employing A4, we can decrease the clock cycles
needed in FF inversion by nearly half. For example, when calculating a2
16
, we can
calculate it with eight successive A4 instructions in 8 cycles while 16 cycles are needed
when using A2.
3.2 Architecture and implementation
The proposed architecture consists of a main controller and three FF cores as shown
in Fig. 3.5. The ve-stage pipeline, instruction fetching (IF), instruction decod-
ing (ID), data loading (DL), instruction executing (EX) and writing back (WB),
is employed in each FF core. These three FF cores are almost same except some
41
Main Controller
CORE1 CORE2 CORE3
Interconnection
ROM1 ROM2 ROM3
ECC_TOP
Control path
Data path
Instruction
{x,y,b}{x0,y0}enable done
rstclk
Figure 3.5: The structure of pseudo-multi-core ECC processor
dierences in arrangement of register les and interconnection as shown in Fig. 3.6.
As there are only several variables involved in the xed program in Algorithm 6
executed on three cores, the register les of each core in Fig. 3.6 are enough to store
the middle results. The instruction set in three nite cores are very similar except
some minor dierences as shown in Table 3.3, Table 3.4, and Table 3.5.
3.2.1 Instruction Set Design of FF Cores
In order to reduce the complexity caused by the instruction set design, all the in-
structions in each core have the same length as in Fig. 3.7. As the double square
SS need only one source operator, source 2 is used in this design.
The LOOP instruction is used to control the program counter in each core. It
42
R
x
M
u
x A
1
_
Z
A
1
_
X
A
1
_
G
A
3
_
Z
A
3
_
X
R
x
R
b
A
3
_
Z
A
3
_
X
A
3
_
G
A
1
_
Z
A
1
_
X
A
3
_
Z
A
1
_
X
A
2
_
G
R
x
A
3
_
X
A
1
_
Z
A
1
_
w
e
1
2
A
2
_
w
s
A
3
_
w
s
A
2
_
w
e
A
3
_
w
e
C
o
r
e
 1
C
o
r
e
 2
C
o
r
e
 3
A
1
_
D
A
A
2
_
D
A
A
3
_
D
A
M
u
x
R
y
A
1
_
w
s
A
1
_
A
S
o
u
t
A
2
_
M
o
u
t
A
1
_
M
o
u
t
A
1
_
S
S
o
u
t
A
1
_
A
S
o
u
t
A
1
_
S
S
o
u
t
A
1
_
R
A
o
u
t
A
2
_
R
A
o
u
t
M
u
x
M
u
x
A
2
_
A
S
o
u
t
A
2
_
S
S
o
u
t
M
u
x
M
u
x
A
3
_
A
S
o
u
t
A
3
_
S
S
o
u
t
A
3
_
S
S
o
u
t
A
3
_
M
o
u
t
A
1
_
A
S
o
u
t
A
1
_
A
S
o
u
t
A
3
_
A
S
o
u
t
A
3
_
A
S
o
u
t
A
3
_
A
S
o
u
t
D
O
U
T
D
O
U
T
D
O
U
T
R
y
R
y
A
2
_
A
S
o
u
t
A
1
_
M
o
u
t
A
2
_
M
o
u
t
A
3
_
M
o
u
t
A
2
_
S
S
o
u
t
A
2
_
M
o
u
t
2
A
2
_
G
A
3
_
R
A
o
u
t
D
A
_
S
[5
:7
]
3
D
A
_
S
[0
]
D
A
_
S
[1
]
D
A
_
S
[2
]
D
A
_
S
[4
]
D
A
_
S
[3
]
MuxMux
Mux Mux
Mux
Mux
Mux
Mux
Mux
Mux
Mux
Mux
Mux
Mux
8-to-1 Mux
8-to-1 Mux
8-to-1 Mux
A
L
U
1
A
1
_
A
S
o
u
t
A
1
_
S
S
o
u
t
A
1
_
D
A
A
1
_
D
B
A
1
_
M
o
u
t
A
L
U
2
A
2
_
D
A
A
2
_
D
B A
2
_
A
S
o
u
t
A
2
_
S
S
o
u
t
A
2
_
M
o
u
t
A
L
U
3
A
3
_
D
A
A
3
_
D
B A
3
_
A
S
o
u
t
A
3
_
S
S
o
u
t
A
3
_
M
o
u
t
Dec
Dec
Dec
F
ig
u
re
3
.6
:
In
te
rc
on
n
ec
ti
on
an
d
re
gi
st
er

le
s
fo
r
F
F
co
re
s
43
Table 3.2: Instruction description
Operation Clock cycles Description
MUL 2 FF multiplication. The bit width is 163 x 163
SMUL 2 Special FF multiplication, and it is used for swap purpose
SQA 1 FF Square and addition. It is a combined operation
ADD 1 FF addition
SS 1 Double square (nish two square operations in one cycle)
LOOP 1 For iteration purpose
NOP 1 IDLE status for one cycle
Table 3.3: Instruction Set Design of FF Core 1
Operation Destination Source
3'b111 MUL 2'b11 DA1 S 8'b00000 111 SA1 Z
3'b110 SMUL 2'b10 DA1 X 8'b00000 110 SA3 Z
3'b101 SQA 2'b01 DA1 Z 8'b00000 101 SA1 X
3'b100 ADD 2'b00 Res. 8'b00000 100 SA3 X
3'b011 SS - - 8'b00000 011 Rx
3'b010 LOOP - - 8'b00000 010 Ry
3'b001 Res. - - 8'b00000 001 SA1 S
3'b000 NOP - - 8'b00000 000 SA2 S
- - - - 8'b10000 xxx A3 BP OUT2
- - - - 8'b01000 xxx A1 BP OUT2
- - - - 8'b00100 xxx A1 SS OUT
- - - - 8'b00010 xxx A1 BP OUT1
- - - - 8'b00001 xxx A2 BP OUT1
Operation Destination Source 1 Source 2
LOOP Parameter 1 Parameter 2
SS Destination Res. Source 2
Res.
Figure 3.7: Instruction formats in three cores
44
Table 3.4: Instruction Set Design of FF Core 2
Operation Destination Source
3'b111 MUL 1'b0 Res. 8'b00000 111 SA1 Z
3'b110 SMUL 1'b1 DA2 S 8'b00000 110 SA3 Z
3'b101 SQA - - 8'b00000 101 SA1 X
3'b100 ADD - - 8'b00000 100 SA3 X
3'b011 SS - - 8'b00000 011 Rx
3'b010 LOOP - - 8'b00000 010 SA2 S
3'b001 Res. - - 8'b00000 001 Ry
3'b000 NOP - - 8'b00000 000 Res.
- - - - 8'b10000 xxx A3 BP OUT2
- - - - 8'b01000 xxx A1 BP OUT2
- - - - 8'b00100 xxx A2 SS OUT
- - - - 8'b00010 xxx A2 BP OUT2
- - - - 8'b00001 xxx A2 BP OUT1
Table 3.5: Instruction Set Design of FF Core 3
Operation Destination Source
3'b111 MUL 2'b11 DA3 S 7'b0000 111 SA1 Z
3'b110 SMUL 2'b10 DA3 X 7'b0000 110 SA3 Z
3'b101 SQA 2'b01 DA3 Z 7'b0000 101 SA1 X
3'b100 ADD 2'b00 Res. 7'b0000 100 SA3 X
3'b011 SS - - 7'b0000 011 Rx
3'b010 LOOP - - 7'b0000 010 SA3 S
3'b001 Res. - - 7'b0000 001 Rb
3'b000 NOP - - 7'b0000 000 Ry
- - - - 7'b1000 xxx A3 BP OUT2
- - - - 7'b0100 xxx A1 BP OUT2
- - - - 7'b0010 xxx A3 SS OUT
- - - - 7'b0001 xxx A3 BP OUT1
45
LOOP 9, 0
Ins 1
Ins 2
Ins 3
LOOP 9, 1
Ins 1
Ins 2
Ins 3
Execution
Direction
LOOP 9, 0
Ins 1
Ins 1
Ins 3
(a) (b) (c)
Figure 3.8: Examples of LOOP instruction
has two parameters, the rst one is the number of the loops, and the other one is
the oset address. The bit length of the parameter 1 and parameter 2 in core 1 and
core 2 just follow the bit length of source 1 and source 2 correspondingly. As the
maximum iterations are this design is 162, and the bit length of source 1 in core 3
is 7 bits (maximum value is 127), one bit is borrowed from the ref. eld to accom-
plish maximum number of LOOPs. Due to the instruction fetching and instruction
decoding used in the 5-stage pipeline, the LOOP instruction has a limit that it can
not control the immediately followed instruction. For example, in Fig. 3.8(a), the
instruction LOOP 9, 0 set the number of loops (9) of the instruction Ins2, and it
can never include the Ins1 in the LOOP although its oset address is 0. However,
sometimes we need the LOOP operation on Ins1 due to the tight data dependency
and limited register les, we can simply accomplish it by the implementation in Fig.
3.8(c), where the Ins1 will run 10 times. In Fig. 3.8(b), the oset address is 1, and
the LOOP operation will aect the instructions from Ins2 to Ins3.
46
SMUL
swap2
swap1
D
A
_
A
d
d
re
ss
[0
:7
]
D
A
_
S
[7
:0
]
M
u
x
M
u
x
Registers&
Interconnection
8-bit
address
A1_Z 00000_111
A3_Z 00000_110
A1_X 00000_101
A3_X 00000_100
Rx 00000_011
Ry 00000_010
A1_G 00000_001
A2_G 00000_000
A3_ASout 1xxxx_xxx
A1_ASout 01xxx_xxx
A1_SSout 001xx_xxx
A1_Mout 0001x_xxx
A2_Mout 00001_xxx
Figure 3.9: Swap logic and address decoding unit of DA in core 1
3.2.2 Register les, interconnection and swap logic
As register les, interconnection and swap logic are tightly related, we analyze them
together in this section. In Fig. 3.6, we present the architecture of register les and
interconnection for DA. Above the dashed line in each core are the local registers,
which can only be written by local core. Ai Z and Ai X are special registers to store
Zi and Xi in Algorithm 6 (refer to p. 32). Ai G is the only general register. Under
the dashed line in each core are the interconnection and by-pass results from local
FF ALU result, which are determined by the ILP data dependency in Algorithm 6.
Rx, Ry and Rb are interconnections from the main controller for accessing x, y and
47
b.
The address of data path is divided into two levels as shown in Fig. 3.9. The
rst level has a higher priority, and uses ve most-signicant bits to distinguish ve
non-register data paths. The second level uses the rest three least-signicant bits
to decode register data paths. If the second level is valid, and the third bit from
least-signicant bit (lsb) of the address is \1", it represents a special register. By
using the arrangement in Fig. 3.9, swap operation between special registers can be
easily performed by changing the lsb of the address. In our design, A1 X and A1 Z
are chosen as the default source for accessing X1 and Z1 while X2 and Z2 are stored
in them at the end of each loop in core 1, which is similar with core 3. Therefore,
a swap operation exists by default. The swap logic is composed with two swap
signals, swap1 and swap2, which are generated from the main controller. swap1
is used to swap special registers by changing the LSB of the address, and swap2
is used to swap by-pass data A1 ASout and A3 ASout, which are the result of Xi
at the end of each loop in Algorithm 6. The rst FFM in the loop is dened as
swap multiplication (SMUL) to dierentiate itself from the common FFM, and with
swap2, swap operation between A1 ASout and A3 ASout can be done. Then, one
cycle is saved in one loop when we load the data directly from ALU by-pass output
A Sout for SMUL.
The address unit of DB in core 1 is nearly the same with DA except no need of
swap2 and SMUL. Similarly, the address units in core 2 and core 3 can be obtained.
48
3.2.3 Main controller
The main controller has two main tasks: provides three data paths Rx, Ry and Rb
for the data x, y, b, 0, and 1 to three FF cores, and generates two swap signals swap1
and swap2. As we have described the swap operation in the previous section, here
we only refer to the rst task. In order to decrease complexity of the FF cores and
interconnection, in our instruction set, there is no data move operation as shown in
Algorithm 6. In the initialization stage, we move the data x to A1 X with the help
of the main controller by setting Ry to constant 0, and performing a FF addition
between x and 0. Then, the data x is moved into A1 X. Similarly, x2 is moved into
A3 Z, and this is also the way we perform FF square operation in this design. As
the data y is only needed in the nal coordination conversion stage, Ry is always
set with 0 till y is needed in the calculation. The main controller is implemented by
using nite state machine.
3.2.4 Critical path analysis
The address unit and interconnection described above is carefully designed by con-
sidering the hardware level aspect. In Fig. 3.4, the critical path, path 1 in FF
multiplier is TMux+TAnd+ dlog2 41eTXor, and the three by-pass delays path2, path3
and path4 from the FF ALU in Fig. 3.4 are dlog2 15eTXor, dlog2 10eTXor + TMux
and dlog2 12eTXor respectively. As the data path DA is similar with DB in core 1,
core 2 and core 3, we only need to consider the critical path in DA. As the three
by-pass delays in ALU nally go to Ai DA, then, by adding the previous delay in
49
Table 3.6: Long paths and comparison
No. Logic delaya Description
1 TMux+TAnd+dlog2 41eTXor path1b
2 dlog2 10eTXor + 4TMux path3b + fA2 ASout to A2 DAg
3 dlog2 12eTXor + 3TMux path4b + fA1 SSout to A1 DAg
4 dlog2 15eTXor + 3TMux path2b +fA1 Mout to A1 DAg
5 6TMux fregisters to A1 DAg
6 2TAnd + 9TXor critical path in [11]
a In TSMC18 standard cell library [27], TAnd is 0:101ns, TXor is 0:183ns and TMux is
0:153ns.
b Refer to Fig. 3.4 (refer to p. 39)
MuxMuxMuxMux
MuxMux
Mux
S2
S0
S1
Figure 3.10: Architecture of 8-to-1 multiplexer in Fig. 3.6 (refer p. 43)
ALU with the delay between the by-pass output and Ai DA, we can get the total
delay of three paths. For instance, A1 ASout goes through two 2-to-1 multiplexers
to arrive A1 DA, the total delay is dlog2 10eTXor+3TMux. Similarly, other cases are
all obtained in the Table 3.6. The 5th long path is from the second part address
decoder unit, which is an 8-to-1 multiplexer and can be implemented by seven 2-to-1
multiplexers as shown in Fig. 3.10. Also, we compare these long paths with the crit-
ical path of [11] in the Table 3.6. The longest logic delay path is determined by the
ratio of TAnd, TXor, and TMux. Based on the delay parameters provided in TSMC18
50
technology [27], we can easily get the longest logic delay path in the proposed archi-
tecture is path1 in the FF multiplier, and it is approximately 3TXor shorter than the
critical path in [11].
3.2.5 Pipeline and timing
Five-stage pipeline (IF, ID, DL, EX and WB) with ILP is employed to increase the
performance. In the IF stage, instruction is fetched from each ROM for corresponding
core, and stored in the instruction register. During the ID stage, instruction is
decoded to generate control signals to FF ALU, and the swap operation is also
accomplished in this stage by using the address decoding unit in Fig. 3.9. In the DL
stage, data needed for calculation is loaded to the input registers (F; S; T;R0 and
R1) of FF ALU. In the EX stage, instructions are executed, and the corresponding
by-pass results Mout, ASout and SSout are generated. In the nal stage WB, the
result is written into the register.
In the following, we present the timing of loop process of Algorithm 6 in Fig. 3.11
to show how the -code is optimized and executed based on the given architecture
and pipeline. In this design, FF multiplication AB costs two clock cycles, and other
FF operations need only one cycle. There are three features to generate the -code:
1. The special register Ai X is only used to store Xi, and Ai Z is for Zi so that
other cores can access them. Ai G is used to store other intermediate data.
2. If the next FF operation needs the result from the present FF operation, it can
load the by-pass result at the end of the EX stage in the present FF operation
51
DL AB
DL
DL AB
DL AB
DL
DL
(A+B)^2
Z1^4 WB
DL AB
DL AB
(A+B)^2
DL AB
DL
DL
A+B
A+B
DL Z1^4
1 2 3 4 5 6 7 8
Cycle
DL
DL
DL
9
Next loop
First loop
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
IF ID
Core1
Core2
Core3
WB
WB
WB
WB
WB
WB
WB
Write A1_G
Write A1_Z
Write A1_X
Write A2_G
Write A3_G
Write A3_Z
Write A3_G
Write A3_X
IF ID
Figure 3.11: Timing in the loop of Algorithm 6
to save one clock cycle. For example, the operation A + B in core 1 can load
the by-pass result AB from FF ALUs in core 1 and core 2 at the end of EX
stage.
3. If the result from the present FF operation is needed in the next FF operation,
and is not needed in the future, it does not need to be written into register le.
Thus, we can decrease the number of registers to the given number shown in
Fig. 3.6. For instance, as the second multiplication result AB in core 1 is only
needed in the next A+B operation and not needed anymore in the future, the
WB stage of this instruction can be omitted.
As the loop instruction is used in our instruction set to control the program
52
counter directly, it is apparent that the clock cycles needed for one loop in Algorithm
6 is 8.
Similarly, the -code of the initialization stage and the projective to ane co-
ordinate conversion stage in Algorithm 6 can be obtained except the following re-
quirements. During the step 2 in the projective to ane coordinate conversion stage,
R3 R3 + y in core 3 needs main controller to change the value in Ry from 0 to y.
Besides, as the result R3 is needed in core 2, R3 can not be stored in general register
as there is no interconnection A3 G to core 2, and it must be stored in a special
register. As Xi and Zi are not needed after step 2, special registers can be used for
general use afterwards, which can be done by disabling swap1 (swap1  0) by the
main controller. Therefore, R3 can be stored in a special register in core 3. For the
same reason, in the step 4 of core 1, V 1 is stored in a special register to meet the
data dependency of core 2. Finally, the PM results (x0; y0) are obtained in A1 X
and A2 G in Fig. 3.6 respectively when the calculation ends.
53
Chapter 4
Experiment Results
The -code in each ROM is presented in Appendix section. When the FF cores
are not enabled, they keep fetching the NOP at the address 0 in the ROM, and they
remain IDLE. When the input enable signal is set, three FF cores start fetching the
instruction in the ROM at the same time. The size of ROM1, ROM2 and ROM2 are
12821 bits, 12820 bits, 12819 bits respectively. All the ROMS are implemented
by using combinational logic block. As three ROMs have dierent length of the valid
-code, three cores will stop their works at dierent time, and FF core2 ends its work
after the end of FF core1 and FF core3. After each FF core nish their work, they
have to remain IDLE to ensure the result not to be altered. Therefore, the -code
of ROM1 after address 69 are all NOP , the -code of ROM2 after address 76 are all
NOP , and the -code of ROM3 after address 66 are all NOP .
The total clock cycles are composed with three parts. First, 5 clock cycles are
needed in the ane to projective coordinate initialization. Then, 8 162 clocks are
consumed in the PM loop process. The nal coordinate conversion stage consumes
123 cycles, where the FF inversion costs 111 cycles. Therefore, the total clock cycles
required for one PM are 5 + 8  162 + 123 + 4 = 1428, in which the last 4 cycles
results from the ve-stage pipeline.
54
Table 4.1: Area information of dierent blocks
Block ROM1 ROM2 ROM3 AB (A+B)2 A4
Occupied slices 54 50 55 5,037 232 181
Block FF ALU core 1 core 2 core 3 Main controller Total
Occupied slices 5,437 6,993 6,830 6,994 616 20,847
The proposed architecture is coded using verilog HDL, and the -code in each
ROM is translated to machine code by using Perl script. We rstly veried each
submodules in our design in Modelsim by using some direct testcases, and then, we
veried the top level of our design by using some direct testcases. All these direct
testcaes are elaborately chosen to consider both the corner cases and the ordinary
cases. The whole system is simulated in Modelsim. Finally, we implement it on both
Xilinx XC4VLX80 FPGA device and TSMC18 technology. We use Xilinx ISE 11.1
to do the synthesis, place and route, and the highest frequency it can reach is 185
MHz with 20; 807 slices. In order to compare the area among dierent blocks, we
synthesize them separately, and the areas of each block are obtained from synthesis
report of ISE except the total area, which is obtained after place and route. As we
can see, the total area is not equal to the summed area of its components, which
is caused by two aspects: there are some global optimizations when synthesizing
the whole design together, and some slices are used for routing through. In this
table, we can see three FF multipliers occupy nearly three quarter of the total area.
When synthesized by Synopsys Design Compiler in TSMC18 CMOS technology [27],
the highest frequency it can reach is 263 MHz, and occupies 217; 904 gates. Both
implementation results show the critical path lies in the path1 FF multiplier.
55
Table 4.2: Performance comparison
Work Technology/ Clk(MHz)/ Time for kP/
area #clk cycles Remarks
Kazuo [12] (2007) 0:13m CMOS 555.6 12s
GF (2163) 154K gates - TNAF method
Kim [11] (2008) XC4VLX80 143 10s
GF (2163) 24,363 slices 1446 Three 55-bit GNB mul.
Bijan [10] (2008) XC2V2000 100 41s
GF (2163) 3,416 slices 4050 One 41-bit Karatsuba mul.
Kimmo [16] (2008) Stratix II - 49s
GF (2163) - - -
This work XC4VLX80 185 7.7 s
GF (2163) 20,807 slices 1428 Three FF cores
This work TSMC18 263 5.4 s
GF (2163) 217,904 gates 1428 -
We compare our work with several recent works in Table 4.2. To our best knowl-
edge, our work is the fastest implementation over GF (2163) in the literature reported.
When implementing on Xilinx XC4VLX80 FPGA, our work consumes 77% total time
of the work of [11] while the area is only 85:4% of their design. The performance of
our work is better than the result in [11] mainly attributes to our higher frequency,
which is determined by the short critical path in FF multiplier. Besides, the total
clock cycles of our work (1428) is less than that in [11] (1446). Our work uses the
same number of FF multiplier (3) with [11]. As the most area-consuming block is
the FF multiplier, and the complexity of the FF multiplier using PB presentation is
smaller than its counterpart using GNB presentation, the area of our work is sightly
smaller than [11].
As the area of our design can not be suited to the FPGA device used in [10],
it is hard to accurately compare our result using XC4LX80 with the result in [10]
56
using XC2V2000. Therefore, we only briey compare them. In [10], only one 42-bit
FF multiplier is used, this is why their area is much smaller (6 times) than ours. In
turn, the performance gain of our work (5 times faster) mainly result from highly
parallel architecture using 3 FF multipliers.
As many papers [12] only use the synthesized result from Synopsys Design Com-
piler for comparison, we also compare our work in the same way. The TSMC18 ASIC
result of our work is faster than that in [12], which uses 0:13m CMOS technology.
Therefore, our work can be faster than [12] when using a same technology.
From the above comparison, we can see our performance gain mainly results from
three factors: high frequency caused by the short critical path of the whole system,
highly parallel architecture using three FF multipliers, and small clock cycles caused
by using ILP and proposed instruction set, especially A4.
57
Chapter 5
Conclusion and Future Work
In this thesis, we proposed a FF arithmetic instruction set AB, A + B, (A +
B)2 and A4 for parallelized algorithm for ECC PM, where the (A + B)2 and A4
are proposed to decrease clock cycles needed in the loop of algorithm and Itoh-
Tsujii's nite eld inversion respectively while not aecting the system critical path.
Then, the register les and interconnection of three FF cores are carefully designed
to minimize the critical path and support the data dependency in the proposed
algorithm. Finally, a pseudo-multi-core architecture with ve-stage pipeline (IF, ID,
DL, EX and WB) in each core is obtained to nish the ECC PM.
The implementation of the proposed architecture can nish one ECC PM in 1428
cycles, and is 1.3 times faster than the current fastest implementation over GF (2163)
reported in literature while consumes only 85:4% of their area on the same FPGA
device. Therefore, the proposed architecture and algorithm can be well suited to
high performance applications.
As the elliptic curve arithmetic algorithm is largely parallelized based on the data
dependency, and three FF cores are employed to support this data dependency in the
proposed architecture, the circuit area is still very large. Some potential approaches
can still make an optimum trade-o between performance and area can be analyzed
58
to meet dierent applications as follows:
1. If one FF core, we need to re-analyze the data dependency based on one FF
core. The FFM needs to be re-designed to make a balance between the critical
path and total clock cycles. The elliptic curve arithmetic algorithm needs to be
modied so that it only consumes one FFI as there is only one core available.
Otherwise, several FFI operations need to be calculated serially, and hence
consume many cycles.
2. If we use two FF cores, the data dependency also needs to be re-analyzed, and
FFM needs to be re-designed to make a balance between the critical path and
total clock cycles. The elliptic curve arithmetic algorithm needs to be modied
to include either one FFI or two FFIs.
3. Also, some FF arithmetic algorithms in [19] can be employed, especially algo-
rithm for FFI, such as extended Euclidean algorithm, to decrease clock cycles.
4. As we only consider the architecture for GF(2163), similar analysis can be used
on other curves recommended by [6], especially Kobliz curves as the value of b
in the Eq. 2.1 equals to 1 for Kobliz curves. Therefore, some special methods
can be employed to simplify the PM calculation based on this feature.
59
References
[1] Joan Daemen, Vincent Rijmen, \The Design of Rijndael: AES - The Advanced
Encryption Standard". Springer, 2002.
[2] Rivest RL, Shamir A, Adleman L (1978) \A method for obtaining digital sig-
natures and public-key cryptosystems". Commun ACM pp. 120-126.
[3] The Elliptic Curve Cryptosustem for smart cards, A Certicom White Paper
Published: May 1998.
[4] V.S. Miller, \Use of elliptic curves in cryptography", CRYPTO85: Proceedings
of the Advances in Cryptology, Springer-Verlag, pp. 417-426, 1986.
[5] N. Koblitz, \Elliptic curve cryptosystems", Mathematics of Computation, vol.
48, no.177, pp. 203-209, 1987.
[6] NIST, \Recommended elliptic curves for federal government use", May 1999.
[7] IEEE 1363, Standard Specications for Publickey Cryptography, 2000.
[8] D. Hankerson, J. Hernandez, A. Menezes, \Software implementation of ellip-
tic curve cryptography over binary elds", in: Proceedings of the CHES 2000,
Lecture Notes in Computer Science, vol. 1965, 2000, pp. 1 - 24.
[9] J. Huang, H. Li, and P. Sweany, \An FPGA Implementation of Elliptic Curve
Cryptography for Future Secure Web Transaction", International Conference
on Parallel and Distributed Computing Systems, pp. 296-301, Sept. 2007.
[10] B.Ansari and M.Anwar, \High-Performance Architecture of Elliptic Curve
Scalar Multiplication", IEEE Trans. on Computers, vol. 57, no. 11, pp. 1443-
1452, Nov. 2008.
[11] C.H. Kim, S. Kwon and C.P. Hong, \FPGA implementation of high performance
elliptic curve cryptographic processor over GF (2163)", Journal of Systems Ar-
chitecture, vol. 54, no. 10, pp. 893-900, Apr. 2008.
[12] K. Sakiyama, L. Batina and B. Preneel, \High-performance Public-key Crypto-
processor for Wireless Mobile Applications", Mobile Networks and Applications,
vol. 12, no. 4, pp. 245-258, Oct. 2007.
60
[13] Kimmo J., Jorma S., \Fast point multiplication on Koblitz curves: Paralleliza-
tion method and implementations", Journal of Microprocessors and Microsys-
tems, Elsevier, 2009.
[14] J. Lopez and R. Dahab, \Fast Multiplication on Elliptic Curves over GF (2m)
without Precomputation", Proc. First Int'l Workshop Cryptographic Hardware
and Embedded Systems, C.K. Koc and C. Paar, eds., pp. 316-327, 1999.
[15] Louis Dupont, Sebastien Roy and Jean-Yves Chouinard, \A FPGA Implemen-
tation of an Elliptic Curve Cryptosystem", IEEE International Symposium on
Circuits and Systems, 2006.
[16] K. Jarvinen, and J. Skytta, \On Parallelization of High-Speed Processors for El-
liptic Curve Cryptography", IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 16, no 9, pp. 1162-1175, Sept. 2008.
[17] T. Itoh and S. Tsujii, \A Fast Algorithm for Computing Multiplicative Inverses
in GF (2m) Using Normal Bases", Information and Computation, vol. 78, no. 3,
pp. 171-177, 1988.
[18] P.L. Montgomery, \Speeding the Pollard and Elliptic Curve Methods of Factor-
ization", Math of Computation, vol. 48, pp. 243-264, 1987.
[19] Hankerson D, Menezes A, Vanstone S, Guide to elliptic curves cryptography.
Springer, 2004.
[20] Henk C.A. van Tilborg, Fundamentals of Cryptology, Eindhoven University of
Technology, Kluwer Academic Publishers, 2000.
[21] Roberto M. Avanzi, Henri Cohen, Christophe Doche, Gerhard Frey, Tanja
Lange, Kim Nguyen, Frederik Vercauteren, Handbook of Elliptic and Hyper-
elliptic Curve Cryptography, Published by Chapman & Hall/CRC, 2006.
[22] J. Lopez and R. Dahab, \Algorithms for Elliptic Curve Arithmetic in GF(2n)",
SAC'98, LNCS Springer Verlag, 1998.
[23] D. Chudnovsky and G. Chudnovsky, \Sequences of numbers generated by ad-
dition in formal groups and new primality and factoring tests", Advances in
Applied Mathematics, 7 (1987), 385-434.
[24] Okeya, K., Takagi, T., Vuillaume, C.: \Ecient representations on Koblitz
curves with resistance to side channel attacks". In: Boyd, C., Gonzalez Nieto,
J.M. (eds.) ACISP 2005. LNCS, vol. 3574, pp. 218-229. Springer, Heidelberg,
2005
[25] A. Menezes, Elliptic curve public key cryptosystems, Kluwer Academic Pub-
lishers, 1993.
61
[26] M. Koschuch, J. Lechner, A. Weitzer, J. Grobschadl, A. Szekely, S.Tillich, and J.
Wolkerstorfer, \Hardware/Software Co-Design of Elliptic Curve Cryptography
on an 8051 Microcontroller", Cryptographic Hardware and Embedded Systems,
vol. 4249, pp. 430{444. Springer Verlag, 2006.
[27] TSMC 0:18m Process 1:8-Volt SAGE-XTM Standard Cell Library Databook,
2001.
62
Appendix A
-code on FF cores
A.1 -code in ROM1
//Stay idle when the enable signal for point mulitplication is not set
0 NOP
//Begin initialization
1 ADD DA1_X, Rx, Ry
2 NOP
3 ADD DA1_Z, Rx, Ry
// Start the Loop, and it begins from address 6 to 13
// The number of Loop is 161
4 LOOP RES, 161, 7
5 MUL DA1_S, SA1_X, SA3_Z
6 NOP
7 NOP
8 SQA DA1_Z, A1_BP_OUT1, A2_BP_OUT1
9 MUL RES, SA1_S, SA2_S
10 NOP
11 NOP
12 ADD DA1_X, A1_BP_OUT1, A2_BP_OUT1
13 SMUL DA1_S, A1_BP_OUT2, SA3_Z
// The last Loop is from address 13 to 20
14 NOP
15 NOP
16 SQA DA1_Z, A1_BP_OUT1, A2_BP_OUT1
17 MUL RES, SA1_S, SA2_S
18 NOP
19 NOP
20 ADD DA1_X, A1_BP_OUT1, A2_BP_OUT1
// FF inversion calculation is from address 21 to 64
21 SQA RES, SA1_Z, Ry
22 MUL DA1_S, A1_BP_OUT2, SA1_Z
23 NOP
24 NOP
25 SQA RES, A1_BP_OUT1, Ry
26 MUL RES, A1_BP_OUT2, SA1_Z
27 NOP
28 NOP
29 SS RES, XXXX, A1_BP_OUT1
63
30 MUL DA1_S, A1_SS_OUT, SA1_S
31 NOP
32 NOP
33 SS RES, XXXX, A1_BP_OUT1
34 SS RES, XXXX, A1_SS_OUT
35 SQA RES, Ry, A1_SS_OUT
36 MUL DA1_S, A1_BP_OUT2, SA1_S
37 NOP
38 LOOP RES, 4, 0
39 SS RES, XXXX, A1_BP_OUT1
40 SS RES, XXXX, A1_SS_OUT
41 MUL DA1_S, A1_SS_OUT, SA1_S
42 NOP
43 LOOP RES, 9, 0
44 SS RES, XXXX, A1_BP_OUT1
45 SS RES, XXXX, A1_SS_OUT
46 MUL DA1_S, A1_SS_OUT, SA1_S
47 NOP
48 NOP
49 SQA RES, A1_BP_OUT1, Ry
50 MUL RES, A1_BP_OUT2, SA1_Z
51 NOP
52 LOOP RES, 19, 0
53 SS RES, XXXX, A1_BP_OUT1
54 SS RES, XXXX, A1_SS_OUT
55 MUL DA1_S, A1_SS_OUT, SA1_S
56 NOP
57 LOOP RES, 39, 0
58 SS RES, XXXX, A1_BP_OUT1
59 SS RES, XXXX, A1_SS_OUT
60 SQA RES, A1_SS_OUT, Ry
61 MUL RES, A1_BP_OUT2, SA1_S
62 NOP
63 NOP
64 SQA RES, A1_BP_OUT1, Ry
// End of FF inversion
65 MUL DA1_X, A1_BP_OUT2, SA1_X
66 NOP
67 NOP
68 ADD RES, A1_BP_OUT1, Rx
69 MUL DA1_Z, A1_BP_OUT2, SA3_X
64
A.2 -code in ROM2
//Stay idle when the enable signal for point mulitplication is not set
0 NOP
//Begin initialization
1 NOP
2 NOP
3 NOP
// Start the Loop, and it begins from address 6 to 13
// The number of Loop is 161
4 LOOP RES, 161, 7
5 MUL DA2_S, SA3_X, SA1_Z
6 NOP
7 NOP
8 NOP
9 MUL RES, Rx, A1_BP_OUT2
10 NOP
11 NOP
12 NOP
13 SMUL DA2_S, A3_BP_OUT2,SA1_Z
// The last Loop is from address 13 to 20
14 NOP
15 NOP
16 NOP
17 MUL RES, Rx, A1_BP_OUT2
18 NOP
19 NOP
20 NOP
// FF inversion calculation is from address 21 to 64
21 SQA RES, SA3_Z, Ry
22 MUL DA2_S, A2_BP_OUT2, SA3_Z
23 NOP
24 NOP
25 SQA RES, A2_BP_OUT1, Ry
26 MUL RES, A2_BP_OUT2, SA3_Z
27 NOP
28 NOP
29 SS RES, XXXX, A2_BP_OUT1
30 MUL DA2_S, A2_SS_OUT, SA2_S
31 NOP
32 NOP
33 SS RES, XXXX, A2_BP_OUT1
34 SS RES, XXXX, A2_SS_OUT
35 SQA RES, Ry, A2_SS_OUT
36 MUL DA2_S, A2_BP_OUT2, SA2_S
65
37 NOP
38 LOOP RES, 4, 0
39 SS RES, XXXX, A2_BP_OUT1
40 SS RES, XXXX, A2_SS_OUT
41 MUL DA2_S, A2_SS_OUT, SA2_S
42 NOP
43 LOOP RES, 9, 0
44 SS RES, XXXX, A2_BP_OUT1
45 SS RES, XXXX, A2_SS_OUT
46 MUL DA2_S, A2_SS_OUT, SA2_S
47 NOP
48 NOP
49 SQA RES, A2_BP_OUT1, Ry
50 MUL RES, A2_BP_OUT2, SA3_Z
51 NOP
52 LOOP RES, 19, 0
53 SS RES, XXXX, A2_BP_OUT1
54 SS RES, XXXX, A2_SS_OUT
55 MUL DA2_S, A2_SS_OUT, SA2_S
56 NOP
57 LOOP RES, 39, 0
58 SS RES, XXXX, A2_BP_OUT1
59 SS RES, XXXX, A2_SS_OUT
60 SQA RES, A2_SS_OUT, Ry
61 MUL RES, A2_BP_OUT2, SA2_S
62 NOP
63 NOP
64 SQA RES, A2_BP_OUT1, Ry
// End of FF inversion
65 MUL RES, A2_BP_OUT2, SA3_X
66 NOP
67 NOP
68 ADD RES, A2_BP_OUT1, Rx
69 MUL RES, A1_BP_OUT2, A2_BP_OUT2
70 NOP
71 NOP
72 ADD DA2_S, A2_BP_OUT1, SA3_Z
73 MUL RES, A2_BP_OUT2, SA1_Z
74 NOP
75 NOP
76 ADD DA2_S, A2_BP_OUT1, Ry
66
A.3 -code in ROM3
//Stay idle when the enable signal for point mulitplication is not set
0 NOP
//Begin initialization
1 SQA DA3_Z, Rx, Ry
2 SS RES, Ry, Rx
3 ADD DA3_X, A3_SS_OUT, Rb
// Start the Loop, and it begins from address 6 to 13
// The number of Loop is 161
4 LOOP RES, 161, 7
5 MUL RES, SA1_X, SA1_Z
6 SS DA3_S, XXXX, SA1_Z
7 NOP
8 SQA DA3_Z, Ry, A3_BP_OUT1
// The last Loop is from address 13 to 20
9 MUL RES, SA3_S, Rb
10 SS DA3_S, XXXX, SA1_X
11 NOP
12 ADD DA3_X, A3_BP_OUT1, SA3_S
13 SMUL RES, A1_BP_OUT2(X1), SA1_Z
14 SS DA3_S, XXXX, SA1_Z
15 NOP
// FF inversion calculation is from address 21 to 64
16 SQA DA3_Z, Ry, A3_BP_OUT1
17 MUL RES, SA3_S, Rb
18 SS DA3_S, XXXX, SA1_X
19 NOP
20 ADD DA3_X, A3_BP_OUT1, SA3_S
21 SQA RES, Rx, Ry
22 MUL DA3_S, A3_BP_OUT2, Rx
23 NOP
24 NOP
25 SQA RES, A3_BP_OUT1, Ry
26 MUL RES, A3_BP_OUT2, Rx
27 NOP
28 NOP
29 SS RES, XXXX, A3_BP_OUT1
30 MUL DA3_S, A3_SS_OUT, SA3_S
31 NOP
32 NOP
33 SS RES, XXXX, A3_BP_OUT1
34 SS RES, XXXX, A3_SS_OUT
35 SQA RES, Ry, A3_SS_OUT
36 MUL DA3_S, A3_BP_OUT2, SA3_S
67
37 NOP
38 LOOP RES, 4, 0
39 SS RES, XXXX, A3_BP_OUT1
40 SS RES, XXXX, A3_SS_OUT
41 MUL DA3_S, A3_SS_OUT, SA3_S
42 NOP
43 LOOP RES, 9, 0
44 SS RES, XXXX, A3_BP_OUT1
45 SS RES, XXXX, A3_SS_OUT
46 MUL DA3_S, A3_SS_OUT, SA3_S
47 NOP
48 NOP
49 SQA RES, A3_BP_OUT1, Ry
50 MUL RES, A3_BP_OUT2, Rx
51 NOP
52 LOOP RES, 19, 0
53 SS RES, XXXX, A3_BP_OUT1
54 SS RES, XXXX, A3_SS_OUT
55 MUL DA3_S, A3_SS_OUT, SA3_S
56 NOP
57 LOOP RES, 39, 0
58 SS RES, XXXX, A3_BP_OUT1
59 SS RES, XXXX, A3_SS_OUT
60 SQA RES, A3_SS_OUT, Ry
61 MUL RES, A3_BP_OUT2, SA3_S
62 NOP
63 NOP
64 SQA DA3_X, A3_BP_OUT1, Ry
// End of FF inversion
65 SQA RES, Rx, Ry
66 ADD DA3_Z, A3_BP_OUT2, Ry
68
