Finite Field Multiplier Architectures for Cryptographic Applications by El-Gebaly, Mohamed





presented to the University of Waterloo
in fullment of the
thesis requirement for the degree of
Master of Applied Science
in
Electrical Engineering
Waterloo, Ontario, Canada, 2000
cMohamed El-Gebaly 2000
I hereby declare that I am the sole author of this thesis. This is a true copy of
the thesis, including any required nal revisions, as accepted by my examiners.
I understand that my thesis may be made electronically available to the public.
ii
Abstract
Security issues have started to play an important role in the wireless communi-
cation and computer networks due to the migration of commerce practices to the
electronic medium. The deployment of security procedures requires the imple-
mentation of cryptographic algorithms. Performance has always been one of the
most critical issues of a cryptographic function, which determines its eectiveness.
Among those cryptographic algorithms are the elliptic curve cryptosystems which
use the arithmetic of nite elds. Furthermore, elds of characteristic two are pre-
ferred since they provide carry-free arithmetic and at the same time a simple way
to represent eld elements on current processor architectures.
Multiplication is a very crucial operation in nite eld computations. In this
contribution, we compare most of the multiplier architectures found in the liter-
ature to clarify the issue of choosing a suitable architecture for a specic appli-
cation. The importance of the measuring the energy consumption in addition to
the conventional measures for energy-critical applications is also emphasized. A
new parallel-in serial-out multiplier based on all-one polynomials (AOP) using the
shifted polynomial basis of representation is presented. The proposed multiplier is
area ecient for hardware realization. Low hardware complexity is advantageous
for implementation in constrained environments such as smart cards.
Architecture of an elliptic curve coprocessor has been developed using the pro-
posed multiplier. The instruction set architecture has been also designed. The
coprocessor has been simulated using VHDL to very the functionality. The co-
processor is capable of performing the scalar multiplication operation over elliptic
curves. Point doubling and addition procedures are hardwired inside the coproces-
sor to allow for faster operation.
iii
Acknowledgements
All praise is due to Allah for guiding me throughout my life and giving me the
ability to complete this work. I am at a loss of words to express my gratitude to
my mother and my brother for their continuous love and support.
I am very fortunate to have had Prof. Hasan as my research advisor. This thesis
would not have been possible without his support, encouragement, and patience of
listening to my ideas.
Here in Waterloo I am grateful to all Waterloo faculty who have taught me, and
my colleagues from whom I learned a lot. I want especially to mention Prof. Agnew
and my colleague Amr Wassal for the useful discussions that helped me throughout
this work.
iv





1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hardware Cryptographic Architectures . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Architectural-Level Comparisons 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Bases of Representations . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Choices of Irreducible Polynomials . . . . . . . . . . . . . . 10
2.2.3 Performance and Complexity Metrics . . . . . . . . . . . . . 11
2.3 GF(2m) Multiplier Architectures . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Polynomial Basis Multipliers . . . . . . . . . . . . . . . . . . 13
2.3.2 Normal Basis Multipliers . . . . . . . . . . . . . . . . . . . . 27
2.3.3 Dual Basis Multipliers . . . . . . . . . . . . . . . . . . . . . 35
2.3.4 Composite Field Multipliers . . . . . . . . . . . . . . . . . . 39
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vi
3 Low-Energy GF Multipliers 45
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Sources of power dissipation in CMOS circuits . . . . . . . . . . . . 46
3.2.1 Static Power . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.2 Dynamic Power . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Architectures Compared and Methodology . . . . . . . . . . . . . . 50
3.3.1 Multiplier Architectures selection . . . . . . . . . . . . . . . 50
3.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Comparison Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.1 Delay Comparison . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.2 Power Comparison . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.3 Energy Comparison . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Bit Serial Multiplication over a class of Finite Fields 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 AOP Related Bases of Representations . . . . . . . . . . . . . . . . 58
4.3 Multiplication and Squaring over the Shifted Polynomial Basis . . . 60
4.3.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Multiplier Architecture And Comparison . . . . . . . . . . . . . . . 63
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Elliptic Curve Coprocessor 68
vii
5.1 Elliptic Curve Cryptosystem . . . . . . . . . . . . . . . . . . . . . . 69
5.1.1 Elliptic curves governing equations over GF(2m) . . . . . . . 69
5.2 Elliptic Curve Operations over GF(2m) . . . . . . . . . . . . . . . . 70
5.2.1 Group Operation Algorithms using Projective coordinates . 71
5.2.2 Scalar Multiplication . . . . . . . . . . . . . . . . . . . . . . 73
5.2.3 Die Hellman Key Exchange . . . . . . . . . . . . . . . . . 75
5.3 Elliptic Curve Coprocessor Architecture . . . . . . . . . . . . . . . 77
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 Coprocessor Architecture . . . . . . . . . . . . . . . . . . . . 78
5.3.3 Instruction Set Architecture . . . . . . . . . . . . . . . . . . 84
5.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6 Conclusion and Future Work 89
6.1 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Recommendations for Future Work . . . . . . . . . . . . . . . . . . 90
7 Appendix 92






DLP Discrete logarithm problem
EC Elliptic curve
ECC Elliptic curve cryptosystem
ECDLP Elliptic curve discrete logarithm problem
ESP Equally-spaced polynomial
FSM Finite state machine
GF Galois eld







SPB Shifted polynomial basis
ix
List of Tables
2.1 Non-systolic polynomial basis GF(2m) multiplier architectures . . . 19
2.2 Systolic polynomial basis GF(2m) multiplier architectures . . . . . . 26
2.3 Normal basis GF(2m) multiplier architectures . . . . . . . . . . . . 34
2.4 Dual basis GF(2m) multiplier architectures . . . . . . . . . . . . . . 38
3.1 Multiplier architectures selected for comparison . . . . . . . . . . . 52
4.1 Comparison between the proposed multiplier and other serial multi-
pliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 The Point Doubling (Double) and Point Addition (Add-pnt) algorithms 73
5.2 The Scalar Multiplication (Smultiply) algorithm . . . . . . . . . . . 74
5.3 NAF-Scalar Multiplication (NAF-Smultiply) algorithm . . . . . . . 75
5.4 Binary encoding of Datapath Registers . . . . . . . . . . . . . . . . 81
5.5 Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.6 Operation count for Point Doubling and Addition . . . . . . . . . . 86
5.7 Performance of the proposed architecture . . . . . . . . . . . . . . . 87
x
List of Figures
2.1 MSB-rst multiplier architecture . . . . . . . . . . . . . . . . . . . 24
2.2 Massey-Omura serial multiplier . . . . . . . . . . . . . . . . . . . . 29
2.3 Berlekamp multiplier congured for polynomial basis multiplication 35
3.1 CMOS inverter and the dierent components of power dissipation . 47
3.2 Delay comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Power consumption comparison . . . . . . . . . . . . . . . . . . . . 55
3.4 Energy comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 Squaring over the shifted polynomial basis . . . . . . . . . . . . . . 62
4.2 The proposed multiplier for multiplication over GF(24) . . . . . . . 64
5.1 Die-Hellman Key Exchange Protocol . . . . . . . . . . . . . . . . 76
5.2 The elliptic curve coprocessor architecture . . . . . . . . . . . . . . 77
5.3 Datapath architecture . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 I/O unit structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Read/Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.6 Instruction set architecture . . . . . . . . . . . . . . . . . . . . . . . 84
xi
7.1 Simulation Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2 Simulation Waveforms (cont.) . . . . . . . . . . . . . . . . . . . . . 94
7.3 Simulation Waveforms (cont.) . . . . . . . . . . . . . . . . . . . . . 95
7.4 Simulation Waveforms (cont.) . . . . . . . . . . . . . . . . . . . . . 96
7.5 Simulation Waveforms (cont.) . . . . . . . . . . . . . . . . . . . . . 97
7.6 Simulation Waveforms (cont.) . . . . . . . . . . . . . . . . . . . . . 98
7.7 Simulation Waveforms (cont.) . . . . . . . . . . . . . . . . . . . . . 99
7.8 Simulation Waveforms (cont.) . . . . . . . . . . . . . . . . . . . . . 100
7.9 Simulation Waveforms (cont.) . . . . . . . . . . . . . . . . . . . . . 101





With the tremendous growth of commerce transactions over wire and wireless me-
dia, the critical role that security plays is greatly emphasized. Electronic commerce
practices are endangered by the possibility of unauthorized access, disclosure, al-
ternation, substitution, or destruction of the information being transmitted. The
necessity for security has fueled research in the area of cryptographic protocols and
cryptographic algorithms.
Cryptographic computations are very intensive since the operand size is usually
very large. This has led to the development of ecient hardware and software
implementations to save system resources. In constrained environments such as
mobile and portable devices, energy consumption is one of those resources that has
to be optimized.
The use of elliptic curves (EC) in cryptography is promising for many reasons.
1
CHAPTER 1. INTRODUCTION 2
Elliptic curve cryptosystems (ECC) allow for shorter key lengths without compro-
mising the security of the system. In comparison to more conventional methods of
public key cryptographic protocols such as RSA and systems based on the discrete
logarithm problem (DLP), key lengths are about 1024-bit, while EC systems , which
are based on the elliptic curve discrete logarithm problem (ECDLP), use 160-bit
operands. From a hardware point of view, this translates to increased performance,
less area and lower bandwidth. From a security standpoint, ECC provide better
long term security due to the lack of sub-exponential attacks which can be applied
to DLP systems. ECC is currently being reviewed for standardization by the IEEE
P1363 standards committee [20].
The elliptic curve cryptosystems which use the arithmetic of nite elds have
been shown to have ecient implementations specially in constrained environ-
ments [25]. Furthermore, elds of characteristic two are preferred since they provide
carry-free arithmetic and at the same time a simple way to represent eld elements
on current processor architectures. Addition in GF(2m) can be as simple as bit-wise
ex-or operations. However, nite eld multiplication is much more dicult. Never-
theless, multiplication is an essential operation in nite eld arithmetic since other
operations such as inversion and exponentiation can be performed using repeated
multiplication operations.
1.2 Hardware Cryptographic Architectures
Cryptographic computations are very demanding in terms of processing power and
speed. This fact has led to the implementation of such systems on a hardware
CHAPTER 1. INTRODUCTION 3
chip rather than a software program. The chip is a piece of hardware dedicated
to perform the computations in the underlying nite eld. Many cryptographic
chips have been implemented to speedup the cryptographic computations [21, 38]
using parallel architectures to achieve the high speed required. In constrained
environments such as smart cards, two very important design factors are to be
considered: area and power consumption of the chip. Parallel architectures are not
suitable for such environment since they consume much more area and power than
what a device can support. As a result, bit or digit serial architectures are of more
practical importance. To evaluate the hardware architectures suitable for a certain
application, the following measures are to be considered:
 Hardware complexity (gate count).
 Time complexity (maximum delay).
 Regularity and Modularity.
In addition to the above measures, a very important metric in the evaluation pro-
cess, especially for energy-critical applications, is the energy consumption. Chapter
3 shows the importance of the energy measure in evaluating multiplier architectures.
1.3 Thesis Outline
This thesis is organized as follows. Chapter 2 provides a survey of most of the
GF(2m) multiplier architectures found in the literature. Based on the representa-
tions of the eld elements, the architectures are grouped into four main categories,
CHAPTER 1. INTRODUCTION 4
namely, polynomial, normal, dual, and composite eld multipliers. The comparison
measures are the gate count and the critical path delay. Chapter 3 adds another
performance measure to those discussed in Chapter 2. The conventional perfor-
mance measures are found to be insucient to select the most suitable architecture
for a particular application. Energy consumption is shown to be a very critical per-
formance measure for wireless and mobile applications. Extending the work done
in [51], the dual basis multiplier is added to the comparison. Another group of
multipliers based on AOPs are also added to the comparison.
A new serial multiplier architecture is proposed in Chapter 4. The proposed
multiplier is based on all-one polynomials as the eld dening polynomial and the
eld elements are represented in the shifted polynomial basis. The hardware com-
plexity of the proposed architecture is ecient which is advantageous in constrained
environments.
An elliptic curve coprocessor that is capable of performing the elliptic curve
scalar multiplication operation is presented in Chapter 5. The main component
inside the coprocessor architecture, the nite eld multiplier, is the architecture
proposed in Chapter 4. The datapath, the I/O unit, the control unit, as well as the




The importance of the multiplication operation amongst other nite eld opera-
tions is greatly emphasized since nite eld multiplication is a complex operation
to perform. Many nite eld multiplier architectures have been proposed in the
literature based on dierent eld bases or dierent application requirements. In
most cases, gate count and critical path delays are used as complexity and perfor-
mance measures to compare and evaluate dierent architectures. In this chapter,
most of the GF(2m) multiplier architectures found in the literature are clustered
into four main groups, polynomial, normal, dual, and composite eld multipliers.
The multipliers within each group are compared in terms of the timing and area
metrics.
With the emergence of wireless devices and the application of nite elds to
securely and reliably transmitting data, the need for low-energy architectures and
5
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 6
implementations is much more emphasized. Conventional complexity and perfor-
mance measures are no longer sucient by themselves and have to be integrated
with an energy metric. This metric is used to compare a group of multipliers in
Chapter 3.
This Chapter is organized as follows. In Section 2.2, the mathematical back-
ground needed is quickly reviewed, presenting the dierent bases used in nite eld
arithmetic in general and multipliers in particular. Dierent architectural features
that inuence the choice of an architecture for a specic application are discussed
in section 2.2.3. Performance and complexity metrics used to quantitatively dif-
ferentiate multiplier architectures are also discussed introducing the energy-delay
metric in that section. Section 2.3 provides a survey of most of the prominent mul-
tiplier architectures in the literature and compares them based on the conventional
metrics.
2.2 Mathematical Background
To understand what nite eld multipliers are about and how they work, a few
mathematical concepts need to be reviewed [28, 45]. An Abelian group G is a set of
elements together with a binary operation  satisfying the following mathematical
properties: closure, associativity, having an identity element, having inverses and
commutativity. A eld is a set F together with two operations, addition and
multiplication such that F is an abelian group under addition with 0 as the identity
element, the non-zero elements of F form an abelian group under multiplication
with 1 as the identity element and the distributive law a(b + c) = ab + ac holds
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 7
for all a; b; and c in the eld. A eld with a nite number of elements q is called
a nite eld of order q and is usually denoted by GF(q), i.e., Galois Field of order
q. The order, q, must be a prime or power of prime to ensure that the eld is
a group under modulo-q multiplication. The binary eld GF(2) and its extension
GF(2m) are of special interest because of their wide usage in computer hardware
and communications equipment.
A polynomial p(x) over GF(2) of degree m is said to be irreducible over GF(2) if
p(x) is not divisible by any polynomial over GF(2) of degree less than m but greater
than zero. Also, an irreducible polynomial p(x) of degree m is said to be primitive
if the smallest positive integer n for which p(x) divides xn + 1 is n = 2m   1.
An irreducible polynomial p(x) of degree m is the generator of the extension eld
GF(2m) if its nonzero elements are powers of  and  is a root of p(x).
If f0; 1; : : : m 1g is a basis of GF(2
m) over GF(2), each element  2 GF(2m)
can be uniquely represented in the form  = a00 + a11 + : : :+ am 1m 1, where
ai 2 GF(2) for 0  i  m   1. Dierent multipliers use dierent bases and the
choice of the underlying basis is heavily dependent on the application.
2.2.1 Bases of Representations
Polynomial Basis
This basis is also known as the canonical or standard basis. It is dened as the
set f1; ; 2; : : : m 1g, where  is a root of the irreducible polynomial p(x) used to
construct the eld GF(2m). In polynomial basis over GF(2m), addition is simply bit-
wise XORing. It is also worth noting that there is no carry to propagate in nite eld
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 8
computations which means a smaller critical path compared to ordinary arithmetic
operations. Polynomial basis multipliers are based on polynomial multiplication
and modular reduction.
Normal Basis
This basis is given by the set f;2; : : : 2
m 1
g where  2 GF(2m). The concept of
Optimal Normal Basis was introduced in [37] to reduce the complexity of multiplier
architectures. Unfortunately, normal basis exists for approximately 23% of the elds
GF(2m), 2 6 m < 1200. Optimal normal basis has two types. Unlike type-II, type-I
has very few irreducible polynomials.
Dual Basis
The concept of duality is dened as follows: Let fjg and fig to be two bases of
representation for GF(2m), Tr() is a linear function: GF(2m) ! GF(2). The bases
fjg and fig are said to be dual with respect to Tr(), where f





1 if i = j;
0 if i 6= j:
(2.1)
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 9




1 if i = j;
0 if i 6= j:
(2.2)
The trace function over GF(2m) is given by Tr()= +2+ : : :+2
m 1
. This basis
was rst used by Berlekamp [3] in the implementation of Reed-Solomon codecs.
Another variation of the dual basis in the Weakly Dual basis introduced in [55].
Other Bases
Other bases that are less commonly used but are gaining more momentum include
Triangular basis and Redundant basis. The triangular basis is similar to the dual
basis in many aspects. This basis is the result of a pre-multiplication of any basis




pm 0 : : : 0 0






p2 p3 : : : pm 0
p1 p2 : : : pm 1 pm
3
777777777775
and the matrix entries pi, 0 6 i 6 m are the coecients of the irreducible polyno-
mial used to generate the eld. For any irreducible polynomial p(x), pm is always
1 and T is guaranteed to be nonsingular. The advantage of this basis is that a
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 10
transform of coordinates to or from a polynomial basis can be done using shift
registers with their connections determined by the irreducible polynomial used to
construct the eld [17]. It was used recently to build a variable dimension Galois
Field coprocessor [13].
Another basis that has gained attention recently is the Composite Field, GF((2n)m)
by extending the extension eld GF(2n) to degree m. Performing multiplication
operations over large eld polynomials by splitting those polynomials has been
proposed in [41] based on the Karatsuba-Ofman algorithm (KOA) [26] for multi-
plication of large numbers. The architecture of the composite eld multipliers is
basically composed of multipliers and adders of the smaller order eld represented in
the polynomial [41] or the normal basis [43]. The complexity of a parallel multiplier
implemented using KOA has reduced the complexity below the O(m2) bound [41].
2.2.2 Choices of Irreducible Polynomials
The choice of this polynomial greatly aects the complexity and regularity of the
multiplier architecture, hence, special classes of irreducible polynomials are often
used. However, using such special classes might limit the applications of the archi-
tecture, e.g., it might not be acceptable in some cryptographic applications. The
availability or rarity of those polynomials for certain eld dimensions also restricts
their applications.
The all-one polynomial, AOP, p(x) = 1 + x + x2 + : : : + xm is irreducible if
and only if m+ 1 is a prime, m+ 1 divides 2m   1 and all the (m+ 1)th roots of
unity are in GF(2m) [35]. For m  100, the AOP is irreducible for m = 2, 4, 10,
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 11
12, 18, 28, 36, 52, 58, 60, 66, 82 and 100. Polynomial basis multiplication based
on the irreducible trinomial xm + xk + 1 with 1  k  bm=2c, most commonly
with k = 1 or m=2, are also attractive since they require fewer bit operations for
modular reduction.
Another attractive case is the Equally Spaced Polynomial with a spacing s, s-
ESP, is given by 1+xs+ x2s+ : : :+xns. It has been shown that the ESP increases
the regularity of the architecture to some extent [16, 23].
2.2.3 Performance and Complexity Metrics
Several performance and complexity metrics are used to compare and evaluate nite
eld multiplier architectures. These metrics and architectural features are reviewed
below.
Gate Count
This is the main complexity metric which is usually given as the numbers of 2-input
AND and XOR gates, ip-ops and switches or 2-to-1 multiplexers. It is sometimes
tied to the silicon area used for implementation using the area and count of an
equivalent 2-input NAND gate to represent the hardware complexity [51].
Throughput
Throughput is basically determined by the time required to complete a multiplica-
tion operation which is usually expressed in clock cycles. The clock period on the
other hand is proportional to the critical path delay. The choice of an architecture
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 12
and an irreducible polynomial should try to minimize the critical path delay to de-
crease the clock period and increase the throughput. Also, pipelined architectures
have the advantage of dividing the critical path delay over several stages, thus,
increasing the clock frequency and the throughput. On the other hand, increasing
the clock frequency has its negative eect in terms of power dissipation.
Latency
The delay between the rst input and the rst output of the multiplier expressed
in clock cycles is dened here as the latency. This measure is of special importance
in the semi-systolic and systolic architectures where the output experiences a delay
of a number of clock cycles after the arrival of the input.
Power Dissipation and Energy
As described above, a quantitative approach is needed to select appropriate archi-
tectures for energy starved applications such as wireless and mobile applications.
One approach was to seek the primitive polynomial that minimizes the power dis-
sipation by reducing the switching activity [46]. However, this approach uses an
exhaustive search to nd the optimal polynomial which is not feasible for very large
eld dimensions. Another approach tries to minimize the power-delay product, and
hence the energy, for an architecture through the logic style used in the implemen-
tation [51]. There is always a power versus delay tradeo which governs practical
VLSI architectures and minimizing their product achieves the best trade-o be-
tween power consumption and speed and conserves energy resources.
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 13
Regularity and Modularity
Although subjective, this is also a very important metric. Many applications use
very large eld dimensions which makes a regular multiplier that can be imple-
mented in bit-slices a very attractive option. Polynomial basis multipliers are the
best in terms of regularity while normal basis multipliers are the worst. Regularity
usually aects performance positively too.
2.3 GF(2m) Multiplier Architectures
This section compares most of the multiplier architectures found in the literature
in terms of the architectural features previously mentioned. A previous comparison
[19] has only covered three multiplier architectures, Berlekamp [3], Massey-Omura
[29], and Scott-Tavares-Peppard [44]. The VLSI chip area was compared for the
three multipliers for order m = 8. The dual basis multiplier by Berlekamp was the
most ecient architecture.
2.3.1 Polynomial Basis Multipliers
Polynomial basis multipliers are the most common multipliers in the literature.
Several architectures, serial or parallel, systolic or non-systolic, have been developed
mainly because the polynomial basis is the simplest to represent. Throughout this
section, the irreducible polynomial is referred to as p(x) = 1+ p1x+ p2x
2+ :::+xm
and for an AOP, p(x) = 1 + x + x2 + ::: + xm. The multiplier operands will
be referred to as A and B where A = a0 + a1 + a2
2 + ::: + am 1
m 1 and
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 14
B = b0 + b1 + b2
2 + :::+ bm 1
m 1.
Non-Systolic Architectures
Many non-systolic polynomial basis multipliers have been developed mainly because
their hardware complexity is smaller compared to the systolic architectures. Most of
these architectures depend heavily on the choice of the irreducible polynomial used
to generate the eld. Selecting certain irreducible polynomials greatly simplies
the underlying architecture and increases its regularity and modularity. Choosing
an AOP increases the regularity while using a trinomial has been shown to produce
hardware ecient architectures [5, 14, 30, 48]. Using these special polynomials, on
the other hand, puts restrictions on the order of the nite eld used since those
polynomials are irreducible only for certain orders.
Mastrovito [30] presented a parallel multiplier based on the reduction of the
product polynomial from a degree of 2m   2 to m   1. Consider that C = AB
mod p(x) is the product of A and B reduced modulo the irreducible polynomial








j) which is of degree
at most 2m   2 is rst computed. The product polynomial d is then reduced to
an m   1 degree polynomial using the multiplication matrix, Z, C = ZA. The
reduction process is performed through producing a reduction matrix, Q, to reduce
the elements of orders m or higher to orders  m   1. This can be accomplished
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 15
























The coecients of the matrix Q depend on the choice of the irreducible polynomial
used to generate the eld. Selecting trinomials of the form xm+x+1 or xm+xm=2+1
has been shown to reduce the number of terms in the Q matrix and therefore reduce
the overall complexity of the multiplier [30].
Recently, the hardware complexity of the Mastrovito multiplier for the trinomial
of the form xm + xn + 1 for 1  n  m  1 has been shown to be the same as that
of the original Mastrovito multiplier [48]. It was also shown that the multiplication
matrix Z can be constructed from three simpler matrices. For the special case of
k = m=2, the number of the required XOR gates is reduced. The time complexity
in that special case is also greatly reduced.
A generalized Mastrovito multiplier has been introduced in [47] for which the











The multiplier has a complexity proportional to (m   1   H(p)), where H(p) is
the Hamming weight of the underlying irreducible polynomial. However, the orig-
inal Mastrovito multiplier has a complexity proportional to H(p). The multiplier
proposed in [47] has a low hardware complexity if the Hamming weight of the
generating polynomial is high while the original Mastrovito multiplier has a lower
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 16
complexity for low Hamming weights. For the AOP case, where H(p) = m  1, the
generalized multiplier has a lower complexity that is exactly the same as the one
proposed in [5].
Another category of irreducible polynomials commonly used in polynomial basis
multipliers is the AOP. Using AOPs can considerably enhance the modularity of
the architecture. For example, Hasan and Bharagava [15] presented a serial AOP





is the product of A and B where A;B; and C are all elements in GF(2m). The












â0 â2 : : : âm 1




































representing the ith coordinate of the element k. The matrix in (2.4)
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 17
was used to construct a serial multiplier in [15]. The matrix multiplication can be
realized using a Linear-Feedback Shift Register (LFSR) and m AND gates. Another
unit is required to perform the partial products accumulation. An AOP was used
in [16] as the irreducible polynomial to construct a parallel multiplier. It was shown
that the use of an AOP greatly simplies the construction of the multiplication
matrix and increases the modularity of the design.
Itoh and Tsujii [23] also proposed a parallel multiplier based on AOPs. The
operands and the product are expressed in a modied version of the polynomial
basis representation. For example, for A 2 GF(2m), A is expressed as A = A0 +
A1 + : : :+Am 1
m 1 + Am














A0 A1 : : : Am





A2 A3 : : : A1













A serial multiplier was derived from (2.5) in [11]. In that structure the operands
A and B are represented in the polynomial basis form while the product C is
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 18












A1 A2 : : : 0
A0 A1 : : : Am 1

















Equation (2.6) is obtained by the application of Am = Bm = 0 to (2.5). Note that
Ai = ai; Bi = bi and Ci = ci for i = 0; 1; : : : ;m  1. The product C is converted to




i=1 AiBm i [23]. This multiplier requires m clock cycles to produce the
output sequence while the original architecture by Itoh and Tsujii requires m + 1
clock cycles.


















































































































































































































































































































































































































































































CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 20
Koc and Sunar [5] utilized the Shifted Polynomial Basis representation and
Mastrovito's multiplication algorithm to construct an AOP multiplier. It has been
shown that the multiplication matrix Z in [30] can be divided into two simpler
matrices Z1+Z2.Those two matrices can be easily computed when an AOP is used
as the irreducible polynomial. The gate and time complexities of this multiplier
are better than that of Mastrovito's. The architecture has an extra advantage that
it can perform multiplication in the normal basis as well as the polynomial basis.
The normal basis multiplication is performed by adding a permutation circuit to the
inputs and the inverse of that permutation to its output. The permutation circuit
does not add any gate complexity to the multiplier since it can be done by rewiring.
The normal basis version of the architecture proposed in [5] will be discussed in more
details in Section 2.3.2. The non-systolic polynomial basis architectures covered in
this chapter are compared quantitatively in Table 2.1.
Systolic Architectures
Systolic architectures are advantageous in some applications because they are easy
to pipeline and to expand. The basic metric in measuring the hardware complexity
of systolic architectures is the complexity of the basic cell. The critical path delay is
also the critical delay of the basic unit. Due to the existence of pipelining registers
in the data path, systolic architectures suer from longer latency and longer initial
delay.
Most of the systolic multipliers are based on array-type multiplication where
one of the operands is processed in parallel and the other is processed one bit at a
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 21
time. Depending upon the order of processing of the second operand, the array-type
algorithms are classied as least-signicant bit rst (LSB-rst) and most-signicant
bit rst (MSB-rst) schemes. The LSB-rst scheme processes the LSB of the second
operand rst while the MSB-rst scheme processes the MSB rst [24].
Assuming that A;B, and C 2 GF(2m) are represented in the polynomial basis
and p(x) is the irreducible polynomial, the product C of A and B can be written
as
C = AB mod p(x)
= b0A+ b1(A mod p(x)) + b2(A



































i (0  k  m  1).
The computation in (2.7) is called the LSB-rst scheme and can be performed












i and C(0) = 0, A(k) = k and A(0) = A.
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 22
For k = 0, we have A0 = A while for 1  k  m  1, we have
A


















Since p() = 0, m = pm 1
m 1 + pm 2


























m 1 pi 1  i  m  1
a
(k 1)

















denote the ith coecients in A(k) and C(k) respectively and the
nal product C is C(m).
The MSB-rst scheme uses the MSB as the rst bit in the multiplication oper-
ation to produce the product as follows
C = (:::(Abm 1 mod p(x) +Abm 2) mod p(x) + : : :+Ab1) mod p(X) +Ab0:
(2.11)
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 23
The basic kth step in the MSB-bit algorithm performs the following computation:
C





















m 1 pi + aibm k 1  i  m  1
c
(k 1)
m 1 p0 + a0bm k i = 0:
(2.13)
The recursive equation (2.13) can be implemented using the architecture shown in
Figure 2.1(a). The basic cell structure is shown in Figure 2.1(b).
The operations performed in both array-multiplication algorithms can be identi-
ed asmultiply-by-, generate-current-partial-products and accumulate-to-previuos-
result [24]. The multiply-by- operation is common in both schemes. In the LSB-
rst scheme, the three operations are performed in parallel while in the MSB-rst
scheme they are performed sequentially. Parallelism in the LSB-rst scheme leads
to ecient implementations with less area complexity than the MSB-rst scheme.
The LSB-rst and the MSB-rst schemes can be easily mapped into serial or par-
allel VLSI implementations. The choice of the implementation depends heavily
on the nature of the application and the availability of the input operands at the
beginning of computation.



































P P P P



























Figure 2.1: MSB-rst multiplier architecture
In [57], the LSB-rst scheme was used to implement a serial and a parallel sys-
tolic array multipliers. These multipliers can perform the product-sum operation,
P = AB + C, in GF(2m). Jain et.al. [24] proposed a semi-systolic multiplier ar-
chitecture. Semi-systolic architectures have lower latency and smaller number of
latches compared to those of the systolic architectures.
The MSB-rst scheme was utilized in the implementation of several systolic
polynomial basis multipliers. In [44], a bit-slice architecture for a serial-in serial-out
multiplier was implemented. The architecture has two global control signals which
introduce synchronization problems when the chip becomes large for large eld
dimensions. The rst serial systolic architecture reported in the literature was due
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 25
to Zhuo [58]. It has only one control signal and a very ecient architecture. Other
implementations using systolic array architectures were proposed in [49] and [54].
The two implementations has about the same area and timing complexity. Only one
control signal is used in both designs which gives these architectures an advantage
over the design presented in [44]. In [7], a systolic product-sum architecture was
presented that is less complex than that proposed in [57]. However, they both have
the same critical path delay.
In [31], an area ecient architecture was proposed utilizing the MSB-scheme
to implement a serial systolic multiplier and another variation of it in the form of
a serial/parallel systolic architecture. The design uses two bits of one operand as
inputs to the basic cell. Hence, the required number of basic cells in the multiplier
is reduced by half i.e. m=2 instead of m. The overall complexity of the multiplier
is better than other designs but the critical path delay is longer. Also, two control
signals are required to produce the output.
Table 2.2 shows a comparison between most of the systolic architectures pro-
posed in the literature. The entries of the table represent the whole architecture
rather than the basic cell.














































































































































































































































































































































































































































































































































CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 27
2.3.2 Normal Basis Multipliers
An element B 2 GF(2m) is represented using the normal basis as
B = b0+ b1
2 + b2
4 + : : :+ bm 1
2(m 1)




; : : : ; 
2(m 1)]t
= b t; (2.14)
or more simply by the vector of coordinates, b, for the normal basis representation.
A powerful feature of the normal basis representation is that squaring of an
element B is simply a cyclic shift of its coordinates which can be implemented
using a binary shift register. Since 2
m









4 + : : :+ bm 2
2m 1
= b(1) t; (2.15)
where b(k) represents the k-fold right cyclic shift of b.
Massey and Omura [29] proposed a normal basis multiplier based on the fol-
lowing principle. If a = [a0; a1; a2; : : : ; am 1] and b = [b0; b1; b2; : : : ; bm 1] are the
vector representations of the coordinates of two elements A;B 2 GF(2m) in a nor-
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 28




where the product matrix, M, is dened by




























2 + : : :+Mm 1
2m 1
; (2.17)




i+2j is represented using the normal basis and k = 0; 1; : : : ;m 1.
From the above equations, the coordinates of the product can be obtained using
cm 1 k = aMm 1 kb
t




Hence, the same logic function used to implement equation (2.18) can be used to
compute all the coordinates of C in serial from the cyclically shifted coordinates of
A and B. A parallel multiplier can also be achieved by using m identical replicas of
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 29
that same logic to simultaneously calculate the coordinates of C using shifted wiring
for the inputs A and B. The main disadvantage of Massey-Omura multipliers is
that the function implementing equation (2.18) depends heavily on the choice of
the polynomial. Therefore, the structure is irregular and cannot be expanded easily
to high order elds. A general architecture for the Massey-Omura serial multiplier
is shown in Figure 2.2.
c
m-1












. . . . . . . . . .





. . . . . . . . . .
. . . . . . . . . .




Figure 2.2: Massey-Omura serial multiplier
Wang et al. [50] proposed a VLSI pipelined implementation of both the serial
and parallel versions using an AND-XOR implementation of equation (2.18) with
pipelining registers between the dierent levels of the XOR tree. However, this
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 30
implementation increases the design area and power considerably for large elds.
They have also pipelined the input into the shift registers in such a way that there
is no time lost between operations except for an initial xed time delay.
Hasan et al. [18] proposed a modied Massey-Omura multiplier based on choos-
ing an irreducible AOP as the generating polynomial. This choice allows to write
Mm 1 as a sum of two matrices, P +Q, as follows
Mm 1 = P+Q (mod 2); (2.19)





1 if i = m
2
+ j (mod m);
0 otherwise:
(2.20)
Using equation (2.19), it was shown that equation (2.18) can be written as
cm 1 k = aPb
t + a(k)Qb(k)t (mod 2): (2.21)
The proposed parallel architecture introduced a signicant reduction in the hard-
ware complexity compared to the original parallel Massey-Omura multiplier because
the rst term in the above equation is independent of k and needs to be computed
only once for all of the product coordinates. However, the critical path delay is still
the same as that in [50]. The only restriction on this architecture is in the use of
AOPs as the generating polynomials.
Agnew et al. [1] presented a serial normal basis multiplier that has a regular
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 31
architecture and therefore suitable for VLSI implementations. The multiplication
algorithm is based on that of Massey-Omura but the authors ended up with a reg-
ular architecture. Their algorithm starts by writing the coordinates of the product
C = AB in a bilinear form of the coordinates of A and B.
















































where all the indices are reduced modulo m.
Equation (2.23) is another representation of Massey-Omura algorithm in (2.18).
In [1], the function F
[k]
j




















The coordinates of A and B are stored in two registers A and B which are shifted












(0) are formed in a special
structure C that implements equation (2.24) from its previous contents and the
current contents of register A. After m clock cycles, the result is stored in C.
This multiplier architecture has a lower hardware complexity than that presented
by Massey-Omura. The hardware complexity can be reduced further by using the
optimal normal basis [37] as the basis of the multiplier.
As mentioned earlier, Koc and Sunar [5] proposed a normal basis multiplier
based on their polynomial basis one. The generating polynomial must be an AOP.
This allows them to write another set

 = f;2; 3; : : : ; mg; (2.26)
and use it as a basis to represent the elements of GF(2m). This basis is a shifted
version of the polynomial basis. The normal basis representation of an element A














CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 33
This conversion can be simply implemented using the permutation given by
b
0
2imod(m+1) = bi for i = 0; 1; : : : ;m  1: (2.28)
This permutation is conducted at the input bits simply by rewiring them with-
out any additional gates. The output of the polynomial basis multiplier is then
C = AB=2. Multiplying the output of the polynomial basis multiplier by 2 and
applying the inverse permutation results in the product in the normal basis rep-
resentation. This results in the same hardware complexity as in the polynomial
basis multiplier since the inverse permutation does not use any additional gates.
Therefore, the proposed multiplier has the same architectural complexity and crit-
ical path delay as the polynomial basis multiplier. It can be shown that Hasan's
multiplier [16] has exactly the same hardware complexity as the one proposed by
Koc [5] however Hasan's multiplier has a better critical delay.
Table 2.3 shows a comparison between the normal basis multipliers covered in
this chapter.





































































































































































































































































































































































































































































































Figure 2.3: Berlekamp multiplier congured for polynomial basis multiplication
2.3.3 Dual Basis Multipliers
Dual basis multipliers, specially bit-serial ones, are known to have the lowest hard-
ware complexity of all available GF(2m) multipliers and to be particularly suited
for constant multiplication [9]. The dual basis representation was rst utilized in
nite eld multiplication by Berlekamp [3]. A setup for Berlekamp serial multiplier
to perform polynomial basis multiplication is shown in Figure 2.3.
Consider the set f0; 1; : : : ; m 1g to be the dual basis of the polynomial basis
f1; ; : : : ; m 1g, where  is a root of the polynomial p(x). Using the denition of





represented in the polynomial basis while B =
P
m 1
















b0 b1 : : : bm 1

















CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 36
where bk =Tr(B
k) (k = 0; 1; : : : ; 2m 2), ck =Tr(C
k) (k = 0; 1; : : : ;m 1), bk
and ck are the dual basis coordinates of B and C respectively. The bk (k = m;m+
1; : : : ; 2m  2) can be generated using an LFSR initialized with the coordinates bk
(k = 0; 1; : : : ;m   1) and the feedback connections corresponding to the nonzero











j=0 pjbj+k, where bk (k = 0; 1; : : : ;m   1) are the dual basis coordinates of B
can be computed using such an LFSR. The coordinates of the product ck (k =

















where [Ak]j is the jth coordinate of A
k in the polynomial basis.
The hardware complexity of the parallel multiplier presented in [9] depends
upon the Hamming weight, H(p), of the generating irreducible polynomial. The
hardware complexity of this multiplier is at its minimum when p(x) is a trinomial.
Furthermore, the delay is minimal when p(x) is a trinomial of the form p(x) =
x
m+x+1. Also, when p(x) is a trinomial, only 2 additional XOR gates are required
to convert from the dual to the polynomial basis. This adds more exibility to the
multiplier by allowing both operands to be used in the dual basis form. Fenn
et.al. [10] proposed two systolic architectures to perform eld multiplication in
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 37
serial and in parallel.




































































































































































































































































































CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 39
A serial systolic architecture was presented in [53] that is based on the multipli-
cation algorithm in (2.30). Using p(x) = xm + x+ 1 as the generating polynomial
reduces the hardware complexity since there is no need to have an input for the
coecients of p(x). These multipliers have smaller delays and require only one
control signal while the one presented by Fenn et al. [10] has a longer delay and
requires two control signals. Most of GF(2m) dual basis multiplier architectures
found in the literature are shown in Table 2.4.
2.3.4 Composite Field Multipliers
Using composite elds to implement parallel multipliers has been proposed in [41{
43]. Performing the multiplication operation using a composite eld has been shown
to lower the area complexity of parallel multipliers below the O(m2) bound. A
GF((2n)m) multiplier can be built using identical modules which provide GF(2n)
arithmetic. Consider the eld GF(2n) with n > 1. The elements of an extension
eld GF((2n)m) may be represented in the polynomial basis as polynomials with a
maximum degree of m  1 over GF(2n). Hence, the element A 2 GF((2n)m) can be
represented by the vector (a0; a1; : : : ; am 1) where ai 2 GF(2
n) (0  i  m   1).
The eld polynomial of the extension eld is an irreducible polynomial P (x) of
degree m over GF(2n).
The product of two elements A and B 2 GF((2n)m) = AB mod P (x) can be
performed in two steps:
1. Ordinary polynomial multiplication and
2. Reduction modulo the eld polynomial P (x).
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 40
The rst step is performed using the Karatsuba-Ofman algorithm KOA [26].
The multiplication in the KOA saves polynomial multiplications over GF(2n) at
the cost of polynomial addition which is free. For example, the rst iteration in the
































 1 + : : :+ b0) = x
m
2 Bh +Bl:
Three intermediate variables are now dened as:
d0 =AlBl;
d1 =(Al +Ah)(Bl +Bh);
d2 =AhBh:
The product polynomial C 0 = AB is given by:
C
0 = d0 + x
m
2 (d1   d0   d2) + x
m
d2:
The reduction modulo P (x) operation can be viewed as a linear mapping of the
coecients resulting from the multiplication operation using the polynomial basis
coecients. The polynomial multiplication of the two polynomials A and B, results
in the product polynomial C 0 over GF(2n) with deg(C 0)  2m   2. The modulo
operation will result in a polynomial C with deg(C)  m  1 which represents the
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 41
nal product.
The selection of the composite eld order has a direct impact upon the hardware
complexity of the multiplier. In [41], the polynomial p(x) = x2 + x + p0 where
p0 2 GF(2
n) has been used to develop an GF((2n)2) multiplier. The product
C = AB mod P is given by:
C = (a0b0 + p0a1b1) + x((a0 + a1)(b0 + b1) + a0b0):
where ai and bi 2 GF(2
n). The KOA was used in [42] to perform the multipli-
cation operation over GF((2n)4). The rst two iterations of KOA generate nine
intermediate variables di, i = 0; 1; : : : ; 8 as follows:
d0 = a0b0
d1 = (a0 + a1)(b0 + b1)
d2 = a1b1
d3 = (a0 + a2)(b0 + b2)
d4 = (a0 + a1 + a2 + a3)(b0 + b1 + b2 + b3)
d5 = (a1 + a3)(b1 + b3)
d6 = a2b2
d7 = (a2 + a3)(b2 + b3)
d8 = a3b3:
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 42
The coecients of the product polynomial C 0 can now be written as:
c0 = d0
c1 = d0 + d1 + d2
c2 = d0 + d2 + d3 + d6
c3 = d0 + d1 + d2 + d3 + d4 + d5 + d6 + d7 + d8
c4 = d2 + d5 + d6 + d8
c5 = d6 + d7 + d8
c6 = d8: (2.31)
The second step is to compute the nal product polynomial, C(x) = c3x
3 +
c2x
2 + c1x + c0 by performing a modulo reduction operation over the polynomial
C
0. The modulo reduction operation can be performed using the linear mapping
between the coecients ci and ci in (2.31) as follows:
ci = ci +
2X
j=0
ri;j cj+4 i = 0; 1; 2; 3 and ri;j 2 GF(2
n): (2.32)








pi i = 0; : : : ; 3; j = 0
ri 1;j 1 + r3;j 1ri;0 i = 0; : : : ; 3; j = 1; 2
where ri 1;j 1 = 0 if i = 0. The complexity of the composite eld multiplier
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 43
depends on the choice of the polynomial P (x) over GF(2n). For example, using the
polynomial P (x) = x4+x3+x2+x+1 in the GF((2n)4) multiplier proposed in [42]
leads to the least hardware complexity amongst other choices.
2.4 Conclusions
The choice of GF(2m) multiplier architecture depends heavily on the underlying
basis representation as well as the hardware complexity and the critical path de-
lay of the architecture. Polynomial basis representation has an advantage over the
other bases as it can be performed using ordinary polynomial arithmetic. In the
normal basis, squaring, which is crucial to other operations such as inversion, can
be done for free. The dual basis representation yields the simplest architectures.
Selecting serial or parallel architectures is solely dependent on the availability of
the operands at the time of computation. Also, systolic architectures allow for
pipelining while non-systolic structures are more hardware ecient. Taking all
those factors into consideration, selecting the multiplier architecture is easier. For
example, if the preference is for low hardware complexity, the non-systolic archi-
tectures come to mind. Semi-systolic architectures are also attractive because of
their lower hardware complexity compared to fully-systolic architectures. On the
other hand, common control signals used in the semi-systolic structures make it
dicult to expand the multiplier to higher order elds. Using composite elds to
construct multiplier architectures is also attractive. Multiplication over the com-
posite eld GF((2n)m) can be performed using GF(2n) arithmetic modules which
lower the area complexity and increase the modularity of the architecture. How-
CHAPTER 2. ARCHITECTURAL-LEVEL COMPARISONS 44
ever, the selection of a particular multiplier within each category of multipliers is
not quite clear. Combining the hardware complexity measure as well as the critical
path delay into one metric is essential in energy-critical applications. The energy




The tremendous demand for wireless communications devices has enforced the need
to reassess designs from the power and energy dissipation perspective. This is due to
the fact that wireless devices are mainly battery operated and the eciency of such
device is, then, directly proportional to its battery life. Cryptographic devices have
been increasingly used since the rapid trend towards electronic nancial transactions
through automated banking machines or through the Internet. Cryptographic chips
mounted on the surface of a smart card in a wireless phone or a portable device is a
very good example for energy-critical designs. Such constrained environment needs
a strong power and energy management strategy to cope with the highly demanding
cryptographic computations and the limited energy available to the chip.
In this chapter, some of the Galois eld (GF) multiplier architectures are com-
pared in terms of the energy metric to assess their suitability for energy-critical
45
CHAPTER 3. LOW-ENERGY GF MULTIPLIERS 46
applications. GF multipliers are specically chosen for comparison here since GF
multiplication is a very critical operation in GF arithmetic. Many other operations
such as inversion can be performed using multiple multiplication operations. This
work has extended the work done in [51] to include the dual basis multipliers in the
energy comparisons. Another group of multipliers, based on AOPs, are also to be
included in the comparison.
This chapter is an attempt to bridge the gap between the many theoretical publi-
cations describing dierent approaches to VLSI suitable GF multipliers on one side,
and the very few reports that compare the architectures from an implementation
point of view on the other side.
3.2 Sources of power dissipation in CMOS cir-
cuits
In this section, the dierent components of power dissipation in CMOS circuits
are dened. The power a CMOS circuit dissipates falls into two broad categories:
static and dynamic [6]. Fig. 3.1 illustrates a simple CMOS inverter and the dierent
components of power dissipation.
3.2.1 Static Power
Static power is dened as the power dissipated by a gate when it is not switching,
i.e. when it is inactive or static. Static power is dissipated in a number of ways.
The largest percentage of static power results from source-to-drain subthreshold

























Figure 3.1: CMOS inverter and the dierent components of power dissipation
leakage. This leakage is caused by reduced threshold voltages that prevent the gate
from completely turning o. Static power is also dissipated when current leaks
between the diusion layers and the substrate. For this reason, static power is
often called leakage power. The total leakage power of a design is the sum of the






PLeakage Power is the total leakage power dissipation of the design, and
PCellLeakagei is the leakage power dissipation for each cell i.
CHAPTER 3. LOW-ENERGY GF MULTIPLIERS 48
Leakage power is dominant when the circuit is idle but it becomes less than one
percent of the total power when the circuit becomes active.
3.2.2 Dynamic Power
Dynamic power is the power dissipated when the circuit is active. A circuit is
active anytime the voltage on a net changes due to some stimulus applied to the
circuit. Because voltage on a net can change without necessarily resulting in a logic
transition, dynamic power can be dissipated even when a net does not change its
logic state. The dynamic power is composed of two main components: Switching
power and Internal power.
Switching Power
The switching power of a driving cell is the power dissipated by the charging and
discharging of the load capacitance at the output of the cell. The total load capaci-
tance at the output of a deriving cell is the sum of the net and gate capacitances on
the driving output. Because such charging and discharging is the result of the logic
transitions at the output of the cell, switching power increases as logic transitions
increase. Therefore, the switching power of a cell is a function of both the total
load capacitance at the cell output and the rate of logic transitions. It is important
to point out that switching power comprises 70-90 percent of the dynamic power
dissipation in CMOS circuits.
CHAPTER 3. LOW-ENERGY GF MULTIPLIERS 49








(Cloadi  TRi) (3.2)
where
Cloadi Capacitive load on net i,
TRi Toggle rate of net i, transitions per second, and
Vdd Supply voltage.
Internal power
The internal power is any power dissipated within the boundary of a cell. The
denition of internal power includes power dissipated by a momentary short circuit
between the P and N transistors of a gate, called short circuit power. This hap-
pens for a short period of time during a logic transition when both the N and P
transistors are ON at the same time. During that time a short circuit current, Isc,
ows from Vdd to GND causing a short circuit power, Psc, to be dissipated.
For circuits with fast transition times, short circuit power can be small. How-
ever, for circuits with slow transition times, short circuit power can account for
30 percent of the total power dissipated by the gate. Short circuit power is also
aected by the dimensions of the transistors and the load capacitance at the gate's
output. Internal power is mostly due to short circuit power and therefore the terms
internal power and short circuit power are used interchangeably. The internal power
CHAPTER 3. LOW-ENERGY GF MULTIPLIERS 50
formula is:
Pint = Eout  TRout (3.3)
where
Pint Total internal power
Eout Internal energy for the cell's output as a function of the input
transitions and output load
= f [Cload WAvg(trans)],






Transi Transition time of input i.
3.3 Architectures Compared and Methodology
To show the importance of the energy performance measure in selecting energy-
critical designs, two sets of architectures were selected based on the conventional
measures. The gate-level architectures were built using CadenceTM design tools
suite. Then the architectures were simulated using HspiceTM to measure the power
consumption and the maximum delay.
3.3.1 Multiplier Architectures selection
The rst set of multipliers includes three parallel non-systolic architectures featuring
the least hardware complexity within each basis. The polynomial basis architecture
CHAPTER 3. LOW-ENERGY GF MULTIPLIERS 51
chosen was that of Mastrovito's [30]. Hasan's architecture [18] was chosen as the
normal basis multiplier while the multiplier proposed in [9] by Fenn et.al. was
selected to represent the dual basis multiplier architectures. The second set includes
three parallel non-systolic polynomial basis architectures which use an AOP as the
eld dening polynomial. The architectures selected were those proposed in [16],
[23] and [5]. The data of the two sets of multipliers are summarized in Table 3.1.
From the data shown, the architectures within each group are close to each other
in terms of the hardware complexity to guarantee a fair comparison between the
multipliers selected.
















































































































































































































































































































































































CHAPTER 3. LOW-ENERGY GF MULTIPLIERS 53
3.3.2 Methodology
The two sets of multipliers were implemented over the eld GF(24) using a :35m
CMOS technology and simulated at the transistor-level using the Hspice simulator.
The clock frequency used was 50 MHz. The order 4 was selected since the AOP is
irreducible at that order and the simulation time is considerable. Going to higher
orders, the AOP will force the next step to be of order 10 which is prohibitively
time consuming to simulate. The polynomial used to construct the eld for both
the polynomial and the dual basis multipliers for that group was a trinomial of
the form x4 + x + 1, for m = 4. The normal basis architecture used an AOP.
The test vectors were selected to include all the possible combinations as inputs to
the multipliers. The comparison results of the rst set will be referred to as the




This section shows the comparison results between the multipliers in terms of
the critical path delay. Fig. 3.2 (a) shows the delay comparisons for the Inter-bases
group of multipliers while Fig. 3.2 (b) shows that for the AOP group. It can be
seen that the critical path delay measured is directly related to the critical path
delay estimation given in Table 3.1.
CHAPTER 3. LOW-ENERGY GF MULTIPLIERS 54




















Dual(Fenn)      
Normal(Hasan)   
(a) Inter-bases results





















Figure 3.2: Delay comparison
3.4.2 Power Comparison
The simulation results for the power consumption of the two sets of multipliers
are shown in Fig. 3.3. Although the hardware complexity seems to be a good
estimate for the power consumption, it fails to predict the power measure for the
Inter-bases group. The hardware complexity of that set is almost the same for the
three multipliers but the power consumption of Mastrovito's multiplier is relatively
lower. This is because the switching activity is dierent for each architecture.
Recall from (3.2) that the switching power is directly proportional to the switching
activity. Therefore, not only the hardware complexity is a measure of the power
consumption but also the interconnects inside the architecture aect the power
dissipation.
CHAPTER 3. LOW-ENERGY GF MULTIPLIERS 55






















Dual(Fenn)      
Normal(Hasan)   
(a) Inter-bases results

























Figure 3.3: Power consumption comparison
3.4.3 Energy Comparison
Figure 3.4 illustrates the calculated energy results for the two groups of multipli-
ers. In the inter-bases comparison, the results showed that the dual basis multiplier
came at the third place for both the delay and power comparisons while the en-
ergy comparisons iterated that conclusion. However, if we were interested in the
rst and the second best architectures, the delay and power results would not have
been sucient. Ranking the multipliers is not quite obvious by just looking at the
data given in Table 3.1. The energy metric makes the choice of the most ecient
architecture very clear.
In the AOP polynomial basis comparison, the power results could not distin-
guish the best architecture since Hasan's and Koc's have approximately the same
results. Moreover, both architectures are very close to each other in terms of power
CHAPTER 3. LOW-ENERGY GF MULTIPLIERS 56

















Dual(Fenn)      
Normal(Hasan)   
(a) Inter-bases results





















Figure 3.4: Energy comparison
consumption. When the delay results were taken into consideration in the energy
comparison, selecting the most ecient architecture was fairly simple.
3.5 Conclusion
Energy critical designs would have to consider the energy consumed as the primary
comparative measure rather than hardware complexity or critical path delay. Those
traditional measures are no longer sucient for evaluating the suitable architecture
for wireless or portable devices. Selecting the most ecient device for energy-
critical applications becomes unclear when the compared architectures have nearly
the same hardware complexity and critical path delay. The energy comparison
should be taken into consideration in order to select the best architecture for a
particular application.
Chapter 4
Bit Serial Multiplication over a
class of Finite Fields
4.1 Introduction
Many multiplier architectures have been previously introduced to eciently perform
nite eld multiplication. Examples of serial multipliers are [8, 11, 14, 15, 56] while
parallel ones can be found in [5, 16, 23, 30, 50]. A number of all-one polynomial
(AOP) multipliers have been proposed since using an AOP as the eld dening
polynomial was shown to produce hardware ecient architectures [5, 11, 16, 23].
The rst AOP-based multiplier was reported by Itoh and Tsujii [23]. In [16],
Hasan et.al. extended the work of [23] and presented an ecient multiplier using
AOPs. The extended polynomial basis representation, rst introduced in [23], was
used in [11] to develop a serial AOP multiplier. Koc and Sunar proposed another
parallel polynomial basis AOP multiplier that can be also used for normal basis
57
CHAPTER 4. BIT SERIAL MULTIPLICATION 58
multiplication through the introduction of the shifted polynomial basis [5]. Other
basis representations have been recently proposed to develop hardware and software
ecient multipliers such as the redundant basis [56], the polynomial ring [8] and
palindromic representation [4].
In this chapter we present a parallel-in serial-out multiplier using all-one poly-
nomials. The proposed multiplier uses a modied version of the polynomial basis,
which is referred to as the shifted polynomial basis (SPB). This basis was rst in-
troduced in [5] and it can be formed when the eld dening polynomial is an AOP.
The SPB can be easily converted to the normal basis representation using a simple
permutation circuit. This chapter is organized as follows. An introductory back-
ground and a review of the shifted polynomial basis representation are to follow in
section 4.2. More background information can be found in [45] and [35]. The pro-
posed multiplication algorithm is introduced in section 4.3. Section 4.4 describes
the architecture of the proposed multiplier and presents a comparison between the
proposed architecture and other GF(2m) serial multipliers.
4.2 AOP Related Bases of Representations
Consider the AOP, g(x) = 1 + x + : : :+ xm of degree m. The polynomial g(x) is
irreducible if and only if m + 1 is prime and 2 is primitive mod (m+ 1) [35]. For
m  100, the AOP is irreducible when m = 2; 4; 10; 12; 18; 28; 36; 52; 58; 60; 66; 82
and 100 [23].
Consider the binary eld GF(2) and its nite extension GF(2m) which is of
particular interest in many applications. Let  2 GF(2m) be a root of g(x) and
CHAPTER 4. BIT SERIAL MULTIPLICATION 59
i = 
i. Then the basis f1; ; : : : ; m 1g is a polynomial basis (PB). On the other
hand, if we assume that i = 
2i, then f;2; : : : ; 2
m 1
g is a normal basis (NB).
Since  is a root of g(x), g() = 1 +  + 2 + : : : + m = 0. Thus, m =
1 +  + : : :+ m 1, and,

m+1 = 1 (4.1)
If an AOP is used to construct the eld GF(2m) and from (4.1), the NB =
f;2; : : : ; 2
m 1
g can be rewritten in the form f;2; : : : ; mg which is referred
to as the shifted polynomial basis (SPB) [5]. For example, if m = 4, then 5 = 1
and 8 = 3. Therefore, the NB = f;2; 4; 8g can be rewritten as f;2; 4; 3g
which is the SPB of GF(24) over GF(2).











ai, i = 0; 1; : : : ;m   1, is the ith coordinate of A represented in the NB and ai,
i = 1; 2; : : : ;m, is the ith coordinate of A represented in the SPB. The conversion
from the NB to the SPB representation can be done using the permutation P [5]
which is given by
a(2i) = ai ; i = 0; 1; : : : ;m  1:
where (:) denotes a mod (m + 1) operation. The result can be converted back to
the NB form using the inverse permutation P 1. Both the permutation and the
inverse permutation are implemented by rewiring without using any extra gates.
Converting SPB to PB can be done using m   1 XOR gates. The permutation
CHAPTER 4. BIT SERIAL MULTIPLICATION 60
required for the conversion takes the form
a0 = am;
ai = ai + am ; i = 1; 2; : : : ;m  1;
where ai , i = 0; 1; : : : ;m   1 is the ith coordinate of A with respect to the PB.
The inverse permutation is given by:
ai = a0 + ai; i = 1; 2; : : : ;m  1
am = a0:
Hence, converting the SPB to and from the NB can be performed for free while for
the PB requires m  1 XOR gates.
4.3 Multiplication and Squaring over the Shifted
Polynomial Basis
The algorithms to be described in this section use an approach rst introduced
in [15] and [16]. A serial AOP multiplier is to be developed here. In [15] a serial
multiplier for any irreducible polynomial was developed while a parallel AOP mul-
tiplier was introduced in [16]. The basis of representation used here is the SPB
while that used in [15] and [16] was the PB.
CHAPTER 4. BIT SERIAL MULTIPLICATION 61
4.3.1 Multiplication
Let C = AB 2 GF(2m) and assume that A;B and C are all represented with
respect to the SPB. Then,





























is the kth coordinate of the element i+j , with respect to the SPB.





























k = 1; 2; : : : ;m: (4.5)
Since m+1 = 1, for any integer l we have,

l = (l)
CHAPTER 4. BIT SERIAL MULTIPLICATION 62











1 (i+ j) = 0 ; and k = 1; 2; : : : ;m
1 1  (i+ j)  m; and k = (i+ j)
0 1  (i+ j)  m; and k 6= (i+ j)














Figure 4.1: Squaring over the shifted polynomial basis
Squaring of a eld element represented in the SPB form can be performed by
CHAPTER 4. BIT SERIAL MULTIPLICATION 63
converting the SPB to the NB form. Converting SPB to/from NB is a simple
permutation which can be implemented by rewiring as discussed earlier. Squaring
in the NB is just cyclic shift. The squared NB element is then converted back to
the SPB form. This operation can be performed in only 2 clock cycles. Fig. 4.1
shows the structure required to perform SPB squaring.
4.4 Multiplier Architecture And Comparison
Using the multiplication algorithm described in the previous section, a parallel-in
serial-out multiplier is presented below. From (4.6), it can be noted that each
coordinate, ck, 1  k  m, has the terms corresponding to a( i) and m elements
from the sequence f0; am; am 1; : : : ; a1g. The rst coordinate, c1, has the rst
m elements of the sequence and the second coordinate, c2, contains the rst m
elements of the one-fold right cyclic shift of that sequence and so on. Therefore,
the above architecture can be implemented using an m+1 cyclic register initialized
by the elements of the sequence f0; am; am 1; : : : ; a1g and m XOR gates to add
the terms corresponding to a( i). In order to produce the partial products and
to accumulate them, m AND gates and m   1 XOR gates are required. The
architecture produces one bit of the product at a time starting with c1. Clocking
the register m times produces the output sequence fc1; c2; : : : ; cmg. The structure
of the proposed multiplier contains a total of m+m  1 = 2m  1 XOR gates and
m AND gates in addition to m+ 1 registers.
Example: For m = 4, the multiplication of any two elements A and B 2 GF(24)
CHAPTER 4. BIT SERIAL MULTIPLICATION 64
represented in the SPB follows directly from (4.6) as:
2
66666664
a4 a3 + a4 a2 + a3 a1 + a2
a4 + a1 a3 a2 + a4 a1 + a3
a4 + a2 a3 + a1 a2 a1 + a4




















The matrix multiplication above can be performed using the multiplier archi-
tecture shown in Fig. 4.2. The register cells A1 through A4 are initialized by a1
through a4 respectively while A5 is rst set to 0. The matrix multiplication, the
inner-product unit, is implemented by the AND and XOR gates at the upper part of
the architecture. The multiplier needs 4 clock cycles to serially produce the result,
fc1; c2; c3; c4g.
Inner-product
c , c , c , c
b b b b
aaa
A A A A A
3 421
1 2 3 4
a 1234
4 3 25 1
Figure 4.2: The proposed multiplier for multiplication over GF(24)

















































































































































































































































































































































































































































































CHAPTER 4. BIT SERIAL MULTIPLICATION 66
A comparison between the proposed multiplier and other related multipliers
proposed in the literature is shown in Table 4.1. The proposed multiplier has the
same number of AND gates but less number of registers as well as XOR gates than
that proposed in [15] when its eld dening polynomial is an AOP. Apparently,
when the I/O format is serial-in/parallel-out, the proposed multiplier is the most
ecient amongst the other architectures presented in Table 4.1. The all-one poly-
nomial multiplier (AOPM) presented in [11] uses less XOR gates but it requires
m + 1 clock cycles to complete the operation. When the multiplication operation
is to be performed in m clock cycles in the modied all-one polynomial multiplier
(MAOPM) architecture [11], the hardware complexity raises signicantly. Com-
paring the proposed architecture to those presented in [8] and in [56], the proposed
multiplier has a higher throughput but its hardware complexity is slightly larger.
Higher throughput is attractive in many applications and this is going to favor the
proposed multiplier over the other ones.
4.5 Conclusion
A parallel-in serial-out nite eld multiplier based on using an irreducible AOP as
the eld dening polynomial is proposed. The multiplier uses the SPB represen-
tation. The proposed SPB multiplier can perform NB multiplication after adding
conversion modules to the inputs and output. The conversion module can be im-
plemented without any additional gate complexity to the multiplier structure. The
proposed multiplier is also capable of performing polynomial basis multiplication
by adding m   1 XOR gates to convert to the SPB. Also, the multiplier can per-
CHAPTER 4. BIT SERIAL MULTIPLICATION 67
form the multiplication operation more eciently than other parallel-in/serial-out
multipliers. The proposed multiplier has a very regular architecture and therefore
well suited for VLSI implementation.
Chapter 5
Elliptic Curve Coprocessor
Elliptic Curve Cryptosystems (ECC) have been gaining attention recently as one
the promising cryptographic techniques. ECCs oer the same level of security as
other public key systems with much smaller key lengths. For example, an ECC with
160-bit key can provide the the same level of security as RSA with 1024-bit key
length [25]. This allows for ecient hardware and software implementations over
the other alternatives. Standardization of the ECC is currently underway. Some
of the eorts in that regard are the IEEE P1363 draft standard [20]. The ANSI
X9.62 and X9.63 standards have been already approved by the US government [39].
ECC can be implemented over any group of elements. Of particular interest, the
implementation over the group of integers less than a prime number p and over the
nite eld GF(2m). The GF(2m) implementation of the elliptic curve system has
been shown to be practical in constrained environments such as smart cards [25].
In this chapter the Elliptic Curve Cryptosystem is described. The curves over
the extension eld of characteristic 2, GF(2m) are to be considered. The multiplier
68
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 69
algorithm described in Chapter 4 is used to perform multiplication. The projective
coordinates are used to avoid inversion over GF(2m) [34]. The design of an Elliptic
Curve coprocessor is also described. The coprocessor has been simulated using
VHDL to verify the functionality.
5.1 Elliptic Curve Cryptosystem
Elliptic curve cryptosystem was independently proposed by Victor Miller [36] and
Neil Koblitz [27] back in the mid-eighties. As a public key cryptosystem, it takes
years to get a reasonable level of condence. In the last few years the rst commer-
cial implementations have started to appear in many real-world applications, such
as email security, web security, smart cards, etc. More detailed information about
elliptic curve cryptosystems can be found in [32].
5.1.1 Elliptic curves governing equations over GF(2m)
Elliptic curves over GF(2m) can be divided into two sets. The set of all solutions
for the equation
Y
2 + aY = X3 + bX + c (5.1)
where a; b; c 2 GF(2m), a 6= 0, together with the point O at innity is a supersin-
gular curve over GF(2m). The second set includes all the solutions for the equation
Y
2 +XY = X3 + aX2 + b (5.2)
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 70
where a; b 2 GF(2m), b 6= 0, together with the point O at innity is a non-
supersingular curve over GF(2m). The pair (X;Y ) represents a point on the curve
E if both X and Y satisfy (5.1) or (5.2). The coordinate system represented by
the pair (X;Y ) is called the ane coordinate system. For cryptographic purposes,
the non-supersingular family of curves is more attractive. That family of curves
is more resistant against the baby-step giant-step attack, one of the most powerful
attacking algorithms known today [2].
5.2 Elliptic Curve Operations over GF(2m)
The elliptic curve group operations for the non-supersingular family of curves are
dened as the addition of two elliptic curve points P1 = (X1; Y1) and P2 = (X2; Y2)
resulting in a third point P3 = (X3; Y3). The addition and doubling formulas are
X3 = 
2 +  +X1 +X2 + a;











if P1 = P2:
The most common representations of the GF(2m) elements in the elliptic curve
computations are the polynomial and the normal basis representation, especially,
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 71
the optimal normal basis [2]. In this work, the shifted polynomial basis is used to
represent the eld elements. More information about the shifted polynomial basis
can be found in chapter 4.
5.2.1 Group Operation Algorithms using Projective coor-
dinates
The point addition formula in (5.3) requires inversion in the underlying nite eld.
That inversion operation is very slow and is considered the main bottleneck in the
process. To avoid inversion the projective coordinates were proposed in [34].
The point (X;Y ) 2 E in the ane coordinates is mapped to the point (X :
Y : Z) 2 E, Z = 1, in the projective coordinates. The inverse mapping between
the projective and the ane coordinates is done through dividing by Z which
results in (X=Z : Y=Z : 1). The identity point O would be (0 : 1 : 0). Using
the projective coordinates, the addition of the two points P = (X1 : Y1 : Z1) and
Q = (X2 : Y2 : Z2), if P 6= Q, would become
X3 = AD





where A = X2Z1+X1Z2, B = Y2Z1+Y1Z2, C = A+B and D = A
2(A+ aZ1Z2)+
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 72






1 + Y1Z1 +A)
Z3 = A
3 (5.5)





The elliptic curve group operations are included in the IEEE P1363 draft stan-
dard [20]. Detailed information of those algorithms are shown below.
Projective elliptic point doubling algorithm:
The Double algorithm performs a point P1 = (X1 : Y1 : Z1) doubling in terms of
the projective coordinates.
Input: A point P1 = (X1 : Y1 : Z1) on the elliptic curve dened by the parameters
a and b.
Output: The point P2 = (X2 : Y2 : Z2) = 2P1 on the curve.
Projective elliptic point addition algorithm
This algorithm, Add-pnt, adds two points P0 = (X0 : Y0 : Z0) and P1 = (X1 : Y1 :
Z1) in terms of the projective coordinates.
Input:. The two points P0 = (X0 : Y0 : Z0) and P1 = (X1 : Y1 : Z1) on the elliptic
curve dened by the parameters a and b.
Output: The point P2 = (X2 : Y2 : Z2) = P0 + P1 on the curve.
Add-pnt and Double are shown in Table 5.1.
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 73
Table 5.1: The Point Doubling (Double) and Point Addition (Add-pnt) algorithms
Double Add-pnt
1. T1  X1. 1. T1  X0. 23. T2  T2 T4.
2. T2  Y1. 2. T2  Y0. 24. T7  T
2
3 .
3. T3  Z1. 3. T3  Z0. 25. T8  T7 T8.
4. T4  c = b
2m 2. 4. T4  X1. 26. T1  T1 T8.
5. T2  T2 T3. 5. T5  Y1. 27. T1  T1 + T2.
6. T3  T
2
3 . 6. T8  a. 28. T4  T1 T4.
7. T4  T3 T4. 7. T6  T
2
3 . 29. T2  T4 + T6.
8. T3  T1 T3. 8. T7  T4 T6. 30. X2  T1.
9. T2  T2 + T3. 9. T1  T1 + T7. 31. Y2  T2.
10. T4  T1 + T4. 10. T6  T3 T6. 32. Z2  T3.
11. T4  T
2
4 . 11. T7  T5 T6.
12. T4  T
2
4 . 12. T2  T2 + T7.
13. T1  T
2
1 . 13. T4  T2 T4.
14. T2  T1 + T2. 14. T3  T1 T3.
15. T2  T2 T4. 15. T5  T3 T5.
16. T1  T
2
1 . 16. T4  T4 + T5.
17. T1  T1 T3. 17. T5  T
2
3 .
18. T2  T1 + T2. 18. T6  T4 T5.
19. T1  T4. 19. T4  T2 + T3.
20. X2  T1. 20. T2  T2 T4.
21. Y2  T2. 21. T5  T
2
1 .
22. Z2  T3. 22. T1  T1 T5.
5.2.2 Scalar Multiplication
The operation in which a point P on the elliptic curve E is to be added to itself k
times is denoted by kP and is called scalar multiplication. The algorithm Smulti-
ply performs the scalar multiplication of an elliptic curve point P = (X : Y : Z)
by an integer k. This is the main process in key establishment protocols such as
the Die-Hellman key exchange.
Input: The point P = (X : Y : Z) on the elliptic curve dened by the param-
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 74
eters a and b and a random integer number k.
Output: The point Q = (X2 : Y2 : Z2) = kP on the curve.
Table 5.2: The Scalar Multiplication (Smultiply) algorithm
Smultiply
1. If k = 0 or Z = 0 the output (1, 1, 0) and stop.
2. X2  X.
3. Z2  Z.
4. Z1  1.
5. Y2  Y.
6. If Z2 = 1 then set X1  X2, Y1  Y2
7. Let klkl 1 : : :k1k0 be the binary representation of k.
8. For i from l   1 downto 1 do
8.1 Set (X2; Y2; Z2) Double[(X2; Y2; Z2)].
8.2 If ki = 1 then Set (X2; Y2; Z2) Add[(X2; Y2; Z2); (X1; Y1; Z1)].
9. Output (X2; Y2; Z2).
Using the non-adjacent form (NAF) has been suggested to improve the perfor-
mance of the Smultiply algorithm [12].
A NAF of an integer k is dened as a signed binary expansion with the prop-
erty that no two consecutive coecients are nonzero. Any integer k has a unique
NAF(k) representation which has the fewest nonzero coecients on any signed
binary expansion of k. To derive NAF(k), k is repeatedly divided by 2 and the
remainder of 0 or  1 is stored. If the remainder is to be  1, the stored remainder
is chosen to make the quotient even.
The algorithm NAF-Smultiply uses NAF(k) instead of the pure binary rep-
resentation of k is shown in Table 5.3.
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 75
Table 5.3: NAF-Scalar Multiplication (NAF-Smultiply) algorithm
NAF-Smultiply
1. If k = 0 or Z = 0 the output (1, 1, 0) and stop.
2. X2  X .
3. Z2  Z.
4. Z1  1.
5. n k.
5. Y2  Y .
6. If Z2 = 1 then set X1  X2, Y1  Y2
7. Let klkl 1 : : :k1k0 be the binary representation of k.
8. For i from l  1 downto 1 do
8.1 Set (X2; Y2; Z2) Double[(X2; Y2; Z2)].
8.2 If ki = 1 then
8.2.1 Set u  2 - (k mod 4)
8.2.2 Set k  k - u
8.2.3 If u = 1 then set (X2; Y2; Z2) Add-pnt[(X2; Y2; Z2); (X1; Y1; Z1)]
8.2.4 If u = -1 then set (X2; Y2; Z2) Add-pnt[(X2; Y2; Z2); (X1; X1Z1 + Y1; Z1)]
9. Output (X2; Y2; Z2).
5.2.3 Die Hellman Key Exchange
The discrete logarithm problem described by Die and Hellman is based on the
problem of nding logarithms with respect to a primitive element in the multiplica-
tive group of integers modulo a prime p. This idea can be extended to arbitrary
groups with the diculty of the problem depending upon the choice of the group.
The EC-based Die-Hellman key exchange protocol is illustrated in Fig. 5.1. More
details about the D-H key exchange can be found in [33, 52]. The basic steps are
1. Setup: Alice and Bob agree on a common elliptic curve, E, and a point on
that curve, P , with the coordinates (XP : YP : ZP ) 2 E. Alice generates a
random number a, which is her secret key, and Bob generates his secret key
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 76





















Figure 5.1: Die-Hellman Key Exchange Protocol
b.
2. Communication: Alice computes the new point QA = a:P and sends it to
Bob while Bob computes QB = b:P and sends it to Alice. Now, QA and QB
are the public keys. Although P is common in both QA and QB, the ECDLP
insures that it is computationally infeasible to factor out QA to compute a or
to factor QB to compute b.
3. Final Step: Alice computes a:QB = a(b:P ) = (XA : YA : ZA) Bob computes
b:QA = b(a:P ) = (XB : YB : ZB)
After the nal stage, Alice and Bob can compute the shared session key K as
K = XA=ZA = XB=ZB. An adversary cannot recover the session key, K, as he does
not know the secret keys a and b. The diculty of recovering the original message
lies on the diculty of recovering the secret keys from the public key. This recovery
problem is called the Elliptic Curve Discrete Logarithm Problem (ECDLP). Once
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 77
the session key is established between Alice and Bob, both parties can communicate
securely using private key algorithm such as DES for faster encryption speeds.
5.3 Elliptic Curve Coprocessor Architecture
In this section an elliptic curve coprocessor is presented. The coprocessor uses the
projective coordinates to represent the curve points over GF(2m). The multiplier
and the squarer architectures selected in this implementation have been previously
presented in Chapter 4 which use the SPB representation. Therefore, the eld
dening polynomial is restricted to be an AOP. Practically speaking, the eld order
should be 162, 172, 180, 196 or 226 to provide a reasonable security level with








Figure 5.2: The elliptic curve coprocessor architecture
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 78
5.3.1 Overview
The elliptic curve coprocessor is designed to perform the cryptographic computa-
tions and to relief the main processor in the system from that task. The architecture
can perform dierent nite eld arithmetic operations. It is also capable of per-
forming the elliptic curve group operations, add, double and scalar multiplication.
The coprocessor architecture performs those operations through a hardwired con-
trol logic. The size of the eld used is variable and can be reprogrammed if the
design is to be implemented over a recongurable hardware, e.g. FPGAs.
5.3.2 Coprocessor Architecture
The coprocessor architecture is divided into three major blocks. The datapath
unit has all the hardware required to perform the basic GF(2m) operations such as
multiplication, addition, and inversion, as well as EC operations and some registers
to hold intermediate results. The control unit controls the operation of the whole
coprocessor. The I/O unit buers instructions and data coming from/written to
the main processor. It is mainly used to buer data to be read/written through the
I/O interface which is usually smaller in size. The architecture of the coprocessor
is shown in Figure 5.2.
Datapath
The datapath is the main computing core of the coprocessor. The datapath
architecture is shown in Figure 5.3. The datapath can be divided into two main
building blocks, the arithmetic block and the storage block. The arithmetic block

























Figure 5.3: Datapath architecture
has four registers, R0, R1, RI, C, to perform the multiplication operation. R1, RI
and C are all m-bit registers while R0 is an m + 1-bit register. Both R0 and R1
are cyclic shift registers where R0 can shifted to the right while R1 can be shifted
to the left. R0 is acting as the A register in Fig. 4.2. A eld adder is connected to
registers R0 and R1. The inner-product unit is connected to the RI register and
its result is stored in the C register in a bit serial fashion. Two conversion units
are used to convert the SPB operands to/from NB. Those conversion modules are
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 80
used to implement the squaring operation. Squaring is performed using the relation
between the shifted polynomial basis (SPB) and the normal basis (NB) mentioned
in Section 4.3. Squaring in the SPB can be done in only two clock cycles, one to
convert from the SPB to the NB and the other is to square in the NB and convert to
the SPB at the same time. Squaring in the NB is done through cyclically shifting the
register R1 to the left. The storage space holds the temporary variables required for
the elliptic curve points addition and doubling. There are eight temporary registers,
T1 through T8, required for the point addition and point doubling algorithms. The
registers T7 and T8 are special registers that can be used in the conversion of SPB
to/from PB. The register T7 is used to convert from SPB to PB while T8 is used
in the conversion from PB to SPB. Two registers are required to hold the original
point coordinates, X and Y , and another two registers are used to hold the curve
parameters a and b. The integer value, k used in the scalar multiplication operation
is stored in the K register.
Controller
The control unit is a nite state machine, FSM, that includes all the control se-
quences for the dierent instructions. The FSM has the enumerated state fIdle,
Fetch, Decode, Exec1, Exec2, Exec3g. The FSM remains in the Idle state while the
Receive signal is inactive. The state of the FSM goes to Fetch when the Receive
signal becomes active. The Fetch state remains until the instruction is available for
the controller to be read. Once the instruction is read, the FSM goes to the Decode
state. If the fetched instruction is a Load, the coprocessor goes to the Exec state
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 81


















and remains there till the data is ready to be processed. However, if the instruction
is not a Load, the FSM performs the required actions at the Decode state and goes
to the Exec stage. Depending upon the type of instruction being executed, the
FSM goes to Exec2 or Exec3 state and may remain in the Exec3 state for some
time till the execution ends. For example, when executing a Multiply instruction,
the FSM remains in the Exec3 state for m clock cycles before writing the result
and going back to Idle. For the Unload instruction, the FSM remains in the Exec
state until the DataOut buer of the I/O unit is emptied then it goes back to Idle.
The elliptic curve point operations are hardwired inside the controller so that
the AddPoint and DoublePoint instructions execute a sequence of other instruc-
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 82
tions according to Table 5.1. The scalar multiplication operation is also hardwired.
Although hardwiring instructions increases the hardware required but it simplies
the programming of the coprocessor and reduces the program size. Two counters
are used inside the control unit: the k counter is the k-bit number being processed
and counter is the current operation number in the sequence of operations inside a
subroutine. This number is indicated before each operation in Table 5.1.
I/O Unit
The elliptic curve coprocessor is designed to handle operands in GF(2m) of m-bit
size. Other cryptosystems such as RSA require eld sizes of more than a 1000-
bit. This large size compared to ordinary arithmetic unit creates a challenge in the











Figure 5.4: I/O unit structure
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 83
Bit-parallel I/O transfers for that large data size are prohibitively complex even
with modern VLSI technologies. In contrast, bit-serial I/O operations are much
more simple but exhibit a very long delay compared to their parallel counterparts.
Splitting the large operand into smaller equally-sized chunks of bits seems to be the
most ecient approach to accomplish I/O operations. The size of the bit-chunk
can be set according to the interface width of the other device connected to the
coprocessor.







(a) Read Data operation






(b) Write Data operation
Figure 5.5: Read/Write Operation
The I/O unit has two buering registers to hold the data being transmitted.
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 84
The DataIn register holds the incoming data and the DataOut register buers the
outgoing data. The InsructionReg buers the Instructions. The I/O unit structure
is shown in Figure 5.4.
The I/O operation uses a handshaking protocol to start transmitting instruc-
tions/data. Two input signals, Receive and I D sel, and the Ready output signal
are used to perform the input handshaking protocol. The Receive signal informs the
coprocessor of incoming data/instruction while I D sel identies its nature. The
Ready signal becomes inactive whenever the instruction or data buer is partially
full. The Send signal is used to inform other devices connected to the coprocessor
that data needs to be written out. Send becomes inactive when the outgoing data
buer is empty. The read and write handshaking operations are shown in Figs. 5.5
(a) and (b) respectively.
5.3.3 Instruction Set Architecture
5 149 8121316
Opcode Des Src1 Src2
Figure 5.6: Instruction set architecture
The 16-bit instruction is divided into 4 parts, 4-bit each as shown in Fig. 5.6.
The instruction set can be divided into three main groups. One group has the
register load, copy, clear and unload operations. This group only provides the
opcode and the destination, Des, register. The two source operands, Src1 and Src2,
are not provided and are not used. The second group includes all the nite eld
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 85
Table 5.5: Instruction Set
Opcode Mnemonic Operands Description
0000 NOP None No Operation
0001 LoadReg Des Des  (dpData port in I/O unit)
0010 UnloadReg Des (Data port of I/O unit)  Des
0011 CopyReg Des, Src1 Des  Src1
0100 ClrReg Des Des  0
0101 Add Des, Src1, Src2 Des  Src1 + Src2
0110 Multiply Des, Src1, Src2 Des  Src1 * Src2
0111 Square Des, Src1 Des  (Src1)2
1000 NB2SPB Des, Src1 Des  NB-to-SPB(Src1)
1001 SPB2NB Des, Src1 Des  SPB-to-NB(Src1)
1010 PB2SPB Des, Src1 (Src1 = T8) Des  PB-to-SPB(Src1)
1011 SPB2PB Des, Src1 (Src1 = T7) Des  SPB-to-PB(Src1)
1100 Invert Des, Src1 Des  inv(Src1)
1101 DoublePoint None EC point (T1; T2; T3) doubling
1110 AddPoint None EC point (T1; T2; T3) addition
1111 Smultiply None EC point scalar multiplication
arithmetic instructions. Each instruction in this group has to provide two source
operands and the destination operand in addition to the opcode. The arithmetic
operations supported by the coprocessor architecture are: multiply, add, square,
invert, convert SPB to/from NB and convert SPB to/from PB. The elliptic curve
point operations form the third group. This group only provides the opcode and
does not provide any information about the operands. The hardwired elliptic curve
operations are: Add point, Double point and Scalar multiply. Although those
hardware macros add a signicant hardware complexity to the chip, they facilitate
the programming task to a great extent. The coprocessor instructions are show in
Table 5.5.
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 86
5.4 Comparison
Table 5.6 shows the number of eld operations performed in each EC operation.
Using the projective coordinates results in the large number of multiplications in-
dicated to avoid eld inversions.
Table 5.6: Operation count for Point Doubling and Addition
Operation EC Doubling EC Addition
Field Addition 4 6
Field Multiplication 5 13
Field Squaring 5 4
The performance analysis of the proposed design is shown in Table 5.7. The
number of clock cycles required to perform the eld operations is indicated. The
clock cycle count of the EC point addition and doubling is also indicated. For m =
196 and a clock frequency of 80 MHz, the time required for each operation is given.
Inversion is the slowest operation since the design is not optimized for inversion. The
scalar multiplication operation was implemented over the Galois Field Processor
(GFP) in [13] using a software program and using the ane coordinate system.
The point doubling is two times faster than that implemented over the GFP, while
EC addition is a bit slower. Point doubling is a crucial operation in the scalar
multiplication operation since it is performedm 1 times in the scalar multiplication
operation while point addition is performed (m 1)=2 times on the average, for the
binary representation of the scalar.
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 87
Table 5.7: Performance of the proposed architecture
Operation Clk Cycles required Time in sec
EC coprocessor 1 GFP 2[13]
Field Addition 3 0.03 0.03
Field Multiplication m+ 2 2.43 2.41
Field Squaring 3 0.03 2.41
Field Inversion 3(m  1) + n(m+ 2). 3 31.4 4.81
EC Doubling 5m+ 37 12.5 24.61
EC Addition 13m+ 56 31.9 27.05
1
m = 196, assuming a clk frequency of 80 MHz.
2
m = 191, assuming a clk frequency of 80 MHz.
3
n = No. of multiplications required for one inversion (e.g. for m = 196, n = 10 [22]).
5.5 Conclusion
A GF(2m) Elliptic Curve (EC) coprocessor is to speedup the Die-Hellman key ex-
change protocol. The coprocessor uses the parallel-in serial-out GF(2m) nite eld
multiplier proposed in Chapter 4. The architecture of the coprocessor as well as the
instruction set have been described. The design has been simulated using VHDL
to verify the functionality. The proposed design uses the projective coordinate sys-
tem. The coprocessor is to perform elliptic curve point addition, point doubling,
and scalar multiplication. Those elliptic curve functions are hardwired inside the
controller of the coprocessor to facilitate the programming task. The performance
of the design in [13] when the EC point doubling on the ane coordinate system
implemented using a software program has been found to be slower than the imple-
mentation of the same operation over the proposed architecture. Speeding up the
EC point doubling operation has a positive impact on the performance of the el-
liptic curve scalar multiplication which is the main operation in the Die-Hellman
CHAPTER 5. ELLIPTIC CURVE COPROCESSOR 88
key exchange algorithm.
Chapter 6
Conclusion and Future Work
6.1 Summary and Conclusion
The choice of GF(2m) multiplier architecture depends heavily on the underlying
basis representation as well as the hardware complexity and the critical path delay of
the architecture. Selecting serial or parallel architectures depends on the availability
of the operands at the time of computation. Also systolic architectures allow for
pipelining while non-systolic structures are more hardware ecient. Those are the
conventional measures that can be used in the selection process. These traditional
measures are no longer sucient for choosing a certain architecture for wireless or
portable devices. Selecting the most suitable device for energy-critical applications
becomes unclear when the architectures under consideration are having nearly the
same hardware complexity and critical path delay. The energy metric would have
to be taken into consideration in order to select the most ecient architecture for
a particular application.
89
CHAPTER 6. CONCLUSION AND FUTURE WORK 90
A parallel-in serial-out nite eld multiplier based on an irreducible AOP as
the eld dening polynomial has been proposed. The proposed multiplier can per-
form polynomial basis as well as normal basis multiplication after adding conversion
modules to the inputs and output. Also, the multiplier can perform the multipli-
cation operation more eciently than other parallel-in/serial-out multipliers. The
proposed multiplier has a very regular architecture and therefore well suited for
VLSI implementation.
An elliptic curve coprocessor that uses the proposed multiplier is designed using
the projective coordinates. The projective coordinates are advantageous in avoiding
inversion in the underlying nite eld. The coprocessor architecture as well as the
instruction set have been developed. VHDL has been used in verifying the design.
The coprocessor is able to perform elliptic curve point addition, point doubling,
and scalar multiplication. Those elliptic curve functions are hardwired inside the
controller of the coprocessor to facilitate the programming task. The use of the
projective coordinates system has enabled us to reduce the computation time for
point doubling. Speeding up the EC point doubling operation has a positive impact
on the elliptic curve scalar multiplication which is the main operation in the Die-
Hellman key exchange algorithm.
6.2 Recommendations for Future Work
This thesis has considered the hardware structure of an elliptic curve cryptosystem
over the nite eld GF(2m). It has also examined the dierent multiplier archi-
tectures which can be used for that system. The implementation of such system
CHAPTER 6. CONCLUSION AND FUTURE WORK 91
requires a through analysis of its building blocks. The work that could be still done
in this regard is summarized below.
1. The energy measure could be extended to compare more nite eld computing
devices such as inverters and squarers.
2. The size of the multiplier architectures simulated in this work have been
restricted by the available CAD tools. Simulating higher orders, even degree
10, would have taken a very long time. Using other tools such as PowerMillTM ,
could have shorten the simulation time considerably.
3. Ecient AOP inversion would be very helpful is performing EC computations
in the ane coordinates rather than the projective coordinates. Using the




The following example was simulated on the coprocessor and veried using a MAT-
LAB program. The eld order is: m = 4.
Input
Point coordinates: X = 0010, Y = 0001.





The following simulation results validates the architecture.
92
APPENDIX 93
Figure 7.1: Simulation Waveforms
APPENDIX 94
Figure 7.2: Simulation Waveforms (cont.)
APPENDIX 95
Figure 7.3: Simulation Waveforms (cont.)
APPENDIX 96
Figure 7.4: Simulation Waveforms (cont.)
APPENDIX 97
Figure 7.5: Simulation Waveforms (cont.)
APPENDIX 98
Figure 7.6: Simulation Waveforms (cont.)
APPENDIX 99
Figure 7.7: Simulation Waveforms (cont.)
APPENDIX 100
Figure 7.8: Simulation Waveforms (cont.)
APPENDIX 101
Figure 7.9: Simulation Waveforms (cont.)
APPENDIX 102
Figure 7.10: Simulation Waveforms (cont.)
Bibliography
[1] G. B. Agnew, R. C. Mullin, I.M. Onyszchuk, and S. A. Vanstone. An Im-
plementation for a Fast Public-Key Cryptosystem. Journal of Cryptology,
3(2):63{79, 1991.
[2] G. B. Agnew, R. C. Mullin, and S. A. Vanstone. An Implementation of El-
liptic Curve Cryptosystem Over F 1552 . IEEE Journal on Selected Areas in
Communications, 11(5):804{813, June 1993.
[3] E.R. Berlekamp. Bit-Serial Reed-Solomon Encoders. IEEE Transactions on
Information Theory, 28(6):869{874, Nov. 1982.
[4] I. Blake, R. Roth, and G. Seroussi. Ecient Arithmetic in
GF(2m) through palindromic representation. Visual comput-
ing dept., Hewlett Packard Laboratories, 1998. Available at:
http://www.hpl.hp.com/techreports/98/HPL-98-134.html.
[5] C .K. Koc and B. Sunar. Low-Complexity Bit-Parallel Canonical and Normal
Basis Multipliers for a Class of Finite Fields. IEEE Transactions on Comput-
ers, 47(3):353{356, March 1998.
103
BIBLIOGRAPHY 104
[6] A. P. Chandraksan and R. W. Brodersen. Low power digital CMOS design.
Kluwer Academic Publishers, 1995.
[7] M. Diab. Systolic architectures for Multiplication over Finite Field GF(2m).
In Applied Algebra, Algebraic algorithms and Error-Correcting codes. 8th In-
ternational Conference, AAECC-8, pages 329{340, 1991.
[8] G. Drolet. A new representation of elements of Finite Fields GF(2m) yield-
ing small complexity arithmetic circuits. IEEE Transactions on Computers,
47(9):938{946, Sept. 1998.
[9] S.T.J. Fenn, M. Benaissa, and D. Taylor. GF(2m) Multiplication and Division
Over the Dual Basis. IEEE Transactions on Computers, 45(3):319{327, Mar.
1996.
[10] S.T.J. Fenn, M. Benaissa, and D. Taylor. Dual basis Systolic Multipliers for
GF(2m). IEE Proceedings-E, 144(1):43{46, Jan. 1997.
[11] S.T.J. Fenn, M.G. Parker, M. Benaissa, and D. Taylor. Bit-serial multipli-
cation in GF(2m) using Irreducible all-one polynomials. IEE Proceedings-E,
144(6):391{393, Nov. 1997.
[12] D. M. Gordon. A survey of fast exponentiation methods. Journal of Algo-
rithms, 27:129{146, 1998.
[13] A. Hasan and A. Wassal. VLSI Algorithms, Architectures and Implementation
of a Versatile GF(2m) Processor. Submitted to the IEEE transactions on
computers, 1999.
BIBLIOGRAPHY 105
[14] M.A. Hasan and V.K. Bhargava. Bit-Serial Systolic Divider and Multiplier for
Finite Fields GF(qm). IEEE Transactions on Computers, 41(8):972{980, Aug.
1992.
[15] M.A. Hasan and V.K. Bhargava. Division and Bit-Serial Multiplication over
GF(qm). IEE Proceedings-E, 139(3):230{236, May 1992.
[16] M.A. Hasan and V.K. Bhargava. Modular Construction of low complexity
Parallel Multipliers for a Class of Finite Fields GF(2m). IEEE Transactions
on Computers, 41(8):962{971, Aug. 1992.
[17] M.A. Hasan and V.K. Bhargava. Architecture for a low complexity
rate-adaptive Reed-Solomon encoder. IEEE Transactions on Computers,
44(7):938{942, July 1995.
[18] M.A. Hasan, M.Z. Wang, and V.K. Bhargava. A Modied Massey-Omura Par-
allel Multiplier for a Class of Finite Fields. IEEE Transactions on Computers,
42(10):1278{1280, Oct. 1993.
[19] I. Hsu, T. Troung, L. Deutsch, and I. Reed. A Comparison of VLSI Architec-
ture of Finite Field Multipliers using Dual, Normal, or Standard Bases. IEEE
Transactions on Computers, 37(6):735{739, June 1988.
[20] IEEE. IEEE P1363: Editorial Contribution to Standard for Public-Key Cryp-
tography, August 1999.
[21] S. Ishii, K. Oyama, and K. Yamanaka. A High-Speed Public Key Encryption
Processor. Systems and Computers in Japan, 29(1):20{32, Jan. 1998.
BIBLIOGRAPHY 106
[22] T. Itoh and S. Tsujii. Computing Multiplicative Inverses in GF(2m) using
Normal Bases. Information and Computation, 78(3):171{177, Sept. 1988.
[23] T. Itoh and S. Tsujii. Structure of Parallel Multipliers for a Class of Fields
GF(2m). Information and Computation, 83(1):21{40, Oct. 1989.
[24] S.K. Jain, L. Song, and K.K. Parhi. Ecient Semisystolic architectures for Fi-
nite Field Arithmetic. IEEE Transactions on Computers, 6(1):101{113, March
1998.
[25] D. B. Johnson and A. J. Menezes. Elliptic Curve DSA (ECDSA):
An Enhanced DSA. Certicom White Papers, 1998. Available at:
http://www.certicom.com/ecc/wpaper.htm.
[26] D. Knuth. The art of computer programming. Vol. 2: Semi-numerical Algo-
rithms. Reading, Massachusetts: Addison-Wesley, 2nd ed., 1981.
[27] N. Koblitz. Elliptic Curve Cryptosystems. In Mathematics of Computations,
volume 48, pages 203{209, 1987.
[28] R. Lidl and H. Niederreiter. Introduction to Finite Fields and their applica-
tions. Cambridge University Press, 1994.
[29] J.L. Massey and J.K. Omura. Computational method and apparatus for Finite
Field arithmetic. U.S. Patent 4587627. issued May 1986.
[30] E.D. Mastrovito. VLSI architectures for computations in Galois eld. PhD the-
sis, Dept of Electrical Eng., Linkoping University, S-581 83 Linkoping, Sweden,
1991.
BIBLIOGRAPHY 107
[31] M.C Mekhallalati and A.S. Ashur. Novel structures for Serial Multiplication
over the Finite Field. Journal of VLSI Signal Processing Systems for Signal,
Image and Video Technology, 15(3):223{245, March 1993.
[32] A. Menezes. Elliptic Curve Public-Key Cryptosystems. Kluwer Academic Pub-
lishers, 1993.
[33] A. Menezes, P. Oorschot, and S. Vanstone. Handbook of Applied Cryptography.
CRC Press, 1997.
[34] A. J. Menezes and S. A. Vanstone. Elliptic Curve Cryptosystems and their
Implementations. Journal of Cryptology, 6:209{224, 1993.
[35] A.J. Menezes, editor. Applications of Finite Fields. Kluwer Academic Pub-
lishers, Boston, MA, 1993.
[36] V. Miller. Uses of Elliptic Curves in Cryptography. In Lecture Notes in Com-
puter Science, Advances in Cryptology: Proceedings of Crypto'85, volume 218,
pages 417{426. Springer-Verlag, Berlin, 1986.
[37] R. C. Mullin, I. M. Onyszchuk, S. A. Vanstone, and R. M. Wilson. Opti-
mal Normal Bases in GF(pn). Discrete Applied Mathematics, pages 149{161,
1988/1989.
[38] D. Naccache and D. M'Raihi. Cryptographic Smart Cards. IEEE Micro,
78(3):14{24, June 1996.
BIBLIOGRAPHY 108
[39] National Institute of Standards and Technology (NIST). Dig-
ital Signature Standard (DSS), Feb 2000. Available at:
http://csrc.nist.gov/cryptval/dss/fr000215.html.
[40] C. Paar. Ecient VLSI Architectures for Bit-Parallel Computation in Galois
Fields. PhD thesis, Institute for Experimental Mathematics, University of
Essen, Germany, 1994.
[41] C. Paar. A new architecture for a parallel Finite Field Multipliers with low
complexity based on Composite Fields. IEEE Transactions on Computers,
45(7):856{861, July 1996.
[42] C. Paar, P. Fleischmann, and P. Roelse. Ecient Multiplier Architectures for
Galois Fields GF(2f4ng). IEEE Transactions on Computers, 47(2):162{170,
Feb 1998.
[43] A. Pincin. A new algorithm for Multiplication in Finite Fields. IEEE Trans-
actions on Computers, 38(7):1045{1049, July 1989.
[44] P.A. Scott, S.E. Tavares, and L.E. Peppard. A Fast VLSI Multiplier for
GF(2m). IEEE Journal on Selected Areas in Communications, 4(1):62{66,
Jan. 1986.
[45] S.Lin and D.J. Costello Jr. Error Control Coding: Fundamentals and Appli-
cations. Prentice-Hall, Inc., 1983.
BIBLIOGRAPHY 109
[46] L. Song and K.K. Parhi. Optimum primitive polynomials for low-area low-
power Finite Field Semi-Systolic Multipliers. In Proceedings of the IEEE Work-
shop on Signal Processing Systems, pages 375{384, New York, NY, 1997.
[47] L. Song and K.K. Parhi. Low-Complexity modied Mastrovito Multipliers over
GF(2m). In Proceedings of the IEEE International Symposium on Circuits and
Systems, ISCS'99, pages I508{I512, San Diago, CA, May 1999.
[48] B. Sunar and C .K. Koc. Mastrovito Multiplier for all Trinomials. IEEE Trans-
actions on Computers, 48(5):522{527, May 1999.
[49] C.-L. Wang and J.-L. Lin. Systolic Array Implementation of Multipliers for
Finite Fields GF(2m). IEEE Transactions of Circuits and Systems, 38(7):796{
800, July 1991.
[50] C.C. Wang, T.K. Troung, H.M. Shao, L.J. Deutsch, J.K. Omura, and I.S. Reed.
VLSI Architectures for Computing Multiplications and Inverses in GF(2m).
IEEE Transactions on Computers, 34(8):709{716, Aug. 1985.
[51] A.G. Wassal, M.A. Hasan, and M.I. Elmasry. Low-Power Design of Finite
Field Multipliers for Wireless Applications. In Proceedings of the IEEE 8th
Great Lakes Symposium on VLSI, pages 19{25, Lafayette, LA, Feb. 1998.
[52] E. D. Win and B. Preneel. Elliptic Curve Public-Key Cryptosystems: An In-
troduction. State of the Art in Applied Cryptography. Course on Computer Se-
curity and Industrial Cryptography. Revised Lectures. Springer-Verlag, Berlin,
Germany, pages 131{141, 1998.
BIBLIOGRAPHY 110
[53] J.J. Wozniak. Systolic dual basis serial multiplier. IEE Proceedings-E,
145(3):237{241, May 1998.
[54] C.-W. Wu and M.-K. Chang. Bit-Level Systolic Arrays for Finite-Field Mul-
tiplications. Journal of VLSI Signal Processing, 10(1):85{92, June 1995.
[55] H. Wu, A. Hasan, and I. Blake. New low-complexity bit-parallel Finite Field
multipliers using Weakly Dual Bases. IEEE Transactions on Computers,
47(11):1223{1234, Nov. 1998.
[56] H. Wu, M. Hasan, and I. Blake. Finite Field Multipliers using Redundant
Basis. To appear in Proceedings of CHES'99, Workshop on Cryptographic
Hardware and Embedded Systems, 1999.
[57] C.-S. Yeh, I.S. Reed, and T.K. Truong. Systolic Multipliers for Finite Fields
GF(2m). IEEE Transactions on Computers, 33(4):357{360, April 1984.
[58] B.B. Zhou. A New Bit-Serial Systolic Multiplier Over GF(2m). IEEE Trans-
actions on Computers, 37(6):749{751, June 1988.
