University of Windsor

Scholarship at UWindsor
Electronic Theses and Dissertations

Theses, Dissertations, and Major Papers

2009

High speed world level finite field multipliers in F2m
Ashkan Hosseinzadeh Namin
University of Windsor

Follow this and additional works at: https://scholar.uwindsor.ca/etd

Recommended Citation
Namin, Ashkan Hosseinzadeh, "High speed world level finite field multipliers in F2m" (2009). Electronic
Theses and Dissertations. 7899.
https://scholar.uwindsor.ca/etd/7899

This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor
students from 1954 forward. These documents are made available for personal study and research purposes only,
in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution,
Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder
(original author), cannot be used for any commercial purposes, and may not be altered. Any other use would
require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or
thesis from this database. For additional inquiries, please contact the repository administrator via email
(scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208.

High Speed World Level Finite Field Multipliers
i n ir 2rri

by

Ashkan Hosseinzadeh Namin

A Thesis
Submitted to the Faculty of Graduate Studies through
the Department of Electrical and Computer Engineering in Partial Fulfillment
of the Requirements for the Degree of Doctor of Philosophy at the
University of Windsor

Windsor, Ontario, Canada
2009

1*1

Library and Archives
Canada

Bibliotheque et
Archives Canada

Published Heritage
Branch

Direction du
Patrimoine de I'edition

395 Wellington Street
OttawaONK1A0N4
Canada

395, rue Wellington
OttawaONK1A0N4
Canada
Your We Votre r6f6rence
ISBN: 978-0-494-57643-4
Our file Notre reference
ISBN: 978-0-494-57643-4

NOTICE:

AVIS:

The author has granted a nonexclusive license allowing Library and
Archives Canada to reproduce,
publish, archive, preserve, conserve,
communicate to the public by
telecommunication or on the Internet,
loan, distribute and sell theses
worldwide, for commercial or noncommercial purposes, in microform,
paper, electronic and/or any other
formats.

L'auteur a accorde une licence non exclusive
permettant a la Bibliotheque et Archives
Canada de reproduire, publier, archiver,
sauvegarder, conserver, transmettre au public
par telecommunication ou par I'lnternet, preter,
distribuer et vendre des theses partout dans le
monde, a des fins commerciales ou autres, sur
support microforme, papier, electronique et/ou
autres formats.

The author retains copyright
ownership and moral rights in this
thesis. Neither the thesis nor
substantial extracts from it may be
printed or otherwise reproduced
without the author's permission.

L'auteur conserve la propriete du droit d'auteur
et des droits moraux qui protege cette these. Ni
la these ni des extraits substantias de celle-ci
ne doivent etre imprimes ou autrement
reproduits sans son autorisation.

In compliance with the Canadian
Privacy Act some supporting forms
may have been removed from this
thesis.

Conformement a la loi canadienne sur la
protection de la vie privee, quelques
formulaires secondaires ont ete enleves de
cette these.

While these forms may be included
in the document page count, their
removal does not represent any loss
of content from the thesis.

Bien que ces formulaires aient inclus dans
la pagination, il n'y aura aucun contenu
manquant.

1+1

Canada

© 2009 Ashkan Hosseinzadeh Namin

All Rights Reserved. No Part of this document may be reproduced, stored or otherwise
retained in a retreival system or transmitted in any form, on any medium by any means
without prior written permission of the author.

Declaration of Co-Authorship

I hereby declare that this thesis incorporates material that is the result of joint research as
follows:
This thesis incorporates the outcome of joint research undertaken in collaboration with
Karl Leboeuf under the supervision of Dr. Roberto Muscedere. The collaboration contributions are outlined in Chapter 8. The personal contributions, design work and development
performed by the author are the focus of this chapter.
I am aware of the University of Windsor Senate Policy on Authorship and I certify that I
have properly acknowledged the contributions of other researchers to my thesis, and have
obtained written permission from the co-authors to include the above materials in my thesis.
I certify that with the above qualification, this thesis, and the research to which it refers
is the product of my own work.

IV

Abstract

Finite fields have important applications in number theory, algebraic geometry, Galois theory, cryptography, and coding theory. Recently, the use of finite field arithmetic in the
area of cryptography has increasingly gained importance. Elliptic curve and El-Gamal
cryptosystems are two important examples of public key cryptosystems widely used today based on finite field arithmetic. Research in this area is moving toward finding new
architectures to implement the arithmetic operations more efficiently.
Two types of finite fields are commonly used in practice, prime field GF(p) and the
binary extension field GF(2 m ). The binary extension fields are attractive for high speed
cryptography applications since they are suitable for hardware implementations. Hardware
implementation of finite field multipliers can usually be categorized into three categories:
bit-serial, bit-parallel, and word-level architectures. The word-level multipliers provide
architectural flexibility and trade-off between the performance and limitations of VLSI
implementation and I/O ports, thus it is of more practical significance.
In this work, different word level architectures for multiplication using binary field are
proposed. It has been shown that the proposed architectures are more efficient compared to
similar proposals considering area/delay complexities as a measure of performance. Practical size multipliers for cryptography applications have been realized in hardware using

v

ABSTRACT

FPGA or standard CMOS technology, to similar proposals considering area/delay complexities as a measure of performance. Practical size multipliers for cryptography applications have been realized in hardware using FPGA or standard CMOS technology. Also
different VLSI implementations for multipliers were explored which resulted in more efficient implementations for some of the regular architectures. The new implementations use
a simple module designed in domino logic as the main building block for the multiplier.
Significant speed improvements was achieved designing practical size multipliers using the
proposed methodology.

VI

To my family, with love ...

vii

Acknowledgments

There are several people who deserve my sincere thanks for their generous contributions
to this project. I would first like to express my sincere gratitude and appreciation to Dr.
Huapeng Wu and Dr. Majid Ahmadi, my supervisors for their invaluable guidance and
constant support throughout the course of this thesis work.
In addition to my advisors, I would like to thank the rest of my thesis committee: Dr.
Roberto Muscedere and Dr. Esam Abdel-Raheem from the electrical and computer engineering department, Dr. Arunita Jaekel from the school of computer science and Dr. Gerald
E. Sobelman from the University of Minnesota for their participation in my seminars, reviewing my thesis, and their constructive comments.
I also like to thank Dr. Roberto Muscedere and Till Kuendiger for their assistants
regarding the VLSI CAD tools and facilities used during the course of the project. I am
thankful to my research colleagues in the RCIM lab of the University of Windsor and
my friends whom have supported and believed in me: Golnar Khodabandehloo, Mitra
Mirhassani, Mahzad Azarmehr, Andrew Tarn, Kevin Biswas and Matthew Meloche. A
Special thanks goes to Karl Leboeuf, for proofreading this thesis.
Finally, my deepest gratitude goes to my family for their unconditional love, support
and encouragement.

Vlll

Contents
Declaration of Co-Authorship

iv

Abstract

v

Dedication

vii

Acknowledgments

viii

List of Figures

xv

List of Tables

xvii

List of Abbreviations
1

2

xix

Introduction

1

1.1

Summary of Contributions

3

1.2

Outline of the Thesis

4

Mathematical Preliminaries

7

2.1

Groups, Rings and Fields

7

2.2

Binary Field and Bases

9

2.2.1

Normal Basis and Its Arithmetic in F2m

10

2.2.1.1

10

Normal Basis Representation

ix

2.2.1.2
2.2.2

3

10

Redundant Basis and its Arithmetic in F2m

12

2.2.2.1

Redundant Representation

12

2.2.2.2

Redundant Basis Multiplication

13

2.2.2.3

Redundant Basis and Normal Basis

14

Comb Architectures for Finite Field Multiplication in ¥%•

15

3.1

Introduction

15

3.2

Preliminaries on Finite Field Bases and Arithmetic in F2m

3.3

3.4

4

Normal basis multiplication

.

3.2.1

Redundant Representation and Multiplication

3.2.2

Reordered Normal Basis Representation and Multiplication

17
17
....

18

Proposed Hybrid Multiplier Using Redundant Representation

19

3.3.1

Bit-serial word-parallel multiplication algorithm

19

3.3.2

Comb style multiplication architecture

21

3.3.3

Architecture complexities and comparison

23

Proposed Hybrid Multiplier Using Reordered Normal Basis

24

3.4.1

Bit-serial word-parallel multiplication algorithm

24

3.4.2

Bit-serial word-parallel multiplier architecture

25

3.4.3

Architecture complexities and comparison

27

3.5

FPGA Implementations

29

3.6

Conclusions

30

A New Finite Field Multiplier Using Redundant Representation

31

4.1

Introduction

31

4.2

Preliminaries

32

4.2.1

Redundant Representation

32

4.2.2

Redundant Basis Multiplication

33

4.2.3

An Overview of Bit-Serial RB Multipliers

33

CONTENTS

4.3

4.4

4.5

4.6
5

Proposed SIPO Multiplier

34

4.3.1

A New Bit-Serial RB Multiplication Algorithm

34

4.3.2

Multiplier Architecture

35

Complexity Comparison

37

4.4.1

Comparison to Other RB Multipliers

37

4.4.2

Comparison to Normal Basis Multipliers

37

Proposed Digit-Level SIPO Multiplier

39

4.5.1

A New Digit-Level RB Multiplication Algorithm

39

4.5.2

Proposed Digit-Level RB Multiplier

40

4.5.3

Complexity Comparison to Previous RB Multipliers

41

4.5.4

Complexity Comparison to Previous NB Multipliers

41

Conclusions

42

A High Speed Word Level Multiplier in F2™ Using Redundant Representation 43
5.1

Introduction

5.2

A Brief Review of Redundant Representation and Its Arithmetic in F2m

43
. . 44

5.2.1

Redundant Basis for F2m

44

5.2.2

Redundant Basis Multiplication in F2m

45

5.3

Proposed Word Level Multiplication In RB

46

5.4

Proposed Word Level Multiplier Architecture in RB

47

5.4.1

Multiplier Architecture

47

5.4.2

Architecture Complexities

50

5.4.3

An example

51

5.4.4

Word Level Architecture with MSB First

53

5.5

Complexity Comparison

53

5.5.1

Comparison to Other Word Level RB Multipliers

53

5.5.2

Comparison to Other Word Level NB and RB Multipliers When
There Exists A Type I ONB

55

xi

CONTENTS

5.6
6

Conclusions

High Speed Word Level Multipliers in GF(2 m ) Using Reordered Normal Basis 58
6.1

Introduction

6.2

A Brief Review of Reordered Normal Basis and Its Arithmetic in GF(2 m ) . 59

6.3

6.4

58

6.2.1

Reordered Normal Basis

59

6.2.2

Reordered Normal Basis Multiplication

61

Proposed High Speed Word Level Multiplier Type One Using Reordered
Normal Basis

62

6.3.1

Word-level multiplication algorithm using reordered normal basis . 62

6.3.2

Multiplier Architecture

64

6.3.3

Architecture complexities

69

6.3.4

An example

69

Proposed High Speed Word Level Multiplier Type Two Using Reordered
Normal Basis

7

57

70

6.4.1

Word-level multiplication algorithm using reordered normal basis . 70

6.4.2

Multiplier architecture

72

6.4.3

Architecture complexity

73

6.4.4

An example

75

6.5

Comparisons

75

6.6

Conclusions

79

High Speed VLSI Implementation of a Multiplier Using Redundant Representation

80

7.1

Introduction

80

7.2

A Brief Review of Redundant Basis and its Arithmetic in F2">

82

7.2.1

Redundant Basis for W2™

82

7.2.2

Redundant Basis Multiplication in F2m

83

xii

CONTENTS

8

7.2.3

Redundant Basis and Normal Basis

84

7.2.4

Multiplier Architecture in Redundant Basis

85

7.3

Design of the Multiplier Main Building Block (x-module)

86

7.4

Design of a Practical Size Multiplier Using the x-module

88

7.5

Design of Practical Size Multiplier Using static CMOS

89

7.6

Different VLSI implementation Comparisons

90

7.7

Conclusions

91

High Speed Implementation of a SIPO Multiplier Using Reordered Normal
Basis

92

8.1

Introduction

92

8.2

A Brief Review of Reordered Normal Basis and Its Arithmetic in F2m . . .

94

8.2.1

Reordered Normal Basis Definition

94

8.2.2

Reordered Normal Basis Multiplication

94

8.2.3

A Review of Existing Architectures for ONB Type II Multiplication

95

8.3

Design of A Practical Size Multiplier Using xax-module

97

8.3.1

Multiplier Size Selection

97

8.3.2

Selected Multiplier Architecture

97

8.3.3

Design and Implementation of the xax-module

98

8.3.4

Design and Implementation of the 233-bit Multiplier Using the
xax-module

9

100

8.4

Design of the 233-bit Multiplier Using Static CMOS

104

8.5

A Comparison of Different VLSI Implementations

105

8.6

Conclusions

106

Conclusions and Future Work

108

9.1

Conclusions

108

9.2

Future Work

110

xiii

CONTENTS

References

111

VITA AUCTORIS

115

List of Figures
3.1

Words vs. comb style inputs

20

3.2

Proposed comb style redundant basis multiplier for F24 (LSB first)

22

3.3

Proposed comb style redundant basis multiplier for F24 (MSB first)

23

3.4

Proposed comb style multiplier in F2e using reordered normal basis (LSB
first)

26

4.1

Previously proposed bit-serial RB multipliers

34

4.2

Proposed serial-in parallel-out RB multiplier (MSB first)

36

4.3

Proposed serial-in parallel-out RB multiplier (LSB first)

36

4.4

Proposed hybrid SIPO RB multiplier (LSB first)

41

5.1

Proposed high speed word level multiplier

48

5.2

Word level (w = 3) multiplier in F24 with the padded zero bits for the input
(the second option)

51

5.3

Proposed high speed word level multiplier (MSB First)

54

6.1

Proposed word-level high speed multiplier using reordered normal basis . . 65

6.2

Viewing the (2m+l)-bit circular shift register R as two virtual (2m+ l)-bit
circular shift registers R\ and i? 2

66

6.3

Architecture ofWL-RNB I in GF(2 3 ) with u; = k = 2

70

6.4

Proposed word-level high speed multiplier using reordered normal basis . . 73

XV

LIST OF FIGURES

6.5

Architecture ofWLM-RNB-II in GF(23) with w = k = 2

75

7.1

High Speed Serial Multiplier in Redundant Basis

85

7.2

(a) x-module Block Detail (b) Multiplier Composed of x-module Blocks . . 85

7.3

AND - XOR Function in Domino Logic

87

7.4

Layout for the AND-XOR Function in Domino Logic

87

7.5

Proposed 197 Bit Multiplier Layout

88

7.6

Post Place-and-Route Simulation Result of the Proposed 197 Bit Multiplier, from top to bottom: input A, input B, input C, node Q, node R, xmodule out, clock

89

7.7

Static CMOS 197 Bit Multiplier Layout

90

8.1

Serial-In Parallel-Out Reordered Normal Basis Multiplier

97

8.2

xax-module and the SIPO Multiplier Composed of xax-module

8.3

XOR-AND-XOR Function Implementation in Domino Logic

8.4

A New XOR-AND-XOR Function Implementation in Domino Logic . . . . 100

8.5

Layout for the XOR-AND-XOR Function in Domino Logic

101

8.6

Block Diagram of the 233-bit Multiplier

102

8.7

233-bit Proposed Multiplier Layout

103

8.8

Post Place-and-Route Simulation Result of the Proposed 233-bit Multi-

. . . . . .

98
99

plier, from top to bottom: input a, input 61, input b2, input c, node R,

8.9

node Q, clock

104

Static CMOS 233-bit Multiplier Layout

106

xvi

List of Tables
3.1

Complexities comparison between hybrid redundant basis multipliers

...

3.2

Complexity comparison of hybrid multipliers using reordered normal basis
or normal basis

23

27

3.3

Comparison of FPGA implementations of hybrid redundant basis multipliers 29

3.4

FPGA implementation results for hybrid reordered normal basis multipliers

30

4.1

Register contents Cj during a multiplication operation

36

4.2

Complexities comparison between bit-serial RB multipliers

37

4.3

Complexities comparison between the proposed RB multiplier and bitserial NB multipliers when there exists a type I optimal normal basis.

...

4.4

Complexities Comparison Between Digit-Level RB Multipliers

4.5

Complexities comparison between digit-level architectures: the proposed

38
41

RB multiplier versus some NB multipliers for a class of fields that there
exists a type I ONB

42

5.1

Contents of the flip-flops in the proposed multiplier in F24

52

5.2

Complexity comparison for word level redundant basis multipliers

54

5.3

Are-Delay Complexity comparison for different architectures where there
exist a type I ONB

5.4

55

Complexity comparison of word level type I optimal NB or RB multipliers
inF2268 for different values of w, k

xvu

56

LIST OF TABLES

5.5

Normalized Complexity Comparison of Different Word Size Multipliers in
F2268

57

6.1

Complexities comparison

77

6.2

Complexity comparison of word-level type II optimal normal basis or reordered normal basis multipliers in GF(2233) for different values of w,k. . . 78

7.1

Complexity Comparison between Two VLSI implementations for a 197 Bit
Multiplier

8.1

91

Complexities Comparison Between Type II ONB / Reordered Normal Basis Multipliers

8.2

load-module Input/Output Characteristics

8.3

Complexity Comparison Between Two VLSI Implementations for a 233bit Multiplier

96
102

106

xviii

List of Abbreviations

AEDS
ASIC
BPWS
BSWP
CAD
CMOS
ECC
ECDSS
FPGA
GF
IC
IEEE
ISO
LUT
NB
NIST
ONB
RB
RNB
SMPO
VHDL
VLSI
XEDS

AND-Efficient Digit-Serial.
Application-Specific Integrated Circuit.
Bit-Parallel Word-Serial.
Bit-Seriall Word-Parallel.
Computer Aided Design.
Complementary Metal-Oxide-Semiconductor.
Elliptic Curve Cryptograhy.
Elliptic Curve Digital Signature Standard.
Field-Programmable Gate Array.
Galois Field.
Integrated Circuit.
Institute of Electrical and Electronics Engineers.
International Organization for Standardization.
Look-Up-Table.
Normal Basis.
National Institute of Standards and Technology.
Optimal Normal Basis.
Redundant Basis.
Reordered Normal Basis.
Sequential Multiplier with Parallel Output.
Very high speed integrated circuit (VHSIC) Hardware Description Language.
Very-Large-Scale Integration.
XOR-Efncient Digit-Serial.

XIX

Chapter 1
Introduction

Finite field is a set of finite elements where one can add, subtract, multiply, and divide such
that properties of associativity, distributivity, and commutativity are satisfied [25]. Finite
fields have important applications in error control coding and cryptography [29].
Two different types of finite field are commonly used in practice: prime field ¥p, and the
binary field F2™. Binary field is an extension of the prime field, F 2 , which contains 2 m elements. Binary fields are attractive for high speed cryptography applications since they are
suitable for hardware implementation [18]. For applications to elliptic curve cryptography,
binary field sizes are required to be at least 160 bits respectively [18].
In F2™, addition is nothing but exclusive-oring of two binary vectors. Multiplication is
more complicated, while division or inversion can be broken down into a series of consecutive multiplication operations [ 11 ], [41 ]. In practice, the finite field multiplier becomes the
key arithmetic unit for any system based on finite field computations.
Efficiency of finite field multiplication depends on the choice of the basis to represent
field elements. Bases that have been used for realizing finite field multipliers include poly-

1

1. INTRODUCTION

nomial basis, normal basis (NB), dual bases, triangular basis, redundant representation or
redundant basis, and their variations (i.e., shifted polynomial basis) [18, 33, 16, 10, 7, 38,
44, 17, 8].
In this work, we are mainly interested in normal basis and redundant representation,
since squaring operation can be achieved by reordering the element coefficients which is
free in hardware. Free squaring operation can be used to speed up the exponentiation
operation by repeated squaring and multiplication [14].
In normal basis, the complexity of multiplication is measured with the multiplication
matrix [30]. For a binary extension field, the multiplication matrix entries are either zero or
one, and the number of ones inside the multiplication matrix is referred to as normal basis
complexity. The normal basis in GF(2 m ) for which the complexity achieves its minimum,
2m — 1, is referred to as the optimal normal basis (ONB). Two types of optimal normal
bases have been found which are referred to as type I and type II optimal normal basis [30].
Reordered normal basis is referred to as a certain permutation of a type II optimal normal
basis [12], [44].
Redundant representation is especially interesting because it not only offers almost free
squaring as normal basis does, but also eliminates modular operation for multiplication.
The main idea for multiplication using redundant representation is to embed a field in a
larger ring and perform the multiplication there [44]. The ring used here has a simple
structure and is referred to as a cyclotomic ring, such that the modular operation can be
saved in a multiplication operation. Since embedding a field is not unique, each field element in the ring can be presented in more than one way, so the representation contains a
certain amount of redundancy.
The main drawback for the redundant representation is that it uses more bits to represent a field element, where the number of representation bits depends on the size of the
cyclotomic ring. For the class of fields F2m such that there exist a type I optimal normal
basis (ONB), the number of bits required for a redundant representation of a field element

2

I. INTRODUCTION

is m + 1. Also, for the class of fields F2™ such that there exist a type II optimal normal
basis (ONB), the number of bits required for a redundant representation of a field element
is 2m + 1.
Hardware implementation of finite field multipliers can usually be divided into three
categories. In the first category there are bit-level or bit-serial multipliers [22],[1],[15],[11].
A bit-level multiplier takes m clock cycles to finish one multiplication in a binary field of
size m. The multipliers in this class are considered to have low power consumption, occupy
a small area of silicon, and operate slowly for large field sizes. The second category are bitparallel or full-parallel multipliers [35],[24],[20],[43]. A full parallel multiplier takes one
clock cycle to finish one field multiplication. These multipliers are not usually economical
for implementation since they require large silicon area and high bandwidth for input and
output ports.
The third category are word-level or digit-level finite field multipliers, which are the
most commonly implemented in practice [12],[44],[22],[32],[31],[36],[37]. A word-level
multiplier takes w clock cycles, 1 ^ w ^ m, to finish one multiplication operation in F2™.
The value of w can be selected by designer to set the trade off between area and speed
according to the application. Decreasing the value of w will result in faster and larger
multipliers while increasing w will make smaller and slower multipliers. Note that bitlevel and full parallel multipliers can be viewed as special cases of word-level multipliers
for w = m and w = 1 respectively.

1.1

Summary of Contributions

In this work, different word-level architectures for multiplication using binary field are proposed. It has been shown that the proposed architectures are more efficient compared to
similar proposals considering area/delay complexities as a measure of performance. Practical size multipliers for cryptography applications have been realized in hardware using

3

1. INTRODUCTION

FPGA or standard CMOS technology. Also, different VLSI implementations for multipliers were explored, which resulted in more efficient implementations for some of the
regular architectures. The new implementations use a simple module designed in domino
logic as the main building block for the multiplier. Significant improvements were achieved
designing practical size multipliers using the proposed methodology.

1.2

Outline of the Thesis

The rest of this thesis is organized as follows. Chapter 2, is a brief review of finite filed
theory. After covering basic definitions and elementary properties such as group, ring and
field, bases for finite fields are presented. Normal basis and redundant basis representation
with their arithmetic operations are discussed in detail. Type I and II optimal normal basis,
which are two important classes of normal basis, and their relationship with redundant
representation are also discussed in this chapter.
Chapter 3 discusses two new high speed bit-serial word-parallel, or comb style finite
field multipliers. The first proposal utilizes redundant representation for any binary field,
and the other uses a reordered normal basis for the binary field where there exists a type
II optimal normal basis. The proposed redundant representation architecture has a smaller
critical path delay compared to the previous methods, while its complexity remains approximately the same. The proposed reordered normal basis multiplier has a significantly
smaller critical path delay compared to the previous methods using the same basis or normal basis. FPGA implementation results of the proposed multipliers are compared to those
of the previous methods using the same basis, confirming that the proposed multipliers
allow for a much higher clock rate.
Chapter 4, presents a novel serial-in parallel-out finite field multiplier using redundant
representation. It is shown that the proposed architecture has either significantly lower
complexity and comparable critical path delay, or significantly smaller critical path delay

4

1. INTRODUCTION

and comparable complexity, in comparison to the previously proposed architectures using
the same representation. For the class of fields such that there exists a type I optimal normal
basis, the proposed multiplier compares favorably to the normal basis multipliers. A digitlevel version for the new multiplier is also presented in this chapter.
In Chapter 5, a high speed word-level finite field multiplier in F2m using redundant
representation is proposed. For the class of fields such that there exists a type I optimal
normal basis, the new architecture has significantly higher speed compared to previously
proposed word-level architectures using either normal basis or redundant representation
at the expense of moderately higher area complexity. One of the unique features of the
proposed word-level multiplier is that the critical path delay is not a function of the field
size, nor the word size. It is also shown that the new multiplier out-performs all other
multipliers in the comparison when considering the product of area and delay as a measure
of performance. VLSI implementation of the proposed multiplier in a 0.18fim CMOS
process is also presented as a module for an elliptic curve processor.
In Chapter 6, two high speed word-level finite field multipliers using reordered normal
basis are proposed, where reordered normal basis is referred to as a certain permutation of
type II optimal normal basis. Complexity comparison shows that the proposed architectures
are faster than all the previously presented architectures in the open literature using either a
type II optimal normal basis or a reordered normal basis at the expense of moderately higher
complexity. One unique feature of the new word-level architectures is that the critical
path delay is not a function of the word size or the field size. This enables the proposed
multipliers to operate at very high clock rate regardless of the word or field size. Such high
speed word-level multipliers are expected to have applications in public key cryptography,
i.e. elliptic curve cryptosystems.
Chapter 7 presents a high speed VLSI implementation of a 233-bit Serial-In ParallelOut finite field multiplier. The proposed design performs multiplication using a reordered
normal basis; a permutation of a type II optimal normal basis. The multiplier was imple-

5

1. INTRODUCTION

mented in a .18 /im TSMC CMOS technology using multiples of a domino logic block.
The domino logic design was simulated, and functioned correctly up to a clock rate of
1.587 GHz, yielding a 99% speed improvement over the static CMOS' simulation results,
while the area was reduced by 49%. This multiplier's size of 233 bits is currently recommended by the National Institute of Standards and Technology (NIST) in their Elliptic
Curve Digital Signature Standard (ECDSS), and is used in practice for binary field multiplication in Elliptic Curve Cryptosystems.
Finally some concluding remarks and future work are presented in Chapter 8.

6

Chapter 2
Mathematical Preliminaries

This chapter briefly reviews the mathematical background on finite fields. It starts with
reviewing basic definitions such as group, ring, and field, and then covers more advanced
topics such as bases and arithmetic operations. Normal basis and redundant representation
with their arithmetic operations are discussed in detail. The relationship between different
classes of normal basis and redundant representation is also discussed in this chapter. For a
more detailed review of finite fields and their applications readers are referred to [25, 29,26]

2.1 Groups, Rings and Fields
Definition 2.1.1. [25] A group (G, *) is a set G together with a binary operation * on G
such that the following three properties hold:
1. The binary operator * is associative; that is, for any a,b,c € G,
a* (b * c) = (a*b) * c

1

2. MATHEMATICAL PRELIMINARIES

2. There is an identity (or unity) element e in G such that for all a G G,
a*e = e* a = a.
3. For each a G G, there exist an inverse element a - 1 in G such that
a * a - 1 = a - 1 * a = e.
If for all a, 6 G G , a * b = b * a, then G is referred to as an abelian or commutative group.
A group with finite number of elements is referred to as a finite group.
Definition 2.1.2. [25] A ring (r, +, *) is a set R together with two binary operations, denoted by + and * , such that the following three properties hold:
1. R is an abelian group with respect to +.
2. The binary operator * is associative, which means for all a,b,c G R
(a * b) * c = a* (b* c).
3. The distribution law holds, which means for all a,b,c G R
a*(b + c) = a*b + a*c and (b + c)*a = b*a +

c*a.

The identity element of the abelian group R with respect to + is called the zero element,
while the identity element with respect to * (if it exist) is called the identity element. A ring
is called commutative if the binary operator * is commutative.
Definition 2.1.3. [25] Afield (/, +, *) is a set F together with two binary operations, denoted by + and * , such that the following two properties hold:
1. F is a commutative ring under + and *.
2. Nonzero elements of F from a group with the binary operation *.
A field with a finite number of elements is referred to as a finite field. The order of a finite
field is the number of elements in the field. There exists a finite field F of order q if and

2. MATHEMATICAL PRELIMINARIES

only if q is a prime power, that is q = p m where p is a prime number referred to as a
the characteristic of F and m is a positive integer [18]. For any prime power q, there is
essentially only one finite field of order q. This means that any two finite fields of order q
are structurally the same, except that the labeling used to represent the field elements may
be different. We say that any two finite fields of order q are isomorphic, and denote such a
field by Fq™ or GF(qm) (GF stands for Galois Field, in honor of Evariste Galois, a French
mathematician who is known for his work on the theory of equations and abelian integrals).

2.2 Binary Field and Bases
For a finite field F with order of q = pm, if m = 1 then the field is referred to as a prime
field. If m > 2, then the finite field is referred to as an extension field. Finite fields of order
q = 2m are called binary fields or characteristic-two finite fields.
An important factor that has an important effect on finite field arithmetic efficiency is
the basis used to represent the field elements. Common bases used in practice are polynomial basis (PB) and normal basis (NB) [25],[33]. Polynomial basis is probably the most
popular basis which has been widely used for hardware and software implementations [18].
Normal basis, on the other hand, is advantageous for hardware implementation since the
squaring operation can be implemented at no cost. Free squaring operations can be used to
speed up the exponentiation operation by repeated squaring and multiplication [14],[2].
Recently, a method of redundant representation of field elements has attracted attention [42, 7, 38, 44]. The idea here was to use the minimal cyclotomic ring in which the
current field can be embedded in, and perform the field arithmetic operations in the ring.
Advantages of using redundant representation not only include the free squaring operation offered by this method but also its 'basis' elements form a cyclic group, and thus the
modulo reduction step can be avoided carrying out the field multiplication operation.

9

2. MATHEMATICAL PRELIMINARIES

2.2.1 Normal Basis and Its Arithmetic in ¥2™
2.2.1.1

Normal Basis Representation

Theorem 2.2.1. [25] Let P(x) be a degree m irreducible polynomial over F2m whose m
roots {/?, /32, • • • , ft2™ } are linearly independent in F2m. Then these m roots form a basis
in F2™ which is referred to as normal basis.
It is well known that there always exists a normal basis for the finite field F2m for all positive values of m [25]. Assume that f3 G F2m is an element such that / = {/?, /? 2 ,/3 2 , ••• ,(32m~ *
is a normal basis, then element A G F2™ can be represented as:
m—1

A = J ^ Oi/?2' = a0/9 + axp2 + a2p2' + ••• + v i f " ' •
i=0

The main advantage of normal basis representation is that, element A can be squared by
m—1

a simple right circular shift on its coordinates, A2 = V ^ a(j + i)/3 2 \ where (i + 1) denotes
i=0

that i + 1 is to be reduced modulo m. This property for normal basis comes from the fact
that Z?2"1 = /?, and is used to speed up exponentiation by use of the square and multiply
algorithm [18].
2.2.1.2

Normal basis multiplication

Let field elements A,BE

F2m be represented with respect to (w.r.t.) the normal basis
771—1

777—1

/ = {/?, /?2, •. • , /32"1"1} as A = J2 aiP21

and B

= Yl

i=0

b

iPV>

res

Pectively. Then the

j=Q

product of A and B can be given by
777—1ro—1

m—lm—1

C = A-B = YJYJ ^iPP E E * W 2 °" Vi=0 j = 0

i=0

(2-D

j=0

Define t^ G F 2 to be the coefficient of (52 in the expansion of the product f3(5T when
represented w.r.t. I [30],
777—1

21

/?/? = $ > , f e / ? 2 \

(2.2)

fe=0

10

2. MATHEMATICAL PRELIMINARIES

Then it follows from (2.2) that
m—1

m—1

m—1

fc=0 fc=0 fc=0

The last step in (2.3) follows from the proper substitution on the subscript k. Substituting
(3(32{i~i] in (2.1) using (2.3)
TO— 1 771 — 1

TO—1

a

C = E E

771— 1 TO— 1 TO— 1

2k

^ ( E kj-iHk-i)(3 ) = J ] E E ^¥(j-*),(fc-*y •

i=0 j = 0

fe=0

(2-4)

i=0 j = 0 fc=0

Then the coefficients of the product C w.r.t. the NB / can be given by
771—1

C=E

771— 1 TO— 1

c

*y'

w h e r e Cfe =

fc=0

(2-5)

E E Uitykj-iUk-i)i=0 j = 0

Also note that from eqn (2.5) and after proper substitution on i and j , we can compute Ck+\
with
771—1 777—1

Cfe+i = 2^t E

TO—1

a

ib^{i-i),{k-i+i)

771—1

= 2^ E

a

(*+i)^0'+i)*0'-i),(fc-»)-

(2.6)

Eqn (2.6) shows that Q + 1 can be computed by applying the same formula used to compute
Ci, if the coefficient vectors of A and B are cyclically shifted by one.
To obtain the values of t^-^^k-i),

a matrix T = [£;,„] is created where row I corresponds

to the coefficients in the expansion of the product /3/?2 and the column n corresponds to
the coefficients of/3 2 " in the expansion of the products (3f32 , I = 0,1,...

,m — I.

Matrix T is referred to as the multiplication table for the normal basis in this thesis.
Note that in [30] the same multiplication table was denoted by To while matrices Tk,k =
1,2,... ,m — 1, were defined for the expansion of 01 (32 . It also should be noted that other
matrices Ti:i = 1,2, ...,m — l i n [30], can be generated from T0 and circular row/column
shifting. In this thesis we use only one multiplication matrix T when deriving the formula
for computing Cj, i = 0 , 1 , . . . , m — 1.
Let the number of nonzero entries in T be denoted by CN- It can be seen from equations
(2.5) and (2.6) that the product coefficient Ck can be computed by summing up exactly CN

11

2. MATHEMATICAL PRELIMINARIES

terms. Thus, the generation of each ck requires CN multiplication operations in F 2 and
CJV — 1 addition operations in F2.
CN is referred to as the complexity of a normal basis and it has been shown that CN >
2m - 1 [30]. When CN = 2m - 1 for a NB, it is called an optimal normal basis (ONB).
Since computations in the optimal normal basis are minimized, these bases are of high
importance for cryptographic applications. A type II ONB corresponds to the case where
no row or column in T contains more than two nonzero entries.
Example 1 Consider the finite field F 2 s. The root (3 of the irreducible polynomial P(x) =
x5 + x4 + x3 + x + l generates a normal basis 1 = {/3,/32,/32 ,/3 2 ,/3 2 }. The multiplication
table T can be found as
0
0
T =

1
0
0

0
1
1

0
0
1

0
1
1

0

1 1 0 0 1
0 1 0 1 0

The closed form solution for multiplication can be easily found from (2.5) as follows:
Ci — bidi+s + bi+1(ai+2 + ai+4) + bi+2{ai+1 + ai+4) +
h+3{ai + ai+4) + bi+4(ai+i + a;+2 + ai+3 + ^+4).
(2.7)

2.2.2 Redundant Basis and its Arithmetic in F2»>
2.2.2.1

Redundant Representation

Let p be a primitive n th root of unity in some extension field of F 2 (/3n = 1). The splitting
?(«)1
field of /3 is called the n th cyclotomic field and denoted by F^
'. Elements in FS-(n)
2"'' can be

represented in the form
A = a0 + a1P + 02/?2 + • • • + On-i/S" -1 , at G F 2 , i = 0,1,

,n — 1.

(2.8)

12

2. MATHEMATICAL PRELIMINARIES

Let F2m be a field that can be embedded in F 2 • The following theorem characterizes
the relationship between m and n .
Theorem 2.2.2. [25] Let n be an odd positive integer. Then, F2™ is contained in Fj

if and

only if m divides the multiplicative order of 2 mod n.
For a given F2m we are particularly interested in F 2

with the minimal value of n

such that F2m can be embedded in F 2 • Obviously, field element A e F2™ can also be
represented with (2.8). Note that 1 + /3 + j32 + • • • + /3"" 1 = 0 and the representation
of A is not unique. For example, the two n-tuples (a 0 , ax, • • • , ara_i) and (a 0 + 1, ai +
1, • • • , o n _i + 1) represent the same element A. By slightly abusing the terminology, the
set [1, /?, /?2, • • • 1 /3"-1] is denoted as redundant basis (RB) for F2m [44]. Also note that the
elements of a RB form a cyclic group of order n and

P-F
2.2.2.2

=<

1

i = n — 1.

Redundant Basis Multiplication

Let field elements A,Be

F2™ be represented with respect to the RB i i = [1, /?, / ? 2 , . . . , /3 n_1 ]

as
n—1

n—1

4 = $ ^ Oi/3* and 5 = J ] 6;/?\
i=0

i=0

respectively, where ai: 6j G F 2 , i = 0,1, • • • , n — 1. Note that n ^ m + 1 and /? n = 1. Then
it follows (3% • B = X^=o \j-i)$K

where (j — z) denotes that j — i is to be reduced modulo

n. The product of field elements A and B can be given by [44]
7i—l

n—1

ra—1

,4 • B = 5>(/3< • B) = £ * ( £ ^ ) .
j=0

i=0

j=0

n—1

71—1

Ti—1 n—1

j=0

j=0

j=0

i=0

13

2. MATHEMATICAL PRELIMINARIES
T l - 1

If we define A • B = C = J ^

j
Cj/3 ,

then Cj can be given by

j=0
n-l
c

2.2.2.3

i

=

] C aAi-i)i

j = 0,1,..., n - 1.

(2.9)

Redundant Basis and Normal Basis

As mentioned before, when the complexity of normal basis multiplication is minimized the
normal basis is referred to as optimal normal basis. Optimal normal basis representations
are classified as Type I or Type II. Since computations in the optimal normal basis are
minimized, these bases are of high importance for cryptographic applications. For the class
of fields where there exist a type I ONB, the size of the Redundant Basis representation is
almost the same as that of the NB as shown in 2.2.3. Also for the class of fields where
there exist a type II ONB, the size of the Redundant Basis representation is almost twice of
that of the NB as shown in 2.2.4. Note that the complexities of redundant basis multipliers
for the class of fields that there exist a type II ONB can be greatly reduced by applying a
symmetry property in redundant representation which is shown in 2.2.5.
Remark 2.2.3. [44] If there is a type I optimal normal basis in F2™, then F2m is contained
in F 2 m + ', so there is a redundant basis of size m + 1 for F2m.
Remark 2.2.4. [44] If there is a type II optimal normal basis in F2™, then F2™ is contained
in F 2

+

, so there is a redundant basis of size 2m + 1 for F2™.

Remark 2.2.5. Assume that there exists a Gauss period of type (m, A;) over F 2 . Then / =
[1, /3, P2,...,

/?"_1] is a redundant representation basis for F2™ over F 2 , where n = mk + 1.

Let A G F2m and A = (ao, ai,...,

a n _i) with respect to /. Assume k ^ 2 is even, then

ak = an-k , for

k = 1, 2 , . . . ,n - 1.

(2.10)

Proof: This is a direct result from Lemma 2 in [44] by noting that the redundant basis
I and the basis 74 used in [44] satisfy / = 1 U I4.

14

Chapter 3
Comb Architectures for Finite Field
Multiplication in F™

3.1 Introduction
For applications to elliptic curve and ElGamal public key cryptography, binary field sizes
are required to be at least 160 and 1000 bits respectively [18]. A full bit-parallel finite field
multiplier in these fields could be difficult to implement using current VLSI technology
and also inefficient when considering that the width of the system data bus is usually much
smaller than the size of the field. Whereas a bit-serial finite field multiplier in F2™ usually
requires m clock cycles to perform one operation which is too slow, hybrid architectures
offer moderate complexity and relatively high speed.
There are at least two types of hybrid architectures: bit-parallel word-serial and bitserial word-parallel (or comb style). One important difference between the two types of

15

3. COMB ARCHITECTURES FOR FINITE FIELD MULTIPLICATION IN F$

architectures is that the throughput 1 of a bit-parallel word-serial architecture is proportional
to the word size whereas the throughput of a comb style architecture is proportional to the
reciprocal of the word size. This difference makes a comb style architecture attractive
where both high throughput and small word size are required. For example, a comb style
architecture would sustain a higher throughput compared to a bit-parallel word-serial one,
if word size is chosen to be smaller than the square root of the field size m.
One important factor that affects the finite field computation efficiency is the choice of
the basis. A few types of bases have been utilized for construction of finite field multipliers,
which include polynomial basis [27], normal basis [22], dual basis [3], triangular basis [19]
and redundant representation or redundant basis [7, 44, 23]. The advantage of redundant
basis is that all the basis elements form a cyclic group, so that computation of modulo
reduction can be saved in multiplication operation.
The idea of using redundant representation was first introduced in [7], where arithmetic
of F2™ is performed in the ring F2[a;]/(a;,^ — 1). In [7], n is chosen as the minimal value such
that an irreducible polynomial of degree m is a factor of xn — 1. This representation is called
polynomial ring basis representation. It was later found in [44] that the value of n can be
reduced further while having F™ embedded in F 2 [x]/(a; n — 1). It has been proved in [44]
that the value for n is optimal when F;, is the cyclotomic field of F™. When a type II
optimal normal basis exists in F™, it is found in [12, 44] that a reordered normal basis can
be derived from the redundant basis such that this reordered normal basis not only offers
free squaring operation but also avoids modulo reduction step in a multiplication operation.
In this chapter, we propose two new bit-serial word-parallel or comb style multipliers,
one using redundant representation and the other using reordered normal basis. It is shown
that the proposed redundant representation multiplier has smaller critical path delay and
thus can operate much faster compared to the previous similar methods. The proposed reordered normal basis multiplier also has significantly smaller critical path delay compared
1

Here we refer to throughput as the number of operations per one clock cycle.

16

3. COMB ARCHITECTURES FOR FINITE FIELD MULTIPLICATION IN ¥%*

to most of recent similar multipliers using reorder normal basis or normal basis. Note by
choosing the word size equal to one or the field size m the proposed multiplier would have
a bit-serial or a full bit-parallel architecture. We have also implemented the proposed multipliers using FPGA and the results show that the proposed multipliers are faster than the
ones presented in [44] by 66 and 81 percent respectively.
The organization of the rest of this chapter is as follows: Section 2 is a brief review of
redundant basis, reordered normal basis and their multiplication operations. New hybrid
multipliers using redundant basis and reordered normal basis are proposed in Sections 3
and 4, respectively. The complexities of the proposed multipliers are also compared to
other previous articles in these sections. FPGA implementations of different word size
multipliers are presented in Section 5, and a few concluding remarks are given in Section 6.

3.2 Preliminaries on Finite Field Bases and Arithmetic in

3.2.1 Redundant Representation and Multiplication
Let xn — 1 be a polynomial defined over F 2 . Then the splitting field of xn — 1 is called nth
cyclotomic field, denoted by F 2 . Let F2™ be a field which can be embedded in Fj . Then
the following theorem characterizes the relationship between n and m.
Theorem 3.2.1. Let n be an odd positive integer. Then, F2™ is contained in F 2 if and only
if m divides the multiplicative order of 2 mod n.
Proof: This is a special case of the theorem in [44].
We are particularly interested in F 2

•

with the minimal value of n which F2™ can be

embedded in. Let f3 belong to some extension field of F 2 and be a primitive nth root of
unity. Then F2 can be generated by (3 and elements of F 2 can be represented by
A = a0 + axp + a2p2 + ••• + a n _ i / T ~ \ at e F 2 , i = 0,1, • • • , n - 1.

(3.1)

17

3. COMB ARCHITECTURES FOR FINITE FIELD MULTIPLICATION IN F f

Such representation of A is not unique since 1 + (3 + /32 + • • • + j3n

l

= 0. we call (3.1)

the redundant representation of A, or by slight abuse of the term 'basis', we call the set
[1, (3,(32,..., (3n~l\ a redundant basis for any subfield of F 2 n ) [44].
One advantage of using redundant basis for finite field arithmetic is that the modulo
reduction step can be avoided for multiplication operation. This is due to the fact that the
redundant basis elements form a cyclic group of order n and

P-P={

1

i = n — 1.

Consider the redundant basis for F2m over F 2 :
/! = [l,/3,/3 2 ,.. -,/6f—1].
Let A = (a 0 , a i , . . . , a„_i), B = (b0, h,...,

6„_i) G F2m be represented with respect to

(w.r.t.) h, where at, hi G F 2 . The multiplication operation can be given by [44]:
ra-l

A

A-B.= C = Y,CjP,
3=0

where
c =

i

] C aAi-i)>

j = 0,1,..., n - 1.

(3.2)

i=0

Note that (j — i) denotes that j — z is to be reduced modulo
n.

3.2.2

Reordered Normal Basis Representation and Multiplication

Theorem 3.2.2. [12] Let (3 be a primitive (2m + l) s t root of unity in F2m and 7 = (3 + -4
generates a type II optimal normal basis. Then {ji,i = 1,2,..., m} with li = (3% + ~ki =
^

+

p2m+i-i^ j = i ? 2 , . . . , m, is also a basis in F2m.

It has been shown that the basis {7*, i = 1,2,..., m} is a permutation of the normal
basis {-f2\ i = 0 , 1 , . . . , m — 1} [44]. We denote the basis I2 = [71,72, • • •, 7m] as the
reordered normal basis following [44].

18

3. COMB ARCHITECTURES FOR FINITE FIELD MULTIPLICATION INF$

Reordered normal basis not only offers free squaring but also can avoid modulo reduction step in a multiplication operation. Multiplication using the reordered normal basis can
be described as follows. Define
A J i mod 2m + 1,
if 0 ^ i mod 2m + 1 ^ m,
s(i) = <
I 2m + 1 — i mod 2m + 1,
otherwise.

Let A = ( a i , . . . , aTO), B = (&i,..., bm) G F2™ w.r.t. I2, and 60 = 0, then the product
coefficients are given by [12, 44]
c

j =^ai(b4j+i)

+ bs(j-i)),

j = l,2,...,rn.

(3.3)

where Cj is defined by
m

3.3 Proposed Hybrid Multiplier Using Redundant Representation
3.3.1 Bit-serial word-parallel multiplication algorithm
From (3.2) it can be seen that one product bit Cj is a sum of n terms where each term is a
partial product bit aj&y-j). Let w denote the word size. Write the subscript of a* in (3.2)
as i = kw + £, k = 0 , 1 , . . . , \n/w] — 1 and t = 0 , 1 , . . . , w — 1. Replace i in (3.2) with
kw + t
C =

J

X

a

]C

kw+(b(j~kw-t),

j = 0 , 1 , . . . , n - 1.

(3.4)

Let
[n/u/| —1
c } =

f

XI

a

kw+eb(j-kw-£), £ = 0,l,...,w-l.

(3.5)

fe=0

Then (3.4) becomes
li; —1

c

i = Z) c f' J = 0 , l , . . . , n - 1 .

(3.6)

19

3. COMB ARCHITECTURES FOR FINITE FIELD MULTIPLICATION INF%

a

•™-\n/w\-l

—

Ax

=

0,2w-i

Aft

=

dw-l

a

[n/w]w-l

\n/w\w-2

{\n/w\-\)w

&2W-2

•••

a

a

•••

a

w-2

A[w - 1]

a

•••

A[w-2]

w

0

...

i4[0]

Figure 3.1: Words vs. comb style inputs
Clearly, if cy>,£ = 0,1,...

,w - 1, j = 0,1,...

,n- 1, can be computed and implemented

in one clock cycle, then it takes only w clock cycles to obtain Cj for j = 0 , 1 , . . . , n — 1.
Clearly a multiplication carried out with (3.5) and (3.6) requires a comb style input A
while input B has a much more complex format. Input operand A can be represented with
\n/w] words
A

=

0'\n/w\w-la\nlw]w-2
v
v

=

A\n/w-\_\

where each word Ai = aiw+w-iaiW+w-2

• • • a(\n/w\-l)w

•••
<

...

0"w-law-2
v

s

• • • a0
'

AQ,

• • • «™ contains w bits. Note if w can not divide

n then the most significant word A^n/W^_i of A would contains some zero bits at its most
significant bit positions. Let the comb inputs from A be denoted by A[w — l},A[w —
2 ] , . . . , -<4[0]. Fig. 3.1 shows how the comb inputs can be obtained from the words of the
input operand A. It can be seen that
A[£] = a(\n/w]-.1)w+iatfn/w-]-2)w+e • • -cte, fori = 0,1,...

,w - 1.

Define B^[£] such that
B^>[£] = b(j-^n/w]-i)w-e)b(j-(in/w]-2)w-e) • • -b(j-e)Note that 5O+1) [£] is a leftward circular shift of B^ [£] by one bit.

20

3. COMB ARCHITECTURES FOR FINITE HELD MULTIPLICATION INF$

Let the inner-product between two n-tuple vectors X and Y be given by
n-l

XQY = ^2xiyi,
i=0

where X = {XQX\ • • • x„_i), Y = {yoyi • • • yn-i) and Xi, y* € F 2 . Then it follows from
(3.4),
w—1

01 = ^ 2 ^ 0

BU)

W>

J =

0,l,...,n-1.

1=0

The algorithm for comb style multiplication using redundant representation is given below.
Algorithm 1. Comb style multiplication using redundant basis
Input:

A{£\,B^[£]

Output: Cj
1.

For j = 0 to n — 1

2.

Cj = 0;

3.

For^ = 0 t o w ; - 1

4.

Cj = Cj +

3.3.2

A[i\QB^[i\;

Comb style multiplication architecture

A comb style multiplication architecture can be developed from Algorithm 1. Operand A is
available at input in a comb style, i.e., A[£] is available at £th cycle, for £ = 0 , 1 , . . . , w — 1.
Operand B is stored in a circular shift register from which B^ [£], j = 0 , 1 , . . . , n — 1 can
be read from in cycle £. A combinatorial circuit module consisting of \n/w~\ AND gates
and \n/w] XOR gates is designed to generate the inner-product A[£] 0 B^ [£] in one clock
cycle.
In the first clock cycle the input bits are A[0] = a([-n/u,]_i)u,a(|-n/l(,]_2)tu... a0 and by the
end of cycle the contents of registers Cj are A[0] 0 5^')[0]. In the second clock cycle the
input bits are A[l] and by the end of cycle the contents of Cj are A[0] 0 B^[0] + A[l] 0
B^ [1]. In the wth clock cycle the input bits are A[w — 1]. Note that an input bit whose index

21

3. COMB ARCHITECTURES FOR FINITE HELD MULTIPLICATION INW%

exceeds n — 1 should be replaced by a zero bit. By the end of the wth cycle the contents of
registers Cj are
w—\

Cj

= J2 A[Z\ 0 B^li], for j = 0,1,..., n - 1.

e=o
Clearly, w clock cycles are required to finish one multiplication operation.
The output bits Cj is produced with an accumulation circuits after w cycles. A comb
style redundant basis multiplier for F24 is shown in Fig. 3.2, where n = 5 and we choose
w = 2. The multiplier architecture discussed so far is a least-significant-bit (LSB) first
version, where the multiplier takes operand A in the order of ^4[0], A[l],...,

A[w — 1]. A

most-significant-bit (MSB) first version of comb style multiplication algorithm can also be
developed by changing line 3. in Algorithm 1 to
3.

For £ = w — 1 to 0 step — 1

A MSB first version of the comb style redundant basis multiplier for F24 is shown in
Fig. 3.3.

Figure 3.2: Proposed comb style redundant basis multiplier for F24 (LSB first)

22

3. COMB ARCHITECTURES FOR FINITE HELD MULTIPLICATION IN Ff

Figure 3.3: Proposed comb style redundant basis multiplier for F24 (MSB first)

3.3.3 Architecture complexities and comparison
Architectural complexity and gate counts for the proposed multiplier along with similar
previous proposals are shown in Table 3.1. In the table the delay of a two-input AND gate
is denoted by TA and the delay for an n-input XOR gate is approximated by [log2 n\TxThe first row of the table shows the complexity result of the hybrid multiplier previously
proposed in [44], where w denotes the number of parallel modules. If we set the value of w
for the hybrid multiplier in [44] to be approximately equal to \m/w~\ in the proposed multiplier, then the complexities and the number of clock cycles for the two multipliers are about
the same except the critical path delay, where the proposed multiplier has significantly less
critical path delay than the one in [44].
Multipliers

Basis

#AND

#XOR

# Reg(bits)

# Cycles

Critical Path Delay

Hybrid [44]

redundant

WW

(n — l)w

2n

\n/w\

TA+\\og2n-\Tx

Proposed

redundant

n\n/w~\

n\n/w]

2n

w

TA+\\og2\n/w-]]Tx

Table 3.1: Complexities comparison between hybrid redundant basis multipliers

23

3. COMB ARCHITECTURES

FOR FINITE FIELD MULTIPLICATION INW%

Note that the comb style input does not cause any problem. Assume that a point addition/doubling is the basic operation for an elliptic curve system and a finite field exponentiation is the basic operation for an ElGamal system. There are dozens of finite field
multiplications, squarings and/or additions required to complete one basic operation [18].
Consider a system where finite field addition and squaring operations are performed by
bit-parallel structures and multiplication operations are performed by the proposed multiplier. Since all the sum or product bits are available at the same time, it does not make any
difference in what order the input coefficients are fed into the multiplier.

3.4 Proposed Hybrid Multiplier Using Reordered Normal
Basis
3.4.1 Bit-serial word-parallel multiplication algorithm
It can be seen from (3.3) that Cj is a sum of m terms where each term is a partial product
bit aj6y_j). Let w denote the word size. Write the subscript of a» in (3.3) as i = kw + £,
k = 0 , 1 , . . . , \m/w\ — 1 and 1= 1,2,... ,w. Replace i in (3.3) with kw +1.
w \m/w\ — l

Cj = 2_j

2-^t

1=1

k=0

a

kw+e(bs(j+kw+e) + bs(j-kw-e)),

(3.7)

for j = 1 , . . . , m. Let
Cj

\m/w~\ — 1
=
2_^
k=0

a

kw+i{bs{j+kw+t)+bs(j-kw-e.))-,

for £ = 1, 2 , . . . , w. If Cj can be implemented and computed in one clock cycle, then it
w

takes only w clock cycles to obtain Cj = Y J cj .

24

3. COMB ARCHITECTURES FOR FINITE HELD MULTIPLICATION IN F%
Let A[£], B{1] [£] and B^ [£} be denned respectively by
A[£]
B_ [£] =
?(i)

B+ M

=

b(j~(\m/w-]-i)w-e)b(j-(\m/w]-2)w-i)

• •-b(j-e),

and

b(j+dm/w-]_1)w+e)b(j+(im/w-]_2)w+e)

• • • %+^)> f ° r ^

=

1) 2,

.it;.

Then it follows from (3.7),
w

c3=J2A^Q(B-)^

+ B )

+ ^^

3 = 1,2,...,m.

(3.8)

e=i
The algorithm (LSB first) for comb style multiplication using reordered normal basis is
given below. A MSB version of this algorithm is also available by changing line 3.
Algorithm 2. Comb style multiplication using reordered normal basis (LSB first)
Input:
Output:

1.

A[£\,B^[£\,B^[i])
Cj

For j = 1 to m

2.

Cj = 0;

3.

For£ = 1 to w

4.

cj=cJ

+ A[g\Q(B^[e\

+ B^[£\);

3.4.2 Bit-serial word-parallel multiplier architecture
Similar to the redundant basis multiplier proposed in the previous section, a comb style
multiplication architecture can be developed from Algorithm 2. Operand A is available
at input in comb style, i.e., A[£] available at £th cycle, £ = 0 , 1 , . . . , w - 1. Note that the
circular shift register to store B is (2m — 1) bits which is almost double the size of operand
B. The initial contents of this (2m— l)-bit shift register have to be carefully arranged so that
each of the following clock cycle both B_ [£] and B+ [£], j = 0 , 1 , . . . , n — 1, can be read
from the register. A combinatorial circuits network is used to generate the inner-product
A[£] 0 (B{1] [£] + B{i] {£}) in each clock cycle.

25

3. COMB ARCHITECTURES FOR FINITE FIELD MULTIPLICATION IN F f

Figure 3.4: Proposed comb style multiplier in F2e using reordered normal basis (LSB first)
For the LSB first version of the multiplier architecture, by the end of the first clock
cycle the contents of registers Cj are A[0] 0 (S_ [0] + B+ '[0]). By the end of the second
cycle the contents of Cj are ,4[0] © (#i j) [0] + B J - ) [ 0 ] ) + A[l] © (B^[i\

+ B^[l]).

By the

end of the u/th clock cycle the contents of registers Cj are
w—1

Cj

= J2 AM 0 (B-]W + B^W), for j = 0,1,..., n - 1.

An MSB version of the multiplier can also be easily developed similar to the one presented
in Section 3.
Fig. 3.4 shows such a multiplier architecture for Fj when w = 2. Note that b0 = 0.
Output registers are initialized as zero and after \m/w]

= 3 clock cycles they should

contain the product bits Ci, C2, •.., c§.

26

3. COMB ARCHITECTURES FOR FINITE HELD MULTIPLICATION IN¥%

3.4.3 Architecture complexities and comparison
The architecture complexities of the proposed hybrid multiplier using reordered normal
basis are shown in Table 3.2, where optimal normal basis type II is denoted by ONB II and
reordered normal basis is denoted by RNB. Complexities for some previously proposed
hybrid multipliers using reordered normal basis and normal basis are also listed in the table.
Note that most of the previous proposals are bit-parallel word-serial (BPWS) or digit-level
architectures whereas the proposed architecture is bit-serial word-parallel (BSWP). Also
note that we use d to denote the word (digit) size for BPWS multipliers and w to denote the
word size for the proposed BSWP architecture.
In order to make comparisons of different styles of architectures on a fair background,
we assume that all the architectures take the same number of clock cycles to complete one
multiplication operation. It means the value of w for the proposed multiplier is equal to
\m/d~\ for all the previous proposals shown in table.
Multipliers

Basis

Style

MO [41]

ONB II

BPWS

AEDS [36]

ONB II

BPWS

(m - d/2 + l / 2 ) d

XEDS [36]

ONB II

BPWS

(2m — n)d

#AND
(2m -

\)d

#XOR

#Cycl

(2m - 2)d

\m/d~\

TA+

fm/d]

TA

+(1+

\log2(m)-\)Tx

(2m - d / 2 - 3 / 2 ) d

\m/d]

TA

+(1+

riog2("i)l)TX

\m/d\

2 T ^ + ( 3 + r i o g 2 ( d - 1)1 )TX

\m/d~\

2 T A + ( 3 + r i o g 2 ( d - 1)1 )TX

(3m -

d-2)d

w-SMPO 1 [37]

ONB II

BPWS

rn + (Lm/2J + \)d

(2m - l ) d

w-SMPO II [37]

ONB II

BPWS

m + md

( m + Lm/2J)d

Hybrid [44]

RNB

BPWS

md

Proposed here

RNB

BSWP

m\m/w~\

(2m -

\)d

Critical Path Delay

\m/d~\

2m[m/iu]

w

TA
TA

r i ° g 2 ( 2 m - !)1 T X

+ ( 1 + ["l°g2»*»l)Tx

+ ( 1 + riog2(r»n/'"l + l ) 1 ) T x

Table 3.2: Complexity comparison of hybrid multipliers using reordered normal basis or
normal basis
Fact 3.4.1. Assume that m > 4. Also assume that all multipliers shown in Table 3.2 are
neither bit-serial nor full bit-parallel architectures and they take the same number of clock
cycles to complete one multiplication. Then the proposed BSWP multiplier has the smallest
critical path delay among all the multipliers listed in Table 3.2.
Proof: Since w > 2 the following inequality holds, \m/w\

< \m/2] < (m + l ) / 2 .

27

3. COMB ARCHITECTURES FOR FINITE FIELD MULTIPLICATION IN¥%

Also because m > 4, it follows 2m — 1 = m + m — 1 > m + 3. Then,
TA + (l+\log2(\m/w-]+l)-])Tx

< TA +

=

(l+\log2-^-])Tx

TA + \log2(2m-l)]Tx

(3.9)

It is assumed that w = \m/d~\, sow > m/d. Then we have d > m/w or d > \m/w~\ (*.•
d is an intger). Also we assumed that d>2,

which follows

4d = d + 3d> \m/w] + 6 > \m/w~\ + 5,
or
4 ( d - 1) > \m/w\ + 1.
It follows
2TA + (3 + riog2(o! - l)] )TX

> TA + {1 + riog2 4(d - 1)] )T X
>

TA + (l + riog 2 ([m/H + l)l)Tx

(3-10)

Also note from (3.9) that
TA + {l+\\og2m\)Tx

=
>

TA+\log22m]Tx
TA+\log2(2m-l)]Tx

> TA + (l+\log2(\m/w]

+ l)])Tx

(3.11)

Summarizing from (3.9), (3.10) and (3.11), we can conclude that the proposed BSWP
multiplier has the smallest critical path delay among all the multipliers listed in Table 3.2.

•
The number of AND gates for the proposed multiplier is also the smallest except for
AEDS [36] which uses much more XOR gates. The number of XOR gates for the proposed
multiplier is moderate, smaller than that of AEDS [36] and w-SMPO II [37] but higher than
thatofXEDS[36].

28

3. COMB ARCHITECTURES FOR FINITE HELD MULTIPLICATION IN Ff

3.5 FPGA Implementations
The proposed multipliers have been implemented in FPGA. For the comparison purpose,
some of the previous proposals are also implemented using the same FPGA technology.
Considering the large number of gates involved in the proposed multipliers for a binary
field of size with practical significance, we used Altera Stratix FPGA family to implement
the multipliers.
FPGA implementation results of comb style redundant basis multiplier for n = 211 is
shown in Table 3.3, along with the results for the hybrid redundant basis multiplier in [44].
We deliberately choose different values of w for the two multipliers such that w for the
proposed multiplier is equal to \n/w~\ for the multiplier in [44]. This setting allows the
two multipliers to take the same number of clock cycles to complete one multiplication
operation. In fact, we set w = 16 for the proposed multiplier and w = 14 or n/w = 16
for the multiplier in [44]. It can be seen from Table 3.3 that the proposed multiplier uses
slightly more logic elements but has a much lower critical path delay than the multiplier
in [44].
We have implemented the comb style reordered normal basis multiplier for F2209 and
w = 16. For comparison purpose, the hybrid reordered normal basis multiplier in [44] has
also been implemented with FPGA, where w is chosen as 14 such that \n/w] = 16. The
implementation results are shown in Table 3.4. It can be seen that the proposed multiplier
not only uses sightly fewer logic elements but also has a much lower critical path delay.
It can be seen from Tables 3.3 and 3.4 that the speed improvements are 66% and 81%,
respectively.
Multiplier

Field Size (n)

# Logic Elements

Critical Path Delay

# Clock Cycles

PISO [44]

211

2185

6.25ns

16

Proposed

211

2321

3.77ns

16

Table 3.3: Comparison of FPGA implementations of hybrid redundant basis multipliers

29

3. COMB ARCHITECTURES FOR FINITE FIELD MULTIPLICATION IN F f

Multiplier

Field Size

# Logic Elements

Critical Path Delay

# Clock Cycles

PISO [44]

209

3625

7.96ns

16

Proposed

209

3558

4.40ns

16

Table 3.4: FPGA implementation results for hybrid reordered normal basis multipliers

3.6

Conclusions

Two new bit-serial word-parallel or comb style finite field multipliers, one using redundant
representation and the other using reordered normal basis, have been proposed. The hybrid architecture gives the designer the ability to set the trade off between area and speed.
Architectural complexities of the proposed multipliers compare favorably to the previously
proposed architectures of similar type. The two proposed multipliers have also been implemented in FPGA. The hardware implementation results show that the proposed multipliers
have much lower critical path delays thus allowing much faster operating clock rates. The
proposed architectures are suitable for high speed cryptographic applications such as elliptic curve cryptography.

30

Chapter 4
A New Finite Field Multiplier Using
Redundant Representation

4.1 Introduction
In [44], a redundant representation was derived from the minimal cyclotomic ring. The
idea here was to use the minimal cyclotomic ring in which the current field can be embedded in, and the field arithmetic operations are performed in the ring. Advantages of using
redundant representation include that this method not only offers free squaring operation
as a normal basis does but also its 'basis' elements form a cyclic group and thus modulo
reduction step can be avoided in field multiplication operation. Two different types of bitserial multipliers using redundant representation have been proposed. One is parallel-in
serial-out (PISO) and the other is serial-in parallel-out (SIPO) [44]. In this chapter, a novel
SIPO multiplier architecture is proposed. Compared to the previous SIPO architecture, the
proposed multiplier has significantly lower complexity while having the same critical path

31

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT REPRESENTATION

delay. Compared to the previous PISO multiplier, the proposed architecture has significantly smaller critical time delay while using same number of gates and registers. For a
class of fields that there exists a type I optimal normal basis (ONB), it is also shown that the
proposed multiplier has lower complexity and smaller critical path delay than the recently
proposed NB multipliers.
The organization of the rest of the chapter is as follows. Redundant representation and
its multiplication are reviewed in Section 2. A brief overview of previously proposed bitserial redundant representation multipliers is also given in this section. A new algorithm
and architecture for bit-serial redundant representation multiplication is proposed in Section 3. Complexity comparison is made in Section 4. In Section 5, a digit-level version of
the proposed multiplier is presented. A few concluding remarks are made in Section 6.

4.2 Preliminaries
4.2.1 Redundant Representation
Let (5 be a primitive nth root of unity in some extension field of F 2 . The splitting field of [5
is called the nth cyclotomic field and denoted by F 2 • Elements in F 2 can be represented
in the form
A = a0 + a1/3 + a2/32 + • • • + On-i/T - 1 , <n G F 2 , i = 0,1, • • • , n - 1.

(4.1)

Let F2m be a field that can be embedded in F 2 '. The following theorem characterizes
the relationship between m and n .
Theorem 4.2.1. [25] Let n be an odd positive integer. Then, F2m is contained in F 2 if and
only if m divides the multiplicative order of 2 mod n.
For a given F2m we are particularly interested in Fg

with the minimal value of n

such that F2m can be embedded in F 2 . Obviously, field element A G F2m can also be
represented with (4.1). Note that 1 + j3 + (32 + • • • + P71"1 = 0 and the representation

32

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT REPRESENTATION

of A is not unique. For example, the two n-tuples (a 0 , ai, • • • , a„_i) and (a0 + 1, oi +
1, • • • , a„_i + 1) represent the same element A By slightly abusing the terminology, the
set [1, /?, /?2, • • • , /3n_1] is denoted as redundant basis (RB) for F2m [44].

4.2.2

Redundant Basis Multiplication

Let field elements A,BE

F2™ be represented with respect to (w.r.t.) the RB Ix =

[1,0,/J 2 ,...,/? 1 - 1 ] as
Ti—l

n—1

A = ] T Oi/3* and 5 = J ] 6;/3\
j=0

i=0

respectively, where a,,fej£ F 2 , i = 0,1, • • • , n — 1. Then it follows f3l • B = Y^j=a fyj-j)/^>
where (j — z) denotes that j — iis to be reduced modulo n. The product of field elements
A and B can be given by [44]
n—1

n—1 n—1

j=0

j=0

i=0

71-1

If we define A • 5 = C = J ] c j/^> then
Cj can be given by
3=0
71-1

C

J

=

] C a*bU-i)i

j = 0 , 1 , . . . , n - 1.

(4.2)

i=0

4.2.3

An Overview of Bit-Serial RB Multipliers

At least two types of bit-serial RB multipliers have been proposed, as shown in Fig. 4.1 [44].
One is SIPO type and the other is PISO type. The PISO multiplier uses fewer registers but
has a longer critical path delay. A complexity comparison between these previous two
multipliers and the proposed one will be made in the next section.

33

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT REPRESENTATION
>
bn-1

b,

K-2

bo
an-l.

j = n-l ,n-2

-

• a0

1,0

(a) PISO

(b) SIPO

Figure 4.1: Previously proposed bit-serial RB multipliers

4.3 Proposed SIPO Multiplier
4.3.1 A New Bit-Serial RB Multiplication Algorithm
Let a,j and bj, j = 0 , 1 , . . . , n — 1 be given as in Section 2. A new algorithm to compute
RB multiplication can be shown as follows.
Algorithm 3. Bit-serial RB multiplication [MSB first]
Input:

a,j,bj,j = 0 , 1 , . . . , n — 1

Output:

Cj, j = 0,1,...

,n — 1

1. Initialization: CJ = 0, for j = 0 , 1 , . . . , n — 1.
2. For k = 1 To n
For j = 0 To n — 1
4.
(fc)

$

(fc-i) ,

= c^._1}' +

,

a{j)bn-k;

(4.3)

(n)

The final value Cj } = Cjforj = 0,1,..., n — 1.

34

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT

REPRESENTATION

Proof of correctness of Algorithm 3:
It follows from (4.3),
c)

= c\3^

+ a{j)b0

=

>"2)
c,._
2) + ay_i)6i +

=

cj"-3) + a 0 - 2 ) 6 2 + a y - i ) 6 i + a{j)b0

=

cf)n)

+ ay+i-njft,,-! + • • • + %•_!)&! + a ( j ) 6.
0

c(0)

+ ^Oy_i)6i.

a{j)k'o

n-1

Note that c\'_,

(4.4)

= 0 from Step 1 of Algorithm 3, then we have
c ] =

f

n—1
a

n—X

^2 (j-i)h = Yl aAi-i)i=0

(4-5)

i=0

The last step in (4.5) comes from proper substitution of subscripts. Compare (4.5) and
(4.2), it follows 4 n ) =c,-.

•

4.3.2 Multiplier Architecture
An architecture to realize Algorithm 3 is proposed and shown in Fig. 4.2. Every bit of
operand A should be available throughout the multiplication operation, while operand B
is available in bit-serial fashion with the MSB first. The contents of the n-bit registers are
initialized as zero. The registers are circularly connected and interleaved with XOR gates,
where the XOR gates perform the addition operation expressed in (4.3). At each clock
cycle, the registers cyclically shift and take on new values from the outputs of the XOR
gates. At clock cycle k, the content of register Rj is Cj '. It takes n clock cycles for the
multiplier to finish one operation.
Table4.1 shows the contents ofthe registers at the end ofith clock cycle, i = 1, 2, • • • ,n.
At the end of multiplication, the registers RQ, R

U

• • • , Rn-i, will respectively contain the

35

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT REPRESENTATION
b

0-bl

b

„-2-b»-l-

a

"O—KO
X

l~*®

R

«

I—Kfl-H

X

I

a

.

a

n2~*®

R

i

n-l—•<•)

X

>®-*\ h*- • • •

R

«2

R

X

KB—H I

"i

KB-H h»i

Figure 4.2: Proposed serial-in parallel-out RB multiplier (MSB first)
product coefficients c0, Ci, • • • , c„_i.
\Cyc.A

1

2

*0

aob„-i

a„-ib„-i + a<t>„-2

"ob(„.k) + a„.,b(„.k+l)

R,

a,b„.,

a0b„_, + a,b„_2

aib(„.Q + a0b(„^+l)

a„-2b„„i

an-3b„-l + Qn-2bn-2

a„-2b{„.k) + a„.3b{„.k+l)

a„-2b„-i + a„-ib„.2

+ a

R

n-2

R

n-1

a„-ib„.,

k

<>n-A»-*)

n
+ - +

a0b0 + a„_ib, + — + atbn_i

au,k)bn_,

+ - + a(2.k)b„.i

+- +

aib0 + a0bl + ••• + a2b„_i

a(„.,.k)b„.i

a„.2bg + a„.3bi + ••• + a„.;A„.;

»-2b(„-k+i) + - + a(„.k)b„.i

"n-ibo + Qn^bj + •" + d0b„.i

,(fe)
Table 4.1: Register contents Cj
during a multiplication operation

A least-significant-bit (LSB) first version of the proposed multiplier is shown in Fig. 4.3,
which has applications where the least significant bits of operand B are available before the
other parts of the operand. Note that the proposed MSB first and LSB first versions have
the same complexities and critical path delays.
b

n-l-b„-2---bl

•*<)

a

„i —*©
X

«„-; —»<•)
R

»i

T*—©H h

X

©H

,
R

*2

a

a

i —M3
X

h- • • • — * — © H

o —*©

R

>

X

V

R

»

©H_

Figure 4.3: Proposed serial-in parallel-out RB multiplier (LSB first)

36

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT REPRESENTATION

4.4 Complexity Comparison
4.4.1 Comparison to Other RB Multipliers
Compared to the two previous architectures [7, 44], the proposed one has a more regular structure. Moreover, the new multiplier has significantly lower complexity or smaller
critical path delay compared to the previous RB multipliers, as shown in Table 4.2. The
proposed architecture is shown in the row at the bottom of the table. Compared to the
SIPO [44], the proposed multiplier requires only half number of registers while incurs
about the same critical path delay. Compared to the PISO [44], the new architecture uses
about the same number of gates and registers but has a significantly smaller critical path
delay.
Architecture

# of AND

#ofXOR

# of registers

Critical path delay

PISO [7, 44]

n

n-1

n +1

SIPO [44]

n

n

2n

TA + TX

Proposed SIPO

n

n

n

TA + TX

TA+\log2

n]Tx

Table 4.2: Complexities comparison between bit-serial RB multipliers
For example, let m = 268 and we find that the minimal value of n is 269. Then
the proposed RB multiplier requires 269 fewer flip-flops compared to the SIPO multiplier
presented in [44] while they both have the same critical path delay. Compared to the PISO
RB multiplier presented in [44] which has a critical path delay TA + R0g2 269] Tx

=

TA + 9TX, the proposed multiplier has a much less critical path delay at TA + Tx •

4.4.2 Comparison to Normal Basis Multipliers
A comparison of the proposed RB multiplier to other bit-serial NB multipliers when there
exists a type I ONB is shown in Table 4.4.2. It can be seen that the proposed multiplier has
the lowest complexity in terms of the number of XOR gates and registers. The proposed

37

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT REPRESENTATION

architecture has also the smallest critical path delay. Note that the new RB multiplier
requires m + 1 instead of m clock cycles to finish a multiplication operation.
Conversion between RB and NB is simple for certain classes of fields where both RB
and NB are generated by the same Gauss Period [44]. For example, for the class of fields
that there exists a type I ONB, RB elements (except the element '1') are a permutation of
the NB elements. Let R and N be RB and NB for the class of fields that there exists a type
I ONB, respectively. Then

R = {^}?=o = {1. A P\ • • •, P71-1}, and N = K}™^ 1 = {/?, / 3 2 , . . . , jT"

},

where (3 is a n th primitive root of unity or Gauss Period of type (m, 1) and n = m + 1. It
follows by noting /? n = 1

Since 2 is a generator of the multiplicative group Z^, there always exists a unique i €
{1, 2 , . . . , n — 1} for a given j such that i = (2 J ).
Multiplier

#AND

#XOR

# Registers

Critical Path Delay

#Cyc.

Basis

Massey-Omura [22]

2m- 1

2m - 2

2m

Tj4 + ( l + r i o g 2 m l ) T x

m

NB

IMO[ll]

m

2m - 2

2m

T/i + ( l + r i o g 2 m l ) T x

m

NB

Beth-Golmann [4]

m

2m- 1

2m

TA

2TX

m

NB

Geiselmann-Gollmann [15]

m

m+Lf J

3m

TA + STX

m

NB

2m — 1

3m- 2

3m-2

TA + 4 T X

m

NB

Agnew [37]

m

2m- 1

3m

TA

+ 2TX

m

NB

6-SMPOI [36]

LfJ + i

2m- 1

3m

TA +

771

NB

b-SMPOII [36]

m

m+Lf J

3m

TA + 3 T X

m

NB

Proposed

m + 1

m + 1

m + 1

+

m + 1

RB

Feng [9]

TA

+

3TX

TX

Table 4.3: Complexities comparison between the proposed RB multiplier and bit-serial NB
multipliers when there exists a type I optimal normal basis.

38

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT REPRESENTATION

4.5 Proposed Digit-Level SIPO Multiplier
4.5.1 A New Digit-Level RB Multiplication Algorithm
(k)

For given j , CJ accumulates one term a,jb(n-k) during clock cycle k according to (4.3) in
Algorithm 3. Certain parallelism can be introduced to Algorithm 3 so that the bit-serial RB
multiplication can be speed up. The proposed digit-level SIPO algorithm facilitates that c^
accumulates w terms during clock cycle k and thus the multiplication can be completed in
[n/u;] clock cycles for positive integer w and 1 < w < n.
Algorithm 4. Digit-level RB multiplication
Input:

<ij,bj,j =

Q,l,...,n—l

Let w be an integer such that 1 < w < n, also let bj = Oforj = —1,..., n — \-]w.
Output:

Cj, j = 0 , 1 , . . . , n — 1

1. Initialization: c^ = 0, for j = 0 , 1 , . . . , n — 1.
2. For k = 1 To \n/w~\
3.

For j = 0 To n — 1

(fc)
_
{j + [n/w]) ~

C

(fc-1)
{j + \n/w~\-l)

C

w—1
. V^
U
+ / J aU + \n/w]+i\n/w'])On-k-i\n/w'\

,

(A &\
V?*-0)

i=0

The final value c^ff^

= Cjforj = 0,1,

39

4. A NEW FINITE FIELD MULTIPLIER USING REDUNDANT REPRESENTATION

Proof of correctness of Algorithm 4:
It follows from (4.6) that
( fc )

C

w—1

I

{j+\n/w\)\k=\n/w'\

_

(\n/w]-l)

—

C<j+^niw-\_X)

_
~

(\n/w]-2)
(J+\n/w\-2)

.ST*-

u

a
/
J (j+\n/w-\+i[n/w~\)On-[n/w~}-i\n/w-\
i=0
w— 1
ST^
i
' / j aU+[n/w]
+i\n/w~\)°n-[n/w]-i\n/w]
i=0

+

w—1
a

(j+(\n/w~]-l)+i\n/w'\)Vn-(\n/w1\-l)-i\n/w~]

i=0
w-l
=

C

=

C

0')

+

Z-^
i=0
n-1
a

\n/w~\ — \
\ ^
1=0

a

T +^2 (J+h+l)bn-h-l

(j+i\n/w]+l>+l)bn-i\n/w-\~e~l

(4.7)

n=o
n-1
i=0

(4.9)
Equation (4.7) uses substitution of i-y = i\n/w\ + 1 and equation (4.8) uses substitution of
i = n — i\ — 1. The final step (4.9) comes from Step 1 of Algorithm 4 and equation (4.2).

a

4.5.2 Proposed Digit-Level RB Multiplier
Fig. 4.4 shows a hybrid SIPO architecture of RB multiplication when w = 2. In this
case, the input operand B is divided into two parts, each of size |~n/2] bits. All the
registers are initialized to zero. At the end of clock cycle k the contents of RjS are
Cj '. It takes \n/2] clock cycles for the multiplier to complete one operation. At the
end of \n/2\ clock cycles, the product coefficients c 0 , c i , . . . , c n _i reside in Registers
R(\ni2\),R{\n/2\+i), •••, R{\n/2\+n-i), respectively.

40

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT REPRESENTATION

2

2

0.b0

Vj.i„j
2

2

Figure 4.4: Proposed hybrid SIPO RB multiplier (LSB first)

4.5.3 Complexity Comparison to Previous RB Multipliers
Table 4.4 shows the complexity comparison for our proposed digit-level multiplier with
the other existing digit-level multiplier proposed in [44], in terms of number of gates and
latches as well as the critical path delay. First row of this table shows the hybrid Parallel-In
Serial-Out multiplier (Hybrid PISO) containing [^] parallel modules which was proposed
in [44], and the second row presents our proposal. It is clear that our proposal has always
a smaller critical path delay than the previous architecture. For small values of w the
proposed architecture requires even less number of gates than the hybrid PISO, but for the
large values of w the hybrid PISO has a lower gate complexity.
Multiplier

#AND

Hybrid PISO [44]

«r-i

proposed

urn

#XOR
1 w !

wn + w + 1

#Reg.

Critical Path Delay

# Clock Cyc.

"+rsi

7U + (riog2nl)T;t

n

TA + (\log2w + l])Tx

r-i
r-i
1

w '

Table 4.4: Complexities Comparison Between Digit-Level RB Multipliers

4.5.4

Complexity Comparison to Previous NB Multipliers

A complexity comparison of the proposed digit-level RB multiplier to some popular NB
multipliers for a class of fields that there exists a type I ONB is given in Table 4.5. It can
be seen that the proposed architecture uses the fewest registers and has the smallest critical
path delay. The number of XOR and AND gates used for the new RB multiplier is also

41

4. A NEW FINITE HELD MULTIPLIER USING REDUNDANT REPRESENTATION

comparable to the lowest number of gates required by the NB multipliers. Note that the
proposed multiplier may need one more clock cycle to complete one multiplication than
the NB multipliers listed in Table 4.5 when m is a multiple of w.
Multiplier
WLMO [22]

# AND
w(2m

- 1)

#XOR

#Reg.

#Cyc.

Basis

w(2rn - 2)

2m

TA

riog2ml)Tx

fm/iwl

NB

w(2m

2m

TA + ( l + r i o g 2 m l ) T x

\m/w~\

NB
NB

IMO [11]

turn

AEDS [36J

(w + 1) f

(iu + l ) ( f m - 2) + l

2m

TA + ( l + r i o g 2 m l ) T x

\m/w~\

XEDS [36]

w(m — 1) + m

(tu + l ) ( m - 1)
aj|m + m + „ _ i

2m

TA + (1+

riog2ml)rx

[m/ui]

NB

+(3+

ri°g2(™-l)DTx

[m/u;]

NB

fm/iu]

NB

T(m+ l)/iu]

RB

m-SMPOI [37]

2f± +m

+ w + 1

tu-SMPOII [37]

wm + m + w + 1

proposed

wm + w

- 2)

Critical Path Delay
+ {1+

wm + m +

TJJ

— 1

wm + m + w + 1

3m

ITA

3m

2TA

m + 1

+ ( 3 + [ l o g 2 ( ^ - 1)1 )TX
TA

+ (\\oS2w

+

l^)Tx

Table 4.5: Complexities comparison between digit-level architectures: the proposed RB
multiplier versus some NB multipliers for a class of fields that there exists a type IONB

4.6

Conclusions

A new SIPO finite field multiplier using redundant representation has been proposed. It
has been shown that the multiplier compares favorably to the previous proposals in terms
of complexity or critical path delay. For a class of fields that there exists a type I ONB, the
proposed multiplier has significantly lower complexity compared to previously proposed
NB multipliers. Digit-level version of the new multiplier has also been presented and its
complexities compare favorably to other type I ONB multipliers. The new architecture
accommodates inputs of MSB first and LSB first fashions. It is expected that the proposed
multiplier has application in elliptic curve and ElGamal cryptography.

42

Chapter 5
A High Speed Word Level Multiplier in
¥2m Using Redundant Representation

5.1

Introduction

Hardware implementation of finite field multipliers usually can be categorized into three
categories. The First category are bit level multipliers [22],[1],[15],[11]. A bit level multiplier takes m clock cycle to finish one multiplication in a binary field of size m. The second
category are full parallel multipliers [35],[24],[20],[43]. A full parallel multiplier takes one
clock cycle to finish one field multiplication.
The third category are word level or digit level finite field multipliers which are the
most commonly implemented in practice [12],[44],[22],[32],[31],[36],[37]. A word level
multiplier takes w clock cycles, 1 ^ w ^ m, to finish one multiplication operation in F2m.
The value of w can be selected by designer to set the trade off between area and speed
according to the application. Decreasing the value of w will result in faster and larger

43

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN F 2 M USING REDUNDANT REPRESENTATION

multipliers while increasing w will make smaller and slower multipliers. Note that bit level
and full parallel multipliers can be viewed as special cases of word level multipliers for
w — m and w = 1 respectively.
In this chapter, a new word level finite field multiplier in F2™ using redundant representation is proposed. For the class of fields that there exists a type I ONB, we show that the
new architecture is much faster compared to previously proposed word level architectures
using either NB or redundant representation. It is also shown that for the class of fields the
new multiplier out-performs all the other multipliers when considering the product of area
and delay as a measure of performance. One of the unique features of the proposed word
level multiplier is that the critical path delay is not a function of the field size nor the word
size. This enables the architecture to operate at very high speed even for large field sizes or
large word sizes.
The organization of this chapter is as follows: Section 2 is a brief review of redundant
representation and multiplication. In sections 3 and 4, a new word level algorithm and
architecture for multiplication in redundant representation is proposed, respectively. The
architectural complexities of the proposed multiplier are compared to other similar previous proposals in section 5. A few concluding remarks are given in section 6.

5.2 A Brief Review of Redundant Representation and Its
Arithmetic in F2^
5.2.1 Redundant Basis for F2™
Let K be a field and f(x)

G K[x] be a polynomial defined over K. Then the field that

contains all the roots of f(x) is called the splitting field of the polynomial f{x).
splitting field of xn — 1 is called the 71th cyclotomic field, denoted by K^

The

[44]. Let /? be a

44

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN ¥2M USING REDUNDANT REPRESENTATION

primitive n th root of unity. Then K^

is generated by (3 over K and elements in K can be

represented in the form
A = a0 + oi/3 + a2/32 H

+ a n _ i / r _ 1 , a* G A".

Thus the set [1, (3,/32,..., /3™-1] acts as a basis for K^n\ Since 1 + (3 + (32 + • • • + /3"- 1 = 0
the representation of A is not unique. So, by sightly abusing the terminology, we call the
set [1, (3, (32,..., /J" -1 ] redundant basis (RB) for any subfield of ATH
We are particularly interested in the following case: Let K be the binary field F 2 and
K^

be a cyclotomic field that F2™ can be embedded in. The following theorem character-

izes the relationship between m and n.

Theorem 5.2.1. [25] Let n be an odd positive integer. Then, F2m is contained in F 2 if and
only if m divides the multiplicative order of 2 mod n.
Remark 5.2.2. If there is a type I Optimal Normal Basis in F 2 , then F 2

is contained in

F^ m + 1 ) , so there is a RB of size m + .1 for F £m).

5.2.2 Redundant Basis Multiplication in F2"<
Consider the RB for F2™ over F 2 :
/=[l,/3,/?2,...,/3"-1].
Let field elements A, B G F2m to be represented with respect to / :
n—1

n—1

A = £>/?, £ = £&,•/?',
i=0

j=0

where ai}bj G F 2 , i, j = 0 , 1 , . . . , n — 1. Note that /?" = 1. Then multiplication of 5 by /?'
using the RB / can be given by
n— 1

Ti—1

i=o

j=o

45

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN¥2M

USING REDUNDANT REPRESENTATION

where (j — i) denotes that j — i is, to be reduced modulo n. Then the product of field
elements A and B can be given by
ra—1

n—1

n—1

A • 5 = J > ( / ? • B) = £
i=0

(X^O-i))/^'-

j=0

(5.1)

j=0

ra-1

A

If we define A • B = C = J ^ c ^ ' , then Cj can be given by
j=0
n-1

c =

i

5Z aAj-i)i

j = 0,1,..., n - 1.

(5.2)

i=0

5.3 Proposed Word Level Multiplication In RB
From (5.2) it can be seen that one product bit Cj is a sum of n partial product bits aj6y_j). If
each partial product bit is to be calculated at one clock cycle, the multiplier will take n clock
cycles to finish one multiplication operation which is the case for the bit level architecture
[44].
Let w denote the word size. Then the operand A in RB can be represented in k =

\n/w\

words:
A = a^ai... aw_i aw ... a,2w~i aiw • • • d(k-i)w • • • a n - i 0 . . . 0.
v

"•

v
A

0

„
^1

'

v

•

v

A f c _!

Or A = Ylh=o AhPhw, where Ah = J27=o ahw+ePe- Note that a, = 0 if the subscript i is
greater than ra — 1. Replace i in (5.2) with hw + £:
k~-l

w—l

Cj = ] P ^2 ahw+eb(j-hw-e),

j = 0 , 1 , . . . , n - 1.

(5.3)

Define new signal deh j as follows

ctp

/
d
hj

= 0 and
=

d

h,j 1] + ahw+eb(j-hw-e) for£ =

(5.4)
0,l,...,w-l.

46

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN¥2M

USING REDUNDANT REPRESENTATION

Then it follows from (5.4)
w—\

~2_^ahw+(^{J-hw-e)-

^h,j

(^)

Compare (5.3) with (5.5), it follows

9 = iky15-

<5-6>

h=0

An algorithmic form for word level multiplication using RB can be given as follows.
Algorithm 5. Word level RB multiplication algorithm
Input:

A = (A0,...,

Afc_i), B = {B0,...,

B^),

both w.r.t. RB.

Output:

C = A x B = ( c 0 , . . . , c n _i) also w.r.t. RB

1.

Initialization: k = \m/w], and <ij~. = 0 for h = 0 , 1 , 2 , . . . , k — 1 and j = 0 , 1 , . . . , n — 1

2.

For all values of j = 0 , 1 , 2 , . . . , n — 1, compute

3.

For all values of h = 0 , 1 , . . . , k — 1, compute

4.

For ^ = 0 To w - 1

5.

dhj = dhj

6.

+ ahw+t[b(j-hw-e)]

End For

/i=0

8.
9.

End For
End For

5.4 Proposed Word Level Multiplier Architecture in RB
5.4.1 Multiplier Architecture
Based on the algorithm proposed in the last section, a new word level architecture for RB
multiplication is presented and shown in Fig. 5.1. The architecture includes, from top to
bottom, one n-bit circular shift register R, a permutation/expansion module with n-bit input

47

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN¥2M

USING REDUNDANT REPRESENTATION

and kn-bit output, a layer of kn AND gates, a layer of kn XOR gates, kn flip-flops, and n
A;-input binary tree networks of XOR gates.
R

A

i=«2»*i

A

k-l

«,,+/•

a

„

a

k*-l '•••• a(k-l)K+l • a(k-l)

Figure 5.1: Proposed high speed word level multiplier
Note that the layer of kn AND gates, the layer of kn XOR gates and the kn flipflops form kn accumulation units in parallel, which are responsible for the accumulation
operation in Step 5 of Algorithm 5. These kn accumulation units are divided into n groups
with each group containing k units. The parallel structure of n groups corresponds Steps 2
and 9 and the structure of k units in each group corresponds to Steps 3 and 8 in Algorithm 5.
The iteration shown in Steps 4 and 6 of the algorithm requires the architecture to take w
clock cycle to complete one multiplication operation.

48

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN¥2M

USING REDUNDANT REPRESENTATION

At first, let us look at the part of the circuit (framed by the dashed lines) that generates
Co, which is denoted by Mo. There are k accumulation units and they are numbered from
left to right as h, h = 0 , 1 , . . . , k — 1. The contents of flip-flop h during clock cycle £
are denoted as <rh0. Before clock cycle zero all the k flip-flops are initialized as zero,
<4t o = 0,h = 0,1,...

,k — 1. During clock cycle £, accumulation unit h performs dh \ =

dh~Q + ahw+eb(-hw-e) and the content of the flip-flop is <rh \. During clock w — 1, flip-flop
h contains d™Q . The fe-input XOR gate is used to generate the output CQ = ^2hZJ dj^0~ .
Now consider accumulation unit h at Mj. During clock cycles 0 , 1 , . . . , w — 1, the
accumulation unit takes inputs ahw, ahw+1,...,

ahw+w_x from A, and b(j-hw), b(j-hw-i), • • •

>b(j-hw-w+i)) from B, respectively. For example, during clock cycle £, accumulation unit
h takes inputs ahw+t and b^-hw-e) and performs operation df^ = df^^

+

ahw+eb^_hw_e).

From the above discussion we can observe the following:
Fact 5.4.1. For any accumulation unit, let the inputs be ail from A and bj1 from B during
clock cycle £, then the input is ail+i from A and fe^n-i) from B during clock cycle £+1.
Also note that at clock cycle £ = 0 the input bit from A to accumulation unit h is
ahw, h = 0,1,...

,k — 1, which is the least significant bit of the word Ah from A. Then it

can be seen from Fact 5.4.1 that inputs from operand A can be organized into k words and
word h inputs to accumulation unit h in a bit serial fashion with the least significant bit first.
Inputs from operand B can be organized as an n-bit circular shift register and a permutation/expansion module as shown in Fig. 5.1. The shifting direction of R should conform
with Fact 5.4.1 and the permutation/expansion module can be explained as follows: during
clock cycle 0, the output of the P/E module that inputs to accumulation unit h in Mj should
be connect to b(j-hw) in R.
Let the critical path of the architecture be denoted by Tcp. Then Tcp is decided by the
accumulation unit, which is Tcp = T& + Tx, where TA and Tx denote the time delays
caused by one AND gate and one XOR gate, respectively. Note that after w clock cycles
the product bits are not generated until it takes another time delay of [ log2 k\ Tx caused by

49

5. A HIGH SPEED WORD LEVEL MULTIPLIER INV2M

USING REDUNDANT REPRESENTATION

a fc-input XOR gate. In order to let the architecture outputs Cj hold stable for a desired time,
two options are available for the multipliers to act during the time spent for the final addition
of the partial products. In the first option, the multiplier clock should be stopped after w
clock cycles and the outputs can be read from the output ports after the addition delay
which can be measured approximately beforehand. This style requires an extra counter to
disable the input clock to the flip flops after w clock cycles. The advantage of this method
is that it can save dynamic power consumption during the final addition.
The second option is to pad zero input bits for the input words for operand A after the
w clock cycles. This will eliminate the need for stopping the clock signal of the multiplier
and does not require extra circuitry. The output product bits can be read from the output
ports after certain number of clock cycles immediately following w cycles. Assuming that
the clock period is chosen to be the critical path delay Tcp

= TA+TX,

and let the number of

extra clock cycles besides w be denoted by wex. Then the total number of the clock cycles
required for a multiplication operation is w + wex, where wex can be obtained as follows
wPr. =

\\og2k]Txi
T
-* cp

_

r\log2k]TxTA + TX

(5.7)

Note that wex zero bits need to be appended at the end of each input word from A.

5.4.2

Architecture Complexities

The area and delay complexity of the proposed design can be easily determined from
Fig. 5.1. The circular shift register R contains n flip-flops. The P/E module is nothing
but a rewiring of lines. There are n modules Mj,j = 0 , 1 , . . . , n — 1, where each module
contains k AND gates, k XOR gates, k flip-flops, and afc-inputXOR gate that is equivalent
to k — 1 (two-input) XOR gates. In total the proposed architecture requires kn AND gates,
(2k - l)ra XOR gates, and (k + l)n flip-flops.

50

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN F 2 M USING REDUNDANT REPRESENTATION

5.4.3 An example
An example of the proposed word level RB multiplier for F24 is shown in Fig. 5.2. Note
n = m + 1 = 5 since there exists a type I ONB in F24. Let the two operands be given as
A = (a4, CI3, a 2 ,01, a0) and £? = (64, 63,62, &i, &o)- Choose IU = 3 and the operand A is
divided into two words A0 = (a2, ai, a 0 ) and vli = (0, a4, a 3 ).

Figure 5.2: Word level (w = 3) multiplier in F24 with the padded zero bits for the input
(the second option)
The critical path delay is Tcp = TA + TX and it takes w + wex = 3 + wex clock cycles
to complete one multiplication, where wex can be obtained from (5.7) by noting k = 2
wP = \TX/Tcp\ = \TX/(TA+TX)]

= 1.

Note that wex = 1 zero bit should be appended at the end of each input word A0 and A\.
The proposed architecture contains n = 5 module M and each module M contains k =

51

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN F 2 M USING REDUNDANT REPRESENTATION

Table 5.1: Contents of the flip-flops in the proposed multiplier in F24
Cyc
£

Content d^
{l)

d

"0,0

of flip-flop h in module Mj

(e)

{£)

{e)

"1,0

"0,1

d
"1,1

"0,2

dW
"1,2

d
"0,3

"1,3

"0,4

d
"1,4

-1

0

0

0

0

0

0

0

0

0

0

0

ao&o

03^2

a 0 &i

a3&3

aob2

a3&4

ao&3

03^0

a0b4

a36i

1

a0&o

fl3&2

a06i

0363

Oo &2

03&4

ao&3

a3b0

a0b4

a3bi

+12164

+0461

+ai&0

+a4b2

+ai6i

+a4&3

+ai62

+0464

+aib3

+0460

a0b0

a3&2

a06i

a3b3

a0&2

a3&4

a0&3

a3&o

a0b4

a3fei

+a\b4

+O461

+ai&o

+a4b2

+ai6i

+a4b3

+ai62

+ 0 4 64

+aib3

+0460

2

+a2b4

3

+a2&4

+a2b2

+a2foi

+O2&0

a0&o

a3b 2

a0bi

a3b3

ao&2

a364

ao&3

a3b0

a0b4

a36i

+O1&4

+a4&i

+aib0

+a4b2

+0161

+a 4 & 3

+ai62

+a4b4

+0163

+a4^o

+02&4

+0264

+0261

+02^0

+a2b2

[n/io] = 2 accumulation units. Let the content of the flip-flop rh in Mj during clock cycle
£ be dh j . During clock cycle — 1, all the flip-flops are initialized as zero. During clock cycle
£, it follows from (5.4) that the flip-flop r^ in Mj contains drhj = drhj

+

ahw+(b(j_hw-£).

Table 5.1 shows drhj for £ = 0,1, 2,3. Note that the flip-flop contents do not change during
the last cycle due to the fact that the input from A are zero bits.
If cycle I = 0 is counted as the first clock cycle, then the final product can be read out
(w-l)

during cycle t = 3. During clock cycle £ = 2, the flip-flops contain dh

<!-As

(2)

soon as the contents of the flip-flops are updated as dhj, the XOR network at the output
end performs summation operation to obtain the product bits following (5.6)
Co

=-

C\

=-

-

c 2 =-

~

C3

=-

c 4 =-~

d{2) + d(2)
"0,0 * u i,o>

d{2) + d{2)
u01 -t{2)

d
u

alx,
{2)

+ d

0,2 '

u

(5.8)

l,2>

d(2)+J(2)
U 0 3 -f U l i 3 ,
a

{2)

{2)

0 ,d4 1+d
"1,4-

52

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN F 2 M USING REDUNDANT REPRESENTATION

The time delay caused by (5.8) is Tx and less than one clock cycle, so the total multiplication delay for the multiplier is four clock cycles. The multiplier requires kn = 10 AND
gates and (2k - l)n = 15 XOR gates, and (k + l)n = 15 bit flip-flops. Note that the
contents of flip-flops during cycle 2 are shown in Table 5.1, the final output product bits
can be given following (5.8),
c0 =

a0b0 + a1bi + a2b3 + a3b2 + a4&i,

ci

=

a06i + axb0 + a 2 6 4 + a 3 6 3 + a 4 6 2 ,

c2 =

a0b2 + a1b1 + a2b0 + a 3 6 4 + a 4 6 3 ,

c3 = a0b3 + ai&2 + a2bi + a3b0 + a 4 6 4 ,
c4 =

5.4.4

a 0 6 4 + ai&3 + a2b2 + a3bi + a 4 6 0 .

Word Level Architecture with MSB First

A most significant bit (MSB) first version of the multiplier architecture is also presented
and shown in Fig. 5.3, where the MSB of each word from the operand A inputs to the
system first. The k words each of w bits from A can be obtained as follows.
A = 0 . . . Oao . . . a n _(fc_i) u ,_i... an_2w_\ an-2w ... an_„,_i an~w ... an-i
v
-•
.—:—•—'
v
"
v
'
AQ

Ak-2

-Afc-i

Note that the shift direction of the circular shift register R is different from that in Fig. 5.1.
The MSB version of the architecture has the same complexities and time delay as the LSB
version (Fig. 5.1).

5.5

Complexity Comparison

5.5.1

Comparison to Other Word Level RB Multipliers

Complexity of the proposed multiplier and similar word level redundant basis multipliers
compared in table 5.2. It can be seen from the table that the proposed design has the

53

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN¥2M USING REDUNDANT REPRESENTATION

At_, =a„_ k ,

<J„,,. „„..a,

Figure 5.3: Proposed high speed word level multiplier (MSB First)
smallest critical path delay and multiplication delay between all similar proposals. As can
be seen from the table, the critical path delay of the proposed architecture is not a function
of the word size or the filed size and is always equal to TA + Tx.
Table 5.2: Complexity comparison for word level redundant basis multipliers
Multiplier

#AND

#XOR

#Reg

Critical Path Delay (Tcp)

Multiplication Delay

PISO [44]

kn

fc(n - 1)

n

TA + riog2 n]Tx

wTcp

Comb [31]

kn

kn

2ra

wTCp

ASH [32]

kn

kn + n

n

TA + l"log2(fc + 1)1 T x
T A +riog 2 (fc + l ) l T x

Proposed RB

kn

(2k - l)n

[k + l)n

TA + TX

wTcp
wTcp+

riog2fc"|Tx

54

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN ¥2M USING REDUNDANT REPRESENTATION

5.5.2

Comparison to Other Word Level NB and RB Multipliers When
There Exists A Type IONB

Type I ONB is probably the most popular and the most efficient class of NB that is used for
realization of NB multipliers in the literature. For this class of fields, the size of the RB is
almost the same as that of the NB as shown in Remark 1 in Section 2. Complexity of the
proposed multiplier and similar proposed architectures for a class of field that there exist a
type I ONB are shown in Table 5.3. The complexities for area and delay are the result of
substitution of n with m + 1, according to remark 1.
Table 5.3: Are-Delay Complexity comparison for different architectures where there exist
a type I ONB
Multiplier

Basis

WLMO [22]

ONBt

IMO[U]

ONBt

# AND
k(2m

- 1)

km

AEDS [36J

ONB I

XEDS [36]

ONB I

TO-SMPOI [37]

ONB I

fc^+m+fc+l

tu-SMPOII [37]

ONB I

km + m + k + 1

PISO [44]

RB

km. + k

Comb [31]

RB

ASH [32]

RB

Proposed

RB

(* + I)TT
fc(m — 1) + m

#Reg

Critical Path Delay ( T c p )

k(2m

- 2)

2m

TA + ( 1 + r i o g 2 " » l ) T x

wTcp

k(2m

- 2)

2m

TA + ( 1 + r i o g 2 m l ) T x

wTcp

(fe+ l ) ( § m - 2) + 1

2m

TA + ( 1 +

wTcp

(fe + l ) ( m - l )

2m

TA + (1 + p o g 2

#XOR

3k^-+k

n°g,2™-~\)Tx
m])Tx

+ m - l

3m

2TA+(3+

km + k -f- m — 1

3m

2 T A + ( 3 + n ° g 2 ( f c - 1)1 )TX

r i o g 2 ( f c - 1)1 )TX

km

m + 1

TA+

km + k

km + k

2m + 2

TA + riog2(fe + i ) i r x

km + k

km + k + m + 1

m+ 1

TA + r i o g 2 ( f e + l ) l T x

km + k

(2fe - l ) ( m + I)

(fc + l ) ( m + 1)

riog2(m + l ) l T x

TA+TX

Multiplication Delay

wTcp

»wTv
cp

VJTCP

u>Tcp
v;Tcp
wTop + riog 2 fc"|T x

For the purpose of illustration we have tabulated the area-delay complexity for the proposed architectures with the previously proposed multipliers in Table 5.4. The field size is
chosen as m = 268 where there exists a type I ONB. Number of parallel modules is selected
to be k = 8,16,32 which represent practical-size multipliers for VLSI implementations.
The following assumptions are made in Table 5.4: The VLSI area of an XOR gate and
a flip-flop are assumed to be twice and three times of the area of an AND gate respectively
1

. It is also assumed that the area of an AND gate is 1. The row Area Cost in Table 5.4

represents the sum of the number of AND gates, twice the number of XOR gates and three
J

In a typical CMOS VLSI realization, an AND gate can be implemented with 6 transistors, while an XOR

gate and one flip-flop can be implemented with 12 and 16 transistors, respectively [40].

55

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN ¥2M USING REDUNDANT REPRESENTATION

Table 5.4: Complexity comparison of word level type I optimal NB or RB multipliers in
F2268 for different values of w, k.
Multiplier

Basis

w, k

#AND

#XOR

(NA)

#Register

Area Cost
(NA

(JVfl)

+ 2NX

+

Delay Cost
3NR)

(TA

= 1,TX

=2)

Area X Delay

WLMO [22]

ONBI

4280

4272

536

14432

34TA + 340TX

= 714

IMO[ll]

ONBI

2144

4272

536

12296

34TA + 340TX

= 714

8779344

AEDS [36]

ONBI

1206

3601

536

10016

34TA + 3 4 0 T X = 714

7151424

XEDS [36]

ONBI

2404

2403

536

8818

34TA + 4 0 T X = 714

6296052

tu-SMPOI [37]

ONBI

1349

3491

804

10743

68T.4 + 2 0 4 T X = 476

5113668

tu-SMPOII [371

ONBI

2421

2419

804

9671

6 8 7 A + 204TX =

476

4603396

34,8

10304448

PISO [44]

RB

2152

2144

269

7247

34TA + 306TX

= 646

4681562

Comb [31]

RB

2152

2152

538

8070

34TA + 1 3 6 T X = 306

2469420

ASH [32]

RB

2152

2421

269

7801

3 4 T A + 1 3 6 T X = 306

2387106

Proposed

RB

2152

4035

2421

17485

3 4 T A + 3 7 T X = 108

1888380

WLMO [22]

ONBI

8560

8544

536

27256

17TA

+ 170TX

= 357

9730392

IMO[U]

ONBI

4288

8544

536

22984

17TA

+ 170TX

= 357

8205288

AEDS [36]

ONBI

2278

6801

536

17488

17T A + 1 7 0 T X = 357

6243216

XEDS [361

ONBI

4540

4539

536

15226

17T A + 170T X = 357

5435682

tu-SMPOI [37]

ONBI

2429

6715

804

18271

3 4 T A + 119T X = 272

4969712

tu-SMPOII [37]

ONBI

17, 16

J

4573

4571

804

16127

3 4 T A + 119T X = 272

4386544

PISO [44]

RB

4304

4288

269

13687

17TA

4420901

Comb [31]

RB

4304

4304

538

14526

17T,4 + 8 5 T X = 187

2716362

ASH [32]

RB

4304

4573

269

14257

17TA

2666059

+ 153T X = 323
+ 8 5 T X = 187

RB

4304

8339

4573

34701

17T A + 2 1 T X = 59

2047359

WLMO [22]

ONBI

17120

17088

536

52904

9 T A + 9 0 T X = 189

9998856

IMO[ll]

ONBI

8576

17088

536

44360

9 T A + 9 0 T X = 189

8384040

AEDS [36]

ONBI

4422

13201

536

32432

9TA

+ 9 0 T X = 189

6129648

XEDS [36|

ONBI

8812

8811

536

28042

9TA

+ 9 0 T X = 189

5299938

tu-SMPOI [37]

ONBI

4589

13163

804

33327

18T A + 7 2 T X = 162

5398974

tu-SMPOII [37]

ONBI

8877

8875

804

29039

18TA

+ 71TX

= 162

4704318

Proposed

9,32

PISO [44]

RB

8608

8576

269

26567

9TA

+ 8 1 T X = 171

4542957

Comb [32]

RB

8608

8608

538

27169

9TA + 54TX = 1 1 7

3210246

ASH [32]

RB

8608

8877

269

27169

9TA + 54TX = 1 1 7

3178773

Proposed

RB

8608

16947

8877

69133

9TA

+ 1 4 T X = 37

2557921

times the number of registers. The delay for an XOR gate is assumed to be twice of that
for an AND gate [5]. If the delay of an AND is assumed to be 1, the row Delay Cost in
Table 5.4 represents the multiplication delay in terms of times of the delay of an AND gate.
Compared to the other architectures the proposed architecture has much smaller delay
cost at expense of modestly higher area cost. Note that both area and delay are the objectives to minimize for a word level multiplier, and decreasing one is usually at the expense of
increasing the other. In Table 5.4 we also used area-delay product to show the balance between the changes of area cost and delay cost, which is given at the last column in the table.

56

5. A HIGH SPEED WORD LEVEL MULTIPLIER IN F 2 M USING REDUNDANT REPRESENTATION

Table 5.5: Normalized Complexity Comparison of Different Word Size Multipliers in F2268
= 34, k = 8

TO

w = 9, k = 32

w = 17, k = 16

Relative

Relative

Relative

Relative

Relative

Relative

Relative

Relative

Relative

Area

Delay

Area x Delay

Area

Delay

Area X Delay

Area

Delay

Area x Delay

WLMO [22]

82%

661%

545%

78%

605%

475%

76%

510%

390%

IMO[ll]

70%

661%

464%

66%

605%

400%

64%

510%

327%

AEDS [361

57%

661%

378%

50%

605%

304%

46%

510%

239%

XEDS [36]

50%

661%

333%

43%

605%

265%

50%

510%

207%

61%

440%

270%

52%

461%

242%

48%

437%

211%

55%

440%

243%

46%

461%

214%

42%

437%

189%

PISO [44]

41%

598%

247%

39%

547%

215%

38%

462%

177%

Comb[31|

46%

283%

130%

41%

316%

132%

39%

316%

125%

ASH [32]

44%

283%

126%

41%

316%

130%

39%

316%

124%

Proposed

100%

100%

100%

100%

100%

100%

100%

100%

100%

Multiplier

TO-SMPOI
TO-SMPOII

[37]
[37]

It can be seen that for all three word sizes, the area-delay-product for the new multiplier is
much smaller than all the previously proposed architectures listed in the table.
If the area-delay-product for the proposed multiplier is one, then the relative values of
the area-delay-product for some previously proposed architectures are listed in Table 5.5.
IT can be seen from the table that the area-delay-product for the new multiplier is at most
80% of that of the any other ones for given values of w and k in the table.

5.6

Conclusions

A high speed word level finite field multiplier using RB has been proposed. The proposed
architecture is significantly faster compared to previously proposed architectures at the
expense of moderately higher area complexity. For the class of fields that there exists
a type I ONB, the proposed multiplier performs much faster than other word level NB
multipliers available in open literature. It was shown that the new multiplier excels all
the other multipliers in comparison when considering the product of area and delay as a
measure of performance.

57

Chapter 6
High Speed Word Level Multipliers in
GF{2m) Using Reordered Normal Basis

6.1

Introduction

An important factor that has great effect on finite field arithmetic efficiency is the basis
used to represent the field elements. Common bases used in practice are polynomial basis
(PB) and normal basis (NB) [25],[33]. Polynomial basis is probably the most popular basis which has been widely used for hardware and software implementations [18]. Normal
basis on the other hand is advantageous for hardware implementation since squaring operation can be implemented at no cost. Free squaring operation can be used to speed up the
exponentiation operation by repeated squaring and multiplication [14],[2].
Since addition operation can be implemented simply by exclusive or and inversion operation with repetitive squaring and multiplication, multiplication operation is considered
to be the main operation for systems using normal basis. In normal basis, complexity

58

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

of multiplication is measured with the multiplication matrix [30]. For a binary extension
field the multiplication matrix entries are either zero or one and the number of ones inside
the multiplication matrix is referred to as normal basis complexity. The normal basis in
GF(2 m ) for which the complexity achieves its minimum , 2ra — 1, is referred to as the
optimal normal basis (ONB). Two types of optimal normal bases have been found which
are referred to as type I and type II optimal normal basis [30]. Reordered normal basis is
refereed to as a certain permutation of a type II optimal normal basis [12], [44].
In this work two new word-level finite field multipliers using a reordered normal basis
are presented. It is shown that the proposed architectures are faster than all the previously
presented architectures in the open literature using either a type II optimal normal basis
or a reordered normal basis at the expense of moderately higher complexity. One unique
feature of the new word-level architectures is that the critical path delay is independent of
the word size or the field size. This enables the proposed multipliers to operate at very high
clock rate regardless of the word size or the field size.
The organization of this chapter is as follows. Reordered normal basis and multiplication using this basis are briefly reviewed in Section 2. In Sections 3 and 4, two new
word-level multipliers using reordered normal basis are respectively proposed. Architectural complexity comparison of proposed architectures with similar proposals are presented
in Section 5. Finally some concluding remarks are given in Section 6.

6.2 A Brief Review of Reordered Normal Basis and Its
Arithmetic in GF(2m)
6.2.1 Reordered Normal Basis
The idea of reordered normal basis was first proposed by Gao and Vanstone in [12] and
was later used to design several multiplier architectures in [44].

59

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

Theorem 6.2.1. [12] Let /? be a primitive (2m + l) s t root of unity in GF(2m).

Then 7 =

P + j3~x generates a type II optimal normal basis. The ordered set {71, i = 1,2,..., m}
with 7^ = /? + /?-', also forms a basis in GF(2 m ).
It has been shown that the basis [71,72, • • •, 7m]1 is a permutation of the normal basis
[l2 > 7 2 ) • • •, 72"1 ] [12], and it is referred to as the reordered normal basis following [44].
Define function s(i) which maps the set of integers to the set { 0 , 1 , . . . , m} as follows [12, 44]:
(
s(i) =

i mod 2m + 1,

if 0 ^ i mod 2m + 1 ^ m,
(6.1)

2m + 1 — 1 mod 2m + 1, if m < i mod 2m + 1 ^ 2m.
The following lemmas will be useful either to facilitate reordered normal basis arithmetic
or to derive a reordered normal basis from a given type II ONB. Note that all the results in
the lemmas have been already stated in [12].
Lemma 6.2.2. Let (3 and 7 be defined as in Theorem 6.2.1. Then 7* = 7^) for any integer
i.
Proof: Consider the following two cases and note that (3 is a primitive (2m + l) s t root
ofunityinGF(2 m ).
• Case 1: If 0 ^ i mod 2m + 1 ^ m, then
7.

= ft + 0-* = ft

mod 2m

+1 + p-V

mod 2

™+!) = /?*« + /3"«(0 =

7s(i).

• Case 2: If m < i mod 2m + 1 ^ 2m, then

7* = Pl + P~'
ni mod 2 m + l

i /o—i mod 2 m + l

a—(2m+l—i mod 2m+l)

=
1

, p2m+l—i mod 2 m + l

/ r s W + /? s(j) = 7 s W .

We use [ • • • ] to denote an ordered set.

60

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

a
Lemma 6.2.3. Given a type II optimal normal basis I' = [ 7 2 0 , 7 2 1 , . . . , 7 2 " 1-1 ]. Then the
reordered normal basis I = [71,72, • • •, 7m] is a permutation of the basis elements of /',
and the permutation function is decided by
I2' =7s(2»), i = 0, l , . . . , m - 1.
Proof: It follows Lemma 6.2.2 that
7 2 ' = (32% + p'2' = 72« = 78(2«) for i = 0 , 1 , . . . , m - 1.

•
Lemma 6.2.4. For all i,j = l,2,...,m,
7i7j

we have
7s(i+j) + 7s(i-j)-

Proof:
7i7i

= (/r + z r W + zr'')
=

7i+j + 7»-j

Note that the last step comes from Lemma 6.2.2.

6.2.2

•

Reordered Normal Basis Multiplication

Assume that A and B are two arbitrary elements in GF(2m) represented with respect to
(w.r.t) reordered normal basis / = [71,72, • • •, 7m],

A=

m

Y1

aai and B =

i=l

m

^2biii-

i=l

61

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

Their product C is represented with respect to the same basis C = Y1T=\ c*7»- Assuming
7o = 0 and b0 = 0, then following Lemma 6.2.4 jjB, j = l,2,...,m,

can be given by [12]

m

i=l
m

=

^2bihs(i+j)

+ is{i-j)]

i=l
m

= X^(*+i) + bs(i-j)H-

(6-2)

i=i

The last step in (6.2) comes from proper substitution of the subscripts. The product C can
be obtained as follows.
C

=

A-B
771

= J2a^B
=

m

m

3=1
m

i=l
m

5Z a3 ^2^+1) + bs(i-j)hi

= Yl ( Yl ai[&s(*+i) +

6

«(*-J')])T<'

Then we have [44]
m
c

i = Y2, ai t^+J) + ^(i-*)]'

i = 1, 2 , . . . , m.

(6.3)

6.3 Proposed High Speed Word Level Multiplier type One
Using Reordered Normal Basis
6.3.1 Word-level multiplication algorithm using reordered normal basis
Let w denote the word size and k = \m/w] be the number of words required for representing a field element in GF(2m). Write the subscript j of a3- in (6.3) as j = gw + £ for

62

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

g = 0 , 1 , . . . , k — 1 and £ = 1,2,... ,w, and replace j in (6.3) with gw +1:
fc—1

c* =

w

/ _, / J agw+i[bs(i+gw+e) + ^(i-gw-^)]e=i

(6.4)

g=o

Note that the coordinates for the operand A and B will be zero if their subscript exceeds
m. Define signal d\j, i = 1, 2 , . . . , m and g = 0 , 1 , . . . , k — 1, as

dfl9 = 0 and
'

4fi

_

=

d

tg

l)

(6.5)

+ <V>+#s(i+s™+<0 + bs{i_gW_i)\

for ^ = 1, 2 , . . . , w.

Then it follows from (6.5)
w

(6.6)
Compare (6.4) with (6.6) it follows

^ = E<e
5=0

An algorithm for word-level multiplication using reordered normal basis can be given
as follows.
Algorithm 6. Word-level reordered normal basis (RNB) multiplication algorithm I
Input:

A = ( a i , . . . , a m ), B = (pi,...,

bm), both w.r.t. RNB,

and the word size w, 1 < w < m
Output:

C = Ax B = (ci,...,cm)

also w.r.t. RNB

63

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

1.

Initialization: k = \m/w], and d\g = 0 for i = 1,2,..., m and g = 0 , 1 , . . . , k — 1

2.

For all values of i = 1,2,... ,m, compute

3.

For all values of g = 0 , 1 , . . . , k — 1, compute

4.

For I = 1 To w

5.

<iig = dig

6.

+ aSu,+46s(;+g™+^) + &s(i-s™-*)]

End For

3=0

8.
9.

6.3.2

End For
End For

Multiplier Architecture

A high speed word-level reordered normal basis multiplier can be built based on Algorithm 6, which is referred to as word-level reordered normal basis type I (WL-RNB I) and
is shown in Fig. 6.1.
From the top to the bottom, the architecture contains a (2m + 1)-bit circular shift register, the Expansion/Permutation module, one layer of XOR gates, one layer of AND gates,
one layer of accumulation units, and one layer of XOR gate networks.
Step 5 of Algorithm 6 can be implemented using one XOR gate, one AND gate and
one accumulation unit as shown in the block of dashed lines in Fig. 6.1. The accumulation
unit requires one XOR gate and one flip-flop. Steps 2 and 3 of Algorithm 6 require totally
m x k such accumulation units in m groups with each group containing k units. Step 7 of
Algorithm 6 shows that each c*, i = 1, 2 , . . . , m, is a sum of k terms which are the outputs
of the k accumulation units after w clock cycles. A fc-input XOR gate or a binary tree of
k — 1 two-input XOR gates is used to produce the final output d as shown at the bottom in
Fig. 6.1.
The input operand A is required to be fed into the multiplier in a comb style. Let A be
divided into k words, with each word of w bits. Then, in the first clock cycle the inputs

64

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

W • - • a(k-l)w+2'

"(k-l)wH

Figure 6.1: Proposed word-level high speed multiplier using reordered normal basis
are the first bit of every word, ai, aw+\, a2w+i,...,

d(k-i)w+i- For the second clock cycle

the inputs are a2, aw+2, ci2w+2, • • • > U(k-i)w+2- Finally in the wth clock cycle the inputs are
aw, aiw, a3w,...,

akw. Note that if the subscript of an input bit exceeds m then it is replaced

by a zero bit.
The input operand B is stored in a (2m + l)-bit circular shift register R, and from there
the input bits are fed into Permutation/Extension module. Suppose that the (2m+l)-bit circular shift register R is initially loaded as, from the top to bottom, bQ, bi, b2,...,

bm, bm, 6 m _i,

65

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

s(2m + 1 — i) for i =

6 m _ 2 ,... ,h as shown in Fig. 6.1. Note from (6.1) that s(i)
0 , 1 , 2 , . . . ,2m, or
fy)

=

b\

= b'»(i)
s(1

=

bm-i

frl

=

bs(0)

— bs(2m+l)
=

b.s(2m)

bs(m)

^(m+l)

^s(m+l)

=

bs(m)

•?s(m+2)

=

b's{m— 1)

frs(2m)

= bs(\)

Then register i? can be viewed as two virtual (2m + 1)-bit registers Ri and R2 as shown in
Fig. 6.2(b): From the top to the bottom, one (R{) contains 6s(o), 6s(i), bs(2), • • •,fcs(2m),and
the other (i?2) contains bs{2m+1), fcs(2m), bs{2m-i),-..,

6/
;
bm
bm
bm-i

^

1
bs(0)

~n&0

k

6s(i) •
bs(2m+l)
b$(2m)

bS(i)

'.

•

i

bs(m)

i k

bs(m+I)

bs(m+i)

bs(m)

bs(m+2)

bs(m-l)

;

'.

•

b2

bs(2m-l)

bs(2)

bi

bs(2m)

bs(i)

1

R

(a)

I

_l

/?,

R(b)

Figure 6.2: Viewing the (2m + l)-bit circular shift register R as two virtual (2m + l)-bit
circular shift registers R\ and R2
The following two lemmas are useful for description of the Permutation/Expansion

66

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

module.
Lemma 6.3.1. If the content of the ith bit (the bit at top as bit zero) of Ri is given as
r[ I = bs(h) at the clock cycle £ for some integer h, then its contents will be r[ ^

=

fr^-i)

at the next clock cycle.
Proof: The contents of Rj starting from the top bit at clock cycle £ can be any (2m + 1)
consecutive bits in the sequence
bs(0),

&s(l) > • • • > bs(2m),

So it is obvious r[ ^

K(0):

&s(l) > • • • > bs(2m)

> &s(0) i ^ s ( l ) , • • • j & s ( 2 m ) , • • • •

= bs^-i) if h > 0. If h = 0 then rj t-

= 6s(2m). The lemma follows

by noting that bs{2m) = K(-i) from (6.1).

•

Lemma 6.3.2. If the content of the ith bit of R2 is given as r2 \ = bs(h)
+

for some integer h, then its contents will be r21

at

the clock cycle £

a

= &s(/i+i) t the next clock cycle.

Proof: The contents of R2 starting from the top bit at clock cycle £ can be any (2m + 1)
consecutive bits in the sequence
bs(2m+l),

bs(2m)

j • • • > &s(l) > O s ( 2 m + l ) > ^ s ( 2 m ) j • • • j ^ s ( l ) j ^ s ( 2 m + l ) > ^ s ( 2 m ) j • • • j & s ( l ) , • • • •

So it is obvious r^/"

= &s(M-i) if h ^ 2m + 1. If /i = 2m + 1 then r^-

= &s(i). The

lemma follows by noting that bs^ — bs(2m+2) from (6.1).

D

For given i and g in dig as in (6.5), every clock cycle the variable in the function s(-) increases by one in bs(i+gw+e), and decreases by one in bs^gw^^.

So following Lemmas 6.3.1

and 6.3.2 the input bs^i+gw+e) is connected to bit (i+gw+£) of R2 while the input

bs^gw-e)

is connected to bit (i - gw — £) of R^. Then it is easy to see that Expansion/Permutation
is just a reordering and copying module which does not contain any gates. This module
accepts 2m + 1 inputs from the circular shift register and provides 2A:m outputs for the
layer of XOR gates.
The critical path for the proposed multiplier contains one AND gate and two XOR gates
as shown in the block formed by dashed lines in Fig. 6.1. Let TA and Tx respectively denote

67

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

the delay of a two-input AND gate and a two-input XOR gate, Then critical path delay is
Tcp = TA + 2TX. Note that the critical path delay depends on neither the field size m nor
the word size w. It is worth to point out that the binary tree of k — 1 XOR gates is not part of
the critical path since the k summation for the Cj outputs has to be calculated only once at
the end of the multiplication. Consequently, the product bits are not immediately available
following w clock cycles. Instead a time delay of amount about equal to Tex = |~log2 k]Tx
has to be spent before the product bits are generated at the output ends.
Two options are available for the multipliers to act during the time spent for the final
addition of the partial products. In the first option, the multiplier clock should be stopped
after w clock cycles and the outputs can be read from the output ports after the addition delay which can be measured approximately beforehand. This style requires an extra counter
but can save dynamic power consumption during the final addition.
The second option is to enter zero input bits for input A once the w clock cycles are
over. This will eliminate the need for stopping the clock signal of the multiplier and does
not require extra circuitry. The output product bits can be read from the output ports after certain number of clock cycles immediately following w cycles. The exact number
clock cycles required for a multiplication operation can be computed beforehand, which is
(assuming that the clock period is chosen to be the critical path delay T.)
ip.

w + T.

-\log2 k]TxTA + 2TX

For example, when m = 233 and w = 32, k — \m/w\
are available after a time delay of Tex = |~log2 k]Tx

= 8. Then the product bits

= STX following w = 32 clock

cycles. Since the extra time delay Tex = 4TX is less than twice of the critical time delay
Tcp = TA + 2TX (assuming that TA ~ Tx/2 and the system clock period is chosen to be
the critical time delay)2 , in practical implementation the product bits can be read out after
w + 2 = 34 clock cycles while two zero bits have to be appended to every input word for
2

In a typical CMOS VLSI realization, the delay of an AND gate is about half of that of an XOR gate [5].

68

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

the operand A. In the rest of the chapter, we assume that the second option is adopted for
the multiplier.

6.3.3 Architecture complexities
The area and delay complexity of the proposed design can be determined from Fig. 6.1.
From the top to the bottom, the complexity of each part of multiplier can be obtained as
follows. The circular shift register R contains 2m + 1 flip-flops. Expansion/Permutation
module does not contain any gates or flip-flops. There are respectively mk gates in the
layers of XOR gates and AND gates. The number of accumulation units is also mk, which
each contains one flip-flops and one XOR gate. The m binary trees of XOR gates at the
bottom consists of in total m(k — 1) XOR gates. The complexities can be summarized as
follows:
#AND
km

6.3.4

#XOR

# Registers

(3k — \)m (k + 2)m + 1

An example

A proposed word-level multiplier in GF(23) using reordered normal basis is shown in
Fig. 6.3. The word size is chosen to be w = 2 and then k = \m/w\

= 2. The critical path

delay is Tcp = TA + 2Tx • It takes w + 1 = 3 clock cycles to complete one multiplication,
since one extra cycle is needed due to the time delay caused by the XOR gate at the output
ends. Correspondingly, a zero bit has to be appended to the end of each of the two input
words for the operand A so that the contents of the flip-flops in the accumulation units
remain unchanged during the last clock cycle.

69

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

Figure 6.3: Architecture of WL-RNB I in GF(2 3 ) with w = k = 2

6.4 Proposed High Speed Word Level Multiplier Type Two
Using Reordered Normal Basis
6.4.1 Word-level multiplication algorithm using reordered normal basis
Let w, k, g, and £ be defined as in Subsection III.A. It follows from (6.4)
fc—1 w
g=0
fe —1

(=1
w

w

— 2_^ 2_^ agw+ebs(i+gW+e) + 2_^ agw+^s(i-gw-e) •
g=0

1=1

(6.7)

1=1

70

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

Define signals d\c^ and e-j, i = 1, 2 , . . . , m, g = 0 , 1 , . . . , k — 1, for I = 1, 2 , . . . , w, as
follows

«S - =»
"i,3

—

a

i,g

+

«gu;+£O s (j + g u , + ^)

(6.8)

«S = o
J!)

-

K

—

i,g

PV-V

gw+t"s(i—gw~£

K

i,g

Then it follows from (6.8)
dM9

_

/ _, agw+ebs(i+gw+£),
e=i

(6.9)

w

»

-"1,9

/

a
J

gw+(Ps(i-gw-C)

•

e=i

Compare (6.7) with (6.9) it follows
fe-i

fe-i

fc-i

3=0

3=0

»
••,g

3=0

'

An algorithm form for a high speed word-level multiplication is shown as follows.
Algorithm 7. Word-level reordered normal basis (RNB) multiplication algorithm II
Input:

A = (a,i,..., am), B = (bj,...,

bm), both w.r.t. RNB,

and the word size w, 1 < w < m
Output:

C = A x B = ( c ! , . . . , c m ) also w.r.t. RNB

71

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

Initialization: k = \m/w], and df^g = 0, ef^ = 0
for i = 1,2,... ,m andg = 0 , 1 , . . . , k — 1
2.

For all values of i = 1, 2 , . . . , m, compute
For all values of g = 0 , 1 , . . . , k — 1, compute

3.
4.

For £ = 1 To w By Step One

5.

• M

"i, ff — ai,g

J(*-1)

+ <V«+^ 0 s(i+s™+^

6.

e

+

i,g — ei,g

7.

,

.

agw+£t>s(i-gw-l)

End For
fe-i fc-i

^ = E«+E e S }

8.

9=0

End For

9.
10.

6.4.2

3=0

End For

Multiplier architecture

A word-level multiplier following Algorithm 7 using RNB, which is referred to as WLRNB II, is shown in Fig. 6.4. It can be seen that the architecture is similar to that of
WL-RNB I. The (2m + l)-bit circular shift register and the Expansion/Permutation module
are the same for the two architectures.
The main difference between the two multiplier architectures lies within the blocks of
dashed lines respectively shown in Fig. 6.1 and Fig. 6.4. In the dashed block of WLM-RNB
I shown in Fig. 6.1, the two bits of operand B are first added together and then the sum
multiplies one bit of A to produce a partial product bit. The partial product bit is then fed
into the accumulation unit. After w clock cycles the content of the accumulation unit along
with those of the other k — 1 accumulation units is fed into the XOR network to generate
one product bit c*.
The dashes block of WL-RNB II has the same inputs of two bits of B as that of WLRNB I. Each input bit is first multiplies one bit of A to produce a partial product bit. The
partial product bit is then fed into the accumulation unit. After w clock cycles the content

72

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

R

a

a

i^„
'kw

a-,, a

ua / z - - / l w + ? ' a uj

(k-l)w+2 • (k-l)w+l

Figure 6.4: Proposed word-level high speed multiplier using reordered normal basis
of the accumulation unit along with those of the other 2k — 1 accumulation units is fed into
the XOR network to generate one product bit a. Note that the critical path for WL-RNB II
contains only one AND and one XOR gate, which is shorter than that of WL-RNB I by one
XOR gate. The final XOR gate network for generating ct has 2k inputs, twice of the XOR
network in WL-RNB I. So the final product bits are available after an extra time delay of
about \log2(2k)]Tx

immediately follow w clock cycles.

6.4.3 Architecture complexity
The area and delay complexity of the proposed design can be determined from Fig. 6.4.
Registers are presented in two parts of the design, first part is the circular shift register,

73

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

R, which contains 2m + 1 flip-flops. The second part is the one bit flip-flops that hold
the partial sum of the output product bits, which their number is equal to 2km since each
output bit uses 2k flip-flops. The total number of two-input AND gates is equal to 2km
since each output product bit uses 2k two-input AND gates. XOR gates exist in two parts
of the architecture. The first part is XOR gates that follow the AND gates. For each output
product the number of these two-input XOR gates is equal to 2k which is the same as
the number of AND gates. For rn output products, totally 2km of these two-input XOR
gates exist. The second part is the 2/c-input XOR gates which exist for each of the output
product bits. The equivalent number of two-input XOR gates to implement these 2fc-input
XOR gates is equal to 2{k — \)m. Consequently The equivalent number of two-input XOR
gates in the architecture is equal to (4fc — l)ra. So the complexities can be summarized as
follows:

#AND
2km

#XOR

# Registers

\km — m 2km + 2m + 1

The critical path delay is Tcp = TA + TX. Similar to WL-RNB I, the product bits are
not immediately available following w clock cycles. Instead a time delay of amount about
equal to Tex = [log2 2k]Tx has to be spent before the product bits are generated at the
output ends. If the clock period is chosen to be equal to the critical path delay, the total
number of clock cycles for completing a multiplication is
w+

, r rio g2 2fcir x1
-T^TT'

Take again the example of m = 233 and w = 32, k = \m/w~\ = 8. Then the product
bits are available after a time delay of |"log2 2k]Tx = 4TX following w = 32 clock cycles.
Since the extra time delay Tex = 4TX is about three times of the critical time delay Tcp =
Tx + TA, in practical implementation the product bits can be read out after w + 3 = 35
clock cycles while three zero bits have to be appended to the end of each input word from
A to keep the contents of the flip-flops of the accumulation units unchanged during the last
three clock cycles.

74

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

6.4.4 An example

0,0, a,, a
0,0,0,a

Figure 6.5: Architecture of WLM-RNB-II in GF(23) with w = k = 2
A diagram of WL-RNB-II in GF(23) is shown in Fig. 6.5. The word size is chosen to
be w = 2 and then k = \m/w\

= 2. The critical path delay is Tcp = TA + TX. Since a

binary tree of 4 XOR gates is used at the output ends to generate the final product bits Ci
which has a time delay of 2TX, two extra clock cycles following w clock cycles are needed
before the product bits are available at the output ends. Correspondingly, two zero bits have
to be appended to each of the two input words for the operand A so that the contents of the
flip-flops in the accumulation units remain unchanged during the last two clock cycles.

6.5

Comparisons

Complexity comparison between the proposed multipliers and some other multipliers in the
literature is made and shown in Table 6.1. Since a reordered normal basis is a permutation
of a type II ONB, it should be interesting to have complexity comparison of the proposed

75

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

reordered normal basis multipliers to some popular NB multipliers for the class of fields
that there exists a type II ONB. These architectures are shown in the top six rows in the
table. The table also includes two previously proposed word-level reordered normal basis
multipliers as shown in rows seven and eight.
The first row of Table 6.1 represents the word-level Massey-Omura (WLMO) multiplier which uses k identical bit-level Massey-Omura multipliers [22], while the second row
shows the improved Massey-Omura multiplier (IMO) [11]. The AND-efficient digit-serial
(AEDS) and XOR-efficient digit-serial (XEDS) multipliers proposed in [36] are shown at
the third and fourth rows of the table. Fifth and sixth rows of the table represent respectively the word-level sequential multipliers with parallel output type I and type II, which
were recently reported in [37].
The seventh row of the table shows the Hybrid PISO architecture proposed in [44]
with k levels of pipelining and the eighth row gives the Comb Style architecture recently
proposed in [31]. The last two rows of the table present the proposed architectures WLRNB I and WL-RNB II.
As can be seen the critical path delays (Tcp) of the proposed architectures are independent of the field size m or the word size w, and much smaller than any other previously
proposed word-level architectures shown in the table. Note that the gain in the clock speed
is at the expense of using significantly more flip-flops for both WL-RNB I and WL-RNB
II and more VLSI gates for WL-RNB II.
The last column (Multiplication Delay) of Table 6.1 represents the time it takes to
complete one multiplication operation for a multiplier. Note that proposed multipliers
need some extra time (flog2 k]Tx or |~log2 2k]Tx) besides w clock cycles. If the system
clock period is chosen as the critical path delay, to complete one multiplication operation
WL-RNB I would take w + [ r ^ g * ] clock cycles while WL-RNB II would require
w+ r r ' r j + ? J x ] clock cycles.
For the purpose of illustration we have tabulated the area-delay complexity for the pro-

76

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

Table 6.1: Complexities comparison
Multiplier

Basis

#AND

WLMO [22]

ONBII

k(2m

- 1)

IMO[ll]

ONBII

km

k(m - \k + J)

#Reg

Critical Path Delay ( T c p )

- 2)

2m

TA+(l+riog2ml)Tx

uiTcp

fc(2m - 2)

2m

T A + ( 1 + [log 2 m ] ) T x

wTcp

- k - 2)

2m

TA+(l+riog2m1)Tx

wTcp

- |fc _ 2 )

2m

TA + ( 1 + riog2ml)Tx

uiTcp

3m

2 T A + (3 + r i o g 2 ( f e - l ) l ) T x

wTcp

3m

2 T A + ( 3 + r i ° g 2 ( f e - 1)1 ) T X

#XOR
k(2m

AEDS [36]

ONBII

XEDS [36]

ONBII

tu-SMPOI [37]

ONBII

fe(lf J + l ) + m

k(2m

UJ-SMPOII [37]

ONBII

km + m

k(m+[!f

PISO [44]

RNB

km

k(2m

Comb Style [31]

RNB

km.

2km

3m + 1

WL-RNB I

RNB

km

2km

(k + 2 ) m + I

WL-RNB II

RNB

1km

(4k - l ) m

2(k + l ) m + 1

k(2m

- k)

k(3m
k(2m

- 1)
j)
- 1)

2m + 1

T

A

+(1+

TA + (1+

riog2ml)Tx
riog2(fe+l)l)Tx

TA + 2 T X
TA+TX

Multiplication Delay

wTcp
wTcp
wTap
™Tcp + r i o E2 *1 Tx
wTcp + r i o g 2 2 f c ! T x

posed architectures with the previously proposed multipliers in Table 6.2. The field size is
chosen as m = 233 because the field GF(2 233 ) is both one of the few recommendations by
National Institute of Standards and Technology (NIST) [34] and a field where there exists
a type II ONB or a reordered normal basis. Word sizes of 16,32 and 64 bits are adopted
which represent some typical bus width for a general processor or an embedded computer
system.
The following assumptions are made in Table 6.2: The VLSI area of an XOR gate and a
flip-flop are assumed to be twice and three times of the area of an AND gate respectively3.
It is also assumed that the area of an AND gate is 1. The row Area Cost in Table 6.2
represents the sum of the number of AND gates, twice the number of XOR gates and three
times the number of registers. The delay for an XOR gate is assumed to be twice of that
for an AND gate [5]. If the delay of an AND is assumed to be 1, the row Delay Cost in
Table 6.2 represents the multiplication delay in terms of times of the delay of an AND gate.
Compared to the other architectures the proposed architectures have much smaller delay cost at expense of modestly higher area cost. Note that both area and delay are the
objectives to minimize for a word-level multiplier, and decreasing one is usually at the expense of increasing the other. In Table 6.2 we also used area-delay product to show the
balance between the changes of area cost and delay cost, which is given at the last column
3

In a typical CMOS VLSI realization, an AND gate can be implemented with 6 transistors, while an XOR

gate and oneflip-flopwith set/reset inputs can be implemented with 12 and 16 transistors, respectively [40].

77

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

Table 6.2: Complexity comparison of word-level type II optimal normal basis or reordered

)imal basis multi )liers in u^C 2*°°) 1or dine rent values or u,,fc.
Multiplier

Basis

w, k

#AND

#XOR

#Register

Area Cost
(NA

+ 2NX

+ 3JV H )

Delay Cost
(TA

= 1, Tx

= 2)

Area X Delay

WLMO [22]e

ONBII

6975

6960

466

22293

16T A + 144T X = 304

6777072

IMO[ll]

ONBI1

3495

6960

466

18813

1 6 T A + 1 4 4 T X = 304

5719152

AEDS [36J

ONBII

3390

10230

466

25248

1 6 T A + 1 4 4 T X = 304

7675392

XEDS [36]

ONBII

6765

6855

466

21873

1 6 T A + 1 4 4 T X = 304

6649392

tu-SMPOI [37]

ONBII

1988

6975

699

18035

3 2 T A + 1 1 2 T X = 256

4616960

tu-SMPOII [37]

ONBII

3728

5235

699

16295

3 2 T A + 1 1 2 T X = 256

4171520

16, 15

PISO [44]

RNB

3495

6975

467

18846

16TA

+ 1 4 4 T X = 304

5729184

Comb Style [31]

RNB

3495

6990

700

19575

16T A + 8 0 T X = 176

3445200

WL-RNBI

RNB

3495

6990

3962

29361

I 6 T 4 + 3 6 T X = 88

2583768

WL-RNB II

RNB

3990

13747

7457

53855

I 6 T 4 + 2 1 T X = 58

3123590

466

12542

3 2 T A + 2 8 8 T X = 608

7625536
6497088

WLMO [22]

ONBII

3720

3712

IMO[ll]

ONBII

1864

3712

466

10686

3 2 T A + 2 8 8 T X = 608

AEDS [36]

ONBII

1836

5512

466

14258

3 2 T A + 2 8 8 T X = 608

8668864

XEDS |36]

ONBII

3664

3684

466

12430

3 2 T A + 2 8 8 T X = 608

7557440

m-SMPOI [37]

ONBII

lu-SMPOIl [37]

ONBII

32,8

1169

3720

699

10706

6 4 T A + 1 9 2 T X = 448

4796288

2097

2792

699

9778

6 4 T A + 1 9 2 T X = 448

4380544

PISO [441

RNB

1864

3720

467

10705

3 2 T A + 2 8 8 T X = 608

6508640

Comb Style [31]

RNB

1864

3728

700

11420

3 2 T A + 1 6 0 T X = 352

4019840

WL-RNB I

RNB

1864

3728

2331

16313

32TA + 6 7 T X = 166

2707958

WL-RNB II

RNB

3728

7223

4159

30651

3 2 T A + 3 6 T X = 104

3187704

WLMO [22]

ONBII

1860

1856

466

6970

6 4 T A + 5 7 6 T X = 1216

8475520

IMO[ll]

ONBII

932

1856

466

6042

6 4 T A + 5 7 6 T X = 1216

7347072

AEDS [36]

ONBII

926

2772

466

7868

6 4 T A + 5 7 6 T X = 1216

9567488

XEDS [36]

ONBII

1848

1850

466

6946

6 4 T A + 5 7 6 T X = 1216

8446336

IU-SMPOI [37]

ONBII

701

1860

699

6518

128T A + 3 2 0 T X = 768

5005824

m-SMPOII [37]

ONBII

1165

1396

699

6054

128T A + 3 2 0 T X = 768

4649472

64,4

PISO [44]

RNB

932

1860

467

6053

6 4 T A + 5 6 7 T X = 1198

7251494

Comb Style [31]

RNB

932

1864

700

6460

6 4 T A + 2 5 6 T X = 576

3720960

WL-RNB I

RNB

932

1864

1399

8857

6 4 T A + 1 3 0 T X = 324

2869668

WL-RNB II

RNB

1864

3495

2331

15847

6 4 T A + 6 7 T X = 198

3137706

in the table. It can be seen that for all three word sizes, the area-delay-product for both
the new multipliers is much smaller than all the previously proposed architectures listed
in the table. In fact, the area-delay-product for WL-RNB I is only 75%, 67%, and 77%
of the previously proposed multiplier with smallest area-delay-product, Comb Style [31],
for word sizes 16,32, and 64, respectively. WL-RNB II has a slightly higher area-delayproduct than WL-RNB I, but the former has the highest speed among all the multipliers
listed in the table.

78

6. HIGH SPEED WORD LEVEL MULTIPLIERS IN GF(2M) USING REORDERED NORMAL BASIS

6.6

Conclusions

Two high speed word-level finite field multipliers in GF(2m) using reordered normal basis
are presented. Architectural complexity comparison and numerical examples show that the
new architectures are faster compared to other similar proposals using either NB or RNB
for the same class of fields. One unique feature of the proposed architectures is that the
critical path delay of the multiplier is not a function of the field size or the word size which
is the case for the previously proposed word-level multipliers. The high speed nature of
the multipliers makes them suitable for high speed public-key cryptography applications
where fast finite field multipliers in large fields are needed.

79

Chapter 7
High Speed VLSI Implementation of a
Multiplier Using Redundant
Representation

7.1

Introduction

Finite field arithmetic has important applications in number theory, algebraic geometry, and
cryptography, particularly in public key cryptography [25, 6]. Elliptic curve and El-Gamal
cryptosystems are two important examples of public key cryptosystems based completely
on finite field arithmetic [28, 6]. Two types of finite fields are commonly used in practice,
prime field F p and the binary field F2">. Binary field is an extension of the prime field,
F 2 , which contains 2™ elements. Binary fields are attractive for high speed cryptography
applications since they are suitable for hardware implementation [28, 18].
The efficiency of finite field multiplication depends on the choice of the basis to repre-

80

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

sent field elements. Few bases have been proposed in literature, including the polynomial
basis, normal basis (NB), dual bases, triangular basis, and redundant representation, or
redundant basis [18, 44, 37, 16].
Redundant representation is especially interesting because it not only offers an almost
free squaring operation, but also eliminates modular operation for multiplication. The main
concept for multiplication using redundant representation is to embed a field in a larger
ring and perform the multiplication in the ring [44]. Since embedding a field is not unique,
each field element in the ring can be represented in more than one way, such that the
representation contains certain amount of redundancy.
The main drawback for the redundant representation is that it uses more bits to represent a field element, where the number of representation bits depends on the size of the
cyclotomic ring. If a type I optimal normal basis (ONB) exists in W2™, the number of bits
required for a redundant representation of a field element is m + 1.
In this chapter, a new VLSI implementation for a finite field multiplier using redundant basis is proposed. Simulation of the final post place-and-route layout shows that the
multiplier can be clocked up to 1.82 GHz, which is 143% faster than the static CMOS implementation of the same design with a Virtual Silicon standard cell logic library. Also,
the proposed design occupies nearly 16% less silicon area compared to the static CMOS
implementation.
Improvements for the new VLSI implementation comes from the fact that the selected
multiplier has a very regular architecture and can be implemented completely with multiples of one simple building block. This block, x-module, is made out of one XOR gate, one
AND gate, and a flip-flop. The AND-XOR function is achieved with a domino logic cell,
and the flip-flop is selected to possess a negative edge triggered, zero hold time, D-flipflop;
properties which are required to properly latch the output data of the domino logic cell.
In our VLSI implementation, we have selected an input size of 197 bits, which is in
the practical range for cryptography applications [18, 6]. The proposed multiplier design

81

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

is intended to be used inside an elliptic curve processor, consequently, it was designed as a
macro module without any I/O pads.
The organization of the rest of the chapter is as follows: Section 2 is a brief review of
redundant basis representation, multiplication and multiplier architectures. In section 3, the
design and implementation of the x-module is discussed. Section 4 presents the implementation of a large size multiplier using the x-module as the main building block. Section 5
examines the static CMOS implementation of the multiplier. In section 6, comparisons between different VLSI implementations are presented. A few concluding remarks are given
in section 7.

7.2 A Brief Review of Redundant Basis and its Arithmetic
inF2«
7.2.1 Redundant Basis for F2«
Let K be a field, and f(x)

G K[x] be a polynomial defined over K. Then the field that

contains all the roots of f(x) is called the splitting field of the polynomial f(x).
splitting field of xn — 1 is called the n th cyclotomic field, denoted by K^.
primitive nth root of unity. Then K^

The

Let (3 be a

is generated by fJ over K and elements in K can be

represented in the form
A = a0 + at/3 + a2p2 + ••• + a„_i/3 n _ 1 , a* e K.
Thus the set [1, (3,(32,..., /3n_1] can be viewed as a basis for K^

[13, 14]. Since 1 + 0 +

P2 + • • • + (3n~l = 0, the representation of A is not unique. For example, the two n-tuples
(CLQ, CLI, ..., a„_i) and (1 — a®, 1 — a\,...,

1 — a n _i) represent the same element A. So, the

basis [1, /?, / ? 2 , . . . , P71"1] is called redundant basis for any subfield of K^n\ Note that the

82

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

elements in the redundant basis form a cyclic group of order n and

1

i = n — 1.

7.2.2 Redundant Basis Multiplication in F2«
Consider the redundant basis in F2™ over F 2 :
/=[l,/3,/32,...,/3"-1].
Let field elements A,B£

F2m to be represented with respect to / :

A = a0 + axp + a2p2 +

ha^^""1,

where aj, 6j G F 2 , i = 0 , 1 , . . . , n — 1. Note that n > m + 1 and the sets of coefficients
{aj} and {6;} are not unique. Also note that (3n = 1. Then multiplication operation using
the redundant basis / can be given by

(?-B

= b0f3i + b1(3i+1 + --- + bn_i + --- + bn_1pi-1
=

bn-i + &„- i+ i/3 + ••• + boP' + ••• +

bn-i-xP71'1

n-\

3=0

where (j — i) denotes that j — z is to be reduced modulo n. Then the product of field elements A and B can be given by

83

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

n-l

A-B

= ^To^-B)
j=0
n—1

n—1

i=0
n—1

j=0
n—1

= E(E°^(^))^-

c7-1)

n-l

If we define A • B = C = J ^ c ^ ' , then Cj can be given by
n-l

c =

i

E a«%-*)' J = 0,1,..., n - 1.

(7.2)

j=0

7.2.3 Redundant Basis and Normal Basis
Normal basis is the most popular basis for hardware implementations of finite field arithmetic since the squaring operation in normal basis is simply a cyclic shift of the coordinates
of the elements [25]. The complexity of multiplication under normal basis is minimized for
two subclasses of normal basis referred to as optimal normal basis type I and II, which is
the main reason that they are often used in practice to implement cryptosystems [30],[33].
For the class of fields where there exists a type I optimal normal basis, redundant basis
elements (except the element '1') are a permutation of the optimal normal basis elements
[44]. Due to this property, an (m + l)-bit redundant basis multiplier can be employed as
an m-bit optimal normal basis type I multiplier. This property was used to select the size
of the multiplier, so that it can perform operations over optimal normal basis type I.

84

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

7.2.4

Multiplier Architecture in Redundant Basis

Different architectures have been proposed for multiplication in redundant basis [44],[32],
all of which realize (7.2). In this work, we are mainly interested in an architecture recently
proposed in [32], which is shown in Fig. 7.1.
*».*;

>>„-!•

b

„-i-

X.

R

°

I—K5-H

X

I

•©-*!

R

'

X

!-»• • • •

R

•@-H

«2

X

I

R

"-i

•©-*L_r~*i

Figure 7.1: High Speed Serial Multiplier in Redundant Basis
In their architecture, all bits of the operand A should be held constant throughout the
multiplication operation, while operand B is available in bit-serial fashion. The contents
of the n flip-flops should be initialized to zero, while the n output bits of the multiplier can
be read from the flip-flops after n clock cycles.
The main advantage of this architecture is its small critical path delay, which is equal to
the delay of one XOR gate in addition to the delay of one AND gate. The high regularity of
the architecture's structure is another major advantage. It has been proven that the architecture shown in Fig.7.1 has the smallest critical path delay, and the smallest area compared
to all other similar multipliers [32].
Kb, *„,

x-module

(«)

iP)

Figure 7.2: (a) x-module Block Detail (b) Multiplier Composed of x-module Blocks
The aforementioned multiplier can be modeled as a series of connected blocks composed of one XOR gate, one AND gate and a flip-flop. We refer to this block as x-module.

85

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

The detailed view of the x-module block, and the multiplier composed of a series of xmodule blocks is shown in Fig. 7.2.

7.3

Design of the Multiplier Main Building Block (x-module)

Reducing the delay of the AND-XOR function in the x-module will result in an increased
multiplication speed. One way to do so is to implement the critical path made out of one
AND gate, and one XOR gate in domino logic, [40]. However, careful considerations must
be taken into account when designing such circuits [39].
The schematic of the implemented domino AND-XOR function is shown in Fig. 7.3.
As shown, the design is quite simple, consisting of 18 transistors. Transistor PI acts as the
pull-up network, charging the node Q during the precharge state. Transistors N2-8 form the
pull down network responsible for discharging node Q when the appropriate combinations
of inputs exist. N2-8 connect to the evaluate transistor, N9, which opens a path to ground
during the evaluate phase. Transistor P3 is the keeper, reducing the charge leakage effect
at node Q.
Transistors P0 and NO create the output NOT stage, providing the output current drawn
from the module. Three NOT gates also exist in the module (which are not shown in figure),
which generate the complements of the module inputs.
The final layout for the domino logic block is shown in Fig 7.4. The height for the
layout was set to be the same as the height of the standard cell technology from which the
D-flipflop was selected: 16.50 /im. The total area for the x-module layout was measured to
be equal to 108.24 iim2.
The flipflop used in the design of the x-module was selected from the Virtual Silicon CMOS library. Care was taken to select an appropriate flip-flop to interface with our
domino-logic cell. A negative-edge triggered flip-flop was needed to maximize the time
available for the domino cell to evaluate. Furthermore, it was required to have a hold-time

86

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

^OUT

Figure 7.3: AND - XOR Function in Domino Logic

•

*l
9

' •!
•

*

"*l3

JJSLA

*

'

"K

Figure 7.4: Layout for the AND-XOR Function in Domino Logic
less than or equal to zero, as the domino cell's output becomes invalid immediately after
the falling edge of the clock.

87

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

7.4 Design of a Practical Size Multiplier Using the x-module
For our multiplier, we have selected an input size of 197 bits for two main reasons. First,
the size is in the practical range for cryptography applications [18]. Second, there exists
an optimal normal basis type I for the field size of 196, for which our multiplier would be
applicable [33].
The design has been carried out with Virtuoso Layout Editor and the Cadence Schematic
Composer. The design process began by replicating 196 x-module blocks, and connecting
them serially. 14 blocks were used in each row, which set the total number of rows needed
to 14. One extra x-module block was placed along the side bringing the total to 197.
Also, 14 buffers were selected from the standard CMOS library and carefully connected
to generate the clock tree for the multiplier. Input B, was also connected to tree structures
of appropriately sized buffers. Other inputs were also correctly buffered to enable high
performance while complying with loading requirements. The final layout for the proposed
design is shown in Fig. 7.5, its size was measured to be 481.97 /jm x 125.49 \im.

Figure 7.5: Proposed 197 Bit Multiplier Layout
The final layout of the multiplier including all parasitic capacitances was simulated
with Cadence's Analog Environment using Spectre. The circuit performed correctly up to
a clock rate of 1.82 GHz. Simulation waveforms for the clock frequency of 1.72 GHz are
shown in Fig. 7.6. In this figure, the first two rows represent the buffered inputs a, and b
to the x-module, and the third row represents the third input, c, which is connected to the
previous x-module's output. The fourth and fifths rows are the voltage at nodes Q and P of

88

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

the x-module (shown in Fig. 7.3), while the sixth row indicates the output of the D-flipfiop,
which is the output of the x-module. Finally, the last row shows the clock output of the
clock tree as it enters the x-module.

Figure 7.6: Post Place-and-Route Simulation Result of the Proposed 197 Bit Multiplier,
from top to bottom: input A, input B, input C, node Q, node R, x-module out, clock

7.5 Design of Practical Size Multiplier Using static CMOS
We began the static CMOS design process by writing the parametrized C code to generate
the VHDL code describing the multiplier hardware. Using this method, different size multipliers are easily generated by changing parameters in the C code. The generated VHDL
code was synthesized afterwards to a gate-level netlist using Synopsys' Design Compiler.
Next, the gate-level netlist was used for partitioning, placement and routing the multiplier
module using Cadence Encounter; the clock tree was also generated using Encounter's
Clock Synthesizer.

89

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

The post Place-and-Route area of the multiplier was 72083.85 /im2 which could be
clocked up to 748 MHz. The total number of standard cells used was 987, while achieving
a maximum gate density of 80%. The final layout for the static CMOS multiplier in shown
in Fig. 7.6.

Figure 7.7: Static CMOS 197 Bit Multiplier Layout

7.6 Different VLSI implementation Comparisons
Comparison between the two VLSI implementations are shown in Tab. 7.1. The first row
of the table presents the static CMOS implementation from section 7.5, and the second row

90

7. HIGH SPEED VLSI IMPLEMENTATION OF A MULTIPLIER USING REDUNDANT REPRESENTATION

represents the proposed design using our x-module. As shown, the clock rate increase is
143%, while the area reduction is 16% for the proposed design compared to static CMOS.
Table 7.1: Complexity Comparison between Two VLSI implementations for a 197 Bit
Multiplier

7.7

Architecture

Area

Clock Frequency

Static CMOS

72083.85 \im2

748 MHz

Proposed design

60482.41 nm2

1.82 GHz

Conclusions

A new VLSI implementation for a 197 bit finite field multiplier was presented. The proposed design employs a main building block designed in domino logic. The speed improvement was measured to be 143% in comparison to static CMOS implementation, while area
reduction was 16%. The post place-and-route design was successfully simulated up to a
clock rate of 1.82 GHz. Our proposed design has applications in public key cryptography,
especially elliptic curve cryptosystems where high speed, large size multipliers are needed.

91

Chapter 8
High Speed Implementation of a SIPO
Multiplier Using Reordered Normal Basis

8.1 Introduction
Polynomial basis is the most widely used basis for hardware and software implementation.
Normal basis, on the other hand, is more suitable for hardware implementation due to the
simplicity of the squaring operation. In normal basis, the squaring operation is achieved
via circular-shifting the element coefficients, which can be implemented in hardware at a
very small cost. This leads to a similarly low cost / high speed squaring operation that can
be used to accelerate the exponentiation operation by repeated squaring and multiplication
via the Fermat theory [14], [21].
There exists two sub-classes of normal basis for which the complexity of the multiplication is minimized. These two classes are referred to as optimal normal basis (ONB) type I
and II, which are particularly attractive for cryptography applications [30]. Many different

92

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

architectures have been proposed for multiplication using these two classes of finite fields
such as in [37, 15, 9, 36]. Reordered normal basis is referred to as a certain permutation of
a type II optimal normal basis [44, 12].
Not all field sizes are suitable for cryptography applications. According to different
standards, different field sizes have been selected based on their suitability for cryptography
applications. In this work we have selected the field size of 233, which is recommended
in the NIST standard for elliptic curve digital signature standard [34]. There exist a type II
optimal normal basis according to the IEEE standard for public key cryptography for this
field size as well [33].
In this work, a new VLSI implementation for a Serial-In Parallel-Out finite field multiplier using a reordered normal basis is presented. It is shown that the new implementation
operates at a much higher speed than a static CMOS implementation of the same architecture, while significantly reducing the area. This performance advantage is the result of
implementing the design as a series connection of a simple block designed and optimized
in domino logic, which has a smaller delay/area compared to the equivalent static CMOS
realization.
The organization of this chapter is as follows: Reordered normal basis and multiplication using this basis are briefly reviewed in Section 2. In section 3, the design and
implementation of the xax-module which is the main building block of the multiplier, is
discussed. Section 4 presents the implementation of a 233-bit multiplier using the xaxmodule as the main building block. Section 5 examines the static CMOS implementation
of the same multiplier. In section 6, comparisons between different VLSI implementations
are presented. A few concluding remarks are given in section 7.

93

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

8.2 A Brief Review of Reordered Normal Basis and Its
Arithmetic in ¥2m
8.2.1 Reordered Normal Basis Definition
Theorem 8.2.1. [12] Let /3 be a primitive (2m + l) s t root of unity in F2™ (j32m+1 = 1) and
7 = (3 + /? _ 1 generates a type II optimal normal basis. Then {7,, i = 1,2,..., m} with
7i

= pi + p-* = ft + / ? 2 m + 1 - \ i = 1, 2 , . . . ,TO,is also a basis in F2™.
It has been shown that the basis {7*, i = 1, 2 , . . . , m} is actually a permutation of the

normal basis {'j2\i = 0 , 1 , . . . , m — 1} [44]. We denote the basis I2 = [71,72, • • •, 7m] as
the reordered normal basis following [44]. Note that reordered normal basis not only offers
free squaring but also can avoid modulo reduction step in a multiplication operation.

8.2.2

Reordered Normal Basis Multiplication

Assume that A and B are two arbitrary elements in F2™ represented with respect to reordered normal basis / = [71,72, • • •, 7m] and C = A.B,
m

m

m

A = ] P aai , B = ^2 hli > C = ^2
2=1

2=1

Cili

-

2=1

To facilitate multiplication, function s(i) has been defined, mapping set of integers to the
set{0, 1 , . . . , 2 T O + 1 } [44].
A I i mod 2m + 1,
if 0 < i mod 2m + 1 ^ m,
s(i) = I
I 2m + 1 — i mod 2m + 1,
otherwise.
Next compute jjA where 1 ^ j ^ TO,
m

A

li

m

a

= Ti J ] i7i = J ] a'b^(i+i) + 7»(<-i)]i=i

(8-1)

i=i

And also note that [44],

94

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

22 ai • hs{i+j) + ls{i-j)} = 22[as{i+j)

+ as^j)]

• 7J.

Then

C =

A$>7j3= 1

E».,-( -.A)
7j

j

1

771

m

E*.
•1 =

+ ls(i-j)\

1

m

771

E».?^2[ s(i+j)
a

+ as(i- J)H

•i=i

771

m

1

j=l i = l

(8.2)
Note that it was assumed that a0 = 0 [44]. The value for Q can be calculated as follows:
m

Ci =

) ^bj[as(i+j)

+as(j_i)],

i=i
m

= ^2aAbs(i+j) + bs(j-i)]^ i = l,2,...,m.
8.2.3

(8.3)

A Review of Existing Architectures for ONB Type II Multiplication

As mentioned in the previous section, ONB type II and reordered normal basis representation of an element are simple permutations of each other. Therefore, reordered normal
basis multipliers can be used as optimal normal basis type II multipliers and vice versa.
Many different architectures have been proposed for multiplication in normal basis. The
complexity of different multipliers available in open literature are listed in table 8.1.

95

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS
Multiplier

#AND

#XOR

# flip-flops

# Clock Cycles

Critical Path Delay

Basis

MO [22]

2m- 1

2m- 2

2m

m

TA + ( r i o g 2 ( 2 m - l ) l ) T x

Normal

IMO[ll]

m

2m- 2

2m

m

TA + (l + riog 2 ml)T x

Normal

GG[15]

m

3m-1
2

3m

m

TA +

3TX

Normal

Feng [9]

2m- 1

3m- 2

3m-2

m

TA + ATX

Normal

Agnew [1]

m

2m- 1

3m

m

TA + 2TX

Normal

XEDS [36]

2m- 1

2m-2

2m

m

TA +

(\[og2(2m-l)])Tx

Normal

AEDS [36]

m

3m — 3

2m

771

TA +

(\[OS2(2m-l)-])Tx

Normal

w-SMPOI [37]

LfJ + i

3m

3m

m

TA+3TX

tu-SMPOII [37]

TO

m+LfJ

3m

m

TA +

Normal

PISO [44]

m

2m- 1

2m+ 1

m

TA + (1 + log 2 m ) T x

Reordered

SIPO [44]

m

2m

3m+ 1

m

TA + 2 T X

Reordered

3TX

Normal

Table 8.1: Complexities Comparison Between Type II ONB / Reordered Normal Basis
Multipliers
In this table the first row represents the famous Massey-Omura normal basis multiplier,
and the second row represents the improved version proposed by Gao and Sobelman. The
third row shows the architecture proposed by Geisellman and Gollman, which exploits the
symmetry property of the normal basis. The fourth and the fifth rows list the architectures proposed by Feng and Agnew respectively. The next two rows represent Reyhani's
architecture XOR Efficient Digit Serial and AND Efficient Digit Serial multipliers, while
the next two rows show the Sequential Multipliers with Parallel Output type I and II, also
proposed by Reyhani. The last two rows list the Serial-In Parallel-Out and Parallel-In
Serial-Out reordered normal basis multipliers recently proposed by Wu. As can be seen
from the table, the Agnew architecture (fifth row) and Serial-In Parallel-Out architecture
(last row) have the smallest critical path delay compared to other architectures.
In this work, we are mainly interested in the SIPO architecture proposed by in [44].
Our interest is mainly due to two specific properties of the architecture. First, the critical
path delay is the minimum among all other architectures, except the Agnew architecture,
which has similar complexity. Second, the architecture has a very regular structure which
greatly simplifies the VLSI implementation.

96

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

8.3 Design of A Practical Size Multiplier Using xax-module
8.3.1 Multiplier Size Selection
According to [6], to provide a sufficient level of security, the field size is required to be at
least 160 bits for an elliptic curve cryptosystem. Some fields have been recommended by
different standards for use, while others were banned. In this work, we have selected the
field size of 233 for three reasons. First, it is in the range suitable for Elliptic Curve Cryptography. Second, there exists a type II ONB representation, meaning that the reordered
normal basis also exists [33]. Finally, the field size is recommended by the National Institute of Standards and technology (NIST) as their Digital Signature Standard (DSS) in
the Elliptic Curve Digital Signature algorithm (ECDSA). A few other field sizes were also
recommended by the standard, but the field size of 233 is the only one such that there exist
a type II ONB.

8.3.2

Selected Multiplier Architecture

Figure 8.1: Serial-In Parallel-Out Reordered Normal Basis Multiplier

97

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

{")

(b)

Figure 8.2: xax-module and the SIPO Multiplier Composed of xax-module
The SIPO multiplier proposed in [44] is shown in Fig. 8.1. The architecture has been
redrawn in Fig. 8.2 to show its regularity, and to showcase the fact that it can implemented
as a serial connection of a single module; it is shown in the figure inside the box. This
module, which is made out of three flip-flops, two XOR gates and an AND gate is referred
to as an xax-module. The two XOR gates, in addition to the AND gate, create the critical
path of the multiplier. One way to reduce the critical path delay is to implement the XORAND-XOR function using domino logic [40]. However, careful consideration should be
taken into account when designing such circuity [39]. The 23 3-bit multiplier can be made
by replicating the xax-module once for every bit of the multiplier, and serially connecting
them together.

8.3.3 Design and Implementation of the xax-module
The schematic shown in Fig 8.3 implements the XOR-AND-XOR function in domino
logic; it is responsible for implementing the function ((61 © b2) • a) (Be). We have reduced
the number of transistors in the architecture, while maintaining the same functionality, by
using the schematic shown in Fig. 8.4 for our implementation. In this figure, the design is
quite simple, consisting of 17 transistors (as opposed to 21 in the original design).
In Fig. 8.4 transistor PI acts as the pull-up network, charging the node Q during the
precharge state. Transistors N2-13 form the pull down network responsible for discharging

98

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

node Q when the appropriate combinations of inputs exist. N2-13 connect to the evaluate
transistor, Nl, which opens a path to ground during the evaluate phase. Transistor P2 is the
keeper, reducing the charge leakage effect at node Q.
Transistors PO and NO create the output NOT stage, providing the output current drawn
from the module. Four NOT gates also exist in the module (not shown in the figure), which
generate the complements of the xax-module's inputs.

Figure 8.3: XOR-AND-XOR Function Implementation in Domino Logic
The final layout for the domino logic block is shown in Fig 8.5. Note that this figure is
rotated 90 degrees to the left for better readability. The three large vertical stripes, from left
to right, are VDD, VSS, and VDD wires. The section on the left are the four NOT gates
used to create the complements of the inputs, while the section on the right implements
the schematic shown in Fig. 8.4. The height for the layout was set to be three times the
height of a standard-cell: 19.962 /im, since three D flip-flops were to be connected to each
domino cell. The total area for the xax-module, including the area of three D flip-flops,
was measured to be 449.18 fxm2. The area of the domino-cell on its own is 198.86 /jm 2 .

99

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

Figure 8.4: A New XOR-AND-XOR Function Implementation in Domino Logic
The flipflop used in the design of the xax-module was selected from the Virtual Silicon CMOS library. Care was taken to select an appropriate flip-flop to interface with our
domino-logic cell. A negative-edge triggered flip-flop was needed to maximize the time
available for the domino cell to evaluate. Furthermore, it was required to have a hold-time
less than or equal to zero, as the domino cell's output becomes invalid immediately after
the falling edge of the clock.

8.3.4

Design and Implementation of the 233-bit Multiplier Using the
xax-module

The block diagram design of the 233-bit multiplier is shown in Fig. 8.6. As can be seen,
the main part of multiplier architecture can be implemented by connecting the xax-modules

100

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

Figure 8.5: Layout for the XOR-AND-XOR Function in Domino Logic
serially. We have used 13 xax-modules in each row, for a total of 18 rows to create the
complete multiplier. This row/column distribution was chosen to give the design an even
aspect ratio, which is typically desirable when floorplanning. There exists one additional
flip-flop, which holds 60 in Fig. 8.6, that needs to be added to the design.
We have added one extra xax-module, referred to as the load-module in Fig. 8.6, which
is used to load the coefficients of input 6 serially into the multiplier when the load signal
is enabled. In order to achieve this, we have connected the a input of the load-module
to the load enable signal, and the c input to the external input, Extint. Input 61 of the
load-module was connected to the output of the previous stage and, input 62 was shorted
to ground. Since the xax-module implements the function ((61 © 62) • a) © c), the output
of the load-module would be ((I nt .input • load) © Ext-input).

This can then be used to

101

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS
13 xax-modules

M

U

Hip-Flop

* j

H
Ktt

KE

^

-C

n

3ra

n

Figure 8.6: Block Diagram of the 233-bit Multiplier
load the coefficients of operand B into the circular shift register. Table 8.2 tabulates the two
combinations that can be used to load the data into the multiplier. When the load signal is
" 0", the output of the xax-module would be equal to the Extinput, and when load is " 1"
and ExtJnput is " 1", the shift register acts as the circular shift register.
Table 8.2: load-module nput/Output Characteristics
Load

ExtJnput

IntJnput

Output

0

ExtJnput

X

ExtJnput

1

0

IntJnput

IntJnput

A tree-structure of similarly-sized buffers was used to generate the clock tree for the
multiplier's clock signal. The same was done for the input a, since it is a high fan out net
that connects to every xax-module in the design. The full layout of the multiplier is shown

102

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

Figure 8.7: 233-bit Proposed Multiplier Layout
in Fig. 8.7.
The final layout of the multiplier including all parasitic capacitances was simulated
in Cadence's Analog Environment using Spectre. The circuit performed correctly up to
a clock rate of 1.587 GHz.

Simulation voltage waveforms for the clock frequency of

1.54 GHz are shown in Fig. 8.5.
In this figure, the first row is the buffered input a. Rows two, three, and four show the
signals 61, 62, and c as they exit the xax-module's flip-flops and enter the XOR-AND-XOR
function's inputs. Row five is the output of the xax-module (node R), while row six is the
voltage at node Q in Fig. 8.4. Finally, row seven shows the multiplier's clock waveform as

103

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

it exits the buffer-tree and enters the xax-modules. All 16 possible input combinations of
a, 61, b2, and c, were tested and verified to give the correct output when determining the
maximum operating frequency.

0-J/1_lnput_a_intefr.al1

,

^^N=^N

N^N

.Q-jj/J_output_cL-

2.fr-J'$_XAX_°uiJnput_

Figure 8.8: Post Place-and-Route Simulation Result of the Proposed 233-bit Multiplier,
from top to bottom: input a, input 61, input 62, input c, node R, node Q, clock

8.4 Design of the 233-bit Multiplier Using Static CMOS
The static CMOS multiplier implements the same functionality as the domino logic design.
Similar to the domino-logic design, the static CMOS version also incorporates a loadmodule (implemented in static CMOS) to serially load an external input into the multiplier.
We began the static CMOS design process by writing the parametrized C code to generate the VHDL code describing the multiplier in hardware. Using this method, different size

104

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

multipliers are easily generated by changing parameters in the C code. The VHDL code
was simulated using Cadence's NCSim to confirm that the architectural code was functioning correctly. Afterwards, the VHDL code was synthesized to a gate-level netlist using
Synopsys' Design Compiler. Compilation parameters were always chosen to maximize the
operating frequency of our design; the critical path delay at this stage was 1.11 ns.
Next, the generated gate-level netlist was simulated again using Cadence's NCSim to
confirm that the functionality did not change during the synthesis stage. Then the verified
gate-level netlist was used for partitioning, placement, and routing using Cadence's Encounter; the clock tree was also generated using Encounter's Clock Synthesizer. The worst
negative slack calculated by Encounter after the place-and-route steps was reported to be
0.145 ns, bringing the total critical path delay to 1.255 ns.
The post place-and-route area of the multiplier was 216737.136/zm2 which could be
clocked up to 796 MHz. The total number of standard cells used was 1965, while achieving
a maximum gate density of 80%. The final layout for the static CMOS multiplier is shown
in Fig. 8.5.

8.5 A Comparison of Different VLSI Implementations
Comparison between the two VLSI implementations are shown in table 8.3. The first row
of the table presents the static CMOS implementation from section 8.4, and the second row
represents the proposed design using our xax-module. As shown, the clock rate increase is
99%, while the area reduction is 49% for the proposed design compared to static CMOS.
If we define area times delay as a performance metric, we can conclude that our proposed design is almost four times more efficient than the static CMOS design.

105

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

Figure 8.9: Static CMOS 233-bit Multiplier Layout
Table 8.3: Complexity Comparison Between Two VLSI Implementations for a 233-bit

8.6

Architecture

Area

Clock Frequency

Static CMOS

216737.136/xm2

796 MHz

Proposed design

109644.819 fim2

1.587 GHz

Conclusions

A new VLSI implementation for a 233-bit finite field multiplier was presented. The proposed design employs a main building block designed in domino logic. The speed improvement was measured to be 99% in comparison to static CMOS implementation, while

106

8. HIGH SPEED IMPLEMENTATION OF A SIPO MULTIPLIER USING REORDERED NORMAL BASIS

area reduction was 49%. The final design was successfully simulated up to a clock rate
of 1.587 GHz. Our design's field size is currently recommended by the NIST standard in
their Elliptic Curve Digital Signature Standard, rendering it a desirable building block in
elliptic curve cryptosystem designs.

107

Chapter 9
Conclusions and Future Work

9.1

Conclusions

Finite field is a set of finite elements where one can add, subtract, multiply, and divide such
that properties of associativity, distributivity, and commutativity are satisfied. Finite fields
have important applications in error control coding and cryptography. Two different types
of finite field are commonly used in practice: prime field F p , and the binary field F2m. Binary field is an extension of the prime field, F 2 , which contains 2 m elements. Binary fields
are attractive for high speed cryptography applications since they are suitable for hardware
implementation. In F2™, addition is nothing but the exclusive-oring of two binary vectors. Multiplication is more complicated, while division or inversion can be broken down
into a series of consecutive multiplication operations. In practice, the finite field multiplier
becomes the key arithmetic unit for any system based on finite field computations.
Efficiency of finite field multiplication depends on the choice of the basis to represent
field elements. Bases that have been used for realizing finite field multipliers include poly-

108

9. CONCLUSIONS AND FUTURE WORK

nomial basis, normal basis (NB), dual bases, triangular basis, redundant representation or
redundant basis, and their variations (i.e., shifted polynomial basis).
In normal basis, the complexity of multiplication is measured with the multiplication
matrix. For a binary extension field, the multiplication matrix entries are either zero or
one, and the number of ones inside the multiplication matrix is referred to as normal basis
complexity. The normal basis in GF(2 m ) for which the complexity achieves its minimum,
Ira — 1, is referred to as the optimal normal basis (ONB). Two types of optimal normal
bases have been found which are referred to as type I and type II optimal normal basis.
Hardware implementation of finite field multipliers can usually be divided into three
categories. In the first category there are bit-level or bit-serial multipliers. A bit-level
multiplier takes m clock cycles to finish one multiplication in a binary field of size m. The
multipliers in this class are considered to have low power consumption, occupy a small area
of silicon, and operate slowly for large field sizes. The second category are bit-parallel,
or full-parallel multipliers. A full parallel multiplier takes one clock cycle to finish one
field multiplication. These multipliers are not usually economical for implementation since
they require a large silicon area and high bandwidth for input and output ports. The third
category are word-level or digit-level finite field multipliers, which are the most commonly
implemented in practice. A word-level multiplier takes w clock cycles, 1 ^ w ^ m, to
finish one multiplication operation in F2™. The value of w can be selected by designer to
set the trade off between area and speed according to the application. Decreasing the value
of w will result in faster and larger multipliers while increasing w will make smaller and
slower multipliers. Note that bit-level and full parallel multipliers can be viewed as special
cases of word-level multipliers for w = m and w = 1 respectively.
In this thesis, a number of high speed word-level architectures for finite field multiplication have been proposed. Most of the proposed architectures have been implemented
in hardware, using FPGA or standard CMOS platforms. It has been shown that proposed
word-level architectures are more efficient compared to optimal normal basis type I or type

109

9. CONCLUSIONS AND FUTURE WORK

II architectures for the classes of field in which they exist.
Also, different VLSI implementations for finite field multipliers were explored, which
resulted in more efficient implementations for some of the regular architectures. The new
implementations use a simple module designed in domino logic as the main building block
for the multiplier. Significant improvements were achieved while designing practical sized
multipliers using the proposed methodology.

9.2 Future Work
More research can be conducted in finding more efficient algorithms and architectures for
finite field multiplication in optimal normal basis type I or type II classes of fields. Special attention should be paid to full parallel architectures for two primary reasons. First,
advances in VLSI design now allow for large parallel systems to be realized as a chip.
Secondly, the need for faster multipliers to further increase the encryption and decryption
processes is of greater importance as more data must be encrypted. Note that the main challenges in designing such multipliers would be in managing the power and I/O requirements,
which are different from word-level architectures.
Further research efforts should be devoted to other classes of normal basis, including
the Gaussian normal basis types III, and IV These classes of fields are considered to be
the most efficient, after optimal normal basis types I and II, for cryptography applications.
Some examples of this are the binary field sizes of 163 and 409 which are recommended
by National Institute of Standards and Technology for elliptic curve cryptography.
Another research area worth exploring is low power VLSI design and implementation
of the bit-level multiplier architectures. The low power and small area requirements of
such designs makes them attractive candidates for resource-constrained applications such
as smart cards, cellular phones, and personal digital assistants.

110

References
[1] G.B. Agnew, R.C. Mullin, I.M. Onyszchuck, and S.A Vanstone. An implementation
for a fast public-key cryptosystem. Journal ofCryptology, 3:63-79, 1991.
[2] G.B. Agnew, R.C. Mullin, and S. Vanstone. Fast exponentiation in F2n. In Lecture
Notes in Computer Science on Advances in Cryptology-EUROCRYPT'88, pages 2 5 1 255, New York, NY, 1988. Springer.
[3] E.R. Berlekamp. Bit-serial reed-solomon encoders. IEEE Trans, on Information
Technology, 28(6):869-874, November 1982.
[4] T. Beth and D. Gollman. Algorithm engineering for public key algorithms. IEEE
Journal on Selected Areas in Communication, 7(4):458^t65, May 1989.
[5] Taiwan Semiconductor Manufacturing Company. 0.18//ra TSMC cmos technology
standard cell library, September 1999.
[6] Certicom Corporation. Current public-key cryptographic systems, 2000. White Paper.
[7] G. Drolet. A new representation of elements of finite fields GF(2 m ) yielding small
complexity arithmetic circuits. IEEE Trans, on Computers, 47(9):938-946, September 1998.
[8] H. Fan and M.A. Hasan. Fast bit parallel-shifted polynomial basis multipliers in
GF(2n). IEEE Transactions on Circuits and Systems I, 53:2606-2615, 2006.
[9] M. Feng. A VLSI architecture for fast inversion in GF(2m). IEEE Trans. Computers,
38(10): 1383-1386, October 1989.
[10] R. Furness, S.T. Fenn, and M.Benaissa. Multiplication using the triangular basis representation over GF(2m). In Global Telecommunications Conference, 1996. GLOBECOM '96, volume 2, pages 788-792, November 1996.
[11] L. Gao and G.E. Sobelman. Improved vlsi designs for multiplication and inversion
in GF{2m) over normal bases. In Proceedings of the 13th Annual IEEE ASIC/SOC
Conference, pages 97-101, September 2000.

Ill

REFERENCES

[12] S. Gao and S. Vanstone. On orders of optimal normal basis generators. Mathematics
of Computation, 64(2): 1227-1233, July 1995.
[13] S. Gao, J. von zur Gathen, and D. Panario. Gauss periods and fast exponentiation in
finite fields. Lecture Notes in Computer Science, 911:311-322, 1995.
[14] S. Gao, J. von zur Gathen, D. Panario, and V. Shoup. Algorithms for exponentiation
in finite fields. Journal of Symbolic Computation, 29:879-889, 2000.
[15] W. Geiselmann and D. Gollmann. Symmetry and duality in normal basis multiplication. In Proceedings of the Applied Algebra, Algebraic Algorithms, and Error Correcting Codes Symposium (AAECC-6), pages 230-238, July 1988.
[16] W. Geiselmann and D. Gollmann. Self-dual bases in Fqn. Designs, Codes and Cryptography, 3:333-345,1993.
[17] W. Geiselmann and R. Steinwandt. Redundant representation of G¥(qn) for designing
arithmetic circuits. IEEE Trans, on Computers, 52(7):1848-1853, July 2003.
[18] D. Hankerson, A. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography.
Springer, New York, NY, 2003.
[19] M.A. Hasan and V.K. Bhargava. Low complexity architecture for exponentiation in
gf(2 m ). Electronic Letters, 28(21): 1984-1986, October 1992.
[20] M.A. Hasan, M.z. Wang, and V.K. Bhargava. A modified massey-omura parallel
multiplier for a class of finite fields. IEEE Trans, on Computers, 42(10): 1278-1280,
October 1993.
[21] T. Itoh and S. Tsujii. A fast algorithm for computing multiplicative inverses in
GF(2m) using normal bases, nformation and Computation, 78:171-177, September
1988.
[22] J.L.Massey and J.K.Omura. U.S. Pat. 4587627: Computational method and epparatus
for finite field arithmetic, September 1986.
[23] R. Katti. Low complexity multiplication in a finite field using ring representation.
IEEE trans, on Computers, 52(4):418^27, April 2003.
[24] C.K. Koc and B. Sunar. Low-complexity bit-parallel canonical and normal basis multipliers for a class of finite fields. IEEE Trans, on Computers, 47(3):353-356, March
1998.
[25] R. Lidl and H. Niederreiter. Introduction to Finite Fields and Their Applications.
Cambridge University Press, Cambridge, England, 1997.

112

REFERENCES

[26] R. Lidl and H. Niederreiter. Finite Fields (Encyclopedia of Mathematics and its Applications). Cambridge University Press, Cambridge, England, 2008.
[27] E.D. Mastrovito. Architectures for Computations in Galois Fields.
Linkoping University, Linkoping, Sweden, 1991.

PhD thesis,

[28] A. Menezes, P. v. Oorschot, and S. Vanstone. Handbook of Applied Cryptography.
CRC-Press, 1996.
[29] A.J. Menezes, I E Blake, X. Gao, R.C. Mullin, S.A. Vanstone, and T. Yaghoobian.
Applications of Finite Fields. The Springer International Series in Engineering and
Computer Science, New York, NY, 1993.
[30] R.C. Mullin, I. M. Onyszchuk, S. A. Vanstone, and R.M. Wilson. Optimal normal
bases in GF(p n ). Discrete Applied Mathematics, 22:149-161, February 1989.
[31] A.H. Namin, H. Wu, and M. Ahmadi. Comb architectures for finite field multiplication in F2m. IEEE Trans, on Computers, 56:909-916, July 2007.
[32] A.H. Namin, H. Wu, and M. Ahmadi. A new finite field multiplier using redundant
representation. IEEE Trans, on Computers, 57:716-720, May 2008.
[33] Institute of Electrical and Electronics Engineers. IEEE standard specifications for
public-key cryptography, August 2000.
[34] National Institute of Standard and Technology. Digital signature standards, Jan 2000.
[35] A. Reyhani-Masoleh and M.A. Hasan. A new construction of massey-omura parallel
multiplier over GF(2m). IEEE Trans, on Computers, 51(5):511-520, May 2002.
[36] A. Reyhani-Masoleh and M.A. Hasan. Efficient digit-serial normal basis multipliers
over GF(2 m ). IEEE Trans, on Computers, special issue on cvryptographic hardware
and embedded systems, 52(4):428^-39, April 2003.
[37] A. Reyhani-Masoleh and M.A. Hasan. Low complexity word-level sequential normal
basis multipliers. IEEE Trans, on Computers, 54(2):98-110, February 2005.
[38] J.H. Silverman. Fast multiplication in finite fields GF(2"). Lecture Nores in Computer Science, Proceedings of the First International Workshop on Cryptographic
Hardware and Embedded Systems, 1717:122-134, Aug 1999.
[39] P. Srivastava, A. Pua, and L. Welch. Issues in the design of domino logic circuits. In
Proceedings of the 8th Great Lakes Symposium on VLSI, pages 108-112, 1998.
[40] J.P. Uyemura. CMOS Logic Circuit Design. Springer, New York, NY, 1999.

113

REFERENCES

[41] C.C. Wang, T.K. Truong, H.M. Shao, L.J. Deutsch, J.K. Omura, and I.S. Reed. VLSI
architectures for computing multiplications and inverses in GF(2 m ). IEEE Trans, on
Computers, C-34(8):709-717, August 1985.
[42] J.K. Wolf. Efficient circuits for multiplying in GF(2 m ) for certain values of m. Discrete Mathematics, 106-107:497-502, September 1992.
[43] H. Wu and M.A. Hasan. Low complexity bit-parallel multipliers for a class of finite
fields. IEEE Trans, on Computers, 47(8):883-887, August 1998.
[44] H. Wu, M.A. Hasan, I.F. Blake, and S. Gao. Finite field multiplier using redundant
representation. IEEE Trans, on Computers, 51(11): 1306-1316, November 2002.

114

VITA AUCTORIS

Ashkan Hosseinzadeh Namin was born in Tehran, IRAN, on September 21, 1979. He
received his BSc degree in Electrical Engineering from Isfahan University of Technology,
Isfahan, Iran, in 2002, and the MSc degree in Electronics from Sharif University of Technology, Tehran, Iran in 2004. Since September 2004, he has been pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering, University of Windsor,
Windsor, On, Canada. His research interests include digital and analog integrated circuits,
architectures in finite fields, and hardware implementation of cryptosystems.

115

