VLSI Architectures and Arithmetic Operations with Application to the Fermat Number Transform by Lars-inge Alfredsson
Link¨ oping Studies in Science and Technology.
Dissertation No. 425
VLSI Architectures and Arithmetic
Operations with Application to
the Fermat Number Transform
Lars-Inge Alfredsson
Department of Electrical Engineering
Link¨ oping University, S-581 83 Link¨ oping, Sweden
Link¨ oping 1996ISBN 91-7871-694-2 ISSN 0345-7524
Printed in Sweden by LJ Foto & Montage/VTT-Graﬁska, Vimmerby 1996To my wife
Anneli
and to our children
Michaela, Sebastian, Jonathan, and AntoniaAbstract
ThepropertiesofarithmeticoperationsinFermatintegerquotientrings
Z
 
m
 
 ,
where
m
 
 
t, are investigated. The arithmetic operations considered are
mainly those involved in the computation of the Fermat number transform.
Weconsider somewaysofrepresentingthebinarycodedintegersinsuchrings
and investigate VLSI architectures for arithmetic operations, with respect to
the different element representations. The VLSI architectures are mutually
compared with respectto area (
A)andtime(
T)complexity and area-time per-
formance (
A
T
 ). The VLSI model chosen is a linear switch-level
R
C model.
In the polar representation, the nonzero elements of a ﬁeld are represented by
the powers of a primitive element of the ﬁeld. In the thesis we particularly in-
vestigatethepropertiesofarithmetic operationsandtheircorresponding VLSI
architectures with respect to the polar representation of the elements of Fer-
mat prime ﬁelds. Some new results regarding the applicability of the Fermat
number transform when using the polar representation are also presented.
iiiAcknowledgements
My time as a PhD student has come to an end. I have really enjoyed teach-
ing, studying, and doing research, which have been my main duties during
these years. One of the main reasons why I wanted to join the Data Transmis-
sion group was the friendly and inspiring atmosphere that was — and still is
— prevalent among the people in the group. I would like to thank all mem-
bers of the Data Transmission group for providing this friendly and inspiring
atmosphere.
I particularly would like to thank my supervisor, Professor Thomas Ericson,
for giving me the opportunity to join the Data Transmission group. He has
beenanexcellentguide onmytourintotheworldofscienceandhehasalways
supported my work with a proper balance betweenfriendly encouragements
and educating directions.
Ialsoappreciatethefruitfuldiscussions withProfessorStefanDodunekov,Pro-
fessor Christer Svensson, and Dr. Edoardo Mastrovito.
Finally, I would like to thank my wonderful family, to whom I dedicate this
thesis. The seemingly never-ending process of writing the thesis has come to
an end. From now on, I will spend a lot more time with You!
Link¨ oping, March 1996
Lasse Alfredsson
iiiivThere are certain privileges of a writer,
the beneﬁt whereof, I hope, there will be no reason to doubt;
Particularly, that where I am not understood, it shall be concluded,
that something very useful and profound is couched underneath.
–J o n a t h a nS w i f t
(Tale of a Tub, preface 1704)
Not that the story need to be long,
but it will take a long while to make it short.
–H e n r yD a v i dT h o r e a u
(Letter, 16 Nov. 1867.)
vviContents
1 Introduction 1
2 Binary Arithmetic in the Fermat Integer Quotient Ring 3
2.1 The Integer Quotient Ring
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  3
2.2 The Number Theoretic Transform
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  4
2.2.1 Suitable Integer Rings
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  6
2.3 The Fermat Number Transform
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  10
2.3.1 Fermat Numbers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  10
2.3.2 The TransformKernel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  12
2.3.3 Butterﬂy Computations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  16
2.4 Element Representation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  23
3 Applications 25
3.1 Convolution and Correlation of Real Integer Sequences
 
 
  26
3.2 Decoding of Reed-Solomon Codes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  30
viiviii Contents
4 The VLSI Model 33
4.1 Introduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  34
4.2 Complexity and Performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  36
4.2.1 The Delay Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  36
4.2.2 Area and Time Complexities
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  38
4.3 Basic CMOS Building Blocks
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  42
4.3.1 The Inverter and the Transmission Gate
 
 
 
 
 
 
 
 
  42
4.3.2 The Two-InputMultiplexer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  44
4.3.3 Two-Input Gates
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  44
4.3.4 The Single-Bit Adder
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  47
4.3.5 The Register
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  51
4.3.6 Table of Complexity Parameters
 
 
 
 
 
 
 
 
 
 
 
 
  55
4.4 Implementing the Fermat Number Transform
 
 
 
 
 
 
 
 
  57
5 The Normal Binary Coded Representation 59
5.1 Architectures for Arithmetic Operations
 
 
 
 
 
 
 
 
 
 
 
 
  59
5.1.1 Modulus Reduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  60
5.1.2 Negation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  68
5.1.3 Addition and Subtraction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  72
5.1.4 Multiplication by Powers of 2
 
 
 
 
 
 
 
 
 
 
 
 
 
  77
5.1.5 General Multiplication
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  81
5.1.6 Exponentiation of the TransformKernel
 
 
 
 
 
 
 
 
  84
5.2 Summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  87
6 The Diminished–1 Representation 89
6.1 Linearly Transformed Representations
 
 
 
 
 
 
 
 
 
 
 
 
 
  89Contents ix
6.1.1 Arithmetic Operations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  90
6.2 The Use of a Zero Indicator
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  93
6.3 The Diminished–1 Representation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  98
6.3.1 Code Translation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  98
6.3.2 Modulus Reduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  106
6.3.3 Negation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  106
6.3.4 Addition and Subtraction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  108
6.3.5 Multiplication by Powers of 2
 
 
 
 
 
 
 
 
 
 
 
 
 
  122
6.3.6 General Multiplication
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  128
6.3.7 Exponentiation of the TransformKernel
 
 
 
 
 
 
 
 
  152
6.4 Summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  152
7 The Polar Representation 155
7.1 Introduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  155
7.2 Arithmetic Operations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  156
7.2.1 Discrete Exponentiation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  157
7.2.2 The Discrete Logarithm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  157
7.2.3 Modulus Reduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  159
7.2.4 Negation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  159
7.2.5 Addition and Subtraction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  160
7.2.6 General Multiplication
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  161
7.2.7 Multiplication by Powers of
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  162
7.3 Zech’s Logarithm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  165
7.4 Properties of the
D
m Matrix
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  167
7.4.1 Discrete Exponentiation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  173x Contents
7.4.2 The Discrete Logarithm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  177
7.4.3 Zech’s Logarithm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  179
7.5 The Mirror Sequence
M
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  182
7.5.1 Discrete Exponentiation Using a Look-Up Table
 
 
 
  183
7.5.2 The Discrete Logarithm Using a Look-Up Table
 
 
 
  183
7.5.3 The Mirror Properties of
M
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  185
7.5.4 Finding the Unique Distinct Positions in
M
m
 
 
 
 
 
  189
7.5.5 Addressing the Look-Up Table for Discrete Logarithm 195
7.6 Architectures for Arithmetic Operations
 
 
 
 
 
 
 
 
 
 
 
 
  197
7.6.1 Discrete Exponentiation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  197
7.6.2 The Discrete Logarithm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  205
7.6.3 Negation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  206
7.6.4 Addition
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  208
7.6.5 General multiplication
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  219
7.6.6 Multiplication by powers of
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  223
7.7 Summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  231
8 Comparisons Between Element Representations 233
8.1 Arithmetic Operations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  233
8.1.1 Modulus Reduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  233
8.1.2 Code Translation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  234
8.1.3 Negation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  235
8.1.4 Addition
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  236
8.1.5 General Multiplication
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  238
8.1.6 Multiplication by Powers of
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  239Contents xi
8.1.7 Butterﬂy Computations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  242
8.2 Other element representations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  246
9 Conclusions 249
A Proofs of Some Theorems 251
A.1 Proof of Theorem 2.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  251
A.2 Proof of Theorem 2.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  253
A.3 Proof of Theorem 2.5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  254
B AT a b l eo fS o m eP r i m e s 2 5 7
C Further Properties of Zech’s Logarithms 261
Bibliography 269xii ContentsChapter 1
Introduction
In 1972 Rader [77] proposed transforms in the ring of integers modulo a Mer-
senne or a Fermat number (
 
n
 
  and
 
m
 
 
 
m
 
 
t
 
 
 
 
 
 
 
 
 
 
 
 ,r e s p e c -
tively) to compute error-free convolutions of real integer sequences. Later,
Agarwal and Burrus [2] showed that for some transform lengths the radix-2
Fermat number transform can be implemented using only addition, subtrac-
tion, and bit shifting, i.e. without using multiplication. This transform was
shown to be faster than the conventional fast Fourier transform over the com-
plex ﬁeld.
There are also other applications of the Fermat number transform. Justesen
[54] was one of the ﬁrst to consider Reed-Solomon codes over the ﬁnite ﬁeld
ofintegersmodulo a Fermatprime. He statedthatthe decoding complexity of
such codes can be reduced if the Fermatnumbertransformis used to evaluate
the syndromes and error magnitudes. This was further investigated by Reed
et al. [82] and others.
The special attributes of the Fermat number transform have led several re-
searchers to consider the VLSI (Very Large Scale Integration) implementation
of arithmetic operations in the ring of integers modulo a Fermat number.
These operations are traditionally implemented using binary logic circuits,
which means that the elements of the ring have a binary coded form of rep-
resentation. The
 
m
 
 binary coded elements of the ring of integers modulo
a Fermat number can be represented using
m
 
 bits. We thus get numer-
ous ways of representing the elements of the ring. The complexity and per-
12 Chapter 1. Introduction
formance of architectures for arithmetic operations depend inter alia on the
representation chosen.
Themost knownrepresentations arethe ones proposed byMcClellan [65] and
Leibowitz [58]. Their coding schemes are linear coordinate transformations
of the normal binary coded representation of the elements in the ring. Using
their representations, operations like addition, multiplication by two, and the
codetranslationcanbecarriedoutfairlyeasyinVLSI.Also, forsomerelatively
small transformlengths, thetransformmultiplications bypowersof thetrans-
form kernel can be carried out as binary shifts. This is a well known property
of the Fermatnumbertransform. One of the main disadvantages of using Mc-
Clellan’s or Leibowitz’ element representation is that for most other possible
transformlengths, the resulting transformcomputation involves generalmul-
tiplications (by powers of the transform kernel). Nevertheless, Leibowitz’ so
called diminished–1 representation is used by most people who consider the
VLSI implementation of the Fermat number transform.
In this thesis we investigate various ways of representing the binary coded el-
ementsofthe ringof integersmodulo a Fermatnumber. For eachelementrep-
resentation considered, the propertiesofthe arithmetic operations involved in
thecomputation oftheFermatnumbertransformarethoroughlyinvestigated.
Some other (arithmetic) operations are also considered. We also investigate
VLSI architectures for the arithmetic operations. Some architectures are pre-
viously known andsome are new. We show how each of these architecturesis
derived from its associated analytical expression for the arithmetic operation
in question.
One ofour maingoalsis toﬁnda representation thatmakesit possible tocom-
putetheFermatnumbertransformwithfavourablearea-timeperformancefor
all possible transform lengths. In particular, we focus on the arithmetic op-
erations obtained when using the polar representation of the elements of Fer-
mat prime ﬁelds. In the polar representation, the elements of a ﬁeld are rep-
resented by powers of some primitive element of the ﬁeld.Chapter 2
Binary Arithmetic in the Fermat
Integer Quotient Ring
In this chapter we give a formal introduction to the number theoretic trans-
form in general and the Fermat number transform in particular. The chap-
ter contains several known results from the area of number theory. We also
consider some fastFouriertransformalgorithmsforimplementing the Fermat
number transform. For each algorithm, we ﬁnd out which arithmetic opera-
tionsareneededandthecomplexity ofcomputing thetransform. Thepurpose
of the survey is to get our work into perspective. The chapter is concluded by
presenting some aspects of representing the binary coded integers of the Fer-
mat integer quotient ring.
2.1 The Integer Quotient Ring
A ring is an algebraic system consisting of a set of elements together with ad-
dition, subtraction, and multiplication. The result of any of these arithmetic
operations is always an element of the original set. It may also be possible to
divide in a ring. Then the multiplicative inverse of the divisor must exist in the
ring.
A natural example of a ring is
Z , the ring of integers; for
a
 
b
 
Z, we have
a
 
b
 
a
 
b
 
a
 
b
 
Z .D e n o t eb y
Z
q the quotient ring of integers modulo an
integer
q: It consists of the set
f
 
 
 
 
 
 
 
 
 
 
q
 
 
g of integers and the result of
every arithmetic operation is reduced modulo
q. Thus, an integer
c maps into
Z
q as the remainder
r of
c divided by
q. If we have
c
 
r
 
d
q for some integer
34 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
d,t h e n
c and
r are congruent modulo
q. The notation for such a congruence is
c
 
r (mod
q)
 
The multiplicative inverse of an element of
Z
q exists if and only if the element
is relatively prime to the modulus
q.
  If
q is a prime number, then every non-
zero element of
Z
q has a multiplicative inverse and thus division becomes a
general operation in the ring. Then
Z
q is called a ﬁeld.
  For a detailed math-
ematical survey on the theory of rings and ﬁelds, see for example Lidl and
Niederreiter [60] or Herstein [50].
 
In this thesis weinvestigate VLSI architectures forarithmetic operations in the
integer quotient ring
Z
q,w h e r e
q is a Fermat number. Even though the devel-
opment of multiple-valued logic has progressed over the years [29] it is still
ad i f ﬁcult problem to design
q-valued logic circuits for large
q. Therefore, we
restrict ourselves to representations of the integers modulo
q as binary coded
symbols and use binary logic circuits in the VLSI architectures for the arith-
metic operations in
Z
q.
2.2 The Number Theoretic Transform
Before going into details about the Fermat number transform, we give the de-
ﬁnition ofthe numbertheoretic transformin an arbitraryintegerquotient ring
Z
q. We alsodiscuss which moduli
q aremost suitable, with respecttothe com-
plexity of computing the number theoretic transform. Thecomputation of the
numbertheoretictransform(NTT)involves integerringarithmetic operations.
The NTT is a DFT-like (discrete Fourier transform) transform which is com-
puted in the ring of integers modulo some integer:
Deﬁnition 2.1 In the ring
Z
q of integers modulo a positive integer
q
 
p
n
 
 
p
n
 
 
 
 
 
p
n
k
k the number theoretic transform of the sequence
x
 
f
x
n
g
N
 
 
n
 
  of
elements
x
n
 
Z
q is a sequence
X
 
f
X
k
g
N
 
 
k
 
  ,
X
k
 
Z
q,g i v e nb y
 If
a
 
Z
q and
q are relatively prime, then we have
 
 
a
b
 
d
q
 
a
b (mod
q)w h e r e
b and
d are integers. The integer
b
m
o
d
q is then referred to as the multiplicative inverse of
a under
multiplication modulo
q.
 Thus, a ﬁeld is a ring in which it is also possible to divide.
 The quotientring
Z
q is denotedby
Z
 
 
q
 and
J
q in[60]and[50]respectively. The notation
Z
q, which weconveniently useinthis thesis,is very commonin manyotherbooksonabstract
algebra and number theory.2.2. The Number Theoretic Transform 5
X
k
 
 
N
 
 
X
n
 
 
x
n
 
k
n (mod
q)
 
k
 
 
 
 
 
 
 
 
 
N
 
 
  (2.1)
where
  is any element with order
N in
Z
q.
The factors
p
 
 
p
 
 
 
 
 
 
p
k of
q are distinct primes.
Remark: Let
  and
q berelatively prime positive integers. Then, the least pos-
itive integer
N such that
 
N
 
 
 
m
o
d
q
  is called the order of
  modulo
q. We denote the order of
  modulo
q by
o
r
d
q
 . Thus, for the transform
kernel
  weget
o
r
d
q
 
 
N. Sometimes,
  is said to be a primitive
Nth root
of unity.
Because we have
o
r
d
q
 
 
N,t h ep r o d u c t
k
n in the exponent of
  in (2.1) is
calculated modulo
N. It is easy to show that the NTT, as well as the DFT, pos-
sesses the cyclic convolution property, i.e. the transform of a cyclic convolu-
tion of two sequences is equal to the product of their transforms. There are
also other properties of the DFT that have their counterparts in the NTT. The
inverse number theoretic transform is given by
x
n
 
 
N
 
 
N
 
 
X
k
 
 
X
k
 
 
k
n (mod
q)
 
n
 
 
 
 
 
 
 
 
 
N
 
 
  (2.2)
where
N
 
  is the multiplicative inverse of
N modulo
q, i.e. the least positive
integer
M for which
N
 
M
 
 
 
m
o
d
q
 . Such an inverse exists if and only
if
g
c
d
 
N
 
q
 
 
  .T h e f a c t o r
 
 
k
n in (2.2) is congruent to
 
N
 
k
n
m
o
d
N
m
o
d
q.
Therefore, (2.2) involves multiplication by positive powers of
  modulo
q.
It is sometimes convenient to use the multiplicative inverse
 
 
  of
  modulo
q instead of
  as the transform kernel of the inverse NTT.
  If there exists an
integer
  with order
N modulo
q,t h e ni t si n v e r s e
 
 
 
 
 
N
 
 
 
m
o
d
q
  also
exists.
Thus,wecansaythatanumbertheoretictransformoflength
N anditsinverse
transform exist in
Z
q if there is an integer
  with order
N modulo
q and
N
has a multiplicative inverse modulo
q. The following theorem may be useful
whendetermining thepossible lengths ofaninvertible transformin aninteger
quotient ring:
 We have
 
￿
k
n
 
￿
 
￿
 
￿
k
n.6 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
Theorem 2.1 There exists an invertible NTT of length
N in
Z
q if and only if
N
j
 
p
i
 
 
  for every prime
p
i that divides
q.
Proof: See Section A.1 of Appendix A.
￿
Thus, the theorem says that the transform length
N must satisfy
N
j
g
c
d
 
p
 
 
 
 
p
 
 
 
 
 
 
 
 
p
k
 
 
 
  (2.3)
where
q
 
p
n
 
 
p
n
 
 
 
 
 
p
n
k
k . In particular, if
q
 
p is a prime, then every nonzero
element of the prime ﬁeld
Z
p has a multiplicative inverse and there exists an
NTT of every length
N that divides
p
 
 .
2.2.1 Suitable Integer Rings
There exist inﬁnitely many number theoretic transforms. The modulus
q
 
p
n
 
 
p
n
 
 
 
 
 
p
n
k
k shouldbechosenin asuitable waywithrespecttothecomplexity
and performance of the architectures for the binary coded integer arithmetic
operations modulo
q, and with respect to the possible NTT lengths that will
be obtained. Multiplication by powersof the transform kernel
  is usually the
most complex arithmetic operation involved in the computation of the NTT.
Therefore, the efﬁciency of a VLSI implementation of an NTT is often largely
determinedbythe efﬁciencybywhich suchmultiplications can becarriedout.
Thedirect computation ofan NTTof length
N requiresin the order of
N
  mul-
tiplications and
N
 
N
 
 
  additions. If the transform length is composite the
NTTcanbedecomposedintoseveraltransformsofsmallersizeswhichmaybe
computed using some fast Fourier transform (FFT) algorithm [17, Ch. 4]. The
FFT algorithm is most efﬁciently computed if the transform is a single-radix
transform with a small radix, i.e. if the transformlength can be expressed as a
powerofasmallinteger. Forexample, if
N
 
r
b,f o rs o m e
rand
b,t h eN T Tc a n
be computed using a radix-r FFT algorithm. Such an algorithm requires in the
order of
k
 
r
 
 
 
N
l
o
g
r
N multiplications and
 
r
 
 
 
N
l
o
g
r
N additions, where
k depends on
N and the choice of
  [33, 35]. Hence, the complexity of com-
puting the NTT can be signiﬁcantly reduced by choosing a suitable transform
length and using an FFTalgorithm. From (2.3)it follows that it is the modulus
that determines the possible transform lengths.
FromaVLSIimplementation pointofview, thereduction modulo
q ofabinary
coded integeris simplesttoperformwhen
q isclose toapowerof twoor when
the binary coded representation of
q contains few ones. The modulus reduc-
tion in
Z
 
m is very simple and straightforward, but since 2 is a prime factor of2.2. The Number Theoretic Transform 7
q
 
 
m the maximum possible NTT length in anyring ofsize
 
m is1. Thesame
conclusion holds for every even modulus
q. Integer quotient rings with even
modulus are therefore not interesting from an NTT application point of view.
Any odd natural number
q c a nb ew r i t t e no nt h ef o r m
q
 
a
 
r
m
 
 for some
natural numbers
a
 
r,a n d
m,w h e r e
r does not divide
a.W h e n
q is a prime,
we see from (2.3) that the possible transform lengths are the ones that divide
a
 
r
m. Therefore, the maximum radix-
r t r a n s f o r ml e n g t hi nt h ep r i m eﬁeld
Z
a
 
r
m
 
  is
r
m. Because a radix-
r transform of length
N
 
r
b involves in the
order of
 
r
 
 
 
N
l
o
g
r
N multiplications and additions, the transform is most
efﬁciently computed if
N is highly composite, i.e.
r is small.
Chevillat gives a table [33, Tab. II] of 8-bit to 16-bit moduli whose associated
integerquotient ringseachcontains a single-radixtransformoflength
N
 
 
 .
Some of these moduli are composite, but most of them are prime numbers.
The modulus should be chosen such that the modulus reduction is not a very
complex operation. As an example we consider
Z
q with a prime modulus
q
 
 
 
 
 
 ,f o rw h i c h
q
 
 
 
 
 
 
 . This is one of Chevillat’s numbers. The maxi-
mumtransformlengthof asingle-radix NTTin
Z
 
 
 
 
 is
 
 
 
 
 
 
 
 . However,
because the normal binary representation of
q
 
 
 
 
 
  is
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,
the reduction modulo
q maynotbe assimply performedaswhen
q canberep-
resented by much fewer ones or when it is closer to a power of two.
We mentioned above that multiplication by powers of the transform kernel
should be carried out as simply as possible. The complexity of such a mul-
tiplication depends inter alia on the kernel chosen. However, in an arbitrary
integer quotient ring there may not exist a suitable kernel for which this com-
plexity is low. In general, even if there exist single-radix transforms of great
lengths in an integer ring, it is not certain that a transform multiplication can
be computed using a procedure that is simpler than general multiplication.
Mersenne numbers
A set of integers of particular interest is the set of Mersenne numbers.T h e s e
numbersareoftheform
 
m
 
 ,wher e
m
 
 
 
 
 
 
 
 
 
 .W edenotesuchnumbers
by
M
m. TheNTTin a Mersenneintegerquotient ring
Z
M
m is usually called the
Mersenne number transform. One of the ﬁrst to consider Mersenne number
transforms was Rader in 1972 [77]. Arithmetic operations are easily carried
out in
Z
M
m if the elements are represented as normal binary coded
m-bit in-
tegers, because then the complexity of performing the operations equals the
complexity of one’s-complement arithmetic: Because
 
m
 
 
 
m
o
d
 
m
 
 
 ,
the modulus reduction is equivalent to the procedure for handling overﬂow
in one’s-complement arithmetic.8 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
m
M
m
 
 
m
 
 
M
m
 
 
 
 
 
 
m
 
 
 
 
 
3 7
 
 
 
5 31
 
 
 
 
 
7 127
 
 
 
 
 
 
13 8191
 
 
 
 
 
 
 
 
 
 
 
17 131071
 
 
 
 
 
 
 
 
 
 
 
 
19 524287
 
 
 
 
 
 
 
 
 
 
 
 
31 2147483647
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Table 2.1: The ﬁrst 7 Mersenne prime numbers.
There is, however, no general fast algorithm for the computation of the Mer-
senne number transform. Let
m
 
 
k where
  is a prime number. Then
 
 
 
 
divides
 
 
k
 
 . Thisis easilyshownbyusingtherelation
x
k
 
 
 
 
x
 
 
 
 
x
k
 
 
 
x
k
 
 
 
 
 
 
 
x
 
 
 for
x
 
 
  which gives
 
 
k
 
 
 
 
 
 
 
 
 
 
 
 
 
k
 
 
 
 
 
 
 
k
 
 
 
 
 
 
 
 
 
 
 
 
 ,a n dt h u sw eg e t
 
 
 
 
 
 
j
 
 
 
k
 
 
 .I f
m
 
 
k is eventhen
 
 
 
 
 
 is a
primefactorof
M
m which, from(2.3),implies thatthetransformlengthdivides
2. Thus,atransformofmeaningfullengthcanonlybeobtainedwhen
misodd.
Furthermore, if
M
m
 
 
m
 
  is prime then
m must also be prime, i.e.
k equals
1 in the previous factorisation of
 
 
k
 
 . Theconverse, however, is notalways
true
  forexample
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  is not a prime number. This shows,
by applying (2.3) to the prime factorisation of
M
m, that the possible lengths of
the NTT in
Z
M
m are relatively small when
m is odd and
M
m is composite.
When
M
m is prime the NTT length must divide
M
m
 
 
 
 
m
 
 .T h et h i r d
column of Table 2.1 shows the prime factorisations of
M
m
 
  for the ﬁrst 7
Mersenne numbers. We see that for large
M
m the number
M
m
 
  is not highly
composite. Therefore, there may not exist any efﬁcient FFT-type algorithm to
compute transforms of great lengths in
Z
M
m. Properties of Mersenne num-
ber transforms and some applications are further discussed in Chapter 6.3 of
Blahut [17] and by Rader [77].
Numbers of the form
 
n
 
 
m
 
 
The ﬁnal set of numbers to be considered here are prime numbers of the form
q
 
 
n
 
 
m
 
 ,w h e r e
 
 
m
 
n . Several of these numberscan also be found
in the set of Chevillat numbers. In 1976, Pollard [73] stated that such numbers
are good choices as integer ring moduli.2.2. The Number Theoretic Transform 9
The normal binary representation of the
n-bit modulus
q
 
 
n
 
 
m
 
 is
n
b
i
t
s
z
 
 
 
 
 
 
 
 
 
 
 
z
 
n
 
m
o
n
e
s
 
 
 
 
 
 
 
 
z
 
m
 
 
z
e
r
o
s
 
 
i.e. a block of
n
 
m ones followed by a block of
m
 
  zeros and a one in the
least signiﬁcant bit position. It is quite easy to perform the modulus reduc-
tion in VLSI when the modulus has this form and if the integer to be reduced
is less than
 
n.B e c a u s e
 
n
 
 
m
 
 
 
 
 
m
o
d
 
n
 
 
m
 
 
 we get
 
n
 
 
m
 
 
 
 
m
o
d
 
n
 
 
m
 
 
 . Therefore, when the
n
 
m most signiﬁcant bits of an
integer not greater than
 
n
 
  are all ones, the modulus reduction is carried
out by changing these bits to zero and subtract one (1) from the resulting bi-
nary coded integer. When the integer to be reduced is greater than
 
n
 
 ,t h e
modulus reduction procedure is just slightly more complicated.
Example:
n
 
 
 
m
 
 
 
q
 
 
 
 
 
 
 
 
 
 
 
  .
Modulus reduction
 
 
 
 
 
 
 
m
o
d
 
 
 
 :
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
In Section 5.1.1 we show how subtraction by one can be carried out in VLSI in
a simple way.
For prime moduli
q, the possible transform lengths divide
q
 
 
 
 
m
 
 
n
 
m
 
 
 ,which implies that there exist radix-2 NTTs of length
N
 
 
b,w h e r e
b
 
m,
in the corresponding prime ﬁelds
Z
q. In Table B.1 of Appendix B we present
the factorisations of
q
 
  together with
n,
m,a n d
q for all primes
q of the form
q
 
 
n
 
 
m
 
 where
 
 
m
 
n
 
 
 . These primeswere foundby computer
search. In order to obtain a transform of great length,
m should be large. On
the other hand, in order to make efﬁcient use of the
n-bit representation of the
integers in
Z
q,
m should be as small as possible (
m
 
 ). The best choice of
m
with respect to
n may differ, depending on the NTT application in question.
We have not found any general structure of the prime factorisations of com-
posite moduli
q
 
 
n
 
 
m
 
 
 
 
m
 
 
n
 
m
 
 
 
 
 . However, it may be
proﬁtable to consider subsets of this set of moduli for which the NTT in the
corresponding integer rings possesses some of the desirable properties. Such10 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
asubsetmay,forexample,consist ofmoduli forwhich
n
 
misconstant. Prop-
erties of the NTT in
Z
q can then be examined separately in each subset.
We see in Table B.1 of Appendix B that there are several prime moduli for
which
n
 
m is small. A report on primes of the form
k
 
 
m
 
 was pub-
lished by Robinson in 1958 [83]. In [83] he also presented a table of all such
primes for
k
 
 
 
  and
m
 
 
 
 . Liu et al. [62] considered primes of the form
 
m
 
 
m
 
 
 
 
 ,i . e .f o r
n
 
 
m, for some values of
m.N u m b e rt h e o r e t i ct r a n s -
forms in the integer ring modulo
 
m
 
 
m
 
 
 
 
  have also been considered
by Dubois and Venetsanopoulos [38, 39]. Some other researchers have inves-
tigated properties of moduli of the form
 
 
 
m
 
  ,i . e .f o r
n
 
m
 
  ,s e ef o r
example Golomb [47] and Golomb et al. [48]. In [48] the authors discuss how
to perform arithmetic operations in
Z
 
 
 
m
 
 .
The above-mentioned numbers are all special cases of numbers of the form
q
 
p
 
 
 
n
 
q
 
p
 
 
 
n
 
 
 
 
 
q
n
 
 
 
 
q
p
n
 
 
 
 
 
q
n
 
 
  for some integers
q,
p,a n d
n.
In a recent article by Dimitrov et al. [37], the authors deﬁne number theoretic
transforms in integer quotient rings with such moduli for
q
 
  ,
p
 
 
 
 ,a n d
 , and for some appropriate values of
n.
In the present thesis weconsider moduli
q
 
 
m
 
 
n
 
m
 
 
 
 
 for
n
 
m
 
  ,i . e
moduli of the form
q
 
 
m
 
  .F o r
m equal to a power of two, such numbers
are called Fermat numbers.
2.3 The Fermat Number Transform
2.3.1 Fermat Numbers
In this section we study number theoretic transformsin integer quotient rings
with moduli of the form
 
m
 
 for some
m.
Theorem 2.2 If
 
m
 
 is a prime then
m is a power of two.
Proof: (From [42, pp. 23–24]) Suppose
m has an odd factor
k,s a y
m
 
n
k.
Usingthefactorisation
x
k
 
 
 
 
x
 
 
 
 
x
k
 
 
 
x
k
 
 
 
x
k
 
 
 
 
 
 
 
x
 
 
x
 
 
 for
x
 
 
n weget
 
m
 
 
 
 
n
k
 
 
 
 
 
n
 
 
 
 
 
n
 
k
 
 
 
 
 
n
 
k
 
 
 
 
 
n
 
k
 
 
 
 
 
 
 
 
 
 
n
 
 
n
 
 
 ,
which apparently is composite. Theonly numbers that have no odd factor are
the powers of two.
￿2.3. The Fermat Number Transform 11
We have shown that
 
m
 
 is not ap r i m ew h e n
m is not a power of two, but
for which
m
 
 
t do we get a prime? The number
F
t
 
 
 
m
 
 
 
m
 
 
 
t
 
where
t
 
N,i sd e ﬁned as the
tth Fermat number.
  Fermat observed that the
ﬁrst ﬁve such numbers are all prime:
F
 
 
 
 
 
 
 
 
F
 
 
 
 
 
 
 
 
 
F
 
 
 
 
 
 
 
 
 
 
F
 
 
 
 
 
 
 
 
 
 
 
F
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fermat expressed his belief that every
F
t is a prime, but admitted that he had
no proof.
From Fermat’s little theorem [84, Th. 5.3] it follows that if
p is a prime and
a
is a positive integer, then
a
p
 
a
 
m
o
d
p
 ,t h a ti s
p
j
 
a
p
 
a
 .
  In general, if
a is a positive integer and
q is a composite positive integer that divides
a
q
 
a,
then
q is usually called a pseudoprime to the base
a. One of the reasons for Fer-
mat’s statement that every
F
t is a prime may have been that in fact all Fermat
numbers are either primes or pseudoprimes.
We see that for every Fermat number
F
t
 
 
 
t
 
  ,w h e r e
t
 
N,t h er e l a t i o n
F
t
j
 
 
F
t
 
 
  holds [93, Exerc. 2]: For any positive integer
t we have
t
 
 
 
 
t,
and thus
 
t
 
 
j
 
 
t.C o n s e q u e n t l y ,w eh a v e
 
 
 
t
 
 
 
 
 
j
 
 
 
 
t
 
 
 
 
 
F
t
 
 
 
 .
Because
 
 
t
 
 
 
 
 
 
 
 
t
 
 
 
 
 
 
t
 
 
  we get
F
t
 
 
 
 
t
 
 
 
j
 
 
F
t
 
 
 
 
  and
hence
F
t
j
 
 
F
t
 
 
 .
Therefore, all composite Fermat numbers
F
t are pseudoprimes to the base
 .
When trying to ﬁnd the factors of composite Fermat numbers, the following
theorem is of good use:
 Henceforth, whenever the number
 
m
 
 appears in the thesis we alwaysmean the Fer-
mat number
F
t, i.e. we implicitly assume
m
 
 
t for some natural number
t.
 Even the ancient Chinese had a test for primality which is similar to Fermat’s little the-
orem. The test said that an integer
p is a prime if and only if
p
j
 
 
p
 
 
 .B y F e r m a t ’ s
little theorem we know that the test is correct when
p is an odd prime, but the converse
is not always true. For example, the ancient Chinese did not discover that the smallest
composite integer that passes their test is
 
 
 
 
 
 
 
 
 . It can easily be veriﬁed that
 
 
 
 
 
 
 
m
o
d
 
 
 
  and thus
 
 
 
j
 
 
 
 
 
 
 
 .12 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
Theorem 2.3 Every prime divisor of the Fermat number
F
t
 
 
 
t
 
  ,w h e r e
t
 
 ,
is of the form
k
 
 
t
 
 
 
  , for some natural number
k.
Proof: See Section A.2 of Appendix A. The proof involves Euler’s theorem
and the concept of quadratic residues.
￿
Thus, every prime divisor of
F
t is congruent to 1 modulo
 
t
 
  for
t
 
 .A c t u -
ally, becausetheproductoftwonumbersoftheform
k
 
n
 
 isalso ofthisform,
any divisor of
F
t is congruent to 1 modulo
 
t
 
  for
t
 
 . Lucas [36, pp. 376–
379]wasthe ﬁrsttoprovethateveryprime factorof
F
t is ofthe form
k
 
 
t
 
 
 
 .
Prior to Lukas’ proof Euler showed that
 
 
 
 
 
 
 
 
 
 is a factor of
F
 .T h e
complete factorisations of
F
  and
F
  are
F
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  Euler 1732
F
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  Landry 1880
To this day, no Fermat prime greater than
F
  has been found. Since the days of
Euler, ﬁnding the prime factors of composite Fermatnumbers or proving that
certain Fermatnumberare composite have beentwoof the most famousprob-
lems in number theory. In 1958, Robinson presented a list [83, Table 2] of all
known prime factors of composite Fermatnumbers together with the dates of
discovery. Using today’s powerful computing tools still more prime factors
have been found. In [28, page lxxxviii], Brillhart et al. published a table of all
factors of composite Fermat numbers known in 1988. To the author’s knowl-
edge, the largest Fermat number with known factorisation is
F
 
 
 
 
 
 
 
 
  ,
which was factored by Brent and Morain in 1988 using the elliptic curve
method [24], [25]. The ninth Fermatnumber
F
  w a sf a c t o r e db yA .K .L e n s t r a ,
H. W. Lenstra Jr., M. S. Manasse, and J. M. Pollard in 1990 by means of the
number ﬁeld sieve [59]. The complete factorisation of
F
 
  is still not known.
The largest Fermat number with a known factor is
F
 
 
 
 
 . It is divisible by
 
 
 
 
 
 
 
 
 
  .
2.3.2 The Transform Kernel
ThenumbertheoretictransformintheFermatintegerquotientring
Z
F
t isoften
referred to as the Fermat number transform (FNT). A great advantage of the
FNT is thatthe possible transformlengths are all highly composite. Asshown
in Section 2.3.1, a composite Fermat number
F
t can be factorised into prime2.3. The Fermat Number Transform 13
powers as
F
t
 
 
k
 
 
t
 
 
 
 
 
n
 
 
k
 
 
t
 
 
 
 
 
n
 
 
 
 
 
k
l
 
t
 
 
 
 
 
n
l
 
for some
k
 
 
k
 
 
 
 
 
 
k
l and
n
 
 
n
 
 
 
 
 
 
n
l.L e t
 
 
k be a common factor of
k
 
 
k
 
 
 
 
 
 
k
l for some
 
k.
  Equation (2.3) then implies that there exist radix-2 trans-
forms in
Z
F
t. The transform length
N must divide
 
t
 
 
k
 
 .F u r t h e r m o r e ,w h e n
F
t is prime the possible lengths
N divide
F
t
 
 
 
 
 
t. Thus, the radix-2 FNT
in
Z
F
t is of length
N
 
 
b
 
 
 
 
 
 
b
 
t
 
 
k
 
 
 
F
t is composite
 
 
b
 
m
 
 
 
t
 
 
F
t is prime
  (2.4)
Because the FNT length
N is a power of two the transform can be computed
using afastandefﬁcient algorithm. Usingthe radix-2Cooley-TukeyFFTalgo-
rithm [35], a transform of length
N
 
 
b in a Fermat integer quotient ring
Z
F
t
can be computed using only
 
N
 
 
 
l
o
g
 
N multiplications and
N
l
o
g
 
N addi-
tions modulo
F
t. Since elements of the sequence that is to be transformed are
multiplied by powers of the kernel
 , the complexity of computing the trans-
form depends strongly on the choice of
 .
Usingbinaryarithmetic, multiplication byapoweroftwocanbeimplemented
in VLSI as binary shifts. We see by the congruence
 
 
 
 
 
 
 
 
 
 
m (mod
 
m
 
  )
that the integer 2 has order
 
m
 
 
t
 
  modulo
 
m
 
 and hence can be used
as the kernel of an FNT of length
 
m. Then, all multiplications involved in
the transform computation can be carried out as binary shifts. Equation (2.4)
implies that
N must divide
 
t
 
 
k
 
  when
F
t is composite, i.e. for
t
 
 . In par-
ticular it can be veriﬁed that
 
k is zero for
F
 ,
F
 ,a n d
F
 ,i . e .t h e
k
i’s in the fac-
torisations of these numbers are all odd (see page 12 and [28, page lxxxviii]).
Thus for
F
 ,
F
  and
F
  the maximum transform length is
 
t
 
 
 
 
m.
Asuitablekernelofa
 
m-lengthtransformisanintegerthathas2asitssquare.
Such an integer exists if the congruence
x
 
 
 
 
m
o
d
F
t
  has a solution. By
the deﬁnition of quadratic residues in Section A.2, the integer 2 is then called
a quadratic residue modulo
F
t. The least positive solution
x to the mentioned
congruence is often denoted
p
  in the literature. The following theorem says
that there really exists such a solution
x.
 In general, we have
g
c
d
 
k
 
 
k
 
 
 
 
 
 
k
l
 
 
k
￿
 
 
 
k for some
k
￿ and
 
k, but here we are only
interested in the cases when the transform length is a power of two.14 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
Theorem 2.4 The integer 2is a quadratic residue moduloeach Fermat number
F
t for
t
 
 .
Proof: From the proof of Theorem 2.3, given in Section A.2, we know that the
integer2isaquadraticresidue modulo everyodd primefactor
p
i oftheFermat
number
F
t
 
p
n
 
 
p
n
 
 
 
 
 
p
n
k
k for
t
 
 . Then,2is alsoaquadratic residuemodulo
p
n
i
i (see for example Stewart, [95, Prop. A.13]). By Proposition A.10 of [95] we
then get that the integer 2 is a quadratic residue modulo
F
t for
t
 
 .
￿
The square of the element
p
  can be expressed in the following way:
 
p
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
and thus we get
p
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
m
  (mod
 
m
 
  )
 
Powers of
p
  c a nb ew r i t t e na s
 
p
 
 
n
 
 
 
 
 
n
 
 
n
e
v
e
n
 
n
 
 
 
p
 
 
 
 
m
 
 
n
 
 
 
 
 
 
m
 
 
n
 
 
 
 
m
o
d
 
m
 
 
 
 
n
o
d
d
  (2.5)
whichmeansthat multiplication bypowersof
p
 can beimplementedin VLSI
as binary shifts when the exponent
n is even, and two binary shifts and one
addition when the exponent is odd. This is the reason why the element
p
  is
practically always used as the kernel of the FNT of length
 
m in
Z
 
m
 
 .I tc a n
easily be shown that the order of
p
  modulo
F
t is
 
m for
t
 
  [2, App. C].
B e c a u s eweh a v e
 
m
 
 
m
 
N
m
a
x for
m
 
  ,t h ek e r n e l
p
  will yield the max-
imum length FNTin
Z
F
 . Thesamekernelwill alsoyield themaximumlength
FNT in
Z
F
t for
t
 
 
 
 , and 7. However, in several applications the transform
length
 
m is still relatively small. In general, one-dimensional prime ﬁeld
FNTsof length greater than
 
m require nontrivial multiplications. For a maxi-
mum lengthFNT (
N
 
 
m) in aFermatprime ﬁeld, thetransformkernelmust
be a primitive element.2.3. The Fermat Number Transform 15
t
m
 
 
t
F
t
 
 
 
 
m
N for
 
 
 
N for
 
 
p
 
N for
 
 
 
0 1 2 2 — —
1 2 4 4 — 4
2 4 16 8 16 16
3 8 256 16 32 256
4 16
 
 
 
 
  32 64 65536
5 32 4294967296 64 128 —
6 64
 
 
 
 
 
 
 
 
 
 
 
  128 256 —
Table 2.2: Some parameters for the FNT. The boldfaced numbers are the maximum
obtainabletransform lengths. The kernel
  is anyprimitiveelement modulo
F
t.
Every primitive element of a prime ﬁeld
Z
phas maximum order
p
 
  modulo
p.
  In Chapter 7 we ﬁnd use of the following property:
Theorem 2.5 The integer 3 is a primitive element of each Fermat prime ﬁeld
Z
F
t
where
t
 
 .
Proof: See Appendix A.3.
￿
Remark: Cunningham (see [36, page 199]) noted that for
t
 
 ,t h ei n t e g e r s
 
 
 
 
 
 
 
 
 
 
  and 12 are all primitive elements of the ﬁeld of integers
modulo a Fermat prime
F
t for
t
 
 .
By Theorem 2.5 the maximum length FNT in a Fermatprime ﬁeld can becom-
puted using the primitive element 3 as transform kernel. Table 2.2 shows the
relations between some kernels and their corresponding FNT lengths for the
seven ﬁrst Fermat numbers.
For eachprimitive element
 
 
Z
 
m
 
 ,w h e r e
 
m
 
 is a prime, we have
 
 
m
 
 
 
 
m
 
b
 
 
b
 
 
 
m
o
d
 
m
 
 
  . Because the order of the element
 
 
m
 
b modulo
 
m
 
 equals
 
b, it may be chosen as the kernel of an FNT of arbitrary length
N
 
 
b for
 
 
b
 
m. This is further discussed in Section 7.2.7.
As previously mentioned, we would like to calculate the radix-2 FNT with
as low complexity and high performance as possible for every such transform
 In general, if
  and
q
 
  are relatively prime integers such that
o
r
d
q
 
 
 
 
q
 ,w h e r e
 
denotes Euler’s totient function, then
  is called a primitiveroot modulo
q.16 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
length
N
 
 
b. Hence, we would like its approximately
N
l
o
g
 
N additions to-
gether with its
 
N
 
 
 
l
o
g
 
N multiplications by powers of the kernel to be car-
ried out as simply as possible. In the present section we do not go into detail
about what we mean by ’simple’. Complexity issues are further discussed in
Chapter 4.
One purpose of our work is to ﬁnd suitable ways of representing the binary
coded integers of
Z
 
m
 
 , in order to simplify the arithmetic operations (espe-
cially multiplication by powers of the transform kernel) involved in the com-
putation of the FNT of every possible length
N
 
 
b. We are particularly in-
terested in the rings for which
 
m
 
 is a prime, i.e. the Fermat prime ﬁelds.
2.3.3 Butterﬂy Computations
The Radix-2 Decimation-In-Time Algorithm
We mentioned above that the FNT of length
N
 
 
b can be computed using
a radix-2 FFT algorithm. When using the well known decimation-in-timealgo-
rithm, which is due to Cooley and Tukey [35], the FNT of the form in (2.1) is
ﬁrst split into two parts as follows.
 
X
k
 
N
 
 
X
n
 
 
x
n
 
k
n
 
X
n
e
v
e
n
x
n
 
k
n
 
X
n
o
d
d
x
n
 
k
n
 
N
 
 
 
 
X
r
 
 
x
 
r
 
k
r
N
 
 
 
 
k
 
N
 
 
 
 
X
r
 
 
x
 
r
 
 
 
k
r
N
 
 
 
G
k
 
 
k
 
H
k
 
m
o
d
F
t
 
 
k
 
 
 
 
 
 
 
 
 
N
 
 
 
where
G
k and
H
k are the
N
 
 -point FNTs of the sequences
f
x
 
r
g
N
 
 
 
 
r
 
  and
f
x
 
r
 
 
g
N
 
 
 
 
r
 
  , respectively. Theorderofthekernel
 
N
 
 
 
 
 
  modulo
F
t is
N
 
 .
Because
 
N
 
 
 
 
 
 
m
o
d
F
t
 wehave
 
k
 
N
 
 
 
 
 
k
 
m
o
d
F
t
 andthus the
FNT can be expressed as
X
k
 
G
k
 
 
k
 
H
k
 
m
o
d
F
t
 
X
k
 
N
 
 
 
G
k
 
 
k
 
H
k
 
m
o
d
F
t
 
 
 The derivation of the decimation-in-time FFT algorithm can be found in most books on
digital signal processing, e.g. [74, Ch. 9.3.3].2.3. The Fermat Number Transform 17
 
r
 
 
 
 
Figure 2.1: Butterﬂy of a radix-2 decimation-in-timeFFT.
for
k
 
 
 
 
 
 
 
 
 
N
 
 
 
 . The name decimation-in-time is due to the dec-
imation of
x
n by a factor of 2. A repeated decimation of the sequences
f
x
 
r
g
and
f
x
 
r
 
 
gwill resultin four
N
 
 -pointFNTsafterthesecondstep,eight
N
 
 -
point FNTsafterthe third step and soon, until weend up in
N
 
  2-point FNTs
afterstep
l
o
g
 
N
 
 . Thus,thecomputation oftheFNToflength
N
 
 
b maybe
carriedoutin
l
o
g
 
N stages,whereeachstage consists of
N
 
 2-pointFNTs[74,
Fig. 9.14]. Hence, the FNT can be computed as
 
N
 
 
 
l
o
g
 
N
 
 
N
 
 
 
b FNTsof
length 2. Figure 2.1 illustrates how such a basic 2-point FNT is computed. Be-
cause of the ﬂow graph symmetry of the 2-point transform, it is usually called
a butterﬂy. The two output signals from the decimation-in-time butterﬂyo f
Figure 2.1 are
 
 
 
 
 
r
 
 
m
o
d
F
t
 
 
 
 
 
 
r
 
 
m
o
d
F
t
 
 
for some
r and where
  and
  are the butterﬂy inputs. Because each butterﬂy
involves two additions and one multiplication, the total number of additions
modulo
F
t equals
N
l
o
g
 
N and the total number of multiplications modulo
F
t
equals
 
N
 
 
 
l
o
g
 
N, as we have previously indicated.
 
 
When the FFTalgorithm is used for computing the ordinary DFT,the real and
imaginary parts of the factors
 
r
 
e
 
j
 
 
r
 
N, which are often called the twiddle
factors, are usually stored in a table. This yields the fastest algorithm, to the
cost of a look-up table. Concerning the FNT, by choosing a suitable kernel for
the transform it may not be necessary to store the different powers of the ker-
nel modulo
F
t. For example, for
 
 
p
  and
N
 
 
m
 
 
t
 
  multiplication by
powers of
  can be carried out as two binary shifts and one addition, as men-
tioned in connection with (2.5). For such kernels the
b-bit exponents
r may be
 
 Subtraction is regarded as addition, because it can be carried out by adding the minuend
to the negated subtrahend (see Section 5.1.3).18 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
 
 
 
 
r
 
Figure 2.2: Butterﬂy of a radix-2 decimation-in-frequency FFT.
generated by some control logic for small transform lengths [101]. For larger
transform lengths, the exponents are preferably stored in a table [90]. How-
ever, if there is no suitable kernel
  for which multiplication by powers of
 
can be carried out simpler than the procedure for general multiplication, then
we may still want a table of the twiddle factors involved in the computation
of the transform.
The Radix-2 Decimation-In-Frequency Algorithm
When using the decimation-in-time FFT algorithm, the input sequence must
appear in a bit-reversed order [74, Ch. 9.3.3]. The transformed sequence is,
however, obtained in natural order. Using the radix-2 decimation-in-frequency
FFT algorithm, we have the opposite situation. Then the input occurs in the
rightorderwhile theoutputisobtainedinbit-reversedorder. Thedecimation-
in-frequency algorithm is obtained by repeatedly divide the transform into
two transforms, one which depends on the ﬁr s th a l fo ft h es e q u e n c ea n dt h e
other depending on the second half of the sequence. This algorithm is due to
Gentleman and Sande [45].
As for the decimation-in-time algorithm, the decimation-in-frequency algo-
rithm also divides the
N-lengthtransforminto
l
o
g
 
N stagesof
N
 
 butterﬂies.
Figure 2.2 shows the butterﬂy for the decimation-in-frequency algorithm.
For the butterﬂy input variables
  and
 , we have the output variables
 
 
 
 
 
 
m
o
d
F
t
 
and
 
 
 
 
 
 
 
 
r
 
m
o
d
F
t
 
 
for some
r.2.3. The Fermat Number Transform 19
The computations in both the decimation-in-time and decimation-in-frequen-
cy algorithms are done in place, which means that the same memory locations
that hold the
N elements of the sequence
f
x
n
g can be used to store the results
of the butterﬂy computations at each of the
l
o
g
 
N stages. Also, both algo-
rithms involve
 
N
 
 
 
l
o
g
 
N butterﬂy operations, each consisting of one mul-
tiplication by a twiddle factor and two additions. The two algorithms can be
arranged such that both the input and output sequences are maintained in
natural order. However, the resulting algorithms are no longer in-place algo-
rithms, which implies that additional memory is required.
Remark: Because of thesimilarity betweenthe FNTandits inverse transform,
they can be computed using the same FFT algorithm. The two trans-
forms differ only in the factor
 
 
N and the sign of the exponent of
 .
Radix-4 Algorithms
If
b
 
l
o
g
 
N isevenwehave
N
 
 
b
 
  andthusthetransformcanbecomputed
using a radix-4 FFT algorithm. Such an algorithm can be obtained by repeat-
edly dividing the input sequence into four parts in a manner that is similar
to the procedure for deriving a radix-2 algorithm [74, Ch. 9.3.4]. The radix-4
FFT algorithm consists of
b
 
  stages of
N
 
  butterﬂies. The four outputs, say
 
 
 
 
 
 
 
 ,a n d
 
 , of adecimation-in-time butterﬂycanbeexpressedinmatrix
form as
 
B
B
 
 
 
 
 
 
 
 
 
 
C
C
A
 
 
B
B
 
 
 
 
 
 
 
N
 
 
 
 
N
 
 
 
 
 
 
N
 
 
 
 
 
 
 
N
 
 
 
 
 
 
N
 
 
 
 
 
 
N
 
 
 
 
 
 
 
N
 
 
 
 
 
 
N
 
 
 
 
 
 
N
 
 
 
 
 
C
C
A
 
B
B
 
 
 
 
 
 
r
 
 
 
 
r
 
 
 
 
r
 
 
 
C
C
A
 
m
o
d
F
t
 
 
for some
r and where
 
 
 
 
 
 
 
 ,a n d
 
  and the four butterﬂyi n p u td a t a .B e -
cause the order of
 
N
 
  modulo
F
t is 4 we get the congruences
 
 
N
 
 
 
 
 
 
 
N
 
 
 
 
 
 
 
 
m
o
d
F
t
 ,
 
 
N
 
 
 
 
 
 
 
N
 
 
 
m
o
d
F
t
 ,
 
 
N
 
 
 
 
 
 
 
m
o
d
F
t
 ,
and
 
 
N
 
 
 
 
 
 
N
 
 
 
m
o
d
F
t
 . In order to reduce the number of additions,
the butterﬂy is usually derived from the following factorised twiddle-factor
matrix:
 
B
B
 
 
 
 
 
 
 
N
 
 
 
 
N
 
 
 
 
N
 
 
 
 
 
N
 
 
 
 
N
 
 
 
 
N
 
 
 
 
 
N
 
 
 
 
N
 
 
 
 
N
 
 
 
C
C
A
 20 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
r
 
 
r
 
N
 
 
 
 
r
Figure 2.3: Butterﬂy of a radix-4 decimation-in-time FFT.
 
 
B
B
 
 
 
 
 
 
 
 
 
N
 
 
 
 
 
 
 
 
 
 
 
 
N
 
 
 
C
C
A
 
B
B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
C
C
A
 
m
o
d
F
t
 
 
A radix-4 decimation-in-time butterﬂy is shown in Figure 2.3. The two-stage
structure of the butterﬂy is due to the factorisation of the twiddle-factor ma-
trix. Note that the input is in bit-reversed order, because then the computa-
tions can be carried out in place.
For the ordinary DFT, which has kernel
 
 
e
 
j
 
 
 
N, we have
 
N
 
 
 
 
j (see
[74, Eq. 9.3.44]).
Let
 
 
 
N
 
 
 
m
o
d
F
t
 . It can be proved that for prime
F
t
 
 ,t h ef o u ri n -
congruent solutions to the congruence
 
 
 
 
 
m
o
d
F
t
  are
 
  and
 
 
m
 
 
modulo
F
t. By Theorem 8.8 in [84] there are
 
 
 
 
 
  incongruent integers of
order 4 modulo a prime
F
t. Obviously, the integers
 
 
 
 
m
 
 
 
m
o
d
F
t
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
m
o
d
F
t
 
are these two incongruent integers. In particular, we see that for the FNTs of
length
N
 
 
m and
N
 
 
m and with kernels
 
 
 and
 
 
p
 
 
m
o
d
F
t
 ,
respectively, we have
 
 
 
N
 
 
 
 
m
 
 
 
m
o
d
F
t
 .
We showed earlier that for composite
F
t, the maximum radix-2 FNT length
is at least
 
m
 
 
t
 
 .
 
  Because the order of 2 modulo
F
t is
 
m
 
 
t
 
  it
 
 Formostcomposite
F
t with at least one factor known, themaximum length is exactly
 
m.2.3. The Fermat Number Transform 21
follows that for all transform lengths
N
 
 
b,w h e r e
 
 
b
 
t
 
  , we have
o
r
d
F
t
 
 
m
 
N
 
N. Therefore,for every such transformlength there exists a ker-
nel which is a power of two.
Hence, by choosing a suitable kernel
 , the radix-4 butterﬂy multiplication by
 
N
 
  can simply be carried out as some binary shifts modulo
F
t for every Fer-
mat number
F
t and every possible transform length in
Z
F
t. Therefore, using
three general multiplications and eight additions modulo
F
t per butterﬂy, a
radix-4 FNT can be computed using
 
 
 
N
 
 
 
 
l
o
g
 
N
 
 
 
N
 
 
 
l
o
g
 
N mul-
tiplications and
 
 
 
N
 
 
 
 
l
o
g
 
N
 
N
l
o
g
 
N additions modulo
F
t.
Comparedwiththeradix-2FNTalgorithm, theradix-4algorithmrequires25%
less multiplications but the same number of additions, i.e. we get the same
complexity reduction as is obtained for the “ordinary” radix-4 DFT (see [74,
Ch. 9.3.4]).
By using appropriate decimating procedures, it is also possible to deﬁne fast
algorithms for radix-
r transforms for
r
 
 . These algorithms are quite simi-
lar to theradix-2 and radix-4algorithms, and they do not result in a signiﬁcant
reduction of the numberof arithmetic operations. Therefore,they arenot con-
sidered here.
The Split-Radix Algorithm
The split-radix algorithm, which is due to Duhamel and Hollman [40], [41], is
presentlythemostefﬁcientradix-2FFTalgorithm. Thedecimation-in-frequen-
cy algorithm is derived by using a radix-2decomposition of the even-indexed
termsandaradix-4decomposition oftheodd-indexedterms. Intheﬁrststage,
theeven-indexedtermsareinputstoaradix-2transformoflength
N
 
  andthe
odd-indexed terms are again decomposed into two sequences of length
N
 
 ,
whichbecomesthe inputs of tworadix-4transforms. Theeven-indexedterms
are given by
X
 
k
 
N
 
 
 
 
X
n
 
 
 
x
n
 
x
n
 
N
 
 
 
 
 
k
n
 
m
o
d
F
t
 
 
for
k
 
 
 
 
 
 
 
 
 
N
 
 
 
  and the two radix-4 transforms are given by
X
 
k
 
 
 
N
 
 
 
 
X
n
 
 
 
 
x
n
 
x
n
 
N
 
 
 
 
 
N
 
 
 
x
n
 
N
 
 
 
x
n
 
 
N
 
 
 
 
 
n
 
 
k
n
and
X
 
k
 
 
 
N
 
 
 
 
X
n
 
 
 
 
x
n
 
x
n
 
N
 
 
 
 
 
N
 
 
 
x
n
 
N
 
 
 
x
n
 
 
N
 
 
 
 
 
 
n
 
 
k
n
 22 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
N
 
 
 
n
 
 
n
Figure 2.4: Butterﬂy of a split-radix decimation-in-frequency FFT.
for
k
 
 
 
 
 
 
 
 
 
 
 
N
 
 
 
  and where both congruences are reduced modulo
F
t. As shown on page 21, in
Z
F
t the factor
 
N
 
  equals some power of two.
Thus, using abinarycoded element representation, multiplication by
 
N
 
  can
b ec a r r i e do u ta sb i n a r ys h i f t sm o d u l o
F
t. Figure 2.4 shows a butterﬂyo fa
split-radix decimation-in-frequency FFT.
In the ﬁrst stage of the algorithm, the input variables
 
 
 
 
 
 
 
 ,a n d
 
  are
x
n
 
x
n
 
N
 
 
 
x
n
 
N
 
 ,a n d
x
n
 
 
N
 
 , respectively, for some
n. Theoutput variables
 
  and
 
  are used to calculate some of the even-indexed terms of the trans-
formed sequence, and
 
  and
 
  are used to calculate the terms with odd in-
dices of the forms
 
k
 
 and
 
k
 
  , respectively, for some
k.
Because thesplit-radix algorithm isakindofmixtureofaradix-2andaradix-4
FFT, it does not progress stage by stage. Therefore, the indexing will be more
complicated compared with for example a ﬁxed-radix FFT algorithm. It has
been shown that a split-radix FFT can be computed using in the order of
 
N
 
 
 
l
o
g
 
N multiplications and
N
l
o
g
 
N additions for great transform
lengths
N (see for example Proakis et al. [75, Ch. 2.14] or Skodras and Con-
stantinides [94]).
As seen above, the only arithmetic operations that are involved in the compu-
tation ofthe FNTandits inversetransformare addition, subtraction (i.e.nega-
tion followed by addition), multiplication by powers of the transform kernel,
and multiplication by powers of two modulo
F
t. In this thesis we mainly fo-
cus on these arithmetic operations and others that may be needed in connec-
tionwiththetransformcomputation. Examplesofsuchoperationsaregeneral
multiplication, the discrete logarithm, and exponentiation modulo
F
t.W ed o
not care about which FFT algorithm is used (radix-2, radix-4, split-radix, or2.4. Element Representation 23
Integer Normal binary
coded repr.
 
m 1000
 
 
 000
 
m
 
  0111
 
 
 111
 
m
 
  0111
 
 
 110
 
m
 
  0111
 
 
 101
 
 
 
 
 
 
3 0000
 
 
 011
2 0000
 
 
 010
1 0000
 
 
 001
0 0000
 
 
 000
Table 2.3: The normal binary coded integer representation.
any other). We are only interested in the arithmetic operations involved in the
computation of the transform.
2.4 Element Representation
We mentioned in Section 2.1 that we represent the elements of the Fermat in-
teger quotient rings
Z
 
m
 
  as binary coded integers and use binary logic circuits
in the VLSI architectures for the arithmetic operations in
Z
 
m
 
 . It is clear that
m
 
 bit positions are needed to representthe
 
m
 
 elements of
Z
 
m
 
 .T h u s ,
there are
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
z
 
 
m
 
 
f
a
c
t
o
r
s
 
 
m
 
 
 
 
 
m
 
 
 
 
different ways of representing these elements. The very well known normal
binary coded representation of integers is illustrated in Table 2.3.
Thisrepresentation,however,may notbethe bestone withrespecttothecom-
plexity andperformanceof the VLSI architectures forarithmetic operations in
Z
 
m
 
 . Dependingonhowcomplexityandperformancearedeﬁned, itmayre-
quire a great effort to ﬁnd the ’optimum’ representation among the
 
m
 
 
 
 
 
 
m
 
 
 
  possible ones, e.g. there are about
 
 
 
 
 
  ways to represent
the 5-bit binary coded integers of
Z
 
 
 
 . We therefore choose to restrict our-24 Chapter 2. Binary Arithmetic in the Fermat Integer Quotient Ring
selves to consider a subset of representationsthat can be expressedaselemen-
tary functions of the normal binary coded representation.
The ﬁrst form of representation considered is the normal binary coded repre-
sentation. In Chapter 5 we study VLSI architectures for arithmetic operations
using this representation. Linear coordinate transformations of the normalbi-
nary coded representation and the corresponding VLSI architectures are con-
sidered in Chapter 6. Finally, in Chapter 7 we particularly focus on the polar
representation, which can be regarded as a nonlinear coordinate transforma-
tion of the normal binary coded representation.Chapter 3
Applications
The Fermat number transform (FNT) is one of the most useful and powerful
number theoretic transforms. As mentioned in Chapter 1, in the beginning of
the1970’stheinteresting propertiesoftheFNTattractedseveralresearches. In
this chapter we describe some of the main applications of the FNT. In particu-
lar, weconsider digital convolution andcorrelation in Fermatinteger quotient
rings and Reed-Solomon codes over Fermat prime ﬁelds.
Thereare also other applications of the FNT. Siu and Constantinides [87] have
shown that the number of multiplications required to compute the discrete
Fourier transform can be reduced by using number theoretic transforms. In
[88] they particularly consider the FNT for reducing the complexity of com-
puting the discrete Fourier transform. Truong et al. [102] later considered the
computation of the discrete Fourier transform using the FNT in a quadratic
residue Fermat number system. Several other researchers have also studied
the computation of the discrete Fourier transform using number theoretic
transforms.
Boussakta and Holt have shown that the discrete Hartley transform can be
calculated using the FNT [20, 21]. In [22], the same authors showed how to
compute the Walsh-Hadamard transform using the FNT and vice versa. Two
decades ago, Rader [78] discussed number theoretic transforms for use in a
block-mode image ﬁltering scheme. A microprocessor-based architecture for
block-mode image ﬁlters using the FNT was later implemented in VLSI by
Shakaff et al. [90].
2526 Chapter 3. Applications
Boussakta et al. [23] showed that the FNT of periodic data has a regular struc-
ture with many transformcomponents equal to zero. Any small imperfection
in the periodic data signiﬁcantly changes the high regularity of its FNT. As a
consequence of the results in [23], the authors conclude that the FNT is highly
applicable in areas like for example the detection of errors in maskmaking for
integrated circuit design and defect detection in industrial inspection. They
also suggestapplications forimage compression anddata storage,whereonly
the nonzero elements of the FNT of periodic data need to be stored together
with their locations.
3.1 Convolution and Correlation of Real Integer
Sequences
Discrete convolution and correlation are two very common operations in dig-
ital signal processing (see for example Blahut [17]). The cyclic convolution of
two sequences
f
x
n
g
N
 
 
n
 
  and
f
h
n
g
N
 
 
n
 
  is given by the sum
y
n
 
N
 
 
X
k
 
 
x
k
h
n
 
k
 
m
o
d
N
 
 
n
 
 
 
 
 
 
 
 
 
N
 
  (3.1)
Correlation and convolution arecomputationally equivalent. Thecross-corre-
lation of two sequences
f
x
n
g and
f
h
n
g is obtained by convolving
f
x
n
g with
f
h
 
n
g.
Like the discrete Fourier transform the FNT also has the cyclic convolution
property, i.e. the transform of a cyclic convolution of two sequences is equal
to the product of their transforms. Because the method of computing the con-
volution sum using transform calculations is often faster than the direct com-
putation of the sum, the procedure is sometimes called fast convolution.T h e
method is particularly efﬁcient whenthe sequencelength ishighly composite,
because then some FFT algorithm can be applied to compute the transform.
It is often possible – and sometimes preferable – to let computations in one
algebraic ﬁeld be carried out in another ﬁeld, which is then usually called a
surrogate ﬁeld. Depending on the application in question, this computational
procedure may also apply to rings. A computation of interest where this is
applicable is convolution via transform calculations. Using a computer or a
digital signal processor, these calculations are often carriedoutin the complex
ﬁeld
C, i.e. the discrete Fourier transform is used. However, if the sequences
that are to be convolved consist of real integers, the convolution can instead be
computed in an integer quotient ring
Z
q, for some suitable modulus
q [2].3.1. Convolution and Correlation of Real Integer Sequences 27
There are some advantages of computing the transforms in
Z
q rather than in
the complex ﬁeld: A complex multiplication requires several real multiplica-
tions while a multiplication in
Z
q is a single and often simpler operation (inte-
germultiplication). Thecomputation precision is also improved since compu-
tations in a ﬁnite ring are exact. Another very important consequence of the
simpliﬁed arithmetic isthat, depending on
q,thecomplexity andperformance
of the hardware implementation of a transform in
Z
q c a nb es m a l l e rt h a nt h e
complexity and performance of the corresponding implemented transform in
C.
The modulus
q must be chosen such that every element
x
n,
h
n,a n d
y
n,f o r
n
 
 
 
 
 
 
 
 
 
N
 
 , is contained in the ring
Z
q. Because of the congruence relation
modulo
q in the ring
Z
q, negative integersare representedaspositive integers,
in accordance with the congruence
 
x
 
q
 
x
 
m
o
d
q
 .
In the following example we illustrate how discrete cyclic convolution of real
integers can be computed in an integer quotient ring.
Example 3.1 If the convolution of two positive real integer sequences
x and
h
are to be carried out in the surrogate ﬁeld
Z
q, then the greatest integer in the
convolution sum must be less than the modulus
q,i . e .
q must not exceed the
dynamicrangeof
y
n (and
x
n and
h
n). For
x
 
f
 
 
 
 
 
 
 
 
gand
h
 
f
 
 
 
 
 
 
 
 
g,
by (3.1) we can compute the convolution
y
 
f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
g.T h ep r i m e
modulus
q
 
 
 
 
 
 
 
 
 is greater than the maximum value of
y.C o n s e -
quently, this convolution can be carried out in
Z
 
 
 
 . Furthermore, because
the sequences involved have length 4, which divides
 
 
 
 
 
 
 
 
  , the con-
volution
y can be obtained by using FNT calculations in
Z
 
 
 
 .
Because the order of the integer 16 is 4 modulo
 
 
 
  , it can be chosen as the
kernel
  of an FNT of length
N
 
  . Thus, the 4-point FNT of
x is
X
k
 
 
X
n
 
 
x
n
 
 
k
n
m
o
d
  (mod 257)
 
k
 
 
 
 
 
 
 
 
with a similar relation for the transform of h. Using matrix notations we have
 
B
B
 
X
 
X
 
X
 
X
 
 
C
C
A
 
 
B
B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
C
C
A
 
B
B
 
 
 
 
 
 
 
C
C
A
 
 
B
B
 
 
 
 
 
 
 
 
 
 
 
 
 
C
C
A
 
m
o
d
 
 
 
 28 Chapter 3. Applications
and
 
B
B
 
H
 
H
 
H
 
H
 
 
C
C
A
 
 
B
B
 
 
 
 
 
 
 
 
 
 
C
C
A
 
m
o
d
 
 
 
 
 
Each component
Y
k of the FNT of
y is then obtained by multiplying
X
k by
H
k
modulo 257, which gives
 
B
B
 
Y
 
Y
 
Y
 
Y
 
 
C
C
A
 
 
B
B
 
 
 
 
 
 
 
 
 
 
 
 
C
C
A
 
m
o
d
 
 
 
 
 
Regarding the inverse transform we need to know
N
 
  and
 
 
 .F r o m t h e
congruences
 
 
 
 
 
 
 
 
m
o
d
 
 
 
  and
 
 
 
 
 
 
 
 
 
m
o
d
 
 
 
  we get
N
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
 
 
  and
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
 
 
 , respectively.
Hence, the inverse transform is
 
B
B
 
y
 
y
 
y
 
y
 
 
C
C
A
 
 
 
 
 
 
B
B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
C
C
A
 
B
B
 
 
 
 
 
 
 
 
 
 
 
 
C
C
A
 
 
B
B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
C
C
A
 
m
o
d
 
 
 
 
 
which agrees with the convolution
y obtained when using the conventional
convolution sum in (3.1).
￿
Let
x be the input sequence of a linear time-invariant system with impulse
response
h.L e t
A be the dynamic range of
x
n,i . e .w eh a v e
j
x
n
j
 
A for
n
 
 
 
 
 
 
 
 
 
N
 
 .I f
x
n can take on negative numbers, the convolution sum
yields
j
y
n
j
 
A
N
 
 
X
k
 
 
j
h
k
j
 
q
 
 
 
 
and thus
A
 
q
 
 
 
P
N
 
 
k
 
 
j
h
k
j
 
If
A is also the dynamic range of
h
n and the computations are carried out in
Z
 
m
 
 ,w eg e t
j
h
k
j
 
A,
q
 
 
m
 
  ,a n d
N
 
 
b. Thus, we have
A
 
r
 
m
 
 
 
 
 
 
 
b
 
 
m
 
b
 
 
 3.1. Convolution and Correlation of Real Integer Sequences 29
b
m
2 4 8 16 32 64
1 1 2 8 128
 
 
 
 
 
 
2 0 1 4 64
 
 
 
 
 
 
3 – 1 4 64
 
 
 
 
 
 
4 – 0 2 32
 
 
 
 
 
 
5 – – 2 32
 
 
 
 
 
 
6 – – 1 16
 
 
 
 
 
 
7 – – 1 16
 
 
 
 
 
 
8 – – 0 8 –
 
 
 
9 – – – 8 – –
10 – – – 4 – –
11 – – – 4 – –
12 – – – 2 – –
13 – – – 2 – –
14 – – – 1 – –
15 – – – 1 – –
16 – – – 0 – –
Table 3.1: The dynamic range of
x
n and
h
n, for which the corresponding sequences
are of length
N
 
 
b.I n
Z
 
m
 
  we have
 
 
b
 
m for
m
 
 
 
 
 
 
 
 
  and 16,
and
 
b
 
 
m for
m
 
 
 and
m
 
 
  .
which implies that the maximum dynamic range is
A
 
b
 
m
 
b
 
 
 
c
 
i.e. the greatestinteger less than or equal to
 
 
m
 
b
 
 
 
 
 . Table 3.1 shows the dy-
namic range of
x
n and
h
n for some values of
m and
b. Because of the relatively
poor dynamic range for small
m, digital ﬁltering of real integer sequences is
generally considered to be applicable primarily for
m
 
 
 .
A common situation in ﬁltering applications is the ﬁltering of a relatively long
sequence by an FIR ﬁlter of much shorter length. This involves a linear con-
volution of great length which can be impractical to compute. There exist,
however, two well known techniques that simplify the computation of great-
length linear convolutions: Using the overlap-add method or the overlap-save
method, the longer sequence issectioned into shorter lengthsubsequencesthat
are cyclically convolved with the impulse response [79, Ch. 2.25]. Truong et
al. [103, 107] have devised a general overlap-save method for ﬁlters of arbi-
trary length using the Fermat number transform.30 Chapter 3. Applications
An important application of the cyclic convolution property is multiplication
of (large) integers. Let
u and
v be two
m-bit normal binary coded integers, i.e.
u
 
P
m
 
 
n
 
 
u
n
 
n and
v
 
P
m
 
 
n
 
 
v
n
 
n where
u
n
 
v
n
 
Z
 . The procedure for
multiplying
u by
v is equivalent to the convolution
 
u
 
v
 
n. The direct con-
volution requires in the order of
m
  bit operations. If
m is a power of 2, this
complexity canbereducedtoapproximately
m
l
o
g
 
  bitoperations byusing the
Karatsuba-Ofman algorithm [55], [4, Ch. 2.6].
The most efﬁcient algorithm for multiplication of large
m-bit integers, where
m is a power of two, is due to Sch¨ onhage and Strassen [86]. The algorithm
multiplies two
m-bitnormalbinarycodedintegers
uand
v,wher e
m
 
 
t.T h e
output is the
 
m
 
 
  -bit product of
u and
v modulo the Fermat number
F
t
 
 
m
 
  . The product is computed using the FNT in
Z
F
s for
s
 
 
t
 
 
 
 
  if
t is
oddandfor
s
 
 
t
 
 
 
 
 if
tiseven. Thealgorithm, whichrequiresintheorder
of
m
 
l
o
g
 
m
 
l
o
g
 
 
l
o
g
 
m
  bit operations , is described in English by Aho et al.
in [4, Ch. 7.5].
3.2 Decoding of Reed-Solomon Codes
Denote by
G
F
 
q
  an algebraic ﬁnite ﬁeld of order
q.
  The Galois ﬁeld Fourier
transform (GFT)can beregarded as a generalisation ofthe well knowndiscrete
Fouriertransform. TheGFTofthevector
v
 
 
v
 
 
v
 
 
v
 
 
 
 
 
 
v
N
 
 
 over
G
F
 
q
 
is the vector
V
 
 
V
 
 
V
 
 
V
 
 
 
 
 
 
V
N
 
 
 ,w h e r e
V
j
 
N
 
 
X
i
 
 
v
i
 
i
j
 
j
 
 
 
 
 
 
 
 
 
N
 
 
 
The transform kernel
  is an element of
G
F
 
q
n
  of order
N,w h e r e
N divides
q
n
 
  for some positive integer
n, see for example Blahut [16, Def. 8.1.1]. The
inverse GFT of
V is given by
v
i
 
N
 
 
N
 
 
X
j
 
 
V
j
 
 
i
j
 
i
 
 
 
 
 
 
 
 
 
N
 
 
 
where the multiplicative inverse
N
 
  is computed modulo the characteristic
p
of the ﬁeld
G
F
 
q
 . Each transform component
V
j is an element of
G
F
 
q
n
 .
 The only ﬁniteﬁeldwehav ec onsid er edsofaristhepr imeﬁeld
G
F
 
p
 ,wher e
pis aprime
number. Because the set
Z
p
 
f
 
 
 
 
 
 
 
 
 
 
p
 
 
g of integers modulo the prime
p forms the
prime ﬁeld
G
F
 
p
  under addition and multiplication modulo
p,t h ep r i m eﬁeld of order
p is
often denoted
Z
p.3.2. Decoding of Reed-Solomon Codes 31
Let
C
 
 
C
 
 
C
 
 
C
 
 
 
 
 
 
C
N
 
 
  be the GFT of a Reed-Solomon codeword
c
 
 
c
 
 
c
 
 
c
 
 
 
 
 
 
c
N
 
 
  of length
N. For Reed-Solomon codes the exponent
n in
the GFT computation equals one, i.e. the transform kernel
  is an element of
G
F
 
q
 . The codeword polynomial
c
 
x
 
 
P
N
 
 
i
 
 
c
i
x
i, which is associated with
thecodeword
c,has
 
tconsecutive powersof
  asits roots, where
tisthenum-
ber of errors that can be corrected by the code. Thus, we have
c
 
 
u
 
l
 
 
N
 
 
X
i
 
 
c
i
 
 
u
 
l
 
i
 
C
u
 
l
 
m
o
d
N
 
 
 
for some
u and
l
 
 
 
 
 
 
 
 
 
 
t
 
 . Consequently, each cyclically contiguous
transform component
C
u
 
l
 
m
o
d
N
  equals zero. This property may be used to
constructReed-Solomon codes in thetransformdomain: Theencoder ﬁrstsets
 
t consecutive
  components of
C equaltozero. The remaining
K
 
N
 
 
tpo-
sitions of
C are ﬁlled with message symbols. Next, the resulting transform is
inverted to produce the desired codeword
c. Depending onthe choice of
q,
N,
and
K, this procedure may yield a computational complexity that is smaller
than the complexity ofthe ’direct’ computation of thecodeword, i.e. bymeans
of polynomial multiplication in the time domain.
The decoding procedure at the receiver’s end may also take place in the trans-
form domain (see Blahut [16, Ch. 8–9]). The receiver ﬁrst computes the GFT
vector
Rfromthe receivedvector
r
 
c
 
e,wher e
eisanerrorvectorof length
N. In the transform domain we have the relation
R
 
C
 
E. The transmit-
ted codeword
c can be obtained as the inverse GFT of
C
 
R
 
E.W h e n
the encoding take place in the transform domain, the message symbols may
be obtained directly from
C. Because the encoder has set
 
t consecutive posi-
tions of
C equal to zero,
E equals
R in these positions. The
 
t corresponding
components of
R are called the syndromes of
r.I fn o tm o r et h a n
t errors have
occurred, the remaining
N
 
 
t unknown components of
E can be recursively
computed from the syndromes using for example the Berlekamp-Massey al-
gorithm.
When
q is a Fermat prime and
n equals one, the FNT in the prime ﬁeld
Z
F
t;
t
 
 
 
 
 
 
 
  is obtained as a special case of the GFT. Justesen [54] was among
theﬁrstresearcherstoconsider Reed-SolomoncodesoverFermatprimeﬁelds.
He stated that the decoding complexity of such codes can be reduced if the
FNT is used to calculate the syndromes.
Reed, one ofthe originators of the Reed-Solomon codes, have coauthored sev-
eral articles concerning fast decoding of Reed-Solomon codes using the FNT.
 Or cyclically contiguous.32 Chapter 3. Applications
For example, in [80], Reed et al. show how to use the FNT and continued frac-
tionsinthedecoding procedure. In[82],Reedetal.conclude thatadecoder for
Reed-Solomon codes of length
 
t
 
  over
G
F
 
F
t
  using an FNT is simpler than
corresponding decoders for a code of length
 
t
 
  using a GFT in
G
F
 
 
t
 .L i u
et al. [63] considered Reed-Solomon codes over
G
F
 
F
 
  for use in space com-
munication applications. In a recent article by Shiozaki et al. [91], the authors
consider a Reed-Solomon code as a special case of a redundant residue poly-
nomial code. Theypresenta fastalgorithm for decoding Reed-Solomon codes
over
G
F
 
F
t
  using the FNT and the Euclidean algorithm.Chapter 4
The VLSI Model
We use complementary metal-oxide-semiconductor (CMOS) circuits in the
VLSI architectures presented in this thesis. The CMOS technology offers high
packing density, high yield, wide noise margin, low power dissipation, and
low cost. Because of these attractive properties, CMOS has become one of the
most important VLSI technologies of today (see for example Weste and Esh-
raghian [113, Ch. 1]). The VLSI model adopted in this thesis (and deﬁned in
the present chapter) is only valid for CMOS and
nMOS circuits.
Inintegerquotient rings, allarithmetic operationsinvolve modulus reduction.
When performing modulus reduction of a binary coded integer, the value in
eachbitposition ofthereducedbinarycodedintegermaydependon thevalue
in every bit position of the original binary coded integer. Therefore, depend-
ing on the modulus, bit-serial architectures are often impracticable for arith-
metic operations in integer quotient rings. This particularly applies to integer
arithmetic operations modulo a Fermat number
 
m
 
  . Most of the architec-
tures presented in the subsequent chapters are based on bit-parallel transmis-
sionandprocessingofthedata. Themainexceptionsarethebit-serial/parallel
multipliers in Sections 5.1.5, 6.3.6, and 7.6.6 and the bit-serial multipliers in
Sections 7.6.5 and 7.6.6.
3334 Chapter 4. The VLSI Model
4.1 Introduction
In the VLSI circuit design process it is important to consider aspects like ﬂoor-
planning and interconnections. These aspects play a major role when mini-
mising parameterslike clock skew, noise, andpowerdissipation (seethe book
of Bakoglu [14]). Over the years, two of the main goals for integrated circuit
designers have been to minimise the area and maximise the performance of
the implemented circuits. During the last years, there has also been an in-
creasing interest in low-power digital CMOS design, see Chandrakasan et al.
[30, 31] and Liu [61]. One reason for this is the increasing number of portable
equipment requiring low power. Another reason is that the scaling of digital
CMOS circuits results in a higher power consumption.
InChapters5, 6, and7 weinvestigate differentarchitectures forarithmetic op-
erations. These architectures are mutually compared mainly with respect to
their area complexity and time performance. The chip area occupied by the
corresponding implemented circuit is denoted by
A and the time required to
perform the operation is denoted by
T. In order to take both chip area and
computation time into account, we also consider the area-time performance
A
T
  of each architecture. The
A
T
  performance is a cost function to be min-
imised. Thompson[100]isoneoftheoriginatorsofthisarea-timeperformance
measure. Inhispaperof1979[100]ThompsonproposedaVLSImodelofcom-
putations. Based on this model, he derived a lower bound
N
  on the
A
T
  per-
formance of computing the discrete Fourier transform of length
N,i . e
A
T
 
 
 
 
N
 
  for such a computation.
  Brent and Kung [26] also did some basic
worksonVLSImodelsandcomplexity. Theyderivedthelowerbound
A
T
 
 
 
 
 
N
 
 
 
 ,f o r
 
 
 
 
 , on the performance of
N-bit binary multiplication. A
survey of computational algorithms and their VLSI implementation is given
by Ullman [104]. For example, in Chapter 2 of [104], Ullman gives an intro-
duction to the area of
A
T
  performance.
As indicated above, two very important steps in the VLSI design process are
ﬂoorplanning and the routing of interconnections and communication paths.
Interconnections usually occupy a large part of the chip, typically more than
ﬁfty percent of the total chip area. The placement of the different modules of
thechipiscrucialtotheinterconnection delay. Thewirelengthsbetweenmod-
ules and within each module should be as small as possible in order to get a
small interconnection delay.
 By
a
 
m
 
 
O
 
b
 
m
 
  (or “
a
 
m
  is
O
 
b
 
m
 
 ”) we mean that, for increasing
m, the function
a
 
m
  does not grow faster than the function
b
 
m
 . The notation
a
 
m
 
 
 
 
b
 
m
 
  (or “
a
 
m
  is
 
 
b
 
m
 
 ”) is used to bound the growth rate of
a
 
m
  from below. The notations
O and
  are
conventionally used in the area of VLSI complexity, see for example Ullman [104].4.1. Introduction 35
The choice of VLSI model differ between researchers. The modelling of the
chip area is quite uncontroversial. The main difference lays in the modelling
of the interconnection delay. The time for a signal to propagate along a wire
of length
l is usually modelled as either
O
 
 
  (synchronous model),
O
 
l
o
g
l
 
(capacitive model),
O
 
l
  (transmission line model), or
O
 
l
 
  (
R
C model), see
Bilardi et al. [15] and Bakoglu [14, Ch. 5-6]. The capacitive model, adopted
by for example Thompson, is appropriate for short wires and the
R
C model
for long wires. It is, however, common to divide long wires into shorter sub-
sections using repeaters (buffers). These repeaters have the effect of reducing
theinterconnection delayfrom
O
 
l
 
 to
O
 
l
 , seeforexampleBakoglu[14,Ch.
5.4.2].
Becausedevice dimensions aregettingsmaller andchips aregettinglarger, the
lengths of on-chip wires are increasing. Therefore, the interconnection delay
is more and more becoming a major factor when determining the overall cir-
cuit performance. In this thesis we assume that interconnection delays within
eachmodule are
O
 
 
 ,i.e weadoptthesynchronous model. Forlargesystems,
thesynchronousmodel givesa grosssimpliﬁcation ofthe trueinterconnection
delay. Because in general the architectures studied here do not involve global
routing(interconnections betweenmodules onthechip)ourdelayestimations
should not, however, considerably deviate from the trueintramodular delays.
Ifwegoa couple ofstepsfurtherin the design processandconsider the imple-
mented circuit, then it is simpler to estimate the delays caused by the wiring.
It is also simple to estimate the true interconnection delay after the ﬂoorplan-
ning and routing steps of the design process. One way of estimating the aver-
age lengths of the chip interconnections is to partition the circuit design into
different sections and calculate the number of connections between the sec-
tions. The average lengths can then be modelled by using Rent’s rule, which
is described by, for example, Bakoglu [14, Ch. 9.8.1].
Adopting the synchronous model for the interconnection delays does not
meanthatwedisregard thewiring effects. Our effortistodesign architectures
with a high degreeof regularity andwith wires only connecting neighbouring
gates.
In general, the architectures presented in this thesis do not contain logic cir-
cuitsforgeneratingcontrolsignals. Forexample,everyclocksignalisassumed
to be available wherever needed and without any clock skew involved.
Thephrase “low power”can be found in many currentpublication titles. Sev-
eral aspects of low power digital CMOS design can be found in the recently
published PhD thesis by Liu [61]. For example, in his thesis Liu considers low36 Chapter 4. The VLSI Model
power CMOS device design, low power circuit and system techniques, and
power estimations in digital CMOS VLSI chips.
In this thesis, we do not give estimates of the power dissipated by the investi-
gated circuits. However, with a low-power design strategy in mind, we often
follow the guidelines suggested by Liu [61] and others when choosing clock-
ing strategy and combinational logic circuits.
4.2 Complexity and Performance
4.2.1 The Delay Model
Over the years, the linear switch-level
R
C model for CMOS transistors has
been adopted by many researchers when investigating the timing properties
of digital VLSI circuits. We refer to the articles by Ousterhout [70] and Rubin-
stein et al. [85], and Chapter 1 in Mead’s and Conway’s book [66]. A linear
switched
R
C model for the
nMOS transistor is shown in Figure 4.1.
For the sake of simplicity, all
nMOS and
pMOS transistors are modelled to
have the same characteristics, e.g.they have equalsize andthe pull-down and
pull-up times of the
nMOS and
pMOS transistors, respectively, are the same.
The
R
C modelforthe
pMOS transistor is similar to the
nMOStransistor model
inFigure4.1. Whena transistor isoff,theswitchisopenandthe transistor acts
only as a capacitive load to the rest of the circuit. The variables
C
g,
C
d,
C
s,a n d
R
  are the gate, drain, and source capacitances, and the channel resistance, re-
spectively.
A consequence of modelling the non-linear MOS transistors of a circuit as lin-
earswitched
R
C circuits is that theestimated circuit delays aremore orless er-
roneous. Acircumstance which is often neglectedis the factthatthe transistor
capacitances and the channel resistance are actually functions of voltage. In-
accuracies in delay computations may also occur because of the difﬁculties in
including input waveform effects. Nevertheless, for most
R
C models the de-
lay estimations do not deviate more than 20 percent from SPICE simulations,
see the articles by Ousterhout [70], Sundblad and Svensson [96], and Heden-
stierna and Jeppson [49].
In this thesis we adopt the Penﬁeld-Rubenstein model [85] in which the input
voltages are modelled as step waveforms and transistors are modelled as the
transistor in Figure 4.1. The delay calculation of a circuit is based on Elmore’s4.2. Complexity and Performance 37
D
(b)
S
C
g
C
s
C
d
(a)
R
 
G
D
S
G
V
G
 
 
Figure 4.1: Model of an
nMOS transistor. (a) Symbolicdescription. (b) A switched
linear
R
C model of the transistor. The transistor is switched on when the gate
voltage
V
G is high.
delay model for an
R
C tree without side branches, i.e. an
R
C chain [43], [85].
For example,the Elmore delay
T
d fromthe input to the outputof the
R
C chain
in Figure 4.2 equals
T
d
 
 
X
i
 
 
 
i
X
j
 
 
R
j
 
C
i
 
R
 
C
 
 
 
R
 
 
R
 
 
C
 
 
 
R
 
 
R
 
 
R
 
 
C
 
 
 
R
 
 
R
 
 
R
 
 
R
 
 
C
 
 
i.e. each capacitor contributes to the delay as the product of the capacitance and the
total resistance between the capacitor and the signal source (or ground)
 .T h ed e l a y
is deﬁned as the time from the 50-percent level of the input signal waveform
to the 50-percent level of the output signal waveform.
Without considering wire capacitances, the capacitive loads in differentnodes
of most of today’s digital CMOS combinational logic circuits are dominated
by gate capacitances [113, Ch. 4.3.4], [49]. The main reason for this is that the
number of gates connected to a node is often several times greater than the
number of drains and sources connected to the node. Furthermore,according
to Weste and Eshraghian [113, Ch. 4.3.4] and others, the gate capacitance is
 Each capacitoris assumedto be charged (ordischarged) through all resistors betweenthe
capacitor and the signal source (ground).38 Chapter 4. The VLSI Model
R
 
C
 
C
 
C
 
C
 
R
 
R
 
R
 
Figure 4.2: An
R
C chain.
typically several times greater than the drain and source capacitances. Due to
these facts and in order to obtain a measure of time complexity on a simple
form, we generally disregard the effects of the drain and source capacitances
on the delays. Thus, the only capacitances and resistances that are involved
in our delay computations are gate capacitances and transistor channel resis-
tances, respectively.
4.2.2 Area and Time Complexities
Denote by
A the chip area occupied by an implemented circuit. Because the
CMOS technology changes so fast, it is essential to have a measure of the chip
area that is technology-independent. The area complexities of the architec-
tures considered in this thesis are given in terms of the sizes of the architec-
tures.
Deﬁnition 4.1 The size of an architecture is the number of CMOS transistors that
form the architecture and is denoted by
C.
Consequently, with equally sized transistors the chip area
A occupied by the
implemented circuit, not including the wire area, is proportional to
C.I ft h e
total circuit area is to be determined, the circuit interconnections must also be
considered. Even though the area occupied by these interconnections is not
considered here, we still strive to design modular and regular architectures
in order to reduce the interconnection area and simplify the interconnection
work.
The clock frequency of a circuit is related to the length of its critical path.T h e
critical path is the longest path along which signals are pulling up andpulling
down circuittransistors, or propagatingthroughthem,during oneclock inter-
val. The critical path usually starts and ends with a clocked latch or ﬂip-ﬂop.
Thus, theminimum clock cycle time of a circuitis proportionalto thelength of4.2. Complexity and Performance 39
its critical path. Suppose the circuit needs
m clock cycles to perform its oper-
ation. Then, the computation time is proportional to
m times the critical path
length.
When determining the critical path, we use the same strategy as the one used
in the timing veriﬁcation program Crystal, which is described by Ousterhout
in [70]. The circuit to be examined is decomposed into chains of transistors
called stages. A stage runs from the supply voltage source or ground through
a number of transistors to the gate inputs of some other transistors.
Deﬁnition 4.2 Let s denote a certain stage of a circuit and let
T
s denote the delay of
that stage. Then, the length
L
s of stage s isthe ratioof
T
s and thetimeconstant
R
 
C
g,
i.e.
L
s
 
T
d
R
 
C
g
 
where
R
  and
C
g are the linearised MOS transistor channel resistance and gate ca-
pacitance, respectively.
The delay of a stage is calculated as Elmore’s delay of the
R
C chain model of
the stage. The critical path through a circuit is formed by an ordered set of
stages, whereeachstage gives aseparatecontribution to thetotal circuit delay.
Deﬁnition 4.3 The length
L
C
P of thecritical path (CP) equals thesum of thelengths
of the stages that forms the critical path.
One of the transistors in each stage is called the trigger. The trigger is the last
transistor to turn on in a stage.
Consider, asanexample, thecircuit in Figure4.3. Thiscircuit hasno relevance,
except for being an example. The CP through the circuit is the ordered set
f
s
 
 
s
 
 
s
 
 
s
 
g of stages. These stages, which become active for
i
n
 
  ,a r e
signiﬁed bydotted lines in the ﬁgure. The
R
C circuits in the bottom of the ﬁg-
ure correspond to the equivalent
R
C models of the four stages
s
 
 
s
 
 
s
 ,a n d
s
 . The total delay
T
d of the circuit is approximately equal to the sum
T
d
 
T
 
 
T
 
 
T
 
 
T
 
 
R
 
 
 
C
g
 
 
R
 
 
 
C
g
 
R
 
C
g
 
 
R
 
 
R
 
 
 
n
C
g
  (4.1)
where
T
 
 
T
 
 
T
 ,and
T
  arethe delay contributions ofthe stages
s
 
 
s
 
 
s
 ,a n d
s
 ,respectively andwhere
nis the fan-outof thecircuit, i.e. the numberof tran-
sistor gates that are driven by the circuit output signal. The trigger of stage
s
 
 
s
 
 
s
  is the transistor in the end of stage
s
 
 
s
 
 
s
 , respectively. By Deﬁni-
tions 4.2 and 4.3, the length of the CP through the circuit in Figure 4.3 equals
L
C
P
 
T
d
 
R
 
C
g.40 Chapter 4. The VLSI Model
s
 
 
R
 
 
C
g
s
 
o
u
t
V
d
d V
d
d
R
 
s
 
s
 
i
n
s
 
s
 
R
 
R
 
 
C
g
s
 
R
 
C
g
n
C
g
s
 
R
 
 
R
 
n
C
g
Figure 4.3: An example of a circuit (the one within the dashed box) that has CP
length
L
C
P
 
 
r
 
 
 
 
n
 
r
 
 
 
 and size
C
 
  . Here, the output load is
strictly capacitive;
n is the circuit fan-out. The
R
C equivalent circuits of the
stages
s
 ,
s
 ,
s
 ,a n d
s
  are shown in the bottom of the ﬁgure.
Deﬁnition 4.4 We deﬁne the normalised resistance
r of a resistor with resistance
R as the ratio
r
 
R
R
 
 
where
R
  is the linearised MOS transistor channel resistance.
By letting
R
 
 
r
 
R
  and
R
 
 
r
 
R
  we get, from (4.1), the CP length
L
C
P
 
 
r
 
 
 
 
n
 
r
 
 
 
 for the circuit in Figure 4.3. Hence, the time from the ris-
ing/falling of the input signal to the rising/falling of the output signal is pro-
portional to
L
C
P
 
 
r
 
 
 
 
n
 
r
 
 
 
  .W ea l s on o t et h a tt h ec i r c u i th a ss i z e
C
 
  , because it comprises 8 transistors.
  The area-time performance
A
T
  of
the circuit is proportional to
C
L
 
C
P
 
 
 
 
 
r
 
 
 
 
n
 
r
 
 
 
 
 
 .
 In this example we do not consider the area of the two resistors
R
  and
R
 .4.2. Complexity and Performance 41
The architectures of this thesis mainly comprise basic building blocks like in-
verters, transmission gates, 2-input gates, adder elements, and registers. In
the following Section 4.3 we present the sizes and time performances of such
CMOS logic circuits, with respect to the adopted delay model. The size of an
architecture wasdeﬁned aboveas thenumberof
nMOS and
pMOS transistors
that form the circuit. We describe the time performance of an architecture in
terms of its fan-in, internal CP length,a n doutput normalised resistance:
Deﬁnition 4.5
1. The fan-in
f of a circuit is the number of transistor gates of the circuit that are
driven by the circuit input signal.
2. Aninternalstageof acircuit isastagewhoseassociated
R
C modeldoesnotde-
pend on how the circuit is connected to other circuits. The internal CP length
L
C
P is the sum of the lengths of the internal stages.
3. The output normalised resistance
r is the total normalised resistance from
the circuit output node back to the supply voltage source (or ground).
The length of the CP is used as a measure of time performance. When char-
acterising the CP through a circuit, it is partitioned into three parts, in accor-
dance with Deﬁnition 4.5-1, 2,a n d3:
1. The ﬁr s tp a r to ft h eC Pi st h ei n p u ts t a g eo ft h ec i r c u i t .T h ed e l a yo ft h ei n -
putstage dependsonthe totalcapacitance
n
i
n
C
g attheinput node, where
n
i
n is
the fan-outof the preceding circuit that hasthis input stageas its output stage.
Henceforth, we describe the input stage of a circuit in terms of its contribution
to
n
i
n, i.e. we only state the fan-in
f of the circuit.
For example, the fan-in of the circuit in Figure 4.3, whose input stage is
s
 ,
equals 2.
2. The second part of the CP is the set of internal stages, which is described
by the internal CP length.
For example, the internal stages of the circuit in Figure 4.3 are
s
  and
s
 .T h e i r
respective lengths equal
 
 
 
 
 and
 
 
 
 
  . Hence, the internal CP length
of the circuit equals
 
 
 
 
  .
3. The third part of the CP is the output stage. The output normalised resis-
tance determines, together with the subsequent resistive and capacitive loads
of the output stage, the length of the stage. The circuit contribution to this
length is described in terms of its output normalised resistance.
For example, the output stage of the circuit in Figure 4.3 is
s
  and its output
normalised resistance equals
r
 
 
  .42 Chapter 4. The VLSI Model
Remark: There are architectures in Chapters 5, 6, and 7 for which the delays
of some stagesare proportional to
m,w h e r e
m is the exponent of 2 in the
Fermat number
 
m
 
  . These delays are generally due to the fact that
single logic gates or inverters are driving large capacitive loads. For ex-
ample, if a logic gate with output normalised resistance
r is driving
m
logic gates, each with fan-in equal to
f, the delay of that stage is pro-
portional to its length
r
 
f
m. The traditional way of reducing the delay
of a stage with a large capacitive load is to properly buffer the stage by
using a number of cascaded drivers (inverters) of gradually increasing
size. Then, the resulting total delay can be bounded to be proportional
to
l
o
g
m, see Mead an Conway [66, Sec. 1.5]. Note, however, that regard-
ing the architectures in Chapters 5, 6, and 7, we generally do not consider the
problem of driving large capacitive loads.
4.3 Basic CMOS Building Blocks
Inthissection wederivethesizes,fan-ins,internalCPlengths,andoutput nor-
malised resistances ofthe inverter, the transmission gate,the two-inputmulti-
plexer,two-inputgates,thesingle-bitadder,andtheregister(Dﬂip-ﬂop),with
respect to the VLSI model deﬁned in the previous section. In Section 4.3.6,
these parameters are all listed in a table.
4.3.1 The Inverter and the Transmission Gate
The Inverter
Because the CMOS inverter comprises two MOS transistors, its size equals
C
i
n
v
 
  . The inverter is shown in Figure 4.4. When the inverter input sig-
nal changes from high to low, the stages marked by the dashed lines in Fig-
ure 4.4(b) are activated. For a low-to-high input signal transition, the stages
marked by the dotted lines are activated. Because the two possible input
stages, as well as the two output stages, are actually equivalent, the inverter
contribution totheCPis simplyits fan-in,whichequals
f
i
n
v
 
  ,a n di t so u t p u t
normalised resistance
r
i
n
v
 
  . There is no internal stage.
In Figure 4.4(c),
n
i
n is the total number of transistor gate inputs that are con-
nected to the inverter input node and
n is the fan-out of the inverter. Further-
more,
r
i
n is the output normalised resistance of the circuit prior to the inverter.4.3. Basic CMOS Building Blocks 43
Input stage
Output stage
(b) (a)
V
d
d
(c)
n
i
n
C
g
R
 
r
i
n
R
 
n
C
g
Figure 4.4: A CMOS inverter. (a)Symbolic description. (b) Schematic description.
TheCPisformedeitherbythedottedorthedashed stages. (c)Simple
R
C equiv-
alents of the stages.
The Transmission Gate
The transmission gate has the same size
C
T
G
 
 as the inverter. Figure 4.5
shows how the transmission gate is formed by an
nMOS and a
pMOS tran-
sistor in parallel. The dotted path in Figure 4.5(b) is the output stage of the
CP. This stage is also the output stage of a preceding circuit whose output sig-
nal is the input signal of the transmission gate. The stage runs through one
of the transmission gate transistors. Therefore, the output normalised resis-
tance equals
r
T
G
 
r
p
r
i
o
r
 
  ,w h e r e
r
p
r
i
o
r is the output normalised resistance
of the mentioned circuit prior to the transmission gate. Note that because the
transmission gate is not connected to the supply voltage source or ground, its
equivalentpasstransistorresistorisaseries resistor (andnotaTheveninequiv-
alent resistor).
If one of the transistors of the transmission gate is the trigger of the output
stage, then the stage that ends up in the gate input of this transistor also be-
longs to the CP; it becomes the input stage. Then, the fan-in
f
T
G equals the
fan-in of the trigger,
  i.e. we have
f
T
G
 
  . Otherwise, the fan-in equals zero.
Like the inverter, the transmission gate has no internal stage.
 In accordance with Deﬁnition 4.5-1, by the fan-in of the trigger we mean the number of
gates of a circuit that are driven by the signal on the gate of the trigger.44 Chapter 4. The VLSI Model
(a)
S
S
(b)
S
S
Figure 4.5: A transmission gate. (a) Symbolic description. (b) Schematic descrip-
tion. The dotted line is the output stage of the transmission gate.
4.3.2 The Two-Input Multiplexer
The two-input multiplexer is simply constructed using two transmission
gates, as shown in Figure 4.6.
Because the multiplexer comprises two transmission gates, each of size 2, the
total size of the two-input multiplexer equals
C
M
U
X
 
  . Like the output stage
of the transmission gate, the multiplexer output stage (
s
  in Figure 4.6) is also
the outputstageof anothercircuit. Hence, theoutput normalisedresistance of
the multiplexer equals
r
M
U
X
 
r
p
r
i
o
r
 
  ,w h e r e
r
p
r
i
o
r is the output normalised
resistance of the circuit prior to the multiplexer.
Furthermore, if the transmission gate transistor of stage
s
  is the trigger of
stage
s
 ,s t a g e
s
  also belongs to the CP. Then, the multiplexer fan-in equals
f
M
U
X
 
  , because the control signal
S in stage
s
  controls two of the multi-
plexer transistors.
  If
s
  does not belong to the CP, the multiplexer has no in-
put stage and thus its fan-in equals zero. The internal CP length equals zero.
4.3.3 Two-Input Gates
NAND/NOR Gates
Schematic descriptions of the 2-input NAND and NOR gates are given in Fig-
ures 4.7(a
 )a n d( b
 ) respectively. The NAND and NOR gates have equal size
 If
S controls the trigger of the output stage, the fan-in
f
M
U
X also equals two.4.3. Basic CMOS Building Blocks 45
(a) (b)
s
 
s
 
S
S
S
0
1
S
D
 
D
 
D
 
D
 
Figure 4.6: A two-input multiplexer. (a) Symbolic description. (b) Schematic de-
scription. The dotted lines show the two stages of the CP when the signal
S
 
 
opens the trigger of stage
s
 .
C
N
A
N
D
 
N
O
R
 
  . With respect to the switch-level transistor model, the gate de-
lays are also the same. The
R
C equivalents of the NAND and NOR gates are
given in Figure 4.7(c). Inthe worstcase delay,
D
  is the input signal ofthe trig-
ger of stage
s
 . Because each of the input signals controls the switching of two
transistors, the fan-in equals
f
N
A
N
D
 
N
O
R
 
 for both the NAND gate and the
NOR gate. We also get the same output normalised resistance
r
N
A
N
D
 
N
O
R
 
 
for both gates. The NAND and NOR gates have no internal stage.
In a more realistic transistor model, the NAND gate is often preferable to the
NOR gate. For example, if the gates are designed to have symmetric switch-
ing, the area occupied by the NAND gate is smaller than the area required for
the NOR gate, see Uyemura [106, Ch. 6.5.3]. Conversely, for transistors of the
same size, the rise-time and fall-time asymmetry is greater for the NOR gate
than for the NAND gate.
AND/OR Gates
Two-input AND gates and OR gates are usually designed as NAND gates
and NOR gates, respectively, each followed by an inverter. Thus, AND gates
and OR gates have size
C
A
N
D
 
O
R
 
 . The fan-in
f
A
N
D
 
O
R equals the fan-in
f
N
A
N
D
 
N
O
R
 
 o ft h eN A N D( a n dN O R )g a t ea n dt h eo u t p u tn o r m a l i s e dr e -46 Chapter 4. The VLSI Model
(b
 ) (b
 )
(a
 )
V
d
d
s
 
s
 
n
D
 
C
g
(a
 )
V
d
d
D
 
s
 
s
 
(c)
s
 
s
 
D
 
D
 
D
 
D
 
D
 
D
 
n
C
g
r
D
 
R
 
D
 
D
 
D
 
D
 
 
R
 
D
 
Figure 4.7: Two-inputNANDandNORgates. (a)ANA NDga te .(b)ANORgate.
(c)
R
C equivalentsoftheCPstages when thereare noside branches orextended
branches. The NAND gate and the NOR gate have similar CP stages, see the
dotted lines in (a
 ) and (b
 ).4.3. Basic CMOS Building Blocks 47
sistance
r
A
N
D
 
O
R of the AND and OR gates equals the normalised resistance
r
i
n
v
 
 of the inverter. The internal CP length
L
A
N
D
 
O
R equals the product
of the output normalised resistance of the NAND (and NOR) gate and the in-
verter fan-in, i.e. we have
L
A
N
D
 
O
R
 
r
N
A
N
D
 
N
O
R
f
i
n
v
 
 
 
 
 
  .
XOR/XNOR Gates
The XOR gate can be designed in several ways. For example, a transmission
gate-basedXOR gate can be built with as few as six transistors [113, Fig. 8.11].
However, it may be rather difﬁcult to track down CPs of circuits that contain
such XOR gates. Therefore, we instead consider the realisation shown in Fig-
ure 4.8(b), which has size
C
X
O
R
 
 
  . This gate contains more transistors than
the transmission gate-basedXOR gate, but it is quite easy to ﬁnd the stages of
its CP.
There are eight different stages in the XOR gate in Figure 4.8. Which ones
will be activated depends on the input signals
D
  and
D
  and their last val-
ues. Among the 16 different transitions of the input signals that may occur,
the one from
 
D
 
 
D
 
 
 
 
 
 
 
  to
 
D
 
 
D
 
 
 
 
 
 
 
  activates the stages
s
 ,
s
 ,
and
s
  in the listed order. These stages, which are signiﬁed by the dotted lines
in Figure 4.8(b), form a CP through the XOR gate.
 
The fan-in of the XOR gate equals
f
X
O
R
 
 and the normalised resistance of
the output stage
s
  equals
r
X
O
R
 
  .T h e
R
C equivalent of the internal stage
s
  of the CP through the XOR gate is shown in Figure 4.8(c). The length
L
X
O
R
of stage
s
  equals 2.
The two-input XNOR gate can be constructed by interchanging the connec-
tions of
  and its binary inverse (i.e. its one’s complement)
  in the rightmost
part of the circuit in Figure 4.8(b). Consequently, the XNOR gate have the
same size and delay characteristics as the XOR gate.
4.3.4 The Single-Bit Adder
Addition is a fundamental operation in all arithmetic processes. There are
many ways to implement an
m-bit binary adder. In general, it consists of a
some single-bit full adder elements. A parallel
m-bit adder can be formed by
cascading
m such adder element. Figure 4.9 showsthe Karnaughmapsforthe
sumoutput
 andcarryoutput
cofthefulladder element. Theadderhasthree
 There are also other stages of the XOR gatewhose lengths sum up to the length ofthe CP
chosen.48 Chapter 4. The VLSI Model
 
 
 
 
C
g
 
 
 
 
(a)
(c)
s
 
R
 
V
d
d
 
 
 
 
V
d
d V
d
d
 
 
s
 
s
 
(b)
 
s
 
s
 
Figure 4.8: A static two-input XOR gate. (a) Symbolic description. (b) Schematic
description. The dotted lines show the three stages of a possible CP through the
gate. (c)
R
C equivalent circuit of the internal stage
s
  of the CP.
inputs; the signals
  and
  and the carry input
c
i
n. According to the Karnaugh
maps, the carry and sum outputs of the full adder element can be expressed
as the Boolean expressions
c
 
 
 
 
c
i
n
 
 
 
 
  (4.2)
 
 
 
 
 
 
c
i
n
  (4.3)
respectively. The symbol
  denotes the XOR function, i.e. addition modulo
2. Figure 4.10(a) shows the symbolic description of the single-bit full adder
element.
There are various ways of implementing the full adder element. Here, we use
the conventional static full adder element shown in Figure 4.10(b), which is
based on the carry output Boolean function given by (4.2) and the sum output
Boolean function
 
 
 
 
c
i
n
 
c
 
 
 
 
 
c
i
n
 
 
whichis obtained byrewriting (4.3). Thedelays of this adder element caneas-
ily be estimated when using the adopted switch-level
R
C delay model. An-
other advantage is that the adder outputs are driven by inverters. However,4.3. Basic CMOS Building Blocks 49
00 11 10 01 00 11 10 01
0
1
0 1
00
001 0
0 111 1
0
 
c
c
i
n
c
i
n
 
 
 
 
1
0 1
1
Figure 4.9: Karnaugh maps of the sum output
  and carry output
c of the full adder
element.
compared with dynamic adders it has at least one disadvantage: From an in-
vestigation of various adder elements, Liu and Svensson [61, Paper 5] con-
clude that the power consumption of the static adder in Figure 4.10(b) is typ-
ically two to three times greater than the power consumption of dynamic full
adder elements. The size of the chosen full adder, which equals
C
F
A
 
 
  ,i s
comparable to the sizes of most other dynamic and static full adder elements.
There are 64 different input signal transitions of the adder element that may
occur. From a CP search point of view, however, most of them are ruled out.
Yuan and Svensson [109] propose two principles of determining the number
of signiﬁcant transitions. Firstly, the start stage of each transition should in-
clude asmany transistors as possible. Secondly, the ﬁnal stage should have as
few transistors in parallel as possible. Using these principles, the number of
interesting input transitions are reduced to 14 [109, Fig. 5]. When investigat-
ing these transitions, wehave found thatthey all give rise to paths of the same
lengths.
For example, one such CP is obtained when the input signals change from
 
 
 
 
 
c
i
n
 
 
 
 
 
 
 
 
  to
 
 
 
 
 
c
i
n
 
 
 
 
 
 
 
 
 . If this transition occurs synchro-
nously for the three adder inputs, the CP from the input to the sum output is
equal to the set
f
s
 
 
s
 
 
s
 
 
s
 
g of stages, see the dotted lines in Figure 4.10(b).
The internal signal
c opens the trigger of stage
s
 .M o r e o v e r ,t h eC Pf r o mt h e
input to the carry outputis equivalentto the set
f
s
 
 
s
 
 
s
 
g.I f
  opens thetrig-
ger of stage
s
 ,t h e n
s
  is replaced by
s
  in the above sets of stages. However,
because both
  and
  drive eight of the full adder transistors, the fan-in of the
trigger of stage
s
  will still be the same;
f
 
 
f
 
 
  .
If the trigger of stage
s
  is the transistor with gate input signal
c
i
n,i . e .i f
c
i
n ap-
pears at the adder input later than the moment when the end node of stage
s
 50 Chapter 4. The VLSI Model
FA
 
 
c
i
n
 
 
 
 
 
 
 
 
 
c
i
n
 
 
c
i
n
 
 
 
 
c
i
n
c
i
n
 
(a)
(b)
 
c
s
 
s
 
s
 
s
 
s
 
s
 
s
 
s
 
c
s
 
V
d
d
V
d
d
V
d
d
V
d
d
c
 
c
i
n
c
i
n
Figure 4.10: The single-bit binary full adder element. (a)Symbolic description.
(b) Schematic description. The dotted and the dashed lines are stages that form
the different CPs of the adder element.4.3. Basic CMOS Building Blocks 51
is fully charged, then stage
s
  is included in the input-to-sum CP. This situa-
tion typically occurs in parallel adders, for which the CP is usually the carry
chain through the full adder elements. Then, the CP from the carry input to
the carry output is the set
f
s
 
 
s
 
 
s
 
g, for which only one of the two parallel
transistors (controlled by
  and
 )instage
s
  is switchedon. Because the carry
input
c
i
n is connected to the gates of six transistors of the full adder, the fan-in
of the trigger of stage
s
  (and of the trigger of stage
s
 ) equals
f
 
 
  .U s i n g
the switch-level
R
C model, we obtain the lengths
L
 
 
 
 
 
 
  ,
L
 
 
 
 
 
 
  ,
and
L
 
 
 
 
 
 
 of the internal stages
s
 ,
s
 ,a n d
s
 , respectively.
From the above reasoning we get that the full adder fan-ins equal
f
F
A
 
s
i
g
n
a
l
 
 
and
f
F
A
 
c
a
r
r
y
 
  , with respect to the signal and carry input nodes, respectively.
When the CP through the full adder leads to the sum output, the internal CP
length equals
L
F
A
 
s
u
m
 
L
 
 
L
 
 
 
 
 
 
 
 and when it leads to the carry
output, the internal CP length equals
L
F
A
 
c
a
r
r
y
 
L
 
 
L
 
 
  .I nb o t hc a s e sw e
get the same output normalised resistance
r
F
A
 
  .
If one of the inputs, say the carry input, of the full adder element is always
equal to zero, a half adder may be used instead of a full adder. Then, the sum
and carry outputs of the half adder are the Boolean functions
 
 
 
 
  and
c
 
 
 , respectively. These functions may be directly implemented using one
XOR gate for the sum output and one AND gate for the carry output, where
the latter gate is realised as a NAND gate followed by an inverter.
The half adder element is depicted in Figure 4.11. The size of this half adder
equals
C
H
A
 
C
X
O
R
 
C
N
A
N
D
 
C
i
n
v
 
 
 
 
 
 
 
 
 
  . Its CP delay parameters
are shown in the bottom of the ﬁgure.
4.3.5 The Register
As for many other CMOS circuits, there are several ways of designing a reg-
ister. Here, we consider the dynamic true single-phase clock master-slave D
ﬂip-ﬂop depicted in Figure 4.12. This positive edge-trigged ﬂip-ﬂop is an ex-
tended version of a precharged inverting D ﬂip-ﬂop suggested by Yuan et al.
[108]. Thesizeoftheﬂip-ﬂopinFigure4.12,whichequals
C
r
e
g
 
 
  ,islessthan
the sizes of ordinary static D ﬂip-ﬂops. Also, Liu and Svensson [61, Ch. 3.3],
[99] found that the power consumption of this ﬂip-ﬂop is less than the power
consumption of other known static and dynamic master-slave D ﬂip-ﬂops.
The ﬂip-ﬂop in Figure 4.12 has an asynchronous reset input. In some circuits,
one may needsettable registers and when using a plain bit-serial shift register
thereisnoneedforsettableorresettableregisters. Therearealsoothertypesof52 Chapter 4. The VLSI Model
  Internal length, input to carry output:
L
H
A
 
c
a
r
r
y
 
r
N
A
N
D
f
i
n
v
 
 
  Internal length, input to sum output:
L
H
A
 
s
u
m
 
L
X
O
R
 
 
  Normalised resistance of the carry output stage:
r
H
A
 
c
a
r
r
y
 
r
i
n
v
 
 
  Normalised resistance of the sum output stage:
r
H
A
 
s
u
m
 
r
X
O
R
 
 
  Total fan-in:
f
H
A
 
f
N
A
N
D
 
f
X
O
R
 
 
c
 
 
 
L
X
O
R
 
 
f
i
n
v
 
 
f
X
O
R
 
 
r
X
O
R
 
 
f
N
A
N
D
 
 
r
i
n
v
 
 
r
N
A
N
D
 
 
Figure 4.11: A half adder element, realised using one NAND gate, one XOR gate,
and one inverter.
registers and D ﬂip-ﬂops. We make the following assumptions regarding the
register elements (and D ﬂip-ﬂops) in the architectures considered in Chap-
ters 5, 6, and 7:
  Every register element (and D ﬂip-ﬂo p )h a st h es a m es i z ea n dd e l a yp a -
rameters as the D ﬂip-ﬂop in Figure 4.12.
  Data is fed from the output of one register through a block of combina-
tional logic to the input of another register during one clock cycle. Con-
sequently, each CP starts with the output stages of the ﬁrst register and
ends with the input stages of the destined register.
  The register input data obeys the setup and hold time constraints of the
register.
  The clock signal
c
l
k of a register is the output signal of an inverter that
only drives this particular register clock input. The delay time of any4.3. Basic CMOS Building Blocks 53
D
D
Q
Q
R
(b)
V
d
d
s
 
s
 
s
 
s
 
s
 
s
 
s
 
s
 
s
 
c
l
k
D
(a)
Q
R
c
l
k
D D
Figure 4.12: A resettable register, realised asa dynamic,truesingle-phaseclock, pos-
itive edge-trigged master-slave D ﬂip-ﬂop. The register is reset for
R
 
  .
(a) Symbolic descriptions. (b) Schematic description.54 Chapter 4. The VLSI Model
Stage
s
i Stage length
L
i
s
 
 
 
 
 
 
s
 
 
 
 
 
 
s
 
 
 
 
 
 
s
 
 
 
 
 
 
s
 
 
 
 
 
 
s
 
 
 
 
 
 
s
 
 
 
 
 
 
Table 4.1: Lengths of the internal stages of the register in Figure 4.12.
stages ahead of the clock input stages
s
  and
s
  are not included in the
total delay of the register.
  InaparticularCMOS system,thevariousarchitectures forarithmeticop-
erations all share the same registers for storing the input data and they
share the same registersforstoring the output data. Therefore,the input
and output registers are generally not considered when deriving the to-
tal size of an investigated architecture. In some architectures, like archi-
tectures for serial/parallel multiplication, the input data are loaded into
registers which are used throughout the whole execution time (during
several clock cycles). An output register may also be used in a similar
manner, for example as a feedback shift register. The size of every regis-
ter involved in the computation in this way is included in the total size
of an architecture.
TheDﬂip-ﬂop stages thatare included in the CP are markedwith dotted (sig-
nal stages) and dashed (clock stages) lines in Figure4.12. TheCP stages
s
 
 
s
 
 
 
 
 
 
s
  in the ﬁgure are activated whenan output (input) signal of the CP start
(end) register changesfromlow to high. The output stagesof the startregister
are the stages
s
  (due to clock rising),
s
 ,a n d
s
 . The input stages of the end
register are
s
 ,
s
 ,
s
 ,
s
 ,
s
 ,a n d
s
 .
The lengths of the register stages are tabulated in Table 4.1. The total internal
length of the start register equals the sum
L
r
e
g
 
o
u
t
 
 
 
 
 
 of the lengths of
stages
s
  and
s
 . The total internal length of the end register equals the sum
L
r
e
g
 
i
n
 
 
 
 
 
 
 
 
 
 
 
 
 ofthe lengthsof stages
s
 ,
s
 ,
s
 ,
s
 ,a n d
s
 .H e n c e ,
the register contribution to the total CP length equals
L
r
e
g
 
L
r
e
g
 
o
u
t
 
L
r
e
g
 
i
n
 
 
 . The fan-in
f
r
e
g of the register and its output normalised resistance
r
r
e
g of
the output stage
s
  both equal 2.4.3. Basic CMOS Building Blocks 55
4.3.6 Table of Complexity Parameters
In Table 4.2 we have listed the complexity parameters of the circuits that
have been analysed in the previous Sections 4.3.1 to 4.3.5. For each circuit
we state its size
C, fan-in
f, internal CP length
L , and its output normalised
resistance
r.56 Chapter 4. The VLSI Model
CMOS circuit Size Fan-in Internal CP Output norm.
C
f length
L resistance
r
Inverter 2 2 — 1
Transmission gate
  2 1(0) —
r
p
r
i
o
r
 
 
2-input Multiplexer
  4 2(0) —
r
p
r
i
o
r
 
 
2-input NAND and 4 2 — 2
NOR gates
2-input AND and 6 2 4 1
OR gates
2-input XOR and 12 4 2 2
XNOR gates
Full adder element 28 1
  Signal input to 8 12
sum output
  Signal input to 8 8
carry output
  Carry input to 6 12
sum output
  Carry input to 6 8
carry output
Half adder element 18 6
  Signal input to 2 2
sum output
  Signal input to 4 1
carry output
(Shift) Register, 16 16+6= 22
D ﬂip-ﬂop
  Input path 2 16
  Output path 6 2
 If a transmission gate transistoris the trigger of the output stage, then the fan-in equals
1 for the transmission gate and 2 for the multiplexer. If not, the fan-in equals zero in both
cases and only the output stage contributes to the CP. The normalised resistance
r
p
r
i
o
r equals
the output normalised resistance of the circuit that is prior to the transmission gate (or
multiplexer).
Table 4.2: Thesizes, fan-ins,internalCPlengths,andoutputnormalised resistances
of some frequently used CMOS circuits.4.4. Implementing the Fermat Number Transform 57
4.4 Implementing the Fermat Number Transform
In the previous chapters, we have mentioned several advantages of number
theoretic transforms in general and the Fermat number transform in particu-
lar. For example, digital convolution of real integer sequences can be imple-
mented using Fermat number transforms for which multiplication by powers
of the transformkernelcan becarried out as binary shifts (rotations). Also, no
round-off errors occur during the computations, because the arithmetic oper-
ations involved are carried out in a ﬁnite ring or ﬁeld.
As mentioned in Section 2.3.2 the Fermat number transform, whose length is
a powerof two, can be computed using a suitable fast Fourier transform algo-
rithm. In Section 2.3.3 we considered the conventional radix-2, radix-4, and
split-radix algorithms, in which the transform additions and multiplications
arepartitionedinto socalledbutterﬂycomputations. Thesealgorithmsexploit
different degrees of parallelism.
Since the publication in 1976 of McClellan’s [65] hardware implementation of
a Fermat number transform, several Fermat number transform architectures
have appeared in the literature. Truong et al. [103, 107] considered the imple-
mentationoffastdigital ﬁlteringusingageneralisedoverlap-savemethodand
a parallel pipelined Fermat number transform architecture.
Based on the work of Truong et al., Towers et al. [101] designed a cascadable
nMOS VLSI circuit for fast convolution, involving a pipelined Fermat number
transformer.
Shakaffet al. [90] investigate the practical aspects of using the Fermatnumber
transformasa block-mode image ﬁltering tool on small microprocessor based
systems. Theirtransformarchitectureisbasedon agatearrayimplementation
of the butterﬂy computational unit.
Several aspects and techniques for implementing the Fermat number trans-
form in (
nMOS) VLSI are investigated in the theses by Pajayakrit [71, Ch. 4–]
and Shakaff [89].
Finally, we also would like to mention the recent paper by Benaissa et al. [13],
in which the authors present a CMOS VLSI design of a high-speed Fermat
number transform-based convolver/correlator. The VLSI chip comprises a
complete 64-point pipeline transformer that can be used for both the forward
and the inverse Fermat number transform.58 Chapter 4. The VLSI Model
Inalltheabovepapers,excepttheonebyMcClellan, theauthorshaveadopted
the diminished–1 representation of the elements in Fermat integer quotient
rings. The diminished–1 representation is thoroughly investigated in Chap-
ter 6.
As mentioned before, we are primarily interested in the arithmetic operations
requiredtocomputetheFermatnumbertransform. Architectures forthecom-
plete transform or the transform butterﬂies are not further considered in this
thesis.Chapter 5
The Normal Binary Coded
Representation
We only consider element representations that can be expressed as simple el-
ementary functions of the normal binary coded (NBC) representation. In the
present chapter, we study integer arithmetic operations modulo
 
m
 
 with
respect to the NBC representation itself.
5.1 Architectures for Arithmetic Operations
We are mainly interested in VLSI architectures for the arithmetic operations
thatmay beinvolved in the computation of the Fermatnumbertransform and
itsinversetransform. Therefore,weconsiderarchitecturesformodulusreduc-
tion, negation, addition, subtraction, multiplication by powers of 2, general
multiplication, and exponentiation, with respect to a binary coded represen-
tation of the integersof
Z
 
m
 
 . Allthese operations maynot beinvolved in the
computation of the Fermat number transform, but for completeness they are
still considered. For example, general multiplication can be avoided using a
suitable transform kernel, see Section 2.3.2. We do not consider division, be-
cause it is not needed when computing the Fermat number transform and it
is not a general operation in every Fermat integer quotient ring.
The architectures for some of the arithmetic operations considered in the the-
sis are based on architectures for operations on ordinary two’s complement
binary coded numbers. There is a wide variety of VLSI designs available for
these operations. For example, an adder circuit can be implemented in sev-
5960 Chapter 5. The Normal Binary Coded Representation
eral ways. It can be a carry ripple adder, a carry select adder, a carry save
adder, a carry look-ahead adder, a conditional-sum adder, or some other type
of adder [113, Ch. 8.2.1]. All architectures in Chapters 5, 6, and 7 are not op-
timal with respect to chip area, computation time, or area-time performance.
Weprimarilyconsider architecturesthat canbemutuallycomparedinorderto
decide whichformofelementrepresentation ismostadvantageous, withrespect
to some area and/or time complexity of the resulting architectures. Thus, for
a certain element representation and a certain arithmetic operation there may
exist architectures that have better area-time performance than the one (or the
ones) presented here.
Henceforth, most of the architectures presented are valid for arithmetic oper-
ations in the Fermat integer quotient ring
Z
 
 
 
 ,i . e .f o r
m
 
  . However, in
general the architectures are regular in such a way that they can easily be ex-
panded (or contracted) to become applicable in any ring
Z
 
m
 
 ,w h e r e
m is a
power of two. The only exceptions are the architectures in Chapter 7. They
are based on the polar representation and are applicable only when
Z
 
m
 
  is a
ﬁeld, i.e. for
m
 
 
 
 
 
 
 
 
 
 
 .
The
 
m
 
 binary coded integers of
Z
 
m
 
  are represented as
 
m
 
 
 -bit NBC
numbers. Therefore, by a congruence
a
 
b
 
m
o
d
 
m
 
 
 we generally con-
sider
a to be the least nonnegative
 
m
 
 
  -bit residue of
b modulo
 
m
 
  .
5.1.1 Modulus Reduction
It is important that the reduction modulo
 
m
 
 is carried out as simply and
fast as possible, because it is involved in all arithmetic operations in
Z
 
m
 
 .
For some operations, the modulus reduction may be included in the overall
computation.
Let
  be an
n-bit normal binary coded integer. This integer
 
 
 
n
 
 
 
n
 
 
 
 
n
 
 
 
n
 
 
 
 
 
 
 
 
 
 
 
 
 ,w h e r e
 
i
 
Z
  for
 
 
i
 
n
 
 , may also be
represented by the
n-bit binary vector
 
 
n
 
 
 
 
 
 
 
n
 
 
 
 
n
 
 
 
 
 
 
 
 
 
 
 
 
 
 .T h e
notation
 
 
n
 
 
  is occasionally used also for the integer
 .
 
Theresidueofan
 
m
 
 
 -bitinteger
 
 
 
m
 
m
o
d
 
m
 
 
 is simply calculated
byﬁrstchangingtheone(1)in themostsigniﬁcant bitposition
 
m of
  to azero
andthensubtractingaonefromthemodiﬁednumber,i.e.
 
 
 
m
 
m
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
  . In a hardware realisation, a simple
way to subtract1 from the binary coded
m-bit positive integer
 
 
m
 
 
  is to add
 This is illustrated, for
n
 
m
 
  , in Table 2.3 of Section 2.4.5.1. Architectures for Arithmetic Operations 61
the two’s complement of 1 to the integer;
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
where
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 . It follows by (4.2) and (4.3) that the carry output
c
i
 
  and the sum output
 
i of a binary full adder element can be expressed as
the Boolean functions
c
i
 
 
 
 
i
 
i
 
c
i
 
 
i
 
 
i
 
 
i
 
 
i
 
 
i
 
c
i
respectively, where
 
i and
 
i are the adder input signals and
c
i is the carry in-
put signal. By letting
 
 
 
m
 
 ,i . e .l e t
 
i
 
 for
 
 
i
 
m
 
 , the carry and
the sum output functions reduce to
c
i
 
 
 
 
i
 
c
i (5.1)
 
i
 
 
i
 
c
i
  (5.2)
respectively. Hence, we get
 
 
c
m
 
m
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 ,w h e r e
c
m
and
 
i
 
 
 
i
 
m
 
  are given in (5.2) and (5.2), respectively.
The full adder elements can be connected in different ways. We consider two
typesoftwo-operandparalleladders,whicharebasedonhowtheinternalcar-
ries between the adder elements are generated. One of the adder types is the
carry ripple(orripplecarry)adder,forwhichthecarryoutputofeachfulladder
element is connected to the carry input of the subsequent full adder element
(the one in the next higher-order bit position). The second adder type is the
carry look-ahead adder, for which the internal carry signals are precomputed.
The carry look-ahead adder is usually faster than the carry ripple adder, but
the penalty paid for this is a greater area complexity, see for example Weste
and Eshraghian [113, Ch. 8.2.1] or Hwang [52, Ch. 3].
A Carry Ripple-Based Architecture
Figure 5.1 shows an architecture that performs the modulus reduction
 
 
 
 
m
o
d
 
 
 
 
 using essentially a simpliﬁed carry ripple adder followed by
two-input multiplexers. The multiplexers, which are formed by the transmis-
sion gate pairs at the outputs, let either
  (if
 
 
 
m)o r
  (if
 
 
 
m) pass to
the output. The signal
h and its inverse control the multiplexers. The output
residue is the
 
m
 
 
 -bitnormal binarycoded integer
 
 
 
m
 
m
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 . The architecture is based on the following algorithm:62 Chapter 5. The Normal Binary Coded Representation
1. If
 
 
 
 
 
m (
h
 
  ), then let
 
 
 
2. If
 
m
 
 
 
 
 
 
m
 
 
 
  (
h
 
  ), then let
 
 
 
 
 
 
 
m
 
 
 
 ,
where
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 .
The carry signals are realised using a chain of OR gates. The Boolean function
h
 
 
m
c
m is used to indicate whether
  is greaterthan
 
m. We assume that
  is
alwaysan
 
m
 
 
 -bit integer, i.e. the maximum reducible overﬂow is
 
m
 
 
 
 .
The architecture in Figure 5.1 for reduction modulo
 
m
 
 comprises
m
 
 
OR gates,
m
 
  XNOR gates, one NAND gate,
m
 
 two-input multiplexers,
 
and two inverters. Using the size parameters of Table 4.2, the size
 of this ar-
chitecture equals
C
m
o
d
 
 
 
 
m
 
 
 
 
C
O
R
 
C
X
N
O
R
 
 
C
N
A
N
D
 
 
m
 
 
 
C
M
U
X
 
 
C
i
n
v
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
  (5.3)
Thecritical path
 (CP) through the circuit is the path from
 
  to
htogether with
the path from
 
m
 
  to
 
m
 
 ,a ss i g n i ﬁed by the dotted lines in the ﬁgure. The
fan-in,
 with respect to the
 
 -input node, equals
f
m
o
d
 
 
 
f
i
n
v
 
f
O
R
 
f
X
N
O
R
 
 
 
 
 
 
 
 
 
The output normalised resistance
 equals
r
m
o
d
 
 
 
r
m
 
 
 
 
 
where
r
m
 
  is the total normalised resistance from the
 
m
 
 -input node to the
supply voltage source, i.e. the output normalised resistance of the preceding
circuit.
Regarding
r
m
o
d
 
 , the length of the output stage, which is the dotted path from
 
m
 
  to
 
m
 
  in Figure 5.1, actually equals
r
m
 
 
 
f
O
R
 
f
X
N
O
R
 
 
 
r
m
 
 
 
 
 
n
m
 
 ,
where
n
m
 
  is thefan-outseenfromthe
 
m
 
 -output node. However, because
the
 
m
 
 -input node is fully charged when the output multiplexer opens,
we do not include the former part of the expression. Thus, only the term
 
r
m
 
 
 
 
 
n
m
 
  contributes to the length of the output path and hence the out-
put normalised resistance
r
m
o
d
 
  equals
r
m
 
 
 
 . The internal CP length of the
 For the sake of simplicity, the single inverter of the output circuitry in bit position
m is
regarded as a transmission gate.
 The size, critical path, fan-in, and output normalisedresistance ofan architecture was de-
ﬁned in Section 4.2.2.5.1. Architectures for Arithmetic Operations 63
h
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c
 
Figure 5.1: A carry ripple-typecircuit for reduction modulo
 
m
 
 ,w h e r e
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 and
m
 
  .I f
 
 
 
m (overﬂow) then
h
 
  , otherwise
h
 
  . The two dotted lines indicate the set of stages that form the critical path
through the circuit.64 Chapter 5. The Normal Binary Coded Representation
circuit equals
L
C
P
 
m
o
d
 
 
 
 
m
 
 
 
L
O
R
 
 
m
 
 
 
r
O
R
 
f
O
R
 
f
X
N
O
R
 
 
r
O
R
f
N
A
N
D
 
r
N
A
N
D
 
 
m
 
 
 
f
i
n
v
 
 
r
i
n
v
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
The circuit parameters forming
L
C
P
 
m
o
d
 
  are obtained from Table 4.2.
Insomesituations, themodulusreductionmaydirectlysucceedoperationsfor
which the signal
 
m
 
  appears on its input node after t h et i m ew h e nt h ei n t e r -
nal carry signal
c
m
 
  appears on its node in the carry chain. See for example
the adder architecture of Figure 5.7. Then, the modulus reduction circuit con-
tribution to the total CP length may be much smaller than
L
C
P
 
m
o
d
 
 .
In Chapter 4 we mentioned that, with respect to size and time performance,
NAND and NOR gates are preferableto AND and OR gates, respectively. Ac-
cordingly, it may seem advantageous to realise the carry chain using a chain
of alternating NAND and NOR gates instead of a chain of OR gates. We have
designed such an architecture. The size of that architecture equals
 
 
m
 
 ,
which is slightly less than the size
C
m
o
d
 
 
 
 
 
m
 
  of the OR-type architec-
ture in Figure 5.1. However, the internal CP length of the NAND/NOR-type
architecture equals
 
 
m
 
 
  ,w h i c hi sgreater than the corresponding length
L
C
P
 
m
o
d
 
 
 
 
 
m
 
  of the architecture in Figure 5.1. The increase of the ﬁrst
term by
 
m (from
 
 
m to
 
 
m) equalsthe differencebetweenthe contributions
 
L
O
R
 
r
O
R
 
f
O
R
 
f
X
N
O
R
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
m and
r
N
A
N
D
 
N
O
R
 
f
N
A
N
D
 
N
O
R
 
f
X
N
O
R
 
m
 
 
 
 
 
 
m
 
 
 
m to the CP lengths of the OR-type and the NAND/
NOR-type architectures, respectively. Thus, for each bit position, the increase
of the CP length (by 6) due to the doubling of the normalised resistance in a
stage (whenexchanging an OR gate for a NAND or NOR gate) is greater than
the decrease of the length (by 4) due to the elimination of the internal length
L
O
R of the OR gate.
A Carry Look-Ahead-Based Architecture
In Figure 5.2 we present a carry look-ahead-type circuit for modulus reduc-
tion. Here, the carry signals are generated in parallel using the tree of NAND
and NOR gates that precedes the row of XNOR gates in the ﬁgure. The struc-
ture of this simpliﬁed and distributed carry look-ahead tree is similar to the
structure of Brent and Kung’s carry look-ahead tree [27].5.1. Architectures for Arithmetic Operations 65
The depth of the tree is
l
o
g
 
m and there are
m
 
 
i NAND or NOR gates in
level
i of the tree, starting with
i
 
 for the input level. Thus, there are a total
of
l
o
g
 
m
 
 
X
i
 
 
 
m
 
 
i
 
 
m
 
l
o
g
 
m
 
 
 
 
 
such gatesin the tree, distributed such thatthe NOR gatesare only in the even
numberedlevelsofthetreeandtheNANDgatesareonlyintheoddnumbered
levels. Also, there are
 
i inverters in level
i of the tree, which means that the
total number of inverters in the tree is
m
 
 . Hence, the size of the carry look-
ahead-type modulus reduction circuit in Figure 5.2 equals
C
m
o
d
 
 
 
 
m
 
l
o
g
 
m
 
 
 
 
 
 
C
N
A
N
D
 
N
O
R
 
m
C
i
n
v
 
 
m
 
 
 
C
X
N
O
R
 
C
N
A
N
D
 
 
m
 
 
 
C
M
U
X
 
 
m
 
l
o
g
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
l
o
g
 
m
 
 
 
m
  (5.4)
The CP through the circuit is the set of stages along the two dotted lines in the
ﬁgure.
  The fan-in of the circuit equals
f
m
o
d
 
 
 
f
X
N
O
R
 
 
f
N
O
R
 
 
 
 
 
 
 
 
and its output normalised resistance
r
m
o
d
 
  equals
r
m
o
d
 
 
 
r
m
o
d
 
 
 
r
m
 
 
 
 
 
Thusthe architecturesin Figures5.1 and5.2 have equalfan-in andoutputnor-
malised resistance. The internal CP length of the carry look-ahead-type archi-
tectures equals
L
C
P
 
m
o
d
 
 
 
 
l
o
g
 
m
 
 
 
r
N
A
N
D
 
N
O
R
 
f
N
O
R
 
N
A
N
D
 
f
i
n
v
 
 
r
N
A
N
D
f
N
A
N
D
 
r
N
A
N
D
 
 
m
 
 
 
f
i
n
v
 
 
r
i
n
v
 
 
 
m
 
 
 
 
 
l
o
g
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
As mentioned above, in some situations the CP may enter the circuit via the
 
m
 
 -input node. In such a situation, the carryripple-type architecture in Fig-
ure 5.1 is preferableto the carryripple-type architecture in Figure 5.2, because
the path from the
 
m
 
  input to the last carry signal
c
m is shorter in the former
case thanin the latter case. Regarding the architecture inFigure 5.2, wewould
like to mention the following:
 Actually, the CP outputstageis any ofthe stagesfrom a
 
i-input node to thecorrespond-
ing
 
i-output node, where
 
 
i
 
m
 
 .66 Chapter 5. The Normal Binary Coded Representation
h
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c
 
Figure 5.2: A carry look-ahead-type circuit for reduction modulo
 
m
 
  ,w h e r e
m
 
  . The two dotted lines signify the set of stages that form the critical path
through the circuit.
1. With respect to both area and time complexity, NAND and NOR gates
are preferableto OR gates in the carrylook-ahead tree. The difference in
complexity is, however, not signiﬁcant.
2. The placement of the row of
m
 
  inverters in the last level of the carry
look-ahead tree differs, depending on whether
l
o
g
 
m is odd or even.
The inverters can be omitted if the subsequent row of
m
 
  XNOR gates
is exchanged for a row of XOR gates.
3. A disadvantage of the carry look-ahead tree may be its relatively long
internal wires.
Regarding the architectures in both Figure 5.1 and Figure5.2, for large
mthere
may be a problem for the NAND gate and the inverter (whose output signals
are
h and
h, respectively) to each drive
 
 
m
 
 
 multiplexer transistors. The5.1. Architectures for Arithmetic Operations 67
delay of a stage with a large capacitive load can be signiﬁcantly reduced by
using properly sized drivers. Such drivers are, however, not used here.
Ourcomparisonbetweendifferentarchitectureswithrespecttotheirarea-time
performanceismadeundertheassumption thatthearchitecturesarebothpre-
ceded and followed by a parallel register. Then, using the architecture in Fig-
ure 5.1 or the one in Figure 5.2, the time
T to perform the modulus reduction
operation is proportional to the lengths
L
m
o
d
 
 
 
L
r
e
g
 
r
r
e
g
f
m
o
d
 
 
 
L
C
P
 
m
o
d
 
 
 
r
m
o
d
 
 
f
r
e
g
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
L
m
o
d
 
 
 
L
r
e
g
 
r
r
e
g
f
m
o
d
 
 
 
L
C
P
 
m
o
d
 
 
 
r
m
o
d
 
 
f
r
e
g
 
 
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
respectively, where
r
m
o
d
 
 
 
r
m
o
d
 
 
 
r
r
e
g
 
 
 
  . Using the size paramaters
C
m
o
d
 
  and
C
m
o
d
 
  in (5.3) and (5.4), respectively, and the above lengths
L
m
o
d
 
 
and
L
m
o
d
 
 , the area-time performances
A
T
  of these modulus reduction cir-
cuits are proportional to the products
C
L
 
m
o
d
 
 
 
 
C
m
o
d
 
 
 
L
m
o
d
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
O
 
m
 
 
C
L
 
m
o
d
 
 
 
 
C
m
o
d
 
 
 
L
m
o
d
 
 
 
 
 
 
 
m
l
o
g
 
m
 
 
 
m
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
 
O
 
m
 
l
o
g
 
m
 
 
respectively. The sizes, CP lengths, and
A
T
  performances of the above two
circuits for modulus reduction are plotted in Figure 5.3. We see that the size
C
m
o
d
 
  o ft h ec a r r yl o o k - a h e a d - t y p ea r c h i t e c t u r ei nF i g u r e5 . 2i sg r e a t e rt h a n
the size
C
m
o
d
 
  of the carry ripple-type architecture in Figure 5.1. On the other
hand, for the CP lengths of the architectures we have the reverse relation. The
ratio of the CP lengths
L
m
o
d
 
  and
L
m
o
d
 
  converges relatively fast to
 
 
  with
growing
m.W ec o n c l u d et h a t ,w i t hr e s p e c tt o the time complexities and the
A
T
  performances, the architecture in Figure 5.2 is preferable to the architec-
ture in Figure 5.1.68 Chapter 5. The Normal Binary Coded Representation
C
m
o
d
 
 
C
m
o
d
 
 
L
m
o
d
 
 
L
m
o
d
 
 
C
L
 
m
o
d
 
 
C
L
 
m
o
d
 
 
248 16 32 64 128256
 
 
 
 
 
 
 
 
 
Area complexity
m
S
i
z
e
,
C
248 16 32 64 128256
 
 
 
 
 
 
Time complexity
m
C
P
l
e
n
g
t
h
,
L
2 4 8 16 32 64 128 256
 
 
 
 
 
 
 
 
 
 
Area-time performance
m
C
L
 
Figure 5.3: Sizes, CP lengths, and
A
T
  performances of the two modulus re-
duction architectures. The parameters are plotted versus
m for
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 . The lines connecting the parameter values are
plotted only to clearly illustrate how the complexity parameters grow with
m.
5.1.2 Negation
Let
  be an
 
m
 
 
  -bit NBC integer. Then we have
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (5.5)
where
 
 
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
  is the one’s complement of
 
 
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 . Adding 3 to
  seems to be a simple operation, but wealso would
liketoperformthemodulus reductioninthesamecomputation step. Wethere-
fore expand
  as
 
 
 
m
 
m
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 
 
 5.1. Architectures for Arithmetic Operations 69
where, by Deﬁnition 2.4,
 
 
m
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 . Hence, (5.5) can be written as
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
m
o
d
 
m
 
 
 
 
where
 
 
 
m
 
 
 
m
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
  is an
 
m
 
 
 -bit binarycoded
integer for which
 
i
 
 
i
 
 
 
 
 
i
 
m
 
 .L e t
 
 
 
 
 
 ,w h e r e
 
 
  .
This sum can be computed using a simpliﬁed
 
m
 
 
 -bit parallel adder. By
(4.3) and (4.2) the sum output and the carry output of the adder element in bit
position
i are the Boolean functions
 
i
 
 
i
 
 
i
 
c
i and
c
i
 
 
 
 
i
 
i
 
c
i
 
 
i
 
 
i
 
respectively, where
 
 
i
 
m
 
 . We have
 
i
 
 
i
 
  for
 
 
i
 
m
 
 ,
 
i
 
 for
 
 
i
 
m
 
 ,
 
 
 
  ,a n d
c
 
 
  . Hence, the sum and carry output
functions simpliﬁes to
 
i
 
 
i
 
c
i
 
 
i
 
 
 
c
i
c
i
 
 
 
 
i
c
i
 
 
i
 
 
c
i
respectively, for
i
 
 
 
 
 
 
 
 
 
m
 
 . We identify these functions as the sum
and carry outputs of the half adder element, see Section 4.3.4 (Figure 4.11).
Fromthe above it follows that the desired integer
  equalsthe
 
m
 
 
 -bit NBC
integer
 
 
 
 
m
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 . Weobtainthebitvalues
 
m and
 
  as Boolean functions of the carry signal
c
m
 
  and the input signals
 
m and
 
 . There are four situations that have to be handled:
1. If
 
 
  ,t h e n
 
 
 
m
 
 ,
 
 
 
  ,a n d
 
m
 
  .
Let
 
m
 
c
m
 
 
 
 and
 
 
 
 
 
 
  .
2. If
 
 
  ,t h e n
 
 
 
m
 
 ,
 
 
 
  ,a n d
 
m
 
  .
Let
 
m
 
c
m
 
 
 
 and
 
 
 
 
 
 
  .
3. If
 
 
 
 
 
m
 
 ,t h e n
 
 
 
 
 
m
 
 
 
 ,
 
  is arbitrary, and
 
m
 
  .
Let
 
m
 
c
m
 
 
 
 and
 
 
 
 
 .
4. If
 
 
 
m
 
 
 
 
m
o
d
 
m
 
 
  ,t h e n
 
 
 
m
 
 ,
 
 
 
  ,a n d
 
m
 
  .
Let
 
m
 
c
m
 
 
 
 and
 
 
 
 
 
 
  .
These special cases yield the Karnaugh maps for
 
m and
 
  in Figure 5.4. The
respective Boolean functions are
 
m
 
c
m
 
 
 
 
 
c
m
 
 
 
 
 
 
 
 
 
 
m
 
c
m
 
 
 
 
 
 
m
 
 
c
m
 
 
 
 
 
 70 Chapter 5. The Normal Binary Coded Representation
0
00
1
X
011 11
0
1
0
00
1
011 11
0
1
00
XX 1 XX X
0 0
0
00
 
m
c
m
 
 
 
 
c
m
 
 
 
 
 
m
(a)
 
m (b)
 
 
Figure 5.4: Karnaugh maps for the output variables
 
m and
 
  of the negation cir-
cuit. X = “don’t care”. (a)
 
m
 
c
m
 
 
 
 . (b)
 
 
 
 
m
 
c
m
 
 
 
 .
Figure 5.5 shows a realisation in
Z
 
 
 
  of the negation circuit. As seen in the
ﬁgure, the above-mentioned simpliﬁed parallel adder consists of a row of half
adder elements, each comprising one AND gate and one XOR gate. In order
to generate the signal
c
m
 
 , the carry output of the half adder in the most sig-
niﬁcant bit position is inverted. The inversion is realised by exchanging the
half adder AND gate for a NAND gate.
The size of the architecture in Figure 5.5 equals
C
n
e
g
 
 
m
 
 
 
C
i
n
v
 
 
m
 
 
 
C
H
A
 
C
X
O
R
 
 
C
N
A
N
D
 
C
N
O
R
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
The CP of the architecture is the dotted path from the
 
  input along the carry
chain to
c
m
 
  and ﬁnally through two NAND gates to
 
 . Hence, the fan-in of
the architecture equals
f
n
e
g
 
f
i
n
v
 
 
and its output normalised resistance equals
r
n
e
g
 
r
N
A
N
D
 
 
 
The length
L
C
P
 
n
e
g of the internal CP equals
L
C
P
 
n
e
g
 
r
i
n
v
 
f
i
n
v
 
f
H
A
 
 
 
m
 
 
 
 
L
H
A
 
c
a
r
r
y
 
r
H
A
 
c
a
r
r
y
f
H
A
 
 
r
N
A
N
D
 
f
N
O
R
 
f
N
A
N
D
 
f
N
A
N
D
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
The negater in Figure 5.5 is a carry ripple type of architecture. The AND-gate
carrychaincanbeexchangedforachainofalternatingNANDandNORgates,5.1. Architectures for Arithmetic Operations 71
HA HA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c
 
 
 
CP
Figure 5.5: Negation modulo
 
m
 
 
 
m
 
  .
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
 
  ,w h e r e
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
  .72 Chapter 5. The Normal Binary Coded Representation
resembling the modiﬁcation of the OR-gate carry chain of the modulus reduc-
tion circuit in Figure 5.1. Such a modiﬁcation slightly reduces the size of the
architecture, but the CP length will, however, increase. This was also the case
for the circuit in Figure 5.1.
It is possible to design a carry look-ahead type of architecture for NBC nega-
tion. The carry look-ahead part of such a circuit may be similar to the carry
look-ahead part of the modulus reduction architecture in Figure 5.2. The dif-
ference in area-time performance between that carry look-ahead negater and
the architecture in Figure 5.5 is in the same order as the difference in area-time
performance between the architecture in Figure 5.2 and the one in Figure 5.1.
When the negater in Figure 5.5 is preceded and followed by parallel registers,
its total CP length equals
L
n
e
g
 
L
r
e
g
 
r
r
e
g
f
n
e
g
 
L
C
P
 
n
e
g
 
r
n
e
g
f
r
e
g
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
which means that the area-time performance
A
T
  of the suggested negater is
proportional to
C
L
 
n
e
g
 
 
C
n
e
g
 
L
n
e
g
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
O
 
m
 
 
 
In Section 8.1.3 we compare the complexity parameters of the above negater
with the complexity parameters of other negation circuits.
5.1.3 Addition and Subtraction
Addition
We consider the addition
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,w h e r et h e
 
m
 
 
  -bit
NBC integers
  and
  are elements of
Z
 
m
 
 , i.e. we have
 
 
 
 
 
 
 
m.I n
order to simplify the arithmetic operation, we expand the above addition in
the following way.
Let
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 . This sum can be obtained by using an
m-bit parallel
carryripple adder. Wewrite
  on the form
 
 
c
m
 
m
 
 
m
 
 
 
m
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 ,w h e r e
c
m is the carry output of the full adder element in bit posi-
tion
m
 
  and
 
i is the sum output of the adder element in bit position
i for
 
 
i
 
m
 
 . Because the ﬁrst carry input signal is always equal to zero,
the adder element in the least signiﬁcant bit position may be implemented as
a half adder.5.1. Architectures for Arithmetic Operations 73
Furthermore,let
 
 
 
 
 
m
 
 
 
 
 
m
 
 . Thisisthesametypeofaddition astheone
that resultedinthe carryripple-typemodulus reduction circuit ofFigure5.1in
Section 5.1.1. With
 
i being the input of a simpliﬁed adder element of the type
described in Section 5.1.1, the corresponding output is
 
i.W e d e n o t e b y
g
m
the carry output from the simpliﬁed adder element in the most signiﬁcant bit
position. Finally, we need a binary control signal, say
h, that lets either
 
 
m
 
 
 
or
 
 
m
 
 
  pass to the adder output. Thus, we deﬁne
h
 
Z
  such that
 
 
m
 
 
 
 
h
 
 
m
 
 
 
 
 
 
 
h
 
 
 
m
 
 
 ,i . e .
 
i
 
h
 
i
 
h
 
i for
 
 
i
 
m
 
 .
With regard to the above deﬁnitions, we consider the following seven special
cases of input signal combinations.
1. If
 
 
 
 
  ,t h e n
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
c
m
 
g
 
 
 
 
 
 
 .
Let
h
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 .
2. If
 
 
 
 
 
 
 
 
 
m
 
  or
 
 
 
 
 
m
 
 
 
 
 
  ,
then
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
c
m
 
g
 
 
 
 
 
 
 .
Let
h
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 .
3. If
 
 
 
 
 
 
 
m or
 
 
 
m
 
 
 
  ,
then
 
 
m
 
 
m
 
 
 
 
 
 
  or
 
 
 
 
 
 
 
c
m
 
g
 
 
 
 
 
 
 .
Let
h
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 .
4. If
 
 
 
 
 
 
 
m
 
  and
 
 
 
m,
then
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
c
m
 
g
 
 
 
 
 
 
 .
Let
h
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 .
5. If
 
 
 
 
 
 
 
m
 
  and
 
 
 
m,
then
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
c
m
 
g
 
 
 
 
 
 
 .
Let
h
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 .
6. If
 
 
 
 
 
m
 
 
 
 
 
 
m or
 
 
 
m
 
 
 
 
 
 
m
 
 ,
then
 
 
m
 
 
m
 
 
 
 
 
 
  or
 
 
 
 
 
 
 
c
m
 
g
 
 
 
 
 
 
 .
Let
h
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 .
7. If
 
 
 
 
 
m,t h e n
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
c
m
 
g
 
 
 
 
 
 
 .
Let
h
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 .74 Chapter 5. The Normal Binary Coded Representation
01
11
10
00
01 00 11 10
c
m
g
m
 
m
 
m
 
m
 
m
0 1
0
0 1
1 0 0 0
1
1
0
0
0
01
11
10
00
01 00 11 10
1
c
m
g
m
0 1 0
X X
X
XX
X X
XX
XXX
XX
(a)
h (b)
 
m
Figure 5.6: Karnaugh maps of the Boolean function
h and
 
m.
(a)
h
 
 
m
 
m
 
g
m
 
c
m
 
 
m
 
 
m
 . (b)
 
m
 
h
 
 
m
 
m
c
m.
Thebinarycontrolsignal
handtheoutputbit
 
m canbeexpressedasaBoolean
functions of the variables
 
m,
 
m,
c
m,a n d
g
m. From the Karnaugh map in Fig-
u r e5 . 6 ( a )w ed e r i v et h eB o o l e a nf u n c t i o n
h
 
 
m
 
m
 
g
m
 
c
m
 
 
m
 
 
m
 
 
 
m
 
m
 
g
m
 
c
m
 
 
m
 
 
m
 
 
Bywriting
honthis formweseethat itcanberealised usingfourNANDgates,
one inverter, and one XOR gate. The Boolean function
 
m
 
 
c
m
 
 
 
m
 
 
m
 
 
 
g
m
is derived from the Karnaugh map in Figure 5.6(b). However, by comparing
the Karnaugh maps in Figure 5.6(a) and (b), we see that the map for
 
m is the
inverse of the map for
h,e x c e p ti nt h ep o s i t i o n s
 
 
m
 
 
m
 
c
m
 
g
m
 
 
 
 
 
 
 
 
 
 
 
and
 
 
 
 
 
 
 
 
 . Therefore, the function
 
m can also be expressed as
 
m
 
h
 
c
m
 
m
 
m
 
h
 
c
m
 
 
m
 
 
m
 
 
Becausetheinverseoftheterm
c
m
 
 
m
 
 
m
 isapartoftheexpressionfor
h,we
consequently only need one inverter and one NOR gate to generate
 
m from
the gates producing the signal
h.
Figure 5.7 shows an adder architecture whose structure is based on the above
reasoning. The sum
 
 
 
 
m
 
 
 
 
 
 
m
 
 
  is computed using an ordinary
m-bit
parallel carry ripple adder. This part of the architecture may be replaced by a
carry look-ahead adder, if desirable. The gates in the leftmost dashed box in5.1. Architectures for Arithmetic Operations 75
FA HA FA FA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c
 
h
g
 
P
 
P
 
Figure 5.7: A circuit performingthe addition
 
 
 
 
 
 
m
o
d
 
m
 
 
 for
m
 
  .
The dotted paths P
  and P
  form the CP through the circuit. The gates within
the rightmost dashed box performs the modulus reduction. This part of the ar-
chitecture can also be found in Figure 5.1. The gates within the leftmost dashed
box generate the output bit
 
m and the control signal
h.76 Chapter 5. The Normal Binary Coded Representation
Figure 5.7 generatethe control signal
h and the output bit
 
m. The gatesin the
rightmost dashed box in the ﬁgure generate the
m least signiﬁcant bits of the
output binary coded integer
  bysubtracting, if necessary, one (1) from
 
 
m
 
 
 
modulo
 
m.
As a result of a timing analysis based on the
R
C delay model described in
Chapter 4, we found that the dotted paths marked by P
  and P
  form the CP
through the adder architecture. The fan-in
f
a
d
d of the architecture equals the
fan-in of the half adder, i.e. we get
f
a
d
d
 
f
H
A
 
 
 
The internal length
L
C
P
 
a
d
d through the adder equals the length of path P
 ,i . e .
L
C
P
 
a
d
d
 
L
H
A
 
c
a
r
r
y
 
 
m
 
 
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
m
 
 
 
L
F
A
 
c
a
r
r
y
 
L
F
A
 
s
u
m
 
r
F
A
 
f
X
N
O
R
 
f
O
R
 
 
L
O
R
 
 
r
O
R
 
r
N
A
N
D
 
f
N
A
N
D
 
r
N
A
N
D
 
 
m
 
f
N
O
R
 
f
i
n
v
 
 
r
i
n
v
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
The
mtwo-inputmultiplexersatthe circuitoutput areopenedsimultaneously
by the control signal
h and its inverse signal. The maximum normalised re-
sistance of the stage that runs through the multiplexer at bit position
i equals
r
X
N
O
R
 
 
 
 for
 
 
i
 
m
 
  and
r
H
A
 
s
u
m
 
 
 
 for
i
 
  . Therefore, the CP
output stage is any of the stages associated with the
m least signiﬁcant bits of
  and thus, the output normalised resistance of the adder equals
r
a
d
d
 
 
 
The size of the addition circuit equals
C
a
d
d
 
 
m
 
 
 
 
C
F
A
 
C
O
R
 
C
X
N
O
R
 
 
C
H
A
 
m
C
M
U
X
 
 
C
N
A
N
D
 
 
C
N
O
R
 
 
C
i
n
v
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
Assuming that
  and
  are outputs of
 
m
 
 
  -bit parallel register and that
 
is also stored in such a register, we get the total CP length
L
a
d
d
 
L
r
e
g
 
r
r
e
g
f
a
d
d
 
L
C
P
 
a
d
d
 
r
a
d
d
f
r
e
g
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
  (5.6)5.1. Architectures for Arithmetic Operations 77
The area-time performance of the circuit is proportional to
C
L
 
a
d
d
 
 
C
a
d
d
 
L
a
d
d
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
O
 
m
 
 
 
Subtraction
The most straightforward method of performing subtraction is to ﬁrst negate
the subtrahend and then add it to the minuend, i.e.
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
Subtraction can thus be realised using the architectures of Figures 5.5 and 5.7.
5.1.4 Multiplication by Powers of 2
Multiplication of an NBC numberby two is easily carried out as a binary shift
of the number. Because the modulus reduction operation is not sostraightfor-
ward and we use an
 
m
 
 
  -bit representation of the binary coded integers
of
Z
 
m
 
 , multiplication byan arbitrarypower of two is preferablycarried out
as repeated multiplication by two. The modulus reduction is carried out after
every single shift, i.e. according to the congruence
 
n
 
 
 
 
 
n
 
 
 
m
o
d
 
m
 
 
 
 
m
o
d
 
m
 
 
 
 
Multiplicationby 2
Figure 5.8 shows an architecture for computing
 
 
 
 
 
m
o
d
 
m
 
 
 ,w h e r e
m
 
  . The modulus reduction part of the circuit may for example be the ar-
chitecture in Figure 5.1 or the one in Figure 5.2. Here, due to its favourable
A
T
  performance, we only consider the carry look-ahead-type architecture in
Figure 5.2.
The input to the residue circuit is
 
 
 
 
m
 
 
 
m
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
when
 
 
 
 
 
m
 
 ,a n d
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
when
  equals
 
m. Thisiseasilyimplementedusingsimpliﬁedtwo-inputmul-
tiplexers priortothe reduction circuit, asshowninFigure5.8. Hence, thecom-
plete circuit has size
C
m
u
l
t
 
 
C
i
n
v
 
 
m
 
 
 
 
C
T
G
 
 
 
 
C
m
o
d
 
 
 
 
m
 
l
o
g
 
m
 
 
 
m
 
 
 78 Chapter 5. The Normal Binary Coded Representation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Reduction modulo
 
 
 
 
V
d
d
V
d
d
V
d
d
 
 
 
 
 
m
o
d
 
 
 
 
 
Figure 5.8: Multiplication by two;
 
 
 
 
 
m
o
d
 
m
 
 
 for
 
 
Z
 
m
 
 ,w h e r e
m
 
  .
The CP is formed by the two dotted paths in the ﬁgure. Because the fan-in of
the reduction circuit in Figure 5.2, with respect to its least signiﬁcant bit posi-
tion, equals
f
N
O
R
 
f
i
n
v
 
  , the total fan-in of the architecture in Figure 5.8,
with respect to the
 
m-input node, equals
f
m
u
l
t
 
 
f
i
n
v
 
m
 
 
 
 
 
m
 
 
 
The output normalised resistance of the architecture is
r
m
u
l
t
 
 
r
m
o
d
 
 
 
r
 
 
 
 
where
r
  equals the normalised resistance from the
 
 -input node to the sup-
ply voltage source (or ground). The internal CP length equals5.1. Architectures for Arithmetic Operations 79
L
C
P
 
m
u
l
t
 
 
r
i
n
v
 
 
 
m
 
 
 
 
 
r
 
 
 
 
f
m
o
d
 
 
 
L
C
P
 
m
o
d
 
 
 
 
 
m
 
 
 
 
 
 
r
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
m
 
l
o
g
 
m
 
r
 
 
 
 
 
 
If the input
  is obtained from parallel register and
  is stored in a similar reg-
ister, then
r
 
 
r
r
e
g
 
 and the total CP length equals
L
m
u
l
t
 
 
L
r
e
g
 
r
r
e
g
f
m
u
l
t
 
 
L
C
P
 
m
u
l
t
 
 
r
m
u
l
t
 
f
r
e
g
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
l
o
g
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
  (5.7)
The
A
T
  performance of the architecture in Figure 5.8 is proportional to the
product
C
L
 
m
u
l
t
 
 
 
C
m
u
l
t
 
 
L
m
u
l
t
 
 
 
 
 
 
m
 
l
o
g
 
m
 
 
 
m
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
 
O
 
m
 
l
o
g
 
m
 
 
The row of transmission gates and
pMOS transistors at the input of the multi-
plication-by-2 circuit in Figure 5.8 may be exchanged for a row of
m
 
  OR
gates. TheOR gate in bitposition
i,f o r
 
 
i
 
m
 
 ,w o u l dh a v e
 
m and
 
i as
its input signals. If such a row of OR gates is used, the fan-in and the output
normalised resistance of the circuit are reduced to
 
m
 
 
 
f
O
R
 
 
 
m
 
 
  and
r
O
R
 
 
 
  , respectively. Thetotal CPlength
L
m
u
l
t
 decreasesby 30 butthe cir-
cuit size increases by
 
m
 
 . With respect to the
A
T
  performance,the row of
OR gates is preferableto the row of transmission-gates-and-
pMOS-transistors
only for
m
 
 
 .F o r
m
 
 
 , the circuit in Figure 5.8 has better area-time per-
formance, compared with an architecture with a row of OR gates at the input.
Multiplicationby
 
n
Multiplication by powers of two can be carried out by using a feedback cou-
pledmultiplication-by-2 circuit with aparallelregisterin thefeedbackloop. A
blockdiagramofsuchacircuitisshowninFigure5.9. Here, themultiplication-
by-2 block is the circuit in Figure 5.8.
For
 
 
Z
 
m
 
  and
n
 
N, the arithmetic operation
 
 
 
n
 
 
m
o
d
 
m
 
 
 
is carried out by ﬁrst, during an initial clock cycle, loading
  into the parallel
registerandthen runthe circuit foran appropriatenumberofclock cycles. Be-
c a u s et h et h ei n t e g e r2h a so r d e r
 
m
 
 
t
 
  modulo
 
m
 
 (see Section 2.3.2)
we have
 
 
 
n
 
t
 
 
 
m
o
d
 
m
 
 
  ,w h e r e
t
 
l
o
g
 
m.T h u s ,o n l yt h e
t
 
 80 Chapter 5. The Normal Binary Coded Representation
least signiﬁcant bits of
n have to be considered. This implies that the desired
product
 
n
 
m
o
d
 
m
 
 is present at the circuit output after
n
 
t
  clock cycles
(not counting the initial clock cycle for loading the feedback register with
 ).
The chip area
A occupied by the circuit in Figure 5.8 is proportional to
C
m
u
l
t
 
n
 
C
m
u
l
t
 
 
 
m
 
 
 
C
r
e
g
 
 
m
 
l
o
g
 
m
 
 
 
m
 
 
 
 
The internal CP of the circuit is the feedback path from the output of the reg-
ister element in the most signiﬁcant bit position, through the multiply-by-2
circuit, to the input of any of the other register elements. Assuming that the
output
 
n
 
m
o
d
 
m
 
 is stored in an
 
m
 
 
  -bit parallel register, the length
of the internal CP equals
L
C
P
 
m
u
l
t
 
n
 
L
r
e
g
 
r
r
e
g
f
m
u
l
t
 
 
L
C
P
 
m
u
l
t
 
 
r
m
u
l
t
 
 
 
f
r
e
g
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
l
o
g
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
After
n
 
t
 
 
 clock cycles, including one clock cycle for initiating the feedback
register, the result is shifted out to the output register. Hence, the time
T re-
quired to perform the entire operation is proportional to
L
m
u
l
t
 
n
 
 
n
 
t
 
 
 
 
L
C
P
 
m
u
l
t
 
n
 
 
n
 
t
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
where
 
 
n
 
t
 
 
 
m
 
 , and the area-time performance
A
T
  of the multiply-
by-
 
n architecture is proportional to the product
C
m
u
l
t
 
n
 
L
m
u
l
t
 
n
 
 .
Note that, because we have
n
 
t
 
 
n
t
 
t
 
n
t
 
 
 
t
 
 
 
 
 
 
 
n
 
 
n
t
 
m
 
n
 
t
 
 
 ,
we can write
 
 
 
n
 
t
 
 
 
 
 
m
 
n
t
 
n
 
t
 
 
 
 
 
 
 
 
 
n
t
 
n
 
t
 
 
 
 
 
m
o
d
 
m
 
 
 
  (5.8)
Hence,
 
n
 
m
o
d
 
m
 
 can be computed by ﬁrst running the feedback multi-
plication-by-two circuit
n
 
t
 
 
  clock cycles. Then, if
n
t
 
 the desired product
 
 
 
n
 
 
m
o
d
 
m
 
 
 is present at the circuit output and if
n
t
 
 the re-
sult must be negated to obtain
 . The area-time performance of the resulting
circuit, which consequently also comprises a negater, is smaller (but not sig-
niﬁcantly smaller) than the area-time performance of the circuit in Figure 5.9.
It is also possible to design a strictly parallel architecture that performsmulti-
plication by powers of two in one clock cycle. The structure of such an archi-
tecture would besimilar tothe structureof abarrel shifter.S u c hanar c h i t ec t u r e
is considered in Chapter 6 but, however, not in the present chapter.5.1. Architectures for Arithmetic Operations 81
m
 
  -bit parallel register
Mult. by 2 (Figure 5.8)
 
 
 
 
 
 
 
 
 
 
 
 
n
 
t
 
 
 
m
o
d
 
m
 
 
 
Figure 5.9: Block diagram for multiplication by powers of two.
5.1.5 General Multiplication
An overview of some well known approaches to the binary multiplication
problem can be found for example in Hwang [52, Ch. 5] and Weste and Esh-
raghian [113, Ch. 8.2.7]. In principle, there are three types of architectures for
general multipliers, namely the serial-type, the serial/parallel-type, and the
parallel-type architecture. Factors like form of data transmission, circuit area
and computation time requirements, potential for pipelining (to increase the
clock frequency), and power dissipation constraints may govern the choice
of architecture type. For multiplication in an integer quotient ring, a serial/
parallel or strictly parallel architecture is generally preferable to a serial ar-
chitecture, inter alia with respect to the complexity of performing the mod-
ulus reduction operation. This issue was brieﬂy discussed in the beginning of
Chapter 4.
Independently of the type of architecture, multiplication of NBC integers is
generally performed as sequential addition of partial products. For the multi-
plicand
 
 
P
m
i
 
 
 
i
 
i andthe multiplier
 
 
P
m
i
 
 
 
i
 
i,whereasusual
 
i
 
 
i
 
Z
 ,w eg e tt h ep r o d u c t
 
 
 
 
 
 
m
X
i
 
 
 
i
 
 
i
 
 
 
m
o
d
 
m
 
 
 
 82 Chapter 5. The Normal Binary Coded Representation
A common approach when designing a fast multiplier is to ﬁnd a way to
quickly sum up all the partial products. The serial/parallel multiplier, which
is also known as the shift-and-add multiplier, is one of the most well known
multipliers. It successively adds the partial products together using one feed-
back parallel adder. In each clock interval, a partial product
 
i
 
m
o
d
 
m
 
 
is calculated as
 
 
 
 
i
 
 
 
 
m
o
d
 
m
 
 , i.e. using repeated multiplication by 2
modulo
 
m
 
  .
A block diagram for a serial/parallel multiplier over
Z
 
m
 
  is shown in Fig-
ure 5.10. The parallel-input multiplicand
  and the serial-input multiplier
 
are initially loaded into the registers R
  and SR, respectively. The register R
 
is initiated with the all-zero word. Theseinitiations arecarriedout during one
clock cycle. After the following
i clock cycles, the
 
m
 
 
  -bit parallel register
R
  contains the partial product
 
i
 
m
o
d
 
m
 
  .E a c ho u t p u tb i tf r o mr e g i s t e r
R
  is fed both to one of the inputs of a two-input AND gate and to the input
of the multiplication-by-2 circuit. The bit-serial output
 
i of the shift register
SR is connected to the second input of each of these
m
 
 AND gates, making
the fan-out of the shift register equal to
 
m
 
 
 
f
A
N
D. Hence, the value of the
least signiﬁcant bit of SR controls, in each clock interval, whether the all-zero
word (for
 
i
 
  ) or the partial product in R
  (for
 
i
 
  )i st ob ea d d e dt ot h e
contents of R
 .
The CP of the serial/parallel multiplier architecture in Figure 5.10 is the path
from the output of shift register SR through an AND gate, the parallel adder,
andintooneoftheregisterselementsinR
 .
  Usingthecarryripple-typeadder
in Figure 5.7, the length of this CP equals
L
C
P
 
m
u
l
t
 
L
r
e
g
 
r
r
e
g
 
 
m
 
 
 
f
A
N
D
 
L
A
N
D
 
r
A
N
D
f
a
d
d
 
L
C
P
 
a
d
d
 
r
a
d
d
f
r
e
g
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
After
m
 
 clock cycles, the product
 
 
 
 
 
 
m
o
d
 
m
 
 
 is obtained in
register R
 . Aninitial clock cycleisrequiredforloading theregisterswiththeir
initial values andanextraclock cycleis requiredto shift
 fromregisterR
  into
anoutputregister(notshownintheﬁgure). Hence,thetotalcomputationtime
T is proportional to
L
m
u
l
t
 
 
m
 
 
 
L
C
P
 
m
u
l
t
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 If the adder architecture in Figure 5.7 is adopted here, theCP ends in the register element
in the next most signiﬁcant bit position (
m
 
 ) of the parallel register R
 .5.1. Architectures for Arithmetic Operations 83
SR
 
m
 
m
 
 
 
 
 
 
0
R
 
R
 
0 0 0
Mult. by 2 mod
F
t
 
 
 
m
o
d
F
t
 
 
 
m
o
d
F
t
. . .
 
 
 
 
 
 
 
 
 
Addition modulo
F
t
 
m
 
m
 
 
 
 
 
 
Row of AND gates
 
 
 
 
 
 
m
o
d
 
m
 
 
 
Figure 5.10: The block diagram fora serial/parallel multiplier. Theproduct
 
 
 
 
 
 
m
o
d
F
t
 ,w h e r e
F
t
 
 
m
 
  , is generated and stored in register R
  after
m
 
 clock cycles. The initial contents of the registers R
  and R
  are
  and the
all-zero word, respectively, and the shift register SR is initiated with
 . These
initial values are shown in the respective registers in the ﬁgure.84 Chapter 5. The Normal Binary Coded Representation
Using the multiplication-by-two circuit in Figure5.8, the chip area
Aoccupied
by the circuit in Figure 5.10 is proportional to its size
C
m
u
l
t
 
C
m
u
l
t
 
 
 
 
m
 
 
 
C
r
e
g
 
 
m
 
 
 
C
A
N
D
 
C
a
d
d
 
 
m
 
l
o
g
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
l
o
g
 
m
 
 
 
 
m
 
 
 
 
The area and/or time complexities of the serial/parallel multiplier may be re-
duced by for example replacing the parallel carry ripple-type adder with an
adder that has better area-time performance. For standard binary serial/
parallel multiplication of binary coded two’s complement numbers, the efﬁ-
ciency ofcomputing thesumofpartial productsmaybespeededup byadopt-
ing a different multiplication scheme, see for example Chapter 5 in Hwang’s
book on computer arithmetic [52]. However, using the NBC representation of
theintegersof
Z
 
m
 
 ,itseemsasifnone oftheseschemesyields aserial/paral-
lel architecture whose area-time performance is signiﬁcantly improved, com-
pared to the area-time performance of the serial/parallel multiplier in Figure
5.10. Thearea-time performance
A
T
  of the latter multiplier is proportional to
C
L
 
m
u
l
t
 
 
L
m
u
l
t
 
C
m
u
l
t
 
 
 
O
 
m
 
l
o
g
 
m
 
 
Remark: We have not investigated the properties of any bit-parallel architec-
ture for NBC multiplication in
Z
 
m
 
 . However, a promising candidate
fo rs u c hanar c h i t ec t u r ei sam o d i ﬁed version ofthe pipelined arraymul-
tiplier suggested by Benaissa et al. [11, Fig. 4]. Their multiplier is based
on an NBC representation of both the multiplier and the multiplicand.
Because this multiplier is basically a diminished–1 multiplier, its proper-
ties are investigated in Section 6.3.6 (see page 138). A block diagram of
the multiplier is shown in Figure 6.21.
5.1.6 Exponentiation of the Transform Kernel
Consider the exponentiation
 
 
 
n
 
m
o
d
 
m
 
 
 
  (5.9)
where
 
 
Z
 
m
 
  and the exponent
n is an integer. Because the order of every
element of
Z
 
m
 
  divides
 
 
 
m
 
 
  ,w h e r e
  is Euler’s totient function,
  the
 See Corollary 8.1.1 in Rosen’s book [84]. The totient function
 
 
 
m
 
 
 equals
 
m in the
Fermat prime ﬁe l d s ,i . e .f o r
m
 
 
 
 
 
 
 
 
 
 
 .5.1. Architectures for Arithmetic Operations 85
only part of the exponent
n that has to be considered is
n
m
o
d
 
 
 
m
 
 
  .I n
particular, when
  is the transform kernel
 ,t h eo r d e ro f
  modulo
 
m
 
 
equalsthetransformlength
N
 
 
b forsomeinteger
b(see(2.4)inSection2.3.2).
Therefore,when computing powers of the transform kernel, we use the expo-
nent
n
m
o
d
N.
There exist several algorithms for integer exponentiation. Probably the most
wellknownmethod isthesocalledbinarymethod,whichisdescribedbyKnuth
in [56, Ch. 4.6.3]. It is based on the NBC extension of the exponent
n.T h e
r-bit
NBC integer
n c a nb ew r i t t e no nt h ef o r m
n
 
n
r
 
 
 
r
 
 
 
n
r
 
 
 
r
 
 
 
 
 
 
 
n
 
 
 
n
 
 
m
o
d
q
 
 
where
r
 
 
b
l
o
g
 
 
n
m
o
d
q
 
c
 
  ,
 
n
 
 
 
 
 
 
n
r
 
 
 
Z
 ,a n d
q
 
 
 
 
m
 
 
 (for
arbitrary
 
 
Z
 
m
 
 )o r
q
 
N (for
 
 
 ). Consequently, the congruence in
(5.9) can be written as
 
 
 
 
 
 
 
 
 
 
n
r
 
 
 
 
 
n
r
 
 
 
 
 
n
r
 
 
 
 
 
 
n
 
 
 
 
n
 
 
 
 
n
 
 
m
o
d
 
m
 
 
 
 
In the binary method, the right-hand side of this congruence is evaluated us-
ing repeatedsquaringandmultiplication. Hence, depending on
n,
r
 
 squar-
ings and at most the same number of multiplications are required to perform
the exponentiation. In a conventional circuit for exponentiation we use a full-
widthexponentrepresentation, i.e. wehave
r
 
b
l
o
g
 
 
 
 
 
m
 
 
 
 
 
 
c
 
 in the
general case
  (for arbitrary nonzero
 
 
Z
 
m
 
 )a n d
r
 
b
l
o
g
 
 
N
 
 
 
c
 
 
 
b
for
 
 
 . The multiplications required to perform the exponentiation are
general multiplications. For some choices of base
 , these multiplications may
be carried out in a simpler way, but such simpliﬁed multiplications are not
considered here.
Zuras[115]discusseshowtoﬁndthefastestwaytosquare(andmultiply) large
integers in software: Denote by
T
m
u
l
t and
T
s
q
u
a
r
e the computation times for
general multiplication and squaring, respectively. Because squaring is a spe-
cial case of multiplying, we trivially have
T
s
q
u
a
r
e
 
T
m
u
l
t. There is no known
algorithm for exponentiation that is signiﬁcantly faster than general multipli-
cation. From the equation
A
 
B
 
 
A
 
B
 
 
 
 
A
 
B
 
 
 
it follows that a multiplication can be carried out as two squarings, three ad-
ditions, and one multiplication by
 
 
 . Assuming that addition and multipli-
cation by
 
 
  takes at most
O
 
m
  time (seefor example (5.6) and (5.7)), where
 For
x
 
R,the expression
b
x
c denotes the greatest integer less than or equal to
x.
 When
 
m
 
 is prime we get
r
 
m.86 Chapter 5. The Normal Binary Coded Representation
m is the operand word bit length, we thus get
T
m
u
l
t
 
 
T
s
q
u
a
r
e
 
O
 
m
 
 
Hence, as stated by Zuras, even though someone may discover an algorithm
for squaring that is faster than any existing multiply algorithm, any squaring
algorithm can be used to construct a multiply algorithm that is not more than
a constant slower than the squaring algorithm. Regarding the NBC represen-
tation of the elements of
Z
 
m
 
  we do not consider any specially designed ar-
chitecture for squaring. Squarings are performed as general multiplications,
which means that exponentiation requires at most
 
 
r
 
 
  multiplications,
where
r
 
b
l
o
g
 
 
 
 
 
m
 
 
 
 
 
 
c
 
 for arbitrarynonzero
 
 
Z
 
m
 
  and
r
 
b
for
 
 
 . Hence, using the NBC representation and the binarymethod as de-
scribed above, exponentiation in
Z
 
m
 
  can be performed in
O
 
 
 
r
 
 
 
L
m
u
l
t
 
time.
Alternative methods of performing integer exponentiation are described for
exampleinChapter4.6.3inKnuth’sbook[56]andinZuras’paper[115]. Bocha-
rova and Kudryashov [18], [19] investigate exponentiation schemes based on
different source codes. Gollmann et al. [46] consider exponentiation based on
a signed-digit representation of the exponent. See also the articles on integer
exponentiation in the reference list in Gollmann’s paper [46]. Compared with
the binary method, most other algorithms for integer exponentiation reduce
the number of true multiplications, often by processing several bits of the bi-
nary(orsigned-digit) representation ofthe exponentata time, whichforsome
algorithms is done to the cost of a precomputed look-up table. The number
of squarings are approximately the same for most algorithms. In the present
chapter we only consider the above binary method.
In Section 2.3.2 we showed that for some sequence lengths
N there exist suit-
able choices of the kernel
  for which the different powers of the kernel are
easily calculated. For example, for the combinations
 
N
 
 
 
 
 
 
m
 
 
  and
 
N
 
 
 
 
 
 
m
 
p
 
 , multiplication by a power of
  can be simply carried out
as binary shifts in the former case and a pair of binary shifts and one addition
in the latter case.
For transforms of arbitrary lengths, the powers of the transform kernel are
either directly calculated whenneeded or precomputed and stored in a mem-
ory (look-up table) from which they are read when needed. For the direct cal-
culations, weusetheabovebinarymethodforexponentiation. Whenthepow-
ers of
  are precomputed, the exponentiations are suitably carried out as re-
peated multiplication by
 ,i . e .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 .W ed o
not consider the complexity of these precomputations.5.2. Summary 87
InFermatprime ﬁelds, i.e. in
Z
 
m
 
  for
m
 
 
 
 
 
 
 
 
 
 
 , there are more ways
ofperformingexponentiation. Forexample,inSection 7.2.1wedescribemeth-
ods of performing exponentiation with respect to the polar representation.
5.2 Summary
In Table 5.1 we have summarised the sizes, the fan-ins, the internal and to-
tal CP lengths, the output normalised resistances, and the area-time perfor-
mances
A
T
  of the architectures in the presentchapter. Note that the modulus
reduction operation is included in the other four operations.88 Chapter 5. The Normal Binary Coded Representation
O
p
e
r
a
t
i
o
n
F
i
g
u
r
e
S
u
b
s
c
r
i
p
t
n
a
m
e
S
i
z
e
C
F
a
n
-
i
n
f
I
n
t
e
r
n
a
l
C
P
l
e
n
g
t
h
L
i
n
t
M
o
d
u
l
u
s
r
e
d
u
c
t
i
o
n
5
.
1
m
o
d
,
1
 
 
m
 
 
8
 
 
m
 
 
5
.
2
m
o
d
,
2
 
m
 
l
o
g
 
m
 
 
 
m
8
 
m
 
 
l
o
g
 
m
 
 
N
e
g
a
t
i
o
n
5
.
5
n
e
g
 
 
m
 
 
 
2
 
 
m
 
 
 
A
d
d
i
t
i
o
n
5
.
7
a
d
d
 
 
m
 
 
6
 
 
m
 
 
 
 
 
 
M
u
l
t
.
b
y
2
5
.
8
m
u
l
t
2
 
m
 
l
o
g
 
m
 
 
 
m
 
 
m
 
 
 
 
m
 
l
o
g
 
m
 
r
 
 
 
 
 
M
u
l
t
.
b
y
 
n
5
.
9
m
u
l
t
2
n
 
m
 
l
o
g
 
m
 
 
 
m
 
 
 
—
 
 
m
 
 
l
o
g
 
m
 
 
 
G
e
n
e
r
a
l
m
u
l
t
.
5
.
1
0
m
u
l
t
 
m
 
l
o
g
 
m
 
 
 
 
m
 
 
 
—
 
 
m
 
 
 
E
x
p
o
n
e
n
t
i
a
t
i
o
n
—
—
 
 
 
 
N
o
r
m
.
o
u
t
p
u
t
r
e
s
.
r
o
T
o
t
a
l
C
P
l
e
n
g
t
h
L
(
i
n
c
l
u
d
i
n
g
r
e
g
i
s
t
e
r
s
)
A
r
e
a
-
t
i
m
e
p
e
r
f
.
C
L
 
r
m
 
 
 
 
 
 
m
 
 
 
O
 
m
 
 
r
m
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
O
 
m
 
l
o
g
 
m
 
2
 
 
m
 
 
 
O
 
m
 
 
 
 
 
3
 
 
m
 
 
 
O
 
m
 
 
r
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
O
 
m
 
l
o
g
 
m
 
—
 
n
 
t
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
O
 
 
n
 
t
 
 
 
 
 
m
 
l
o
g
 
m
 
—
 
 
m
 
 
 
 
 
m
 
 
 
 
O
 
m
 
l
o
g
 
m
 
A
t
m
o
s
t
 
 
r
 
 
 
m
u
l
t
i
p
l
i
c
a
t
i
o
n
s
a
r
e
r
e
q
u
i
r
e
d
,
w
h
e
r
e
r
 
b
l
o
g
 
 
 
 
 
m
 
 
 
 
 
 
c
 
 
o
r
r
 
b
T
a
b
l
e
5
.
1
:
T
h
e
c
o
m
p
l
e
x
i
t
y
p
a
r
a
m
e
t
e
r
s
o
f
t
h
e
a
r
c
h
i
t
e
c
t
u
r
e
s
i
n
t
h
e
p
r
e
s
e
n
t
c
h
a
p
t
e
r
.Chapter 6
The Diminished–1 Representation
6.1 Linearly Transformed Representations
In this chapter investigate properties of arithmetic operations in
Z
 
m
 
 ,w i t h
respect to a linear coordinate transformation of the
 
m
 
 
  -bit normal binary
coded (NBC) integers of
Z
 
m
 
 . In the resulting number system, an NBC inte-
ger
 
 
Z
 
m
 
  is represented by the binary coded integer
T
 
 
 
 
k
 
 
l (mod
 
m
 
  ), (6.1)
where
k
 
l
 
Z
 
m
 
 . The reverse code translation (from
T
 
 
  to
 ) can be writ-
ten as
 
 
k
 
 
 
T
 
 
 
 
l
 
 
m
o
d
 
m
 
 
 . Consequently, the reverse code trans-
lation only exists if
k has a multiplicative inverse
k
 
  in
Z
 
m
 
 ,i . e .i f
g
c
d
 
k
 
 
m
 
 
 
 
 
 
Depending on the constants
k and
l, we obtain various VLSI architectures for
arithmetic operations in
Z
 
m
 
 . Trivially, for
k
 
 and
l
 
 we get the NBC
representation of
T
 
 
 
 
  and consequently the architectures considered in
Chapter 5.
Because every translated integer
T
 
 
  is an
 
m
 
 
  -bit NBC integer in
Z
 
m
 
 ,
reduction modulo
 
m
 
 can be performed using the procedure described in
Section 5.1.1 for any
k and
l. In Section 6.2 we investigate how to choose the
constants
k and
lsothatthe modulus reduction operation can beincorporated
into the various arithmetic operations in a straightforward way, i.e. in a way
that minimises the computational complexity of each operation.
8990 Chapter 6. The Diminished–1 Representation
For the sake of convenience, we occasionally denote a translated NBC integer
T
 
 
  by
 
 ,i . e .f o r
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Z
  we have
T
 
 
 
 
 
 
 
 
 
m
 
 
m
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6.1.1 Arithmetic Operations
Forarbitrary
 
 
 
 
k
 
l
 
Z
 
m
 
 ,w h e r e
g
c
d
 
k
 
 
m
 
 
 
 
 ,wegetthefollowing
arithmetic:
Negation
By (6.1) we get
T
 
 
 
 
T
 
 
 
 
 
 
l
 
m
o
d
 
m
 
 
 and thus
T
 
 
 
 
 
 
T
 
 
 
 
 
l
 
m
o
d
 
m
 
 
 
 
By (5.5) we get the congruence
 
T
 
 
 
 
T
 
 
 
 
 
 
m
o
d
 
m
 
 
 ,w h i c hg i v e s
T
 
 
 
 
 
T
 
 
 
 
 
 
 
l (mod
 
m
 
  )
  (6.2)
Thus, the integer
 
  is represented by
T
 
 
 
 
 
 
 
l
m
o
d
 
m
 
  ,w h e r e
T
 
 
 
is the one’s complement of the
 
m
 
 
  -bit translated NBC integer
T
 
 
 .
Addition
T
 
 
 
 
 
 
k
 
 
 
 
 
 
l
 
T
 
 
 
 
T
 
 
 
 
l
 
m
o
d
 
m
 
 
 
  (6.3)
For the sake of simplicity we sometimes use the symbol
  to denote addition
between translated symbols.
  Hence, we deﬁne such an addition as
T
 
 
 
 
T
 
 
 
 
 
T
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (6.4)
Subtraction
The congruence
T
 
 
 
 
 
 
T
 
 
 
 
T
 
 
 
 
 
m
o
d
 
m
 
 
  (6.5)
follows directly from (6.3) and (6.4), i.e. subtraction is performed in the tradi-
tionalwaybyﬁrstnegatingthesubtrahendandthenadding ittotheminuend.
 Chang et al. [32] use the same notation.
Note that the symbol
  denotes the logical XOR function whenever it appears in a
Boolean expression.6.1. Linearly Transformed Representations 91
Multiplicationby Powers of 2
We expand
T
 
 
n
 
  as
T
 
 
n
 
 
 
k
 
n
 
 
l
 
 
n
l
 
 
n
l
 
 
n
T
 
 
 
 
 
 
n
 
 
 
l (6.6)
 
 
n
 
 
X
i
 
 
 
T
 
 
  (mod
 
m
 
  )
  (6.7)
where
P
  denotes the special summation of translated symbols. The special
case of simple multiplication by 2,
T
 
 
 
 
 
 
T
 
 
 
 
l (mod
 
m
 
  )
  (6.8)
can also be directly obtained from the addition formula (6.3). The product
 
T
 
 
  is simply obtained by shifting the NBC integer
T
 
 
  one bit to the left.
Thetranslated product
T
 
 
n
 
  can be calculated in a way that is computation-
ally more efﬁcient than the direct computation of (6.6). We have
T
 
 
n
 
 
 
k
 
n
 
 
l
 
 
 
k
 
n
 
 
 
 
l
 
 
l
 
 
T
 
 
n
 
 
 
 
 
l
 
m
o
d
 
m
 
 
 
  (6.9)
which is simply computed using repeated multiplication by 2 and addition.
The modulus reduction is performed after every multiplication by 2 and ad-
dition of
 
l.
General Multiplication
T
 
 
 
 
 
 
k
 
 
 
l
 
k
k
 
 
 
T
 
 
 
 
l
 
k
 
 
 
T
 
 
 
 
l
 
 
l
 
k
 
 
 
T
 
 
 
T
 
 
 
 
l
 
T
 
 
 
 
T
 
 
 
 
l
 
 
 
l
 
m
o
d
 
m
 
 
 
  (6.10)
Because
T
 
 
  and
T
 
 
  are NBC integers, it is possible to simplify (6.10). By
writing
T
 
 
  on the form
T
 
 
 
 
P
m
i
 
 
 
 
i
 
i,w h e r e
 
i
 
f
 
 
 
g,w eg e t92 Chapter 6. The Diminished–1 Representation
T
 
 
 
 
 
 
k
 
 
 
l
 
 
 
k
 
 
l
 
 
l
 
 
l
 
 
 
T
 
 
 
 
l
k
 
 
 
k
 
 
l
 
 
l
 
k
 
 
 
l
 
 
m
X
i
 
 
 
 
i
 
i
 
l
k
 
 
 
l
 
k
 
T
 
 
 
 
 
m
X
i
 
 
 
 
 
i
 
i
 
 
l
 
 
m
l
 
l
k
 
 
 
l
 
k
 
T
 
 
 
 
 
l
 
 
m
X
i
 
 
 
T
 
 
 
i
 
i
 
 
 
 
l
k
 
 
 
l
 
k
 
T
 
 
 
  (mod
 
m
 
  )
  (6.11)
In some applications one may wish to represent either the multiplicand or the
multiplier as an NBC number. For example, constants or the Fermat number
transform coefﬁcients
 
 
k
n m a yj u s ta sw e l lb es t o r e di nt h a tf o r m a t .B yw r i t -
ing the NBC multiplier
  on the form
 
 
P
m
i
 
 
 
i
 
i we get
T
 
 
 
 
 
 
k
 
 
 
l
 
k
 
m
X
i
 
 
 
i
 
i
 
l
 
m
X
i
 
 
 
k
 
i
 
i
 
 
l
 
 
l
 
m
 
 
 
 
l
 
m
X
i
 
 
T
 
k
 
i
 
i
 
 
 
l
m
 
m
X
i
 
 
 
T
 
k
 
i
 
i
 
  (mod
 
m
 
  )
  (6.12)
which apparently has a simpler structure than both (6.10) and (6.11). Appar-
ently, the efﬁciency of computing (6.10) and (6.11) depends on which values
are assigned to
k and
l. The multiplication procedure according to (6.12) de-
pends on the value of
k,b u tn o t
l. It is also possible to obtain an expression
for general multiplication which involves
l but not
k:
T
 
 
 
 
 
 
k
 
 
 
l
 
 
k
 
 
l
 
m
X
i
 
 
 
i
 
i
 
l
 
 
l
 
m
X
i
 
 
 
i
 
i
 
k
 
 
l
 
 
l
 
 
l
 
m
X
i
 
 
 
i
 
k
 
i
 
 
l
 
 
l
m
X
i
 
 
 
i
 
l
m
X
i
 
 
 
i
 
i
 
l
 
 
l6.2. The Use of a Zero Indicator 93
 
m
X
i
 
 
 
i
T
 
 
i
 
 
 
l
 
 
 
m
X
i
 
 
 
i
 
 
m
X
i
 
 
 
i
T
 
 
i
 
 
 
l
 
m
X
i
 
 
 
 
 
 
i
 
 
m
 
 
m
X
i
 
 
 
 
i
T
 
 
i
 
 
 
 
i
l
 
 
m
 
m
 
l
 
 
 
 
m
X
i
 
 
 
 
 
i
T
 
 
i
 
 
 
 
i
l
 
 
m
 
l
 
 
 
 
m
o
d
 
m
 
 
 
  (6.13)
InSection6.3.6weconsidervariousmultiplication procedureswhicharebased
on the above congruences (6.10), (6.11), (6.12), and (6.13).
Exponentiation
The formula for general exponentiation is
T
 
 
n
 
 
k
 
n
 
l
 
k
 
 
n
 
T
 
 
 
 
l
 
n
 
l
 
m
o
d
 
m
 
 
 
  (6.14)
6.2 The Use of a Zero Indicator
Generallyspeaking,thebestchoice of
k and
lin (6.1)yields optimumcomplex-
ity and performance of the corresponding VLSI architectures for arithmetic
operations. Among the arithmetic operations considered in the previous sec-
tion, the code translation (Equation (6.1)), general multiplication according to
(6.10),(6.11),and(6.12),andexponentiation (Equation(6.14))aretheonly ones
involving the constant
k. It is involved in these operations in the following
ways:
  Multiplication by
k and
k
 
 .
  Multiplication by
l
k
 
  (or
 
l
k
 
 ) and addition by
l
 
k.
  Multiplication by
k
 
 
n.
These operations are simplest carried out if
k
 
 
 
k
 
 
l,a n d
k
 
 re-
spectively. The operations then reduce to multiplication by one (for all equa-
tions involving
k) and addition by zero (Equation (6.11)). Hence, we assert94 Chapter 6. The Diminished–1 Representation
that choosing
k
 
 is the best choice with respect to the simpliﬁcation of the above
operations.
Now, let us consider the choice of the constant
l. We have seen that addition
of translated symbols occurs inseveral operations (Equations (6.3), (6.5),(6.7),
(6.10), (6.11), (6.12), and (6.13)). We therefore ﬁrst focus on the sum
T
 
 
 
 
T
 
 
 
 
T
 
 
 
 
 
 
T
 
 
 
 
T
 
 
 
 
l (mod
 
m
 
  )
 
in (6.3). Multiplication by two,
T
 
 
 
 
 
 
T
 
 
 
 
l (mod
 
m
 
  )
 
is another operation of special interest, because it is involved in the compu-
tation of general multiplication. Multiplication by two is of course a special
case of addition, but the product
 
T
 
 
  is preferably carried out as a binary
shift of
T
 
 
  instead of ordinary addition. We would, however, like to carry
out the addition by
 
l followed by the reduction modulo
 
m
 
 as simply as
possible.
The following
 
m
 
 
 -bit NBC integers modulo
 
m
 
 are the
 
m elements of
Z
 
m
 
 :
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
. . .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
It would be very convenient, at least from an implementation point of view,
if
 
m represents the zero element. Because the NBC integer
 
m is the only el-
ement of
Z
 
m
 
  which has a one in its most signiﬁcant bit position , it would
then be enough to check in one bit position whether an element is zero. Such
a procedure can be helpful for example when computing sums and products;
addition by zero
 
T
 
 
 
 
 
 
T
 
 
 
  and multiplication by zero
 
T
 
 
 
 
 
 
T
 
 
 
  are two operations that can be simply carried out in VLSI during a sin-
gle clock interval.
When representing the zero element by the integer
 
m, the nonzero integers
 
 
 
 
 
 
 
 
 
 
 
m are consequently represented by the (
m-bit) NBC integers of
Z
 
m. Hence, wecanuse an
m-bitarithmetic forthe nonzeroelements of
Z
 
m
 
 ,
which from a complexity point of view is preferable to the
 
m
 
 
  -bit arith-
metic associated with the NBC representation in Chapter 5.6.2. The Use of a Zero Indicator 95
Henceforth, the translated element
 
m is called the zero indicator.T h u s ,b yl e t -
ting
T
 
 
 
 
 
m represent
 
 
 we get
l
 
 
 
from (6.1), which is the same value of
l that is obtained from the choice of
k:
For
k
 
 and
k
 
 
l,w eg e t
l
 
 
 .
The congruences (6.3) and (6.8) then change to
T
 
 
 
 
 
 
T
 
 
 
 
T
 
 
 
 
  (mod
 
m
 
  ) (6.15)
T
 
 
 
 
 
 
T
 
 
 
 
  (mod
 
m
 
  )
  (6.16)
respectively. It is also interesting to note that, for
l
 
 
 , negation (Equa-
tion (6.2)) and general multiplication according to (6.13) simplify to
T
 
 
 
 
 
T
 
 
 
 
  (mod
 
m
 
  ) (6.17)
T
 
 
 
 
 
 
m
X
i
 
 
 
 
 
i
T
 
 
i
 
 
 
 
i
 
 
m
o
d
 
m
 
 
 
  (6.18)
respectively. Obviously, the addition by 1 modulo
 
m
 
 appears in the three
congruences(6.15),(6.16),and(6.17)(andactuallyalsointhecongruence(6.18),
which is formed by
m additions of the type in (6.15)).
 
For an arbitrary
 
m
 
 
  -bit NBC integer
 
 
 
 
m
 
  the congruence
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 can be simpliﬁed as
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
m
 
m
o
d
 
m
 
 
 
  (6.19)
where
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
. We thus have
 
 
 
 
 
 
 
m
 
 
 
  for
 
 
m
 
 
 
 
 
m
 
 
 
 
 
  for
 
 
m
 
 
 
m
o
d
 
m
 
 
 
  (6.20)
which can easily be computed using for example a chain of half adders. This
is further discussed in Section 6.3.1. The sum
 
  may be associated with
 Notethatby letting
 
 
 
l of (6.2)be equal to
 
l of (6.3), we get
l
 
 
  and thus
 
 
 
l
 
 
and
 
l
 
  .96 Chapter 6. The Diminished–1 Representation
T
 
 
 
 
 
 
T
 
 
 
 ,o r
T
 
 
 
  and the addend
 
  may be associated with
T
 
 
 
 
T
 
 
 
 
 
T
 
 
 ,o r
T
 
 
  in (6.15), (6.16), or (6.17), respectively.
From the above arguments we conclude that, from a computational complexity point
of view and with respect to the complexity and performance of the VLSI architectures
for arithmetic operations, the best choice of the constants
k and
l in (6.1) is
 
k
 
l
 
 
 
 
 
 
 
 .
McClellan’s Representation
In 1976, McClellan [65] proposed a way of representing the integers of
Z
 
m
 
 .
By letting the
 
m
 
 
 -bitbinarycoded number
T
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
representthe NBC integer
 
 
Z
 
m
 
 , the coding scheme is deﬁned as follows:
If
 
 
m
 
  ,t h e n
 
 
  .
If
 
 
m
 
  ,t h e n
 
 
 
m
 
 
 
m
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
  (mod
 
m
 
  )
 
where
 
j
 
 
  if
 
 
j
 
 
 
  if
 
 
j
 
 
 
Thus, McClellan uses binary weightings with
 
  instead of 0 and 1. The core
of his representation is that the binary coded integer
 
m representsthe integer
0, i.e. he uses
 
m as a zero indicator. Consequently, all the nonzero elements
have a zero in their most signiﬁcant bit position and thus it is possible to per-
form
m-bit arithmetic operations on these elements.
For a nonzero integer
 ,f o rw h i c h
 
 
m
 
  ,w eh a v et h er e l a t i o n
 
j
 
 
 
 
j
 
 .
Therefore,
  can be expanded as
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
  (mod
 
m
 
  )
 
Because
 
 
m equalszeroweget
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
T
 
 
 
 
 
 
m
o
d
 
m
 
 
 . It shows that this congruence also holds for the zero element;
 
T
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
  . Hence, the congruence
 
 
 
T
 
 
 
 
  (mod
 
m
 
  )
holds for every element
 
 
Z
 
m
 
 .B e c a u s ew eh a v e
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
o
d
 
m
 
 
 the code translation from
  to
T
 
 
  is performed ac-
cording to the congruence
T
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 6.2. The Use of a Zero Indicator 97
We thus get McClellan’s element representation by choosing
k
 
 
m
 
 
 
 and
l
 
 
  in (6.1).
Leibowitz’ Representation
Also in 1976, Leibowitz [58] presented another way of representing the inte-
gersof
Z
 
m
 
 . In his article, he mentions that McClellan’s element representa-
tion belongs to the set of element translations of the form
T
 
 
 
 
k
 
 
 
 
m
o
d
 
m
 
 
 
 
k
 
k
 
 
 
Z
 
m
 
 
 
which all give the same simpliﬁed binary arithmetic modulo
 
m
 
  .H o w -
ever, this is true only for operations like negation (Equation (6.2)), addition
(Equation (6.3)), multiplication by two (Equation (6.8)), and general multipli-
cation according to (6.12). The integer
k is not involved in any of these op-
erations. Leibowitz claimed that that the choice
k
 
 will give the simplest
code translation. Then,
  is simply obtained from
T
 
 
  by adding 1 to
T
 
 
 
modulo
 
m
 
  . The reverse operation is carried out by diminishing
  by 1
modulo
 
m
 
 . Owing to this fact, Leibowitz calls his element representation
the diminished–1 representation.
The diminished–1 representation has been adopted in most published archi-
tectures since 1976. As indicated above, the two main reasons for that should
be the utilisation of the element
 
m as a zero indicator (
l
 
 
 ), and the sim-
pliﬁed element translation (
k
 
  ). A very common application to Fermat
integer quotient ring computations is the computation of the Fermat number
transform of lengths
 
m and
 
m, for which the transform kernel
  is prefer-
ably chosen as 2 and
p
 , respectively.
  Then it is possible to compute the Fer-
mat number transform using bit shifts and additions but no general multipli-
cation or exponentiation. Hence, no operation involving
k is needed to compute
such a transform. If the translation from the normal binary representation to
the diminished–1 representation and vice versa must take place, then, as as-
serted above,
k
 
 is the best choice.
 See Section 2.3.2.98 Chapter 6. The Diminished–1 Representation
6.3 The Diminished–1 Representation
6.3.1 Code Translation
Obviously, the code translation from the NBC integer
 
 
Z
 
m
 
  to the dimin-
ished–1 integer
T
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 and the reverse translation
 
 
T
 
 
 
 
 
 
m
o
d
 
m
 
 
 only involve subtraction by one and addition by one
modulo
 
m
 
  , respectively.
NBC to Diminished–1 Representation
The forward translation
T
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,w h e r e
 
 
 
 
 
m,
can quite easily be carried out using a simpliﬁed parallel adder. As in Sec-
tion 5.1.1, we ﬁrst compute the sum
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
c
m
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,w h e r e
c
m is the carry output from the most signiﬁcant bit po-
sition of the adder and
 
i is the adder sum output in bit position
i,f o r
i
 
 
 
 
 
 
 
 
 
m
 
 . With one of the adder input signals in bit position
i high and
the second input signal equal to
 
i we get, in accordance with (5.2) and (5.2),
the carry and sum outputs
c
i
 
 
 
 
i
 
c
i
 
i
 
 
i
 
c
i
 
respectively. The ﬁrst carry input
c
  equals zero. In order to determine the
desired sum
 
 , three cases have to be considered:
1. If
 
 
  ,t h e nl e t
 
 
m
 
c
m
 
 and
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 ,
i.e.
 
 
i
 
 for
 
 
i
 
m
 
 .
2. If
 
 
 
 
 
m
 
 ,t h e nl e t
 
 
m
 
c
m
 
 and
 
 
 
m
 
 
 
 
 
 
m
 
 
 .
3. If
 
 
 
m,t h e nl e t
 
 
m
 
c
m
 
 and
 
 
 
m
 
 
 
 
 
 
m
 
 
 .
From these three cases we form the two Karnaugh maps in Figure 6.1 for the
bitvalues of
 
 . According tothe maps, for
 
 
i
 
m
 
 the Boolean functions
for
 
 
m and
 
 
i can be expressed as
 
 
m
 
 
m
 
c
m
 
 
i
 
 
m
 
c
m
 
i
 6.3. The Diminished–1 Representation 99
0
X
1
0
1
0
00
1
011 11
0
1
0
XX 1
XX
0
0
 
m
c
m
 
m
c
m
 
i
(a)
 
 
m (b)
 
 
i for
 
 
i
 
m
 
 
1
0
Figure 6.1: Karnaugh maps for the bit values
 
 
i of
 
 . X = “don’t care”.
(a)
 
 
m
 
 
m
 
c
m
 
 
m
 
c
m. (b)
 
 
i
 
 
m
 
c
m
 
i.
respectively. Thesum
  maybecomputedusingeitheracarryrippleoracarry
look-ahead type of architecture. In the previous chapter we concluded that,
dueto its favourable
A
T
 performance,the carrylook-ahead-typearchitecture
in Figure 5.2 is preferable to the carry ripple-type architecture in Figure 5.1.
We have designed a carry look-ahead type of architecture for the computa-
tion of
T
 
 
  (i.e.
 
 )f r o m
 . The architecture, which is shown in Figure 6.2,
is similar to the architecture in Figure 5.2. The row of combined AND–NOR
gates at the output generates the one’s complement
 
 
 
m
 
 
  of the NBC inte-
ger
 
 
 
m
 
 
 , i.e. the gate in bit position
i generates the signal
 
 
i
 
 
m
 
c
m
 
i.
A schematic description of such a gate is shown in the bottom of Figure 6.2.
The gate has size
C
A
N
D
 
N
O
R
 
  , fan-in
f
A
N
D
 
N
O
R
 
  , and output normalised
resistance
r
A
N
D
 
N
O
R
 
  . It has no internal stage. The output array of inverters
generates the desired output
 
 
 
m
 
 
 .
The total size of the ’NBC-to-diminished–1’ architecture in Figure 6.2 equals
 
C
N
B
C
 
D
i
m
 
 
m
 
l
o
g
 
m
 
 
 
 
 
 
C
N
A
N
D
 
N
O
R
 
 
m
 
 
 
 
C
i
n
v
 
C
X
N
O
R
 
 
C
N
O
R
 
m
 
C
A
N
D
 
N
O
R
 
C
i
n
v
 
 
 
m
 
l
o
g
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
l
o
g
 
m
 
 
m
 
 
 
The CP is the dotted path from the
 
 -input node through the circuit and to
the
 
 
 -output node. The fan-in and the output normalised resistance of the
architecture, with respect to this CP, equal
f
N
B
C
 
D
i
m
 
f
m
o
d
 
 
 
 
r
N
B
C
 
D
i
m
 
r
A
N
D
 
N
O
R
 
 
 
 Compare the derivation of this expression with the derivation of
C
m
o
d
 
  in (5.4).100 Chapter 6. The Diminished–1 Representation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
 
 
 
 
 
 
V
d
d
c
 
 
i
 
 
i
 
 
m
 
c
m
 
i
c
 
c
 
 
 
 
 
 
i
 
i
 
 
i
AND–NOR
AND–NOR AND–NOR AND–NOR AND–NOR
Figure 6.2: An architecture performing the code translation from the NBC to the
diminished–1 representation (from
  to
T
 
 
 
 
 
 ). The dotted line indicates
the CP through the circuit. The bottom part of the ﬁgure shows how each com-
bined AND–NOR gate is designed.6.3. The Diminished–1 Representation 101
respectively, and the internal CP length equals
L
C
P
 
N
B
C
 
D
i
m
 
 
l
o
g
 
m
 
 
 
r
N
A
N
D
 
N
O
R
 
f
N
O
R
 
N
A
N
D
 
f
i
n
v
 
 
r
N
A
N
D
 
f
N
O
R
 
m
f
A
N
D
 
N
O
R
 
 
r
A
N
D
 
N
O
R
f
i
n
v
 
 
l
o
g
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
As in Chapter 5, when determining the
A
T
  performance of the architecture,
we assume that it is both preceded and followed by
 
m
 
 
  -bit parallel reg-
isters. Therefore, the time
T required to evaluate the congruence
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 is proportional to
L
N
B
C
 
D
i
m
 
L
r
e
g
 
r
r
e
g
f
N
B
C
 
D
i
m
 
L
C
P
 
N
B
C
 
D
i
m
 
r
N
B
C
 
D
i
m
f
r
e
g
 
 
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
which implies that the
A
T
  performance of the circuit is proportional to the
product
C
L
 
N
B
C
 
D
i
m
 
 
C
N
B
C
 
D
i
m
 
L
N
B
C
 
D
i
m
 
 
 
 
 
m
 
l
o
g
 
m
 
 
m
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
 
O
 
m
 
l
o
g
 
m
 
 
This product is less than the area-time product
C
L
 
m
o
d
 
  of the modulus reduc-
tion circuit in Figure 5.2.
An alternative procedure for performing the subtraction
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,f o r
 
 
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
m
 
 
 
m
 
 
 ,i st h ef o l -
lowing:
1. If
 
 
 
 
 
m
 
 ,i . e .i f
 
m
 
 and
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 ,t h e nl e t
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  .
The diminished–1 integer
 
  is obtained by reducing
  modulo
 
m
 
  .
2. If
 
 
 
m,i . e .i f
 
m
 
 and
 
 
m
 
 
 
 
  ,t h e nl e t
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 . The diminished–1 integer
 
  equals
 .
In case 1,
  can be obtained from
  simply by inverting its most signiﬁcant bit
 
m. By letting
  be the input of a modulus reduction circuit, for example the
one in Figure 5.1 or the one in Figure 5.2, we get the desired integer
 
  as the
outputofthe circuit. Incase2,
  canbeobtainedbyinverting allbitsof
 .E v e n
thoughthis
  isthedesired
 
 , weavoid anunnecessarilycomplexcontrollogic102 Chapter 6. The Diminished–1 Representation
 
m
 
m
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
m
Figure 6.3: An alternative architecture for performing the code translation from the
NBC integer
  to the diminished–1 coded integer
T
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 using a modulus reduction circuit.
by letting
  pass through the modulus reduction circuit also in case 2. Still, it
is the procedure in case 1 that determines the overall time performance of the
operation. Figure 6.3 shows an architecture that generates the binary coded
diminished–1 integer
 
  from the NBC integer
  using the above procedures in
cases 1 and 2.
Itmaybeconvenient to utilise amodulus reduction circuit toperformthecode
translationbut,however,boththesizeandthetotalCPlengthofsuchanarchi-
tecture are greater than the corresponding parameters of an architecture that
is specially designed for the operation. For example, if the modulus reduction
part of the circuit in Figure 6.3 is the carrylook-ahead-type architecture in Fig-
ure 5.2, its total size equals
C
m
o
d
 
 
 
m
C
O
R
 
C
i
n
v
 
 
m
l
o
g
 
m
 
 
 
m
 
 and its
total CP length (including the delay contribution of one input and one output
register)equals
L
r
e
g
 
r
r
e
g
 
m
f
O
R
 
f
i
n
v
 
 
L
O
R
 
r
O
R
f
m
o
d
 
 
 
L
C
P
 
m
o
d
 
 
 
r
m
o
d
 
 
f
r
e
g
 
 
 
m
 
 
l
o
g
 
m
 
 
  . These two complexity parameters should be compared
with the smaller size
C
N
B
C
 
D
i
m and the smaller CP length
L
N
B
C
 
D
i
m, respectively,
of the architecture in Figure 6.2.
Inhispaperof1976Leibowitz [58]suggeststhatthecode translationshouldbe
carried out as an ordinary diminished–1 addition (see Section 6.3.4) of
  and6.3. The Diminished–1 Representation 103
the NBC integer
 
m
 
 .
  This is a good solution if the diminished–1 adder
is readily available and if the time requirements for the code translation are
fulﬁlled. However, at least from a time performance point of view, a special-
purpose architecture like the one in Figure 6.2 is preferable to a general-pur-
pose architecture (like the diminished–1 adder).
Diminished–1 to NBC Representation
Leibowitz [58] described how to perform the translation from a binary coded
diminished–1 integer
T
 
 
  (
 
 
 )t oa nN B Ci n t e g e r
  by adding
 
 
m to
 
 
 
m
 
 
 .
Thus, we have
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
m
 
m
o
d
 
m
 
 
 
 
This operation, which in hardware does not require any modulus reduction,
can be performed using a row of half adder elements. Consider an
m-bit par-
allel carry ripple adder with input signals
  and
 .I n b i t p o s i t i o n
i,w h e r e
 
 
i
 
m
 
 , the signal input bits are
 
i
 
 and
 
 
i
 
Z
 , which implies that,
by (4.2) and (4.3), the carry and sum outputs are equal to
c
i
 
 
 
c
i
 
 
i
 
i
 
c
i
 
 
 
i
 
respectively. Because these functions are also the respective carry and sum
outputs of the half adder element (see the end of Section 4.3.4), the parallel
adder may be formed by a row of half adder elements, where the ﬁrst carry
input
c
  equals
 
 
m. An architecture that performs the diminished–1-to-NBC
coordinate transformation using the above procedure is shown in Figure 6.4.
The size of this architecture equals
C
D
i
m
 
N
B
C
 
m
C
H
A
 
C
i
n
v
 
 
 
m
 
 
 
TheCPrunsfromthe
 
 
m-inputnode through the inverter andthechain ofcas-
caded halfadder elements. Denoteby
n
s and
n
c the fan-outwith respecttothe
 
m
 
 -outputnodeandthe
 
m-outputnode, respectively. If
L
H
A
 
s
u
m
 
r
H
A
 
s
u
m
n
s
 
 
 
 
n
s
 
L
H
A
 
c
a
r
r
y
 
r
H
A
 
c
a
r
r
y
n
c
 
 
 
n
c,i . ei f
n
s
 
n
c
 
 
 
 , the CP runsfrom the
input to the sum output
 
m
 
  of the half adder element in the most signiﬁcant
bit position. Otherwise, the CP runs to the carry output
 
m.T h ef o r m e rp a t h
is the one most likely to belong to the CP.
  Therefore, the fan-in, the output
 We have
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
 
  .
 For example, if
n
s
 
n
c, the CP runs from the input of the half adder element to its carry
output only if
n
c
 
  .104 Chapter 6. The Diminished–1 Representation
HA HA HA HA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 6.4: A simple architecture for performing the code translation from the bi-
nary coded diminished–1 integer
T
 
 
 
 
 
  to the NBC integer
 
 
 
 
 
 
 
m
o
d
 
 
 
 
  .
normalised resistance, and the internal CP length of the circuit equal
f
D
i
m
 
N
B
C
 
f
i
n
v
 
 
 
r
D
i
m
 
N
B
C
 
r
H
A
 
s
u
m
 
 
 
L
C
P
 
D
i
m
 
N
B
C
 
r
i
n
v
f
H
A
 
 
m
 
 
 
 
L
H
A
 
c
a
r
r
y
 
r
H
A
 
c
a
r
r
y
f
H
A
 
 
L
H
A
 
s
u
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
respectively. When the input and the output of the coordinate transformation
circuit in Figure 6.4 are each connected to an
 
m
 
 
 -bit register, the time
T to
perform its operation is proportional to the total CP length6.3. The Diminished–1 Representation 105
C
N
B
C
 
D
i
m
C
D
i
m
 
N
B
C
L
N
B
C
 
D
i
m
L
D
i
m
 
N
B
C
C
L
 
N
B
C
 
D
i
m
C
L
 
D
i
m
 
N
B
C
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  Area complexity
m
S
i
z
e
,
C
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Area-time performance
m
C
L
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Time complexity
m
C
P
l
e
n
g
t
h
,
L
Figure 6.5: The sizes, CP lengths, and
A
T
  performances of the code translation
architectures. The parameters are plotted versus
m for
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 .
L
D
i
m
 
N
B
C
 
L
r
e
g
 
r
r
e
g
f
D
i
m
 
N
B
C
 
L
C
P
 
D
i
m
 
N
B
C
 
r
D
i
m
 
N
B
C
f
r
e
g
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
Hence, the area-time performance
A
T
  of this circuit is proportional to
C
L
 
D
i
m
 
N
B
C
 
 
C
D
i
m
 
N
B
C
 
L
D
i
m
 
N
B
C
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
O
 
m
 
 
 
The row of half adder elements in Figure 6.4 is a carry ripple type of archi-
tecture. Other classes of architectures, like the parallel carry look-ahead (half)
adder, are not considered here.
In Figure 6.5 we have plotted the sizes, total CP lengths, and area-time per-
formances of the architectures in Figure 6.2 and Figure 6.4. When comparing106 Chapter 6. The Diminished–1 Representation
these parameters we see that
C
N
B
C
 
D
i
m
 
C
D
i
m
 
N
B
C
 
C
N
B
C
 
D
i
m
 
C
D
i
m
 
N
B
C
 
for
m
 
 and
m
 
  .
for
m
 
 .
L
N
B
C
 
D
i
m
 
L
D
i
m
 
N
B
C
 
L
N
B
C
 
D
i
m
 
L
D
i
m
 
N
B
C
 
for
m
 
 and
m
 
  .
for
m
 
 .
C
L
 
N
B
C
 
D
i
m
 
C
L
 
D
i
m
 
N
B
C
  for all
m.
6.3.2 Modulus Reduction
Because the diminished–1 integers are represented by the NBC integers in
Z
 
m
 
 ,t h er e s i d u em o d u l o
 
m
 
 of an
 
m
 
 
 -bit binary coded diminished–1
integer can be computed using any of the modulus reduction circuits in Sec-
tion 5.1.1. However, one of the nice properties of the diminished–1 represen-
tation is that it yields arithmetic operations for which the modulus reductions
togetherwiththearithmetic operations canbecarriedoutin amorestraightfor-
wardwaythan whatis possible whenusing the ordinary NBC representation.
This is demonstrated in the following sections.
6.3.3 Negation
By letting
l
 
 
  in (6.2) we get
T
 
 
 
 
 
T
 
 
 
 
  (mod
 
m
 
  )
  (6.21)
Thiscongruence wasalso considered in (6.17). Thecomputational complexity
of computing
T
 
 
 
 
 
m
o
d
 
m
 
 may seem to be in the same order as the
complexity of performing the code translation from the diminished–1 integer
T
 
 
 
 
Z
 
m
 
  to the NBC integer
 
 
T
 
 
 
 
 
 
m
o
d
 
m
 
 
  . However, by
expanding (6.21) as
T
 
 
 
 
 
 
 
 
 
 
m
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
  if
 
 
m
 
 
 
 
 
 
m if
 
 
m
 
 
(mod
 
m
 
  )
 
where
 
 
 
T
 
 
 , it shows to be quite easy to implement. The negative of a
nonzerointeger
  (forwhich
 
 
m
 
  ) is simply derived by inverting its
mleast6.3. The Diminished–1 Representation 107
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
T
 
 
 
 
 
m
o
d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 6.6: Negation modulo
 
 
 
  . Here, we have
 
 
 
T
 
 
  and
T
 
 
 
 
 
 
 .
signiﬁcant bits. For the zeroelement wehave the relation
T
 
 
 
 
 
T
 
 
 
 
 
m,
whichmeansthatthesymbol
T
 
 
 staysunmodiﬁed. Forthe
 
m
 
 
 -bitbinary
integer
T
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
  we thus have
T
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
where the Boolean function for
 
 
i equals
 
 
i
 
 
 
 
i
 
 
 
m
 
 
 
i
 
 
 
m
  for
i
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
  for
i
 
m
 
Figure 6.6 shows an architecture that realises such a negation, using a row of
NOR gates. This simple circuit is generally used for negation of diminished–1
numbers, see for example Pajayakrit [71, Ch. 3.4], Benaissa et al. [11, Fig. 8],
and Sunder et al. [97, Fig. 3].108 Chapter 6. The Diminished–1 Representation
TheCP through the negatercircuit is the pathfrom the
 
 
m-inputnode through
one of the NOR gates to the circuit output. The fan-in and the output nor-
malised resistance, with respect to this path, equal
f
d
i
m
n
e
g
 
n
m
 
m
f
N
O
R
 
n
m
 
 
m
r
d
i
m
n
e
g
 
r
N
O
R
 
 
 
respectively, where
n
m isthenegaterfan-outwithrespecttothe
 
 
m node. Note
that the delay of the input stage will be excessively long if both
m and the nor-
malised resistance of the inputstage are large. One wayof reducing this delay
is to properly buffer the circuit input stage, i.e. by using drivers in the stage.
However,asmentionedbefore,suchabufferingisnotconsideredhere. There-
fore, the chip area
A occupied by the negater circuit in Figure 6.6 is propor-
tional to its size
C
d
i
m
n
e
g
 
m
C
N
O
R
 
 
m
 
There is no internal CP of the architecture. When the negater input is taken
from an
 
m
 
 
  -bit parallel register and the output is stored in a similar reg-
ister, we get
n
m
 
f
r
e
g
 
  . Then, the negation time
T is proportional to the
total CP length
L
d
i
m
n
e
g
 
L
r
e
g
 
r
r
e
g
f
d
i
m
n
e
g
 
r
d
i
m
n
e
g
f
r
e
g
 
 
m
 
 
 
 
Hence, the product
A
T
  is proportional to
C
L
 
d
i
m
n
e
g
 
 
C
d
i
m
n
e
g
 
L
d
i
m
n
e
g
 
 
 
O
 
m
 
 
 
6.3.4 Addition and Subtraction
In Section 6.2 (Equations (6.3) and (6.15)) we showed that the choice
l
 
 
  in
(6.1) yields
T
 
 
 
 
 
 
T
 
 
 
 
T
 
 
 
 
  (mod
 
m
 
  ) (6.22)
For
 
 
T
 
 
 
 
T
 
 
 
 
 
m, we expand this equation as
 
 
 
 
T
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  if
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  if
 
 
 
m
 
 
 
m
 
 
f
 
 
 
 
 
 
 
 
 
 
 
g
 
 
 
 
 
 
m
 
m
o
d
 
m
 
 
 
  if
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 6.3. The Diminished–1 Representation 109
where
 
 
 
T
 
 
 
 
 
 
 
T
 
 
 ,a n d
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 . The three cases in the
above equation are handled in the following way:
1.
 
 
 
 
 
 
 (mod
 
m
 
  ):
Using the congruence
 
 
 
 
 
m
 
m
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m (mod
 
m
 
  )
we get
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
m
 
m
o
d
 
m
 
 
 
2.
 
 
 
 
  (mod
 
m
 
  ):
In this case we have either
 
 
 
m
 
 
 
 
 or
 
 
 
m
 
 
 
 
  , which implies
 
 
m
 
  . Therefore, no modulus reduction is needed. We let
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 .
3.
 
 
 
 
 
 
 
 
 
m (mod
 
m
 
  ):
In this case we have
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
  ,w h i c hg i v e s
 
 
 
  .
Therefore, let
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 .
A Carry Look-Ahead Adder
For a bit-parallel transmission of both
 
  and
 
 ,t h es u m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
may be calculated using an
m-bit parallel adder. In the above case 1, the sup-
plementary addition by
 
 
m may be carried out by letting the carry in
c
  in the
least signiﬁcant bitposition beequalto
 
 
m.T h eb i tv a l u e
 
 
m mustthen begen-
erated by a carry look-ahead circuit. In cases 2 and 3, the initial carry in
c
 
equals zero. Hence,
c
  can be generated according to the Boolean function
c
 
 
 
 
m
 
 
 
m
 
 
 
m
 
 
 
 
m
 
 
 
m
 
 
 
 
m
  (6.23)
The most signiﬁcant bit
 
 
m of the output
 
  must be generated separately. Let
c
m denote the ﬁnal carry of the addition
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
c
 . Table 6.1 shows
the possible states of
 
 
m
 
 
 
m
 
 
 
m
 
c
 
 
c
m, and the resulting sum
 
 .W es e et h a t
 
 
m is high only for
 
 
 
m
 
 
 
m
 
 
 
 
 
 
  and for
 
c
 
 
c
m
 
 
 
 
 
 
 , which means
that it can be described as the Boolean function
 
 
m
 
 
 
m
 
 
 
m
 
c
 
 
c
m
 
 
 
m
 
 
 
m
 
c
 
 
c
m
 110 Chapter 6. The Diminished–1 Representation
Case
 
 
m
 
 
m
 
 
m
c
 
c
m
 
 
 
 
m
1 000 10
 
 
 
 
 
m
 
  0
0 11
 
m 1
1 01
 
 
 
 
 
m
 
  0
2 0/1 1/0 0 00
 
 
 
 
 
m
 
  0
3 110 00
 
m 1
Table 6.1: The possible states of some variables involved in the computation of
 
 
 
T
 
 
 
 
  (see also Figure 6.7).
Figure6.7 showsonepossible architectureof adiminished–1 carrylook-ahead
adder. The carry-in of the carry look-ahead block equals zero. The adder has
about the same structure as McClellan’s adder [65, Fig. 7]. However, because
of an incorrect gating of the carry
c
m, McClellan’s adder gives an erroneous
output when
 
 
m equals one (the third line of Table 6.1). On the other hand,
the gating is correctly realised for the carry look-ahead adder in Figure 8 of
[65].
Benaissa et al. [11] and others also use an adder that is based on the adder in
Figure 6.7. However, in Figure 9 of [11], the authors use AND and OR gates
to form the output bit
 
 
m, in contrast to the NAND gates used in Figure 6.7.
Pajayakrit[71, Fig. 3.3]alsoconsidersanadderwhosearchitectureslightly dif-
fersfrom the one presentedin [11]. InPajayakrit’sadder, thereis an ANDgate
that has
 
 
m (which by Pajayakrit is named
D) as one of its input signals. This
signal is exchanged for
c
  in Benaissa’s adder. Using the
R
C model adopted
in this thesis, it can easily be veriﬁed that when
c
  is chosen as input signal to
the AND gate, the internal delay from the
 
 
m-output node of the carry look-
ahead circuit to the
c
  carry input node of the parallel adder is less than the
corresponding delay if
 
 
m is chosen as the input signal of the AND gate.
Remark: Pajayakrit’s adder is actually a corrected version of the adder con-
sidered by Towers et al. [101, Fig. 9]. In Towers’ adder, which is based
on McClellan’s adder, the carry-in signal
c
  was improperly formed as
the Boolean function
c
 
 
 
 
m
 
 
m
 
 
m instead of the correct one given by
(6.23).
When comparing the adder architectures described above we ﬁnd that the
adder in Figure 6.7 is preferable to the others, with respect to correctness and
both area and time complexity. So far, we have not considered the choice of6.3. The Diminished–1 Representation 111
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
m
 
 
c
 
 
 
m
c
m
 
 
 
T
 
 
 
 
 
 
m
o
d
 
m
 
 
 
m-bit parallel adder
Carry
look-ahead
logic
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
Figure 6.7: Diminished–1 addition modulo
 
m
 
  :
 
 
 
T
 
 
  and
 
 
 
T
 
 
 .112 Chapter 6. The Diminished–1 Representation
adder type for the
m-bit adder in Figure 6.7 (the one whose inputs are
 
 
 
m
 
 
 ,
 
 
 
m
 
 
 ,a n d
c
 ). If this adder is implemented as a carry look-ahead type of
adder, like for example McClellan’s adder [65, Fig. 8], we presumably obtain
a faster diminished–1 adder than if it is implemented as a carry ripple type of
adder. However, the chip areaoccupied byanadder is generally largerfor the
carry look-ahead adder than for the carry ripple adder.
In order to get fair comparisons between the bit-parallel carry ripple-type
NBC adder in Figure 5.7 and the bit-parallel adders in this section, the paral-
leldiminished–1 addersconsidered hereareallplaincarryripple-typeadders.
As mentioned in the beginning of Section 5.1, we primarily consider architec-
tures that can be mutually compared in order to decide which form of element
representation is most advantageous, with respect to chip area, computation
time, and area-time performance. Therefore, we generally compare architec-
tures of the same type, but with respect to different element representations,
rather than try to ﬁnd the most area-time efﬁcient architecture for a certain el-
ement representation.
Hence, the bit-parallel
m-bit carry ripple adder in Figure 6.7 simply consists
of a row of
m cascaded full adder elements. Consequently, the total size of the
entire diminished–1 adder equals
C
d
i
m
a
d
d
 
 
 
m
C
F
A
 
C
C
L
A
 
 
C
N
A
N
D
 
N
O
R
 
C
O
R
 
C
C
L
A
 
 
 
m
 
 
 
  (6.24)
where the complexity
C
C
L
A of the carry look-ahead logic depends on how it is
implemented. It is well knownthat the output carry
c
 
i
 
  froma full adder ele-
ment in bit position
i, whose input signals are
 
 
i,
 
 
i,a n d
c
 
i, may be expressed
as the Boolean function
 
c
 
i
 
 
 
g
i
 
p
i
c
 
i
  (6.25)
where
g
i
 
 
 
i
 
 
i and
p
i
 
 
 
i
 
 
 
i are called the carrygenerate andpropagate func-
tions, respectively. For the diminished–1 adder we have
 
 
m
 
c
 
m. Therefore,
by expanding (6.25) for
i
 
m
 
  we get
 
 
m
 
c
 
m
 
g
m
 
 
 
p
m
 
 
g
m
 
 
 
p
m
 
 
p
m
 
 
g
m
 
 
 
 
 
 
 
p
m
 
 
p
m
 
 
 
 
 
p
 
g
 
 
p
m
 
 
p
m
 
 
 
 
 
p
 
c
 
 
  (6.26)
However, theaddend
p
m
 
 
p
m
 
 
 
 
 
p
 
c
 
  of(6.26)canbeexcluded here,because
for the circuit in Figure 6.7 we have
c
 
 
 
  . The resulting Boolean function
can efﬁciently be evaluated using the carry look-ahead tree in Figure 6.8. This
architecture for generating only one carry signal is also suggested by Yuan
 For the diminished–1 adder in Figure 6.7, we denote by
c
￿
i the input carry signal in bit
position
i of the carry look-ahead circuit. We do this in order to distinguish it from the corre-
sponding carry signal
c
i of the parallel adder in the bottom part of the ﬁgure.6.3. The Diminished–1 Representation 113
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
E
g
 
p
O
 
E
E
 
O
g
 
p
O
g
 
p
O
O
E
g
 
p
g
 
p
g
 
p
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
g
 
p
O
E
 
 
i
 
 
i
p
i
 
 
 
i
 
 
 
i
g
i
 
 
 
i
 
 
i
p
 
p
 
g
 
g
 
p
 
 
p
 
 
g
 
 
g
 
 
p
 
p
 
p
 
 
g
 
g
 
 
 
p
 
 
g
 
p
 
p
 
 
p
 
 
g
 
g
 
 
 
p
 
 
 
g
 
 
Levels:
 
 
l
o
g
 
m
l
o
g
 
m
 
 
Figure 6.8: A carry look-ahead tree that generates the carry signal
 
 
m
 
c
 
m.E a c h
g
 
p cell has an inverted carry propagate and an inverted carry generate as out-
put signals. The odd and even levels of the tree consist of the
O cells and the
E cells, respectively. The outputfunctions of the three cells are displayed in the
top-rightmost part of the ﬁgure.
et al. [110, Fig. 4]. Furthermore, it is essentially a modiﬁed version of Brent
and Kung’s [27] well known carry look-ahead tree.
As seen in Figure 6.8, the
g
 
p cells generate the inverted initial carry propa-
gate and generate signals
p
i and
g
i, respectively, for
 
 
i
 
m
 
 .C o n s i d e r
the tree subsequent to the array of
g
 
p cells. This tree has
l
o
g
 
m levels. By
subsequently numbering the levels from 1 to
l
o
g
 
m (from left to right for the
treein Figure6.8), wededucethatthe odd indexedlevels of thetree areformed
only by
O cells and the even indexed levels are formed only by
E cells. Con-
sequently, the end cell is either an
E cell or an
O cell, depending on whether
l
o
g
 
m is even or odd, respectively. Note that in the former case, the
g out-
put signal of the end cell must be inverted to formthe desired carry signal
 
 
m.
Henceforth,wedonotconsiderthisextrainverterneededforeven
l
o
g
 
m.T h e114 Chapter 6. The Diminished–1 Representation
Cell Size Input Fan-in Output norm. res.
g
 
p 8
 
 
i
 
 
 
i 4 2
O 10
p
 
 
g
 
 
g
 
  2 2
p
 
  4
E 10
p
 
 
g
 
 
g
 
  2 2
p
 
  4
Table 6.2: The sizes, fan-ins, and output normalised resistances of the
g
 
p,
O,a n d
E cells of Figure 6.8. We use the cell names as subscripts of the complexity pa-
rameters, e.g.
r
g
 
p
 
  ,
f
E
 
p
 
 
 
  ,a n d
C
E
 
O
 
C
E
 
C
O
 
 
  .
input and output signals of the
g
 
p,
O,a n d
E cells are displayed in the top-
rightmost part of Figure 6.8.
Each
g
 
p cell consists of one NAND gate and one NOR gate. The
E cell con-
sists of one NAND gate and one combined AND–NOR gate (see Figure 6.2).
The complexity parameters of the latter gate are given on page 99. Also, a
schematic description of the gate is given in Figure 6.2. The
O cell consists of
one NOR gate and one gate which has a similar structure and the same com-
plexity parametersasthe AND–NORgate. Recently, Wei andThompson[112]
derived an
A
T
  optimal parallel carry look-ahead adder based on Brent and
Kung’s carry look-ahead tree. Two of their basic cells which they use to im-
plement the parallel carry computation are equivalent to the
E and
O cells in
Figure 6.8: Their ’black ba’ cell [112, Fig. 3(a)] is equivalent to our
E cell and
their ’blackbb’cell[112,Fig. 3(b)]isequivalenttoour
O cell. Thesizes, thefan-
ins and the output normalised resistances of the cells in Figure 6.8 are given
in Table 6.2.
Thebinarycarrylook-aheadtreeinFigure6.8comprises
m
g
 
pcells and
 
m
 
 
E and
O cells. Hence, the size
C
C
L
A of the tree equals
C
C
L
A
 
m
C
g
 
p
 
 
 
m
 
 
 
C
E
 
O
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
  (6.27)
The values of
C
g
 
p and
C
E
 
O are taken from Table 6.2. By combining (6.27) and
(6.24), we get the total size
C
d
i
m
a
d
d
 
 
 
 
 
m
 
 
 
of the diminished–1 adder of Figure 6.7. The fan-in
f
C
L
A of the carry look-
ahead tree equals
f
g
 
p
 
  . The output normalised resistance equals
r
C
L
A
 
r
E
 
O
 
  .6.3. The Diminished–1 Representation 115
Regarding the
E and
O cells, because the respective
p
 
 -a n d
p
 
 -inputs have
the largest fan-ins, the CP is the path from either the
 
 
m
 
  node or the
 
 
m
 
 
node through the carry look-ahead tree to
 
 
m and onwards through the
m-bit
parallel adder to the
 
 
m output. With respect to this CP, the fan-in, the output
normalised resistance, and the internal CP length equal
f
d
i
m
a
d
d
 
 
 
f
F
A
 
s
i
g
n
a
l
 
f
g
 
p
 
 
 
 
 
 
 
r
d
i
m
a
d
d
 
 
 
r
N
A
N
D
 
 
L
C
P
 
d
i
m
a
d
d
 
 
 
r
g
 
p
f
O
 
p
 
 
 
 
l
o
g
 
m
 
 
 
r
O
 
E
f
E
 
p
 
 
 
O
 
p
 
 
 
r
E
 
O
f
N
A
N
D
 
r
N
O
R
 
f
N
O
R
 
f
F
A
 
c
a
r
r
y
 
 
m
L
F
A
 
c
a
r
r
y
 
 
m
 
 
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
r
F
A
 
r
N
A
N
D
 
f
N
A
N
D
 
 
 
 
 
 
l
o
g
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
A Carry Ripple Adder
The NBC adder of Figure 5.7 in Section 5.1.3 is a carry ripple type of adder. In
Figure6.9 wepresentanequally comparablediminished–1 carryripple adder.
We have
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 . In the carry look-ahead adder of Figure 6.7, for
case 1 (see page 109) the addend
 
 
m is added to the sum
 
 
 
m
 
 
  by letting
 
 
m
be the carry input signal of the parallel adder. In contrast, the carry input sig-
nal of the parallel adder in Figure 6.9 is always equal to zero. Therefore, it is
sufﬁcient to use a half adder element in the least signiﬁcant bit position of the
adder. Furthermore, the addition of
 
 
 
m
 
 
  by
 
 
m is carried out by multiplex-
ing either the sum
 
 
 
m
 
 
  (for
 
 
m
 
 in case 1) or
 
 
 
m
 
 
 
 
 (for
 
 
m
 
 in case
1) to the output. The addition of
 
 
 
m
 
 
  by 1 is carried out by the row of half
adder elements in the ﬁgure, in accordance with the circuit in Figure 6.4 for
code translation from the diminished–1 representation to the NBC represen-
tation. Here, because one of the inputs of the half adder element in the least
signiﬁcant bit position equals zero, the adder element can be simpliﬁed to an
inverter (see Figure 6.9).
Let
 
 
 
 
 
 
m
 
 
 
 
  ,w h e r e
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
  as before. For all three cases
described in the beginning of Section 6.3.4, we introduce a Boolean function
f
to control which of
 
 
 
m
 
 
  (for
f
 
  )o r
 
 
 
m
 
 
  (for
f
 
  ) should be passed
to the output
 
 
 
m
 
 
 . The most signiﬁcant bit
 
 
m is generated separately. In116 Chapter 6. The Diminished–1 Representation
FA FA FA HA
HA HA HA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
T
 
 
 
 
 
 
m
o
d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Add-by-one
circuit
2/1 multi-
plexers
 
 
 
 
 
 
 
 
 
 
 
 
P
 
P
 
Figure 6.9: Acarry ripplearchitecture fordiminished–1additionmodulo
 
 
 
 .T h e
paths P
  and P
  form the CP through the circuit.6.3. The Diminished–1 Representation 117
Case
 
 
m
 
 
m
 
 
m
 
 
m
f
 
 
m
1 00 0 0 0 0
1 1
1 0 1 0
2 0/1 1/0 0 0/1 1 0
3 11 0 0 1 1
Table 6.3: Properties of some variables of the addition circuit of Figure 6.9.T h e
Boolean functions
f and
 
 
m depend on the other variables.
Table 6.3 we have listed the possible values of
 
 
m
 
 
 
m
 
 
 
m
 
 
 
m
 
f
 and
 
 
m for
the three above-mentioned cases. By using Karnaugh maps for
f and
 
 
m we
obtain the minimised Boolean functions
f
 
 
 
m
 
 
 
m
 
 
 
m
 
 
m
 
 
 
m
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
m
 
 
 
m
 
 
 
Thesefunctions areformedbythe logic gates in the leftmost partof Figure6.9.
By comparison with the adder in Figure 6.7, the size of the adder in Figure 6.9
is reduced by
 
m
 
 
 to
C
d
i
m
a
d
d
 
 
 
 
m
 
 
 
C
F
A
 
m
C
H
A
 
m
C
M
U
X
 
 
C
O
R
 
 
C
N
A
N
D
 
 
C
i
n
v
 
 
 
m
 
 
 
Also, as expected when comparing a carry ripple adder with a carry look-
ahead adder, the internal CP length is greater for the carry ripple adder: The
CP through the adder is formed by the two paths labelled P
  and P
  in Fig-
ure 6.9. Hence, for this adder we get the fan-in, output normalised resistance,
and internal CP length
f
d
i
m
a
d
d
 
 
 
f
H
A
 
 
r
d
i
m
a
d
d
 
 
 
r
H
A
 
s
u
m
 
 
 
 
L
C
P
 
d
i
m
a
d
d
 
 
 
L
H
A
 
c
a
r
r
y
 
r
H
A
 
c
a
r
r
y
f
F
A
 
c
a
r
r
y
 
 
m
 
 
 
L
F
A
 
c
a
r
r
y
 
 
m
 
 
 
r
F
A
f
F
A
 
c
a
r
r
y
 
r
F
A
f
O
R
 
L
O
R
 
r
O
R
 
 
m
 
f
i
n
v
 
 
 
m
r
i
n
v
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
respectively.118 Chapter 6. The Diminished–1 Representation
Carry Look-Ahead versus Carry Ripple Adder
In order to take all three delay parameters, i.e. the fan-in, the internal CP
length, and the output normalised resistance, of each adder into account we
assume as beforethat the adders are bothpreceded and followed by registers.
Then, the computation times (
T) of the carry look-ahead adder in Figure 6.7
and the carry ripple adder in Figure 6.9 are equal to
L
d
i
m
a
d
d
 
 
 
L
r
e
g
 
r
r
e
g
f
d
i
m
a
d
d
 
 
 
L
C
P
 
d
i
m
a
d
d
 
 
 
r
d
i
m
a
d
d
 
 
f
r
e
g
 
 
 
 
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
L
d
i
m
a
d
d
 
 
 
L
r
e
g
 
r
r
e
g
f
d
i
m
a
d
d
 
 
 
L
C
P
 
d
i
m
a
d
d
 
 
 
r
d
i
m
a
d
d
 
 
f
r
e
g
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
  (6.28)
respectively. Apparently, the carry look-ahead adder is faster than the carry
ripple adder. By combining the size and the total CP length (including reg-
i s t e r s )o fe a c ha d d e r ,w eﬁnd that the
A
T
  performance of the two adders is
proportional to
C
L
 
d
i
m
a
d
d
 
 
 
 
C
d
i
m
a
d
d
 
 
 
L
d
i
m
a
d
d
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
 
O
 
m
 
 
C
L
 
d
i
m
a
d
d
 
 
 
 
C
d
i
m
a
d
d
 
 
 
L
d
i
m
a
d
d
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
O
 
m
 
 
 
respectively. The sizes, total CP lengths, and
C
L
  products of the two adders
are plotted versus
m in Figure 6.10. We see that the values of the complexity
parameters do not differ much between the adders. However, for all
m the
size of the carry look-ahead adder is slightly greater than the size of the carry
ripple adder. For
m
 
 
 , the total CP length
L
d
i
m
a
d
d
 
  of the carry look-ahead
a d d e ri sl e s st h a nt h et o t a lC Pl e n g t h
L
d
i
m
a
d
d
 
  of the carry ripple adder. For
m
 
  we have
L
d
i
m
a
d
d
 
 
 
L
d
i
m
a
d
d
 
 .
With respect to the
A
T
  performance,thecarrylook-ahead adder is preferable
to the carry ripple adder for
m
 
 
 , but the carry ripple adder is preferable
for
m
 
 
 .6.3. The Diminished–1 Representation 119
C
d
i
m
a
d
d
 
 
C
d
i
m
a
d
d
 
 
L
d
i
m
a
d
d
 
 
L
d
i
m
a
d
d
 
 
C
L
 
d
i
m
a
d
d
 
 
C
L
 
d
i
m
a
d
d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Time complexity
m
C
P
l
e
n
g
t
h
,
L
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Area-time performance
m
C
L
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Area complexity
m
S
i
z
e
,
C
Figure 6.10: The sizes, CP lengths, and
A
T
  performances of the two
diminished–1 adders. The parameters are plotted versus
m for
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 .
A Bit-Serial Adder
The chip area required to perform an arithmetic operation is usually smaller
for bit-serial architectures than for bit-parallel architectures. Another advan-
tage of bit-serial architectures is that they can be clocked with a higher clock
frequency, i.e. they have a higher throughput. However, it is not certain that
the total time required to perform an operation is smaller for a bit-serial archi-
tecture than for a bit-parallel architecture.
The bit-serial carry-save adder of Figure 6.11 adds the two binary coded num-
bers
 
  and
 
 . Here, weassume that the binarydigits of
 
  and
 
  are fedinto the
adder element with the least signiﬁcant bits ﬁrst. Each digit of the sum
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  is either directly stored in a shift register or ﬁrst manipulated
in some way, for example to form the sum
 
 
 
 (according to diminished–1120 Chapter 6. The Diminished–1 Representation
addition), before it is stored. In any case, assuming that
 
 
i and
 
 
i are the out-
putsfromtwo(shift)registers,theinternalCPlengthofthebit-serialaddercan
not be less than the length of the path from the input register through the full
adder element (from the signal input to the sum output) to an output register.
This minimal CP length equals
L
C
P
 
s
e
r
a
d
d
 
m
i
n
 
 
L
r
e
g
 
r
r
e
g
f
F
A
 
s
i
g
n
a
l
 
L
F
A
 
s
u
m
 
r
F
A
f
r
e
g
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
It takes at least
m
 
 clock cycles before the desired diminished–1 sum
 
 
 
 
 
 
 
m
o
d
 
m
 
 is present at the adder output, whether it is in bit-serial or
bit-parallel form. Hence, the total computation time is proportional to to the
length
L
s
e
r
a
d
d
 
m
i
n, for which we have
L
s
e
r
a
d
d
 
m
i
n
 
 
m
 
 
 
L
C
P
 
s
e
r
a
d
d
 
m
i
n
 
 
 
 
m
 
 
 
 
This length is also greater than the total CP lengths
L
d
i
m
a
d
d
 
  and
L
d
i
m
a
d
d
 
  of
the carry look-ahead adder and the carry ripple adder, respectively. We assert
that from an
A
T
  performance point of view, when comparing the bit-serial
adder with the carry look-ahead and carry ripple adders, the bit serial adder
is not competitive. The bit-serial adder is not further considered here.
Other Adders
In addition to the adders described above, we would like to mention some
other diminished–1 adders that have been presented in the literature. In this
thesis, we do not consider the complexity or performance of these adders.
Firstly, because we let the adder block of the diminished–1 adder in Figure 6.7
beacarryrippleadder, thecompleteadderarchitectureisnotatruecarrylook-
ahead adder. McClellan [65, Fig. 8], however, implements this adder part of
the circuit using length-4 arithmetic logic units and carry look-ahead logic
blocks.
Secondly, Towersetal.[101,Fig. 10]andPajayakrit[71,Fig. 3.4]proposeatrue
carry look-ahead diminished–1 adder that is based on McClellan’s adder [65,
Fig. 8]. Theyforwardthegenerateandpropagatesignalsobtainedinthecarry
look-ahead block (see Figure 6.7) to an array of 4-bit carry look-ahead units,
w h i c hi nt u r ni sf o l l o w e db ya na r r a yo fX O Rg a t e sf o r m i n gt h es u mo u t p u t .
The resulting adder is implemented in
nMOS technology. The authors state
that their carrylook-ahead scheme “seemedtobe the bestintermsof area and
speed”.6.3. The Diminished–1 Representation 121
FA
D
Reset
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c
 
c
 
c
 
Figure 6.11: The adder part of a diminished–1 bit-serial adder.
Thirdly, Morikawa et al. [68] have implemented a three-input diminished–1
adder, i.e. an adder that adds three numbers at a time. Their adder is based
on the three-input carry-save adder presented by Hwang [52, Ch. 4.2]. Re-
cently, Benaissa et al. [13, Fig. 5] presented a VLSI design of a Fermat number
transform using three-input adders. Benaissa’s adder is an improved version
of Morikawa’s adder.
Subtraction
In the end of Section 5.1.3 we wrote that subtraction is simplest carried out by
ﬁrst negating the subtrahend and then adding the result to the minuend. This
should also be the most straightforward procedure for diminished–1 subtrac-
tion. Thus, subtraction can be performed using the negation architecture in
Figure 6.6 and any of the two-input adders described in the present section.122 Chapter 6. The Diminished–1 Representation
6.3.5 Multiplication by Powers of 2
Multiplication by 2
Multiplication by two is simply performed in VLSI when using the dimin-
ished–1 representation. By letting
 
 
  in (6.22), we get
 
 
 
 
T
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
  if
 
 
 
 
m (i.e. if
 
 
  )
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
i
f
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
where
 
 
m
 
 
 
m and
 
 
i
 
 
 
i
 
  for
i
 
 
 
 
 
 
 
 
 
m
 
  holds for all elements
 
 
 
Z
 
m
 
 .F o r
 
 
 
 
m weget
 
 
 
 
 andfor
 
 
 
 
 
 
m
 
  weget
 
 
 
 
 
 
m
 
 .
Hence, the binary digits of
 
  are formed as
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
i
 
 
 
i
 
 
 
 
 
i
 
m
 
 
 
 
m
 
 
 
m
Figure6.12showsanarchitectureformultiplication by2. TheCPofthissimple
architecture is the path from the
 
 
m-input node to the
 
 
 -output node. With
respect to this CP, the size of the circuit, its fan-in, and its output normalised
resistance equal
C
d
i
m
m
u
l
t
 
 
C
N
O
R
 
 
f
d
i
m
m
u
l
t
 
 
n
m
 
f
N
O
R
 
n
m
 
 
r
d
i
m
m
u
l
t
 
 
r
N
O
R
 
 
 
respectively, where
n
m isthecircuit fan-outwithrespecttothe
 
 
m-outputnode.
There is no internal stage. If
n
m is (much) greater than 2, the circuit perfor-
mance can be improved by connecting the
 
 
m-output to a simple driver (two
cascaded inverters). Then,
n
m
 
f
i
n
v
 
 and thus the fan-in
f
d
i
m
m
u
l
t
  equals 6.6.3. The Diminished–1 Representation 123
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
T
 
 
 
 
 
m
o
d
 
m
 
 
 
Figure 6.12: Diminished–1 multiplication by 2 modulo
 
m
 
 
The total CP length of the multiplication-by-2 circuit in Figure 6.12, including
registers
 , equals
L
d
i
m
m
u
l
t
 
 
L
r
e
g
 
r
r
e
g
f
d
i
m
m
u
l
t
 
 
r
d
i
m
m
u
l
t
 
f
r
e
g
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Hence, the area-time performance is proportional to the product
C
L
 
d
i
m
m
u
l
t
 
 
 
 
 
 
 
 
 
 
 
 
 
 Thus, we have
n
m
 
r
r
e
g
 
  .124 Chapter 6. The Diminished–1 Representation
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
Figure 6.13: A feedback shift register for repeated multiplication by 2.
Multiplication by
 
n
Some computations, like for example bit-serial/parallel multiplication, in-
volve repeated multiplication by 2. Repeated multiplication by 2 may be con-
veniently implemented as a feedback shift register with a NOR gate in the
feedback loop, as shown in Figure 6.13. This circuit is based on the circuit in
Figure 6.12.
The feedback shift register is initially loaded with
 
 .A f t e r
k clock cycles, the
contents ofthe register(including the single register elementholding the most
signiﬁcant bit) equals
T
 
 
k
 
 
 
 
 
 
 
k
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  .
As concluded in Section 5.1.4 (page 80), for
t
 
l
o
g
 
m only the
t
 
 least sig-
niﬁcantbitsoftheexponent
nhavetobeconsidered whencomputing
T
 
 
n
 
 .
 
Hence, at most
k
 
n
 
t
  clock cycles are required to compute
T
 
 
n
 
  using the
circuit in Figure 6.13.
Furthermore, in Section 5.1.4 we also concluded from (5.8) that it is enough to
ﬁrst compute
 
 
 
n
 
t
 
 
 
 
 
T
 
 
n
 
t
 
 
 
 
 
 
m
o
d
 
m
 
 
 and then, if and only if
n
t
 
  , negate
 
 
 
n
 
t
 
 
 
  to obtain the desired result.
 
  Diminished–1 negation
is performed by inverting the
m least signiﬁcant bits of the register contents
 
 
 
n
 
t
 
 
 
 .I f
n
t equals zero or
 
 
m equals one, the negation does not take place.
Let
 
 
 
T
 
 
n
 
 
 
m
o
d
 
m
 
 
  . The binary digit
 
 
i is obtained from the Kar-
naugh map in Figure 6.14 as the Boolean function
 
 
i
 
 
 
i
 
 
m
n
t
 
 
 
i
 
 
 
m
 
n
t
 
 
 
 
m
n
t
 
 
 
i
  (6.29)
 Because
o
r
d
 
m
 
 
 
 
 
m
 
 
t
 
  it is enough to consider
n
 
t
 
 
n
m
o
d
 
m.
 
 By (5.8) we get
T
 
 
n
 
 
 
T
 
 
n
 
t
￿
 
 
 
 
m
o
d
 
m
 
 
 if
n
t
 
 and
T
 
 
n
 
 
 
T
 
 
 
n
 
t
￿
 
 
 
 
m
o
d
 
m
 
 
 if
n
t
 
  .6.3. The Diminished–1 Representation 125
0
00
1
011 11
0
1 X 1 X
0
0
 
 
m
n
t
 
 
i
 
 
i
0
0
Figure 6.14: Karnaugh map for the output bit
 
 
i of
 
  for
 
 
i
 
m
 
 .
X = “don’t care”.
where
 
 
i is the contents in bit position
i of the feedback register. Figure 6.15
shows an architecture that performsthe operation
 
 
 
T
 
 
n
 
 
 
m
o
d
 
m
 
 
 
usingrepeatedmultiplication by2accordingto theaboveprocedure. Thecon-
trol logic is not included in the ﬁgure. According to the Karnaugh map in
Figure 6.14,
 
 
i can also be formed by other Boolean functions, depending on
whichvaluesareassignedtothe“don’tcares”. However,thefunctionin(6.29)
results in the most efﬁcient realisation (the array of XOR gates), with respect
to the area-time performance.
The size of the architecture in Figure 6.15 equals
C
s
e
q
 
m
u
l
t
 
n
 
 
m
 
 
 
C
r
e
g
 
m
C
X
O
R
 
C
N
O
R
 
C
A
N
D
 
C
i
n
v
 
 
 
m
 
 
 
TheinternalCPduringtheshiftoperation isthefeedbackpathP
  fromthereg-
ister holding
 
 
m
 
  through the NOR gate to the register in the least signiﬁcant
bit position. This path has length
L
C
P
 
s
e
q
 
m
u
l
t
 
n
 
L
r
e
g
 
r
r
e
g
 
f
X
O
R
 
f
N
O
R
 
 
r
N
O
R
f
r
e
g
 
 
 
 
During an initial clock cycle,
 
  is loaded into the shift register. After the
n
 
t
 
 
 
subsequent clock cycles, the shift register contains the diminished–1 integer
T
 
 
n
 
t
 
 
 
 . An extra clock cycle is then required to shift this result through the
array of XOR gates to the output. Assuming that
 
  is directly stored in a reg-
ister, the length of this ﬁnal output path (which is named P
  in Figure 6.15)
equals
L
P
 
 
L
r
e
g
 
r
r
e
g
 
f
X
O
R
 
f
N
O
R
 
 
L
X
O
R
 
r
X
O
R
f
r
e
g
 
 
 
 
which is slightly greater than the length
L
C
P
 
s
e
q
 
m
u
l
t
 
n of the internal critical
pathP
 . Therefore,byletting theclockintervalbeproportionalto
L
P
 ,theti me126 Chapter 6. The Diminished–1 Representation
 
 
m
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
n
t
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
P
 
P
 
P
 
Figure 6.15: An architecture for diminished–1 multiplication by a power of 2.
T required to perform the multiplication by
 
n modulo
 
m
 
 is proportional
to
L
s
e
q
 
m
u
l
t
 
n
 
 
n
 
t
 
 
 
 
 
 
L
P
 
 
 
 
 
n
 
t
 
 
 
 
 
 
 
Because
 
 
n
 
t
 
 
 
 
 
t
 
 
 
m
 
 ,t h emaximum multiplication time is
proportional to
 
 
 
m
 
 
  . When the circuit in Figure 6.15 is followed by a
register, the length
L
P
  of path P
  of the ﬁgure equals
L
P
 
 
L
r
e
g
 
r
r
e
g
 
f
r
e
g
 
f
i
n
v
 
 
r
i
n
v
f
A
N
D
 
L
A
N
D
 
r
A
N
D
 
m
f
X
O
R
 
L
X
O
R
 
r
X
O
R
f
r
e
g
 
 
m
 
 
 
 
Therefore,for
L
s
e
q
 
m
u
l
t
 
n
 
L
P
 ,i . e.for
n
 
t
 
 
 
 
d
 
 
m
 
 
 
 
 
 
 
 
 
e
 
d
m
 
 
 
e
 
 ,
the computation time
T is proportional to
L
P
  and hence, for
n
 
t
 
 
 
 
d
m
 
 
 
e
 
  the computation time is proportional to
L
s
e
q
 
m
u
l
t
 
n.T h e
A
T
  per-
formance of the circuit is proportional to the product
C
L
 
s
e
q
 
m
u
l
t
 
n
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
C
s
e
q
 
m
u
l
t
 
n
 
L
P
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
O
 
m
 
 
 
for
 
 
n
 
t
 
 
 
 
d
m
 
 
 
e
 
 
C
s
e
q
 
m
u
l
t
 
n
 
L
s
e
q
 
m
u
l
t
 
n
 
 
 
 
 
 
m
 
 
 
 
 
 
n
 
t
 
 
 
 
 
 
 
 
 
O
 
m
 
n
 
t
 
 
 
 
 
 
 
for
d
m
 
 
 
e
 
 
 
n
 
t
 
 
 
 
m
 
 6.3. The Diminished–1 Representation 127
Note that for a nonzero integer
 , its corresponding diminished–1 integer
 
  is
anelementof
Z
 
m,i . e .w eh a v e
 
 
m
 
  . Because the odd Fermatnumber
 
m
 
 
is not divisible by 2, for
n
 
N every integer
 
n
 
m
o
d
 
m
 
 is also a nonzero
integer. Thus, for all such nonzero numbers
  we get
 
 
m
 
  ,w h e r e
 
 
 
T
 
 
n
 
 
 
m
o
d
 
m
 
 
  , and hence the NOR gate in the feedback path can be
replaced by an inverter. Also, for
 
 
i
 
m
 
 ,e a c ho u t p u tb i t
 
 
i of the
circuit is then formed by the Boolean function
 
 
i
 
n
t
 
 
 
i,. The procedure
for repeated multiplication by 2 using only a feedback shift register with an
inverter in the feedback loop was originally described by Leibowitz [58].
MultiplicationUsing a Modiﬁed Barrel Shifter
In 1982, Truong et al. [103] proposed a method of computing multiplication
by a power of two using a modiﬁed Barrel shifter. To the author’sknowledge,
Truong’s multiplication method (or modiﬁed versions of the method) is used
by most people when implementing diminished–1 multiplication by powers
oftwo. Forexample,Pajayakrit[71,Ch. 3.6]andTowersetal.[101,Sec. 11.1.3]
propose an
nMOSarchitecture which comprisestwomodiﬁed Barrel(circular)
shifters. Thisarchitectureworksforbothnegativeandpositive exponents
n,in
order to be applicable in the computation of the inverse Fermat number trans-
form as well as the forward transform.
One of the shifters is a diminished–1 left shifter, which is used for positive ex-
ponents. The second diminished–1 shifter is used when the exponent is neg-
ative. This shifter shifts the input to the right. The shifters are controlled by a
decoder. Figure 6.16 shows a block diagram for such a multiplier over
Z
 
m
 
 .
The signal
c
t
r
l in the ﬁgure controls which of the shifters is to be activated.
If the exponent is always nonnegative (or nonpositive), i.e. we have
 
 
 
T
 
 
n
 
t
 
 
 
 
m
o
d
 
m
 
 
 (or
 
 
 
T
 
 
 
n
 
t
 
 
 
 
m
o
d
 
m
 
 
  ), then it is sufﬁ-
cient to use only one modiﬁed Barrel shifter. Figure 6.17(a) shows a block di-
agram of Truong’s modiﬁed Barrel (left) shifter together with a decoder. The
decoder has
t
 
 inputs and, consequently,
 
t
 
 
 
 
m outputs. Hence, the
size of the shifter is
O
 
m
 
 
m
 .T h em o d i ﬁed Barrel shifter differs from anor-
dinary Barrel shifter only in the wirings of its transistors. An architecture of
a transmission-gate based modiﬁed Barrel shifter over
Z
 
 
 
  is presented in a
paper by Shakaff et al. [90, Fig. 6].
In Figure 6.17(b), we present a block diagram of a multiplier for which the
size of the decoder is half the size of the decoder in Truong’sarchitecture. The
decoder has the
t-bit NBC number
n
 
t
 
 
  as its input and it has
 
t
 
m out-
puts. Therefore,thesizeof theBarrel shifteris
O
 
m
 
m
 ,i.e.halfthe sizeofthe
shifter needed in Truong’sarchitecture. The output of the reduced-size shifter128 Chapter 6. The Diminished–1 Representation
Decoder
Modiﬁed
Barrel
left
shifter
Modiﬁed
Barrel
right
shifter
n
 
 
c
t
r
l
1o f
m 1o f
m
 
 
Figure 6.16: A diminished–1 multiplier of a power of two modulo
 
m
 
 (from [71,
Fig. 3.8] but usingour notations). The signal
c
t
r
l controls whether theinput
 
is to be shifted to the left (for a positive exponent) or to the right (for a negative
exponent). The output
 
  equals
 
 
 
T
 
 
 
n
 
 ,w h e r e
n
 
N.
equals
 
 
 
T
 
 
n
 
t
 
 
 
 
 
 
m
o
d
 
m
 
 
 (we assume that the exponent is posi-
tive). If
n
t equalsone,
 
 
 
t
 
 
 is invertedtoformthedesiredresult
 
 
 
T
 
 
n
 
 
 
T
 
 
 
n
 
t
 
 
 
 
 
 
m
o
d
 
m
 
 
  .I f
n
t equals zero,
 
  is passed unchanged to the
output. The inversion is carried out by a row of XOR gates, as in Figure 6.15.
The decoder can be implemented in several ways, see for instance Weste and
Eshraghian [113, Ch. 8.3.1.1.3]. The choice of implementation may for exam-
ple be governed by speed requirements, power dissipation constraints, and
chip size constraints. In this section, we do not give any further details about
the complexities and performances of the Barrel-shifter type of architectures
in Figures 6.16 and 6.17.
6.3.6 General Multiplication
Several algorithms and architectures for diminished–1 general multiplication
have appeared in the literature. To our knowledge, only bit-serial/parallel
and bit-parallel architectures are suggested by the originators of these archi-
tectures.6.3. The Diminished–1 Representation 129
Decoder
Modiﬁed
Barrel
left
shifter
n
 
t
 
 
 
 
 
n
t 1o f
m
 
 
 
 
Array of XOR gates
Decoder
Modiﬁed
Barrel
left
shifter
n
 
t
 
 
 
 
 
1o f
 
m
 
 
(a)
(b)
Figure 6.17: Diminished–1multiplicationof a positivepower of twomodulo
 
m
 
 ,
using one shifter. (a) Truong’s modiﬁed Barrel shifter of size
m
 
 
m bits. The
output equals
 
 
 
T
 
 
n
 
 
 
m
o
d
 
m
 
 
  . (b) Am o d i ﬁed Barrel shifter of
size
m
 
m bits, followed by an array of XOR gates. For the output
 
 
 
m
 
 
  we
have
 
 
i
 
 
 
i
 
n
t,w h e r e
 
 
i
 
m
 
  and
 
 
 
T
 
 
n
 
t
 
 
 
 
 
 
m
o
d
 
m
 
 
 .130 Chapter 6. The Diminished–1 Representation
6.3.6.1 Bit-Parallel Architectures
Leibowitz [58] was the ﬁrst to present some procedures for general multipli-
cation. He brieﬂy considers three procedures,
 
  w h i c ha r eb a s e do nh o wt h e
multiplicand and the multiplier are represented:
1. The multiplicand and the multiplier are both diminished–1 numbers.
2. The multiplicand and the multiplier are both NBC numbers.
3. Themultiplicand isanNBCnumberandthemultiplier isadiminished–1
number, or vice versa.
The third multiplication procedure is generally used only for bit-serial/paral-
lel multiplication. In the literature, wehave not foundany architecture forbit-
parallel diminished–1 generalmultiplication thatisbasedonsuch aprocedure.
In this section we give an analytical description of the above multiplication
procedures 1 and 2, beginning with the ﬁrst procedure. For nonzero factors
  and
 , i.e for the diminished–1 numbers
 
 
 
 
 
 
 
 
 
m
 
 ,t h ep r o d u c ti n
(6.10) can be written as
T
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
m
 
 
X
i
 
 
 
 
m
 
i
 
i
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (6.30)
where
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
 
 and where
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
  is a
 
m-bit NBC integer and
 
 
 
P
m
 
 
i
 
 
 
 
m
 
i
 
i is an
m-bit
NBC integer. Therefore,we have
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,w h e r e
 
 
 
m
 
 
  is the one’s complement of
 
 
 
m
 
 
  and hence
(6.30) can be expressed as
T
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
 
 
  (6.31)
Hence,theaboveprocedure1involvesoneordinary
 
m
 
m
 -bitgeneralmulti-
plication, one ordinary
 
m-bit addition, and two diminished–1 additions (see
Example 12 in Leibowitz’ paper [58]).
 
 Leibowitz did not formulate any algorithm for diminished–1 general multiplication. He
only sketched the main steps of the proposed multiplication procedures and presented two
examples.6.3. The Diminished–1 Representation 131
Sunder’s Parallel Multiplier
An
 
m
 
m
 -bitmultiplication can be carried out using a conventional (Braun)
bit-parallel multiplier array, see for instance Weste and Eshraghian [113, Ch.
8.2.7.1 (Figures 8.36 and 8.37)] or Hwang [52, Ch. 6.1 (Fig. 6.3)]. Sunder et
al. [97] recently proposed an architecture for diminished–1 general multipli-
cation based onthe above procedure (Equation (6.30). They modiﬁed the con-
ventional array multiplier such that the addition of
 
 
 
m
 
 
 
 
 
 
m
 
 
  by the non-
reduced sum
 
  is performed in the multiplier array. Thus, following the nota-
tions above, the output of the array is the
 
 
m
 
 
  -bit NBC integer
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
m
X
i
 
 
 
 
m
 
i
 
i
 
 
 
 
m
 
 
 
 
rather than just the
 
m-bit NBC integer
 
 
 
m
 
 
 
 
 
 
m
 
 
 . Here, we consequently
let
 
 
 
P
m
i
 
 
 
 
m
 
i
 
i (compare this
 
m
 
 
  -bit
 
  with the
m-bit
 
  used in (6.30)
and (6.31)). By (5.5) we get
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  , which implies that
T
 
 
 
 
  c a nb ew r i t t e na s
T
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
The addend
 
 
 
 canbeobtained using the procedure for diminished–1 nega-
tion. Figure6.18showsthemodiﬁed multiplier array. Theaddends
 
 
 
m
 
 
  and
 
 
 
m
 
 
  areaddedtothesumofpartialproductsintheﬁrstrowofthearray. The
addition by one is carried by the rightmost column of half adder elements.
 
 
A block diagram of Sunder’s bit-parallel diminished–1 pipelined multiplier is
showninFigure6.19. Theoutputfromthediminished–1 adder ofthearchitec-
ture is the sum
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 ,w h e r e
 
  and
 
  are deﬁned
asabove. Thedesired product
T
 
 
 
 
 equals
 
 only fornonzeroinputs, i.e. for
 
 
 
 
 
  . When either
  or
  (or both) equals zero, we have
 
 
m
 
 (
 
 
 
 
m)
and
 
 
m
 
 (
 
 
 
 
m), respectively, and
T
 
 
 
 
 
 
 
m
 
m
o
d
 
m
 
 
  .H o w -
ever, when only one of
  and
  equals zero, the adder output
 
  of Figure 6.19
is nonzero. The correctoutput is formedbya row of inverters and NOR gates,
see Figure 2 in Sunder’s paper [97]. Sunder names this circuit the output con-
troller.
For
 
 
i
 
m
 
 , the output bit in position
i can be expressed as the Boolean
function
 
 
 
m
 
 
 
m
 
 
 
 
i, i.e. its value can be generated using one inverter and
 
 Compare with the equivalent add-by-one circuit in Figure 6.4.132 Chapter 6. The Diminished–1 Representation
(d)
FA FA FA FA
FA
  FA
  FA
 
FA
 
FA
 
HA
FA
 
FA
 
HA
HA
 
FA
 
FA
  HA HA
 
FA
 
FA
 
HA
HA
 
FA
 
FA
 
(a)
FA
 
 
i
 
 
i
 
 
 
HA
 
 
m
 
 
 
 
j
FA
 
 
i
 
 
j
(b) (c)
1
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
CP
sum sum sum
carry carry carry
FA
  HA
  FA
 
Figure 6.18: Am o d i ﬁed
 
m
 
m
 -bit multiplier (from Sunder et al. [97, Fig. 1]).
(a) The multiplier array. The dotted line is the CP through the array.
(b), (c),a n d(d):T h eFA
 , HA
 ,a n dFA
  cells, respectively.6.3. The Diminished–1 Representation 133
Modiﬁed
m
 
m
multiplier
Negater
Diminished–1
CLA adder
O
u
t
p
u
t
c
o
n
t
r
o
l
l
e
r
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
m
D D
R
  R
 
R
  R
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
T
 
 
 
 
 
Figure 6.19: Ablockdiagramofamodiﬁed
 
m
 
m
 -bitdiminished–1multiplierover
Z
 
m
 
  (essentially from Sunder et al. [97, Fig. 4]).
one NOR gate. The Boolean function
 
 
m
 
 
 
m is evaluated separately. Sun-
der assigns the value of this function to the most signiﬁcant bit of the output
T
 
 
 
 
 . However, whenthe modulus
 
m
 
 is composite, there exist products
of nonzerointegers of
Z
 
m
 
  that are congruentto zero modulo
 
m
 
 .I fs u c h
a situation occurs, the most signiﬁcant bit of the product
T
 
 
 
 
  should not
be formed by the Boolean function
 
 
m
 
 
 
m. The correct Boolean function is
 
 
m
 
 
 
m
 
 
 
m, i.e. only an extra OR gate is needed to form the true output (see
Figure 6.19).
Let
C
S
u
n
d
e
r
 
a
r
r
a
y denote the size of Sunder’s
 
m
 
m
 -bit array multiplier of Fig-
ure 6.18. Then we have
C
S
u
n
d
e
r
 
a
r
r
a
y
 
 
m
 
 
 
 
C
F
A
 
 
 
m
 
 
 
C
H
A
 
 
m
 
C
F
A
 
 
C
F
A
 
C
H
A
 
 
 
 
m
 
 
 
 
m
 
 
 
 
where
C
F
A
 
 
C
F
A
 
C
A
N
D
 
 
  ,
C
F
A
 
 
C
F
A
 
 
 
  ,
C
H
A
 
 
C
H
A
 
C
A
N
D
 
 
  ,
C
F
A
 
 
  ,a n d
C
H
A
 
 
 are the sizes of the FA
 ,F A
 ,H A
 , FA, and HA cells,134 Chapter 6. The Diminished–1 Representation
respectively. TheregisterslabelledR
  inFigure6.19are
m-bitparallelregisters
and the ones labelled R
  are
 
m
 
 
  -bit parallel registers. The D cells are D
ﬂip-ﬂops (single-bit registers). Hence, the total chip area
A occupied by the
bit-parallel multiplier in Figure 6.19 is proportional to its size
C
S
u
n
d
e
r
 
m
u
l
t
 
C
S
u
n
d
e
r
 
a
r
r
a
y
 
 
 
m
 
 
 
m
 
 
 
 
 
 
C
r
e
g
 
C
d
i
m
n
e
g
 
C
d
i
m
a
d
d
 
 
 
C
o
u
t
 
c
t
r
l
 
 
C
O
R
 
 
 
m
 
 
 
 
 
m
 
 
 
 
where
C
o
u
t
 
c
t
r
l
 
m
 
C
N
O
R
 
C
i
n
v
 
 
 
m is the size of the output controller. We as-
sume that the diminished–1 adder in Figure6.19 is the carrylook-aheadadder
of Figure 6.7.
The CP through Sunder’s array multiplier is the dotted path in Figure 6.18
from the
 
 
m
 
 -input node to the
 
 
 
m-output node. With the inputs taken di-
rectly from registers, the minimum clock cycle time of the complete multiplier
in Figure 6.19 is proportional to the length
L
C
P
 
S
u
n
d
e
r
 
a
r
r
a
y
 
L
r
e
g
 
r
r
e
g
 
m
f
A
N
D
 
f
F
A
 
c
a
r
r
y
 
 
L
A
N
D
 
r
A
N
D
f
F
A
 
s
i
g
n
a
l
 
m
L
F
A
 
s
u
m
 
 
m
 
 
 
r
F
A
f
F
A
 
c
a
r
r
y
 
r
F
A
f
H
A
 
L
H
A
 
c
a
r
r
y
 
r
H
A
 
c
a
r
r
y
f
F
A
 
c
a
r
r
y
 
m
L
F
A
 
c
a
r
r
y
 
 
m
 
 
 
r
F
A
f
F
A
 
c
a
r
r
y
 
r
F
A
f
r
e
g
 
 
 
m
 
 
 
 
of this path.
Remark: The carryinput fan-in
f
F
A
 
c
a
r
r
y of a full adder element (the one in Fig-
ure 4.10) is less than its signal input fan-in
f
F
A
 
s
i
g
n
a
l. Therefore, in order
to minimise the overall propagation delay, we assume that the sum out-
put of the full adder element in each FA
  cell is fed to the carry input of
the full adder element in the subsequent FA
  cell. Then the CP passes
through the full adder element of each FA
  cell from the carry input to
thesumoutput. Thesmallest propagationdelaythrougha FA
  cell is ob-
tained whenthe
 
 
i-inputsignal is connectedto the carry inputof the full
adder element.
The desired product
T
 
 
 
 
  is obtained in an output register after three clock
cycles. Hence, the total multiplication time
T is proportional to
L
S
u
n
d
e
r
 
m
u
l
t
 
 
L
C
P
 
S
u
n
d
e
r
 
a
r
r
a
y
 
 
 
 
m
 
 
 
 
 
which implies that the area-time performance
A
T
  is proportional to
C
L
 
S
u
n
d
e
r
 
m
u
l
t
 
 
C
S
u
n
d
e
r
 
m
u
l
t
 
L
S
u
n
d
e
r
 
m
u
l
t
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
O
 
m
 
 
 6.3. The Diminished–1 Representation 135
Ashur’s Parallel Multiplier
Quite recently, Ashur et al. [10] presented an architecture for bit-parallel di-
minished–1 multiplication that is based on Sunder’s architecture in Figure
6.19. They obtain a smaller area-time performance for their architecture inter
alia by including the negation step in the array multiplier. Below, we analyti-
cally describe how Ashur’s algorithm works: For nonzero NBC factors
  and
 , i.e for the diminished–1 numbers
 
 
 
 
 
 
 
 
 
m
 
 , the generalmultiplica-
tion
T
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
 
 can be expanded
as
T
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
m
 
 
 
 
 
m
 
 
m
 
 
 
 
 
m
m
X
i
 
 
 
 
m
 
i
 
i
 
 
 
 
m
 
 
 
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (6.32)
where
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
m
 
 
  (6.33)
isa
 
 
m
 
 
 -bitNBC integerand
 
 
 
P
m
 
 
i
 
 
 
 
m
 
i
 
i isan
 
m
 
 
 -bitNBCinteger.
Again, by (5.5) we get
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  . Using this congruence and
the congruence
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  , (6.32) can be
written as
T
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
  (6.34)
Figure6.20showsablock diagramofAshur’sdiminished–1 multiplier. Ashur
et al. have modiﬁed Sunder’s array multiplier (the one in Figure 6.18) in the
following way: Therightmost column ofhalfadder elements, whichperforms
anaddition by one, is excluded. Instead, the addition by
 
m
 
 
m
 
 
 in (6.33)is
carried out by exchanging the half adder elements in the leftmost column for
full adder elements (i.e. exchange the HA
  cells for FA
  cells) and let each re-
dundantfulladderinputbeequaltoone. Thesofargroundedinputoftheleft-
most fulladder elementin thebottomrowofSunder’sarraymultiplier should
now also be equal to one.
Furthermore, the resulting
m full adder elements in the bottom row of the ar-
ray forms an
m-bit carry ripple adder. The output of this adder is the
 
m
 
 
 -
bit integer
 
 ,w h i c hi sd e ﬁned above. The output
 
  can be formed by the sum
 
 
 
 
 
 
 
 , where in turn the
m-bit integer
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
  is
formedbythecarryoutputs andthe
m-bitinteger
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
is formed by the sum outputs of the row of full adder elements prior to the
carry ripple adder (see Figure 6.20). Hence, (6.34) can be further expanded as136 Chapter 6. The Diminished–1 Representation
FA FA FA FA
FA
  FA
  FA
 
FA
 
FA
 
FA
 
FA
 
FA
 
FA
 
FA
 
FA
 
FA
 
FA
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
CP
FA
 
FA
 
FA
 
1
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
Diminished–1
CLA adder
R
 
T
 
 
 
 
 
R
 
Output
controller
D
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
m
Figure 6.20: A block diagram of a modiﬁed
m
 
m diminished–1 multiplier over
Z
 
m
 
 , based onSunder’smultiplier(fromAshuretal.[10, Fig. 1]). The dotted
line is the CP through the array multiplier.6.3. The Diminished–1 Representation 137
T
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (6.35)
The addends
 
  and
 
  are formed by the row of inverters below the array mul-
tiplier in Figure 6.20. Carry-save adders are preferably used when more than
twonumbersareto beaddedtogether. Forexample,arraymultipliers (likethe
ones described in the present section) generally comprise rows of carry-save
adders that performthe summation of the partial products. Theﬁnal addition
is performed using a carry ripple or carry look-ahead adder.
Ashur et al. efﬁciently adds the
m-bit addends
 
 
 
m
 
 
 ,
 
 ,a n d
 
  in (6.35) by
using a carry save adder. These three addends are the inputs of the carry-
save adder which is subsequent to the row of inverters in Figure 6.20. Let
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
c
 
 
 ,w h e r et h e
 
m
 
 
  -bit integer
 
c
 
P
m
i
 
 
 
c
i
 
i
 
P
m
 
 
i
 
 
 
c
i
 
i
 
 
c
m
 
m
o
d
 
m
 
 
 andthe
m-bit integer
 
 
 
P
m
 
 
i
 
 
 
 
i
 
i are formed
by the carry outputs
 
c
i and sum outputs
 
 
i, respectively, of the carry-save
adder. Hence, (6.35) can be written as the diminished–1 sum
T
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
 
c
 
 
 
 
 
 
m
 
 
X
i
 
 
 
c
i
 
i
 
 
 
 
 
c
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (6.36)
where
 
 
 
P
m
 
 
i
 
 
 
c
i
 
i
 
 
c
m is an
m bit integer, i.e. we have
 
 
 
m
 
 
X
i
 
 
 
 
i
 
i
  where
 
 
 
 
 
 
c
m
 
 
 
i
 
 
c
i
  for
 
 
i
 
m
 
 
The addition by
 
c
m is thus carried out by inverting the most signiﬁcant carry
output
 
c
m of the carry-save adder and feeding it into the vacant least signiﬁ-
cant bit position of the register that holds
 
 . For consistency, we have intro-
duced our notations for Ashur’s multiplier in Figure 6.20. As for Sunder’s
multiplier in Figure 6.19, we have modiﬁed the output controller in order to
obtain the correctoutputwhen
 
 
 
j
 
 
m
 
 
 and
T
 
 
 
 
 
 
 
m
 
m
o
d
 
m
 
 
 
(i.e. when
 
 
 
 
 
m
o
d
 
m
 
 
  ).
The chip area
A occupied by Ashur’s multiplier is proportional to its size
C
A
s
h
u
r
 
m
u
l
t
 
m
 
C
F
A
 
 
 
m
 
 
 
C
F
A
 
 
 
C
i
n
v
 
C
F
A
 
 
 
 
m
 
 
 
C
r
e
g
 
C
d
i
m
a
d
d
 
 
 
C
o
u
t
 
c
t
r
l
 
 
C
O
R
 
C
i
n
v
 
 
 
m
 
 
 
 
m
 
 
 138 Chapter 6. The Diminished–1 Representation
The CP through the array multiplier (see the dotted line in Figure 6.20) de-
termines the maximum clock frequency. The maximum clock frequency is in-
versely proportional to the CP length
 
 
L
C
P
 
A
s
h
u
r
 
a
r
r
a
y
 
L
r
e
g
 
r
r
e
g
 
m
f
A
N
D
 
f
F
A
 
c
a
r
r
y
 
 
L
A
N
D
 
r
A
N
D
f
F
A
 
s
i
g
n
a
l
 
 
m
 
 
 
 
L
F
A
 
s
u
m
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
L
F
A
 
c
a
r
r
y
 
r
F
A
f
i
n
v
 
r
i
n
v
f
F
A
 
s
i
g
n
a
l
 
L
F
A
 
s
u
m
 
r
F
A
f
r
e
g
 
 
 
m
 
 
 
Because Ashur’s multiplier computes the product
T
 
 
 
 
  in only two clock
cycles, the total computation time
T is proportional to
L
A
s
h
u
r
 
m
u
l
t
 
 
L
C
P
 
A
s
h
u
r
 
a
r
r
a
y
 
 
 
m
 
 
 
 
and the
A
T
  performance is proportional to the product
C
L
 
A
s
h
u
r
 
m
u
l
t
 
 
C
A
s
h
u
r
 
m
u
l
t
 
L
A
s
h
u
r
 
m
u
l
t
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
O
 
m
 
 
 
Benaissa’s Parallel Multiplier
Regarding the second of Leibowitz’ multiplication procedures (see page 130),
the diminished–1 product
T
 
 
 
 
  can be written as
T
 
 
 
 
 
 
 
 
 
 
 
 
m
m
X
i
 
 
 
m
 
i
 
i
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
where
 
 
 
  is a
 
 
m
 
 
 -bit NBC integer and
 
 
P
m
i
 
 
 
m
 
i
 
i is an
 
m
 
 
 -
bit NBC integer.
 
  By (5.5) we get
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 and therefore
T
 
 
 
 
  can be formed by the diminished–1 sum
T
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
where
 
 
 equals diminished–1 negation of
 .
Benaissa et al. [11] have implemented a bit-parallel multiplier which is based
on this multiplication procedure. Figure 6.21 shows a block diagram of their
 
 The CP startswith the output path of a register.
 
 Leibowitz[58],however, erroneouslystatedthat
 isan
m-bitNBC integer. Theprocedure
described in his article gives an incorrect answer for
 
 
 
 
 
m.6.3. The Diminished–1 Representation 139
pipelined multiplier. The translation blocks translate the inputs
 
  and
 
  to
 
and
 , respectively, which are multiplied in the array multiplier. The realisa-
tion of the translation blocks are shown in Figure 6.4 of Section 6.3.1. Benaissa
et al. use a standard
 
m
 
 
 
 
 
m
 
 
 -bit square-version array multiplier, see
Benaissa etal. [11,Fig. 6] orWeste andEshraghian[113, Fig. 8.37],whichcom-
prises
m
 
 
 fulladder elements,
m
 
 half adder elements, and
 
m
 
 
 
  AND
gates. Hence, the size of the this array multiplier equals
C
B
e
n
a
i
s
s
a
 
a
r
r
a
y
 
 
m
 
 
 
 
C
F
A
 
 
m
 
 
 
C
H
A
 
 
m
 
 
 
 
C
A
N
D
 
 
 
m
 
 
 
 
m
 
 
 
which is slightly less than the size
C
S
u
n
d
e
r
 
a
r
r
a
y of Sunder’s array multiplier in
Figure 6.18. Using the same types of
m-bit parallel registers R
  and
 
m
 
 
 -bit
registers R
  and the same type of carry look-ahead adder as in Figure 6.19, the
size of the complete multiplier of Figure 6.21 equals
C
B
e
n
a
i
s
s
a
 
p
 
m
u
l
t
 
C
B
e
n
a
i
s
s
a
 
a
r
r
a
y
 
 
 
m
 
 
 
m
 
 
 
 
C
r
e
g
 
C
d
i
m
n
e
g
 
C
d
i
m
a
d
d
 
 
 
 
C
D
i
m
 
N
B
C
 
 
C
O
R
 
 
 
m
 
 
 
 
 
m
 
 
 
 
The CP of the array multiplier is similar to the CP of Sunder’s array multi-
plier in Figure 6.18. It runs from a register output into the AND gate in the
top-leftmost position of the array and then diagonally through the array of
full adder elements and ﬁnally to the left along the bottom carry-chain row
to the register holding the most signiﬁcant output bit
 
 
m. The length of this
CP equals
L
C
P
 
B
e
n
a
i
s
s
a
 
a
r
r
a
y
 
L
r
e
g
 
r
r
e
g
 
 
m
 
 
 
f
A
N
D
 
L
A
N
D
 
r
A
N
D
f
H
A
 
L
H
A
 
s
u
m
 
r
H
A
 
s
u
m
f
F
A
 
c
a
r
r
y
 
 
m
 
 
 
L
F
A
 
s
u
m
 
 
m
 
 
 
r
F
A
f
F
A
 
c
a
r
r
y
 
r
F
A
f
H
A
 
L
H
A
 
c
a
r
r
y
 
r
H
A
 
c
a
r
r
y
f
F
A
 
c
a
r
r
y
 
 
m
 
 
 
 
L
F
A
 
c
a
r
r
y
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
L
F
A
 
s
u
m
 
r
F
A
f
r
e
g
 
 
 
m
 
 
 
 
Theproduct
T
 
 
 
 
 is obtainedin theoutput registersubsequentto thedimin-
ished–1 adder after four clock cycles. Hence, the time
T required to multiply
using Benaissa’s array multiplier architecture is proportional to
L
B
e
n
a
i
s
s
a
 
p
 
m
u
l
t
 
 
L
C
P
 
B
e
n
a
i
s
s
a
 
a
r
r
a
y
 
 
 
 
m
 
 
 
 
 
which means thatthe
A
T
 performanceof the multiplier is proportional to the
product
C
L
 
B
e
n
a
i
s
s
a
 
p
 
m
u
l
t
 
 
C
B
e
n
a
i
s
s
a
 
p
 
m
u
l
t
 
L
B
e
n
a
i
s
s
a
 
p
 
m
u
l
t
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
O
 
m
 
 
 140 Chapter 6. The Diminished–1 Representation
Standard
 
m
 
 
 
 
 
m
 
 
 
multiplier
Negater
Diminished–1
CLA adder
 
 
 
 
R
  R
 
R
  R
 
 
 
m
 
 
 
 
 
 
 
T
 
 
 
 
 
Translation
Translation R
 
R
 
 
 
Figure 6.21: Ablockdiagram ofBenaissa’s [11, Fig. 4]diminished–1pipelinedarray
multiplier.
Remark: Ifthe multiplier (orthemultiplicand) is available as anNBC number
  (or
 ), one of the translation circuits in Figure 6.21 can be excluded.
This reduces the total size
C
B
e
n
a
i
s
s
a
 
p
 
m
u
l
t.I f both the multiplier and the
multiplicand are NBC numbers, the translation part and the two input
registers (R
 ) can be excluded. Consequently, the initial clock cycle is
then excluded. This reduces the total computation time as well as the
total circuit size. Note also that a simple additional modiﬁcation of the
multiplier (inFigure 6.21), makes it applicable forgeneralmultiplication
with respect to the NBC symbol representation.
6.3.6.2 Bit-Serial/Parallel Multipliers
Probably the most frequently used diminished–1 multiplier is the bit-serial/
parallel multiplier.
 
  In general, serial/parallel multipliers are known to oc-
 
 The multiplication scheme adopted is often referred to as the iterative shift-and-add
technique.6.3. The Diminished–1 Representation 141
cupy less chip area than the corresponding parallel multipliers, but to the cost
of a poorer time performance.
Several algorithms for serial/parallel diminished–1 multiplication have ap-
peared in the literature. They mainly differ in how the multiplicand and the
multiplier are represented and which initial values have to be computed. The
registers needed in the corresponding architectures are loaded with the ini-
tially computed values.
Chang’s Serial/Parallel Multiplier
Chang et al. [32] were among the ﬁrst to publish a VLSI implementation of a
serial/parallel diminished–1 generalmultiplier. It is based on a diminished–1
representation of the multiplicand and an NBC representation of the multi-
plier. Let
 
  and
 
  be the multiplicand and the multiplier, respectively, in their
diminished–1 form of representation. The multiplication algorithm is valid
only for
 
 
 
 
 
  ,i . e .f o r
 
 
 
 
 
 
 
 
m. Situations where either
  or
  (or both)
equals zero are handled separately. The algorithm of Chang et al. is based on
the following expansion of
T
 
 
 
 
  (here, we use our notations):
T
 
 
 
 
 
 
 
 
 
 
 
 
m
X
i
 
 
 
i
 
i
 
 
 
m
X
i
 
 
 
i
 
 
i
 
 
 
 
 
m
X
i
 
 
 
i
 
 
 
m
 
m
 
 
m
X
i
 
 
 
i
T
 
 
i
 
 
 
m
 
 
 
 
m
 
 
 
D
 
 
 
 
 
m
X
i
 
 
 
 
i
T
 
 
i
 
 
 
 
D
 
m
o
d
 
m
 
 
 
  (6.37)
where
D
 
 
m
 
P
m
i
 
 
 
i and where
P
  and
  denote diminished–1 addition.
Because
 
 
Z
 
m
 
  we get
D
 
Z
m
 
 , which is represented as an
m-bit NBC
integer.
Chang etal. [32, Fig. 1]presenteda simple architecture whichcomputes (6.37)
using an recursive shift-and-add technique, where the modulus reduction is
simultaneously carried out during each recursion. Thus, (6.37) can be ex-
pressed on the recursive form
P
 
i
 
 
 
 
P
 
i
 
 
 
i
T
 
 
i
 
 
 
m
o
d
 
m
 
 
 
  for
 
 
i
 
m
 
where
P
 
 
 
 
D.F o r
i
 
m we then get
P
 
m
 
 
 
 
T
 
 
 
 
 . Chang et al.
present a slightly modiﬁed algorithm to compute the desired product142 Chapter 6. The Diminished–1 Representation
T
 
 
 
 
 in
m
 
 clock cycles. The algorithm, however, suffersprincipally from
two drawbacks. Firstly, the multiplier
  needs to be translated from its dimin-
ished–1 representation to its NBC representation. Secondly, an initial compu-
tation of
D must be performed before the shift-and-add procedure can begin.
As i m p l i ﬁed block diagram of a general diminished–1 general multiplier,
based on Chang’s multiplier and an MC68000 microprocessor, can be found
in Shakaff’sPhD.thesis [89, Fig. 3.21(a)]. Shakaffconcludes thatthe maindis-
advantage of the above multiplication procedure is the need to compute the
initial value
D. He instead proposes the multiplier by Benaissa et al. [12] as
a competitive alternative. Benaissa’s multiplier, which needs no precomputa-
tions, is presented below.
Benaissa’s Serial/Parallel Multipliers
For the diminished–1 representation, the parameters
k and
l in (6.1) are equal
to 1 and
 
 , respectively. Hence, the diminished–1 form of the general multi-
plication formula in (6.13) is
T
 
 
 
 
 
 
m
X
i
 
 
 
 
 
i
T
 
 
i
 
 
 
 
i
 
 
m
X
i
 
 
 
 
 
i
T
 
 
i
 
 
 
 
i
 
m
 
 
m
o
d
 
m
 
 
 
  (6.38)
The multiplication algorithm suggested by Benaissa et al. [12] is based on this
congruence. In their formulation of
T
 
 
 
 
  they have omitted the term
 
i
 
m.
They express the product as
P
 
 
i
T
 
 
i
 
  (see [12, Eq. (6)]) and only mention
that for
 
i
 
  , the addend is set equal to the diminished–1 zero (i.e. the inte-
ger
 
m). The correct expression for
T
 
 
 
 
 , however, is given in (6.38). The
congruence (6.38) can be expressed on the recursive form
P
 
i
 
 
 
 
P
 
i
 
 
 
 
i
T
 
 
i
 
 
 
 
i
 
m
 
 
m
o
d
 
m
 
 
 
  for
 
 
i
 
m
 
where the initial value
P
 
 
  equals
 
m (diminished–1 zero). Moreover, for
i
 
m we have
P
 
m
 
 
 
 
T
 
 
 
 
 .
A block diagram of Benaissa’s multiplier is shown in Figure 6.22. The control
signals are not shown in the ﬁgure. The registers R
 ,R
 ,a n dR
  are all
m
 
 
bits wide.
During an initial clock cycle,
 
  is loaded into register R
  and the translated
integer
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 is loaded into register R
 .A l s o ,t h e
 
m
 
 
  -
bit integer
 
m is loaded into R
 . After the subsequent clock cycles, R
  con-
tains
T
 
 
 
 ,
T
 
 
 
 
 ,
T
 
 
 
 
 , etc. We assume that the output
T
 
 
 
 
  is directly
stored in an
 
m
 
 
  -bit parallel register. The CP is the dotted path in Fig-
u r e6 . 2 2 ,f r o mt h eo u t p u to ft h es h i f tr e g i s t e rR
  t h r o u g ho na nA N Dg a t ea n d6.3. The Diminished–1 Representation 143
CP
Diminished–1
CLA adder
T
 
 
 
 
 
 
 
 
 
R
 
Translation
msb lsB
 
 
i
R
 
R
 
P
 
i
 
P
 
i
 
 
 
Row of
m AND gates
Figure 6.22: A block diagram of Benaissa’s diminished–1 serial/parallel multiplier
(essentially from Benaissa et al. [12, Fig. 1]).144 Chapter 6. The Diminished–1 Representation
the carry look-ahead adder to the input of the parallel register R
 . The length
L
C
P
 
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t of this path equals
L
C
P
 
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
L
r
e
g
 
r
r
e
g
 
 
m
 
 
 
f
A
N
D
 
L
A
N
D
 
r
A
N
D
f
d
i
m
a
d
d
 
 
 
L
d
i
m
a
d
d
 
 
 
r
d
i
m
a
d
d
 
 
 
 
f
r
e
g
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
The initial clock cycle is followed by
m
 
 clock cycles, during which the par-
tial products are computed and recursively added together. Hence, the total
computation time
T is proportional to
L
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
m
 
 
 
L
C
P
 
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
m
 
 
 
m
l
o
g
 
m
 
 
 
 
m
 
 
 
l
o
g
 
m
 
 
 
 
 
The desired product
P
 
i
 
 
 
 
T
 
 
 
 
  is shifted into an output parallel reg-
ister during the ﬁnal clock cycle. In order to make a fair comparison with the
parallel diminished–1 multipliers described above, we again assume that the
diminished–1 adder in Figure 6.22 is the carry look-ahead adder of Figure 6.7.
Then, the chip area
Aoccupied by the multiplier in Figure 6.22 is proportional
to its size
C
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
m
 
 
 
C
r
e
g
 
C
D
i
m
 
N
B
C
 
m
C
A
N
D
 
 
C
N
A
N
D
 
N
O
R
 
C
i
n
v
 
C
d
i
m
a
d
d
 
 
 
 
 
 
m
 
 
 
 
Hence, the area-time performance
A
T
  of the multiplier is proportional to
C
L
 
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
C
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
L
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
m
l
o
g
 
m
 
 
 
 
m
 
 
 
l
o
g
 
m
 
 
 
 
 
 
 
O
 
m
 
 
 
Ifthemultiplier inthe multiplication operationis available asanNBC number
 , the translation circuit in Figure 6.22 can be excluded. This reduces the total
size of the multiplier architecture, but the computation time is not changed.
Intheirpaper,Benaissaetal.[12,Ch. 3.2]alsodescribesaprocedurefordimin-
ished–1 multiplication which is a slight modiﬁcation of the above procedure.
The procedure is based on (6.38), but it uses
 
  as multiplier instead of
 .T h i s
eliminates the needfor the code translation from
 
  to
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 .
For nonzero
  we can write
 
 
m
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 6.3. The Diminished–1 Representation 145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 . Benaissa et al. state that
  can be replaced by
 
  in (6.38)
by letting the least signiﬁcant bit of the multiplier take on the value
 
 
 
 
  .
Actually, this can easily be analytically formulated by expanding (6.38) in the
following way:
T
 
 
 
 
 
 
m
X
i
 
 
 
 
 
i
T
 
 
i
 
 
 
 
i
 
m
 
 
m
 
 
X
i
 
 
 
 
 
 
i
T
 
 
i
 
 
 
 
 
i
 
m
 
 
 
 
 
 
 
 
 
T
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
 
 
X
i
 
 
 
 
 
 
i
T
 
 
i
 
 
 
 
 
i
 
m
 
 
T
 
 
 
 
 
 
m
 
 
X
i
 
 
 
 
 
 
i
T
 
 
i
 
 
 
 
 
i
 
m
 
 
T
 
 
 
 
m
o
d
 
m
 
 
 
  (6.39)
which can be expressed on the recursive form
P
 
i
 
 
 
 
P
 
i
 
 
 
 
 
i
T
 
 
i
 
 
 
 
 
i
 
m
 
 
m
o
d
 
m
 
 
 
  for
 
 
i
 
m
 
 
 
where
P
 
 
 
 
T
 
 
 
 
 
 . We thus have
P
 
m
 
 
T
 
 
 
 
 .F i g u r e6 . 2 3s h o w st h e
modiﬁed multiplier byBenaissa et al. It is a modiﬁed version of the multiplier
in Figure 6.22. During an initial clock cycle,
 
  is loaded into register R
  and
 
  is loaded into both register R
  and R
 . After the subsequent
m clock cycles,
register R
  will contain the product
P
 
m
 
 
T
 
 
 
 
 . Anadditional clock pulse
shifts the product to an output register. If
 
 
 (
 
 
m
 
  )o r
 
 
 (
 
 
m
 
  ),
the output controller sets the correct output
T
 
 
 
 
 
 
 
m (see page 131)
 
 .
Thechip area
A occupied by Benaissa’s modiﬁed multiplier is proportional to
its size
C
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
 
 
m
 
 
 
C
r
e
g
 
m
C
A
N
D
 
 
C
N
A
N
D
 
N
O
R
 
 
C
O
R
 
C
i
n
v
 
C
d
i
m
a
d
d
 
 
 
C
o
u
t
 
c
t
r
l
 
 
 
 
m
 
 
 
 
The CP of the multiplier is marked by the dotted line in Figure 6.23. It only
slightly differs from the CP of the multiplier in Figure 6.22. The length of the
CP equals
L
C
P
 
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
L
C
P
 
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
r
d
i
m
a
d
d
 
 
f
r
e
g
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
 
 Benaissa et al. [12, Fig. 3] use a row of AND gates instead of a row of inverters and NOR
gates.146 Chapter 6. The Diminished–1 Representation
CP
Diminished–1
CLA adder
T
 
 
 
 
 
 
 
 
 
 
m
 
 
 
R
 
msb lsB
 
 
i
R
 
R
 
P
 
i
 
P
 
i
 
 
 
Row of
m AND gates
 
 
m
 
 
m
Output
controller
P
 
i
 
 
m
 
 
 
P
 
i
 
m
Figure 6.23: A block diagram of Benaissa’s [12, Fig. 3] modiﬁed diminished–1
serial/parallel multiplier.6.3. The Diminished–1 Representation 147
As described above, the modiﬁed multiplier needs
m
 
 clock cycles to com-
pute a product, i.e. the same number of clock cycles as was required for the
non-modiﬁed multiplier. Hence, the computation time
T is proportional to
L
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
 
m
 
 
 
L
C
P
 
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
 
 
m
 
 
 
m
l
o
g
 
m
 
 
 
 
m
 
 
 
l
o
g
 
m
 
 
 
 
and the product
A
T
  is proportional to
C
L
 
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
 
C
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
L
B
e
n
a
i
s
s
a
 
s
 
p
 
m
u
l
t
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
m
l
o
g
 
m
 
 
 
 
m
 
 
 
l
o
g
 
m
 
 
 
 
 
 
 
O
 
m
 
 
 
Shyu’s Serial/Parallel Multiplier
The ﬁnal serial/parallel diminished–1 multiplier to be considered here is the
one suggested by Shyu et al. [92]. Theorem 1 in their paper says that dimin-
ished–1 multiplication can be calculated in
Z
 
m
 
  as follows:
 
 
T
 
 
 
 
 
 
 
m
X
i
 
 
 
T
 
 
 
i
 
i
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (6.40)
This congruence can be written on the recursive form
P
 
i
 
 
 
 
P
 
i
 
 
T
 
 
 
i
 
i
 
 
 
m
o
d
 
m
 
 
 
  for
 
 
i
 
m
 
where
P
 
 
 
 
 
  and for which we have
P
 
m
 
 
 
 
T
 
 
 
 
 .
Note that this equation can also be derived from the more general expression
in (6.11) by letting
k
 
 
l, which is the case for the diminished–1 representa-
tion (
 
k
 
l
 
 
 
 
 
 
 
 ). Because(6.39)and(6.40)arequite similar, the twoasso-
ciated architectures have about the same structure. Figure 6.24 shows a mod-
iﬁed version of Shyu’s [92, Fig. 1] multiplier. It is based on the serial/parallel
multiplier proposed by Chang et al. [32]. In Figure 6.24, we have exchanged
most of Shyu’s
nMOS pass transistors for transmission gates. We have also
modiﬁed their multiplier such that the output product is correctalso when ei-
ther of the diminished–1 operands (or both) are equal to
 
m. Such a situation
is handled inter alia by the “output controller” circuit.
The multiplication algorithm works as follows: During an initial clock pulse,
the
 
m
 
 
  -bit registers A and D are loaded with
 
  and
 
 
 
m
 
 
 , respectively,
 
 Here, we use our notations.148 Chapter 6. The Diminished–1 Representation
T
 
 
 
 
 
FA FA FA FA
P
 
P
 
D
 
D
 
D
 
D
m
 
 
D
m
B
 
B
 
B
 
B
m
 
 
B
m
A
m A
m
￿
  A
m
￿
  A
m
￿
  A
 
C
m C
m
￿
  C
m
￿
  C
m
￿
  C
 
Output controller
D
1
1
1
1
0
 
 
m
 
 
m
Figure 6.24: Am o d i ﬁed version of the multiplier proposed by Shyu et al.
[92, Fig. 1]. The paths P
  and P
  form the CP during one clock cycle.
and the
 
m
 
 
 -bitregistersB and C are loaded with
 
m
 
 
 
 
  and
 
m
 
 
 
 ,r e -
spectively. Also, the single D ﬂip-ﬂop is loaded with
 
 
m. Afterthe subsequent
m
 
 clock cycles, the product
T
 
 
 
 
  has been shifted through the output
controller and into an
 
m
 
 
  -bit parallel register. This output register is not
shown in Figure 6.24. The multiplication process is described more in detail
by Shyu et al. [92] and, to some extent, by Chang et al. [32].
The size of the multiplier in Figure 6.24 equals
C
S
h
y
u
 
m
u
l
t
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
C
r
e
g
 
m
C
F
A
 
C
N
A
N
D
 
N
O
R
 
 
C
O
R
 
 
C
i
n
v
 
C
o
u
t
 
c
t
r
l
 
 
 
m
 
 
 
C
T
G
 
m
 
 
 
 
m
 
 
 
 
 6.3. The Diminished–1 Representation 149
The CP is formed by the dotted paths P
  and P
  in the ﬁgure. The CP length
equals
L
C
P
 
S
h
y
u
 
m
u
l
t
 
L
r
e
g
 
r
r
e
g
 
f
i
n
v
 
m
f
T
G
 
 
r
i
n
v
 
m
 
f
T
G
 
 
 
 
 
r
r
e
g
 
 
 
f
F
A
 
s
i
g
n
a
l
 
m
L
F
A
 
c
a
r
r
y
 
 
m
 
 
 
 
L
F
A
 
c
a
r
r
y
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
L
F
A
 
s
u
m
 
 
r
F
A
 
 
 
f
r
e
g
 
 
 
m
 
 
 
Because the desired diminished–1 product is shifted into the output register
during the last of a total of
m
 
 clock cycles, the computation time
T is pro-
portional to
L
S
h
y
u
 
m
u
l
t
 
 
m
 
 
 
L
C
P
 
S
h
y
u
 
m
u
l
t
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
Hence, the area-time performance
A
T
  is proportional to
C
L
 
S
h
y
u
 
m
u
l
t
 
 
C
S
h
y
u
 
m
u
l
t
 
L
S
h
y
u
 
m
u
l
t
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
O
 
m
 
 
 
It is possible to obtain yet another algorithm for diminished–1 serial/parallel
multiplication, based on the general expression in (6.12). For
k
 
  , (6.12)
changes to
T
 
 
 
 
 
 
m
X
i
 
 
 
T
 
 
i
 
i
 
 
 
m
o
d
 
m
 
 
 
  (6.41)
This formula is also derived by Shyu et al. [92, Theorem 2]. An architecture
for multiplication based on (6.41) is suitably used when the multiplier (
 )i s
represented on NBC form. The architecture in Figure 6.24 may be modiﬁed to
be based on (6.41). However, such an architecture is not considered here.
6.3.6.3 Comparisons
In Table 6.4 we have listed the sizes and the total CP lengths of the dimin-
ished–1 general multipliers presented in the thesis. It is clear that the multi-
plier proposed by Ashur et al. [10] has the smallest size and CP length among
the bit-parallel architectures. It is also clear that the multiplier proposed by
Shyu etal. [92] hasthe smallest size andCP length amongthe bit-serial/paral-
lel architectures.
Thesizes, totalCPlengths, and
A
T
 performancesofAshur’sandShyu’smul-
tipliers are plotted versus
m,f o r
m
 
 
t;
 
 
t
 
 , in Figure 6.25.150 Chapter 6. The Diminished–1 Representation
M
u
l
t
i
p
l
i
e
r
t
y
p
e
S
u
b
s
c
r
i
p
t
n
a
m
e
F
i
g
u
r
e
S
i
z
e
C
T
o
t
a
l
C
P
l
e
n
g
t
h
L
B
i
t
-
p
a
r
a
l
l
e
l
S
u
n
d
e
r
,
m
u
l
t
6
.
1
9
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
A
s
h
u
r
,
m
u
l
t
6
.
2
0
 
 
m
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
B
e
n
a
i
s
s
a
,
p
-
m
u
l
t
6
.
2
1
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
B
i
t
-
s
e
r
i
a
l
/
p
a
r
a
l
l
e
l
B
e
n
a
i
s
s
a
,
s
/
p
-
m
u
l
t
6
.
2
2
 
 
 
m
 
 
 
 
 
m
 
 
 
m
l
o
g
 
m
 
 
 
 
m
 
 
 
l
o
g
 
m
 
 
 
 
B
e
n
a
i
s
s
a
,
s
/
p
-
m
u
l
t
,
2
6
.
2
3
 
 
 
m
 
 
 
 
 
m
 
 
 
m
l
o
g
 
m
 
 
 
 
m
 
 
 
l
o
g
 
m
 
 
 
 
S
h
y
u
,
m
u
l
t
6
.
2
4
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
T
a
b
l
e
6
.
4
:
T
h
e
s
i
z
e
s
a
n
d
t
o
t
a
l
C
P
l
e
n
g
t
h
s
o
f
t
h
e
d
i
m
i
n
i
s
h
e
d
–
1
m
u
l
t
i
p
l
i
e
r
a
r
c
h
i
t
e
c
t
u
r
e
s
.
T
h
e
p
r
o
d
u
c
t
C
L
 
i
s
O
 
m
 
 
f
o
r
t
h
e
b
i
t
-
p
a
r
a
l
l
e
l
a
r
c
h
i
t
e
c
t
u
r
e
s
a
n
d
O
 
m
 
 
f
o
r
t
h
e
b
i
t
-
s
e
r
i
a
l
/
p
a
r
a
l
l
e
l
a
r
c
h
i
t
e
c
t
u
r
e
s
.6.3. The Diminished–1 Representation 151
C
A
s
h
u
r
 
m
u
l
t
C
S
h
y
u
 
m
u
l
t
L
A
s
h
u
r
 
m
u
l
t
L
S
h
y
u
 
m
u
l
t
C
L
 
A
s
h
u
r
 
m
u
l
t
C
L
 
S
h
y
u
 
m
u
l
t
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Time complexity
m
C
P
l
e
n
g
t
h
,
L
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Area complexity
m
S
i
z
e
,
C
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Area-time performance
m
C
L
 
Figure 6.25: The sizes, total CP lengths, and
A
T
  performances of Ashur’s and
Shyu’s diminished–1 multipliers, see Figures 6.20 and 6.24, respectively. The
parameters are plotted versus
m
 
 
t for
 
 
t
 
 .
From the ﬁgure we conclude that, for all
m, the size of Ashur’s multiplier is
greaterthanthesizeofShyu’smultiplier. Ontheotherhand,withrespectboth
to their time performance and their
A
T
  performance, Ashur’s multiplier is
preferable to Shyu’s multiplier.
All in all, we conclude that the sizes, the total CP lengths, and the
A
T
  perfor-
mances of the bit-parallel multipliers are
O
 
m
 
 ,
O
 
m
 ,a n d
O
 
m
 
 ,r e s p e c -
tively, while the corresponding parameters of the bit-serial multipliers are
O
 
m
 ,
O
 
m
 
 ,a n d
O
 
m
 
 , respectively. The choice of architecture for general
multiplication in
Z
 
m
 
  is further discussed in Section 8.1.5.152 Chapter 6. The Diminished–1 Representation
6.3.7 Exponentiation of the Transform Kernel
For the diminished–1 element representation, the linear coordinate transfor-
mation parameters
k and
l in (6.1) are equal to 1 and
 
 , respectively. Hence,
by (6.14) we get
T
 
 
n
 
 
 
T
 
 
 
 
 
 
n
 
 
 
 
n
 
 
 
m
o
d
 
m
 
 
 
 
whichisalsodirectlyobtainedfromthegeneralNBC-to-diminished–1 formula
T
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  . It seems as if the diminished–1 representation
does not provide aprocedureforperformingexponentiation in
Z
 
m
 
  whichis
computationally simpler than the procedures for performing exponentiation
with respect to the normal binary coded element representation. Therefore,
for the computation of
 
n
m
o
d
 
m
 
 we refer to the exponentiation proce-
dures described in Section 5.1.6 of the previous chapter. When the modulus
 
m
 
 is prime, there are some propertiesof the prime ﬁeld
Z
 
m
 
  which can be
applied such that exponentiation modulo
 
m
 
 can be performed in a sim-
pliﬁed way. This is further discussed in Section 7.2.1.
6.4 Summary
The complexity and performance parameters of the architectures considered
in this chapter are listed in Table 6.5. Regarding the parameters for the archi-
tectures for general diminished–1 multiplication, we refer to Table 6.4.6.4. Summary 153
O
p
e
r
a
t
i
o
n
F
i
g
u
r
e
S
u
b
s
c
r
i
p
t
n
a
m
e
S
i
z
e
C
F
a
n
-
i
n
f
I
n
t
e
r
n
a
l
C
P
l
e
n
g
t
h
L
C
P
N
B
C
t
o
d
i
m
.
–
1
t
r
a
n
s
l
.
6
.
2
N
B
C
2
D
i
m
 
m
l
o
g
 
m
 
 
m
 
 
8
 
m
 
 
l
o
g
 
m
D
i
m
.
–
1
t
o
N
B
C
t
r
a
n
s
l
.
6
.
4
D
i
m
2
N
B
C
 
 
m
 
 
2
 
 
m
 
 
N
e
g
a
t
i
o
n
6
.
6
d
i
m
n
e
g
 
m
n
m
 
 
m
—
A
d
d
i
t
i
o
n
(
c
a
r
r
y
l
.
-
a
.
)
6
.
7
d
i
m
a
d
d
,
1
 
 
m
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
 
 
 
A
d
d
i
t
i
o
n
(
c
a
r
r
y
-
r
.
)
6
.
9
d
i
m
a
d
d
,
2
 
 
m
 
 
6
 
 
m
M
u
l
t
i
p
l
i
c
a
t
i
o
n
b
y
2
6
.
1
2
d
i
m
m
u
l
t
2
 
n
m
 
 
—
M
u
l
t
i
p
l
i
c
a
t
i
o
n
b
y
 
n
6
.
1
5
s
e
q
,
m
u
l
t
2
n
 
 
m
 
 
—
3
8
(
o
r
4
0
)
G
e
n
e
r
a
l
m
u
l
t
i
p
l
i
c
a
t
i
o
n
S
e
e
T
a
b
l
e
6
.
4
N
o
r
m
.
o
u
t
p
u
t
r
e
s
.
r
o
T
o
t
a
l
C
P
l
e
n
g
t
h
L
(
i
n
c
l
u
d
i
n
g
r
e
g
i
s
t
e
r
s
)
A
r
e
a
-
t
i
m
e
p
e
r
f
.
C
L
 
 
 
m
 
 
l
o
g
 
m
 
 
 
O
 
m
 
l
o
g
 
m
 
 
 
 
m
 
 
 
O
 
m
 
 
 
 
m
 
 
 
O
 
m
 
 
 
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
O
 
m
 
 
 
 
 
m
 
 
 
O
 
m
 
 
 
 
 
 
 
 
 
—
 
 
 
n
 
t
 
 
 
 
 
 
(
o
r
 
m
 
 
 
)
O
 
m
 
 
 
 
 
 
T
a
b
l
e
6
.
5
:
C
o
m
p
l
e
x
i
t
y
p
a
r
a
m
e
t
e
r
s
o
f
t
h
e
a
r
c
h
i
t
e
c
t
u
r
e
s
i
n
t
h
e
p
r
e
s
e
n
t
c
h
a
p
t
e
r
.154 Chapter 6. The Diminished–1 RepresentationChapter 7
The Polar Representation
In Chapter 6, we considered the diminished–1 representation of the elements
in Fermat integer quotient rings
Z
 
m
 
 . The code translation
T
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,w h e r e
  is an NBC integer of
Z
 
m
 
  and
T
 
 
  is the dimin-
ished–1 representation of
 , belongs to the set of linear coordinate transforma-
tions given by (6.1).
Thereexistmanyformsof nonlinearcoordinate transformations, i.e. mappings
P from the NBC representation of integers
 
 
Z
 
m
 
  to
P
 
 
 ,w h e r e
P is a
nonlinear function of
 . In this chapter we investigate the properties of one
such form of representation, namely the polar representation.
  A restriction of
the polar representation, however, is that it is only applicable in ﬁnite ﬁelds.
7.1 Introduction
For
m
 
 
 
 
 
 
 
 
 
 
 the Fermatnumber
 
m
 
 is primeand hencethe integer
quotient ring
Z
 
m
 
  is a ﬁeld. Let
  be a primitive element of a prime ﬁeld
Z
p,
i.e. anelement of
Z
pwith maximum order
p
 
 . It is well known [60, Th. 1.15,
Th. 2.8] that the multiplicative groupof nonzero elements of
Z
pcan beformed
by the cyclic group
f
 
 
 
 
 
 
 
 
 
 
 
p
 
 
 
 
p
 
 
g.
  Let the symbol
  be deﬁned by
 In the literature, the polar representation is sometimes referred to as the index representa-
tion, see for example Niederreiter [60, Ch. 10.1] and Rosen [84, Ch. 8.4].
 Actually, for any ﬁnite ﬁeld, its multiplicative group can be formed by its powers of a
primitive element.
155156 Chapter 7. The Polar Representation
the equation
 
 
 
  . Then, any integer of
Z
pcan be expressed as some power
of
 .
Deﬁnition 7.1 Consider the Fermat prime ﬁelds
Z
 
m
 
 ;
m
 
 
 
 
 
 
 
 
 
 
 .I nt h e
polar representation of
Z
 
m
 
 , each element (integer)
 
 
Z
 
m
 
  is represented by
its associated power
P
 
 
  of a primitive element
  of the ﬁeld.
Accordingly, we have
 
 
 
P
 
 
 
 
m
o
d
 
m
 
 
 
  (7.1)
where
 
P
 
 
 
 
Z
 
m
 
 
 
 
 
P
 
 
 
 
 
.
An element in the polar representation is referred to as a polar element.
In the diminished–1 representation, the zero element is represented by the in-
teger
T
 
 
 
 
 
m,which wecalled the zeroindicator,seeSecti on6. 2.W esui tabl y
use the integer
 
m as a zero indicator also in the polar representation, i.e. we
have
P
 
 
 
 
 
 
 
 
m. Similar to the diminished–1 representation, by letting all
integers
P
 
 
 be
 
m
 
 
 -bitnormalbinarycoded integers,the zerorepresenta-
tive
P
 
 
  is the only integer
P
 
 
  for which its most signiﬁcant bit equals one.
Situations where one of the operands in an arithmetic operation is the zero el-
ement are handled separately. For nonzero integers
  we have
P
 
 
 
 
Z
 
m
and the order of the primitive element
  modulo
 
m
 
 equals
 
m.C o n s e -
quently, for nonzero integers
  we can use an
m-bit binary arithmeticmodulo
 
m for
the associated exponents
P
 
 
  of
 .
The general properties of arithmetic operations in ﬁnite ﬁelds, with respect to
the polar representation, are well known. However, the particular properties
of arithmetic operations in Fermat prime ﬁelds, with respect to the polar repre-
sentation have not been studied before. An investigation of such properties is
carriedoutinthischapter. Henceforth, wegenerallyreferto
Z
 
m
 
  asaFermat
prime ﬁeld.
7.2 Arithmetic Operations
Occasionally, we have denoted diminished–1 elements
T
 
 
  by
 
 .I nt h ep o l a r
representation we conveniently use the same kind of notation, i.e. the
 
m
 
 
  -bit polar integer
P
 
 
  is denoted by the normal binary coded integer
 
 
 
 
m
 
 
m
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 .7.2. Arithmetic Operations 157
In the present section we describe the arithmetic operations involved in the
computation of the Fermat number transform with respect to the polar repre-
sentation. Later, in Section 7.6, we also consider VLSI architectures for some
of these arithmetic operations.
7.2.1 Discrete Exponentiation
The code translation from a polar number
P
 
 
  to its corresponding normal
binarycoded number
  is carried out using a discrete exponentiation modulo
 
m
 
  , as given by (7.1). In Section 5.1.6 we considered some procedures for
general exponentiation modulo
 
m
 
  .T h ei n t e g e r
 
 
 
P
 
 
 
 
m
o
d
 
m
 
 
 
may be computed using any of those procedures. For example, by using the
well known binary method, which is brieﬂy described in Section 5.1.6, expo-
nentiation can beperformedusing
m
 
 squaringsand atmost
m
 
 multipli-
cations modulo
 
m
 
 . By performing a squaring as a general multiplication,
atmost
 
 
m
 
 
 generalmultiplications modulo
 
m
 
 ofnormalbinarycoded
numbers are required to compute
  from
P
 
 
  in (7.1).
In Section 7.4 we consider a new procedure [5] for discrete exponentiation in
Fermat prime ﬁelds using some properties of Zech’s logarithms.
7.2.2 The Discrete Logarithm
By taking the
 -logarithm of both sides of (7.1) we get the congruence
P
 
 
 
 
l
o
g
 
 
 
m
o
d
 
m
 
  (7.2)
which is called the discrete logarithm to the base
  modulo
 
m.T h ep r o b l e mo f
computing (7.2)is generallyknownas the discrete logarithm problem. In gen-
eral,it isquite hardtocompute the discrete logarithminalarge primeﬁeld
Z
p.
Several algorithms suggested in the literature require
O
 
p
p
 
multiplications
to compute the logarithm.
The Pohlig-Hellman Algorithm
In 1978, Pohlig and Hellman [72] presented an algorithm for computing the
discrete logarithm in
Z
p which only requires
O
 
l
o
g
 
p
 
multiplications mod-
ulo
p. In particular, for Fermat primes
p
 
 
m
 
 their algorithm computes
(7.2)byrecursivelydeterminingthebinarydigits
 
 
i of
T
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
  such that
 
 
 
 
 
 
m
o
d
 
m
 
 
 holds. The algorithm, which158 Chapter 7. The Polar Representation
is based on the fact that the order of the primitive element
  modulo
 
m
 
 
equals
 
m, works as follows (see [72, Sec. III]):
The least signiﬁcant bit
 
 
  of
 
  is determined by raising the nonzero integer
  to the
 
m
 
 th power and identifying whether the result equals 1 or
 
 .L e t
 
 
 
 
 
 
  and
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  . Then we have
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  if
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  if
 
 
 
 
 
  (7.3)
Only
m
 
  squarings are required to compute
 
 
 
 
 
m
 
 
m
o
d
 
m
 
  .T h ed i g i t
 
 
  is set to either 0 or 1, depending on whether
 
 
 
 
 
m
 
 
m
o
d
 
m
 
 is evalu-
ated to 1 or
 
 , respectively. Now, let
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  .T h e
digit
 
 
  can be determined in the same way as above from the congruence
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  if
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  if
 
 
 
 
 
  (7.4)
which can be computed using
m
 
  squarings. Next, compute
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 and
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 and determine
 
 
  from
 
 
 
 
 
m
 
 
m
o
d
 
m
 
  , etc., until the most signiﬁcant bit
 
 
m
 
  has been deter-
mined.
In order to determine the digit
 
 
i
 
 
 
i
 
m
 
 , one squaring is required
to compute
 
 
i
 
 
 
 
 
 
i
 
 
 
 
 
m
o
d
 
m
 
 
 , one multiplication is required
to compute the product
 
 
i
 
 
 
 
i
 
 
 
 
 
 
i
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,a n d
m
 
i
 
  squarings are required to compute
 
 
i
 
m
 
i
 
 
m
o
d
 
m
 
  .H e n c e ,i f
 
 
 
 is
precomputed and squarings are performed as multiplications, the algorithm
requires approximately
 
m
 
P
m
 
 
i
 
 
i
 
m
 
m
 
 
 
 
  general multiplications
modulo
 
m
 
 . Assuming that each multiplication can be carried out using at
most
m additions, the Pohlig-Hellman algorithm requiresat most
m
 
 
m
 
 
 
 
 
additions modulo
 
m
 
  .
New Algorithms
In 1993, we [5] presented a new algorithm for computing the discrete loga-
rithm in Fermat prime ﬁelds
Z
 
m
 
 . The algorithm, which is based on some
properties of Zech’s logarithms in Fermat prime ﬁelds, requires at most
 
m
 
 
m
additions modulo
 
m
 
  .
Recently, we [7] proposed another algorithm for computing the discrete loga-
rithm, which in turn is based on the algorithm in [5]. By using a look-up table7.2. Arithmetic Operations 159
of size
 
m
 
 
m
 
m bits, we show how to compute the logarithm using at most
 
m
 
  binary shifts (rotations), one table look-up, and one addition and one
simpliﬁed multiplication modulo
 
m.
Our two algorithms are thoroughly described in Sections 7.4 and 7.5, respec-
tively.
7.2.3 Modulus Reduction
Modulus reduction in the polar representation is a very simple operation.
When
  is nonzero, the least positive residue of
 
  modulo
 
m equals
 
 
 
m
 
 
 ,
which is instantaneously obtained from
 
 .W h e n
  is congruent to zero mod-
ulo
 
m
 
 we have
 
 
 
 
m.
B e c a u s ew eu s ea n
 
m
 
 
  -bit normal binary coded representation of the ex-
ponents
 
  of
 , there are only two cases that have to be considered:
Exponent Reduced exponent (mod
 
m)
 
 
 
 
 
m
 
 
P
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
P
 
 
 
 
 
 
 
m
7.2.4 Negation
Like modulus reduction, negation is also simply carriedout in the polarrepre-
sentation. Because the order of the primitive element
  modulo
 
m
 
 equals
 
m we have
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
  .F o r
 
 
 
P
 
 
  and
 
 
 
P
 
 
 
  we
therefore get
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
Hence, because in the polar representation all arithmetic operations are car-
ried out in the exponent of
 , the polar element
P
 
 
 
  is obtained from the
congruence
P
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
  (7.5)
which, for
 
 
 
 
 
 
m
 
 , we expand as
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
m
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
  (7.6)160 Chapter 7. The Polar Representation
For
 
 
 
 
m,i . ef o r
 
 
  ,w el e t
 
 
 
 
 
 
 
m. In Section 7.6.3 we consider a
VLSI architecture for negation based on (7.6).
7.2.5 Addition and Subtraction
Addition
When considering addition in the polar representation we need the following
deﬁnition (see for example Conway [34, Ch. 6]).
Deﬁnition 7.2 Zech’s logarithm
  of the polarelement
 
  is denotedby
Z
 
 
 
 and de-
ﬁned by the congruence
 
 
 
 
 
 
 
Z
 
 
 
 
 
m
o
d
 
m
 
 
 
 
For nonzero
 
 
 
 
Z
 
m
 
 ,l e t
 
 
 
P
 
 
  and
 
 
 
P
 
 
 . The function evaluated
whenperformingaddition inthepolarrepresentationisfoundintheexponent
of
  in the congruence
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (7.7)
i.e. we have
P
 
 
 
 
 
 
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
m
o
d
 
m
 
  (7.8)
where
Z is Zech’s logarithm. Using the congruence
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 ,w h e r e
 
 
 
m
 
 
  is the one’s complement of
 
 
 
m
 
 
 ,w e
rewrite (7.8) as
 
 
 
 
 
 
m
 
 
 
 
Z
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
o
d
 
m
 
  (7.9)
Hence, according to (7.9), addition in the polar representation may be carried
out using two additions and one discrete logarithm modulo
 
m.T h e d i r e c t
computation of
Z
 
 
 
 , as expressed by the congruence
Z
 
 
 
 
 
l
o
g
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
  (7.10)
requiresonediscrete exponentiation andone addition modulo
 
m
 
 followed
by one discrete logarithm modulo
 
m, which makes it quite an intricate func-
tion. Some researchers, like for instance Conway [34], Imamura [53], and Hu-
ber [51] have considered different methods of computing Zech’s logarithms
 Zech’s logarithm is also referred to as Jacobi’s logarithm [60, Exc. 2.8].7.2. Arithmetic Operations 161
in
G
F
 
p
n
  in a simpliﬁed way. In particular, the researchers consider ﬁelds of
characteristic
p
 
  . In order to speed up the computation of
Z
 
 
 
 , the men-
tioned methods all involve the use of look-up tables.
Remark: TheparticularpropertiesofZech’slogarithmsinFermatprimeﬁelds
Z
 
m
 
  are investigated in Section 7.3. The main purpose of the investiga-
tion is to ﬁnd anarea-time efﬁcient wayof computing Zech’slogarithms
in
Z
 
m
 
 .
The case when either of the addends
  and
  (or both) equals zero is handled
separately. For
 
 
 
 
 
 
m
o
d
 
m
 
 
 ,w h e r e
 
 
 or
 
 
  ,w ec a ns i m p l y
do the following:
If
 
 
 (
 
 
 
 
m)
  then let
 
 
 ,i . e .l e t
 
 
 
 
 
If
 
 
 (
 
 
 
 
m)
  then let
 
 
 ,i . e .l e t
 
 
 
 
 
Subtraction
The polar integer
P
 
 
 
 
 ,f o rw h i c h
  and
  are nonzero integers of
Z
 
m
 
 ,
can be derived by letting
 
 
 
  in (7.7). Then, by (7.8) we get
P
 
 
 
 
 
 
 
 
 
Z
 
P
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
  (7.11)
Consequently, subtraction in the polar representation can be carried out in a
conventional way as a (polar) negation followed by a (polar) addition.
7.2.6 General Multiplication
For nonzero
  and
 ,t h ep r o d u c t
 
 
 
 
 
m
o
d
 
m
 
 
 can be expanded as
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (7.12)
By this congruence we get
P
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
  (7.13)
which is a well known property of the polar representation; general multipli-
cation in a ﬁnite ﬁeld
G
F
 
p
n
  turns into addition modulo
p
n
 
  when using a
polar representation. When either of the factors
  and
  (or both) equals zero,
P
 
 
 
  is set to
P
 
 
 
 
 
m.162 Chapter 7. The Polar Representation
7.2.7 Multiplication by Powers of
 
The computation of the Fermat number transform of length
N involves mul-
tiplication by powers of the transform kernel
  of order
N.L e t
 
 
 
n
m
o
d
N
 
m
o
d
 
m
 
 
  ,w h e r e
P
 
 
 
 
 
 . Then, by (7.12) and (7.13) it follows that
P
 
 
n
 
 
 
 
 
 
 
n
m
o
d
N
 
 
 
 
m
o
d
 
m
 
  (7.14)
Multiplication by
 
n
The Fermat number transforms most commonly used are the ones of lengths
N
 
 
m and
 
m, with transform kernels
 
 
 and
 
 
p
 
 
 
 
m
 
 
 
 
m
  ,
respectively. The main reason is that, with respect to the diminished–1 and
the NBC representations of the integers of
Z
 
m
 
 , multiplication by powers of
  can then be carried out as binary shifts (rotations) (see Sections 2.3.2, 5.1.4,
and 6.3.5).
Multiplication bypowersoftwocanbecarriedoutin asimple wayinthepolar
representation as well. By the congruence
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
m
Z
 
 
 
 
m
o
d
 
m
 
 
  ,w h e r e
  is a primitive element of
Z
 
m
 
 ,w eg e t
m
Z
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 
  (7.15)
Because
mis a powerof two;
m
 
 
t;
t
 
 
 
 
 
 
 
 
 
 , by Theorem3.4 of Rosen
[84] we can rewrite (7.15) as
Z
 
 
 
 
 
c
 
m
o
d
 
c
 
 
 
  (7.16)
where
c is deﬁned by the equation
 
c
 
 
m
 
m
  (7.17)
Consequently, for some integer
k we can write
Z
 
 
 
 
k
 
c
 
 
 
 
c
 
 
 
 
c
  (7.18)
where
 
 
 
 
k
 
 .B e c a u s e
 
  is an odd
 
t
 
 
 -bit normal binary coded integer,
where
t
 
l
o
g
 
m, it follows that
k
 
Z
m. Thus, depending of the primitive
element
  chosen, the corresponding Zech’s logarithm of zero is of the form
given in (7.18) for some
k
 
 
 
 
 
 
 
 
 
 
 
m
 
 .
Theorem 7.1 For each
k
 
Z
m there exist
 
c primitive elements
  of
Z
 
m
 
  such
that the equality
Z
 
 
 
 
 
 
k
 
 
 
 
c holds.7.2. Arithmetic Operations 163
Theorem 7.1 may also be formulated as follows. The primitive elements of
Z
 
m
 
  can be partitioned into
m sets, each comprising
 
c elements, such that
the primitive elements in each set all have the same Zech’s logarithm of zero
on the form given by (7.18).
Proof: Let
  and
 
  be two primitive elements of
Z
 
m
 
 . By Corollary 8.4.1
of Rosen [84] we know that
 
u is a primitive element of
Z
 
m
 
  if and only if
g
c
d
 
u
 
 
m
 
 
  , which is true for all odd integers
u. Hence, for some integer
r
 
Z
 
m
 
 ,
 
  c a nb ew r i t t e no nt h ef o r m
 
 
 
 
 
r
 
 
 
m
o
d
 
m
 
 
  . By (7.10)
we have
Z
 
 
 
 
l
o
g
 
 
 
 
 
 
 
 
m
o
d
 
m
 . Suppose that
 
  has the same Zech’s
logarithm of zero as
 ,i . e .s u p p o s e
Z
 
 
 
 
l
o
g
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 .T h e ni t
follows that
 
Z
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
r
 
 
 
Z
 
 
 
 
m
o
d
 
m
 
 
 
andhencewehave
Z
 
 
 
 
 
 
r
 
 
 
Z
 
 
 
 
m
o
d
 
m
 ,wher e
Z
 
 
 
 
 
 
 
c.B e c a u s e
 
m
 
g
c
d
 
 
c
 
 
m
 
 
 
m, by [84, Th. 3.4] it follows that
 
 
 
 
 
r
 
 
 
 
 
 
m
o
d
 
m
 .
Consequently,
  and
 
 
 
 
 
 
r
 
 
 
 
m
o
d
 
m
 
 
 have the same Zech’s log-
arithm of zero only if
 
r
 
 
 
m
o
d
 
m
 , or equivalently if
m is a divisor
of
r.
From the above reasoning we conclude that there exist exactly
m Zech’s loga-
rithms of zero on the formgiven by (7.18). By [84, Th. 8.5] we know that there
are
 
 
 
 
 
m
 
 
 
 
 
 
m
 
  primitive roots of
Z
 
m
 
 . Hence, these primitive ele-
ments can be partitioned into
m sets of
 
m
 
 
 
m
 
 
c elements, which all have
the same Zech’s logarithm of zero.
￿
For
 
 
 we get
 
 
 
Z
 
 
  and
N
 
 
m in (7.14). By Theorem 7.1, there exist
 
c primitive elements for which the associated Zech’s logarithms of zero are
all equal to
 
c (
 
 
 
 in (7.18)). Consequently, by appropriately choosing such
an
 , multiplication bya powerof twocan becomputed in the polar represen-
tation as in (7.14), with
 
 
 
 
c
 
m
o
d
 
m
 ,i . e .w eg e t
P
 
 
n
 
 
 
 
 
 
 
n
m
o
d
 
m
 
 
c
 
m
o
d
 
m
 
  (7.19)
Let
 
 
 
 
n
m
o
d
 
m
 
 
c
 
n
 
t
 
 
c. Thisbinarycoded integermay becomputed as
c
 
m
 
 
t
 
 
 binaryshifts of
n
 
t
 . However, because the factor
n
 
t
  is a
 
t
 
 
 -
bit NBC integer, no reduction modulo
 
m is needed for the
m-bit NBC integer
n
 
t
 
 
c. The shifts can therefore be carried out instantaneously and hence the
evaluation of (7.19) only requires one addition. Furthermore, since
 
 
 
c
 
 
 
 
  ,
i.e. the
c least signiﬁcant bits of
 
  are zero, this addition modulo
 
m simpliﬁes
to a
 
t
 
 
  -bit addition of
n
 
t
  by the
t
 
 most signiﬁcant bits of
 
 .T h es u m
is reduced modulo
 
t
 
 . Thiscomputational procedure is generalisedand fur-
ther explained below.
An Optimal Choice of
 164 Chapter 7. The Polar Representation
The main results in the remainder of the present section (7.2.7) can also be
found in [6]. Because the transform length
N is a power of two, i.e. we have
N
 
 
b for
 
 
b
 
m, we can write (7.14) as
P
 
 
n
 
 
 
 
 
 
 
n
m
o
d
 
b
 
 
 
 
 
 
 
n
 
b
 
 
 
 
 
 
m
o
d
 
m
 
  (7.20)
By choosing an appropriate kernel
 , it is possible to compute
P
 
 
n
 
  with
a complexity that is smaller than the complexity of performing one general
multiplication followed by one addition modulo
 
m,i . e .a c c o r d i n gt ot h ed i -
rect computation of (7.20). We showed above that for some bases
  and for
 
 
  ,
P
 
 
n
 
  can conveniently be computed using only one addition modulo
 
m
 
 
t
 
 . That simple way of computing
P
 
 
n
 
  is actually a special case of
a general procedure for computing
P
 
 
n
 
  which only requires one addition
modulo
N for all possible transform lengths
N
 
 
b in
Z
 
m
 
 .
Theorem 7.2 Let
 
 
 
 
m
 
b
 
m
o
d
 
m
 
 
 and
P
 
 
n
 
 
 
 
 
 
n
 
b
 
 
 
 
 
 
m
o
d
 
m
 ,
where
 isaprimitiveelement oftheprimeﬁeld
Z
 
m
 
 ,
n isanonnegativeinteger, and
 
 
b
 
m. Thentheorderof
  modulo
 
m
 
 equals
 
b and
P
 
 
n
 
 can becomputed
using only one
b-bit addition modulo
 
b.
Thechoiceof
 
 
 
 
m
 
b
 
m
o
d
 
m
 
 
 asthekernelofaFermatnumbertrans-
form of length
 
b was also considered in Section 2.3.2 (page 15). In the proof
of Theorem 7.2 we need the following notation.
Deﬁnition 7.3 Let
 
  be an
m-bit polar integer. By
 
 
 
i
  we denote the NBC integer
w h i c hi sf o r m e db yt h e
m
 
i most signiﬁcant bits of
 
  such that, for
 
 
i
 
m,
 
 
c a nb ew r i t t e no nt h ef o r m
 
 
 
 
 
 
i
 
 
i
 
 
 
 
i
 
 
 .
Proof: (Theorem 7.2) The order of the primitive element
  modulo the prime
 
m
 
 equals
 
 
 
m
 
 
 
 
 
m. By Theorem 8.4 of Rosen [84], for
 
 
b
 
m
the order of
 
 
m
 
b modulo
 
m
 
 equals
 
m
 
g
c
d
 
 
m
 
 
m
 
b
 
 
 
b.H e n c e ,t h e
element
 
 
 
 
m
 
b
 
m
o
d
 
m
 
 
 
canbeusedasthekernelofaFermatnumbertransformoflength
N
 
 
b.S i m -
ilar to the notation used above, let
 
 
 
 
n
 
b
 
 
 
 
 
  where we now have
 
 
 
P
 
 
 
 
 
m
 
b. Thus, using this deﬁnition of
 
  we can write
P
 
 
n
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
  (7.21)
 For
b
 
t
 
 we get
N
 
 
b
 
 
t
 
 
 
 
m, which is the order of the transform kernel
 
 
 used above.7.3. Zech’s Logarithm 165
where
 
 
 
n
 
b
 
 
 
 
m
 
b.B e c a u s e
n
 
b
 
 
  is a
b-bit NBC integer,
 
  is an
m-bit NBC
integer for which
 
 
 
m
 
b
 
 
 
 
  ,i . e .t h e
m
 
b least signiﬁcant bits of
 
  equal
zero. Hence,
P
 
 
n
 
  in (7.21) can be computed as a
b-bit addition modulo
 
b.
By Deﬁnition 7.3 we can write
 
 
 
 
 
 
m
 
b
 
 
m
 
b
 
 
 
 
m
 
b
 
 
  and
P
 
 
n
 
 
 
 
 
 
 
 
 
 
m
 
b
 
 
m
 
b
 
 
 
 
m
 
b
 
 
 
  (7.22)
where
 
 
 
 
 
 
m
 
b
 
 
 
 
 
m
 
b
 
 
n
 
b
 
 
 
 
m
o
d
 
b
 
 
 
 
m
 
b
 
 
 
 
 
 
 
m
 
b
 
 
 
  (7.23)
Obviously,
P
 
 
n
 
 canbecomputedusingonlyone
b-bitaddition of
 
 
 
m
 
b
  and
n
 
b
 
 
  modulo
 
b.
￿
Theorem 7.2 leads immediately to the following corollary.
Corollary 7.1 Consider a Fermat number transform in the prime ﬁeld
Z
 
m
 
  and
of arbitrary transform length
N
 
 
b, such that
 
 
b
 
m. By letting
 
 
 
 
m
 
b
 
m
o
d
 
m
 
 
 be the kernel of the transform, in the polar representation each
transform multiplication by a power of
  can be computed using only one addition
modulo
 
b.
Proof: The proof follows directly from Theorem 7.2.
￿
7.3 Zech’s Logarithm
In Section 7.2.5 we saw that polar addition involves the evaluation of a Zech’s
logarithm. The computational complexity of polar addition depends heavily
onthecomplexity ofcomputing theZechlogarithm. Zech’slogarithms arede-
ﬁned in Deﬁnition 7.2 (page 160). In this section we investigate some proper-
ties of Zech’s logarithm over Fermat prime ﬁelds, with the purpose of ﬁnding
an (area-time) efﬁcient way of computing the logarithm.
Wementioned inSection7.2.5 thatsome researchershaveconsidered different
methods of computing Zech’s logarithms in ﬁnite ﬁelds, in particular ﬁelds of
characteristic two. To the authors knowledge, their methods all involve the
use of a look-up table. See for example Huber’s [51] technique for comput-
ing Zech’s logarithm in
G
F
 
 
n
 . He uses a restricted set of elements which,
together with their Zech’s logarithms, are stored in a look-up table. Arbitrary
Zech’s logarithms in the ﬁeld can then be computed by using this table and166 Chapter 7. The Polar Representation
some properties of Zech’s logarithms in ﬁelds of characteristic two. However,
some ofthe propertiesusedbyHuberforcomputing Zech’slogarithmsdo not
apply to prime ﬁelds. In this chapter, we present new methods of computing
Zech’s logarithms which only apply to Fermat prime ﬁelds.
In Appendix C, we investigate some special properties of Zech’s logarithms
in Fermat prime ﬁelds. Using these properties we show that the integers of
Z
 
m
n
f
 
m
 
 
g can be partitioned into
 
 
m
 
 
 
 
  subsets of six integers each,
 
such that the Zech logarithm of any integer of a subset can be computed from
any of the other integers of the set and its Zech’s logarithm. Consequently, a
method of computing Zech’s logarithm could be the following:
Select one integer
 
  from each subset and store the associated Zech’s loga-
rithms
Z
 
 
 
  in a look-up table (of size
 
 
m
 
 
 
 
 
 
m bits). Given
 
 , suppose
we want to compute
Z
 
 
 
 .
1. The ﬁrst step is to ﬁnd which subset contains
 
 . We know that the Zech
logarithm
Z
 
 
 
 ofoneoftheintegers
 
  ofthis subsetisstoredinthetable.
Thus, the ﬁrst step of the method is to ﬁnd
 
  and then obtain
Z
 
 
 
  from
the look-up table.
2. The remaining integers of the subset are subsequently computed using
the equations in Theorem C.2 until
 
  is found.
3. Thedesired logarithm
Z
 
 
 
  is computed using the appropriate equation
in Theorem C.1.
In step 2, at most one addition modulo
 
m is required to compute, from
 
  and
Z
 
 
 
 , an arbitrary integer of the subset. In step 3,
Z
 
 
 
  can also be computed
using at most one addition modulo
 
m.
There is, however, a major drawback of this procedure for computing Zech’s
logarithms. We have not yet discovered a straightforward way of ﬁnding a
simple connection between an arbitrary integer of
Z
 
m
n
f
 
m
 
 
g and the asso-
ciated subset to which it belongs. Thus, in the above step 1 we are not able to
ﬁnd the subset which contains
 
 ,o re q u i v a l e n t l y ,ﬁnd the associated integer
 
  in the table, without searching the whole table. Therefore, the above proce-
dure is not further considered in this section. The properties of the integers
of the mentioned subsets (and their Zech’s logarithms) are thoroughly inves-
tigated in Appendix C.
In the following section we consider properties of Zech’s logarithms which
lead to procedures for computing Zech’s logarithms, either with or without
 However, one of the subsets only contains three integers.7.4. Properties of the
D
m Matrix 167
the use of look-up tables. When using look-up tables, the tables required are
smaller than the table of size
 
 
m
 
 
 
 
 
 
m bits used in the above-mentioned
procedure. The main contents of Sections 7.4 and 7.5 has recently been pre-
sented by the author in [5] and [7].
7.4 Properties of the
D
m Matrix
Deﬁnition 7.4 Let
 
  be a polar integer, i.e.
 
 
 
Z
 
m
 
f
 
g.W ed e ﬁne the
jth Zech
logarithm of
 
  as
Z
f
j
g
 
 
 
 
 
Z
 
Z
f
j
 
 
g
 
 
 
 
 
 
j
 
Z
 
where
Z
f
 
g
 
 
 
 
 
 
 .
From Deﬁnition 7.2 of Zech’s logarithm in Fermat prime ﬁelds we have (see
also (C.1) and (C.2) in Appendix C)
Z
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
Z
 
 
 
 
 
 
m
o
d
 
m
 
 
which, together with the fact that there exists an integer of
Z
 
m whose Zech
logarithm equals
 
m
 
 , implies
Z
f
j
 
k
 
 
m
 
 
 
g
 
 
 
 
 
Z
f
j
g
 
 
 
 . Hence, we have
Z
f
j
g
 
 
 
 
 
Z
f
j
m
o
d
 
m
 
 
g
 
 
 
 
 
In Section 7.2.7 we saw that
Z
 
 
 , the Zech logarithm of zero, is involved in
the computation of multiplication by powers of two. As seen below, we ob-
tain several interesting properties of Zech’s logarithms in Fermat prime ﬁelds
which are related to
Z
 
 
 . Henceforth, each Zech’s logarithm
Z
 
 
 
  in
Z
 
m
 
 
is generally considered as a
jth Zech logarithm of zero, for some
j.I n F i g -
ure 7.1, we visualise the sequence of
jth Zech’s logarithms by drawing lines
from
Z
f
j
g
 
 
  to
Z
f
j
 
 
g
 
 
  for
j
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 .F r o mt h eﬁgure we can de-
rive some special properties of Zech’s logarithms in
Z
 
m
 
 .T h i si sf u r t h e rd i s -
cussed in Appendix C.
Theorem 7.3 Let
  be a primitive element of the Fermat prime ﬁeld
Z
 
m
 
  and let
Z
f
j
g
 
 
  be the
jth Zech logarithm of zero. Also, for
i
 
k
 
Z ,let
a
i
 
 
i
 
a
 
 
 
 
 
 
 
 
a
i
 
 
 
 
 
m
o
d
 
m
 
 
 
 
a
 
 
Z
 
m (7.24)
d
k
 
 
k
 
 
 
 
d
k
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (7.25)168 Chapter 7. The Polar Representation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 7.1: The sequence of Zech’s logarithms
Z
f
 
m
 
 
g
 
 
 
 
Z
f
 
g
 
 
 
 
Z
 
 
 
 
 ,
Z
f
 
g
 
 
 
 
Z
 
 
 
 
 
 ,
Z
f
 
g
 
 
 
 
Z
 
 
 
 
 
 ,
Z
f
 
g
 
 
 
 
Z
 
 
 
 
 
 
 
 
 
 ,
Z
f
 
m
g
 
 
 
 
Z
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 ,f o r
m
 
  .
Then we have
 
Z
f
j
g
 
 
 
 
j
 
 
 
m
o
d
 
m
 
 
  (7.26)
Z
f
j
g
 
 
 
 
Z
f
a
 
g
 
 
 
 
i
Z
 
 
 
 
Z
f
a
i
 
 
g
 
 
 
 
Z
 
 
 
 
m
o
d
 
m
  (7.27)
Z
f
d
k
g
 
 
 
 
k
 
m
o
d
 
m
  (7.28)
m
Z
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 
  (7.29)
Proof: Theexpansionsof
a
i and
d
k in(7.24)and(7.25),respectively, follow eas-
ily from the deﬁnitions of
a
i and
d
k.7.4. Properties of the
D
m Matrix 169
  Equation (7.26): By Deﬁnitions 7.2 and 7.4 we get
 
Z
f
j
g
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
z
 
j
 
 
Z
f
 
g
 
 
 
 
j
 
 
 
m
o
d
 
m
 
 
 
 
  Equation (7.27): By combining (7.26) and (7.24) and using the congru-
ence
 
 
 
Z
 
 
 
 
m
o
d
 
m
 
 
 we get
 
Z
f
a
i
g
 
 
 
 
a
i
 
 
 
 
 
 
i
 
a
 
 
 
 
 
 
 
 
 
 
 
Z
f
a
 
g
 
 
 
 
i
Z
 
 
 
 
m
o
d
 
m
 
 
 
 
 
a
i
 
 
 
 
 
 
 
 
 
Z
f
a
i
 
 
g
 
 
 
 
Z
 
 
 
 
m
o
d
 
m
 
 
 
 
from which we get (7.27).
  Equation (7.28): By letting
j
 
d
k in (7.26) we get
 
Z
f
d
k
g
 
 
 
 
d
k
 
 
 
 
k
 
 
 
 
 
 
k
 
m
o
d
 
m
 
 
 
 
from which we get (7.28).
  Equation (7.29) wasobtained onpage 162 (seethe congruence leading to
(7.15). It is repeated here only for the sake of completeness.
￿
Therecursivepartof(7.24)is onthesameformasdiminished–1 multiplication
by two. Therefore,
a
i can very simply be obtained from
a
i
 
  using an
m-bit
feedback shift register with an inverter in the feedback loop (see page 127 –
the last paragraph concerning multiplication by 2 – in Section 6.3.5).
Theorem 7.4 Let
a
i
 
 
i
 
a
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,w h e r e
i
 
Zand
a
 
 
Z
 
m.
Then, the sequence
 
 
 
 
a
i
 
 
 
a
i
 
a
i
 
 
 
 
 
 is cyclic with period
 
m,i . e .w eh a v e
a
i
 
a
i
m
o
d
 
m
 
m
o
d
 
m
 
 
 
 
For
a
 
 
 
m and
i
 
Z,wehave
a
i
 
a
 
 
m
o
d
 
m
 
 
  .
Proof: The cyclic property of the sequence
 
 
 
 
a
i
 
 
 
a
i
 
a
i
 
 
 
 
 
 follows sim-
plyfromthe factthatthe orderof2modulo
 
m
 
 equals
 
m. Byletting
a
i
 
a
 
 
m
o
d
 
m
 
 
 in(7.24)weget
a
 
 
 
i
 
a
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 , which implies
 
 
i
 
 
 
 
a
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 170 Chapter 7. The Polar Representation
This congruence hasthe solutions
 
i
 
 
 
m
o
d
 
m
 
 
 ,i . e .
i
 
 
 
m
o
d
 
m
 ,
and
a
 
 
 
 
 
 
m
 
m
o
d
 
m
 
 
  . However, because we have
a
 
 
Z
 
m,t h e
only valid solution is
i
 
 
 
m
o
d
 
m
 .C o n s e q u e n t l y ,
 
a
i
 
 
a
 
 
m
o
d
 
m
 
 
 
  for
 
m
j
 
i
a
i
 
a
 
 
m
o
d
 
m
 
 
 
  for
 
m
j
i
￿
For
k
 
Zwe have
 
k
 
 
 
 
m
o
d
 
m
 
 
 which, by (7.25), implies
d
k
 
 
 
 
 
 
m
 
m
o
d
 
m
 
 
 and thus
d
k
 
Z
 
m. By Theorem 7.4 it follows that when
representing each Zech’s logarithm in
Z
 
m
 
  on the form given by (7.28), i.e.
as some
d
kth Zech logarithm of zero, the set
 
Z
f
d
k
g
 
 
 
 
k
 
Z
 
m
 
can be par-
titioned into
 
c
 
 
m
 
 
m distinct cyclic groups. Each subgroup can be gene-
ratedusing(7.24)and(7.27)andwiththeknowledgeofonly
Z
 
 
 and
Z
f
a
 
g
 
 
 ,
for some integer
a
  associated with the group.
Thesequence
d
 
 
d
 
 
d
 
 
 
 
 
 
 
d
 
m
 
  ofindices canbearrangedtoformamatrix
D
m of size
 
c
 
 
m,w h i c hw ed e ﬁne [5, Sec. 2] as
D
m
 
 
 
B
B
B
B
B
 
d
 
d
 
c
d
 
 
 
c
 
 
 
d
 
 
m
 
 
 
 
c
d
 
d
 
 
 
c
d
 
 
 
 
 
c
 
 
 
d
 
 
 
 
m
 
 
 
 
c
d
 
d
 
 
 
c
d
 
 
 
 
 
c
 
 
 
d
 
 
 
 
m
 
 
 
 
c
. . .
. . .
. . . ... . . .
d
 
c
 
 
d
 
 
 
c
 
 
d
 
 
 
c
 
 
 
 
 
d
 
m
 
 
 
C
C
C
C
C
A
  (7.30)
where
 
c
 
 
m
 
 
m (see (7.17)). Thus, the matrix
D
m is formed such that, by
writing
k on the form
k
 
 
 
 
 
c, the element
d
k is in row
  and column
  of
D
m,w h e r e
 
 
 
 
 
c
 
  and
 
 
 
 
 
m
 
 .B e c a u s e
k is an
m-bit NBC
integer, theNBCinteger
 is inturnformedbythe
c
 
m
 
t
 
 least signiﬁcant
bits and the NBC integer
  is formed by the
t
 
 most signiﬁcant bits of
k.
Theorem 7.5 The set of
 
m elements in row
  of
D
m equals the cyclic group
f
 
i
 
a
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
i
 
 
 
 
 
 
 
 
 
 
m
 
 
g,w h e r e
a
  is any
element of the row.
The theorem is simply proved using the following lemma.
Lemma 7.1 Denote by
a
i
j
a the integer
a
i
 
 
i
 
a
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,w h e r e
a
 
 
a.L e t
 
  be the multiplicative inverse of
 
  modulo
 
m,w h e r e
 
  is deﬁned by the7.4. Properties of the
D
m Matrix 171
Zech logarithm
Z
 
 
 
 
 
 
 
c. Also, let
r
 
Z
 
m and
d
k
 
a
i
j
a
 
 
m
o
d
 
m
 
 
 for
some
k,
i,a n d
a
 . Then we have
d
k
 
r
 
c
 
a
i
 
r
 
 
j
a
 
 
m
o
d
 
m
 
 
 
  (7.31)
Proof: It follows from (7.25) that
d
k
 
r
 
c
 
 
k
 
r
 
c
 
 
 
 
d
k
 
 
 
 
d
r
 
c
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
Fromthecongruence
 
 
 
 
 
 
 
m
o
d
 
m
 itfollows that
 
m
j
 
 
 
 
 
 
 
 andhence
 
c
 
 
m
 
 
m
j
 
 
c
 
 
 
 
 
 
c
 
 
 
 
 
Z
 
 
 
 
 
c
 , which implies
 
c
 
 
 
Z
 
 
 
 
m
o
d
 
m
 .
Hence
d
r
 
c
 
 
r
 
c
 
 
 
 
r
 
 
Z
 
 
 
 
 
 
 
r
 
 
 
 
 
a
r
 
 
j
 
 
m
o
d
 
m
 
 
 
 
and consequently
d
k
 
r
 
c
 
 
a
i
j
a
 
 
 
 
 
a
r
 
 
j
 
 
 
 
 
 
 
 
i
 
r
 
 
 
a
 
 
 
 
 
 
 
a
i
 
r
 
 
j
a
 
 
m
o
d
 
m
 
 
 
 
￿
In the following proof of Theorem 7.5 we consider row
  of
D
m,w h i c hw ed e -
note by
D
m
 
 .
Proof: (Theorem 7.5) Let
d
k,w h e r e
k
 
 
 
 
 
c, be the integer in position
  of
D
m
 
 . Then we can write
D
m
 
 
 
 
d
 
d
 
 
 
c
 
 
 
d
 
 
 
 
 
 
 
 
c
d
 
 
 
 
c
d
 
 
 
 
 
 
 
 
c
 
 
 
d
 
 
 
 
m
 
 
 
 
c
 
 
 
d
k
 
 
 
c
d
k
 
 
 
 
 
 
 
c
 
 
 
d
k
 
 
c
d
k
d
k
 
 
c
 
 
 
d
k
 
 
 
 
 
 
 
c
 
 
Let
d
k
 
a
i
j
a
 
 
m
o
d
 
m
 
 
 for some
i
 
Z
 
m and
a
 
 
Z
 
m.B e c a u s e
g
c
d
 
 
 
 
 
m
 
 
  ,t h es e t
f
a
i
 
r
 
 
j
a
 
 
r
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 
g forms the
cyclicgroupoforder
 
mwhichcontains
a
i
j
a
 . Hence, byLemma 7.1(Equation
(7.31)),
D
m
 
  c a nb ew r i t t e no nt h ef o r m
 
a
i
 
 
 
 
j
a
 
a
i
 
 
 
 
 
 
 
 
j
a
 
 
 
 
a
i
 
 
 
j
a
 
a
i
j
a
 
a
i
 
 
 
j
a
 
 
 
 
a
i
 
 
 
 
 
 
 
 
j
a
 
 
 
where
a
  is the integerin column
 
 
 
 
i
m
o
d
 
m.( S i n c e
a
i
j
a
  is in column
 ,t h e
integer
a
  is in column
 
 
r
m
o
d
 
m,w h e r e
r is obtained from the congruence
i
 
r
 
 
 
 
 
m
o
d
 
m
 . Thus, we have
r
 
 
 
 
 
 
i
 
 
 
 
i
 
m
o
d
 
m
 .)
￿
Henceforth, for each row of
D
m, we generally let
a
  be the element in the ﬁrst
(leftmost) position of the row. Thus, for the row vector
D
m
 
  we have
a
 
 
d
 .172 Chapter 7. The Polar Representation
In particular, in the ﬁrst row
D
m
 
  of
D
m we get
a
 
 
d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  . By (7.24), the remaining elements of the row are
a
 
j
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,
a
 
j
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  ,
a
m
 
 
j
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,
a
m
j
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,
a
m
 
 
j
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,
a
m
 
 
j
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,
a
 
m
 
 
j
 
 
 
 
 
 
 
 
 
 
 
 
 
 .H e n c e ,for
 
 
i
 
m
 
 , the NBC integer
a
i
j
  is formed by a block
of
m
 
i zeros followed by a block of
i ones. For
m
 
i
 
 
m
 
 ,
a
i
j
  is formed by a
block of
 
m
 
i ones followed by a block of
i
 
m zeros.L e t
d be an arbitrary NBC
integer of
D
m
 
 .C o n s e q u e n t l y ,f o r
a
i
j
 
 
e, the subscript
i is simply obtained
as
i
 
e
 
m
 
n
d
  (7.32)
where
e
  is the least signiﬁcant bit of
e and
n
e is the number of bits of
e which
are equal to
e
 . For example, for
m
 
 wehave
e
 
 
 
 
 
 
 
 
 
 
 
 
 
a
i
j
 ,w h e r e
i
 
 
 
 
 
 
 
 and
e
 
 
 
 
 
 
 
 
 
 
 
 
 
a
i
j
 ,w h e r e
i
 
 
 
 
 
 
 
 
  .
So far, we have not said much about the multiplicative inverse
 
  of
 
  mod-
ulo
 
m. In Table 7.1 we have listed the parameters
Z
 
 
 ,
 
 ,a n d
 
  for
m
 
 
 
 
 
 
 
 
 , with respectto the primitive element
 
 
  .B yd e ﬁnition, we have
 
 
 
 
 
 
 
 
m
o
d
 
m
 ,w h e r e
 
  is deﬁned by
Z
 
 
 
 
 
 
 
c
 
l
o
g
 
 
 
m
o
d
 
m
 .
Regarding
 
  we have observed the following property.
Observation 7.1 For
m
 
 
 
 
 
 
  and
 
 
 we can write
 
  on the form
 
 
 
m
 
 
 
 
m
o
d
 
m
 
  (7.33)
For
m
 
 wesimply have
 
 
 
 
 
 
  . We have not beenable to show whether
the fact that Observation 7.1 holds can be derived from the deﬁnition of
 
  or
if it is just some kind of coincidence. Anyhow, a consequence of (7.33) of Ob-
servation 7.1 is the following theorem.
Theorem 7.6 For
m
 
 
 
 
 
 
 , when the primitive element
  equals 3, the multi-
plicative inverse
 
  of
 
  modulo
 
m c a nb ew r i t t e no nt h ef o r m
 
 
 
m
 
 
 
m
o
d
 
m
 
  (7.34)
Proof: By Observation 7.1, for
m
 
 
 
 
 
 
  and
 
 
 we have
 
 
 
m
 
 
 
 
m
o
d
 
m
 . The congruence
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
m
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
holds for
m
 
 
 
 
 
 
 
 
 
 
 . Hence, the multiplicative inverse of
 
 
 
m
 
 
 
modulo
 
m equals
m
 
  .
￿7.4. Properties of the
D
m Matrix 173
m
Z
 
 
 
 
 
 
 
2 3 3 3
4 14 7 7
8 48 3 11
16 55296 27 19
Table 7.1: The parameters
Z
 
 
 ,
 
 ,a n d
 
  for
m
 
 
 
 
 
 
 
 
  when the primitive
element
  equals 3.
7.4.1 Discrete Exponentiation
Theproperties of the matrix
D
m derived in the previous section can beutilised
to perform exponentiation and compute the discrete logarithm.
Theorem 7.7 Let
P
 
 
 
 
 
 
 
Z
 
m be on the form
 
 
 
 
 
 
 
c,w h e r e
  and
  are
c-bit and
 
m
 
c
 -bit NBC integers, respectively. Then, the discrete exponentiation
 
 
 
 
 
 
m
o
d
 
m
 
 
 can be performed by ﬁrst deriving the integer
d
 
  in position
 
 
 
 
  of
D
m and then computing
 
 
d
 
 
 
  .
Proof: For
 
 
 
Z
 
m,
  is a nonzero integer of
Z
 
m
 
 . By (7.1), (7.25) and (7.28)
we then have
 
 
 
 
 
 
d
 
 
 
 
 
m
o
d
 
m
 
 
 
  (7.35)
where
 
 
 
Z
f
d
 
 
g
 
 
 
 
m
o
d
 
m
 .F r o mt h ed e ﬁnition of the matrix
D
m in (7.30)
we know that for a given
 
 
 
 
 
 
 
c, the associated integer
d
 
  is located in
row
  and column
  of
D
m.A f t e rﬁnding this integer
d
 
  we get, from (7.35),
 
 
d
 
 
 
  .B e c a u s e
 
 
d
 
 
 
 
m
 
  we have
d
 
 
 
 
 
Z
 
m
 
 , i.e. no modulus
reduction is needed when computing
  from
d
 
 .
￿
The computational complexity of performing discrete exponentiation accord-
ing to the procedure described in Theorem 7.7 mainly depends on the com-
plexity of obtaining
d
 
  from
 
 
 
 
 
 
 
c.
The discrete exponentiation
 
 
 
 
 
 
m
o
d
 
m
 
 
 c a nb ec o m p u t e di nt h e
following way:
1. The ﬁrst step is to compute
d
 
 
c. By letting
a
 
 
d
 
 
 it follows, from
Lemma 7.1 (Equation (7.31)),that
d
 
 
c
 
a
 
 
 
j
 
 
m
o
d
 
m
 
 
 .L e t
i
 
 
 
 
 
m
o
d
 
m
 . The NBC integer
a
i
j
  is preferably computed in either of
the following two ways:174 Chapter 7. The Polar Representation
(a) In the paragraph subsequent to the proof of Theorem 7.5, we de-
scribedhowthe elementsof
D
m
 
 ,i . e.thetopr owof
D
m,ar eformed .
Consequently, if
 
 
i
 
m
 
  we let
a
i
j
 
 
 
 
 
m
 
i
 
 
 
i
 
 
  and if
m
 
i
 
 
m
 
  we let
a
i
j
 
 
 
 
 
 
m
 
i
 
 
 
i
 
m
 
 
 .
(b) As mentionedin the paragraphsubsequenttotheproof ofTheorem
7.3,
a
i
j
  cansimply berecursivelycomputedin
iclock cyclesusinga
feedback shift register of length
m and with an inverter in the feed-
back loop. This computational procedure is based on the recursive
form of
a
i in (7.24).
2. The second step is to recursively compute
d
 
 
 
d
 
 
 
 
c from
d
 
 
c
 
a
i
j
 .
Thus, we compute
d
 
 
 
 
c from
d
 
 
c,
d
 
 
 
 
c from
d
 
 
 
 
c, etc., until, after
 
steps,
d
 
 
 
d
 
 
 
 
c is computed from
d
 
 
 
 
 
 
c. In each computation step,
d
k is computed from
d
k
 
  using the recursive congruence
d
k
 
 
d
k
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 in (7.25).
3. In the third step we compute the desired result
 
 
d
 
 
 
  .
In Figure7.2 weillustrate which partsof the matrix
D
m areassociated with the
above Steps 1 and 2 of the procedure for performing discrete exponentiation.
The complexity of computing the recursive congruence
d
k
 
 
d
k
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 strongly depends on which primitive element
  is chosen. By
Theorem 2.5 of Section 2.3.2, the integer 3 is a primitive element of every Fer-
mat prime ﬁeld
Z
 
m
 
 ;
m
 
 . If the primitive element
  equals 3, we get
d
k
 
 
d
k
 
 
 
 
 
 
 
d
k
 
 
 
 
 
 
d
k
 
 
 
 
 
 
 
d
k
 
 
 
 
 
 
d
k
 
 
 
m
o
d
 
m
 
 
 
  (7.36)
where
 
d
k
 
 
 
 
 
m
o
d
 
m
 
 
 is equivalent to diminished–1 multiplication
by 2 and where
  denotes diminished–1 addition. Figure 7.8 in Section 7.6.1
show how to compute (7.36) using a feedback parallel adder.
Remark: Step 2 (computations along a column) may be carried out prior to
Step 1(b)(computations along a row)as follows: Firstly,
d
  is recursively
computed from
d
  asin Step2using (7.25)(whichfor
 
 
 is equivalent
to(7.36)). Secondly, byletting
a
 
 
d
 ,theinteger
d
 
 
 
a
i
j
 
 
m
o
d
 
m
 
 
 
is recursively computed as in Step 1(b) using (7.24).
Theorem 7.8 Let the matrix
D
m b ew r i t t e no nt h ef o r m
D
m
 
 
D
 
 
 
m
 
 
D
 
 
 
m
 
 7.4. Properties of the
D
m Matrix 175
 
d
 
 
 
 
 
 
 
 
 
 
c
 
D
m
￿
Step 1
Step 2
a
 
 
 
d
 
 
c
 
a
i
j
 
 
m
o
d
 
m
 
 
 
 
i
 
 
 
 
 
m
o
d
 
m
 
Figure 7.2: The computation steps for performing discrete exponentiation using
properties ofthe matrix
D
m.E q u a t i o n s(7.24) and (7.36) are used in Step 1 and
2, respectively.
where
D
 
 
 
m and
D
 
 
 
m are formed respectively by the
m ﬁrst and
m last columns of
D
m.
Also,let
D
 
 
 
m denotethematrixobtainedwhenexchangingeach(
m-bit)integerof
D
 
 
 
m
for its one’s complement. Then
D
 
 
 
m
 
D
 
 
 
m
 
Inthe proof of Theorem7.8 we use the following properties: Because the mul-
tiplicative inverse
 
  of
 
  modulo
 
m is odd, it follows that
 
  is also odd, i.e. we
have
 
 
 
 
d
 
 for some nonnegative integer
d. Hence, by Theorem 7.4 and
(7.24) we get
a
i
 
m
 
 
j
a
 
 
a
i
 
 
m
 
d
 
m
j
a
 
 
a
i
 
m
m
o
d
 
m
j
a
 
 
 
i
 
m
 
a
 
 
 
 
 
 
 
 
m
 
 
 
 
 
i
 
a
 
 
 
 
 
 
 
 
a
i
j
a
 
 
m
o
d
 
m
 
 
 
  (7.37)
where
a
i
j
a
  istheone’scomplementofthe
m-bitinteger
a
i
j
a
 ,forany
a
 
 
Z
 
m.
Proof: (Theorem 7.8) Let
a
i
j
a
  be the
m-bit integer in an arbitrary position of
D
 
 
 
m . Then, by (7.31) in Lemma 7.1 and the deﬁnition of
D
 
 
 
m ,
a
i
 
m
 
 
j
a
  is the
m-bit integer in the corresponding position of
D
 
 
 
m . Because the congruence
in (7.37) holds for every integer
a
i
j
a
  of
D
 
 
 
m , we have
D
 
 
 
m
 
D
 
 
 
m .
￿176 Chapter 7. The Polar Representation
As it is described on page 174, the computation of
a
i
j
  from
a
 
 
 in Step 1(b)
requires at most
 
m
 
  clock cycles. One binary shift (rotation) is performed
duringeachclockcycle. Wehave
i
 
 
 
 
 
m
o
d
 
m
 ,whichimplies
i
 
 
m
 
 .
In consequence of Theorem 7.8, when
i
 
m,
a
i
j
  c a nb eo b t a i n e di na tm o s t
m
 
  clock cycles: If
i
 
m,l e t
i
 
j
 
m for some integer
j
 
Z
m. From(7.37),
after
j clock cycles we then get
a
i
j
 
 
a
j
j
a
 
 
m
o
d
 
m
 
 
  .
Another way of reducing the number of shifts required in Step 1(b) is the fol-
lowing: Let
l
 
 
m
 
i. Because the sequence
f
a
i
j
a
 
g
i
 
Zis cyclic with pe-
riod
 
m, we have
a
i
j
a
 
 
a
 
m
 
l
j
a
 
 
a
 
l
j
a
 
 
m
o
d
 
m
 
 
  .H e n c e ,f o r
 
 
i
m
o
d
 
m
 
m,
a
i
j
a
  can be obtained by rotating the
m-bit NBC integer
a
 
j
a
 
i
bits to the left(in the increasedsigniﬁcant bitsdirection). Asbefore,there isan
inverter in the feedback loop. For
m
 
 
 
i
m
o
d
 
m
 
 
m
 
 , which implies
 
 
l
m
o
d
 
m
 
m
 
 ,
a
i
j
a
 
 
a
 
l
j
a
 
 
m
o
d
 
m
 
 
 is obtained by rotating
a
 
j
a
 
l bits in the opposite direction (to the right). Thebits in the feedbackloop
areinverted. Thisprocedurerequireseithertwofeedbackshiftregistersorjust
one register which can rotate its contents in both directions. In any case, the
maximum number of shifts is
m.
We preferably state the computational complexity of an algorithm in terms of
thenumberofadditions requiredtoperformthealgorithm. Here,alladditions
are carried out modulo
 
m
 
 . We assume that the
i binary shifts of Step 1(b)
(where
i can be maximised to
m
 
 ) can be carried out as fast as one addition.
With
 
 
  , the above Step 2 requiresat most
 
c
 
 
m
 
 
m additions andabout
half asmany additions in average. Step 3 can be carriedout using a simpliﬁed
adder.
 
Consequently, thealgorithmforperformingdiscreteexponentiationdescribed
in this section can be performed using at most
 
c
 
 
 
 
m
 
 
m
 
 additions
(modulo
 
m
 
 ). The algorithm requires about
 
c
 
 
 
 
 
 
m
 
 
m
 
 additions
modulo
 
m
 
 in average.
As mentioned in Section 7.2.1, the binary method for discrete exponentiation
requires at most
 
 
m
 
 
  multiplications. Assuming that a binary multipli-
cation is computed using at most
m additions, the binary method requires at
most
 
m
 
m
 
 
 additions. Theaverage numberof additions requiredis about
m
 
m
 
 
 
m
 
 
 
 
m
 
 
m
 
 
 
 
 .
 In Figure 6.4 of Section 6.3.1, we see that addition by one can simply be carried out using
a row of
m cascaded half adder elements.7.4. Properties of the
D
m Matrix 177
7.4.2 The Discrete Logarithm
Theorem 7.9 Let
  be a nonzero integer of
Z
 
m
 
  and
d
 
 
 
 
 
 ,w h e r e
 
 
 
P
 
 
 .
Then, the discrete logarithm
 
 
 
l
o
g
 
 
 
m
o
d
 
m
  can be obtained by ﬁrst ﬁnding
the position
 
 
 
 
  in
D
m where
d
 
  is located and then forming
 
  as
 
 
 
 
 
 
 
c.
Proof: For
 
 
 
 
 
m
 
  ,t h ei n t e g e r
d
 
 
 
 
 
  is an element of
Z
 
m. Then,
by (7.1), (7.25) and (7.28) we have
 
 
 
 
 
 
d
 
 
 
 
 
m
o
d
 
m
 
 
 
 
where
 
 
 
Z
f
d
 
 
g
 
 
 
 
 
 
 
 
c
 
m
o
d
 
m
 . Also, from the deﬁnition of the
matrix
D
m in (7.30) we know that the integer
d
 
 , which is associated with
 
 ,i s
located in row
 and column
  of
D
m.H e n c e ,b yﬁnding the position
 
 
 
 
  we
directly obtain the desired discrete logarithm
 
 .
￿
By Theorem 7.9, the problem of computing the discrete logarithm
 
 
 
l
o
g
 
 
 
m
o
d
 
m
  is equivalent to the problem of ﬁnding the position
 
 
 
 
  in
D
m
where
d
 
  is located. One way of ﬁnding this position is to compute
d
 
 
 
 ,
d
 
 
 
 ,
d
 
 
 
 ,etc., using(7.36)
  until,forsome
i,theinteger
d
 
 
 
 
 
 
c
 
a
i
j
 
 
m
o
d
 
m
 
 
 
of the top row
D
m
 
  of
D
m is obtained. Asdescribed onpage 172, eachNBC in-
tegerof
D
m
 
  is formedeither byablock of zerosfollowed bya block ofones or
vice versa. Hence, for
j
 
 
 
 
 
 
 
 
 
 , the recursive computation of
d
 
 
 
j from
d
 
 
 
j
 
  progress until such a binary word is detected. By (7.32), the subscript
i
of
a
i
j
  equals
i
 
e
 
m
 
n
e
 
where
e
  is the least signiﬁcant bit of
a
i
j
  and
n
e is the number of bits in
a
i
j
 
which are equal to
e
 . Because the integer
d
 
 
 
 
 
 
c
 
a
i
j
 
 
m
o
d
 
m
 
 
 is in
column
 
 
 of
D
m, it follows from Lemma 7.1 (Equation (7.31)) that
i
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 .C o n s e q u e n t l y ,
d
 
  is in column
 
 
i
 
 
 
 
 
 
 
i
 
 
 
 
 
m
o
d
 
m
 
of
D
m. The row position
  is obtained from the number of recursions. This
gives the desired discrete logarithm
 
 
 
 
 
 
 
c.
The above procedure for computing the discrete logarithm
 
 
 
P
 
 
  is sum-
marised in the following algorithm.
1. Let
d
 
 
 
  and
j
 
 
c.
 Or in general Equation (7.25). However, as in Section 7.4.1, by choosing
 
 
 the recur-
sive congruence in (7.25) changes to (7.36).178 Chapter 7. The Polar Representation
 
D
m
￿
a
 
 
 
Step 2
d
 
 
 
 
 
 
 
 
 
 
c
 
 
 
 
 
d
 
 
 
 
 
 
c
 
a
i
j
 
 
m
o
d
 
m
 
 
 
 
 
 
i
 
 
 
 
 
m
o
d
 
m
 
Figure 7.3: The procedure for computing the discrete logarithm using properties of
the matrix
D
m.E q u a t i o n(7.36) is used in Step 2.
2. If
d
 
D
m
 
 ,g o t oS t e p3 .
Otherwise, let
j
 
j
 
 , compute the next
d using the recursive con-
gruence in (7.36) (i.e.
d
n
e
x
t
 
 
 
d
 
 
 
 
d
m
o
d
 
m
 
  ), and goto
Step 2.
3. Let
 
 
j
 
m
o
d
 
c
 .
Compute the subscript
i
 
e
 
m
 
n
e of
a
i
j
 
 
d.
Also, let
 
 
i
 
 
 
 
 
m
o
d
 
m
 .
Then we have
 
 
 
 
 
 
 
c.
In Figure 7.3 we illustrate which elements of the
D
m matrix are the computed
in the above algorithm.
T h ei n i t i a lv a l u eo f
d is computed in Step 1 using one (simpliﬁed) addition. In
Step 2 we assume that the computation of
d and the checking whether
d is an
element of
D
m
 
  are concurrent operations.
  Then, in total at most
 
c
 
  ad-
ditions modulo
 
m
 
 are required in Step 2. Finally, assuming that the com-
putational complexity of the derivation of
  in Step 3 equals the complexity
 This may be possible only if
d
n
e
x
t is computed using a carry ripple diminished–1 adder.7.4. Properties of the
D
m Matrix 179
of performing one addition modulo
 
m
 
 , the complete algorithm presented
aboveforcomputingthediscrete logarithm requiresatmost
 
c
 
 
 
 
m
 
 
m
 
 
additions modulo
 
m
 
 .A b o u t
 
m
 
 
m
 
 additions are requiredonaverage.
7.4.3 Zech’s Logarithm
Theorem 7.10 Let
 
 
 
 
 
 
 
 
 
 
 
c for some integers
 
 
  and
 
 
 . The Zech logarithm of
 
  can be obtained by ﬁrst ﬁnding the integer
d
 
 , which is in position
 
 
 
 
 
 
 
 
  of
D
m,
and then ﬁnding the position
 
 
 
 
 
 
 
 
  of
D
m where
d
 
 
 
d
 
 
 
 is located. Then we
have
Z
 
 
 
 
 
 
 
 
 
 
 
 
 
c.
Proof: By taking Zech’s logarithm on both sides of (7.28) and letting
k
 
 
  we
get
Z
 
 
 
 
 
Z
f
d
 
 
 
 
g
 
 
 
 
m
o
d
 
m
 
  (7.38)
Let
d
 
 
 
d
 
 
 
 . Then, again by (7.28) it follows that
Z
f
d
 
 
 
 
g
 
 
  in (7.38) equals
the subscript
 
 
 
 
 
 
 
 
 
 
 
c of
d
 
 .C o n s e q u e n t l y ,w e h a v e
Z
 
 
 
 
 
 
 
 
 
 
 
 
 
c
where, by the deﬁnition of the matrix
D
m in (7.30),
 
 
 
 
 
 
 
 
  is the position in
D
m where
d
 
 
 
d
 
 
 
 islocated. Againbythedeﬁnition of
D
m,for
 
 
 
 
 
 
 
 
 
 
 
c,
d
 
  is the integer located in position
 
 
 
 
 
 
 
 
  of
D
m.
￿
From Theorem7.10 it follows that, using properties of the matrix
D
m, we need
both the procedure in Section 7.4.1 for discrete exponentiation and the pro-
cedure in Section 7.4.2 for the discrete logarithm in order to compute Zech’s
logarithms. This should be comparedwith the direct computation of the Zech
logarithm, as expressed in (7.10), which also requires one discrete exponenti-
ation and one discrete logarithm (and one addition by one) over
Z
 
m
 
 .
Consequently, in any case we need one discrete exponentiation, one addition
by one, and one discrete logarithm in order to compute a Zech logarithm. The
number of additions modulo
 
m
 
 required for performing discrete expo-
nentiation and computing the discrete logarithm are given in the end of Sec-
tions 7.4.1 and 7.4.2. In Table 7.2 we have listed these complexity numbers to-
gether with the resulting number of additions required for computing Zech’s
logarithm. For comparison, we have also listed the number of additions re-
quired when using the binary method for performing exponentiation and
Pohlig-Hellman’s algorithm for computing the discrete logarithm. These al-
gorithms are described in Sections 7.2.1 (and 7.4.1) and 7.2.2.
In Figure 7.4 we have plotted the numberof additions modulo
 
m
 
 required
to perform discrete exponentiation (“Exp”) and compute the discrete loga-180 Chapter 7. The Polar Representation
Exp
m
a
x
 
 
Exp
m
a
x
 
 
Log
m
a
x
 
 
Log
m
a
x
 
 
Exp
a
v
 
 
Exp
a
v
 
 
Log
a
v
 
 
Log
a
v
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
(b)
N
u
m
b
e
r
o
f
a
d
d
i
t
i
o
n
s
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
(a)
N
u
m
b
e
r
o
f
a
d
d
i
t
i
o
n
s
m
Zech
a
v
 
 
Zech
a
v
 
 
Zech
m
a
x
 
 
Zech
m
a
x
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
N
u
m
b
e
r
o
f
a
d
d
i
t
i
o
n
s
(c)
m
Figure 7.4: The number of additions modulo
 
m
 
 required to perform discrete ex-
ponentiationandcomputethediscrete logarithm and Zech’s logarithm,with re-
spect to different algorithms (see Table 7.2). The functions are plotted versus
m
for
m
 
 
 
 
 
 
 
 
 .7.4. Properties of the
D
m Matrix 181
Operation Algorithm Average Maximum
Discrete Algorithm in Sec. 7.4.1
 
m
 
m
 
 
 
m
 
m
 
 
expon. The binary method
m
 
 
m
 
 
 
 
 
m
 
m
 
 
 
Discrete Algorithm in Sec. 7.4.2
 
m
 
m
 
 
 
m
 
m
 
 
logarithm Pohlig-Hellman (P-H)
m
 
 
m
 
 
 
 
m
 
 
m
 
 
 
 
Zech’s Alg.s in Sec.s 7.4.1 & 7.4.2
 
m
 
m
 
 
 
m
m
 
 
logarithm Binary method & P-H
m
 
 
 
m
 
 
 
m
 
 
 
m
 
 
 
m
 
 
 
m
 
 
 
Table 7.2: The average and maximum number of additions modulo
 
m
 
 required
to perform discrete exponentiation and compute the discrete logarithm and the
Zech logarithm, with respect to the algorithms in Sections 7.4.1 and 7.4.2 and
with respect to the binary method for exponentiation and Pohlig-Hellman’s al-
gorithm for computing the discrete logarithm.
rithm (“Log”) and Zech’s logarithm (“Zech”). The functions plotted are the
ones in Table 7.2. The subscript “1” refers to the algorithms in Sections 7.4.1
and7.4.2forperformingexponentiationandcomputingthediscrete logarithm,
respectively. The subscript “2” refersto the binary method for exponentiation
and to Pohlig-Hellman’s algorithm for computing the discrete logarithm.
Figure7.4(a)showstheaverage(“av”)numberof additions requiredinthe ex-
ponentiation anddiscrete logarithm algorithms. Figure 7.4(b)shows the max-
imum (“max”) number of additions required in the algorithms. We see that
in general, the algorithms presented in Sections 7.4.1 and 7.4.2 are superior to
the binary method and Pohlig-Hellman’s algorithm, respectively. However,
for
m
 
 
 the binary method requires less additions than the algorithm in
Section 7.4.1.
Figure 7.4(c) shows both the average and maximum number of additions re-
quired to compute Zech’s logarithm when using (“1”) the algorithms in Sec-
tions 7.4.1 and 7.4.2 and (“2”) the binary method and Pohlig-Hellman’s algo-
rithm. Again, we conclude that in general the least number of additions are
required when using the algorithms in Sections 7.4.1 and 7.4.2. However, be-
cause for the number of additions required by the binary method is relatively
small
m
 
 
 (see the previous paragraph), the maximum number of addi-
tions required by our algorithms for
m
 
 
  is greater than the number of182 Chapter 7. The Polar Representation
additions required by the binary method together with Pohlig-Hellman’s al-
gorithm.
7.5 The Mirror Sequence
M
m
From Figure 7.4 we conclude that, for
m
 
 , the computational complexi-
ties (in terms of the required number of additions) of performing discrete ex-
ponentiation and computing the discrete logarithm and the Zech logarithm
using the algorithms proposed in the previous section are relatively low. For
m
 
 
  , however, no signiﬁcant reduction of the computational complexities
are made, compared with conventional algorithms.
Thenumberof additions requiredwhenperforminga discrete exponentiation
and computing a discrete logarithm is at most about
 
c
 
 
m
 
 
m for each op-
eration. These additions derive in both cases from the recursive computation
of
d
k from
d
k
 
  for some
k. The mentioned additions along some column of
the matrix
D
m can be avoided by using two look-up tables – one for exponen-
tiation and one for the discrete logarithm.
7.5.1 Discrete Exponentiation Using a Look-Up Table
When using a look-up table, we can deﬁne an algorithm for performing dis-
crete exponentiation, based on the algorithm proposed in Section 7.4.1 (see
page 174), in the following way.
Thetableusedhassize
 
c
 
mbits anditcontains the integersfromtheleftmost
column of the matrix
D
m:F o r
 
 
 
 
 
c
 
 , location
  of the table contains
d
 .F o r
 
 
 
 
 
 
 
c
 
Z
 
m,t h ei n t e g e r
 
 
 
 
d
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 can be
computed in the following way:
1. Let
a
 
 
d
 ,w h e r e
d
  is obtained from the look-up table at location
 .
2. By Lemma 7.1 (Equation (7.31))we have
d
 
 
 
d
 
 
 
 
c
 
a
i
j
d
 
 
m
o
d
 
m
 
 
 ,where
i
 
 
 
 
 
m
o
d
 
m
 . Thus, by loading an
m-bit feedback shift
register(which hasaninverter in the feedbackloop) with
d
  androtating
the contents of the register
i steps, the resulting contents of the register
equals
d
 
 .
3. The desired result
 
 
d
 
 
 
 is obtained simply by adding a one to
d
 
 .7.5. The Mirror Sequence
M
m 183
As described in Section 7.4.1, it is possible to bound the maximum number
of required shifts in the above Step 2 to
m
 
 .C o n s e q u e n t l y ,we can perform
a discrete exponentiation using one table look-up followed by at most
m
 
  binary
shiftsand anadditionby one. Hence, the complexity of performinga discrete ex-
ponentiation can be considerably reduced, compared with the computational
complexityobtainedwhenusingtheproceduredescribedinSection7.4.1. This
holds particularly for exponentiation in
Z
 
 
 
 
 ,i . e .f o r
m
 
 
  .T h e r e d u c e d
computational complexity is achieved to the cost of the look-up table of size
 
m
 
 
m
 
m bits.
7.5.2 The Discrete Logarithm Using a Look-Up Table
When computing a discrete logarithm
 
 
 
l
o
g
 
 
 
m
o
d
 
m
  using the algo-
rithm proposed in Section 7.4.2, the most demanding part of the algorithm is
the procedure for ﬁnding the row
  of the matrix
D
m in which
d
 
  is located.
Depending on
 , this procedures may require up to
 
c
 
 
m
 
 
m
 
  additions
modulo
 
m
 
  . However, these additions can be avoided by using a look-up
table.
The look-up table considered here contains a number of
m-bit subscripts
 
k of
d
 
k.F o re a c h
c-bit integer
 
 
 
Z
 
c, there is at least one
 
m
 
c
 -bit integer
 
 
 
Z
 
m
 
c such that
 
k
 
 
 
 
 
 
 
c is stored in table. In other words, there is at least
oneinteger
d
 
k in each row(
D
m
 
 
 )of
D
m for which its subscript
 
k is stored in the
table.
Notation 7.1 We denote by
 
  the set of integers
d
 
k whose respective subscripts
 
k
form the contents of the look-up table used for computing the discrete logarithm.
A table of minimum size, i.e. whose size equals
 
c
 
m bits, where
c
 
m
 
l
o
g
 
m
 
 , is formed in such a way that its associated set
 
  only contains one
elementfromeachrowof
D
m. Hence,foreachrowof
D
m wepreferablywould
like to ﬁnd one suitable such element
d
 
k that in a simple way mapsto a unique
entry of the look-up table. This table, which performs a one-to-one mapping
from
d
 
k to
 
k,i si nas en s es o m eki n do fi n v ers et a bl eo ft h ea bo v et a bl eu s edfo r
performing discrete exponentiation. However, each integer
d
  stored in the
table for exponentiation originates from some row position
  in the leftmost
column of
D
m, while the various integers
d
 
k (forwhich
 
k is stored in the table)
used here may originate from an arbitrary column position in
D
m.
The discrete logarithm
 
 
 
 
 
 
 
c of a nonzero integer
 
 
 
 
 
 
m
o
d
 
m
 
 
 
can be computed using the look-up table of subscripts
 
k in the following way:184 Chapter 7. The Polar Representation
We know that the integer
d
 
 
 
 
 
  is in column
  and row
  of
D
m.S t a r t i n g
fromtheinteger
a
 
 
d
 
 ,weuse(7.24)tosuccessively calculate
a
 
j
d
 
 
 
a
 
j
d
 
 
 
 
 
 ,
etc., until after
i successions we obtain an integer
a
i
j
d
 
  which is an element of
 
 . The desired logarithm
 
 
 
 
 
 
 
c can now be formed using the associated
table output
 
k
 
 
 
 
 
 
 
c, which is the subscript of
d
 
k
 
a
i
j
d
 
 .B e c a u s e
d
 
  and
d
 
k are in the same row of
D
m we get
 
 
 
 . We therefore have
d
 
k
 
d
 
 
 
j
 
c
 
d
 
 
 
 
 
j
 
 
c
 
m
o
d
 
m
 
 
 
 
for some integer
j, and hence
 
 
 
 
 
j
 
m
o
d
 
m
 ,i . e .
 
 
 
 
 
j
 
m
o
d
 
m
 .
Because we also have
d
 
k
 
a
i
j
d
 
  it follows by (7.31) that
i
 
j
 
 
 
m
o
d
 
m
 ,
i.e.
j
 
i
 
 
 
m
o
d
 
m
 ,w h e r e
 
  is the multiplicative inverse of
 
  modulo
 
m.
Consequently, we obtain
 
 
 
 
 
i
 
 
 
m
o
d
 
m
 .
Thus, the above procedure for computing the discrete logarithm
 
 
 
l
o
g
 
 
 
m
o
d
 
m
  using a look-up table can be summarised in the following algo-
rithm.7.5. The Mirror Sequence
M
m 185
1. Let
a
 
 
 
 
  and
i
 
  .
2. If
a
i
j
a
 
 
 
 ,g o t oS t e p3 .
Otherwise, let
i
n
e
x
t
 
i
 
  ,c o m p u t e
a
i
j
a
  from
a
i
 
 
j
a
  using (7.24),
and goto Step 2.
3. Perform the mapping from
d
 
k
 
a
i
j
a
  to a table address and
read
 
k
 
 
 
 
 
 
 
c from the look-up table.
4. Let
 
 
 
  and compute
 
 
 
 
 
i
 
 
 
m
o
d
 
m
 .
Then we have
 
 
 
 
 
 
 
c.
When using this algorithm to compute the discrete logarithm we need an ad-
dition by one (Step 1), at most
 
m
 
  binary shifts (Step 2), one table look-up
(Step 3), and one multiplication by
 
  and one addition modulo
 
m (Step 4).
Hence,the computational complexity of computing the discrete logarithm us-
ing the above algorithm is considerably reduced compared with the compu-
tational complexity of the corresponding algorithm described in Section 7.4.2.
A similar reduction in complexity was obtained for exponentiation in Section
7.5.1, again to the cost of a look-up table of size at least
 
 
c
 
m bits.
In Step 2, we also need to check (at most
 
m
 
  times) whether
a
i
j
a
  is an
element of
 
 . The complexity of such a check and the complexity of obtaining
the table address from
d
 
k depend strongly on the binary representation of the
integers of
 
 . The ideal situation would of course be if the integers of
 
  were
consecutive numbers. The problem of ﬁnding a suitable set
 
 is considered in
the following sections.
7.5.3 The Mirror Properties of
M
m
The set
 
  of integers
d
 
k from the matrix
D
m was introduced in Notation 7.1 of
the previous section. In the remainder of Section 7.5 we consider the problem
of ﬁnding a suitable set
 
  such that the elements of
 
  can be analytically de-
scribed in a simple way and such that we obtain a straightforward mapping
from each
d
 
k to its associated table entry. The forming of the set
 
  is based on
the properties of an integer sequence
M
m:
 For exponentiation, the size of the table used is exactly
 
c
 
m bits.186 Chapter 7. The Polar Representation
Deﬁnition 7.5 Let
j
 
Z
 
m andlet
 
j beequal tothenumberof therow of
D
m which
contains the integer
j. We refer to the sequence
M
m
 
 
 
 
 
 
 
 
 
 
 
 
j
 
 
 
 
 
 
 
m
 
 
 
which has length
 
m,a st h emirror sequence associated with
D
m.
From the deﬁnition follows that if
j
 
d
k for some
k
 
 
 
 
 
c then
 
j
 
 .
Consequently, each row number
 
 
Z
 
c of
D
m appears
 
m times in
M
m.T h e
following theorem describes the main connection between the subscripts of
the
 
m integers
  of
M
m which are all located in the same row of
D
m.
Theorem 7.11 Let, for some
j
 
Z
 
m, the integer
 
j be an element of the sequence
M
m. Then, for any
i
 
Z ,wehave
 
a
i
j
j
 
 
j
 
Proof: For
a
 
 
j
 
Z
 
m and
 
 
Z
 
c, it follows by Theorem 7.5 that the
 
m
integers
j,
a
 
j
j,
a
 
j
j
 
 
 
 
a
 
m
 
 
j
j form the set of all elements in one of the rows
D
m
 
  of
D
m. Consequently, fromDeﬁnition7.5itfollowsthatanytwoelements
 
g and
 
h with subscripts
g
 
h
 
D
m
 
  are equal.
￿
Corollary 7.2 For
j
 
Z
 
m,w eh a v e
 
 
m
 
 
 
j
 
 
j (7.39)
 
 
m
 
 
 
 
 
j
 
 
 
j (7.40)
Proof: Theequalities follow bychoosingsomeappropriatesubscripts
iinThe-
orem 7.11 and then using (7.24).
  For
i
 
m and
a
 
 
j we get
a
m
j
j
 
 
m
 
j
 
 
 
 
 
 
 
m
 
 
 
j
 
m
o
d
 
m
 
 
 
and hence we have
 
 
m
 
 
 
j
 
 
j.
  For
i
 
m
 
  and
a
 
 
 
j we get
a
m
 
 
j
 
j
 
 
m
 
 
 
 
j
 
 
 
 
 
 
 
m
 
 
 
 
 
j
 
m
o
d
 
m
 
 
 
and hence we have
 
 
m
 
 
 
 
 
j
 
 
 
j.
￿7.5. The Mirror Sequence
M
m 187
By (7.39) of Corollary 7.2 it follows that the contents of the second half (the el-
ementsin positions
 
m
 
  to
 
m
 
 )o ft h es e q u e n c e
M
m is a mirror image of its
ﬁrst half (the elements in positions 0 to
 
m
 
 
 
 ). Furthermore, (7.40) reveals
another kind of distributed “mirror” property of
M
m: It follows from (7.40)
thatthesequence
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
  (i.e.wetakeeverysecond elementof
M
m,s t a r t i n gf r o m
 
 ) equalsthe sequence
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 (i.e.wetakeeveryconsecutiveelementof
M
m,starti ngfr om
 
 
m
 
 
 
  and
goingin theopposite direction). Itismainly becauseoftheseandothersimilar
properties of
M
m that we refer to
M
m as a mirror sequence.
In Figure 7.5 we show the structure of the ﬁrst half
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  of the
sequence
M
 , in which each row number appears
m
 
 times.
 
  We have
plotted the row numbers versus their respective positions in
M
  in the form
of checkerboardplots. The 128 cell columns of the two checkerboardplots are
associated with the 128 ﬁrst positions (0 to 127) of
M
 .I ne a c hc o l u m n( p o s i -
tion) there is only one cell that is black. The row number
  associated with the
black cell in column
p equals the element
 
p of
M
m. For example, the black
cells in columns 16 and 17 are located in the cell rows which are numbered 8
and 2, respectively. This implies
 
 
 
 
 and
 
 
 
 
  .
Figures 7.5(a) and (b) differ only in the ordering of the cell rows. In Figure
7.5(a), where the rows appear in a natural increasing order, the checkerboard
plot does not seem to reveal any particular structure. In Figure 7.5(b), how-
ever, it is quite easy to identify the mirror properties of
M
m. The ordering of
the rows of the checkerboard plot in Figure 7.5(b) is based on the following
rule:
1. Consider the positions
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
  in
M
m.
2. Pick row 0 (zero) as the ﬁrst row.
Thepositions
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
  intheﬁrsthalfof
M
m whichcontain
zero are now ruled out.
3. Select the row number which is contained in the foremost position of
M
m that is not ruled out. This row is the next row of the checkerboard.
4. Allpositions in
M
m which contain the lastselected rownumberarenow
ruled out. Repeat from Step 3 until all rows have been selected.
By following this rule, the checkerboard plots of
M
 ,
M
 ,a n d
M
 
  are all of
the same type asthe plotin Figure7.5. Infact, it showsthatthe sequences
M
 ,
M
 ,and
M
 
 arespecialcasesofamoregeneralclassofmirrorsequences. The
 
 This follows from the ﬁrst “mirror” property (Equation (7.39)).188 Chapter 7. The Polar Representation
0 1 63 24 86 48 09 6 112 127
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Position
p
R
o
w
n
u
m
b
e
r
 
(a)
0 1 63 24 86 48 09 6112 127
0
1
7
5
2
4
10
8
13
6
12
14
3
11
15
9
Position
p
R
o
w
n
u
m
b
e
r
 
(b)
Figure 7.5: Checkerboard plots of the ﬁrst half of the sequence
M
 .7.5. The Mirror Sequence
M
m 189
sequences in this class all have checkerboardplots of the same type as the one
in Figure 7.5(b). Also, the general class, which is not further considered here,
contains one such sequence of length
 
m for every positive integer
m.
7.5.4 Finding the Unique Distinct Positions in
M
m
From the deﬁnition of the mirror sequence
M
m we have the following:
The problem of ﬁnding one unique matrix element in each row of
D
m, as described in
Section 7.5.2,i se q u i v a l e n tt oﬁnding
 
c positions in
M
m such that the contents in
these positions are the
 
c distinct row numbers of
D
m which form the set
 
 .
This set
 
  of
 
c distinct row numbers of
D
m was introduced in Notation 7.1
(page 184). Note that for
m
 
 and
m
 
  , we have
 
c
 
 
m
 
 
m
 
  ,w h i c h
meansthat
D
  and
D
  arerowvectors(i.e.
M
  and
M
  only contain theinteger
zero). Therefore, the results in the remainder of Section 7.5 is valid only for
m
 
 . When deriving the set
 
  we need the following mapping:
Deﬁnition 7.6 Let
P
 
 
f
p
 
 
 
 
 
 
p
n
 
 
g be a set of
n integers
p
j
 
Z
 
m for
j
 
 
 
 
 
 
 
 
 
 
 
n
 
 . The mapping
f
 
i
 
 
P
 
 
 
P
i is deﬁned as
f
 
i
 
 
P
 
 
f
p
 
 
 
 
 
 
p
n
 
 
g
 
 
P
i
 
 
i
 
P
 
 
 
 
 
 
 
 
a
i
j
p
 
 
 
 
 
 
a
i
j
p
n
 
 
 
 
where
a
i
j
p
j
 
 
i
 
p
j
 
 
 
 
 
 
m
o
d
 
m
 
 
  .
Theorem 7.12 Let
P
i
 
f
 
i
 
 
P
 
  an let
M
P
  denote the set
 
 
p
 
 
 
 
 
 
 
p
n
 
 
 
of ele-
ments from the sequence
M
m. Then
M
P
i
 
M
P
 
 
where
M
P
i
 
n
 
a
i
j
p
 
 
 
 
 
 
 
a
i
j
p
n
 
 
o
.
Proof: By Theorem 7.5, the integers
p
j and
a
i
j
p
j are located in the same row of
the matrix
D
m. Therefore, from Deﬁnition 7.5 it follows that
 
p
j equals
 
a
i
j
p
j
.
Hence, for
j
 
 
 
 
 
 
 
 
 
 
 
n
 
 , we have
M
P
 
 
 
 
p
 
 
 
 
 
 
 
p
n
 
 
 
 
n
 
a
i
j
p
 
 
 
 
 
 
 
a
i
j
p
n
 
 
o
 
M
P
i.
￿
Using the above notations, the goal is to ﬁnd a set
 
 
 
f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
n
 
 
g of
positions such that
M
 
 
 
f
 
 
 
 
 
 
 
 
 
c
 
 
g and such that the size
n
 
 
c of
 
 
is preferably equal to
 
c.190 Chapter 7. The Polar Representation
Deﬁnition 7.7 The ordered set
f
s
 
 
s
 
 
t
 
s
 
 
 
t
 
s
 
 
 
t
 
 
 
 
 
s
n
 
 
g of integers
from
Z
 
m is denoted by
f
s
 
 
 
 
 
 
s
n
 
 
g
t.
Lemma 7.2 Every integer of the set
f
 
 
 
 
 
 
 
 
 
m
 
 
 
 
g
 
Z
 
m
 
  maps, by
f
 
i
 ,
into the set
 
 
 
f
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
g of size
 
m
 
 .
Proof: First, we have
f
 
m
 
 
 
 
 
 
 
a
m
 
 
j
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
o
d
 
m
 
 
  , which is an element of
 .F o r
i
 
 
 
 
 
 
 
 
 
 
 
m
 
 ,w et h e n
apply the mapping
f
 
m
 
 
 
i
  on the set
P
 
 
f
 
i
 
 
 
 
 
 
i
 
 
 
 
g, which gives the
set
P
m
 
 
 
i
 
 
m
 
 
 
i
 
P
 
 
 
 
 
 
 
f
 
m
 
 
 
 
m
 
 
 
i
 
 
 
 
 
 
 
 
m
 
 
 
 
g
 
m
 
 
 
i
 
m
o
d
 
m
 
 
  .I ti so b v i o u st h a t
P
m
 
 
 
i
 
 .
￿
Theorem 7.13 Let
M
 
 
 
f
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
g be the set of row num-
bers which is associated with the set
 
 
f
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
g of
 
m
 
 
positions in
M
m. Then, we have
M
 
 
f
 
 
 
 
 
 
 
 
 
c
 
 
g
 
Z
 
c
 
Proof: We prove the theorem by showing that all positions
 
 
Z
 
m map, via
f
 
i
 , into the set
 . We know by (7.39) that the second half of
M
m is a mirror
image of its ﬁrst half. This implies (see the proof of Corollary 7.2)
f
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
From Lemma 7.2 we get that all positions in
f
 
 
 
 
 
 
 
m
 
 
 
 
g map into
 .
Hence, for every
 
 
f
 
 
 
 
 
 
 
m
 
 
 
 
g
S
f
 
m
 
 
 
 
 
 
 
 
m
 
 
g,t h er o wn u m b e r
 
  is contained in
M
 .
Now, only the
 
m
 
  positions of
 
 
 
f
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
g remain. We par-
tition this set into
m
 
  disjoint subsets
 
i in the following way. Let
 
 
S
m
 
 
i
 
 
 
i,s u c ht h a t
 
i
 
 
 
 
 
 
i
 
 
f
 
m
 
 
 
i
 
 
 
 
 
 
m
 
 
 
i
 
 
g
 
 
i
  for
 
 
i
 
m
 
 
 
m
 
 
  for
i
 
m
 
 
 
where
 
i
 
 
 
i
 
 
X
n
 
 
 
 
 
 
 
n
 
 
 
 
n
 
 
 
 
 
 
 
 
 
 
i
 
 
i
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
i
 
 
 
  if
i is even
 
i
 
 
 
  if
i is odd
 7.5. The Mirror Sequence
M
m 191
For
i
 
 
 
 
 
 
 
 
 
m
 
 ,t h es e t
 
i contains
 
m
 
 
 
i elements. The set
 
m
 
  only
contains one element. For
 
 
i
 
m
 
 , we have
f
 
i
 
m
 
 
 
 
 
 
 
 
i
 
 
 
i
 
m
 
 
 
 
 
 
 
i
 
 
 
 
m
 
 
 
i
 
 
 
 
 
 
m
 
 
 
i
 
 
 
 
 
i
 
 
 
 
 
 
 
 
m
 
i
 
 
m
 
 
 
i
 
 
 
 
 
 
m
 
 
 
i
 
 
 
 
 
 
m
 
i
 
 
 
 
 
 
 
 
 
 
 
i
 
 
m
 
 
 
 
m
 
 
 
i
 
 
 
 
m
o
d
 
m
 
 
 
 
If
i is even, we get
f
 
i
 
m
 
 
 
 
 
 
 
 
i
 
 
 
 
m
 
 
 
i
 
 
 
 
 
 
m
 
 
 
i
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
i
 
 
 
m
 
 
 
m
 
 
 
 
m
 
 
 
i
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
i
 
 
 
 
 
m
o
d
 
m
 
 
 
 
It can easily be checked that this set of integers is a subset of
 .
If
i is odd, we get
f
 
i
 
m
 
 
 
 
 
 
 
 
i
 
 
 
 
 
m
 
 
 
i
 
 
 
 
 
 
m
 
 
 
i
 
 
 
 
 
m
 
 
 
 
m
 
 
 
i
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
i
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
i
 
 
 
 
 
m
o
d
 
m
 
 
 
 
which is also a subset of
 .
Furthermore, for
i
 
m
 
 , the mapping
f
 
i
 
m
 
 
 
 
 
 
 
 
i
  equals
f
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 
wh i c hi sani n t egero f
 . Hence, wehave
f
 
i
 
m
 
 
 
 
 
 
 
 
i
 
 
 for
 
 
i
 
m
 
 ,
which means that for every
 
 
 
 
S
m
 
 
i
 
 
 
i,t h er o wn u m b e r
 
  is contained
in
M
 .
￿
Theconception of thepartitioning of
 into the disjoint subsets
 
i in the above
proof may at ﬁrst be difﬁcult to grasp. We prefer not to go into detail here192 Chapter 7. The Polar Representation
64 70 74 90 106 127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Position
 
S
u
b
s
e
t
 
i
Figure 7.6: Checkerboard plots of the contents
f
 
 
 
 
 
 
 
 
 
 
 
 
 
g of
 
 
S
m
 
 
i
 
 
 
i
for
m
 
  .
abouttheformingofthesesubsets. However,thecheckerboardplotofthecon-
tents of
  in Figure 7.6 may give some insight into the partitioning of
 .T h e
black cells in a cell row indicate which positions are contained in the subset
 
i
which is associated with that particular cell row. For example, we see that
 
 
is formed by the positions 90 and 122.
Deﬁnition 7.8 If an element appears several times in a set, we say that the extra ele-
ments are redundant.B yt h erelative redundancy of a set we mean the ratio of the
redundant elements to the total number of elements in the set.
For
m
 
 , the relative redundancy of the set
M
  equals
 
 
m
 
 
 
 
c
 
 
 
m
 
 
 
 
m
 
 
 
 
m. Then, for
m
 
 the relative redundancy equalszero, whichmeans
that
 
 
f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
g
 
f
 
 
 
g is such a set
 
  of unique positions
that we are looking for. The integers
 
 
 
 
 
 
 
 
  and
 
 
 
 
 
 
 
 
  of
 
 
 
 
are the only 4-bit NBC integers whose three most signiﬁcant bits equal
 
 
 
 
 
 .
Therefore,inStep2onpage185,thecheckingwhether
a
i
j
a
  is anintegerof
 
 is
performed by checking whether the three most signiﬁcant bits of
a
i
j
a
  equals
 
 
 
 
 
 . In Step 3 on the same page, the least signiﬁcant bit of
a
i
j
a
  can be used
to address the look-up table. If
a
i
j
a
 
 
d
 
k
 
 
 
  , the subscript
 
k
  is read from
table location 0 and if
a
i
j
a
 
 
d
 
k
 
 
 
  , the subscript
 
k
  is read from table loca-
tion 1.7.5. The Mirror Sequence
M
m 193
For
m
 
 and
m
 
 
 t h er e l a t i v er e d u n d a n c yi s
 
 
  and
 
 
  respectively.
It is desirable to reduce these redundancies even more. We therefore further
reduce the set
M
 .
Theorem 7.14 Let
 
 
 
 
 
 
 
 
 ,w h e r e
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
are disjoint subsets of
 . Also, let
M
 
 
  be the set of
 
m
 
 
 
 integers of
M
m in the
positions given by the elements of
 
 
 . Then, for
m
 
 and
m
 
 
  ,w eh a v e
M
 
 
 
 
f
 
 
 
 
 
 
 
 
 
c
 
 
g
 
Z
 
c
 
Proof: Let
 
 
S
 
i
 
 
 
i,w h e r e
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
and where
 
  and
 
  are deﬁned in the theorem. The equality
M
 
 
 
 
Z
 
c in
Theorem 7.14 holds if (and only if) the integers in the sets
 
  and
 
  map, by
f
 
i
 ,i n t o
 
  and/or
 
  for some
i.L e t
 
 
 
S
 
j
 
 
 
 
 
j,w h e r e
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
are disjoint subsets of
 
 . Then, using the function
f
 
m
 
 
 ,w eg e t
f
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
f
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
f
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
 194 Chapter 7. The Polar Representation
where clearly
f
 
m
 
 
 
 
 
 
 
 
  is a subset of
 
 ,
f
 
m
 
 
 
 
 
 
 
 
  is a subset of
 
 ,a n d
f
 
m
 
 
 
 
 
 
 
 
  is a subset of
 
 . We again partition
f
 
m
 
 
 
 
 
 
 
 
  into three dis-
joint subsets, say
 
 
 
 
 
 ,
 
 
 
 
 
 ,a n d
 
 
 
 
 
 , in the same way as we did with
 
 .
Then, we get
f
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,
f
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,a n d
f
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 .
This process of partitioning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  into three disjoint subsets which
map (by
f
 
m
 
 
 )i n t o
 
 ,
 
 ,a n d
 
 , respectively, is repeated until only two
integersremain. One of theseintegersmapinto
 
  andthe other integermaps
into
 
 . Hence, for every integer
 
 
 
  there is a positive integer
i such that
f
 
i
 
m
 
 
 
 
 
 
 
 
 
 
 .
In order to show that the set
 
  can be mapped into
 
 
 
 
 
 
 
  we parti-
tion it into a number of disjoint subsets which map into
 
 
  in different ways.
Thisapproachis usedbothin theproofofTheorem7.13andinthe aboveproof
that
 
  maps into
 
 
 .I ts h o w st h a t
 
  can be partitioned into
m
 
  disjoint
subsets, say
 
 
 
 ,
 
 
 
 
 
 
 
 ,
 
 
 
m
 
 ,s u c ht h a t
f
 
j
 
 
 
 
 
 
 
j
 
 
 
 
 
 
 
 
 
where
 
 
j
 
m
 
 . We have not yet been able to express the subsets
 
 
 
j
analytically, but they can easily be obtained as follows.
 
 
1. Let
j
 
 and
P
 
 
f
 
 
 
 
 
 
 .
2. Let
j
 
j
 
 and
P
j
 
 
P
j
 
 
 
 
 
m
o
d
 
m
 
 
  .
3. Let
 
 
 
j be equal to the set of integers in
P
j which are also
contained in
 
 
 
 
 
 .
Let
P
j
 
P
j
n
 
 
 
j.
4. If
P
j
 
 
  (i.e. if
j
 
m
 
 ), goto Step 2.
Otherwise, stop.
Hence,foreveryinteger
 
 
 
 thereisaninteger
i
 
 
 
i
 
m
 
 suchthat
f
 
i
 
 
 
 
 
 
 
 
 
 
 . As shown above, the integers of
 
  in turn map into
 
 
 .C o n s e -
quently, every integer of
 
 
 
 
  maps into
 
 
 , which means that
M
 
 
 
 
M
 
 
Z
 
m.
￿
Because
 
  contains
 
 
m
 
 
 
 
 
 
  elements and
 
  contains
 
 
m
 
 
 
 
 
 
  ele-
ments, theirunion
 
 
 
 
 
 
 
 
  contains
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
elements. Therefore, the relative redundancy of the set
M
 
 
  equals
 
m
 
 
 
 
 
 
m
 
 
m
 
m
 
 
 
 
 
m
 
 
m
 
 
 
m
 
 
m
 
 
 
 
 
 
 
 
 
 
 
  for
m
 
 
 
 
 
 
 
 
 
 
 
 
  for
m
 
 
 
 
 
 For example, for
m
 
 we have
 
 
 
 
 
f
 
 
 
 
 
 
 
 
g,
 
 
 
 
 
f
 
 
 
 
 
g,
 
 
 
 
 
f
 
 
 
 
 
g,
and
 
 
 
 
 
f
 
 
 
 
 
 
 
 
g.7.5. The Mirror Sequence
M
m 195
In order to store
 
m
 
 
 
 integers of
Z
 
m, we need a table of size
 
m
 
 
 
m
bits. Unfortunately, almost half of the locations in such a table would not be
used. In the next section we show that one of the elements in the set
M
 
 
 
is dispensable. This property makes it possible to reduce the size of the table
needed to
 
m
 
 
 
m bits.
7.5.5 Addressing the Look-Up Table for Discrete Logarithm
Thesets
 
 
 
n
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
o
and
 
 
 
n
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
o
can be viewed as row vectors of
m-bit NBC integers:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
. . .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
. . .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
We see that the four most signiﬁcant bits in every NBC integer of
 
  and
 
 
areequalto
 
 
 
 
 
 
  and
 
 
 
 
 
 
 ,respectively. Hence,arbitraryintegers
p
 
 
 
 
and
p
 
 
 
  c a nb ew r i t t e no nt h ef o r m s
p
 
 
 
m
 
 
 
p
 
m
 
 
 
  and
p
 
 
 
m
 
 
 
 
m
 
 
 
p
 
m
 
 
 
  ,w h e r e
p
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
p
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
are
 
m
 
 
 -bit NBC integers. Let
 
 
  denote the set formed by the one’s com-
plements of the
 
m
 
 
 -bit NBC integers of
 
 
 
 
 
m
o
d
 
m
 
 , i.e. we have
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
f
 
m
 
 
 
 
g
 
m
o
d
 
m
 
 
 
 
Hence, wehave
 
 
 
 
 
 
 
 
f
 
 
 
 
 
 
 
m
 
 
 
 
gwithonly one redundant element
–t h ei n t e g e r
 
m
 
 
 
  appears twice in the union set
 
 
 
 
 
 
 . Fortunately, we
can allow this overlap. For any
p
 
 
 
 ,l e t
q
 
 
p
 
m
 
 
 
 
 
  . Then, we have
q
 
m
 
 
 
 
 
 
m
 
 
 
 
 
q
 
m
 
 
 
 
 
 
 
 . Using (7.24) we can write
p
 
 
 
m
 
 
 
  on196 Chapter 7. The Polar Representation
the form
p
 
 
 
m
 
 
 
 
 
 
 
 
 
 
a
m
 
 
j
a
 
 
m
o
d
 
m
 
 
  ,w h e r e
a
 
 
d
 
 
  .
Hence, we have
p
 
 
d
 
k
  forsome
 
k
 
 
 
 
 
 
 
 
 
 
c,w h e r e
 
 
 
 
 (i.e.
p
  is in row
z e r o( t h et o pr o w )o ft h em a t r i x
D
m). By (7.31) in Lemma 7.1 we also have
p
 
 
d
 
k
 
 
a
 
 
 
 
 
j
 
 
m
o
d
 
m
 
 
 
 
which consequently implies
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 . From this congruence
weget
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 .H e n c e ,f o r
p
 
 
d
 
k
 
 
 
m
 
 
 
 ,t h e
subscript
 
k
 
 
 
m
 
 
 
 
 
c
 
m
o
d
 
m
  is stored in location
p
 
m
 
 
 
 
 
 
m
 
 
 
  of
the look-up table.
Now, consider the integer
p
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
a
m
 
 
j
a
 
 
m
o
d
 
m
 
 
  ,w h e r e
a
 
 
d
 
 
  , and which maps to the same table entry
as
p
  (note that here we generally use the primitive element
 
 
 when com-
puting
d
k in (7.25)). Thus,
p
  is in row
 
 
 
 
 of
D
m.A g a i n ,f o r
 
k
 
 
 
 
 
 
 
 
 
 
c,
we have
p
 
 
d
 
k
 
 
a
 
 
 
 
 
j
 
 
m
o
d
 
m
 
 
 
by (7.31). This gives
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 ,f r o mw h i c hw eg e t
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 .F o r
m
 
 and
m
 
 
  , which arethe casescon-
sidered here, we can write
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
o
d
 
m
  and
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 .H e n c e , w i t h
 
 
  being the least nonnegative residue of
m
 
 
 
modulo
 
m,w eg e t
 
 
 
 
 
m
 
 
 
 
 
 , which means that
 
 
  can be obtained from
 
 
  by invertingits
l
o
g
 
mmost signiﬁcant bits.A l s o ,
 
 
 
 
 can beobtained from
 
 
 
 
 by inverting its least signiﬁcant bit.
Consequently, by storing the subscript
 
k
  of
d
 
k
 
 
p
 
 
 
m
 
 
 
  in the table,
the subscript
 
k
  of
d
 
k
 
 
p
 
 
 
 
 
m
 
 
 
  (which maps to the same table
entry as
p
 ) can be obtained simply by inverting
l
o
g
 
m
 
 of the table output
bits. Each inversion may be implemented as a 2-input XOR gate, with one of
its inputs coming from the table output and the other input coming from the
last carry of the
 
m
 
 
 -bit addition
q
 
 
 
p
 
 
 . This carry is high only when
 
p
 
 
 
m
 
 
 
 ,i . ew h e n
p
 
 
 
 
 
m
 
 
 
 , which therefore is the only case when
the table output is changed (from
d
 
k
  to
d
 
k
 ) by the XOR gates. This is further
discussed in Section 7.6.2.
From the above reasoning we conclude that the
 
m
 
 
 
 elements of
 
 
 
 
 
 
 
 
  can be mapped onto the entries of a look-up table (memory) of size
 
m
 
 
 
m bits as follows:
  Each position
p
 
 
 
  maps to table entry
p
 
m
 
 
 
  . The table output is the
m-bit NBC subscript
 
k
  of
d
 
k
 
 
p
 .7.6. Architectures for Arithmetic Operations 197
  Each position
p
 
 
 
  maps to table entry
q
 
m
 
 
 
  , i.e. the one’s comple-
ment of
q
 
m
 
 
 
  ,w h e r e
q
 
 
p
 
m
 
 
 
 
 
 . The table output is the
m-bit NBC
subscript
 
k
  of
d
 
k
 
 
p
 . However, if
p
 
 
 
 
 
m
 
 ,t h et a b l eo u t p u ti s
the subscript
 
k
  of
d
 
k
 
 
 
m
 
 
 
 . This table output is modiﬁed to the
desired value by a simple circuit.
B e c a u s ewec a nh a n d l et h ep r o bl e mwh e n
 
 
 
m
 
 
 
 and
 
m
 
 
 
  bothmapto
the same table entry, the actual relative redundancy of the set
M
 
 
  is exactly
 
 for
m
 
 and
 
 
 for
m
 
 
  . In the latter case, for
m
 
 
  , it is possible to
reducethe set
 
 
  evenfurther. However,wehavenotyetsucceededinreduc-
ing it byhalf to
 
m
 
  elements. Sucha setwould have
 
 relative redundancy.
If the number of elements in the reduced set obtained from
 
 
  is greater than
 
m
 
 , we still need a look-up table (memory) of size
 
m
 
 
 
m bits. Therefore,
for
m
 
 
 
 
  we let
 
 
 
 
 
 ,w h e r e
 
  is the set introduced in Notation 7.1 in
Section 7.5.2. Note that for
m
 
  ,w el e t
 
 
 
 (seetheparagraphsubsequent
to Deﬁnition 7.8 in Section 7.5.4).
7.6 Architectures for Arithmetic Operations
In this section we propose VLSI architectures for most of the arithmetic oper-
ations considered in Sections 7.2 – 7.5. The sizes, fan-ins, internal CP delays,
and output normalised resistances of the basicbuilding blocks in the architec-
tures are given in Chapter 4. Note that these complexity parameters are sum-
marised in Table 4.2.
7.6.1 Discrete Exponentiation
An Architecture for Computing
a
i
j
a
 
The respective algorithms in Sections 7.4.1 and 7.5.1 for performing discrete
exponentiation both involve the computation of
a
i
j
a
  from some
a
  using a
feedback shift register of length
m (see Step 1(b) on page 174 and Step 2 on
page183). SuchashiftregisterisshowninFigure7.7. Generally, thecircuitcan
beusedtocompute
a
i
j
a
 ,w h i c hi sd e ﬁnedin(7.24),fromanarbitrary
a
 
 
Z
 
m
by loading it with
a
  and shifting (rotating) the register contents
i steps to the
left. The size of the circuit equals
C
a
i
 
m
C
r
e
g
 
C
i
n
v
 
 
 
m
 
 198 Chapter 7. The Polar Representation
CP
a
 
a
i
j
a
 
Shift register
Figure 7.7: Recursive computation of
a
i
j
a
  from
a
  using a feedback shift register of
length
m.
andthe CP, which runsfromthe output of the registerelementin the mostsig-
niﬁcant bit position to the input of the element in its least signiﬁcant bit posi-
tion, equals
L
C
P
 
a
i
 
L
r
e
g
 
r
r
e
g
 
f
i
n
v
 
f
n
e
x
t
 
 
r
i
n
v
f
r
e
g
 
 
 
 
 
f
n
e
x
t
 
where
f
n
e
x
t is the fan-in(seen from the most signiﬁcant bit position of the reg-
ister) of the circuit subsequent to the register. For example, if the subsequent
circuit is another register, we get
f
n
e
x
t
 
f
r
e
g
 
 andthus
L
C
P
 
a
i
 
 
  . Assum-
ing that an initial clock cycle is required to load the shift register with
a
 ,t h e
register contains the desired result
a
i
j
a
  after
i additional clock cycles. Thus,
the total computation time
T is proportional to
L
a
i
 
 
i
 
 
 
L
C
P
 
a
i
  (7.41)
where
i
 
 
m
 
 .
An Architecture for Computing
d
k
Whenperformingdiscreteexponentiation usingthealgorithm inSection 7.4.1,
we need to recursively compute
d
k from
d
k
 
  (see Step 2 on page 174). The
architecture in Figure 7.8 for computing
d
k is based on (7.36), i.e.
d
k
 
 
 
d
k
 
 
 
 
 
 
d
k
 
 
 
 
 
m
o
d
 
m
 
 
 
 7.6. Architectures for Arithmetic Operations 199
which was obtained from (7.25) by letting
 
 
  . The addend
 
d
k
 
 
 
 
m
o
d
 
m
 
  ,w h i c he q u a l s
a
 
j
d
k
 
 , is obtained simply by inverting the most signiﬁ-
cant bit (the wire labelled “msb” in the ﬁgure) of
d
k
 
  and modifying the feed-
back wirings.
 
  Let
d
k
 
 
 
 
 
m
o
d
 
m
 
 
 ,w h e r e
 
 
 
 
 
d
k
 
 
 
 
 
 
d
k
 
 
 
 
d
k
 
 
 
 is the sum of the addend
 
d
k
 
 
 
 and the augend
d
k
 
  in Figure7.8.
If
 
 
 
m,i . e .i f
 
m
 
  , we have
d
k
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
 
 .
In this case we set the ﬁrst carry signal, which in the ﬁgure is denoted by
c
 ,
of the adder equal to zero. If
 
 
 
m, we have
d
k
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 .
Here, we set
c
  equal to one.
Thecarrysignal
c
  canbegeneratedusingforexampleacomparator. Fromthe
inequality
 
 
 
d
k
 
 
 
 
 
 
m itfollowsthat
d
k
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 .
A comparatorcan for examplebeimplemented using achain offull adder ele-
ments,wherethecarryoutofthemostsigniﬁcant bitposition indicates wheth-
erone of theaddends is greaterthanthe other (seeWeste andEshraghian[113,
Fig. 8.26]). In our case, where one of the comparator addends is always
 
 
m
 
 
 
 
 , the comparator simpliﬁes to a chain of
m
 
  alternating OR and AND
gates.
 
  With respect to the comparator propagation delay, this carry ripple
type of comparator is preferably used together with a parallel carry ripple
adder: By inserting the comparator prior to the register in Figure 7.8 and by
modifying its output circuitry, the resulting comparator do not have any effect on
the CP length of the total circuit.
If
d
k
 
 
 
 
 
m
 
 
 
 
 , the comparator output equals 1, otherwise it equals 0.
Note that we do not refer to the output of the modiﬁed comparator (see the
ﬁgure). The ﬁrst carry
c
  of the adder equals the inverse of the comparator
output. In the architecture in Figure 7.8, this inversion is realised by exchang-
ing the output OR gate of the comparator for a NOR gate. In the ﬁgure, the
resulting NOR gate is moved outside the comparator.
With the
m-bitparallel adder in Figure7.8 beingastandard carryripple adder,
the size of the complete circuit equals
C
d
k
 
m
C
F
A
 
 
m
 
 
 
C
r
e
g
 
C
c
o
m
p
 
C
N
O
R
 
C
i
n
v
 
 
 
m
 
 
 
 
where
C
c
o
m
p
 
 
m
 
 
 
C
A
N
D
 
O
R
 
 
 
m
 
 
 isthesizeofthemodiﬁedcomparator.
The CP through the circuit is the path from the output of the register element
in the least signiﬁcant bit position along the carry chain of the parallel adder
to the input of the register element in the most signiﬁcant bit position. This
path, which is marked by the dotted line in Figure 7.8, has length
 
 The procedure is based on the architecture in Figure 7.7.
 
 The resulting chain of gates both starts and ends with an OR gate.200 Chapter 7. The Polar Representation
D
FA FA FA
D
 
 
z
 
 
d
k
 
 
 
 
c
 
 
 
z
 
d
k
 
 
m-bit parallel adder
Modiﬁed comparator
m-bit register
 
 
 
d
k
msb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The most signiﬁcant bits of the
adder, comparator, and register.
c
 
m
s
b
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
CP
CP
Figure 7.8: Recursive computation of
d
k from
d
k
 
  when
 
 
  . In order to con-
veniently label the sum output bits, say as
 
 
m
 
 ,
 
 
m
 
 ,
 
 
m
 
 
 
 
 
 
 
 
 
 ,w eo n l y
temporarily deﬁne
 
 
 
 
d
k in this ﬁgure.7.6. Architectures for Arithmetic Operations 201
L
C
P
 
d
k
 
L
r
e
g
 
r
r
e
g
 
 
f
F
A
 
s
i
g
n
a
l
 
 
m
 
 
 
 
L
F
A
 
c
a
r
r
y
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
L
F
A
 
s
u
m
 
r
F
A
f
r
e
g
 
 
 
m
 
 
 
 
Hence, afterthe register has beenloaded withsome integer
d
k
 
Z
 
m,t h et i m e
T needed for each recursive computation of
d
k
 
 ,
d
k
 
 ,
d
k
 
 , etc. is propor-
tional to
L
C
P
 
d
k. Note thatin ordertoobtain the correctcarrysignal
c
  forthese
computations, the initial integer
d
k must also be fed to the modiﬁed comparator.I f
not, the D ﬂip-ﬂop in Figure 7.8 may not be properly initiated.
The Look-Up Table
Thealgorithm in Section 7.5.1 forperformingdiscrete exponentiation involves
the process of reading integers from a look-up table. Such a look-up table is
suitably implemented as a semiconductor memory. A look-up table of size
 
u
 
 
w bits is usually implemented as a memory of size
 
a
 
 
b bits (
 
a rows
and
 
b columns), where
a
 
b
 
u
 
w and
b
 
w. Figure 7.9 shows the block
diagram of a typical random-access memory (RAM) of size
 
a
 
 
b bits.
When reading from the memory,
a of the
u address lines select one of the
 
a
rows of the memory array. The remaining
u
 
a address lines select
 
w of the
 
b columns. Thecontentsofthe
 
w correspondingmemorycellsintheaccessed
row are detected and multiplexed to the data output. A memory cell and the
sense ampliﬁer connected to the
b
i
t and
b
i
t lines of that cell is shown in Fig-
ure 7.10. The memory cell considered is a standard six-transistor static RAM
cell. The sense ampliﬁer is used to sense the state of the memory cell.
The size of the
 
 
a
 
 
b
 -bit memory cell array equals
 
 
 
a
 
b.U s i n gaN O R -
typerowdecoder[44,Fig. 9.10-2]andastandardcolumntreedecoder [44, Fig.
9.10-3], the size of these (line) decoders together with the sense ampliﬁers is
roughly in the order of
a
 
a
 
b
 
b. Hence, using six-transistor memory cells, the
total chip area occupied by the
 
 
a
 
 
b
 -bit RAM in Figure 7.9 is proportional
to the size
C
R
A
M
 
 
 
 
a
 
b
 
a
 
a
 
b
 
b
  (7.42)
Inorderto minimise theareacomplexity oftheaddress decoding, the memory
arrayis usually organisedasa square array,i.e. we have
a
 
b(if
a
 
bis even).
Note that we do not consider the chip area occupied by the address bus or the
word lines and data lines of the memory.
The critical path associated with the process of reading from the memory can
be separated into two main paths. The ﬁrst path runs from the address input202 Chapter 7. The Polar Representation
 
a
Sense ampliﬁers and write control
memory cells
Column decoder
Array of
 
a
 
 
b
Address
Data
 
b
R
o
w
d
e
c
o
d
e
r
Figure 7.9: A block diagram of a typical random-access memory.
through the row decoder and along one of the word lines in the memory ar-
ray. The second path runs from inside an accessed memory cell along a data
bit line, througha senseampliﬁer, tothe data output. Thememoryaccess time
is dominated by the time required to fully charge the word line plus the time
required to sense the state of an accessed memory cell. In order to minimise
the length of the ﬁrst path, i.e. to speedup the charging of the wordline, a col-
umn of drivers is usually inserted between the row decoder and the memory
array. The chip area occupied by these drivers is neglected here.
 
 
The delay of a stage with capacitive load
C
L, which is driven by an optimised
driver, is proportional to
l
o
g
 
 
C
L
 
C
g
 ,wher e
C
g is the (minimum size) transis-
 
 An optimised driver on a word line is formed by a number of cascaded inverters of in-
creasing size. The total area occupied by the column of such drivers is actually greater than
the row decoder area, but it is less than the area occupied by the memory cell array.7.6. Architectures for Arithmetic Operations 203
V
r
e
f
V
r
e
f
V
d
d
Sense
ampliﬁer
Precharge
Word line
Control Data out
b
i
t
b
i
t
Memory cell
Figure 7.10: Amemorycellandthesense ampliﬁerfordetectingthememorycellcon-
tents.
tor gate capacitance. We refer to Mead an Conway [66, Sec. 1.5]. When using
the six-transistor static memory cell in Figure 7.10, the total capacitance
C
L at
each word line (not counting the wire capacitance) equals
 
b
 
 
C
g,w h e r e
 
b is
the numberof memorycells in one rowof the memoryarray. Hence, the word
line delay is proportional to
l
o
g
 
 
b
 
 
 
b
 
  , which implies that the length
of the ﬁrst part of the critical path is about
L
 
 
 
b
 
 
 
L
 ,w h e r e
L
  is some
constant.
Prior to the driving of the word line, all
b
i
t and
b
i
t lines of the memory array
are precharged to some suitable potential. When a word line opens the mem-
ory cells in a row, the potentials on the
b
i
t and
b
i
tlines start changing. The re-
sulting difference in potential is either positive or negative, depending on the204 Chapter 7. The Polar Representation
data stored in the cell. When the difference in potential between the lines has
reached some speciﬁedvoltagelevel
 
V , thememorystate can bedetected by
the (differential) sense ampliﬁer. There exist various sense ampliﬁers, see for
example Bakoglu [14, Ch. 4.9] and Annaratone [8, Ch. 6.4.3].
By properly precharging the
b
i
t and
b
i
t lines and choosing a suitable type of
senseampliﬁer, thememorycellcontentscanbedetectedveryquickly. Let
t
c
e
l
l
denote the time needed for a memory cell to induce the potential difference
 
V between the
b
i
tand
b
i
t lines and let
t
s
e
n
s
e denote the sense ampliﬁer delay
time. Then, the delay associated with the second part of the memory critical
path equals
t
d
e
t
e
c
t
 
t
c
e
l
l
 
t
s
e
n
s
e. It can be shown that
t
d
e
t
e
c
t can be minimised
to be approximately proportional to
l
o
g
 
 
C
b
i
t
 
C
g
 ,w h e r e
C
b
i
t is the total ca-
pacitive load at a bit line and
C
g is the transistor gate capacitance. We refer to
Svensson et al. [98], McCarroll et al. [64], and Mohsen and Mead [67].
Let
C
d denote the drain capacitance of a CMOS transistor. For a
 
 
a
 
 
b
 -bit
memory, where each memory cell has a transistor drain connected to the
b
i
t
line (and another transistor drain connected to the
b
i
t line), we get
 
 
C
b
i
t
 
 
a
C
d. Assuming that the drain capacitance is approximately equal to the gate
capacitance
C
g,w eg e t
t
d
e
t
e
c
t
 
l
o
g
 
 
a
 
a
 
Then, the length of the second part of the critical path, which has delay time
t
d
e
t
e
c
t,i sa b o u t
L
 
 
a
L
 ,w h e r e
L
  is some constant.
Hence, the access time of the
 
 
a
 
 
b
 -bit memoryin Figure 7.9 is proportional
to the length of its critical path, which in turn is approximately equal to
L
R
A
M
 
L
 
 
L
 
 
 
b
 
 
 
L
 
 
a
L
 
  (7.43)
Thelook-up table usedin the algorithm described in Section 7.5.1 has size
 
c
 
m bits, where
c
 
m
 
t
 
  and
m
 
 
t. Using the above notations, we have
u
 
c,
w
 
l
o
g
 
m
 
t,a n dt h u s
a
 
b
 
c
 
t
 
m
 
 .B e c a u s e
m
 
  is odd,
the memory array associated with the table can not be square. Instead, we let
a
 
b
 
 (or alternatively
a
 
b
 
 ) which implies
a
 
m
 
  and
b
 
m
 
 
 
 .
Then, the size
C
e
x
p
 
t
a
b of the memory in which the
 
 
c
 
m
 -bit look-up table is
stored is approximately equal to
C
R
A
M
j
 
a
 
b
 
 
 
m
 
 
 
m
 
 
 
 
 ,i . e .w eg e t
C
e
x
p
 
t
a
b
 
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
 
 
 
 
 
m
  (7.44)
The critical path through the memory equals
L
e
x
p
 
t
a
b
 
L
R
A
M
j
 
a
 
b
 
 
 
m
 
 
 
m
 
 
 
 
 
 
m
L
e
x
p
 
t
a
b
  (7.45)
where
L
e
x
p
 
t
a
b
 
 
L
 
 
L
 
 
 
  is some constant.
 
 We do not include the wire capacitance.7.6. Architectures for Arithmetic Operations 205
Remark: Note that the memory size
C
R
A
M and the length
L
R
A
M of the critical
path trough the memory in Figure 7.9 are approximate reﬂections of the
true chip area occupied by the memory and its true access time, respec-
tively.
7.6.2 The Discrete Logarithm
In Section 7.4.2, the algorithm for computing the discrete logarithm without
using any table involves the recursive computation of
d
k from
d
k
 
  (see Step 2
on page 178). An architecture for this computation was considered in Sec-
tion 7.6.1.
An architecture for computing
a
i
j
a
  is needed in the algorithm described in
Section 7.5.1 (Step 2 on page 185). Such an architecture is also considered in
Section 7.6.1.
InSection 7.5.5, page 196,we describe how to correctanerroneouslook-upta-
ble output by letting each of the
l
o
g
 
m most signiﬁcant bits and the least sig-
niﬁcant bit of the NBC output integer pass through an XOR gate. Figure 7.11
shows how the table output is modiﬁed by the XOR gates. When the con-
trol signal
c
t
r
l equals 1, each XOR gates inverts its signal taken from the ta-
ble. For
c
t
r
l
 
  , the XOR gates do not change the table output bits. Let
p
 
and
q
  be deﬁned as in Section 7.5.5. The erroneous table mapping occurs for
p
 
 
 
 
 
m
 
 
 
 ,w h i c hi st h eonly case where the carry out, say
c
 ,f r o mt h e
most signiﬁcant bit position of the sum
q
 
 
p
 
m
 
 
 
 
 
 equals one (1).
Let
p
 
 
 
  be an integer which maps to an entry of the look-up table. Then,
p
m
 
 
 
 if
p
 
 
  and
p
m
 
 
 
 if
p
 
 
 . Hence, the control signal can be
formed by the Boolean function
c
t
r
l
 
c
 
 
p
m
 
 
 
c
 
 
p
m
 
 
 
Note that if we deﬁne
c
 
 
 whenever
p
 
 
 
 ,w es i m p l yg e t
c
t
r
l
 
c
 .
For
m
 
 , the above-mentioned look-up table used when computing the dis-
crete logarithm has size
 
m
 
 
 
m bits, see the end of Section 7.5.5. We do
not consider the simple look-up tables used when
m
 
 and
m
 
  .W h e n
l
o
g
 
m is even (i.e. when
m
 
 
  )w el e t
a
 
b
 
 
m
 
 
 
l
o
g
 
m
 
 
  in
(7.42) and (7.43) and when
l
o
g
 
m is odd (i.e. when
m
 
  )w el e t
 
a
 
b
 
 
 
b
 
 
 
 
m
 
 
 
l
o
g
 
m
 
 
 
 . Then, the size of the memory which realises the
 
 
m
 
 
 
m
 -bit table and the length of the critical path through that memory206 Chapter 7. The Polar Representation
Table output
 
 
z
 
 
 
 
c
b
i
t
s
 
 
 
z
 
 
 
 
m
 
c
b
i
t
s
 
 
 
z
 
 
k
 
 
 
 
c
 
 
 
 
m
b
i
t
s
 
c
t
r
l
Look-up table for discrete logarithm
Figure 7.11: Correcting the one case of erroneous output from the table used when
computing the discrete logarithm.
are equal to
C
l
o
g
 
t
a
b
 
C
R
A
M
 
 
m
 
 
m
 
  (7.46)
L
l
o
g
 
t
a
b
 
L
R
A
M
 
 
m
 
 
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
  (7.47)
respectively, where
L
l
o
g
 
t
a
b is some constant.
7.6.3 Negation
From (7.6) we have
 
 
 
P
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
  where
  is
a nonzero integer of
Z
 
m
 
 .T h u s ,f o r
 
 
 
Z
 
m (i.e. for
 
 
m
 
  ),
 
  is obtained
from
 
  simply by inverting the digit
 
 
m
 
 .I f
 
 
 we let
 
 
 
 
 
 
P
 
 
 
 
 
m.
Consequently we get
 
 
m
 
 
 
m
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 7.6. Architectures for Arithmetic Operations 207
 
 
m
 
 
m
 
 
 
 
m
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
m
 
 
 
Figure 7.12: Negation in the polar representation.
Figure 7.12 shows an architecture for performing negation with respect to the
polar representation. TheCPthrough the notsocomplicated circuit runsfrom
the most signiﬁcant bit input to the output of the NOR gate. The size, fan-in,
and output normalised resistance of the circuit equal
C
p
o
l
n
e
g
 
C
N
O
R
 
 
f
p
o
l
n
e
g
 
n
p
o
l
n
e
g
 
f
N
O
R
 
n
p
o
l
n
e
g
 
 
r
p
o
l
n
e
g
 
r
N
O
R
 
 
 
respectively, where
n
p
o
l
n
e
g is the fan-out of the circuit, with respect to the
 
 
m-
output node. Assuming that the input
 
  is obtained from a parallel register
andthe output
 
  is also storedin a register,
 
  the time requiredfor performing
negation in the polar representation is proportional to the length
L
p
o
l
n
e
g
 
L
r
e
g
 
r
r
e
g
f
p
o
l
n
e
g
 
r
p
o
l
n
e
g
f
r
e
g
 
 
 
and hence the area-time performance of the architecture is proportional to
C
L
 
p
o
l
n
e
g
 
 
C
p
o
l
n
e
g
 
L
p
o
l
n
e
g
 
 
 
 
 
 
 
 
 
 Like we did in Chapters 5 and 6.208 Chapter 7. The Polar Representation
7.6.4 Addition
For nonzero
 
 
 
 
Z
 
m
 
 ,i . e .f o r
 
 
 
 
 
 
Z
 
m, the following congruence was
given in (7.9):
 
 
 
P
 
 
 
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
o
d
 
m
 
  (7.48)
where
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
  is the one’s complement of
 
 
 
m
 
 
 .W i t hr e -
spect to area complexity, this congruence is preferably computed in two clock
cycles, using an
m-bit feedback parallel adder. Let
 
  denote the sum output
of the adder. During the ﬁrst clock interval we compute the Zech logarithm
Z
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
. The two adder inputs are
 
 
 
m
 
 
  and
 
 
 
m
 
 
  and the
ﬁrst carry input signal equals 1. The
m-bit sum
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
  is the input of a circuit which outputs the
m-bit Zech logarithm
Z
 
 
 
 . However, because we have deﬁned
Z
 
 
m
 
 
 
 
 
 
 
 
m,t h eZ e c h ’ sl o g -
arithmcircuitwillgenerateanerroneousoutputwhentheinput
 
 equals
 
m
 
 .
This situation is handled by deﬁning an additional output signal, say
Z
i
n
d,
which indicates whether the logarithm output of the circuit isthe correctZech
logarithm of its input
 
 :
 
 
 
 
 
m
 
 
 
  Correct output
Z
 
 
 
 
 
Z
i
n
d
 
 
 
 
 
 
m
 
 
 
  Incorrect output
Z
 
 
 
 
 
Z
i
n
d
 
 
 
During the second clock interval, the new adder output
 
  equals the sum
 
 
 
m
 
 
 
 
Z
 
 
 
 
m
o
d
 
m,i . e .
 
 
s
e
c
o
n
d
 
 
 
 
m
 
 
 
 
Z
 
 
 
 
r
s
t
 . Thus, the adder input
signals are
 
 
 
m
 
 
  and
Z
 
 
 
  and the carry input signal equals 0. If
Z
i
n
d
 
  ,t h e
desired sum
 
  equals the adder output
 
 .I f
Z
i
n
d
 
  ,w el e t
 
 
 
 
 
 
m.I f
both
  and
  are zero, i.e. if
 
 
 
 
 
 
 
m,w ea l s ol e t
 
 
 
 
m.I fo n l y
  (or
 )
equals zero, we let
 
 
 
P
 
 
 
 
 
 
 
  (or
 
 
 
 
 ).
Figure 7.13 shows an architecture for “polar” addition, which is based on the
procedure described above. The
m-bit parallel registers R
  and R
  are initially
loaded with
 
 
 
m
 
 
  and
 
 
 
m
 
 
 , respectively, and the D ﬂip-ﬂops D
 ,D
 ,a n d
D
  are loaded with 1,
 
 
m,a n d
 
 
m, respectively. The number of parallel wires
in every signal bus (“
 
 ”) in the ﬁgure equals
m. Consequently, the inverter
which has the
m-bit contents of register R
  as its input signal is actually a row
of
m ordinary one-bit inverters.
The desired output signal
 
  is formed by the output controller circuit in the
bottom-leftmost part of Figure 7.13. Table 7.3 shows which output is gener-
ated for different values of
 
 
m,
 
 
m,a n d
Z
i
n
d. Based on this table, we form the
two Karnaugh maps in Figure 7.14 for
 
 
m and
 
 
i,w h e r e
 
 
i
 
m
 
 .7.6. Architectures for Arithmetic Operations 209
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
m
 
 
 
m
 
 
 
R
  R
 
 
 
Z
i
n
d
m-bit parallel adder
Zech’s
logarithm
P
 
D
 
D
 
D
  D
 
1 0
 
 
m
 
 
 
m
 
 
 
controller
Output
Z
 
 
 
 
P
 
 
 
 
m
 
 
 
Figure 7.13: An architecture for addition using the polar representation. The
arrangement of the output controller circuit is shown in Figure 7.16.210 Chapter 7. The Polar Representation
 
 
m
 
 
m
Z
i
n
d
 
 
m
 
 
 
m
 
 
 
1 1 0
 
 
m
 
 
 
m
 
  0
1 0
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
0 1
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
0 0 1 1 0
0 0
 
 
 
m
 
 
 
Table 7.3: The output
 
 
 
 
 
m
 
m
 
 
 
 
m
 
 
  for the various values of
 
 
m,
 
 
m,a n d
Z
i
n
d.
0
 
 
i
 
 
i X
0
 
 
i
 
 
i
 
 
i 00 0
1 00
0
10
1
00 00 11 01 01
1
 
 
m
11 10
 
 
m
 
 
m
 
 
m
 
 
m
0
Z
i
n
d
Z
i
n
d
 
 
i
 
 
 
i
 
m
 
 
1
X
Figure 7.14: Karnaugh maps for
 
 
m and
 
 
i,w h e r e
 
 
i
 
m
 
 . X=“don’t care”.
From these maps, we obtain the Boolean functions
 
 
m
 
 
 
m
 
 
 
m
 
Z
i
n
d
 
 
 
m
 
 
m
 
 
 
 
m
 
 
 
m
 
Z
i
n
d
 
 
 
m
 
 
m (7.49)
 
 
i
 
 
 
m
 
 
 
m
 
Z
i
n
d
 
 
 
i
 
 
 
i
 
 
 
 
m
 
 
 
m
 
 
 
Z
i
n
d
 
 
 
i
 
 
 
 
i
  (7.50)
for
 
 
i
 
m
 
  and where
 
 
i
 
 
 
i
 
 
m
 
 
m
 
 
 
i
 
 
m
 
 
m. The Boolean function
 
 
i
can simply be generated using the reduced four-input multiplexer shown in
Figure 7.15. This multiplexer lets either
 
 
i or
 
 
i pass to the output
 
 
i, depend-
ing on whether
 
 
 
m
 
 
 
m
  equals
 
 
 
 
  or
 
 
 
 
 , respectively. According to the
Boolean function for
 
 
i,w h e n
 
 
 
m
 
 
 
m
  equals
 
 
 
 
  or
 
 
 
 
 , we should have
 
 
i
 
  . Therefore, in order to always get the correct output, each output node
 
 
i of
thereduced multiplexershouldbe discharged (i.e. weset thelogical level equal tozero)
before the control signals
 
 
m and
 
 
m and their inverses are present at the multiplexer
inputs. The circuitry for doing this, however, is not considered here.7.6. Architectures for Arithmetic Operations 211
 
 
m
 
 
m
 
 
i
 
 
i
(a) (b)
 
 
i
 
 
i
 
 
m
 
 
m
 
 
i
 
 
m
 
 
i
 
 
m
 
 
m
 
 
m
 
 
m
 
 
m
Figure 7.15: A reduced four-input multiplexer. (a) Symbolic description.
(b) Schematic description.
Figure 7.16 shows the structure of the output controller in Figure 7.13. For
 
 
i
 
m, the gates of the circuit generate the binary digits
 
 
i of
 
 
 
P
 
 
 
 
 
where, depending on
i,
 
 
i is given either by (7.49) or (7.50). The size
C
c
t
r
l of
the output controller in Figure 7.16 equals
C
c
t
r
l
 
m
C
R
M
U
X
 
 
m
 
 
 
C
r
e
g
 
 
 
m
 
 
 
C
N
A
N
D
 
N
O
R
 
 
 
m
 
 
 
C
i
n
v
 
 
 
m
 
 
 
 
where
C
R
M
U
X
 
 is the size of one reduced (four-input) multiplexer. The fan-
in, internal delay, and output normalised resistance of the output controller,
with respect to the dotted path P
  in the ﬁgure, equal
f
c
t
r
l
 
f
i
n
v
 
 
L
c
t
r
l
 
r
i
n
v
f
N
O
R
 
r
N
O
R
f
N
A
N
D
 
r
N
A
N
D
f
N
A
N
D
 
 
 
r
c
t
r
l
 
r
N
A
N
D
 
 
 
respectively. The chip area
A occupied by the entire “polar” adder in Figure
7.13 is proportional to its size
C
p
o
l
a
d
d
 
 
 
C
Z
e
c
h
 
C
m
a
d
d
 
 
 
m
 
 
 
C
r
e
g
 
 
m
 
 
 
C
i
n
v
 
C
c
t
r
l212 Chapter 7. The Polar Representation
D D
 
 
 
m
 
 
 
 
 
 
m
 
 
 
R
 
 
 
m
 
 
m
 
 
m
 
 
 
m
 
 
 
Z
i
n
d
 
 
 
m
 
 
 
P
 
A row of reduced
4-input multiplexers
 
 
 
m
 
 
 
Figure 7.16: The output controller of Figure 7.13.7.6. Architectures for Arithmetic Operations 213
 
C
Z
e
c
h
 
C
m
a
d
d
 
 
 
m
 
 
 
 
where
C
m
a
d
d is the size of the
m-bit parallel adder and
C
Z
e
c
h is the size of the
Zech’s logarithm circuit in the bottom-rightmost part of the ﬁgure. The Zech
logarithm can be computed using one discrete exponentiation and one dis-
crete logarithm, which both can be efﬁciently computed using look-up tables,
see Sections 7.5.1 and 7.5.2. The size of such a Zech’s logarithm circuit is dom-
inated by the sizes of the look-up tables. For
m
 
  we thus have
 
 
C
Z
e
c
h
 
C
e
x
p
 
t
a
b
 
C
l
o
g
 
t
a
b
 
 
 
 
m
 
 
m
 
 
m
 
 
  (7.51)
where
C
e
x
p
 
t
a
band
C
l
o
g
 
t
a
baregivenby(7.44)and(7.46),respectively. Assuming
that the parallel adder in Figure7.13 is an ordinary carryripple adder, consist-
ing of
m full adder elements, we have
C
m
a
d
d
 
m
C
F
A
 
 
 
m and thus
C
p
o
l
a
d
d
 
 
 
 
 
 
m
 
 
m
 
 
m
 
 
 
 
 
 
m
 
The total CP of the “polar” adder architecture is formed by the two dotted
paths P
  and P
  in the ﬁgure, where P
  is the CP during the ﬁrst clock inter-
val and P
  is the CP during the second clock interval of the computation. The
length of path P
 , which runs from the output of register R
  through one in-
verter, along the carry chain of the carry ripple adder, and through the Zech’s
logarithm circuit to the input of register R
 , equals
L
P
 
 
L
r
e
g
 
r
r
e
g
f
i
n
v
 
r
i
n
v
 
f
r
e
g
 
f
F
A
 
s
i
g
n
a
l
 
 
 
m
 
 
 
 
L
F
A
 
c
a
r
r
y
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
L
F
A
 
s
u
m
 
r
F
A
 
f
c
t
r
l
 
f
Z
e
c
h
 
 
L
Z
e
c
h
 
r
Z
e
c
h
f
r
e
g
 
L
Z
e
c
h
 
f
Z
e
c
h
 
 
r
Z
e
c
h
 
 
 
m
 
 
 
 
where
L
Z
e
c
h,
f
Z
e
c
h,a n d
r
Z
e
c
h are the internal CP length, the fan-in, and the out-
put normalised resistance of the Zech’s logarithm circuit. Note that when the
path P
  and P
  are active we have
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 , which means that the re-
duced multiplexers of the output controller are switched off. Hence, the out-
put controller does not affect the output stages of register R
  and R
 .
Assuming that the sum
 
  is stored in a parallel register, the length of path P
 ,
which runs from the output of register R
  through the carry ripple adder and
the output controller, equals
L
P
 
 
L
r
e
g
 
r
r
e
g
f
F
A
 
s
i
g
n
a
l
 
 
m
 
 
 
 
L
F
A
 
c
a
r
r
y
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
L
F
A
 
s
u
m
 
r
F
A
 
f
c
t
r
l
 
f
Z
e
c
h
 
 
L
c
t
r
l
 
r
c
t
r
l
f
r
e
g
 
 
 
m
 
 
 
 
f
Z
e
c
h
 
 
 The simpler cases for which
m
 
  should be handled separately.214 Chapter 7. The Polar Representation
Ifeither
 
 
m or
 
 
m (orboth) equals one, the
m-bitNBC integer
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
  is loaded into register R
  of the output controller in the beginning of
theﬁrstclock cycle. Then,inthebeginningofthesecondclock cycle,theoutput
controller sets
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 . Hence, if we do not include the time needed
for the initiation of the registers (and D ﬂip-ﬂops), the time
T required to per-
form an addition using the polar representation is proportional to
 
 
L
p
o
l
a
d
d
 
 
 
 
L
P
 
 
L
P
 
 
L
Z
e
c
h
 
 
f
Z
e
c
h
 
 
r
Z
e
c
h
 
 
 
m
 
 
 
 
As indicated above, a Zech’s logarithm can be computed using two look-up
tables – one for exponentiation and one for the discrete logarithm. The algo-
rithmsforperformingdiscreteexponentiationandcomputingthediscretelog-
arithmaredescribed inSections 7.5.1 and7.5.2, respectively. We conclude that
the worst-case time for computing a Zech’s logarithm using these algorithms
is approximately proportional to the length
L
Z
e
c
h
j
m
a
x
 
L
e
x
p
 
t
a
b
 
L
l
o
g
 
t
a
b
 
 
L
a
i
j
m
a
x
 
m
L
e
x
p
 
t
a
b
 
 
m
 
 
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m
 
where
L
e
x
p
 
t
a
b and
L
l
o
g
 
t
a
b are given in (7.45) and (7.47), respectively, and
L
a
i
j
m
a
x
 
 
 
m is the maximum value of
L
a
i given in (7.41). Note that
L
a
i
j
m
a
x
canquitesimplybereducedto
 
 
m,forexamplebyusinganarchitecture forcom-
puting
a
i which allows shifting to the right as well as to the left. With
L
Z
e
c
h
 
L
Z
e
c
h
j
m
a
x and by assuming
f
Z
e
c
h
 
 and
r
Z
e
c
h
 
  ,w eg e t
L
p
o
l
a
d
d
 
 
 
m
L
e
x
p
 
t
a
b
 
 
m
 
 
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m
 
 
 
 
  (7.52)
The area-time product
A
T
  performance of the adder architecture in Figure
7.13 is proportional to
C
L
 
p
o
l
a
d
d
 
 
 
 
C
p
o
l
a
d
d
 
 
 
L
p
o
l
a
d
d
 
 
 
 
 
Remark: The two clock intervals associated with the respective CP length P
 
and P
  are not equally long.
An Alternative Adder
Let
  and
  benonzerointegersoftheprimeﬁeld
Z
 
m
 
 . Usingthecongruences
 
 
 
 
d
 
 
 
m
o
d
 
m
 
 
 and
 
 
 
 
d
 
 
 
m
o
d
 
m
 
 
 (see(7.35) in theproof
of Theorem 7.7), addition in
Z
 
m
 
  can be expressed as
 
 
 
 
 
 
 
 
d
 
 
 
 
 
d
 
 
 
d
 
 
 
d
 
 
 
 
 
d
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (7.53)
 
 The time for needed for initiating the registers is negligible in comparison with the ﬁrst
and second clock cycle times.7.6. Architectures for Arithmetic Operations 215
where
d
 
 
 
d
 
 
 
d
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
andwhere
 denotesdiminished–1 addition. Based on(7.53),wecanperform
polar addition in the following way: First, we compute
d
 
  and
d
 
  from
 
  and
 
 , respectively. This can be done either using direct computations (see the al-
gorithm in Section 7.4.1) or using a look-up table and some binary shifts (see
the algorithm in Section 7.5.1). Then, the sum
d
 
 
 
d
 
 
 
d
 
 
 
m
o
d
 
m
 
 
 
is formed by the output of a diminished–1 adder (see Section 6.3.4) with
d
 
 
and
d
 
  as its input signals. The desired result
 
 
 
P
 
 
 
 
  is obtained from
d
 
  either by using direct computations (see the algorithm in Section 7.4.2) or
using a look-up table and essentially some binary shifts (see the algorithm in
Section 7.5.2).
Figure7.17showsablockdiagramofapolaradderwhichisbasedontheabove
addition procedure. The adder in the ﬁgure is modiﬁed to work also for zero
addends (
 
 
 and/or
 
 
  ). Let
 
 
 
 
 
 
m
 
m
 
d
 
 
 
m
 
 
 
 
 
 
 
 
m
 
m
 
d
 
 
 
m
 
 
 
 
 
m
o
d
 
m
 
 
 
 
where
d
 
 
 
m
 
 
  and
d
 
 
 
m
 
 
  are obtained from
 
 
 
m
 
 
  and
 
 
 
m
 
 
 , respectively, be
the output of the diminished–1 adder in Figure 7.17. Then, we have
 
 
m
 
 
 
m
and
 
 
 
m
 
 
  is obtained from
d
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 . This can be understood from the
following three special cases:
1. If
 
 
 
 
 
  ,i . e .
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 :
(a) If
 
 
 
 
m, which occurs if
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,t h e n
 
 
 
 
 
 
m. Thus, we set
 
 
m
 
 
 
m
 
  .A l s o ,
 
 
 
m
 
 
 
 
 is obtained from
d
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
  .
(b) If
 
 
 
 
 
 
m
 
 , which occurs if
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
  ,
then we set
 
 
m
 
 
 
m
 
  .A l s o ,
 
 
 
m
 
 
 
 
Z
 
m is obtained from
d
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
Z
 
m.
2. If
 
 
 
 and
 
 
  ,i . e .
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 ,
 
 
 
 
 
m
 
 
 
 
 
m
 
 ,a n d
 
 
 
m
 
 
 
 
  (or
 
 
 and
 
 
 
  ,i . e .
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 ,
 
 
 
m
 
 
 
 
  ,a n d
 
 
 
 
 
m
 
 
 
 
 
m
 
 ):
Then
 
 
 
d
 
 
 
m
 
 
  (or
 
 
 
d
 
 
 
m
 
 
 )a n dt h u s ,w eg e t
 
 
m
 
 
 
m
 
  .A l s o ,
 
 
 
m
 
 
 
 
 
  (or
 
 
 
m
 
 
 
 
 
 ) is obtained from
d
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 .
3. If
 
 
 
 
  ,i . e .
 
 
 
m
 
 
 
m
 
 
 
 
 
 
 ,a n d
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
  :
Then
d
 
 
 
m
 
 
 
 
d
 
 
 
m
 
 
 
 
  , which means that
 
 
 
 
m. We have
 
 216 Chapter 7. The Polar Representation
 
 
 
 
 
 
m
o
d
 
m
 
 
 and thus
 
 
m
 
 
 
m
 
  .A l s o ,
 
 
 
m
 
 
 
 
 is
obtained from
d
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
  .H e n c e ,t h eo u t p u t
 
  equals
 
 
 
m.
The polar adder in Figure 7.17 works as follows: The computation procedure
is split into two parts(clock cycles). Duringthe ﬁrstclock cycle,
d
 
 
 
m
 
 
  is com-
puted from the input
 
 
 
m
 
 
  and is stored, together with
 
 
m,i nt h er e g i s t e r .
During the second clock cycle,
d
 
 
 
m
 
 
  is ﬁrst computed from the input
 
 
 
m
 
 
 
and the addends
 
 
m
 
m
 
d
 
 
 
m
 
 
  and
 
 
m
 
m
 
d
 
 
 
m
 
 
  appearsatthe inputs of the
diminished-1 adder. Then, the desired sum
 
 
 
P
 
 
 
 
  is computed from
the adder output
 
 .
Assuming that the two translation circuits in Figure 7.17 are realised using
look-up tables, as described above, the sizes of the input and output transla-
tion circuits are approximately equal to
C
e
x
p
 
t
a
b
 
 
 
 
m and
C
l
o
g
 
t
a
b
 
 
m
 
 
m
 
 , respectively.
 
  Theparameters
C
e
x
p
 
t
a
b and
C
l
o
g
 
t
a
b are givenby (7.44) and
(7.46), respectively. Using the carry ripple adder in Figure 6.9, which has size
C
d
i
m
a
d
d
 
 
 
 
 
m
 
  , the total size of the polar adder in Figure 7.17 equals
 
 
C
p
o
l
a
d
d
 
 
 
C
e
x
p
 
t
a
b
 
C
l
o
g
 
t
a
b
 
C
d
i
m
a
d
d
 
 
 
 
m
 
 
 
C
r
e
g
 
 
 
 
m
 
 
m
 
 
m
 
 
 
 
 
m
 
where
C
r
e
g
 
 
  . Hence,thesizeofthepolaradderinFigure7.17islessthanthe
size of the polar adder in Figure 7.13. Also, the overall structure of the former
architecture is simpler than the structure of the latter one.
The total critical path through the circuit in Figure 7.17 is formed by the paths
P
  and P
 , which correspond to the critical paths associated with the ﬁrst and
second clock cycles, respectively. The length
L
P
  of path P
  is approximately
equal to
L
e
x
p
 
t
a
b
 
L
a
i
j
m
a
x
 
m
 
L
e
x
p
 
t
a
b
 
 
 
 and the length
L
P
  of path P
  is
approximately equalto
L
e
x
p
 
t
a
b
 
L
d
i
m
a
d
d
 
 
 
L
l
o
g
 
t
a
b
 
 
L
a
i
j
m
a
x
 
m
L
e
x
p
 
t
a
b
 
 
m
 
 
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m (see (7.45), (6.28), (7.47), and (7.41)). Hence, the total
time required to perform polar addition, using the architecture in Figure 7.17,
is approximately proportional to the length
L
p
o
l
a
d
d
 
 
 
L
P
 
 
L
P
 
 
 
m
L
e
x
p
 
t
a
b
 
 
m
 
 
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m
 
where
L
e
x
p
 
t
a
b and
L
l
o
g
 
t
a
b are some constants, By comparing
L
p
o
l
a
d
d
 
  with
L
p
o
l
a
d
d
 
 , which is given in (7.52), we conclude that the polar adder in Figure
7.13 is faster than the adder in Figure 7.17.
 
 Thus, the sum of the sizes of the two translation circuits is equal to
C
Z
e
c
h (see (7.51)).
 
 Note that here we only consider the cases
m
 
 and
m
 
 
  .7.6. Architectures for Arithmetic Operations 217
Translation from
d
k to
k
 
m
 
 
  -bit diminished-1 adder
Translation from
k to
d
k
Register
 
 
m
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
d
 
 
 
m
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
d
 
 
 
m
 
 
 
d
 
 
 
m
 
 
 
P
 
P
 
Figure 7.17: The block diagram of an alternative polar adder. The paths P
  and P
 
form the CP through the circuit. We have
k
 
d
k
 
Z
 
m.218 Chapter 7. The Polar Representation
The
A
T
  performance of the polar adder in Figure 7.17 is proportional to the
product
C
L
 
p
o
l
a
d
d
 
 
 
 
C
p
o
l
a
d
d
 
 
 
L
p
o
l
a
d
d
 
 
 
 
 
Remark: The two clock intervals associated with the respective CP length P
 
and P
  are not equally long.7.6. Architectures for Arithmetic Operations 219
Subtraction
By (7.11) and (7.6) we get
P
 
 
 
 
 
 
 
 
 
Z
 
P
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
o
d
 
m
 
  (7.54)
where
P
 
 
 
 
 
 
 
m
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 . Hence, by comparing (7.54)
with (7.48), we conclude that polar subtraction can be carried out by using
the adder architecture in Figure7.13 butwith the input bit
 
 
m
 
  exchangedfor
its one’s complement
 
 
m
 
 . An XOR gate can be used to control whether the
input bit
 
 
m
 
  of
 
  is to be inverted (when subtracting) or unchanged (when
adding).
Polar subtraction can also be carried out by using a modiﬁed version of the
polar adder in Figure 7.17. Similar to (7.53), we can write
 
 
 
 
 
 
 
 
d
 
 
 
 
 
 
d
 
 
 
 
d
 
 
 
 
m
 
 
 
d
 
 
 
 
 
d
 
 
 
d
 
 
 
 
 
d
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (7.55)
where
d
 
  is the one’s complement of the
m-bit normal binary coded integer
d
 
 
and
d
 
 
 
d
 
 
 
d
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 .Ar o wo f
m XOR gatescan beplaced
at the output of the
k-to-
d
k translation circuit in Figure 7.17 to control whether
the output is to be inverted (when subtracting) or unchanged (when adding).
7.6.5 General multiplication
From Section 7.2.6, we get
 
 
 
 
P
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
  if
 
 
m
 
 
 
m
 
 
 
 
 
 
 
 
m
  if
 
 
m
 
 and/or
 
 
m
 
 
 
(7.56)
which we compute as follows. Let
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 . Then, by
(7.56) and for
 
 
i
 
m
 
 ,e a c ho u t p u tb i t
 
 
i of
 
  can be written as the
Boolean function
 
 
i
 
 
 
i
 
  (7.57)
where
 
 
 
 
m
 
 
 
m
 
 
 
m
 
 
 
m indicates whether
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
m
o
d
 
m
 
(
 
 
 
  )o r
 
 
 
 
m (
 
 
 
  ). The most signiﬁcant bit
 
 
m of
 
  equals
 .220 Chapter 7. The Polar Representation
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
m
 
 
 
m
 
 
 
 
 
m
m-bit parallel adder
 
 
 
 
m
 
 
 
Figure 7.18: A bit-parallel architecture for general multiplication using the polar
representation.
A Bit-Parallel Architecture
An architecture for general “polar” multiplication, which is based on the
above procedure, is shown in Figure 7.18. The AND gate in the ﬁgure rep-
resents a row of
m AND gates, each which, for some
i
 
 
 
 
 
 
 
 
 
m
 
 ,g e n -
erates the output bit
 
 
i according to the Boolean function in (7.57).
Assuming that the parallel adder in Figure 7.18 is an ordinary carry ripple
adder, which comprises
m full adder elements, the chip area
A occupied by
the general multiplier architecture is proportional to its size
C
p
o
l
m
u
l
t
 
p
a
r
 
m
C
F
A
 
m
C
A
N
D
 
C
N
O
R
 
C
i
n
v
 
 
 
m
 
 
 
Thelength of theinternal CP, whichis thepath fromthe least signiﬁcant bitin-
putoftheparalleladder, throughthechainoffulladderelementstotheoutput
of the AND gate in bit position
m
 
 . The length of this CP equals
L
C
P
 
p
o
l
m
u
l
t
 
p
a
r
 
 
m
 
 
 
 
L
F
A
 
c
a
r
r
y
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
L
F
A
 
s
u
m
 
r
F
A
f
A
N
D
 
L
A
N
D
 
 
 
m
 
 
 7.6. Architectures for Arithmetic Operations 221
The fan-in and the output normalised resistance of the architecture are equal
to
f
p
o
l
m
u
l
t
 
p
a
r
 
f
F
A
 
s
i
g
n
a
l
 
 
r
p
o
l
m
u
l
t
 
p
a
r
 
r
A
N
D
 
 
 
respectively. Assuming as before that the circuit inputs are obtained directly
from some parallel registers and the output
 
  is directly stored in a parallel
register, the total computation time
T is proportional to
L
p
o
l
m
u
l
t
 
p
a
r
 
L
r
e
g
 
r
r
e
g
f
p
o
l
m
u
l
t
 
p
a
r
 
L
C
P
 
p
o
l
m
u
l
t
 
p
a
r
 
r
p
o
l
m
u
l
t
 
p
a
r
f
r
e
g
 
 
 
m
 
 
 
 
Hence, the area-time product
A
T
  is proportional to
C
L
 
p
o
l
m
u
l
t
 
p
a
r
 
 
C
p
o
l
m
u
l
t
 
p
a
r
 
L
p
o
l
m
u
l
t
 
p
a
r
 
 
 
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
A Bit-Serial Architecture
In the beginning of Chapter 4 we stated that, depending on the modulus, bit-
serialarchitectures areoften impracticable forarithmetic operations in integer
quotient rings. However, when using the polar representation of the integers
of
Z
 
m
 
 , all computations are carried out modulo
 
m. Because the reduction
modulo
 
m can be carried out instantaneously, bit-serial architectures may be
competitive for some arithmetic operations.
In Figure 7.19 we show a bit-serial architecture for general “polar” multipli-
cation, which is based on the parallel multiplier in Figure 7.18. The chip area
A occupied by this multiplier is proportional to its size
C
p
o
l
m
u
l
t
 
s
e
r
 
C
F
A
 
 
 
m
 
 
 
C
r
e
g
 
C
A
N
D
 
C
N
O
R
 
C
i
n
v
 
 
 
m
 
 
 
 
Duringaninitial clock cycle, the
m-bitshiftregistersR
  andR
  areloaded with
 
 
 
m
 
 
  and
 
 
 
m
 
 
 , respectively, and the D ﬂip-ﬂops D
 ,D
 ,a n dD
  are loaded
with
 
 
m,
 
 
m, and zero, respectively. Then, during the
m subsequent clock cy-
cles, the digits
 
 
 ,
 
 
 
 
 
 
 ,
 
 
m
 
  are shifted into the feedback shift register R
 .
Hence, after a total of
m
 
 clock cycles, the result
 
 
 
m
 
 
  is contained in reg-
ister R
 .
TheCP is the dotted path fromthe serialoutput ofR
  (orR
 ) tothe serialinput
of R
 . The length of this path equals
L
C
P
 
p
o
l
m
u
l
t
 
s
e
r
 
L
r
e
g
 
r
r
e
g
f
F
A
 
s
i
g
n
a
l
 
L
F
A
 
s
u
m
 
r
F
A
f
A
N
D
 
L
A
N
D
 
r
A
N
D
f
r
e
g
 
 
 
 222 Chapter 7. The Polar Representation
 
 
 
m
 
 
 
 
 
 
m
 
 
 
R
 
R
 
 
 
 
m
 
 
 
CP
c
i
n
c
o
u
t
 
Reset
 
 
m
 
 
m
 
 
i
 
 
i
 
 
i
 
 
m
D
  D
 
D
 
FA
Figure 7.19: A bit-serial architecture for general multiplication using the polar rep-
resentation.
which implies that the total computation time
T is proportional to
L
p
o
l
m
u
l
t
 
s
e
r
 
 
m
 
 
 
L
C
P
 
p
o
l
m
u
l
t
 
s
e
r
 
 
 
 
m
 
 
 
and the
A
T
  performance is proportional to
C
L
 
p
o
l
m
u
l
t
 
s
e
r
 
 
C
p
o
l
m
u
l
t
 
s
e
r
 
L
p
o
l
m
u
l
t
 
s
e
r
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
In Figure 7.20 we have plotted the parameters
C
p
o
l
m
u
l
t
 
p
a
r,
C
p
o
l
m
u
l
t
 
s
e
r,
L
p
o
l
m
u
l
t
 
p
a
r,
L
p
o
l
m
u
l
t
 
s
e
r,
C
L
 
p
o
l
m
u
l
t
 
p
a
r,a n d
C
L
 
p
o
l
m
u
l
t
 
s
e
r versus
m for
m
 
 
 
 
 
 
 
 
 .O b v i -
ously, the bit-parallel architecture in Figure 7.18 is superior to the bit-serial ar-
chitecture in Figure 7.19 with respect to both chip area and computation time
and, consequently, also with respect to area-time performance. Note, how-
ever, that the size (area) of the bit-parallel architecture becomes greater than
the size of the bit-serial architecture if the input and output registers are in-
cluded in the size parameter
C
p
o
l
m
u
l
t
 
p
a
r.7.6. Architectures for Arithmetic Operations 223
C
p
o
l
m
u
l
t
 
p
a
r
C
p
o
l
m
u
l
t
 
s
e
r
L
p
o
l
m
u
l
t
 
p
a
r
L
p
o
l
m
u
l
t
 
s
e
r
C
L
 
p
o
l
m
u
l
t
 
p
a
r
C
L
 
p
o
l
m
u
l
t
 
s
e
r
 
 
 
 
 
 
 
 
 
 
 
Time complexity
m
C
P
l
e
n
g
t
h
,
L
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Area-time performance
C
L
 
m
 
 
 
 
 
 
 
 
 
 
 
Area complexity
m
S
i
z
e
,
C
Figure 7.20: The sizes
C, lengths
L,a n d
A
T
  performances
C
L
  of the bit-parallel
and the bit-serial polar multipliers in Figure 7.18 and Figure 7.19, respectively.
The parameters are plotted versus
m for
m
 
 
 
 
 
 
 
 
 .
7.6.6 Multiplication by powers of
 
Oneofthemajorattributes ofthepolarrepresentation oftheelements of
Z
 
m
 
 
follows from Corollary 7.1: When computing a Fermat number transform of
length
N
 
 
b using the transform kernel
 
 
 
 
m
 
b
 
m
o
d
 
m
 
 
  , each
multiplication by
  can be performed as one
b-bit addition modulo
 
b.
Let
  beanonzerointeger of
Z
 
m
 
  and
 
 
b
 
m.B yD e ﬁnition 7.3, the polar
integer
P
 
 
 
 
 
 
 
Z
 
m c a nb ew r i t t e no nt h ef o r m
 
 
 
 
 
 
m
 
b
 
 
m
 
b
 
 
 
 
m
 
b
 
 
 ,
where
 
 
 
m
 
b
  is formed by the
b most signiﬁcant bits and
 
 
 
m
 
b
 
 
  is formed
by the
m
 
b least signiﬁcant bits of
 
 . By (7.21), (7.22) and (7.23), the polar
representation of the product
 
 
 
 
n
 
 
 
n
 
m
 
b
 
m
o
d
 
m
 
 
 equals
P
 
 
 
n
 
 
 
 
 
 
 
 
m
 
b
 
 
m
 
b
 
 
 
 
m
 
b
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 224 Chapter 7. The Polar Representation
0
 
 
 
 
 
 
 
 
 
m bits
b bits
 
 
 
m
 
b
 
n
 
b
 
 
 
 
 
 
m
 
b
 
 
 
 
m
 
b
 
 
 
 
 
 
m
 
b
 
 
 
Figure 7.21: Computation of
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 ,w h e r e
 
 
 
n
 
b
 
 
 
 
m
 
b.
where
 
 
 
 
 
 
m
 
b
 
 
 
 
 
m
 
b
 
 
 
 
 
m
 
b
 
 
m
o
d
 
b
 
 
 
 
m
 
b
 
 
 
 
 
 
 
m
 
b
 
 
 
 
w h e r ei nt u r nw eh a v e
 
 
 
m
 
b
 
 
n
 
b
 
 
 . Obviously,
P
 
 
n
 
  can be computed
u s i n go n l yo n e
b-bit addition of
 
 
 
m
 
b
  and
n
 
b
 
 
  modulo
 
b. This is illustrated
in Figure 7.21.
7.6.6.1 Fixed Architectures
An architecture which computes
 
 
 
P
 
 
n
 
 ,f o rs o m eﬁxed
b
 
 
 
 
m
 ,i s
s h o wni nFi gu r e7. 22.L et
 
 
 
 
 
 
m
 
b
 
 
n
 
b
 
 
 
 
m
o
d
 
b
  denote the
b-bitoutput
of the parallel adder in the ﬁgure. Each digit of
 
  and each digit of
 
 
 
m
 
b
 
 
  is
forwarded to one of the inputs of a two-input AND gate. The one’s comple-
ment
 
 
m of
 
 
m is the second input of each AND gate. If
 
 
 
Z
 
m,i . e .i f
 
 
 
  ,
the desired product
 
 
 
 
 
 
 
m
 
b
 
 
n
 
b
 
 
 
m
o
d
 
b
 
 
m
 
b
 
 
 
 
m
 
b
 
 
  will bepresent
at the circuit output. If
 
 
 
 
 
 
m,i . e .i f
 
 
 
  ,t h eo u t p u t
 
 
 
m
 
 
  is set equal
to zero by the row of AND gates. We always have
 
 
m
 
 
 
m.
If the
b-bit parallel adder in Figure 7.22 is an ordinary carry ripple adder, the
size of the architecture in the ﬁgure equals
C
m
u
l
t
 
 
 
p
a
r
 
b
C
F
A
 
m
C
A
N
D
 
C
i
n
v
 
 
m
 
 
 
b
 
 
 
where
 
 
b
 
m. The internal CP of the architecture, which is indicated by
the dotted path in the ﬁgure, has length7.6. Architectures for Arithmetic Operations 225
b-bit parallel adder
CP
 
 
 
 
m
 
 
 
m
 
b
 
n
 
b
 
 
 
 
 
 
m
 
b
 
 
 
 
 
 
m
 
b
 
 
 
m
 
 
 
m
 
b
 
 
 
Figure 7.22: A bit-parallel architecture for computing polar multiplication by pow-
ers of
 ;
 
 
 
P
 
 
 
n
 
 
 
 
m
 
m
 
 
 
 
m
 
b
 
 
m
 
b
 
 
 
 
m
 
b
 
 
 ,w h e r e
o
r
d
 
m
 
 
 
 
 
 
 
b for some ﬁxed
b
 
 
 
 
m
 . T h eo u t p u tc i r c u i t r yi sf o r m e db yar o wo f
b
 
 
m
 
b
 
 
m two-input AND gates.
L
C
P
 
m
u
l
t
 
 
 
p
a
r
 
 
b
 
 
 
 
L
F
A
 
c
a
r
r
y
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
L
F
A
 
s
u
m
 
r
F
A
f
A
N
D
 
L
A
N
D
 
 
 
b
 
 
 
The fan-in and the output normalised resistance , with respect to this CP, are
equal to
f
m
u
l
t
 
 
 
p
a
r
 
f
F
A
 
s
i
g
n
a
l
 
 
r
m
u
l
t
 
 
 
p
a
r
 
r
A
N
D
 
 
 226 Chapter 7. The Polar Representation
respectively. With the CP both starting and ending in a register, the total com-
putation time of the the architecture in Figure 7.22 is proportional to
L
m
u
l
t
 
 
 
p
a
r
 
L
r
e
g
 
r
r
e
g
f
m
u
l
t
 
 
 
p
a
r
 
L
C
P
 
m
u
l
t
 
 
 
p
a
r
 
r
m
u
l
t
 
 
 
p
a
r
f
r
e
g
 
 
 
b
 
 
 
and hence, the area-time performance is proportional to
C
L
 
m
u
l
t
 
 
 
p
a
r
 
 
C
m
u
l
t
 
 
 
p
a
r
 
L
m
u
l
t
 
 
 
p
a
r
 
 
 
 
 
m
 
 
 
b
 
 
 
 
 
 
b
 
 
 
 
 
 
Note that for
b
 
mwehave
L
m
u
l
t
 
 
 
p
a
r
 
L
p
o
l
m
u
l
t
 
p
a
rand
C
m
u
l
t
 
 
 
p
a
r
 
C
p
o
l
m
u
l
t
 
p
a
r.
For
b
 
mwe have
L
m
u
l
t
 
 
 
p
a
r
 
L
p
o
l
m
u
l
t
 
p
a
r and
C
m
u
l
t
 
 
 
p
a
r
 
C
p
o
l
m
u
l
t
 
p
a
r.
A bit-serial architecture for polar multiplication by powers of
 , can be de-
signed in a rather straightforward manner. It is derived from the bit-parallel
architecture in Figure 7.22 in the same way as the bit-serial general multiplier
in Figure7.19 was derived fromthe bit-parallel multiplier in Figure7.18. Such
an architecture would be rather similar to the universal bit-serial architecture
in Figure 7.24, which is described below. Therefore, it is not considered here.
7.6.6.2 Universal Architectures
A bit-serial/parallel architecture
Sofar,allarchitecturesconsideredinthepresentchapter,excepttheonein Fig-
ure 7.22, canbeused whencomputing the Fermatnumbertransformof length
N
 
 
b in
Z
 
m
 
  for some given
m
 
 
 
 
 
 
 
 
 . The circuit in Figure 7.22 can
only be used for some ﬁxed
b
 
 
 
 
m
 . The bit-serial/parallel architecture in
Figure 7.23, however, is a universal circuit for multiplication by powers of
 ,
i.e. it is applicable for all possible transform lengths
N
 
 
b,w h e r e
b
 
 
 
 
m
 .
The circuit works as follows.
  Duringan initial clock cycle, the parallelregisterR
  is loaded with
n
 
b
 
 
 ,
shift register R
  is loaded with
 
 
 
m
 
 
 ,a n dt h eDﬂip-ﬂop is loaded with
 
 
m. All registers in the architecture are
m bits wide.
  Duringthefollowing
bclockcycles, thetransmission gatessubsequentto
theparalleladderareallclosedandthe
b-bitNBC integer
 
 
 
m
 
b
  isshifted
into bothregisterR
  andR
 .T h es i g n a l
S (Shiftenable)isacontrolsignal
that either enables(
S
 
  ) or disables (
S
 
  ) the shifting of the contents
of the shift registers. Consequently, during the
b shifts just mentioned,
we have
S
 
  . Each clock interval is proportional to the length
L
P
 
 
L
r
e
g
 
r
r
e
g
 
 
f
r
e
g
 
 
 7.6. Architectures for Arithmetic Operations 227
 
 
 
m
 
 
 
 
 
 
m
 
b
 
 
 
 
m
 
b
 
 
 
m-bit parallel adder
n
 
b
 
 
 
 
 
m
S
S
 
 
 
m
 
 
 
S
R
 
R
 
R
  R
  P
 
P
 
 
 
b
 
 
 
 
 
P
 
 
 
m
D
Figure 7.23: A universal bit-serial/parallel architecture for polar multiplication by
powers of
 ;
 
 
 
P
 
 
 
n
 ,w h e r e
o
r
d
 
m
 
 
 
 
 
 
 
b for any
b
 
 
 
 
m
 .
of the dotted path P
  in Figure 7.23.
  After the
b shifts, the control signal
S is set to 0 (zero). Let
 
  denote the
NBC integer which is formed by the
b least signiﬁcant output bits of the
parallel(carryripple)adder. Thenwehave
 
 
 
 
 
 
m
 
b
 
 
n
 
b
 
 
 
 
m
o
d
 
b
 .
The
m
 
b most signiﬁcant output bits of the adder are redundant.
1. If
 
 
m
 
  ,i . e .i f
 
 
  , the transmission gates subsequent to the
adder remain closed, so that the contents
 
 
 
m
 
b
 
 
 
 
 
 
 
 
 
 
 
 
 
  in
the
b least signiﬁcant bit positions of R
  remain unchanged.228 Chapter 7. The Polar Representation
2. If
 
 
m
 
  ,i . e .i f
 
 
 
  , the transmission gates are open
 
  and the
register R
  is loaded with the adder output
 
 .
The time needed to compute
 
 
 
 
 
 
m
 
b
  and load it into the
b least sig-
niﬁcant bit positions of R
  is proportional to the length
 
 
L
P
 
 
L
r
e
g
 
r
r
e
g
f
F
A
 
s
i
g
n
a
l
 
 
b
 
 
 
 
L
F
A
 
c
a
r
r
y
 
r
F
A
f
F
A
 
c
a
r
r
y
 
 
L
F
A
 
s
u
m
 
 
r
F
A
 
 
 
f
r
e
g
 
 
 
b
 
 
 
of the dotted path P
  in the ﬁgure. The maximum computation time is
obtained for
b
 
m, i.e. we have
L
P
 
 
m
a
x
 
 
m
a
x
L
P
 
 
 
 
m
 
 
  .
  Next,the controlsignal
S issetto1(one). Thistransition closesthe trans-
mission gates(iftheywereopen)andenablesshifting oftheshiftregister
contents. For
 
 
m
 
  , the time to close the transmission gates is propor-
tional to the length
L
P
 
 
r
S
 
f
O
R
 
 
f
S
 
r
e
g
 
 
L
O
R
 
r
O
R
 
f
i
n
v
 
m
 
 
r
i
n
v
 
m
 
r
S
 
 
 
 
f
S
 
r
e
g
 
 
 
m
 
 
 
where
r
S is the normalised resistance from the
S input node of the OR
gate to the supply voltage source and
f
S
 
r
e
g is the fan-in of the shift regis-
ters, with respect to the control input signal
S. By assuming
r
S
 
 and
f
S
 
r
e
g
 
  ,w eg e t
L
P
 
 
 
m
 
 
  .
  Finally, during
m
 
b clock cycles,
 
 
 
m
 
b
 
 
 
 
 
 
 
m
 
b
 
 
  is shifted from
R
  into the
m
 
b least signiﬁcant bit positions of R
  while
 
 
 
m
 
b
 
 
 
  is
shifted up to the
b most signiﬁcant bit positions of R
 .
We assume that the registers can be initialised during one cycle of the shift
register clock. Then, the total time needed to perform a polar multiplication
byapowerof
 ,using theuniversalarchitecture inFigure7.23,isproportional
to
L
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
 
 
b
 
 
 
L
P
 
 
L
P
 
 
L
P
 
 
 
m
 
b
 
L
P
 
 
 
 
m
 
 
 
b
 
 
 
 
 The transmission gates have opened before the digit
 
 
b
￿
  of
 
  is present at the adder
output.
 
 We assume that, for all
b
 
m, the length of path P
  is always greater than the length of
path P
 . This is true in virtually all cases.7.6. Architectures for Arithmetic Operations 229
which, for
b
 
m, equals
L
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
 
m
a
x
 
 
 
m
 
 
  .T h e d e s i r e d r e s u l t
 
 
 
P
 
 
 
 
n
  is present at the output of register R
  (and the D ﬂip-ﬂop). The
size of the universal architecture equals
C
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
 
m
C
F
A
 
 
 
m
 
 
 
C
r
e
g
 
m
C
T
G
 
C
O
R
 
C
i
n
v
 
 
 
m
 
 
 
 
which implies that its area-time performance is proportional to
C
L
 
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
 
 
C
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
 
L
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
b
 
 
 
 
 
 
A bit-serial architecture
In Figure 7.24 we show a universal bit-serial architecture for polar multipli-
cation by powers of
 , which is based on the bit-serial/parallel architecture
in Figure 7.23. Also, it is quite similar to the bit-serial architecture for general
multiplication, see Figure 7.19. The size of the architecture in Figure 7.24, in
which the shift registers R
  and R
  are
m bits wide, equals
C
u
n
i
v
 
m
u
l
t
 
 
 
s
e
r
 
C
F
A
 
 
 
m
 
 
 
C
r
e
g
 
C
A
N
D
 
 
C
i
n
v
 
C
T
G
 
 
 
 
 
m
 
 
 
 
Thecontrolsignal
S is thesamesignalforshift enabling/disabling that is used
in the above universal bit-serial/parallel architecture. During an initial clock
cycle,
S is set to zero, the registers R
  and R
  are loaded with
n
 
b
 
 
  and
 
 
m
 
 ,
respectively, and the D ﬂip-ﬂops D
  and D
  are loaded with
 
 
m and zero, re-
spectively. During the following
m
 
b clock cycles,
 
 
 
m
 
b
 
 
  is shifted into
the most signiﬁcant bit positions of shift register R
 ,i . e .w es i m p l yp e r f o r ma n
 
m
 
b
 -bitrotation of the contents of R
 . Then,
S is settoone (thisis done dur-
ing one clock cycle) and if
 
 
m
 
  ,t h e
b-bit sum
 
 
 
 
 
 
m
 
b
 
 
n
 
b
 
 
 
 
m
o
d
 
b
 
is shifted into register R
 .I f
 
 
m
 
  , only zeros are shifted into the register
(
 
 
 
P
 
 
 
 
n
 
 
P
 
 
 
 
 
m
 
 
 
 
m
 
 
 
 
  ).
The CP of the multiplier is the dotted path from the serial output of register
R
  to the serial input of register R
  in Figure 7.24. The clock cycle time is pro-
portional to the length
L
C
P
 
u
n
i
v
 
m
u
l
t
 
 
 
s
e
r
 
L
r
e
g
 
 
r
r
e
g
 
 
 
f
F
A
 
s
i
g
n
a
l
 
L
F
A
 
s
u
m
 
r
F
A
f
A
N
D
 
L
A
N
D
 
r
A
N
D
f
r
e
g
 
 
 230 Chapter 7. The Polar Representation
 
 
 
m
 
 
 
R
 
 
 
m
FA
c
i
n
c
o
u
t
D
 
Reset
D
 
S
CP
R
 
 
 
 
m
 
b
 
 
 
 
 
 
m
 
b
 
n
 
b
 
 
 
 
 
 
m
 
 
 
 
 
i
 
 
m
 
 
i
 
 
i
Figure 7.24: A universal bit-serial architecture for polar multiplicationby powers of
 ;
 
 
 
P
 
 
 
n
 ,w h e r e
o
r
d
 
m
 
 
 
 
 
 
 
b for any
b
 
 
 
 
m
 .
of the CP. Because the desired product
 
 
 
P
 
 
 
  is obtained in register R
 
after
 
 
 
m
 
b
 
 
 
 
b
 
m
 
 clock cycles, the total computation time is
proportional to
L
u
n
i
v
 
m
u
l
t
 
 
 
s
e
r
 
 
m
 
 
 
L
C
P
 
p
o
l
m
u
l
t
 
s
e
r
 
 
 
 
m
 
 
 
 
which implies that the
A
T
  performance of the bit-serial architecture is pro-
portional to
C
L
 
u
n
i
v
 
m
u
l
t
 
 
 
s
e
r
 
 
C
u
n
i
v
 
m
u
l
t
 
 
 
s
e
r
 
L
u
n
i
v
 
m
u
l
t
 
 
 
s
e
r
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
The area and time complexities and the area-time performances of the above
universalbit-serial/parallelandstrictly bit-serialarchitecturesareplottedver-
sus
m in Figure 7.25. Note that for
L
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r and
C
L
 
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r we have
actually set
b
 
m, i.e we have plotted
L
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
 
m
a
x and
C
L
 
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
 
m
a
x.
As expected, for all
m
 
 
 
 
 
 
 
 
 , the size of the bit-serial/parallel archi-
tecture is greater than the size of the bit-serial architecture, while we have the
opposite relation when considering their respective computation time. With
respect to their area-time performance, the bit-serial/parallel architecture is
preferable to the bit-serial architecture for
m
 
 
 
  with
b
 
m and for
m
 
 
 
 
  with
b
 
m
 
 
 
  . The bit-serial architecture is preferable to the bit-
serial/parallel architecture for
m
 
 
 
 
  with
b
 
m
 
 
 
  .N o t e ,h o w e v e r ,
that the difference in area-time performance of the two architectures is rela-
tively insigniﬁcant.7.7. Summary 231
C
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
C
u
n
i
v
 
m
u
l
t
 
 
 
s
e
r
L
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
L
u
n
i
v
 
m
u
l
t
 
 
 
s
e
r
C
L
 
u
n
i
v
 
m
u
l
t
 
 
 
p
a
r
C
L
 
u
n
i
v
 
m
u
l
t
 
 
 
s
e
r
248 16
 
 
 
 
 
 
Area complexity
m
S
i
z
e
,
C
248 16
 
 
 
 
 
 
Time complexity
m
C
P
l
e
n
g
t
h
,
L
2 4 81 6
 
 
 
 
 
 
 
 
 
Area-time performance
m
C
L
 
Figure 7.25: The sizes
C, lengths
L,a n d
A
T
  performances
C
L
  of the universal bit-
serial/parallel and bit-serial architectures in Figure 7.23 (for
b
 
m) and Fig-
ure7.24,respectively. Theparameters areplottedversus
mfor
m
 
 
 
 
 
 
 
 
 .
Remark: Forsimplicity, wehaveassumedthattheshiftregistersinFigures7.23
and 7.24, with shift enable control signal
S, have the same area and time
complexities as the other registers considered in the thesis.
7.7 Summary
In Sections 5.2 and 6.4 we summarised the complexity and performance para-
meters of the architectures considered in the respective chapters. In Table 7.4,
we have summarised the corresponding parametersfor the architectures con-
sidered in the present chapter.232 Chapter 7. The Polar Representation
O
p
e
r
a
t
i
o
n
F
i
g
u
r
e
S
u
b
s
c
r
i
p
t
n
a
m
e
S
i
z
e
C
F
a
n
-
i
n
f
I
n
t
.
C
P
l
e
n
g
t
h
L
C
P
C
o
m
p
u
t
i
n
g
a
i
j
a
 
7
.
7
a
i
 
 
m
 
 
—
 
 
 
 
f
n
e
x
t
C
o
m
p
u
t
i
n
g
d
k
7
.
8
d
k
 
 
m
 
 
 
—
 
 
m
 
 
 
N
e
g
a
t
i
o
n
7
.
1
2
p
o
l
n
e
g
4
n
p
o
l
n
e
g
 
 
—
A
d
d
i
t
i
o
n
7
.
1
3
p
o
l
a
d
d
,
1
 
 
 
 
m
 
 
m
 
 
m
 
 
 
 
 
 
m
—
—
A
d
d
i
t
i
o
n
7
.
1
7
p
o
l
a
d
d
,
2
 
 
 
 
m
 
 
m
 
 
m
 
 
 
 
 
m
—
—
 
 
 
G
e
n
e
r
a
l
m
u
l
t
i
p
l
.
7
.
1
8
p
o
l
m
u
l
t
,
p
a
r
 
 
m
 
 
8
 
 
m
 
 
G
e
n
e
r
a
l
m
u
l
t
i
p
l
.
7
.
1
9
p
o
l
m
u
l
t
,
s
e
r
 
 
m
 
 
 
—
5
8
M
u
l
t
i
p
l
i
c
a
t
i
o
n
b
y
 
n
7
.
2
2
m
u
l
t
,
 
,
p
a
r
 
m
 
 
 
b
 
 
8
 
 
b
 
 
U
n
i
v
.
m
u
l
t
i
p
l
.
b
y
 
n
7
.
2
3
u
n
i
v
,
m
u
l
t
,
 
,
p
a
r
 
 
m
 
 
 
—
—
U
n
i
v
.
m
u
l
t
i
p
l
.
b
y
 
n
7
.
2
4
u
n
i
v
,
m
u
l
t
,
 
,
s
e
r
 
 
m
 
 
 
—
6
6
N
o
r
m
.
o
u
t
p
u
t
r
e
s
.
r
o
T
o
t
a
l
C
P
l
e
n
g
t
h
L
(
i
n
c
l
u
d
i
n
g
r
e
g
i
s
t
e
r
s
)
A
r
e
a
-
t
i
m
e
p
e
r
f
.
C
L
 
—
 
i
 
 
 
L
C
P
—
—
—
—
2
3
4
4
6
2
4
—
 
m
L
e
x
p
 
t
a
b
 
 
m
 
 
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m
 
 
 
 
—
 
 
 
—
 
 
m
L
e
x
p
 
t
a
b
 
 
m
 
 
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m
—
1
 
 
m
 
 
 
O
 
m
 
 
—
 
 
m
 
 
 
O
 
m
 
 
1
 
 
b
 
 
 
O
 
m
b
 
 
—
 
 
m
 
 
 
b
 
 
 
O
 
m
 
 
—
 
 
m
 
 
 
 
O
 
m
 
 
T
a
b
l
e
7
.
4
:
C
o
m
p
l
e
x
i
t
y
p
a
r
a
m
e
t
e
r
s
o
f
t
h
e
a
r
c
h
i
t
e
c
t
u
r
e
s
c
o
n
s
i
d
e
r
e
d
i
n
t
h
e
p
r
e
s
e
n
t
c
h
a
p
t
e
r
.Chapter 8
Comparisons Between Element
Representations
Thepurpose of this chapter is tomake briefcomparisons betweenthe element
representations in Chapters 5, 6, and 7, i.e. the normal binary coded (NBC),
the diminished–1, and the polar representation, respectively. We compare the
respective VLSI architectures for arithmetic operations which are considered
in these chapters.
8.1 Arithmetic Operations
Only the measure
C of area complexity, the measure
L of time complexity, and
the measure
C
L
  of combined area-time performance of each architecture are
considered here. Regarding the parameter
L, we generally only consider the
total CP length (which is proportional to the total computation time) and not
the internal CP length (which for a bit-serial architecture is proportional to the
clock cycle time). For detailed characterisation of the architectures, we referto
the mentioned Chapters5, 6, and 7. Inparticular, see Tables 5.1, 6.5, and 7.4 in
the respective Summary sections 5.2, 6.4, and 7.7.
8.1.1 Modulus Reduction
Oneof themainadvantagesofthepolar representationisthatmodulus reduc-
tion is an instantaneous operation: The residue of the normal binary coded
233234 Chapter 8. Comparisons Between Element Representations
Form of repr. Size
C Total CP length
L
C
L
 
NBC
  CR type
 
 
m
 
 
 
 
m
 
 
 
O
 
m
 
 
  CLA type
 
m
l
o
g
 
m
 
 
 
m
 
m
 
 
l
o
g
 
m
 
 
 
O
 
m
 
l
o
g
 
m
 
Diminished–1 As in the NBC case
Polar 0 0 0
Table 8.1: Sizes
C, total CP lengths
L, and area-time performances
C
L
  of the
architectures for modulus reduction, with respect to element representation.
“CR”=carry ripple, “CLA”=carry look-ahead.
integer
 
 
 
Zmodulo
 
m equals
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 .I nb o t ht h e
diminished–1 and the polar representation, the integer
 
m is used as a rep-
resentative of zero, which means that we can use an
m-bit arithmetic for the
nonzero integers of
Z
 
m
 
 .
In Section 5.1.1 (see for example Figure 5.3) we concluded that, with respect
to the area-time performance
A
T
  (and the time performance),the carrylook-
ahead type modulus reduction architecture is preferable to the carry ripple
type architecture. From the
C
L
  parameters in the rightmost column of Ta-
b l e8 . 1o n em a yc o n c l u d et h a tt h ec a r r yr i p p l et y p ea r c h i t e c t u r e( w i t h
C
L
 
 
O
 
m
 
 ) is preferable to the carry look-ahead type architecture (for which
C
L
 
 
O
 
m
 
l
o
g
 
m
 ). However, the product
C
L
  is smaller for the former
architecture, compared to the latter one, only for very large
m;
m
 
 
 
 .
Anyhow,asseeninChapters5,6,and7,inmostcircuitsperformingarithmetic
operations, the modulus reduction partof the operation is preferablyincorpo-
rated into each separate arithmetic operation.
8.1.2 Code Translation
The code translation from the NBC to the diminished–1 representation is sim-
ply carried out as a subtraction by one modulo
 
m
 
  . As seen in Table 8.2,
the area-time product
C
L
  is slightly less for the reverse translation (addition
by one modulo
 
m
 
 ). The code translation from the NBC to the polar repre-
sentation and it reverse code translation involves the computation of the dis-
crete logarithm and discrete exponentiation, respectively. In Sections 7.4 and8.1. Arithmetic Operations 235
Form of repr. Size
C Total CP length
L
C
L
 
NBC —
Diminished–1
  NBC to dim.–1
 
m
l
o
g
 
m
 
 
m
 
 
 
m
 
 
l
o
g
 
m
 
 
 
O
￿
m
 
l
o
g
 
m
￿
  Dim.–1 to NBC
 
 
m
 
 
 
 
m
 
 
 
O
￿
m
 
￿
Polar
  NBC to polar One discrete logarithm
  Polar to NBC O n ed i s c r e t ee x p o n e n t i a t i o n
Table 8.2: Sizes
C, total CP lengths
L, and area-time performances
C
L
  of the archi-
t e c t u r e sf o rc o d et r a n s l a t i o n , with respect to element representation.
Form of repr. Size
C Total CP length
L
C
L
 
NBC
 
 
m
 
 
 
 
 
m
 
 
 
O
 
m
 
 
Diminished–1
 
m
 
m
 
 
 
O
 
m
 
 
Polar 4 34 4624
Table 8.3: Sizes
C, total CP lengths
L, and area-time performances
C
L
  of the archi-
tectures for negation, with respect to element representation.
7.5, we showed how to compute the discrete logarithm and perform discrete
exponentiation either without (Sec. 7.4) or with (Sec. 7.5) the use of look-up
tables.
It is obvious that both the area complexities and the time performances of the
code translations to and from the diminished–1 representation are less than
the corresponding complexities of the code translations to and from the po-
lar representation. Regarding the area and time complexities of the discrete
logarithm and discrete exponentiation, we refer to Sections 7.6.1 and 7.6.2.
8.1.3 Negation
Table 8.3 shows some complexity parameters related to the architectures for
negation using the NBC, diminished–1, and polar representations. The pa-236 Chapter 8. Comparisons Between Element Representations
NBC
Dim.–1
Polar
NBC
Dim.–1
Polar
NBC
Dim.–1
Polar
248 16 32 64 128256
 
 
 
 
 
 
 
 
 
Time complexity
m
T
o
t
a
l
C
P
l
e
n
g
t
h
,
L
2 4 8 16 32 64 128 256
 
 
 
 
 
 
 
 
 
 
 
 
 
Area-time performance
m
C
L
 
248 16 32 64 128256
 
 
 
 
 
 
 
 
 
 
 
  Area complexity
m
S
i
z
e
,
C
Figure 8.1: Plots of the complexity parameters
C,
L,a n d
C
L
  for negation when us-
ingtheNBC, thediminished–1, or thepolar representation. The parameters are
obtained from Table 8.3.
rameters
C,
L,a n d
C
L
  are plotted versus
m in Figure 8.1. With respect to
each of these parameters, it is clear that diminished–1 negation is generally
less complex than NBC negation. In Fermat prime ﬁe l d s ,i . e .f o r
m
 
 
 
 
 
 
 
 
 
 
 , negation in the polar representation is in turn less complex than nega-
tion in the diminished–1 representation.
8.1.4 Addition
As seen in Table 8.4, the complexity and the performance of performing addi-
tion in
Z
 
m
 
  are approximately the same when using the NBC representation
as when using the diminished–1 representation. For a comparison between8.1. Arithmetic Operations 237
Form of repr. Size
C Total CP length
L
C
L
 
NBC (carry r.)
 
 
m
 
 
 
 
m
 
 
 
O
￿
m
 
￿
Diminished–1
  Carry ripple
 
 
m
 
 
 
 
m
 
 
 
O
￿
m
 
￿
  Carry l.-a.
 
 
m
 
 
 
 
 
m
 
 
l
o
g
 
m
 
 
 
O
￿
m
 
￿
Polar
  Figure 7.13
 
 
 
 
m
 
 
m
 
m
 
 
 
m
L
e
x
p
 
t
a
b
 
 
m
 
  —
 
 
 
 
m
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m
 
 
 
 
  Figure 7.17
 
 
 
 
m
 
 
m
 
m
 
 
 
 
m
L
e
x
p
 
t
a
b
 
 
m
 
  —
 
 
 
m
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m
Table 8.4: Sizes
C, total CP lengths
L, and area-time performances
C
L
  of the archi-
tectures for addition, with respect to element representation. The sizes and CP
lengths for polar addition is valid for
m
 
 and
m
 
 
  .
the carry ripple-type and the carry look-ahead-type diminished–1 adders, we
refer to Section 6.3.4.
One of the main disadvantages of the polar representation derives from the
factthateachpolaraddition involves thecomputation ofoneZech’slogarithm
(see Figure 7.13), or essentially two discrete exponentiations and one discrete
logarithm (see Figure 7.17). We have considered realisations of these opera-
tions which involve look-up tables. As seen in Table 8.4, polar addition is a
muchmorecomplexoperationthanforexamplediminished–1 addition. Note,
however, that in order to geta correct/fair comparison between the polar rep-
resentationandthediminished–1 (orNBC)representation, polaraddition should
be compared with diminished–1 (or NBC) general multiplication and polar
general multiplication should be compared with diminished–1 (or NBC) ad-
dition. This is further discussed in Section 8.1.7.
Remark: The complexity parameters for polar addition in Table 8.4 are ap-
proximate estimations. When using the delay model described in Sec-
tion 4.2, we have not been able to determine the values of the constants
L
e
x
p
 
t
a
b and
L
l
o
g
 
t
a
b.238 Chapter 8. Comparisons Between Element Representations
Form of repr. Size
C Total CP length
L
C
L
 
NBC (s/p)
 
m
l
o
g
 
m
 
 
 
 
m
 
 
m
 
 
 
 
 
m
 
 
 
 
O
 
m
 
l
o
g
 
m
 
 
 
 
Diminished–1
  Ashur’s par
 
 
m
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
O
 
m
 
 
  Shyu’s s/p
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
m
 
 
 
 
O
 
m
 
 
Polar
  Bit-parallel
 
 
m
 
 
 
 
m
 
 
 
O
 
m
 
 
  Bit-serial
 
 
m
 
 
 
 
 
m
 
 
 
O
 
m
 
 
Table 8.5: Sizes
C, total CP lengths
L, and area-time performances
C
L
  of the ar-
chitectures for general multiplication, with respect to element representation.
“s/p”=bit-serial/parallel. “par”=bit-parallel.
8.1.5 General Multiplication
The sizes
C, total CP lengths
L, and area-time products
C
L
  of the architec-
tures for general multiplication considered in Chapters 5, 6, and 7 are listed in
Table 8.5. InChapter 6 we considered six differentdiminished–1 generalmul-
tipliers, of which three are bit-serial and the other three are bit-serial/parallel
multipliers. The sizes and total CP lengths of these multipliers were summa-
risedinTable6.4intheendofSection6.3.6. Also,thecomplexityparametersof
the best bit-parallelmultiplier (Ashur’s)andthe best bit-serial/parallel multi-
plier (Shyu’s) were plotted versus
m in Figure 6.25. Among the diminished–1
multipliers, only these two are considered in Table 8.5.
In Figure 8.2, we have plotted the parameters
C,
L,a n d
C
L
  of the NBC bit-
serial/parallel multiplier, Ashur’s diminished–1 bit-parallel multiplier, and
our polar bit-parallel multiplier. For
m
 
 , the complexity parameters of
Shyu’s multiplier are all slightly less than the corresponding complexity para-
meters of the NBC multiplier. Therefore, Shyu’s multiplier is not considered
in Figure 8.2. We see that the
A
T
  performance of Ashur’s multiplier is less
than the
A
T
  performance of the NBC multiplier. In Fermat prime ﬁelds, the
polar multiplier is in turn superior to the other multipliers.8.1. Arithmetic Operations 239
NBC
Ashur’s
Pol. par.
NBC
Ashur’s
Pol. par.
NBC
Ashur’s
Pol. par.
2 4 8 16 32 64 128256
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Area complexity
m
S
i
z
e
,
C
2 4 8 16 32 64 128256
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Time complexity
m
T
o
t
a
l
C
P
l
e
n
g
t
h
,
L
2 4 8 16 32 64 128 256
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Area-time performance
m
C
L
 
Figure 8.2: Plots of the complexity parameters
C,
L,a n d
C
L
  for
general multiplication with respect to the NBC, the diminished–1, or the
polar representation. The diminished–1 multiplier is Ashur’s bit-parallel
multiplier and the polar multiplier is the bit-parallel one. The parameters are
obtained from Table 8.5.
8.1.6 Multiplication by Powers of
 
Multiplicationby
 
n
Multiplications by powers of two typically occur when computing Fermat
number transforms of lengths
N
 
 
m and
N
 
 
m using the NBC or the
diminished–1 representation. Then,thetransformkernelsmostoftenusedare
 
 
 (for
N
 
 
m)and
 
 
p
  (for
N
 
 
m),seeSection 2.3.2. InTable8.6we
have listed some complexity parameters of architectures for multiplication by
 
n
 
n
 
Z ,with respect to the NBC, the diminished–1, and the polar represen-
tation. The parametersof the architecture for polar multiplication (the bottom240 Chapter 8. Comparisons Between Element Representations
Form of repr. Size
C Total CP length
L
C
L
 
NBC
 
m
l
o
g
 
m
 
 
 
m
 
 
 
m
 
 
 
 
m
l
o
g
 
m
O
 
m
 
l
o
g
 
m
 
 
 
 
 
 
 
 
m
Diminished–1
 
 
m
 
 
 
 
 
m
 
 
 
O
 
m
 
 
Polar
 
m
 
 
 
l
o
g
 
m
 
 
l
o
g
 
m
 
 
 
O
 
m
l
o
g
 
 
m
 
 
 
 
Table 8.6: Sizes
C, total CP lengths
L, and area-time performances
C
L
  of the archi-
tectures for multiplication by
 
n, with respect to element representation.
row in the table) are obtained by letting
 
b
 
l
o
g
 
 
m
 
l
o
g
 
m
 
 in the cor-
responding parametersof theﬁxedarchitecture for polar multiplication by
 
n
in Table 8.7.
The parameters in Table 8.6 are plotted versus
m in Figure 8.3. The complex-
ity and performance of the architecture for the NBC representation are rela-
tively high for all
m. The architecture for the diminished–1 representation is
generally superior to the other architectures. However, for
m
 
 
 
 
 
 
 
 
 ,
the architecture for the polar representation has the smallest time complexity
and the smallest area-time performance. Hence, whenever applicable, the ar-
chitecture for polar multiplication by powers of two is preferable to the other
architectures performing the same operation.
Multiplication by Powers of
 
 
 
 
m
 
b
When using the diminished–1 representation (orthe NBC representation), the
Fermat number transform is generally known to be applicable only for some
small transform lengths, because then the transform multiplications by pow-
ers of the transform kernel can be carried out using only binary shifts (rota-
tions) (we mentioned above the kernels
 
 
 and
 
 
p
 ,f o rw h i c hw eg e t
thetransformlengths
N
 
 
mand
N
 
 
m). Therestriction torelatively small
transform lengths, however, is still adequate in Fermat integer quotient rings
where the modulus
 
m
 
 is composite, because in such rings the maximum
possibletransformlengthisrelatively small, in comparison with themodulus.
In the Fermat prime ﬁelds
Z
 
 
 
  and
Z
 
 
 
 
 , however, i.e. where the modulus
 
m
 
 is prime, thereexist transformsofmuchgreaterlengths than
 
mand
 
m:
 The equality followsfrom thefact thatthe order of
 
 
 modulo
 
m
 
 equals
N
 
 
b
 
 
m.8.1. Arithmetic Operations 241
NBC
Dim.–1
Polar
NBC
Dim.–1
Polar
NBC
Polar
Dim.–1
248 16 32 64 128256
 
 
 
 
 
 
 
 
 
Area complexity
m
S
i
z
e
,
C
2 4 8 16 32 64 128 256
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Area-time performance
m
C
L
 
248 16 32 64 128256
 
 
 
 
 
 
 
 
 
Time complexity
m
T
o
t
a
l
C
P
l
e
n
g
t
h
,
L
Figure 8.3: Plots of the complexity parameters
C,
L,a n d
C
L
  for
multiplication by
 
n when using the NBC, the diminished–1, or the polar
representation. The parameters are obtained from Table 8.6.
We know that for
m
 
 
 
 
 
 
 
 
 
 
 , there exist Fermat number transforms of
length
N
 
 
b in
Z
 
m
 
 ,w h e r e
 
 
b
 
m.
Usingthe NBCrepresentation orthediminished–1 representation,whencom-
puting a transform of arbitrary length
N
 
 
b, each nontrivial multiplication
byapowerof thetransformkernel
  of order
N modulo
 
m
 
 must generally
becarriedoutasageneralmultiplication. Thepowersof
  whichappearinthe
computation of each transformmay be precomputed and stored in a memory.
If not, they can be obtained using general exponentiations.
In Chapter 7 we showed how to compute multiplications by arbitrary powers
of thetransformkernel
 
 
 
 
m
 
b
 
m
o
d
 
m
 
 
 of arbitraryorder
 
b
 
 
 
b
 
m modulo
 
m
 
  ,u s i n go n es i m p l i ﬁed addition in the polar representation.
The complexity parameters of the architectures for polar multiplication by a242 Chapter 8. Comparisons Between Element Representations
Form of repr. Size
C Total CP length
L
C
L
 
NBC General (exponentiation and) multiplication needed
Diminished–1 General (exponentiation and) multiplication needed
Polar
  Fixed
 
 
m
 
 
 
b
 
 
 
 
 
b
 
 
 
O
 
m
b
 
 
  Universal
– Serial/parallel
 
 
m
 
 
 
 
 
m
 
 
 
b
 
 
 
O
 
m
 
 
–S e r i a l
 
 
m
 
 
 
 
 
m
 
 
 
 
O
 
m
 
 
Table 8.7: Sizes
C, total CP lengths
L, and area-time performances
C
L
  of the ar-
chitecturesformultiplication by
 
n
 
 
n
 
 
m
 
b
 
m
o
d
 
m
 
 
  ,withrespect to
element representation.
power of
 
 
m
 
b
m
o
d
 
m
 
  , is listed in Table 8.7. These parameters are also
plotted versus
m in Figure 8.4. Some of the parameters are plotted twice in
the ﬁgure. For each such pair of curves, the upper curve is an upper bound
(for
b
 
m)a n dt h el o w e rc u r v ei sal o w e rb o u n d( f o r
b
 
  )o nt h ep a r a m e t e r
in question.
For all
b
 
 
 
 
m
 ,t h eﬁxed architecture is superior to the two universal archi-
tectures. However, the universal bit-serial architecture has the smallest size
among the architecture. Note that the complexity and performance of these
three architectures are less than the complexity and performance of the archi-
tectures for NBC and diminished–1 general multiplication.
8.1.7 Butterﬂy Computations
In Section 2.3.3, we considered some algorithms for computing the Fermat
number transform. In each of these algorithms, the transform computation is
subdivided into a number of butterﬂy computations. For example, in the well
known radix-2 decimation-in-time and decimation-in-frequency algorithms,
which are described in Section 2.3.3, a Fermat number transform of length
N
 
 
b isobtainedbycomputing
 
N
 
 
 
l
o
g
 
N basicbutterﬂies. Eachbutterﬂy,
which performs a transform of length two, involves one negation, two addi-
tions, and one multiplication by some power of the transform kernel
 .W e
refer to Figures 2.1 and 2.2.8.1. Arithmetic Operations 243
Fixed
Un.s
Un.s/p
Fixed
Un.s/p
Un.s
Fixed
Un. s/p
Un. s
2 4 81 6
 
 
 
 
 
 
 
 
 
 
 
 
Area-time performance
C
L
 
m
248 16
 
 
 
 
 
 
Area complexity
m
S
i
z
e
,
C
248 16
 
 
 
 
 
 
Time complexity
m
T
o
t
a
l
C
P
l
e
n
g
t
h
,
L
Figure 8.4: Plots of the complexity parameters
C,
L,a n d
C
L
  for
multiplication by
 
n
 
 
n
 
 
m
 
b
 
m
o
d
 
m
 
 
  when using the polar
representation. The parameters are obtained from Table 8.7. “Fixed”=the
ﬁxed bit-parallel architecture. “Un. s/p”=the universal bit-serial/parallel
architecture. “Un. s”=the universal bit-serial architecture.
Next, we consider gross estimations of the total size and the total critical path
length of such a butterﬂy, with respect to the normal binary coded represen-
tation, the diminished–1 representation, and the polar representation. When
using the normal binary coded and the diminished–1 representations, we as-
sume that we have two adders in parallel. We use the following complexity
parameters:244 Chapter 8. Comparisons Between Element Representations
  The Normal Binary Coded Representation:
Negation:
 
C
n
e
g
 
 
 
m
 
 
 
L
n
e
g
 
 
 
m
 
 
 
Addition:
 
C
a
d
d
 
 
 
m
 
 
L
a
d
d
 
 
 
m
 
 
 
General multiplication:
 
C
m
u
l
t
 
 
m
l
o
g
 
m
 
 
 
 
m
 
 
 
L
m
u
l
t
 
 
 
m
 
 
 
 
 
m
 
 
 
 
Total complexity:
 
 
 
 
 
 
 
 
 
 
 
 
 
C
b
u
t
t
 
N
B
C
 
C
n
e
g
 
 
C
a
d
d
 
C
m
u
l
t
 
 
m
l
o
g
 
m
 
 
 
 
m
 
 
 
L
b
u
t
t
 
N
B
C
 
L
n
e
g
 
L
a
d
d
 
L
m
u
l
t
 
 
 
m
 
 
 
 
 
m
 
 
 
 
  The Diminished–1 Representation:
Negation:
 
C
d
i
m
n
e
g
 
 
m
L
d
i
m
n
e
g
 
 
m
 
 
 
Addition:
 
C
d
i
m
a
d
d
 
 
 
 
 
m
 
 
L
d
i
m
a
d
d
 
 
 
 
 
m
 
 
 
General multiplication:
 
C
A
s
h
u
r
 
m
u
l
t
 
 
 
m
 
 
 
 
m
 
 
 
L
A
s
h
u
r
 
m
u
l
t
 
 
 
m
 
 
 
 
Total complexity:
 
 
 
 
 
 
 
 
 
 
 
 
 
C
b
u
t
t
 
d
i
m
 
C
d
i
m
n
e
g
 
 
C
d
i
m
a
d
d
 
 
 
C
A
s
h
u
r
 
m
u
l
t
 
 
 
m
 
 
 
 
 
m
 
 
 
L
b
u
t
t
 
d
i
m
 
L
d
i
m
n
e
g
 
L
d
i
m
a
d
d
 
 
 
L
A
s
h
u
r
 
m
u
l
t
 
 
 
m
 
 
 
 
When using the polar representation, the two butterﬂy additions can not be
computed exactly in parallel, because then we would get a memory access
conﬂict. Therefore, the total critical path runs through the negater and the
two adders of the decimation-in-frequency butterﬂy in Figures 2.2. Multipli-
cation by a power of the transform kernel is carried out during the compu-
tation of the second addition. If the decimation-in-time butterﬂyi su s e d ,t h e
paththroughthemultiplier mustalso beaddedto thetotalcriticalpathlength.8.1. Arithmetic Operations 245
Hence, for the polar representation (andwhenusing decimation-in-frequency
butterﬂies), we use the following complexity parameters:
  The Polar Representation:
Negation:
 
C
p
o
l
n
e
g
 
 
L
p
o
l
n
e
g
 
 
 
Addition:
 
 
 
C
p
o
l
a
d
d
 
 
 
 
 
 
m
 
 
m
 
 
m
 
 
 
 
 
 
m
L
p
o
l
a
d
d
 
 
 
m
L
e
x
p
 
t
a
b
 
 
m
 
 
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m
 
 
 
 
Multiplication by
 
n:
 
C
m
u
l
t
 
 
 
p
a
r
 
 
m
 
 
 
b
 
 
L
m
u
l
t
 
 
 
p
a
r
 
 
 
b
 
 
 
Total complexity:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
C
b
u
t
t
 
p
o
l
a
r
 
C
p
o
l
n
e
g
 
C
p
o
l
a
d
d
 
 
 
C
m
u
l
t
 
 
 
p
a
r
 
 
 
 
m
 
 
m
 
 
m
 
 
 
 
 
 
m
 
 
 
b
 
 
L
b
u
t
t
 
p
o
l
a
r
 
L
p
o
l
n
e
g
 
 
C
p
o
l
a
d
d
 
 
 
 
m
L
e
x
p
 
t
a
b
 
 
 
m
 
 
 
l
o
g
 
m
 
L
l
o
g
 
t
a
b
 
 
 
 
m
 
 
 
 
The butterﬂy complexity parameters
C
b
u
t
t
 
N
B
C,
L
b
u
t
t
 
N
B
C,
C
b
u
t
t
 
d
i
m,
L
b
u
t
t
 
d
i
m,
C
b
u
t
t
 
p
o
l
a
r,a n d
L
b
u
t
t
 
p
o
l
a
r are plotted versus
m in Figure 8.5. For
C
b
u
t
t
 
p
o
l
a
r and
L
b
u
t
t
 
p
o
l
a
r we have set maximum
b
 
m and
L
e
x
p
 
t
a
b
 
L
l
o
g
 
t
a
b
 
  , respectively.
These complexity parameters, however, do not change signiﬁcantly for other
(reasonable) values of
b,
L
e
x
p
 
t
a
b,a n d
L
l
o
g
 
t
a
b.
FromFigure8.5weconcludethat, forall
m,thediminished–1 representationis
superior to the normal binary coded representation. Regarding the polar rep-
resentation, thecomplexity parameters
C
b
u
t
t
 
p
o
l
a
rand
L
b
u
t
t
 
p
o
l
a
rshould betaken
with a pinch of salt. The reason for this is the inaccuracies of the modelled ar-
eas andaccess times of the memories used to performdiscrete exponentiation
and compute discrete logarithms. In Sections 7.6.1 and 7.6.2, we only derived
approximate estimations of the parameters
C
e
x
p
 
t
a
b,
L
e
x
p
 
t
a
b,
C
l
o
g
 
t
a
b,a n d
L
l
o
g
 
t
a
b.
We can obtain more accurate estimations of these parameters by considering
all parts of the memory architecture in Figure 7.9. Such a complex modelling,
however, is not considered in this thesis.
As mentioned earlier in the thesis, the main disadvantage of the polar repre-
sentation is the relatively large chip area required when implementing polar
addition on the form which uses look-up tables. Still, some other nice proper-
ties of the polar representation may make up for this disadvantage. For246 Chapter 8. Comparisons Between Element Representations
NBC
Dim.–1
Polar
NBC
Dim.–1
Polar
NBC
Polar
Dim.–1
248 16 32 64 128256
 
 
 
 
 
 
Area complexity
m
S
i
z
e
,
C
2 4 8 16 32 64 128 256
 
 
 
 
 
 
 
 
Area-time performance
m
C
L
 
248 16 32 64 128256
 
 
 
 
 
 
Time complexity
m
T
o
t
a
l
C
P
l
e
n
g
t
h
,
L
Figure 8.5: Plots of the complexity parameters
C,
L,a n d
C
L
  for
the complete decimation-in-frequency butterﬂy when using the NBC, the
diminished–1, or the polar representation.
example,wehaveproposeduniversalarchitecturesformultiplication bypow-
ers of the transform kernel with favourable sizes and critical path lengths, see
Sections 7.6.6 and 8.1.6. Any of these universal architectures can be used in
the computation of a Fermat number transform of arbitrary allowed length in
a Fermat prime ﬁeld.
8.2 Other element representations
We have focused on the normal binary coded, the diminished–1, and the po-
lar representation. A few alternative ways of representing the (binary coded)
integers of Fermatinteger quotient rings
Z
 
m
 
  have beensuggested in the lit-8.2. Other element representations 247
erature. For example, Agrawal and Rao [3]
  describes an
 
m
 
 
  -bit binary
coded representation which uses one of the bits as a zero indicator. However,
none of these forms of representation have been considered in this thesis.
 See also references [6] and [7] in their paper.248 Chapter 8. Comparisons Between Element RepresentationsChapter 9
Conclusions
T h ea r i t h m e t i co p e r a t i o n sc o n s i d e r e di nt h i st h e s i sa r ee s s e n t i a l l ym o d u l u sr e -
duction, code translation, negation, addition, subtraction, general multiplica-
tion, and multiplication by powers of the Fermat number transform kernel.
All operations arecarriedout in Fermatinteger quotient rings. Theproperties
of these operations were thoroughly investigated with respect to the normal
binary coded representation, the diminished–1 representation, and the polar
representation of the binary coded integers of Fermat integer quotient rings.
Thepolarrepresentationis applicable onlywhentheFermatnumbermodulus
is prime.
Based on a linear switch-level
R
C model for CMOS transistors we derived
areaandtimecomplexities andcombinedarea
 time
 performancesofthevar-
ious architectures for the above arithmetic operations. The architectures were
mutuallycomparedwithrespecttothesemeasuresofcomplexityandperform-
ance. To the authors knowledge, such a comparison has not been carried out
before.
Regarding the normal binary coded representation, we found that the area
 
time
  performanceofsomeofthearchitecturesconsideredwasrelativelypoor.
This derives mainly from the relatively complex circuitry for performing the
modulus reduction ofthecorresponding arithmetic operations. Insomearchi-
tectures, the modulus reduction part of the circuit represented a rather large
part of the complete architecture.
With respect to the area and time complexities and the area
 time
  perform-
ance, we established the superiority of the diminished–1 representation over
249250 Chapter 9. Conclusions
the normal binary coded representation. We also came to the general conclu-
sion that, mainly from a computational complexity point of view, the dimini-
shed–1 representation is in fact the one most efﬁcient in the class of element
representations that can be expressed as a linear elementary function of the
normal binary coded representation.
Using properties of Zech’s logarithms, wederived an algorithm for efﬁciently
computing the discrete logarithm in Fermat prime ﬁelds, principally using
only a number of recursive diminished–1 additions. We also derived an al-
gorithm for performing discrete exponentiation using only a number of re-
cursive diminished–1 additions and some binary shifts. Based on these algo-
rithms, wethenderivedcomputational proceduresforcomputing thediscrete
logarithm and performing discrete exponentiation using look-up tables of ap-
propriate sizes (one table for each operation). Each resulting algorithm prin-
cipally only involves a number of binary shifts and a table look-up. Hence,
the complexity of computing the discrete logarithm and performing discrete
exponentiation was signiﬁcantly reduced, to the cost of two look-up tables.
One ofthe mainadvantagesof the polar representationconcernsthe complex-
ityofperformingmultiplication bypowersofthetransformkernel. Weproved
that, for every possible transform length
N
 
 
b
 
 
 
b
 
m,t h ep o l a rr e p r e -
sentation provides a suitable choice of the transform kernel for which mul-
tiplication by powers of the transform kernel can be carried out using only
one addition modulo
 
b. We also designed universal architectures (one bit-
serial/parallel and one bit-serial) for performing such multiplications. Thus,
any of these universal architectures can be used in the computation of a Fer-
mat number transform of arbitrary allowed length in a Fermat prime ﬁeld.Appendix A
Proofs of Some Theorems
Inthis Appendix wepresentproofsofsome theoremsofthe thesis. Theproofs
themselves may not be of central importance for the results of the thesis, but
they are included mainly because they have great number theoretic signiﬁ-
cance in the context of the thesis.
A.1 Proof of Theorem 2.1
The outline of the proof is essentially the same as the outline of the proof by
Agarwal and Burrus in [2, Th. 1]. The theorem is equivalent to
Theorem A.1 There exists an invertible NTT of length
N in
Z
q if and only if
N
j
 
p
i
 
 
  for every prime
p
i that divides
q.
Proof: According to Euler’s theorem (”if
q is a positive integer and
  is relatively
prime to
q, then
 
 
 
q
 
 
 
 
m
o
d
q
 ”), the order
N of the transform kernel
 
modulo
q must divide
 
 
q
 where
  denotes Euler’s totient function(seeforex-
ample Rosen, [84, Ch. 5.3]). It can be shown that for such an integer
q with
prime-power factorisation
q
 
p
n
 
 
p
n
 
 
 
 
 
p
n
k
k , the totient function is
 
 
q
 
 
p
n
 
 
 
 
 
p
 
 
 
 
p
n
 
 
 
 
 
p
 
 
 
 
 
 
 
p
n
k
 
 
k
 
p
k
 
 
 
 
Hence, we get
N
j
p
n
 
 
 
 
 
p
 
 
 
 
p
n
 
 
 
 
 
p
 
 
 
 
 
 
 
p
n
k
 
 
k
 
p
k
 
 
 
 
251252 Appendix A. Proofs of Some Theorems
However, by the congruence
 
N
 
 
 
m
o
d
q
  we get
q
j
 
 
N
 
 
 , and hence
p
n
i
i
j
 
 
N
 
 
 ,i . e .
 
N
 
  (mod
p
n
i
i )
for every factor
p
n
i
i of
q. Then, by Euler’s theorem, we get
N
j
 
 
p
n
i
i
 
 
p
n
i
 
 
i
 
p
i
 
 
 
  (A.1)
In order forthe inverse transformtoexist,
N
 
  must existin the ring. Thecon-
gruence
N
 
N
 
 
 
  (mod
q) implies that
N and
q must be relatively prime,
which means that no prime factor
p
i of
q can be a factor of
N. Therefore (A.1)
reduces to
N
j
 
p
i
 
 
 
 
for
i
 
 
 
 
 
 
 
 
 
k , which can also be written as
N
j
g
c
d
 
p
 
 
 
 
p
 
 
 
 
 
 
 
 
p
k
 
 
 
 
Conversely, if
N
j
 
p
i
 
 
  we know, byTheorem8.8 of [84],that there are
 
 
N
 
incongruent integers with order
N modulo
p
i.F o r
p
i
 
 we get the solution
N
 
 and the theorem becomes trivial. For odd primes
p
i,l e t
 
i be aninteger
with
g
c
d
 
 
i
 
p
i
 
 
 such that
o
r
d
p
i
 
i
 
p
i
 
 . Then, each nonzero integer
of
Z
p
i is congruent to some power of
 
i modulo
p
i [84, Th. 8.3]. For such an
integer
 
i
 
 
r
i
i with
o
r
d
p
i
 
i
 
N and some positive integer
r
i, it follows from
[84, Th. 8.4] that
N
 
p
i
 
 
g
c
d
 
p
i
 
 
 
r
i
 
 
By Theorems 8.9 and 8.10 of [84] we know that if
o
r
d
p
i
 
i
 
 
 
p
i
 ,t h eo r d e ro f
 
i modulo
p
n
i
i is
 
 
p
n
i
i
 
 
 
p
i
 
 
 
p
n
i
 
 
i for all positive integers
n
i.
From the above reasoning we get
 
 
p
i
 
 
 
p
n
i
 
 
i
i
 
 
N
N
 
p
i
 
 
 
p
n
i
 
 
i
i
 
 
 
g
c
d
 
p
i
 
 
 
r
i
 
 
p
n
i
 
 
i
i
 
N
 
 
 
m
o
d
p
n
i
i
 
 
and consequently we can choose
 
i
 
 
g
c
d
 
p
i
 
 
 
r
i
 
 
p
n
i
 
 
i
i
as an integer with order
N modulo
p
n
i
i . By the Chinese reminder theorem [84,
Th. 3.12] we can ﬁnd a unique solution
  modulo
q
 
p
n
 
 
p
n
 
 
 
 
 
p
n
k
k such that
 
 
 
i
 
m
o
d
p
n
i
i
 
for distinct primes
p
i and
i
 
 
 
 
 
 
 
 
 
k. Also, the order of
  modulo
q is
N.
Because
g
c
d
 
N
 
p
i
 
 
 we have
g
c
d
 
N
 
q
 
 
 and thus there exists an inverseA.2. Proof of Theorem 2.3 253
of
N modulo
q. Hence, there exists an invertible NTT of length
N in
Z
q for
which
N
j
 
p
i
 
 
  for every prime factor
p
i of
q.
￿
A.2 Proof of Theorem 2.3
In most numbertheory booksthe author leaves the proofof Theorem 2.3 as an
exercise for the reader. In this section we present our solution to this exercise.
The proof involves the concept of quadratic residues.
Deﬁnition A.1 An integer
a which is relatively prime to a positive integer
q is said
to be a quadratic residue modulo
q if there is an integer
x such that the congruence
x
 
 
a
 
m
o
d
q
  has a solution. If the congruence has no solution, we say that
a is
a quadratic nonresidue modulo
q.
The Legendre symbol
 
a
p
 
is frequently used to indicate whether aninteger
a,
not divisible by the odd prime
p, is a quadratic residue modulo
p:
 
a
p
 
 
 
 
  if
a is a quadratic residue modulo
p
 
  if
a is a quadratic nonresidue modulo
p
  (A.2)
Euler’scriterion is usefulwhendecidingwhetheraninteger is a quadratic resi-
due modulo a prime:
Lemma A.1 If
p is an odd prime and
a is a positive integer not divisible by
p, then
 
a
p
 
 
a
p
 
 
 
 
m
o
d
p
 
  (A.3)
Proof: See the proof of Theorem 9.2 of Rosen in [84].
￿
Now, we are ready for the proof of Theorem 2.3, which is equivalent to
Theorem A.2 Every prime divisor of the Fermat number
F
t
 
 
 
t
 
 ,w h e r e
t
 
 ,
is on the form
k
 
 
t
 
 
 
  , for some natural number
k.254 Appendix A. Proofs of Some Theorems
Proof: For every Fermat number
F
t we have
 
 
 
 
t
 
 
 
 
 
 
 
 
 
m
o
d
F
t
 ,
which implies
F
t
j
 
 
 
t
 
 
 
 
 . Therefore, for every prime divisor
p of
F
t,w e
get
p
j
 
 
 
t
 
 
 
 
  or equivalently
 
 
t
 
 
 
 
 
m
o
d
p
 
  (A.4)
Also, by Euler’s theorem we have
 
p
 
 
 
 
 
m
o
d
p
  and therefore
 
t
 
 
j
 
p
 
 
 , which means that
p is of the form
p
 
k
 
 
 
t
 
 
 
 for some positive integer
k
 .F o r
t
 
  we see that
p
 
k
 
 
 
t
 
 
 
 
 
 
 is congruent to 1 modulo 8.
By Proposition A.17(ii) of Stewart [95], 2 is a quadratic residue modulo
p,i . e .
 
 
p
 
 
  , and thus, from Euler’s criterion (Equation (A.3)) we get
 
p
 
 
 
 
 
 
m
o
d
p
 
  (A.5)
Hence, from (A.4) and (A.5), we see that
 
t
 
 
j
p
 
 
 
 
which implies that
p is on the form
p
 
 
 
 
t
 
 
 
 
 
 
t
 
 
 
  .
￿
A.3 Proof of Theorem 2.5
TheLegendresymbol,whichwasdeﬁnedin(A.2),canbeusedtocheckwheth-
er an integer is primitive or not. It follows from Euler’s criterion (Equation
(A.3)), together with the deﬁnition of primitive elements, that a primitive ele-
ment in
Z
F
t is a quadratic nonresidue modulo
F
t.
The quadratic reciprocity law, which was discovered by Euler and proved by
Gauss, can be of great help to calculate the Legendre symbol:
Lemma A.2 If
p and
q are odd primes, then
 
q
p
 
 
 
p
q
 
 
 
 
 
p
 
 
 
q
 
 
 
 
Proof: See for example Lang [57, pp. 76–78] or Rosen [84, Ch. 9.2].
￿
We are now able to prove Theorem 2.5, which is equivalent toA.3. Proof of Theorem 2.5 255
Theorem A.3 The integer 3 is a primitive element of each Fermat prime ﬁeld
Z
F
t
where
t
 
 .
Proof: (See for example the proof of Theorem 9.7 (Pepin’s test) in the book by
Rosen, [84]). Consider the primes among the Fermat numbers
F
t
 
 
m
 
  ;
m
 
 
t for
t
 
 . The quadratic reciprocity law yields
 
 
 
m
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
By Euler’s criterion (A.3) we can write
 
 
m
 
 
 
 
as
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
m
 
 
 
 
 
 
  (mod 3)
 
and hence we have
 
 
 
m
 
 
 
 
 
 
 
or equivalently
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 
 
 
  (A.6)
By Euler’s theorem we know that the order of 3 modulo the prime
F
t divides
F
t
 
 
 
 
m,i . e .
o
r
d
F
t
 
j
 
m, which means that
o
r
d
F
t
  is a power of two. Fur-
thermore, since (A.6) implies that
o
r
d
F
t
 
j
 
 
m
 
 , we consequently get
o
r
d
F
t
 
 
 
m. Thus, the integer 3 is a primitive element of
Z
 
m
 
  for
 
m
 
 
 
  ( when-
ever
 
m
 
 is prime.
￿256 Appendix A. Proofs of Some TheoremsAppendix B
A Table of Some Primes
n
m
q
q
 
 
2 1 3 2
3 1 7
 
 
 
3 2 5
 
 
4 2 13
 
 
 
 
5 1 31
 
 
 
 
 
5 2 29
 
 
 
 
5 4 17
 
 
6 2 61
 
 
 
 
 
 
7 1 127
 
 
 
 
 
 
7 4 113
 
 
 
 
7 5 97
 
 
 
 
8 4 241
 
 
 
 
 
 
8 6 193
 
 
 
 
9 2 509
 
 
 
 
 
 
9 6 449
 
 
 
 
9 8 257
 
 
10 2 1021
 
 
 
 
 
 
 
 
 
10 4 1009
 
 
 
 
 
 
 
10 8 769
 
 
 
 
Table B.1: Prime numbers of the form
q
 
 
n
 
 
m
 
 for
 
 
m
 
n
 
 
 .T h e
table continues on the next page.
257258 Appendix B. A Table of Some Primes
n
m
q
q
 
 
11 5 2017
 
 
 
 
 
 
 
12 2 4093
 
 
 
 
 
 
 
 
 
 
13 1 8191
 
 
 
 
 
 
 
 
 
 
 
13 5 8161
 
 
 
 
 
 
 
 
 
13 8 7937
 
 
 
 
 
13 9 7681
 
 
 
 
 
 
14 2 16381
 
 
 
 
 
 
 
 
 
 
 
 
14 4 16369
 
 
 
 
 
 
 
 
 
 
14 10 15361
 
 
 
 
 
 
 
14 12 12289
 
 
 
 
 
15 9 32257
 
 
 
 
 
 
 
16 4 65521
 
 
 
 
 
 
 
 
 
 
 
 
16 10 64513
 
 
 
 
 
 
 
 
16 12 61441
 
 
 
 
 
 
 
17 1 131071
 
 
 
 
 
 
 
 
 
 
 
 
17 5 131041
 
 
 
 
 
 
 
 
 
 
 
 
17 6 131009
 
 
 
 
 
 
 
 
17 8 130817
 
 
 
 
 
 
 
17 14 114689
 
 
 
 
 
17 16 65537
 
 
 
19 1 524287
 
 
 
 
 
 
 
 
 
 
 
 
19 5 524257
 
 
 
 
 
 
 
 
 
 
 
19 9 523777
 
 
 
 
 
 
 
 
 
 
19 12 520193
 
 
 
 
 
 
 
20 2 1048573
 
 
 
 
 
 
 
 
 
 
 
 
 
20 14 1032193
 
 
 
 
 
 
 
 
20 18 786433
 
 
 
 
 
22 2 4194301
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23 4 8388593
 
 
 
 
 
 
 
 
 
23 13 8380417
 
 
 
 
 
 
 
 
 
 
 
23 17 8257537
 
 
 
 
 
 
 
 
23 20 7340033
 
 
 
 
 
24 2 16777213
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24 6 16777153
 
 
 
 
 
 
 
 
 
 
 
 
 
24 8 16776961
 
 
 
 
 
 
 
 
 
 
 
 
 
24 14 16760833
 
 
 
 
 
 
 
 
 
 
 
24 18 16515073
 
 
 
 
 
 
 
 
Table B.1: cont’: Prime numbers of the form
q
 
 
n
 
 
m
 
 for
 
 
m
 
n
 
 
 .
The table continues on the next page.259
n
m
q
q
 
 
25 12 33550337
 
 
 
 
 
 
 
 
25 14 33538049
 
 
 
 
 
 
 
 
 
25 18 33292289
 
 
 
 
 
 
 
26 12 67104769
 
 
 
 
 
 
 
 
 
 
 
 
26 16 67043329
 
 
 
 
 
 
 
 
 
 
 
27 11 134215681
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27 21 132120577
 
 
 
 
 
 
 
 
28 16 268369921
 
 
 
 
 
 
 
 
 
 
 
 
 
29 2 536870909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29 6 536870849
 
 
 
 
 
 
 
 
 
 
 
 
29 8 536870657
 
 
 
 
 
 
 
 
 
 
 
 
 
29 9 536870401
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29 18 536608769
 
 
 
 
 
 
 
 
 
29 26 469762049
 
 
 
 
 
30 18 1073479681
 
 
 
 
 
 
 
 
 
 
 
 
 
31 1 2147483647
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31 9 2147483137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31 17 2147352577
 
 
 
 
 
 
 
 
 
 
 
 
31 19 2146959361
 
 
 
 
 
 
 
 
 
 
 
 
 
31 24 2130706433
 
 
 
 
 
 
 
31 25 2113929217
 
 
 
 
 
 
 
 
31 27 2013265921
 
 
 
 
 
 
 
32 20 4293918721
 
 
 
 
 
 
 
 
 
 
 
 
 
32 30 3221225473
 
 
 
 
 
Table B.1: cont’: Prime numbers of the form
q
 
 
n
 
 
m
 
 for
 
 
m
 
n
 
 
 .260 Appendix B. A Table of Some PrimesAppendix C
Further Properties of Zech’s
Logarithms
Several properties of Zech’s logarithms in Fermat prime ﬁelds were consid-
ered in Chapter 7. In this appendix we present some additional properties of
such logarithms. These properties may be used to derive alternative ways of
computing Zech’s logarithms in Fermat prime ﬁelds.
Theorem C.1 Let
P
 
 
 
 
 
  be a polar representation of
 
 
Z
 
m
 
 .F o r
P
 
 
 
 
 
we have
Z
 
 
m
 
 
 
 
 
 
m
o
d
 
m
  (C.1)
Z
 
 
 
 
 
 
m
o
d
 
m
  (C.2)
For nonzero
 ,i . e .f o r
 
 
 
Z
 
m, the following congruences hold.
Z
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
m
o
d
 
m
  (C.3)
Z
 
Z
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
  (C.4)
Z
 
 
m
 
 
 
Z
 
 
 
 
 
 
 
 
 
Z
 
 
 
 
 
m
o
d
 
m
  (C.5)
Z
 
 
m
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
Z
 
 
 
 
 
m
o
d
 
m
  (C.6)
Z
 
 
m
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
m
 
 
 
 
 
 
m
o
d
 
m
  (C.7)
261262 Appendix C. Further Properties of Zech’s Logarithms
Proof:
  Equation(C.1): FromDeﬁnitions 7.1and7.2weget
 
Z
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 and thus
Z
 
 
m
 
 
 
 
 
 
m
o
d
 
m
 .
  Equation (C.2): From
 
Z
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 we get
Z
 
 
 
 
 
 
m
o
d
 
m
 .
  Equation (C.3): Takingthediscrete logarithm ofthecongruence
 
Z
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
m
o
d
 
m
 
 
 yields
Z
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
m
o
d
 
m
 .
  Equation (C.4): From the congruence
 
Z
 
Z
 
 
 
 
 
 
m
 
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 
 
 we get
Z
 
Z
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 .
  Equation(C.5): By(C.3)weget
Z
 
 
m
 
 
 
Z
 
 
 
 
 
 
Z
 
 
 
Z
 
 
 
 
 
 
m
 
 
 
 
 
Z
 
Z
 
 
 
 
 
 
m
 
 
 
 
Z
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 . Using (C.4) we then get
Z
 
 
m
 
 
 
Z
 
 
 
 
 
 
 
 
 
 
m
 
 
 
Z
 
 
 
 
 
 
m
 
 
 
 
 
 
Z
 
 
 
 
 
m
o
d
 
m
 .
  Equation(C.6): Using(C.3)and(C.5),wecanwrite
Z
 
 
m
 
 
 
 
 
 
Z
 
 
 
 
 
 
Z
 
 
m
 
 
 
Z
 
 
 
 
 
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
 
Z
 
 
 
 
 
m
o
d
 
m
 .
  Equation(C.7): Using(C.3)and(C.4),wecanwrite
Z
 
 
m
 
 
 
 
 
 
Z
 
 
 
 
 
 
Z
 
 
m
 
 
 
Z
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 .
￿
The set of all polar integers
 
 
 
Z
 
m can be partitioned into subsets such that
the Zech logarithms of all integers in each subset can be computed using the
knowledgeof only onelogarithm in thesubset. This propertyis demonstrated
in the following theorem
Theorem C.2 Let
 
 
 
Z
 
m
n
f
 
m
 
 
g be a polar integer and let
f
 ,
f
 ,
f
 ,
f
 ,a n d
f
 
be mappings form
Z
 
m to
Z
 
m,g i v e nb y
f
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
  (C.8)
f
 
 
 
 
 
 
Z
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
  (C.9)
f
 
 
 
 
 
 
 
m
 
 
 
Z
 
 
 
 
 
m
o
d
 
m
  (C.10)
f
 
 
 
 
 
 
 
m
 
 
 
 
 
 
Z
 
 
 
 
 
m
o
d
 
m
  (C.11)
f
 
 
 
 
 
 
 
m
 
 
 
 
 
 
Z
 
 
 
 
 
m
o
d
 
m
  (C.12)263
Let
F
 
 
 
  be a set of polar integers deﬁned by
F
 
 
 
 
 
 
f
 
 
 
f
 
 
 
 
 
 
f
 
 
 
 
 
 
f
 
 
 
 
 
 
f
 
 
 
 
 
 
f
 
 
 
 
 
g. Then, we have
F
 
 
 
 
 
F
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 .
Proof: Let
j
 
f
 
 
 
 
 
 
 
 
 
g. Then, by (C.8), (C.9), (C.10), (C.11), and (C.12)
we have
f
 
 
f
j
 
 
 
 
 
 
 
f
j
 
 
 
 
 
m
o
d
 
m
 
f
 
 
f
j
 
 
 
 
 
 
Z
 
f
j
 
 
 
 
 
 
 
m
 
 
 
m
o
d
 
m
 
f
 
 
f
j
 
 
 
 
 
 
 
m
 
 
 
Z
 
f
j
 
 
 
 
 
 
m
o
d
 
m
 
f
 
 
f
j
 
 
 
 
 
 
 
m
 
 
 
f
j
 
 
 
 
 
Z
 
f
j
 
 
 
 
 
 
m
o
d
 
m
 
f
 
 
f
j
 
 
 
 
 
 
 
m
 
 
 
f
j
 
 
 
 
 
Z
 
f
j
 
 
 
 
 
 
m
o
d
 
m
 
 
respectively. Dependingon
j,Zech’slogarithmof
f
j
 
 
 
 isgivenbyeither(C.3),
(C.4),(C.5),(C.6),or(C.7). By(C.8)and(C.3)itfollowsthat
Z
 
f
 
 
 
 
 
 
 
Z
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
m
o
d
 
m
 . Therefore, for
j
 
 we get
f
 
 
f
 
 
 
 
 
 
 
 
f
 
 
 
 
 
 
 
 
 
m
o
d
 
m
 
f
 
 
f
 
 
 
 
 
 
 
Z
 
f
 
 
 
 
 
 
 
 
m
 
 
 
Z
 
 
 
 
 
 
 
 
 
m
 
 
 
f
 
 
 
 
 
 
m
o
d
 
m
 
f
 
 
f
 
 
 
 
 
 
 
 
m
 
 
 
Z
 
f
 
 
 
 
 
 
 
 
m
 
 
 
 
Z
 
 
 
 
 
 
 
 
 
f
 
 
 
 
 
 
m
o
d
 
m
 
f
 
 
f
 
 
 
 
 
 
 
 
m
 
 
 
f
 
 
 
 
 
 
Z
 
f
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
 
f
 
 
 
 
 
 
m
o
d
 
m
 
f
 
 
f
 
 
 
 
 
 
 
 
m
 
 
 
f
 
 
 
 
 
 
Z
 
f
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
Z
 
 
 
 
 
 
 
 
f
 
 
 
 
 
 
m
o
d
 
m
 
and thus
F
 
f
 
 
 
 
 
 
 
f
f
 
 
 
 
 
 
f
 
 
f
 
 
 
 
 
 
 
f
 
 
f
 
 
 
 
 
 
 
f
 
 
f
 
 
 
 
 
 
 
f
 
 
f
 
 
 
 
 
 
 
f
 
 
f
 
 
 
 
 
 
g
 
f
f
 
 
 
 
 
 
 
 
 
f
 
 
 
 
 
 
f
 
 
 
 
 
 
f
 
 
 
 
 
 
f
 
 
 
 
 
g
 
F
 
 
 
 .
For
j
 
 
 
 
 
 
 
  and
i
 
 
 
 
 
 
 
 
 
 ,t h ei n t e g e r
f
i
 
f
j
 
 
 
 
  can be obtained
in a way similar to the above derivation of
f
i
 
f
 
 
 
 
 
 . All elements
f
i
 
f
j
 
 
 
 
 
a r es h o w ni nT a b l eC . 1 .F o re x a m p l e ,
f
 
 
f
 
 
 
 
 
  is found as the element
f
 
 
 
 
 
in the intersection of the
f
 -row and the
f
 -column. The elements in the ﬁrst
row of the table form the set
F
 
 
 
 , the elements in the second row form the264 Appendix C. Further Properties of Zech’s Logarithms
h
—
f
 
f
 
f
 
f
 
f
 
—
 
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
 
 
F
 
 
 
 
f
 
f
 
 
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
g
f
 
f
 
 
 
 
 
f
 
 
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
f
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
f
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
 
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
f
 
f
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
 
 
f
 
 
 
 
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
Table C.1: The table shows, for
i
 
j
 
 
 
 
 
 
 
 
 
 , all combinations of
f
i
 
f
j
 
 
 
 
 
 
g
 
h
 
 
 
 
 ,w h e r e
g and
hdenote
f
i and
f
j,respectively. The symbol ’—’indicates
that no mapping is carried out, i.e. if
h ’=’ — or
g ’=’ — we get the mapping
g
 
 
 
  or
h
 
 
 
 , respectively.
set
F
 
f
 
 
 
 
 
 , the elements in the third row form the set
F
 
f
 
 
 
 
 
 , etc. Further-
more, we see that each of the elements
 
 ,
f
 
 
 
 
 ,
f
 
 
 
 
 ,
f
 
 
 
 
 ,
f
 
 
 
 
 ,a n d
f
 
 
 
 
 
only appears once in every row (and column) of the table. Hence, we have
F
 
 
 
 
 
F
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 
 
F
 
f
 
 
 
 
 
 .
￿
Thus, Theorem C.2 says that given two arbitrary elements
 
  and
 
  of
F
 
 
 
 ,t h e
sets
F
 
 
 
  and
F
 
 
 
  are equivalent.L e t
 
 
 
f
 
 
 
 
 
 
m
o
d
 
m
  and
 
 
 
f
 
 
 
 
 
 
m
o
d
 
m
 . Then, by Theorem C.2 we get
F
 
 
 
 
 
f
 
 
 
 
 
 
 
 
g
S
f
 
 
 
 
 
 
 
 
 
 
 
g,
where
 
 
 
 
f
 
 
 
 
 
 
m
o
d
 
m
 ,
 
 
 
 
f
 
 
 
 
 
 
m
o
d
 
m
 ,a n d
 
 
 
 
f
 
 
 
 
 
 
m
o
d
 
m
 . Using these notations for the elements of
F
 
 
 
 ,w es h o wi nF i g -
ureC.1 howthese elementsarerelatedtoeachother, via themappingsdeﬁned
in TheoremC.2. TableC.1 and FigureC.1 are equivalent descriptions ofthe re-
lations between the elements of
F
 
 
 
 .
It can bee seen in Figure C.1 that the paths
f
 
 
f
  associated with a set
F
 
 
 
 
(for some
 
 ) form a pair of triples. The paths
f
 
 
f
  are marked with thicker
lines in the ﬁgure. Theset ofall integers of
Z
 
m,e x c e p tt h ei n t e g e r
 
m
 
 ,c a nb e
partitioned into disjoint subsets of size six, which each can be viewed as such
a pair of triples. Note, however,thatone ofthese subsetsonly comprisesthree
integers: It follows from (C.8), (C.9), (C.10), (C.11), and (C.12) in Theorem C.2
that
f
 
 
 
 
 
  ,
f
 
 
 
 
 
f
 
 
 
 ,a n d
f
 
 
 
 
 
f
 
 
 
 . Hence, the subset
F
 
 
 is equal
to
f
 
 
f
 
 
 
 
 
f
 
 
 
 
g
 
f
 
 
 
m
 
 
 
Z
 
 
 
 
 
m
 
 
 
Z
 
 
 
g, which has size three (3).265
 
 
 
 
 
 
f
 
f
 
f
 
f
 
 
 
 
f
 
f
 
f
 
f
 
f
 
f
 
f
 
 
 
 
 
 
 
f
 
f
 
f
 
f
 
f
 
f
 
f
 
f
 
Figure C.1: Relations between the polar integers
 
 ,
 
 
 ,
 
 ,
 
 
 ,
 
 ,a n d
 
 
  modulo
 
m, with respect to the mappings
f
 ,
f
 ,
f
 ,
f
 ,a n d
f
 .
The set of all Zech’s logarithms, except
Z
 
 
m
 
 
 
 
 , is also partitioned into
corresponding subsets of size six. This is illustrated, for
m
 
  ,i nF i g u r eC . 2 .
The number of disjoint subsets of size six, as described above, equals
 
 
m
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
Suppose the Zech logarithm of one integer, say
 
 , fromeach of the above sub-
sets (of size six) is stored in a table. Then, the Zech logarithm
Z
 
x
  of an arbi-
trary integer
x
 
Z
 
m
n
f
 
m
 
 
g can be computed in the following way:
1. Find the unique integer
 
  which is contained in
F
 
x
  and whose Zech’s
logarithm
Z
 
 
 
  is stored in the table. The set
F was deﬁned in Theo-
rem C.2.
2. Read
Z
 
 
 
  from the look-up table.266 Appendix C. Further Properties of Zech’s Logarithms
Z
 
 
 
Z
 
 
 
Z
 
 
 
Z
 
 
 
Z
 
 
 
Z
 
 
 
Z
 
 
 
Z
 
 
 
 
Z
 
 
 
 
Z
 
 
 
Z
 
 
 
Z
 
 
 
 
Z
 
 
 
 
Z
 
 
 
 
Z
 
 
 
 
Z
 
 
 
Figure C.2: The Zech logarithms in
Z
 
 
 
 , partitioned into pairs of triples.
3. Use the congruencesin TheoremC.1 and C.2 to compute, from
x,
 
 ,a n d
Z
 
 
 
 , the desired logarithm
Z
 
x
 .
When
 
m
 
  is added to an
m-bit normal binary coded integer, the sum is sim-
ply obtained by inverting the most signiﬁcant bit of the integer. Therefore,
apart from this simple operation, the computation of an integer
f
j
 
 
 
 ,o ri t s
Zech’slogarithm
Z
 
f
j
 
 
 
 
 ,requiresatmostoneaddition modulo
 
m.T h em a i n
problem here is to carry out Step 1. We have not fully investigated how to se-
lecttheintegers
 
  whichin auniquewayshouldmaptotheentriesofthelook-
up table. This problem is similar to the problem in Section 7.5.4 of ﬁnding the
unique positions in
M
m.
We conclude this appendix by presenting two properties of the subset
F
 
x
 .
These properties may beof help in Step 1, when trying to ﬁnd the unique inte-
ger
 
  of
F
 
x
 . Consider the set
F
 
x
 of integers, where
xisanarbitraryinteger
in any triple, as described above. Then, by the congruences in Theorems C.1267
and C.2 we straightforwardly obtain the two following properties:
 
x
 
f
 
 
x
 
 
f
 
 
x
 
 
x
 
 
 
m
 
 
 
Z
 
x
 
 
 
 
 
m
 
 
 
x
 
Z
 
x
 
 
 
 
 
m
o
d
 
m
  (C.13)
Z
 
x
 
 
Z
 
f
 
 
x
 
 
 
Z
 
f
 
 
x
 
 
 
Z
 
x
 
 
 
x
 
Z
 
x
 
 
 
 
 
m
 
 
 
x
 
 
 
m
 
 
 
m
o
d
 
m
 
  (C.14)
Remark: Wehavederivedstill morepropertiesofZech‘slogarithmsin Fermat
prime ﬁelds. However, these properties are not considered here.
 Alternatively, using the above notations, we can write
 
 
 
 
 
 
 
 
 
 
 
m
o
d
 
m
  and
Z
 
 
 
 
 
Z
 
 
 
 
 
Z
 
 
 
 
 
 
m
￿
 
 
m
o
d
 
m
 .268 Appendix C. Further Properties of Zech’s LogarithmsBibliography
[1] M. Afgahi and J Yuan, “A Novel Implementation of Double-Edge
Trigger Flip-Flop for High Speed CMOS Circuit”, IEEE Journal of
Solid-StateCircuits, Vol. 26, No. 8, pp. 1168–1170, August 1991.
[2] R. C. Agarwal and C. S. Burrus, “Fast Convolution Using Fermat
Number Transforms with Applications to Digital Filtering”, IEEE
Trans.Acoust., Speech,andSignalProcessing,V ol .ASSP-22,N o.2,pp.
87–97, April 1974.
[3] D.P.AgrawalandT. R. N. Rao, “Modulo
 
 
n
 
 
 arithmetic logic”,
IEE Journ. Electronic Circuits and Systems, Vol. 2, pp. 186–188, No-
vember 1978.
[ 4 ] A .V .A h o ,J .E .H o p c r o f t ,a n dJ .D .U l l m a n ,The Design and Analysis
of Computer Algorithms, Addison-Wesley, 1974.
[5] L.-I. Alfredsson, “Properties of Zech’s Logarithms over Fermat
Prime Fields”, Proc. Sixth Joint Swedish-Russian International Work-
shop on Information Theory,M ¨ olle, Sweden, pp. 310–314, August
1993.
[6] L.-I. Alfredsson, “A Fast Fermat Number Transform for Long
Sequences”, Proc. Seventh European Signal Processing Conference,
(EUSIPCO-94), Edinburgh, Scotland, Vol. III, pp. 1579–1581, Sep-
tember 1994.
[7] L.-I. Alfredsson, “A Mirrored Integer Sequence of Length
 
m and
the Discrete Logarithm in Fermat Prime Fields”, Proc. Sixth Joint
269270 Bibliography
Swedish-Russian International Workshop on Information Theory,S t . -
Petersburg, Russia, pp. 15–19, June 1995.
[8] M. Annaratone, Digital CMOS Circuit Design,K l u w e rA c a d e m i c
Publishers, 1986.
[9] B. Arambepola and S. Choomchuay, “Algorithms and Architec-
turesforReed-Solomon Codes”, GECJournalofResearch, Vol. 9,No.
3, pp. 172–184, 1992.
[10] A.S. Ashur, “Area-TimeEfﬁcient Diminished–1 Multiplier forFer-
mat Number Transform”, Electronic Letters, Vol. 30, No. 20, pp.
1640–1641, September 1994.
[ 1 1 ]M .B e n a i s s a ,A .B o u r i d a n e ,S .S .D l a y ,a n dA .G .J .H o l t ,
“Diminished–1 Multiplier for a Fast Convolver and Correla-
tor Using the Fermat Number Transform”, IEE Proceedings,V o l .
135, Pt. G, No. 5, pp. 187–193, October 1988.
[12] M. Benaissa, A. Pajayakrit, S. S. Dlay, and A. G. J. Holt, “VLSI De-
sign for Diminished–1 Multiplication of Integers Modulo a Fermat
Number”, IEE Proceedings, Vol. 135, Pt. E, No. 3, pp. 161–164, May
1988.
[ 1 3 ]M .B e n a i s s a ,S .S .D l a y ,a n dA .G .J .H o l t ,“ C M O SV L S ID e -
sign of a High-Speed Fermat Number Transform Based Con-
volver/Correlator Using Three-Input Adders”, IEE Proceedings,
Vol. 138, Pt. G, No. 2, pp. 182–190, April 1991.
[14] H. B. Bakoglu, Circuits, Interconnections, and Packing for VLSI,
Addison-Wesley, 1990.
[15] G. Bilardi, M. Pracchi, and F. P. Preparata, “A Critique and an Ap-
praisal of VLSI Models of Computation”, VLSI Systems and Com-
putations, pp. 81–88, Editors: H. T. Kung, B. Sproull, and G. Steele,
Springer-Verlag, 1981.
[16] R. E. Blahut, Theory and Practice of Error Control Codes,A d i s o n -
Wesley, 1984.
[17] R. E. Blahut, Fast Algorithms for Digital Signal Processing,A d i s o n -
Wesley, 1985.
[18] I. E. Bocharova and B. D. Kudryashov, “Fast Exponentiation Based
on Lempel-Ziv Algorithm”, Proc. of the Sixth Joint Swedish-Russian
International Workshop on Information Theory,M ¨ olle, Sweden, pp.
259–263, August 1992.Bibliography 271
[19] I. E. Bocharova and B. D. Kudryashov, “Fast Exponentiation
Based on Data Compression Algorithms”, Proc. of the Seventh Joint
Swedish-Russian International Workshop on Information Theory,S t . -
Petersburg, Russia, pp. 36–39, June 1995.
[20] S. Boussakta and A. G. J. Holt, “Calculation of the discrete Hartley
transform via the Fermat number transform using a VLSI chip”,
IEE Proceedings, Vol. 135, Pt. G, No. 3, pp. 101–103, June 1988.
[21] S. Boussakta and A. G. J. Holt, “Fast multidimensional discrete
Hartley transform using Fermat number transform”, IEE Proceed-
ings, Vol. 135, Pt. G, No. 6, pp. 253–257, December 1988.
[22] S. Boussakta and A. G. J. Holt, “Relationship between the Fermat
number transform and the Walsh-Hadamard transform”, IEE Pro-
ceedings, Vol. 136, Pt. G, No. 4, pp. 191–204, August 1989.
[23] S. Boussakta, A. Y. Md. Shakaff, F. Marir, and A. G. J. Holt, “Num-
ber theoretic transforms of periodic structures and their applica-
tions”, IEE Proceedings, Vol. 135, Pt. G,No. 2, pp. 83–96, April 1988.
[24] R. P. Brent, “Factorization of the Eleventh Fermat Number” (pre-
liminary report), Abstracts, Amer. Math. Soc., Vol. 10, 89T-11-73,
1989.
[25] R. P. Brent, “Parallel Algorithms for Integer Factorizations”, Num-
ber Theory and Cryptography, London Math. Soc. Lecture Note Se-
ries, Editor: J. H. Loxton, Vol. 154, Cambridge, 1990.
[26] R. P. Brent and H. T. Kung, “The Area-Time Complexity of Binary
Multiplication”, Journ.oftheAss. forComp.Mash.,V o l .2 8 ,N o .3 ,p p .
521–534, July 1981.
[27] R.P.BrentandH.T.Kung,“ARegularLayoutforParallelAdders”,
IEEE Trans. on Computers, Vol. C-31, No. 3, pp. 260–264, March
1982.
[28] J. Brillhart, D.H. Lehmer, J.L. Selfridge, B. Tuckerman, and S. S.
Wagstaff, Jr., Factorizations of
b
n
 
 
 
b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  up
to high powers, Contemporary Mathematics, Volume 22, American
Mathematical Society, Second Edition, 1988.
[29] J. T. Butler (editor), Multiple-valued logic in VLSI, IEEE Computer
Press Society, Los Alamitos, 1991.272 Bibliography
[30] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-Power
CMOS Digital Design”, IEEE Journal of Solid-State Circuits,V o l .2 7 ,
No. 4, pp. 473–484, April 1992.
[31] A. P.Chandrakasan and R. W. Brodersen, LowPower DigitalCMOS
Design, Kluwer Academic Publishers, 1995.
[ 3 2 ] J .J .C h a n g ,T .K .T r u o n g ,H .M .S h a o ,I .S .R e e d ,a n dI - SH s u ,“ T h e
VLSI Design of a Single Chip for the Multiplication of Integers
Modulo a FermatNumber”, IEEE Trans. Acoust., Speech, and Signal
Processing, Vol. ASSP-33, No. 6, pp. 1599–1602, December 1985.
[33] P. R. Chevillat, “Transform-Domain Digital Filtering with Num-
ber Theoretic Transformsand Limited Word Lengths”, IEEE Trans.
Acoust., Speech, and Signal Processing, Vol. ASSP-26, No. 4, pp. 284–
290, August 1978.
[34] J. H. Conway, “A Tabulation of Some Information Concerning Fi-
nite Fields”, ComputersinMathematicalResearch (R.F. Churchhouse
and J.-C. Herz, Editors), pp. 37–50, North-Holland, Amsterdam,
1968.
[35] J. W. Cooley and J. W. Tukey, “An algorithm for the machine cal-
culation of complex Fourier series”, Mathematics Computation,V o l .
19, pp. 297–301, 1965.
[36] L. E. Dickson, History of the theory of numbers, Vol. I, Washington D.
C.: Carnegie Institute, 1919.
[ 3 7 ]V .S .D i m i t r o v ,T .V .C o o k l e v ,a n dB .D .D o n e v s k y ,“ G e n e r a l i z e d
Fermat-Mersenne Number Theoretic Transform”, IEEE Trans. on
Circuits and Systems–II: Analog and DigitalSignal Processing, Vol. 41,
No. 2, pp. 1–7, February 1994.
[38] E. Dubois and A. N. Venetsanopoulos, “Number Theoretic Trans-
forms with modulus
 
 
q
 
 
q
 
 ”, Rec. 1978IEEE Int. Conf. Acoust.,
Speech, and Signal Processing, pp. 624–627, April 1978.
[39] E. Dubois and A. N. Venetsanopoulos, “The Generalized Discrete
Fourier Transform in Rings of Algebraic Integers”, IEEE Trans.
Acoust., Speech, and Signal Processing, Vol. ASSP-28, No. 2, pp. 169–
175, April 1980.
[40] P. Duhamel and H. Hollman, “Split-radix FFT Algorithm”, Elec-
tron. Lett., Vol. 20, pp. 14–16, January 1984.Bibliography 273
[41] P. Duhamel, “Implementation of ’Split-Radix’ FFT Algorithms for
Complex, Real, and Real-Symmetric Data”, IEEE Trans. Acoust.,
Speech, and Signal Processing, Vol. ASSP-34, No. 2, pp. 285–295,
April 1986.
[42] H. M. Edwards, Fermat’s Last Theorem, A Genetic Introduction to Al-
gebraic Number Theory, Springer-Verlag, New York 1977.
[43] W. C. Elmore, “The Transient Response of Damped Linear Net-
works with Particular Regard to Wideband Ampliﬁer”, Journal of
Applied Physics, Vol. 19, No. 1, pp. 55–63, January 1948.
[ 4 4 ] R .L .G e i g e r ,P .E .A l l e n ,a n dN .R .S t r a d e r ,VLSI Design Techniques
for Analog and Digital Circuits, McGraw-Hill Publishing Company,
1991.
[45] W. M. Gentleman and G. Sande, “Fast Fourier transforms for fun
and proﬁt”, Fall Joint Computing Conference, AFIPS Proc., Vol. 29,
pp. 563–578, 1966.
[46] D.Gollman, Y. Han, and C. J. Mitchell, “Redundant Integer Repre-
sentations and Fast Exponentiation”, To appear in Designs, Codes
and Cryptography.
[47] S. W. Golomb, “Properties of the Sequence
 
 
 
n
 
 ”, Mathematics
of Computation, Vol. 30, N0. 135, pp. 657–663, July 1976.
[ 4 8 ] S .W .G o l o m b ,I .S .R e e d ,a n dT .K .T r u o n g ,“ I n t e g e rC o n v o l u t i o n s
over the Finite Field
G
F
 
 
 
 
n
 
 
 ”, SIAM Journalof AppliedMath.,
Vol. 32, No. 2, pp. 356–365, March 1977.
[49] N. Hedenstierna and K. O. Jeppson, “CMOS Circuit Speed and
Buffer Optimization”, IEEE Trans. on Computer-Aided Design,V o l .
CAD-6, No. 2, pp. 270-281, March 1987.
[50] I.N. Herstein, TopicsinAlgebra, SecondEdition, JohnWiley&Sons,
1975.
[51] K. Huber, “Some Comments on Zech’s Logarithms”, IEEE Trans.
on Inf. Theory, Vol. IT-36, No. 4, pp. 946–950, July 1990.
[52] K. Hwang, Computer Arithmetic: principles, architecture, and design,
John Wiley & Sons, 1979.
[53] K. Imamura, “A Method for Computing Addition Tables in
G
F
 
p
n
 ”, IEEE Trans. on Inf. Theory, Vol. IT-26, No. 3, pp. 367–369,
May 1980.274 Bibliography
[54] J. Justesen, “On the Complexity of Decoding Reed-Solomon
Codes”, IEEE Trans. on Inf. Theory, Vol. IT-22, pp. 237–238, March
1976.
[55] A. Karatsuba and Y. Hofman, “Multiplication of multidigit num-
bers on automata” (in Russian), Dokl. Akad. Nauk SSSR, Vol. 145,
pp. 293–294, 1962.
[56] D.E.Knuth,TheArtofComputerProgramming.Vol.2: Seminumerical
Algorithms, Addison-Wesley, Reading, MA, 1969.
[57] S.Lang,AlgebraicNumberTheory, Springer-Verlag,NewYork,1986.
[58] L. M. Leibowitz, “A Simpliﬁed Binary Arithmetic for the Fermat
NumberTransform”,IEEETrans.Acoust.,Speech,andSignalProcess-
ing, Vol. ASSP-24, No. 5, pp. 356–359, October 1976.
[59] A. K. Lenstra, H. W. Lenstra, Jr., M. S. Manasse, J. M. Pollard, “The
factorization of the ninth Fermat Number”, Math. Comp., Vol. 61,
pp. 318–349, 1993.
[60] R. Lidl and H. Niederreiter, Finite Fields, Encyclopedia of Math-
ematics and its Applications, Volume 20, Cambridge University
Press, 1984.
[61] D. Liu, Low Power Digital CMOS Design, Ph.D. dissertation, No.
364, Link¨ oping University, Link¨ oping, Sweden 1994.
[62] K. Y. Liu, I. S. Reed, and T. K. Truong, “Fast Number-Theoretic
Transforms for Digital Filtering”, Electronic Latters, Vol. 12, No. 24,
pp. 644–646, November 1976.
[63] K. Y. Liu, I. S. Reed, and T. K. Truong,“High-Radix Transformsfor
Reed-Solomon Codes over FermatPrimes”, IEEE Trans. onInf. The-
ory, Vol. IT-23, No. 6, pp. 776–778, November 1977.
[64] B. J. McCarroll, C. G. Sodini, and H.-S. Lee, “A High-Speed CMOS
ComparatorforUseinanADC”,IEEEJournalofSolid-StateCircuits,
Vol. 23, No. 1, pp. 159–165, February 1988.
[65] J. H. McClellan, “Hardware Realization of a Fermat Number
Transform”, IEEE Trans. Acoust., Speech, and Signal Processing,V o l .
ASSP-24, No. 3, pp. 216–225, June 1976.
[66] C. Mead and L. Conway, Introduction to VLSI Systems, Addison-
Wesley, 1980.Bibliography 275
[67] A. M. Mohsen and C. A. Mead, “Delay-Time Optimization for
DrivingandSensing ofSignals on High-Capacitance Pathsof VLSI
Systems”, IEEE Journal of Solid-StateCircuits,V o l .S C - 1 4 ,N o .2 ,p p .
462–470, April 1979.
[68] Y. Morikawa, H. Hamada, and K. Nagayasu, “Hardware Reali-
sation of High Speed Butterﬂy for the Maximum Length Fermat
Number Transform”, Trans. IECE, Japan, Vol. J66-D, No. 1, pp. 81–
88 1983. (We have not yet been able to get a copy of this reference.)
[69] H. Murakami, I. S. Reed, and L. R. Welch, “A Transform Decoder
for Reed-Solomon Codes in Multiple-User Communication Sys-
tems”, IEEETrans. on Inf.Theory, Vol. IT-23,No. 6,pp. 675–683,No-
vember 1977.
[70] J. K. Ousterhout, “A Switch-Level Timing Veriﬁer for Digital MOS
VLSI”, IEEE Trans. on Computer-Aided Design,V o l .C A D - 4 ,N o .3 ,
pp. 336–349, July 1985.
[71] A. Pajayakrit, VLSI Architecture and Design for the Fermat Number
Transform Implementation,PhD.Thesis, Dept.of Electrical andElec-
tronic Engineering, University of Newcastle-Upon-Tyne, United
Kingdom, 1988.
[72] S. C. Pohlig and M. E. Hellman, “An Improved Algorithm for
Computing Logarithms over
G
F
 
p
  and Its Cryptographic Signif-
icance”, IEEE Trans. on Inf. Theory, Vol. IT-24, No. 1, pp. 106–110,
January 1978.
[73] J.M. Pollard, “Implementation of Number-TheoreticTransforms”,
Electronic Letters, Vol. 12, No. 15, pp. 378–379, July 1976.
[74] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Princi-
ples, Algorithms, and Applications, Second Edition, Macmillan Publ.
Comp., 1992.
[75] J.G.Proakis,C.M.Rader,F.Ling, andC.L.Nikias, AdvancedDigital
Signal Processing, Macmillan Publ. Company, 1992.
[76] D. A. Pucknell and K. Eshraghian, Basic VLSI Design, Third Edi-
tion, Prentice Hall, 1994.
[77] C. M. Rader, “Discrete Convolution via Mersenne Transforms”,
IEEE Trans. Comput., Vol. C-21, No. 12, pp. 1269–1273, December
1972.276 Bibliography
[78] C. M. Rader, “On the application of the number theoretic meth-
ods of high-speed convolution to two-dimensional ﬁltering”, IEEE
Trans. on Circuits and Systems, Vol. 22, p. 575, 1975.
[79] L. Rabiner and B. Gold, Theory and Application of Digital Signal
Processing, Prentice-Hall, 1975.
[ 8 0 ]I .S .R e e d ,R .A .S c h o l t z ,T .K .T r u o n g ,a n dL .R .W e l s h ,“ T h eF a s t
Decoding of Reed-Solomon Codes Using Fermat Theoretic Trans-
formsand Continued Fractions”, IEEE Trans. on Inf.Theory,V o l .I T -
24, No. 1, pp. 100–106, January 1978.
[81] I. S. Reed and G. Solomon, “Polynomial Codes over Certain Finite
Fields”, Journal of the Society for Industrial and Applied Mathematics,
Vol. 8, pp. 300–304, June 1960.
[82] I. S. Reed, T. K. Truong, and L. R. Welch, “The Fast Decoding of
Reed-Solomon Codes Using Fermat Transforms”, IEEE Trans. on
Inf. Theory, Vol. IT-24, No. 4, pp. 497–499, July 1978.
[83] R. M. Robinson, “A Report on Primes of the Form
k
 
 
n
 
 and
on Factors of Fermat Numbers”, Proc. of the Amer. Math. Soc.,V o l .
9, pp. 673–681, 1958.
[84] K. H. Rosen, Elementary Number Theory and its Applications,T h i r d
Edition, Addison-Wesley, 1993.
[85] J. Rubinstein, P. Penﬁeld Jr., and M. A. Horowitz, “Signal Delay
in
R
C Tree Networks”, IEEE Trans. on Computer-Aided Design,V o l .
CAD-2, No. 3, pp. 202–211, July 1983.
[86] A. Sch¨ onhage and V. Strassen, “Fast multiplication of integers” (in
German), Computing, Vol. 7, pp. 281–292, 1971.
[87] W. C. Siu and A. G. Constantinides, “Very fast discrete Fourier
transform, using number theoretic transform”, IEE Proceedings,
Vol. 130, Pt. G, No. 5, pp. 201–204, October 1983.
[88] W. C. Siu and A. G. Constantinides, “On the computation of dis-
crete Fourier transform using Fermatnumbertransform”, IEEPro-
ceedings, Vol. 131, Pt. F, No. 1, pp. 7–14, February 1984.
[89] A. Y. Md. Shakaff, Practical Implementation of the Fermat Number
Transform with Applications to Filtering and Image Processing,P h D .
Thesis, Dept. of Electrical and Electronic Engineering, University
of Newcastle-Upon-Tyne, United Kingdom, 1988.Bibliography 277
[90] A.Y. Md.Shakaff,A.Pajayakrit,andA.G.J.Holt, “Practical imple-
mentations of block-mode image ﬁlters using the Fermat number
transformonamicroprocessorbasedsystem”,IEEProceedings,V o l .
135, Pt. G, No. 4, pp. 141–154, August 1988.
[91] A. Shiozaki, T. K. Truong, K. M. Cheung, and I. S. Reed, “Fast
TransformDecoding of Nonsystematic Reed-Solomon codes”, IEE
Proceedings, Vol. 137, Pt. E, No. 2, pp. 139–143, March 1990.
[ 9 2 ]H .C .S h y u ,T .K .T r u o n g ,I .I .R e e d ,I .SH s u ,a n dJ .J .C h a n g ,“ A
New VLSI Complex Integer Multiplier Which Uses a Quadratic-
Polynomial Residue System with Fermat Numbers”, IEEE Trans.
Acoust.,Speech, andSignalProcessing, Vol.ASSP-35,No.7,pp.1076–
1079, July 1987.
[93] W. Sierpi´ nski, Elementary Theory of Numbers, Editor: A. Schinzel,
Second Edition, PWN-Polish Scientiﬁc Publishers, 1988.
[94] A.N.SkodrasandA.G.Constantinides, “EfﬁcientComputation of
the Split-Radix FFT”, IEE Proceedings–F, Vol. 139, No. 1, pp. 56–60,
February 1992.
[95] I. N. Stewart and D. O. Tall, Algebraic Number Theory, Second Edi-
tion, Chapman & Hall, 1987, Reprinted 1994.
[96] R. Sundblad and C. Svensson, “Fully Dynamic Switch-Level Sim-
ulation of CMOS Circuits”, IEEE Trans. on Computer-Aided Design,
Vol. CAD-6, No. 2, pp. 282–289, March 1987.
[97] S. Sunder, F. El-Guibali, and A. Antoniou, “Area-efﬁcient
Diminished–1 Multiplier for Fermat Number-theoretic Trans-
form”, IEE Proceedings–G, Vol. 140, No. 3, pp. 211–215, June
1993.
[98] C. Svensson, K. Cheng, and J.Yuan, “Decisionmaking in FastA/D
Converters”, Int. Report LiTH-IFM-IS-154, Link¨ oping University,
October 1989.
[99] C. Svensson and D. Liu, “Low Power Circuit Techniques”, Manu-
script, 1995.
[100] C. D. Thompson, “Area-Time Complexity for VLSI”, Proc. Eleventh
Annual ACM Symposium on the Theory of Computing, pp. 81–88,
1979.278 Bibliography
[101] P. J. Towers, A. Pajayakrit, and A. G. J. Holt, “Cascadable NMOS
VLSI circuit for implementing a fast convolver using the Fermat
number transform”, IEE Proceedings, Vol. 134, Pt. G, No. 2, pp. 57–
66, April 1987.
[102] T. K. Truong, J. J. Chang, I. S. Hsu, D. Y. Pei, and I. S. Reed, “Tech-
niques for Computing the Discrete Fourier Transform Using the
Quadratic Residue FermatNumber Systems”, IEEE Trans. on Com-
puters, Vol. C-35, No. 11, pp. 1008–1012, November 1986.
[103] T. K. Truong, I. S. Reed, C. -Yeh, and H. M. Shao, “A Parallel VLSI
Architecture for a Digital Filter of Arbitrary Length Using Fermat
Number Transforms”, Proceedings of IEEE International Conference
on Circuits and Computers (ICCC ’82), New York, pp. 574–578, Sept.
28 – Oct. 1, 1982.
[104] J. D. Ullman, Computational Aspects of VLSI, Computer Science
Press, 1983.
[105] J. P. Uyemura, Circuit Design for CMOS VLSI,K l u w e rA c a d e m i c
Publishers, 1992.
[106] J. P. Uyemura, Fundamentals of MOS Digital Integrated Circuits,
Addison-Wesley, 1988.
[107] C. -Yeh, I. S. Reed, J. J. Chang, and T. K. Truong, “VLSI Design of
NumberTheoreticTransformsforaFastConvolution”, Proceedings
of IEEE International Conference on Computer Design: VLSI in Com-
puting, New York, pp. 200–203, October 1983.
[108] J. Yuan, I. Karlsson, and C. Svensson, “A True Single-Phase-Clock
DynamicCMOSCircuit Technique”, IEEE JournalofSolid-StateCir-
cuits, Vol. 22, No. 5, pp. 899–901, October 1987.
[109] J. Yuan and C. Svensson, “CMOS Circuit Speed Optimization
Based on Switch Level Simulation”, Proceedings of 1988 IEEE Inter-
national Symposium on Circuits and Systems, Espoo, Finland, Vol. 3,
pp. 2109–2112, June 1988.
[110] J. Yuan, C. Svensson, and P. Larsson, “New Domino Logic
Precharged by Clock and Data”, Electronic Letters, Vol. 29, No. 25,
pp. 2188–2189, December 1993.
[111] L. Wanhammar, B. Sikstr¨ om, DSP Integrated Circuits, Dept. of EE,
Link¨ oping University, Sweden, 1990.Bibliography 279
[112] B. W. Y. Wei and C. D. Thompson, “Area-Time Optimal Adder De-
sign”, IEEE Trans. on Computers, Vol. C-39, No. 5, pp. 666–675, May
1990.
[113] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI De-
sign: A Systems Perspective, Second Edition, Addison-Wesley Publ.
Comp., 1993.
[114] S. B. Wicker, Error Control Systems for Digital Communication and
Storage, Prentice Hall, 1995.
[115] D. Zuras, “More on Squaring and Multiplying Large Integers”,
IEEE Trans. on Computers, Vol. 43, No. 8, pp. 899–908, August 1994.280 BibliographyLink¨ oping Studies in Science and Technology
Dissertations, Data Transmission
V. Ramamoorthy: Speech coding based on a composite-Gaussian source model.
Dissertation No. 60, 1981.
Jan-Erik Stjernvall: A study of distortion measures for source coding.
Dissertation No. 68, 1981.
Jens Zander: Distributed access algorithms for a class of multi-access channels.
Dissertation No. 123, 1985.
Edoardo Mastrovito: VLSI architectures for computations in Galois ﬁelds.
Dissertation No. 242, 1991.
Shakir Abdul-Jabbar: Disjunctive codes for the multiple access OR-channel.
Dissertation No. 254, 1991.
Youzhi Xu: Contributions to the decoding of Reed-Solomon and related codes.
Dissertation No. 257, 1991.
Tommy Pedersen: Performance aspects of concatenated codes.
Dissertation No. 245, 1992.
Jan Nilsson: On hard and soft decoding of block codes.
Dissertation No. 333, 1994.
Per-Olof Anderson: Superimposed codes for the Euclidean channel.
Dissertation No. 342, 1994.
Per Larsson: Codes for correction of localized errors.
Dissertation No. 374, 1995.
Eva Englund: Codes with unequal error protection.
Dissertation No. 412, 1995.
Ralf K¨ otter: On algebraic decoding of algebraic-geometric and cyclic codes.
Dissertation No. 419, 1996.