Squared Law Algorithms: Theory and Applications. by Rao, Poornachandra Bellamkonda
Louisiana State University
LSU Digital Commons
LSU Historical Dissertations and Theses Graduate School
1993
Squared Law Algorithms: Theory and Applications.
Poornachandra Bellamkonda Rao
Louisiana State University and Agricultural & Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact
gradetd@lsu.edu.
Recommended Citation




This manuscript has been reproduced from the microfilm master. UMI 
films the text directly from the original or copy submitted. Thus, some 
thesis and dissertation copies are in typewriter face, while others may 
be from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand corner and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in 
reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6" x 9" black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly 
to order.
University Microfilms International 
A Bell & Howell Information Company 
300 North Zeeb Road. Ann Arbor. Ml 48106-1346 USA 
313/761-4700 800/521-0600

Order Num ber 9405417
Squared law algorithm s: Theory and applications
Rao, Poornachandra Bellamkonda, Ph.D.
The Louisiana State University and Agricultural and Mechanical Col., 1993
U M I
300 N. ZeebRd.
Ann Arbor, MI 48106

SQUARED LAW ALGORITHMS: THEORY AND APPLICATIONS
A Dissertation
Submitted to the Graduate Faculty of the 
Louisiana State University and 
Agricultural and Mechanical College 
in partial fulfillment of the 
requirements for the degree of 
Doctor of Philosophy
in
The Department of Electrical and Computer Engineering
by
Poornachandra B. Rao 
B. E., Osmania University, 1984 
M.S. in E.E., Louisiana State University, 1989 
August 1993
Acknowledgments
I would like to thank my major professor, Dr. Alexander Skavantzos, for his 
advice and guidance throughout this research effort.
I would also like to thank Drs. Ahmed El-Amawy, Bush Jones, Subhash Kak, 
Charles Monlezun, and Suresh Rai for willing to serve as members of my doctoral 
committee. In particular, I would like to thank Professor Subhash Kak who through 
many conversations not only provided encouragement but also helped me see the 
broader picture.
Most of all, I would like to thank my parents, sister, and brother whose 
patience, faith, and moral support made this dissertation possible.
Table of Contents
Acknowledgments ........................................................................... ii
List of Tables......................................................... vi
List of Figures............................................................................................vii
Abstract.............................................................  ix
Chapter 1 Introduction ..........   1
1.1 Overview of existing methods ...............   2
1.2 Our approach .......................................................................................... 6
Chapter 2 Mathematical Foundations ............................................. 9
2.1 Background............................................................................................... 9
2.2 Algorithm for convolution using an exponential
number of squares....................................................................... 11
2.3 Direct extension of the one over eight squared
algorithm...................................................................................... 16
2.4 Squared law theorems for cyclic convolutions...................................18
2.5 Analysis of part 1....................................................................................29
2.5.1 Comparison of methods 1 - 3............................................................... 33
2.6 Number of squares.................................................................................34
2.7 Number of additions.............................................................................. 37
2.8 Example...................................................................................................39
2.9 Summary ...........................................................................................49
Chapter 3 Implementation Issues  51
3.1 CSA implementation of the multiplication operation.......................51
3.2 CSA implementation of the squaring operation................................ 56
3.3 Alternate CSA implementation of the squaring operation...............61
3.4 CSA implementation of the cyclic convolution.................................66
3.4.1 4-point cyclic convolution-traditional............................................... 67
3.4.2 4-point cyclic convolution-modular...................................................69
3.4.3 4-point cyclic convolution-squares..................................................... 71
3.4.4 8-point cyclic convolution-traditional.................................................76
3.4.5 8-point cyclic convolution-modular.................................................... 78
3.4.6 8-point cyclic convolution-squares  .............................................80
3.4.7 16-point cyclic convolution-traditional............................................. 90
3.4.8 16-point cyclic convolution-modular..................................................90
3.4.9 16-point cyclic convolution-squares...................................................90
3.4.10 Discussion.............................................................................................. 91
3.5 Hybrid implementation of cyclic convolution.................................. 96
3.5.1 8-point cyclic convolution-hybrid, modular...................   98
3.5.2 8-point cyclic convolution-hybrid, squares....................................... 98
3.6 Applications to computer arithmetic.................................................. 99
3.6.1 Modulo 2N -1 multiplication................................................................99
3.6.2 Extending the modulo 2N -1 multiplier............................................100
3.6.3 Example................................................................................................ 102
3.6.3.1 Modulo 2n  -1 product....................................................................... 102
3.6.3.2 Modulo 2N + 1 product.......................................................................102
3.6.3.3 Modulo 2N product............................................................................. 103
3.6.3.4 Full precision product......................................................................... 103
3.6.4 Hardware and speed Analysis............................................................103
3.7 Summary...............................................................................................109
Chapter 4 ROM Based Methods for Computing the Squaring
Operation in Modular Rings ..............................................113
4.1 Memory compression schemes for arithmetic in
modulo 2n.................................................................................114
4.1.1 Analysis when the high word is one bit long...................................114
4.1.2 Analysis when the high word is two bits long.................................116
4.1.3 Analysis when the high word is three bits long............................. 119
4.2 Optimized memory compression schemes for
arithmetic in modulo 2n  ..........   123
4.2.1 Analysis when n is even..................................................................... 127
4.3 Numerical example.................................................................   129
4.3.1 Illustrating techniques of section 4.1.1............................  129
4.3.2 Illustrating techniques of section 4.1.2............................................. 129
4.3.3 Illustrating techniques of section 4.1.3.............................................129
4.3.4 Illustrating techniques of section 4.2.1............................................130
4.4 Comparing techniques of section 4.1 with 4.2................................ 130
4.4.1 Cost and speed analysis for section 4.1............................................ 130
4.4.2 Cost and speed analysis for section 4.2............................................ 138
4.5 Memory compression schemes for arithmetic in
modulo 2n -1.............................................................................. 141
4.5.1 Analysis when the high word is one bit long............................  142
4.5.2 Analysis when the high word is two bits long.................................142
4.6 Optimized memory compression schemes for
arithmetic in modulo 2n -1.......................................................145
4.7 Memory compression schemes for arithmetic in
modulo 2n +1............................................................................. 146
4.8 Optimized memory compression schemes for
arithmetic in modulo 2n +1........................................   146
4.9 Conclusions.......................................................................................... 147
iv
Chapter 5 Conclusions ................................................................... 149
5.1 Summary............................................................................ 149
5.2 Future research em phasis......................................................150
References ..........................................................................................151




Table 2.1 Comparison of the number of multiplications
versus squaring operations........................................... 36
Table 3.1 Hardware cost in 2-input gates for cyclic convolution
of 4, 8, and 16 points..................................................  92
Table 3.2 Time delay of cyclic convolution of 4, 8, and 16 points  95
Table 3.3 Hardware and speed comparison of various
look-up table techniques..............................................  105
Table 3.4 Cost comparison in ROM bits of the various techniques
for computing <A x B>2n  . \ ......................................  110
Table 3.5 Cost comparison in ROM bits for integrated multiplier,
based on techniques of this section...........................  I l l
Table 4.1 Values of <AH2AL22n^>211.......................................................  H8
Table 4.2 Values of <AH3AL32n"2>2n......................................................  121
Table 4.3 Results when n is even. AH = an.jan.2 ... an/2,
A L =  a(n/2)-la(n/2)-2 -  and
QS = 2n/2-1{(AH + Al )2 -(Ah - Al 2)) ......................  125
Table 4.4 Results when n is odd. AH = an l an 2 ... â n+1y2,
A L = a(n-l)/2 -  a l a0’ and
QS = 2(n-1)/2{(AH + Al )2 -(Ah - Al 2)}.................... 126
Table 4.5 Cost comparison in 2-input gates of techniques
of section 4.1 with 4.2.................................................  137
Table 4.6 Speed comparison in 2-input gate delays of techniques
of section 4.1 with 4.2.................................................  139
Table 4.7 Values of <2n'4 A ^2 + a ...........................  144
List of Figures
Figure 3.1 CSA implementation of an 8 x 8 multiplier...........................  53
Figure 3.2 Array of summands for an 8 bit squarer.................................. 57
Figure 3.3 CSA implementation of an 8 bit squarer................................. 59
Figure 3.4 Intuitive CSA implementation of an 8 bit squarer.................. 60
Figure 3.5 Reduced and regular array of summands for an 8
bit squarer............................................................................  62
Figure 3.6 CSA implementation of reduced 8 bit squarer...................... 63
Figure 3.7 Pictorial representation of a 4-point cyclic convolution..........68
Figure 3.8 8-point cyclic convolution-module 1......................................... 81
Figure 3.9 8-point cyclic convoiutiori-module 2......................................... 82
Figure 3.10 8-point cyclic convolution-module 3......................................... 83
Figure 3.11 Hardware architecture to implement equation (3.2) using
traditional techniques....................................................... 106
Figure 3.12 Hardware architecture for implementing equation (3.2)
using the quarter squared algorithm  ................ 107
Figure 3.13 Hardware architecture to realize equations
(3.26) and (3.28).................................................................108
/■»
Figure 4.1 The direct computation of <A >2n, ROM size 2n x n 115
Figure 4.2 The computation of <A >2n based on equation (4.3),
ROM size 2n_1 x n............................................................  117
2
Figure 4.3 The computation of <A >2n based on equation (4.9),
ROM size 2n"2 x n............................................................ 120
Figure 4.4 The computation of <A2>2n based on equation (4.16),
ROM size 2n' 3 x n..............................................................124
2
Figure 4.5 The computation of <A >2n based on equations
(4.23)-(4.24), total ROM bits = 5 x 2072 x n................. 128
Figure 4.6 Basic scheme for techniques of section 4.1.......................... 132
viii
Abstract
This dissertation focuses on a new approach for a hardware implementation of 
the cyclic convolution operation. The cyclic convolution operation is the core of 
several functions used in applications related to digital signal processing and error 
control. Since the operation is multiplication intensive and the cost of a multiplication 
operation is very high, most of the present research effort attempts to reduce the 
number of multiplications.
Our approach, however, aims at obtaining an efficient implementation by 
relying on the properties of the special case of multiplication, namely, the squaring 
operation. Due to the properties exhibited by the squaring operation the hardware cost 
and time delay of a squarer unit is both cheaper and faster than that of a multiplication 
unit. This is true for both memory and non-memory based implementations.
In this dissertation we have developed all the necessary theory required to 
express the cyclic convolution of two n-point sequences, where n is a power of 2, in 
terms of the elementary arithmetic operations add, square, and subtract. Our 
algorithms require fewer squaring operations than multiplication operations required 
by a traditional implementation of the cyclic convolution operation, do not introduce 
any round-off errors, place no restriction on word length, and are valid when the 
number of points to be convolved is a power of two. We then clearly demonstrate that 
our algorithms are also more hardware efficient for both memory and non-memory 
based implementations. Further, schemes to multiply two numbers based on the cyclic 
convolution operation are presented. Finally, efficient ways of computing the squaring 
operation when arithmetic is performed in modular rings are developed.
Chapter 1 
Introduction
Applications in the fields of digital signal processing (DSP) and error control are 
a few of the many interests of a hardware design engineer. Hardware design for these 
applications are challenging because of their high computational complexity. The main 
computational tasks in these applications are convolutions, Fourier transforms, and the 
inversion of Toeplitz systems of equations for spectral estimation [1]. A wealth of 
literature already exists in these areas [2]-[6], to name a few. Apart from these books 
there are several journals dedicated to research and development in these areas. Much of 
the material in these fields is centered around the discrete Fourier transform (DFT). The 
DFT has many powerful algebraic properties that are valid in numerous number 
systems. Researchers have exploited these properties by exploring several different 
alternatives for the field of operations [6]-[ll]. An appropriate selection provides the 
designer with a number of tricks that can speed up algorithms and simplify hardware 
implementations, for instance, selecting a Galois field of the form GF(2n -1) or GF(2n 
+1) simplifies significantly the arithmetic processing. We observe that arithmetic 
performed modulo (2n -1) is similar to one's complement arithmetic.
Much of the field of digital signal processing and error control coding is devoted 
to the task of removing noise by passing a known signal through a suitable filter [1]. 
The main computational problem involved in this is the convolution operation. The 
convolution operation is used in implementations of finite impulse response filters [12], 
infinite impulse response filters [13], auto and cross correlations [14], and polynomial 
multiplication and multiplication of very large integers [15],[16]. The large sized
1
2
problems in filtering are broken into smaller linear convolutions or cyclic convolutions 
using well known overlap techniques [17],[18]. This dissertation focuses on computing 
the convolution operation using a new approach that does not rely on any transforms. 
Instead, we focus on this operation from the computer arithmetic point of view by 
examining the applicability of other elementary functions in evaluating the convolution 
operation.
The rest of the introductory chapter is organized into two sections. The first is a 
brief overview of existing approaches for computing convolutions and the second is an 
introduction to our approach. The intent of the first section is to impress upon the 
reader, some of the difficulties and complexities associated with existing methods and 
further, to motivate the need for approaching the problem from a fundamentally different 
angle.
1.1 Overview of existing methods
Signals are typically generated whenever things vibrate, pump, pulse, or in any 
other way change with time [19]. While such signals or waveforms in real life are 
continuous in nature they can, for pragmatic purposes, only be represented with a finite 
amount of precision. Further, while the data may be a real or complex number, it can be 
temporarily rescaled by shifting the decimal point to the right and treating the number as 
an integer. This practice of treating data as integer sequences is common [1], [8] and 
does not in any way detract from the quality of the analysis. Based on the application the 
designer may choose an appropriate word length to prevent overflow after data 
manipulation. We assume without loss of generality that data, also referred to as points 
of input and output sequences, are integers. The linear and cyclic convolution operations 
are defined as computations on two sets of integers that yields a third set of integers. 
More precisely, the linear convolution is defined as
3
n-1
ci = 2 ai-k bk for i = 0, 1. n-1 (1.1)
k=0
and the cyclic convolution as 
n-1
Cj = 2 a<i_k>nbk for i = 0, 1, n-1 (1.2)
k=0
where the a4 and bj are the input sets of data and the Ci are the data of the convolved 
sequence. The notation <x>m denotes the operation x modulo m. The number of points 
in the sequence, 'n', is also known as the block length. We note that the above 
computation requires n2 multiplications. Performing the arithmetic in the above two 
equations, (1.1) and (1.2), modulo p, where p is prime, changes the entire picture. This 
is because, now the computations are being carried out in the Galois field GF(p). This is 
very attractive due to the fact that the properties of the convolution theorem can be used 
[20],[21], Before we discuss the usefulness of the convolution theorem one must note 
that if the choice of p in the above is such that the input data and the computed results 
are smaller than p then the modulo p operation is redundant.
The convolution theorem [1] enables the computation of the cyclic convolution 
of two vectors A and B by first computing the Fourier transforms of the vectors, then 
obtaining a new vector by performing a point by point multiplication of the transformed 
vectors, and then finally applying the inverse Fourier transform on the new vector, i.e. 
on the vector obtained in the transform domain. If we assume that there are n elements 
in each of the input vectors then it is easy to see from the above that only n 
multiplications are needed, the multiplications being in the transformed domain. Clearly 
the convolution theorem is useful if and only if their exist efficient ways of computing 
the Fourier and inverse Fourier transforms. The existence of such transforms is 
discussed in [20],[21].
4
The discrete Fourier transform can be applied on a discrete set of points. The 
DFT maps a discrete time domain waveform x(n) into a frequency domain X(k) and an 
inverse discrete time Fourier transform (IDFT) maps it back into the time domain. The 
transforms are symbolically represented as
x(t) - F(lt >X(k) and X(k) —5 ....>x(t)
and defined by
n-1 - j2 n




x(t) = -  Y X ( k ) e V  t = 0, 1, ..., n - 1  (1.4)
n kT0
From equations (1.3) and (1.4) it can be seen that to obtain c(t) which is the 
cyclic convolution of a(t) and b(t), a(t) and b(t) are mapped into the frequency domain, 
represented by parameters A(k) and B(k), using n multiplications for each point for a 
total of 2n2 multiplications and 2n2 additions. Multiplication of A(k) and B(k) requires 
another n multiplications while the inverse mapping requires another n2 multiplications 
and n2 additions. We thus have a total of (3n2 + n) multiplications and 3n2 additions. 
Thus, when the DFTs are computed directly this approach is not of much practical value 
as the direct computation of the cyclic convolution itself requires only n2 multiplications.
However, Cooley and Tukey [6] introduced the fast Fourier transform (FFT) 
which is an efficient algorithm to compute the DFT. Other efficient FFT algorithms can 
be found in [22]-[24], While these algorithms require only n log2 n multiplications to
map an input sequence of length n into the frequency domain, they have two primary 
disadvantages. One is that they produce significant round off errors [25],[26] and the 
other is that they are not very well suited for VLSI implementations [27]. Although the
5
FFT reduces the number of multiplications required for evaluating the DFT, the count of 
multiplications in itself does not determine the computational efficiency of the algorithm. 
With the widespread use of VLSI to build application specific integrated circuits (ASIC) 
the architectural details of implementation have gained significant importance. Reference 
[28] discusses Fourier transforms in VLSI, [29] discusses architectural issues in DSP 
applications, and [30] discusses multiplier policies in DSP applications. Many 
researchers have also explored the applicability of systolic array architectures to compute 
the DFT [31]-[33]. Some of the other irritants are block lengths and wordlengths [1]. 
While none of these are insurmountable they do require certain awkward design 
choices. The DFT can also be evaluated using number theoretic transforms (NTT) 
which are defined over finite fields and rings of integers with all the arithmetic 
performed modulo an integer. An evaluation of the various NTT algorithms as applied 
to digital filtering applications can be found in [34],[35].
The main focus of the various FFT and NTT algorithms [36] has been on 
reducing the number of multiplications. However, in real time applications large 
amounts of data have to be processed in relatively small periods of time and thus apart 
from reducing the number of computations, it has also become necessary to parallelize 
the computations. With the cost of hardware reducing and the acceptance of application 
specific integrated circuits (ASICs) increasing, it has become possible to build dedicated 
systems in an economical fashion. The inherent properties of the residue number system 
(RNS) lends itself as a viable candidate for parallel computations [37]. Implementations 
using the RNS and the quadratic residue number system (QRNS) can be found in [38]- 
[41]. More recently, the polynomial residue number system (PRNS) has been 
developed [42],[43] which combines both features, i.e. reducing the number of 
computations while simultaneously increasing the level of parallelism.
6
We have so far described the importance of the cyclic convolution operation in 
digital signal processing applications. In our research we propose to develop new 
algorithms for computing the cyclic convolution of two n-point sequences by 
performing all required operations in a single domain, i.e. we will not map the points 
into a frequency domain or for that matter into any other domain. To do this 
successfully we will have to use fewer than n2 multiplications. To achieve this reduction 
in the multiplication count all existing research has focused on mapping strategies that 
result in reduced number of multiplications in the mapped domain. However, our 
research effort focuses on using squaring operations instead of multiplication 
operations. The next section provides an introduction to our approach.
1.2 Our approach
The vast amount of literature in the DSP area focuses on the convolution 
operation by exploiting the properties of the DFT and the algebraic field in which it is 
applied. Most of the algorithms therefore have specific properties and perform well in 
the environments that they were designed to function in. We have looked at the problem 
from a more broader perspective and our approach therefore does not rely on the DFT at 
all. Instead we have focused on the definition of cyclic convolution and have attempted 
to develop efficient algorithms centered around the elementary arithmetic function, the 
squaring operation, and hence the title of the dissertation "Squared Law Algorithms: 
Theory and Applications." Our motivation is based on the general underlying theme of 
the DFT and FFT algorithms, which has been primarily, to reduce the amount of 
hardware required to perform the convolution operation. Since the operation is 
multiplication intensive, the emphasis was first to reduce the count on the number of 
multiplications and then with the development of integrated circuit technology the 
emphasis was to improve the implementation architecture. Similarly, we attempt to
reduce the amount of hardware by zeroing in on the fact that the squaring operation 
requires lesser amount of hardware than the multiplication operation. One must also 
keep in mind that this reduction in hardware is not at the expense of speed, contrary to 
this it is also faster to evaluate the squaring as opposed to the multiplication operation.
Table look-up techniques for performing the multiplication operation using 
ROMs have been researched in [44]-[50]. Of these [46] is based on the index calculus 
technique which can only be used with prime moduli and [47]-[50] are based on the 
quarter squared algorithm technique. These designs offer attractive speed-complexity 
trade offs for small word lengths while for large word lengths the ROM size increases to 
the point where it becomes unrealistic.
Consider a simple ROM based direct implementation of the multiplication 
operation. The two input operands of length, say L bits each, serve as the address to the 
ROM. The data stored at this location is the result of the multiplication. Such a ROM 
implementation of the multiplication operation would require a ROM of size 22L x  2L. 
On the other hand the squaring operation would have only one input operand and a 
ROM implementing this would be of size 2L x 2L. The immediate savings in ROM bits 
is apparent. The motivation is now clear. The more important question is now therefore: 
how does one replace all the multiplication operations in a given application with the 
squaring operations? We researched this problem with the convolution operation as our 
application and have developed algorithms to compute the convolution operation using 
squaring operations as opposed to multiplication operations. While we reduce the 
number of squaring operations compared to the number of multiplication operations we 
do increase the number of additions. Initial results of our research were published in 
[51]. The next natural question is: the squaring operation is a special case of the 
multiplication operation which in turn is repetitive additions, thus how does the increase
in the number of additions compare with the decrease in the number of the squarings. 
These two issues are addressed in detail in this dissertation.
The rest of the dissertation is organized as follows. Chapter 2 lays the 
mathematical foundations for the algorithms to compute the cyclic convolution using 
squaring operations. It also provides formulae for the count on the number of squares 
and the number of two-operand additions. Chapter 3 discusses various implementation 
issues including both non-ROM based and ROM based implementations. In this chapter 
the addition-squaring trade-off is also analyzed in detail. Some initial results on this 
were published in [52], Since the focus of the research was on the usefulness of the 
squaring operation, the behavior with respect to hardware costs of the computation of 
this operation in modular rings was also studied. Initial results on this were published in 
brief in [53]. Details of these results are presented in chapter 4. Finally, chapter 5 




In this chapter we present the mathematical basis to prove the validity of our 
algorithms. The material in this chapter has a natural flow in the sense that it is presented 
in the order in which it was developed. This chapter also defines the extensive notations 
that are used throughout this dissertation.
2.1 Background
The starting point for this research has been reference [50], which described a 
novel approach for implementing convolutions with small tables. The algorithm 
developed in that paper, titled the one over eight squared algorithm applies the idea of 
the quarter squared algorithm to compute a two-point cyclic convolution. The method is 
briefly described.
The quarter squared algorithm technique [47]-[50] is based on the fact that the 
product of two n-bit numbers x and y can be given as
xy = l/4{(x + y)2 - (x - y)2} (2.1)
Here look up tables can be used to compute the values of (x + y)2 and (x - y)2. If the 
result of the operation xy is computed directly by using a ROM then the size of the ROM 
required would be 22nx 2n, however, if (2.1) is used then two ROMs each of size 2n+1 
x  2(n+l) would be required. Thus the use of (2.1) yields a total ROM bit requirement 
of 2n+2 x 2(n+l) bits. Clearly for n > 2 the use of (2.1) requires fewer ROM bits, 
however, there is an overhead in terms of adders. In general, it can be said that the use 
of the quarter squared technique reduces the ROM bits from the order of 22n to 2n.
9
10
Now consider the problem of obtaining the cyclic convolution of two two-point 
sequences. The cyclic convolution of two sequences A = [a0, } and B = [b0, bj} is
by definition given as C = [c0, Cj} where cQ = agb0 + a jb j and Cj = agbj + ajbg.
Define[50]
u = aQ + aj + b0 + bj (2.2)
v = -aQ + aj - b0 + bj (2.3)
w = -&q - aj + b0 + bj (2.4)
x = -ag + aj + b0 - bj (2.5)
Then the two points of the cyclic convolution can be given as
c0 = l/8(u2 + v2 - w2 - x2) (2.6)
Cj = l/8(u2 + x2 - v2 - w2) (2.7)
Equations (2.6) and (2.7) constitute the one-over eight squared algorithm of [50] 
and they clearly demonstrate that the cyclic convolution of two two-point sequences can 
be obtained solely by the use of additions, subtractions, and squaring operations. A 
subtraction can be simply thought of as an addition as the hardware units that perform 
subtraction and addition are approximately equal in cost. Thus hereinafter the number of 
additions will include the number of subtractions. Also (2.6) and (2.7) show that the 
term w2 always appears in the negative and hence the ROM that generates w2 can be 
designed to directly generate -w2.
The one over eight squared algorithm can also be applied in modular rings, 
provided the multiplicative inverse of 8 exists in the chosen ring. In the case when the 
chosen modulus m is odd <8‘1>m always exists, where <x>m is read as x modulo m.
This can easily be shown as follows: when m is odd, m + 1 is even which implies that 
(m + l)/2 is an integer. Therefore the multiplicative inverse of 2 modulus m can be 
given as < 2 '1>m = (m + l)/2 as 2(m + l)/2 = m + 1 = < l> m [50]. Thus since <2_1>m
11
always exists, <8_1>m also always exists as <8_1>m = <(2_1)3>m. Similarly, the
multiplicative inverses of all numbers that are powers of 2 exist when m is odd. 
However, when m is even <2_1>m does not exist. To see this let us assume that it did 
exist and its value is k. We then have <2k>m = 1, which implies that 2k = mx + 1 (x is 
some integer). But this is impossible as 2k and mx are even (m is even) and the 
difference of two even numbers can never be equal to 1. We thus have a problem and 
[50] provides some theorems to account for the round-off errors caused by this non­
existence of <2_1>m (m even).
2.2 Algorithm for convolution using an exponential 
number of squares
The first effort in generalizing the one over eight squared algorithm resulted in 
an algorithm for doing convolution using an exponential number of squares. Since the 
algorithm used an exponential number of squares, it is impractical from the view point 
of the cost of its hardware implementation. However, the insight gained from this 
algorithm was that it might be impossible to obtain in an efficient manner each point of 
the cyclic convolution directly as a function of a summation of squares. We next present 
the algorithm along with an example.
Algorithm 2.1
Input: The points of two n-point sequences {aQ, a j , ..., a ^ }  and {bQ, b j , ..., bn_j}.
Output: The cyclic convolution {c0, Cj, ..., cn. j } of the two given input sequences. 
Method: The procedure uses only addition and squaring operations.
Procedure: Each term of the cyclic convolution is given by
j 2n- l
CP = 0n+l ULi zpk(~l) » p = 0, 1, ..., n —1
2 k=0
where the z ^ 's  are terms of the matrix Zp. Matrix Zp is of size 2n x 1 and is formed as 
follows:
1) Zp = Xp x  Y where X is a 2n x 2n matrix whose terms are +1 or -1 and Y is a 
2n x 1 transpose matrix of {a0, a j , a ^ j ,  b0, b j , b n_j}.
2) The rows and columns of matrix Xp are represented by subscripts i and j
respectively. Subscript i is in the range 0 to 2n -1 and subscript j is in the range 1 
to 2n.
3) The terms of matrix Xq are defined by the following set of rules.
a) Xy = 1 if either i = 0 or j = 1.





-xi-l j if X Xi=i-ij
xi - l j ot terwise
= ± r j
c) For i > 0 and n+1 < j <= 2n, let rj = 2i'n_1 
Then
X jj = x i-lj
- X :li - l j
if i<rj
if i is an integer multiple of rj 
otherwise
4) The matrices Xp for p = 1, 2 ,..., n-1 are obtained from matrix Xq by retaining 
its columns 1 through n as it is and by rotating right the columns n+1 through 2n 
by (p) positions.
13
Example: Suppose we wish to compute the cyclic convolution of two 3-point
sequences {ag, aj, a2} and {b0, bj, b2}.
From the above algorithm matrix Y is the transpose of [a^, aj, a2, b0, bj, b2]. 
From step 3 we have
X 0=
Thus
Multiplying the above matrix Xq with matrix Y results in ZQ which is a 8 x 1 
matrix with terms zQ0, z01, z 07.
ao + at + a2 + bQ + b̂  + b2 
+ aj + a2 ~  bg — bj — b2 
a0 + ai -  a2 + b0 -  bi + b2 
z  _  -uj a0 + a1 - a 2 - b 0 + b1 - b 2 
a0 -  ai + a2 + bg + bi -  b2 
a0 -  ai + a2 -  b0 -  bj +  b2
a0 -  ai -  a2 + bp -  bi -  b2
1 T 2 2 2 2 2 2 2 2 1
c 0 ~  y ^ [z00 “  Z01 +  z02 “  z03 +  z04 ~  z05 +  z06 “  z07j










Multiplying matrix Zjwith matrix B gives us Zj which is a 8 x 1 matrix whose 
terms are z10, Zj j , z 17.
3q + b0 + bj + b2
a0 + ai + a2 -  b0 -  bi -  b2 
ao + ai -  a2 + bo + bj -  b2
a0 +  a l “  a2 -  bo ~  bi +  b2 
a0 _  al +  a2 “  b0 +  bi +  b2 
a0 -  ai + a2 + b0 -  bi -  b2 
ao -  ai -  a2 ~  b0 +  bi -  b2 











1 r 2 2 , 2  2 , 2  2 , 2  2 1
c ! =  TT[Z10 “  Z11 +  z12 “  z 13 +  z 14 “  z 15 +  z 16 “  z17j16
= albo + aobi + a2b2
and finally,
Multiplying matrix X2 with matrix Y gives us Z2 which is a 8 x 1 matrix whose 
terms are z20, z21, z 2?.
z20 a0 + ai + a2 + bo + bi + b2
Z21 ao + ai + a2 -  bo -  bj -  b2
z22 a0 + ai -  a2 -  bo + bj + b2
z23 a0 + aj -  a2 + bo -  bi -  b2
z24 ao -  a i + a 2 + bo -  bj + b2
z25 a0 - a 1 + a2 - b 0 + b1- b 2
z26 ao -  ai -  a2 ~ bo -  bj + b2
_z27_ ao -  aj -  a2 + bo + bj -  b2
Thus giving
1 f 2 2 , 2  2 , 2  2 , 2  2 1
c 2 =  ^ [ z20 “  Z21 +  z 22 “  z 23 +  z 24 “  z25 +  z 26 “  Z27J
= a2b0 + a1b1 + a0b2
It appears that we are multiplying two matrices for each point. However, this is 
not the case as the matrix notation is only a convenient form to represent the several 
equations that are developed for each point. Also there is no actual division involved as 
the last four bits are zero and by simply ignoring them we achieve the division by 16.
Although this algorithm is well structured from the implementation point of 
view, it relies on squaring operations in the order of n2n, plus additions and 
subtractions, and thus the cost is prohibitive. Clearly this is far greater than even the n2 
multiplications required by the definition of the problem and thus no further work was 
done on this algorithm. However this motivated us to look in other directions and our 
results are presented in the next section.
16
2.3 Direct extension of the one over eight squared 
algorithm
We now try to extend the one over eight squared algorithm to obtain the cyclic 
convolution of two 4-point sequences. We first re-write the equations of section 2.1 in a 
more nicer form as follows. Equations (2.2) - (2.5) can be re-written as 
u = ag + aj + b0 + bj (2.8)
v =  ag + aj - b g - b j  (2.9)
w = ag - aj + bg - bj (2.10)
x =  a g - a j - b g  + bj (2.11)
Then the two points of the cyclic convolution can be given as
Now, our objective is to extend this method to obtain the cyclic convolution of 
two 4-point sequences. Let the sequences be A = {ag, aj, a2, a3} and B = {bg, bj, b2, 
b3} and by definition, the cyclic convolution of the two sequences is given as C = {cQ, 
c l ’ c2> c3  ̂ w^ere
Now, re-defining equations u through x based on the pattern of terms and signs 
in equations (2.8) - (2.11), we have
Cg = l/8(u2 - v2 + w2 - x2) 
Cj = l/8(u2 - v2 - w2 + x2)
(2 .12)
(2.13)
co ~  a0b0 + a3bl + a2b2 + al b3 
Cj = a j bg + agbj + a3b2 + a2b3
C2 =  a2b0 + al bl + a0b2 + a3b3 





u — ag 3- a  ̂ 3- a2 3- a3 3- bg 3- b j 3- b2 3- b3 




w — Sq - Ei| + a2 ~ ^3 t*Q - b j+  b2 - b3 
x = aQ - aj + a2 - a3 - b0 + bj- b2 + b3
(2 .20)
(2 .21)
Then we find that l/8(u2 - v2 + w2 - x2) = Cq + c2 and l/8(u2 - v2 - w2 + x2) = 
Cj + c3. This gives us the notion that for any two n-point sequences if equations are
built on lines similar to that of equations (2.8) - (2.11) and then plugged into equations 
of type (2.12) and (2.13) we can obtain 8 X c2i and 8 X c2i + j. In the next section we 
reinforce this notion by providing four theorems on the summation of the even Cj and 
odd Cj taken separately. Going back to the cyclic convolution of the two 4-point 
sequences we observe that if we have the quantities cQ - c2 and Cj - c3 then they can be 
added, subtracted with c0 + c2 and Cj + c3 to obtain the individual points of the cyclic
convolution. This gives us the indication that we have to have formulae like equations 
(2.8) - (2.11) that will generate the difference of c0 and c2 , and Cj and c3. In general, 
we would need formulae to generate die sum of the even Cj with alternating terms having 
negative signs and similarly the sum of the odd Cj with the alternating terms having 
negative signs. In the next section we provide two theorems that can be applied in 
general to n-point sequences.
It is easy to see that if this method is extended further, say for two 8-point 
sequences then the above promised extensions will yield X c2i and X c2i (-1)1, however, 
this would be inadequate. This is because the sum of £ c 2i and X c2i (-1)1 would give us 
Cq + c4 and the difference c2 + c6. (Similarly, for the odd points we would obtain cj + 
c5 and c3 + c7.) Thus we would need to obtain cQ - c4, c2 - c6, Cj - c5, and c3 - c7. In 
other words we would need to obtain the alternating sum and difference of Cj where the 
difference of the indices of two consecutive Cj differ by 2, 4, ..., n/2. In later 
discussions we let j denote this difference between consecutive Cj. Also, we observe that
such a methodology of adding and subtracting different combinations of Cj will be valid 
only when n is a power of 2.
We divide the methodology of computing the cyclic convolution of two n-point 
sequences based on squaring and addition operations into two parts: part 1 comprises 
the computation of X c2i and X c2i + j while part 2 that of Xcj + kj (-l)k for all 0 <= i,
j < n/2, k = 0, 1, 2 , (n/j -1). (We note that j is a power of 2, >= 2). In section 2.4 
we present eight theorems and proofs that are required for our complete methodology of 
computing the cyclic convolution of two n-point sequences. Following that, in section
2.5 we compare and contrast three methods by which we can compute part 1 of the 
methodology. In section 2.6 we present a formula for the number of squares required 
for any given n while in section 2.7 we present a formula for the total number of 
additions required by our methodology. We conclude the chapter by presenting a full 
blown example of computing the cyclic convolution of two 16-point sequences.
2.4 Squared law theorems for cyclic convolutions
In section 2.2 we have shown that trying to obtain each point of the cyclic 
convolution directly as a function of a summation of squares required a total of n2n 
squaring operations and thus was not of any practical use. In this section we present 
eight theorems that we have developed for computing the sum and difference of the even 
points taken together and the odd points taken together. Theorems 2.1 and 2.2 are taken 
from our paper titled "New Multipliers Modulo 2N -1 "[51]. They are presented here for 
both, the sake of completeness and consistent notation.
Consider two sequences each of length n-points given as A = {ao, a i , ..., an- i } 
and B = {bo, b i , ..., bn_]}. Then the cyclic convolution between these two sequences 
can be given as a n-point sequence C = {cq, c i ,  ..., cn- i } with each cj defined by
19
n-1
ci = X a<i-k>nbk for i = 0, 1, n-1 (2.22)
k=0
In the above and in the rest of the dissertation <x>m denotes the operation x modulo m 
between integers. With reference to the above computation the following theorems 
apply.
Theorem 2.1: Assume that n is even and define
wnl = ao + ai + ... + an-i + bo + bi + ... + bn-i (2.23)
wn2 = ao + ai + ... + an-i - bo - bi - ...  - bn-i (2.24)
wn3 = ao - ai + ... - an-i + bo - bi + ... - bn-i (2.25)
Wn4 = ao - ai + ... - an-i - bo + bi - ...  + bn-i (2.26)
Then
f_1
w nl _ w n 2 + w n 3- w n4 =  8 X C2i = 8(c0 + c 2+ —+ c n -2 ) (2.27)
i=0
Proof:
wSl “  wS2 + w?3 ~ = (wnl + wn2)(wnl -  wn2) + (wn3 + wn4)(wn3 -  wn4)
- - 1  - - 1  - - 1  - - 1
n-1 n-1 2 2 2 2
= 4 X a2 i X b2 i+ 4 ( X a2 i~  X a2i+l)x ( X b2 i“  X b2i+1>
i=0 i=0 i=0 i=0 i=0 i=0
£_1 £_i £_i £_i £_ j £_j £_ j £_j
2 2  2 2  2 2  2 2
= 4 [ ( 2 a2i + 2 a2i+l) ( Z b2 i + X b2i+l) + ( Z a2i “  X a2i+lX X  b2i “  X b2i+l)]
i=0 i=0 i=0 i=0 i=0 i=0 i=0 i=0
f_1 f_1 f_1 f_1
= 8 X  a2i X  b2i + 8 X  a2i+l X  b2i+l
i=0 i=0 i=0 i=0
20
—  8[ 2w a even^even +  ^ a odd^odd3
all possible products all possible products
= 8 5 > 2i
i=0
and the proof of (2.27) is completed.
Note from equation (2.22) that c2j = ^ a . vhw such that <v+w>n =2i or v+w =
nx + 2i. But since n=even it implies that v+w = even and therefore v and w are either 
both even or both odd. This justifies the last step of the proof.
Theorem 2.2: Assume that n is even and define wnl, wn2, wn3, and wn4 as
given in theorem 2.1.
Then
2_1
w2l -  wii2 -  wn3 + w„4 = 8 X c2i+1 = 8(cj + c3+ ...+cn_1) (2.28)
i=0
Proof: The proof is similar to that of (2.27) and is thus omitted.
Theorem 2.3: Assume that n is even and define
xni = ao + a2 + ... + an.2 + bo + b2 + ... + bn-2 (2.29)
xn2 = 30 + a2 + -  + an-2 - bo - b2 - ... - bn-2 (2.30)
x„3 = a i + a3 + ... + an_i + bi + b3 + ... + bn-i (2.31)
Xn4 = ai + a3 + ... + an_i - bi - b3 - . . . - bn-i (2.32)
Then
H-i





xnl — xn2 "*"xn3 ~ xn4 = (xnl + xn2)(xnl — xn2) + (xn3 + xn4)(xn3 — xn4)
- -1  - -1  - -1  - -1
2 2 2 2
= 4 X  a2i £  b2i + 4 2  a2i+l X  b2i+l
i=0 i=0 i=0 i=0
= 4t yaevenbeven ^ aoddbodd]
all possible products all possible products
= 4 I c 2i
i=0
and the proof of (2.33) is completed.
Note again from the definition of cyclic convolution that C2j = ^ a vbw such that
<v+w >n =2i or v+w = nx + 21. But since n=even it implies that v+w = even and 
therefore v and w are either both even or both odd. This justifies the last step of the 
proof.
Theorem 2.4: Assume that n is even and define
ynl = ao + a2 + ... + an-2 + bi + b3 + ... + bn-i (2.34)
yn2 = ao + a2 + ... + an-2 - bi - b3 - ... - bn-i (2.35)
Yn3 = a l + a3 + ... + an-i + bo + b2 + ... + bn-2 (2.36)
yn4 = ai + a3 + ... + an-i - bo - b2 - ...  - bn-2 (2.37)
Then
f_1
y n l- yn2 + yn3- yn4 = 4 X C2i+l = 4(cl + c 3+- - +cn-l) (2.38)
i=0
Proof: The proof is similar to that of (2.33) and is thus omitted.
Theorem 2.1 and 2.3 both compute c2i, however theorem 2.1 is more useful for 
computing the cyclic convolution in modular rings while theorem 2.3 is more hardware
efficient for computing the full precision cyclic convolution. The same can be said of 
theorems 2.2 and 2.4. The advantages are described in detail in section 2.5.
Theorem 2.5: Assume that n > 2, n = 2P (=> n=4m), and define
znl = ao - a2 + M - a6+ ... - an-2 + bo - b2 + b4 - b6 + ... - bn-2 (2.39)
zn2 = ao - a2 + aa - a6+ ... - an_2 - b0 + b2 - b4 + b6 -... + bn-2 (2.40)
zn3 = ai - a3 + a5 - a7+ ... - an-i + bi - b3 + b5 - b7 + ... - bn_i (2.41)
zn4 = ai - a3 + a5 - a7+ ... - an.i - bi + b3 - b5 + b7 - ... + bn_i (2.42)
Then
z2l -  zn2 ~ zn3 + zn4 = 4(^0 ~ C2 + C4 -  c6+ ...-c n_2) (2.43)
Proof:
znl — zn2 — zn3 zn4 = (znl zn2)(znl ~ zn2) — (zn3 4  zn4)(zn3 — zn4)
= 4[(a0 -...- aj,.2)(b0 -...- bn_2) - (a! -...- an_1)(b1 -...- bn_j)]
= 4[(ao + a4 + ag +... + an_4 ) -  (a2 + a6 + ajo + — + an_2)]x
[(bo + b4 + bg +... + bn_4 ) -  (b2 + b6 + bjo + — + bn_2)l
-4[(aj + a5 + a9 +... + an_3) -  (a3 + a7 + an  + ... + an_!)]x
[(bi + b5 + b 9 + ... + bn_3) - ( b 3 + b7 + b n  + ... + bn_j)] 
m -l m-1 m-1 m-1
= 4[( 2  a4k X S  b41) + < S  a4k+2 X S  b41+2 )1
k=0 A=0 k=0 X= 0
m-1 m-1 m-1 m-1
+4[( X a 4k+1)( 2 b 4A+3 ) + ( S a 4k+3)( 2 b 4A+l )1
k=0 X= 0  k=0 X= 0
m-1 m-1 m-1 m-1
-4K 2 > 4 t )( S b4A+2) + ( X a4k+2>( X b4l)]
k=0 A=0 k=0 X= 0
m-1 m-1 m-1 m-1
-4 [(  X a 4k+1 X S b 4A+l)+  ( Z ,a 4k+3)( X b 4A+3)] 
k=0 A=0 k=0 A=0
= 4(c0 -  c2 + c4 -  c6 +... -  cn_2) 
and the proof of (2.43) is completed.
23
Note that for the positive terms in the above expression the sum of the indices 
modulo n of the a terms and b terms is always a multiple of four, thus the sum of all 
positive terms will be 4(co + C4 + cs +...+ cn_4). Similarly for the negative terms, the 
sum of the indices is a ((multiple of 4) + 2), thus the sum of all negative terms will be 
4 (c2 + C6 + cio +••■+ cn-2)- This justifies the last step of the proof. Also observe that the 
theorem holds good for any n that is a multiple of 4, i.e. not necessarily a power of 2. 
However, since our main algorithm that describes our methodology requires n to be a 
power of 2 , we focus only on such cases.
Theorem 2.6: Assume that n > 2, n = 2P (=> n=4m), and define
Zn5 = ao - &2 + aq - a6+ ... - an-2 + bi - b 3 + b5 - b 7 + ... - bn-i (2.44)
Zn6 = ao - a2 + 34 - a6+ ... - an-2 - bi + b3 - 6 5  + b7 -... + bn-i (2.45)
zn7 = ai - a3 + a5 - a7+ ... - an_i + bo - b2 + b4 - b6 + ... - bn.2 (2.46)
zn8 = at - a3 + a5 - a7+ ... - an_i - bo + b2 - b4 + b6 -... + bn.2 (2.47)
Then
i s  ~ zn6 + z„ 7 -  zjg = 4(Cl - c3 + c5 - c7+...■- c n_!) (2.48)
Proof: The proof is similar to that of (2.43) and is thus omitted.
Theorem 2.7: Assume that n is even and define
Wn5 = ao - ai + a2 - a3+ ... - an-i + bo - bj + b2 - b3 + ... + b„-i (2.49)
wn6 = _ao + ai - a2 + a3+ ... + an_i + bo - bi + b2 - b3 + ... - bn.i  (2.50)
Then
n-1
W n 5 -W n 6 = '« Z c i ( - 1)i <2 -51>
i=0
Proof:
w n5 “  w n6 =  (w n5 +  w n6)(w n5 -  w n6)
n-1 n-1 n-1 n-1
=  2 ( J b2i “  2 b2 i+ l)2 ( 2 a2i “  X a2i+ l)
i=0 i=0 i=0 i=0
n-1 n-1 n-1 n-1
=  4K 2 a2i 2 b2i +  2 a2i+l 2 b2 i+ l) 
i=0 i=0 i=0 i=0
n-1 n-1 n-1 n-1
“ ( 2  a2i 2  b2i+l +  2  a2i+l 2 b2 i)] 
i=0 i=0 i=0 i=0
= 4[( 2  aeven beven ^  2 aoddbodd)
all possible products all possible products
2 aevenbodd+  2 aoddb even )]
all possible products all possible products
n-1 n-1
= 4 ( 2 C2i -  2 C2i+l> 
i=0 i=0
n-1
= 4 X c i( - l ) i
i=0
and the proof of (2.51) is completed.
Theorem 2.5 generates the sum (or difference) of the even points of the cyclic 
convolution with the alternating points having negative signs. We need to generalize this 
theorem so that it is possible for us to automate the generation of the sum (or difference) 
of the even points where the indices of the even points differ by powers of two. i.e. we 
need a generalized theorem to generate (c0 - c2 + c4 - ... cn_2), (c0 - c4 + c8 - ... cn_4) 
and (c2 - c6 + c 10 -... cn_2), (c0 - c8 + c16 -... cn_8), (c2 - c10 + c 18 - ...  cn_6), (c4 - c12 
+ c2q - ... cn_4), and (c6 - c 14 + c22 - ... cn_2), and so on. We would also need a 
generalized theorem corresponding to theorem 2.6, i.e. for the automatic generation of 
the odd points. We first present an algorithm to generate the equations that the 
generalized theorem will use. We then present the generalized theorem and its associated 
proof. We show that the same theorem can also be used for odd points.
25
Algorithm 2.2
Input: The points of two n-point sequences {a0, a j , a n_1} and {b0, b1 ? bn-1}
and j, where j is the difference of the indices of two consecutive Cj with 
j = 2*2k ; k = 0, 1,... (log2 (n/2) -1).
n/j -1
Output: 4 jT Ci+kj ( - l ) k
k=0
Procedure:
Step 1: Initialization step.
k = 1, q = 0, p = i.
(Obtain equations of type V„k)
Step 2: Vnk= ap - a<p + j>n + a<p + 2j>n -  - a<p. j>n +
bq ' b<q+j>n + b<p + 2j>n -  " b<q-j>n’
Obtain V„(k+i) by reversing signs of b terms in V,1̂ ;
k <— k + 2; 
p <— <p - l>n ;
q <— q + 1;
If k <= 2j -1, go to step 2.
(Note that <p + q>n = i)
2j n/j -1
Slep3: (-> )  = 4  £  c» k i
k=l k=0
Step 4: i <— i + 2;
If i <= j - 1, then k = 1; q = 0; p = i; and go to step 2
else STOP.
Theorem 2.8: Assume that n is greater than 2, a power of 2, and for a given value
of i, j, and n define based on steps 1 and 2 of algorithm 2.2.
2j • o n/j_1
Then £ ( V ; k)2 ( - l ) k+l = 4 £  ci+kj ( - l ) k (2.52)
k=l k=0
Proof:
i ( v * k >2 ( -D k+'
k=l
= 2 ( Vnk +V n(k+1)XVnk “  V^k+t)) V k odd and < 2 j
= 4 â<p>n " a<P + j>n "• ' a<P - j>n^ b<q>n " b<q+j>n ’ b<q - J>n̂  +
4 (a<p - l>n‘ a<p - 1 + j>n-  ' a<p - 1 - j>n)
(b<q+l>n' b<q+l + j>n-  ' b<q+l - j>„) +
. . .+
...+
4 (a<p - j + l>n ' a<p + 1 >n + -  ' a<p + 1 - 2j>„)
(b<q + j - l>n ' b<q + 2j - l>n + -  ‘ b<q - 1>„)
n/2j -1 n/2j -1 n/2j -1 n/2j -1
= 4 ( S aP + 21j -  X aP + (21+1)j)( 2 bq+21j “  2 bq+(21+l)j) +
1=0 1=0 1=0 1=0
... +
.. .  +
n/2j -1 n/2j -1 n/2j -1 n/2j -1
4( S ap-j+l+21j“  X ap-j+l+ (21+1)jX X bq+j-l+21j“  X bq+j-l+(21+l)j)
1=0 1=0 1=0 1=0
n/2j -1 n/2j -1 n/2j -1 n/2j -1
= 4 ( S aP+21j X bq+21j+ S aP + (21+l)j S bq+(21+l)j)
1=0 1=0 1=0 1=0
+  • • •
n/2j -1 n/2j -1 n/2j -1 n/2j -1
+ 4< Z ap-j+l+21j S bq+j-l+21j+ X ap-j+l+ (21+1)j Z bq+j-l+(21+l)j)
1=0 1=0 1=0 1=0
n/2j -1 n/2j -1 n/2j -1 n/2j -1
-4 (  X aP+(21+l)j ] £ bq+21j +  X aP+21j X bq+(21+l)j)
1=0 1=0 1=0 1=0
n/2j -1 n/2j -1 n/2j -1 n/2j -1
‘ 4 ( S ap-j+l+ (21+l)j X bq+j-l+21j+ Z ap-j+l+21j Z bq+j-l+(21+l)j)
1=0 1=0 1=0 1=0
n/2j -1 n/2j -1
= 4 ( ^ c i+21j-  ^ ci+(21+l)j)
1=0 1=0 
n/j -1
= 4 S c i+kj ( - ! ) k
k=0
and the proof of (2.52) is completed.
Notice, from the last but third step, that in all the positive terms the sum of the
indices of a and b is of the form p+q+21j and since <p+q>n = i (from step 2 of algorithm
n/2j -1
2.2), the sum of all positive terms reduces to 4( X ci+2lj )• A similar argument can be
1=0
made for the negative terms. Aiso, note that in the initialization step of algorithm 2.2, i 
takes a value passed by theorem 2.8 as opposed to being equal to zero. Other parts of 
the initialization step remain unchanged. Thus the theorem can be used for odd and even 
points alike.
Algorithm 2.3
Input: The points of two n-point sequences {a^, a j , a , , ^ }  and {b0, b j , b n-1}.
Output: The cyclic convolution {c0, Cj, cn. j } of the two given input sequences. 
Method: The procedure uses only addition and squaring operations.
Procedure:
Step 1: r = 1, i = 0.
Do theorem 2.3. Set result to Xi-
n/2 -1
We thus obtain X1 = 4  X c2k
k=0
Step 2: j = 2r.
Do theorem 2.8. Set result to Z.
n/j -1
We thus obtain Z = 4 X c i + k j (-l)k
k=0
Step 3: X‘ <— X1 + Z 2r_1; Xi+J <— X1 - Z 2r’1
If j = n/2, then set Cj = X* / (4*2r), ci+j = Xi+j / (4*2r), and go to step 5.
Step 4: If j <= n/2, then r <— r + 1 and go to step 2.
Step 5: i < — i + 2; If i <= (n /2 -1), then r = [log2 i j + 1 and go to step 2
else STOP. (All c2j have been computed).
In order to compute c2j+j algorithm 2.3 with a few changes can be employed. 
The changes are in step 1. Step 1 will read,
Step l(o d d  Cj): r = l , i = l .
Do theorem 2.4. Set result to X1- 
n/2 -1
We thus obtain X1 = X c 2k+i
k=0
29
Steps 2 through 5 will remain unchanged with the obvious exception in step 5 wherein 
when the algorithm stops, all c2i + j would have been computed. An implementation of
the above algorithm in Mathematica for n = 32, along with the results can be found in 
the appendix.
2.5 Analysis of part 8
As explained in section 2.3, part I of the overall methodology for determining 
the cyclic convolution of the two n-point sequences, is comprised of the computations: 
I c 2i and S c 2i + j. These can be obtained by using one of the following three methods.
The advantages and disadvantages of the three methods are compared and contrasted 
below.
Method 1:
We can use theorem 2.1 to obtain the sum of the cyclic convolution of all the 
even points and theorem 2.2 to obtain the sum of the cyclic convolution of all the odd 
points. Let us assume that each of the n points has k bits. Looking at equations (2.23)-
(2.26) we notice that each of the equations is a function of all 2n points. In equation 
(2.23) all terms have positive signs and therefore the length of the result is k + log2 2n.
In the other three equations, namely equations (2.24)-(2.26), half have positive signs 
and the other half have negative signs. In order to determine the length of the result, the 
worst case assumption would be when all the points with positive signs have a value of 
zero and all the points with negative signs have a value of 2k -1, i.e. a maximum value. 
Such a situation will yield the smallest negative number of size k + log2 n. But since this
number is negative, in an actual hardware implementation it will be represented in its 
two's complement form and will therefore require an additional bit. Thus the number of 
bits required for each of equations (2.24)-(2.26) will be k + 1 + log2 n which is equal to
k + log2 2n. Thus we see that the number of bits required to represent a number is the 
same in all of the equations (2.23)-(2.26) and in general we can say that the number of 
bits required for any equation will be the same if either all terms have positive signs or 
half have positive and the other half have negative signs. Thus the ROMs required to 
generate the squares of each of these equations will be of the same size.
Each of the squares required by equation (2.27) can be generated by a ROM of 
size 2k + log2 2n x  2(k + log2 2n). Since there are four squares the total number of
ROM bits required is given by
ROM bits - 4 squares = 8 x 2k + ,og2 2n x (k  + log2 2n) (2.53)
The number of additions required for adding 2n terms is equal to 2n -1. In order 
to get wnl we would need 2n -1 additions. wn2 can be written as (ao.+ ai + ... + an_i) -
(bo + bi + ... + bn_i) thus needing only one more addition. (We assume that the cost 
of an adder is the same as that of a negator). wn3 can be written as (ao + & 2 + —+ an_2) 
- (ai + a3 +... + an_i) + (bo + b2 + ... + bn_2) - (bj + b3 + ... + bn-i) thus requiring 
only three additional additions while wn4 can be written as (ao + &2 + •••+ an_2) - (ai + 
a3 +... + an_i) - ((bo + b2 + ... + bn_2) - (bi + b3 + ... + bn.i)) thus requiring only one 
more addition as the rest is generated while generating wn3. To do the additions and 
subtractions required by equations (2.27) and (2.28) we need a total of four more 
additions. Thus the total number of additions can be given by
Additions - 4 squares = 2n + 8 (2.54)
The time delay for performing all the additions required by equations (2.23) -
(2.26) can be given by log2 2n A as each equation has 2n terms. Equations (2.27) and
(2.28) will require an additional 2 A. Here we assume that A is the time required for
adding two / bit numbers where I is in the range k < / < 2(k + log2 2n). Thus the total 
time delay can be given as
Time delay - 4 squares = (3 + log2 n) A (2.55)
Method 2:
In method 1 we used theorems 2.1 and 2.2 while the same results can be 
obtained by using theorems 2.3 and 2.4. Here since each of equations (2.29)-(2.32) and 
(2.34)-(2.37) have only half the number of total points, the length of the result of each 
of these equations will be k + log2 n. Thus the square of each these equations can be 
implemented by using a ROM of size 2k + 1082 n x 2(k + log2 n) and for 8 squares the 
total ROM bits can be given as
ROM bits - 8 squares = 16 x  2k + log2 n x (k + log2 n) (2.56)
The number of additions required for adding n terms is equal to n -1. In order to 
obtain xnl and xn3 as given by equations (2.29) and (2.31) we would need 2(n-l) 
additions. xn2 and xn4 as given by equations (2.30) and (2.32) can then be obtained by 
just performing two more additions while ynl - yn4 as given by equations (2.34) - 
(2.37) can be obtained by performing four more additions. Thus the total number of 
additions including those required by equations (2.33) and (2.38) can be given by
Additions - 8 squares = 2n + 10 (2.57)
The time delay for performing all the additions required by equations (2.29) - 
(2.32) and (2.34) - (2.37) can be given by log2 n A as each equation has n terms.
Further, equations (2.33) and (2.38) will require another 2 A thus giving the total time 
delay as
Time delay - 8 squares = (2 + log2 n)A (2.58)
Method 3:
Another way of obtaining the sum of the cyclic convolution of all the even points 
and all the odd points separately would be by using theorems 2.3 and 2.7. Theorem 2.3 
gives us the sum of the even points of the cyclic convolution while theorem 2.7 gives us 
the difference of the sum of the even points and the sum of the odd points. Therefore the 
sum of the odd points of the cyclic convolution can be obtained by taking the difference 
of the results of theorems 2.3 and 2.7.
In method 2 we have shown that the length of the result of each of the equations 
required by theorem 2.3 is equal to k + log2 n while their squares can be implemented 
by ROMs of size 2k + ,og2 n x 2(k + log2 n). We note that there are four such ROMs. In
theorem 2.7, both equations are a function of all 2n points and in method 1 we have 
shown that the result of such an equation is of length k + log2 2n. Each of the two 
squares can be implemented by using a ROM of size 2k + ,og2 2n x 2(k + log2 2n). 
Thus the total number of ROM bits required can be given as
ROM bits - 6 squares = [8 x  2k + ,og2 n x (k + log2 n) ] +
[4 x 2k + l0S2 2n x (k + log2 2n)]
= 2k + 3 + log2 n x [2(k + log2 n) + 1] (2.59)
We have already shown in method 2 that the number of additions required by 
equations (2.29)-(2.32) is equal to (2(n-l) + 2). Equation (2.49) that generates wn5, can 
also be written as xnl - xn3 thus requiring only one more addition. Equation (2.50) that 
generates wn6, can also be written as xn4 - xn2 thus again requiring only one more 
addition. Equations (2.33) and (2.51) require four additions while to subtract equation 
(2.51) from (2.33) we would need one more addition. Thus the total number of 
additions can be given by
Additions - 6 squares = 2n + 7 (2.60)
To compute the time delay for performing all the additions required by this 
method it is easy for one to see that since equations (2.49) and (2.50) are the longest it 
will be sufficient to determine the time delay to compute these equations. It was shown 
in the analysis for method 1 that this is equal to log2 2n A. To obtain equation (2.51) 
we will need one additional time unit or the total time can be given as (log2 n + 2) A . 
The computation of equation (2.33) will also take (log2 n + 2) Awhile to obtain the 
difference of equation (2.33) and (2.51) we will need one more time unit, thus giving 
the total time delay as
Time delay - 6 squares = (3 + log2 n ) A (2.61)
2.5.1 Comparison of methods 1 - 3
Comparing equations (2.53), (2.56), and (2.59) we can see that the number of 
ROM bits required by method 2 is the least. With respect to time delay for performing 
the additions, comparing equations (2.55), (2.58), and (2.61) we once again find that 
method 2 is the best although it is only marginally faster than the other two methods. 
With respect to the number of additions, comparing equations (2.54), (2.57), and 
(2.60) we find that method 3 is the best although again by only a marginal amount.
However, we should note that if computations are being performed in some 
modular ring then method 1 would be the best as the size of the equations does not grow 
and therefore the number of squares is of prime importance. Clearly, method 1 would 
be the best as it requires only four squares.
For the sake of clarity we have in the remainder of the chapter assumed that part 
1 is computed based on method 2 while keeping in mind that method 1 is to be used in 
the event arithmetic is done in some modular ring.
2.6 Number of squares
In section 2.5 we have shown that part 1 would need eight squares if method 2 
was used. Irrespective of the methods used for computing part 1 the methodology for 
part 2 remains the same. In this section we present a formula to compute the number of 
squares required to compute part 2.
There are two ways of computing the number of squares required for part 2 of 
the overall methodology. One is by examining algorithm 2.3 and determining how many 
times step 2 is executed. This is because in step 2, theorem 2.8 is evaluated and theorem 
2.8 in turn requires 2j squares to be computed. Thus by knowing the number of times 
step 2 is executed for different values of j, one can estimate the number of squares 
required. On the other hand, we can estimate the number of squares in a more intuitive 
fashion by looking more closely at the methodology of part 2. We recall that in 
algorithm 2.2 the following notations are used. The number of points being convolved 
is represented by the variable n while variable i is used to represent the index of the 
points of the cyclic convolution. The variable j is used to denote the difference between 
two consecutive indices of the points of the cyclic convolution while k is a local variable 
used to indicate the relationship between j and n. We now make the following 
observations:
1) Part 2 of the methodology treats the even points separately from the odd 
points, however either computation uses exactly the same concept. Therefore if we 
compute the number of squares required by the even points all we need to do is double 
that number to get the total number of squares.
2) For any given n the values of j are of the form 2k with k = 1,2,  ..., 
log2(n/2). The maximum value of j is equal to n/2 and we can say there are log2(n/2)
stages.
3) For any given value of n, i lies in the range 0 through (n/2 -1). Since we 
are computing the even points and odd points separately i is incremented by 2 every 
time. Thus the number of distinct i for a given n at every stage is equal to j/2.
4) The number of squares is a recursive formula, i.e., the number of 
squares required for any value of n is equal to the number of squares required when j = 
n/2 plus all the squares required for half the number of points i.e. when n := n/2. Thus 
at every stage we only need to determine the number of squares required for the 
maximum allowable value of j.
5) For every i there are 2j squares. Thus at every stage the number of 
squares is equal to the number of distinct i times 2j, thus giving j/2 x 2j = j2.
Let lk represent the number of squares required at stage k. Then from the above 
observations it follows that lk = j2 = 4k, k = 1,2,  3, ..., log2 (n/2). Thus the total 
number of squares required is given by
log2 n/2
number of squares = ^  4k
k=l
= 4/3 x (4log2 n/2- 1) (2.62)
Thus the total number of squares, for computing both odd and even points, can 
be given as
Total number of squares = 2 x 4/3 x (4,og2 n/2 - 1)
= 2/3 x (n2 - 4) (2.63)
Table 2.1 compares the number of multiplication versus squaring operations required 
for cyclically convolving two n-point sequences. This table assumes that part 1 of our
Table 2.1: Comparison o f the number of multiplication versus squaring operations





2/3 x (n2 + 8)
%
savings
4 16 16 0
8 64 48 25
16 256 176 31.25
32 1024 688 32.81
64 4096 2736 33.20
overall methodology is evaluated based on method 2. Thus the number of squares in 
this table is computed based on equation (2.63) + 8 = 2/3 x  (n2 + 8).
2.7 Number of additions
We have already estimated the number of additions required for part 1 of the 
methodology. In this section we obtain a formula for determining the number of 
additions required for part 2 of the methodology. A brief description of the meaning of 
each of the variables can be found in the previous section.
For a given n, j has a maximum value of n/2. Also j = 2 x 2k ; k = 0, 1 , 2 , . . . ,  
log2 n/4. For every value of j we have 2j equations consisting of a4 and bj. From 
theorem 2.8 we see that each equation contains (n/j - 1) aj terms and (n/j - 1) bj terms.
Also from algorithm 2.2, which shows how to construct these equations, we find that 
the aj and bj terms appear only j  distinct times. The aj and bj terms appear in the same 
combinations for both odd and even points. The difference, however, is that the a4 and 
bj terms are combined differently for even and odd points. Thus the number of additions 
for computing these equations can be given as
Term 1 - additions = 2 ^  j(n / j — 1) + 2j x  j (2.64)
Vj Vj
or
Term 1 - additions = 2n log2 (n/2) - j 2 (2.65)
Vj Vj
We note that the first term of equation (2.64) represents the number of additions 
required for obtaining the sum of the aj terms and bj terms separately. For every j there 
are n/j aj terms and therefore we have (n/j -1) additions. Also, for every j there are j sets 
of such equations. This explains the product under the summation. The 2 outside the
summation accounts for the additions required by the bj terms. The second term 
represents the number of additions required for combining the aj and bj terms. For every 
j there are 2j equations and j sets of such equations. This explains the product under the 
second summation in equation (2.64). (Note that this includes both the odd and even 
points).
Since for every value of j there are 2j squares to add-subtract these squares we 
would need 2j - 1 additions. Thus, we have
Term 2 - additions = ^ j ( 2 j  -1 )  (2.66)
Vj
The addition/subtraction of squares, which are accounted for by equation (2.66), 
n/j -1
generate for every j, ^ q  + ^  (-1) . These are in turn added with X1 as shown in step 
k=0
3 of algorithm 2.3. Initially, X 1 is the sum of all even q , or in other words it consists of 
n/2 points. An individual q  is obtained by adding to this quantity the alternating 
difference of q  with varying distances between consecutive q . This process thus 
requires (n/2 - 1) additions and subtractions, before individual q  are computed. Thus
the number of additions is 2 (n/2 - 1) or (n - 2) additions and a similar amount would be 
needed for the odd q . Thus we have the total number of additions for this process as
Term 3 - additions = 2(n - 2) (2.67)
Therefore the total number of additions for part 2 of the methodology can be 
given by the sum of equations (2.65) - (2.67), thus giving
Part 2 - additions = 2n log2 (n/2) - 2X  j + 2£  j 2+ S  j(2 j - ! )  + 2 (n - 2)
Vj Vj Vj
39
= 2 (n + n log2 (n/2 )- 2) + 4 ^  j2 - 3 J  j
Vj Vj
log(n/4) log(n/4)
= 2(n + n log2 (n/2)- 2) + 16 ]T 4k - 6  2k
k=0 k=0
= (l/3)[4n2 + 6n log2 (n/2) - 3n - 10] (2.68)
Assuming that we use method 2 for part 1, we have
Total number of additions = (l/3)[4n2 + 6n log2 (n/2) + 3n + 20] (2.69)
2.8 Example
In this section we present in detail all the necessary computations required to 
compute the cyclic convolution of two 16-point sequences. The purpose of this example 
is to illustrate the methodology, theorems, algorithms, and notations developed earlier in 
this chapter. We consider two sequences A and B whose points are given as 
A = {4o’ a i> a2’ a3’ a4’ a5’ a6’ a7’ a8’ ‘H)’ a10’ a l l ’ a12’ a13’ a14’ a15  ̂ an^
® = { bl ’ b2’ b3 ’ b4 ’ b5’ b6’ b7’ b8’ b9’ b10’ bl l ’ b 12’ b 13’ b14’ b15  ̂ while their 
cyclic convolution is given by
C = {c0, Cj, c2, c3, c4, c5, c6, c7, c8, c9, c10, c n , c12, c13, c14, c15] where each point 
is defined by equation (2 .22) or
15
ci = X a<i-k>16bk for i = 0, 1, 2, ..., 15
k=0
In order to use our methodology, we have to run through the steps of algorithm 2.3. 
Procedure:
Step 1: r = 1, i = 0
Do theorem 2.3 
This will, based on equations (2.29) - (2.32), yield:
40
xnl = a0+a2+a4+a6+a8+a10+a12+a14+b0+b2+b4+b6+b8+b10+b12+b14 
xn2 = a0+a2+a4+a6+a8+a10+a12+a14"b0'b2"b4'b6'b8‘b10'b12'b14 
xn3 = a1+a3+a5+a7+a9+a11+a13 +a15+b1+b3+b5+b7+b9+b11+b13 +b 1 5
xn4 = al +a3+a5+a7+a9+ al l + a13+ a 15'b r b3"b5‘b7'b9'b i r b13'b15 
and from equation (2.33) we get
7
x nl “  x n2 +  x n3 “  xn4 =  4 X c2i = 4 (c0 +  c 2 + - + c 14)
i=0
Set result to X1 
Thus,
X° = 4(c0+c2+c4+c6+c8+c i Q+c 12+c 14)
Step 2: j = 2r = 2
Do theorem 2.8
This will call step 2 of algorithm 2.2 with i = 0, j = 2, k = 1, q = 0, p = i, and n = 16, 
thus yielding:
V161 = a0'a2+a4"a6+a8'a10+a12'a14+b0'b2+b4'b6+b8'b10+b12'b14
V162 = a0_a2+a4"a6+a8'a10+a12‘a14"b0+b2'b4+b6'b8+b10"b12'b14 
V163= -a1+a3-a5+a7-a9 +a11-a13+a15+b1-b3+b5-b7+b9-b11+b13-b15
= -a1+a3-a5+a7-a9+a1 j_a j 3+ a |3-b2+b3-b3+b7_b9+b2 j -b |34-b|^ 
and based on equation (2.52) we get 
4 7
X ( V16k)2(~ l)k+1 = 4 X C2k(_ l)k= 4 (c0'c2+c4"c6+c8'c10+c12"c14) 
k=l k=0




X°<— X° + Z
= 8(c0+c4+cg+c12)
and X2 <— X2 - Z
= 8(c2+c6+c 10+c 14)
Also, j 9* 8, and therefore we got to step 4.
Step 4: j <= 8 is true and therefore r <— r + 1 = 2 and we go to step 2.
Step 2: j = 2r = 4
Do theorem 2.8
This will call step 2 of algorithm 2.2 with i = 0, j = 4, k = 1, q = 0, p = i, and n = 16, 
thus yielding:
V161 = ^  ‘ a4 + a8 " a12 + b0 ‘ b4 + b8 ‘ b12.
Vl°62 = a0 ' a4 + a8 ' a12 ‘ b0 + b4 ' b8 + b12 
^163 = "a3 + a7 ' a l l  + a15 + bj - b5 + b9 - b13 
Vl64 = -a3 + a2 - a^j + a ^  - bj + b  ̂- b9 + bj^
v l°65 = _a2 + a6 ' a10 + a14 + b2 ' b6 + b 10 " b 14
Vl°66 = "a2 + a6 " a10 + a14 * b2 + b6 ' b10 + b14 
Vf6 7  = -al + a5 - a9 + al3  + b3 - b7 + b l l  - bl5
= -al + a5 - a9 + al3 - b3 + b7 - b l l  + b l5  
and based on equation (2.52) we get
8 3
S(Vi°6k)2( - l ) k+1 = 4 i > 4 k( - l ) k = 4(c0 - c4 + c8 - C|2> 
k=l k=0
Set result to Z
Z = 4(cq - c4 + Cg- c j2)
Step 3:
42
X° <— X° + Z x 2 
= 16(c0 +  c 8)
and X4 <— X2 - Z x 2 
= 16(C4 +  Cj2)
Also, j *  8, and therefore we got to step 4.
Step 4: j <= 8 is true and therefore r <— r + 1 = 3 and we go to step 2.
Step 2: j = 2r = 8
Do theorem 2.8
This will call step 2 of algorithm 2.2 with i = 0, j = 8, k = 1, q = 0, p = i, and n = 16, 
thus yielding:
v 161 = % " a8 + b0 * b8 
Vl°62 = ao ‘ a8 ' b0 + b8 
V163 = ~a7 + a15 + bj - b9 
V164 = _a7 + a15 - bj + b9 
V165 = "a6 + a14 + b2 ‘ b10 
Vl°66 = _a6 + a14 * b2 + b10 
V167 = _a5 + a13 + b3 ' b l l  
V168 = -a5 + a13 " b3 + bl l  
V169 = "a4 + a12 + b4 ' b12 
V1610 = 'a4 + a12 ' b4 + b12 
v l°611 = -a3 + an  + b5 - b13
Vl°612 = "a3 + al l  ' b5 + b13 
V1613 = _a2 + a10 + b6 ' b14 
V1614 = -a2 + a10 ' b6 + b14 
v 1615 = -a! + ag + b7 - b15 
V1616 = _al + >d9 ' b7 + b15
43
and based on equation (2.52) we get
S ( v l°6k)2( -D k+1 = 4 £ c , l ( - l ) l  = 4(c0 - c ,)  
k=l k=0
Set result to Z
Z = 4(cq - c8)
Step 3:
X °< — X° + Z x 4 
= 32c0
and X8 <— X° - Z x 4 
= 32cg
Also, j = 8, and we note that c0 and c8 are indeed equal to X°/32 and X8/32, and we go 
to step 5.
Step 5: i <— i + 2 = 2 and i <= 7 is true, therefore r = 2 and we go to step 2.
Step 2 : j = 2r = 4
Do theorem 2.8
This will call step 2 of algorithm 2.2 with i = 2, j = 4, k = 1, q = 0, p = i, and n = 16, 
thus yielding:
V\6 \ = a2 '  a6 + a10'  a14 + b0 - b4 + b8 - b12 
v ?62 = a2 " a6 + a10" al4 - b0 + b4 - b8 + b12
V 263 = ai " a5 + a9 - ai3 + bj - b5 + b9 - b13 
^164 =  a i  ■ a 5 +  49 ■ a i3  - b j  +  bg - bg +  b j 3 
V165 = 2*0 ' a4 + a8 ' a12 + b2 ‘ b6 + ^10 " b14
44
V166 = a0 " a4 + a8 ‘ a12 ' b2 + b6 ' b10 + b14 
v l267 = -a3 + a7 - an  + a15 + b3 - b7 + bn  - b15
Vl268 = _a3 + a7 ■ al l  + a15 ■ b3 + b7 ■ bl l  + b15
and based on equation (2.52) we get
8 3
X < v l26k)2( - 1)k+1 = 4 X c 2 + 4 k (-l)k = 4(c2 - C6 + C10 - C14) 
k=l k=0
Set result to Z
Z = 4 (c2 - c 6 + c 10 - c 14)
Step 3:
X2 <— X2 + Z x 2
= 16(c2 + c 10) 
and X6 <— X2 - Z x 2 
= 16(c6 + c 1 4 )
Also, j ^  8, and therefore we got to step 4.
Step 4: j <= 8 is true and therefore r <— r + 1 = 3 and we go to step 2.
Step 2 : j = 2r = 8
Do theorem 2.8
This will call step 2 of algorithm 2.2 with i = 2, j = 8, k = 1, q = 0, p = i, and n = 16, 
thus yielding:
Yu>l = a2 '  a10 + b0 '  b8
V162 = a2 ' a10 - bo + b8 
^1263 = a i - a<7 + bj - b9 
V&4 = aj - a<) - bj + b(j
V165 = ^  ' a8 + b2 ' b10 
V l266 =  a 0  '  a 8 '  b 2 +  b 10
45
V167 = “a7 + a15 + b3 ' b l l  
Vl268 = ' a7 + a15 " b3 + bl l  
V169 = “a6 + a14 + b4 " b12 
Vl2610 = "a6 + a14 ' b4 + b 12 
V1611 = "a5 + a13 + b5 ' b13 
V1612 = "a5 + a13 ‘ b5 + b13 
V1613 = "a4 + a12 + b6 _ b14 
Vl2614 = 'a4 + a12 ' b6 + b14 
v l2615 =  -a3 + an  + b7 - b15
Vl616 = -a3 + all  - b7 + b15 
and based on equation  (2.52) w e get
16 1
£ ( v ?6k)2 ( - l ) k+1 = 4 X c 2+8fc(-l)'k= 4(c2 - c 10) 
k=l k=0
Set result to Z
Z = 4(c2 - c10)
Step 3:
X2 <— X2 + Z x 4 
= 32c2
and X 10 <— X2 - Z x 4
= 32c 10
9 10Also, j = 8, and we note that c2 and c10 ate indeed equal to X /32 and X /32, and we 
go to step 5.
Step 5: i <— i + 2 = 4 and i <= 7 is true, therefore r = 3 and go to step 2.
46
Step 2: j = 2r = 8
Do theorem 2.8
This w ill call step 2 o f algorithm 2.2 with i = 4, j = 8 , k = 1, q = 0, p = i, and n =  16,
thus yielding:
v 161= a4 ' a l2 + b0 - b8
V162 = a4 " a12 ‘ ^0 + ^8
Vl463 = a3 ' al l  + - b9
V164 = a3 ' al l  " bi + b9
V165 = a2 ' a10 + b2 - bio
V166 = a2 ' a10 ' b2 + bio
Vl467 = aj - 39 + b3 - bn
V168 = a l ' % " b3 + bn  
V169 = ao - a8 + b4 - b12 
V1610 = ^  ' a8 ' b4 + bi2 
V1611 = _a7 + a15 + b5 - bi3 
V1612 = -a7 + a15 ' b5 + bi3 
V1613 = -a6 + a14 + b6 - bi4 
V1614 = "a6 + a14 " be + bi4 
V1615 = "a5 + a13 + b7 - bi5 
V1616 = 'a5 + a13 " b7 + b15 
and based on equation (2.52) we get
16 1
I ( V i 46k )2 ( - l ) k+1 = 4 X c 4+8 k ( - l)k = 4 (c 4 - c 12) 
k=l k=0
Set result to Z
Z  = 4(c4 - Cj2)
Step 3:
X 4 <--- X4 + Z x  4 
= 32c4
and X 12 <— X4 - Z x  4 
= 3 2 c12
Also, j = 8, and we note that c4 and c12 are indeed equal to X4/32 and X 12/32, and we 
go to step 5.
Step 5: i <— i + 2 = 6 and i <= 7 is true, therefore r = 3 and go to step 2.
Step 2: j = 2r = 8
Do theorem 2.8





a6 " a14 + b0 " b8
Vl662 = a6 ' a14 " b0 + b8




a5 " a13 " bl + b9
V f e  = a4 ’ a12 + b2 ' b10
II
$
a4 " a12 ' b2 + b10
< II a3 " al l  + b 3 '  bl l
y 6  _  
V168 “ a3 " al l  _ b3 + bll
< VO II a2 ’ a10 + b4 " b12
V1610 ‘= a2 - a10 - b4 + b12
IIr-N
>
: a l ' a9 + b5 ' b13
v 1612 '= al - tk) - b5 + b13
\/6
v 1613 “:a 0 - a8 + b6 - b14
Vl6614 -  ^  - ag - b6 + bi4 
Vl615 = _a7 + a15 + b7 " b15 
Vl6616 = -a7 + a15 ' b7 + b15 
and based on equation (2.52) we get
16 1
S(Vl66k)2( - « k+l = 4 2 c 6+8k(-l)k = 4(c6 - c,4) 
k=l k=0
Set result to Z
Z = 4(cg - Cj4)
Step 3:
X6 <— X6 + Z x 4 
= 32c6
and X 14 <— X6 - Z x  4 
=  32c 14
Also, j = 8, and we note that c6 and c14 are indeed equal to X6/32 and X 14/32, and we 
go to step 5.
Step 5: i <— i + 2 = 8 and i <= 7 is false and so we stop. We observe that all c2i
have been computed.
The c2i + 1 can be computed in a similar fashion and therefore it is not presented
here.
2.9 Summary
All the necessary theory required to compute the cyclic convolution of two n- 
point sequences where n is a power of 2 has been developed in this chapter. We have 
presented a new methodology for a hardware based implementation of the cyclic 
convolution operation. Eight theorems that were developed as a part of this dissertation 
form the mathematical basis for our methodology. Our methodology consists of two 
parts, part 1 and part 2. By selective utilization of the theorems, part 1 can be evaluated 
in three different ways, referred to as methods 1, 2, and 3. A comparative analysis of 
the three methods was provided in section 2.5. Part 2 can only be evaluated in a unique 
manner and the required equations are provided by algorithm 2.2. The overall 
methodology was described by algorithm 2.3. The algorithms did not use 
approximations of any kind and are therefore inherently free of any round-off errors, 
thus eliminating the need for error correcting hardware. To complete the theory we have 
also derived non-recursive formulae to obtain the number of squares and additions 
required by our methodology.
We have shown that while our algorithms require fewer squares than 
multiplications required by a traditional computation, we require more two-operand 
additions. Further, the number of squares has been approximately reduced by one-third 
while the number of additions have been increased by about a third. However, we 
observe that this is not a zero gain, for we have decreased the number of expensive 
operations, namely, the multiplication operations, at the cost of increasing the 
inexpensive operations, namely, the addition operations. One must note that this is a fact 
independent of the technology of implementation. Also, we note that the formula on the 
number of additions is not an accurate reflection of the increase in hardware cost. This is 
because our implementations rely primarily on multi-operand additions and by selecting
a suitable implementation, one can not only reduce the amount of hardware but can also 
decrease the time delay associated with the computation. In the next chapter 
implementation issues of the convolution operation based on the definition and our 
method are discussed in detail.
Chapter 3 
Implementation Issues
In the previous chapter we have developed algorithms for performing the cyclic 
convolution of two n-point sequences. In place of the multiplication operation, we used 
the squaring operation. While our algorithms require fewer squaring operations when 
compared to the number of multiplication operations required by the traditional 
technique, we require more addition operations. However, since our equations require 
multi-operand adders the count of two-operand addition operations may be somewhat 
misleading. In this chapter we present carry-save adder (CSA) and read only memory 
(ROM) based implementations and discuss hardware cost and speed trade-offs. The 
purpose of these implementations are not to provide the DSP engineer with an off the 
shelf design but more for the purpose of precisely analyzing the effect of the increase in 
the number of additions caused by our methodology. We then also show how the 
convolution operation can be applied to the problem of multiplying two numbers.
3.1 CSA implementation of the multiplication operation
Since our goal is to compare hardware requirements of algorithms based on the 
multiplication operation with that of algorithms based on the squaring operation, we first 
consider the implementation of the elementary functions. We illustrate this comparison 
by calculating the hardware required for a CSA implementation of an eight-by-eight 
multiplier and in the following section we calculate the hardware required for a CSA 
implementation of an eight bit squarer. These implementations are based on schemes 
for parallel multipliers offered by Dadda [54], Dadda's method is based on successively
51
adding the corresponding significant columns of the partial products until only two 
numbers are left. The sum of these two numbers yields the product of the original two 
numbers. In [54], Dadda has shown that there exists a sequence of numbers which 
should be used to determine the appropriate height of the partial product matrix at each 
level. For the case when a full adder is used the sequence of numbers is: 2, 3, 4, 6, 9, 
13, 19, 28, e.t.c. Each term in the sequence is obtained by multiplying the preceding 
term by 3/2 and taking the integral part. The use of such a sequence of numbers 
generally yields the fastest implementation with the least amount of hardware required.
Figure 3.1 shows the multiplier scheme for obtaining the product of two eight 
bit numbers. The multiplication of two eight bit numbers requires the summation of 
eight partial products each of length eight bits. Each 'x' in the figure represents a single 
bit of the partial product, with the least significant bit on the right most side.
The CSA tree implementation requires four levels, labeled 1 through 4 on the left 
side of figure 3.1, to obtain two numbers whose sum yields the product. Each of the 
level requires a delay equal to the delay through one full adder, in this context also 
known as a carry-save adder, and is denoted as 1 xD CSA while the final two numbers
can be added using a fast adder such as the carry look ahead adder. For the purpose of 
providing a fair comparison between all methods we assume a simple ripple carry 
propagate adder(CPA). We denote the delay of such an adder as 1 xD CPA . The number
of full adders and half adders required in each of the levels is indicated on the right side 
of figure 3.1. (The notation of indicating the level numbers and the number of full and 
half adders will be used in all figures depicting CSA tree implementations.) The 
maximum height of a column in level 1 is 8. From the sequence given earlier, we find 
that the largest number less than 8 is 6. Therefore, the objective at this level is to ensure 
that the height of every column is not greater than 6 at the next level. Also, this is to be
14 .... 0 FA HA
X X X X X X X X  3 3
X X X X X X X X  
X X X X X X X X  
X X X X X X X X  
X X X X X X X X  
X X X X X X X X  
X X X X X X X X  
X X X X X X X X
X X X X X X X X X X X  12 2
X X X X X X X X X X
x x x x x x x x x x
X X X X X X X X X X
x x x x x x x x x x
x x x x x x x x x x
X X X X X X X X X X X X X 9  1
X X X X X X X X X X X X
x x x x x x x x x x x x
X X X X X X X X X X X X
X X X X X X X X X X X X X X 11 1
x x x x x x x x x x x x x  
x x x x x x x x x x x x x
x x x x x x x x x x x x x x x
x x x x x x x x x x x x x x
Figure 3.1: CSA implementation of an 8 x 8 multiplier
54
achieved by using a minimum number of half and full adders. Note that the use of a full 
adder reduces the height of the column by 2 while a half adder reduces the height of the 
column by 1. For instance, at level 1, columns 0 through 5 need no manipulations, for 
they all have a height less than or equal to 6. Column 6 uses one half adder to reduce its 
height from 7 to 6, while column 7 uses one half adder and one full adder to reduce its 
height from 8 to 6. The use of one half adder and one full adder actually reduces the 
height of column 7 by three but since there is a carry-in from column 6, the final height 
of column 7 in level 2 is 6. In a similar fashion the others columns are reduced. 
Eventually after level 4 we are left with two numbers that are added using a CPA to 
produce the final result. From figure 3.1 we see that the hardware required by an 8 x 8 
multiplier requires 35 full adders, 7 half adders, and one 15 bit CPA. If the CPA is 
based on a simple ripple carry adder then the CPA would require 14 full adders and 1 
half adder. Thus the total number of full adders required is 49 and the total number of 
half adders required is 8. The time delay for the entire computation can be given as 
4 x D CSA +  1 x D CPA-
In order to mechanize the computation of the amount of hardware required for a 
CSA based implementation of multi-operand addition, the above procedure is written as 
an algorithm and then coded in Mathematica. The Mathematica version can be found in 
the appendix.
Algorithm 3.1
Input: A  list, L, whose elements are the heights of the columns of the array of
numbers to be added, with the elements listed in the order of least to most 
significant. Let the elements of this set be Lj, L2, ..., Lj.
Output: The amount of hardware required to add the array of numbers in terms of
full adders, half adders, and two-input gates and the time delay required to 
perform this computation in terms of number of CSA and CPA levels.
Procedure:
Step 1: Denote the largest element in L as max[L], Obtain a set, T, of the numbers
Step 2: The cardinality of set T is the number of CSA levels required to add the array
of numbers. The number of CPA levels is always one.
Step 3: Set j = 1.
Step 5: Set Lj = Ti? and increase the height of the next column by the carries
generated by the current column, i.e. the height will be increased by an 
amount equal to the sum of fa-temp and ha-temp.
Step 6: fa = fa + fa-temp and ha = ha + ha-temp.
Step 7: j = j + 1; repeat steps 4, 5, 6. (i.e these steps are performed on all columns
of set L.)
Step 8: When there are no more columns to update, repeat steps 3 through 6 for the
next smaller element in T, i.e., Tj_j .
Step 9: When there are no more elements in set T, the number of full and half adders
required by the CSA tree has been computed.
of the sequence based on [54], with the last element of T being the largest 
number smaller than max[L]. Let the elements of this set be T j, T2, ..., Tj.
Use variables fa and ha to keep track of the number of full adders and half 
adders respectively. Initially these variables are set to zero.
Step 4: If Lj > Ti? then
Step 10: Add to the number o f full and half adders the size o f the carry propagate
adder in terms of full and half adders.
Step 11: Compute the number of two-input gates by treating a full adder as 5 two-
input gates and a half adder as 2 two-input gates.
The justification for step 4 is given as follows: Each full adder reduces the column size 
by 2 bits. Therefore, if the number of bits to be reduced is odd, say e = 2k + 1, then k 
full adders and 1 half adder will be required to reduce the height by e bits. Since 1 half 
adder reduces the height by 1 bit, one can also say that 1/2 a full adder reduces the 
column size by 1 bit. Thus each non-zero fractional part of the computation of the 
number of full adders contributes a single half adder.
3.2 CSA implementation of the squaring operation
Now we consider the squaring of an eight bit number. Let the number be 
represented as A = a7a6a5a4a3a2a1a0. The square of A can be given as
A2 = a7 214 + a7a6 214 + a?a5 213 + a6 212 + a7a4 212 + a6a5 212 + a7a3 211 + a6a4 211 
+ a5 210 + a7a2 210 +  a6a3 210 + a5a4 210 +  a7a} 29 + a6a2 29 +  a5a3 29 +  a4 28 
+ a7aQ 2 + a6aj 2 + a2 2 + a4n3 2 + 22 + j 22 4* n4u2 22 + a3 2^
4- a5aQ 2 "f a4aj 2 4- a3a2 2 -i- a4a^ 2 4* a3a  ̂2 4* a2 2 4* u3aQ 2 4- a2a  ̂2 
4- a2aQ 2 4- aj 2 4- a^a^ 2 4- 3q (3.1)
Each of the product terms a^j, also called summands, can be obtained by a 2-input 
AND gate since a4 and aj are each one bit long. The addition of the various terms of 
equation 3.1 can be obtained by re-arranging them as an array of summands as shown 
in figure 3.2. Here the terms of each column have the same weight. In a sense, we can 
say that we are adding five 15-bit numbers many of whose individual bits are zero. 
These zero bits are not shown in the figure.
57
14 13 12 11
a7 a7a5 a6 ^ 4  
a7a6 a6a5 a7a3 
a7a4
10 9 8 7 6 5 4 3 2 1 0
a5 a5a3 a4 *4*2 a3 a3al a2 *2*0 al ^
a5a4 *6*2 a4a3 a5al a3a2 *4*0 a2al al a0
a6a3 a7al a5a2 *6*0 a4al *3*0
*1*2 *6*\ *5*0
*1*0
Figure 3.2: Array of summands for an 8 bit squarer
Figure 3.3 shows the CSA implementation of the eight bit squarer. Here each term of 
figure 3.2 is represented by a x. The figure relies again on the sequence of numbers 
given in section 3.1, namely 2, 3 ,4. Since the height of the tallest column is 5, in level 
1 the objective is to group the x's such that the height of no column is greater than 4. In 
level 2 the objective is to limit the height to 3 and in level 3 to 2.
From the figure we see that the amount of hardware required is 10 full adders, 5 
half adders, and one 15 bit CPA and the time delay is 3 x D CSA. Comparing this with
the hardware required by an 8 x 8 multiplier we find that the squarer requires around a 
third of that required by a multiplier and is faster by I x Dc s a - Clearly, there is an
advantage to designing algorithms around the squaring operation. While designing CSA 
based implementations, one is generally guided by [54]-[56]. However, a closer look at 
the array of summands to be added for the squaring operations yields the configuration 
shown in figure 3.4. Here although the total amount of hardware is the same as that of 
figure 3.3, we find that there are fewer levels , i.e. it is faster by 1 xD CSA. We observe
that by treating the cost of a full adder as 5 two-input gates and the cost of a half adder 
as 2 two-input gates the total cost including the cost of a^j terms is 88 two-input gates
plus one CPA.
In summary, it appears that the number of levels required to add a set of 
summands in a parallel fashion is not only a function of the height of the tallest column 
but is also a function of the heights of the other columns and their relative placements. 
One must note that such a situation does not arise in the multiplication of two distinct 
numbers as the height of the columns then monotonically increases, reaches a maximum 




X X X X X X X X X X X X X  
X X X X X X X X X X  X 
X X X X X X X 







x x x x x x x x x x x x x
X X X X X X X X X X  X 
X X X X X X X 
X X X  X
X 3
3
x x x x x x x x x x x x x
X X X X X X X X X X  X 
X X X X X X X  X
X 7
x x x x x x x x x x x x x
X X X X X X X X X X X  X
X
Figure 3.3: CSA implementation of an 8 bit squarer.
1
14
X X X X X X X X X X X X X  
X X X X X X X X X X  X 
X X X X X X X 






x x x x x x x x x x x x x X 7
X X X X X X X X X X  X
2 X X X X X X X  X
x x x x x x x x x x x x x X
X X X X X X X X X X X  X
Figure 3.4: Intuitive CSA implementation o f an 8 bit squarer.
3.3 Alternate CSA implementation of the squaring
operation
Other parallel implementations of the squaring operation can be found in [57], 
[58] while serial implementations can be found in [59]. Jayashree and Basu in [58] 
show that their method is both faster and cheaper than that of [57]. In this section we 
propose yet another parallel implementation that not only compares very well with [58] 
with respect to both cost and speed but is also regular, more modular, and easier to 
design.
We first present our alternate method and then compare it with reference [58]. 
Looking at figure 3.2 we notice that in each of the columns 2, 4, 6, 8, 10, 12, and 14 
there exist terms of the nature aj and a ^ j .  For instance, in column 6 we have the terms 
a3 and a3a2. We can therefore substitute in place of these two terms their sum, a ja^ i, 
and carry, ajaj.j. In the case of column 6 the sum a3a2 replaces the terms a3 and a3a2 
and the carry a3a2 is placed in column 7. Thus we have reduced the height of column 6 
by one and at the same time increased the height of column 7 by one. Performing this 
simple manipulation on every such pair of terms yields the array as shown in figure 3.5.
We observe from this figure:
i) the height of the tallest column is less than that of figure 3.2 by one, and
ii) the array of summands to be added now exhibits a very regular structure.
Method of [54] is then applied to this reduced regular array of summands to 
yield the final result. Figure 3.6 details the implementation and from this we see that the 
amount of hardware required is 9 full adders, 5 half adders, and one 15 bit CPA and the 
time delay is 2 x D CSA + 1 xD cpA. Thus, compared with the implementation in section
3.2, with no loss in speed we have reduced the number of full adders by 1 and have
62
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
a7a6 a7 a6 a7a5 a7a4 a7a3 a7a2 a7al *5^0 ^ ^0  ^**0 a l a0 ^
^ 5  a6 a 5 ^ 4  a6a3 a6al a5al a4al ^ 1  a2 a l al a0
a5a4 a5 a4 a5a3 a5a2 a3 a2 a3al
a4a3 a4 a 3 a4**2
Figure 3.5: Reduced and regular array o f  summands for an 8 bit squarer
15 0 FA
x x x x x x x x x x x x x x X 2
x x x x x x x x x x x
X X X X X X X
1 X X X
X X X X X X X X X X X X X X X 7
x x x x x x x x x x x
2 x x x x x x x x
x x x x x x x x x x x x x x X
x x x x x x x x x x x x
Figure 3.6: CSA implementation o f  reduced 8 bit squarer.
also produced a regular structure. However, our manipulations will require 7 additional 
two-input gates which compares with the cost of a full adder. Treating the cost of a full 
adder as 5 two-input gates and the cost of the half adder as 2 two-input gates the total 
cost including the cost of a^j terms is 90 two-input gates plus one CPA. In summary
our manipulations have not resulted in either hardware or speed improvement but has 
achieved regularity.
Our method and that of [58] rely on the same basic principle, i.e. first reduce the 
squaring matrix and then apply Dadda's scheme to obtain the final result. In the process 
of using Dadda's scheme both methods rely on a CPA to compute the final sum. Since 
this step requires the maximum amount of time the CPA is generally implemented using 
some fast carry lookahead adder. However, since both methods require this CPA, the 
cost and delay of this unit can be ignored without affecting the quality of the analysis. 
One might argue that the CPA in [58] is smaller than ours by 2 bits, but needless to say, 
this is marginal. Thus the comparison process reduces to estimating the hardware and 
time delay required for the data stream to reach the CPA. In [58] the authors rely on the 
properties of the squaring matrix which has the shape of a parallelogram while we rely 
on equation 3.1 and some simple manipulations.
We now estimate the amount of hardware and time delay required by [58] and 
for the sake of clarity we use their notation. In [58], in order to reduce the height of the 
columns, the authors define equations Lj through L16. The authors of [58] state that the
generation of these equations requires no full adders. While this is true, the hardware 
and time delay required by these equations is the same as that of a full adder. The cost of 
hardware required by the terms Lj where i = 1 through 16 is estimated as follows. 
Terms Lj and L2 require no hardware. Terms Lj, where i = 5, 7, 9, 11, 13, are of the 
form ^ 1+\)I2^^.\)I2 ( a (i-3)/2 + a (i+3)/2  ̂+ a(i-3)/2(a(i+l)/2® a(i+3)/2^ Clearly this is a 3-
level circuit with 6 two-input gates. A full adder can also be realized with 3 levels while 
requiring only 5 two-input gates. Terms Lj, where i = 6, 8, 10, 12, 14, are of the form
a(i-2)/2 (ai/2 ® a(i+2)/2) + a(i-4)/2ai/2a(i+2)/2^■ Each of these terms require 5 two-input 
gates and 3 levels. Each of terms and L16 require one gate while each of the terms L4 
and L15 require two gates. From the above analysis it is clear that the generation of all Lj 
terms requires the same delay as that of a full adder, i.e., 1xD c s a  while the total 
hardware is 61 gates. The remaining a^j terms require 10 two-input gates and to reduce 
the Lj and a^j terms to two numbers that can serve as inputs to the CPA requires 2 full 
adders and 3 half adders. These adders contribute another 1 x D CSA. Thus the hardware 
required is 87 two-input gates and the time delay is 2 x D CSA. Once again, there is no 
improvement in speed while the hardware cost is only marginally better. However, this 
method requires four additional types of hardware units, over and above the full adder 
and half adder. Two distinct types of units are needed to realize L4 and L15 while two 
more distinct types of units are required to realize the other odd and even Lj terms.
In summary, we have presented an alternate parallel implementation for the 
squaring operation and have shown that while its performance and hardware cost is 
approximately the same as that of [58], our implementation is not only regular but also 
simpler to design and more modular in the sense of requiring fewer types of hardware 
units. Further analysis shows that the technique of [58] for higher word lengths 
produces hardware savings, but is slower. It appears that for small word lengths 
different designs yield similar hardware cost and speed functions. Thus for VLSI 
implementations it may be more important to focus on designs that are both regular and 
modular.
3.4 CSA implementation of the cyclic convolution
In the previous section we have discussed CSA implementations of the squaring 
operation in detail. We had clearly demonstrated that the computation of the squaring 
operation is both faster and cheaper than that of a multiplication operation. We observe 
that this is primarily due to the fact that the multiplication operation has n summands
(  Mwhile the squaring operation has I I + n summands. While both operations require
summands in the order of n , the squaring operation in terms of absolute values 
contains approximately half the summands. Thus, if a particular function can be 
evaluated by using either squaring or multiplication operations and the number of 
operations in either case are of the same order, then it is reasonable to expect to have 
hardware savings in the magnitude of a factor as opposed to an order.
Before we discuss implementations of the cyclic convolution operation, we 
would once again like to emphasize that the purpose of these implementations are not to 
provide the DSP engineer with an off the shelf design but more for the purpose of 
analyzing the effect of the increase in the number of additions by our methodology. 
Referring to equations (2.63) and (2.69) we find that we have reduced the number of 
squaring operations by one-third and at the same time we have increased the number of 
addition operations by a third. At this point we hypothesize that this is not a zero gain. 
In the ensuing sections we demonstrate the validity of this hypothesis by deriving the 
cost and speed functions of the cyclic convolution of 4, 8, and 16 points. We present 
three CSA based implementations, the first two are based on the definition and the third 
is based on our methodology. We call the first implementation, traditional, the second, 
modular, and the third, squares. We conclude our section by presenting a detailed 
discussion that analyzes all the results obtained.
3.4.1 4-point cyclic convolution-traditional
The cyclic convolution of two four point sequences is considered. Let the two 
sequences be A and B with A = {a3, a2, a1? ag} and B = {b3, b2, b l5 b0 }. Let each of
these points be of length 8 bits and represented in two's complement form. An 
implementation by definition would require the computation of 16 products, i.e. every 
point of a sequence is multiplied with every point of the other sequence. The cyclic 
convolution C is given by C = {c3, c2, Cj, c0} with co, cj, C2, and C3 defined as
co = aobo + a3bi + a2b2 + ajb3 (3.2)
ci = aibo + aobi + a3b2 + a2b3 (3.3)
C2 = a2bo + aibi + aob2 + a3b3 (3.4)
C3 = a3bo + a2bi + aib2 + aob3 (3.5)
A pictorial representation of the cyclic convolution operation of two 4-point sequences, 
each point consisting of 8 bits is shown in figure 3.7. In this figure each point is 
represented by +, the individual bits of each point by x, the product of two points by 
® , and each point of the cyclic convolution by 0 . Note that the product of two x's 
gives another x and this operation is achieved by a two-input AND gate. Now, instead 
of computing each product, i.e. evaluating ®, and then adding the four products to 
obtain a point of the cyclic convolution, we can line up all the partial products of the 
four points and then add them simultaneously using a CSA tree implementation. This 
way, we would need only one CPA for each point of the cyclic convolution. We refer to 
such a method of computation as traditional. The CSA tree is reduced based on the rules 
given in [54], The list L for ®, a 8 by 8 multiplication is given as {1, 2, 3, 4, 5, 6 , 7, 
8 , 7, 6 , 5, 4, 3, 2, 1}. Since we are trying to add the partial products of four 
multiplications, each element of this list has to be multiplied by 4. Applying this list to 
algorithm 3.1, the following results are obtained:
+ = x x x x x x x x
<S>= + times +
= x x x x x x x x
x x x x x x x x  
x x x x x x x x  
x x x x x x x x  
x x x x x x x x  
x x x x x x x x  
x x x x x x x x  
x x x x x x x x
+ + + + 
+ + + +
<g> 0  ® <g> 
0  0  0 ®  
0  0  0  0  
® ® ® ®
0000
Figure 3.7: Pictorial representation of a 4-point cyclic convolution
69
L = {4, 8, 12, 16, 20, 24, 28, 32, 28, 24, 20, 16, 12, 8, 4}
# of Full Adders = 222
# of Half Adders =13
# of Full Adders including CPA = 238
# of Half Adders including CPA = 14
# of CSA Levels = 8
# of CPA Levels = 1 
Size of CPA = 17
Number of 2-input gates including CPA =1218
Since there are four points the total number of two input gates is given as 4872 while the 
time delay remains as 8 x D CSA + 1 xD cpA. This is because we are performing the
computation of all four points in parallel. Also, each bit of the partial products requires a 
two input gate. Since each multiplication operation consists of 8 partial products each 
with 8 bits, the number of two input gates required to obtain these bits is equal to 64. 
Since there are 16 multiplications the total number of two input gates required to 
compute these bits is equal to 1024. The time delay to compute these bits is the delay of 
one two input AND gate, however, this delay is ignored as no matter which method is 
used it always exists. The results are summarized in the following two equations:
Hardware, 4T8 = 5896 (3.6)
Time Delay, 4T8 = 8 xD CSA + 1 xD cpA (3.7)
3.4.2 4-point cyclic convolution-modular
As the problem size becomes larger, i.e. both the number of points and the size 
of each point increases, it may not be possible to add the partial products of all the 
multiplication operations simultaneously as done in the previous section. Therefore in
70
equations (3.2)-(3.5) each of the 16 products are first computed or in other words, 
referring to figure 3.7 each of the ® 's is evaluated. Then each point of the cyclic 
convolution is obtained by adding its four associated <8>'s. We refer to computation 
based on such a method as modular. This addition is again done using a CSA tree 
implementation. Thus, clearly there are two CPA delays, one for evaluating ® and the 
other for evaluating 0 . Also there are some CSA delays that are associated with the 
computation of ® and 0 . To compute hardware and delays associated with the 
computation of each of the 16 multiplication operations, ® , the list L = {1, 2, 3, 4, 5,
6, 7, 8, 7, 6, 5, 4, 3, 2, 1} is applied to algorithm 3.1. The following results are
obtained:
L = {1, 2, 3, 4, 5, 6, 7, 8, 7, 6, 5, 4, 3, 2, 1}
# of Full Adders = 35
# of Half Adders = 7
# of Full Adders including CPA = 49
# of Half Adders including CPA = 8
# of CSA Levels = 4
# of CPA Levels = 1 
Size of CPA =15
Number of 2-input gates including CPA =261
Since there are 16 such multiplications, the number of gates is 4176. Each one of these 
multiplication operations produces a result that can be at most 16 bits long. Four such 
results are added to obtain one point of the cyclic convolution. The hardware and delay 
associated with such a computation can be obtained by applying the list L = { 4 ,4 ,4 ,4 , 
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4} to algorithm 3.1. The following results are obtained:
L = {4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}
# of Full Adders = 30
71
# of Half Adders = 2
#  of Full Adders including CPA = 46
# of Half Adders including CPA = 3
# of CSA Levels = 2
#  of CPA Levels = 1 
Size of CPA =17
Number of 2-input gates including CPA = 236
Since there are four points the total number of two input gates is 944. Also, as explained 
before we need an additional number of two input gates to generate the bits of the partial 
products. Thus the total hardware and speed delays associated with the modular 
approach can be summarized as
Hardware, 4M8 = 6144 (3.8)
Time Delay, 4M8 = 6 x D CSA + 2 x D CpA (3.9)
3.4.3 4-point cyclic convolution-squares
In this section we estimate the hardware cost and speed required based on our 
method. Since the points are expressed in the two's complement form the CSA tree 
implementation can be designed with minor modifications [60]. These modifications 
will require no additional cost and thus the negative sign in the equations can be treated 
as a positive sign and the design is carried out as usual. Our method outlined in chapter 
2, is by nature modular, i.e. the design is broken into several small parts. Essentially the 
following computations have to be performed: c0 + c2, Cj + c3, c0 - c2, Cj - c3. As
explained in chapter 2, the first two computations constitute part 1 of our methodology 
while the later two constitute part 2. Part 1 can be evaluated in three different ways 
while there is only a singular way for part 2. In chapter 2 we compared the three 
methods of part 1 based on ROM implementations. However, here we are interested in
CSA based implementations. For now assuming we use method 2 for part 1, then based 
on theorems (2.3)-(2.6) developed in Chapter 2, the following equations are defined.
X4 i = ao + a2 + bo + b2 (3.10)
X42 = ao + a2 - bo - b2 (3.11)
X43 = ai + a3 + bj + b3 (3.12)
X44 = ai + a3 - bi - b3 (3.13)
y41 = ao + a2 + bi + b3 (3.14)
y42 = ao + a2 - bi - b3 (3.15)
y43 = ai + a3 + bo + b2 (3.16)
y44 = ai + a3 - bo - b2 (3.17)
Z4i = ao - a2 + bo - b2 (3.18)
Z42 = ao - a2 - bo + b2 (3.19)
Z43 = ai - a3 + bi - b3 (3.20)
Z44 = ai - a3 - bi + b3 (3.21)
Z45 = ao - a2 + bi - b3 (3.22)
Z46 = ao - a2 - bi + b3 (3.23)
Z47 = ai - a3 + bo - b2 (3.24)
z4 8 = ai - a3 - bo + b2 (3.25)
The amount of hardware required for each of these equations can be obtained by 
applying the list L = {4, 4, 4, 4, 4, 4, 4, 4} to algorithm 3.1. Note that we are adding 
four operands each of length eight bits. Thus the height of each column is equal to 4. 
The following results are obtained:
L = {4, 4, 4, 4, 4, 4, 4 ,4}
# of Full Adders = 14
# of Half Adders = 2
73
# of Full Adders including CPA = 22
# of Half Adders including CPA = 3
# of CSA Levels = 2
# of CPA Levels = 1 
Size of CPA = 9
Number of 2-input gates including CPA =116
Since there are 16 such equations, we have a total of 1856 two-input gates. Now 
suppose we were to use theorems 2.1 and 2.2 for evaluating part 1. Then instead of the 
8 equations given by (3.10)-(3.17) we would have four equations defined by (2.23)- 
(2.26). The amount of hardware required for each of these equations can be obtained by 
applying the list L = {8, 8, 8, 8, 8, 8, 8, 8} to algorithm 3.1, which gives the total 
number of 2-input gates including the CPA as 275. Thus for four equations we would 
need 1100 2-input gates. Since 1100 (required by method 1) is greater than 928
(required by method 2), we can conclude that method 2 is better. A similar argument can
be constructed for method 3. We must also note that the terms of methods 1 and 3 
contain more bits than method 2 which in turn implies that computation of their squares 
will also require more hardware. Therefore from now on we will confine ourselves to 
evaluating part 1 of our methodology based on method 2.
The points of the cyclic convolution are given, based on theorems (2.3)-(2.6) by
c0 = 7"(x 41 "  x 42 + x 43 “  x44 + Z41 ~ z42 ~ z 43 +  z 44) (3.26)O
C1 = g(y41 -  Y42 + y43 -  y£f + z45 -  z46 + z47 “  z48) (3.27)
c 2 = “ (x 41 “  x 42 +  x43 “  x 44 “  Z41 +  z42 +  z 43 “  z44) (3.28)8
c3 = g (y il “  y42 + y43 ~ y44 -  z45 + z46 -  z47 + z48) (3.29)
Each of the terms to be squared in equations (3.26)-(3.29) is of length 10 bits. Thus we 
first compute the cost and delay of a 10 bit squarer. Let A be a 10 bit number with A = 
a^ag... aQ. Then the square of A is given as
A2 = ap 218 + a^g 218 + aga7 217 + a8 216 + a ^  216 + a8a7 216 + 215 + a8a6 215
+ a7 2 + 2 + <t8a3 2 + u7Ug 2 + 9̂̂ 3 2  ̂+ <t8a4 2^  + u7â  2^  + â  2̂ 2
+ a<jU2 2 + a8a3 2 +  n7a4 2 +  â a<j 2 +  â Uj 2 +  a8a2 2 +  a7a3 2^
+ a6a4 211 + a5 210 + a^Q 210 + a8aj 210 + a7a2 210 + a6a3 210 + a5a4 210 + a8aQ 29 
+  a7aj 2  +  a6a2 2  +  a^a3 2  +  a4 2  +  a7aQ 2  +  UgUj 2  +  a^u2 2 8 +  u4a3 2 8 
+ a6aQ 2 + a^aj 2 + a4a2 2 + a3 2 + û Uq 2 + a4â  2^ + a3a2 2^ + &4<tQ 2^
+ a3aj 2 + a2 2 + u33q 2 + a2aj 2 + a2aQ 2* + a^ 2 + UjUq 2 + a^
(3.30)
Each of the product terms a^j can be obtained using a two-input AND gate. Since there
r m
are 10 bits, the number of gates required is ^ ^ or 45. We apply the manipulation
outlined in section 3.3 before we square the number. Such a manipulation does not yield 
hardware savings, however, it results in a compact array of summands thus allowing 
the application of the rules of [54] more effectively. Also, this manipulation adds a small 
cost by increasing the number of summands, in this case by 9. Thus each term to be 
squared requires 54 two-input gates and the total for all 16 squares is 864. The cost and 
delay associated with the computation of such a square can be obtained by applying the 
list L = {1, 0, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 4, 4, 3, 3, 2, 2, 1, 1} to algorithm 3.1. The 
results are:
L = {1,0, 1 ,2, 2, 3, 3, 4 ,4 , 5, 5, 5, 4, 4, 3, 3, 2, 2, 1, 1}
# of Full Adders = 20
# of Half Adders = 7
# of Full Adders including CPA = 39
75
# of Half Adders including CPA = 8
# of CSA Levels = 3
#  of CPA Levels = 1 
Size of CPA = 20
Number of 2-input gates including CPA = 211
Since there are 16 such squares, the total number of 2-input gates can be given as 3376.
Finally, equations (3.26)-(3.29) have to be evaluated. From these equations we find that
there are basically only four terms that have to be computed. These terms are
P = -  X4 2  + X4 3  -  X4 4  (3.31)
Q = Y4 1  -  y42 + y4 3  ~ y 2u  (3.32)
R = Z4 1  -  Z4 2  -  Z4 3  + Z4 4  (3.33)
S = z45 “  z46 +  z47 “  z48 (3.34)
Then equations (3.26) - (3.29) can be rewritten as
c0 = 1/8(P + R) (3.35)
cj = 1/8(Q + S) (3.36)
c2 = 1/8(P - R) (3.37)
c3 = 1/8(Q - S) (3.38)
The cost and delay required to compute each of equations (3.31)-(3.34) can be obtained 
by applying the list L = (4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4} to 
algorithm 3.1. Note that although the size of the CPA required in the previous 
computation was 20 bits, we know that there will be no carry-out as the square of a 10 
bit number can be no more than 20 bits. Thus we are adding four 20-bit numbers and 
therefore the height of each column is 4. The results are:
L = {4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}
# of Full Adders = 38
76
# of Half Adders = 2
# of Full Adders including CPA = 58
# of Half Adders including CPA = 3
# of CSA Levels = 2
# of CPA Levels = 1 
Size of CPA = 21
Number of 2-input gates including CPA = 296
Since there are four such equations, the total number of gates required is 1184. Finally 
equations (3.35)-(3.38) have to be evaluated. Each one of these equations requires a 
CPA of size 22 bits, or 21 full adders and 1 half adder, giving a total of 107 gates. Four 
CPAs would therefore require 428 two-input gates. In summary
Hardware, 4S8 = 7708 (3.39)
Time Delay, 4 S 8 = 7 x Dcsa  + 4 x Dcpa (3.40)
To summarize briefly, we observe that our method based on squaring operations, in the 
case of 4-point cyclic convolution neither achieves hardware savings nor gain in speed 
of computation. However, we note that there was no savings in the number of 
operations to begin with.
3.4.4 8-point cyclic convolution-traditional
The cyclic convolution of two eight point sequences is now considered. Let the 
two sequences be A and B with A = {a7, a6, a5, a4, a3, a2, ap  aQ} and B = {b7, b6, b5, 
b4, b3, b2, bj, b0}. Let each of these points be of length 8 bits and represented in two's 
complement form. A traditional implementation would require the computation of 64 
products, i.e. every point of a sequence is multiplied with every point of the other
sequence. The cyclic convolution C is given by C = {c7, c6, c5, c4, c3, c2, Cj, c0} with
c o , C7 defined as
co = aobo + a7bi + a6 b2  + asb3 + a4b4 + a3 bs + a2b6  + aib7 (3.41)
ci = aibo + aobi + a7 b2  + a6 b3 + asb4  + a4 bs + a3b6  + a2 b7 (3.42)
C2  = a2 bo + ajbi + aob2  + a7b3 + agb4  + asbs + a4b6 + a3b7 (3.43)
C3 = a3bo + a2 bj + aib2  + aob3 + a7 b4  + a^bs + asb6  + a4 b7 (3.44)
C4  = a4bo + a3 bi + a2 b2  + aib3 + aob4  + a?b5 + a^be + asb7 (3.45)
C5 = asbo + a4 bi + a3b2 + a2b3 + aib4  + aobs + a?b6  + a^o-j (3.46)
C6  = a6 bo + a5bi + a4b2 + a3b3 + a2b4 + aibs + aobg + a7b7 (3.47)
C7 = a7bo + a6 bi + asb2 + a4b3 + a3b4 + a2 bs + aib6  + aob7 (3.48)
Now, as outlined in section 3.4.1, instead of computing each product and then adding 
the eight products to obtain a point of the cyclic convolution, we can line up all the 
partial products of the eight points and then add them simultaneously using a CSA tree 
implementation. This way, we would need only one CPA for each point of the cyclic 
convolution. The list L for a 8 by 8 multiplication is given as {1, 2, 3, 4, 5, 6, 7, 8, 7, 
6, 5, 4, 3, 2, 1}. Since we are trying to add the partial products of eight multiplications, 
each element of this list has to be multiplied by 8. Applying this list to algorithm 3.1, the 
following results are obtained:
L = {8, 16, 24, 32, 40, 48, 56, 64, 56, 48, 40, 32, 24, 16, 8}
# of Full Adders = 476
# of Half Adders =18
# of Full Adders including CPA = 493
# of Half Adders including CPA = 19
# of CSA Levels =10
# of CPA Levels = 1 
Size of CPA =18
Number o f 2-input gates including CPA = 2503
Since there are eight points the total number of two input gates is given as 20,024 while 
the time delay remains as 10xDCSA + 1 xD CpA. This is because we are performing the
computation of all eight points in parallel. Also, each bit of the partial products requires 
a two input gate. Since each multiplication operation consists of 8 partial products each 
with 8 bits, the number of two input gates required to obtain these bits is equal to 64. 
Since there are 64 multiplications the total number of two input gates required to 
represent these bits is equal to 4096. The time delay to compute these bits is the delay of 
one two-input AND gate, however, again this delay is ignored as no matter which 
method is used it always exists. The results are summarized in the following two 
equations:
As the problem size becomes larger, i.e. both the number of points and the size 
of each point increases, it may not be possible to add the partial products of all the 
multiplication operations simultaneously. Therefore in equations (3.41)-(3.48) each of 
the 64 products are first computed. Then each point is evaluated by adding its eight 
associated operands. This addition is again done using a CSA tree implementation. 
Thus, clearly there are two CPA delays and some CSA delays which are determined as 
follows. Since each of the 64 multiplications are of the same size as that in section
3.4.2, hardware and delays associated with the computation are the same as that 
estimated earlier. This cost is therefore 261 two-input gates for each multiplication for a 
total of 16704 and a time delay of 4 x D CSA + 1 xD CpA with a CPA size of 15. Each
Hardware, 8T8 = 24120
Time Delay, 8T8 = 10xD CSA + 1 xD CpA
(3.49)
(3.50)
3.4.5 8-point cyclic convolution-modular
79
one of these multiplication operations produces a result that can be at most 16 bits long. 
Eight such results are added to obtain one point of the cyclic convolution. The hardware 
and delay associated with such a computation can be obtained by applying the list L = 
{8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8} to algorithm 3.1. The following results are 
obtained:
L = {8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}
# of Full Adders = 92
# of Half Adders = 4
# of Full Adders including CPA = 109
# of Half Adders including CPA = 5
# of CSA Levels = 4
# of CPA Levels = 1 
Size of CPA =18
Number of 2-input gates including CPA = 555
Since there are eight points the number of two-input gates is 4440. Also, as explained 
before we need an additional 4096 two-input gates to generate the bits of the partial 
products. Thus the total hardware and speed delays associated with the modular 
approach can be summarized as
Hardware, 8M8 = 25240 (3.51)
Time Delay, 8M8 = 8 x D CSA + 2 x D CPA (3.52)
We must note that such a modular approach has, in all the steps limited the 
height of the tallest column in any CSA tree to a maximum of 8. Thus in order to ensure
a fair comparison we must make sure that our new proposed methods do not involve
steps that require CSA trees whose columns are much taller.
3.4.6 8-point cyclic convolution-squares
In this section we estimate the hardware cost and speed required based on our 
method, using a total of 48 squares. Our method can be divided into three distinct 
modules as shown in figures 3.8, 3.9, and 3.10. We do not develop all the equations as 
the intent in this section is primarily to estimate the hardware cost and speed. However, 
we list all the steps in each module and their associated costs and delays.
M odule 1: Computes (c0 + c4), (c2 + c6), (cj + c$), (c3 + Cj)
Step 1:
a) Use theorem 2.3 to generate terms x81 through x84.
b) Use theorem 2.4 to generate terms y81 through y84.
c) Use theorem 2.5 to generate terms z8J through z84.
d) Use theorem 2.6 to generate terms z85 through z88.
Each of the above 16 terms are of the same size and also have an identical
structure, i.e. they are formed by adding/subtracting 8 points of the input sequences.
The hardware and delay associated with the computation of these terms can be estimated 
by applying the list L = {8, 8, 8, 8, 8, 8, 8, 8} to algorithm 3.1. Note that we are 
adding eight 8-bit numbers. Thus the height of each column is equal to 8. The results 
are:
L =  {8, 8, 8, 8, 8, 8, 8, 8}
# of Full Adders = 44
# of Half Adders = 4
# of Full Adders including CPA = 53
# of Half Adders including CPA = 5













i  .... f*
Square Square Square Square
i jf2
Add/Subtract Add/Subtract Add/Subtract Add/Subtract
▼ r r 24
adjust r e s u l t* ^  ^  adjust result d  adjust result d  adjust result
I I ▼ I I j zTj I
Step 4: +
T T I22
adjust result" :> d adjust result CZ adjust result d  adjust result
d i r  it <r
C0+ C4 C2 + C 6 Cl +C5 C3+S
Figure 3.8: 8-point cyclic convolution-module 1
Step 1:
Step 2:
Generate Generate Generate Generate
yO . yO 
81 88 oc
>W y i  . y l81 88























~ { 2 3 “
adjust result ^  adjust result’ ^  adjust r e s u l t ' ^
T  T  “
C 2* C 6
c - c1 5 C - C
Figure 3.9: 8-point cyclic convolution-module 2
83
< V 9 C -o - q CL+' C2’ C6
i t
+ - + -







i ig▼C0 Vc4 Tc2 ▼C6
Ci+S 1 >-------- c r S  ----------------1 c3+qI ,--------- C3 - S
▼ 1 ^  i 1 — *
+ - + -
I T




^adjust result^ ^adjust result^
Figure 3.10: 8-point cyclic convolution-module 3
# of CPA Levels = 1 
Size of CPA =10
Number of 2-input gates including CPA = 275
Since there are 16 such equations, we have a total of 4400 two-input gates. Results of 
step 1 can be summarized as
Step 1, hardware = 4400 (3.52)
Step 1, time delay = 4 xD CSA + 1 x DcpA (3.53)
We note that the height of the tallest column in this step is 8.
Step 2:
Each of the terms obtained in step 1 is of length 11 bits and needs to be squared, 
resulting in a term that can have at most 22 bits. The cost and delay associated with the 
computation of the square can be obtained by applying the list L = {1, 0, 1, 2, 2, 3, 3, 
4, 4, 5, 5, 6, 5, 5, 4, 4, 3, 3, 2, 2, 1, 1}, to algorithm 3.1. The results are:
{1, 0, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 5, 5, 4, 4, 3, 3, 2, 2, 1, 1}
#  of Full Adders = 27
# of Half Adders = 9
# of Full Adders including CPA = 48
# of Half Adders including CPA = 10
# of CSA Levels = 3
# of CPA Levels = 1 
Size of CPA = 22
Number of 2-input gates including CPA = 260
Since there are 16 such squares, the total number of 2-input gates can be given as 4160.
n o
Also, each ol the 16 terms that need to be squared have
v 2 y
+ 10 = 65 input bits of the
85
form a^j, thus requiring a total of 1040 two-input AND gates. Results of step 2 can be 
summarized as
Step 2, hardware = 5200 (3.54)
Step 2, time delay = 3 xD CSA + 1 x DCpA (3.55)
We note that the height of the tallest column in this step is 6.
Step 3:
The results of step 2 are used to generate 4 ̂  c2j , 4 ^  c 2 i+i » c2i ("I)1
, 4^T c2i+1 (-1)‘ based on equations (2.33), (2.38), (2.43), and (2.48). This is
achieved by grouping the 16 squares into sets of four and adding/subtracting the terms. 
The hardware and delay associated with this computation is obtained by applying the list 
L = {4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}. Note that we are 
adding four 22-bit numbers. The results are:
L = {4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}
# of Full Adders = 42
# of Half Adders = 2
# of Full Adders including CPA = 64
# of Half Adders including CPA = 3
# of CSA Levels = 2
# of CPA Levels = 1 
Size of CPA = 23
Number of 2-input gates including CPA = 326
Since there are four such groups, the total is 1304. Results of step 3 can be summarized 
as
Step 3, hardware = 1304 (3.56)
Step 3, time delay = 2 x D CSA + 1 xD CpA (3.57)
86
We note that the height of the tallest column in this step is 4. The output of the CPAs at 
this step have a length of 24 bits. However, since we know that the result is four times 
the desired value, the last two bits of this computation are zero and can hence be ignored 
to divide by four. Also, since each of ^  c2i, X  c2i+l’ X  c2i t-1)1’ X  c2i+l ( '1)!
is the sum of four points of the cyclic convolution, their length cannot be greater than 16 
+ log2 32 = 21. Thus the most significant bit (MSB) is also stripped.
Step 4:
The following computations now need to be performed:
X  C2i+ X  c2i (-1)1 = 2(C 0 +  C4)
X  C2i- X  C2i H ) 1 = 2(C 2 + C6)
X  C2i+1 + X  C2i+1 (-D 1 = 2 (C 1 + C 5)
X  c2i+l “ X  c2i+l C-1)1 =2(c3 + c7)
Each of these computations requires a CPA of size 21 bits. The cost of such a CPA is
equal to 102 two-input gates. Since we have four CPAs the total cost is 408 gates.
Results of step 4 can be summarized as
Step 4, hardware = 408 (3.58)
Step 4, time delay = lx D cpA (3.59)
Again, since we know that the result is twice the desired value, the last bit of this 
computation is zero and can hence be ignored to divide by two. Also, since each of
X  C4i’ X  c4i+l> X  c4i+2’ X  C4i+3 is the sum ° f  tw0 Points of the Cyclic
convolution, their length cannot be greater than 16 + log2 16 = 20. Therefore the MSB 
is also stripped. Thus these results are of length 20 bits.
87
M odule 2: Computes (c0 - c4), (c2 ~ c6), (cj - c5), (c3 - c7)
Step 1:
a) Use theorem 2.8 to generate terms Vg\ through Vgg.
b) Use theorem 2.8 to generate terms Vgj through Vgg.
2 2c) Use theorem 2.8 to generate terms Vgj through Vgg.
a
d) Use theorem 2.8 to generate terms Vgj through Vgg.
Each of the above 32 terms are of the same size and also have an identical 
structure, i.e. they are formed by adding/subtracting 4 points of the input sequences. 
The hardware and delay associated with the computation of these terms can be estimated 
by applying the list L = {4, 4, 4, 4, 4, 4, 4, 4} to algorithm 3.1. Note that we are 
adding four 8-bit numbers. Thus the height of each column is equal to 4. Previously, 
the hardware cost for such a computation was calculated as 116 two-input gates and the 
time delay as 2 x D CSA + 1 x  DCPA with a CPA of size 9 bits. Since there are 32 such
terms, we have a total of 3712 two-input gates. Results of step 1 can be summarized as
Each of the terms obtained in step 1 is of length 10 bits and needs to be squared, 
resulting in a term that can have at most 20 bits. The cost and delay associated with the 
computation of a 10 bit square was previously estimated at 211 two-input gates and a 
time delay of 3 x D CSA + 1 xD CPA with a 20-bit CPA. Since there are 32 such squares,
the total number of 2-input gates can be given as 6752. Also, each of the 32 terms that
Step 1, hardware = 3712
Step 1, time delay = 2 x  DcsA + 1 x DCPA





need to be squared have 54 input bits of the form a^j, thus requiring a total of 1728
two-input AND gates. Results of step 2 can be summarized as
Step 2, hardware = 8480 (3.62)
Step 2, time delay = 3 xD CSA + 1 x DCpA (3.63)
We note that the height of the tallest column in this step is 6.
Step 3:
The results of step 2 are used to generate 4 X  c 4i • 4  2  c 4 i+ l ’ 4 X  c 4i (-1)1
, 4 ^  c4i+j (-1)1 based on equation (2.38). This is achieved by grouping the 32
squares into sets of four and adding/subtracting the 8 terms within each set. The 
hardware and delay associated with this computation is obtained by applying the list L 
={8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8} to algorithm 3.1. Note that 
we are adding eight 20-bit numbers. The results are:
L = {8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}
# of Full Adders =116
# of Half Adders = 4
# of Full Adders including CPA = 137
# of Half Adders including CPA = 5
# of CSA Levels = 4
# of CPA Levels =1 
Size of CPA = 22
Number of 2-input gates including CPA = 695
Since there are four such groups, the total is 2780. Results of step 3 can be summarized 
as
Step 3, hardware = 2780 (3.64)
Step 3, time delay = 4 x D CSA + 1 xD CPA (3.65)
89
We note that the height of the tallest column in this step is 8. Also since we know that 
the result is four times the desired value, the last two bits of this computation are zero 
and can hence be ignored to divide by four. However, since each of ^  c4i (-1)1,
c4i+l ('!)*’ X  c4i+2 ("I)1 * X  c4i+3 (“I)1, is the sum of two points of the cyclic 
convolution, their length cannot be greater than 16 + log2 16 = 20. Therefore the MSB 
is also stripped. Thus the length of each of these results is 20 bits.
M odule 3: Computes Cq, c 4, c 2, c 6, c }) C j ,  c 3, c 7
This module adds/subtracts the results of modules 1 and 2 to obtain the points of 
the cyclic convolution. All the computations are done in a single step with the help of 8 
CPAs of size 20 bits. The cost of such a CPA is 97 gates thereby giving a total of 776 
gates. Results of this module can be summarized as
Step 1, hardware = 776 (3.66)
Step 1, time delay = 1x Dcpa (3.67)
Again, since we know that the result is twice the desired value, the last bit of this 
computation is zero and can hence be ignored to divide by two. Also, since each point 
of the cyclic convolution cannot be of length greater than 19 bits, the MSB is also 
stripped.
In summary, adding the results of equations (3.52), (3.54), (3.56), (3.58), (3.60), 
(3.62), (3.64), and (3.66) we obtain the hardware cost of all modules as
Hardware, 8S8 = 27060 (3.68)
Looking at figures (3.8) and (3.9) we observe that these two modules operate in 
parallel and therefore the module with the higher delay and the delay of module 3 
account for the total delay. Thus adding the results of equations (3.53), (3.55), (3.57), 
(3.59), and (3.67) we obtain the total delay for the computation as
Time Delay, 8S8 = 9 x DcsA + 5 x DCPA (3.69)
3.4.7 16-point cyclic convolution-traditional
Since we have already presented the cyclic convolution of 4 and 8 points in 
detail, we keep the presentation over here very brief. Each point of the cyclic 
convolution is obtained by adding simultaneously the summands of 16, 8x8 products. 
The cost of such a computation is 5066 gates for a total of 81056 gates. The number of 
gates required for the summands is 256 times 64, or 16384. In summary,
Hardware, 16T8 = 97440 (3.70)
Time Delay, 16T8 = 11 xD CSA + 1 xD CpA (3.71)
We note that the height of the tallest column is 128.
3.4.8 16-point cyclic convolution-modular
Each of the 16 products that constitute a single point of the cyclic convolution 
are first evaluated and then the 16 results are added. The cost of an 8x8 product is 261 
gates and since in all there are 256 such products we have a total of 66816 gates. The 
cost of adding 16 products is 1198 gates and for 16 such computations we have a total 
of 19168 gates. The number of gates required by the summands is as before, 16384. In 
summary,
Hardware, 16M8 = 102368 (3.72)
Time Delay, 16M8 = 10xD CSA + 2 x D CPA (3.73)
We note that the height of the tallest column is 16.
3.4.9 16-point cyclic convolution-squares
Module 1 computes (c0 + c4 + c8 + c12), (c2 + c6 + c 10 + c14), (cj + c5 + c9 + c 13), and 
(c3 + c7 + cn  + c 15). The cost and time delay associated with such a computation can 
be summarized as
91
Hardware, module 1, 16S8 = 17552 (3.74)
Time Delay, module 1 ,16S8 = 11 x Dcsa + 4 x DcpA (3.75)
Module 2 computes (c0 - c4 + cg - c 12)> (c2 " c6 + c10'  c14̂ ’ (C1 '  c5 + c9 " c 13^ ant* (c3 
- c? + c lj  - c 15). The cost and time delay associated with such a computation can be
summarized as
Hardware, 16S8, module 2 =23116 (3.76)
Time Delay, 16S8, module 2 =11 xD CSA + 4 x D cpA (3.77)
Module 3 computes (c0 - cg), (c2 - c10), (c4 - c12), (c6 - c 14), ( ^  - c9), (c3 - cn ), (c5 - 
c i 3), (c7 - c15). The cost and time delay associated with such a computation can be 
summarized as
Hardware, 16S8, module 3 = 61352 (3.78)
Time Delay, 16S8, module 3 =11 xD ^SA + 3 xD CPA (3.79)
Module 4 computes c0, Cj, c2, c3, c4, c5, c6, c7, cg, c9, c10, cn , c12, c13, c14, and c15. 
The cost and time delay associated with such a computation can be summarized as 
Hardware, 16S8, module 4 = 1632 (3.80)





Table 3.1 summarizes the hardware cost in 2-input gates for computing the 
cyclic convolution of 4, 8, and 16 points based on all three methods. At first glance it
= 103652 (3.82)
= 11x Dcsa + 6 x Dcpa (3.83)
3.1: Hardware cost in 2-input gates for cyclic convolution of 4, 8, and 16
points
Method 4 points 8 points 16 points
Traditional 5896 24120 97440
Modular 6144 25240 102368
Squares 7708 27060 103652
93
appears that our method is at best competitive. But this is not the case for several 
reasons.
1) In the traditional implementation there are basically no modules. Therefore the 
silicon area of the chip is directly a function of the height of the tallest column in the 
CSA tree. The height of the tallest column is the product of the number of points in the 
sequences to be convolved and the word length of each point. For instance, with a word 
length of 8 bits per point, in the case of four points the height is 4 times 8 or 32, for 8 
points it is 8 times 8 or 64, and for 16 points it is 16 times 8 or 128. Clearly as the 
number of points increases the height increases. If in a given technology, this height can 
be managed then the traditional method is the best method of implementation. On the 
other hand if the problem has to be broken into smaller components then one must have 
a systematic way of doing so. The modular approach is one method and our approach 
based on squares is the other. Thus in the event the traditional implementation is not 
feasible one might consider implementations based on these methods and thus the 
comparison is restricted to these two methods, modular and squares.
2) Looking at table 3.1 again, we find that in the case of 4 points our method based 
on squares is worse than the modular by 25%, in the case of 8 points by 7%, and in the 
case of 16 points by 1%. However, from table 2.1, we see that the savings in squaring 
operations in the three cases is 0%, 25%, and 31.25%. Thus it is only reasonable for us 
to speculate that with an increase in the number of points being convolved the hardware 
savings will increase.
3) Our main purpose of all the preceding analysis was to show that in spite of the 
increase in the number of additions caused by the use of our methodology, the decrease 
in squaring operations will be more beneficial. Now, here is the surprise. All the
preceding implementations were 100% parallel. In other words they used many more 
addition operations than that given by equation (2.69). This is because, in the 
construction of our equations, that need to be squared, there are several common terms. 
But for a parallel implementation these common terms are evaluated as many times as 
they are needed, thus increasing the hardware. By the same token, for the modular 
approach, there are no common terms in the first place and therefore there is no question 
of redundant computation. Thus our implementation model is very kind to the modular 
approach, in the sense that it is ideally suited to that approach. Thus it is only fair for us 
to conclude that if an implementation structure that is capable of exploiting the properties 
of our method is selected, then the savings in squaring operations will pay off.
4) One might argue as to why another model that would be more suitable to our 
approach was not selected. However, if one has to be fair to all methods, then the 
selection of such a model would be difficult if not impossible. Therefore we chose the 
worst case scenario for our method which at the same time is best case scenario for the 
modular method and have shown that in spite of being against all odds, we are at the 
least competitive.
5) With regard to the speed of computation, for the 100% parallel implementation, 
referring to table 3.2 we can see that our method is the slowest. However, our approach 
has a lot of other properties that can be exploited by a clever architecture. To illustrate, 
looking at figure 3.8 one can see that the four CPAs used in step 3 can also be used for 
step 4 with minor modifications. These minor modifications require negligible additional 
hardware and at the same time do not slow down the operations. Similarly we observe 
that the hardware required by module 3 is redundant as the same can be achieved by 
CPAs in step 3 of modules 1 and 2. Thus by simple modifications to the architecture we 
can reduce the hardware costs. Our method thus also provides for alternate
Table 3.2: Time delay o f cyclic convolution o f 4, 8, and 16 points
Method 4 points 8 points 16 points
Traditional 8D CSA + 1d c pa 10DCSA + 1DCPA 11d c s a +  1d c pa
Modular 6D c s a  +  2D CPA 8d c s a  +  2d c p a 10d c s a  +  2d c pa
Squares 7D c s a  +  4 D CPA 9 d c s a  +  5D c p a 1 1d c s a  +  6 d c p a
implementations while such is not the case with either the traditional or modular 
approach. Further, the fine granularity of our approach makes it an ideal candidate for 
incorporating several sophisticated methods, for example, pipe-lining and systolic 
arrays, to improve the speed of the computation. However, while employing these and 
other techniques one must also account for the interconnection delays. Analysis in this 
area is highly application dependent. For instance since the convolution operation is 
quite often required to be performed in real time, several different sequences have to be 
processed one after the other. Thus, the throughput of the model should not only 
include the number of points processed per cycle but also include the number of 
sequences processed per cycle.
In summary, while we have not provided an actual factor for the amount of 
savings in CSA based implementations, we have, however, convincingly demonstrated 
with the help of the preceding sections that in spite of an increase in the number of two- 
operand additions our method produces efficient designs. In the next section we 
consider hybrid implementations, i.e. we implement the squaring and multiplication 
operations using ROMs and show that in such a case our method yields phenomenal 
hardware savings in spite of using an unkind model.
3.5 Hybrid implementation of cyclic convolution
In a hybrid implementation of the cyclic convolution operation, we substitute the 
CSA implementation of the squaring or multiplication operation by a ROM 
implementation. The rest of the implementation is left unchanged. We first present a 
ROM model and derive its cost and speed functions. We then use these functions to 
estimate the cost and speed of the hybrid implementations of the cyclic convolution
operation. Since the traditional approach involves no multiplication operation (the partial 
products were simply added) there exists no hybrid implementation for that approach.
ROM Hardware Cost: The ROM can be designed based on the model given in
[55]. In such a model the address lines are split into two halves called the X and Y 
sections. The address lines in X and Y can then be decoded simultaneously. Let the 
ROM be of size 2L x n, where L is the number of address lines and n is the number of 
outputs. If there are L address lines then each half has L/2 address lines with 2L/2 
minterms. In each of these decoders every minterm requires (2°+ 21 + ... + 2 l̂og2 
) 2-input gates. The outputs of the diode matrix are logically ANDed with the outputs 
of the Y decoder before being multiplexed to form the final output. The number of gates 
required for this purpose is (2° + 21 + ... + 2^og2 2**(L/2)  ̂ 2-input gates. Adding and 
simplifying we have,
ROM, Hardware = 2L/2 (2n + L - 2) - n (3.84)
In the above we note that each of the 2L 2̂ minterms are realized independently. 
However, we assume that each of these minterms is realized only once for every output 
of the ROM. Also we ignore the cost and delay of the diode matrix.
ROM Time delay: The delay of the such a model is given by [55] in gate delays as
ROM, Delay =2 + log2 (L/2) + log2 2(L/2) (3.85)
The above computation assumes that the X and Y decoders operate in parallel and the 
input line from the diode-matrix is multiplexed at the outputs of the Y decoder. We note 
that this model does not account for several unique characteristics of ROM 
implementations [55], However, the purpose of using this model is to provide a more 
meaningful hardware cost in terms of gates as opposed to ROM bits and also to estimate 
the time delay of the overall implementation.
98
3.5.1 8-point cyclic convolution-hybrid, modular
In section 3.4.2, the results of the 64 multiplications were obtained by using 
CSA tree implementations. Instead if we were to use ROMs for the multiplication 
operations then we would need 64 ROMs, each of size 216x 16 or a total of 67,108,864 
ROM bits. In terms of gates, from equation (3.84) we get the number of 2-input gates 
as 752640. The time delay of this ROM is computed from equation (3.85) as 13 gate 
delays. Over and above this we would need 4440 two-input gates for adding the results 
of the ROMs with a time delay of 4 x D CSA + 1 x DCPA. In summary,
Hardware, 8HM8 = 757080 (3.86)
Time Delay, 8HM8 = 13 + 4 x D CSA + 1 x DCPA (3.87)
3.5.2 8-point cyclic convolution-hybrid, squares
In section 3.4.6, the results of the 48 squares were obtained by using CSA tree 
implementations. Instead if we were to use ROMs for the squaring operations then we 
would need 16 ROMs, each of size 211 x 22 and 32 ROMs, each of size 210 x 20 or a 
total of 1,376,256 ROM bits. In terms of gates the count is 86544 2-input gates with a 
time delay of 10 gates. Over and above this we would need 13380 two-input gates for 
adding the results of the ROMs with a time delay of 4 x D CSA + 3 x  Dc p A . In
summary,
Hardware, 8HS8 = 99924 (3.88)
Time Delay, 8HS8 = 1 0  + 6 x D CSA + 4 x  DCpA (3.89)
Clearly, from equations (3.86)-(3.89), one can see that our method is very 
attractive. We require approximately one-eighth the gate count of the hybrid-modular 
method while being slower by about only 3 x DCPA.
3.6 Applications to computer arithmetic
The multiplication operation is one of the four basic operations and is used 
extensively, both in general purpose and special purpose computing. As such there 
exists a vast amount of literature on the variety of multiplication algorithms, references 
[47]-[49],[56] to name a few. Recently new multipliers modulo (2N -1), [51] and 
modulo (2N +1), [61] have been developed. Apart from the requirements of computer 
arithmetic, the fields of digital signal processing and cryptography have several 
algorithms that perform arithmetic in modular rings [17],[62]. Multipliers designed 
using look-up tables [46],[47],[51] offer attractive speed-complexity trade offs [61], 
however their main draw back has been excessive ROM sizes and thus the inability to 
integrate the entire design on a single chip.
3.6.1 Modulo 2n -1 multiplication
Consider the multiplication of two N-bit binary numbers A and B. Let each of 
the numbers be decomposed into four parts given as [a3, a2, ai, ao] and [b3, b2, bi, 
bo}. The number A is then given as a323N/4 + a222N/4 + ai2N/4 + ao, and number B can 
be evaluated in a similar fashion. Then their product modulo 2N -1 can be given as
<A x B > 2n  _j = < co + c i2 n/4  +  c22n/2  + c‘323n/4 > 2n  (3.90)
with co, c i, C2, and C3 given by equations (3.2)-(3.5). Note that the Cj in equations
(3.2)-(3.5) are the terms of cyclic convolution of two four point sequences with the 
points being aj and bj. Thus we can apply the theorems developed in chapter 2 to obtain
co> c i, C2 , and C3. However, over here since our objective is to minimize the total 
number of ROM bits and not the total number of squaring operations we use theorems
(2.3)-(2.6). We then define equations Xy, yy, and Zy as given by equations (3.10)-
100
(3.25). Then theorems (2.3)-(2.6) give us 4(c0 + c2), 4(cj + c3), 4(c0 - c2), 4(cj - c3). 
Finally equations (3.26)-(3.29) give the values of co, ci, C2, and C3.
Each square required by equations (3.26)-(3.29) is realized using a ROM. The 
advantages of such techniques are detailed in references [47]-[51]. Although the number 
of squares required is more than that of [51] the total number of ROM bits required is 
less than that required by [51]. Hardware in terms of adders and subtracters is 
comparable with that required by [51]. Section 3.6.4 offers a detailed comparative 
analysis.
3.6.2 Extending the modulo 2N -1 multiplier
Continuing with the same notation as before, the modulo 2N +1 product of two 
numbers A and B can be given as
<A x B> 2n +1 = < d0 + di2N/4 + d22N/2 + d323N/4 > 2n  +1 (3.91)
with do, di, d2, and d3 defined as
do = aobo - a3bi - a2b2 - aib3 (3.92)
di = aibo + aobi - a3b2 - a2b3 (3.93)
d2 = a2bo + aibi + aob2 - a3b3 (3.94)
d3 = a3bo + a2bi + aib2 + aob3 (3.95)
Note that term d3 of equation (3.95) is the same as term c3 of equation (3.5) and 
so no extra ROM bits are required for computing d3. To compute do, di, and d2 define 
go = a3bi + a2b2 + aib3 (3.96)
gl = a3b2 + a2b3 (3.97)
g2 = a3b3 (3.98)
Then
do = co -2go (3.99)
101
di = ci -2gi (3.100)
d2 = C2 -2g2 (3.101)
The terms go, g i, and g2 can be computed by directly applying the quarter 
squared algorithm [47]-[51] while these equations are not presented here. Doubling 
these terms can be obtained by simply shifting the numbers to the left by one position 
and thus this needs no extra ROM bits.
The product of two numbers A and B modulo 2N can be given as
<A x B> 2n  = < e0 + ei2N/4 + e22N/2 + e323N/4 > 2n  (3.102)
with eo, e i, e2, and e3 defined as
eo = aobo (3.103)
e i= a jb o  + aobi (3.104)
e2 = a2bo + aibi + aob2 (3.105)
e3 = a3bo + a2bi + aib2 + aob3 (3.106)
Again, term e3 of equation (3.106) is the same as term c3 of equation (3.5) and so no 
extra ROM bits are needed for this computation. The other terms can also be obtained 
without the expense of any more ROM bits by using the following equations:
e0 = (c0 + d0) x 1/2 (3.107)
ei = (c i + d i)x  1/2 (3.108)
e2 = (c2 + d2) x 1/2 (3.109)
The full precision product of two integer numbers A and B can be given as
A X B = f0 + fi2N/4 + f22N/2 + f323N/4 + f42N + f525N/4 + f626N/4 (3. i io)
with
fo = aobo = eo (3.111)
fl =aibo  + aobi = ei (3.112)
f2 = a2bo + aib] +aob2 = e2 (3.113)
102
f3 = a3bo+ a2 bi + a ib 2  +aob3 = e3 (3.114)
f4=a3bi + a2 b2  + aib3 = go (3.115)
f5 = a3b2+ a2 b3 = gi (3.116)
f6  = a3b3 = g2  (3.117)
Again, all of these computations need no extra ROM bits.
3.6.3 Example
In this section we present a numerical example to illustrate the various 
techniques described earlier in this section. Consider two 16-bit numbers A and B with 
A = 54682 = (1101010110011010)2 and B = 57811 = (1110000111010011)2 . We 
decompose each number into four parts, each with four bits. Thus we have A = {a3, a2, 
a i ,  ao) and B = {b3 , b2 , b j, bo} with a3 = (1101)2 = 13, a2 = (0101)2 = 5, ai = 
(1001)2 = 9, ao = (1010)2 = 10 and b3 = (1110)2 = 14, b2 = (0001)2 = 1, bi = (1101)2 
= 13, b0 = (0011)2 = 3. Here N = 16 and n = 4.
3.6.3.1 Modulo 2N -1 product
Equations (3.10)-(3.25) give:
X41 = 19, X42 = 11, X43  = 49, X4 4  = -5, y4 i = 42, y42 = -12, y4 3  = 26, y4 4  = 18,
Z4 i =1,742 = 3, Z4 3  = -5, Z4 4  = -3, 7 45 = 4, 7 4 $ = 6 , Z4 7  = -2, and 7 4 % = -6 .
Evaluating equations (3.26)-(3.29) we get co = 330, ci = 240, C2  = 324, C3 = 253. 
Evaluating equation (3.90) we have <A x B> 2n  _j = 9307 and the result checks correct.
3.6.3.2 Modulo 2N + 1 product
Equations (3.96)-(3.98) give go = 300, gi = 83, and g2 = 182.
Evaluating equations (3.99)-(3.101) we get do = -270, d] = 74, and d2 = -40.
Evaluating equation(3.91) we have <A x B>2n +1= 43907 and the result checks correct.
103
3.6.3.3 Modulo 2N product
Equations (3.107)-(3.109) give eo = 30, ei = 157, and e2 = 142.
Evaluating equation (3.102) we have <A x B> 2n  = 26606 and the result checks correct.
3.6.3.4 Full precision product
Equations (3.111)-(3.117) give f0 = 30, fi = 157, f2 = 142, f3 =253, f4 = 300, 
f5 = 8 3 ,an d f6 = 182.
Evaluating equation(3.110) we have A x B = 3161221102 and the result checks correct.
3.6.4 Hardware and speed analysis
All the analysis in this section is provided for the case when the product of two 
numbers is obtained by decomposing each number into four equal parts, say each with k 
bits. Four methods are compared:
i) traditional techniques,
ii) quarter squared algorithm,
iii) new multipliers modulo 2N -1 [51],
iv) techniques of this chapter.
The traditional way of computing co, ci, c2, and c3 would be by using equations (3.2)- 
(3.5). Here each product term can be realized by a ROM of size 22k x  2k. Since sixteen 
product terms have to be realized, we would need a total of k x 22k+5 ROM bits.
Direct application of the quarter squared algorithm to each term of equations
(3.2)-(3.5) would require for each term two ROMs, each of size 2k+1 x (2k + 2). Thus 
sixteen product terms would require a total of (k + 1) x  2k+7 ROM bits.
104
New multipliers modulo 2N -1, [51] requires a total of (2k + 5) x 2k+6 ROM
bits.
Based on our techniques, equations (3.26)-(3.29) dictate that 16 ROMs, each of 
size 2k+2 x  (2k + 4) would be required, thus giving a total of (k + 2) x  2k+7 ROM 
bits.
Table 3.3 summarizes these results and also presents data on the number of 
adders required by each method. Here we have assumed that the operation (a - b) 
requires only one adder. This is a reasonable assumption because our numbers are 
integers, and therefore the operation (a - b) can be realized by (a + (-b)), where (-b) is 
the two's complement representation of b. The hardware module for this adder can be 
suitably wired to obtain this function. Figures 3.11, 3.12, and 3.13 show the hardware 
structure needed to compute co for all methods except that of (iii) which can be found in 
[51]. These hardware structures can be replicated appropriately for the other terms. The 
number of levels through which the data has to flow is indicated on each figure, 
however this might be irrelevant if data is being processed continuously. In such a case 
the limiting factor will be the speed at which the ROM can deliver. Method (i) requires a 
ROM whose size is in the order of 0 (2 2k) while all the other three methods require 
ROMs with sizes of order 0 (2 k). Since method (i) requires the largest sized ROMs, it 
will be the slowest. Also since the total number of ROM bits is very high, it will not be 
possible to integrate the entire design on a single chip. With respect to speed and 
number of ROM bits, methods (ii) through (iv) are comparable. With respect to the total 
number of ROM bits required, method (ii), i.e. the direct application of the quarter 
squared algorithm, appears to be the best but considering the fact that it requires 60 
adders it will be the most complex one to build. Our techniques in this chapter yield the 
best trade off for speed and hardware; while they require more ROM bits than the
Table 3.3: Hardware and speed comparison o f various look-up table techniques


















k x 22k+5 12 22k x  2k 0(22k) NO, 
for N>= 16
Quarter
Squared (k+1) x 2k+7 60 2k+l x (2k+2) 0(2k)
NO, 
for N>= 16
Reference[51] (2k+5) x 2k+6 38 2k+3 x  (2k+6) 0(2k) Possible
This section (k+2) x 2k+7 40 2k+2 x  (2k+4) 0(2k) Possible
106
a 0 b0 a 3 b l
ROM
i
1) Number of levels = 3 (2 adders + 1 ROM)
2) Size of each ROM = 22k x 2k
Fig. 3.11. Hardware architecture to implement equation (3.2) using 
traditional techniques
a0 b0 a0 b0
i i i i+ —
ROM ROM
a3 b l a3 bj
U U
4 “
a2 b2 a2 b2 











1) Number of levels = 5 (4 adder + 1 ROM)
k+1
2) Size of each ROM = 2 x (2k + 2)
Fig. 3.12: Hardware architecture for implementing equation (3.2) using the quarter squared algorithm.
a0 a2 b0 bz
ii li
+ +
f l  V 1
+ -




















1) Number of levels: 6 (5 adder + 1ROM)
k+22) Size of ROM = 2 x (2k + 4) +
u
t s
Fig. 3.13: Hardware architecture to realize equations (3.26) and (3.28)
109
quarter squared algorithm, they require far fewer adders. Techniques of this chapter out 
perform those of [51] in many respects viz. smaller maximum ROM size, a total of 
fewer ROM bits, and higher speed. Also based on the techniques of this chapter, all the 
ROMs are of the same size and hence identical. This again is a big advantage for VLSI 
designs. The regularity of the hardware architecture is clearly seen in figure 3.13.
Table 3.4 summarizes the ROM requirements for different wordlengths(N) for 
the case when the numbers are each decomposed into four parts. Table 3.5 summarizes 
the overhead ROM requirements required for computing the product modulo 2N +1, 
modulo 2N, and the full precision product. Again, each product is computed by 
decomposing each of the numbers into four equal parts. Overhead is defined as the 
number of ROM bits needed over and above those required for the computation of the 
modulo 2N -1 product. The values are based on equations (3.96)-(3.98), are obtained in 
a straightforward manner and are hence not detailed.
3.7 Summary
In this chapter we have discussed several implementation issues of the squaring 
and convolution operations. In section 3.2 we presented an intuitive CSA based 
implementation for the squaring operation that was faster than the schemes suggested by 
[54], We showed, by counter example, that the number of levels required to add a set of 
summands in a parallel fashion is not only a function of the height of the tallest column 
but is also a function of the heights of the other columns and their relative placements. 
In section 3.3 we presented an alternate implementation for the squaring operation and 
compared its performance with existing schemes. We found that for VLSI 
implementations of small wordlength squarers, the prime factors in the selection of a 
design would be regularity and modularity. This was because different schemes for
Table 3.4: Cost Comparison in ROM bits of the various techniques for computing



















( vs. trad, 
tech.)
16 4 25 x 2 ™ 10 x 210 13 x 2!0 12x 210 62.50
32 8 210 x  214 18 x  214 21 x  214 20x 2 I4 98.04
64 16 219 x  222 34 x 222 37 x 222 36x 222 99.99
Table 3.5: Cost in ROM bits for integrated multiplier, based on techniques of this
section
Word Decomp­ Mod. (2N -1) Mod. (2N +1) Mod. (2N) Full precis­ %
Length osed Part product cost product product ion product over­
Length overhead cost overhead overhead head
N k (k+2) x  2k+7 6(k+l) x 2k+3 cost cost
16 4 6 x 211 30 x 27 None None 31.25
32 8 10 x 215 54 x 2 ll None None 33.75
64 16 18 x 223 102 x  219 None None 35.41
112
small wordlength squarers had similar hardware costs. In section 3.4 we presented in 
detail CSA based implementations for 4, 8, and 16 point cyclic convolutions. The 
analysis showed that the increase in the number of addition operations does not 
significantly diminish the hardware savings obtained by the reduction in the number of 
the squaring operations. We again emphasize that the purpose of the implementations 
was solely to argue the case in point and not for the puipose of field implementations. 
We also clearly demonstrated that our approach is an excellent candidate for smart 
architectures. In section 3.5 hybrid implementations of the convolution operation are 
presented. Here, we've shown that if the multiplication and squaring operation are 
implemented by ROMs then our method while being a little slower, yields phenomenal 
hardware savings. Finally, in section 3.6 we presented an application of the convolution 
operation in the field of computer arithmetic, namely, the problem of integer 
multiplication. We present the case of a modulo 2N -1 multiplier and show how our 
techniques can be extended to multiplication in other rings, namely, modulo 2N +1 and 
modulo 2n . We also present the case of full precision multiplication. We show that in all 
cases our methods produce significant ROM bit savings when compared with traditional 
implementations.
Chapter 4
ROM Based Methods for Computing the 
Squaring Operation in Modular Rings
In the previous chapters we developed algorithms for modular multiplication and 
cyclic convolution that relied primarily on squaring operations. The focal point of those 
algorithms was to how best reduce the number of squaring operations to perform the 
desired computation. Also, these algorithms were discussed in the context of full 
precision computation. However, signal processing applications often rely on the 
properties of the residue number system (RNS) [38] to perform efficient computations. 
In such an environment computations are performed over modular rings, the popular 
choices being 2n, 2n_1, and 2n+1 [38], [47]. Therefore, in this chapter we focus on 
hardware efficient compression schemes for computing the square of a number modulo 
2n, modulo 2n -1, and modulo 2n +1, using ROM look-up tables. In this process we 
present several schemes and compare their relative merits and de-merits.
In section 4.1 we attempt to motivate the reader by showing how a few simple 
arithmetic manipulations can reduce the size of the ROM required for the squaring 
operation. These schemes were presented in brief in [63]. In section 4.2 we present our 
newly proposed optimized schemes which were also presented very briefly in [53]. In 
this chapter these schemes are presented in detail, for both the sake of completeness and 
for comparison with the newly proposed schemes.
113
114
4.1 Memory compression schemes for arithmetic in 
modulo 2n
Our objective here is to find efficient ways to compute the square of a number. 
In this chapter we consider ROM based methods to perform this computation. Let us 
consider a number A belonging to the modular ring Z211 = {0, 1 , ..., 2n -1}. Then A has 
a n-bit binary representation as in A = a ^ a , ^  ... aja0; aj e  {0,1}. Our task is to 
compute <A2>2n. where <x>m denotes the operation x modulo m. Our method basically 
consists of decomposing the number A into two words, a high word, say AHi, and a 
low word, say ALi, and then performing certain arithmetic manipulations to yield 
significant savings in ROM bits. We then show that by varying the lengths of AHi and 
ALi we can obtain more savings in ROM bits at the expense of an overhead consisting 
of a few gates and multiplexers. We present the analysis for three different 
decompositions of the number A.
4.1.1 Analysis when the high word is one bit long
Let AH1 be a 1 bit word and AL1 be a n-1 bit word with AH1 = a ^ j and 
Aj i = an.2 aja0. Then we have
A = Ajjj2n 1 + Al1 (4.1)
and
A2 = Afn 22"-2 + AH1AL12n + A j1 (4.2)
If n > 2, then 2n -2 > n, <22n'2>2n = 0, and we get
<A2>2n = <A2 !>2n (4.3)
For a table look-up approach, as shown in figure 4.1, the square of the number 
A can be computed by simply using a ROM of size 2n x n. We shall refer to this as the 













Figure 4.1: The direct computation o f <A2>2n, ROM size = 2n x  n
116
equation (4.3) shows us that the same task can be accomplished by using a ROM of size 
2n_1 x  n as shown in figure 4.2. Thus we have a savings of 50% in ROM bits with no 
additional overhead. In the next section, by applying the same techniques, we analyze 
the effects of increasing the length of AH1 and reducing the length of AL1 on the savings
in ROM bits.
4.1.2 Analysis when the high word is two bits long
Let AH2 be a 2 bit word and AL2 be a n-2 bit word with AH2 = ^ -1 ^-2  and 
AL2 = an_3 ... aja0. Then we have
and
A = Am 2"'2 + AL2 (4.4)
A2 = A ^2 22n-4 + AH2AL2 2“-> + A l 2 (4.5)
If n > 4, then 2n -4 > n, < 2 2 n ' 4 > 2 n  = 0, and we get
< A 2 > 2 n  =  < A H 2 A L 2  2n 1 + A 2 2 > 2 n  (4.6)
The possible values of <AH2AL2 2n_1>2n are shown in Table 4.1. Further, 
<AL2 ^  = <(an-3 •" a la0 ^ n ^>2n= ^ ^ - 3  ••• al)2” + *>2n = a()2n *
(4.7)
Combining equation (4.7) and Table 4.1 we can write
, 2 .
<A2>2n =
<AL2>2n ifa n-2= 0
<a02n- 1+ A j2>2n i f a n.2=l
(4.8)
9
If <AL2>2nis represented in binary as b ^ b ^ .- .b jb o ,  and letting cn.j = a0 ® bn.j  
where © denotes the exclusive-or operation, equation (4.8) can then be rewritten as
< A 2>  _  fb n - l^ n -2 -^ 1 ^ 0  ^ an-2= ® l  (4  9)
<A >2n"  k - i b „ - 2 -b ibo  if an-2=l J (4-9)
117
\ r V 2 " ‘ 0
R O M
< A2> 2n - < Au ^ n  
Figure 4.2: The computation of <A2>2n based on equation (4.3), ROM size = 2n_1x n
Table 4.1: Values of <AH2AL2 2n’1>2n
A H2 
^ -1  ^ - 2
<A H2A L22" l>2n
0  0  
0  1 
1 0  
1 1
0




We note from Table 4.1 that the value of bit a ^  j is irrelevant and thus does not 
appear in equation 4.9.
0 n O
To obtain < A l 2 >2ii a ROM of size 2 x  n would be sufficient as the length of 
ALi  is only n-2 bits. Thus we have a savings of 75% in ROM bits when compared with 
the direct implementation. However, to obtain <A >2n we need to realize equation (4.9)
and for this in addition to the ROM we would need an exclusive-or gate and a single 
2x  1 multiplexer(mux) as shown in figure 4.3. Thus while equation (4.9) produces 
more savings in ROM bits than equation (4.3) it also has a small overhead.
In the next section we increase further the length of the high word by one more 
bit and simultaneously reduce the length of the low word by one bit. The objective of 
the analysis here is to show that while more savings in ROM bits are obtained the 
overhead increases in a disproportionate fashion.
4.1.3 Analysis when the high word is three bits long
Let Ah3 be a 3-bit word and AL3 be a (n-3)-bit word with AH3 = a ^ a ^ a , ^
and
AL3 = an_4 ... aja0. Then we have
A = ^H3^n 3 + ^L3 (4.10)
and
A2 = AH322n'6 + AH3Al3  2n"2 + A2 3 (4.11)
If n > 6, then 2n -6 > n, <22n"6>2n = 0, and we get
<A2>2„ = <AH3Al3 2n 2 + A2 3 >2n (4.12)
The possible values of <AH3AL3 2n' 2>2n are shown in Table 4.2. Further, 
remembering that AL3 = an_4 ... a ^ ,  we have
120
k hT an-3 "• a0
/
^  ^ iu n-l
R O M
< A L 2 ^ n = bn - l V 2 - b0
2- to -1 Multiplexer
AT >  -  c , b 
L 2n "" n' 1 " 2 u0
Figure 4.3: The computation of <A >2n based on equation (4.9), ROM size = 2n'2 x  n




^ H S ^ 2" 2>2n
0 0 0 0
0 0 1 <AL32" 2>2n
0 1 0 <AL32" l > 2 n
0 1 1 <3AL32n 2>2n
1 0 0 0
1 0 1 <AL32" 2>2n
1 1 0 <AL32" 1;>2n
1 1 1 <3AL32n-2>2n
122
<AL3 2 n  ^ > 2 n  -  < ( a n -4  —  a i a o ) 2 n  ^ > 2 n
= <(an-4 •" a2 ^ n + al a02n ^>2n
=  a i a o 2 n -2  ( 4 . 1 3 )
and similarly
<AL3 2 n  ^ > 2 1 1  = < ( a n -4  ••• a i a o ) 2 n  ^ > 2 0 =  < ( a n - 4  ••• a i ) 2 n  +  a ^ 2 n  * > 2 n  =  a ( )2 n  *
( 4 . 1 4 )
Adding ( 4 . 1 3 )  and ( 4 . 1 4 )  we get
< 3 A L 3  2 n  2 > 2 n  =  < a i a Q 2 n  2  +  a Q 2 n  * > 2 n := ( a j  ©  a o ) a o 2 n  ^  ( 4 . 1 5 )
2If < a L3 >2n is represented in binary as d ^ d ^ - d ^ g ,  and letting hn_j = aj © 
aQ, equations ( 4 . 1 2 )  -  ( 4 . 1 5 )  combined with table 4 . 2  can then be rewritten as
<A >2n =
dn -ld n -2d n -3—dldo
en - le n-2dn-3"-dldo
fn -ld n-2dn-3 -d ld o
g n - le n-2dn-3-"dldo
if  an-2an-3 = 00  
if  an-2an-3 = 01 
if  an-2an-3 = 10 
if  an-2an-3 = H .
(4.16)
where the bits en_j, en_2, fn_j, and gn.j are given by 
en-l = al ® dn-i ® (ao A dn.2) 
en-2 = a0 ®  dn-2
fn-l -  aQ ®  dn-!





with a  denoting the AND operation. We once again see from Table 4.2 that the value of 
< a H3 AL3 2n' 2>2n is not a function of bit a ^ j  and this is accordingly reflected by
equation (4.16).
Now, to obtain <A23 >2n a ROM of size 2n' 3 x n would be adequate as the 
length of Al3  is only n-3 bits. Thus we have a savings of 87.5% in ROM bits when 
compared with the direct implementation. However, to compute <A2>2n we need to
123
implement equation (4.16) and for this in addition to the above ROM we would need 6 
gates and a 4x  1 multiplexer of word length two. This implementation is shown in 
figure 4.4.
The preceding analysis shows that a direct extension of the above method, i.e. 
increasing the length of the high word while simultaneously reducing the length of the 
lower word results in reducing the number of ROM bits but at the same time increases 
the overhead both in terms of gate count and complexity of the design. In the next 
section we show the optimal size for the high and low words to obtain not only the 
maximum savings in ROM bits but also an overhead that is less than that required by the 
above method and is also streamlined with respect to implementation. In a later section 
we show that this overhead is also streamlined with respect to arithmetic modulo 2" -1 
and modulo 2n +1.
4.2 Optimized memory compression schemes for 
arithmetic in modulo 2n
The following proposed schemes were published very briefly in [53]. In this
section the schemes are presented in detail followed by a comparative analysis. We use
the same notation as before and our task remains the same, i.e. we wish to compute 
<A2>2n’ where once again <x>m denotes the operation x modulo m.
Recognizing that the first bit that might produce an overflow when an n-bit (n
/A
even) number is squared, is located at the 2 position we decompose the number into 
two parts each of length n/2 bits. The technique is explained in detail for the case when 
the arithmetic is done modulo 2n and n even. All other cases are summarized in Tables
4.3 and 4.4.
124










n-2 4- to -1 Multiplexer
n-3
n-2
Figure 4.4: The computation of <A2>2n based on equation (4.16), ROM size = 2n'3 x n
125
Table 4.3: Results when n is even. AH = ... a„/2, AL = a(n/2Ha(n/2)_2 ...
a ^ ,  and QS = 2n 2̂ * Ka h + a l )  - (Ajj - A j)2}
Operation Formula Cost in ROM 
bits
Overhead %savings o ROM bits
n = 16 n = 32
<A2>2„ < A 2 + QS>2n 5 x  2n/2 x n 4 adders 98.04 99.99
<A2>2n .i <A ^ + A 2 + QS>2n _i 6 x 2n/2 x  n 5 adders 97.65 99.99
<A2>2n +i <*A H + a L + QS>2n+i 6 x 2n/2 x n 5 adders 97.65 99.99
Table 4.4: Results when n is odd. AH = ... a(„+i)/2’ AL = a(n_i)/2 — a ia0’
and QS = 2(n-|)/2 |(A H + AL)2 - (AH - AL)2}




%savings of ROM bits
n =  15 n = 31
<A2>2n <A 2 + QS>2n 5 x2 (n+1)/2 xn 4addeis 96.09 99.99
<A2>2n A <2 A +A l +QS>2n _j llx2^n"^/2 x n 5 adders 95.70 99.99
<A2>2n +i <-2A ^+A 2 +QS>2n +j l lx 2 (n' 1)/2x n 5 adders 95.70 99.99
127
4.2.1 Analysis when n is even
Consider a number A belonging in the modular ring Z2n with a n-bit 
representation as given before. Let AH = an. 1an.2 ... an/2 and AL = \ ni2)-\\nl2)-2 — 
sl̂ Oq. Then we have
A = AH2n/2 + AL (4.21)
and
A2 = A ii 2" + Ah Al  2n/2 +1 + A l  (4.22)
while
<A2>2n = < AI  + Ah Al  2n/2 +1>2n (4.23)
  ^ /a_____ _
The term Al  can be computed using a ROM of size 2 x n. The product AHAL can be 
realized by using the quarter squared algorithm [47], thus giving
<2n/2 AjjAj >2n = <2n/2 * { (Ah + Al )2 - (Ajj - Al )2}>2ii (4.24)
Each of the square terms in equation (4.24) can be realized using a ROM of size 
2n/2 +1 x n. Thus the total number of ROM bits required to compute <A2>2n is 5 x
2n 2̂ x n. For the case when n=16, we have obtained a savings of 98% in ROM bits 
while the overhead is two (n/2)-bit adders and two modulo 2n adders of size n. Note 
that while n increases the savings in ROM bits increases whereas the overhead remains 
the same with respect to the count of adders, i.e. the number of adders is not a function 
of n. Also, when it is required to compute terms like -(A ), a negator is not needed as 
the ROM used for this purpose can be designed to directly generate the negative result. 
The implementation of this technique is shown in figure 4.5.
128
n/2. n/2n /2 n/2
n/2 + 1n/2 + 1 n/2









<  A  >£11
n/2Note: 1) The scaler unit simply shifts it's input to the left by 2 -1
positions. Thus the lower (n/2) -1 bits are zeros and the upper 
(n/2) -1 bits are = 0 modulo 2 and are hence ignored. The 
remaining bits can be simply hardwired at the appropriate 
locations in the next unit and thus the scaler unit requires no 
additional hardware.
2) The modulo 2n adders are regular adders with only the 
lower n significant bits, i.e. carry's are ignored.
Figure 4.5: The computation of <A2>2n based on equations (4.23)-(4.24), total ROM
bits = 5 x 2n/2 x n
129
4.3 Numerical example
We now present a numerical example to illustrate the techniques presented in
ry
sections 4.1 and 4.2. Our task is to compute <A > 2n with n = 10 and A =
2
a9a8a7a6a5a4a3a2aja0 = 1111011101 = (989)j0. We expect to obtain <989 >1024 = 201.
4.3.1 Illustrating techniques of section 4.1.1
Decomposing A into a 1-bit high word and a 9-bit low word, we get AH1 = a9 =
1 and AL1 = a8a7a6a5a4a3a2a 1ao = 111011101 = (477)10. Equation (4.3) gives 
<A2> 1024 = < A li> 1024 = <4772>jq24 = 201 and this agrees with the expected result.
4.3.2 Illustrating techniques of section 4.1.2
Decomposing A into a 2-bit high word and a 8-bit low word, we get AH2 = a9a8 
= 11 = (3)i0 tind Aj^2 “  a7a6a5a4a3a2a l a0 = 11011101 — (221) j q . Thus =
<2212> 1024 = (713)10 = b9b8b7b6b5b4b3b2b jb0 = 1011001001. Since an_2 = a8 = 1, 
equation (4.9) gives <A > j024 = c9b8 ... b |b 0 with c9 = aQ © b9 = 0. Plugging the 
values we have <A2>1024 = 0011001001 = (201)10 and this agrees with the expected
result.
4.3.3 Illustrating techniques of section 4.1.3
Decomposing A into a 3-bit high word and a 7-bit low word, we get AH3 = 
”  111 — (7)io &nd A jj  = a6a5a4a3a2ala0 ^  1011101 — (93)jo* ^^L 3^2^ = 
<932>j024 = (457) 10 = d9d8d7d6d5d4d3d2d2d0 = 0111001001. Since ^ ,2 an-3 = a8a7 = 
11, equation (4.16) gives <A2> 1024 = g9egd7... d jdQ with equations (4.18) and (4.20)
130
giving g9 = aj © aQ ® d9 0  (aQ a  d8) and e8 = aQ © d8. Plugging the values we have 
<A2>1024 = 0011001001 = (201)10 and this agrees with the expected result.
4.3.4 Illustrating techniques of section 4.2.1
Decomposing A into a 5-bit high word and a 5-bit low word, we get AH5 = 
= 11110 “  (30) jq and = 11101 — (29)jq. Thus ”
<292>io24 = (841)iq. Equation (4.24) gives <26AHAL>2n = <24{(AH + AL)2 - (AH - 
AL)2}>2n = <24{592 - 12} >1024 ”  384. Plugging the values into equation (4.23) we 
have <A2>1024 = <384 + 841>1024 = (201)10 and this agrees with the expected result.
4.4 Comparing techniques of section 4.1 with 4.2
In order to make a fair comparison of the techniques presented in section 4.1 
with those of section 4.2 we decompose the number as given in section 4.2 and then 
apply the techniques of section 4.1. We compare the two techniques with respect to 
hardware cost and speed. The hardware cost is expressed as a function of 2-input gates 
while the speed as a function of gate delays. Since both methods are implementing 
equation (4.23) the cost for implementing the term Al does not need to be taken into 
account as both methods implement this term in exactly the same fashion, viz. using a
n/ry
ROM of size 2 x  n. The difference in cost and speed arises based on the manner in 
which the other term, namely AHAL 2nl2 +1 is implemented and added to A2 .
4.4.1 Cost and speed analysis for section 4.1
In order to implement the term AHA L 2nl2 +1 we would need (n/2) -1 
multiplexers of size 2^nl2  ̂4 x 1. We arrive at this figure based on the following:
Recall that in this method we make use of the fact that the lower (n/2) + 1 bits of 
the end result is the same as the lower (n/2) + 1 bits of the term A2 . This is simply
131
because the term AHAL 2n/2 +1 is the quantity AHAL shifted to the left by (n/2) + 1
positions with zeroes filled in. Thus the remaining (n/2) - 1 bits of the end result are 
determined by the summation of AHAL with the upper (n/2) -1 bits of the term A f . This
accounts for the number of multiplexers. This scheme is pictorially shown in figure 4.6. 
Let AhAl be represented in binary as rh. |r h_2 ...r0 and the upper (n/2) -1 bits of the term
9
Aj^as S|1_1S|1_2 •••Sq where the subscript h also represents the number of bits in the high 
word.
Since AH is n/2 bits long its value lies in the range 0 to 2̂ n/2  ̂-1. However, the 
most significant bit of AH i.e an_1 has a weight of 211/2 and when multiplied by 2n/2 +1 
it gives ajj.j x 2n which modulo 2n is equal to 0. Therefore the only bits of AH that are 
of interest to us are an_2an_3 .~a(n/2). This gives us 2^n/2^ _1 different terms to be 
multiplied with AL 211/2 +1 thus giving us the size of the multiplexer as 2̂ n/2-> -1 x 1.
2 2 
Let Al be represented in binary as bn lbn_2 ...bjb0. Note that Af, is inherently
a n-bit number. The inputs to the multiplexer are terms, each one of which is the sum of 
^n-l^n-2 — b(n/2) +1 anc* one ° f  the Aj jAj terms. (There are 2^n/2^ s u c h  terms).
The following assumptions are made for calculating the amount of hardware:
1) We assume that the design is based on 2-input gates. We do not count the cost 
of inverters. We allow all types of 2-input gates including exclusive-or gates.
2) All gates have a fanout of 1. This assumption is necessary as the technique 
employed here is essentially bit manipulation and we are trying to give a general formula 
for any size n. While this estimate gives a conservative estimate on the number of gates 
it is a fair assumption as the same criteria is applied to the techniques of section 4.2. 
Also most units of section 4.2 have fanouts that are not a function of n and so to allow 
an ar bitrary fanout will not be fair as the size of a gate is also a function of the fanout.
L








------------------► n/2 -1 2 -1
lower n/2 -1 bits of
n/2 +1' /
< A  >2n
Figure 4.6: Basic scheme for techniques of section 4.1
133
The size o f the overall multiplexer is determined as follows:
Let p be the number of select lines and q be the number of bits in each input data word. 
The number of 2-input gates for q multiplexers each of size 2P x 1 is calculated as 
follows:
Since there are p select lines, there are 2P minterms each containing p-bits. Each 
minterm will have log2p stages, thus giving the number of gates as (2° + 21 + ... +
2Iog2 p -1). Normally the select lines of all the multiplexers would be tied together 
implying that each minterm needs to be realized only once. However, since we are 
assuming a fanout of 1 we cannot use this fact and so these terms have to be realized q 
times. In each multiplexer, each minterm is combined with one bit of the input word. 
Since there are 2P minterms the number of stages required to transfer one bit of the q-bit 
input to the output is equal to p, thus giving the number of gates as (2° + 21 + ... + 2P). 
Thus the total number of 2-input gates for the multiplexer is given by, 
mux hardware = q2p(2° + 2 1 + ... + 2log2 p -1) + q(2° + 21 + ... + 2P) (4.25)
In our case p = q = (n/2) -1. Plugging this into the above equation and simplifying we 
get
Mux h/w = (n/2 -l)[2n/2 _1 (2log2<n/2 -D -1) + (2n/2 -1)] (4.26)
The computation for the number of gates required for obtaining each of the input 
words is based on a recursive formula and is thus not as straight forward as the above 
analysis. It is thus presented in detail. Let Gjj denote the number of gates required for 
computing the input words to the multiplexer when the high word has h bits and let gj 
denote the number of gates required for the modulo 2n addition of two i-bit words.
For the case when the high word is one bit long there is no additional hardware. 
Thus Gj = 0.
134
For the case when the high word is two bits long we require a 2 x  1 mux and 
one gate. Using the notation introduced in this section this gate performs the addition of 
r0 and s0. Thus G2 = 1 and gj = 1.
For the case when the high word is three bits long we require a 4 x  1 mux and 
10 gates. We explain below how the figure of 10 is obtained. The words of interest now 
are r^Q and SjSq while the following summations need to be performed.
S1 s0 rl r0 S1 s0 S1 s0
+ rl r0 + r 0 + r 0
result ------------- > result
Gates © © © © © ©
req. © a ® a
4 gates 1 gate 4 gates 1 gate
Note that the summation of the last column is given by G2. Thus the number of gates 
can be given as G3 = 4 + 1 + 4 + G2 = 10. This checks with equations (4.17) - (4.20). 
Here the first term is g2 and is equal to 4.
For the case when the high word is four bits long we require an 8 x  1 mux and 
60 gates. We explain below how the figure of 60 is obtained. The words of interest now 
are r2rjr0 and s2SjS0 while the following summations need to be performed.
1) S 2  Sj  s 0




The number 6 appears because to add Sj and r t we need a full adder. (For a 
fanout of one a full adder needs 6 gates.)
135
2) r2 rl r0
r l r0
s 2 S1 s 0
result -> result
Gates req. © © ©  6 ©
©  A © A
4 gates 10 gates
3) r 2  r l r0 s 2  S 1 S0
r0
result ----------— > result
Gates req. © ©  6 ©
© A
1 gate 10 gates
4) r 2  r l r0 r0 s 2  S 1 s0
r l r0
result ---------- —> result — ->  result
Gates req. © © © ©  6 ©
©  A © A
4 gates 1 gate 10 gates
5) In addition to the above we would need all the gates required for the case when 
the high word had three bits.
Thus the number of gates can be given as G4 = 10 + 4 + 10 + 1 + 10 + 4 + 1 + 
10 + G3 = 60. Here the first term is g3 and is equal to 10.
136
From the above discussion we see that for each decomposition the term g; also 
represents the number of gates required for summing the two words of interest viz. rh_ 
jrh_2 ...r0 and Sj1_1Sj1_2  — s0- We let set G = {gj, g2, g j _ j } where i = h-1. Then the 
general formula for the number of gates required for all the input words of the 
multiplexer can be given for all h > 2, by
G h = Si +  . 1
G h  (\ G h  (\ G
Si +
| |
v 2 ya M i o
gi
i-1
+ X ^ k  + Sl) e  2, .... i-1 }  + ... + —*gi-l)
k=l
+ Gh_, (4.27)
where Gj = 0, G2 = 1, gj = 1, g2 = 4, and gj = 4 + 6(i-2) for i > 2 and I G I is the 
cardinality of set G.
Thus the total hardware is given by
Total Hardware -Section (4.1) = Equation (4.26) + Equation (4.27) (4.28)
Table 4.5 lists the costs of the hardware for various values of n and also gives a 
cost comparison of the two sections. Here the hardware cost is based on equation (2.26) 
and not on equation (4.28) merely to illustrate the fact that in spite of ignoring the cost 
of equation (4.27), section 4.2 is far more cost efficient. Also, because of this we use 
the term minimum % savings as opposed to simply % savings.
We now present the analysis to compute the time delay associated with the 
computation of the squaring operation. The delay of the multiplexer is given by 1 + p + 
log2 p while the delay to compute the input terms of the mux is given by the time to 
obtain the mod 2n sum of rh_1rll_2 ...r0 and sh_jSh_2 ...s0. The worst case delay arises 
when rh_|rh_2 ...r0 assumes its maximum value of 2h -1. In such a situation h-1 
summations have to be performed in a sequential fashion while the delay of each
Table 4.5: Cost comparison in 2-input gates o f techniques o f section 4.1 with 4.2
Word Length n H/w cost of 
section 4.1 based 
on eqn (4.26)
H/w cost of 




16 7161 1411 80.30
32 7864305 40339 99.48
138
summation is given by the time to compute the corresponding gj's. The time delay for gi 
is denoted by tgi, while tgl = 1, tg2 = 2, tgi = 2(i-l) for i > 2. Thus the total worst case 
delay is given by
h-1
Th = 3 + £  tgi = (h-3)[(h-3)(h+2) -2] + 3 for h > 2 (4.29)
i=3
We have assumed that the full adders are connected in a ripple fashion. Thus the total 
delay in gate delays is given by
1 + p + log2 p + (h-3)[(h-3)(h+2) -2] + 3 (4.30)
For the decomposition considered p = q = (n/2) -1 and h = n/2 and plugging 
these into the above equation we get
Time Delay -Section (4.1) = n/2 + (l/8)(n-6)[(n-6)(n+4)-8] + log2(n/2 -1) + 3
(4.31)
Table 4.6 lists the time delays for various values of n and also gives a delay 
comparison of the two sections. Once again, it is seen that techniques of section 4.2 are 
better than those of section 4.1.
4.4.2 Cost and speed analysis for section 4.2
The hardware cost of a modulo 2n adder for n > 2 is given by 4 + 6(n-2). 
Referring to figure 4.5 there are 2 adders of size n/2 + 1, one adder of size n, and one 
adder of size n/2. (Note that the AH - AL unit produces the result in 2's complement
form.) Thus giving a total adder cost of 15n -13. The cost of a ROM with L address 
lines can be given based on [55] as
ROM Cost = 2 x (2° + 2 1 + ... + 2(log2 L/2) -1) + (2° + 21 + ... + 2log2 2**(L/2))
(4.32)
Table 4.6: Speed comparison in 2-input gate delays o f techniques o f  section 4.1
with 4.2
Word Length n Time delay of 
section 4.1 based 
on eqn. (4.31)
Time delay of 
section 4.2 based 
on eqn. (4.35)
Ratio of time 
delays of sections 
4.1 and 4.2
16 254 62 4.10
32 3039 126 24.12
140
However, since we assume that each minterm is realized independently the first term is
T /9to be multiplied by 2 . The second term has to be multiplied by n as the output of our
ROM has n-bits. Here L = n/2, thus giving
ROM Cost = 2n/4 4  x (5n - 4) - n (4.33)
Thus the total hardware cost can be given by
Total Hardware -Section (4.2) = 15n-13 + 2 x  Equation (4.33) (4.34)
Referring to figure 4.5 again the delay can be given by three levels of adders 
plus one level of ROMs. The total delay of the adders in gate delays is given by 4n-2 
while the delay of the ROM in gate delays [55] is given by 2 + log2 (L/2) + log2 2̂ L/2\
However, referring to figure 4.6 we note that the same ROM delay is also associated 
with techniques of section 4.1. Therefore for comparison purposes we do not need to 
include the delay of the ROM unit. Thus we have
Time Delay -Section(4.2) = 4n -2 (4.35)
Tables 4.5 and 4.6 summarize the costs and delays for various values of n and 
also compare them with the techniques of section 4.1.
We make the following observations:
1) From table 4.5 we see that techniques of section 4.2 result in considerable 
savings in hardware, of up to 99.48% , when compared with those of section 4.1. Note 
that in this table for section 4.1 we have only taken into account the cost of the 
multiplexer. From table 4.6 we see that techniques of section 4.2 also yield a much 
faster hardware, of up to approximately 20 times for a 32-bit word.
141
2) The bulk of the delay in equation (4.35) is due to the adder circuit. By using 
better adders such as the carry-look ahead adder the timing can be drastically improved 
for section (4.2) while it will make little difference for techniques of section (4.1) as, 
referring to equation (4.31), the adder delay here is 0 (n 3). Also in section (4.1) the 
number of summands is a function of n while for section (4.2) it is a constant.
3) The ROM delay models used for section (4.2) are very conservative [55] as they 
do not take into account the density, regular implementation structure, e.t.c. while the 
model used for section (4.1) is very generous as it does not take into effect the delays of 
interconnection wiring.
4) A big advantage of section (4.2) is that it is very modular. Thus in a practical 
implementation a design change from n = 16 to say n = 32 will require much lesser 
design turn around time as only the blocks have to be changed while for techniques of 
section (4.1) a complete new set of schematics will have to be created.
4.5 Memory compression schemes for arithmetic in 
modulo 2n -1
In this section we present the arithmetic manipulations required to compute the 
square of a number modulo 2n -1. Our objective here again is to find ROM based 
efficient methods to compute the square of a number modulo 2n -1. Let us consider a 
number A belonging to the modular ring Z2n_i = {0, 1,..., 2n -2}. Then A has a n-bit 
binary representation as in A = ^ . 1 ^ .2  — ^i^q; e  {0,1}. Our task is to compute 
<A2>2n_i> where as usual <x>,n denotes the operation x modulo m. Our method is 
essentially the same as that outlined in section 4.1. Here we present the analysis for two 
different decompositions of the number A.
142
4.5.1 Analysis when the high word is one bit long
Let AH1 be a 1 bit word and Aj j be a n-1 bit word with AH1 = a ^ j and 
ALi = an.2 — a ja0. Then the value of A is given by equation (4.1) and its square by 
equation (4.2). Evaluating equation (4.2) modulo 2n -1 we get
<AV.=i^ L l ) 2 n_i i f an - l —0
( 2 n 2 +  A l1  +  A l i )2 i, _ 1 if  an- i  =  1
(4.36)
In the computation of equation (4.36) we use the following: <2n>2n_i = 1, <22n"2>2n_i 
= 2n'2, and AH1 = a ^  e {0,1}. Further, the summation of 2n‘2 and AL1, can be given
in binary as an-2an -2 an-3 — a l a0’ 1S êss ^ an 2n_l as b°th ^ - 2  anc* *ts complement
an_2 cannot be one at the same time.
We realize the term A li using a ROM thus needing a ROM of size 2n_1 x  n. 
Therefore we have a savings of 50% in ROM bits, however the overhead is one modulo 
2n -1 adder and a single 2 x 1  multiplexer of word length n. As seen from equation 
(4.36) the select line of the multiplexer is an. |.  In the next section, by applying the same 
techniques, we analyze the effects of increasing the length of AH1 and reducing the 
length of AL1 on the savings in ROM bits.
4.5.2 Analysis when the high word is two bits long
Let AH2 be a 2 bit word and AL2 be a n-2 bit word with AH2 = an. 1an_2 and 
AL2 = £4^3 ... a ja0. Then the value of A is given by equation (4.4) and its square by 
equation (4.5). Evaluating equation (4.5) modulo 2n -1 we get
<A2>2n_i= <2n_4 A2|2 + AH2Al 2  2n_1 + A22>2n (4.37)
143
In the above we have used the simplification <22n'4>2n_i = 2n"4. The possible values of 
<2n'4 A2j 2 + AH2AL2 2n"1>2n-i are shown in Table 4.7 while the following can be
observed:
Case 1: = 01*
In this case, according to Table 4.7, <2n’4 + AL22 n"1> 2I,-i needs to be 
computed. Since AL2 = an-3 — a0> = ^ - 3  — aiao)2n l  = ( ^ - 3  — a i)2n + ao2n’
1 and <AL22n' 1>2n.i = <(000an.3 ... aj) + (a^OO ... 0 0 )>2n_i = ao00an.3 ... aj. One step 
further we observe that
<2n' 4 + AL22n‘1> 2n. 1 = <(00010 ... 0) + (aoOOan_3 ... a1)>2n_i
= aO®an-3an -3an-4 — a l (4.38)
Case 2: ^ -1 ^ -2  = 10-
In this case the desired <2n' 2 + AL2>2n_i is given by
<2n' 2 + AL2>2n_i = < (010  ... 0 ) + (OOa^ ... a0 )>2n.i
= 0 ^ . 3  ... ao (4.39)
Case 3: an_ian_2 =11-
Here we need to compute <2n_1 + 2n"4 + AL2 + AL22n_1>2n.i. It is easy to see 
that summation of equations (4.38), (4.39), and the quantity 2n’2 yields the desired 
result. We thus have
<2n 1 + 2n 4 + AL2 + AL22n 1>2n.i = <(aoOan_3an_3 an_4 ... aj) +
(10an_3 ••• a i ao)>2n-l (4.40)
Table 4.7: Values o f  <2n'4A ^ 2+ AH2AL2 2n_1>2n.1
A H2 
^ - l  an-2
<2n"4 A n2+ AH2Al 2  2n 1>2n-i
0  0  




<2n' 4 + AL22n‘1>2n. 1 
<2n' 2 + AL2>2n. 1 
<2n_1 + 2n'4 + AL2 + AL22n‘1>2n. 1
145
Finally, combining Table 4.7 and equations (4.38) - (4.40) w e get
<A2)2n_i -
<A L2>2“_ i
((aoOan -3an-3an -4 ”  -a j ) +  A L 2 ) 2 n _j  
((01an-3 an -4 ' * *al ) +  a l,2 )2" _i 
<(a0 0an -3an-3an -4 '" al)  +  
( l ° an -3-” alao) + A L 2 )2n_!
if an_1an_ 2 = 00  
if an_!an_ 2 = 01 
if an - lan-2 = 10
if an - lan-2 = 11
(4.41)
We realize the term A2 2 using a ROM thus needing a ROM of size 2n' 2 x  n. 
Therefore we have an increase in savings to 75% in ROM bits, however the overhead is 
two modulo 2n -1 adders and a single 4 x 1  multiplexer of word length n. As seen from 
equation (4.41) the select lines of the multiplexer are a ^ a ^ .  It is easy to see that if this
method is increased further the savings in ROM bits will increase but at the same time 
the number of adders and the size of the multiplexer will also increase.
4.6 Optimized memory compression schemes for 
arithmetic in modulo 2n -1
We present the analysis when n is even. Once again, we consider a number A 
belonging to the modular ring = {0, 1, ..., 2n -2}. Then A has a n-bit binary 
representation as in A = a ^ a , ,^  ... a j£io; aj € {0,1}. Let AH = a ^ a ^  ... a ^  and AL 
= a(n/2)_ia(n/2)_2 — 3130- Then the value of A is given by equation (4.21) and its square 
by equation (4.22). Evaluating equation (4.21) modulo 2n -1 we get
<A  >2n _i — < A i4 + A 2 + AjjAj^ 2^ 2 +^>2n _i (4.42)
Comparing this with equation (4.23) we find that it is very similar except for the 
fact that this has the additional term AH which in turn can be realized using a ROM of 
size 2n/2 x n. The other terms can be realized as outlined in section 4.2.1. Also note that
146
the ROM for is identical to the ROM for A £ . The similarities between equation
(4.23) and (4.42) yields an overhead that is very streamlined with respect to 
implementation and is thus suited for VLSI implementation. We should note that
Tables 4.3 and 4.4 summarize the results for the cases when n is even and n is odd.
4.7 Memory compression schemes for arithmetic in 
modulo 2n +1
The number A that needs to be squared now belongs to the ring + j = {0,1, 
..., 2n}. If A = 2n, then A has an (n+l)-bit representation as in A = 100 ...00 = < -l> 2n 
+ 1 and <A >2n +i = 1 . For all the other cases A assumes an n-bit binary representation 
as in A = a ^ a , ,^  ... a ^ ;  aj e  {0,1}. Considering the decomposition of A into AH1 = 
a ^ j  and AL1 = a ^  ...aQ, the following equation can be derived on lines similar to that 
of equation (4.36).
4.8 Optimized memory compression schemes for 
arithmetic in modulo 2n +1
We once again present the analysis when n is even. We consider a number A 
belonging to the modular ring = {0, 1,..., 2n}. Again, if A = 2n, then A has an
the other cases A assumes an n-bit binary representation as in A = a ^ a ^  ... a ^ ;  a4 
e {0,l}.Then A has a n-bit binary representation as in A = an_ian_2 ... a ja0; aj e  
{0,1}. Let Ah = anA-da _2 ... \ t2 and AL = a(n/2H a(n/2)_2 ... a ^ .  Then the value of A is
arithmetic in this ring makes use of the following: <2n>2n _j = 1, <2n+1>2n = 2 .
(“ (an-2an-2an-3"'ao) + A L l ) 2 n + 1
2 " + l
if an_! = 0  
if an_x = 1
(4.43)
(n-t-l)-bit representation as in A = 100 ...00 = <-l>2n +1 and <A2>2n +1 = 1. For all
147
given by equation (4.21) and its square by equation (4.22). Evaluating equation (4.21) 
modulo 2n +1 we get
n +1 = + +^>2n +1 (4.44)
Comparing this with equation (4.42) we find that it is very similar except for the 
fact that the term A jj is negative. But this does not need any extra hardware as the ROM 
used to realize this term can directly generate the negative result. Therefore, the amount 
of hardware required for realizing this equation is the same as that required for realizing 
equation (4.42) plus a 2 x 1 multiplexer. The select line of this multiplexer is bit aj, and
if this bit is equal to 1 then the output is set to one as explained before while if it is zero 
the output is the result of equation (4.44). The similarities between equations (4.23), 
(4.42), and (4.44) again suggests that the overhead very streamlined with respect to 
implementation and is thus well suited for VLSI implementation. We should note that 
arithmetic in this ring makes use of the following: <2n>2n +1 = - 1, and <2n+1>2n +1 =
-2. Tables 4.3 and 4.4 summarize the results for the cases when n is even and n is odd. 
From these tables it is clearly seen that these techniques are also ideally suited for 
building an integrated squarer, i.e. a unit that can compute either one of three operations 
viz. <A2> 2n, <A2> 2n _ j ,  or <A2> 2n + 1 . Note that A^ is inherently a n-bit
representation and thus the same unit can be used in all three computations.
4.9 Conclusions
In this chapter we have presented in detail two ROM based methods that 
compute the squaring operation in modular rings. When compared with traditional 
techniques, both techniques reduce the number of ROM bits significantly. However, for 
a fair comparison of the two techniques the cost of the overhead must be included and in 
the ensuing analysis we show that techniques of section 4.2 are very optimal in all
148
respects viz. cost, speed, and regularity of the hardware structure. The techniques of 
section 4.2 are very systematic and result in a modular design, i.e.,
i) a modulo 2n squarer unit can be easily extended to a modulo 2n -1 or modulo 
2n +1 squarer and
ii) design changes for different values of n are minimal.
While we have not presented the comparative analysis for arithmetic in modulo 
2n -1 and modulo 2n +1, one can see from equations (4.36)-(4.41) and (4.43) that the 
techniques of section 4.2 will yield optimal results as the techniques of section 4.1 
require the use of modulo adders and multiplexers whose input words have a length of n 
bits. Also, the number of adders required is a function of the decomposition length. We 
also note that the cost of computing the overhead in these rings is far simpler than when 
the arithmetic is performed modulo 2n. This is because the size of the multiplexer data 
words is always the same, i.e. it is not a function of the length of decomposition.
Chapter 5 
Conclusions
In this chapter we first summarize the results of this dissertation and then 
discuss avenues for further research initiated by this effort.
5.1 Summary
In this dissertation we have developed algorithms for obtaining the cyclic 
convolution of two n-point sequences where n is a power of two, with no restriction on 
the size of each point. These algorithms rely only on square, add, and subtract 
operations. All the necessary theory for computing the cyclic convolution operation is 
developed in chapter 2. The correctness of these algorithms is based on eight 
theorems also developed in chapter 2. We have also derived non-recursive formulae 
for the count on additions and squaring operations. These formulae show that while we 
decrease the number of squaring operations we increase the number of addition 
operations. Issues relating to CSA and ROM based implementations were discussed 
in detail in chapter 3. The main purpose of this exercise was to demonstrate that the 
increase in the number of addition operations does not negate the decrease in the 
number of squaring operations. Results of the chapter prove convincingly the 
usefulness of squared law algorithms. Further, we have shown that our methods are 
far more superior than traditional methods when ROMs are used. Our methods also 
result in modular implementations and exhibit properties that can be exploited by 
clever architectural designs to obtain elegant and efficient implementations. Our 
methods also do not introduce any round-off errors and thereby eliminate the need for
149
150
error correction hardware. Some interesting observations were found in CSA based 
implementations of squarers and these along with schemes for multiplying two 
numbers based on the cyclic convolution operation were also presented in chapter 3. In 
chapter 4, the behavior of the squaring operation when computed in modular rings was 
examined. Two methods of this computation were presented and we have clearly 
shown that one is far better than the other, both in terms of speed and cost.
5.2 Future research emphasis
Since we have shown the usefulness of squared law algorithms in applications 
of digital signal processing and error control coding, further work can be classified 
under research and development.
Research: The research emphasis can be on finding other multiplication intensive
environments and deriving similar algorithms. For instance, some of the other useful 
operations are linear convolutions, skew-cyclic convolutions [35], and higher order 
correlations [64]. Since skew-cyclic convolutions are less symmetric than cyclic 
convolutions, algorithms developed on lines similar to that of this dissertation are 
likely to be less efficient. However, no such hypothesis can be made for triple 
correlations. While triple-order correlations contain more information on the signal 
they also require more computation. Thus, it might be useful to explore the 
applicability of our methods in these computations.
Development: For some specific needs hardware units and software programs
can be developed around our algorithms and their performance can be compared with 
existing products that have the same goals. Existing processors and routines do not 
exploit the properties of the squaring operation and multi-operand additions. Thus, for 
a fair comparison new units and routines have to be created.
References
[1] R. E. Blahut, "Algebraic fields, signal processing, and error control," 
Proceedings of the IEEE, vol. 73, no. 5, pp. 874-893, May 1985.
[2] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing. Englewood 
Cliffs, NJ: Prentice Hall, 1975.
[3] E. R. Berlekamp, Algebraic Coding Theory. New York: McGraw-Hill, 1968.
[4] J. H. McClellan and C. M. Rader, Number Theory in Digital Signal Processing. 
Englewood Cliffs, NJ: Prentice Hall, 1979.
[5] M.A. Soderstrand, W.K. Jenkins, G.A. Jullien and F.J. Taylor, Eds., Residue 
Number System Arithmetic: Modern Applications in Digital Signal Processing. 
New York: IEEE Press, 1986.
[6] J. W. Cooley and J. W. Tukey, "An algorithm for the machine computation of 
complex Fourier series," Mathematics of Computation, vol. 19, pp. 297-301, 
1965.
[7] R. C. Agarwal and C. S. Burrus, "Fast convolution using Fermat number 
transforms with applications to digital filtering," IEEE Transactions on 
Acoustics, Speech, and Signal Processing, vol. ASSP-22, pp. 87-97, April 
1974.
[8] R. C. Agarwal and C. S. Burrus, "Number theoretic transforms to implement 
fast digital convolution," Proceedings of the IEEE, vol. 63, no. 4, pp. 550-560, 
April 1975.
[9] I. S. Reed and T. K. Truong, "The use of finite fields to compute 
convolutions," IEEE Transactions on Information Theory, vol. IT-21, pp. 208- 
213, 1975.
[10] H. J. Nussbaumer, "Digital filtering using polynomial transforms," Electronics 
Letters, vol. 13, pp. 386-387, 1977.
[11] B. Rice, "Some good fields and rings for computing number theoretic 
transforms," IEEE Transactions on Acoustics, Speech, and Signal Processing, 
vol. ASSP-27, no. 4, pp. 432-433, August 1979.
[12] B. Gold and C. M. Rader, Digital Signal Processing of Signals. New York: 
McGraw-Hill: 1969.
[13] C. S. Burrus, "Block realization of digital filters," IEEE Transactions on Audio 
Electroacoustics, vol. AU-20, pp. 230-235, October 1972.
[14] T. G. Stockham, "High speed convolution and correlation," Proceedings of the 
AFIPS Conference, 1966 Joint Computer Conference, vol. 28, pp. 229-233.
151
152
[15] D. H. Lehmer, "Large-scale digital calculating machinery," Proceedings of the 
2nd Symposium, Cambridge, MA: Harvard University Press, 1951, pp. 141- 
146.
[16] D. E. Knuth, The Art of Computer Programming, vol. 2, Semi-numerical 
Algorithms. Reading, MA: Addison-Wesley, 1969.
[17] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. 
Englewood Cliffs, NJ: Prentice Hall, 1989.
[18] L. R. Rabiner and B. Gold, Theory and Application of Digital Signal 
Processing. Englewood Cliffs, NJ: Prentice Hall, 1975.
[19] R.W. Ramirez, The FFT Fundamentals and Concepts. Englewood Cliffs, NJ: 
Prentice Hall, 1985.
[20] R. E. Blahut, Fast Algorithms for Digital Signal Processing. Reading, MA: 
Addison-Wesley, 1985.
[21] J. M. Pollard, "The fast Fourier transform in a finite field," Mathematics of 
Computation, vol. 25, pp. 365-374, April 1971.
[22] I. J. Good, "The interaction algorithm and practical Fourier analysis, "Journal of 
the Royal Statistics Society, Section B-20, 361-375, 1958 and B-22, 372-375, 
1960.
[23] L. H. Thomas, "Using a computer to solve problems in physics," in Application 
of Digital Computers. Boston, MA: Ginn and Co., 1963.
[24] S. Winograd, "On computing the discrete Fourier transform," Proceedings of 
the National Academy of Sciences, USA, vol. 73, pp. 1005-1006,1976.
[25] A. V. Oppenheim and C. Weinstein, "Effects of finite register length in digital 
filtering and the fast Fourier transform," Proceedings of the IEEE, vol. 60, pp. 
957-976, August 1972.
[26] A. Despain, "Very fast Fourier transform algorithm s hardware for 
implementation," IEEE Transactions on Computers, vol. C-28, no. 5, May 
1979.
[27] J. Guo, C. Liu, and C. Jen, "The efficient memory-based VLSI array designs 
for DFT and DCT," IEEE Transactions on Circuits and Systems-II: Analog and 
Digital Signal Processing, vol. 39, no. 10, pp. 723-733, October 1992.
[28] C. D. Thompson, "Fourier transforms in VLSI," IEEE Transactions on 
Computers, vol. C-32, no. 11, pp. 1047-1057, November 1983.
[29] J. Allen, "Computer architecture for signal processing," Proceedings of the 
IEEE, vol. 63, no. 4, pp. 624-632, April 1975.
153
[30] G. Ma and F. J. Taylor, "Multiplier policies for digital signal processing," IEEE 
ASSP Magazine, pp. 6-20, January 1990.
[31] J. A. Beraldin, T. Aboulnasr, and W. Steenaart," Efficient one-dimensional 
systolic array realization of discret Fourier transform," IEEE Transactions on 
Circuits and Systems, vol. 36, pp. 95-100, January 1989.
[32] L. W. Chang and M. Y. Chen, "A new systolic array for discrete Fourier 
transform, "IEEE Transactions on Acoustics, Speech, and Signal Processing, 
vol. ASSP-36, pp. 1665-1667, October 1988.
[33] C. M. Liu and C. W. Jen, "A new systolic array algorithm for discrete Fourier 
transform," Proceedings of ISC AS, pp. 2212-2215, 1991.
[34] H.J. Nussbaumer, "Relative evaluation of various number theoretic transforms 
for digital filtering applications," IEEE Transactions on Acoustics, Speech and 
Signal Processing, ASSP-26, pp. 88-93, February 1978.
[35] H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms. Berlin, 
FRG: Springer Verlag 1982.
[36] D.F. Elliot and K.R. Rao, Fast Transforms, Algorithms, Analyses, 
Applications. New York: Academic Press, 1982.
[37] N.S. Szabo and R.I. Tanaka, Residue Arithmetic and its Applications to 
Computer Technology. McGraw-Hill, NY: 1967.
[38] F.J. Taylor, "Residue arithmetic: A tutorial with examples," IEEE Computer, 
vol. 17, no. 5, pp. 50-62, May 1984.
[39] S.H. Leung, "Application of residue number systems to complex digital filters," 
Proceedings of the Fifteenth Asilomar Conference on Circuits, Systems and 
Computers, Pacific Grove, CA, November 1981, pp. 70-74.
[40] J.V. Krogmeier and W.K. Jenkins, "Error detection and correction in quadratic 
residue number systems," Proceedings of the 26th Midwest Symposium on 
Circuits and Systems, Puebla, MX, August 1983, pp. 408-411.
[41] F.J. Taylor, G. Papadourakis, A. Skavantzos and A. Stouraitis, "A radix-4 FFT 
using complex RNS arithmetic," IEEE Transactions on Computers, vol. C-34, 
no. 6 , pp. 573-576, June 1985.
[42] A. Skavantzos and F.J. Taylor, "On the polynomial residue number system," 
IEEE Transactions on Signal Processing, vol. 39, no. 2, pp. 376-382, February
1991.
[43] A. Skavantzos and N. Mitash, "Implementation issues of 2-dimensional 
polynomial multipliers for signal processing using residue arithmetic," IEE 
Proceedings-E, vol. 140, no. 1, pp. 45-53, January 1993.
[44] G.A. Jullien : "Residue number scaling and other operations using ROM 
arrays," IEEE Transacations on Computers, pp. 325-336, April 1978.
154
[45] C.H. Huang and F J. Taylor, "A memory compression scheme for modular 
arithmetic," IEEE Transactions on Acoustics Speech Signal Processing, ASSP- 
27, vol. 6 , pp. 608-611, December 1979.
[46] G.A. Jullien : "Implementation of multiplication, modulo a prime number, with 
applications to number theoretic transforms," IEEE Transacations on 
Computers, vol. C-29, pp. 899-905, October 1980.
[47] F.J. Taylor, "Large moduli multipliers for signal processing," IEEE 
Transactions on Circuits and Systems, vol. CAS-28, no. 7, pp. 731-736, July 
1981.
[48] M.A. Soderstrand and E.L. Fields, "Multipliers for residue number arithmetic 
digital filters," Electronics Letters, vol. 13, no. 6 , pp. 164-166, March 1977.
[49] M.A. Soderstrand and C. Vernia, "A high-speed low-cost modulo Pi multiplier 
with RNS arithmetic applications," Proceedings of the IEEE, vol. 6 8 , no. 4, pp. 
529-532, April 1980.
[50] A. Skavantzos, "Novel approach for implementing convolutions with small 
tables," IEE Proceedings-E, vol. 138, no. 4, pp. 255-259, July 1991.
[51] A. Skavantzos and P. B. Rao, "New multipliers modulo 2N -1," IEEE 
Transactions on Computers, vol. 41, no. 8 , pp. 957-961, August 1992.
[52] P. B. Rao and A. Skavantzos, "New multiplier designs based on squared law 
algorithms and table look-ups," Proceedings of the 26th Annual Asilomar 
Conference on Signals, Systems and Computers, Pacific Grove, CA, October 
1992, pp. 686-690.
[53] P. B. Rao and A. Skavantzos, "Efficient computation of squaring operation in 
modular rings," Electronics Letters, vol. 28, no. 17, pp. 1628-1630, August
1992.
[54] L. Dadda, "Some schemes for parallel multipliers," Alta Frequenza, vol. 34, pp. 
349-356, 1965.
[55] S. Waser and M. J. Flynn, Introduction to Arithmetic for Digital Systems 
Designers. Orlando, FL: Holt, Rinehart and Winston, 1982.
[56] K. Hwang, Computer Arithmetic. New York: John Wiley and Sons, 1979.
[57] T. C. Chen, "A binary multiplication scheme based on squaring," IEEE 
Transactions on Computers, vol. C-20, pp. 678-680, June 1971.
[58] T. Jayashree and D. Basu, "On binary multiplication using the quarter squared 
algorithm," IEEE Transactions on Computers, pp. 957-960, September 1976.
[59] L. Dadda, "Squarers for binary numbers in serial form," Proceedings of the 7th 
Symposium on Computer Arithmetic, Urbana, IL, June 1985, pp. 173-180.
155
[60] H. Kobayashi, "A multioperand two's complement addition algorithm," 
Proceedings of the 7th Symposium on Computer Arithmetic, Urbana, IL, June 
1985, pp. 16-17.
[61] A.V. Curiger, H. Bonnenberg, and H. Kaeslin, "Regular VLSI architectures for 
multiplication modulo (2n +1)," IEEE Journal of Solid-State Circuits, vol. 26, 
no. 7, pp. 990-994, July 1991.
[62] X. Lai and J.L. Massey, "A proposal for a new block encryption standard," 
presented at EUROCRYPT '90, Aarhus, Denmark, May 1990.
[63] A. Skavantzos, "ROM table reduction techniques for computing the squaring 
operation using modular arithmetic," Proceedings of the 25th Asilomar 
Conference on Circuits, Systems, and Computers, Pacific Grove, CA, October 
1991, pp. 413-417.
[64] A. W. Lohmann and B. Wimitzer, "Triple corrleations," Proceedings of the 














z [ i_ ,r j := z[i,r] := 4 * Sum[(c[i+ k j[r]] * (-l)^k),{k,0,(n/j[r])-l}]; 
x [ i_ ,r j := x[i,r] := Simplify[x[i,r-1] + z[i,r]*2A(r-l)]; 












cy[i] = x[i,r]/(4*2Ar); 
cy[i+j[r]] = x[(i+j[r]),r]/(4*2Ar); 
Print[cy[i]];
Print[cy[i+j[r]]]; 
i = i + 2;
If[i <= (n/2 -1),





IfUEr] <= (n/2), 





Output when the program is run fo r  n =32:
4 (c[0] + c[2] + c[4] + c[6] + c[8] + c[10] + c[12] + c[14] + c[16] + 
c[18] + c[20] + c[22] + c[24] + c[26] + c[28] + c[30])
01
4 (c[0] - c[2] + c[4] - c[6] + c[8] - c[10] + c[12] - c[14] + c[16] - 
c[18] + c[20] - c[22] + c[24] - c[26] + c[28] - c[30])
8 (c[0] + c[4] + c[8] + c[12] + c[16] + c[20] + c[24] + c[28])
8 (c[2] + c[6] + c[10] + c[14] + c[18] + c[22] + c[26] + c[30])
02
4 (c[0] - c[4] + c[8] - c[12] + c[16] - c[20] + c[24] - c[28])
16 (c[0] + c[8] + c[16] + c[24])
16 (c[4] + c[12] + c[20] + c[28])
03
4 (c[0 ]-c [8 ]+ c [1 6 ]-c [2 4 ])
32 (c[0] + c[16])
32 (c[8] + c[24])
04






4 (c[2] - c[6] + c[10] - c[14] + c[18] - c[22] + c[26] - c[30])
16 (c[2] + c[10] + c[18] + c[26])
16 (c[6] + c[14] + c[22] + c[30])
159
23
4(c[2 ]-c [1 0 ]+ c[1 8 ]-c [2 6 ]) 
32 (c[2] + c[18])








4 (c[4] - c[12] + c[20] - c[28]) 
32 (c[4] + c[20])








4 (c[6] - c[14] + c[22] - c[30]) 
32 (c[6] + c[22])
32 (c[14] + c[30])
64

































cy[0] = c[0] 
cy[2] = c[2] 
cy[4] = c[4] 
cy[6] = c[6] 
cy[8] = c[8] 




cy[18] = c[18] 
cy[20] = c[20] 
cy[22] = c[22] 
cy[24] = c[24] 
cy[26]=c[26] 







temp = Lengthjc]; 
ha = 0; 
fa = 0; 
hal = 0; 
fa l = 0;
ex = Ceiling[Log[2, Max[c]]]; 
Do[c=AppendTo[c,0], {ex}]; 
jm = Lengthfc]; 
i = 1; 
t[l] = 2;
While[t[i] < Max[c], 
i = i+ l;
t [ i j  := t[i] = Floor[3/2 * t[i-l]]
]




Whilefj < jm, 
lf[(c[U31 > t[ij),
fal -  Floor[(c[[j]] - t[i])/2];
163
hal = Ceiling[(c[[j]] - t[i])/2] - fal;
c[|j]] = t[i];
c[U+l]] = c[[j+l]] + fal + hal; 
fa = fa + fal; 
ha = ha + hal;
*
];
j = j  + i;
];
j = i;
i = i - 1 
]
zero = Take[c, {temp + 1, Length[c]}]; 
cpa = Length [c] - Count[zero,0]; 
facpa = cpa -1;
Printf"# of Full Adders = ”,fa];
Print["# of Half Adders = ",ha]; 
fa = fa + facpa; 
ha = ha+  1;
Print["# of Full Adders including CPA = ",fa];
Print["# of Half Adders including CPA = ",ha];
Print["# of CSA Levels = ”,le];
Printf"# of CPA Levels =1"];
Print["Size of CPA = ",cpa]; 
gates = 5*fa + 2*ha;
Print["Number of 2-input gates including CPA = ",gates]
Output when program is run:
{1,2, 3 ,4 , 5 ,6 ,7 ,8 ,7 ,6 ,  5 ,4 , 3, 2,1}
#  of Full Adders = 35
# of Half Adders = 7
# of Full Adders including CPA = 49
# of Half Adders including CPA = 8
# of CSA Levels = 4
#  of CPA Levels =1 
Size of CPA =15
Number of 2-input gates including CPA = 261
Note: In this example the input numbers to be added are the partial products obtained 
when two eight bit numbers are multiplied. The results check with a conventional 
calculation, as shown in figure 3.1.
Vita
Poomachandra B. Rao received the B.E. degree from Osmania University, 
India, in 1984, and the M.S. degree from Louisiana State University in 1989, both in 
Electrical Engineering. From 1984-1987, he worked with Larsen & Toubro Ltd., India 
as an Electrical Systems Design Engineer. He has also held summer research positions 
at Ruhr University, Germany, in 1989 and Circuit Technology Group, Hewlett- 
Packard, in 1991. Currently, he is a candidate for the doctoral degree in the 
Department of Electrical and Computer Engineering at Louisiana State University. His 
research interests include application specific integrated circuit design, computer 
arithmetic, and parallel processing. He is a member of IEEE Computer Society and 
Eta Kappa Nu.
165
DOCTORAL EXAMINATION AND DISSERTATION REPORT
Candidate: Poornachandra B. Rao
Major Field: Electrical Engineering
Title of Dissertation: Squared Law Algorithms: Theory and Applications
Approved:
Major Professor and Chairman 
Dpdn of the Gradual School
EXAMINING COMMITTEE:
Date of Examination:
June 9 ,  1993
