Cyclic convolutions using low complexity dedicated hardware by Raòhmåan, Nåahåid
Lehigh University
Lehigh Preserve
Theses and Dissertations
2006
Cyclic convolutions using low complexity
dedicated hardware
Nåahåid Raòhmåan
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Raòhmåan, Nåahåid, "Cyclic convolutions using low complexity dedicated hardware" (2006). Theses and Dissertations. Paper 922.
Rahman, Nahid
Cyclic
Convolutions
using Low
Complexity
Dedicated
Hardware
September 2006
CYCLIC CONVOLUTIONS USING
LO\i\1 COI\1PLEXITY DEDICATED
HARD\i\1ARE
b~'
l\'ahid Hahmall
A Thesis
Presellted to the Graduatc Commit tcc
of Lchigh Ullivcrsity
iII Calldidac~' for thc Dcgrcc of
i\lastcr of SciclIC'('
111
COIllputcr Ellgillccrillg
Lehigh Uniycrsity
2006
© Copyright 2006 b~· i\ahid Rahman
All Rights Reserved
11

lY
Acknowledgements
I am truly thankful to my creator for all the wonderful people he has introduced me
to. f'lly guess is, if you are reading this, )'ou are certainly one of t hem. Please know
that I am honored and feel very fortunate that )'ou are in Illy life. Thank )'OU so
milch for all )'OIlr support.
- Sohid.
\"1
Abstract
Cyclic Convolution plays a central role in digital signal processing. This thesis
develops an architecture to compute cyclic convolution of l\' = 2" points for arbitrary
1/. We de\'elop and use a bilinear algorithm as the basis of the hardware platform.
The architecture and its control was simulated using v.mn and sy'nthesized a.'i an
Application specific Integrated Circuit (ASIC) using Mentor Graphics tools. This
provided us with the time and hard ware complexity' of the archi tect ure.
The Hardware complexity of our architecture is independant of the convolution
size as compared to the complexity of a memoryless implement at ion. However. the
time complexity' of our architecture is O(N 15S ). Thus this architecture will be ideal
in situations where 10\\' hardware complexity is more important than speed.
\'11
YIll
Contents
Acknowlcdgements
Abstract
1 Introduction
1.1 Linear and Cyclic COlJ\'Olllt ions
1.1.1 C~'clic Convolllt ion . , ,
1.2 Dcdicat cd Hardware vs 1'.1 icroprocessors for Convolut ions
1.3 Previolls Work
I.el Organizat ion of Thesis
1.G Sllmlllar~' of Results
2 Devclopment of the Algorithm
2.1 Definition and Properties of a Tocplitz ;"Iatrix
~.2 Dcrivation of the Algorithm
23 Stl\lcturc of a Toeplitz ~lllltip1ication ;"Iodlllc
2·1 Complcxit~· Anal~'sis
3 Singlc ;\lcmory Implemcntation
1X
v
VB
1
2
G
7
10
10
13
17
17
23
28
33
3.1 Adapting the Algorithm to Hardware 34
3.1.1 Phase 0 37
3.1.2 Phase 1 37
3.1.3 Phase 2 39
3.1.4 Phase 3 41
3.1.5 Phase 4 42
3.1.6 Phase 5 42
3.1. 7 Phase 6 43
3.2 Architecture of the Dedicated Hardware 44
3.3 Hardware Implementation and Timing Analysis 4,1
3.3.1 Phase 0 46
3.3.2 Phase 1 50
3.3.3 Phasc 2 54
3.3.4 PhiL'ic 3 56
3.3.5 PhiL'ic 4 60
3.3.6 PhiL'ic 5 62
3.3.7 PhiL'iC 6 64
3.,1 Owrall Timc COl11plcxit~· . 67
35 Hardware Complexit~· . 67
3.5.1 Expcriment at ion Hesult s 68
4 Furt her 1mproycl11cnts i3
·11 T,,'o i\lcmor~' ImplemcntiltioJ1 7.1
·1 ~ Division i\loduks j:1
13 Exclusion of Baut-h-\\'ook~' i\lnltiplier in s~'nthesis 711
x
4.4 Timc Optimization . . . .. . .
4.5 Tocplitz Products iII Calculations of Other Transforms
5 Conclusion
A
Bibliography
xi
76
77
79
81
103
:>':11
List of Tables
3.1 Time complexity of the proposed cyclic cOllvolution architecture. . . . 68
3.2 llardware complexity of the proposed cyclic convolution architecture. 69
3 3 II ard ware com plexi ty for differcnt lengths wi t h variable-bi t interIlal
hardware. . 70
3.-1 Comparison of t he hard ware complexity (number of gates) of the
proposed met hod wi t h a combinat ional logic circuit to comput e c~'clic
cOllvolut ion . . . . .. ..... 71
.l.1 Origin and destination addresses in a 4-point butterfly network. 7·1
.1.2 Origin and drstination addressrs in a 8-point but terfly nrtwork. 75
XIII
Xl\"
List of Figures
2.1 2-point cyclic convolution 16
2.2 Calculating Fourier transform components for a 8-point network 21
2.3 i\lultiplication stage for a 8-point network 22
2.4 Conversion between a,b and p.q. . 23
2.5 Flow diagram for a 8-point network 24
2.6 A 2 x 2 Toeplitz multiplication module TP2. 25
2.7 Generclized structure of a 2" x 2" Toeplitz multiplication module TP2". 26
2.8 4 point cyclic convolution algorithm
2.9 S point c~'('lic convolution algorithm
2.10 A 2 x 2 cyclic convolution algorithm.
2.11 :\ 2" point c~'clic cOl1\'olution algorithm
3.1 Partitioning the c~Tlic convolution algorithm in 6 pha."es.
3.2 Plw."e 0 for an S-point net\\'ork
3.3 Prcmultiplication stage in a TP2 module(The second operand for each
mult iplica t ion is precom putcd and st ored in Illemor~'
3.·1 Premultiplication in a Tp·1 and a TP8 module
3.5 :\ TP2 postlllultiplication operation.
3.G IDFT cakulat ions in phase G for an S-point network.
26
29
32
36
38
.l()
3.7 Architecture of the Dedicated Hardware
45
3.8 Timing diagram for phase 0
49
3.9 Timing diagram for phase 1
53
3.10 Timing diagram for phase 2
55
3.11 Timing diagram for phase 3
59
3.12 Timing diagram for phase 4
61
3.13 Timing diagram for phase 5
64
3.14 Timing diagram for phase 6
66
Chapter 1
Introduction
1.1 Linear and Cyclic Convolutions
Convolutions arc used to compute the response of a linear system to an input signal.
The linear system is defined by its impulse response. The output signal is the
convolution of the iAput signal and the impulse response. Convolutions can also be
regarded as the time-domain equivalent of filtering in the frequency domain.
Suppose, we have a linear, time-invariant system and its impulse response is h[n].
Anytime we apply an input x[n], it produces output y[n], where
y[n] = x[n] * h[n]
or,
(1.1 )
(1.2)
n
y[n] = L x[k]h[n - k]
k=O
Thus, y[n] is obtained by flipping the sequence h[n]' shifting it by n positions
and then taking its inner product with sequence x[n]. Note: that the system response,
h[n] has values fat. n ~ 0 only. Anytime the term h[n - k] becomes negative, i.e.
1
.'
CHA.PTER 1. I.rVTRODUCTION
exceeds the finilr length for which it is drfinrd. yin] evaluates 10 zero.
/\. property of linear convolutions is that even if :rln] and !lin] ha\'e finitc lengths.
the length ofy[n] could bc grcatcr. If x[n] and !t[n] has lengths Nand M respectively.
the length of y[n] would be: M + 1\1 - 1.
;\low, when we are givcn two DFTs (finitc length sequences usually of length
N). we cannot just multiply Ihem together. Because DFTs are periodic. they have
non-zero values for n 2 N and thus the multiplication of these two DFTs will
be non-zero for n 2 N. Therefore, we need to definc a new type of convolution
operation that will result in the convolved signal being zero outside of the range
n = {O, 1, 2.. N - I}. This idea led to the development of circular convolution, also
called Ihe cyclic or periodic convolut ion.
1.1.1 Cyclic Convolution
If x [n] and h in] are bol h of finil f' length 1'\. such thaI
.r[n] = {.r[O] ..r!I] . ....r[N - I]}
and.
h[n] = {h[O]. hll] .... hl.\' - I]}.
The c~'('lic c0I1\'olulion of .r[n] and h[n] is definf'd a.'-;
.\' - 1
)1[n] = L .r!11h[} - 1 mod7llo .\"j
:=0
where.l = O. 1, ~ . .\' - 1
...,
(1.3)
(1-1)
(15)
1.1. LISEAR A.\'D CYCLIC COSVOLUTIOSS
One can always write the con\'olution as a matrix product. To illustrate this.
consider j\' = 3. We ha\'e.
2
Y[0] = L 2' [i] II [- i TTl ad 3]
1=0
or.
y[O] = II [O].r[O] + h[2].1'[ 1] + II [1 ].r[2].
similarl~·.
ViI] = h[I]J'[O] + h[O].r[1] + 11[2].1'[2].
and.
V[2] = h[2].1'[0] + h[1 ].1'[ 1] + h[0].1'[2].
In matrix form. equatiolls 1.7 to 1.9 IW1~' be experimcrd as:
(
V.(O) 1 [11(0) 11(2) 11(1) 1('1'(0) 1
y(1) = 11(1) h(O) 11(2) .1'(1)
11(2) 11(2) h(l) 11(0) .T(2)
Properties of a Cyclic Convolution Matrix
(1.6 )
(1.7)
(1.8)
(Ul)
(1.10)
F!'Om the matrix abow, we call rrad il~' ob~erw somr basic propert ies of a c\'eI ic
cOII\'olut ion mat rix. \\'r \\'ill exploit t hr following proprrt irs of c~'elic C0I1\'olut ion
mnt ricrs in our work:
• Thr dingonn! clrl11l'nts in thl' cOJ1\'oll1tion matrix or rl'sponsc marix h nrc
idrnt icn]
• E\Tr~' r(l\\' in the h matrix i~ n C~TlIC shift (If thc \llc\,i(luS r(l\\'.
CHAPTER 1. INTRODUCTION
Conversion of a Linear Convolution to a Cyclic Convolution
If wc compute the linear convolution of 1'[71] and lI[n] from our prcvious cxample. it
can be expressed as
y(O) 11(0) 0 0
y( 1) 11 (1) 11(0) 0 l,(01 1y(2) 11(2) 11 (1) 11(0) x(1) , (1.11 )
y(3) 0 11(2) 11 (1) :r(2)
y(.t) 0 0 11(2)
This nlat rix can be expanded as:
y(O) 11(0) 0 0 11(2) 11(1) ,r(O)
y( 1) II (1) 11(0) 0 0 11(2) .r( 1)
y(2) 11(2) 11(1) 11(0) 0 0 .T(2) (1.12)
,1/(3) 0 11(2) 11(1) 11(0) 0 0
y( -I ) 0 0 11(2) II (1 ) 11(0) 0
The zeros added to the ,1' colul11n forces the In.,,t two colul11ns of the II matrix to
become zero. But in the process. we haw con\'erted a length 3 linear con\'olution
into a length 5 c)'clic com'olution. Thus. an)' linear cOll\'olution can be conH'rted to
a cyclic con\'olut ion and \\'C' can take ach'ant age of a c)'Clic con\'olut ion propert\' that
states that if .\[n' and lI[n1 arc the Fomier Transforms of ,r[n] nnd lI[n] rcspecti\Tl\'.
t hen if !J = .1' * 11. then
)' = X· II. \1131
Or. Y is the mult iplicntion of \. and II point \n' point. So if the rcspnnse
·1
1.1. LINEAR. AND CYCLIC CONVOLUTIONS
transform of a system H is known, we can figure out the output transform Y cor-
responding to the transform of input signal X. In this respect, a good and efficient
algorithm to calculate cyclic convolutions is highly beneficial.
Importance and Applications of Cyclic Convolutions
Cyclic Convolution is a vcry important tool in DSP applications. We have seen
that linear convolution, which is at the heart of digital filtering, can be computed
by transforming it into a cyclic convolution algorithm. Therefore, it is extremely
important to develop faster and more cost-effective cyclic convolution algorithms
and their hardware implementations. These algorithms are used in a wide range
systems ranging from image filtering, polynomial multiplications, large integer mul-
tiplications, coding and data compression, where the objective is to reduce the con-
tent of the original signal, so that only relevant information may be transferred or
transmitted. A Fourier transform matrix slightly varied becomes similar to a cyclic
convolution matrix slightly varied[l]. Other some very important transforms used
in image processing such as DCT and Hartley transform have transform matrices
that can be partitioned into cyclic convolution matrices[2] [3].
In this thesis, we discuss a bilinear algorithm for calculating cyclic cQnvolutionss
of length N, where N =; 2n , for any positive integer n, by the the Use of Toeplitz
products. We propose a hardware implementation scheme and analyze the cost and
benefits of such an approach.
5
.'
CHAPTER 1. INTRODUCTION
1.2 Dedicated Hardware vs Microprocessors for
Convolutions
Despite their extensive usc in digital filtering and other key applications, DSP algo-
rithms often pose some challenges to general purpose processors or microprocessors
due to their intensive computational and hardware requirements. Even though
general purpose processors(GPPs) are capable of performing DSP tasks as well as
others, it is often a better idea to have dedicated hardware for DSP applications,
which may prove to be an advantageous approach over microprocessors in numerous
ways.
DSP algorithms differ from other tasks that microprocessors peform in ways that
justifies the implementation of these algorithms as standalone applications [4] For
instance, they are very computationally demanding and require attention to numeric
fidelity. As a result, they often require multiple parallel execution units, hardware
I,
acceleration for common functions, accumulator registers, guard bits, saturation
hardware etc. They have very high memory bandwidth requirements and require
capability for large amounts of streaming data. Moreover, they most often have most
predictable data access patterns implemented by specialized addressing modes, such
as modulo, bit-reversed etc. Also, they are math-centric and require single-cycle
multipliers or MAC units.
Using microprocessors or GPPs for DSP applications introduces the overhead as-
sociated with increased hardware complexity. Superscalar, high performance GPPs
require greater Silicon area and power consumption. Dedicated hardware for DSP
performs all key arithmetic in one cycle. Whereas,_ ip. GPPs, multiplications and
multi-bit shifts often take more than one cycle. In dedicated· hardware, sufficient
6
1.3. PR.EVIOUS WOR.K
support for managing numeric fidelity can be included by integrating shifters, guard
bits, saturation and rounding modes. They often usc Harvard architecture that al-
low upto 2-4 memory accesses per cycle compared to the Von-Newmann architecture
of GPPs that allow only one access per cycle. Signal processing applications' access
patterns tend to be predictable and thus DMA may be preferable to using caches
that are typically used in microprocessors. Also in microprocessors, often there are
no seperate address generation units and they use general purpose addressing modes
compared to the more efficient specialized addressing modes of DSP algorithms.
Generally speaking, GPPs do not have sufficient DSP horsepower. They have
limited memory bandwidth, high cost and power consumption and limited on-chip
integration in some cases. They lack execution-time predictability and for micropro-
cessors, there are few DSP-oriented development tools and few DSP-related software
libraries [4] .
In this thesis, we discuss the implementation of dedicated hardware for calcu-
\
lating cyclic convolutions. We introduce a new method for calculating cyclic con-
volutions of lengths that arc powers of two using Toeplitz products. We discuss
the development and implementation of the algorithm and analyze its efficiency in
reducing hardware complexity.
1.3 Previous Work
Over time, in computing cyclic convolutions, a lot of effort has been mad,e in de-
veloping algorithms that converted one-dimensional cyclic convolutions into multi-
dimensional cyclic convolutions of smaller lengths, which: were cyclic in all dimen:
sions. These algorithms algorithms significantly reduced the number of complex
7
CHAPTER 1. INTRODUCTION
additions and multiplications, but often could be extended only upto certain lengths
and were not scalable. Later on, new algorithms were developed that avoided the
complex Fourier domain altogether.
The Cook-Toom algorithm served as the building block of large convolution
algorithms[5]. This algorithm was able to compute length linear convolutions. To
compute linear convolutions of Nand M points, the Cook-Toom algorithm computed
the Lagrange Interpolation at L=M+N-l real points[6].I-Iowever, computation was
often tedious and could be carried out for only special integers.
S. Winograd[7] developed non-structured algorithms for evaluating cyclic convo-
lutions. He identified cyclic structures in transform matrices .of small lengths and
developed fast algorithms to evaluate these cyclic convolutions. But in his approach,
multiplications involved complex numbers and the method was applicable to limited
length algorithms only.
Agarwal-dooley's [8] algorithm for cyclic convolutions effectively avoided the
complex Fourier domain. They used the Chinese Remainder Theorem to convert
a one-dimensional cyclic convolution to a multi-dimensional convolution which is
cyclic in all dimensions. Small length algorithms were developed for lengths < 10.
Algorithms for lengths that were greater than 10 could be obtained from algorithms
of smaller relatively prime lengths. But this approach was not scalable and the basic
algorithms were limited to lengths upto 9. With this limited set of algorithms, it
was not possible to generate a large number algorithms through multi~dimensional
techniques [9].
Nussbaumer[lO] and Quandalle tooka direct approach to developing two-qimen-
sional cyclicalgorithms[ll]. Their algorithm was applicable to a pxp input data,
8 ..
1.3. PREVIOUS WOR.K
where p is prime[12J. They demonstrated that repeated nesting of polynomial trans-
forms could lead to efficient algorithms for certain one and two dimensional cyclic
convolutions[9J. They showed that from a from a p x p two-dimensional cyclic convo-
lution, a prl X prl 2 dimensional cyclic convolution algorithm could be obtained. The
theoretical complexity of their algorithm was fairly good, but the actual implcmen-
tation became very complex and posed certain challenges in real-time computation.
Wagh and Morgera [13J gave a procedure to design a bilinear algorithm for cyclic
convolution of any desired length, provided the length is not divisible by the field
characteristics. Later on, these bilinear algorithms proved not only highly cfficient
fo cyclic convolutions, but for other transforms as well. They could be used to
compute DCT sequcnces for arbitrary number of points[14J. Further research on
thc application of these bilincar algorithms in other transforms provided further
insights to their properties that could be extended to cyclic convolutions as well.
Earlier research had proposed highly effective algorithms for DFTs when the number
of data samples is p;ime[15J and also for prime factor FFTs[lJ. Wagh, Venkatram
presented a strategy to design bilinear discrete cosine transform (DCT) algorithm
of prime lengths [16J.
In this thesis, we describe a bilinear algorithm to compute cyclic convolutions of
length 2r1 , where n is any positive integer. Our algorithm computes the convolution
by converting the cyclic convolutions into n products of Toeplitz matrices and vec-
tors. All the Toeplitz products involved in a 2r1 point cyclic convolution algorithm
can be built systemically from a Toeplitz product of 2 and points. The ultimate
cyclic convolution algorithm thus obtained has a highly regular structure and can
be applied to any length for any positive value of n. Our c,ontention is, since this is
a structured algorithm, it can be applied to any length N (where N .= 2r1 , a common
9
.'
CHAPTER 1. INTRODUCTION
number of inputs for a digital system). The algorithm only involves calculations in
the real domain and thereby saves significant computational effort as compared to
complex DFT evaluation. Moreover, no matter how large the length of the convolu-
tion is, only a single stage of multiplications is required, which contributes greatly
in saving hardware complexity of a dedicated integrated circuit chip.
1.4 Organization of Thesis
Chapter 2 contains a discussion about the development of the algorithm to usc
Toeplitz products for cyclic convolutions. Here, we describe the mathematical prop-
erties of the network and show its potential for reduction in chip area and hardware
1
complexity. In chapter 3, we show its actual implementation and include simulation
results obtained for cyclic convolutions of lengths of powers of two, starting from 4
through 1024 (Only powers of 2). Chapter 4 is a summary of further improvements
where we try to shed some light on potential future work. Chapter 5 provides final
discussion and the conclusion of the work.
1.5 Summary of Results
We implemented the algorithm for cyclic convolutions of lengths 4, 8, 16, 32, 64,
128, 256, 512 and 1024, and obtained simulation results~ for timing analysis and.
",
synthesis results for hardware complexity analysis. Our results show that as 'the
10
1.5. SL':\I.\JAHY OF HESULTS
length increascs by a factor of two. total dela~' increases by a factor of thrce. But
hardwarc complexity of a dedicated VLSI chip using these nlethods is far lower
than other comnlonly used methods. For all lengths. it remained fairly constant
ami shows significant reduction in the number of gates used.
11
1~
CHAPTERl.INTRODUCTIOX
Chapter 2
Development of the Algorithm
In this chapter, we discuss the development of an efficient bilinear algorithm that
one can use to compute cyclic convolutions. As noted earlier, cyclic convolutions
have the following property:
If X, Hand Yare the Fourier transforms of input signal x, impulse response h,
and output signal y respeCtively, then,
Y=1-!·X (2.1)
Where, Y is the point-by-point multiplication of Hand X. By taking the inverse
Fourier Transform of y, we can retrieve the signal y and thus get the result of the
cyclic convolution of x and h.
Unfortunately, this method has some disadvantages. For instance,
• The Fourier transforms· of the signals involve performing computations in the
complex domain, which can often be imprecise and computationally exhaus- ,
tive.
13
CHAPTER 2. DEVELOPMENT OF THE ALGORITHM
• Hardware requirements arc heavy due to greater computational complexity.
Hence it would be very convenient if orie can develop an algorithm where complex
numbers are avoided altogether. In this thesis, we have made an effort to do so.
When cyclic convolutions are computed in real applications such as a digital
filter, often one of the input signals x or h is pre-determined. Usually, the impulse
response h is known and can be stored in the system as a cyclic convolution matrix,
so that the input signal expressed as an input vector could be multiplied with this
matrix and produce the output as an output vector (the result of matrix-vector
multiplication) .
Suppose we are calculating a length 2 cyclic convolution where the input vector is
{xo, xI}, output vector is {Yo, Yd, and the precomputed impulse response h matrix
is a length 2 cyclic convolution matrix, then
( Yo ) (a b) ( Xo ) ,YI b a Xl
This convolution, when computed by brute force method would look like:
Yo = a.xo + b,XI
YI = b.xo + a.xI
(2.2)
(2.3)
(2.4)
The brute force method requires four multiplications and two additions. How-
ever, if we could rearrange the computation as:
(
Yo) = ( {(a + b)/2}.(xo + Xl) +{(a- b)/2}.(~o - Xl) .). ' (2.5)
YI {(a + b)/2}.(xo + Xl) - {(a - b)/2}.(xo - Xl)
we get the exact same output as in 2.3 and 2.4. B1,lLcomputational complexiW
reduces.
14
.'
The terms (a + b)/2 and (a - b)/2 can be pre-computed and stored in the sys-
tem, as previously mentioned. Therefore,computations for (a + b)/2and (a - b)/2
may be omitted when determining the complexity of calculations.The computations
are limited to the additions or subtractions to compute (xo + Xl) and (xo - Xl),
multiplication of these terms with the pre-computed coefficients and the addition or
subtraction of the resultant products.
For instance, if C = (a + b)/2 and D = (a - b)/2, then,
(2.6)
From the equation above, we can readily observe a second advantage to this
rc-arrangement. We need to calculate the terms C· (xo + Xl) and D· (xo - Xl) only
once each. Suppose,
(2.7)
and,
Then,
( Yo) (E + F)YI E- F
(2.8)
(2.9)
The entire calculation now involves four linear additions or subtractions (e.g.
(Xo +Xl), (xo - Xl)' (F +E) and (F - E)) and two multiplications [e.g. C· (xo + Xl)
and D· (xo - Xl)] in place of the four multiplications and tWQ additions required in
the brute force method.
Our re-arranged algorithm is shown in figure 2.1.
From the figure, we can see that implementation .of the hardware can reduce'
costs significantly. This is because if we traverse the path from any input to any
15
.'
CHAPTER 2. DEVELOPMENT OF THE ALGORITHM
~a+b)/2
Figure 2.1: 2-point cyclic convolution
output (the worst delay or critical path), we require only one multiplication and
two addition operations. In hardware, multiplications require the longest time and
largest power consumption as well. If all computations can be reduced to a single
stage of multiplications as described, this could be performed with only multiplier
and would sharply reduce hardware requirements.
\
When we compute cyclic convolutions using the above mentioned procedure, our
timing becomes approximately that of a single stage of multiplication and hardware
requirements become very low. Also, if we observe the functionality of our network,
we can sec that a set of linear additions or subtractions are performed first, which
is followed by a single multiplication stage and finally, a set of post-multiplication
addition or subtraction operations. Since the single multiplication stage is preceded
and anteceded by linear operations only, we have named the similar set of!1lgorithms
as bilinearalgorithms. Following sections discuss a bilinear algorithm to compute
cyclic convolutions of length 2n (where n is any positiveinteger), through efficifnt
use of ToeplitzIJroducts.
16
2.1. DEFINITION AND PR.OPERTIES OF A TOEPLITZ MATR.IX
2.1 Definition and Properties of a Toeplitz Ma-
trix
A Toeplitz matrix or diagonal constant matrix is a matrix in which each descending
diagonal from left to right is constant. For instance, matrix T shown below is an
example of a Toeplitz matrix:
a b c d k
.f a b c d
T g
.f a. b c (2.10)
h g
.f a b
j h g
.f a.
The following properties of a Toeplitz matrix is sufficient for our purposes:
• Addition or subtraction of two Toeplitz matrices results in a Tocplitz matrix.
• Any submatrix of a Toeplitz matrix is itself a Toeplitz matrix.
2.2 Derivation of the Algorithm
~
In algrithms like the FFT, larger algorithms for length 2n (where n is any positive
integer), can be built from smaller algorithms. Our aim is to develop a similar
algorithm to compute the cyclic convolution of any length 2n which is scalable to
any value of n.
Suppose, we are to calculate a cyclic convolution of length n for input sequence.
{x;} and impulse response sequence {hi} where i = {O, 1, 2.,.2n - I}.
17
CHAPTER 2. DEVELOPMENT OF THE ALGORITHM
As we have mentioned before, we can compute the eyclic convolution by taking
the Fourier transform of both sequences, multiplying them point-by-point and taking
the inverse transform. This introduces calculations in the complex domain. But
we can work around complex computations by using a clever approach. In this
approach, we do take the Fourier transform, but instead of using actual numerical
values for W (where W = e-(2X7rXj)/N), we leave W as a variable associated with
the intermediate results of processed data and manipulate the properties of the
transform matrix in such a way that the actual values of Ware never required in
the computation. This can be best described by an example.
A length 4 matrix for Discrete Fourier transform (W4 = 1 and W 2 -1) is
shown in equation 2.11:
1 1 1 1
1 W W 2 W 3
1 W2 1 W2
1 W 3 W 2 W
(2.11)
In this matrix, xo, Xl, X2 and X3 are real.
By realizing that W 2 = -1, we get,
X o = Xo + Xl + X2 + X3
Xl = (xo - X2) + (Xl - X3)'W
X 2 = (xo + X2) + (Xl + X3).W 2
Xs = (Xo - X2) - (Xl - xs).W
Here, Xl and Xs looks. basically the same except for the fact that the second
term is negative in X 3 . It would, therefore, suffice to. calculate Fourier transfoqn
components for only X o, Xl and X 2 . This is because X s could be easily obtained.
18
2.2. DERIVATION OF TIlE ALGORITHM
from Xl'
A bigger example ill \lstrates the same. For example, consider a Fourier transform
matrix for length 8:
Xu Xu
Xl W W 2 W 3 -1 W W 2 W'3 :1'1
X 2 W 2 -1 _\\'2 \\'2 -1 _\11
2
:C2
X 3 W:1 -W 2 \\' -1 _1F
3 \\,2 \\' .C3 (2.12)
X.j -1 -1 -1 -1 ·1'.1
J"\:) W \\'2 _\\<1 -1 \\. 1F2 \\'3 .r :>
Xli _\\'2 -1 W 2 _\\'2 -1 \\'2 :r6
X, _ \\<1 _\\'2 -\\' -1 W:1 \\'2 \\' .1',
In this case, if we calculate Fourier Transform components for Xo, Xl, X 2 and
X.1, t he rest of the transform coefficient s can be derived from them.
In general. for any lcngt h, if one seperat es the transform component indices int a
groups that haw the same greatest common divisor(GCD) with 2', then we can
choose only one clement from each group and rest of t he components from that
group can be obtained from the representati\'e clement.
Clearl~', t here will be n groups for GC D = 2', 0 'S 1 < 11. Index 0 also forms an
independant group.
For example, if 71 = .\. then 2 1 = 16, and indicf's would range from 0.1.1."1.
Croup indicf's in this ca."f' would be 3. 2. 1 and 0
In addition. {X(O)} forms its O\\'n group.
Crnup:\ consists of all thf' clements that ha\'(' the sanl(' CeD a." 2.1 \\'ith 2 1 find
is represented b~' clcmf'nt .\ (Sl. Simihrh',
10
CHApTER 2. DEVELOPMENT OF THE ALGORITHM
Group 2 consists of all the elements that have the same GCD as 22 with 24 and
is represented by clement X (4).
Group 1 consists of all the clements that have the same GCD as 21 with 24 and
is represented by clement X(2).
Group 0 consists of all the clements that have the same GCD as 2° with 24 and
is represented by clement X (1).
In general one can see that we choose clement X(2i ) for group i.
Therefore, for length 8, we calculate the Fourier transform components for X(O),
X(l), X(2) and X(4). Before performing multiplications with the complex W terms,
let us leave them as they are and re-name these components as our Fourier trans-
forms. In this case, we compute the following:
(2.13)
(2.14)
(2.15)
(2.16)
The general form for X 2 is: X 2 = a+b· W 2, and for Xl: Xl = c+d· W +e· W2+
f· W 3 , where a and b are components of X 2 and c, d, e and f are components of X 2 .
So far, We haven't calculated anything in the complex domain. The flow diagram
for calculating Fourier transform components is shown in figure 2.2. Now, we are to
perform multiplication operations on these components. The multiplication for the
components of X 2 would be of the form X 2 . H2, where
(2.17)
20.
2.2. DERIVATION OF THE ALGORITHM
--=---~er--Xo
o-X4
a .
Components of X 2
b:
Components of X I
Figure 2.2: Calculating Fourier transform components for a 8-point network
(2.18)
therefore,
or,
Y2 = a . a' + (a· b' + a' . b) . W 2 - b· b'
(2.19)
(2.20)
which is of the form u + v . W 2 , where 'U = a' a' - b . b' and v = a . b' + a' . b.
We can call u and v the components of Y2' Now, a and b are the components of
X 2 and response matrix is H 2 . In the form of a matrix, equation 2.20 looks like:
(2.21)
Clearly, the H 2 matrix is a Toeplitz matrix. Hence,:we can always precompute
the H components and arrange them as Toeplitz matrices to multiply' with the.
21
CHAPTER 2. DEVELOPMENT OF THE ALGORITHM
Fourier transform component vectors in order to get the Y component vectors.
This result is at the heart of our algorithm.
The flow diagram for computing the components of X ('i), 'i = 0,1,2,4 when
N = 23 is shown in figure 2.3. Once these components are obtained, they can
be multiplied by corresponding Toeplitz matrices. After these multiplications, we
get the Y transform clements Yo, Y4 , Y2 and Y1 in their component form. This is
illustrated in Fig. 2.3.
components
of YI
components
ofY2
______ Y4
H4
2x2 Toeplitz matrix
4x4 Toeplitz matrix
x7
x4
x6
x2
x3
xO .-----+t+----.t·-t-1-:::------::kt-r-----{·J('}-<;:-------;7 YO
Figure 2.3: Multiplication stage for a 8-point network
From these Y components, we are to reconstruct the original signal y. But this
can be easily done by an inverse procedure to that of pre-multiplication operations.
This concept can be clarified with a very simple e:x:ample:
Suppose, we have 2 components a and b. Now if a +'b = p and a -'b = q, then,
22·
2.3. STR.UCTUR.E OF A TOEPLITZ MULTIPLICATION MODULE
a = (p + q)/2 and b = (p - q)/2 as shown in figure 2.4.
:><:=:
~"2a
: ~e-.-2b
Figure 2.4: Conversion between a,b and p,q.
This implies that if we apply the results p and q to the exact same network,
we get scaled a and b back. To get the original a, b rather than scaled a,b, we
simply scale the Toeplitz matrix elements appropriately. For instance, instead of
multiplying by 110 , 114 , 112 and I-h, we should multiply by 110 /8, 114 /8, 112/4 and
11t/2.The final algorithm for an 8-point cyclic convolution is shown in Fig. 2.5.
2.3 Structure of a Toeplitz Multiplication Module
If a Toeplitz matrix of dimension 2n x 2n , is divided in 4 submatrices, each will
be a Toeplitz matrix itself. We usc this property to express 2n point Toeplitz
multiplications into a 2n - 1 point Toeplitz multiplication. Following this process
resursively, one can see that a 2n point Toeplitz multiplication module will eventually
be expressed only in terms of 2-point Toeplitz modules.
To illustrate this, consider a 2 x 2 Toeplitz multiplication operation shown below:
(::) (:: )(:: ).
23
(2.22)
CHAPTER 2. DEVELOPMENT OF THE ALGORITHM
/
Xo ,---------j+x---+t-~-_____:;t+1~-___=*~~ :l:::-----:;IH-t---±+.r--------i:-t-) YO
.j--¥---Ir!-~_+-+-H- Y4
2x2 Toeplitz Multiplication
(HP)
4x4 Toeplitz Multiplication (H 1/4)
Figure 2.5: Flow diagram for a 8-point network
We can rewrite 2.22 as:
(
Yo) = (a'(XO+:L'I)-(a-b)'XI)
YI a·(xO+xI)-(a-c)·xo
(2.23)
Expressed as in 2.23, the 2-point Toeplitz product requires 3 multiplications and
3 additions or subtractions, which, even though worse than the cyclic convolution
of length 2 (that requires 2 multiplications and 2 additions) is still better than the
brute force method that required 4 multiplications and 2 additions. Therefore, the
inner structure of 2 x 2 Toeplitz multiplication module is based on equation 2.23
and is shown in Fig. 2.6.
Name the module in Fig. 2.6 a TP2 module. A 2 x 2 Toeplitz multiplication or a
TP2 multiplication can be extended to any length 2nX zn and we would name each
higher order module as T P N module and the corresponding Toeplitz multiplication'
, .
24
2.3. STRUCTURE OF A TOEPLITZ MULTIPLICATION MODULE
Figure 2.6: A 2 x 2 Toeplitz multiplication module TP2.
as a T P N multiplication. This basic Toeplitz structure is scalable to any length 2"
and therefore can be used to compute cyclic convolutions of any length 2".
For any length 2", the Toeplitz multiplication involving a 2" x 2" Toeplitz matrix
can be written as:
(2.24)
where Xo, Xl, YO and Yl arc vectors of length 2,,-1 and submatrices A, Band
Care Toeplitz matrices themselves. The addition of Xo and Xl of Fig. 2.7 can
be extended here to imply the vector addition of X o and Xl' Similarly, instead
of multiplying with (a - c), we will be multiplying Xl vector with Toeplitz matrix
(A - C), which itself is a Toeplitz multiplication of a lower order. Therefore, any
2" x 2" Toeplitz multiplication can be represented by Fig.2.7:
The complete algorithms for a 4 and 8-point cyclic c.bnvolutions using Toepli~z
products developed as explained here is shown in Figs. 2.8 and 2'.9..
25
ClIAPTEH 2. DE\TLOP.\IEST OF TlIE ALGOHITlI.\I
xo
(A-C~
, - - - - - - - 1'\
Figlll'c 2.7: GrIlcrclizrd structurc of a 2" x 2" Tocplitz lllultiplicatioll lllodule TP2".
Xo .,------t+;..~------:;lH-J---___t
\...,
Fihlll'c 2:' ·1 point cwlic (,oIl\'olut inn al[!,nrit hill
+ Yo
y
-~
y
-I
2.3. STRUCTURE OF A TOEPLITZ MULTIPLICATION MODULE
'0 YO
y
, I ' 4
:\2 Y2
\-
I
-+--_--L- -'<:xy.' - - - - -8 ----
Fi!!.\IH' 2~ S point r~'clir ron\'ol\lt ion nlgorit 11m
.~­
_I
)\
\'
, 5
)3
CHAPTER 2. DEVELOPMENT OF THE ALGORITHM
2.4 Complexity Analysis
In this section, we analyze thc computational complexity of the cyclic convolution
algorithm developed in previous sections.
As we discussed earlier, a cyclic convolution can be expressed as:
( Yo) ( A· (Xo+ Xd - (A - B) . XI )YI = A· (Xo+ XI) - (A - C) . X o (2.25)
Here, A, Bare 2"-1 X 2,,-1 submatrices and Y(J, YI , X o and Xl are subveetors of
length 2,,-1.
For instance, if we have a 4 x 4 cyclic convolution matrix as follows:
abc d
where
and,
CC4
A
dab c
c dab
bed a
(: :)
(2.26)
(2.27)
(2.28)
A and Bare 2 x 2 submatrices each. To calculate Yo and YI , we need to compute
. the terms (A + B)/2 and (A - B)/2 and multiply them with thesubvector addi-
tion result (Xo+ XI) and subvector subtraction result (Xo - Xl). These matrices
evaluates to equations 2.29 and 2.30.
28 '"
"
2.4. COMPLEXITY ANALYSIS
(A + B)/2 (
(a + c)/2
(b + d)/2
(b+d)/2 )
(a + c)/2
(2.29)
(
(a - c)/2 (b - d)/2 )
(A-B)/2 =
-(b - d)/2 (a - c)/2
(2,30)
From Figs. 2,29 and 2,30 we observe that (A + B)/2 is a cyclic matrix and
(A - B) /2 is a Toeplitz matrix.
As a result, the cyclic convolution algorithm is shown in Fig. 2.10,
: x9 ..,.----------i-rJ------.j
XI:
: x I __-'.t...--j.--r-t-----.
. x3 - - - - - -
A+B
2
A-B
2
Figure 2.10: A 2 x 2 cyclic convolution algorithm.
Here, we have divided the original length 4 X vector in 1 vectors X o and Xl of
length 2 each (Xo and Xl)' (Xo+ Xd is multiplied with 2 x 2 cyclic matrix, which
means computing the cyclic convolution of length 2, and (Xo - Xl) is multiplied
with a 2 x 2 Toeplitz matrix which means computing Toeplitz matrix multiplication
of length 2. Thus a length 4 cyclic convolution has become the combined output
vector resulting from the addition of length 2 cyclic convolution and length 2 Toeplitz
multiplication and the subtraction of a TP2 multiplication from a length 2 cyclic
".
convolution.
29
.'
CHAPTER 2, DEVELOPMENT OF THE ALGORITHM
The algorithm above can be generalized to the calculation of a cyclic convolution
of length 2n .
A cyclic convolution of length 2n expressed as CC2n becomes 2 vector additions
(consisting of 2n - 1 adds), a cyclic convolution of length 2n - 1 , a Toeplitz multipli-
cation of order (n - 1), (or TP2 n - 1), and 2 vector subtractions (consisting of 2n - 1
subtractions) .
Or,
CC2n = 2n - 1adds + CC2n - 1 + T P2n - 1 + 2n - 1subtract'ions
Therefore, CC2,,=2n - 1adds +
(2 n - 2adds + CC2,,-2 + TP2,,-2 + 2n - 2 s'ubtract'ions)+
(2n - 2adds + T P2,,-2 + 2 x 2n - 2 s'ubtract'ions)+
2n - 1subtTaetions
(2.31)
This becomes a recursive algorithm for which, the base condition occurs when
we reach calculation stages for TP2 and CC2 which are computed by previously
described methods, with only one stage of multiplications, From the discussion
above, we can express the complexity of a cyclic convolution of length 2n as:
CC2n := CC2n - 1 + T P2n - 1 + 2n +1 linear additions or subtractions, where,
T P2n := 3 x T P2n - 1 + 3 x 2n - 1 linear additions or subtractions.
Now, total additions for a T P2 module, al = 3, and the total number of mul-
tiplications for a T P2 module, ml = 3. These will serve as the base conditions for
the difference equation just mentioned. In a T P2n module, if total additions and
multiplications are an and m n respectively, then
m n = 3 x mn-l
= 32 x m n --'2
30
"
2.4, COMPLEXITY ANALYSIS
and, an = 3 X 2n- 1 + 3 X an-1
which, after simplifications becomes,
an = I:~==} 3': . 2n- i
Now, equation 2.31 becomes,
CC2n = CC2n- 1 + T P2n- 1 + 2n+1 additions
= (CC2n- 2 + TP2n- 2 + 2n additions) + TP2n- 1 + 2n+1 additions
= CC2 wrrl.p'uiai'ions+TP2 1+TP22+ ....+TP2n- 1 cornp'Utat'ions
+ (23 + 24 + ... + 2n+1) add'itions
A 2 point cyclic convolution requires 2 multiplications and 4 additions. There-
fore, total computational complexity for a CC2n module:
CC2n = (2 + 31 + 32 + + 3n- 1) rn'Ult'iplicat'ions
+3 x{ (33 +34+ +3n+1) - (23 +24 + ... +2n+l) }+4 additions
Since, a cyclic convolution algorithm involving Toeplitz multiplications finally
reduces to CC2 and T P2 calculations, the complete algorithm has only pre-multipli-
cation additions, a single stage of multiplications and post-multiplication linear
additions and subtractions. Hence, we call this a bilinear algorithm.
Complete networks for length 4 and length 8 cyclic convolutions has been shown
in figures 2.8 and 2.9.
In the next chapter, we will elaborate on the implementation of our algorithm
and discuss the hardware and time complexity of our method.
31
CHAPTER 2. DEVELOPMEST OF THE ALGORITHM
I
I X~: 1
~:
I
cyclic matnx
of order :r- I
Toeplttz matnx
n-I
of order 2
v
•. 0
I
I YO
I
t
•
: Y1
I
t
v
. n-I
Figure 2.11: :\ 2" point c~'clic convolution illgorithm
Chapter 3
Single Memory Implementation
We are primarily interested in building an architecture or dedicated hardware for
calculating cyclic convolutions. As discussed previously in chapter 1, even though
Microprocessors or General Purpose Processors(GPPs) could be used to perform
the task, our contention is that dedicated hardware or a chip designed specifically
for this computation would be a more cost-effective alternative.
In this chapter, we discuss the actual implementation of dedicated hardware.
First, we describe in detail the sequential algorithm used for its implementation.
We then proceed to illustrate its architecture and discuss in detail the stages as-
sociated with the algorithm. We also analyze the time complexity invloved in the
different stages associated with the algorithm. We also analyze the time complexity
involved in the different stages and include a summary of simulation results obtained
for convolutions of length 22 through length 210 . Finally, we discuss hardware com-
plexityand show improvements gained over a commonly used method to calculate
cyclic convolutions.
We are assuming that this system uses a single memory block or on-chip RAM ..
33
. .~
CHApTER 3. SINGLE MEMORY IMPLEMENTATION
We describe the system's address generation procedure fo fetching and storing data
from this memory.
3.1 Adapting the Algorithm to Hardware
Chapter 2 described the bilinear algorithm for cyclic convolution of length N = 2n
points. For the purpose of hardware implementation, we divide the computation
into six sequential phases. These six phases represent diverse tasks involved in the
algorithm. These phases maybe described as follows:
• Phase 0: Computes pre-multiplication DFT forms or butterfly operations.
• Phase 1: Creates appropriate forms for Toeplitz multiplications.
• Phase 2: Computes the multiplication of the Toeplitz forms with their respec-
tive precomputed h coeficients.
• Phase 3: Computes post-multiplication IDFT components.
• Phase 4: Copies components computed in phase 3 into another array.
• Phase 5: Computes first 2 IDFT components and stores them.
• Phase 6: Computes output signal from the IDFT components by butterfly
operation.
In our hardware implementation of the algorithm, all the multiplication results are
computed and stored, before the inverse transforms are calculated; ,The procedure
to do so has been described in terms of a set of seql.lential steps or phases. Th,is set
of steps is iterated log2 N - 1 times in order for all the multiplication results to be
34' ..
3.1. ADAPTING THE ALGORITHM TO HARDWARE
calculated first, where N is the length or total number of points. Our loop counter
for this purpose is called j, which is initialized to zero and increased by 1 at the end
of each iteration.
During the first iteration of j, for a length of N points, the last N /2 points are
subjected to a Toeplitz multiplication of order (loY2N - 1). In the second iteration,
the second half or N /4 points of the remaining N /2 points are subjected to a Toeplitz
multiplication of order (loY2N - 2).This process is continued until the first 4 points
are remaining. In this final iteration of j, the 2nd 2 points are applied to a TP2
multiplication. The algorithm then proceeds to multiply the first 2 points directly
with their respective h coefficients. In this manner, the set of steps for Toeplitz
multiplication is repeated (lOY2N - 1) times before starting the inverse transforms.
This means, that for a particular Toeplitz product to be computed, all the mul-
tiplication results are computed and stored first, before the computation for the
next iteration of j even begins. For instance, if the total length is 16 (n = 4),the
last 8 points are applied to a TP8 multiplication,and all its results are calculated
and stored first. Then of the remaining 8 points, last 4 are subjected to a TP4
calculation. In the 3rd or final iteration(loY216 -1 = 3), the last 2 of the remaining
4 are applied to a TP2 multiplication and the iterative process completes.
For each iteration, the first 5 steps,namely phase 0 through phase 4, are repeated
sequentially for a particular order of Toeplitz product calculation.Phase 5 and Phase
6 are executed after having completed all Toeplitz product calculations.
We assume that we have N = 2n points stored sequentially in an array called
dbufj, with starting index 0. So we have N data points stored in locations dbuff[O]
through dbuff[N-1].
The sequential algorithm can be summarized by the flow diagrm in Fig. 3.1.
35
CHAPTER 3. SINGLE MEMORY L\IPLEMESTATIOS
initialize loop counter j=O, 1----,
N=2"'n, halfr=N/2.
Phase 6
calculate lOFT of all points
in the dft array.
Phase 5
>---.,----'----~ calculate and store dft[O] and dft[ I]
calculate Toeplitz forms for last N/2 points
phase I
phase 2
multiply in the Toeplitz domain
phase 3
perform post-multiplication additions
phase 4
j=j-"-1.
N=N/2
halfr=halfr/2 f-oooIIl------'
Figure .3.1 Partitioning the c~'clic rOll\'ollltion algorithm in G pha;;('~
36
3.1. ADAPTING THE ALGORITHM TO HARDWARE
We now describe the details of each phase.
3.1.1 Phase 0
In phase 0, we start out by considering all N points. At the end of each iteration,the
total number of points in question becomes equal to half the size of the previous
iteration. We usc an index halfr, whose value is always half the size of total number
of points that are currently being considered.
For example, for a 16-point network, during the first iteration (where the last
8 points would be applied to Toeplitz multiplication), ha~fr = 16. For the next
iteration, halfr becomes 4, and so on. Here, we should note that even though we are
computing forms for the last N /2 (here,8) points,calculations for phase 0 are being
done for all 16 points. In the next iteration, this phase will be applicable to data
stored in only the first 8 locations of the 16, and in the process, the last iteration
will apply first 4 dat~ points to phase O.
In phase 0, DFTs of the N points are calculated. Addition of each point and the
point halfr- apart, is stored at the location of the point. The subtraction of the 2nd
point from the 1st point is stored at the location of the subtrahend. For a length of
8 points, phase 0 is shown in Fig. 3.2.
3.1.2 Phase 1
Phase 1 is the premultiplication stage of the algorithm.In this phase, appropriate
forms for multiplication in the Toeplitz domain are calculated.
In a Toeplitz module of the lowest order, or in a TP2 module, the premultiplica-
tion additions are as follows: 2 inputs are added, then the result of the addition and
37
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
x(O)
x(l)
x(2)
x(3)
x(4)
a
x(5)
componenls
for TP4 calculation
x(6)
c
x(7)
Figure 3.2: Phase 0 for an 8-point network
the 2 inputs are passed on to the multiplication stage.This results in 3 multiplication
inputs, and therefore, 3 multiplication outputs as shown in Fig. 3.3.
I
Xo~----fv
Figure 3.3: Premultiplication stage in a TP2 module(The second operand for each
multiplication is precomputed and stored in memory
A similar concept can be applied to any 2n -point length. What happe:O:s in their
case is that instead of adding 2 points, we add 2 vectors, the result of which is a
vector itself. These 2 vectors and the resultant vector is applied to 3 Toeplitz mat:ix
multiplications.. A premultiplication stage for a 4-point and 8-point multiplication
38
3.1. ADAPTING THE ALGORITHM TO HARDWARE
are shown in Fig. 3.4:
Here,we observe that 2n data points applied to a T P N (N = 2n ) module are
first divided into 2 groups of 2n - 1 data points, or 2 vectors of 2n - 1 data points
each. After the addition of these 2 vectors, The resultant vector, along with the
original input vectors constitute a total of 3 X 2n - 1 outputs. These 3 x 2n - 1 outputs
are then applied to 3 TP(N - 1) multiplication modules. Within the TP(N - 1)
modules, the same procedure occurs recursively until a TP2 calculation stage is
reached. Thcrefore, for a Toeplitz multiplication of 2n points, these 2n points are
subjected to (log2N - 1) iterations of the phase 1 or the premultiplication stage to
generate appropriate number of multiplication outputs for the applied 2n inputs.
In phase 1, the intermediate results or vector addition results are stored begin-
ning after the last stored original input data point in the dbuff array. For example,
in a 16-point network, in the Hrst iteration of j loop(j = 0), phase 1 intially applies
the last 8 data points to a 4-point vector addition. The resultant vector of 4 points is
stored at the end of 16 points (dbujf[16] through dbuj .f[19]). Next, the input vec-
tors and the resultant vector, a total of 12 points are applied to 6 vector additions.
6 Addition results are stored at locations dbuf .f[20] through dbujf[25]. Finally, the
12 input points + 6 resultant points = 18 points are applied to 9 additions. These
9 results are stored in locations dbuj j[26] through dbuj .f[34].
At the end of phase 1, from 2n points, we have 3n points ready for multiplication
with their respective pre-computed h coefficients.
3.1.3 Phase 2
In this phase, the points created for multiplication ~nphase 1 are multiplied with
their corresponding impulse response or h coefficients. We have assumed that these
39·
.'
CHAPTER 3. Sl;\,CLE MEMORY 1.\/PLE.\/E.\'TAT/O.\"
Xl
--
--
4-point Premultiplication
stage shown by the box 'Z'
x2
x3 - ..
'Z'
4
'z·
'Z'
'Z'
X.l L..-,.L--r'----r''------------..j
Xs L-".L--r----------l~
~ <--,.L-----------.~
\) ~-----------._J
xl "---::.,,.------------..{
x2 ~-,.-.>...c_----------._J
x3 ~""""~--------~-------...J
x 7
Figure 3.·1 Pre1llultip!ication in a Tr·j and a TPS 1Ilodule
·Hl
3.1. ADAPTING THE ALGORITHM TO HARDWARE
points arc precomputed and ready to be applied whenever requested.
3.1.4 Phase 3
Phase 3 is the post-multiplication addition stage. We have previously seen that at
the begining of the iteration of loop counter j, if we have 2n points applied to Toeplitz
multiplication, we have a total of 3n resultant points. In phase 3, postmultiplication
computations are applied to these 3n points and we get 2n resultant points at the
end of the phase.
Phase 3 operations simulate the inverse of that of phase 1. A TP2 postrnultipli-
cation operation is shown in Fig. 3.5.
,-L~ ~Y2
--2.-
---.l.-£l- -----
Figure 3.5: A TP2 postmultiplication operation.
Here the 1st and the 3rd multiplication outputs are both subtracted from the
2nd multiplication output to generate respectively the 2nd and 1st output of the 2-
point cyclic convolution. These outputs create a 2-point vector, which then becomes
the first subtrahend vector for next stage of postmultiplication calculations, where,
the first and 3rd 2-point vectors will be subtracted from the 2nd 2-point vector and
generate 2, 2-point output vectors. T~ese vectors become a 4-point 1st subtrahend
vector and this proCeSS is continued (lOg2N -l)times for phase 3.
Thus the sequential algorithm in phase 3 use~ 2n inputs, creates 3n - 1 groups of
3 inputs each and generates 2 x 3n - 1 outputs. This process repeats until 2n DFT .
41
.'
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
points are computed. These 2n points are stored in dbuff array at the end of this
phase.
3.1.5 Phase 4
In phase 4, calculated DFT co-efficients are copied into a seperate memory block
or array called dft array.This is done because at the end of phase 4, iteration of the
loop j completes 1 cycle of execution and j is increased by 1. For the next value
of j, array db'UJf needs to be freed up in order for the procedure to be repeated
for the next lower TP multiplication. This is required because for each order of
TP multiplication, 3n intermediate memory locations arc required, which is greater
than the initial length of dbuff array. If the results from the previous iteration had
not been copied elsewhere, they would get overwritten.
In each iteration of j, since the last N /2 DFT co-officionts arc calculated, these
N /2 coefficients arc stored in the last N /2 locations of the dft array. In the next
iteration, N /4 coefficients get stored in the next to last N /4 locations. After all
iterations are completed, dft array is populated with N - 2 resultant points starting
from the 3rd location (The 1st and 2nd location are reserved for phase 5 as we shall
soon see).
3.1.6 Phase 5
Since the 1st and 2nd point of the N points are not subjected to Toeplitz multi-
plication, after all the.multiplications have been completed and stpred through the
previously described iterative process, only these 2: DFT coefficiently are directly
calculated in phase 5 and stored in the. begin~ing locations of the dft array (dftIO]
42
3.1. ADAPTING THE ALGORITHM TO HARDWARE
awl dftll]).
3.1. 7 Phase 6
Having all the multiplication results stored in appropriate locations of the array dft.
the inverse procedure of phase 0 is carried out in phase 6, in order to generate the
10FT or output of cyclic convolution. }wlfr is initialized to 1 and gets multiplied lJy
2 at the end of each iteration of phase 6
This phase difTers from phase 0 in that the addition of dft/Jj and elf til + hulfr]
is stored in clftl' + }wlfr). while elf til + }wlfr] - dII[!] is stored in location dfl/I}.
Also. phase 0 is carried out in the beginning of each iteration of the loop J for
calculat ing TP multiplications of a certain order. I3ut phase 6 is performed after
h,l\'ing completed allot her phases. An inverse transform for an 8-point network is
5hO\\"I1 in Fig. 3.6
yl i)
y(6)
________ - y(O)
dft [5] L- --,.'r--"l--++ Y(5)
dft[6]
dft[i]
dftl4]
TN outputs:
dft[ 3]
I dftl2]
I
TP2 outputs
dft[O]
dftll] ""'-----~r-t
Figmc .16 10FT calculations in pll<1."e G for an 8-point Ilct\H,rk
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
3.2 Architecture of the Dedicated Hardware
Figure 3.7 shows the architecture of the dedicated hardware. We have named the
system My_Chip and its Arithmetic Logic Unit and Control units are shown. All the
additions and subtractions in different phases are performed using a Carry Prop-
agation Adder (CPA). We also use an array multiplier to perform multiplications
of Toeplitz forms with their respective precomputed response coefficients ('h' coeffi-
cients). Both the CPA and the array multiplier uses 16-bit operands. The operands
are read from the memory and stored in internal temporary registers. Temporary
registers tempi, temp2 hold data for the CPA and registers temp3 and temp4, for
the multiplier.
The control unit generates logic for address generation and memory control.
Data flow is bidirectional in the memory. Data is written to memory when wr
signal is enabled iand read from memory when Td signal is enabled.
3.3 Hardware Implementation and Timing Anal-
ysis
In this section, we analyze the phases in light of hardware implementation and
timing of the calculations involved. We describe the conversion of C code to verilog
code in generating appropriate addresses, as well as read and write signals from the
chip to the meni.9ry. (The Verilog code is included in appendix A).
44
3.3. HARDWARE BIPLEMESTATIOI': AND TIMING ANALYSIS
clock --,---,------,----,
rd
Data BUS
data
MY CHIP
Control Unit
Temporary
storage for f10w
control
Address
generation logic
memory
control
r wr
address
MEMORY
Figure 3.7 Architecture of the j)edicnted IInrd\\',ue
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
3.3.1 Phase 0
In phase 0, DFT of the points stored in the dbuj] array arc computed as described
in the previous section. The C code to perform these operatios would probably look
like:
for(i=Oj i<halfr; i++)
{
temp = dbuff[i + halfr];
dbuff[i + halfr] = dbuff[i] - temp;
dbuff[i] = dbuff[i] + temp;
}
The statements in C are carried out sequentially. But in order for them to
be sequentially executed in hardware, we include a counter to control the order of
statements. In our hardware, we have a carry propagation adder (CPA) to perform
l
additions and/or subtractions between two operands. We are assuming 2 internal
registers tempi and temp2 to hold the values of the operands. Our ehip generates
correct addresses to read operands from the memory and to write the results of
addition. Since single memory module is being used, simultaneous reads and writes
are not possible. Therefore, we require 2 clock cycles to read the operands and
2 clock cycles write the results. The writes occur after tempi and temp2 have
been loaded with appropriate values. A control flip-flop sub_noLadd is included to
indicate whether subtraction or addition is to occur.
A mod 4 counter cntl is sufficient for our purposes, since for each value of i, we
need 4 clock cycles to complete an iteration .. A regist~r ~alled regi has been used to
hold the value of i. Register addi holds the value or'appropriate addresses generated
46
3.3. HARDWARE IMPLEMENTATION AND TIMING ANALYSIS
by the ehip for data to be fetehed from or sent out to the memory. 1 bit registers
rd and 'Wr generate active low read and write signals respectively.
regi, addi, rd, 'Wr, cntl are changed on the positive edge of the clock. Registers
tempi and temp2 latch data on the negative edge of the clock. sub_noLadd signal
also changes at the negative edge. We assume that the CPA needs only one clock
period to compute the addition after the operands are clocked into registers tempi
and temp2. At the end of the clock, 'Wr is created is created to have a positive edge
to write the results to memory.
cntl is initialized to zero and at every positive edge of the clock, increases by 1
until it reaches the value 3, when it gets reset back to O. The sequence of actions
are as follows:
• cntl = 0:
1. add1 is assigned the value of regi on the positive edge of clock.
2. rd is emabled on the positive edge of clock.
3. temp1 loads the value stored at dbl1fffreg'ij on the -ve edge of clock.
• cntl = 1:
1. add1 is assigned the value of regi[halfr on the positive edge of clock(Here,
we are using an or gate to perform the addition of regi and halfI'. As halfr
is greater than regi, it gives the equivalent value but saves hardware).
2. rd is enabled on the positive edge of clock.
3. temp2 loads the value stored at dbuf j[regi + halfr] on the -ve edge of
clock,
4, sub_noLaddis assigned 0 to indkate ad'clition on the.-ve edge 'of clock.
47
.'
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
• cntl = 2:
1. (uld1 is assigned the value of rcgi on the positive edge of clock.
2. 'WT is enabled so that result of addition ean be stored in db1llf/Teg'ij on the
positive edge of clock.
3. Td is disabled, on the +ve edge of clock.
4. sub_not-add is set to 1 to indicate subtraction on the positive edge of
clock.
5. 'WT is disabled on the -ve edge of clock.
• cntl = 3:
1. add1 is assigned Teg'ilhalfT on the positive edge of clock.
2. 'WT is enabled on the positive edge of clock (so that the edge triggered
memoty registers get the required second edge to write subtraction re-
sults.
3. 'WT is disabled at the -ve edge of clock.
The timing diagram for phase 0 is shown in Fig.3.8:
Timing Analysis for Phase 0
If we have a total of N points where N = 2n , then the value of j ranges from 0
through n - 2, resulting in n - 1 iterations of the j loop. For each iteration, 'i ranges
from 0 through(n - j - 1). Therefore, total number ~f iterations for 'i in phase 0
would be: 21 +22 + ... +2n - 1 = 2n - 2
48
.'
3.3. HARDWARE IMPLEMENTATION AND TL\1lJ\·C ANALYSIS
enll o o
regl <--- 0
, , ,
sub -LJ:,~,..------,--U
not add
lIT ------r--,
u
Figmc 3.8. Timing dingr[l]1l for plw."c ()
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
Each iteration requires 4 clock cycles to complete. Therefore, the total number
of clock cycles used in phase 0, To = (2n - 2) x 4, or,
(3.1)
3.3.2 Phase 1
Phase 1 creates forms for Toeplitz forms. At the end of phase 0, the last N /2 points
(N/2 = 2n - 1) are applied to phase 1, and we have 3n - 1 outputs at the end of the
phase that are ready to be multiplied with respective filter coefficients.
Phase 1 requires more storage than 2n - 1 locations as we have 3 outputs for every
2 inputs. To keep track of where intermediate results are stored, internal register
address1 is used. Since this phase is associated with vector addition, the register
vector-base contains the starting address of the current vector and is modified as
necessary. Also, the length of a vector at the start of phase 1 is half of that of halfT',
and is halved ill every iteration thereafter. So we need another register half_length,
that specifies the current length of vector adition. Register nurnber_of_vectoT's keeps
track of current number of vectors.
The C code used for phase 1 was:
address1 = 2*halfr;
number_of_vectors = 1;
half_length = halfr/2;
for (k1 = 0; k1 < logr-j; k1++)
{ for (k=O; k < number_of_vectors; k++)
{ vector base halfr +k*2*half_length;
50.
.'
3.3. HARDWARE IMPLEMENTATION AND TIMING ANALYSIS
for (i = 0; i < half_length; i++)
{ dbuff[address1++] = dbuff[vector_base]+
dbuff[vector_base+half_length] ;
vector_base++;
}
}
half_length = half_length/2;
}
Here we have an outer loop that has 2 inner loops embedded within; with one
inner loop embedded within the other. For loop control, we have 3 loop-control
registers kl, k and i, similar to our code above, all of which arc initialized to zero.
As in phase 0, we re-use a mod-4 counter cnt! to maintain the order of execution.
\
This time, we implement the innermost i first, with each count of cnt1. But at
the beginning of each iteration of i, loop termination condition for i, k and kl are
checked and these registers are updated accordingly.
The 2-bit counter cntl is initialized to zero at the beginning of phase O. At every
positive edge of clock, it increases by 1 automatically resetting to 0 after a cycle of
4 states. The sequence of actions can be described as follows:
• cnt1 = 0 :
1. addl is assigned the value of vector-base on the positive edge of the clock.
2. rd signal is enabled on the positive edge of the clock.
3. wr signal is disabled on the positive edge of clock.
51
CHAPTER. 3. SINGLE MEMOR.Y IMPLEMENTATION
4. temp1 is assigned the value of dbuffCueetoLbase] on the negative edge
of clock.
• cnt1 = 1 :
1. add1 is assigned the value of vectoLbaselhalfr on the positive edge of
clock.
2. temp2 is assigned the value of dbuf f[vectoT_basellwlfT] on the negative
edge of the clock.
3. and sub_noLadd is assigned 0 indicating addition on the negative edge of
clock.
• cnt1 = 2 :
1. addl is assigned the value of addTess1 on the positive edge of the clock.
2. r'd signfLl is disabled on the positive edge of the clock.
3. wr signal is enabled on the positive edge of clock.
• cnt1 = 3 :
1. On the positive edge of the clock, loop termination condition for i is
checked first, if it is not a terminal condition, then regi is increased by 1.
Otherwise, condition for k is ehecked, and if it is not a terminal condition,
i is set to 0, and k increased by 1. Otherwise, k1 is checked for terminal
condition. If it is not terminal, k1 is increased by 1, k and iare set to
zero. Otherwise, control enters phase 2.
The timing diagram for phase 1 would is shown in Fig:3.9:
52
3,3. HARDWARE IMPLEMENTATION AND TIMING ANALYSIS
clock
rcgi .......1--- 0
cntl o 2 3
.. ..
o 2 3
....
o 2
2---1"~
rd ,'- _
wr
___I I------JI
L
sUb_not_add~
'------------------------
Figure 3.9: Timing diagram for phase 1
Timing Analysis for Phase·1
If 2m points are applied to phase 1, then total number of additions,
+ 21 . {1 + (3/2)}
+ 22 . {I + (3/2) + (3/2)2}
+ 2m- 2 , {I + (3/2) + (3/2)2 + ...... + (3/2)m-2}
Therefore,
Or,
n-2 i
A = L 2i , L (3/2)j
i=O j=O
A = {(3n + 1)/2} - 2n
Total number of cycles T1 = 4 X A=:2 . (3n + 1) -,.2n+2
53'·,
(3.2)
(3,3)
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
3.3.3 Phase 2
In this phase, Toeplitz operands created in phase 1 arc multiplied with their re-
spective II, coefficients. The mod-4 counter previously used can be re-used here to
execute the following C code:
for (i=halfr; i < addressl; i++)
{ dbuff[i] = dbuff[i]*h2[address+i-halfr];
}
address = address + addressl-halfr;
The multiplication operands arc fetched and stored in internal registers ternp3
and ternp4. We are assuming an array multiplier to perform the multiplication.
Register regi is updated every fourth cycle of the clock. The sequence of operations
is controlled by register cntl as follows:
• cntl = 0: i
1. addl is assigned the value of regi on the positive edge of the clock.
2. rd is enabled on the positive edge of the clock.
3. and wr is disabled on the positive edge of the clock.
4. ternp3 is assigned the value stored at dbuJj(T'egi} on the negative edge of
the clock.
• cntl = 1 :
1. addl is assigned the value of address + regi -,- halfr on the positive edge
of clock.
54
3,3. HARDWARE IMPLEMENTA1'IONAND TIMING ANALYSIS
2, ternp4 is assigned the value of h2[address + regi - halfr] on the negative
edge of clock.
• cntl = 2 :
1. add1 is assigned the value of regi on the positive edge of the clock.
2, rd is disabled on the positive edge of the clock.
3. and 'WT is enabled on the positive edge of the clock.
• cntl = 3 :
1. On the positive edge of the clock loop termination condition for regi is
checked, it condition not met, reg'i is increased by 1. Otherwise control
enters phase 3, and 'Wr is disabled.
The timing diagr~m for phase 2 would is shown in Fig.3.1O:
clock
cnt! 0 2 3 0 2 3 0 2
regi .. 0 ..~ ..... 2 ~
rd I I I I I
wr L..J L..J L
Figure 3.10: Timing diagram for phase 2
".
55
"
CHAPTER 3, SINGLE MEMORY IMPLEMENTATION
Timing Analysis for Phase 2
Total number of multiplications performed in phase 2, M2 = 31 + 32 + '"'' + 3n - 1
Or,
M2 = (3n - 3)/2 (3.4)
Each multiplication requires 4 clock cycles, Therefore, total clock cycles required in
phase 2,
(3.5)
3.3.4 Phase 3
Phase 3 is the post-multiplication addition/subtraction stage. Functionally it can be
considered the inverse procedure of phase 1 and its implementation is also similar to
that of phase 1. If 2m points were applied to phase 1, then at the end of phase 2, we
would have 3m outbut points. The post-multiplication operations take in these 3m
points and produce 2m outputs. Registers addr-ess1, half_length, number_of_vectors,
and vector-base arc re-used for similar purposes as that of phase 1.
It should be noted here that Leonardo did not recognize division algorithm.
Therefore, we assigned values by brute force for computations that require division
by 9 and division by 3.
The C code for phase 3 is as follows:
half_length = 1;
number of_vectors =number_of_vectors/9;
address1 = address1- half_length;
56
. .~
3.3. HARDWARE IMPLEMENTATION AND TIMING ANALYSIS
for (k1 = 0; k1 < logr-j; k1++)
{ for (k=number_of_vectors-l; k >=0 ; k--)
for (i = half_length-1; i >=0 ; i--)
{ temp = dbuff[vector_base+i];
dbuff[vector_base+i]
= dbuff[address1+i]
-dbuff[vector_base+i+half_length] ;
dbuff[vector_base+i+half_length]
= dbuff[address1+i] - temp;
}
}
address1 address1- half_length;
address1 = address1- half_length;
t
number_of_vectors =number_of_vectors / 3;
}
The nested loop structure of phase 3 was implemented exactly as was done in
phase 1, with the registers k, kl and regi. The difference is, however in the fact
that since in the innermost loop, we have 3 reads and 2 writes, a 2-bit counter is no
longer sufficient. So we used a 3-bit counter cnt2 for flow control, which was reset
to 0 after 5 clock cycles. The sequence of operations is as follows:
• cnt2 = 0 :
1. addl is. assigned the value of veGtor_base + regi on the positive edge of .
57
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
the clock.
2. rd signal is enabled on the positive edge of the clock.
3. wr signal is disabled on the positive edge of clock.
4. tempO is assigned the value of dbuf f[vector _base + regi] on the negative
edge of the clock.
5. sub_noLadd is assigned 1 indicating subtraction on the negative edge of
clock.
• cnt2 = 1 :
1. add1 is assigned the value of address1+reg'i on the positive edge of clock.
2. temp1 is assigned the value of dbuff[addressl + reg'i] on the negative
edge of clock.
• cnt2 = 2 :
1. add1 is assigned the value of vectorbase + regi + half-length on the
positive edge of clock.
2. temp2 is assigned the value of dbuf f[vector _base + regi + half -length]
on the negative edge of clock.
• cnt2 = 3 ;
1. add1 is assigned the value of vector _base + regi on the positive edge of
the clock.
2. rd signal is disabled on the positive edge of the clock.
3. wr signal is enabled on the positive edg~ of clock.
58
3.3. HARDWARE IMPLEMENTATION AND TjMING ANALYSIS
4. ternp2 is assigned the value stored at tempO (dbuj j[veetoT-base+regi])on
the negative edge of clock.
• cnt2 = 4 :
1. addl is assigned the value of vector_base + regi + half -length on the
positive edge of clock. Also,
2. Loop termination condition for i is checked first, if it is not a terminal con-
dition, then regi is decreased by 1.0therwise, condition for k is checked,
and if it is not a terminal condition, i is set to 0, and k decreased by 1.
Otherwise, kl is checked for terminal condition. If it is not terminal, kl
is increased, k and i are set to zero. Otherwise, control enters phase 4.
The timing diagram for phase 3 is shown in Fig.3.11:
clock
cnt2 o 2 3 4 o 2 3 4 o
regi
rd
....4l----- 0 --~---___l.._41_---
IL-- ---'
-----0.._ ..1--- 2
L
-
wr
Figure 3.11: Timing diagram for phase 3
59 .
.'
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
Timing Analysis for Phase 3
If 2m points are applied to phase 1, then total number of clock cycles required for
phase 3, P = 5 x A (where A is the total number of additions in phase 1) Or,
3.3.5 Phase 4
T3 = (5/2) x (3m + 1) - 5 x 2m (3.6)
In phase 4, all the calculated DFT coefficients are stored in the dft array. This
requires only read and write cycles and this sequence of actions can be controlled
by a I-bit counter cnt3. The C code we used for this is a fairly simple loop.
for (i=halfr; i < 2*halfr; i++) { dft[i] = dbuff[i];}
halfr halfr/2;
}
In hardware, it translates to the following:
• cnt3 = 0 :
1. addl is assigned the value of regi on the positive edge of the clock.
2. rd is enabled on the positive edge of the clock.
3. and wr is disabled on the positive edge of the clock.
4. temp5 is assigned the value of dbuftfi:egi} on the negative edge of the 'clock.
. .
60
3.3. HARDWARE IMPLEMENTATION AND TIMING ANALYSIS
• cnt3 = 1 :
1. addl is assigned the value of regi on the positive edge of the clock.
2. rd is disabled on the positive edge of the clock.
3. and 'WT is enabled on the positive edge of the clock.
The timing diagram for phase 4 is shown in Fig.3.12:
o 2 3 4 5
cnt3 0 o o
rd
-
WI'
Figure 3)2: Timing diagram for phase 4
Timing Analysis for Phase 4
If 2m points are applied to phase 1, then total cycles required in phase 4,
Or,
61
(3.7)
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
3.3.6 Phase 5
Phase 5 is simply calculating the first 2 DFT coefficients and storing them in. the
dft array. The C code used was:
dft [oJ
dft [1]
(dbuff [OJ +dbuff [1] ) *h2 [0] ;
(dbuff [OJ-dbuff [lJ)*h2[1] ;
We require 6 cycles to compute this phase. These 6 cycles were implemeted with
3-bit control register as follows:
• cnt2 = 0 :
1. add1 is assigned 0 on the positive edge of the clock.
2. rd is enabled on the positive edge of the clock.
3. wr is disabled on the positive edge of the clock.
4. tempi is assigned the value of dlJ71ff(O) on the negative edge of the clock.
5. sub_noLadd is assigned 0 indicating addition on the negative edge of clock.
• cnt2 = 1 :
1. add1 is assigned 1 on the positive edge of clock.
2. temp2 is assigned the value of db11.ff(1} on the negative edge of clock.
• cnt2 = 2 :
1. add1 is assigned 0 on the positive edge of the clock.
2. temp3 is assigned the value of sum on the negative edge of the clock.
3. temp4 is assigned the value of h2[O] on: the negative edge of Clock.
62
3.3. HARDWARE IMPLEMENTATION AND TIMING ANALYSIS
• cnt2 = 3 :
1. addl is assigned 0 on the positive edge of the clock.
2. rd is disabled on the positive edge of the clock.
3. and wr is enabled on the positive edge of the clock.
4. sub_not-add is assigned 1 indicating subtraction on the negative edge of
clock.
• cnt2 = 4 :
1. addl is assigned 1 on the positive edge of the clock.
2. rd is enabled on the positive edge of the clock..
3. wr is disabled on the positive edge of the clock.
4. ternp3 is assigned the value of s'urn on the negative edge of the clock.
5. ternp4 is assigned the value of h2[1] on the negative edge of clock.
• cnt2 = 5 :
1. addl is assigned 1 on the positive edge of the clock.
2. rd is disabled on the positive edge of the clock.
3. wr is enabled on the positive edge of the clock.
The timing diagram for phase 5 is shown in Fig.3.13:
Time Complexity
The time complexity of phase 5 is constant at 6 cycles.
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
o 2 3 4 5
cnt2 o 2 3 4 5
rd
u
3.3.7 Phase 6·
Figure 3.13: Timing diagram for phase 5
In phase 6, the inverse transform of the DFT cocficicnts was calculated from the
results stored in dft array. The C eode that was used for this purpose was:
halfr = 1 ;
for (j = O· j < logr; j++),
{ for (i = O· i < halfr; i++),
{ temp = dft [i+ halfr] ;
dft [i+halfr] (dft [i] + temp);
dft [i] (dft [i] - temp);
}
halfr halfr*2;
}
/* inverse DFT */
64
3.3. HARDWARE IMPLEMENTATION AND TIMING ANALYSIS
This process is quite similar to the operations performed in phase O. The se-
quence of operations can also be controll.ed by control register cnU in a similar
manner. Verilog implementation of this loop in terms of the value of cnU are as
follows:
• cntl = 0 :
1. addl is assigned the value of Tcgi + halfT on the positive edge of the
clock.
2. Td is enabled on the positive edge of the clock.
3. 'WT is disabled on the positive edge of the clock.
4. temp2 is assigned the value of dft[i + halfT] on the negative edge of the
clock.
5. sub_noLadd is assigned 0 indicating addition on the negative edge of clock.
• cntl = 1 :
1. addl is assigned the value of Tegi on the positive edge of clock.
2. tempi is assigned the value of c{ft[TCgi] on the negative edge of clock.
• cntl = 2 :
1. addl is assigned the value of Tegi + halfT on the positive edge of the.
clock.
2. Td is disabled on the positive edge of the clock.
3. 'WT is enabled on the positive edge of the dock.
4. sub_noLadd is assigned 1 indicating subtraction on the negative edge of
clock.
65
CHAPTER 3. SINGLE MEMORY IMPLEMENTATION
• cntl = 3 :
1. On the positive edge of the clock, addl is assigned the value of regi.Loop
termination condition for reg'i is checked, if not met reg'i is increased by
1. Otherwise loop termination condition for regj is checked, if not met,
regi is set to 0 and regj is increased by 1. Otherwise, loop terminates and
calculation is complete.
The timing diagram for phase 6 is shown in Fig.3.14:
o 2 3 4 5 6 7 8 9 10 11
---------.c>-"""'-- 1--------c,......,--- 2 ----"""~_<O_
entl
regi
o
""'---0
2 3 o 2 3 o 2 3
wr-------,
Figure 3.14: Timing diagram for phase 6
Timing Analysis for phase 6
Total number of loop iterations for cntlin pha;;e6 are; L = 1 + 2 + 2? + ..... + 2n - 1.
Therefore, total number of cycles required for phase 6 are: T6 == L x 4.
. 66
.'
3.';. OVERALL TIME COMPLEXITY
Or.
(3.8)
3.4 Overall Time Complexity
Anal~'zing the tot al nUlllber of cycles in each length, we see that as we increase
the cyclic convolution length b~' a factor of 2. c~'cles required for phases O. ·1 and
G increa.<;e by a factor of 2. since these phases are a..'isociated with OFT and 10FT
cOlllputation stages. But pha..'ies 1. 2 and 3 increase clock c~'cle requirelllent b~' a
factor of 3. Phase 5 canst it lies of a negligible constant number of pIHL'ics.
Total clock cycles uscd in phases 1. 2 and 3 arc Illuch larger innllmber than cycles
used in pIIiL'iCS O..l and G. Therefore. when we are taking into account total clock
cycles used to compute the cyclic cOIl\'olution of a particular length. contribution
of pha.ses 1. 2 and 3 are far more significant than tha t of ot hers. As a result. \\'e
can conclude by saying that total clock cycles used for c~'Clic convolutions increiL<;es
approximatcl~' by a factor of 3 as the length increases b~' s factor of 2.
Table 3.1 shO\\·s the timing performance of the architecture. 1'\ote that the clock
period is independant of the convolution lenf!,th and for 0.5 I' :\SIC library. it is 20
nsecs.
3.5 Hard,varc C0111plcxity
Om svstcm \\·a.' descri!wd using veri log hard\\'are description language \\'e used
\'sim for simulation purpnses and after the simulations showed that the~' generated
,1ccm,1tc addresses. we used Lcnnardo Spcctrum (:-Icntnr Cr,1phicsl using .\SIC ..
CHAPTER 3, SINGLE MEMORY IMPLEMENTATION
Table 3.1: Time complexity of the proposed cyclic convolution architecture.
Cyclic Cov. Length Clock Cycles Time in nsecs
4 50 990
8 170 3390
16 528 10550
32 1596 31910
64 4786 95710
128 14326 286510
256 42884 857670
512 128432 2568630
1024 384822 7696430
SCL05 library technology to obtain synthesis results,
3.5.1 Experimentation Results
We implemented ~eperate systems for calculating cyclic convolutions of lengths 4, 8,
16, 32, 64, 128, 256, 512 and 1024, and collected data for hardware and time com-
plexity, Our results show that as the length increases by a factor of 2, total time
increases by a factor of 3. But for alllcngths, the complexity of hardware remains ap-
proximately constant. This can be of remarkable significance, because theoretically,
no matter how much we increase the length, the convolution can be computed using
minimal hardware, even though with higher delays. In applications where computa-
tional demands arc extensive, but time is of relatively less importance, our approach
of computing cyclic convolution could be an extremely cost-effective method.
In our experimentation, we observed that hardware complexity was primarily
influenced by bit widths of the internal registers. We 'assumed all registers to be 8
bit wide for lengths 4 through 32 and 16 bits for 64 through 1024. The results t~:at
68.
3.5. HARDWARE COMPLEXITY
Table 3.2: Hardware complexity of the proposed cyclic convolution architecture.
Cyclic Cov. Length number of gates
4 5874
8 6114
16 6249
32 6350
64 15180
128 15445
256 15512
512 15546
1024 15611
were obtained are summarized in table 3.2.
We then optimized the synthesis according to the length of the convolution. We
used registers of smaller bit-widths for smaller lengths and adjusted the bit-widths
according to the requirements of larger lengths. For example, register 'halfr' used
in a length 4 To~plitz system needed to be only 2 bits wide, whereas for length 1024
it was required to be 16 bits. In this case, the hardware complexity increased at a
constant rate, indicating that this increase was only due to the change of the size of
internal registers, not the number of multiplications or additions involved in larger
lengths.
The hardware complexity for systems optimized according to their convolution
lengths is summarized in table 3.3.
The alternative to our procedure is to synthesize the entire cyclic convolution
algorithm by using as many adders and multipliers required as a single combinational
logic block. This would 'allow one to compute the output extremely fast but the
drawback of the process is that the hardware complexity increases very rapidly 'and.
after length 64, Leonardo is unable to complete the sythesis[17]. Our methods .can
, .
6,9
CHAPTER 3. SISGLE .\lEMORY l:\lPLE.\lEST:\TIO.\'
Table 3.3: Ilanhnue complexity for diffrrent lengthf; with \'ariable-bit intemal hard-
ware.
Cyclic Co\'. Length Approx. internal register bit-width number of gates
4 3 2809
8 4 3279
16 6 4760
32 8 6111
64 9 7067
128 11 8988
2GG 13 10898
512 14 119c17
1024 15 12927
easily s~'nthcsize lengths of 1024 or c\'cn grcater.
The results of thr two Illcthods arc cOlllpared in tahle 3.4.
The nT~' low hardwarr complexity aehie\'cd by thc usc of thc proposcd architec-
t mc makes it a highly promising approach to using dedicated hardwarc for c~'('li('
cOIl\'olutions. as well as other rclated DSP algorithms.
7ll
3D, IJ:\HD\\';\HE CO.\IPLEXITY
Tahle 3A: Comparisoll of the hard ware com plcxity (llll1111wr of gat es) of t he proposed
met hod wi t h a comhi Ila t iOllal logic circll itt 0 com pll te c~'C lic cOIl\'olut iOIl
Cyclic ('0\', Lellgt h Proposed 1\ Iet hod I\lellJor~'kss Implemellt at iOIl
.\ 2809 113.J.J
8 3279 30222
IG .J7GO 85825
32 GIll 250878
G,\ 70G7 739G20
128 8988
256 10898
512 11 9.J 7 -
102·\ 12927
71
CHAPTER 3. SISGLE .\IE.\IOH1' I.\IPLE.\IESTATIO.\'
-·1
,-
Chapter 4
Further Improvements
In this chapter we discuss some shortcomings of our procedures and suggest solutions
to overcome these shortcomings. We aim to provide some insights into real life
constraints that we had corne up with during our experimentation:
4.1 Two Memory Implementation
In this thesis, we implemented the bilinear algorithm for 2n point cyclic convolution.
Our architecture consisted of a single block of memory. This means that the initial
data points, the intermediate results, as well as the outputs were stored in the same
memory block. We could either read from or write to this memory in a single clock
cycle.
We have taken the advantage of some specialized addressing modes in our pro-
cedure, the control design of which was a rather simple procedure due to our single
memory structure. For example:
• The relationship between outputs and inputs in the network was such that .
73
CHAPTER 4. FURTHER IMPROVEMENTS
Table 4.1: Origin and destination addresses in a 4-point butterfly network.
Origin (G1GO) Destination (GOC1)
00 00
01 10
10 01
11 11
the locations of the outputs were in the the bit-reversed order of that of the
locations of the inputs. For instance, in a length 4 network, if the inputs xo,
Xl, X2 and X3 were located at 00, 01,10 and 11, then the outputs Yo, Yl, Y2,
and Y3 would be located at 00, 10, 01 and 11 .
• Phase °and phase 6 used a butterfly network. The intermediate storage loca-
tions for these phases used a specialized addressing mode. We can illustrate
the usc of this addressing mode and then generalize it to a broader context:
The origin aI\d destination addresses of a 4-point butterfly network is summa-
rized in table 4.1.
Here we can observe that Go has been shifted left once, or (2 - 1) times when
the total number of points was 4 = 22 . For an 8-point network, the addresses
are indicated in table 4.2.
Go has been shifted twice or (3 - 1) times, when the total number of points
was 8 = 23 .
Our generalized addressing mode for phase °was that we shifted the least
significant bit column of the value of the order of calculation n - 1 times to
get the addresses to store the results.
Future work to optimize our method could focus on implementing a dual memory'
c
74
4.2. DIVISION MODULES
Table 4.2: Origin and destination addresses in a 8-point butterfly network.
Origin (C2C\ Co) Destination (COC2C\)
000 000
001 100
010 001
all 101
100 010
101 110
110 011
III III
structure, so that data could be fetched from one memory and results written to the
other in the same cycle. We took an initiative to design an algorithm for doing so, but
due to time constraints this hasn't materialized. In our initial design attempts, a new
addressing mode was required and the specialized addressing mode just described
was no longer effective. Also, there was a time lag from when the results were ready
to write and when they could actually be written in the second memory, which posed
a problem in our design efforts.
4.2 Division Modules
In our work, we had seperate modules for additions or subtractions and 16-bit
multiplications. But we did not have a seperate module for calculating divisions.
As a result, in phase 3, divisions by 3 and 9 were computed manually or directly
assigned by combinational logic manipulation. In fu~ure, -inclusion of an efficient.
division module would be able to simplify the .software complexity In phase 3.
75
CHAPTER 4. FURTHER IMPROVEMENTS
4.3 Exclusion of Baugh-Wooley Multiplier in syn-
thesis
When calculating the veetoT-base in phase 1 and phase 3 (veetoT_base <= halfT +
2 x (k + 1) x half_length, veciOT_base <= halfT + 2 x (k - 1) x half_length), the
Leonardo synthesis tool did not recognize the fact that multiplication by haif_length
was just a left-shift operation, as haif-length is always a power of 2. For these
operations, the Baugh-Wooley multiplier was synthesized twice, which increased
the hardware complexity more than we expected. But since the value of half_length
changes throughout the phase, this is a variable shift and we would have to use a
Barrel shifter, had we replaced the Baugh-Wooley Multiplier. In future, we hope to
opitmize the logic in phase 1 and phase 3, in such a way that synthesis of both the
hardware intensive Baugh-Wooley multiplier and Barrel shifter can be avoided.
4.4 Time Optimization
Even though the complexity of hardwar,? remained fairly constant with the increase
of the length of convolution, the total time increases by a factor of three. This
may not be vert efficient in time-sensitive computations. 'Therefore, instead of
performing one addition or subtraction and one multiplication(when necessary) per
cycle, number of CPAs or multipliers could be increased at the cost of increased
hardware, in attempt to optimize performance or ~trike a balance between hardware
and timing requirements.
76
.J.D. TOEPLlTZ PRODl'CTS IS CALCULATIOSS OF OTHER TRASSFOH.\IS
4.5 Toeplitz Products in Calculations of Other
Transforms
We h,we seen how easil~' and pftlcipntl~' Toeplitz ll1ultiplication call be used to COIll-
pute cyclic couvolutions of lrugths 2". This cOllcrpt cau be exteuded ((J other
trallsforms. such as DCT[18]. Hartley trallsforms[19]. If scalable and low compkx-
ity algorit hll1s could be developed for Discret e Cosille Transforms. it would highly
facilitatr image processing. where computational drIllands arc incrrasing rapidl.\·.
f f
CHAPTER ·1. FL'HTHER l.\IPROVE.\IESTS
Chapter 5
Conclusion
In this thesis, we have introduced the low-complexity hardware implementation of
highly efficient algorithms for computing cyclic eonvolutions. The use of Toeplitz
matrix multiplications for our purposes has enabled us to perform all calculations
in the real number domain. For digital signal processing, where computational
,
demands are extremely high mostly due to to complex arithmetic, this can prove to
be a very cost-effective approach.
The Toeplitz product algorithm is a bilinear algorithm with a single stage of
multiplications. The rest of the arithmetic in the entire network is either additions
or subtractions. Since the number of multiplications, the primary source of inten-
sive computation, is significantly reduced by this algorithm, it lowers the overall
computational complexity' greatly.
Moreover, the algorithm is scalable to convolutions of any length N, where
N = 2n . Previously developed algorithms were suitable for transform matrices
of dimensions that were relatively prime. Some. of these algorithms wer!" appro- .
priate for small lengths only, and, even though a few larger algorithms could be
'.79
CHAPTER 5. COSCLUSIOS
dewloped b~' combillillg smaller olles. they were limited ill lIumber. i.e. scalable to
larger lellgths.
Our simulatioll results illdicated that as we illcre,L'ied the lellgth by a factor of
2 (the algorithm was siJl1ulated alld s\'Ilthesized using ASIC techllology for II
2.3.4 .. 10), hardware complexity did Ilot illcrease alld remaillcd fairly constallt. For
s~'stCJl1S that iIl\'olve large voluJl1es of computations. and where hardware reduction
IS Illore' important thall time complexity, this could be a vcry useful technique to
adopt.
III light of the advalltages we call gaill from usillg Toeplitz products. ill computing
c~'clic COIl\'olut iOlls. t his algorithm docs scem to be a verv promising effort ill the
flcld of digit al sigllal processing.
Appendix A
//Verilog code for length 16 cyclic convolution
module CC16(clock, reset, addl,rd,my_wr,datal);
input clock, reset;
output addl, rd, my_wr;
inout[15:0] datal;
reg[7:0] halfr, regj, regi, addl, vector_base,
addressl,number_of_vectors,k,kl,half_length,address;
reg rd, lJr;
reg cnt3;
reg[1:0] cntl;
reg[2:0] phase,cnt2;
reg[15:0]te~pO,te~pl,te~p2,te~p3,tewp4,temp5;
reg sub_not_add;
f.l
APPESDIX A.
~ire[15:0] sum,mult_result;
al~ays @(posedge clock)
if (reset==l)
begin
phase<=O;
regJ <=0;
regi <=0;
k <=0;
k1 <=0;
cnt1 <=0;
cnt2 <=0;
cnt3 <=0;
rd <=1;
~r <=1;
halfr<=8; Ilhalfr=r/2;
address<=2;
end
Ilreset condition or initializations
Ilphase 0 at positive edge of clock
else if(phase==O && regJ 1=3 && regi==halfr)
begln
phase<=l ;
address1<=halfr«1;
nu~ber_of_,ectors<=l;
half_length <= halfr»l;
vector_base<=halfr;
cntl<=O;
regi<=O;
wr<=l;
end
else if(phase==O && regj 1=3)
begin
case(cntl)
2'bOO:begin
addl<=regi;
rd<=O;
wr<= 1 ;
end
2 'bOl: begin
addl<=regilhalfr; Iladdl<=regi+halfr
end
2'blO:begin
wr<=O;
rd<= 1 ;
addl<=regi;
end
2'bll:begin
addl<=regl1halfr;
if(regi l =halfr)regi<=regl+l;
end
endcase
cntl<=cntl+l;
end
//phase 1 at positive edge of clock
else if(phase==l && kl==3-regj) //change
begin
regi<=halfr;
wr<= 1;
k<=O;
kl<=O;
cntl<=O;
phase<=2;
number_of vectors <= number_of_vectors*3;
end
else if(phase==l)
begln
case(cntl)
2'bOO:begln
addl<=vector_base;
rd<=O;
APPENDIX A.
wr<=l;
end
2'bOl:begin
addl<=vector_baselhalf_length;
end
2'blO:begin
rd<=l;
wr<=O;
add 1<=address 1 ;
end
2'bll:begin
wr<=l;
addressl<=addressl+l;
vector_base<=vector_base+l;
if(regil=half_length-l) regi<=regi+l;
begin
regi<=O;
k<=k+l;
end
else if(kl l =3-regj) //change
begin
regi<=O;
k<=O;
kl<=kl+l;
vector_base<=halfr;
number_of_vectors<=number_of_vectors*3;
half_length<=half_length/2;
end
end
endcase
cntl<=cntl+l;
end
//phase 2 at positive edge of clock
else if(phase==2 && regi==addressl)
begin
phase<=3;
wr<= 1;
regi<=O;
address<=address+addressl-halfr;
half _length<= 1;
addressl<=addressl-l;
//nurnber_of_vectors<=number_of_vectors/9;
//k<=(nurnber_of_vectors/9)-1;
//vector_base<=halfr+(n~~ber_of_vectors/9-1)*2;
APPE.YDIX A.
begin
k<=8;
vector_base<=halfr+16;
end
else if(number_of_vectors==27)
begin
k<=2;
vector_base<=halfr+4;
end
begin
k<=O;
vector_base<=halfr;
end
end
else If(phase==2)
begin
case(cntl)
2'bOO:begln
add 1<=rep ;
,,-
, I
rd<=O;
wr<=l ;
end
2'b01:begin
add1<=address+regi-halfr;
end
2'b10:begin
add1<=regi;
rd<=l ;
wr<=O;
end
2'b11:ifCregl l =address1)
begin
regi<=regi+1;
wr<=l ;
end
endcase
cnt1<=cntl +1;
end
//phase 3 at posltive edge of clock
else ifCphase==3 && k1==3-regJ)
begin
phase<=4;
I.T<=l ;
APPESDIX A.
regi <=halfr;
kl<=O;
end
else if(phase==3)
begin
case(cnt2)
3'bOOO:begin
addl<=vector_base+regi;
rd<=O;
wr<=l ;
end
3'b001:begin
addl<=addressl+regi;
end
3'bOl0:begin
addl<=vector_base+regi+half_length;
end
3'bOll:begin
addl<=vector_base+regi;
>.'r<=O;
rd<=l ;
end
3'bl00:begin
addl<=vector_base+regl+half_length;
If(regi l =O)regi<=regi-l;
APPESDIX A.
else if(k>O)
begin
k<=k-1 ;
regi<=half_length-1;
vector_base<=halfr+(k-1)*2*half_length;
address1<=address1-half_length;
end
else if(k1 1=3-regJ) //change
begin
k1<=kl +1;
regi<=2*half_length-1;
address1<=address1-2*half_length;
half_length<=half_length*2;
//k<=number_of_vectors/3-1;
//number_of_vectors<=number_of_vectors/3;
//vector_base<=halfr+2*(number_of_vectors/3-1)*2*half_length;
if (number_of_vectors==9)
begin
number_of_vectors<=3;
k<=2;
vector_base<=halfr+8*half_length;
end
else If(nu~ber_of_vectors==3)
begin
number_of_vectors<=l;
k<=O;
vector_base<=halfr;
end
end
end
endcase
if(cnt2==4) cnt2<=O;
else cnt2<=cnt2+1;
end
//phase 4 at posltlve edge of clock
else if(regi==2*halfr && regjl=3 && phase==4)
begin
1.'r<=l ;
regi<=O;
regj<=regj+l;
halfr<=halfr /2;
k<=O;
kl<=O;
cntl<=O;
cnt2<=O;
cnt3<=O;
1f (regJ==2)
~1
begin
phase<=5;
end
else phase<=O;
end
else if(phase==4)
begin
case(cnt3)
l'bO:begin
addl<=regi;
wr<=l ;
rd<=O;
end
l'bl:begin
addl<=regi;
wr<=O;
rd<=l ;
if(regi l =2*halfr)
regl<=regi+l;
end
endcase
cnt3<=cnt3+1;
end
//phase 5 at posItIve edge of clock
:\PPE.\'DIX :\.
else If(phase==5 && cnt2==6)
begln
wr<=l;
phase<=6;
halfr<=l;
regl<=O;
regJ<=O;
end
else if(phase==5)
begin
case(cnt2)
3'bOOO:begin //rd dbuff[O]
addl<=O;
rd<=O;
wr<= 1;
end
3'b001:begin
addl<=l;
end
3'b010:begin
addl<=O;
end
3'bOll:begin
addl<=O;
..:r<=O;
/ Ird dbuff [l]
//rd h2[O]
//wr to dbuff[O]
rd<=l ;
end
3'bl00:begin
addl<=l ;
wr<= 1;
rd<=O;
Ilrd h2 [1]
APPESDIX A.
end
3'bl0l:begin Ilwr to dbuff[l]
addl<=l ;
wr<=O;
rd<=l;
end
endcase
if(cnt2 1=6)cnt2<=cnt2+1;
end
Ilphase 6 at positive edge of clock
else If(phase==6 && regJ==4)
begln
phase<=7;
wr<= 1;
end
else If(phase==6)
begln
case(cntl)
2'bOO:begin //rd dft[i+halfr]
addl<=regi+halfr;
rd<=O;
wr<=l;
end
2'b01:begln //rd dft[i]
rd<=O;
addl<=regi;
end
2'bl0:begin //wr sum in dft[i+halfr]
add1<=regi+halfr;
wr<=O;
rd<=l ;
end
2'bll:begin //wr subtraction in dft[i]
addl<=regi;
if(regi l =halfr-1)
regi <=rep +1;
else if (regJ 1=4)
begin
regJ <=regJ +1;
regi<=O;
halfr<=halfr*2 ;
end
end
endcase
cntl<=cntl+l;
end
Ilphase 0 at negative edge of clock
always @(negedge clock)
if(phase==O)
begin
case(cntl)
2'bOO:begin
templ<=datal ;
end
2'bOl:begin
temp2<=datal;
end
2'bl0:begin
sub_not_add<=l ;
end
endcase
end
Ilphase 1 at negative edge of clock
else If(phase==l)
APPESDIX A
begin
case (cntl)
2'bOO:begin
templ<=datal;
end
2' bOl: begin
temp2<=datal;
end
endcase
end
Ilphase 2 at negative edge of clock
else if(phase==2)
begin
case(cntl)
2'bOO:begin
temp3<=datal;
end
2'bOl:begin
temp4<=datal;
end
endcase
end
//phase 3 at negative edge of clock
else if(phase==3)
begin
case(cnt2)
3'bOOO:begin
tempO<=datal;
sub_not3dd<=1;
end
3' bOO 1:begin
templ<=datal;
end
3'b010:begIn
temp2<=datal;
end
3'bOll :begin
temp2<=tempO;
end
endcase
end
//phase 4 at negatIve edge of clock
else If(phase==4)
begIn
case(cnt3)
APPEXDIX A.
l'bO:begin
temp5<=datal;
end
endcase
end
Ilphase 5 at negative edge of clock
else if(phase==5)
begln
case(cnt2)
3'bOOO:begin
templ<=datal;
sub_nocadd<=O;
end
3'b001:begin
temp2<=datal;
end
3'bOl0:begin
temp3<=sum;
temp4<=datal;
end
3'bOll:begln
sub_not_add<=l ;
end
3'bl00:begin
APPE.\'DIX A.
temp3<=swn;
temp4<=datal;
end
endcase
end
//phase 6 at negatIve edge of clock
else if(phase==6)
begin
case(cntl)
2'bOO:begin //second operand dft[l+halfrJ
temp2<=datal;
sub_noCadd<=O;
end
2'bOl:begin
templ<=datal;
end
2'bl0:begin
sub_not_add<=l;
end
endcase
end
assIgn wy_wr = wrlclock;
CPA16 addO(te~pl, te~p2-{16{sub_not_add}}, sub_not_add, suw, c_out);
Hlli
MULT16 multO(temp3, temp4, mult_result);
endmodule
/*
module test_CC16();
reg clock, reset;
wire[7:0] add1;
wire rd, my_wr;
wire[15:0] datal;
initlal
begin
clock<=O;
reset<=l ;
end
always #10
begin
clock<=- clock;
end
al~ays #30 reset<=O;
//commented for synthesis
CC16 c16(clock, reset,addl,rd,Qy_~r,datal);
101
endmodule
*/
APPESDlX A.
Bibliography
[1] D. P. Kolba and T. W. Parks, "A primc factor FFT algorithm using high-spccd
convolution,"~ IEEE TnmsuctlUl/S 011 Accollsl1CS, Speech. a1ld Sllj1lal ?roccsslTIg.
\'01. 2G. August 19//.
[2] 1\1. D. Wagh and II. Gancsh. "A nc\\' algorithm for thc discrctc cosine transforIll
of arbitrar~' number of points." IEEE Tn17lMJ(:1101IS 111 CmT/]l1l111l.!j. pp. 269-2//.
April 1980.
13] S. \\'o!tcr and R. Laur. "An Ri\S impleIllcntation of a length-8 dct based 01\ a
c~'clic conyolution" IEEE. 1992.
I.~i Bcrkc!cy Dcsign Technology. Inc. ,\flcTOpnJcrsson; 1'S DS?:;: F1I1/drllT/e1lIol.- and
D/sIlllcl70lI.'. Febmary 200·1.
[.~)] .I. \\'. Coo!cy and .I. \\'. Tuckcy. "An fllgorithm for thc IllachilH' comp\ltatiou
of complex fomier series." .\fllih. ('0111P1l1.. yo1. 19. pp. 29,-301. :\pril 19G.'j
161 Y Wflng flud 1\. Pnrhi. "Explicit cook-Tool11 algorithm for linear c()uyollltion"
IEEE. 2000
[i' S \\·inogrfld. "On cnl11plltiug the discrete fourier trausform," .\fath. lOlTJiliJl .
pp 1i:) 100 ..1fI1lU'HY 10i~
10J
BIBLIOGRAPHY
[8] n. c. Agarwal and J. \\'. Cooley. "New algorithms for digital convolution."
IEEETmnsactlOns on ACC0l1Sl1C Speech and Signal Pmcessmg. vol. 25. pp. 392
·no. October 1977
[9] i\1. O. Wagh. "Modular algorithms for cyclic convolutions of arbitrary lengths."
i\larch 2006.
[10] II. J. Nussbaumer. "Oigi tal filtering using polynomial transforms," in Electron.
Lett .. vol. 14, pp. 386-387, June 1977.
[11] II . .I. Nussbaumer and P. Quandallc, "Ne\\" algorithms for convolution and dft
based on poly'nomial transforms'" in IBM JOllrnal of Rcscarch and Dcuclop-
1I/cnl. vol. 22. pp. 134-1·1·1. i\larch 1978.
[12] II ..I. Nussbaumer. "New polynomial transform algorithms for fa.,;t OFT compu-
tation'" in ConI Rcwnl Inl. Can! Acwlisl. Spccch Su/nal Pmccssl1Ig. pp. 510,-
513. 1979.
[13] i\1. O. \\'agh and S. D. i\lorgcra. "Structured design mcthods for convolutions
o\'('r flnite fldds." IEEE TnJ 11 S. Info. Them'y. \'01. 32. pp. 175-199. October 1982
[1.1] l\1. D. \\'agh and II. Cancsh. "A ncw algorithm for the discrete cosinc transform
of arbitrary IlIlmber of points." IEEE Tm71.-. C071lplil .. vol. C-29. pp. 269 277.
;\ pri I 19S0
[I.S C :-'1 radar. "Discrete fourier transforms \\'hcn the IlIlmhcr of data samples IS
prime'" l'mr IEEE. vol ~)G. pp. lO·j lOS . .1unc 19G5
i16; \' ;"luddha~;1I1i and:-'1. D \\'agh. "Bllincar algorithms for discrete fOsinc trans-
forms of primc lengths'" Sit)71()! I'n)(c.'.'illo. vp! S6. pp. 2:19:1 2·jO(1. 2(lOti
BIBLIOGRAPHY
117] K. Brownell. "Report of independant study," tech. rep., Lehigh LJ niversi ty.
Spring 2006
[18] D. F. ChipeL "A new systolic array algorithm for inverse det with high through-
put rate," in Pmcccdmgs of the IEEE Intcmational Symposium on Industrial
Eledronics, vol. 1, pp. 201-206, 1996.
[19] D. F. Chiper and V. i\lunteanu, "1\ new design approach to vlsi parallel imple-
ment ation of discrete hartley transform," in Proceedmgs of the IEEE Inlern a-
lzonal Symposium on Induslnal Elcetmmcs. vol. 1, pp. 207-212. 1996.
END OF
TITLE
