An investigation on the applicability of multi-microprocessing in the two dimensional digital filtering problem by Whitcher, Timothy J.
Rochester Institute of Technology
RIT Scholar Works
Theses Thesis/Dissertation Collections
5-1-1980
An investigation on the applicability of multi-
microprocessing in the two dimensional digital
filtering problem
Timothy J. Whitcher
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Whitcher, Timothy J., "An investigation on the applicability of multi-microprocessing in the two dimensional digital filtering problem"
(1980). Thesis. Rochester Institute of Technology. Accessed from
AN INVESTIGATION ON THE APPL ICABIL ITY
OF MULTI -MI CROPROCESSING INTHE TWO
DIMENSIONAL DIG ITAL FILTERI NG PROBLEM
by
Timothy J . Wh itcher
A Thesi s Submitted
i n
Partia l Fulf i llment
,
of t he
Requ irements for the Degree of
MAS TER OF SCIENCE
in
Electrica l Eng ineer ing
App roved by :
(Thes i.. Advisor )Prof. R==-.::;==.,----- - - --
Prof. _
Prof.
Prof.
(Depar tment Head)
DEPARTMENT OF ELECTR ICAL ENGINEER ING
COL LEGE OF ENG INE ERING
ROCHES TER INST ITUTE OF TECHNOLOGY
ROCHESTER, NEWYORK
I1AY , 1980
ABSTRACT
Digital image processing has been receiving an increasing amount
of development in recent years, largely because high speed digital com
puters are becoming readily available. In addition, the advent of the
microprocessor has revolutionized the capabilities of compact electronic
systems.
This thesis examines the applicability of microprocessors to
digital image processing. Using a Z8000 microprocessor as a baseline,
the computation time for a two-dimensional fast Fourier transform is
estimated for various microprocessor architectures. These results are
later compared to the manufacturer's specified computation times on a
Floating Point. Systems AP-120B Array Processor.
TABLE OF CONTENTS
Page
I. List of Tables iv
II. List of Figures vi
III. List of Terms vi i
IV. List of Programs ix
V. Introduction 1
VI. Section 1 - Historical Review of the Fast Fourier Transform ... 6
VII. Section 2 - An Overview of the Theory of One and Two Dimensional
Fourier Transforms, Continuous and Discrete 9
2.1 The Continuous Fourier Transform in One and Two
Dimensions 10
2.2 The Discrete Fourier Transform and Convolution.
The One Dimensional Case 19
2.3 The Discrete Fourier Transform and Convolution.
The Two Dimensional Case 25
2.4 The One and Two Dimensional Fast Fourier
Transform 36
VIII. Section 3 - Examination of Architectures 44
3.1 Method of Examination 45
3.2 Presentation of Architecture I 49
3.3 Presentation of Architecture II 57
3.4 Presentation of Architecture III 59
3.5 Presentation of Architecture IV 63
3.6 Presentation of Architecture V 65
3.7 Presentation of Architecture VI 66
n
TABLE OF CONTENTS (Continued)
Pae
IX. Section 4 - Conclusions 74
X. Appendices 87
Appendix A - Program Listings 88
Appendix B - The 2D-FFT in Matrix Notation 134
Appendix C - FFT Error Analysis 139
Appendix D - A Comparison of Cycle Times 147
XI. References 157
XII. Bibliography 158
m
LIST OF TABLES
Section 2
Tab! e Number
2. 2-1
2. 2-2
2.3-1
2,4-1
Section 3
3.1
3.4
Title Page Number
Computational Comparison
DFT vs. Convolution 23
Computational Comparison
FFT vs. Convolution 24
Computational Comparison
2D-FFT vs. 2D Convolution 33
A Description of Bit Reversal Order . 41
Summary of Architectures 49
Essential Components, Hardware
Floating-Point Complex Mulitplier . . 2
Section 4
4.1
4.2
4.3
Architecture I - single processor . .
Architecture II - single processor,
Wj,1
look-up
Architecture III <- single processor,
feL1
look-up, hardware complex
multiplier, floating-point
78
79
80
IV
Table Number
4.4
4.5
4.6
4.7
Appendix D
D.l
D.2
D.3
D.4
D.5
D.6
D.7
D.8
D.9
Title Page Number
Architecture IV - single processor,
WN look-up table, hardware complex
multiplier, fixed-point 80
Architecture V - single processor
W.,1
look-up table, programmed multiply- 81
Architecture VI - the multiprocessor
system 81
Computation Time for Multiprocessor
Systems 82
Architecture I Machine Cycles .... 148
Architecture I Cycle Time 149
Architecture II Machine Cycles .... 150
Architecture II Cycle Time 151
Architecture III Summary 152
Architecture IV Machine Cycles .... 153
Architecture IV Cycle Time 154
Architecture V Summary 155
Architecture VI Summary 156
LIST OF FIGURES
Section 2 - An Overview of the Theory of One and Two Dimensional
Fourier Transforms, Continuous and Discrete
Figure Number Title Page Number
2.1-1
2.1-2
2.1-3
2.1-4
2.1-5
A Block Diagram of a One-Dimensional
Linear System
An Example of High Frequency Imagery
An Example of Low Frequency Imagery .
Fourier Transform of Figure 2.1-2 . .
Fourier Transform of Figure 2.1-3 . .
13
14
14
15
15
2.3-1
2.3-2
Convolution Approach (Image Processing). . 32
2-DDFT Approach (Image Processing) ... 32
2.4-1
2.4-2
Section 3 -
3.1-1
3.2-1
3,2-2
3.2-3
3.2-4
3.2-5
The Two Point DFT 39
The Eight Point DFT 40
Flow Chart Showing the Major Operation
of the 2D-DFT Computations 46
Butterfly Structures 49
Fourier Coefficients in the Z-plane- ... 50
Process Block A - Bit Reversal 51
Process Block B - A Linear FFT Algorithm 52
Process Block C - Array Transposition . . 55
vi
Figure Number Ti tl e Page Number
3.3-1 Process Block B - Architecture II 58
3.4-1 Hardware Floating Point Complex
Multiplier (Input-half) 60
3.4-2 Hardware Floating Point Complex
Multiplier (Output-half) 61
3.5-1 Fixed-Point hardware Multiplier- . 64
3.7-1 Computational Flow 68
3.7-2 Two Port Bus Control 72
Appendix C
C-l 2's Compliment Truncation 140
C-2 2's Compliment Rounding 140
C-3 4 Point FFT 141
vi i
LIST OF TERMS
LSI - large scale integration
MOS - metal oxide on silicon
DFT - discrete Fourier transform
DIT - decimation in time
FFT - fast Fourier transform
CPU - central processing unit
MMU - memory management unit
vin
LIST OF PROGRAMS
Page
A.l Base-Line Code for floating-Point 2D FFT in
Architecture I 89
A. 2 Code for the Computation of W* Coefficients
in Architecture II 116
A. 3 Base-Line Code for a Fixed-Point 2D-FFT in
Architecture IV U9
A. 4 Code for Butterfly Computation Using Fixed-
Point Software Multiply in Architecture V 130
IX
INTRODUCTION
With the advent of LSI circuits capable of n-bit multiplication and
division, specialized data processing machines are becoming more and more
prevalent in the world of computing. The advancement of large semi-con
ductor memory and bubble technology has enabled this new class of machines
to have data buffers with very low access times. One of the most frequent
applications of these data processors is the computation of the discrete
Fourier transform utilizing the fast Fourier transform algorithm. Because
of the inherent lack of drift associated with digital circuits, digital
signal processing is widely used in military applications such as radar,
sonar, and remote image sensing. The widely varying environment in which
military hardware must operate makes the drift-free characteristic of
digital circuits yery desirable. While digital computation suffers from
truncation errors and limit cycle behavior, these effects are consistent
in all environments. Consequently, the advantage of no drift enables
digital signal processing to outweigh its analog counterpart in most
applications.
Within the last decade, the one thing which has had probably the
greatest impact on the electronic world has been the microprocessor. These
inexpensive single-chip computers have evolved to the point where they
rival large scale computers in capability in several applications. In some
cases, these single chip processing units readily lend themselves to
distributed processing applications that would be cumbersome to implement
with their larger counterparts.
Despite the popularity of the microprocessor, they have not made a
serious impact in the area of data processors which are now commonly referred
to as array processors. There are several reasons for the lack of impact
by the microprocessor on the market of array processors. Most microprocessors
have a one-byte word size and consequently, computations requiring 32 bits
or greater precision are awkward to implement. Recently, sixteen-bit
microprocessors have arrived on the market, but these still lack the precision
necessary for many applications. Multiplication and division which are
two operations used frequently in signal processing algorithms can only be
implemented on most microprocessors through long sequential add and shift
routines. When such routines are used, the computation time of signal
processing algorithms become exceedingly long. Most of the new sixteen
bit mircroprocessors have special hardware which can shorten the computation
time needed for multiplication and division, but these are no match for
the speed presented by the array processor, largely because of the word
size required in most signal processing applications. Microprocessors
have yet another drawback which is responsible for their lack of popularity
in applications that require frequent multiplication and division. Because
most microprocessors use MOS technology, the clock rates used by a micropro
cessor are lower than that of an array processor which features bipolar
technology in its ALU. Thus, it becomes evident why microprocessors have
not made an impact in the area of array processing.
Despite the many drawbacks of the microprocessor, it cannot be ruled
out in all signal processing applications. A microprocessor represents a very
sophisticated building block for a system. The microprocessor has a
powerful instruction set that enables the system designer to build apparatus
capable of performing complex functions with a minimum amount of hardware.
If the volume of the product (that is the number to be produced) is small,
the programmability of the microprocessor enables changes in design to be
implemented with ease in comparison to a design change solely in hardware.
Several LSI circuits are designed to be microprocessor compatible. Included
in these are high speed expandable multipliers and a variety of memory
devices. In addition, most manufacturers of microprocessors offer a wide
variety of supportive services. It is these features that make it attractive
to use a microprocessor in situations where it is only marginally applicable
or where it may even be an overkill in the design.
Some areas of digital signal processing do not require very precise
arithmetic. One such problem is in the area of digital image processing.
If a quantized image is to be characterized by its spatial frequency
distribution, one may perform a low precision two-dimensional discrete
Fourier transform (DFT) to perform such a characterization. Simple
two-
dimensional pattern recognition can be performed utilizing low precision
DFT's along with a two-dimensional filter function. It is for
this kind
of problem, where only basic image characteristics are sought, that the
microprocessor may be applicable.
If a microprocessor is to perform image filtering or pattern recognition
in the spatial frequency domain, it must perform two primary tasks. The
most straight forward of the two is the multiplication of each quantized
element of an image's two-dimensional discrete Fourier transform by a
corresponding element of the two-dimensional filter function. The other
task is to process the array of picture elements or pixels that compose
the image and calculate the 2-D discrete Fourier transform of that image
through the use of the fast Fourier transform algorithm. In the problem
of digital filtering the inverse two-dimensional discrete Fourier transform
must be calculated, however, the inverse transform is essentially the same
as the forward transform with the exception of some changes in sign.
The calculation of the two-dimensional DFT is the most cumbersome task.
In this thesis, the computation of the two-dimensional discrete Fourier
transform by a microprocessor will be examined. Consideration will be
constrained to the decimation-in-time (DIT) version of the FFT algorithm.
The computational flow path will be extensively flow charted. Through
the use of flow charts and an example program, the computation time for
a two-dimensional discrete Fourier transform will be determined. In addition,
alternate methods of executing the FFT algorithm will be examined with the
goal of gaining a better understanding of the feasibility of using the
microprocessor to perform a low precision DFT.
SECTION 1
HISTORICAL REVIEW OF THE
FAST FOURIER TRANSFORM
The fast Fourier transform (FFT) algorithm is a method of computing
the discrete Fourier transform (DFT) of a series of N complex data points
in approximately N log2 N operations. When Cooley and Turkey [1 ] described
the algorithm in 1965, it was thought to be new by many people. Until that
time most people believed discrete Fourier analysis to be a process requiring
2
N operations.
In 1942 G.C. Danielson and C. Lanczos [2 ] published a paper that
described a method of Fourier analysis that is used in X-ray scattering
problems. This method had the number of computational operations proportional
to N log2 N.
The Danielson and Lanczos paper refers to two papers by C. Runge [3]
published in 1903 and 1905 as the source of the computational method. A
set of lecture notes of C. Runge and H. Konig [5 ] published in 1924 described
how one could use the periodicity of the sine and cosine functions to obtain
a procedure for taking a series of length N where N is a power of 2 and
splitting this series into log2 N subseries. Then using a doubling algorithm
a 2N point Fourier analysis is obtained from two N point analyses in N
operations. Successive applications of this doubling algorithm led to
the computation of a discrete Fourier transform in approximately N log2 N
operation.
The algorithm of Cooley and Turkey is more general in the sense that
methods are described to handle cases where N is a composite number not
necessarily a power of two.
L.H. Thomas [6 ] published a method of computation similar to that
of Cooley and Turkey in 1963. In this method, the factors of N must
be mutually prime.
Thus, while Cooley and Turkey receive much of the acclaim, there
were others who had previously developed the techniques utilized in the
fast Fourier transform.! 7 J
SECTION 2
AN OVERVIEW OF THE
THEORY OF ONE AND TWO
DIMENSIONAL FOURIER TRANSFORMS,
CONTINUOUS AND DISCRETE
10
2,1
The Continuous Fourier
Transform in One and
Two Dimensions
11
The Fourier transform is a frequently used mathematical tool that
transforms a variable (or variables) of one space to a different variable
(or variables) of another space. Such transformations are made generally
to enable a computation to be performed in a more convenient fashion or
to present information in a more useful manner.
In the field of signal processing, analog signals are converted from
the time domain to the frequency domain by the Fourier transform. The
Fourier transform of a signal s(t) can be given by
oo
S(f) =
/s(t)e-j2^ft
dt (2J.1)
- 00
Generally (2.1.1) will not be convergent. In many engineering applications
the Fourier transform is used with periodic signals where the real time
signal s(t) is composed of a finite number of known frequencies. The
transform may then be rewritten as
T
2
3(nf0) =f f s(t)e-J'2V dt (2.1.2)J
T
2
where: n is an index identifying the discrete frequencies that
make up s(t)
f is the lowest of these frequencies
o
T is the period of s(t)
t is time
(3 is the amplitude of the component at frequency nf
12
This form has the inverse transform
s(t) = B(nfo)
eJ2ft
{2]3)
n=- 0
Equations (2) and (3) are sometimes referred to as the exponential
form of the Fourier series.
If a signal is bandlimited but not necessarily periodic, the
form of (1) will converge and in such a case S(f) will be called the
spectrum of s(t). This leads the following one-dimensional transform
pair for continuous signals.
00
S(f) = / s(t)e"j27Tftdt
00
(2.1.4)
00
s(t) = / S(f)ef^ft df
00
The above is a one-dimensional transform pair. The Fourier transform
exists in multi -dimensional forms also. Here we are interested in the
two-dimensional form. The two-dimensional transform pair is as follows:
00
S(u,v) = ff s(x,y) e"j(ux+ vy) dx dy
00
(2.1.5)
00
s(x,y) =
1 ff S(u,v)
ej(ux
+ v^ dudv
4/ JJ
13
A sufficient condition for the existence of this transform pair
is that s(x,y) be absolutely integrable.
// S(x,y)|dx dy < (2.1.6)
In a one-dimensional application we are often interested in
the notion of filtering a signal with respect to frequency. This is often
represented in block diagram form as
s(t)
S(f)
q(t)
Q(f)
A Block Diagram of a One
Dimensional Linear System
Fig. 2.1-1
where H(f) is some filtering function and
Q(f) = H(f) S(f) (2-1-7)
This concept can also be extended to the two-dimensional case. Here
we would have
Q(u,v) = H(u,v) S(u,v) (2.1.8)
where Q(u,v) is a filtered output. H(u,v) is the two-dimensional filter
function, and S(u,v) is the input spectrum.
Just what is S(u,v)? One can consider s(x,y) to be an image function.
The value of s(x ,y ) at the point (x ,y ) may be considered the brightness
14
of the image at that point. Thus, the entire function s(x,y) may describe
a picture over the coordinates x andy. The function S(u,v) represents
the spatial frequency spectrum of s(x,y). The following figures serve
to clarify the concept of spatial frequency.
s-,(x,y)
An Example of
High Frequency
Imagery
Fig. 2.1-2
-> T, <-
s2(x,y)
An Example of
Low Frequency
Imagery
Fig. 2.1-3
In Figure 2, s-,(x,y) is said to be more highly modulated than s2(x,y)
of Figure 3. Consequently, s-,(x,y) has a greater content of higher spatial
frequencies than s2(x,y). Therefore, S-^yjv) and.S2(y,v) would look as
follows.
15
S-|(y,v)
Deviation from the y,v
plane can be represented
by increased brightness.
Fourier Transform of Fig. 2.1-2
Fin. 2.1-4
S2(y,v)
Here the magnitude falls
off faster indicating less
high freauency content.
Fourier Transform of Fig. 2.1-3
Fig. 2.1-5
When one understands the concept of image frequency, one can
readily comprehend how an image can be filtered. If an image is low
pass filtered, all sharp edges would be obscurred and only the largest,
most unmodulated features would be discernible. If an image is high pass
filtered, edges would be visible but not the broad unmodulated features.
The Fourier transform is a linear transformation in both the one and
multi -dimensional forms. This leads to a variety of properties, some
of which are given here. In each case, F is the Fourier transform of f ,
and F is the operation.
16
One Dimensional
a F(w) = F[af(x)]
F(w)+F2(w) = F[f1(x)+f2(x)]
Two Dimensional
a F(wx, wy) = F[af(x,y)]
Ft(wx.w ) + F2(wx,w )
(2.1.9)
(2.1.10)
FLf^x.y) + f2(x,y)]
1
F() = F[f(ax)]
a
'a1
"snF(T"
' ^ = FL"f<ax>b^ (2.1-n)
-jwx
e F(w) = F[f(x-xJ]
j(wxa + wyb)
F(wx,wy) = (2.1.12)
F[f(x-a, y-b)]
jw x
F(w - wn) =fie f(x)] [F(wx - ax, wy - by)]
j(a x + by)
F[e x ^ f(x,y)]
(2.1.13)
(jw)n
F(w) = F[^-f(x)]dx, -jwx
F(wx,w ) = F[^ f (x,y)] (2.1.14)
"jwy F(wx,wy)
= F[^f(x,y)]
F(-w) = F[f(-x)] F(-wx,-wy) = F[f(-x,-y)] (2.1.15)
F(w) 6(w) = F[f(x) *g(x)] F(wx,wy) G(wx,wy) (.2.1.16)
F[f(x,y) * g(x,y)]
17
where *- is the symbol for convolution which is
f(x) * g(x) = / f(x) g(x-x) dx ]
f(x,y ) *-g(x,y) =f f f(a,B) g(x-a, y-B) dadB
The two dimensional Fourier transform property of separability is shown
below
r r -j(xu + yv)
S(u,v) = J J <s(x,y) e dx dy
(2.1.18)
/ g(x,v) e_jxydx
= S(y,v)
Thus, the two dimensional Fourier transform can be separated into the
sequential application of two one-dimensional transforms. As a further
condition, if the image function is separable
-*(x,y)
= 4x(x) 4y(y) (2.1.19)
18
then
S(y,v) = J J 4(x,y) e"j(yx "y^ dxdy
-to -TOO
LL \(x)Vy) e-jVXe-J> dxdy
/ .(x) d)< /^(y)e-JVVdyX
-oo
(2.1.20)
= S (y) S (v)
y v
If the above is true, the complete image transform requires only the
computation of two one-dimensional transforms and their product. This
is different from the former case of sequential application of the
transform where a function of two variables was transformed.
19
2,2
The Discrete Fourier
Transform and Convolution
The one Dimensional Case
20
In modern day signal processing, computation is done digitally.
This requires that a signal be sampled in time and its value at that
time be represented by a digital word. In addition, the signal must
be bandlimited and the sampling rate must be equal to or greater than
the Nyquist rate in order for the signal to be reconstructed without
distortion. The Nyquist rate is twice the highest signal frequency.
If a continuous signal is sampled every T seconds, the sampled
signal can be represented by x(nT), n = 0,1,2... These sampled
values describe the sequence x(n). The sequence x(n) is of finite
length when one deals with real signals. The sequence x(n) is
of length N and it can be used to construct a periodic sequence
x(n) with a period of N. The discrete Fourier series of x(n) is
X(k) which is also periodic in N. This leads to the Discrete
Fourier Transform of x(n) given by
N-l
X(k)= x(n)
WNkn
(2.2.1)
n=0
where W., = e
J N
The Inverse Discrete Fourier Transform is given by
N-l
x(n) = x Y. *(k) Vkn (2.2,2)
k=0
It is important to note that both x(n) and X(K) are periodic in
N. This periodicity is especially important when one desires to do
21
digital filtering. Usually linear filtering is desired. Let H(K)
be a one dimensional linear filtering function and h(n) be its corres
ponding impulse response sequence. (Note that H(K) is the Z-transform,
H(z), of the sequence h(n) evaluated at specific points along the
unit circle in the z-plane.) If one performs linear convolution on
x(n), a sequence of length N, by h(m), a sequence of length M, the
result is y(n) a sequence of length N+M where y(n) is given by
N+M-l
y(n) = V x(m) h(n-m) (2.2.3)
m = 0
The sequence y(n) can also be obtained through the use of the discrete
Fourier transform. However, if we are dealing with the discrete Fourier
series, it must be emphasized that this series must be obtained from a
periodic sequence. Convolution can also be performed with periodic
sequences. This is called cyclic convolution. To perform cyclic
convolution, h(m) must become fi(m) and x(n) must become x(n). Assume
that N > M. Then zeroes must be appended to h(m) so that fi(m) and
x(n) both have a period of N. However, cyclic convolution of ff(m) and x(n)
leads to y,(n) which also has a period of N. This operation is given by
N-l
y,(n) = V x(m) fi(n-m) (2.2.4)
m=0
This is not the same sequence as y(n) which is of length M + N.
In order for the cyclic convolution to yield the sequence y(n),
zeroes must be appended to both h(m) and x(n) in such a way that
fi(m) and x(n) are both N+M long. Then the sequence formed by one
22
cycle of the cyclic convolution of x(n) and fi(m) will be the same
as y(n) formed by linear convolution of x(n) and h(m).
Convolution is a particular type of linear transformation. The
discrete Fourier transform is another type of linear transformation.
It is possible to perform in a single linear transformation, namely
convolution, the same operation performed by the successive applications
of several other linear transformations such as those utilizing the
DFT. The following discussion will eventually show that there are
conditions where the successive application of three linear transformations
namely a discrete Fourier transform, a simple filtering function and
an inverse discrete Fourier transform, will lead to fewer computations
than when performing the same operation in one consolidated linear
transform, namely convolution.
Using the discrete Fourier series, we can calculate y(n) through
the following steps. First x(n) and R(n) must be formed each having
a period of N + M. Then their DFS must be calculated using the DFT.
Next the DFS of y(n) is formed through the relation
Y(K) = H(K) Y(K) (2.2.5)
Taking the inverse discrete Fourier transform (IDFT) of Y(K) will
lead to y(n), one cycle of which would be equal to y(n) because the
periods of the sequences x(n) and R(n) were both expanded to N + M.
Thus, dimensionality is an important consideration when implementing
linear filtering with the discrete Fourier transform.
23
The number of computations needed to perform a particular trans
formation can be compared in a table. Let x(n) be of period N and
let h, (n), of period U, be the weighting function used in cyclic
convolution. Also let H2(k) be the function used to operate on the
DFS of x(n) which is X(k) where X (k) is of period N and H2(K) is of
period U.
Table 2.2-1, Computational Comparison DFT vs. Convolution
No. of Computations Type of Operation
k_NU convolution of x(n) with h\(n)
N2
finding DFS of x(n)
kdNU filtering X( k) with H2(k)
U2
finding the IDFT of Y(k)
The constants k k, are of a value that is dependent on the
number of zeroes in the sequences fi,(n) and H2(k), respectively.
For the DFT approach to involve fewer computations than convolution,
kd<kc-"LT-T (2.2.6)
If the fast Fourier transform algorithm is used to compute the
DFT and the IDFT, then the following comparison can be made.
24
Table 2.2-2 Computational Comparison, FFT vs. Convolution
No. of Computations Type of Operation
k NU convolution of x(n) with h-,(n)
N log2N finding the DFS of x(n)
kdNU filtering X(k) with H2(k)
U log2 U finding the IDFS of Y(k)
This leads to the relation
log2 N log2 U
kd * kc D (2.2.7)
as the condition that must be satisfied to enable the DFT approach
to have fewer computations than the convolution approach. The values
of k. and k are determined by the filtering functions selected [fi,(n)
and H'(k)]. In this thesis we are specifically dealing with the
implementation of the DFT approach.
25
2.3
The Discrete Fourier
Transform and Convolution
The Two Dimensional Case
26
An image, like a time varying, signal can be sampled. In addition,
the concept of Nyquist-rate can be extended to sampled images. If
it is desired to have an image resolution of Q lines per millimeter along a
given dimension, then there must be 2Q light to dark transitions per
millimeter. If the resolution is to be the same in both the x and y
dimensions, then the method chosen for image quantization must provide
for a total of 2Q light and dark regions along both the x and y axis.
This leads to an image sampling rate of 2Q samples per millimeter per
dimension which is analogous to the Nyquist rate. In an image the
sampling rate corresponds to the size of the picture element or
pixel. If an image is subdivided into many small squares, each of
these squares is a pixel. The average brightness of each pixel is the
digitized value that is used in digital image processing. The greater
the number of pixels in an image, the higher the resolution of the
quantized image, and the higher the image sample rate. It is the
values of the pixels that provide the input to the two-dimensional
discrete Fourier transform.
As was the case in the one-dimensional discrete Fourier transform,
the two-dimensional discrete Fourier transform deals with periodic
functions. Let p(n,,n2) represent the individual pixel values of an
image. Then we can envision a quantized image to be an array or matrix
with each element in the array being a pixel. This can be represented
mathematically as
p(nrn2) (2,3.1)
27
where P represents the array. To develop a two-dimensional discrete
Fourier transform, a periodic two-dimensional array
P P(n-,,n2) (2.3.2)
is constructed from the image array P. The two-dimensional discrete
Fourier transform relations can now be given by the following.
N-l N-l
P(krk2) = x Z X P(nTn2)
k,n,+k2n2
(2.3.3)
n-, =0 n2=0
N-l N-l
p(nrn2) = x 2_ Y. ^I'M
(k1n1+k2n2)
(2.3,4)
klnl+k2n2
k^O k2=0
.2tt
-J"X (k1n1+R2n2)
where W = e
and we are assuming that P is a square NxN array. Again, it is
important to note that both p(n.j,n2) andP(k],k2) are periodic in N in
two-dimensions. That is to say, they are periodic in NxN.
The two-dimensional discrete Fourier transform is a specific
kind of two-dimensional linear transformation. The general equation of
a two-dimensional linear transform is
N-l N-l
P(krk2) = V Y_ P(ni'n2^ e(n1,k1,n2,k2) (2.3.5)
n^O n^O
28
where 6(n-, ,k, ,n2,k2) is a weighting coefficient that describes the
transform and p(n-,,n2) is assumed to be an element of a square, NxN,
array. In the two-dimensional DFT,
(k-,n1+k2n2)
WN = W(n1,k1,n2,k2) = 6(n],k1,n2,k2). (2.3.6)
The two-dimensional DFT is also what is called a unitary transform
because it is exactly invertible and its weighting coefficients meet
the following conditions.
2- L- W(n],k1,n2,k2) W*(j1 ,k1 ,j2,k2) = 6(n1~j1 ,n2-j2)
kl k2
2. 2_W(n1,k1,n2,k2) (j-, ,k] ,j2,k2) = 6(n]-j'1 ,n2-j2)
2.3.7)
kl k2
2_ 2-W(ni'kTn2'k2^ w*(ni'min2'm2^ = <5(k-1-m1,k2-m2)
nl n2
2_
2.W"
(n1,k],n2,k2) (n1 ,m] ,n2,m2) = 6(k-]-m1 ,k2-m2)
nl n2
where
-j'X^lh +n2k21
W = e
-1 J"F(nlkl + n2k2)
W* = W
'
= e
The 2D-DFT is also separable as is the continuous case. In
the discrete case,
W(nI,k1,n2,k2) = W,_(ni,k1)WR(n2,k2)
(2.3.8)
W (n1,k1,n2,k2) = W[ (n^k^W^ (n2,k2)
Any NxN image array [p can be converted into a 1 x N image
vector p by a procedure called column scanning. This is described
in Appendix B.
Using matrix notation a 2D-DFT operating on the array IP can be
expressed as
Q-RPR
where ]]/J/| and }\/Jj R are NxN matrices defined in Appendix B.
When U has been converted to the vector p, this same DFT operation
can be expressed as
q = WJ P
where wNJ is an N x N matrix defined by
with > being the operator for the left direct product also defined in
Appendix B.
Two-dimensional convolution is another kind of two-dimensional
linear transformation. The equation for two-dimensional convolution is,
30
N-l N-l
P(krk2)=2_ Y P(nl'n2^ (k] - nr k2 - k2) (2.3.9)
n.j = 0 n2=0
Where the array
p(nrn2) P
is convolved with the array
r(krk2) r
which can be written as
p * r =[p(kvk2)] (2.3.10)
As in the case of one-dimensional convolution, the two-dimensional
case has both linear and cyclic convolution. We are now dealing with
periodicity in two-dimensions when we consider cyclic convolution.
The linear convolution of an N x N array with a M x M array leads
to an (N + M) x (N + M) result. If cyclic convolution is to be utilized,
then zeroes must be appended to both the NxN array and the M x M array
in such a way so that both arrays are of dimension (N + M) x (N + M)
before cyclic convolution is performed. Pictorially,
array 1
A
'11
'Nl *NN
31
(2.3.11)
and array 2
A.
'11
JM1
'1M
JMM
(2.3.12)
These are changed to the following
A i =
ln alN 1,N+1 "" Q1N+M
"Nl lNNl
0
N+1,1
0
N+M.l - 0N+M, N+M _L
(2.3.13)
and
K
Jn
'M,l
bl ,M 1 ,M+1
JM,m
;1 ,M+N
0
M+1,1
0M+N.l M+N,M+N J
(2.3.14)
From the two (M+N) x (M+N) arrays, two periodic arrays are contructed
each with period (N+M) in both dimensions. Two-dimensional cyclic
32
convolution can be performed on these periodic arrays. One (N+M) x (N+M)
period of the cyclic convolution will be the same as the linear convolution
result.
To implement a two-dimensional linear convolution using the two-
dimensional DFT, one must insure that the dimensionality of the arrays
is large enough to make the result using the DFT the same as that
using convolution. Not all image processing techniques require linear
convolution. However, when it is to be implemented, zeroes must be
appended to the original image array to provide for correct results.
Two-dimensional filtering can be performed by a single linear
transformation such as two-dimensional convolution or by several linear
transformations such as a 2-D DFT approach that transforms an image into
its spatial frequencies. Diagramatically, the two-dimensional convolution
approach can be represented as
P
NxN array
2-D
->f convolution
function
Convolution Approach
Fig. 2.3-1
Q
U x U array
The 2-D DFT approach can be represented as
UP 2D - DFT FLI
NxN array NxN array
2D-DFT Approach -> Spatial Filter
function U x U array
Fig. 2.3-2
h2D - IDFT > &U x U array
33
where in each case W is the input image, (^ is the output image, and
IP and G$ are the spatial frequency distributions ofPand Q. , respectively
The reason for using a three-step 2D-DFT approach in an image filtering
problem is to transfix the data into a form that is easier to work with.
The goal here is to minimize the number of computations needed to transform
P toQ.
To directly implement the 2D-DFT on an N x N array, N operations
are required. Similarly, a direct implementation of the 2D-IDFT on a
U x U array requires U operations. If a linear filtering operation is
2 2
to transform an N x N array into a U x U array, at most N U operations
are required. It appears that a three step transformation such as the
DFT approach requires more computations than a single two-dimensional
linear transformation such as a convolution approach. However, as in the
case of one-dimensional transforms, the three-step process can involve
fewer computations.
Using tables similar to those used in the one-dimensional case, the
number of computations needed in the DFT approach and in the convolution
approach can be compared, assuming Pand ($ as the input and output
arrays respectively.
Table 2.3-1 Computational Comparison, 2D-DFT vs. 2D Convolution
No. of computations Type of Operation
k
N2 U2 2D convolution
c
N4 2D - DFT
k,
N2 U2 2D spatial filteringd
U4 2D - IDFT
34
where kd and k are sparseness coefficients that have a value between
zero and one. The value of kd and kc will tend to zero as more weighting
coefficients of the given linear transformation become zero. This assumes
that our filtering algorithm has software that will skip an operation
when a particular weighting coefficient is equal to zero.
In order for the two-dimensional filtering problem to have fewer
computations using DFT techniques,
2 2
k < k _ IT _ UlKd Kc u2 N2 ' (2.3.15)
If the two-dimensional fast Fourier transform algorithm is used
to compute the 2D-DFT and the 2D-IDFT, then the following table applies.
Table 2.3-2 Computational Comparison 2D-FFT vs 2D Convolution
No. of Computations Type of Operation
kc
N2 U 2D convolution
2N2
log2N 2D-DFT
2 2
k . N U spatial filtering function
2U2
log2U 2D-IDFT
For the DFT approach to have fewer computations 4
kd<kc-7ig2N-^l092u (2-3J6)
This condition is readily achievable in most applications. Again, as
in the one-dimensional case, the values of k, and k are determined by
35
the type of filtering desired. In this thesis, only the implementation of
the 2D-DFT approach is considered. Specifically, the implementation of
the 2D-FFT algorithm will be examined.
36
2,4
The One and Two
Dimensional Fast Fourier
Transform
37
The fast Fourier transform algorithm is used almost exclusively
to compute the discrete Fourier transform of a kernel. The FFT has
several forms, with the decimation- in-time algorithm and the decimation-
in -frequency algorithm being the most often discussed algorithms in
textbooks. The algorithm was developed by exploiting various symmetry
properties found in the DFT. Consider the one-dimensional DFT relation
N-l
I
n=0
X(k) - Y x(n) kn 0,1, ..., N-l (2.4.1)
and the IDFT
N-l
x(n) = -i- Y X(k)
n=0
where
kn
k = 0,1, ..., N-l (2.4.2)
W
kn
= eWN
& ^
One symmetry property is
W
"kN
= eWN
N
=
ej2u
= ] (2.4.3)
If N is a multiple of two
W 2WN
2l _N_
"JN 2
= e
-J*
-= -1 (2.4.4)
Thus for some values of W.., no multiplication need be done in the operation
x(n)*W kn. If N is a power of two (N = 2V) a dramatic reduction in the
38
number of computations can be achieved by dividing the DFT into smaller
DFT computations until we are dealing only with two point DFT's.The first
step of the decomposition is repeated here to indicate the methods
involved. Algorithms that decompose the input sequence into smaller
subsequences are called dejcAxxitiAn-iti-tii-.'^, algorithms. With N = 2 ,
the DFT of a sequence x(n) with n = 0,1,2 ..., N-l is
N-l
X(k) = Y x(n) WNPl< k = 0,1, ..., N-l (2.4.5)
n=0
separating x(n) into its even and odd points
2 ' r|< ? kr
X(k) = V x(2r)(WN2) +
WNk V x(2r+l) (WN2) (2.4.6)
r=0 r^t)
where n = 2r for even points and
n = 2r + 1 for odd points
- 2^
"J ft,
2 -^ X
now WN = e = e = wn
T
so that
f-i T 4--i
X(k) = V x(2r) W
2
+WkY
r=i -I
(2.4.7)
X(k) = G(k) +
WNk
H(k)
39
This indicates that an N point DFT where N is a power of 2 can be
N N
decomposed into the sum of two -j- point DFT s. The -j- point DFT can
N
be decomposed into the sum of two -r- point DFT s. This is a process
p
where each p point transform is decomposed into a sum of two _ point
2
transforms can proceed until left with only two point transforms.
This requires v stages of computation where v = log2N.
This process leads to the FFT algorithm with the number of complex
multiplications and additions equal to N log2N for a one-dimensional
sequence of dimension N.
The two point DFT which is the fundamental computation in a more
complex FFT has come to be known as the butterfly computation. The
butterfly computation is shown diagramatically in its basic form in
Fig. 2.4-1
The Two Point DFT
xm+l(p)
xm+l(q)
Fig, 2.4-1
The following figure reveals the structure of an eight point
decimation-in-time FFT algorithm. This structure can easily be extended
to reveal the structure of computations needed for longer input sequences.
Eight Point DIT-FFT
40
>r-x(o)
Fig. 2.4-2
41
The eight point FFT flow graph reveals one significant
peculiarity about the decimation-in-time algorithm. The input
sequence must be in what is called bit-reversed-order if the
output sequence is to be in normal order. Bit reversed order
can be understood by the following table.
Table 2.4.1 A Description of Bit-Reversed-Order
Normal Order Binary
Designation
x(0) 000
x(l) 001
x(2) 010
x(3) Oil
x(4) 100
x(5) 101
x(6) 110
x(7) 111
Bit-Reversed Order in
Designation FFT Computation
000 0 1st
100 4 5th
010 2 3rd
110 6 7th
001 1 2nd
101 5 6th
Oil 3 4th
111 7 8th
42
By referring to the chart above and Fig. 2,4-23 the role of bit
reversal can be understood. If a decimation in time algorithm is to
be used in DFT computation, an additional algorithm must operate on the
input sequence to place the data in bit reversed order. This results
in only a small amount of additional overhead as will be shown later
in this paper. Other algorithms exist that do not require bit reversed
order, however, these have other shortcomings that will not be discussed
here.
The one-dimensional decimation in time FFT algorithm can be easily
implemented in a fashion that will compute a two-dimensional discrete
Fourier transform. This results from the separability of the two-dimensional
discrete Fourier transform. The computation of the two-dimensional discrete
Fourier transform can be done in the following way.
With the image arranged in an array or matrix of dimension N x N the
first operation is to perform separate DFT's on each of the N columns (or
rowsl of the -matrix. Each of these DFT's will be of dimension fl. When this
operation is complete, the N x N matrix resulting from the N one-dimensional
FFT operations is transposed. After the transposition, the row values are
now the column values and N one-dimensional DFT's are again performed on
the columns. (The purpose of the transposition is so that a separate
algorithm need not be written to perform N DFT's on the rows of the matrix.)
Thus, performing a two-dimensional DFT on an image represented by an N x N
matrix requires the application of an N point one-dimensional FFT algorithm
43
,.2,2 N times. This results in a total of 2N log2N operations needed for
a two-dimensional DFT.
Because of bit reversal operations, matrix transposition, and
general software overhead the total number of operations needed for the
2
two-dimensional DFT is somewhat more than 2N log2N. In this paper the
relative importance of this additional overhead will be indicated in
terms of computation time.
44
Section 3
Examination of Architectures
45
3.1 Method of Examination
This thesis will examine how different hardware microprocessor
architectures will perform when executing essentially the same decimation-
in-time FFT algorithm in the calculation of the two-dimensional discrete
Fourier transform.
It will be assumed in this thesis that the data to be transformed,
that is the array of pixels, is entirely memory resident and the operations
used to load this data into memory will be ignored.
With the above conditions, the overall computational path used by
any microprocessor architecture in the calculation of the 2D-DFT
is given in the following flow chart.
The loop labeled 1 in the flow chart is executed N times when
transforming the first N columns of an N x N array of pixels. It is then
executed another N times when the rows of pixel array are transformed.
This means that process blocks A + B are executed a total of 2N times.
The loop labeled 2 is executed once. Process C transposes the pixel
array so that the rows become columns. This permits the data addressing
schemes for process blocks A and B to be used again. Because of the use
of this data transposition, the completed 2D-DFT output will be trans
posed unless it is transposed back again by process C2> It is not
necessary to do this because the user may desire transposed data.
In addition, the data may be retained in memory for further processing
at the completion of the 2D-DFT computation. It is not necessary to
output the data from memory.
Flow Chart Showing the Major Operations of the
2D-DFT Computation
46
Input Pixel Data
into Memory
I
Initialize Software
Parameters
Perform Bit
Reversal on one
Column in Array
H is a Column
pointer 1 < H < N
<-
A
Perform a linear
FFT on one
Column in Array
Output Pixel Data
(optional )
\ from memory
Transpose
Pixel
Array
(optional )
<-
<
i
i
H-H + 1
->
'1
Transpose
Pixel
Array
Fig, 3,1-1
47
Implementation Scheme
The Z8000 microprocessor will be used as the baseline processor
in this thesis. All hardware architectures considered will use this
as their processing unit. This thesis will consider six hardware
architectures. These are summarized below.
Architecture I
This will be a single Z8000 MPU equipped with memory. In this
architecture and in all architectures, square pixel arrays of dimension
NxN will be processed. Complex floating point arithmetric will be
handled entirely in software with multiplication instructions exercising
the MPU's on chip multiplier. A 16-bit mantissa and 16-bit exponent
will be assumed. Coefficients used in the FFT computations will be
calculated recursively within the FFT algorithm. The pixel array
need only be square and its dimension N be a power of two.
Architecture II
This architecture is a single processor with memory. The pixel
array will be constrained to a fixed square dimension. The coefficients
used in the FFT calculations will now reside in a look-up table
eliminating the need for recursively calculating them. Floating point
complex -multiplication is implemented in software-
48
Architecture III
This architecture is the same as architecture II except that
a parallel complex hardware multiplier is assumed to exist. In such
a device, the processor loads the multiplicand and mulitplier
into hardware registers and retrieves the product from different
registers. It is assumed that the product can be formed fast enough
so that the processor does not have to wait for it.
Architecture IV
This architecture is the same as architecture III only in this case
it is assumed that 16-bit fixed-point arithmetric is ^used.
Architecture V
This architecture is the same as architecture IV only complex
multiplication will be performed in software.
Architecture VI
Here the effect of using more than one processor to compute the
2D-DFT of an array of pixels will be examined. Architecture II and
Architecture III will be used as building blocks in a multi-microprocessor
architecture.
49
Table 3.1 Summary of Architectures ,:
Architecture Coefficients Multiplication Arithmetric
I Calculated Software Floating Point
II Look-up table Software Floating Point
III Look-up table Hardware Floating Point
IV Look-up table Hardware Fixed Point
V Look-up table Software Fixed Point
VI Look-up table Hardware &
Software
Floating Point
3.2 Presentation of Architecture I
This architecture is a single processor with memory. The array
of pixels used as data for this architecture is assumed to be square
with a dimension NxN where N is a power of two. This architecture
and all architectures used in this thesis use a decimation-in-time FFT
algorithm. The Butterfly computation in this algorithm has the
structure given below.
X(I)
XUP)
x(i)
X(IP)
Butterfly
Structure
Fig. 3.2-1
The values of W
k
are calculated recursively in Architecture I by
the relation
50
WNk+1
=
WNk
AW (3.1.1)
where the meaning of the variables can be understood by examining the
Z-plane.
Z-plane
Fourier Coefficient
in the Z-plane
Fia. 3.2-2
The following pages present the details of the major process
blocks A, B, & C as they apply to Architecture P. At the top of each
flow chart the number of times a section of computations is entered
is given. Also given for each path on the flow graph is a number
sigma (E). This indicates the number of times a path is executed with
respect to the number of times the flow graph is entered at the top.
To detemrine the total number of times a path is executed, multiply
E by the number of times the flow chart is entered.
Process Block A - Bit Reversal
This Block is entered 2N times
51
I = 1 , J = 1
Yes
I > J
No
T * X(I+L)
X(J+L)^X(J+L)
X(J+L)< T
N - 2
v/2
K + N/2
E = v
E= N -2
J+K
I-I + 1
,v/2
N =
2v
H = Column Pointer
I < H s N
L = (H-l) N
I, J = Data Pointer
1 < I N
1 < J < N
K = Counter
1 < K < N/2
E = normalized
execution number
E =
E = N-l
J- K K/2
N - v - 1
The total number of executions
is equal to 2N times E.
H is incremented when changing
columns.
Fig. 3.2-3
Process Block B - A linear FFT algorithm
This Process Block is entered 2N times
52
LP = 1, PN = 2, G = 1 E =
a I'
\y
LE=2G, LEI = LE/2, S=0 E =
E = v -1
''
A -<- it/LEI, it shifted right G-l times
v
T^-A , a*
A2
N f
X+-A, y^A, t-(l/2l)A
V
LP^-PN + 2
Yl/PNl y
y-*-y x
J *
E = (LIMIT -l)v
N = 2
1 < G < v
\LP > 0^
^ No Y LP"*- +i
JYes E = -^-(limi t)v ,
LP^
-1 t---t
> '
s*-s + t E = (Limit)v
1
-(limit)>
LIMIT determines the
length of the infinite
series aoproximation
to COS (A).
FIG. 3.2-4
53
Process Block B continued, this process block is entered 2N times
V
E = v N = 2
v
PN-3, LP--1, D<-A, x<-A
S D
D*-x D
E = (LIMIT)v
E = l/2(LIMIT)v
,Yes
LP--1
JL
S-S + t
PN-PN + 2 )*-
No
E = (LIMIT -lh
> LIMIT,
Yes
Store I [AwJ
LP^+ 1
(LIMIT)v
E = l/2(LIMITh
E = v LIMIT determines the
length of the infinite
series approximation
to sin (A)
Fin- 3,2-4 continued
54
Process Block B continued, this process block is entered 2N times,
N = 2
LEI = 2
v
6-1
v
LE =
2G
1 < G
W * 1 + JO, J - 1 E = v
v - Nvi - 2
i
- N + 1 I * 0
fc.
''
IP + I + LEI
''
>
T - X(IP+L) W
fc
Floating Point
Complex
Multiplier
i '
X(IP+L) + X(I+L) - T
X(I+L) - X(I+L) + T
N
,E = x v
' t Butterfly
CalculationI * I + LE
>fes ^ I < N ^>
jT No
W * W AW
J - J + 1
E = N - 1
Ye
J < LEI ^
I = v-1
<?
No
E =v
E = 1
Yes
G <
End Process Block B
Fig. 3. 2-4 continued
55
Process Block C - Array Transposition
This process block is entered once
H+1, row counter, K=l column counter
i 2 2
1 ~ 2
"
2
H > K
No
R
Q
N(k-l) + H
N(H-l) + k
T + X(R)
X(R) + X(Q)
X(Q) - T
K + K + 1
H - H + 1
Yes
y _ Nl Ni -
2
-
2
Yes E = N
--
E = IT
Yes
A A
E = IT - N
E = N
Yes
E = N -1
* K - 1
End of Process Block C
Fig. 3.2-5
56
To obtain an idea of the computation needed for a single processor
to implement these operations, the programs in Appendix A were written.
These present one way of implementing the operations shown in the flow
graphs. The instructions used in these programs are taken from
advance product information on the Z8000 16-bit microprocessor.
Likewise, the cycle times and the required memory space for each
instruction are so referenced.
57
3.3 Presentation of Architecture II
When the dimension of the pixel array comprising the digital
picture is contrained to a fixed size, then the 2D-FFT software can
be simplified. Specifically, the Wj. coefficients in the FFT algorithm
need not be recursively calculated, but can be tabulated in a look-up
table.
When the array size is constrained to a fixed dimension only minor
changes occur in the computational flow as compared to Architecture I.
These changes occur within Process Block B. The new Process Block B
is presented in the following flow chart.
58
Process Block B - A linear FFT algorithm for Architecture II
This process block is entered 2N times
Fig. 3.3-1
1
G - 1, IN <-N/2, AD ^ 0 E = 1
LE=2", LEl=LE/2, IN=IN/2. J=l . AD=Q
1
E = v
N = 2
LEI = 2
,v
G-l
W + W(AD) E = N - 1
'
1
I + J
*
f\
IP <- I+LE1
' f
*
<
Floating Point
Complex
Multiplier
T + X(IP+L) U
i
X(IP+L) - X(I+L) - T
X(I+L) - X(I+L) + T
v _ NE = -k v
' '
I *- I + LE
Yes ^
>
'I
r
< N .
LE =
2U
1 < G < v
H = Column Pointer
L = (H-l) N
1 < H < N
N - 1
E = v
->. End Process Block B
59
3.4 Presentation of Architecture III
Multiplication is one of the more time consuming operations
in the FFT computation. In Architecture III it will be assumed that
a floating point complex multiplier has been built to assist the
processor with complex multiplication. The multiplier will be assumed
to operate asynchronous to the processor clock. This enables complex
multiplication to be done through load and read memory instructions.
The mantissa and exponents of the two words to be multiplied are loaded
into specific locations in the processor's memory space. The mantissa
and exponent of the product are then read from a different memory location.
The computational flow of Architecture III is the same as Architecture II
The difference is that the floating point multiplier subroutine is not used
in Architecture III. As a result, all computation times are the same for
Architecture III as for Architecture II except for the butterfly computation.
An asynchronous hardware multiplier is a very complex device to build
in hardware, especially when it is to handle complex numbers. The following
block diagrams indicate this.
Hardware Floating Point Complex Multiplier
Input Half
60
XRM
XIM
XRE /Tn WRE
_/Ty
XTF -(7^-
V
_____^_
vi WIE
' ' >
-y-
\ f
CHR4 CRH3 CHR2 CHR1
e>
-
>
MAN4 MANS MAN2 MAN!
WRM
WIM
Fig. 3.4-1
XRE, XIE, XRM, XIM, etc. are 16-bit registers, each addressed by the
CPU as 2 addressible bytes.
The multipliers may be considered to be TRW-MPY16A multipliers.
Each adder can be cascaded 4 or 8 bit parallel adders or and ALU
wired for 16 bit addition.
Programmable logic arrays can provide for address decoding for all input
and output registers.
MAN!, MAN2, etc. are 32-bit partial products.
CHR1, CHR2, etc, are 17-bit exponents.
61
Hardware Floating-Point Complex Multiplier Block Diagram
CHR 1 (17b) CHR4 (17b)
o
\s i-
Shifter Control
Man 1
H shi fter
32 b
G>
MS 16b
(REAL) PRMDR
16b
PRDCR
Man 4
Shifter
^
32b
CHR2 (17b) CHR3 (17b)
Man2
Shifter Control
> Shifter
32b
*
MS 16b
(IMAG) PRDMI
Man 3
Mshi fter
32b
16b
PRDCI
Fig. 3.4-2
62
The shifters and shifter controllers can be built in several
ways. They can be an array of 16 to 1 multiplexers in which case the shift
control logic can be a PLA array to control the shifters and to pass on the
greater input characteristic to the product characterisitc register. The
shifters may also be implemented with shift registers in which case the
shifter control will need counting and clocking logic. The clocking will
have to be at a rate fast enough to insure that a 32-bit shift can be
performed in a time less than that required for a memory reference
instruction if the hardware multiplier is to be transparent to the CPU.
In this thesis no attempt was made to design an operational hardware
floating point complex multiplier. The following table is only to suggest
what the complexity of a multiplier might be
Table 3.4 Essential Components
Hardware Floating-Point Complex Multiplier
Quantity Device
4 16-bit square parallel multipliers
4 16-bit registers
4 PLA (16 input 2 output)
4 17-bit adders
2 18-bit adders
64 16 to 1 multipliers (to shift mantissas before adding)
2 18 input 8 output PLA (to control shifters and the source of
the product's exponents)
32 2 to 1 multiplexers (sources the exponents)
4 16-bit tristate registers (to store the product)
63
3.5 Presentation of Architecture IV
Architecture IV is similar to Architecture III except that
fixed point arithmetic will now be used. The hardware complex
multiplier can be simplified as a result of this. The following block
diagram depicts the structure of the multiplier.
The flow chart for Architecture IV is the same as Architecture II
and Architecture III. In Architecture IV only half the previous number
of storage locations are needed in comparison to the previous architectures
as a result of the fixed point computation. The program code for
Architecture IV is given in Appendix A,
64
Essential Components
16-Bit Fixed-Point Complex Multiplier
This requires in addition to the registers, 4 16-bit parallel multipliers
and 2 32-bit adders. The inputs of the multipliers are connected so a
single shift right is performed to prevent overflow.
Fig. 3.5-1
65
3.6 Presentation of Architecture V
Architecture V is again a structure that uses fixed-point
arithmetic. It also shares the same flowgraph of Architecture II,
III, and IV. Architecture V employs essentially the same code as
Architecture IV with the exception that it uses programmed fixed-
point complex multiplication. Because the on chip multiplier can be
conveniently used in this fixed-point arithmetic, the complex multi
plication used in the butterfly computations is performed in-line in
Architecture V rather than calling on a subroutine as was the case
in Architecture I and II. The bit reversal operations for this
architecture are the same as for Architecture IV. The code for the
FFT butterfly computation differs from Architecture IV and is presented
in Appendix A.
66
3,7 Presentation of Architecture VI
This structure utilizes the technique known as multiporting. Here
a given sector of memory is addressible by more than one processor. In
the case considered here, two processors address one sector. The master
processor may accesss the entire memory space. It controls the flow of
data to and from peripherals. It also controls the operation of the slave
0
processors. Each slave processor accesses a sector of the 4N data block.
If there are P processors, P-l of them being of the slave type, then each
2 2
processor accesses 4N /P data words. To have these 4N /P words correspond
to an integer number of rows, let P =
2X
< N, X an integer. Utilizing either
multiprocessor control inputs or interrupt control, the master will switch
the two port memory sectors to the slave processors, at the same time
preventing itself from accessing these locations. The master will signal
the slaves to begin their FFT routine and then begin its own FFT operation
on its own sector. On the completion of this, it will wait until the other
processors are done after which the two port control is switched so that
the master will be able to access the entire address space. Then trans
position will be done by the master and the process repeated. Such
sequencing is outlined below.
1) Set a bit in control register to multiplex memory sectors to
slave processors.
2) Either interrupt slaves or start their FFT routines through
multiprocessor handshaking.
3) When master has completed its FFT, it checks a status register.
4) If the status register indicates all operations are complete,
switch multiplexers back to full master control.
5) Perform transposition.
6) Repeat 1-4.
Computational Flow
Architecture VI
68
Master
Start
_L_
Slaves
Bring up system
Input digital
picture
Initialize
Transposition
Pointer Data Ready
Change
Memory Ports
Start Slaves
e
Perform Bit
Reversal on Row
Perform Linear FFT
on Row
Yes
Perform Bit
Reversal on Row
Perform Linear FFT
on Row
No
Slaves are
done
Signal Master
Done
\y
A
Fig. 3.7-1
69
Master Slaves
Perform
Transposition
Change
Transposition
Pointer
U .
Output
Digital
Picture
Stop
A
v
Reinitialize
Pointers
Fig. 3.7-1 continued
70
The Z8000 series microprocessors are equipped with their own
multipliprocessor control pins. Through handshaking signals on these pins,
multiprocessor activity can be implemented easily on the Z8000. A
possible method of implementation of the master/slave scheme discussed
before would involve the following routines.
Master Processor
MRES
(load data)
<>
7 1
MULT I MBIT 7 1
JP N, ERR0R1 11 3
DFFT MSET
<>
(perform N
one dim, FFT's)
7 1
WAIT MBIT 7 1
JP N,.XPSN 11 3
INC.WAITCNT 16 3
JP OV, ERR0R2 11 3
JP WAIT 11 3
XPSN MRES
<>
(perform data
7 1
transposition and
determine if the
2-D DFT calc. is done)
deactivate all processors
make sure slaves are
inactive
start slaves
wait for all slaves
to finish
increment wait loop cntr.
waited too long for slaves
disable slaves
71
JP MULT1 11 begin transformation
in second dimension
Slave Processor
TEST MSET 7 1 indicate not busy
MBIT 7 1 did the master say start?
JP N, DFFT 11 3 yes
JP TEST 11 3 no
DFFT MSET
(perform N
7 1 not yet done
one dim. FFT'
MRES
s)
7 1 indicate to master that vi
MRDY MBIT
JP N,MRDY
JP TEST
7 1
11 3
11 3
done
has the master released you
released you
yes
Because of capacitive loading and fan-out limitations, the Z8000
CPU as well as most other microprocessors requires buffering between
the CPU/MMU and memory/peripherals. The AmZ8104 is a bidirectional
octal buffer with tri-state ports that can be used for this purpose.
The AmZ8104 can also be used as a means of fabricating two-port memory.
Figure 3.7-2 indicates how this might be done.
72
>
>
t-i an -^
3. Ci o; t q
?~>
10 cc:
X>
Ui
r>
R
7"
o
E
.3"
O
o
m
fco <
<s
. o
I
3.
13
a:
i.
r (U
O 4->
O in
CO (13
Busak
73
A multiprocessor architecture can be built by means other than
utilizing two-port memory techniques. Instead of using shared memory
as in two-porting, data blocks are transfered between different
processor/memory groups.
The Z8000 series offers what is called a Microprocessor Buffer
Unit (MBU). This is essentially a FIFO that can be used to connect
asynchronous processors to each other. Information between processors
is exchanged through DMA operations with the MBU. One processor loads
data into the MBU by DMA control. When the MBU is full, the second
processor is notified and data is transfered to it from the MBU by DMA
control. The direction of data flow can be reversed.
Using the MBU, the interprocessor overhead would be greater than
in two-porting the system memory. By either method of interprocessor
communication the time needed to move blocks of data is less than that
needed for the computation of the DFT.
74
Section 4
CONCLUSIONS
75
The various architectures used to calculate a two-dimensional
discrete Fourier transform mentioned in this paper are highly similar
from the standpoint of computational flow. Consequently, the
architectures have almost identical software. The hardware of these
architectures do, however, vary considerably. The following
outline places the various architectures into perspective.
1. Architecture I - single processor with memory, programmed
floating-point multiply, calculated coefficient values
A. The chip count is low, only a CPU, MMU, and memory are
required.
B. The cost would be low.
C. This architecture has the slowest throughput - 7020 sec
for 1024 x 1024 2D-FFT
2. Architecture II - single processor with memory, programmed
floating-point multiply, coefficient values are in look-up
table
A. Low chip count - CPU, MMU, and memory
B. The cost would be the same as Architecture I.
C. The throughput is inproved to 6846 sec for a 1024 x 1024 2D-FFT
76
3. Architecture III - single processor with memory hardware
floating-point multiplier, coefficient values in a look-up table
A. The chip count increases because of the hardware floating-point
multiplier. (Over 200 chips, most of them LSI are needed for
such a multiplier.)
B. The cost of Architecture III is higher than that of
Architecture I and II, most of this is hardware cost.
The cost of software is about the same.
C. About 3716 sec are needed to perform a 1024 x 1024 2D-FFT
4. Architecture IV - single processor w/memory, hardware fixed-point
multiplier, look-up table for coefficient values.
A. Chip count is less than Archicture III but greater than
I and II
B. Overall cost of this system follows that of A.
C. About 2472 sec are needed to perform a 1024 x 1024 2D-FFT
77
5. Architecture V - single processor with memory, programmed fixed-point
multiply, coefficients in look-up table.
A, Chip count is comparable to Architecture I and II, with
less memory needed for data.
B. Cost is the lowest of all Architectures.
C. About 5222 sec are needed to perform a 1024 x 1024 2D-FFT.
6. Architecture VI - multiprocessor systems with coefficient look-up,
without and with hardware parallel floating-point multiplication.
A, The chip count of such systems are very high, especially when
hardware multipliers are used,
B. Both the hardware and software cost of a multiprocessor
system is very high.
C. The computation of the 2D-FFT can be made quickly.
The following series of tables will compare the time needed by
the various architectures considered in this thesis to compute a two-
dimensional discrete Fourier transform using a fast Fourier transform
algorithm. In all cases presented, the array size used was 1024 x 1024.
78
Table 4.1 Architecture I - single processor
Bit Reversal Operations 160.90 sec
Set-up for W-Coefficient Computation 4.77 sec
Calculate real part of WN 71.91 sec
Calculate imaginery part of WN 41.40 sec
Set-up for Butterfly Computation .069 sec
Perform all Butterfly Computation with
W,.1
6517.90 sec
Change W..-coefficient value 154.59 sec
Change Span of Butterfly* .204 sec
Check for end of row transformations .020 sec
Check for transpose 7 ys
Perform transpose 69.19 sec
Total time - 7020 sec
*The span of a butterfly computation relates to what point are being
used in the computation. The general equation for the butterfly
computation can be written as
79
xmJ.i (i) = xm(T) + wm xm (1 + r)m+1 ' m N m
xm+i (i + r) = xm(i) ~ w xm (i + r)m+1 m n m
the value of r is the indicator of the span.
Table 4.2 Architecture II - single processor,
W..1
look-up table
Function Time
Bit Reversal Operations 160.90 sec
Set-up for Butterfly computation of access first
W.,1
.52 sec
Perform all Butterfly computations with W.. 6517.94 sec
Access New W^ 92.18 sec
Change Span of Butterfly .204 sec
Check for end of row transformations .020 sec
Check for transpose 7 ys
Perform transpose 69.19 sec
total time - 6846 sec
Table 4.3 Architecture III - single processor
WN look-up table hardware complex multiplier
(floating point)
Function Time
Bit Reversal Operations 160.90 sec
Set-up for Butterfly computations & access W.,1 .52 sec
Perform all Butterfly computations with W.,i 3388.90 sec
Access New W^1 92.18 sec
Change span of butterfly .204 sec
Check for end of row transformations .020 sec
Check for transpose 7 ys
Perform transpose 69.19 sec
total time - 3716 sec
Table 4.4 Architecture IV < single processor
WN look-up table hardware complex multiplier
(.fixed point)
Function Time
Bit Reversal Operations 141.60 sec
Set-up for Butterfly computations & access W^ .39 sec
Perform all Butterfly computations with
WNn 2170.60 sec
Access New W^ 77.93 sec
Change Span of Butterfly -204 sec
Check for end of row transformations .020 sec
Check for transpose 7 ys
Perform transpose 61.64 sec
total time - 2472 sec
80
81
Table 4.5 Architecture V - single processor
W^ look-up table programmed multiply
Function Time
Bit Reversal Operations 141.60 sec
Set-up for Butterfly computations & access W..1 .39 sec
Perform all Butterfly computations with wJ 4919.90 sec
Access new W^ 77.93 sec
Change Span of Butterfly .204 sec
Check for end of row transformations ,020 sec
Check for transpose 7 ys
Perform transpose 61.64 sec
total time - 5222 sec
Table 4.6- Architecture VI - the multiprocessor system
Here the multiprocessor environment is compared to the previous
cases. In this system there are several microprocessors with one of
the processors in the system the master, the rest being slaves. The
master processor is responsible for data transposition with an
estimated time for this function of 75 seconds..
The processors are all considered to be either Architecture II
or Architecture III.
2D-FFT Processing time by Architecture II - 6771 sec
2D-FFT Processing time by Architecture III - 3641 sec.
82
Table 4.7 Computation Time for Multiprocessor Systems
Number of
Processors in
System
2
4
8
16
32
64
128
256
Configuration
Architecture II Archiitecture III
3385 1820
1693 910
846 455
423 227
212 114
105 57
52 28
26 14
Using a hardware parallel multiplier to perform complex floating point
multiplication is roughly equivalent to doubling the number of processors.
The accuracy of the computations also has a bearing on what architecture
a designer might choose for his design. Opnenheim & Weinstein developed
formulae for noise-to-signal ratios for fixed-point and floating-point
fast Fourier transform computations. They assumed that the imperfect
results obtained from finite-length registers in digital computations
could be represented by a white noise source placed at every node where
complex multiplication occurred. The noise-to-signal ratio is the variance
of the noise generated from digital complex multiplications as it appears
at the end of the FFT computation divided by the variance of the output
83
sequence X(k). The assumption is made that the input sequence x(n)
was complex with uniformly distributed real and imaginary parts in
(-1/V/2N, 1/ v/2N). [8]
For fixed-point arithmetric, with scaling performed in stages,
the noise-to-signal ratio is (according to Oppenheim and Weinstein)
-^- =
5N2"2b
(4.1)
aX
where N is the length of the sequence and b is the number of bits, not
including the sign bit, im the registers used to perform the computation.
Likewise, for floating-point computation the noise to signal ratio
is given by Oppenhein and Weinstein as
2
%. = J^. 2"2b log2N (4.2)
aX
These formulae are for one-dimensional fast Fourier transforms.
In light of the fact that a two-dimensional discrete Fourier transform
on an N x N array can be represented as a one-dimensional transform
on an
N2
x 1 vector (see Appendix B), the formulae for noise-to-signal
ratios can be extended readily to two-dimensional algorithms.
For two-dimensional fixed-point FFT algorithm with the scaling done
in stages,
84
Or
5N2 2"2b (4-3)
and for floating-point algorithms,
8 9-2b , ..
T" 2 log9N (4.4)
where it is assumed that the characteristic of the floating-point
algorithm has sufficient bits to handle the necessary scaling of order
2
N in magnitude needed in the two-dimensional discrete Fourier transform.
In light of these formulae, if b, represents the number of significant
bits in a fixed-point FFT algorithm and b2 represents the number of
significant bits in a floating-point algorithm then for both algorithms
to give the same noise-to-signal ratios,
b-, = b2 lo^o
15N log2N| (4.5)
If N is 1024 and the floating-point algorithm uses a 16-bit mantissa,
then b2 = 15 and
bl = 15 " ~1 log 2
= 15 - (-17.5)
8
15(1024)
log2 1024]
(4.6)
~ 33 bits
Thus a fixed-point algorithm must have 34 bit registers (33 plus sign bit)
to give accuracy comparable to a floating-point algorithm with a 16-bit
mantissa.
85
This thesis has also shown that a multi -microprocessor system
with 256 processors can compute a 1024 x 1024 2D-FFT in approximately
26 seconds using 16-bit floating-point arithmetric.
In comparison, a SIGNAL PROCESSING SYSTEMS SPS-61 array processor
when coupled to a suitable minicomputer would need about 10 seconds
to compute a 1024 x 1024 2D-FFT using a 32-bit fixed-point algorithm. [9]
In addition, the FLOATING POINT SYSTEMS AP-120B array processor can
perform a 512 x 512 2D-FFT in 1.55 seconds using what the manufacturer
calls "38-bit normalized floating-point" [10J arithmetric. This
implies approximately 6,2 seconds for a 1024 x 1024 2D-FFT operation.
Consequently, a multiprocessor system can be made that would
give comparable performance to an array processor in speed and accuracy.
The discrete Fourier transform is an algorithm that involves
a great deal of numerical calculation. Despite the arithmetical nature
of the algorithm, this paper has shown that when the algorithm is
implemented on a programmable processor, a large number of operations
of a type different than arithmetic operations are required. That is
to say the loading and moving of data expend considerable time in the
algorithm.
This thesis has also shown that even when accuracy is sacrificed
using the short word length of a microprocessor, a single microprocessor
requires an exceedingly long time to compute a two-dimensional discrete
Fourier transform (about 2 hours for a 1024 x 1024 square array). This
is not unexpected.
86
The possibility of augmenting a microprocessor with a hardware
multiplier was investigated in this thesis. As a consequence of the
many data movement operations needed in the 2D-DFT algorithm, the
construction of a high speed hardware multiplier does little to
improve the computational speed of a microprocessor that has its own
internal multiplier. Certainly, the speed improvement is not enough
to justify the added cost and complexity to the system. If multiplication
can only be done in software when using a given microprocessor, one
should not use such a processor in DFT applications. The Z8000 micro
processor and others have hardware which enables the instructions set
to contain a multiply instruction, These processors should be used in
DFT applications. An additional hardware multiplier designed to
quickly perform the complex multiplications used in the DFT algorithm
offers little speed improvement.
This paper has shown that the best way to improve the computational
speed of a microprocessor implementation of a two-dimensional discrete
Fourier transform is the multiprocessor environment. Even though the
complexity of a multiprocessor system is high, the hardware is more
easily implemented than that which would be required for the complete
algorithm to be implemented entirely in hardware.
87
APPENDICES
88
APPENDIX A
PROGRAM LISTINGS
This is the program code that is used in the baseline architecture,
Architecture I.
TWO DIMENSIONAL FFT 90
ONDM
NXRW
This code initializes the dimension pointer that indicates
whether the rows or columns of the pixel array are being
transformed.
LD R15'*2 7 2
LD
.DMCN,R15 14 3
LD .HCNT, 17 4
LD Rg,-N 12 3
LD
.NDIM.Rg 14 3
LD Rin,-ONE 12 3
The following code forms the word pointer for addressing
the pixel data in a given column (or row).
LD ^p. HCNT 12 3 H
DEC Rirl 4 1 H-l
MULT Rir R9 70 1 (H-1)*N
ADDL RR-, 1 14 3 (H-1)*N+1
SRA Rg.l 16 2
LD
.NV2,Rg
14 3
LD R9,.N 12 3
DEC R9J 4 1
LD
.NMl,Rg
14 3
LD .JCNT, 17 4
LD .ICNT, 17 4
LD R12 ^ICNT 12 3
91
BRO
This code places data in Bit Reversed OrderProcess Block A
CP R12-JCNT 12 3
JP P.FIVE 10 3
SUBL RR1Q,1 14 3
LDL
-LM1'RR10 " 3 L=(H-1)*N+1
LD Rg,.JCNT 12 3
CLR R8 7 ^
ADDL RR10,RRg 8 i
SLAL RR1Q,1 16 2
LDL RR]4,.TEMTBL 15 3
LD Ro,*3 7 2
ADDL RR1Q,.XTABLE 18 3
LDIR RR14,RR10,R0 54 3 T<-X
LDL RR2,.LMI 15 3
CLR RQ 4 1
LD Rr;CNT 12 3
ADDL RR2,RR]0 8 1
SLAL RR2,2 19 2
ADDL RR2, STABLE 15 2
SUBL RR]0,4 14 3
LD R0,*3 7 2
LDIR RR10,RR2J*0 54 2 X.-X.
SUBL RR]4, 4 14 3
SUBL RR2, 4 14 3
LD RQ,*3 7 2
LDIR RR2,RR14,RQ 54 2 X.^T
92
FIVE LD R0%NV2 12 3
SIX CP RQ,.JCNT 12 3
JP P, SEVEN 10 3
LD Rr.JCNT 12 3
SUB RrRo 4 1
LD JCNT, R] 14 3
SRA R0,l 16 2
JP SIX 10 3
SEVEN ADD r0,.jcnt 12 3
INC ICNT.l 16 3
LD
.JCNT,RQ 14 3
LD r15,.nmi 12 3
CP R15,.ICNT 12 3
JP P, BRO 10 3
This completes the code for bit reversal. This is the end
of Process Block A.
This code is a One Dimensional FFT.
Process Block B
The following code calculates the values of AW.
FFT3
LD ..GCNT, 17
LD R5,.M 12
LD Ry,.GCNT 12
LD Rg, *1 7
SDA RgR7 15
LD LE, Rg 14
SRA Rg.l 16
LD
,LE1,R7
14
LD .URM,.ONK\ 17
LD .URE,.0NE 17
CLR .UIM 14
CLR .UIE 14
CLR .ACCE 14
CLR .ACCM 14
LD .PCNT, 2 17
LD .LPCNT, 17
LD . LIMIT, .SM 17
LD RQ,.PIM 14
LD Rr.PIE 14
LD R9,.GCNT 14
SRA Rg.l 16
SDA R-| >Rg 15
4
3
3
2
2
3
2
3
4
3
3
3
3
3
3
3
3
3
3
3
2
2
R? = G
Rg = 2
LE =
2C
LEI = LE/2 =
2"
U = 1 + jO
PCNT = 2
LPCNT 1
Load ir
tt/LEI
94
CALL SHIFT to
LD
.DRM,RQ 14
LD DRE,R1 14
LD R3,.DRM 12
MULT R3,.DRM 74
LD R5 , . DRE 12
SLA R5, 1 16
LDL
MSPL,RR2 17
LD
MSP, R5 17
CALL SHIFT 32 20
LD R5,.MSP 12
LDL RR2.MSPL 15
LD
.STRM,R2 14
LD STRE.R
5
14
LD
.TRM,R2 14
LD
.TRE,R5 14
LD R] 1 , . PCNT 12
SLA Rirl 16
LD Ry.FACT.R^ 13
LD Rg.FACT+l.R^ 13
CLR R2 4
LD R,,.TRM 12
3
3
3
3
3 store tt/LEI
3 form (tt/LEI)2
2
3
3
3 norm (tt/LEI)2
3
3
3
3
3
3
3 PCNT = 2
2 Rn - 4
3 get 2j mantissa
3 get tj-j- exponent
1
3
MULT R3,R7 70 1 first cosine term
store 16 bits of (tt/LEI) 2
95
AGAIN
POST
BIB
ADD Rg.Rg 4
INC PCNT, 2 16
LDL
MSPL,RR2 17
LD
MSP,R5 14
CALL SHIFT32 20
LD R5,.MSP 12
LDL RR2,.MSPL 15
LD R1 -,, LPCNT 12
JP P,POST 10
NEG R2 7
NEG .LPCNT 18
JP BIB 10
LD .LCNT,ONE 17
NEG .LPCNT 18
LD
.TEMP,P5 14
SUB R5,.ACCE 12
JP P,POSB 10
NEG R5 7
SDA R2'R5 15
ADD R,.ACCM 12
1
3
3
3
3
3
3
3
3
1
3
3
4
3
3
3
3
1
2
3
normalize
X^-X
.LPCNT
=
-1
LPCNT = -1
1 TT 2
store exp. of jHon )
exp of x - exp
+ X^R,
96
POSB
HUM
LD
.ACCM,R2
14
JP HUM 10
SDA
.ACCM,R5 17
ADD R2,.ACCM 12
LD
.ACCM,R2 14
LD R5,TEMP 12
LD ntut $ K^ j- 14
LD R5,. LIMIT 12
CP R5,.PCNT 12
JP N, OUT 10
LD R3,.TRM 12
MULT R3,.STRM 74
LD R5,.TRE 12
ADD R5,.STRE 12
LDL .MSPL,RR2 17
LD .MSP, R5 14
CALL SHIFT 32 20
LD R5,: MSP 12
LDL RR2,.MSPL 15
LD
TRM,R2 14
LD
.TRF.R^ 14
LD Rn, .PCNT 12
SLA Rir 1 16
LD R7, FACT, R^ 13
LD R9 .FACT+l,Rn 13
LD
.TEMP,R2 14
LD R3,.TEMP 12
MULT R3'R7 70
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2
3
3
3
3
1
*-R2
normalize to x
X +2 + R2
*-R2
load new exponent
Mantissa ^(ttt)
x mantissa
normalize
store tt (j^)
97
OUT
ADD R5,R? 4
JP AGAIN 10
LD R2,.ONEE 12
SUB R2,.ACCE 12
LD RQ,.ACCM 12
SDA Rn,Ro'
2
15
ADD R2,.ACCE 12
LD R-j ,R2 4
CALL SHIFT 20
LD .WRM,R0
14
LD
.WRE.R-,
14
CLR .ACCM 14
CLR ACCE 14
LD .PCNT, 3 17
LD .CPCNT,-1 17
LD . LIMIT, SM 17
LD R3,.DRM 12
LD R5,.DRE 12
MULT R3,.DRM 74
ADD R5 , . DRE 12
LDL .MSPL,RR2 17
LD .MSP, R5 15
CALL SHIFT32 20
LD R5,.MSP 12
LDL RR2,.MSPL 15
LD .SDRM,R2
14
LD .SDRE,R5
14
LD R, , . DRM 12
1
3
3
3
3
2
3
1
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
Store real part of W
Calculation of
imaginary part of
W
form
(tt/LEI)'
store
(tt/LEI)'
98
LD
.ACCM,R] 14 3 d
LD Rp.DRE 12 3
LD
.ACCE,R] 14 3
AGAIN LD R3,.SDRM 12 3
MULT R3,.DRM 74 3 form x-d
ADD R5,.DRE 12 3
LDL
.MSPL,RR2 17 3
LD
.MSP,R5 14 3
CALL SHIFT32 20 3
LD R5,.MSP 12 3
LDL RR2,.MSPL 15 3
LD Rir. LPCNT 12 3
JP P, POSTA 10 3
NEG R6 7 1
NEG .LPCNT 18 3
JP AGB 10 3
POSTA NEG .LPCNT 18 3
LD
.TEMP,Rg
14 3
SUB Rg,.ACCE 12 3
JMP P, PSG. 10 3
NEG Rg 7 1
SDA R6,R5 15 2
ADD R6,.ACCM 12 3
LD
.ACCM,R6
14 3
JP SHBU 10 3
PSG SDA .ACCM,Rg 17 3
99
SHBU
ADD R6 , . ACCM 12 3
LD
.ACCM,R6 14 3
LD R5,.TEMP 12 3
LD
.ACCE,R5 14 3
INC .PCNT, 2
\<h
3
LD R5,. LIMIT 12 3
CP R5.PCNT 12 3
JP P, AGAIN 10 3
LD RQ,.ACCM 12 3
LD Rp.ACCE 12 3
CALL SHIFT 20 3
LD .WIM,R0
14 3
LD .WIE,R, 14 3
store imaginary
component of W
This completes the computation of AW.
100
BTRFY
FFT2
w|$+1
= WJj -AW by Architecture I,
The next three lines of code are used in the calculation
jkn
ID .JCNT,1 17 3 j = ]
LD R15,.JCNT 12 3
ICNT,R15 14 3 I=JLD
The following code is the Butterfly Computation. It is
used for Architectures I, II,VI.
FFT LD R3,.ICNT 12
CLR R2 4
ADDL RR2,.LEI 18
LDL .IP,RR2
17
ADDL RR2,.LMI 18
SLAL RR2,2 19
ADDL RR4,.XTABLE 18
LD R0,*3 7
ADDL RK , KKp 8
LDL RRga.ATABLE 15
LDIR RR6,RR4,RQ 54
LD V *3 7
3 form butterfly
1 span counter
3
3 IP = I + LEI
3 I + LEI + LMI
2 form word
3 get data table address
2
1
3 get ATABLE add
2 A X(IP+LMI)
2
101
LDL
LDL
LDIR
CALL
LD
LDL
LDL
LDIR
LD
LD
LDL
LDIR
LD
LDL
ADDL
SLAL
ADDL
LDL
LDIR
CALL
LDL
LD
LDL
ADDL
SLAL
ADDL
LDIR
RRg.BTABLE 15
RR4,.UTABLE 15
RR6,RR4,Ro 54
FPMULT
R *3K0' J
20
7
RR4,.TEMPTBL 15
RRg,.PRDTBL 15
RR4,RR6,R0 54
V *3 7
RR4,.TEMPTBL 15
RR6,.ATABLE 15
RRg,RR4,RQ
R *3
54
7
15
18
19
RR4,.ICNT
RR4,.LMI
RR4,2
RR4,.XTABLE 18
RR6,.BTABLE 15
RRg , RR4 , RQ
ADDNORM
RR4,.SUMTBL 15
R0,*3 7
RR6,. IP
RR6,.LMI
54
20
RR6,2
15
18
19
RR^.XTABLE 18
RR6,RR4,R0 54
3
3
2
3
2
3
3
2
2
3
3
2
2
3
3
2
3
3
2
3
3
2
3
3
2
3
2
X(IP+LMI)*U
TEMP - PRD
ATAB <- TEMP
Address of X(I+LMI)
B e-X(I+LMI)
address of X(IP+LMI)
102
NEG .TRM 18
NEG .TIM 18
LD R *3 7
LDL RR4,.TEMPTBL 15
LDL RRg , . ATABLE 15
LDIR RRg , RR4 , Rq 54
LD R0,*3 7
LD R5,.ICNT 12
CLR R4 4
ADDL RR4,.LMI 18
SLAL RR4,2 19
ADDL RR4,.XTABLE 18
LDL RR6,.BTABLE 15
LDIR RRg , RR4 , RQ 54
CALL ADDNORM 20
LDL RR6,.SUMTBL - 15
LD V *3 7
LD R5,.ICNT 12
CLR *4 4
ADDL RR4,.LMI 18
SLAL RR4, 2 19
ADDL RR4,.XTABLE 18
LDIR RR4 RR6 ' R0 54
LD R]3,.ICNT 12
ADD R13,.LE 12
3
3
2
3
3
2
2
3
1
3
2
3
3
2
3-
3.
2.
3 "
1.
3
2-
3'
2-
3
3-
Negate Temp
T -e--T
ATABLE --TEMP
address of X(I+LMI)
B -^-X(I+LMI)
<eX(I+LMI) -T
X(I+LIM)
ICNT ICNT+LE
103
LD
.ICNT,R13 14 3
LD R13,.N 12 3
CP R]3,.ICNT 12 3
JP P,FFT 10 3
The code for the Butterfly Computation ends here.
The code for the calculation
w!j+1
= w]j 'AW of
Architecture I follows. This code is replaced in
other Architectures.
LDL RR4,.UTABLE 15 3
LDL RRg,. ATABLE 15 3
LD RQ,*3 7 2
LDIR RRg,RR4,Ro 54 2
LD R3,*3 7 2
LDL RR4,.WTBL 15 3
LDL RR6,.BTABLE 15 3
LDIR RRg,RR4,RQ 54 2
CALL FPMULT 20 3
LD R0,*3 7 2
LDL RR4,.PRDTBL 15 3
LDL RR6,.UTABLE 15 3
LDIR, RRg , RR4 , RQ 54 2
IRC^. .JCNT,1 16 3
LD R]3,.LEI 12 3
CP R]3,.JCNT 12 c 3
JP P.FFT2 ,10 3
IS: N
Yes
B-AW
U*AW ->PRD
U*AW
J<LEI
Yes
104
The following code determines when to calculate a new
value of W in Architecture I. It is also replaced in other
Architectures.
INC .GCNT,1 16 3 G = G+l
LD R13-M 12 3
CP R13.GCNT 12 3 G^M
JP P.FFT3, 10 3 Yes
Process Block B ends here.
The following code determines when all columns (or rows)
have been transformed in one dimension.
INC .HCNT.l 16 3
LD R13-N 12 3
CP R]3,.HCNT 12 3 N>H
JP P, NXRW 10 3 Yes, next row
The following code determines if the pixel array has been
transformed in two dimensions.
DEC .DMCN.l 16 3
JP Z.OUTAL 10 3
The following code transposes the pixel array.
This is Process Block C.
105
XPSN LD "13'*1 7 2
LD
.HCNT,R13 14 3
LD .KCNT, R]3 14 3
HLP LD R^.HCNT 12 3
KLP LD R3,.KCNT 12 3
CP R3, R1 4 1
JP N, INKCH 10 3
DEC R3, 1 4 1
MULT R3,.N 74 3
CLR Ro 4 1
ADDL KKo KK 8 1
DEC Rrl 4 1
MULT Rr-N 74 3
LD R5,.KCNT 12 3
CLR R4 4 1
ADDL RRQ,RR4 8 1
CP RR , RR2 8 1
JP Z, INKCH 10 3
LD R7, *3 7 2
LDL RR8,.XTABLE 15 3
SLAL RR2,2 19 2
ADDL KKp KKo 8 1
LDL RR10,.TEMPTBL 15 3
LDIR 10' 8' 7 54 2
LD R7,*3 7 2
K-l
(k-l)N
RR2^H + (K-l)N
M-l
RR^ K +(H-1)N
RRg has address of
X(H+(K-1)N)
T<-X(H+(K-1)N)
106
INKCH
LUL KKg,.XTABLE 15 3
ADDL KKp , KK^ 8 1
LDL RR]0,.XTABLE 15 3
SLAL RR0,2 19 2 address pointer
ADDL RR10'RR0 8 1 formed
LDIR RRg, RR. q,R_ 54 2 X(H+(K-1)N)<-X(K+(H-1)N)
LD R7, *3 7 3
LDL RRg,. XTABLE 15 3
ADDL RRo > RRq 8 1
LDL RR10,. TEMPTBL15 3
LDIR RRg, RR. qRt 54 2 X(K+H-1)N) T
INC .KCNT.l 16 3
LD R]3,.N 12 3
CP R13,.KCNT 12 3
JP P, KLP 10 3
INC .HCNT.l 16 3
JP P, HLP 10 3
JP ONDM 10 3
SUBROUTINE ADDNORM
107
Label Instruction
LD Ry,.ARE
Cvcles N
12
lemory Loc
3
ations Comments
ADDNORM Get real A exp
SUB Ry,.BRE 12 3 sub real B Bexp
JP N,NEGA 10 3
LD Rg,.ARM 12 3 A > B, get real A mant
LD R11S.BRM 12 3 geat real B mant.
SDA Rl 1 'R7 15 2 adj. real B mant.
SRA R1V1 16 2
SRA R9>1 16 2
NEG Rg 7 1
ADD R9,Rn 4 1
LD .SUMRM,Rg
14 3
LD R7,.ARE 12 3
INC R7,l 13 3
LD .SUMRE,Ry
14 3
JP IMAG 10 3
NEGA NEG R7 7 1 A < B, neg. diff
LD Rg,.ARM 12 3 get real A mant.
LD R1 1 , . BRM 12 3 get real B mant.
SDA Rg,R7 15 2 adj real A mant.
SRA Rg.l 16 2 scale A mant.
SRA Rn,l 16 2 scale B mant.
NEG Rg 7 1 negate real A mant.
ADD RlljR9 4 1 B + A
LD .SUMRMjR^ 14
3 store sum
LD R^ , , BRE 12 3 get real B exp.
INC R7,l 13 2 adj. for scale
IMAG
NEGI
END
LD
.SUMRE,R7 14 3 store real exp iQg
LD R?,.AIE 12 3 get imag A exp
SUB R7,.BIE 12 3 adj imag. B mant.
JP N.NEGI 10 3 scale imag B mant.
LD Rg,.AIM 12 3 scale imag A mant.
LD Rir.BIM 12 3 negate A
SDA R11'R7 15 2 B + A (B - A)
SRA Rirl 16 2 store sum
SRA Rg.l 16 2 get imag A exp
NEG Rg 7 1 scale exp
ADD R9'Rn 4 1 store exp
LD
.SUMIM.Rg 14 3
LD R?,.AIE 12 3 B > A, neg.
INC R7,l 13 2 get imag A mant.
LD
.SUMIE,Ry 14 3 get imag B mant
JP END 10 3 adj A mant
NEG R7 7 1 scale B mant.
LD Rg,.AIM 12 3 neg A
LD Rn,.BIM 12 3 A + B
SDA RgjR7 15 2 store mant.
SRA Rg.l 16 2 get B exp
SRA Rirl 16 2 scale exp
NEG Rg 7 1 store exp
ADD RirRg 4 1
LD
.SUMIM.R^
14 3
LD Ry,.BIE 12 3
INC R7,l 13 2
LD
.SUMIE,Ry 14
2
RTN 13 1
SUBROUTINE FPMULT 109
Lable Instruction Cvcles Memory Locations Comments
OUT RTN 13 1
FPMULT LD Rr- ARM 12 3 load real mant. A
LD R3,. AIM 12 3 load imag. mant. A
LD R5,. BRM 12 3 load real mant. B
LD R7,- BIM 12 3 load imag mant. B
MULT Rr- BRM 74 3 real A x real B
LD R8,. ARE 12 3 load real exp A
ADD Rg,- BRE 12 3 add real exp B
LD .ARBRE, Rg 14 3 store sum
REAL MULT R3,. BIM 74 3 Imag A x Imag B
LD Rg> AIE 12 3 load imag exp. A
ADD Rg,- BIE 12 3 add imag. exp. B
LD .AIBIE, R? 14 3 store sum
SUB R8, Rg 4 1 find the larger exp
JP N, NEG 10 3 real prod, larger
SDLL RR2, R8 18 2 shift imag. mant.
SRAL RR2, 1 16 2 scale iman mant.
SRAL RR0, 1 16 2 scale real mant.
ADDL RR0, RR2 8 1 add mantissas
LDL .MSPL, RRQ 17 3 store real mant.
ADD Rg,. ARBRE 12 3 load real exp
INC Rg, 1 13 2 scale exp
LD .MSP ,R8 14
3 store real exp
CALL SHIFT32 20 3
110
FOR LD RR,. MSP 12 3 get norm, real
get norm, real mant.
store real exp
store real mant.
NEG
8
LDL RR .. MSPL
0
15 3
LD .RPDE, Rg 14 3
LD .RPDM, RQ 14 3
JP IMAG 20 3
NEG R8 7 1
SDLL RRo, Rg 18 2
SRAL RR0, 1 16 2
SRAL RR2, 1 16 2
ADDL
o' RR2 8 1
LDL .MSPL, RR
0
17 3
LD Rg,. AIBIE 12 3
INC R8, 1 13 2
LD MSP, Rg 14 3
CALL SHIFT32 20 3
JP FOR 20 3
MULT R5?. AIM 74 3
LD R8,. AIE 12 3
ADD Rp, BRE 12 3
LD
,AIBRE,Rg
14 3
MULT R7,. ARM 74 3
LD Rg,.ARE 12 3
ADD Rg,. BIE 12 3
LD .ARBIE, Rg 14 3
SUB Rg, Rg 4 1
JP N, NEGX 20 3
SDLL RR6, Rg 18 2
SRAL RR6, 1 16 2
imag product larger
shift real mant.
scale real mant.
scale imag. mant
add mantissas
store real mant.
load imag. exp.
scale exp.
store real exp.
IMAG ^, Real B x Imag A
load imag. exp. A
add real exp. B
store sum
Image B x Real A
load real exn A
add imag exp. B
store sum
find larger exp.
no jump, AIBR larger
shift ARBI mant.
scale ARBI mant.
Ill
SRAL RR4, 1 16
ADDL RR4, RRg 8
LDL
.MSPL,RR4 17
LD Rg, .AIBRE 12
INC Rg, 1 13
LD .MSP, Rg 14
GAMMA CALL SHIFT32 20
FIB LD Rg,. MSP 12
LDL RRQ,. MSPL 15
LD .IPDE, Rg 14
LD
.IPDM,Ro
14
x8
SDLL RR4,Rg
SRAL RR4, 1
JP OUT 20
NEGX NEG R0 7
18
16
SRAL RRg, 1 16
ADDL RR4, RRg 8
LDL .MSPL, RR4 17
LD Rg,.ARBIE 12
INC R8,l 13
LD .MSP, Rg 14
JP GAMMA 20
2
1
3
scale AIBR mantissa
add mantissas
store mantissa
3 get AIBR exp.
2 sclae exp.
3 store exp.
3
3 get norm imag exp.
3 get norm imag mant.
3 store imag. exp.
3 store imag mant.
3
1 ARBI larger
2 shift AIBR mant.
2 scale AIBR mant.
2 scale AIBR mant.
1 add mantissas
3 store mantissa
3 get ARBI exp.
2 scale exp.
3 store exp.
3
Label
SUBROUTINE SHIFT32
Instruction Cycles Memory Locations Comments
112
SHIFT32 LD R3,.MSP 12 3 get exponent
LDL RR4,,MSPL 15 3 get mantissa
LD .SCNT,*1 17 3 SCNT = 1
RESFLG,C 7 1 clear carry
SLAL RR4,1 16 2 shift mant. left
JP C,POS 10 3 carry clear, pos. no
DEC R3,l 13 2 adjust exp.
JP NEG 10 3 negative no.
POS SLAL RR4,1 16 2 shift mantissa
JP S,ONE 10 3 carry set
DEC R3,l 13 2 adjust rxp.
INC .SCONT,l 16 3 incre. bit count
CP .SC0NT,32. 17 4 all bits shifted?
JP Z,OUTl 10 3 yes
JP POS 10 3 no
ONE RRL RR4,2 8 1 shift back mant.
INC R3,l 13 2 adjust back exp.
JP OUT2 10 3
0UT1 CLR Rl 4 1
0UT2 JP HOP 10 3
NEG DEC R31 31 2 neg. no. , adj. exp.
SLAL RR4,1 16 2 shift mant.
JP S,MORE 10 3 another bit set
RRL RR4,1 8 1 a zero, shift back
SETFLG,C 7 1 set carry
INC R3,l 13 2 adj. back exp.
113
shift back sign bit
MORE DEC R-,,1 13 2 no zero, adj. exp.
incre. bit count
all bits shifted
yes
no
OUT3 ADD R.,,32. 11 2 restore exp
restore to -1
2's compliment
HOP LDL .MSPL,RR, 17 3 output mant.
output exp.
RRL RR4,1 7 1
JMP 0UT2 10 3
rl
INC .SC0NT,1 16 3
CP .SC0NT,32. 17 4
JP Z,0UT3 10 3
JP NEG 10 3
3
CLRL RR4 5 1
COML RR4 7 1
.MSPL,RR4
LD
.MSP,R3
14 3
RTN 13 1
SUBROUTINE SHIFT
114
Label Instructions Cycles Memory Locations Comments
SHIFT RESFLG,C 7 1 clear carry
LD R *1K1Q, 1 16 3
LD
.CONT,R10
14 3 bit count = 1
SLA V1 8 1 shift mant. left
JP CPOS 10 3 sign bit 0, pos.
DEC R151 13 2 dec. exp.
JP NEG 10 3 sign bit 1 , neg
POS SLA V 8 1 shift mant. legt
JP S,ONE 10 3 bit set
DEC Rpl 13 2 0, dec. exp.
INC .CONT,l 16 3 inc. bit count
CP .C0NT,*15 17 4 16 shift?
JP Z,OUT 1 10 3 yes
JP POS 10 3 no
ONE RR Ro>2 7 1 1 , shift backtwice
JMP OUT 2 10 3 word normalized
OUT 1 CLR Rl 4 1 clear exp.
OUT 2 RTN 13 1
NEG DEC Rrl 13 2 adj. exp.
SLA Ro>] 8 1 shift mant.
JP S,MORE 10 3
RR V 6 1
INC Rr2 13 2
SET FLG 7 1
RR Ro'] 6 1
JMP OUT 2 10 3
MORE
0UT3
INC .CONT,l 16
CP CONT,*}5 17
JP Z,0UT3 10
JP NEG 10
ADD Rp16. n
CLR Ro 4
COM Ro 4
RTN 13
115
3
4
3
3
2
1
inc. bit count
16 shift
yes
no
restore exp
restore the
mantissa
116
The following program listing shows that area of code in Architecture
II that differs from Architecture I. Refer back to the code for
Architecture I to locate the area of change.
The following code is used by Architecture II to
address the wft coefficients.
117
FFT3
FFT2
LD GCNT..ONE
LD R5..N
SRA R5,l
LD
.IN.Rg
LD R7,.GCNT
LD Rg.l
SDA RgR7
LD
-LE,Rg
SRA Rg,l
LD
.LEI.Rg
LD .JCNT..0NE
LD Rg,.IN
SRA R9J
LD
IN.Rg
CLR .AD
LD R15,.JCNT
LD
.ICNT,R15
CLR R2
LD R3,.AD
SLAL RR2,1
LDL RR0,.WTABLE
ADDL
0 C
LDL RR6,.UTABLE
17 4
12 3
16 2
14 3 IN - N/2
12 3
7 2
15 2 LE = 2G
14 3
16 2 LEI LE/2
14 3
17 4 J *- 1
12 3
16 2 IN-IN/2
14 3
14 3 AD r 0
12 3
14 3 I- J
4 1
12 3
16 2
15 3 Form Address
8 1 Pointers
15 3
118
FFT
LD R15>1 7 2
LDIR RR6,RR0,R15 30 2
LD R3,.ICNT 12 3
Code for butterfly computation
u*-w, ADN
,ADIN
NEND
JP P, FFT
INC .JCNT,1
LD R15,.LEI
CP R15,,JCNT
JP P,,ADIN
INC .GCNT.l
LD R15,.M
CP R15,.GCNT
JP P,FFT3
JP NEND
LD R^.IN
ADD R10,.AD
LD
.ADfR-jQ
JP FFT2
INC
HCNT.l
LD R15,.N
CP R15,.HCNT
JP P.NXRW
10 3
16 4 J*-J + 1
12 3
12 3 J < LEI
10 3
16 2
12 3
12 3 G < v
10 3
10 3
12 3
12 3 AD*- AD + IN
14 3
10 3
16 2 H H + 1
12 3
12 3 H < N
10 3
This program code is a baseline for Architecture IV which uses
fixed-point arithmetic and a hardware parallel comples multiplier.
Program Code for a 2D-DFT with fixed point arithmetic
120
NXRW
BRO
LD
,HCNT,.ONE 17 4 H = 1
LD Rg,.N 12 3
LD Rn,.HCNT 12 3
DEC R.l 4 1 H - 1
MULT R1TR9 70 3 (H-l)N
SRA Rg.l 16 2
LD
.NV2,Rg 14 3 NV2^ N/2
LD Rg,-N 12 3
DEC Rg.l 4 1
LD
.NMl,Rg 14 3 NM1 --N - 1
LD JCNT,.0NE 17 4 J= 1
LD ICNT,.0NE 17 4 I = 1
LD R-,2,.ICNT 12 3 Bit Reversal Begins
CP R]2,.JCNT 12 3 Here
JP P, FIVE 10 3 I > J
LDL
.LM1,RR1Q 17 3 LM1 (H-l)N
LD Rg,.JCNT 12 3
CLR R8 7 1
ADDL RR10, RRg 8 1 J + L
SLAL RR]0 16 2
ADDL RR10,,XTABLE 15 3 generate data
LDL RR14,.TEMPTBL 15 3 address
LD V1 7 2
LDIR RR14'RR10'Ro 30 2 T +- X(J+L)
LDL RR2,.LM1 15 3
121
CLR R.
LD Rp.ICNT
ADDL RR9,RR
^ o
SLA RR2, 1
ADDL RR2,.XTABLE
SUBL RR1Q,2
LD Ro>]
LDIR RR10,RR2,RQ
SUBL RR]4,2
SUBL RR2,2
LD V
LDIR RR2'RRi4'R0
FIVE LD R0,.NV2
SIX CP R0,.JCNT
JP P, SEVEN
LD Rp.JCNT
SUB Rl>Ro
LD
.JCNT,R1
SRA V
JP SIX
5EVEN ADD R0,.JCNT
INC ,ICNT,1
LD
.JCNT,R0
LD R15,.N
LDL RR10,.LM1
4 1
12 3
8 1 I + L
19 2 generate data
15 2 address
14 3
7 2
30 2 X(J+L)^X(I+L)
14 3
14 3
7 2
30 2 X(I+L) ^T
12 3 K -N/2
12 3 K > J
10 3
12 3
4 1 J *. J - K
14 3
16 2 K K/2
10 3
12 3 J - J + k
16 3 I + I+]
14 3
12 3
15 3
122
CP R15,.ICNT 12 3 I > N - 1
JP P,BRO 10 3 End of Bit Reversal
The lines of code between the label BRO and the end of this listing
on this page correspond to process block A.
The Butterfly Computation Begins Here, Process Block B
123
FFT3
FFT2
FFT
LD .GCNT..0NE
LD R5,.N
LD
.IN,R5
LD R7,.GCNT
LD Rg,l
SDA Rg,R7
LD
LE,Rg
SRA Rg>l
LD
.LEI.Rg
LD JCNT,.0NE
LD Rg,.IN
SRA Rgl
LD
IN,Rg
CLR .AD
LD R15,.JCNT
LD
.ICNT,R15
CLR R2
LD R3,.AD
SLAL RR2,1
LDL RRQ,.WTABLE
ADDL KK_ , KK/->
0 2
LDL RR6,.UTABLE
LD "iS'1
LDIR RRg,RRQ,R15
LD R3,.ICNT
17 4 G^-l
12 3
14 3 IN^-N/2
12 3
7 2
15 2 LE = 2G
14 3
16 2 LEI ~ LE/2
14 3
17 4 J ^ 1
12 3
16 2 IN*-IN/2
14 3
14 3 AD = 0
12 3
14 3 I- J
4 1
12 3
16 2
15 3 Form Address
8 1 Pointers
15 3
7 2
30 2 u*-wNAD
12 3
124
CLR R2
ADDL RR2,.LEI
LDL
IP,RR2
ADDL RR2,.LMI
SLRL RR2,1
CLR R8
ADDL RR4,.XTABLE
LD V1
ADDL RR* ,RRp
LDL RRg,. ATABLE
LDIR RR6,RR4,Ro
LD V
LDL RRg,.BTABLE
LDL RR4,.UTABLE
LDIR RR6,RR4,R0
LD R0,l
LDL RR4,,TEMPTBL
LDL RR6,.PRDTBL
LDIR RR4,RRg,RQ
SUBL RR4,2
CLR R10
LD Rn,,ICNT
ADDL RR10,.LMI
SLAL RR10,1
ADDL RR10,.XTABLE
LD R12,(? RR10,Rg
LD R14,@RR4,R8
SUB R12'R14
4 1
18 3
17 3
18 3
16 2
4 1
18 3
7 2
8 1
15 3
30 2
7 2
15 3
15 3
30 2
7 2
15 3
15 3
40 2
14 3
4 1
12 3
18 3
16 2
18 3
11 2
11 2
8 1
Ip<- I + LEI
Form Address
Pointers
X(IP+L)
B - U = W,
AD
N
T - X(IP + L)-U
Form Address
Pointers
to Real Part
RR
12 X(I+L)
RR
14
125
ADDL RR4,1 14 3 Form Address Pointer
ADDL RR10,1 14 3 to Imaginary Part
LD R14'@RfWR8 11 2 RR14 X(I+L)
LD R2,@RR4,R8 11 2 RR2 XT
SUB R]4> R2 8 1 X(I+L) - T
LDL RRg, .IP 15 3
ADDL RRg,.LMI 15 3 IP + L
SLAL RR6,1 16 2 Form Address
ADDL RR6,.XTABLE 18 3 Pointers
LDL
SMR, R12 14 3 Store R[X-T]
LDL
SMI, R]4 14 3 Store I[X-T]
SUBL RR4,2 14 3 start address of
LD V1 7 2
temptbl
LDIR RRg,RR4,R0 30 2 TEMP + X(IP+L)
SUBL RR4,2 14 3
LD R3,.ICNT 12 3
CLR R2 4 1
ADDL RR2,.LMI 18 3 I + L
SLAL RR2,1 16 2 Form Address
ADDL RR2,,XTABLE 15 3 Pointer
LDL RR6,,PRDTBL 15 3
LDL RR4,,TEMPTBL 15 3
LD Ro> ] 7 2
LDIR RR4,RRg,RQ 30 2 Temp*. X(IP+L)U
SUBL RR4,2 14 3
LD R-|2,@RR4,Rg 11 2 RR2^ R[Temp]
LD R-j4,RR2,Rg 11 2 RR2^ R[X(I+L)]
ADD R14, R12 8 1 R[X(I+L) + T]
126
RR2,1ADDL
LD R12,@RR4,Rg
LD R10,(3RR2 ,Rg
ADD R]2, R]0
LD .SMR, R,T
LD .SMI, R]
SUBL RR2,1
SUBL RR4,1
LD V
LDIR RR2,RR4,R
LD R15,.ICNT
ADD R15,.LE
LD .ICNT, R]5
LD R]5,.N
CP R]5,.ICNT
JP P, FFT
INC .JCNT,1
LD R]5,,LEI
CP R15,.JCNT
JP P..ADIN
INC .GCNT,1
LD Ri5"M
CP R15,.GCNT
JP
JP
.ADIN LD
ADD
LD
P,FFT3
NEND
R-,0,.IN
R10,-AD
14 3
11 2 RR12^I [T]
RR10 - I[X(I+L)]
,AD,R10
11 2
8 1 I[X(I+L) + T]
14 3 TEMP^X(I+L) + T
14 3
I4 3 add. of .XTABLE
I4 3 add> of .TEMPTBL
7 2
30 2 X(I+L) + TEMP
12 3
12 3
14 3 I -I + LE
12 3
12 3 I < N?
10 3
16 4 J^J + 1
12 3
12 3 J < LEI
10 3
16 2
12 3
12 3 G < v
10 3
10 3
12 3
12 3 AD^-Ad + IN
14 3
127
NEND
JP FFT2
INC .HCNT.l
LD R15,.N
CP R15,.HCNT
JP P,NXRW
10 3
16 2 H H + 1
12 3
12 3 H < N
10 3
Process Block B ends just before the lable NEND.
128
The following code transposes the pixel array.
This is Process Block C.
XPSN LD R10,*l 7 2
3
3
HLP LD RR.HCNT 12 3
KLP LD R0,.KCNT 12 3
3
LD
.HCNT,R]3 14
LD
.KCNT,R]3 14
3
CP R3, R] 4
JP N, INKCH 10
DEC R3, 1 4
MULT R3,.N 74
CLR Ro 4
ADDL RR2 , RRQ 8
DEC Rrl 4
MULT Rr.N 74
LD R5,.KCNT 12
CLR R4 4
ADDL RRQ,RR4 8
CP RR0RR2 8
JP Z, INKCH 10
LD R7, *1 7
LDL RRg,,XTABLE 15
SLAL RR2,2 19
ADDL RRp,RRo 8
LDL RR1Q,,TEMPTBL 15
LDIR 10*8' 7
30
LD R7,*l 7
1
3
1 K-l
3 (k-l)N
1
1 RR2 ^H + (K-l)N
1 M-l
3
3
1
1 RRQ^K + (H-l)N
1
3
2
3
2
1 RRg hai address of
3 X(H+(K-1)N)
2 T^X(H+(K-1)N)
2
129
INKCH
LUL KKg,.XIABLE 15 3
ADDL KKp , KKn 8 1
LDL RR10,.XTABLE 15 3
SLAL RR0,2 19 2 address pointer
ADDL RRio,RRo 8 1 formed
LDIR RRg, RR. q,R_ 30 2 X(H+(K-1)N)*X(K+(H-1)
LD R7, *l 7 3
LDL RRg,. XTABLE 15 3
ADDL RRg , RRq 8 1
LDL RR10,. TEMPTBL15 3
LDIR RRg , RR. q , R_ 30 2 X(K+H-1)N) T
INC .KCNT.l 16 3
LD R13,.N 12 3
CP R13,.KCNT 12 3
JP P, KLP 10 3
INC .HCNT.l 16 3
JP P, HLP 10 3
JP 0NDM 10 3
130
This is the code for the butterfly computation of Architecture V,
131
The Program Code for the Butterfly Computation in Architecture V.
EFT LD R3,.ICNT
CLR R2
ADDL RR2,.LEI
LDL
.IP,RR2
ADDL RR2,.LMI
SLRL RR2,1
ADDL RR4,.XTABLE
CLR M
LD R14,(3RR4,R1
SRA R14,l
LD RRg,.UTABLE
LD R12,@RRg,R1
MULT R12R]4
LD
.TEMP,R12
ADDL RRg,l
LD R12,@RRg,R1
MULT R12>R14
LD
'
,TEMP+2,R12
INCL RR4,1
LD R14,@RR4,R1
SRA RH,1
LD R12,(?RRg,R1
MULT R12,R14
LD .TEMP+1, R]2
SUBL RRg,l
12 3
4 1
15 3 IP I + LEI
17 3
18 3 IP + L
16 2 Form Address
18 3 Pointers
4 1
11 2 R14 R[X(IP+L)]
16 2 scale
18 3
11 2 R12 R[U]
70 1 R[X(IP+L)]-R[U] R12
14 3
14 3
11 2
70 1
14 3
16 2
11 2 R14 I[X(IP+L)]
16 2
11 2 R12 I[U]
70 1 RR12 I[X(IP+L)] I[U]
14 3
14 3
LD R12,@RRg,R1
MULT RR12,R]4
LD
.TEMP+3,R12
LD R4,.TEMP
ADD R4,.TEMP+1
LD
.PRDTBL,R4
LD R4,.TEMP+2
ADD R4,.TEMP+3
LD
.PRDTBL+1,R4
CLR R]0
LD Rir.ICNT
ADDL RR]0,.LMI
SLAL RR1Q,1
ADDL RRQ,.XTABLE
LD R12,@RR1Q,R1
SUB R12,.PRDTBL
ADDL RR-io'1
LDL R14,@RR1Q,R1
SUB R14,.PRDTBL+1
LDL RRg,. IP
ADDL RRg,.LMI
SLAL RRg,l
ADDL RRg,.XTABLE
LDL RR4,.ADSPRD
LD R0,l
.ADSI,R12
.ADS2,R]4
LDIR RRg,RR4,R0
132
11 2
LD
LD
70 1 RR -hI[X(IP+L)].R[U
14 3
12 3
12 3
14 3 R[X(IP+L)-U]
12 3
12 3
14 3 I[X(IP+L) U]
4 1
12 3 Form Address
15 3 Pointer
16 2
18 3 Ad. X(I+L)
11 2
12 3
14 3
11 3
12 3 I[X(I+L) - T]
15 3
18 3 IP + L
16 2 Form Address
18 3 Pointer
15 3
7 2
14 3
14 3
30 2 X(IP+L) X(I+D-
133
SUBL RR10,1 14 3 RR1Q ADD ofRX(I+L)
LD R-|2,(9RR.
g,R-| 11 2
ADD R12,.PRDTBL 18 3 R[X(I+L) + T]
LD
.ADSI,R12 14 3
ADDL RR1Q,1 14 3 ADD of I[X(I+L)]
LD R-i p @RR-i qR-j 11 2
ADD R12,.PRDTBL+1 18 3
LD
.ADSI,R12 14 3 I[X(I+L) + T]
SUBL RR10,1 14 3 ADDR of X(I+L)
LD V1 7 2
LDL RR4,.ADSPRD 15 3
LDIR RR10,RR14,R0 30 2 X(I+LhX(I+L) + T
LD R15,.ICNT 12 3
ADD R15,.LE 12 3
LD
.ICNT,R]5
14 3
LD R15,.N 12 3
CP R15,.ICNT 12 3
JP R, FFT 10 3
Total cycles 1200
2xNxl^xvx 1200 = 1.25829 x
1010
cycles for N = 1024
3145 sec .87 hr
134
APPENDIX B
A general two-dimensional linear transform is of the following
form.
1 rA\
q(uru2) =\ \ p(n-,,n2) e(n] ,n2,u1 ,u2) (B.l)
n-j
= l n2 =1
where p(n,,n2) is an element of the input array and 6(n, ,n2,u, ,u2) is an
element of transformation and q(u-,,u2) is the output array.
The input array can be assumed to be N2 x N. and represented by P.
P(n-|,n2) P (B.2)
Any array can be rewritten into a vector by a procedure called column
scanning. That is to say that the elements of an N2 x N-| array can be
reorganized into an N2N-. x 1 vector. This operation can be represented
mathematically as shown below,
Nn
-EVp V, (B.3)
n = 1
where V. is N] x 1 operational vector
135
Vn =
1
IL1
n-l
n
n+1
andNN is a N^-, x N2 operational matrix
Nn =
o
I
Q)
n-l
n
n+1
(B,4)
Q) is the null matrix
Jj_ is the identity matrix
(B.5)
and p is the N2 x N] image array and p is the N^ x 1 image vector.
If the transformation is separable as in the case of 2D-DFT the
following will hold
N2 N]
q(uru2) = P(nlfn2) Wfr^n^.u,,)
n2=l n-,=l
N2 Nl
Y Y p(nl'n2 W(nrui) W(n2,u2)
n2=l n.=l
<- i
y W(n2,u2) y p(nrn2) W(n1,u1)
n2=l n^=l
(B.6)
n2=l
If we use matrix notati
d.
y W(n2,u2) T(n2,u])
on.
|[p(n1,n2)_
[W(n1,u1)|
|w(n2,u2)
T(n2,u-,)
P , N2 x N1
14 R Nl X Ul
W L' U2 X N2
T ' N2 X Ul
the following equations can be written.
T- PW
Q =
R
(B.7)
(B.8)
137
The two-dimensional discrete Fourier transforms can be performed
on an image {P after it has been converted into a vector p by column
scanning. This operation can be represented by the equation
=Wi (B.9)
where p is an image vector, q is the output vector of the DFT operation
described by
It can be shown that the one-dimensional operator \^) is related
to the two-dimensional operators fy]}, and "j^JJ R by
M -- hDLh8R (B.10)
where 0 represents the left direct product operator.
ma
The left direct product of an N-| x N2 matrix $\ and an N3 x N4
trix B is an N-, N3 x N2 N4 matrix C defined by
AB = B(1,1)A B(l,2)/\
B(2,1)A. B(2,2)A
B(Nvl)/\
B(1,N4) fi\
B(2,N4)A
B(N3,N4) j\
The elements of the operator matrices of the 2-D DFT are given by
[Tw w=
n^ = D4 = .X_
IL
W
W
w
w
2
W
W
w
N-lW
2(N-1)
when we assume square arrays, NjtN and W*
W
W(N-1)
-j^xJ N x
1
(B.
139
APPENDIX C
Error Analysis
When digital computations are used in signal processing or picture
processing, errors result from inherent characteristics of digital
processes. In the macroscopic world all things are continuous in time
and space. When digital processing is used, continuous signals must be
quantized. Because a given range of signals is quantized into a finite
number of discrete levels, the digital representation of a signal is not
exactly accurate. This results in "quantization
error." Quantization
error is present in all digital processing systems.
Another source of error exists in digital computations. This is
the effect of fixed register length. In any computational machine
the accuracy with which a number can be represented is fixed, regardless
of the radix of the computation, For instance, multiplication of
two sixteen bit binary numbers produces a thirty-two bit binary product.
The computer may only be able to retain 16 bits of this product. This
results in what is called truncation error. Some machines will round
off the result to the nearest least significant bit of the machine. This
is called rounding error.
A number, x, when represented in a finite length register can take
on a finite number of quantized values Q(x). The relationship between
x and Q(x) is shown in Figure C-l for two's complement truncation.
140
2's Complement Truncation
Q(x)
0 > Qtx) - x >
2"
Figure C-l
Here 1 + b is the number of binary bits in a register and 2"b
is the least significant bit,
For the case of two's complement rounding, the relationship is
shown in Fig. C-2
2's Compliment Rounding
Q(x)
o-b
Figure C-2
2-2"b< Q(x) x < \ 2"b
In the case of floating-point calculations, the error is carried in the
mantissa. The characteristic or exponent carries no error but it effects
141
the weight of the least significant bit. Thus for floating-point
operations using truncation the error range is
0 * Q(x) - x >
2C 2"b
(CI)
and for rounding it is
r
?"b
r
?~b
2C. ^- < Q(x) - x < 2C- ^- CC2)
where 2 is the weighting factor of the characteristic.
In FFT calculations, complex multiplications are generally
performed. These consist of four real multiplications each introducing
error. Oppenheim and Weinstein have made an error analysis for FFT
calculaitons. They assumed multiplication could be represented as a
point introducing white noise into a calculation. With this assumption
each node in an FFT calculation had a additive white noise source. [8]
4 Point FFT
T
3-
Each 0 represents
a white noise source
Figure C-3
It was assumed that each noise source was uncorrected to the
rest. The mean squared variance resulting from the 2's complement
representation of a value x is (for both trunication and rounding)
2 1 9-2b
, N
aE
=
72 2 . (C3)
Since complex multiplication involves four real multiplications, the
complex variance is
0
2
= -L
2'2b (C4)
aB 3 L
Since it was assumed that all noise sources were uncorrected, the
2
multiplicative noise variance at any output node is equal to ag
times the number of noise sources that propagate to that node. This
leads to
= = (N-l) (C.5)
142
for an N point FFT, When N is large
N or
These are for fixed point FFT operations.
(C6)
The generalized butterfly equation of the FFT can be written as
Vi x (il - w "n01
(C.7)
143
To protect against overflow,
max I |yi)| , JXm (j)| J
< ^x |Xm+1(i)| , |Xm+1(j)| J (C.8)
and max [ 1X^(1)1 , |Xm+1(j)| J
< 2 max I jXm(i)| , | yj)| ]
In light of these constraints, we must scale the input to the FFT
calculation for fixed point computation according to the relation
N-l
lX<k>U <- W"'l a* 1^1 =Nlx(nLax <>
n=0
Thus the input sequence is to be bounded so that |x(n)| < -rr-
This scaling can be done all at once in the beginning or the values
into each stage of the butterfly computations can be multiplied by 1/2.
Since a N-point FFT has v stages of butterfly computations (N = 2V),
the scaling is the same for the two cases.
Oppenheim and Weinstein assumed that x(n) was complex with uniformly
distributed real and imaginary parts in (-l/\/2 N, 1/72 N). This led
to a output variance of
144
a
2
=
Ui,m2
- m _ 2 _ mLm,2 _ 1[x(k)r =Na/ = N|x(n)r = 4 (CIO)
This leads to a noise- to-signal ratio of
\ - 3N2aB2 . N22"2b f0'11'
when all scaling is done at the input. The noist-to-signal ratio
can be improved if the input is scaled by 1/2 at each stage of the FFT
computation. Oppenheim and Weinstein have calculated the improved ratio
to be
^2 = 6 N = 5B2"2b (C12)
1
The important point to note is for fixed-point FFT computation, when
the input is scaled all at once in the beginning, the noise-to-signal
2
ratio is proportional to N . When the scaling is done in stages, the
noise-to-signal ratio is proportional to N.
When floating-point arithmetic is used, addition also introduces
error because the characteristics of two numbers must be equal in order
to add them. Several models have been developed by various authors to
analyze the floating-point problem. Oppenheim and Weinstein give the
following for noise-to-signal ratio when the input signal is assumed
white.
145
= 4 v = -j-
2"2b
log2 N (C.13)
It can easily be seen that the noise-to-signal level is further
reduced using floating-point arithmetic.
These formulae of Oppenheim and Weinstein did not account for
the fact that multiplication by 1 can be performed noiselessly. In
1976 Tran-Thong and Bede Liu developed detailed formulae for fixed-point
FFT computation error. The following are presented for decimation-in-time
algorithms. [11]
Rounding, all scaling at the beginning,
Expected value = }/} error of point p = e(p)
I e(p)2} . n _N_ _N_ 3NP ~ ' 4 ' 2 ' 4 (C14)
^TpFT
- 1 otherwise
M
where P = )_, Pi 2
i=l
i-1
;(p) = 2 + min ( j: Pj = ij
N-l
E fe<P>2}
p=0
v ;
,-2b
M
m 4
Rounding, scaling done in steps
DFT[']n N-Point DFT of the sequence
(DFT f] )p Ptn coefficient of the DFT
(C.15)
[e(p)| = 2__ Q + jKDFT^2g(n) n=0 ... N-l-
where g(n) = f M n = 0
M + 1 - max |m;njii = H otherwise
N = 2M
[e(p)2j = |[e(p)j| +1^ N(N.1} (CJ6)
+ ' 0 0 _N_ J_ 3NP u' 4 ' 2 ' T
,-2b
-j- N2N - 2s (pb otherwise
N-l
p=0 [e(p)2j = 2~2b [f N3-(2 + M + ^-M) N2] (C.17)
147
APPENDIX D
The following tables provide for a means of comparing the
efficiency of the various architectures considered in this thesis.
The tables indicate how many machine cycles, as a function of the
array size N, are needed to perform a given function. Then N
is assumed to by 1024 and some tangible numbers are developed for
the comparison of the computation time required by each architecture
when executing a two-dimensional discrete Fourier transform
148
Table D.l Architecture I, Machine Cycles
Function Step No. of Machine Cycles
set up spatial dimension counter 1 51 x 2
set up for row counter 2 223 x N x 2
N-2+2v/2
perform bit reversal 3 34 x - 2 x N x 2
N-2v/2
4 439 x ^ x N x 2
5 98 x (N-v-1) x N x 2
6 22 x v x N x 2
7 76 x (N-2) x N x 2
8 29 x N x 2
set up for Aw 9 1166xvxNx2
calculate RAw] 10 535 x L x v x N x 2
11 664 x (L-l) x v x N x 2
12 237 x v x N x 2
calculate IIAw] 13 598 x v x N x 2
14 621 x (L-l) x v x N x 2
15 815xvxNx2
set up for butterfly 16 17xvxNx2
perform butterfly 17 26 x (N-l) x N x 2
18 3108 x -J!- xvxNx2
19 343 x N x (N-l) x 2
20 50 x v x N x 2
check for end of 1st spatial dim. 21 50 x N x 2
check for transpose 22 35
23 236 x y-j- + -j-
24 324 x [^- - 4r
25 50 x
N2
26 36 x N
149
Table p,2 Architecture I, Cycle Time
(Specific values for N = 1024
Step No. of Machine Cycles Cycle Time
1 102 20.4 ys
2 456702 .0913 sec
3 36696064 7.33 sec
4 445939712 89.18 sec
5 161820672 32.36 sec
6 450560 .090 sec
7 159072256 31.84 sec
8 59392 .012 sec
9 23879680 4.77 sec
10 164352000 32.87 sec
11 190382080 38.07 sec
12 4853760 .970 sec
13 12247040 2.45 sec
14 178053120 35.61 sec
15 16691200 3.34 sec
16 348160 . .069 sec
17 54472704 10.89 sec
18 3,258 x
1010
6517.9 sec
ia 718620672 143.7 sec
20 1 024000 .204 sec
21 102400 .0204 sec
22 35 7 ys
23 12385800 24.77 sec
24 169703424 33.94 sec
25 52428800 10.48 sec
26 36864 .00737 sec
150
Table D.3 Architecture II, Machine Cycles
Steps 1 - 8 the same as I
Function
initialize U
perform butterfly
check for end of 1st spatial dim.
check for transpose
perform transpose
Step No. of machine cycles
9 127 x v x N >. 2
10 25 x (N-l) x N x 2
11 3108 x -j- x v N x 2
12 194 x (N-l) >c N x 2
13 50 x v x N x 2
14 50 x N x 2
15 35
(z \
16 236 x y
+ *)
17
In2
324 x ^- - 2
18 50 x
N2
i
19 36 x N
Instead of calculating Awi and then UL+1
= U. x Awi , Ui is looked-up
from a w. table.
151
Table D.4 Architecture II, Cycle Time (N = 1024)
Step No. of Machine Cycles Cycle Time
9 2600960 .520 sec
10 5447204 10.89 sec
11 3.2589 x
1010
6517.94 sec
12 406450176 81.29 sec
13 1024000 .204 sec
14 102400 .0204 sec
15 35 7 ys
16 12385800 24.77 sec
17 169703424 39.94 sec
18 52428800 10.48 sec
19 36864 .00737 sec
152
Table p. 5 Architecture^!!!, Summary
This is the same as II, only a hardware multiplier is used for the floating
point multiplications. Only the butterfly operation has its time affected.
Shift 32 was called twice within FPMULT
FPMULT was called once during each butterfly
1492 total cycles in FPMULT with single shift normalization by
SHIFT32
1616 xvx-pJ- xNx2 = 1.69449 x
1010
= 3388.9 sec
St- = 3716 sec * 1,03 hr
153
Table D.6 Architecture IV, Machine Cycles
Function
set up spatial dimension counter
set up for row counter
perform bit reversal
initialize U
perform butterfly
check for end of 1st spatial dim.
check for transpose
perform transpose
Step vNo. of Machine Cycles
1 51 x 2
2 223 x N x 2
3
/m_? + 2v/2\
Jt -XI o 1 A IN A
4
\ /
/N-2v/2\0AA Y Y N Y nJHt A 1 r\ 1 A \\ A C
5 22 x v x N x 2
6 78 x (N-v-1) x N x 2
7 76 x (N-2) x N x 2
8 29 x N x 2
9 96 x v x N x 2
10 26 x (N-l) x N x 2
11 1035 x-^- xvxNx2
12 160 x CN-D x N x 2
13 50 x N x 2
14 50 x N x 2
15 35
16
,
(n2
. n\
236 x l-p- + -p 1
17 252x(^- 1)
18 50 x
N2
19 36 x
N2
154
Table D.7 Architecture IV, Cycle Time
(Specific values for N = 1024
Step No. of Machine Cycles Cycle Time
1 102 20.4 ys
2 456702 .0913 sec
3 36606064 7.33 sec
4 3494377952 69.88 sec
5 450560 .09 sec
6 161820672 32.36 sec
7 159072256 31.84 sec
8 59392 .012 sec
9 1966080 .393 sec
10 5447204 10.89 sec
11 1.085329 x
1010
2170.6 sec
12 33521664,0 67.04 sec
13 1024000 .204 sec
14 102400 .0204 sec
15 35 7 ys
16 12385800 24.77 sec
17 13ia91552 26.39 sec
18 52428800 10.48 sec
19 36864 .00737 sec
Yt. = 2472 sec * .686 hr
155
Table p. 8 Architecture V, Summary
All operations are the same as IV except step 11 where we now have
programmed multiply.
2346 x ~ xvxNx2 = 2.4599 x
1010 4919.9 sec
Iv= 302 + 4920 = 5222 sec = 1.45 hr,
156
Table D.9 Architecture VI, Summary
Transposition time by the master processor 75 sec
Processing time by Architecture II 6771 sec
Processing time by Architecture III 3641 sec
Multiprocessor Computation Times
Number of Architecture Architecture
Processors II III
2 3385 1820
4 1693 910
6 846 455
8 423 227
16 212 114
32 105 57
64 52 28
128 26 14
Using a hardware parallel multiplier to do just floating-point
multiplication is roughly equivalent to doubling the number of processors.
References
1. J.W. Cooley and J.W. Turkey, "An Algorithm for the machine
calculation of complex Fourier Series," Mathematics of Computers,
Vol. 19, pp. 297-301, April 1965.
2. S.C. Danielson and C. Lanezos, "Some improvements in practical
Fourier analysis and their application to X-ray scattering from
liquids," J. Franklin Inst., Vol. 233, pp. 365-380 and 435-452,
April 1942":
3. C. Runge, Zeit fur Math, und Physik, Vol. 48, p. 443, 1903.
4. C, Runge, Zeit fur Math, und Physik, Vol. 53, p. 117, 1905.
5. C. Runge and H, Konig, "Die Grundlehren der Mathematischen
Wissenschaften," Vorlesungen urber Numerisches Rechnen, Vol. 11,
Berlin: Julius Springer, ia24.
6, L.H, Thomas, "Using a computer to solve problems in physics,"
Application of Digital Computers, Boston, Mass, Ginn, 1963.
7. J.W, Cooley, Peter A.W. Lewis, and Peter D. Welch, "Historical
Notes on the Fast Fourier Transform," IEEE Transactions on
Audio and Electroacoustics, Vol. AL1-15, No. 2, June 1967.
8, A.V, Oppenheim and C.J, Weinstein, ''Effects of Finite Register
Length in Digital Filtering and the Fast Fourier
Transform,"
Proceedings of the IEEE, Vol. 60, No. 8, August 1972,
9. Signal Processing System, SPS-61 Array Processor Data Sheet.
10. Floating Point Systems, AP-120B Array Processor Data Sheet,
Rev. 1, March 1979.
11. Tran-Thong and Bede Lui, "Fixed-Point Fast. Fourier Transform
Error Analysis," IEEE Transactions on Acoustics, Speech, and
Signal Processing, Vol, ASSP-24, No. 6, December 1976,
158
BIBLIOGRAPHY
1. Am Z8001/Am Z8002 Processor Interface, Advanced Micro Devices, Inc. 1979
2. Am Z8010 Memory Management, Advanced Micro Devices, Inc., 1979
3. Am Z8001/2 Processor Instruction Set. Advanced Micro Devices, Inc., 1979
4. Cooley, James W. , Peter A. W. Lewis, and Peter D. Welch, "Historical Notes
on the Fast Fourier Transform," IEEE Transactions on Audio and Electro-
acoustics, June 1967.
"
5. Cooley, James W., Peter A. W. Lewis, and Peter D. Welch, "Application of
the Fast Fourier Transform to Computation of Fourier Integrals, Fourier
Series, and Convolution Integrals," IEEE Transactions on Audio and Electro-
acoustics, Vol. AU-15, No. 2, June 196T
"
6. Gonzalez, Rafael C. and Paul Wintz, Digital Image Processing, Addison-Wesley,
Reading, Mass., 1977.
7. James, David V., "Quantization Errors in the Fast Fourier Transform," IEEE
Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-23,
No. 3, June 1975.
8. Kuo, Franklin F., Network Analysis and Synthesis, Second Edition, John
Wiley & Sons, New York, 1966.
9. MPY-Series Multipliers, Electronic Systems Divison, TRW Inc., 1976.
10. Oppenheim, Alan V. and Ronald W. Schafer, Digital Signal Processing. Prentice-
Hall, Inc., New Jersey, 1975.
11. Oppenehim, Alan V. and Clifford J. Weinstein, "Effects of Finite Register
Length in Digital Filtering and the Fast Fourier
Transform," Proceedings
of the IEEE, Vol. 60, No. 8, August 1972.
12. Pratt, William K., Digital Image Processing, Wiley Interscience, New York,
1978
159
BIBLIOGRAPHY (Continued)
13. Tran-Thong and Bede Liu, "Fixed-Point Fast Fourier Transform Error Analysis,"
IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-24,
No. 6, December 1976.
14. Tretter, Steven A., Introduction to Discrete-Time Signal Processing, John
Wiley & Sons, New York, 1976.
15. TRW LSI Multiplier-Accumulator, TRW Inc., 1978.
16. Zilog Z8000 Advance Specification, Zi log Inc., 1978.
17. Zilog Z8000 Architectural Overview, Zilog Inc., March 1978.
18. The Z8000 Family of Microcomputer Components, Zilog Inc., 1978.
