Feasibility study, software design, layout and simulation of a two-dimensional Fast Fourier Transform machine for use in optical array interferometry by Boriakoff, Valentin
NASA-CR-I 9692 7
7P
Final Report on the NASA FFT Project
Feasibility study, Software Design, Layout and
Simulation of a Two-Dimensional Fast Fourier
Transform Machine for use in Optical Array
Interferometry
Principal Investigator: Valentln Boriakoff
Worcester Polytechnic Institute
August, 1994
NASA Grant Number NAG 5 1138
Covering the Period: June 1, 1989, to September 1, 1994.
(NASA-CR-196927) FEASIBILITY
STUDY, SOFTWARE OESIGN, LAYOUT AND
SIMULATION OF A TWO-DIMENSIONAL
FAST FOURIER TRANSFORM MACHINE FOR
USE IN OPTICAL ARRAY INTERFEROMETRY
Final Report, i Jun. 1989 - I Sep.
Iq94 (Worcester Polytechnic Inst.) G3/74
7p
N95-I1674
Uric1 as
0024948
https://ntrs.nasa.gov/search.jsp?R=19950005261 2020-06-16T10:11:08+00:00Z
Summary Report
This goal of this project was the feasibility study of a particular architecture of a digital
signal processing machine operating in real time which could do in a pipeline fashion the
computation of the Fast Fourier Transform of a time-domain sampled complex digital data
stream. The particular architecture is described in the enclosed paper "FFT Computation
With Systolic Arrays - A New Architecture" (IEEE Trans. on Circuits and Systems-
II:Analog and Digital Signal Processing, 41, p.278, 1994), and makes use of simple identical
processors (called Inner Product Processors) in a linear organization called a systolic array.
By definition systolic arrays consist of 1, 2 or more dimensional organizations of such
processors where the only communication with the outside world is at the edge of the
array, and the communication of data with other processors is exclusively with the closest
neigbours. The system clock is common_ it is applied system-wide to all the processors in
synchronism.
Many processing organizations using systolic arrays have been devised (see enclosed
paper for a partial list and analysis of them), however, many of them are not economical,
or do not work in a pipeline fashion: they require large storage memories in between
processing stages, including at the input of the first processing stage. In many cases the
access order of the stored elements is different when writing and when reading, hence they
require two memory sets in a ping-pong arrangement for faultless access.
The advantages of our proposed organization are multiple: it is very economical in
hardware, requires no stream switching or data rearranging during the processing. It
operates continuously without need of interrupting the incoming data stream even when
new blocks of data are applied, and transformed complex sequences exit from the systolic
array at the same rate as the sampled data comes in. The storage dements work in a
sequential fashion, they could even be replaced with shift registers or with FIFO's, no
data resequencing is necessary anywhere.
-1-
An interesting application of this processing system came up with the consideration of
this achitecture to be used for the correction of optical distortion due to the turbulence of
the Earth's atmosphere in the viewing volume of a telescope. This operation would be done
by correcting for the distortion with a compensating system. Ultimately the compensating
system could be a "rubber" mirror with an optical mirror whose optical figure is distorted
by a set of electromechanical actuators located on the mirror back. Another compensating
system could be simply the selection of observing time intervals for integration when the
atmosphere is minimally perturbed. Another possibility is to invert the process and use of
atmospheric turbulence to generate speckle interferometry, a way of obtaining high angular
resolution observations. It is in this configuration that this project was worked out (G.
Chin et al, 1988).
Data obtained from a 64 by 64 point Charge Coupled Device is applied to a 2D FFT
processor, which can be based on the FFT processor described above operating on a 64
point data block. A detailed description of the processing sequence and scheme can be
found in the Chin et al paper.
At the beginning of this project the described algorithm was a theoretical conception,
and although seemed a fully realistic scheme no effort had been done in testing its feasibility.
This project comprised two phases: a simulation of the whole system on a computer where
all bits were represented, and a study of the feasibility of the implementation of the Inner
Product Processor. Both phases were pursued in parallel.
Simulation. Simulation of the architecture was carried out in a computer program
where the individual bits of the digital words of the data streams of the systolic array
were represented separately. Assembled together into digital words they were used in the
computation of the complex arithmetic of the Inner Product Processor Co,_t = Ci,_ + A. B.
Storage delays were represented by arrays of data words. Simulation was carried out in a
fixed point computation with 22 bit accuracy, a FFT machine of 1024 points was assembled
-2-
in software. The input signal was a sine wave with a variable amount of gaussian noise
added to the sine wave. This signal was processed with this architecture simulated in
a C program. An accurate result in the frequency domain with the expected signal-to-
noise ratio was obtained. In a second, more complex, simulation an input of a wide
band impulse in the time domain was dispersed according to the law of inverse frequency
squared of cold, tenuous plasma (interstellar plasma), this signal was applied to a 1024
FFT simulated architecture, the frequency domain signal went through an all-pass digital
filter that had a frequency characteristics opposite to that of the interstellar medium cold
plasma filter, and a second 1024 point inverse FFT simulated architecture machine brought
the signal back into the time domain. Comparison of the input and output signals showed
correct operation of the simulated architecture. This work was done in collaboration with
Sivakumar Maldneni.
Inner Product Processor. Two implementations of the integrated circuit doing
the Inner Product Processor were laid out in 2 micron CMOS integrated circuits, one in
floating point arithmetic (in collaboration with Peter DelVecchio and Wei Chen) and the
other in fixed point aritmentic (in collaboration with Emad Afifi). Both used the maximum
silicon area standard MOSIS chips may have, 7.9mm by 9.2ram. Both use standard p-well
VLSI design. The complex arithmetic operations carried out by both implementations are:
Coat,real = Cin,reaZ + Areal " Breal -- Aimag " Bimag
Uo,,t,imag = Uir,,iraag + Areal • Birr, ag + Aima_ • BreaZ
where A is the input data, B the FFT coefficients (twiddle factors), and C the processed
data.
Floating Point Inner Product Processor. Because the standard format of the digital
words in the whole processing system was IEEE floating point format (IEEE Standard,
1985) the main design of the Inner Product Processor was to be in this format. This
format is not very convenientfor digital addersand multiphers, soformat conversionwas
done inside the integrated circuit for arithmetic convenience,and then at the exit of the
systemthe format wasconvertedback into IEEE floating point to conform to the machine
standard. The computation of the aritmetic operations is done in this architecture with one
hardware 24 bit by 24 bit multipher, and two hardware adders, one a 49 bit and the other
an 8 bit. In addition to the actual additions in the products the exponents of the floating
point number must be added. A lpsec operation cycle was the goal for this design, to meet
this goal four computational cycles of 250nsec each have been implemented. This requires
a 4MHz clock. Since 64 point transforms were required the values of the coefficients B of
the FFT algorithm were computed on chip by a special circuit that had in account the
location of this particular Inner Product Processor in the architecture. The location was
encoded by hardwiring of inputs to the integrated circuit. To accomodate as many bits in
parallel as possible in the inputs and the outputs to the integrated circuit a ceramic body
with the maximum set of 132 pins was selected.
Fixed Point Inner Product Processor. A fixed point version of the Inner Product
Processor was also designed and built. This chip has a 22-bit accuracy, sufficient for
FFT's of up to 106 points. A somewhat different chip architecture was chosen: two units
consisting of a multiplier and an adder each operate in parallel. The B coefficients are
applied externally to this chip, hence there is no internal generator as there is in the floating
point chip. For a goal of 1MHz operation the multiplexing and computation requires 5
computational cycles of 200nsec each, hence a 5MHz clock is required. A small shift
register was implemented at the output of the processor, in Co,_t, Its purpose is to be used
as interstage delay when necessary. The design was implemented with encapsulation in a
ceramic package with the maximum number of available pins, 132.
Integrated Circuit Simulation and Testing. Before sending the designs for fab-
rications both were simulated in software. On the floating point version the standard
-4-
software simulation was run (RSIM), it consistsof applying a set of random vector inputs
to a circuit equivalent which was lifted from the layout itself, and comparing the output
these random vectors produced with that of the expected output computed on the basis of
what the circuit itself should do. It also had the formal verification procedure Nuprl run
on part of the design: the Mantissa Adjustor and Exponent Calculator. Another round of
simulation was run after final completion of the floating point version, and sevedral errors
were corrected. A data rate of 1MHz and a clock rate of 4MHz were sustained correctly
in the simulation. Simulation of the fixed point version was also carried out with RSIM,
and the results verified in the same way. In the simulation the clock rate was raised from
the 5MHz to 10Mhz, and the data rate from 1MHz to 2 MHz without errors, showing that
the design had ample operational margin.
After fabrication the integrated circuits had to be tested, the problem arose in finding
an integrated circuit tester capable of applying 132 signals simultaneously to a Device
Under Test. After many aborted attempts to operate a Tektronix LT1000 tester, or obtain
access to other testers with the necessary capability a solution was found in a commercial
company, Testware, of Hudson, MA, where General Radio type 2286 integrated circuit
and board testers were available. The maximum clock rate was 4MHz, but the number
of available pins was substantially larger than 132. Both sets of integrated circuits were
tested, the floating point version was found to be non-operating, with a different defects
precluding operation on all 30-plus copies of the integrated circuit. The fixed point version
operated succesfully, with the prediction from the test results of up to 4MHz clock that
it would operate correctly up to 10MHz clock rate and 2MHz data rate. This part of the
work was carried out with M. Lopresti.
Conclusions
1) Through computer simulation the new architecture to compute the FFT with Sys-
tolic Arrays was proved to be viable, and computed the FFT correctly and with the
-5-
predicted particulars of operation.
2) Integrated circuits to compute the operations expected of the vital node of the
Systohc Architecture were proven feasible, and even with a 2 micron VLSI technology can
execute the required operations in the required time. Actual construction of the integrated
circuits was succesful in one variant (fixed point) and unsuccesful in the other (floating
point).
-6-
Bibliography
Chin, G., Florez, J., Borelli, R., Fong, W., Miko, J., TrujiUo, C., 1989, "Real Time
Processor for Array Speckle Interferometry", IF, BE Transaction on Nuclear Science, Pt.
I, Instrumentation for Space Physics and Astronomy, p.958.
IEEE Standard for Binary Floating-Point Arithmetic ANSI/IEEE std 754-1985, New
York: The institute of Electrical and Electronic Engineers.
7
