FFTCOM, a commutator for radix-2 fast Fourier Transform by Sidharta, Irwan Laurentius
Lehigh University
Lehigh Preserve
Theses and Dissertations
1989
FFTCOM, a commutator for radix-2 fast Fourier
Transform
Irwan Laurentius Sidharta
Lehigh University
Follow this and additional works at: https://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Sidharta, Irwan Laurentius, "FFTCOM, a commutator for radix-2 fast Fourier Transform" (1989). Theses and Dissertations. 5248.
https://preserve.lehigh.edu/etd/5248
FFTCOM: A Cotnniutator for 
radix-2 Fast Fourier Transforn1 
by 
Irwan Laurentius Sidharta 
A Thesis 
Presented to the Graduate Committee 
of Lehigh University 
in candidacy for the degree of 
Master of Science in Electrical Engineering 
Lehigh University 
Bethlehem, Pennsylvania 
1989 
\ 
This thesis is accepted and approved in partial fulfillment of the require-
ments for the degree of Master of Science in Electrical Engineering. 
Date 
, 
Dr. Weipi g Li 
Advisor i Charge 
Dr. . J. Varnerin 
, 
CSEE Department Chairperson 
•• 
11 
ToRatna, 
the Sidhartas and the Budiartos 
••• 111 
Acknowledgements 
I wish to express my sincere gratitude to Prof. Weiping Li, whose advice, 
support and encouragement throughout all phases of this project has been tire-
less. The many discussions we had as I learned about the idea of FFTCOM and 
the ropes of high performance digital signal processing design from b.im has 
helped focus and clarify numerous issues concerning the implementations of the 
subject matter. I also wish to thank Dr. L. J. Varnerin for the useful guidance 
he has provided from time to time. 
With my apologies in advance for certainly failing to mention all col-
leagues who have helped me one way or the other, I thank Lip, Ray, Yosi, Rick, 
Chandra, Nanang, Joseph, Dajen, Mike, Binod, Keith, Joes, and Bob. Special 
thanks for Cathy and Linda for making sure that our department runs 
smoothly, and for Nick for providing useful comments on my earlier 
manuscripts. 
This work is especially dedicated to my wife Ratna, whose love, faith, and 
support have made all the differenc~. 
• IV 
Table of Contents 
Abstract 
1. Introduction 
1.1 Fourier Transform. 
1.2 Fourier Transform. of discrete-time signals 
1.3 The Discrete Fourier Transform 
1.4 The Fast Fourier Transform 
1.5 Decimation-in-time in FFT 
1.6 Recursive Formulas for DIT-FFT 
1.6.1 Butterfly structure 
1.6.2 In-place computation 
1.6.3 Bit reversing 
1.7 Decimation-in-frequency in FFT 
1.8 Overview 
2. Architecture 
2.1 Stage Independent Addressing 
2.2 Same-Geometry Flow Graph 
2.3 The FFTCOM - Butterfly Processor Combination 
2.4 FFTCOM architecture 
2.4.1 Input node 
2.4.2 Rbl - input control ring 
2.4.3 Rb2 - output control ring 
2.4.4 Basic Data Storage Registers 
2.4.5 Output Registers 
3. Design 
3.1 Functional block diagram 
3.2 Desig'n Flowchart 
3.3 Behavioral simulation 
3.3.1 The Command Line Interpreter 
3.3.2 The Irene Interface 
3.3.3 The User Model 
3.3.4 Procedure 
3.4 Structural simulation 
3.4.1 Procedure 
3.4.2 "NET" example 
3.5 Test vectors 
4. Implementation 
4.1 Subcell implementation 
4.2 Floorplan 
4.3 Functional Simulation 
4.3.1 Procedure 
4.3.2 "RSIM" simulator 
4.4 Performance Measurement 
V 
1 
2 
3 
4 
5 
6 
6 
7 
7 
8 
9 
9 
9 
12 
12 
13 
15 
16 
18 
19 
19 
20 
21 
22 
22 
23 
25 
26 
27 
27 
27 
28 
28 
29 
31 
34 
35 
36 
39 
39 
41 
42 
4.4.1 Timing 
4.4.2 Critical Path Analysis 
4.5 Fabrication 
5. Summary and Future Work 
5.1 Summary Of Current Implementation 
5.1.1 Achievements 
5.1.2 Limitations 
5.2 Future Work 
5.2.1 Performance Speed 
5.2.2 Transform Size 
Appendix A. BSIM workfiles 
Appendix B. NET workfiles 
Appendix C. MAGIC workfiles 
Appendix D. Layout Cells 
References 
Vita 
• VI 
42 
44 
45 
46 
46 
47 
47 
48 
48 
49 
50 
52 
54 
57 
77 
79 
List of Figures 
Figure 1-1: Butterfly Processor 7 
Figure 1-2: In-Place Signal Flow Graph 8 
Figure 2-1: Stage independent addressing 13 
Figure 2-2: Same Geometry Signal Flow Graph 14 
Figure 2-3: Proposed architecture combination of FFI'COM and but- 15 
terfly processor 
Figure 2-4: The architecture for "FFTCOM", a size 16 example 17 
Figure 2-5: Data register implemented 20 
Figure 3-1: Functional block diagram of FFTCOM 22 
Figure 3-2: VLSI design flowchart 24 
Figure 4-1: Chip Floorplan \ 37 
Figure A-1: BSIM files hierarchical tree 50 
Figure B-1: NET files hierarchical tree 52 
Figure C-1: MAGIC files hierarchical tree 55 
Figure D-1: sregandl circuitry 58 
Figure D-2: sregand2a circuitry 58 
Figure D-3: sregand2b circuitry 58 
Figure D-4: "sregandl.mag" - for ring Rbl 59 
Figure D-5: "sregand2a.mag" - left side cell of ring Rb2 60 
Figure D-6: "sregand2b.mag" - right side cell of ring Rb2 61 
Figure D-7: muxrl circuitry 62 
Figure D-8: m uxr2a circuitry 62 
Figure D-9: muxr2b circuitry 62 
Figure D-10: "muxrl.mag" - multiplexor for ring Rbl 63 
Figure D-11: "muxr2a.mag" - multiplexor for ring Rb2 top 64 
Figure D-12: "muxr2b.mag" - multiplexor for ring Rb2 bottom 65 
Figure D-13: ptregpt circuitry 66 
Figure D-14: "ptregpt.mag" - main data register 67 
Figure D-15: ROout circuitry 68 
Figure D-16: Rlout circuitry 68 
Figure D-17: "ROout.mag" - output register for RO 69 
Figure D-18: "Rlout.mag" - output register for Rl 70 
Figure D-19: Rb2 circuitry 71 
Figure D-20: · "Rb2.mag" - how Rb2 is assembled 72 
Figure D-21: Bufferl circuitry 73 
Figure D-22: Buffer2 circuitry 73 
Figure D-23: "bufferl" - driver for Phil 74 
Figure D-24: "buffer2" - driver for Phi2 75 
Figure D-25: Actual Chip Layout 76 
'\ 
• • 
VII 
\ 
.... ~ 
I 
" Abstract 
FFTCOM: A Commutator for 
radix-2 Fast Fourier Transform 
Irwan L. Sidharta 
A commutator to complement a two-cycle FFT butterfly processor is 
\ 
' presented. Behavioral simulation, strujtural simulation, chip floor planning, 
I 
and chip fabrication in CMOS VLSI technology are discussed. By implementing 
stage independent data flow arrangement, this architecture performs identical 
operations at every FFT stage, eliminates the time consuming data addressing 
task, and allows the combination FFTCOM-butterfly processor to perform 
radix-2 FFT computation at a high speed. Such an implementation requires no 
real time controls, no random access memory, small power dissipation, and cur-
.• 
rently runs at 31MHz. Speed improvements are discussed. 
Index Terms: Commutator, Fast Fourier Transform, FFT Butterfly, stage 
independent addressing, CAD tools. 
1 
Chapter 1 
Introduction 
The benefit of employing the Fast Fourier Transform (FFT) in many ap-
plications of Digital Signal Processing has continued to challenge many resear-
chers to develop faster and more efficient ways to compute the FFT1. In an in-
creasing number of applications, it is becoming more obvious that a device 
capable of computing FFT at a high speed would benefit the application greatly, 
and open doors for the possibilities of some applications that could not be real-
ized at lower speeds. 
Even with the same algorithm, different implementations have shown 
that they vary in terms of efficiency and requirements. Some applications 
would dictate the limitations that an implementation has to work within, and it 
is in recognition of a better implementation search that this thesis is delivered. 
The thesis presents a device called FFTCOM, that coupled with a two-
cycle butterfly processor, would be able to compute FFT at a high speed ef-
ficiently, and with minimum memory requirements. The type of butterfly 
processor that would complement FFTCOM is a two cycle radix-2 device. These 
devices are available now, and have proven effective. FFTCOM's implemen-
tation will be discussed in depth in the next chapters, and the background that 
will lead into the subject matter is presented in the current chapter. 
1Since the introduction of the first efficient algorithm for computing the DFT by Cooley and 
Tukey in 1965, a wealth of efficient DFT algorithms have been presented, see [1] 
2 
1.1 Fourier Transform 
In a simple terms, the Fourier Transform can be thought of as a useful 
tool to convert a time domain signal representation into a frequency domain 
representation. Advantages that can be gained through the use of such a trans-
formation are discussed in many introductory textbooks to digital signal 
processing [2] [3] [ 4]. These advantages are usually in the form of a better un-
derstanding of the signal features. 
Analog signals are the continuous waveform type. And some of them are 
of infinite time duration. Often times, we would like to study these signals in 
more detail because they interest us. Since we are equipped with many useful 
tools for manipulating signals in the frequency domain representation, it would 
be to our advantage to study those signals in their frequency domain represen-
tation. This conversion can be expressed by the following equations 
(1.1) 
F(J)= f.: .fl t)e - j'l'Iift dt 
[Fourier Transform and Inverse Fourier Transform] 
The signal ftt) is the original waveform with reference to time t. By mul-
tiplying ftt) by an exponential term, we now reflect the signal to a frequency 
axis. This transformation is reversible, both equations are given as a pair in 
equation (1.1) above, and no information is lost throughout the process. It is 
.. 
often easier to work with the representation F(O rather than with the signal f(t) 
3 
itself. Since all information contained in the signal is neither created or 
destroyed, this is just 'displaying' it differently. In this case, the frequency has 
real values, not simply the harmonics or integer values, so equation (1.1) is a 
continuous-frequency representation, and the duration can be infinite. By ap-
plying this operation to nt), we are extracting the Fourier spectrum of that sig-
nal. 
1.2 Fourier Transform of discrete-tiJDe signals 
Through sampling, we can represent a continuous signal waveform in the 
form of a sequence of real values that would reveal the signal at ot1r sampling 
instants. This is called a discrete-time representation of that particular signal, 
and should be understood as equally spaced points along the time axis where 
the values of the signal are known. The values of the signal between those sam-
pling points are not known, and we will discuss the effect this could have on our 
A 
goal of retaining correct information. The equations below are known as the 
Discrete Time Fourier Transform (DTFT), and its inverse operation, 
00 
X(ei'°)= L x(n)e-jron 
n=-oo 
00 
x(n)= 1 ~ X(ei'°)eiron 21t L..J 
n=-oo 
[DTFT and Inverse DTFT] 
(1.2) 
Note that although the time axis is now in discrete form, the sequence 
length is still of infinite duration. This is for signals that are not time limited. 
4 
Obviously, such a signal is impossible to capture, unless the designer is willing 
to wait for an infinite amount of time2. 
1.3 The Discrete Fourier Transform 
If the transform is applied to a sequence of data, the resulting entity is 
called the signal spectrum of that data sequence. If we observe a signal at a 
finite number of points and during a finite time duration, we can take a discrete 
Fourier Transform of that signal. In order for us to use the digital computer to 
process these data and perform FT computations, we must represent the signal 
in a way manageable to the computer. That means truncating the signal's dura-
tion and capturing the information in a finite time length. This will be the Dis-
crete Fourier Transform, and it is the basis of our analysis in subsequent chap-
ters. 
N-1 21t 
X(k)= L x(n)e-fy:J-kn 
n=O 
1 N-1 .21t 
x(n)= L X(k)et"frkn 
Nk=O 
[Discrete Fourier Transform and Inverse Discrete Fourier Transform] 
(1.3) 
It is an implementation of how to calculate this DFT that we will be 
scrutinizing closely in the next chapters. The method for DFT efficient com-
putations are introduced in the next section. 
2
"That's a very long wait" 
5 
1.4 The Fast Fourier Transform 
The Fast Fourier Transform is best viewed as a collection of very efficient 
ways of computing the DFT. It is the success of these Fast Fourier Transform 
algorithms that has pushed for rapid development and advancement of deeper 
FFT studies. These algorithms derive their simplicity from observations about 
the symmetry and periodicity properties of the complex exponential 
.21t 
wNkn=e-11rkn that would reduce the number of multiplications and thereby allow-
ing for a faster computation, hence the name Fast Fourier Transform3• The idea 
behind FFT is decomposing the original signal sequence into smaller parts of 
sequences and performing simpler (smaller size) DFT on those smaller size se-
' I quences successively, thereby reducing the complexity of calculations. Through 
prompt observation of symmetry features, a significant reduction of computa-
tions (multiplications and additions) can be gained, and computational time 
decreased. In decimation-in-time FFT, the sequence that gets taken apart is the 
data sequence x(n), which is in the time domain. Alternatively, we could 
similarly break apart the transform sequence X(k) instead, thus decimating the 
signal in the frequency domain, hence the name decimation-in-frequency. 
1.5 Deciination-in-time in FFr 
Decimation of the time domain sequence into smaller parts of successive 
DFT computations allows us to compute FFT efficiently. This efficiency is 
derived from the symmetry and periodicity properties of the complex exponen-
.21t 
tiaIWNkn=e-JN°kn that would reduce the number of multiplications required for 
the complete computation. 
3The term FFI' refers more to a collection of computationally efficient algorithms rather than to 
any one particular method 
6 
1.6 Recursive ForJDulas for DIT-FFr 
A basic recursive formula for decimation-in-time FF.r is depicted in the 
following series of expressions: 
(1.4) 
21t 
fiork=o 1 Jn 1 )andWNkn.=e -j-N kn. - , , ... ,lz 
[DIT-FFI' Recursive Formula] 
Such recursive formulas give rise to a recursive structure for processing 
architectures as can be seen in the following sections. 
1.6.1 Butterfly structure 
The basic DIT-FFr processor has a regular and symmetrical signal flow, 
and one butterfly processor can be used as long as the data ordering is 
presented correctly. For decimation-in-time, the butterfly structure is il-
lustrated in figure (1-1) 
Figure 1-1: Butterfly Processor 
7 
The term WNk is the complex exponential term. This term is multiplied 
into the lower signal input first before subtraction/addition is done. This option 
was chosen because such a butterfly processor offers more possibilities to select 
some special elements for high speed performance over a DIF butterfly [5]. 
1.6.2 In-place computation 
An in-place computation of FFT features a structure where even and odd 
numbered points are separated, producing a signal flow structure that main-
tains the same node index for both input and output. Such an arrangement lets 
us use the same array to hold data. Displayed in figure (1-2) is such a signal 
flow graph. 
x(O) 
x(4) 
x(2) 
x(6) 
x(l) 
x(5) 
x(3) 
x(7) 
----1n.--...... -~ .u~~P----1...-- .o-----1n.---...... --,n-_...X(O) 
_____ w-<.Jlm------~----~-~----_....---"-l~---t,...--l'r--4....,___,__~---~x<t) 
--u.--...... ---! lll.J~--u-----t.------ -.u---c .. -. ..... _ _,_.....u~ ........ x(2) 
__ w-u-. ______ ll\,,j,__w--ui5----4------ ll(.J--,U.-_,... ........... ___,o-----t .... x{3) 
-----1.....-----1..._ __ ..u--~-----~ f'(.J----u,--11--t ---w~~----11 .... X( 4) 
___ W-(.Jllo---1...,_---- 9'>---u.--...... -~ r\..~--u......_._...... ....-ri.....--lK.J----t .... X(5) 
-
-------<.i.----~---- 14.>-----i(Jli----..... ----~u-----1..,-......_,.,_ ....... ~~~ ....... xc6) 
------~Jlllii---------- ~--<J--...... -- ll,,(,>----u---.-----ilftoU---t.-X(7) 
Figure 1-2: In-Place Signal Flow Graph 
It is clear from figure (1-1) that one array of N storage registers is physi-
8 
<-
cally necessary to implement the complete computation4 for N points. Com-
pared to this arrangement, the architecture proposed in this thesis eliminates 
the need to maintain that register array. 
1.6.3 Bit reversing 
In order for a signal flow to operate properly, data must be presented in 
the correct order as expected. An arrangement such as figure (1-2) needs the 
input data sequence to appear in a bit-reversed order at the input nodes. The 
transformed output will be in a normal order. Our architecture would take care 
of this requirement, thereby relieving the butterfly processor from such task. 
The explanation of why bit-reversing is necessary for in-place computation can 
be found in [6]. 
1.7 DeciIDation-in-frequency in FFT 
Instead of decomposing the data sequence x(n), we could similarly choose 
to decompose the transformed sequence X(k). Since X(k) is in the frequency 
domain, this method is called decimation-in-frequency FFT (DIF-FFT). It is 
true that there is a very high degree of similarity between DIT-FFT and DIF-
FFT. 
1.8 Overview 
The purpose of this thesis is to introduce a new way of implementing FFT 
computation, particularly by the use of a device that would commute data flow 
so that data addressing would not present a bottleneck effect for the butterfly 
processor. 
4Note from figure (1-1) that Xm+l(p) and Xm+1(q) are stored in the same two storage registers as 
Xm(p) and Xm(q) 
9 
Chapter 2 discusses the architecture selected which incorporates the use 
of FFTCOM for FFT computation. It follows a dataflow concept that produces 
the same geometry at every stage of FFT si~~flow graph, and has stage inde-
pendent addressing for each data entry. This implementation is presented for 
an efficient combination of FFTCOM-butterfly processor. 
Chapter 3 explains all details of FFTCOM design. It discusses in depth 
all simulation stages that were performed to obtain the final product that is 
transferred on silicon. These simulations are the behavioral, structural, and 
functional simulations. The tools used are quite sophisticated CAD tools, known 
as BSIM, NET, RSIM, and MAGIC [7] [8] [9] [10]. 
Chapter 4 goes in depth on how FFTCOM as the final chip product is laid 
out. All pin assignments and architectural contents of the chip are shown in 
descriptive diagrams. 
Chapter 5 concludes with a summary of what this thesis has achieved and 
what problems are solved by the introduction of this new implementation. This 
chapter also discusses the work that can be done to further enhance the current 
implementation and produce faster performance. Some limitations with the cur-
rent architecture are also addressed. 
Appendix A gives a brief explanation of what kind of files are involved in 
the behavioral simulation process and how they relate to each other. 
Appendix B covers the structural simulation files and their significance. 
Appendix C displays a hierarchical view of the subcells that make up the 
complete layout design for the chip fabrication. Only a brief explanation on 
10 
what each subcell is for will be offered. 
Appendix D shows some basic layout cells of the Data Path implemen-
tation. The Pads are standard cells and are not shown herein. 
Readers who are interested in obtaining the source files for this design are 
welcome to contact the CSEE Department of Lehigh University where a com-
plete listing of the source files are kept. These files are not included herein for 
the sake of brevity. 
11 
,, 
Chapter2 
Architecture 
Each FFT implementation is different in terms of computational ef-
ficiency. It is up to the designer to select the type of implementation that would 
best suit his needs. Equipped with the knowledge of what different implemen-
tations have to offer, he then can make an intelligent choice. The architecture 41 
featured in this thesis is based on the idea of having an FFr butterfly signal 
flow rearrangement so that every stage has the same geometry (this idea was 
. 
first proposed by [11]), and thus allowing for sequential data accessing and 
storage. We chose to do a decimation in time decomposition and we will take 
care of the bit reversing problem at the data sequence side. The butterfly 
processors will receive data that is already in a bit reversed order, and can 
therefore process them immediately. 
2.1 Stage Independent Addressing 
The addressing scheme illustrated in figure (2-1) constitutes a stage-
independent addressing method. It has the advantage of functioning identically 
at any given stage. By employing this, we can be sure that the data flow will be 
regular for the butterfly processors. 
Figure (2-1) illustrates a size-8 addressing scheme. Using registers for 
this kind of addressing is beneficial since we can eliminate the need for -memory. 
All connections are fixed, so it would function as a fixed data channel, taking 
inputs from a known branch and delivering outputs to a known node. 
12 
• J 
j+N/2 
Figure 2-1: Stage independent addressing 
2.2 Same-GeoIDetry Flow Graph 
A flow graph to illustrate this stage independency is included below. This 
shows that identical operations are involved at every stage. 
For a given transform size, we could precompute the coefficients ofiline 
and have those values ready for multiplication with signal values at the ap-
propriate nodes. The input is in bit-reversed order, the output is in normal or-
der. If we use a commutator, we can eliminate the need of having to maintain 
an array of memory cells and later reading contents from these cells through 
time consuming address decoding. Freed from these tasks, the butterfly proces-
sor can make sure it quickly performs the butterfly computation for the FFT 
stages efficiently and not worry about which memory address to read data from. 
It can in essence take it for granted that data would appear at its inputs cor-
rectly, because of the service provided by the commutator. 
13 
x(O) 
x(4) 
x(2) 
x(6) 
x(l) 
x(5) 
x(3) 
x(7) 
WO 
WO 
WO 
-
-----ar.------------1 ,.... ...... ~ ...... ---------1 ~----~){(0) 
W2 
.n----...:,C(l) 
n--~){(2) 
ci-------){(3) 
~__....X(4) 
n--~X(5) 
n--~X(6) 
..,...........,~------------ n--....... an------------ n-----~X(7) 
Figure 2-2: Same Geometry Signal Flow Graph 
As can be seen from the signal flow graph of figure (2-2), the regularity of 
branch-to-node connection at every stage is repeated identically. For an entry 
with "array index-j" there is an entry that is 'N/2" distance away to form a but-
terfly that produces output entries with indexes "2j" and "2j+l", where 
j=O, ... ,N/2-1. For every stage, the geometry is identical, only the branch trans-
mittances change. The key point of this arrangement is the possibility to access 
data sequentially. 
An example in [6] described a situation where we have four magnetic tape 
units or four sequential areas of disc storage, and the first half of the input data 
(which is in bit-reversed order) is stored on one tape and the second half is 
stored on another tape. Then data can be accessed sequentially on tapes 1 and 2 
and the results written sequentially on tapes 3 (for the first half) and tape 4 (for 
14 
the second halO. For the next stage of computation, we switch pair 3&4 with 
pair 1&2, so that tapes 3 and 4 are the input and tapes 1 and 2 will store the 
-
output. This is repeated for each stage. 
2.3 The FFrCOM - Butterfly Processor CoIDbination 
Below is the proposed architecture of how FFTCOM would function in 
combination with a butte~fly processor. Ass11rning that the butterfly processor 
is a two-cycle device, FFTCOM would provide data for a cascaded combination 
with the butterfly processor as illustrated in figure (2-3). 
serial in 
r 
SIPO 
••• 
PISO 
• •• 
PISO 
Butterfly 
Processor 
Figure 2-3: Proposed architecture combination 
of FFTCOM and butterfly processor 
The combination allows FFTCOM to provide two data values as outputs, 
and to hold them available for two clock cycles, so that the butterfly processor 
can read the data in as the first and second entries. While FFTCOM accepts 
inputs and internally rearrange them, it would continue to deliver at its outputs 
(which are inputs of the butterfly processor) data in a sequential order (normal 
order). This process is repeated until all data inputs are exhausted. 
15 
This eliminates the need to maintain an array of memory cells, which 
means eliminating the time consuming addressing task of reading out the con-
tents to the output. And having the FFTCOM perform this data commuting 
......;d 
continuously, we can pipeline that to some degree. This pipelining is possible 
because there is no data dependency for FFTCOM's operations, and because a 
fixed period of latency5exist between the time a particular data entry is read 
into FFTCOM and the time when that data is delivered to the inputs of the But-
terfly Processor. 
2.4 FFTCOM architecture 
To illustrate FFTCOM's architecture in better detail, let us take for an 
example a size 16-point transform as described in figure (2-4). For clarity, only 
one "leaf cell" is shown in that figure, which means that the configurations only 
describe data values of 1 bit width6• 
There are two rings depicted in figure (2-4) towards the bottom of the 
diagram, the left ring is Rbl, which controls the input side, and the right ring is 
Rb2, which controls the output side. The array N of wire leads that connect to 
the data registers ("ptregpt") are controlled by these two rings. The difference is 
that at the output side we break the array length into N/2, although still con-
trolled the same way by ring Rb2. The top half is for the "even" values, where 
the bottom half takes care of the "odd" values. Theoretically, the two halves of 
the output array could share the same connections, but as a favor for future 
developments and routing reasons which will be discussed in Chapter 4, we 
maintain two separate rings. But for now, suffice it to say that both halves of 
5depending of the transform size, this chip is for 64 points 
6The chip FFTCOM has 16 leaf cells since it is designed for 1 word = 16 bits 
16 
the output array are controlled by ring Rb2 in an identical manner. 
1 J ________ __.i}\i{{tiitf}t!{/{ 
J..------------f{I/H/Jif/???tt?....._-._ LJ LJ 
b J-------------tl/!i!i!\1/:i:i!i!i!iiliiiiiiiiiiiiiiii:iiiii!i!/!il 
bJ----------1·ittt?f&f?t?f?tl~-----~i.---io-
~t-+-+. --------c}}}jj{/J@fjj/}j]j}i/}..,...._-......-.i.----.--+- -
J 
LJ 
t-1-1-+-+-.... -L J \t/IIf£lft\t\tI...._____._..___.__.._...._ 
~.......+....... -----..... 111rtrrm~t£1111111111irrn:iiirn}..______.,_...,__....-6-___ 
................ ~··· .. •.••.• .. •.•.•.•.•.•• 
--
LJ 
-Xin .._._......._........._ .... --- -------1j}\@j{jjjf{/{f}tf------a---,.t~_...._...._....._... 
_.,. lJ 
b J 
b J 
b J 
i--a-...........i~........... --------1):j/{t{(!j,i:i?i!i:\}:t}:ia.-- _...._'----l_..._.._--6-1.. : :·:·: :·:·: :·:·:·:·::::::·:·:·:·:·:·:·:·:·:·:·:·:·: 
.,__._+-+-l ....................... bJ ____ _.,.:1:1:::!{:\:::\:\:\:G9.:i1\:\:\:/:::\:i:i:::\:\:::i.__.__.._ 
~........_......+-I.......__,_...._._ -------f·i/tt??Ji.i#.HiJU?{@..__----t--+ ......... 
-- :-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-: 
LJ 
bU ........................... ._.._. ........ L J-------t:::):!\:::::::::1:~1:t::1::::!1::::::::::::::::....__ ......... .......,_ 
L J-----.ft/I?Hiriii.f f ){/? 
.._...,I Rx 1 I X2out 
I I 
b u 
\ J 
I d 
~~ 
..__._~_._..-+--+-......... ~ ----c:)1):/1/:ii:i:i:\1)1\,t~li:i:\:i!\:i:i:[i:::i!\:------.---+--+---+--+ 
·.·.·.·.·.·.·.·.·.·.·. -.. ·.·.·.·.·.·.·.·.·.·.·.·.· 
.,__._..........,,i~ ............................. ~~ ,-___.::t\:\i:i::1:1::::fuHt:::::1!/l?:1:1::}....--~i---1----..--------+-....._ ! ::! ·.·.·.·.·.·.·.·.·.· ·.- -... -.· .. ·.·.•.•.· .... · ·.·.·.·.· 
.__._......._~....__._...._....................... _ __.?t}}f]iilf/fttt·t----t-4~_...__.__...._ ..... 
--
L---------aiMm?_: .....===-IN-1- ...... :·········:·······:·:,.._ __ __, ------:MOX-:· :-:,:,:,:.: .. :-:•:-:· . ::·:·:·:·:·:·:·:-::: . ...__ IN2 
STl • ST2 • 
Figure 2-4: The architecture for "FFTCOM", a size 16 example 
The rings Rbl and Rb2 consist of a series connection of 1-bit registers. 
These rings control the entire word length (all 16 bits). There are multiplexors 
built in the rings Rbl and Rb2, for the purpose of selecting two modes of opera-
tion for the rings. Mode 1 (ST#=l) means that we want to let a new bit-value (0 
or 1 at IN#) enter the ring, while mode O (ST#=O) tells the ring to loop around 
17 
and not pay attention to whatever bit-value exist on IN#7. 
The middle array of registers in the center of figure (2-4) are data 
registers (denoted by the labels Rs#). This N-size array of storage holds incom-
ing data, and only deliver entries that are selected by the switch (provided by 
the pass transistors driven by the ring at the bottom of the figure) to the output 
registers (RxO and Rxl). 
At any given time, there is only one switch transistor that is on (due to the 
fact that there is only one "1" bit (and rest are "O"s) in each ring Rbl and Rb2. 
This makes sure that only one data register would be "communicating" with ei-
-i::r 
ther the input side or the output side (on each half of the array8, so only one 
data value enters the array of data registers (Rs#), and only one of them appears 
at each of the output registers (Rx#). 
2.4.1 Input node 
The 16-bit input Xin is shared with all paths leading into the data 
registers (Rs#). Only one of the paths, however, will be "on" at a given time and 
allow data to enter the corresponding Rs register when the clock cycles. These 
selections are made by incorporating NMOS pass transistors as switches, which 
uses Phil excitation. The choice of a pass transistor as opposed to using a trans-
mission gate has been investigated. It is generally known that a transmission 
gate dissipates less power, and provides a more stable signal (less degradation 
through current leakage). 
7The "#" sign indicates that the explanation is applicable for both Rbl and Rb2 
8Recall we discussed that the "even" and "odd" parts are separated 
18 
I 
'' 
\ 
However, if we chose to use a transmission gate, the larger area and in-
creased number of transistors are too penalizing. It is our goal to obtain a very 
fast and dense design to accommodate as big a transform size and as fast an 
operation as possible. We have taken efforts to custom design each cell to a 
minimum size. With this objective, the selection of using pass transistors was 
made. 
2.4.2 Rbl - input control ring 
For a 64 point transform which FFTCOM currently supports, we have 64 
1-bit registers that are serially connected together to form ring Rbl. There is a 
multiplexor within the ring to allow the user to control when he wants a new bit 
to enter the ring or let the ring loop around (closed). The control signals for the 
multiplexor are accessible from the chip pins (The Pads). These Pads uses the 
labels INl and STl. When STl=l, the value assigned to INl enters the ring. 
When STl=O, ring Rbl is closed and the existing bits marches forward and loops 
around as driven by the clock cycle. 
2.4.3 Rb2 · output control ring 
Ring Rb2 is very similar to ring Rbl. The difference is that we incor-
porated two separate rings for the top half and bottom half of the output array9• 
This is done for a several of reasons. First it aides in the routing of wires to 
connect to the Data Path (it allows for common metal-1 connection rather than 
using m2c transfers)10. The second reason is anticipation of future expansion 
that might require independent control for the even- and odd- points of the 
transform array. Currently, since that requirement is not imposed, and because 
9The author avoids to show this difference now as it will be much clearer from the diagram of 
layout in Appendix D 
10This hardware details will also be covered in more depth in Chaptier 4 
19 
of pin-count limitations, both rings (top and bottom) are controlled by the same 
Pads (IN2 and ST2). 
2.4.4 Basic Data Storage Registers 
The basic storage cell of FFTCOM are the data registers, which is labeled 
"ptregpt.mag" from "pass transistor - register - pass transistor". This is due to 
the fact that each design cell contains 3 (three) clocking switches used for trans-
fer of data. 
The first switch is qlfl, which is Phil being qualified by ring Rbl. An 
"AND" gate takes inputs from clock "Phil" and a particular cell in ring Rbl. If 
"Phil" and the cell Rbl.i (where i is the location index) both are "high" or "1", 
the "AND" gate generates a pulse that would switch a pass transistor whose 
gate it is connected to into an "on" state. 
The second switch is the clock signal "Phi2". This signal is taken directly 
from the input Pad "Phi2". When "Phi2" is high, data value advances internally 
inside the register (figure 2-5). Data value would then be held by the end part of 
the register, awaiting a signal for it to flow out of FFTCOM and into the But-
terfly Processor. 
3/8 3/8 
Xin-
16/2 T 16/2 
Phi2 
Figure 2-5: Data register implemente~ 
20 
The last switch is q2f1. is a similar qualified clock as q lfl. The difference 
is that here the "AND" gate is being qualified by ring Rb2 instead of ring Rbl. 
As in the case with ring Rbl, ring Rb2 also only has one "1" bit and the rest are 
"O". However, the location of this bit on the ring is not the same. As will be 
explained in Chapter 3, there is a required latency of N/2 for the bits "1" in these 
two rings. 
Each of these clocking switches is a pass transistor. The register itself is 
depicted in figure (2-5). The feedback drive (ratio 3/8) is chosen to be weak 
enough so that there would not be disturbing signal feedback from the output to 
the input of each 16/2 driver. Another reason for selecting a forward transistor 
size of 16/2 was also so that it can drive the large fan-out a 32 node junction 
leading to the output registers. 
2.4.5 Output Registers 
There are two output signals, XOout and Xlout, which are held in output 
registers RxO and Rxl respectively. Each of these signals has a data width of 16 
bits as they should since Xin is also a 16 bits signal. 
21 
Chapter3 
Design 
3.1 Functional block diagram 
At the functional level, FFTCOM operates on input signals as indicated in 
figure (3-1). It is clearly shown that with one input stream Xin, the device will 
deliver two output streams at the same time. This is in a sense a serial-in-
parallel-out conversion (SIPO). As the figure shows, some control signals are 
needed for this operation as well. In order for us to realize this idea, we need to 
follow a design procedure that will be discussed in the next section. 
Xin 
Phil ---.-t 
Phl2 __ __.,. 
STl __ _.,. 
INl __ __.,. 
commutator 
ROout 
...--- ST2 
IN2 
Figure 3-1: Functional block diagram of FFTCOM 
The data signals are Xin (16 bits datawidth), ROout (16 bits datawidth), 
and Rlout (16 bits datawidth). The control signals are STl, IN1, ST2, and IN2. 
Two-phase clocking is provided by Phil and Phi2 signals. 
22 
3.2 Design Flowchart 
A VLSI chip implementation such as the one for this projectll, requires 
that we follow certain guidelines to produce an efficient development approach 
and not lose sight of the final goal. As illustrated in figure (3-2), the idea of our 
application started it all. After we formulated the specifications needed for the 
FFTCOM implementation proposed for this project, we investigated the avail-
able technologies that we could use. Since computational speed requirement 
and a low power dissipation are the commanding criteria, a CMOS implemen-
tation was chosen. The technology available to us was a CMOS 2µ nwell double 
metal process. The next step was to design a functional architecture that would 
best serve this implementation. In our case, since we focussed on reducing to 
the absolute minimum the use of shift register activities, we designed an ar-
chitecture shown in figure (2-4). 
In order to appreciate the savings on shift register activities that our ar-
chitecture in figure (2-4) offers, let us examine how it would be done in a direct 
approach .. · The direct approach to sequentially enter data into an array of 
registers would be of a "First-In-First-Out (FIFO) pipe", where the user would 
inject data into the first register, then while injecting the next data into the 
same register, he lets the previous data already in that register move forward 
the chain into the next register, etc. This arrangement would necessarily have 
all data moving at all times, getting pushed from behind to go forward to the 
next register in the chain. Having data constantly shifting fr~m register to 
register at all times would generate a lot of heat, hence power consumption 
would be prohibitive, and the performance speed slower. That is why we use 
11This chip actually falls into the LSI category since the number of transistor-count (16,000) is 
between 10,000-100,000, while VLSI is > 100,000 transistors 
23 
., 
Application specification 
• I 
L-____ N_o--< Feasibility No 
study 
Yes 
I 
Architecture design 
CAD tools: 
' 
BSIM ___ __..,_., Behavioral model 
("C" language) 
NET 
- Structural model & 
RSIM Logic simulation 
' 
(maxtime) 
( .. MAGIC 
Auto Router 
-
Cellframe 
-
Pwrframe Chip Layout Design 
Pad2frame 
Framegen . 
Chipframe -
. 
Ext2sim -
Presim 
-
Functional simulation 
RSIM . 
' I 
Technology specification 
' I 
, 
Circuit ~ - . 
-.. improvements,, 
r .., 
Layout -
- improvements -
\._ .... 
Meet 
requirement No 
specification >--------------
? 
Yes 
Fabrication 
I • 
Actual chip testing 
' 
, ..._ 
Works! I 
\. ...i 
· Figure 3-2: VLSI design flowchart 
24 
pass transistors as switches, while the data register array stays dormant (not 
shifting its data to the next register in the array). Our arrangement would ac-
tually reverse the process, the first data would enter and stay in the first 
register RsO, the second data would enter the second register Rsl, and so forth. 
' Hence the array would contain data in the sequential order they were entered. 
The following discussion will address the issues of each stage of the 
development process. It will follow the flowchart depicted in figure (3-2) closely. 
The CAD tools that are used for this project were installed on a Sun3/60 
' 
workstation12, running under the UNIX operating system 13. 
3.3 Behavioral sim.ulation 
As soon as an architecture was decided upon, we created a model using 
"C" language, that would describe the behavior of our chip. A CAD (Computer 
Aided Design) tool "BSIM" was used for this stage14 [7]. "BSIM" allows us to 
model our circuit in close agreement with the actual hardware. Registers, for 
example, are controlled with the clock excitations as it would be when the real 
chip performs as part of a larger system. Another strength of "BSIM" is the 
IRENE tester that is built into the package. This tester provides the capability 
to match simulation and hardware functionality and performance during the 
test. The simulation can run interactively, or if the user prefers, under batch 
mode15. The model is captured in a file called model.c, and additional "BSIM" 
files need to be adjusted by the designer so that the simulator is equipped with 
12product of Sun Microsystems, Inc. in Mountain View, CA 
13UNIX is trademark of AT&T 
14From the words ''behavioral simulator" 
15convenient for large simulations that take a long time to run 
25 
the knowledge of the chip parameters16• Information about how these files re-
late to "BSIM" can be found in appendix A. The complete Behavioral Simulator 
package has three parts: 
1. The Command Line Interpreter 
2. The Irene interface 
3. The User Model 
3.3.1 The Command Line Interpreter "-·1 
The Command Line Interpreter: is the a program that was created ba~ed 
on the designer's description of his chip (in the file "model.c"). This "BSIM" 
program is a binary executable that provides a command line interactive inter-
face to the user written "C" program "model.c". "BSIM" has a command line 
similar in syntax to the "RSIM" commands to be discussed in later sections. 
There are several data files that support "BSIM" with some required parameters 
of the chip. There are templates already given with the package for these spe-
cialized files, and the user only needs to make minimal changes that is directly 
related to the current chip. The data file "padlist.l" defines the pad vectors 
which the model will use. Some of its descriptions include information about 
whether a particular pad is an input, output, of bidirectional pad. It could also 
define some internal nodes and clocks. Another data file that "BSIM" uses when 
the tester is implemented is "pads.irene". This file describes which pins on the 
irene tester corresponds to which pads in the model. About the values represen-
tation, "BSIM" only allows pad vectors up to 16 bits wide. 
16For our case, there is actually . only one file that needed to be adjusted, namely the file 
"padlist.1" 
26 
3.8.2 The Irene Interface 
The Irene interface: is the tester for actual test matching when the chip 
has been fabricated. The Irene tester has the capability to test for behavioral 
matching between the user model and the physical hardware. This is very use-
ful in determining the yield when chip samples are obtained from fabrication. 
3.3.S The User Model 
The User Model: is a program written in "C" language, that captures the 
description of the chip. "C" was chosen because it offers the flexibility useful 
operations such as bit-wise control which could mimic real hardware operation. 
\ 
. 
I 
If the user model is written following guidelines provided in [7], the "C" code can 
closely match the structure of the hardware that it is modeling. 
3.3.4 Procedure 
There is a set of steps that must be done in the correct sequence for the 
user to prepare "BSIM" simulation. The following enumeration will attempt to 
explain each step: 
1. Prepare "model.c": it is convenient to start from the template al-
ready given for "model.c" and write "C" codes to model our chip. 
2. Prepare the data files: modify also the file "padlist.l" with the cor-
rect parameters for pad vectors specific to our chip. 
3. Execute the makefile: a makefile is given that will compile our 
"model.c" and link with the required libraries of routines. Execute 
this to create an executable image "bsim *". 
4. Command File: this is optional. If we choose to run "BSIM" by let-
ting it read data from a specified ftle, we must prepare the com-
mand file. Then we can select to capture the output into a file, in ., 
UNIX the syntax would be: 
bsim < file.bat > file.out 
where "file.bat" contains: 
@ file.cmd 
(note the space required). The ftle "file.cmd" is the command file 
that contains user instructions of what "BSIM" should run under. 
27 
This behavioral simulation is always an important task and should be un-
derstood well since there are much to gain from it and could lead into a better 
and more detailed structural or even layout simulations. The fundamental 
reasoning is simplicity and speed. It is much easier and takes less time for the 
designer to model his chip in a flexible programming environment first (to find 
out if his idea works) rather than doing a detailed circuitry design for this early 
stage of the development. 
,, 
" 
3.4 Structural siIDulation 
When the designer has successfully described the circuit's behavior and 
supposedly has found out all parameters required for a hardware implemen-
tation, a structural description is needed. This structural information will 
detail all physical connections of the transistors, specifying where the gate, 
• 
source, and drain of each transistor are connected to. The CAD tool "NET" 
chosen for this stage of simulation. "NET" has its own syntax, and is combined 
with the simulator "RSIM" [8]. The syntax of "NET" offers convenient ways to 
describe repetitive structures which are common in integrated circuits of large 
sizes. This language can be used to describe both NMOS or CMOS circuits, and 
contains convenient structures for macros and iterations. 
3.4.1 Procedure 
The steps that must be followed to prepare this structural simulation 
properly are outlined as follows: 
1. The program net: is to extend a description prepared by the user in 
"file.net" into a list of of transistors in "file.net". The syntax is 
net file.net file.sim 
28 
2. The program presim: is to process the transistor information in "file.sim" into an executable image suitable for running with simulator "RSIM", which is the file "file.rsm". The syntax for this command is 
presim file.sim file.rem cmoslOO.prm -nostack 
3. The simulator rsim: is an event-driven logic simulator for transis-tor circuits [9], and will execute the "file.rsm" and run interactive commands issued by the user. It is also capable to read in com-mands from a file prepared by the user. We used the latter method for our simulations. The syntax for this command is 
rsim file.rsm -file.cmd > file.out 
or 
nohup nice rsim file.rsm -file.cmd > file.out & 
This latter command is useful for simulations that take a long time to run, it will put the job in the background mode and put a "nice" priority to it17. 
3.4.2 ''NET' example 
It is hoped that the following explanation of the "NET" syntax would 
benefit the reader who are interested in using the simulator. We will explain 
important parts of the syntax by taking some parts of the actual circuit descrip-
tion for FFTCOM as an example. 
"NET" has include files capabilities, which is useful for keeping the basic 
gates in a separate library file and call them as macros when we need to use 
them in our circuit. Here is an example of how and "AND" gate is defined in 
"NET" syntax, this part is located in the file "gates.net": 
17 All the commands given here are for UNIX environment 
29 
(macro and2 (inl in2 out) 
(local tl outb) 
) 
(etrans inl gnd tl 6 2) 
(etrans in2 tl outb 6 2) 
(ptrans inl outb vdd 6 2) 
(ptrans in2 outb vdd 6 2) 
(inv outb out 6 2) 
The following would be parts from the main file "FFI'COM.net", we will 
show first how to define node names so that the simulator would recognize 
them: 
(include "gates. net") 
(node Phil Phi2) 
(node Qlphil Q2phil) 
(node Xin) 
(node Rsi) 
(node Rso) 
(node Rbl) 
(node Rb2) 
(node RxOi) 
(node RxOo) 
(node Rxli) 
(node Rxlo) 
(node muxrlout STl INl) 
(node muxr2out ST2 IN2) 
.-- .. --, 
·- _) ,, . 
The following codes are taken from the main structural description of the 
circuit FFTCOM. We will explain what each part does with the hope that the 
reader would appreciate the convenience that this "NET" syntax offers for com-
mon VLSI structures. 
(mux2 Rbl.63 INl STl muxrlout) 
(sreg muxrlout Phil Phi2 Rbl.0) 
(repeat i O 62 
(sreg Rbl.i Phil Phi2 Rbl. (+ i 1)) 
} 
The above piece of code uses two macros, namely "mux2" and "sreg". 
These macros are defined in "gates.net", a ftle that was called by the command 
"include gates.net" within FFTCOM.net. The n11mber of parameters that must 
30 
be passed onto the function when calling it are fixed by how that function was 
defined. In the example above, the function "mux2" expects to define the struc-
ture for 4 nodes. The "repeat" command is an example of how to build a repeti-
tive iteration for a particular structure. Note that the indexing 
increment/decrement capability provides great flexibility for this purpose. 
(repeat i O 31 
) 
(and2 Phil Rb2.(+ ii) Q2phil.i) 
(and2 Phil Rb2. (+ii) Q2phil.i) 
(repeat j O 15 
(etrans Q2phil.i Rso.i.j RxOi.j 6 2) 
(etra~s Q2philui Rso. (+ i 32) .j Rxli.j 6 2) 
) 
The example above shows that it is just as convenient to describe nested 
iterations. As in the case of a programming language, this feature allows a 
more complex description of data structures possible. 
For a more complete description of the files needed for the "NET" struc-
tural simulation package, the "RSIM" simulator, and how they relate to each 
other, we have included appendix B. 
3.5 Test vectors 
We have seen that at every stage of the simulation, there is a "file.cmd" to 
be used for the simulator to read user instructions from. This "file.cmd" 
provides the test vectors that would test the functionality of the circuit. Hence, 
it is obviously beneficial to use the same "file.cmd" so that we could be sure we 
are testing the identical (and complete) circuit functionalities. It is also a good 
way to do cross reference checking between stages of simulations, to assure that 
circuit integrity has been preserved. 
31 
It is therefore fortunate for us to have the CAD tools "NET" and "MAGIC" 
together since they would produce binary simulation images that are compatible 
with "RSIM". Although we will discuss about "MAGIC" in the next chapter, it is 
just as appropriate for us to point out the structure of this "file.cmd" that con-
tains all test vectors. 
For our circuit, the FFTCOM which operates on a 64-point transform, we 
have prepared a command file "file.cmd" that will control "RSIM" execution and 
test for all functional response. The structure of this file is heavily dependent 
on the implementation. We will outline the sequence of commands to be issued 
to "RSIM" to get FFTCOM up and running. The following is the steps we used: 
1. Assign the inputs 1Nl=IN2=0, and let the multiplexor controls 
ST1=8T2=1 so that these input values can enter the rings Rbl and 
Rb2. 
2. Cycle the clock 64 times, to initialize all 1-bit registers in both 
rings to 0. 
3. Let INl=l, introducing the excitation bit into ring Rbl. 
4. Cycle 1 time to allow this excitation bit pass through the register 
regm which is located before the multiplexor muxrl. 
5. Let ST1=ST2=0 to tell both rings to loop around and close the path 
from their input Pads. 
6. Cycle 1 time to all~w the excitation bit "1" to get to the first 1-bit 
register in ring Rbl (which uses notation Rbl.O). 
7. Provide values to the input vector Xin (16 bits) and cycle 31 times, 
this will fill in the first 32 data registers with the corresponding 
Xin values. 
8. Let IN2=ST2=1 to allow an excitation bit "1" enter ring Rb2. 
9. Continue to assign another Xin value and let cycle 1 time. 
10. Let ST2=IN2=0 to close ring Rb2 to loop around. 
11. Continue to assign values to Xin while cycling the clock, repeat 
this step for 32 times, and we will be back to step 1. At this point 
all operations are fully running causing FFTCOM to execute at full 
speed. 
32 
This procedure shows that there is a certain initialization period required 
to set up FFTCOM before it can execute fully. For an N point size transform, 
this period is approximately N clock cycles. This initialization period is basi-
cally getting the two rings Rbl and Rb2 filled with "O" since we do not want any 
undesirable values to be resident in any of the 1-bit registers. Recall that there 
should be only one "1" ip each ring at all times. 
/i 
As soon as all values in the rings are known, FFI'COM can start execu-
tion. This initialization period is only needed at power up, and do not have to be 
repeated during normal run time. 
~·-· . 
33 
Chapter4 
Implementation 
The FFTCOM chip was fabricated on a standard MOSIS 64P69X68 pack-
age, with 64 pins and a design area of approximately 68x67 mm2. The tech-
nology is a MOSIS 2µ CMOS N-Well Double Metal 18. In the following sections, 
we will discuss the hardware implementation of the chip. After completing the 
design process starting from idea conception through various levels of simula-
tion (starting from behavioral to the structural interconnections), chip layout, 
and fabrication, several observations are vivid in the designer's mind. The first 
observation is the indispensability of accurate simulation tools to guarantee the 
/ quality of the design. If these tools are powerful enough to m6del the real physi-
cal chip, design parameters could more accurately be determined and tested. 
For example, capacitance load on the wires that usually affect line delays could 
be closely watched and timed, and thereby allowing for easier and more precise 
critical path analysis. 
The current implementation features a 64-point transform size and a data 
width of 16 bits. It is considered common for data signals of this nature to 
maintain 16 bits accuracy. Higher accuracies and larger data width would re-
quire more advanced CAD tools. "BSIM", for example, can only handle up to 16 
bits vectors. For our purposes, this is not a severe limitation. We are set out to 
prove that this new idea of incorporating.a commutator in FFT computation can 
work well. 
18MOS1S Technologies, Inc. is located in California 
34 
4.1 Subcell iinpleJDentation 
Each word is implemented by having a series of 16 registers of 
"ptregpt.mag", where one row represents one word of 16 bits. We therefore have 
an array of 64 rows. Data would enter each register from the metal-2 columns 
that expands from top to bottom connecting the entire array that is sharing the 
same data bit (This is true for both Xin and Xout). Each register cell is com-
monly connected to "qlfl", "Phi2", and "q2fl", by perfect matching metal-1 (blue) 
\ 
lines across subcells. 
The output registers are implemented as a series of 16 register subcells, 
one row (ROout) above the array of "ptregpt.mag", and one row (Rlout) below 
this array. This is done after careful estimation of which pads to assign these 
wires to. Again, all metal-1 (blue) wires carry the clocking signals, as is the case 
for the array described above. 
The multiplexors are placed in close proximity to the rings they are con-
trolling. This is done to avoid long wire runs for signals. We also assign Pads 
that are as close as possible to these cells for their control signals (INl, STl, 
IN2, ST2). 
Because this chip uses NMOS pass transistors as switches, it can operate 
at a fast speed. Care should still be taken at how strong to drive the gates of , 
these transistors, since a value that is too weak compared to the optimum 
causes very significant performance degradation, while a value too strong would 
be simply wasteful. 
Another benefit of implementing this circuit in CMOS is the small power 
dissipation associated with the technology. Since in CMOS current only flows 
35 
during a switching period, the overall amount of current flow is minimal, and 
the overall heat dissipation is reduced as well. 
4.2 Floorplan 
FFTCOM has been designed with the goal to accommodate as big a trans-
form size as possible. With the decision to incorporate a radix-2 FFT, we are 
limited to transform sizes that are powers of 2. In a design space of 68x67mm, 
we could only fit a 64 point FFTCOM. The next step up would be implementing 
a 128 point transform size, but it there was not enough space to accommodate 
the current cell dimensions. Below we will examine the block structure of this 
floorplan: 
As is the case with any type of floorplanning, it is the designer's respon-
sibility to arrange where a cell must be placed. It is somewhat similar to the 
process of rearranging furniture in your home. You would want to get the most 
use and practical functioning out of the rearrangement. For example, you do 
not wish the family couch to be blocking the view to the television. It is there-
fore difficult to argue that this is one of the most thrilling and important task in 
the design process. 
It is a two-way street between placing the cells and designing them. As 
was done for FFTCOM, it is sometimes very beneficial to redesign some parts of 
the cell after realizing that it could aide in floorplanning tremendously. In try-
ing to design a good floorplan, we were guided with a useful concept, which is 
enumerated as follows: 
1. Sub-cells should connect to each other perfectly: avoid having gaps 
and connection failures when subcells are placed side by side with 
each other. 
36 
top 
chip2 
chipl 
chip64x16.mag 
· ........................ ·-··-··-·-· ...... -..... --.-· .... ·-··-·-·-··-· . ---·-·. ·-·----.... ---~···· ........• ,: :?}/ :: ::-: . . ·-:<:::: : :}}: :-, 
I. ://:::. ::: f ------------------ ------ \:::;:: .//:: : I 
I : //:{ )J ':i ':i f \ \/{ J t >>>: :>I ~ ROout.mag = [> <>> q 
I .·.·.-.·.·.· ·, I Cl , ..... '.".", .-.· ... 
~ . 
> fcJ 
- i ~ 0 
• . . . . I I i array of t;i I • I ~ ptregpt.mag 
I 
Rlout.mag 
• ··········•·· I 1 @go.mag) _)I L .............. 1 
"--------------------------------------
Figure 4-1: Chip Floorplan 
2. Avoid the need to have additional wiring between subcells if pos-
sible: this would make routing much easier and produce a more 
compact routing channels. 
3. Use metal-1 (blue} for power rails in subcells (Vdd and Gnd), and 
avoid using metal-2 at the subcell level design. 
37 
4. Adopt the rule that all metal-1 (blue) should run horizontally, and 
metal-2 runs vertically. 
By observing the above guidelines, we will have subcells designed in such 
a way that when we place them to connect and represent a larger structure 
block, we would have a perfect fit. This would enable us to take advantage of 
the array capabilities of "MAGIC", making large repetitions for VLSI design 
manageable. When all the connections are kept by joining these subcells, we 
will only have to worry about routing the end cells (the cell~': at the outer boun-
• 
daries of the array). 
This set of guidelines actually forced us to redesign ring Rb2, with an ad-
ditional cell ("Rb2b.mag") that is different from "Rb2a.mag" only in the last pass 
transistor switch part. "Rb2b.mag" does not need this switch because this out-
put ring Rb2 only drives the switches at an interval of 2, as evident from figure 
(2-4)19. 
Figure (4-1) shows how we arranged the cell blocks onto the available 
design space. Several floorplans have been considered, and the best tradeoff 
choice was made. The underlying criterion is to come up with a plan that would 
occupy the least space, which usually also means that the wire run lengths for 
the circuitry is also minima120. 
19Therefore only cells Rb2: 0, 2, 4, 6, 8, ... , 62 are driving the pass transistor switches at the 
output side ofFFTCOM 
2°'rhis is not al ways necessarily true, that's why we say "usually" 
38 
4.3 Functional Siinulation 
MAGIC has been a most rewarding experience. Introduced only a few 
years ago, this packace is relatively new and offers many useful features not 
readily found at older layout editors. Features like automatic and continuous 
design-rule-checking (drc) significantly helps the designer to create cells that 
are error free. Rather than finding out about his error late in the process and 
having to go back to the drawing board, the designer will immediately be alerted 
if his design violates any rules. This is proven to cut down tremendously on 
design time [10]. 
The node-count for FFTCOM is 16639 nodes, while transistor-count is 
9579 n-channels and 6063 p-channels (A total of 15,642 transistors). Be~se a 
simulation of this size takes about 6 hours on the Sun 3/60 Workstation that we 
used, it is important to prepare as best as possible for the functional simulation. 
From the layout, we can extract the circuit information into a transistor descrip-
tion, and then produce an executable binary image that can be tested by the 
"RSIM" simulator for functionality. Below we will describe the steps that we 
took to carry out this stage of simulation. 
4.3.1 Procedure 
After the subcell design is completed, there is a preset procedure that 
should be followed in the correct sequence. These steps include the use of 
several CAD tools that will eventually produce the fabrication file "file.cif' that 
we can send to the fabrication house. The following are the necessary steps and 
a brief explanation for each step, using the actual files for FFTCOM as an ex-
ample, and described in figure (C-1). 
1. Create the Data Path main file that is assembled collectively from 
the subcells. This file is called "chip64x16.mag" in our case. 
39 
2. To generate DataPath Power Frame: we have two tools (cellfraroe 
and prwframe) to do the job for us. The first tool would extend all 
terminals, and the latter would route Vdd and Gnd lines. The syn-
tax of this command is: 
cellfraroe chip64x16 < edgecmds > chipl.mag 
and then, 
pwrframe chipl > chip2.mag 
3. To generate Padframe: The Padframe is a frame containing the 
Pads that represent the actual pins of the chip package we selected 
to use. At this step we can request that specific pins be connected 
to specific terminals ·rrom the DataPath. We must create a file 
called "pads.info" that contains our instructions about what label 
to assign to which Pad number. This enables us to identify the 
Pad when we get to the Routing step.The syntax for these com-
mands are: 
pad2frame <pads.info> pads.input 
and then, 
framegen < pads.input -f pads.params 
which will produce an output file called "padframe.mag" 
4. To merge DataPath inside Padframe: we follow the following steps: 
chipframe chip.spec > final.mag 
5. Put additional cells: At this stage we can place some additional 
cells that are to be placed outside the Data Path. Examples of 
these are the clockdriver cells, "bufferl" and "buffer2" (for our 
case), the designer's cells, logo cells, etc. 
6. Then we check for Design-Rule errors. When there is none, we can 
proceed with the next step. 
7. Routing step: here we use an automatic tool that would route the 
wires according to a netlist that we specify. The netlist file is in 
"netlist.l", and contains information of which labels we would like 
the Router to connect together. The steps are: 
:specialopen netlist 
:route 
8. After all routing is done, we check again for design rule errors, 
making sure that the Routing did not overlook required connec-
tions or produce violations. 
9. Then we create a file that has all common Vdd and Gnd connected 
together. This aides a great deal in reducing the "RSIM" simula-
tion run-time. From this file (called "TOP.mag"), we extract the 
40 
circuit transistors information so that we can produce the binary 
image for use with "RSIM". The commands are: 
cadmake top.mag top.rem I &tee top.log 
Cad.make is a batch file containing the programs EXT2SIM and 
PRESIM, to flatten out circuit description and prepare an "RSIM" 
run file. The above cad.make command will read in the file 
"top.mag", produce "top.rsm", while recording all processing details 
in the file "top.log". 
10. After the "RSIM" file is created, we can issue the command: 
nohup nice rsim top.rem -top.cmd >& top.out & 
to let "RSIM" run "top.rsm", take input vectors from "top.cmd", and 
recording outputs in file "top.out". This job as executed with the 
above line would put the job in background mode. 
The process described above took about 2 hours for our circuit, the time 
consuming steps are the design-rule checking, and the circuit extractions. 
However, compared to what it was several years ago, this is considered very fast 
since they measured delays by days before, not by hours. 
4.3.2 "RSIM'' simulator 
In running "RSIM", recall that we should use the same "file.cmd" that we 
used for "NET-RSIM". This file would tell "RSIM" to test for exactly the same 
test patterns that we have proven effective when we tested our circuit obtained 
from the structural descriptor "NET". There are commands in "RSIM" that help 
us measure the line delays of the circuit. These delays could be due to many 
reasons, some of them are usually because of heavy load capacitances. 
"RSIM" however, would not allow us to modify our circuit. In other words, 
if we need to make a change to the circuit, we must go back to the layout level, 
and do all the extractions process again, to produce a new image file for "RSIM". 
41 
In "RSIM" we can define many nodes to be represented by one vector. It is 
more convenient to assign values to those nodes by using this feature, and it 
makes the output readout more readable. 
There are two possible modes of simulation that "RSIM" would execute, 
they are: 
1. The switch model: where each transistor is modeled as a voltage-
controlled switch. 
2. The linear model: where each transistor is modeled as a resistor in 
series with a voltage-controlled switch. 
4.4 Performance Measurement 
Timing for the circuit reveals that capacitance effect greatly affects circuit 
performance. This timing information is provided by "RSIM" as we trace the 
important nodes by the "t" command (for "trace"). From all the nodes that are 
traced, we try to find out which node is the slowest, since this node would be the 
"bottleneck" for the overall performance of the chip. We will discuss below how 
our observations on the timing lead us to significant circuit improvements. 
There are five different designs that were investigated for this particular 
implementation, and they increase gradually in order of performance as im-
provements are made to subsequent designs. The design started with an initial 
operating speed of 11.4 MHz which corresponds to a longest delay of 87.6 ns. 
4.4.1 Timing 
We found that for the first design there is a delay of 87.6 ns, which was 
associated with the node for "Phi2". This delay would limit the chip perfor-
mance to approximately 12MHz. A quick study done to the node reveals that 
the slow switching performance was due to the heavy loading that "Phi2" had to 
drive. It is indeed a common connection for all "Phi2" in the entire Data Path, 
42 
,.> 
and apparently the signal was suffering from the huge fan out. 
At this point we investigated methods to strengthen this signal "Phi2" so 
it could manage the large loading that it must drive. The first step was to place 
an inverter at every horizontal metal-I "Phi2" line. The inputs for these in-
verters are from the inverted signal from the Pad. This improved the perfor-
mance from 87.6 ns to 51.7 ns (design #2). 
Then we decided to place a driving cell to strengthen the source signal 
"Phi2" from the Pads, while still maintaining the inverter-buffers at the ends of 
every row inside the Data Path block. Since we decided to use a 3 stage ratioed-
inverter cell, we switch back to taking Phi2 from the "high" wire from the Pad 
(as it was for design #1). Recall that we switched to taking the "low" wire for 
design #2). This is a welcome situation since the "high" wire is driven by the 
Pad itself stronger than its "low" wire. This change improved performance from 
51.7 ns to 41.6 ns (design #3). Now it was observed that the bottleneck node is 
no longer "Phi2", but rather "Phil". 
Continuing this perfo1·mance improvement steps, we then placed 
"Buffer!", which is also a 3 stage ratioed inverter as "Buffer2", beneath the Pad 
"Phil". This improved the performance for "Phil" switching time from 41.6 ns 
to 32.4 ns. (design #5). Design #4 shows a slightly faster speed, but it is an 
incomplete test run since the simulation was aborted before normal termina-
tion. We include the timing information, however, to illustrate the idea that 
each design must be tested and observed in exactly identical environments and 
run on identical test patterns to obtain a guaranteed validity in timing figures. 
The following table summarizes the performance numbers: 
43 
Performance Timing Table 
Design Speed Delay 
design 1 11.4 MHz 87.6 ns 
design 2 19.3 MHz 51.7 ns 
design 3 24.0 MHz 41.6 ns 
design 4 32.0 MHz 31.2 ns 
design 5 30.8 MHz 32.4 ns 
4.4.2 Critical Path Analysis 
Since today's FFT butterfly processors are estimated to be able to run at 
100MHz speeds, the longest delay a useful commutator can have is lOns. 
Otherwise it would slow down and create a bottleneck for pipelining with those 
processors. At the time this design was finalized, most of the line delays were 
less than 9.Sns. There is only a section that has a delay of 32.4 ns. This iso-
lated case indicates that the overall performance could easily achieve 100MHz if 
this particularly slow section could be eliminated. This is the use of critical 
path analysis. It must be understood that driving the current stronger is not 
the only solution to improve circuit performance. This method has an optimum 
level, and it would be wasteful to drive the current beyond that level. If the 
designer still wants to speed up his chip, a study of the gating circuits might 
prove useful. 
To illustrate this, consider our FFTCOM gating circuit. At the output 
side, there is two connection wires that connect 32 rows together, and this loads 
the pass transistor switch that must drive a selected data value into the output 
registers (ROout and Rlout). It is probably possible to come up with a different 
circuit topology that could limit the large loading or at least have some control 
to disconnect parts of the loading before switching. This would be a major 
44 
reconstruction of the output side, and will be left for future work with our sug-
gested recommendations. 
4.5 Fabrication 
We have indicated that FFTCOM was fabricated on a 64 pin chip. None of 
the pins are unused, in fact, the large Data Path section could actually use 
another pair of pins if we adopt the rule of thumb to have a power (V dd or Gnd) 
Pad alternating between every 4 pins. We used a count of 5 to spread out the 
power-pins further and cut down on pin number. MOSIS Technologies, Inc. is 
the fabricator, and their part number for this package is 64P89X69. 
In order for the fabricator house to know the exact details of a chip, they 
must receive a CIF file from the designer21 . We prepared our CIF file from the 
"final.mag" file. Only after the designers are satisfied with the performance of 
the circuit, we can go back to "MAGIC" and produce the CIF file 22 . Recall that 
we use "final.mag" instead of "top.mag", since the latter was created only to aide 
to make "RSIM" simulation run faster. This CIF file can easily be transmitted 
across computer networks to reach the fabricator house. 
At the time of this writing, the fabricator house has not delivered the chip, 
so testing results that were originally planned to be reported can not be in-
cluded in this thesis. 
21CIF = Caltech Intermediate Forrn, a code of low level graphics language for specifying the 
geometry of integrated circuits 
22In "MAGIC", use :cif ostyle mosisl.O(nwell) 
45 
Chapter 5 
Summary and Future Work 
5.1 Sullllllary Of Current linpleinentation 
A signal flow structure for FFT computation that is based on stage inde-
pendent addressing was introduced as the working environment that FFTCOM 
supports. FFTCOM was implemented and shown to effectively serve a two-cycle 
Butterfly Processor in performing the above F?.l' computation. The current im-
plementation supports a 64-point FFT transform size. The FFTCOM chip was 
developed using the CAD tools "BSIM", "NET", "RSIM", and "MAGIC". "BSIM" 
as a behavioral simulator was very useful in allowing the designer to model his 
idea and find out the parameters for a circuit design. "NET'' has been a great 
aide for investigating structural C<)nnections for the circuit and obtaining 
preliminary timing information. "RSIM" is the logic simulator that test for all 
functional correctness for both "NET" and "MAGIC" executable binary files. 
"MAGIC" has been an exceptional chip layout editor. 'l'he ease of use and excel-
lent performance of this package has cut down the design time for this stage 
tremendously. The chip was then fabricated on a MOSIS 64P68X69 package, 
which is a 2µ N-Well Double Metal technology. 
Different designs illustrating subsequent chip performance improvements 
have been discussed. The final design incorporates 15,642 transistors. 'I'he 
d.esign space was 68x67 mm2, but only about half of that is filled for the Data 
Path block. It has been investigated, however, that the available design space 
could not accommodate a 128-point FFI'COM with the current circuit implemen-
tation. 
46 
5.1.1 Achievements 
FFTCOM has been successful in proving that it could benefit a tw
o-cycle 
Butterfly Processor implemented for a stage independent FFT sign
al flow com-
putation. The chip also eliminates the need of having random acc
ess memory. 
Sequential data access enables it to serve the processor in providi
ng a sequen-
tial order of data values, already in bit-reversed ordering as requir
ed. The cur-
rent speed performance of FFTCOM is 31MHz. No controls are
 necessary 
during operation, and only simple procedures need to be executed w
hen the user 
wants to enter new bit-values into the control rings. 
5.1.2 Limitations 
' 
The limitations of the current FITCOM chip is the transform siz
e it ac-
commodates and the speed performance. Limited to serving only a 
64-point FFT 
computation makes this chip too size-specific. It would be desirable
 to have this 
chip able to serve any powers of 2 sizes for FFT computation. 
Cascading 
methods should be further investigated. 
The speed performance should also be optimized further, possibly 
requir-
ing some re-designs of certain parts of the circuit. It was noted that
 heavy load-
ing caused slow switching speeds. 
In layout design, care should be taken that signal spacing should a
t least 
be IOl wide. This is to allow the Auto-Router tool to properly rout
e these .sig-
nals without collision. 
As for signal labels, watch out for the reserved signal names. "Phil"
 and 
"Phi2" are certainly reserved. Any signal names that contains 
those words 
would be merged together. For example, we used the name "
Qlphil" and 
47 
"Q2phi2" earlier, and we did not realize that "MAGIC" 'Vill merge these two sig-
·, 
~ 
nals with "Phil". This should be avoided by using names completely different 
and not containing the reserved words. We quickly solved it by replacing the 
·th " lfl" d It lfl" names wi q an q . 
~' 
5.2 Future Work 
We would highly recommend future developments on FFTCOM. If the 
next designer can enhance the chip so that it would perform any power-2 sizes of 
FFr computation, the usefulness of FFTCOM would have a much greater im-
pact. Also the speed .performance could be studied closer and new designs 
tested. 
5.2.1 Performance Speed 
Performance has always been a major factor in determining how well a 
chip will make it in the market. Butterfly processors can now be pushed to run 
at 100MHz speeds, which means that chips designed to complement them must 
also be able to keep pace. FFrCOM has shown that almost all switching takes 
less than lOns. The limitation was on a section that imposed a 32ns delay. This 
is "Phil" signal that has a heavy load to drive. If these clocking signals could be 
limited to a delay of also under lOns, then we have achieved the 100MHz perfor-
mance that is desired. 
Future work could concentrate on designing a new circuitry to eliminate 
the constant heavy loading. A special "OR" gate might be useful to disconnect 
unnecessary loading when possible. It was understood that only one data 
register will deliver a data value to each output register. 
48 
• 
5.2.2 Transform. Size 
The current design implementation only allows for a 64 point transform 
size. It was originally our goal to have a multi-chip capability incorporated also 
so the user can cascade these chips to accommodate any radix-2 FFT transform 
size that he needs. But difficulties arose at the output control side since that 
capability would not grant this chip to hardwire some important control nodes, 
thus defeating the other goal, which is very simple control signals. At the cur-
rent implementation of FFTCOM, we already have two rings of 1-bit registers 
that could easily be designed to be totally independent, but was tied to the same 
two control Pads IN2 and ST2 for this chip because of pin count availability-. ··~ 
\ 
it is desired to be able to control the upper and lower halves of the array of N_J 
data registers separately, no more design space is required. A chip package 
with more pins might be required, (MOSIS also has offers an 84 pin package 
instead of the 64 pin package this chip is fabricated on), and more freedom can 
be gained due to the added ability to control the two output rings independently. 
Strictly for cascading purposes, future work should investigate how to con-
nect the control rings between chips. Assigning two more pins as flow-in and 
flow-out for each of these rings might provide the necessary connection. Then 
the more difficult part is separating the output registers. Care should be taken 
as to which output register to enable for a particular transform size. Also, in the 
cascading mode, the two halves (previously for even and odd) on each chip must 
be connected together since they are now representing the same 'half, either all 
even or all odd. 
49 
Appendix A 
BSIM workfiles 
• 
This appendix gives an overview of the work files that are needed for the 
behavioral simulation, gives a description of what each file is useful for, and 
show them in a relative hierarchy structure. 
BERA VIORAL SIMULATION 
Tools: BSIM and "C" language 
DIAGRAM OF WORK FILES: 
model.c 
model.cmd 
lib.c 
mylib.c BSIM 
\ 
\ 
i 
J 
padlist.l 
other files 
output file 
Figure A-1: BSIM files hierarchical tree 
DESCRIPTION OF WORK FILES: 
50 
~ 
I modeLc 
lib.c 
myllb.c 
padlist.l 
model.cmd 
other files 
a "C" language depiction of the circuit being modelled, this is 
the main file that the user prepared that contains all the in-
ternal workings of hie circuit. 
the default library files, contains primitives for template of 
model.c. This is the generic type of functions, and any other 
functions that the user's model.c needs must be added to the 
file called mylib.c. 
a library of functions that are repeatedly used by the main 
file, model.c, collected in this separate file so that model.c 
would not be cluttered, and makes modular programming 
easier. This file contains functions that the user needs and 
are not provided in the file lib.c. Some examples of the kinds 
of routines found in such a library are: 
• register - defines a register element mimicking a 
real register with Phil and Phi2 excitation clock-
• 1ng. 
• add42 - implements a 4-2 Adder to for adding 
four numbers. 
• pipereg - a motion definition to move data 
through the register. 
• Xreg - initializes a register to value X, which is 
the undefined value, to ensure no unwanted in-
itial values are assumed by any register. 
• n_JJrint - displays content of a node. 
• cpa - implements a carry propagate adder. 
defines the Pad vectors which the model will use, contains 
description whether a Pad is an input or output, how many 
bits a vector consists of, and directional information. 
a command file to let the simulation run in batch mode, con-
tains vectors definition, watch commands that inspects 
specified nodes of interest, and provide the test input values. 
are the files to support the "BSIM" package. These are in-
cluded with the package at the time of installation. 
51 
AppendixB 
NET workfiles 
This appendix gives an overview of the work files that are needed for the 
structural simulation, gives a description of what each file is useful for, and 
show them in a relative hierarchy structure. 
STRUCTURAL SIMULATION 
Tools: NET and RSIM 
DIAGRAM OF WORK FILES: 
model.net 
CAD Tools: 
NET 
model.sun 
model.rsm 
model.cmd 
model.out 
cmoslOO.prm 
Figure B-1: NET files hierarchical tree 
52 
,}'l 
l 
DESCRIPTION OF WORK FILES: 
model.net 
model.sim 
model.rsm 
model.cmd 
model.out 
This file is written in the language for "NET", that describes 
how the transistors interconnect to form the circuit. Tran-
sistor sizes are also realistically implemented, so the user 
will be able to get useful timing information from this 
simulation. 
an intermediate file that was translated from model.net, will 
be used as input to the program "presim". 
a binary executable file generated by the program "presim" 
with the input model.sim. This is the simulation running 
file. 
a command file to run the simulation in a batch mode, allow-
ing long simulation processes to be run overnight if neces-
sary, requiring no user interaction. 
a file containing the output results of the "NET-RSIM" 
simulation. 
53 
AppendixC 
MAGIC workfiles 
This appendix gives an overview of the work files that are needed for the 
layout and functional simulation, gives a description of what each file is useful 
for, and show them in a relative hierarchy structure. 
LAYOUT and FUNCTIONAL SIMULATION 
Tools: MAGIC and RSIM 
DESCRIPTION OF WORK FILES: 
top.mag 
final.mag 
Padin 
PadOut 
PadVdd 
PadGnd 
PadVddToDP 
PadGndToDP 
logo 
designer name 
chip64x16 
ptregpt 
this is the top level cell, which was created only for the pur-
pose of running RSIM faster. It is almost identical to 
final.mag, but top.mag has all Vdd and Gnd connected 
together, hence allowing RSIM to run faster. 
this is the file that produces final.cif, which is the file that 
gets sent to the fabrication house for chip fabrication. 
Final.mag contain subcells and Pads layout. 
input pad for signals from outside to enter chip's physical 
medium. 
output pad for signals from inside the chip to go out. 
pad to connect common V dd ring. 
pad to connect common Gnd ring. 
pad to connect to data path's Vdd power bus. 
pad to connect to data path's Gnd power bus. 
subcel\ to identify Lehigh University's copyright. 
subcel\ to identify the designers. 
this is the ensemble of the main custom designed subcells, 
but without the power rails. The power rails and labels ex-
traction are done by some tool programs. 
data register that holds the data value, data enters the 
register when Phil is high, and leaves when Phi2 is high. 
But this cell is driven by two qualified clocks, named qlfl 
54 
DIAGRAM OF WORK FILES: 
TOP.mag 
Remarks: 
•This level is only useful 
for RSIM simulation 
................ -..................................................... ~ ~-·-·-·-·-·-··-·-·-·-·-·-··-·-·-·-·-·-·-· .................................... . 
CIF file for fabrication 
...-----1-. FINAL.mag 
•The 'final' product, 
merges DataPath, 
Pad.frame, others 
other cells 
etc. 
I I 
-
I 
CHIP2.mag 
' . 
padframe.mag 
•subcells 
•chip2 is datapath after 
pwrframe 
logo ratioed CHIPI.mag individual Pad files 
•chip I is data path after 
cellframe files inverters 
ring cells 
sregand.mag 
~ 
' / . 
I I 
chip64xl6.mag 
I I I I I 
-
data register 
pttegpt.mag 
multiplexor 
muxr.mag 
•the data path main 
design 
output registers 
ROout.mag, Rlout.mag 
Figure C-1: MAGIC files hierarchical tree 
and q2fl. qlfl is an and relationship of Phil and the ring 
Rbl. q2fl is excited by Phil and the ring Rb2. 
sregandl ring Rbl, this is a ring of total N=64 1-bit registers, which 
during operation only has one cell with a value ·"1" and the 
rest "O". This is used to excite qualified clock qlfl. 
sregand2a and' sregand2b 
ring Rb2, this is a ring of total N=64 1-bit registers, which 
during operation only has one cell with a value "1" and the 
rest "O". The latency of the location of "1" relative to ring. 
55 
muxrl 
bufmr 
buf 
bufn 
ROout 
Rlout 
muxr2a 
muxr2b 
buffer2 
bufferl 
Rbl is N/2+1 count. This conforms to the stage independent 
addressing count for bit reversed order of input sequence. 
multiplexor to select whether to let a new value from IN1 
enter the ring Rbl or let the ring loop around. When STl=l, 
INl gets in. 
buffer for register that holds STl and INl for muxrl. 
buffers to drive phi2 into muxrl. 
buffer to drive phi2 into row of output registers Rlout. 
output registers that is qualified by Phi2 only. 
output registers that is qualified by Phi2 only. 
multiplexor for ring Rb2. 
multiplexor for ring Rb2. 
a 3-stage inverter cascade with ratio 3 to drive Phil stronger. 
a 3-stage inverter cascade with ratio 3 to drive Phi2 stronger. 
56 
AppendixD 
Layout Cells 
A few of the most important layout cells are displayed in this appendix to 
show how the designer implemented each cell. Also shown are the circuitry 
each cell represent. This is to aide the reader in quickly recognizing what part 
of the circuit those cells belong to. 
These cells are: 
1. sregandl.mag - this cell create ring 1, which is the ring that con-
trols the input side. 
2. sregand2a.mag - this cell create ring 2, which is the ring that con-
trols the output side. 
3. sregand2b.mag - this cell create ring 2 also, which is the ring that 
controls the output side. 
4. ptregpt.mag - the main data register cell, data is qualified to enter 
this cell by a qlfl, (Phil qualified by ring 1), and will appear at the 
output of the register at Phi2 excitation, and then selected out to 
output register by q2fl (Phil qualified by ring 2). 
5. muxrl.mag - a multiplexor to select whether new data will enter 
ring 1 or a loop should occur. 
6. muxr2a.mag - a multiplexor to select whether new data will enter 
ring 2 (at the top halO or a loop should occur. 
7. muxr2b.mag - a multiplexor to select whether new data will enter 
ring 2 (at the bottom halO or a loop should occur. 
8. ROout.mag - output register 
9. Rlout.mag - output register 
10. Rb2.mag - to show in more detail how two cells "2a" and "2b" 
makes up a joint ring 2 (Rb2), this is to save space and allow inde-
pendent control of each halves of ring 2. 
57 
Rbl.(i) 
T 
Phil 
Rb2.(i) 
T 
Phil 
Rb2.(i) 
T 
Phil 
sregandl.mag: is the cell for each 
1-bit register in ring Rb 1 
3/8 
T 
Phi2 
Figure D-1: sregandl circuitry 
sregand2a.mag: is the left cell for 
each I -bit register in ring Rb2 
3/8 3/8 
T 
Phi2 
Phil 
Phil 
Figure D-2: sregand2a circuitry 
sregand2b.rnag: is the right cell 
for each 1-bit register in ring Rb2 
3/8 3/8 
lfl 
2fl 
Rb2.(i+l) 
T 
Phi2 
Figure D-3: sregand2b circuitry 
58 
• FF• Ff F hf Fi 
•'1';;,, :1·. ',,.: 
rt!' . ,ii: I, ·r! ! 
. ,i;,;;];J;;;;;/;'.L~~ 
Figure D-4: 
~~~::··~··-~i·l•¥"P•~-·~;·~··r;,~..-~·:":'~7T~~nr7:r~~Tr~'·~;~~l~~·~,~:·~··'.~'·::·~··r~::~71"nT:~r~:~n~:· 
I . !;! :; 1:1::!!i1!!11::!!i[:i,:i,:p::::i:: '!::!i:,l!i :! :!:1.!ijilij;/:l!:,!!jj 1\ 
~~;.:~:~~~ill ~'b A•'l:~:~:~;i ;i~:i;d_iij:~;.i~: ~!;~~JL ii~.U~ ;i;: :~ij~ :11\:~J~ ~:~;~ :~ j' '.: ,;_~!~ 'i ~;~:.iJ:~;~;1}1 {j~li] l ! l'. . 
"sregandl.mag" - for ring Rbl 
59 
Figure D-5: "sregand2a.mag" - left side cell of ring Rb2 
60 
', 
,/ 
Figure D-6: 
j: 
,, 
•I 
li 
Pi ,,,
jll 
di I', 
!!! 
,•, 
if 
!ill; 
"sregand2b.mag" - right side cell of ring Rb2 
61 
,. 
muxrl.mag - is the multiplexor for 
ring Rbl 
LP 
m.uxrlout 
- muxrl 
INl 
-
• 
STl 
Figure D-7: muxrl circuitry 
muxr2a.mag - is the multiplexor for 
ring Rb2 top side 
LP 
muxr2a.out --
- muxrl 
IN2 
I 
ST2 
Figure D-8: muxr2a circuitry 
muxr2b.mag - is the multiplexor for 
ring Rb2 bottom side 
LP 
muxr2b.out 
- muxrl 
IN2 
' . 
ST2 
Figure D-9: muxr2b circuitry 
62 
~. 
) ) 
I 
J 
I 
I 
I 
\ 
Figure D-10: 
....... _ .. A _
_
_
 ... -
- ........ ....._ 
,~t?,!i:i rii:::: 1 it:I: J:: 1:::1:::!: irn1rn rn:1!:111:i:w: 
,:,· 
~:;:;;::! 
JWi:1: :,: ... '.:.::,·.::.:_,.' ) . 
,.. , , 
·=· I'; 
I: 
I: . 
~-,· 
. . ~:: 
::.:•::;;::::;;;i;:::i ;:::;;:;:;;::;:; 
,: ;, 
.··.·.:=.:.·.·:.'.'.1.·.: .. ::_,_ •. ' ... '.'·'.'·'· '.'·'.'·'.'.:.'.i_,.1.:.1.: ... · · ):\:/ 
--
~1···>!"·:,~ ......
: : .... 
. . ... 
. •' ,,L,., 
"muxrl.mag" - multiplexor for ring Rbl 
63 
.. 
Figure D-11: "muxr2a.mag" - multiplexor for ring Rb2 top 
64 
ii !llliii! i 
iii 1: 111 I: 111111111 
Figure D-12: " 2b " muxr .mag multiplexor for ring Rb2 bottom 
65 
Xin-
ptregpt.mag - is the data register cell 
containing, depicted as Rs also 
Phi2 
Figure D-13: ptregpt circuitry 
66 
T 
q2fl 
/ 
/ 
Figure D-14: "ptregpt.mag" - main data register 
67 
ROout.mag: is the output cell for 
the top half of the data registers 
3/8 3/8 
T 
Phi2 
Figure D-15: ROout circuitry 
Rlout.mag: is the output cell for 
the top half of the data registers 
3/8 3/8 
T 
Phi2 
Figure D-16: Rlout circuitry 
68 
i i \) 
,:·:1! ·;:1; r·;i ,1:;·t 1.::: 1:!;::;;:;I1!:li11U:\ :1:r::1111:1;:;;1t1\1li\!!i!ili\i!i1:1:::::i?11Tiu~:;;;1:;;11! 
! I lll!!!i i1illl11!!Jl\l1lii1d 1 ,\I !l!li1q:lli:l!l!lij' q111il1![ll!ll111[l!ljl!l'l!lll!l!i!il1!i 1l1li!1!1!'!i!::'!l111l1:11i 
·==========-ii=====!iJui.Jwl:W::,;!JlU.11llii:::::,:.·tllli))!:r:.!LL1!1i!J 1:!;::i)!JWlli2JLl1iillLU];';i)! 1 :/t! .. , .• ,.1 •. ~ 
,; 
Figure D-17: "RO t " ou .mag - output register for RO 
69 
....•• , ..... 
Figure D-18: "Rl t " ou .mag 
70 
- output register for Rl 
~··~· 
· .......... 
:11iit 
':qi: 
Ji 
) 
sregand2a 
.mag 
sregand2a 
.mag 
sregand2a 
.mag 
sregand2a 
.mag 
sregand2b 
.mag 
sregand2b 
.mag 
sregand2b 
.mag 
sregand2b 
.mag 
mux 
Figure D-19: Rb2 circuitry 
71 
Figure D-20: "Rb2.mag" - how Rb2 is assembled 
72 
'' 
bufferl.mag - a 2 stage inverter 
cell with ratio 4 
inverter 1: 
(100/2) X 1 
' ' 
inverter 2: 
(133/2) X 3 
Figure D-21: Buffer! circuitry 
buff er2.mag - a 3 stage inverter 
cell with ratio 3 
inverter 1: 
(72/2) X 1 inverter 2: (72/2) X 3 
inverter 3: 
(162/2) X 4 
Figure D-22: Buffer2 circuitry 
73 
' ' ' .. 
~; .. ..:....j.,l..~l._~ •. 
)
1iii 
);). 
Figure D-23: "buff er 1" - driver for Phi 1 
74 
.,.-;..J •.•• u. ............... .... 
Figure D-24: "buffer2" driver for Phi2 
75 
•,n .. r•••·•~1·1· 
,, 
' 
I 
1:i 
,1/1! i111 11/i:;:;d 
;i11,1!i, :d1!1 1 
DllJJ.il1~~L 
1~~1~l~W 
"~.;.i.1 ..... ,. 
... 
LI 
'. i:::11:,,:::',11, 
. :ilf11:i:1Jjidil: 
i 11;1::):\i! 
. :,:1. ,··1,·:!:1:1 1q1r::!1,!1iili1: 
. i!f!iiii(':}l!( 
• :ii!!i!1i1i,I) i 
1 i1111111n111'1ii!fi I • l:1 /tl:1 ·,·, 
i !11111111 J/!!!i 
i I ill!1lr1i! 
'. '1' 1111.1:,11,l,11: 
! ·11,i\',il!1:I! 
; 11 i ':\q::: ,!:! 
'JIJ1lii)il:iH:i' 
, l!11:l1,:1;Jli'i1: 
j''I:' ,:,1,, •------------------~~' 
i !iilil:t!f ! 
! lilf:j!l\1jlji1i; 1 
: J:!!H:i//il:H 
; i:!H11: 111:!: .. i1 
• iidl:!ii!:11·: 1: 
: !,::1:i(:ii\: ::: 
, l'liit1111lq! ii, 
: ,, ii)lj:ji :, 
: 1'11'' 111'1:'1' 11 q, 111 111 ' 1 ,,. 1:t11:Juq \: 
t j1iirl!JidiH ;:: ! ,1 11 ii,!li!I:: 
: ... 1'iil!ijli,ili1): 
' :i111111·1 ii' 
: 1!:i!. 1:1 !,! i!/ 
---------------- ----
Figure D-25: Actual Chip Layout 
76 
f 
[1] 
References 
J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine Calcula-
tion of Complex Fourier Series", Math. Computation, IEEE, IEEE, 1965, 
pp. Vol. 19, pp.297-301. 
[2] Roman Kuc, Introduction to Digital Signal Processing, McGraw-Hill, 
New York, 1988. 
[3] William D. Stanley, Gary R. Dougherty, Ray Dougherty, Digital Signal 
Processing, Reston Publishing Company, Reston, VA, 1984. 
[4] Maurice Bellanger, Digital Processing of Signals, John Wiley & Sons 
Ltd., Paris, 1984. 
[5] 6 Alfred J. Eiblmeier, "A Reduced Coefficient FFT Butterfly Processor''. 
[6] Alan V. Oppenheim, Ronald W. Schafer, Digital Signal Processing, 
Prentice-Hall, Inc., Englewood Cliffs, 1975. 
[7] James. B. Burr, Stanford University, BSIM User Guide, Stanford, CA, 
1988. 
[8] NET User Guide, 1988. 
[9] RSIM User Guide, 1988. 
[10] J. K Ousterhout, et. al., "The MAGIC VLSI Layout System", IEEE 
Design & Test of Computers, pp. 19-30, Vol. Feb85, 1985. 
[11] R. C. Singleton, "A Method for Computing the Fast Fourier Transform 
with Auxiliary Memory and Limited High-Speed Storage"", IEEE Trans. 
Audio Electroacoustics, IEEE, IEEE, 1967, pp. Vol AU-15, June 1967, 
pp.91-97. 
[12] Karoran Eshraghian, Douglas A. Pucknell, Basic VLSI Design - Systems 
and Circuits, Prentice Hall, Sydney, 1988. 
[13] Douglas F. Elliott, Handbook of Digital Signal Processing, Academic 
Press, Inc., London, 1987. 
[14] Thomas Young, Linear Systems and Digital Signal Processing, Prentice-
Hall, Inc., Englewood Cliffs, NJ, 1985. 
[15] Thomas M. Frederiksen, Intuitive Digital Computer Basics, McGraw-
Hill, Inc., New York, 1988. 
[16] C. S. Burrus, T. W. Parks, DFT IFFT and Convolution Algorithms, Wiley 
Interscience, New York, 1985. 
[17] Hassan K. Reghbati, Tutorial: VLSI Testing & Validation Techniques, 
IEEE Computer Society Press, Washington D.C., 1985. 
77 
[18] K. Ramaroohan Rao, Discrete Transforms and their Applications, Van 
Nostrand Reinhold Company, New York, 1985. 
[19] Mark A. Richards, "On Hardware Implementation of the Split-Radix 
FFT", IEEE Transactions on Acoustics, Speech., and Signal Processing, 
IEEE ASSP Society, Piscataway, New Jersey, 1988, pp. 1575-1581. 
[20] Zhi-Jian Mou, Pierre Duhamel, "In-Place Butterfly-Style FFT for 2-D 
Real Sequences", IEEE Transactions on Acoustics, Speech, and Signal 
Processing, IEEE ASSP Society, Piscataway, New Jersey, 1988, pp. 
1642-1650. 
[21] Ja-Ling Wu, Chau-Yun Hsu, "Comments on "A two-Stage Representation 
of DFT and its Applications", IEEE Transactions on Acoustics, Speech, 
and Signal Processing, IEEE ASSP Society, Piscataway, New Jersey, 
1988, pp. 1687. 
[22] James B. Burr, Frame's User Guide, 1988. 
[23] Neil Weste, Kamran Eshragian, Principles of CMOS VLSI Design, 
Addison-Wesley, Don Mills, Ontario, 1985. 
78 
/ --
/ 
Vita 
The author was born in Surabaya, Indonesia on September 12, 1963, to 
Ronald Vincentius Sidharta M.D. and Theresia Maudy Goei. After graduation 
from St. Louis Senior High School in Surabaya, Indonesia, in May 1982, he en-
rolled in Stevens Institute of Technology in Hoboken, New Jersey, pursued a 
Computer Science curricul11m, and then continued at Rutgers College of En-
gineering in Piscataway,. New Jersey, to pursue Electrical Engineering. He 
worked as a computer operator at Rutgers Univ. Computing Center during his 
last year, received his B.S.E.E degree in May 1986, and then joined Fermi Na-
tional Accelerator Laboratory, in Batavia, Illinois, as a project engineer and 
completed a high energy physics experimental project in 1987. In Fall 1987, he 
enrolled in the Department of CSEE at Lehigh University, Bethlehem, Pennsyl-
vania and concentrated on his areas of interest that include high performance 
digital signal processing design, logic design, parallel computer architecture, 
microprocessor design, and digital communications. He also received a Lehigh 
University graduate scholarship award and completed his M.S.E.E. degree in 
June 1989. 
79 
