VLSI implementation of a discrete cosine transform by Crook, D.
University of Wollongong 
Research Online 
University of Wollongong Thesis Collection 
1954-2016 University of Wollongong Thesis Collections 
1991 
VLSI implementation of a discrete cosine transform 
D. Crook 
University of Wollongong 
Follow this and additional works at: https://ro.uow.edu.au/theses 
University of Wollongong 
Copyright Warning 
You may print or download ONE copy of this document for the purpose of your own research or study. The University 
does not authorise you to copy, communicate or otherwise make available electronically to any other person any 
copyright material contained on this site. 
You are reminded of the following: This work is copyright. Apart from any use permitted under the Copyright Act 
1968, no part of this work may be reproduced by any process, nor may any other exclusive right be exercised, 
without the permission of the author. Copyright owners are entitled to take legal action against persons who infringe 
their copyright. A reproduction of material that is protected by copyright may be a copyright infringement. A court 
may impose penalties and award damages in relation to offences and infringements relating to copyright material. 
Higher penalties may apply, and higher damages may be awarded, for offences and infringements involving the 
conversion of material into digital or electronic form. 
Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily 
represent the views of the University of Wollongong. 
Recommended Citation 
Crook, D., VLSI implementation of a discrete cosine transform, thesis, , University of Wollongong, 1991. 
https://ro.uow.edu.au/theses/3020 
Research Online is the open access institutional repository for the University of Wollongong. For further information 
contact the UOW Library: research-pubs@uow.edu.au 

VLSI IMPLEMENTATION OF A DISCRETE 
COSINE TRANSFORM
A thesis submitted in partial fulfilment of the 
requirements for the award of the degree






THE UNIVERSITY OF WOLLONGONG
by
D. Crook, B.E. (Mech),
Graduate Diploma (Comp. Sc.)
Department of Computer Science
1991
0 1 3 7 3 4
ACKNOWLEDGEMENTS
I would like to give my sincere thanks and gratitude to my supervisor, John 
Fulcher, for his continual assistance and guidance throughout the course of this project.
I would also like to extend my gratitude to Peter Gray for his patience and 
assistance with respect to the system problems I encounted, and to Van Dao Mai of the 
Information Technology Centre, University of Wollongong. I would like to extend a special 
thanks for his invaluable help at various times during the project.
Chong Hee Tan of the University New South Wales must be thanked for his 
guidance and assistance in using the UNSW VLSI design tool package. Also Henry Wu, 
formerly of the Electrical Engineering Department, University of Wollongong needs to be 
thanked for his assistance in deciphering the discrete cosine transform algorithm.
To my wife, Kelly, whose patience I nearly exhausted, I would like to thank for her 




Discrete Cosine transforms, (DCT), were originally introduced for digital signal 
processing, (DSP), for the purpose of pattern recognition and Wiener filtering. The 2D DCT 
is used for transform coding of images in telecommunications. It can be implemented by 
fast algorithms using either software or hardware.
There are many one dimensional Fast Cosine Transform (FCT) algorithms. 2D 
transforms are usually performed by a row then column approach using a one dimensional 
FCT.
In this report, various designs of VLSI DCT chips using the UNSW multi-project 
chip (MPC) software design tools are discussed. The objective in implementing the DCT in 
hardware is that both parallel and pipelined data processing are possible due to the 
specialist nature of the chip design.
The designs discussed use the simple 1D DCT algorithm proposed by Lee to 
process data into a 2D transform. One design is based on a conventional bit parallel design 
and the other designs are bit serial. Designs are evaluated with respect to silicon real estate 
and processing speed with the more favourable design committed to silicon at the AWA 
CMOS fabrication facility.
The final design has the following characteristics
- master 15 Mhz clock pulse
- parallel data input and output, of which 8 pulses are required,
- two buffers to store the 8 x 12 bit input and transformed data words
- single pass ALU
- 8 pulses are required to read/write the data from the chip
i i
- 22 pulses are required to transform the data
The resultant design using a 15 MHz clock cycle is able to process an eight data 
word set in 2001 ns, thereby allowing the possibility of processing an eight by eight data 
























1.2 Lee Algorithm .......................
1.3 Mathematical Definition......
1.4 Forward DCT Equations.....
DESIGN PARAMETERS
2.1 Parameters...........................
2.2 Data Length .........................
DCT IMPLEMENTATION
3.1 Introduction ...........................
3.2 Bit Parallel Verses Bit Serial
3.3 Revised Butterfly Design ....
3.4 Multiplication Accuracy.......
DESIGN DESCRIPTION
4.1 Data Input / Output............
4.2 Arithmetic Logic U n it..........
4.3 Timing ....................................
4.4 Control ...................................
4.5 Design Cycle T im e .............
PARALLEL DESIGN
5.1 Introduction ..........................
5.2 ALU Design .........................
5.3 Timing and Control............
5.4 Bit Parallel Process Flow ....
IV
6.0 SERIAL DESIGNS
6.1 Introduction .....................................................................................  27
6.2 Design A2-1 Lee Butterfly Implementation...................................  27
6.3 Design A2-1 Special Considerations............................................. 27
6.3.1 Data Sequencing.............................................................  27
6.3.2 Multiplication Magnitudes.................................................. 28
6.3.3 Data Stream Length........................................................... 29
6.4 Design A2-2 Expanded Butterfly, 2 Stages..................................... 29
7.0 DESIGN DECISION ANALYSIS.................................................................  30
8.0 MODULE DESIGN AND CONTROL
8.1 Multiplication ................................................................................... 32
8.1.1 Negative Multiplication.......................................  33
8.1.2 Least Significant Coefficient..............................  35
8.2 Adders ............................................................................................. 36
8.3 Sutraction ........................................................................................  36
8.4 Delay Elements.............................................................................. 36
8.5 Input / Output Buffers...................................................................  37
8.6 Clocking Strategy..........................................................................  38
8.7 Clock Skew ..................................................................................... 38
8.8 Counters...........................................................................................  39
8.9 Chip Control ................................................................................... 40
8.10 Chip Control and Data Lines.......................................................... 43
8.11 Read / Write Enables..................................................................... 44
9.0 RESULTS DISCUSSION AND ANALYSIS




9.3 Expected Design performance...................................................... 47
9.3.1 Design Cycle T im e ............................................................  48
9.3.2 Transformation Accuracy..................................................  48
9.3.3 Algorithm Verification........................................................  48
9.3.4 Simulator Results.............................................................. 49
9.3.4.1 Spice Simulations.................................... 50
9.3.4.2 Trek Simulations...................................... 50
9.4 Results Evaluation......................................................................... 51
9.4.1 Algotithm Verification.......................................................... 51
9.4.2. Spice Simluations...............................................................  52
9.4.3 Trek Simulations.................................................................  55
9.5 Design Tools ....................................................................................61
9.5.1 Design Tool Idiosyncracies..............................................  62
10.0 CONCLUSIONS ............................................................................................  65
11.0 FURTHER WORK .........................................................................................  68
v i
APPENDICES
A.1 ALGORITHM VERIFICATION PROGRAM AND RESULTS
A1.1 Verification Program.....................................................................  I
A1.2 Verification Data S e t ..................................................................... VI
A1.3 Verification Program Results.......................................................  VII
A.2 SPICE SIMULATION RESULTS.................................................................  XIII
A.3 TREK SIMULATION RESULTS..................................................................  XXII
A.4 FIFO CHIP DATA ........................................................................................  LIV
A.5 SPICE OUTPUT DATA FILTER.................................................................. LVII
A.6 CIF PLOT DATA FILTER.............................................................................  LIX
A .7 CHIP PLOT ....................................................................................................  LXI
A.8 COMPARISON TO INMOS DESIGN .......................................................... LXII
A .9 TWO PHASE CLOCK ANALYSIS ........................................   LXIV
A .10 CIRCUIT DIAGRAMS ...................................................................................  LXVII
BIBLIOGRAPHY
vu
CHAPTER 1 - DISCRETE COSINE TRANSFORM
1.1 Introduction
Most of the information in this chapter is derived from Wu’s doctoral thesis [5].
The DCT was first introduced into digital signal processing for the purposes of 
pattern recognition and Weiner filtering [6]. It has since been incorporated in the Joint 
Photographic Experts Group Standards, J.P.E.G .[8], along with incorporation into the Video 
Codec for Audiovisual Services Recommendation H.261 [4].
The 2D DCT is used for transform coding of images in telecommunications, 
especially prior to transmission over packet switching ISDN networks. They can be 
implemented by fast algorithms in either software or hardware form. They render near optimal 
performance that is virtually indistinguishable from the Karhunen Loeve Transform [5].
There are many one dimensional Fast Cosine Transforms, FCT, algorithms 
available but 2D transforms are usually performed by a row then column approach using a 
one dimensional FCT algorithm.
Lee first proposed his algorithm in 1984, and this is the algorithm that was used as 
the basis of the chip design in the current project [5]. Other DCT designs were considered, 
but found to be no more efficient than Lee’s [5].
3 0009 02989 8991
1 . 2 Lee Algorithm
One of the main advantages of the DCT and the resultant FCT is the reduction in 
the number of multiplications compared to the Fast Fourier Transform, with the number of 
real multiplications required for a FCT being
(N/2) xlog2N
for a N point DCT with N = 2m, which is about half that of FFT algorithms.
1.3  Mathematical Definition
Let x(k) =The time domain data sequence,
with k = 0,1... N-1
X(n) = The (Cosine transformed) transform domain data sequence, 
with n = 0,1... N-1
The discrete cosine transform to connect these 2 data sequences is
N-1
Inverse transform x(k) = 2  e(n) X(n) cosf(2k+1)n]
n=0 2N
N-1
Forward transform X(n) = 2e(n) 2  x(k) cos f(2k+1) n|
N k=0 2N




Also let the relation
(2k+1)n













Let X(k) = e(n) X(n) and





x(k) = X(n) C 
n=0 2N
~ N-1




By decomposing the functions into odd and even indices of n and k and assuming 
N is even then an N point DCT and inverse DCT, (IDCT) can be transformed into 2 x N/2 point 
DCT's and 2 x N/2 point IDCT respectively.
If this is continued until n,k = 0 then for N = 8 the sequence of equations that 
constitute the butterfly module in Figure 1 result for the DCT.
3
It can also be shown that relating x(k) and x(n) for the DCT can be realised simply by 
transposing the I DCT relating x(n) and x(k). By transposition we mean reversing the direction 
of flow of a system such that the input and output are reversed and all multipliers remain the 
same. This is due to the fact that the matrix C representing the inverse DCT such that
x = C X
for the vectors x = (x(0), x(1),.. x(N-1)) and 
X = (X(0),X(1),.. X(N-1)) are
orthogonal. As a consequence we have the relation
X = _2_C* x for the forward transform. 
N
For an eight pixel set, N = 8, and the equation reduces to
X = £* x for the forward transform.
4
4
Figure 1 - 8 Point Forward DCT
5
1 .4 Forward DCT Equations
The resultant equations that can be derived from Figure 1 constitute the forward 
DCT. The position of the intermediate values of bO.. e7 are shown in Figure 1.
bO = x0+x7 cO = b0+b2 dO = c0+c4
b4 = x1+x6 c 4 = b4+b6 d4 = C0-C4
b2 = x2+x5 c2 = b0-b2 d2 = B1c2+B2c6
b6 = x3+x4 c6 = b4-b6 d6 = B1C2-B2C6
b1 = x0-x7 c1 = E1.b1+ E4b3 d1 = C1+C5
b5 = x1-x6 c5 = E2 b5+ E3b7 d5 = c1 -c5
b3 = x2-x5 C3 = E1 b1- E4b3 d3 = B1c3+B2c7
b7 = x3-x4 c7 = E2 b5- E3b7 d7 = B1c3-B2c7
o"Ono(D X0 = e0/ root2/4
e4= Ad4 X4 = e4/4
e2= d2+Ad6 X2 = e2/4
e6= Ad6 X6 = e6/4
e1= d1 X1 = (e1+e3)/4
e5= Ad5 X5 = (e5+e7)/4
e3= d3+Ad7 X3 = (e3+e5)/4
e7= Ad7 X7 = e7/4
with A = 1/2C1 = 0.707331
4
B1 = 1/2C1 =0.541232  
8
B2 = 1/2C3 = 0.60144 
8
E1 = 1/2C1 = 0.509804 
8
E2 = 1/2C3 = 0.60134 
8
E3 = 1/2C5 = 0.900509 
16
E4 = 1/2C7 = 2.57006 
16
6
CHAPTER 2 - DESIGN PARAMETERS
2.1 Parameters
The parameters that the chip must conform to for a one dimensional transform are
a) operate at 10-15 M Hz
b) operate in real time
c) process pixels at a rate of
- 288 * 352 pixels/frame
- 30 frames/sec
The Video Codec recommendation H.261 [6], requires that eight pixels be 
processed at a time. This corresponds to processing
(288*352*30) / 8 = 380160 ,8  pixel packets per second 
ie. 1 / 380160 sec / 8 pixel packet = 2.6305 msecs / pkt
= 2630.5 nsec / pkt
This requires that all data be read, written and processed in the 2.6305 msecs.
To achieve this the possible design configurations are
(1) read, process and write in series
(2) read } in parallel and process in series to I/O 
write }
This leaves the processing time to be whatever the time is left after the 
parallel I/O has been completed.
7
(3) read } in parallel,
write } 
process }
This allows the full 2630 nsecs to process the 8 pixels.
2 .2  Data Length
The input data length of the single transform is 8 bits. The 1D and 2D transform 
internal processing and output data lengths are a maximum of 12 bits, with the 12th bit being 
the 2’s complement sign bit.
This data length was quoted by other designs [1] [2] as being required to perform 
the transformations and was also verified with the verification program in appendix A 1.1.
Therefore , the design should be able to handle 12 bit I/O in the 2630 nsec time frame.
8
CHAPTER 3 - DCT IMPLEMENTATION
3.1 Introduction
In deciding upon the design to be implemented, various bit parallel and bit serial 
designs were trialled to enable the data to be processed in the required time and also for the 
design to conform to the silicon area of 6.8mm x 6.8mm.
The final design implementation is outlined in this chapter and the other 
considered designs along with the decision analysis as to which design was finally adopted 
can be found in Chapter 8.
3.2  Bit Parallel Versus Bit Serial Design
The final design adopted is a bit serial design for the reason that a bit parallel 
design with a single ALU does not provide comparable data throughput and a second ALU is 
required to enable performance to be comparable with bit serial designs. The bit serial 
design on the other hand provides sufficient data throughput with comparable silicon area.
A bit parallel design is limited by the settle time of the slowest combinational 
element, which in our case is the ALU. In the parallel design studied the most efficient 
processing of the data required a settle time of 27 adder cells, which corresponds to 
approximately 135 nsecs per ALU settle time.
If the bit parallel ALU has an 'n' adder length then at any point in time only 1/n of 
the adder sections will be in the process of being evaluated. The other (1-1/n) sections will 
either be awaiting data to filter through to them or will have been evaluated. This leads to a 
majority of the sections being non productive at any point in time.
9
In bit serial designs more data elements can be evaluated at any one time and so 
the throughput is increased. The detailing of the bit parallel and bit serial data throughputs is 
outlined in the Chapters 5 and 6.
3 .3  Revised Butterfly Design
In studing the various designs already implemented the one that was adopted by 
INMOS [1], triggered the design that I have adopted. The INMOS design solves the 
transformation by making each of the eight transformed values a function of the eight input
values.
X(0) = fn (x(0), x (1 ),.. ••• x(7))
X(1) = fn (x(0), x (1 ),.. x(7))
X(7) = fn (x(0), x (1 ),.. ••• x(7))
This is achieved by using an 8 x 8 multiplication matrix with each row containing the 
coefficients to produce the values that are to be summed to produce a single transformed 
value.
X(0) = x(0)m(0,0) + x(1 )m(0,1) + ...+  x(7)m(0,7)
X(7) = x(0)m(7,0) + x(1)m(7,1) + ... + x(7)m(7,7)
This design therefore consists of 64 multiplications and 64 additions per 
transform. This is a very computational expensive design and on face value is not as efficient
10
as other designs. However, due to the bit serial design this design is the fastest of all the 
designs researched.
When studying this design it was noticed that many of the matrix coefficients are 
duplicated, which lead to expansion of the Lee algorithm equations to a point where the 
eight transformed values are a function of the eight input values, similar to that of the INMOS 
design, [1], It then becomes apparent that not only are many of the coefficients duplicated 
but also the transform equations are one of three forms, each containing 1, 2 or 4 
multiplications respectively.
This enables the mathematics to be reduced to 22 multiplications and 28 
additions/subtractions. This is a vast improvement on the INMOS design. It is this revised 
design that has been adopted henceforth, with the transform being conducted in a single 
pass through the ALU.
s1=x0-x7 S2 = x1-x6 s3 = x2-x5 
s5 = a1 -a4 s6 = a2-a3 s7 = a5-a6
s4 = x3-x4
a1=x0+x7 a2 = x1+x6 a3 = x2+x5 
a5 = a l +a4 a6 = a2+a3 a7 = a5+a6
a4 = x3+x4
X(0) = A * a7 where A = 0.176776
X II > * 1̂ B = 0.230969
X(2) = C * s 6  + B*s5 C = 0.095670
X(6) = C * s 5  - B * s 6 D = 0.245196
X(1) = D* s1  + E*s2  + F*s3  + G* s4 E = 0.207867
X(5) = F*s1 - D * s 2  + G* s3  + E*s4 F = 0.138958
X(3) = E*S1 - G * s 2 - D * s 3 -  F*s4  





















































a d d - add~ sub
add
































































A = 0.17677610 = 0.001011012 
B = 0.23096910 = 0.001110112 
C = 0.095670i o = 0.001100002 
D = 0.245196io= 0.001111112
E = 0.20786710 = 0.001101012 
F = 0.13895810 = 0.001001002 
G = 0.04877210 = 0.000011002
Figure 2 - Expanded Lee Butterfly Equations
12
3 .4  Multiplication Accuracy
In expanding the equations the multiplication coefficients are reduced to 
magnitudes that are less than the original algorithm coefficients. For example, the largest 
multiplication value in the expanded form is A which is 0.176776. The largest coefficient of 
the original algorithm was E4 at 2.562915448.
Having all the multiplication values less than 1 simplifies the multiplication design, 
as outlined in Chapter 8, Module Design.
13
CHAPTER 4 - CHIP DESIGN
4.1 Data Input / Output
The amount of data required to pass through the chip requires that minimal time is 
spent loading the chip with data. This can be achieved by having a FIFO chip at both the 
front and back ends of the chip. This allows the chip to simply sample the FIFO chip to 
determine if any data is available and if so read it in a single dock cycle. Similarity, for data 
output the output FIFO can be sampled to check it is not full, and if not, data can be output to 
ft. This I/O can be handled in parallel as the reading and writing of data are independent 
processes.
The loading of the front end FIFO and the reading of the output FIFO is then 
handled by the host processor.
The chip buffers used for the I/O are typical RAM cells with two enable and data 
fines. This allows data to be accessed both in word parallel for reading/writing to the FIFOs 
and word serial in pulsing the data through the chip. Chapter 8 has further details on the 
design and functioning of the I/O buffers.
4.2 ARITHMETIC LOGIC UNIT
Data is pulsed through the ALU in a bit serial form by the respective bits of the 
input buffers being read as required.
Because the clock pulse is much greater than the settle time of an adder cell, 
multiple adder cells are evaluated in a single cycle. This allows for a single bit to traverse from 








adder settle time and the data path being 8 adder cells long then sufficient time is available to
allow the data to filter through to the output buffers. ■
The 8 cell data path length results from 3 adder/sub stages per equation and a 
maximum of 5 adders being present in any of the multipliers. Therefore, the Isb is available 
on the first pulse through the ALU. The msb is available from the ALU after the 12th data bit 
is pulsed into the ALU and the respective delay cells are then cleared. This takes 10 cycles 
after the last data bit is pulsed into the ALU. The 10 is a result of 3 add/sub stages and 7 
multiplier delay stages. Therefore, the msb is available on the 22nd pulse through the ALU.
.
The ALU consists of the 22 multiplications and the 28 add/sub stages to realise 
the 8 equations in Section 3.3.
4 .3  Timing
Although the system clock of 10 to 15 MHz could be used in pulsing the data 
through the ALU, problems may arise if a 10 MHz clock is used as the ALU process time will 
be 22 * 100 nsecs = 2200 nsecs. This leaves only 400 nsecs to read/write the data, and so a 
second set of I/O buffers will be required to achieve the real time processing.
Also as only 8 adder cells need to be evaluated per clock cycle a maximum of 40 to 
50 nsecs is required. Therefore as data is latched on the first pulse and read on the second 
pulse a clock cycle of 50 nsecs would be sufficient.
To account for data line delays apart from the adder cells an ideal design will have 
an internal clock cycle of 67 nsecs, (15 MHz), to ensure that data is cycled through the ALU 
at a sufficient rate. Therefore, the process time of the ALU is 67x22=1474 nsecs.
15
This leaves 2630-1474 = 1156 nsecs to I/O data. With the FIFO I/O strategy 
employed and 8 cycles required to I/O data then 8 x 67 = 536 nsecs are required to I/O data. 
Remember data is I/O in parallel. This leaves 1156 - 536 = 620 nsecs in reserve. Therefore, 
the I/O of the data can in fact be handled in series rather than parallel.
4 .4  Control
Control of the chip is via 2 set/reset latches, one for input, and one for output. The 
I/O is performed in parallel while the ALU is disabled. Once data is read to and from the chip 
the ALU is enabled so that the data is processed and the sequence begins again with data 
I/O.
Data I/O requires eight read/write cycles. Once these cycles are complete the I/O is 
disabled respectively and the ALU is enabled. The ALU then uses 21 cycles to pulse the 
data through itself and upon the 22nd cycle the ALU is disabled and the I/O enabled.
Data is read in bit parallel form and after the eight data words are read, the least 
significant bit of each data word is then pulsed into the ALU, as indicated in Figure 3. On 
each successive pulse the next significant bit is pulsed into the ALU until all the twelve data 
bits are present. The least significant bit produced from the ALU is available two cycles after 
the LSB is pulsed into the ALU. However, as the coefficients used have an accuracy of eight 
bits the least significant eight bits are ignored and the ninth bit generated from the ALU is 
the first bit stored in the output buffers. This occurs on pulse ten as there is a single cycle 
buffer on either side of the ALU.
16
/*»
data in data out
(12 bit word) (12 bit word)
Figure 3 - Chip Control
4 .5  Design Cycle Time
As outlined in section 4.4 the design requires :
- eight cycles to read and write eight data words in parallel
- twenty one cycles to cycle the data through the ALU
- has a master clock rate of 15 Mhz, (66.7 ns wavelength)
The twenty one cycles for the ALU processing is derived as follows
cycle 1 - bit is loaded from input buffers to pre ALU buffer on phase 1
- the bit is then gated to the ALU
cycle 12 - final data bit is gated into the pre ALU buffer on phase 1 and
gated in to the ALU on phase 2
17
The ALU multiplication coefficients are of eight bit accuracy and so the least 
significant eight bits of a multiplication are discarded for integer multiplication. As the 
coefficients all have the two msb equal to zero, (Figure 2), then these two msb cells are not 
included in the design and the result is therefore available two cycles earlier than if eight 
coefficients are used. Also, the first coefficient of a multiplication does not require a delay cell 
to buffer the input data stream, (Figure 9) Therefore, the total cycles required to pass a data 
bit into a multiplier and to pulse the results out is
-1  cycle to enter the data bit into the multiplier
-1  cycle per buffer
As there are five such buffers, a total of six cycles are required to pulse a data bit 
through a multiplier. Therefore, the total cycles required to pulse a data word through a 
multiplier is the word length plus five.
Each of the transformed coefficients require three addition/subtraction stages in 
conjunction with the multiplications, (Figure 2). This section then adds a maximum of three 
data bits to the word length.
The resultant maximum word length is therefore,
12 initial bits
+ 3 add bits
+ 5 multiplication bits
- 8 Isb bits for integer multiplication.
This leaves the resultant data word being twelve bits long.
18
After the msb of the input data word is cycled into the ALU the msb of the result is 









phase 1, msb of data cycled into pre ALU buffer 
phase 2, msb of data cycled into ALU 
phase 1, first adder carry read into carry buffer 
phase 2, first adder carry cycled into adder, (last bit cycled in) 
phase 1, second adder carry read into carry buffer 
phase 2, second adder carry cycled into adder 
phase 1, third adder carry read into carry buffer 
phase 2, third adder carry read into adder, and result is directly 
fed into first stage of multiplier
phase 1, first stage of multiplier has its carry read by its carry buffer, (recall, 
there are five delay stages)
phase 1, last stage of multiplier has its carry read by its carry buffer 
phase 2, last stage has its carry fed back into multiplier adder
and the result is directly fed into into the post ALU buffer 
phase 1, post ALU buffer reads multiplier result 
phase 2, result cycled to output buffers 































CHAPTER 5 - BIT PARALLEL DESIGN
5.1 Introduction
The bit parallel design of Amould and Dugre [2], was used a base model for the 
current design with the aspect of pipeline processing adopted for the Lee algorithm.
5 .2  ALU Design
Amould and Dugre [2] used a different DCT algorithm from Lees, and also 
incorporated pipelining data through adders and multipliers. Their design incorporates a 
single parallel multiplier with a pipelined subtractor and parallel adder. The current design 
centres around an ALU which has a parallel multiplier with a subtractor pipelined on its front 
end. This is to maximize the data flow through the ALU as all but one multiplication is 
preceded by a subtraction, as can be seen in Figure 1. An adder is set up in parallel to the 
sub/mult stage as the Lee butterfly has simultaneous additions and subtractions of two data 
values, (Figure 1). Therefore, two data values are fed simultaneously into the adder and 
subtractor, the subtractor result is then pipelined into the multiplier and on the read cycle the 
results of the adder and multiplier are placed back onto the data bus and the results stored in 
the buffer positions where the original two data values were read from.
This requires 13 passes through the ALU for multiplications and a further 5 passes 
through the ALU to evaluate 5 final additions.
The subtractor and adder are ‘carry select’ designs and so their settle times are 
reduced from 12 adder cells to 4 + 1 + 1 = 6. This is achieved by having the 12 bits broken 
into 3 x 4  adder sections with the last 2 sections evaluated with carry high and carry low cases. 
When the carry from the first stage filters through to the second stage a multiplexer selects
21
the carry high or low result respectively, and similarly the second stage carry selects the third 
stage results. This design follows that of Weste and Eshraghian [3].
The bit parallel multiplier is based on the design of Weste and Eshraghan’s, (page 
346) and requires coefficients with 8 bits. Multiplication constant E4, has a value of 
2.562915 and so with its 8 bit representation the multiplication product has to be shifted two 
bits to align it with the other multiplication results with constants less than 1.
Therefore, with 12 data bits and 8 bit multiplier coefficients the settle time of the 
multiplier is 8+12+1 = 21 adder settle times. The subtractor settle time will be four adder 
settles for the first stage, 1 adder settle for each of the successive stages, ie. 4 + 1 + 1 = 6 
adder settle times. So the total settle time of the ALU is 6 + 21 = 2 7  adder settle tiimes. For 5 
nsec being the worst case adder settle time this leads to 135 nsecs per ALU pass.
Also, for a 50 nsec clock period, one cycle is used to load the adder and one cycle 
to read the results. For a 25 nsec period, two clock periods are required to load and settle the 
adder and one period to read the result. Figure A1.2 shows the effect that reducing the clock 
period has on the processing time of the ALU.
Therefore, the total ALU processing time is a minimum of 135 nsecs for 
multiplications and 30 nsecs for additions only. The data can then be read on the next clock 
cycle following these settle times.
22
5 .3  Timing and Control
The data is fed through the ALUs according to the equations outlined in Section 5.2
Loading data into the multiplication ALU can be done on the first cycle of the ALU 
evaluation period and 135 nsecs are then required to allow the ALU to settle. Reading of the 
ALU will then be required on the next clock cycle.
Figure 6 shows that the best results can be achieved with a 20 nsec clock period. 
As 20 nsecs corresponds to a 50 Mhz dock rate this option was not considered as such a fast 
clock rate is beyond the capability of the target fabrication technology. The next best 
performance was with a 30 nsecs clock period of 2640 nsec. This processing time is well 
outside the processing speed attainable with a bit serial design.
To achieve real time processing with this design two sets of I/O buffers would be 
necessary so that while one set of data is being processed other sets of data are being read 
and written from the second set of buffers. This in turn complicates the control of the chip.
The reasons for this ALU restriction is that a bit parallel design is limited by the settle 
time of the slowest combinational element, which in our case is the multiplication ALU. In the 
parallel design studied the most efficient processing of the data required a settle time of 27 
adder cells, which corresponds to approximately 135 nsecs per ALU settle time.
If the bit parallel ALU has an 'n' adder length then at any point in time only 1/n of the 
adder sections will be in the process of being evaluated. The other (1 -1/n) sections will either 
be awaiting data to filter through to them or will have been evaluated. This leads to a majority 










































(4x50x13)+ (5x50x2) = 3100
(5x40x13) + (5x40x2) = 3000
(6x30x13) + (5x30x2) = 2640
(7x25x13) + (5x50x3) = 3025
(8x20x13) + (5x40x3) = 2680
7 '
ALU settle time = 135nsecs 
Figure 6 - Parallel Multiplier Timing Comparisons
25
5 . 4 Bit Parallel Process Flow
Refer to Figure 1, for to location of values bO,... etc.
Cycle adder sub/mult buffer contents 
x0,x1 ,x2,x3,x4,x5,x6,x7
1 x0+x7=b0 xo-x7=b1*E1=m1 bO.ml ,x2,x3,x4,x5,x6,x7
2 x2+x5=b2 x2-x5=b3*E4=m2 b0,m1 ,b2,m2,x4,x5,x6,x7
3 x1+x6=b4 x1-x6=b5*E2=m3 b0,m1 ,b2,m2,b4lm3,x6,x7
4 x3+x4=b6 x3-x4=b7* E3=m4 b0,m1 )b2,m2,b4,m3,b6,m4
5 b0+b2=c0 bO-b2=c2*B1=m5 c0,m1 ,m5,m2,b4,m3,b61m4
6 b4+b6=c4 b4-b6=c6*B2=m6 c0,m1 ,m5,m2,c4,m3,m6,m4
7 ml +m2=c1 m1-m2=c3*B1=m7 c0,c1 ,m5,m7,c4>m3,m6,m4
8 m3+m4=c5 m3-m4=c7*B2=m8 c0,c1 ,m5,m7,c5,m6,m8
9 c0+c4=d0 C0-C4 =d4*A =e4 d0,c1 ,m5,m7,e4,c5,m6,m8
10 m5+m6=d2 m5-m6=d6*A=e6 d0,c1 ,d2,m7,e4,c5,e6>m8
11 c1+c5=d1 c1-c5=d5*A=e5 d0,d1 ,d2,m7,e4,e5,e6,m8
12 m7+m8=d3 m7-m8=d7*A=e7 d0,d1 ,d2,d3,e4,e5,e6,e7
13 d2+e6=e2 d0,d1,e2,d3,e4,e5,e6,e7
14 d3+e7=e7 e0,e1 ,e2,e3,e4,e5,e6,e7
15 e1+e3=g1 e0,g1 ,e2,e3,e4,e5,e6,e7
16 e3+e5=g3 e0,g1 ,e2,g3,g4,e5,e6,e7
17 e5+e7=g2 e0,g1 ,e2,g3,e4,g2,e6,e7
18 eo*1/root2=g0 g0,g1 ,e2,g3,e4,g2,e6,e7
All eight values are then divided by four, left shift 2 places, to give the transformed values.
26
CHAPTER 6 - BIT SERIAL DESIGNS
6.1 Introduction
There are 2 other designs investigated apart from the design adopted. These are
a) Lee butterfly implementation
b) Expanded butterfly in 2 stages.
6 .2  Design A2-1 Lee Butterfly Implementation
This design was the first bit serial design investigated and involves implementing 
the design as per the butterfly in Figure 1, in a bit serial form. Specifically, the lines in Figure 
1 are data lines and the junctions are adders/subtractors respectively. Moreover, the 
multiplication nodes are serial/parallel multipliers, with the multiplier coefficients fed in parallel 
form into the multiplier and the data serially into the multiplier. This design required 13 
multiplications and 29 add/subtractions.
6 .3  Design A2-1 Special Considerations
On the surface this design looks relatively easy to implement. However, upon 
closer inspection various problems need to be overcome.
6.3.1 Data Sequencing
To ensure data arrives at each node in the correct sequence stream equalizers 
must be incorporated at certain points, for example after the B1 multiplications, so that the 
B1 stream is equal to the B2 stream and the next node.
27
A minimum of 5 of these stream equalizers would be necessary to ensure the data 
is evaluated conectly. However, not all streams would be emerging from the ALU at the same 
time. To align all streams then 11 equalizers would be necessary. A stream equalizer is 
simply 8 consecutive delay cells to delay a bit by 8 cycles.
The maximum time to produce data from the ALU is via the x(4) stream as it has the 
longest length. This stream length is :
12 cycles - get msb of data into ALU
1 " - dear adder carry prior to E4
7 " - pulse msb through E4
2 " - delay data after E4 as E4 > 2
1 * - clear adder prior to B1
7 " - pulse data through B1
1 " - dear adder carry after B1
7 " - stream equalizer to align with bottom stream
1 " - clear adder after equalizer
1 " - dear last adder carry
40 cycles
6 .3 .2  Multiplication Magnitudes
As not all the multiplication coefficients are of the same power. Buffers must be 
installed after the coefficients which are greater than 1, such as E4 and 'root 2', to delay 
these multiplication results so that there arrival at the next node is correctly sequenced.
28
6 .3 .3  Data Stream Length
As each data stream is multiplied, added and subtracted accordingly the data 
stream increases 8 bits for each multiplication and 1 bit per addition. The resultant data 
stream needs to be filtered at the end so that only the most significant bits are generated 
from the ALU.
6 .4  Design A2-1 Expanded Butterfly, 2 Stages
Another design that was considered was the actual design adopted but performed 
in 2 separate passes through the same ALU. XO.1,2 and 3 evaluated in stage 1 and X4,5,6 
and 7 evaluated in a second stage.
This results is the following changes to the performance and design
a) total ALU processing requires twice the time
b) ALU area is halved as the ALU is used twice
c) extra control is required to direct data into adders/subtractors and 
multipliers correctly - this is achieved with the use of multiplexers.
d) extra control is required to read the input buffers twice
e) extra control is required to write data to the output buffers correctly at each 
stage.
29
CHAPTER 7 - DESIGN DECISION ANALYSIS
The decision as to which design would be used depends upon satisfication of the 
following criteria.
ESSENTIALS
a) process ID transform in real time
b) fit into an area 6.8mm x 6.8 mm
DESIRABLE
a) minimum area
b) able to adapt to 2D transformations
c) simple timing and control
The adaptability to 2D transforms includes the ability of the design to be altered to 
be able to process a second transform in the time frame. Only the expanded design allows 
this as its processing time is half that of the other bit serial designs.
As the bit parallel design requires two sets of I/O buffers and the processing time 
is significantly slower than the bit serial designs for similar silicon area it was not considered 
the best option. Also to be a real time processor 2 ALUs are required or 2 sets of I/O buffers 
are required and this in turn increases the complexity of the chip.
Of the bit serial designs the selected design of the expanded algorithm was 
selected as it conformed to the ESSENTIALS and of the DESIRABLES it gave the simpliest
30
processing control, was twice as fast as the other bit serial designs and so only one set of I/O 
buffers are required.
There is a trade off in that the area is greater than the other designs but as it fits 
within the 6.8 x 6.8 limitation, this is a tolerable concession.
31
CHAPTER 8 - MODULE DESIGN AND CONTROL
8.1 Multiplication
Due to the multiplication coefficients being known and multiple data streams being 
processed in parallel the multiplications are performed with the coefficients incorporated 
into the hardware design rather than having generic multipliers capable of using more than 
one coefficient. This design was adopted as the resultant circuitry was less than that 
required to implement generic multipliers.
Figure 7 shows a generic bit serial multiplier cell and each stage requires an adder 
and associated ‘and’ and delay elements. The current design makes use of the fact that for 
a single clock pulse several consecutive adders can be evaluated and so the delay 
elements between the adder cells and the data stream are removed. This also increases the 
data throughput of the multiplier, as shown in Figure 9.
As a coefficient of 0 results in no addition then the adder cell and the ‘and’ cell 
may be removed from stages where the coefficient is 0 and replaced with delay cells for the 
data transfer. It is the removal of these cells that greatly reduces the circuitry required. 
Therefore to have 22 generic multipliers requires more circuitry than to have 22 specific 
multipliers with some multipliers duplicated to enable parallel processing with similar 





Figure 8 - Revised Bit Serial Multiplier Cell
8 .1 .1  N egative M ultip lication
When multiplying a number in 2’s complement form, special care must be made to 
ensure that if a data stream is a negative number, represented by the msb bits being 1, then 
after the 12th bit is fed into the multiplier the 13th to 21st bits must also be 1 to ensure 
correct multiplication product. For example,
33
- 8 * 0.625 with 8 bit word length
8 = 0000 1000
-8=1111 0111+1 =1111 1000 (8 inverted +1)
0.625 = 0.101






- truncate 3 least significant bits = 1001 10112
= 101io
- what is required is generation of a 1, (high) after the msb for each partial product if
a negative number is used
ie. 111111000
11 0000 0000 
11 1110 0000 
111 1101 1000
- or generation of 11 0000 0000
Therefore, - if (msb of partial product is high) then
generate n-1 1s after msb.
This generation of Vs is accomplished by continually regenerating the 12th data 
bit of each word on the 13th to 21st pulse. This will result in negative number generation
34
within the subtractors and so the multiplication cells will see the stream of 1’s required to 
give the correct result.
8 .1 .2  Least S ignificant Coefficient
The arrangement of the coefficients within the multiplier are shown in Figure 9 and 
if the least significant coefficient is a 1 then the Si is set to the data value, otherwise if the 
coefficient is a 0 , then the Si is set to 0, ie. ground.
This results in seven delay elements for the data stream, but with the multiplication 
values used in our design, all have the two most significant coefficient of 0 and so these 
coefficients are not required as they are the last coefficients in the multiplier and so the 
multiplication length can be reduced from 8 to 6 and the delay elements required in the data 
stream can be reduced from 7 to 5 accordingly.
C4 C3 C2 C1 CO
xi xo
So
Note : if C4 = 0 then Si = 0 instead of xi
Figure 9 - Specific Multiplier
Coefficients = 10101
input : bit stream
output : product of multiplication
35
8 . 2 Adders
All adders are full combinational adders based on the design of [5]. The adder 
carry is fed back into the adder for the next adder cycle.
input: a, b bit data streams
reset for setting the carry low, 0, on the first add cycle, 
output : sum, which is fed onto the next stage
carry, that is fed back into the adder for the next add cycle.
8 .3  Subtraction
All subtractions are based on the adder circuit with one of the data stream inverted 
and the carry set high, 1, on the first subtraction cycle. The inversion of the data stream and 
the setting of the carry to high applies the 2's complement logic for converting a positive 
number to a negative number which is than added to the unaltered data stream.
input: a, b bit data streams
reset for setting the carry high, 1, on the first subtraction cycle, 
output : sum, which is fed onto the next stage
carry, that is fed back into the subtractor for the next subtraction cycle.
8 .4  Delay Elements
A delay element simply delays the transfer of a data bit for one clock cycle. This is 
achieved by employing the design of [5]. These delays are pseudo 2 phase latches that 
read the incoming data on clock cycle one, invert it so as to buffer the data, and then on the 
second clock cycle the data is transfered out of the buffer after another inversion.
36
input: data bit stream
output data stream delayed one cycle
8 .5  Input / Output buffers
These buffers are based on the design of [5], and are an array of 2 port RAM cells.
Data is transferred to/from externally in a parallel form by the use of one of the 
ports and associated enable. In transferring data to/from the ALU data is accessed in a word 
form but each bit is from a separate data word. This is achieved by having the second port 
and enable lines running perpendicular to the first.
input: data word in column form 
output : data word in row form
NOTE : the input buffer is the reverse of the input buffer.
in reading data to the input buffer a word of data cells is enabled and the ‘read 
enable’ is enabled to signal data is required. In writing to the output buffer from the ALU the 
data lines are setup on phase 2 and the writing to the cells is enabled on the subsequent 
phase 1.
To correct any data skew problems that may occur in transferring data to/from the 
ALU from the buffers a set of buffers are installed between the I/O buffers and the ALU. 
These interim buffers are loaded and read on successive phase 1 and 2.
Phase 1 = read input buffer cells and gate into buffer
37
= enable the output buffer write enables
Phase 2 = gate input data into ALU from buffers 
= gate output data to output buffers
8 .6  Clocking Strategy
The control of the ALU is by a two phase clock. This two phase clock is produced 
by taking the single phase mother cycle, 15 Mhz, and passing it through a flip flop that 
produces one phase in sequence with the mother cycle and the second phase is a non 
overlapping phase that is high when phase one is low. This ensures that the two cycles are 
always non overlapping and prevents any clock skew problems.
input : 15 MHz clock cycle
output : two phase clock each with a cycle of 15 MHz
8 .7  Clock Skew
Clock skew occurs from the phase clock generation to the ALU . This is a problem 
if the negated phase signals are also generated from the same module.
To overcome this problem the primary signals, phase 1 and 2 and the adder/sub 
reset signals are generated from the central control module and the secondary negated 
signals are generated locally at separate locations. These separate locations are :
- before each multiplication module
- before the adders/subs prior to the multiplication module
- before the adders/subs after the multiplication module
38
- before the input buffer reading enable module
- before the output buffer writing enable module
8 .8  Counters
Counters are constructed by cascading a number of flip flops together with each 
flip flop input being the preceding flip flops output. This has the effect of multipling by 2 the 
cycle for each successive stage.
input: counter pulse
output : n data lines for n flip flops, each data line representing the state of the 
corresponding count bit.
The reset to the counters is high while the counters are not required and upon the 
reset being negated the counters are zeroed.
39
8 .9  Chip Control




Remember, that the input and output are performed in parallel.
The interaction of the states is shown in Figure 10, the circuitry logic is shown in 








if (clock high) then
if (data available) then
if (chip enabled) then
read data enable 
increment input counter
NOTES: (1) the counter is incremented on negation of read data
enable signal
(2) a count of 8 disables the input enable
Figure 11 - Input Control
41
The output is nearly identical to the input
O U T P U T
while (output enabled)
if (clock high) then
if (output fifo not full) then
if (chip enabled) then
write data enable 
increment output counter
NOTES : (1) the counter is incremented on negation of write data
enable signal
(2) a count of 8 disables the output enable
output
Figure 12 - Output Control
A L U
while (ALU enabled)
if (count = 0) then
reset adder/subs carry to 0,1 respectively 
set read enable word 1 
if (count=10) then
set write enable word 1 
if (count=22) then
reset read enable word 1 
disable ALU
Figure 13 - ALU Control 
8 . 1 0  Chip Control and Data Lines
The interface with the real world consists of the following lines
- chip enable : set by processor
- data available from input FIFO : set by input FIFO
- read data from input FIFO : set by DCT chip
- output FIFO not full : set by output FIFO
43
- write data to output FI FO
- input data lines (12)
- output data lines (12)
- master clock pulse
8.11 Read /W rite Enable
: set by DCT chip 
: data placed on by input FIFO 
: data placed on by DCT chip 
: from host processor
Typically to enable a word to be pulsed into the ALU would simply require a 
counter enabling the respective enable lines. As we have 12 consecutive words the 
circuitry can be reduced by simply having a daisy chain of delays and so for every cycle the 
enable signal is incremented along the enable lines, as shown in Figure 14. The output 
chain is the same except that the input enable is initialized on count 10.
enables are high after inversion 
in the delay element
1 2  11 12 
— f t — f t -  — ¿ — A
count =0
NOTE - output chain is the same
Figure 14 - Input Enable Daisy Chain
44
CHAPTER 9 - DISCUSSION AND ANALYSIS
9.1 Introduction
The discussion of the results progresses through several stages, beginning with 
an outline of the circuit design adopted throughout to verification of this design. The 
verification of the design incorporates the verification of the algorithm adopted and the 
results of the circuit simulations using both the Trek and Spice simulators, and finally 
discussion of these results. Also a discussion of the tools used in the design is also 
provided.
9 .2  Circuit Design
The objective of the project, as outlined in the abstract, is to create a VLSI double 
metal CMOS design to perform the discrete cosine transform.
This by definition, provides the opportunity to implement any DCT algorithm in 
whatever silicon arrangement is available. Therefore, the range of solutions that can be 
implemented is vast. In considering any specific solutions, it is necessary to have an 
understanding of the performance limitations of the design in question. The resultant 
investigation and verification has to then ensure that the design adopted meets both design 
parameter and desired required performance criteria.
Investigation of alternate designs [1] [2] exposed some of these design limitations. 
Creating low level modules such as adders, subtractors, buffers and so on with the design 
tools [7], exposed other limitations that were a product of the VLSI design technology. The 
design itself has been the centre of discussion of the preceding chapters, however further
45
comments need to be made in order to complete the discussion. These comments are 
expanded in Section 9.2.1.
Adoption of the Lee algorithm [5] to perform the DCT was a result of reviewing the 
work of Wu [5]. The Lee algorithm was one of the main algorithms investigated by Wu and it 
was found to be one of the most computationally efficient algorithms. Other algorithms such 
as derived by Hu [5] could have also been adopted, and in fact the designs are very similar.
9.2 .1  VLSI Design
Initial design investigations concentrated on a bit parallel design as this 
arrangement is the one most documented and reported in the literature. As the low level 
modules were created and their performance investigated, it became apparent that to 
provide the necessary data throughput two ALUs were required. This verifies the design of 
Arould and Dugre [2] in which four separate processing units are adopted. The resultant 
silicon required would then exceed the silicion area limitations. Also the resultant 
implemetation of the Lee algorithm requires a minimum number of multiplications as the 
parallel multiplier is the processing bottleneck. The original Lee algorithm with thirteen 
multiplications is the most efficient means of adopting a parallel design as the pipelining of 
the subtractions and multiplications reduces the number of passes through the ALU to 
eighteen. The resultant eighteen passes with the ALU settle time results in a transformation 
time insufficient to process the required data and so a second ALU is required.
The investigation then turned to a bit serial implemetation. As the preceding 
chapters outline, a revised version of the Lee algorithm was found to be the most 
computationally efficient means of implementing a bit serial design. To implement this bit 
serial design it was necessary to design basic arithmetic units such as adders, subtractors 
and mutiplers to enable the data throughput to be maximized in terms of parallel and
46
pipelined data processing. With a single pass through the ALU and the number of 
multiplications and additions reduced from other serial designs [1], the final design was 
considered to be the most efficient possible.
The number of clock cycles required to transform a 1D eight word data set is 2001 
ns and a 2D transform, (of an 8 x 8 word data set) takes 32 ms, assuming a 15 MHz clock 
cycle. This could be improved if the data processing and data I/O were rearranged into a 
parallel design. This would reduce the processing time to 1407 ns for an eight word data set 
and 22 ms for a 2D transform of an 8 x 8 word data set. The design of INMOS [1] performs a 
2D transform of an 8 x 8 data set in 3 ms with an inferior mathematical algorithm. This superior 
performance would be due to the processor array design as adopted by the INMOS  
transputer designs.
The design philosophy of having data and control lines running perpendicular was 
implemented wherever possible. With the second metal layer being available it allows greater 
flexibility in routing wires both through individual circuits and between circuits. Due to the 
hybrid designs developed in the design, especially in the multiplying modules, it became 
necessary to run the control lines parallel to the data lines. This does not create any problems 
as the second layer of metal allows the crossover sections to be constructed without any 
layout restrictions.
9 .3  Expected Design Performance.
Having decided upon the final algorithm and circuit implementation it is necessary to 
determine the expected design performance so that when the verification tools are used a 
measure of the success of the design can be undertaken.
47
9.3 .1 Design Cycle Time
As outlined in Section 4.5, the cycle time for the design to process eight data words is :
- eight cycles to read/write data to/from the design
- twenty one cycles to cycle the data through the ALU
- the data is available to be written out of the design on cycle twenty two
9 .3 .2  Transformation Accuracy
Due to integer multiplication it is expected that errors will occur as a result of the 
inherent truncation of the least significant eight bits. The maximum error that would be 
expected would be a result of multipling the largest possible result, (transform value XO), by 
half the accuracy of the multipliers. In other words, if all the eight input data words were 255 
then the maximum error would be
maximum value * multiplication Isb magnitude * 50% accuracy
= (255 * 8) * (1/256) * (0.5)
= 3.9843
9.3.3 Algorithm Verification
Before the design was tested, the algorithm used in the design needed to be 
verified by way of a simulation.
This verification consists of running the C program listed in Appendix 1, to simulate 
the original Lee forward and inverse transform algorithms, as well as the revised forward
48
transform adopted in the design. A total of twenty separate test runs are presented in 
Appendix 1.
9 .3 .4  Simulator Results
There are two simulators available for testing
a) Trek [7], this gives fast turn around time to simulations by examining the logic
design of the circuit only.
b) Spice [7], gives voltage level analysis of all points in the circuit.
The Spice simulations are the prefered method testing the designs.
It was found when running the Spice simulations that as the circuit being tested 
increases in size the Spice analysis takes considerably longer to reach a conclusion. It 
becomes so slow that only the fundamental building blocks of the design can be properly 
tested with Spice. It became apparent that as Spice is such a fine detail analysis it will not give 
results when testing large sections of the design. The total design has of the order of 18000 
points to be analysed.
The Trek simulator on the other hand does give results for larger designs, and so 
the results are presented in two sections :
a) Spice simulations of the fundamental building blocks to show their 
functionality at the lowest level
b) Trek simulations of the total design to show the functionality of the design 
as a whole.
49
9 .3 .4 .1 Spice Simulations
Attempts were made to analyse all the the low level modules with Spice. However, 
due to problems with Spice not performng correctly when conducting transient analysis on 
larger designs, not all lower level modules could be fully tested with Spice, consequently 
these modules had to be verified with the Trek simulator. A subset of the modules 
successfully tested with Spice is presented to show the typical performance obtained. The 






The multiplier cell is a unit that incorporates an adder and a carry feedback 
delay cell required for a single multiplier stage. The initial 20 ns of the cell 
settle time is presented to show the worst case settle time required by 
such a cell.
A typical bit buffer adopted throughout the design to store a bit value for a 
single clock cycle
Given a master clock cycle the driver creates the second clock phase.
The buffer cells adopted for the input buffers 
The buffer cells adopted for the output buffers
Refer to Appendix 2 for the plots of the Spice simulations.
9 .3 .4 .2  Trek Simulations
The Trek simulator was used to verify the design with respect to the result obtained 
with the verification program. (Appendix A1.1)
50
The tests consist o f :
a) running the same twenty verification tests used in Section 9.2
b) running the results of the last eight verification tests as inputs, this simulates a 2D 
transform
c) running two tests to show the performance when the input and output 
control are interrupted
d) a final test is run with two transforms performed in succession to show the cyclic
performance of the design
Refer to Appendix 3 for the plotted results of the Trek simulations.
9 .4  Results Evaluation
The results and evaluation are presented in the same sequence as in Section 9.3, 
namely algorithm verification, Spice simulations and finally Trek simulations.
9.4 .1 Algorithm Verification
The results of the verification are shown in Appendix 1. The Lee algorithm is 
verified since the inverse transform results produce the same data set as used in the forward 
transform input.
Moreover, the revised algorithm used in the current design is also verified as it 
produces the same results as the Lee forward transform.
51
9 .4 .2  Spice Simulations
Spice is the preferred simulator to verify a design and attempts were made to 
thoroughly test each low level module. During the simulations it became apparent that as the 
complexity of the design increased the time required by Spice to complete an analysis 
became considerably longer and only the simplest of the low level modules could be 
successfully tested.
To perform a Spice simulation the UNSW VLSI tool “Runspice” was used to create 
the Spice “deck” which in turn is fed into the Spice simulator. Runspice requires the circuit 
initial conditions and pulsed data waveform characteristics, which it then transfers into a Spice 
“deck” to perform a “transient” analysis. The specification of the transient analysis command 
within the Spice deck has the following form:
.TRAN step stop [start] [max step] [use initial conditions]
where arguments “start” and “use initial conditions” are optional argument and time 
magnitudes are specified with descriptors such as 10N for 10 ns and 100M for 100 ms 
respectively. The "step" time increment is used for plotting and/or printing results, it is not 
necessarily the time step used for computations. The "start" time is 0 by default.
To calculate the calculation time step of the simulation Spice selects the minimum 
of “step” and "(stop - start)/ 50" and sets a maximum time step of “max step” [9]. The 
calculation start time is always at 0 secs.
It was found that irrespective of the manipulation of these timing parameters and 
the circuit initial conditions the simulation always commenced with a calculation time step in 
the order of 1 x 10 '1^ secs and therefore the simulation results contained a considerable
52
amount of superfluous data. Once the simulation reached the time frame that was required, 
usually after 1 ns, the calculation time step was as expected, but only for smaller designs. As 
the circuit being tested became more complex the initial superfluous data increased 
accordingly.
In the case where the circuit has several pulsed inputs, such as the multiplier cells, 
adders and subtractors, which have pulsed inputs of phasel, phase 2 and their inverse 
phases, the simulations on the smallest section of these designs is possible for a short 
period only. It was found that if a simulation period was less than approximately 100 ns then 
the simulation would run to completion. If however, the simulation was considerably longer 
then it would terminate earlier than the shorter simulations. Investigations failed to reveal why 
this was occurring. It was therefore not possible to obtain results of a time window within a 
larger simulation time frame.
For the Spice simulations that were successful a subset of results are presented to 
show typical results that were obtained.
a) Multiplier Cell (refer page XIV)
This cell has an adder and a carry feedback delay for a single stage of a multiplier. 
The time frame used is the first 20 ns to show the worst case settle time required for such a 
cell. To create the worst case the inputs and outputs are set to initial states opposite to their 
final states. The multiplier partial sum output, (green) is set high when the two input streams 
are set high thereby creating a carry high and partial sum low. The settle time is shown to be
7.5 ns and the next stage would have a useable signal at approximately 5 ns at which point 
the voltage level is 90% of the final voltage. Phase 2, (orange) is the phase that the module 
receives data from the preceding module.
53
If 7ns is used as the worst case settle time and recalling that the maximum number 
of adder stages in a data stream is eight, (five multiplier stages and three adder stages). This 
then requires 8 x 7 = 56 ns per data stream settle time. If a worst case wire transmission time 
of 2ns is assumed [3] then the total worst case settle time of the ALU is 58ns between the 
ALU input and output buffers. The timing pulse of 67 ns for a 15 MHz clock rate is sufficient 
for the data input to be latched on phase 2 and results to be read for the next stage on the 
following phase 1.
b) Delay Buffer (refer pages XV and XVI)
The plot on page XVI is presented to show the input and output pulses separately 
as the plot on page XV is confusing when examined alone.
The input data, (green) is fed into the buffer on phase 1, (orange trace) and written 
on phase 2, (black). The input data pulse is aligned with phase 1 for a single pulse and the 
resultant output data is seen to be buffered for the full phase 2 clock cycle. The output is 
lowered when the next low data cycle is pulsed through the buffer.
c) Clock Driver (refer page XVII)
The mother pulse, (green) is converted to phase 1, (blue), and phase 2, (orange). 
The two phases are not overlapping for approximately 5ns.
Throughout the design there are drivers for the phase pulse negation as outlined 
in Section 8.7.
d) input buffer (refer pages XVIII and XIX)
The buffer reads the data , (green) and inverse data lines when the read enable, 
(blue) is high. The writing of the data to the output, (black) is when the write enable line, 
(orange) is negated.
54
The plot on page XVIII shows that when a high value is stored the resultant output 
value becomes slightly unstable after the write enable is raised and is not fully grounded 
when a corresponding tow value is written. This is due to the data being output to the output 
fines via a “p" type transistor [3]. A "p" type transistor will not conduct a tow state efficiently. 
To overcome this the output line has to be preset to ground prior to writing data from the cell 
[3]. An alternative method to filter the data is to pass the output through a double inverter so 
that the unstable raising and lowering of the output is corrected. This double inverter acts 
like a a buffer. The result of this filtering is seen on page XIX where the output signal is 
latched high and tow correctly. In the current design this filtering is achieved by the buffer at 
the input of the ALU.
In the current design it is necessary to pulse the msb of the input data through the 
ALU on the 13th through to 21st pulses so that the correct subtraction and addition could be 
achieved. Originally it was intended to install control to retransmit the msb on these cycles. 
However, ft was found that there is sufficient capacitance in the output buffer signals to 
maintain the signal after the msb of the input data is read.
e) output buffer (refer pages XX and XXI)
As this buffer has the same design philosophy as the input buffer the behaviour of 
the buffer is identical to that of the input buffer. The buffer output filtering in the current 
design is achieved by installing a double inverter between the buffers and the output pads.
9 .4 .3  Trek Simulations
Table 9.1 summarizes the Trek results and the resultant error that occurred. It can 
be seen that the average error experienced is one when the simulation results are compared
55
to the verification results, with the latter rounded to the nearest integer. The maximum error is 
two which is well inside the maximum expected error of 3.9.
Tests 21 to 28 inclusive, are run with the output of tests 13 to 20 inclusive as their 
inputs. The aim of these tests is to verify that the internal processing can handle the 
magnitudes required to perform 2D transforms. In normal 2D transforms a set of 8x8 input 
words are transformed row by row and these results are then transformed in a column by 
column manner. The tests 21 to 28 take a single set of 8 transformed results and performs a 
second transform on this set. This strictly speaking is not a true 2D dimensional transform but 
is a valid test to prove the internal functioning of the design in handling the required data 
magnitudes and negative numbers. The results of these simulations are in accordance with 
the verification tests.
Test 29 is a repeat of test 13 with the input data control interupted and test 30 is a 
repeat of test 13 with the output data control interupted. Test 31 has data sets 19 and 20 as 
its inputs. All three of these tests produce the results as expected.
56
Test Input Lee Fwd TREK error Test Input Lee Fwd TREK error
1 0 17.7 17 1 2 0 17.7 17 1
0 -24.5 -25 0 0 -20.8 -21 0
0 23.1 23 0 0 9.6 9 1
0 -20.8 -21 0 0 4.9 4 1
0 17.7 17 1 0 -17.7 -17 1
0 -13.9 -15 1 0 24.5 24 1
0 9.6 9 1 100 -23.1 -24 1
100 -4.9 -5 0 0 13.9 15 1
Test Input Lee Fwd TREK error Test Input Lee Fwd TREK error
3 0 17.7 17 1 4 0 17.7 17 1
0 -13.9 15 1 0 -4.9 -5 0
0 - 9.5 -9 1 0 -23.1 -24 1
0 24.5 24 1 0 13.9 15 1
0 - 17.7 -17 1 100 17.7 17 1
100 - 4.9 -5 0 0 -20.9 -21 0
0 23.1 23 0 0 -9.6 -9 1
0 -20.9 -21 0 0 24.6 24 1
Test Input Lee Fwd TREK error Test Input Lee Fwd TREK error
5 0 17.7 17 1 6 0 17.7 17 1
0 4.9 4 1 0 13.9 15 1
0 -23.1 -24 1 100 -9.6 -9 1
100 -13.9 -15 1 0 -24.6 -27 2
0 17.7 17 1 0 -17.7 -17 1
0 20.9 20 1 0 4.9 4 1
0 -9.6 - 9 1 0 23.1 23 0
0 -24.6 -27 2 0 20.9 20 1
Table 9.1 - Trek / Algorithm Verification Comparison
57
Test Input Lee Fwd TREK error Test Input Lee Fwd TREK error
7 0 17.7 17 1 8 100 17.7 17 1
100 20.8 20 1 0 24.5 24 1
0 9.6 9 1 0 23.1 23 0
0 -4.9 -5 0 0 20.8 20 1
0 -17.7 -17 1 0 17.7 17 1
0 - 24.5 -25 0 0 13.9 15 1
0 -23.1 -24 1 0 9.6 9 1
0 -13.9 -15 1 0 4.9 4 1
Test Input Lee Fwd TREK error Test Incut Lee Fwd TREK error
9 0 0 0 0 10 255 360.6 359 2
0 0 0 0 255 0 0 0
0 0 0 0 255 0 0 0
0 0 0 0 255 0 0 0
0 0 0 0 255 0 0 0
0 0 0 0 255 0 0 0
0 0 0 0 255 0 0 0
0 0 0 0 255 0 0 0
Test Inout Lee Fwd TREK error Test Input Lee Fwd TREK error
11 10 63.6 63 1 12 80 63.6 63 1
20 -32.2 32 0 70 32.2 32 0
30 0 0 0 60 0 0 0
40 -3.3 -3 0 50 3.3 3 0
50 0 0 0 40 0 0 0
60 -1.0 1 2 30 1.0 1 0
70 0 0 0 20 0 0 0
80 -0.2 0 0 10 0.2 0 0
Table 9.1 - Trek / Algorithm Verification Comparison, (continued)
58
Jest Input Lee Fwd TREK error Test InDut Lee Fwd TREK error
13 1 175.0 175 0 14 126 175.0 175 0
63 -25.8 -25 1 154 -11.0 -12 1
217 -81.6 -81 1 128 32.1 31 1
128 -25.3 -25 0 63 32.7 32 1
247 2.5 3 0 1 -88.7 -89 0
154 -41.3 -41 0 217 41.4 41 0
54 35.0 35 0 247 -1.9 - 3 1
126 35.0 35 0 54 -17.3 -17 0
Test Input Lee Fwd TREK error Test Inout Lee Fwd TREK error
15 176 223.2 223 0 16 125 223.3 233 0
125 - 8.9 -9 0 248 9.4 9 0
71 -28.2 -29 1 228 45.1 44 1
248 56.0 55 1 38 -52 -53 1
122 -16.8 -17 0 176 -13.3 -13 1
228 69.6 71 1 71 -70.1 -71 1
255 -33.7 -33 1 122 -0.5 - 1 0
38 -38.9 -39 0 255 42.8 43 0
Test Input Lee Fwd TREK error Test Input Lee Fwd TREK error
17 36 55.9 55 1 18 110 150.6 149 2
37 -3.2 -4 1 109 3.2 3 0
38 0 0 0 108 0 0 0
39 - 0.3 -1 1 107 0.3 0 0
40 0 0 0 106 0 0 0
41 - 0.1 -1 1 105 0 0 0
42 0 0 0 104 0 0 0
43 0 0 0 103 0 0 0
Table 9.1 - Trek / Algorithm Verification Comparison, (continued)
59
Test Input Lee Fwd TREK error Test Input Lee Fwd TREK error
19 201 289.2 287 2 20 255 289.2 287 2
202 -3.2 -4 1 254 3.2 3 0
203 0 0 0 253 0 0 0
204 - 0.3 -1 1 252 0.3 0 0
205 0 0 0 251 0 0 0
206 - 0.1 -1 1 250 0.1 0 0
207 0 0 0 249 0 0 0
208 0 0 0 248 0 0 0
Test Input Lee Fwd TREK error Test Input Lee Fwd TREK error
21 175.0 12.9 13 0 22 175.0 28.7 27 2
-25.8 14.7 15 0 -11.0 49.9 49 1
-81.6 66.4 65 1 32.1 41.1 41 0
-25.3 45.8 45 1 32.7 25.8 25 1
2.5 53.2 52 1 -88.8 7.2 7 0
-41.3 26.6 27 0 41.4 53.9 53 1
35.0 -8.2 -9 1 -1.9 40.4 40 0
35.0 13.7 13 1 -17.3 -21.1 -21 0
Test Input Lee Fwd TREK error Test Input Lee Fwd TREK error
23 223.2 39.3 39 0 24 223.2 32.6 31 2
-8.9 59.3 59 0 9.4 60.4 60 0
- 28.2 25.5 25 1 45.1 79.8 79 1
56.0 67.2 67 0 -52.5 14.2 15 1
-16.8 39.7 39 1 -13.2 38.3 39 1
69.6 40.7 41 0 -70.0 20.1 19 1
-33.7 33.3 33 0 - 0.5 23.9 23 1
-38.9 -28.9 -29 0 42.8 41.1 40 1
Table 9.1 - Trek / Algorithm Verification Comparison, (continued)
60
Iesi----- Incut— Lee Fwd TREK error
25 55.9 9.2 8 1
-3.2 13.0 12 1
0 12.7 12 1
-0.3 11.8 11 1
0 10.4 11 1
-0.1 8.5 8 1
0 6.1 5 1
0 3.3 3 0
Test Input Lee Fwd TREK error
27 289.2 50.4 49 1
-3.2 70.3 69 1
0 66.6 67 0
-0.3 60.3 59 1
0 51.7 51 1
-0.1 40.9 41 0
0 28.4 27 1
0 14.7 15 0
Test Input Lee Fwd TREK error
26 150.6 27.3 27 0
3.2 37.6 37 1
0 35.0 35 0
0.3 31.1 31 0
0 26.1 25 1
0.1 20.2 20 0
0 13.7 13 1
0 6.8 7 0
Test Input Lee Fwd TREK error
28 355.7 63.5 63 0
3.2 87.9 87 1
0 82.4 83 1
0.3 73.8 73 1
0 62.4 61 1
0.1 48.7 49 0
0 33.3 32 1
0 16.8 16 1
Table 9.1 - Trek/Algorithm Verification Comparison, (continued)
9 .5  Design Tools
There are many tools available within the UNSW VLSI package [7] and not all of 
them were required in developing the design.
To develop a design the normal sequence of tools would be to :
(1) Ingred. This an interactive graphics editor, used to construct the initial design. The output 
of an Ingred session is a text file in the Silo format.
61
(2) Jigsaw. This tool is a symbolic composer and spacer that spaces the design created with 
the Ingred tool to rules and contracts valid for the design technology used. Jigsaw output is 
in Caltech Intermediate Form, CIF.
(3) Galah. The Galah tool is a circuit extractor and design rule checker.
(4) Spice. Spice is an electrical circuit simulator that is used to produce transient simulations 
with the Runspice tool. Runspice accepts a text file that contains circuit data and produces a 
spice deck suitable for using with the Spice simulator.
(5) Trek. This is an event driven level timing simulator. The simulator is not meant to be a 
replacement for Spice. It provides a fast turn around time for MOS logic simulations.
(6) Simplot. The Simplot tool is a waveform plotter of simulated results and produces plots 
directly from the output files of Trek simulations.
(7) Spiceplot. This is a superimpose waveform plotter for Spice simulations. The file that is 
produced from the Spice simulations has to be edited to produce a format acceptable for the 
Spiceplot tool. The filter used was developed for this project by the author and is shown in 
Appendix 5.
(6) Cifplot. This produces a graphics plot of the CIF format.
9.5 .1  Design Tool Idiosyncrasies
A great deal of time was spent at the commencement of the project in familiarization 
with VLSI design principles generally and CMOS VLSI design techniques in particular.
62
After this initial learning curve, an investigation of how the tools could be used was 
undertaken. From this point through to the later stages of the project various idiosyncrasies 
of the tools became apparent. This section draws attention to some of the problems 
encountered during the project when using the UNSW VLSI tools.
The most significant problem relates to the fact that the only method available to 
connect separate modules together was with an abutment command. The version of the tool 
package available to the author had neither manual or automatic routing facilities. The net 
result of using the 'abutment' facility was that the Jigsaw processing of a design resulted in 
the design being spaced in such a way that the connections ran either vertically or 
horizontally. This then leads to Jigsaw having to move sections of the design to align the 
connecting ports. The resultant area of the design is consequently enlarged. The magnitude 
of this problem was reduced slightly by the obtaining a copy of the facility 'mingrid' within the 
Ingred tool, which was not present in the original version of the VLSI design suite. This 
allowed more effecient connection pads to be designed, but extensive use of the 'abutment 
' facility was still necessary.
The next most significant problem relates to the performance of the Jigsaw tool. 
The tool can be used to produce finely spaced designs or to simply take the spacing of the 
sub modules and connect them together in a sparse spacing mode. The fine spacing facility 
did produce designs superior to the sparce spacing but the advantage was reduced due to 
the 'abutment' problem discussed above. Also, the fine spacing of designs takes significantly 
longer than the sparse spacings and consequently the design process no longer stays an 
interactive one as the time required for Jigsaw to complete the spacings of larger sections of 
the design approaches and usually exceeded thirty minutes duration.
63
Processing times and file sizes created by the Jigsaw and the plotting tools of the 
final design results in the need to alter the allocated resources to a data segment size of 
20MB and a stack size of 15MB.
64
CHAPTER 10 - CONCLUSIONS
Discrete cosine transforms are being adopted widely in the digital signal 
processing field due to their more efficient processing ability when compared to other 
transform approaches such as FFTs.
To implement the transform several approaches have been made. Bit parallel 
designs have been adopted but require more than one processing unit to adequately 
process the data in the real time [2]. An analysis of the requirements of a pipelined bit 
parallel design exposes why this is required as the single multiplication unit has a settle time 
that requires several clock cycles and the affect of reducing the clock period does not 
satisfactorily improve the performance and so more than one processing element is 
required.
Serial designs such as that by INMOS [1], while using a single processing unit to 
perform a single transformation requires a large number of multiplications, 64. The design 
adopted is a refined version of the Lee algorithm [5], with the number of multiplications 
being 22 and performed in parallel. The original Lee algorithm while only requiring 13 
multiplications would require several passes through an ALU, or have an ALU with three 
multiplication stages with an associated processing time of 41 cycles.
With the multiplications performed in parallel, only a single pass through the ALU 
is required to perform a single transform. To implement a 2D transform extensions would 
require storing the 64 first transform results and then feeding these in the alternate 
dimension into another ALU. These two processing units and the intermediate storage unit 
could be pipelined to enable a single pass processing unit.
65
Accuracy of the integer multiplications results in errors in magnitude of a maximum 
severity of two, which is within the maximum error 3.9. Improvement in the accuracy 
required could be improved by incorporating circuitry to round results to the nearest 
integers, or by implementing floating point arithmetic logic in the design.
Initial investigations into the design of the circuitry showed results that would 
result in the final design being within the 6.8mm x 6.8mm size constraints. However, as the 
design has reached its final stages the design extended beyond this boundary.
One of the reasons for this is due to the unavailability of manual or automatic 
routing tools. All routing had to be carried out using the abutment facility which requires that 
connection ports be aligned horizontally and vertically respectively. This results in the 
design being shuffled when the “jigsaw” [7] tool is used to create the “cif” [7] form of the 
design as it makes the connections perpendicular. To alleviate this problem an updated 
version of the tools was ported from UNSW which incorporated the tool “mingrid” [7]. The 
“mingrid” tool allowed connection pads to be created easily between cells, however, the 
same abutment fadlty had to be used and so the problem was not totally overcome.
This problem also contributed to the inabilty of being able to successfully use 
common rails in the stacking of the multiplication modules in the ALU. Common rails, such 
as GND, Vdd, p1, p2 and reset could have been used by adjacent multiplication modules, 
however, each module had to have its own set of rails which consequently increases the 
size of the module. When the common rail design was attempted the abutment of the 
common rail module and the adjacent multiplication modules resulted in the design being 
distorted to the extent that it was more effecient for the modules have their own power rails.
Verification of the design began with verification of the original Lee algorithm and 
the revised algorithm adopted in the design. Simulations using Spice were used to verify
66
the design of the modules used in the design, whereas the overall design was verified with 
the simulator Trek, due to the inability of Spice to cope with the final design size.
67
CHAPTER 11 - FURTHER WORK
During the course of the design it became apparent that designing such a device 
requires considerable revision of design concepts and layouts and many iterations throught 
the
design -> implement -> test -> revise design
cycle are required.
The final design is far from being at a stage that would be suitable for mass 
production as there are facets of the design that could be revised to improve the 
functionality and silicon area utilization.
The areas that could be improved are
(1) Transformation Accuracy
By incorporating circuitry to round the results to the nearest integer or by 
implementing floating point computations the small error resulting from the design would be 
eliminated.
(2) Extention to 2 Dimensional Transforms
The design implements a one dimension transformation. To implement a second 
dimension the inclusion of a storage medium to store the first transform results is required 
so that these could be fed into a second processor to perform the second dimension 
transformation. The two ALUs and the storage medium could be pipelined to produce a 
single pass processing element.
68
Alternatively, if the number of cycles required to process the data could be 
reduced to 19 or less then a single processor could be used to perform both dimensions of 
the transform. This would require reading and writing data in parallel to the processor and 
having the processor occupied with processing data at all times.
To achieve this two sets of input and output buffers would be required, together 
with their associated control circuitry.
(3) Design Silicon Area Reduction
As the design proceeds the layout of the modules needs to be continually 
reviewed to ensure the layout is satisfactory. There is a considerable scope to reduce the 
size of the silicon area required by refining the design of each module.
69









* This program performs 3 tasks
*
* a) Verifies the Lee algorithm by performing the forward* and inverse transforms on a set of data.
*
This shows that the interpretation of the algorithm is correct as the inverse transform results should match the initial data set
b) Verifies the revised algorithm by showing that the results are the same as the original Lee algorithm.
* c) performs a second transform upon the results of the* first Lee transform
*
* Input: 20 sets of 8 integers in the the range of 0..255,* from stdin.
*
* Output: a) Lee forward transform coefficients* b) Revised algorithm forward transform* c) Lee inverse transform of coefficients* d) Lee second transform of ID results
*




I* variable declarations */float alpha, betal, beta2, gamal, gama2, gama3, gama4, root2, TO, T l, T2, T3, T4, T5, T6, T7, tO, tl, t2, t3, t4, t5, t6, t7, lee_f[8], new_f[8], lee_i[8], lee_f2[8],A, B, C, D, E, F, G;
float x[8];
I
/* constant definitions */
void init_constants ()
{ alpha = 0.70733; betal= 0.54123; beta2 = 1.30806; gamal = 0.50979; gama2 = 0.60144; gama3 = 0.90051; gama4 = 2.57006; root2 = 0.70711;
A = 0.176776;B = 0.230969;C = 0.095670;D = 0.245196;E = 0.207867;F = 0.138958;G = 0.048772;
}
/* revised forward transform algorithm */
void revised_DCT (array) float *array;
{ float sl,s2,s3,s4,s5,s6,s7, al,a2,a3,a4,a5,a6,a7;
si = x[0]-x[7]; s2 = x[l]-x[6]; s3 = x[2]-x[5]; s4al = x[0]+x[7]; a2 = x[l]+x[6]; a3 = x[2]+x[5]; a4a5 = al+a4; a6 = a2+a3; a7 = a5+a6;s5 = al-a4; s6 = a2-a3; s7 = a5-a6;




/* LEE forward transform algorithm */
void forward_DCT (in, out) float *in, *out;
{ tO = in[0]+in[7]; tl = in[l]+in[6]; t2 = in[3]+in[4]; t3 = in[2]+in[5]; t4 = gamal * (in[0]-in[7]) t5 = gama2 * (in[l]-in[6]) t6 = gama4 * (in[3]-in[4]) tl =  gama3 *  (in [2]-in [5])
TO = tO + 12;Tl = tl + 13;T2 = betal * (t0-t2);T3 = beta2 * (tl-t3);T4 = t4 + t6;T5 = t5 + t7;T6 = betal * (t4-t6);T7 = beta2*(t5-t7);
tO = root2 * (T0+T1);t l=  alpha *(T0-T1);t2 = T2+T3;t3 = alpha * (T2-T3);t4 = T4+T5;t5 = alpha * (T4-T5);t6 = T6+T7;t7 = alpha * (T6-T7);
t2 = t2+t3; t6 = t6+t7;
T4 = t4+t6;T5 = t5+t7;T6 = t6+t5;
out[0] = tO/4; out[4] = tl/4; out[2] = t2/4; out[6] = t3/4; out[l] = T4/4; out[5] = T5/4; out[3] = T6/4; out[7]=t7/4;
m
void reverseJDCT (in, out) float *in, *out;
{ float wO, w l, w2, w3, w4, w5, w6, w7;
w0=in[0]; wl=in[4]; w2=in[2]; w3=in[6]; w4=in[l]; w5=in[5]; w6=in[3]; w7=in[7];
TO = w0*root2;T1 = wl*alpha;T2 = w2;T3 = alpha*(w2+w3);T4 = w4;T5 = alpha*(w5+w6);T6 = w4+w6;T7 = alpha*(w4+w5+w6+w7);
wO = TO + Tl; w l = TO - Tl; w2 = T2 + T3; w3 = T2-T3;W4 = T4 + T5; w5 = T4-T5; w6 = T6 + T7; w7 = T6-T7;
TO = wO + (w2*betal);Tl = w l + (w3*beta2);T2 = wO - (w2*betal);T3 = w l - (w3*beta2);T4 = w4 + (w6*betal);T5 = w5 + (w7*beta2);T6 = w4 - (w6*betal);T7 = w5 - (w7*beta2);
out[0] = TO + (T4*gamal); out[l] = Tl + (T5*gama2); out[3] = T2 + (T6*gama4); out[2] = T3 + (T7*gama3); out[7] = TO - (T4*gamal); out[6] = Tl - (T5*gama2); out[4] = T2 - (T6*gama4); out[5] = T3-(T7*gama3);
IV
void print_results (j) intj;
{ inti;
printf ("\n test input Lee ID New ID Lee Invprintf ("\n --------------------------------------------------------------printf ("\n\t%2dSt%3.0f\t%10.3f\t%10.3f\t%10.3f\t%10.3f', j+1 ,x[0] ,lee_f[0] ,new_f[0] ,lee_i[0] ,lee_f2[0]);
for (i=l;i<=7;i++) {printf ("\n\t\t%3.0f\t% 10.3fNt% 10.3f\t% 10.3f\t% 10.3f", x[i] ,lee_f[i] ,new_f[i] ,lee_i[i] ,lee_f2[i]);
}printf ('V ”);
void enter_data 0




for (i=0;i<=20;i++) { enter_data 0; init_constants 0;forward_DCT (&x[0],&lee_f[0]); forward_DCT (&lee_f[0], &lee_f2[0]); reverse_DCT (&lee_f[0],&lee_i[0]); revisedJDCT (&new_f[0]); print_results (i);
}
}
Lee 2D") ------- ");
V
A  1 .2  V er ifica tio n  D a ta  S et
This is the set of twenty tests that are used with the verification program in 
Appendix A1.1 to verify the Lee algorithm and the revised algorithm design.
0 0 0 0 0 0 0 1000 0 0 0 0 0 100 00 0 0 0 0 100 0 00 0 0 0 100 0 0 00 0 0 100 0 0 0 00 0 100 0 0 0 0 00 100 0 0 0 0 0 0100 0 0 0 0 0 0 00 0 0 0 0 0 0 0255 255 255 255 255 255 255 25510 20 30 40 50 60 70 8080 70 60 50 40 30 20 101 63 217 123 247 154 54 126126 154 128 63 1 217 247 54176 125 71 248 122 228 255 38125 248 228 38 176 71 122 25536 37 38 39 40 41 42 43110 109 108 107 106 105 104 103201 202 203 204 205 206 207 208255 254 253 252 251 250 249 248
VI
A  1 .3  V er ifica tio n  P ro g ra m  R esu lts
The following table contains the results of the twenty verification tests used to 
verify the Lee algorithm and the revised transform that was adopted in the design.
test input Lee  ID N ew  ID Lee  In v L e e .
1 0 17.678 17.678 0.001 0.698
0 - 24.522 - 24.520 0.002 1.706
0 23.101 23.097 - 0.005 1.363
0 - 20.792 - 20.787 0.005 2.631
0 17.683 17.678 - 0.011 2.729
0 - 13.894 - 13.896 0.012 5.296
0 9.571 9.567 - 0.040 7.110
100 - 4.879 - 4.877 100.037 23.013
2 0 17.678 17.678 0.002 1.588
0 - 20.792 - 20.787 - 0.016 0.430
0 9.571 9.567 0.005 2.786
0 4.879 4.877 0.012 1.203
0 - 17.683 - 17.678 - 0.005 5.056
0 24.547 24.520 - 0.119 3.919
100 - 23.131 - 23.097 100.162 22.300
0 13.912 13.896 - 0.040 - 8.811
3 0 17.678 17.678 - 0.005 - 0.271
0 - 13.894 - 13.896 0.005 3.156
0 - 9.571 - 9.567 - 0.059 - 0.043
0 24.542 24.520 0.002 5.078
0 - 17.683 - 17.678 - 0.040 1.582
100 - 4.906 - 4.877 100.206 23.014
0 23.131 23.097 - 0.119 - 6.443
0 - 20.829 - 20.787 0.012 - 4.329
vn






0 17.678 17.678 0.005 2.737
0 - 4.879 - 4.877 0.012 - 1.218
0 - 23.101 - 23.097 0.002 5.288
0 13.925 13.896 -0219 - 0.592
100 17.683 17.678 100.317 23.392
0 - 20.850 - 20.787 - 0.040 - 3.007
0 - 9.571 - 9.567 - 0.005 - 5.803
0 24.597 24.520 - 0.011 - 0.535
0 17.678 17.678 - 0.011 - 1.786
0 4.879 4.877 - 0.005 5.722
0 - 23.101 - 23.097 - 0.040 - 2.700
100 - 13.925 - 13.896 100.317 23.273
0 17.683 17.678 - 0.279 0.669
0 20.850 20.787 0.002 - 6.420
0 - 9.571 - 9.567 0.012 - 0.458
0 - 24.597 - 24.520 0.005 - 1.328
0 17.678 17.678 0.012 5.063
0 13.894 13.896 - 0.119 - 5.039
100 - 9.571 - 9.567 100.206 22.641
0 - 24.542 - 24.520 - 0.040 4.303
0 - 17.683 - 17.678 0.002 - 6.380
0 4.906 4.877 - 0.059 - 0.311
0 23.131 23.097 0.005 - 1.917
0 20.829 20.787 - 0.005 - 0.197
0 17.678 17.678 - 0.040 - 6.384
100 20.792 20.787 100.162 22.244
0 9.571 9.567 - 0.119 7.292
0 - 4.879 - 4.877 - 0.005 - 5.731
0 - 17.683 - 17.678 0.012 - 0.262
0 - 24.547 - 24.520 0.005 - 2.050
0 - 23.131 - 23.097 - 0.016 - 0.403
0 - 13.912 - 13. 8% 0.002 - 0.612
vm
input Lee ID New ID Lee Inv Lee.
10
11
100 17.678 17.678 100.037 23.356
0 24.522 24.520 - 0.040 7.678
0 23.101 23.097 0.012 - 3.955
0 20.792 20.787 - 0.011 - 0.761
0 17.683 17.678 0.005 - 1.778
0 13.894 13.896 - 0.005 - 0.792
0 9.571 9.567 0.002 - 0.852
0 4.879 4.877 0.001 - 0.302
0 0.000 0.000 0.000 0.000
0 0.000 0.000 0.000 0.000
0 0.000 0.000 0.000 0.000
0 0.000 0.000 0.000 0.000
0 0.000 0.000 0.000 0.000
0 0.000 0.000 0.000 0.000
0 0.000 0.000 0.000 0.000
0 0.000 0.000 0.000 0.000
255 360.626 360.623 255.002 63.751
255 0.000 0.000 255.002 88.431
255 0.000 0.000 255.002 83.310
255 0.000 0.000 255.002 74.980
255 0.000 0.000 255.002 63.770
255 0.000 0.000 255.002 50.105
255 0.000 0.000 255.002 34.514
255 0.000 0.000 255.002 17.595
10 63.640 63.639 9.997 4.739
20 - 32.217 - 32.214 19.990 8.944
30 0.000 0.000 29.988 12.433
40 - 3.359 - 3.367 39.987 15.076
50 0.000 0.000 50.014 16.491
60 - 1.009 - 1.009 60.013 16.134
70 0.000 0.000 70.011 13.607




test Lee ID New ID Lee Inv Lee.input
12 80 63.640 63.639 80.004 17.762
70 32.217 32.214 70.011 22.267
60 0.000 0.000 60.013 16.970
50 3359 3367 50.014 11.388
40 0.000 0.000 39.987 6.016
30 1.009 1.009 29.988 1.550
20 0.000 0.000 19.990 - 1.426
10 0349 0350 9.997 - 2.426
13 1 175.010 175.008 0.994 12.992
63 - 25.834 - 25.828 62.867 14.714
217 - 81.601 - 81.580 217.230 66.444
128 - 25.320 - 25.334 127.643 45.833
247 2.476 2.475 247.360 53.213
154 - 41.297 -41340 154.046 26.612
54 35.017 34.940 53.857 -8362
126 35.042 34.927 126.012 13.736
14 126 175.010 175.008 125.987 28.690
154 - 11.022 - 11.021 154.017 49.904
128 32.157 32.150 127.950 41.102
63 32.717 32.709 63.160 25.789
1 - 88.770 - 88.742 0.750 7340
217 41393 41.355 217.084 53.852
247 - 1.851 - 1.837 247.112 40.445
54 - 17.338 -17368 53.949 - 21.146
15 176 223.270 223368 175.995 39.326
125 - 8.856 - 8.857 125.022 59.386
71 - 28.286 -28382 70.797 25.508
248 56.021 56.013 248.428 67.190
122 - 16.799 - 16.794 121.612 39.747
228 69.654 69.586 228.080 40.710
255 - 33.666 - 33.633 255.137 33.298







input Lee  ID N ew  ID Lee  In v L e e .
125 223.270 223.268 124.981 32.569
248 9.400 9.402 248.088 60.419
228 45.144 45.133 228.127 79.884
38 - 52.492 - 52.488 37.541 14.231
176 - 13.262 - 13.258 176.429 38.257
71 - 70.062 - 69.988 70.832 20.111
122 - 0.535 - 0.518 121.982 23.941
255 42.775 42.623 255.030 41.070
36 55.862 55.861 36.000 9.224
37 - 3.222 - 3.221 36.999 13.032
38 0.000 0.000 37.999 12.678
39 - 0.336 - 0.337 38.999 11.799
40 0.000 0.000 40.002 10.402
41 - 0.101 - 0.101 41.002 8.491
42 0.000 0.000 42.001 6.098
43 - 0.025 - 0.025 43.001 3.279
110 150.614 150.613 110.001 27.276
109 3.222 3.221 109.002 37.599
108 0.000 0.000 108.002 35.021
107 0.336 0.337 107.002 31.131
106 0.000 0.000 106.000 26.110
105 0.101 0.101 105.000 20.197
104 0.000 0.000 104.000 13.663
103 0.025 0.025 103.001 6.795
201 289.208 289.206 201.001 50.474
202 - 3.222 - 3.221 202.001 70.252
203 0.000 0.000 203.001 66.584
204 - 0.336 - 0.337 204.000 60.315
205 0.000 0.000 205.003 51.665
206 - 0.101 - 0.101 206.003 40.911
207 0.000 0.000 207.003 28.431
208 - 0.025 - 0.025 208.002 14.664
XI
test input Lee  ID N ew  ID Lee  In v L e e .
20 255 355.676 355.673 255.003 63.527
254 3.222 3.221 254.003 87.884
253 0.000 0.000 253.004 82.393
252 0.336 0.337 252.004 73.767
251 0.000 0.000 251.001 62.371
250 0.101 0.101 250.001 48.688
249 0.000 0.000 249.001 33.289
248 0.025 0.025 248.002 16.801
Table A l.l - Verification Program Test Results
XII
A2 - SPICE SIMULATION RESULTS
The plots in this appendix are those of several Spice simulation results. The plots 
are labelled with a title and the plots represented are:
1) multiplication cell, first 20 ns (mult_cell1.1)
2) delay buffer, with phase pulse plotted (delay_a1.1)
, without phase pulse plotted (delay_a1.2)
3) phase clock generator (clock. 1)
4) input buffer cell, without output filter (buf_cell_in.1)
, with output filter (test_buf_in.1)
5) output buffer cell, without output filter (buf_cell_out.1)
, with output filter (test_buf_out.1)
xm
Hon Juiì 31991
Spice output from w’.tcelil.l
X IV
Hon Jun 3 1991
Spice output from delayal.l
5 . 0 5 2
4 . 2
3 . 3 4 9
2 . 4 9 7
1 . 6 4 6
0 . 7 9 4 5
- 0 . 9 5 6 9 9










0  5 e - 0 8  i e - 0 7  i . 5 e - 0 7  2 e - 0 7  2 . 5 e - 0 7
l i m e  i n  s e c o n d s
3 v (iD)
3 e - 0 7
X V
Mon Jun 3 1991
Spice output fro* délayai.2
5 . 0 5 2  -
4.2-
3 . 3 4 9  -
2 . 4 9 7  -
1 . 6 4 6  -
0 . 7 9 4 5  -
- 0 . 3 5 6 9 9  1 1 - J ' - t  1  = t= c= ì 1
0  5 e - 0 8  i e - 0 7  1 . 5 e - 0 7  2 e - 0 7  2 . 5 e - 0 7
= d v ( | .
3 e - 0 7
Tine in seconds
Hon Jun 3 1991
Spice output from clock. 1
xvn
Hon Jun 3 1991
5 . 5 6 3  -
r
1 6 3 5  -
3 . 7 0 8  -
2 . 7 8 1  -
1 . 8 5 4  -
0 . 9 2 7 1  -
3l
0







1 1 1  ̂ v  i § )  
2 e - 0 7  2 . 5 e - 0 7  3 e - 0 75 e - 0 8  i e - 0 7  1 . 5 e - 0 7
T i m e  i n  s e c o n d s
x v m
Hon Jun 3 1991
Spice output from testbufin.l
Time in seconds
X IX
Han Jun 3 1991
Spice output fro« bufceHout.l
X X
Hon Jun 3 1991
Spice output fron testbufout.l
Tiie in seconds
X X I
A3 - TREK SIMULATION RESULTS
The following thirty one plots are the results of simulations set out in Chapter 9. 
The plots are labelled with a title, for example “dct1” for the first plot.
To understand the format of the data I/O for the simulation refer to the first plot. The 
input values are entered so that they are correctly loaded into the input buffer. This requires 
that the x7 data word entered equals “001001100000”. This can be seen by vertically 
reading the data from bottom to top on the last read enable pulse. Data word xO is the first 
data entered.
On reading data from the plot the first data word output is X0, “100010000000” 
and again it is read vertically from bottom to top at the first write enable pulse. X7 is the last 
data word output.
XXII
dctl TREK vi.O 20: ¿2 Monday 3 June 1991
i w w w i  i / i A / m n / w w v w w w w i  i/v  i / i / w w w w  p?
-  V W W I M M 1 A M M A  L  U l  \ A  l A M M A  U V L  l A A A i l  I M A  p i- ___________________nruir »i- ___________________ruinr ™
■_______________________________________________
- ______________________________________________ r i n j u - ®
-  ______________________________________________ J I M T »- __________________ nruir ®- ___________________nruir k
-  ______________________________________________ r u i n r  m
- ____________________________________ n _ ^  »











- J i l l-J





mi!!!!!!!!!!!!!!! II! !!!! !!!ll!ll !l!ll!I!l!lll!!!mi!!l!!ll!
-----1-----1-----!-----1-----1-----!-----T-----T-----1-----1 I I I I I r
5 10 15
Time (x 1.0e-6 seconds)
! ! I ! 11 ! ! ! 11 ! ! ! ! ! ! I ! Input Eventa
T
20
~  Incut 
-  Output 
-I/O
X X II I
dct2 TREK vi.O 20:23 Monday 3 June 1991
)AAA/W\AAAAAAAAA/WW\AAA/W\/WW\riAAAAAAAA/ UMmmMJ\MAhJ\mMJ\AAAhMMJ\AMMmvi
________________________________________________ r u i r L a i
________________________________________________ n J T T L M____________________________ ru in_xs
________________________________________________ T L _ T U T _ ff l____________________________ n_nn_w_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ n _ n n _ ®
n





! l ! ! ! !M ! !M ! ! i ! ! I !  ! l l l I ! 1 1 1 ! ! ! M l ! I I ! ! ! I! ll! ll! ll! l! l! l l l ll l! l! lllll! l! ll!
i  r i — !— r i — i— r
10




166t ainp E íepuoH E? (K l'Bi EW
dct4 IREK vl.O 20:23 Monday 3 June 1991
X X V I
H /V X X
1661 E *spuoH KXJ 0'I»M SW
dct6 IREK vi.O 20:23 Monday 3 June 1991
xxvm
dct7 IREK vi.O 20: ¿3 Monday 3 June 1991
X X IX
X X X
1661 aunp e  fcpuOH E r o i  O 'M  B U I 8 W
IX X X
1661 MP E M p *  ß  Î2  O'I* BU  6PP
dctiO IREK vi.O 20:24 Monday 3 June 1991






¡3PX3K 9-3̂ 1 t  WA
2 Sï 01
1 i ■ ■ < 1 ■ » »_ 1_ 1_ 1_ 1_ !_ 1_
S
i i »_ 1- !-r|LIf ’ I l i l i i I I  i 1 l i i.i i i ! í i LU -i .Li i .Í i~~~~i  înn-innnnnîn - i 1 i :i l  i i i i i i i I ilijTM iinnm M -
m
~ -  r  •*1?« ' r-
R3J n u n
i!*  uuuuuuuu
1.
ÏÏ U U -
'X 1__1 1__! -
n -:x 1 1 -
31 -/Ï-Ïfl
:u
hiOX U  LJ -iï U LJ -
3  U U USX UULJ -tx U U U -
3 “inni -




U U K  
~~L T U U
“ “ L T L r iJ
1/1/¥VV1/WV1/VWV\A/YVW/VVVVV¥VYVV1AAAAAAAW
t\/\aaaaaa/u \aa/w \aaaaaaaaaaaaaa/\/vAAaa/\aaaa/i-
V8613BI1 E MMM 6 1 » i  n w
dcti2 I®  vi.O &¿4 Honfcy 3 Jkre 1991
■ iAAAAAAAAAnAAAAAAAAAAAAAAAAAAAAA/lAAAAAAAÂ  p? 






_ J 5  
_X4 
_X3 
_ E  
XI
j it l t l__ru ut





n_ r u i














; i i t  3  
ä ijF lii
El i A V
HimifiniuirlaiLriTLrirLTLriĴ  ̂ in_rlRtLriTLfir̂ üirlJjuifjitlìjitl!{ ! j ! ! ! J ? I \ \ Î I ? H Î  ! í I n  ! T !  ! ! T !  ! ! ! ! ! ! ! ! ! ! ! ! ! ' - A 
— !— !— i— r~ 1----;-----f-----r
1G




Tile ■x i.Oe-6 seanás)
XXXIV
dctl3 IREK vl.O 20:24 Monday 3 June 1991
XXXV
IAXXX
MiaunpEfepuOHRœ 0'J»m  HPP
nAxxx
Í66I 3UPf £ ÍBpuOH »? XK O 'fA m  S i p p
mAxxx
16619UÍT í AepiiOH ft $  ô ' Ï A M  9ÏPP
XIXXX
(spwms 9-ao'f q ai! i
0?. SÏ
J__ 1__ I__ L
7L
J__ !__ L -1----i__ L J__ 1__ I__ L
i i i i i í i i l i  lili li lili i i I i I í I il í i i I i i i I i i i i l  ! i ¡i i I ìi Ì i I i I i I i i i I i j i lililllílilijiiiii!
n in n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n
Afm 















M UU Lí 
w LTU l_T 
a LTLTLJi_ n n _ r
H LTLTLJ
a  U L T I
a LAJU 
a LTLTLÍ 




tí m w M w w w m m m m m M m m im  •
*  M  w i a a a a 7 m a a a a a / \ a a a m a / i j v \ a / i a a J v w w \ a a a / i I -
! f f l aunp e kepuoH t? :ß O í* ¡GUI ¿ÍW




[spuöoas 9-3(31 X) 3 ii|
Oí
Y M  i ¡ i i i ! i ! I ! i I í I i 11 i I i ! i l i  il i li I i Ü i i  Í il i li í i i i i i  ii i i Ü í i ¡ i i í ¡llí i i i l i i i i l i i i i i i i i ì i
H W l l T O l O W i J l M m n J ü W W m T l f m M W l l
AV ! »  1___________________
nui p _________________________________
l e ^ J M J l í U U L Í l
o*
UJ
L H JU ia












ÍX “ i n r u -
¿X U  U  1____ 1
ex n j u i ____ i -
n T f l T l J -
a  U  U  U
9X U  U  U
¿X ■ ^ o j u -
ex u  u  u -
6X u i r u -
OH ü  U  U
ÍÍX U  I J  u -
*  y \A A A A A A y V \^ iW W W \A A A A A A A A A A A A A J W W V W W V
!66f 3*0 E Áepuotí f t  ü  o’ I» m  61»
dct20 TREK vi.O 20; 24 Monday 3 Jone 1991
X L II




J n r i n



































l u i n i u «
m iinm m im i!  ¡ m n m i i m m i m m i i M i i m i m i m ! !  m m i m m i i m i in p u t e«
10
Time (x l.0e-6 seconds)
15 20
X L II I
dct22 IREK vi.O 20:24 Monday 3 June 1991
XLIV
A T X
1661 *HP£ fepuOKS?:K 0 I * m  OTP









-S U ~  
u u n j i  
_ J U
















_ _ _ _ _ _ _ _ _ _ rea d  _EN
c ltip J N  
outJFllL 
data AV
lu u u in iu ^ ^  ci! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !  ! ! II  ! ! ! ! ! !  ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! l ! ! ! l ! ! l ! ! ! l  I ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !  in c u t -ven t
i — !— r i  r i — !— r
10
T ise  ¡X l.O e-b  seconds)
15 20
X L V I
IIA T X
!66ta¡T E 6 ' »  f i l i  B f f i
dct26 IREK vl.O 20:25 Monday 3 June 1991
X Lvm
dct27 TREK vi.O 20: 25 Monday 3 June 1991
X L IX
dct28 TREK vi.O 20; 25 Monday 3 June 1991
L
dct29 TREK vl.O 20; 25 Monday 3 Jone 1991
L I
m
T661 aurp E Aepuod SS W O 'IA ffll OEW
nn
(SpU033S 9-aO'i X) Ml
01







N3P03J ------------------------1i l i -
N ff iw lli i 1 1 -
0X rum LTUin1-
IX L R URI
IX L _ r -
ex L i r
IX L -
y  L -






ox L i n -
ix U i n -
ex LRJ -
ex 1 1 -
IX R J -
5X J R -
9X JUU -
L ì JUU -





t66í £ JtspuoH g  :ß> o't» H i  ÎEW
APPENDIX A4 - FIFO DATA
Several options are available to enable data to be processed through the chip at a real time 
rate of 8 data words per 2.6 msecs. These options are
1) access data directly from the host system memory by taking control of the data bus
2) have the host processor interfaced to the chip so that data is loaded from the host to the 
chip by a driver on the host processor
3) have FIFO buffers at the front and back end of the DCT chip and have the host 
processor load and read data to these buffers
The first option suffers from the fact that delivering the data 8 words at a time to the chip will 
require inefficient use of the data bus and the data transfer times will be unacceptable. The 
second option suffers a similar problem to the first. For either of these options to work 
efficiently there needs to be some form of buffer into and out of the DCT chip. Data can then 
be transferred in bulk lots to enable efficient use of the data bus and enable data to be fed 
through the chip at a sufficiently fast rate.
To achieve this the FIFO buffers are employed and so data can be fed in large quantities to 
the input FIFO, the DCT chip then simply sees if there is data available and if so reads it. 
When data is processed the DCT chip simply checks to see if the output FIFO is not full and if 
so it then writes data to it.
The timing diagram of the data input and output are shown in figures A4.1 and A4.2.
LIV
where
RC = rad cycle time WC = write cycle time
A = access time WPW = write pulse width
RR = read recovery time WR = write recovery time
RPW = read pulse width DS = dara setup time
RL = dR low to lozw Z DH = data hold time
DV = data valid from dR high WFF = dW tow to dFF tow
RHZ = dR high to high Z FFW = dFF high to valid write
REF = dR low to dEF low RFF = dR high to dFF high
EFR = dEF high to valid read WPI = write protect indeterminant














Figure A4.2 - Write and Full Flag Timing
L V I
APPENDIX 5 - SPICE OUTPUT DATA FILTER
This filter is required to rearrange the data from the Spice output to enable the 
Spiceplot tool to plot the data. The input file is "file".spout and the filtered file is "file".spplot.
Spice Output
headers
index TIME V(0) vro Vini
1 nnn nnn nnn nnn
2 nnn nnn nnn nnn
n nnn nnn nnn nnn
index TIME___ V(Q) V(1) _ Y (q)____
new page of data
Spiceplot Required Input Format
headers
TIME V (0) V(1) V(n) 
X
nnn nnn nnn nnn
nnn nnn nnn nnn




The fitter is written in Unix system command interpreter language, 'shell'.
cat "$1 ".spout | sed '/lnd/q' | sed 's/lndex/r > $1.1
echo X »  $1.1
cat "$1 ".spout | sed '1,12d
/Total run time/q* > $1.2 




/Totaled* >$1.3  
echo Y  »  $1.3 
cat $1.1 $1.3 > $1 .spplot 
rm $1.1 $1.2 $1.3
Lvm
APPENDIX 6 - CIF PLOT DATA FILTER
This filter takes a file from stdin and writes it to several files that will have MAX_SIZE 
characters in each of them. This filter was needed to create the cifplot of the final design as 
the system used for the design had a maximum file size of 3 MB and the file created for the 
plot was over 6 MB in size. The plotting data that is in separate files is transfered to the 
system on which the plots are produced and then the files are concatenated.
r file filter - to break a stream from stdin into files with 




FILE *operrfile (cnt) 
int cnt;
{
FILE *fp, *fopen(); 
switch (cnt) {
case 0 : fp = (topen ("dc.0", "w")); break; 
case 1 : fp = (topen ("dc.1", "w")); break; 
case 2 : fp = (fopen ("dc.2", "w")); break; 
case 3 : fp = (fopen ("dc.3", "w")); break; 
case 4 : fp = (fopen ("dc.4", "w")); break; 
case 5 : fp = (fopen ("dc.5", "w")); break; 
case 6 ; fp = (fopen ("dc.6", V ) ) ;  break; 
case 7 : fp = (fopen ("dc.7", "w")); break; 
case 8 : fp = (fopen ("dc.8", "w")); break; 
case 9 : fp = (fopen ("dc.9", "w")); break; 







int file_cnt, char_cnt; 
int c;
FILE *fp;
file_cnt = 0; 
char_cnt = 0;
fp = openfile (file_cnt); 
c *  getchar 0;
while (c != EOF) { 
putc (c, fp); 
char_cnt++;
if (charjcnt >= MAX_SIZE) { 
char_cnt = 0; 
fclose (fp); 
file_cnt++;
fp = openfile (file_cnt);
}





A7 - CHIP PLOT
The plot of the design is the cifplot plotted on a HP 7585B plotter. Figure A7.1 
shows the arrangement of the modules within the design and the actual plot can be found 
in the pocket inside the back cover of the thesis.
Examining the plot reveals the extent of the vacant silicon area that has resulted 













Figure A7.1 - Chip layout
L X I
APPENDIX 8 - COMPARISON TO INMOS DESIGN
A8.1 INMOS Design
The Inmos design involves an 8*8 matrix multiplication in order to achieve a one dimensional 
transformation of the input vectors. This amounts to 64 cycles or 3.2 msecs per transformation, 
using a 20 MHz system clock, [1].
Input Matrix Transformation Matrix Transformed Matrix
¡00 ¡01 ... ion
X
too to i . ton ¡00 t0 0 + « 0 ltl0 + ---+ i0 n tn 0  ................
InO In i — Inn tnO tnl ••• tnn ... ... ........
where : ijj = n x n input matrix, i = 0..n, j = 0..n
tj] = n x n transformation matrix, i = 0..n, j = 0..n
Figure A8.1 - Single 8*8 Matrix Transformation
For each row of this transformation, the input vector row is multiplied by the transformation 
matrix column, to produce a transformation of that single vector. This involves 8 multiplications 
and 7 additions for each element of the output matrix. Thus, the transformation of all 64 
elements requires 512, (8*64), multiplications and 448, (7*64), additions.
Following this initial transformation, the resultant vector is rounded, transposed and then 
transformed a second time in order to produce a two dimensional transformation. This second 
transformation is similar to the first, in that the input vector is multiplied by an 8*8 transformation 
matrix.
The Inmos design is fully pipelined with data sampled on the input at the full clock rate and the 
transformation result appearing 128 clock cycles, or 6.4 msecs later.
A8.2 Revised Algorithm Design
The revised algorithm used in the present project takes 2.01 msecs per transformation of each 
row of the input matrix. This transformation involves 22 multiplications and 28 additions, 
(subtractions). Compared with the Inmos design, this design is computationally more efficient, 
(64 multiplications and 56 additions for the Inmos design).
LXII
A two dimensional transformation based on the revised design requires sixteen passes 
through the chip, or 32.16 msecs. This involves a pass for each row and a pass for each 
column. This figure compares unfavourably with the Inmos design of 6.4 msecs, due to several 
factors. First, is the lack of concurrent multiplications in the design adopted in the project. 
Second, the Inmos design uses a high degree of pipelining. Finally, the Inmos design uses a 
20 Mhz system clock while the project uses a 15 Mhz system clock.
Inmos were able to make full advantage of multiprocessing units in VLSI form. Parallel 
multiplication units could not be used in the present design due to the chip size restrictions of 
8mm*8mm silicon area. As outlined in Chapter 3, there would need to be at least two parallel 
multiplication units to achieve the required data processing speed and during the course of the 
project it became apparent that having two parallel multiplication units would make the final 
design too large. As a comparison with the final design, the 22 multiplication modules adopted 
in the revised algorithm comprise 3220 transistors, this includes all the timing circuits and 
buffers. By comparison a single 8*12 bit parallel multiplier comprises 2688 transistors, where 
each of the adders in the design is a full combinational adder, (28 transistors per adder). 
Therefore, two of theses multipliers would amount to 5376 transistors which is a 67% increase 
over the transistor requirements adopted in the design.
The final design covered 4mm*5mm of silicon. In hindsight it could be argued that a parallel 
design with two parallel multiplication units may have fit in the maximum silicon area of 
8mm*8mm. However, during the course of the project when a decision had to be made as to 
which design to adopt, the bit serial design gave a simpler and more compact design. This 
decision had to take into account the problems being experienced with the VLSI package that 
enlarged the circuit design due to the behaviour of the 'abutment' facility. These problems are 
outlined in Chapter 9. These problems were of sufficient magnitude to prohibit the pursuit of 
the parallel multiplier design.
[1 ] Inmos (1989) IMS A121 2-D DCT Image Processing, Advanced Information
LXIII
APPENDIX A9 - 2 PHASE CLOCK ANALYSIS
As outlined in Section 8.6, the 2 phase strategy is attained by passing a single phase clock 
through a flip flop, in order to produce a 2 phase non overlapping clock.
The circuit diagram of this two phase clock is shown in Figure A9.1. The Spice simulation of this 
clock is shown in Figure A9.2. The circuit of the flip flop is based on the designs of Eshraghian1 [1 ].
The input clock is on line 'clin' with the two output clock phases being on 'pT and ^2', 
respectively. On the left hand side of Figure A9.1 is the flip flop from which *pT and 'p2' are 
produced and then fed separately into two sets of double inverters. Each inverter is composed of 
four sets of transistors so that a full five volts differential is produced. 'p1' is produced from the left 
hand set of inverters and 'p2' is produced from the right hand set of inverters.
Examination of the Spice output reveals the behaviour of the flip flop.
Input single phase V(5), (green plot),
Output phase 1, V(6), (blue plot),
Output phase 2, V(7), (yellow plot).
The 'input dock1 and 'p1' are seen to be in phase with 'p2' being out of phase from *p1




Figure A9.1 - Phase Clock Circuit Diagram
L X V
S p i c e  o u t p u t  f r o m  p h a s e c M . s p p l o t
T i m e  i n  s e c o n d s
Figure A9.2 - Spice Output of Phase Clock Circuit
L X V I
APPENDIX A10 - CIRCUIT DIAGRAMS
A10.1 Introduction
This appendix contains all the transistor circuits contained in the design along with a brief 
description of how each circuit operates. All the main modules are shown along with a number of 
the sub modules. In Section A10.2 all the circuits are listed and the circuits that are displayed are in 
bold type. The plots were produced on a HP 7585P plotter.
The circuits are shown in a top down fashion. That is, the overall design is shown as a module and 
then each submodule is then broken down to its basic circuits.
To maximize the resolution of each circuit, it is plotted either normally or rotated ninety degrees. 
The circuit name is placed in the top right hand corner of the page. No headers were included as all 
the colour and layer combinations are the same for each plot and these combinations are listed in 
Section A10.1.2 Also, in the module diagrams, only the significant interface between modules is 
shown to maximize the clarity of the diagrams.
A 10.1.1 Manual Intervention In Plotting
To gain the maximum resolution when plotting the individual circuits the X and Y axes are scaled to 
maximize the resolution along the respective axes. The scaling is performed by editing the text file 
produced by the plotting package. The line within the text file is the 'SC' line that contains the X 
and Y scale factors. This leads to the thickness of the wires in the X and Y directions not always 
being equal. Also, from plot to plot the wire dimensions are not consistent.
The plots of the module designs were not clear. In all cases the module labels would overlap and 
were unreadable. Both the Silo plot, 'slplot' and the Cif plot, 'cifplot', packages were used to try to 
obtain a clear module plot. However, both of these packages failed to give module plots of 
acceptable clarity. Therefore, all the module diagrams in this appendix were produced with the 
Aldus Superpaint 3.0 on a Macintosh, so that clear module designs are presented and the reader 
can easily follow the design.
A 10.1 .2  Circuit Colours And Circuit Example
There are seven colours associated with the plots and these are
pen number 1 black: background, (not shown)
2 red: polysilicon
3 blue: metal 1








Black is used for 'metal 2' as the normal colour of 'magenta' , that is specified in the CIF plotting 
manuals, was not available.
As an example of a circuit the inverter circuit is shown in Figure A10.1. Here the input 'in' is fed 
across a 'P* transistor that is connected to the power rail ,'Vdd', and a 'N' transistor that is 
connected to the ground rail, 'GND'. If the input is high the 'N' transistor is grounded and the 
output, 'out' is low. Alternatively, if the input is low, the 'P* transistor is enabled to the 'Vdd' rail and 
the output is high.
Note the 'p well' around the 'P* transistor. All *P' transistors in all the circuits are layered in a 'p well' 






The following hierarchy is a list of all the cells and modules in the design. Under each level are the 
levels contents. For example, 'alu' is a submodule of 'dct' and 'multa' is a submodule of 'alu' etc.
The lowest order levels are in italics and these levels are the 'leafcells' which contain the transistor 
circuits. Those elements that have 'pad' in their name are simply routing pads. For example, 'control_padr.
Those names in bold type are those circuits that are shown in this appendix. It can be seen that all 
the leafcell plots are presented except for the routing pads which only contain wires.





d e la y _ a 1
m u l t  c e l l l
c e l l l  _ 1
muttg
d e la y _ a 1
m u l t  c e l l l
ceH O _ 1
multd
m u lt_ c e l l1
c e l l l  _ 1
multb
d e la y _ a 1
m u l t_ c e l l1
c e l l l  _ 1mult_pad3 
mult_driver_l
m u lt_ d r iv e r
multe
m u lt_ c e l l1  
d e la y _ a 1  
c e l l 1 _  1
muttf
c e l lO _  1
m u l t_ c e l l1
d e la y _ a 1
multe
m u lt_ c e l l1  
d e la y _ a 1  
c e l lO _  1mult_pad4 mult_pad2 
mult_driver_r
m u lt_ d r iv e r
mult_io
input_delay
d e la y _ b u f 1  mult_b_pad1 
add_driver1 
add_driver 
mult_outputmult_io_vpad mult_out_padout mult_out_padin 
s u b _ g  
a d d e r _ v  
a d d e r _ g  
s u b _  v  
multjnputmult_input_padinmult_input_padout
s u b _ g
s u b _ v
a d d e r _ v





m u l t _ d r i v e rinter_pad3inter_pad1
buf_out_ens
d e l a y _ b u f _ o u t
buf_out_inv
b u f _ o u t _ i n v _ c
tc
tc_input
in  p u t _ c n  t_ d c o d e  
c o u n t_ c e t ltc_input_pad2 
i o _ e n a b le  tc_pad5
o u t _ w r i t e _ e ntc_pad4
o u t _ w r i t e _ e n
tc_control
t c _ a n d 2 1  
tc _ la tc h  tc_pad2 tcjpadt 
t c _ a n d 3  
t c _ o r  
t c _ a n d 2  
tc_bfly
p h a s e _ c to c k
c o u n t_ c e i i
b f ly _ c n t _ d c o d etc_bfly_pad1
buf_out_ens2
b u f _ o u t _ a n d 1
buf_out
buf_outx12
b u f _ c e l l_ o u t
d r i v e r linter_pad2 bufjn_pad1 
buf_out_inv2
b u f _ o u t _ in v _ c 2  inter_pad4 
bufjn
bufjnx12
b u f _ c e l i_ in
bufjn_ens
d e ! a y _ b u f
buf_injnv
b u f _ in _ in v _ c
out_filter
b u f _ f l i t e r
p a d o u t  
p a d  inchip_pad2 chip_pad3 
p a d v d d  
p a d g n dchip_pad1 chip_pad5 chip_pad4
Lxxn
A10.3 Chip Circuits And Modules
Chip Input: input data , (x0-x11) Output: output data (X0-X11)
data available, (data_AV) read enable, (read_EN)
clock, (cl) write enable, (write_EN)
chip enable, (chip_EN)
output not full, (output_NFULL)
The chip layout can be seen in Figure A10.2. It is obvious by looking at the layout of the chip that 
the 'dct' module contains all the circuitry except for the routing and contact pads.




output not full 
read enable
- set by host processor, if it is low then the on board chip reset is set high
- set by input FIFO to tell the chip data is available
- driven by host processor
- set by output FIFO to indicate the output FIFO is not full and that data can 
be read from the chip
- set by the chip to indicate data can be written to the chip from the input
FIFO
write enable - set by the chip to indicate that data can be read from the chip to the 
output FIFO
The main interface wires with the 'dct' module is shown. All the routing wires are not shown so as 
to make the diagram as clear as possible. The data is read and written from the chip in parallel form.
LXXIII
Figure A10.2 -
A 1 0 . 3 . 1  Discrete Cosine Transform,
Inputs: input data (inO-in11)
data available, (data_AV) 
clock, (cl)
chip enable, (chip_EN) 
output not full, (out_NFULL)
Chip Layout 
(dct), Module
Outputs: output data (out0-out11)
input enable, (in_EN) 
out enable, (out_EN)
This module is shown in Figure A10.3 and contains all the circuitry for processing the data. The 
control lines are the same as for the chip module except that some of them are renamed.
in - parallel input data
out - parallel output data
input enable - set by the 'dct' module to indicate data can written to the chip by the host 
processor from the input FIFO
L X X IV
output enable - set by the 'dct1 module to indicate that data can be read by the output 
FIFO
There are a number of onboard lines internal to the 'dct' module.
x 0 - x 7 - serial lines for the eight input data words
X0-X7 - serial lines for the eight output data words
P1 - phase 1
P2 - phase 2
reset - chip reset
Phase 1 and 2 are created in the 'control' module from the 'cl' signal and the 'reset' signal is driven 
using the 'chip enable' signal. The input data is converted to bit serial form and the processed bit 
serial data is converted into parallel data for writing from the module.
LXXV




Figure A10.3 - DCT Module Layout 
A 1 0 . 4  Arithmetic Logic Unit (ALU) Module
Inputs: input data (x0-x11) Outputs: output data (XO - X11)
phase 1, (p1) 
phase 2, (p2) 
reset, (R)
This module is really only an arithmetic unit rather than an arithmetic logic un it. It receives the eight 
data streams pulsed into the module and passes the data through the respective adders and 
subtractors in the 'm u ltjo ' module before passing them onto the multiplier modules. After the 
multiplications the data streams are then passed through some more adders and subtractors
L X X V I
depending on the data stream. See Figure 2, in Chapter 3, for more details on the requirements 
for multipliers, adders and subtractors for each data stream.
The modules 'm ult_padj' and 'mult_pad_r' are used to overcome the clock skew problems of 
routing the inverse phase signals over the entire chip. These pads contain circuitry to inverse 
phases 1 and 2 so that the clock pulses through the multipliers don't have any clock skew. Also, in 
the 'm u ltjo ' module there is a similar circuit to create the inverse phases for the 'm ultjo ' module.
Figure A10.4 - ALU Module Layout 
A 1 0 . 4 . 1  M ultip lier Input/Output, (m u lt jo )  Module
Input: input data, (x0-x7) Outputs: output data , (X0 - X7)
reset, (R) multiplier inputs, (x22)
phase 1, (p1) 
phase 2, (p2) 
multiplier outputs, (x22)
This module is shown in Figure A10.5 and receives the input data from the input buffer, (xO - x7), and 
passes the data through the adders and subtractors on the 'multjnput' module. The outputs from this 
module is then output to the multipliers in the 'alu' module. After passing through the multipliers the
L X X V II
data is then passed through the adders and subtractors in the 'mult_output' module to produce the 
transformed data, (XO - X7). The data is finally passed to the output buffer.
The 'add_driver1 ' modules on entry to the 'input_delay' and 'output_delay' modules simply produce 
the inverse phase signals for these modules. Similarly, the ,add_driverl modules on entry to the 
'm u ltjnpu t' and 'm ultjoutput' modules, produce the inverse phase and inverse reset signals for 
these modules. The 'input_delay' module pulses the data through to the 'm u ltjnput' module by 
using the circuits shown in 'delay_buf1'. Similarly, the 'output_delay' module uses the 'delay_buf2‘ 










Xs5 Xs6 XO X4 Xs1 Xs2 Xs3 Xs444441111 XOii X2b Xlb X1c X3e X3d X5d X5g X7g X7e lx4i I X2a|x ia | X1d|x3g| X3f Ix5f Ix5e Ix7f I X7d
m ultjnput
44444444
xO x1 x2 x3 x4 x5 x6 x7




mult_output S i / U
reset
XO X4 X2 X6 X1 X3 X5 X7 dreset
n n n n
output_delay











Figure A10.5 - Mult J o  Module Layout
Figure A10.6 - lnput_delay Module Layout
L X X V I I I





Figure A10.7 - Output_delay Module Layout




t t t t t t t











x0x7 x2x3 S0S1 S0S1 x1 x6 x2x5 S3S4 S3 S4 S2S5 S2S5 x0x7 x1 x6 x2x5 x3x4
T  T  T  T T Y  „
SO S1 S2 S3 S4 S5 multjnput_padin
x0 -x7
Figure A10.8 - M ultjnput Module Layout
L X X IX
multiplier output
Figure A10.9 - Mult_output Module Layout
From the 'm ultjnput' layout it can be seen that the inputs to the multiplers are :
s o =  x0+ x7 Xs6 = S3 - S4 = x1 + x6 - x2 - x5
S1 = x2 + x3 XO = S2 + S5 = xO + x7 + x2 + x3 + x1 + x6 + x2 + x5
S2 =  SO + S1 = xO +x7 + x2 + x3 X4 = S2 - S5 = xO + x7 + x2 + x3 - x1 - x6 - x2 - x5
Xs5 =  SO - S1 =  xO + x7 - x2 - x3 Xs1
h*.X
oXII
S3 = x1 +  x6 Xs2 = x1 - x6
S4 = x2 + x5 Xs3 = x2 - x5
S5 = S3 +  S4 = x1 + x6 + x2 + x5 Xs4 = x3 - x4
A1 0 . 4 . 1 . 1  M ultip lier Input/Output Adders, (adder_v, adder_g)
Inputs: input, (a) Outputs: sum, (So)
input, (b)
phases (p1, d p i, p2, dp2) 
reset, (R) and dreset, (dR)
This circuit adds together the three data streams 'a1 , 'b* and 'carry' to produce the sum 'So' and an 
updated 'carry1. On phase 1 the 'carry' signal is inverted and on phase 2 this signal is reinverted and 
fed back into the adder. The values of 'a' and 'b' are fed into the adder on phase 2 and the results of 
the adder are produced as soon as the adder settles. If the 'reset' is high then the 'carry' is set low after 
the phase 2 inversion, otherwise if the 'reset' is low the 'carry' is passed unaltered into the adder.
LXXX
To minimize the power and clock phase rails in this section of the design, each adder has either the 
'Vdd\ ' d p i ' d p 2 '  and 'R' rails or the 'GND\ ' p l ' p 2 ' ,  'dR' rails and adjacent modules inverted 
horizontally so that there is only a single set of rails between each circuit. Therefore, there are two 
types of adders, 'adder_v' and 'adder_g' which have the 'Vdd' or 'GND' set of rails respectively.
A 10.4 .1 .2  Multiplier Input/Output Subtractors, (sub_v, sub_g)
Inputs: input, (a) Outputs: sum, (So)
input, (b)
phases (p1, d p i, p2, dp2) 
reset, (R) and dreset, (dR)
These circuits are nearly identical to the adder circuits except that for a subtraction the data stream 'b1 is 
inverted. Again there are two versions, 'sub_v' and 'sub_g' to save on silicon real estate.
A 10.4 .1 .3  ALU Multiplier Drivers
Input: phase 1, (p1) Output: inverse phase 1, (dpi)
phase 2, (p2) inverse phase 2, (dp2)
reset, (R) inverse reset, (dR)
Prior to the data entering the multipliers there are two modules 'mult_driverj' and 'mulUdrive^r1 on 
the left and right hand side of the 'alu' module. These modules contain a set of 'mul^driver1 circuits 
that produce the inverse signals of the 'pi', 'p2' and 'reset' signals. These circuits are required to 
prevent the any clock skew in the multiplication units.




















ju i t )
5
0


























4 ? 1 0
I___ i
Figure A10.13 - Delay_buf2 Circuit
L X X X V
addepy.cif
L X X X V I
adderg.cif
Lxxxvn
s u b v . c i f
Lxxxvni
SÉj.Cif
L X X X T X
l u l t d r i v e r . c i f
Figure A10.18 - Mult_driver circuit
A 10 .4 .2  Multipliers
For each multiplier:
Inputs: data in, (ai) Outputs: Product, (So)
reset, (R) 
dreset, (dR)
phases (p1, d p i, p2, dp2)
The multipliers are arranged with six modules and the arrangement of these modules depends on the 
multiplier. Figure A10.19 shows the arrangement of the multiplier for 'multa' which is the multiplier 'A' 
in Chapter3, Figure 2.
The first module sets the initial partial product stream to equal the most significant coefficient. That is, if 
the coefficient is 'T then the partial product is set to the input data, otherwise if the coefficient is 'O', 
then the partial product is set low, 'O'. The other modules either perform an addition if the coefficient is 
' 1', or simply delay the data stream flow if the coefficient is 'O'. See Chapter 8 for more details on the 
arrangement of the modules.
As an example the layout of 'multa' is shown, 'multa' is the multiplying module for the multiplier 'A'. In 
Section 3.3 the value of 'A' is 0.176776-[o = 0.001011012- The coefficients of 'A' are arranged in 
reverse order as the input stream is multiplied by the least significant coefficient first. All the multiplier 
values 'A' through to 'G' have two most significant coefficients equal to zero. Also, the least significant 
coefficient is handled by setting the partial product stream equal to the input value, ie. 0 or 1. 
Therefore, the resultant arrangement of 'multa' is :
* remove two most significant coefficients 00101101 =101101
* reverse the order of the coefficients 101101 =101101
Apart from the first coefficient, the 'mult_cellT module performs the addition if there is a coefficient of 
T  . The 'delay_aT cell performs a one cycle delay where the coefficient is 'O'.
x a
In the case of 'multa' the second step doesn't change the order of coefficients, 'multc' is presented as 
a further example.
* multc = 0.0011000022.
* remove leading two zeros = 110000
* reverse order of the coefficients = 000011
The circuit diagram of 'multc' can be seen in Figure A10.23 and has the partial product connected to 
ground followed by three 'delay _a1' modules and then two 'mult_cell1' modules.
A10.4.2.1 Multiplier Cell, (mult_celh)
Inputs: data, (ai)




This circuit is based on the combinational adders of [1] with some extra circuitry to accommodate the 
clocking of the data through the adder and feedback of the carry bit into the adder.
Data is fed via 'ai' and 'Si'. 'Si' is the partial product data stream and 'ai' is unaltered data stream. The 
output stream 'ao' is the same data as 'ai' except that it is delayed one cycle. Beginning from the start 
of the circuit diagram of 'mult_cell1', 'ai' is pulsed into an inverter on phase 1 and then on phase 2 the 
inverted signal is passed through another inverter to return the signal to its original form and it is then 
passed onto the combinational adder. The second set of clock phase gates, gate the 'carry' signal 
back into the adder. If the 'reset' is high then the 'carry' is set low, 'O', and if the reset is low, the 'carry' is 
passed unaltered. The adder adds the three signals, 'ai', 'Si' and the 'carry' and sets 'Si' and 'carry'
accordingly. The partial product is then output on 'So' and the carry is pulsed back through the circuit
on the next clock phase. The data stream 'ao' is the delayed data 'ai'.
A10.4 .2 .2  Multiplier Delay Cell, (delay_a1)
Inputs: data, (ai) Outputs: data, (ao)
input partial product, (Si) output partial product, (So)
reset, (R) 
dreset, (dR)
phases (p1 ,dp1 ,p2,dp2)
Outputs: data, (ao)
output partial product, (So)
xcn
This circuit simply passes the data *ai' on phase 1 into an inverter and then the output of the inverted 
into another inverter on phase 2. This cell simply delays the data 'ai' by one clock cycle and is used 
where the multiplier coefficient is 'O' which means there is no alteration to the partial product.
A10.4 .2 .3  Multiplier Circuits
The >mult_cellT and the 'delay_aT circuits are then arranged to perform the required multiplications as 
shown in Chapter3, Figure 2. As an example, the circuit of 'multa' is shown in Figure A10.22 . Note at 
the commencement of the circuit the signal 'ai' is fed into the partial product stream as the most 
significant coefficient of 'multa' is a 1. The other example of 'multc' shows the initial product signal 
being set low, to 'GND', as the most significant coefficient of 'multc' is a 0.














L5 ^  
















































































Figure A10.20 - Mult_cell1 Circuit
XCIV
delaial.cif
Figure A10.21 - Delay_a1 Circuit
X C V
Figure A10.22 - Multa Circuit
X C V I
«ulte.cif
X C V II
A 10.5 Timing And Control
Inputs: data available, (data_AV)
clock, (cl)
chip enable, (chip_EN) 
output not full, (output_NFULL) 
input data, (xO -x11) 
data from alu
Outputs: input enable, (in_EN)
output enable, (out_EN) 
phasel,(p1) 
phase 2, (p2) 
alu reset, (R)
read buffer enable, (en0-en7) 
output data, (XO - X11) 
data to alu
The central module of Figure A10.24 controls the chip by using the four external inputs to control the 
alu and the input and output buffers. Data is read in bit parallel form into the input buffer and then 
pulsed out in bit serial form to the 'alu'. Conversely, the data processed by the 'alu' is pulsed into the 
output buffer from the 'alu' in bit serial form and then written from the chip in bit parallel form. There are 
three main modules, the timing and control module, 'tc', the input buffer and the output buffer.
chip_EN
output_NFILL
Figure A10.24 - Control Module layout
X C V I I I
A10.5.1 Timing And Control, (tc), Module
input enable, (in_EN) 
phase 1, (p1) 
output enable, (out_EN) 
phase 2, (p2) 
alu reset, (R)
input buffer read, (enO - en7) 
output buffer write, (enO - en7) 
alu count 0, (cntO) 
alu count 8, (cnt8) 
alu count 20, (cnt20)
This module consists of four main sections, and is shown in Figure A10.17. First, the 'tc_module‘ that 
produces the main control signals for the chip. Second, the input buffer control modules, 'tcjnput', 
'out_write_en' and 'io_enable' modules. Third, the output buffer control modules that consists of the 
same type of modules as the input buffer control. Finally the 'tc_bfly' module that controls the 
sequencing of data through the 'alu' module.
The reader will have noticed that the input buffer control and the output buffer control consists of the 
same three modules. The output control consists of the top set of these modules. This similarity in 
functionality is detailed in Section A10.5.3.
Inputs: data available, (data_AV) Outputs :
clock, (cl)
chip enable, (chip_EN)
output FIFO not full, (output_NFULL)
xox
bfly_reset
read /  '—  













------►  cl tc jnput
cntS------►
DntO enti cnt2 cnt3 cnt4 cnt5 cnt6 cnt7 




enO - en7 enO - en7 ^
------^  in_reset
------^  cl tc jnput
c n t8 ------ ►
sntO enti cnt2 cnt3 cnt4 cnt5 cnt6 cnt7 
* + + * + + * +
in_EN~~
(  |enO-












in EN out EN
r




Figure A10.25 - Timing And Control Module Layout 
A 1 0 .5 .2  Tim ing Module, (tc_control) Module
Inputs: data available, (data_AV) Outputs:
clock, (cl)
chip enable, (chip_EN) 
output FIFO not full, (output_NFULL) 
input count reset, (in_cnt_R) 
alu count reset, (bf_cnt_R) 
output count reset, (out_cnt_R)
clock, (cl)
input disable, (in_D) 
input pulse, (in_pls) 
output disable, (out_D) 
output pulse, (out_pls) 
alu disable, (bf_D)
This module implements the logic outlined in Section 8.9 and Figure A10.26 gives a clearer outline as 
to how the control is configured. The circuits on the left hand side of the module control the input 
buffer control signals, and the circuits on the right control the output buffer control signals. The 
circuits in the middle of the module control the alu control signals.
















increment the output buffer control counter to enable the data word to be written from 
the buffer
disable the output buffer control 
disable the alu control 
disable the input buffer control
increment the input buffer counter so that the next data word will be read into the next 
word buffer
indicates that eight data words have been read to the input buffer
indicates twenty alu cycles have been completed and so alu processing is finished
indicates eight data words have been written from the output buffer
Figure A10.26 - Timing Module Layout




Figure A10.27 - Tc_Control Circuit
C i l
tcand3.cif
J d J íiíi J .
JB in






Figure A10.28 - Tc_and3 Circuit
C I I I
tçantë.cif
;
L  . I l  I
i
L_
JB Hid JL 1











j l i n d i
ftd
i i iO
Figure A10 .3 0 -Tc_and21 Circuit
c v
tclatch.cif
Figure A10.31 - TcJatch Circuit
e v i
tcor.cif
Figure A10.32 - Tc_or Circuit
C V I I
A 10.5 .3  Input And Output Buffer Control Modules
Inputs: reset Outputs: count signals, (cntO - cnt8)
clock I/O enable
As mentioned in Section A10.5.1 the input and output control is performed by two sets of identical 
modules. This is achieved by realising that both these functions are nearly identical.
To input data to the input buffer the required input buffer word has to be enabled, that is, a row of the 
buffer. This reading of data takes one clock cycle and after eight cycles the reading of the data is 
disabled. The data is then pulsed in bit serial form from the input buffer through the 'alu' module. This 
takes twelve cycles to write all the data from the input buffer. Conversely, after data is processed in the 
'alu' it needs to be read into the output buffer in bit serial form and this takes twelve cycles to read all 
the data . When data is ready for writing from the output buffer, eight cycles are required to write the 
data out in parallel form. Therefore, the input and output buffer controls are similar in that they both 
require eight enable signals to enable the buffer to read or write the twelve bit word respectively. Also, 
there needs to be an initiation to pulse the data in bit serial form from and to the input and output 
buffers respectively.
Therefore, both of the input and output control modules requires and initiation to commence a count 
of eight, generation of eight separate enables and the generation of a signal to commence the bit 
serial pulsing of data. These functions are performed by the 'tcjnput', 'out_write_en' and ,io_enablel 
modules.
The input control commences counting when the disable signal from the 'tc_contror module is set low. 
Counting continues for eight cycles and then the input control signals to the 'tc_contror module that 
counting is finished by setting the 'cnt8' signal high. The 'tc_contror module uses this signal to 
disable the input control module and enable the 'alu1 module. The pulsing of the data in bit serial form 
from the input buffer is achieved by routing the 'alu' count 0 signal from the 'tc_bfly' module to the 
'buf_in_ens' module. The 'buf_in_ens' module sets the respective write enable signal high for the 
column of RAM cells in the buffer.
The output buffer module is controlled in a similar fashion. Once the 'alu' module has processed the 
first bit of data the 'tc_bfly' module initiates the pulsing of the data into the output buffer by setting its 
'cnt8* signal high. This signal is then pulsed through the 'buf_out_ens' module which enables the 
respective column of RAM cells to receive the eight parallel streams of serial data. When all the data is 
processed by the 'alu' module the 'tc_bfly' module sets the 'bfly_cnt_R' signal high which is used by 
the 'tc_control' module to disable the 'alu' module and to enable the writing of data from the output 
buffer. The 'tc_control' module also uses the end of the 'alu' processing to enable the counter in the
C V I I I
output buffer control module. The counter enables the respective row of RAM cells in the output 
buffer to be read from the chip and after eight cycles the output buffer control sets the 'cnt8' signal 
high which the ltc_controF module uses to disable the output buffer control.
A 1 0 . 5 . 3 . 1  ' tc jn p u t '  Module
Input: reset Outputs: counts 0-8, (cntO - cnt8)
dock
While the 'reset' signal from the 1c_control' module is high the module is disabled. In this state all the 
'count_cell' 'q' outputs are set low and 'dq' are set high. The 'count_ce!l' module is a flip flop with a 
'reset' built in and is one stage of the counter. Once the 'reset' signal is set low the counter uses the 
clock to count up to eight. After the 'reset' is set low the first clock pulse causes all the 'count_cells' to 
flip state so that the 'q' states are high and the 'dq' states are low. Therefore, a count of zero occurs 
when all the 'dq' states are low. Each 'count_celP sets its 'q' and 'dq' and these are routed to the 
'input_cnt_dcode' module. This input decoding module is simply a set of nine 'or1 cells that 'or1 four of 
the inputs from the count cells and generates the respective count lines. For example, 'cntO' is set 
high when all of the 'dq' signals are low. Similarly, 'cnt1' is set high if the first count cell 'q' is low and 
other count ceil 'dq' signals are low.
Figure A10.33 - Ic jn p u t' Control Module
A 1 0 . 5 . 3 . 2  'ou t_w rite_en ' C ircuit





This module is a series of 'and1 circuits that receives the eight count signals from the tc_input module, 
'cntO - cnt7‘, and 'and' each signal with the inverse phase 1 signal, 'dpi', to generate the respective 
RAM enable signal.
A10.5 .3 .3  'io_enable' Circuit
Input: counts, (cntO - cnt8) Outputs: enables, (enO - en8)
I/O enable
This circuit takes the eight enable signals set by the 'out_write_en' circuit and sets an 'enable' signal 
low if all of the eight signals is high. Consequently, if any of these signals becomes low, indicating that 
the module is counting, the 'enable' is set high. This 'enable' signal is used indicate that the 
respective buffer is available for use. That is, the input buffer enable being high indicates the control 
module is reading data into the buffer. Similarily, the output buffer enable being high indicates the 
output buffer is being unloaded.
The following CIF plots are for the circuits:
Count_cell
lnput_cnt_dcode









j  I  J  I  J  I  
« i iq ] r s i fS~] ten ten ten ten fo l !□
U q
I ¿a t




■  g b i
chi:
d - j
J W  H 5
Í S


















































Figure A10.35 - lnput_cnt_dcode Circuit
C X II
tçinput.cif


















































b  ¡ m E ZÖEJ no: a _Li _ L
□  □  □  □
if £
Figure A10.38 - lo_enable Circuit
cxv
A1 0.5.4 Arithmetic Logic Unit Control, (tc_bfly), Module
Input: clock, (cO Outputs: count 0, (cntO)
reset, (bfly.reset) count 8, (cnt8)
count 20, (cnt20)
The module of Figure A10.31, firstly creates the clock phases 'p l1 and 'p2' for the 'alu' module. In 
addition to this, a set of five ,count_celP circuits comprise a counter to enable counting up to twenty. 
This counting is enabled when the 'alu' processing is enabled and once a count of twenty is reached 
the ‘cnt20‘ signal is sent to the 1c_controF module to disable the 'alu1 control module. Also the 'cnt8' 
signal is set to indicate the commencement of pulsing of the data into the output buffer.
The phase dock simply comprises a latch on its input side to create the inverse phase and then the 
two phases are fed into separate sets of double inverters. Each inverter consists of four sets of 
transistors. This arrangement was required to eradicate any dock skew problems between the two 
phases.
The processing of the counter outputs is similar to that of the input and output buffer modules. The 




The foliowing CIF piots are for the drcufts Tc_bfly , Phase_dock and Bfly_cnt_dcode.
C X V I
Figure A10.40 - Tc_bfly Circuit
C X V II
phaseclock.cif
Figure A10.41 - Phase_clock Circuit
C X V III
bflycntdcode.cif
Figure A10.42 - Bfly_cnt_dcode Circuit
CXIX
A10.5 .5 Input Buffer, (bufjn ) Module
Input: parallel data in, (bo - b7) Outputs: serial data out, (xO - x7)
inverse data, (dbO - db7) 
row write enable, (enO - en7) 
column write enable, (en)
This module consists of a set of 'buf_cell_in' circuits arranged in eight rows of twelve circuits. The data 
is read in parallel for to each row and is written to the 'alu' module in eight bit serial lines.
The 'buf_cell_in' circuit is a typical two input RAM cell. As data has to be accessed via two separate 
paths it is written to the cell via one path and read via another. The input data, 'b1', is also transmitted 
to the cell in its inverse form, 'dbT. To write data to the cell the control line 'w1' is set high and 
whichever value of 'b11 or 'dbT is low is passed into the cell. If 'b' is low the latch is set low and if 'db11 is 
set low the latch is set high. To write data from a cell the enable line 'dw2' is set low and the latch 
contents are fed out of the cell onto the 'b2' line if the contents are high. If the contents are low then 
no value is fed out and the value of the cell is read as low.
A 10 .5 .6  Output Buffer, (buf_out) Module
Input: serial data in, (bo - b11) Outputs: parallel data out, (XO - X11)
inverse data, (dbO - db11) 
column write enable, (en) 
row write enable, (enO - en7)
This module is similar to the 'buf_in' module except that the functionality is rotated by ninety degrees. 
The module consists of eight rows of twelve 'buf_celLout' circuits. The data is read into the cells 
horizontally and written out of the cells vertically. The input data , 'bT, is fed to the cell along with its 
inverse, 'dbT. Data is written to the cell by enabling 'wT and the circuit design is the same as for 
'buf_cell_in' except that the transistors are arranged differently. To read data from the cell the enable 
line 'dw2' is set low and the contents of the latch are written out to 'b2' if the contents are high. If the 
contents are low no contents are written out and 'b2' is set low.
The following CIF plots are for the circuits Buf_cell_in and Buf_cell_out.
cxx
bufcellin.cif
Figure A10.43 - Buf_cell_in Circuit
C X X I
bufcelM.cif
III 32 ]h2














: l i □ 12
Figure A10.44 - Buf_cell_out Circuit
cxxn
A10 .5 .7 Other Control Modules
There are several module apart from the main ones listed above that are required to control the chip. 
These modules all consist on an array of a single component. Each module is described below and 
their component circuits are listed together in one group.
A 10.5 .7 .1  'b u f jn jn v ' Module
This module creates the inverse bit values to be used for writing data to a RAM cell. The module 
consists of twelve ,buf_in_inv_c' circuits, one for each column of the input buffer module.
A 10 .5 .7 .2  ,buf_in_ens' Module
This module chanels the column enable signal for pulsing a column of data from the input buffer to the 
'alu' module. There are twelve 'delay_buf circuits in this module, one for each column of the buffer. 
The enable signal generated by the 1c_bfly* module is fed into the right hand end of the module via 'ai' 
port. On phase 1, 'p i ', 'ai' is fed into an inverter and the output of the inverter is fed into the column of 
buffer cells. On pulse 2, 'p2', the inverter output is fed in to a second inverter for passing onto the next 
cell. Therefore, the inverted enable signal required for the 'buf_cell Jn ' to read the RAM cell contents 
is provided by inverting the enable signal on phase 1. Included in the circuit is a reset that sets the 
enable line high.
A 10 .5 .7 .3  'buf_outJnv' Module
This module provides a similar function as the 'b u fjn jn v ' module for the input buffer. The module 
consists of eight 'buf_outJnv_c' circuits, one for each row, that inverse the incoming signal for 
storage in the RAM cells.
A 10 .5 .7 .4  'buf_outJnv2' Module
This is a set of eight 'buf_outJnv_c2' circuits that invert the enable signals for each row of the output 
buffer when data is being written from the buffer.
A 1 0 .5 .7 .4  'buf_out_ens' Module
This module is similar to the 'bufjn_ens' module for the input buffer. It consists of twelve 
'delay buf_ouf circuits. On phase 1 the enable signal 'ai' is passed onto an inverter and the inverter 
output is fed to the column of buffer RAM cells. On phase 2 the inverter output is inverted and passed 
onto the next circuit. Included in the circuit is a reset that sets the enable line high.
cxxm
This module consists of twelve 'buf_out_and1' circuits and each circuit simply takes the enable signal 
produced by the 'delay_buf_out' circuit and 'and's it with the phase 1 to ensure the output buffer 
column enables are synchronized with phase 1 as the data is produced from the 'alu' module on phase 
2 .
A 10 .5 .7 .6  'driveri1 Circuit
Before the first data arrives in the 'alu' module the 'alu' reset is confirmed by testing to see if either the 
'alu' reset, 'in_pls' or input 'cntO' are high. If any of these signals are high then the 'alu' reset is set high 
to ensure all the data lines in the 'alu' are reset and no data corruption occurs. This is performed by a 
simple 'or1 circuit using these three signals. This circuit was found to be necessary when the data 
processing is interrupted.
A 10 .5 .7 .7  'mulLdrlver' Circuit
On either side of the tc' module there is a 'mu^driver1 circuit, the same circuit as those on the input of 
the multiplier modules. These two circuits create the inverse phasel, phase 2 and reset signals 
required for the 'buf_in_ens' and 'buf_out_ens' modules. The 'mult_driver' circuit was shown as part 
of the multiplier circuits.
A 1 0 .5 .7 .8  'outfitter' Module
This module contains a set of twelve 'buMilter' circuits that simply double invert the line signal to 
refresh its state. These circuits were necessary to ensure the voltage levels were correct at the output 
of the chip.
A 10.6  Chip Pads
As part of the NSW VLSI package there were several contact pads that could be utilized to route the 
data and control lines onto the chip. These pads were the only features of the double metal CMOS 
library designs that could be use in the design. There are four pads altogether, padVdd, padGND, 
padolit and padin.





A10.5 .7 .5  'buf_out_ens2' Module









Inmos (1989) IMS A121 2-D DCT Image Processing, Advanced Information
cxxv
M i n i m e . cif
C X X V I
Èlaïtaf.cif
i i , _ ___ f ..... ..










































Rgure A10.48- Buf_out_inv_c2 Circuit
CXXEX
delajbufout.cif
Figure A10.49 - Delay_buf_out Circuit
cxxx
bufoutandi.cif
Figure A10.50 - Buf_out_and1 Circuit
C X X X I
■ w e r ! ,  u i
cxxxn
buffilter.cif
Figure A10.52 - BuMilter Circuit
cxxxin
padvdd.cif
Figure A10.53 - padVdd
C X X X IV
Figure A10.54 - padGND
cxxxv
padout.cif
Figure A10.55 - padout








I* i Í L l_0 : II 080 0
8 80 í y D1 X 8D o 0
l 1,0 0
8 80 . L
1 8 8 1 |8 8 1
1
8 8 8 1 8 8 1
Q L






Figure A10.56 - padin.
cxxxvn
BIBLIOGRAPHY
[1 ] INMOS IMS A121 2D DCT Image Processor Advanced Information,
1986.
[2] ICASSP 1984 IEEE International Conference on Acoustic Speech and Signal 
Processing.
"Real Time Discrete Cosine Transform, an Original Architecture".
E. Arould, J. Dugre,
1984, IEEE.
[3] "Principles of CMOS VLSI Design, A Systems Perspective"
Nefl H. E. Weste, Kamran Eshraghian,
Addison-Wesley Publishing Company,
1985.
[4] CCITT Recommendation H.261,
"Video Codec for Audiovisual Services of x64 kbit/sec",
1990.
[5] H. Wu Doctorate






B. D. O. Anderson, J. B. Moore,
Prentice Hall Inc.,
Englewood Cliff, New Jersey,
1979.
[7] Joint Microelectronic Research Centre, 
University of New South Wales,
VLSI Design Tools,
1990.
[8] Joint Photographie Experts Group Standards, 
J.P.E.G.,
1990.
[9] "Computer Aided Circuit Analysis Using Spice",
Walter Banzhaf,
Prentice Hall,
Englewood Cliffs, New Jersey,
1989.
