A high speed 2-D DCT/IDCT processor by Slawecki, Darren
Lehigh University
Lehigh Preserve
Theses and Dissertations
1991
A high speed 2-D DCT/IDCT processor
Darren Slawecki
Lehigh University
Follow this and additional works at: https://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Slawecki, Darren, "A high speed 2-D DCT/IDCT processor" (1991). Theses and Dissertations. 5520.
https://preserve.lehigh.edu/etd/5520
A High Speed 2-D DCT/IDCT Processor 
by 
Darren Slawecki 
A Thesis 
Presented to the 
Graduate Committee of Lehigh University 
in Candidacy for the Degree of 
Master of Science 
. 
1n 
Electrical Engineering 
Lehigh University 
August 1991 
This thesis is accepted and approved in partial fulfillment of the 
requirements for the degree of Master of Science in Electrical Engineering. 
~ .2_ / I /tj9/ 
Date 
(vi 
CSEE Department Chairperson 
.. 
- 11 -
Acknowledgements 
I would like to thank Dr. Li for his unwavering support of this project, 
my capabilities, and for listening, encouraging, and responding to my 
questions, comments and suggestions. I would also like to thank him for 
encouraging me to attend and allowing me to present this project at the 
International Symposium on Circuits and Systems in Singapore, June 1991. 
I thank my parents and my sisters, Melissa and Tania, for their financial, 
physical and moral support. I thank Jacqueline Benner for typing, 
proofreading and for bearing with me through my Masters program. I 
would like to thank Steve, Ed, Bill, Richard, Keith and all the people that 
keep the labs running. I would also like to thank the original class 
_members William Migatz, Arindam Saha, and Luis Shirley for their 
contributions to the project. 
... 
- 111 -
Table of Contents 
Abstract ·1 
Chapter 1: Introduction 
Chapter 2: Review of Computational Algorithms for the DCT 
Chapter 3: Choice of Implementation 
Chapter 4: Outline of the fast algorithm 
Chapter 5: Architecture of the Processor 
Chapter 6: Circuit Design Issues 
Chapter 7: Results & Analysis 
Chapter 8: Standard design procedures 
Chapter 9: Future work 
References 
Appendix A: The areas of common cells tabulated. 
Appendix B: The complete pin description of the IC. 
Appendix C: Sparkstation Tools. 
Biography 
. 
- IV -
Page 
1 
2 
5 
7 
10 
25 
34 
38 
42 
56 
60 
62 
63 
65 
_.,.~ 
68 
r 
List of Tables 
Page 
Table 1: Area of selected cells. 39 
Table 2: Summary of the IC's pins. 40 '· . ... 
Table 3: A comparison of commercially available ICs for N=8. 41 
Table 4: A synopsis of common procedures. 52 
- V -
List of Figures 
Figure 1: An image divided into 8x8 blocks. 
Figure 2: Compression performance vs. 
computational complexity (O(N2)). 
Figure 3: A block diagram of the 8x4 unit. 
Figure 4: A block diagram of the 4x2 unit. 
Figure 5: (a) A block diagram of the 2xl unit. 
(b) A block diagram of the scaler unit. 
Page 
2 
9 
13 
15 
17 
Figure 6: A block diagram of the IDCT operation. 19 
Figure 7: Decomposition of the 2-D DCT into two 1-D DCTs. 21 
Figure 8: An illustration of a 3x3 transpose memory. 22 
Figure 9: A block diagram of the entire DCT processor. 24 
Figure 10: The block diagram of the 1-D DCT. 25 
Figure 11: A detailed view of the Parallel to Serial convertor. 2 7 
Figure 12: A detailed view of the Pre-Add unit. 27 
Figure 13: A building block for the 8x4 and 4x2 units. 28 
Figure 14: A block diagram for the 2xl and Scaler units. 30 
Figure 15: A block diagram of the Post-Processor. 31 
Figure 16: A detail view of the Serial to Parallel convertor. 31 
Figure 17: The transpose memory. 3 2 
Figure 18: A two input register. 34 
Figure 19: A two input XOR gate. 35 
Figure 20: A standard serial adder. 35 
Figure 21: The divide by two and sign extension serial adder. 36 
Figure 22: A shift register implementation of a MOD 8 counter. 3 7 
Figure 23: A PC based high speed IC tester. 5 8 
. 
- Vl -
Abstract 
Recently a new fast algorithm to compute the Discrete 
Cosine Transform (DCT) and its inverse (IDCT) was 
derived. Based on th.e new algorithm, a high-speed 8x8 
2-D DCT/IDCT processor chip was designed. In this paper, 
the algorithm is outlined, the chip architecture 
presented, the design methodology reviewed, .and special 
circuit designs discussed. The chip measures 7 .9 x 9.2 
mm 2. It is designed using the MOSIS 2µ scmos 
technology. It takes 16-bit inputs, uses 16-bit internal 
memory for coefficients and data, and generates 16-bi t 
outputs. A single input line determines if the chip 
computes the DCT or the IDCT. The chip is highly 
pipelined with a latency of 127 cycles and a maximum 
delay time of 17 .3 ns. A comparison of commercial chips 
reveals that our DCT chip has the highest throughput 
even though the commercial chips are fabricated in more 
advanced processes. 
- 1 -
Chapter 1: Introduction 
Since the introduction of the Discrete Cosine Transform (DCT) in 
the 1970's, considerable research has been performed on algorithms, 
methods, and architectures for computing the DCT. The DCT closely 
approximates the optimal Karhunen-Loeve Transform (KL T). The 
KLT. depends on the statistics of (image) data, and does not have a 
good. mathematical structure for its computation. Unlike the KLT-, the 
DCT has a nice mathematical structure, and therefore, fast ~lgorithms 
can be derived for its computation. Applications of the DCT are in 
the area of speech and image processing. W·e use the DCT as a key 
component for high data rate image comptession. The D.CT 
decorrelates data and moves the energy into the "lower transform 
coefficients. Typically a two dimensional DCT operates on square 
blocks of data (NxN), as shown in Fig. 1, where each dat.a point, or 
pixel, is represented in eight bits. The output is an NxN block of data, 
where each point is represented by· 12 or ~ore bits. 
. ············· 
·•···········•···················· .... : . ::: :: : : : ::: :: :: :: :: :: : : :: :: .... 
• • ••• : • : : : : :=::: :: : : ::: :: :: :::::: ::• 
···········•· . ··········· .. .... . . 
D - an 8x8 pixel block 
Figure 1: An image divided into 8x8 blocks. 
In a two dimensional DCT when the input data is highly 
correlated ( p - .9 ), the energy of the transform coefficients is 
- 2 -
concentrated in the upper left corner, with the largest coefficient 
·being the upper left one, denoted as the D.C. component. Since we 
receive the same number of coefficients as original data points, we 
see that the DCT itself does not compress the data. The compression 
is achieved through suitable coding algorithms that operate on the 
DCT coefficients that will: 
1. Throw away higher order coefficients (which are near zero) 
according to a specified criterion, or by thresholding. 
2. Code the higher order coefficients with less bits, and the 
lower order with more, so that the total number of bi ts is 
reduced. 
3. A combination of both methods. 
This thesis describes the implementation of an architecture based 
on a fast algorithm to compute the DCT which was derived by W. Li 
[ 1]. The Very Large Scale Integrated (VLSI) circuit operates on 8x8 
blocks· of data. It takes 16-bit inputs, uses 16-bits for coefficients 
and internal memory, and produces 16-bit results. 
This thesis is organized into njne chapters. Chapter 2 presents 
the history of the DCT and reviews computational algorithms used for 
implementing the DCT. In Chaptet 3, we justify our choice of 
implementation. Chapter 4 presents the fast algorithm used to 
compute the DCT, and gives the details for the calculations for the 
case N=8. The IDCT and the DCT/IDCT in-two dimensions is discussed, 
with an overview of the architecture. The architecture is presented 
in Chapter 5, from the viewpoint of data traveling through the 
Integrated Circuit (IC). Chapter 6 reveals the special purpose circuits 
- 3 -
designed for our implementation, and their operations are 
highlighted by examples. Chapter 7 contains our research results, as 
well as end user information (i.e., pinouts, clock specifications, etc.). 
Chapter 7 also includes a comparison of commercially available ICs. 
Chapter 8 contains our design approach, the tools used, (when to use 
them, and where they are located) complete with examples of 
common procedures. It also gives hints about trouble shooting the 
simulators. Chapter 9 is the final chapter which discusses 
enhancements to the project and is entitled "future work". 
- 4 -
• 
Chapter 2: Review of Computational Algorithms for the DCT 
Historically, N. Ahmed, T. Natarajan, and K. R. Rao defined the DCT 
algorithm, highlighting its advantages and indicated that a 2N point 
Fast Fourier Transform (FFT) can be used to compute the DCT [2]. 
The DCT of an N point sequence { x(O), x( 1 ), ... , x(N-1)} )s denoted as 
{ X(O), X(l ), ... , X(N-1)}. The DCT and its inverse (IDCT) are defined as 
follows: 
DCT 
X(O) = ~ ~x(n) 
X(k) = (2 I'.x(n) cos( 1r(2n + l)k), k -1,··,N -1 
~N n=O 2N (1) 
IDCT x(n) = ff X(O) + (2 I'.x(k) cos(1r(2n + l)k), n = 0,1,··,N -1 VN ~N k=O 2N · (2) 
Although the equation defined in (1) is slightly different from their 
origin a 1 definition, it is equiv a 1 en t. It was .shown that by a re -
arrangement of the input data. (an input mapping), that an N point 
DCT can be performed by an N. point Discrete Fourier Transform 
(DFT), which can be performed by an N point FFT when N is a power 
of 2 [3]. We see that it is possible to perform an N point DCT as 
efficiently as an N point FFT, instead of a 2N point FFT as stated 
previously. This algorithm was generalized to accept even or odd N 
and extended to two dimensions [ 4]. Further research yielded an 
algorithm which utilizes only .real operations to compute an N point 
DCT three times faster than an N point FFT [5]. This was 
accomplished via a butterfly matrix structure altering between 
cosine and sine operators, and re-ordering matrix ·elements. 
Researchers at Bell Communications Research have designed a DCT 
- 5 -
chip using distributed arithmetic (6, 7]. Algorithm research based on 
distributed arithmetic has become active too. P. Duhamel and H. 
H'Mida derived a two-step mapping algorithm that converts the DCT 
\ 
to a set of circular convolutions [8]. A. Leger et al. presented a VLSI 
implementation of the algorithm [9]. Recently a single-step mapping 
algorithm to convert the DCT to circular convolutions was derived by 
W. Li [1]. We used this algorithm as a basis for our architecture. The 
algorithm is described in Chapter 4, and the computation is explicitly· 
defined for the case N=8. 
- 6 -
Chapter 3: Choice of Implementation 
All of the algorithms mentioned in Chapter 2 can be realized in 
several ways. The easiest method is to write a software program in a 
high level language to perform an N point OCT from the definition, 
which can be extended into two dimensions. If higher performance· 
is desired/required, a program can re-order data according to an 
algorithm and then call an FFT subroutine. Also, if it is available, a 
co-proc.essing chip, or a pa_rallel-processing machine can _speed up the 
computation. If this is inadequate, we can move up to a Digital Signal. 
Processing (DSP) IC, such as the AT&T DSP-32, the AT&T DSP-16, or 
the Texas Instruments TMS320 family. These I Cs are optimized for 
data filtering or similar tasks, and often can be placed on a single 
board inside a host computer. 
As we move from software to hardware implementations, we 
increase performance, but decrease functionality. For inst~nce, a 
central processing unit (cpu) can control a disk drive as well as 
compute a DCT, while a DSP IC cannot. For the next level of 
p·erformance, we need to design an Application Specific Integrated 
Circuit (ASIC), which generally has the operation fixed. An ASIC is 
usually not re-configurable, and may have fixed filter coefficients, a 
fixed transform size, or other parameters fixed. ASICs have realized 
the DCT using a butterfly architecture, and distributed arithmetic 
architecture [6,7,10]. Distributed arithmetic has been shown to yield 
higher performance [11] under certain conditions, which the DCT 
meets. The desired operation is broken into several smaller 
operations which are performed independently and then may be 
- 7 -
combined to produce the result, like a parallel-processing 
architecture. The operation is performed by multiple simple 
structures instead of a single fast multiplier -whic·h butterfly 
architectures commonly use. Before justifying the. type oJ 
implementation chosen for this project, the project goals must be 
described: 
1. > 54 MHz data rate. 
2. 8x8 Transform size. 
3. Single chip implementation to compute the DCT and the IDCT. 
Some of the restraints are: 
l. 2µ double metal CMOS process technology. 
2. 7.9 mm2 x 9.2 mm2 MOSIS package. 
3. University design tools. 
The 54 MHz data rate is derived from four times the current 
video rate for the National Television Standards Committee (NTSC) 
standard to approximate the higher resolution as defined by Hig·h 
Definition Television (HDTV) proposals, which roughly encompasses 
four times the current viewing area. The 8x8 transform size was 
chosen for this project considering: 
1. N must be a power of 2 - a requiremenr of the algorithm. 
2. As we increase N from N=4 to N=8, the compression 
performance increases at the expense of increasing the 
computational complexity. As we increase N t.o N = 16 from 
N=8, we again increase compression performance, but the 
performance increase is less than the previous case. Due to 
diminishing returns, we chose the cutoff point to be N=8. 
The diminishing return is illustrated in Fig. 2 since ci:~s> <\8~>" 
- 8 -
Compression 
Performance 
8 
Transform size, N 
16 
Y1 <Y2 
... 
Figure 2: Compression performance vs. 
computational complexity (O(N2)). 
3. Area estimates indicate that an 8x8 two dimensional 
DCT/IDCT processor can be realized inside the available 
area. 
4. The transform size of N=8 is specified in several 
international standards, such as JPEG [12], MPEG [13], etc. 
Since we chose N=8, we can compare the performance of our 
design with commercially available ICs which meet the 
standard. 
The high data rate and the large estimate of the number of 
transistors were the key factors in determining the selection of a 
VLSI implementation. 
In practical applications, it is very desirable to have one 
processor compute either the DCT or the IDCT depending on a single 
line of control, without doubling the hardware. The algorithm is 
suitable for this, and the architecture was designed accordingly. In 
fact, the architecture uses a majority of the hardware for both the 
computation· of the DCT and computation of the IDCT. 
- 9 -
Chapter 4: Outline of the fast algorithm 
The Forward DCT 
The DCT defined in (1) can be decomposed into the computation 
of the even indexed components and the. odd indexed components. 
Since the computation of the even indexed DCT components is 
equivalent to an N/2 point DCT of the sequence {x(n)+x(N-1-n), 
n=0,1,2, ... ,N/2-1}, an algorithm for computing the odd indexed DCT 
components is all we need, since we can recursively apply the 
algorithm to the even indexed components. The new algorithm 
derived in [1] computes· the odd indexed DCT components in a four 
step process as follows: 
1. lnput Mapping. 
\ 
{x(O),x(l), · ·, x(N - l)l =} {x(n0), x(nt), · ·,x(nN_1)} 
where 
3i mod(4N)-1 
2 
n.= l 3i mod(4N)-1 
2N-1-
2 
2. Subtraction. 
where 
y(i) - x(n)-- x(nN ) 
. -+i 
2 
if 
3i mod(4N)-l 
2 
if 
3i mod(4N)-1 
2 
- 10 -
(3) 
<N 
(4) 
>N 
(5) 
(6) 
3. Skew-Circular Convolution. 
N {y(O),y(l),···,y(--1)} => {X'(k0),X'(k1),··,X'(kN )} 2 --1 
2 
where 
N 
_ f y(i) [ {2 cos( 2n 3i+j)] 
1=0 ~N 4N 
4. Output mapping. 
{X'(k0),X'(k1),··,X'(kN )} => {X(l),X(3), 00 ,X(N -1)} 
--1 
2 
where 
3j mod(4N)-l 
if 
3j mod( 4N)-1 
<N 
k. 2 2 -J 3j mod(4N)-1 3j mod( 4N)-1 2N-1- if >N 
2 2 
X(2kj + 1) X'(kJ) if N - k.<-
J 2 
X(2(N-1-kj)+ 1) 
-
-X'(kj) if k.>N 
J 2 
(7) 
(8) 
(9) 
(10) 
(11) 
The even indexed coefficients as stated previously, define a new 
sequence of length N/2, which we now apply the· algorithm too. 
Eventually after dividing the data sequence into even and odd parts 
log 2N times, the final sequence is of length one, which is 
{x(O)+x(l)+x(2)+ ... +x(N-1)}. We denote it as the scaler case since a one 
point Skew-Circular Convolution (SCC) is only a scalar multiplication. 
We only need to multiply this sequence by ~ to obtain X(O) .. 
- 11 -
The computation of each coefficient for the case N=8 is now 
defined explicitly. The eight point sequence in (12) is ·mapped to 
another eight point sequence in accordance to the input mapping 
equation (3). After step two, the subtraction, the number of points in 
the sequence has been reduced to four. A four point SCC is 
performed with the constant .matrix defined in (18). The 
computation of the odd coefficients in (17) is complete after output 
. 
mapping. 
{ x(O);x(l), x(2 ), x(3), x( 4 ), x(5), x( 6), x(7)} 
il 
{x(O), x(l), x( 4); x(2), x(7), x(6), x(3), x(5)} 
il 
{x(O)-x(7),x(l)-x(6),x(4)-x(3),x(2) ~ x(5)} 
= {y(O), y(l), y(2), y(3)} 
X'(k0) 
X'(k1) 
X'(k2) 
X'(k3) 
where 
-
il 
C(O) C(l) C(2) C(3) y(O) 
C(l) C(2) C(3)" -C(O) y(l) 
C(2) C(3) -C(O) -C(l) y(2) 
C(3) -C(O) -C(l) -C(2) y(3) 
il 
{X' (k0), X' (k1), X' (k3), -X' (k2)} 
= {X(l),X(3),X(5),X(7)} 
(7r3m) 1 (7r3m) C(m) ~ cos -. = -cos -2N 2 16 · 
(12) 
(13) 
(14) 
(15) 
(16) 
(17) 
(18) 
A block diagram of the computation of the odd coefficients is shown 
in Fig. 3. Since we start the algorithm with eight numbers and 
produce four, we shall call the hardware to perform this calculation 
the 8x4 unit. 
- 12 -
x(O) x(nO) 
x(1) x(n 1) y(O) X'(k 0) X(1) 
x(2) x(n2) Skew 
X'(k 1 ) y ( 1) Circular X(3) x(n3) x(3) Convolution 
x(4) x(n4) y(2) X'(k2 ) X(5) 
..-
uJ 
x(S) x(n5) 
X'(k3) X(7) y(3) x(n6) x(6) 
x(7) x(n7) Output 
Input 
Mapping 
Mapping 
Figure 3: A block diagram of the 8x4 unit. 
The computation of X(2) and X(6) starts from creating a new 
sequence from the original eight numbers as defined in (20). The 
four steps of the algorithm are then performed, utilizing the constant 
matrix (25) with N => N/2. Since we start with four numbers in (2), 
and compute two numbers, we call the hardware to perform this 
operation the 4x2 unit. Figure 4 shows block diagram of the 4x2 
unit. 
(x(O), x(l),x(2),x(3), x( 4 ),x(5), x(6), x(7)} 
il 
(x1(0) = x(O) + x(7),x1(1) = x(l) + x(6),x1(2)'= x(2). + x(5),x1(~) = x(3) + x(4)} 
where 
!l 
(x1(0),x1(1),x/3),x1(2)} 
il 
(y1(0) = x1(0)- x1(3),y1(1) = x1(1)- xi(2)} 
= (yi(O),Y10)} 
il 
[
X(2)] = [C1(0) C1(1) ] [Y1(0)] 
X(6) C1(1) -C1(0) yi(l) 
il 
(X(2),X(6)} 
- 14 -
(19) 
(20) 
(21) 
(22) 
(23) 
(24) 
(25) 
X 1 (0) 
X 1 (1) 
X 1 (2) 
X 1 (3) 
,--------------------------------------~. • 
---l---
' 
' ' 
' 
' y 1 (0) ' 
' 
2 pt. X(2) ' 
' 
' ~ ' 
X(6) ' 
' 
' y 1 (1) ' 
' 
' 
x1 (nO) 
x1 (n1) 
X1 (n2) 
X1 (n3) ' ' ;_ ............................ t- ......... -' 
Input 
Mapping 
Subtracter 
Figure 4: A block diagram of the 4x2 unit. 
Transparent 
Mapping 
The 2xl unit (two numbers input, one number output) takes the 
sequence in (26) and computes X( 4) with the constant defined in 
(29). Note that the SCC has been reduced to scalar multiplication as 
shown in Fig. 5 a. 
(xz(O) = x1(0) + x1(3),xz(l) = x1(1) + x1(2)) (26) 
u 
(yz(O) = xi(O)- xi(l)) (27) 
u 
X( 4) = yz(O) * C2 (28) 
where 
C2 - ~ cos(2:) = ~cos(;) (29) 
The final computation, shown in Fig. 5b, is for X(O). As stated 
previously, this only requires multiplication of x3 o.btained using 
equation (31) with i=-H. Note that the constant for the scaler and 
2xl units is the same, and will be in general, for. any valid value. of N. 
The computation of the forward transform is now complete. 
where 
(xi{O), xz(l)) 
u 
(x/0) = xz(O) + xz(l)) 
u 
X(O) = xiO) * C3 
- 16 -
(30) 
(31) 
(32) 
(33)· 
0 
.... 
X 1 (0) 
X 1 (3) 1---1. 
X 1 { 1 ) t----1' 
X 1 (2) 
X2 (0) 
X2 {1) 
/' - - - -- -- ---- - - ----- - - ----- ----... 
X2 (0) ' ' 
X2 (1) ~ Y 2 (0) I I Scaler, ...... _:I X_{_4_) 11 
..._ _ __, ···----------------t------------' 
(a) 
(+}- x3 {0) 
{b) 
1 Point Skew 
Circular 
Convolution 
Scaler X{O) 
Figure 5: (a) A block diagram of the 2xl unit.· (b) A block diagram of the scaler unit. 
The Inverse DCT 
The algorithm computes the IDCT defined in (2) in the same 
manner that it computes the DCT, namely recursively computing odd 
indexed DCTs. An IDCT is performed when x1, x2, and x3 are supplied 
with transformed data (X(k) values), and the results of the 
computational units combined linearly as defined in (34) and (35). 
This can be seen by substituting the definitions for S 1 and S2 into 
equations (34) and (35). 
where 
and 
x(n) - Si(n) + Si{n), 
x(N -1- n) - S1(n) - Si(n), 
N . 
--1 
Si(n) = ~X(2k)co{ ir(2:+ l)k} 
N 
--1 
Si{n) = i:x(2k + l)cos( 1r(2n + l)(2k + l)J, 
k=O 2N . 
N 
n=012···--1 
' ' ' ' 2· 
N 
Ii=012···--1 
' ' ' ' 2 
n=0.12··· N_l 
' ' ' ' 2. 
N 
n=012···--·-'--I 
' '· ' ' 2 
(34) 
(35) 
(36) 
(37) 
Notice that S2 is really an odd indexed DCT and that S1 defines an N/2 
point IDCT, which we recursively break into another odd indexed DCT 
(a new S 2), and a half size IDCT (a new S 1). We recursively break S 1 
the until we reach the final stage, which is a one point IDCT. 
In terms of the application when N=8, the S2 values are calculated 
by the 8x4 un.it when the data supplied to it is the sequence defined 
in (38). 
{X(l), X(3), X(5), X(7), 0,0,0,0} (38) 
The S1 terms are generated from a linear combination of the results 
- 18 -
of the 4x2, 2xl, and scaler units. The sequences supplied to the 4x2; 
and 2x 1 units are defined in (39) and ( 40) respectively. The scaler 
unit only need be supplied with X(O). 
{X(2),X(6),0,0} 
{X(4),0} 
(39) 
(40) 
Note th·e pre-additions that normally occur for the DCT have been 
bypassed, and that zero padding of the data is performed when there 
are only half the values. Figure 6 shows the block diagram of the 
IDCT operation, and how to combine the results of the computational 
units to obtain the original data. 
X(1) 
X(3) 
x(5) 
x(7) 
X(2) 
X(6) 
x(4) 
... 
8X4 
111t 4X2 
111t 2X 1 
X(O) ... Scaler 
S~1) 
St-0) 
Post-Add Unit 
52(2) 
S2(1) 
S2(0) 
_ s1 (3) 
+ 
s1 (2) 
S1 (1) 
s1 (0) 
~ 
+ • x(O) 
'-----------------------------------~ 
Figure 6: A block diagram of the IDCT operation. 
Two Dimensional Considerations 
The calculations thus far have been for a one dimensional 
transform. How does this help us compute a two dimensional 
transform? A fairly common method of computing a two 
dimensional DCT as defined in ( 41) ·is to first compute the trans.form 
- 19 -
of all the rows of an NxN block, then compute the transform of the 
columns. This is possible as shown in ( 4~) since the DCT is a 
separable orthogonal transform. 
X(k1,k2) = D(k1,k2)!'I,1x(n,,n2) cos(,r(2n1 +l)k•)co{.,r(2n2 + l)k2), (41) 
D1=0 D1=0 2Nl 2N2 
k1 = 0,1,2,··,N1 -l,k2 = 0,1,2,··,N2 -1 
where 
(42) 
When we equate N 1 =N2=N, we define a square matrix which 
simplifies the D(k1 ,k2) equation to ( 44 ). 
1 
if kl = 0, k2 :::: 0 
N 
D{k1,k2) 
-J2 if t· = o. k2 > 0, (44) 
-
N kl> 0, k2 -0 
2 
if kl > 0, k2 > 0 
N 
The actual computation the IC performs for the DCT is defined in 
( 45). The· difference is the factor of 1/64 that each coefficient is 
multiplied by to keep the int~rnal registers from overflowing. 
- 20 -
X(k1,kJ = D(k1,k2) !' [)~,1x(n1,n2) cos( n(2n2 + l)k2J]cos( n(2n1 + l)k1J ( 45) 
64 ni=O n2=o 2N 2 2N 1 
Ideally, we would like as much repetition in a VLSI IC to save 
design time. If we can use the same hardware that computes the 
row-transform to compute the column-transform, we have saved 
ourselves a tremendous amount of work. Since all the data for the 
second dimension i s not av a i I ab I e un ti I a 11 of the row s are 
transformed, a storage area for the row-transform coefficients is 
needed. This storage area, or memory, should re-sequence the data 
to supply the second dimension· with correct data. Notice that the 
shuffling of data that the m·emory performs is a matrix transpose 
operation. Figure 7 shows the two dimensional transform 
decomposed into two 1-D transforms w.ith a transpose memory in the 
middle. 
x(n1 ,n2) .. 2-D DCT/IDCT - X(k1 ,k2) - ... 
I 
t \ DCT/IDCT 
x(n1 ,n2) 1-D 
Transpose 1-D X(k1 ,k2) 
Memory 
DCT/IDCT DCT/IDCT 
Figure 7: Decomposition of the 2-D DCT into two 1-D DCTs. 
The transpose memory needs to be fast to keep up with the data 
supplied from the first dimension, and to supply the second 
dimension. The transpose memory holds the coefficients for an 
entire NxN block of data. Although it is common to· use a Random· 
Access Memory (RAM) for srorage (on chip or -off chip), special 
- 21 -
addressing techniques are needed to readout the data in the 
prescribed format. We chose to use a custom design that 
automatically performs the transpose operation, and is compact 
enough to fit on the chip. Although we used an 8x8 transpose 
memory in our design, we will illustrate the design with a 3x3 
transpose memory. Figure 8 shows a 3x3 transpose memory. The 
transpose memory shown contains nine bit storage locations. We 
would use 16 of these bit-transpose memories to store 16-bit 
numbers in parallel. 
/~ 
\ 
R 
0 
w 
C 
0 
n 
t 
r 
0 
I 
Column Control 
D 
Memory 
out 
Figure\8: An illustration of a 3x3 transpose memory. 
When the transpose memory is loaded from the bottom with the. 
sequence { 1,2,3,4,5,6, 7 ,8,9} the output sequence after the ninth input 
will be f 1,4, 7 ,2,5,8,3,6,9}. This output is unloaded from the right 
- 22 -
side, while we simultaneously load new values from the left side. 
The numbers inside each cell represent the order that data is stored 
when the memory is loaded from the bottom. Only one of the 
column or row control lines is active during a phi 1 phase. Although 
specific locations cannot be accessed at random, the transpose 
operation is performed automatically. Also, the transpose memory 
loads and unloads words simultaneously, like a dual-port memory. 
In our initial. design, we assumed the output of the first 
dimension w.ould -be a 16.-bit word every cycle, and the input to the 
second dimension is likewise a 16-bit word each cycle. Without 
delving into too much of the architecture at once, we noticed that if 
we modify the tail end of the first dimension to output 8 2-bit words, 
and modify the front end of the second dimension to receive 8 2-bit 
words, we can save sixteen cycles of latency and a significant amount 
of chip area. The modifications to both dimensions were slight, 
alt ho u g h i t required .red e s i g n in g the tr an s po s e memory to 
accommodate the different input/output (i/q) format. The details of 
the cur.rent transpose memory design are discussed in Chapter 5. 
A block diagram view of the entire IC is shown in Fig. 9 with the 
first and second dimension outline~. The computational units- inside 
each dimension are- outlined also. Chapter 5 describes in detail the 
operation each block performs, and is presented in the order that. 
data moves forward through the chip .. 
- 23 -
x(n 1,n 2) 
/---~--·-------------~---------------------------~ 
I .·--------"'\ 
• • 
.. 
• 
• 
• 
• 
• 
' 
, 1 6 
' 
' H -
' 
' . ... p 
.. 
• 
• 
p 
' 
' 
' 
' 
0 
• 
• 
r ' 
' 
' s 
' • 
• ~ ... , 
, ... 
• 1 6 • 
• 
• 
Parallel e ' ~ ... A • , ..... 
' to 1 6 d ' 
' Serial d . 
' 
' 
t ... 
• .. A 
' 
' d 
' I ... d 
' 
... 
.... 
-- 4 X 2 
' ' 
• 
• ' ' 
• 
• 
. I ..... 
• ~ Scaler .... 
. ' ' ' 
. ' . ' 
... - - - - - - - - - - - - - - - - - - ~ ~- - - - - - : : : : : : : : : ~ - - - - - j ~ - - - - - - - - ,/ 
The First Dimension 
DCT/IDCT 
... Transpose 
__ -,1 . 
- Memory 
... 
The Second Dimension 
/--------------------------------------------------. 
p 
r 
e 
~ -- A , ... 
ts d 
d 
f 
' 
' _,
' 
' 
' 
' 
' I 
' 
' 
' 
. - - - -
--1~ .. 8 X 4 
..... 
- 4 X 2 
' 
' 
t-----4~.. 2 X 1 
' 
' 
' 
' 
-:--•scaler 
' 
' -
' 
' 
' 
' 
' _,
' 
' 
' 
..... 
..... 
..... 
-
.... 
. -
' 
' ..... 
... 
' : 
·, - - --- - - - .. 
p 
0 
s 
t 
A 
d 
d 
, ... 
1 6 
Serial 
to 
Parallel 
'~ 1 6 
___________________________________ f ____ _ 
j ~ 
DCT/IDCT 
X(k 1,k2 ) 
Figure 9: A block diagram of the entire DCT processor. 
- 24 -
f 
' 
' 
' 
Chapter 5: Architecture of the Processor 
The algorithm ensures that the computations of the DCT and the 
' 
IDCT share the same building blocks. The DCT needs some p_re-
additions while the IDCT needs some post-additions. Therefore, we 
can design a DCT/IDCT processor with a single control line which 
determines computation of the DCT or the IDCT . 
.... 8X4 ... 
-
~ ... Pre- Post- ~ ... 
, 
-
... 4X2 .... , 16 processor - - processor 
16 
:--
12x1 ... -
~ Scaler ... 
-
.~ Al 
f 
DCT/IDCT 
Figure 10: A block diagram of the l-D DCT. 
Figure 10 is a block diagram of such a processor. The Pre-Processing 
unit generates the x1, x2, and x3 values (see equations 20, 26, and 31) 
needed for performing the DCT or zero-padding for the IDCT. The 
Post-Processing unit performs either data routing for the DCT or 
a d di t ion s an d sub tr a c ti on s ( S 1 +S 2 , S 1 - S 2) for the ID CT . The 
computational units, 8x4, 4x-2, 2xl, and scaler (contained by the 
inner dotted lines in Fig. 9) perform the same computations for the 
DCT and the IDCT, and therefore do not require the DCT/IDCT control 
line. The 2-D DCT/IDCT processor uses two 1-D units and an 
intermediate transpose memory as shown in Fig. 7. A detailed 
discussion of each of the units now follows. 
-25-
The Pre-Processor 
The Pre-Processor contains a parallel to serial converter (PtoS) 
and a Pre-Add unit to feed the computational units when the DCT is 
desired. It distributes inputs and pads zeros when the IDCT is 
desired. Since a 16-bit word is clocked into the Pre-Processor each 
cycle, (8-bit values are zero padded to 16 bits with the data as the 
most significant byte) a 1-D DCT transform can be performed after 
eight cycles. The PtoS, shown in Fig. 11 loads 16-bit words for eight 
cycles, then on the proper control signal transfers the eight words to 
I 
another bank of registers. The registers are connected so that the 
even 8-bits of each word are saved in parallel. to an 8-bit shift 
register, and the odd 8-bits are likewise saved. During the next 
cycl~, when the first value is reloaded in from the top, we receive 
from the right hand side, 8 2-bit words. The eight 2-bit words (least 
significant two bits first) are unloaded in eight cycles. 
The PtoS feeds the Pre-Add unit with eight 2-bit words. The Pre-
Add unit, shown in Fig. 12, has two functions. For the DCT case, the 
eight 2-bit words are used to compute the x1, x2, and x3 values, as 
defined in Chapter 4. The Pre-Add unit supplies the correct inputs to 
the 4x2, 2xl, and scaler units, while the 8x4 unit receives all eight 
words directly. Special se_rial adders were designed to compute the 
addition of two 16-bit words in eight cycles. These- adders are 
cap.able of dividing the answer by two and sign extending the result. 
We utilize this capability to provide a guard band against overflow. 
The first two stages of adders (they feed the 4x2 and 2xl units) are 
the dividing type, while the adder that feeds the scaler unit is the 
standard (non-dividing) type. A detailed description of the serial 
adders is provided in Chapter 6. For the IDCT case, the Pre-Add 
-26-
serves to route the required inputs to the computational units (8x4, 
4x2, 2xl, and scaler units) zero padding values as necessary. A 
multiplexer determines the operation of the Pre-Add unit. 
16 
2 
• 8x2 
16 • 
2 
Figure 11: A detailed view of the Parallel to Serial convertor. 
8x2 
Zero __ ,___ 
7x2 
8x2 
4x2 
1x2 
1 
M 
u 
X 
..,___..a X 4 
>----,...--~~o 
15x2 
DCT/IDCT 
Figure 12.: A detailed view of the Pre-Add unit. 
-27-
·8x4 and 4x2 units 
The 8x4 computational unit performs four subtractif ns and a four 
point SCC. It accepts eight 2 bit words and outputs f.our 2-bit words. 
A building block for the 8x4 and 4x2 units is shown in Fig. 13. It is 
comprised of a subtracter, two ROMs (Read Only Memory), a Biased 
Redundant Binary Adder (BRBA) [11] with feedback, PtoS; and a 
carry propagate adder (CPA). 
H L 
2 
s B 2 
u R p C 2 
B B to p 
2 A s A 
2 
Figure 13: A building block for the 8x4 and 4x2 units. 
The 8x4 unit requires four building blocks. The building block 
contains two identical ROMs, but the contents of the ROMs are 
different in different blocks. The subtraction produces an 8-bit 
result which is used to address the contents of a ROM, which contains 
precalculated values. Since an 8-bit address can point to any of 256 
values, this is the required number of words in the ROM. We chose 
to break the address into a high and low value of 4-bits each, which 
is shared between blocks. We need two ROMs with 16 words (a total 
of 32 word), instead of one ROM of 256 words, but now we need to 
sum the outpu_ts of the ROMs. The ROM contents are the C(m) matrix 
as in -(18) which already incorporate scaling by i and teflect any. 
output mapping (sign change). The width of the ROMs is 16 bits. 
-28-
Since the basic operation the 8x4 provid~s is multiplication, we 
accumulate the result in eight cycles by adding a new partial product 
each cycle. The sum is fed back, shifted to the right and sign 
extended. The accumulation technique used is truncation, since we 
keep only 16 bits. The contents of the the high ROM are inverted on 
the final cycle, in accordance with the 2 's complement number 
system. We use a BRBA to sum the ROM contents and to accumulate 
the result. After eight cycles, we capture the BRBA outputs (two 16-
bit words), which when summed are the final result of the four point 
SCC. We also reset the feedback registers to zero to be ready for the 
next accumulation. We sum the two parallel outputs of BRBA, by 
first converting them to bit ... serial. words in a smaller version of the 
PtoS, and then feeding the bit-serial words to a CPA. This CPA is 
identical to the serial adders (dividing type) described in the Pre-
Add unit. The propagation delay time of the BRBA is approximately 
equal to that of the serial adders. Note that as we are summing the 
result in the CPA, a new multiplication is accumulating in the BRBA. 
The 4x2 unit needs only two building blocks since it receives four 
2-bit words. The subtracters produce a 4-bit result which is broken 
into two 2-bit addresses. The ROMs for the 4x2 contain only four 16-
bit words. The rest of the hardware operates as described for the 
8x4 unit. 
Scaler and 2x 1 uni ts 
A different architecture. was used to implement scaler and 2xl 
units as shown in Fig. 14. A five 16-bit word ROM was used to store 
C, 2C, 3C, -C, and -2C, where C is the scaling constant. An additional 
control line is generated to select -C and -2C during the last cycle of 
-29-
accumulation. A carry save adder (CSA) accumulates the results 
through shifting and feedback, similar to the 8x4 and 4x2 units. 
........................ control 
' ' 
' 2 ' 2 
' 1 ' 2 
' 
C p C 
s s to p 
u A s A 
B 2 2 
' 
·---------
Figure 14: A block diagram· for the 2xl and Scaler units .. 
The only difference between scaler unit and 2xl unit is tha.t 2xl unit 
requires a subtracter (the dotted· box in Fig. 14) while scaler unit 
does not. The ROM contents of the two units are identical because 
~ = l cos(;). 
The Post-Processor 
For the DCT case, the Post-Processor shown in Fig. 15, routes the 
data around the Post-Add unit to the serial to parallel converter 
(StoP) so that the output of the 1-D DCT becomes 16-·bit words. For 
the IDCT case, the Post-Processor performs additions and 
subtractions as well as StoP conversion. The adders are identical to 
those used in the Pre-Processor, and the subtracters are ·modified 
adders. The StoP, shown in Fig. 16 is very similar to the PtoS in the 
Pre-Processor, except that it receives eight 2-bit numbers per cycle 
and outputs one 16-bit word per cycle. The delays for data traveling 
through different number of adder/subtracters is compensated in 
the Post-Processor through a small number of delay registers. 
-30-
~ ' 0 
~ s M 
u to 
X 
p 1x16 
~ ' 
1 
8x2 
• 
DCT/lDCT 
Figure 15: A block diagram of the Post-Processor. 
Even 8 
8x2 
1 6 
Even 8 
2 
16 
Figure 16: A detail view of the Serial to Parallel convertor. 
The Transpose Memory 
The original design for the transpose memory (refer to Fig. 8) 
accommodated 16-bit words. We redesigned the transpose memory 
to accept 8 2-bit words due to the new i/o requirements described in 
-31-
Chapter 4. The block diagram of the new design for the transpose 
memory is shown in Fig. 17. It consists of 64 "M" cells, each that can 
contain a 16-bit word. The left most registers in the "M" cell can 
receive inputs from the bottom or the left. The transpose memory 
rec.eives 8 2-bit words from the first dimension which it routes to 
the bottom and left inputs of the transpose memory in parallel as 
shown in Fig. 17. 
M(Row, Column) 
+Colum:--i 
+ Row 
8x2 
. . . 
• 
• 
• 
• • • 
I M(R,C) I = 
8X2 
8x2 MUX 
8x2 
• 
Mod 
64 
16 Registers = 1 Word 
:r,BHIIIII~ 
Figure 1 7: The transpose memory. 
The first row coefficients are loaded into the bottom row (row 7) in 
parallel .during the first eight cycles (assuming valid data). During 
the next eight cycles, the bottom row is filled with the second row 
coefficients, and the first row coefficients are shifted into row 6. We 
continue loading data in this manner until the first row's coefficients 
are located in row 0. The transpose memory is now entirely full. We 
then toggle the multiplexer (MUX) control line, called MOD 64, to 
-32-
unload from the bottom since we now begin to load the· transpose 
memory from· the left side. As we re-load the memory, the 
transpose of the coefficients is shifted out of the right side of the 
memory and up through the MUX. We continue loading the memory 
from the left until we have removed all of the previous coefficients. 
At this time we notice that a completely new set of coefficients has 
been loaded and is ready to be unloaded, but we need to change the 
modulo (MOD) 64 control signal so that the memory is unloaded from 
the top (through the left side of the MUX). The memory flip-flops 
the loading and unloading directions every 64 cycles. This structure 
is highly repetitive, compact and requires simple control signals. The 
only drawback is that every register cell is active every cycle, which 
can consumes a significant amount of power in complementary metal 
oxide semiconductor (CMOS) designs·. The MOD 64 counter was 
constructed from a MOD 8 counter which clocks a MOD 16 counter. 
The MOD 16 counter toggles a flip-flop whose output is connected to 
the MOD 64 control line. The ·control was constructed in this manner 
to facilitate debugging and because it was constructed in the or.iginal 
transpose memory design. 
-33-
Chapter 6: Circuit Design Issues 
In addition to standard gates and registers, we designed several 
special purpose circuits. The 2-input register used extensively 
throughout the design is essentially a 2x 1 MUX and a standard static 
register combined, shown in Fig. 18. 
Control1 
1 3:8 3:8 
in1 ~ r;1J Out in2 1L T 16:2 T 16:2 
Control2 02 
Figure 18: A two input r~gister. 
Not.e that Con troll and Control2 must be qualified with phi 1 for 
proper -operation. The lower inverters are stronger and therefore 
determine the dire_ction of data flow, which is from left to right in 
Fig. 18. 
The BRBA consists of two 3-input majority voting circuits and 
four 2-input exclusive-or :(XOR) gates. Since the gate implementation 
of an XOR consumes a fairly large are~, a six transistor XOR gate was 
used as shown in Fig. 19. Proper driving of the inputs must be 
considered. For instance, if at the beginning of a cycle both the A 
and B inputs are a "O", the node before the inverter is pulled up to a 
good "1 ". If the B input becomes a "1 ", we would expect this node to 
flip to a "O", but instead, because of the pas& transistor, we see that 
"' the good "1" travels down through the pass transistor (whose source 
is connected to the A input) which was just turned on and the A 
input changes to be a "1 ". The gate is performing the XOR operation, 
-34-
but sometimes the inputs get changed, not the output as desired. 
This can be compensated by careful driving of the inputs-. 
Vdd 
A© B 
Gnd 
Figure 19: A two input XOR gate. 
The specially designed 2-bit add.er that performs the serial 
addition of two 16-bit words is shown in Fig. 20. It is constructed 
fro.m a 2-input register, two standard registers, and two full adders. 
C0•01 --1 
-- C0•01 
A1 
81 
AO 
80 
FEG 
CCfE 
F 
_ _,. t----+---1 
A 
F 
-------1 A 
FEG 
FEG 
Example: 
. 
LSB A 00 ,01 , 10 B + 00 :oo: 1.1 
ANS: 00 :10~01 
MS8 
Figure 20: A standard s_erial adder. 
6 
+ 3 
-9 
·The 2-input register pre-loads the input carry· to zero during the firs-r 
cycle and saves the carry out during the other seven cycles. The 
standard registers provide buffering to the next stage. Figure 21 
sh<;>ws an adder with the capability of dividing by two and sign 
extending the result. A simple numerical example demonstrates this 
-35-
capability, which is utilized to ensure that the results of additions do 
not overflow internal registers, and so that the outputs of each 
computational unit are uniformly weighted. The extra 2-input 
register on the output shifts the result to the right and performs the 
sign extension. This requires the generation of two extra control 
signals. Note that the MSB (most significant bit) and the LSB (least 
significant bit) have been reversed with respect to the standard 
serial adder. The subtracters used in the Post-Add unit are the serial 
adders used in the Pre-Add unit, modified by complementing the "B·" 
inputs and pre-loading the input carry to one. Subtracters with the 
capability of dividing the result by two and sign extension are also 
used. 
CO •01 ~ CO •01 
REG 
CXF.E Example: 
Cl•01 
F _1., REG A 00'00,10 2 
A1 CXF.E _. tvS3 ' ' A B + 11,10, 10 +~ 
-B1 I- Cl•01 11 , 11: 00 - 4 
AO F ANS: 11 11 10 -2 
BO A REG REG LSB 
Figure 21: The divide by two and sign extension serial adder. 
Each sub-section of the design requires control signals which are 
derived from a MOD 8 counter, except for the transpose memory. 
The MOD 8 counter was implemented using an array of eight 2-input 
registers which are initially loaded in paral-\el with a single one, as 
shown in Fig. 22. 
-36-
Vdd -start• 01 start• 01 
Figure 22: A shift register implementation of a MOD 8 counter. 
We chose to implement the MOD 8 counter with a register ring since 
it is easy to debug, is highly repetitive, and most importantly, all of 
the outputs become valid concurrently. Other counter designs 
usµally require gated outputs which often introduce uneqijal delays 
of the. outputs. 
-37-
... 
Chapter 7: Results & Analysis 
Timing results were generated from the layout RSIM maxtime 
statistics. There are two important timing figures. The first is the 
maximum frequency of operatio~ achievable when using the on-chip 
clock driver. The internal clock driver has a preset gap of 3.9 nano-
seconds (ns). The longest delay time, called the critical path, is 17 .3 
ns. The inverse of their sum is the simulated maximum clock 
frequency, which equates to 4 7 MHz. We expect the IC to operate at 
approximately this speed, due to process variations. In terms of 
calculations, a throughput of 47 MHz is equivalent to computing 752 
million multiplies and 658 million adds per second. The -second 
figure is the maximum frequency of operation achievable when using 
an e~ternal clock. If we clock the IC externally, the critical path time 
remains unchanged, but we may be able to reduce the gap time, or 
. r 
skew phi 1 and phi2 times so that we expect the IC to operate at 
frequencies greater than 47 MHz, but less than 58 MHz. The higher 
frequency is the inverse of the critical path time. Please note that 
since we use static registers in the design, that the IC can operate in 
the frequency range from DC to the stated maximum. The latency, or 
number of clock cycles until the first re.suit, is 127 cycles. Keep in 
mind that 64 cycles are required to clock· out the entire 8x8 block of 
data. 
The total number of transistors is 67,929. The die size is 7.9 x 9.2 
mm 2_ .The area of selected cells has been tabulated in Table 1. We 
can only use 79% of the die size for a design, due to the area i/o pads 
consume. An estimate of routing area was calculated. The estimate 
of 35% routing area is deceiving since this includes the area occupied 
-38-
by the large global V dd and GND busses and was calculated by 
subtracting each unit shown in Fig. 9 from the total usable design 
area. Also the first dimension figures do not include the PtoS area, 
and likewise, the second dimension do not include the StoP. 
Appendix A has the areas for each of the basic building blocks for all 
units. 
r " height width area % of %of aspect ratio 
Unit (lambda) (lambda) (mm 2 ) total design (height to width) 
total area 9200 7900 12-.68 100% 1 .16 
used area 9144 7844 71.73 99% 1 .17 
design area 8267 6964 57.57 79% 100% 1 .19 
pad area 15 .11 21% 
first dimension 3047 6110 18.62 26% 32% 0.50 
memory 1106 4533 5.01 7% 9% 0.24 
second dimension 3351 6140 20.58 28% 36% 0.55 
PtoS 1057 1412 1.49 2% 3% 0.75 
Pre-Add 1055 797 0.84 1% 1% 1.32 
scaler 1432 722 1.03 1% 2% 1.98 
2x 1 1438 759 1.09 2% 2% 1.89 
4x2 1374 2530 3.48 5% 6% 0.54 
8x4 2976 2546 7.58 10% 13% 1 .17 
Post-Add 2830 1048 2.97 4% 5% 2.70 
StoP 324 5288 1. 71 2% 3% 0.06 
clock driver 399 745 0.30 <1% 1% 0.54 
routing area area % of area used for routing 
first dimension 1.63 9% 
second dimension 3.59 17% 
design area 20.40 35% 
I,.. 
.~ 
Table 1: Area of selected cells. 
Table 2 contains a summary of the number and type of pads 
used. Appendix B contains a full description of the IC's pinout. We 
had the option of a 64 or 84 pin package. Although we normally like 
ICs to have a minimum number of pins, we chos·e the 84 pin grid 
array (PGA) package since it allows enough pins for a 16-bit test 
port. The location of the test port is the output of the Post-Add unit 
of the first dimension, which is actually 8 2-bit words. Normally we 
-39-
c·an examine the outputs of the first dimension, or in test mode, we 
can insert data into the transpose memory. We can determine if a 
fault exists in the first dimension, second dimension, or transpose 
memory by utilizing the test port, and analyzing the results. 
r 
Number Function Type 
16 input IN 
16 output OUT 
16 test 1/0 
1 test control IN 
1 start IN 
1 DCT control IN 
4 Clock 1/0 
12 Vdd VDD 
16 Gnd GJD 
1 Blank 
\.. ~ 
Table 2: Summary of the IC's pins. 
A comparison of commercial ICs is tabulated in Table 3. Although 
the commercial IC-s listed have been fabricated using more advanced 
process technologies, our IC is expected to achieve a higher 
frequency of operation. We must take into consideration when 
deciding the merits of a particular design, the extra features that 
commercial ICs have incorporated into their designs, for instance 
hardware to perform zig-zag scanning of the DCT coefficients and 
coding of the coefficients. Since chip area is always limited, 
designers must choose between additional features or higher 
throughput. We chose to create a high throughput DCT engine which 
can be used a.s the core of a high performance image processing chip 
set, or as a building block of an image processing chip which would 
include these additional features and could be realized in 1.2µ CMOS. 
-40-
Manufacturer Max. Clock Tech. Latency package 1/0 Coef. Number of 
Freq. (MHz) cycles bits bits Transistors 
Lehigh 2.0µ 127 84 University >47 16 16 67,929 Cfv03 PGA 
C-Cube [14] 1.2µ 144 
Microsystems 29.41 Cfv03 >320 PGA 
16 -400,000 
. 1.2µ 44 pin I 1nmos- 128 12 86,008 ~ 20 
~ Thomson [15] Cfv03 PLCC 14 I 
LSI Logic [16] 30/40 98 
68 pin 12 CPGA 
or PLCC 
Table 3: A comparison of commercially available ICs for N=8. 
Chapter 8: Standard design procedures 
The notation for this and subsequent chapters is: computer 
commands are in boldface typeset, and file and program names are 
in italics typeset. The system prompt -is a percent sign (% ). 
A high level language description of the DCT algorithm is the first 
step of the design and serves as a. comparison for all other models. 
The high level language description verifies the properties of the 
algorithm and provides a benchmark in terms of functionality and 
accuracy. The high level description was written by Dr. Li in the C 
programming language. The program is quite generic, being able to 
compute using several variations of the DCT from several input 
sources. 
The second stage of the design functionally describes the 
architecture to the bit level and is classified as a behavioral 
simulation. We used the bsim simulator which uses C subroutines to 
mimic hardware operations. The original bsim model of the chip was 
written by the four members of Dr. Li's VLSI Signal Processing class 
in the fall of 1989. Dr. Li merged each members' contribution to 
create a working bsim model of the chip, and ~ade all the 
subsequent ~hanges to the description. Although bsim verifies 
functionality, it does no.t determine any timing information other 
than the latency, and it is not affected by fan-in or fan-out problems. 
The third stage of the design is a transistor level description of 
the architecture using the net language. The net language has a lisp 
like interface from which macros for gates, and other structures, can 
-42-
be built. Once a net descriptions is compiled, it is simulated using 
RSIM. RSIM is an event driven simulator that models each transistor 
as a switch with an associated RC value for timing analysis. We used 
net and RSIM to locate the critical path(s) and discover fan-in/fan-
out problems. The net description language uses conservative values 
for resistors (wiring, transistor on, transistor off etc.), capacitors 
(gate, wiring, source, drain, etc.) and modifies the appropriate values 
when a specific width to length ratio is given. The class provided the 
original net descriptions, which Dr. Li merged and modified like the 
bsim model. The net description gives a good estimate of the 
frequency of operation the IC will achieve. 
Generating the mask layout is the final stage of the design. We 
used the magic layout editor for. creating all layouts. Magic gives the 
area. of the design, as well as verify functionality and timing. The 
procedure to verify a layout is as follows: First, an extraction is 
performed, which calculates all the key parameters from the spatial 
arrangement of a layout for each cell and stores them in separate 
files; Second, a compilation is performed to condense all of the 
extracted files and parameters into one file; Fin.ally, the design can· be 
verified and compared to the net description using RSIM. Although 
the layout is the final stage of the design, it certainly is not a one 
way path. Many of the design changes were incorporated due to 
constraints discovered at the layout stage. Since there is always a 
limit to the area a design can be realized in, the area constraint 
caused the largest amount of modifications. Any modifications· to the 
layout needs to be reflected in the net and bsim descriptions so that 
functionality and accuracy can be verified. Since changes to the 
layout are often time consuming, if at all possible, changes should be 
-43-
made to the net description first. Then if the net description 
numerical results agree with the bsim results, it is safe to change the 
layout. A judgment call is to be made as to whether or not to change 
the bsim model when a layout and net description change, since the 
net and magic descriptions may alter timing and not functionality. 
Although the net description should reflect modifications first, it is 
common for the layout to be changed first. Again, the bsim model 
may be different than the net, but the net and magic descriptions 
should be identical (within reason, i.e., global drivers that are in the 
layout are not necessary in the net description, but may be 
incorporated if desired). Once completed, the layout is extracted in a 
slightly different manner to create a Caltech Intermediate Format 
(CIF) file which is submitted to MOS implementation Service (MOSIS) 
for fabrication. 
Computer Systems 
A 11 I e v e 1 s of s i m u I at i on: were perform e d on S u n 3 / 6 0 , 
Sparkstation 1, and Sparkstation 1 + workstations. The high level 
description could have been written on a Personal Computer (PC) but 
this would have been inconvenient due to file tra·nsfers to the Sun 
computers. Documentation, generation of transparencies, 
submissions to journals, and this thesis were completed· using 
Macintosh computers. The high level language description was 
written on a Sun 3/60 workstation, but could be compiled on the 
Sparkstation .computers if it was determined that the. computational 
time became too lengthy. The bsim model was written exclusively 
on Sun 3/60 workstations since bs'im uses support files specifically 
compiled for the Sun 3/60 workstations. The. net description was 
written on both the Sun 3/60 and Spark computers, although slightly 
( -44-
different commands need to be issued to do so. The compiled net 
description is simulated using RSIM, available on both Sun systems. 
The default system is the Sun 3/60, and the modification to the 
command to run a desired program on a Sparkstation instead of a 
Sun 3/60, usually requires adding the prefix "S" to the .command. 
For instance, to run RSIM on the Sun 3/60s type 
%rsim 
and on a Sparkstation type 
%Srsim 
After the design. of a sub-unit exceeds approximately 8k transistors, 
we recommend using a Sparkstation to run RSIM simulations since 
the Sparkstations have greater processing power and complete 
simulations quicker. 
Layout can likewise be performed on either Sun system with the 
'S' appended to the program name magic. The compiling of a magic 
layout uses support files and programs which have only been 
compiled for the Sun 3/60s. The extraction process is the first step 
w.hen compiling a layout, and is internal to magic. The compiling 
time of the layout became lengthy (-3 hours) when all the sub-units 
were united. Also the RSIM simulation time of a layout for three 8x8 
.blocks of data is approximately 90 minutes on a Sparkstation 1 + 
computer. We expect the RSIM time for the layout simulation to be 
greater than that of the net simulation for two reasons. The first is 
that RSIM uses a description for each unique transistor, and since the 
net description does not reflect any ~patial information, all 
transistors are the same distance away, (which is certainly not true 
of the magic layout) the number of descriptions is significantly less 
for the net-RSIM. The second is that RSIM keeps tim.ing informatiop. 
-45-
(max. time) for each node, which was turned off since we were only 
interested in comparing the numerical results of the net-RSIM and 
the magic-RSIM. The final RSIM data file is quite large, and the 
machine that will be performing the simulations should have at least 
1.2 times the file size (in bytes) of RAM. For instance the final RSIM 
data file is -15 Mbytes and simulates on machines with memory 
greater than 24 Mbytes. Although it will simulate on machines with 
less memory, the process takes much longer since the machine is 
continually swapping memory to disk and vice-versa. This memory 
constraint limits the number of machines available to perform large 
simulations. 
Location of Files 
As of August 1991, Dr. Li's files and the files of all his research 
students are kept on a disk drive attached to the Sun 3/80 computer 
named jupiter in the VLSI lab. The files located on jupiter can be 
accessed transparently from any of the Sun computers in the 
Artificial Intelligence (Al) lab, but also from other machines (such as 
the machines in the Sun lab, ·PLl 18) through ftp, rep or other in(ra~ 
s y s t em p r o gram s . Th e r o o t di rec tor y for Dr . L-i ' s f i 1 e s i s 
/home/jupiter2/li, and the Author's root directory is 
/home/jupiter 1 /dslaweck. 
The bsim data files are located in the directory: 
-li/chips/dct/bsim/twod the two key files are model.c and padlist.l 
The n e t d at a f i I e s are 1 o c at e d i n th e d i r ~ c tor y ·: 
-Ii/chips/dct/net/twod. The key files being twod.net, gates .net, and 
cells .net. 
-46-
Although magic runs on the Sun 3/60 computers, we recommend 
the Sparkstation version sin<tF the Sparkstations are _much faster 
machines. The procedure for running magic on a Sparkstation .can be 
found in Appendix C. The magic files for the dct chip are located in 
-dslaweck/magic/scmos/dctchip. They have been compressed to 
save space on the disk drive. To view these files first copy th·em to 
the current directory 
% cp "'dslaweck/magic/scmos/dctchip/* .mag.Z 
and then type 
% uncompress * .mag.Z 
The highest level cell for simulations is top.mag. 
Examples of Standard Procedures 
• 
The high 1 eve 1 1 an g u age des c rip ti on was writ ten in the ·c 
language, with an available text editor such as emacs, .or VI. The 
commands for compiling C programs are available on the Sun's on 
line help by typing 
%man cc 
In order to run m_ost of Dr. Li's software, modify the search path to 
include the location of the programs. This is accomplished by adding 
the line 
set path = ($path -cad/bin -Ii/cad/bin -Ii/bin) 
to the .cshrc file in one's root directory. 
Initialization of bsim requires a two step process. First to create 
a bsim directory, which we will call "bsimdir", type 
%mkdir bsimdir 
%cd bsimdir 
%get_bsim 
-4 7-
This places and links all the necessary files to run bsim in the 
directory "bsimdir", including the files model.c and padlist.l which 
are edited to create a bsim model. After a model is written the 
compile command (which is a multi stage process) is 
%make 
To run bsim type 
%bsim 
This invokes bsim in the interactive mode. One commonly has 
standard commands, test vectors, and nodes to watch, which can be 
placed in a file which we will call file .cmd. The command to execute 
the files is (at the bsim prompt) 
bsim -> @ file.cmd 
To "further automate the process, and to store the bsim program 
output to a file type 
% bsim < filename.in > filename.out 
whete filename.in contains "@ file.cmd". Use an editor to create these 
files, and to examine the output. A complete bsim manual can be 
obtained .from Dr.. Li. One note of caution is that only one model can 
be placed in ~· directory since the bsim make script always uses the 
file model.c when compiling. Pe.rhaps an ambitious student can 
generalize the make script to accept other file names. 
Setting up net follows a similar procedure: 
%mkdir netdir 
%cd netdir 
%get_net 
Unlike bsim, net gives the flexibility of several net description per 
directory. For instance, the same directory contained the files 
8x4.net and 4x2.net. They each point to the file gates.net: which 
-48-
contains macros of gates and other commonly used structures. The 
gates.net file can be appended with user macros, although it is 
common to place user-specific macros in another file appropriately 
titled, such as gates2 .net or cells .net. If user-specific macros have 
been written inside the file cells.net, the line 
(include "cells.net") 
must be included inside the main net file (i.e., 8x4 .net). Compiljng a 
net description on a Sun 3/60 is accomplished by typing 
% makersm file 
The m.akersm program operates on file.net and produces two files. 
The first file is file.sim which is a text file that contains transistor, 
resistor, capacitor, and interconnect information. This file is handy to 
debug and- trace inadvertent connections. The second file is file .rsm 
which is a binary RSIM data file that RSIM uses. RSIM is started in 
the interactive mode by the command 
% rsim file.rsm 
If a standard input file has been created (similar to bsim) the 
command 
% rsim file.rsm -file.cmd >file.out 
runs RSIM with the commands in file.cmd and directs the output to 
file .out. A complete set of manuals for net and RSIM, as well as 
several other programs for RSIM are available from Dr. Li. 
Magic is started on the Sun 3/60s by typing 
%magic 
The description for setting up magic on a Sparkstation 1s discussed in 
Appendix C. Magic is a fairly extensive program with tutorials for 
many basic and advanced operations. We highly recommend that all 
tutorials are explored and reviewed from time to time. Verifying a 
-49-
magic layout is a multi-step process. Although a script has been 
written to create file.rsm from file.mag, we found that performing 
the first step, the extraction, while still in magic is quicker and less 
likely to cause an error in the compiling process. To extract a layout 
first set the extraction style by typing (inside magic) 
:ext style lambda=l.O 
then type 
:ext 
Magic extracts the current edit cell, and all of the subcells (children) 
and generates warnings if it finds any errors in the design. It is 
easier to catch and correct errors while in magic rather than 
interpreting error messages generated by the cadmake script. After 
the extraction type (on a Sun 3/60) 
%cadmake file.rsm 
The cadmake script first checks to see if any files need to be 
extracted, and invokes magic if they do, then it creates file.sim from 
:all the .ext files via the ext2sim program.. Once we have file.sim, we 
normally create file.rsm, but the file.sim the ext2 sim· program creates 
is slightly non-stand~rd.. First all the [ ] brackets are changed· to { } 
brackets by a program called array/ix. then the program powerstrip 
corrects a_ny spelling-case variation of V dd and GND so that power is 
to applied to the correct places. Next the compact program replaces 
the lengthy names that magic generates to a single node number 
with an alias table linking node names to node numbers. This 
drastically reduces the size of file.sim and ·consequently file.rsm. 
Finally the presim program appends technology specific information 
and the setprm program gen~rates fi.le .. rsm. This is the 
rec·ommended procedure on the Sun 3/60's. Since the cadmake 
script calls programs specifically compiled for the Sun 3/60, we 
-50-
/ 
/ 
/ 
/ 
I 
r-·., \ 
' _ __) /
/ \ 
/ 
always need to run cadmake on a Sun 3/60. 
Although there is a routing tool in magic, it was determined that 
the spatial overhead that the routing tool requires was too costly, 
and consequently all routing was performed by hand. Dr. Li has 
several other tools to automate the layout design process. An 
overview of the tools can be found in the file -li/chips/doc/proc.doc. 
Most of these tools were not applicable to this project. The one tool 
set that we did use was pad2frame and framegen. The pad2frame 
program automates the creation of a pads . in p t!}J:~ i 1 e, and the 
, .• ,:{lf:~·b 
framegen program creates padframe.'ff[lrg~rom pads.input. 
Pad2frame uses a data file which contains information as to the type 
of pad (input, output, bidirectional, or power), desired location, and 
the name of the pad. The framegen creates a magic layout file in 
which the users design is placed. The pad2frame program is only 
available on the Sun 3/60 computers, and is invoked by typi.ng 
% padframe <pads.info >pads.input 
This creates pads.input for the framegen program to create 
padframe.mag by typing 
%framegen <pads.input -f pads.params 
The commands for generating the CIF file are the same on the 
Sun 3/60 and Sparkstation computers since CIF generations is 
internal to magic. First set the ostyle by typing 
:cif ostyle lambda=l.O(nwell) 
Since the ostyle varies with the technology used, type 
:cif ostyle ? 
to see all v~lid ostyles. To create the CIF file type 
:cif 
-51-
from the parent cell of the design. This creates file.cif from file .mag. 
The parent cell of our design is final.mag, which is slightly different 
from top.mag, which was used for the top-level simulation. Magic 
generates the CIF file, informing the user of any CIF design rule 
errors that it encounters. 
Sun 3/60 computer 
Ecggram Qg~ratiQa QQrntnsma QQmtD~D1§ 
1:llilh compile % cc prog.c -o prog directs output to prog 
,~~~, rather that a.out 
laaguag~ run prog % prog > file.out 
bsiro compile %make 
run bsim % bsim < file.in >file.out 
Ml compile net % makersm file mak
es file.rsm from 
description file.net 
BSIM run RSIM % rsim file.rsm -file.cmd 
> file.out 
[Dggj~ run magic % magic file opens magic with 
file.mag as the edit cell 
compile % cadmake file.rsm file.mag is the parent 
cell of the design to 
be verified 
Sparkstatlon computer 
erQgram 
High 
,~~~, 
laaguag~ 
bsim 
Ml 
RSIM 
magi~ 
QgeratiQO QQmman~ QQ[D[D~01S 
compile % cc prog.c -o prog directs output to prog 
rather that a.out 
run prog % prog > file.out 
n/a must run on a Sun 3/60 
compile net % net file.net file.sim two step process 
description % Spresim file.sim file.rsm 
cmos100.prm -nostack 
run RSIM % Srsim file.rsm -file.cmd 
> file.out 
run magic % Smagic file opens magic with 
file.mag as the edit cell 
compile % cadmake file.sim <- cadmake must be 
% Ssetprm performed on a Sun 3/601 
Table 4: A_ syno.psis of common procedures. 
-52-
A summary of the common commands for both the Sun 3/~0 and 
Sparkstation computers can be found in Table 4. 
System utilities 
The Sun computers use the operating system SunOS, which is a 
variation of the UNIX operating system. The operating system has 
standard commands to manipulate files, plus utilities which are 
helpful when manipulating text files. This is valuable since all files 
with the extensions of: .net, .mag, .sim, and .c are text files. We 
re·commend becoming familiar with the utilities/commands more, 
grep, sed, rsh, rep, ftp, and rlogin. 
Simulator problems/tips 
The RSIM simulator does not detect certain types of errors. For 
instance, if V dd and GND are shorted together, the simulator will 
perform the simulation, but no events occur. This is also true if phil 
or phi2 become shorted to a power supply rail. Another quirk of 
RSIM to be aware of is that a node name of "t" is not to be used. 
RSIM uses "t" as the time command, and does not generate a warning 
if it is used for a node name. 
The ext2sim program performs merging of names that are 
identical in a cell a~d also names that are similar to phi 1 or phi2. It 
is easy to catch the error of two separate nodes in a cell with the 
same name, because magic generates a warning when the extraction 
is performed. However, if the name "philstart" is used, ext2sim will 
merged it with phi 1. This can sneak up when creating global buffers_, 
since the node name is usually the same on both sides. This leads to 
the same maxtime before and after the buffer since the buffer is 
-53-
/ ' 
effectively shorted through names. We recommend using the name 
"p 1" instead of phi 1 and "p2" instead of phi2 inside all cells, 
reserving phi 1 and phi2 to the parent cell. Buffer outputs can be 
called "p 1 buff'' and "p2buff'. A worst case scenario is that all nodes 
needing phi 1 are called phi 1, but a connection to the global phi 1 is 
forgotten. The simulator will give good results while the fabricated 
chip will not. Even if all phi 1 nodes are connected to global phi 1, the 
global phi 1 buffer may not be strong enough, since it is shorted to 
the simulator phi 1 signal which has infinite drive. Finally, since the 
powerstrip program corrects for variations of Vdd and GND, do not 
use Vdd! and GND! as the magic tutorials advise. 
The chip layout requires the greatest amount of time since the 
layout is performed by ha-nd. Listed below are some tips that we 
think will aid the layout process. 
• Use separate directories for each unit of the design, except those 
that are very similar. Since all magic data files will eventually be 
copied into one directory, use unique cell names so that files are 
not overwritten. 
+ Think about the routing of control signals to a cell before placing 
other cells around it. 
+ Do not be afraid to create new cells with the same functionality. 
We have a dozen or so registers which are used to drive different 
loads. 
-54-
• Create standard widths for Vdd and GND. We recommend 8 
lambda for the powet supply rails inside cells. Often two cells can 
be merged to share a larger than normal Vdd or GND. We found 
that the layout occupies less chip area if the cells are combined to 
create alternating two Vdds and two Gnds instead of Vdd 
alternating with GND. 
• Create a standard width that V dd and GND are separated by, for 
instance 60 to 100 lambda. 
• Become familiar with the process technology. The SCMOS design 
rules allow smaller contacts than the CMOS design rules. A typi~al 
case is a Metal 1 (Ml) to Metal2 (M2) contact. In CMOS the contact 
occupies 25 lambda 2, while the same contact in SCMOS occupies 
only 16 lambda2, a savings of 35%. Also the technology file 
contains· information about the resistance and capacitance of 
different layers.. For instance, in the SCMOS rules M2 over 
substrate has zero capacitance, while M2 over Ml has the same 
capacitance as Ml over substrate - translation: Use M2 for long 
wires, even over top of Ml. 
• Leave a few lambda of room to make transistors larger. If we 
make all transistors large, we use more area and more power than 
necessary. Use reasonable sizes, which can be made larger to 
accommodate different loads. 
• Perform as much red_esigning in bsim and net before changing the 
layout. 
-55-
Chapter 9: Future work 
Since there al ways seems to be room for improvement of a 
design, a discussion of several ideas that improve upon the design 
now follows. 
The architecture we used for the transpose memory can also be 
used to realize the PtoS and StoP units. The· area of these units 
would be reduced to about one half of their current area. As noted 
before, since every register is active every cycle power requirements 
increase dramatically. This is especially true because the.re is a 
significantly larger number of transistors that will be operating each 
cycle inside ·a smaller area. 
Since· the 4x2 unit's ROMs only contain three words (three words 
plus zero), perhaps the architecture for the scaler and 2xl units is 
more appropriate since it occupies less chip ar·ea. The required 
number of words in the 4x2 unit's ROMs would increase to 32 16-bit 
words, or only 16 16-bit words if we include the ability to invert the 
output of the ROM. A 32 word ROM design may be too slow, so a 16 
word ROM with XOR gates to invert the output as needed, is 
preferred. 
Currently the architecture realizes the equation defined in { 45) 
for the DCT. The difference from the desired DCT ( 43) is the factor of 
1/64 that all coefficients are multiplied by so that data registers 
inside the IC do not overflow. A numerical analysis indicates it is 
only necessary to divide by 16 to keep from overflowing any 
register. We analyzed this change, and determined that the change 
-56-
to the magic layout is fairly small, namely in the CP As at the end of 
the computational units, and in the Post-Processor. ~ benefits Of 
this change would yield two more significant bits, or ?our times the 
accuracy of all coefficients. 
A complete accuracy study needs to be completed of this 
architecture. Since we would like our IC to meet the International 
Telegraph and Telephone Consultative Committee (CCITT) standard 
for DCT ICs, different bsim models are need which can discriminate 
the differences between truncation, rounding, and full precision 
techniques of data handling. Different bsim models would reveal the 
are.as where accuracy is most critical, and thus deserve more 
attention. Upon completion of a bsi.m model that meets -the CCITT 
st.andard, an impact analysis can be performed on the architecture. 
Currently we are using 16-bit data paths throughout the chip. The 
delay time would be increased only by the delay time of one full 
adder for increasing the data path to 24 bits. 
When the IC is returned from fabrication we need to test both 
functionality and timing. We can test the functionality at Lehigh 
University using a PC which has a custom board designed specifically 
to automate the testing of custom ICs. This board was developed at 
Lehigh University by two seniors as their senior project. 
Un.fortunately, it can not perform timing or throughput analysis. For 
timing analysis, a .similar type of board could be developed which 
contains a storage area. The storage area would be used to store test 
vectors, control words, and the IC's output(s). The storage area, or 
memory can be loaded at a convenient speed for the PC. Then the 
memory and IC would be clocked at a much higher rate, storing the 
-57-
IC's result back into the memory for accuracy analysis by the PC. 
The memory needs to be constructed from a fast technology (i.e., 
emitter coupled logic), and fast conversion buffers to handle the 
interface between different technologies are also needed. These 
buffers would be required between the memory and ~nd the 
memory and PC. As stated previously, the memory need not contain 
just data since we would also need to "clock in'' control signals at the 
same rate as data. An logical interface between the memory and PC 
is required, which would. incorporate operations such as a method· of 
controlling the simulation size {i.e., simulate for C cycles). Although 
·our design contains. an on-chip clock driver, a ge.neral purpose clock 
driver should be availab.le so that all clock parameters (phi 1, phi 2, 
gap times) can be modified, and resulting perfor.mance analyzed. 
Such a tester is envisioned in Fig. 23. 
Logic and Control 
B B B B 
u u u u 
f Input f 
IC f Output f 
f f Under f f Memory Memory Test e e e e 
r r r r 
PC 
Commu-
nications 
Phi1 Phi2 Bus Clock 
Driver 
Figure 23: A PC based high speed IC tester. 
-5 8-
The improvements suggested so far have been to reduce 
hard w ate and to increase the computation a 1 accuracy. The 
frequency of operation can be increased in two ways. The first 
would require major modifications to the net model and to the magic 
layout. Currently the delay time of the serial adders/subtracters is 
matched to the delay time of the BRBA. The BRBA has a delay time 
of approximat~ly two cascaded CSAs. We can increase throughput by 
inserting more pipeline stages into the critical path(s). This would 
double the number of pipeline registers, and split the BRBA into two 
single stage CSAs. The throughput would increase, but as always, at 
the expense of area. The ROMs may also need a stage of pipelining 
that is not present in the current design. The second way of 
increasing throughput requires no redesigning at all. Since the IC 
was layed-out using scalable CMOS rules, it can be fabricated in 1.2µ 
CMOS by changing the extraction and CIF parameters. The only work 
required would. be to ensure correct placement of the i/o pads. 
-59-
References 
[l] W. Li, "A New Algorithm to Compute the OCT and Its Inverse", 
IEEE Trans. Signal Proc., vol. 39, no. 6, (June 1991 ), 
pp. 1305-1313. 
[2] N. Ahmed, et al., "Discrete Cosine Transform", IEEE Trans. 
Comput., (January 1974), pp. 90-93. 
[3] N .J .N arasimha and A.M. Peterson, "On the Computation of the 
Discrete Cosine Transform", IEEE Trans. Commun., (June 1978), 
pp 934-936. 
[4] J. Makhoul, "A Fast Cosine Transform in One and Two 
Dimensions", IEEE Trans. ASSP, (February 1980), pp. 27-34. 
[5] W. Chen, et al., "A fast computational algorithm for the discrete 
cosine transform", IEEE Trans. Commun., vol. COM-25, no. 9, 
(September 1977), pp. 1004-1009. 
[6] M. T. Sun, et al., "A Concurrent Architecture for VLSI 
Implementation of Discrete Cosine Transform"', IEEE Trans. 
Circuits Syst., vol. CAS-34, no.8, (August 1987), pp. 992-994. 
[7] T. C. Chen, et al., "VLSI Implementation of A 16x16 DCT", 
in Proc. IEEE ICASSP'88, (1988), pp. 1973-1976. 
[8] P. Duhamel and H. H'Mida, "New 2° DCT Algorithms Suitable for 
VLSI Implementation", in Proc. IEEE ICASSP'87, (1987), 
pp.1805-1808. 
[9] A. Leger, et al., "Distributed Arithmetic Implementation of the 
DCT for Real Time Photovideotex on ISDN", Proc. SPIE Int. Soc. 
Opt. Eng., vol. 804, (1987), pp. 364-370. 
[l 0] M. Vetter Ii and A. Ligtenberg, "A discrete Fourier-cosine 
transform chip", IEEE J. Selected Areas Commun., vol. SAC...;4~ 
no. 1, (January 1986), pp. 49-61. 
-60-
[11] W. Li, Class notes for ECE450-10, Lehigh University, Fall 1989. 
[12] G.K. Wallace, "The JPEG Still Picture Compression Standard", 
Commun. ACM, vol. 34, no. 4, (April 1991), pp. 30-44. 
[13] D. LeGall, "MPEG: A Video Compression Standard for Multimedia 
Applications", Commun. ACM, vol. 34, no. 4, (April 1991), 
pp. 46-58. 
[14] CL550 JPEG Image Compression Processor, C-Cube 
Microsystems, (February 1990). 
[15] IMS A121 2-D Discrete Cosine Transform Image Processor, 
INMOS, (February 1989). 
[16] L64730 Discrete Cosine Transform Processor, LSI Logic, 
(July 1990). 
-61-
APPENDIX A: The areas of common cells tabulated. 
, 
""""'I 
height width area aspect 
Cell (lambda) (lambda) (lambda 2 ) ratio (h/w) 
serial adder w/o regs 122 211 25,742 0.58 
serial adder /2 123, 274 33,702 0.45 
full adder 54 130 7,020 0.42 
reg 62 80 4,960 0.78 
reg 56 81 4,536 0.69 
reg 54 66 3,564 0.82 
reg 70 68 4,760 1.03 
2 - reg 70 68 4,760 1.03 
qualify circuit 51 80 4,080 0.64 
scaler rom 167 467 77,989 0.36 
decoder 127 104 13,208 1.22' 
word contents 80 192 15,360 0.42 
all pullups 34 193 6,562 0.18 
eval trans 82 66 5,412 1.24 
16 bn csa/feedback & regs 708 554 392,232 1.28 
1 bn csa/feedback & regs 174 133 23,142 1.31 
full adder 51 130 6,630 0.39 
reg 52 66 3,432 0.79 
invert and circuit 59 73 4,307 0.81 
4 word PtoS 222 555 123,210 0.40 
reg 57 68 3,876 0.84 
CPA 71 555 39,405 0.13 
full adder 51 130 6,630 0.39 
2 - reg 57 68 3,876 0.84 
4x2 decoder 149 78 11,622 1.91 
16 bit brba w feedback & regs & PtoS 665 1,422 945,630 0.47 
16 bit brba 196 1,420 278,320 0.14 
brbaMSB cell 196 99 19,404 1.'98 
brba cell 196 88 17,248 2.23 
16 word x 16-bit rom w/ decoder 370 490 181,300 0.76 
8x4 decoder 339 227 76,953 1.49 
16 words 240 192 46,080 '1.25 
serial subtracter 123 306 37,638 0.40 
serial subtracter /2 123 277 34,071 0.44 
back to back regs (2) 100 62 6,200 1. 61 
mod 16 counter 919 74 68,006 12.42 
mod 8 counter 456 82 37,392 5.56 
M cell 11 4 473 53,922 0.24 
\.. ~ 
Note that multiple "reg" cells were included to show the impact that 
control routing has on cell area. 
-62-
APPENDIX B: The complete pin description of the IC. 
,(' 
frame 84P79X92 
Pin name Pad Location Type Pin Number 
PadGndO top 20 PadGnd 1 
Test5 top 19 PadTri 2 
Test6 top 18 PadTri 3 
Test7 top 17 PadTri 4 
PadVddO top 16 PadVdd 5 
Test8 top 15 PadTri 6 
Test9 top 14 PadTri 7 
Test IO top 13 PadTri 8 
Testll top 12 PadTri 9 
PadGndl top 11 PadGnd 10 
Test12 top 10 PadTri 11 
Testl3 top 9 PadTri 12 
Test14 top 8 PadTri 13 
Testl5 top 7 PadTri 14 
blank top 6 PadBlank 15 
PadVddl top 5 PadVdd 16 
XinO top 4 Padin 17 
Xinl top 3 Padin 18 
Xin2 top 2 Padin 1.9 
Xin3 top 1 Padin 20 
PadGnd2 top 0 PadGnd 21 
PadGnd3 left 20 PadGnd 22 
Xin4 left 19 Padin 23 
Xin5 left 18 Padin 24 
Xin6 left 17 Padin 25 
Xin7 left 16 Padin 26· 
VddO left 15 PadVddToDP 27 
PadVdd2 left 14 PadVdd 28 
Xin8 left 13 Padin 29 
Xin9 left 12 Padin 30 
PadGnd4 left 11 PadGnd 31 
XinlO left 10 Padin 32 
Xinl 1 left 9 Padin 33 
Vddl left 8 PadVddToDP 34 
PadVdd3 left 7 PadVdd 35 
Xinl2 left 6 Padin 36 
Xin13 left 5 Padin 37 
PadGnd5 left 4 PadGnd 38 
Xin14 left 3 Padin 39 
Xin15 left 2 Padin 40 
Vdd2 left 1 PadVddToDP 41 
PadVdd4 left 0 PadVdd 42 
START bottom 0 Padin 43 
ocr bottom l Padin 44 
ClkEn bottom 2 PadClkln 45 
PadGnd6 bottom 3 PadGnd 46 
Phil bottom 4 PadTri 47 
Phi2 bottom 5 PadTri 48 
-63-
Clk bottom 6 PadClkln 49 
PadVdd5 bottom 7 PadVdd 50 
YoutO bottom 8 PadOut 51 
Youtl bottom 9 PadOut 52 
PadGnd7 bottom 10 PadGnd 53 
Yout2 bottom 11 PadOut 54 
Yout3 bottom 12 PadOut 55 
Yout4 bottom 13 PadOut 56 
Yout5 bottom 14 PadOut 57 
PadVdd6 bottom 15 PadVdd 58 
Yout6 bottom 16 PadOut 59 
Yout7 bottom 17 PadOut 60 
Yout8 bottom 18 PadOut 61 
Yout9 bottom 19 PadOut 62 
PadGnd8 bottom 20 PadGnd 63 
PadGnd9 right 0 PadGnd 64 
GndO right l PadGndToDP 65 
YoutlO right 2 PadOut 66 
Youtl 1 right 3 PadOut 67 
PadVdd7 right 4 PadVdd 68 
Youtl2 right 5 PadOut 69 
Yout13 right 6 PadOut 70 
Yout14 right 7 PadOut 71 
PadGndlO right 8 PadGnd 72 
Gndl right 9 PadGndToDP 73 
PadGndl 1 right 10 PadGnd 74 
Yout15 right 11 PadOut 75 
TestEn right 12 Padlo 76 
TestO right 13 PadTri 77 
Testl right 14 PadTri 78 
PadVdd8 right 15 PadVdd 79 
Test2 right 16 PadTri 80 
Test3 right 17 PadTri 81 
Test4 right 18 PadTri 82 
Gnd2 right 19 PadGndToDP 83 
PadGnd12 right 20 PadGnd 84 
.. 
-64-
APPENDIX C: Sparkstation Tools. 
T_he Sun operating system allows a substitution starting with the 
user name. For example to access the magic system files for the 
Sparkstations, change directory ( cd) to the Smagic account _by typing 
% cd /home/jupiterl/dslaweck/magic/sys 
or equivalently 
%cd "'dslaweck/magic/sys 
Either command will make the current directory point to the Smagic 
system directory. The second command should be used in case the 
system administrator .relocates these files to ~nother partition or 
drive. The first command should be used if access is required within 
a shell script. A shell script is a file that contains one or more 
command·s .that are normally typed in at the system prompt, plus 
shell scripts have a few extra functions to automate simple decision 
processes. For example one might make a shell script to compile a C 
program, run the program with input and output files specified from 
the command line, and edit the output file. Normally this would 
require at least three commands, but now one only needs to type the 
name of the script file. 
To compile a net description on a Sparkstation type 
% net file.net file.sim 
to make file.sim from file.net. 
Next type 
% Spresim file.sim file.rsm cmoslOO.prm -nostack 
to create file.rsm from file.sim 
RSIM on the Sparkstations is invoked by the command 
% Srsim file.rsm 
and functions exactly as the Sun 3/60 version. 
-65-
The magic program is a general layout editor and requires 
several support files. Magic has the location of these support files 
hard coded into its program starting with the username "cad". Magic 
for the Sun 3/60s is kept in -cad/magic/sys, with further 
subdirectories containing support files for different technologies. 
Al though magic requires a user name of "cad", one can relocate the 
files elsewhere, and in fact one does not even need a -cad directory. 
The Sparkstation version of magic is located in the directory: 
-dslaweck/magic/sys. The drawback is that in each directory that 
Smagic (spark version) is to run, a soft link to Smagic and all the 
support files is needed. This seems tedious,. but two scripts were 
written to automate the linking. The scripts are called link and 
unlink~ and are kept in -dslaweck/magic/sys. To run Smagic, copy 
the file link to the current directory by issuing the command 
% cp ~dslaweck/magic/sys/link . 
then type 
%link 
and then 
%Smagic 
If the directory is examined, notice that there are several softlinks 
back to -dslaweck/magic/sys directory. Only type the link command 
once to create the links, and Smagic thereafter.. The unlink command 
reqioves each of the soft links for backups.. It is a good idea to 
backup all data files at least daily, to a different subdirectory, and 
a 1 so to another di s k (.different computer s y stem or f 1 opp y) . 
Unfortunately when performing a backup, the copy command 
duplicates files attached to soft links. The leads to many copies of 
the entire magic system on your backup medium. This takes time 
-66-
and space. The solution is to copy the unlink file to the current 
directory 
%cp ~dslaweck/magic/sys/link 
When making backups, first type 
%unlink 
• 
then type the command to backup all the data files, for instance 
%cp * .mag ~/backup 
then re-link the files by typing 
%link 
The magic files for the DCT chip are located in 
-dslaweck/magic/scmo.s/dctchip. They have been compressed to 
save space on the disk drive. To view these files first copy them to 
the current directory 
% cp ~dslaweck/magic/scmos/dctchip/* .mag.Z 
and then type 
% uncompress * .mag.Z 
The highest level cell for simulations is top.mag. 
• 
To create an RSIM simulation of a layout on a Sparkstation, follow 
the procedure for extraction explained in Chapter 8, then type (on a 
Sun 3/60) 
%cadmake file.sim 
That will correctly make file.sim. After cadmake finishes type 
%Ssetprm 
on the Sparkstation to create file.rsm from file.sim. Running RSIM on 
a Sparkstation is the same as described in Chapter 8. Table 4 
contains a summary of common commands for both computer 
systems. 
-67-
Biography 
Darren Slawecki received his BSEE from Drexel University in 1989, 
and his MSEE from Lehigh University in 1991. He was a Teaching 
Assistant from 1989 to 1991 at Lehigh University. His research 
interests are in the area of Digital Signal Processing, Image 
Processing, and VLSI design. 
-68-
