VLSI implementation of distributed arithmetic by Huang, Dajen
Lehigh University
Lehigh Preserve
Theses and Dissertations
1989
VLSI implementation of distributed arithmetic
Dajen Huang
Lehigh University
Follow this and additional works at: https://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Huang, Dajen, "VLSI implementation of distributed arithmetic" (1989). Theses and Dissertations. 5220.
https://preserve.lehigh.edu/etd/5220
.. 
VLSI IMPLEMENTATION OF 
DISTRIBUTED ARITHMETIC 
by 
Dajen Huang 
A Thesis 
Presented to the Graduate Committee 
of Lehigh University 
in Candidacy for the degree of 
Master of Science 
• 1n 
Electrical Engineering 
Lehigh University· 
1989 
,_ 
.. 
.. 
.. 
-. 
- ... . 
... 
This thesis is accepted and approved in partial fulfillment of the requirements for 
the degree of Master of Science in Electrical Engineering. 
, 
Date 
~, !?87 \ 
Date CSEE Department Chairperson 
, I 
•• 11 · •. 
r 
../ 
, 
:. 
• 
TABLE OF CONTENTS 
page 
LIST OF TAB LES ................... ~ ............................ : .........•... : .. •·. . . . ... .. . . . . . . . . . . . v 
LIST OF FIGURES ...................................................................................... ' ·v 
• ABSTRACT ........................................................................................... ··....... 1 
CHAPTER 1 
INTRODUCTION 
CHAPTER 2 
DISTRIBUTED ARITHMETIC 
2 
2.1 Conventional Descriptions of Digital Filter ...................... -.................. 5 
2.2 Distributed Arithmetic....................................................................... 8 
2.3 Digital Filter Structures Described by Distributed Arithmetic........... 10 
CHAPTER 3 
A 4-TAP CONVOLUTION PROCESSING UNIT 
3.1 Chip. A.rchitectu·re ......................................................... :......................... 14 
3 . 2 V L S I I m p 1 em e n t at i o n ...................................... _ ...................... ~· ... _. . . . . . . . . . 16 
3.3 Sign Bit Treatme·nts ····~······················-r................................................ 19. 
\. 
3.4 L·ayou.t Conside:r:at.ions ........ ·................................................................. 22 
....,,. 
CHAPTER 4 
'. ( 
CHIP DESIGN METHODOLOGY 
4.1 · Desi,gn Procedure ................................................................................. . 25 
·4 . 2 Test ab i 1 it y .................................................................................... -~ . . . .. . . . . . . . . 2 7 
4.3 Characteriz~tiori of the Digital Filter .............•.•...........................••. ~. 28 
) 
-.,II~ 
,,.__,,..,,., .. 
• •• 111 
• 
.. 
-· 
• 
' 
;· 
I 
CHAPTER 5 
A DISCRETE COSINE TRANSFORM CHIP 
5.1 Introduction of the DCT ................................................................... . 31 
5.2 The New Computing Algorithm •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• 33 
5.3 Chip Ar hit ec tu re ............................................................................... . 41 
5.4 Comparis of Different Computing Methods •••••••••••••••••••••••••••••••••• 43 
CHAPTER 6 
CONCLUSION 
6.1 Future Developments ................................................................. ··- ........... . 46 
6.2 Prospective ...................................................... · .................................. . 48 
BIB L IO GR A P.H Y .................................•.... _._ ._ ...... -......... -· ........................ -................ -. . 4 9. 
"" APPENDIX 1: Behavior Modeling 
4-tap digital filter . . . 
··············································~······························ 
32xl DCT 
·······································································~··············· 
APPENDIX 2: RSIM Simulation 
4-tap digital filter ....................... -.......................................................... . 
APPENDIX 3: Design files index 
···············-:············································-······· 
;; .. 
.,, 
• 
,IV 
.. 
53 
64 
72 
82 
• 
' 
I 
. . 
\ 
Table 
I 
II 
Figure 
3-1 
3-2 
3-3 
3-4 
3·-5 
4-1 
4-2 
4_·3 
5-1 
5~2 
5-3 
Sign Bit Treatments 
LIST OF TABLES 
CHAPTER 
Comparison of 32x32 DCT Computing Algorithms 
LIST OF FIGURES 
, 
Chip Architecture of the 4-tap Digital Filter 
Register Circuit 
Decoder Circuits 
Bonding Di-agram 
Die Photo 
VLSI Design Flowchart 
Test Pattern for Performance Testing 
Parallel/Serial mixed Output Register 
... 
·· Row-Column Decomposition of Two-. dimensional DCT 
Partition Algorithm for 32-point DCT ImplementatioQ 
Chip Architecture of_32xl DCT block 
·, 
V 
• 
. ,( 
·page 
21 
-..,..44 
. .... 
Page 
15 
17 
17 
23· 
24 
26 
29 
29 
34 
34 
42 
'i 
, I 
... 
ABSTRACT 
The advancement of microelectronic technology and new algorithm development are 
\ 
dependent of each other. Current VL,SI technology open the way to implement digital 
signal processing(DSP) chip for those applications involving large signal bandwidths and 
.. 
high .speed computation. When the fundamental operations of convolution a.nd 
multiplication are mixed-distributed arithmetic ( contrary to the classical "concentrated 
" 
arithmetic") is very attractive in DSP hardware realizations due to its co111patibility and 
.... 
flexibility. 
A· 4-tap convolution processing unit is designed and fabricated in double-n1eta.l 
2µm CMOS technology. Its 33 MHz operating frequency equivalently complete 99 
millions of adds and 132 millions of multiplys in a second. This characterized 4-tap ·unit 
could be easily expa.nded to larger tap sizes(32, for example) and higher precision but 
only limited by the available silicon area. By doing specific perm·utations on in_put data 
sequence, Discrete Cosine Transform(DCT) becomes as circular correlation. Distributed 
arithmetic is also efficient in this computation. A DCT chip employed in image. coding, 
polyphase ·'filter banks, .and FFT evaluation is investigated and ready for VLSI 
........ .· 
implementation. Distributed Arithmetic is therefore shown as an efficient algorithm in 
VLSI jmplementation~ of digital filter(linear convolution), DFT( circular convolutio)1), a.s 
" 
well as D C-T( circular correlation). 
1 
-:, ;,. ,. 
. -/ 
• 
• 
' 
CHAPTER 1 
INTRODUCTION 
' 
The aim of this study is to investigate VLSI (Very Large Scale Integrated 
circuit) implementation of "distributed arithmetic" for DSP ( digital signal processing) 
/ 
applications. The vehicle chosen is the fundamental building block in digital S)'Ste.m 
-
design-digital filter, or say a convolution processor. To exhibit the feasibility of this 
algorithm in different operation, · a DCT ( discrete cosine transform) chip was also 
derived as another example. 
_The techniques and applications of DSP field are as old as Newton and Gauss 
and as new as digital computers and integrated circuits. During 1950s, the use of dig"ita.1 
• 
computers in signal processing arose because the flexibility of digital computers wa.s 
useful to simulate a signal processing system before implementing it in analog hardware. 
In this way, a new signal processing algorithm, or system, could be studied in a flexible 
·experimental environment before committing etonq~ic · and engineering resources to 
con&tructing. Howeve~, the ad·vent of modern se·miconductor technology has caused a 
revolution that more and more implementations of digital algorithms will be in tern1s of 
.. "' 
:;. \ 
special purpose hardware rather than as software for a general purpose computer. 
; 
I 
Furthermore, some sophisticated signal processing_ algorithms which previously had 
.41! 
appeared to be impractical bega:Q to appear to have prat'i<;al imp_lementations with the 
; 
2 
..,. 
I 
J 
current VLSI technology. This also explains the current interest in new digital signal 
processing algorithms for VLSI. 
In this study, the mathematical description and design of a 4-tap convolution 
unit and a 32 x32 OCT chip are discussed. The design procedure is described in detail 
and involved using many CAD tools; SPICE (Simulation Program with Integrated 
Circuit Emphasis) [1] for circuit simulation, BSIM (Behavior SIMulation) [2], RSI1\1 [3] 
-
. 
for logic/timing simulation, NET (4] for hardware description, and MAGIC [5] for 
layout, etc. The use of the appropriate computer simulations and checks enables first 
pass IC design work. 
The idea of distributed arithmetic is outlined in the next chapter. It shows ho,v 
\ 
linear convolution ( digital filter), and circular correlation(DCT) is possibly realized 
without using multiplier. The DCT example requires· a permutation on· the input data 
sequence before we can treat it as a cyclic convolution, while the linear convolution is 
obvious. Several different hardware mechanisms for distributed. arithmetic 
impl~men tation, namely·, the trade-off between arithmetic operation c;1,nd silicon area are 
discussed too·. 
Hardware realization-the path from a:bst.ra.ct algorithm to,Jeal chip is exa1nined 
in chapter 3. The chip structure and leaf-cell design are crucial in determination of chip 
.... ~:;r 
. 
.. 
.• 
speed, thus they are analyzed in detail. The sign bit problem in dealing with .arithmetic 
.. 
3 '1 .. 
r..' 
.I 
' 
operations of 2's complement numbers is coved. 
In chapter 4, the design procedure and CAD tool used through the whole project 
are presented. A multiplexed serial-out and a single phase speed testing methods are 
considered to make the chip testable under pipelined architecture. Characterizations of 
. 
the fabricated convolution · chip are described, including functionality, and speed 
performance. • 
The well simulated DCT chip, based on a new developed algorithm, is given in 
chapter 5. Its behavior model and .simulation results are enclosed in the appendix for 
reference. 
r 
4 
..... 
' 
. I 
• \ ' 
\ 
\ 
CHAPTER 2 
DISTRIBUTED ARITHMETIC 
2.1 CONVENTIONAL DESCRIPTIONS OF DIGITAL FILTER 
.. 
In the 1980s, VLSI developments have dramatically reduced the cost and po\ver 
'· 
consumption of digital filters and have led to much more widespread application of 
digital signal processing. 
Digital filter is superior to its analog counterpart in the following aspects: 
-
1. Programmable ( filter characteristics easily changed) 
2. Reliable and repeatable 
.,. 
3. Free from component drift 
4~ No tuning required 
5. No p.recision componen~s, no component rnatching 
" 
6. Superior performance (linear phase, for example) 
Different mathematical descriptions of digital filter may suggest different 
hardware realizations. After a brief review of general digital filter forms, t.he way of 
using distributed arithmetic in digital filter im·plementation is discuss.ed. 
The transfer characteristic·s of a digital filter are _commonly described i~ terms of 
5 
( 
.. t, 
,, .. ',' ,· 
.. 
. . I 
-~ 
its Z-domain transfer function (6], 
• 
' 
(2 .1) 
; 
where z-l represents the unit delay. The corresponding difference equation is, 
N 
y(k) == L bi -x(.k-i) (2.2) 
~·-
i=O 
. 
then the directly equivalent digital circuit form cari be realized. But there are three 
canonical ·forms, or variations thereof, are most often- employed. These fbrn1s are 
canonical in the sense that a minimum number of adders, multipliers, and delays (shift 
register) are required. to realize (.2.1) in the general case. The first of these forms reduces 
the number of delays to N by considering , 
. . . W(z) Y(z) 
H(z) = H1(z)·H2(z) = X(z) ·w{z) -
.... 
Or, in t:erma_.of difference equations: 
.. 
w(k} == x(k) 
- N 
N L a_iw(k-i), and 
i=O 
.. y(k) - L biw(k-i). 
i=O 
• 
\ 
(2.3) 
(2.4) 
.6 
.. 
,. 
However, it has been pointed out by (7), [8) that use of this direct form is avoided 
because the accuracy requirements on the coefficients {ai} and {bi} are often severe 
(finite register effect). The second canonical form correspon'ding to a factprization of the 
numerator and denominator polynomials of (2.1) to produce: 
ll(z) = c·Ilm H ·(z) = C· nm h2iz-2 + h1iz-l + 1 
· 
1 a .z-2 + a .z-1 + 1' 
i=O i=O 21 11 
(2.5} 
where m is the integer part of ( n + 1) /2. This is the cascade form of a digital filter. Note 
' I 
a 2 i and b 2 i could be zero for some i. The third canonical form resulted from a partia.l 
fraction expansion of (2.1), such that: 
•· 
H(z) = c + ~ H·(z) = c + ~ h1iz-l + hoi ~ ' · ~ a .z-2 + a .z-1 + 1' 
i=O i=O 2, 11 
(2.6) 
where c-bn/an. This realization is called a parallel form. 
In summary, all three canonical forms are entirely ... equivale.n t with· regard to the 
;, ' 
amount of storage required (N shift registers) and ~he number of arithmetic operatio·ns 
fl • \', ~ 
required (2N+l multipliers a.nd 2N adders). Next, we will see how distributebd arithmetic , ., 1 
is applied to ·digital filter implementations. 
., 
.. 
7 
.. ' 
•; 
• 
• 
• 
-
2.2 DISTRIBUTED-ARITHMETIC 
-. 
Distributed arithmetic was suggested in middle 70's as a new hard ware 
realization [9] [10] which was based on the fact th __ at all the coefficients of a digital filter 
ate constant so that it is possible to use ROM (Read Only Memory) instead of 
.. 
multipliers. Later on, Burrus (11] provided a mathematical framework for this ne,v 
approach to digital filter implementations. 
. ,, 
It has been noticed that multidimensional convolution iechniques when used ,vith 
various fast algorithms for short length convolutions, will improve th·e efficiency of one-
dimensional convolution of the original signals [12]. The basic idea of distributed 
arithmetic is to observe that conventional convolution is already two dimensional. 
.... N-1 N-1 
y(k) == L h(i)·x(k-i) == L 
Bh-l . L h( i,b1J·2-b1 
i=O i=O b1 =0 
(2. 7) 
\ 
. ) 
where, h(i,b1 ) and x(k-i,b2 ) E {0,-1}, if bl O or b2==0, 
E {0,1 }·, · else. (2.8) 
8 
.. ' 
. .-
) 
• 
(2.9) 
This is a standard two-dimensional convolution form. Data sequence here is viewed as a 
two-dimensional binary ( again, note h( ·) and x(·) E {0,1}) signal with the rows giving 
the word and the colu~s giving the binary position. Thus one-dimensional numbers 
string convolution become two-dimensional bits array convolution. For example; 
. 
• 
1-D: Let, x(k)*h(k) = {x(o), x(1)} * {h(o), h(1)} 
{x(O)·h(o), X(O)·h(1)+X(l)·h(o), X(l)·h(1)} = Y(k) (2.10) 
2-D: Assume all word's are two bit long, then the output y is three bit long. 
Yoo 
' 
Y 0,1 Y 0,2 
Xo 0 Xo 1 ho,o ho,1 
' ' (2.11) * 
~1,0 y 1 1 y 1,2 
X1 0 X1 1 h1,o h1 ,1 ' 
' ' 
y 2,0 y 2,1 y 2,2 
· Now, operation along the horizontal dimension is the convolution (words 
multiplication is bi ts convolution}, 
Ex: y(o) = x(o)·h(o) = [x0 , 0 x 0 ,1] * [ho,o ho,1]. 
and the operation along the vertical dimension is the usual multiplication, 
9 
• 
rl:-' -... 
.. 
.... 
' 
I' 
I 
Down to bit level, arithmetic operations can be "re-distributed" because the order of 
thos.e two operations, multiplication and convolution, are interchangeble-hence the 
name "distributed arithmetic". Furthermore, if we convert (2.9) into a one-dimensionat' 
bitstring-to-bitstring convolution, several interesting mechanisms can be developed by 
using different block length [10]. One of the extreme cases is the convolution of a word 
string with a bit string, 
(2.12) 
It is the scheme used through this paper. Note x(·) sequence is "distributed" in the 
arithmetic operation. It is very attractive because table-look-up is possibly used to pre-
N-1 
calculate the partial products, L h(i) · x(k-i, b1), w,hich was previously done by 
i=O 
sophisticated multiplications. The detailed hardware realization will. be discussed later . 
2.3 DIGITAL FILTER STRUCTURES DESCRIBED BY DISTRIBUTED 
ARITHMETIC 
It is the implementation of digital filters with all precalculated partial products 
stored in memory that received the greatest interest. Observing (2.12) the Bx.'s partiaJ 
products y 1(k,b{) a;re no_thing but the possible summations of filter's coefficients h{ i) due 
to the fact of x(k~i,b1) E { 0,1}. The_refore all 2N partial products could be precalculated 
10 
• 
... 
• 
• 
and stored in a memory, then the input data sequence {x(t)} is used to address out 
y ,(k,b1) for b1 = [0, Br-1]. Once they are read out, y(t) can be obtained by right shifting 
( 
each y 1(k,b 1) b1 bits and summing them over 1, 1 • Next, some different mechanisms for 
hardware realization [13] are presented which is a trade between arithmetic and 
memory. 
( a). single memory 
--
Bx-1 N-1 
y(k) = L L h(i)·x(hi., bi)-2-b' (2.13) 
b1=0 i=O . 
Two-dimensional convolution with h( ·) word sequence as one and x( ·) bit string as the 
other. 
(b ). memory partition 
(2.14) 
-
If 2N is too big, add one more dimensi.on by partition N points of { h(i) } into Q blocks 
~ 
with each block having P words. 
( c). L bits p.er woi;d addressing of partial products 
• 
G-1 
y(k) = L 
g=O 
11 · 
. ' 
.. 
.. 
(2.15) 
-
I 
I 
r 
.. 
---
If Bz is too long (too many addressed-out terms), add one more dimension by partition 
x( ·) word into G groups with each group having L bits such that G · L=Bz. In this way, 
N ·L bits are used to address memories at a time. This is called multiprecision 
arithmetic. Also note we save L times of arithmetic operations at the cost of men1ory 
size increace (2N -+ 2LN ). 
(d). Input data word-partition 
G-1 L-1 -1 
y(.1:) - L L L h(i)·x(.1:-i, g·L+t) -2-(g·L+t) (2.16) 
g=O l=O 
Same form as (c) but is 1-bit per data word addressing, and needs (G·L-G) inore 
additions in getting memory saving as 2N size. It is a trade-off between memory layout 
and bus wiring when compared with ( a). 
( e). Mixed of memory partition and input data word partition 
G-1 Q-1 L-1 (q+1)P-l 
y(.1:) = L L L L h(i)·x(k-i, g·L+t)·2-I .2-g•L (2.17). 
g=O q=O l=O i=qP 
Mixture of (b)'s h(k) word block and (c)'s x(k) bit group, constitute a fou_r-dimensional 
convolution. 
In conclusion, it is possible to have various distributed arithmetic structures by 
the techniques of bit grouping, word blocking, and concurrent processing. As discussed 
in [11], whenever digital filter structure is described down. to the bit level, the size of 
12 
"""': 
I 
I 
" 
grouping or blocking is not necessarily right at multiple times of the data aequence's 
.word length. (2.12) is just one of the extreme case-the convolution of bitstring and 
wordstring. 
r 
-. 
--
"' 
13 
" 
CHAPTER 3 
A 4- TAP CONVOLUTION PROCESSING UNIT 
3.1 CHIP ARCHITECTURE 
Digital signal processing involves a large amount of input data which require 
\ 
repetitive calculations. Three classes of computer architectures are usually applied to 
achieve a high computational throughput [14], including bus-based, pipeline, and 
parallel. The following 4-tap convolution chip is thus designed in the form of parallel a:nd 
pipelined architecture [15]. 
t 
.... 
' 
• Parallel and Pipeline 
To have input signal processed in real time or with a reasonable delay, the chip 
shown in Fig. 3-1 is designed in a combination of parallel and pipelining architectu·re . 
., 
Parallel processor architecture requires duplication. of the hardware as the. price to 
increase throughput and reduce data latency, while pipeline processing <:ioes not require 
this r-eplicating. Pipelining is best suited for special purpose applications whe,re data flow 
) 
and hardware requir«:ments are frxed. Once data enter the pipeline, a prescribed rigid 
sequence of operations is performed. Control is only required to initiate the proc;,essing 
sequence of operations, After a certain latency period, data exit the pipeline. Pipelining 
can improve latency by increase bandwidth, however also result in higher gaie count due 
14 
~ .•. . 
; 
• 
.... 
Phil t PMOS precharge ~ Phil 
....---'-, ·----
• 
Phi-1 .... -~ Inv - Decoder 
- ROM -- Decoder .... -...,_. Inv - Phil 
r 
~ L I L 
• I 
I 
buffer 
shift register shift register 
I f 
' register 
Phi2 .. ..: Phi2 
.. Phil , ' 
-
..._ __ ... - Phil 
• j ' j ' j ' J ' 
X3 X2 Xl XO 
4-2 ADDER TREE 
• I 
PhilE·_ PhilE 
- ... 
, SOE SOE • -
. 
- -
· register - MUX . 
- MUX - - . Phil - - -
. Phil 
Phi2 Phi2 
• I 
I I , ' 
- shift register -
-
-
Psi I Psi 
, I 1 ' 
OutC, Outs serial out 
' 
,. 
Fig. 3-l: Chip architecture of the. 4-tap digital filter 
;. 
15 
• 
· to the additional latches [16). 
• 
• 
• Clocking scheme 
A two-phase non-overlaping clock scheme is used. 
' 
3.2 VLSI IMPLEMENTATION 
The basic cells and circuit elemnets used to build up a VLSI chip are crucial part 
in defining the speed and performance of the chip. A good example (17] is the 
investigation made on exclusive-OR logic gate. Even such a simple logic can vary a lot 
in its implementations. The consideration includes speed, number of transistors (layout 
size), and logic family planned to use. The same philosophy could be applied to other 
basic ga,tes which are briefly reviewed. as the follows: 
' ' 
• Register 
A static cmos latch is used·. N.ote the feedback inverter must be weak enougl1 to 
• 
remove the potential of che,rge sharing ( signal fighting). Pass transistor replaces 
tra .... nsmission gate saves transistor count but sacrifices signal degradation and the current 
~ 
leakage through the slightly turned on pmos transistor(power· dissipation), Fig. 3-2. 
.. 
.. 
• 
16 
- .... ,,.• 
t 
, 
• 
.. 
\ 
• 
.. 
Phil 
_j_ 
Xin --
register 
Phil~ 
-
-
3/8 
......... 
6/2 -
Phi2 
_J_ 
Fig. 3-2: register circuit 
--
Phil Inv -
. 
6/2 
- NAND4 NOR2 
-
(A)· static 4-input NANO 
Phil -q 
I I I I 
' C. 
. . . from input register . . . 
(B) AND plan with domino output 
Phil 
....... 
3/8 
-
- ROM 
._ROM 
---.-----------1 AND2 .._ ROM 
? 
Phil 
-r 'Phil 
-
-
(C) NOR plan 
Fig. 3-3: decoder circuit 
,, 
17 
,. 
.. 
/ 
• 
• Decoder 
When considering speed and testability, Fig. 3-3(a) static-nand decoder plus nor 
gate qualified clock is selected from the others two; (b) and ( c). Owing to ROM is a 
precharge logic, its precedent signal-decoder output must remain logic low during the 
precharge period. This purpose can be obtained by using qualified clock as in ( a) or 
domino logic output in (b ), or add an evaluate nmos transistor at the bottom of ROI\1 
plane. 
• ROM 
The a·ccess time iff a major factor in memory design. The used memory cell (18] is 
-
simulated using the method suggested by Glar3er and Dobberpuhl. The result gives 8ns 
-'-
access time in a size of 16 x 18 bits ROM. One other feature in this ROM is its 
automatic generation of m2c (metal-2 contact) photomask. A "C" program whic.h ta.kes 
digital filter's coefficients as input-from designer generates a m2c layout file in MAGIC 
.... 
format. The architecture is very flexible and feasible for automatic design of digital filter 
chips. 
• 4-2 Adder 
It has been obvious t-hat multiplier is not required· in our chip. But adder circult is 
still heavy in the structure. A special ,designed 4-. 2 adder (19] is used to construct· the 
., 
adder tree. It reduces the stage count in the tree from log1_5 N to log2 N when compared. 
18 
... 
i with using conventional full adder implementation, where N is the total number of 
addends. 
I ' 
3.3 SIGN BIT TREATMENTS 
In distributed arithmetic structures, each point of the filter output is obtained by 
summing up Br numbers (AAi, i=O, 1, ... , Bx-1). These numbers are addressed out 
from memories by the input {x(k)}. This OJlcration i.nvolves two problems; 1) 2's 
complement addition requires sign bit extension, and 2) sign lJits of {x(k)} should 
address out a negative AA 0 , recall (2.8). 
For the first problem, the straight-forward metho(1 is to clo sign lJit cxtPnsion for 
' 
each AAi to the sign bit position of AA 0 . But it would cause a large fanout at the 
memory output and slow down the circuit. The second problem can be solved by taking 
one's complement of AA 0 and then plus 1. But the difficulty is to handle the extra "1" 
in the adder tree. H. Yamauchi et al. [2.0] suggested a "add one" method in their 
multiplier design. Their algorithm can be modified as the follows to fix the mentioned 
two problems. 
Suppose 4 ·numbers addressed out, and each is 4-bit long. After weight-shift · 
between AAi and ·sjgn bit extension, we compute the total su.m as, 
' \ 
19 
' 
(3.1) 
where a,, b,, c,, d, are sign bits of AAi respectively. Then, let's consider the sign bit 
extension part only and use a replacement, a, = 1 - a,. We have, 
-. 1, 
S = (1 - a., )(2 7 - 23 ) + (1 - b., )(2 7 - 24 ) + (1 - c., )(2 7 - 25 ) + (1 - d., )(2 7 - 2G) 
- ·.. - . 7 + (4 - (a.,+ b~ + c., + d.,))·2 
·- · 3 - · 4 - · 5 - 6 · .·_ . - - -. · · 7 
=(a.,+ 1)·2 + b.,·2 + c.,·2 + d.,·2 + (3 - (a.,+ b., + c., + d.,))·2 (3.2) 
The last ter.m aboved .is at 27 -weight and can be discarded. The remained expression is 
interpreted as: 
a) invert AA 3 sign bit and add it to the left of the sign bit. 
b) invert all other AAi sign bits. 
., 
Next, we discuss the problem resulte·d from the sign bits of {x(k)}. In the aboved 
example, AA0 is the number correspondent to this addressing bit, and it should b·e 
> 
'" ·....... 
negated. Such that, we hav~ 
• 
. ("11 
20 .. , 
(2's of AA0 ) = (1 's of AA0 ) + 1 (3.3) 
After AA0 left-shift 3 bits, the "1" here actually means 1 · 2
3
• Use the fact 23 - 22 + 
l, 
2 · 21 , this "1" can be replaced by modifying AA0 and AA 1 as: 
~ 5 
• AA0 = ( d.·26 + ~ d;·2;) + 22 + 21, and 
1=3 
In next section, it will l;>e explained that sign bit treatments actually can be done in 
ROM part. Instead of using (3.4), we can just take 2's complement of AA 0 before 
writing it into the memory. In summary, the difficulties with the sign bit can be fixed in 
a hardware""economic way by the algorithm belowed, 
Table-I: Sign Bit Treatments 
Assume totally l\I addends AA0 , AA1, ... , AAn-l are to be summed up, and AA0 is the 
highest-weight number. 
• 
0 
\\ 
a) invert the sign bit of AAn-l and add it to the left of the sign bit. 
b) invert sign bit in each AA1 --- AAn_ 2 • 
. c) invert every bits in AA0 except the sign.· bit. 
d) do the: weight-shifting of AAi for addition operation. 
e} in AA0 , the next two bits to the right of the original LS B are set as "1 ''. ,, 
21 
l ., 
t . 
• 
) 
\ 
' 
' 
f) in AA 1, the bit to the right of the original LSl;3 is set u "l". 
'."' 
g) sum them up. 
3.4 LAYOUT CONSIDERATIONS 
.. 
A special feature with the chip layout is to complete the sign bit treatments in 
ROM part. This would save a lot of river routings from ROM's outputs to the 4-2 
adder tree but with a little increasing of word-length from 18-bits to 21-bits in ROl\1. In 
other words, partial products are pre-calculated, shifted, and sign compensated then 
randomly programmed into ROM according to their ordering happens in the adder tree . 
.. "- ..... 
( Special efft>rts are made to keep sub-cells pitch-matched such- that there is almost 
no channel routing between modules. This is very important not Gnly for a layout sa.ving 
in the current desgin but also a favor for future's expansions based on this 4-tap unit. 
The chip is packaged in 28-pin, 4.6x3.4 mm 2 ceramic DIP. Its bonding diagram 
and die photo are shown. in Fig. 3-4 and 3-5, respectively·. , 
. ' 
" ' ~- ., 
I 22 
.. 
. \ 
\/ 
. '' 
.... \ ' . 
I 
- .1.-4s ) 
11 10 
12 4 
13 3 
14 2 J 
15 
/ 1 I K 
( 
' l , 
• I 
/ 
16 28 
· 18 
. . 
19 20 21 2-2 23 24 25 • 
M8CHCR 1 
. ·. 
. 
• 357-NSF-RCLRS/LEHIGH-CSEE 
ttl: 26455\FIL TEA -- 28P46X34 
/ 
• 
Fig. 3·_:_4 <;;onding Diagram 
· 23 
I 
' 
\ \~ 
-. 
1 
~ -
_/ • 
' 
\ 
\....._ 
0 
• 
• 
I 
' I 
•;:?· 
r 
-- ~ 
...... ·, ... 
. , .... 
'· 
Fig. 3-5 
·• 
I 
Bar Pl1oto 
24 ~ 
{ 
"--
1 
. I 
CHAPTER 4 
CHIP DESIGN METHODOLOGY 
• 
4.1 DESIGN PROCEDURE' 
The complexity of VLSI design makes it impractical for human being to carry 
A through the design without the help of computer. A well-followed design procedure 
.• 
coupled with some software tools is the only way to complete a VLSI design. 
In Fig. 4-1, VLSI implementation starts from algorithm development which is 
verified by software simulation. From then it switches to t.he real chip design. After 
\ 
worked out the chip architecture, behavior modeling is one step further in verifying the 
feasibility of the developed algorithm. A high level programming language-C is used in 
this step. BSIM is an interface C program to help designer in simulation. Next is NET 
for tra·nsistor network description, and RSIM for logic simulation. Certainly, all circuit 
elements must be decided and critical path is checked by detailed circuit simulator 
SPICE in the sam.e time. After this, chip layout is followed. MAGIC is used for ctll I\ 
<'I-·· 
layout, and other few programs including "cellframe'' to extend cell terminals, 
"pwrframe" to route power .and ground lines, "pad2frame" and "framegen" to generate 
a pad frame, and chipframe to ·merge all aboved. Later, Magic router is used to complete 
' '· 
routing between the datapath and pads. Finally, layout is extracted by EXT2SIM and 
., 
25 
. . 
• • 
,. 
, 
ALGORITHM DEVELOPMENT 
algorithm development 
& 
verification 
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • •• • • • •••••••••••••••••••••••••
••••••••••••••••••••••••••• 
r-----'----
VLSI IMPLEMENTATION 
chip architecture 
design 
SPICE 
circuit elements 
design & 
simulation 
circuit imprvemen 
. or 
transistor sizin 
layout 
optimization 
... 
• 
N 
N 
behavior modeling 
(C language) 
structure design 
& 
logic simulation 
meet 
y 
layout 
& 
DRC 
layout extraction 
& 
logic simulation 
meet 
specifications 
y 
CIF file output . 
Chip fabrication 
testir,ig 
.._-BSIM 
.,.__NET 
1-41-_RSIM, MAXTIME 
I 
...__ MAGIG, ROUTER 
CELLFRAME, PWRFRAME 
PADFRAME, CHIPFRAME 
14--EXT2SIM, PRESIM 
~-RSIM 
Fig. 4-1: VLSI' Design Flowchart. 
26 
-
V 
simulated using RSIM. 
In behavior modeling, structure simulation, and extracted layout simulation, an 
-identical set of test vectors should be applied all the time such that cross checking could ( 
be carried through the design. If the final chip's performance doesn't meet the 
specification, further circuit change or layout optimization must be done and \vould 
induce different depth of re-work. 
( 
4.2 TEST ABILITY 
The key to design a "testable" VLSI chip relates to two concepts called 
controllability and observability. Some special techniques stem from this need include ,,... 
LSSD (Level Sensitive Scan Design) [21], BILBO (22], Desig-n for Autonomous Test 
[23], and so on. But, it would be trading some area (increased gate count) and possibly 
... 
some speed ( extra loading) to achieve this level of testability. 
In our design, speed testing is a problem. Conservatively speaking, a 20 MHz. 
high frequenc_y tester is required if we assume the chip's max. delay is 50-ns. One way to 
, 
solve this problem for a combinational circuit with or . without pipeline registers is to 
hold both Phil and -Phi2 high, (Fig. 3-1) such that a toggled input will pass through 
C the data path and showing change on the output. The total' delay between .input and 
output waveforms divided by the number ·Of pipeline stages is an estimation of chip's 
27 
~-~·-- ----- ------ --- --· -- - - • ,< 
throughput. However, this possibility is excluded in this project due to the use of 
precharge logic in ROM circuit. Another trick is therefore considered and shown i-n Fig. 
4-2. In this way, it is possible to test output's T plh using general waveform generator 
and oscilloscope. T phi can't be tested by this special method because ROM 's output 
can't go from discharge to charge during Phil hold high. 
The ~econd testing problem is the pin-count limitation. This makes some of the 
output signals unreachable from pins. In the chip, we need ·21-pins for carry-bit outputs 
and 21-pins for sum-bit outputs. The package used is 28-pins. An approach similar to 
, 
LSSD structure is used to solve this probl~m. It is indicated in Fig. 4-3, a parallel/ 
• 
serial mixed shift register. 
~ 
4.3 CHARACTERIZATION OF THE DIGITAL FILTER 
A. ·Functional Tests 
Depending on the value of the SOE pin, the chip ~an operate in normally parallel 
•·. 
out,put .mode or in serial output mode. A complete functional testing should include 
connectivity tests and matrix type . test patterns. Under this detailed tests, wafer 
processing defects could be identified if any. 
As to equipments used, the BSIM program also has the capability to generate 
. ~· 
28 
t 
' ' 
r . 
' 
-.. 
' 
·, 
~~--~-----------------
5v 
Phil 
• XlD 
Xout 
.. .. 
ROM 
precharge 
evaulate 
.. .... ... 
reset output 
output change 
... ... 
'Tpd 
.. 
reset 
decoder 
Fi~ 4-2: Test pattern for performance testing 
PhilE 
serial output 
parallel output 
Fig. 4-3: Parallel/serial shift register 
29 
5v 
0v 
5v 
0v 
5v 
0v 
....,_ Phil 
....... --SOE 
Phil 
' 
--~ .. 
• 
• 
test vectors in Sieve format [24] and GenRad format [25] for functionality and 
performance testing. "Sunkits" is the simple tester used here. 
A behavior model for the chip and its BSIM, RSIM input/output files are listed 
in appendix. The completed testing gives 75% yield (9 out of 12) 
1,2 
• 
B. Performance Tests 
_., 
Test patter_n is the one suggested in section two. Again, a behavior model \vith 
command input file were prepared. The test result gives 82ns Tp 1h 3 • 
l. Special thanks to James 8. Burr at Standford University who did the chip testing . 
2. The chip was fabricated through the MOSIS service. 
3. Data is obtained from RSIM simulation. The chip's actual speed is not available at this niomen_t. 
30 
., 
• 
CHAPTER 5 
A DISCRETE COSINE TRANSFORM CHIP 
5.1 INTRODUCTION OF THE DCT 
Discrete cosine transform( OCT) is one of the most popular techniques used in 
video data compression. The OCT, developed in 1973 [26], has a performance verified as 
the most closest one in the class of orthogonal transformations to tha.t of the l(arh u nen-
Loeve transform (KLT) which is known to be optimal in some sense but has no general 
algorithm to do fast computation. To get a feeling why image data compression is 
needed, consider that a full-motion color video signal requires nearly lOOMbits/sec! Even 
a still picture at 512x512 resolution requires 6.3 Mbits (3x8 bits/pixel ,with RGB three 
colors) of storage. Now, if an image .is broken into 32x32 block.s of data and. tra·nsfor1ned 
·by DCT at a compression rate 3.2:1 (. 75 bits/·pixel). That image will take only 
approximately l .9ms to transmit its 6.3M/32 bits data (use the lOµs data in Table-. .II). 
This· is a very excellent result. 
The D.CT transformation consists of a set of basis vectors that are sa1npled 
cosine functions. 
{2 N-1 
X(O) = Ji·L x(n), 
n=O . 
2 N-l .. . (2n+l)k1r X(k) - N. L x(n)·cos 2N · , k=1, 2, ... , N-1. 
n=O · 
(5.1) 
31 
7 . 
I 7 
,. 
.. 
- , 
where x(n) is input data sequence, X(k) is the kth DCT component, and { 1/{2, 
cos( {2n + 1 )k,r /2N)} is the set of basis vectors which is afso known as a class of dis~rete 
( . 
Chebyshev polynomials. Traditionally, the DCT can be computed using the fast Fou(ier 
transform (FFT) [27] [28] . 
.J2 N-1 
X(O) = J·L x(n) 
n=O 
. k1r2N-l . 21r k 
X(k) = ~-Re e -J 2N L x(n)·e -J 2N n , k=l, 2, ... , N-1 
n=O 
( 5.2) 
So the DCT realization can be accomplished by using multipliers to im·plement the 
butterfly structure of various FFT algorithms. The fast algorithm (29] proposed by 
Chen, through a direct de·composition of ·DCT matrix, provided another way of 
computing the DCT. Nevertheless, all these approaches focused on reducing the number 
of multipliers. These approaches are good for computing the DCT using general purpose 
computer but not so efficient for VLSI implementations. 
Distributed Arithmetic is again feasible in DCT realization by notJ,ng the 
following facts: 1) The transform coefficients, {1/v, cos((2n+l}k1r/2N)}, are fixed 
" 
<,t 
constants for a certain size DCT. 2) The matrix-vector product of transform operations . 
can be realized by a large number of concurrent vector inner products. This new 
-
32 
• 
I 
" 
• 
approach can end up with a new architecture of OCT which consists of registers, 
memories, and adders only; no multiplier or butterfly structure is needed. 
5.2 THE NEW COMPUTING ALGORITIIM 
VLSI technology today provides the possibilities of the DCT hardware 
implementation using distributed arithmetic algorithm. Some real examples have been 
reported in recent years [30] [31] [32] [33] (34], and [34] even presented a completed chip. 
The two dimensional patterns .in image coding, corresponding to the rows a~ 
' 
columns of an data matrix, require a two-dimensional DCT. The two-dimensional DCT· 
t. 
of size N x N can be defined as: 
z ~ ct.x.c (5.3) 
with C is the Nth.;.order DCT matrix, ct is the transpose of C, and X is the input data 
matrix. By the row-column ____ decomposition technique shown in Fig. 5-1, T. C ... Chen et al. 
used bit-serial and bit-parallel for data I/0, decimation~in-frequency and ROM-partition 
in reducing memory size. They constructed a 16xl6 DCT chip with 14.3 MHz sample 
rate of video data for real-time processing. This is a straight-forward application· of 
, 
distributed arithmetic to inner product com·putation between two matrices with little 
{J 
.. 
special investigation ol(the nature of DCT mathematical structure. 
33 
• 
\ 
, 
clock 
counter 
x(n} 
-
-
-
-
-
) 
I f 
N x 1 
OCT 
timing control 
I I ' , 
NxN Nxl 
-
-
-
- • • - -transposition OCT 
memory 
Fig. 5-1: row-column decomposition of two-dimensional DCT 
x(n) 
32-pn t 
permute 
- & - 16-pnt skew circular correlation -
- -
-
--IJlt-' 
x-, discarded 
, 
x- direct~computation of 8-pnt Dt..:1 
- DIP -
-
-
- DIP -
x+, 16-pnt 
-
direct computation of 8-pnt Dr~-1 
-
~ x+ 
permute 
& -
negate -
-
X(2(2k+l) ) 
-
- X(2(2k)) 
-
X(k) 
X(2k+l) 
Fig. 5-2: Partition algorithm for 32-point DCT implementation 
- -
. 34 
,, 
. . -
... 
( 
In (28], it has been pointed out that DCT computation can be regarded as 
polynomial products after the data matrix is permuted in some way, and these 
polynomial products are evaluated by distributed arithmetic. Below, a new algorithm is 
presented (35] which uses a different permutation and transfers OCT operation as a 
circular correlation between the data sequence and the transform coefficients. 
• The New Computing Algorithm 
The idea behind Li's algorithm is based on a one to one mapping between { odd 
integers in [O, 4N], N=0,1, .. ,N-1} and {(-1/3i mod( 4N), 1=0,1 and i=0,1, ... ,N-1 }· The 
product relationship of input index and output index is changed to be sum relationship 
via this exponential operation, so a transform operation. becomes a correlation operation. 
In short, the odd terms in DCT computation, 
7 
X·c2·k 1) ~ ~ ( )· · (2n+1)_(2k+l}1r + - L....t x n cos 2N , 
n=O 
k=O, 1, 2, ... , ~-1 
can become as circular correlation. in (5.5) via (5.6), (5.7) permutations . 
. I, 
• . . N-1 . . 
=> X"(j) = ~ x"(;)cos(~·3i+imod(4N)) 
s=O 
x''(i) = x( n), i = 0, 1, ... , N-1 
35 
~---~~-
• 
(5.4) 
(5.5) 
( 5.6) . 
~-. 
----
. .I' 
n = ;,(3imod(4N)-1), if ;,(3imod(4N)-1) < N, or 
• n = (2N-1) - ;,(3imod(4N)-l), if !·(3imod(4N)-1) > N . 
X ( 2 k + 1 ) = X' '(j), for j = 0, 1 , . . . , N -1 (5.7) 
k = ~({3;mod(4N) - 1), if ~(3;mod(4N) - 1) < ~. or 
k = 2N-1 - ~(3;mod(4N) - 1); if ~(3;mod(4N) - 1) > ~N 
As to the even indexed terms, they are computed as a ~ point DCT in the same 
. 
way after doing recursive decimation-in-frequency on the data sequence, a.s described in 
(35]. Thus, It is shown that DCT can be computed as circular correlation by performing 
permutations on both input and output sequences, respectively. 
Note that the direct inner product computation of DCT suggested by T. C. 
Chen using distributed arithmetic requires N different memory cell~ to store N different 
sets of coefficient summations, totally N .2N words are needed~ Through the help of DlF 
'' ' 
and memory partition, N·~·2 4 is the final memory size for a Nxl DCT computing block. 
The new algorithm of computing the DCT as circQlar correlation requires Bx identical 
sets of coefficients, consequently, needs approximately 2 ·Bx·~· 2 4 memories. This allows 
a hardwa.re realization using only half size of memory, and provides a faster operJ:1,tion .in 
N =32 and Bx=8 case. More explanations a-re followed. 
36 
"' ~-- .-,· 
/ 
First, to recognize that distributed arithmetic is suitable in implementation of 
( 5.5 ), let's review the expression, 
N-1 
X"(j) = ~ x"(;)cos(fN·3i+;mod(4N)) 
•=O 
.. N-1 
X(j) = E x(i)·C 0 (i+i), where C0 is periodic sequence in N 
i:O 
N-1 
X(j) == E x 0 (i-i)·C(i), where x 0 is periodic sequence in N 
i=O 
N-1 L x((i-j) mod(N), m)·C{,) .2-m 
i=O 
(5.8) 
(5.9) 
By circulating the input data sequence, each output is obtained by solving the. inner 
~ -
product of two matrices, and distributed arithmetic is an efficient way to do it. Before 
going to the chip architecture design, a detailed investigation of arithmetic properties of 
- ·. 
{5.5) is necessary for such a huge number cruncher to be integrated on a chip . 
... 
1. DIF-before permutation of the input data sequence 
( 
The basis vectors cos(·) is a symmetric function in 2,r range. This property can be · 
.<..:. 
used to decompose the primitive DCT expression a~: 
• 
:., J 
.., 
37 
,.: 
,, 
( 
-1 
{ } (2n+l)·k1r X(k) = x(n) - x(N-1-n) •cos 2N , 
n:O 
if k is odd, and 
N 1 ~ { } (2n+l)·k1r X(k) = Li x(n) + x(N-1-n) ·cos 2N , 
n:O 
if k is even. (5.10) 
A N-point DCT is now traded as two ~-point DCT. It is a big saving in computation. 
The consecutive decimation of the even part is permissible but at the expense of 
increasing irregularity in hardware [33]. 
2. DIF-after permutation of the input data sequence 
Regard to the new algorithm, no wonder this cosine symmetry is preserved but in 
somehow different aspect due to th.e permutation executed on data sequence. From (5.5) 
.. 
N-1 
X(j) = ~ x(i)cos(lN·3i+jmod(4N)) 
1=0 
N_1 N .. 1 N 
-t x(i)cos(iN-i+Jmod(4N)) + t x(i+~) cos(lN·(32 .3i+j) mod(4N)) 
1=0 i=O 
N 
Use the fact [36], 3 2 mod(4~) - 2N + 1, N-·2t+1 , t=2, 3, 4, ... (5.11) 
we get, 
38 
~ .... "'' '. . ... 
:,.,,· 
.. 
N N 2-1 -r-1 
X(j) = E x(t)cos(~·3'+imod(4N)) + E x(•+~) cos(iN·(2N+1)·3'+i mod(4N)) 
1:0 a:O 
N 
-r-1 
= E ( x(i) - x(•+~)) cos(~·3i+jmod(4N) ), j = 0, 1, ... , N-1. (5.12) 
1:0 
It proves that the N-point circular correlation is decimated as ~-point skew circular 
correlation. Skew circular comes from ihe observation: 
•+J----( . . N ) - cos lN ·3 2 mod( 4N) . (5 .13) 
Using distributed arithmetic, the interchangeable property of correlation. allow us to do 
the skew circulation in x(n) and fix one set of ~-point coefficients in memories. 
Though (5.12) has saved us half of computations for each X(j), recall (5.7) t.hat 
N-point X(j) is mapped to ~"point odd terms of X(k) of final DCT output. This 
red·undancy is eliminated by noting, 
N. ~ 2-1 !.! 
X(~+j) = E (x(i) - x(;+~)) cos(lN·(3 2 .3i+jmod(4N)) 
.. 1=0 · I 
39 
... 
- - X(j), j = 0, 1, ... , g. (5.14) -
This implies that the second half ~ points of X( ·) output can be obtained from the first 
half~ points by taking a minus sign. Finally, we have ~ point DCT output by doing ~ 
computations of~ point skew circular correlation . 
3. 0/F-in memory part 
A third place where the cosine symmetry property can be applied . . IS l ll the 
n1emory in which we store all possible summations of coefficients. A memory unit of 21 
words ( the commonly used size after partition) could be simplified as only 4 words if the 
coefficients are specially grouped. For example, the 16 combinations of 4 coefficients {a, 
b, -a, -b} is concentrated as {a, b, a+b, a-b} 4 words plus ?' minus sign. This allo\vs a. 
decreasing of memory size from 2 4 to 4. But the drawback is the increased complexity 
on decoder design because each four match-ed signals must be on-purpose collected in a 
group,_ such that they can fetch the correspondent squeezed memory, furthermore the 
minus sign also needs special care in hardware. The worst is all structures before the 
memory part will be double-sized since DIF operation is postponed. 
In conclusion,. the $ymmet.ric property of cosine function allows us. to do DIF · · 
'· 
during the computation and it saves a; lot in the processing. A choice exists in 
.. 
40 , 
'\ 
considering "where" or say "when" is the best spot to "use out" this unique advantage. 
Further split on the basis vector gives cosine and sine relationship in between, obviously 
it doesn't do any help. 
5.3 CHIP ARCHITECTURE 
The 32xl DCT module proposed in Fig. 5-3 is a mixture of the new algorithn1 
and Chen's approach. The reason for that is the new algorithm only exhibits its 
superiority when N > Bx. With a usual 8-bit representation for image data, Chcn 's is 
preferred when N is not much larger than 8 because additional hardware for I/0 
permutation must be paid in doing the new algorithm. More detailed comparisons are 
summarized in the next section. Fig. 5-· 2 demonstrates how partition could be done to 
get advantages from two ,different algorithms. The same concept can be applied to 
N ==64, 128, ... larger points of DCT. 
,I 
Below, some features in ha·rdware implementation are discussed: 
1. chip architecture 
A high degree of processing concurrency is achieved through the use of three 
pipelined channels in parallel. A second· set of clock, Phild and Phi2d operates at a half 
frequ·ency of Phil and Phi2, respectively_, is used to get more efficient in hardwar~ 
41 
.. 
• 
... 
"'I. 
... 
dump3 
.------
pennute ..__~ X(2k+l) 
Phi2d ~ t shift register Phild 
dump2 
:- 16x13 
I I 
negate 
I I 
4-2 adder Tree 
5 pipe stages 
I ' 
ppA 
ROM 
(Bx+ 1) x 4 unit J 
. \ 
decoder 
4 x (Bx+l) 
. ~ 
,,,,- I -1 N' -----
"' ~ 
-
shift register ~ 
16 x (Bx+l) -
' ' 
t shift registe1 
16 X (Bx+l) 
temp16 ·' j ' 
-
I \ j ' 
pennute 
I \ I ' 
Phi2d -
-
' 
-
. 
X(2(2k+l)) 
dump4 j' 
-
--
,, 
CPA/Accum 
8 
j I I \ 
-
I \ 
CPA 
8 
j ' 
ROM 
8 x 2 units 
j ' 
decoder 
2 
' ' 
• 
accu 
bits 
+ • • • • • • • 
E 
I~ 
+ 
I \ 
-
Phild :- t shift register (32 x Bx) --- shift register : x(n) 
(32 x Bx) 
dumpl A A 
Phi2 Phil 
X(2(2k)) 
--
j, dump5 
I I 
I I 
I \ 
j • 
j I 
+ 
I \ 
ppC 
pipeline 
register 
8 X 1 
+ 
I \ 
-
(Bx+2) 
., 
Fig. 5-3: Chip Architecture of 32xl DCT block 
42 -
,. 
utilizations. 
• 
2. permutation 
) 
Based on the algorithm, sequences permutation is required on both input and 
output. It would be a heavy routing if imagine a 32-words sequence permutated to be 
the other. One of the choices is to use (serial-in, parallel-out), (parallel-in, serial-out) 
structure [30]. This is another example of trading tirne for space. 
3. control signal 
Due to the use of two different algorithms in th.e implementation, three in-
parallel pipeline channels executed in d'ifferent speed. It causes the necessity of dedicated 
. 
control signals to trig some particular operations in a right timing. The use of DIF is 
another reason to explain this increased complexity because each DlF would give one-bit 
increment in data sequence length~ For sure, one more bit in word length takes one more 
cycle to comp. lete its computation. To save the use of many counters and combinational 
. .., 
logic, a counter coupled with a set of shift register can provide all required trig signals in 
\I 
circuit. 
5.4 COMPARISON OF DIFFERENT COMPUTING METHODS 
43 
.. 
I 
q 
• 
In this section, we compare three kinds of approaches in DCT computations: 1) 
• using AT&T DSP16A general purpose DSP chip (37], 2) VLSI implementation using 
Chen's algorithm (direct inner product computation), and 3) VIJSI implementation using 
the new algorithm (circular correlation). A 32x32 2-D DCT requires 2-2 5 -2 5 -2 5 = 2 16 
multiplications for each block of transformation. One block of input m~ans 32x32 = 
1024 points of coded image data. Here we a.ssume 9-bit data input. 
Table- II Comparisons of speed performance and hardware requirements of 
two-dimensional 32-point OCT using different computing methods 
(listed value is the required number of circuit elements) 
mechanism register decoder full adder 4-2 adder CPA RO l\f expected 
speed 
DSP16A 
Chen's 
New 
1-bit 
2816 
5896 
4-bit 
256 
1160 
1-bit 
64 
96 
9-bit 
64 
30 
-. NOTE -
13-bit 
64 
66 
9-bit 
4096 
2176 
4.lms 
72µs 
lOµs 
1) S·peed is th.e time consumed to complete 32x32 data points transformation. 
2) DSP16.A takes 1.9 instruction cycle to complete one multiply at 33ns/cycle 
speed. 
2) Chen's design is considered for 14.3 MHz real time video signal input, equivalent 
to 1/70ns operating frequency. It takes 32x32 cycles to co.mplete. ; 
3) New method use fully parallel and pipeline structure, hopefully gives 1/.lOns 
', operating frequency (note the computing part of the chip, Fig. 5-3, is operated ,, 
44 
0 
at half-frequency so that 20ns is available for addition and memory fetch). It 
also take 32x32 cycles to complete one block transform. 
4) both Chen's and new method need another 32x32 RAM (word length 13-bit) 
to be the transposition memory. 
5) the accuracy of DCT computation· is decided by the word length used in each 
section's register. The number of bits used here are based on M. T. Sun's study 
[29], and assume the second 32 x 1 DCT block is the same as the first one. 
From the table, we notice that the new algorithm ·requires half of memory size of 
Chen's design but pays more registers and decoders. Heavy pipelining and the algorith 111 
itself are responsible for this extra cost. However, it is still the most attractive design in 
N ==32 case when one recognizes the layout size ratio of l~bit register and 1-bit me111ory 
is about 1:5, and less 4-2 adders are required (1 bit 4-2 adder has 72 transistors). 
Besides, it gives a seven times faster execution speed than Chen's. Based on the 
fabricated filter chip, This proposed 32-point DCT chip would require 107mm 2 silicon 
-. 
area for the datapath only (based on the cell layou~ in the digital filter chip). 
45 
,~ 
·'-;tJ 
CHAPTER 6 
CONCLUSION 
6.1 FUTURE DEVELOPMENTS 
A 4-tap convolution unit is designed, fabricated, and tested. New DCT algorithm 
is developed and its VLSI implementation is proposed. Future work may include: 
.. 
• 32-tap digital filter: 
\' 
According to the current MOSIS 2-µm process and 2.0x 1.6 mm 2 area taken by 
the 4-tap unit, 32-tap with 8-bit data length might be the largest tap size which we can 
fit into the 7.9x9.2 mm 2 padframe available from MOSIS. To accomplish such a design, 
a good arrangement of the memory blocks must be assured to avoid routing overhead. 
The discussion in section 2.3 shows some examples for this kind of consideration. One 
other work Is to designf a fast CLA (Carry Lookahead Adder) which would sum-up the 
last two n_umbers fro-m the 4-2 adder tree output. Its maximum delay should not larger 
than that of 4-2 adder, otherwise, it would become the new ·critical path. Conventional 
CLA design has the drawbac-k of very irregular layout. Manchester carry adder [38], [16] 
with carry lookahead circuit is a prospective structure. Two. levels of carry lookahead 
must be used because th.e length of the two numbers is over 16-bit . 
,, ' 
-· 
46 
.. 
,. 
• Digital filter silicon compiler: 
. 
As a fact, the digital filter implemented using distributed arithmetic is very 
regular, flexible. Only a few types of circuit elements are needed (register, decoder, 
ROM, 4-2 adder, and CLA). it is very practical to consider the topic- "Auton1atic 
Digital filter design using distributed arithmetic structure". IIR (Infinite In1pluse 
Response) filter could be also covered once a feed back path (register) is provided. The 
complete project would involve three parts: a) a program to generate filter coefficients 
according to user's input specifications, b) auto-generate behavior model and net 
description files, and simulate them, c) generate chip layout and simulate the extracted 
output. The whole idea is between cell-based system and procedure design. Designer 
write a program and generate the final chip layout. 
• 32x32 DCT chip 
Chip architecture design and. -~ehavior modeling have been completed. Stru.cture 
design ~nd chip layout should be cont.inned to get the chip completed. Such a chip is 
estimated to have 200,000 tr&,nsistors and take over 107mm 2 Qi~ size. In simulation, at 
' ',, 
least 1024 input vectors are required an~ the first block of DCT output will not be ready 
,; . 
. 
untill about 8x1024. ·cycles of clcoking. All of these imply a very heavy loading to 
computer resources and designer's efforts. 
47 
, 
6.2 PROSPECTIVE 
Mapping DSP algorithm into VLSI chip is a very promising topi
c for no,v and 
the future. As semiconductor technology continue to increase th
e speed, the integral 
density and decrease the cost of processing, more and more ( new or old) sophist
icated 
> 
algorithms could become practical. Thus the theory and chip imp
lementation of digital 
signal processing is more significant as time goes on. 
"The importance of digital signal processing appears to be increasing with no visible 
sign of saturation. Indeed, the application in this field is expanding." 
., ,• 
,, 
48 
. ., 
BIBLIOGRAPHY 
[I) L. W. Nagel, "SPICE2: A computer program to simulate semiconductor circuits", 
Memo ERL-M520, U. C. Berkeley, May 9, 1975. 
[2) J. 8. Burr, "BSIM user's guide", Stanford University, 1987 
[3] C. J. Terman, "Simulation tools for digital LSI design", Ph. D. thesis, MIT, 1983. 
(4] C. J. Terman, "User's guide to NET, PRESIM, and RNL/NL", VLSI m_emo 82-112, 
MIT, July 1982. 
[5] J. I(. Ousterhout, et al., "1985 VLSI tools: more works by the original artists", U. C. 
Berkeley 1985 VLSI tools distribution. 
[6] R. A. Roberts and C.T. Mullis, Digital Signal Processing. Addison Wesley, pp. 72- 76, 
1987. 
[7] L. B. Jackson, J. F. Kaiser, and H. S. Mcdonald, "An approach to th.e 
implementation of digital filters," IEEE Trans. Audio Electroacoust. (Special Issue on 
Digital Filters: The Promise of LSI Applied to Signal Processing), vol. AU-16, pp. 413-421, 
Sept. 1968. 
(8] ~. F. l{aiser, "·Some pratical considerations in tlfe realization of linear digital filters," 
1965 Proc. 3rd Allerton Conf. on Circuit and System Theory, pp. 621-633. 
[9] A. Croisier, et al., "Digital filter for PCM encoded signals", U.S. patent 3,777,130, 
I 
Dec. 4, 1973. 
[10] A. Peled and B. Liu, "A new hardware realization of digital filters," IEEE Trans. 
· A.S.S.P. vol. ASSP-22, pp. 456~462, Dec. 1974. I,. I ' j 
[11] C. S. Burrus, ''Digital filter structures described by distributed arithmetic," IEEE 
" Trans. on Circuit and Systems, vol. CAS-24, no. 12, Dec. 1977. 
49 
,. 
• 
[12] R. C. Agarwal and C. S. Burrus, "Fast one dimensional digital convolution by 
multidimensional techniques," IEEE Trans. Acoust., Speech, Signal Proc., vol. ASSJ>-22, 
pp. 1-10, Feb. 1974. 
[13] F. J. Taylor, Digital Filter Design Handbook. marcel Dekker, INC., 1983, pp. 678-
700. 
(14] B. C. Mckinney and F. E. Guibaly, "A multiple-access pipeline architecture for 
digital signal processing," IEEE Trans. on Computers, vol. 37, no. 3, pp. 283-290, l\1ar. 
1988. 
[15] W. Li, J. B. Burr, and A. M. Peterson, "A fully parallel VLSI implementation of 
distributed arithmetic," Proceedings of IEEE ISCAS'BB, Finland, Jun. 1988, pp. 1511-
1515. 
[16] S. Waser and M. J. Flynn, Introduction to Arithmetic for Digital Systcrns 
Designers, CBS College Publishing, 1982. 
[17] D. Huang and W. Li, ''On CMOS Exclusive OR Design", internal documentation, 
Mar., 1989. 
[18] L. A. Glasser and D. W. Dobberpuhl, ''The design and a,nalysis of VLSI circuits", 
Addison Wesley Publishing, 1985 
[19] W. Li and J. B. Burr, "Parallel multiplier accumulator using 4-2 adders"., US patent 
pending, application number 088,096, filing date Aug. 21, 1987. 
[20] H. Yamauchi et al., "lOns 8x8 multiplier LSI using super self-aligned process 
technology", IEEE Journal of Solfd-State Circuits, vol. sc-18, no. 2, pp. 204-210, April 
1983. 
[21] E. B. Eichelberger and T. W. Williams, ;,A logic design structure for VLSI testing", 
Proc. 14th Design Automation Conference, ·pp. 462-468, June 1977. 
50 
t 
• 
[22} B. l{oenemann, J. Mucha and G. Zwiehoff, "Built-in logic bloc
k observation 
techniques," Digest 1979 Test Conference, 79CH1509-9C, pp. 37-41, October 1979. 
(23] E. J. McCluskey and S. Bozorgui-Nesbat, "Design for autonomo
us test." IEEE 
Trans. on Computer, vol. C-30, no. 11, pp. 866-875, Nov. 1981
. 
(24] Sunkit-1 Test Software Distribution, the MOSIS service, USC/ISi, 4676
 Adn1iralty 
Way, l\1arina Del Ray, CA 90292-6695. 
[25] "GR160/180 Programmer's guide", Part no. 2420-0136, Gen Rad,
 510 Cotton,voo<l 
Drive, M ii pitas, CA 95035. 
• 
(26] N. Ahmed, T. Natarajan, and l(.R. Rao, "Discrete cosine transform," IEE
E Trans. 
on Computer, vol. C-23, pp. 90-93, Jan. 1974. 
(27) M. J. Narasimha and A. M. Peterson, "On the computation of the 
discrete cosine 
transform", IEEE Trans. on Communications, vol. COM-26, n
o. 6, pp. 934-936, .June 
1978. 
(28] H. Malvar, "Fast computation of discrete cosine transform th.rough 
fast Ilartley 
transform", Electronics Letters, vol. 22, no. 7, pp. 352-353, Ma
r. 1986. 
(29) W. H. Chen et al., "A fast computational algorithm for the :d
iscrete 
transform,'' IEEE Trans. Commun., vol. COM-25, pp. 1004-1009
, Sept. 1977. 
• 
cosine 
[30] M. Vetterli and A. Ligtenberg, "A discrete. Fourier-cosine transfor
m chip," IEEE J. 
Selected Areas in Communications, vol. ·SAV-4, no. 1, pp. 49-61, 
Jan. 1986. 
(31] M. Vetterli and H. J. Nussbaumer, "Simple FFT and DCT algorithms
 with reduced 
number of operations", Signal Processing, vol. 6, no. ·4, pp. 267-
278, 1984. 
[32] P. Duhamel and H. H'm,ida, "New 2" DCT algorithms suita
ble for VLSI 
implementation", Proceeding of IEEE ICASSP, pp. 1805-1808, 1
987. 
51 
[33) M. T. Sun, L. Wu, and M. L. Liou, "A concurrent architecture for VLSI 
implementation of discrete cosine transform", IEEE Trans. on Circuits and Systems, vol. 
CAS-34, no. 8, pp. 992-994, 1987. 
(34] T.C. Chen, M. T. Sun, and A. M. Gottlieb, "VLSI implementation of 'fl 16x16 
DCT", Proceeding of IEEE ICASSP, pp. 1973-1976, 1988. 
[35] W. Li, "A OCT algorithm," to be published. 
[36] W. Li, internal documentation, Mar., 1989. 
[ 3 7] D. M . Blaker, " Using th c D SP 16 / D SP 16 A for image co n1 p rcss ion" , A T & 7- D SP 
Review, vol. 2, issue 1, pp. 4-5, Winter, 1989. 
[38] N. Weste and K. Eshraghian, "CMOS VLSI design", pp. 320-326, 1985. 
52 
' -
APPENDIX 1 
BEHAVIOR MODELING 
This section lists behavior models of the 4-tap digital filter and the 32x I OC'l'. 
The programs are abbreviated to keep them short. As to the simulation\ on)j, the 
outputs are quoted. The correspondent BSIM input command can be figured out fron1 
the outputs themselves. 
• Behavior model of 4-tap Digital Filter 
model() 
{ 
} 
mix(); 
S_rcg(); 
dee(); 
ROi\1(); 
adder( carry, sum); 
L_pipe( carry, ~um, 21 ); 
mix() 
{ 
int i, a, b, c, d; 
{ 
} 
.} 
a·== i & 1; 
b == (i>>l) & 1; 
C = (i>>2) & 1; 
d - (i>>3) & 1; 
53 
} 
s _reg() 
{ 
int i, j; 
if(Xin == X) ; 
else 
for(j=O; j < Bx; j++) { 
if ( P hi 1 == == 1 ) 
sreg[3] Li] .in = (Xin > >j) & 1; 
for(i==N-1; i>-1; --i){ 
if ( Phi 1 == == 1 ) 
sreg[i]LJ].s == sreg[i]LJ].in; 
} 
} 
if(Phi2==1) 
sreg[i]LJ].out = sreg[i]LJ].s; 
if(i====O) 
else 
• 
' 
sreg[i-1 ]LiJ.in - sr.eg[i_] Li] .out; 
dee() 
{ 
int i,j,index; 
double p; 
for(j=O; J<Bx; J++) 
{ 
. index = O; 
if(sreg[O]Li].9ut =~ X) codeLl] == selectLl] = X; 
54 
} 
} 
else 
{ 
for(i=O, p=N-1; i<N; i++, p--) 
index += sreg(i]Ll].out•(int)(pow(2.0, p)); 
codeLi] = index; 
if( Phi 1 = = 1 ) 
if( Phi 1 == 0) 
} 
selectU] = codeLi]; 
selectLi) = O; 
ROM() 
{ 
static int hp[Bx], ppb[Bx]; 
int j, B, lac; 
hp[O] = ( ((h_mix[select[O]] & Ox20000) ·· Ox20000) <<l) I (h_mix[select[O]] 
& Ox03ffff); 
hp[l] = ( (h _ mix[select[l]] - Ox20000) < < f ) & Ox07fffe; 
hp[2] = ( ((h_mix[select[2]] - Ox20000) <<2 ) & OxOffffc) I Ox2; 
hp[3] --- ( ((h_mix[select[3]] - Oxlffff) < <3 } & Oxlffff8) I Ox6; 
for(j=O; j<Bx; j++ ){ 
if(Phil == 1 0) , ppbLl] == Oxlfffff; 
.. 
if(Phil~==l && selectLi] !== X){ 
for(B==O; 8<21; ++B) 
if(((ppbLJ]>>B}&l === 1) && ((hpLi]>>B)&l === 1)){ 
loc· -- (int) pow(2.0,. (double) B); 
ppbLJ] ,. = loc; 
} 
PPLi].in = ppbLi] " Oxlfffff; 
} 
55 
} 
pipe(pp, Bx); 
} 
adder(outc, outs) 
reg •outs, •outc; 
{ 
--,.. 
' 
unsigned i, a, b, c, d, cin, cout, bed;. 
if( pp[O] .OU t = = X) ; 
else 
{ 
cin = O; 
for(i==O; i<Bh+5; i++, outc++, outs++) 
} 
{ 
} 
} 
a= (pp(3].out >> i) & 1; 
b == (pp[2].out >> i} & 1; 
c == (pp[l].out >> i) & l; 
d == (pp[O].out >> i) & 1; 
cou t = d & ( b I C) I ( b & C ); 
bed = b ... c .. d; 
outc->in = bed & ( cin l a) I (cin & a); 
outs->in == cin .. a .. bed; 
cin _;_. cou t; 
L_pipe(outc, outs,· n) 
reg *outs, *outc; 
· int n · 
' 
{ 
int i· 
' 
56 
t 
.;.! 
J 
if( out c- >in== X) ; 
else{ 
if(SO E= = 1 ){ 
} 
pi pc ( 0 U t C, 21 ) ; 
pi pc ( <) u ts , 21 ) ; 
if(SO f:==0) 
/ • norn1al rnode • / 
/• serial out mode •/ 
for(i=O; i<n; i++, outc++, outs++){ 
if( I) hi I=== I ) { 
0 lJ t S - > S = CJ lJ t C - > CJ 11 t ; 
outc->s = (outs+I)->out; 
} 
if ( Phi 2 = = 1 ) { 
outc->out = outc->s; 
outs- >out == outs- >s; 
} 
} 
} 
} 
pipe(data, n) 
reg *data; 
int n · 
' 
{ 
int i · 
' 
if(Phil === I)< 
{ for(i=O; i<n; i++) 
if(Phi2 == 1) 
(data+i)->s = (data+i)->in;} 
{ for(i=O; i<n; i++) (data+i)->out == (data+i)->s;} 
} 
t 
57 
I. 
.. 
Xfilter() 
{ 
initialize all internal nodes to be unknown states. 
} 
• Digital filter BSIM output, normal mode: parallel output of carry and sum 
) 
**** BSIM 1.0 **** 
bsirn --> w SO CO Sl Cl S9 C9 S10 C10 S18 C18 S19 C19 
bsim --> == SOE 1 
bsim --> == Xin 0 
bsim --> c 
SO==X CO==X s1~x Cl==X S9==X C9==X S10==X ClO==X S18==X C18==X Sl9==X 
C19==X 
bsim --> p 
sreg[O]== X X X X 
sreg[l]== X XX X 
sreg[2]== X X X X 
sreg[3]== 0 0 0 0 
code(O)==X code(l)==X code{2)==X code(3)==X 
The output of ROM (Partial Product) are: 
ppO=X ppl=X pp2==X pp3=X 
outc=X 
outs=X 
• 
. # doing the same way, continually input Xin as 7, 0, 7, 1 
• 
bsim --> = Xin 2 # we assume h(k) == {-2, 5, 2, 1} of {hO, hl, h2, h3} 
bsim --> c # Y(3) when Xin(k) - {7, 0, 7, O}, of {x3, x2, xl, xO} 
SO-OxO CO=OxO Sl=OxO Cl=Ox1 S9=0xl C9=0x0 S10=0xl ClO=OxO S18=0xl 
C18=0x0 S19:._Qxl C19-==0xl 
58 
.r 
bsim --> p 
sreg[O]= 0 0 0 0 
sreg[l]= 0 1 1 1 
sreg[2]= 0 0 0 1 
sreg[3]= 0 0 1 0 
code(0)=6 code(1)=5 code(2)=4 code(3)=0 
The output of ROM (Partial Product) are: 
pp0=0x40004 ppl=Ox4000c pp2=0x8001a pp3=0xffffe 
outs=OllllllllllllllllllOO 
outc==010000000000000000010 
bsim --> == Xin 8 
bsim --> c # X(k) == {1, 7, 0, 7} 
.. 
l 
S0=0x0 CO=OxO S1==0x0 Cl=Oxl S9=0xl C9=0x0 S10=0xl ClO=OxO. S18==0xl 
C18==0x0 S19=0xl C19=0xl 
bsim --> p 
sreg[O]= 0 1 1 1 
sreg[l]= 0 0 0 1 
sreg[2] == 0 0 1 0 
sreg[3]== 1 0. 0 0 
code(0)==12 code(l)=lO code(2)==8 code(3)~1 
The output of ROM (Partial Product) are: 
pp0==0x40007 ppl-Ox40000 pp2==0x8000a pp3~0xffffe 
outs==Oll 111111111111110100 
outc==010000000000000011010 
bsim -~ > == Xin 10 
bsim -~> c #---X(k) - {2, 1, 7, O} 
59 
j . 
SO=Oxl CO=OxO Sl=Oxl Cl=OxO S9=0xl C9'=0x0 S10=0x1 ClO=OxO S18=0xl 
C18=0x0 S19=0xl C19=0xl 
bsim --> p 
sreg[O)== 0 0 0 1 
sreg[l]== 0 0 1 0 
sreg[2]= 1 0 0 0 
sreg[3) == 1 0 1 0 
code(0)=8 code(1)=5 code(2)=0 code(3)=3 
... 
The output of ROM (Partial Product) are: 
pp0=0x40003 ppl=Ox4000c pp2=0x80006 pr>3=0x10000e 
outs=OllllllllllllllllOlll 
outc=010000000000000001100 
bsim --> c # X(k) = {-8, 2, 1, 7} 
S0=0xl CO=OxO Sl=Oxl Cl=OxO S9=0x0 C9=0x0 S10=0x0 ClO=OxO S18-0x0 
C18=0x0 S19=0x0 C19=0xl 
bsim --> p 
sreg[O]= 0 0 1 0 
sreg[l]= 1 0 0 0 
_ sreg[2]= 1 0 1 0 
sreg[3]= 1 0 1 0 
code(O)=O code(l)==ll code(2)=0 code(3)=7 
The output of ROM (Partial Product) are: 
pp0=0x40001 ppl=Ox40000 pp2=0x80002 pp3=0.xfffe6 
outs=100000000000000001011 
outc=010000000000000001100 
bsim ~-> q 
60 
.. 
( 
' 
• Digital filter BSIM output, speed test mode, use the special waveform in Fig. 4-2. 
•••• BSIM 1.0 •••• 
bsim --> w SO CO Sl Cl S9 C9 SlO ClO SIS C18 S19 Cl9 
bsim --> = SOE 1 
bsim --> = Phi2 1 
bsim --> I Phil 
bsim --> = Xin 0 
bsim --> s 
bsim --> h Phil 
bsim --> s ' 
S0==0x0 CO==OxO S1==0x0 Cl==Oxl S9==0xl C9==0x0 S10==0xl ClO==OxO S18==0xl 
C18==0x0 S19==0xl C19==0xl 
bsim --> p 
sreg[O]== 0 0 0 0 
sreg[l]== 0 0 0 0 
sreg[2]== 0 0 0 0 
sreg[3] == 0 0 0 0 
The output of ROM (Partial Product) are: 
• 
pp0=-0x40000 ppl==Ox40000 pp2==0x80002 pp3==0xffffe 
outs=011111111111111111100 
out c=O 10000000000000000010 
bsim --> = Xin 1 
bsim --> s 
S0=0x0 CO=OxO Sl=Oxl Cl=OxO S9=0xl C9=0x0 SlO==Oxl ClO-OxO S18==0xl 
CI8-0x0 S19==0xl C19-0xl 
bsim --> p 
sreg[O]= 0 0 0 1 
sreg[l]= 0 0 0 1 
sreg[2]= 0 0 0 1 
sreg[3]= 0 0 0 1 
........ •, 
,,~ . 
61 
• 
The output of ROM (Partial Product) are: 
pp0=0x40006 ppl =0x40000 pp2=0x80002 PJ>3=0xffffe 
outs=OlllllllllllllllllllO 
outc=010000000000000000100 
bsim --> = Xin 0 
bsim --> s 
S0=0x0' CO=OxO Sl=Oxl Cl=OxO S9=0xl C9=0x0 S10==0xl ClO=OxO S18=0xl 
C18==0x0 S19==0xl C19=0xl 
bsim --> p 
sreg[O]== 0 0.0 0 
sreg[l]== 0 0 0 0 
sreg[2]= 0 0 0 0 
sreg[3]== 0 0 0 0 
The output of ROM (Partial Product) are: 
pp0==0x40006 ppl==Ox40000 pp2=0x80002 pp3=0xffffe 
outs-Olll 11111111.111111110 
outc=010000000000000000100 
bsim --> l Phil 
bsim --> s 
S0=0x0 CO=OxO Sl=Oxl Cl=OxO S9=0xl C9-0x0 S10=0xl ClO=OxO S18-0xl 
C18==0x0 S19=0xl C19==0xl 
bsim --> p 
sreg[O]== 0 0 0 0 
sreg[l]== 0 0 0 0 
sreg[2]= 0 0 0 0 
sreg[3]= 0 0 0 0 
. ' 
The output of ROM (Partial-Product) are: 
62 
,_ 
pp0=0x40006 ppl=Ox40000 pp2=0x80002 pp3=0xffffe 
outs=OlllllllllllllllllllO 
outc=010000000000000000100 
bsim --> h Phil 
bsim --> s 
S0=0x0 CO=OxO S1=0x0 Cl=Oxl S9=0xl C9=0x0 S10=0xl ClO=OxO S18==0xl 
C18==0x0 S19=0xl C19=0xl 
bsim --> p 
sreg[O]== 0 0 0 0 
sreg(l]== 0 0 0 0 
sreg[2]= 0 0 0 0 
sreg[3)= 0 0 0 0 
The output of ROM (Partial Product) are: 
pp0=0x40000 ppl=Ox40000 pp2=0x80002 pp3==0xffffe 
outs=OllllllllllllllllllOO 
outc=010000000000000000010 
bsim --> q 
• 
63 
• Behavior Model of 32x 1 DCT 
model() 
{ 
} 
Li_coeff(16, ROMl); 
Chen_coeff{8, 1, ROM2); 
Chen_coeff(8, 0, ROM3); 
counter(); 
data_in{ column); 
split_xn( column); 
i11ner _p(bits_n, ROM2, ppB, 8); 
inner~p(bits_p, ROM3, ppC, 8); 
correlate(Vx16, ROMl, ppA, Bx+l, 16); 
S _sum(ppB, total_n, accu_n, 1, 8); 
S_sum(ppC, total_p, accu_p; 0, 8); 
sum_up(ppA, Bx+l, 16); 
Li_ coeff(n, cell) 
int n; 
• 
dou hie cell[] (16]; 
{ 
use chapter 5, (5.5) to calculate 16 coefficients. 
mix(coeff, cell, 0, 16); 
} 
Chen_coeff(size, odd_even, cell) 
int size, odd_ even; 
double cell(][16];· 
{ 
l 
"-, 
use chapter 5, ( 5.1) to calculate 8 coefficients. 
64 
\ 
I 
r' 
mix( coefT, cell, 2•k, 8); 
} 
counter() 
{ 
reset some register's values and also generate Phild and Phi2d. 
} 
data_in( out_ bit) 
int *OUt bit; 
{ 
} 
vsi_reg(xn, Vx32, N); 
if( count32==32){ 
dump(Vx32, Lx32, N); 
count32==0; 
} (' 
lso_reg(Lx32, out_bit, N); 
split_xn( data) 
int *data; 
{ 
int i, i_ map[N], xn_n16[16]; 
i_permute(32, i_map ); 
semi(data, i_map, C116, templ6, 32); 
Isi~re·g(templ6, Lx16, 16); 
if(flag(Bx+ 1] .out=== 1) 
dump(Lx16, Vx16, 16}; 
DIF(data, Ci_n16, Ci_p16, xn_n16, xn_pl6, flag[Brx].out, 32); 
D1F(xn_pl6, Ci_n8, Ci.:._p8, xn_n8·,· xn_p8, flag[Bx+l].out, 16); 
65 
.... 
• 
' 
} 
for(i=O; i<8; ++i){ 
bits_n[i].in = xn_n8[i]; 
bits_p[i].in = xn_p8[i]; 
} 
pipe_l(bits_n, 8); 
pipe_ I ( bits_ p, 8) ; 
correlate(data, ROM_id, ppout, wide, n) 
reg *data; 
do u b I e RO I\1 _ id [] [ 16] ; 
dreg *ppout; 
int wide, n; 
{ 
s_cir(data, n); 
L_deco(data, L_code, wide, n); 
L_fetch(L_code, ROM_id, ppout, wide, n); 
pipe(ppout, wide*n/4); 
} 
inner_p(data, ROM_id, ppou.t, n) 
reg *data; 
double ROM_id(](16]; 
dreg ppout(8*8/4]; 
{ 
int i; 
C_deco(data, C_code, n); 
C_fetch(C_code, ROM_id, ppout, n); 
pipe(ppout, n*n/4); 
} . 
sum_up(terms, width, n) 
66 
' ' 
dreg •terms; 
int width, n; 
{ 
} 
double sum, add_up{); 
if( terms->out==X) 
sum = X; 
else 
sum = add_up(terms, width, n); 
vsi_reg_F(sum, out16, 16); 
if( flag[O] .out---.__ i && out 16->out!==X) 
dump_x(out16, Xkl, 16); 
S _sum(terms, total, accu, e _ o, n) 
dreg *terms, *total, *accu; 
int e_o, n· 
' 
{ 
} 
int • 1· 
' 
cpal( terms, total, n ); 
pipe(total, n); 
cpa2(total, accu, e_o, n); 
pipe( accu, n); 
if(flag[14] .out==l·) 
dump_e(accu, Xk2, e_o, n); 
XDCT() 
.. { 
. 
., 
67 
~ '. 
initialize the counter and reset all nodes to be unknown states. 
} 
68 
J 
• 32x 1 DCT BSIM output 
•••• BSIM 1.0 •••• 
bsim --> = xn 1 
bsim --> c \ f' ' 
bsim --> p # first pn t in 
O's count32=1, Phi2d=O 
flag[ OJ = 0 flag[ I] = 0 flag[ 2] = 0 flag[ 3] = 0 
flag[ 4] = 0 flag[ 5] = 0 flag[ 6] = 0 flag[ 7] = 0 
flag[ 8] = 0 flag[ 9] = 0 flag[lO] = 0 flag[l 1] = 0 
~ 
flag[12] = 0 flag[13] = 0 flag[14] = 0 flag[15] = 0 
Vx32[31] = 1 
bsim --> - xn 3 
bsim --> c 
do the aboved continueously to input xn= 32, 23, 49, 4, 43, 71,. 18, 8, 10, 54, 0, 43, 89, 
92, 38, 6, 9, 26, 31, 24, 12, 84, 32, 90, 30, 24, 38, 28, 84,: 7. 
bsim --> c 
bsim --> p # Vxl.6 s.kew circular 
1 's count32=26, Phi2d=l 
flag[ OJ = 0 flag[ 1] = 0 flag[ 2] = 0 flag[ 3] = 0 
flag[ 4] = 0 flag[ 5] = 0 flag[ 6] = 0 flag[ 7] = 0 
flag[ 8] = 0 flag[ 9] = 0 flag[lO] = 0 flag[ll] = 0 
flag[12] 0----: 1 flag[13] = 0 flag[14] = 0 flag[15] = 0 
Vx32[ O] = 30 Vx32[ 1] = 24 Vx32[ 2] = 38 Vx32[ 3J = 28 
Vx32[ 4] · = 84 Vx32[ 5] --:- 7 Vx32[ 6] = 0 Vx32[ 7] = 1 
69 
Vx32[ 8] = 2 Vx32[ 9] = 3 Vx32[10] = 4 Vx32[11] = 5 
V x 3 2 [ I 2] = 6 V x ~3 2 [ 1 :1] = 7 V x 3 2 [ 1 4] = 8 V x :l 2 [ 1 ."i] = 9 
V x 3 2 [ I (>] = 1 0 V x :3 2 [ 1 7] = I 1 V x :J 2 [ I 8] = 1 2 \/ x : J 2 [ 1 !) ] = 13 
Vx:32[2()] = 14 Vx:12[2 l] = lfi Vx:32[22] == 16 \ 1x:J2[2:J] = 17 
Vx32[24] = 18 Vx32[25] = 19 Vx:32[26] == 20 Vx:12[27] == 21 
Vx32[28] = 22 Vx32[29] = 23 Vx:32[:JO] = 2·1 Vx:32[:J 1] = 25 
Lx32[ O] == 0 Lx32[ 1] == 0 Lx32[ 2] == 0 Lx32[ 3] == 0 
Lx32[ 4] == 0 Lx:32[ 5] == 0 l.1x32[ 6] == 0 Lx32[ 7] == 0 
Lx32[ 8] == 0 Lx32[ 9] == 0 Lx32[10] == 0 Lx32[1 l] == 0 
Lx32[12] == 0 IJx32[13] == 0 I_Jx32[ 14] == 0 l.1x32[I5] == 0 
Lx32[16] == 0 Lx32[ 17] == 0 Lx32[18] == 0 Lx32[ 19] == 0 
Lx32[20] == 0 Lx32[21] == 0 Lx32[22] == 0 Lx32[23] == 0 
Lx32[24] = 0 Lx32[25] == 0 Lx32[26] = 0 Lx32[27] == 0 
Lx32[28] = 0 Lx32[29] == 0 Lx32[30] == 0 Lx32[31] == 0 
Lx16[ O] == 4090 Lx16[ 1] == 4015 Lx16[ 2] = 25 Lx16[ 3] == 34 
Lx16[ 4] == 66 Lx16[ 5] == 4049 Lx16[ 6] == 26 Lx16[ 7] == 4070 
Lx16[ 8] == 4042 Lx16[ 9] = 83 Lx16(10] == 4073 Lx16[11] == 4 
Lx16[12] == 39 Lx16[13] == 4 Lx16[14] == 4081 Lx16[15] == 4082 
Vx16[ O] == -1010 Vx16[ 1] == 1018 Vx16[ 2] == 943 Vx16[ 3] == 25 
Vx16[ 4] == 34 Vx16[ 5] == 66 Vx16[ 6] == 977 Vx16[ 7] == 26 
Vx16[ 8] == 998 Vx16[ 9] == 970 Vx16[10] == 83 Vx16[11] == 1001 
Vx16[12] = 4 Vx16[13] = 39 Vx16[14] = 4 Vx16[15] = 1009 
bsim --> c 
bsim --> p # Chen's 1st set of DCT out 
l's count32=30, Phi2d=l 
Xk2[ O] = 1103.000 Xk2[ 2] = -69.670 
Xk2[ 4] == .-31.006 Xk2[ 6] == -214.594 
70 
,• 
Xk2[ 8] = 149.856 Xk2[10J = -31.645 
Xk2[12] == -65.075 Xk2[14] = -206.995 
Xk2[1fi] == 51.619 Xk2[1~] == 17.3R6 
Xk2[20] == -116.116 Xk2[22] = -12.637 
Xk2[24] == -105.698 Xk2[26] == - l 2~.523 
Xk2(28] = 77.038 Xk2(30] == 69. 795 
bsim --> c 
bsim --> p # Li's 1st of DCT out, 3rd Lx32 dumped 
3's count32==2, Phi2d=l 
Xkl[ 1] = -134.653 Xkl[ 3] = -79.446 
Xkl( 5] = 81.584 Xkl( 7] == -191.323 
Xkl( 9] = 69.559 Xkl(ll] == -113.796 
Xk1[13] == 124.525 Xk1[15] == -57.407 
Xkl[l 7] = 221. 764 Xkl(l9] == 28.203 
Xk1[21] == -35.964 Xk1[23] == 18.993 
' ,1 j 
Xk1[25] == 126.963 Xk1[27] == 205.545 
Xk1(29] == 7.822 Xk1[31] = -120.301 
bsim --> q 
71 
APPENDIX 2 
RSIM SIMULATION 
• 4- Tap Digital Filter 
In this section, the RSIM input commltnd and correspondent output arc listed. The 
testing includes operation mode, serial output mode, and speed test mode. llowever, the 
part for serial output mode is removed because it is actually same as operation mode but 
with SOE=O and output taken from SO pinout . 
... 
• RSIM input command of operation mode, carry and sum are parallel out. 
(vector define is re-edited to keep the text short) 
vector xk 
chi pl_ 0/ chip_ 0/regA_l_ 0/ con5_0/x2 
chi pl_ 0/ chip_O/regA_ r_O/ con6_0 /xO 
chipl _0/ chip_O /regA_ l_ 0/ con5_0/x3 
chip 1 _0 / chip_O / regA_ r_O / con6 _ 0 /x 1 
vector sxO chipl_ O/chip-.-0/regA_l_O_/regcon_ O{O}/s')(.0 
chi pl_ 0/ chip_ 0/regA_l_O/regcon_O{l} /sxO 
chipl _0/ chip_ 0/regA_t_O/regcon_O{O} /sxO 
chipl _ 0/ chip_ 0/regA_r_O/regcon_ O{ 1} /sxO 
vector codeO chipl-"0/ chip_O/ decA_r ~0/dec_r_0{0,1} /select 
chipl _ 0/chip _ O/ decA_r_ 0/ dec_r __ 0{1,1} /select 
chipl_ 0/chip _O/decA_ r _O/ dec _r _ 0{2,1} /select 
• 
• 
chipl ___ 0 / chip_O / decA_r _ 0 / dec _ r _ O{ 15,1 }/select 
. . 
I 
vector ppl 
chi pl_ 0 /chip_ 0 / addA _ 0 / add _O{ 1} /IND 
• 
• 
chipl _·O /chip_ 0 / addA __ 0 /add_ 0{20} /IND 
72 
chipl _O / chip_ 0/addA _ 0/add_ O{O} /IND 
' 
vector outs chi pl_ 0/chip _ O/outregA_ O/outreg_ O{O} /outs 
chi pl_ 0/chip _ O/outregA_ O/outreg_ O{ 1} /outs 
chi pl_ 0/chip _ 0/outregA_ O/outreg_ 0{2} /outs 
• 
• 
chipl _ 0/chip _ 0/outregA _ 0/outreg_ O{ 19} /outs 
chi pl_ 0/ chip_ 0/outregA _ 0/ out reg_ 0{20} /outs 
vector outc chi pl_ 0/ chip_. 0/outregA_O/outreg_O{O} /outc 
chipl _0/ chip_O/outregA _0/outreg_O{ 1} /outc 
chipl _0/ chip_ O/outregA _ O/outreg_0{2} /outc 
• 
• 
chi pl_ 0/ chip_O/outregA_ 0/outreg_ O{ 19} /outc 
chi pl_ 0/ chip_ 0/outregA_ 0/outreg_ 0{20} /outc 
w outc 
w outs 
w pp3 
w pp2 
w ppl 
w ppO 
w code3 code2 codel codeO 
w sxO 
w sxl 
w sx2 
w sx3 
w xk 
ITHIS IS OPERATION MODE 
( 
73 
i 
atepaize 1000 
h outen I O SOE outen r O SOE 
- - -
- - -
clock phi 1 left 1 0 
-
clock phi2 left O 1 
-
max on 
period 1000 
I 
V chipl_O/chip_O/regA_r_O/con6_0/xO O 101100 0 
I 
V chipl_O/chip_O/regA_r_O/con6_0/xl O 1010101 
V chipl_O/chip_O/regA_l_O/con5_0/x2 0 101 0 0 0 0 
V chipl_O/chip_O/regA_l_O/con5_0/x3 0 0 0 0 0 0 1 1 
I 
R 
C 
C 
I the latest signals 
pn 3 0 
exit 
74 
/ 
J 
• RSIM output, operation mode, carry, sum are parallel out 
••• RSIM Version 6.0 ••• 
11990 nodes, transistors: n-channel=3534 p-channel=l956 
xk=OOOO sx3=0000 sx2=XXXX sxl=XXXX sxO=XXXX code0=XlX1XlXIX1XlX1Xl 
code 1 == X 1 X 1 X IX IX 1 X 1 X I X 1 
cod{~3==X IX 1 X 1 X l XIX 1 X 1 X 1 
ppO=XXXXXXXXXXXXXXXXXXXXX 
pp2=XXXXXXXXXXXXXXXXXXXXX 
code2=X IX IX 1 X 1 X 1 X 1 XIX 1 
~ 
ppl=XXXXXXXXXXXXXXXXXXXXX 
pp3=XXXXXXXXXXXXXXXXXXXXX outs=XXXXXXXXXXXXXXXXXXXXX 
outc=XXXXXXXXXXXXXXXXXXXXX 
time == 200.0ns 
tirne = 1000.0ns 
• 
• 
xk=0010 sx3=0010 sx2=0001 sxl==Olll sx0=0000 code0=1111110111111111 
codel== 1111101111111111 code2= 1111011111111111 code3=0111111111111111 
pp0==001000000000000000100 ppl=001000000000000001100 
pp2==010000000000000011010 
pp3==011111111111111111110 ou ts=O 11111111111111111100 
outc=010000000000000000010 
time = 1200.0ns 
xk=0010 sx3=0010 sx2=0001 sxl=Olll sx0=0000 code0=1111110111111111 
codel=l 111101111111111 code2=1111011111111111 code3=0111111111111111 
pp0=001000000000000000100 pp1=001000000000000001100 
pp2=010000000000000011010 
pp3=011111111111111111110 outs=Ol 1111111111111111100 
outc==010000000000000000010 
time = 1200.0ns 
xk=lOOO sx3=1000 sx2=0010 sx1=0001 sxO=Olll code0=1111111111110111 
codel=llll 111111011111 code2=1111111101111111 code3=1011111 lllllllll 
ppo.:...001000000000000000111 ppl=oo1000000000000000000 
pp2=010000000000000001010 
pp3=011111111111111111110 outs=011111111111111110100 
75 
• 
outc=010000000000000011010 
time = 1400.0ns 
xk=lOOO sx3==1000 sx2=0010 sx1=0001 sx0=0111 code0=1111111111110111 
code 1=1111111111011111 co<lc2= 1111111101111111 code3== 10111 11111111111 
pp0==00I000000000000000111 pp 1==001000000000000000000 
pp2==010000000000000001010 
pp 3 = 0 1 111 1 1 11 I 11 1 11 1 1111 0 outs= 0111111111111111 l O 100 
outc==010000000000000011010 
time == 1400.0ns 
xk=l010 sx3==1010 sx2=1000 sx1=0010 sx0=0001 code0=1111111101111111 
cod el= 1111101111111111 code2=0111111111111111 code3== 11101 11111111111 
pp0==001000000000000000011 ppl==001000000000000001100 
pp2=010000000000000000110 
pp3== 100000000000000001110 outs=Ol 1111111111111110111 
outc==010000000000000001100 
time == 1600.0ns 
xk=1010 sx3==1010 sx2=1000 sx1=0010 sx0=0001 code0=1111111101111111 
code 1=1111101111111111 code2=0111111111111111 code3= 1110111111111111 
pp0==001000000000000000011 pp1==001000000000000001100 
pp2=010000000000000000110 
pp3== 100000000000000001110 outs=Oll 111111111111110111 
outc==010000000000000001100 
time = 1600.0ns 
xk=1010 sx3==1010 sx2=1010 sxl=lOOO sx0=0010 code0=0111111111111111 
codel=ll 11111111101111 code2=0111111111111111 code3=1111111011111111 
pp0=001000000000000000001 ppl=001000000000000000000 
pp2=010000000000000000010 
pp3=011111111111111100110 outs= 100000000000000001011 
outc=010000000000000001100 
time = 1800.0ns 
xk=1010 sx3=1010 sx2=1010 sx1=1010 sx0=1000 codeO=Olllllllllllllll 
code 1=1111111011111111 code2=0111111111111111 code3 = 1111111111111110 
pp0=001000000000000000000 pp1=001000000000000001000 
pp2=010000000000000000010 
76 
' 
" 
• 
p p3 = 0 11111111111111 0 10 110 outs= 01111111111111110010 1 
outc=010000000000000000010 
time = 2000.0ns /• ti111ing statistics, shows that max. delay is 12.9ns •/ 
120 129 
112 126 
112 126 
8120 
8112 
8112 
1129 
1126 
1126 
82 0 386 
75 0 324 
75 0 320 
,. 
-
,.. 
77 
• RSIM input command of speed test mode, use the special waveform in Fig. 4-2. 
vector xk 
chi pl_ 0/chip _ O/regA_I_ O/con5 _ O/x2 
chipl_ O/chip _ 0/regA_r _ O/con6 _ 0/x0 
chi pl_ 0/chip _ 0/regA_ l_ O/con5 _ 0/x3 
chipl_ 0/chip _ O/regA_r _ O/con6 _ 0/xl 
I 
I . . . . . define all vectors like we did in operation mode ..... 
w outc 
w outs 
w pp3 
w pp2 
w ppl 
w ppO 
w code3 code2 codel codeO 
w sxO 
w sxl 
w sx2 
w s:x3 
w xk 
• 
ITHIS IS SPEED TEST MODE 
stepsize 1000 
w -ppO -ppl -pp2 -pp3 
h phi2 left outen l O SOE outen r . O· SOE 
- --- ---
clock phil_left O 1 1 1 
I xk 
.. 
clock chipl_O/chip_O/regA_r_O/con6_0/xO O O 1 0 
t chi pl_ O/chip _ O/outregA_ O/outreg_ 0{19} /outs 
t chipl_ O/chip _ O/outregA_ O/outreg_ 0{19} /outc 
78 
' ! 
• 
C 
C 
I now, repeat it again , but step by step ! ! ! 
I 
clock 
I phil_left chipl_O/chip_O/regA_r_O/con6_0/xO 
8 
h phil_left 
8 
h chipl_O/chip _ 0/regA_r _0/con~_O/xO 
8 
1 chi pl_ 0/chip _ O/regA_r _ 0/ con6 _ 0/x0 
8 
I phil_left 
s 
I the latest signals 
pn 3 0 
exit 
• 
,. 
... 
. 
' 
79 
• 
... ~ .. 
. -----
• RSIM output of speed test mode. 
••• RSIM Version 6.0 ••• 
11990 nodes, transistors: n-channel=3534 p-channel= 1956 
xk=OOOO sx3=0000 sx2=0000 sx 1 =0000 sx0=0000 cocle0=0111111111111111 
code 1=0111111111111111 co<le2=0111111111111111 code3=0111111111111111 
outs=XlXXXXXXXXXXXXXXXXXXl outc=OlXXXXXXXXXXXXXXXXlXO 
time = 400.0ns -.. 
[event #4543] node 544: X -> 0@ 511.Sns 
[event #4898] node 5·11: X -> 0 @ 516.9ns 
[event #5071] node 544: 0 -> 1 @ 524.0ns 
[event #5770] node 5,14: 1 -> 0 @ 633. 7ns 
[event #5786] node 541: 0 -> 1 @ 634.4ns 
xk=OOOO sx3==0000 sx2=0000 sxl=OOOO sx0=0000 code0==0111111111111111 
code 1==0111111111111111 code2==0111111111111111 code3==0111111111111111 
outs=Ol 1000000000000000111 outc=010111111111111111100 
time = 800.0ns 
xk=OOOO sx3=0000 sx2=0000 sxl=OOOO sx0=0000 code0=0111111111111111 
codel~Olllllllllllllll code2==0111111111111111 code3=0111111111111111 
outs=Ol 1000000000000000111 outc=010111111111111111100 
time = 900.0ns 
[event #7524] node 541: 1 -> 0@ 916.6ns 
[event #7688] node 544: 0 -> 1 @ 924.0ns 
xk=OOOO sx3=0000 sx2=0000 sxl=OOOO sx0=0000 code0==0111111111111111 
codel=Olllllllllllllll code2=0111111111111111 code3=0111111111111111 
outs=Ol 1111111111111111100 outc=010000000000000000010 
time = 1000.0ns 
[event #8371] node 544: 1 -> 0@ 1033.7ns 
[event #8387] node 541: 0 -> 1 @ 1034.4ns 
xk=0001 sx3=0001 sx2=0001 sx1=0001 sx0=0001 codeO=lllllllllllllllO 
codel=Oll 1111111111111 code2=0111111111111111 code3=0111 llllllll 1111 
outs=Ol 1000000000000000111 outc=010111111111111111100 
' 
"' 
time = 1100.0ns 
xk=OOOO sx3=0000 sx2=0000 sxl=OOOO sx0=0000 code0=0111111111111111 
80 
r 
I 
code 1 = 0 111111111111111 code 2 = 0 111111111111111 code 3 = 0 111111111111111 
outs=Ol 1000000000000000111 outc=0101 l 1111111111111100 
time = 1200.0ns 
xk=OOOO sx3==0000 sx2=0000 sx 1 =0000 sx0=0000 code0=0111111111111111 
code 1 =0 l 11111111111111 code2=0111111111111111 codc3=0111111111111111 
outs=Ol 1000000000000000111 outc=0101111111 l l 111111100 
/• timing statistics, shows path delay is 59.2ns when all pipeline stages on •/ 
time == 1300.0ns 
- 592 
- 582 
579 
• 
1592 161 1 631 
1582 118 0 659 
1579 153 0 657 
81 
.. 
APPENDIX 3: Design Files Index 
top (filter + global VDD/GND. to speed up simulation) 
! 
filter (datapath + padframe) 
! • 
chip2 (datapath after pwrframe) 
chip I (datapath after cellframe) others i (se.e next page) 
chip (datapath) 
~:. Input reg. part: regA_l ~ reg regcon 
con5 
invA_l .- inv I 
~:.. Decoder part: decA_l :. dee I dec_m2c 
quaA_l :. qua_l 
~:. ROM part: rom • mem pmos rom_m2c 
~~ Adder tree: addA • add 
~:... Output reg. part: outregA ,. outreg 
outen_l 
bufA ,. buf 
~- Others: pregA ,, preg · 
# for xxx.mag files 
con_l # xxx_r.mag likes xxx_l.mag 
82 
·-
• LOGO xxx.mag files 
l.n1ag, N60.mag, 8.mag, 060.n1ag, 9.mag, P60.mag, A60.mag, 860.mag, 
C60.n1ag, logo.111ag, D120.n1ag, D60.111ag, E60.n1ag, F60.n1ag, G6().111ag, 
H120.mag, R60.n1ag, II60.n1ag, SGO.n1ag, IIUANG.1nag, T60.n1ag, 
copyright.mag, I60.111ag, U60.n1ag, JG0.111ag, UNIVEilSl'fY.mag, L120.mag, 
V60.111ag, l"'60.111ag, W120.111ag, LEIII(;}{.n1ag, \V60.n1ag, LI.mag, Y60.mag, 
M60.111ag, a60.111ag. 
• standard pad layout 
Pad CI k In . mag, Pad G n d .1n ag, P acl G 11 cl To DP. 111 ag, Pad In.mag, Pad Out.mag, 
Pad Tri.mag, Pad d V d d .1n a g, P a<l \,, d d To D P. 111 ag, 
• padframe 
pad.info: pad listing fron1 user. 
pad .input: pad .info after pad2frame. 
padframe.mag: pad la.yout file, pad.input after doi11g fran1egen. 
• RSIM files 
command files: padO.cn1d, pad l .c111d, pad2.cmd 
output files: padO.out, padl.out, pad2.out 
0: normal mode, 1: serial output 111ode, 2: speed test mode 
• Others 
chip _net.net: routing description file between datapatl1 & pads. 
Top.bug: log file for top.1nag rsi1n result . 
. ~ 
' 
83 
( . -~. 
VITA 
Name: Dajen Huang 
Birth date/place: May 5, 1959/ Char-I, Taiwan. 
Father: Thai-Z Huang, Mother: Soon C. fluang 
Education: received B.S. degree in electrical engineering from the 
National Tsing Hua University, Taiwan, in 1983, and the M.S. degree 
in electrical engineering from the Lehigh University, Pennsylvania, in 
1989, respectively. 
Wo~k Experience: employe1 as integrated circuit design engineer in 
Texas Instruments, Taiwan, 1983-1987. 
84 
' 
. 
~-· 
J 
