VLSI implementation of adaptive BIT/serial IIR filters by Kiaei, Sayfe
AN ABSTRACT OF THE THESIS OF
Rajeev Badyal for the degree of Master of Science in Electricaland Computer
Engineering presented on January 29, 1992.
Title: VLSI Implementation of Adaptive BIT/SerialIIR Filters.
/ .
Abstract approved:Redactedfor -Privacy
Sayfe Kiaei
A new structure for the implementation of bit/serial adaptiveIIR filter is
presented. The bit level system consists of gated full adders for thearithmetic
unit and data latches for the data path. This approachallows recursive
operation of the IIR filter to be implemented withoutany global
interconnections, minimal delay time, chiparea and I/O pins.The
coefficients of the filter can be updated serially in real timefor time invariant
and adaptive filtering. A fourth order bit/serial IIR filteris implemented on a
2 micron CMOS technology clocked at 55 MHz.VLSI Implementation of Adaptive BIT/Serial IIR Filters
by
Rajeev Badyal
A THESIS
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Master of Science
Completed January 29, 1992
Commencement April 1992APPROVED:
Redacted for Privacy
Assistant Professor of Electrical and Computer Engineering, in charge of
major
Redacted for Privacy
Head of14thpartmentof Electrical and Computer Engineering
Redacted for Privacy
Dean of Grato Schoolll
Date thesis is presented January 29 ,1992
Typed by Rajeev Badyal for Rajeev BadyalTABLE OF CONTENTS
1.
2.
INTRODUCTION
1.1
SYSTOLIC
GOALS AND OUTLINE
1
2
6 IMPLEMENTATION OF IIR FILTER
2.1IIR FILTER STRUCTURE 6
2.2OVERVIEW OF THE MULTIPLIER CELL DESIGN8
3.IMPLEMENTATION OF IIR FILTER CHIP 22
3.1FILTER CELL 22
3.2FOURTH ORDER IIR FILTER 25
3.3LAYOUT 29
3.4TIMING 35
4.CONCLUSION 38
BIBLIOGRAPHY 39LIST OF FIGURES
FIGURE PAGE
1IIR FILTER BLOCK 4
2FILTER CELL 5
3INTERACTION OF THE BITS IN WORD ALK AND 10
BLK TO FORM C11
4BIT LEVEL CELL FOR MULTIPLICATION 11
5BASIC ARRAY REQUIRED FOR PIPELINED 14
MULTIPLICATION
6FORMATION OF INNER PRODUCT 17
7MAIN ARRAY CELL SHOWING LOGIC FUNCTION 18
FOR 2'S COMPLIMENT ARITHMETIC
8ACCUMULATOR CELL SHOWING LOGIC FUNCTION18
OPERATION
9SERIAL/PARALLEL MULTIPLIER 20
10CIRCUIT FOR LOADING COEFICIENTS 24
11FULL ADDER 30
12LAYOUT OF A 1-BIT STATIC FULL ADDER 31
13LAYOUT OF A 1-BIT MULTIPLIER CELL 32
14D FLIP FLOP 34
15SPICE SIMULATION OF THE CRITICAL PATH 37VLSI IMPLEMENTATION OF ADAPTIVE BIT/SERIAL IIR
FILTERS
Chapter 1
INTRODUCTION
The demand for high speed real-time signal processing has led to
many new and innovative array architectures for digital filtering. A
digital filter is a computational process in which the sequence of input
signals is converted into a sequence of output signals representing the
alteration of the data in some prescribed manner. A common example is
the process of filtering out a certain range of frequencies in a signal while
rejecting all other frequencies.
Systolic arrays have played a significant role and have been applied
in many algorithms for digital signal processing. A systolic system
consists of a set of interconnected cells, each capable of performing
simple operations.Uniform, regular communication and control
structures of these arrays offer asubstantial advantage for the VLSI
design and implementation of many algorithms.Information in a
systolic system flows between cells in a local pipelined fashion.The
communication with the outside world occurs only at the "boundary"
cells where only those cells on the array boundaries may be I/O ports of
thesystem.Advantagesofthesearraysinclude modular2
expansionability, regular data and control flows,uniform cells,
elimination of global broadcasting, limited fan-in and fast response time.
The bit level systolic techniques have been applied to the design of
many non-recursive components such as multipliers/adders, FIR filters,
correlators, discrete Fourier transform and other DSP algorithms [1-2].
This approach, however, has had a limited application to the
implementation of recursive algorithms such as IIR filters [3-4].
1.1 GOALS AND OUTLINE
This thesis introduces a bit serial implementation ofa high speed
adaptive serial/parallel IIR filter using the architectures presented in[5]
and discusses some of the practical issues involved in the VLSI
implementation of these architectures. The filter is adaptive where the
coefficients al, a2,. anand b1, b2,..bn can be altered on line to
generate high pass, low pass or bandpass filtering. The serial/parallel
implementation reduces the number of I/O pads, due of serial data input
,and reduces the chip area.
Figure 1 shows an architecture of a fourth order IIR filter and the
internal structure of each cell is depicted in figure 2.It is necessary that
the cell area is minimized and at the same time high throughput is
obtained. Each cell consists of two multipliers whichoccupy the largest
area. The goal is to implement area efficient multipliers.
Highly parallel algorithms for multiplication have been presented
in [6-8]. The detail of the different mutipliers will be discussed in the3
next chapters. These parallel array-multipliers provide high throughput
but require large area as well as a large number of I/O pins. Parallel array
multipliers are expensive and non-practical for large order filter
implementation. Alternative bit-serial methods are investigated in this
thesis.
Chapter two discusses the systolic architectures for IIR filters
presented by [5].Different approaches for the internal arithmetic
architectures and their advantages and disadvantages are discussed in
this chapter.Chapter three describes the complete design and
implementation of a fourth order IIR filter chip using filter cell approach
discussedin Chapter 1 and 2.Chapter three discusses the general
structure of the 8-bit filter cell followed bythe implementation and
operation of the fourth order filter.The layout, pinout and size of the
chip and the timing requirement of the chip is also dicussed in chapter
three, followed by a conclusion.4 11- 4 4
110.-- --10 --Op,
--0. O. 10 -go,x
Figure 1: IIR filter block5
Yi
u
Figure 2: Filter cell6
Chapter 2
SYSTOLIC IMPLEMENTATION OF IIR FILTER
Traditionally IIR filters have been implementedusing
microprocessors programmed to perform thenecessary arithmetic
operation on the digital data to obtain the desired filtertype.This
sequential approach limits the throughput of thesystem and the
parallelism of the system is not exploited.
One of the most important applications of systolicarrays has
been in filter designs. The systolicarray approach exploits the
parallelism of the IIR filters.
2.1 IIR FILTER STRUCTURE
An IIR filter can be expressed in the recursive formas
y(k) = ai y (k-1) + a2 y(k-2)+... + an y(k-n) + bou(k) + bi u(k-1) +
b2u(k-2) +...bnu(k-n) --- (1)
Taking z transform of (1)
Y(z) = ai Y(z) z-1 + a2Y(z)z-
bnU(z)rn
2 +...+ anY(z)z-11+boU(z)z-1+ b1U(z)z-2 +...+7
H(z) = N(z)/D(z) = [b0 + biz-1 +...+bnz-n ] / [1-airl +... +anz-n ]
n
The first part Ea i y(k-i)or 1/D(z), is known as the
i = 1
n
Auto-Regressive (AR) and the second part (E biu(k-0), or N(z) is
i = 0
the moving average (MA) part of the filter. An architecture for the
direct implimentation ofIIR filter is shown in figure 1.This Nth
order filter architecture is a cascading of basic filter cells where the
number of cells used for the filter is equal to the order of the filter.
Figure 2 shows the internal design of the filter cell. Each cell consists of
two multipliers, two adders and two n-bit registers containing the
coefficientsal, a2, ... anand b1, b2,...bn.Efficient VLSI
implementation of thisfilter requires minimal area and high
throughput of the inner product (multiplier and adder) block for each
cell. Highly parallel algorithms for multiplication have been presented
[6-8].In general these algorithms require n2 basic cells (full adders)
where n is the number of bits of the input data word. For the IIR filter
cell, 2n2 basic cells are required and hence 2Nn2 multiplying basic cells
are required for an Nth order filter.These multipliers provide high
throughput but require large area and I/O pins which makes it
prohibitive for higher order filter design.8
2.2 OVERVIEW OF THE MULTIPLIER CELL DESIGN
In this section various approaches for the design of multipliers
presented in [6-81 are discussed.
Mc Canny et al., [6] : They have presenteda bit level multiplier as
illustrated in figure 3. Each square representsa gated full adder unit
into which the input bits are latched.The data bits are expanded
horizontally. The organization of bits is such that each wordenters the
array in a bit-serial manner, the words aik from the right and words bkl
from the left with the least significant bit in each word(alkO, bklO)
entering ahead of the next significant bit(alki, bkll) and so on. On
each pulse of a system clock bits in the words alkmove one cell to the
left whilst bits in the words bki moveone cell to the right and as these
interact any partial products which are formedare passed vertically
downwards. Any carry bits which are generated in theprocess remain
on fixed sites and these are latched from the output of a cell to its input
on each clock cycle. Each row of cells within a diamond shaped region
contains all the partial product sums required to forma bit of given
significance in the result. In order to complete the fullsum of products
operations the partial product bits whichare formed within each row
must be accumulated as the diamond emerges from thearray. This
final accumulation process can be carried out by addinga pipelined tree
of adder cells to the bottom of the array. The results fromthis can be9
clocked out in a bit serial manner, least significant bit first and a
complete result is formed every 2m-1 clock cycles. This architecture
requires 2n2-n cells, each cell looking like figure 4, and has a 50%
efficiency. For the IIR filter cell, 2(2n2-n) cells are required and hence
2N(2n2-n) cells are required foran Nth order filter.Figure 3: Interaction of the bits in words alk and bkl to form c11
_L011
c'
c c' XOR (a.b) XOR cy'
cy (a.b).c' + (a.b).cy' + c'.cy'
Figure 4: Bit level cell for multiplication
Mc Canny et al., [7] : They have presented a completely iterative,
pipelined multiplier array as shown in figure 5.It comprises of a
diamond-shaped array of m2 latched, gated full adder cells each of
which is connected only to its nearest neighbors. All the inputs and
outputs of each cell are latched. The operation of each cell is illustrated
in figure 5.It performs the 1-bit logic function:
p = a * b; s = s' (XOR) c' (XOR) p;
c = (s' * c') + (c' * p) + (s' *p)
where a and b represent individual bits of the two numbers to be
multiplied, s' is one bit of the accumulating sum of partial products12
and c'is a bit carried in from the previous stage of the calculation. The
resulting value of s, the corresponding carry bit c and the input bitsa
and b are all stored in latches and then passed on to neighboring cells.
The value of s'is also latched on input to each cell.
The nth pair of numbers a(n) and b(n)to be multiplied are
input to the circuit along the upper edges of the array with their
constituent bits ak(n) and bk(n) staggered in time, as indicated by
means of the external latches in Fig. 5. The numbers are arranged such
that the most significant bit of a(n) (i.e. ani_i (n)) and the least
significant bit of b(n) (i.e. bo(n)) enter the circuit on one clock cycle, the
second most significant bit of a(n) and the second least significant bit of
b(n) on the next clock cycle, and so on. This arrangementensures that
as each bit of a(n) moves across the array, it meets every bit of b(n)-
one at each of the cells which it crosses. The kth bit of each partial
product ak_i(n)bi(n)is formed on one of the cells within the kth
vertical column, and the kth bit of the product
k
sk(n) = X ak -i (n) bi (n)
i=0
is formed by letting these components accumulated as sk(n)passes
down that column. The sum bits s'k(n) and carry bits c'k(n) which
enter the array boundaries are set equal to zero, and the carries which
are generated at each stage within the array are simply passed to the left13
(the next most significant column) on the next clock cycle.
This circuit constitutes a pipelined carry-save multiplier and, since
the carries do not have to ripple through at each stage of the
calculation, the clock speed is limited only by the propagation delay
within a single cell.However, as with any carry-save device, the
residual carry bits which leave the basic multiplier circuitacross the
lower left-hand boundary in figure 5 must be added into the sum of
partial products in order to complete the multiplication.This
architecture requires n2 cells and hence 2Nn2 cells foran Nth order
filter.14
b2(n)
b2(n-1)
b2(n-2)
bl(n)
= delay
a
c
b
0 w 00
N .1
.....,
S'
a
c'
S < S' xor (a.b) xor c'
c < (a.b).S' + (a.b).c' + S'.c'
S b
Figure. 5: Basic array required for pipelined multiplication15
R.B. Urquhart et al., [8] : They have proposeda 100% efficient
architecture for multiplication. Consider the formationof a product G
of a data word X and coefficient word W, whereX and W are of B and C
bits, respectively, i.e.
X = (x(B-1)x(B-2),x(1)x(0))
w = (w(c-1)w(c-2), ,w(l)w(0))
G g(0))
G can be formed by summing the partialproducts gib = x(i)w(l)
providing the significance of partial products istaken into account.If
we store the bits of W across a linear chain of C cells,a parallelogram of
partial products can be formed by passing thebits of the X word across
the chain.If the least significant bits of each wordinteract first, partial
products of equal significancewill emerge from the chain
simultaneously. Alternatively, if the bits of thecoefficient word W are
reversed, partial products of equal significanceappear at a slant within
the parallelogram.In either case, if a sequence of X wordsare
multiplied by W, their parallelograms of partialproducts will butt,
resulting in a 100% utilization of processing elements.
Both of these schemes are easily extendedto the formation of the
inner product G* of vector X=(X1,X2,...XN)and W =(W1, W2,...,WN),
i.e.
N16
G* = E Xi Wk
k=1
Input data is skewed and enters thearray least significant bit first (Fig. 6).
Each main array cell comprises a gated full adder and fourlatches for
positive valued data or a gated full adder and five latchesfor two's
complement data (Fig. 7).The parallelogram of partial products will
move down through the array accumulating contributions from each
term in the inner product.Since succeeding partial productsare of
greater significance, carries are recirculated.Wordlength growth is
allowed for by adding a guard band ofzeros (log2, N bits wide) at the end
of each input vector. The inner product word G*must be obtained by
accumulating the parallelogram of partial products. Eachaccumulator
cell (Fig. 8) consists of a full adder and full latches.This architecture
requires n2 processing elements and hence 2Nn2 cells foran Nth order
filter.g20
g10 g21
g00 gll
g01
w1(0)
w2111111(0)
1)
(0)411w2( w4 w4(1)
OUT
g22
g12
g02
wl (2)
Frillie,
x1(0) x1(1) x1(2)0 0
x2(0) x2(1) x2(2)0
x3(0) x3(1) x3(2)0 0
Figure 6: Formation of an inner product
1718
CNTRL S
0 E1
c*
a
CTRL' S'
S' < S XOR (xw XOR CTRL) XOR C
C <-- SC + S (xw XOR CTRL) + C(xw XOR CTRL)
Figure 7: Main array cell showing logic function for
2's complement arithmetic
S' 411--
S' < S XOR g XOR C
C' < SC + Sg + gC
Figure 8: Accumulator cell showing logic
function operation19
R.F. Lyon [9]: An alternative approach to these multipliers is the bit
serial/parallel multiplier presented by R.F. Lyon[9].This
implementation requires n processing elements for an n-bit data word
and takes 2n clock cycles to multiply two n-bit numbers as shown in
figure 9. Each n-bit data word is followed by n-bits of zero's inorder to
flush the multiplier pipeline. For large n the serial/parallel architecture
occupies smaller area and requires lesser I/O pins. The disadvantage of
this multiplier over the array multipliers is that the bit serial/parallel
multiplier has a lower throughput (2n clock cycles as compared toevery
clock cycle for an array multiplier).Out
--lob
x0 xl x2
FA1
Clock Out
4101
A
FA ______1
1 a2x2
2 alx2+a2x1
3 a0x2+alxl+a2x0
4 a0xl+alx0
5 a0x0
6 carry
-Po
Figure 9: Serial/parallel multiplier
FA
A
FA: Full adder
d: delay
20
Table 1 shows the comparison of the serial/parallel architecture
and the array multiplier implementations.21
ARRAY/PARALLEL
MULTIPLIER
SERIAL/PARALLEL
MULTIPLIER
Of one bit
multiplying elements
per cell
2
2n 2n
Area per cell 2
o(2n ) o(2n)
Throughput
every clockcycle
(after initial pipeline
is full)
every 2n clock
cycles
#Of one bit
multiplying elements
per Nth order filter
2
N2 n N2n
Table 1: Comparison of serial/parallel andarray multipliers22
CHAPTER 3
IMPLEMENTATION OF THE IIR FILTER CHIP
This chapter describes the design and implementation ofan 8-bit
input - 16 bit output fourth order serial/parallel IIRfilter chip using
the approach discussed earlier in Chapter 1 and 2.Section 3.1 discusses
the general structure of the 8-bit filter cell.Section 3.2 discusses the
implementation and operation of the fourth order filterusing the filter
cell.Section 3.3 discusses the layout, pinout and size of thechip and
section 3.4 describes the timing requirement of thesystem.
3.1: FILTER CELL:
The general structure of the cell is shown in figure2. An 8-bit
serial/parallel implementation of this filter cell consistsof two 8-bit
serial/parallel multipliers two 8-bit shift registers andtwo 1-bit carry
ripple full adders, asshown in figure 9.
The two serial/parallel multipliers receive bit serialinputs from
the previous stage. The output of each stage is delayedby the 16-bit
serial shift register to the input of the next cell.The Auto Regressive
(AR) part of the cell receives u(k)'s from the boundarycell and the
output y(k) is fed back to perform the Moving Average (MA)part.
The filter coefficients a and bare loaded serially from an external
pin ( section 3.3). The coefficients loading is performedby a chain of
eight serially connected D-flip flopsas shown in figure 10. The clock23
signal to these flip flops is ANDed witha load coefficientsignal
provided externally which is held high for eight clockcycles to load the
coefficients serially.24
i
D flip flop4
il
+
D flip flop
D flip flop4
D flip flop
D flip flop
D flip flop
D flip flop
D flip flop
Figure 10: Circuit for loading coefficients25
3.2 FOURTH ORDER IIR FILTER:
The fourth order system consists of four filtercells with two
multipy and add units in each cell. The last cell feedsback the output
signal y(k) as shown in figure 1. The feedbackconsists of a 16-bit serial
shift register which provides thenecessary delay to the u(k) inputs.
Further this register also performs the truncationnecessary to generate
zero's to flush the multiplier results.The data among cells is
transfered through a 16-bit shift register which providesthe delay. The
system input requirement is an 8-bit bit serial input, LSBin first,
followed by an 8-bit string of zero's. Thesezero,s are added to flush the
higher 8-bits of the product an * u(k).
A complete snap shot of the filter operation isshown below:26
y(2)
0 0 0
1=1b1 b2 b3 0 0 0 0 0 0
0
u(2)y(3)
O 0 0
bl b2 b3 O 0 0 0 0 0
y(4)
O 0 0
b2 b3 O 0 0 0 0 0
y(3)
0
u(4)
2728
Y(5)
0
u(5) u(4) u(3) u(2)
After the system is initialized, at each consecutive clock cyclean
output data bit is obtained. The length of the output word is 16 bits.
The intermediate truncations are provided by the ANDing of theinput29
to the16-bit shift register with the externally provided AND control
signal.
3.3: LAYOUT:
The fourth order IIR filter was implemented using theMentor
Graphics CAD tools in CMOS 2 micron technology. The size of the
chip without the pads is 3600 microns * 3400 microns. Thesystem was
designed using static CMOS design approach.The static design
approach allowed for a single phase simple clocking scheme.Figure 11
and 12 show the circuit and the layout of the static full adder. Thesize
of the full adder was 228 microns *183 microns andwas used for the
multiplier and the add unit. Layout of a 1-bit multiplier cell isshown
in figure 13 which is 290 microns.* 387 microns.It consists of a full
adder, three D flip flops, a two input NAND gate andan inverter. The
one bit multiplier was cascaded with seven similar cells in series to
implement the 8-bit multiplier.VDD
GND
VDD
GND
Figure 11: Full adder
CARRY
30
SUMFigure 12: Layout of a 1-bit static full adderFigure 13: Layout of a 1-bit multiplier cell33
The 16-bit shift register was implemented by serially cascading
sixteen D flip flops. Figure 14 shows the circuit diagram of the D filp
flop. To ease interconnection among D flip flops the input,output and
the CLK signal ports have the same height in the layout.34
bfAa
-7 __D--
ilThr-Lf15-
T
71---
T.),T
i_FLELFLF-
T
...i7r--
7:,
.-.17r-
7,T
151_1-11-15-
T
-27
6 T
11-11-Tilf-
T
171---7-
Z TJT
zu
The entire chip requires 16 pins and is shown belowVDD
CLK
CLK'
output 1
load AND
coef. Control
9
PINOUT FOR THE IIR
FILTER CHIP
35
data input
--1 Ceof. A4
Coef. A3
Coef. A2
C CC C
o o a o
e e e e
f. f f f
B1B2B3B4
Coef. Al
3.4: TIMING:
SPICE simulations for different blocks of the chip indicate that
maximum delay occurs at the output of the full adder drivinga D flip
flop. Figure 15 shows the SPICE simulation for this block. A delay of 9
ns from high to low and from low to high transitions resulted in a 55MHz maximum operating frequency.
The timing requirement for the chipare shown below.
Clock
Input
AND control
low for 8 cyles
ii
high for 8 cycles
36
The timing diagram for the chip indicates that the inputis latched at
the leading edge of the clock and a valid output is obtainedat the
trailing edge of the clock. The AND control signal is maintainedlow
for the first 8 clock cycles and high for thenext 8 cycles which is
repeated.ns25.0
19.0V
`9.0V
50.0/5.0100.0 125.0 150.0 1/5.0 200.0 225.0 250.0
I.1.0
9.0V
ot.
Figure 15: SPICE simulation results of the critical path38
CHAPTER 4
CONCLUSION
In this thesis, the implementation ofa fourth order bit/serial
adaptive IIR filter was presented. Thiswas achieved by designing and
simulating a set of bit level cells suchas multipliers and adders. The
most important part of the filter structure is that it allows therecursive
computation to be implemented directly and the system is modularfor
higher order filters.
Direct comparison between bit/parallel andbit/serial
architectures was presented. The results indicated thatthe bit/serial
method is more efficient in terms of chiparea and I/O interface with
minimum delay between cells. Basedon the approach by R.F. Lyon [9]
a fourth order 8-bit IIR filter was implemented. The size of thesystem
was 3600 microns * 3400 microns with a speed of 55 MHz..39
BIBLIOGRAPHY
[1]J.S. Ward et al., "Bit-level Systolic Array implementationof the
Winogard Fourier Transform," IEE Proc., Vol. 132, Oct1985.
[2] A. Corry and K. Patel, "Architecture ofa CMOS Correlator," Proc
International Symposium on Circuits And Systems,1983.
[3] R.F. Woods et al., "Systolic IIR filters with bitlevel pipelining," Proc.
ICASSP, 1988.
[4] S.C. Knowles et al., "Bit-level systolicarrays for IIR filtering," Int.
Conf. on Systolic Arrays, San Diego, May 1988.
[5] Kiaell, S., Rajopadhya, S., "Automaticderivation of systolic arrays
for IIR filters", CAD tools for VLSI signal processing,1989.
[61 Mc Canny, J.V., Wood, K.W., McWhirter, J.G.,and Oliver, C.J.: "The
relationship between word and bit level systolicarrays as applied to
matrix x matrix multiplication". SPIE Real Time SignalProcess. VI,
1983, 431, pp. 114-120.
[7] Mc Canny, J.V., Phil, D., McWhirterJ.G.: "Completely iterative,
pipelined multiplier array suitable for VLSI", IEEPROC., Vol. 129, Pt. G,
No. 2, April 1982, pp. 40-45.40
[8] Urquhart, R.B. and Wood, D.: "Systolic matrix and vector
multiplication methods for signal processing", IEE Proceedings, Vol.
131, Pt. F, No. 6, Oct 1984, pp. 623-631.
[9] Lyon, R.F.: "Two's compliment pipeline multiplier", IEEE Trans.
Commun., COM-24:418-425 (1976)