Design of an asynchronous third-order finite impulse response filter by Kiaei, Sayfe
AN ABSTRACT OF THE THESIS OF
Joel A. Oren for the degree of Master of Sciencein
Electrical and Computer Engineeringpresented on8 February 1994 ,
Title:Design of an Asynchronous Third-Order Finite Impulse Response Filter
Abstract approved:
With the increased demand for complex digital signal processing systems,
real-time signal processing requires higher throughput systems.In the past, the
throughputhasbeenincreasedbyincreasingtheclockrates,but
synchronization can become increasingly more difficult.Recently there has
been renewed interest in designing asynchronous digital systems.In an
asynchronous system, there is no global clock, and all modules communicate
through handshaking.In this thesis we demonstrate an implementation of an
FIR filter using asynchronous digital circuit techniques.These asynchronous
design techniques are used to test whether a practical signal processing filter
can be implemented with asynchronous logic.A third-order four-bit filter is
developed and simulated with SPICE, comparing favorably with other available
technologies in speed and power consumption. Although in practice 8-16 bits
are needed, this work is sufficient to demonstrate the feasibility of asynchronous
circuits for filtering applications.A chip is laid out in 2 micron CMOS, and
testing shows that it has a speed-power product comparable with asynchronous
designs fabricated by others.
Redacted for privacyDesign of an Asynchronous Third-Order
Finite Impulse Response Filter
by
Joel A. Oren
A THESIS
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Master of Science
Completed February 8, 1994
Commencement June 1994APPROVED:
AssistProfessor of Electrical an Computer Engineering in charge of major
Head of dartfrient of Electrical and Computer Engineering
Dean of Graduate S
Date thesis is presented February 8, 1994
Typed by researcher for Joel A. Oren
Redacted for privacy
Redacted for privacy
Redacted for privacyTABLE OF CONTENTS
1. INTRODUCTION 1
1.1 FIR filter design with ECDL asynchronous circuits 1
1.2 Historical Background 3
1.3 Organization of this Thesis 3
2. DESIGN OF THE FIR FILTER 4
2.1 FIR Filter description 4
2.2 Operation of the FIR filter in a pipeline 5
2.3 Data-Flow Interpretation of the Architecture 8
2.4 Sequence of operation of the Data Flow filter stage 9
2.5 Initialization of the Pipeline 12
3. ENABLE/DISABLE CMOS DIFFERENTIAL LOGIC (ECDL] 14
3.1 Micropipelines 14
3.2 The basic ECDL circuit 19
3.3 Initial Token Generation 21
4. CIRCUIT DESIGN, SIMULATION, AND LAYOUT. 25
4.1 Overall organization of the Processing Elements 25
4.2 The Control Unit 25
4.2.1 The Muller C-element 25
4.2.2 The XNOR 28
4.2.3 The Flip-Flops 28
4.2.4 Simulations 33
4.3 Data Path 35
4.3.1 Multiplier 36
4.3.2 Adder 37
4.3.3 Simulations 39
4.4 Overall Structure of the Filter 43
4.4.1 Floor plan of the Filter stage 43
4.4.2 Layout issues for the filter stage 46
4.4.3 Simulation results for the filter stage 49
5. TESTING AND EXPERIMENTAL RESULTS 53
5.1 Measurement of Speed and Delay 53
5.2 Test Results 58
6. CONCLUSION 61
BIBLIOGRAPHY 63LIST OF FIGURES
Figure Page
1. General 3-stage filter 5
2. Four steps in an FIR filter operation 7
3. A non-recursive convolution filter 8
4. Data-Flow Architecture of the Filter 9
5. Four steps in the operation of the Data-Flow filter stage 10
6. Sutherland's Micropipelines 16
7. Timing Diagram for the Micropipeline. 16
8. Modified ECDL Micropipeline. 17
9. Basic ECDL circuit 19
10. Timing Diagram for the Micropipeline. 21
11. Two-stage ECDL pipeline. 23
12. Muller C-Element logic symbol, circuit diagram, and state diagram. 26
13. Basic XNOR circuit 28
14. Double-Edge-Triggered D Flip-Flop Diagram 29
15. SETOFF (Single-Edge-Triggered D Flip-Flop). 32
16. Multiplier Block Diagram. 36
17. Adder Block Diagram. 38
18. Multiplier Simulation Results. 40
19. Adder Simulation Inputs 41
20. Adder Simulation Results. 42
21. Floor diagram of the Triplet filter stage 44
22. Plot of the filter chip. 45
23. Triplet Simulation Results. 4924. Triplet Simulation Results. 50
25. The critical path through the laststage 56
26. State analyzer graph for the filter chip 59
27. Oscilloscope plot for the filter chip 60LIST OF TABLES
Table Page
1. Logic Table for the Muller C-element with 1 inverted input. 27
2. State Transition Table for the Muller C-element withone inverter, 27
3. State Transition Table for the XNOR circuit. 28
4. State Transition Table for the DETDFF Latch 1 circuit 30
5. Summary of simulation results for the Control-Pathcircuits. 35
6. Signal names for the Multiplier simulation results. 39
7. Signal names for the Adder simulation results. 43
8. Summary of simulation results for the Data-Path circuits. 43
9. Summary for the Triplet cell. 46
10. Signal names for the Triplet simulation results. 51
11. Comparison of 2u and 1.2u feature sizes for the Triplet filter cell. 52
12. Comparison of 4- and 8-bit Triplet filter cells. 52
13. Maximum cycle time for the filter stage 57DESIGN OF AN ASYNCHRONOUS THIRD-ORDER
FINITE IMPULSE RESPONSE FILTER
1. INTRODUCTION
1.1 FIR filter design with ECDL asynchronous circuits
FiniteImpulseResponse(FIR)filtersareincreasinglyfoundin
applications such as Delta-Sigma modulatorsas decimation filters to reduce the
noise. An FIR filter is a filter that hasa finite response to a finite input. The
transfer function of a typical FIR filter could be shownas:
M
h(u) =aix(n-i) (1.1)
i=1
The advantages of FIR filtersare that they have constant phase and
group delay, and are stable.Disadvantages include the roundoff noise
associated with all digital filters, the need fora large number of stages to get
sharp cutoff characteristics, and complex design compared to IIR (Infinite
Impulse Response) filters [R086]. As the speed requirements for these filters
increase, the difficulty in realizing a synchronously clocked chip increases due to
area, resistance, and clock skew effects [SU89] [WEBB).In response to such
clocking difficulties, asynchronous or self-timed circuits are being investigated.
This thesis presents a self-timed design for a four-bit third-order FIR filter
using the two-cycle micropipeline for control [LU88].We will compare this
design with other available circuits and filter architectures, and describe its
advantages and disadvantages in terms of power consumption, speed, and
area. The filter designed in this thesis has been fabricated in a 2 micron CMOS
process, and both simulated and measured results will be presented.2
Asynchronous systems excelover synchronous systems in that they can
proceed with processing upon the arrival of the input signals.Furthermore since
there is no global clock, the control of thesystem is easier and the delays are
substantially lower.In asynchronous circuits, handshaking is used to pass
values from one stage to the next ina decentralized fashion.This allows the
calculations within the FIR filter to proceedas fast as possible. As larger filters
are constructed with longer latencies and more complex layouts, theuse of
asynchronouscircuitswillgivegreaterperformance improvements than
comparable filters implemented in standard synchronous logic. This is because
the asynchronous filters will not have the time delays between cycles that must
be built into synchronously clocked filters.
Another method to synthesize self-timed circuits is by using data-flow
diagrams. Asynchronous circuitsare data-flow machines where each data has
an associated token.The computation flows through the data flow machine
which consists of nodes.These nodes "fire" or perform their logic function
whenever all the tokens and associated dataare available. The goal of the FIR
filter design presented here is to use the asynchronous architecture to speedup
calculations by allowing the availability of data to determine how fast the
computation proceeds.
In order to fully exploit the advantages of the self-timed architecture, the
logic family used must be able to perform handshaking and generate a
completion signal. The circuit type used for the filter described in this thesis is
based on Enable/Disable CMOS Differential Logic (ECDL) [LU91A]. ECDL is a
differential logic family which can generate both signals and their complements.
ECDL is based on the static CMOS logic family andit has the CMOS
advantagesoflowerstaticpower,highernoisemargins,andeasier
implementation.3
1.2 Historical Background
There have been several recent works for FIR filter design using systolic
arrays for the processing of signals [QU91].Using pipelining, fast processors
can achieve orders of magnitude higher throughput. Prior work in asynchronous
signal processing circuits was basedon Differential Cascade Voltage Swing
Logic (DCVSL).Practical filters were designed using DCVSL [JA88].The
DCVSL-based filters used Sutherland's four-cycle micropipelines for control
[SU89]. This work will apply the two-cycle technique to implement the system.
1.3 Organization of this Thesis
The organization of this thesis is as follows: Chapter 2 will describe the
overall architecture, demonstrate the operation of the filter, and discuss the
data-flow architecture and pipelining of FIR filters.Chapter 3 will discuss the
ECDL circuit, the micropipeline, and initial token generation in the ECDL
micropipeline. The fourth chapter will present the simulation results of the filter
building blocks. A summary of the results comparing delay times and power
consumption with other technologies will be given.Chapter 5 discusses the
testing and experimental measurement from the fabricated IC.Finally, the sixth
chapter will summarize the advantages and disadvantages of this design and
give some ideas for future improvements.4
2. DESIGN OF THE FIR FILTER
In this chapter we will discuss the architecture of the generalpurpose FIR
filter.We will begin with an overview of the architecture at the block level and
the algorithmic level.Next we will show the operation at the block level with a
series of examples.The third part of this chapter discusses the data-flow
architecture of this chip. Operation of the circuit is demonstrated bya series of
"snapshots" of the filter.The last section introduces the topic of initial token
generation.
2.1 FIR Filter description
Figure 1 shows the pipeline diagram of a third order FIR filter. The third
order FIR filter realizes equation 2.1 below.
Yout(n)Yin(n)ai.x(n) + a2 *x(n-1) + a3 *x(n-2) (2.1)
where al -a3 are the coefficients of the filter, thex values are the data inputs to
the system, and the Y values are the results flowing through the system. The
FIR filter has advantages compared to HR (Infinite Impulse Response) filters.It
has a linear phase response and is easier to design than an IIR filter.YOUT
XIN
STAGE 1 STAGE 2 STAGE 3
Figure 1. General 3-stage filter
5
In general, each block of this filter performs the following operation:
Yout := Yin + a * Xin (2.2)
Using these basic modular blocks,filtersof higher order can be
implemented. Each stage of the filter requiresone adder and one multiplier and
for the Nth order pipelined filter, N adders and Nmultipliers are required.
A direct implementation of the FIR filteruses only one adder and one
multiplier to compute the result Yout.It could do the computation by computing
all sums and products on one modular block, by multiplexingthe filter block in
time. This would result in a reduction of the computation rate. The FIR filter isa
non-recursive convolution network. The non-recursive aspect of the filter gives it
it's finite output response. The block diagram ofa fourth order, four-bit FIR filter
in a pipelined (systolic) [QU91] fashion is shown in Figure 3.
2.2 Operation of the FIR filter ina pipeline
This section examines the operation of the pipelined filter inmore detail.
Figure 2 shows snapshots of the array of filter stages. The data streamx values6
enter from the left to the right without being modified. Weare assuming that
there is a host I/O interface. The other stream is the resulty stream, which flows
from right to left. The initial value of Yin is eitherfrom a preceding module or
else it is zero.
In part (a) of Figure 2, the token Yi is being processed, along with Xi, in
the first stage of the pipeline.The tokens Yi+2 and Xi..2 have just been
presented.In part (b) the token Yi is output, and the next input token Xi+1 is
being presented. Parts (c) and (d) of Figure 2trace another cycle of operation
of this pipeline.
It should be pointed out that in part (a) the first and third stages are
processing and the second and fourth stagesare idle. The opposite is true in
part (b), which means that each stage is only operating half of the time. The
ECDL micropipeline filter design presented here willremove this inefficiency.7
Y
i
(a)
Y1+1
_4a1
I
(b)
Y i+2
X
i+1 (c) xi
Yi+2
a4f--- Y i+3
-- x.
1 -1
Figure 2. Four steps in an FIR filter operationYi+1
X
YOUT
XIN
YIN
OUT
Figure 3. A non-recursive convolution filter
8
Figure 3 shows one of the PE blocks in more detail. Since the circuit will
be asynchronous, storage will benecessary to hold the tokens while the pipeline
has stopped. The input token Xin will flow through the multiplier, where it will be
multiplied by the coefficient a, and then will be used by the adder. The input
token Yin is flowing into the adder, andmay either come from a subsequent
stage or it may be a zero (for the last stage).
2.3 Data-Flow interpretation of the Architecture
The FIR filter can also be described using the data-flow graph [AR86]
[LA88]. The data flow graph contains nodes for adders, multipliers, forks, and
joins. There is a fork required at the beginning of each input stage. This fork is
required to split the input X token to be used by both the multiplier and the next
stage. The coefficients are assumed to be constant, and so the multiplier has9
only one asynchronous input, for the Xin token.The joinis required to
synchronize the result token of the multiplication with the Yintoken generated by
the subsequent stage. An addergenerates the Yout token to pass to the next
stage, or in the case of the last stage, to the outside of the chip.
YOUT (0)
X(0)
JOIN
Y OUT(1)
0
FORK
Y IN (0)
a
1
OUT(°)
Y OUT(2)
JOIN0Y IN (1)
0 a
0 FORK40 X(1)
2
OUT(1)
JOIN
YIN (2)
0
X(2)
FORK
XOUT(2)
Stage (0) Stage (1) Stage (2)
Figure 4. Data-Flow Architecture of the Filter
2.4 Sequence of operation of the Data Flow filter stage
Similar to the previous snapshots, we will look at the operation of the data
flow pipeline in greater detail.In Figure 5, the operation of one stage of the
data-flow filter is examined. As with the FIR block diagram in Figure 3, the X
data values are flowing from left to right, and the Y result valuesare flowing from
right to left.In this case, the Yin values are usually non zero, except in the case
of the last stage in the filter.The X data values are not modified, but are
delayed one cycle.10
(b)
out (0)= X(0) *a +Y in(0)
X(0)
X
in(
1) FORK X in(0
(c) (d) X in(1)
Figure 5. Four steps in the operation of the Data-Flow filter stage
Figure 5(a) shows Xin data and Yin result tokens (filled) waiting at the
input to the filter stage.These tokens do not necessarily have to appear
simultaneously. Figure 5(b) shows the X data token going through the fork. The
fork produces two X data tokens, which go to the delay and the multiplier.In
Figure 5(c) the X token goes through the delay, and then the multiplier produces
a product token for X times the coefficient a1. The coefficient a1 is a constant11
and is not set up as a token.The product token Xin(1) and the Yin(1) input
token are now waiting at the join, which is enabledas shown in Figure 5(c).In
the Figure 5(d), the join has fired and passed thetwo matched data tokens on to
the adder, which has produced the finaloutput Y token.
If there are no tokens in the pipeline when the )(in datatokens arrive, the
circuit will not operate properly.For example, consider a single Xin token
presented to the pipeline in Figure 4. As Xin enters the first stage, it is splitand
goes on to the current stage multiplier and to the next stage.In the first stage,
the multiplication result atkXin will be waiting fora Yin token to arrive before the
adder can fire. The same thing could happen in thenext stage, resulting in three
multiplication values waiting at the multiplier inputs to the joins.In this case,
when the Yin(0) token is presented, it triggers the last stages' join, which fires
and enables the last stages' adder.The Yin(0) token plus the result of the
multiplication in stage 2, a3 *Xin(0),are added and sent to stage 1.However,
there is already a token, a2 *Xin(0), waiting at the input to the join in stage 1.
The join fires, enabling the adder, which adds Yin(0)+a3*Xin(0) to a2,,Xin(0). A
similar operation takes place in stage 0, resulting in:
Yout(0) = Yin(0) + a3*Xin(0) + a2*Xin(0) + al *Xin(0). (2.3)
The above is not the correct output. The rout token should depend onlyon the
last three values of Xin and the Yin token from three cycles before. However, in
the case described above, the Yout(0) token depends onlyon the current value
of Yin and Xin tokens.The problem is that the first Yin token presented
immediately flows through all of the stages, instead of firing the last stage only.
The solution is to place initial tokens in the pipeline as space holders, to setup
the pipeline properly. The technique for generating initial tokens is the subject
of the next section.12
2.5 Initialization of the Pipeline
In order for the pipeline to operate correctly,a sequence of initial tokens
must be automatically generated each time the pipelined circuitis activated.
These initial dummy tokens, generated by the pipelinein the initial reset, will
keep the filter from running throughan entire sequence of operations on the first
data token presented.
This problem is solved by invertingsome of the tokens in the \lout result
stream.It was found through simulations that inverting the tokens in alternate
stages will generate the initial tokens needed without interfering with the normal
operation of the pipeline. This solution has the added benefit that it does not
cause excessive slowing of the pipeline.Alternate token inversion causes the
pipeline to pause for one cycle and wait for the next data to be presented before
continuing. This is especially important at the initialization of the pipeline.
To demonstrate the initialization,we will refer to Figures 4 and 5.
1. As the first data value Xin(0) enters the first stage, itgoes through the fork in
Processing Element 0 (PEO) and is split into two tokens,as demonstrated in
Figure 4 (b).
2.The first token goes through the multiplier in PEO to the join.There is a
Yin(0) token waiting at the join, and the token from the multiplier is added to the
Yin(0) token, and the Yout(0) result token flows out. The value of the Yout(0)
token depends only on the value of the Xin(0) and Yin(0) tokens.
3. The second Xin(0) token generated by the fork in PEOgoes on to the fork in
PE1, and is again split into two. One of these two Xin tokens goeson to PE2
and the other token goes through the multiplier in PE1 to the join.
4.There is a Yin(1) token waiting at the adder in PE1, generated by the
initialization, so the (Xin.a2) token continues through the adder in PE1 resulting
in a rout =((Xina2) + Yin) result token.13
5. The result token from the adder in PE1 flows backto PEO, where it is now a
token waiting at the Yin input to the adder in PEO. The Xin(0)token in PEO was
already used and the multiplier in PEO has already fired,so the Yin token in PEO
must wait for the next Xin token to be presented to the pipeline.It is the first Y
token waiting in the pipeline.
6. The first token generated in step 3 above, by the fork in PE1 flows straight
through the multiplier in PE2 since there isno fork in PE2.It waits temporarily at
the join in PE2 for the Yin(2) token.
7. When the Yin(2) token is input froman external source, the join in PE2 fires
and triggers the adder in PE2. The adder result token You{= Yin(2) f ( Xin(0)
a3) is sent back to PE1 as the Yin(2) input token to the adder/join in PE1.
8. At the join in PE1 the result token from PE2 must wait, because it is the only
token waiting at the join in PE1.There are now two Y tokens waiting in the
pipeline.
To summarize this process, the result of inputtingone Xin token and one
in token is one rout token and two partial result Yin tokens waiting in the
pipeline. The pipeline ends up the same as it was when it was initialized.It has
two tokens waiting, one each at the joins in PEO and PE1.It is now set up for
the next pair of inputs.
Before we explain how the initial token generation is done, the ECDL
circuit used is described.The ECDL circuit is at the heart of the modified
micropipeline, and it's characteristics make initial token generation possible for
this filter.After we have covered the ECDL circuit and the micropipeline in the
next chapter, we will revisit the problem of initial token generation.14
3. ENABLE/DISABLE CMOS DIFFERENTIAL LOGIC (ECDL).
The ECDL logic family was proposed by [LU88A, 88B].It is a static logic
family, but itdiffers from previous static differential logic families in thatit
dissipates very little static power.Differential logic families require that both an
input and it's complement be presented, and generate both the output and it's
complement.The ECDL family uses a single-phase clock, which reduces
problems caused by clock skews.
Some previous designs of asynchronous logic systems have been based
on Differential Cascade Voltage Switch Logic, or DCVSL [JA88]. The DCVSL
family dissipates more static power and usesmore devices than comparable
circuits implemented in either static CMOS or ECDL. ECDL was developed to
reduce the static power consumption and improve the speed/power product in
asynchronous circuits.
In this chapter, we will first examine the micropipeline control structure
designed by Sutherland [SU89], and it's modifications to take advantage of the
special characteristics of the ECDL circuit. Application of ECDL to micropipeline
control and the initial token generation for the FIR filter are then examined.
3.1 Micropipelines
The basic asynchronous circuit consists of the logic for the signal
calculation and the controller called the micropipeline. The micropipeline is an
asynchronous self-timedpipeline developed by Sutherland [SU89], which
contain a string of stages, each stage containing computation and control logic.
The control logic, shown in Figure 6, is composed of a Muller C-element, storage15
for all variables, and a delay element. The Muller C-element synchronizes the
current stage to the previous and next stage. A variable storage is needed since
the previous stage can change it's outputsas soon as the current stage has
acknowledged receiving them, and a delay element is required to guarantee that
a request is not generated to the next stage before the current stage has
finished it's computation. The micropipeline used in this workwas modified to
take advantage of the completion signal generated by the ECDL circuit [LU88A].
The token storage show in Figure 6 is necessary in order to store tokens
between processing stages, so that the stage generating the tokencan proceed
with its processing of new tokens.The token storage in the modified ECDL
micropipeline should be contrasted with the token storage used in Sutherland's
micropipeline. A double-edge-triggered flip-flopis used as the variable storage
in Lu's micropipeline because of the two-cycle nature of the micropipeline
control.The handshaking signals for Sutherlands storage register are as
follows: The Capture input, C, signals the availability of the variable from the
previous stage; The Capture Done signal, Cd, signals when this variable has
been stored in the register (captured); Similarly, Pass (P) and Pass Done (Pd)
signal when the previously captured data is output to the current stage.To
simplify this, the double-edge-triggered D-flip-flop is used in the modified ECDL
pipeline.The remaining signals, R and A,are Request and Acknowledge
handshaking signals between stages.R
IN
DIN 7
4
C
CPd
reg
Cd P
V
A IN
REQUEST
ACKNOWLEDGE
DATA
A
(1)
7
R
0
C
(DELAY D
7
(DELAY D
Cd P
reg
CPd
V
R
(2)
R
(1) A(2)
7
R
0
C7
CPd
reg
Cd P
V
Figure 6. Sutherland's Micropipelines
c
ONE CYCLE
1
NEXT CYCLE
A
(3)
R
0
C
Figure 7. Timing Diagram for the Micropipeline.
16
The timing diagram in Figure 7 shows the sequence of events in the
micropipeline:(1) At time a, the input data becomes valid for the first stage.
The first stage has completed it's function and the dataout signal linesare
stable.(2) At time b the Requestout is generated by the first stage and is
received by the second stage.(3) At time c, the second stage has entered the17
data values into a register, and signals the first stage via the Acknowledgein line
that it no longer needs the data.
At this point, the next cycle starts, and the first stage can begin to
compute the next data value. The sequence continues similarly to the first cycle
except that the Request and Acknowledge signal lines have the opposite polarity
from the first cycle.This does not affect circuit operation, since positive and
negative signal transitions are equivalent.
Figure 8 shows the modified ECDL version of the micropipeline.This
version has three differences compared to the micropipeline discussed above.
First, the delay element is eliminated since ECDLcan generate a completion
signal with the addition of a NOR gate.Also, the register REG in Figure 6 for
the input signal has been replaced with a Double-Edge-Triggered D flip-flop
(DET in Figure 8) [LU90] [LU88A]. The clock signal for the SETDFF register
generating REQout flows through the DETDFF input token storage register,
unliketheSutherlandmicropipelinewhere separateCapture Done and
Pass Done signals controlled the input token storage register operation.
4
REG
IN
ACKOUT
C
'11
DATA
IN -7
4
ACK
IN
CLOCK
DET
ECDL
\ LOGIC
N DONE ouT
DATAOUT
Ci> CLOCK
Figure 8. Modified ECDL Micropipeline.
REQ
OUT18
ECDL is inherently a four-phase circuit and is level-sensitive rather than
edge-sensitive. In a four-phase system, the signalsare level sensitive and must
make a full transition from low to high to low. Two-phase circuitsare triggered
by both the rising and falling edge, whichconsumes less energy and time.
The micropipeline is a two-phase circuit, therefore the modified ECDL
micropipeline will require a circuit to convert between two- and four-phase and
back again, which adds an XNOR anda SETDFF (Single Edge Triggered D Flip-
Flop). The XNOR circuit is used to convert from two-phase to four phase, and
the SET D flip-flop is used to convert from four phase back to two phase.
There are two Acknowledge lines in each stage of the micropipeline, one
going to the previous stage and one coming from the next stage. They willnever
change at the same time, because the circuit will not acceptnew data from the
previous stage until after the next stage has accepted the current data outputs.
By connecting the XNOR gate to both the Request and Acknowledge lines, the
XNOR will produce two transitions for each two-phase control cycle.
The SETDFF converts the four-phase output of the ECDL gate back into
two-phase control logic by connecting the clock input of the SETDFF to the
output of the ECDL gate to trigger the flip-flop once for every complete cycle of
the ECDL circuit.Connecting the D input of the flip-flop to the Acknowledgein
line (which is already two-phase) will generate one transition for every two
transitions of the ECDL gate.
Another way of looking at the Micropipeline is to see it as a sequence of
unstable loops. There is exactly one inversion in each loop around a stage in
the Micropipeline, and we know that a signal path with an odd number of
inverters will oscillate. The Muller C-elements and SETDFFs act to control the
oscillation based on the state of the ECDL circuit.OUT
DONE IN
VDD
Mt
M2
OUT
H DONE
N-core
Differential
logic tree
IN
OUT d
VDD
roloI OUT
notrrli 4 OUT
DONEOr
VSS VSS
Figure 9. Basic ECDL circuit
3.2 The basic ECDL circuit
OUT
19
The ECDL cell shown in Figure 9 has several basic elements.In the
center of the cellis the memory element, composed of the cross-coupled
inverters M2-M5. Three transistors are used for control: M1 enables the circuit
by connecting it to VDD; M6-M7 reset the circuit by connecting the outputs to
VSS. The Differential logic tree generated the output by allowing one output to
go to VSS and holding the other output high, depending on the input signals.
Lastly, the NOR circuit generates a completion signal when the output values
are valid.
When the data is available from the previous stage, the Donein signal
goes low and activates the cell. The flip-flop value is set by the differential input
tree structure connected to the outputs of the flip-flop. When the Donein signal20
is high the cell is disabled. Devices M6 and M7are enabled, and both outputs
are low. The NOR circuit has both inputs low, and it's output is high, presenting
a disable signal to the next stage.
When the Donein input signalgoes low the circuit is enabled (event a in
Fig. 8 below).Depending on the input signals, the differential tree holdsone
output at ground while the other output is leftopen.The output left open is
pulled high by the action ofone of the cross-coupled inverters in the memory
element.
Since the circuit had the low voltageon the inputs to both of the inverters,
both of the inverters will be trying to bring their output positive. Onlyone will be
successful, the other being overridden by the differential tree structure.The
remaining two transistors are used to drag the outputs of the flip-flop (the inputs
of the two inverters) low during the inactive phase of the circuit. Thedelay time
required for the logic tree and the inverters to bringone of the outputs high is
known as the forward delay.
The last four transistors, M8-M11, performa NOR operation.These
transistors are used to generate the DONEout signal from the signal OUT and
it's inverse. The NOR output is low wheneverone or both of the inputs are high.
As the ECDL circuit is enabled,one of the outputs will go high, which causes
the NOR to go from high to low, resulting ina DONEout signal.
In the timing diagram of the handshaking sequence shown in Figure 10,
the circuit is triggered by a high-to-low transitionon the Donein signal (event a in
Figure 10). The output of the cell is generated during the period when Donein is
low. When the Donein signal goes high again (event d), the two outputsgo high
(event e) and the output value is invalid. This makes the ECDL circuit inherently
four-phase with regards to clocking.DONEIN
DONE
OUT
OUTPUT
a a
c f c
, 1\I e e\
1
1 ONE CYCLE NEXT CYCLE
7
1
Figure 10. Timing Diagram for the Micropipeline.
21
The ECDL has a delay, from eventa to events b and c, called the forward
delay. This is the time required for the circuitto generate the outputs and the
completion signal. The ECDL circuit also hasa backward delay from event d to
events e and f.This is the time required to reset both outputs low (event f).
Both outputs must be low before the circuitcan perform it's next cycle, since
otherwise Doneout will not go from low to high (event e). The backwarddelay
time (or reset time) can add significantly to the delay of the cell,as will be seen
when the simulation results from the multiplierare discussed in chapter 4.In the
next section. we will discuss the generation of initial tokens.
3.3 Initial Token Generation
Having examined the ECDL circuit and the modified micropipeline based
on it, we will now describe initial token generation.The initial tokens are
generated internally. The "token" consists of puttinga micropipeline stage in a
different state than the state beforeor after it.With a token present, the Muller
C-element will have its inputs at 00 or 11.If either an acknowledge comes from22
the next stage or a request comes from the previousstage, the Muller C-element
will trigger and begin a processing cycle in the current stage.
The initial token generationis done by inverting tokens across the
boundaries between micropipeline stages. The token inversionscan be done in
two different ways. The first method is to putan inverter on the Requestin and
Acknowledgein lines. This causes the tokens to be invertedas they go between
stages, and generates the initial tokens.This also introduces delays in the
control circuitry as the signals have topass through the inverter.
A better method is to use preset and reset of the control circuitry to set up
adjacent stages where the initial token is required. One stage is preset, and the
next or previous stage (or both, if two tokensare desired) is reset.The only
circuit elements which must be affected in thismanner are the Muller C-element
and the SETDFF. This is because the SETDFF and the Muller C-elementare
the only elements that must be resetor preset to set up the circuit.Once the
Muller C-element and the SETDFF are ina definite state, the state of the XNOR
and the ECDL circuits are determined.REQ
IN
DATA
IN
CLK
reg01
4
ACK
IN Stage 0
ECDL
LOGIC
DONE
IN OUT
23
ACK OUT ACK
IN
DATA DATA
OUT IN
CLK
a
reg
CLK
REQ
REQ
IN
OUT
IN OUT
DONE
ECDL
LOGIC
EQ
OUT
CLK
DATA--
/ OUT
Stage 1
Figure 11. Two-stage ECDL pipeline.
ACKOUT
4
In the following steps we will describe the generation of a token using the
inverter method described above.It should be noted that a token exists when
there is a request flowing through a pipeline, or when a request is waiting, such
as at a join. Refer to Figure 11 as the following sequence of events is traced.
(1) The circuit is initialized with Muller C-elements = low, SETDFF = low, the
ACKin output of the second stage = low, and the ACKout input of the first stage
= high. The first stage Muller C-element is stable with any input to the REQin.
The XNOR will output a low and trigger the ECDL circuit in the first stage. The
REQout output from the first stage = low resulting in the REQin input to the
second stage = high. Assuming a low on the second stage ACKout input, the
second stage Muller C-element is unstable.
(2)When the initialization signal is removed, the second stage Muller C-
element = high.24
(3) The second stage XNOR is triggeredand its ECDL cell is enabled, causing
an operation to be performed on input data. At thesame time, the second stage
ACKin signal = high, and the ACKout input ofstage one = low.
(4) Assuming the REQin input is low, theMuller C-element in stage one remains
stable. The XNOR inputsare now equal and Ws output goes high. The ECDL
circuit in stage one will reset.
(5)In the second stage, the D input of the SETDFF isnow one. As the ECDL
circuit completes it's task,it sends a high-to-low transition to the SETDFF,
changing the output state Q from low to high.
(6) This change on the REQout line triggersthe next stage, sending the token
on down the micropipeline.
The chip was designed so that twoor more of the chips could be placed
in series which gives a FIR filter withmore than three stages.To facilitate
modularity for the FIR filter, all control signalsare sent outside the chip.This
also makes testing easier. More importantly, there had to bea Yin and an )(out
port. The Yin port may not be used ina small one-chip filter, and it may not be
used in the first chip of a multi-chip filter. This is because the Yin isoften zero
for the first stage of a FIR filter.However, for multi-chip filtersit must be
included to carry through the results of previous chip stages. Similarly for Xout,
we wouldn't normally need the delayed values of the X variable coming out of
the last stage.However, the next chip in a multi-chip filter will need these
delayed X values. The only other choice is to delay the X variable ina shift
register outside the chip.25
4. CIRCUIT DESIGN, SIMULATION, AND LAYOUT.
4.1 Overall organization of the Processing Elements
In this chapter the design, simulation, and layout of the various subcircuits
of the FIR filter design are discussed.First the control section, the Muller C-
Element, the XNOR, the DETDFF, and the SETDFFare discussed, followed by
the arithmetic blocks that form the data path.In the last part of this chapter, the
control and data paths are integrated in an examination of the basic filter stage.
4.2 The Control Unit
The control unit of the FIR filter stage has two functions.First, it controls
the flow of tokens into and out of the filter stage.Second, it must enable and
disable the ECDL processing circuit basedon the availability of tokens to be
processed. There are four very basic circuit blocks used in the FIR filter control
unit design: the XNOR; Muller C-Element; the SETDFF; and the DETDFF. We
begin with the Muller C-element.
4.2.1 The Muller C-element
The Muller C-element performs an AND function for tokens, synchronizing
the current filter stage with the next and previous stages. In the micropipeline,
the current filter stage cannot fire until two events have occurred: (1) The
previous stage(s) must have completed their processing and output token(s) to
the current stage.(2) The next stage(s) must have read in the token from the
current stage, allowing the current stage to change it's outputs.
The Muller C-element is connected to the Requestin signal from the
previous filter stage, which signals when the previous stage has completed it's26
processing. The other (inverted) inputis connected to the Acknowledgein signal
from the next filter stage, whichacknowledges the receipt of the current stages'
output token. These two connections satisfythe requirements for synchronizing
token exchange between filterstages. Figure 12 (a) shows the logic symbol for
the Muller C-element withone inverted input.
xY
00,
10
X Y
(a)
01
Y
M13
_1
M14
7c'
10 (c)
rjM1
2
M51
M9
M10
M6
M7
M11
00,
11,
01
(b)
z
Figure 12. Muller C-Element logic symbol,circuit diagram, and state diagram.
As the state diagram in Figure 12 (c) shows, the Muller C-element has
two states. Whenever the input X and the inverted input Y match, theoutput Z
of the Muller C-element assumes the value of the X input.In all other cases, the
output remains in it's current state. Table 1 shows the logic table forthis circuit.27
Table 1. Logic Table for the Muller C-element with 1 inverted input.
INPUT X INPUT Y CURRENT
STATE
NEXT
STATE
0 0 Q Q
0 1 X 0
1 0 X 1
1 1 Q Q
The following table shows which transistors are on and off with varying
inputs to the circuit and varying current states.
Table 2. State Transition Table for the Muller C-element with one inverter.
X Y Z-1 M1 M2M3M4M5 M6M7M8M9M10M11M12M13M14Z
0 0 0 ONOFFONOFFONOFFOFFONONOFFONOFFONOFF0
0 0 1 ONOFFONOFFOFFONONOFFONOFFONOFFONOFF 1
0 1 X ONONONONONOFFOFFONOFFOFFOFFOFFOFFON0
1 0 X OFFONOFFONOFFONONOFFOFFONOFFONOFFON 1
1 1 0 OFFOFFOFFOFFONOFFOFFONONONONON ONOFF0
1 1 1 OFFOFFOFFOFFOFFONONOFFONONONON ONOFF 1
In the circuit diagram in Figure 12 (c), some transistors are shown larger
than others, since they are used to change the state of the C-element.The
smaller transistors are only used to hold the value in it's current state, making
the gate fully static.28
4.2.2 The XNOR
The XNOR logic gate is implemented in standard CMOS logic, shown in
Figure 13 [WE88]. The XOR circuit (M1-M6) consists oftwo inverters (M1-M2,
M3-M4) and a pass transistor pairor transmission gate (M5-M6). A third inverter
(M7-M8) is used to invert the A8B signal to get the AC,Bbaroutput.Table 2
shows the state transitions for the XNOR circuit.
A M1
2
M3
M4
A ?B
Figure 13. Basic XNOR circuit
Table 3. State Transition Table for the XNOR circuit.
AEBB
A B M1 M2 M3 M4 M5 M6 M7 M8 AlOB
0 0 ON OFF ONOFFOFFOFF ON OFF 0
0 1 ON OFFOFF ON OFFOFFOFF ON 1
1 0 OFF ON ON OFF ON ON OFF ON 1
1 1 OFF ON OFF ON ON ON ON OFF 0
4.2.3 The Flip-Flops
The FIR filter stage uses a SETDFF and a DETDFF. The SETDFF is
leveltriggered,and the DETDFF isedge triggered.The SETDFF is29
implemented using pass transistors and NOR gates.The DETDFF uses an
optimized design found in [LU90].The SETDFF has both set and reset
available, whereas the DETDFF does not. The SETDFFuses a 2-phase clock,
requiring that both CLOCK and CLOCKbar be input. The DETDFFuses a single
clock phase, but requires both D and Dbar. The logic symbol for the DETDFF is
shown in Figure 14 (a).
D
Q
(X:>CLK
--'> CLK
(a) Logic Symbol
M12
M11
M15
M13
M7
(c) latch 2
Q1
(d) final
logic
Figure 14. Double-Edge-Triggered D flip-flop Diagram.30
The DETDFF circuit shown in Figure 14 has three main parts. The latch
sections shown in Figure 14 (b) and (c) are used to generate partial results,one
for each clock transition. Each latch is composed ofa central flip-flop made up
of two cross-coupled inverters (M3-M6 in Figure 14 (b), M12-M15 in Figure 14
(c)). The other transistors control the cross-coupled inverters. The latch in (b) is
active during the low-to-high transition and the latch in (c) is active during the
high-to-low transition.The third part of the circuit is the final logic, shown in
Figure 14 (d).It takes the two values generated by the latches and combines
them to produce the output Q which copies the input D during both clock
transitions.
Table 4. State Transition Table for the DETDFF Latch 1 circuit.
D0 M1M2M3M4M5M6M7M8M9Q1 _cTi"
0 1,OFFONOFFOFFONONONOFFON 1 1
0i'ONOFFONOFFOFFONONOFFOFF0 1
1 ,i,OFFONOFFOFFONONOFFONON 1 1
1 TONOFFOFFONONOFFOFFONOFF 1 0
We will examine the operation of the latch 1 and final logic circuits of the
DETDFF, since the latch 2 is a mirror image of latch 1 and operates in much the
same way. The sequence of events as the DETDFF goes through one cycle is
as follows:
(1) The two outputs in latch 1, Figure 14 (b), are held to VDD when the latch is
off (when the clock is low).Transistor M9 is on, and either M7 or M8 will be
conducting Vdd to one of the outputs, with a slight reduction due to Vt.31
Transistor M2 will be conducting Vdd to the other, with another reduction in
voltage due to Vt.In the final logic (d), with both inputs high (Q1 and Q1bar) the
lower pass transistors M21-M22 are cut off.
(2) When the clock transitions high, the path to Vdd through M9 and the path
shorting the two outputs through M2are cut off.The transistor M1 is now on,
which provides a pathway to ground for the inverters.
(3) With the inputs to both inverters high, both inverters (M3, M5) and (M4, M6)
will be trying to bring their outputs low.The side that was already lower in
voltage due to the Vt drop through M2 will be successful, and the other will
remain at high.
(4) In the final logic, one of the transistors M21or M22 will be on, bringing the
output Q1 to the input of the series inverters. The output of the inverters will
generate both Q and Qbar.
Latch 2 operates in the same way, except that the outputs are driven low
instead of high in the off clock phase.One side will be driven high by the
inverter pair.The DETDFFs are used to store the 4-bit token values in each
filter stage. There are 16 of them in each filter stage, organized in four banks of
four each.32
0
0
SET
RESET
(a)
Figure 15. SETDFF (Single-Edge-Triggered D Flip-Flop).
The SETDFF in Figure 15 is a design found and described in [WE88].It
has both Set and Reset available, and is triggeredon the falling edge. The NOR
gates are only needed when Set and Reset are desired, or else the circuit could
have used inverters instead. The pass transistors and NOR gates act as a two-
stage flip-flop, holding the value and passing it to the output when the clock has
made one complete low-high-low cycle.In Figure 15 (b) transmission gates (T1-
T2) and NOR gates (NOR1-NOR2) are the first stage, while (T1-T2) and (NOR1-
NOR2) are the second stage.
The circuit requires a two-phase clock, which leaves it open to clock skew
problems.If the clocks were both positive or both negative at the same time, the
input D could be connected to the output of NOR2 and the outputs of NOR1 and
NOR4 could be shorted together.
The SETDFF circuit in Figure 15 (b) performs the following operations
during each cycle:(1) The clock is high and input D is valid, so T1 is on and33
input D determines the output of NOR1. T2 and T4 are off, but T3 is on and
provides a feedback path for NOR3 and NOR4, maintaining the output in it's
present state. (2) When the clock goes low, T1 turns off and D is disconnected.
T2 is on and provides a feedback path for NOR1 and NOR2, maintaining the
input D for the second stage. T3 provides a path from the output of the first
stage to NOR3, setting the second stage output Q to reflect the input D.
4.2.4 Simulations
The simulation methodology for the Muller C-element was as follows:
First the circuit parameters were extracted from the cell layout, and then the
simulations were done using HSPICE for 2.0 micron technology.The
simulations were performed on the control and data paths of one PE in a
hierarchical manner.The simulation results for the control path circuits are
summarized in Table 5 below.
Simulations revealed that the Muller C-element had an average delay of
3.5 ns.This is very high, and is due to transistors M1-M2 and M7-M10 being
smaller than they should be.All of the other transistors in this layout are
minimum size.Since it is the function of M1-M2 and M9-M10 in Figure 12 to
override M5-M6, they should be much larger than M5-M6. In the cell layout they
are only 1.5 times as large. Because the Muller C-element delay is in the critical
path of the circuit, the larger-than-expected delay of the Muller C-element will
cause the FIR filter stage to have lower throughput.
It was expected that there would be current spikes when the output of the
Muller C-element changes, but simulation showed that there were also current
spikes when the inputs are changed from X#Y to X=Y. These spikes are caused
by changes in the circuit feedback path, which holds the output steady. When
the inputs differ, the output is held constant by the action of M7-M8 in Figure 12.34
When the inputs are equal, the output is maintained by transistors M5-M6 in
Figure 12.
Since the heart of the Muller C-element is the two cross-coupled inverters
M5-M8, the power consumption is similar to that ofan inverter. When the output
is changing state, the peak power consumption is 2.8 mW. At 18 Mhz, the
average power consumption is 0.2 mW. When the Muller C-element has static
inputs, like an inverter, it consumes only leakage current. The Muller C-element
has a total of 14 transistors. With 2.0 micron CMOS layout rules, ituses an area
of 90 by 80 microns.
The XNOR has a peak power consumption of 3.6 mW at switching, and
it's average power consumption is 0.06 mW at 18 Mhz.It's delay is 0.9 ns, and
it's power-delay product is 0.1 picojoule. The cell layout takes an area of 40 by
80 microns.
The DETDFF uses a space of 70 by 145 microns when laid out in 2
micron CMOS.It has a peak power consumption of 3.0 mW, and an average
power consumption of 0.1 mW at 18 Mhz. The delay time from clock transition
to output Q changing is 1.9 ns. The other flip-flop, the SETDFF, uses an area of
115 by 90 microns, about the same as the DETDFF.It's peak power
consumption is 2.0 mW, with an average power consumption of 0.1 mW at 18
Mhz. As it was laid out, a large amount of polysilicon layer was used.This
contributed to the 1.6 ns delay.35
Table 5. Summary of simulation results for the Control-Path circuits.
Circuit DelayPeak
Power
Avg Power
@ 18 MHz
Power-Delay
Product
Number of
Transistors
Area
Muller C3.5 ns2.8 mW0.2 mW 0.7 pJ 7 P, 7 N 7200 42
XNOR 3.6 ns0.9 mW0.06 mW0.2 pJ 4 P, 4 N 3200 42
SETDFF1.6 ns2.0 mW0.1 mW 0.2 pJ 12 P, 12 N10,350 42
DETDFF1.9 ns3.0 mW0.1 mW 0.2 pJ 13 P, 13 N10,150 42
4.3 Data Path
There were two different arithmetic circuit blocks used in the data path of
each filter stage. They are the adder and the multiplier.Each uses unsigned
integer values, and both are four bits wide. They were implemented in ECDL,
and were developed and laid out by [LU91A]. The multiplier was the larger of
the two, occupying a space of 475 by 750 microns.It is an array multiplier, with
a four by five cell organization. The array organization works well with an ECDL
circuit, since this lets calculation proceed as fast as the circuits will allow.0
M1-4M1 M1
AD 0
M1 ADAM IP
111
Ripn
Adder
M2
-TT\p
M3
P7
Ripple
Adder
M3
Ripple
Adder
These four outputs used.
Figure 16. Multiplier Block Diagram.
36
4.3.1 Multiplier
The basic principle of operation of an array multiplier is the generation of
partial products in parallel. There are two basic cell types in the array. There
are the multiplier cells, M1-M3 in Figure 16, which are used to compute the
partial products and low-nibble multiplier results. At the bottom of Figure 16 are
the ripple carry adder cells, which are used to generate the final product.For
the multiplier cells, there are several different types. The most general circuit,
M3, is used in the center of the multiplier array.It is the most complex of all,
since it must handle the most inputs. The other multiplier circuits, M1 and M2,37
are simplified versions which can be used in the first and secondrows and first
column, where some of the inputsare always zeroes.
This multiplier takes two four-bit inputs, andgenerates an eight-bit output.
There is no need for all eight of the bits,since the next stage in the pipeline will
only take a four-bit input.Since the coefficient, which is one input to the
multiplier, is assumed to be betweenzero and one, the output of the multiplier
will have a binary point anda fractional part.If we let a coefficient input of
10002 be the "one" value, and 00002 bethe "zero" value, then the binary point
will be between bits 2 and 3. Thiscan be seen from identity since 10002 times
10002 will give 01000.0002on the output of the multiplier. To preserve scaling
of token values throughout the pipeline,only bits 3 through 6 should be
connected to the next circuit.
Another way to organize the data path blocks is toassume that an
coefficient input of 100002 would be 1. Sucha 5-bit value could not actually be
input, so a coefficient of 11112 would be the closest possible valueto one. The
binary point on the output wouldnow be between bits 3 and 4, and bits 4-7
would be the output to the next filter stage. The advantage with this scheme is
that the coefficients can be more accurate, since thereare almost twice as many
possible values for the coefficient.The disadvantage is that the filter cannot
have coefficients of one, so it could not performa simple delay function without
attenuating the signal passing through it.
4.3.2 Adder
The other arithmetic circuit block is the adder.It is a four-bit carry ripple
adder based again on ECDL [LU91B].Adders of this type tend to be much
slower than adders which try to predict what thecarry bits will be to speed up the
calculation.However, they have an advantage in that they are simpler, more38
regular, and much smaller due to the lack of complicatedcarry lookahead logic.
The block diagram in Figure 17 shows the organization of the ECDL adder.
ANAL
CARRY
SUM3 SUM2 SUM 1
FULL
DDER
AB C
A
ARRY
out in
FULL
DDER
FULL
DDER
AB C ABC
SUM0
FULL
ADDER
outA
CARRY
out in
outA
ARRY
out in
AB C 4 0
A
3 A2B2 A1B1
Figure 17. Adder Block Diagram.
out
ARRY
out
C.
AB
A B
4 0
AoB0
An adder of this type implemented in ECDL has another advantage in that
it will not produce false values during the carry ripple period.This is because
the Donein control circuitry in the ECDL cell will not "fire" until all inputs are
ready. By connecting the Donein of one stage to the Doneout of the next lower
bit carry subcell, the stage will not begin to produce a value until the previous
stage has completed it's carry.Other circuit types such as static CMOS may
produce false values on their outputs because the carry input to a stage may
change during the setup of that stage.Static CMOS circuits continuously39
evaluate their inputs and may produce temporary glitches before settlingto the
correct value.
4.3.3 Simulations
Simulation of the multiplier block indicated that there was a forward delay
of about 15.6 ns. This is a reasonable value considering that it is implemented
in 2 micron line width. To compare the simulation results from the multiplier with
the results from the control circuits, consider that the longest path through the
multiplier is 8 stages. This comes out to about 2.0ns per stage.Interestingly,
the backward delay was simulated at 19.75ns, more than 4ns longer than the
forward delay.This is probably caused by the fact that the transistors which
discharge the output nodes of the ECDL gate are smaller than the transistors
which make up the logic tree. The peak power consumption is 9.68 mA. This
compares favorably with other circuit designs for a 4 bit multiplier.The
simulation signal graphs for the control signals of the multiplier can be found in
Figure 18 below. The inputs are not shown, but all other signals are shown and
their names are given in Table 6 below.
Table 6. Signal names for the Multiplier simulation results.
Simulation subgraph numberSignal name
100 Donein
101 Doneout
-I(Vdd) Current for Vdd
31-37, 41-47 Output bits 1-4, 5-640
VL
0 I
LN
VL
o
LN
E
XL
P I
R N
E
S
VL
OI
L
VL
O I
L
T
L
O I
L
VL
O I
L
L
O I
L
VL
O I
L N
VL
G I
LN
L
O I
LN
FILE! //RELAR/U2/GRENI/LAYOUT/ECOL/MULT4X4A.CREATION TIME: SUN AUG
10 -AUG931116!59
4.0 _
2.0
0.
=
4.0 _
2.0
I
1.0
0.
-KULT4X4A.TR0
100
fMULT4X44.TR0
101
mULT4X4A.TRO
-ICV003
MULT4X4A.TR0
31
1-14
4.0 _
2.0 _
B.
4 1 1
4.0;
2.0 _
0.
0.
4.0
2.0
0.
4.0
2.0
4.0_
MULT4X4A.TRO
33
A
MULT4x4A.TR0
-35
:
2.0 7.-
100.0N 4,--1 4
200.0N 300.0N 400.0N 500.0N
TIME [LIN)
600.
640.0N
MULT4X4A.TR0
37
A
1- --1.144LI
0. a
4 . 07
2.0
0
4.07
2.0_
0.
0.
I.4
1 .1 1..4.
I J i 1..,..1- -4-1 44...-111--;
43
MULT4X4A.TRO
4
100.0N
mULT4X4A.TR0
41
45
mULT4X4A.TRO
A
-a MuLT4X4A.TRO
47
A
4
200.0N 300.0N 400.0N
TIME (LIN)
-4-4 4.
500.0N 6110.0N
640.0N
Figure 18. Multiplier Simulation Results41
The adder block is smaller, and simulation of theadder indicated that it
had a forward delay of only 7.4ns. The backward delay was measured at 5.1
ns. The peak power was 2.8 mA, and theaverage power at 18 Mhz is 1.6 mW.
Like the multiplier, this alsocompares favorably with results from other circuit
types.Since the backward delay of the multiplier isso much longer than the
forward delay of the adder, the adderwill have long since completed it's cycle
before the multiplier has the next inputready for the adder. This suggests that
the multiplier is indeed the slowest sectionof the overall filter, and that itis
critical to reduce the time taken by the multiplierto compute it's output.Signal
graphs from the HSPICE simulation of the adder blockcan be found in Figures
19 and 20 below. The completed addercell used a space of 220 x 420 microns.
YL
I
L
* FILE: //RELAR/U2/ORENI/LAYOUT/ECOL/FOUROITADOER.SP.
2.0
a.
20-IUL930:41:40
-L--
CREATIONTIME! WE
:FOURBITABOE!
-11
A
-Li
=FOURBITABBE
VL 4.0 _13
0 I
L 1.1 2.0
1 1 0. --1-1111.11
=FOUROITAOLIE
VL 15
0 I A
L 2.0
O.
1 1 1 1 1
FOURBITABBEI
VL 4.6 -17
0 I A
L 2.0
0.
1 1111
=FOURBITABOE!
V L. 4.0.7 _21
13 I
L 1,1 2.0 ............-
O.
1 -J- _a_i
FOURBITABBE
L 4.0 -23
0 I A
L 2.07
O.
1 A I L 1 1
200.0N
1
50.0N 100.0N 150.0N 250.0N 300.'W
O. TIME [LIN) 320.0N
Figure 19. Adder Simulation Inputs42
VL
O I
L
VL
0 I
L
VL
0 1
L
T
YL
O I
LW
YL
B I
L
VL
OI
L
T
V
a
L
L
V
0
L
L
N
V
0
L
L
N
L
N
* FILE: //9ELAR/U2/BREN//LAYOUT/ECOL/FOURRITABBER.5P.
20-IUL910:41:40
4.B
2.0
0.
4.0
1.L.
2.0 _
J I. 1. 11 L1IJ
El
500.0M:
1
I
B. :I't-v(-11-v.4.--
2.0
0-
4.0
2.0
0.
r
rw-
CREATION TIME! WE
FOURBITAHE
25
1I 1
A
:FOURBITAHE
_27
a
11 I I
I.
2.0 7
=FOURBITABOE
41
=FOURBITABOE
42
A-
100.0N 150.0N 200.0N250.0N2H.
0. TIME CLINI 320.0N
=FOURRITADOE
=43
A
=FOUROITABOE
=44
A
* FILE: //RELAR/U2/ORENI/LAYOUT/ECOL/FOURRITABOERAP.CREATION TIME! WE
4.0
2.0
B.
5.0
2.0
.1
)#,
50.0N
I
N§lt
I
-
29-1UL93
,
itt
.1.
10 0 .
.L .1..1
0 N
4
.I.
150.0N
0:41:40
FOURRITACIBE
31
7_A
\
.1.....
...
n
-I 41
\
II.I.
200.0N
...
11,
\./.. 1...1.
.4
.1..1...11...
250.0N
..
a-
...
01
-114
Lt..,
.
'ft
300.0N
FOURRITABOE
100
A
=
-=1
0.
5.09i1
4.0
2.0_
=-4--,-4-1-1--F
2.7917M :
2.0M _
0.
.....J
A
FOURDITABOE
- 101.
A
:FOURRITABBE
--111400)
A
O. TIME (LIN) 320.0N
Figure 20. Adder Simulation Results43
Table 7. Signal names for the Adder simulation results.
Simulation subgraph numberSignal name
11-17 Input A
21-27 Input B
41-44 Output bits 1-4
31 Carry output
100 Donein
101 Doneout
-I(Vdd) Current for Vdd
Table 8. Summary of simulation results for the Data-Path circuits.
Circuit Forward BackwardPeak AvgPowerPower-Delay Area
Delay Delay Power @ 18 MHz Product
Multiplier15.6 ns 19.75 ns 48.4 mW5.7 mW 89 pJ 356,250 i.t2
Adder 7.4 ns 5.1 ns 14 mW 1.6 mW 12 pJ 92,400112
4.4 Overall Structure of the Filter
In this section I will combine all of the basic circuit blocks of the previous
sections and discuss the layout of the main filter stages. Iwill begin by
discussing the floor plan of the filter chip. Next I will present some of the layout
issues, and discuss the simulation results for the filter stage cell.
4.4.1 Floor plan of the Filter stage
There are three major almost-identical cells on each chip. These cells,
named triplet, are the individual three stages in the FIR filter. Each contains an
adder and a multiplier, storage registers, and various control logic blocks. The44
organization of these blocks is shown in Figure 21.This Figure shows the
organization the same way as the plot of the actual fabricated chip in Figure 22.
OUTPUT
4.
ADDERI
INPT
DETFF
x4
Y INPUT
CONTROL
LOGIC
DETFF
x4
X' a INPUT
DETFF
x4
MULT
DETFF
x4
DELAY
A1----
COEFFICIENT
(B)
xINPUT ---,
MULTIPLIER
INPUT
CONTROL LOGIC (EXPANDED SECTION)
MULLER
C-ELEMENT
ADDER
XNOR
ADDER
I
N
V
E
R
T
E
R
S.E.T.
D FLIP-FLOP
ADDER
MULLER
C-ELEMENT
FORK
MULLER
C-ELEMENT
DELAY
MULLER
C-ELEMENT
JOIN
D
E
L
A
Y
MULLER
C-ELEMENT
MULTIPLIER
XNOR
ADDER
I
N
V
E
RI
E
R
S.E.T.
D FLIP-FLOP
MULTIPLIER
Figure 21. Floor diagram of the Triplet filter stage.45
Figure 22. Plot of the filter chip.
Table 9 below summarizes the amount of chip area required for the
Triplet cell.The multiplier was the largest cell by far, and uses up about 32
percent of the cell area. The DETDFF is the next largest area consumer, since46
there are 16 of them. They used 15 percent of the area.The adder is third
using about 8 percent of the cell area. The only thing that consumed more area
than the multiplier was the interconnection wiring between all of the cells. The
total area taken by the triplet cell is about 1700 microns long by 660 microns
wide. Three of these triplet cells are placed side by side in the topmost level of
the chip layout. With power busses, signal wiring, and space for 10 pads and
circuitry, the finished chip is 2300 by 2250 microns.
Table 9. Summary for the Triplet cell.
Cell Name Number usedTotal area usedPercent of area
Muller C 5 36,000[1,2 3 %
XNOR 2 6400112 <1 %
DETDFF 16 162,400112 15 %
SETDFF 2 20,700 j_t2 2 %
Adder 1 92,400 p2 8 %
Multiplier 1 356,250 p.2 32 %
Signal InterconnectsN/A 448,000 p.2 40 %
Triplet Cell Total N/A 1,122,000 p.2 100 %
4.4.2 Layout issues for the filter stage
Alternate filter stages must be slightly different, since the control section
of one must have both resets and presets, and the next must have resets only.
In practice, all three filter stages had to be different due to interfacing with off-
chip components.At one side of the pipeline, the X token must be input,
requiring both the individual bits of the token and their inverse. On the same
side, the Y value had to be output, requiring only the Y value's bits and not their47
inverse.On the other side of the pipeline, the reverse was true, requiring a
reversed wiring layout. The stage in the middle of the filter had to be different
from the outside two to meet the requirement of alternating Preset/Reset stages.
The control circuitry for each filter stage was kept together in one single
large block in order to facilitate wiring.The connections between the various
control logic circuits would have required morearea and would have been more
difficult to route if they had to be routedacross the filter stage cell.The
centralized control block arrangement had the added benefit of allowing fast
signal travel between the various control logic blocks. This signal timing benefit
was not without it's perils. The short wires between the various blocks in one
instance caused a timing error that was impossible to fix.
In the original micropipeline by Sutherland, the bundled convention for the
control signals states that the control signals should be kept bundled with the
data signals. This means that they should be routed the same way on-chip. The
circuit is only guaranteed to operate properly when the delay of the Request
signal is the same as the delay of the data signals.
Simulation revealed the difference in routing between control and data
signals had caused a timing problem within each of the stages. The problem
was caused by the Adder's Donein signal reaching the adder before the data to
the adder was fully set up.This signal goes low as soon as the data have
arrived and can be used. Itis generated by the XNOR gate from the
Acknowledgein signal and the output of the adders Muller C gate.The input
data to the inverter that were not reaching the adder on time were from the
multiplier, being generated on the opposite side of the single-stage cell.The
signals had to travel across wires that were as long as 1mm, then through the
DET D flip-flop at the input to the adder, and then to the input of the adder itself.48
The control signal, by contrast, had to flow out of the multiplier,across a shorter
path of approximately 0.5mm, and then through the logic blocks.
The path through the controllogic consisted of the SETDFF, the
Acknowledgein wire to the adder's control logic, and the XNOR gate.The
control signal delay is shorter than the data signal delay by approximately 4 ns.
This resulted in the lowest bit being incorrectly received approximately half of
the time. Because the adder operates starting from the LSB and moving up to
the MSB, only the accuracy of the LSB is affected. Since the adder performs the
calculation of the carry bit first, and the sum bit second, the LSB carry bit is most
likely to be incorrect. The upper input bits can tolerate extra delay because they
are not needed until the lower bits have been computed.
We tried to fix the incorrect timing in the control block by placing delay
inverters in the spaces available between the control logic cells and the adder
cells. There was room for up to three inverters, but since signal inversion would
have caused errors, only two inverters could be used giving a delay of about .0.8
ns. We needed to delay the signal at least 2 ns to guarantee accuracy in the
LSB of the adder. To get more delay, we used a long diffusion run instead of a
wire to connect the output of the delay inverters to the DONEin input of the
adder. The large capacitance and large resistance of the diffusion run adds a
delay of about 1.5 ns, delaying the control signal by about 2.3 ns. Timing was
still marginal, and the circuit still exhibited errors in certain circumstances.
When more than one bit is changing at one time in the output stream of the
multiplier, the adder gives incorrect results in the lowest bit.These errors can
be predicted easily in advance and can be factored into the results. The results
must then be calculated on a computer and compared with the actual output of
the chip. We will reexamine this timing problem when we show the simulation
results.49
4.4.3 Simulation results for the filter stage
The triplet cell was slow and difficult to simulate becauseit has 1300
transistors.The HSPICE program indicated thatithad a peak power
consumption of 45 mW, and averagepower consumption of 3.5 mW at 18 Mhz.
The average cycle time or latency from RequestinXto RequestoutY was 37.8 ns.
The graphs of the simulation results forthe Triplet filter stage are in Figures 23
and 24. Table 10 shows the signals correspondingto the node numbers on the
SPICE graphs.
YL
0 I
L N
T
VL
0 I
L
YL
0 I
L
4 . 0
2.0
4.0
2.0_
.
4.5_
2
VL 4.0_
0 I
LN 2.5
P
X
L
7.50M
R I 5.0M
EN
2.50M
.1JI
I L.
21-AU6919i29!49
III
J 1II t 1.L J
...
..1 ... .. .1.. J.. 1..
II 1
---1 TRIPLET.TRO
101
I..IJ
_
_TRIPLET.TRO
102
-
TRIPLET.TRO
103
A
=
I 1 I I I 1 1 1 1 1
TRIPLET.TRO
104
TRIPLET.TRO
A
-ICV00/
41.9176U itII 1I 1lit!
50.0N 100.0N150.0N200.0N250.0N300.0N
TIME CLIN1 150.0N
Figure 23. Triplet Simulation Results50
0
V
V
a
V
0
4 . B
.......
21-81.19939:20!40
TRIPLET.TRO
=61
- u
LILL 1 1 1 1 1
L .111LELLI
5.0295
4.0
2 . 0
L I 1 L 0.
4 . 0
2.0
L
50.0N
I . .. O. 100.0N
8.
V 5.0
0
4.0 :
3.0
2.0F.
1.0;
a
4.0
2.0_
1 1 L 1 1.I .1I a.
V
o
V
5.0996
2.0_
0.
1 1 1
1...;
TRIPLET.TRO
62
A
-
.111_1
TRIPLET.O
1-71.
TR
63
_u
.4111
IIL LLI I
IIIttLi
158.0N200.0N250.0N
TIME (LIN)
2I-AU6939!28!49
4_74
TRIPLET.TRO
64
300.0N
150.0N
--t,
I
IIIILL.L 1 1 1 1 I
--
/I-7
LILLIJIJJ
TRIPLET.TRO
65
a
TRIPLET.TRO
66
-^
1.1.111.1 ILIJ !III
50.0N
TRIPLET.TRO
67
a
TRIPLET.TRO
69
A
I.I .1 .1 J JII,. 1. 1.1 .1.4...1L.,4_1___L_A
100.0NI50.0N200.0N250.0N300.0N
TIME CLINI 350.BN
Figure 24. Triplet Simulation Results51
Table 10. Signal names for the Tripletsimulation results.
Signal subgraph numberSignal name
61 REQ;nX
62 ACKinX
63 REQoutX
64 AC KoutX
65 REQinY
66 ACKinY
67 REQoutY
68 AC KoutY
102 Adder DONEin
101 Adder DONEout
104 Multiplier DONEin
103 Multiplier DONEout
-I(Vdd) Vdd current
For purposes of making the resultsof thisresearch more easily
comparable to other technologies, I have scaled the layout down to a 1.2 micron
line width, and also scaled the results up to 8 bits at 2 micron. For 1.2 micron
line width, the peak power consumption declines to 35 mW. The latency or
delay time is reduced to 25.5 ns. When the design is scaledup to 8 bits, the
delay or latency increases to 60.2 ns. The cycle time at 8 bits is 80.5 ns, giving
a maximum speed of 12.4 MHz. The scaling from 2.0 IA to 1.2 IA and the scaling
from 4 to 8 bits are summarized in Tables 11 and 12 below. In the next section I
will discuss some aspects of the layout of the filter and the circuit blocks, and
also the extraction and simulation of the circuit blocks.52
Table 11. Comparison of 211 and 1.211 featuresizes for the Triplet filter cell.
Feature Size Area Power Maximum SpeedLatency
2.0 micron1.12E6 microns245 mW 18 MHz 37.8 ns
1.2 micron 0.4E6 microns235 mW 26 MHz 25.5 ns
Table 12. Comparison of 4- and 8-bit Tripletfilter cells.
Number of Bits Area PowerMaximum SpeedLatency
4 1.12E6 microns245 mW 18 MHz 37.8 ns
8 ??E6 microns2??mW 12.4 MHz 60.2 ns53
5. TESTING AND EXPERIMENTAL RESULTS
In presenting the results from the testing of this filter chip,I will begin by
discussing the proper measurement of the output of thechip, and continue with
a description of the critical path through the chip. Iwill then present the
simulation results, starting with the smaller parts and workingup to the filter
stage as a whole.Lastly Iwill examine the results of the testing of the actual
fabricated chip.
5.1 Measurement of Speed and Delay
Testing of this chip is made difficult by the fact that it isan asynchronous
chip. With a synchronous design,we would only have to connect the chip to a
clocked input circuit, and determine how fastwe could clock the input before the
filter failed.With an asynchronous chip, the input is assumed to come from a
buffer. The simple input of the synchronous design cannot test the maximum
capabilities of the filter, since the filtermay take inputs faster or slower at various
times. Nonetheless, the latencyor delay time is well defined for both cases, and
can be measured in a straightforward way.
We are interested in two different characteristics of the chip. The speed
or cycle time is how fast the data can be clocked into the chip. The latency or
delay time is how long it takes for a result to come out after a data value is input.
The speed of the design in the synchronouscase is well defined.Itis the
reciprocal of the fastest possible input period where the filter functions correctly.
The asynchronous case is not so well defined. The speed must be measured
over a period of time so as to get an average. This is because the speed, as
well as the latency, may change drastically depending on the value of the inputs.54
There is a distribution of speeds associated with the range of possible
inputs values.The exact nature of this distribution will depend on the exact
characteristics of the input source. This makes it necessary for the designer to
look at the type of input values that are expected, and design the test cases
accordingly. The larger the test suite, the more accurate the speed assessment
will be.
The worst case value is usually used to get the speed of a design. This is
so that the person designing with the chip can have a safety margin when
designing a system around the part. This is the case for a synchronous design,
where the input speed will not be changeable, and inputs that come too fast will
produce incorrect results and failure. This cannot be done for an asynchronous
design. For an asynchronous design, we want to look at the average case over
the expected range of input values.The size of the input buffer will help
determine the robustness of the overall design. A large input buffer will allow
large swings in the type of input values, so that the actual performance of the
design when looking from outside the filter and it's buffer will come closer to the
average case.
As a practical matter, the chip must be tested in a synchronous manner.
We have to use clocks to generate the control signals. To try to generate these
signals in an asynchronous manner will cause too much clock skew and other
problems, which will get worse as we get into the upper range of the filter. The
best way to test a filter like this one is to put it into an asynchronous system. the
next best way is to connect it to a synchronous test setup.
The critical path through the design will determine the speed of the filter.
The critical path is the path from the input of values to the input of the next
value.It is the path that needs to be evaluated most closely to get an accurate
speed assessment. The worst case on this path will tell us something about the55
performance of the filter.It will certainly tell us where the filter needs to be
improved, and how large the safety marginsare compared to the average case.
But the average case through the critical path will giveus a better indication of
how fast the filter chip can be clocked.
In defining the critical path we must also defineour assumptions about
the current state of the filter chip. We will define the current stateas having
values in all registers, and tokens waiting at all of the joins. This is the state at
rest after the chip has been operating forsome time. Also, signals have already
been presented (the Requestin signals) for the values for the next operation.
The operations critical for the calculation of the next output value willnow be
determined.
The critical path for the triplet filter stage is shown in Figure 25 below.
Only the output stage is in the critical path through the filter.This can be seen
since the next value to be output depends onlyon the previous stage's output
value, and on the next X input value.The previous two stages can then be
disregarded. The result of the previous stage's addition is waiting on the input to
the last stages adder.The next output cycle begins when the Requestin is
presented to the Xin port. The Xin value itself is assumed to have already been
set up. The Requestin signal will cause the Muller C-element in the multiplier
part of this filter stage to change state, after the Muller C-element delay. This
will trigger the DETDFFs to bring in the X value, and also will trigger the XNOR
gate. The XNOR gate will go low after a delay, triggering the ECDL multiplier.
After a very long delay, the multiplier's Doneout signal will go low, triggering the
SET. D flip-flop. There will be a delay associated with that flip-flop, and then a
delay associated with the Join Muller C-element. The output of the multiplier is
complete at this point.REQ X
ACK X
I
C
RESET
ECDL
MULT
DONE
N OUT
RESET
FORK
C
RESET
RESET
DELAY
ACK Y
PRESET
C
JOIN
17e-Thc,
DONE
IN
OUT
ECDL
ADDER
D Q
CLKREQ DY
UT
PRESET
REQ X
ACK X
RED, Y
Figure 25. The critical path through the last stage.
Critical Path
ACK Y
56
The output of the Join Muller C-element will trigger the Muller C-element
in the adder stage.After the delay, the adder's Muller C-element will change
state, triggering the DET. D flip-flop, the XNOR gate in the adder stage, and the
inverter on the Acknowledge line going back to the multiplier stage.The
inverter, after a very short delay, resets the multiplier's Muller C-element. After a
delay of about 4 ns, the multiplier's Muller C-element changes state, triggering
the multiplier's XNOR gate to change back to the one state. This starts the reset
cycle of the multiplier ECDL circuit, the longest delay in the whole cycle. At this
point, the path diverges and we have separate cycle and delay paths.
Only after the multiplier is properly reset can the next X data value be
input.If the next request comes too soon, the Muller C-element and the XNOR
will change state properly, but the ECDL multiplier circuit will not have gotten a57
chance to go high.Since it did not return high, the SETDFF will not function,
and the circuit will be locked up. This is the backward delay of the circuit, and it
is important to realize that it affects the maximum cycle rate at which data values
can be presented. The backward delay in this case is long that the adder will
have completed it's work before the multiplier startson a new data value.In this
way, the pipelining of the stages is of limited value internally, except in terms of
delay for a given data value. The complete cycle time contains the backward
delay of the multiplier, minus the delay of the Muller C-element and the XNOR
circuit.Table 13 shows the calculations for the maximum time required to
complete a cycle.
Table 13. Maximum cycle time for the filter stage.
Critical path element Delay
Muller C-element, 3.5 ns
XNOR gate 0.9 ns
ECDL Multiplier, forward delay
SETDFF
15.6 ns
1.6 ns
Muller C-element; Join 3.5 ns
Muller C-element; Adder 3.5 ns
Inverter 0.4 ns
XNOR gate 0.9 ns
ECDL Multiplier, backward delay
Muller C-element,
19.7 ns
-3.5 ns
XNOR gate -0.9 ns
TOTAL 45.2 ns
At the same time the inverter between the adder and multiplier is
changing state, the XNOR output is going low, enabling the ECDL adder. After
the adder's delay the ECDL adder's Doneout signal will go low, clocking the
output of the adder's Muller C-element into the SETDFF. This will cause the58
RequestoutY signal to change state, informingthe next circuit or buffer that
another output value has been computed and isready.
The actual time to complete the cycle is slightly longer,according to
simulations. The simulations done from extract fileson the triplet cell indicated
that it could not be clocked faster than 56ns with a synchronous clock.The
reasons that it takes so much longer than thesum of the times from the
individualcellsare due tocapacitance and resistanceeffectsofthe
interconnections.There are many long runs of metall and metal2 layers
connecting the various cells, and each hasan associated resistance and
capacitance.These resistances and capacitances slow down the output
transistors in the cells driving these lines. These effects addup over many cells
to make a difference of approximately 16ns in the overall performance of the
filter cell.
5.2 Test Results
In this section I will summarize the results from the testing of the chip.
We received four copies of the chip, fabricated through MOSIS.All of the
characteristics listed here are averages of the four chips. Theaverage latency
was 70.5 ns. The average power consumption was 36 mW at 4 MHz, including
the power consumed by the I/O pads.The maximum speed could not be
measured because it was necessary to divide down the clock generator to
provide the clock phases needed. The maximum speed of the fabricated chip
would be slower than the maximum speed in simulation due to thesame 1/0 pad
delays that slow down the latency path. The chip performedvery well compared
to expectations, with all values comparable to the values predicted by the
HSPICE simulations.State/Timing A l
59
In Figure 26 below, we will showsome of the results from the testing. The
first graph is the output ofa state analyzer, connected up to one filter chip. The
graph shows the inputs and outputs atthe four ports of the filter chip.The
control signals (ACKinX, REQinX, etc.)cannot be shown on this graph, because
they are to fast for the state analyzer to record.The X inputs to the filter is a
ramp, starting at one at initialization, and proceeding to countup to 15.The
coefficients are 1, 1/2, and 1/4. The Yout port isseen to count up to 15 twice
during the input sequence, which is correct.
I,
Accumulate
Off 1J
st/Div (Delay 1 Markers
3q 145 j Of
Cancel
TIN
XIN
XIN
XIN
XOUT
XOUT
XOUT
XOUT
YIN
YIN
YIN
YIN
POUT
POUT
POUT
POUT
0
1
2
3
0
1
2
3
0
1
2
3
Cr
1
2
3
f 1
i r
I
11-1111-1-11. 1-11.11_11 ruiswit. j---u---LjuLa_sis .,
ART (............_ii.
LRJIL------r----,
Figure 26. State analyzer graph for the filter chip.60
An oscilloscope plot of the filter output pins is shown belowin Figure 27.
The two signals shownare REQoutX and ACKinY.Both of these signals are
generated by the filter chip, inresponse to externally generated signals REQinX
and REQinY. From this oscilloscopeplot, the signals appear to be very noisy.
This is mostly due to the noise generated bythe oscilloscope probe.
/prunning
100nsidiv
:- delayMONNESM
rtsrenrp
101. amo5 right
window -------
WWI oh
-540.000 ns
period( 2
delay( 1 )-( 2 )
delay( 2 )-t 1427 .146ris421. 158ns 435.130ns428.643ns
-40,000ns
1((sne /div
current minimum maximum
499.002ns 490i1.Ons 506.956ns 4 99.016ns
average
realiime 570.8513ns 562.874ns 576.846ns 569.44Ons'
460.000 ns
repetitiv.e.
Sensitivity
Channel1 1.99V/div
Channel 2 2.00V/div
Offset Probe
1,74405V 9.966:1
2.25000V 10.00:1
Coupling
dc OW lim (1)1 ohm)
dc (IM ohm)
Figure 27. Oscilloscope plot for the filter chip.61
6. CONCLUSION
A third-order self-timed filter has been produced. This filter uses four-bit
coefficients, and has been produced for demonstration purposes.It uses the
micropipeline control system developed by Sutherland, and the ECDL logic
family developed by Lu.
Simulation indicates that the filter will work dependably up to about 18
MHz, with a 56 ns cycle time and a 37.8ns latency.The simulated power
consumption was 3.5 mW average, at 18 MHz.Experiments with the chip
produced through MOSIS indicate a power consumption of about 36 mW from a
5V power supply, running at 4 MHz. The majority of this power was used by the
I/O pin drivers. The measured delay time or latency from the time a data token
is presented to when the result is output is 70.5 ns, including delays due to chip
I/O pads.Simulations indicated that the same chip laid out in 1.2 micron line
width would have a delay time of 25.5 ns, not including delays due to chip I/O
pads.
One disadvantage in this filter architecture is that the backward delay of
the ECDL multiplier is so large that the adder has long since finished it's
calculation before the multiplier is ready to receive another input data token.
This limits the effectiveness of the two-phase pipelined filter stage. The chip has
advantages in speed and power consumption when compared to other designs.
Improvement of the multiplier should be the main focus of future research.
It may be necessary to use a different coding scheme such as Booth encoding to
speed up calculation of the results.It would be possible to use more than one
multiplier, multiplexed, per stage, or alternatively to use one adder with more
than one stage and multiplier to save on chip area.It might be possible to62
improve the backward delay of themultiplier through an improved circuit.For a
practical design, two's complement arithmeticmust be used to allow both
positive and negative data values withoutdistortion.63
BIBLIOGRAPHY
[AR86] Arvind and D. E. Culler, "Dataflow Architectures,"Annual Reviews in
Computer Science 1986, vol. 1,pp. 225-253, Annual Reviews Inc., 1986.
[JA88] G. M. Jacobs and R. W. Broderson, "Self-timedintegrated circuits for
digital signal processing applications," VLSI Signal Processing111, pp. 197-208,
IEEE Press, 1988.
[LA88] C. H. Lau, D. Renshaw, and J. Mayor,"Data Flow approach to self-timed
logic in VLSI," Proc. ISCAS, 1988,pp. 479-482.
[LU91] S. Lu and M. D. Ercegovac, "Evaluation of Two-SummandAdders
Implemented in ECDL CMOS Differential Logic," IEEE J. Solid-State Circuits,
vol. 26, no. 8, pp. 1152-1160, Aug. 1991.
[LU88] S. Lu, "Implementation of Iterative Networks with CMOS Differential
Logic," IEEE J. Solid-State Circuits, vol. 23,no. 4, pp. 1013-1017, Aug. 1988.
[LU90] S. Lu and M. D. Ercegovac, "A Novel CMOS Implementation of Double-
Edge-Triggered flip-flops," IEEE J. Solid-State Circuits, vol. 25,no. 4, pp. 1008-
1010, Aug. 1990.
[LU88] S. Lu, "A safe single-phase clocking scheme for CMOS circuits," IEEE J.
Solid-State Circuits, vol. 23, no. 1, pp. 280-283, Feb. 1988.
[LU93] S. Lu, "Design of Hardware Efficient Self Timed Circuits," Electronics
Letters, vol. 29, no. 1, pp. 6-7, 7 January 1993.
[LU88] S. Lu, "Implementation of Micropipelines with ECDL," accepted for
publication, IEEE Transactions in VLSI Systems, September 1994.
[QU91] P. Quinton and Y. Robert, "Systolic Algorithms and Architectures,"
Prentice-Hall, 1991.
[R086] B. Rorabaugh, "Signal Processing Design Techniques," TAB Books,
1986.64
[SU89] 1. E. Sutherland, "Micropipelines,"CACM, vol. 32, no. 6, pp. 720-738,
June 1989
jUN81] S. H. Unger, "Double-Edge-TriggeredFlip-Flops," IEEE Trans.
Computers, vol. C-30, no. 6,pp. 447-451, June 1981.
[WE88] N. Weste and K. Eshraghian,"Principles of CMOS VLSI Design: A
Systems Perspective," Addison-Wesley,1988.