A Self-timed implementation of the bi-way sorter systolic array processor by Diamond, Mitchell




A Self-timed implementation of the bi-way sorter
systolic array processor
Mitchell Diamond
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Diamond, Mitchell, "A Self-timed implementation of the bi-way sorter systolic array processor" (1993). Thesis. Rochester Institute of
Technology. Accessed from






Partial Fulfilhnent of the




Approved by: Graduate Advisor - Prof. George A. Brown
Department Chair - Dr. Roy Czernikowski
Reader - Dr. Tony Chang
DEPARTMENT OF COMPUTER ENGINEERING
COLLEGE OF ENGINEERING
ROCHESTER INSTITUTE OF TECHNOLOGY
ROCHESTER, NEW YORK
October 1993
THESIS RELEASE PERMISSION FORM
ROCHESTER INSTITUTE OF TECHNOLOGY
COLLEGE OF ENGINEERING
Title of Thesis: A Self-Timed Implementation of the Bi-Way Sorter Systolic Array
Processor.
I, Mitchell S. Diamond, hereby grant permission to the Wallace Memorial Library of




Self-timed circuits with an appropriate handshake control circuit can be used to
replace the global clock in a VLSI chip. By replacing the global clock many problems
which face designers have disappeared along with the clock. Some of these problems
are due to clock skew and capacitance scaling with smaller feature sizes. The wire
capacitance cannot scale below a certain limit due to two-dimensional effects,
therefore the RC delays associated with the interconnect layers do not scale
proportion.ally to the feature size. The resultant increase in wire delay makes it
difficult to distribute a global clock at a high frequency.
This project takes an existing synchronous systolic array, the bi-way sorter,
and implements the sorter algorithm using a self-timed approach. By using self-timed
instead of synchronous approaches, many of the problems associated with
synchronous circuits such as clock skew and large line capacitance, are avoided. In
this thesis, a 2-bit, four number sorter will be designed and simulated and the




1.1 What is Self-timed? 1
1.2 Self-Timed vs. Synchronous Circuits 2
2.0 Theory of the Bi-Way Sorter Algorithm and the Handshake Controller Circuit 6
2.1 The Bi-Way systolic sorting algorithm 6
2.2 The Self-Timed Circuit 13
2.2.1 The Self-Timed Logic Operator 13
2.2.2 The Handshake Control Circuit (HCC) 16
3.0 Implementation of the Bi-Way Sorter Cell 20
3.1 VHDL Overview 20
3.2 The Sequential Bi-Way Sorter 21
3.2.1 The VHDL Model of the Sequential Bi-way Sorter Cell 23
3.2.2 The TransistorModel of the Sequential Bi-way Sorter Cell 27
3.3 Implementation of the Self-timed Sorter Cell 32
3.3.1 The VHDL Model of the Self-timed Sorter Cell 33
3.3.2 The Transistor Model of the Self-timed Sorter cell 35
3.3.3 The Layout of the Self-timed Sorter cell 43
4.0 Implementation of the Handshake Controller Circuit (HCC) 46
4.1 Implementation 46
4.1.1 VHDL model of the Handshake Controller Circuit 48
4.1.2 The Transistor model of the Handshake Controller Circuit 51
4.1.3 The Layout of the Handshake Controller Circuit 58
5.0 A Self-timed Four 2 bitNumber Sorter Array 60
5.1 Overview of a Systolic j\rray Processor 60
5.2 Description of the Self-timed Sorter Anay Processor 61
iv
Table ofContents (Continued^
5.2.1 Data Dependencies Between Cells in the Array 61
5.2.2 Operation of the Self-timed Bi-way SorterArray 63
5.3 Simulaton results of the self-timed bi-way sorter 67
6.0 Conclusion 70
Appendix A
Example of Sorting four 2-bit numbers using the Synchronous Bi-way sorter array
processor A-l
Appendix B
Circuits and simulation for the latch and the mutiplexor used in the synchronous bi-
way sorter B-l
Appendix C
Quine-McCluskeymethod for sorter logic implementation C-l
Appendix D
VHDL code for the number generatormodule and the b control signal module D-l
Appendix E
Schematic and simulation of components in the self-timed sorter module E-l
Bibliography Bib-1
List ofFigures
Figure 1.1 Diagram of the contributions which effect the driving capcity of the clock
signal 3
Figure 2.1 Diagram of a bi-way cell 7
Figure 2.2 3x2 array of bi-way cells 10
Figure 2.3 DCVSL logic with ready signal 15
Figure 2.4a Petri-net describing the operation of the HCC 19
Figure 2.4b Logic diagram of the interconnection between logic modules 19
Figure 3.1 Circuit of the sequential bi way cell 22
Figure 3.2 VHDL Code for Sequential bi-way sorter 24
Figure 3.3 Simulation of the sequential VHDL code for the bi-way sorter 26
Figure 3.4 Schematic of the synchronous sorter module 28
Figure 3.5 Schematic forme control module within the bi-way cell 29
Figure 3.6 Simulation and timing results for the control gate circuit 30
Figure 3.7 Simulation of the synchronous sorter cell 31
Figure 3.8 VHDL program for the Self-timed Bi-Way Sorter cell 34
Figure 3.9 Schematic for the Self-timed sorter cell 36
Figure 3.10 Schematic of the regular nmos tree 37
Figure 3.11 Schematic of the inverse nmos tree 38
Figure 3.12 Schematic of the ready signal generator circuit 40
Figure 3.13 Simulation of the ready signal circuit 41
Figure 3.14 Simulation of the one-way cell 42
Figure 3.15 Layout of the bi-way sorter cell 45
Figure 4.1 Modified handshake controller circuit 47
Figure 4.2 VHDL code describing the Handshake Controller Circuit 49
Figure 4.3 Logic Diagram of the HCC 50
vi
List of FiguresfContinueri^
Figure 4.4 Simulation of the Handshake Controller Circuit 52
Figure 4.5 Schematic of the Handshake Controller Circuit 53
Figure 4.6 Test Circuit used to determine timing characteristics of the HCC 56
Figure 4.7 Simulation of the test circuit for the Handshake Controller Circuit 57
Figure 4. 8 Layout of the Handshake Controller Circuit 59
Figure 5.1 Example of the self-timed array processor 62
Figure 5.2 Schematic of the self-timed bi-way sorter processor 64
Figure 5.3 Operation flow of the Self-timed Bi-way Sorter 66
Figure 5.4 Simulation wave forms of the self timed sorter processor 69
FigureAl Example of sorting four 2-bit numbers using the Synchronous Bi-way
sorter processor A-3
Figure Cl Karnough maps used to reduce the logic equations for the nmos trees C-5
Figure Dl Generatormodule used to generate the unsorted numbers D-2
Figure D2 Generator for the b control signal used to control the control modules D-3
Vll
List ofTables
Table 1.1 Selection of x and y based on the b input 9
Table 3.1 Truth table describing operation of the one way cell 33
Table 4.1 Timing determined from the HCC simulation 55
Table 5.1 Simulation results for the self-timed bi-way sorter 68
Table Cl Quine-McCluskey reduction C-2
Table C2 Quine-McCluskey reduction C-2
Table C3 Quine-McCluskey reduction D-3




















Very high speed integrated circuit Hardware
Description Language
Differential Cascode Voltage Switch Logic
A dedicated high speed sorter processor
Module Controlling data swapping between modules
A design where the control over the activity in the




1.1 What is Self-Timed?
The operation of a synchronous system is reminiscent of marchers moving
uniformly to the beat of the drum in a parade. The temporal controls of the marchers
are centralized in a single "authority", and the marchers respond by doing their tasks in
synchronism with the marching cadence. This type of control results in a simple form
of organization which people seem to associate with the efficiency of machines.
However, this rigid behavior is not the only way to coordinate these tasks, nor is it
efficient unless all the tasks are very well matched.
Self-timed systems use a different approach to managing the tasks that have to
be performed. This method allows each system module to have some control over
when it does its tasks while still conforming to the overall operation of the system. A
designer must assure that all system tasks occur in a proper sequence, but nothing ever
has to occur at a particular time.
Self-timed circuits are categorized as a cross between synchronous and
asynchronous circuits. The self-timed methodology more closely resembles the
asynchronous circuit. A synchronous circuit uses a global clock that controls the
operation of the circuit; every module reacts to the clock pulse at the same time. An
asynchronous system does not use a clock and the control of the circuit is usually
sporadic and does not appear to be rigidly organized.
The self-timed concept arose from asynchronous interfaces used at the board
level. The term
"self-timed"
indicates that the circuit doesn't use a global clock,
instead it functions through data dependencies between modules. A module starts the
next cycle of its operation once the current task is completed. In a self-timed system,
the timing control is left to the modules themselves. Each module communicates with
other modules via an asynchronous communication circuit known as the Handshaking
Control Circuit (HCC).
Self-timed circuits eliminate the problem of delays caused by logic elements
and interconnects, placing more emphasis on individual element design. This is the
central idea of self-timed systems: each system
"part"
controls its time of operation.
1.2 Self-Timed vs. Synchronous Circuits
The main reason for the use of self-timed circuits is that the synchronous
approach becomes less suitable for present integrated circuit (IC) technology and will
become even less appropriate with future technology. The total chip area for most
designs has remained constant or in some cases has increased as the feature size has
been reduced. This has allowed the number of active on chip devices (transistors) to
increase dramatically. Therefore although the gate capacitance of the individual
transistors has been reduced with the feature size, the increase in the number of
devices being driven by the system clock has increased the total capacitance load on
the clock driver.
Synchronous technology is hindering IC miniaturization by increasing the
driving capabilities needed for a clock driver circuit. To illustrate this point, consider a
global clock which will drive a wire with a number of transistors (N) attached to it. In
this example some assumptions will be made. It will be assumed that the wire is made
of one type of metal and that each transistor along the wire has the same dimensions.
A diagram of the circuit and the variables being used are shown in figure 1.1. To find
the total delay of the clock signal based on the example circuit we have to find the
capacitance and the resistance of the wire and the transistors. Equations 1, 2, and 3
show the equations needed to find the capacitance and the resistance of each element
of the circuit. The approximate total delay can be determined using the resistance
times the capacitance which is also known as the RC delay of the circuit. The
equation for the total delay of the circuit is found by substituting equations 1 , 2 and 3
into the delay equation which is shown in Eq. 4. To analyze the effect of smaller
feature sizes upon the delay equation, the effects of each variable will be examined.
The sheet resistance (Rsheet) f the w^e wiU remain constant due to the nature of the
metal. The same applies to the permutivity constants eu^e and eox. The area of the
IC will stay constant which means the interconnects of the wires will remain about the
same length (0. The length (L) and width (W) of the transistors will decrease in size.
The spacing between the metal and the poly (d^g) will remain the same due to the
fabrication technology. The thin oxide (Xox) of the gate will decrease with the feature
size. By examining the total contributions of each variable, the overall delay will be
reduced slightly with the decreasing feature size. But the new affects imposed on the
delay of the clock signal will increase dramatically. The total gate driving capacitance
with the new feature size will will be much larger than with the smaller feature sizes





































X =R \C +NC 1 - Rsheete-line
l
. M?,toe0x^L l ,as
"to AA
Another problem that effects the use of future IC technology is the problem
of clock skew. Clock skew complicates the problem of distributing the clock signal to
all parts of the circuit at the same time. If IC feature size keeps getting smaller and the
total design area gets larger, the ability to drive the increased capacitance of the
circuit, caused by clock signal distribution, will become close to impossible. Designers
today are trying to overcome the problem of clock skew by introducing specific delays
into their circuits to compensate for the non-synchronous behavior. Two methods of
accomplishing this is either position each module on the chip in a certain place to
create a delay or introduce delay buffers into the circuit. This cannot continue for long
if the industry desires to increase clock frequency because circuits would have to be
redesigned to compensate for the delay which was changed in the origional circuit.
Until now, the asynchronous approach has not been popular because of circuit
overhead, design difficulty, and poor performance because of the added delay of
managing data flow through the system. With the advance of technology these
obstacles may be overcome. The extra area needed for handshaking controllers is less
of a penalty now because of smaller feature sizes of IC chips. Also, with faster IC
technology the overhead due to these handshaking controllers becomes nominal.
Meanwhile, the synchronous approach is becoming increasingly inefficient as the
maximum feasible system clock speed is determined by gate delays. Thus, the
self-
timed approach will become a more attractive approach for IC designers when IC
circuits can operate beyond the maximum feasible global system clock speed. The
optimal performances of self-timed circuits are determined by the speed that data can
be transferred between modules and the speed with which the HCC's can accomplish
their tasks.
2.0 Theory of the Bi-Way Sorter Algorithm and the Handshake
Controller Circuit
2.1 The Bi-Way systolic sorting algorithm
Sorting is one of the most frequent tasks that a computer performs. In order
for humans to understand and coordinate large amounts of data, there must be some
order to the data. Applications which require sorting include such diverse tasks as
database management, graphics, and weather forecasting. Sometimes a general
purpose processor would not suffice for some applications, so a dedicated processor is
needed. The bi-way sorter is one such processor.
The bi-way sorter processor was first proposed in a paper entitled "Bi-way
Sorter: a Two-Dimensional Systolic Array"[l]. The objective was to modify an
existing sorting processor and increase its speed while decreasing its area. The bi-way
sorter is a synchronous systolic array composed of modules called "bi-way
cells."
Each bi-way cell is made up of one-way cells by multiplexing between the bottom set
of signals (zl and w2) and the top set of signals (z2 and wl). The bi-way cells are
interconnected in an array structure. By exchanging commands, data flow is directed
through the array.
In Figure 2. 1 there is a diagram of the bi-way cell showing the connections of
the inputs and outputs to the cell. Each cell has to make a decision depending on the
value of the bit stored in the cell and the new bit coming into the cell. Two control
signals r and s determine the comparison decision: no decisions, store the incoming
number, or bypass the incoming number. There is also a b control line that controls
which inputs are directed into the cell. The b input also controls the flow of data in














Figure 2. 1 Diagram of a bi-way cell
The one way cell takes data inputs x y, control inputs r s, and
outputs signals u,v,r',s'. The bi-way cell uses the b input to
multiplex the wl,z2 or w2,zl signals into the x and y inputs of
the one way cell. The wl and w2 signals are the u and v signals
of the one way cell fed back into the bi-way cell.
The one-way cell takes four inputs r,s,x,y and produces outputs r',s',u,v. To
direct either the top or the bottom signals of the bi-way cell into the one-way ceU's
inputs x,y, the b input is used based on the values in Table 2.1.
The heart of the bi-way sorter is the method of choosing which inputs are
encountered by the one way cell. If the b input is low then x and y will take on the
input from the top zl and it will output to the bottom w2 respectively. If the b input
is high then x and y will take on the input from the bottom z2 and it will output to the
top wl respectively. Once the specific x and y values are chosen the one-way cell
steers one of the two bits corresponding to the smaller number, to the u output and




lines indicate the result of the
comparison: (r',s')=(0,0) for no decision, (r',s')=(l,0) to switch x and y (or store the
incoming bit) and (r',s')=(0,l) to not switch x and y (or let the incoming bit bypass the
cell).
By only using one-way cells, numbers can only be sorted in one direction
through the array. This means that the amount of cells required is equal to the product
of the size of the array to be sorted and the bit-size of the array elements. The bi-way
cell algorithm was derived to cut the number of required cells in half while still sorting
a data set of the same size. This is accomplished by letting the numbers flow into the
array then back out of the array in the opposite direction, hence, doubling the number
of comparisons which can be handled by the smaller array. Another feature of the
bi-
way sorter is that it can sort numbers in two directions. This is accomplished by
letting data flow either in from the bottom or top, depending on the order required. If
the data flows in from the top the data will be sorted from largest to smallest. If the
data flows in from the bottom the data will be sorted from smallest to largest. By
sorting in two directions the sorter is very versatile and easily
adaptable to many tasks.
Figure 2.3 shows a diagram of a 3x2 array.
b data flow stored num X y
0 up largest zl w2
1 down smallest wl z2


























The array can be increased in size by adding modules as shown by
dotted lines. Each module is a bi-way sorter shown in Figure 1.
10
At every clock pulse the data flows into each cell and is compared with the bit
already stored in that cell. The cell makes a decision based on the bit coming in and the
stored bit. One of three decisions can be made: allow the bit to pass through the array,
hold the bit and send the previously stored bit to the next cell, or no decision. The bit
is stored in the cell by using feedback on the u and v outputs. By repeating this
process each bit is compared with all the other bits in its column.
The first column of the array is composed of control cells. These cells do not
have data pass through them, rather they send out control signals
r',s',b'
that control
the sorting process of the rest of the array. Data flows from top to bottom or from
bottom to top depending on the b input, and control signals flow from left to right.
The data is sent into the array in a skewed-bit-parallel data format where the
MSB is one clock cycle ahead of the next most significant bit and so on. A skewed-
bit-parallel data format is described as having the MSB entering the cells followed by
the next bit place one clock tick later and so on until you reach the LSB. The skewed
data format allows each bit of a number to arrive at the bit-comparison cell in its
column at the same time as the comparison decision from the next most significant cell
in that row. Below is an example if a set of two-bit data values and the conversion to
a skewed-bit-parallel format.






Example of a conversion to skewed-bit-parallel format
11
The number of cells needed in the array is based on two variables:
The number of data values to be sorted.
The number of bits needed to represent each data value.
The array is set up in the following fashion. The left hand column is made up of the
control modules. The next column .and subsequent columns are needed depending on
the number of bits. The number of rows needed is equal to the total number of data
values divided by two. By using this formula expansion is easily done. For example,
to set up a an eight bit twenty data value array you would need nine columns and ten
rows. In Appendix A an example of sorting four, 2-bit numbers can be found.
12
2.2 The Self Timed Circuit
The Self-timed circuit is composed of two parts:
1 . The logic operator which performs a specific task.
2. The Handshaking Controller that controls the data transfer between
logic operators.
2.2.1 The Self-Timed Logic Operator
The logic operator is a module that evaluates a particular function. The
difference between a synchronous and a self-timed logic operator is the production of
a ready signal. Synchronous circuits all run on a global clock. The global clock is set
up with a period equal to the slowest module in the system. Operators have a rigidly
defined number of periods necessary to evaluate their function. With self-timed
operators no global clock is used. This means that there has to be a different
mechanism set up to accommodate the operator's evaluation delay. A mechanism
must be devised which indicates to the controller when the evaluation has been
completed. This mechanism is called the ready signal.
There are many ways of generating the ready signal. The following two
methods are examined to demonstrate two popular extremes. One way is to artificially
mimic the delay of the logic operator. This is done by inserting a delay element within
the logic module that will send out a ready signal based on the maximum time needed
to evaluate the function. By creating the ready signal this way, we are reverting to the
synchronous methodology which slows down the operation of the system based on a
maximum delay time. This approach is unacceptable and cannot be considered in a
self-timed design because it does not allow the logic module to operate at its own
speed. Once the operator's task is complete it should indicate this to the controller.
13
Another method, which is more acceptable, generates a natural ready signal.
This signal is created when the logic operator has completed its task and not after an
artificial delay. One convenient way to produce the natural ready signal is to
implement the logic operator with a dynamic differential logic family. The most
popular circuit is the DCVSL (Differential Cascode Voltage Switch Logic). Figure
2.3 shows a diagram of a DCVSL logic operator.
There are two important properties of domino DCVSL that lends this type of
circuit to the implementation of the self-timed logic operator.
One property is that the circuit is free of delay hazards[2]. A circuit is said to
have delay hazards if the output depends on the relative delay of the signal path
through the logic. This is detrimental because a false output might register as a
completion that will set off the ready signal prematurely. DCVSL logic is free of delay
hazards because the design is organized to prevent feedback from other logic modules
being used to produce the final output answer.
Another property that is important to the stability of the ready signal is the case
of race states. The logic module has to be designed in such a way as to not let race
states occur. This may be accomplished through the use of logic circuits in a
precharge-evaluate scheme. In the precharge state, the outputs of the logic structure
are switched to their high-voltage states. When the circuit goes into the evaluation
phase the outputs are then free to assume their logical value determined by two nmos
pull-down trees. If the nmos trees are duals of one another, the output of one tree will
be high and the other low during the evaluation phase producing the output signal and
its complement. By using this method a transition can only occur from
"11"
(i.e.
charging phase) to either
"10"
or "01". There can never be a condition where both the











Figure 2.3 DCVSL logic with ready signal
The ready signal is generated by logical ORing the output
and it's complement together.
15
Since only one output will change there cannot be a race state in the system.
Because of these special properties, a reliable ready signal can easily be derived
from the output terminals of the domino DCVSL. This is done be creating a ready
signal generator circuit which checks to see if the output function and its inverse are
different values. If this occurs the answer must be valid and the ready signal goes
high. While in the precharge phase both outputs are
"1"




The unpopularity of the asynchronous approach in the past was caused
primarily by the extra hardware circuitry required for the generation of the ready
signal. At that time domino DCVSL logic did not exist. The design of hazard-free
asynchronous circuits involved adding redundant states into the logic that tended to
make the overall circuit much larger. Also, creating the inputs needed for the logic
was more difficult to implement than a global clock circuit. Logic was slow relative to
an easily obtainable clock frequency and the circuit overhead was substantial in size
and performance. Now the technology exists to make self-timed circuits a more viable
solution.
2.2.2 The Handshake Control Circuit (HCC)
The Handshaking controller is the heart of any self-timed system. Its job is to
control the logic operator's evaluation of the incoming data, and also to synchronize
the passing of data from one module to another.
The HCC must have the following properties if it is to perform its job as fast
and as flawlessly as possible. One property which is associated
with the HCC is its
immunity to delays from either the logic circuit or the
interconnections between
16
modules. Since in a self-timed circuit every module is evaluating at different times,
any one module has no knowledge ofwhen it must begin its evaluation.
The HCC must be small and execute its task with great speed. The time for
the HCC to execute the transfer of data should be less than the time it takes for the
logic operator to evaluate its function. If this were not so, then the HCC would add a
delay to the operation of the circuit and the asynchronous self-timed approach would
be counterproductive. For this reason great care must be taken in the design of the
HCC to maximize speed.
The HCC uses the ready signal information to control the transfer of data
between the logic operators. A 4-cycle handshaking protocol is suitable when
employing domino DCVSL. In 4-cycle handshaking, for each request signal sent out
by the HCC there is an acknowledge signal which follows it. Data is valid only while
the ready signal is active. The handshake signals are usually called REQuest and
ACKnowledge. A ready signal generated by a self-timed logic operator indicates to
the HCC that the operator is finished doing its computations. This causes the HCC to
signal the next operator to perform a data transfer. When the transfer is done, the
controller which sent the REQ signal will receive an ACK signal from the next
controller. Once received the controller will start the computation cycle over again.
In Figure 2.4a there is a diagram in the form of a petri net and in Figure 2.4b
there is a logical model diagram describing the workings of the HCC. In order for
data to be valid from the previous module three signals must be set. First, the previous
module has to set its REQuest out signal to indicate to the present module that it is
finished with its computation (Pl). Second, the present module has to be done with its
previous computations (P3) and finally the next module must have accepted the newly
computed data from the present module (P5). Once the above occurs, the new data is
latched (Tl) and an ACKnowledge signal is sent to the previous module (P2) and a
new computation can now begin (T2). When the present module is done with its
17
computation, a REQuest out signal is sent to the next HCC (P4). The cycle will then




































Figure 2.4b Logic diagram of the interconnection between logic modules
19
3.0 Implementation of the Bi-Way Sorter Cell
3.1 VHDL Overview
The one-way cell was first implemented using the VHDL (Very High speed
integrated circuit Hardware Description Language) programming language, which is
an excellent candidate for the creation and simulation of hardware circuit models. One
reason for VHDL's popularity is that the language is able to describe the hardware
model at different design levels. There is the functional behavioral model which
simulates the hardware in a fashion much like the C programming language. Timing
factors are not introduced into this model and the program describes the behavior of
the circuit and not the actual hardware implementation. The behavioral model is a
good starting point in designing a circuit because emphasis is on the proper operation
of the system without regard to the actual implementation.
The next level of complexity is the dataflow model, which simulates the actual
circuit modules and the data that flows between them. The model is usually broken up
into processes. Each process describes a module within the circuit. In VHDL,
processes run concurrently thus simulating hardware parallelism. Processes can
communicate with each other using objects called signals. Signals can be described as
the wires between modules. Data flow models ordinarily don't involve timing concerns
and are mostly used to verify thatmodules work together.
Finally, there is the structural model, which strictly simulates the hardware
circuit. The structural model incorporates full timing characteristics of the circuit
elements and modules as well as a precise description of the interconnects between
modules.
VHDL aids hardware designers by permitting them to design and simulate a
circuit without actually building it. In addition, technological advances may allow
20
VHDL models to be converted directly to hardware implementations using synthesis
design automation tools.
3.2 The Sequential Bi-Way Sorter
The sequential one-way cell was implemented using steering logic. Steering
logic is a methodology which controls the flow of the input bits through a cell to their
designated outputs using certain control signals. The flow of the bits are controlled
using multiplexors and a control module signal which selects the output of a
multiplexor.
The architecture of the bi-way sorter cell is shown in Figure 3.1. The cell
contains seven inputs and five outputs which were described in chapter 2. The
execution of the one-way cell is as follows: First the data inputs are steered to x and y




outputs, depending on the output of the r and s control module. Finally, x and





Figure 3.1 Circuit of the sequential bi-way cell
The cell is implemented using steering logic. The
multiplexors steer the inputs through the ceU to their
respective outputs.
22
3.2.1 The VHDL Model of the Sequential Bi-way Sorter Cell
The sorter cell was first implemented using the VHDL programming language.
The model description used is a combination of the data flow and the structural model
types. By using a combination of the two models, components can be described using
the data flow style while also incorporating some aspects from the structural model.
By adding specific timing values to the data flow model the circuit can more accurately
simulate the hardware while keeping the code less complex than the structural model
type.
The code was broken up into processes which describe the operation of the
multiplexors and the control module. The processes are interconnected using signal
declarations. The code for the sequential one-way sorter can be found in Figure 3.2.
The code was written and compiled using VHDL version 8.1 from Mentor Graphics
Corporation. At the beginning of the source code, library statements were added to
load extra command logic states, and data types which are specific to the Mentor
software.
The entity declaration follows the loading of the
libraries. The entity
declaration specifies the name of the circuit as well as defining the input and output
ports of the circuit. The signals declared in the port statement are the same
signals
which were discussed in chapter 2. The data type QSIM_STATE is a
special data
type used by Mentor's simulators, and has four possible
values: the Z state (high
impedance), the X state (indeterminate value), 1 and 0
(logic high and logic low).
Each port signal was given an initial value of either 1 or 0. If the
ports were not
initialized the simulator would default values to the X state, causing
the logic
equations to evaluate incorrectly.
23
--
Mitchell Diamond Thesis Project 6/10/93
-- This is a structural model of the Sequential bi-way sorter cell




USE mgc_por table . qsim_relations . ALL;
USE ieee. std_logic_1164 . ALL;
ENTITY str_beh_sorter_new IS
PORT ( Clk: IN OSIM STATE
b: IN QSIM STATE
r, s: IN QSIM STATE
Zl, z2: IN OSIM STATE
bp: OUT QSIM STATE
rp , sp : OUT OSIM STATE
wl , w2 : IN QSIM STATE
U, V: OUT QSIM_STATE
END str_beh_sorter_new;









SIGNAL X: QSIM STATE = '0
SIGNAL y = QSIM STATE = '0
SIGNAL sptmp : QSIM STATE = '0
SIGNAL Ctl: QSIM_.STATE= '0
BEGIN
bp <= b;





IF b ' 0 ' THEN
x<=zl AFTER 0.6 9 ns;
y<=w2 AFTER 0.69 ns;
ELSE
x<=wl AFTER 0.69 ns;
y<=z2 AFTER 0.69 ns;
END IF;
END IF;
END PROCESS muxl ;








rp<=r AFTER 0.69 ns;
sp<=s AFTER 0.69 ns ;
sptmp<=s AFTER 0.69 ns;
rp<=x AFTER 0.69 ns;
sp<=y AFTER 0.69 ns;












sel sig mux3 .
sel sig for mux2 .





choose 1 sel . ' s
send r to rp out.
send s to sp out.
make Ctrl sig.
send x to rp out.









END PROCESS mux2 ;










u<=y AFTER 0.69 ns;
v<=x AFTER 0.69 ns;
ELSE
u<=x AFTER 0. 69 ns;
v<=y AFTER 0.69 ns;
END IF;
END IF;
END PROCESS mux3 ;











Figure 3.2 VHDL Code for Sequential bi-way sorter
24
The .architecture section is where the specific model is formed. An entity can
have multiple architecture's describing the circuit, each having its own model style. In
the architecture body several signals are declared. The x and y signals represent the
internal .signals of the one-way cell, the ctl signal is the output control line from the
control process. The sptmp signal is used to control the selection of the outputs in
the mux3 process and is derived from the
s'
output of the mux2 process.
The multiplexor processes all have the same general implementation. An IF-
THEN statement is used to control which set of input signals are directed to the
output signals. In the process declaration, signals which are used within the process
are listed in parentheses after the PROCESS statement. This list is called the
sensitivity list. The simulator will evaluate the process if any of the signals have
changed in value in this list. The elk signal appears in each sensitivity list in order to
allow each process to be evaluated during each clock cycle. The control process uses
a logic equation which produces the correct ctl signal based on the input signals
x,y,r,s.
Within the code, AFTER clauses are used in every signal assignment. The
AFTER clauses notify the simulator to assign the
computed value of the signal to the
output after a fixed time. The timing for each clause was determined from the
simulation of the transistor implementation of the cell and will be discussed in the next
section.
The VHDL program was tested using the Quicksim logic
simulator software
version 8.1 from Mentor Graphics Corporation. In Figure 3.3 a trace is found for the
simulation of the bi way cell VHDL code. To
test the cell, all combination of inputs




Figure 3.3 Simulation of the sequential VHDL code for the
bi-way sorter
The simulation shows a test pattern consisting of all













1* 1 * 1 *
/b
4 4 4 4 4 4 4 4 4 4












4 4 ? * 1




4 4 { 4
4
4
4 * 1 i *























+ 44 + (1 4
4
Zap


































Figure 3.3 Simulation of the sequential
VHDL code for the bi-way sorter cell
3.2.2 The TransistorModel of the Sequential Bi-way Sorter Cell
The sequential sorter cell was modeled using cmos logic circuits. The sorter
cell described in [Orton,Peppard,Aki,1992] was implemented using three-micron
technology. In this paper the sorter cell was implemented using a two-micron
technology in order to make the comparison more consistent with the self-timed
implementation of the same sorter cell which was also designed using a two-micron
technology.
Figure 3.4 is a schematic diagram of the sorter cell circuit. The circuit consists
of multiplexors, latches, and a control module; schematics and timing results of the
multiplexors and the latches can be found in Appendix B. A single shift-register was
used as a clocked latch to store the output results after each clock cycle. The control
module circuit is shown in Figure 3.5. Equation 1 describes the output signal of the
control module.
ctl = (rVXy'+x') (1).
A simulation of the control circuit is shown in Figure 3.6. The control circuit was
tested with all possible values of the four, signal inputs (x,y,r,s). The ctl output
waveforms and the timing results are also shown in the simulation. The delay of the
output signal was determined by taking the average time for the output to evaluate all
of the input signals. The average delay time was calculated to be 0.42 nanoseconds.
The entire cell was then simulated and tested for accuracy in evaluating all
combinations of the inputs. The simulation results can be found in Figure 3.7. The
top set of waveforms are the inputs to the cell and
the bottom set of wave forms are
output results of the cell. The delay of the cell outputs were measured to be 2.1
nanoseconds.
27
Figure 3.4 Schematic of the synchronous sorter module
The sorter cell consists of nine inputs and five outputs. The
transistor sizing for each module were computed depending on
the loads each module had to drive. The sizing characteristics



























Figure 3.1 Schenotlc of tha Synchronous sorter nodule
Figure 3.5 Schematic for the control module within the bi-way
ceU
The control gate produces a signal ctl which is used to produce
the r and the s control signals by selecting outputs on the




























D o r\j -O
.^ c
r 0 3
-- > 0 o
c 0 r L
*J a L 0 T3
Fleure 3.5 Scheotic for the
control nodule
Mlthin the bl-woy cell
Figure 3.6a .and Figure 3.6b Simulation and timing results for
the control gate.
Figure 3.6a uses a test pattern of all combinations of r,s,x,y
signal values to produce the ctl signal which can be seen in the
simulation. In Figure 3.6b a close up of the ctl and the elk
signals are shown which were used to determine the delay of



































































I I ^.v X
i i yy
1 i / 1 ^.
! ! / ^




















Figure 3.6b Simulation and timing results
for the control gate
Figure 3.7 Simulation of the synchronous sorter cell
The synchronous sorter cell was tested using all combinations
of input values. The top simulation shows the input waveforms




JD .e uc c EQ e eo e let .D 12E .e
- 1 . Dv
S.BV
h














?:.: ".ce e:.c e: c ic:.: jjc.c
I
. c j 1 i
i
i , r-r-> ^P
j ] !
i '














Figure 3.7 Simulation of the synchronous sorter cell
3.3 Implementation of the Self-timed Sorter Cell
The procedure for transforming a synchronous circuit to a self-timed circuit is
not straightforward There are many different design aspects which must be analyzed
before a specific design methodology can be adopted. After studying different ways to
implement the sorter cell, one method was chosen.
The design criteria which led to the choice of a specific technique were very
important to the design of the sorter cell, which had to have both speed and capability
to be a viable choice for a self-timed logic module. One desired feature was the
implementation of a natural ready signal (refer to chapter 2). The sequential sorter cell
used multiplexors to implement steering logic. This is an appropriate choice for a
globally clocked circuit but is not advantageous for a self-timed circuit. The reason
for this is that the outputs and the complement of the outputs must be derived from
separate circuits. If only one circuit was used and it's output was sent through an
inverter to produce the complement, then the logic tree in the precharge stage would
never be high. Different logic families were considered and one was finally chosen.
The cell was implemented using domino DCVSL logic. The advantages of using
DCVSL were described in chapter 2.
DCVSL logic creates its logic functions using nmos pull down trees. To
derive the equations needed, the Quine-McCluskey minimization method was used to
minimize the logic equations, (see Appendix C). The truth table describing the
operation of the one way cell is shown in Table 3.1.
32
r s X y rp sp u V
0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 1
0 0 1 0 1 0 0 1
0 0 1 1 0 0 1 1
0 1 0 0 0 1 0 0
0 1 0 1 0 1 0 1
0 1 1 0 0 1 1 0
0 1 1 1 0 1 1 1
1 0 0 0 1 0 0 0
1 0 0 1 1 0 1 0
1 0 1 0 1 0 0 1
1 0 1 1 1 0 1 1
Table 3.1. Truth table describing operation of one way cell
Equations for both the results and their complement were needed to implement the
DCVSL logic module. Below are the derived equations.

















3.3.1 The VHDL Model of the Self-timed Sorter Cell
The VHDL model of the self timed Bi-way cell was designed using VHDL
version 8.1 from Mentor Graphics Corporation. In Figure 3.8 a listing for the data
flow model of the sorter cell can be found.
To fully implement dynamic logic using VHDL, some specifics about the
precharge-evaluation method had to be addressed. DCVSL logic uses a
precharge-
evaluation method which more closely resembles an analog behavior rather
than a
digital behavior; each nmos tree evaluates its outputs by sinking the precharge
capacitance voltage to ground.
33
Mitchell Diamond Thesis project 6/8/93
















rp , sp :
bp:











































struct OF self_sorter IS
One_Way_Sorter : PROCESS (Rin)
VARIABLE X,y: QSIM_STATE;



















































used for sim of mux
u output
v output









x : =u ;
y:=z2;
END IF;
rp<=(not s) AND (x) AND (not y) ) OR (r) ;
sp<=((not r) AND (not x) AND (y) ) OR (s);
u:=((x) AND (y)) OR ((s) AND (x) ) OR ( (r)
AND (y) ) ;
v:=((not r)AND(not x) AND (y ) ) OR ( (not s) AND (x) AND (not y ) )




AFTER 1.2 ns ;
Figure 3.8 VHDL Program for the Self-timed Bi-way Sorter cell
34
The precharging mode, results in a low ready signal and a high value on the
output signals. The evaluation mode, allows the cell to be free to evaluate the input
logic signals. To implement this in VHDL, a mechanism had to be derived. By
looking at the operation of the cell one can see that the cell has two distinct modes.
To implement these modes the logic equations were entered into the code
without any delay. The delay was introduced into the ready signal. Since the logic
operator could not start until rin went high and the evaluation wasn't finished until the
ready signal also went high, the delayed ready signal simulates the nmos tree. By
having the ready signal inherit the delay, the logic equations were allowed to evaluate
freely, but the answers should not be used by other cells until after the ready signal
goes high. This is implemented in the code by adding an AFTER statement to the
ready signal assignment statements. The delay time was determined by simulating the
circuit which is described in the next section.
3.3.2 The Transistor Model of the Self-timed Sorter cell
The transistor model of the Self-timed sorter cell was simulated using Design
Architect version 8.1 from Mentor Graphics Corporation. A schematic can be found
in Figure 3.9 of the bi-way sorter cell. The circuit consists of a 4-2 multiplexor which
selects which inputs arrive to the cell, two nmos trees which derive the outputs for the
cell as well as their complements, and a natural ready signal generator module.
The nmos trees are shown in Figures 3.10 and 3.11. Each tree has eight
inputs, r,rp,s,sp,x,xp,y,yp. Each branch of the tree signifies a prime implicant of the
logic equations. The output signals are taken from the connection between the pull-up
35
Figure 3.9 Schematic for the Self-timed sorter cell
The sorter cell is made up of two nmos trees which produce the
outputs of the cell and their complements. There is a 4-2
multiplexor which selects either inputs from the bottom cell or
the top cell depending on the b input.
36
'2 .2 l ;l bt b
r rp s sp
7^)




























Sorter Ce I I
Figure 3.9 Schematic for the Self -timed sorter eel
Figure 3. 10 Schematic of the regular nmos tree
The nmos tree take four inputs r,s,x,y and their complements
rp,sp,xp,yp and has outputs rout,sout,xout,yout.
37
rinO"
























Nmos tree for correct logic
Fioure 3. 10 Schematic of the reaulcr
nmos tree
Figure 3.11 Schematic of the inverse nmos tree
The inverse tree takes the same inputs as the regular tree but
produces the complementary outputs of the regular tree.
38
rinO-

































Nmos tree for inverse
N-trins I =2um =4um
P-trins l=2iim =8um
logii
Figure 3.11 Schematic of the Inverse nmos tree
transistor and the branches below it. The pull-up transistor is a pmos transistor used
to precharge the output to high state. When the rin signal is in a low state the
precharge transistor is on and the nmos transistor at the bottom of the tree is off, thus
preventing the charge at the top of the tree from discharging to ground. When rin
goes high the pmos transistor is turned off and the nmos transistor is turned on,
allowing the branch to sink the charge to ground if the corresponding inputs allow.
Transistor sizes were determined based on layout sizes and the load capacitances and
are shown on the schematics.
The readyn signal takes the outputs of the nmos trees and determines whether
the evaluation has finished. Determining the integrity of the output from the nmos
trees is simply a matter of checking to see if each pair of outputs (ex. w and w') are
complements of each other. The circuit for implementing the natural ready signal is
shown in Figure 3.12. The circuit was derived from the following concept.
The ready signal will go high if all output pairs are
complements of each other. In all other cases the ready
signal will remain low.
By using this concept Eq. 1 was formed. The simulation for the ready signal generator
can be seen in Figure 3.13. The simulation shows inputs changing at different times
until the conditions enabling the ready signal go high are met
R = (rout NAND routp)
S = (sout NAND soutp)
X = (xout NAND xoutp)
Y = (youtp NAND youtp)
readyn = (R NAND S) NOR (X NAND Y) ( 1 )
All the modules where then assembled together to create a one-way sorter
cell. Once assembled, the circuit was then simulated and tested using the same test
39
Figure 3. 12 Schematic of the ready signal generator circuit
The ready signal schematic take the outputs from each nmos





























(_ (_ 6 A
cn o
Figure 3.12 Schematic of the ready signal generator circuit
Figure 3. 13 Simulation of the ready signal circuit
In this simulation only the regular inputs are shown for clarity.
The complement inputs all started high and each regular input















































Figure 3.13 Simulation of the ready signal circuit
Figure 3.14 Simulation of the one way cell
The simulation shows the output results of the cells for all
combinations of input signals (not shown for clarity). The












































-1 1 1 1 1 1 1 1 1 1 1 1 1 r
0.000000100 0.000000150 0.000000200
Figure 3. 14 Simulation of the one way cell
pattern was used by the VHDL model. Components not mentioned in this section are
found in Appendix E. A simulation result can be found in Figure 3.14. Output
complements are omitted for clarity. One can see that when the rin signal goes high
the outputs are evaluated; after a delay the readyn signal goes high. When the
outputs evaluated to a high value the voltage would increase by 0.5V. The reason for
this is that during evaluation the pull-up transistor is off, and the charge collected on
the parasitic capacitances of the transistors in the pull down trees is added to the
output load capcitance charge. This phenomenon will not affect the result because
the output remains above the 5 volt threshold. The delay from the rise of the rin
signal to the rise of the readyn signal is found to be 1.2 nanoseconds.
3.3.3. The Layout of the Self-timed Sorter cell
The layout of the Self-timed sorter cell was done using IC Station version 8.1
from Mentor Graphics Corporation. The layout was done in 2 micron technology and
used the MOSIS rules to check the soundness of the circuit. The layout is shown in
Figure 3.15.
A design decision had to made as to how to structure the nmos trees. The
trees are the largest part of the cell which made the module a limiting factor in
minimizing area. The structure used to implement the nmos trees was a NAND-NOR
network, which produces all the AND terms in the equations then OR's the terms
needed for each output. This structure was used for the following reasons. First,
eight inputs needed to be dispersed throughout the tree structure. The NAND-NOR
network organizes the distribution by pulling the required inputs off of busses which
run through the tree. Another reason for choosing this structure is the size of the cell.
The NAND-NOR network is symmetrical, making the task of minimizing the empty
43
space between modules simpler. See Appendix E for the complete layout of the bi-
way module.
44
Figure 3.15 Layout of the bi-way sorter cell
The Figure shows the layout of the bi-way cell in two-micron




vdd_I1 NMOS TREE j
MgMP^WMMlg!ji MW^
Figure 3.15 Layout of the Nmos logic trees
4.0 Implementation of the Handshake Controller Circuit (HCC)
4.1. Implementation
Self-timed circuits have the unique property of not using a global clock to
control the operation of the system. Instead the self-timed system uses a module
called the Handshake Control Circuit which controls the operation of the array by
organizing the data flow between modules. The Handshake Controller circuit which
was chosen was the HCC_B which was taken from the paper [Sim, Gunawan, 1992].
By looking at the data dependencies between modules which are formed in the
synchronous array, a modified handshake controller was designed before deciding on
the HCC_B controller. By following the flow of data through the cell, the control
signals enter the cell from the left side and data enters from either the top or the
bottom depending on the b control line. Since data can appear at the cell from two
directions the handshake controller has to decide on which module to send its request
and acknowledge signals. The modification which was chosen was to add circuitry to
the HCC to support this added function.
A diagram of the modified controller can be found in Figure 4.1. Added
circuitry was used to modify the Ain(n) and Ok(n) signals. The added multiplexors
select which modules the request or the acknowledge signals are accepted from. In
order for a module to accomplish its task, valid control signals must be present from
the cell directly to the left and valid data from above or below the present cell must be
available. When the controller receives a request or an acknowledge signal its b input
will direct the appropriate signals to the cell. The cell was designed and simulated and
was shown to function properly.
The reason this cell was not chosen for the final system was that it contained
too many input and output lines to be considered a
viable solution. The Handshake
46
controller should not impose on the overall system by either slowing the system down
or by drastically increasing the number of pins on the chip. If this design was used the
increase in the number of pins would be four for each module. .Anotherpossible thesis
project would be to design a way to keep this scheme while decreasing the overhead











Figure 4.1 modified handshake controller circuit
47
4.1.1. The VHDL model of the Handshake Controller Circuit
The VHDL Code for the unmodified Handshake Controller Circuit was written
and compiled using VHDL version 8.1 from Mentor Graphics Corporation. The
VHDL code can be found in Figure 4.2. The HCC code was written using a structural
style. By using the structural style, the circuits operation can be precisely simulated
and the circuit timing directly incorporated into the code.
At the beginning of the code, required libraries are loaded to facilitate the use
of the special variable and signal types. The next section of code is the entity
declaration. The entity declaration states that the name of this component is
struct_hcc and that it contain the port declaration. The circuit has four input ports:
init, okn_l, ainn, readyn. These signals are declared as QSIM_STATE and
initialized to appropriate values. This avoids the problem with the simulator not
working with uninitialized signals. The circuit also has three output ports: aoutn,
routn, okn. These ports are declared as type INOUT. This type declares the port as
bi-directional. These signals were declared as INOUT to incorporate feedback into
the VHDL code.
The architecture body follows the entity declaration. In the architecture body,
two internal signals are declared: atmp and btmp. These signals are used for internal
computation in the controller circuit. The architecture body contain one process
which simulates the operation of the controller circuit. The equations which describe
the operation of the circuit are shown in the process body. A logic diagram cm be
found in Figure 4.3.
Timing was introduced into the code by adding
AFTER clauses foUowing each
signal assignment line. The circuit was first simulated before the timing was
introduced and resulted in the unreadable signal waveforms. After close study of
48
-- Handshake controller circuit (HCC)
-- Mitchell Diamond 130-70-8065






PORT, init: IN QSIM_STATE =
'1'
;
aoutn INOUT QSIM_STATE =
' 0 ' ;
okn_l IN QSIM_STATE =
'1'
;




ainn IN QSIM_STATE =
'0'
:












-- incoming ready signal
END struct_hcc;





hcc: PROCESS) init, okn_l, ainn, readyn, atmp, btmp, aoutn, routn,
okn)
BEGIN
atmp<=( routn NOR aoutn) AFTER 1.2ns;
btmp<= ( readyn OR routn) AFTER 0,7ns;
aoutn<=(okn_l NOR btmp) AFTER 0.8ns;
routn<=NOT(init OR ainn OR atmp) AFTER 1.6ns;






_- logic equation for atmp
-- logic equation for btmp
-- logic equation for aout
-- logic equation for rout
logic equation for okn




Figure 4.3 Logic Diagram of the HCC
The diagram shows a logic schematic of the Handshake
Controller Circuit. The circuit operates as a state machine by





Fiuure 4.3 Louie Oioorom of the HCC
a solution was found. Since the circuit is self-timed and there is no clock to
synchronize the signal changes, the outputs of each process statement occured after a
delta delay from an input event. Delta delays do not effect simulator time therfore all
outputs appeared to change instantaneously. These waveforms appeared close
together and gave the illusion of the signals not changing state. After adding timing
declarations to each signal assignment the waveforms appeared correctly during
simulation. This problem demonstrated that when simulating a self-timed circuit using
VHDL, timing must be incorporated into the code in order to verify correct simulation
of the self-timed system. A simulation of the Handshake controller circuit is shown in
Figure 4.4, and shows the time needed to complete a cycle of the handshake
controller. The simulated time was 6.656 nano seconds for one cycle of the HCC's
task.
4.1.2. The Transistor model of the Handshake Controller Circuit
The transistor model was designed using Design Architect version 8.1 from
Mentor Graphics Corporation. The design was simulated using CMOS technology
and gate sizes were rninimum except for certain driving outputs. The schematic of the
handshake controller circuit is found in Figure 4.5.
The circuit is composed of two nor gates connected in a feedback
configuration to form a flip-flop which controls the changes between states. There is
also a two input nand gate which takes the routn and readyn signals as inputs and
produces the okn signal when the logic operator has completed its task. The circuit
also uses two transistors (pmos and nmos) arranged in an inverter fashion but with one
difference (See Figure 4.3). The drain of the nMOS transistor, instead of being
connected to ground, is connected to the readyn input. The output of this sub-circuit
51
Figure 4.4 Simulation of the Handshake Controller Circut
The simulation shows the waveforms for the HCC which
generated by sending a request to the HCC. The simulation


















0.0 6.4 12.8 19.2 25.6
Tim(n_)
Figure 4.4 Simulation of the Handshake Control Circuit
Figure 4.5 Schematic of the Handshake Controller Circuit
The HCC was designed using mmumim sized gates except for
the gates which produce the routn signal because of the driving



















Figure 4.5 Schematic of the Handshake Controller Circuit
is fed back into the system to produce the changing states. When readyn is low then
the sub-circuit acts as an inverter and sends the correct value of the routn signal back
to the system. If readyn is high the circuit will produce a high value back into the
system if the routn signal is high. If the routn signal goes low then the output of the
sub-circuit will be floating. In the operation of the circuit the routn signal and the
readyn signal will both be high until the logic ciruit is again precharged which causes
the readyn signal to go low again. The small amount time in which the output of the
sub-circuit is floating will have no effect on the overall operation of the circuit. This
sub-circuit was introduced to limit the number of gates needed to produce the correct
operation.
To simulate and find timing characteristics for the circuit, a test system was
developed to obtain these results. In Figure 4.6 the test system is found. The test
circuit is set up in such a way as to let the HCC produce a full cycle without any
outside interference. Three HCC's were connected together with the last HCC's fed
back into itself in order to have it complete its cycle. Timing was derived from the
center HCC. The simulation results from the Accusim simulator from Mentor
Graphics Corporation are found in Figure 4.7. Each HCC had a dummy logic module
which produced a zero delay effect. This was accomplished by using the routn signals
of the HCC to feedback onto itself, by connecting it to the readyn signal. The system
was first initialized and then started by sending a request signal (OKN_l) to the left
most HCC. Table 4.1 shows the timing characteristics for each signal. The timing
results were then incorporated into the VHDL code which was described in the
previous section. The timing was determined by subtracting the time at the half-way













time (ns) 1.2 ns .7 ns .8 ns 1.6 ns .8 ns
Table 4.1 Timing determined from The HCC simulation
55
Figure 4.6 Test circuit used to determine timing characteristics
of the HCC.
The test circuit consists of three HCC modules. The timing
characteristics were derived from the center HCC. The cycle
was started from the left most controller and the circuit was


























Figure 4.6 Test circuit used to
determine timing
character ist ics of the HCC
Figure 4.7 Simulation of the test circuit for the Handshake
Control Circuit
The test circuit was simulated to verify accuracy and timing
characteristics. The waveforms shown are internal and external






















| | | | I I I I I I I M I I I I M I I I I I |











Figure 4.7 Simulation of the test circuit for the Handshake Control Circuit
4.1.3 The Layout of the Handshake Controller Circuit
The layout of the Handshake Controller Circuit was done using IC Station
version 8.1 from Mentor Graphics Corporation. The layout was done using
two-
micron technology. The layout of the control circuit is found in Figure 4.8.
In order for an effective self-timed circuit the HCC should be as small as
possible. By letting the HCC be as small as possible the overall area of the circuit
would not increase drastically compared to the sequential circuit. The layout of the
HCC was carefully planned to minimize its array.
58
Figure 4.8 Layout of the Handshake Controller Circuit
The HCC was layout in two-micron technology. Keeping the






Figure 4.8 Layout of the Handshake Control Circuit
5.0 A Self-timed Four 2-bit Number Sorter Array
5.1 Overview of a SystolicArray Processor
The bi-way sorter processor is structured in the form of a systolic array. A
systolic array is a type of processor in which modules are interconnected in a grid
formation. The data usually enters on one or more sides of the grid and leaves
through the same or opposite side. This data enters each module where a computation
is performed. When the computation is done, the result is transmitted to the next
connected module and the process repeats. This type of processor is described as
"systolic"
because the data flows through the array in a cellular fashion, reminiscent of
red blood cells flowing through the human body.
Systolic arrays have two distinct design methodologies: synchronous and self-
timed. The synchronous systolic array responds to a global clock. Each module
performs its task and sends the result on to the next module during each clock cycle.
This type of design is advantageous for smaller arrays (i.e. number of cells) because
the layout area is small enough to allow a global clock to propagate through the array
with rninimal delay. Once the area of the circuit becomes too large for a global clock
to propagate through the array in an acceptable delay time, then the self-timed
methodology would be more suitable.
A systolic array which uses the self-timed methodology is known as a wave
front array processor. The difference between the
wave front processor and the
synchronous processor is that the modules are not synchronized with each other. In
the wave front processor, when data is valid at a module the module will compute its
result and then send it on to the next module. When the module is done it continues
with the next task without waiting for the entire array to
be finished with its cycle.
The flow of the data through the array acts like wave fronts propagating
60
across the ocean. This method can be used with larger arrays because it makes the
need for propagating a global clock through the circuit unnecessary.
5.2 Description of the Self-timed SorterArray Processor
5.2.1 Data Dependencies Between Cells in the Array
To design the self-timed array processor the data dependencies of the
sequential sorter array were examined. In order for a cell to perform its task, valid
control signals must be present from the cell directly to the left of the cell and valid
data must be present either from the cell above or below the current cell. The results
of each cell are then transferred into latches waiting for the next cell in the sequence to
take the results. In the synchronous array all modules act on their specific input data
at the same time. This means that the latches are holding data which is one clock
cycle old. In the self-timed processor each module acts on its data at different times.
A problem arises when a module sends new data into its latch before the next module
has taken the old data from that latch.
To demonstrate this point let's consider the array in Figure 5.1. This array is
structured in the same way as the self-timed sorter array processor designed in this
paper. The handshake controllers have been removed for clarity. The arrows around
the outside of the array show the order in which the cells accomplish their tasks. The
boxes with the letter
"L"
are single stage latches. The array's initial conditions before
starting the processor are stored in the latches.
For this example, the initial condition
for the latches are set to a high state. When the processor is started module A
computes its inputs based on the values stored in the latches connected to its inputs.
For this example input zl on module A is high. After module A is finished computing,
it stores its results in the latches connected to its outputs. For this example module A
produces a low state in the w2 output and stores it the connected latch. Now when




















Figure 5. 1 Example of the self-timed array processor.
62
module B, its z2 inputs have a low value instead of the original high value which
should be used for the computation. To fix the problem a two stage latch must be
connected to the w2 output pin in order to delay the new result one cycle. By delaying
the new result module B can now compute its result on the correct data.
5.2.2 Operation of the Self-timed Bi-way Sorter Array
Figure 5.2 shows a schematic of the self-timed bi-way sorter which was
simulated in this paper. The sorter is a three by two array which sorts four 2-bit
numbers. Each cell of the array is made up of an nmos logic tree, an HCC module and
three latches. Each cell is started by a request entering on the okn_l input signal of
the HCC. Once received, the HCC starts the logic module and awaits the readyn
signal from the logic module which signifies that the logic module has completed the
computation and the results are now valid. The HCC will then store the results in the
connected latches and send a request signal to the next module.
The order in which each module executes its task is shown in Figure 5.3. The
sorter starts at the top left cell of the array. At the start of the sorting algorithm this
cell has all its data to begin computing. Once this cell is finished the order follows to
the next cell in the column. Once the first column is finished the next column does its
computations starting with the bottom cell in that column. Once the first two columns
are finished, the input data will now be valid for the first column to compute the next
cycle. After the second column is finished, the third columns is also started. The
wave will propagates through columns 4,5,6...n while the next wave is being generated
by the first column. This method of traversing the array was chosen because this
configuration resulted in the minimum number of input and output signals and each
HCC only has to communicate with one
cell rather than more than one cell. If the
modified HCC (see chapter 4) was used instead of the HCC which was used in this
63
Figure 5.2 Schematic of the self-timed bi-way sorter processor
This Figure is a schematic of a four 2-bit number sorter
processor. The array is made up of six cells and a b control
generator module and a number generator module. The inputs
and outputs are derived from the modules in the first and
























I ! i *
___
L-Hwiup ___!__











Figure 5.2 Schematic of the self-timed bi-way sorter processor
paper, the number of pins and signal crossings in the layout would increa.se the
difficulty of the layout. This method of traversing the array is not the only method
which can be used and other ways are left up to other thesis projects to research.
A number generator module was used to hold the unsorted numbers before
they enter the sorter processor. The number generator module, labeled gen in Figure
5.2, is a two-dimensional array of shift registers structured to produce a skewed-bit-
parallel data format. The number of shift registers in each column (k) is equal to the
column number starting with the MSB as number 1. On the generator module there
are separate signal lines which are used to select the shift register column to produce
the next bit. The select signals come off of the okn_l signal on the HCC module. The
number generator module was simulated in a behavior format using the VHDL
programming language as shown in Appendix D. The module, if implemented in
hardware, would be a separate chip connected to the main sorter processor.
Another off chip controller is the B signal generator module. This module is labeled
b_gen and is shown in Figure 5.2. The B generator module is used to produce the b
inputs for the control column of cells. The b input starts out with a high value which
causes the data in the array to flow downward. Once the last MSB has entered the
array the b input toggles to a low state which changes the data flow upward. By
toggling the b input, the data flows into the array and then back out allowing each bit
to be compared with all the other bits in the column. This is shown in the example in
Appendix A. The B generator module is modeled as a counter in VHDL using the
behavioral method. The VHDL code for this module is shown in Appendix D.
To start the system an AND gate is connected to the okn_l input of the first
control module's HCC. One input to the AND gate comes from an off chip start signal
















k- _ k_ .. k- ^
/>
Figure 5.3 Operational flow of the Self-timed Bi-way Sorter
66
Figure 5.3). Once the data is loaded into the number generator module a start signal
goes low which will start the computation of the first control module. The start signal
is a pulse lasting aminimum of two nanoseconds. Once the second column is finished,
the last HCC then pulses the AND gate's input starting the cycle over again. The last
HCC in the chain has an inverter redirecting its request signals back into its
acknowledge input signal. To stop the array the init line which is part of the HCC, is
pulsed. This will initialize the array and allow another sort process to occur.
5.3 Simulation results of the self-timed bi-way sorter
A four, 2-bit number sorter processor was designed by connecting up all the
VHDL modules for the individual components. Timing was determined by designing
each component at a transistor level and simulating it with the Accusim Simulator.
The times derived were then incorporated into the VHDL code for each module.
The sorter was then simulated using the QuickSim II logic Simulator from
Mentor Graphics Corporation. The numbers that were sorted were 00, 10, 01, 11
respectively. These numbers were hard coded into the number generator module
before the start of the simulation.
The simulation results are shown in Figure 5.4. Only the most important
signals were shown to make the results more explicit. Before starting the system the
init signal goes high which initializes the array. The start signal was then pulsed
which initiates the sorting process. Once the start signal goes low, the okn_l signal
of the first HCC also dropped low. The okn_l signal is shown in the simulation
because it indicates when one wave has been sent and another is being created. The
bout signal comes from the b signal generator and is used to manipulate the control
modules. The bout signal starts high and is toggled low to reverse the flow of data in
67
the array. The next group of signals: rin, z2, and wl are inputs and outputs which are
part of two modules in the first row, columns two and three (see Figure 5.3). The
I$24 signals represent the MSB and are part of the module in column two and the
I$10 signals represent the LSB and are part of the module in column three. The rin
signals are shown because they show when the logic module begins and finishes its
computations. The output data is valid after the rin signal goes low. The z2 signals
originate from the number generator module and are the data bits which enter the
array. The wl signals are the output of the array and contain the sorted numbers.
The results of the sorter are shown in Table 5.1. The data outputs are valid
after each rin signal. At time 0 the sorter is initialized to the values shown in the
diagram. At time 56 there is now control signals entering the sorter in order to let the
control modules produce their first control outputs. After that, data is sent out of the
generator and into the z2 inputs in a skewed-bit-parallel format. The resulting inputs
and output values can be followed from the example in Appendix A.
The time needed to sort the four numbers was determined to be 220
nanoseconds. The time needed was found by subtracting the time the final answer was
produced (time when the last LSB was produced) from the time the start signal was
produced.
MSB LSB
tirnens z2 wl z2 wl
0 0 1 0 1
56 0 1 0 1
84 0 0 0 1
112 1 0 0 0
120 0 0 0 0
150 1 0 0
175 0 0 0
200 0 1 1
224 0 1 0
250 0 X 1
Table 5.1 Simulation results for the self-timed bi-way sorter
68
Figure 5.4 Simulation wave forms of the self timed sorter
processor
The Figure shows the simulation results of sorting 4 numbers
using the self-timed bi-way sorter. The numbers enter the z2
inputs and the sorted numbers leave the w2 inputs. The I$24




* 4 ? 4 4 4 4 4 4 . 4 4 4 4 4 .
4
?u










































































4 4 4 4 * 4 4
112.0
Tim* (n*>
Figure 5.4 Simulation waveforms of
the self-timed sorter processor
6.0 Conclusion
A modification of the sequential bi-way sorter processor was achieved which
enabled the new sorter to operate using a self-time methodology rather than a
sequential clocking scheme. The self-timed implementation introduces into the system
a controller module called the HCC. It controls the operation of the system rather than
a global clock which was used in the original sorter processor. Each sorter
methodology was designed and simulated at the VHDL level and also at the transistor
level.
The transistor level of the sequential bi-way sorter was simulated using the
schematic found in the paper "Bi-way sorter: a two-dimensional systolic
array"
[1].
The circuit was designed using multiplexors and control logic to control the flow the
data through the cell. The number of transistors used in the sequential implementation
was a total of 38 nmos and pmos enhancement transistors. In the self-timed
implementation 106 nmos and pmos enhancement transistors were used. The sums
stated were determined by counting only the transistors used in the cell itself and not
the latches which were used for data storage. The reason that the latches were not
included in the count was that the latches are the same for both methodologies and
only the differences between the two circuits are going to be looked at. The
number
of transistors for the self-timed implementation was derived from the two nmos trees,
the natural ready signal circuitry, and the HCC. The
dimensions for the self-timed
sorter cell was 554x424 (im and the HCC's dimensions were 112x86 pra. By
comparing the sizes of the HCC to the
entire sorter cell, the HCC takes up a very
small portion of the overall area and will not place a major factor in larger module
designs.
The difference in the number of transistors for each cell is alarming at first but
closer observation of how this effects the overall system should be looked at. In the
70
sequential bi-way sorter circuit each cell takes up less area than its self-timed
counterpart. This attribute is more beneficial if the sorter array was rather small. As
the sorter array grows larger the problems associated with clock distribution such as
shrinking IC feature sizes and clock skew will become more evident. With the
technology heading towards smaller feature sizes the clock driver needed to drive the
load of the system will have to increase in size thus taking away large sections of
real-
estate on the IC. By using the self-timed methodology the shrinking IC technology
will not play as big a part in the overall design as in sequential designs. By using a
self-timed system the clock driver which was introduced into the sequential design is
no longer needed, thus freeing up space on the IC. By comparing the total area a
clock driver will use on an IC with the area being used by the extra number of
transistors within the self-timed cell, it can be shown that the over all area of the sorter
with these extra transistors will be smaller than the sequential design. In the case of
the bi-way sorter cell, the transistors used in the self-timed cell are mostly minimum
feature size and will not take up much space, but the clock driver's size increases
exponentially as the number of stages needed increases.
If the clock driver needs to
increase in size to be able to drive the increasing capacitance of the system, then these
driving transistors can grow larger reaching widths greater than 1000 microns. This
driver area adds a large area to be added to the actual sorter cell area in the total
design.
The self-timed cell took 7.8 nanoseconds to run through one complete cycle.
Most of the delay was due to the overhead of the HCC
controller. What was learned
from including an HCC into this type of system is
that the overall affect of the HCC on
the system will decrease as the delay of each logic module increases.
For the self-
timed circuit the logic modules evaluated in 1.2 nanoseconds
and the HCC took 6.658
nanoseconds to complete a cycle. Also a four number sorter was simulated and the
timing was determined to be 200
nanoseconds. If the logic module is significantly
71
slower than the speed of the HCC than the HCC's affect will not add much to the
timing of the system. To verify this conjecture a system should be tested where the
logic module is slower than the actual HCC. This is left up to further research by
other graduate students.
What was shown in this paper is that a synchronous system can be converted
to a self-timed methodology while still staying within the operation of the sequential
circuit. Also what was shown is that as the technology advances the self-timed
concept should start to grow in popularity and eventually become a necessity to
overcome the problems with synchronous systems.
72
Appendix A
Example of Sorting four 2-bit numbers using the Synchronous Bi-way
sorter array processor.
In order to sort four 2-bit numbers an array of six cells are required. The left
most column is known as the control cells. The control cells do not take part in the
sorting process, rather they send out control signals which control the operation of the
array. The next two columns are used to sort the numbers. The MSB enters the
middle column and the LSB enters the right most column. The example shown on the
next page shows the state of the array right before each clock pulse. The numbers
which are going to be sorted are ordered in skewed-bit-parallel format are as follows,
1X,00,11,01,xO respectively. The key shows which value each number represents in
the cell. The b inputs decides which inputs and outputs to use for incoming data based
on table 2.1 and figure 2.1.
Before the array begins sorting it contains invalid data which is shown in the
first panel of the example. In the second panel, after the first clock pulse, the first
input bit reaches the array. The MSB of the first number is a 1 which enters the
second column's top cell. Since the r and s inputs to that cell are a 1 and a 0




variables are set to 1 and 0 respectively. The new values can be seen in the third
panel. The results signify that a new bit entered the array and since there wasn't a bit
already stored in the array the cell was told to store the incoming MSB and notify the
LSB's cell to store its incoming bit when it enters the cell. The next set of clock pulses
occur which follow the same set of calculations for each cell in the array based on the
cells input signals. The output results after sorting are as follows 0x,00,ll,10,xl
respectively. The output of the sorter is in skewed-bit-parallel format. To convert
back to a regular format read the numbers starting with the first MSB and the next bit
A-l
place will be one place down. For example to convert the output results from this
example start with the MSB (0) and the next one down of the MSB (0) so the first

















































































































































r=0 li 0 fx x fx 1













11 y xx x xx x





















Figure Al Example of sorting four




Circuits and simulations for the latch and the multiplexor used in the
synchronous bi-way sorter
The synchronous bi-way sorter is made up of three types of components:
multiplexors, latches, and a control module. The control module is described in
chapter 3. This appendix describes the circuit and simulations of the multiplexor and
the latch used.
Figure BI shows a schematic a static shift register which is used for the latch
that stores the output results. A static shift register was chosen because of its ability
to act as a sense amplifier which will rectify the raw output signal from the cell. The
latch was then simulated using Accusim from Mentor Graphics Corporation. Figure
B2 shows the simulation of a single latch which was loaded with the measured
capacitance from the layout of the cell. The simulation shows a low and a high value
being shifted through the latch. The timing as determined to be .69 nanoseconds. The
timing value found for the latch was incorporated into the VHDL code used for the
simulation of the sequential bi-way sorter.
Figure B3 shows a schematic for a 4-2 multiplexor. The multiplexor was
designed using transmission gates. A select signal is used to select which transmission
gate is activated to allow the incoming signal to pass to the output. The multiplexors
are used to control the flow of the data through the sequential bi-way cell. The
multiplexor was simulated with appropriate loading capacitance and is shown in Figure
B4. The simulation shows the select signal toggling between one input and the other.
The timing was determined from the simulation was used


























































































































Figure B3 Schematic for a 4-2 multiplexor
n5


























Figure B4 Simulation of the 4-2 multiplexor
Appendix C
Quine-McCluskey method for sorter logic implementation
The Quine-McCluskey method was used to reduce the truth table in table 3.1
which describes the functionality of the one way sorter cell. What is shown in this
appendix is the steps involved in reducing the truth tables. Each table shows one
iteration of the method. After each iteration the terms are reduced until only the prime
implicants are left. The final results for the complement of the output signals are
found after the fourth table. The following equations were rninimized simultaneously
using this procedure.









INDEX MINTERM TERM rp sp u V
one 1 * 0001 0 10 1
2 * 0010 10 0 1
4 * 0100 0 10 0
8 * 1000 10 0 0
two 3 * 0011 0 0 1 1
5 * 0101 0 10 1
6 * 0110 Oil 0
g * 1001 10 1 0
10 * 1010 10 0 1
12 * 1100 111 1
three 7 * 0111 Oil 1
11 * 1011 10 1 1
13 * 1101 111 1
14 * 1110 111 1
four 15 * 1111 111 1
Table 2
INDEX MINTERM TERM rp sp u V
one 1,3 # 00-1 0 0 0 1
1,5 # 0-01 0 10 1
2,3 * 001- 0 0 0 1
2,10 # -010 10 0 1
4,5 * 010- 0 10 0
4,12 * -100 0 10 0
8,9 * 100- 10 0 0
8,10 * 10-0 10 0 0
8,12 * 1-00 10 0 0
two 3,7 * 0-11 0 0 1 1
3,11 * -011 0 0 1 1
5,13 * -101 0 10 1
6,7 * 011- 0 11 0
6,14 * -110 0 11 0
9,11 * 10-1 10 1 0
9,13 * 1-01 10 1 0
10,11 * 101- 10 0 1
10,14 * 1-10 10 0 1
12,13 * 110- 111 1
12,14 * 11-0 111 1
three 7,15 # -111 0 11 1
11,15 # 1-11 10 1 1
13,15 * 11-1 111 1
14,15 * 111- 111 1
C-2
Table 3
INDEX MINTERM TERM rp sp u V
one 1,5,3,7 # 0--1 0 0 0 1
2,3,10,11 # -01- 0 0 0 1
4,5,6,7 * 01-- 0 10 0
4,5,12,13 * -10- 0 10 0
4,12,6,14 * -1-0 0 10 0
8,9,10,11 * 10-- 10 0 0
8,9,12,13 * 1-0- 10 0 0
8,10,12,14 * 1--0 10 0 0
two 3,7,11,15 # --11 0 0 1 1
5,13,7,15 # -1-1 0 10 1
6,7,14,15 # -11- Oil 0
9,11,13,15 # 1--1 10 1 0
10,11,13,15 # 1-1- 10 0 1
12,13,14,15 # 11-- 111 1
Table 4




_]___ 0 10 0
8,9,10,11,12,13,
14,15




X X X X
O
H X X X
t~- X X X X
10 X X X
ro X X X X
CN X X
- X X X
- X X X
On X
r- X X X
VO *
rn %
r^ X X X X
VO X X
>o X X X
* *
- =B-




























































>. >. + >>
X Xl >. X


































Ch Oi II II
c_ cn D >
u
Karnough Maps used to derive the equations for the positive nmos tree
equations
The Karnough map method of reducing truth tables was used for the non inverted














00 01 11 10
.0 :o x; 1










:o o; X 1
1 1 X 1








00 01 11 10
o.: 1 :x : o: .
1 1 :x o;












01 1 1 x. p;
11 1 1 X 1







Figure Cl Karnough maps used to reduce the logic equations
for the nmos trees.
C-5
Appendix D
VHDL code for the number generator module and the b control
signal module
This appendix shows the VHDL code for the two off chip modules used to
operate the self-timed bi-way sorter. The number generator is implemented as a two
dimensional array of shift registers. The shift registers are organized in such a way as
to incorporate a skewed-bit-parallel format into the numbers which are being
generated. This module would be a chip separate from the main sorter processor. The
reason that it is off chip is to allow the maximum number of cells to be put of the chip.
The number generator module was modeled in VHDL and is shown in Figure Dl. To
simulate the array of shift registers a two dimensional array was declared. The array is
loaded with the unsorted numbers when the init signal goes high. Once the numbers
are stored in the array the module waits for either the sell or the seI2 signal to go
high. When one of these signals go high the proper index into the array is incremented
and the selected bit is put on the output signal.
The b control signal generator is another off chip module which provides the b
control signal to the column of control cells. The VHDL code is shown in Figure D2
This module is an off chip module as is modeled in VHDL as a counter connected to a
toggle flip-flop. The generator produces initially a high value on the bout signal and
toggles after four pulses have entered the input port. By toggling the bout signal the
flow of data is reversed in the array allowing all the numbers to be compared against
each other.
D-l
-- Mitchell Diamond Number generator module




























select for MSB bit




ARCHITECTURE behave OF num_gen IS
TYPE num_type IS ARRAY (INTEGER RANGE 1
OF QSIM_STATE;
TO 2, INTEGER RANGE 0 TO 5)
BEGIN



































num2<=numbers (2 , count2 ) ;






















if a select for
MSB enters




Figure Dl Generator module used to generate the
unsorted numbers
D-2











ieee . std_logic_1164 . ALL;
ENTITY b_input_module IS
P0RT( input: IN QSIM_STATE:=' l1
bout : INOUT QSIM_STATE : = ' 1
'
END b_input_module;










count : =count + l
IF count=4 THEN








-- the b out signal
counter variable
--we have 4 numbers
--





Figure D2 Generator for the b control signal used to control the control modules
D-3
Appendix E
Schematics and simulations of components used in the self-timed
sorter module
The following components were used in the design of the self-timed sorter
module: a 4-2 multiplexor, a self-timed latch, two nmos trees, and a HCC. This
appendix shows the schematics and simulations of the 4-2 multiplexor and the self-
timed latch as well as the complete layout of the self-timed sorter module.
A 4-2 multiplexor was designed using transmission gate logic. Figure El
shows the schematic of the multiplexor. The multiplexor has a four inputs and two
outputs which are selected by the sel signal. The two outputs are connected to
inverters which also provides the complement of the result. The multiplexor is used to
direct either the top or the bottom inputs into the sorter cell. The inputs are selected
by the cell's b input. The inverse outputs are also supplied because the nmos trees
need the results and their complements to produce its answers. The multiplexor was
simulated and the results are shown in Figure E2 and E3. Figure E2 shows the
multiplexor being simulated with constant inputs and the sel input being pulsed.
Figure E3 shows a close look at the signal changes which was used to acquire the
timing for the multiplexor. From the simulation the timing was determined to be 0.2
nanoseconds.
A schematic of the self-timed latch is shown in figure E4. the latch uses two
back-to-back inveiters situated in a feedback orientation to store the incoming bit.
Two nmos transistors are used to select and pass the input signal or hold the feedback
signal and hence store or latch the input signal. The reason the latch is known as a
self-timed latch is the fact that the data can be stored almost indefinitely because of the
feedback mechanism. In a self-timed circuit the time delay between a cell completing
its task and the time the data is taken from the latch is indeterminate. For this reason
E-l
the latch might need to hold the data for a long time. By using the feedback the stored
bit is regenerated and kept at a valid state for some time. The circuit was simulated
and is shown in Figure E5. The init signal goes low which causes the latch to be
initialized to a high state. The simulation shows a low state being passed from the
input to the output and latched when the sel signal goes high.
Finally, all the components were integrated to form a self-timed sorter cell.
The layout of the self-timed sorter cell is shown in figure E6. The layout was done in
2 micron technology. The input and output for the cell were strategically placed to
facilitate the interconnecting of the cells. The cell uses an 8-bit driver circuit to drive
the input values of the nmos tree because the capacitive load of the poly lines. By
looking at the layout one can see that the HCC takes up a small section of the cell. By
taking up a small section of the cell the overall area of the circuit will not be increased
























I I al*u !=_ ZI-_2-R2l-30?=.J ** !-!
I
-o



























I I I I I I I
I I
0.0000000200

































































































































































































.O / \ ^ / \ / \ / \ ii
33 \ ^ f f ) __ 1 ) - 1 1 2
cn 1 l 3=1 1 Q- 1 1 - 1 1 E
CM _, 1 -CJ T
C3J











































n i i i i i i i i i i i i i i i i i r i |
i i i i |
n i i | i
i i i | i
i i i |
i i i
0.0000000000 0.0000000100 0.0000000200 0.0000000300 0.0000000400
TIME (s)
MINI
Figure E5 Simulation of the self-timed latch
miMW,W,^^^^
Figure E6 Layout of the setf-timed bi-way sorter cell
Bibliography
[l]Nouta, M.Sim, Gunawan, "Two Self-timed Handshake controllers For High Speed
Applications", IEEE International Symposium on Circuits and Systems, vol 5, pp
2124-2127, May 92
[2]Orton,Peppard,Aki, "Bi-way sorter: a two-dimensional systolic array", IEEE
Proceedings (Computers and Digital Techniques), Vol 139 no2, pp 147-155, March
1992
[3]Wolf, Stanley Ph.D., "Silicon Processing VLSI Era: Volume 2: Process
Integration", Lattice Press, 1990
[4]Uyemura, John P., "Circuit Design For Cmos VLSI", Kluwer Academic Publishers,
1992
[5]Bhasker, Jayaram, "A VHDL Primer", Prentice Hall, 1992
Bib-l
