




Methodology and Testittg of a
Systolic Floating Point
Processing Element






A thesis submitted in fulfllment of the requirement for the degree of
Master of Engineering Science
The Ilniversity of Adelaide
Faculty of Engineering












1.1 High Performance Computing
1.2 . Systolic Processing for Matrix Computations
L.z.L Algorithms and Architecture .
I.2.2 Granularity
L.2.3 Implementations and Issues
1.3 Gallium Arsenide Technology
1.3.1 Gallium Arsenide Devices
1.4 Contribution of the Thesis
1.5 Outline of the Thesis
2 Gallium Arsenide Technology
























2.1.5 Second Order Effects
2.1.6 Simulating Worst Case
2.2 GaAs MESFET Logic Classes
2.2.1 Direct Coupled FBT Logic
2.2.2 Source Follower Direct Coupled FET Logic
2.2.3 Super Buffer FET Logic
2.2.4 Performance Comparison .
Design Methodology
2.3.L Layout Style
2.3.2 Design Tools .






2.4.6 Line Delay .





2.4.L2 Power Supply and Ground Lines
2.5 Summary
3 Systolic Ring Processing Element
3.1 Digit-Serial Multiplication
Digit-Serial Floating Point Multiplication
3.2.L Floating Point Numbers



































3.2.3 Digit-serial Floating Point Multiplier Model
A Systolic Ring Floating Point Processing Element
Performance Metric of a Rectangular Systolic Array Processor
Architecture Optimisation
3.5.1 Area-Time Model Evaluation
3.5.2 Processor Bandwidth Requirement
3.6 Summary
4 Design, Layout and Simulation
4.I Introduction
4.2 Floating Point Representation
4.3 GaAs Circuit Design, Simulation and Layout
4.3.1 Data Flip-Flop
4.3.2 Full Adder









4.4 Fabrication and Packaging















































5.3.2 Fix 1 : Ground Bounce
5.4
5.5
5.3.3 Fix 2 : Separate Power Supplies
Fingered MESFET Test Structures
Systolic Cell Functional Testing
5.5.1 Generating Test Vectors
Systolic Ring Testingb.0
D.t Clock Generation Circuit
6 Discussion and E\¡ture Work
6.1 Discussion
6.2 Future \Mork . .
Appendix A : GaAs Digital Logic Performance Specifications
Appendix B : PE Chip Pin Allocation


















Despite the recent advances in computing performance, there remain many signal pro-
cessing tasks that are beyond the capabilities of current off-the-shelf computing systems'
These tasks include matrix intensive operations such as real-time digital Kalman fil-
tering, signal processing and computer simulation algorithms for electronic circuits and
mechanical structures and thermal system modelling. This dependence on O(n3) matrix
operations leads to a requirement for a parallel computer architecture in the form of a
multi-dimensional array of processing elements.
A general purpose matrix processing engine is described in this thesis which deals in par-
ticular with the implementation of a single processing element which forms part of a two
dimensional processing atray. The processing element performs addition, multiplication
and multiplication - accumulation of two floating point numbers'
An architecture for a class of digit-serial systolic ring floating point processing element
is investigated and a 0.8¡,tm gallium arsenide implementation is realised using Vitesse
HGAAS-II technology. Gallium arsenide technology was chosen to implement the pro-
cessing element because of its high speed and low power advantages over conventional
technologies such as silicon ECL. Studies were conducted to develop an optimised logic
class for this technology. A mixed logic approach using DCFL (direct coupled FET logic),
sDcFL (source follower DCFL), SBFL (super buffer FET logic) was used.
A new physical layout strategy'ring notation'was developed which was shown to be
suitable for the design of high speed circuits using these classes of logic. This strategy
achieves good power supply isolation from high speed signal interconnects and high pack-
ing density for these types of circuits.
A CAD environment for gallium arsenicle was developed which includes the modelling of
cilcuit parasitics, layout, circuit cxtraction ancÌ technology files, Circuit primitives were
v
designed including flip-flops, adders and multiplexers
Architectural studies were carried out to determine the optimum architecture for this
technology. It is shown that the area-timemetric should be used to optimise these pro-
CESSOTS,
A four bit per digit implementation of the systolic-ring floating point processing element
was realised for an extended floating point format. A chip was successfully fabricated
using the HGAAS-II process and measured 3mm x 5.7mm. It contained 12,000 devices
and has a maximum operating speed of 300MHz' producing llMfl'ops for multiply -
accumulate operations. The chip was tested and found to be fully functional at I2\MHz
(due to process variation) to produce a computation rate or.\Mflops.
VI
Declaration
This thesis contains no material which has been accepted for the award of any other
degree or diploma in any university or other tertiary institution and, to the best of my
knowledge and belief, contains no material previously published or written by another
person) except where due reference has been made in the text'
I give consent to this copy of my thesis, when deposited in the University Library, being




Firstly, I would like to thank my supervisors, Dr. Kamran Eshraghian and Dr' Cheng-
Chew Lim for their guidance and assistance with the work and the preparation of this
thesis.
I would like to thank Dr. Warren Marwood for the enlightenment he has given me through
the work and papers we have written jointly, for proof reading the thesis and countless
interesting late night discussions'
Thanks also to my colleagues in the department and from other Universities including
Mr. Ali Moini, Mr. Michael Liebelt, Dr. Jens Jakobsen (Jydsk Telefon, Denmark), Mr.
Eric Chu, Mr. Mike McGeever, Mrs. Song Cui, Mr. Tim Shaw and Mr' Gyudong Kim
(Seoul National University). Mr. Mike McGeever also assisted this work by characteris-
ing the I-V curves of the fabricated MESFET devices.
The support of The Australian Research Council and the Sir Ross & Sir Keith Smith
Foundation is gratefully acknowledged.





The following is a list of publications by the author and colleagues which are related to
this thesis
A. Beaumont-Smith, W. Marwood, C.C. Lim and K. Eshraghian. "Design and
Impiementation of a GaAs Systolic Floating Point Processing Element" , Submitted to
IEE Proceedings-U, Computers and Digital Techniques, 1995.
A. Beaumont-Smith, W. Marwood and C.C. Lim. "A CMOS Linear Systolic Processing
Element" . Proc. 13th Australian Microelectronics Conference, PP. 74-79, July 1995.
A. Beaumont-Smith, W. Marwood, K. Eshraghian and C.C. Lim. "The Gallium
Arsenide Implementation of a Systolic Floating Point Processing Element" . Proc. 12th
Australian Microelectronics conference, pp. 255-260, october 1993.
W. Marwood and A. Beaumont-Smith. "The Implementation of a Generalised Systolic
Serial Floating Point Multiplier". Proc. APCCAS'LZ, IEEE Asia-Pacif'c Conference on
Circuits and Systems, pp. 513-518, December 1992.
W. Marwood and A. Beaumont-Smith. "The Architecture and Optimisation of Systolic
Ring Processors" . Proc. TENCON '92: IEEE Region 10 Conferencei pp.735-739'
November 1992.
W. Marwood, C.C. Lim, K. Eshraghian and A. Beaumont-Smith. "Systolic Matrix
Processor Architecture for Very High Speed Signal Processing" ' Proc. IREECON
International C onuention, 1997.
lx
A. Beaumont-Smith, W. Marwood, C.C. Lim and K. Eshraghian, "Ultra High Speed
Gallium Arsenide Systems: Design Methodology, CAD tools and Architecture". Proc'
Microelectronics '91, I.E.Aust Conferencel pp.85-90, June 1991'
Software:
A. Beaumont-Smith. (EXT2HSP - A conversion program from MAGIC io HSPICE for
GaAs circuits" , The Uniuersity of Adelaid,e, Adelaide, 1992'
A. Beaumont-Smith. "GAASNET V2.0 - A gallium arsenide network extractor",




































Gate to Channel Spacing
Transconductance Parameter or Number Base
Saturation Factor
Channel Length Modulation Parameter
Drain Voltage Induced Threshold Voltage Lowering Coefficient
1¿" Feedback Factor for TOM Model or Skin Depth
Critical Field for Mobility Degradation
Drain to Source Current




Gate to Source Voltage
Drain to Source Voltage
Threshold Voltage
Gate Voltage Exponent
Drain to Source Voltage for the Curtice II model
Diode Ideality Factor
Temperature in Kelvin




k: 1.38062 x 10-23




Permittivity of Free Space
Permeability of Free Space
Number of Bits per Digit
Sign Bit
Number of Mantissa Bits
Number of Exponent Bits
Number of Systolic Cells
Number of Guard Digits
Number of Digits in a Floating Point Operand
Order of a Square Systolic ArraY
Number of Systolic Cells in a Systolic Ring
Number of Digit Delay Cells in a Systolic Ring
Number of Circulations of Operands in a Systolic Ring
Total Active Area of a Systolic Processing Array
Active Area of a Processing Element

























Giga Floating Point Operations per Second
High Electron Mobility Transistor
Leaded Chip Carrier
MATrix Reduced Instruction Set Computer
Metal Semiconductor FET
Mega Floating Point Operations per Second
Mega Instructions per Second
Processing Element
Self-Aligned Gate
Super Buffer FET Logic
Source Follower Direct Coupled FET Logic
Silicon
Semi Insulating
Single Instruction, Multiple Data































Matrix processor architecture (MATRISC)'
A constant bandwidth mesh connected systolic array.
A performance comparison of a conventional systolic array and a constant
bandwidth array when implementing a block QR factorisation algorithm'
Simulated performance for FIR filters implemented on an order 40 MA-
TRISC processor
A 3 x 3 engagement processor with input matrices A and B '









GaAs MESFET equivalent circuit.
Cross section of a MESFET device
Simulated I-V characteristics for an EFET, L :0.8U'm,W : I0p'm'
(a) DCFL inverter, (b) 2 input NOR gate, (c) equivalent circuit.
Drain current for a DFET with Vs,:0 for a 1.2,2 and 3¡; gate length'
Average noise margin of a three input DCFL NOR gate as a function of' W"'
Propagation delay of a DCFL inverter as a function of fan-out (capacitive
load).
(a) sDCFL inverter, (b) SDCFL inverter with extra supply, (c) equivalent2.8
circuit.
2.9 OR-AND-INVtrRT (OAI) logic structure.
2.10 SBFL inverter
Schematic , ring notation and layout for a SDCFL inverter and OAI structure
(a) Physical layers used for layout and (b) key for ring notation.




































DCFL 3 input NOR gate laYout. .
SDCFL buffer layout.
SDCFL 2 input OR gate buffer laYout.
Inter-nodal capacitance of two neighbouring wires on 100prn [e, : 4(*)],
450p,m [e, : 4(o)] and 100prn [e' : 8(n)] thick GaAs SI substrate with a
backplane and dielectric constant for the inter-level dielectric.
Total capacitance of the centre 2p,m wide wire on l00p,m lr, :4(+)]'
450p,m [e, : 4(o)] and 100¡.rrn [e' : 8(n)] thick GaAs SI substrate with a
backplane and dielectric constant for the inter-level dielectric.
Five equally spaced conductors on a GaAs substrate'
coupling capacitance between electrodes for 5 equal width and spacing
electrodes on 100prn thick GaAs SI substrate embedded in dielectric (e' :
4), 2p,m thick with a backplane metallisation (o - Clz, * : C13, [l :
CI4,, x - Clï, L : C23, * : C24).
Total coupling capacitance for electrodes 1 (o),2 (+) and 3 (n) for 5 equal
width and spacing electrodes on 100¡.rrn thick GaAs SI substrate embedded
in dielectric (e, - 4),,2¡.tm thick with a backplane metallisation.
Coplanar waveguide (cross section).
Coplanar strips (cross section).
characteristic impedance of the centre 2p,m wide wire on l00p'm le, --
4(o)], 450¡.tm [e, : 4(*)] and 100Êrm [e, : 8(l)] GaAs SI substrate with
a backplane and dielectric constant for the inter-level dielectric.
Resistance (10-2Qlp,m) of a wire as a function of design rule.
Self plus mutual inductance of the centre 2p'm wide wire on I00p'm le, :
4(+)], 450¡-tm [e, : 4(o)] and 100prn [e" : 8(n)] ttrick GaAs SI substrate
with a backplane and dielectric constant for the inter-level dielectric.
Mutual inductance of the centre 2¡tm wide wire on 100¡'tm [e, : 4(t)],
450p,rrt [e, - 4(o)] and 100prn [e, : 8(l)] thick GaAs SI substrate with 
a






























Self plus mutuai inductance of 5 equal width and spaced interconnects on
al00¡-r//r- GaAs SI substrate with a backplane (o: LII: L55, +: L22:
¿33, n: ¿33). 53
Mutual inductance of 5 equal width and spaced interconnects on a I00p'm
GaAs SI substrate with a backplane (o - LIz, I : Ll3, n : Ll4',
x : L15,, L -- L23, * : L24)' 53
Transient analysis of a L00¡.tm and.Imm line modelled as a lumped capacitor' 55
Transient analysis of a 100prn and. Imm line modelled as a lossy TL and
a lumped capacitor. 56
Transient analysis or. a 2mm and a \mm line modelled as a lossy TL
(signals plotted at start and end of the TL). 57
Transient analysis of a 100¡lrn and.lmm line modelled as a lossless TL' 58
Equivalent model for a LCC pad and bond, L:I' nH, C:40f F' 59
A pipelined four-bit per digit muitiplier' 66




3.3 A digit-serial multiplier array.
3.4 A digit-serial multiplier cell'
3.5 Operand movement through a four cell linear array of recurrence cells'
3.6 Operand movement through a modified four cell linear array of recurrence
cells
3.7 The systolic ring multiply/accumulate processing element.
3.8 The systolic multiply/accumulate cell.
3.9 The logical function of the systolic cell during multiplication. .
3.10 The logical function of the systolic cell during denormalisation'
3.11 The ,4? metric for the systolic ring multiplier'
3.12 The A? metric for the systolic ring multiplier for a continuous model and
a constrained model with the maximum number of systolic cells versus the













3.13 The A? metric for the systolic ring multiplier for a continuous model and
a constrained model with the maximum number of systolic cells versus the
number of bits per digit for rn : 64 and e : 16'
3.14 The bandwidth metric for the systolic ring processing element under the
constraint of constant area.





















Data flip-flop 1: schematic of a 6-NoR data flip-flop with a single input' 95
Data flip-flop 2: schematic of a 6-NoR data flip-flop with D and D inputs. 95
Data flip-flop 3: Schematic of a 6-NOR data flip-flop with clear 96
Processing element architecture.
Master-slave half latch (1).
Master-slave half latch (2).
Master-slave half latch (3).
Clock Scheme 1
clock.
Final version of the data flip-flop schematic with clear'
Toggle flip-flop layout.
SDCFL implementation of a full adder using adder half equations with (a)
SDCFL outputs and (b) DCFL outputs.
DCFL implementation of a full adder using adder half equation with DCFL
outputs.
Full adder sum generation circuits.
Full adder carry generation circuits'
4.10 Data flip-flop 4: schematic of a 6-NoR data flip-flop with improved clear' 96
4.11 Data flip-flop 5: schematic of a 6-NOR data flip-flop with clear, D and
D inputs.
4.12'Ring notation of a GaAs data flip-flop with clear or preset. 98
4.13 Layout of a GaAs data flip-flop using ring notation' 98


























Carry generation circuit used in the final design'
Sum generation circuit used in the final design'
Full adder layout.
Nibble-serial multiplier schematic. .
Schematic of the systolic cell.
Layout of the systolic cell.
SpICE simulation of the critical path through the digit-serial multiplier.
Schematic of the input pad.
Layout of the input pad.
simulation of an input pad receiver with VREF-- 0'7V showing input,










and s/ot¿-slow-2 process parameters' lI2
4.31 Schematic of the output pad. 113
4.32 Layout of the outPut Pad. IL4
4.33 simulation of an output pad showing the voltage response and current
drawn for typical-typical,, slow-slow-I and slow-slow-Z process parameters" 114
4.34 State transition diagram for the ring controller' 115
4.35 Circuit schematic of the ring controller' ' 116
4.36 Functional simulation of the ring controller using IRSIM. Ll7
4.37aSchematic of a 4-bit multiplexer. 118
4.37bSchematic of the I/O multiplexer. 118
4.38 Schematic of the flag generation circuit. I20
4.39 Layout of the flag generation circuit. 120
4.40 Clock architecture' I2l
4.41 Clock generator laYout. 121
4.42 clock generator transient simulation for two clock rates. 122
a.a3 (a) A clocked ring where the arrows indicate the delay from the clock
generator to the latch, (b) Clock timing where arrows indicate clock skew
between adjacent latches. I24
4.44 Two stages of super buffers used to drive the clock tree. . 125










SPICE simulation of the clock distribution across the chip.
Layout of the clock distribution circuit.
Schematic of the systolic ring processing element'
A 4-bit delay element.
A 16-bit delay element used in the systolic ring'
Floorplan of the processing element chip including test structures.
Data flip-flop simulation showing supply current'
Micrograph of the fabricated GaAs systolic PE chip with the floorplan
overlaid.
Top layer of the test fixture PCB.
Bottom layer negative of the test fixture PCB.
Equivalent circuit of a signal driven off chip.
simulation of. a 40mm long line with no -Rs and Rt:25,50, 75 and 100f1































Simulation of. a 40mm long line with fit :
widths of 0.3048mm,0'38lmrn and 0.5mm.





widths of 0.3048rnm,0.38lmrn and 0.5mm'
5.8 Simulation of a 40rnm long line with ,RÍ : 25Q and As : 25Q for track
widths of 0.3048m m, 0.38lmrn and 0.'rnm. The PCB it +" thick.
5.9 simulation o1 a 40mm long line with .Rú - 25,50, 75 and 1000, signal ìs
being driven onto the chiP. .
5.10 simulation of the four possible interconnect types on the PCB with 47cl
source and terminating resistors.







Cross section through the PCB.
Photograph of the high speed test jig with a chip and heat-sink installed.
External clock input and chip clock output waveforms
Signal p2out and the pad ground showing ground bounce'










5.lT Layout of a fingered enhancement mode MESFET (5 fingers x 74.8p' wide).150
5.18 Layout of a fingered depletion mode MESFET (5 fingers x 74.8p'm wide). 150
b.lg Photograph of the EFET I-V characteristics from the curve tracer. 151
b.20 Comparison of measured and simulated EFET I-V characteristics using
typical process parameters. L52
5.21 Photograph of the DFET I-V characteristics from the curve tracer. 153
5.22 Comparison of measured and simulated DFET I-V characteristics. 154
5.23 Instruction nibble for multiplication mode. 155
5.24 Systolic cell testing in multiplication mode at 50MHz (zero by zero). 156
5.25 Systolic cell testing in multiplication mode at 50MHz L57
5.26 Results of the systolic cell in multiplication mode at 50MHz. 157
5.2? Test results of the PE in floating point addition mode at l2gMH:z 159
b.28 Test results of the PE in floating point denormalisation mode at 12BMHz. 160
5.29aSystolic ring operating in floating point multiplication mode at 9lMHz
where Xin:0 x e0, Vn -- 0 x e0, Pout -- 0 x eo. 161
5.29bSystolic ring operating in floating point multiplication mode at 9LMHz






5.29cSystolic ring operating in floating point multiplication mode at 9IMHz
- where Xin : 0'00010000 X €031, Yn : 0'0045fr100 x eoFs aîd Pou¿ :
0.000000,4'5F x e726. t62
5.29dSystolic ring operating in floating point multiplication mode at 97MHz
where X¿n : 0.00010488 X €031, Yn -- 0.0045F000 x e0Fs arld Pou¿ :
0.00000048D x er26. 162
5.3OaSystolic ring operating in floating point multiplication mode at l28MHz
where Xin:0 x e0, Yn:O x e0 and Por¿ : 0 x e0' 163
5.3ObSystolic ring operating in floating point multiplication mode at I28MHz








5.3QcSystolic ring operating in floating point multiplication mode at I28MHz
where Xin : 04iF9060 X 
"ooB, 
Yn : ¡',45f-6802 x eoFC and Po,¿ -
0.04085C612 x eroT
5.31 Test results of the PE in floating point multiplication mode at 350MHz-
5.32 Clock generator set for 37.\MHz operation showing feedthrough from the
source (Vdd :1.8914.
5.33 Clock generator set for t\\MHz operation showing measured frequency of
9lMHz.
5.34 Variation of clock frequency with power supply voltage fot CIftate: I'
5.35 Variation of clock frequency with power supply voltage for Clftate :0'
5.36 Variation of peak-peak output voltage with pad po\Ã/er supply voltage for











Micrograph of the fabricated GaAs systolic PE chip.
Voltage swing, delay, rise and fall time measurements. .
Noise margin measurement methods using (a) NSC and (b) MPC or MEC
if the rectangle becomes square














1.1 Comparison of GaAs and silicon physical characteristics 13
2.L Table of simulated DCFL gates with L" : r'2\'m' W¿: ZP'm' vdd: 2v
and ? :70oC
2.2 Table of simulated 3-input SDCFL NOR gates with 2 inputs tied to GND
and a fan-out o1 5, L" : L"e: L"d, : I.2p,m., W"d -- 2p*, Vd'd:2V and
T:70'C. 34
GaAs logic circuit characteristics for DCFL, SDCFL and sBFL 36
Total and inter-pad capacitance simulation for three adjacent bonding
pads on a SI GaAs substrate 59
2.5 Pad capacitance for various pad sizes and substrate thicknesses (e, : 13.1). 60





















Characteristics of the data flip-flops.
Area (device)-delay(ps) product for various full adder implementations.
Systolic cell instructions.
Specification of input and output operands flag nibble
Logic length for the two ring oscillator configurations.
Clock control signals.






Assignment of pins to signals, power and ground.
Assignment of pins to signals, power and ground (cont')'






L.1- High Performance ComPuting
Despite the recent ad,vances in computing performance, there remain many tasks that
are beyond the capabilities of current of-the-shelf computing systems such as signal
processing [LoLi95], real-time control [Marw90a, KuHw9l] and computer simulation for
electronic devices [NaHi91]. This is due to the rapid growth in algorithm development for
better task performance and the general purpose nature of computer architecture design,
so there will remain a niche market for application specific processors in the for-seeable
future.
The recent trends in computer architecture have been toward massively parallel proces-
sors based on Reduced Instruction Set Computer (RISC) nodes with support for vector
processing. The support for matrix algorithms has been at the software level but ma-
trix computation rates in the range of Gflops are not possible on general purpose RISC
machines due to non-optimal processor architectures and software overheads for matrix
algorithms. A natural extension for these architectures is to provide hardware support
to a RISC ptocessor for a set of matrix operations such as the matrix outer product, ma-
trix addition and subtraction, element-wise matrix multiplication (Schur or Hadamard
product), tensor product and tensor sum. The matrix product is an order n3 (O(n3))
operator whereas matrix addition, subtraction and the element-wise product are O(n2)
operaLors. A ¡ratrix prücessor which suppolts thcsc tasks has been proposed by Marwood
1
et.al. lMarw94, Mali9l]. This processor is referred to as a MATrix Reduced Instruction
Set Computer (MATRISC) due to similarities to the RISC philosophies. The core of
the processor architecture is characterised by a two dimensional mesh connected systolic
array of parallel processing elements (PEs) as shown in Figure 1.1 fMarwga]. The systolic
array is fed operands on two wavefronts by address generators from a high bandwidth
main memory. The address generators are a programmable device that map the matrix
operands from the linear main memory into the systolic array. The RISC/CISC CPU is
used to perform scalar operations on the data since the array has significant performance
penalties for simpler operations. Caches are used to increase the performance of the
system through re-use of input operands and results without accessing main memory.
The system bus links the main memory to the address generators, caches and a host
workstation.
Hardware supported matrix operations provide an architecture [Marw94] which :
r provides a well defined framework for problem definition and expression
¡ allows serial code to implement scalar, vector or matrix algorithms
o utilises a well defined RISC architecture to describe the parallel architecture
o improves on current implementations by orders of magnitude in performance using
current technologies
The implementation of the PEs in the the two dimensional systolic array uses systolic
ring techniques to achieve scalable floating point precision, size and computation time.
The PB architecture is described in Chapter 3 and allows the possibility of different size
and speed PEs to be used as function of their location in the systolic array. This is done
by trading off the length of floating point representation and the number of computation
cells with the time taken to complete a floating point computation. It is then possible
to match the order of the array to the matrix problem and maintain constant bandwidth
as faster elements are placed towards the origin of the array as indicated in Figure 1.2
[Marwg4]. For example, an array of order ] would contain PEs with computation rates
twice as fast as an array of order 1ú to maintain constant memory bandwidth and hence



































Figure 1.2: A, constant bandwidth mesh connected systolic array
Figures 1.3 and 1.4 show the simulated performance of a MATRISC processor presented
in Marwood [Marw94]. The performance of the processor in Figure 1.3 is nearly doubled
ro T1yMfl,ops by using a constant bandwidth array. The irregularities in the simulation
are due to cache and. memory behaviour. In Figure I.4 a peak computation rate of 6
Gfl,ops i's achieved for an array of order 40.
For such a processor to support high computation rates, the technology for realisation of
the PEs has a significant impact on the architecture and hardware realisation. Gallium
Arsenide (GaAs) Metal Semiconductor FBT (MESFET) technology is a contender for
implementing high performance electronic components to rival silicon Emitter Coupled
Logic (ECL) in speed and power performance [LoBu89, Eshr9l]. Integration densities
have also improved dramatically in recent years and more than one million devices may
be integrated on a single chip in a gate array [Vite92]'
In the following sections an overview of systolic arrays, matrix processing algorithms and



















100 200 300 ¿lO0 500 600 700 800
Matrix order
Figure 1.3: A performance comparison of
a conventional systolic array and a con-
stant bandwidth artay when implement-
ing a block QR factorisation algorithm.




Figure 1.4: Simulated performance for
FIR filters implemented on an order 40
MATRISC processor.
1.2 Systolic Processing for Matrix Computations
Systolic (rmo,As haue regular and modular structures that match the computational re-
quirements of many algorithms. Their implementation requires that a wealth of subsumed
concepts and engineering solutions be mastered and understood.
- J.Fortes and B.Wah, 1987: p.12 [FoWa87].
This statement reflects the fact that systolic array design covers many inter-related dis-
ciplines including mathematics, VLSI and Computer Architecture. Systolic arrays were
first proposed by Kung and Leiserson in 1978 [KuLe78] as a parallel processing technique.
Since this time many different systolic array algorithms and architectures have been pro-
posed for applications including matrix arithmetic, signal processing, image processing,
language recognition, relational database operations, data structure manipulation and
character string manipulation [JoHu93, Kung88]. The term 'systolic' refers to the way
data is pipelined rhythmically along the communication channels between an array of
nodes. The nodes that the data visit may be arranged in a single- or multi-dimensional
array with a fixed or configurable interconnection structure. The idea is to re-use data
already entered into the systolic ailay as it passes through the pipeline to achieve very






faster than traditional computer architectures since each operand is re-used inside the
array and not fetched and written to memory each time it is used. This has an advantage
over conventional processor architectures by using a substantially lower memory band-
width. This implies that the computation performance of bandwidth limited systems can
be increased through the use of systolic arrays.
L.2.1 Algorithms and Architecture
Examples of matrix based systolic array architectures include the inner product step pro-
cessor of Kung and Leiserson [KuLe78] which was a fundamental building block that could
be configured as a linear, orthogonally- (mesh) or hexagonally-connected array. The
proposed matrix based algorithms included matrix-vector multiplication using a linearly
connected network, matrix multiplication using a hexagonal array and LU-decomposition
of a matrix using a hexagonal array [MeCo8O]. An inner product accumulate systolic cell
connected in a mesh was proposed by Whitehouse and Speiser [WhSp81] in 1981. This
was called the engagement processor and each cell at position i, j in the array computes
the inner product, c¿¡ which is stored in each cell and given by
N
cij : D a¿*b*j
&=1
where ø¿¿ and b¡"¡ are the elements of the matrices A and B. A 3 x 3 engagement processor
is shown in Figure 1.5.
The efficiency for this systolic array approaches 100% for large matrices under the as-
sumptions that end effects of the start and end of the matrix operation are ignored.
L.2.2 Granularity
The operation performed in each cycle by a PE can range from a bit-wise to a word-level
operation such as multiplication and addition. This is the granularity of the systolic array.
Bit level systolic arrays have low (or fine) granularity and use bit-serial data transfers
between PEs. This work, as reported in [DeRe85] has a low Input/Output (I/O) require-
ment which is attractive for I/O constrained systems, however this limits the throughput










Figure 1.5: A 3 x 3 engagement processor with input matrices A and B.
serial arithmetic however is an area-time efficient method of performing high speed arith-
metic calculations [HaCo9Q, CoHa92]. The digit size can be appropriately chosen for the
throughput to match the design needs. Configurable systolic architectures such as the
Configurable Highly Parallel (CHiP) computer [Snyd82] overcome the difficulty of being
limited to a fixed array architecture as the connections between PEs are configurable
through switches to suit the algorithm. Special-purpose single chip processors which can
be used to build systolic arrays such as the Programmable Systolic Chip [FiKu83] and
the Orthogonal Multi-Processor [HwDu9O] based on i860 processors are flexible in the
types of algorithms that can be run but too complex to implement since each node must
be programmed. These are examples of systems with high (or coarse) granularity.
L.2.3 Implementations and Issues
Since systolic anays are regular two dimensional structures of nodes or PEs, they lend
themselves very well to Very Large Scale Integration (VLSI) implementation. This
presents a number of VLSI design challenges since the order of systolic arrays tends to
be large and the connections between the nodes can be complex as in the case of a recon-
figurable array. Implementation techniques such as Wafer Scale Integration (WSI) and








These bring the circuits physically closer together which enables them to achieve higher
throughput.
WSI is the process used to integrate and interconnect circuits on a single processed wafer
where the individual chips are not cut from the wafer and packaged. MCMs are a struc-
ture housing two or more integrated circuits elecbrically connected to a common circuit
base and interconnected by conductors in that base. WSI is simpler for the construction
of large systolic arrays all using the same technology but the number of defects in the sil-
icon substrate is constant. This implies a high probability that at least one node will not
work in a WSI system, so fault tolerance through redundancy or array reconfiguration is
required. The identification of a non functional PE using self testing and the replacement
and/or bypassing of it become design considerations as the order of the systolic array
grows [LiJe89]. Kanopoulos [Kano85] describes a bit-serial systolic array for signal pro-
cessing applications that uses a self testing scheme and a voting circuit to identify single
permanent faults and isolate a particular storage or arithmetic unit during processing'
This increases the control overhead and circuìt complexity ultimately resulting in a per-
formance penalty. Furthermore, integration of memory chips and specialised processor
chips is not possible using WSI if different process technologies are used. MCMs however
are more difficult and expensive to construct but can be repaired since individual com-
ponents.may be replaced. Another advantage of MCM technology is being able to test
chips before they are assembled.
A recent MCM implementation of a systolic array for matrix computation was the SCal-
able Array Processor (SCAP) lclclg2, Marw94, MaCl95] which is the first systolic co-
processor subsystem to implement the set of matrix operations {multiplication, addition,
element-wise multiplication, transposition, perrnutationj. SCAP uses an IBEB single
precision floating point data format and is coupled to a SUN SPARCstation 1. The pro-
cessor module has four hundred l.TMflops processors and the system can perform matrix
products around. 150 times faster than a SUN SPARCstation 1. The scalable array of
PE chips and data formatting chips were implementedinl.Z¡L'rn CMOS. Each PE chip
contained a 4 x 5 array PEs in a mesh connected array. The ceramic MCM lvas con-
8
structed using MCM-C technology and contained a 5 x 4 array of PE chips which were
wire bonded in place. A failure rate analysis of the completed MCMs Save a yield of
around 30%. Given the high cost in producing each processor array, there was a high
probability that at least one PE chip would have to be replaced or repaired. It is not
possible to repair WSI systems in this manner.
Hein et.al. lHeZiSTl report on the design of a GaAs systolic array for an adaptive null
steering beamformer. The processing array was configured as a SIMD machine and spe-
cially designed parts included a 32-bit GaAs ALU, a500M-bits per second (bps)Manch-
ester encoder, a 200 MIPS, RISC, 8-bit microprocessor and a Manchester decoder. The
clock rate was L2\MHz and the system updates the coefficients for the multiple beams
every 5¡zs. This real-time performance was unmatchable at that time by any realisable
uniprocessor system.
Fouts and Butner [FoBu9l] proposed GASP, a GaAs supercomputer which contained a
hexagonally connected homogeneous systolic array of PEs. The design uses 32-bit integer
arithmetic and has a peak computation rate of 30,000 MIPS with 65 PEs' A MCM so-
lution was proposed for integration of GASP with a 500MHz instruction issue clock and
IGHz subsystem clocks. System simulation predicted an improvement in performance
by a faqtor of 8.3 and 457 over a Sun 4/280 for heap sorting and Gaussian elimination
algorithms, respectiveiy. Problems with the design were identified as the availability of
high density and high speed (zns) RAMs and relatively low pin count on hybrid modules
limit data transfer rates. The processor dissipates 680 W and requires refrigerant cooling.
Most systolic array designs to date use integer arithmetic in the PEs. Some exceptions
include the SAXPY Matrix-l [FoScST] and the Warp computer lAnArST]. For general
purpose use and to be compatible with existing RISC workstations, the PEs of the pro-
posed MATRISC processor need to comply rvith the IBEE-754 floating point standard
[ieee85].
I
1-.3 Gallium Arsenide TechnologY
Galliurn arsenide is the technologg of the future, always has been, always will be
(- humorous moment at a conference' source unknown, circa 1989')
The group III-V compound gallium arsenide (GaAs) was first discovered in 1926, but
its high speed potential as a semiconductor was not realised until the 1960's [PuEs88].
The first GaAs analogue products appeared in the 1970's with the development of IC
fabrication and the advances in ion implantation in the 1980's have made digital GaAs
VLSI technology a commercial reality in the 1990's. GaAs will not be a technology for
mainstream computer and systems applications in the forseeable future due to continuing
advances and research and development support for CMOS technology. The character-
istics of GaAs make it suitable for specific niche applications where it shows a clear
advantage over silicon implementations. These applications include communications de-
vices such as an optical fibre front end that processes high speed serial data, automotive
sensors and specialised high speed computers [Dyks9O] such as the CRAY-3 [KiHe97].
Figure 1.6 shows a comparison of the speed versus power characteristics for GaAs, CMOS,
BiCMOS, nMOS and ECL technologìes. This shows that GaAs would be a favourable
choice of technology where outright speed or speed and power are criticai design param-
eters for a particular problem.
The advantages of GaAs material over silicon include: [LoBu89, Eshr91, Vite92, TriQ91,
Rocc9O, Giga91, Beau93]
¡ a six to seven times higher electron mobility than silicon. MESFETs with typical
gate lengths of around 0.8¡t"m with transit times as small as 10 to 15ps produce
current gain-bandwidth products in the range of 15 to 25GHz. This is a three to
five times improvement over silicon.
o smaller interconnect capacities than silicon as a consequence of the substrate being
a semi-insulator rather than a semiconductor.
e higher electron saturation velocity at lower eiectric field strengths than silicon. 'l'his
10






1 00uw 1mW lOmW 100mW
POWER OISSIPATION / GATE
Figure 1"6: Speed veïsus power for GaAs, CMOS, BiCMOS, nMOS and ECL technologies
implies faster switching speeds and up to a70To reduction in power dissipation over
silicon ECL.
o smaller speed-power product than silicon (ECL).
o the direct bandgap of GaAs allows the efficient radiative recombination of carriers.
This provides a mechanism for integrated high bandwidth optical communications.
o there is no gate oxide to trap charges which makes the device more ionising radiation
resistant than silicon. This is of benefit in space-borne applications.
o GaAs devices are more temperature tolerant due to the larger bandgap (l'42eV) of
the material.
Disadvantages associated with GaAs circuits lie in problems with the device physics
which mainly result in high fabrication costs and low yield. The disadvantages and
reasons include:
o lower yielcl tha,n achievable with silicon due to a large density of dislocations in the







and threshold voltage over the wafer. Threshold voltage should be controlled to
less than 20mV between devices, otherwise the circuit may become inoperable.
o there is no gate oxide in a MESFET to isolate the gate, therefore the gate may only
be forward biased to around 0.7 to 0.8I/ before large currents begin to flow. This
limits the voltage swing for many logic classes and makes them incompatible with
other technologies such as CMOS.
o different device offset voltages produce an intrinsic bias in different parts of the
circuit which degrades both the noise margin of logic circuits and the yield. Device
offset in GaAs is caused by threshold variation, component mismatch and low
frequency (1/f) noise. Material non-uniformity and threshold variation are the
main contributors.
o backgating or sidegating in GaAs circuits cause a reduction in the drain current of
a device when the substrate (backgate) or neighbouring clevice (sidegate) is biased
negatively with respect to the source of that device. This causes an increase in the
size of the space-charge layer at the channel/substrate interface which shifts the
threshold voltage of the device higher. It is also dependent on the distance between
active devices an{ can been reduced through ion implantation and mesa etching
around active regions. The solution is to space devices farther apart which is not
gobd either for device matching or for increasing circuit density.
o drain current hysteresis effects due to charges stored in the substrate traps. These
are frequency dependent and have the most influence at frequencies less than I00H z.
Table 1.1 summarises the physical characteristics of GaAs and silicon [Glon88, Sze83]
1.3.1 Gallium Arsenide Devices
There are two clistinct generations of GaAs devices [PuEs8B]. First generation devices
have typical switching delays of 70ps for a simple inverter and power dissipations around
0.1 to 0.2W. GaAs devices include:






















electron mobility (cm2 lV s)
maximum electron drift velocity (cm I s)
hole mobility (cm2 lV s)
energy gap (eV)
gap type
density of states in conduction band ("*")
maximum resistivity (f lcrn)
minority carrier iife (s)
breakdown fr,eld (Vlcm)
Schottky barrier height (V)
SiPhysi roperty aAs
Table 1.1: Comparison of GaAs and silicon physical characteristics
o enhancement mode JFET,
o complementary enhancement mode JFET
First Generation Devices
Complementary GaAs logic suffers from a poor P-type transistor due to its low mobility
so high performance logic has been restricted to normally-off and normally-on classes
employing MESFETs. The two types of MESFETs are enhancement and depletion mode.
The enhancement mode MESFET has a positive threshold voltage and depletion mode
MESFETs have a negative threshold voltage. In the fabrication process for MESFETs
conductive transistor channels are formed by implanting silicon atoms into the substrate'
A two step impiant scheme is used to pattern the channels for enhancement and depletion
mode devices where the threshold voltage is adjusted by the depth of the N- implant in the
channel region. A refractory metal is then deposited to form the gate and a high dose Iü*
implant is used to lower the resistance in the source and drain regions. Ohmic contacts for
the source and drain connections are subsequently formed. The proceeding steps include
the patterning and deposition of dielectric fi1ms and interconnect metalisation. Since
MESFET fabrication is a planar process, up to four layers of aluminium interconnect are
able to be used with some processes. The GaAs MBSFtrT has a similar lithographical
process to silicon and the metalisation process is identical (exce¡,t fol airbridges). Elcvcn
13
mask steps are needed for a three metal process which is half the number of mask layers
required for silicon ECL. Further information on fabrication techniques for MESFETs
has been presented elsewhere [PuEs88, TriQ9l, Rocc90, LoBu89, Vite92, Giga9l].
The MESFBT is the most mature GaAs device and millions of devices are able to be
integrated onto a single chip [Vite92] and many standard MESFET based products are
available for applications such as encoding/decoding, multiplexing, crosspoint switches
and gate array products.
Second Generation Devices
Second generation devices include the High Electron Mobility Transistor (HBMT) and
Heterojunction Bipolar Transistor (HBT). These devices have different structures to
achieve up to five times higher electron mobility than the first generation. For exam-
ple, typical depletion mode Pseudo-morphic HEMT (PM-HBMT) devices have short
circuit current gain-bandwidth products (f ) "f around 50 
to l00GHz" Of these devices
the HEMT holds the most promise for future digitat GaAs implementations. The HEMT
device was first developed in 1980 [HiLa86] and has since progressed rapidly' These sub-
micron gate devices have less than 10ps switching delays at 3001( with a typical /¿ of
around 70GHz for quarter micron devices. HEMTs exploit the superior transport prop-
erties of electrons moving along the heterojunction interface between two lattice matched
compound semiconductor materials which have been grown using molecular beam epi-
taxy. HEMTs are also known as two dimensional electron gas FET (TEGFET), modu-
lation doped FET (MODFET) and selectively doped heterojunction transistor (SDHT),
depencling on the process or resultant device characteristics. They offer superior gain
and speed among most known semiconductor devices [HiLa86].
Other more recent devices include the semiconductor-insulator-semiconductor FET (SIS-
FET) and heterostructure insulated-gate FET (HIGFET). HEMTs are suitable for use
in analogue MMIC and high speed digital circuits and have performance benefits over
GaAs MESFETs including low noise, low power and high speed. Logic classes applicable
to HEMT include high/low po\¡/er buffered FET logic (BFL), direct coupled FET logic
t4
(DCFL, for E/D mode processes), high/low power for both E/D and D mode source
coupled FET logic (SCFL) and capacitor enhanced logic (CBL). LSI/VLSI applications
are possible and some circuit applications employing HtrMT devices include multipli-
ers [Ber91, TaNi92] , static RAMs (SRAMs), ALUs and demultiplexers/multiplexers
[Nowo91]. Other applications include high definition television and optical fibre telecom-
munication systems such as SONBT.
Both generations of GaAs devices employ Schottky barrier diodes for tasks such as logic
level shifting and ESD protection. Schottky barriers can be made with metals such as
aluminium, platinum and titanium. They have low reverse currents (< IAlcm2) and
high ideality factors (< 1.1).
Digitai HEMT processes are not as mature as the digital MESFET processes and VLSI
circuits suitable for computer systems applications are still currently the domain of the
MESFET. The remainder of this thesis focuses on the Enhancement/Depletion (B/D)
MESFET process.
!.4 Contribution of the Thesis
Due to the high speed nature of GaAs, diversions from traditional VLSI design principles
are required at the transistor and chip architectural level. In deriving architectures
suitable for high speed systems, the following problems must also be addressed:
o selection of logic classes suitable for high speed systems
o transmission line modelling of longer interconnects
¡ minimisation of crosstalk and inductive spikes through architecture design
o clock distribution across large chips, guaranteeing synchronism
¡ multi-chip interconnection
¡ thermal management of chips
The research in this thesis addressed these issues and the main contributions include:
15
o an evaluation of digital GaAs MtrSFET logic classes
o characterisation and modelling of GaAs interconnects and parasitics
o the design and optimisation of a matrix processor suitable for implementation in
GaAs
¡ the design of a floating point systolic PB in GaAs
o testing of the high speed GaAs PE
1.5 Outline of the Thesis
Chapter 2 presents a discussion of digital GaAs circuits, design methodology and circuit
mod,els. GaAs MESFET devices and digital GaAs logic classes suitable for implementa-
tion are reviewed. A layout strategy called 'ring notation' is developed for the physical
layout of the circuit primitives. An analysis of GaAs interconnect structures and par-
asitics is presented to characterise the process for accurate modelling and simulation.
CAD tools are discussed which were developed to help facilitate GaAs circuit design by
modification of existing silicon CAD tools.
In Chapter 3 digit-serial multiplication is reviewed and a parallel digit-serial multiplier
presented for use in a systolic multiplier cell. A class of systolic ring PE is proposed us-
ing the systolic muitiplier cell for floating point multiplication and accumulation of two
input operands. A performance metric is derived by minimising the job time for a matrix
product on a two-dimensional mesh connected array of PBs. The performance metric
is evaluated for a target GaAs technology and number representation to determine the
optimum PE architecture. The required memory bandwidth for an array of PEs is also
discussed.
Chapter 4 presents the circuit design, layout, simulation and implementation of the
GaAs systolic ring PE. Basic circuit elements including data flip-flops, a full adder, a





are used. to build the parts of the PE including the systolic cell, ring controller, flag gen-
erator, I/O multiplexer. A variable frequency clock generator is implemented to allow
testing of the chip at different clock rates. A ciock and power distribution system is
also designed and implemented. The chip floorplan is presented and finally details of the
fabricated PE chip.
Chapter 5 reviews the testing of the systolic ring PE chip. A test fixture is designed
and constructed to facilitate testing of the Ptr chip. MESFtrT test devices are measured
to characterise and verify the models used in the design of the PE chip' A test proce-
dure is developed for low and high speed functional testing of the PE using a digital test
system. This test procedure is subsequently used to verify the operation of the PE chip.
The clock generator output frequency is measured over its range of operation and as a
function of supply voltage to further characterise the process.











This chapter presents a discussion of digital GaAs circuit simulation and design method-
ology as the basis for designing the processing element chip. GaAs MESFET models
are reviewed and digital GaAs logic classes suitable for implementation are presented
and optimised for speed, area and noise margin. Circuit parasitics are also investigated
and transmission line models are discussed for high speed interconnections on a GaAs
substrate. Layouts of circuit primitives are designed using these results.
2.L GaAs MESFET Device Modelling
GaAs digital circuits have small voltage swings and accurate modelling in all regions is
required to predict circuit performance [LoBu89, Wing90]. The GaAs MESFtrT is char-
acterised by operation in several regions; cutoff where the channel is pinched off by the
gate depletion region and no drain current flows, linear region where the behaviour is
similar to a resistor and saturation where the behaviour is similar to a current source
due to velocity saturation of the electrons in the channel. The inuerse and subthreshold
regions are of secondary importance'
The physical parameters which make up the MBSFET models may not correspond to
the parameters in the models because a purely algebraic representation of the device
has been used to curve fit the actual characteristics. The models which are used for








Mathematical models suitable for simulation of the MtrSFET have been developed for
use in SPICE-like circuit simulators using the JFET equations as a basis. Figure 2.1












Figure 2.1: GaAs MESFET equivalent circuit.
uses a hyperbolic tangent function which fits all regions of the model and is continuous in
all of its derivatives. The most developed equation is the Statz-Raytheon model [StNe87]
which includes modelling effects due to velocity saturation-
z.I.L Drain Current
Modelling of MESFET drain current characteristics is performed by curve fitting a for-
mula to data and a range of MESFET compatible models are commonly available in most
SPICE simulators. The drain current is set to zero for the cutoff region (Vn" 1V76) and
the equations for the linear and saturation regions are given below. A description of the
parameters in the equations is given in the "List of Symbols" at the start of this thesis'
o Curtice model [Curt80]





o Curtice model with user-defined gate voltage exponent and Vn" in the hyperbolic
tangent function
I¿" : B(Vn" - Vro)'"EXP (l I ÀV¿")tannøV#øl
o Statz-Raytheon model [StNeST]
For Vr" < I
d
']
For Vr" > I
d
, p(Vn" -Vro)'(1 + À%")tds:@
o Meta software variable saturation model [Meta92]
This is the same as the Statz-Raytheon model except more flexibility has been
allowed. by parameterising the gate voltage exponent (VGEXP) and the satura-
tion exponent which is '3'. A more flexible model can be made by building a
hybrid model from the Curtice, Statz-Raytheon and TriQuints Own (TOM) mod-
els [Goli9l]. Note that Golio [Goli9l] states that the saturation function is the
hyperbolic tangent function which is contrary to the HSPICE manual, although
the cubic saturation function is a truncated Taylor series representation of lhe tanh
function.
rd" : ffi(l + À%") lt - (t - "?)']
o TOM [McCa90]
For Vr" < I
d
I
where ld"o : þ(Vn" - Vro +'tVo")"øxr [r - (r - "?)t]
It is unclear whether the equation has been implemented correctly in HSPICE
because the manual [Meta92] does not state any formula. HSPICE may have im-
plemente{ ihe tn,n,h, function or the cubic approximation. The original paper states
the former.
I 20
The common 1¿" and capacitor parameters for the Statz capacitor model may be made
independent of the Ids model. Some researchers such as Golio [Goli91] have incorrectly
statecl the TOM model by putting the hyperbolic tangent function outside of the feedback
equation for -I¿". In the original paper [McCa9O], this function is inside the feedback
equation.
2.L.2 Diode
There are two Schottky gate diodes, one from gate to source and the other from gate to
drain in MESFET devices. The diode current is given by:
Io: I"lerp(qV¿lNkT) - 1l
2.t.3 Parasitic Capacitances
Capacitors in the MESFET model characterise charge storage within the physical device
which provides information about the transient operation and ultimate speed of a circuit.
Figure 2.2 shows the physical interpretation of the MESFET parasitic capacitances with
an exploded view of a typical device. The two major capacitances associated with a






Figure 2.2: Cross section of a MESFET device
MESFET are the gate to source and gate to drain capacitances because the storage of
charges in the gate to source and gate to drain depletion region and are non linear.
For ]VIESFET devices, the Statz-Raytheon capacitance model [StNeST] is a function
of. V¿" and Vs" and is an accurate analytical expression to use for large signal analysis
[McCa9fl]. 'I'his model provides symmetric modelling of the gate to source and gate
2t
to drain capacitances and is more accurate than the Curtice model [Curt80] which is
a function of Vn" only. Other capacitances are the fringing capacitance from the gate
depletion region to the source and drain because of the depletion layer extending beyond
the edge of the gate, the smaller the gate length, the larger these capacitances will become.
They are modelled by fixed capacitances connected from the gate to the source and drain
intrinsic terminals. HSPICE does not accurately model the gate to source and gate to
drain capacitance values at all bias conditions, therefore HSPICE can only be used up to
2GHz.
2.L.4 Parasitic Resistances
The parasitic resistances of the MESFET occur in series with the drain and source con-
nections. The physical interpretation is the resistance formed at the connection of the
metal 1 layer to the diffusion via a low resistance ohmic contact. To minimise these resis-
tances the source and drain connections must be placed as close to the gate as possible.
In early models the values of the drain and source resistances were fiddled to enable a
better fit to the MESFET characteristics but this caused the drain resistance to become
quite large and the source resistance to drop to zero. This means that the symmetry of
the device is lost (JFET model). However, later models overcame this deficiency (Statz-
Raytheon Model [StNeST]) and the source and drain resistances regained their physical
meaning. The series gate resistance is so small that it is often ignored.
2.L.5 Second Order Effects
Second order effects of Gallium Arsenide circuits are not necessarily taken into consider-
ation for modelling purposes. Second order effects are listed below.
1. Backgating or sidegating is a similar to the body effect in MOSFETs. Sidegating is
caused by negatively biased neighbouring FETs which cause the threshold voitage
to increase ancl therefore reduce the drain current. Sidegating has the same effect
but is due to a negative substrate bias. All active devices are susceptible to these
effects including FBTs, diodes and resistors. Even a horse shoe shaped resistor
exhibits self sidegating although a positively biased guard ring around such devices
can help. Sidegating is dependent on the distance between active devices and it has
22
been reduced through ion implantation and mesa etching around active regions"
The backgating or sidegating effect cannot be modelled currently by extraction
from a layout since it depends on the distance and relative potentials of surrounding
active regions. TriQuint [TriQ9l] claim that the backside metal should be biased
at the highest power supply potential of the circuit and to allow 3p'mlV separation
between devices to help red.uce this effect. The threshold voltage decreases rapidly
below 10oC so worst case modelling would be at low temperature and high power
supply voltages. Prediction of the backgating voltage is unreliable due to large
variations in backgating from substrate to substrate. Thus backgating is ignored
for low to moderate power supply voltages and higher than room temperature
modelling. A constant substrate bias (backgate) may be specified and can be used
with a model parameter, I{I, in HSPICtr to shift the threshold voltage of the
simulated devices.
2. Drain current transient lag effects are due to deep level traps in the substrate
below the channel which accumulate electrons injected into the substrate. It takes
longer for the traps to release electrons than to capture them, hence the effect of
overshoot in the drain current and slow recovery to a step in the drain to source
voltage. This effect is also observed as an increase in the small signal output
conductance by a factor of as much as three when the FET is in saturation' This is
because the traps under the channel shield the drain to channel capacitance. This
frequency dependent effect is modelled by changing the parameter LAMBDA in
the MESFET model for the high and low frequency case. The effect is to increase
LAMBDA for the high frequency case which increases the slope of the I-V curve in
the saturation region. The function of variation of the drain-source characteristics
with frequency is not a simple function. This also has a smaller effect of increasing
the transconductance with increasing frequency although this is limited to a few
percent and is not modelled. A higher quality substrate material with less traps
would reduce these effects.
3. Subthreshold current flows from drain to source when the gate to source voltage is
below the pinch off voltage. This occurs when the electrons are transported across
the channel by diffusion and drift. The subthreshold current has little influence
23
on the DCFL circuits which spend most of their time operating in the saturation
regron.
4. Temperature dependence is characterised by two physical effects; the variation of
the built in voltage of the channel/substrate interface and the channel transcon-
ductance factor, þ. They both affect the threshold voltage of the MESFET but the
built-in voltage increases the threshold as temperature is increased and p decreases
the threshold voltage as temperature is increased. The net effect on the threshold
voltage (and hence the drain current depends on the gate to source voltage) is
complex and is not modelled. However, different iibraries of models have been for-
mulated to model devices at specific temperatures, usually two extremes such as
0o C and I25" C . If operation at any other temperatures is required interpolation of
the results would be the best option.
2.I.6 Simulating 'Worst Case
Worst case conditions for a particular device may include a selection of a poor parame-
ter or group of parameters that cause poorer device operation and hence a reduction in
voltage swing, noise margins and speed of operation. Worst case parameters are usually
determined at 0"C and 125"C and characterised by variations in threshold voltage (ei-
ther positive or negative), degraded. transcond,uctance (smaller), resistance (larger) and
capacitance (larger). This is opposed to nominal circuit conditions where parameters are
determined at room temperature (around 25"C) which are used io initially verify the
functionality of the circuit.
2.2 GaAs MESFET Logic Classes
Normally-on logic classes use only depletion mode MESFETs and typically require some
voltage level shifting of the gate output to be compatible with the next stage. Larger
supply voltages are needed than with the normally-off logic classes and so the power
dissipation is higher. The complexity is also generally higher in normally-on logic classes
but the speed may be greater than normally-off iogic classes. Level shifting may be done
by using Schottky diodes. Normally-on logic classes include [LoBu89, PuEs8S, Eshr91,
24
KaNa85, Wing90]
o Buffered FET Logic (BFL)
o Capacitively Coupled Domino Logic (CCDL)
o Capacitor Coupled FET Logic (CCFI)
o Capacitor Diode FET Logic (CDFL)
o Feed-Forward Static Logic (FFSL)
¡ Inverted Common Drain Logic (ICDL)
o Schottky Diode FET Logic (SDFL)
o Source Coupled FBT Logic (SCFL)
o Two-Phase dynamic FBT Logic (TDFL)
o Unbuffered FET Logic (UFL)
Normally-off logic uses enhancement type MESFETs as a switch and depletion type
MESFETs (or a resistor) as a load. Normally-off logic classes include [LoBu89, PuEs88,
Eshr9l, Wing90]:
o Direct Coupled FET Logic (DCFL)
o Feedback FET Logic (FBFL)
o FET FET Logic (FFL)
o Junction FET Logic (JFL)
o Pseudo Current Mode Logic (PCIUL)
o Quasi FET Logic (QFL)
o Super Buffer FET Logic (SBFL)
o Source Follower Direct Coupled FET Logic (SDCFL)
o Source Follower FET Logic (SFFL)
25
Other logic classes include Differential Pass Transistor Logic (DPTL) which has a low
density due to its differential nature and the frequent bufering required. The logic classes
above can be further broken down into static and dynamic logic. Dynamic logic has a
minimum frequency of operation although the gates may be low power and simpler but
static logic can operate down to DC. One requirement of the chip is complete testability
which should be operable at clock frequencies down to DC, therefore a dynamic approach
is not suitable.
The H-GAAS II E/D MESFET process, supplied by Vitesse Semiconductor Inc., USA
and Thomson-CSF Semiconducteurs Specifiques, France (as a second source) has been
tuned for using Normally-off DCFL derived logic families. A buffered logic family, Source
follower Direct Coupled FBT Logic (SDCFL), was mixed with an unbuffered logic family
(DCFL) to optimise the speed and layout density for most of the chip. Super buffered
DCFL (SBDCFL) was used where high drive capability is required including clock lines
and long interconnects. Studies of this mixed logic approach have shown that good VLSI
density, noise immunity and speed can be achieved [BeMa9l].
The limits of operation of the logic classes are:
o Power supply: 1.2 to 2.5V
o Temperature: 0 to I25"C
o 0.5ø fast-fast and 0.5ø slow-slow models
Unfortunately, accurate temperature models and process spread models were not avail-
able when this work was carried out.
The devices available and the corresponding models and device sizes are:
o Enhancement MESFET (EFBT): I : I.2pm,I.\p'm
o Depletion MBSFBT (DFtrT): L : l'.2p,m,2.4p,m,3.2p'm
All gate lengths are specified "as drawn" and are shrunk by 0.4¡"+m lvhen processed. All
device modelling was done using HSPICE [Meta92] and the models supplied by MOSIS













































Figure 2.3: Simulated I-v characteristics for an EFET, L -- 0.8p,m,W : Ilp'm"
Appendix A contains the performance criteria and specifications for the classes of logic
investigated.
2.2.L Direct Coupled FET Logic
DCFL is the simplest logic class for digital GaAs design and has the smallest power-
delay pr.oduct of the current GaAs normally-off logic classes. It is comparable to nMOS
in Silicon VLSI design. An EFET operates as a voltage controlled resistor which pulls
the output down as a function of the applied gate voltage, while a DFET operating in
the saturation region provides the active pull up as shown in Figure 2.4. (^ resistor may
replace the depletion mode MESFET in some cases.) When a DCFL gate drives another
DCFL gate, the high level output of the first gate is clamped to about 0.7Vby the Schot-
tky diode at the input of the second gate. This limits the voltage swing of the gate and
hence the noise margin. By varying the pull up and pull down MBSFBT widths', the
gate can be tuned for speed, noise margin, power and load drive. The pull down to pull





DFET (Wd,k) ,t Id
outpul output Ii..-+
output




Figure 2.4: (a) DCFL inverter, (b) 2 input NoR gate, (c) equivalent circuit.
(c)
28
In high speed circuits with small voltage swings, the noise margin is the most critical pa-
rameter to consider when designing for correct circuit operation. The noise margin may
be increased by reducing the on resistance of the BFET. This is achieved by increasing
the EFBT width with respect to the DFET load at the expense of decreasing the speed
of the gate. Figure2.5 shows the drain current of a DFET when used as a pull up device
(Vn" :0) for different device sizes. A relatively small (40%) change in current drawn
320.0u = 
-' -- -- -- """ ":
900.0u='"'--"--'' "i
290-0U:- """"''-;
aÊo . ou =' ' "' "".'""-' "i-i
:i.r""" - -"" """'-,:
220 0u =
200 0u:-


























,.,......t.,..,....i...r...¡....¡....J...-j....^'.....'..,.-..r..'.i.-.,.. ".""" ''-""--'-'' -"ia'.'o' 2-0 3.0
Vds
Figure 2.5: Drain current for a DFET with Vg" : 0 for a 1.2, 2 and 3¡.r gate length.
from the supplv in the high and low iogic states also contributes to circuit stability. In
this respect, DCFL produces a quieter power bus since the DFET operates in saturation
as a current source. Increasing the power supply voltage pushes the DFET further into
saturation and makes the changes in supply current even smaller at the expense of power
dissipation. To achieve low power the current in the DFBT must be made small, so a
gate length or 3.2p,m or 2p,m can be used. For a DFET gate length of. L¿ : 3.2pm, a
DCFL inverter driving another DCFL load can only drive 1 fan-out. For larger fan-outs
orfordrivinglongerwiresaDFETgatelengthorL¿:24'rnmustbeused'Amajor
drawback of DCFL is its poor load drive capability since the DFET is always on and
the switching EFET must supply current to both pull the load down and also supply the
29
load DFET. Complementary logic classes drive the load with just one device while the
other is cut off. Figure 2.7 shows that the speed of a DCFL inverter as the load (fan-out)
increases is quite linear and fan-outs greater than three or four lead to long gate delays.
A bufered logic should be used to drive these higher loads. The wire delay of 70p,m of.
interconnect is approximately equal to 1 fan-out'
Figure 2.6 shows the average noise margin ev+y*) as a function of EFET width
(ør,) with a fan-out of three for a three input NOR gate with only one input signal
driven and the other two inputs tied to ground. Definitions of noise margin parameters
and measurement techniques are presented in Appendix A. The average noise margin
degrad.es as the fan-out and fan-in increase and as less of the input signals are driven
high indicating that this will be the worst case noise margin. The high noise margin in
this case is close to zero at around W" - I0¡-tm but the low noise margin is always above
150mV' Since this is the absolute worst caseW":8P'm was chosen' Table 2'1 shows the
characteristics of DCFL circuits that were simulated using HSPICE for different device









































Fan-ouE (nülber of similar gaLes)
Figure 2.7: Propagation delay of a DCFL inverter as a function of fan-out (capacitive
load).

























































































































































The DCFL design guidelines used for an inverter and NOR gates are as follows:
o DCFL with one fan-out
- Wa -- 2[t*, La : 3.2P,m
- W.:6p'm, L": L.ZP,p
-fan-in:3maximum
o DCFL with two or three fan-outs
- W¿:2pm, La :2.4p,m
- W" : ïp,m, L" : 0.8P'm
- fan-in :3 maximum
2.2.2 Source Follower Direct Coupled FET Logic
SDCFL is a bufiered version of DCFL to improve the load drive capability, voltage swing
and noise margin. The buffer is a source follower using an EFET as a pull up and a
DFET as a pull down load as shown in Figure 2.8(a). The output of the DCFL stage is
clamped at two diode drops. The first diode is across the EFET in the source follower
stage and the second is the input diode of the DCFL load. The voltage swing at the input
to the SDCFL buffer stage is improved. over the DCFL class. There is a 7e" voltage drop
across the EFET in the source follower, so the logic low level is improved- A negative
supply for the source follower may be used to further improve the voltage swing as shown
in Figure 2.8(b). A separate negative supply may be used for the ground of the buffer
stage to further improve the noise margin and isolate the higher switching currents of the
source foilower from the DCFL stage. However, the extra po\Mer rail requires extra area
for power distribution and the small benefit in noise margin performance (10 to 20mV)
did not justify this overhead. There is a trade off in sizing the ratio of the EFET and the
DFBT in the source follower because a larger DFET can ilischa,rge the outpub node faster
but the BFBT must be made larger to supply enough current for the DFET as well as
charge up the load capacitance. A large load capacitance leads to longer fall times, since
the current supplied by the DFET is constant, but the current supplied by the EFET is













DCFL stage Buffer Stage Load
























































































































Table 2.2: Tab\e of simulated 3-input SDCFL NOR gates with 2 inputs tied to GND
and a fan-out oT 5, L" : L"e : L"d' : I'2P'm', W"d' :2P*, Vdd:2V and T :70oC '
through the source follower depends on the logic state and therefore switching transients
are produced in the power rails which lead to ground bounce and noise injection into
other circuits. This is caused. by the DFET coming in and out of saturation when the
circuit swìtches. The current in the output logic low state is around L50p'A and 700¡rA
in the output logic high state. The change in current is significant (460%) compared to
DCFL (150%) because the buffer switches off in the logic low state' However, SDCFL
has a higher noise margin than DCFL and can have a fan-in of up to five.
OR-ANÐ-INVERT (OAI) structures can be made using SDCFL, as shown in Figure 2.9,




where A, B, C, D, E and F are inputs and. Z is the output. Table 2'2 shows the
characteristics of SDCFL circuits that were simulated using HSPICE for different device
:Ã..8.e +D.E.T
SlZES
The design guidciincs used for SDCFL circrritslvere:
34
o SDCFL with up to five fan-outs
- W¿:Zp,m, La :2.4p,m
- W" -- 8P*, L" : l.24,m
- W"d :2p,m, L"d, : L2p'm
- W"e : Ilpm, Lre : I.2p'm
-fan-in:3maximum
2.2.3 Super Buffer FET Logic
SBFL improves the load capacitance drive capability of DCFL and SDCFL utilising a
push-pull super buffer as shown in Figure 2.10. The disadvantage of using this gate is
the noise produced on the power rails because of a conduction path from Vdd to ground
when the gate changes state. With limited use and careful power rail design, it can be
successfully used to drive high loads such as clock lines, buses and high fan-out loads- It
has a higher noise margin (lT2mlf than DCFL and SDCFL but a higher power dissipa-
tion, and the current changes in the high and low state by 200%. The use of this gate is
restricted to an inverter driver only since logic gates are more complex and require more
area than DCFL or SDCFL"
Optimising the gate ratios is similar to SDCFL except the devices in the output stage
have the same sizes. The design guideline used for SBFL is as follows:
o SBFL with up to seven fan-outs
- W¿ -- 2P^, L¿ :2.4P'm
- W" -- \lt*, L" : l-2ry,m
- Wre : I2p.m, L"e : l2pm
- W"eI : l2p"m, L"eL :1.2p'm
For driving high loads such as clock lines the ratio of these sizes is adhered to' Careful









Figure 2.10: SBFL inverter.
2.2.4 Performance Compariso
Table 2.3 is a summary of the characteristics of the DCFL, SDCFL and SBFL inverters
discussed.
403510Power-delay product (fJ)
2N+2N+3N+1Number of devices for an
N-input gate









r92L42107Noise Margin (mV average)
r20t20Delay 70
L Inverter SD F verter SBFL InverterParameter
Table 2.3: GaAs logic circuit characteristics for DCFL, SDCFL and SBFL.
36
2.3 Design MethodologY
The characteristics of digital GaAs technology which require special attention so as to
avoid performance penalties include:
o Lower integration levels than possible with silicon. This is because of the larger
device width which is a consequence of using ratioed logic. Complex iogic gates
aren't available (unlike CMOS).
o tower yields than equivalent silicon systems.
o Logic families suited to VLSI are characterised by supply voltages approaching the
thresholds of the transistors and also suffer from poor noise margins.
o The rise times of 'long' interconnects are degraded by transmission line reflection
coeficients at impedance discontinuities.
Reducing device size leads to lower parasitics and high integrationer levels, but designers
only have control over how the circuits are realised in layout'
2.3.L Layout Style
Ring notationisused to layout the logic gates and has been previously reported [BeMa9l,
SaCag2, Eshrgla]. This technique made the rapid layout of the regular high performance
GaAs circuits required for the chip possible. The circuit design should take the following
practices into consideration:
o the minimisation of coupling and clock feed,through by using separate clock and
signal lines
o the minimisation of ground bounce by making the circuits quiet when switching
current to ground
o close placement of the devices to achieve good device matching
o separation of larger devices to minimise sidegating
o placement of all gates in one clirection (horizontal) to gain maximum mobility





o minimisation of interconnect lengths and coupling
o reduction of inductance and increase capacitance of power buses
o high packing density
Traditionat CMOS layout technique involves placing logic in between the ground and
Vd,d, power buses. Ring notation places the power buses bunched next to, or on top of,
each other. The connections to the transistors and other devices are in the form of rings
from the Vddlo the ground as shown in Figure 2.11. Enhancement mode MESFETs are
drawn as a dashed line while depletion mode MtrSFtrTs are drawn as solid lines' Gate
connections are d.rawn as an arrow head crossing the line and interconnections between
gates are simply drawn as lines. This provides a simpler method than stick style for
nMOS for GaAs designers to layout subcells. By placing the power rails close together
the capacitance between them is increased and separating the signals into a wiring channel
between the gates and power rails leads to a quieter power bus. Figurc 2"12 shows the
physical layers used for layout of the GaAs circuits and the key for ring notation used in
Figure 2.11. Some of the basic building biocks of the layout are a DCFL 2 input NOR
gate Figure 2.13, a DCFL 3 input NOR gate Figure2.I4, a SDCFL source follower buffer
Figure 2.15 and a SDCFL source follower OR gate Figure 2.16.
2.3.2 ' Design Tools
CAD for integrated circuits requires a layout editor with some automatic design rule
checking, a layout network extractor and circuit simulators. Mapping to various formats
such as CIF and CALMA are also required. The layout tool MAGIC [Magi90] was
used to design the PE chip. A GaAs network extractor called 'gaasnet' [Beau9l] was
developed from the ISD phase-l design suite but its use was discontinued as MAGIC had
developed beyond gaasnets parasitic extraction capabilities. A program called 'ext2hsp'
[Beau92] was written to generate spice decks suitable for input directly into the HSPICB
lMetag2] circuit simulator. This checks transistor models against MtrSFBT types and
gate lengths to ensure correct selection of the correct device simulation model. Another
script called 'ext2sp' does label name substitution in the SPICtr deck. The MAGIC
I 38
1l










































Figure 2.13: DCFL 2 input NOR gate layout
Figure 2.14: DCFL 3 input NOR gate layout
Figure 2.15: SDCFL buffer laYout




technology file'edgaas.tech' was supplied by MOSISl and modified to both correct design
rule information and enhance circuit extraction. The circuit simulator IRSIM was used for
functional simulation since it uses a simpler transistor model than SPICE for fast turn-
around. A HGAAS-II parameter file was written for IRSIM based on timing parameters
from HSPICE.
2.4 Circuit Modelling and Parasitic Extraction
2.4.t Interconnect Analysis
Circuit parasitics play a major role in determining the ultimate performance of an in-
tegrated circuit. As the operating frequencies increase, particularly with high speed
technologies such as GaAs MESFET and HEMT, the nature of the parasitics change
from being mainly capacitive (e.g. CMOS) to a combination of inductive, resistive and
capacitive. The relative magnitudes of these parasitic elements on a GaAs chip were
investigated. As the wavelengths of the signals become comparable to the length of the
interconnect the transmission line (TL) effects become important, as reflections from
discontinuities may degrade circuit performance. Transmission line effects become signif-
icant at the chip level when signal rise times go below 150ps [Bako90].
The electric field lines of an interconnection may terminate at the adjacent interconnec-
tion lines because integrated circuits are inherently densely packed and this particularly
occurs in higher metal layers of a multi-level interconnection. The adjacent lines are not
uniform in structure and hence there is not such a stable capacitance to ground. This
means that the crosstalk between conductors must be carefully modelled for high speed
circuits. A software package,'Raphael'[TeMo93], was used to model and analyse the
interconnect structures in the following sections.
lMOS Information Service at the University of Southern Californra.
! 42
2.4.2 Capacitance
Bulk GaAs is a semi-insulating material, so the capacitance to the substrate is lower than
for CMOS technology. The capacitance to substrate of a single wire on silicon is about
0.08f F I pm [Rocc90] whereas it is around 0.05/f/ p,m for wires on GaAs substrates.
Figure 2.17 shows the inter-nodal capacitance for two neighbouring wires as a function
of metal pitch [Beau93]. This shows that the dominant capacitance load is the nearest















I 10 12 14
MeEaI PiEch (microns)
16 18 20
Figure 2.I7: Inter-nodal capacitance of two neighbouring wires on 100prn [e' : 4(*)],
450¡tm [e, : 4(o)] and 100prn [e" : 8(n)] thick GaAs SI substrate with a backplane and
dielectric constant for the inter-level dielectric.
of the centre interconnect (Figure 2.18) is the sum of the capacitance to substrate and
the inter-nodal capacitances to neighbouring structures. This shows that the substrate
thickness has a negligible effect on the capacitance per micron length of the wire but the
effect of the dielectric constant is more significant (25% increase). To study the coupling
capacitances, five conductors on a substrate were simulated (Figure 2.19). The coupling
capacitance as a function of the design rule are plotted in Figure 2.20 where C", denotes
the capacitive coupling from electrode'r'to electrode'y'. The coupling to the closest
neighbour accounts for 80 to g0% of the total capacitance for tightly spaced lines while



























Figure 2.18: Total capacitance of the centre 2¡tm wide wire on 100pm [e' : 4(*)],450p'm
fe, : 4(o)] and 100prn [.,. : 8(n)] thick GaAs SI substrate with a backplane and dielectric
constant for the inter-level dielectric.
mum total capacitance for these structures, as in silicon technologY, as the design rule is
changed [Rocc90]. To minimise the total capacitance the interconnects must be spaced
as far apart as is practical or allowed.
The total capacitance of the wires was plotted as a function of the design rule in Figure
2.2L. As expected the capacitance increases with the design rule. The capacitance
converges as proximity becomes less important and the major influence becomes the
capacitance to substrate. The capacitance of wire 1 equals wire 5 and wire 2 equals wire
4 because of symmetry.
2.4.3 Characteristic Impedance
A coplanar waveguide TL can be used to analyse the characteristic impedance as shown
in Figure 2.22. The line has semi-infinite ground planes placed either side (it is assumed
that adjacent strips of an interconnect will be ground or a reference).
The formula for an ideal coplanar waveguide [LoBu89] ignores the effect of a ground plane








































MeÈal spacing and widEh (miclons)
Figure 2.20: Coupling capacitance between electrodes for 5 equal width and spacing
electrodes on 100prn thick GaAs SI substrate embedded in dielectric (e, :4),2pm thick






































tteEal sÞacing ånd widEh (nicrons)
b
9 10I
Figure 2.2I: Total coupling capacitance for electrodes 1 (o).,2 (*) and 3 (n) for 5 equal
width and spacing electrodes on 100¡rrn thick GaAs SI substrate embedded in dielectric
(", :4),2p,m thick with a backplane metallisation.






Figure 2.23: Coplanar strips (cross section)
impedance. Thus





êeÍÍ : t *+ :7.05(GaAs), r :T, k' : Jl - lez
and the /i(k) is the elliptic integral of the first kind and I{'(k) is the complementary
function. Tables and formulae for this function can be found in [LoBu89]. The validity of
this equation assumes that the substrate thickness is much greater than the line spacing
ô, which is greater than the line width, ø. Consider lines of equal width and spacing, this
implies lr : å and the elliptic function becomes m :0.64. Substituting into equation
2.1 gives the characteristic impedance of the line, Zs : 22'7Q ' A correction may be
applied to account for the increase in efective dielectric constant due to a thin dielectric
coating, e.g. polyimide. The effective dielectric constant is multiplied by B, which is
given by:
B:r+",!lr-"rrt-n,'u!)1 Q.2)€rll L a+o J
I1 a : I¡tm and, b : 3p,m and the thickness Í of polyimide is 2p,m and the effective
dielectricconstantise¿:4"0,thenbyequation2.2,B:1.19 ande"¡¡ canbecorrected.
Substituting into equation 2.1 produces the characteristic impedance, 26:20.8f1. This
value seems quite low compared to the results of the two-dimensional simulation of three


















Figure 2.24: Characteristic impedance of the centre Zp,m wide wire on I00p,m [e" : 4(o)],
450p,m lr, : 4(+)] and 100prn lr, : 8(n)] GaAs SI substrate with a backplane and
dielectric constant for the inter-level dielectric-
A more accurate approach may be to consider two coplanar strips that are at some
distance from other lines as shown in Figure 2.23. The characteristic impedance may be
given by:
zo: r2on ry (2.3)-
Je4r- I{(k')
where the parameters are defined as being the same as the coplanar waveguide case above.
For a line width and spacin g of.2¡.tm, the characteristic impedance given by equation 2.3
becomes Zo :83.2Q. Other methods of characterisation, such as microstrip analysis, are
not appropriate since the distance to the backplane is around 0.1 to 0.\mm (if a backplane
exists). This gives an w f h ratio of several hundred for VLSI type interconnections where
u, is the width of the interconnect and å is the height of the strip above the ground plane.
The characteristic impedance of a microstrip line is given by:
17 
- 
60 ,^,8\ , -Lo: 
ãt"(i * nh)
which is valid for f, < 10 [LoBu89], and e"¡¡ \s given by:






The width of the interconnect would have to be 72p,m for a 100prn thick substrate with
a grounded backplane and a 50Cl characteristic impedance line. For å : 0.Imm and
€, : 13.1 the width of the line is l}¡.tm which is too wide and not suitable for densely
packed integrated circuits. The impedance is not well controlled to ground since the
coupling to the nearest interconnect will dominate. A coplanar waveguide structure may
be built if a ground plane is placed either side of the interconnect. This would be suitable
for making controlled impedance lines connected to pads for wafer probing'
2.4.4 Resistance
The series resistance of an interconnect increases as the feature size is scaled down. This
may lead to RC delays in signal lines and ohmic drops in power lines. The resistance in
an aluminium line is given by:
T
where p : 2.74 x 10-6CI.cn¿ is the bulk resistivity of aluminium, / is the length, tu is
the width and ú is the thickness of the line. Wire resistance is a function of |. Figure
2.25 shows the resistance per micron of a wire versus the design rule (width and spac-
ing) modelled using Raphael with the equally spaced five parallel trace model from the
Raphael Interconnect Library [TeMo93].
The skin effect is the exponential decay of the electric field as it penetrates the conductor
at high frequencies and increases the resistive loss. The skin depth is given by:
1ò-
V2
where ø is the frequency in radf s, ¡.rs is the permeability and ø is the conductivity of
the conductor. The bandwidth of a 100ps wide gaussian pulse is about 3GHz, so for
aluminium wires, o : l: 3.65 x 105(CIcrn)-l and 6 : I.íp,m. This is much greater than










MeEaI spacing and widEh (micronsl
I 10
Figure 2.25: Resistance (I0-2Ql¡"tm) of a wire as a function of design rule.
2.4.5 Inductance
The partial self and mutual inductances of interconnect structures may be found using
some simple formulas derived by Grover and reproduced in reference [TriQ92]. The partial
self inductance for a rectangular bar may be approximated by the following equation:
f_o: fitl'":i,.'ul
where ,r.r.,'is the width of the line, / is the length, f is the thickness and p : 4r nH I crn. For
a 2p,m wide line which is 0.5p,m thick and I0p,m long, the total self partial inductance
is 51pH or 0.51p.É/ per micron length. For a I00p,m line, the partial self inductance is
gTpH or 0.97pHl¡,lm. The total inductance per unìt length / of two parallel lines when
they form part of a complete loop may be approximated by [TriQ92]:
hf I : (L""tÍ,t I L""tÍ,2 - 2Mr,z)ll
:Llmr d )+r.blrl'u+t' I
where it is assumed that the length is much greater than the spacing d between the lines
(as is usually the case in a dense integrated circuit). The permeability is ¡t, w is the
widih and ú is the thickness of the conductor. H d -- u; : 2P,m and ú : 0.5P'm the total
inductance per unit length is 0.|IpHlp,m. These results compare well with the results
resulLs,dal'+
50
of the two-dimensional simulation shown in Figure 2.26 of coplanar strips on a GaAs
substrate. Figure 2.26 shows the self inductance plus the mutual inductance per micron
length and Figure 2.27 shows the mutual inductance per micron length only. Note that
the difference between Figure 2.26 and Figure 2.27 is the self inductance. Figures 2.28
and,2.29 show the mutual and self pius mutual inductance per micron length of five equal
width and spaced interconnects on a 100¡.tm thick substrate. The same simulation input
file was used for the capacitance calculation (from the Raphael Interconnect Library)
using the same conditions. The results show an asymptotic decrease of the inductance
per micron length with an increasing design rule. These results agree with the results for
the three conductor case shown in Figures 2-26 and 2-27.
2.4.6 Line Delay
It has been shown [Bako9O] that for various line lengths and driver impedances'the
interconnect delay for aluminium is constant and minimum past a width of 2p'm. The
RC delay is proportional to the square of the line length whereas the electromagnetic LC
delay is proportional to length. The RC delay along an interconnect of length / is the
time to charge the end of the line to 50% of. the final value and is given by:
tRC :0.69RCt2u Q'4)
where -R is the sheet resistance, tr is the width and C is the capacitance per unit length
of the conductor. The electromagnetic transit delay is given by:
TLC : TJLC
where / is the length of the interconnect and L:6xL0-6Hf m is the inductance per unit
length and C :0.2xI1-e[fm is the capacitance per unit length f.or Z¡-t'm aluminium
wires. The critical line length at which an interconnect must be treated a,s a TL occurs
when the rise time l" of the signal is the same as the time of flight down the line, Ú¡.
Substituting in the formula [Bako90]:
2.5t¡ - t,
for a 100ps rise time which is routinely observed in GaAs HEMT technology gives ú¡ :




























0 10 12 L4
MeEaL PlEch ((icrons)
16 18 20
Figure 2.26: Self plus mutual inductance of the centre 2p,m wide wire on l00p,m le, :
4(+)], 450pm [e, : 4(o)] and 100¡.rrn [e, : 8(n)] thick GaAs SI substrate with a backplane















Figure 2.27: Mutual inductance of the centre 2pm wide wire on l00p'm lr, : 4(+)]'
450ptm [e, : 4(o)] and 100pzn [e, : 8(n)] thick GaAs SI substrate with a backplane and
dielectric constant for the inter-level dielectric.
































MeEal spacing and widÈh {micronsl
456'l
MeEal spacing and widEh (mlcrons)
10
10
Figure 2.28: Self plus mutual inductance of 5 equal width and spaced interconnects on a
I00pm GaAs SI substrate with a backplane (o - LII: L55, +: L22 - L33, n : tr33).
4
1
Figure 2.29: Mutual inductance of 5 equal width and spaced interconnects on a L00p'm


















2.4.7 Source fmpedance of the Driving Gate
The source resistance of the driving circuit relative to the characteristic impedance of
the line determines the behaviour of the signal on the line. If the source resistance is
low compared to the line impedance, reflections may be observed on the line. However,
if the source resistance is high compared to the line impedance, a lumped capacitor
approximation may be used since the voltage at the end of the line will rise slowly. The
characteristic impedance of packed 2¡tm ahtminium lines ranges from 47 to 1500 with
the pitch ranging from 4 to 20p,m as determined by two-dimensional simulation using
Raphael. The source impedance of a source follower is typically several hundreds of
ohms. Short lines (( 600¡.tm, as determined in the resistance calculation) have a total
line resistance less than 18f) lor 2¡.tm wide aluminium wire, therelore þ - 0.5 and
W r_ 2. It has been stated that a lumped capacitor model may be used in this case for
a delay accuracy within 10 percent [LoBu89]. If the size of the driver is increased (Rsot"
becomes smaller) or the line is long, a more complex model for the interconnect should
be used.
2.4.8 Interconnect Models
An interconnect line may be modelled as; a lumped capacitance, L-shaped RC cir-
cuit, a hybrid-r circuit, T model, T2 model, nRLC segments, ideal TL, or a lossy TL
[LoBu89, Bako9O]. The choice of model depends on the required accuracy and the ef-
fects to be modelled. Two models were considered, a lumped capacitance model and a
lossy TL which are the simplest and the most advanced models, respectively. For circuit
simulation, the simplest model satisfying the required accuracy should be chosen. VLSI
interconnect lengths typically fall into two groups, short local gate to gate and long cell
to cell connections. The gate to gate connection was considered to be around I00p'm
long and the cell to cell connection to be around Imm.
Interconnections with a high speed technology were studied using some P-HEMT circuits
which are potentially faster than MESFET circuits [Beau93]. The device models were
characterised from prototype P-HEMT devices fabricated by the Department of Elec-
tronics Bngineering at Seoul National University, Korea. A source follower model was
54
chosen to drive the circuit which is used to buffer the output of the SCFL gates. This
uses a ¡1¡"tm wide P-HEMT, two level shift diodes and a minimum size current source
(300¡rA) with a 5 I/ supply. A 1.5 I/ step input is applied to the input of a buffered source
follower to launch a signal into the modei and the waveform response at each end of the
line is measured.
Lumped Capacitor Model
A lumped capacitor cannot model any ringing effects on long lines but can be reasonably
accurate if the waveforms are well behaved. It is the simplest interconnection model for
implementation on a circuit simulator. The lumped capacitive load of a L00pm and a
lmmline\s23f F and 230/F, respectively. Figure 2.30 shows the simulation of a step
input applied to these lines. The lumped capacitor modei is valid for small buffer stages

























Figure 2.30: Transient analysis of a I00gm arrdlmrn line modelled as a lumped capacitor'
55
Lossy Transmission Line Model
The lossy TL model is a complete model of a line with distributed series resistance,
inductance and shunt capacitance. The value of shunt conductance is negligible and is
ignored. The number of lumped sections in the model can be increased to improve the
accurac)¿ but the simulation time is longer. The parameters for the TL model are from
the results of the Raphael simuiation for three, 2pm width and spacing interconnects on
I00¡.tm thick SI GaAs substrate embedded in a dielectric with €r : 4.
o C : 0.2056 xL}-eFf rn
o L:5.984 xl0-6íf m
¡R:3x10-8fl/rn
Figure 2.31 shows the transient analysis of this circuit for a 100¡rrn and a lmm line with
the results for the lumped capacitor model superimposed. Good accuracy was achieved
with 20 lumped sections. A. \mm line was simulated to observe any TL effects with this

















L , ,r ',l200 0P u00-0P â00 t.0
Time
Figure 2.31: Transient analysis of aI00¡tm and lmm line modelled as a lossy TL and a
lumped capacitor.
model. The results of a transient analysis is shown in Figure 2.32' The ripples in the
Ðb


















500.0P I nN t .5 0N 2.0
Time
Figure 2.32: Transient analysìs of a2mm and a \mm line modelled as a lossy TL (signals
plotted at start and end of the TL).
Effect of Resistance
The lossy TL circuit was re-simulated without the distributed resistance' The results of
a transient analysis are shown in Figure 2.33. Comparing the lossless TL (Figure 2.33)
with the lossy TL (Figure 2.31) shows there is no difference in the results. Interconnects
of this type may be considered as lossless without loss of simulation accuracy.
2.4.9 Crosstalk
Crosstalk is a signal transition on a wire which influences the signal on another neigh-
bouring wire. Crosstalk is mainly due to inter-nodal coupling capacitances between lines
in VLSI circuits. Crosstalk may be minimised by using a circuit layout style where lines
do not run parallel for 'long' distances, adjacent signal lines are spaced far apart and
crossovers are avoicled. Ground lines either beside or above a signal reduce crosstalk to






















^ ;a' l. 'u.4 200 0P
0. llme
Figure 2.33: Transient analysis of a 100 ¡tm and Imm iine modelled as a lossless TL
lines in a circuit uses more area although this must be used in some instances, particu-
larly at the chip boundary. SCFL gates can have differential input and output signals
which means there are twice as many signal lines to consider than in other logic classes
such as DCFL. There is a virtual ground between differential signals so placing a power
bus between the differential wires spaces them further apart as there is less signal-signal
coupling. This also means that the power bus should be cleaner than if there were just
a single coupled signal line. The capacitive coupling between lines plotted as a function
of spacing can be seen in Figure 2.20.
2.4.LO Package Parasitics
Parasitics associated with the packaging limit the communication bandwidth across a
chip boundary. The bond wire, trace and external lead from the package form a TL
which may be modelled as a coplanar waveguide or a stripline if the package has a metal
floor and lid. A first order equivalent 'T' circuit model for a pad and package bond for a
24pinleadless chip carrier (LCC) [LoBu89] is shown in Figure 2.34. The effectivetrace







Figure 2.34: Equivalent model for a LCC pad and bond, L:1.4nH, C -- 40lF
layer ceramic) packages so the model must be determined by the package structure. The
pad to pin characteristic impedance is 50fl with a pad to pin delay of about 80ps. First
order effects of pad, bond wire and package lead parasitics can be modelled and are
probably sufficient to characterise the circuit. The bond wire inductance is about 1 to
2nH for LCC type packages [TriQ92] which can be an order of magnitude higher for
needle probes and dual-in-line (DIL) type packages.
2.4.LL Pad Structures
A three-dimensionat field analysis using Raphael [TeMo93] was performed for a single
pad on a 450p,m thick GaAs substrate with a backplane metallisation. This showed the
total pad capacitance to be 26.9f F for a 100¡;rn square pad and 34.2f F for an 80pm
square pad. The same simulation was performed for three co-linear pads for which the
total and inter-pad coupling is shown in Table 2.4. The total coupling referred to ground











Table 2.4: Total and inter-pad capacitance simulation for three adjacent bonding pads
on a SI GaAs substrate.
(if the pad either side of the centre pad is grounded) is 4lf F. This agrees with the results
published by TriQuint [TriQ92] and is higher than for the case of a single pad (26.9/f).
The capacitance of a square pad from the following formula which uses an approximate











€r*1 , er-leeÍÍ-- 2'4t+t4n¡w¡¡",
Irll is the size of a square pad and l¿ is the height of the pad above the backplane. Table

































Table 2.5: Pad capacitance for various pad sizes and substrate thicknesset (., : 13.1)'
2.4.L2 Power Supply and Ground Lines
Current density must be kept below a limit to prevent electromigration and subsequent
open circuits. The safe limit for the current density of aluminium interconnect is 2 x
I}sAf cm2 up to a temperature of I25"C. This equates to ImAlp,m width for 0.5p'm
thick lines. Inductance in power and ground lines may cause voltage transients due to
their self inductance. To calculate the magnitude of the voltage spike the self inductance
and change in current per unit time need to be determined. The magnitude of the voltage
spike is given by the familiar equation:
/lV: LAJ
A¿
where Al is the change in current in time Ar. This may be used to check the variation in
supply voltage on a power bus for a group of logic gates as they switch current from one
state to another. Ohmic drops in the power supply need to be avoided. The resistivity
of a0.5¡,tm thick aluminiuminterconnect is 0.06f1 f square. If the current is aL ils highest
* 60
ailowable density of LmAf ¡,tmwidth, the voltage drop per unit length becomes 60mVf mm
or 3% of a 2ll supply. Typically, a 5% supply variation (100rnV) can be tolerated but
this depends on the noise margin of the logic gates.
2.5 Summary
A review of MESFBT models has been presented suitable for use in SPICE circuit simu-
lators. The MtrSFET models used are from Vitesse Semiconductor and are implemented
using the Statz-Raytheon model in the HSPICE circuit simulator. GaAs MESFET digital
logic classes are also reviewed and DCFL, SDCFL and SBFL are chosen for implementing
GaAs digital circuits. These logic classes are optimised for speed, area and noise mar-
gin by adjusting the sizes of the MESFBTs in the logic gates. A design methodology is
developed along with a layout style called 'ring notation' which is used to design layout
primitives including NOR, OR, source followers and OR-AND-INVERT structures. A
study of the interconnect parasitics and suitable models for the behaviour of high speed
signals on GaAs substrates is given. The study included capacitive coupling, resistance,
inductance, characteristic impedance, line delay and source impedance. An evaluation of
interconnect models found that short lines (< 600¡.r,m) need to be modelled as a lumped
capacitance but long wires may be modelled as non-lossy transmission lines. A SPICE
deck can be extracted directly frorn the layout using a program called 'ext2sp'. The in-
terconnect extraction is limited to inter-nodal and ground lumped capacitance. SPICE
circuits for long wires must be created by hand and incorporated into the extracted cir-









Systolic Ring Processing Element
The PE forms the basic computational unit in the mesh connected systolic array. It per-
forms multiplication, addition and multiplication-accumulation of the two input operands.
Architectures for a class of digit-serial systolic ring floating point PEs are targeted for
fabrication in Gallium Arsenide technology and which are optimised for matrix Process-
ing is discussed in this chapter. Digit-serial multiplication and floating point numbers
are discussed and a systolic cell is presented with two models for floating point multipli-
cation. The PE is subsequently improved with additional circuits to allow it to perform
floating point addition. A performance metric is derived by minimising the total job time
for a matrix product using a systolic array. This performance metric is used to optimise
the architecture of the PE. The memory bandwidth requirement of the systolic array is
also discussed.
3.1 Digit-Serial Multiplication
Previous work on integer digit-serial processing techniques can be found in [HaCo90],
[CoHa92] and [Parh8g] and in the discussion in Chapter 1. Much of this work is concerned
with the partitioning of a parallel operation into a sequence of smaller-radix digit-serial
operations. To derive a digit-serial multiplier cell, consider X, an M-digit number
{*or*r,...rru-t} represented in base B as:
x:T, *o0o (3.1)
i=0













(3.3)XY-- D t ûiyj|i+i
i=0 j=o
Let each digit r; and y¡ of X and Y have a r-bit binary representation given by:
The product of X and Y is given by:
and








Substituting equations 3.4 and 3.5 into equation 3.3 gives:
l=O
M-l N-L r-l r-].
Xy : Ð D t t r;¡,y¡Zk+t g¿+i
i=Q j=O fr=O l=0
Re-writing the innermost summation gives:
M-l N-1 r-l -1 -k






Then equation 3.7 can be re-written as:
M-7 N-l r-l
xy : t t llAo¡rpn+i + B¿¡t"þd+r'+r]




The A;¡e are associated with partial digit sums of weight þi+i ar.dthe B¿¡¡ are associated
with partial sums of weight Pi+i+t.





Expanding the A¿¡¡" and B¿¡¡, terms in this equation gives:
XY tt
l,+*rO' 
+ Bton\i+' + A¿,,*|i+' + B¿r*þi+' *
Aorrþo+' + B;v,Ti+s -l- - . .A¿(ru- ¡*gN-r I B¿g,¡-ÐnþN
M-1r-lf
Ðà1"- Po + Ð (o,r,*rrr + B¿¡¡) P;+i+r * 
B¿g,t-r)t 0N (3.11)
M-l lr-l r-1-kxY: tlt t *,voZk+t7il
i=0 Lß-0 J=0
N-2fr-l r-l-k r-7 r-l
t lt Ð r¿t y(j+\pk+¡ + D tj-o L¡-o l=o lc=o l=r-lc riky jt
2k+t pi+i+r
r-l r-l
+ D D ,nraw-r¡2k+t-'BN (3.12)
(3.13)
k=O I=r-lc
Consider a four-bit per digit representation (r : 4). The two terms in the inner brackets
give the following lor (x;y¡) and (r¿y¡-,.1):
B 3 tt' s2a*
t t .r¿¡y¡¿2k+t æ2ys slrzuz2a+
k=O I=4-k







x oy z23 * r oy r22 l r oA r2l + x oy o2o *
x yy 223 ! x 1y 122 ¡ n 1Y s2t ¡
x2g123Jr2Uo22¡
nzao23
where the terms in equation 3.13 are the high-order components of the digit product
(*¡A¡) and the terms in equation 3.14 are the low-order components of the product
(ro1¡+r).The i, j and / t 1 subscripts on the right hand side of equations 3.13 and 3.14
(3.14)
64
have been omitted for clarity of presentation
This re-formulation of the XY product shows that partial products of weight Pi+i+\ are
formed. by summing the digit produ ct (xg1¡¡yr) wittr the digit product (*,y¡)' To form
digits of weight Pi+i+r a structure is required which in each time period can compute and
accumulate the two different partial products from adjacent time periods, and then ac-
cumulate the result with partial products computed in other cells. The function required
IS
Z:XY+PP+V (3.15)
where PP is the r-bit partial product input and V is the r-bit carry. X and Y are r-bit
input operands. The term XY implies the pipelined computation and accumulation of the
high- and low-order digits as discussed. To implemeni this function a pipelined parallel
multiplier structure is used. Pipelining of the high-order output of this multiplier with
one level of registers delays the high-order digit by one clock cycle. This delayed digit is
then fed back to the V input during the next computation to allow its accumulation with
the next low-order digit, and so forms the desired term in equation 3.12. The structure
of a multiplier which implements this operation is shown in Figure 3.1 and is a direct
result of equation 3.15. The need for the single level of registers to properly sequence
the digit-wise addition within the multiplier allows the minimisation of the critical path"
In fact it is possible to almost halve the number of delays present in the critical path
with an appropriate placement of these registers, as shown in Figure 3.2. The immediate
consequence is that the optimised multiplier can function at double the clock speed of a
conventional parallel multiplier array. The number of registers required for the optimised
multiplier with the shortest critical path (2n - 1) is approximatelv two-thirds of the
number of registers required for the direct implementation (3" - 1), as can be seen by
comparing Figures 3.1 and 3.2. The first use of this re-organised multiplier array was
reported by Braun [Brau63] ancl is modified to include the sum of two extra nibbles in
the least significant digit.
65


























Figure 3.2: A pipelined four-bit per digit multiplier optimised for both area and critical
path.
bt
3.2 Digit-serial Floating Point Multiplication
3.2.L Floating Point Numbers
To be compatible with most of todays scientific and engineering computers, a general
purpose co-processor should use a floating point standard such as IEEE-754 (1985)
[ieee85] for implementation of arithmetic logic units. Work on floating point numbers
and floating point arithmetic can be found in [ieee85, Ster74, Zyne88]. The floating point
representation in its most basic form is characterised by four integers: the base, B, the
precision rn, a sign bit s and the exponent range e. A floating point number, 1ú is given
by'
Iú : (-1)" x 0' dodúzdt... d^-t X 0rot1û2r3"'x:e-7 (3.16)
where d¿ are the mantissa bits, ri ate the exponent bits for a signed exponent field and
s : 0 or 1 is the sign bit. The fractional part of the number is to the right of the ''' and
B is 2 for binary numbers. In addition, a signaling, a quiet Not a Number (NaN) and the
two infinities (too) must be encoded in the representation.
3.2.2 A Systolic cell for Floating Point Multiplication
The digit-serial multiplier shown in Figure 3.3 is constructed from a linear array of sim-
ple systolic cells and implements the multiplication algorithm presented in the previous
section. The operands indicated in Figure 3.3 are a digit-serial sequence and the mode
input aligns the {X} and {Y} input operand formatting as the operands pass through
each cell. The systolic cell is shown in Figure 3.4 and consists of a number of delay cells,
































Figure 3.4: A digit-serial multiplier cell
The speed of this multiplier is limited by the maximum speed of the multiplication and
addition eiements. The cell control comes from the mode input and operates the mul-
tiplexers and the digit-serial multiplier (shown in Figure 3.4) to reconfigure the cell
depending on the operation required. The simple cell control together with the reformu-
lation of the multiplication algorithm in terms of a pipelined digit multiplication leads
to an elegant systolic multiplier cell.
3.2.3 Digit-serial Floating Point Multiplier Model
A multiplier model implementing floating point multiplication is described briefly. Let
{X} and {f } be two sequences of digits entered in parallel into a machine M andlet {Z}
be a sequence of digits output from the machine. The sequences are constructed from k
digit 2-tuples" Each 2-tuple represents a discrete floating point number and consists of
an ordered exponent and mantissa number pair. Bach number is entered least significant
digit first. The first e digits in a 2-tuple represent the exponent, and the remaining (k-e)
digits represent the mantissa. A mode signal is used to differentiate between exponent













let the state of the machine at time n be {Sr(n, Xo, Xr, Xr,Yo,,Y,Yz, P) t p : l', "',m}
where the state variables X¿, Y and P represent storage nodes for digits where i is a
non-negative integer. The states X¿ and Y; are indicated on the systolic cell in Figure
3.4. The behaviour of the pth cell in the machine is defined by the following recurrence
relations:
Xo(p,n): Xz(p - I,n)
Xr(p,n): Xo(p,n - I)
Xr(p,n): Xt(p,, - L)
Yo(p,r):Yr(p - 1,t )
Yr(p,"):Yo(p,n - 1) ik -l2p 1n1ik +2p * e * 1
:Y(p,n -I) ik+ e-l2p*I1n < (i + I)k +2P
Yr(p,n) : Yt(p,n - l) ik + 2p + 1 < n < ilc * 2p * e i I
:Yo(p,,n _ l) ik-l e*2p+l1n < (i + 1)k + 2p+t
P(p,r) : Xr(p,n) + Y1(p,n) ik * 2p 1n 1 ik * 2p * e
: P(p - I,n - 1) + Xt(p,n)Y(p,n) ik *2p * e 1n< (i +l)k +2p
In the following, a two-dimensional mapping n : Ici + j + 1 is used to express the nth
digit of the iinear input and output sequences in terms of the ith 2-t:uple. Using this
mapping, the one-dimensional sequence. {X}, {Y} and {Z} car. all be written in the
form {u,(i,j) : vi à 0 :0 < i < ,b} where the element -(i,i) is the ¡¿h digit of the i¿h
2-tuple.
Lemma 1: The X1 state of the pth ceII Xr(p,n) is expressed in terms of the input digit
sequence {X} as
Xr(p,n): X (i,(" -2pl) (3'17)
where ( . )¡ is the remainder modulo lc.
Proof: By induction.
Lemma 1 states that the digit sequence through xr(p,n) is circular.
Lemma 2: The Y1 state of the ptå cell Yt(p,") is expressed in terms of the input digit
sequence {Y} as
Y(p,"):Y (i,("-2p),,) ikI2p 1n 1ik*2pte (3.18)
70
where ( . )¡ is the remainder modulo k.
Proof: By induction.
Lemma 2 states that the Y1 state is a cyclic sequence of exponent digits for the given
range.
Lemma 3: The Yr state of the pth cell of the machine M is given in terms of the digit
sequence
Yr(p,r) :Y(i,p * e - I) ik I2p * e 1 n < (i + l)k +2p (3.1e)
Proof: By induction.
Lemma 3 states that the Y1 state for the pth cell is the (p + " - I)tn digit in the sequence
for all n in the range given, i.e. the Y digit is stored so it can be multiplied with X digits
to form partial product terms.
The following theorem can be proven:
Theorem: For the machine inputs {X} and {Y} defined above, the state P(p,n) of the
ptå stage of the machine M at time n is given by the following:
Vi > 0 : 0 < j ( e where i :n -2P
P(p,") -- '(i, 
j) + a(i, j) (3.20)
Vi > 0 :'e*p - 1 < j <k - l where i:r-P-I
p-1
P(p,r) : t r(i, i - ")y(i,s * e) (3.21)s=0
Vi>0:0(r(pwherer:n-P
p-l-r




The theorem can be interpreted as follows:
o In the interval defined by equation 3.20 the exponent elements of the input 2-tuples
are added independently in every cell. Only the final cell contributes to the output
rligit sequence.
7I
o In the interval defined by equation 3.21, the digits output from the p¿å cell (p < q)
are the low-order digits of the product of the irh input mantissae. The expression
is not defined for p - q as the low-order digits do not reach the last cell of the
machine and do not constitute any part of the output digit sequence.
o In the interval defined by equation 3.22 the digits output from the last cell of the
machine (when p : q) are the most significant digits of the product of the i¿å input
mantissae.
Figure 3.2.3 illustrates the movement of data through the Y operand path of an array of
four cells which implement these recurrences. Note that no mantissa digits are output
from the final cell (Cell a). It is not desirable to lose the input operand since it should be
passed to the next processing element in a systolic array. To overcome this an equivalent
set of recurrences which can be shown to implement the same multiplication algorithm
are
Xo(p.,n): Xz(p - 1, ",)
Xt(p,n): Xo(p," - L)
Xr(p,,n): Xt(P,, - L)
Yo(p,r):Yz(p-L,r)
Yt(p,"):Yo(p,n - 1) ik *2p 1n 1ik *2p * e * 1
:Y(p,n - 1) ile +e*2p*l1n < (i + l)k +2P
Yr(p,"):Y(p,,n - l) ik +2p <n 1ik *2p * e * 1
:Yo(p,n -l) ik* e-lLp-ll1n < (i + I)k +2p
P(p,n) - Xt(p,n) + Y(p,n) ik -f 2p 1n 1- ik * 2p t e
: P(p-l,n - 1) + Xr(p,n)Y(n,n) ik -l2p* e 1n< (i +l)k +2p
where i is a non-negative integer. These recurrences differ from the earlier set only in the
definition of the terms involving Yz. The modified definition of Yz has two advantages:
1. one gate is removed. from the implementation of the control signals driving the
multiplexers for the Y operands, and more significantly,
2. the digit sequence output from the Y port of the last cell in the machine is identical
to the digit sequence input to the Y input ports of the first cell of the machine.
72




Time Cell 1 Cell 2
n UoUtAz
Uo At

























































































































































































































































Figure 3.5: Operand movement through
a four cell linear array of recurrence
cells.
Figure 3.6: Operand movement through
a modified four cell linear array of re-
currence cells.
73
The second item has major significance to the testing and verification of a multiplier
implementation. Figure 3.6 illustrates the movement of data through an array of cells
which implement the Y ïecutrences. Note that the Y output from successive cells rotates
the mantissa sequence by one digit so that at the final stage the sequence is identical to
the input to the first stage.
3.3 A Systolic Ring Floating Point Processing Ele-
ment
Floating point multiplication has been considered in previous sections of this chapter.
To complete the design of a PE floating point addition must be inciuded' The floating
point multiplication algorithm is not significantly more complex than the integer algo-
rithm. However the algorithm for floating point addition or accumulation is substantially
more complex than the integer operation due to a need for the denormalisation of one
operand. In previous work the denormalisations were handled using dedicated shift units
[AwTa93, BrBa92] and time optimal implementations for pipelined scalar and scalar mul-
tiplication [AwTa93, TaNi92]. The new digit-serial architecture unifies the floating point
multiplication and addition operations into one architecture where the operands move
through a reconfigurable systolic computation cell. A new systolic ring PE shown in
Figure 3.7 implements the combined function of multiplication and accumulation and
appears at the block schematic level to be identical to that used for multiplication. The
difference is an increased complexity in both the systolic cells and the single logic element.
Interconnections are mad.e only to nearest neighbours, as is characteristic of systolic ar-
chitectures.
The PE consists of an I/O-control unit and a circular ring of delay and systolic cells.
The systolic cells perform multiplication and accumulation on the input operands. (A
schematic of the systolic cell is shown in Figure 3.4.) The systolic cells implement re-
currence relations to perform the operations denormalisation, multiplication, addition
depending on the instruction field to the ring which is encoded into the mode digit.
The mode input also differentiates belween lLe exponent and rrtarttissa clcmcnts of the
74
r-bit datapath









Figure 3.7: The systolic ring multiply/accumulate processing element.
rúdó6
tâtâtâtÊt.âcÉcÉÉ












operands. It allows the processing of different formats. For an rn digit mantissa, it is
necessary to apply rn recurrences to compute the product. The data format required by
the ring processor is shown in Figure 3.7. The last element in the systolic ring is a delay
cell. The number of delay cells in a ring is chosen so that the length of the ring is equal
to the length of the operands.
The operation can be described as follows: two operands, X and Y are input through
a multiplexer into the ring with a mode signal, the ring is closed and the operands
are circulated an integral number of cycles. The ring is then opened to output the
results and input the next operands at the same time. Consider an operand format of k
digitsr , M : l?l "f which represent the mantissa, 
and the remaining k - M represent
the exponent, sign, instruction and guard digits. A systolic ring can be constructed
from q cells and k -2q state registers, where q < Mlz. Note that rn is the number
of mantissa bits and r is the number of bits per digit as defined previously. The state
register cells may be lumped or distributed. The k digits of the operands are input into
the ring and the M recurrences are applied by circulating the operands lm lql times
for the multiplication, and lU lql * I times for accumulation. The next computation is
fully pipelined. New operands are entered into the ring as the results of the previous
computation are being output. The length of the ring is determined by the floating point
operand.representation. For a representation of rn mantissa bits and e exponent bits in
an r-bit per digit representation, the ring has a length 2 ,t given by:
L:?+9+grT
where g is the number of guard digiis. It is assumed that ffl : rt and fil - e.
The number of operand circuiations around the ring which are required to complete an
operation is determined by the ratio of the number of digits in the mantissa to the number
of systolic cells in the ring, n". The number of circulations, C, required for a multiplication
i" # and for an accumulation fr+1. In the illustrated format in Figure 3.7 three digits
are dedicated to instruction and guard digits. The limiting case for a processor is a single
computational cell with k - 2 delay cells. In this case the ring can process operands whose
1k now includes a sign ancl gua,rd digit, mantissa and exponent digits
zlength refers to the number of storage cells for an operand in the systolic ring.
(3.23)
76
specifications range from k-4 mantissa digits and a single exponent digit to k-4 exponent
digits and a single mantissa digit. This PB provides a wide range of possible dynamic
range and precision options in a single hardware implementation. The architecture can
be optimised with respect to the number of systolic cells n", the number of circulations of
the operands in the processing element C, and the number of bits per digit r. The cost of
this flexibility is the number of recirculations needed for each product. The consequence
of the two architectural improvements is a different systolic cell. Partial products formed
in the cells of the multiplier are accumulated with the uncommitted PP multiplier input.
This allows the accumulation to be performed in parallel with the multiplication and so
does not contribute to the overall cell cycle time. A less detailed version of the cell as
presented previously is shown in Figure 3.8. The function of the cell is (from equation
3.15):
Z : XY + PP +V
This cell function is used in two ways during floating point multiplication. During the
mantissa multiplication the algorithm implemented is:
PPour: XY * PP¿" *V
where J/ is the high-order digit generated by the pipelined multiplier and P4' is the
partial product input from the previous cell. During the exponent addition mode, the
following function is implemented:
PPout :XxI+V+Y
X+Y
The value of V is zero in this part of the computation as there is no high-order output
from the product X x 1. This implements the exponent addition in each cell.
Figure 3.9 shows the logical function of the systolic cell when it is performing a mul-
tiplication operation on the mantissa of two floating point operands. The output from
the systolic ring multiplier (high-order digits denoted by XY') using the results of the
section 3.1, equation 3.12 is:
N-t c-lÐt
i=0 À=0 le*rø' 














Figure 3.8: The systolic multiply/accumulate cell.





















Figure 3.10: The logical function of the systolic cell during denormalisation.
Figure 3.10 shows the logical function of the systolic cell when it is performing a denor-
malisation operation for the floating point accumulation function. Two operations are
required in this mode; one to increment an exponent difference, and the other to shift
the appropriate mantissa. Both operations are performed by the cell on the exponent
and mantissa fields of the required operand.
3.4 Performance Metric of a Rectangular Systolic
Array Processor
It has been common to use a variety of performance metrics when designing arithmetic
units. These metrics are typically functions of execution time, power and area. Rather
than restrict the study to the optimisation of a single PE using an arbitrary metric, a
more realistic goal is the minimisation of the total job time for the computation of large
order matrix products when executed on a square array of PEs. Implementation of a
real systolic array constructed from a rcctangular array of elementary inner-procluct-
accumulate processors imposes some physical constraints upon the system architect. It









1. the design is limited by some maximum area of active circuitry, determined by
either physical limitations such as thermal dissipation, or cost;
2. the design is limited by some maximum memory bandwidth.
To analyse the performance of the processing array it is assumed that the architecture
of the PB allows variation in both execution time and chip area. Architecture classes
ranging from bit-serial to fully parallel implementations of PBs are two extremes of these
variables. Let the chip area of a PE be Ap" and let Tr"be the execution time required to
process one set of operands or a wavefront in a PE. It is assumed that the area constraint
is Apro., so the processor active area is less than or equal to Apro" and the maximum
bandwidth constraint for the processor is Bp,o". Let p be the order of a square systolic
array. The number of PEs i" p'. Under the above assumptions, the number of PEs in
the system is expressed as a function of the total active area, Aproc ãs;




p- Ap,o.f Ap. (3.26)
The number of operands required to drive the array inputs for each wavefront entered
into the array is 2p. The time to fetch these operands from memory or the time for one
wavefront is T-¡ - 2pf Be,o". This is an upper bound for the execution time of each PE.
Thus the bandwidth constraint provides an execution time constraint for the processing
elements of the form:
Tr" 12pf Bp,o" (3.27)
Hence
Bp,o. S 2pf Tp"








Consicler the execution of a product of square matrices of order .fú on a systolic array of
order p, where ¡{ > > p and for simplicity .nú mod p : 0. The number of partitions to be
80
computed in the product is [/úip] 
2, where [r'l represents the least integral value greater
than or equal to r. The pipelined time to compute each partition is given by the time
for the array to process all of the wavefronts in any given partition. If the processing
time for one wavefront is To", then the time to compute one partition is 1ú x 7o". It is
convenient to assume that all partitions are the same size, in which case the time required
to compute the matrix product,, T¡o6 is:
T¡oa : ¡Vfr"Wf'
: N:-T A^^ (3.29)
- AproctPetLPe
Start-up delays are ignored and it is assumed that all partition computations are fully
pipelined. Under these assumptions the job execution time is minimised by minimising
the area-time (,4?) product of the PEs. If the bandwidth constraint (3.27) is considered
in terms of the area constraint (3.25), the following expression relates the area and time
metrics of the PE:
Tr" (3.30)
A. matched, system is one in which the processor fully utilises the available bandwidth
from the memories to obtain maximum processing performance' For a matched system,












To optimise the PE over the range of architectural possibilities a number of GaAs tech-
nologies have been studied. The HGAAS-II [Vite92] process has been characterised in
terms of the properties of a limited set of fundamental circuits, typically a logic gate, a 2:1

















ao: IL x an
ar:7 x an
&tnul : r2 x ao ¡ 2r -l a,r
Area (rn2)Circuit Element
Table 3.1: Area and time metrics of characteristic circuits for a E/D GaAs process-
metrics of these cells is given in Table 3.1. Typical values for propagation delay and area
of a GaAs NOR gate are 150ps and 900 llm2, respectively.
The number of delay cells, n¿, can be expressed in terms of the delay length of the ring,
L, and. the number of computation cells, n", where each computation cell has two delays,
n¿ : L - 2n". It can be seen from Figure 3.4 that the systolic cell area, A""¡¿, calir be
approximated by the area of its constituents: seven r-bit registers, two 2-bit registers,
three r-bit multiplexers and an r x r pipelined digit-serial multiplier. An estimate of
ten gates has been used for instruction decoding. Using the characteristics of this E/D
GaAs process shown in Table 3.1, the area estimate for the systolic cell is:
Acetr : |ra, ! 4a, +3ra^ ! I}an * r'oo + (2, - l)o, - Ar2 * Br -f C
where A:ao, B--9a,*3a* andC -3a,!I}an. Theareaof adelaycellis
A¿: (4n *2)a,.Thus the area of a PE, ApeT cãÍt be written as:
Ap": ft.A""il * n¿A¿ * A.on
where A.on is the area of the PE control. The delay of a systolic cell, 7""¿¿, is assumed
to be determined by the setup and hold time of the registers plus the critical path delay
of the digit-serial multiplier discussed earlier. The multiplier critical path contains r * I
full adders and one AND gate. Letting the gate delay be Ín the totai multiplier delay is
(2r +3)tn and the register delay is 5ú, giving a total cell delay of 7""¿¿ : 2(r + 4)ts. Using
equation 3.23, the time to circulatc an operand through the ring, f, is given by:
,":(T*i*ù2(,++)ts
The number of circulations required for a multiplication i" fr and for an accumulation




the number of clocks to complete the required multiply/accumulate operation as follows:
(#+lXî+i*s)r*u
: (åt+tà +ie#Ð * 8(rn * e*TD
+2m-l2e+ff*8g*2rs)t'
The AT product of the PE can be represented by the polynomial expression
A,p"Tp" - (oor' * a{ * az * asr-L)(tsr-z ¡ fir-r * tz * tzr) (3.32)
where
ao: Ilnccrs¡ at: n.(B - 84") l4a,g,
az: n.(C - 4o,) ! 4a,(m + e) + 2ga, ! A"on,
az:2(m * e)a,,
t _ 16m(rn*e)Lo - ---T-;
h:9#ùi8m-t\etff,
tz = 2m *2e -l8g * T, ts:29"
The evaluation of the partial derivative of the A? product with respect to the number
of cells gives a result which is always negative. Hence, lhe AT product is minimised by
using the maximum number of systolic cells n" : # which requires two circulations
of the operands for a multiplication or denormalisation and three for an accumulation
operation. Expressing the number of cells r¿c as a function of the number of bits per digit,
r allows the AT product to be expressed as a function of r:
lr(+ + 4a,g) t 0.5m(B - 8r) I 4a,(m+ e) + 2sa, I A"on
*0.5r-1rn (C - 4a,) * 2(m * e)ø,1i0( (* * e)lr -l s)? + a)t' (3.33)
In the above analysis it has been assumed that all functions are continuous and differ-
entiable. These assumptions neglect physical restrictions such as the requirement for
the number of delay cells to be integral. As a consequence the model lvas extended
to incorporate into equation 3.33 the following constraints: i * fil , T - ffl and






3.5.1 Area-Time Model Evaluation
Evaluation of the resulting expression derived in equation 3.32 for a mantissa length of
32-bits, and an exponent length of 16-bits gives a three-dimensional plot of the AT
metric for the systolic ring processor shown in Figure 3. I 1 . This is a function of both the









l:32 systolic :32 bits per digit
Figure 3.11: The ,4? metric for the systolic ring multiplier.
Figure 3.12 is a plot of the evaluation of a continuous time and discrete time model
for the A? performance metric versus the number of bits per digit' Figure 3.12 is for
multipliers with 32 mantissa bits, 16 exponent bits and two guard or control digits and
implemented with the maximum number of systolic cells permissible in a ring e.g. mf 2r
cells. Figure 3.12 shows that a PB with 4-bits per digit would be AT minimum for
this particular processor and number format. Local minima in the AT product for the
discrete time curve are associated primarily with ? : l|l and so for mantissa lengths
which differ from this example, the optimal number of bits per digit may differ from
four. It is apparent from the graph that the continuous model represents a lower bound
for the A? performance metric of the proccssor. The AT product over all r and n" has
also been evaluatecl for a number representation with 64 mantissa bits, 16 exponent bits
and two guard or control digits. The results of this evaluation are shown in Figure 3.13
and it can be seen that 4-bits per digit is still an optimal solution for double precision










25 30l0 15 20
l:32 bits per digit
Figure 3.12: The .4? metric for the systolic ring multiplier for a continuous model and a
constrained model with the maximum number of systolic cells versus the number of bits























, 4 6 810
l: 16 bits per digit
t2 t4 16
Figure 3.13: The AT metric for the systolic ring multiplier for a continuous model and a
constrained model with the maximum number of systolic cells versus the number of bits
per digit for m - 64 and e : L6.
85
double precision computation using the systolic array
3.6.2 Processor Bandwidth Requirement
The bandwidth requirements of the ring under the constraint of a given maxlmum area
are determined by equation 3.3lwhich is plotted as a function of boih the number of
bits per digit and the number of systolic cells in the systolic ring in Figure 3'14. As the
number of systolic cells increase, the bandwidth requirement rises sharply but, as the
number of bits per digit in the number representation increases the bandwidth curves
saturate. This is due to the simple models used to implement the arithmetic units. If
arithmetic speed-up techniques were used, such as booth encoding for larger bit per digit
implementations, the bandwidth curves would continue to rise in line with the increased


















B¡ I = /cîo
€¡ oof cerrs
Figure 3.14: The bandwidth metric for the systolic ring processing element under the
constraint of constant area.
Since the processor performance is limited by the available bandwidth, it is useful to know
what architectures are under- or over-utilised. The constraint on llte systetn imposed by
86
the memory bandwidth in equation 3.28 can be rewritten as:
Bpro" - 0 (3.34)
Equation 3.34 is graphed in Figure 3.15 for a range of values of r and n \¡/ith an active area
of ten mìllion gates assumed for the complete processor. Figure 3.15 shows that while
equation 3.34 is greater than zero the constraint is met and the processor bandwidth
is fully utilised. When the expression equals zero the processor and memory subsystem
requirements are matched, and when the expression becomes negative, the bandwidth is
no longer fully utilised. An increase in the order of the array can return the bandwidth
to full utilisation. The available memory bandwidth is 400Mbytes/s. A significant set of











1:32 systolic cells l:32 bits per digit
Figure 3.15: The bandwidth constraint for the systolic ring processing element for ten
million gates.
3.6 Summary
Digit-serial arithmetic has been studied for a systolic cell which performs multiplication'
The algorithm for the multiplication of an 1ú digit number by an M digit number, where
the digit base B is 2', is rewritten in terms of the binary representation of the digits. The
result is used to show that for digit-serial multiplication a full r x r-bit multiplication of
digit pairs is not required at each time step. In particular, it shows that the digits which
87
contribute to the output sequence are formed from the accumulation of partial multipli-
cations whose critical paths are approximately half that of a r x r-bit parallel multiplier.
A systolic ring digit-serial multiplier is described which uses a single-level pipelined par-
allel multiply/accumulate cell to implement the partial multiplications indicated by the
decomposition. An architecture for a systolic floating point processing element which can
perform multiplication, accumulation and denormalisation of two floating point operands
has been proposed. The performance of a rectangular systolic array of PEs was analysed
using the metrics of area, time and bandwidth. It was found that the ,47 metric should be
minimised to give the smallest job time for a matrix product. The architecture has been
optimised for use in a class of systolic array processors to perform matrix computations
by minimising the A? metric of the processing element. An optimal implementation has
been shown to consist of arithmetic units with 4-bits per digit (nibble) and four systolic
cells in the ring for HGAAS-II GaAs technology using an IEEE single precision format.
88
Chapter 4
Design, Layout and Simulation
4.L Introduction
In this chapter the PE is designed and simulated and a layout produced for remote
fabrication. The building blocks for the systolic ring PE include a data flip-flop, toggle
flip-flop, full adder, multiplexers, clock generation, clock distribution circuits and bonding
pads. A conventional 6-NOR gate data flip-flop is modified and several new versions
produced to incorporate clear and preset functions. The signal integrity of the data flip-
flop is critical for the correct operation of the chip since data storage circuits take up
most of the layout area. Various adder circuit implementations constructed from DCFL
and SDCFL classes are studied. The fuli adder circuit is used throughout the PB and so
the area-time characteristics of each implementation was studied to find a minimum to
satisfy the processor model from Chapter 3. The architectural studies presented in the
previous chapter show clearly that four-bits per digit is an optimal implementation for
a particular class of PE using the HGAAS-II process. It was decided to implement the
PE with the following requirements:
o use digital GaAs process (from Thomson-CSF, 0.8prn SAGA)
o area avaiiable is I5mm2 due to cosl constr-aints
o PE chip functions
- floating point multiplication
- floating point addition (includes denormalisation)
89
- floating point flags
o extended single precision floating point format
o state machine controller for data I/O
o individual circuits must be testable, therefore include separate test structures
Schematics and layouts are produced using basic circuits for the systolic cell, systolic ring
controller, delay cell and flag checking. A variable speed clock generator was needed and
a design based on ring oscillators and was produced. Clock distribution for a systolic
ring is studied. Design constraints were derived from daia flip-flop timing characteristics
so the correct transfer of data could be achieved. The physical clock distribution circuit
must be carefully simulated and includes buffers and a H-tree interconnection structure
to distribute the clock signal to the data flip-flops in the PE. Power circuits are designed
and checked for current density limits and voltage variation limits due to resistance and
inductance in the power rails. A floorplan of the chip is presented and a final layout
produced. Fabrication and packaging details are also discussed.
4.2 Floating Point Representation
An extended floating point representation is defined which exceeds both the dynamic
range and precision of the single precision IEEE-754 standard [ieee85]' Referring to
equation 3.16 in Chapter 3, the specified minimum number of bits in the representation
is e 2 lL, m ) 32 where e is the number of exponent bits and rn is the number of
mantissa bits" The exponent bias is unspecified, however the minimum exponent value
is E^;n < -1022 and the maximum exponent value is E^o, > 1023. Note the encoding
of non-zero values may only be used in extended formats. To align the number of bits
with the total number of digits in the PE for a 4-bit per digit implementation e : 12
and. m :32. The opera,ncl format consists of eight mantissa digits, three exponent digits,
one flag and one guard digit. The flag digit contains a zero flag and a sign flag. The
guard digit stores the most significant digit of the result. This architecture implements a
multiplication with two circulations of the data around the ring, so the processor requires
a total of 55 clocks to perform a multiplication and accumulation. Rounding is not
90
implemented because an extended number format is used and all bits are carried to the
next operation unmodified. Conversion to other formats for floating point compatibility











Figure 4.1: Processing element architecture.
4.9 GaAs Circuit Design, Simulation and Layout
This section deals with the design of the components of the chip that form the building
blocks for the PE.
4.3.t Data Flip-Flop
A data latch is required to store data that is recirculated around the systolic ring. The
following requirements are considered essential in the design:
r operation from DC to lGHz
o area-time efficient storage since latches make up around 50% of a layout
o ability to be cleared or preset into a state
91
o single ended clock input
o complementary output available
Master-slave latches may be implemented in a variety of ways using current mode tech-
niques or usual logic gate (e.g. NOR). Since a small power supply voltage (1 to 2lf is
used, current mode techniques cannot be used since they require a higher power supply
voltage to correctly bias the circuit. Multiple power supply voltages may be used to over-
come this but this increases the layout complexity. Figures 4.2, 4.3 and 4.4 show different
half latches using simple gates which are transparent on half of the clock cycle' These
may be implemented in Scheme 1 (Figure 4.5) where two phase non-overlapping clocks
are used or Scheme 2 (Figure 4.6) where a single clock line is used to avoid distributing
two clock phases. Note that in Scheme 2 only half of the logic is active at any time but









Figure 4.3: Master-slave half latch (2)
The 6-gate data flip-flop is a good approach and has been used successfully by foundries















Two phase non<verlapping clocks, Pl and P2.
Timingconsuaint: Thi +Tl>Tcl, Th isthehold time.










Tcl2 + Ts Tcll + Ts
Single phase clock, Pl with P2 generated at each slave.
Figure 4.6: Clock Scheme 2.
negative edge triggered data flip-flop with a single D input as sho\4¡n in Figure 4.7'
Figure 4.8 shows the same data flip-flop but with D being required as well. Figures
4.g, 4.L0 and 4.11 show similar arrangements but with a clear signal incorporated. All
flip-flops have the Q and Q outputs available and are negative edge triggered. Table 4.1
shows the characteristics of the data flip-flops. Data flip-flops that have both D and
D inputs will toggle at a higher frequency and may make the overall circuit operation
slightly faster if D does not have to be generated. The interconnection would be simpler
and the number of gates used would be reduced slightly if a single input was used'
A problem can occur for the latch shown in Figure 4.9; if the c.lock is low and D is low
when the clear goes low, a 'f is latched to Q instead of '0'. This problem disappears if
clear goes low when the clock is high. To overcome this, the clear signal must be held
Iow long enough for D : 0 to propagate to the next latch. Alternatively, the circuit
shown in Figure 4.10 overcomes the problem by gating the D input with the clear signal'
The latch shown in Figure 4.11 is impractical for design in GaAs since there is a 4-input








Figure 4.7: Data flip-flop 1: Schematic of a 6-NOR data flip-flop with a single input
Figure 4.8: Data flip-flop 2: Schematic of a 6-NOR data flip-flop with D and D inputs.
95
Figure 4.9: Data flip-flop 3: Schematic of a 6-NOR data flip-flop with clear.
Figure 4.10: Data flip-flop 4: Schematic of a 6-NOR data flip-flop with improved clear.
96
Figure 4.11: Data flip-flop 5: Schematic of a 6-NOR data flip-flop with clear, D and D
inputs




























































immunity. The latch in Figure 4.10 is considered the best alternative if a reset or clear
function is required otherwise the latch in Figure 4.11 could be used. Ring notation for
the latch with clear or preset is shown in Figure 4.12. Figure 4.13 shows the resulting








Figure 4.12: Ring notation of a GaAs data flip-flop with clear or preset
Figure 4.13: Layout of a GaAs data flip-flop using ring notation'
Set-up and Hold
The set-up time of a latch is the time before the clock edge where the input must be
held stable. The hold time of a latch is the time after the clock edge where the input
must be held stable. To find the point at which a latch becomes metastable, a simulation
of the latch was carried out with a transition occurring near the clock edge. The set-
up time was found to be 150ps, the hold time is 120ps and the propagation delay is 360ps.
98
V zoo. or,4
65 0 . 0M :-
600.011
550.0M












































Figure 4.14: SPICB transient response simulation of a GaAs data flip-flop with a IGHz
Toggle Flip-Flop
Toggle flip-flops are needed in the clock divider circuit and operate by inverting their
outputs.at each clock cycle. The toggle flip-flop was constructed from a data flip-flop
with the outputs fed back to the inputs, D:Q,D: Q.The resulting layout is shown
in Figure 4.16.
4.3.2 Full Adder
Full- and half adders with equal sum and carry times \¡/ere requirecl for the digit-serial
multiplier accumulator. The equations for the sum and carry terms generated from the
ø, ó and c inputs rñ/ere:
Hx : aØb
S : HrØc






Figure 4.15: Final version of the data flip-flop schematic with clear.











The signals available rüere ¿, d,, b, b, c and Z and the outputs ,S, ,9, C and be
produced. There are several ways to generate the sum and carry using either DCFL'
























Figure 4.17: SDCFL implementation of a full adder using adder half equations with (a)














The full adder sum and carry equations are written below:
.9
C
: abc *a.6.c * o.6.¿ +a.b.¿
: a.b I a.c * b.c (4.2)
Various independent implementations for sum and carry generation are shown in Figures
4.19 and 4.20, respectively, using equations 4.2. The adder may be composed of any
combination of these circuits. The circuit area is proportional to the number of devices
if the same layout strategy is used. Table 4.2 shows the area (device)-delay product for















Figure 4.19: Full adder sum generation circuits
The device-delay product for the full adder in Figure 4.LTa is 24,960. The full adder
in Figure 4.\7b is not considered since the DCFL outputs would be unable to drive the
required load. The delay for the cases in Table 4.2 is less than for Figures 4.17 and






























Figure 4.20: Full adder carry generation circuits









sum-b 4.r sum-c 4.19c






combinational circuit stabilises. In high speed circuits glitches should be avoided because
they cause additional noise to be generated on power and signal buses. In the case of
the full adder, glitches can arise due to unequal delay paths for the cases where the half
sum is generated. The circuits summarised in Table 4.2have almost equal delay and are
glitch free if the inputs are synchronous. Another consideration is the amount of routing
between cells. In all cases, three inputs and their inverse need to be provided which
must be propagated between cells. A method to stightly reduce the number of devices
and eliminate the need to generate the inverse of each signai is shown in Figures 4.21
and, 4.22. The inputs are only ø, b and õ which are used directly by the carry generation
circuit to generat. Grv as shown in Figure 4.2L with a delay of 370ps which is less
than other implementations since C arrg does not have to be generated. The results of
the first set of gates are a.c, a.b and ó.c which are fed into the sum generation circuit in
Figure 4.22. The result, S is calculated:
S: a.b.cIa.a.c..l6+b.ã. I añc.6i
: a.b.c + a.6.¿ + d.b.c I a.6.c
The delay through the sum path is 550ps. The device-delay product for this implemen-
tation is 26,400 which is nearly the same as the cases in Table 4.2 except the carry path
is shorter. The implementations shown in Figures 4.2I and 4.22 were used in the final












Figure 4.22: Sum generation circuit used in the final design.






The systolic cell implements digit-serial arithmetic on the three operands X , Y and P P
where X, Y and PP are single precision floating point numbers which have mantissa,
exponent, flag and guard fields. The design of the systolic cell presented in Chapter
3, Figure 3.4 is modified. to also perform the accumulation operation. In the following
discussion, a subscript 'rn' indicates the mantissa part and 'e' denotes the exponent part
of an operand. Each number is entered least significant digit first. There are three modes
of operatìon: multiply, add and denormalisation. These are determined by the 'c' and










INSTR c IN Cell Operationd
Table 4.3: Systolic cell instructions
Multiplication Mode
In multiplication mode, the Y operand is the multiplicand and the X operand is the
multiplier. To form partial product terms, the first nibble of the Y mantissa is stored
and multiplied with each nibble of the X operand and the result is accumulated with the
input partial product and outpul to PPout The Y mantissa is nibble-wise rotated as it
passes through each systolic cell. To complete a mantissa multiplication, the mantissa
must pass through the same number of systolic cells as there are nibble-digits in the
mantissa. Hence, the number of systoiic cells in the ring multiplied by the number of
complete rotations of the operands around the ring must equal the number of mantissa
nibble-digits in the mantissa. The cell function is used in two ways during floating point
multiplication. During the mantissa multiplication the algorithm implemented is
PPout: (XY +V) + PP¡"
where V is the high-order digit generated by the pipelined multiplier, P P;n is the partial
product input from the previous cell. During the exponent addition mode, the following
function is implemented:
106
PPout : (X x 1+ V) +Y
:X*y
The value of V is zero in this part of the computation as there is no high-order output
from the product X x I. This implements the exponent addition in each cell.
Addition Mode
In addition mode, the X and Y operand exponents are assumed to be equal and the
function of the PE is to add up the mantissa. The % (Y exponent) is directed to PPou¿,"
exponent and the mantissa result is P Pou¿,^ - X* * Y*. In terms of the systolic cell
function, Z: XY+PP *VwhentheX input is set to'1'andtheX operand is
multiplexedto PP when the mantissa is passed through the cell. Every cell performs
this computation but only one cell is required to perform the mantissa addition. Since
the P P¿^ to a systolic cell is blocked, all cells perform the mantissa addition and routing
of % but only the last cell actually provides the result PPout at the output'
Denormalisation mode
The function of denormalisation is to mantissa shift the smaller of the two operands and
increment the exponent. Only the systolic cell functionality is tested here. Each cell
operates independently and when presented with an instruction to denormalise, the Y-
operand is shifted with respect to the X- operand by '1' digit by bypassing a delay cell.
The exponent field is incremented by '1' by using the cell function:
PPout,":Z:lxY+V+l
One circulation of the ring shifts the Y operand by 4 digits. For full denormalisation
capability an exponent subtraction would determine which operand is to be denormalised
and the operands are exchanged if necessary. A difference counter would then determine
how many cells the operand would need to pass through to align the mantissae. To
complete a cycle of the ring any excess instructions would be NOPs. A schematic of
the 4-bit digit-serial multiplier cell is shown in Figure 4.24 which incorporates the extra
functions required for denormalisation and addition. The schematic for the complete
107
systolic cell is shown in Figure 4.25 and the corresponding layout is shown in Figure 4.26.
A SPICE simulation of the critical path through the digit-serial multiplier carry path
l¡
3
Figure 4.24: Nibble-serial multiplier schematic.
(Figure 4.27) shows a delay o13.2ns which indicates that the maximum clock speed at
which the processor will function correctly is greater lhan 300 MHz. It takes 55 clock cycles
to do a multiplication-accumulation, therefore the floating point performance which can
be expected from the device is approximately llMfl,ops.
4.3.4 I/O Pads
The input ancl output pads interface the chip to the outside world. They have a low
voltage swing of around 0.8 lz and provide a high bandwidth electrical interface to the















Figure 4.25: Schematic of the systolic cell.


























0, 0t ' Time
l3 0N 0t
Figure 4.27: SPICE simulation of the critical path through the digit-serial multiplier
Barbara and supplied by MOSIS. They were modified to suit our pad-ring requirements
Input Pad
The non-inverting input pad protects the chip from ESD and allows a signal to enter
onto the chip with a high bandwidth up to 500MHz. A schematic of the input pad is
shown in Figure 4.28 and the layout is shown in Figure 4.29. Figure 4.30 shows a sim-
ulation of the input pad. The pad requires an external reference voltage, VREF, to be
supplied to the comparators in the pad. It was found VREF : 0.7Il provides a symmet-
rical response. On the top simulation in Figure 4.30 the signal pad-i,n is the input signal
with a l.3Vamplitude and a lns rise and fall time. The three signals out-tt, out-ssl and
out-ss}correspond. to the DCFL signal on the chip for typical,la- and 2o-slow MESFET
parameters, respectively. The delay through the pad ranges from 1 to 2ns. The lower
simulation in Figure 4.30 shows the current drawn from the pad supply,, I(Vddp), and
from the input, I(Vin). There is negligible current drawn in the low state but around
l.6mA is drawn from the supply in the high state giving a power dissipatiort of- 3-2n¿W'
There is a large change in current drawn from the supply when switching between logic
states
Output Pad












Figure 4.28: Schematic of the input pad.




















Figure 4.30: Simulation of an input pad receiver with VREF:0.7Ilshowing input,








ing layout is shown in Figure 4.32. A simulation of the output driver pad is shown in
Figure 4.31: Schematic of the output pad
Figure 4.33. The upper simulation in Figure 4.33 shows the response of the pad to a rising
and falling edge on the chip (chip-out) with a 700rnll amplitude and a Ins rise and fall
time. The output pad drives into an external 500, 1pF load. The three output responses
pad-tt, pad-ss1 and pad-ss2 correspond to typical,lø- and 2o-slow MESFET parameters,
respectively. The delay through the pad is less than lns. The lower simulation in Figure
4.33 shows that the current drawn from the pad supply, I(VDDP), is negligible in the
low state but around 2mA in the high state giving a po\Mer dissipation ol4ml'l. There is
a large changc in current drawn from the supply when switching between logic states as
with the input pad.
4.3.6 Ring Controller
A finite state machine controller was used to control the I/O of data from the systolic
ring. The ìnput to the controller is the instruction inputs a and b in both the input
(oo,bor) and inside the ring (o,inn,b,¿nn). The controller has eight states, s0-s7 and two
113






































Figure 4.33: Simulation of an output pad showing the voltage response and current drawn
r.or t,y,p,ical-ty,picul, slow slow-l and slow-slow-g proccss parameters.
tl4
outputs, Is and 1r. The state transition cliagram is shown in Figure 4'34 
and 'Is opens
the I/O multiplexer for the ring to feed the operands into the ring' once 
the operands
are loaded, the ring is closed and one mole circulation of the operands 
is carried out
to complete the computation. The ring is then opened' and the operands 
and result are
output at the same time the new operand's are loaded (i'e' the cycle is repeated)' 
The
input/output multiplexer to the ring is controlled by the 'fs signal' Currently' 
f is not
used. The states sI, s2and s3 are used. by the flag generation circuit to 
generate the flag
biis. The schematic diagram of the controller is shown in Figure 4'35 
and a functional
















Figure 4.34: State transition diagram for the ring controller
simulation using IRSIM is shown in Figure 4.36 which verifies the operation' 
The
outputs are multiplexed between the first cell and the ring by the multiplexer 
under the
control of the ring controller. The schematic of a 4-bit multiplexer is shown 
in Figure







































































Figure 4.37b: Schematic of the I/O multiplexer
118
4.3.6 Flag Checking
The flag fieid for both X and Y input operands and the result Pou¿have the specification








Table 4.4: Specification of input and output operands flag nibble.
and indicates the result, Pou¡ is zero. Xm,sisn andY*,"isn are the sign bits for the mantissa
of the X and Y operands, respectively. These are fed into an EX-OR function to form
the sign result for the multiplication. The circuit schematic of the flag generation circuit
is shown in Figure 4.38 and the signals sI, s2 and s3 from the ring controller are used
to define when the flag checking circuit is operational. The resulting layout is shown in
Figure 4.39.
4.3.7 Clock Generation
A two speed single-phase clock is generated on-chip using a DCFL ring oscillator which
can run.at either IGHzor 600MHz. To provide a range of possible clock speeds for both
low speed functional testing, high speed performance verification, and process character-
isation the output from the two speed oscillator is divided by two additional modulo-4
counters. External signals are used to multiplex between the direct and derived clocks as
well as an external clock source. The internally generated clock frequencies are 37.5, 62.5,
150, 250, 600 and 1000MHz. There is no constraint on the external clock frequency and
can therefore be used for DC testing. Figure 4.40 shows the clock architecture, Figure
4.41 shows the layout and Table 4.6 shows the clock control signals. The signal ,RS? is
the active high ring oscillator reset signal and the rate signal determines the path through
the ring as shown in Table 4.5. CLIþød is the external clock input. A SPICE simulation






Figure 4.38: Schematic of the flag generation circuit.














































Figure 4.40: Clock architecture.































Figure 4.42: Clock generator transient simulation for two clock rates















































There is a significant overhead in generating and distributing a synchronous clock across a
VLSI system [Come92]. A global synchronous clock scheme could be used for small arrays,
but for large arrays clock synchronisation in a fully synchronous system will not work if
significant clock skew exists between iso-synchronous zones. To overcome this problem,
techniques such as clock frequency multiplication in each chip from a global lower rate
clock and balancing skew using H-Trees [Bako90] may be used. As the order of the a ray
increases, the communication between the PEs would need to be asynchronous, but each
PE would have its own synchronous clock which may be derived from a global or locally
generated clock. The PE is characterised by a ring of registers some rvith a delay path
between adjacent registers and some without delay. A set of clocked data latches {I}
can be described in terms of the following timing characteristics:
o 7"¿ is the clock period
o ?¡ is the hold time
o I is the set-up time
o To is the propagation delay from the clock to the output
r fr is the time at which latch -[; is clocked
The condition under which data is transferred correctly between the adjacent latches, .L¿
and ,L;+r in a Iinear array is
-T¿tTe+7"<LT¿1Tr-T¡
where the clock skew A4 :T¿+t-T¡. kt an ideal synchronous system there would not
be any clock skew, that is:
LT¿: g Y¿
Consider a ring structure of l/ latch elements shown schematically in Figure 4.43 for the
case of 1ú : 8. Extending the definition of a linear array to a ring, the skew, Afr between
two adjacent .L; and Lç+\ mod N latches is:
LT¡:fi(;+t) mod Nl - T¿
r23




The sum of the clock skews in a closed ring is zero. Figure 4.43 shows an example of
an imperfectly clocked ring from which it can be seen that any positive clock skew must
be matched by an equivalent negative skew. Clock signal between latches 3 and 4 has a
positive clock skew while the clock signal between latches 4 and 5 has a negative clock
skew (Figure 4.43). Clock signal between latches 1 and 2 has no clock skew. It is also
apparent that there is no upper limit to total skew across j elements provided that the
constraint on each A4 is satisfied. The simple design constraint which was adopted to
guarantee correct operation of the ring was that any two latches in the ring should be
able to communicate with each other. Hence, for latches .L¿ and L¡, LT;,¡ has an upper
bound given by:
LT¿,¡ 1Tr:Tn, Vi,i













Figure a.43: (a) A clocked ring where the arrows indicate the delay from the clock gen-

































































































' H' -tree distribution networks


















l"'1,',,1I 0ñ 2 0N ! 0N




Figure 4.46: SPICE simulation of the clock distribution across the chip.
Control






Clock Input to Clock Tree





Equal length clock lines using an H-tree layout styie were used to distribute the clock to
the 266 flip-flops (532 EFET loads) in the chip. Figure 4.47 shows the layout of the clock
distribution system including buffers. A two stage system of super buffers was simulated
and optimised to drive the long buses in the clock tree. Figures 4.44 and 4.45 show the
design of the first and second stages of the super buffers. Transmission line models were
also used to simulate the clock system and Figure 4.46 shows a SPICE simulation of the
clock waveforms at the leaves of the H-tree. The simulation shows that the clock skew,
Afr,¡ between any two latches has been controlled to within 100ps to ensure correct latch
operation.
4.3.9 Systolic Ring
The complete processing element is a systolic ring consisting of three elements and a ring
controller. The first is an I/O logic element in which I/O and logical operations are per-
formed. The second is a systolic cell which implements two distinct recurrence relations
upon operands circulating in the ring. The selection of the appropriate recurrences to be
applied at a given time is determined by an instruction nibble, INSTR,, which circulates
with the operands and includes mode which was defined previously. The third element
of the ring is a delay cell. The number of delay cells in a ring is chosen so that the
length of the ring is equal to the length of the operands. The mantissa length is 32-bits
and r :'4-bits per digit. Therefore, there are eight mantissa digits and if there are four
systolic cells in the ring, two circulations of the operands is required. The schematic
of the PE is shown in Figure 4.48 which includes the controller (Figure 4.35), the flag
generation circuit (Figure 4.38), four systolic cells (Figure 4.25) and six 16-bit delay cells
(Figure 4.50) which are built from the 4-bit delay elements shown in Figure 4.49.
4.3.1-0 Floorplan
The floorplan of the overall chip is shown in Figure 4.51. The ring structure of the
processing element is in the centre of the chip with the I/O on one side of the chip and
test structures on the other. Some basic circuit elements were placed on the chip as test







Figure 4.48: Schematic of the systolic ring processing element
INbar' ( 3 r0
(3:0)






out' ( 1 5',12)






oul bar'( 7 l4 )
inbar'( 3 :0 oul(3:0)
out bar'( 3 :0 )
Figure 4.50: A 16-bit delay element used in the systolic ring
4.3.LL Power Circuit
The design specification for the power distribution circuit adopted was that the voltage
should not drop by more than 5To due to resistive effects across the chip. The design
of the power distribution circuit took into consideration the maximum allowable current
density, the self inductance and maximised the supply to ground capacitance of the power
buses. Simulated total power dissipaiion of the PE chip is 2.2W with a 2V supply. The
power is.dissipated using an 132-pin Multi-Layer Ceramic (MLC) package with a finned
heat-sink. Buffered DCFL is the major logic class used. It is a normally-on class in
which there is little dynamic power dissipation. The voltage swing is small (0.6If and
the operation of a DCFL gate is to switch current from the pull-down FET in the output
logic low state to the forward biased Schottky diode on gate of the load device' Ring
notation places gate structures in a local area so the change in current into this local area
is small.
Clock Distribution
The clock distribution and the power rails for the logic circuits \¡/ere separated to min-
imise noise coupling. The peak current in the clock circuit for the feed wires is 20m4.































































Current Density Limits in Power Buses
Each data flip-flop dissipates around IrnW of power and a full adder requires 2mW'
For each power bus there are four full adders and four latches which require l2rnW or
6mA of current. Power supply rails fed from each end have a maximum current density
at the ends of 3mA. A systolic cell draws I00mA of current and is double-end fed so
the 240p,m wide power bus has a peak of 50mA passing through it to give a current
density of 0.\mAlprn. This is 20% of the metal-3 current density limit of 2.8mAlp'm.
The twenty-four metal-2 buses which take the current to the circuits from the metal-3
buses are each 3p,m wide to produce a maximum current density of 0.66mAlp'm. A, gold
bond wire is 25p,m in diameter which produces a cross sectional area of 491¡.tm2. The
maximumcurrent density for gold is J*o,,Au:6 x l05Af cm2 and therefore a bond wire
can only carry 2.95A. A l60p,m wide pad made from metal-3 has a current limit of
I79mA, therefore the maximum current through a pad is limited by the metal-3 con-
nection from the pad. The input pad draws SrnA and the output pads draw 25mA of.
current and therefore, one set of Vdd and GND pads should only supply four output pads.
Inductance Limit
Consider the inductance and subsequent voltage induced in a metal-2 power line running
through. a systolic cell. For two parallel metal-2 wires with a pitch of.2p,m and a width
of. L¡.tm, the mutual inductance is given by [LoBu89]:
L:L2oo K
c
where 1l : 0.33 is an elliptic integral function which depends on the width and pitch
of the wires and c is the speed of tight. In this case -L : 8nH lcm. Note that these
wires are thinner than actual metal-2 power distribution wires and the inductance is
overestimated. The voltage induced between the wires is given by:
LV:t^H;
The current change in the power supply for a single data flip-flop is approximately 0.4m4
as seen in Figure 4.52. A single Vdd llus may supply ten data flip-flops rvhere, in the





























l2 0N 0N I
Tlme
ü
Figure 4.52: Data flip-flop simulation showing supply current
will be 4mA. For a \rnmlong Vddbus and a 4mA c;':;rrent ramp for a typical rise time of
100ps, LV :32mV. This is less than our design limit in Chapter 3 which was 5% of.2V
(l00rn V).
4.4 Fabrication and Packagittg
The chip was successfully fabricated by Thomson-CSF Semiconducteurs Specifiques,
France using the HGAAS-II process licenced from Vitesse Semiconductor Inc., USA
on their first fabrication run with this process. The chip was fabricated in a Gallium
Arsenide 0.8p,m E/D MESFET process. The total chip size including pads and test
structures is 3.Lmm x 5.8mm and includes 16,000 devices. The dimensions of the PE by
itself are l.7mmx 4.5mm giving an active area of 7 .5mm2 with 12,000 devices resulting in
a device density of 1600 FETsf mm2. The chip was bonded into a L32184 pin MLC pack-
age supplied by TriQuint Semiconductor [TriQ9l]. This is a high speed package which
supports the special requirements of very high performance ICs. There are 84 signal lines
and two poïver supplys with internal decoupling capacitors between the internal power
and ground planes which minimise switching noise on the power supplies. Signals are
carried on 50CI controlled impedance transmission lines between the package leads and





to 4 to 5 tr{¿ with a finned heat-sink. The package is mounted upside-down to allow the
heat-sink to be attached to the back of the package. The gull wing leads make it suitable
for surface mounting or contact mounting using an elastomer ring. The delay of signals
through the package range from 70ps to 110ps. Appendix B shows the pin assignment
to the package. A micrograph of the fabricated GaAs systolic PE chip after packaging is
shown in Figure 4.53 with the floorplan overlaid.








Testittg the Processing Element Chip
This chapter presents the test environment, test procedure and results of testing the PE
chip. A test jig was designed and constructed to mount the PE chip and interface it to
a digital tester and other test equipment. The wires from the test jig to the probe tips
of the digital tester must propagate digital signals at a 300MHz rate without significant
distortion. The discrete enhancement and depletion mode MESFETs are characterised.
A digital tester was used to test the functionality of the PE.
5.L Test Fixture
A custorn designed test fixture (test jig) provides a platform to quickly test the packaged
chips. The 132/84 pin MLC package is surface mounted to the PCB with an elastomer
ring (pressure contact) to allow easy mounting and unmounting of test chips. The foot-
print is the same as that which would be used for solder reflow assembly. The board
is designed to interface to the Tektronix DAS-9200 digital tester. The connections are
standard gold PCB pins with a 0.1" (inch) pitch. A thin low clielectric double sided
Teflon PCB provides controlled impedance lines to the test interface pins. Solder pads
are provided to connect chip resistors or capacitors to either end of the board trace.
The test fixture provides:
o a precise environment for high speed digital circuits
o easy mounting/unmounting of packaged chips in a 132184 MLC package
t 135
o 132 external pin connection
o direct interface to the Tektronix DAS
o controlled impedance lines
o provision for chip resistors or capacitors at either end of a trace
5.1.1 PCB Design
The dimensions of the PCB is 4.1" square and the top and bottom artwork of the board
are shown in Figures 5.1 and 5.2, respectively.
Figure 5.1: Top layer of the test fixture PCB.
The output pads of the chip drive into a resistor to ground whose resistance is nominally
50f1. The design parameters are:
o signal traces of 0.015" wide and 0.010" spacing (0.025' pitch)
r available board material is Teflon (e, -- 2.55), double sided l-ounce copper. Thick-






















00000000 0 0000000000000000 0 00000000






"i / \ oooo






Figure 5.2: Bottom layer negative of the test fixture PCB
o pins to the DAS must be a signal-ground pair which can be arranged in a group of
eight or individually. The pin spacing in both directions is 0.100". They must be
gold to make reliable contact and to prevent metal migration of tin into the DAS
probe tips
o due to the compact nature of the test board, trace lengths could not be equalised
so there is a difference in delay between some traces
The traces on the PCB are treated as lossy transmission lines. The structure is a mi-
crostrip where the backside ground plane reflects the signal to produce its dual. The
impedance may be approximated by:
17 _ 87 ,_ I 5.98å ILo: J¿ñ'" lost + t1
where e, : 2.55 is the relative dielectric constant of Teflon, I : 0.0013" is the thickness
of the copper, tr.r : 0.015" is the width of the track and å, : à" (0.0156) is the thickness
of the dielectric' The characteristic impedance is zn : 959' A board thickness of +'





























540 on a *!" bo.rd, the track width must be doubled to u :0.030" which is impractical
in this application. A board trace may be designed in one of three ways:
o line terminating resistor to ground (,Rú)
o line source resistor to ground and terminating resistor to ground (Rs', Rt)
¡ line series source resistor and terminating resistor to ground (,Rss)
Provision for source and terminating resistors or capacitor to ground for each signal line
were made. HSPICE was used to model a signal trace driven by one of the output pads
through a bond wire, lead, transmission line and terminating in a probe. The equivalent


















Probe load . Lossy Trmsmission Line
Z0 = 85 ohms
længth=4.5 - l7.8mm
Figure 5.3: Equivalent circuit of a signal driven off chip
the signals, the following simulations were done
o Using a å" thick Teflon PCB:
- A 40mm long line with no source resistance and terminating resistors of 25,50,
75 and 100Q" Figure 5.4 shows the 500 load has the best damping although









- A. 40mm long line with a 50fl terminating resistance for source resistance
values of.25,50,75 and 100,1ì. Figure 5.5 shows the 750 source resistor gives
the best result with a slight overshoot and a 1 I/ swing.
- A. 40mm long line with a 500 terminating resistance and a 50Q source resis-
tance with track widths of 0.30482nm, 0.38lmm and 0.5mm (Figure 5.6).
- A 40mm long line with a 25f) terminating resistance and a 25f) source re-
sistance with track widths of 0.3048n2rn,, 0.38lmm and 0.\mm (Figure 5.7).
This has a smaller output voltage swing than the 50f) case (Figure 5.6).
o Using a fi" thick Teflon PCB:
- A. 40mm long line with a 250 terminating resistance and a 25f,) source resis-
tance with track widths of 0.30482nm, 0.38lmm and 0.5mm (Figure 5.8).
All simulation results show the on-chip pad input signal as well as the response at the
start (Rs) and end (Rt) of the PCB line. These simulations show there is negligible
difference in the response due to changes in track width. The source and terminating
resistors should be about the same for a good response (fast rise time with a small
overshoot). The delay from the input to the PCB pin is about lns. Figure 5.9 shows a
signal being driven onto the chip through a board trace. The board it å" thick and the
line is terminated with a resistor, Äú, near the chip. Using these results, the board was
designed and the length of the tracks measured and resimulated to determine the skew
and the resistor values to be used. There are four cases considered; short lines and long
lines with source resistors underneath or alongside the chip. This arises because of area
constraints around the chip. The four possible interconnect types are:
o L7.8mm line with the source resistor under the chip
o 19.05mm line with the source resistor outside the chip
o 32.6mm line with the source resistor outside ihe chip






































































































3 C 0 0H
200.0M:'





















t,Ê 0N rl 'lI 0N l0l0 0050
ïme
Figure 5.6: Simulation of a 40mm long line with Ã¿ : 50CI and .Rs : 500 for track








































- I 00 0t't ='
- lso oroL ' '
- q5 - 0P
Figure 5.7: Simulation of a 40mm long line with ßl : 25Q and -Rs : 25{l for track

























- t 0 0 . 0H -




Figure 5.8: Simulation of. a 40mm long line \4/ith ,R¿ : 25Q and ,Rs : 25Q for track
















































Figure 5.9: Simulation of a 40mm long line with -Rú - 25, 50, 75 and 1000, signal is
being driven onto the chip.
r42
The final design details are:
o track width is 0.381rnrn(0.015')
o track pitch is 0.635rnrn(0.025')
¡ PCB thickness is 0.396rnrn (fi")
. source resistor to ground is 47Q
o terminating resistor to ground is 47fl
Figure 5.10 shows a simulation of the four line types with 47fl source and terminating
resistors to ground. The maximum skew between the signal lines is around 350ps and
the worst case skew between two signal lines from the pad to the PCB pin is 390ps. The
output voltage swing is 0.9Il but the resistors can be increased to around 68fl to obtain



























8 - 0N I 925(
Time
Figure 5.10: Simulation of the four possible interconnect types on the PCB with 47f)
source and terminating resistors.
143
6.L.2 Construction
The test jig was made from aluminium with a sealed cavity under the PCB. Figure











Figure 5.11: High speed test jig for 132164 MLC packages
surface mount 805 type package which are low inductance and suitable for high frequency
applications. Terminating resistors for chip output lines are mounted next to the PCB
pins as shown in Figure 5.12. Source resistors are mounted as close to the chip as possible'
All power pins have decoupling capacitors (0.1 to 0.a7 p,F) connected to ground near the
chip. A photograph of the test jig with the chip mounted is shown in Figure 5.13.
5.2 Test Equipment and Set-up
The arrangement of power supply connections to the chip is critical to the correct op-






Double row gold plated PCB pins Teflon 
PCB (double sided copper)
l/64" thich Er2.55
Chip capacitor or rcsistor
(805 package)
Figure 5.12: Cross section through the PCB
Figure 5.13: Photograph of the high speed test jig with a chip and heat-sink installed
/
r45
the I/O signals. Appendix C contains a description of the power supply connections and
sequencing to avoid ground loops, ground bounce and crosstalk. Low speed functional
and high speed testing was carried out using a Tektronix Digital Analysis System (DAS)
9200. The DAS has two 92516 pattern generation cards (18 signal lines at 50MHz) and a
92496 data acquisition card which can monitor up to 24 channels at a 400MHz acquisi-
tion rate. The support software allows test vectors to be generated, test results displayed
and stored. Appendix C contains a description and specifications of the DAS and its
pattern generation and acquisition modules. A LeCroy 9360 digital storage oscilioscope
was used for measurement of high speed signals and was useful in de-bugging the system.
The oscilloscope has a sampling rate of SGSamples/s and a 600MHz internal bandwidth"
The active probes used have a lGHz bandwidth.
5.3 Circuit Testing
Testing of the following structures was performed:
¡ enhancement and depletion MESFETs
o systolic cell
o systolic ring
o clock generation circuit
In all testing, the heat-sink was in place and the chip was allowed to reach a steady
operating temperature. The input and output of data to and from the ring is controlled
by the ring controller which is clocked at the same rate as the rest of the chip. For
testability purposes, two outputs are provided. One is from the 16-bit output of the
first systolic cell and the other is the 16-bit output of the ring. This allows independent
functional testing of both a systolic cell and the systolic ring which can be carried out
under DC conditions with external clock control. High speed testing of the systolic
ring is carried out by loading the input operands into the systolic ring at low speed for
one circulation around the ring at which point the ring closes. The clock for the chip
is provicled by the DAS. The clock is then switched to being internally generated (up
to IGHz) and the operands are recirculated internally at high speed while the output is
r46
monitored for the result. Appendix C shows the channel allocation and the test programs
used to test the PE chip.
5.3.L Practical Terminations
The output of the ring oscillator was used to check the terminations on the test jig.
The terminating resistance for adequate voltage swing was too small with the two 68f)
resistors at each end of the PCB. The source resistor was removed and the voltage swing
increased to around l.lv. This may be due to a larger than expected series resistance
at the source of the PCB trace possibly due to contact resistance of the package on
the PCB. Figure 5"14 shows the clock output waveform with a 67Q terminating resistor
measured using the oscilloscope. From Figure 5.14 the delay of the clock input to output
signal is around 2 to 2.5ns and the rise and fall time of the output clock was measured




Figure 5.14: External clock input and chip clock output waveforms.
5.3.2 Fix L : Ground Bounce
Initial testing showed that there was a severe ground bounce problem with the four







'lnnl¿ i, ñr rl
r47
is 'p2out' and signal '2' is the ground for that set of pads measured at the pin of the
chip. 1.61/peak-peakof bouncewasobservedwhenanyof thePoutsignalschange. The
ground connection for the four 'PP' pads is not connected to the internal ground plane
of the package but to a signal line which was externally grounded. This was carried
out because of pin limitations in the package. The solution was to solder the package
pin (11a) to an adjacent ground pin (115) that is connected to the ground plane thus
considerably shortening the ground loop and the same signals are shown in Figure 5'16.
This illustrates the magnitude of noise that may be generated through a deficient power
supply and ground scheme.
1V/division
5ns/division
Figure 5.15: Signal p2out and the pad ground showing ground bounce.
5.3.3 Fix 2 : Separate Power Supplies
To improve the stability of the circuits, the circuit and pad supplies should be separated.
Unfortunately, a bonding error left both a circuit supply and the pad supply for 'Xout'
connected to 'PWR2'in the package. Since the circuit has several other sources for power,
the bond wire to the centre l/dd supplv on the chip (connected to 'P2') was removed and











Figure 5.16: Signal p2out and the pad ground with the ground bounce solved
supplies made the circuit considerably more stable.
5.4 Fingered MESFET Test Structures
In order to allow verification of the SPICE models used in the design, an enhancement
mode fingered MESFET and a depletion mode fingered MtrSFET were fabricated. The
layouts of the EFET and DFET are shown in Figures 5.17 and 5.18, respectively. The
fingered structure is an economical way of producing a wide transistor capable of handling
large currents and provides some degree of protection against damage by the testing
equipment. A Tektronix transistor tracer was used to test the MESFBTs.
The drain-source voltage was swept over a 0 to 4Ilrange on testing each MESFtrT. The
gate-source voltage is varied using a step voltage control and choosing the number of
steps to be traced. A capacitor rffas connected between the drain and source terminals of
each MESFBT to prevent oscillations between the drain and source. This is caused by
the parasitic capacitances of the MESFET and inductance and capacitance of the test
circuit acting as a resonant circuit and oscillating producing a negaLive resis[auce effect











Figure 5.17: Layout of a fingered enhancement mode MESFET (5 fingers x 74.8p, wide).
ffiffi
Figure 5.18: Layout of a fingered depletion mode MBSFBT (5 fingers x 74.8p'm wide).
150
intrinsic capacitance of the MESFET. Ã 0.Ip.F chip capacitor was used and mounted on
the PCB as close to the PE chip package as possible.
Enhancement Mode MESFET
The tracer was set to provide six 0.1 l/steps of gate voltage starting at 0-2V , i.e. Vg" : 0.2,
0.3, 0.4, 0.5, 0.6, 0.7V to test the EFET. A photograph of the transistor tracer screen
showing the EFET I-V characteristics is shown in Figure 5.19. The horizontal scale is
0.5 drain-source volts/division and the vertical scale is \mA of drain current/division.





Figure 5.19: Photograph of the EFET I-V characteristics from the curve tracer.
transistor characteristics. This is due to charge being trapped in the substrate below the
transistor. As a consequence of the high resistance of the semi-insulating GaAs substrate,
the charge is dissipated slowly and so hysteresis is predominately a low frequency effect.
The low scan frequency of the transistor tracer (approximately llcilz maximum) causes
a noticeable hysteresis effect in the characteristics,, however, at higher frequencies the
problem diminishes. The amount of hysteresis will also depend on the amount of charge










facilitate a comparison between simulation and measurement, the I-V characteristics were
measured from the photograph and imported into Matlab. Because the hysteresis effect
is not modelled, the midpoint of each curve was taken. To determine the appropriate
simulation characteristics temperature and process variation must be found. During
measurement, the ambient temperature was about 30oC so the temperature of the device
was estimated to be 30 to 50"C. Different process variation models were tried on the
SPICE deck generated from the transistor layout until the closest match to the measured
characteristics was found. The results of the HSPICtr simulation at 50 'C using typical-
fupical process parameters is shown in Figure 5.20 superimposed on the measured I-V

















2.5 3 3.5 4
Figure 5.20: Comparison of measured and simulated EFET I-V characteristics using
fu pi cal process parameters.
Depletion Mode MESFET
The DFET was tested using the same procedure used for the EFET. Due to a lack of
response from the DFET on the same chip the EFtrT characteristics were taken from,
+ Measured
- - -:- S¡mulated
t52
a second chip had to be used to test the DFtrT. This chip was found to be slower than
the first chip. The tracer was set up to provide eight steps for the gate-source voltage at
0.2 volts per step starting at -1.2 volts, i.e. Vn": -L.2, -1.0, -0.8, -0.6, -0-4, -0.2,
0, 0.2Il. A photograph of the DFET I-V characteristics is shown in Figure 5.21. The
scales used are 0.5 volt/division for drain-source voltage (%") und I}mAldivision for




Figure 5.21: Photograph of the DFET I-V characteristics from the curve tracer
were also done for the DFET. The closest simulation was found using lo-slow process
parameters at 50 "C. A graph comparing the measured and simulated FET characteristics
is shown in Figure 5.22.
Discussion
It can be seen that in both cases the model provides a reasonable approximation of the
measured transistor characteristics. At low values of Vs,, the model appears to over-
estimate the measured characteristics, while at high values of Vsr, there appears to be an
under-estimation. Middle values of Vs" show close correlation. There are some facl,ors

























Figure 5.22: Comparison of measured and simulated DFET I-V characteristics
o The models were not designed for such wide transistors, they were determined
using 10 ¡.rm wide MESFETs. It is known that transistor characteristics do not
scale linearly [PuEs88].
o The simulation does not take into account the geometry of the finger structure, and
treats the fingered MtrSFET simply as five MESFETs in parallel'
r The models were sourced from Vitesse Semiconductor Inc. though MOSIS. The
chips were actually fabricated at Thomson-CSF. Although the process has been
replicated, there will undoubtedly be some variation in performance between the
two. Thomson-CSF has not supplied models derived from their foundry process.
o The DFET is rotated by 90 degrees with respect to the GaAs crystal plane align-
ment position for maximum transconductance. This may account for the slow
DFBT.











5.5 Systolic Cell Functional Testing
The functionality of the systolic cell was tested using exhaustive test data generated
by the C Program which checked the design functionality using IRSIM. The chip and
DAS were configured as detailed in Appendix C. To configure the clock for externally
applied input, pin 22 (CI{s1) is connected to the logic level high poì¡/er supply and pins
20 (CIßz),23 (CI{stop) and 25 (CI{rate) are connected to GND'
5.5.1 Generating Test Vectors
For the systolic cell, the SELO input should be tied high to multiplex the input operands
permanently to the first systolic cell. In this mode the output of the PE is permanently
tied to the output of the first systolic cell. The length of the mantissa and exponent
operands may be arbitrarily long provided they follow the format shown in Figure 5.23.
















Figure 5.23: Instruction nibble for multiplication mode
tiplication mode at a 50MHz clock rate. Figure 5.24 shows the correct result of zero
by zerc. In the systolic cell test mode, the X, Y and PP operands are inverted in the
test result figures. Figure 5.25 shows the systolic cell operating in multiplication mode
where the input operands are X;n:0.00000001 x e001 andYn:0.00000001 x e001. The
convention used in this chapter is that the signal ^9,,¿ denotes ,9 - out in hexadecimal
format shown in the DAS test result figures' The outputs xo¿: 0'00000001 x e001'
þ
155
Yu : 0.10000000 x e001 and INST R6y¡ ã,re correct for three successive cycles. Yout
has been shifted eight places in the mantissa and the exponent is unchanged due to the
systolic cell holding the first mantissa digit and releasing it at the end of the mantissa
having multiplied it with each of the X mantissa digits. The X and Y exponents are
digit-serially added. The resultant partial product is Pou¡:0.00000001 x e002.
clusEer-1 92À96-1 Tifring
Clock (Exf) Strip 1, Page 1l4ag: 100






























Figure 5.24: Systolic cell testing in multiplication mode at 50MHz (zero by zero)
Figure 5.26 shows the acquired data from the DAS of the systolic celi in multiplication
mode at 50MHz. These results show the correct operation of the systolic cell.
5.6 Systolic Ring Testing
The maximum clock speed of the DAS is 50MHz for data generation, so a scheme to
determine the maximum clock frequency was devised given that data can be acquired
at a 400MHz rate. Operands are loaded into the ring under external low speed clock
control up to 50MHz. The internal high speed clock is stopped while the clock generator
multiplexers configure the clock to run at one of the specified speeds. The stop signal is
released and the outputs are monitored while the operands are recirculated and unloaded




















































Figure 5.25: Systolic cell testing in multiplication mode at 50MHz.
Refnen pe-E1ouE TLming
Måg: 200 clock ( 10ns) sErip 1, Page 1
INSTRtdlouÈ









Figure 5.26: Results of the systolic cell in multiplication mode at 50MHz.
r57
with the internal clock generator configured for nominal2S}MHz operation. The actual
frequency was measured using the osciiloscope to be l28MHz. The result of an addition
of:
Xoú :0'00000C21 x eo41
and
Yut :0.00000831 x "o4'
IS:
Poú :0'00001452 x eot'
after the second appearance of the data at the output (unloading cycle) of Figure 5.27.
The first cycle is the single recirculation. This computation excludes the denormalisation
operation. Although the input operands do not have the same exponent, the addition of
the X and Y mantissa is correct. Figure 5.28 shows the PE operating in floating point
denormalisation mode where the instruction given to every systolic cell is to denormalise
the Y operand. The result ant You mantissa is shifted by eight digits (four cells with two
circulations) and the exponent has been incremented by eight where:
Yn :0.00000831 x 
"onz, 
X¿n :0.00000C21 x eoa6
and Pou¿ is the result of adding X and the shifted Y operands:
Poú:0'00001452 x eo4A
The denormalisation operation has been shown to work with this architecture, however
more control is required to subtract the exponents and control how many digit denor-
malisations are to be carried out before the addition takes place. This has not been
implemented in this chip. Figures 5.29a, 5.29b, 5.29c and 5.29d show the systolic ring
PE operating at gLMHz in multiplication mode on a variety of input operands. Note
the resultant Pou¡ is the high order result of the multiplication. Figures 5.30a, 5.30b
and 5.30c show the PE operating at l28MHz in multiplication mode and shows the final
circulation of the operands around the ring and output. The exponent field is where
INSTRouú:1000 and the mantissa fietd is where INSTRouI: 1001. In some cases,
the clock was not completely recovered due to a poor 50f) load from the oscilloscope.
Tlre computation is Po,¿ - X¡n x X,, where:
X¿n : Xout :0.041F9060 x e008, Vn: Yut :0.F,4'5-F6802 x eoFC
158
and
Pout :0.04085C612 x e7o7
Note that the Y operand has been rotated by four digits in the first circulation of data
Figure 5.31 shows the PE operating in multiplication mode at 350MHz (clock operation
set for 600MHz typical-typical). The inputs were Yn : 0.FA5F6802 x eoFC, Xin :
0.041F9060 x e008. While the result, Pou¡ is not correcl, I N ST Routl Xou¿: 0.041F9060 x
e00B and the exponent of.Yout and most of the mantissa are correct. This shows high speed
synchronous clock operation is possible. The computation failure is in the critical paths
of the chip not being able to complete the computation before the next clock cycle. These
results show that the PE chip is fully functional, however, the chips tested were at least






























Figure 5.27: Test results of the PE in floating point addition mode at 128MHz
5.7 Clock Generation Circuit
To test the clock generation circuit, the chip should be powered in the same wav as the

































Figure 5.28: Test results of the PE in floating point denormalisation mode at l28MHz
(CIftate, CIkl and CIß?) were connected as shown in Table 5.1 to achieve the desired
output frequency. CKstop is the active high reset for the ring oscillator and was tied
to logic low. CI{in is the external clock input which was connected to the DAS pattern
generation probe cûc signal. The clock output (Clfuut) had no terminating resistor on
the PCB but was connected to the Tektronix 7514 Storage Oscilloscope instead through
a 500 coaxial line to the sampling head which has a 500 input impedance. The clock
frequency was measured using a frequency meter.
Figure 5.32 shows the output of the clock generator set for 37.\MHz operation with a
measured frequency of 30.I2MHz" In most cases the frequency was stable to LL\IiHz-
There was feedthrough observed from the base clock frequency (482MHz measured) in
the high output state with Vdd, : 1.89 y. Exactly eight cycles were observed indicating
the x 16 clock was being fed through one of the multiplexer stages. Figures 5.34 and
5.35 show the variation of oscillator frequency with power supply voltage for a ring length
of seven (Clftate: 1) and thirteen (Clirate: 0) gates, respectively (set Vddp : 2-0V
I20mV, VREF:0.7V I20mV, I/high: 1.3 V l2\mlf . Observed rise and fall times fall
between 1.0 to 1.5ns. It was also noted that Vdd > I.75 / produced t'eedthrough of the






























Figure 5.29a: Systolic ring operating in floating point multiplication mode at 9LMHz









































Figure 5.29b: Systolic ring operating in floating point multiplication mode at 9lMHz







































Figure 5.29c: Systolic ring operating in floating point multiplication mode at 9lMHz
where Xin:0.00010000 x eo31 ,Yn:0.0045F100 x e0F5 and Po.,¿ : 0.000000 AïF x er26.
Refmen n3xring Tl.n.Lng





























Figure 5.29d: Systolic ring operating in floating point multiplication mode at 97MHz














INSTR Id ] ouE
INSTR Ic]





Figure 5.30a: Systolic ring operating in floating point multiplication mode at I28MHz




































P-ou E- 0 1








Figure 5.30b: Systolic ring operating in floating point multiplication mode at l28MHz




































Figure 5.30c: Systolic ring operating in floating point multiplication mode at l28MHz
where Xin :041F9060 X €ooB , Yn : F A5F6802 x eoFC and Pou¿: 0.04085C6L2 x eroT
Refmem n5xring600 Tlming
clock (2.5ns) strip 1, Page 1Maq:1000




















P-ou È- 0 2
P-ouE-0 0
t1
1264la ar I r a. a ) a r t t t I l at t t t I I I I t I ) 126'12 t a
1
. . 12683se@e¡ce:



























































Table 5.1: Simulated and observed clock frequencies for different clock rates.
1V/division
5nsidivision

























threshold has just been reached in the multiplexer circuits of the clock generator, and is































Figure 5.35: Variation of clock frequency with power supply voltage for Clirate :0.
5.36 shows the variation of oscillator output amplitude (peak-peak) with pad power
supply voltage (Vddù for a ring length of seven (CIftate:1) and thirleeu (Clirate- 0)
















t-2 1.4 1.6 1.8
pad voltage, Vddp
Figure 5.36: Variation of peak-peak output voltage with pad po\¡/er supply voltage for
Vd,dc : I.6V,, CI{rate :1 and Vddc :1.4V,, CIftate :0.
168
Chapter 6
Discussion and Future 'Work
6.L Discussion
This thesis has presented the design methodology, simulation, implementation and test-
ing of a systolic ring floating point processing element (PE).
Gallium Arsenide (GaAs) was chosen as the technology for implementing the PE because
of its speed and power advantages over silicon and to assess new architectures to make
best use of the characteristics of GaAs. GaAs technology was studied in Chapter 2 in-
cluding a review of MESFET and HEMT technoiogy, large signal MESFET models, and
MESFET logic classes. It was found that DCFL, SDCFL and SBFL may be mixed to
provide a library of primitive cells from which circuits can be made. The logic classes
were optimised for delay, area and noise margin to achieve both the smallest and fastest
possible circuits. A set of design guidelines were established to build circuits. A" ring
notation layout style was presented which improves the performance, area and regularity
of circuit structures for GaAs over the more tradition styles of CMOS design. An abstract
design style (analogous to stick diagrams in CMOS design) was developed to aid the full
custom layout. Circuit primitives included inverters, NOR gates, a 2-input OR gate and
buffers to drive large fan-out loads. Circuit parasitics were investigated for GaAs circuits'
The interconnect parasitics have a significant effect on circuit operation due to the fast
transition times of the logic gates. These parasitics cause crosstalk, ringing and poor
signal delays if not modelied properly. An electromagnetic field simulator, 'Raphael' was
169
used to study circuit interconnect structures. Models for interconnects including trans-
mission line (TL), and a lumped capacitance were simulated using HSPICE. It was found
that for short wires (< 600prn) a lumped capacitor model may be used with little error
to model the wire. For longer wires, a simulation using a more complex model such as a
lossless TL should be used. To facilitaie full custom GaAs design, some design tools were
modified from their use in silicon design. A program 'ext2sp' was developed during the
course of the research to correctly extract GaAs devices and parasitics from the layout for
simulation in HSPICE. Technology files were further developed for use with 'MAGIC'.
Models for parasitics of pads, bonding wires and package leads were investigated and
their magnitude found from simulations.
A new PE architecture for integrated multiplication and accumulation of two floating
point numbers was developed in Chapter 3. A digit-serial multiplication algorithm was
presented and a simple digit-serial multiplier was clesigned for the case of 4-bits per digit.
The digit-serial multiplier cell is a re-organised parallel multiplier which is pipelined and
optimised for fast propagation through the critical path (critical path length is five full
adder detays). A model for a systolic cell was presented which used the digit-serial multi-
plier and when systolic cells were placed in a linear array, they performed multiplication
on two arbitrary precision floating point numbers. The systolic cell was then extended
to perform the basic functions of floating point accumulation, namely denormalisation
and addition. A ring of systolic cells and delay cells can perform these floating point op-
erations and a range of architectures is possible which is variable in the precision of the
operands, the number of bits per digit in the number representation and the number of
systolic cells around the ring. To optimise this architecture for a MATRISC processor, it
was necessary to develop a performance metric. The appropriate performance metric was
shown to be areú x tirne (AT) for these types of systolic array processors by minimising
the total job time for a matrix product on a rectangular systolic array. The model was
then evaluated for an IEBE extended single precision floating point number representa-
tion and two circulations of data in the PE using the Vitesse HGAAS-II (E/D MESFET)
process as the target technology. The results show that an optimal implementation in
the target GaAs technology consists of arithmetic units with four bits per digit and four
170
systolic cells in the ring
To build the physical layout of the PB chip, fast area efficient data flip-flops were de-
signed, based on the edge triggered 6-NOR flip-flop and adapted for use with DCFL and
SDCFL. A toggle flip-flop was designed based on the data flip-flop. GaAs full adder
circuits were investigated and a small AT metric implementation was chosen. Other cir-
cuits designed include a ring controller to control the I/O of operands from the PE, a
flag checking circuit, multiplexers and a clock generator based on a variable length ring
oscillator which can run up to lGHz and has a selectable output from divider stages" A
clock distribution system was designed based on the H-Tree approach to minimise clock
skew to the 266 flip-flops on the chip. A design approach was developed for clocking
synchronous rings of latches which found that the sum of the skew between all latches in
a closed ring is zero. The clock distribution and buffer circuits were simulated using Tt
models to show that the skew between adjacent latches was less than 100ps to guarantee
correct data transfer. The power circuit was designed to have less than a 5% fluctuation
across the chip due to inductive spikes and ohmic losses and to be within safe limits for
metal migration.
The chip was successfully fabricated in a GaAs 0.8p,m E/D MESFET process by Thomson-
CSF Semicond.ucteurs Specifiques, France on their first fabrication run with the HGAAS-
II process licenced from Vitesse Semiconductor, USA. A micrograph of the chip is shown
in Figure 6.1. The total chip size including pads and test structures is 3.Irnm x 5.8mm
and includes 16,000 devices. The dimensions of the processing element is l'7mmx4.5mm
giving an active area of 7 .5mm2 with 12, 000 devices resulting in 1600 deuices f mm2. The
chip was bonded into a 132184 pin MLC package with a heat-sink. A test fixture was
designed and constructed to facilitate testing of the PE chip. A å" thick Teflon PCB
was designed with PCB pins connecting the pressure mounted chip to the tester' The
GaAs I/O pads have signal transitions of well under 500ps with a 1.31/swing, so the PCB
wires were modelled with source and terminating resistors to find the best response. Ter-
minating resistors (630) were used in the final test circuit and signal skew between lines
was less than 350ps.
17T
The chips were tested and oscillator speeds were measured as a function of supply volt-
age. The chips tested were found to operate correctly except that most showed oscilla-
tor speeds corresponding to 0.\o-slow simulations. The power dissipation was 1.51/ at
Vdd : 1.51/. Most chips would not work at Vdd : 2V which was the designed supply
voltage. This may be due to the process spread observed. The functionality of the chips
was tested using a Tektronix DAS-9200. Input operands for testing were generated and
programmed into the DAS and the results in Chapter 5 show the correct operation of
the systolic cell and the complete systolic ring with operation at l28MHz. At this clock
nte \Mfl,ops was achieved. Synchronous operation was shown for a 350MHz clock rate
except that the result was incorrect due to failure of the critical path, however the X
operand was fully recovered from the ring. For typical process parameters, a maximum
clock speed above 300MHz would be expected with a corresponding computation rate of
LlMfl,ops for the PE chip.
The PE is a computation node in a two dimensional mesh connected systolic array which
forms part of a proposed MATRISC processor to perform matrix operations. The sim-
ulated performance of such a device when executing matrix problems is in the range of
Gflops which is well in excess of the capabilities of current generation engineering work-
stations.. The MATRISC processor closely follows the RISC philosophy of providing a
smaller set of commonly executed hardware operations. It is proposed that the matrix
hardware extension be integrated into a RISC processor system.
This thesis has shown that high performance computing components can be implemented
using GaAs technology if proper consideration is given to the characteristics of the tech-
nology to produce optimal processor architectures.
The development of GaAs layout strategies and systolic ring architectures for the PE has
been published previously [BeMa91]. The design of the systolic matrix processor and the
optimisation of the PE architecture for a target technology has been reported elsewhere
[lVIaBe92a, MaBe92]. The work on the design, layout and simulation of the PE chip has
172
been published in reference [BeMa93]. Finally, the design, simulation and testing of the
PE chip is reported in reference [BeMa95].
Figure 6.1: Micrograph of the fabricated GaAs systolic PE chip.
6.2 Future Work
Following a top-down methodology for future work:
o A complete MATRISC system needs to be thoroughly studied and simulated. Such
a study must take into account the types of algorithms that suffer large perfor-
mance penalties when executed on conventional computer systems. The optimal
MATRISC architecture should also take into account current memory speeds and
sizes to provide adequate bandwidth for such a processor. Additional components
such as caches will also improve system performance. Design tools such as VHDL









speed of the components of the MATRISC system will need to be studied possibly
by building some components.
o By breaking down the MATRISC system into components and studying each part to
determine a method for improving processor performance, parts such as memories,
scalar processors and caches are best found as "off the shelf" items. The following
require special design to gain optimum system performance:
- Processing elements do the computation work in the systolic array. They must
be both small in chip area and fast in execution speed. They must also operate
on multi-precision data, such as double precision floating point. Ideally, the
PEs should be fault tolerant and have a simple built-in self test mechanism so
faulty PEs can be bypassed or replaced. General future trends for PEs of this
type will be an increase in complexity, reconfigurabiliiy (multi-precision and
format), faster and more of them integrated onto a single chip to allow larger
arrays to be built.
- Buses link the system together and must provide an efficient way of trans-
ferring data to maximise throughput for a variety of matrix algorithms. The
maximum allowable pin density and data transfer speed determine the bus
speeds and hence the system bandwidth and overall performance.
- A possible solution for the memory is 'Rambus' [Ramb93] which provides
multiple 500Mbytes/s channels which may meet the bandwidth requirements
for a MATRISC processor.
- System integration is a significant problem when designing large systems with
chips fabricated using different technologies such as memories, PEs and caches.
A fine-line PCB solution can be used at the top level but due to the high data
transfer rates (up T,o 500MHz) between chips, a multi-chip module technology
should be used.. This improves both the density and hence the execution speed
of the system.
o Better logic families for faster circuit operation with higher levels of integration and
low power are needed. Complex gate structures in GaAs may provide an alternative




o The currently available technology and integration techniques drive the possible
range of architectural solutions in computer design. GaAs was investigated because
of its advantages in both speed and power over silicon technology. The solution may
lie in a different technology in the future as processes improve, integration levels
become higher and new logic classes are investigated. Motorola has announced a
complementary GaAs process, 'CGaAs' which holds much promise for the future of
high speed technologies. The largest deficiency of MtrSFET technology is the poor
integration level when compared to a similar gate length CMOS technology. CGaAs
has an integration levei similar to that of CMOS and is claimed to be faster than
DCFL using MESFETs. The process is also simpler than CMOS wiih no substrate
contacts required which saves chip area. This would make CGaAs an attractive







Appendix A : GaAs Digital Logic
Performance Specifications
Measurements are carriecl out on the middle gate in a chain of three identical gates so the
input and output to the gate under observation are realistic. Voltage swing was measured
as the difference in output voltage in the static logic low and the static logic high state.
The speed or delay of a logic gate was measured from the time at 50% of. the input voltage
swing to 50% of the output voltage swing in response to an input with a voltage swing
and slew rate the same as the output. The rise or fall time of the gate was measured as
the time for the signal to rise or fall frorn 20To to 80% of the output voltage swing. This
differs f¡om conventional CMOS which takes I0To to 90% of the output voltage swing.
This is because GaAs has a much smaller voltage swing and the relative noise in each
















discussion of noise margins can be found in [Hi1186, Lohs79, Haus93] and [Wing90] has a
discussion relevant to GaAs.
Noise margin for a logic gate is defined as the maximum amount of noise applied to
the input in each logic state while the output remains in the correct state. There are
static and dynamic noise margins for any gate but here we only consider the static noise
margins since robust operation under static conditions guarantees the dynamic operation




A negative noise margin indicates the logic gate will not settle into that logic state. There
are five ways to measure the static noise margin of a logic gate as defined in [Haus93]:
o NSC (negative slope criteria) selects the unity gain point in the gate input-output
voltage characteristics as the switching point where the gate moves from a logic
state to a metastable state. This may be calculated mathematically which can be
useful.
o MSC (maximum sum criteria) of N Ms -f N Mr is identical to the NSC method for
most transfer characteristics but may predict a zeto for one of the noise margins
since it is not concerned with the individual values.
o MNSC (modified negative slope method) is used in many textbooks and leads to
unconservative results for noise margin and has a poor theoretical basis [Haus93]
and therefore is not used.
¡ MEC (maximum equai criteria) or mirror and maximum square method [HiLa86]
constrains the high and low noise margins of the gate to be equal (N Mn : N Mt)
and produces the worst case equal noise margin. This may be too restrictive on the
optimisation of a logic gate and gives a more average result of high and low noise
margtns
177
o MPC (maximum product criteria) maximises the area of a rectangle and hence














Figure 4.3: Noise margin measurement methods using (a) NSC and (b) MPC or MEC
if the rectangle becomes square.
Only the last two techniques, namely MEC and MPC give valid resuits for a wide range of
gate transfer curves. The DCFL transfer curves shown in Figure 4.3 are quite symmetric.
For this case \rye can use either the NSC, MPC or MEC methods which would all give
similar results. Noise margin rffas measured using the gain method where the points have
unity gain. This means that the transition point from logic high to the metastable state
is where the rate of change of output divided by input is unity.
178
Appendix B : PE Chip Pin
Allocation
The following is the package pin to signal allocation for the PE chip and a description
of power supply signals. The pins are numbered clockwise from the angled corner of
the package. Note that when the package is mounted upside-down, the numbers run
anticlockwise when viewed from the top. Key to symbols:
NC - not connected
Vdd - Pad and circuit power +I.5V (decoupled to GND)
Vddp - Pad power f2V
Vddc - Circuit power +I.5V
GNDp - Pad ground 0l/
GNDc - Circuit ground 0 I/
GND - Package ground plane 0ll (decoupled to Vdd)
VREF - Input pad reference supply +0.7V
TSD - Depletion mode MESFET test structure
TSB - Enhancement mode MESFET test structure
TSFF - Data flip-flop test structure
TSINV - DCFL inverter gates test structure











































































































































































































































Package Pin PFunctionName Pin Functioname



































































































































































S Name FunctionFunction P PinSignal NamePackage Pin
Table B.2: Assignment of pins to signals, power and ground (cont.)




The following is a description of the test environment set-up and equipment used for
testing the PE chip.
Power Supply Connections and Sequencing
All pad po\ryer supply circuits were driven separately from the logic circuit supply to
minimise noise and ground bounce. In addition separate high logic level and pad voltage
references were used. The nominal circuit supply voltage (Vdd,Vddc) is 1.5I/(t5%) and
the pad power supply (Vddfl should be set to 2.0V (I2\mIf. The input pads require
a reference voltage (VREF) of 0.7trland the high logic level reference should be set to
I.3V (+20mV).The total current drawn from all power supplies is around 1,4.
The circuit power supply voltage should not be higher than 50% above its nominal value
for any long period of time since excessive power dissipation may damage the chip'
The input and output pads are single ended and no differential signals are used and
each output pad is of the open-source type and can sink up to 2TmA of current. The
signal levels on all pads are 1.0 to 1.3Ilfor logic high and 0Ilfor logic low. All ground
connections to power supplies were connected to a common point on the circuit chassis to
prevent current loops. All power supply leads were shielded to prevent electromagnetic
coupling with the shields ruere connected back to the test jig chassis ground. The chip
was configured as follows (with reference to the pin allocation in Appendix B):
o all pins urarketl GND have a shorting PCB jumper connected to thc adjacent
grounded pin
r82
o pins 17 and 44 are connected to VREF supply
o pins I,26,66,67, 98,I22,132 are connected to the pad power supply (Vdde)
o pins 29,, 3it, 33, 34, 71, 72, 99, 100, 104, 105 are connected to the circuit power
supply
Before the power for ANY power supply was turned on, the output voltage was set to
zero and then turned up to the correct voltage to avoid spikes from the power supply
entering the circuit. The input pad reference voltage supply was switched on first, then
the pad supply, the high logic level reference supply and finally the circuit supply.
Test Equipment
DAS 9200 High Speed Digital Tester
The Tektronix Digital Analysis System (DAS) 9200 is a tool which provides the operating
environment for pattern generator modules, data acquisition modules, and the software
to control them. The software is configured to run via a host system using an X-windows
interface. The software allows programming of the pattern generation modules with test
vectors after which, the program can be run which sends signals to the chip under test.
The resulting signals read on the data acquisition probes (the chip outputs) can then be
displayed.
Pattern Generation Module
The 92516 pattern generator module connects the two P6464 pattern generator pods to
the DAS. The pods are labelled A0 and ,41. The P6464 provides a total of nine signal
lines (bits 0-8) and a clock line (ctk). The maximum clock frequency available is 50M H z
and the P6464 can supply signals at either TTL or ECL levels. The probe tips should
be directly connected to the PCB gold connectors with the signal label (white) on the
innermost pin and the reference (black label) on the outermost pin. The specifications
of the 92516 pattern generator are shown in Table C.3. The P6464 receives power via
three sense leads connected to the probe: a red line for 7¡¡ (voltage high), a black line
for V7 (voltage low) and a green line for ground. The three power wires were connected
183
20mA sink or sourceCurrent Capability
Vg -IVECL I/ø Out
Vs - 7.75VECL V¿ Out
Vu- L.LVTTL I/a Out
Vn * 0.8VTTL V¿ Out
4.8V to 5.2VVn -Vr
+O.gV to -5.5V @ 100rnÁ * IrotnLow Voltaee (Vr)
-0.5V to -15.5V @ 100rnAl InotnHigh Voltaee (Vn)
50 (20ns)Maximum Clock Frequency (period)
Characteristic ecification
Table C.3: DAS 92516 Pattern Generator Specifications
for TTL output as follows: vn - +2.4v, vr : -2.6v and the green wire was connected
to the test jig chassis ground. There must be a connection between the power supply
ground and the circuit chassis ground. This provides a logic high of 1.3 7 and a logic low
of -2.0I/ at the probe tip. A negative logic low will not affect the chip operation since
it must only be below the logic threshold to be off. ECL logic levels could not be used
since the voltage swing was too small (0.814.
Data Acquisition Module
The 92496 data acquisition module was used in 24 channel high speed acquisition mode
with a resolution of. 2.5ns. The signal side of the probe tips is marked with a colour and
should point towards the chip.
DAS probe allocation
The DAS was configured through the software interface as follows
¡ define a cluster ('Sys-Config' menu) of the 92496-1 and 92516-1 modules
o und,er the'Cluster Setup' menu, select run and start modes as normal and the stop
mode as manual to allow debugging of the test head while the DAS keeps running
o 92S16-1 module
















































































































Pin NumberPattern Generator Pod
Table C.4: DAS probe allocation
185



















































































































































Figure C.4: DAS test vector program to test the systolic ring f.oating point multiplication
187
Bibtiography
[AnAr87] M. Annaratone,, E. Arnould, T. Gross, H.T' Kung, M. Lam, O. Menzilcioglu
and J.A. Webb. "The Warp Computer: Architecture, Implementation and
Performance".IEEE Transactions on Computers, C-36(12) pp. 1523-1538, De-
cember 1987.
[AwTa93] M. Awaga and H. Takahashi. "The mVP 64-Bit Vector Coprocessor: A New
Implementation of High-Performance Numerical Computation". IEEE Micro,
pp. 24-36, October 1993.
[Bako9g] H. Bakoglu. Circuits, Interconnections and Packaging for VLSI. Addison-
Wesley, 1990.
[Beaugl] A. Beaumont-Smith. 
.GAASNET V2.0 - A gallium arsenide network extrac-
tor" . Integrated Silicon Design Pty.Ltd., Adelaide, 1991'
[Beaug2] A. Beaumont-Smith. 
*EXT2HSP - A conversion program from MAGIC to
HSPICE for GaAs circuits" . The Uniaersity of Adelaid'e, Adelaide, L992.
[Beau93] A. Beaumont-Smith. "SCFL circuit design and Designer Interface for In-
GaAs/AlGaAs HEMT" . Seoul National Uniuersity Report,, Department of Elec-
tronics Engineering, JulY 1993.
[BeMa91] A. Beaumont-Smith, W. Marwood, C.C. Lim and K. Eshraghian. "Ultra High
Speed Gallium Arsenide Systems: Design Methodology, CAD tools and Ar-
chitecture" . Proc. Microelectronics '91, I.E.Aust Conferencej pp.85-90' June
1991
[BeMag3] A. Beaumont-Smith, W. Marwood, K. Eshraghian ancl C.C. Lin. "The Gal-
lium Arsenide Implementation of a Systolic Floating Point Processing Ele-
188
ment" . Proc. 12th Australian Microelectronics Conference, PP. 255-260, Octo-
ber 1993.
[BeMa95] A. Beaumont-Smith, W. Marwood, C.C. Lim and K. Eshraghian. "Design
and Implementation of a GaAs Systoiic Floating Point Processing Element".
Subrnitted to IEE Proceedings-Û, Computers and Digital Techniques, 1995.
[BeMa95a] A. Beaumont-Smith, W. Marwood and C.C. Lim. "A CMOS Linear Systolic
Processing Element" . Proc. 1Sth Australian Microelectronics Conference, pp.
74-79, July 1995.
[Ber91] M. Berroth, V. Hurm, U. Nowotny, A. Hulsmann, G. Kaufel, K. Kohler,
B. Raynor and Jo. Schneider. "A 2.5nS 8x8-b Parallel Multiplier Using 0.5p'rn
GaAs/AlGaAs Heterostructure Field Effect Transistors" . Microelectronic En-
gineering ,15, Elsevier Science Publishers B.V',PP. 327-330' 1991'
[Brau63] E.L. Braun. Digital Computer Design, Logic, Circuitry, and Synth¿sis. Aca-
demic Press, 1963.
[BrBa92] R.B. Brown, P. Barker, A. Chandna, T.R. Huff, A.I. Kayssi, R.J. Lomax,
T.N. Mudge, D. Nagle, K.A. Sakallah, P.J. sherhart, R. uhlig, and M. upton.
*GaAs RISC Processors" , Proc. IEEE GaAs IC Symposiurn, pp. 81-84, 1992"
[Clclg2] A.P. Clarke, R.J. Clarke, LA. Curtis and W. Marwood. "A Floating Point Ma-
trix Arithmetic Processor: An Implementation of the SCAP Concept". Proc.
APCCAS '92, IEEE, IREE and IEAust Asia-Pacif,c Conference on Circuits
and Systems, December 1992.
[CoHa92] P. Corbett and R. Hartley. "Designing Systolic Arrays Using Digit-Serial Arith-
metic" . IEEE Transactions on Circuits and Systems - II: Analog and Digital
Signal Processing, Vol. 39, No. 1, January 1992"
[Curt80] W. Curtice. "A MESFET Model for Use in the Design of GaAs Integrated Cir-
cuits" . IEEE Transactions on Microwaue Theory and Techniques, Vol. MTT-
28, No. 5, May 1980.
189
[DeReSS] P.B. Denyer and D. Renshaw. VLil Signal Processing: A Bit-Serial Approach-
Addison-Wesley, England, 1985.
[DoFr93] D.A. Doane and P.D. Franzon (Bditors). Multichip Module Technologies and
Alternatiaes - The Basics. Van Nostrand Reinhold, New York, 1993.
[Dyks90] J.A. Dykstra. "High-Speed Microprocessor Design with Gallium Arsenide Very
Large Scale Integrated Digital Circuits". Ph.D. Thesis, The University of
Michigan, 1990.
[Eshr9l] K. Eshraghian. "Fundamentals of Very High Speed Systems: Gallium Arsenide
VLSI Technology Course Notes" . Centre for GaAs VLil Technology, The Uni-
versity of Adelaide, South Australia, 1991.
[Eshr91a] K. Eshraghian, R. Sarmiento, P.P. Carballo and A. Nunez. "Speed-area-power
optimization for DCFL and SDCFL class of logic using ring notation" . Micro-
processing and Microprogramnxing,S2, (1-5)' pp. 75-82' 1991'
[FiKu83] A.L. Allan, H.T. Kung, L.M. Monier, H. Walker and D. Yasunori' "Design of
the PSC: A Programmable Systolic Chip" . Third Caltech Conference on Vrey
Large Scale Integration, Pasadena, cA, usA, pp. 287-302, March 1983.
[FoBu9l] D.J. Fouts and S.E. Butner. "Architecture and Design of a 500-MHz Gallium-
Arsenide Processing Element for a Parallel Supercomputer" . IEEE Journal of
solid state circuits, vol. 26, No. 9, pp. 1199-1211, september 1991.
[FoSc87] D.E. Foulser and R. Schreiber. "The SAXPY Matrix 1: A General Purpose
Systolic Computer". IEEE COMPUTEfu pp. 35-43, July 1987'
[FoWa87] J.A.B. Fortes and B.W. Wah. "systolic Arrays - From Conception to Imple-
mentation". IEEE COMPUTER, pp- 12-17, July 1987-
[Giga9l] GaAs IC Data Boolc and Designer's Guide. GigaBit Logic. 1991.
[G1on88] M. Gloanec et.al. GaAs Digitat Integrated Circui,ts (chapter 8, GaAs MBSFBT
Circuit Design), Artech House, 1988.
[Goli91] J.M. Golio. Microwaue MESFETs €j HEMTs. Artech House, 1991.
190
[HaCo90] R. Hartley and P. Corbett. "Digit-Serial Processing Techniques" . IEEE Trans-
actions on circuits and systems, vol.37, No.6, pp.707-719, June 1990.
[Haus93] J.R. Hauser. "Noise Margin Criteria for Digital Logic Circuits" . IEEE Trans-
actions on Education,Yol.36, No. 4, pp. 363-368, November 1993.
[HeZi87] C.E. Hein, R.M. Zeiger and J.A. Urbano. "The Design of a GaAs Systolic Array
for an Adaptive Null Steering Beamforming Controller". IEEE COMPUTER,
pp.92-93, July 1987.
[HiLaS6] A.J. Hill and P.H. Ladbroke. "High Electron Mobility Transistors (HEMTS) -
A Review" . GEC Journal of Researcå, Vol. 4, No. 1, PP' 1-14, 1986'
[Hill86] C.F. Hitt. "Noise margin and noise immunity in logic circuits" . Microelectron.,
Vol. 1, pp. 16-21, April 1968.
[HwDu9O] K. Hwang, M. Dubois, D.K. Panda, S. Rao, s. shang, A. uresin, w. Mao,
H. Nair, M. Lytwyn, F. Hsieh, J. Liu, s. Mehrotra and c.M. cheng. "oMP.
A RISC-based multiprocessor using orthogonal-access memories and multiple
spanning buses" . Proc. ACM International Conference on Supercomputing, pp-
7-22, Arnsterdam, June 1990'
[ieee85] IEEE STANDARD FOR BINARY FLOATING POINT ARITHMETIC,
ANSI/IEEE Std 754-1985, pp" 260-270, 1985'
[Come92] R. Comerford.. "How DEC developed the Alpha", IEEE Spectrum, pp. 26-31'
July 1992.
[JoHu93] K.T. Johnson, A.R. Hurson and B. Shirazi. "General-Purpose Systolic Arrays".
IEEE COMPUTER, pp. 20-31, November 1993.
[Kano85] N. Kanopoulos. "A Bit-Serial Architecture for Digital Signal Processing" . IEEE
Transactions on circuits and systems, Vol. cAS-32, No. 3, March 1985.
[KaNaSS] S. Katsu, S. Nambu, A. Shimano and G. Kano, "A Source Coupled FET Logic
- A New Current-Mode Approach to GaAs Logics", IEEE Transactions on
Electron Deaices, Vol. ED-32, No. 6, pp. 1114-1118, June 1985'
191
[KiHe97] D. Kiefer and J. Heightley. "CRAY-3: A GaAs Implemented Supercomputer
System" . Proc. IEEE GaAs IC Symposium, pp.3-6, 1987'
[KuHw9l] S.Y. Kung and J-N. Hwang. "systolic Array Designs for Kalman Filtering".
IEEE Transactions on Signal Processing, Vol. 39, No. 1, January, 1991.
[KuLe78] H.T. Kung and C.E. Leiserson. "systolic Arrays (for VLSI)". Proc' Symposium
on Sparse Matrir Computations and their Applications, Duff and Stewart Ed-
itors, 1978.
[Kung88] S.Y. Kung. VLg Array Processors' Prentice Hall, 1988.
[LiJe89] C.-M. Liu and C.-W.Jen. "Design of algorithm-based fault-tolerant VLSI array
processor" . IEE Proceedings, Vol. 136, Pt. E, No. 6, November 1989'
[LoBuSg] S. Long and S. Butner. Gallium Arsenide Digital Integrated Circuit Design.
McGraw-Hill, New York, 1989.
[Lohs79] J. Lohstroh. "static and Dynamic Noise Margins of Logic Circuits". IEEE
Journal of solid state circuits,Yol. sc-14, No.3, pp.591-598, June 1979.
[LoLi95] P.LLozo, C.C. Lim and D. Nandagopal. "Translation Invariant Pattern Recog-
nition: A Real-time Neural Network Architecture Based on Biological Visual
' Spatial Attention" . Australian Journat of Intelligent Information Processing
Systems, Vol. 2, No. 1, Autumn 1995.
[MaBe92] W. Marwood and A. Beaumont-Smith. "The Architecture and Optimisation of
Systolic Ring Processors" . Proc. TENCON '92: IEEE Region 10 Conference,
pp. 735-739, November 1992.
[MaBe92a] W. Marwood and A. Beaumont-Smith. "The Implementation of a Gener-
alised Systolic Serial Floating Point Multiplier". Proc. APCCAS'92, IEEE
Asia-Pacif,c Conference on Circuits and Systems, pp. 513-518, Decembet1992.
[MaCl95] W. Marwoocl, A.P. Clarke, T.C. Thrum, O. Reinhold and M. Wise. "A Multi-
Chip Module Technology anrì Application". Proc. 13th Australian Microelec-
tronics Conference, pp. 40-45, July 1995.
t92
[Mali9l] W. Marwood, C.C. Lim, K. Eshraghian and A. Beaumont-Smith' "Syt-
tolic Matrix Processor Architecture for Very High Speed Signal Processing".
Proc. IREECON International Conuention, l99L'
[Magigg] R.N. Mayo et.al. "1990 DtrCWRl/Livermore Magic Release" ,, WRL Research
Report g0/7, Western Digital Laboratory, September 1990'
[Marw9¡] W. Marwood. "A Generalised Systolic Ring Serial Floating Point Multiplier".
Electronics Letters, Vol. 26, No. 11, pp. 753-754, May 1990'
[Marwg¡a] W. Marwood and C.C. Lim. "A GaAs Systolic Processor for Implementing a
Kalman Filter" , Proc. I.E.Aust. Conference, Microelectronics '90r 1990.
[Marw9l] W. Marwood. "A Generalised Systolic Ring Serial Floating Point Multiplier".
PCT Patent Application No. PCT/AU?1/0027' Julv 1991'
[Marwgla] W. Marwood. "A Generalised Systolic Ring Serial Floating Point Adder and
Accumulator" . Australian Patent, July 1991'
[Marw94] W. Marwood. "An Integrated Multiprocessor for Matrix Algorithms". Ph.D"
Thesis, Department of Electrical and Electronic Engineering, The University
of Adelaide, South Australia, 1994.
[McCagg] A.J. McCamant, G.D. McCormack and D.H. Smith. "An improved 
GaAs MES-
FET Model for SPICE' . IEEE Transactions on Microwaue Theory and Tech-
niques, Vol. 38, pp.822-824, June 1990.
[MeCoS¡] H.T. Kung and C.B. Leiserson. "systolic Arrays for VLSI", chapter in C. Mead
and L. Conway. Introd,uction to VLil Systems. Addison-Wesley, October 1980.
[Meta92] HSPICE (Jser',s Manual. META Software. version H92,1992.
[NaHigl] T. Naritomi, H. Aso and M. Kimura. "A Fast Processor for 3-D Device Simu-
lation Using Systolic Arrays". Systems and Computers in Japan, Vol.22, No.
1, pp. 39-47, 1991.
[Nowo91] U. Nowotny, M. Lang, M. Berroth, v. Hurm, A. Hulsmann, G. Kaufel,
K. Kohler, B. Raynor and Jo. Schneider. "20Gbit/s 2:1 Multiplexer Using
193
g.J¡-r2 Gate Length Double Pulse Doped Quantum Well GaAs/AlGaAs Tran-
sistors" , Microelectronic Engineerinq -15, Elsevier Science Publishers B.V-, pp.
323-326, 1991.
[ParhSg] K.K" Parhi. "Nibble-Serial Arithmetic Processor Designs via Unfolding" 
- Proc.
1989 Int. Sy*p' on Circuits and Systems, pp' 635-640, 1989'
[PuEs88] D.A. Pucknell and K. Eshraghlan. BASIC VLil DESIGN - Systerns and Cir-
cuits. Prcntice Hall, 1988'
[Ramb93] RAMBUS - ARCHITECTURAL OVERVIEI,tr4 Rambus Inc. Mountain View,
California USA, 1993'
fRoccg¡] M. Rocchi . High Speed digital IC technologies. Artech House, 1990.
[SaCag2] R. Sarmiento, P.P. Carballo and A. Nunez. "High speed primitives 
of hardware
accelerators for DSP in GaAs technology". IEE proc.-G, Vol' 139, No' 2, pp'
205-216, April 1992.
[Snyd82] L. Snyder. "Introduction to the Configurable, Highly Parallei Computer",
IEEE Computer, pp. 47-56, January 1982.
[stNe87] H. Statz, P. Newman, I.W. Smith, R.A. Pucel and H.A. Haus. "GaAs 
FET
- Device and Circuit Simulation in SPICE" . IEEE Transactions on Electron
Deaices, Vol. ED-34, FebruarY 1987.
[Ster7a] P.H. Sterbenz. Floating-Point Computation. Prentice-Hall, 1974'
[Sze83] S.M. Sze. VLil TECHNOLOGY. McGraw-Hill' 1983'
[TaNi92] L.R. Tate, R.J. Niescier, A.C. Hu, J. Scorzelli, w. Leung, c.H. Tzinis,
P.J. Robertson and A. Baca. "32 Bit GaAs HFBT IEBE Floating Point Mul-
tiplier" , IEEE GaAs IC Symposium, pp' 85-88' 1992'
[TeMo93] Raphael Interconnect Analysis Program Manual, Version 2. Technology 
Mod-
elling Associates Inc. 1993.








[TriQ92] GaAs IC Design Course Notes. TriQuint Semiconductor Corp. May 1992.
[Vite92] Foundry Design Manual, Version 5. Vitesse Semiconductor Corp. 1992.
[vite92a] 1992 Product Data Book. Vitesse semiconductor corp. 1992.
[WeEs85] N.H.E. Weste and K. Eshraghian. Principles of CMOS VLil Design - A Sys-
tems Perspectiue. Addison-Wesley, October 1985.
[WhSpSl] H.J. Whitehouse and J.M. Speiser. "SONAR Applications of Systolic Array
Technology,,, conference Record, IEEE EASCON, Washington, D.c., Novem-
ber 17-19, 1981.
[Wing9g] O. Wing. Gallium Arsenide Digital Circuits. Kluwer Academic Publishers,
1990.
[Zyne88] G.B. Zyner. "Design of Arithmetic Systems in VLSI". Ph.D. Thesis, The Uni-
versity of Adelaide, October 1988'
1
!d
t
I
;
t
195
